commits

When scheduling a group of events, there are constraint checks done to
make sure all events can go in a group. Example, one of the criteria is
that events in a group cannot use the same PMC. But platform specific
PMU supports alternative event for some of the event codes. During
perf_event_open(), if any event group doesn't match constraint check
criteria, further lookup is done to find alternative event.

By current design, the array of alternatives events in PMU code is
expected to be sorted by column 0. This is because in
find_alternative() the return criteria is based on event code
comparison. ie. "event < ev_alt[i][0])". This optimisation is there
since find_alternative() can be called multiple times. In power10 PMU
code, the alternative event array is not sorted properly and hence there
is breakage in finding alternative event.

To work with existing logic, fix the alternative event array to be
sorted by column 0 for power10-pmu.c

Results:

In case where an alternative event is not chosen when we could, events
will be multiplexed. ie, time sliced where it could actually run
concurrently.

Example, in power10 PM_INST_CMPL_ALT(0x00002) has alternative event,
PM_INST_CMPL(0x500fa). Without the fix, if a group of events with PMC1
to PMC4 is used along with PM_INST_CMPL_ALT, it will be time sliced
since all programmable PMC's are consumed already. But with the fix,
when it picks alternative event on PMC5, all events will run
concurrently.

Before:

# perf stat -e r00002,r100fc,r200fa,r300fc,r400fc

Performance counter stats for 'system wide':

328668935 r00002 (79.94%)
56501024 r100fc (79.95%)
49564238 r200fa (79.95%)
376 r300fc (80.19%)
660 r400fc (79.97%)

4.039150522 seconds time elapsed

With the fix, since alternative event is chosen to run on PMC6, events
will be run concurrently.

After:

# perf stat -e r00002,r100fc,r200fa,r300fc,r400fc

Performance counter stats for 'system wide':

23596607 r00002
4907738 r100fc
2283608 r200fa
135 r300fc
248 r400fc

1.664671390 seconds time elapsed

Fixes: a64e697cef23 ("powerpc/perf: power10 Performance Monitoring support")
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220419114828.89843-2-atrajeev@linux.vnet.ibm.com

3y ago

Linus Torvalds

a1901b46

Merge tag 'for-linus-5.18-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

3y ago

Linus Torvalds

9becb688

kvmalloc: use vmalloc_huge for vmalloc allocations

Since commit 559089e0a93d ("vmalloc: replace VM_NO_HUGE_VMAP with
VM_ALLOW_HUGE_VMAP"), the use of hugepage mappings for vmalloc is an
opt-in strategy, because it caused a number of problems that weren't
noticed until x86 enabled it too.

One of the issues was fixed by Nick Piggin in commit 3b8000ae185c
("mm/vmalloc: huge vmalloc backing pages should be split rather than
compound"), but I'm still worried about page protection issues, and
VM_FLUSH_RESET_PERMS in particular.

However, like the hash table allocation case (commit f2edd118d02d:
"page_alloc: use vmalloc_huge for large system hash"), the use of
kvmalloc() should be safe from any such games, since the returned
pointer might be a SLUB allocation, and as such no user should
reasonably be using it in any odd ways.

We also know that the allocations are fairly large, since it falls back
to the vmalloc case only when a kmalloc() fails. So using a hugepage
mapping seems both safe and relevant.

This patch does show a weakness in the opt-in strategy: since the opt-in
flag is in the 'vm_flags', not the usual gfp_t allocation flags, very
few of the usual interfaces actually expose it.

That's not much of an issue in this case that already used one of the
fairly specialized low-level vmalloc interfaces for the allocation, but
for a lot of other vmalloc() users that might want to opt in, it's going
to be very inconvenient.

We'll either have to fix any compatibility problems, or expose it in the
gfp flags (__GFP_COMP would have made a lot of sense) to allow normal
vmalloc() users to use hugepage mappings. That said, the cases that
really matter were probably already taken care of by the hash tabel
allocation.

Link: https://lore.kernel.org/all/20220415164413.2727220-1-song@kernel.org/
Link: https://lore.kernel.org/all/CAHk-=whao=iosX1s5Z4SF-ZGa-ebAukJoAdUJFk5SPwnofV+Vg@mail.gmail.com/
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>
Cc: Song Liu <songliubraving@fb.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3y ago

Shubhrajyoti Datta

e2932d1f

EDAC/synopsys: Read the error count from the correct register

3y ago

Zhipeng Xie

60490e79

perf/core: Fix perf_mmap fail when CONFIG_PERF_USE_VMALLOC enabled

3y ago

Athira Rajeev

0dcad700

powerpc/perf: Fix power9 event alternatives

3y ago

Linus Torvalds

3a69a442

Merge tag 'x86-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

3y ago

Juergen Gross

262fc47a

xen/balloon: don't use PV mode extra memory for zone device allocations

3y ago

Song Liu

f2edd118

page_alloc: use vmalloc_huge for large system hash

3y ago

Linus Torvalds

ce522ba9

Linux 5.18-rc2 v5.18-rc2

3y ago

Alexey Kardashevskiy

26a62b75

KVM: PPC: Fix TCE handling for VFIO

The LoPAPR spec defines a guest visible IOMMU with a variable page size.
Currently QEMU advertises 4K, 64K, 2M, 16MB pages, a Linux VM picks
the biggest (16MB). In the case of a passed though PCI device, there is
a hardware IOMMU which does not support all pages sizes from the above -
P8 cannot do 2MB and P9 cannot do 16MB. So for each emulated
16M IOMMU page we may create several smaller mappings ("TCEs") in
the hardware IOMMU.

The code wrongly uses the emulated TCE index instead of hardware TCE
index in error handling. The problem is easier to see on POWER8 with
multi-level TCE tables (when only the first level is preallocated)
as hash mode uses real mode TCE hypercalls handlers.
The kernel starts using indirect tables when VMs get bigger than 128GB
(depends on the max page order).
The very first real mode hcall is going to fail with H_TOO_HARD as
in the real mode we cannot allocate memory for TCEs (we can in the virtual
mode) but on the way out the code attempts to clear hardware TCEs using
emulated TCE indexes which corrupts random kernel memory because
it_offset==1<<59 is subtracted from those indexes and the resulting index
is out of the TCE table bounds.

This fixes kvmppc_clear_tce() to use the correct TCE indexes.

While at it, this fixes TCE cache invalidation which uses emulated TCE
indexes instead of the hardware ones. This went unnoticed as 64bit DMA
is used these days and VMs map all RAM in one go and only then do DMA
and this is when the TCE cache gets populated.

Potentially this could slow down mapping, however normally 16MB
emulated pages are backed by 64K hardware pages so it is one write to
the "TCE Kill" per 256 updates which is not that bad considering the size
of the cache (1024 TCEs or so).

Fixes: ca1fc489cfa0 ("KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages with smaller physical pages")

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220420050840.328223-1-aik@ozlabs.ru

3y ago

Linus Torvalds

fbb9c58e

Merge tag 'timers-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

3y ago

Pawan Gupta

400331f8

x86/tsx: Disable TSX development mode at boot

3y ago

Juergen Gross

de2ae403

xen: fix is_xen_pmu()

3y ago

Linus Torvalds

22da5264

Merge tag '5.18-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd

3y ago

Linus Torvalds

8b57b304

Merge tag 'tty-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

3y ago

Michael Ellerman

d2b9be1f

powerpc/time: Always set decrementer in timer_interrupt()

This is a partial revert of commit 0faf20a1ad16 ("powerpc/64s/interrupt:
Don't enable MSR[EE] in irq handlers unless perf is in use").

Prior to that commit, we always set the decrementer in
timer_interrupt(), to clear the timer interrupt. Otherwise we could end
up continuously taking timer interrupts.

When high res timers are enabled there is no problem seen with leaving
the decrementer untouched in timer_interrupt(), because it will be
programmed via hrtimer_interrupt() -> tick_program_event() ->
clockevents_program_event() -> decrementer_set_next_event().

However with CONFIG_HIGH_RES_TIMERS=n or booting with highres=off, we
see a stall/lockup, because tick_nohz_handler() does not cause a
reprogram of the decrementer, leading to endless timer interrupts.
Example trace:

[ 1.898617][ T7] Freeing initrd memory: 2624K^M
[ 22.680919][ C1] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:^M
[ 22.682281][ C1] rcu: 0-....: (25 ticks this GP) idle=073/0/0x1 softirq=10/16 fqs=1050 ^M
[ 22.682851][ C1] (detected by 1, t=2102 jiffies, g=-1179, q=476)^M
[ 22.683649][ C1] Sending NMI from CPU 1 to CPUs 0:^M
[ 22.685252][ C0] NMI backtrace for cpu 0^M
[ 22.685649][ C0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.16.0-rc2-00185-g0faf20a1ad16 #145^M
[ 22.686393][ C0] NIP: c000000000016d64 LR: c000000000f6cca4 CTR: c00000000019c6e0^M
[ 22.686774][ C0] REGS: c000000002833590 TRAP: 0500 Not tainted (5.16.0-rc2-00185-g0faf20a1ad16)^M
[ 22.687222][ C0] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24000222 XER: 00000000^M
[ 22.688297][ C0] CFAR: c00000000000c854 IRQMASK: 0 ^M
...
[ 22.692637][ C0] NIP [c000000000016d64] arch_local_irq_restore+0x174/0x250^M
[ 22.694443][ C0] LR [c000000000f6cca4] __do_softirq+0xe4/0x3dc^M
[ 22.695762][ C0] Call Trace:^M
[ 22.696050][ C0] [c000000002833830] [c000000000f6cc80] __do_softirq+0xc0/0x3dc (unreliable)^M
[ 22.697377][ C0] [c000000002833920] [c000000000151508] __irq_exit_rcu+0xd8/0x130^M
[ 22.698739][ C0] [c000000002833950] [c000000000151730] irq_exit+0x20/0x40^M
[ 22.699938][ C0] [c000000002833970] [c000000000027f40] timer_interrupt+0x270/0x460^M
[ 22.701119][ C0] [c0000000028339d0] [c0000000000099a8] decrementer_common_virt+0x208/0x210^M

Possibly this should be fixed in the lowres timing code, but that would
be a generic change and could take some time and may not backport
easily, so for now make the programming of the decrementer unconditional
again in timer_interrupt() to avoid the stall/lockup.

Fixes: 0faf20a1ad16 ("powerpc/64s/interrupt: Don't enable MSR[EE] in irq handlers unless perf is in use")
Reported-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Link: https://lore.kernel.org/r/20220420141657.771442-1-mpe@ellerman.id.au

3y ago

Linus Torvalds

0e59732e

Merge tag 'smp-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

3y ago

Jiapeng Chong

9c95bc25

tick/sched: Fix non-kernel-doc comment

3y ago

Pawan Gupta

258f3b8c

x86/tsx: Use MSR_TSX_CTRL to clear CPUID bits

3y ago

Jakub Kądziołka

ff32baa1

xen: don't hang when resuming PCI device

3y ago

Linus Torvalds

f3935926

Merge tag 'arc-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc

3y ago

Namjae Jeon

02655a70

ksmbd: set fixed sector size to FS_SECTOR_SIZE_INFORMATION

3y ago

Linus Torvalds

95aa17c3

Merge tag 'staging-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

3y ago

Jiri Slaby

dbf3f093

tty: serial: mpc52xx_uart: make rx/tx hooks return unsigned, part II.

3y ago

Linus Torvalds

7e1777f5

Merge tag 'irq-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

3y ago

Steven Price

b7ba6d8d

cpu/hotplug: Remove the 'cpu' member of cpuhp_cpu_state

3y ago

Paul Gortmaker

40e97e42

tick/nohz: Use WARN_ON_ONCE() to prevent console saturation

3y ago

jianchunfu

309b5172

arch:x86:xen: Remove unnecessary assignment in xen_apic_read()

3y ago

Linus Torvalds

6fc2586d

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

3y ago

Bang Li

c6ed4d84

ARC: remove redundant READ_ONCE() in cmpxchg loop

3y ago

Namjae Jeon

8510a043

ksmbd: increment reference count of parent fp

3y ago

Linus Torvalds

33563138

Merge tag 'driver-core-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

3y ago

Guenter Roeck

20314bac

staging: r8188eu: Fix PPPoE tag insertion on little endian systems

3y ago

Linus Torvalds

31231092

Linux 5.18-rc1 v5.18-rc1

3y ago

Linus Torvalds

9a921a6f

Merge tag 'for-v5.18-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply

3y ago

Rei Yamamoto

08d835df

genirq/affinity: Consider that CPUs on nodes can be unbalanced

3y ago

Nadav Amit

9e949a38

smp: Fix offline cpu check in flush_smp_call_function_queue()

3y ago

Anna-Maria Behnsen

c54bc0fc

timers: Fix warning condition in __run_timers()

3y ago

Juergen Gross

c94b731d

xen/grant-table: remove readonly parameter from functions

3y ago

Linus Torvalds

b51bd23c

Merge tag 'for-linus-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

3y ago

Tom Rix

faad6ceb

scsi: sr: Do not leak information in ioctl

3y ago

Sergey Matyukevich

ac411e41

ARC: atomic: cleanup atomic-llsc definitions

3y ago

Namjae Jeon

50f500b7

ksmbd: remove filename in ksmbd_file

3y ago

Linus Torvalds

f58d3410

Merge tag 'char-misc-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

3y ago

Greg Kroah-Hartman

cdb4f26a

kobject: kobj_type: remove default_attrs

3y ago

Linus Torvalds

09bb8856

Merge tag 'trace-v5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

3y ago

Linus Torvalds

bd0c7d75

Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

3y ago

Yassine Oudjana

581045ed

power: supply: Reset err after not finding static battery

3y ago

Thomas Gleixner

63ef1a8a

Merge tag 'irqchip-fixes-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/urgent

3y ago

Juergen Gross

b0f21263

xen/grant-table: remove gnttab_*transfer*() functions

3y ago

Linux 5.18-rc4 v5.18-rc4

af2d861d

Linus Torvalds

Merge tag 'sched_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

42740a2f

Linus Torvalds

Merge tag 'powerpc-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

5206548f

Linus Torvalds

sched/pelt: Fix attach_entity_load_avg() corner case

40f5aa4c

kuyo chang

Merge tag 'perf_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

f48ffef1

Linus Torvalds

powerpc/perf: Fix 32bit compile

bb82c574

Alexey Kardashevskiy

Linux 5.18-rc3 v5.18-rc3

b2d229d4

Linus Torvalds

Merge tag 'edac_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

b877ca4d

Linus Torvalds

perf/x86/cstate: Add SAPPHIRERAPIDS_X CPU support

528c9f1d

Zhang Rui

powerpc/perf: Fix power10 event alternatives

c6cc9a85

Athira Rajeev

Merge tag 'for-linus-5.18-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

a1901b46

Linus Torvalds

kvmalloc: use vmalloc_huge for vmalloc allocations

9becb688

Linus Torvalds

EDAC/synopsys: Read the error count from the correct register

e2932d1f

Shubhrajyoti Datta

perf/core: Fix perf_mmap fail when CONFIG_PERF_USE_VMALLOC enabled

60490e79

Zhipeng Xie

powerpc/perf: Fix power9 event alternatives

When scheduling a group of events, there are constraint checks done to
make sure all events can go in a group. Example, one of the criteria is
that events in a group cannot use the same PMC. But platform specific
PMU supports alternative event for some of the event codes. During
perf_event_open(), if any event group doesn't match constraint check
criteria, further lookup is done to find alternative event.

By current design, the array of alternatives events in PMU code is
expected to be sorted by column 0. This is because in
find_alternative() the return criteria is based on event code
comparison. ie. "event < ev_alt[i][0])". This optimisation is there
since find_alternative() can be called multiple times. In power9 PMU
code, the alternative event array is not sorted properly and hence there
is breakage in finding alternative events.

To work with existing logic, fix the alternative event array to be
sorted by column 0 for power9-pmu.c

Results:

With alternative events, multiplexing can be avoided. That is, for
example, in power9 PM_LD_MISS_L1 (0x3e054) has alternative event,
PM_LD_MISS_L1_ALT (0x400f0). This is an identical event which can be
programmed in a different PMC.

Before:

# perf stat -e r3e054,r300fc

Performance counter stats for 'system wide':

1057860 r3e054 (50.21%)
379 r300fc (49.79%)

0.944329741 seconds time elapsed

Since both the events are using PMC3 in this case, they are
multiplexed here.

After:

# perf stat -e r3e054,r300fc

Performance counter stats for 'system wide':

1006948 r3e054
182 r300fc

Fixes: 91e0bd1e6251 ("powerpc/perf: Add PM_LD_MISS_L1 and PM_BR_2PATH to power9 event list")
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220419114828.89843-1-atrajeev@linux.vnet.ibm.com