commits

Occasionally objtool driven code patching (think .static_call_sites
.retpoline_sites etc..) goes sideways and it tries to patch an
instruction that doesn't match.

Much head-scatching and cursing later the problem is as outlined below
and affects every section that objtool generates for us, very much
including the ORC data. The below uses .static_call_sites because it's
convenient for demonstration purposes, but as mentioned the ORC
sections, .retpoline_sites and __mount_loc are all similarly affected.

Consider:

foo-weak.c:

extern void __SCT__foo(void);

__attribute__((weak)) void foo(void)
{
return __SCT__foo();
}

foo.c:

extern void __SCT__foo(void);
extern void my_foo(void);

void foo(void)
{
my_foo();
return __SCT__foo();
}

These generate the obvious code
(gcc -O2 -fcf-protection=none -fno-asynchronous-unwind-tables -c foo*.c):

foo-weak.o:
0000000000000000 <foo>:
0: e9 00 00 00 00 jmpq 5 <foo+0x5> 1: R_X86_64_PLT32 __SCT__foo-0x4

foo.o:
0000000000000000 <foo>:
0: 48 83 ec 08 sub $0x8,%rsp
4: e8 00 00 00 00 callq 9 <foo+0x9> 5: R_X86_64_PLT32 my_foo-0x4
9: 48 83 c4 08 add $0x8,%rsp
d: e9 00 00 00 00 jmpq 12 <foo+0x12> e: R_X86_64_PLT32 __SCT__foo-0x4

Now, when we link these two files together, you get something like
(ld -r -o foos.o foo-weak.o foo.o):

foos.o:
0000000000000000 <foo-0x10>:
0: e9 00 00 00 00 jmpq 5 <foo-0xb> 1: R_X86_64_PLT32 __SCT__foo-0x4
5: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:0x0(%rax,%rax,1)
f: 90 nop

0000000000000010 <foo>:
10: 48 83 ec 08 sub $0x8,%rsp
14: e8 00 00 00 00 callq 19 <foo+0x9> 15: R_X86_64_PLT32 my_foo-0x4
19: 48 83 c4 08 add $0x8,%rsp
1d: e9 00 00 00 00 jmpq 22 <foo+0x12> 1e: R_X86_64_PLT32 __SCT__foo-0x4

Noting that ld preserves the weak function text, but strips the symbol
off of it (hence objdump doing that funny negative offset thing). This
does lead to 'interesting' unused code issues with objtool when ran on
linked objects, but that seems to be working (fingers crossed).

So far so good.. Now lets consider the objtool static_call output
section (readelf output, old binutils):

foo-weak.o:

Relocation section '.rela.static_call_sites' at offset 0x2c8 contains 1 entry:
Offset Info Type Symbol's Value Symbol's Name + Addend
0000000000000000 0000000200000002 R_X86_64_PC32 0000000000000000 .text + 0
0000000000000004 0000000d00000002 R_X86_64_PC32 0000000000000000 __SCT__foo + 1

foo.o:

Relocation section '.rela.static_call_sites' at offset 0x310 contains 2 entries:
Offset Info Type Symbol's Value Symbol's Name + Addend
0000000000000000 0000000200000002 R_X86_64_PC32 0000000000000000 .text + d
0000000000000004 0000000d00000002 R_X86_64_PC32 0000000000000000 __SCT__foo + 1

foos.o:

Relocation section '.rela.static_call_sites' at offset 0x430 contains 4 entries:
Offset Info Type Symbol's Value Symbol's Name + Addend
0000000000000000 0000000100000002 R_X86_64_PC32 0000000000000000 .text + 0
0000000000000004 0000000d00000002 R_X86_64_PC32 0000000000000000 __SCT__foo + 1
0000000000000008 0000000100000002 R_X86_64_PC32 0000000000000000 .text + 1d
000000000000000c 0000000d00000002 R_X86_64_PC32 0000000000000000 __SCT__foo + 1

So we have two patch sites, one in the dead code of the weak foo and one
in the real foo. All is well.

*HOWEVER*, when the toolchain strips unused section symbols it
generates things like this (using new enough binutils):

foo-weak.o:

Relocation section '.rela.static_call_sites' at offset 0x2c8 contains 1 entry:
Offset Info Type Symbol's Value Symbol's Name + Addend
0000000000000000 0000000200000002 R_X86_64_PC32 0000000000000000 foo + 0
0000000000000004 0000000d00000002 R_X86_64_PC32 0000000000000000 __SCT__foo + 1

foo.o:

Relocation section '.rela.static_call_sites' at offset 0x310 contains 2 entries:
Offset Info Type Symbol's Value Symbol's Name + Addend
0000000000000000 0000000200000002 R_X86_64_PC32 0000000000000000 foo + d
0000000000000004 0000000d00000002 R_X86_64_PC32 0000000000000000 __SCT__foo + 1

foos.o:

Relocation section '.rela.static_call_sites' at offset 0x430 contains 4 entries:
Offset Info Type Symbol's Value Symbol's Name + Addend
0000000000000000 0000000100000002 R_X86_64_PC32 0000000000000000 foo + 0
0000000000000004 0000000d00000002 R_X86_64_PC32 0000000000000000 __SCT__foo + 1
0000000000000008 0000000100000002 R_X86_64_PC32 0000000000000000 foo + d
000000000000000c 0000000d00000002 R_X86_64_PC32 0000000000000000 __SCT__foo + 1

And now we can see how that foos.o .static_call_sites goes side-ways, we
now have _two_ patch sites in foo. One for the weak symbol at foo+0
(which is no longer a static_call site!) and one at foo+d which is in
fact the right location.

This seems to happen when objtool cannot find a section symbol, in which
case it falls back to any other symbol to key off of, however in this
case that goes terribly wrong!

As such, teach objtool to create a section symbol when there isn't
one.

Fixes: 44f6a7c0755d ("objtool: Fix seg fault with Clang non-section symbols")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20220419203807.655552918@infradead.org

3y ago

Shida Zhang

1fa568e2

bug: Have __warn() prototype defined unconditionally

3y ago

Paolo Bonzini

73331c5d

Merge branch 'kvm-fixes-for-5.18-rc5' into HEAD

3y ago

Linus Torvalds

57ae8a49

Merge tag 'driver-core-5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

3y ago

Shin'ichiro Kawasaki

c7d2f89f

bus: fsl-mc-msi: Fix MSI descriptor mutex lock for msi_first_desc()

3y ago

Peter Zijlstra

c087c6e7

objtool: Fix type of reloc::addend

3y ago

Nur Hussein

4cdfc11b

x86/Kconfig: fix the spelling of 'becoming' in X86_KERNEL_IBT config

3y ago

Paolo Bonzini

484c22df

Merge tag 'kvmarm-fixes-5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

3y ago

Mingwei Zhang

44187235

KVM: x86/mmu: fix potential races when walking host page table

3y ago

Linus Torvalds

e2e5ebec

Merge tag 'char-misc-5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

3y ago

Minchan Kim

ad8d8693

kernfs: fix NULL dereferencing in kernfs_remove

kernfs_remove supported NULL kernfs_node param to bail out but revent
per-fs lock change introduced regression that dereferencing the
param without NULL check so kernel goes crash.

This patch checks the NULL kernfs_node in kernfs_remove and if so,
just return.

Quote from bug report by Jirka

```
The bug is triggered by running NAS Parallel benchmark suite on
SuperMicro servers with 2x Xeon(R) Gold 6126 CPU. Here is the error
log:

[ 247.035564] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 247.036009] #PF: supervisor read access in kernel mode
[ 247.036009] #PF: error_code(0x0000) - not-present page
[ 247.036009] PGD 0 P4D 0
[ 247.036009] Oops: 0000 [#1] PREEMPT SMP PTI
[ 247.058060] CPU: 1 PID: 6546 Comm: umount Not tainted
5.16.0393c3714081a53795bbff0e985d24146def6f57f+ #16
[ 247.058060] Hardware name: Supermicro Super Server/X11DDW-L, BIOS
2.0b 03/07/2018
[ 247.058060] RIP: 0010:kernfs_remove+0x8/0x50
[ 247.058060] Code: 4c 89 e0 5b 5d 41 5c 41 5d 41 5e c3 49 c7 c4 f4
ff ff ff eb b2 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00
41 54 55 <48> 8b 47 08 48 89 fd 48 85 c0 48 0f 44 c7 4c 8b 60 50 49 83
c4 60
[ 247.058060] RSP: 0018:ffffbbfa48a27e48 EFLAGS: 00010246
[ 247.058060] RAX: 0000000000000001 RBX: ffffffff89e31f98 RCX: 0000000080200018
[ 247.058060] RDX: 0000000080200019 RSI: fffff6760786c900 RDI: 0000000000000000
[ 247.058060] RBP: ffffffff89e31f98 R08: ffff926b61b24d00 R09: 0000000080200018
[ 247.122048] R10: ffff926b61b24d00 R11: ffff926a8040c000 R12: ffff927bd09a2000
[ 247.122048] R13: ffffffff89e31fa0 R14: dead000000000122 R15: dead000000000100
[ 247.122048] FS: 00007f01be0a8c40(0000) GS:ffff926fa8e40000(0000)
knlGS:0000000000000000
[ 247.122048] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 247.122048] CR2: 0000000000000008 CR3: 00000001145c6003 CR4: 00000000007706e0
[ 247.122048] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 247.122048] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 247.122048] PKRU: 55555554
[ 247.122048] Call Trace:
[ 247.122048] <TASK>
[ 247.122048] rdt_kill_sb+0x29d/0x350
[ 247.122048] deactivate_locked_super+0x36/0xa0
[ 247.122048] cleanup_mnt+0x131/0x190
[ 247.122048] task_work_run+0x5c/0x90
[ 247.122048] exit_to_user_mode_prepare+0x229/0x230
[ 247.122048] syscall_exit_to_user_mode+0x18/0x40
[ 247.122048] do_syscall_64+0x48/0x90
[ 247.122048] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 247.122048] RIP: 0033:0x7f01be2d735b
```

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215696
Link: https://lore.kernel.org/lkml/CAE4VaGDZr_4wzRn2___eDYRtmdPaGGJdzu_LCSkJYuY9BEO3cw@mail.gmail.com/
Fixes: 393c3714081a (kernfs: switch global kernfs_rwsem lock to per-fs lock)
Cc: stable@vger.kernel.org
Reported-by: Jirka Hladky <jhladky@redhat.com>
Tested-by: Jirka Hladky <jhladky@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Link: https://lore.kernel.org/r/20220427172152.3505364-1-minchan@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

3y ago

Linus Torvalds

af2d861d

Linux 5.18-rc4 v5.18-rc4

3y ago

Josh Poimboeuf

08feafe8

objtool: Fix function fallthrough detection for vmlinux

3y ago

Josh Poimboeuf

1d08b92f

objtool: Use offstr() to print address of missing ENDBR

3y ago

Paolo Bonzini

e852be8b

kvm: selftests: introduce and use more page size-related constants

3y ago

Marc Zyngier

85ea6b1e

KVM: arm64: Inject exception on out-of-IPA-range translation fault

3y ago

Paolo Bonzini

d495f942

KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENT

3y ago

Linus Torvalds

a6b5c5dc

Merge tag 'tty-5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

3y ago

Greg Kroah-Hartman

fda05730

Merge tag 'iio-fixes-for-5.18a' of https://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio

Pull set of IIO fixes for 5.18 from Jonathan Cameron:
"1st set of IIO fixes for the 5.18 cycle

ad3552r:
- Fix a bug with error codes being stored in unsigned local variable.
- Fix IS_ERR when value is either NULL or not rather than ERR_PTR
ad5446
- Fix shifting of read_raw value.
ad5592r
- Fix missing return value being set for a fwnode property read.
ad7280a
- Wrong variable being used to set thresholds.
admv8818
- Kconfig dependency fix.
ak8975
- Missing regulator disable in error path.
bmi160
- Disable regulators in an error path.
dac5571
- Fix chip id detection for devices with OF bindings.
inv_icm42600
- Handle a case of a missing I2C NACK during initially configuration.
ltc2688
- Fix voltage scaling where integer part was written twice and
decimal part not at all.
scd4x
- Handle error before using value.
sx9310
- Device property parsing against indio_dev->dev.of_node which
hasn't been set yet.
sx9324
- Fix hardware gain related maths.
- Wrong defaults for precharge internal resistance register."

* tag 'iio-fixes-for-5.18a' of https://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio:
iio: imu: inv_icm42600: Fix I2C init possible nack
iio: dac: ltc2688: fix voltage scale read
iio:dac:ad3552r: Fix an IS_ERR() vs NULL check
iio: sx9324: Fix default precharge internal resistance register
iio: dac: ad5446: Fix read_raw not returning set value
iio: magnetometer: ak8975: Fix the error handling in ak8975_power_on()
iio:proximity:sx9324: Fix hardware gain read/write
iio:proximity:sx_common: Fix device property parsing on DT systems
iio: adc: ad7280a: Fix wrong variable used when setting thresholds.
iio:filter:admv8818: select REGMAP_SPI for ADMV8818
iio: dac: ad5592r: Fix the missing return value.
iio: dac: dac5571: Fix chip id detection for OF devices
iio:imu:bmi160: disable regulator in error path
iio: scd4x: check return of scd4x_write_and_fetch
iio: dac: ad3552r: fix signedness bug in ad3552r_reset()

3y ago

Greg Kroah-Hartman

c95ce3a2

topology: Fix up build warning in topology_is_visible()

3y ago

Linus Torvalds

42740a2f

Merge tag 'sched_urgent_for_v5.18_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

3y ago

Josh Poimboeuf

34c861e8

objtool: Fix sibling call detection in alternatives

3y ago

Josh Poimboeuf

4baae989

objtool: Print data address for "!ENDBR" data warnings

3y ago

Paolo Bonzini

f18b4aeb

kvm: selftests: do not use bitfields larger than 32-bits for PTEs

3y ago

Alexandru Elisei

8f6379e2

KVM/arm64: Don't emulate a PMU for 32-bit guests if feature not set

kvm->arch.arm_pmu is set when userspace attempts to set the first PMU
attribute. As certain attributes are mandatory, arm_pmu ends up always
being set to a valid arm_pmu, otherwise KVM will refuse to run the VCPU.
However, this only happens if the VCPU has the PMU feature. If the VCPU
doesn't have the feature bit set, kvm->arch.arm_pmu will be left
uninitialized and equal to NULL.

KVM doesn't do ID register emulation for 32-bit guests and accesses to the
PMU registers aren't gated by the pmu_visibility() function. This is done
to prevent injecting unexpected undefined exceptions in guests which have
detected the presence of a hardware PMU. But even though the VCPU feature
is missing, KVM still attempts to emulate certain aspects of the PMU when
PMU registers are accessed. This leads to a NULL pointer dereference like
this one, which happens on an odroid-c4 board when running the
kvm-unit-tests pmu-cycle-counter test with kvmtool and without the PMU
feature being set:

[ 454.402699] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000150
[ 454.405865] Mem abort info:
[ 454.408596] ESR = 0x96000004
[ 454.411638] EC = 0x25: DABT (current EL), IL = 32 bits
[ 454.416901] SET = 0, FnV = 0
[ 454.419909] EA = 0, S1PTW = 0
[ 454.423010] FSC = 0x04: level 0 translation fault
[ 454.427841] Data abort info:
[ 454.430687] ISV = 0, ISS = 0x00000004
[ 454.434484] CM = 0, WnR = 0
[ 454.437404] user pgtable: 4k pages, 48-bit VAs, pgdp=000000000c924000
[ 454.443800] [0000000000000150] pgd=0000000000000000, p4d=0000000000000000
[ 454.450528] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 454.456036] Modules linked in:
[ 454.459053] CPU: 1 PID: 267 Comm: kvm-vcpu-0 Not tainted 5.18.0-rc4 #113
[ 454.465697] Hardware name: Hardkernel ODROID-C4 (DT)
[ 454.470612] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 454.477512] pc : kvm_pmu_event_mask.isra.0+0x14/0x74
[ 454.482427] lr : kvm_pmu_set_counter_event_type+0x2c/0x80
[ 454.487775] sp : ffff80000a9839c0
[ 454.491050] x29: ffff80000a9839c0 x28: ffff000000a83a00 x27: 0000000000000000
[ 454.498127] x26: 0000000000000000 x25: 0000000000000000 x24: ffff00000a510000
[ 454.505198] x23: ffff000000a83a00 x22: ffff000003b01000 x21: 0000000000000000
[ 454.512271] x20: 000000000000001f x19: 00000000000003ff x18: 0000000000000000
[ 454.519343] x17: 000000008003fe98 x16: 0000000000000000 x15: 0000000000000000
[ 454.526416] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 454.533489] x11: 000000008003fdbc x10: 0000000000009d20 x9 : 000000000000001b
[ 454.540561] x8 : 0000000000000000 x7 : 0000000000000d00 x6 : 0000000000009d00
[ 454.547633] x5 : 0000000000000037 x4 : 0000000000009d00 x3 : 0d09000000000000
[ 454.554705] x2 : 000000000000001f x1 : 0000000000000000 x0 : 0000000000000000
[ 454.561779] Call trace:
[ 454.564191] kvm_pmu_event_mask.isra.0+0x14/0x74
[ 454.568764] kvm_pmu_set_counter_event_type+0x2c/0x80
[ 454.573766] access_pmu_evtyper+0x128/0x170
[ 454.577905] perform_access+0x34/0x80
[ 454.581527] kvm_handle_cp_32+0x13c/0x160
[ 454.585495] kvm_handle_cp15_32+0x1c/0x30
[ 454.589462] handle_exit+0x70/0x180
[ 454.592912] kvm_arch_vcpu_ioctl_run+0x1c4/0x5e0
[ 454.597485] kvm_vcpu_ioctl+0x23c/0x940
[ 454.601280] __arm64_sys_ioctl+0xa8/0xf0
[ 454.605160] invoke_syscall+0x48/0x114
[ 454.608869] el0_svc_common.constprop.0+0xd4/0xfc
[ 454.613527] do_el0_svc+0x28/0x90
[ 454.616803] el0_svc+0x34/0xb0
[ 454.619822] el0t_64_sync_handler+0xa4/0x130
[ 454.624049] el0t_64_sync+0x18c/0x190
[ 454.627675] Code: a9be7bfd 910003fd f9000bf3 52807ff3 (b9415001)
[ 454.633714] ---[ end trace 0000000000000000 ]---

In this particular case, Linux hasn't detected the presence of a hardware
PMU because the PMU node is missing from the DTB, so userspace would have
been unable to set the VCPU PMU feature even if it attempted it. What
happens is that the 32-bit guest reads ID_DFR0, which advertises the
presence of the PMU, and when it tries to program a counter, it triggers
the NULL pointer dereference because kvm->arch.arm_pmu is NULL.

kvm-arch.arm_pmu was introduced by commit 46b187821472 ("KVM: arm64:
Keep a per-VM pointer to the default PMU"). Until that commit, this
error would be triggered instead:

[ 73.388140] ------------[ cut here ]------------
[ 73.388189] Unknown PMU version 0
[ 73.390420] WARNING: CPU: 1 PID: 264 at arch/arm64/kvm/pmu-emul.c:36 kvm_pmu_event_mask.isra.0+0x6c/0x74
[ 73.399821] Modules linked in:
[ 73.402835] CPU: 1 PID: 264 Comm: kvm-vcpu-0 Not tainted 5.17.0 #114
[ 73.409132] Hardware name: Hardkernel ODROID-C4 (DT)
[ 73.414048] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 73.420948] pc : kvm_pmu_event_mask.isra.0+0x6c/0x74
[ 73.425863] lr : kvm_pmu_event_mask.isra.0+0x6c/0x74
[ 73.430779] sp : ffff80000a8db9b0
[ 73.434055] x29: ffff80000a8db9b0 x28: ffff000000dbaac0 x27: 0000000000000000
[ 73.441131] x26: ffff000000dbaac0 x25: 00000000c600000d x24: 0000000000180720
[ 73.448203] x23: ffff800009ffbe10 x22: ffff00000b612000 x21: 0000000000000000
[ 73.455276] x20: 000000000000001f x19: 0000000000000000 x18: ffffffffffffffff
[ 73.462348] x17: 000000008003fe98 x16: 0000000000000000 x15: 0720072007200720
[ 73.469420] x14: 0720072007200720 x13: ffff800009d32488 x12: 00000000000004e6
[ 73.476493] x11: 00000000000001a2 x10: ffff800009d32488 x9 : ffff800009d32488
[ 73.483565] x8 : 00000000ffffefff x7 : ffff800009d8a488 x6 : ffff800009d8a488
[ 73.490638] x5 : ffff0000f461a9d8 x4 : 0000000000000000 x3 : 0000000000000001
[ 73.497710] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000000dbaac0
[ 73.504784] Call trace:
[ 73.507195] kvm_pmu_event_mask.isra.0+0x6c/0x74
[ 73.511768] kvm_pmu_set_counter_event_type+0x2c/0x80
[ 73.516770] access_pmu_evtyper+0x128/0x16c
[ 73.520910] perform_access+0x34/0x80
[ 73.524532] kvm_handle_cp_32+0x13c/0x160
[ 73.528500] kvm_handle_cp15_32+0x1c/0x30
[ 73.532467] handle_exit+0x70/0x180
[ 73.535917] kvm_arch_vcpu_ioctl_run+0x20c/0x6e0
[ 73.540489] kvm_vcpu_ioctl+0x2b8/0x9e0
[ 73.544283] __arm64_sys_ioctl+0xa8/0xf0
[ 73.548165] invoke_syscall+0x48/0x114
[ 73.551874] el0_svc_common.constprop.0+0xd4/0xfc
[ 73.556531] do_el0_svc+0x28/0x90
[ 73.559808] el0_svc+0x28/0x80
[ 73.562826] el0t_64_sync_handler+0xa4/0x130
[ 73.567054] el0t_64_sync+0x1a0/0x1a4
[ 73.570676] ---[ end trace 0000000000000000 ]---
[ 73.575382] kvm: pmu event creation failed -2

The root cause remains the same: kvm->arch.pmuver was never set to
something sensible because the VCPU feature itself was never set.

The odroid-c4 is somewhat of a special case, because Linux doesn't probe
the PMU. But the above errors can easily be reproduced on any hardware,
with or without a PMU driver, as long as userspace doesn't set the PMU
feature.

Work around the fact that KVM advertises a PMU even when the VCPU feature
is not set by gating all PMU emulation on the feature. The guest can still
access the registers without KVM injecting an undefined exception.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220425145530.723858-1-alexandru.elisei@arm.com

3y ago

Sean Christopherson

86931ff7

KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR

Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
guest, possibly with help from userspace, manages to coerce KVM into
creating a SPTE for an "impossible" gfn, KVM will leak the associated
shadow pages (page tables):

WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Modules linked in: kvm_intel kvm irqbypass
CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2d0 [kvm]
kvm_vm_release+0x1d/0x30 [kvm]
__fput+0x82/0x240
task_work_run+0x5b/0x90
exit_to_user_mode_prepare+0xd2/0xe0
syscall_exit_to_user_mode+0x1d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>

On bare metal, encountering an impossible gpa in the page fault path is
well and truly impossible, barring CPU bugs, as the CPU will signal #PF
during the gva=>gpa translation (or a similar failure when stuffing a
physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM
itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
of the underlying hardware, in which case the hardware will not fault on
the illegal-from-KVM's-perspective gpa.

Alternatively, KVM could continue allowing the dodgy behavior and simply
zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's
a (minor) waste of cycles, and more importantly, KVM can't reasonably
support impossible memslots when running on bare metal (or with an
accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if
KVM is running as a guest is not a safe option as the host isn't required
to announce itself to the guest in any way, e.g. doesn't need to set the
HYPERVISOR CPUID bit.

A second alternative to disallowing the memslot behavior would be to
disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That
restriction is undesirable as there are legitimate use cases for doing
so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
systems so that VMs can be migrated between hosts with different
MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.

Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
any reserved physical address bits).

The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
The memslot and TDP MMU code want an exclusive value, but the name
implies the returned value is inclusive, and the MMIO path needs an
inclusive check.

Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220428233416.2446833-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

3y ago

Linus Torvalds

da1b4042

Merge tag 'usb-5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

3y ago

Daniel Starke

19317433

tty: n_gsm: fix sometimes uninitialized warning in gsm_dlci_modem_output()

3y ago

Christophe Leroy

5b47b751

eeprom: at25: Use DMA safe buffers

Reading EEPROM fails with following warning:

[ 16.357496] ------------[ cut here ]------------
[ 16.357529] fsl_spi b01004c0.spi: rejecting DMA map of vmalloc memory
[ 16.357698] WARNING: CPU: 0 PID: 371 at include/linux/dma-mapping.h:326 fsl_spi_cpm_bufs+0x2a0/0x2d8
[ 16.357775] CPU: 0 PID: 371 Comm: od Not tainted 5.16.11-s3k-dev-01743-g19beecbfe9d6-dirty #109
[ 16.357806] NIP: c03fbc9c LR: c03fbc9c CTR: 00000000
[ 16.357825] REGS: e68d9b20 TRAP: 0700 Not tainted (5.16.11-s3k-dev-01743-g19beecbfe9d6-dirty)
[ 16.357849] MSR: 00029032 <EE,ME,IR,DR,RI> CR: 24002282 XER: 00000000
[ 16.357931]
[ 16.357931] GPR00: c03fbc9c e68d9be0 c26d06a0 00000039 00000001 c0d36364 c0e96428 00000027
[ 16.357931] GPR08: 00000001 00000000 00000023 3fffc000 24002282 100d3dd6 100a2ffc 00000000
[ 16.357931] GPR16: 100cd280 100b0000 00000000 aff54f7e 100d0000 100d0000 00000001 100cf328
[ 16.357931] GPR24: 100cf328 00000000 00000003 e68d9e30 c156b410 e67ab4c0 e68d9d38 c24ab278
[ 16.358253] NIP [c03fbc9c] fsl_spi_cpm_bufs+0x2a0/0x2d8
[ 16.358292] LR [c03fbc9c] fsl_spi_cpm_bufs+0x2a0/0x2d8
[ 16.358325] Call Trace:
[ 16.358336] [e68d9be0] [c03fbc9c] fsl_spi_cpm_bufs+0x2a0/0x2d8 (unreliable)
[ 16.358388] [e68d9c00] [c03fcb44] fsl_spi_bufs.isra.0+0x94/0x1a0
[ 16.358436] [e68d9c20] [c03fd970] fsl_spi_do_one_msg+0x254/0x3dc
[ 16.358483] [e68d9cb0] [c03f7e50] __spi_pump_messages+0x274/0x8a4
[ 16.358529] [e68d9ce0] [c03f9d30] __spi_sync+0x344/0x378
[ 16.358573] [e68d9d20] [c03fb52c] spi_sync+0x34/0x60
[ 16.358616] [e68d9d30] [c03b4dec] at25_ee_read+0x138/0x1a8
[ 16.358667] [e68d9e50] [c04a8fb8] bin_attr_nvmem_read+0x98/0x110
[ 16.358725] [e68d9e60] [c0204b14] kernfs_fop_read_iter+0xc0/0x1fc
[ 16.358774] [e68d9e80] [c0168660] vfs_read+0x284/0x410
[ 16.358821] [e68d9f00] [c016925c] ksys_read+0x6c/0x11c
[ 16.358863] [e68d9f30] [c00160e0] ret_from_syscall+0x0/0x28
...
[ 16.359608] ---[ end trace a4ce3e34afef0cb5 ]---
[ 16.359638] fsl_spi b01004c0.spi: unable to map tx dma

This is due to the AT25 driver using buffers on stack, which is not
possible with CONFIG_VMAP_STACK.

As mentionned in kernel Documentation (Documentation/spi/spi-summary.rst):

- Follow standard kernel rules, and provide DMA-safe buffers in
your messages. That way controller drivers using DMA aren't forced
to make extra copies unless the hardware requires it (e.g. working
around hardware errata that force the use of bounce buffering).

Modify the driver to use a buffer located in the at25 device structure
which is allocated via kmalloc during probe.

Protect writes in this new buffer with the driver's mutex.

Fixes: b587b13a4f67 ("[PATCH] SPI eeprom driver")
Cc: stable <stable@kernel.org>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Link: https://lore.kernel.org/r/230a9486fc68ea0182df46255e42a51099403642.1648032613.git.christophe.leroy@csgroup.eu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

3y ago

Fawzi Khaber

b5d6ba09

iio: imu: inv_icm42600: Fix I2C init possible nack

3y ago

Wang Qing

1dc9f1a6

arch_topology: Do not set llc_sibling if llc_id is invalid

3y ago

Linus Torvalds

5206548f

Merge tag 'powerpc-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

3y ago

kuyo chang

40f5aa4c

sched/pelt: Fix attach_entity_load_avg() corner case

3y ago

Josh Poimboeuf

26ff6041

objtool: Don't set 'jump_dest' for sibling calls

3y ago

Josh Poimboeuf

1ab80a0d

x86/xen: Add ANNOTATE_NOENDBR to startup_xen()

3y ago

Mingwei Zhang

683412cc

KVM: SEV: add cache flush to solve SEV cache incoherency issues

3y ago

Will Deacon

2a50fc5f

KVM: arm64: Handle host stage-2 faults from 32-bit EL0

3y ago

Vitaly Kuznetsov

42dcbe7d

KVM: x86: hyper-v: Avoid writing to TSC page without an active vCPU

3y ago

Linus Torvalds

e9512f36

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

3y ago

Sean Anderson

03e607cb

usb: phy: generic: Get the vbus supply

3y ago

Maciej W. Rozycki

637674fa

serial: 8250: Correct the clock for EndRun PTP/1588 PCIe device

3y ago

Alessandro Astone

ef38de92

binder: Gracefully handle BINDER_TYPE_FDA objects with num_fds=0

3y ago

Nuno Sá

e7e51eb0

iio: dac: ltc2688: fix voltage scale read

3y ago

Darren Hart

db1e5948

topology: make core_mask include at least cluster_siblings

Ampere Altra defines CPU clusters in the ACPI PPTT. They share a Snoop
Control Unit, but have no shared CPU-side last level cache.

cpu_coregroup_mask() will return a cpumask with weight 1, while
cpu_clustergroup_mask() will return a cpumask with weight 2.

As a result, build_sched_domain() will BUG() once per CPU with:

BUG: arch topology borken
the CLS domain not a subset of the MC domain

The MC level cpumask is then extended to that of the CLS child, and is
later removed entirely as redundant. This sched domain topology is an
improvement over previous topologies, or those built without
SCHED_CLUSTER, particularly for certain latency sensitive workloads.
With the current scheduler model and heuristics, this is a desirable
default topology for Ampere Altra and Altra Max system.

Rather than create a custom sched domains topology structure and
introduce new logic in arch/arm64 to detect these systems, update the
core_mask so coregroup is never a subset of clustergroup, extending it
to cluster_siblings if necessary. Only do this if CONFIG_SCHED_CLUSTER
is enabled to avoid also changing the topology (MC) when
CONFIG_SCHED_CLUSTER is disabled.

This has the added benefit over a custom topology of working for both
symmetric and asymmetric topologies. It does not address systems where
the CLUSTER topology is above a populated MC topology, but these are not
considered today and can be addressed separately if and when they
appear.

The final sched domain topology for a 2 socket Ampere Altra system is
unchanged with or without CONFIG_SCHED_CLUSTER, and the BUG is avoided:

For CPU0:

CONFIG_SCHED_CLUSTER=y
CLS [0-1]
DIE [0-79]
NUMA [0-159]

CONFIG_SCHED_CLUSTER is not set
DIE [0-79]
NUMA [0-159]

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: D. Scott Phillips <scott@os.amperecomputing.com>
Cc: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Cc: <stable@vger.kernel.org> # 5.16.x
Suggested-by: Barry Song <song.bao.hua@hisilicon.com>
Reviewed-by: Barry Song <song.bao.hua@hisilicon.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
Link: https://lore.kernel.org/r/c8fe9fce7c86ed56b4c455b8c902982dc2303868.1649696956.git.darren@os.amperecomputing.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>