commits

To hot unplug a CPU, the idle task on that CPU calls a few layers of C
code before finally leaving the kernel. When KASAN is in use, poisoned
shadow is left around for each of the active stack frames, and when
shadow call stacks are in use. When shadow call stacks (SCS) are in use
the task's saved SCS SP is left pointing at an arbitrary point within
the task's shadow call stack.

When a CPU is offlined than onlined back into the kernel, this stale
state can adversely affect execution. Stale KASAN shadow can alias new
stackframes and result in bogus KASAN warnings. A stale SCS SP is
effectively a memory leak, and prevents a portion of the shadow call
stack being used. Across a number of hotplug cycles the idle task's
entire shadow call stack can become unusable.

We previously fixed the KASAN issue in commit:

e1b77c92981a5222 ("sched/kasan: remove stale KASAN poison after hotplug")

... by removing any stale KASAN stack poison immediately prior to
onlining a CPU.

Subsequently in commit:

f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled")

... the refactoring left the KASAN and SCS cleanup in one-time idle
thread initialization code rather than something invoked prior to each
CPU being onlined, breaking both as above.

We fixed SCS (but not KASAN) in commit:

63acd42c0d4942f7 ("sched/scs: Reset the shadow stack when idle_task_exit")

... but as this runs in the context of the idle task being offlined it's
potentially fragile.

To fix these consistently and more robustly, reset the SCS SP and KASAN
shadow of a CPU's idle task immediately before we online that CPU in
bringup_cpu(). This ensures the idle task always has a consistent state
when it is running, and removes the need to so so when exiting an idle
task.

Whenever any thread is created, dup_task_struct() will give the task a
stack which is free of KASAN shadow, and initialize the task's SCS SP,
so there's no need to specially initialize either for idle thread within
init_idle(), as this was only necessary to handle hotplug cycles.

I've tested this on arm64 with:

* gcc 11.1.0, defconfig +KASAN_INLINE, KASAN_STACK
* clang 12.0.0, defconfig +KASAN_INLINE, KASAN_STACK, SHADOW_CALL_STACK

... offlining and onlining CPUS with:

| while true; do
| for C in /sys/devices/system/cpu/cpu*/online; do
| echo 0 > $C;
| echo 1 > $C;
| done
| done

Fixes: f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled")
Reported-by: Qian Cai <quic_qiancai@quicinc.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Qian Cai <quic_qiancai@quicinc.com>
Link: https://lore.kernel.org/lkml/20211115113310.35693-1-mark.rutland@arm.com/

4y ago

Linus Torvalds

13605725

Linux 5.16-rc2 v5.16-rc2

4y ago

Ye Guojin

0466a39b

virtio-blk: modify the value type of num in virtio_queue_rq()

4y ago

Linus Torvalds

d039f388

Merge tag 'locking-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

4y ago

Marco Elver

73743c3b

perf: Ignore sigtrap for tracepoints destined for other tasks

syzbot reported that the warning in perf_sigtrap() fires, saying that
the event's task does not match current:

| WARNING: CPU: 0 PID: 9090 at kernel/events/core.c:6446 perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513
| Modules linked in:
| CPU: 0 PID: 9090 Comm: syz-executor.1 Not tainted 5.15.0-syzkaller #0
| Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
| RIP: 0010:perf_sigtrap kernel/events/core.c:6446 [inline]
| RIP: 0010:perf_pending_event_disable kernel/events/core.c:6470 [inline]
| RIP: 0010:perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513
| ...
| Call Trace:
| <IRQ>
| irq_work_single+0x106/0x220 kernel/irq_work.c:211
| irq_work_run_list+0x6a/0x90 kernel/irq_work.c:242
| irq_work_run+0x4f/0xd0 kernel/irq_work.c:251
| __sysvec_irq_work+0x95/0x3d0 arch/x86/kernel/irq_work.c:22
| sysvec_irq_work+0x8e/0xc0 arch/x86/kernel/irq_work.c:17
| </IRQ>
| <TASK>
| asm_sysvec_irq_work+0x12/0x20 arch/x86/include/asm/idtentry.h:664
| RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline]
| RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70 kernel/locking/spinlock.c:194
| ...
| coredump_task_exit kernel/exit.c:371 [inline]
| do_exit+0x1865/0x25c0 kernel/exit.c:771
| do_group_exit+0xe7/0x290 kernel/exit.c:929
| get_signal+0x3b0/0x1ce0 kernel/signal.c:2820
| arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
| handle_signal_work kernel/entry/common.c:148 [inline]
| exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
| exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
| __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
| syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
| do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
| entry_SYSCALL_64_after_hwframe+0x44/0xae

On x86 this shouldn't happen, which has arch_irq_work_raise().

The test program sets up a perf event with sigtrap set to fire on the
'sched_wakeup' tracepoint, which fired in ttwu_do_wakeup().

This happened because the 'sched_wakeup' tracepoint also takes a task
argument passed on to perf_tp_event(), which is used to deliver the
event to that other task.

Since we cannot deliver synchronous signals to other tasks, skip an event if
perf_tp_event() is targeted at another task and perf_event_attr::sigtrap is
set, which will avoid ever entering perf_sigtrap() for such events.

Fixes: 97ba62b27867 ("perf: Add support for SIGTRAP on perf events")
Reported-by: syzbot+663359e32ce6f1a305ad@syzkaller.appspotmail.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YYpoCOBmC/kJWfmI@elver.google.com

4y ago

Linus Torvalds

40c93d7f

Merge tag 'x86-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

4y ago

Stefano Garzarella

11708ff9

vhost/vsock: cleanup removing `len` variable

4y ago

Linus Torvalds

f8132d62

Merge tag 'trace-v5.16-rc2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

4y ago

Muchun Song

14c24048

locking/rwsem: Optimize down_read_trylock() under highly contended case

We found that a process with 10 thousnads threads has been encountered
a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of
workload which will concurrently allocate lots of memory in different
threads sometimes. In this case, we will see the down_read_trylock()
with a high hotspot. Therefore, we suppose that rwsem has a regression
at least since Linux-v5.4. In order to easily debug this problem, we
write a simply benchmark to create the similar situation lile the
following.

```c++
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <sched.h>

#include <cstdio>
#include <cassert>
#include <thread>
#include <vector>
#include <chrono>

volatile int mutex;

void trigger(int cpu, char* ptr, std::size_t sz)
{
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(cpu, &set);
assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0);

while (mutex);

for (std::size_t i = 0; i < sz; i += 4096) {
*ptr = '\0';
ptr += 4096;
}
}

int main(int argc, char* argv[])
{
std::size_t sz = 100;

if (argc > 1)
sz = atoi(argv[1]);

auto nproc = std::thread::hardware_concurrency();
std::vector<std::thread> thr;
sz <<= 30;
auto* ptr = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_ANON |
MAP_PRIVATE, -1, 0);
assert(ptr != MAP_FAILED);
char* cptr = static_cast<char*>(ptr);
auto run = sz / nproc;
run = (run >> 12) << 12;

mutex = 1;

for (auto i = 0U; i < nproc; ++i) {
thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); }));
cptr += run;
}

rusage usage_start;
getrusage(RUSAGE_SELF, &usage_start);
auto start = std::chrono::system_clock::now();

mutex = 0;

for (auto& t : thr)
t.join();

rusage usage_end;
getrusage(RUSAGE_SELF, &usage_end);
auto end = std::chrono::system_clock::now();
timeval utime;
timeval stime;
timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime);
timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime);
printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec);
printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec);
printf("real: %lu\n",
std::chrono::duration_cast<std::chrono::milliseconds>(end -
start).count());

return 0;
}
```

The functionality of above program is simply which creates `nproc`
threads and each of them are trying to touch memory (trigger page
fault) on different CPU. Then we will see the similar profile by
`perf top`.

25.55% [kernel] [k] down_read_trylock
14.78% [kernel] [k] handle_mm_fault
13.45% [kernel] [k] up_read
8.61% [kernel] [k] clear_page_erms
3.89% [kernel] [k] __do_page_fault

The highest hot instruction, which accounts for about 92%, in
down_read_trylock() is cmpxchg like the following.

91.89 │ lock cmpxchg %rdx,(%rdi)

Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4,
so we easily found that the commit ddb20d1d3aed ("locking/rwsem: Optimize
down_read_trylock()") caused the regression. The reason is that the
commit assumes the rwsem is not contended at all. But it is not always
true for mmap lock which could be contended with thousands threads.
So most threads almost need to run at least 2 times of "cmpxchg" to
acquire the lock. The overhead of atomic operation is higher than
non-atomic instructions, which caused the regression.

By using the above benchmark, the real executing time on a x86-64 system
before and after the patch were:

Before Patch After Patch
# of Threads real real reduced by
------------ ------ ------ ----------
1 65,373 65,206 ~0.0%
4 15,467 15,378 ~0.5%
40 6,214 5,528 ~11.0%

For the uncontended case, the new down_read_trylock() is the same as
before. For the contended cases, the new down_read_trylock() is faster
than before. The more contended, the more fast.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com

4y ago

Linus Torvalds

af16bdea

Merge tag 'perf-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

4y ago

Reinette Chatre

ac5d272a

x86/sgx: Fix free page accounting

The SGX driver maintains a single global free page counter,
sgx_nr_free_pages, that reflects the number of free pages available
across all NUMA nodes. Correspondingly, a list of free pages is
associated with each NUMA node and sgx_nr_free_pages is updated
every time a page is added or removed from any of the free page
lists. The main usage of sgx_nr_free_pages is by the reclaimer
that runs when it (sgx_nr_free_pages) goes below a watermark
to ensure that there are always some free pages available to, for
example, support efficient page faults.

With sgx_nr_free_pages accessed and modified from a few places
it is essential to ensure that these accesses are done safely but
this is not the case. sgx_nr_free_pages is read without any
protection and updated with inconsistent protection by any one
of the spin locks associated with the individual NUMA nodes.
For example:

CPU_A CPU_B
----- -----
spin_lock(&nodeA->lock); spin_lock(&nodeB->lock);
... ...
sgx_nr_free_pages--; /* NOT SAFE */ sgx_nr_free_pages--;

spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock);

Since sgx_nr_free_pages may be protected by different spin locks
while being modified from different CPUs, the following scenario
is possible:

CPU_A CPU_B
----- -----
{sgx_nr_free_pages = 100}
spin_lock(&nodeA->lock); spin_lock(&nodeB->lock);
sgx_nr_free_pages--; sgx_nr_free_pages--;
/* LOAD sgx_nr_free_pages = 100 */ /* LOAD sgx_nr_free_pages = 100 */
/* sgx_nr_free_pages-- */ /* sgx_nr_free_pages-- */
/* STORE sgx_nr_free_pages = 99 */ /* STORE sgx_nr_free_pages = 99 */
spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock);

In the above scenario, sgx_nr_free_pages is decremented from two CPUs
but instead of sgx_nr_free_pages ending with a value that is two less
than it started with, it was only decremented by one while the number
of free pages were actually reduced by two. The consequence of
sgx_nr_free_pages not being protected is that its value may not
accurately reflect the actual number of free pages on the system,
impacting the availability of free pages in support of many flows.

The problematic scenario is when the reclaimer does not run because it
believes there to be sufficient free pages while any attempt to allocate
a page fails because there are no free pages available. In the SGX driver
the reclaimer's watermark is only 32 pages so after encountering the
above example scenario 32 times a user space hang is possible when there
are no more free pages because of repeated page faults caused by no
free pages made available.

The following flow was encountered:
asm_exc_page_fault
...
sgx_vma_fault()
sgx_encl_load_page()
sgx_encl_eldu() // Encrypted page needs to be loaded from backing
// storage into newly allocated SGX memory page
sgx_alloc_epc_page() // Allocate a page of SGX memory
__sgx_alloc_epc_page() // Fails, no free SGX memory
...
if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer
wake_up(&ksgxd_waitq);
return -EBUSY; // Return -EBUSY giving reclaimer time to run
return -EBUSY;
return -EBUSY;
return VM_FAULT_NOPAGE;

The reclaimer is triggered in above flow with the following code:

static bool sgx_should_reclaim(unsigned long watermark)
{
return sgx_nr_free_pages < watermark &&
!list_empty(&sgx_active_page_list);
}

In the problematic scenario there were no free pages available yet the
value of sgx_nr_free_pages was above the watermark. The allocation of
SGX memory thus always failed because of a lack of free pages while no
free pages were made available because the reclaimer is never started
because of sgx_nr_free_pages' incorrect value. The consequence was that
user space kept encountering VM_FAULT_NOPAGE that caused the same
address to be accessed repeatedly with the same result.

Change the global free page counter to an atomic type that
ensures simultaneous updates are done safely. While doing so, move
the updating of the variable outside of the spin lock critical
section to which it does not belong.

Cc: stable@vger.kernel.org
Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/a95a40743bbd3f795b465f30922dde7f1ea9e0eb.1637004094.git.reinette.chatre@intel.com

4y ago

Stefano Garzarella

49d8c5ff

vhost/vsock: fix incorrect used length reported to the guest

4y ago

Linus Torvalds

0757ca01

Merge tag 'iommu-fixes-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

4y ago

Steven Rostedt (VMware)

27ff768f

tracing: Test the 'Do not trace this pid' case in create event

4y ago

Waiman Long

d257cc8c

locking/rwsem: Make handoff bit handling more consistent

4y ago

Linus Torvalds

75603b14

Merge tag 'powerpc-5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

4y ago

Song Liu

f3fd84a3

x86/perf: Fix snapshot_branch_stack warning in VM

4y ago

Borislav Petkov

8d48bf82

x86/boot: Pull up cmdline preparation and early param parsing

4y ago

Michael S. Tsirkin

f124034f

Revert "virtio_ring: validate used buffer length"

4y ago

Linus Torvalds

3498e7f2

Merge tag '5.16-rc2-ksmbd-fixes' of git://git.samba.org/ksmbd

4y ago

Alex Williamson

86dc40c7

iommu/vt-d: Fix unmap_pages support

When supporting only the .map and .unmap callbacks of iommu_ops,
the IOMMU driver can make assumptions about the size and alignment
used for mappings based on the driver provided pgsize_bitmap. VT-d
previously used essentially PAGE_MASK for this bitmap as any power
of two mapping was acceptably filled by native page sizes.

However, with the .map_pages and .unmap_pages interface we're now
getting page-size and count arguments. If we simply combine these
as (page-size * count) and make use of the previous map/unmap
functions internally, any size and alignment assumptions are very
different.

As an example, a given vfio device assignment VM will often create
a 4MB mapping at IOVA pfn [0x3fe00 - 0x401ff]. On a system that
does not support IOMMU super pages, the unmap_pages interface will
ask to unmap 1024 4KB pages at the base IOVA. dma_pte_clear_level()
will recurse down to level 2 of the page table where the first half
of the pfn range exactly matches the entire pte level. We clear the
pte, increment the pfn by the level size, but (oops) the next pte is
on a new page, so we exit the loop an pop back up a level. When we
then update the pfn based on that higher level, we seem to assume
that the previous pfn value was at the start of the level. In this
case the level size is 256K pfns, which we add to the base pfn and
get a results of 0x7fe00, which is clearly greater than 0x401ff,
so we're done. Meanwhile we never cleared the ptes for the remainder
of the range. When the VM remaps this range, we're overwriting valid
ptes and the VT-d driver complains loudly, as reported by the user
report linked below.

The fix for this seems relatively simple, if each iteration of the
loop in dma_pte_clear_level() is assumed to clear to the end of the
level pte page, then our next pfn should be calculated from level_pfn
rather than our working pfn.

Fixes: 3f34f1259776 ("iommu/vt-d: Implement map/unmap_pages() iommu_ops callback")
Reported-by: Ajay Garg <ajaygargnsit@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Link: https://lore.kernel.org/all/20211002124012.18186-1-ajaygargnsit@gmail.com/
Link: https://lore.kernel.org/r/163659074748.1617923.12716161410774184024.stgit@omen
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20211126135556.397932-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>

4y ago

Steven Rostedt (VMware)

a55f224f

tracing: Fix pid filtering when triggers are attached

4y ago

Geert Uytterhoeven

61eb495c

pstore/blk: Use "%lu" to format unsigned long

4y ago

Cédric Le Goater

8e80a73f

powerpc/xive: Change IRQ domain to a tree domain

4y ago

Alexander Antonov

bdc0feee

perf/x86/intel/uncore: Fix IIO event constraints for Snowridge

4y ago

Linus Torvalds

fa55b7dc

Linux 5.16-rc1 v5.16-rc1

4y ago

Michael S. Tsirkin

fcfb65f8

Revert "virtio-net: don't let virtio core to validate used length"

4y ago

Guenter Roeck

00169a92

vmxnet3: Use generic Kconfig option for page size limit

4y ago

Namjae Jeon

178ca6f8

ksmbd: fix memleak in get_file_stream_info()

4y ago

Christophe JAILLET

4e5973dd

iommu/vt-d: Fix an unbalanced rcu_read_lock/rcu_read_unlock()

4y ago

Steven Rostedt (VMware)

6cb20650

tracing: Check pid filtering when creating events

4y ago

Linus Torvalds

923dcc5e

Merge branch 'akpm' (patches from Andrew)

4y ago

Christophe Leroy

1e35eba4

powerpc/8xx: Fix pinned TLBs with CONFIG_STRICT_KERNEL_RWX

4y ago

Alexander Antonov

3866ae31

perf/x86/intel/uncore: Fix IIO event constraints for Skylake Server

4y ago

Gustavo A. R. Silva

dee2b702

kconfig: Add support for -Wimplicit-fallthrough

4y ago

Michael S. Tsirkin

2b17d9f8

Revert "virtio-blk: don't let virtio core to validate used length"

4y ago

Guenter Roeck

4eec7faf

fs: ntfs: Limit NTFS_RW to page sizes smaller than 64k

4y ago

Namjae Jeon

1ec72153

ksmbd: contain default data stream even if xattr is empty

4y ago

Alex Bee

f7ff3cff

iommu/rockchip: Fix PAGE_DESC_HI_MASKs for RK3568

4y ago

Jiri Olsa

1880ed71

tracing/uprobe: Fix uprobe_perf_open probes iteration

4y ago

Linus Torvalds

61564e7b

Merge tag 'block-5.16-2021-11-19' of git://git.kernel.dk/linux-block

4y ago

David Hildenbrand

c1e63117

proc/vmcore: fix clearing user buffer by properly using clear_user()

To clear a user buffer we cannot simply use memset, we have to use
clear_user(). With a virtio-mem device that registers a vmcore_cb and
has some logically unplugged memory inside an added Linux memory block,
I can easily trigger a BUG by copying the vmcore via "cp":

systemd[1]: Starting Kdump Vmcore Save Service...
kdump[420]: Kdump is using the default log level(3).
kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
kdump[465]: saving vmcore-dmesg.txt complete
kdump[467]: saving vmcore
BUG: unable to handle page fault for address: 00007f2374e01000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0003) - permissions violation
PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 800000007048f867
Oops: 0003 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86
Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81
RSP: 0018:ffffc9000073be08 EFLAGS: 00010212
RAX: 0000000000001000 RBX: 00000000002fd000 RCX: 00007f2374e01000
RDX: 0000000000000001 RSI: 00000000ffffdfff RDI: 00007f2374e01008
RBP: 0000000000001000 R08: 0000000000000000 R09: ffffc9000073bc50
R10: ffffc9000073bc48 R11: ffffffff829461a8 R12: 000000000000f000
R13: 00007f2374e01000 R14: 0000000000000000 R15: ffff88807bd421e8
FS: 00007f2374e12140(0000) GS:ffff88807f000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2374e01000 CR3: 000000007a4aa000 CR4: 0000000000350eb0
Call Trace:
read_vmcore+0x236/0x2c0
proc_reg_read+0x55/0xa0
vfs_read+0x95/0x190
ksys_read+0x4f/0xc0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access
Prevention (SMAP)", which is used to detect wrong access from the kernel
to user buffers like this: SMAP triggers a permissions violation on
wrong access. In the x86-64 variant of clear_user(), SMAP is properly
handled via clac()+stac().

To fix, properly use clear_user() when we're dealing with a user buffer.

Link: https://lkml.kernel.org/r/20211112092750.6921-1-david@redhat.com
Fixes: 997c136f518c ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4y ago

Christophe Leroy

5499802b

powerpc/signal32: Fix sigset_t copy

4y ago

Alexander Antonov

e324234e

perf/x86/intel/uncore: Fix filter_tid mask for CHA events on Skylake Server

4y ago

Linus Torvalds

ce49bfc8

Merge tag 'xfs-5.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

4y ago

Michael S. Tsirkin

6318cb88

Revert "virtio-scsi: don't let virtio core to validate used buffer length"

4y ago

Guenter Roeck

1f0e290c

arch: Add generic Kconfig option indicating page size smaller than 64k

4y ago

Namjae Jeon

8e537d14

ksmbd: downgrade addition info error msg to debug in smb2_get_info_sec()

4y ago

Joerg Roedel

717e88aa

iommu/amd: Clarify AMD IOMMUv2 initialization messages

4y ago

Linus Torvalds

b100274c

Merge tag 'pinctrl-v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

4y ago

Ming Lei

2b504bd4

blk-mq: don't insert FUA request with data into scheduler queue

4y ago

Ard Biesheuvel

825c43f5

kmap_local: don't assume kmap PTEs are linear arrays in memory

4y ago

Linux 5.16-rc3 v5.16-rc3

d58071a8

Linus Torvalds

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

d06c942e

Linus Torvalds

Merge tag 'x86-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

9557e60b

Linus Torvalds

vdpa_sim: avoid putting an uninitialized iova_domain

bb93ce4b

Longpeng

Merge tag 'sched-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

97891bbf

Linus Torvalds

x86/boot: Mark prepare_command_line() __init

c0f2077b

Borislav Petkov

vhost-vdpa: clean irqs before reseting vdpa device

ea8f17e4

Wu Zongyong

Merge tag 'perf-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

1ed1d3a3

Linus Torvalds

sched/scs: Reset task stack state in bringup_cpu()

dce1ca05

Mark Rutland

Linux 5.16-rc2 v5.16-rc2

13605725

Linus Torvalds

virtio-blk: modify the value type of num in virtio_queue_rq()

0466a39b

Ye Guojin

Merge tag 'locking-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

d039f388

Linus Torvalds

perf: Ignore sigtrap for tracepoints destined for other tasks

73743c3b

Marco Elver

Merge tag 'x86-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

40c93d7f

Linus Torvalds

vhost/vsock: cleanup removing `len` variable

11708ff9

Stefano Garzarella

Merge tag 'trace-v5.16-rc2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

f8132d62

Linus Torvalds

locking/rwsem: Optimize down_read_trylock() under highly contended case

14c24048

Muchun Song

Merge tag 'perf-urgent-2021-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

af16bdea

Linus Torvalds

x86/sgx: Fix free page accounting

ac5d272a

Reinette Chatre

vhost/vsock: fix incorrect used length reported to the guest

49d8c5ff

Stefano Garzarella

Merge tag 'iommu-fixes-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

0757ca01

Linus Torvalds

tracing: Test the 'Do not trace this pid' case in create event

27ff768f

Steven Rostedt (VMware)

locking/rwsem: Make handoff bit handling more consistent

There are some inconsistency in the way that the handoff bit is being
handled in readers and writers that lead to a race condition.

Firstly, when a queue head writer set the handoff bit, it will clear
it when the writer is being killed or interrupted on its way out
without acquiring the lock. That is not the case for a queue head
reader. The handoff bit will simply be inherited by the next waiter.

Secondly, in the out_nolock path of rwsem_down_read_slowpath(), both
the waiter and handoff bits are cleared if the wait queue becomes
empty. For rwsem_down_write_slowpath(), however, the handoff bit is
not checked and cleared if the wait queue is empty. This can
potentially make the handoff bit set with empty wait queue.

Worse, the situation in rwsem_down_write_slowpath() relies on wstate,
a variable set outside of the critical section containing the ->count
manipulation, this leads to race condition where RWSEM_FLAG_HANDOFF
can be double subtracted, corrupting ->count.

To make the handoff bit handling more consistent and robust, extract
out handoff bit clearing code into the new rwsem_del_waiter() helper
function. Also, completely eradicate wstate; always evaluate
everything inside the same critical section.

The common function will only use atomic_long_andnot() to clear bits
when the wait queue is empty to avoid possible race condition. If the
first waiter with handoff bit set is killed or interrupted to exit the
slowpath without acquiring the lock, the next waiter will inherit the
handoff bit.

While at it, simplify the trylock for loop in
rwsem_down_write_slowpath() to make it easier to read.

Fixes: 4f23dbc1e657 ("locking/rwsem: Implement lock handoff to prevent lock starvation")
Reported-by: Zhenhua Ma <mazhenhua@xiaomi.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211116012912.723980-1-longman@redhat.com