commits

tjh.dev / kernel

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

fork atom

Author

Commit

Message

Date

Linus Torvalds

62fb9874

Linux 5.13 v5.13

4y ago

Linus Torvalds

b4b27b9e

Revert "signal: Allow tasks to cache one sigqueue struct"

This reverts commits 4bad58ebc8bc4f20d89cff95417c9b4674769709 (and
399f8dd9a866e107639eabd3c1979cd526ca3a98, which tried to fix it).

I do not believe these are correct, and I'm about to release 5.13, so am
reverting them out of an abundance of caution.

The locking is odd, and appears broken.

On the allocation side (in __sigqueue_alloc()), the locking is somewhat
straightforward: it depends on sighand->siglock. Since one caller
doesn't hold that lock, it further then tests 'sigqueue_flags' to avoid
the case with no locks held.

On the freeing side (in sigqueue_cache_or_free()), there is no locking
at all, and the logic instead depends on 'current' being a single
thread, and not able to race with itself.

To make things more exciting, there's also the data race between freeing
a signal and allocating one, which is handled by using WRITE_ONCE() and
READ_ONCE(), and being mutually exclusive wrt the initial state (ie
freeing will only free if the old state was NULL, while allocating will
obviously only use the value if it was non-NULL, so only one or the
other will actually act on the value).

However, while the free->alloc paths do seem mutually exclusive thanks
to just the data value dependency, it's not clear what the memory
ordering constraints are on it. Could writes from the previous
allocation possibly be delayed and seen by the new allocation later,
causing logical inconsistencies?

So it's all very exciting and unusual.

And in particular, it seems that the freeing side is incorrect in
depending on "current" being single-threaded. Yes, 'current' is a
single thread, but in the presense of asynchronous events even a single
thread can have data races.

And such asynchronous events can and do happen, with interrupts causing
signals to be flushed and thus free'd (for example - sending a
SIGCONT/SIGSTOP can happen from interrupt context, and can flush
previously queued process control signals).

So regardless of all the other questions about the memory ordering and
locking for this new cached allocation, the sigqueue_cache_or_free()
assumptions seem to be fundamentally incorrect.

It may be that people will show me the errors of my ways, and tell me
why this is all safe after all. We can reinstate it if so. But my
current belief is that the WRITE_ONCE() that sets the cached entry needs
to be a smp_store_release(), and the READ_ONCE() that finds a cached
entry needs to be a smp_load_acquire() to handle memory ordering
correctly.

And the sequence in sigqueue_cache_or_free() would need to either use a
lock or at least be interrupt-safe some way (perhaps by using something
like the percpu 'cmpxchg': it doesn't need to be SMP-safe, but like the
percpu operations it needs to be interrupt-safe).

Fixes: 399f8dd9a866 ("signal: Prevent sigqueue caching after task got released")
Fixes: 4bad58ebc8bc ("signal: Allow tasks to cache one sigqueue struct")
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4y ago

Linus Torvalds

625acffd

Merge tag 's390-5.13-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

4y ago

Linus Torvalds

b7050b24

Merge tag 'pinctrl-v5.13-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

4y ago

Heiko Carstens

67147e96

s390/stack: fix possible register corruption with stack switch helper

4y ago

Linus Torvalds

e2f527b5

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

4y ago

Fabien Dessenne

67e2996f

pinctrl: stm32: fix the reported number of GPIO lines per bank

4y ago

Sven Schnelle

9e3d62d5

s390/topology: clear thread/group maps for offline cpus

4y ago

Linus Torvalds

7ce32ac6

Merge branch 'akpm' (patches from Andrew)

4y ago

Christoph Hellwig

d1b7f920

scsi: sd: Call sd_revalidate_disk() for ioctl(BLKRRPART)

4y ago

Andy Shevchenko

76b7f8fa

pinctrl: microchip-sgpio: Put fwnode in error case during ->probe()

4y ago

Tony Krowiak

8c0795d2

s390/vfio-ap: clean up mdev resources when remove callback invoked

4y ago

Gleb Fotengauer-Malinovskiy

808e9df4

userfaultfd: uapi: fix UFFDIO_CONTINUE ioctl request definition

4y ago

Marek Behún

72a461ad

mailmap: add Marek's other e-mail address and identity without diacritics

4y ago

ManYi Li

7dd753ca

scsi: sr: Return appropriate error code when disk is ejected

4y ago

Linus Torvalds

009c9aa5

Linux 5.13-rc6 v5.13-rc6

4y ago

Sven Schnelle

ca1f4d70

s390: clear pt_regs::flags on irq entry

4y ago

Linus Torvalds

55fcd449

Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

4y ago

Marek Behún

ee924d3d

MAINTAINERS: fix Marek's identity again

4y ago

Ming Lei

1e0d4e62

scsi: core: Only put parent device if host state differs from SHOST_CREATED

4y ago

Linus Torvalds

e4e45343

Merge tag 'perf-tools-fixes-for-v5.13-2021-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux

4y ago

Sven Schnelle

fc66127d

s390: fix system call restart with multiple signals

4y ago

Linus Torvalds

7764c62f

Merge tag 'devprop-5.13-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

4y ago

Johan Hovold

4ca070ef

i2c: robotfuzz-osif: fix control-request directions

4y ago

Mel Gorman

b3b64ebd

mm/page_alloc: do bulk array bounds check after checking populated elements

4y ago

Ming Lei

11714026

scsi: core: Put .shost_dev in failure path if host state changes to RUNNING

4y ago

Linus Torvalds

960f0716

Merge tag 'nfs-for-5.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

4y ago

Arnaldo Carvalho de Melo

36524112

tools headers cpufeatures: Sync with the kernel sources

4y ago

Linus Torvalds

13311e74

Linux 5.13-rc7 v5.13-rc7

4y ago

Linus Torvalds

b960e014

Merge tag 'for-linus-5.13b-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

4y ago

Heikki Krogerus

5dca69e2

software node: Handle software node injection to an existing device properly

4y ago

Andreas Hecht

3265a7e6

i2c: dev: Add __user annotation

4y ago

Rasmus Villemoes

b08e50dd

mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing array

4y ago

Ming Lei

3719f4ff

scsi: core: Fix failure handling of scsi_add_host_with_dma()

4y ago

Linus Torvalds

331a6edb

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

4y ago

Trond Myklebust

c3aba897

NFSv4: Fix second deadlock in nfs4_evict_inode()

4y ago

Leo Yan

197eecb6

perf session: Correct buffer copying when peeking events

4y ago

Linus Torvalds

cba5e972

Merge tag 'sched_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

4y ago

Linus Torvalds

616a99dd

Merge tag 'for-linus-urgent' of git://git.kernel.org/pub/scm/virt/kvm/kvm

4y ago

Juergen Gross

3de218ff

xen/events: reset active flag for lateeoi events later

4y ago

Dan Carpenter

22695837

i2c: cp2615: check for allocation failure in cp2615_i2c_recv()

4y ago

Naoya Horiguchi

ea6d0630

mm/hwpoison: do not lock page again when me_huge_page() successfully recovers

4y ago

Ming Lei

66a834d0

scsi: core: Fix error handling of scsi_host_alloc()

4y ago

Linus Torvalds

8ecfa36c

Merge tag 'riscv-for-linus-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

4y ago

Trond Myklebust

dfe1fe75

NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()

4y ago

Eric W. Biederman

06af8679

coredump: Limit what can interrupt coredumps

4y ago

Linus Torvalds

9df7f15e

Merge tag 'irq_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

4y ago

Odin Ugedal

a7b359fc

sched/fair: Correctly insert cfs_rq's to list on unthrottle

4y ago

Linus Torvalds

94ca94bb

Merge tag 'x86_urgent_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

4y ago

Nicholas Piggin

f8be156b

KVM: do not allow mapping valid but non-reference-counted pages

4y ago

Roger Pau Monne

107866a8

xen-netback: take a reference to the RX task thread

4y ago

Heiner Kallweit

065b6211

i2c: i801: Ensure that SMBHSTSTS_INUSE_STS is cleared when leaving i801_access

4y ago

Aili Yao

47af12ba

mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned

When memory_failure() is called with MF_ACTION_REQUIRED on the page that
has already been hwpoisoned, memory_failure() could fail to send SIGBUS
to the affected process, which results in infinite loop of MCEs.

Currently memory_failure() returns 0 if it's called for already
hwpoisoned page, then the caller, kill_me_maybe(), could return without
sending SIGBUS to current process. An action required MCE is raised
when the current process accesses to the broken memory, so no SIGBUS
means that the current process continues to run and access to the error
page again soon, so running into MCE loop.

This issue can arise for example in the following scenarios:

- Two or more threads access to the poisoned page concurrently. If
local MCE is enabled, MCE handler independently handles the MCE
events. So there's a race among MCE events, and the second or latter
threads fall into the situation in question.

- If there was a precedent memory error event and memory_failure() for
the event failed to unmap the error page for some reason, the
subsequent memory access to the error page triggers the MCE loop
situation.

To fix the issue, make memory_failure() return an error code when the
error page has already been hwpoisoned. This allows memory error
handler to control how it sends signals to userspace. And make sure
that any process touching a hwpoisoned page should get a SIGBUS even in
"already hwpoisoned" path of memory_failure() as is done in page fault
path.

Link: https://lkml.kernel.org/r/20210521030156.2612074-3-nao.horiguchi@gmail.com
Signed-off-by: Aili Yao <yaoaili@kingsoft.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jue Wang <juew@google.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4y ago

Ewan D. Milne

e57f5cd9

scsi: scsi_devinfo: Add blacklist entry for HPE OPEN-V

4y ago

Feng Tang

2e302543

mm: relocate 'write_protect_seq' in struct mm_struct

0day robot reported a 9.2% regression for will-it-scale mmap1 test
case[1], caused by commit 57efa1fe5957 ("mm/gup: prevent gup_fast from
racing with COW during fork").

Further debug shows the regression is due to that commit changes the
offset of hot fields 'mmap_lock' inside structure 'mm_struct', thus some
cache alignment changes.

From the perf data, the contention for 'mmap_lock' is very severe and
takes around 95% cpu cycles, and it is a rw_semaphore

struct rw_semaphore {
atomic_long_t count; /* 8 bytes */
atomic_long_t owner; /* 8 bytes */
struct optimistic_spin_queue osq; /* spinner MCS lock */
...

Before commit 57efa1fe5957 adds the 'write_protect_seq', it happens to
have a very optimal cache alignment layout, as Linus explained:

"and before the addition of the 'write_protect_seq' field, the
mmap_sem was at offset 120 in 'struct mm_struct'.

Which meant that count and owner were in two different cachelines,
and then when you have contention and spend time in
rwsem_down_write_slowpath(), this is probably *exactly* the kind
of layout you want.

Because first the rwsem_write_trylock() will do a cmpxchg on the
first cacheline (for the optimistic fast-path), and then in the
case of contention, rwsem_down_write_slowpath() will just access
the second cacheline.

Which is probably just optimal for a load that spends a lot of
time contended - new waiters touch that first cacheline, and then
they queue themselves up on the second cacheline."

After the commit, the rw_semaphore is at offset 128, which means the
'count' and 'owner' fields are now in the same cacheline, and causes
more cache bouncing.

Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which will
affect its offset:

CONFIG_MMU
CONFIG_MEMBARRIER
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES

The layout above is on 64 bits system with 0day's default kernel config
(similar to RHEL-8.3's config), in which all these 3 options are 'y'.
And the layout can vary with different kernel configs.

Relayouting a structure is usually a double-edged sword, as sometimes it
can helps one case, but hurt other cases. For this case, one solution
is, as the newly added 'write_protect_seq' is a 4 bytes long seqcount_t
(when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an existing 4 bytes
hole in 'mm_struct' will not change other fields' alignment, while
restoring the regression.

Link: https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/ [1]
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>