tjh.dev/kernel at 7cec2e16cb62ed597791fb2d266e5ddd5818f1b3

In our production environment, we found many hung tasks which are
blocked for more than 18 hours. Their call traces are like this:

[346278.191038] __schedule+0x2d8/0x890
[346278.191046] schedule+0x4e/0xb0
[346278.191049] perf_event_free_task+0x220/0x270
[346278.191056] ? init_wait_var_entry+0x50/0x50
[346278.191060] copy_process+0x663/0x18d0
[346278.191068] kernel_clone+0x9d/0x3d0
[346278.191072] __do_sys_clone+0x5d/0x80
[346278.191076] __x64_sys_clone+0x25/0x30
[346278.191079] do_syscall_64+0x5c/0xc0
[346278.191083] ? syscall_exit_to_user_mode+0x27/0x50
[346278.191086] ? do_syscall_64+0x69/0xc0
[346278.191088] ? irqentry_exit_to_user_mode+0x9/0x20
[346278.191092] ? irqentry_exit+0x19/0x30
[346278.191095] ? exc_page_fault+0x89/0x160
[346278.191097] ? asm_exc_page_fault+0x8/0x30
[346278.191102] entry_SYSCALL_64_after_hwframe+0x44/0xae

The task was waiting for the refcount become to 1, but from the vmcore,
we found the refcount has already been 1. It seems that the task didn't
get woken up by perf_event_release_kernel() and got stuck forever. The
below scenario may cause the problem.

Thread A Thread B
... ...
perf_event_free_task perf_event_release_kernel
...
acquire event->child_mutex
...
get_ctx
... release event->child_mutex
acquire ctx->mutex
...
perf_free_event (acquire/release event->child_mutex)
...
release ctx->mutex
wait_var_event
acquire ctx->mutex
acquire event->child_mutex
# move existing events to free_list
release event->child_mutex
release ctx->mutex
put_ctx
... ...

In this case, all events of the ctx have been freed, so we couldn't
find the ctx in free_list and Thread A will miss the wakeup. It's thus
necessary to add a wakeup after dropping the reference.

Fixes: 1cf8dfe8a661 ("perf/core: Fix race between close() and fork()")
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20240513103948.33570-1-haifeng.xu@shopee.com

74751ef5

Haifeng Xu

2 years ago

Merge tag 'mm-hotfixes-stable-2024-06-07-15-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

dc772f82

Linus Torvalds

2 years ago

locking/atomic: scripts: fix ${atomic}_sub_and_test() kerneldoc

f92a59f6

Carlos Llamas

2 years ago

Linux 6.10-rc2

c3f38fa6

Linus Torvalds

2 years ago

v6.10-rc2

Merge tag 'gpio-fixes-for-v6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

e60721bf

Linus Torvalds

2 years ago

nilfs2: fix nilfs_empty_dir() misjudgment and long loop on I/O errors

7373a51e

Ryusuke Konishi

2 years ago

Merge tag 'ata-6.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux

58d89ee8

Linus Torvalds

2 years ago

Merge tag 'block-6.10-20240607' of git://git.kernel.dk/linux

602079a0

Linus Torvalds

2 years ago

gpio: add missing MODULE_DESCRIPTION() macros

64054eb7

Jeff Johnson

2 years ago

mm: fix xyz_noprof functions calling profiled functions

94159835

Suren Baghdasaryan

2 years ago

Merge tag 'x86-urgent-2024-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

a693b9c9

Linus Torvalds

2 years ago

ata: libata-core: Add ATA_HORKAGE_NOLPM for Apacer AS340

3cb648c4

Niklas Cassel

2 years ago

Merge tag 'io_uring-6.10-20240607' of git://git.kernel.dk/linux

e3391589

Linus Torvalds

2 years ago

branches 3

master 12 hours ago default

compare

nocache-cleanup 3 weeks ago

compare

for-next 1 year ago

compare

tags 927

v7.0

1 week ago latest

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the reStructuredText markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.

Configure Feed

Configure Feed

Clone this repository