tjh.dev/kernel at 5fee0dafba0a5223fa1c1f0dd99af71256449d38

Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate
load for throttled cfs_rq") which transitioned to using
cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
propagate_entity_cfs_rq().

The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
its descendants should have their PELT frozen too or weird things can
happen as a result of children accumulating PELT signals when the
parents have their PELT clock stopped.

Another side effect of this is the loss of integrity of the leaf cfs_rq
list. As debugged by Aaron, consider the following hierarchy:

root(#)
/ \
A(#) B(*)
|
C <--- new cgroup
|
D <--- new cgroup

# - Already on leaf cfs_rq list
* - Throttled with PELT frozen

The newly created cgroups don't have their "pelt_clock_throttled" signal
synced with cgroup B. Next, the following series of events occur:

1. online_fair_sched_group() for cgroup D will call
propagate_entity_cfs_rq(). (Same can happen if a throttled task is
moved to cgroup C and enqueue_task_fair() returns early.)

propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
"rq->tmp_alone_branch" since its PELT clock is not marked throttled
and cfs_rq of cgroup B is not on the list.

cfs_rq of cgroup B is skipped since its PELT is throttled.

root cfs_rq already exists on cfs_rq leading to
list_add_leaf_cfs_rq() returning early.

The cfs_rq of cgroup C is left dangling on the
"rq->tmp_alone_branch".

2. A new task wakes up on cgroup A. Since the whole hierarchy is already
on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
without any modifications to "rq->tmp_alone_branch".

The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".

!!! Splat !!!

Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
enough since the new cfs_rq is not yet enqueued on the hierarchy. A
dequeue on other subtree on the throttled hierarchy can freeze the PELT
clock for the parent hierarchy without setting the indicators for this
newly added cfs_rq which was never enqueued.

Since there are no tasks on the new hierarchy, start a cfs_rq on a
throttled hierarchy with its PELT clock throttled. The first enqueue, or
the distribution (whichever happens first) will unfreeze the PELT clock
and queue the cfs_rq on the leaf cfs_rq list.

While at it, add an assert_list_leaf_cfs_rq() in
propagate_entity_cfs_rq() to catch such cases in the future.

Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Link: https://patch.msgid.link/20251021053522.37583-1-kprateek.nayak@amd.com

0e4a169d

K Prateek Nayak

6 months ago

objtool/rust: add one more `noreturn` Rust function

dbdf2a7f

Miguel Ojeda

6 months ago

genirq/chip: Add buslock back in to irq_set_handler()

5d7e45dd

Charles Keepax

6 months ago

Merge tag 'driver-core-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core

72761a7e

Linus Torvalds

6 months ago

timekeeping: Fix aux clocks sysfs initialization loop bound

39a9ed0f

Haofeng Li

6 months ago

Linux 6.18-rc2

211ddde0

Linus Torvalds

6 months ago

v6.18-rc2

Merge tag 'firewire-fixes-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394

818444a6

Linus Torvalds

6 months ago

arch_topology: Fix incorrect error check in topology_parse_cpu_capacity()

2eead193

Kaushlendra Kumar

6 months ago

Merge tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

d9043c79

Linus Torvalds

6 months ago

branches 3

master 1 day ago default

compare

nocache-cleanup 4 weeks ago

compare

for-next 1 year ago

compare

tags 928

v7.1-rc1

1 day ago latest

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the reStructuredText markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.

Configure Feed

Configure Feed

Clone this repository