Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

sched/core: Dequeue PSI signals for blocked tasks that are delayed

psi_dequeue() in for blocked task expects psi_sched_switch() to clear
the TSK_.*RUNNING PSI flags and set the TSK_IOWAIT flags however
psi_sched_switch() uses "!task_on_rq_queued(prev)" to detect if the task
is blocked or still runnable which is no longer true with DELAY_DEQUEUE
since a blocking task can be left queued on the runqueue.

This can lead to PSI splats similar to:

psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=0 set=4

when the task is requeued since the TSK_RUNNING flag was not cleared
when the task was blocked.

Explicitly communicate that the task was blocked to psi_sched_switch()
even if it was delayed and is still on the runqueue.

[ prateek: Broke off the relevant part from [1], commit message ]

Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Closes: https://lore.kernel.org/lkml/20240830123458.3557-1-spasswolf@web.de/
Closes: https://lore.kernel.org/all/cd67fbcd-d659-4822-bb90-7e8fbb40a856@molgen.mpg.de/
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/lkml/20241004123506.GR18071@noisy.programming.kicks-ass.net/ [1]

+3 -1
+3 -1
kernel/sched/core.c
··· 6537 6537 * as a preemption by schedule_debug() and RCU. 6538 6538 */ 6539 6539 bool preempt = sched_mode > SM_NONE; 6540 + bool block = false; 6540 6541 unsigned long *switch_count; 6541 6542 unsigned long prev_state; 6542 6543 struct rq_flags rf; ··· 6623 6622 * After this, schedule() must not care about p->state any more. 6624 6623 */ 6625 6624 block_task(rq, prev, flags); 6625 + block = true; 6626 6626 } 6627 6627 switch_count = &prev->nvcsw; 6628 6628 } ··· 6669 6667 6670 6668 migrate_disable_switch(rq, prev); 6671 6669 psi_account_irqtime(rq, prev, next); 6672 - psi_sched_switch(prev, next, !task_on_rq_queued(prev)); 6670 + psi_sched_switch(prev, next, block); 6673 6671 6674 6672 trace_sched_switch(preempt, prev, next, prev_state); 6675 6673