sched: Fix rq->nr_iowait ordering

schedule() ttwu()
deactivate_task(); if (p->on_rq && ...) // false
atomic_dec(&task_rq(p)->nr_iowait);
if (prev->in_iowait)
atomic_inc(&rq->nr_iowait);

Allows nr_iowait to be decremented before it gets incremented,
resulting in more dodgy IO-wait numbers than usual.

Note that because we can now do ttwu_queue_wakelist() before
p->on_cpu==0, we lose the natural ordering and have to further delay
the decrement.

Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lkml.kernel.org/r/20201117093829.GD3121429@hirez.programming.kicks-ass.net

Changed files
+10 -5
kernel
sched
+10 -5
kernel/sched/core.c
··· 2501 2501 #ifdef CONFIG_SMP 2502 2502 if (wake_flags & WF_MIGRATED) 2503 2503 en_flags |= ENQUEUE_MIGRATED; 2504 + else 2504 2505 #endif 2506 + if (p->in_iowait) { 2507 + delayacct_blkio_end(p); 2508 + atomic_dec(&task_rq(p)->nr_iowait); 2509 + } 2505 2510 2506 2511 activate_task(rq, p, en_flags); 2507 2512 ttwu_do_wakeup(rq, p, wake_flags, rf); ··· 2893 2888 if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags)) 2894 2889 goto unlock; 2895 2890 2896 - if (p->in_iowait) { 2897 - delayacct_blkio_end(p); 2898 - atomic_dec(&task_rq(p)->nr_iowait); 2899 - } 2900 - 2901 2891 #ifdef CONFIG_SMP 2902 2892 /* 2903 2893 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be ··· 2963 2963 2964 2964 cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); 2965 2965 if (task_cpu(p) != cpu) { 2966 + if (p->in_iowait) { 2967 + delayacct_blkio_end(p); 2968 + atomic_dec(&task_rq(p)->nr_iowait); 2969 + } 2970 + 2966 2971 wake_flags |= WF_MIGRATED; 2967 2972 psi_ttwu_dequeue(p); 2968 2973 set_task_cpu(p, cpu);