sched/core: Fix TASK_DEAD race in finish_task_switch()

So the problem this patch is trying to address is as follows:

CPU0 CPU1

context_switch(A, B)
ttwu(A)
LOCK A->pi_lock
A->on_cpu == 0
finish_task_switch(A)
prev_state = A->state <-.
WMB |
A->on_cpu = 0; |
UNLOCK rq0->lock |
| context_switch(C, A)
`-- A->state = TASK_DEAD
prev_state == TASK_DEAD
put_task_struct(A)
context_switch(A, C)
finish_task_switch(A)
A->state == TASK_DEAD
put_task_struct(A)

The argument being that the WMB will allow the load of A->state on CPU0
to cross over and observe CPU1's store of A->state, which will then
result in a double-drop and use-after-free.

Now the comment states (and this was true once upon a long time ago)
that we need to observe A->state while holding rq->lock because that
will order us against the wakeup; however the wakeup will not in fact
acquire (that) rq->lock; it takes A->pi_lock these days.

We can obviously fix this by upgrading the WMB to an MB, but that is
expensive, so we'd rather avoid that.

The alternative this patch takes is: smp_store_release(&A->on_cpu, 0),
which avoids the MB on some archs, but not important ones like ARM.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@vger.kernel.org> # v3.1+
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: manfred@colorfullife.com
Cc: will.deacon@arm.com
Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()")
Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>

authored by Peter Zijlstra and committed by Ingo Molnar 95913d97 049e6dde

Changed files
+8 -7
kernel
sched
+5 -5
kernel/sched/core.c
··· 2517 2517 * If a task dies, then it sets TASK_DEAD in tsk->state and calls 2518 2518 * schedule one last time. The schedule call will never return, and 2519 2519 * the scheduled task must drop that reference. 2520 - * The test for TASK_DEAD must occur while the runqueue locks are 2521 - * still held, otherwise prev could be scheduled on another cpu, die 2522 - * there before we look at prev->state, and then the reference would 2523 - * be dropped twice. 2524 - * Manfred Spraul <manfred@colorfullife.com> 2520 + * 2521 + * We must observe prev->state before clearing prev->on_cpu (in 2522 + * finish_lock_switch), otherwise a concurrent wakeup can get prev 2523 + * running on another CPU and we could rave with its RUNNING -> DEAD 2524 + * transition, resulting in a double drop. 2525 2525 */ 2526 2526 prev_state = prev->state; 2527 2527 vtime_task_switch(prev);
+3 -2
kernel/sched/sched.h
··· 1078 1078 * After ->on_cpu is cleared, the task can be moved to a different CPU. 1079 1079 * We must ensure this doesn't happen until the switch is completely 1080 1080 * finished. 1081 + * 1082 + * Pairs with the control dependency and rmb in try_to_wake_up(). 1081 1083 */ 1082 - smp_wmb(); 1083 - prev->on_cpu = 0; 1084 + smp_store_release(&prev->on_cpu, 0); 1084 1085 #endif 1085 1086 #ifdef CONFIG_DEBUG_SPINLOCK 1086 1087 /* this is a valid case when another task releases the spinlock */