Fix possible runqueue lock starvation in wait_task_inactive()

Miklos Szeredi reported very long pauses (several seconds, sometimes
more) on his T60 (with a Core2Duo) which he managed to track down to
wait_task_inactive()'s open-coded busy-loop.

He observed that an interrupt on one core tries to acquire the
runqueue-lock but does not succeed in doing so for a very long time -
while wait_task_inactive() on the other core loops waiting for the first
core to deschedule a task (which it wont do while spinning in an
interrupt handler).

This rewrites wait_task_inactive() to do all its waiting optimistically
without any locks taken at all, and then just double-check the end
result with the proper runqueue lock held over just a very short
section. If there were races in the optimistic wait, of a preemption
event scheduled the process away, we simply re-synchronize, and start
over.

So the code now looks like this:

repeat:
/* Unlocked, optimistic looping! */
rq = task_rq(p);
while (task_running(rq, p))
cpu_relax();

/* Get the *real* values */
rq = task_rq_lock(p, &flags);
running = task_running(rq, p);
array = p->array;
task_rq_unlock(rq, &flags);

/* Check them.. */
if (unlikely(running)) {
cpu_relax();
goto repeat;
}

/* Preempted away? Yield if so.. */
if (unlikely(array)) {
yield();
goto repeat;
}

Basically, that first "while()" loop is done entirely without any
locking at all (and doesn't check for the case where the target process
might have been preempted away), and so it's possibly "incorrect", but
we don't really care. Both the runqueue used, and the "task_running()"
check might be the wrong tests, but they won't oops - they just mean
that we could possibly get the wrong results due to lack of locking and
exit the loop early in the case of a race condition.

So once we've exited the loop, we then get the proper (and careful) rq
lock, and check the running/runnable state _safely_. And if it turns
out that our quick-and-dirty and unsafe loop was wrong after all, we
just go back and try it all again.

(The patch also adds a lot of comments, which is the actual bulk of it
all, to make it more obvious why we can do these things without holding
the locks).

Thanks to Miklos for all the testing and tracking it down.

Tested-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by Linus Torvalds and committed by Linus Torvalds fa490cfd a0f98a1c

+61 -10
+61 -10
kernel/sched.c
··· 1159 1159 { 1160 1160 unsigned long flags; 1161 1161 struct rq *rq; 1162 - int preempted; 1162 + struct prio_array *array; 1163 + int running; 1163 1164 1164 1165 repeat: 1165 - rq = task_rq_lock(p, &flags); 1166 - /* Must be off runqueue entirely, not preempted. */ 1167 - if (unlikely(p->array || task_running(rq, p))) { 1168 - /* If it's preempted, we yield. It could be a while. */ 1169 - preempted = !task_running(rq, p); 1170 - task_rq_unlock(rq, &flags); 1166 + /* 1167 + * We do the initial early heuristics without holding 1168 + * any task-queue locks at all. We'll only try to get 1169 + * the runqueue lock when things look like they will 1170 + * work out! 1171 + */ 1172 + rq = task_rq(p); 1173 + 1174 + /* 1175 + * If the task is actively running on another CPU 1176 + * still, just relax and busy-wait without holding 1177 + * any locks. 1178 + * 1179 + * NOTE! Since we don't hold any locks, it's not 1180 + * even sure that "rq" stays as the right runqueue! 1181 + * But we don't care, since "task_running()" will 1182 + * return false if the runqueue has changed and p 1183 + * is actually now running somewhere else! 1184 + */ 1185 + while (task_running(rq, p)) 1171 1186 cpu_relax(); 1172 - if (preempted) 1173 - yield(); 1187 + 1188 + /* 1189 + * Ok, time to look more closely! We need the rq 1190 + * lock now, to be *sure*. If we're wrong, we'll 1191 + * just go back and repeat. 1192 + */ 1193 + rq = task_rq_lock(p, &flags); 1194 + running = task_running(rq, p); 1195 + array = p->array; 1196 + task_rq_unlock(rq, &flags); 1197 + 1198 + /* 1199 + * Was it really running after all now that we 1200 + * checked with the proper locks actually held? 1201 + * 1202 + * Oops. Go back and try again.. 1203 + */ 1204 + if (unlikely(running)) { 1205 + cpu_relax(); 1174 1206 goto repeat; 1175 1207 } 1176 - task_rq_unlock(rq, &flags); 1208 + 1209 + /* 1210 + * It's not enough that it's not actively running, 1211 + * it must be off the runqueue _entirely_, and not 1212 + * preempted! 1213 + * 1214 + * So if it wa still runnable (but just not actively 1215 + * running right now), it's preempted, and we should 1216 + * yield - it could be a while. 1217 + */ 1218 + if (unlikely(array)) { 1219 + yield(); 1220 + goto repeat; 1221 + } 1222 + 1223 + /* 1224 + * Ahh, all good. It wasn't running, and it wasn't 1225 + * runnable, which means that it will never become 1226 + * running in the future either. We're all done! 1227 + */ 1177 1228 } 1178 1229 1179 1230 /***