sched: fix overload performance: buddy wakeups

Currently we schedule to the leftmost task in the runqueue. When the
runtimes are very short because of some server/client ping-pong,
especially in over-saturated workloads, this will cycle through all
tasks trashing the cache.

Reduce cache trashing by keeping dependent tasks together by running
newly woken tasks first. However, by not running the leftmost task first
we could starve tasks because the wakee can gain unlimited runtime.

Therefore we only run the wakee if its within a small
(wakeup_granularity) window of the leftmost task. This preserves
fairness, but does alternate server/client task groups.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

authored by Peter Zijlstra and committed by Ingo Molnar aa2ac252 27d11726

+27 -1
+1 -1
kernel/sched.c
··· 301 301 /* 'curr' points to currently running entity on this cfs_rq. 302 302 * It is set to NULL otherwise (i.e when none are currently running). 303 303 */ 304 - struct sched_entity *curr; 304 + struct sched_entity *curr, *next; 305 305 306 306 unsigned long nr_spread_over; 307 307
+26
kernel/sched_fair.c
··· 207 207 } 208 208 } 209 209 210 + if (cfs_rq->next == se) 211 + cfs_rq->next = NULL; 212 + 210 213 rb_erase(&se->run_node, &cfs_rq->tasks_timeline); 211 214 } 212 215 ··· 629 626 se->prev_sum_exec_runtime = se->sum_exec_runtime; 630 627 } 631 628 629 + static struct sched_entity * 630 + pick_next(struct cfs_rq *cfs_rq, struct sched_entity *se) 631 + { 632 + s64 diff, gran; 633 + 634 + if (!cfs_rq->next) 635 + return se; 636 + 637 + diff = cfs_rq->next->vruntime - se->vruntime; 638 + if (diff < 0) 639 + return se; 640 + 641 + gran = calc_delta_fair(sysctl_sched_wakeup_granularity, &cfs_rq->load); 642 + if (diff > gran) 643 + return se; 644 + 645 + return cfs_rq->next; 646 + } 647 + 632 648 static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq) 633 649 { 634 650 struct sched_entity *se = NULL; 635 651 636 652 if (first_fair(cfs_rq)) { 637 653 se = __pick_next_entity(cfs_rq); 654 + se = pick_next(cfs_rq, se); 638 655 set_next_entity(cfs_rq, se); 639 656 } 640 657 ··· 1093 1070 resched_task(curr); 1094 1071 return; 1095 1072 } 1073 + 1074 + cfs_rq_of(pse)->next = pse; 1075 + 1096 1076 /* 1097 1077 * Batch tasks do not preempt (their preemption is driven by 1098 1078 * the tick):