sched: fix overload performance: buddy wakeups

Currently we schedule to the leftmost task in the runqueue. When the
runtimes are very short because of some server/client ping-pong,
especially in over-saturated workloads, this will cycle through all
tasks trashing the cache.

Reduce cache trashing by keeping dependent tasks together by running
newly woken tasks first. However, by not running the leftmost task first
we could starve tasks because the wakee can gain unlimited runtime.

Therefore we only run the wakee if its within a small
(wakeup_granularity) window of the leftmost task. This preserves
fairness, but does alternate server/client task groups.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

authored by Peter Zijlstra and committed by Ingo Molnar aa2ac252 27d11726

+27 -1
+1 -1
kernel/sched.c
··· 301 /* 'curr' points to currently running entity on this cfs_rq. 302 * It is set to NULL otherwise (i.e when none are currently running). 303 */ 304 - struct sched_entity *curr; 305 306 unsigned long nr_spread_over; 307
··· 301 /* 'curr' points to currently running entity on this cfs_rq. 302 * It is set to NULL otherwise (i.e when none are currently running). 303 */ 304 + struct sched_entity *curr, *next; 305 306 unsigned long nr_spread_over; 307
+26
kernel/sched_fair.c
··· 207 } 208 } 209 210 rb_erase(&se->run_node, &cfs_rq->tasks_timeline); 211 } 212 ··· 629 se->prev_sum_exec_runtime = se->sum_exec_runtime; 630 } 631 632 static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq) 633 { 634 struct sched_entity *se = NULL; 635 636 if (first_fair(cfs_rq)) { 637 se = __pick_next_entity(cfs_rq); 638 set_next_entity(cfs_rq, se); 639 } 640 ··· 1093 resched_task(curr); 1094 return; 1095 } 1096 /* 1097 * Batch tasks do not preempt (their preemption is driven by 1098 * the tick):
··· 207 } 208 } 209 210 + if (cfs_rq->next == se) 211 + cfs_rq->next = NULL; 212 + 213 rb_erase(&se->run_node, &cfs_rq->tasks_timeline); 214 } 215 ··· 626 se->prev_sum_exec_runtime = se->sum_exec_runtime; 627 } 628 629 + static struct sched_entity * 630 + pick_next(struct cfs_rq *cfs_rq, struct sched_entity *se) 631 + { 632 + s64 diff, gran; 633 + 634 + if (!cfs_rq->next) 635 + return se; 636 + 637 + diff = cfs_rq->next->vruntime - se->vruntime; 638 + if (diff < 0) 639 + return se; 640 + 641 + gran = calc_delta_fair(sysctl_sched_wakeup_granularity, &cfs_rq->load); 642 + if (diff > gran) 643 + return se; 644 + 645 + return cfs_rq->next; 646 + } 647 + 648 static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq) 649 { 650 struct sched_entity *se = NULL; 651 652 if (first_fair(cfs_rq)) { 653 se = __pick_next_entity(cfs_rq); 654 + se = pick_next(cfs_rq, se); 655 set_next_entity(cfs_rq, se); 656 } 657 ··· 1070 resched_task(curr); 1071 return; 1072 } 1073 + 1074 + cfs_rq_of(pse)->next = pse; 1075 + 1076 /* 1077 * Batch tasks do not preempt (their preemption is driven by 1078 * the tick):