···11+22+This is the CFS scheduler.33+44+80% of CFS's design can be summed up in a single sentence: CFS basically55+models an "ideal, precise multi-tasking CPU" on real hardware.66+77+"Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100%88+physical power and which can run each task at precise equal speed, in99+parallel, each at 1/nr_running speed. For example: if there are 2 tasks1010+running then it runs each at 50% physical power - totally in parallel.1111+1212+On real hardware, we can run only a single task at once, so while that1313+one task runs, the other tasks that are waiting for the CPU are at a1414+disadvantage - the current task gets an unfair amount of CPU time. In1515+CFS this fairness imbalance is expressed and tracked via the per-task1616+p->wait_runtime (nanosec-unit) value. "wait_runtime" is the amount of1717+time the task should now run on the CPU for it to become completely fair1818+and balanced.1919+2020+( small detail: on 'ideal' hardware, the p->wait_runtime value would2121+ always be zero - no task would ever get 'out of balance' from the2222+ 'ideal' share of CPU time. )2323+2424+CFS's task picking logic is based on this p->wait_runtime value and it2525+is thus very simple: it always tries to run the task with the largest2626+p->wait_runtime value. In other words, CFS tries to run the task with2727+the 'gravest need' for more CPU time. So CFS always tries to split up2828+CPU time between runnable tasks as close to 'ideal multitasking2929+hardware' as possible.3030+3131+Most of the rest of CFS's design just falls out of this really simple3232+concept, with a few add-on embellishments like nice levels,3333+multiprocessing and various algorithm variants to recognize sleepers.3434+3535+In practice it works like this: the system runs a task a bit, and when3636+the task schedules (or a scheduler tick happens) the task's CPU usage is3737+'accounted for': the (small) time it just spent using the physical CPU3838+is deducted from p->wait_runtime. [minus the 'fair share' it would have3939+gotten anyway]. Once p->wait_runtime gets low enough so that another4040+task becomes the 'leftmost task' of the time-ordered rbtree it maintains4141+(plus a small amount of 'granularity' distance relative to the leftmost4242+task so that we do not over-schedule tasks and trash the cache) then the4343+new leftmost task is picked and the current task is preempted.4444+4545+The rq->fair_clock value tracks the 'CPU time a runnable task would have4646+fairly gotten, had it been runnable during that time'. So by using4747+rq->fair_clock values we can accurately timestamp and measure the4848+'expected CPU time' a task should have gotten. All runnable tasks are4949+sorted in the rbtree by the "rq->fair_clock - p->wait_runtime" key, and5050+CFS picks the 'leftmost' task and sticks to it. As the system progresses5151+forwards, newly woken tasks are put into the tree more and more to the5252+right - slowly but surely giving a chance for every task to become the5353+'leftmost task' and thus get on the CPU within a deterministic amount of5454+time.5555+5656+Some implementation details:5757+5858+ - the introduction of Scheduling Classes: an extensible hierarchy of5959+ scheduler modules. These modules encapsulate scheduling policy6060+ details and are handled by the scheduler core without the core6161+ code assuming about them too much.6262+6363+ - sched_fair.c implements the 'CFS desktop scheduler': it is a6464+ replacement for the vanilla scheduler's SCHED_OTHER interactivity6565+ code.6666+6767+ I'd like to give credit to Con Kolivas for the general approach here:6868+ he has proven via RSDL/SD that 'fair scheduling' is possible and that6969+ it results in better desktop scheduling. Kudos Con!7070+7171+ The CFS patch uses a completely different approach and implementation7272+ from RSDL/SD. My goal was to make CFS's interactivity quality exceed7373+ that of RSDL/SD, which is a high standard to meet :-) Testing7474+ feedback is welcome to decide this one way or another. [ and, in any7575+ case, all of SD's logic could be added via a kernel/sched_sd.c module7676+ as well, if Con is interested in such an approach. ]7777+7878+ CFS's design is quite radical: it does not use runqueues, it uses a7979+ time-ordered rbtree to build a 'timeline' of future task execution,8080+ and thus has no 'array switch' artifacts (by which both the vanilla8181+ scheduler and RSDL/SD are affected).8282+8383+ CFS uses nanosecond granularity accounting and does not rely on any8484+ jiffies or other HZ detail. Thus the CFS scheduler has no notion of8585+ 'timeslices' and has no heuristics whatsoever. There is only one8686+ central tunable:8787+8888+ /proc/sys/kernel/sched_granularity_ns8989+9090+ which can be used to tune the scheduler from 'desktop' (low9191+ latencies) to 'server' (good batching) workloads. It defaults to a9292+ setting suitable for desktop workloads. SCHED_BATCH is handled by the9393+ CFS scheduler module too.9494+9595+ Due to its design, the CFS scheduler is not prone to any of the9696+ 'attacks' that exist today against the heuristics of the stock9797+ scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all9898+ work fine and do not impact interactivity and produce the expected9999+ behavior.100100+101101+ the CFS scheduler has a much stronger handling of nice levels and102102+ SCHED_BATCH: both types of workloads should be isolated much more103103+ agressively than under the vanilla scheduler.104104+105105+ ( another detail: due to nanosec accounting and timeline sorting,106106+ sched_yield() support is very simple under CFS, and in fact under107107+ CFS sched_yield() behaves much better than under any other108108+ scheduler i have tested so far. )109109+110110+ - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler111111+ way than the vanilla scheduler does. It uses 100 runqueues (for all112112+ 100 RT priority levels, instead of 140 in the vanilla scheduler)113113+ and it needs no expired array.114114+115115+ - reworked/sanitized SMP load-balancing: the runqueue-walking116116+ assumptions are gone from the load-balancing code now, and117117+ iterators of the scheduling modules are used. The balancing code got118118+ quite a bit simpler as a result.119119+