···168168mkdir $ARGV[0],0777;169169$state = 0;170170while (<STDIN>) {171171- if (/^\.TH \"[^\"]*\" 4 \"([^\"]*)\"/) {171171+ if (/^\.TH \"[^\"]*\" 9 \"([^\"]*)\"/) {172172 if ($state == 1) { close OUT }173173 $state = 1;174174- $fn = "$ARGV[0]/$1.4";174174+ $fn = "$ARGV[0]/$1.9";175175 print STDERR "Creating $fn\n";176176 open OUT, ">$fn" or die "can't open $fn: $!\n";177177 print OUT $_;
+238-149
Documentation/scheduler/sched-design-CFS.txt
···11-22-This is the CFS scheduler.33-44-80% of CFS's design can be summed up in a single sentence: CFS basically55-models an "ideal, precise multi-tasking CPU" on real hardware.66-77-"Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100%88-physical power and which can run each task at precise equal speed, in99-parallel, each at 1/nr_running speed. For example: if there are 2 tasks1010-running then it runs each at 50% physical power - totally in parallel.1111-1212-On real hardware, we can run only a single task at once, so while that1313-one task runs, the other tasks that are waiting for the CPU are at a1414-disadvantage - the current task gets an unfair amount of CPU time. In1515-CFS this fairness imbalance is expressed and tracked via the per-task1616-p->wait_runtime (nanosec-unit) value. "wait_runtime" is the amount of1717-time the task should now run on the CPU for it to become completely fair1818-and balanced.1919-2020-( small detail: on 'ideal' hardware, the p->wait_runtime value would2121- always be zero - no task would ever get 'out of balance' from the2222- 'ideal' share of CPU time. )2323-2424-CFS's task picking logic is based on this p->wait_runtime value and it2525-is thus very simple: it always tries to run the task with the largest2626-p->wait_runtime value. In other words, CFS tries to run the task with2727-the 'gravest need' for more CPU time. So CFS always tries to split up2828-CPU time between runnable tasks as close to 'ideal multitasking2929-hardware' as possible.3030-3131-Most of the rest of CFS's design just falls out of this really simple3232-concept, with a few add-on embellishments like nice levels,3333-multiprocessing and various algorithm variants to recognize sleepers.3434-3535-In practice it works like this: the system runs a task a bit, and when3636-the task schedules (or a scheduler tick happens) the task's CPU usage is3737-'accounted for': the (small) time it just spent using the physical CPU3838-is deducted from p->wait_runtime. [minus the 'fair share' it would have3939-gotten anyway]. Once p->wait_runtime gets low enough so that another4040-task becomes the 'leftmost task' of the time-ordered rbtree it maintains4141-(plus a small amount of 'granularity' distance relative to the leftmost4242-task so that we do not over-schedule tasks and trash the cache) then the4343-new leftmost task is picked and the current task is preempted.4444-4545-The rq->fair_clock value tracks the 'CPU time a runnable task would have4646-fairly gotten, had it been runnable during that time'. So by using4747-rq->fair_clock values we can accurately timestamp and measure the4848-'expected CPU time' a task should have gotten. All runnable tasks are4949-sorted in the rbtree by the "rq->fair_clock - p->wait_runtime" key, and5050-CFS picks the 'leftmost' task and sticks to it. As the system progresses5151-forwards, newly woken tasks are put into the tree more and more to the5252-right - slowly but surely giving a chance for every task to become the5353-'leftmost task' and thus get on the CPU within a deterministic amount of5454-time.5555-5656-Some implementation details:5757-5858- - the introduction of Scheduling Classes: an extensible hierarchy of5959- scheduler modules. These modules encapsulate scheduling policy6060- details and are handled by the scheduler core without the core6161- code assuming about them too much.6262-6363- - sched_fair.c implements the 'CFS desktop scheduler': it is a6464- replacement for the vanilla scheduler's SCHED_OTHER interactivity6565- code.6666-6767- I'd like to give credit to Con Kolivas for the general approach here:6868- he has proven via RSDL/SD that 'fair scheduling' is possible and that6969- it results in better desktop scheduling. Kudos Con!7070-7171- The CFS patch uses a completely different approach and implementation7272- from RSDL/SD. My goal was to make CFS's interactivity quality exceed7373- that of RSDL/SD, which is a high standard to meet :-) Testing7474- feedback is welcome to decide this one way or another. [ and, in any7575- case, all of SD's logic could be added via a kernel/sched_sd.c module7676- as well, if Con is interested in such an approach. ]7777-7878- CFS's design is quite radical: it does not use runqueues, it uses a7979- time-ordered rbtree to build a 'timeline' of future task execution,8080- and thus has no 'array switch' artifacts (by which both the vanilla8181- scheduler and RSDL/SD are affected).8282-8383- CFS uses nanosecond granularity accounting and does not rely on any8484- jiffies or other HZ detail. Thus the CFS scheduler has no notion of8585- 'timeslices' and has no heuristics whatsoever. There is only one8686- central tunable (you have to switch on CONFIG_SCHED_DEBUG):8787-8888- /proc/sys/kernel/sched_granularity_ns8989-9090- which can be used to tune the scheduler from 'desktop' (low9191- latencies) to 'server' (good batching) workloads. It defaults to a9292- setting suitable for desktop workloads. SCHED_BATCH is handled by the9393- CFS scheduler module too.9494-9595- Due to its design, the CFS scheduler is not prone to any of the9696- 'attacks' that exist today against the heuristics of the stock9797- scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all9898- work fine and do not impact interactivity and produce the expected9999- behavior.100100-101101- the CFS scheduler has a much stronger handling of nice levels and102102- SCHED_BATCH: both types of workloads should be isolated much more103103- agressively than under the vanilla scheduler.104104-105105- ( another detail: due to nanosec accounting and timeline sorting,106106- sched_yield() support is very simple under CFS, and in fact under107107- CFS sched_yield() behaves much better than under any other108108- scheduler i have tested so far. )109109-110110- - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler111111- way than the vanilla scheduler does. It uses 100 runqueues (for all112112- 100 RT priority levels, instead of 140 in the vanilla scheduler)113113- and it needs no expired array.114114-115115- - reworked/sanitized SMP load-balancing: the runqueue-walking116116- assumptions are gone from the load-balancing code now, and117117- iterators of the scheduling modules are used. The balancing code got118118- quite a bit simpler as a result.11+ =============22+ CFS Scheduler33+ =============11941205121121-Group scheduler extension to CFS122122-================================66+1. OVERVIEW1237124124-Normally the scheduler operates on individual tasks and strives to provide125125-fair CPU time to each task. Sometimes, it may be desirable to group tasks126126-and provide fair CPU time to each such task group. For example, it may127127-be desirable to first provide fair CPU time to each user on the system128128-and then to each task belonging to a user.88+CFS stands for "Completely Fair Scheduler," and is the new "desktop" process99+scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the1010+replacement for the previous vanilla scheduler's SCHED_OTHER interactivity1111+code.12912130130-CONFIG_FAIR_GROUP_SCHED strives to achieve exactly that. It lets131131-SCHED_NORMAL/BATCH tasks be be grouped and divides CPU time fairly among such132132-groups. At present, there are two (mutually exclusive) mechanisms to group133133-tasks for CPU bandwidth control purpose:1313+80% of CFS's design can be summed up in a single sentence: CFS basically models1414+an "ideal, precise multi-tasking CPU" on real hardware.13415135135- - Based on user id (CONFIG_FAIR_USER_SCHED)136136- In this option, tasks are grouped according to their user id.137137- - Based on "cgroup" pseudo filesystem (CONFIG_FAIR_CGROUP_SCHED)138138- This options lets the administrator create arbitrary groups139139- of tasks, using the "cgroup" pseudo filesystem. See140140- Documentation/cgroups.txt for more information about this141141- filesystem.1616+"Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100% physical1717+power and which can run each task at precise equal speed, in parallel, each at1818+1/nr_running speed. For example: if there are 2 tasks running, then it runs1919+each at 50% physical power --- i.e., actually in parallel.2020+2121+On real hardware, we can run only a single task at once, so we have to2222+introduce the concept of "virtual runtime." The virtual runtime of a task2323+specifies when its next timeslice would start execution on the ideal2424+multi-tasking CPU described above. In practice, the virtual runtime of a task2525+is its actual runtime normalized to the total number of running tasks.2626+2727+2828+2929+2. FEW IMPLEMENTATION DETAILS3030+3131+In CFS the virtual runtime is expressed and tracked via the per-task3232+p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately3333+timestamp and measure the "expected CPU time" a task should have gotten.3434+3535+[ small detail: on "ideal" hardware, at any time all tasks would have the same3636+ p->se.vruntime value --- i.e., tasks would execute simultaneously and no task3737+ would ever get "out of balance" from the "ideal" share of CPU time. ]3838+3939+CFS's task picking logic is based on this p->se.vruntime value and it is thus4040+very simple: it always tries to run the task with the smallest p->se.vruntime4141+value (i.e., the task which executed least so far). CFS always tries to split4242+up CPU time between runnable tasks as close to "ideal multitasking hardware" as4343+possible.4444+4545+Most of the rest of CFS's design just falls out of this really simple concept,4646+with a few add-on embellishments like nice levels, multiprocessing and various4747+algorithm variants to recognize sleepers.4848+4949+5050+5151+3. THE RBTREE5252+5353+CFS's design is quite radical: it does not use the old data structures for the5454+runqueues, but it uses a time-ordered rbtree to build a "timeline" of future5555+task execution, and thus has no "array switch" artifacts (by which both the5656+previous vanilla scheduler and RSDL/SD are affected).5757+5858+CFS also maintains the rq->cfs.min_vruntime value, which is a monotonic5959+increasing value tracking the smallest vruntime among all tasks in the6060+runqueue. The total amount of work done by the system is tracked using6161+min_vruntime; that value is used to place newly activated entities on the left6262+side of the tree as much as possible.6363+6464+The total number of running tasks in the runqueue is accounted through the6565+rq->cfs.load value, which is the sum of the weights of the tasks queued on the6666+runqueue.6767+6868+CFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the6969+p->se.vruntime key (there is a subtraction using rq->cfs.min_vruntime to7070+account for possible wraparounds). CFS picks the "leftmost" task from this7171+tree and sticks to it.7272+As the system progresses forwards, the executed tasks are put into the tree7373+more and more to the right --- slowly but surely giving a chance for every task7474+to become the "leftmost task" and thus get on the CPU within a deterministic7575+amount of time.7676+7777+Summing up, CFS works like this: it runs a task a bit, and when the task7878+schedules (or a scheduler tick happens) the task's CPU usage is "accounted7979+for": the (small) time it just spent using the physical CPU is added to8080+p->se.vruntime. Once p->se.vruntime gets high enough so that another task8181+becomes the "leftmost task" of the time-ordered rbtree it maintains (plus a8282+small amount of "granularity" distance relative to the leftmost task so that we8383+do not over-schedule tasks and trash the cache), then the new leftmost task is8484+picked and the current task is preempted.8585+8686+8787+8888+4. SOME FEATURES OF CFS8989+9090+CFS uses nanosecond granularity accounting and does not rely on any jiffies or9191+other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the9292+way the previous scheduler had, and has no heuristics whatsoever. There is9393+only one central tunable (you have to switch on CONFIG_SCHED_DEBUG):9494+9595+ /proc/sys/kernel/sched_granularity_ns9696+9797+which can be used to tune the scheduler from "desktop" (i.e., low latencies) to9898+"server" (i.e., good batching) workloads. It defaults to a setting suitable9999+for desktop workloads. SCHED_BATCH is handled by the CFS scheduler module too.100100+101101+Due to its design, the CFS scheduler is not prone to any of the "attacks" that102102+exist today against the heuristics of the stock scheduler: fiftyp.c, thud.c,103103+chew.c, ring-test.c, massive_intr.c all work fine and do not impact104104+interactivity and produce the expected behavior.105105+106106+The CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH107107+than the previous vanilla scheduler: both types of workloads are isolated much108108+more aggressively.109109+110110+SMP load-balancing has been reworked/sanitized: the runqueue-walking111111+assumptions are gone from the load-balancing code now, and iterators of the112112+scheduling modules are used. The balancing code got quite a bit simpler as a113113+result.114114+115115+116116+117117+5. Scheduling policies118118+119119+CFS implements three scheduling policies:120120+121121+ - SCHED_NORMAL (traditionally called SCHED_OTHER): The scheduling122122+ policy that is used for regular tasks.123123+124124+ - SCHED_BATCH: Does not preempt nearly as often as regular tasks125125+ would, thereby allowing tasks to run longer and make better use of126126+ caches but at the cost of interactivity. This is well suited for127127+ batch jobs.128128+129129+ - SCHED_IDLE: This is even weaker than nice 19, but its not a true130130+ idle timer scheduler in order to avoid to get into priority131131+ inversion problems which would deadlock the machine.132132+133133+SCHED_FIFO/_RR are implemented in sched_rt.c and are as specified by134134+POSIX.135135+136136+The command chrt from util-linux-ng 2.13.1.1 can set all of these except137137+SCHED_IDLE.138138+139139+140140+141141+6. SCHEDULING CLASSES142142+143143+The new CFS scheduler has been designed in such a way to introduce "Scheduling144144+Classes," an extensible hierarchy of scheduler modules. These modules145145+encapsulate scheduling policy details and are handled by the scheduler core146146+without the core code assuming too much about them.147147+148148+sched_fair.c implements the CFS scheduler described above.149149+150150+sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than151151+the previous vanilla scheduler did. It uses 100 runqueues (for all 100 RT152152+priority levels, instead of 140 in the previous scheduler) and it needs no153153+expired array.154154+155155+Scheduling classes are implemented through the sched_class structure, which156156+contains hooks to functions that must be called whenever an interesting event157157+occurs.158158+159159+This is the (partial) list of the hooks:160160+161161+ - enqueue_task(...)162162+163163+ Called when a task enters a runnable state.164164+ It puts the scheduling entity (task) into the red-black tree and165165+ increments the nr_running variable.166166+167167+ - dequeue_tree(...)168168+169169+ When a task is no longer runnable, this function is called to keep the170170+ corresponding scheduling entity out of the red-black tree. It decrements171171+ the nr_running variable.172172+173173+ - yield_task(...)174174+175175+ This function is basically just a dequeue followed by an enqueue, unless the176176+ compat_yield sysctl is turned on; in that case, it places the scheduling177177+ entity at the right-most end of the red-black tree.178178+179179+ - check_preempt_curr(...)180180+181181+ This function checks if a task that entered the runnable state should182182+ preempt the currently running task.183183+184184+ - pick_next_task(...)185185+186186+ This function chooses the most appropriate task eligible to run next.187187+188188+ - set_curr_task(...)189189+190190+ This function is called when a task changes its scheduling class or changes191191+ its task group.192192+193193+ - task_tick(...)194194+195195+ This function is mostly called from time tick functions; it might lead to196196+ process switch. This drives the running preemption.197197+198198+ - task_new(...)199199+200200+ The core scheduler gives the scheduling module an opportunity to manage new201201+ task startup. The CFS scheduling module uses it for group scheduling, while202202+ the scheduling module for a real-time task does not use it.203203+204204+205205+206206+7. GROUP SCHEDULER EXTENSIONS TO CFS207207+208208+Normally, the scheduler operates on individual tasks and strives to provide209209+fair CPU time to each task. Sometimes, it may be desirable to group tasks and210210+provide fair CPU time to each such task group. For example, it may be211211+desirable to first provide fair CPU time to each user on the system and then to212212+each task belonging to a user.213213+214214+CONFIG_GROUP_SCHED strives to achieve exactly that. It lets tasks to be215215+grouped and divides CPU time fairly among such groups.216216+217217+CONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and218218+SCHED_RR) tasks.219219+220220+CONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and221221+SCHED_BATCH) tasks.222222+223223+At present, there are two (mutually exclusive) mechanisms to group tasks for224224+CPU bandwidth control purposes:225225+226226+ - Based on user id (CONFIG_USER_SCHED)227227+228228+ With this option, tasks are grouped according to their user id.229229+230230+ - Based on "cgroup" pseudo filesystem (CONFIG_CGROUP_SCHED)231231+232232+ This options needs CONFIG_CGROUPS to be defined, and lets the administrator233233+ create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See234234+ Documentation/cgroups.txt for more information about this filesystem.142235143236Only one of these options to group tasks can be chosen and not both.144237145145-Group scheduler tunables:146146-147147-When CONFIG_FAIR_USER_SCHED is defined, a directory is created in sysfs for148148-each new user and a "cpu_share" file is added in that directory.238238+When CONFIG_USER_SCHED is defined, a directory is created in sysfs for each new239239+user and a "cpu_share" file is added in that directory.149240150241 # cd /sys/kernel/uids151242 # cat 512/cpu_share # Display user 512's CPU share···246155 2048247156 #248157249249-CPU bandwidth between two users are divided in the ratio of their CPU shares.250250-For ex: if you would like user "root" to get twice the bandwidth of user251251-"guest", then set the cpu_share for both the users such that "root"'s252252-cpu_share is twice "guest"'s cpu_share158158+CPU bandwidth between two users is divided in the ratio of their CPU shares.159159+For example: if you would like user "root" to get twice the bandwidth of user160160+"guest," then set the cpu_share for both the users such that "root"'s cpu_share161161+is twice "guest"'s cpu_share.253162254254-255255-When CONFIG_FAIR_CGROUP_SCHED is defined, a "cpu.shares" file is created256256-for each group created using the pseudo filesystem. See example steps257257-below to create task groups and modify their CPU share using the "cgroups"258258-pseudo filesystem163163+When CONFIG_CGROUP_SCHED is defined, a "cpu.shares" file is created for each164164+group created using the pseudo filesystem. See example steps below to create165165+task groups and modify their CPU share using the "cgroups" pseudo filesystem.259166260167 # mkdir /dev/cpuctl261168 # mount -t cgroup -ocpu none /dev/cpuctl
+3
arch/alpha/kernel/smp.c
···149149 atomic_inc(&init_mm.mm_count);150150 current->active_mm = &init_mm;151151152152+ /* inform the notifiers about the new cpu */153153+ notify_cpu_starting(cpuid);154154+152155 /* Must have completely accurate bogos. */153156 local_irq_enable();154157
···585585 /* Enable pfault pseudo page faults on this cpu. */586586 pfault_init();587587588588+ /* call cpu notifiers */589589+ notify_cpu_starting(smp_processor_id());588590 /* Mark this cpu as online */589591 spin_lock(&call_lock);590592 cpu_set(smp_processor_id(), cpu_online_map);
···8888 local_flush_cache_all();8989 local_flush_tlb_all();90909191+ notify_cpu_starting(cpuid);9192 /*9293 * Unblock the master CPU _only_ when the scheduler state9394 * of all secondary CPUs will be up-to-date, so after
+2
arch/sparc/kernel/sun4m_smp.c
···7171 local_flush_cache_all();7272 local_flush_tlb_all();73737474+ notify_cpu_starting(cpuid);7575+7476 /* Get our local ticker going. */7577 smp_setup_percpu_timer();7678
···448448449449 VDEBUG(("VOYAGER SMP: CPU%d, stack at about %p\n", cpuid, &cpuid));450450451451+ notify_cpu_starting(cpuid);452452+451453 /* enable interrupts */452454 local_irq_enable();453455
+41
include/linux/completion.h
···10101111#include <linux/wait.h>12121313+/**1414+ * struct completion - structure used to maintain state for a "completion"1515+ *1616+ * This is the opaque structure used to maintain the state for a "completion".1717+ * Completions currently use a FIFO to queue threads that have to wait for1818+ * the "completion" event.1919+ *2020+ * See also: complete(), wait_for_completion() (and friends _timeout,2121+ * _interruptible, _interruptible_timeout, and _killable), init_completion(),2222+ * and macros DECLARE_COMPLETION(), DECLARE_COMPLETION_ONSTACK(), and2323+ * INIT_COMPLETION().2424+ */1325struct completion {1426 unsigned int done;1527 wait_queue_head_t wait;···3321#define COMPLETION_INITIALIZER_ONSTACK(work) \3422 ({ init_completion(&work); work; })35232424+/**2525+ * DECLARE_COMPLETION: - declare and initialize a completion structure2626+ * @work: identifier for the completion structure2727+ *2828+ * This macro declares and initializes a completion structure. Generally used2929+ * for static declarations. You should use the _ONSTACK variant for automatic3030+ * variables.3131+ */3632#define DECLARE_COMPLETION(work) \3733 struct completion work = COMPLETION_INITIALIZER(work)3834···4929 * completions - so we use the _ONSTACK() variant for those that5030 * are on the kernel stack:5131 */3232+/**3333+ * DECLARE_COMPLETION_ONSTACK: - declare and initialize a completion structure3434+ * @work: identifier for the completion structure3535+ *3636+ * This macro declares and initializes a completion structure on the kernel3737+ * stack.3838+ */5239#ifdef CONFIG_LOCKDEP5340# define DECLARE_COMPLETION_ONSTACK(work) \5441 struct completion work = COMPLETION_INITIALIZER_ONSTACK(work)···6336# define DECLARE_COMPLETION_ONSTACK(work) DECLARE_COMPLETION(work)6437#endif65383939+/**4040+ * init_completion: - Initialize a dynamically allocated completion4141+ * @x: completion structure that is to be initialized4242+ *4343+ * This inline function will initialize a dynamically created completion4444+ * structure.4545+ */6646static inline void init_completion(struct completion *x)6747{6848 x->done = 0;···8955extern void complete(struct completion *);9056extern void complete_all(struct completion *);91575858+/**5959+ * INIT_COMPLETION: - reinitialize a completion structure6060+ * @x: completion structure to be reinitialized6161+ *6262+ * This macro should be used to reinitialize a completion structure so it can6363+ * be reused. This is especially important after complete_all() is used.6464+ */9265#define INIT_COMPLETION(x) ((x).done = 0)93669467
+1
include/linux/cpu.h
···6969#endif70707171int cpu_up(unsigned int cpu);7272+void notify_cpu_starting(unsigned int cpu);7273extern void cpu_hotplug_init(void);7374extern void cpu_maps_update_begin(void);7475extern void cpu_maps_update_done(void);
+9-1
include/linux/notifier.h
···213213#define CPU_DOWN_FAILED 0x0006 /* CPU (unsigned)v NOT going down */214214#define CPU_DEAD 0x0007 /* CPU (unsigned)v dead */215215#define CPU_DYING 0x0008 /* CPU (unsigned)v not running any task,216216- * not handling interrupts, soon dead */216216+ * not handling interrupts, soon dead.217217+ * Called on the dying cpu, interrupts218218+ * are already disabled. Must not219219+ * sleep, must not fail */217220#define CPU_POST_DEAD 0x0009 /* CPU (unsigned)v dead, cpu_hotplug218221 * lock is dropped */222222+#define CPU_STARTING 0x000A /* CPU (unsigned)v soon running.223223+ * Called on the new cpu, just before224224+ * enabling interrupts. Must not sleep,225225+ * must not fail */219226220227/* Used for CPU hotplug events occuring while tasks are frozen due to a suspend221228 * operation in progress···236229#define CPU_DOWN_FAILED_FROZEN (CPU_DOWN_FAILED | CPU_TASKS_FROZEN)237230#define CPU_DEAD_FROZEN (CPU_DEAD | CPU_TASKS_FROZEN)238231#define CPU_DYING_FROZEN (CPU_DYING | CPU_TASKS_FROZEN)232232+#define CPU_STARTING_FROZEN (CPU_STARTING | CPU_TASKS_FROZEN)239233240234/* Hibernation and suspend events */241235#define PM_HIBERNATION_PREPARE 0x0001 /* Going to hibernate */
+1-1
include/linux/proportions.h
···104104 * snapshot of the last seen global state105105 * and a lock protecting this state106106 */107107- int shift;108107 unsigned long period;108108+ int shift;109109 spinlock_t lock; /* protect the snapshot state */110110};111111
+3-3
include/linux/sched.h
···451451 * - everyone except group_exit_task is stopped during signal delivery452452 * of fatal signals, group_exit_task processes the signal.453453 */454454- struct task_struct *group_exit_task;455454 int notify_count;455455+ struct task_struct *group_exit_task;456456457457 /* thread group stop support, overloads group_exit_code too */458458 int group_stop_count;···897897 void (*yield_task) (struct rq *rq);898898 int (*select_task_rq)(struct task_struct *p, int sync);899899900900- void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);900900+ void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int sync);901901902902 struct task_struct * (*pick_next_task) (struct rq *rq);903903 void (*put_prev_task) (struct rq *rq, struct task_struct *p);···1010101010111011struct sched_rt_entity {10121012 struct list_head run_list;10131013- unsigned int time_slice;10141013 unsigned long timeout;10141014+ unsigned int time_slice;10151015 int nr_cpus_allowed;1016101610171017 struct sched_rt_entity *back;
+22-2
kernel/cpu.c
···199199 struct take_cpu_down_param *param = _param;200200 int err;201201202202- raw_notifier_call_chain(&cpu_chain, CPU_DYING | param->mod,203203- param->hcpu);204202 /* Ensure this CPU doesn't handle any more interrupts. */205203 err = __cpu_disable();206204 if (err < 0)207205 return err;206206+207207+ raw_notifier_call_chain(&cpu_chain, CPU_DYING | param->mod,208208+ param->hcpu);208209209210 /* Force idle task to run as soon as we yield: it should210211 immediately notice cpu is offline and die quickly. */···453452 cpu_maps_update_done();454453}455454#endif /* CONFIG_PM_SLEEP_SMP */455455+456456+/**457457+ * notify_cpu_starting(cpu) - call the CPU_STARTING notifiers458458+ * @cpu: cpu that just started459459+ *460460+ * This function calls the cpu_chain notifiers with CPU_STARTING.461461+ * It must be called by the arch code on the new cpu, before the new cpu462462+ * enables interrupts and before the "boot" cpu returns from __cpu_up().463463+ */464464+void notify_cpu_starting(unsigned int cpu)465465+{466466+ unsigned long val = CPU_STARTING;467467+468468+#ifdef CONFIG_PM_SLEEP_SMP469469+ if (cpu_isset(cpu, frozen_cpus))470470+ val = CPU_STARTING_FROZEN;471471+#endif /* CONFIG_PM_SLEEP_SMP */472472+ raw_notifier_call_chain(&cpu_chain, val, (void *)(long)cpu);473473+}456474457475#endif /* CONFIG_SMP */458476
+1-1
kernel/cpuset.c
···19211921 * that has tasks along with an empty 'mems'. But if we did see such19221922 * a cpuset, we'd handle it just like we do if its 'cpus' was empty.19231923 */19241924-static void scan_for_empty_cpusets(const struct cpuset *root)19241924+static void scan_for_empty_cpusets(struct cpuset *root)19251925{19261926 LIST_HEAD(queue);19271927 struct cpuset *cp; /* scans cpusets being updated */
+243-150
kernel/sched.c
···204204 rt_b->rt_period_timer.cb_mode = HRTIMER_CB_IRQSAFE_UNLOCKED;205205}206206207207+static inline int rt_bandwidth_enabled(void)208208+{209209+ return sysctl_sched_rt_runtime >= 0;210210+}211211+207212static void start_rt_bandwidth(struct rt_bandwidth *rt_b)208213{209214 ktime_t now;210215211211- if (rt_b->rt_runtime == RUNTIME_INF)216216+ if (rt_bandwidth_enabled() && rt_b->rt_runtime == RUNTIME_INF)212217 return;213218214219 if (hrtimer_active(&rt_b->rt_period_timer))···303298static DEFINE_PER_CPU(struct sched_rt_entity, init_sched_rt_entity);304299static DEFINE_PER_CPU(struct rt_rq, init_rt_rq) ____cacheline_aligned_in_smp;305300#endif /* CONFIG_RT_GROUP_SCHED */306306-#else /* !CONFIG_FAIR_GROUP_SCHED */301301+#else /* !CONFIG_USER_SCHED */307302#define root_task_group init_task_group308308-#endif /* CONFIG_FAIR_GROUP_SCHED */303303+#endif /* CONFIG_USER_SCHED */309304310305/* task_group_lock serializes add/remove of task groups and also changes to311306 * a task group's cpu shares.···609604610605static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);611606612612-static inline void check_preempt_curr(struct rq *rq, struct task_struct *p)607607+static inline void check_preempt_curr(struct rq *rq, struct task_struct *p, int sync)613608{614614- rq->curr->sched_class->check_preempt_curr(rq, p);609609+ rq->curr->sched_class->check_preempt_curr(rq, p, sync);615610}616611617612static inline int cpu_of(struct rq *rq)···11071102 hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay), HRTIMER_MODE_REL);11081103}1109110411101110-static void init_hrtick(void)11051105+static inline void init_hrtick(void)11111106{11121107}11131108#endif /* CONFIG_SMP */···11261121 rq->hrtick_timer.function = hrtick;11271122 rq->hrtick_timer.cb_mode = HRTIMER_CB_IRQSAFE_PERCPU;11281123}11291129-#else11241124+#else /* CONFIG_SCHED_HRTICK */11301125static inline void hrtick_clear(struct rq *rq)11311126{11321127}···11381133static inline void init_hrtick(void)11391134{11401135}11411141-#endif11361136+#endif /* CONFIG_SCHED_HRTICK */1142113711431138/*11441139 * resched_task - mark a task 'to be rescheduled now'.···13851380 update_load_sub(&rq->load, load);13861381}1387138213831383+#if (defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)) || defined(CONFIG_RT_GROUP_SCHED)13841384+typedef int (*tg_visitor)(struct task_group *, void *);13851385+13861386+/*13871387+ * Iterate the full tree, calling @down when first entering a node and @up when13881388+ * leaving it for the final time.13891389+ */13901390+static int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)13911391+{13921392+ struct task_group *parent, *child;13931393+ int ret;13941394+13951395+ rcu_read_lock();13961396+ parent = &root_task_group;13971397+down:13981398+ ret = (*down)(parent, data);13991399+ if (ret)14001400+ goto out_unlock;14011401+ list_for_each_entry_rcu(child, &parent->children, siblings) {14021402+ parent = child;14031403+ goto down;14041404+14051405+up:14061406+ continue;14071407+ }14081408+ ret = (*up)(parent, data);14091409+ if (ret)14101410+ goto out_unlock;14111411+14121412+ child = parent;14131413+ parent = parent->parent;14141414+ if (parent)14151415+ goto up;14161416+out_unlock:14171417+ rcu_read_unlock();14181418+14191419+ return ret;14201420+}14211421+14221422+static int tg_nop(struct task_group *tg, void *data)14231423+{14241424+ return 0;14251425+}14261426+#endif14271427+13881428#ifdef CONFIG_SMP13891429static unsigned long source_load(int cpu, int type);13901430static unsigned long target_load(int cpu, int type);···14461396}1447139714481398#ifdef CONFIG_FAIR_GROUP_SCHED14491449-14501450-typedef void (*tg_visitor)(struct task_group *, int, struct sched_domain *);14511451-14521452-/*14531453- * Iterate the full tree, calling @down when first entering a node and @up when14541454- * leaving it for the final time.14551455- */14561456-static void14571457-walk_tg_tree(tg_visitor down, tg_visitor up, int cpu, struct sched_domain *sd)14581458-{14591459- struct task_group *parent, *child;14601460-14611461- rcu_read_lock();14621462- parent = &root_task_group;14631463-down:14641464- (*down)(parent, cpu, sd);14651465- list_for_each_entry_rcu(child, &parent->children, siblings) {14661466- parent = child;14671467- goto down;14681468-14691469-up:14701470- continue;14711471- }14721472- (*up)(parent, cpu, sd);14731473-14741474- child = parent;14751475- parent = parent->parent;14761476- if (parent)14771477- goto up;14781478- rcu_read_unlock();14791479-}1480139914811400static void __set_se_shares(struct sched_entity *se, unsigned long shares);14821401···15051486 * This needs to be done in a bottom-up fashion because the rq weight of a15061487 * parent group depends on the shares of its child groups.15071488 */15081508-static void15091509-tg_shares_up(struct task_group *tg, int cpu, struct sched_domain *sd)14891489+static int tg_shares_up(struct task_group *tg, void *data)15101490{15111491 unsigned long rq_weight = 0;15121492 unsigned long shares = 0;14931493+ struct sched_domain *sd = data;15131494 int i;1514149515151496 for_each_cpu_mask(i, sd->span) {···15341515 __update_group_shares_cpu(tg, i, shares, rq_weight);15351516 spin_unlock_irqrestore(&rq->lock, flags);15361517 }15181518+15191519+ return 0;15371520}1538152115391522/*···15431522 * This needs to be done in a top-down fashion because the load of a child15441523 * group is a fraction of its parents load.15451524 */15461546-static void15471547-tg_load_down(struct task_group *tg, int cpu, struct sched_domain *sd)15251525+static int tg_load_down(struct task_group *tg, void *data)15481526{15491527 unsigned long load;15281528+ long cpu = (long)data;1550152915511530 if (!tg->parent) {15521531 load = cpu_rq(cpu)->load.weight;···15571536 }1558153715591538 tg->cfs_rq[cpu]->h_load = load;15601560-}1561153915621562-static void15631563-tg_nop(struct task_group *tg, int cpu, struct sched_domain *sd)15641564-{15401540+ return 0;15651541}1566154215671543static void update_shares(struct sched_domain *sd)···1568155015691551 if (elapsed >= (s64)(u64)sysctl_sched_shares_ratelimit) {15701552 sd->last_update = now;15711571- walk_tg_tree(tg_nop, tg_shares_up, 0, sd);15531553+ walk_tg_tree(tg_nop, tg_shares_up, sd);15721554 }15731555}15741556···15791561 spin_lock(&rq->lock);15801562}1581156315821582-static void update_h_load(int cpu)15641564+static void update_h_load(long cpu)15831565{15841584- walk_tg_tree(tg_load_down, tg_nop, cpu, NULL);15661566+ walk_tg_tree(tg_load_down, tg_nop, (void *)cpu);15851567}1586156815871569#else···19391921 running = task_running(rq, p);19401922 on_rq = p->se.on_rq;19411923 ncsw = 0;19421942- if (!match_state || p->state == match_state) {19431943- ncsw = p->nivcsw + p->nvcsw;19441944- if (unlikely(!ncsw))19451945- ncsw = 1;19461946- }19241924+ if (!match_state || p->state == match_state)19251925+ ncsw = p->nvcsw | LONG_MIN; /* sets MSB */19471926 task_rq_unlock(rq, &flags);1948192719491928 /*···23002285 trace_mark(kernel_sched_wakeup,23012286 "pid %d state %ld ## rq %p task %p rq->curr %p",23022287 p->pid, p->state, rq, p, rq->curr);23032303- check_preempt_curr(rq, p);22882288+ check_preempt_curr(rq, p, sync);2304228923052290 p->state = TASK_RUNNING;23062291#ifdef CONFIG_SMP···24352420 trace_mark(kernel_sched_wakeup_new,24362421 "pid %d state %ld ## rq %p task %p rq->curr %p",24372422 p->pid, p->state, rq, p, rq->curr);24382438- check_preempt_curr(rq, p);24232423+ check_preempt_curr(rq, p, 0);24392424#ifdef CONFIG_SMP24402425 if (p->sched_class->task_wake_up)24412426 p->sched_class->task_wake_up(rq, p);···28952880 * Note that idle threads have a prio of MAX_PRIO, for this test28962881 * to be always true for them.28972882 */28982898- check_preempt_curr(this_rq, p);28832883+ check_preempt_curr(this_rq, p, 0);28992884}2900288529012886/*···46424627}46434628EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */4644462946304630+/**46314631+ * complete: - signals a single thread waiting on this completion46324632+ * @x: holds the state of this particular completion46334633+ *46344634+ * This will wake up a single thread waiting on this completion. Threads will be46354635+ * awakened in the same order in which they were queued.46364636+ *46374637+ * See also complete_all(), wait_for_completion() and related routines.46384638+ */46454639void complete(struct completion *x)46464640{46474641 unsigned long flags;···46624638}46634639EXPORT_SYMBOL(complete);4664464046414641+/**46424642+ * complete_all: - signals all threads waiting on this completion46434643+ * @x: holds the state of this particular completion46444644+ *46454645+ * This will wake up all threads waiting on this particular completion event.46464646+ */46654647void complete_all(struct completion *x)46664648{46674649 unsigned long flags;···46884658 wait.flags |= WQ_FLAG_EXCLUSIVE;46894659 __add_wait_queue_tail(&x->wait, &wait);46904660 do {46914691- if ((state == TASK_INTERRUPTIBLE &&46924692- signal_pending(current)) ||46934693- (state == TASK_KILLABLE &&46944694- fatal_signal_pending(current))) {46614661+ if (signal_pending_state(state, current)) {46954662 timeout = -ERESTARTSYS;46964663 break;46974664 }···47164689 return timeout;47174690}4718469146924692+/**46934693+ * wait_for_completion: - waits for completion of a task46944694+ * @x: holds the state of this particular completion46954695+ *46964696+ * This waits to be signaled for completion of a specific task. It is NOT46974697+ * interruptible and there is no timeout.46984698+ *46994699+ * See also similar routines (i.e. wait_for_completion_timeout()) with timeout47004700+ * and interrupt capability. Also see complete().47014701+ */47194702void __sched wait_for_completion(struct completion *x)47204703{47214704 wait_for_common(x, MAX_SCHEDULE_TIMEOUT, TASK_UNINTERRUPTIBLE);47224705}47234706EXPORT_SYMBOL(wait_for_completion);4724470747084708+/**47094709+ * wait_for_completion_timeout: - waits for completion of a task (w/timeout)47104710+ * @x: holds the state of this particular completion47114711+ * @timeout: timeout value in jiffies47124712+ *47134713+ * This waits for either a completion of a specific task to be signaled or for a47144714+ * specified timeout to expire. The timeout is in jiffies. It is not47154715+ * interruptible.47164716+ */47254717unsigned long __sched47264718wait_for_completion_timeout(struct completion *x, unsigned long timeout)47274719{···47484702}47494703EXPORT_SYMBOL(wait_for_completion_timeout);4750470447054705+/**47064706+ * wait_for_completion_interruptible: - waits for completion of a task (w/intr)47074707+ * @x: holds the state of this particular completion47084708+ *47094709+ * This waits for completion of a specific task to be signaled. It is47104710+ * interruptible.47114711+ */47514712int __sched wait_for_completion_interruptible(struct completion *x)47524713{47534714 long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT, TASK_INTERRUPTIBLE);···47644711}47654712EXPORT_SYMBOL(wait_for_completion_interruptible);4766471347144714+/**47154715+ * wait_for_completion_interruptible_timeout: - waits for completion (w/(to,intr))47164716+ * @x: holds the state of this particular completion47174717+ * @timeout: timeout value in jiffies47184718+ *47194719+ * This waits for either a completion of a specific task to be signaled or for a47204720+ * specified timeout to expire. It is interruptible. The timeout is in jiffies.47214721+ */47674722unsigned long __sched47684723wait_for_completion_interruptible_timeout(struct completion *x,47694724 unsigned long timeout)···47804719}47814720EXPORT_SYMBOL(wait_for_completion_interruptible_timeout);4782472147224722+/**47234723+ * wait_for_completion_killable: - waits for completion of a task (killable)47244724+ * @x: holds the state of this particular completion47254725+ *47264726+ * This waits to be signaled for completion of a specific task. It can be47274727+ * interrupted by a kill signal.47284728+ */47834729int __sched wait_for_completion_killable(struct completion *x)47844730{47854731 long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT, TASK_KILLABLE);···51895121 * Do not allow realtime tasks into groups that have no runtime51905122 * assigned.51915123 */51925192- if (rt_policy(policy) && task_group(p)->rt_bandwidth.rt_runtime == 0)51245124+ if (rt_bandwidth_enabled() && rt_policy(policy) &&51255125+ task_group(p)->rt_bandwidth.rt_runtime == 0)51935126 return -EPERM;51945127#endif51955128···60265957 set_task_cpu(p, dest_cpu);60275958 if (on_rq) {60285959 activate_task(rq_dest, p, 0);60296029- check_preempt_curr(rq_dest, p);59605960+ check_preempt_curr(rq_dest, p, 0);60305961 }60315962done:60325963 ret = 1;···83118242#ifdef in_atomic83128243 static unsigned long prev_jiffy; /* ratelimiting */8313824483148314- if ((in_atomic() || irqs_disabled()) &&83158315- system_state == SYSTEM_RUNNING && !oops_in_progress) {83168316- if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)83178317- return;83188318- prev_jiffy = jiffies;83198319- printk(KERN_ERR "BUG: sleeping function called from invalid"83208320- " context at %s:%d\n", file, line);83218321- printk("in_atomic():%d, irqs_disabled():%d\n",83228322- in_atomic(), irqs_disabled());83238323- debug_show_held_locks(current);83248324- if (irqs_disabled())83258325- print_irqtrace_events(current);83268326- dump_stack();83278327- }82458245+ if ((!in_atomic() && !irqs_disabled()) ||82468246+ system_state != SYSTEM_RUNNING || oops_in_progress)82478247+ return;82488248+ if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)82498249+ return;82508250+ prev_jiffy = jiffies;82518251+82528252+ printk(KERN_ERR82538253+ "BUG: sleeping function called from invalid context at %s:%d\n",82548254+ file, line);82558255+ printk(KERN_ERR82568256+ "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",82578257+ in_atomic(), irqs_disabled(),82588258+ current->pid, current->comm);82598259+82608260+ debug_show_held_locks(current);82618261+ if (irqs_disabled())82628262+ print_irqtrace_events(current);82638263+ dump_stack();83288264#endif83298265}83308266EXPORT_SYMBOL(__might_sleep);···88278753static unsigned long to_ratio(u64 period, u64 runtime)88288754{88298755 if (runtime == RUNTIME_INF)88308830- return 1ULL << 16;87568756+ return 1ULL << 20;8831875788328832- return div64_u64(runtime << 16, period);87588758+ return div64_u64(runtime << 20, period);88338759}88348834-88358835-#ifdef CONFIG_CGROUP_SCHED88368836-static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)88378837-{88388838- struct task_group *tgi, *parent = tg->parent;88398839- unsigned long total = 0;88408840-88418841- if (!parent) {88428842- if (global_rt_period() < period)88438843- return 0;88448844-88458845- return to_ratio(period, runtime) <88468846- to_ratio(global_rt_period(), global_rt_runtime());88478847- }88488848-88498849- if (ktime_to_ns(parent->rt_bandwidth.rt_period) < period)88508850- return 0;88518851-88528852- rcu_read_lock();88538853- list_for_each_entry_rcu(tgi, &parent->children, siblings) {88548854- if (tgi == tg)88558855- continue;88568856-88578857- total += to_ratio(ktime_to_ns(tgi->rt_bandwidth.rt_period),88588858- tgi->rt_bandwidth.rt_runtime);88598859- }88608860- rcu_read_unlock();88618861-88628862- return total + to_ratio(period, runtime) <=88638863- to_ratio(ktime_to_ns(parent->rt_bandwidth.rt_period),88648864- parent->rt_bandwidth.rt_runtime);88658865-}88668866-#elif defined CONFIG_USER_SCHED88678867-static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)88688868-{88698869- struct task_group *tgi;88708870- unsigned long total = 0;88718871- unsigned long global_ratio =88728872- to_ratio(global_rt_period(), global_rt_runtime());88738873-88748874- rcu_read_lock();88758875- list_for_each_entry_rcu(tgi, &task_groups, list) {88768876- if (tgi == tg)88778877- continue;88788878-88798879- total += to_ratio(ktime_to_ns(tgi->rt_bandwidth.rt_period),88808880- tgi->rt_bandwidth.rt_runtime);88818881- }88828882- rcu_read_unlock();88838883-88848884- return total + to_ratio(period, runtime) < global_ratio;88858885-}88868886-#endif8887876088888761/* Must be called with tasklist_lock held */88898762static inline int tg_has_rt_tasks(struct task_group *tg)88908763{88918764 struct task_struct *g, *p;87658765+88928766 do_each_thread(g, p) {88938767 if (rt_task(p) && rt_rq_of_se(&p->rt)->tg == tg)88948768 return 1;88958769 } while_each_thread(g, p);87708770+88968771 return 0;87728772+}87738773+87748774+struct rt_schedulable_data {87758775+ struct task_group *tg;87768776+ u64 rt_period;87778777+ u64 rt_runtime;87788778+};87798779+87808780+static int tg_schedulable(struct task_group *tg, void *data)87818781+{87828782+ struct rt_schedulable_data *d = data;87838783+ struct task_group *child;87848784+ unsigned long total, sum = 0;87858785+ u64 period, runtime;87868786+87878787+ period = ktime_to_ns(tg->rt_bandwidth.rt_period);87888788+ runtime = tg->rt_bandwidth.rt_runtime;87898789+87908790+ if (tg == d->tg) {87918791+ period = d->rt_period;87928792+ runtime = d->rt_runtime;87938793+ }87948794+87958795+ /*87968796+ * Cannot have more runtime than the period.87978797+ */87988798+ if (runtime > period && runtime != RUNTIME_INF)87998799+ return -EINVAL;88008800+88018801+ /*88028802+ * Ensure we don't starve existing RT tasks.88038803+ */88048804+ if (rt_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg))88058805+ return -EBUSY;88068806+88078807+ total = to_ratio(period, runtime);88088808+88098809+ /*88108810+ * Nobody can have more than the global setting allows.88118811+ */88128812+ if (total > to_ratio(global_rt_period(), global_rt_runtime()))88138813+ return -EINVAL;88148814+88158815+ /*88168816+ * The sum of our children's runtime should not exceed our own.88178817+ */88188818+ list_for_each_entry_rcu(child, &tg->children, siblings) {88198819+ period = ktime_to_ns(child->rt_bandwidth.rt_period);88208820+ runtime = child->rt_bandwidth.rt_runtime;88218821+88228822+ if (child == d->tg) {88238823+ period = d->rt_period;88248824+ runtime = d->rt_runtime;88258825+ }88268826+88278827+ sum += to_ratio(period, runtime);88288828+ }88298829+88308830+ if (sum > total)88318831+ return -EINVAL;88328832+88338833+ return 0;88348834+}88358835+88368836+static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)88378837+{88388838+ struct rt_schedulable_data data = {88398839+ .tg = tg,88408840+ .rt_period = period,88418841+ .rt_runtime = runtime,88428842+ };88438843+88448844+ return walk_tg_tree(tg_schedulable, tg_nop, &data);88978845}8898884688998847static int tg_set_bandwidth(struct task_group *tg,···8925882989268830 mutex_lock(&rt_constraints_mutex);89278831 read_lock(&tasklist_lock);89288928- if (rt_runtime == 0 && tg_has_rt_tasks(tg)) {89298929- err = -EBUSY;88328832+ err = __rt_schedulable(tg, rt_period, rt_runtime);88338833+ if (err)89308834 goto unlock;89318931- }89328932- if (!__rt_schedulable(tg, rt_period, rt_runtime)) {89338933- err = -EINVAL;89348934- goto unlock;89358935- }8936883589378836 spin_lock_irq(&tg->rt_bandwidth.rt_runtime_lock);89388837 tg->rt_bandwidth.rt_period = ns_to_ktime(rt_period);···8996890589978906static int sched_rt_global_constraints(void)89988907{89998999- struct task_group *tg = &root_task_group;90009000- u64 rt_runtime, rt_period;89088908+ u64 runtime, period;90018909 int ret = 0;9002891090038911 if (sysctl_sched_rt_period <= 0)90048912 return -EINVAL;9005891390069006- rt_period = ktime_to_ns(tg->rt_bandwidth.rt_period);90079007- rt_runtime = tg->rt_bandwidth.rt_runtime;89148914+ runtime = global_rt_runtime();89158915+ period = global_rt_period();89168916+89178917+ /*89188918+ * Sanity check on the sysctl variables.89198919+ */89208920+ if (runtime > period && runtime != RUNTIME_INF)89218921+ return -EINVAL;9008892290098923 mutex_lock(&rt_constraints_mutex);90109010- if (!__rt_schedulable(tg, rt_period, rt_runtime))90119011- ret = -EINVAL;89248924+ read_lock(&tasklist_lock);89258925+ ret = __rt_schedulable(NULL, 0, 0);89268926+ read_unlock(&tasklist_lock);90128927 mutex_unlock(&rt_constraints_mutex);9013892890148929 return ret;···9088899190898992 if (!cgrp->parent) {90908993 /* This is early initialization for the top cgroup */90919091- init_task_group.css.cgroup = cgrp;90928994 return &init_task_group.css;90938995 }90948996···90958999 tg = sched_create_group(parent);90969000 if (IS_ERR(tg))90979001 return ERR_PTR(-ENOMEM);90989098-90999099- /* Bind the cgroup to task_group object we just created */91009100- tg->css.cgroup = cgrp;9101900291029003 return &tg->css;91039004}
+51-171
kernel/sched_fair.c
···409409}410410411411/*412412- * The goal of calc_delta_asym() is to be asymmetrically around NICE_0_LOAD, in413413- * that it favours >=0 over <0.414414- *415415- * -20 |416416- * |417417- * 0 --------+-------418418- * .'419419- * 19 .'420420- *421421- */422422-static unsigned long423423-calc_delta_asym(unsigned long delta, struct sched_entity *se)424424-{425425- struct load_weight lw = {426426- .weight = NICE_0_LOAD,427427- .inv_weight = 1UL << (WMULT_SHIFT-NICE_0_SHIFT)428428- };429429-430430- for_each_sched_entity(se) {431431- struct load_weight *se_lw = &se->load;432432- unsigned long rw = cfs_rq_of(se)->load.weight;433433-434434-#ifdef CONFIG_FAIR_SCHED_GROUP435435- struct cfs_rq *cfs_rq = se->my_q;436436- struct task_group *tg = NULL437437-438438- if (cfs_rq)439439- tg = cfs_rq->tg;440440-441441- if (tg && tg->shares < NICE_0_LOAD) {442442- /*443443- * scale shares to what it would have been had444444- * tg->weight been NICE_0_LOAD:445445- *446446- * weight = 1024 * shares / tg->weight447447- */448448- lw.weight *= se->load.weight;449449- lw.weight /= tg->shares;450450-451451- lw.inv_weight = 0;452452-453453- se_lw = &lw;454454- rw += lw.weight - se->load.weight;455455- } else456456-#endif457457-458458- if (se->load.weight < NICE_0_LOAD) {459459- se_lw = &lw;460460- rw += NICE_0_LOAD - se->load.weight;461461- }462462-463463- delta = calc_delta_mine(delta, rw, se_lw);464464- }465465-466466- return delta;467467-}468468-469469-/*470412 * Update the current task's runtime statistics. Skip current tasks that471413 * are not in our scheduling class.472414 */···528586 update_load_add(&cfs_rq->load, se->load.weight);529587 if (!parent_entity(se))530588 inc_cpu_load(rq_of(cfs_rq), se->load.weight);531531- if (entity_is_task(se))589589+ if (entity_is_task(se)) {532590 add_cfs_task_weight(cfs_rq, se->load.weight);591591+ list_add(&se->group_node, &cfs_rq->tasks);592592+ }533593 cfs_rq->nr_running++;534594 se->on_rq = 1;535535- list_add(&se->group_node, &cfs_rq->tasks);536595}537596538597static void···542599 update_load_sub(&cfs_rq->load, se->load.weight);543600 if (!parent_entity(se))544601 dec_cpu_load(rq_of(cfs_rq), se->load.weight);545545- if (entity_is_task(se))602602+ if (entity_is_task(se)) {546603 add_cfs_task_weight(cfs_rq, -se->load.weight);604604+ list_del_init(&se->group_node);605605+ }547606 cfs_rq->nr_running--;548607 se->on_rq = 0;549549- list_del_init(&se->group_node);550608}551609552610static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)···10291085 long wl, long wg)10301086{10311087 struct sched_entity *se = tg->se[cpu];10321032- long more_w;1033108810341089 if (!tg->parent)10351090 return wl;···10401097 if (!wl && sched_feat(ASYM_EFF_LOAD))10411098 return wl;1042109910431043- /*10441044- * Instead of using this increment, also add the difference10451045- * between when the shares were last updated and now.10461046- */10471047- more_w = se->my_q->load.weight - se->my_q->rq_weight;10481048- wl += more_w;10491049- wg += more_w;10501050-10511100 for_each_sched_entity(se) {10521052-#define D(n) (likely(n) ? (n) : 1)10531053-10541101 long S, rw, s, a, b;11021102+ long more_w;11031103+11041104+ /*11051105+ * Instead of using this increment, also add the difference11061106+ * between when the shares were last updated and now.11071107+ */11081108+ more_w = se->my_q->load.weight - se->my_q->rq_weight;11091109+ wl += more_w;11101110+ wg += more_w;1055111110561112 S = se->my_q->tg->shares;10571113 s = se->my_q->shares;···10591117 a = S*(rw + wl);10601118 b = S*rw + s*wg;1061111910621062- wl = s*(a-b)/D(b);11201120+ wl = s*(a-b);11211121+11221122+ if (likely(b))11231123+ wl /= b;11241124+10631125 /*10641126 * Assume the group is already running and will10651127 * thus already be accounted for in the weight.···10721126 * alter the group weight.10731127 */10741128 wg = 0;10751075-#undef D10761129 }1077113010781131 return wl;···10881143#endif1089114410901145static int10911091-wake_affine(struct rq *rq, struct sched_domain *this_sd, struct rq *this_rq,11461146+wake_affine(struct sched_domain *this_sd, struct rq *this_rq,10921147 struct task_struct *p, int prev_cpu, int this_cpu, int sync,10931148 int idx, unsigned long load, unsigned long this_load,10941149 unsigned int imbalance)···11361191 schedstat_inc(p, se.nr_wakeups_affine_attempts);11371192 tl_per_task = cpu_avg_load_per_task(this_cpu);1138119311391139- if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) ||11401140- balanced) {11941194+ if (balanced || (tl <= load && tl + target_load(prev_cpu, idx) <=11951195+ tl_per_task)) {11411196 /*11421197 * This domain has SD_WAKE_AFFINE and11431198 * p is cache cold in this domain, and···11561211 struct sched_domain *sd, *this_sd = NULL;11571212 int prev_cpu, this_cpu, new_cpu;11581213 unsigned long load, this_load;11591159- struct rq *rq, *this_rq;12141214+ struct rq *this_rq;11601215 unsigned int imbalance;11611216 int idx;1162121711631218 prev_cpu = task_cpu(p);11641164- rq = task_rq(p);11651219 this_cpu = smp_processor_id();11661220 this_rq = cpu_rq(this_cpu);11671221 new_cpu = prev_cpu;1168122212231223+ if (prev_cpu == this_cpu)12241224+ goto out;11691225 /*11701226 * 'this_sd' is the first domain that both11711227 * this_cpu and prev_cpu are present in:···11941248 load = source_load(prev_cpu, idx);11951249 this_load = target_load(this_cpu, idx);1196125011971197- if (wake_affine(rq, this_sd, this_rq, p, prev_cpu, this_cpu, sync, idx,12511251+ if (wake_affine(this_sd, this_rq, p, prev_cpu, this_cpu, sync, idx,11981252 load, this_load, imbalance))11991253 return this_cpu;12001200-12011201- if (prev_cpu == this_cpu)12021202- goto out;1203125412041255 /*12051256 * Start passive balancing when half the imbalance_pct···12241281 * + nice tasks.12251282 */12261283 if (sched_feat(ASYM_GRAN))12271227- gran = calc_delta_asym(sysctl_sched_wakeup_granularity, se);12281228- else12291229- gran = calc_delta_fair(sysctl_sched_wakeup_granularity, se);12841284+ gran = calc_delta_mine(gran, NICE_0_LOAD, &se->load);1230128512311286 return gran;12321287}1233128812341289/*12351235- * Should 'se' preempt 'curr'.12361236- *12371237- * |s112381238- * |s212391239- * |s312401240- * g12411241- * |<--->|c12421242- *12431243- * w(c, s1) = -112441244- * w(c, s2) = 012451245- * w(c, s3) = 112461246- *12471247- */12481248-static int12491249-wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)12501250-{12511251- s64 gran, vdiff = curr->vruntime - se->vruntime;12521252-12531253- if (vdiff < 0)12541254- return -1;12551255-12561256- gran = wakeup_gran(curr);12571257- if (vdiff > gran)12581258- return 1;12591259-12601260- return 0;12611261-}12621262-12631263-/* return depth at which a sched entity is present in the hierarchy */12641264-static inline int depth_se(struct sched_entity *se)12651265-{12661266- int depth = 0;12671267-12681268- for_each_sched_entity(se)12691269- depth++;12701270-12711271- return depth;12721272-}12731273-12741274-/*12751290 * Preempt the current task with a newly woken task if needed:12761291 */12771277-static void check_preempt_wakeup(struct rq *rq, struct task_struct *p)12921292+static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)12781293{12791294 struct task_struct *curr = rq->curr;12801295 struct cfs_rq *cfs_rq = task_cfs_rq(curr);12811296 struct sched_entity *se = &curr->se, *pse = &p->se;12821282- int se_depth, pse_depth;12971297+ s64 delta_exec;1283129812841299 if (unlikely(rt_prio(p->prio))) {12851300 update_rq_clock(rq);···12521351 cfs_rq_of(pse)->next = pse;1253135212541353 /*13541354+ * We can come here with TIF_NEED_RESCHED already set from new task13551355+ * wake up path.13561356+ */13571357+ if (test_tsk_need_resched(curr))13581358+ return;13591359+13601360+ /*12551361 * Batch tasks do not preempt (their preemption is driven by12561362 * the tick):12571363 */···12681360 if (!sched_feat(WAKEUP_PREEMPT))12691361 return;1270136212711271- /*12721272- * preemption test can be made between sibling entities who are in the12731273- * same cfs_rq i.e who have a common parent. Walk up the hierarchy of12741274- * both tasks until we find their ancestors who are siblings of common12751275- * parent.12761276- */12771277-12781278- /* First walk up until both entities are at same depth */12791279- se_depth = depth_se(se);12801280- pse_depth = depth_se(pse);12811281-12821282- while (se_depth > pse_depth) {12831283- se_depth--;12841284- se = parent_entity(se);13631363+ if (sched_feat(WAKEUP_OVERLAP) && sync &&13641364+ se->avg_overlap < sysctl_sched_migration_cost &&13651365+ pse->avg_overlap < sysctl_sched_migration_cost) {13661366+ resched_task(curr);13671367+ return;12851368 }1286136912871287- while (pse_depth > se_depth) {12881288- pse_depth--;12891289- pse = parent_entity(pse);12901290- }12911291-12921292- while (!is_same_group(se, pse)) {12931293- se = parent_entity(se);12941294- pse = parent_entity(pse);12951295- }12961296-12971297- if (wakeup_preempt_entity(se, pse) == 1)13701370+ delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;13711371+ if (delta_exec > wakeup_gran(pse))12981372 resched_task(curr);12991373}13001374···13351445 if (next == &cfs_rq->tasks)13361446 return NULL;1337144713381338- /* Skip over entities that are not tasks */13391339- do {13401340- se = list_entry(next, struct sched_entity, group_node);13411341- next = next->next;13421342- } while (next != &cfs_rq->tasks && !entity_is_task(se));13431343-13441344- if (next == &cfs_rq->tasks)13451345- return NULL;13461346-13471347- cfs_rq->balance_iterator = next;13481348-13491349- if (entity_is_task(se))13501350- p = task_of(se);14481448+ se = list_entry(next, struct sched_entity, group_node);14491449+ p = task_of(se);14501450+ cfs_rq->balance_iterator = next->next;1351145113521452 return p;13531453}···13871507 rcu_read_lock();13881508 update_h_load(busiest_cpu);1389150913901390- list_for_each_entry(tg, &task_groups, list) {15101510+ list_for_each_entry_rcu(tg, &task_groups, list) {13911511 struct cfs_rq *busiest_cfs_rq = tg->cfs_rq[busiest_cpu];13921512 unsigned long busiest_h_load = busiest_cfs_rq->h_load;13931513 unsigned long busiest_weight = busiest_cfs_rq->load.weight;···15001620 * 'current' within the tree based on its new key value.15011621 */15021622 swap(curr->vruntime, se->vruntime);16231623+ resched_task(rq->curr);15031624 }1504162515051626 enqueue_task_fair(rq, p, 0);15061506- resched_task(rq->curr);15071627}1508162815091629/*···15221642 if (p->prio > oldprio)15231643 resched_task(rq->curr);15241644 } else15251525- check_preempt_curr(rq, p);16451645+ check_preempt_curr(rq, p, 0);15261646}1527164715281648/*···15391659 if (running)15401660 resched_task(rq->curr);15411661 else15421542- check_preempt_curr(rq, p);16621662+ check_preempt_curr(rq, p, 0);15431663}1544166415451665/* Account for a task changing its policy or group.
···102102103103static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)104104{105105+ struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr;105106 struct sched_rt_entity *rt_se = rt_rq->rt_se;106107107107- if (rt_se && !on_rt_rq(rt_se) && rt_rq->rt_nr_running) {108108- struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr;109109-110110- enqueue_rt_entity(rt_se);108108+ if (rt_rq->rt_nr_running) {109109+ if (rt_se && !on_rt_rq(rt_se))110110+ enqueue_rt_entity(rt_se);111111 if (rt_rq->highest_prio < curr->prio)112112 resched_task(curr);113113 }···231231#endif /* CONFIG_RT_GROUP_SCHED */232232233233#ifdef CONFIG_SMP234234+/*235235+ * We ran out of runtime, see if we can borrow some from our neighbours.236236+ */234237static int do_balance_runtime(struct rt_rq *rt_rq)235238{236239 struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);···253250 continue;254251255252 spin_lock(&iter->rt_runtime_lock);253253+ /*254254+ * Either all rqs have inf runtime and there's nothing to steal255255+ * or __disable_runtime() below sets a specific rq to inf to256256+ * indicate its been disabled and disalow stealing.257257+ */256258 if (iter->rt_runtime == RUNTIME_INF)257259 goto next;258260261261+ /*262262+ * From runqueues with spare time, take 1/n part of their263263+ * spare time, but no more than our period.264264+ */259265 diff = iter->rt_runtime - iter->rt_time;260266 if (diff > 0) {261267 diff = div_u64((u64)diff, weight);···286274 return more;287275}288276277277+/*278278+ * Ensure this RQ takes back all the runtime it lend to its neighbours.279279+ */289280static void __disable_runtime(struct rq *rq)290281{291282 struct root_domain *rd = rq->rd;···304289305290 spin_lock(&rt_b->rt_runtime_lock);306291 spin_lock(&rt_rq->rt_runtime_lock);292292+ /*293293+ * Either we're all inf and nobody needs to borrow, or we're294294+ * already disabled and thus have nothing to do, or we have295295+ * exactly the right amount of runtime to take out.296296+ */307297 if (rt_rq->rt_runtime == RUNTIME_INF ||308298 rt_rq->rt_runtime == rt_b->rt_runtime)309299 goto balanced;310300 spin_unlock(&rt_rq->rt_runtime_lock);311301302302+ /*303303+ * Calculate the difference between what we started out with304304+ * and what we current have, that's the amount of runtime305305+ * we lend and now have to reclaim.306306+ */312307 want = rt_b->rt_runtime - rt_rq->rt_runtime;313308309309+ /*310310+ * Greedy reclaim, take back as much as we can.311311+ */314312 for_each_cpu_mask(i, rd->span) {315313 struct rt_rq *iter = sched_rt_period_rt_rq(rt_b, i);316314 s64 diff;317315316316+ /*317317+ * Can't reclaim from ourselves or disabled runqueues.318318+ */318319 if (iter == rt_rq || iter->rt_runtime == RUNTIME_INF)319320 continue;320321···350319 }351320352321 spin_lock(&rt_rq->rt_runtime_lock);322322+ /*323323+ * We cannot be left wanting - that would mean some runtime324324+ * leaked out of the system.325325+ */353326 BUG_ON(want);354327balanced:328328+ /*329329+ * Disable all the borrow logic by pretending we have inf330330+ * runtime - in which case borrowing doesn't make sense.331331+ */355332 rt_rq->rt_runtime = RUNTIME_INF;356333 spin_unlock(&rt_rq->rt_runtime_lock);357334 spin_unlock(&rt_b->rt_runtime_lock);···382343 if (unlikely(!scheduler_running))383344 return;384345346346+ /*347347+ * Reset each runqueue's bandwidth settings348348+ */385349 for_each_leaf_rt_rq(rt_rq, rq) {386350 struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);387351···431389 int i, idle = 1;432390 cpumask_t span;433391434434- if (rt_b->rt_runtime == RUNTIME_INF)392392+ if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)435393 return 1;436394437395 span = sched_rt_period_mask();···528486 curr->se.sum_exec_runtime += delta_exec;529487 curr->se.exec_start = rq->clock;530488 cpuacct_charge(curr, delta_exec);489489+490490+ if (!rt_bandwidth_enabled())491491+ return;531492532493 for_each_sched_rt_entity(rt_se) {533494 rt_rq = rt_rq_of_se(rt_se);···829784/*830785 * Preempt the current task with a newly woken task if needed:831786 */832832-static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p)787787+static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int sync)833788{834789 if (p->prio < rq->curr->prio) {835790 resched_task(rq->curr);