Merge branches 'sched/devel', 'sched/cpu-hotplug', 'sched/cpusets' and 'sched/urgent' into sched/core

+2 -2

Documentation/kernel-doc-nano-HOWTO.txt

··· 168 168 mkdir $ARGV[0],0777; 169 169 $state = 0; 170 170 while (<STDIN>) { 171 - if (/^\.TH \"[^\"]*\" 4 \"([^\"]*)\"/) { 171 + if (/^\.TH \"[^\"]*\" 9 \"([^\"]*)\"/) { 172 172 if ($state == 1) { close OUT } 173 173 $state = 1; 174 - $fn = "$ARGV[0]/$1.4"; 174 + $fn = "$ARGV[0]/$1.9"; 175 175 print STDERR "Creating $fn\n"; 176 176 open OUT, ">$fn" or die "can't open $fn: $!\n"; 177 177 print OUT $_;

+238 -149

Documentation/scheduler/sched-design-CFS.txt

··· 1 - 2 - This is the CFS scheduler. 3 - 4 - 80% of CFS's design can be summed up in a single sentence: CFS basically 5 - models an "ideal, precise multi-tasking CPU" on real hardware. 6 - 7 - "Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100% 8 - physical power and which can run each task at precise equal speed, in 9 - parallel, each at 1/nr_running speed. For example: if there are 2 tasks 10 - running then it runs each at 50% physical power - totally in parallel. 11 - 12 - On real hardware, we can run only a single task at once, so while that 13 - one task runs, the other tasks that are waiting for the CPU are at a 14 - disadvantage - the current task gets an unfair amount of CPU time. In 15 - CFS this fairness imbalance is expressed and tracked via the per-task 16 - p->wait_runtime (nanosec-unit) value. "wait_runtime" is the amount of 17 - time the task should now run on the CPU for it to become completely fair 18 - and balanced. 19 - 20 - ( small detail: on 'ideal' hardware, the p->wait_runtime value would 21 - always be zero - no task would ever get 'out of balance' from the 22 - 'ideal' share of CPU time. ) 23 - 24 - CFS's task picking logic is based on this p->wait_runtime value and it 25 - is thus very simple: it always tries to run the task with the largest 26 - p->wait_runtime value. In other words, CFS tries to run the task with 27 - the 'gravest need' for more CPU time. So CFS always tries to split up 28 - CPU time between runnable tasks as close to 'ideal multitasking 29 - hardware' as possible. 30 - 31 - Most of the rest of CFS's design just falls out of this really simple 32 - concept, with a few add-on embellishments like nice levels, 33 - multiprocessing and various algorithm variants to recognize sleepers. 34 - 35 - In practice it works like this: the system runs a task a bit, and when 36 - the task schedules (or a scheduler tick happens) the task's CPU usage is 37 - 'accounted for': the (small) time it just spent using the physical CPU 38 - is deducted from p->wait_runtime. [minus the 'fair share' it would have 39 - gotten anyway]. Once p->wait_runtime gets low enough so that another 40 - task becomes the 'leftmost task' of the time-ordered rbtree it maintains 41 - (plus a small amount of 'granularity' distance relative to the leftmost 42 - task so that we do not over-schedule tasks and trash the cache) then the 43 - new leftmost task is picked and the current task is preempted. 44 - 45 - The rq->fair_clock value tracks the 'CPU time a runnable task would have 46 - fairly gotten, had it been runnable during that time'. So by using 47 - rq->fair_clock values we can accurately timestamp and measure the 48 - 'expected CPU time' a task should have gotten. All runnable tasks are 49 - sorted in the rbtree by the "rq->fair_clock - p->wait_runtime" key, and 50 - CFS picks the 'leftmost' task and sticks to it. As the system progresses 51 - forwards, newly woken tasks are put into the tree more and more to the 52 - right - slowly but surely giving a chance for every task to become the 53 - 'leftmost task' and thus get on the CPU within a deterministic amount of 54 - time. 55 - 56 - Some implementation details: 57 - 58 - - the introduction of Scheduling Classes: an extensible hierarchy of 59 - scheduler modules. These modules encapsulate scheduling policy 60 - details and are handled by the scheduler core without the core 61 - code assuming about them too much. 62 - 63 - - sched_fair.c implements the 'CFS desktop scheduler': it is a 64 - replacement for the vanilla scheduler's SCHED_OTHER interactivity 65 - code. 66 - 67 - I'd like to give credit to Con Kolivas for the general approach here: 68 - he has proven via RSDL/SD that 'fair scheduling' is possible and that 69 - it results in better desktop scheduling. Kudos Con! 70 - 71 - The CFS patch uses a completely different approach and implementation 72 - from RSDL/SD. My goal was to make CFS's interactivity quality exceed 73 - that of RSDL/SD, which is a high standard to meet :-) Testing 74 - feedback is welcome to decide this one way or another. [ and, in any 75 - case, all of SD's logic could be added via a kernel/sched_sd.c module 76 - as well, if Con is interested in such an approach. ] 77 - 78 - CFS's design is quite radical: it does not use runqueues, it uses a 79 - time-ordered rbtree to build a 'timeline' of future task execution, 80 - and thus has no 'array switch' artifacts (by which both the vanilla 81 - scheduler and RSDL/SD are affected). 82 - 83 - CFS uses nanosecond granularity accounting and does not rely on any 84 - jiffies or other HZ detail. Thus the CFS scheduler has no notion of 85 - 'timeslices' and has no heuristics whatsoever. There is only one 86 - central tunable (you have to switch on CONFIG_SCHED_DEBUG): 87 - 88 - /proc/sys/kernel/sched_granularity_ns 89 - 90 - which can be used to tune the scheduler from 'desktop' (low 91 - latencies) to 'server' (good batching) workloads. It defaults to a 92 - setting suitable for desktop workloads. SCHED_BATCH is handled by the 93 - CFS scheduler module too. 94 - 95 - Due to its design, the CFS scheduler is not prone to any of the 96 - 'attacks' that exist today against the heuristics of the stock 97 - scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all 98 - work fine and do not impact interactivity and produce the expected 99 - behavior. 100 - 101 - the CFS scheduler has a much stronger handling of nice levels and 102 - SCHED_BATCH: both types of workloads should be isolated much more 103 - agressively than under the vanilla scheduler. 104 - 105 - ( another detail: due to nanosec accounting and timeline sorting, 106 - sched_yield() support is very simple under CFS, and in fact under 107 - CFS sched_yield() behaves much better than under any other 108 - scheduler i have tested so far. ) 109 - 110 - - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler 111 - way than the vanilla scheduler does. It uses 100 runqueues (for all 112 - 100 RT priority levels, instead of 140 in the vanilla scheduler) 113 - and it needs no expired array. 114 - 115 - - reworked/sanitized SMP load-balancing: the runqueue-walking 116 - assumptions are gone from the load-balancing code now, and 117 - iterators of the scheduling modules are used. The balancing code got 118 - quite a bit simpler as a result. 1 + ============= 2 + CFS Scheduler 3 + ============= 119 4 120 5 121 - Group scheduler extension to CFS 122 - ================================ 6 + 1. OVERVIEW 123 7 124 - Normally the scheduler operates on individual tasks and strives to provide 125 - fair CPU time to each task. Sometimes, it may be desirable to group tasks 126 - and provide fair CPU time to each such task group. For example, it may 127 - be desirable to first provide fair CPU time to each user on the system 128 - and then to each task belonging to a user. 8 + CFS stands for "Completely Fair Scheduler," and is the new "desktop" process 9 + scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the 10 + replacement for the previous vanilla scheduler's SCHED_OTHER interactivity 11 + code. 129 12 130 - CONFIG_FAIR_GROUP_SCHED strives to achieve exactly that. It lets 131 - SCHED_NORMAL/BATCH tasks be be grouped and divides CPU time fairly among such 132 - groups. At present, there are two (mutually exclusive) mechanisms to group 133 - tasks for CPU bandwidth control purpose: 13 + 80% of CFS's design can be summed up in a single sentence: CFS basically models 14 + an "ideal, precise multi-tasking CPU" on real hardware. 134 15 135 - - Based on user id (CONFIG_FAIR_USER_SCHED) 136 - In this option, tasks are grouped according to their user id. 137 - - Based on "cgroup" pseudo filesystem (CONFIG_FAIR_CGROUP_SCHED) 138 - This options lets the administrator create arbitrary groups 139 - of tasks, using the "cgroup" pseudo filesystem. See 140 - Documentation/cgroups.txt for more information about this 141 - filesystem. 16 + "Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100% physical 17 + power and which can run each task at precise equal speed, in parallel, each at 18 + 1/nr_running speed. For example: if there are 2 tasks running, then it runs 19 + each at 50% physical power --- i.e., actually in parallel. 20 + 21 + On real hardware, we can run only a single task at once, so we have to 22 + introduce the concept of "virtual runtime." The virtual runtime of a task 23 + specifies when its next timeslice would start execution on the ideal 24 + multi-tasking CPU described above. In practice, the virtual runtime of a task 25 + is its actual runtime normalized to the total number of running tasks. 26 + 27 + 28 + 29 + 2. FEW IMPLEMENTATION DETAILS 30 + 31 + In CFS the virtual runtime is expressed and tracked via the per-task 32 + p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately 33 + timestamp and measure the "expected CPU time" a task should have gotten. 34 + 35 + [ small detail: on "ideal" hardware, at any time all tasks would have the same 36 + p->se.vruntime value --- i.e., tasks would execute simultaneously and no task 37 + would ever get "out of balance" from the "ideal" share of CPU time. ] 38 + 39 + CFS's task picking logic is based on this p->se.vruntime value and it is thus 40 + very simple: it always tries to run the task with the smallest p->se.vruntime 41 + value (i.e., the task which executed least so far). CFS always tries to split 42 + up CPU time between runnable tasks as close to "ideal multitasking hardware" as 43 + possible. 44 + 45 + Most of the rest of CFS's design just falls out of this really simple concept, 46 + with a few add-on embellishments like nice levels, multiprocessing and various 47 + algorithm variants to recognize sleepers. 48 + 49 + 50 + 51 + 3. THE RBTREE 52 + 53 + CFS's design is quite radical: it does not use the old data structures for the 54 + runqueues, but it uses a time-ordered rbtree to build a "timeline" of future 55 + task execution, and thus has no "array switch" artifacts (by which both the 56 + previous vanilla scheduler and RSDL/SD are affected). 57 + 58 + CFS also maintains the rq->cfs.min_vruntime value, which is a monotonic 59 + increasing value tracking the smallest vruntime among all tasks in the 60 + runqueue. The total amount of work done by the system is tracked using 61 + min_vruntime; that value is used to place newly activated entities on the left 62 + side of the tree as much as possible. 63 + 64 + The total number of running tasks in the runqueue is accounted through the 65 + rq->cfs.load value, which is the sum of the weights of the tasks queued on the 66 + runqueue. 67 + 68 + CFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the 69 + p->se.vruntime key (there is a subtraction using rq->cfs.min_vruntime to 70 + account for possible wraparounds). CFS picks the "leftmost" task from this 71 + tree and sticks to it. 72 + As the system progresses forwards, the executed tasks are put into the tree 73 + more and more to the right --- slowly but surely giving a chance for every task 74 + to become the "leftmost task" and thus get on the CPU within a deterministic 75 + amount of time. 76 + 77 + Summing up, CFS works like this: it runs a task a bit, and when the task 78 + schedules (or a scheduler tick happens) the task's CPU usage is "accounted 79 + for": the (small) time it just spent using the physical CPU is added to 80 + p->se.vruntime. Once p->se.vruntime gets high enough so that another task 81 + becomes the "leftmost task" of the time-ordered rbtree it maintains (plus a 82 + small amount of "granularity" distance relative to the leftmost task so that we 83 + do not over-schedule tasks and trash the cache), then the new leftmost task is 84 + picked and the current task is preempted. 85 + 86 + 87 + 88 + 4. SOME FEATURES OF CFS 89 + 90 + CFS uses nanosecond granularity accounting and does not rely on any jiffies or 91 + other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the 92 + way the previous scheduler had, and has no heuristics whatsoever. There is 93 + only one central tunable (you have to switch on CONFIG_SCHED_DEBUG): 94 + 95 + /proc/sys/kernel/sched_granularity_ns 96 + 97 + which can be used to tune the scheduler from "desktop" (i.e., low latencies) to 98 + "server" (i.e., good batching) workloads. It defaults to a setting suitable 99 + for desktop workloads. SCHED_BATCH is handled by the CFS scheduler module too. 100 + 101 + Due to its design, the CFS scheduler is not prone to any of the "attacks" that 102 + exist today against the heuristics of the stock scheduler: fiftyp.c, thud.c, 103 + chew.c, ring-test.c, massive_intr.c all work fine and do not impact 104 + interactivity and produce the expected behavior. 105 + 106 + The CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH 107 + than the previous vanilla scheduler: both types of workloads are isolated much 108 + more aggressively. 109 + 110 + SMP load-balancing has been reworked/sanitized: the runqueue-walking 111 + assumptions are gone from the load-balancing code now, and iterators of the 112 + scheduling modules are used. The balancing code got quite a bit simpler as a 113 + result. 114 + 115 + 116 + 117 + 5. Scheduling policies 118 + 119 + CFS implements three scheduling policies: 120 + 121 + - SCHED_NORMAL (traditionally called SCHED_OTHER): The scheduling 122 + policy that is used for regular tasks. 123 + 124 + - SCHED_BATCH: Does not preempt nearly as often as regular tasks 125 + would, thereby allowing tasks to run longer and make better use of 126 + caches but at the cost of interactivity. This is well suited for 127 + batch jobs. 128 + 129 + - SCHED_IDLE: This is even weaker than nice 19, but its not a true 130 + idle timer scheduler in order to avoid to get into priority 131 + inversion problems which would deadlock the machine. 132 + 133 + SCHED_FIFO/_RR are implemented in sched_rt.c and are as specified by 134 + POSIX. 135 + 136 + The command chrt from util-linux-ng 2.13.1.1 can set all of these except 137 + SCHED_IDLE. 138 + 139 + 140 + 141 + 6. SCHEDULING CLASSES 142 + 143 + The new CFS scheduler has been designed in such a way to introduce "Scheduling 144 + Classes," an extensible hierarchy of scheduler modules. These modules 145 + encapsulate scheduling policy details and are handled by the scheduler core 146 + without the core code assuming too much about them. 147 + 148 + sched_fair.c implements the CFS scheduler described above. 149 + 150 + sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than 151 + the previous vanilla scheduler did. It uses 100 runqueues (for all 100 RT 152 + priority levels, instead of 140 in the previous scheduler) and it needs no 153 + expired array. 154 + 155 + Scheduling classes are implemented through the sched_class structure, which 156 + contains hooks to functions that must be called whenever an interesting event 157 + occurs. 158 + 159 + This is the (partial) list of the hooks: 160 + 161 + - enqueue_task(...) 162 + 163 + Called when a task enters a runnable state. 164 + It puts the scheduling entity (task) into the red-black tree and 165 + increments the nr_running variable. 166 + 167 + - dequeue_tree(...) 168 + 169 + When a task is no longer runnable, this function is called to keep the 170 + corresponding scheduling entity out of the red-black tree. It decrements 171 + the nr_running variable. 172 + 173 + - yield_task(...) 174 + 175 + This function is basically just a dequeue followed by an enqueue, unless the 176 + compat_yield sysctl is turned on; in that case, it places the scheduling 177 + entity at the right-most end of the red-black tree. 178 + 179 + - check_preempt_curr(...) 180 + 181 + This function checks if a task that entered the runnable state should 182 + preempt the currently running task. 183 + 184 + - pick_next_task(...) 185 + 186 + This function chooses the most appropriate task eligible to run next. 187 + 188 + - set_curr_task(...) 189 + 190 + This function is called when a task changes its scheduling class or changes 191 + its task group. 192 + 193 + - task_tick(...) 194 + 195 + This function is mostly called from time tick functions; it might lead to 196 + process switch. This drives the running preemption. 197 + 198 + - task_new(...) 199 + 200 + The core scheduler gives the scheduling module an opportunity to manage new 201 + task startup. The CFS scheduling module uses it for group scheduling, while 202 + the scheduling module for a real-time task does not use it. 203 + 204 + 205 + 206 + 7. GROUP SCHEDULER EXTENSIONS TO CFS 207 + 208 + Normally, the scheduler operates on individual tasks and strives to provide 209 + fair CPU time to each task. Sometimes, it may be desirable to group tasks and 210 + provide fair CPU time to each such task group. For example, it may be 211 + desirable to first provide fair CPU time to each user on the system and then to 212 + each task belonging to a user. 213 + 214 + CONFIG_GROUP_SCHED strives to achieve exactly that. It lets tasks to be 215 + grouped and divides CPU time fairly among such groups. 216 + 217 + CONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and 218 + SCHED_RR) tasks. 219 + 220 + CONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and 221 + SCHED_BATCH) tasks. 222 + 223 + At present, there are two (mutually exclusive) mechanisms to group tasks for 224 + CPU bandwidth control purposes: 225 + 226 + - Based on user id (CONFIG_USER_SCHED) 227 + 228 + With this option, tasks are grouped according to their user id. 229 + 230 + - Based on "cgroup" pseudo filesystem (CONFIG_CGROUP_SCHED) 231 + 232 + This options needs CONFIG_CGROUPS to be defined, and lets the administrator 233 + create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See 234 + Documentation/cgroups.txt for more information about this filesystem. 142 235 143 236 Only one of these options to group tasks can be chosen and not both. 144 237 145 - Group scheduler tunables: 146 - 147 - When CONFIG_FAIR_USER_SCHED is defined, a directory is created in sysfs for 148 - each new user and a "cpu_share" file is added in that directory. 238 + When CONFIG_USER_SCHED is defined, a directory is created in sysfs for each new 239 + user and a "cpu_share" file is added in that directory. 149 240 150 241 # cd /sys/kernel/uids 151 242 # cat 512/cpu_share # Display user 512's CPU share ··· 246 155 2048 247 156 # 248 157 249 - CPU bandwidth between two users are divided in the ratio of their CPU shares. 250 - For ex: if you would like user "root" to get twice the bandwidth of user 251 - "guest", then set the cpu_share for both the users such that "root"'s 252 - cpu_share is twice "guest"'s cpu_share 158 + CPU bandwidth between two users is divided in the ratio of their CPU shares. 159 + For example: if you would like user "root" to get twice the bandwidth of user 160 + "guest," then set the cpu_share for both the users such that "root"'s cpu_share 161 + is twice "guest"'s cpu_share. 253 162 254 - 255 - When CONFIG_FAIR_CGROUP_SCHED is defined, a "cpu.shares" file is created 256 - for each group created using the pseudo filesystem. See example steps 257 - below to create task groups and modify their CPU share using the "cgroups" 258 - pseudo filesystem 163 + When CONFIG_CGROUP_SCHED is defined, a "cpu.shares" file is created for each 164 + group created using the pseudo filesystem. See example steps below to create 165 + task groups and modify their CPU share using the "cgroups" pseudo filesystem. 259 166 260 167 # mkdir /dev/cpuctl 261 168 # mount -t cgroup -ocpu none /dev/cpuctl

+3

arch/alpha/kernel/smp.c

··· 149 149 atomic_inc(&init_mm.mm_count); 150 150 current->active_mm = &init_mm; 151 151 152 + /* inform the notifiers about the new cpu */ 153 + notify_cpu_starting(cpuid); 154 + 152 155 /* Must have completely accurate bogos. */ 153 156 local_irq_enable(); 154 157

+1

arch/arm/kernel/smp.c

··· 277 277 /* 278 278 * Enable local interrupts. 279 279 */ 280 + notify_cpu_starting(cpu); 280 281 local_irq_enable(); 281 282 local_fiq_enable(); 282 283

+1

arch/cris/arch-v32/kernel/smp.c

··· 178 178 unmask_irq(IPI_INTR_VECT); 179 179 unmask_irq(TIMER0_INTR_VECT); 180 180 preempt_disable(); 181 + notify_cpu_starting(cpu); 181 182 local_irq_enable(); 182 183 183 184 cpu_set(cpu, cpu_online_map);

+1

arch/ia64/kernel/smpboot.c

··· 401 401 spin_lock(&vector_lock); 402 402 /* Setup the per cpu irq handling data structures */ 403 403 __setup_vector_irq(cpuid); 404 + notify_cpu_starting(cpuid); 404 405 cpu_set(cpuid, cpu_online_map); 405 406 per_cpu(cpu_state, cpuid) = CPU_ONLINE; 406 407 spin_unlock(&vector_lock);

+2

arch/m32r/kernel/smpboot.c

··· 498 498 { 499 499 int cpu_id = smp_processor_id(); 500 500 501 + notify_cpu_starting(cpu_id); 502 + 501 503 local_irq_enable(); 502 504 503 505 /* Get our bogomips. */

+2

arch/mips/kernel/smp.c

··· 121 121 cpu = smp_processor_id(); 122 122 cpu_data[cpu].udelay_val = loops_per_jiffy; 123 123 124 + notify_cpu_starting(cpu); 125 + 124 126 mp_ops->smp_finish(); 125 127 set_cpu_sibling_map(cpu); 126 128

+1

arch/powerpc/kernel/smp.c

··· 453 453 secondary_cpu_time_init(); 454 454 455 455 ipi_call_lock(); 456 + notify_cpu_starting(cpu); 456 457 cpu_set(cpu, cpu_online_map); 457 458 /* Update sibling maps */ 458 459 base = cpu_first_thread_in_core(cpu);

+2

arch/s390/kernel/smp.c

··· 585 585 /* Enable pfault pseudo page faults on this cpu. */ 586 586 pfault_init(); 587 587 588 + /* call cpu notifiers */ 589 + notify_cpu_starting(smp_processor_id()); 588 590 /* Mark this cpu as online */ 589 591 spin_lock(&call_lock); 590 592 cpu_set(smp_processor_id(), cpu_online_map);

+2

arch/sh/kernel/smp.c

··· 82 82 83 83 preempt_disable(); 84 84 85 + notify_cpu_starting(smp_processor_id()); 86 + 85 87 local_irq_enable(); 86 88 87 89 calibrate_delay();

+1

arch/sparc/kernel/sun4d_smp.c

··· 88 88 local_flush_cache_all(); 89 89 local_flush_tlb_all(); 90 90 91 + notify_cpu_starting(cpuid); 91 92 /* 92 93 * Unblock the master CPU _only_ when the scheduler state 93 94 * of all secondary CPUs will be up-to-date, so after

+2

arch/sparc/kernel/sun4m_smp.c

··· 71 71 local_flush_cache_all(); 72 72 local_flush_tlb_all(); 73 73 74 + notify_cpu_starting(cpuid); 75 + 74 76 /* Get our local ticker going. */ 75 77 smp_setup_percpu_timer(); 76 78

+1

arch/um/kernel/smp.c

··· 85 85 while (!cpu_isset(cpu, smp_commenced_mask)) 86 86 cpu_relax(); 87 87 88 + notify_cpu_starting(cpu); 88 89 cpu_set(cpu, cpu_online_map); 89 90 default_idle(); 90 91 return 0;

+1

arch/x86/kernel/smpboot.c

··· 257 257 end_local_APIC_setup(); 258 258 map_cpu_to_logical_apicid(); 259 259 260 + notify_cpu_starting(cpuid); 260 261 /* 261 262 * Get our bogomips. 262 263 *

+2

arch/x86/mach-voyager/voyager_smp.c

··· 448 448 449 449 VDEBUG(("VOYAGER SMP: CPU%d, stack at about %p\n", cpuid, &cpuid)); 450 450 451 + notify_cpu_starting(cpuid); 452 + 451 453 /* enable interrupts */ 452 454 local_irq_enable(); 453 455

+41

include/linux/completion.h

··· 10 10 11 11 #include <linux/wait.h> 12 12 13 + /** 14 + * struct completion - structure used to maintain state for a "completion" 15 + * 16 + * This is the opaque structure used to maintain the state for a "completion". 17 + * Completions currently use a FIFO to queue threads that have to wait for 18 + * the "completion" event. 19 + * 20 + * See also: complete(), wait_for_completion() (and friends _timeout, 21 + * _interruptible, _interruptible_timeout, and _killable), init_completion(), 22 + * and macros DECLARE_COMPLETION(), DECLARE_COMPLETION_ONSTACK(), and 23 + * INIT_COMPLETION(). 24 + */ 13 25 struct completion { 14 26 unsigned int done; 15 27 wait_queue_head_t wait; ··· 33 21 #define COMPLETION_INITIALIZER_ONSTACK(work) \ 34 22 ({ init_completion(&work); work; }) 35 23 24 + /** 25 + * DECLARE_COMPLETION: - declare and initialize a completion structure 26 + * @work: identifier for the completion structure 27 + * 28 + * This macro declares and initializes a completion structure. Generally used 29 + * for static declarations. You should use the _ONSTACK variant for automatic 30 + * variables. 31 + */ 36 32 #define DECLARE_COMPLETION(work) \ 37 33 struct completion work = COMPLETION_INITIALIZER(work) 38 34 ··· 49 29 * completions - so we use the _ONSTACK() variant for those that 50 30 * are on the kernel stack: 51 31 */ 32 + /** 33 + * DECLARE_COMPLETION_ONSTACK: - declare and initialize a completion structure 34 + * @work: identifier for the completion structure 35 + * 36 + * This macro declares and initializes a completion structure on the kernel 37 + * stack. 38 + */ 52 39 #ifdef CONFIG_LOCKDEP 53 40 # define DECLARE_COMPLETION_ONSTACK(work) \ 54 41 struct completion work = COMPLETION_INITIALIZER_ONSTACK(work) ··· 63 36 # define DECLARE_COMPLETION_ONSTACK(work) DECLARE_COMPLETION(work) 64 37 #endif 65 38 39 + /** 40 + * init_completion: - Initialize a dynamically allocated completion 41 + * @x: completion structure that is to be initialized 42 + * 43 + * This inline function will initialize a dynamically created completion 44 + * structure. 45 + */ 66 46 static inline void init_completion(struct completion *x) 67 47 { 68 48 x->done = 0; ··· 89 55 extern void complete(struct completion *); 90 56 extern void complete_all(struct completion *); 91 57 58 + /** 59 + * INIT_COMPLETION: - reinitialize a completion structure 60 + * @x: completion structure to be reinitialized 61 + * 62 + * This macro should be used to reinitialize a completion structure so it can 63 + * be reused. This is especially important after complete_all() is used. 64 + */ 92 65 #define INIT_COMPLETION(x) ((x).done = 0) 93 66 94 67

+1

include/linux/cpu.h

··· 69 69 #endif 70 70 71 71 int cpu_up(unsigned int cpu); 72 + void notify_cpu_starting(unsigned int cpu); 72 73 extern void cpu_hotplug_init(void); 73 74 extern void cpu_maps_update_begin(void); 74 75 extern void cpu_maps_update_done(void);

+9 -1

include/linux/notifier.h

··· 213 213 #define CPU_DOWN_FAILED 0x0006 /* CPU (unsigned)v NOT going down */ 214 214 #define CPU_DEAD 0x0007 /* CPU (unsigned)v dead */ 215 215 #define CPU_DYING 0x0008 /* CPU (unsigned)v not running any task, 216 - * not handling interrupts, soon dead */ 216 + * not handling interrupts, soon dead. 217 + * Called on the dying cpu, interrupts 218 + * are already disabled. Must not 219 + * sleep, must not fail */ 217 220 #define CPU_POST_DEAD 0x0009 /* CPU (unsigned)v dead, cpu_hotplug 218 221 * lock is dropped */ 222 + #define CPU_STARTING 0x000A /* CPU (unsigned)v soon running. 223 + * Called on the new cpu, just before 224 + * enabling interrupts. Must not sleep, 225 + * must not fail */ 219 226 220 227 /* Used for CPU hotplug events occuring while tasks are frozen due to a suspend 221 228 * operation in progress ··· 236 229 #define CPU_DOWN_FAILED_FROZEN (CPU_DOWN_FAILED | CPU_TASKS_FROZEN) 237 230 #define CPU_DEAD_FROZEN (CPU_DEAD | CPU_TASKS_FROZEN) 238 231 #define CPU_DYING_FROZEN (CPU_DYING | CPU_TASKS_FROZEN) 232 + #define CPU_STARTING_FROZEN (CPU_STARTING | CPU_TASKS_FROZEN) 239 233 240 234 /* Hibernation and suspend events */ 241 235 #define PM_HIBERNATION_PREPARE 0x0001 /* Going to hibernate */

+1 -1

include/linux/proportions.h

··· 104 104 * snapshot of the last seen global state 105 105 * and a lock protecting this state 106 106 */ 107 - int shift; 108 107 unsigned long period; 108 + int shift; 109 109 spinlock_t lock; /* protect the snapshot state */ 110 110 }; 111 111

+3 -3

include/linux/sched.h

··· 451 451 * - everyone except group_exit_task is stopped during signal delivery 452 452 * of fatal signals, group_exit_task processes the signal. 453 453 */ 454 - struct task_struct *group_exit_task; 455 454 int notify_count; 455 + struct task_struct *group_exit_task; 456 456 457 457 /* thread group stop support, overloads group_exit_code too */ 458 458 int group_stop_count; ··· 897 897 void (*yield_task) (struct rq *rq); 898 898 int (*select_task_rq)(struct task_struct *p, int sync); 899 899 900 - void (*check_preempt_curr) (struct rq *rq, struct task_struct *p); 900 + void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int sync); 901 901 902 902 struct task_struct * (*pick_next_task) (struct rq *rq); 903 903 void (*put_prev_task) (struct rq *rq, struct task_struct *p); ··· 1010 1010 1011 1011 struct sched_rt_entity { 1012 1012 struct list_head run_list; 1013 - unsigned int time_slice; 1014 1013 unsigned long timeout; 1014 + unsigned int time_slice; 1015 1015 int nr_cpus_allowed; 1016 1016 1017 1017 struct sched_rt_entity *back;

+22 -2

kernel/cpu.c

··· 199 199 struct take_cpu_down_param *param = _param; 200 200 int err; 201 201 202 - raw_notifier_call_chain(&cpu_chain, CPU_DYING | param->mod, 203 - param->hcpu); 204 202 /* Ensure this CPU doesn't handle any more interrupts. */ 205 203 err = __cpu_disable(); 206 204 if (err < 0) 207 205 return err; 206 + 207 + raw_notifier_call_chain(&cpu_chain, CPU_DYING | param->mod, 208 + param->hcpu); 208 209 209 210 /* Force idle task to run as soon as we yield: it should 210 211 immediately notice cpu is offline and die quickly. */ ··· 453 452 cpu_maps_update_done(); 454 453 } 455 454 #endif /* CONFIG_PM_SLEEP_SMP */ 455 + 456 + /** 457 + * notify_cpu_starting(cpu) - call the CPU_STARTING notifiers 458 + * @cpu: cpu that just started 459 + * 460 + * This function calls the cpu_chain notifiers with CPU_STARTING. 461 + * It must be called by the arch code on the new cpu, before the new cpu 462 + * enables interrupts and before the "boot" cpu returns from __cpu_up(). 463 + */ 464 + void notify_cpu_starting(unsigned int cpu) 465 + { 466 + unsigned long val = CPU_STARTING; 467 + 468 + #ifdef CONFIG_PM_SLEEP_SMP 469 + if (cpu_isset(cpu, frozen_cpus)) 470 + val = CPU_STARTING_FROZEN; 471 + #endif /* CONFIG_PM_SLEEP_SMP */ 472 + raw_notifier_call_chain(&cpu_chain, val, (void *)(long)cpu); 473 + } 456 474 457 475 #endif /* CONFIG_SMP */ 458 476

+1 -1

kernel/cpuset.c

··· 1921 1921 * that has tasks along with an empty 'mems'. But if we did see such 1922 1922 * a cpuset, we'd handle it just like we do if its 'cpus' was empty. 1923 1923 */ 1924 - static void scan_for_empty_cpusets(const struct cpuset *root) 1924 + static void scan_for_empty_cpusets(struct cpuset *root) 1925 1925 { 1926 1926 LIST_HEAD(queue); 1927 1927 struct cpuset *cp; /* scans cpusets being updated */

+243 -150

kernel/sched.c

··· 204 204 rt_b->rt_period_timer.cb_mode = HRTIMER_CB_IRQSAFE_UNLOCKED; 205 205 } 206 206 207 + static inline int rt_bandwidth_enabled(void) 208 + { 209 + return sysctl_sched_rt_runtime >= 0; 210 + } 211 + 207 212 static void start_rt_bandwidth(struct rt_bandwidth *rt_b) 208 213 { 209 214 ktime_t now; 210 215 211 - if (rt_b->rt_runtime == RUNTIME_INF) 216 + if (rt_bandwidth_enabled() && rt_b->rt_runtime == RUNTIME_INF) 212 217 return; 213 218 214 219 if (hrtimer_active(&rt_b->rt_period_timer)) ··· 303 298 static DEFINE_PER_CPU(struct sched_rt_entity, init_sched_rt_entity); 304 299 static DEFINE_PER_CPU(struct rt_rq, init_rt_rq) ____cacheline_aligned_in_smp; 305 300 #endif /* CONFIG_RT_GROUP_SCHED */ 306 - #else /* !CONFIG_FAIR_GROUP_SCHED */ 301 + #else /* !CONFIG_USER_SCHED */ 307 302 #define root_task_group init_task_group 308 - #endif /* CONFIG_FAIR_GROUP_SCHED */ 303 + #endif /* CONFIG_USER_SCHED */ 309 304 310 305 /* task_group_lock serializes add/remove of task groups and also changes to 311 306 * a task group's cpu shares. ··· 609 604 610 605 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); 611 606 612 - static inline void check_preempt_curr(struct rq *rq, struct task_struct *p) 607 + static inline void check_preempt_curr(struct rq *rq, struct task_struct *p, int sync) 613 608 { 614 - rq->curr->sched_class->check_preempt_curr(rq, p); 609 + rq->curr->sched_class->check_preempt_curr(rq, p, sync); 615 610 } 616 611 617 612 static inline int cpu_of(struct rq *rq) ··· 1107 1102 hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay), HRTIMER_MODE_REL); 1108 1103 } 1109 1104 1110 - static void init_hrtick(void) 1105 + static inline void init_hrtick(void) 1111 1106 { 1112 1107 } 1113 1108 #endif /* CONFIG_SMP */ ··· 1126 1121 rq->hrtick_timer.function = hrtick; 1127 1122 rq->hrtick_timer.cb_mode = HRTIMER_CB_IRQSAFE_PERCPU; 1128 1123 } 1129 - #else 1124 + #else /* CONFIG_SCHED_HRTICK */ 1130 1125 static inline void hrtick_clear(struct rq *rq) 1131 1126 { 1132 1127 } ··· 1138 1133 static inline void init_hrtick(void) 1139 1134 { 1140 1135 } 1141 - #endif 1136 + #endif /* CONFIG_SCHED_HRTICK */ 1142 1137 1143 1138 /* 1144 1139 * resched_task - mark a task 'to be rescheduled now'. ··· 1385 1380 update_load_sub(&rq->load, load); 1386 1381 } 1387 1382 1383 + #if (defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)) || defined(CONFIG_RT_GROUP_SCHED) 1384 + typedef int (*tg_visitor)(struct task_group *, void *); 1385 + 1386 + /* 1387 + * Iterate the full tree, calling @down when first entering a node and @up when 1388 + * leaving it for the final time. 1389 + */ 1390 + static int walk_tg_tree(tg_visitor down, tg_visitor up, void *data) 1391 + { 1392 + struct task_group *parent, *child; 1393 + int ret; 1394 + 1395 + rcu_read_lock(); 1396 + parent = &root_task_group; 1397 + down: 1398 + ret = (*down)(parent, data); 1399 + if (ret) 1400 + goto out_unlock; 1401 + list_for_each_entry_rcu(child, &parent->children, siblings) { 1402 + parent = child; 1403 + goto down; 1404 + 1405 + up: 1406 + continue; 1407 + } 1408 + ret = (*up)(parent, data); 1409 + if (ret) 1410 + goto out_unlock; 1411 + 1412 + child = parent; 1413 + parent = parent->parent; 1414 + if (parent) 1415 + goto up; 1416 + out_unlock: 1417 + rcu_read_unlock(); 1418 + 1419 + return ret; 1420 + } 1421 + 1422 + static int tg_nop(struct task_group *tg, void *data) 1423 + { 1424 + return 0; 1425 + } 1426 + #endif 1427 + 1388 1428 #ifdef CONFIG_SMP 1389 1429 static unsigned long source_load(int cpu, int type); 1390 1430 static unsigned long target_load(int cpu, int type); ··· 1446 1396 } 1447 1397 1448 1398 #ifdef CONFIG_FAIR_GROUP_SCHED 1449 - 1450 - typedef void (*tg_visitor)(struct task_group *, int, struct sched_domain *); 1451 - 1452 - /* 1453 - * Iterate the full tree, calling @down when first entering a node and @up when 1454 - * leaving it for the final time. 1455 - */ 1456 - static void 1457 - walk_tg_tree(tg_visitor down, tg_visitor up, int cpu, struct sched_domain *sd) 1458 - { 1459 - struct task_group *parent, *child; 1460 - 1461 - rcu_read_lock(); 1462 - parent = &root_task_group; 1463 - down: 1464 - (*down)(parent, cpu, sd); 1465 - list_for_each_entry_rcu(child, &parent->children, siblings) { 1466 - parent = child; 1467 - goto down; 1468 - 1469 - up: 1470 - continue; 1471 - } 1472 - (*up)(parent, cpu, sd); 1473 - 1474 - child = parent; 1475 - parent = parent->parent; 1476 - if (parent) 1477 - goto up; 1478 - rcu_read_unlock(); 1479 - } 1480 1399 1481 1400 static void __set_se_shares(struct sched_entity *se, unsigned long shares); 1482 1401 ··· 1505 1486 * This needs to be done in a bottom-up fashion because the rq weight of a 1506 1487 * parent group depends on the shares of its child groups. 1507 1488 */ 1508 - static void 1509 - tg_shares_up(struct task_group *tg, int cpu, struct sched_domain *sd) 1489 + static int tg_shares_up(struct task_group *tg, void *data) 1510 1490 { 1511 1491 unsigned long rq_weight = 0; 1512 1492 unsigned long shares = 0; 1493 + struct sched_domain *sd = data; 1513 1494 int i; 1514 1495 1515 1496 for_each_cpu_mask(i, sd->span) { ··· 1534 1515 __update_group_shares_cpu(tg, i, shares, rq_weight); 1535 1516 spin_unlock_irqrestore(&rq->lock, flags); 1536 1517 } 1518 + 1519 + return 0; 1537 1520 } 1538 1521 1539 1522 /* ··· 1543 1522 * This needs to be done in a top-down fashion because the load of a child 1544 1523 * group is a fraction of its parents load. 1545 1524 */ 1546 - static void 1547 - tg_load_down(struct task_group *tg, int cpu, struct sched_domain *sd) 1525 + static int tg_load_down(struct task_group *tg, void *data) 1548 1526 { 1549 1527 unsigned long load; 1528 + long cpu = (long)data; 1550 1529 1551 1530 if (!tg->parent) { 1552 1531 load = cpu_rq(cpu)->load.weight; ··· 1557 1536 } 1558 1537 1559 1538 tg->cfs_rq[cpu]->h_load = load; 1560 - } 1561 1539 1562 - static void 1563 - tg_nop(struct task_group *tg, int cpu, struct sched_domain *sd) 1564 - { 1540 + return 0; 1565 1541 } 1566 1542 1567 1543 static void update_shares(struct sched_domain *sd) ··· 1568 1550 1569 1551 if (elapsed >= (s64)(u64)sysctl_sched_shares_ratelimit) { 1570 1552 sd->last_update = now; 1571 - walk_tg_tree(tg_nop, tg_shares_up, 0, sd); 1553 + walk_tg_tree(tg_nop, tg_shares_up, sd); 1572 1554 } 1573 1555 } 1574 1556 ··· 1579 1561 spin_lock(&rq->lock); 1580 1562 } 1581 1563 1582 - static void update_h_load(int cpu) 1564 + static void update_h_load(long cpu) 1583 1565 { 1584 - walk_tg_tree(tg_load_down, tg_nop, cpu, NULL); 1566 + walk_tg_tree(tg_load_down, tg_nop, (void *)cpu); 1585 1567 } 1586 1568 1587 1569 #else ··· 1939 1921 running = task_running(rq, p); 1940 1922 on_rq = p->se.on_rq; 1941 1923 ncsw = 0; 1942 - if (!match_state || p->state == match_state) { 1943 - ncsw = p->nivcsw + p->nvcsw; 1944 - if (unlikely(!ncsw)) 1945 - ncsw = 1; 1946 - } 1924 + if (!match_state || p->state == match_state) 1925 + ncsw = p->nvcsw | LONG_MIN; /* sets MSB */ 1947 1926 task_rq_unlock(rq, &flags); 1948 1927 1949 1928 /* ··· 2300 2285 trace_mark(kernel_sched_wakeup, 2301 2286 "pid %d state %ld ## rq %p task %p rq->curr %p", 2302 2287 p->pid, p->state, rq, p, rq->curr); 2303 - check_preempt_curr(rq, p); 2288 + check_preempt_curr(rq, p, sync); 2304 2289 2305 2290 p->state = TASK_RUNNING; 2306 2291 #ifdef CONFIG_SMP ··· 2435 2420 trace_mark(kernel_sched_wakeup_new, 2436 2421 "pid %d state %ld ## rq %p task %p rq->curr %p", 2437 2422 p->pid, p->state, rq, p, rq->curr); 2438 - check_preempt_curr(rq, p); 2423 + check_preempt_curr(rq, p, 0); 2439 2424 #ifdef CONFIG_SMP 2440 2425 if (p->sched_class->task_wake_up) 2441 2426 p->sched_class->task_wake_up(rq, p); ··· 2895 2880 * Note that idle threads have a prio of MAX_PRIO, for this test 2896 2881 * to be always true for them. 2897 2882 */ 2898 - check_preempt_curr(this_rq, p); 2883 + check_preempt_curr(this_rq, p, 0); 2899 2884 } 2900 2885 2901 2886 /* ··· 4642 4627 } 4643 4628 EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */ 4644 4629 4630 + /** 4631 + * complete: - signals a single thread waiting on this completion 4632 + * @x: holds the state of this particular completion 4633 + * 4634 + * This will wake up a single thread waiting on this completion. Threads will be 4635 + * awakened in the same order in which they were queued. 4636 + * 4637 + * See also complete_all(), wait_for_completion() and related routines. 4638 + */ 4645 4639 void complete(struct completion *x) 4646 4640 { 4647 4641 unsigned long flags; ··· 4662 4638 } 4663 4639 EXPORT_SYMBOL(complete); 4664 4640 4641 + /** 4642 + * complete_all: - signals all threads waiting on this completion 4643 + * @x: holds the state of this particular completion 4644 + * 4645 + * This will wake up all threads waiting on this particular completion event. 4646 + */ 4665 4647 void complete_all(struct completion *x) 4666 4648 { 4667 4649 unsigned long flags; ··· 4688 4658 wait.flags |= WQ_FLAG_EXCLUSIVE; 4689 4659 __add_wait_queue_tail(&x->wait, &wait); 4690 4660 do { 4691 - if ((state == TASK_INTERRUPTIBLE && 4692 - signal_pending(current)) || 4693 - (state == TASK_KILLABLE && 4694 - fatal_signal_pending(current))) { 4661 + if (signal_pending_state(state, current)) { 4695 4662 timeout = -ERESTARTSYS; 4696 4663 break; 4697 4664 } ··· 4716 4689 return timeout; 4717 4690 } 4718 4691 4692 + /** 4693 + * wait_for_completion: - waits for completion of a task 4694 + * @x: holds the state of this particular completion 4695 + * 4696 + * This waits to be signaled for completion of a specific task. It is NOT 4697 + * interruptible and there is no timeout. 4698 + * 4699 + * See also similar routines (i.e. wait_for_completion_timeout()) with timeout 4700 + * and interrupt capability. Also see complete(). 4701 + */ 4719 4702 void __sched wait_for_completion(struct completion *x) 4720 4703 { 4721 4704 wait_for_common(x, MAX_SCHEDULE_TIMEOUT, TASK_UNINTERRUPTIBLE); 4722 4705 } 4723 4706 EXPORT_SYMBOL(wait_for_completion); 4724 4707 4708 + /** 4709 + * wait_for_completion_timeout: - waits for completion of a task (w/timeout) 4710 + * @x: holds the state of this particular completion 4711 + * @timeout: timeout value in jiffies 4712 + * 4713 + * This waits for either a completion of a specific task to be signaled or for a 4714 + * specified timeout to expire. The timeout is in jiffies. It is not 4715 + * interruptible. 4716 + */ 4725 4717 unsigned long __sched 4726 4718 wait_for_completion_timeout(struct completion *x, unsigned long timeout) 4727 4719 { ··· 4748 4702 } 4749 4703 EXPORT_SYMBOL(wait_for_completion_timeout); 4750 4704 4705 + /** 4706 + * wait_for_completion_interruptible: - waits for completion of a task (w/intr) 4707 + * @x: holds the state of this particular completion 4708 + * 4709 + * This waits for completion of a specific task to be signaled. It is 4710 + * interruptible. 4711 + */ 4751 4712 int __sched wait_for_completion_interruptible(struct completion *x) 4752 4713 { 4753 4714 long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT, TASK_INTERRUPTIBLE); ··· 4764 4711 } 4765 4712 EXPORT_SYMBOL(wait_for_completion_interruptible); 4766 4713 4714 + /** 4715 + * wait_for_completion_interruptible_timeout: - waits for completion (w/(to,intr)) 4716 + * @x: holds the state of this particular completion 4717 + * @timeout: timeout value in jiffies 4718 + * 4719 + * This waits for either a completion of a specific task to be signaled or for a 4720 + * specified timeout to expire. It is interruptible. The timeout is in jiffies. 4721 + */ 4767 4722 unsigned long __sched 4768 4723 wait_for_completion_interruptible_timeout(struct completion *x, 4769 4724 unsigned long timeout) ··· 4780 4719 } 4781 4720 EXPORT_SYMBOL(wait_for_completion_interruptible_timeout); 4782 4721 4722 + /** 4723 + * wait_for_completion_killable: - waits for completion of a task (killable) 4724 + * @x: holds the state of this particular completion 4725 + * 4726 + * This waits to be signaled for completion of a specific task. It can be 4727 + * interrupted by a kill signal. 4728 + */ 4783 4729 int __sched wait_for_completion_killable(struct completion *x) 4784 4730 { 4785 4731 long t = wait_for_common(x, MAX_SCHEDULE_TIMEOUT, TASK_KILLABLE); ··· 5189 5121 * Do not allow realtime tasks into groups that have no runtime 5190 5122 * assigned. 5191 5123 */ 5192 - if (rt_policy(policy) && task_group(p)->rt_bandwidth.rt_runtime == 0) 5124 + if (rt_bandwidth_enabled() && rt_policy(policy) && 5125 + task_group(p)->rt_bandwidth.rt_runtime == 0) 5193 5126 return -EPERM; 5194 5127 #endif 5195 5128 ··· 6026 5957 set_task_cpu(p, dest_cpu); 6027 5958 if (on_rq) { 6028 5959 activate_task(rq_dest, p, 0); 6029 - check_preempt_curr(rq_dest, p); 5960 + check_preempt_curr(rq_dest, p, 0); 6030 5961 } 6031 5962 done: 6032 5963 ret = 1; ··· 8311 8242 #ifdef in_atomic 8312 8243 static unsigned long prev_jiffy; /* ratelimiting */ 8313 8244 8314 - if ((in_atomic() || irqs_disabled()) && 8315 - system_state == SYSTEM_RUNNING && !oops_in_progress) { 8316 - if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy) 8317 - return; 8318 - prev_jiffy = jiffies; 8319 - printk(KERN_ERR "BUG: sleeping function called from invalid" 8320 - " context at %s:%d\n", file, line); 8321 - printk("in_atomic():%d, irqs_disabled():%d\n", 8322 - in_atomic(), irqs_disabled()); 8323 - debug_show_held_locks(current); 8324 - if (irqs_disabled()) 8325 - print_irqtrace_events(current); 8326 - dump_stack(); 8327 - } 8245 + if ((!in_atomic() && !irqs_disabled()) || 8246 + system_state != SYSTEM_RUNNING || oops_in_progress) 8247 + return; 8248 + if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy) 8249 + return; 8250 + prev_jiffy = jiffies; 8251 + 8252 + printk(KERN_ERR 8253 + "BUG: sleeping function called from invalid context at %s:%d\n", 8254 + file, line); 8255 + printk(KERN_ERR 8256 + "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n", 8257 + in_atomic(), irqs_disabled(), 8258 + current->pid, current->comm); 8259 + 8260 + debug_show_held_locks(current); 8261 + if (irqs_disabled()) 8262 + print_irqtrace_events(current); 8263 + dump_stack(); 8328 8264 #endif 8329 8265 } 8330 8266 EXPORT_SYMBOL(__might_sleep); ··· 8827 8753 static unsigned long to_ratio(u64 period, u64 runtime) 8828 8754 { 8829 8755 if (runtime == RUNTIME_INF) 8830 - return 1ULL << 16; 8756 + return 1ULL << 20; 8831 8757 8832 - return div64_u64(runtime << 16, period); 8758 + return div64_u64(runtime << 20, period); 8833 8759 } 8834 - 8835 - #ifdef CONFIG_CGROUP_SCHED 8836 - static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime) 8837 - { 8838 - struct task_group *tgi, *parent = tg->parent; 8839 - unsigned long total = 0; 8840 - 8841 - if (!parent) { 8842 - if (global_rt_period() < period) 8843 - return 0; 8844 - 8845 - return to_ratio(period, runtime) < 8846 - to_ratio(global_rt_period(), global_rt_runtime()); 8847 - } 8848 - 8849 - if (ktime_to_ns(parent->rt_bandwidth.rt_period) < period) 8850 - return 0; 8851 - 8852 - rcu_read_lock(); 8853 - list_for_each_entry_rcu(tgi, &parent->children, siblings) { 8854 - if (tgi == tg) 8855 - continue; 8856 - 8857 - total += to_ratio(ktime_to_ns(tgi->rt_bandwidth.rt_period), 8858 - tgi->rt_bandwidth.rt_runtime); 8859 - } 8860 - rcu_read_unlock(); 8861 - 8862 - return total + to_ratio(period, runtime) <= 8863 - to_ratio(ktime_to_ns(parent->rt_bandwidth.rt_period), 8864 - parent->rt_bandwidth.rt_runtime); 8865 - } 8866 - #elif defined CONFIG_USER_SCHED 8867 - static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime) 8868 - { 8869 - struct task_group *tgi; 8870 - unsigned long total = 0; 8871 - unsigned long global_ratio = 8872 - to_ratio(global_rt_period(), global_rt_runtime()); 8873 - 8874 - rcu_read_lock(); 8875 - list_for_each_entry_rcu(tgi, &task_groups, list) { 8876 - if (tgi == tg) 8877 - continue; 8878 - 8879 - total += to_ratio(ktime_to_ns(tgi->rt_bandwidth.rt_period), 8880 - tgi->rt_bandwidth.rt_runtime); 8881 - } 8882 - rcu_read_unlock(); 8883 - 8884 - return total + to_ratio(period, runtime) < global_ratio; 8885 - } 8886 - #endif 8887 8760 8888 8761 /* Must be called with tasklist_lock held */ 8889 8762 static inline int tg_has_rt_tasks(struct task_group *tg) 8890 8763 { 8891 8764 struct task_struct *g, *p; 8765 + 8892 8766 do_each_thread(g, p) { 8893 8767 if (rt_task(p) && rt_rq_of_se(&p->rt)->tg == tg) 8894 8768 return 1; 8895 8769 } while_each_thread(g, p); 8770 + 8896 8771 return 0; 8772 + } 8773 + 8774 + struct rt_schedulable_data { 8775 + struct task_group *tg; 8776 + u64 rt_period; 8777 + u64 rt_runtime; 8778 + }; 8779 + 8780 + static int tg_schedulable(struct task_group *tg, void *data) 8781 + { 8782 + struct rt_schedulable_data *d = data; 8783 + struct task_group *child; 8784 + unsigned long total, sum = 0; 8785 + u64 period, runtime; 8786 + 8787 + period = ktime_to_ns(tg->rt_bandwidth.rt_period); 8788 + runtime = tg->rt_bandwidth.rt_runtime; 8789 + 8790 + if (tg == d->tg) { 8791 + period = d->rt_period; 8792 + runtime = d->rt_runtime; 8793 + } 8794 + 8795 + /* 8796 + * Cannot have more runtime than the period. 8797 + */ 8798 + if (runtime > period && runtime != RUNTIME_INF) 8799 + return -EINVAL; 8800 + 8801 + /* 8802 + * Ensure we don't starve existing RT tasks. 8803 + */ 8804 + if (rt_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg)) 8805 + return -EBUSY; 8806 + 8807 + total = to_ratio(period, runtime); 8808 + 8809 + /* 8810 + * Nobody can have more than the global setting allows. 8811 + */ 8812 + if (total > to_ratio(global_rt_period(), global_rt_runtime())) 8813 + return -EINVAL; 8814 + 8815 + /* 8816 + * The sum of our children's runtime should not exceed our own. 8817 + */ 8818 + list_for_each_entry_rcu(child, &tg->children, siblings) { 8819 + period = ktime_to_ns(child->rt_bandwidth.rt_period); 8820 + runtime = child->rt_bandwidth.rt_runtime; 8821 + 8822 + if (child == d->tg) { 8823 + period = d->rt_period; 8824 + runtime = d->rt_runtime; 8825 + } 8826 + 8827 + sum += to_ratio(period, runtime); 8828 + } 8829 + 8830 + if (sum > total) 8831 + return -EINVAL; 8832 + 8833 + return 0; 8834 + } 8835 + 8836 + static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime) 8837 + { 8838 + struct rt_schedulable_data data = { 8839 + .tg = tg, 8840 + .rt_period = period, 8841 + .rt_runtime = runtime, 8842 + }; 8843 + 8844 + return walk_tg_tree(tg_schedulable, tg_nop, &data); 8897 8845 } 8898 8846 8899 8847 static int tg_set_bandwidth(struct task_group *tg, ··· 8925 8829 8926 8830 mutex_lock(&rt_constraints_mutex); 8927 8831 read_lock(&tasklist_lock); 8928 - if (rt_runtime == 0 && tg_has_rt_tasks(tg)) { 8929 - err = -EBUSY; 8832 + err = __rt_schedulable(tg, rt_period, rt_runtime); 8833 + if (err) 8930 8834 goto unlock; 8931 - } 8932 - if (!__rt_schedulable(tg, rt_period, rt_runtime)) { 8933 - err = -EINVAL; 8934 - goto unlock; 8935 - } 8936 8835 8937 8836 spin_lock_irq(&tg->rt_bandwidth.rt_runtime_lock); 8938 8837 tg->rt_bandwidth.rt_period = ns_to_ktime(rt_period); ··· 8996 8905 8997 8906 static int sched_rt_global_constraints(void) 8998 8907 { 8999 - struct task_group *tg = &root_task_group; 9000 - u64 rt_runtime, rt_period; 8908 + u64 runtime, period; 9001 8909 int ret = 0; 9002 8910 9003 8911 if (sysctl_sched_rt_period <= 0) 9004 8912 return -EINVAL; 9005 8913 9006 - rt_period = ktime_to_ns(tg->rt_bandwidth.rt_period); 9007 - rt_runtime = tg->rt_bandwidth.rt_runtime; 8914 + runtime = global_rt_runtime(); 8915 + period = global_rt_period(); 8916 + 8917 + /* 8918 + * Sanity check on the sysctl variables. 8919 + */ 8920 + if (runtime > period && runtime != RUNTIME_INF) 8921 + return -EINVAL; 9008 8922 9009 8923 mutex_lock(&rt_constraints_mutex); 9010 - if (!__rt_schedulable(tg, rt_period, rt_runtime)) 9011 - ret = -EINVAL; 8924 + read_lock(&tasklist_lock); 8925 + ret = __rt_schedulable(NULL, 0, 0); 8926 + read_unlock(&tasklist_lock); 9012 8927 mutex_unlock(&rt_constraints_mutex); 9013 8928 9014 8929 return ret; ··· 9088 8991 9089 8992 if (!cgrp->parent) { 9090 8993 /* This is early initialization for the top cgroup */ 9091 - init_task_group.css.cgroup = cgrp; 9092 8994 return &init_task_group.css; 9093 8995 } 9094 8996 ··· 9095 8999 tg = sched_create_group(parent); 9096 9000 if (IS_ERR(tg)) 9097 9001 return ERR_PTR(-ENOMEM); 9098 - 9099 - /* Bind the cgroup to task_group object we just created */ 9100 - tg->css.cgroup = cgrp; 9101 9002 9102 9003 return &tg->css; 9103 9004 }

+51 -171

kernel/sched_fair.c

··· 409 409 } 410 410 411 411 /* 412 - * The goal of calc_delta_asym() is to be asymmetrically around NICE_0_LOAD, in 413 - * that it favours >=0 over <0. 414 - * 415 - * -20 | 416 - * | 417 - * 0 --------+------- 418 - * .' 419 - * 19 .' 420 - * 421 - */ 422 - static unsigned long 423 - calc_delta_asym(unsigned long delta, struct sched_entity *se) 424 - { 425 - struct load_weight lw = { 426 - .weight = NICE_0_LOAD, 427 - .inv_weight = 1UL << (WMULT_SHIFT-NICE_0_SHIFT) 428 - }; 429 - 430 - for_each_sched_entity(se) { 431 - struct load_weight *se_lw = &se->load; 432 - unsigned long rw = cfs_rq_of(se)->load.weight; 433 - 434 - #ifdef CONFIG_FAIR_SCHED_GROUP 435 - struct cfs_rq *cfs_rq = se->my_q; 436 - struct task_group *tg = NULL 437 - 438 - if (cfs_rq) 439 - tg = cfs_rq->tg; 440 - 441 - if (tg && tg->shares < NICE_0_LOAD) { 442 - /* 443 - * scale shares to what it would have been had 444 - * tg->weight been NICE_0_LOAD: 445 - * 446 - * weight = 1024 * shares / tg->weight 447 - */ 448 - lw.weight *= se->load.weight; 449 - lw.weight /= tg->shares; 450 - 451 - lw.inv_weight = 0; 452 - 453 - se_lw = &lw; 454 - rw += lw.weight - se->load.weight; 455 - } else 456 - #endif 457 - 458 - if (se->load.weight < NICE_0_LOAD) { 459 - se_lw = &lw; 460 - rw += NICE_0_LOAD - se->load.weight; 461 - } 462 - 463 - delta = calc_delta_mine(delta, rw, se_lw); 464 - } 465 - 466 - return delta; 467 - } 468 - 469 - /* 470 412 * Update the current task's runtime statistics. Skip current tasks that 471 413 * are not in our scheduling class. 472 414 */ ··· 528 586 update_load_add(&cfs_rq->load, se->load.weight); 529 587 if (!parent_entity(se)) 530 588 inc_cpu_load(rq_of(cfs_rq), se->load.weight); 531 - if (entity_is_task(se)) 589 + if (entity_is_task(se)) { 532 590 add_cfs_task_weight(cfs_rq, se->load.weight); 591 + list_add(&se->group_node, &cfs_rq->tasks); 592 + } 533 593 cfs_rq->nr_running++; 534 594 se->on_rq = 1; 535 - list_add(&se->group_node, &cfs_rq->tasks); 536 595 } 537 596 538 597 static void ··· 542 599 update_load_sub(&cfs_rq->load, se->load.weight); 543 600 if (!parent_entity(se)) 544 601 dec_cpu_load(rq_of(cfs_rq), se->load.weight); 545 - if (entity_is_task(se)) 602 + if (entity_is_task(se)) { 546 603 add_cfs_task_weight(cfs_rq, -se->load.weight); 604 + list_del_init(&se->group_node); 605 + } 547 606 cfs_rq->nr_running--; 548 607 se->on_rq = 0; 549 - list_del_init(&se->group_node); 550 608 } 551 609 552 610 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se) ··· 1029 1085 long wl, long wg) 1030 1086 { 1031 1087 struct sched_entity *se = tg->se[cpu]; 1032 - long more_w; 1033 1088 1034 1089 if (!tg->parent) 1035 1090 return wl; ··· 1040 1097 if (!wl && sched_feat(ASYM_EFF_LOAD)) 1041 1098 return wl; 1042 1099 1043 - /* 1044 - * Instead of using this increment, also add the difference 1045 - * between when the shares were last updated and now. 1046 - */ 1047 - more_w = se->my_q->load.weight - se->my_q->rq_weight; 1048 - wl += more_w; 1049 - wg += more_w; 1050 - 1051 1100 for_each_sched_entity(se) { 1052 - #define D(n) (likely(n) ? (n) : 1) 1053 - 1054 1101 long S, rw, s, a, b; 1102 + long more_w; 1103 + 1104 + /* 1105 + * Instead of using this increment, also add the difference 1106 + * between when the shares were last updated and now. 1107 + */ 1108 + more_w = se->my_q->load.weight - se->my_q->rq_weight; 1109 + wl += more_w; 1110 + wg += more_w; 1055 1111 1056 1112 S = se->my_q->tg->shares; 1057 1113 s = se->my_q->shares; ··· 1059 1117 a = S*(rw + wl); 1060 1118 b = S*rw + s*wg; 1061 1119 1062 - wl = s*(a-b)/D(b); 1120 + wl = s*(a-b); 1121 + 1122 + if (likely(b)) 1123 + wl /= b; 1124 + 1063 1125 /* 1064 1126 * Assume the group is already running and will 1065 1127 * thus already be accounted for in the weight. ··· 1072 1126 * alter the group weight. 1073 1127 */ 1074 1128 wg = 0; 1075 - #undef D 1076 1129 } 1077 1130 1078 1131 return wl; ··· 1088 1143 #endif 1089 1144 1090 1145 static int 1091 - wake_affine(struct rq *rq, struct sched_domain *this_sd, struct rq *this_rq, 1146 + wake_affine(struct sched_domain *this_sd, struct rq *this_rq, 1092 1147 struct task_struct *p, int prev_cpu, int this_cpu, int sync, 1093 1148 int idx, unsigned long load, unsigned long this_load, 1094 1149 unsigned int imbalance) ··· 1136 1191 schedstat_inc(p, se.nr_wakeups_affine_attempts); 1137 1192 tl_per_task = cpu_avg_load_per_task(this_cpu); 1138 1193 1139 - if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) || 1140 - balanced) { 1194 + if (balanced || (tl <= load && tl + target_load(prev_cpu, idx) <= 1195 + tl_per_task)) { 1141 1196 /* 1142 1197 * This domain has SD_WAKE_AFFINE and 1143 1198 * p is cache cold in this domain, and ··· 1156 1211 struct sched_domain *sd, *this_sd = NULL; 1157 1212 int prev_cpu, this_cpu, new_cpu; 1158 1213 unsigned long load, this_load; 1159 - struct rq *rq, *this_rq; 1214 + struct rq *this_rq; 1160 1215 unsigned int imbalance; 1161 1216 int idx; 1162 1217 1163 1218 prev_cpu = task_cpu(p); 1164 - rq = task_rq(p); 1165 1219 this_cpu = smp_processor_id(); 1166 1220 this_rq = cpu_rq(this_cpu); 1167 1221 new_cpu = prev_cpu; 1168 1222 1223 + if (prev_cpu == this_cpu) 1224 + goto out; 1169 1225 /* 1170 1226 * 'this_sd' is the first domain that both 1171 1227 * this_cpu and prev_cpu are present in: ··· 1194 1248 load = source_load(prev_cpu, idx); 1195 1249 this_load = target_load(this_cpu, idx); 1196 1250 1197 - if (wake_affine(rq, this_sd, this_rq, p, prev_cpu, this_cpu, sync, idx, 1251 + if (wake_affine(this_sd, this_rq, p, prev_cpu, this_cpu, sync, idx, 1198 1252 load, this_load, imbalance)) 1199 1253 return this_cpu; 1200 - 1201 - if (prev_cpu == this_cpu) 1202 - goto out; 1203 1254 1204 1255 /* 1205 1256 * Start passive balancing when half the imbalance_pct ··· 1224 1281 * + nice tasks. 1225 1282 */ 1226 1283 if (sched_feat(ASYM_GRAN)) 1227 - gran = calc_delta_asym(sysctl_sched_wakeup_granularity, se); 1228 - else 1229 - gran = calc_delta_fair(sysctl_sched_wakeup_granularity, se); 1284 + gran = calc_delta_mine(gran, NICE_0_LOAD, &se->load); 1230 1285 1231 1286 return gran; 1232 1287 } 1233 1288 1234 1289 /* 1235 - * Should 'se' preempt 'curr'. 1236 - * 1237 - * |s1 1238 - * |s2 1239 - * |s3 1240 - * g 1241 - * |<--->|c 1242 - * 1243 - * w(c, s1) = -1 1244 - * w(c, s2) = 0 1245 - * w(c, s3) = 1 1246 - * 1247 - */ 1248 - static int 1249 - wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se) 1250 - { 1251 - s64 gran, vdiff = curr->vruntime - se->vruntime; 1252 - 1253 - if (vdiff < 0) 1254 - return -1; 1255 - 1256 - gran = wakeup_gran(curr); 1257 - if (vdiff > gran) 1258 - return 1; 1259 - 1260 - return 0; 1261 - } 1262 - 1263 - /* return depth at which a sched entity is present in the hierarchy */ 1264 - static inline int depth_se(struct sched_entity *se) 1265 - { 1266 - int depth = 0; 1267 - 1268 - for_each_sched_entity(se) 1269 - depth++; 1270 - 1271 - return depth; 1272 - } 1273 - 1274 - /* 1275 1290 * Preempt the current task with a newly woken task if needed: 1276 1291 */ 1277 - static void check_preempt_wakeup(struct rq *rq, struct task_struct *p) 1292 + static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync) 1278 1293 { 1279 1294 struct task_struct *curr = rq->curr; 1280 1295 struct cfs_rq *cfs_rq = task_cfs_rq(curr); 1281 1296 struct sched_entity *se = &curr->se, *pse = &p->se; 1282 - int se_depth, pse_depth; 1297 + s64 delta_exec; 1283 1298 1284 1299 if (unlikely(rt_prio(p->prio))) { 1285 1300 update_rq_clock(rq); ··· 1252 1351 cfs_rq_of(pse)->next = pse; 1253 1352 1254 1353 /* 1354 + * We can come here with TIF_NEED_RESCHED already set from new task 1355 + * wake up path. 1356 + */ 1357 + if (test_tsk_need_resched(curr)) 1358 + return; 1359 + 1360 + /* 1255 1361 * Batch tasks do not preempt (their preemption is driven by 1256 1362 * the tick): 1257 1363 */ ··· 1268 1360 if (!sched_feat(WAKEUP_PREEMPT)) 1269 1361 return; 1270 1362 1271 - /* 1272 - * preemption test can be made between sibling entities who are in the 1273 - * same cfs_rq i.e who have a common parent. Walk up the hierarchy of 1274 - * both tasks until we find their ancestors who are siblings of common 1275 - * parent. 1276 - */ 1277 - 1278 - /* First walk up until both entities are at same depth */ 1279 - se_depth = depth_se(se); 1280 - pse_depth = depth_se(pse); 1281 - 1282 - while (se_depth > pse_depth) { 1283 - se_depth--; 1284 - se = parent_entity(se); 1363 + if (sched_feat(WAKEUP_OVERLAP) && sync && 1364 + se->avg_overlap < sysctl_sched_migration_cost && 1365 + pse->avg_overlap < sysctl_sched_migration_cost) { 1366 + resched_task(curr); 1367 + return; 1285 1368 } 1286 1369 1287 - while (pse_depth > se_depth) { 1288 - pse_depth--; 1289 - pse = parent_entity(pse); 1290 - } 1291 - 1292 - while (!is_same_group(se, pse)) { 1293 - se = parent_entity(se); 1294 - pse = parent_entity(pse); 1295 - } 1296 - 1297 - if (wakeup_preempt_entity(se, pse) == 1) 1370 + delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime; 1371 + if (delta_exec > wakeup_gran(pse)) 1298 1372 resched_task(curr); 1299 1373 } 1300 1374 ··· 1335 1445 if (next == &cfs_rq->tasks) 1336 1446 return NULL; 1337 1447 1338 - /* Skip over entities that are not tasks */ 1339 - do { 1340 - se = list_entry(next, struct sched_entity, group_node); 1341 - next = next->next; 1342 - } while (next != &cfs_rq->tasks && !entity_is_task(se)); 1343 - 1344 - if (next == &cfs_rq->tasks) 1345 - return NULL; 1346 - 1347 - cfs_rq->balance_iterator = next; 1348 - 1349 - if (entity_is_task(se)) 1350 - p = task_of(se); 1448 + se = list_entry(next, struct sched_entity, group_node); 1449 + p = task_of(se); 1450 + cfs_rq->balance_iterator = next->next; 1351 1451 1352 1452 return p; 1353 1453 } ··· 1387 1507 rcu_read_lock(); 1388 1508 update_h_load(busiest_cpu); 1389 1509 1390 - list_for_each_entry(tg, &task_groups, list) { 1510 + list_for_each_entry_rcu(tg, &task_groups, list) { 1391 1511 struct cfs_rq *busiest_cfs_rq = tg->cfs_rq[busiest_cpu]; 1392 1512 unsigned long busiest_h_load = busiest_cfs_rq->h_load; 1393 1513 unsigned long busiest_weight = busiest_cfs_rq->load.weight; ··· 1500 1620 * 'current' within the tree based on its new key value. 1501 1621 */ 1502 1622 swap(curr->vruntime, se->vruntime); 1623 + resched_task(rq->curr); 1503 1624 } 1504 1625 1505 1626 enqueue_task_fair(rq, p, 0); 1506 - resched_task(rq->curr); 1507 1627 } 1508 1628 1509 1629 /* ··· 1522 1642 if (p->prio > oldprio) 1523 1643 resched_task(rq->curr); 1524 1644 } else 1525 - check_preempt_curr(rq, p); 1645 + check_preempt_curr(rq, p, 0); 1526 1646 } 1527 1647 1528 1648 /* ··· 1539 1659 if (running) 1540 1660 resched_task(rq->curr); 1541 1661 else 1542 - check_preempt_curr(rq, p); 1662 + check_preempt_curr(rq, p, 0); 1543 1663 } 1544 1664 1545 1665 /* Account for a task changing its policy or group.

+1

kernel/sched_features.h

··· 11 11 SCHED_FEAT(LB_BIAS, 1) 12 12 SCHED_FEAT(LB_WAKEUP_UPDATE, 1) 13 13 SCHED_FEAT(ASYM_EFF_LOAD, 1) 14 + SCHED_FEAT(WAKEUP_OVERLAP, 0)

+3 -3

kernel/sched_idletask.c

··· 14 14 /* 15 15 * Idle tasks are unconditionally rescheduled: 16 16 */ 17 - static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p) 17 + static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int sync) 18 18 { 19 19 resched_task(rq->idle); 20 20 } ··· 76 76 if (running) 77 77 resched_task(rq->curr); 78 78 else 79 - check_preempt_curr(rq, p); 79 + check_preempt_curr(rq, p, 0); 80 80 } 81 81 82 82 static void prio_changed_idle(struct rq *rq, struct task_struct *p, ··· 93 93 if (p->prio > oldprio) 94 94 resched_task(rq->curr); 95 95 } else 96 - check_preempt_curr(rq, p); 96 + check_preempt_curr(rq, p, 0); 97 97 } 98 98 99 99 /*

+51 -6

kernel/sched_rt.c

··· 102 102 103 103 static void sched_rt_rq_enqueue(struct rt_rq *rt_rq) 104 104 { 105 + struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr; 105 106 struct sched_rt_entity *rt_se = rt_rq->rt_se; 106 107 107 - if (rt_se && !on_rt_rq(rt_se) && rt_rq->rt_nr_running) { 108 - struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr; 109 - 110 - enqueue_rt_entity(rt_se); 108 + if (rt_rq->rt_nr_running) { 109 + if (rt_se && !on_rt_rq(rt_se)) 110 + enqueue_rt_entity(rt_se); 111 111 if (rt_rq->highest_prio < curr->prio) 112 112 resched_task(curr); 113 113 } ··· 231 231 #endif /* CONFIG_RT_GROUP_SCHED */ 232 232 233 233 #ifdef CONFIG_SMP 234 + /* 235 + * We ran out of runtime, see if we can borrow some from our neighbours. 236 + */ 234 237 static int do_balance_runtime(struct rt_rq *rt_rq) 235 238 { 236 239 struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq); ··· 253 250 continue; 254 251 255 252 spin_lock(&iter->rt_runtime_lock); 253 + /* 254 + * Either all rqs have inf runtime and there's nothing to steal 255 + * or __disable_runtime() below sets a specific rq to inf to 256 + * indicate its been disabled and disalow stealing. 257 + */ 256 258 if (iter->rt_runtime == RUNTIME_INF) 257 259 goto next; 258 260 261 + /* 262 + * From runqueues with spare time, take 1/n part of their 263 + * spare time, but no more than our period. 264 + */ 259 265 diff = iter->rt_runtime - iter->rt_time; 260 266 if (diff > 0) { 261 267 diff = div_u64((u64)diff, weight); ··· 286 274 return more; 287 275 } 288 276 277 + /* 278 + * Ensure this RQ takes back all the runtime it lend to its neighbours. 279 + */ 289 280 static void __disable_runtime(struct rq *rq) 290 281 { 291 282 struct root_domain *rd = rq->rd; ··· 304 289 305 290 spin_lock(&rt_b->rt_runtime_lock); 306 291 spin_lock(&rt_rq->rt_runtime_lock); 292 + /* 293 + * Either we're all inf and nobody needs to borrow, or we're 294 + * already disabled and thus have nothing to do, or we have 295 + * exactly the right amount of runtime to take out. 296 + */ 307 297 if (rt_rq->rt_runtime == RUNTIME_INF || 308 298 rt_rq->rt_runtime == rt_b->rt_runtime) 309 299 goto balanced; 310 300 spin_unlock(&rt_rq->rt_runtime_lock); 311 301 302 + /* 303 + * Calculate the difference between what we started out with 304 + * and what we current have, that's the amount of runtime 305 + * we lend and now have to reclaim. 306 + */ 312 307 want = rt_b->rt_runtime - rt_rq->rt_runtime; 313 308 309 + /* 310 + * Greedy reclaim, take back as much as we can. 311 + */ 314 312 for_each_cpu_mask(i, rd->span) { 315 313 struct rt_rq *iter = sched_rt_period_rt_rq(rt_b, i); 316 314 s64 diff; 317 315 316 + /* 317 + * Can't reclaim from ourselves or disabled runqueues. 318 + */ 318 319 if (iter == rt_rq || iter->rt_runtime == RUNTIME_INF) 319 320 continue; 320 321 ··· 350 319 } 351 320 352 321 spin_lock(&rt_rq->rt_runtime_lock); 322 + /* 323 + * We cannot be left wanting - that would mean some runtime 324 + * leaked out of the system. 325 + */ 353 326 BUG_ON(want); 354 327 balanced: 328 + /* 329 + * Disable all the borrow logic by pretending we have inf 330 + * runtime - in which case borrowing doesn't make sense. 331 + */ 355 332 rt_rq->rt_runtime = RUNTIME_INF; 356 333 spin_unlock(&rt_rq->rt_runtime_lock); 357 334 spin_unlock(&rt_b->rt_runtime_lock); ··· 382 343 if (unlikely(!scheduler_running)) 383 344 return; 384 345 346 + /* 347 + * Reset each runqueue's bandwidth settings 348 + */ 385 349 for_each_leaf_rt_rq(rt_rq, rq) { 386 350 struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq); 387 351 ··· 431 389 int i, idle = 1; 432 390 cpumask_t span; 433 391 434 - if (rt_b->rt_runtime == RUNTIME_INF) 392 + if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF) 435 393 return 1; 436 394 437 395 span = sched_rt_period_mask(); ··· 528 486 curr->se.sum_exec_runtime += delta_exec; 529 487 curr->se.exec_start = rq->clock; 530 488 cpuacct_charge(curr, delta_exec); 489 + 490 + if (!rt_bandwidth_enabled()) 491 + return; 531 492 532 493 for_each_sched_rt_entity(rt_se) { 533 494 rt_rq = rt_rq_of_se(rt_se); ··· 829 784 /* 830 785 * Preempt the current task with a newly woken task if needed: 831 786 */ 832 - static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p) 787 + static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int sync) 833 788 { 834 789 if (p->prio < rq->curr->prio) { 835 790 resched_task(rq->curr);

+2 -2

kernel/user.c

··· 169 169 { 170 170 struct user_struct *up = container_of(kobj, struct user_struct, kobj); 171 171 172 - return sprintf(buf, "%lu\n", sched_group_rt_runtime(up->tg)); 172 + return sprintf(buf, "%ld\n", sched_group_rt_runtime(up->tg)); 173 173 } 174 174 175 175 static ssize_t cpu_rt_runtime_store(struct kobject *kobj, ··· 180 180 unsigned long rt_runtime; 181 181 int rc; 182 182 183 - sscanf(buf, "%lu", &rt_runtime); 183 + sscanf(buf, "%ld", &rt_runtime); 184 184 185 185 rc = sched_group_set_rt_runtime(up->tg, rt_runtime); 186 186