Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

+1 -1

Documentation/DocBook/kernel-hacking.tmpl

··· 819 819 certain condition is true. They must be used carefully to ensure 820 820 there is no race condition. You declare a 821 821 <type>wait_queue_head_t</type>, and then processes which want to 822 - wait for that condition declare a <type>wait_queue_t</type> 822 + wait for that condition declare a <type>wait_queue_entry_t</type> 823 823 referring to themselves, and place that in the queue. 824 824 </para> 825 825

+6 -6

Documentation/filesystems/autofs4.txt

··· 316 316 struct autofs_v5_packet { 317 317 int proto_version; /* Protocol version */ 318 318 int type; /* Type of packet */ 319 - autofs_wqt_t wait_queue_token; 319 + autofs_wqt_t wait_queue_entry_token; 320 320 __u32 dev; 321 321 __u64 ino; 322 322 __u32 uid; ··· 341 341 `O_DIRECT`) to _pipe2(2)_ so that a read from the pipe will return at 342 342 most one packet, and any unread portion of a packet will be discarded. 343 343 344 - The `wait_queue_token` is a unique number which can identify a 344 + The `wait_queue_entry_token` is a unique number which can identify a 345 345 particular request to be acknowledged. When a message is sent over 346 346 the pipe the affected dentry is marked as either "active" or 347 347 "expiring" and other accesses to it block until the message is 348 348 acknowledged using one of the ioctls below and the relevant 349 - `wait_queue_token`. 349 + `wait_queue_entry_token`. 350 350 351 351 Communicating with autofs: root directory ioctls 352 352 ------------------------------------------------ ··· 358 358 The available ioctl commands are: 359 359 360 360 - **AUTOFS_IOC_READY**: a notification has been handled. The argument 361 - to the ioctl command is the "wait_queue_token" number 361 + to the ioctl command is the "wait_queue_entry_token" number 362 362 corresponding to the notification being acknowledged. 363 363 - **AUTOFS_IOC_FAIL**: similar to above, but indicates failure with 364 364 the error code `ENOENT`. ··· 382 382 struct autofs_packet_expire_multi { 383 383 int proto_version; /* Protocol version */ 384 384 int type; /* Type of packet */ 385 - autofs_wqt_t wait_queue_token; 385 + autofs_wqt_t wait_queue_entry_token; 386 386 int len; 387 387 char name[NAME_MAX+1]; 388 388 }; 389 389 390 390 is required. This is filled in with the name of something 391 391 that can be unmounted or removed. If nothing can be expired, 392 - `errno` is set to `EAGAIN`. Even though a `wait_queue_token` 392 + `errno` is set to `EAGAIN`. Even though a `wait_queue_entry_token` 393 393 is present in the structure, no "wait queue" is established 394 394 and no acknowledgment is needed. 395 395 - **AUTOFS_IOC_EXPIRE_MULTI**: This is similar to

+168

Documentation/scheduler/sched-deadline.txt

··· 7 7 0. WARNING 8 8 1. Overview 9 9 2. Scheduling algorithm 10 + 2.1 Main algorithm 11 + 2.2 Bandwidth reclaiming 10 12 3. Scheduling Real-Time Tasks 11 13 3.1 Definitions 12 14 3.2 Schedulability Analysis for Uniprocessor Systems ··· 45 43 46 44 2. Scheduling algorithm 47 45 ================== 46 + 47 + 2.1 Main algorithm 48 + ------------------ 48 49 49 50 SCHED_DEADLINE uses three parameters, named "runtime", "period", and 50 51 "deadline", to schedule tasks. A SCHED_DEADLINE task should receive ··· 116 111 117 112 scheduling deadline = scheduling deadline + period 118 113 remaining runtime = remaining runtime + runtime 114 + 115 + 116 + 2.2 Bandwidth reclaiming 117 + ------------------------ 118 + 119 + Bandwidth reclaiming for deadline tasks is based on the GRUB (Greedy 120 + Reclamation of Unused Bandwidth) algorithm [15, 16, 17] and it is enabled 121 + when flag SCHED_FLAG_RECLAIM is set. 122 + 123 + The following diagram illustrates the state names for tasks handled by GRUB: 124 + 125 + ------------ 126 + (d) | Active | 127 + ------------->| | 128 + | | Contending | 129 + | ------------ 130 + | A | 131 + ---------- | | 132 + | | | | 133 + | Inactive | |(b) | (a) 134 + | | | | 135 + ---------- | | 136 + A | V 137 + | ------------ 138 + | | Active | 139 + --------------| Non | 140 + (c) | Contending | 141 + ------------ 142 + 143 + A task can be in one of the following states: 144 + 145 + - ActiveContending: if it is ready for execution (or executing); 146 + 147 + - ActiveNonContending: if it just blocked and has not yet surpassed the 0-lag 148 + time; 149 + 150 + - Inactive: if it is blocked and has surpassed the 0-lag time. 151 + 152 + State transitions: 153 + 154 + (a) When a task blocks, it does not become immediately inactive since its 155 + bandwidth cannot be immediately reclaimed without breaking the 156 + real-time guarantees. It therefore enters a transitional state called 157 + ActiveNonContending. The scheduler arms the "inactive timer" to fire at 158 + the 0-lag time, when the task's bandwidth can be reclaimed without 159 + breaking the real-time guarantees. 160 + 161 + The 0-lag time for a task entering the ActiveNonContending state is 162 + computed as 163 + 164 + (runtime * dl_period) 165 + deadline - --------------------- 166 + dl_runtime 167 + 168 + where runtime is the remaining runtime, while dl_runtime and dl_period 169 + are the reservation parameters. 170 + 171 + (b) If the task wakes up before the inactive timer fires, the task re-enters 172 + the ActiveContending state and the "inactive timer" is canceled. 173 + In addition, if the task wakes up on a different runqueue, then 174 + the task's utilization must be removed from the previous runqueue's active 175 + utilization and must be added to the new runqueue's active utilization. 176 + In order to avoid races between a task waking up on a runqueue while the 177 + "inactive timer" is running on a different CPU, the "dl_non_contending" 178 + flag is used to indicate that a task is not on a runqueue but is active 179 + (so, the flag is set when the task blocks and is cleared when the 180 + "inactive timer" fires or when the task wakes up). 181 + 182 + (c) When the "inactive timer" fires, the task enters the Inactive state and 183 + its utilization is removed from the runqueue's active utilization. 184 + 185 + (d) When an inactive task wakes up, it enters the ActiveContending state and 186 + its utilization is added to the active utilization of the runqueue where 187 + it has been enqueued. 188 + 189 + For each runqueue, the algorithm GRUB keeps track of two different bandwidths: 190 + 191 + - Active bandwidth (running_bw): this is the sum of the bandwidths of all 192 + tasks in active state (i.e., ActiveContending or ActiveNonContending); 193 + 194 + - Total bandwidth (this_bw): this is the sum of all tasks "belonging" to the 195 + runqueue, including the tasks in Inactive state. 196 + 197 + 198 + The algorithm reclaims the bandwidth of the tasks in Inactive state. 199 + It does so by decrementing the runtime of the executing task Ti at a pace equal 200 + to 201 + 202 + dq = -max{ Ui, (1 - Uinact) } dt 203 + 204 + where Uinact is the inactive utilization, computed as (this_bq - running_bw), 205 + and Ui is the bandwidth of task Ti. 206 + 207 + 208 + Let's now see a trivial example of two deadline tasks with runtime equal 209 + to 4 and period equal to 8 (i.e., bandwidth equal to 0.5): 210 + 211 + A Task T1 212 + | 213 + | | 214 + | | 215 + |-------- |---- 216 + | | V 217 + |---|---|---|---|---|---|---|---|--------->t 218 + 0 1 2 3 4 5 6 7 8 219 + 220 + 221 + A Task T2 222 + | 223 + | | 224 + | | 225 + | ------------------------| 226 + | | V 227 + |---|---|---|---|---|---|---|---|--------->t 228 + 0 1 2 3 4 5 6 7 8 229 + 230 + 231 + A running_bw 232 + | 233 + 1 ----------------- ------ 234 + | | | 235 + 0.5- ----------------- 236 + | | 237 + |---|---|---|---|---|---|---|---|--------->t 238 + 0 1 2 3 4 5 6 7 8 239 + 240 + 241 + - Time t = 0: 242 + 243 + Both tasks are ready for execution and therefore in ActiveContending state. 244 + Suppose Task T1 is the first task to start execution. 245 + Since there are no inactive tasks, its runtime is decreased as dq = -1 dt. 246 + 247 + - Time t = 2: 248 + 249 + Suppose that task T1 blocks 250 + Task T1 therefore enters the ActiveNonContending state. Since its remaining 251 + runtime is equal to 2, its 0-lag time is equal to t = 4. 252 + Task T2 start execution, with runtime still decreased as dq = -1 dt since 253 + there are no inactive tasks. 254 + 255 + - Time t = 4: 256 + 257 + This is the 0-lag time for Task T1. Since it didn't woken up in the 258 + meantime, it enters the Inactive state. Its bandwidth is removed from 259 + running_bw. 260 + Task T2 continues its execution. However, its runtime is now decreased as 261 + dq = - 0.5 dt because Uinact = 0.5. 262 + Task T2 therefore reclaims the bandwidth unused by Task T1. 263 + 264 + - Time t = 8: 265 + 266 + Task T1 wakes up. It enters the ActiveContending state again, and the 267 + running_bw is incremented. 119 268 120 269 121 270 3. Scheduling Real-Time Tasks ··· 489 330 14 - J. Erickson, U. Devi and S. Baruah. Improved tardiness bounds for 490 331 Global EDF. Proceedings of the 22nd Euromicro Conference on 491 332 Real-Time Systems, 2010. 333 + 15 - G. Lipari, S. Baruah, Greedy reclamation of unused bandwidth in 334 + constant-bandwidth servers, 12th IEEE Euromicro Conference on Real-Time 335 + Systems, 2000. 336 + 16 - L. Abeni, J. Lelli, C. Scordino, L. Palopoli, Greedy CPU reclaiming for 337 + SCHED DEADLINE. In Proceedings of the Real-Time Linux Workshop (RTLWS), 338 + Dusseldorf, Germany, 2014. 339 + 17 - L. Abeni, G. Lipari, A. Parri, Y. Sun, Multicore CPU reclaiming: parallel 340 + or sequential?. In Proceedings of the 31st Annual ACM Symposium on Applied 341 + Computing, 2016. 492 342 493 343 494 344 4. Bandwidth management

+1 -1

Documentation/trace/ftrace.txt

··· 1609 1609 <idle>-0 3dN.2 14us : sched_avg_update <-__cpu_load_update 1610 1610 <idle>-0 3dN.2 14us : _raw_spin_unlock <-cpu_load_update_nohz 1611 1611 <idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock 1612 - <idle>-0 3dN.1 15us : calc_load_exit_idle <-tick_nohz_idle_exit 1612 + <idle>-0 3dN.1 15us : calc_load_nohz_stop <-tick_nohz_idle_exit 1613 1613 <idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit 1614 1614 <idle>-0 3dN.1 15us : hrtimer_cancel <-tick_nohz_idle_exit 1615 1615 <idle>-0 3dN.1 15us : hrtimer_try_to_cancel <-hrtimer_cancel

+1 -2

arch/arm/kernel/smp.c

··· 555 555 */ 556 556 static void ipi_cpu_stop(unsigned int cpu) 557 557 { 558 - if (system_state == SYSTEM_BOOTING || 559 - system_state == SYSTEM_RUNNING) { 558 + if (system_state <= SYSTEM_RUNNING) { 560 559 raw_spin_lock(&stop_lock); 561 560 pr_crit("CPU%u: stopping\n", cpu); 562 561 dump_stack();

+1 -2

arch/arm64/kernel/smp.c

··· 961 961 cpumask_copy(&mask, cpu_online_mask); 962 962 cpumask_clear_cpu(smp_processor_id(), &mask); 963 963 964 - if (system_state == SYSTEM_BOOTING || 965 - system_state == SYSTEM_RUNNING) 964 + if (system_state <= SYSTEM_RUNNING) 966 965 pr_crit("SMP: stopping secondary CPUs\n"); 967 966 smp_cross_call(&mask, IPI_CPU_STOP); 968 967 }

+1 -2

arch/metag/kernel/smp.c

··· 567 567 { 568 568 unsigned int cpu = smp_processor_id(); 569 569 570 - if (system_state == SYSTEM_BOOTING || 571 - system_state == SYSTEM_RUNNING) { 570 + if (system_state <= SYSTEM_RUNNING) { 572 571 spin_lock(&stop_lock); 573 572 pr_crit("CPU%u: stopping\n", cpu); 574 573 dump_stack();

+1 -1

arch/powerpc/kernel/smp.c

··· 97 97 /* Special case - we inhibit secondary thread startup 98 98 * during boot if the user requests it. 99 99 */ 100 - if (system_state == SYSTEM_BOOTING && cpu_has_feature(CPU_FTR_SMT)) { 100 + if (system_state < SYSTEM_RUNNING && cpu_has_feature(CPU_FTR_SMT)) { 101 101 if (!smt_enabled_at_boot && cpu_thread_in_core(nr) != 0) 102 102 return 0; 103 103 if (smt_enabled_at_boot

+6 -6

arch/x86/events/core.c

··· 2265 2265 void arch_perf_update_userpage(struct perf_event *event, 2266 2266 struct perf_event_mmap_page *userpg, u64 now) 2267 2267 { 2268 - struct cyc2ns_data *data; 2268 + struct cyc2ns_data data; 2269 2269 u64 offset; 2270 2270 2271 2271 userpg->cap_user_time = 0; ··· 2277 2277 if (!using_native_sched_clock() || !sched_clock_stable()) 2278 2278 return; 2279 2279 2280 - data = cyc2ns_read_begin(); 2280 + cyc2ns_read_begin(&data); 2281 2281 2282 - offset = data->cyc2ns_offset + __sched_clock_offset; 2282 + offset = data.cyc2ns_offset + __sched_clock_offset; 2283 2283 2284 2284 /* 2285 2285 * Internal timekeeping for enabled/running/stopped times 2286 2286 * is always in the local_clock domain. 2287 2287 */ 2288 2288 userpg->cap_user_time = 1; 2289 - userpg->time_mult = data->cyc2ns_mul; 2290 - userpg->time_shift = data->cyc2ns_shift; 2289 + userpg->time_mult = data.cyc2ns_mul; 2290 + userpg->time_shift = data.cyc2ns_shift; 2291 2291 userpg->time_offset = offset - now; 2292 2292 2293 2293 /* ··· 2299 2299 userpg->time_zero = offset; 2300 2300 } 2301 2301 2302 - cyc2ns_read_end(data); 2302 + cyc2ns_read_end(); 2303 2303 } 2304 2304 2305 2305 void

+3 -5

arch/x86/include/asm/timer.h

··· 29 29 u32 cyc2ns_mul; 30 30 u32 cyc2ns_shift; 31 31 u64 cyc2ns_offset; 32 - u32 __count; 33 - /* u32 hole */ 34 - }; /* 24 bytes -- do not grow */ 32 + }; /* 16 bytes */ 35 33 36 - extern struct cyc2ns_data *cyc2ns_read_begin(void); 37 - extern void cyc2ns_read_end(struct cyc2ns_data *); 34 + extern void cyc2ns_read_begin(struct cyc2ns_data *); 35 + extern void cyc2ns_read_end(void); 38 36 39 37 #endif /* _ASM_X86_TIMER_H */

+1 -1

arch/x86/kernel/smpboot.c

··· 863 863 if (cpu == 1) 864 864 printk(KERN_INFO "x86: Booting SMP configuration:\n"); 865 865 866 - if (system_state == SYSTEM_BOOTING) { 866 + if (system_state < SYSTEM_RUNNING) { 867 867 if (node != current_node) { 868 868 if (current_node > (-1)) 869 869 pr_cont("\n");

+62 -146

arch/x86/kernel/tsc.c

··· 51 51 static u64 art_to_tsc_offset; 52 52 struct clocksource *art_related_clocksource; 53 53 54 - /* 55 - * Use a ring-buffer like data structure, where a writer advances the head by 56 - * writing a new data entry and a reader advances the tail when it observes a 57 - * new entry. 58 - * 59 - * Writers are made to wait on readers until there's space to write a new 60 - * entry. 61 - * 62 - * This means that we can always use an {offset, mul} pair to compute a ns 63 - * value that is 'roughly' in the right direction, even if we're writing a new 64 - * {offset, mul} pair during the clock read. 65 - * 66 - * The down-side is that we can no longer guarantee strict monotonicity anymore 67 - * (assuming the TSC was that to begin with), because while we compute the 68 - * intersection point of the two clock slopes and make sure the time is 69 - * continuous at the point of switching; we can no longer guarantee a reader is 70 - * strictly before or after the switch point. 71 - * 72 - * It does mean a reader no longer needs to disable IRQs in order to avoid 73 - * CPU-Freq updates messing with his times, and similarly an NMI reader will 74 - * no longer run the risk of hitting half-written state. 75 - */ 76 - 77 54 struct cyc2ns { 78 - struct cyc2ns_data data[2]; /* 0 + 2*24 = 48 */ 79 - struct cyc2ns_data *head; /* 48 + 8 = 56 */ 80 - struct cyc2ns_data *tail; /* 56 + 8 = 64 */ 81 - }; /* exactly fits one cacheline */ 55 + struct cyc2ns_data data[2]; /* 0 + 2*16 = 32 */ 56 + seqcount_t seq; /* 32 + 4 = 36 */ 57 + 58 + }; /* fits one cacheline */ 82 59 83 60 static DEFINE_PER_CPU_ALIGNED(struct cyc2ns, cyc2ns); 84 61 85 - struct cyc2ns_data *cyc2ns_read_begin(void) 62 + void cyc2ns_read_begin(struct cyc2ns_data *data) 86 63 { 87 - struct cyc2ns_data *head; 64 + int seq, idx; 88 65 89 - preempt_disable(); 66 + preempt_disable_notrace(); 90 67 91 - head = this_cpu_read(cyc2ns.head); 92 - /* 93 - * Ensure we observe the entry when we observe the pointer to it. 94 - * matches the wmb from cyc2ns_write_end(). 95 - */ 96 - smp_read_barrier_depends(); 97 - head->__count++; 98 - barrier(); 68 + do { 69 + seq = this_cpu_read(cyc2ns.seq.sequence); 70 + idx = seq & 1; 99 71 100 - return head; 72 + data->cyc2ns_offset = this_cpu_read(cyc2ns.data[idx].cyc2ns_offset); 73 + data->cyc2ns_mul = this_cpu_read(cyc2ns.data[idx].cyc2ns_mul); 74 + data->cyc2ns_shift = this_cpu_read(cyc2ns.data[idx].cyc2ns_shift); 75 + 76 + } while (unlikely(seq != this_cpu_read(cyc2ns.seq.sequence))); 101 77 } 102 78 103 - void cyc2ns_read_end(struct cyc2ns_data *head) 79 + void cyc2ns_read_end(void) 104 80 { 105 - barrier(); 106 - /* 107 - * If we're the outer most nested read; update the tail pointer 108 - * when we're done. This notifies possible pending writers 109 - * that we've observed the head pointer and that the other 110 - * entry is now free. 111 - */ 112 - if (!--head->__count) { 113 - /* 114 - * x86-TSO does not reorder writes with older reads; 115 - * therefore once this write becomes visible to another 116 - * cpu, we must be finished reading the cyc2ns_data. 117 - * 118 - * matches with cyc2ns_write_begin(). 119 - */ 120 - this_cpu_write(cyc2ns.tail, head); 121 - } 122 - preempt_enable(); 123 - } 124 - 125 - /* 126 - * Begin writing a new @data entry for @cpu. 127 - * 128 - * Assumes some sort of write side lock; currently 'provided' by the assumption 129 - * that cpufreq will call its notifiers sequentially. 130 - */ 131 - static struct cyc2ns_data *cyc2ns_write_begin(int cpu) 132 - { 133 - struct cyc2ns *c2n = &per_cpu(cyc2ns, cpu); 134 - struct cyc2ns_data *data = c2n->data; 135 - 136 - if (data == c2n->head) 137 - data++; 138 - 139 - /* XXX send an IPI to @cpu in order to guarantee a read? */ 140 - 141 - /* 142 - * When we observe the tail write from cyc2ns_read_end(), 143 - * the cpu must be done with that entry and its safe 144 - * to start writing to it. 145 - */ 146 - while (c2n->tail == data) 147 - cpu_relax(); 148 - 149 - return data; 150 - } 151 - 152 - static void cyc2ns_write_end(int cpu, struct cyc2ns_data *data) 153 - { 154 - struct cyc2ns *c2n = &per_cpu(cyc2ns, cpu); 155 - 156 - /* 157 - * Ensure the @data writes are visible before we publish the 158 - * entry. Matches the data-depencency in cyc2ns_read_begin(). 159 - */ 160 - smp_wmb(); 161 - 162 - ACCESS_ONCE(c2n->head) = data; 81 + preempt_enable_notrace(); 163 82 } 164 83 165 84 /* ··· 110 191 data->cyc2ns_mul = 0; 111 192 data->cyc2ns_shift = 0; 112 193 data->cyc2ns_offset = 0; 113 - data->__count = 0; 114 194 } 115 195 116 196 static void cyc2ns_init(int cpu) ··· 119 201 cyc2ns_data_init(&c2n->data[0]); 120 202 cyc2ns_data_init(&c2n->data[1]); 121 203 122 - c2n->head = c2n->data; 123 - c2n->tail = c2n->data; 204 + seqcount_init(&c2n->seq); 124 205 } 125 206 126 207 static inline unsigned long long cycles_2_ns(unsigned long long cyc) 127 208 { 128 - struct cyc2ns_data *data, *tail; 209 + struct cyc2ns_data data; 129 210 unsigned long long ns; 130 211 131 - /* 132 - * See cyc2ns_read_*() for details; replicated in order to avoid 133 - * an extra few instructions that came with the abstraction. 134 - * Notable, it allows us to only do the __count and tail update 135 - * dance when its actually needed. 136 - */ 212 + cyc2ns_read_begin(&data); 137 213 138 - preempt_disable_notrace(); 139 - data = this_cpu_read(cyc2ns.head); 140 - tail = this_cpu_read(cyc2ns.tail); 214 + ns = data.cyc2ns_offset; 215 + ns += mul_u64_u32_shr(cyc, data.cyc2ns_mul, data.cyc2ns_shift); 141 216 142 - if (likely(data == tail)) { 143 - ns = data->cyc2ns_offset; 144 - ns += mul_u64_u32_shr(cyc, data->cyc2ns_mul, data->cyc2ns_shift); 145 - } else { 146 - data->__count++; 147 - 148 - barrier(); 149 - 150 - ns = data->cyc2ns_offset; 151 - ns += mul_u64_u32_shr(cyc, data->cyc2ns_mul, data->cyc2ns_shift); 152 - 153 - barrier(); 154 - 155 - if (!--data->__count) 156 - this_cpu_write(cyc2ns.tail, data); 157 - } 158 - preempt_enable_notrace(); 217 + cyc2ns_read_end(); 159 218 160 219 return ns; 161 220 } 162 221 163 - static void set_cyc2ns_scale(unsigned long khz, int cpu) 222 + static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now) 164 223 { 165 - unsigned long long tsc_now, ns_now; 166 - struct cyc2ns_data *data; 224 + unsigned long long ns_now; 225 + struct cyc2ns_data data; 226 + struct cyc2ns *c2n; 167 227 unsigned long flags; 168 228 169 229 local_irq_save(flags); ··· 150 254 if (!khz) 151 255 goto done; 152 256 153 - data = cyc2ns_write_begin(cpu); 154 - 155 - tsc_now = rdtsc(); 156 257 ns_now = cycles_2_ns(tsc_now); 157 258 158 259 /* ··· 157 264 * time function is continuous; see the comment near struct 158 265 * cyc2ns_data. 159 266 */ 160 - clocks_calc_mult_shift(&data->cyc2ns_mul, &data->cyc2ns_shift, khz, 267 + clocks_calc_mult_shift(&data.cyc2ns_mul, &data.cyc2ns_shift, khz, 161 268 NSEC_PER_MSEC, 0); 162 269 163 270 /* ··· 166 273 * conversion algorithm shifting a 32-bit value (now specifies a 64-bit 167 274 * value) - refer perf_event_mmap_page documentation in perf_event.h. 168 275 */ 169 - if (data->cyc2ns_shift == 32) { 170 - data->cyc2ns_shift = 31; 171 - data->cyc2ns_mul >>= 1; 276 + if (data.cyc2ns_shift == 32) { 277 + data.cyc2ns_shift = 31; 278 + data.cyc2ns_mul >>= 1; 172 279 } 173 280 174 - data->cyc2ns_offset = ns_now - 175 - mul_u64_u32_shr(tsc_now, data->cyc2ns_mul, data->cyc2ns_shift); 281 + data.cyc2ns_offset = ns_now - 282 + mul_u64_u32_shr(tsc_now, data.cyc2ns_mul, data.cyc2ns_shift); 176 283 177 - cyc2ns_write_end(cpu, data); 284 + c2n = per_cpu_ptr(&cyc2ns, cpu); 285 + 286 + raw_write_seqcount_latch(&c2n->seq); 287 + c2n->data[0] = data; 288 + raw_write_seqcount_latch(&c2n->seq); 289 + c2n->data[1] = data; 178 290 179 291 done: 180 - sched_clock_idle_wakeup_event(0); 292 + sched_clock_idle_wakeup_event(); 181 293 local_irq_restore(flags); 182 294 } 295 + 183 296 /* 184 297 * Scheduler clock - returns current time in nanosec units. 185 298 */ ··· 273 374 tsc_clocksource_reliable = 1; 274 375 if (!strncmp(str, "noirqtime", 9)) 275 376 no_sched_irq_time = 1; 377 + if (!strcmp(str, "unstable")) 378 + mark_tsc_unstable("boot parameter"); 276 379 return 1; 277 380 } 278 381 ··· 887 986 } 888 987 889 988 #ifdef CONFIG_CPU_FREQ 890 - 891 989 /* Frequency scaling support. Adjust the TSC based timer when the cpu frequency 892 990 * changes. 893 991 * ··· 927 1027 if (!(freq->flags & CPUFREQ_CONST_LOOPS)) 928 1028 mark_tsc_unstable("cpufreq changes"); 929 1029 930 - set_cyc2ns_scale(tsc_khz, freq->cpu); 1030 + set_cyc2ns_scale(tsc_khz, freq->cpu, rdtsc()); 931 1031 } 932 1032 933 1033 return 0; ··· 1027 1127 pr_info("Marking TSC unstable due to clocksource watchdog\n"); 1028 1128 } 1029 1129 1130 + static void tsc_cs_tick_stable(struct clocksource *cs) 1131 + { 1132 + if (tsc_unstable) 1133 + return; 1134 + 1135 + if (using_native_sched_clock()) 1136 + sched_clock_tick_stable(); 1137 + } 1138 + 1030 1139 /* 1031 1140 * .mask MUST be CLOCKSOURCE_MASK(64). See comment above read_tsc() 1032 1141 */ ··· 1049 1140 .archdata = { .vclock_mode = VCLOCK_TSC }, 1050 1141 .resume = tsc_resume, 1051 1142 .mark_unstable = tsc_cs_mark_unstable, 1143 + .tick_stable = tsc_cs_tick_stable, 1052 1144 }; 1053 1145 1054 1146 void mark_tsc_unstable(char *reason) ··· 1165 1255 static int hpet; 1166 1256 u64 tsc_stop, ref_stop, delta; 1167 1257 unsigned long freq; 1258 + int cpu; 1168 1259 1169 1260 /* Don't bother refining TSC on unstable systems */ 1170 1261 if (check_tsc_unstable()) ··· 1216 1305 /* Inform the TSC deadline clockevent devices about the recalibration */ 1217 1306 lapic_update_tsc_freq(); 1218 1307 1308 + /* Update the sched_clock() rate to match the clocksource one */ 1309 + for_each_possible_cpu(cpu) 1310 + set_cyc2ns_scale(tsc_khz, cpu, tsc_stop); 1311 + 1219 1312 out: 1220 1313 if (boot_cpu_has(X86_FEATURE_ART)) 1221 1314 art_related_clocksource = &clocksource_tsc; ··· 1265 1350 1266 1351 void __init tsc_init(void) 1267 1352 { 1268 - u64 lpj; 1353 + u64 lpj, cyc; 1269 1354 int cpu; 1270 1355 1271 1356 if (!boot_cpu_has(X86_FEATURE_TSC)) { ··· 1305 1390 * speed as the bootup CPU. (cpufreq notifiers will fix this 1306 1391 * up if their speed diverges) 1307 1392 */ 1393 + cyc = rdtsc(); 1308 1394 for_each_possible_cpu(cpu) { 1309 1395 cyc2ns_init(cpu); 1310 - set_cyc2ns_scale(tsc_khz, cpu); 1396 + set_cyc2ns_scale(tsc_khz, cpu, cyc); 1311 1397 } 1312 1398 1313 1399 if (tsc_disabled > 0)

+8 -6

arch/x86/platform/uv/tlb_uv.c

··· 456 456 */ 457 457 static inline unsigned long long cycles_2_ns(unsigned long long cyc) 458 458 { 459 - struct cyc2ns_data *data = cyc2ns_read_begin(); 459 + struct cyc2ns_data data; 460 460 unsigned long long ns; 461 461 462 - ns = mul_u64_u32_shr(cyc, data->cyc2ns_mul, data->cyc2ns_shift); 462 + cyc2ns_read_begin(&data); 463 + ns = mul_u64_u32_shr(cyc, data.cyc2ns_mul, data.cyc2ns_shift); 464 + cyc2ns_read_end(); 463 465 464 - cyc2ns_read_end(data); 465 466 return ns; 466 467 } 467 468 ··· 471 470 */ 472 471 static inline unsigned long long ns_2_cycles(unsigned long long ns) 473 472 { 474 - struct cyc2ns_data *data = cyc2ns_read_begin(); 473 + struct cyc2ns_data data; 475 474 unsigned long long cyc; 476 475 477 - cyc = (ns << data->cyc2ns_shift) / data->cyc2ns_mul; 476 + cyc2ns_read_begin(&data); 477 + cyc = (ns << data.cyc2ns_shift) / data.cyc2ns_mul; 478 + cyc2ns_read_end(); 478 479 479 - cyc2ns_read_end(data); 480 480 return cyc; 481 481 } 482 482

+2 -2

block/blk-mq.c

··· 941 941 return first != NULL; 942 942 } 943 943 944 - static int blk_mq_dispatch_wake(wait_queue_t *wait, unsigned mode, int flags, 944 + static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode, int flags, 945 945 void *key) 946 946 { 947 947 struct blk_mq_hw_ctx *hctx; 948 948 949 949 hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait); 950 950 951 - list_del(&wait->task_list); 951 + list_del(&wait->entry); 952 952 clear_bit_unlock(BLK_MQ_S_TAG_WAITING, &hctx->state); 953 953 blk_mq_run_hw_queue(hctx, true); 954 954 return 1;

+2 -2

block/blk-wbt.c

··· 503 503 } 504 504 505 505 static inline bool may_queue(struct rq_wb *rwb, struct rq_wait *rqw, 506 - wait_queue_t *wait, unsigned long rw) 506 + wait_queue_entry_t *wait, unsigned long rw) 507 507 { 508 508 /* 509 509 * inc it here even if disabled, since we'll dec it at completion. ··· 520 520 * in line to be woken up, wait for our turn. 521 521 */ 522 522 if (waitqueue_active(&rqw->wait) && 523 - rqw->wait.task_list.next != &wait->task_list) 523 + rqw->wait.head.next != &wait->entry) 524 524 return false; 525 525 526 526 return atomic_inc_below(&rqw->inflight, get_limit(rwb, rw));

+8 -8

block/kyber-iosched.c

··· 99 99 struct list_head rqs[KYBER_NUM_DOMAINS]; 100 100 unsigned int cur_domain; 101 101 unsigned int batching; 102 - wait_queue_t domain_wait[KYBER_NUM_DOMAINS]; 102 + wait_queue_entry_t domain_wait[KYBER_NUM_DOMAINS]; 103 103 atomic_t wait_index[KYBER_NUM_DOMAINS]; 104 104 }; 105 105 ··· 385 385 386 386 for (i = 0; i < KYBER_NUM_DOMAINS; i++) { 387 387 INIT_LIST_HEAD(&khd->rqs[i]); 388 - INIT_LIST_HEAD(&khd->domain_wait[i].task_list); 388 + INIT_LIST_HEAD(&khd->domain_wait[i].entry); 389 389 atomic_set(&khd->wait_index[i], 0); 390 390 } 391 391 ··· 503 503 } 504 504 } 505 505 506 - static int kyber_domain_wake(wait_queue_t *wait, unsigned mode, int flags, 506 + static int kyber_domain_wake(wait_queue_entry_t *wait, unsigned mode, int flags, 507 507 void *key) 508 508 { 509 509 struct blk_mq_hw_ctx *hctx = READ_ONCE(wait->private); 510 510 511 - list_del_init(&wait->task_list); 511 + list_del_init(&wait->entry); 512 512 blk_mq_run_hw_queue(hctx, true); 513 513 return 1; 514 514 } ··· 519 519 { 520 520 unsigned int sched_domain = khd->cur_domain; 521 521 struct sbitmap_queue *domain_tokens = &kqd->domain_tokens[sched_domain]; 522 - wait_queue_t *wait = &khd->domain_wait[sched_domain]; 522 + wait_queue_entry_t *wait = &khd->domain_wait[sched_domain]; 523 523 struct sbq_wait_state *ws; 524 524 int nr; 525 525 ··· 532 532 * run when one becomes available. Note that this is serialized on 533 533 * khd->lock, but we still need to be careful about the waker. 534 534 */ 535 - if (list_empty_careful(&wait->task_list)) { 535 + if (list_empty_careful(&wait->entry)) { 536 536 init_waitqueue_func_entry(wait, kyber_domain_wake); 537 537 wait->private = hctx; 538 538 ws = sbq_wait_ptr(domain_tokens, ··· 730 730 { \ 731 731 struct blk_mq_hw_ctx *hctx = data; \ 732 732 struct kyber_hctx_data *khd = hctx->sched_data; \ 733 - wait_queue_t *wait = &khd->domain_wait[domain]; \ 733 + wait_queue_entry_t *wait = &khd->domain_wait[domain]; \ 734 734 \ 735 - seq_printf(m, "%d\n", !list_empty_careful(&wait->task_list)); \ 735 + seq_printf(m, "%d\n", !list_empty_careful(&wait->entry)); \ 736 736 return 0; \ 737 737 } 738 738 KYBER_DEBUGFS_DOMAIN_ATTRS(KYBER_READ, read)

+1 -1

drivers/acpi/pci_root.c

··· 523 523 struct acpi_pci_root *root; 524 524 acpi_handle handle = device->handle; 525 525 int no_aspm = 0; 526 - bool hotadd = system_state != SYSTEM_BOOTING; 526 + bool hotadd = system_state == SYSTEM_RUNNING; 527 527 528 528 root = kzalloc(sizeof(struct acpi_pci_root), GFP_KERNEL); 529 529 if (!root)

+1 -1

drivers/base/node.c

··· 377 377 if (!pfn_valid_within(pfn)) 378 378 return -1; 379 379 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT 380 - if (system_state == SYSTEM_BOOTING) 380 + if (system_state < SYSTEM_RUNNING) 381 381 return early_pfn_to_nid(pfn); 382 382 #endif 383 383 page = pfn_to_page(pfn);

+1 -1

drivers/bluetooth/btmrvl_main.c

··· 602 602 struct btmrvl_thread *thread = data; 603 603 struct btmrvl_private *priv = thread->priv; 604 604 struct btmrvl_adapter *adapter = priv->adapter; 605 - wait_queue_t wait; 605 + wait_queue_entry_t wait; 606 606 struct sk_buff *skb; 607 607 ulong flags; 608 608

+1 -1

drivers/char/ipmi/ipmi_watchdog.c

··· 821 821 loff_t *ppos) 822 822 { 823 823 int rv = 0; 824 - wait_queue_t wait; 824 + wait_queue_entry_t wait; 825 825 826 826 if (count <= 0) 827 827 return 0;

+1 -1

drivers/cpufreq/pasemi-cpufreq.c

··· 226 226 * We don't support CPU hotplug. Don't unmap after the system 227 227 * has already made it to a running state. 228 228 */ 229 - if (system_state != SYSTEM_BOOTING) 229 + if (system_state >= SYSTEM_RUNNING) 230 230 return 0; 231 231 232 232 if (sdcasr_mapbase)

+1

drivers/cpuidle/cpuidle.c

··· 220 220 entered_state = target_state->enter(dev, drv, index); 221 221 start_critical_timings(); 222 222 223 + sched_clock_idle_wakeup_event(); 223 224 time_end = ns_to_ktime(local_clock()); 224 225 trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu); 225 226

+1 -1

drivers/gpu/drm/i915/i915_gem_request.h

··· 123 123 * It is used by the driver to then queue the request for execution. 124 124 */ 125 125 struct i915_sw_fence submit; 126 - wait_queue_t submitq; 126 + wait_queue_entry_t submitq; 127 127 wait_queue_head_t execute; 128 128 129 129 /* A list of everyone we wait upon, and everyone who waits upon us.

+17 -18

drivers/gpu/drm/i915/i915_sw_fence.c

··· 152 152 struct list_head *continuation) 153 153 { 154 154 wait_queue_head_t *x = &fence->wait; 155 - wait_queue_t *pos, *next; 155 + wait_queue_entry_t *pos, *next; 156 156 unsigned long flags; 157 157 158 158 debug_fence_deactivate(fence); ··· 160 160 161 161 /* 162 162 * To prevent unbounded recursion as we traverse the graph of 163 - * i915_sw_fences, we move the task_list from this, the next ready 164 - * fence, to the tail of the original fence's task_list 163 + * i915_sw_fences, we move the entry list from this, the next ready 164 + * fence, to the tail of the original fence's entry list 165 165 * (and so added to the list to be woken). 166 166 */ 167 167 168 168 spin_lock_irqsave_nested(&x->lock, flags, 1 + !!continuation); 169 169 if (continuation) { 170 - list_for_each_entry_safe(pos, next, &x->task_list, task_list) { 170 + list_for_each_entry_safe(pos, next, &x->head, entry) { 171 171 if (pos->func == autoremove_wake_function) 172 172 pos->func(pos, TASK_NORMAL, 0, continuation); 173 173 else 174 - list_move_tail(&pos->task_list, continuation); 174 + list_move_tail(&pos->entry, continuation); 175 175 } 176 176 } else { 177 177 LIST_HEAD(extra); 178 178 179 179 do { 180 - list_for_each_entry_safe(pos, next, 181 - &x->task_list, task_list) 180 + list_for_each_entry_safe(pos, next, &x->head, entry) 182 181 pos->func(pos, TASK_NORMAL, 0, &extra); 183 182 184 183 if (list_empty(&extra)) 185 184 break; 186 185 187 - list_splice_tail_init(&extra, &x->task_list); 186 + list_splice_tail_init(&extra, &x->head); 188 187 } while (1); 189 188 } 190 189 spin_unlock_irqrestore(&x->lock, flags); ··· 253 254 __i915_sw_fence_commit(fence); 254 255 } 255 256 256 - static int i915_sw_fence_wake(wait_queue_t *wq, unsigned mode, int flags, void *key) 257 + static int i915_sw_fence_wake(wait_queue_entry_t *wq, unsigned mode, int flags, void *key) 257 258 { 258 - list_del(&wq->task_list); 259 + list_del(&wq->entry); 259 260 __i915_sw_fence_complete(wq->private, key); 260 261 i915_sw_fence_put(wq->private); 261 262 if (wq->flags & I915_SW_FENCE_FLAG_ALLOC) ··· 266 267 static bool __i915_sw_fence_check_if_after(struct i915_sw_fence *fence, 267 268 const struct i915_sw_fence * const signaler) 268 269 { 269 - wait_queue_t *wq; 270 + wait_queue_entry_t *wq; 270 271 271 272 if (__test_and_set_bit(I915_SW_FENCE_CHECKED_BIT, &fence->flags)) 272 273 return false; ··· 274 275 if (fence == signaler) 275 276 return true; 276 277 277 - list_for_each_entry(wq, &fence->wait.task_list, task_list) { 278 + list_for_each_entry(wq, &fence->wait.head, entry) { 278 279 if (wq->func != i915_sw_fence_wake) 279 280 continue; 280 281 ··· 287 288 288 289 static void __i915_sw_fence_clear_checked_bit(struct i915_sw_fence *fence) 289 290 { 290 - wait_queue_t *wq; 291 + wait_queue_entry_t *wq; 291 292 292 293 if (!__test_and_clear_bit(I915_SW_FENCE_CHECKED_BIT, &fence->flags)) 293 294 return; 294 295 295 - list_for_each_entry(wq, &fence->wait.task_list, task_list) { 296 + list_for_each_entry(wq, &fence->wait.head, entry) { 296 297 if (wq->func != i915_sw_fence_wake) 297 298 continue; 298 299 ··· 319 320 320 321 static int __i915_sw_fence_await_sw_fence(struct i915_sw_fence *fence, 321 322 struct i915_sw_fence *signaler, 322 - wait_queue_t *wq, gfp_t gfp) 323 + wait_queue_entry_t *wq, gfp_t gfp) 323 324 { 324 325 unsigned long flags; 325 326 int pending; ··· 349 350 pending |= I915_SW_FENCE_FLAG_ALLOC; 350 351 } 351 352 352 - INIT_LIST_HEAD(&wq->task_list); 353 + INIT_LIST_HEAD(&wq->entry); 353 354 wq->flags = pending; 354 355 wq->func = i915_sw_fence_wake; 355 356 wq->private = i915_sw_fence_get(fence); ··· 358 359 359 360 spin_lock_irqsave(&signaler->wait.lock, flags); 360 361 if (likely(!i915_sw_fence_done(signaler))) { 361 - __add_wait_queue_tail(&signaler->wait, wq); 362 + __add_wait_queue_entry_tail(&signaler->wait, wq); 362 363 pending = 1; 363 364 } else { 364 365 i915_sw_fence_wake(wq, 0, 0, NULL); ··· 371 372 372 373 int i915_sw_fence_await_sw_fence(struct i915_sw_fence *fence, 373 374 struct i915_sw_fence *signaler, 374 - wait_queue_t *wq) 375 + wait_queue_entry_t *wq) 375 376 { 376 377 return __i915_sw_fence_await_sw_fence(fence, signaler, wq, 0); 377 378 }

+1 -1

drivers/gpu/drm/i915/i915_sw_fence.h

··· 66 66 67 67 int i915_sw_fence_await_sw_fence(struct i915_sw_fence *fence, 68 68 struct i915_sw_fence *after, 69 - wait_queue_t *wq); 69 + wait_queue_entry_t *wq); 70 70 int i915_sw_fence_await_sw_fence_gfp(struct i915_sw_fence *fence, 71 71 struct i915_sw_fence *after, 72 72 gfp_t gfp);

+1 -1

drivers/gpu/drm/radeon/radeon.h

··· 375 375 unsigned ring; 376 376 bool is_vm_update; 377 377 378 - wait_queue_t fence_wake; 378 + wait_queue_entry_t fence_wake; 379 379 }; 380 380 381 381 int radeon_fence_driver_start_ring(struct radeon_device *rdev, int ring);

+1 -1

drivers/gpu/drm/radeon/radeon_fence.c

··· 158 158 * for the fence locking itself, so unlocked variants are used for 159 159 * fence_signal, and remove_wait_queue. 160 160 */ 161 - static int radeon_fence_check_signaled(wait_queue_t *wait, unsigned mode, int flags, void *key) 161 + static int radeon_fence_check_signaled(wait_queue_entry_t *wait, unsigned mode, int flags, void *key) 162 162 { 163 163 struct radeon_fence *fence; 164 164 u64 seq;

+1 -1

drivers/gpu/vga/vgaarb.c

··· 417 417 { 418 418 struct vga_device *vgadev, *conflict; 419 419 unsigned long flags; 420 - wait_queue_t wait; 420 + wait_queue_entry_t wait; 421 421 int rc = 0; 422 422 423 423 vga_check_first_use();

+1 -1

drivers/infiniband/hw/i40iw/i40iw_main.c

··· 1939 1939 bool i40iw_vf_clear_to_send(struct i40iw_sc_dev *dev) 1940 1940 { 1941 1941 struct i40iw_device *iwdev; 1942 - wait_queue_t wait; 1942 + wait_queue_entry_t wait; 1943 1943 1944 1944 iwdev = dev->back_dev; 1945 1945

+2 -2

drivers/iommu/intel-iommu.c

··· 4315 4315 struct acpi_dmar_atsr *atsr; 4316 4316 struct dmar_atsr_unit *atsru; 4317 4317 4318 - if (system_state != SYSTEM_BOOTING && !intel_iommu_enabled) 4318 + if (system_state >= SYSTEM_RUNNING && !intel_iommu_enabled) 4319 4319 return 0; 4320 4320 4321 4321 atsr = container_of(hdr, struct acpi_dmar_atsr, header); ··· 4565 4565 struct acpi_dmar_atsr *atsr; 4566 4566 struct acpi_dmar_reserved_memory *rmrr; 4567 4567 4568 - if (!intel_iommu_enabled && system_state != SYSTEM_BOOTING) 4568 + if (!intel_iommu_enabled && system_state >= SYSTEM_RUNNING) 4569 4569 return 0; 4570 4570 4571 4571 list_for_each_entry(rmrru, &dmar_rmrr_units, list) {

+1 -1

drivers/iommu/of_iommu.c

··· 103 103 * it never will be. We don't want to defer indefinitely, nor attempt 104 104 * to dereference __iommu_of_table after it's been freed. 105 105 */ 106 - if (system_state > SYSTEM_BOOTING) 106 + if (system_state >= SYSTEM_RUNNING) 107 107 return false; 108 108 109 109 return of_match_node(&__iommu_of_table, np);

+1 -1

drivers/md/bcache/btree.h

··· 207 207 208 208 struct btree_op { 209 209 /* for waiting on btree reserve in btree_split() */ 210 - wait_queue_t wait; 210 + wait_queue_entry_t wait; 211 211 212 212 /* Btree level at which we start taking write locks */ 213 213 short lock;

+2 -2

drivers/net/ethernet/cavium/liquidio/octeon_main.h

··· 144 144 sleep_cond(wait_queue_head_t *wait_queue, int *condition) 145 145 { 146 146 int errno = 0; 147 - wait_queue_t we; 147 + wait_queue_entry_t we; 148 148 149 149 init_waitqueue_entry(&we, current); 150 150 add_wait_queue(wait_queue, &we); ··· 171 171 int *condition, 172 172 int timeout) 173 173 { 174 - wait_queue_t we; 174 + wait_queue_entry_t we; 175 175 176 176 init_waitqueue_entry(&we, current); 177 177 add_wait_queue(wait_queue, &we);

+1 -1

drivers/net/wireless/cisco/airo.c

··· 3066 3066 if (ai->jobs) { 3067 3067 locked = down_interruptible(&ai->sem); 3068 3068 } else { 3069 - wait_queue_t wait; 3069 + wait_queue_entry_t wait; 3070 3070 3071 3071 init_waitqueue_entry(&wait, current); 3072 3072 add_wait_queue(&ai->thr_wait, &wait);

+1 -1

drivers/net/wireless/intersil/hostap/hostap_ioctl.c

··· 2544 2544 ret = -EINVAL; 2545 2545 } 2546 2546 if (local->iw_mode == IW_MODE_MASTER) { 2547 - wait_queue_t __wait; 2547 + wait_queue_entry_t __wait; 2548 2548 init_waitqueue_entry(&__wait, current); 2549 2549 add_wait_queue(&local->hostscan_wq, &__wait); 2550 2550 set_current_state(TASK_INTERRUPTIBLE);

+1 -1

drivers/net/wireless/marvell/libertas/main.c

··· 453 453 { 454 454 struct net_device *dev = data; 455 455 struct lbs_private *priv = dev->ml_priv; 456 - wait_queue_t wait; 456 + wait_queue_entry_t wait; 457 457 458 458 lbs_deb_enter(LBS_DEB_THREAD); 459 459

+1 -1

drivers/rtc/rtc-imxdi.c

··· 709 709 /*If the write wait queue is empty then there is no pending 710 710 operations. It means the interrupt is for DryIce -Security. 711 711 IRQ must be returned as none.*/ 712 - if (list_empty_careful(&imxdi->write_wait.task_list)) 712 + if (list_empty_careful(&imxdi->write_wait.head)) 713 713 return rc; 714 714 715 715 /* DSR_WCF clears itself on DSR read */

+1 -1

drivers/scsi/dpt/dpti_i2o.h

··· 48 48 #include <linux/wait.h> 49 49 typedef wait_queue_head_t adpt_wait_queue_head_t; 50 50 #define ADPT_DECLARE_WAIT_QUEUE_HEAD(wait) DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wait) 51 - typedef wait_queue_t adpt_wait_queue_t; 51 + typedef wait_queue_entry_t adpt_wait_queue_entry_t; 52 52 53 53 /* 54 54 * message structures

+6 -6

drivers/scsi/ips.c

··· 301 301 static uint32_t ips_statupd_morpheus(ips_ha_t *); 302 302 static ips_scb_t *ips_getscb(ips_ha_t *); 303 303 static void ips_putq_scb_head(ips_scb_queue_t *, ips_scb_t *); 304 - static void ips_putq_wait_tail(ips_wait_queue_t *, struct scsi_cmnd *); 304 + static void ips_putq_wait_tail(ips_wait_queue_entry_t *, struct scsi_cmnd *); 305 305 static void ips_putq_copp_tail(ips_copp_queue_t *, 306 306 ips_copp_wait_item_t *); 307 307 static ips_scb_t *ips_removeq_scb_head(ips_scb_queue_t *); 308 308 static ips_scb_t *ips_removeq_scb(ips_scb_queue_t *, ips_scb_t *); 309 - static struct scsi_cmnd *ips_removeq_wait_head(ips_wait_queue_t *); 310 - static struct scsi_cmnd *ips_removeq_wait(ips_wait_queue_t *, 309 + static struct scsi_cmnd *ips_removeq_wait_head(ips_wait_queue_entry_t *); 310 + static struct scsi_cmnd *ips_removeq_wait(ips_wait_queue_entry_t *, 311 311 struct scsi_cmnd *); 312 312 static ips_copp_wait_item_t *ips_removeq_copp(ips_copp_queue_t *, 313 313 ips_copp_wait_item_t *); ··· 2871 2871 /* ASSUMED to be called from within the HA lock */ 2872 2872 /* */ 2873 2873 /****************************************************************************/ 2874 - static void ips_putq_wait_tail(ips_wait_queue_t *queue, struct scsi_cmnd *item) 2874 + static void ips_putq_wait_tail(ips_wait_queue_entry_t *queue, struct scsi_cmnd *item) 2875 2875 { 2876 2876 METHOD_TRACE("ips_putq_wait_tail", 1); 2877 2877 ··· 2902 2902 /* ASSUMED to be called from within the HA lock */ 2903 2903 /* */ 2904 2904 /****************************************************************************/ 2905 - static struct scsi_cmnd *ips_removeq_wait_head(ips_wait_queue_t *queue) 2905 + static struct scsi_cmnd *ips_removeq_wait_head(ips_wait_queue_entry_t *queue) 2906 2906 { 2907 2907 struct scsi_cmnd *item; 2908 2908 ··· 2936 2936 /* ASSUMED to be called from within the HA lock */ 2937 2937 /* */ 2938 2938 /****************************************************************************/ 2939 - static struct scsi_cmnd *ips_removeq_wait(ips_wait_queue_t *queue, 2939 + static struct scsi_cmnd *ips_removeq_wait(ips_wait_queue_entry_t *queue, 2940 2940 struct scsi_cmnd *item) 2941 2941 { 2942 2942 struct scsi_cmnd *p;

+2 -2

drivers/scsi/ips.h

··· 989 989 struct scsi_cmnd *head; 990 990 struct scsi_cmnd *tail; 991 991 int count; 992 - } ips_wait_queue_t; 992 + } ips_wait_queue_entry_t; 993 993 994 994 typedef struct ips_copp_wait_item { 995 995 struct scsi_cmnd *scsi_cmd; ··· 1035 1035 ips_stat_t sp; /* Status packer pointer */ 1036 1036 struct ips_scb *scbs; /* Array of all CCBS */ 1037 1037 struct ips_scb *scb_freelist; /* SCB free list */ 1038 - ips_wait_queue_t scb_waitlist; /* Pending SCB list */ 1038 + ips_wait_queue_entry_t scb_waitlist; /* Pending SCB list */ 1039 1039 ips_copp_queue_t copp_waitlist; /* Pending PT list */ 1040 1040 ips_scb_queue_t scb_activelist; /* Active SCB list */ 1041 1041 IPS_IO_CMD *dummy; /* dummy command */

+3 -3

drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c

··· 3267 3267 kiblnd_connd(void *arg) 3268 3268 { 3269 3269 spinlock_t *lock = &kiblnd_data.kib_connd_lock; 3270 - wait_queue_t wait; 3270 + wait_queue_entry_t wait; 3271 3271 unsigned long flags; 3272 3272 struct kib_conn *conn; 3273 3273 int timeout; ··· 3521 3521 long id = (long)arg; 3522 3522 struct kib_sched_info *sched; 3523 3523 struct kib_conn *conn; 3524 - wait_queue_t wait; 3524 + wait_queue_entry_t wait; 3525 3525 unsigned long flags; 3526 3526 struct ib_wc wc; 3527 3527 int did_something; ··· 3656 3656 { 3657 3657 rwlock_t *glock = &kiblnd_data.kib_global_lock; 3658 3658 struct kib_dev *dev; 3659 - wait_queue_t wait; 3659 + wait_queue_entry_t wait; 3660 3660 unsigned long flags; 3661 3661 int rc; 3662 3662

+2 -2

drivers/staging/lustre/lnet/klnds/socklnd/socklnd_cb.c

··· 2166 2166 { 2167 2167 spinlock_t *connd_lock = &ksocknal_data.ksnd_connd_lock; 2168 2168 struct ksock_connreq *cr; 2169 - wait_queue_t wait; 2169 + wait_queue_entry_t wait; 2170 2170 int nloops = 0; 2171 2171 int cons_retry = 0; 2172 2172 ··· 2554 2554 int 2555 2555 ksocknal_reaper(void *arg) 2556 2556 { 2557 - wait_queue_t wait; 2557 + wait_queue_entry_t wait; 2558 2558 struct ksock_conn *conn; 2559 2559 struct ksock_sched *sched; 2560 2560 struct list_head enomem_conns;

+1 -1

drivers/staging/lustre/lnet/libcfs/debug.c

··· 361 361 362 362 void libcfs_debug_dumplog(void) 363 363 { 364 - wait_queue_t wait; 364 + wait_queue_entry_t wait; 365 365 struct task_struct *dumper; 366 366 367 367 /* we're being careful to ensure that the kernel thread is

+1 -1

drivers/staging/lustre/lnet/libcfs/tracefile.c

··· 990 990 complete(&tctl->tctl_start); 991 991 992 992 while (1) { 993 - wait_queue_t __wait; 993 + wait_queue_entry_t __wait; 994 994 995 995 pc.pc_want_daemon_pages = 0; 996 996 collect_pages(&pc);

+1 -1

drivers/staging/lustre/lnet/lnet/lib-eq.c

··· 312 312 { 313 313 int tms = *timeout_ms; 314 314 int wait; 315 - wait_queue_t wl; 315 + wait_queue_entry_t wl; 316 316 unsigned long now; 317 317 318 318 if (!tms)

+1 -1

drivers/staging/lustre/lnet/lnet/lib-socket.c

··· 516 516 int 517 517 lnet_sock_accept(struct socket **newsockp, struct socket *sock) 518 518 { 519 - wait_queue_t wait; 519 + wait_queue_entry_t wait; 520 520 struct socket *newsock; 521 521 int rc; 522 522

+3 -3

drivers/staging/lustre/lustre/fid/fid_request.c

··· 192 192 } 193 193 194 194 static int seq_fid_alloc_prep(struct lu_client_seq *seq, 195 - wait_queue_t *link) 195 + wait_queue_entry_t *link) 196 196 { 197 197 if (seq->lcs_update) { 198 198 add_wait_queue(&seq->lcs_waitq, link); ··· 223 223 int seq_client_alloc_fid(const struct lu_env *env, 224 224 struct lu_client_seq *seq, struct lu_fid *fid) 225 225 { 226 - wait_queue_t link; 226 + wait_queue_entry_t link; 227 227 int rc; 228 228 229 229 LASSERT(seq); ··· 290 290 */ 291 291 void seq_client_flush(struct lu_client_seq *seq) 292 292 { 293 - wait_queue_t link; 293 + wait_queue_entry_t link; 294 294 295 295 LASSERT(seq); 296 296 init_waitqueue_entry(&link, current);

+2 -2

drivers/staging/lustre/lustre/include/lustre_lib.h

··· 201 201 sigmask(SIGALRM)) 202 202 203 203 /** 204 - * wait_queue_t of Linux (version < 2.6.34) is a FIFO list for exclusively 204 + * wait_queue_entry_t of Linux (version < 2.6.34) is a FIFO list for exclusively 205 205 * waiting threads, which is not always desirable because all threads will 206 206 * be waken up again and again, even user only needs a few of them to be 207 207 * active most time. This is not good for performance because cache can ··· 228 228 */ 229 229 #define __l_wait_event(wq, condition, info, ret, l_add_wait) \ 230 230 do { \ 231 - wait_queue_t __wait; \ 231 + wait_queue_entry_t __wait; \ 232 232 long __timeout = info->lwi_timeout; \ 233 233 sigset_t __blocked; \ 234 234 int __allow_intr = info->lwi_allow_intr; \

+1 -1

drivers/staging/lustre/lustre/llite/lcommon_cl.c

··· 207 207 static void cl_object_put_last(struct lu_env *env, struct cl_object *obj) 208 208 { 209 209 struct lu_object_header *header = obj->co_lu.lo_header; 210 - wait_queue_t waiter; 210 + wait_queue_entry_t waiter; 211 211 212 212 if (unlikely(atomic_read(&header->loh_ref) != 1)) { 213 213 struct lu_site *site = obj->co_lu.lo_dev->ld_site;

+1 -1

drivers/staging/lustre/lustre/lov/lov_cl_internal.h

··· 370 370 struct ost_lvb lti_lvb; 371 371 struct cl_2queue lti_cl2q; 372 372 struct cl_page_list lti_plist; 373 - wait_queue_t lti_waiter; 373 + wait_queue_entry_t lti_waiter; 374 374 struct cl_attr lti_attr; 375 375 }; 376 376

+1 -1

drivers/staging/lustre/lustre/lov/lov_object.c

··· 371 371 struct lov_layout_raid0 *r0; 372 372 struct lu_site *site; 373 373 struct lu_site_bkt_data *bkt; 374 - wait_queue_t *waiter; 374 + wait_queue_entry_t *waiter; 375 375 376 376 r0 = &lov->u.raid0; 377 377 LASSERT(r0->lo_sub[idx] == los);

+3 -3

drivers/staging/lustre/lustre/obdclass/lu_object.c

··· 556 556 static struct lu_object *htable_lookup(struct lu_site *s, 557 557 struct cfs_hash_bd *bd, 558 558 const struct lu_fid *f, 559 - wait_queue_t *waiter, 559 + wait_queue_entry_t *waiter, 560 560 __u64 *version) 561 561 { 562 562 struct lu_site_bkt_data *bkt; ··· 670 670 struct lu_device *dev, 671 671 const struct lu_fid *f, 672 672 const struct lu_object_conf *conf, 673 - wait_queue_t *waiter) 673 + wait_queue_entry_t *waiter) 674 674 { 675 675 struct lu_object *o; 676 676 struct lu_object *shadow; ··· 750 750 { 751 751 struct lu_site_bkt_data *bkt; 752 752 struct lu_object *obj; 753 - wait_queue_t wait; 753 + wait_queue_entry_t wait; 754 754 755 755 while (1) { 756 756 obj = lu_object_find_try(env, dev, f, conf, &wait);

+1 -1

drivers/tty/synclink_gt.c

··· 184 184 struct cond_wait { 185 185 struct cond_wait *next; 186 186 wait_queue_head_t q; 187 - wait_queue_t wait; 187 + wait_queue_entry_t wait; 188 188 unsigned int data; 189 189 }; 190 190 static void init_cond_wait(struct cond_wait *w, unsigned int data);

+1 -1

drivers/vfio/virqfd.c

··· 43 43 queue_work(vfio_irqfd_cleanup_wq, &virqfd->shutdown); 44 44 } 45 45 46 - static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) 46 + static int virqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) 47 47 { 48 48 struct virqfd *virqfd = container_of(wait, struct virqfd, wait); 49 49 unsigned long flags = (unsigned long)key;

+1 -1

drivers/vhost/vhost.c

··· 165 165 add_wait_queue(wqh, &poll->wait); 166 166 } 167 167 168 - static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync, 168 + static int vhost_poll_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, 169 169 void *key) 170 170 { 171 171 struct vhost_poll *poll = container_of(wait, struct vhost_poll, wait);

+1 -1

drivers/vhost/vhost.h

··· 31 31 struct vhost_poll { 32 32 poll_table table; 33 33 wait_queue_head_t *wqh; 34 - wait_queue_t wait; 34 + wait_queue_entry_t wait; 35 35 struct vhost_work work; 36 36 unsigned long mask; 37 37 struct vhost_dev *dev;

+1

drivers/xen/manage.c

··· 190 190 { 191 191 switch (system_state) { 192 192 case SYSTEM_BOOTING: 193 + case SYSTEM_SCHEDULING: 193 194 orderly_poweroff(true); 194 195 break; 195 196 case SYSTEM_RUNNING:

+1 -1

fs/autofs4/autofs_i.h

··· 83 83 struct autofs_wait_queue { 84 84 wait_queue_head_t queue; 85 85 struct autofs_wait_queue *next; 86 - autofs_wqt_t wait_queue_token; 86 + autofs_wqt_t wait_queue_entry_token; 87 87 /* We use the following to see what we are waiting for */ 88 88 struct qstr name; 89 89 u32 dev;

+9 -9

fs/autofs4/waitq.c

··· 104 104 size_t pktsz; 105 105 106 106 pr_debug("wait id = 0x%08lx, name = %.*s, type=%d\n", 107 - (unsigned long) wq->wait_queue_token, 107 + (unsigned long) wq->wait_queue_entry_token, 108 108 wq->name.len, wq->name.name, type); 109 109 110 110 memset(&pkt, 0, sizeof(pkt)); /* For security reasons */ ··· 120 120 121 121 pktsz = sizeof(*mp); 122 122 123 - mp->wait_queue_token = wq->wait_queue_token; 123 + mp->wait_queue_entry_token = wq->wait_queue_entry_token; 124 124 mp->len = wq->name.len; 125 125 memcpy(mp->name, wq->name.name, wq->name.len); 126 126 mp->name[wq->name.len] = '\0'; ··· 133 133 134 134 pktsz = sizeof(*ep); 135 135 136 - ep->wait_queue_token = wq->wait_queue_token; 136 + ep->wait_queue_entry_token = wq->wait_queue_entry_token; 137 137 ep->len = wq->name.len; 138 138 memcpy(ep->name, wq->name.name, wq->name.len); 139 139 ep->name[wq->name.len] = '\0'; ··· 153 153 154 154 pktsz = sizeof(*packet); 155 155 156 - packet->wait_queue_token = wq->wait_queue_token; 156 + packet->wait_queue_entry_token = wq->wait_queue_entry_token; 157 157 packet->len = wq->name.len; 158 158 memcpy(packet->name, wq->name.name, wq->name.len); 159 159 packet->name[wq->name.len] = '\0'; ··· 428 428 return -ENOMEM; 429 429 } 430 430 431 - wq->wait_queue_token = autofs4_next_wait_queue; 431 + wq->wait_queue_entry_token = autofs4_next_wait_queue; 432 432 if (++autofs4_next_wait_queue == 0) 433 433 autofs4_next_wait_queue = 1; 434 434 wq->next = sbi->queues; ··· 461 461 } 462 462 463 463 pr_debug("new wait id = 0x%08lx, name = %.*s, nfy=%d\n", 464 - (unsigned long) wq->wait_queue_token, wq->name.len, 464 + (unsigned long) wq->wait_queue_entry_token, wq->name.len, 465 465 wq->name.name, notify); 466 466 467 467 /* ··· 471 471 } else { 472 472 wq->wait_ctr++; 473 473 pr_debug("existing wait id = 0x%08lx, name = %.*s, nfy=%d\n", 474 - (unsigned long) wq->wait_queue_token, wq->name.len, 474 + (unsigned long) wq->wait_queue_entry_token, wq->name.len, 475 475 wq->name.name, notify); 476 476 mutex_unlock(&sbi->wq_mutex); 477 477 kfree(qstr.name); ··· 550 550 } 551 551 552 552 553 - int autofs4_wait_release(struct autofs_sb_info *sbi, autofs_wqt_t wait_queue_token, int status) 553 + int autofs4_wait_release(struct autofs_sb_info *sbi, autofs_wqt_t wait_queue_entry_token, int status) 554 554 { 555 555 struct autofs_wait_queue *wq, **wql; 556 556 557 557 mutex_lock(&sbi->wq_mutex); 558 558 for (wql = &sbi->queues; (wq = *wql) != NULL; wql = &wq->next) { 559 - if (wq->wait_queue_token == wait_queue_token) 559 + if (wq->wait_queue_entry_token == wait_queue_entry_token) 560 560 break; 561 561 } 562 562

+2 -2

fs/cachefiles/internal.h

··· 18 18 19 19 #include <linux/fscache-cache.h> 20 20 #include <linux/timer.h> 21 - #include <linux/wait.h> 21 + #include <linux/wait_bit.h> 22 22 #include <linux/cred.h> 23 23 #include <linux/workqueue.h> 24 24 #include <linux/security.h> ··· 97 97 * backing file read tracking 98 98 */ 99 99 struct cachefiles_one_read { 100 - wait_queue_t monitor; /* link into monitored waitqueue */ 100 + wait_queue_entry_t monitor; /* link into monitored waitqueue */ 101 101 struct page *back_page; /* backing file page we're waiting for */ 102 102 struct page *netfs_page; /* netfs page we're going to fill */ 103 103 struct fscache_retrieval *op; /* retrieval op covering this */

+1 -1

fs/cachefiles/namei.c

··· 204 204 wait_queue_head_t *wq; 205 205 206 206 signed long timeout = 60 * HZ; 207 - wait_queue_t wait; 207 + wait_queue_entry_t wait; 208 208 bool requeue; 209 209 210 210 /* if the object we're waiting for is queued for processing,

+2 -2

fs/cachefiles/rdwr.c

··· 21 21 * - we use this to detect read completion of backing pages 22 22 * - the caller holds the waitqueue lock 23 23 */ 24 - static int cachefiles_read_waiter(wait_queue_t *wait, unsigned mode, 24 + static int cachefiles_read_waiter(wait_queue_entry_t *wait, unsigned mode, 25 25 int sync, void *_key) 26 26 { 27 27 struct cachefiles_one_read *monitor = ··· 48 48 } 49 49 50 50 /* remove from the waitqueue */ 51 - list_del(&wait->task_list); 51 + list_del(&wait->entry); 52 52 53 53 /* move onto the action list and queue for FS-Cache thread pool */ 54 54 ASSERT(monitor->op);

+1

fs/cifs/inode.c

··· 24 24 #include <linux/pagemap.h> 25 25 #include <linux/freezer.h> 26 26 #include <linux/sched/signal.h> 27 + #include <linux/wait_bit.h> 27 28 28 29 #include <asm/div64.h> 29 30 #include "cifsfs.h"

+2 -2

fs/dax.c

··· 84 84 }; 85 85 86 86 struct wait_exceptional_entry_queue { 87 - wait_queue_t wait; 87 + wait_queue_entry_t wait; 88 88 struct exceptional_entry_key key; 89 89 }; 90 90 ··· 108 108 return wait_table + hash; 109 109 } 110 110 111 - static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode, 111 + static int wake_exceptional_entry_func(wait_queue_entry_t *wait, unsigned int mode, 112 112 int sync, void *keyp) 113 113 { 114 114 struct exceptional_entry_key *key = keyp;

+1 -1

fs/eventfd.c

··· 191 191 * This is used to atomically remove a wait queue entry from the eventfd wait 192 192 * queue head, and read/reset the counter value. 193 193 */ 194 - int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait, 194 + int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_entry_t *wait, 195 195 __u64 *cnt) 196 196 { 197 197 unsigned long flags;

+6 -6

fs/eventpoll.c

··· 244 244 * Wait queue item that will be linked to the target file wait 245 245 * queue head. 246 246 */ 247 - wait_queue_t wait; 247 + wait_queue_entry_t wait; 248 248 249 249 /* The wait queue head that linked the "wait" wait queue item */ 250 250 wait_queue_head_t *whead; ··· 347 347 return !list_empty(p); 348 348 } 349 349 350 - static inline struct eppoll_entry *ep_pwq_from_wait(wait_queue_t *p) 350 + static inline struct eppoll_entry *ep_pwq_from_wait(wait_queue_entry_t *p) 351 351 { 352 352 return container_of(p, struct eppoll_entry, wait); 353 353 } 354 354 355 355 /* Get the "struct epitem" from a wait queue pointer */ 356 - static inline struct epitem *ep_item_from_wait(wait_queue_t *p) 356 + static inline struct epitem *ep_item_from_wait(wait_queue_entry_t *p) 357 357 { 358 358 return container_of(p, struct eppoll_entry, wait)->base; 359 359 } ··· 1078 1078 * mechanism. It is called by the stored file descriptors when they 1079 1079 * have events to report. 1080 1080 */ 1081 - static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key) 1081 + static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) 1082 1082 { 1083 1083 int pwake = 0; 1084 1084 unsigned long flags; ··· 1094 1094 * can't use __remove_wait_queue(). whead->lock is held by 1095 1095 * the caller. 1096 1096 */ 1097 - list_del_init(&wait->task_list); 1097 + list_del_init(&wait->entry); 1098 1098 } 1099 1099 1100 1100 spin_lock_irqsave(&ep->lock, flags); ··· 1699 1699 int res = 0, eavail, timed_out = 0; 1700 1700 unsigned long flags; 1701 1701 u64 slack = 0; 1702 - wait_queue_t wait; 1702 + wait_queue_entry_t wait; 1703 1703 ktime_t expires, *to = NULL; 1704 1704 1705 1705 if (timeout > 0) {

+2 -2

fs/fs_pin.c

··· 34 34 35 35 void pin_kill(struct fs_pin *p) 36 36 { 37 - wait_queue_t wait; 37 + wait_queue_entry_t wait; 38 38 39 39 if (!p) { 40 40 rcu_read_unlock(); ··· 61 61 rcu_read_unlock(); 62 62 schedule(); 63 63 rcu_read_lock(); 64 - if (likely(list_empty(&wait.task_list))) 64 + if (likely(list_empty(&wait.entry))) 65 65 break; 66 66 /* OK, we know p couldn't have been freed yet */ 67 67 spin_lock_irq(&p->wait.lock);

+4 -4

fs/inode.c

··· 1892 1892 wait_queue_head_t *wq; 1893 1893 DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW); 1894 1894 wq = bit_waitqueue(&inode->i_state, __I_NEW); 1895 - prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE); 1895 + prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); 1896 1896 spin_unlock(&inode->i_lock); 1897 1897 spin_unlock(&inode_hash_lock); 1898 1898 schedule(); 1899 - finish_wait(wq, &wait.wait); 1899 + finish_wait(wq, &wait.wq_entry); 1900 1900 spin_lock(&inode_hash_lock); 1901 1901 } 1902 1902 ··· 2039 2039 DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP); 2040 2040 2041 2041 do { 2042 - prepare_to_wait(wq, &q.wait, TASK_UNINTERRUPTIBLE); 2042 + prepare_to_wait(wq, &q.wq_entry, TASK_UNINTERRUPTIBLE); 2043 2043 if (atomic_read(&inode->i_dio_count)) 2044 2044 schedule(); 2045 2045 } while (atomic_read(&inode->i_dio_count)); 2046 - finish_wait(wq, &q.wait); 2046 + finish_wait(wq, &q.wq_entry); 2047 2047 } 2048 2048 2049 2049 /**

+2 -2

fs/jbd2/journal.c

··· 2579 2579 wait_queue_head_t *wq; 2580 2580 DEFINE_WAIT_BIT(wait, &jinode->i_flags, __JI_COMMIT_RUNNING); 2581 2581 wq = bit_waitqueue(&jinode->i_flags, __JI_COMMIT_RUNNING); 2582 - prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE); 2582 + prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); 2583 2583 spin_unlock(&journal->j_list_lock); 2584 2584 schedule(); 2585 - finish_wait(wq, &wait.wait); 2585 + finish_wait(wq, &wait.wq_entry); 2586 2586 goto restart; 2587 2587 } 2588 2588

+1

fs/nfs/internal.h

··· 7 7 #include <linux/security.h> 8 8 #include <linux/crc32.h> 9 9 #include <linux/nfs_page.h> 10 + #include <linux/wait_bit.h> 10 11 11 12 #define NFS_MS_MASK (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_SYNCHRONOUS) 12 13

+2 -2

fs/nfs/nfs4proc.c

··· 6373 6373 }; 6374 6374 6375 6375 static int 6376 - nfs4_wake_lock_waiter(wait_queue_t *wait, unsigned int mode, int flags, void *key) 6376 + nfs4_wake_lock_waiter(wait_queue_entry_t *wait, unsigned int mode, int flags, void *key) 6377 6377 { 6378 6378 int ret; 6379 6379 struct cb_notify_lock_args *cbnl = key; ··· 6416 6416 .inode = state->inode, 6417 6417 .owner = &owner, 6418 6418 .notified = false }; 6419 - wait_queue_t wait; 6419 + wait_queue_entry_t wait; 6420 6420 6421 6421 /* Don't bother with waitqueue if we don't expect a callback */ 6422 6422 if (!test_bit(NFS_STATE_MAY_NOTIFY_LOCK, &state->flags))

+2 -3

fs/nilfs2/segment.c

··· 2161 2161 } 2162 2162 2163 2163 struct nilfs_segctor_wait_request { 2164 - wait_queue_t wq; 2164 + wait_queue_entry_t wq; 2165 2165 __u32 seq; 2166 2166 int err; 2167 2167 atomic_t done; ··· 2206 2206 unsigned long flags; 2207 2207 2208 2208 spin_lock_irqsave(&sci->sc_wait_request.lock, flags); 2209 - list_for_each_entry_safe(wrq, n, &sci->sc_wait_request.task_list, 2210 - wq.task_list) { 2209 + list_for_each_entry_safe(wrq, n, &sci->sc_wait_request.head, wq.entry) { 2211 2210 if (!atomic_read(&wrq->done) && 2212 2211 nilfs_cnt32_ge(sci->sc_seq_done, wrq->seq)) { 2213 2212 wrq->err = err;

+6 -6

fs/orangefs/orangefs-bufmap.c

··· 46 46 spin_lock(&m->q.lock); 47 47 if (m->c != -1) { 48 48 for (;;) { 49 - if (likely(list_empty(&wait.task_list))) 50 - __add_wait_queue_tail(&m->q, &wait); 49 + if (likely(list_empty(&wait.entry))) 50 + __add_wait_queue_entry_tail(&m->q, &wait); 51 51 set_current_state(TASK_UNINTERRUPTIBLE); 52 52 53 53 if (m->c == -1) ··· 84 84 85 85 do { 86 86 long n = left, t; 87 - if (likely(list_empty(&wait.task_list))) 88 - __add_wait_queue_tail_exclusive(&m->q, &wait); 87 + if (likely(list_empty(&wait.entry))) 88 + __add_wait_queue_entry_tail_exclusive(&m->q, &wait); 89 89 set_current_state(TASK_INTERRUPTIBLE); 90 90 91 91 if (m->c > 0) ··· 108 108 left = -EINTR; 109 109 } while (left > 0); 110 110 111 - if (!list_empty(&wait.task_list)) 112 - list_del(&wait.task_list); 111 + if (!list_empty(&wait.entry)) 112 + list_del(&wait.entry); 113 113 else if (left <= 0 && waitqueue_active(&m->q)) 114 114 __wake_up_locked_key(&m->q, TASK_INTERRUPTIBLE, NULL); 115 115 __set_current_state(TASK_RUNNING);

+1 -1

fs/reiserfs/journal.c

··· 2956 2956 2957 2957 static void queue_log_writer(struct super_block *s) 2958 2958 { 2959 - wait_queue_t wait; 2959 + wait_queue_entry_t wait; 2960 2960 struct reiserfs_journal *journal = SB_JOURNAL(s); 2961 2961 set_bit(J_WRITERS_QUEUED, &journal->j_state); 2962 2962

+2 -2

fs/select.c

··· 180 180 return table->entry++; 181 181 } 182 182 183 - static int __pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key) 183 + static int __pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) 184 184 { 185 185 struct poll_wqueues *pwq = wait->private; 186 186 DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task); ··· 206 206 return default_wake_function(&dummy_wait, mode, sync, key); 207 207 } 208 208 209 - static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key) 209 + static int pollwake(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) 210 210 { 211 211 struct poll_table_entry *entry; 212 212

+1 -1

fs/signalfd.c

··· 43 43 if (likely(!waitqueue_active(wqh))) 44 44 return; 45 45 46 - /* wait_queue_t->func(POLLFREE) should do remove_wait_queue() */ 46 + /* wait_queue_entry_t->func(POLLFREE) should do remove_wait_queue() */ 47 47 wake_up_poll(wqh, POLLHUP | POLLFREE); 48 48 } 49 49

+15 -15

fs/userfaultfd.c

··· 81 81 82 82 struct userfaultfd_wait_queue { 83 83 struct uffd_msg msg; 84 - wait_queue_t wq; 84 + wait_queue_entry_t wq; 85 85 struct userfaultfd_ctx *ctx; 86 86 bool waken; 87 87 }; ··· 91 91 unsigned long len; 92 92 }; 93 93 94 - static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode, 94 + static int userfaultfd_wake_function(wait_queue_entry_t *wq, unsigned mode, 95 95 int wake_flags, void *key) 96 96 { 97 97 struct userfaultfd_wake_range *range = key; ··· 129 129 * wouldn't be enough, the smp_mb__before_spinlock is 130 130 * enough to avoid an explicit smp_mb() here. 131 131 */ 132 - list_del_init(&wq->task_list); 132 + list_del_init(&wq->entry); 133 133 out: 134 134 return ret; 135 135 } ··· 522 522 * and it's fine not to block on the spinlock. The uwq on this 523 523 * kernel stack can be released after the list_del_init. 524 524 */ 525 - if (!list_empty_careful(&uwq.wq.task_list)) { 525 + if (!list_empty_careful(&uwq.wq.entry)) { 526 526 spin_lock(&ctx->fault_pending_wqh.lock); 527 527 /* 528 528 * No need of list_del_init(), the uwq on the stack 529 529 * will be freed shortly anyway. 530 530 */ 531 - list_del(&uwq.wq.task_list); 531 + list_del(&uwq.wq.entry); 532 532 spin_unlock(&ctx->fault_pending_wqh.lock); 533 533 } 534 534 ··· 860 860 static inline struct userfaultfd_wait_queue *find_userfault_in( 861 861 wait_queue_head_t *wqh) 862 862 { 863 - wait_queue_t *wq; 863 + wait_queue_entry_t *wq; 864 864 struct userfaultfd_wait_queue *uwq; 865 865 866 866 VM_BUG_ON(!spin_is_locked(&wqh->lock)); ··· 869 869 if (!waitqueue_active(wqh)) 870 870 goto out; 871 871 /* walk in reverse to provide FIFO behavior to read userfaults */ 872 - wq = list_last_entry(&wqh->task_list, typeof(*wq), task_list); 872 + wq = list_last_entry(&wqh->head, typeof(*wq), entry); 873 873 uwq = container_of(wq, struct userfaultfd_wait_queue, wq); 874 874 out: 875 875 return uwq; ··· 1003 1003 * changes __remove_wait_queue() to use 1004 1004 * list_del_init() in turn breaking the 1005 1005 * !list_empty_careful() check in 1006 - * handle_userfault(). The uwq->wq.task_list 1006 + * handle_userfault(). The uwq->wq.head list 1007 1007 * must never be empty at any time during the 1008 1008 * refile, or the waitqueue could disappear 1009 1009 * from under us. The "wait_queue_head_t" 1010 1010 * parameter of __remove_wait_queue() is unused 1011 1011 * anyway. 1012 1012 */ 1013 - list_del(&uwq->wq.task_list); 1013 + list_del(&uwq->wq.entry); 1014 1014 __add_wait_queue(&ctx->fault_wqh, &uwq->wq); 1015 1015 1016 1016 write_seqcount_end(&ctx->refile_seq); ··· 1032 1032 fork_nctx = (struct userfaultfd_ctx *) 1033 1033 (unsigned long) 1034 1034 uwq->msg.arg.reserved.reserved1; 1035 - list_move(&uwq->wq.task_list, &fork_event); 1035 + list_move(&uwq->wq.entry, &fork_event); 1036 1036 spin_unlock(&ctx->event_wqh.lock); 1037 1037 ret = 0; 1038 1038 break; ··· 1069 1069 if (!list_empty(&fork_event)) { 1070 1070 uwq = list_first_entry(&fork_event, 1071 1071 typeof(*uwq), 1072 - wq.task_list); 1073 - list_del(&uwq->wq.task_list); 1072 + wq.entry); 1073 + list_del(&uwq->wq.entry); 1074 1074 __add_wait_queue(&ctx->event_wqh, &uwq->wq); 1075 1075 userfaultfd_event_complete(ctx, uwq); 1076 1076 } ··· 1747 1747 static void userfaultfd_show_fdinfo(struct seq_file *m, struct file *f) 1748 1748 { 1749 1749 struct userfaultfd_ctx *ctx = f->private_data; 1750 - wait_queue_t *wq; 1750 + wait_queue_entry_t *wq; 1751 1751 struct userfaultfd_wait_queue *uwq; 1752 1752 unsigned long pending = 0, total = 0; 1753 1753 1754 1754 spin_lock(&ctx->fault_pending_wqh.lock); 1755 - list_for_each_entry(wq, &ctx->fault_pending_wqh.task_list, task_list) { 1755 + list_for_each_entry(wq, &ctx->fault_pending_wqh.head, entry) { 1756 1756 uwq = container_of(wq, struct userfaultfd_wait_queue, wq); 1757 1757 pending++; 1758 1758 total++; 1759 1759 } 1760 - list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) { 1760 + list_for_each_entry(wq, &ctx->fault_wqh.head, entry) { 1761 1761 uwq = container_of(wq, struct userfaultfd_wait_queue, wq); 1762 1762 total++; 1763 1763 }

+2 -2

fs/xfs/xfs_icache.c

··· 269 269 DEFINE_WAIT_BIT(wait, &ip->i_flags, __XFS_INEW_BIT); 270 270 271 271 do { 272 - prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE); 272 + prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); 273 273 if (!xfs_iflags_test(ip, XFS_INEW)) 274 274 break; 275 275 schedule(); 276 276 } while (true); 277 - finish_wait(wq, &wait.wait); 277 + finish_wait(wq, &wait.wq_entry); 278 278 } 279 279 280 280 /*

+4 -4

fs/xfs/xfs_inode.c

··· 622 622 DEFINE_WAIT_BIT(wait, &ip->i_flags, __XFS_IFLOCK_BIT); 623 623 624 624 do { 625 - prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE); 625 + prepare_to_wait_exclusive(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); 626 626 if (xfs_isiflocked(ip)) 627 627 io_schedule(); 628 628 } while (!xfs_iflock_nowait(ip)); 629 629 630 - finish_wait(wq, &wait.wait); 630 + finish_wait(wq, &wait.wq_entry); 631 631 } 632 632 633 633 STATIC uint ··· 2486 2486 xfs_iunpin(ip); 2487 2487 2488 2488 do { 2489 - prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE); 2489 + prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); 2490 2490 if (xfs_ipincount(ip)) 2491 2491 io_schedule(); 2492 2492 } while (xfs_ipincount(ip)); 2493 - finish_wait(wq, &wait.wait); 2493 + finish_wait(wq, &wait.wq_entry); 2494 2494 } 2495 2495 2496 2496 void

+1 -1

include/linux/blk-mq.h

··· 33 33 struct blk_mq_ctx **ctxs; 34 34 unsigned int nr_ctx; 35 35 36 - wait_queue_t dispatch_wait; 36 + wait_queue_entry_t dispatch_wait; 37 37 atomic_t wait_index; 38 38 39 39 struct blk_mq_tags *tags;

+1

include/linux/clocksource.h

··· 96 96 void (*suspend)(struct clocksource *cs); 97 97 void (*resume)(struct clocksource *cs); 98 98 void (*mark_unstable)(struct clocksource *cs); 99 + void (*tick_stable)(struct clocksource *cs); 99 100 100 101 /* private: */ 101 102 #ifdef CONFIG_CLOCKSOURCE_WATCHDOG

+28

include/linux/cpumask.h

··· 236 236 (cpu) = cpumask_next_zero((cpu), (mask)), \ 237 237 (cpu) < nr_cpu_ids;) 238 238 239 + extern int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap); 240 + 241 + /** 242 + * for_each_cpu_wrap - iterate over every cpu in a mask, starting at a specified location 243 + * @cpu: the (optionally unsigned) integer iterator 244 + * @mask: the cpumask poiter 245 + * @start: the start location 246 + * 247 + * The implementation does not assume any bit in @mask is set (including @start). 248 + * 249 + * After the loop, cpu is >= nr_cpu_ids. 250 + */ 251 + #define for_each_cpu_wrap(cpu, mask, start) \ 252 + for ((cpu) = cpumask_next_wrap((start)-1, (mask), (start), false); \ 253 + (cpu) < nr_cpumask_bits; \ 254 + (cpu) = cpumask_next_wrap((cpu), (mask), (start), true)) 255 + 239 256 /** 240 257 * for_each_cpu_and - iterate over every cpu in both masks 241 258 * @cpu: the (optionally unsigned) integer iterator ··· 293 276 set_bit(cpumask_check(cpu), cpumask_bits(dstp)); 294 277 } 295 278 279 + static inline void __cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp) 280 + { 281 + __set_bit(cpumask_check(cpu), cpumask_bits(dstp)); 282 + } 283 + 284 + 296 285 /** 297 286 * cpumask_clear_cpu - clear a cpu in a cpumask 298 287 * @cpu: cpu number (< nr_cpu_ids) ··· 307 284 static inline void cpumask_clear_cpu(int cpu, struct cpumask *dstp) 308 285 { 309 286 clear_bit(cpumask_check(cpu), cpumask_bits(dstp)); 287 + } 288 + 289 + static inline void __cpumask_clear_cpu(int cpu, struct cpumask *dstp) 290 + { 291 + __clear_bit(cpumask_check(cpu), cpumask_bits(dstp)); 310 292 } 311 293 312 294 /**

+2 -2

include/linux/eventfd.h

··· 37 37 struct eventfd_ctx *eventfd_ctx_fileget(struct file *file); 38 38 __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n); 39 39 ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt); 40 - int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait, 40 + int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_entry_t *wait, 41 41 __u64 *cnt); 42 42 43 43 #else /* CONFIG_EVENTFD */ ··· 73 73 } 74 74 75 75 static inline int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, 76 - wait_queue_t *wait, __u64 *cnt) 76 + wait_queue_entry_t *wait, __u64 *cnt) 77 77 { 78 78 return -ENOSYS; 79 79 }

+1 -1

include/linux/fs.h

··· 2 2 #define _LINUX_FS_H 3 3 4 4 #include <linux/linkage.h> 5 - #include <linux/wait.h> 5 + #include <linux/wait_bit.h> 6 6 #include <linux/kdev_t.h> 7 7 #include <linux/dcache.h> 8 8 #include <linux/path.h>

+5 -1

include/linux/kernel.h

··· 490 490 491 491 extern bool early_boot_irqs_disabled; 492 492 493 - /* Values used for system_state */ 493 + /* 494 + * Values used for system_state. Ordering of the states must not be changed 495 + * as code checks for <, <=, >, >= STATE. 496 + */ 494 497 extern enum system_states { 495 498 SYSTEM_BOOTING, 499 + SYSTEM_SCHEDULING, 496 500 SYSTEM_RUNNING, 497 501 SYSTEM_HALT, 498 502 SYSTEM_POWER_OFF,

+1 -1

include/linux/kvm_irqfd.h

··· 46 46 struct kvm_kernel_irqfd { 47 47 /* Used for MSI fast-path */ 48 48 struct kvm *kvm; 49 - wait_queue_t wait; 49 + wait_queue_entry_t wait; 50 50 /* Update side is protected by irqfds.lock */ 51 51 struct kvm_kernel_irq_routing_entry irq_entry; 52 52 seqcount_t irq_entry_sc;

+19

include/linux/llist.h

··· 110 110 for ((pos) = (node); pos; (pos) = (pos)->next) 111 111 112 112 /** 113 + * llist_for_each_safe - iterate over some deleted entries of a lock-less list 114 + * safe against removal of list entry 115 + * @pos: the &struct llist_node to use as a loop cursor 116 + * @n: another &struct llist_node to use as temporary storage 117 + * @node: the first entry of deleted list entries 118 + * 119 + * In general, some entries of the lock-less list can be traversed 120 + * safely only after being deleted from list, so start with an entry 121 + * instead of list head. 122 + * 123 + * If being used on entries deleted from lock-less list directly, the 124 + * traverse order is from the newest to the oldest added entry. If 125 + * you want to traverse from the oldest to the newest, you must 126 + * reverse the order by yourself before traversing. 127 + */ 128 + #define llist_for_each_safe(pos, n, node) \ 129 + for ((pos) = (node); (pos) && ((n) = (pos)->next, true); (pos) = (n)) 130 + 131 + /** 113 132 * llist_for_each_entry - iterate over some deleted entries of lock-less list of given type 114 133 * @pos: the type * to use as a loop cursor. 115 134 * @node: the fist entry of deleted list entries.

+1 -1

include/linux/pagemap.h

··· 524 524 /* 525 525 * Add an arbitrary waiter to a page's wait queue 526 526 */ 527 - extern void add_page_wait_queue(struct page *page, wait_queue_t *waiter); 527 + extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter); 528 528 529 529 /* 530 530 * Fault everything in given userspace address range in.

+1 -1

include/linux/poll.h

··· 75 75 struct poll_table_entry { 76 76 struct file *filp; 77 77 unsigned long key; 78 - wait_queue_t wait; 78 + wait_queue_entry_t wait; 79 79 wait_queue_head_t *wait_address; 80 80 }; 81 81

+19 -3

include/linux/sched.h

··· 421 421 u64 dl_runtime; /* Maximum runtime for each instance */ 422 422 u64 dl_deadline; /* Relative deadline of each instance */ 423 423 u64 dl_period; /* Separation of two instances (period) */ 424 - u64 dl_bw; /* dl_runtime / dl_deadline */ 424 + u64 dl_bw; /* dl_runtime / dl_period */ 425 + u64 dl_density; /* dl_runtime / dl_deadline */ 425 426 426 427 /* 427 428 * Actual scheduling parameters. Initialized with the values above, ··· 446 445 * 447 446 * @dl_yielded tells if task gave up the CPU before consuming 448 447 * all its available runtime during the last job. 448 + * 449 + * @dl_non_contending tells if the task is inactive while still 450 + * contributing to the active utilization. In other words, it 451 + * indicates if the inactive timer has been armed and its handler 452 + * has not been executed yet. This flag is useful to avoid race 453 + * conditions between the inactive timer handler and the wakeup 454 + * code. 449 455 */ 450 456 int dl_throttled; 451 457 int dl_boosted; 452 458 int dl_yielded; 459 + int dl_non_contending; 453 460 454 461 /* 455 462 * Bandwidth enforcement timer. Each -deadline task has its 456 463 * own bandwidth to be enforced, thus we need one timer per task. 457 464 */ 458 465 struct hrtimer dl_timer; 466 + 467 + /* 468 + * Inactive timer, responsible for decreasing the active utilization 469 + * at the "0-lag time". When a -deadline task blocks, it contributes 470 + * to GRUB's active utilization until the "0-lag time", hence a 471 + * timer is needed to decrease the active utilization at the correct 472 + * time. 473 + */ 474 + struct hrtimer inactive_timer; 459 475 }; 460 476 461 477 union rcu_special { ··· 1113 1095 * task_xid_vnr() : virtual id, i.e. the id seen from the pid namespace of 1114 1096 * current. 1115 1097 * task_xid_nr_ns() : id seen from the ns specified; 1116 - * 1117 - * set_task_vxid() : assigns a virtual id to a task; 1118 1098 * 1119 1099 * see also pid_nr() etc in include/linux/pid.h 1120 1100 */

+3 -8

include/linux/sched/clock.h

··· 23 23 extern void sched_clock_init(void); 24 24 25 25 #ifndef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK 26 - static inline void sched_clock_init_late(void) 27 - { 28 - } 29 - 30 26 static inline void sched_clock_tick(void) 31 27 { 32 28 } ··· 35 39 { 36 40 } 37 41 38 - static inline void sched_clock_idle_wakeup_event(u64 delta_ns) 42 + static inline void sched_clock_idle_wakeup_event(void) 39 43 { 40 44 } 41 45 ··· 49 53 return sched_clock(); 50 54 } 51 55 #else 52 - extern void sched_clock_init_late(void); 53 56 extern int sched_clock_stable(void); 54 57 extern void clear_sched_clock_stable(void); 55 58 ··· 58 63 */ 59 64 extern u64 __sched_clock_offset; 60 65 61 - 62 66 extern void sched_clock_tick(void); 67 + extern void sched_clock_tick_stable(void); 63 68 extern void sched_clock_idle_sleep_event(void); 64 - extern void sched_clock_idle_wakeup_event(u64 delta_ns); 69 + extern void sched_clock_idle_wakeup_event(void); 65 70 66 71 /* 67 72 * As outlined in clock.c, provides a fast, high resolution, nanosecond

+4 -4

include/linux/sched/nohz.h

··· 23 23 #endif 24 24 25 25 #ifdef CONFIG_NO_HZ_COMMON 26 - void calc_load_enter_idle(void); 27 - void calc_load_exit_idle(void); 26 + void calc_load_nohz_start(void); 27 + void calc_load_nohz_stop(void); 28 28 #else 29 - static inline void calc_load_enter_idle(void) { } 30 - static inline void calc_load_exit_idle(void) { } 29 + static inline void calc_load_nohz_start(void) { } 30 + static inline void calc_load_nohz_stop(void) { } 31 31 #endif /* CONFIG_NO_HZ_COMMON */ 32 32 33 33 #if defined(CONFIG_NO_HZ_COMMON) && defined(CONFIG_SMP)

-2

include/linux/sched/task.h

··· 95 95 } 96 96 97 97 struct task_struct *task_rcu_dereference(struct task_struct **ptask); 98 - struct task_struct *try_get_task_struct(struct task_struct **ptask); 99 - 100 98 101 99 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT 102 100 extern int arch_task_struct_size __read_mostly;

+1 -1

include/linux/sunrpc/sched.h

··· 13 13 #include <linux/ktime.h> 14 14 #include <linux/sunrpc/types.h> 15 15 #include <linux/spinlock.h> 16 - #include <linux/wait.h> 16 + #include <linux/wait_bit.h> 17 17 #include <linux/workqueue.h> 18 18 #include <linux/sunrpc/xdr.h> 19 19

+1 -1

include/linux/vfio.h

··· 183 183 void (*thread)(void *, void *); 184 184 void *data; 185 185 struct work_struct inject; 186 - wait_queue_t wait; 186 + wait_queue_entry_t wait; 187 187 poll_table pt; 188 188 struct work_struct shutdown; 189 189 struct virqfd **pvirqfd;

+378 -626

include/linux/wait.h

··· 10 10 #include <asm/current.h> 11 11 #include <uapi/linux/wait.h> 12 12 13 - typedef struct __wait_queue wait_queue_t; 14 - typedef int (*wait_queue_func_t)(wait_queue_t *wait, unsigned mode, int flags, void *key); 15 - int default_wake_function(wait_queue_t *wait, unsigned mode, int flags, void *key); 13 + typedef struct wait_queue_entry wait_queue_entry_t; 16 14 17 - /* __wait_queue::flags */ 15 + typedef int (*wait_queue_func_t)(struct wait_queue_entry *wq_entry, unsigned mode, int flags, void *key); 16 + int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int flags, void *key); 17 + 18 + /* wait_queue_entry::flags */ 18 19 #define WQ_FLAG_EXCLUSIVE 0x01 19 20 #define WQ_FLAG_WOKEN 0x02 20 21 21 - struct __wait_queue { 22 + /* 23 + * A single wait-queue entry structure: 24 + */ 25 + struct wait_queue_entry { 22 26 unsigned int flags; 23 27 void *private; 24 28 wait_queue_func_t func; 25 - struct list_head task_list; 29 + struct list_head entry; 26 30 }; 27 31 28 - struct wait_bit_key { 29 - void *flags; 30 - int bit_nr; 31 - #define WAIT_ATOMIC_T_BIT_NR -1 32 - unsigned long timeout; 33 - }; 34 - 35 - struct wait_bit_queue { 36 - struct wait_bit_key key; 37 - wait_queue_t wait; 38 - }; 39 - 40 - struct __wait_queue_head { 32 + struct wait_queue_head { 41 33 spinlock_t lock; 42 - struct list_head task_list; 34 + struct list_head head; 43 35 }; 44 - typedef struct __wait_queue_head wait_queue_head_t; 36 + typedef struct wait_queue_head wait_queue_head_t; 45 37 46 38 struct task_struct; 47 39 ··· 41 49 * Macros for declaration and initialisaton of the datatypes 42 50 */ 43 51 44 - #define __WAITQUEUE_INITIALIZER(name, tsk) { \ 45 - .private = tsk, \ 46 - .func = default_wake_function, \ 47 - .task_list = { NULL, NULL } } 52 + #define __WAITQUEUE_INITIALIZER(name, tsk) { \ 53 + .private = tsk, \ 54 + .func = default_wake_function, \ 55 + .entry = { NULL, NULL } } 48 56 49 - #define DECLARE_WAITQUEUE(name, tsk) \ 50 - wait_queue_t name = __WAITQUEUE_INITIALIZER(name, tsk) 57 + #define DECLARE_WAITQUEUE(name, tsk) \ 58 + struct wait_queue_entry name = __WAITQUEUE_INITIALIZER(name, tsk) 51 59 52 - #define __WAIT_QUEUE_HEAD_INITIALIZER(name) { \ 53 - .lock = __SPIN_LOCK_UNLOCKED(name.lock), \ 54 - .task_list = { &(name).task_list, &(name).task_list } } 60 + #define __WAIT_QUEUE_HEAD_INITIALIZER(name) { \ 61 + .lock = __SPIN_LOCK_UNLOCKED(name.lock), \ 62 + .head = { &(name).head, &(name).head } } 55 63 56 64 #define DECLARE_WAIT_QUEUE_HEAD(name) \ 57 - wait_queue_head_t name = __WAIT_QUEUE_HEAD_INITIALIZER(name) 65 + struct wait_queue_head name = __WAIT_QUEUE_HEAD_INITIALIZER(name) 58 66 59 - #define __WAIT_BIT_KEY_INITIALIZER(word, bit) \ 60 - { .flags = word, .bit_nr = bit, } 67 + extern void __init_waitqueue_head(struct wait_queue_head *wq_head, const char *name, struct lock_class_key *); 61 68 62 - #define __WAIT_ATOMIC_T_KEY_INITIALIZER(p) \ 63 - { .flags = p, .bit_nr = WAIT_ATOMIC_T_BIT_NR, } 64 - 65 - extern void __init_waitqueue_head(wait_queue_head_t *q, const char *name, struct lock_class_key *); 66 - 67 - #define init_waitqueue_head(q) \ 68 - do { \ 69 - static struct lock_class_key __key; \ 70 - \ 71 - __init_waitqueue_head((q), #q, &__key); \ 69 + #define init_waitqueue_head(wq_head) \ 70 + do { \ 71 + static struct lock_class_key __key; \ 72 + \ 73 + __init_waitqueue_head((wq_head), #wq_head, &__key); \ 72 74 } while (0) 73 75 74 76 #ifdef CONFIG_LOCKDEP 75 77 # define __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) \ 76 78 ({ init_waitqueue_head(&name); name; }) 77 79 # define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) \ 78 - wait_queue_head_t name = __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) 80 + struct wait_queue_head name = __WAIT_QUEUE_HEAD_INIT_ONSTACK(name) 79 81 #else 80 82 # define DECLARE_WAIT_QUEUE_HEAD_ONSTACK(name) DECLARE_WAIT_QUEUE_HEAD(name) 81 83 #endif 82 84 83 - static inline void init_waitqueue_entry(wait_queue_t *q, struct task_struct *p) 85 + static inline void init_waitqueue_entry(struct wait_queue_entry *wq_entry, struct task_struct *p) 84 86 { 85 - q->flags = 0; 86 - q->private = p; 87 - q->func = default_wake_function; 87 + wq_entry->flags = 0; 88 + wq_entry->private = p; 89 + wq_entry->func = default_wake_function; 88 90 } 89 91 90 92 static inline void 91 - init_waitqueue_func_entry(wait_queue_t *q, wait_queue_func_t func) 93 + init_waitqueue_func_entry(struct wait_queue_entry *wq_entry, wait_queue_func_t func) 92 94 { 93 - q->flags = 0; 94 - q->private = NULL; 95 - q->func = func; 95 + wq_entry->flags = 0; 96 + wq_entry->private = NULL; 97 + wq_entry->func = func; 96 98 } 97 99 98 100 /** 99 101 * waitqueue_active -- locklessly test for waiters on the queue 100 - * @q: the waitqueue to test for waiters 102 + * @wq_head: the waitqueue to test for waiters 101 103 * 102 104 * returns true if the wait list is not empty 103 105 * 104 106 * NOTE: this function is lockless and requires care, incorrect usage _will_ 105 107 * lead to sporadic and non-obvious failure. 106 108 * 107 - * Use either while holding wait_queue_head_t::lock or when used for wakeups 109 + * Use either while holding wait_queue_head::lock or when used for wakeups 108 110 * with an extra smp_mb() like: 109 111 * 110 112 * CPU0 - waker CPU1 - waiter 111 113 * 112 114 * for (;;) { 113 - * @cond = true; prepare_to_wait(&wq, &wait, state); 115 + * @cond = true; prepare_to_wait(&wq_head, &wait, state); 114 116 * smp_mb(); // smp_mb() from set_current_state() 115 - * if (waitqueue_active(wq)) if (@cond) 116 - * wake_up(wq); break; 117 + * if (waitqueue_active(wq_head)) if (@cond) 118 + * wake_up(wq_head); break; 117 119 * schedule(); 118 120 * } 119 - * finish_wait(&wq, &wait); 121 + * finish_wait(&wq_head, &wait); 120 122 * 121 123 * Because without the explicit smp_mb() it's possible for the 122 124 * waitqueue_active() load to get hoisted over the @cond store such that we'll ··· 119 133 * Also note that this 'optimization' trades a spin_lock() for an smp_mb(), 120 134 * which (when the lock is uncontended) are of roughly equal cost. 121 135 */ 122 - static inline int waitqueue_active(wait_queue_head_t *q) 136 + static inline int waitqueue_active(struct wait_queue_head *wq_head) 123 137 { 124 - return !list_empty(&q->task_list); 138 + return !list_empty(&wq_head->head); 125 139 } 126 140 127 141 /** 128 142 * wq_has_sleeper - check if there are any waiting processes 129 - * @wq: wait queue head 143 + * @wq_head: wait queue head 130 144 * 131 - * Returns true if wq has waiting processes 145 + * Returns true if wq_head has waiting processes 132 146 * 133 147 * Please refer to the comment for waitqueue_active. 134 148 */ 135 - static inline bool wq_has_sleeper(wait_queue_head_t *wq) 149 + static inline bool wq_has_sleeper(struct wait_queue_head *wq_head) 136 150 { 137 151 /* 138 152 * We need to be sure we are in sync with the ··· 142 156 * waiting side. 143 157 */ 144 158 smp_mb(); 145 - return waitqueue_active(wq); 159 + return waitqueue_active(wq_head); 146 160 } 147 161 148 - extern void add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait); 149 - extern void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait); 150 - extern void remove_wait_queue(wait_queue_head_t *q, wait_queue_t *wait); 162 + extern void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); 163 + extern void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); 164 + extern void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); 151 165 152 - static inline void __add_wait_queue(wait_queue_head_t *head, wait_queue_t *new) 166 + static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) 153 167 { 154 - list_add(&new->task_list, &head->task_list); 168 + list_add(&wq_entry->entry, &wq_head->head); 155 169 } 156 170 157 171 /* 158 172 * Used for wake-one threads: 159 173 */ 160 174 static inline void 161 - __add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait) 175 + __add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) 162 176 { 163 - wait->flags |= WQ_FLAG_EXCLUSIVE; 164 - __add_wait_queue(q, wait); 177 + wq_entry->flags |= WQ_FLAG_EXCLUSIVE; 178 + __add_wait_queue(wq_head, wq_entry); 165 179 } 166 180 167 - static inline void __add_wait_queue_tail(wait_queue_head_t *head, 168 - wait_queue_t *new) 181 + static inline void __add_wait_queue_entry_tail(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) 169 182 { 170 - list_add_tail(&new->task_list, &head->task_list); 171 - } 172 - 173 - static inline void 174 - __add_wait_queue_tail_exclusive(wait_queue_head_t *q, wait_queue_t *wait) 175 - { 176 - wait->flags |= WQ_FLAG_EXCLUSIVE; 177 - __add_wait_queue_tail(q, wait); 183 + list_add_tail(&wq_entry->entry, &wq_head->head); 178 184 } 179 185 180 186 static inline void 181 - __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old) 187 + __add_wait_queue_entry_tail_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) 182 188 { 183 - list_del(&old->task_list); 189 + wq_entry->flags |= WQ_FLAG_EXCLUSIVE; 190 + __add_wait_queue_entry_tail(wq_head, wq_entry); 184 191 } 185 192 186 - typedef int wait_bit_action_f(struct wait_bit_key *, int mode); 187 - void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key); 188 - void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key); 189 - void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key); 190 - void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr); 191 - void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr); 192 - void __wake_up_bit(wait_queue_head_t *, void *, int); 193 - int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, wait_bit_action_f *, unsigned); 194 - int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, wait_bit_action_f *, unsigned); 195 - void wake_up_bit(void *, int); 196 - void wake_up_atomic_t(atomic_t *); 197 - int out_of_line_wait_on_bit(void *, int, wait_bit_action_f *, unsigned); 198 - int out_of_line_wait_on_bit_timeout(void *, int, wait_bit_action_f *, unsigned, unsigned long); 199 - int out_of_line_wait_on_bit_lock(void *, int, wait_bit_action_f *, unsigned); 200 - int out_of_line_wait_on_atomic_t(atomic_t *, int (*)(atomic_t *), unsigned); 201 - wait_queue_head_t *bit_waitqueue(void *, int); 193 + static inline void 194 + __remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) 195 + { 196 + list_del(&wq_entry->entry); 197 + } 198 + 199 + void __wake_up(struct wait_queue_head *wq_head, unsigned int mode, int nr, void *key); 200 + void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key); 201 + void __wake_up_sync_key(struct wait_queue_head *wq_head, unsigned int mode, int nr, void *key); 202 + void __wake_up_locked(struct wait_queue_head *wq_head, unsigned int mode, int nr); 203 + void __wake_up_sync(struct wait_queue_head *wq_head, unsigned int mode, int nr); 202 204 203 205 #define wake_up(x) __wake_up(x, TASK_NORMAL, 1, NULL) 204 206 #define wake_up_nr(x, nr) __wake_up(x, TASK_NORMAL, nr, NULL) ··· 202 228 /* 203 229 * Wakeup macros to be used to report events to the targets. 204 230 */ 205 - #define wake_up_poll(x, m) \ 231 + #define wake_up_poll(x, m) \ 206 232 __wake_up(x, TASK_NORMAL, 1, (void *) (m)) 207 - #define wake_up_locked_poll(x, m) \ 233 + #define wake_up_locked_poll(x, m) \ 208 234 __wake_up_locked_key((x), TASK_NORMAL, (void *) (m)) 209 - #define wake_up_interruptible_poll(x, m) \ 235 + #define wake_up_interruptible_poll(x, m) \ 210 236 __wake_up(x, TASK_INTERRUPTIBLE, 1, (void *) (m)) 211 - #define wake_up_interruptible_sync_poll(x, m) \ 237 + #define wake_up_interruptible_sync_poll(x, m) \ 212 238 __wake_up_sync_key((x), TASK_INTERRUPTIBLE, 1, (void *) (m)) 213 239 214 - #define ___wait_cond_timeout(condition) \ 215 - ({ \ 216 - bool __cond = (condition); \ 217 - if (__cond && !__ret) \ 218 - __ret = 1; \ 219 - __cond || !__ret; \ 240 + #define ___wait_cond_timeout(condition) \ 241 + ({ \ 242 + bool __cond = (condition); \ 243 + if (__cond && !__ret) \ 244 + __ret = 1; \ 245 + __cond || !__ret; \ 220 246 }) 221 247 222 - #define ___wait_is_interruptible(state) \ 223 - (!__builtin_constant_p(state) || \ 224 - state == TASK_INTERRUPTIBLE || state == TASK_KILLABLE) \ 248 + #define ___wait_is_interruptible(state) \ 249 + (!__builtin_constant_p(state) || \ 250 + state == TASK_INTERRUPTIBLE || state == TASK_KILLABLE) \ 225 251 226 - extern void init_wait_entry(wait_queue_t *__wait, int flags); 252 + extern void init_wait_entry(struct wait_queue_entry *wq_entry, int flags); 227 253 228 254 /* 229 255 * The below macro ___wait_event() has an explicit shadow of the __ret ··· 237 263 * otherwise. 238 264 */ 239 265 240 - #define ___wait_event(wq, condition, state, exclusive, ret, cmd) \ 241 - ({ \ 242 - __label__ __out; \ 243 - wait_queue_t __wait; \ 244 - long __ret = ret; /* explicit shadow */ \ 245 - \ 246 - init_wait_entry(&__wait, exclusive ? WQ_FLAG_EXCLUSIVE : 0); \ 247 - for (;;) { \ 248 - long __int = prepare_to_wait_event(&wq, &__wait, state);\ 249 - \ 250 - if (condition) \ 251 - break; \ 252 - \ 253 - if (___wait_is_interruptible(state) && __int) { \ 254 - __ret = __int; \ 255 - goto __out; \ 256 - } \ 257 - \ 258 - cmd; \ 259 - } \ 260 - finish_wait(&wq, &__wait); \ 261 - __out: __ret; \ 266 + #define ___wait_event(wq_head, condition, state, exclusive, ret, cmd) \ 267 + ({ \ 268 + __label__ __out; \ 269 + struct wait_queue_entry __wq_entry; \ 270 + long __ret = ret; /* explicit shadow */ \ 271 + \ 272 + init_wait_entry(&__wq_entry, exclusive ? WQ_FLAG_EXCLUSIVE : 0); \ 273 + for (;;) { \ 274 + long __int = prepare_to_wait_event(&wq_head, &__wq_entry, state);\ 275 + \ 276 + if (condition) \ 277 + break; \ 278 + \ 279 + if (___wait_is_interruptible(state) && __int) { \ 280 + __ret = __int; \ 281 + goto __out; \ 282 + } \ 283 + \ 284 + cmd; \ 285 + } \ 286 + finish_wait(&wq_head, &__wq_entry); \ 287 + __out: __ret; \ 262 288 }) 263 289 264 - #define __wait_event(wq, condition) \ 265 - (void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, 0, \ 290 + #define __wait_event(wq_head, condition) \ 291 + (void)___wait_event(wq_head, condition, TASK_UNINTERRUPTIBLE, 0, 0, \ 266 292 schedule()) 267 293 268 294 /** 269 295 * wait_event - sleep until a condition gets true 270 - * @wq: the waitqueue to wait on 296 + * @wq_head: the waitqueue to wait on 271 297 * @condition: a C expression for the event to wait for 272 298 * 273 299 * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the 274 300 * @condition evaluates to true. The @condition is checked each time 275 - * the waitqueue @wq is woken up. 301 + * the waitqueue @wq_head is woken up. 276 302 * 277 303 * wake_up() has to be called after changing any variable that could 278 304 * change the result of the wait condition. 279 305 */ 280 - #define wait_event(wq, condition) \ 281 - do { \ 282 - might_sleep(); \ 283 - if (condition) \ 284 - break; \ 285 - __wait_event(wq, condition); \ 306 + #define wait_event(wq_head, condition) \ 307 + do { \ 308 + might_sleep(); \ 309 + if (condition) \ 310 + break; \ 311 + __wait_event(wq_head, condition); \ 286 312 } while (0) 287 313 288 - #define __io_wait_event(wq, condition) \ 289 - (void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, 0, \ 314 + #define __io_wait_event(wq_head, condition) \ 315 + (void)___wait_event(wq_head, condition, TASK_UNINTERRUPTIBLE, 0, 0, \ 290 316 io_schedule()) 291 317 292 318 /* 293 319 * io_wait_event() -- like wait_event() but with io_schedule() 294 320 */ 295 - #define io_wait_event(wq, condition) \ 296 - do { \ 297 - might_sleep(); \ 298 - if (condition) \ 299 - break; \ 300 - __io_wait_event(wq, condition); \ 321 + #define io_wait_event(wq_head, condition) \ 322 + do { \ 323 + might_sleep(); \ 324 + if (condition) \ 325 + break; \ 326 + __io_wait_event(wq_head, condition); \ 301 327 } while (0) 302 328 303 - #define __wait_event_freezable(wq, condition) \ 304 - ___wait_event(wq, condition, TASK_INTERRUPTIBLE, 0, 0, \ 329 + #define __wait_event_freezable(wq_head, condition) \ 330 + ___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0, \ 305 331 schedule(); try_to_freeze()) 306 332 307 333 /** 308 334 * wait_event_freezable - sleep (or freeze) until a condition gets true 309 - * @wq: the waitqueue to wait on 335 + * @wq_head: the waitqueue to wait on 310 336 * @condition: a C expression for the event to wait for 311 337 * 312 338 * The process is put to sleep (TASK_INTERRUPTIBLE -- so as not to contribute 313 339 * to system load) until the @condition evaluates to true. The 314 - * @condition is checked each time the waitqueue @wq is woken up. 340 + * @condition is checked each time the waitqueue @wq_head is woken up. 315 341 * 316 342 * wake_up() has to be called after changing any variable that could 317 343 * change the result of the wait condition. 318 344 */ 319 - #define wait_event_freezable(wq, condition) \ 320 - ({ \ 321 - int __ret = 0; \ 322 - might_sleep(); \ 323 - if (!(condition)) \ 324 - __ret = __wait_event_freezable(wq, condition); \ 325 - __ret; \ 345 + #define wait_event_freezable(wq_head, condition) \ 346 + ({ \ 347 + int __ret = 0; \ 348 + might_sleep(); \ 349 + if (!(condition)) \ 350 + __ret = __wait_event_freezable(wq_head, condition); \ 351 + __ret; \ 326 352 }) 327 353 328 - #define __wait_event_timeout(wq, condition, timeout) \ 329 - ___wait_event(wq, ___wait_cond_timeout(condition), \ 330 - TASK_UNINTERRUPTIBLE, 0, timeout, \ 354 + #define __wait_event_timeout(wq_head, condition, timeout) \ 355 + ___wait_event(wq_head, ___wait_cond_timeout(condition), \ 356 + TASK_UNINTERRUPTIBLE, 0, timeout, \ 331 357 __ret = schedule_timeout(__ret)) 332 358 333 359 /** 334 360 * wait_event_timeout - sleep until a condition gets true or a timeout elapses 335 - * @wq: the waitqueue to wait on 361 + * @wq_head: the waitqueue to wait on 336 362 * @condition: a C expression for the event to wait for 337 363 * @timeout: timeout, in jiffies 338 364 * 339 365 * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the 340 366 * @condition evaluates to true. The @condition is checked each time 341 - * the waitqueue @wq is woken up. 367 + * the waitqueue @wq_head is woken up. 342 368 * 343 369 * wake_up() has to be called after changing any variable that could 344 370 * change the result of the wait condition. ··· 349 375 * or the remaining jiffies (at least 1) if the @condition evaluated 350 376 * to %true before the @timeout elapsed. 351 377 */ 352 - #define wait_event_timeout(wq, condition, timeout) \ 353 - ({ \ 354 - long __ret = timeout; \ 355 - might_sleep(); \ 356 - if (!___wait_cond_timeout(condition)) \ 357 - __ret = __wait_event_timeout(wq, condition, timeout); \ 358 - __ret; \ 378 + #define wait_event_timeout(wq_head, condition, timeout) \ 379 + ({ \ 380 + long __ret = timeout; \ 381 + might_sleep(); \ 382 + if (!___wait_cond_timeout(condition)) \ 383 + __ret = __wait_event_timeout(wq_head, condition, timeout); \ 384 + __ret; \ 359 385 }) 360 386 361 - #define __wait_event_freezable_timeout(wq, condition, timeout) \ 362 - ___wait_event(wq, ___wait_cond_timeout(condition), \ 363 - TASK_INTERRUPTIBLE, 0, timeout, \ 387 + #define __wait_event_freezable_timeout(wq_head, condition, timeout) \ 388 + ___wait_event(wq_head, ___wait_cond_timeout(condition), \ 389 + TASK_INTERRUPTIBLE, 0, timeout, \ 364 390 __ret = schedule_timeout(__ret); try_to_freeze()) 365 391 366 392 /* 367 393 * like wait_event_timeout() -- except it uses TASK_INTERRUPTIBLE to avoid 368 394 * increasing load and is freezable. 369 395 */ 370 - #define wait_event_freezable_timeout(wq, condition, timeout) \ 371 - ({ \ 372 - long __ret = timeout; \ 373 - might_sleep(); \ 374 - if (!___wait_cond_timeout(condition)) \ 375 - __ret = __wait_event_freezable_timeout(wq, condition, timeout); \ 376 - __ret; \ 396 + #define wait_event_freezable_timeout(wq_head, condition, timeout) \ 397 + ({ \ 398 + long __ret = timeout; \ 399 + might_sleep(); \ 400 + if (!___wait_cond_timeout(condition)) \ 401 + __ret = __wait_event_freezable_timeout(wq_head, condition, timeout); \ 402 + __ret; \ 377 403 }) 378 404 379 - #define __wait_event_exclusive_cmd(wq, condition, cmd1, cmd2) \ 380 - (void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 1, 0, \ 405 + #define __wait_event_exclusive_cmd(wq_head, condition, cmd1, cmd2) \ 406 + (void)___wait_event(wq_head, condition, TASK_UNINTERRUPTIBLE, 1, 0, \ 381 407 cmd1; schedule(); cmd2) 382 408 /* 383 409 * Just like wait_event_cmd(), except it sets exclusive flag 384 410 */ 385 - #define wait_event_exclusive_cmd(wq, condition, cmd1, cmd2) \ 386 - do { \ 387 - if (condition) \ 388 - break; \ 389 - __wait_event_exclusive_cmd(wq, condition, cmd1, cmd2); \ 411 + #define wait_event_exclusive_cmd(wq_head, condition, cmd1, cmd2) \ 412 + do { \ 413 + if (condition) \ 414 + break; \ 415 + __wait_event_exclusive_cmd(wq_head, condition, cmd1, cmd2); \ 390 416 } while (0) 391 417 392 - #define __wait_event_cmd(wq, condition, cmd1, cmd2) \ 393 - (void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, 0, \ 418 + #define __wait_event_cmd(wq_head, condition, cmd1, cmd2) \ 419 + (void)___wait_event(wq_head, condition, TASK_UNINTERRUPTIBLE, 0, 0, \ 394 420 cmd1; schedule(); cmd2) 395 421 396 422 /** 397 423 * wait_event_cmd - sleep until a condition gets true 398 - * @wq: the waitqueue to wait on 424 + * @wq_head: the waitqueue to wait on 399 425 * @condition: a C expression for the event to wait for 400 426 * @cmd1: the command will be executed before sleep 401 427 * @cmd2: the command will be executed after sleep 402 428 * 403 429 * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the 404 430 * @condition evaluates to true. The @condition is checked each time 405 - * the waitqueue @wq is woken up. 431 + * the waitqueue @wq_head is woken up. 406 432 * 407 433 * wake_up() has to be called after changing any variable that could 408 434 * change the result of the wait condition. 409 435 */ 410 - #define wait_event_cmd(wq, condition, cmd1, cmd2) \ 411 - do { \ 412 - if (condition) \ 413 - break; \ 414 - __wait_event_cmd(wq, condition, cmd1, cmd2); \ 436 + #define wait_event_cmd(wq_head, condition, cmd1, cmd2) \ 437 + do { \ 438 + if (condition) \ 439 + break; \ 440 + __wait_event_cmd(wq_head, condition, cmd1, cmd2); \ 415 441 } while (0) 416 442 417 - #define __wait_event_interruptible(wq, condition) \ 418 - ___wait_event(wq, condition, TASK_INTERRUPTIBLE, 0, 0, \ 443 + #define __wait_event_interruptible(wq_head, condition) \ 444 + ___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0, \ 419 445 schedule()) 420 446 421 447 /** 422 448 * wait_event_interruptible - sleep until a condition gets true 423 - * @wq: the waitqueue to wait on 449 + * @wq_head: the waitqueue to wait on 424 450 * @condition: a C expression for the event to wait for 425 451 * 426 452 * The process is put to sleep (TASK_INTERRUPTIBLE) until the 427 453 * @condition evaluates to true or a signal is received. 428 - * The @condition is checked each time the waitqueue @wq is woken up. 454 + * The @condition is checked each time the waitqueue @wq_head is woken up. 429 455 * 430 456 * wake_up() has to be called after changing any variable that could 431 457 * change the result of the wait condition. ··· 433 459 * The function will return -ERESTARTSYS if it was interrupted by a 434 460 * signal and 0 if @condition evaluated to true. 435 461 */ 436 - #define wait_event_interruptible(wq, condition) \ 437 - ({ \ 438 - int __ret = 0; \ 439 - might_sleep(); \ 440 - if (!(condition)) \ 441 - __ret = __wait_event_interruptible(wq, condition); \ 442 - __ret; \ 462 + #define wait_event_interruptible(wq_head, condition) \ 463 + ({ \ 464 + int __ret = 0; \ 465 + might_sleep(); \ 466 + if (!(condition)) \ 467 + __ret = __wait_event_interruptible(wq_head, condition); \ 468 + __ret; \ 443 469 }) 444 470 445 - #define __wait_event_interruptible_timeout(wq, condition, timeout) \ 446 - ___wait_event(wq, ___wait_cond_timeout(condition), \ 447 - TASK_INTERRUPTIBLE, 0, timeout, \ 471 + #define __wait_event_interruptible_timeout(wq_head, condition, timeout) \ 472 + ___wait_event(wq_head, ___wait_cond_timeout(condition), \ 473 + TASK_INTERRUPTIBLE, 0, timeout, \ 448 474 __ret = schedule_timeout(__ret)) 449 475 450 476 /** 451 477 * wait_event_interruptible_timeout - sleep until a condition gets true or a timeout elapses 452 - * @wq: the waitqueue to wait on 478 + * @wq_head: the waitqueue to wait on 453 479 * @condition: a C expression for the event to wait for 454 480 * @timeout: timeout, in jiffies 455 481 * 456 482 * The process is put to sleep (TASK_INTERRUPTIBLE) until the 457 483 * @condition evaluates to true or a signal is received. 458 - * The @condition is checked each time the waitqueue @wq is woken up. 484 + * The @condition is checked each time the waitqueue @wq_head is woken up. 459 485 * 460 486 * wake_up() has to be called after changing any variable that could 461 487 * change the result of the wait condition. ··· 467 493 * to %true before the @timeout elapsed, or -%ERESTARTSYS if it was 468 494 * interrupted by a signal. 469 495 */ 470 - #define wait_event_interruptible_timeout(wq, condition, timeout) \ 471 - ({ \ 472 - long __ret = timeout; \ 473 - might_sleep(); \ 474 - if (!___wait_cond_timeout(condition)) \ 475 - __ret = __wait_event_interruptible_timeout(wq, \ 476 - condition, timeout); \ 477 - __ret; \ 496 + #define wait_event_interruptible_timeout(wq_head, condition, timeout) \ 497 + ({ \ 498 + long __ret = timeout; \ 499 + might_sleep(); \ 500 + if (!___wait_cond_timeout(condition)) \ 501 + __ret = __wait_event_interruptible_timeout(wq_head, \ 502 + condition, timeout); \ 503 + __ret; \ 478 504 }) 479 505 480 - #define __wait_event_hrtimeout(wq, condition, timeout, state) \ 481 - ({ \ 482 - int __ret = 0; \ 483 - struct hrtimer_sleeper __t; \ 484 - \ 485 - hrtimer_init_on_stack(&__t.timer, CLOCK_MONOTONIC, \ 486 - HRTIMER_MODE_REL); \ 487 - hrtimer_init_sleeper(&__t, current); \ 488 - if ((timeout) != KTIME_MAX) \ 489 - hrtimer_start_range_ns(&__t.timer, timeout, \ 490 - current->timer_slack_ns, \ 491 - HRTIMER_MODE_REL); \ 492 - \ 493 - __ret = ___wait_event(wq, condition, state, 0, 0, \ 494 - if (!__t.task) { \ 495 - __ret = -ETIME; \ 496 - break; \ 497 - } \ 498 - schedule()); \ 499 - \ 500 - hrtimer_cancel(&__t.timer); \ 501 - destroy_hrtimer_on_stack(&__t.timer); \ 502 - __ret; \ 506 + #define __wait_event_hrtimeout(wq_head, condition, timeout, state) \ 507 + ({ \ 508 + int __ret = 0; \ 509 + struct hrtimer_sleeper __t; \ 510 + \ 511 + hrtimer_init_on_stack(&__t.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); \ 512 + hrtimer_init_sleeper(&__t, current); \ 513 + if ((timeout) != KTIME_MAX) \ 514 + hrtimer_start_range_ns(&__t.timer, timeout, \ 515 + current->timer_slack_ns, \ 516 + HRTIMER_MODE_REL); \ 517 + \ 518 + __ret = ___wait_event(wq_head, condition, state, 0, 0, \ 519 + if (!__t.task) { \ 520 + __ret = -ETIME; \ 521 + break; \ 522 + } \ 523 + schedule()); \ 524 + \ 525 + hrtimer_cancel(&__t.timer); \ 526 + destroy_hrtimer_on_stack(&__t.timer); \ 527 + __ret; \ 503 528 }) 504 529 505 530 /** 506 531 * wait_event_hrtimeout - sleep until a condition gets true or a timeout elapses 507 - * @wq: the waitqueue to wait on 532 + * @wq_head: the waitqueue to wait on 508 533 * @condition: a C expression for the event to wait for 509 534 * @timeout: timeout, as a ktime_t 510 535 * 511 536 * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the 512 537 * @condition evaluates to true or a signal is received. 513 - * The @condition is checked each time the waitqueue @wq is woken up. 538 + * The @condition is checked each time the waitqueue @wq_head is woken up. 514 539 * 515 540 * wake_up() has to be called after changing any variable that could 516 541 * change the result of the wait condition. ··· 517 544 * The function returns 0 if @condition became true, or -ETIME if the timeout 518 545 * elapsed. 519 546 */ 520 - #define wait_event_hrtimeout(wq, condition, timeout) \ 521 - ({ \ 522 - int __ret = 0; \ 523 - might_sleep(); \ 524 - if (!(condition)) \ 525 - __ret = __wait_event_hrtimeout(wq, condition, timeout, \ 526 - TASK_UNINTERRUPTIBLE); \ 527 - __ret; \ 547 + #define wait_event_hrtimeout(wq_head, condition, timeout) \ 548 + ({ \ 549 + int __ret = 0; \ 550 + might_sleep(); \ 551 + if (!(condition)) \ 552 + __ret = __wait_event_hrtimeout(wq_head, condition, timeout, \ 553 + TASK_UNINTERRUPTIBLE); \ 554 + __ret; \ 528 555 }) 529 556 530 557 /** 531 558 * wait_event_interruptible_hrtimeout - sleep until a condition gets true or a timeout elapses 532 - * @wq: the waitqueue to wait on 559 + * @wq_head: the waitqueue to wait on 533 560 * @condition: a C expression for the event to wait for 534 561 * @timeout: timeout, as a ktime_t 535 562 * 536 563 * The process is put to sleep (TASK_INTERRUPTIBLE) until the 537 564 * @condition evaluates to true or a signal is received. 538 - * The @condition is checked each time the waitqueue @wq is woken up. 565 + * The @condition is checked each time the waitqueue @wq_head is woken up. 539 566 * 540 567 * wake_up() has to be called after changing any variable that could 541 568 * change the result of the wait condition. ··· 543 570 * The function returns 0 if @condition became true, -ERESTARTSYS if it was 544 571 * interrupted by a signal, or -ETIME if the timeout elapsed. 545 572 */ 546 - #define wait_event_interruptible_hrtimeout(wq, condition, timeout) \ 547 - ({ \ 548 - long __ret = 0; \ 549 - might_sleep(); \ 550 - if (!(condition)) \ 551 - __ret = __wait_event_hrtimeout(wq, condition, timeout, \ 552 - TASK_INTERRUPTIBLE); \ 553 - __ret; \ 573 + #define wait_event_interruptible_hrtimeout(wq, condition, timeout) \ 574 + ({ \ 575 + long __ret = 0; \ 576 + might_sleep(); \ 577 + if (!(condition)) \ 578 + __ret = __wait_event_hrtimeout(wq, condition, timeout, \ 579 + TASK_INTERRUPTIBLE); \ 580 + __ret; \ 554 581 }) 555 582 556 - #define __wait_event_interruptible_exclusive(wq, condition) \ 557 - ___wait_event(wq, condition, TASK_INTERRUPTIBLE, 1, 0, \ 583 + #define __wait_event_interruptible_exclusive(wq, condition) \ 584 + ___wait_event(wq, condition, TASK_INTERRUPTIBLE, 1, 0, \ 558 585 schedule()) 559 586 560 - #define wait_event_interruptible_exclusive(wq, condition) \ 561 - ({ \ 562 - int __ret = 0; \ 563 - might_sleep(); \ 564 - if (!(condition)) \ 565 - __ret = __wait_event_interruptible_exclusive(wq, condition);\ 566 - __ret; \ 587 + #define wait_event_interruptible_exclusive(wq, condition) \ 588 + ({ \ 589 + int __ret = 0; \ 590 + might_sleep(); \ 591 + if (!(condition)) \ 592 + __ret = __wait_event_interruptible_exclusive(wq, condition); \ 593 + __ret; \ 567 594 }) 568 595 569 - #define __wait_event_killable_exclusive(wq, condition) \ 570 - ___wait_event(wq, condition, TASK_KILLABLE, 1, 0, \ 596 + #define __wait_event_killable_exclusive(wq, condition) \ 597 + ___wait_event(wq, condition, TASK_KILLABLE, 1, 0, \ 571 598 schedule()) 572 599 573 - #define wait_event_killable_exclusive(wq, condition) \ 574 - ({ \ 575 - int __ret = 0; \ 576 - might_sleep(); \ 577 - if (!(condition)) \ 578 - __ret = __wait_event_killable_exclusive(wq, condition); \ 579 - __ret; \ 600 + #define wait_event_killable_exclusive(wq, condition) \ 601 + ({ \ 602 + int __ret = 0; \ 603 + might_sleep(); \ 604 + if (!(condition)) \ 605 + __ret = __wait_event_killable_exclusive(wq, condition); \ 606 + __ret; \ 580 607 }) 581 608 582 609 583 - #define __wait_event_freezable_exclusive(wq, condition) \ 584 - ___wait_event(wq, condition, TASK_INTERRUPTIBLE, 1, 0, \ 610 + #define __wait_event_freezable_exclusive(wq, condition) \ 611 + ___wait_event(wq, condition, TASK_INTERRUPTIBLE, 1, 0, \ 585 612 schedule(); try_to_freeze()) 586 613 587 - #define wait_event_freezable_exclusive(wq, condition) \ 588 - ({ \ 589 - int __ret = 0; \ 590 - might_sleep(); \ 591 - if (!(condition)) \ 592 - __ret = __wait_event_freezable_exclusive(wq, condition);\ 593 - __ret; \ 614 + #define wait_event_freezable_exclusive(wq, condition) \ 615 + ({ \ 616 + int __ret = 0; \ 617 + might_sleep(); \ 618 + if (!(condition)) \ 619 + __ret = __wait_event_freezable_exclusive(wq, condition); \ 620 + __ret; \ 594 621 }) 595 622 596 - extern int do_wait_intr(wait_queue_head_t *, wait_queue_t *); 597 - extern int do_wait_intr_irq(wait_queue_head_t *, wait_queue_t *); 623 + extern int do_wait_intr(wait_queue_head_t *, wait_queue_entry_t *); 624 + extern int do_wait_intr_irq(wait_queue_head_t *, wait_queue_entry_t *); 598 625 599 - #define __wait_event_interruptible_locked(wq, condition, exclusive, fn) \ 600 - ({ \ 601 - int __ret; \ 602 - DEFINE_WAIT(__wait); \ 603 - if (exclusive) \ 604 - __wait.flags |= WQ_FLAG_EXCLUSIVE; \ 605 - do { \ 606 - __ret = fn(&(wq), &__wait); \ 607 - if (__ret) \ 608 - break; \ 609 - } while (!(condition)); \ 610 - __remove_wait_queue(&(wq), &__wait); \ 611 - __set_current_state(TASK_RUNNING); \ 612 - __ret; \ 626 + #define __wait_event_interruptible_locked(wq, condition, exclusive, fn) \ 627 + ({ \ 628 + int __ret; \ 629 + DEFINE_WAIT(__wait); \ 630 + if (exclusive) \ 631 + __wait.flags |= WQ_FLAG_EXCLUSIVE; \ 632 + do { \ 633 + __ret = fn(&(wq), &__wait); \ 634 + if (__ret) \ 635 + break; \ 636 + } while (!(condition)); \ 637 + __remove_wait_queue(&(wq), &__wait); \ 638 + __set_current_state(TASK_RUNNING); \ 639 + __ret; \ 613 640 }) 614 641 615 642 ··· 636 663 * The function will return -ERESTARTSYS if it was interrupted by a 637 664 * signal and 0 if @condition evaluated to true. 638 665 */ 639 - #define wait_event_interruptible_locked(wq, condition) \ 640 - ((condition) \ 666 + #define wait_event_interruptible_locked(wq, condition) \ 667 + ((condition) \ 641 668 ? 0 : __wait_event_interruptible_locked(wq, condition, 0, do_wait_intr)) 642 669 643 670 /** ··· 663 690 * The function will return -ERESTARTSYS if it was interrupted by a 664 691 * signal and 0 if @condition evaluated to true. 665 692 */ 666 - #define wait_event_interruptible_locked_irq(wq, condition) \ 667 - ((condition) \ 693 + #define wait_event_interruptible_locked_irq(wq, condition) \ 694 + ((condition) \ 668 695 ? 0 : __wait_event_interruptible_locked(wq, condition, 0, do_wait_intr_irq)) 669 696 670 697 /** ··· 694 721 * The function will return -ERESTARTSYS if it was interrupted by a 695 722 * signal and 0 if @condition evaluated to true. 696 723 */ 697 - #define wait_event_interruptible_exclusive_locked(wq, condition) \ 698 - ((condition) \ 724 + #define wait_event_interruptible_exclusive_locked(wq, condition) \ 725 + ((condition) \ 699 726 ? 0 : __wait_event_interruptible_locked(wq, condition, 1, do_wait_intr)) 700 727 701 728 /** ··· 725 752 * The function will return -ERESTARTSYS if it was interrupted by a 726 753 * signal and 0 if @condition evaluated to true. 727 754 */ 728 - #define wait_event_interruptible_exclusive_locked_irq(wq, condition) \ 729 - ((condition) \ 755 + #define wait_event_interruptible_exclusive_locked_irq(wq, condition) \ 756 + ((condition) \ 730 757 ? 0 : __wait_event_interruptible_locked(wq, condition, 1, do_wait_intr_irq)) 731 758 732 759 733 - #define __wait_event_killable(wq, condition) \ 760 + #define __wait_event_killable(wq, condition) \ 734 761 ___wait_event(wq, condition, TASK_KILLABLE, 0, 0, schedule()) 735 762 736 763 /** ··· 748 775 * The function will return -ERESTARTSYS if it was interrupted by a 749 776 * signal and 0 if @condition evaluated to true. 750 777 */ 751 - #define wait_event_killable(wq, condition) \ 752 - ({ \ 753 - int __ret = 0; \ 754 - might_sleep(); \ 755 - if (!(condition)) \ 756 - __ret = __wait_event_killable(wq, condition); \ 757 - __ret; \ 778 + #define wait_event_killable(wq_head, condition) \ 779 + ({ \ 780 + int __ret = 0; \ 781 + might_sleep(); \ 782 + if (!(condition)) \ 783 + __ret = __wait_event_killable(wq_head, condition); \ 784 + __ret; \ 758 785 }) 759 786 760 787 761 - #define __wait_event_lock_irq(wq, condition, lock, cmd) \ 762 - (void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, 0, \ 763 - spin_unlock_irq(&lock); \ 764 - cmd; \ 765 - schedule(); \ 788 + #define __wait_event_lock_irq(wq_head, condition, lock, cmd) \ 789 + (void)___wait_event(wq_head, condition, TASK_UNINTERRUPTIBLE, 0, 0, \ 790 + spin_unlock_irq(&lock); \ 791 + cmd; \ 792 + schedule(); \ 766 793 spin_lock_irq(&lock)) 767 794 768 795 /** ··· 770 797 * condition is checked under the lock. This 771 798 * is expected to be called with the lock 772 799 * taken. 773 - * @wq: the waitqueue to wait on 800 + * @wq_head: the waitqueue to wait on 774 801 * @condition: a C expression for the event to wait for 775 802 * @lock: a locked spinlock_t, which will be released before cmd 776 803 * and schedule() and reacquired afterwards. ··· 779 806 * 780 807 * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the 781 808 * @condition evaluates to true. The @condition is checked each time 782 - * the waitqueue @wq is woken up. 809 + * the waitqueue @wq_head is woken up. 783 810 * 784 811 * wake_up() has to be called after changing any variable that could 785 812 * change the result of the wait condition. ··· 788 815 * dropped before invoking the cmd and going to sleep and is reacquired 789 816 * afterwards. 790 817 */ 791 - #define wait_event_lock_irq_cmd(wq, condition, lock, cmd) \ 792 - do { \ 793 - if (condition) \ 794 - break; \ 795 - __wait_event_lock_irq(wq, condition, lock, cmd); \ 818 + #define wait_event_lock_irq_cmd(wq_head, condition, lock, cmd) \ 819 + do { \ 820 + if (condition) \ 821 + break; \ 822 + __wait_event_lock_irq(wq_head, condition, lock, cmd); \ 796 823 } while (0) 797 824 798 825 /** ··· 800 827 * condition is checked under the lock. This 801 828 * is expected to be called with the lock 802 829 * taken. 803 - * @wq: the waitqueue to wait on 830 + * @wq_head: the waitqueue to wait on 804 831 * @condition: a C expression for the event to wait for 805 832 * @lock: a locked spinlock_t, which will be released before schedule() 806 833 * and reacquired afterwards. 807 834 * 808 835 * The process is put to sleep (TASK_UNINTERRUPTIBLE) until the 809 836 * @condition evaluates to true. The @condition is checked each time 810 - * the waitqueue @wq is woken up. 837 + * the waitqueue @wq_head is woken up. 811 838 * 812 839 * wake_up() has to be called after changing any variable that could 813 840 * change the result of the wait condition. ··· 815 842 * This is supposed to be called while holding the lock. The lock is 816 843 * dropped before going to sleep and is reacquired afterwards. 817 844 */ 818 - #define wait_event_lock_irq(wq, condition, lock) \ 819 - do { \ 820 - if (condition) \ 821 - break; \ 822 - __wait_event_lock_irq(wq, condition, lock, ); \ 845 + #define wait_event_lock_irq(wq_head, condition, lock) \ 846 + do { \ 847 + if (condition) \ 848 + break; \ 849 + __wait_event_lock_irq(wq_head, condition, lock, ); \ 823 850 } while (0) 824 851 825 852 826 - #define __wait_event_interruptible_lock_irq(wq, condition, lock, cmd) \ 827 - ___wait_event(wq, condition, TASK_INTERRUPTIBLE, 0, 0, \ 828 - spin_unlock_irq(&lock); \ 829 - cmd; \ 830 - schedule(); \ 853 + #define __wait_event_interruptible_lock_irq(wq_head, condition, lock, cmd) \ 854 + ___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0, \ 855 + spin_unlock_irq(&lock); \ 856 + cmd; \ 857 + schedule(); \ 831 858 spin_lock_irq(&lock)) 832 859 833 860 /** 834 861 * wait_event_interruptible_lock_irq_cmd - sleep until a condition gets true. 835 862 * The condition is checked under the lock. This is expected to 836 863 * be called with the lock taken. 837 - * @wq: the waitqueue to wait on 864 + * @wq_head: the waitqueue to wait on 838 865 * @condition: a C expression for the event to wait for 839 866 * @lock: a locked spinlock_t, which will be released before cmd and 840 867 * schedule() and reacquired afterwards. ··· 843 870 * 844 871 * The process is put to sleep (TASK_INTERRUPTIBLE) until the 845 872 * @condition evaluates to true or a signal is received. The @condition is 846 - * checked each time the waitqueue @wq is woken up. 873 + * checked each time the waitqueue @wq_head is woken up. 847 874 * 848 875 * wake_up() has to be called after changing any variable that could 849 876 * change the result of the wait condition. ··· 855 882 * The macro will return -ERESTARTSYS if it was interrupted by a signal 856 883 * and 0 if @condition evaluated to true. 857 884 */ 858 - #define wait_event_interruptible_lock_irq_cmd(wq, condition, lock, cmd) \ 859 - ({ \ 860 - int __ret = 0; \ 861 - if (!(condition)) \ 862 - __ret = __wait_event_interruptible_lock_irq(wq, \ 863 - condition, lock, cmd); \ 864 - __ret; \ 885 + #define wait_event_interruptible_lock_irq_cmd(wq_head, condition, lock, cmd) \ 886 + ({ \ 887 + int __ret = 0; \ 888 + if (!(condition)) \ 889 + __ret = __wait_event_interruptible_lock_irq(wq_head, \ 890 + condition, lock, cmd); \ 891 + __ret; \ 865 892 }) 866 893 867 894 /** 868 895 * wait_event_interruptible_lock_irq - sleep until a condition gets true. 869 896 * The condition is checked under the lock. This is expected 870 897 * to be called with the lock taken. 871 - * @wq: the waitqueue to wait on 898 + * @wq_head: the waitqueue to wait on 872 899 * @condition: a C expression for the event to wait for 873 900 * @lock: a locked spinlock_t, which will be released before schedule() 874 901 * and reacquired afterwards. 875 902 * 876 903 * The process is put to sleep (TASK_INTERRUPTIBLE) until the 877 904 * @condition evaluates to true or signal is received. The @condition is 878 - * checked each time the waitqueue @wq is woken up. 905 + * checked each time the waitqueue @wq_head is woken up. 879 906 * 880 907 * wake_up() has to be called after changing any variable that could 881 908 * change the result of the wait condition. ··· 886 913 * The macro will return -ERESTARTSYS if it was interrupted by a signal 887 914 * and 0 if @condition evaluated to true. 888 915 */ 889 - #define wait_event_interruptible_lock_irq(wq, condition, lock) \ 890 - ({ \ 891 - int __ret = 0; \ 892 - if (!(condition)) \ 893 - __ret = __wait_event_interruptible_lock_irq(wq, \ 894 - condition, lock,); \ 895 - __ret; \ 916 + #define wait_event_interruptible_lock_irq(wq_head, condition, lock) \ 917 + ({ \ 918 + int __ret = 0; \ 919 + if (!(condition)) \ 920 + __ret = __wait_event_interruptible_lock_irq(wq_head, \ 921 + condition, lock,); \ 922 + __ret; \ 896 923 }) 897 924 898 - #define __wait_event_interruptible_lock_irq_timeout(wq, condition, \ 899 - lock, timeout) \ 900 - ___wait_event(wq, ___wait_cond_timeout(condition), \ 901 - TASK_INTERRUPTIBLE, 0, timeout, \ 902 - spin_unlock_irq(&lock); \ 903 - __ret = schedule_timeout(__ret); \ 925 + #define __wait_event_interruptible_lock_irq_timeout(wq_head, condition, \ 926 + lock, timeout) \ 927 + ___wait_event(wq_head, ___wait_cond_timeout(condition), \ 928 + TASK_INTERRUPTIBLE, 0, timeout, \ 929 + spin_unlock_irq(&lock); \ 930 + __ret = schedule_timeout(__ret); \ 904 931 spin_lock_irq(&lock)); 905 932 906 933 /** 907 934 * wait_event_interruptible_lock_irq_timeout - sleep until a condition gets 908 935 * true or a timeout elapses. The condition is checked under 909 936 * the lock. This is expected to be called with the lock taken. 910 - * @wq: the waitqueue to wait on 937 + * @wq_head: the waitqueue to wait on 911 938 * @condition: a C expression for the event to wait for 912 939 * @lock: a locked spinlock_t, which will be released before schedule() 913 940 * and reacquired afterwards. ··· 915 942 * 916 943 * The process is put to sleep (TASK_INTERRUPTIBLE) until the 917 944 * @condition evaluates to true or signal is received. The @condition is 918 - * checked each time the waitqueue @wq is woken up. 945 + * checked each time the waitqueue @wq_head is woken up. 919 946 * 920 947 * wake_up() has to be called after changing any variable that could 921 948 * change the result of the wait condition. ··· 927 954 * was interrupted by a signal, and the remaining jiffies otherwise 928 955 * if the condition evaluated to true before the timeout elapsed. 929 956 */ 930 - #define wait_event_interruptible_lock_irq_timeout(wq, condition, lock, \ 931 - timeout) \ 932 - ({ \ 933 - long __ret = timeout; \ 934 - if (!___wait_cond_timeout(condition)) \ 935 - __ret = __wait_event_interruptible_lock_irq_timeout( \ 936 - wq, condition, lock, timeout); \ 937 - __ret; \ 957 + #define wait_event_interruptible_lock_irq_timeout(wq_head, condition, lock, \ 958 + timeout) \ 959 + ({ \ 960 + long __ret = timeout; \ 961 + if (!___wait_cond_timeout(condition)) \ 962 + __ret = __wait_event_interruptible_lock_irq_timeout( \ 963 + wq_head, condition, lock, timeout); \ 964 + __ret; \ 938 965 }) 939 966 940 967 /* 941 968 * Waitqueues which are removed from the waitqueue_head at wakeup time 942 969 */ 943 - void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state); 944 - void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state); 945 - long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state); 946 - void finish_wait(wait_queue_head_t *q, wait_queue_t *wait); 947 - long wait_woken(wait_queue_t *wait, unsigned mode, long timeout); 948 - int woken_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key); 949 - int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key); 950 - int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key); 970 + void prepare_to_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state); 971 + void prepare_to_wait_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state); 972 + long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state); 973 + void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry); 974 + long wait_woken(struct wait_queue_entry *wq_entry, unsigned mode, long timeout); 975 + int woken_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key); 976 + int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key); 951 977 952 - #define DEFINE_WAIT_FUNC(name, function) \ 953 - wait_queue_t name = { \ 954 - .private = current, \ 955 - .func = function, \ 956 - .task_list = LIST_HEAD_INIT((name).task_list), \ 978 + #define DEFINE_WAIT_FUNC(name, function) \ 979 + struct wait_queue_entry name = { \ 980 + .private = current, \ 981 + .func = function, \ 982 + .entry = LIST_HEAD_INIT((name).entry), \ 957 983 } 958 984 959 985 #define DEFINE_WAIT(name) DEFINE_WAIT_FUNC(name, autoremove_wake_function) 960 986 961 - #define DEFINE_WAIT_BIT(name, word, bit) \ 962 - struct wait_bit_queue name = { \ 963 - .key = __WAIT_BIT_KEY_INITIALIZER(word, bit), \ 964 - .wait = { \ 965 - .private = current, \ 966 - .func = wake_bit_function, \ 967 - .task_list = \ 968 - LIST_HEAD_INIT((name).wait.task_list), \ 969 - }, \ 970 - } 971 - 972 - #define init_wait(wait) \ 973 - do { \ 974 - (wait)->private = current; \ 975 - (wait)->func = autoremove_wake_function; \ 976 - INIT_LIST_HEAD(&(wait)->task_list); \ 977 - (wait)->flags = 0; \ 987 + #define init_wait(wait) \ 988 + do { \ 989 + (wait)->private = current; \ 990 + (wait)->func = autoremove_wake_function; \ 991 + INIT_LIST_HEAD(&(wait)->entry); \ 992 + (wait)->flags = 0; \ 978 993 } while (0) 979 - 980 - 981 - extern int bit_wait(struct wait_bit_key *, int); 982 - extern int bit_wait_io(struct wait_bit_key *, int); 983 - extern int bit_wait_timeout(struct wait_bit_key *, int); 984 - extern int bit_wait_io_timeout(struct wait_bit_key *, int); 985 - 986 - /** 987 - * wait_on_bit - wait for a bit to be cleared 988 - * @word: the word being waited on, a kernel virtual address 989 - * @bit: the bit of the word being waited on 990 - * @mode: the task state to sleep in 991 - * 992 - * There is a standard hashed waitqueue table for generic use. This 993 - * is the part of the hashtable's accessor API that waits on a bit. 994 - * For instance, if one were to have waiters on a bitflag, one would 995 - * call wait_on_bit() in threads waiting for the bit to clear. 996 - * One uses wait_on_bit() where one is waiting for the bit to clear, 997 - * but has no intention of setting it. 998 - * Returned value will be zero if the bit was cleared, or non-zero 999 - * if the process received a signal and the mode permitted wakeup 1000 - * on that signal. 1001 - */ 1002 - static inline int 1003 - wait_on_bit(unsigned long *word, int bit, unsigned mode) 1004 - { 1005 - might_sleep(); 1006 - if (!test_bit(bit, word)) 1007 - return 0; 1008 - return out_of_line_wait_on_bit(word, bit, 1009 - bit_wait, 1010 - mode); 1011 - } 1012 - 1013 - /** 1014 - * wait_on_bit_io - wait for a bit to be cleared 1015 - * @word: the word being waited on, a kernel virtual address 1016 - * @bit: the bit of the word being waited on 1017 - * @mode: the task state to sleep in 1018 - * 1019 - * Use the standard hashed waitqueue table to wait for a bit 1020 - * to be cleared. This is similar to wait_on_bit(), but calls 1021 - * io_schedule() instead of schedule() for the actual waiting. 1022 - * 1023 - * Returned value will be zero if the bit was cleared, or non-zero 1024 - * if the process received a signal and the mode permitted wakeup 1025 - * on that signal. 1026 - */ 1027 - static inline int 1028 - wait_on_bit_io(unsigned long *word, int bit, unsigned mode) 1029 - { 1030 - might_sleep(); 1031 - if (!test_bit(bit, word)) 1032 - return 0; 1033 - return out_of_line_wait_on_bit(word, bit, 1034 - bit_wait_io, 1035 - mode); 1036 - } 1037 - 1038 - /** 1039 - * wait_on_bit_timeout - wait for a bit to be cleared or a timeout elapses 1040 - * @word: the word being waited on, a kernel virtual address 1041 - * @bit: the bit of the word being waited on 1042 - * @mode: the task state to sleep in 1043 - * @timeout: timeout, in jiffies 1044 - * 1045 - * Use the standard hashed waitqueue table to wait for a bit 1046 - * to be cleared. This is similar to wait_on_bit(), except also takes a 1047 - * timeout parameter. 1048 - * 1049 - * Returned value will be zero if the bit was cleared before the 1050 - * @timeout elapsed, or non-zero if the @timeout elapsed or process 1051 - * received a signal and the mode permitted wakeup on that signal. 1052 - */ 1053 - static inline int 1054 - wait_on_bit_timeout(unsigned long *word, int bit, unsigned mode, 1055 - unsigned long timeout) 1056 - { 1057 - might_sleep(); 1058 - if (!test_bit(bit, word)) 1059 - return 0; 1060 - return out_of_line_wait_on_bit_timeout(word, bit, 1061 - bit_wait_timeout, 1062 - mode, timeout); 1063 - } 1064 - 1065 - /** 1066 - * wait_on_bit_action - wait for a bit to be cleared 1067 - * @word: the word being waited on, a kernel virtual address 1068 - * @bit: the bit of the word being waited on 1069 - * @action: the function used to sleep, which may take special actions 1070 - * @mode: the task state to sleep in 1071 - * 1072 - * Use the standard hashed waitqueue table to wait for a bit 1073 - * to be cleared, and allow the waiting action to be specified. 1074 - * This is like wait_on_bit() but allows fine control of how the waiting 1075 - * is done. 1076 - * 1077 - * Returned value will be zero if the bit was cleared, or non-zero 1078 - * if the process received a signal and the mode permitted wakeup 1079 - * on that signal. 1080 - */ 1081 - static inline int 1082 - wait_on_bit_action(unsigned long *word, int bit, wait_bit_action_f *action, 1083 - unsigned mode) 1084 - { 1085 - might_sleep(); 1086 - if (!test_bit(bit, word)) 1087 - return 0; 1088 - return out_of_line_wait_on_bit(word, bit, action, mode); 1089 - } 1090 - 1091 - /** 1092 - * wait_on_bit_lock - wait for a bit to be cleared, when wanting to set it 1093 - * @word: the word being waited on, a kernel virtual address 1094 - * @bit: the bit of the word being waited on 1095 - * @mode: the task state to sleep in 1096 - * 1097 - * There is a standard hashed waitqueue table for generic use. This 1098 - * is the part of the hashtable's accessor API that waits on a bit 1099 - * when one intends to set it, for instance, trying to lock bitflags. 1100 - * For instance, if one were to have waiters trying to set bitflag 1101 - * and waiting for it to clear before setting it, one would call 1102 - * wait_on_bit() in threads waiting to be able to set the bit. 1103 - * One uses wait_on_bit_lock() where one is waiting for the bit to 1104 - * clear with the intention of setting it, and when done, clearing it. 1105 - * 1106 - * Returns zero if the bit was (eventually) found to be clear and was 1107 - * set. Returns non-zero if a signal was delivered to the process and 1108 - * the @mode allows that signal to wake the process. 1109 - */ 1110 - static inline int 1111 - wait_on_bit_lock(unsigned long *word, int bit, unsigned mode) 1112 - { 1113 - might_sleep(); 1114 - if (!test_and_set_bit(bit, word)) 1115 - return 0; 1116 - return out_of_line_wait_on_bit_lock(word, bit, bit_wait, mode); 1117 - } 1118 - 1119 - /** 1120 - * wait_on_bit_lock_io - wait for a bit to be cleared, when wanting to set it 1121 - * @word: the word being waited on, a kernel virtual address 1122 - * @bit: the bit of the word being waited on 1123 - * @mode: the task state to sleep in 1124 - * 1125 - * Use the standard hashed waitqueue table to wait for a bit 1126 - * to be cleared and then to atomically set it. This is similar 1127 - * to wait_on_bit(), but calls io_schedule() instead of schedule() 1128 - * for the actual waiting. 1129 - * 1130 - * Returns zero if the bit was (eventually) found to be clear and was 1131 - * set. Returns non-zero if a signal was delivered to the process and 1132 - * the @mode allows that signal to wake the process. 1133 - */ 1134 - static inline int 1135 - wait_on_bit_lock_io(unsigned long *word, int bit, unsigned mode) 1136 - { 1137 - might_sleep(); 1138 - if (!test_and_set_bit(bit, word)) 1139 - return 0; 1140 - return out_of_line_wait_on_bit_lock(word, bit, bit_wait_io, mode); 1141 - } 1142 - 1143 - /** 1144 - * wait_on_bit_lock_action - wait for a bit to be cleared, when wanting to set it 1145 - * @word: the word being waited on, a kernel virtual address 1146 - * @bit: the bit of the word being waited on 1147 - * @action: the function used to sleep, which may take special actions 1148 - * @mode: the task state to sleep in 1149 - * 1150 - * Use the standard hashed waitqueue table to wait for a bit 1151 - * to be cleared and then to set it, and allow the waiting action 1152 - * to be specified. 1153 - * This is like wait_on_bit() but allows fine control of how the waiting 1154 - * is done. 1155 - * 1156 - * Returns zero if the bit was (eventually) found to be clear and was 1157 - * set. Returns non-zero if a signal was delivered to the process and 1158 - * the @mode allows that signal to wake the process. 1159 - */ 1160 - static inline int 1161 - wait_on_bit_lock_action(unsigned long *word, int bit, wait_bit_action_f *action, 1162 - unsigned mode) 1163 - { 1164 - might_sleep(); 1165 - if (!test_and_set_bit(bit, word)) 1166 - return 0; 1167 - return out_of_line_wait_on_bit_lock(word, bit, action, mode); 1168 - } 1169 - 1170 - /** 1171 - * wait_on_atomic_t - Wait for an atomic_t to become 0 1172 - * @val: The atomic value being waited on, a kernel virtual address 1173 - * @action: the function used to sleep, which may take special actions 1174 - * @mode: the task state to sleep in 1175 - * 1176 - * Wait for an atomic_t to become 0. We abuse the bit-wait waitqueue table for 1177 - * the purpose of getting a waitqueue, but we set the key to a bit number 1178 - * outside of the target 'word'. 1179 - */ 1180 - static inline 1181 - int wait_on_atomic_t(atomic_t *val, int (*action)(atomic_t *), unsigned mode) 1182 - { 1183 - might_sleep(); 1184 - if (atomic_read(val) == 0) 1185 - return 0; 1186 - return out_of_line_wait_on_atomic_t(val, action, mode); 1187 - } 1188 994 1189 995 #endif /* _LINUX_WAIT_H */

+261

include/linux/wait_bit.h

··· 1 + #ifndef _LINUX_WAIT_BIT_H 2 + #define _LINUX_WAIT_BIT_H 3 + 4 + /* 5 + * Linux wait-bit related types and methods: 6 + */ 7 + #include <linux/wait.h> 8 + 9 + struct wait_bit_key { 10 + void *flags; 11 + int bit_nr; 12 + #define WAIT_ATOMIC_T_BIT_NR -1 13 + unsigned long timeout; 14 + }; 15 + 16 + struct wait_bit_queue_entry { 17 + struct wait_bit_key key; 18 + struct wait_queue_entry wq_entry; 19 + }; 20 + 21 + #define __WAIT_BIT_KEY_INITIALIZER(word, bit) \ 22 + { .flags = word, .bit_nr = bit, } 23 + 24 + #define __WAIT_ATOMIC_T_KEY_INITIALIZER(p) \ 25 + { .flags = p, .bit_nr = WAIT_ATOMIC_T_BIT_NR, } 26 + 27 + typedef int wait_bit_action_f(struct wait_bit_key *key, int mode); 28 + void __wake_up_bit(struct wait_queue_head *wq_head, void *word, int bit); 29 + int __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode); 30 + int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode); 31 + void wake_up_bit(void *word, int bit); 32 + void wake_up_atomic_t(atomic_t *p); 33 + int out_of_line_wait_on_bit(void *word, int, wait_bit_action_f *action, unsigned int mode); 34 + int out_of_line_wait_on_bit_timeout(void *word, int, wait_bit_action_f *action, unsigned int mode, unsigned long timeout); 35 + int out_of_line_wait_on_bit_lock(void *word, int, wait_bit_action_f *action, unsigned int mode); 36 + int out_of_line_wait_on_atomic_t(atomic_t *p, int (*)(atomic_t *), unsigned int mode); 37 + struct wait_queue_head *bit_waitqueue(void *word, int bit); 38 + extern void __init wait_bit_init(void); 39 + 40 + int wake_bit_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key); 41 + 42 + #define DEFINE_WAIT_BIT(name, word, bit) \ 43 + struct wait_bit_queue_entry name = { \ 44 + .key = __WAIT_BIT_KEY_INITIALIZER(word, bit), \ 45 + .wq_entry = { \ 46 + .private = current, \ 47 + .func = wake_bit_function, \ 48 + .entry = \ 49 + LIST_HEAD_INIT((name).wq_entry.entry), \ 50 + }, \ 51 + } 52 + 53 + extern int bit_wait(struct wait_bit_key *key, int bit); 54 + extern int bit_wait_io(struct wait_bit_key *key, int bit); 55 + extern int bit_wait_timeout(struct wait_bit_key *key, int bit); 56 + extern int bit_wait_io_timeout(struct wait_bit_key *key, int bit); 57 + 58 + /** 59 + * wait_on_bit - wait for a bit to be cleared 60 + * @word: the word being waited on, a kernel virtual address 61 + * @bit: the bit of the word being waited on 62 + * @mode: the task state to sleep in 63 + * 64 + * There is a standard hashed waitqueue table for generic use. This 65 + * is the part of the hashtable's accessor API that waits on a bit. 66 + * For instance, if one were to have waiters on a bitflag, one would 67 + * call wait_on_bit() in threads waiting for the bit to clear. 68 + * One uses wait_on_bit() where one is waiting for the bit to clear, 69 + * but has no intention of setting it. 70 + * Returned value will be zero if the bit was cleared, or non-zero 71 + * if the process received a signal and the mode permitted wakeup 72 + * on that signal. 73 + */ 74 + static inline int 75 + wait_on_bit(unsigned long *word, int bit, unsigned mode) 76 + { 77 + might_sleep(); 78 + if (!test_bit(bit, word)) 79 + return 0; 80 + return out_of_line_wait_on_bit(word, bit, 81 + bit_wait, 82 + mode); 83 + } 84 + 85 + /** 86 + * wait_on_bit_io - wait for a bit to be cleared 87 + * @word: the word being waited on, a kernel virtual address 88 + * @bit: the bit of the word being waited on 89 + * @mode: the task state to sleep in 90 + * 91 + * Use the standard hashed waitqueue table to wait for a bit 92 + * to be cleared. This is similar to wait_on_bit(), but calls 93 + * io_schedule() instead of schedule() for the actual waiting. 94 + * 95 + * Returned value will be zero if the bit was cleared, or non-zero 96 + * if the process received a signal and the mode permitted wakeup 97 + * on that signal. 98 + */ 99 + static inline int 100 + wait_on_bit_io(unsigned long *word, int bit, unsigned mode) 101 + { 102 + might_sleep(); 103 + if (!test_bit(bit, word)) 104 + return 0; 105 + return out_of_line_wait_on_bit(word, bit, 106 + bit_wait_io, 107 + mode); 108 + } 109 + 110 + /** 111 + * wait_on_bit_timeout - wait for a bit to be cleared or a timeout elapses 112 + * @word: the word being waited on, a kernel virtual address 113 + * @bit: the bit of the word being waited on 114 + * @mode: the task state to sleep in 115 + * @timeout: timeout, in jiffies 116 + * 117 + * Use the standard hashed waitqueue table to wait for a bit 118 + * to be cleared. This is similar to wait_on_bit(), except also takes a 119 + * timeout parameter. 120 + * 121 + * Returned value will be zero if the bit was cleared before the 122 + * @timeout elapsed, or non-zero if the @timeout elapsed or process 123 + * received a signal and the mode permitted wakeup on that signal. 124 + */ 125 + static inline int 126 + wait_on_bit_timeout(unsigned long *word, int bit, unsigned mode, 127 + unsigned long timeout) 128 + { 129 + might_sleep(); 130 + if (!test_bit(bit, word)) 131 + return 0; 132 + return out_of_line_wait_on_bit_timeout(word, bit, 133 + bit_wait_timeout, 134 + mode, timeout); 135 + } 136 + 137 + /** 138 + * wait_on_bit_action - wait for a bit to be cleared 139 + * @word: the word being waited on, a kernel virtual address 140 + * @bit: the bit of the word being waited on 141 + * @action: the function used to sleep, which may take special actions 142 + * @mode: the task state to sleep in 143 + * 144 + * Use the standard hashed waitqueue table to wait for a bit 145 + * to be cleared, and allow the waiting action to be specified. 146 + * This is like wait_on_bit() but allows fine control of how the waiting 147 + * is done. 148 + * 149 + * Returned value will be zero if the bit was cleared, or non-zero 150 + * if the process received a signal and the mode permitted wakeup 151 + * on that signal. 152 + */ 153 + static inline int 154 + wait_on_bit_action(unsigned long *word, int bit, wait_bit_action_f *action, 155 + unsigned mode) 156 + { 157 + might_sleep(); 158 + if (!test_bit(bit, word)) 159 + return 0; 160 + return out_of_line_wait_on_bit(word, bit, action, mode); 161 + } 162 + 163 + /** 164 + * wait_on_bit_lock - wait for a bit to be cleared, when wanting to set it 165 + * @word: the word being waited on, a kernel virtual address 166 + * @bit: the bit of the word being waited on 167 + * @mode: the task state to sleep in 168 + * 169 + * There is a standard hashed waitqueue table for generic use. This 170 + * is the part of the hashtable's accessor API that waits on a bit 171 + * when one intends to set it, for instance, trying to lock bitflags. 172 + * For instance, if one were to have waiters trying to set bitflag 173 + * and waiting for it to clear before setting it, one would call 174 + * wait_on_bit() in threads waiting to be able to set the bit. 175 + * One uses wait_on_bit_lock() where one is waiting for the bit to 176 + * clear with the intention of setting it, and when done, clearing it. 177 + * 178 + * Returns zero if the bit was (eventually) found to be clear and was 179 + * set. Returns non-zero if a signal was delivered to the process and 180 + * the @mode allows that signal to wake the process. 181 + */ 182 + static inline int 183 + wait_on_bit_lock(unsigned long *word, int bit, unsigned mode) 184 + { 185 + might_sleep(); 186 + if (!test_and_set_bit(bit, word)) 187 + return 0; 188 + return out_of_line_wait_on_bit_lock(word, bit, bit_wait, mode); 189 + } 190 + 191 + /** 192 + * wait_on_bit_lock_io - wait for a bit to be cleared, when wanting to set it 193 + * @word: the word being waited on, a kernel virtual address 194 + * @bit: the bit of the word being waited on 195 + * @mode: the task state to sleep in 196 + * 197 + * Use the standard hashed waitqueue table to wait for a bit 198 + * to be cleared and then to atomically set it. This is similar 199 + * to wait_on_bit(), but calls io_schedule() instead of schedule() 200 + * for the actual waiting. 201 + * 202 + * Returns zero if the bit was (eventually) found to be clear and was 203 + * set. Returns non-zero if a signal was delivered to the process and 204 + * the @mode allows that signal to wake the process. 205 + */ 206 + static inline int 207 + wait_on_bit_lock_io(unsigned long *word, int bit, unsigned mode) 208 + { 209 + might_sleep(); 210 + if (!test_and_set_bit(bit, word)) 211 + return 0; 212 + return out_of_line_wait_on_bit_lock(word, bit, bit_wait_io, mode); 213 + } 214 + 215 + /** 216 + * wait_on_bit_lock_action - wait for a bit to be cleared, when wanting to set it 217 + * @word: the word being waited on, a kernel virtual address 218 + * @bit: the bit of the word being waited on 219 + * @action: the function used to sleep, which may take special actions 220 + * @mode: the task state to sleep in 221 + * 222 + * Use the standard hashed waitqueue table to wait for a bit 223 + * to be cleared and then to set it, and allow the waiting action 224 + * to be specified. 225 + * This is like wait_on_bit() but allows fine control of how the waiting 226 + * is done. 227 + * 228 + * Returns zero if the bit was (eventually) found to be clear and was 229 + * set. Returns non-zero if a signal was delivered to the process and 230 + * the @mode allows that signal to wake the process. 231 + */ 232 + static inline int 233 + wait_on_bit_lock_action(unsigned long *word, int bit, wait_bit_action_f *action, 234 + unsigned mode) 235 + { 236 + might_sleep(); 237 + if (!test_and_set_bit(bit, word)) 238 + return 0; 239 + return out_of_line_wait_on_bit_lock(word, bit, action, mode); 240 + } 241 + 242 + /** 243 + * wait_on_atomic_t - Wait for an atomic_t to become 0 244 + * @val: The atomic value being waited on, a kernel virtual address 245 + * @action: the function used to sleep, which may take special actions 246 + * @mode: the task state to sleep in 247 + * 248 + * Wait for an atomic_t to become 0. We abuse the bit-wait waitqueue table for 249 + * the purpose of getting a waitqueue, but we set the key to a bit number 250 + * outside of the target 'word'. 251 + */ 252 + static inline 253 + int wait_on_atomic_t(atomic_t *val, int (*action)(atomic_t *), unsigned mode) 254 + { 255 + might_sleep(); 256 + if (atomic_read(val) == 0) 257 + return 0; 258 + return out_of_line_wait_on_atomic_t(val, action, mode); 259 + } 260 + 261 + #endif /* _LINUX_WAIT_BIT_H */

+1 -1

include/net/af_unix.h

··· 62 62 #define UNIX_GC_CANDIDATE 0 63 63 #define UNIX_GC_MAYBE_CYCLE 1 64 64 struct socket_wq peer_wq; 65 - wait_queue_t peer_wake; 65 + wait_queue_entry_t peer_wake; 66 66 }; 67 67 68 68 static inline struct unix_sock *unix_sk(const struct sock *sk)

+2 -2

include/uapi/linux/auto_fs.h

··· 26 26 #define AUTOFS_MIN_PROTO_VERSION AUTOFS_PROTO_VERSION 27 27 28 28 /* 29 - * The wait_queue_token (autofs_wqt_t) is part of a structure which is passed 29 + * The wait_queue_entry_token (autofs_wqt_t) is part of a structure which is passed 30 30 * back to the kernel via ioctl from userspace. On architectures where 32- and 31 31 * 64-bit userspace binaries can be executed it's important that the size of 32 32 * autofs_wqt_t stays constant between 32- and 64-bit Linux kernels so that we ··· 49 49 50 50 struct autofs_packet_missing { 51 51 struct autofs_packet_hdr hdr; 52 - autofs_wqt_t wait_queue_token; 52 + autofs_wqt_t wait_queue_entry_token; 53 53 int len; 54 54 char name[NAME_MAX+1]; 55 55 };

+2 -2

include/uapi/linux/auto_fs4.h

··· 108 108 /* v4 multi expire (via pipe) */ 109 109 struct autofs_packet_expire_multi { 110 110 struct autofs_packet_hdr hdr; 111 - autofs_wqt_t wait_queue_token; 111 + autofs_wqt_t wait_queue_entry_token; 112 112 int len; 113 113 char name[NAME_MAX+1]; 114 114 }; ··· 123 123 /* autofs v5 common packet struct */ 124 124 struct autofs_v5_packet { 125 125 struct autofs_packet_hdr hdr; 126 - autofs_wqt_t wait_queue_token; 126 + autofs_wqt_t wait_queue_entry_token; 127 127 __u32 dev; 128 128 __u64 ino; 129 129 __u32 uid;

+1

include/uapi/linux/sched.h

··· 47 47 * For the sched_{set,get}attr() calls 48 48 */ 49 49 #define SCHED_FLAG_RESET_ON_FORK 0x01 50 + #define SCHED_FLAG_RECLAIM 0x02 50 51 51 52 #endif /* _UAPI_LINUX_SCHED_H */

+1

init/Kconfig

··· 809 809 810 810 config CPUSETS 811 811 bool "Cpuset controller" 812 + depends on SMP 812 813 help 813 814 This option will let you create and manage CPUSETs which 814 815 allow dynamically partitioning a system into sets of CPUs and

+22 -5

init/main.c

··· 389 389 390 390 static noinline void __ref rest_init(void) 391 391 { 392 + struct task_struct *tsk; 392 393 int pid; 393 394 394 395 rcu_scheduler_starting(); ··· 398 397 * the init task will end up wanting to create kthreads, which, if 399 398 * we schedule it before we create kthreadd, will OOPS. 400 399 */ 401 - kernel_thread(kernel_init, NULL, CLONE_FS); 400 + pid = kernel_thread(kernel_init, NULL, CLONE_FS); 401 + /* 402 + * Pin init on the boot CPU. Task migration is not properly working 403 + * until sched_init_smp() has been run. It will set the allowed 404 + * CPUs for init to the non isolated CPUs. 405 + */ 406 + rcu_read_lock(); 407 + tsk = find_task_by_pid_ns(pid, &init_pid_ns); 408 + set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id())); 409 + rcu_read_unlock(); 410 + 402 411 numa_default_policy(); 403 412 pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES); 404 413 rcu_read_lock(); 405 414 kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns); 406 415 rcu_read_unlock(); 416 + 417 + /* 418 + * Enable might_sleep() and smp_processor_id() checks. 419 + * They cannot be enabled earlier because with CONFIG_PRREMPT=y 420 + * kernel_thread() would trigger might_sleep() splats. With 421 + * CONFIG_PREEMPT_VOLUNTARY=y the init task might have scheduled 422 + * already, but it's stuck on the kthreadd_done completion. 423 + */ 424 + system_state = SYSTEM_SCHEDULING; 425 + 407 426 complete(&kthreadd_done); 408 427 409 428 /* ··· 1036 1015 * init can allocate pages on any node 1037 1016 */ 1038 1017 set_mems_allowed(node_states[N_MEMORY]); 1039 - /* 1040 - * init can run on any cpu. 1041 - */ 1042 - set_cpus_allowed_ptr(current, cpu_all_mask); 1043 1018 1044 1019 cad_pid = task_pid(current); 1045 1020

+4 -4

kernel/async.c

··· 114 114 ktime_t uninitialized_var(calltime), delta, rettime; 115 115 116 116 /* 1) run (and print duration) */ 117 - if (initcall_debug && system_state == SYSTEM_BOOTING) { 117 + if (initcall_debug && system_state < SYSTEM_RUNNING) { 118 118 pr_debug("calling %lli_%pF @ %i\n", 119 119 (long long)entry->cookie, 120 120 entry->func, task_pid_nr(current)); 121 121 calltime = ktime_get(); 122 122 } 123 123 entry->func(entry->data, entry->cookie); 124 - if (initcall_debug && system_state == SYSTEM_BOOTING) { 124 + if (initcall_debug && system_state < SYSTEM_RUNNING) { 125 125 rettime = ktime_get(); 126 126 delta = ktime_sub(rettime, calltime); 127 127 pr_debug("initcall %lli_%pF returned 0 after %lld usecs\n", ··· 284 284 { 285 285 ktime_t uninitialized_var(starttime), delta, endtime; 286 286 287 - if (initcall_debug && system_state == SYSTEM_BOOTING) { 287 + if (initcall_debug && system_state < SYSTEM_RUNNING) { 288 288 pr_debug("async_waiting @ %i\n", task_pid_nr(current)); 289 289 starttime = ktime_get(); 290 290 } 291 291 292 292 wait_event(async_done, lowest_in_progress(domain) >= cookie); 293 293 294 - if (initcall_debug && system_state == SYSTEM_BOOTING) { 294 + if (initcall_debug && system_state < SYSTEM_RUNNING) { 295 295 endtime = ktime_get(); 296 296 delta = ktime_sub(endtime, starttime); 297 297

+2 -15

kernel/exit.c

··· 318 318 rcu_read_unlock(); 319 319 } 320 320 321 - struct task_struct *try_get_task_struct(struct task_struct **ptask) 322 - { 323 - struct task_struct *task; 324 - 325 - rcu_read_lock(); 326 - task = task_rcu_dereference(ptask); 327 - if (task) 328 - get_task_struct(task); 329 - rcu_read_unlock(); 330 - 331 - return task; 332 - } 333 - 334 321 /* 335 322 * Determine if a process group is "orphaned", according to the POSIX 336 323 * definition in 2.2.2.52. Orphaned process groups are not to be affected ··· 991 1004 int __user *wo_stat; 992 1005 struct rusage __user *wo_rusage; 993 1006 994 - wait_queue_t child_wait; 1007 + wait_queue_entry_t child_wait; 995 1008 int notask_error; 996 1009 }; 997 1010 ··· 1528 1541 return 0; 1529 1542 } 1530 1543 1531 - static int child_wait_callback(wait_queue_t *wait, unsigned mode, 1544 + static int child_wait_callback(wait_queue_entry_t *wait, unsigned mode, 1532 1545 int sync, void *key) 1533 1546 { 1534 1547 struct wait_opts *wo = container_of(wait, struct wait_opts,

+1 -1

kernel/extable.c

··· 75 75 addr < (unsigned long)_etext) 76 76 return 1; 77 77 78 - if (system_state == SYSTEM_BOOTING && 78 + if (system_state < SYSTEM_RUNNING && 79 79 init_kernel_text(addr)) 80 80 return 1; 81 81 return 0;

+1 -1

kernel/futex.c

··· 225 225 * @requeue_pi_key: the requeue_pi target futex key 226 226 * @bitset: bitset for the optional bitmasked wakeup 227 227 * 228 - * We use this hashed waitqueue, instead of a normal wait_queue_t, so 228 + * We use this hashed waitqueue, instead of a normal wait_queue_entry_t, so 229 229 * we can wake only the relevant ones (hashed queues may be shared). 230 230 * 231 231 * A futex_q has a woken state, just like tasks have TASK_RUNNING.

+1 -1

kernel/printk/printk.c

··· 1175 1175 unsigned long long k; 1176 1176 unsigned long timeout; 1177 1177 1178 - if ((boot_delay == 0 || system_state != SYSTEM_BOOTING) 1178 + if ((boot_delay == 0 || system_state >= SYSTEM_RUNNING) 1179 1179 || suppress_message_printing(level)) { 1180 1180 return; 1181 1181 }

+3 -3

kernel/sched/Makefile

··· 16 16 endif 17 17 18 18 obj-y += core.o loadavg.o clock.o cputime.o 19 - obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o 20 - obj-y += wait.o swait.o completion.o idle.o 21 - obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o 19 + obj-y += idle_task.o fair.o rt.o deadline.o 20 + obj-y += wait.o wait_bit.o swait.o completion.o idle.o 21 + obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o 22 22 obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o 23 23 obj-$(CONFIG_SCHEDSTATS) += stats.o 24 24 obj-$(CONFIG_SCHED_DEBUG) += debug.o

+94 -36

kernel/sched/clock.c

··· 64 64 #include <linux/workqueue.h> 65 65 #include <linux/compiler.h> 66 66 #include <linux/tick.h> 67 + #include <linux/init.h> 67 68 68 69 /* 69 70 * Scheduler clock - returns current time in nanosec units. ··· 125 124 return static_branch_likely(&__sched_clock_stable); 126 125 } 127 126 127 + static void __scd_stamp(struct sched_clock_data *scd) 128 + { 129 + scd->tick_gtod = ktime_get_ns(); 130 + scd->tick_raw = sched_clock(); 131 + } 132 + 128 133 static void __set_sched_clock_stable(void) 129 134 { 130 - struct sched_clock_data *scd = this_scd(); 135 + struct sched_clock_data *scd; 131 136 137 + /* 138 + * Since we're still unstable and the tick is already running, we have 139 + * to disable IRQs in order to get a consistent scd->tick* reading. 140 + */ 141 + local_irq_disable(); 142 + scd = this_scd(); 132 143 /* 133 144 * Attempt to make the (initial) unstable->stable transition continuous. 134 145 */ 135 146 __sched_clock_offset = (scd->tick_gtod + __gtod_offset) - (scd->tick_raw); 147 + local_irq_enable(); 136 148 137 149 printk(KERN_INFO "sched_clock: Marking stable (%lld, %lld)->(%lld, %lld)\n", 138 150 scd->tick_gtod, __gtod_offset, ··· 155 141 tick_dep_clear(TICK_DEP_BIT_CLOCK_UNSTABLE); 156 142 } 157 143 144 + /* 145 + * If we ever get here, we're screwed, because we found out -- typically after 146 + * the fact -- that TSC wasn't good. This means all our clocksources (including 147 + * ktime) could have reported wrong values. 148 + * 149 + * What we do here is an attempt to fix up and continue sort of where we left 150 + * off in a coherent manner. 151 + * 152 + * The only way to fully avoid random clock jumps is to boot with: 153 + * "tsc=unstable". 154 + */ 158 155 static void __sched_clock_work(struct work_struct *work) 159 156 { 157 + struct sched_clock_data *scd; 158 + int cpu; 159 + 160 + /* take a current timestamp and set 'now' */ 161 + preempt_disable(); 162 + scd = this_scd(); 163 + __scd_stamp(scd); 164 + scd->clock = scd->tick_gtod + __gtod_offset; 165 + preempt_enable(); 166 + 167 + /* clone to all CPUs */ 168 + for_each_possible_cpu(cpu) 169 + per_cpu(sched_clock_data, cpu) = *scd; 170 + 171 + printk(KERN_WARNING "TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.\n"); 172 + printk(KERN_INFO "sched_clock: Marking unstable (%lld, %lld)<-(%lld, %lld)\n", 173 + scd->tick_gtod, __gtod_offset, 174 + scd->tick_raw, __sched_clock_offset); 175 + 160 176 static_branch_disable(&__sched_clock_stable); 161 177 } 162 178 ··· 194 150 195 151 static void __clear_sched_clock_stable(void) 196 152 { 197 - struct sched_clock_data *scd = this_scd(); 198 - 199 - /* 200 - * Attempt to make the stable->unstable transition continuous. 201 - * 202 - * Trouble is, this is typically called from the TSC watchdog 203 - * timer, which is late per definition. This means the tick 204 - * values can already be screwy. 205 - * 206 - * Still do what we can. 207 - */ 208 - __gtod_offset = (scd->tick_raw + __sched_clock_offset) - (scd->tick_gtod); 209 - 210 - printk(KERN_INFO "sched_clock: Marking unstable (%lld, %lld)<-(%lld, %lld)\n", 211 - scd->tick_gtod, __gtod_offset, 212 - scd->tick_raw, __sched_clock_offset); 153 + if (!sched_clock_stable()) 154 + return; 213 155 214 156 tick_dep_set(TICK_DEP_BIT_CLOCK_UNSTABLE); 215 - 216 - if (sched_clock_stable()) 217 - schedule_work(&sched_clock_work); 157 + schedule_work(&sched_clock_work); 218 158 } 219 159 220 160 void clear_sched_clock_stable(void) ··· 211 183 __clear_sched_clock_stable(); 212 184 } 213 185 214 - void sched_clock_init_late(void) 186 + /* 187 + * We run this as late_initcall() such that it runs after all built-in drivers, 188 + * notably: acpi_processor and intel_idle, which can mark the TSC as unstable. 189 + */ 190 + static int __init sched_clock_init_late(void) 215 191 { 216 192 sched_clock_running = 2; 217 193 /* ··· 229 197 230 198 if (__sched_clock_stable_early) 231 199 __set_sched_clock_stable(); 200 + 201 + return 0; 232 202 } 203 + late_initcall(sched_clock_init_late); 233 204 234 205 /* 235 206 * min, max except they take wrapping into account ··· 382 347 { 383 348 struct sched_clock_data *scd; 384 349 350 + if (sched_clock_stable()) 351 + return; 352 + 353 + if (unlikely(!sched_clock_running)) 354 + return; 355 + 385 356 WARN_ON_ONCE(!irqs_disabled()); 386 357 387 - /* 388 - * Update these values even if sched_clock_stable(), because it can 389 - * become unstable at any point in time at which point we need some 390 - * values to fall back on. 391 - * 392 - * XXX arguably we can skip this if we expose tsc_clocksource_reliable 393 - */ 394 358 scd = this_scd(); 395 - scd->tick_raw = sched_clock(); 396 - scd->tick_gtod = ktime_get_ns(); 359 + __scd_stamp(scd); 360 + sched_clock_local(scd); 361 + } 397 362 398 - if (!sched_clock_stable() && likely(sched_clock_running)) 399 - sched_clock_local(scd); 363 + void sched_clock_tick_stable(void) 364 + { 365 + u64 gtod, clock; 366 + 367 + if (!sched_clock_stable()) 368 + return; 369 + 370 + /* 371 + * Called under watchdog_lock. 372 + * 373 + * The watchdog just found this TSC to (still) be stable, so now is a 374 + * good moment to update our __gtod_offset. Because once we find the 375 + * TSC to be unstable, any computation will be computing crap. 376 + */ 377 + local_irq_disable(); 378 + gtod = ktime_get_ns(); 379 + clock = sched_clock(); 380 + __gtod_offset = (clock + __sched_clock_offset) - gtod; 381 + local_irq_enable(); 400 382 } 401 383 402 384 /* ··· 426 374 EXPORT_SYMBOL_GPL(sched_clock_idle_sleep_event); 427 375 428 376 /* 429 - * We just idled delta nanoseconds (called with irqs disabled): 377 + * We just idled; resync with ktime. 430 378 */ 431 - void sched_clock_idle_wakeup_event(u64 delta_ns) 379 + void sched_clock_idle_wakeup_event(void) 432 380 { 433 - if (timekeeping_suspended) 381 + unsigned long flags; 382 + 383 + if (sched_clock_stable()) 434 384 return; 435 385 386 + if (unlikely(timekeeping_suspended)) 387 + return; 388 + 389 + local_irq_save(flags); 436 390 sched_clock_tick(); 437 - touch_softlockup_watchdog_sched(); 391 + local_irq_restore(flags); 438 392 } 439 393 EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event); 440 394

+1 -1

kernel/sched/completion.c

··· 66 66 if (!x->done) { 67 67 DECLARE_WAITQUEUE(wait, current); 68 68 69 - __add_wait_queue_tail_exclusive(&x->wait, &wait); 69 + __add_wait_queue_entry_tail_exclusive(&x->wait, &wait); 70 70 do { 71 71 if (signal_pending_state(state, current)) { 72 72 timeout = -ERESTARTSYS;

+55 -717

kernel/sched/core.c

··· 10 10 #include <uapi/linux/sched/types.h> 11 11 #include <linux/sched/loadavg.h> 12 12 #include <linux/sched/hotplug.h> 13 + #include <linux/wait_bit.h> 13 14 #include <linux/cpuset.h> 14 15 #include <linux/delayacct.h> 15 16 #include <linux/init_task.h> ··· 789 788 dequeue_task(rq, p, flags); 790 789 } 791 790 792 - void sched_set_stop_task(int cpu, struct task_struct *stop) 793 - { 794 - struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 }; 795 - struct task_struct *old_stop = cpu_rq(cpu)->stop; 796 - 797 - if (stop) { 798 - /* 799 - * Make it appear like a SCHED_FIFO task, its something 800 - * userspace knows about and won't get confused about. 801 - * 802 - * Also, it will make PI more or less work without too 803 - * much confusion -- but then, stop work should not 804 - * rely on PI working anyway. 805 - */ 806 - sched_setscheduler_nocheck(stop, SCHED_FIFO, &param); 807 - 808 - stop->sched_class = &stop_sched_class; 809 - } 810 - 811 - cpu_rq(cpu)->stop = stop; 812 - 813 - if (old_stop) { 814 - /* 815 - * Reset it back to a normal scheduling class so that 816 - * it can die in pieces. 817 - */ 818 - old_stop->sched_class = &rt_sched_class; 819 - } 820 - } 821 - 822 791 /* 823 792 * __normal_prio - return the priority that is based on the static prio 824 793 */ ··· 1559 1588 *avg += diff >> 3; 1560 1589 } 1561 1590 1591 + void sched_set_stop_task(int cpu, struct task_struct *stop) 1592 + { 1593 + struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 }; 1594 + struct task_struct *old_stop = cpu_rq(cpu)->stop; 1595 + 1596 + if (stop) { 1597 + /* 1598 + * Make it appear like a SCHED_FIFO task, its something 1599 + * userspace knows about and won't get confused about. 1600 + * 1601 + * Also, it will make PI more or less work without too 1602 + * much confusion -- but then, stop work should not 1603 + * rely on PI working anyway. 1604 + */ 1605 + sched_setscheduler_nocheck(stop, SCHED_FIFO, &param); 1606 + 1607 + stop->sched_class = &stop_sched_class; 1608 + } 1609 + 1610 + cpu_rq(cpu)->stop = stop; 1611 + 1612 + if (old_stop) { 1613 + /* 1614 + * Reset it back to a normal scheduling class so that 1615 + * it can die in pieces. 1616 + */ 1617 + old_stop->sched_class = &rt_sched_class; 1618 + } 1619 + } 1620 + 1562 1621 #else 1563 1622 1564 1623 static inline int __set_cpus_allowed_ptr(struct task_struct *p, ··· 1732 1731 { 1733 1732 struct rq *rq = this_rq(); 1734 1733 struct llist_node *llist = llist_del_all(&rq->wake_list); 1735 - struct task_struct *p; 1734 + struct task_struct *p, *t; 1736 1735 struct rq_flags rf; 1737 1736 1738 1737 if (!llist) ··· 1741 1740 rq_lock_irqsave(rq, &rf); 1742 1741 update_rq_clock(rq); 1743 1742 1744 - while (llist) { 1745 - int wake_flags = 0; 1746 - 1747 - p = llist_entry(llist, struct task_struct, wake_entry); 1748 - llist = llist_next(llist); 1749 - 1750 - if (p->sched_remote_wakeup) 1751 - wake_flags = WF_MIGRATED; 1752 - 1753 - ttwu_do_activate(rq, p, wake_flags, &rf); 1754 - } 1743 + llist_for_each_entry_safe(p, t, llist, wake_entry) 1744 + ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf); 1755 1745 1756 1746 rq_unlock_irqrestore(rq, &rf); 1757 1747 } ··· 2140 2148 } 2141 2149 2142 2150 /* 2143 - * This function clears the sched_dl_entity static params. 2144 - */ 2145 - void __dl_clear_params(struct task_struct *p) 2146 - { 2147 - struct sched_dl_entity *dl_se = &p->dl; 2148 - 2149 - dl_se->dl_runtime = 0; 2150 - dl_se->dl_deadline = 0; 2151 - dl_se->dl_period = 0; 2152 - dl_se->flags = 0; 2153 - dl_se->dl_bw = 0; 2154 - 2155 - dl_se->dl_throttled = 0; 2156 - dl_se->dl_yielded = 0; 2157 - } 2158 - 2159 - /* 2160 2151 * Perform scheduler related setup for a newly forked process p. 2161 2152 * p is forked by current. 2162 2153 * ··· 2168 2193 2169 2194 RB_CLEAR_NODE(&p->dl.rb_node); 2170 2195 init_dl_task_timer(&p->dl); 2196 + init_dl_inactive_task_timer(&p->dl); 2171 2197 __dl_clear_params(p); 2172 2198 2173 2199 INIT_LIST_HEAD(&p->rt.run_list); ··· 2406 2430 unsigned long to_ratio(u64 period, u64 runtime) 2407 2431 { 2408 2432 if (runtime == RUNTIME_INF) 2409 - return 1ULL << 20; 2433 + return BW_UNIT; 2410 2434 2411 2435 /* 2412 2436 * Doing this here saves a lot of checks in all ··· 2416 2440 if (period == 0) 2417 2441 return 0; 2418 2442 2419 - return div64_u64(runtime << 20, period); 2443 + return div64_u64(runtime << BW_SHIFT, period); 2420 2444 } 2421 - 2422 - #ifdef CONFIG_SMP 2423 - inline struct dl_bw *dl_bw_of(int i) 2424 - { 2425 - RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(), 2426 - "sched RCU must be held"); 2427 - return &cpu_rq(i)->rd->dl_bw; 2428 - } 2429 - 2430 - static inline int dl_bw_cpus(int i) 2431 - { 2432 - struct root_domain *rd = cpu_rq(i)->rd; 2433 - int cpus = 0; 2434 - 2435 - RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(), 2436 - "sched RCU must be held"); 2437 - for_each_cpu_and(i, rd->span, cpu_active_mask) 2438 - cpus++; 2439 - 2440 - return cpus; 2441 - } 2442 - #else 2443 - inline struct dl_bw *dl_bw_of(int i) 2444 - { 2445 - return &cpu_rq(i)->dl.dl_bw; 2446 - } 2447 - 2448 - static inline int dl_bw_cpus(int i) 2449 - { 2450 - return 1; 2451 - } 2452 - #endif 2453 - 2454 - /* 2455 - * We must be sure that accepting a new task (or allowing changing the 2456 - * parameters of an existing one) is consistent with the bandwidth 2457 - * constraints. If yes, this function also accordingly updates the currently 2458 - * allocated bandwidth to reflect the new situation. 2459 - * 2460 - * This function is called while holding p's rq->lock. 2461 - * 2462 - * XXX we should delay bw change until the task's 0-lag point, see 2463 - * __setparam_dl(). 2464 - */ 2465 - static int dl_overflow(struct task_struct *p, int policy, 2466 - const struct sched_attr *attr) 2467 - { 2468 - 2469 - struct dl_bw *dl_b = dl_bw_of(task_cpu(p)); 2470 - u64 period = attr->sched_period ?: attr->sched_deadline; 2471 - u64 runtime = attr->sched_runtime; 2472 - u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0; 2473 - int cpus, err = -1; 2474 - 2475 - /* !deadline task may carry old deadline bandwidth */ 2476 - if (new_bw == p->dl.dl_bw && task_has_dl_policy(p)) 2477 - return 0; 2478 - 2479 - /* 2480 - * Either if a task, enters, leave, or stays -deadline but changes 2481 - * its parameters, we may need to update accordingly the total 2482 - * allocated bandwidth of the container. 2483 - */ 2484 - raw_spin_lock(&dl_b->lock); 2485 - cpus = dl_bw_cpus(task_cpu(p)); 2486 - if (dl_policy(policy) && !task_has_dl_policy(p) && 2487 - !__dl_overflow(dl_b, cpus, 0, new_bw)) { 2488 - __dl_add(dl_b, new_bw); 2489 - err = 0; 2490 - } else if (dl_policy(policy) && task_has_dl_policy(p) && 2491 - !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) { 2492 - __dl_clear(dl_b, p->dl.dl_bw); 2493 - __dl_add(dl_b, new_bw); 2494 - err = 0; 2495 - } else if (!dl_policy(policy) && task_has_dl_policy(p)) { 2496 - __dl_clear(dl_b, p->dl.dl_bw); 2497 - err = 0; 2498 - } 2499 - raw_spin_unlock(&dl_b->lock); 2500 - 2501 - return err; 2502 - } 2503 - 2504 - extern void init_dl_bw(struct dl_bw *dl_b); 2505 2445 2506 2446 /* 2507 2447 * wake_up_new_task - wake up a newly created task for the first time. ··· 3579 3687 exception_exit(prev_state); 3580 3688 } 3581 3689 3582 - int default_wake_function(wait_queue_t *curr, unsigned mode, int wake_flags, 3690 + int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags, 3583 3691 void *key) 3584 3692 { 3585 3693 return try_to_wake_up(curr->private, mode, wake_flags); ··· 3901 4009 } 3902 4010 3903 4011 /* 3904 - * This function initializes the sched_dl_entity of a newly becoming 3905 - * SCHED_DEADLINE task. 3906 - * 3907 - * Only the static values are considered here, the actual runtime and the 3908 - * absolute deadline will be properly calculated when the task is enqueued 3909 - * for the first time with its new policy. 3910 - */ 3911 - static void 3912 - __setparam_dl(struct task_struct *p, const struct sched_attr *attr) 3913 - { 3914 - struct sched_dl_entity *dl_se = &p->dl; 3915 - 3916 - dl_se->dl_runtime = attr->sched_runtime; 3917 - dl_se->dl_deadline = attr->sched_deadline; 3918 - dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline; 3919 - dl_se->flags = attr->sched_flags; 3920 - dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime); 3921 - 3922 - /* 3923 - * Changing the parameters of a task is 'tricky' and we're not doing 3924 - * the correct thing -- also see task_dead_dl() and switched_from_dl(). 3925 - * 3926 - * What we SHOULD do is delay the bandwidth release until the 0-lag 3927 - * point. This would include retaining the task_struct until that time 3928 - * and change dl_overflow() to not immediately decrement the current 3929 - * amount. 3930 - * 3931 - * Instead we retain the current runtime/deadline and let the new 3932 - * parameters take effect after the current reservation period lapses. 3933 - * This is safe (albeit pessimistic) because the 0-lag point is always 3934 - * before the current scheduling deadline. 3935 - * 3936 - * We can still have temporary overloads because we do not delay the 3937 - * change in bandwidth until that time; so admission control is 3938 - * not on the safe side. It does however guarantee tasks will never 3939 - * consume more than promised. 3940 - */ 3941 - } 3942 - 3943 - /* 3944 4012 * sched_setparam() passes in -1 for its policy, to let the functions 3945 4013 * it calls know not to change it. 3946 4014 */ ··· 3953 4101 p->sched_class = &fair_sched_class; 3954 4102 } 3955 4103 3956 - static void 3957 - __getparam_dl(struct task_struct *p, struct sched_attr *attr) 3958 - { 3959 - struct sched_dl_entity *dl_se = &p->dl; 3960 - 3961 - attr->sched_priority = p->rt_priority; 3962 - attr->sched_runtime = dl_se->dl_runtime; 3963 - attr->sched_deadline = dl_se->dl_deadline; 3964 - attr->sched_period = dl_se->dl_period; 3965 - attr->sched_flags = dl_se->flags; 3966 - } 3967 - 3968 - /* 3969 - * This function validates the new parameters of a -deadline task. 3970 - * We ask for the deadline not being zero, and greater or equal 3971 - * than the runtime, as well as the period of being zero or 3972 - * greater than deadline. Furthermore, we have to be sure that 3973 - * user parameters are above the internal resolution of 1us (we 3974 - * check sched_runtime only since it is always the smaller one) and 3975 - * below 2^63 ns (we have to check both sched_deadline and 3976 - * sched_period, as the latter can be zero). 3977 - */ 3978 - static bool 3979 - __checkparam_dl(const struct sched_attr *attr) 3980 - { 3981 - /* deadline != 0 */ 3982 - if (attr->sched_deadline == 0) 3983 - return false; 3984 - 3985 - /* 3986 - * Since we truncate DL_SCALE bits, make sure we're at least 3987 - * that big. 3988 - */ 3989 - if (attr->sched_runtime < (1ULL << DL_SCALE)) 3990 - return false; 3991 - 3992 - /* 3993 - * Since we use the MSB for wrap-around and sign issues, make 3994 - * sure it's not set (mind that period can be equal to zero). 3995 - */ 3996 - if (attr->sched_deadline & (1ULL << 63) || 3997 - attr->sched_period & (1ULL << 63)) 3998 - return false; 3999 - 4000 - /* runtime <= deadline <= period (if period != 0) */ 4001 - if ((attr->sched_period != 0 && 4002 - attr->sched_period < attr->sched_deadline) || 4003 - attr->sched_deadline < attr->sched_runtime) 4004 - return false; 4005 - 4006 - return true; 4007 - } 4008 - 4009 4104 /* 4010 4105 * Check the target process has a UID that matches the current process's: 4011 4106 */ ··· 3969 4170 return match; 3970 4171 } 3971 4172 3972 - static bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr) 3973 - { 3974 - struct sched_dl_entity *dl_se = &p->dl; 3975 - 3976 - if (dl_se->dl_runtime != attr->sched_runtime || 3977 - dl_se->dl_deadline != attr->sched_deadline || 3978 - dl_se->dl_period != attr->sched_period || 3979 - dl_se->flags != attr->sched_flags) 3980 - return true; 3981 - 3982 - return false; 3983 - } 3984 - 3985 4173 static int __sched_setscheduler(struct task_struct *p, 3986 4174 const struct sched_attr *attr, 3987 4175 bool user, bool pi) ··· 3983 4197 int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; 3984 4198 struct rq *rq; 3985 4199 3986 - /* May grab non-irq protected spin_locks: */ 3987 - BUG_ON(in_interrupt()); 4200 + /* The pi code expects interrupts enabled */ 4201 + BUG_ON(pi && in_interrupt()); 3988 4202 recheck: 3989 4203 /* Double check policy once rq lock held: */ 3990 4204 if (policy < 0) { ··· 3997 4211 return -EINVAL; 3998 4212 } 3999 4213 4000 - if (attr->sched_flags & ~(SCHED_FLAG_RESET_ON_FORK)) 4214 + if (attr->sched_flags & 4215 + ~(SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_RECLAIM)) 4001 4216 return -EINVAL; 4002 4217 4003 4218 /* ··· 4149 4362 * of a SCHED_DEADLINE task) we need to check if enough bandwidth 4150 4363 * is available. 4151 4364 */ 4152 - if ((dl_policy(policy) || dl_task(p)) && dl_overflow(p, policy, attr)) { 4365 + if ((dl_policy(policy) || dl_task(p)) && sched_dl_overflow(p, policy, attr)) { 4153 4366 task_rq_unlock(rq, p, &rf); 4154 4367 return -EBUSY; 4155 4368 } ··· 5250 5463 #endif 5251 5464 } 5252 5465 5466 + #ifdef CONFIG_SMP 5467 + 5253 5468 int cpuset_cpumask_can_shrink(const struct cpumask *cur, 5254 5469 const struct cpumask *trial) 5255 5470 { 5256 - int ret = 1, trial_cpus; 5257 - struct dl_bw *cur_dl_b; 5258 - unsigned long flags; 5471 + int ret = 1; 5259 5472 5260 5473 if (!cpumask_weight(cur)) 5261 5474 return ret; 5262 5475 5263 - rcu_read_lock_sched(); 5264 - cur_dl_b = dl_bw_of(cpumask_any(cur)); 5265 - trial_cpus = cpumask_weight(trial); 5266 - 5267 - raw_spin_lock_irqsave(&cur_dl_b->lock, flags); 5268 - if (cur_dl_b->bw != -1 && 5269 - cur_dl_b->bw * trial_cpus < cur_dl_b->total_bw) 5270 - ret = 0; 5271 - raw_spin_unlock_irqrestore(&cur_dl_b->lock, flags); 5272 - rcu_read_unlock_sched(); 5476 + ret = dl_cpuset_cpumask_can_shrink(cur, trial); 5273 5477 5274 5478 return ret; 5275 5479 } ··· 5284 5506 goto out; 5285 5507 } 5286 5508 5287 - #ifdef CONFIG_SMP 5288 5509 if (dl_task(p) && !cpumask_intersects(task_rq(p)->rd->span, 5289 - cs_cpus_allowed)) { 5290 - unsigned int dest_cpu = cpumask_any_and(cpu_active_mask, 5291 - cs_cpus_allowed); 5292 - struct dl_bw *dl_b; 5293 - bool overflow; 5294 - int cpus; 5295 - unsigned long flags; 5510 + cs_cpus_allowed)) 5511 + ret = dl_task_can_attach(p, cs_cpus_allowed); 5296 5512 5297 - rcu_read_lock_sched(); 5298 - dl_b = dl_bw_of(dest_cpu); 5299 - raw_spin_lock_irqsave(&dl_b->lock, flags); 5300 - cpus = dl_bw_cpus(dest_cpu); 5301 - overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw); 5302 - if (overflow) 5303 - ret = -EBUSY; 5304 - else { 5305 - /* 5306 - * We reserve space for this task in the destination 5307 - * root_domain, as we can't fail after this point. 5308 - * We will free resources in the source root_domain 5309 - * later on (see set_cpus_allowed_dl()). 5310 - */ 5311 - __dl_add(dl_b, p->dl.dl_bw); 5312 - } 5313 - raw_spin_unlock_irqrestore(&dl_b->lock, flags); 5314 - rcu_read_unlock_sched(); 5315 - 5316 - } 5317 - #endif 5318 5513 out: 5319 5514 return ret; 5320 5515 } 5321 - 5322 - #ifdef CONFIG_SMP 5323 5516 5324 5517 bool sched_smp_initialized __read_mostly; 5325 5518 ··· 5554 5805 5555 5806 static int cpuset_cpu_inactive(unsigned int cpu) 5556 5807 { 5557 - unsigned long flags; 5558 - struct dl_bw *dl_b; 5559 - bool overflow; 5560 - int cpus; 5561 - 5562 5808 if (!cpuhp_tasks_frozen) { 5563 - rcu_read_lock_sched(); 5564 - dl_b = dl_bw_of(cpu); 5565 - 5566 - raw_spin_lock_irqsave(&dl_b->lock, flags); 5567 - cpus = dl_bw_cpus(cpu); 5568 - overflow = __dl_overflow(dl_b, cpus, 0, 0); 5569 - raw_spin_unlock_irqrestore(&dl_b->lock, flags); 5570 - 5571 - rcu_read_unlock_sched(); 5572 - 5573 - if (overflow) 5809 + if (dl_cpu_busy(cpu)) 5574 5810 return -EBUSY; 5575 5811 cpuset_update_active_cpus(); 5576 5812 } else { ··· 5686 5952 cpumask_var_t non_isolated_cpus; 5687 5953 5688 5954 alloc_cpumask_var(&non_isolated_cpus, GFP_KERNEL); 5689 - alloc_cpumask_var(&fallback_doms, GFP_KERNEL); 5690 5955 5691 5956 sched_init_numa(); 5692 5957 ··· 5695 5962 * happen. 5696 5963 */ 5697 5964 mutex_lock(&sched_domains_mutex); 5698 - init_sched_domains(cpu_active_mask); 5965 + sched_init_domains(cpu_active_mask); 5699 5966 cpumask_andnot(non_isolated_cpus, cpu_possible_mask, cpu_isolated_map); 5700 5967 if (cpumask_empty(non_isolated_cpus)) 5701 5968 cpumask_set_cpu(smp_processor_id(), non_isolated_cpus); ··· 5711 5978 init_sched_dl_class(); 5712 5979 5713 5980 sched_init_smt(); 5714 - sched_clock_init_late(); 5715 5981 5716 5982 sched_smp_initialized = true; 5717 5983 } ··· 5726 5994 void __init sched_init_smp(void) 5727 5995 { 5728 5996 sched_init_granularity(); 5729 - sched_clock_init_late(); 5730 5997 } 5731 5998 #endif /* CONFIG_SMP */ 5732 5999 ··· 5751 6020 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask); 5752 6021 DECLARE_PER_CPU(cpumask_var_t, select_idle_mask); 5753 6022 5754 - #define WAIT_TABLE_BITS 8 5755 - #define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS) 5756 - static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned; 5757 - 5758 - wait_queue_head_t *bit_waitqueue(void *word, int bit) 5759 - { 5760 - const int shift = BITS_PER_LONG == 32 ? 5 : 6; 5761 - unsigned long val = (unsigned long)word << shift | bit; 5762 - 5763 - return bit_wait_table + hash_long(val, WAIT_TABLE_BITS); 5764 - } 5765 - EXPORT_SYMBOL(bit_waitqueue); 5766 - 5767 6023 void __init sched_init(void) 5768 6024 { 5769 6025 int i, j; 5770 6026 unsigned long alloc_size = 0, ptr; 5771 6027 5772 6028 sched_clock_init(); 5773 - 5774 - for (i = 0; i < WAIT_TABLE_SIZE; i++) 5775 - init_waitqueue_head(bit_wait_table + i); 6029 + wait_bit_init(); 5776 6030 5777 6031 #ifdef CONFIG_FAIR_GROUP_SCHED 5778 6032 alloc_size += 2 * nr_cpu_ids * sizeof(void **); ··· 5909 6193 calc_load_update = jiffies + LOAD_FREQ; 5910 6194 5911 6195 #ifdef CONFIG_SMP 5912 - zalloc_cpumask_var(&sched_domains_tmpmask, GFP_NOWAIT); 5913 6196 /* May be allocated at isolcpus cmdline parse time */ 5914 6197 if (cpu_isolated_map == NULL) 5915 6198 zalloc_cpumask_var(&cpu_isolated_map, GFP_NOWAIT); ··· 5960 6245 5961 6246 if ((preempt_count_equals(preempt_offset) && !irqs_disabled() && 5962 6247 !is_idle_task(current)) || 5963 - system_state != SYSTEM_RUNNING || oops_in_progress) 6248 + system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING || 6249 + oops_in_progress) 5964 6250 return; 6251 + 5965 6252 if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy) 5966 6253 return; 5967 6254 prev_jiffy = jiffies; ··· 6218 6501 6219 6502 task_rq_unlock(rq, tsk, &rf); 6220 6503 } 6221 - #endif /* CONFIG_CGROUP_SCHED */ 6222 - 6223 - #ifdef CONFIG_RT_GROUP_SCHED 6224 - /* 6225 - * Ensure that the real time constraints are schedulable. 6226 - */ 6227 - static DEFINE_MUTEX(rt_constraints_mutex); 6228 - 6229 - /* Must be called with tasklist_lock held */ 6230 - static inline int tg_has_rt_tasks(struct task_group *tg) 6231 - { 6232 - struct task_struct *g, *p; 6233 - 6234 - /* 6235 - * Autogroups do not have RT tasks; see autogroup_create(). 6236 - */ 6237 - if (task_group_is_autogroup(tg)) 6238 - return 0; 6239 - 6240 - for_each_process_thread(g, p) { 6241 - if (rt_task(p) && task_group(p) == tg) 6242 - return 1; 6243 - } 6244 - 6245 - return 0; 6246 - } 6247 - 6248 - struct rt_schedulable_data { 6249 - struct task_group *tg; 6250 - u64 rt_period; 6251 - u64 rt_runtime; 6252 - }; 6253 - 6254 - static int tg_rt_schedulable(struct task_group *tg, void *data) 6255 - { 6256 - struct rt_schedulable_data *d = data; 6257 - struct task_group *child; 6258 - unsigned long total, sum = 0; 6259 - u64 period, runtime; 6260 - 6261 - period = ktime_to_ns(tg->rt_bandwidth.rt_period); 6262 - runtime = tg->rt_bandwidth.rt_runtime; 6263 - 6264 - if (tg == d->tg) { 6265 - period = d->rt_period; 6266 - runtime = d->rt_runtime; 6267 - } 6268 - 6269 - /* 6270 - * Cannot have more runtime than the period. 6271 - */ 6272 - if (runtime > period && runtime != RUNTIME_INF) 6273 - return -EINVAL; 6274 - 6275 - /* 6276 - * Ensure we don't starve existing RT tasks. 6277 - */ 6278 - if (rt_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg)) 6279 - return -EBUSY; 6280 - 6281 - total = to_ratio(period, runtime); 6282 - 6283 - /* 6284 - * Nobody can have more than the global setting allows. 6285 - */ 6286 - if (total > to_ratio(global_rt_period(), global_rt_runtime())) 6287 - return -EINVAL; 6288 - 6289 - /* 6290 - * The sum of our children's runtime should not exceed our own. 6291 - */ 6292 - list_for_each_entry_rcu(child, &tg->children, siblings) { 6293 - period = ktime_to_ns(child->rt_bandwidth.rt_period); 6294 - runtime = child->rt_bandwidth.rt_runtime; 6295 - 6296 - if (child == d->tg) { 6297 - period = d->rt_period; 6298 - runtime = d->rt_runtime; 6299 - } 6300 - 6301 - sum += to_ratio(period, runtime); 6302 - } 6303 - 6304 - if (sum > total) 6305 - return -EINVAL; 6306 - 6307 - return 0; 6308 - } 6309 - 6310 - static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime) 6311 - { 6312 - int ret; 6313 - 6314 - struct rt_schedulable_data data = { 6315 - .tg = tg, 6316 - .rt_period = period, 6317 - .rt_runtime = runtime, 6318 - }; 6319 - 6320 - rcu_read_lock(); 6321 - ret = walk_tg_tree(tg_rt_schedulable, tg_nop, &data); 6322 - rcu_read_unlock(); 6323 - 6324 - return ret; 6325 - } 6326 - 6327 - static int tg_set_rt_bandwidth(struct task_group *tg, 6328 - u64 rt_period, u64 rt_runtime) 6329 - { 6330 - int i, err = 0; 6331 - 6332 - /* 6333 - * Disallowing the root group RT runtime is BAD, it would disallow the 6334 - * kernel creating (and or operating) RT threads. 6335 - */ 6336 - if (tg == &root_task_group && rt_runtime == 0) 6337 - return -EINVAL; 6338 - 6339 - /* No period doesn't make any sense. */ 6340 - if (rt_period == 0) 6341 - return -EINVAL; 6342 - 6343 - mutex_lock(&rt_constraints_mutex); 6344 - read_lock(&tasklist_lock); 6345 - err = __rt_schedulable(tg, rt_period, rt_runtime); 6346 - if (err) 6347 - goto unlock; 6348 - 6349 - raw_spin_lock_irq(&tg->rt_bandwidth.rt_runtime_lock); 6350 - tg->rt_bandwidth.rt_period = ns_to_ktime(rt_period); 6351 - tg->rt_bandwidth.rt_runtime = rt_runtime; 6352 - 6353 - for_each_possible_cpu(i) { 6354 - struct rt_rq *rt_rq = tg->rt_rq[i]; 6355 - 6356 - raw_spin_lock(&rt_rq->rt_runtime_lock); 6357 - rt_rq->rt_runtime = rt_runtime; 6358 - raw_spin_unlock(&rt_rq->rt_runtime_lock); 6359 - } 6360 - raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock); 6361 - unlock: 6362 - read_unlock(&tasklist_lock); 6363 - mutex_unlock(&rt_constraints_mutex); 6364 - 6365 - return err; 6366 - } 6367 - 6368 - static int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us) 6369 - { 6370 - u64 rt_runtime, rt_period; 6371 - 6372 - rt_period = ktime_to_ns(tg->rt_bandwidth.rt_period); 6373 - rt_runtime = (u64)rt_runtime_us * NSEC_PER_USEC; 6374 - if (rt_runtime_us < 0) 6375 - rt_runtime = RUNTIME_INF; 6376 - 6377 - return tg_set_rt_bandwidth(tg, rt_period, rt_runtime); 6378 - } 6379 - 6380 - static long sched_group_rt_runtime(struct task_group *tg) 6381 - { 6382 - u64 rt_runtime_us; 6383 - 6384 - if (tg->rt_bandwidth.rt_runtime == RUNTIME_INF) 6385 - return -1; 6386 - 6387 - rt_runtime_us = tg->rt_bandwidth.rt_runtime; 6388 - do_div(rt_runtime_us, NSEC_PER_USEC); 6389 - return rt_runtime_us; 6390 - } 6391 - 6392 - static int sched_group_set_rt_period(struct task_group *tg, u64 rt_period_us) 6393 - { 6394 - u64 rt_runtime, rt_period; 6395 - 6396 - rt_period = rt_period_us * NSEC_PER_USEC; 6397 - rt_runtime = tg->rt_bandwidth.rt_runtime; 6398 - 6399 - return tg_set_rt_bandwidth(tg, rt_period, rt_runtime); 6400 - } 6401 - 6402 - static long sched_group_rt_period(struct task_group *tg) 6403 - { 6404 - u64 rt_period_us; 6405 - 6406 - rt_period_us = ktime_to_ns(tg->rt_bandwidth.rt_period); 6407 - do_div(rt_period_us, NSEC_PER_USEC); 6408 - return rt_period_us; 6409 - } 6410 - #endif /* CONFIG_RT_GROUP_SCHED */ 6411 - 6412 - #ifdef CONFIG_RT_GROUP_SCHED 6413 - static int sched_rt_global_constraints(void) 6414 - { 6415 - int ret = 0; 6416 - 6417 - mutex_lock(&rt_constraints_mutex); 6418 - read_lock(&tasklist_lock); 6419 - ret = __rt_schedulable(NULL, 0, 0); 6420 - read_unlock(&tasklist_lock); 6421 - mutex_unlock(&rt_constraints_mutex); 6422 - 6423 - return ret; 6424 - } 6425 - 6426 - static int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk) 6427 - { 6428 - /* Don't accept realtime tasks when there is no way for them to run */ 6429 - if (rt_task(tsk) && tg->rt_bandwidth.rt_runtime == 0) 6430 - return 0; 6431 - 6432 - return 1; 6433 - } 6434 - 6435 - #else /* !CONFIG_RT_GROUP_SCHED */ 6436 - static int sched_rt_global_constraints(void) 6437 - { 6438 - unsigned long flags; 6439 - int i; 6440 - 6441 - raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags); 6442 - for_each_possible_cpu(i) { 6443 - struct rt_rq *rt_rq = &cpu_rq(i)->rt; 6444 - 6445 - raw_spin_lock(&rt_rq->rt_runtime_lock); 6446 - rt_rq->rt_runtime = global_rt_runtime(); 6447 - raw_spin_unlock(&rt_rq->rt_runtime_lock); 6448 - } 6449 - raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags); 6450 - 6451 - return 0; 6452 - } 6453 - #endif /* CONFIG_RT_GROUP_SCHED */ 6454 - 6455 - static int sched_dl_global_validate(void) 6456 - { 6457 - u64 runtime = global_rt_runtime(); 6458 - u64 period = global_rt_period(); 6459 - u64 new_bw = to_ratio(period, runtime); 6460 - struct dl_bw *dl_b; 6461 - int cpu, ret = 0; 6462 - unsigned long flags; 6463 - 6464 - /* 6465 - * Here we want to check the bandwidth not being set to some 6466 - * value smaller than the currently allocated bandwidth in 6467 - * any of the root_domains. 6468 - * 6469 - * FIXME: Cycling on all the CPUs is overdoing, but simpler than 6470 - * cycling on root_domains... Discussion on different/better 6471 - * solutions is welcome! 6472 - */ 6473 - for_each_possible_cpu(cpu) { 6474 - rcu_read_lock_sched(); 6475 - dl_b = dl_bw_of(cpu); 6476 - 6477 - raw_spin_lock_irqsave(&dl_b->lock, flags); 6478 - if (new_bw < dl_b->total_bw) 6479 - ret = -EBUSY; 6480 - raw_spin_unlock_irqrestore(&dl_b->lock, flags); 6481 - 6482 - rcu_read_unlock_sched(); 6483 - 6484 - if (ret) 6485 - break; 6486 - } 6487 - 6488 - return ret; 6489 - } 6490 - 6491 - static void sched_dl_do_global(void) 6492 - { 6493 - u64 new_bw = -1; 6494 - struct dl_bw *dl_b; 6495 - int cpu; 6496 - unsigned long flags; 6497 - 6498 - def_dl_bandwidth.dl_period = global_rt_period(); 6499 - def_dl_bandwidth.dl_runtime = global_rt_runtime(); 6500 - 6501 - if (global_rt_runtime() != RUNTIME_INF) 6502 - new_bw = to_ratio(global_rt_period(), global_rt_runtime()); 6503 - 6504 - /* 6505 - * FIXME: As above... 6506 - */ 6507 - for_each_possible_cpu(cpu) { 6508 - rcu_read_lock_sched(); 6509 - dl_b = dl_bw_of(cpu); 6510 - 6511 - raw_spin_lock_irqsave(&dl_b->lock, flags); 6512 - dl_b->bw = new_bw; 6513 - raw_spin_unlock_irqrestore(&dl_b->lock, flags); 6514 - 6515 - rcu_read_unlock_sched(); 6516 - } 6517 - } 6518 - 6519 - static int sched_rt_global_validate(void) 6520 - { 6521 - if (sysctl_sched_rt_period <= 0) 6522 - return -EINVAL; 6523 - 6524 - if ((sysctl_sched_rt_runtime != RUNTIME_INF) && 6525 - (sysctl_sched_rt_runtime > sysctl_sched_rt_period)) 6526 - return -EINVAL; 6527 - 6528 - return 0; 6529 - } 6530 - 6531 - static void sched_rt_do_global(void) 6532 - { 6533 - def_rt_bandwidth.rt_runtime = global_rt_runtime(); 6534 - def_rt_bandwidth.rt_period = ns_to_ktime(global_rt_period()); 6535 - } 6536 - 6537 - int sched_rt_handler(struct ctl_table *table, int write, 6538 - void __user *buffer, size_t *lenp, 6539 - loff_t *ppos) 6540 - { 6541 - int old_period, old_runtime; 6542 - static DEFINE_MUTEX(mutex); 6543 - int ret; 6544 - 6545 - mutex_lock(&mutex); 6546 - old_period = sysctl_sched_rt_period; 6547 - old_runtime = sysctl_sched_rt_runtime; 6548 - 6549 - ret = proc_dointvec(table, write, buffer, lenp, ppos); 6550 - 6551 - if (!ret && write) { 6552 - ret = sched_rt_global_validate(); 6553 - if (ret) 6554 - goto undo; 6555 - 6556 - ret = sched_dl_global_validate(); 6557 - if (ret) 6558 - goto undo; 6559 - 6560 - ret = sched_rt_global_constraints(); 6561 - if (ret) 6562 - goto undo; 6563 - 6564 - sched_rt_do_global(); 6565 - sched_dl_do_global(); 6566 - } 6567 - if (0) { 6568 - undo: 6569 - sysctl_sched_rt_period = old_period; 6570 - sysctl_sched_rt_runtime = old_runtime; 6571 - } 6572 - mutex_unlock(&mutex); 6573 - 6574 - return ret; 6575 - } 6576 - 6577 - int sched_rr_handler(struct ctl_table *table, int write, 6578 - void __user *buffer, size_t *lenp, 6579 - loff_t *ppos) 6580 - { 6581 - int ret; 6582 - static DEFINE_MUTEX(mutex); 6583 - 6584 - mutex_lock(&mutex); 6585 - ret = proc_dointvec(table, write, buffer, lenp, ppos); 6586 - /* 6587 - * Make sure that internally we keep jiffies. 6588 - * Also, writing zero resets the timeslice to default: 6589 - */ 6590 - if (!ret && write) { 6591 - sched_rr_timeslice = 6592 - sysctl_sched_rr_timeslice <= 0 ? RR_TIMESLICE : 6593 - msecs_to_jiffies(sysctl_sched_rr_timeslice); 6594 - } 6595 - mutex_unlock(&mutex); 6596 - return ret; 6597 - } 6598 - 6599 - #ifdef CONFIG_CGROUP_SCHED 6600 6504 6601 6505 static inline struct task_group *css_tg(struct cgroup_subsys_state *css) 6602 6506 {

+5 -11

kernel/sched/cputime.c

··· 615 615 * userspace. Once a task gets some ticks, the monotonicy code at 616 616 * 'update' will ensure things converge to the observed ratio. 617 617 */ 618 - if (stime == 0) { 619 - utime = rtime; 620 - goto update; 618 + if (stime != 0) { 619 + if (utime == 0) 620 + stime = rtime; 621 + else 622 + stime = scale_stime(stime, rtime, stime + utime); 621 623 } 622 624 623 - if (utime == 0) { 624 - stime = rtime; 625 - goto update; 626 - } 627 - 628 - stime = scale_stime(stime, rtime, stime + utime); 629 - 630 - update: 631 625 /* 632 626 * Make sure stime doesn't go backwards; this preserves monotonicity 633 627 * for utime because rtime is monotonic.

+851 -43

kernel/sched/deadline.c

··· 17 17 #include "sched.h" 18 18 19 19 #include <linux/slab.h> 20 + #include <uapi/linux/sched/types.h> 20 21 21 22 struct dl_bandwidth def_dl_bandwidth; 22 23 ··· 42 41 static inline int on_dl_rq(struct sched_dl_entity *dl_se) 43 42 { 44 43 return !RB_EMPTY_NODE(&dl_se->rb_node); 44 + } 45 + 46 + #ifdef CONFIG_SMP 47 + static inline struct dl_bw *dl_bw_of(int i) 48 + { 49 + RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(), 50 + "sched RCU must be held"); 51 + return &cpu_rq(i)->rd->dl_bw; 52 + } 53 + 54 + static inline int dl_bw_cpus(int i) 55 + { 56 + struct root_domain *rd = cpu_rq(i)->rd; 57 + int cpus = 0; 58 + 59 + RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(), 60 + "sched RCU must be held"); 61 + for_each_cpu_and(i, rd->span, cpu_active_mask) 62 + cpus++; 63 + 64 + return cpus; 65 + } 66 + #else 67 + static inline struct dl_bw *dl_bw_of(int i) 68 + { 69 + return &cpu_rq(i)->dl.dl_bw; 70 + } 71 + 72 + static inline int dl_bw_cpus(int i) 73 + { 74 + return 1; 75 + } 76 + #endif 77 + 78 + static inline 79 + void add_running_bw(u64 dl_bw, struct dl_rq *dl_rq) 80 + { 81 + u64 old = dl_rq->running_bw; 82 + 83 + lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock); 84 + dl_rq->running_bw += dl_bw; 85 + SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */ 86 + SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw); 87 + } 88 + 89 + static inline 90 + void sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq) 91 + { 92 + u64 old = dl_rq->running_bw; 93 + 94 + lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock); 95 + dl_rq->running_bw -= dl_bw; 96 + SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */ 97 + if (dl_rq->running_bw > old) 98 + dl_rq->running_bw = 0; 99 + } 100 + 101 + static inline 102 + void add_rq_bw(u64 dl_bw, struct dl_rq *dl_rq) 103 + { 104 + u64 old = dl_rq->this_bw; 105 + 106 + lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock); 107 + dl_rq->this_bw += dl_bw; 108 + SCHED_WARN_ON(dl_rq->this_bw < old); /* overflow */ 109 + } 110 + 111 + static inline 112 + void sub_rq_bw(u64 dl_bw, struct dl_rq *dl_rq) 113 + { 114 + u64 old = dl_rq->this_bw; 115 + 116 + lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock); 117 + dl_rq->this_bw -= dl_bw; 118 + SCHED_WARN_ON(dl_rq->this_bw > old); /* underflow */ 119 + if (dl_rq->this_bw > old) 120 + dl_rq->this_bw = 0; 121 + SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw); 122 + } 123 + 124 + void dl_change_utilization(struct task_struct *p, u64 new_bw) 125 + { 126 + struct rq *rq; 127 + 128 + if (task_on_rq_queued(p)) 129 + return; 130 + 131 + rq = task_rq(p); 132 + if (p->dl.dl_non_contending) { 133 + sub_running_bw(p->dl.dl_bw, &rq->dl); 134 + p->dl.dl_non_contending = 0; 135 + /* 136 + * If the timer handler is currently running and the 137 + * timer cannot be cancelled, inactive_task_timer() 138 + * will see that dl_not_contending is not set, and 139 + * will not touch the rq's active utilization, 140 + * so we are still safe. 141 + */ 142 + if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1) 143 + put_task_struct(p); 144 + } 145 + sub_rq_bw(p->dl.dl_bw, &rq->dl); 146 + add_rq_bw(new_bw, &rq->dl); 147 + } 148 + 149 + /* 150 + * The utilization of a task cannot be immediately removed from 151 + * the rq active utilization (running_bw) when the task blocks. 152 + * Instead, we have to wait for the so called "0-lag time". 153 + * 154 + * If a task blocks before the "0-lag time", a timer (the inactive 155 + * timer) is armed, and running_bw is decreased when the timer 156 + * fires. 157 + * 158 + * If the task wakes up again before the inactive timer fires, 159 + * the timer is cancelled, whereas if the task wakes up after the 160 + * inactive timer fired (and running_bw has been decreased) the 161 + * task's utilization has to be added to running_bw again. 162 + * A flag in the deadline scheduling entity (dl_non_contending) 163 + * is used to avoid race conditions between the inactive timer handler 164 + * and task wakeups. 165 + * 166 + * The following diagram shows how running_bw is updated. A task is 167 + * "ACTIVE" when its utilization contributes to running_bw; an 168 + * "ACTIVE contending" task is in the TASK_RUNNING state, while an 169 + * "ACTIVE non contending" task is a blocked task for which the "0-lag time" 170 + * has not passed yet. An "INACTIVE" task is a task for which the "0-lag" 171 + * time already passed, which does not contribute to running_bw anymore. 172 + * +------------------+ 173 + * wakeup | ACTIVE | 174 + * +------------------>+ contending | 175 + * | add_running_bw | | 176 + * | +----+------+------+ 177 + * | | ^ 178 + * | dequeue | | 179 + * +--------+-------+ | | 180 + * | | t >= 0-lag | | wakeup 181 + * | INACTIVE |<---------------+ | 182 + * | | sub_running_bw | | 183 + * +--------+-------+ | | 184 + * ^ | | 185 + * | t < 0-lag | | 186 + * | | | 187 + * | V | 188 + * | +----+------+------+ 189 + * | sub_running_bw | ACTIVE | 190 + * +-------------------+ | 191 + * inactive timer | non contending | 192 + * fired +------------------+ 193 + * 194 + * The task_non_contending() function is invoked when a task 195 + * blocks, and checks if the 0-lag time already passed or 196 + * not (in the first case, it directly updates running_bw; 197 + * in the second case, it arms the inactive timer). 198 + * 199 + * The task_contending() function is invoked when a task wakes 200 + * up, and checks if the task is still in the "ACTIVE non contending" 201 + * state or not (in the second case, it updates running_bw). 202 + */ 203 + static void task_non_contending(struct task_struct *p) 204 + { 205 + struct sched_dl_entity *dl_se = &p->dl; 206 + struct hrtimer *timer = &dl_se->inactive_timer; 207 + struct dl_rq *dl_rq = dl_rq_of_se(dl_se); 208 + struct rq *rq = rq_of_dl_rq(dl_rq); 209 + s64 zerolag_time; 210 + 211 + /* 212 + * If this is a non-deadline task that has been boosted, 213 + * do nothing 214 + */ 215 + if (dl_se->dl_runtime == 0) 216 + return; 217 + 218 + WARN_ON(hrtimer_active(&dl_se->inactive_timer)); 219 + WARN_ON(dl_se->dl_non_contending); 220 + 221 + zerolag_time = dl_se->deadline - 222 + div64_long((dl_se->runtime * dl_se->dl_period), 223 + dl_se->dl_runtime); 224 + 225 + /* 226 + * Using relative times instead of the absolute "0-lag time" 227 + * allows to simplify the code 228 + */ 229 + zerolag_time -= rq_clock(rq); 230 + 231 + /* 232 + * If the "0-lag time" already passed, decrease the active 233 + * utilization now, instead of starting a timer 234 + */ 235 + if (zerolag_time < 0) { 236 + if (dl_task(p)) 237 + sub_running_bw(dl_se->dl_bw, dl_rq); 238 + if (!dl_task(p) || p->state == TASK_DEAD) { 239 + struct dl_bw *dl_b = dl_bw_of(task_cpu(p)); 240 + 241 + if (p->state == TASK_DEAD) 242 + sub_rq_bw(p->dl.dl_bw, &rq->dl); 243 + raw_spin_lock(&dl_b->lock); 244 + __dl_clear(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p))); 245 + __dl_clear_params(p); 246 + raw_spin_unlock(&dl_b->lock); 247 + } 248 + 249 + return; 250 + } 251 + 252 + dl_se->dl_non_contending = 1; 253 + get_task_struct(p); 254 + hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL); 255 + } 256 + 257 + static void task_contending(struct sched_dl_entity *dl_se, int flags) 258 + { 259 + struct dl_rq *dl_rq = dl_rq_of_se(dl_se); 260 + 261 + /* 262 + * If this is a non-deadline task that has been boosted, 263 + * do nothing 264 + */ 265 + if (dl_se->dl_runtime == 0) 266 + return; 267 + 268 + if (flags & ENQUEUE_MIGRATED) 269 + add_rq_bw(dl_se->dl_bw, dl_rq); 270 + 271 + if (dl_se->dl_non_contending) { 272 + dl_se->dl_non_contending = 0; 273 + /* 274 + * If the timer handler is currently running and the 275 + * timer cannot be cancelled, inactive_task_timer() 276 + * will see that dl_not_contending is not set, and 277 + * will not touch the rq's active utilization, 278 + * so we are still safe. 279 + */ 280 + if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1) 281 + put_task_struct(dl_task_of(dl_se)); 282 + } else { 283 + /* 284 + * Since "dl_non_contending" is not set, the 285 + * task's utilization has already been removed from 286 + * active utilization (either when the task blocked, 287 + * when the "inactive timer" fired). 288 + * So, add it back. 289 + */ 290 + add_running_bw(dl_se->dl_bw, dl_rq); 291 + } 45 292 } 46 293 47 294 static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq) ··· 332 83 #else 333 84 init_dl_bw(&dl_rq->dl_bw); 334 85 #endif 86 + 87 + dl_rq->running_bw = 0; 88 + dl_rq->this_bw = 0; 89 + init_dl_rq_bw_ratio(dl_rq); 335 90 } 336 91 337 92 #ifdef CONFIG_SMP ··· 737 484 } 738 485 739 486 /* 740 - * When a -deadline entity is queued back on the runqueue, its runtime and 741 - * deadline might need updating. 487 + * Revised wakeup rule [1]: For self-suspending tasks, rather then 488 + * re-initializing task's runtime and deadline, the revised wakeup 489 + * rule adjusts the task's runtime to avoid the task to overrun its 490 + * density. 742 491 * 743 - * The policy here is that we update the deadline of the entity only if: 744 - * - the current deadline is in the past, 745 - * - using the remaining runtime with the current deadline would make 746 - * the entity exceed its bandwidth. 492 + * Reasoning: a task may overrun the density if: 493 + * runtime / (deadline - t) > dl_runtime / dl_deadline 494 + * 495 + * Therefore, runtime can be adjusted to: 496 + * runtime = (dl_runtime / dl_deadline) * (deadline - t) 497 + * 498 + * In such way that runtime will be equal to the maximum density 499 + * the task can use without breaking any rule. 500 + * 501 + * [1] Luca Abeni, Giuseppe Lipari, and Juri Lelli. 2015. Constant 502 + * bandwidth server revisited. SIGBED Rev. 11, 4 (January 2015), 19-24. 503 + */ 504 + static void 505 + update_dl_revised_wakeup(struct sched_dl_entity *dl_se, struct rq *rq) 506 + { 507 + u64 laxity = dl_se->deadline - rq_clock(rq); 508 + 509 + /* 510 + * If the task has deadline < period, and the deadline is in the past, 511 + * it should already be throttled before this check. 512 + * 513 + * See update_dl_entity() comments for further details. 514 + */ 515 + WARN_ON(dl_time_before(dl_se->deadline, rq_clock(rq))); 516 + 517 + dl_se->runtime = (dl_se->dl_density * laxity) >> BW_SHIFT; 518 + } 519 + 520 + /* 521 + * Regarding the deadline, a task with implicit deadline has a relative 522 + * deadline == relative period. A task with constrained deadline has a 523 + * relative deadline <= relative period. 524 + * 525 + * We support constrained deadline tasks. However, there are some restrictions 526 + * applied only for tasks which do not have an implicit deadline. See 527 + * update_dl_entity() to know more about such restrictions. 528 + * 529 + * The dl_is_implicit() returns true if the task has an implicit deadline. 530 + */ 531 + static inline bool dl_is_implicit(struct sched_dl_entity *dl_se) 532 + { 533 + return dl_se->dl_deadline == dl_se->dl_period; 534 + } 535 + 536 + /* 537 + * When a deadline entity is placed in the runqueue, its runtime and deadline 538 + * might need to be updated. This is done by a CBS wake up rule. There are two 539 + * different rules: 1) the original CBS; and 2) the Revisited CBS. 540 + * 541 + * When the task is starting a new period, the Original CBS is used. In this 542 + * case, the runtime is replenished and a new absolute deadline is set. 543 + * 544 + * When a task is queued before the begin of the next period, using the 545 + * remaining runtime and deadline could make the entity to overflow, see 546 + * dl_entity_overflow() to find more about runtime overflow. When such case 547 + * is detected, the runtime and deadline need to be updated. 548 + * 549 + * If the task has an implicit deadline, i.e., deadline == period, the Original 550 + * CBS is applied. the runtime is replenished and a new absolute deadline is 551 + * set, as in the previous cases. 552 + * 553 + * However, the Original CBS does not work properly for tasks with 554 + * deadline < period, which are said to have a constrained deadline. By 555 + * applying the Original CBS, a constrained deadline task would be able to run 556 + * runtime/deadline in a period. With deadline < period, the task would 557 + * overrun the runtime/period allowed bandwidth, breaking the admission test. 558 + * 559 + * In order to prevent this misbehave, the Revisited CBS is used for 560 + * constrained deadline tasks when a runtime overflow is detected. In the 561 + * Revisited CBS, rather than replenishing & setting a new absolute deadline, 562 + * the remaining runtime of the task is reduced to avoid runtime overflow. 563 + * Please refer to the comments update_dl_revised_wakeup() function to find 564 + * more about the Revised CBS rule. 747 565 */ 748 566 static void update_dl_entity(struct sched_dl_entity *dl_se, 749 567 struct sched_dl_entity *pi_se) ··· 824 500 825 501 if (dl_time_before(dl_se->deadline, rq_clock(rq)) || 826 502 dl_entity_overflow(dl_se, pi_se, rq_clock(rq))) { 503 + 504 + if (unlikely(!dl_is_implicit(dl_se) && 505 + !dl_time_before(dl_se->deadline, rq_clock(rq)) && 506 + !dl_se->dl_boosted)){ 507 + update_dl_revised_wakeup(dl_se, rq); 508 + return; 509 + } 510 + 827 511 dl_se->deadline = rq_clock(rq) + pi_se->dl_deadline; 828 512 dl_se->runtime = pi_se->dl_runtime; 829 513 } ··· 925 593 * The task might have changed its scheduling policy to something 926 594 * different than SCHED_DEADLINE (through switched_from_dl()). 927 595 */ 928 - if (!dl_task(p)) { 929 - __dl_clear_params(p); 596 + if (!dl_task(p)) 930 597 goto unlock; 931 - } 932 598 933 599 /* 934 600 * The task might have been boosted by someone else and might be in the ··· 1053 723 if (unlikely(dl_se->dl_boosted || !start_dl_timer(p))) 1054 724 return; 1055 725 dl_se->dl_throttled = 1; 726 + if (dl_se->runtime > 0) 727 + dl_se->runtime = 0; 1056 728 } 1057 729 } 1058 730 ··· 1065 733 } 1066 734 1067 735 extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq); 736 + 737 + /* 738 + * This function implements the GRUB accounting rule: 739 + * according to the GRUB reclaiming algorithm, the runtime is 740 + * not decreased as "dq = -dt", but as 741 + * "dq = -max{u / Umax, (1 - Uinact - Uextra)} dt", 742 + * where u is the utilization of the task, Umax is the maximum reclaimable 743 + * utilization, Uinact is the (per-runqueue) inactive utilization, computed 744 + * as the difference between the "total runqueue utilization" and the 745 + * runqueue active utilization, and Uextra is the (per runqueue) extra 746 + * reclaimable utilization. 747 + * Since rq->dl.running_bw and rq->dl.this_bw contain utilizations 748 + * multiplied by 2^BW_SHIFT, the result has to be shifted right by 749 + * BW_SHIFT. 750 + * Since rq->dl.bw_ratio contains 1 / Umax multipled by 2^RATIO_SHIFT, 751 + * dl_bw is multiped by rq->dl.bw_ratio and shifted right by RATIO_SHIFT. 752 + * Since delta is a 64 bit variable, to have an overflow its value 753 + * should be larger than 2^(64 - 20 - 8), which is more than 64 seconds. 754 + * So, overflow is not an issue here. 755 + */ 756 + u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se) 757 + { 758 + u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */ 759 + u64 u_act; 760 + u64 u_act_min = (dl_se->dl_bw * rq->dl.bw_ratio) >> RATIO_SHIFT; 761 + 762 + /* 763 + * Instead of computing max{u * bw_ratio, (1 - u_inact - u_extra)}, 764 + * we compare u_inact + rq->dl.extra_bw with 765 + * 1 - (u * rq->dl.bw_ratio >> RATIO_SHIFT), because 766 + * u_inact + rq->dl.extra_bw can be larger than 767 + * 1 * (so, 1 - u_inact - rq->dl.extra_bw would be negative 768 + * leading to wrong results) 769 + */ 770 + if (u_inact + rq->dl.extra_bw > BW_UNIT - u_act_min) 771 + u_act = u_act_min; 772 + else 773 + u_act = BW_UNIT - u_inact - rq->dl.extra_bw; 774 + 775 + return (delta * u_act) >> BW_SHIFT; 776 + } 1068 777 1069 778 /* 1070 779 * Update the current task's runtime statistics (provided it is still ··· 1149 776 1150 777 sched_rt_avg_update(rq, delta_exec); 1151 778 779 + if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) 780 + delta_exec = grub_reclaim(delta_exec, rq, &curr->dl); 1152 781 dl_se->runtime -= delta_exec; 1153 782 1154 783 throttle: ··· 1188 813 rt_rq->rt_time += delta_exec; 1189 814 raw_spin_unlock(&rt_rq->rt_runtime_lock); 1190 815 } 816 + } 817 + 818 + static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer) 819 + { 820 + struct sched_dl_entity *dl_se = container_of(timer, 821 + struct sched_dl_entity, 822 + inactive_timer); 823 + struct task_struct *p = dl_task_of(dl_se); 824 + struct rq_flags rf; 825 + struct rq *rq; 826 + 827 + rq = task_rq_lock(p, &rf); 828 + 829 + if (!dl_task(p) || p->state == TASK_DEAD) { 830 + struct dl_bw *dl_b = dl_bw_of(task_cpu(p)); 831 + 832 + if (p->state == TASK_DEAD && dl_se->dl_non_contending) { 833 + sub_running_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl)); 834 + sub_rq_bw(p->dl.dl_bw, dl_rq_of_se(&p->dl)); 835 + dl_se->dl_non_contending = 0; 836 + } 837 + 838 + raw_spin_lock(&dl_b->lock); 839 + __dl_clear(dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p))); 840 + raw_spin_unlock(&dl_b->lock); 841 + __dl_clear_params(p); 842 + 843 + goto unlock; 844 + } 845 + if (dl_se->dl_non_contending == 0) 846 + goto unlock; 847 + 848 + sched_clock_tick(); 849 + update_rq_clock(rq); 850 + 851 + sub_running_bw(dl_se->dl_bw, &rq->dl); 852 + dl_se->dl_non_contending = 0; 853 + unlock: 854 + task_rq_unlock(rq, p, &rf); 855 + put_task_struct(p); 856 + 857 + return HRTIMER_NORESTART; 858 + } 859 + 860 + void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se) 861 + { 862 + struct hrtimer *timer = &dl_se->inactive_timer; 863 + 864 + hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); 865 + timer->function = inactive_task_timer; 1191 866 } 1192 867 1193 868 #ifdef CONFIG_SMP ··· 1371 946 * parameters of the task might need updating. Otherwise, 1372 947 * we want a replenishment of its runtime. 1373 948 */ 1374 - if (flags & ENQUEUE_WAKEUP) 949 + if (flags & ENQUEUE_WAKEUP) { 950 + task_contending(dl_se, flags); 1375 951 update_dl_entity(dl_se, pi_se); 1376 - else if (flags & ENQUEUE_REPLENISH) 952 + } else if (flags & ENQUEUE_REPLENISH) { 1377 953 replenish_dl_entity(dl_se, pi_se); 954 + } 1378 955 1379 956 __enqueue_dl_entity(dl_se); 1380 957 } ··· 1384 957 static void dequeue_dl_entity(struct sched_dl_entity *dl_se) 1385 958 { 1386 959 __dequeue_dl_entity(dl_se); 1387 - } 1388 - 1389 - static inline bool dl_is_constrained(struct sched_dl_entity *dl_se) 1390 - { 1391 - return dl_se->dl_deadline < dl_se->dl_period; 1392 960 } 1393 961 1394 962 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags) ··· 1417 995 * If that is the case, the task will be throttled and 1418 996 * the replenishment timer will be set to the next period. 1419 997 */ 1420 - if (!p->dl.dl_throttled && dl_is_constrained(&p->dl)) 998 + if (!p->dl.dl_throttled && !dl_is_implicit(&p->dl)) 1421 999 dl_check_constrained_dl(&p->dl); 1422 1000 1001 + if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & ENQUEUE_RESTORE) { 1002 + add_rq_bw(p->dl.dl_bw, &rq->dl); 1003 + add_running_bw(p->dl.dl_bw, &rq->dl); 1004 + } 1005 + 1423 1006 /* 1424 - * If p is throttled, we do nothing. In fact, if it exhausted 1007 + * If p is throttled, we do not enqueue it. In fact, if it exhausted 1425 1008 * its budget it needs a replenishment and, since it now is on 1426 1009 * its rq, the bandwidth timer callback (which clearly has not 1427 1010 * run yet) will take care of this. 1011 + * However, the active utilization does not depend on the fact 1012 + * that the task is on the runqueue or not (but depends on the 1013 + * task's state - in GRUB parlance, "inactive" vs "active contending"). 1014 + * In other words, even if a task is throttled its utilization must 1015 + * be counted in the active utilization; hence, we need to call 1016 + * add_running_bw(). 1428 1017 */ 1429 - if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) 1018 + if (p->dl.dl_throttled && !(flags & ENQUEUE_REPLENISH)) { 1019 + if (flags & ENQUEUE_WAKEUP) 1020 + task_contending(&p->dl, flags); 1021 + 1430 1022 return; 1023 + } 1431 1024 1432 1025 enqueue_dl_entity(&p->dl, pi_se, flags); 1433 1026 ··· 1460 1023 { 1461 1024 update_curr_dl(rq); 1462 1025 __dequeue_task_dl(rq, p, flags); 1026 + 1027 + if (p->on_rq == TASK_ON_RQ_MIGRATING || flags & DEQUEUE_SAVE) { 1028 + sub_running_bw(p->dl.dl_bw, &rq->dl); 1029 + sub_rq_bw(p->dl.dl_bw, &rq->dl); 1030 + } 1031 + 1032 + /* 1033 + * This check allows to start the inactive timer (or to immediately 1034 + * decrease the active utilization, if needed) in two cases: 1035 + * when the task blocks and when it is terminating 1036 + * (p->state == TASK_DEAD). We can handle the two cases in the same 1037 + * way, because from GRUB's point of view the same thing is happening 1038 + * (the task moves from "active contending" to "active non contending" 1039 + * or "inactive") 1040 + */ 1041 + if (flags & DEQUEUE_SLEEP) 1042 + task_non_contending(p); 1463 1043 } 1464 1044 1465 1045 /* ··· 1552 1098 1553 1099 out: 1554 1100 return cpu; 1101 + } 1102 + 1103 + static void migrate_task_rq_dl(struct task_struct *p) 1104 + { 1105 + struct rq *rq; 1106 + 1107 + if (p->state != TASK_WAKING) 1108 + return; 1109 + 1110 + rq = task_rq(p); 1111 + /* 1112 + * Since p->state == TASK_WAKING, set_task_cpu() has been called 1113 + * from try_to_wake_up(). Hence, p->pi_lock is locked, but 1114 + * rq->lock is not... So, lock it 1115 + */ 1116 + raw_spin_lock(&rq->lock); 1117 + if (p->dl.dl_non_contending) { 1118 + sub_running_bw(p->dl.dl_bw, &rq->dl); 1119 + p->dl.dl_non_contending = 0; 1120 + /* 1121 + * If the timer handler is currently running and the 1122 + * timer cannot be cancelled, inactive_task_timer() 1123 + * will see that dl_not_contending is not set, and 1124 + * will not touch the rq's active utilization, 1125 + * so we are still safe. 1126 + */ 1127 + if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1) 1128 + put_task_struct(p); 1129 + } 1130 + sub_rq_bw(p->dl.dl_bw, &rq->dl); 1131 + raw_spin_unlock(&rq->lock); 1555 1132 } 1556 1133 1557 1134 static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p) ··· 1738 1253 * SCHED_DEADLINE tasks cannot fork and this is achieved through 1739 1254 * sched_fork() 1740 1255 */ 1741 - } 1742 - 1743 - static void task_dead_dl(struct task_struct *p) 1744 - { 1745 - struct dl_bw *dl_b = dl_bw_of(task_cpu(p)); 1746 - 1747 - /* 1748 - * Since we are TASK_DEAD we won't slip out of the domain! 1749 - */ 1750 - raw_spin_lock_irq(&dl_b->lock); 1751 - /* XXX we should retain the bw until 0-lag */ 1752 - dl_b->total_bw -= p->dl.dl_bw; 1753 - raw_spin_unlock_irq(&dl_b->lock); 1754 1256 } 1755 1257 1756 1258 static void set_curr_task_dl(struct rq *rq) ··· 2005 1533 * then possible that next_task has migrated. 2006 1534 */ 2007 1535 task = pick_next_pushable_dl_task(rq); 2008 - if (task_cpu(next_task) == rq->cpu && task == next_task) { 1536 + if (task == next_task) { 2009 1537 /* 2010 1538 * The task is still there. We don't try 2011 1539 * again, some other cpu will pull it when ready. ··· 2023 1551 } 2024 1552 2025 1553 deactivate_task(rq, next_task, 0); 1554 + sub_running_bw(next_task->dl.dl_bw, &rq->dl); 1555 + sub_rq_bw(next_task->dl.dl_bw, &rq->dl); 2026 1556 set_task_cpu(next_task, later_rq->cpu); 1557 + add_rq_bw(next_task->dl.dl_bw, &later_rq->dl); 1558 + add_running_bw(next_task->dl.dl_bw, &later_rq->dl); 2027 1559 activate_task(later_rq, next_task, 0); 2028 1560 ret = 1; 2029 1561 ··· 2115 1639 resched = true; 2116 1640 2117 1641 deactivate_task(src_rq, p, 0); 1642 + sub_running_bw(p->dl.dl_bw, &src_rq->dl); 1643 + sub_rq_bw(p->dl.dl_bw, &src_rq->dl); 2118 1644 set_task_cpu(p, this_cpu); 1645 + add_rq_bw(p->dl.dl_bw, &this_rq->dl); 1646 + add_running_bw(p->dl.dl_bw, &this_rq->dl); 2119 1647 activate_task(this_rq, p, 0); 2120 1648 dmin = p->dl.deadline; 2121 1649 ··· 2175 1695 * until we complete the update. 2176 1696 */ 2177 1697 raw_spin_lock(&src_dl_b->lock); 2178 - __dl_clear(src_dl_b, p->dl.dl_bw); 1698 + __dl_clear(src_dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p))); 2179 1699 raw_spin_unlock(&src_dl_b->lock); 2180 1700 } 2181 1701 ··· 2217 1737 static void switched_from_dl(struct rq *rq, struct task_struct *p) 2218 1738 { 2219 1739 /* 2220 - * Start the deadline timer; if we switch back to dl before this we'll 2221 - * continue consuming our current CBS slice. If we stay outside of 2222 - * SCHED_DEADLINE until the deadline passes, the timer will reset the 2223 - * task. 1740 + * task_non_contending() can start the "inactive timer" (if the 0-lag 1741 + * time is in the future). If the task switches back to dl before 1742 + * the "inactive timer" fires, it can continue to consume its current 1743 + * runtime using its current deadline. If it stays outside of 1744 + * SCHED_DEADLINE until the 0-lag time passes, inactive_task_timer() 1745 + * will reset the task parameters. 2224 1746 */ 2225 - if (!start_dl_timer(p)) 2226 - __dl_clear_params(p); 1747 + if (task_on_rq_queued(p) && p->dl.dl_runtime) 1748 + task_non_contending(p); 1749 + 1750 + if (!task_on_rq_queued(p)) 1751 + sub_rq_bw(p->dl.dl_bw, &rq->dl); 1752 + 1753 + /* 1754 + * We cannot use inactive_task_timer() to invoke sub_running_bw() 1755 + * at the 0-lag time, because the task could have been migrated 1756 + * while SCHED_OTHER in the meanwhile. 1757 + */ 1758 + if (p->dl.dl_non_contending) 1759 + p->dl.dl_non_contending = 0; 2227 1760 2228 1761 /* 2229 1762 * Since this might be the only -deadline task on the rq, ··· 2255 1762 */ 2256 1763 static void switched_to_dl(struct rq *rq, struct task_struct *p) 2257 1764 { 1765 + if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1) 1766 + put_task_struct(p); 2258 1767 2259 1768 /* If p is not queued we will update its parameters at next wakeup. */ 2260 - if (!task_on_rq_queued(p)) 2261 - return; 1769 + if (!task_on_rq_queued(p)) { 1770 + add_rq_bw(p->dl.dl_bw, &rq->dl); 2262 1771 1772 + return; 1773 + } 2263 1774 /* 2264 1775 * If p is boosted we already updated its params in 2265 1776 * rt_mutex_setprio()->enqueue_task(..., ENQUEUE_REPLENISH), ··· 2333 1836 2334 1837 #ifdef CONFIG_SMP 2335 1838 .select_task_rq = select_task_rq_dl, 1839 + .migrate_task_rq = migrate_task_rq_dl, 2336 1840 .set_cpus_allowed = set_cpus_allowed_dl, 2337 1841 .rq_online = rq_online_dl, 2338 1842 .rq_offline = rq_offline_dl, ··· 2343 1845 .set_curr_task = set_curr_task_dl, 2344 1846 .task_tick = task_tick_dl, 2345 1847 .task_fork = task_fork_dl, 2346 - .task_dead = task_dead_dl, 2347 1848 2348 1849 .prio_changed = prio_changed_dl, 2349 1850 .switched_from = switched_from_dl, ··· 2350 1853 2351 1854 .update_curr = update_curr_dl, 2352 1855 }; 1856 + 1857 + int sched_dl_global_validate(void) 1858 + { 1859 + u64 runtime = global_rt_runtime(); 1860 + u64 period = global_rt_period(); 1861 + u64 new_bw = to_ratio(period, runtime); 1862 + struct dl_bw *dl_b; 1863 + int cpu, ret = 0; 1864 + unsigned long flags; 1865 + 1866 + /* 1867 + * Here we want to check the bandwidth not being set to some 1868 + * value smaller than the currently allocated bandwidth in 1869 + * any of the root_domains. 1870 + * 1871 + * FIXME: Cycling on all the CPUs is overdoing, but simpler than 1872 + * cycling on root_domains... Discussion on different/better 1873 + * solutions is welcome! 1874 + */ 1875 + for_each_possible_cpu(cpu) { 1876 + rcu_read_lock_sched(); 1877 + dl_b = dl_bw_of(cpu); 1878 + 1879 + raw_spin_lock_irqsave(&dl_b->lock, flags); 1880 + if (new_bw < dl_b->total_bw) 1881 + ret = -EBUSY; 1882 + raw_spin_unlock_irqrestore(&dl_b->lock, flags); 1883 + 1884 + rcu_read_unlock_sched(); 1885 + 1886 + if (ret) 1887 + break; 1888 + } 1889 + 1890 + return ret; 1891 + } 1892 + 1893 + void init_dl_rq_bw_ratio(struct dl_rq *dl_rq) 1894 + { 1895 + if (global_rt_runtime() == RUNTIME_INF) { 1896 + dl_rq->bw_ratio = 1 << RATIO_SHIFT; 1897 + dl_rq->extra_bw = 1 << BW_SHIFT; 1898 + } else { 1899 + dl_rq->bw_ratio = to_ratio(global_rt_runtime(), 1900 + global_rt_period()) >> (BW_SHIFT - RATIO_SHIFT); 1901 + dl_rq->extra_bw = to_ratio(global_rt_period(), 1902 + global_rt_runtime()); 1903 + } 1904 + } 1905 + 1906 + void sched_dl_do_global(void) 1907 + { 1908 + u64 new_bw = -1; 1909 + struct dl_bw *dl_b; 1910 + int cpu; 1911 + unsigned long flags; 1912 + 1913 + def_dl_bandwidth.dl_period = global_rt_period(); 1914 + def_dl_bandwidth.dl_runtime = global_rt_runtime(); 1915 + 1916 + if (global_rt_runtime() != RUNTIME_INF) 1917 + new_bw = to_ratio(global_rt_period(), global_rt_runtime()); 1918 + 1919 + /* 1920 + * FIXME: As above... 1921 + */ 1922 + for_each_possible_cpu(cpu) { 1923 + rcu_read_lock_sched(); 1924 + dl_b = dl_bw_of(cpu); 1925 + 1926 + raw_spin_lock_irqsave(&dl_b->lock, flags); 1927 + dl_b->bw = new_bw; 1928 + raw_spin_unlock_irqrestore(&dl_b->lock, flags); 1929 + 1930 + rcu_read_unlock_sched(); 1931 + init_dl_rq_bw_ratio(&cpu_rq(cpu)->dl); 1932 + } 1933 + } 1934 + 1935 + /* 1936 + * We must be sure that accepting a new task (or allowing changing the 1937 + * parameters of an existing one) is consistent with the bandwidth 1938 + * constraints. If yes, this function also accordingly updates the currently 1939 + * allocated bandwidth to reflect the new situation. 1940 + * 1941 + * This function is called while holding p's rq->lock. 1942 + */ 1943 + int sched_dl_overflow(struct task_struct *p, int policy, 1944 + const struct sched_attr *attr) 1945 + { 1946 + struct dl_bw *dl_b = dl_bw_of(task_cpu(p)); 1947 + u64 period = attr->sched_period ?: attr->sched_deadline; 1948 + u64 runtime = attr->sched_runtime; 1949 + u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0; 1950 + int cpus, err = -1; 1951 + 1952 + /* !deadline task may carry old deadline bandwidth */ 1953 + if (new_bw == p->dl.dl_bw && task_has_dl_policy(p)) 1954 + return 0; 1955 + 1956 + /* 1957 + * Either if a task, enters, leave, or stays -deadline but changes 1958 + * its parameters, we may need to update accordingly the total 1959 + * allocated bandwidth of the container. 1960 + */ 1961 + raw_spin_lock(&dl_b->lock); 1962 + cpus = dl_bw_cpus(task_cpu(p)); 1963 + if (dl_policy(policy) && !task_has_dl_policy(p) && 1964 + !__dl_overflow(dl_b, cpus, 0, new_bw)) { 1965 + if (hrtimer_active(&p->dl.inactive_timer)) 1966 + __dl_clear(dl_b, p->dl.dl_bw, cpus); 1967 + __dl_add(dl_b, new_bw, cpus); 1968 + err = 0; 1969 + } else if (dl_policy(policy) && task_has_dl_policy(p) && 1970 + !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) { 1971 + /* 1972 + * XXX this is slightly incorrect: when the task 1973 + * utilization decreases, we should delay the total 1974 + * utilization change until the task's 0-lag point. 1975 + * But this would require to set the task's "inactive 1976 + * timer" when the task is not inactive. 1977 + */ 1978 + __dl_clear(dl_b, p->dl.dl_bw, cpus); 1979 + __dl_add(dl_b, new_bw, cpus); 1980 + dl_change_utilization(p, new_bw); 1981 + err = 0; 1982 + } else if (!dl_policy(policy) && task_has_dl_policy(p)) { 1983 + /* 1984 + * Do not decrease the total deadline utilization here, 1985 + * switched_from_dl() will take care to do it at the correct 1986 + * (0-lag) time. 1987 + */ 1988 + err = 0; 1989 + } 1990 + raw_spin_unlock(&dl_b->lock); 1991 + 1992 + return err; 1993 + } 1994 + 1995 + /* 1996 + * This function initializes the sched_dl_entity of a newly becoming 1997 + * SCHED_DEADLINE task. 1998 + * 1999 + * Only the static values are considered here, the actual runtime and the 2000 + * absolute deadline will be properly calculated when the task is enqueued 2001 + * for the first time with its new policy. 2002 + */ 2003 + void __setparam_dl(struct task_struct *p, const struct sched_attr *attr) 2004 + { 2005 + struct sched_dl_entity *dl_se = &p->dl; 2006 + 2007 + dl_se->dl_runtime = attr->sched_runtime; 2008 + dl_se->dl_deadline = attr->sched_deadline; 2009 + dl_se->dl_period = attr->sched_period ?: dl_se->dl_deadline; 2010 + dl_se->flags = attr->sched_flags; 2011 + dl_se->dl_bw = to_ratio(dl_se->dl_period, dl_se->dl_runtime); 2012 + dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime); 2013 + } 2014 + 2015 + void __getparam_dl(struct task_struct *p, struct sched_attr *attr) 2016 + { 2017 + struct sched_dl_entity *dl_se = &p->dl; 2018 + 2019 + attr->sched_priority = p->rt_priority; 2020 + attr->sched_runtime = dl_se->dl_runtime; 2021 + attr->sched_deadline = dl_se->dl_deadline; 2022 + attr->sched_period = dl_se->dl_period; 2023 + attr->sched_flags = dl_se->flags; 2024 + } 2025 + 2026 + /* 2027 + * This function validates the new parameters of a -deadline task. 2028 + * We ask for the deadline not being zero, and greater or equal 2029 + * than the runtime, as well as the period of being zero or 2030 + * greater than deadline. Furthermore, we have to be sure that 2031 + * user parameters are above the internal resolution of 1us (we 2032 + * check sched_runtime only since it is always the smaller one) and 2033 + * below 2^63 ns (we have to check both sched_deadline and 2034 + * sched_period, as the latter can be zero). 2035 + */ 2036 + bool __checkparam_dl(const struct sched_attr *attr) 2037 + { 2038 + /* deadline != 0 */ 2039 + if (attr->sched_deadline == 0) 2040 + return false; 2041 + 2042 + /* 2043 + * Since we truncate DL_SCALE bits, make sure we're at least 2044 + * that big. 2045 + */ 2046 + if (attr->sched_runtime < (1ULL << DL_SCALE)) 2047 + return false; 2048 + 2049 + /* 2050 + * Since we use the MSB for wrap-around and sign issues, make 2051 + * sure it's not set (mind that period can be equal to zero). 2052 + */ 2053 + if (attr->sched_deadline & (1ULL << 63) || 2054 + attr->sched_period & (1ULL << 63)) 2055 + return false; 2056 + 2057 + /* runtime <= deadline <= period (if period != 0) */ 2058 + if ((attr->sched_period != 0 && 2059 + attr->sched_period < attr->sched_deadline) || 2060 + attr->sched_deadline < attr->sched_runtime) 2061 + return false; 2062 + 2063 + return true; 2064 + } 2065 + 2066 + /* 2067 + * This function clears the sched_dl_entity static params. 2068 + */ 2069 + void __dl_clear_params(struct task_struct *p) 2070 + { 2071 + struct sched_dl_entity *dl_se = &p->dl; 2072 + 2073 + dl_se->dl_runtime = 0; 2074 + dl_se->dl_deadline = 0; 2075 + dl_se->dl_period = 0; 2076 + dl_se->flags = 0; 2077 + dl_se->dl_bw = 0; 2078 + dl_se->dl_density = 0; 2079 + 2080 + dl_se->dl_throttled = 0; 2081 + dl_se->dl_yielded = 0; 2082 + dl_se->dl_non_contending = 0; 2083 + } 2084 + 2085 + bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr) 2086 + { 2087 + struct sched_dl_entity *dl_se = &p->dl; 2088 + 2089 + if (dl_se->dl_runtime != attr->sched_runtime || 2090 + dl_se->dl_deadline != attr->sched_deadline || 2091 + dl_se->dl_period != attr->sched_period || 2092 + dl_se->flags != attr->sched_flags) 2093 + return true; 2094 + 2095 + return false; 2096 + } 2097 + 2098 + #ifdef CONFIG_SMP 2099 + int dl_task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed) 2100 + { 2101 + unsigned int dest_cpu = cpumask_any_and(cpu_active_mask, 2102 + cs_cpus_allowed); 2103 + struct dl_bw *dl_b; 2104 + bool overflow; 2105 + int cpus, ret; 2106 + unsigned long flags; 2107 + 2108 + rcu_read_lock_sched(); 2109 + dl_b = dl_bw_of(dest_cpu); 2110 + raw_spin_lock_irqsave(&dl_b->lock, flags); 2111 + cpus = dl_bw_cpus(dest_cpu); 2112 + overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw); 2113 + if (overflow) 2114 + ret = -EBUSY; 2115 + else { 2116 + /* 2117 + * We reserve space for this task in the destination 2118 + * root_domain, as we can't fail after this point. 2119 + * We will free resources in the source root_domain 2120 + * later on (see set_cpus_allowed_dl()). 2121 + */ 2122 + __dl_add(dl_b, p->dl.dl_bw, cpus); 2123 + ret = 0; 2124 + } 2125 + raw_spin_unlock_irqrestore(&dl_b->lock, flags); 2126 + rcu_read_unlock_sched(); 2127 + return ret; 2128 + } 2129 + 2130 + int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, 2131 + const struct cpumask *trial) 2132 + { 2133 + int ret = 1, trial_cpus; 2134 + struct dl_bw *cur_dl_b; 2135 + unsigned long flags; 2136 + 2137 + rcu_read_lock_sched(); 2138 + cur_dl_b = dl_bw_of(cpumask_any(cur)); 2139 + trial_cpus = cpumask_weight(trial); 2140 + 2141 + raw_spin_lock_irqsave(&cur_dl_b->lock, flags); 2142 + if (cur_dl_b->bw != -1 && 2143 + cur_dl_b->bw * trial_cpus < cur_dl_b->total_bw) 2144 + ret = 0; 2145 + raw_spin_unlock_irqrestore(&cur_dl_b->lock, flags); 2146 + rcu_read_unlock_sched(); 2147 + return ret; 2148 + } 2149 + 2150 + bool dl_cpu_busy(unsigned int cpu) 2151 + { 2152 + unsigned long flags; 2153 + struct dl_bw *dl_b; 2154 + bool overflow; 2155 + int cpus; 2156 + 2157 + rcu_read_lock_sched(); 2158 + dl_b = dl_bw_of(cpu); 2159 + raw_spin_lock_irqsave(&dl_b->lock, flags); 2160 + cpus = dl_bw_cpus(cpu); 2161 + overflow = __dl_overflow(dl_b, cpus, 0, 0); 2162 + raw_spin_unlock_irqrestore(&dl_b->lock, flags); 2163 + rcu_read_unlock_sched(); 2164 + return overflow; 2165 + } 2166 + #endif 2353 2167 2354 2168 #ifdef CONFIG_SCHED_DEBUG 2355 2169 extern void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq);

+15 -2

kernel/sched/debug.c

··· 552 552 553 553 #define P(x) \ 554 554 SEQ_printf(m, " .%-30s: %Ld\n", #x, (long long)(rt_rq->x)) 555 + #define PU(x) \ 556 + SEQ_printf(m, " .%-30s: %lu\n", #x, (unsigned long)(rt_rq->x)) 555 557 #define PN(x) \ 556 558 SEQ_printf(m, " .%-30s: %Ld.%06ld\n", #x, SPLIT_NS(rt_rq->x)) 557 559 558 - P(rt_nr_running); 560 + PU(rt_nr_running); 561 + #ifdef CONFIG_SMP 562 + PU(rt_nr_migratory); 563 + #endif 559 564 P(rt_throttled); 560 565 PN(rt_time); 561 566 PN(rt_runtime); 562 567 563 568 #undef PN 569 + #undef PU 564 570 #undef P 565 571 } 566 572 ··· 575 569 struct dl_bw *dl_bw; 576 570 577 571 SEQ_printf(m, "\ndl_rq[%d]:\n", cpu); 578 - SEQ_printf(m, " .%-30s: %ld\n", "dl_nr_running", dl_rq->dl_nr_running); 572 + 573 + #define PU(x) \ 574 + SEQ_printf(m, " .%-30s: %lu\n", #x, (unsigned long)(dl_rq->x)) 575 + 576 + PU(dl_nr_running); 579 577 #ifdef CONFIG_SMP 578 + PU(dl_nr_migratory); 580 579 dl_bw = &cpu_rq(cpu)->rd->dl_bw; 581 580 #else 582 581 dl_bw = &dl_rq->dl_bw; 583 582 #endif 584 583 SEQ_printf(m, " .%-30s: %lld\n", "dl_bw->bw", dl_bw->bw); 585 584 SEQ_printf(m, " .%-30s: %lld\n", "dl_bw->total_bw", dl_bw->total_bw); 585 + 586 + #undef PU 586 587 } 587 588 588 589 extern __read_mostly int sched_clock_running;

+193 -258

kernel/sched/fair.c

··· 369 369 } 370 370 371 371 /* Iterate thr' all leaf cfs_rq's on a runqueue */ 372 - #define for_each_leaf_cfs_rq(rq, cfs_rq) \ 373 - list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list) 372 + #define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ 373 + list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list, \ 374 + leaf_cfs_rq_list) 374 375 375 376 /* Do the two (enqueued) entities belong to the same group ? */ 376 377 static inline struct cfs_rq * ··· 464 463 { 465 464 } 466 465 467 - #define for_each_leaf_cfs_rq(rq, cfs_rq) \ 468 - for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL) 466 + #define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) \ 467 + for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos) 469 468 470 469 static inline struct sched_entity *parent_entity(struct sched_entity *se) 471 470 { ··· 1382 1381 static unsigned long source_load(int cpu, int type); 1383 1382 static unsigned long target_load(int cpu, int type); 1384 1383 static unsigned long capacity_of(int cpu); 1385 - static long effective_load(struct task_group *tg, int cpu, long wl, long wg); 1386 1384 1387 1385 /* Cached statistics for all CPUs within a node */ 1388 1386 struct numa_stats { ··· 2469 2469 return; 2470 2470 2471 2471 2472 - down_read(&mm->mmap_sem); 2472 + if (!down_read_trylock(&mm->mmap_sem)) 2473 + return; 2473 2474 vma = find_vma(mm, start); 2474 2475 if (!vma) { 2475 2476 reset_ptenuma_scan(p); ··· 2585 2584 } 2586 2585 } 2587 2586 } 2587 + 2588 + /* 2589 + * Can a task be moved from prev_cpu to this_cpu without causing a load 2590 + * imbalance that would trigger the load balancer? 2591 + */ 2592 + static inline bool numa_wake_affine(struct sched_domain *sd, 2593 + struct task_struct *p, int this_cpu, 2594 + int prev_cpu, int sync) 2595 + { 2596 + struct numa_stats prev_load, this_load; 2597 + s64 this_eff_load, prev_eff_load; 2598 + 2599 + update_numa_stats(&prev_load, cpu_to_node(prev_cpu)); 2600 + update_numa_stats(&this_load, cpu_to_node(this_cpu)); 2601 + 2602 + /* 2603 + * If sync wakeup then subtract the (maximum possible) 2604 + * effect of the currently running task from the load 2605 + * of the current CPU: 2606 + */ 2607 + if (sync) { 2608 + unsigned long current_load = task_h_load(current); 2609 + 2610 + if (this_load.load > current_load) 2611 + this_load.load -= current_load; 2612 + else 2613 + this_load.load = 0; 2614 + } 2615 + 2616 + /* 2617 + * In low-load situations, where this_cpu's node is idle due to the 2618 + * sync cause above having dropped this_load.load to 0, move the task. 2619 + * Moving to an idle socket will not create a bad imbalance. 2620 + * 2621 + * Otherwise check if the nodes are near enough in load to allow this 2622 + * task to be woken on this_cpu's node. 2623 + */ 2624 + if (this_load.load > 0) { 2625 + unsigned long task_load = task_h_load(p); 2626 + 2627 + this_eff_load = 100; 2628 + this_eff_load *= prev_load.compute_capacity; 2629 + 2630 + prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; 2631 + prev_eff_load *= this_load.compute_capacity; 2632 + 2633 + this_eff_load *= this_load.load + task_load; 2634 + prev_eff_load *= prev_load.load - task_load; 2635 + 2636 + return this_eff_load <= prev_eff_load; 2637 + } 2638 + 2639 + return true; 2640 + } 2588 2641 #else 2589 2642 static void task_tick_numa(struct rq *rq, struct task_struct *curr) 2590 2643 { ··· 2651 2596 static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p) 2652 2597 { 2653 2598 } 2599 + 2600 + #ifdef CONFIG_SMP 2601 + static inline bool numa_wake_affine(struct sched_domain *sd, 2602 + struct task_struct *p, int this_cpu, 2603 + int prev_cpu, int sync) 2604 + { 2605 + return true; 2606 + } 2607 + #endif /* !SMP */ 2654 2608 #endif /* CONFIG_NUMA_BALANCING */ 2655 2609 2656 2610 static void ··· 2980 2916 /* 2981 2917 * Step 2: update *_avg. 2982 2918 */ 2983 - sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX); 2919 + sa->load_avg = div_u64(sa->load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib); 2984 2920 if (cfs_rq) { 2985 2921 cfs_rq->runnable_load_avg = 2986 - div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX); 2922 + div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX - 1024 + sa->period_contrib); 2987 2923 } 2988 - sa->util_avg = sa->util_sum / LOAD_AVG_MAX; 2924 + sa->util_avg = sa->util_sum / (LOAD_AVG_MAX - 1024 + sa->period_contrib); 2989 2925 2990 2926 return 1; 2991 2927 } ··· 3046 2982 * differential update where we store the last value we propagated. This in 3047 2983 * turn allows skipping updates if the differential is 'small'. 3048 2984 * 3049 - * Updating tg's load_avg is necessary before update_cfs_share() (which is 3050 - * done) and effective_load() (which is not done because it is too costly). 2985 + * Updating tg's load_avg is necessary before update_cfs_share(). 3051 2986 */ 3052 2987 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) 3053 2988 { ··· 4705 4642 hrtimer_cancel(&cfs_b->slack_timer); 4706 4643 } 4707 4644 4645 + /* 4646 + * Both these cpu hotplug callbacks race against unregister_fair_sched_group() 4647 + * 4648 + * The race is harmless, since modifying bandwidth settings of unhooked group 4649 + * bits doesn't do much. 4650 + */ 4651 + 4652 + /* cpu online calback */ 4708 4653 static void __maybe_unused update_runtime_enabled(struct rq *rq) 4709 4654 { 4710 - struct cfs_rq *cfs_rq; 4655 + struct task_group *tg; 4711 4656 4712 - for_each_leaf_cfs_rq(rq, cfs_rq) { 4713 - struct cfs_bandwidth *cfs_b = &cfs_rq->tg->cfs_bandwidth; 4657 + lockdep_assert_held(&rq->lock); 4658 + 4659 + rcu_read_lock(); 4660 + list_for_each_entry_rcu(tg, &task_groups, list) { 4661 + struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; 4662 + struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; 4714 4663 4715 4664 raw_spin_lock(&cfs_b->lock); 4716 4665 cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF; 4717 4666 raw_spin_unlock(&cfs_b->lock); 4718 4667 } 4668 + rcu_read_unlock(); 4719 4669 } 4720 4670 4671 + /* cpu offline callback */ 4721 4672 static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq) 4722 4673 { 4723 - struct cfs_rq *cfs_rq; 4674 + struct task_group *tg; 4724 4675 4725 - for_each_leaf_cfs_rq(rq, cfs_rq) { 4676 + lockdep_assert_held(&rq->lock); 4677 + 4678 + rcu_read_lock(); 4679 + list_for_each_entry_rcu(tg, &task_groups, list) { 4680 + struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; 4681 + 4726 4682 if (!cfs_rq->runtime_enabled) 4727 4683 continue; 4728 4684 ··· 4759 4677 if (cfs_rq_throttled(cfs_rq)) 4760 4678 unthrottle_cfs_rq(cfs_rq); 4761 4679 } 4680 + rcu_read_unlock(); 4762 4681 } 4763 4682 4764 4683 #else /* CONFIG_CFS_BANDWIDTH */ ··· 5298 5215 return 0; 5299 5216 } 5300 5217 5301 - #ifdef CONFIG_FAIR_GROUP_SCHED 5302 - /* 5303 - * effective_load() calculates the load change as seen from the root_task_group 5304 - * 5305 - * Adding load to a group doesn't make a group heavier, but can cause movement 5306 - * of group shares between cpus. Assuming the shares were perfectly aligned one 5307 - * can calculate the shift in shares. 5308 - * 5309 - * Calculate the effective load difference if @wl is added (subtracted) to @tg 5310 - * on this @cpu and results in a total addition (subtraction) of @wg to the 5311 - * total group weight. 5312 - * 5313 - * Given a runqueue weight distribution (rw_i) we can compute a shares 5314 - * distribution (s_i) using: 5315 - * 5316 - * s_i = rw_i / \Sum rw_j (1) 5317 - * 5318 - * Suppose we have 4 CPUs and our @tg is a direct child of the root group and 5319 - * has 7 equal weight tasks, distributed as below (rw_i), with the resulting 5320 - * shares distribution (s_i): 5321 - * 5322 - * rw_i = { 2, 4, 1, 0 } 5323 - * s_i = { 2/7, 4/7, 1/7, 0 } 5324 - * 5325 - * As per wake_affine() we're interested in the load of two CPUs (the CPU the 5326 - * task used to run on and the CPU the waker is running on), we need to 5327 - * compute the effect of waking a task on either CPU and, in case of a sync 5328 - * wakeup, compute the effect of the current task going to sleep. 5329 - * 5330 - * So for a change of @wl to the local @cpu with an overall group weight change 5331 - * of @wl we can compute the new shares distribution (s'_i) using: 5332 - * 5333 - * s'_i = (rw_i + @wl) / (@wg + \Sum rw_j) (2) 5334 - * 5335 - * Suppose we're interested in CPUs 0 and 1, and want to compute the load 5336 - * differences in waking a task to CPU 0. The additional task changes the 5337 - * weight and shares distributions like: 5338 - * 5339 - * rw'_i = { 3, 4, 1, 0 } 5340 - * s'_i = { 3/8, 4/8, 1/8, 0 } 5341 - * 5342 - * We can then compute the difference in effective weight by using: 5343 - * 5344 - * dw_i = S * (s'_i - s_i) (3) 5345 - * 5346 - * Where 'S' is the group weight as seen by its parent. 5347 - * 5348 - * Therefore the effective change in loads on CPU 0 would be 5/56 (3/8 - 2/7) 5349 - * times the weight of the group. The effect on CPU 1 would be -4/56 (4/8 - 5350 - * 4/7) times the weight of the group. 5351 - */ 5352 - static long effective_load(struct task_group *tg, int cpu, long wl, long wg) 5353 - { 5354 - struct sched_entity *se = tg->se[cpu]; 5355 - 5356 - if (!tg->parent) /* the trivial, non-cgroup case */ 5357 - return wl; 5358 - 5359 - for_each_sched_entity(se) { 5360 - struct cfs_rq *cfs_rq = se->my_q; 5361 - long W, w = cfs_rq_load_avg(cfs_rq); 5362 - 5363 - tg = cfs_rq->tg; 5364 - 5365 - /* 5366 - * W = @wg + \Sum rw_j 5367 - */ 5368 - W = wg + atomic_long_read(&tg->load_avg); 5369 - 5370 - /* Ensure \Sum rw_j >= rw_i */ 5371 - W -= cfs_rq->tg_load_avg_contrib; 5372 - W += w; 5373 - 5374 - /* 5375 - * w = rw_i + @wl 5376 - */ 5377 - w += wl; 5378 - 5379 - /* 5380 - * wl = S * s'_i; see (2) 5381 - */ 5382 - if (W > 0 && w < W) 5383 - wl = (w * (long)scale_load_down(tg->shares)) / W; 5384 - else 5385 - wl = scale_load_down(tg->shares); 5386 - 5387 - /* 5388 - * Per the above, wl is the new se->load.weight value; since 5389 - * those are clipped to [MIN_SHARES, ...) do so now. See 5390 - * calc_cfs_shares(). 5391 - */ 5392 - if (wl < MIN_SHARES) 5393 - wl = MIN_SHARES; 5394 - 5395 - /* 5396 - * wl = dw_i = S * (s'_i - s_i); see (3) 5397 - */ 5398 - wl -= se->avg.load_avg; 5399 - 5400 - /* 5401 - * Recursively apply this logic to all parent groups to compute 5402 - * the final effective load change on the root group. Since 5403 - * only the @tg group gets extra weight, all parent groups can 5404 - * only redistribute existing shares. @wl is the shift in shares 5405 - * resulting from this level per the above. 5406 - */ 5407 - wg = 0; 5408 - } 5409 - 5410 - return wl; 5411 - } 5412 - #else 5413 - 5414 - static long effective_load(struct task_group *tg, int cpu, long wl, long wg) 5415 - { 5416 - return wl; 5417 - } 5418 - 5419 - #endif 5420 - 5421 5218 static void record_wakee(struct task_struct *p) 5422 5219 { 5423 5220 /* ··· 5348 5385 static int wake_affine(struct sched_domain *sd, struct task_struct *p, 5349 5386 int prev_cpu, int sync) 5350 5387 { 5351 - s64 this_load, load; 5352 - s64 this_eff_load, prev_eff_load; 5353 - int idx, this_cpu; 5354 - struct task_group *tg; 5355 - unsigned long weight; 5356 - int balanced; 5357 - 5358 - idx = sd->wake_idx; 5359 - this_cpu = smp_processor_id(); 5360 - load = source_load(prev_cpu, idx); 5361 - this_load = target_load(this_cpu, idx); 5388 + int this_cpu = smp_processor_id(); 5389 + bool affine = false; 5362 5390 5363 5391 /* 5364 - * If sync wakeup then subtract the (maximum possible) 5365 - * effect of the currently running task from the load 5366 - * of the current CPU: 5392 + * Common case: CPUs are in the same socket, and select_idle_sibling() 5393 + * will do its thing regardless of what we return: 5367 5394 */ 5368 - if (sync) { 5369 - tg = task_group(current); 5370 - weight = current->se.avg.load_avg; 5371 - 5372 - this_load += effective_load(tg, this_cpu, -weight, -weight); 5373 - load += effective_load(tg, prev_cpu, 0, -weight); 5374 - } 5375 - 5376 - tg = task_group(p); 5377 - weight = p->se.avg.load_avg; 5378 - 5379 - /* 5380 - * In low-load situations, where prev_cpu is idle and this_cpu is idle 5381 - * due to the sync cause above having dropped this_load to 0, we'll 5382 - * always have an imbalance, but there's really nothing you can do 5383 - * about that, so that's good too. 5384 - * 5385 - * Otherwise check if either cpus are near enough in load to allow this 5386 - * task to be woken on this_cpu. 5387 - */ 5388 - this_eff_load = 100; 5389 - this_eff_load *= capacity_of(prev_cpu); 5390 - 5391 - prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; 5392 - prev_eff_load *= capacity_of(this_cpu); 5393 - 5394 - if (this_load > 0) { 5395 - this_eff_load *= this_load + 5396 - effective_load(tg, this_cpu, weight, weight); 5397 - 5398 - prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight); 5399 - } 5400 - 5401 - balanced = this_eff_load <= prev_eff_load; 5395 + if (cpus_share_cache(prev_cpu, this_cpu)) 5396 + affine = true; 5397 + else 5398 + affine = numa_wake_affine(sd, p, this_cpu, prev_cpu, sync); 5402 5399 5403 5400 schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts); 5401 + if (affine) { 5402 + schedstat_inc(sd->ttwu_move_affine); 5403 + schedstat_inc(p->se.statistics.nr_wakeups_affine); 5404 + } 5404 5405 5405 - if (!balanced) 5406 - return 0; 5407 - 5408 - schedstat_inc(sd->ttwu_move_affine); 5409 - schedstat_inc(p->se.statistics.nr_wakeups_affine); 5410 - 5411 - return 1; 5406 + return affine; 5412 5407 } 5413 5408 5414 5409 static inline int task_util(struct task_struct *p); ··· 5405 5484 int i; 5406 5485 5407 5486 /* Skip over this group if it has no CPUs allowed */ 5408 - if (!cpumask_intersects(sched_group_cpus(group), 5487 + if (!cpumask_intersects(sched_group_span(group), 5409 5488 &p->cpus_allowed)) 5410 5489 continue; 5411 5490 5412 5491 local_group = cpumask_test_cpu(this_cpu, 5413 - sched_group_cpus(group)); 5492 + sched_group_span(group)); 5414 5493 5415 5494 /* 5416 5495 * Tally up the load of all CPUs in the group and find ··· 5420 5499 runnable_load = 0; 5421 5500 max_spare_cap = 0; 5422 5501 5423 - for_each_cpu(i, sched_group_cpus(group)) { 5502 + for_each_cpu(i, sched_group_span(group)) { 5424 5503 /* Bias balancing toward cpus of our domain */ 5425 5504 if (local_group) 5426 5505 load = source_load(i, load_idx); ··· 5523 5602 5524 5603 /* Check if we have any choice: */ 5525 5604 if (group->group_weight == 1) 5526 - return cpumask_first(sched_group_cpus(group)); 5605 + return cpumask_first(sched_group_span(group)); 5527 5606 5528 5607 /* Traverse only the allowed CPUs */ 5529 - for_each_cpu_and(i, sched_group_cpus(group), &p->cpus_allowed) { 5608 + for_each_cpu_and(i, sched_group_span(group), &p->cpus_allowed) { 5530 5609 if (idle_cpu(i)) { 5531 5610 struct rq *rq = cpu_rq(i); 5532 5611 struct cpuidle_state *idle = idle_get_state(rq); ··· 5560 5639 5561 5640 return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu; 5562 5641 } 5563 - 5564 - /* 5565 - * Implement a for_each_cpu() variant that starts the scan at a given cpu 5566 - * (@start), and wraps around. 5567 - * 5568 - * This is used to scan for idle CPUs; such that not all CPUs looking for an 5569 - * idle CPU find the same CPU. The down-side is that tasks tend to cycle 5570 - * through the LLC domain. 5571 - * 5572 - * Especially tbench is found sensitive to this. 5573 - */ 5574 - 5575 - static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped) 5576 - { 5577 - int next; 5578 - 5579 - again: 5580 - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1); 5581 - 5582 - if (*wrapped) { 5583 - if (next >= start) 5584 - return nr_cpumask_bits; 5585 - } else { 5586 - if (next >= nr_cpumask_bits) { 5587 - *wrapped = 1; 5588 - n = -1; 5589 - goto again; 5590 - } 5591 - } 5592 - 5593 - return next; 5594 - } 5595 - 5596 - #define for_each_cpu_wrap(cpu, mask, start, wrap) \ 5597 - for ((wrap) = 0, (cpu) = (start)-1; \ 5598 - (cpu) = cpumask_next_wrap((cpu), (mask), (start), &(wrap)), \ 5599 - (cpu) < nr_cpumask_bits; ) 5600 5642 5601 5643 #ifdef CONFIG_SCHED_SMT 5602 5644 ··· 5620 5736 static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target) 5621 5737 { 5622 5738 struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); 5623 - int core, cpu, wrap; 5739 + int core, cpu; 5624 5740 5625 5741 if (!static_branch_likely(&sched_smt_present)) 5626 5742 return -1; ··· 5630 5746 5631 5747 cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed); 5632 5748 5633 - for_each_cpu_wrap(core, cpus, target, wrap) { 5749 + for_each_cpu_wrap(core, cpus, target) { 5634 5750 bool idle = true; 5635 5751 5636 5752 for_each_cpu(cpu, cpu_smt_mask(core)) { ··· 5693 5809 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target) 5694 5810 { 5695 5811 struct sched_domain *this_sd; 5696 - u64 avg_cost, avg_idle = this_rq()->avg_idle; 5812 + u64 avg_cost, avg_idle; 5697 5813 u64 time, cost; 5698 5814 s64 delta; 5699 - int cpu, wrap; 5815 + int cpu, nr = INT_MAX; 5700 5816 5701 5817 this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); 5702 5818 if (!this_sd) 5703 5819 return -1; 5704 5820 5705 - avg_cost = this_sd->avg_scan_cost; 5706 - 5707 5821 /* 5708 5822 * Due to large variance we need a large fuzz factor; hackbench in 5709 5823 * particularly is sensitive here. 5710 5824 */ 5711 - if (sched_feat(SIS_AVG_CPU) && (avg_idle / 512) < avg_cost) 5825 + avg_idle = this_rq()->avg_idle / 512; 5826 + avg_cost = this_sd->avg_scan_cost + 1; 5827 + 5828 + if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost) 5712 5829 return -1; 5830 + 5831 + if (sched_feat(SIS_PROP)) { 5832 + u64 span_avg = sd->span_weight * avg_idle; 5833 + if (span_avg > 4*avg_cost) 5834 + nr = div_u64(span_avg, avg_cost); 5835 + else 5836 + nr = 4; 5837 + } 5713 5838 5714 5839 time = local_clock(); 5715 5840 5716 - for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) { 5841 + for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { 5842 + if (!--nr) 5843 + return -1; 5717 5844 if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) 5718 5845 continue; 5719 5846 if (idle_cpu(cpu)) ··· 5906 6011 5907 6012 if (affine_sd) { 5908 6013 sd = NULL; /* Prefer wake_affine over balance flags */ 5909 - if (cpu != prev_cpu && wake_affine(affine_sd, p, prev_cpu, sync)) 6014 + if (cpu == prev_cpu) 6015 + goto pick_cpu; 6016 + 6017 + if (wake_affine(affine_sd, p, prev_cpu, sync)) 5910 6018 new_cpu = cpu; 5911 6019 } 5912 6020 5913 6021 if (!sd) { 6022 + pick_cpu: 5914 6023 if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */ 5915 6024 new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); 5916 6025 ··· 6067 6168 if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE)) 6068 6169 return; 6069 6170 6070 - for_each_sched_entity(se) 6171 + for_each_sched_entity(se) { 6172 + if (SCHED_WARN_ON(!se->on_rq)) 6173 + return; 6071 6174 cfs_rq_of(se)->last = se; 6175 + } 6072 6176 } 6073 6177 6074 6178 static void set_next_buddy(struct sched_entity *se) ··· 6079 6177 if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE)) 6080 6178 return; 6081 6179 6082 - for_each_sched_entity(se) 6180 + for_each_sched_entity(se) { 6181 + if (SCHED_WARN_ON(!se->on_rq)) 6182 + return; 6083 6183 cfs_rq_of(se)->next = se; 6184 + } 6084 6185 } 6085 6186 6086 6187 static void set_skip_buddy(struct sched_entity *se) ··· 6591 6686 if (dst_nid == p->numa_preferred_nid) 6592 6687 return 0; 6593 6688 6689 + /* Leaving a core idle is often worse than degrading locality. */ 6690 + if (env->idle != CPU_NOT_IDLE) 6691 + return -1; 6692 + 6594 6693 if (numa_group) { 6595 6694 src_faults = group_faults(p, src_nid); 6596 6695 dst_faults = group_faults(p, dst_nid); ··· 6879 6970 } 6880 6971 6881 6972 #ifdef CONFIG_FAIR_GROUP_SCHED 6973 + 6974 + static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) 6975 + { 6976 + if (cfs_rq->load.weight) 6977 + return false; 6978 + 6979 + if (cfs_rq->avg.load_sum) 6980 + return false; 6981 + 6982 + if (cfs_rq->avg.util_sum) 6983 + return false; 6984 + 6985 + if (cfs_rq->runnable_load_sum) 6986 + return false; 6987 + 6988 + return true; 6989 + } 6990 + 6882 6991 static void update_blocked_averages(int cpu) 6883 6992 { 6884 6993 struct rq *rq = cpu_rq(cpu); 6885 - struct cfs_rq *cfs_rq; 6994 + struct cfs_rq *cfs_rq, *pos; 6886 6995 struct rq_flags rf; 6887 6996 6888 6997 rq_lock_irqsave(rq, &rf); ··· 6910 6983 * Iterates the task_group tree in a bottom up fashion, see 6911 6984 * list_add_leaf_cfs_rq() for details. 6912 6985 */ 6913 - for_each_leaf_cfs_rq(rq, cfs_rq) { 6986 + for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) { 6914 6987 struct sched_entity *se; 6915 6988 6916 6989 /* throttled entities do not contribute to load */ ··· 6924 6997 se = cfs_rq->tg->se[cpu]; 6925 6998 if (se && !skip_blocked_update(se)) 6926 6999 update_load_avg(se, 0); 7000 + 7001 + /* 7002 + * There can be a lot of idle CPU cgroups. Don't let fully 7003 + * decayed cfs_rqs linger on the list. 7004 + */ 7005 + if (cfs_rq_is_decayed(cfs_rq)) 7006 + list_del_leaf_cfs_rq(cfs_rq); 6927 7007 } 6928 7008 rq_unlock_irqrestore(rq, &rf); 6929 7009 } ··· 7163 7229 * span the current group. 7164 7230 */ 7165 7231 7166 - for_each_cpu(cpu, sched_group_cpus(sdg)) { 7232 + for_each_cpu(cpu, sched_group_span(sdg)) { 7167 7233 struct sched_group_capacity *sgc; 7168 7234 struct rq *rq = cpu_rq(cpu); 7169 7235 ··· 7342 7408 7343 7409 memset(sgs, 0, sizeof(*sgs)); 7344 7410 7345 - for_each_cpu_and(i, sched_group_cpus(group), env->cpus) { 7411 + for_each_cpu_and(i, sched_group_span(group), env->cpus) { 7346 7412 struct rq *rq = cpu_rq(i); 7347 7413 7348 7414 /* Bias balancing toward cpus of our domain */ ··· 7506 7572 struct sg_lb_stats *sgs = &tmp_sgs; 7507 7573 int local_group; 7508 7574 7509 - local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg)); 7575 + local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(sg)); 7510 7576 if (local_group) { 7511 7577 sds->local = sg; 7512 7578 sgs = local; ··· 7861 7927 unsigned long busiest_load = 0, busiest_capacity = 1; 7862 7928 int i; 7863 7929 7864 - for_each_cpu_and(i, sched_group_cpus(group), env->cpus) { 7930 + for_each_cpu_and(i, sched_group_span(group), env->cpus) { 7865 7931 unsigned long capacity, wl; 7866 7932 enum fbq_type rt; 7867 7933 ··· 7967 8033 static int should_we_balance(struct lb_env *env) 7968 8034 { 7969 8035 struct sched_group *sg = env->sd->groups; 7970 - struct cpumask *sg_cpus, *sg_mask; 7971 8036 int cpu, balance_cpu = -1; 7972 8037 7973 8038 /* ··· 7976 8043 if (env->idle == CPU_NEWLY_IDLE) 7977 8044 return 1; 7978 8045 7979 - sg_cpus = sched_group_cpus(sg); 7980 - sg_mask = sched_group_mask(sg); 7981 8046 /* Try to find first idle cpu */ 7982 - for_each_cpu_and(cpu, sg_cpus, env->cpus) { 7983 - if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu)) 8047 + for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) { 8048 + if (!idle_cpu(cpu)) 7984 8049 continue; 7985 8050 7986 8051 balance_cpu = cpu; ··· 8014 8083 .sd = sd, 8015 8084 .dst_cpu = this_cpu, 8016 8085 .dst_rq = this_rq, 8017 - .dst_grpmask = sched_group_cpus(sd->groups), 8086 + .dst_grpmask = sched_group_span(sd->groups), 8018 8087 .idle = idle, 8019 8088 .loop_break = sched_nr_migrate_break, 8020 8089 .cpus = cpus, ··· 8588 8657 * If this cpu is going down, then nothing needs to be done. 8589 8658 */ 8590 8659 if (!cpu_active(cpu)) 8660 + return; 8661 + 8662 + /* Spare idle load balancing on CPUs that don't want to be disturbed: */ 8663 + if (!is_housekeeping_cpu(cpu)) 8591 8664 return; 8592 8665 8593 8666 if (test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu))) ··· 9458 9523 #ifdef CONFIG_SCHED_DEBUG 9459 9524 void print_cfs_stats(struct seq_file *m, int cpu) 9460 9525 { 9461 - struct cfs_rq *cfs_rq; 9526 + struct cfs_rq *cfs_rq, *pos; 9462 9527 9463 9528 rcu_read_lock(); 9464 - for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq) 9529 + for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos) 9465 9530 print_cfs_rq(m, cpu, cfs_rq); 9466 9531 rcu_read_unlock(); 9467 9532 }

+1 -1

kernel/sched/features.h

··· 55 55 * When doing wakeups, attempt to limit superfluous scans of the LLC domain. 56 56 */ 57 57 SCHED_FEAT(SIS_AVG_CPU, false) 58 + SCHED_FEAT(SIS_PROP, true) 58 59 59 60 /* 60 61 * Issue a WARN when we do multiple update_rq_clock() calls ··· 77 76 SCHED_FEAT(RT_PUSH_IPI, true) 78 77 #endif 79 78 80 - SCHED_FEAT(FORCE_SD_OVERLAP, false) 81 79 SCHED_FEAT(RT_RUNTIME_SHARE, true) 82 80 SCHED_FEAT(LB_MIN, false) 83 81 SCHED_FEAT(ATTACH_AGE_LOAD, true)

+1

kernel/sched/idle.c

··· 219 219 */ 220 220 221 221 __current_set_polling(); 222 + quiet_vmstat(); 222 223 tick_nohz_idle_enter(); 223 224 224 225 while (!need_resched()) {

+26 -25

kernel/sched/loadavg.c

··· 117 117 * load-average relies on per-cpu sampling from the tick, it is affected by 118 118 * NO_HZ. 119 119 * 120 - * The basic idea is to fold the nr_active delta into a global idle-delta upon 120 + * The basic idea is to fold the nr_active delta into a global NO_HZ-delta upon 121 121 * entering NO_HZ state such that we can include this as an 'extra' cpu delta 122 122 * when we read the global state. 123 123 * ··· 126 126 * - When we go NO_HZ idle during the window, we can negate our sample 127 127 * contribution, causing under-accounting. 128 128 * 129 - * We avoid this by keeping two idle-delta counters and flipping them 129 + * We avoid this by keeping two NO_HZ-delta counters and flipping them 130 130 * when the window starts, thus separating old and new NO_HZ load. 131 131 * 132 132 * The only trick is the slight shift in index flip for read vs write. ··· 137 137 * r:0 0 1 1 0 0 1 1 0 138 138 * w:0 1 1 0 0 1 1 0 0 139 139 * 140 - * This ensures we'll fold the old idle contribution in this window while 140 + * This ensures we'll fold the old NO_HZ contribution in this window while 141 141 * accumlating the new one. 142 142 * 143 - * - When we wake up from NO_HZ idle during the window, we push up our 143 + * - When we wake up from NO_HZ during the window, we push up our 144 144 * contribution, since we effectively move our sample point to a known 145 145 * busy state. 146 146 * 147 147 * This is solved by pushing the window forward, and thus skipping the 148 - * sample, for this cpu (effectively using the idle-delta for this cpu which 148 + * sample, for this cpu (effectively using the NO_HZ-delta for this cpu which 149 149 * was in effect at the time the window opened). This also solves the issue 150 - * of having to deal with a cpu having been in NOHZ idle for multiple 151 - * LOAD_FREQ intervals. 150 + * of having to deal with a cpu having been in NO_HZ for multiple LOAD_FREQ 151 + * intervals. 152 152 * 153 153 * When making the ILB scale, we should try to pull this in as well. 154 154 */ 155 - static atomic_long_t calc_load_idle[2]; 155 + static atomic_long_t calc_load_nohz[2]; 156 156 static int calc_load_idx; 157 157 158 158 static inline int calc_load_write_idx(void) ··· 167 167 168 168 /* 169 169 * If the folding window started, make sure we start writing in the 170 - * next idle-delta. 170 + * next NO_HZ-delta. 171 171 */ 172 172 if (!time_before(jiffies, READ_ONCE(calc_load_update))) 173 173 idx++; ··· 180 180 return calc_load_idx & 1; 181 181 } 182 182 183 - void calc_load_enter_idle(void) 183 + void calc_load_nohz_start(void) 184 184 { 185 185 struct rq *this_rq = this_rq(); 186 186 long delta; 187 187 188 188 /* 189 - * We're going into NOHZ mode, if there's any pending delta, fold it 190 - * into the pending idle delta. 189 + * We're going into NO_HZ mode, if there's any pending delta, fold it 190 + * into the pending NO_HZ delta. 191 191 */ 192 192 delta = calc_load_fold_active(this_rq, 0); 193 193 if (delta) { 194 194 int idx = calc_load_write_idx(); 195 195 196 - atomic_long_add(delta, &calc_load_idle[idx]); 196 + atomic_long_add(delta, &calc_load_nohz[idx]); 197 197 } 198 198 } 199 199 200 - void calc_load_exit_idle(void) 200 + void calc_load_nohz_stop(void) 201 201 { 202 202 struct rq *this_rq = this_rq(); 203 203 ··· 217 217 this_rq->calc_load_update += LOAD_FREQ; 218 218 } 219 219 220 - static long calc_load_fold_idle(void) 220 + static long calc_load_nohz_fold(void) 221 221 { 222 222 int idx = calc_load_read_idx(); 223 223 long delta = 0; 224 224 225 - if (atomic_long_read(&calc_load_idle[idx])) 226 - delta = atomic_long_xchg(&calc_load_idle[idx], 0); 225 + if (atomic_long_read(&calc_load_nohz[idx])) 226 + delta = atomic_long_xchg(&calc_load_nohz[idx], 0); 227 227 228 228 return delta; 229 229 } ··· 299 299 300 300 /* 301 301 * NO_HZ can leave us missing all per-cpu ticks calling 302 - * calc_load_account_active(), but since an idle CPU folds its delta into 303 - * calc_load_tasks_idle per calc_load_account_idle(), all we need to do is fold 304 - * in the pending idle delta if our idle period crossed a load cycle boundary. 302 + * calc_load_fold_active(), but since a NO_HZ CPU folds its delta into 303 + * calc_load_nohz per calc_load_nohz_start(), all we need to do is fold 304 + * in the pending NO_HZ delta if our NO_HZ period crossed a load cycle boundary. 305 305 * 306 306 * Once we've updated the global active value, we need to apply the exponential 307 307 * weights adjusted to the number of cycles missed. ··· 330 330 } 331 331 332 332 /* 333 - * Flip the idle index... 333 + * Flip the NO_HZ index... 334 334 * 335 335 * Make sure we first write the new time then flip the index, so that 336 336 * calc_load_write_idx() will see the new time when it reads the new ··· 341 341 } 342 342 #else /* !CONFIG_NO_HZ_COMMON */ 343 343 344 - static inline long calc_load_fold_idle(void) { return 0; } 344 + static inline long calc_load_nohz_fold(void) { return 0; } 345 345 static inline void calc_global_nohz(void) { } 346 346 347 347 #endif /* CONFIG_NO_HZ_COMMON */ ··· 362 362 return; 363 363 364 364 /* 365 - * Fold the 'old' idle-delta to include all NO_HZ cpus. 365 + * Fold the 'old' NO_HZ-delta to include all NO_HZ cpus. 366 366 */ 367 - delta = calc_load_fold_idle(); 367 + delta = calc_load_nohz_fold(); 368 368 if (delta) 369 369 atomic_long_add(delta, &calc_load_tasks); 370 370 ··· 378 378 WRITE_ONCE(calc_load_update, sample_window + LOAD_FREQ); 379 379 380 380 /* 381 - * In case we idled for multiple LOAD_FREQ intervals, catch up in bulk. 381 + * In case we went to NO_HZ for multiple LOAD_FREQ intervals 382 + * catch up in bulk. 382 383 */ 383 384 calc_global_nohz(); 384 385 }

+322 -1

kernel/sched/rt.c

··· 840 840 int enqueue = 0; 841 841 struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i); 842 842 struct rq *rq = rq_of_rt_rq(rt_rq); 843 + int skip; 844 + 845 + /* 846 + * When span == cpu_online_mask, taking each rq->lock 847 + * can be time-consuming. Try to avoid it when possible. 848 + */ 849 + raw_spin_lock(&rt_rq->rt_runtime_lock); 850 + skip = !rt_rq->rt_time && !rt_rq->rt_nr_running; 851 + raw_spin_unlock(&rt_rq->rt_runtime_lock); 852 + if (skip) 853 + continue; 843 854 844 855 raw_spin_lock(&rq->lock); 845 856 if (rt_rq->rt_time) { ··· 1830 1819 * pushing. 1831 1820 */ 1832 1821 task = pick_next_pushable_task(rq); 1833 - if (task_cpu(next_task) == rq->cpu && task == next_task) { 1822 + if (task == next_task) { 1834 1823 /* 1835 1824 * The task hasn't migrated, and is still the next 1836 1825 * eligible task, but we failed to find a run-queue ··· 2448 2437 2449 2438 .update_curr = update_curr_rt, 2450 2439 }; 2440 + 2441 + #ifdef CONFIG_RT_GROUP_SCHED 2442 + /* 2443 + * Ensure that the real time constraints are schedulable. 2444 + */ 2445 + static DEFINE_MUTEX(rt_constraints_mutex); 2446 + 2447 + /* Must be called with tasklist_lock held */ 2448 + static inline int tg_has_rt_tasks(struct task_group *tg) 2449 + { 2450 + struct task_struct *g, *p; 2451 + 2452 + /* 2453 + * Autogroups do not have RT tasks; see autogroup_create(). 2454 + */ 2455 + if (task_group_is_autogroup(tg)) 2456 + return 0; 2457 + 2458 + for_each_process_thread(g, p) { 2459 + if (rt_task(p) && task_group(p) == tg) 2460 + return 1; 2461 + } 2462 + 2463 + return 0; 2464 + } 2465 + 2466 + struct rt_schedulable_data { 2467 + struct task_group *tg; 2468 + u64 rt_period; 2469 + u64 rt_runtime; 2470 + }; 2471 + 2472 + static int tg_rt_schedulable(struct task_group *tg, void *data) 2473 + { 2474 + struct rt_schedulable_data *d = data; 2475 + struct task_group *child; 2476 + unsigned long total, sum = 0; 2477 + u64 period, runtime; 2478 + 2479 + period = ktime_to_ns(tg->rt_bandwidth.rt_period); 2480 + runtime = tg->rt_bandwidth.rt_runtime; 2481 + 2482 + if (tg == d->tg) { 2483 + period = d->rt_period; 2484 + runtime = d->rt_runtime; 2485 + } 2486 + 2487 + /* 2488 + * Cannot have more runtime than the period. 2489 + */ 2490 + if (runtime > period && runtime != RUNTIME_INF) 2491 + return -EINVAL; 2492 + 2493 + /* 2494 + * Ensure we don't starve existing RT tasks. 2495 + */ 2496 + if (rt_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg)) 2497 + return -EBUSY; 2498 + 2499 + total = to_ratio(period, runtime); 2500 + 2501 + /* 2502 + * Nobody can have more than the global setting allows. 2503 + */ 2504 + if (total > to_ratio(global_rt_period(), global_rt_runtime())) 2505 + return -EINVAL; 2506 + 2507 + /* 2508 + * The sum of our children's runtime should not exceed our own. 2509 + */ 2510 + list_for_each_entry_rcu(child, &tg->children, siblings) { 2511 + period = ktime_to_ns(child->rt_bandwidth.rt_period); 2512 + runtime = child->rt_bandwidth.rt_runtime; 2513 + 2514 + if (child == d->tg) { 2515 + period = d->rt_period; 2516 + runtime = d->rt_runtime; 2517 + } 2518 + 2519 + sum += to_ratio(period, runtime); 2520 + } 2521 + 2522 + if (sum > total) 2523 + return -EINVAL; 2524 + 2525 + return 0; 2526 + } 2527 + 2528 + static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime) 2529 + { 2530 + int ret; 2531 + 2532 + struct rt_schedulable_data data = { 2533 + .tg = tg, 2534 + .rt_period = period, 2535 + .rt_runtime = runtime, 2536 + }; 2537 + 2538 + rcu_read_lock(); 2539 + ret = walk_tg_tree(tg_rt_schedulable, tg_nop, &data); 2540 + rcu_read_unlock(); 2541 + 2542 + return ret; 2543 + } 2544 + 2545 + static int tg_set_rt_bandwidth(struct task_group *tg, 2546 + u64 rt_period, u64 rt_runtime) 2547 + { 2548 + int i, err = 0; 2549 + 2550 + /* 2551 + * Disallowing the root group RT runtime is BAD, it would disallow the 2552 + * kernel creating (and or operating) RT threads. 2553 + */ 2554 + if (tg == &root_task_group && rt_runtime == 0) 2555 + return -EINVAL; 2556 + 2557 + /* No period doesn't make any sense. */ 2558 + if (rt_period == 0) 2559 + return -EINVAL; 2560 + 2561 + mutex_lock(&rt_constraints_mutex); 2562 + read_lock(&tasklist_lock); 2563 + err = __rt_schedulable(tg, rt_period, rt_runtime); 2564 + if (err) 2565 + goto unlock; 2566 + 2567 + raw_spin_lock_irq(&tg->rt_bandwidth.rt_runtime_lock); 2568 + tg->rt_bandwidth.rt_period = ns_to_ktime(rt_period); 2569 + tg->rt_bandwidth.rt_runtime = rt_runtime; 2570 + 2571 + for_each_possible_cpu(i) { 2572 + struct rt_rq *rt_rq = tg->rt_rq[i]; 2573 + 2574 + raw_spin_lock(&rt_rq->rt_runtime_lock); 2575 + rt_rq->rt_runtime = rt_runtime; 2576 + raw_spin_unlock(&rt_rq->rt_runtime_lock); 2577 + } 2578 + raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock); 2579 + unlock: 2580 + read_unlock(&tasklist_lock); 2581 + mutex_unlock(&rt_constraints_mutex); 2582 + 2583 + return err; 2584 + } 2585 + 2586 + int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us) 2587 + { 2588 + u64 rt_runtime, rt_period; 2589 + 2590 + rt_period = ktime_to_ns(tg->rt_bandwidth.rt_period); 2591 + rt_runtime = (u64)rt_runtime_us * NSEC_PER_USEC; 2592 + if (rt_runtime_us < 0) 2593 + rt_runtime = RUNTIME_INF; 2594 + 2595 + return tg_set_rt_bandwidth(tg, rt_period, rt_runtime); 2596 + } 2597 + 2598 + long sched_group_rt_runtime(struct task_group *tg) 2599 + { 2600 + u64 rt_runtime_us; 2601 + 2602 + if (tg->rt_bandwidth.rt_runtime == RUNTIME_INF) 2603 + return -1; 2604 + 2605 + rt_runtime_us = tg->rt_bandwidth.rt_runtime; 2606 + do_div(rt_runtime_us, NSEC_PER_USEC); 2607 + return rt_runtime_us; 2608 + } 2609 + 2610 + int sched_group_set_rt_period(struct task_group *tg, u64 rt_period_us) 2611 + { 2612 + u64 rt_runtime, rt_period; 2613 + 2614 + rt_period = rt_period_us * NSEC_PER_USEC; 2615 + rt_runtime = tg->rt_bandwidth.rt_runtime; 2616 + 2617 + return tg_set_rt_bandwidth(tg, rt_period, rt_runtime); 2618 + } 2619 + 2620 + long sched_group_rt_period(struct task_group *tg) 2621 + { 2622 + u64 rt_period_us; 2623 + 2624 + rt_period_us = ktime_to_ns(tg->rt_bandwidth.rt_period); 2625 + do_div(rt_period_us, NSEC_PER_USEC); 2626 + return rt_period_us; 2627 + } 2628 + 2629 + static int sched_rt_global_constraints(void) 2630 + { 2631 + int ret = 0; 2632 + 2633 + mutex_lock(&rt_constraints_mutex); 2634 + read_lock(&tasklist_lock); 2635 + ret = __rt_schedulable(NULL, 0, 0); 2636 + read_unlock(&tasklist_lock); 2637 + mutex_unlock(&rt_constraints_mutex); 2638 + 2639 + return ret; 2640 + } 2641 + 2642 + int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk) 2643 + { 2644 + /* Don't accept realtime tasks when there is no way for them to run */ 2645 + if (rt_task(tsk) && tg->rt_bandwidth.rt_runtime == 0) 2646 + return 0; 2647 + 2648 + return 1; 2649 + } 2650 + 2651 + #else /* !CONFIG_RT_GROUP_SCHED */ 2652 + static int sched_rt_global_constraints(void) 2653 + { 2654 + unsigned long flags; 2655 + int i; 2656 + 2657 + raw_spin_lock_irqsave(&def_rt_bandwidth.rt_runtime_lock, flags); 2658 + for_each_possible_cpu(i) { 2659 + struct rt_rq *rt_rq = &cpu_rq(i)->rt; 2660 + 2661 + raw_spin_lock(&rt_rq->rt_runtime_lock); 2662 + rt_rq->rt_runtime = global_rt_runtime(); 2663 + raw_spin_unlock(&rt_rq->rt_runtime_lock); 2664 + } 2665 + raw_spin_unlock_irqrestore(&def_rt_bandwidth.rt_runtime_lock, flags); 2666 + 2667 + return 0; 2668 + } 2669 + #endif /* CONFIG_RT_GROUP_SCHED */ 2670 + 2671 + static int sched_rt_global_validate(void) 2672 + { 2673 + if (sysctl_sched_rt_period <= 0) 2674 + return -EINVAL; 2675 + 2676 + if ((sysctl_sched_rt_runtime != RUNTIME_INF) && 2677 + (sysctl_sched_rt_runtime > sysctl_sched_rt_period)) 2678 + return -EINVAL; 2679 + 2680 + return 0; 2681 + } 2682 + 2683 + static void sched_rt_do_global(void) 2684 + { 2685 + def_rt_bandwidth.rt_runtime = global_rt_runtime(); 2686 + def_rt_bandwidth.rt_period = ns_to_ktime(global_rt_period()); 2687 + } 2688 + 2689 + int sched_rt_handler(struct ctl_table *table, int write, 2690 + void __user *buffer, size_t *lenp, 2691 + loff_t *ppos) 2692 + { 2693 + int old_period, old_runtime; 2694 + static DEFINE_MUTEX(mutex); 2695 + int ret; 2696 + 2697 + mutex_lock(&mutex); 2698 + old_period = sysctl_sched_rt_period; 2699 + old_runtime = sysctl_sched_rt_runtime; 2700 + 2701 + ret = proc_dointvec(table, write, buffer, lenp, ppos); 2702 + 2703 + if (!ret && write) { 2704 + ret = sched_rt_global_validate(); 2705 + if (ret) 2706 + goto undo; 2707 + 2708 + ret = sched_dl_global_validate(); 2709 + if (ret) 2710 + goto undo; 2711 + 2712 + ret = sched_rt_global_constraints(); 2713 + if (ret) 2714 + goto undo; 2715 + 2716 + sched_rt_do_global(); 2717 + sched_dl_do_global(); 2718 + } 2719 + if (0) { 2720 + undo: 2721 + sysctl_sched_rt_period = old_period; 2722 + sysctl_sched_rt_runtime = old_runtime; 2723 + } 2724 + mutex_unlock(&mutex); 2725 + 2726 + return ret; 2727 + } 2728 + 2729 + int sched_rr_handler(struct ctl_table *table, int write, 2730 + void __user *buffer, size_t *lenp, 2731 + loff_t *ppos) 2732 + { 2733 + int ret; 2734 + static DEFINE_MUTEX(mutex); 2735 + 2736 + mutex_lock(&mutex); 2737 + ret = proc_dointvec(table, write, buffer, lenp, ppos); 2738 + /* 2739 + * Make sure that internally we keep jiffies. 2740 + * Also, writing zero resets the timeslice to default: 2741 + */ 2742 + if (!ret && write) { 2743 + sched_rr_timeslice = 2744 + sysctl_sched_rr_timeslice <= 0 ? RR_TIMESLICE : 2745 + msecs_to_jiffies(sysctl_sched_rr_timeslice); 2746 + } 2747 + mutex_unlock(&mutex); 2748 + return ret; 2749 + } 2451 2750 2452 2751 #ifdef CONFIG_SCHED_DEBUG 2453 2752 extern void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq);

+98 -15

kernel/sched/sched.h

··· 39 39 #include "cpuacct.h" 40 40 41 41 #ifdef CONFIG_SCHED_DEBUG 42 - #define SCHED_WARN_ON(x) WARN_ONCE(x, #x) 42 + # define SCHED_WARN_ON(x) WARN_ONCE(x, #x) 43 43 #else 44 - #define SCHED_WARN_ON(x) ((void)(x)) 44 + # define SCHED_WARN_ON(x) ({ (void)(x), 0; }) 45 45 #endif 46 46 47 47 struct rq; ··· 218 218 return sysctl_sched_rt_runtime >= 0; 219 219 } 220 220 221 - extern struct dl_bw *dl_bw_of(int i); 222 - 223 221 struct dl_bw { 224 222 raw_spinlock_t lock; 225 223 u64 bw, total_bw; 226 224 }; 227 225 226 + static inline void __dl_update(struct dl_bw *dl_b, s64 bw); 227 + 228 228 static inline 229 - void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw) 229 + void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw, int cpus) 230 230 { 231 231 dl_b->total_bw -= tsk_bw; 232 + __dl_update(dl_b, (s32)tsk_bw / cpus); 232 233 } 233 234 234 235 static inline 235 - void __dl_add(struct dl_bw *dl_b, u64 tsk_bw) 236 + void __dl_add(struct dl_bw *dl_b, u64 tsk_bw, int cpus) 236 237 { 237 238 dl_b->total_bw += tsk_bw; 239 + __dl_update(dl_b, -((s32)tsk_bw / cpus)); 238 240 } 239 241 240 242 static inline ··· 246 244 dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw; 247 245 } 248 246 247 + void dl_change_utilization(struct task_struct *p, u64 new_bw); 249 248 extern void init_dl_bw(struct dl_bw *dl_b); 249 + extern int sched_dl_global_validate(void); 250 + extern void sched_dl_do_global(void); 251 + extern int sched_dl_overflow(struct task_struct *p, int policy, 252 + const struct sched_attr *attr); 253 + extern void __setparam_dl(struct task_struct *p, const struct sched_attr *attr); 254 + extern void __getparam_dl(struct task_struct *p, struct sched_attr *attr); 255 + extern bool __checkparam_dl(const struct sched_attr *attr); 256 + extern void __dl_clear_params(struct task_struct *p); 257 + extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr); 258 + extern int dl_task_can_attach(struct task_struct *p, 259 + const struct cpumask *cs_cpus_allowed); 260 + extern int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, 261 + const struct cpumask *trial); 262 + extern bool dl_cpu_busy(unsigned int cpu); 250 263 251 264 #ifdef CONFIG_CGROUP_SCHED 252 265 ··· 383 366 extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq, 384 367 struct sched_rt_entity *rt_se, int cpu, 385 368 struct sched_rt_entity *parent); 369 + extern int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us); 370 + extern int sched_group_set_rt_period(struct task_group *tg, u64 rt_period_us); 371 + extern long sched_group_rt_runtime(struct task_group *tg); 372 + extern long sched_group_rt_period(struct task_group *tg); 373 + extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk); 386 374 387 375 extern struct task_group *sched_create_group(struct task_group *parent); 388 376 extern void sched_online_group(struct task_group *tg, ··· 580 558 #else 581 559 struct dl_bw dl_bw; 582 560 #endif 561 + /* 562 + * "Active utilization" for this runqueue: increased when a 563 + * task wakes up (becomes TASK_RUNNING) and decreased when a 564 + * task blocks 565 + */ 566 + u64 running_bw; 567 + 568 + /* 569 + * Utilization of the tasks "assigned" to this runqueue (including 570 + * the tasks that are in runqueue and the tasks that executed on this 571 + * CPU and blocked). Increased when a task moves to this runqueue, and 572 + * decreased when the task moves away (migrates, changes scheduling 573 + * policy, or terminates). 574 + * This is needed to compute the "inactive utilization" for the 575 + * runqueue (inactive utilization = this_bw - running_bw). 576 + */ 577 + u64 this_bw; 578 + u64 extra_bw; 579 + 580 + /* 581 + * Inverse of the fraction of CPU utilization that can be reclaimed 582 + * by the GRUB algorithm. 583 + */ 584 + u64 bw_ratio; 583 585 }; 584 586 585 587 #ifdef CONFIG_SMP ··· 652 606 653 607 extern struct root_domain def_root_domain; 654 608 extern struct mutex sched_domains_mutex; 655 - extern cpumask_var_t fallback_doms; 656 - extern cpumask_var_t sched_domains_tmpmask; 657 609 658 610 extern void init_defrootdomain(void); 659 - extern int init_sched_domains(const struct cpumask *cpu_map); 611 + extern int sched_init_domains(const struct cpumask *cpu_map); 660 612 extern void rq_attach_root(struct rq *rq, struct root_domain *rd); 661 613 662 614 #endif /* CONFIG_SMP */ ··· 1069 1025 unsigned long next_update; 1070 1026 int imbalance; /* XXX unrelated to capacity but shared group state */ 1071 1027 1072 - unsigned long cpumask[0]; /* iteration mask */ 1028 + #ifdef CONFIG_SCHED_DEBUG 1029 + int id; 1030 + #endif 1031 + 1032 + unsigned long cpumask[0]; /* balance mask */ 1073 1033 }; 1074 1034 1075 1035 struct sched_group { ··· 1094 1046 unsigned long cpumask[0]; 1095 1047 }; 1096 1048 1097 - static inline struct cpumask *sched_group_cpus(struct sched_group *sg) 1049 + static inline struct cpumask *sched_group_span(struct sched_group *sg) 1098 1050 { 1099 1051 return to_cpumask(sg->cpumask); 1100 1052 } 1101 1053 1102 1054 /* 1103 - * cpumask masking which cpus in the group are allowed to iterate up the domain 1104 - * tree. 1055 + * See build_balance_mask(). 1105 1056 */ 1106 - static inline struct cpumask *sched_group_mask(struct sched_group *sg) 1057 + static inline struct cpumask *group_balance_mask(struct sched_group *sg) 1107 1058 { 1108 1059 return to_cpumask(sg->sgc->cpumask); 1109 1060 } ··· 1113 1066 */ 1114 1067 static inline unsigned int group_first_cpu(struct sched_group *group) 1115 1068 { 1116 - return cpumask_first(sched_group_cpus(group)); 1069 + return cpumask_first(sched_group_span(group)); 1117 1070 } 1118 1071 1119 1072 extern int group_balance_cpu(struct sched_group *sg); ··· 1469 1422 curr->sched_class->set_curr_task(rq); 1470 1423 } 1471 1424 1425 + #ifdef CONFIG_SMP 1472 1426 #define sched_class_highest (&stop_sched_class) 1427 + #else 1428 + #define sched_class_highest (&dl_sched_class) 1429 + #endif 1473 1430 #define for_each_class(class) \ 1474 1431 for (class = sched_class_highest; class; class = class->next) 1475 1432 ··· 1537 1486 extern struct dl_bandwidth def_dl_bandwidth; 1538 1487 extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime); 1539 1488 extern void init_dl_task_timer(struct sched_dl_entity *dl_se); 1489 + extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se); 1490 + extern void init_dl_rq_bw_ratio(struct dl_rq *dl_rq); 1540 1491 1492 + #define BW_SHIFT 20 1493 + #define BW_UNIT (1 << BW_SHIFT) 1494 + #define RATIO_SHIFT 8 1541 1495 unsigned long to_ratio(u64 period, u64 runtime); 1542 1496 1543 1497 extern void init_entity_runnable_average(struct sched_entity *se); ··· 1983 1927 #else 1984 1928 static inline void nohz_balance_exit_idle(unsigned int cpu) { } 1985 1929 #endif 1930 + 1931 + 1932 + #ifdef CONFIG_SMP 1933 + static inline 1934 + void __dl_update(struct dl_bw *dl_b, s64 bw) 1935 + { 1936 + struct root_domain *rd = container_of(dl_b, struct root_domain, dl_bw); 1937 + int i; 1938 + 1939 + RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(), 1940 + "sched RCU must be held"); 1941 + for_each_cpu_and(i, rd->span, cpu_active_mask) { 1942 + struct rq *rq = cpu_rq(i); 1943 + 1944 + rq->dl.extra_bw += bw; 1945 + } 1946 + } 1947 + #else 1948 + static inline 1949 + void __dl_update(struct dl_bw *dl_b, s64 bw) 1950 + { 1951 + struct dl_rq *dl = container_of(dl_b, struct dl_rq, dl_bw); 1952 + 1953 + dl->extra_bw += bw; 1954 + } 1955 + #endif 1956 + 1986 1957 1987 1958 #ifdef CONFIG_IRQ_TIME_ACCOUNTING 1988 1959 struct irqtime {

+339 -91

kernel/sched/topology.c

··· 10 10 11 11 /* Protected by sched_domains_mutex: */ 12 12 cpumask_var_t sched_domains_tmpmask; 13 + cpumask_var_t sched_domains_tmpmask2; 13 14 14 15 #ifdef CONFIG_SCHED_DEBUG 15 16 ··· 36 35 37 36 cpumask_clear(groupmask); 38 37 39 - printk(KERN_DEBUG "%*s domain %d: ", level, "", level); 38 + printk(KERN_DEBUG "%*s domain-%d: ", level, "", level); 40 39 41 40 if (!(sd->flags & SD_LOAD_BALANCE)) { 42 41 printk("does not load-balance\n"); ··· 46 45 return -1; 47 46 } 48 47 49 - printk(KERN_CONT "span %*pbl level %s\n", 48 + printk(KERN_CONT "span=%*pbl level=%s\n", 50 49 cpumask_pr_args(sched_domain_span(sd)), sd->name); 51 50 52 51 if (!cpumask_test_cpu(cpu, sched_domain_span(sd))) { 53 52 printk(KERN_ERR "ERROR: domain->span does not contain " 54 53 "CPU%d\n", cpu); 55 54 } 56 - if (!cpumask_test_cpu(cpu, sched_group_cpus(group))) { 55 + if (!cpumask_test_cpu(cpu, sched_group_span(group))) { 57 56 printk(KERN_ERR "ERROR: domain->groups does not contain" 58 57 " CPU%d\n", cpu); 59 58 } ··· 66 65 break; 67 66 } 68 67 69 - if (!cpumask_weight(sched_group_cpus(group))) { 68 + if (!cpumask_weight(sched_group_span(group))) { 70 69 printk(KERN_CONT "\n"); 71 70 printk(KERN_ERR "ERROR: empty group\n"); 72 71 break; 73 72 } 74 73 75 74 if (!(sd->flags & SD_OVERLAP) && 76 - cpumask_intersects(groupmask, sched_group_cpus(group))) { 75 + cpumask_intersects(groupmask, sched_group_span(group))) { 77 76 printk(KERN_CONT "\n"); 78 77 printk(KERN_ERR "ERROR: repeated CPUs\n"); 79 78 break; 80 79 } 81 80 82 - cpumask_or(groupmask, groupmask, sched_group_cpus(group)); 81 + cpumask_or(groupmask, groupmask, sched_group_span(group)); 83 82 84 - printk(KERN_CONT " %*pbl", 85 - cpumask_pr_args(sched_group_cpus(group))); 86 - if (group->sgc->capacity != SCHED_CAPACITY_SCALE) { 87 - printk(KERN_CONT " (cpu_capacity = %lu)", 88 - group->sgc->capacity); 83 + printk(KERN_CONT " %d:{ span=%*pbl", 84 + group->sgc->id, 85 + cpumask_pr_args(sched_group_span(group))); 86 + 87 + if ((sd->flags & SD_OVERLAP) && 88 + !cpumask_equal(group_balance_mask(group), sched_group_span(group))) { 89 + printk(KERN_CONT " mask=%*pbl", 90 + cpumask_pr_args(group_balance_mask(group))); 89 91 } 90 92 93 + if (group->sgc->capacity != SCHED_CAPACITY_SCALE) 94 + printk(KERN_CONT " cap=%lu", group->sgc->capacity); 95 + 96 + if (group == sd->groups && sd->child && 97 + !cpumask_equal(sched_domain_span(sd->child), 98 + sched_group_span(group))) { 99 + printk(KERN_ERR "ERROR: domain->groups does not match domain->child\n"); 100 + } 101 + 102 + printk(KERN_CONT " }"); 103 + 91 104 group = group->next; 105 + 106 + if (group != sd->groups) 107 + printk(KERN_CONT ","); 108 + 92 109 } while (group != sd->groups); 93 110 printk(KERN_CONT "\n"); 94 111 ··· 132 113 return; 133 114 } 134 115 135 - printk(KERN_DEBUG "CPU%d attaching sched-domain:\n", cpu); 116 + printk(KERN_DEBUG "CPU%d attaching sched-domain(s):\n", cpu); 136 117 137 118 for (;;) { 138 119 if (sched_domain_debug_one(sd, cpu, level, sched_domains_tmpmask)) ··· 496 477 }; 497 478 498 479 /* 499 - * Build an iteration mask that can exclude certain CPUs from the upwards 500 - * domain traversal. 480 + * Return the canonical balance CPU for this group, this is the first CPU 481 + * of this group that's also in the balance mask. 501 482 * 502 - * Asymmetric node setups can result in situations where the domain tree is of 503 - * unequal depth, make sure to skip domains that already cover the entire 504 - * range. 483 + * The balance mask are all those CPUs that could actually end up at this 484 + * group. See build_balance_mask(). 505 485 * 506 - * In that case build_sched_domains() will have terminated the iteration early 507 - * and our sibling sd spans will be empty. Domains should always include the 508 - * CPU they're built on, so check that. 486 + * Also see should_we_balance(). 509 487 */ 510 - static void build_group_mask(struct sched_domain *sd, struct sched_group *sg) 488 + int group_balance_cpu(struct sched_group *sg) 511 489 { 512 - const struct cpumask *span = sched_domain_span(sd); 490 + return cpumask_first(group_balance_mask(sg)); 491 + } 492 + 493 + 494 + /* 495 + * NUMA topology (first read the regular topology blurb below) 496 + * 497 + * Given a node-distance table, for example: 498 + * 499 + * node 0 1 2 3 500 + * 0: 10 20 30 20 501 + * 1: 20 10 20 30 502 + * 2: 30 20 10 20 503 + * 3: 20 30 20 10 504 + * 505 + * which represents a 4 node ring topology like: 506 + * 507 + * 0 ----- 1 508 + * | | 509 + * | | 510 + * | | 511 + * 3 ----- 2 512 + * 513 + * We want to construct domains and groups to represent this. The way we go 514 + * about doing this is to build the domains on 'hops'. For each NUMA level we 515 + * construct the mask of all nodes reachable in @level hops. 516 + * 517 + * For the above NUMA topology that gives 3 levels: 518 + * 519 + * NUMA-2 0-3 0-3 0-3 0-3 520 + * groups: {0-1,3},{1-3} {0-2},{0,2-3} {1-3},{0-1,3} {0,2-3},{0-2} 521 + * 522 + * NUMA-1 0-1,3 0-2 1-3 0,2-3 523 + * groups: {0},{1},{3} {0},{1},{2} {1},{2},{3} {0},{2},{3} 524 + * 525 + * NUMA-0 0 1 2 3 526 + * 527 + * 528 + * As can be seen; things don't nicely line up as with the regular topology. 529 + * When we iterate a domain in child domain chunks some nodes can be 530 + * represented multiple times -- hence the "overlap" naming for this part of 531 + * the topology. 532 + * 533 + * In order to minimize this overlap, we only build enough groups to cover the 534 + * domain. For instance Node-0 NUMA-2 would only get groups: 0-1,3 and 1-3. 535 + * 536 + * Because: 537 + * 538 + * - the first group of each domain is its child domain; this 539 + * gets us the first 0-1,3 540 + * - the only uncovered node is 2, who's child domain is 1-3. 541 + * 542 + * However, because of the overlap, computing a unique CPU for each group is 543 + * more complicated. Consider for instance the groups of NODE-1 NUMA-2, both 544 + * groups include the CPUs of Node-0, while those CPUs would not in fact ever 545 + * end up at those groups (they would end up in group: 0-1,3). 546 + * 547 + * To correct this we have to introduce the group balance mask. This mask 548 + * will contain those CPUs in the group that can reach this group given the 549 + * (child) domain tree. 550 + * 551 + * With this we can once again compute balance_cpu and sched_group_capacity 552 + * relations. 553 + * 554 + * XXX include words on how balance_cpu is unique and therefore can be 555 + * used for sched_group_capacity links. 556 + * 557 + * 558 + * Another 'interesting' topology is: 559 + * 560 + * node 0 1 2 3 561 + * 0: 10 20 20 30 562 + * 1: 20 10 20 20 563 + * 2: 20 20 10 20 564 + * 3: 30 20 20 10 565 + * 566 + * Which looks a little like: 567 + * 568 + * 0 ----- 1 569 + * | / | 570 + * | / | 571 + * | / | 572 + * 2 ----- 3 573 + * 574 + * This topology is asymmetric, nodes 1,2 are fully connected, but nodes 0,3 575 + * are not. 576 + * 577 + * This leads to a few particularly weird cases where the sched_domain's are 578 + * not of the same number for each cpu. Consider: 579 + * 580 + * NUMA-2 0-3 0-3 581 + * groups: {0-2},{1-3} {1-3},{0-2} 582 + * 583 + * NUMA-1 0-2 0-3 0-3 1-3 584 + * 585 + * NUMA-0 0 1 2 3 586 + * 587 + */ 588 + 589 + 590 + /* 591 + * Build the balance mask; it contains only those CPUs that can arrive at this 592 + * group and should be considered to continue balancing. 593 + * 594 + * We do this during the group creation pass, therefore the group information 595 + * isn't complete yet, however since each group represents a (child) domain we 596 + * can fully construct this using the sched_domain bits (which are already 597 + * complete). 598 + */ 599 + static void 600 + build_balance_mask(struct sched_domain *sd, struct sched_group *sg, struct cpumask *mask) 601 + { 602 + const struct cpumask *sg_span = sched_group_span(sg); 513 603 struct sd_data *sdd = sd->private; 514 604 struct sched_domain *sibling; 515 605 int i; 516 606 517 - for_each_cpu(i, span) { 607 + cpumask_clear(mask); 608 + 609 + for_each_cpu(i, sg_span) { 518 610 sibling = *per_cpu_ptr(sdd->sd, i); 519 - if (!cpumask_test_cpu(i, sched_domain_span(sibling))) 611 + 612 + /* 613 + * Can happen in the asymmetric case, where these siblings are 614 + * unused. The mask will not be empty because those CPUs that 615 + * do have the top domain _should_ span the domain. 616 + */ 617 + if (!sibling->child) 520 618 continue; 521 619 522 - cpumask_set_cpu(i, sched_group_mask(sg)); 620 + /* If we would not end up here, we can't continue from here */ 621 + if (!cpumask_equal(sg_span, sched_domain_span(sibling->child))) 622 + continue; 623 + 624 + cpumask_set_cpu(i, mask); 523 625 } 626 + 627 + /* We must not have empty masks here */ 628 + WARN_ON_ONCE(cpumask_empty(mask)); 524 629 } 525 630 526 631 /* 527 - * Return the canonical balance CPU for this group, this is the first CPU 528 - * of this group that's also in the iteration mask. 632 + * XXX: This creates per-node group entries; since the load-balancer will 633 + * immediately access remote memory to construct this group's load-balance 634 + * statistics having the groups node local is of dubious benefit. 529 635 */ 530 - int group_balance_cpu(struct sched_group *sg) 636 + static struct sched_group * 637 + build_group_from_child_sched_domain(struct sched_domain *sd, int cpu) 531 638 { 532 - return cpumask_first_and(sched_group_cpus(sg), sched_group_mask(sg)); 639 + struct sched_group *sg; 640 + struct cpumask *sg_span; 641 + 642 + sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(), 643 + GFP_KERNEL, cpu_to_node(cpu)); 644 + 645 + if (!sg) 646 + return NULL; 647 + 648 + sg_span = sched_group_span(sg); 649 + if (sd->child) 650 + cpumask_copy(sg_span, sched_domain_span(sd->child)); 651 + else 652 + cpumask_copy(sg_span, sched_domain_span(sd)); 653 + 654 + return sg; 655 + } 656 + 657 + static void init_overlap_sched_group(struct sched_domain *sd, 658 + struct sched_group *sg) 659 + { 660 + struct cpumask *mask = sched_domains_tmpmask2; 661 + struct sd_data *sdd = sd->private; 662 + struct cpumask *sg_span; 663 + int cpu; 664 + 665 + build_balance_mask(sd, sg, mask); 666 + cpu = cpumask_first_and(sched_group_span(sg), mask); 667 + 668 + sg->sgc = *per_cpu_ptr(sdd->sgc, cpu); 669 + if (atomic_inc_return(&sg->sgc->ref) == 1) 670 + cpumask_copy(group_balance_mask(sg), mask); 671 + else 672 + WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)); 673 + 674 + /* 675 + * Initialize sgc->capacity such that even if we mess up the 676 + * domains and no possible iteration will get us here, we won't 677 + * die on a /0 trap. 678 + */ 679 + sg_span = sched_group_span(sg); 680 + sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span); 681 + sg->sgc->min_capacity = SCHED_CAPACITY_SCALE; 533 682 } 534 683 535 684 static int 536 685 build_overlap_sched_groups(struct sched_domain *sd, int cpu) 537 686 { 538 - struct sched_group *first = NULL, *last = NULL, *groups = NULL, *sg; 687 + struct sched_group *first = NULL, *last = NULL, *sg; 539 688 const struct cpumask *span = sched_domain_span(sd); 540 689 struct cpumask *covered = sched_domains_tmpmask; 541 690 struct sd_data *sdd = sd->private; ··· 712 525 713 526 cpumask_clear(covered); 714 527 715 - for_each_cpu(i, span) { 528 + for_each_cpu_wrap(i, span, cpu) { 716 529 struct cpumask *sg_span; 717 530 718 531 if (cpumask_test_cpu(i, covered)) ··· 720 533 721 534 sibling = *per_cpu_ptr(sdd->sd, i); 722 535 723 - /* See the comment near build_group_mask(). */ 536 + /* 537 + * Asymmetric node setups can result in situations where the 538 + * domain tree is of unequal depth, make sure to skip domains 539 + * that already cover the entire range. 540 + * 541 + * In that case build_sched_domains() will have terminated the 542 + * iteration early and our sibling sd spans will be empty. 543 + * Domains should always include the CPU they're built on, so 544 + * check that. 545 + */ 724 546 if (!cpumask_test_cpu(i, sched_domain_span(sibling))) 725 547 continue; 726 548 727 - sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(), 728 - GFP_KERNEL, cpu_to_node(cpu)); 729 - 549 + sg = build_group_from_child_sched_domain(sibling, cpu); 730 550 if (!sg) 731 551 goto fail; 732 552 733 - sg_span = sched_group_cpus(sg); 734 - if (sibling->child) 735 - cpumask_copy(sg_span, sched_domain_span(sibling->child)); 736 - else 737 - cpumask_set_cpu(i, sg_span); 738 - 553 + sg_span = sched_group_span(sg); 739 554 cpumask_or(covered, covered, sg_span); 740 555 741 - sg->sgc = *per_cpu_ptr(sdd->sgc, i); 742 - if (atomic_inc_return(&sg->sgc->ref) == 1) 743 - build_group_mask(sd, sg); 744 - 745 - /* 746 - * Initialize sgc->capacity such that even if we mess up the 747 - * domains and no possible iteration will get us here, we won't 748 - * die on a /0 trap. 749 - */ 750 - sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span); 751 - sg->sgc->min_capacity = SCHED_CAPACITY_SCALE; 752 - 753 - /* 754 - * Make sure the first group of this domain contains the 755 - * canonical balance CPU. Otherwise the sched_domain iteration 756 - * breaks. See update_sg_lb_stats(). 757 - */ 758 - if ((!groups && cpumask_test_cpu(cpu, sg_span)) || 759 - group_balance_cpu(sg) == cpu) 760 - groups = sg; 556 + init_overlap_sched_group(sd, sg); 761 557 762 558 if (!first) 763 559 first = sg; ··· 749 579 last = sg; 750 580 last->next = first; 751 581 } 752 - sd->groups = groups; 582 + sd->groups = first; 753 583 754 584 return 0; 755 585 ··· 759 589 return -ENOMEM; 760 590 } 761 591 762 - static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg) 592 + 593 + /* 594 + * Package topology (also see the load-balance blurb in fair.c) 595 + * 596 + * The scheduler builds a tree structure to represent a number of important 597 + * topology features. By default (default_topology[]) these include: 598 + * 599 + * - Simultaneous multithreading (SMT) 600 + * - Multi-Core Cache (MC) 601 + * - Package (DIE) 602 + * 603 + * Where the last one more or less denotes everything up to a NUMA node. 604 + * 605 + * The tree consists of 3 primary data structures: 606 + * 607 + * sched_domain -> sched_group -> sched_group_capacity 608 + * ^ ^ ^ ^ 609 + * `-' `-' 610 + * 611 + * The sched_domains are per-cpu and have a two way link (parent & child) and 612 + * denote the ever growing mask of CPUs belonging to that level of topology. 613 + * 614 + * Each sched_domain has a circular (double) linked list of sched_group's, each 615 + * denoting the domains of the level below (or individual CPUs in case of the 616 + * first domain level). The sched_group linked by a sched_domain includes the 617 + * CPU of that sched_domain [*]. 618 + * 619 + * Take for instance a 2 threaded, 2 core, 2 cache cluster part: 620 + * 621 + * CPU 0 1 2 3 4 5 6 7 622 + * 623 + * DIE [ ] 624 + * MC [ ] [ ] 625 + * SMT [ ] [ ] [ ] [ ] 626 + * 627 + * - or - 628 + * 629 + * DIE 0-7 0-7 0-7 0-7 0-7 0-7 0-7 0-7 630 + * MC 0-3 0-3 0-3 0-3 4-7 4-7 4-7 4-7 631 + * SMT 0-1 0-1 2-3 2-3 4-5 4-5 6-7 6-7 632 + * 633 + * CPU 0 1 2 3 4 5 6 7 634 + * 635 + * One way to think about it is: sched_domain moves you up and down among these 636 + * topology levels, while sched_group moves you sideways through it, at child 637 + * domain granularity. 638 + * 639 + * sched_group_capacity ensures each unique sched_group has shared storage. 640 + * 641 + * There are two related construction problems, both require a CPU that 642 + * uniquely identify each group (for a given domain): 643 + * 644 + * - The first is the balance_cpu (see should_we_balance() and the 645 + * load-balance blub in fair.c); for each group we only want 1 CPU to 646 + * continue balancing at a higher domain. 647 + * 648 + * - The second is the sched_group_capacity; we want all identical groups 649 + * to share a single sched_group_capacity. 650 + * 651 + * Since these topologies are exclusive by construction. That is, its 652 + * impossible for an SMT thread to belong to multiple cores, and cores to 653 + * be part of multiple caches. There is a very clear and unique location 654 + * for each CPU in the hierarchy. 655 + * 656 + * Therefore computing a unique CPU for each group is trivial (the iteration 657 + * mask is redundant and set all 1s; all CPUs in a group will end up at _that_ 658 + * group), we can simply pick the first CPU in each group. 659 + * 660 + * 661 + * [*] in other words, the first group of each domain is its child domain. 662 + */ 663 + 664 + static struct sched_group *get_group(int cpu, struct sd_data *sdd) 763 665 { 764 666 struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu); 765 667 struct sched_domain *child = sd->child; 668 + struct sched_group *sg; 766 669 767 670 if (child) 768 671 cpu = cpumask_first(sched_domain_span(child)); 769 672 770 - if (sg) { 771 - *sg = *per_cpu_ptr(sdd->sg, cpu); 772 - (*sg)->sgc = *per_cpu_ptr(sdd->sgc, cpu); 673 + sg = *per_cpu_ptr(sdd->sg, cpu); 674 + sg->sgc = *per_cpu_ptr(sdd->sgc, cpu); 773 675 774 - /* For claim_allocations: */ 775 - atomic_set(&(*sg)->sgc->ref, 1); 676 + /* For claim_allocations: */ 677 + atomic_inc(&sg->ref); 678 + atomic_inc(&sg->sgc->ref); 679 + 680 + if (child) { 681 + cpumask_copy(sched_group_span(sg), sched_domain_span(child)); 682 + cpumask_copy(group_balance_mask(sg), sched_group_span(sg)); 683 + } else { 684 + cpumask_set_cpu(cpu, sched_group_span(sg)); 685 + cpumask_set_cpu(cpu, group_balance_mask(sg)); 776 686 } 777 687 778 - return cpu; 688 + sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_span(sg)); 689 + sg->sgc->min_capacity = SCHED_CAPACITY_SCALE; 690 + 691 + return sg; 779 692 } 780 693 781 694 /* ··· 877 624 struct cpumask *covered; 878 625 int i; 879 626 880 - get_group(cpu, sdd, &sd->groups); 881 - atomic_inc(&sd->groups->ref); 882 - 883 - if (cpu != cpumask_first(span)) 884 - return 0; 885 - 886 627 lockdep_assert_held(&sched_domains_mutex); 887 628 covered = sched_domains_tmpmask; 888 629 889 630 cpumask_clear(covered); 890 631 891 - for_each_cpu(i, span) { 632 + for_each_cpu_wrap(i, span, cpu) { 892 633 struct sched_group *sg; 893 - int group, j; 894 634 895 635 if (cpumask_test_cpu(i, covered)) 896 636 continue; 897 637 898 - group = get_group(i, sdd, &sg); 899 - cpumask_setall(sched_group_mask(sg)); 638 + sg = get_group(i, sdd); 900 639 901 - for_each_cpu(j, span) { 902 - if (get_group(j, sdd, NULL) != group) 903 - continue; 904 - 905 - cpumask_set_cpu(j, covered); 906 - cpumask_set_cpu(j, sched_group_cpus(sg)); 907 - } 640 + cpumask_or(covered, covered, sched_group_span(sg)); 908 641 909 642 if (!first) 910 643 first = sg; ··· 899 660 last = sg; 900 661 } 901 662 last->next = first; 663 + sd->groups = first; 902 664 903 665 return 0; 904 666 } ··· 923 683 do { 924 684 int cpu, max_cpu = -1; 925 685 926 - sg->group_weight = cpumask_weight(sched_group_cpus(sg)); 686 + sg->group_weight = cpumask_weight(sched_group_span(sg)); 927 687 928 688 if (!(sd->flags & SD_ASYM_PACKING)) 929 689 goto next; 930 690 931 - for_each_cpu(cpu, sched_group_cpus(sg)) { 691 + for_each_cpu(cpu, sched_group_span(sg)) { 932 692 if (max_cpu < 0) 933 693 max_cpu = cpu; 934 694 else if (sched_asym_prefer(cpu, max_cpu)) ··· 1548 1308 if (!sgc) 1549 1309 return -ENOMEM; 1550 1310 1311 + #ifdef CONFIG_SCHED_DEBUG 1312 + sgc->id = j; 1313 + #endif 1314 + 1551 1315 *per_cpu_ptr(sdd->sgc, j) = sgc; 1552 1316 } 1553 1317 } ··· 1651 1407 sd = build_sched_domain(tl, cpu_map, attr, sd, i); 1652 1408 if (tl == sched_domain_topology) 1653 1409 *per_cpu_ptr(d.sd, i) = sd; 1654 - if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP)) 1410 + if (tl->flags & SDTL_OVERLAP) 1655 1411 sd->flags |= SD_OVERLAP; 1656 1412 if (cpumask_equal(cpu_map, sched_domain_span(sd))) 1657 1413 break; ··· 1722 1478 * cpumask) fails, then fallback to a single sched domain, 1723 1479 * as determined by the single cpumask fallback_doms. 1724 1480 */ 1725 - cpumask_var_t fallback_doms; 1481 + static cpumask_var_t fallback_doms; 1726 1482 1727 1483 /* 1728 1484 * arch_update_cpu_topology lets virtualized architectures update the ··· 1764 1520 * For now this just excludes isolated CPUs, but could be used to 1765 1521 * exclude other special cases in the future. 1766 1522 */ 1767 - int init_sched_domains(const struct cpumask *cpu_map) 1523 + int sched_init_domains(const struct cpumask *cpu_map) 1768 1524 { 1769 1525 int err; 1526 + 1527 + zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL); 1528 + zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL); 1529 + zalloc_cpumask_var(&fallback_doms, GFP_KERNEL); 1770 1530 1771 1531 arch_update_cpu_topology(); 1772 1532 ndoms_cur = 1;

+91 -350

kernel/sched/wait.c

··· 12 12 #include <linux/hash.h> 13 13 #include <linux/kthread.h> 14 14 15 - void __init_waitqueue_head(wait_queue_head_t *q, const char *name, struct lock_class_key *key) 15 + void __init_waitqueue_head(struct wait_queue_head *wq_head, const char *name, struct lock_class_key *key) 16 16 { 17 - spin_lock_init(&q->lock); 18 - lockdep_set_class_and_name(&q->lock, key, name); 19 - INIT_LIST_HEAD(&q->task_list); 17 + spin_lock_init(&wq_head->lock); 18 + lockdep_set_class_and_name(&wq_head->lock, key, name); 19 + INIT_LIST_HEAD(&wq_head->head); 20 20 } 21 21 22 22 EXPORT_SYMBOL(__init_waitqueue_head); 23 23 24 - void add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait) 24 + void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) 25 25 { 26 26 unsigned long flags; 27 27 28 - wait->flags &= ~WQ_FLAG_EXCLUSIVE; 29 - spin_lock_irqsave(&q->lock, flags); 30 - __add_wait_queue(q, wait); 31 - spin_unlock_irqrestore(&q->lock, flags); 28 + wq_entry->flags &= ~WQ_FLAG_EXCLUSIVE; 29 + spin_lock_irqsave(&wq_head->lock, flags); 30 + __add_wait_queue_entry_tail(wq_head, wq_entry); 31 + spin_unlock_irqrestore(&wq_head->lock, flags); 32 32 } 33 33 EXPORT_SYMBOL(add_wait_queue); 34 34 35 - void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait) 35 + void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) 36 36 { 37 37 unsigned long flags; 38 38 39 - wait->flags |= WQ_FLAG_EXCLUSIVE; 40 - spin_lock_irqsave(&q->lock, flags); 41 - __add_wait_queue_tail(q, wait); 42 - spin_unlock_irqrestore(&q->lock, flags); 39 + wq_entry->flags |= WQ_FLAG_EXCLUSIVE; 40 + spin_lock_irqsave(&wq_head->lock, flags); 41 + __add_wait_queue_entry_tail(wq_head, wq_entry); 42 + spin_unlock_irqrestore(&wq_head->lock, flags); 43 43 } 44 44 EXPORT_SYMBOL(add_wait_queue_exclusive); 45 45 46 - void remove_wait_queue(wait_queue_head_t *q, wait_queue_t *wait) 46 + void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) 47 47 { 48 48 unsigned long flags; 49 49 50 - spin_lock_irqsave(&q->lock, flags); 51 - __remove_wait_queue(q, wait); 52 - spin_unlock_irqrestore(&q->lock, flags); 50 + spin_lock_irqsave(&wq_head->lock, flags); 51 + __remove_wait_queue(wq_head, wq_entry); 52 + spin_unlock_irqrestore(&wq_head->lock, flags); 53 53 } 54 54 EXPORT_SYMBOL(remove_wait_queue); 55 55 ··· 63 63 * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns 64 64 * zero in this (rare) case, and we handle it by continuing to scan the queue. 65 65 */ 66 - static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, 66 + static void __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode, 67 67 int nr_exclusive, int wake_flags, void *key) 68 68 { 69 - wait_queue_t *curr, *next; 69 + wait_queue_entry_t *curr, *next; 70 70 71 - list_for_each_entry_safe(curr, next, &q->task_list, task_list) { 71 + list_for_each_entry_safe(curr, next, &wq_head->head, entry) { 72 72 unsigned flags = curr->flags; 73 73 74 74 if (curr->func(curr, mode, wake_flags, key) && ··· 79 79 80 80 /** 81 81 * __wake_up - wake up threads blocked on a waitqueue. 82 - * @q: the waitqueue 82 + * @wq_head: the waitqueue 83 83 * @mode: which threads 84 84 * @nr_exclusive: how many wake-one or wake-many threads to wake up 85 85 * @key: is directly passed to the wakeup function ··· 87 87 * It may be assumed that this function implies a write memory barrier before 88 88 * changing the task state if and only if any tasks are woken up. 89 89 */ 90 - void __wake_up(wait_queue_head_t *q, unsigned int mode, 90 + void __wake_up(struct wait_queue_head *wq_head, unsigned int mode, 91 91 int nr_exclusive, void *key) 92 92 { 93 93 unsigned long flags; 94 94 95 - spin_lock_irqsave(&q->lock, flags); 96 - __wake_up_common(q, mode, nr_exclusive, 0, key); 97 - spin_unlock_irqrestore(&q->lock, flags); 95 + spin_lock_irqsave(&wq_head->lock, flags); 96 + __wake_up_common(wq_head, mode, nr_exclusive, 0, key); 97 + spin_unlock_irqrestore(&wq_head->lock, flags); 98 98 } 99 99 EXPORT_SYMBOL(__wake_up); 100 100 101 101 /* 102 102 * Same as __wake_up but called with the spinlock in wait_queue_head_t held. 103 103 */ 104 - void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr) 104 + void __wake_up_locked(struct wait_queue_head *wq_head, unsigned int mode, int nr) 105 105 { 106 - __wake_up_common(q, mode, nr, 0, NULL); 106 + __wake_up_common(wq_head, mode, nr, 0, NULL); 107 107 } 108 108 EXPORT_SYMBOL_GPL(__wake_up_locked); 109 109 110 - void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key) 110 + void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key) 111 111 { 112 - __wake_up_common(q, mode, 1, 0, key); 112 + __wake_up_common(wq_head, mode, 1, 0, key); 113 113 } 114 114 EXPORT_SYMBOL_GPL(__wake_up_locked_key); 115 115 116 116 /** 117 117 * __wake_up_sync_key - wake up threads blocked on a waitqueue. 118 - * @q: the waitqueue 118 + * @wq_head: the waitqueue 119 119 * @mode: which threads 120 120 * @nr_exclusive: how many wake-one or wake-many threads to wake up 121 121 * @key: opaque value to be passed to wakeup targets ··· 130 130 * It may be assumed that this function implies a write memory barrier before 131 131 * changing the task state if and only if any tasks are woken up. 132 132 */ 133 - void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, 133 + void __wake_up_sync_key(struct wait_queue_head *wq_head, unsigned int mode, 134 134 int nr_exclusive, void *key) 135 135 { 136 136 unsigned long flags; 137 137 int wake_flags = 1; /* XXX WF_SYNC */ 138 138 139 - if (unlikely(!q)) 139 + if (unlikely(!wq_head)) 140 140 return; 141 141 142 142 if (unlikely(nr_exclusive != 1)) 143 143 wake_flags = 0; 144 144 145 - spin_lock_irqsave(&q->lock, flags); 146 - __wake_up_common(q, mode, nr_exclusive, wake_flags, key); 147 - spin_unlock_irqrestore(&q->lock, flags); 145 + spin_lock_irqsave(&wq_head->lock, flags); 146 + __wake_up_common(wq_head, mode, nr_exclusive, wake_flags, key); 147 + spin_unlock_irqrestore(&wq_head->lock, flags); 148 148 } 149 149 EXPORT_SYMBOL_GPL(__wake_up_sync_key); 150 150 151 151 /* 152 152 * __wake_up_sync - see __wake_up_sync_key() 153 153 */ 154 - void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive) 154 + void __wake_up_sync(struct wait_queue_head *wq_head, unsigned int mode, int nr_exclusive) 155 155 { 156 - __wake_up_sync_key(q, mode, nr_exclusive, NULL); 156 + __wake_up_sync_key(wq_head, mode, nr_exclusive, NULL); 157 157 } 158 158 EXPORT_SYMBOL_GPL(__wake_up_sync); /* For internal use only */ 159 159 ··· 170 170 * loads to move into the critical region). 171 171 */ 172 172 void 173 - prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state) 173 + prepare_to_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state) 174 174 { 175 175 unsigned long flags; 176 176 177 - wait->flags &= ~WQ_FLAG_EXCLUSIVE; 178 - spin_lock_irqsave(&q->lock, flags); 179 - if (list_empty(&wait->task_list)) 180 - __add_wait_queue(q, wait); 177 + wq_entry->flags &= ~WQ_FLAG_EXCLUSIVE; 178 + spin_lock_irqsave(&wq_head->lock, flags); 179 + if (list_empty(&wq_entry->entry)) 180 + __add_wait_queue(wq_head, wq_entry); 181 181 set_current_state(state); 182 - spin_unlock_irqrestore(&q->lock, flags); 182 + spin_unlock_irqrestore(&wq_head->lock, flags); 183 183 } 184 184 EXPORT_SYMBOL(prepare_to_wait); 185 185 186 186 void 187 - prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state) 187 + prepare_to_wait_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state) 188 188 { 189 189 unsigned long flags; 190 190 191 - wait->flags |= WQ_FLAG_EXCLUSIVE; 192 - spin_lock_irqsave(&q->lock, flags); 193 - if (list_empty(&wait->task_list)) 194 - __add_wait_queue_tail(q, wait); 191 + wq_entry->flags |= WQ_FLAG_EXCLUSIVE; 192 + spin_lock_irqsave(&wq_head->lock, flags); 193 + if (list_empty(&wq_entry->entry)) 194 + __add_wait_queue_entry_tail(wq_head, wq_entry); 195 195 set_current_state(state); 196 - spin_unlock_irqrestore(&q->lock, flags); 196 + spin_unlock_irqrestore(&wq_head->lock, flags); 197 197 } 198 198 EXPORT_SYMBOL(prepare_to_wait_exclusive); 199 199 200 - void init_wait_entry(wait_queue_t *wait, int flags) 200 + void init_wait_entry(struct wait_queue_entry *wq_entry, int flags) 201 201 { 202 - wait->flags = flags; 203 - wait->private = current; 204 - wait->func = autoremove_wake_function; 205 - INIT_LIST_HEAD(&wait->task_list); 202 + wq_entry->flags = flags; 203 + wq_entry->private = current; 204 + wq_entry->func = autoremove_wake_function; 205 + INIT_LIST_HEAD(&wq_entry->entry); 206 206 } 207 207 EXPORT_SYMBOL(init_wait_entry); 208 208 209 - long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state) 209 + long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state) 210 210 { 211 211 unsigned long flags; 212 212 long ret = 0; 213 213 214 - spin_lock_irqsave(&q->lock, flags); 214 + spin_lock_irqsave(&wq_head->lock, flags); 215 215 if (unlikely(signal_pending_state(state, current))) { 216 216 /* 217 217 * Exclusive waiter must not fail if it was selected by wakeup, ··· 219 219 * 220 220 * The caller will recheck the condition and return success if 221 221 * we were already woken up, we can not miss the event because 222 - * wakeup locks/unlocks the same q->lock. 222 + * wakeup locks/unlocks the same wq_head->lock. 223 223 * 224 224 * But we need to ensure that set-condition + wakeup after that 225 225 * can't see us, it should wake up another exclusive waiter if 226 226 * we fail. 227 227 */ 228 - list_del_init(&wait->task_list); 228 + list_del_init(&wq_entry->entry); 229 229 ret = -ERESTARTSYS; 230 230 } else { 231 - if (list_empty(&wait->task_list)) { 232 - if (wait->flags & WQ_FLAG_EXCLUSIVE) 233 - __add_wait_queue_tail(q, wait); 231 + if (list_empty(&wq_entry->entry)) { 232 + if (wq_entry->flags & WQ_FLAG_EXCLUSIVE) 233 + __add_wait_queue_entry_tail(wq_head, wq_entry); 234 234 else 235 - __add_wait_queue(q, wait); 235 + __add_wait_queue(wq_head, wq_entry); 236 236 } 237 237 set_current_state(state); 238 238 } 239 - spin_unlock_irqrestore(&q->lock, flags); 239 + spin_unlock_irqrestore(&wq_head->lock, flags); 240 240 241 241 return ret; 242 242 } ··· 249 249 * condition in the caller before they add the wait 250 250 * entry to the wake queue. 251 251 */ 252 - int do_wait_intr(wait_queue_head_t *wq, wait_queue_t *wait) 252 + int do_wait_intr(wait_queue_head_t *wq, wait_queue_entry_t *wait) 253 253 { 254 - if (likely(list_empty(&wait->task_list))) 255 - __add_wait_queue_tail(wq, wait); 254 + if (likely(list_empty(&wait->entry))) 255 + __add_wait_queue_entry_tail(wq, wait); 256 256 257 257 set_current_state(TASK_INTERRUPTIBLE); 258 258 if (signal_pending(current)) ··· 265 265 } 266 266 EXPORT_SYMBOL(do_wait_intr); 267 267 268 - int do_wait_intr_irq(wait_queue_head_t *wq, wait_queue_t *wait) 268 + int do_wait_intr_irq(wait_queue_head_t *wq, wait_queue_entry_t *wait) 269 269 { 270 - if (likely(list_empty(&wait->task_list))) 271 - __add_wait_queue_tail(wq, wait); 270 + if (likely(list_empty(&wait->entry))) 271 + __add_wait_queue_entry_tail(wq, wait); 272 272 273 273 set_current_state(TASK_INTERRUPTIBLE); 274 274 if (signal_pending(current)) ··· 283 283 284 284 /** 285 285 * finish_wait - clean up after waiting in a queue 286 - * @q: waitqueue waited on 287 - * @wait: wait descriptor 286 + * @wq_head: waitqueue waited on 287 + * @wq_entry: wait descriptor 288 288 * 289 289 * Sets current thread back to running state and removes 290 290 * the wait descriptor from the given waitqueue if still 291 291 * queued. 292 292 */ 293 - void finish_wait(wait_queue_head_t *q, wait_queue_t *wait) 293 + void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) 294 294 { 295 295 unsigned long flags; 296 296 ··· 308 308 * have _one_ other CPU that looks at or modifies 309 309 * the list). 310 310 */ 311 - if (!list_empty_careful(&wait->task_list)) { 312 - spin_lock_irqsave(&q->lock, flags); 313 - list_del_init(&wait->task_list); 314 - spin_unlock_irqrestore(&q->lock, flags); 311 + if (!list_empty_careful(&wq_entry->entry)) { 312 + spin_lock_irqsave(&wq_head->lock, flags); 313 + list_del_init(&wq_entry->entry); 314 + spin_unlock_irqrestore(&wq_head->lock, flags); 315 315 } 316 316 } 317 317 EXPORT_SYMBOL(finish_wait); 318 318 319 - int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key) 319 + int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key) 320 320 { 321 - int ret = default_wake_function(wait, mode, sync, key); 321 + int ret = default_wake_function(wq_entry, mode, sync, key); 322 322 323 323 if (ret) 324 - list_del_init(&wait->task_list); 324 + list_del_init(&wq_entry->entry); 325 325 return ret; 326 326 } 327 327 EXPORT_SYMBOL(autoremove_wake_function); ··· 334 334 /* 335 335 * DEFINE_WAIT_FUNC(wait, woken_wake_func); 336 336 * 337 - * add_wait_queue(&wq, &wait); 337 + * add_wait_queue(&wq_head, &wait); 338 338 * for (;;) { 339 339 * if (condition) 340 340 * break; 341 341 * 342 342 * p->state = mode; condition = true; 343 343 * smp_mb(); // A smp_wmb(); // C 344 - * if (!wait->flags & WQ_FLAG_WOKEN) wait->flags |= WQ_FLAG_WOKEN; 344 + * if (!wq_entry->flags & WQ_FLAG_WOKEN) wq_entry->flags |= WQ_FLAG_WOKEN; 345 345 * schedule() try_to_wake_up(); 346 346 * p->state = TASK_RUNNING; ~~~~~~~~~~~~~~~~~~ 347 - * wait->flags &= ~WQ_FLAG_WOKEN; condition = true; 347 + * wq_entry->flags &= ~WQ_FLAG_WOKEN; condition = true; 348 348 * smp_mb() // B smp_wmb(); // C 349 - * wait->flags |= WQ_FLAG_WOKEN; 349 + * wq_entry->flags |= WQ_FLAG_WOKEN; 350 350 * } 351 - * remove_wait_queue(&wq, &wait); 351 + * remove_wait_queue(&wq_head, &wait); 352 352 * 353 353 */ 354 - long wait_woken(wait_queue_t *wait, unsigned mode, long timeout) 354 + long wait_woken(struct wait_queue_entry *wq_entry, unsigned mode, long timeout) 355 355 { 356 356 set_current_state(mode); /* A */ 357 357 /* ··· 359 359 * woken_wake_function() such that if we observe WQ_FLAG_WOKEN we must 360 360 * also observe all state before the wakeup. 361 361 */ 362 - if (!(wait->flags & WQ_FLAG_WOKEN) && !is_kthread_should_stop()) 362 + if (!(wq_entry->flags & WQ_FLAG_WOKEN) && !is_kthread_should_stop()) 363 363 timeout = schedule_timeout(timeout); 364 364 __set_current_state(TASK_RUNNING); 365 365 ··· 369 369 * condition being true _OR_ WQ_FLAG_WOKEN such that we will not miss 370 370 * an event. 371 371 */ 372 - smp_store_mb(wait->flags, wait->flags & ~WQ_FLAG_WOKEN); /* B */ 372 + smp_store_mb(wq_entry->flags, wq_entry->flags & ~WQ_FLAG_WOKEN); /* B */ 373 373 374 374 return timeout; 375 375 } 376 376 EXPORT_SYMBOL(wait_woken); 377 377 378 - int woken_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key) 378 + int woken_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key) 379 379 { 380 380 /* 381 381 * Although this function is called under waitqueue lock, LOCK ··· 385 385 * and is paired with smp_store_mb() in wait_woken(). 386 386 */ 387 387 smp_wmb(); /* C */ 388 - wait->flags |= WQ_FLAG_WOKEN; 388 + wq_entry->flags |= WQ_FLAG_WOKEN; 389 389 390 - return default_wake_function(wait, mode, sync, key); 390 + return default_wake_function(wq_entry, mode, sync, key); 391 391 } 392 392 EXPORT_SYMBOL(woken_wake_function); 393 - 394 - int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg) 395 - { 396 - struct wait_bit_key *key = arg; 397 - struct wait_bit_queue *wait_bit 398 - = container_of(wait, struct wait_bit_queue, wait); 399 - 400 - if (wait_bit->key.flags != key->flags || 401 - wait_bit->key.bit_nr != key->bit_nr || 402 - test_bit(key->bit_nr, key->flags)) 403 - return 0; 404 - else 405 - return autoremove_wake_function(wait, mode, sync, key); 406 - } 407 - EXPORT_SYMBOL(wake_bit_function); 408 - 409 - /* 410 - * To allow interruptible waiting and asynchronous (i.e. nonblocking) 411 - * waiting, the actions of __wait_on_bit() and __wait_on_bit_lock() are 412 - * permitted return codes. Nonzero return codes halt waiting and return. 413 - */ 414 - int __sched 415 - __wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q, 416 - wait_bit_action_f *action, unsigned mode) 417 - { 418 - int ret = 0; 419 - 420 - do { 421 - prepare_to_wait(wq, &q->wait, mode); 422 - if (test_bit(q->key.bit_nr, q->key.flags)) 423 - ret = (*action)(&q->key, mode); 424 - } while (test_bit(q->key.bit_nr, q->key.flags) && !ret); 425 - finish_wait(wq, &q->wait); 426 - return ret; 427 - } 428 - EXPORT_SYMBOL(__wait_on_bit); 429 - 430 - int __sched out_of_line_wait_on_bit(void *word, int bit, 431 - wait_bit_action_f *action, unsigned mode) 432 - { 433 - wait_queue_head_t *wq = bit_waitqueue(word, bit); 434 - DEFINE_WAIT_BIT(wait, word, bit); 435 - 436 - return __wait_on_bit(wq, &wait, action, mode); 437 - } 438 - EXPORT_SYMBOL(out_of_line_wait_on_bit); 439 - 440 - int __sched out_of_line_wait_on_bit_timeout( 441 - void *word, int bit, wait_bit_action_f *action, 442 - unsigned mode, unsigned long timeout) 443 - { 444 - wait_queue_head_t *wq = bit_waitqueue(word, bit); 445 - DEFINE_WAIT_BIT(wait, word, bit); 446 - 447 - wait.key.timeout = jiffies + timeout; 448 - return __wait_on_bit(wq, &wait, action, mode); 449 - } 450 - EXPORT_SYMBOL_GPL(out_of_line_wait_on_bit_timeout); 451 - 452 - int __sched 453 - __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q, 454 - wait_bit_action_f *action, unsigned mode) 455 - { 456 - int ret = 0; 457 - 458 - for (;;) { 459 - prepare_to_wait_exclusive(wq, &q->wait, mode); 460 - if (test_bit(q->key.bit_nr, q->key.flags)) { 461 - ret = action(&q->key, mode); 462 - /* 463 - * See the comment in prepare_to_wait_event(). 464 - * finish_wait() does not necessarily takes wq->lock, 465 - * but test_and_set_bit() implies mb() which pairs with 466 - * smp_mb__after_atomic() before wake_up_page(). 467 - */ 468 - if (ret) 469 - finish_wait(wq, &q->wait); 470 - } 471 - if (!test_and_set_bit(q->key.bit_nr, q->key.flags)) { 472 - if (!ret) 473 - finish_wait(wq, &q->wait); 474 - return 0; 475 - } else if (ret) { 476 - return ret; 477 - } 478 - } 479 - } 480 - EXPORT_SYMBOL(__wait_on_bit_lock); 481 - 482 - int __sched out_of_line_wait_on_bit_lock(void *word, int bit, 483 - wait_bit_action_f *action, unsigned mode) 484 - { 485 - wait_queue_head_t *wq = bit_waitqueue(word, bit); 486 - DEFINE_WAIT_BIT(wait, word, bit); 487 - 488 - return __wait_on_bit_lock(wq, &wait, action, mode); 489 - } 490 - EXPORT_SYMBOL(out_of_line_wait_on_bit_lock); 491 - 492 - void __wake_up_bit(wait_queue_head_t *wq, void *word, int bit) 493 - { 494 - struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); 495 - if (waitqueue_active(wq)) 496 - __wake_up(wq, TASK_NORMAL, 1, &key); 497 - } 498 - EXPORT_SYMBOL(__wake_up_bit); 499 - 500 - /** 501 - * wake_up_bit - wake up a waiter on a bit 502 - * @word: the word being waited on, a kernel virtual address 503 - * @bit: the bit of the word being waited on 504 - * 505 - * There is a standard hashed waitqueue table for generic use. This 506 - * is the part of the hashtable's accessor API that wakes up waiters 507 - * on a bit. For instance, if one were to have waiters on a bitflag, 508 - * one would call wake_up_bit() after clearing the bit. 509 - * 510 - * In order for this to function properly, as it uses waitqueue_active() 511 - * internally, some kind of memory barrier must be done prior to calling 512 - * this. Typically, this will be smp_mb__after_atomic(), but in some 513 - * cases where bitflags are manipulated non-atomically under a lock, one 514 - * may need to use a less regular barrier, such fs/inode.c's smp_mb(), 515 - * because spin_unlock() does not guarantee a memory barrier. 516 - */ 517 - void wake_up_bit(void *word, int bit) 518 - { 519 - __wake_up_bit(bit_waitqueue(word, bit), word, bit); 520 - } 521 - EXPORT_SYMBOL(wake_up_bit); 522 - 523 - /* 524 - * Manipulate the atomic_t address to produce a better bit waitqueue table hash 525 - * index (we're keying off bit -1, but that would produce a horrible hash 526 - * value). 527 - */ 528 - static inline wait_queue_head_t *atomic_t_waitqueue(atomic_t *p) 529 - { 530 - if (BITS_PER_LONG == 64) { 531 - unsigned long q = (unsigned long)p; 532 - return bit_waitqueue((void *)(q & ~1), q & 1); 533 - } 534 - return bit_waitqueue(p, 0); 535 - } 536 - 537 - static int wake_atomic_t_function(wait_queue_t *wait, unsigned mode, int sync, 538 - void *arg) 539 - { 540 - struct wait_bit_key *key = arg; 541 - struct wait_bit_queue *wait_bit 542 - = container_of(wait, struct wait_bit_queue, wait); 543 - atomic_t *val = key->flags; 544 - 545 - if (wait_bit->key.flags != key->flags || 546 - wait_bit->key.bit_nr != key->bit_nr || 547 - atomic_read(val) != 0) 548 - return 0; 549 - return autoremove_wake_function(wait, mode, sync, key); 550 - } 551 - 552 - /* 553 - * To allow interruptible waiting and asynchronous (i.e. nonblocking) waiting, 554 - * the actions of __wait_on_atomic_t() are permitted return codes. Nonzero 555 - * return codes halt waiting and return. 556 - */ 557 - static __sched 558 - int __wait_on_atomic_t(wait_queue_head_t *wq, struct wait_bit_queue *q, 559 - int (*action)(atomic_t *), unsigned mode) 560 - { 561 - atomic_t *val; 562 - int ret = 0; 563 - 564 - do { 565 - prepare_to_wait(wq, &q->wait, mode); 566 - val = q->key.flags; 567 - if (atomic_read(val) == 0) 568 - break; 569 - ret = (*action)(val); 570 - } while (!ret && atomic_read(val) != 0); 571 - finish_wait(wq, &q->wait); 572 - return ret; 573 - } 574 - 575 - #define DEFINE_WAIT_ATOMIC_T(name, p) \ 576 - struct wait_bit_queue name = { \ 577 - .key = __WAIT_ATOMIC_T_KEY_INITIALIZER(p), \ 578 - .wait = { \ 579 - .private = current, \ 580 - .func = wake_atomic_t_function, \ 581 - .task_list = \ 582 - LIST_HEAD_INIT((name).wait.task_list), \ 583 - }, \ 584 - } 585 - 586 - __sched int out_of_line_wait_on_atomic_t(atomic_t *p, int (*action)(atomic_t *), 587 - unsigned mode) 588 - { 589 - wait_queue_head_t *wq = atomic_t_waitqueue(p); 590 - DEFINE_WAIT_ATOMIC_T(wait, p); 591 - 592 - return __wait_on_atomic_t(wq, &wait, action, mode); 593 - } 594 - EXPORT_SYMBOL(out_of_line_wait_on_atomic_t); 595 - 596 - /** 597 - * wake_up_atomic_t - Wake up a waiter on a atomic_t 598 - * @p: The atomic_t being waited on, a kernel virtual address 599 - * 600 - * Wake up anyone waiting for the atomic_t to go to zero. 601 - * 602 - * Abuse the bit-waker function and its waitqueue hash table set (the atomic_t 603 - * check is done by the waiter's wake function, not the by the waker itself). 604 - */ 605 - void wake_up_atomic_t(atomic_t *p) 606 - { 607 - __wake_up_bit(atomic_t_waitqueue(p), p, WAIT_ATOMIC_T_BIT_NR); 608 - } 609 - EXPORT_SYMBOL(wake_up_atomic_t); 610 - 611 - __sched int bit_wait(struct wait_bit_key *word, int mode) 612 - { 613 - schedule(); 614 - if (signal_pending_state(mode, current)) 615 - return -EINTR; 616 - return 0; 617 - } 618 - EXPORT_SYMBOL(bit_wait); 619 - 620 - __sched int bit_wait_io(struct wait_bit_key *word, int mode) 621 - { 622 - io_schedule(); 623 - if (signal_pending_state(mode, current)) 624 - return -EINTR; 625 - return 0; 626 - } 627 - EXPORT_SYMBOL(bit_wait_io); 628 - 629 - __sched int bit_wait_timeout(struct wait_bit_key *word, int mode) 630 - { 631 - unsigned long now = READ_ONCE(jiffies); 632 - if (time_after_eq(now, word->timeout)) 633 - return -EAGAIN; 634 - schedule_timeout(word->timeout - now); 635 - if (signal_pending_state(mode, current)) 636 - return -EINTR; 637 - return 0; 638 - } 639 - EXPORT_SYMBOL_GPL(bit_wait_timeout); 640 - 641 - __sched int bit_wait_io_timeout(struct wait_bit_key *word, int mode) 642 - { 643 - unsigned long now = READ_ONCE(jiffies); 644 - if (time_after_eq(now, word->timeout)) 645 - return -EAGAIN; 646 - io_schedule_timeout(word->timeout - now); 647 - if (signal_pending_state(mode, current)) 648 - return -EINTR; 649 - return 0; 650 - } 651 - EXPORT_SYMBOL_GPL(bit_wait_io_timeout);

+286

kernel/sched/wait_bit.c

··· 1 + /* 2 + * The implementation of the wait_bit*() and related waiting APIs: 3 + */ 4 + #include <linux/wait_bit.h> 5 + #include <linux/sched/signal.h> 6 + #include <linux/sched/debug.h> 7 + #include <linux/hash.h> 8 + 9 + #define WAIT_TABLE_BITS 8 10 + #define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS) 11 + 12 + static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned; 13 + 14 + wait_queue_head_t *bit_waitqueue(void *word, int bit) 15 + { 16 + const int shift = BITS_PER_LONG == 32 ? 5 : 6; 17 + unsigned long val = (unsigned long)word << shift | bit; 18 + 19 + return bit_wait_table + hash_long(val, WAIT_TABLE_BITS); 20 + } 21 + EXPORT_SYMBOL(bit_waitqueue); 22 + 23 + int wake_bit_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *arg) 24 + { 25 + struct wait_bit_key *key = arg; 26 + struct wait_bit_queue_entry *wait_bit = container_of(wq_entry, struct wait_bit_queue_entry, wq_entry); 27 + 28 + if (wait_bit->key.flags != key->flags || 29 + wait_bit->key.bit_nr != key->bit_nr || 30 + test_bit(key->bit_nr, key->flags)) 31 + return 0; 32 + else 33 + return autoremove_wake_function(wq_entry, mode, sync, key); 34 + } 35 + EXPORT_SYMBOL(wake_bit_function); 36 + 37 + /* 38 + * To allow interruptible waiting and asynchronous (i.e. nonblocking) 39 + * waiting, the actions of __wait_on_bit() and __wait_on_bit_lock() are 40 + * permitted return codes. Nonzero return codes halt waiting and return. 41 + */ 42 + int __sched 43 + __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, 44 + wait_bit_action_f *action, unsigned mode) 45 + { 46 + int ret = 0; 47 + 48 + do { 49 + prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode); 50 + if (test_bit(wbq_entry->key.bit_nr, wbq_entry->key.flags)) 51 + ret = (*action)(&wbq_entry->key, mode); 52 + } while (test_bit(wbq_entry->key.bit_nr, wbq_entry->key.flags) && !ret); 53 + finish_wait(wq_head, &wbq_entry->wq_entry); 54 + return ret; 55 + } 56 + EXPORT_SYMBOL(__wait_on_bit); 57 + 58 + int __sched out_of_line_wait_on_bit(void *word, int bit, 59 + wait_bit_action_f *action, unsigned mode) 60 + { 61 + struct wait_queue_head *wq_head = bit_waitqueue(word, bit); 62 + DEFINE_WAIT_BIT(wq_entry, word, bit); 63 + 64 + return __wait_on_bit(wq_head, &wq_entry, action, mode); 65 + } 66 + EXPORT_SYMBOL(out_of_line_wait_on_bit); 67 + 68 + int __sched out_of_line_wait_on_bit_timeout( 69 + void *word, int bit, wait_bit_action_f *action, 70 + unsigned mode, unsigned long timeout) 71 + { 72 + struct wait_queue_head *wq_head = bit_waitqueue(word, bit); 73 + DEFINE_WAIT_BIT(wq_entry, word, bit); 74 + 75 + wq_entry.key.timeout = jiffies + timeout; 76 + return __wait_on_bit(wq_head, &wq_entry, action, mode); 77 + } 78 + EXPORT_SYMBOL_GPL(out_of_line_wait_on_bit_timeout); 79 + 80 + int __sched 81 + __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, 82 + wait_bit_action_f *action, unsigned mode) 83 + { 84 + int ret = 0; 85 + 86 + for (;;) { 87 + prepare_to_wait_exclusive(wq_head, &wbq_entry->wq_entry, mode); 88 + if (test_bit(wbq_entry->key.bit_nr, wbq_entry->key.flags)) { 89 + ret = action(&wbq_entry->key, mode); 90 + /* 91 + * See the comment in prepare_to_wait_event(). 92 + * finish_wait() does not necessarily takes wwq_head->lock, 93 + * but test_and_set_bit() implies mb() which pairs with 94 + * smp_mb__after_atomic() before wake_up_page(). 95 + */ 96 + if (ret) 97 + finish_wait(wq_head, &wbq_entry->wq_entry); 98 + } 99 + if (!test_and_set_bit(wbq_entry->key.bit_nr, wbq_entry->key.flags)) { 100 + if (!ret) 101 + finish_wait(wq_head, &wbq_entry->wq_entry); 102 + return 0; 103 + } else if (ret) { 104 + return ret; 105 + } 106 + } 107 + } 108 + EXPORT_SYMBOL(__wait_on_bit_lock); 109 + 110 + int __sched out_of_line_wait_on_bit_lock(void *word, int bit, 111 + wait_bit_action_f *action, unsigned mode) 112 + { 113 + struct wait_queue_head *wq_head = bit_waitqueue(word, bit); 114 + DEFINE_WAIT_BIT(wq_entry, word, bit); 115 + 116 + return __wait_on_bit_lock(wq_head, &wq_entry, action, mode); 117 + } 118 + EXPORT_SYMBOL(out_of_line_wait_on_bit_lock); 119 + 120 + void __wake_up_bit(struct wait_queue_head *wq_head, void *word, int bit) 121 + { 122 + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); 123 + if (waitqueue_active(wq_head)) 124 + __wake_up(wq_head, TASK_NORMAL, 1, &key); 125 + } 126 + EXPORT_SYMBOL(__wake_up_bit); 127 + 128 + /** 129 + * wake_up_bit - wake up a waiter on a bit 130 + * @word: the word being waited on, a kernel virtual address 131 + * @bit: the bit of the word being waited on 132 + * 133 + * There is a standard hashed waitqueue table for generic use. This 134 + * is the part of the hashtable's accessor API that wakes up waiters 135 + * on a bit. For instance, if one were to have waiters on a bitflag, 136 + * one would call wake_up_bit() after clearing the bit. 137 + * 138 + * In order for this to function properly, as it uses waitqueue_active() 139 + * internally, some kind of memory barrier must be done prior to calling 140 + * this. Typically, this will be smp_mb__after_atomic(), but in some 141 + * cases where bitflags are manipulated non-atomically under a lock, one 142 + * may need to use a less regular barrier, such fs/inode.c's smp_mb(), 143 + * because spin_unlock() does not guarantee a memory barrier. 144 + */ 145 + void wake_up_bit(void *word, int bit) 146 + { 147 + __wake_up_bit(bit_waitqueue(word, bit), word, bit); 148 + } 149 + EXPORT_SYMBOL(wake_up_bit); 150 + 151 + /* 152 + * Manipulate the atomic_t address to produce a better bit waitqueue table hash 153 + * index (we're keying off bit -1, but that would produce a horrible hash 154 + * value). 155 + */ 156 + static inline wait_queue_head_t *atomic_t_waitqueue(atomic_t *p) 157 + { 158 + if (BITS_PER_LONG == 64) { 159 + unsigned long q = (unsigned long)p; 160 + return bit_waitqueue((void *)(q & ~1), q & 1); 161 + } 162 + return bit_waitqueue(p, 0); 163 + } 164 + 165 + static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, 166 + void *arg) 167 + { 168 + struct wait_bit_key *key = arg; 169 + struct wait_bit_queue_entry *wait_bit = container_of(wq_entry, struct wait_bit_queue_entry, wq_entry); 170 + atomic_t *val = key->flags; 171 + 172 + if (wait_bit->key.flags != key->flags || 173 + wait_bit->key.bit_nr != key->bit_nr || 174 + atomic_read(val) != 0) 175 + return 0; 176 + return autoremove_wake_function(wq_entry, mode, sync, key); 177 + } 178 + 179 + /* 180 + * To allow interruptible waiting and asynchronous (i.e. nonblocking) waiting, 181 + * the actions of __wait_on_atomic_t() are permitted return codes. Nonzero 182 + * return codes halt waiting and return. 183 + */ 184 + static __sched 185 + int __wait_on_atomic_t(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, 186 + int (*action)(atomic_t *), unsigned mode) 187 + { 188 + atomic_t *val; 189 + int ret = 0; 190 + 191 + do { 192 + prepare_to_wait(wq_head, &wbq_entry->wq_entry, mode); 193 + val = wbq_entry->key.flags; 194 + if (atomic_read(val) == 0) 195 + break; 196 + ret = (*action)(val); 197 + } while (!ret && atomic_read(val) != 0); 198 + finish_wait(wq_head, &wbq_entry->wq_entry); 199 + return ret; 200 + } 201 + 202 + #define DEFINE_WAIT_ATOMIC_T(name, p) \ 203 + struct wait_bit_queue_entry name = { \ 204 + .key = __WAIT_ATOMIC_T_KEY_INITIALIZER(p), \ 205 + .wq_entry = { \ 206 + .private = current, \ 207 + .func = wake_atomic_t_function, \ 208 + .entry = \ 209 + LIST_HEAD_INIT((name).wq_entry.entry), \ 210 + }, \ 211 + } 212 + 213 + __sched int out_of_line_wait_on_atomic_t(atomic_t *p, int (*action)(atomic_t *), 214 + unsigned mode) 215 + { 216 + struct wait_queue_head *wq_head = atomic_t_waitqueue(p); 217 + DEFINE_WAIT_ATOMIC_T(wq_entry, p); 218 + 219 + return __wait_on_atomic_t(wq_head, &wq_entry, action, mode); 220 + } 221 + EXPORT_SYMBOL(out_of_line_wait_on_atomic_t); 222 + 223 + /** 224 + * wake_up_atomic_t - Wake up a waiter on a atomic_t 225 + * @p: The atomic_t being waited on, a kernel virtual address 226 + * 227 + * Wake up anyone waiting for the atomic_t to go to zero. 228 + * 229 + * Abuse the bit-waker function and its waitqueue hash table set (the atomic_t 230 + * check is done by the waiter's wake function, not the by the waker itself). 231 + */ 232 + void wake_up_atomic_t(atomic_t *p) 233 + { 234 + __wake_up_bit(atomic_t_waitqueue(p), p, WAIT_ATOMIC_T_BIT_NR); 235 + } 236 + EXPORT_SYMBOL(wake_up_atomic_t); 237 + 238 + __sched int bit_wait(struct wait_bit_key *word, int mode) 239 + { 240 + schedule(); 241 + if (signal_pending_state(mode, current)) 242 + return -EINTR; 243 + return 0; 244 + } 245 + EXPORT_SYMBOL(bit_wait); 246 + 247 + __sched int bit_wait_io(struct wait_bit_key *word, int mode) 248 + { 249 + io_schedule(); 250 + if (signal_pending_state(mode, current)) 251 + return -EINTR; 252 + return 0; 253 + } 254 + EXPORT_SYMBOL(bit_wait_io); 255 + 256 + __sched int bit_wait_timeout(struct wait_bit_key *word, int mode) 257 + { 258 + unsigned long now = READ_ONCE(jiffies); 259 + if (time_after_eq(now, word->timeout)) 260 + return -EAGAIN; 261 + schedule_timeout(word->timeout - now); 262 + if (signal_pending_state(mode, current)) 263 + return -EINTR; 264 + return 0; 265 + } 266 + EXPORT_SYMBOL_GPL(bit_wait_timeout); 267 + 268 + __sched int bit_wait_io_timeout(struct wait_bit_key *word, int mode) 269 + { 270 + unsigned long now = READ_ONCE(jiffies); 271 + if (time_after_eq(now, word->timeout)) 272 + return -EAGAIN; 273 + io_schedule_timeout(word->timeout - now); 274 + if (signal_pending_state(mode, current)) 275 + return -EINTR; 276 + return 0; 277 + } 278 + EXPORT_SYMBOL_GPL(bit_wait_io_timeout); 279 + 280 + void __init wait_bit_init(void) 281 + { 282 + int i; 283 + 284 + for (i = 0; i < WAIT_TABLE_SIZE; i++) 285 + init_waitqueue_head(bit_wait_table + i); 286 + }

+13 -3

kernel/smp.c

··· 30 30 struct call_function_data { 31 31 struct call_single_data __percpu *csd; 32 32 cpumask_var_t cpumask; 33 + cpumask_var_t cpumask_ipi; 33 34 }; 34 35 35 36 static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_function_data, cfd_data); ··· 46 45 if (!zalloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL, 47 46 cpu_to_node(cpu))) 48 47 return -ENOMEM; 48 + if (!zalloc_cpumask_var_node(&cfd->cpumask_ipi, GFP_KERNEL, 49 + cpu_to_node(cpu))) { 50 + free_cpumask_var(cfd->cpumask); 51 + return -ENOMEM; 52 + } 49 53 cfd->csd = alloc_percpu(struct call_single_data); 50 54 if (!cfd->csd) { 51 55 free_cpumask_var(cfd->cpumask); 56 + free_cpumask_var(cfd->cpumask_ipi); 52 57 return -ENOMEM; 53 58 } 54 59 ··· 66 59 struct call_function_data *cfd = &per_cpu(cfd_data, cpu); 67 60 68 61 free_cpumask_var(cfd->cpumask); 62 + free_cpumask_var(cfd->cpumask_ipi); 69 63 free_percpu(cfd->csd); 70 64 return 0; 71 65 } ··· 436 428 cfd = this_cpu_ptr(&cfd_data); 437 429 438 430 cpumask_and(cfd->cpumask, mask, cpu_online_mask); 439 - cpumask_clear_cpu(this_cpu, cfd->cpumask); 431 + __cpumask_clear_cpu(this_cpu, cfd->cpumask); 440 432 441 433 /* Some callers race with other cpus changing the passed mask */ 442 434 if (unlikely(!cpumask_weight(cfd->cpumask))) 443 435 return; 444 436 437 + cpumask_clear(cfd->cpumask_ipi); 445 438 for_each_cpu(cpu, cfd->cpumask) { 446 439 struct call_single_data *csd = per_cpu_ptr(cfd->csd, cpu); 447 440 ··· 451 442 csd->flags |= CSD_FLAG_SYNCHRONOUS; 452 443 csd->func = func; 453 444 csd->info = info; 454 - llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)); 445 + if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu))) 446 + __cpumask_set_cpu(cpu, cfd->cpumask_ipi); 455 447 } 456 448 457 449 /* Send a message to all CPUs in the map */ 458 - arch_send_call_function_ipi_mask(cfd->cpumask); 450 + arch_send_call_function_ipi_mask(cfd->cpumask_ipi); 459 451 460 452 if (wait) { 461 453 for_each_cpu(cpu, cfd->cpumask) {

+3

kernel/time/clocksource.c

··· 233 233 continue; 234 234 } 235 235 236 + if (cs == curr_clocksource && cs->tick_stable) 237 + cs->tick_stable(cs); 238 + 236 239 if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && 237 240 (cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) && 238 241 (watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS)) {

+6 -5

kernel/time/tick-sched.c

··· 554 554 update_ts_time_stats(smp_processor_id(), ts, now, NULL); 555 555 ts->idle_active = 0; 556 556 557 - sched_clock_idle_wakeup_event(0); 557 + sched_clock_idle_wakeup_event(); 558 558 } 559 559 560 560 static ktime_t tick_nohz_start_idle(struct tick_sched *ts) ··· 782 782 * the scheduler tick in nohz_restart_sched_tick. 783 783 */ 784 784 if (!ts->tick_stopped) { 785 - nohz_balance_enter_idle(cpu); 786 - calc_load_enter_idle(); 785 + calc_load_nohz_start(); 787 786 cpu_load_update_nohz_start(); 788 787 789 788 ts->last_tick = hrtimer_get_expires(&ts->sched_timer); ··· 822 823 */ 823 824 timer_clear_idle(); 824 825 825 - calc_load_exit_idle(); 826 + calc_load_nohz_stop(); 826 827 touch_softlockup_watchdog_sched(); 827 828 /* 828 829 * Cancel the scheduled timer and restore the tick ··· 922 923 ts->idle_expires = expires; 923 924 } 924 925 925 - if (!was_stopped && ts->tick_stopped) 926 + if (!was_stopped && ts->tick_stopped) { 926 927 ts->idle_jiffies = ts->last_jiffies; 928 + nohz_balance_enter_idle(cpu); 929 + } 927 930 } 928 931 } 929 932

+2 -2

kernel/workqueue.c

··· 2864 2864 EXPORT_SYMBOL_GPL(flush_work); 2865 2865 2866 2866 struct cwt_wait { 2867 - wait_queue_t wait; 2867 + wait_queue_entry_t wait; 2868 2868 struct work_struct *work; 2869 2869 }; 2870 2870 2871 - static int cwt_wakefn(wait_queue_t *wait, unsigned mode, int sync, void *key) 2871 + static int cwt_wakefn(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) 2872 2872 { 2873 2873 struct cwt_wait *cwait = container_of(wait, struct cwt_wait, wait); 2874 2874

+32

lib/cpumask.c

··· 43 43 } 44 44 EXPORT_SYMBOL(cpumask_any_but); 45 45 46 + /** 47 + * cpumask_next_wrap - helper to implement for_each_cpu_wrap 48 + * @n: the cpu prior to the place to search 49 + * @mask: the cpumask pointer 50 + * @start: the start point of the iteration 51 + * @wrap: assume @n crossing @start terminates the iteration 52 + * 53 + * Returns >= nr_cpu_ids on completion 54 + * 55 + * Note: the @wrap argument is required for the start condition when 56 + * we cannot assume @start is set in @mask. 57 + */ 58 + int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap) 59 + { 60 + int next; 61 + 62 + again: 63 + next = cpumask_next(n, mask); 64 + 65 + if (wrap && n < start && next >= start) { 66 + return nr_cpumask_bits; 67 + 68 + } else if (next >= nr_cpumask_bits) { 69 + wrap = true; 70 + n = -1; 71 + goto again; 72 + } 73 + 74 + return next; 75 + } 76 + EXPORT_SYMBOL(cpumask_next_wrap); 77 + 46 78 /* These are not inline because of header tangles. */ 47 79 #ifdef CONFIG_CPUMASK_OFFSTACK 48 80 /**

+1 -1

lib/smp_processor_id.c

··· 28 28 /* 29 29 * It is valid to assume CPU-locality during early bootup: 30 30 */ 31 - if (system_state != SYSTEM_RUNNING) 31 + if (system_state < SYSTEM_SCHEDULING) 32 32 goto out; 33 33 34 34 /*

+6 -6

mm/filemap.c

··· 800 800 struct wait_page_queue { 801 801 struct page *page; 802 802 int bit_nr; 803 - wait_queue_t wait; 803 + wait_queue_entry_t wait; 804 804 }; 805 805 806 - static int wake_page_function(wait_queue_t *wait, unsigned mode, int sync, void *arg) 806 + static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg) 807 807 { 808 808 struct wait_page_key *key = arg; 809 809 struct wait_page_queue *wait_page ··· 866 866 struct page *page, int bit_nr, int state, bool lock) 867 867 { 868 868 struct wait_page_queue wait_page; 869 - wait_queue_t *wait = &wait_page.wait; 869 + wait_queue_entry_t *wait = &wait_page.wait; 870 870 int ret = 0; 871 871 872 872 init_wait(wait); ··· 877 877 for (;;) { 878 878 spin_lock_irq(&q->lock); 879 879 880 - if (likely(list_empty(&wait->task_list))) { 880 + if (likely(list_empty(&wait->entry))) { 881 881 if (lock) 882 - __add_wait_queue_tail_exclusive(q, wait); 882 + __add_wait_queue_entry_tail_exclusive(q, wait); 883 883 else 884 884 __add_wait_queue(q, wait); 885 885 SetPageWaiters(page); ··· 939 939 * 940 940 * Add an arbitrary @waiter to the wait queue for the nominated @page. 941 941 */ 942 - void add_page_wait_queue(struct page *page, wait_queue_t *waiter) 942 + void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter) 943 943 { 944 944 wait_queue_head_t *q = page_waitqueue(page); 945 945 unsigned long flags;

+5 -5

mm/memcontrol.c

··· 170 170 */ 171 171 poll_table pt; 172 172 wait_queue_head_t *wqh; 173 - wait_queue_t wait; 173 + wait_queue_entry_t wait; 174 174 struct work_struct remove; 175 175 }; 176 176 ··· 1479 1479 1480 1480 struct oom_wait_info { 1481 1481 struct mem_cgroup *memcg; 1482 - wait_queue_t wait; 1482 + wait_queue_entry_t wait; 1483 1483 }; 1484 1484 1485 - static int memcg_oom_wake_function(wait_queue_t *wait, 1485 + static int memcg_oom_wake_function(wait_queue_entry_t *wait, 1486 1486 unsigned mode, int sync, void *arg) 1487 1487 { 1488 1488 struct mem_cgroup *wake_memcg = (struct mem_cgroup *)arg; ··· 1570 1570 owait.wait.flags = 0; 1571 1571 owait.wait.func = memcg_oom_wake_function; 1572 1572 owait.wait.private = current; 1573 - INIT_LIST_HEAD(&owait.wait.task_list); 1573 + INIT_LIST_HEAD(&owait.wait.entry); 1574 1574 1575 1575 prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE); 1576 1576 mem_cgroup_mark_under_oom(memcg); ··· 3725 3725 * 3726 3726 * Called with wqh->lock held and interrupts disabled. 3727 3727 */ 3728 - static int memcg_event_wake(wait_queue_t *wait, unsigned mode, 3728 + static int memcg_event_wake(wait_queue_entry_t *wait, unsigned mode, 3729 3729 int sync, void *key) 3730 3730 { 3731 3731 struct mem_cgroup_event *event =

+1 -1

mm/mempool.c

··· 312 312 { 313 313 void *element; 314 314 unsigned long flags; 315 - wait_queue_t wait; 315 + wait_queue_entry_t wait; 316 316 gfp_t gfp_temp; 317 317 318 318 VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);

+3 -3

mm/shmem.c

··· 1903 1903 * entry unconditionally - even if something else had already woken the 1904 1904 * target. 1905 1905 */ 1906 - static int synchronous_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key) 1906 + static int synchronous_wake_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) 1907 1907 { 1908 1908 int ret = default_wake_function(wait, mode, sync, key); 1909 - list_del_init(&wait->task_list); 1909 + list_del_init(&wait->entry); 1910 1910 return ret; 1911 1911 } 1912 1912 ··· 2841 2841 spin_lock(&inode->i_lock); 2842 2842 inode->i_private = NULL; 2843 2843 wake_up_all(&shmem_falloc_waitq); 2844 - WARN_ON_ONCE(!list_empty(&shmem_falloc_waitq.task_list)); 2844 + WARN_ON_ONCE(!list_empty(&shmem_falloc_waitq.head)); 2845 2845 spin_unlock(&inode->i_lock); 2846 2846 error = 0; 2847 2847 goto out;

+1 -1

mm/vmscan.c

··· 3652 3652 pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid); 3653 3653 if (IS_ERR(pgdat->kswapd)) { 3654 3654 /* failure at boot is fatal */ 3655 - BUG_ON(system_state == SYSTEM_BOOTING); 3655 + BUG_ON(system_state < SYSTEM_RUNNING); 3656 3656 pr_err("Failed to start kswapd on node %d\n", nid); 3657 3657 ret = PTR_ERR(pgdat->kswapd); 3658 3658 pgdat->kswapd = NULL;

+2 -2

net/9p/trans_fd.c

··· 95 95 96 96 struct p9_poll_wait { 97 97 struct p9_conn *conn; 98 - wait_queue_t wait; 98 + wait_queue_entry_t wait; 99 99 wait_queue_head_t *wait_addr; 100 100 }; 101 101 ··· 522 522 clear_bit(Wworksched, &m->wsched); 523 523 } 524 524 525 - static int p9_pollwake(wait_queue_t *wait, unsigned int mode, int sync, void *key) 525 + static int p9_pollwake(wait_queue_entry_t *wait, unsigned int mode, int sync, void *key) 526 526 { 527 527 struct p9_poll_wait *pwait = 528 528 container_of(wait, struct p9_poll_wait, wait);

+1 -1

net/bluetooth/bnep/core.c

··· 484 484 struct net_device *dev = s->dev; 485 485 struct sock *sk = s->sock->sk; 486 486 struct sk_buff *skb; 487 - wait_queue_t wait; 487 + wait_queue_entry_t wait; 488 488 489 489 BT_DBG(""); 490 490

+1 -1

net/bluetooth/cmtp/core.c

··· 280 280 struct cmtp_session *session = arg; 281 281 struct sock *sk = session->sock->sk; 282 282 struct sk_buff *skb; 283 - wait_queue_t wait; 283 + wait_queue_entry_t wait; 284 284 285 285 BT_DBG("session %p", session); 286 286

+1 -1

net/bluetooth/hidp/core.c

··· 1244 1244 static int hidp_session_thread(void *arg) 1245 1245 { 1246 1246 struct hidp_session *session = arg; 1247 - wait_queue_t ctrl_wait, intr_wait; 1247 + wait_queue_entry_t ctrl_wait, intr_wait; 1248 1248 1249 1249 BT_DBG("session %p", session); 1250 1250

+1 -1

net/core/datagram.c

··· 68 68 return sk->sk_type == SOCK_SEQPACKET || sk->sk_type == SOCK_STREAM; 69 69 } 70 70 71 - static int receiver_wake_function(wait_queue_t *wait, unsigned int mode, int sync, 71 + static int receiver_wake_function(wait_queue_entry_t *wait, unsigned int mode, int sync, 72 72 void *key) 73 73 { 74 74 unsigned long bits = (unsigned long)key;

+2 -2

net/unix/af_unix.c

··· 343 343 * are still connected to it and there's no way to inform "a polling 344 344 * implementation" that it should let go of a certain wait queue 345 345 * 346 - * In order to propagate a wake up, a wait_queue_t of the client 346 + * In order to propagate a wake up, a wait_queue_entry_t of the client 347 347 * socket is enqueued on the peer_wait queue of the server socket 348 348 * whose wake function does a wake_up on the ordinary client socket 349 349 * wait queue. This connection is established whenever a write (or ··· 352 352 * was relayed. 353 353 */ 354 354 355 - static int unix_dgram_peer_wake_relay(wait_queue_t *q, unsigned mode, int flags, 355 + static int unix_dgram_peer_wake_relay(wait_queue_entry_t *q, unsigned mode, int flags, 356 356 void *key) 357 357 { 358 358 struct unix_sock *u;

+1

security/keys/internal.h

··· 13 13 #define _INTERNAL_H 14 14 15 15 #include <linux/sched.h> 16 + #include <linux/wait_bit.h> 16 17 #include <linux/cred.h> 17 18 #include <linux/key-type.h> 18 19 #include <linux/task_work.h>

+1 -1

sound/core/control.c

··· 1577 1577 struct snd_ctl_event ev; 1578 1578 struct snd_kctl_event *kev; 1579 1579 while (list_empty(&ctl->events)) { 1580 - wait_queue_t wait; 1580 + wait_queue_entry_t wait; 1581 1581 if ((file->f_flags & O_NONBLOCK) != 0 || result > 0) { 1582 1582 err = -EAGAIN; 1583 1583 goto __end_lock;

+1 -1

sound/core/hwdep.c

··· 85 85 int major = imajor(inode); 86 86 struct snd_hwdep *hw; 87 87 int err; 88 - wait_queue_t wait; 88 + wait_queue_entry_t wait; 89 89 90 90 if (major == snd_major) { 91 91 hw = snd_lookup_minor_data(iminor(inode),

+1 -1

sound/core/init.c

··· 989 989 */ 990 990 int snd_power_wait(struct snd_card *card, unsigned int power_state) 991 991 { 992 - wait_queue_t wait; 992 + wait_queue_entry_t wait; 993 993 int result = 0; 994 994 995 995 /* fastpath */

+2 -2

sound/core/oss/pcm_oss.c

··· 1554 1554 ssize_t result = 0; 1555 1555 snd_pcm_state_t state; 1556 1556 long res; 1557 - wait_queue_t wait; 1557 + wait_queue_entry_t wait; 1558 1558 1559 1559 runtime = substream->runtime; 1560 1560 init_waitqueue_entry(&wait, current); ··· 2387 2387 struct snd_pcm_oss_file *pcm_oss_file; 2388 2388 struct snd_pcm_oss_setup setup[2]; 2389 2389 int nonblock; 2390 - wait_queue_t wait; 2390 + wait_queue_entry_t wait; 2391 2391 2392 2392 err = nonseekable_open(inode, file); 2393 2393 if (err < 0)

+1 -1

sound/core/pcm_lib.c

··· 1904 1904 { 1905 1905 struct snd_pcm_runtime *runtime = substream->runtime; 1906 1906 int is_playback = substream->stream == SNDRV_PCM_STREAM_PLAYBACK; 1907 - wait_queue_t wait; 1907 + wait_queue_entry_t wait; 1908 1908 int err = 0; 1909 1909 snd_pcm_uframes_t avail = 0; 1910 1910 long wait_time, tout;

+2 -2

sound/core/pcm_native.c

··· 1652 1652 struct snd_card *card; 1653 1653 struct snd_pcm_runtime *runtime; 1654 1654 struct snd_pcm_substream *s; 1655 - wait_queue_t wait; 1655 + wait_queue_entry_t wait; 1656 1656 int result = 0; 1657 1657 int nonblock = 0; 1658 1658 ··· 2353 2353 static int snd_pcm_open(struct file *file, struct snd_pcm *pcm, int stream) 2354 2354 { 2355 2355 int err; 2356 - wait_queue_t wait; 2356 + wait_queue_entry_t wait; 2357 2357 2358 2358 if (pcm == NULL) { 2359 2359 err = -ENODEV;

+4 -4

sound/core/rawmidi.c

··· 368 368 int err; 369 369 struct snd_rawmidi *rmidi; 370 370 struct snd_rawmidi_file *rawmidi_file = NULL; 371 - wait_queue_t wait; 371 + wait_queue_entry_t wait; 372 372 373 373 if ((file->f_flags & O_APPEND) && !(file->f_flags & O_NONBLOCK)) 374 374 return -EINVAL; /* invalid combination */ ··· 1002 1002 while (count > 0) { 1003 1003 spin_lock_irq(&runtime->lock); 1004 1004 while (!snd_rawmidi_ready(substream)) { 1005 - wait_queue_t wait; 1005 + wait_queue_entry_t wait; 1006 1006 if ((file->f_flags & O_NONBLOCK) != 0 || result > 0) { 1007 1007 spin_unlock_irq(&runtime->lock); 1008 1008 return result > 0 ? result : -EAGAIN; ··· 1306 1306 while (count > 0) { 1307 1307 spin_lock_irq(&runtime->lock); 1308 1308 while (!snd_rawmidi_ready_append(substream, count)) { 1309 - wait_queue_t wait; 1309 + wait_queue_entry_t wait; 1310 1310 if (file->f_flags & O_NONBLOCK) { 1311 1311 spin_unlock_irq(&runtime->lock); 1312 1312 return result > 0 ? result : -EAGAIN; ··· 1338 1338 if (file->f_flags & O_DSYNC) { 1339 1339 spin_lock_irq(&runtime->lock); 1340 1340 while (runtime->avail != runtime->buffer_size) { 1341 - wait_queue_t wait; 1341 + wait_queue_entry_t wait; 1342 1342 unsigned int last_avail = runtime->avail; 1343 1343 init_waitqueue_entry(&wait, current); 1344 1344 add_wait_queue(&runtime->sleep, &wait);

+1 -1

sound/core/seq/seq_fifo.c

··· 179 179 { 180 180 struct snd_seq_event_cell *cell; 181 181 unsigned long flags; 182 - wait_queue_t wait; 182 + wait_queue_entry_t wait; 183 183 184 184 if (snd_BUG_ON(!f)) 185 185 return -EINVAL;

+1 -1

sound/core/seq/seq_memory.c

··· 227 227 struct snd_seq_event_cell *cell; 228 228 unsigned long flags; 229 229 int err = -EAGAIN; 230 - wait_queue_t wait; 230 + wait_queue_entry_t wait; 231 231 232 232 if (pool == NULL) 233 233 return -EINVAL;

+1 -1

sound/core/timer.c

··· 1964 1964 spin_lock_irq(&tu->qlock); 1965 1965 while ((long)count - result >= unit) { 1966 1966 while (!tu->qused) { 1967 - wait_queue_t wait; 1967 + wait_queue_entry_t wait; 1968 1968 1969 1969 if ((file->f_flags & O_NONBLOCK) != 0 || result > 0) { 1970 1970 err = -EAGAIN;

+1 -1

sound/isa/wavefront/wavefront_synth.c

··· 1782 1782 int val, int port, unsigned long timeout) 1783 1783 1784 1784 { 1785 - wait_queue_t wait; 1785 + wait_queue_entry_t wait; 1786 1786 1787 1787 init_waitqueue_entry(&wait, current); 1788 1788 spin_lock_irq(&dev->irq_lock);

+2 -2

sound/pci/mixart/mixart_core.c

··· 239 239 struct mixart_msg resp; 240 240 u32 msg_frame = 0; /* set to 0, so it's no notification to wait for, but the answer */ 241 241 int err; 242 - wait_queue_t wait; 242 + wait_queue_entry_t wait; 243 243 long timeout; 244 244 245 245 init_waitqueue_entry(&wait, current); ··· 284 284 struct mixart_msg *request, u32 notif_event) 285 285 { 286 286 int err; 287 - wait_queue_t wait; 287 + wait_queue_entry_t wait; 288 288 long timeout; 289 289 290 290 if (snd_BUG_ON(!notif_event))

+1 -1

sound/pci/ymfpci/ymfpci_main.c

··· 781 781 782 782 static void snd_ymfpci_irq_wait(struct snd_ymfpci *chip) 783 783 { 784 - wait_queue_t wait; 784 + wait_queue_entry_t wait; 785 785 int loops = 4; 786 786 787 787 while (loops-- > 0) {

+1 -1

virt/kvm/eventfd.c

··· 184 184 * Called with wqh->lock held and interrupts disabled 185 185 */ 186 186 static int 187 - irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) 187 + irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) 188 188 { 189 189 struct kvm_kernel_irqfd *irqfd = 190 190 container_of(wait, struct kvm_kernel_irqfd, wait);