Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu

+23 -22

Documentation/atomic_ops.txt

··· 201 201 atomic_t and return the new counter value after the operation is 202 202 performed. 203 203 204 - Unlike the above routines, it is required that explicit memory 205 - barriers are performed before and after the operation. It must be 206 - done such that all memory operations before and after the atomic 207 - operation calls are strongly ordered with respect to the atomic 208 - operation itself. 204 + Unlike the above routines, it is required that these primitives 205 + include explicit memory barriers that are performed before and after 206 + the operation. It must be done such that all memory operations before 207 + and after the atomic operation calls are strongly ordered with respect 208 + to the atomic operation itself. 209 209 210 210 For example, it should behave as if a smp_mb() call existed both 211 211 before and after the atomic operation. ··· 233 233 given atomic counter. They return a boolean indicating whether the 234 234 resulting counter value was zero or not. 235 235 236 - It requires explicit memory barrier semantics around the operation as 237 - above. 236 + Again, these primitives provide explicit memory barrier semantics around 237 + the atomic operation. 238 238 239 239 int atomic_sub_and_test(int i, atomic_t *v); 240 240 241 241 This is identical to atomic_dec_and_test() except that an explicit 242 - decrement is given instead of the implicit "1". It requires explicit 243 - memory barrier semantics around the operation. 242 + decrement is given instead of the implicit "1". This primitive must 243 + provide explicit memory barrier semantics around the operation. 244 244 245 245 int atomic_add_negative(int i, atomic_t *v); 246 246 247 - The given increment is added to the given atomic counter value. A 248 - boolean is return which indicates whether the resulting counter value 249 - is negative. It requires explicit memory barrier semantics around the 250 - operation. 247 + The given increment is added to the given atomic counter value. A boolean 248 + is return which indicates whether the resulting counter value is negative. 249 + This primitive must provide explicit memory barrier semantics around 250 + the operation. 251 251 252 252 Then: 253 253 ··· 257 257 the given new value. It returns the old value that the atomic variable v had 258 258 just before the operation. 259 259 260 - atomic_xchg requires explicit memory barriers around the operation. 260 + atomic_xchg must provide explicit memory barriers around the operation. 261 261 262 262 int atomic_cmpxchg(atomic_t *v, int old, int new); 263 263 ··· 266 266 atomic_cmpxchg will only satisfy its atomicity semantics as long as all 267 267 other accesses of *v are performed through atomic_xxx operations. 268 268 269 - atomic_cmpxchg requires explicit memory barriers around the operation. 269 + atomic_cmpxchg must provide explicit memory barriers around the operation. 270 270 271 271 The semantics for atomic_cmpxchg are the same as those defined for 'cas' 272 272 below. ··· 279 279 returns non zero. If v is equal to u then it returns zero. This is done as 280 280 an atomic operation. 281 281 282 - atomic_add_unless requires explicit memory barriers around the operation 283 - unless it fails (returns 0). 282 + atomic_add_unless must provide explicit memory barriers around the 283 + operation unless it fails (returns 0). 284 284 285 285 atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0) 286 286 ··· 460 460 like this occur as well. 461 461 462 462 These routines, like the atomic_t counter operations returning values, 463 - require explicit memory barrier semantics around their execution. All 464 - memory operations before the atomic bit operation call must be made 465 - visible globally before the atomic bit operation is made visible. 463 + must provide explicit memory barrier semantics around their execution. 464 + All memory operations before the atomic bit operation call must be 465 + made visible globally before the atomic bit operation is made visible. 466 466 Likewise, the atomic bit operation must be visible globally before any 467 467 subsequent memory operation is made visible. For example: 468 468 ··· 536 536 These non-atomic variants also do not require any special memory 537 537 barrier semantics. 538 538 539 - The routines xchg() and cmpxchg() need the same exact memory barriers 540 - as the atomic and bit operations returning values. 539 + The routines xchg() and cmpxchg() must provide the same exact 540 + memory-barrier semantics as the atomic and bit operations returning 541 + values. 541 542 542 543 Spinlocks and rwlocks have memory barrier expectations as well. 543 544 The rule to follow is simple:

+15 -5

Documentation/kernel-parameters.txt

··· 2968 2968 Set maximum number of finished RCU callbacks to 2969 2969 process in one batch. 2970 2970 2971 + rcutree.gp_init_delay= [KNL] 2972 + Set the number of jiffies to delay each step of 2973 + RCU grace-period initialization. This only has 2974 + effect when CONFIG_RCU_TORTURE_TEST_SLOW_INIT is 2975 + set. 2976 + 2971 2977 rcutree.rcu_fanout_leaf= [KNL] 2972 2978 Increase the number of CPUs assigned to each 2973 2979 leaf rcu_node structure. Useful for very large ··· 2997 2991 value is one, and maximum value is HZ. 2998 2992 2999 2993 rcutree.kthread_prio= [KNL,BOOT] 3000 - Set the SCHED_FIFO priority of the RCU 3001 - per-CPU kthreads (rcuc/N). This value is also 3002 - used for the priority of the RCU boost threads 3003 - (rcub/N). Valid values are 1-99 and the default 3004 - is 1 (the least-favored priority). 2994 + Set the SCHED_FIFO priority of the RCU per-CPU 2995 + kthreads (rcuc/N). This value is also used for 2996 + the priority of the RCU boost threads (rcub/N) 2997 + and for the RCU grace-period kthreads (rcu_bh, 2998 + rcu_preempt, and rcu_sched). If RCU_BOOST is 2999 + set, valid values are 1-99 and the default is 1 3000 + (the least-favored priority). Otherwise, when 3001 + RCU_BOOST is not set, valid values are 0-99 and 3002 + the default is zero (non-realtime operation). 3005 3003 3006 3004 rcutree.rcu_nocb_leader_stride= [KNL] 3007 3005 Set the number of NOCB kthread groups, which

+21 -13

Documentation/kernel-per-CPU-kthreads.txt

··· 190 190 on each CPU, including cs_dbs_timer() and od_dbs_timer(). 191 191 WARNING: Please check your CPU specifications to 192 192 make sure that this is safe on your particular system. 193 - d. It is not possible to entirely get rid of OS jitter 194 - from vmstat_update() on CONFIG_SMP=y systems, but you 195 - can decrease its frequency by writing a large value 196 - to /proc/sys/vm/stat_interval. The default value is 197 - HZ, for an interval of one second. Of course, larger 198 - values will make your virtual-memory statistics update 199 - more slowly. Of course, you can also run your workload 200 - at a real-time priority, thus preempting vmstat_update(), 193 + d. As of v3.18, Christoph Lameter's on-demand vmstat workers 194 + commit prevents OS jitter due to vmstat_update() on 195 + CONFIG_SMP=y systems. Before v3.18, is not possible 196 + to entirely get rid of the OS jitter, but you can 197 + decrease its frequency by writing a large value to 198 + /proc/sys/vm/stat_interval. The default value is HZ, 199 + for an interval of one second. Of course, larger values 200 + will make your virtual-memory statistics update more 201 + slowly. Of course, you can also run your workload at 202 + a real-time priority, thus preempting vmstat_update(), 201 203 but if your workload is CPU-bound, this is a bad idea. 202 204 However, there is an RFC patch from Christoph Lameter 203 205 (based on an earlier one from Gilad Ben-Yossef) that 204 206 reduces or even eliminates vmstat overhead for some 205 207 workloads at https://lkml.org/lkml/2013/9/4/379. 206 - e. If running on high-end powerpc servers, build with 208 + e. Boot with "elevator=noop" to avoid workqueue use by 209 + the block layer. 210 + f. If running on high-end powerpc servers, build with 207 211 CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS 208 212 daemon from running on each CPU every second or so. 209 213 (This will require editing Kconfig files and will defeat ··· 215 211 due to the rtas_event_scan() function. 216 212 WARNING: Please check your CPU specifications to 217 213 make sure that this is safe on your particular system. 218 - f. If running on Cell Processor, build your kernel with 214 + g. If running on Cell Processor, build your kernel with 219 215 CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from 220 216 spu_gov_work(). 221 217 WARNING: Please check your CPU specifications to 222 218 make sure that this is safe on your particular system. 223 - g. If running on PowerMAC, build your kernel with 219 + h. If running on PowerMAC, build your kernel with 224 220 CONFIG_PMAC_RACKMETER=n to disable the CPU-meter, 225 221 avoiding OS jitter from rackmeter_do_timer(). 226 222 ··· 262 258 To reduce its OS jitter, do at least one of the following: 263 259 1. Build with CONFIG_LOCKUP_DETECTOR=n, which will prevent these 264 260 kthreads from being created in the first place. 265 - 2. Echo a zero to /proc/sys/kernel/watchdog to disable the 261 + 2. Boot with "nosoftlockup=0", which will also prevent these kthreads 262 + from being created. Other related watchdog and softlockup boot 263 + parameters may be found in Documentation/kernel-parameters.txt 264 + and Documentation/watchdog/watchdog-parameters.txt. 265 + 3. Echo a zero to /proc/sys/kernel/watchdog to disable the 266 266 watchdog timer. 267 - 3. Echo a large number of /proc/sys/kernel/watchdog_thresh in 267 + 4. Echo a large number of /proc/sys/kernel/watchdog_thresh in 268 268 order to reduce the frequency of OS jitter due to the watchdog 269 269 timer down to a level that is acceptable for your workload.

+29 -13

Documentation/memory-barriers.txt

··· 592 592 CONTROL DEPENDENCIES 593 593 -------------------- 594 594 595 - A control dependency requires a full read memory barrier, not simply a data 596 - dependency barrier to make it work correctly. Consider the following bit of 597 - code: 595 + A load-load control dependency requires a full read memory barrier, not 596 + simply a data dependency barrier to make it work correctly. Consider the 597 + following bit of code: 598 598 599 599 q = ACCESS_ONCE(a); 600 600 if (q) { ··· 615 615 } 616 616 617 617 However, stores are not speculated. This means that ordering -is- provided 618 - in the following example: 618 + for load-store control dependencies, as in the following example: 619 619 620 620 q = ACCESS_ONCE(a); 621 621 if (q) { 622 622 ACCESS_ONCE(b) = p; 623 623 } 624 624 625 - Please note that ACCESS_ONCE() is not optional! Without the 625 + Control dependencies pair normally with other types of barriers. 626 + That said, please note that ACCESS_ONCE() is not optional! Without the 626 627 ACCESS_ONCE(), might combine the load from 'a' with other loads from 627 628 'a', and the store to 'b' with other stores to 'b', with possible highly 628 629 counterintuitive effects on ordering. ··· 814 813 barrier() can help to preserve your control dependency. Please 815 814 see the Compiler Barrier section for more information. 816 815 816 + (*) Control dependencies pair normally with other types of barriers. 817 + 817 818 (*) Control dependencies do -not- provide transitivity. If you 818 819 need transitivity, use smp_mb(). 819 820 ··· 826 823 When dealing with CPU-CPU interactions, certain types of memory barrier should 827 824 always be paired. A lack of appropriate pairing is almost certainly an error. 828 825 829 - General barriers pair with each other, though they also pair with 830 - most other types of barriers, albeit without transitivity. An acquire 831 - barrier pairs with a release barrier, but both may also pair with other 832 - barriers, including of course general barriers. A write barrier pairs 833 - with a data dependency barrier, an acquire barrier, a release barrier, 834 - a read barrier, or a general barrier. Similarly a read barrier or a 835 - data dependency barrier pairs with a write barrier, an acquire barrier, 836 - a release barrier, or a general barrier: 826 + General barriers pair with each other, though they also pair with most 827 + other types of barriers, albeit without transitivity. An acquire barrier 828 + pairs with a release barrier, but both may also pair with other barriers, 829 + including of course general barriers. A write barrier pairs with a data 830 + dependency barrier, a control dependency, an acquire barrier, a release 831 + barrier, a read barrier, or a general barrier. Similarly a read barrier, 832 + control dependency, or a data dependency barrier pairs with a write 833 + barrier, an acquire barrier, a release barrier, or a general barrier: 837 834 838 835 CPU 1 CPU 2 839 836 =============== =============== ··· 852 849 ACCESS_ONCE(b) = &a; x = ACCESS_ONCE(b); 853 850 <data dependency barrier> 854 851 y = *x; 852 + 853 + Or even: 854 + 855 + CPU 1 CPU 2 856 + =============== =============================== 857 + r1 = ACCESS_ONCE(y); 858 + <general barrier> 859 + ACCESS_ONCE(y) = 1; if (r2 = ACCESS_ONCE(x)) { 860 + <implicit control dependency> 861 + ACCESS_ONCE(y) = 1; 862 + } 863 + 864 + assert(r1 == 0 || r2 == 0); 855 865 856 866 Basically, the read barrier always has to be there, even though it can be of 857 867 the "weaker" type.

+3 -7

Documentation/timers/NO_HZ.txt

··· 158 158 to the need to inform kernel subsystems (such as RCU) about 159 159 the change in mode. 160 160 161 - 3. POSIX CPU timers on adaptive-tick CPUs may miss their deadlines 162 - (perhaps indefinitely) because they currently rely on 163 - scheduling-tick interrupts. This will likely be fixed in 164 - one of two ways: (1) Prevent CPUs with POSIX CPU timers from 165 - entering adaptive-tick mode, or (2) Use hrtimers or other 166 - adaptive-ticks-immune mechanism to cause the POSIX CPU timer to 167 - fire properly. 161 + 3. POSIX CPU timers prevent CPUs from entering adaptive-tick mode. 162 + Real-time applications needing to take actions based on CPU time 163 + consumption need to use other means of doing so. 168 164 169 165 4. If there are more perf events pending than the hardware can 170 166 accommodate, they are normally round-robined so as to collect

+2 -4

arch/blackfin/mach-common/smp.c

··· 413 413 return 0; 414 414 } 415 415 416 - static DECLARE_COMPLETION(cpu_killed); 417 - 418 416 int __cpu_die(unsigned int cpu) 419 417 { 420 - return wait_for_completion_timeout(&cpu_killed, 5000); 418 + return cpu_wait_death(cpu, 5); 421 419 } 422 420 423 421 void cpu_die(void) 424 422 { 425 - complete(&cpu_killed); 423 + (void)cpu_report_death(); 426 424 427 425 atomic_dec(&init_mm.mm_users); 428 426 atomic_dec(&init_mm.mm_count);

+2 -3

arch/metag/kernel/smp.c

··· 261 261 } 262 262 263 263 #ifdef CONFIG_HOTPLUG_CPU 264 - static DECLARE_COMPLETION(cpu_killed); 265 264 266 265 /* 267 266 * __cpu_disable runs on the processor to be shutdown. ··· 298 299 */ 299 300 void __cpu_die(unsigned int cpu) 300 301 { 301 - if (!wait_for_completion_timeout(&cpu_killed, msecs_to_jiffies(1))) 302 + if (!cpu_wait_death(cpu, 1)) 302 303 pr_err("CPU%u: unable to kill\n", cpu); 303 304 } 304 305 ··· 313 314 local_irq_disable(); 314 315 idle_task_exit(); 315 316 316 - complete(&cpu_killed); 317 + (void)cpu_report_death(); 317 318 318 319 asm ("XOR TXENABLE, D0Re0,D0Re0\n"); 319 320 }

-2

arch/x86/include/asm/cpu.h

··· 34 34 #endif 35 35 #endif 36 36 37 - DECLARE_PER_CPU(int, cpu_state); 38 - 39 37 int mwait_usable(const struct cpuinfo_x86 *); 40 38 41 39 #endif /* _ASM_X86_CPU_H */

+1 -1

arch/x86/include/asm/smp.h

··· 150 150 } 151 151 152 152 void cpu_disable_common(void); 153 - void cpu_die_common(unsigned int cpu); 154 153 void native_smp_prepare_boot_cpu(void); 155 154 void native_smp_prepare_cpus(unsigned int max_cpus); 156 155 void native_smp_cpus_done(unsigned int max_cpus); 157 156 int native_cpu_up(unsigned int cpunum, struct task_struct *tidle); 158 157 int native_cpu_disable(void); 158 + int common_cpu_die(unsigned int cpu); 159 159 void native_cpu_die(unsigned int cpu); 160 160 void native_play_dead(void); 161 161 void play_dead_common(void);

+18 -21

arch/x86/kernel/smpboot.c

··· 77 77 #include <asm/realmode.h> 78 78 #include <asm/misc.h> 79 79 80 - /* State of each CPU */ 81 - DEFINE_PER_CPU(int, cpu_state) = { 0 }; 82 - 83 80 /* Number of siblings per CPU package */ 84 81 int smp_num_siblings = 1; 85 82 EXPORT_SYMBOL(smp_num_siblings); ··· 254 257 lock_vector_lock(); 255 258 set_cpu_online(smp_processor_id(), true); 256 259 unlock_vector_lock(); 257 - per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE; 260 + cpu_set_state_online(smp_processor_id()); 258 261 x86_platform.nmi_init(); 259 262 260 263 /* enable local interrupts */ ··· 945 948 */ 946 949 mtrr_save_state(); 947 950 948 - per_cpu(cpu_state, cpu) = CPU_UP_PREPARE; 951 + /* x86 CPUs take themselves offline, so delayed offline is OK. */ 952 + err = cpu_check_up_prepare(cpu); 953 + if (err && err != -EBUSY) 954 + return err; 949 955 950 956 /* the FPU context is blank, nobody can own it */ 951 957 __cpu_disable_lazy_restore(cpu); ··· 1191 1191 switch_to_new_gdt(me); 1192 1192 /* already set me in cpu_online_mask in boot_cpu_init() */ 1193 1193 cpumask_set_cpu(me, cpu_callout_mask); 1194 - per_cpu(cpu_state, me) = CPU_ONLINE; 1194 + cpu_set_state_online(me); 1195 1195 } 1196 1196 1197 1197 void __init native_smp_cpus_done(unsigned int max_cpus) ··· 1318 1318 numa_remove_cpu(cpu); 1319 1319 } 1320 1320 1321 - static DEFINE_PER_CPU(struct completion, die_complete); 1322 - 1323 1321 void cpu_disable_common(void) 1324 1322 { 1325 1323 int cpu = smp_processor_id(); 1326 - 1327 - init_completion(&per_cpu(die_complete, smp_processor_id())); 1328 1324 1329 1325 remove_siblinginfo(cpu); 1330 1326 ··· 1345 1349 return 0; 1346 1350 } 1347 1351 1348 - void cpu_die_common(unsigned int cpu) 1352 + int common_cpu_die(unsigned int cpu) 1349 1353 { 1350 - wait_for_completion_timeout(&per_cpu(die_complete, cpu), HZ); 1351 - } 1354 + int ret = 0; 1352 1355 1353 - void native_cpu_die(unsigned int cpu) 1354 - { 1355 1356 /* We don't do anything here: idle task is faking death itself. */ 1356 1357 1357 - cpu_die_common(cpu); 1358 - 1359 1358 /* They ack this in play_dead() by setting CPU_DEAD */ 1360 - if (per_cpu(cpu_state, cpu) == CPU_DEAD) { 1359 + if (cpu_wait_death(cpu, 5)) { 1361 1360 if (system_state == SYSTEM_RUNNING) 1362 1361 pr_info("CPU %u is now offline\n", cpu); 1363 1362 } else { 1364 1363 pr_err("CPU %u didn't die...\n", cpu); 1364 + ret = -1; 1365 1365 } 1366 + 1367 + return ret; 1368 + } 1369 + 1370 + void native_cpu_die(unsigned int cpu) 1371 + { 1372 + common_cpu_die(cpu); 1366 1373 } 1367 1374 1368 1375 void play_dead_common(void) ··· 1374 1375 reset_lazy_tlbstate(); 1375 1376 amd_e400_remove_cpu(raw_smp_processor_id()); 1376 1377 1377 - mb(); 1378 1378 /* Ack it */ 1379 - __this_cpu_write(cpu_state, CPU_DEAD); 1380 - complete(&per_cpu(die_complete, smp_processor_id())); 1379 + (void)cpu_report_death(); 1381 1380 1382 1381 /* 1383 1382 * With physical CPU hotplug, we should halt the cpu

+25 -21

arch/x86/xen/smp.c

··· 90 90 91 91 set_cpu_online(cpu, true); 92 92 93 - this_cpu_write(cpu_state, CPU_ONLINE); 94 - 95 - wmb(); 93 + cpu_set_state_online(cpu); /* Implies full memory barrier. */ 96 94 97 95 /* We can take interrupts now: we're officially "up". */ 98 96 local_irq_enable(); 99 - 100 - wmb(); /* make sure everything is out */ 101 97 } 102 98 103 99 /* ··· 455 459 xen_setup_timer(cpu); 456 460 xen_init_lock_cpu(cpu); 457 461 458 - per_cpu(cpu_state, cpu) = CPU_UP_PREPARE; 462 + /* 463 + * PV VCPUs are always successfully taken down (see 'while' loop 464 + * in xen_cpu_die()), so -EBUSY is an error. 465 + */ 466 + rc = cpu_check_up_prepare(cpu); 467 + if (rc) 468 + return rc; 459 469 460 470 /* make sure interrupts start blocked */ 461 471 per_cpu(xen_vcpu, cpu)->evtchn_upcall_mask = 1; ··· 481 479 rc = HYPERVISOR_vcpu_op(VCPUOP_up, cpu, NULL); 482 480 BUG_ON(rc); 483 481 484 - while(per_cpu(cpu_state, cpu) != CPU_ONLINE) { 482 + while (cpu_report_state(cpu) != CPU_ONLINE) 485 483 HYPERVISOR_sched_op(SCHEDOP_yield, NULL); 486 - barrier(); 487 - } 488 484 489 485 return 0; 490 486 } ··· 511 511 schedule_timeout(HZ/10); 512 512 } 513 513 514 - cpu_die_common(cpu); 515 - 516 - xen_smp_intr_free(cpu); 517 - xen_uninit_lock_cpu(cpu); 518 - xen_teardown_timer(cpu); 514 + if (common_cpu_die(cpu) == 0) { 515 + xen_smp_intr_free(cpu); 516 + xen_uninit_lock_cpu(cpu); 517 + xen_teardown_timer(cpu); 518 + } 519 519 } 520 520 521 521 static void xen_play_dead(void) /* used only with HOTPLUG_CPU */ ··· 747 747 static int xen_hvm_cpu_up(unsigned int cpu, struct task_struct *tidle) 748 748 { 749 749 int rc; 750 + 751 + /* 752 + * This can happen if CPU was offlined earlier and 753 + * offlining timed out in common_cpu_die(). 754 + */ 755 + if (cpu_report_state(cpu) == CPU_DEAD_FROZEN) { 756 + xen_smp_intr_free(cpu); 757 + xen_uninit_lock_cpu(cpu); 758 + } 759 + 750 760 /* 751 761 * xen_smp_intr_init() needs to run before native_cpu_up() 752 762 * so that IPI vectors are set up on the booting CPU before ··· 778 768 return rc; 779 769 } 780 770 781 - static void xen_hvm_cpu_die(unsigned int cpu) 782 - { 783 - xen_cpu_die(cpu); 784 - native_cpu_die(cpu); 785 - } 786 - 787 771 void __init xen_hvm_smp_init(void) 788 772 { 789 773 if (!xen_have_vector_callback) ··· 785 781 smp_ops.smp_prepare_cpus = xen_hvm_smp_prepare_cpus; 786 782 smp_ops.smp_send_reschedule = xen_smp_send_reschedule; 787 783 smp_ops.cpu_up = xen_hvm_cpu_up; 788 - smp_ops.cpu_die = xen_hvm_cpu_die; 784 + smp_ops.cpu_die = xen_cpu_die; 789 785 smp_ops.send_call_func_ipi = xen_smp_send_call_function_ipi; 790 786 smp_ops.send_call_func_single_ipi = xen_smp_send_call_function_single_ipi; 791 787 smp_ops.smp_prepare_boot_cpu = xen_smp_prepare_boot_cpu;

+14

include/linux/cpu.h

··· 95 95 * Called on the new cpu, just before 96 96 * enabling interrupts. Must not sleep, 97 97 * must not fail */ 98 + #define CPU_DYING_IDLE 0x000B /* CPU (unsigned)v dying, reached 99 + * idle loop. */ 100 + #define CPU_BROKEN 0x000C /* CPU (unsigned)v did not die properly, 101 + * perhaps due to preemption. */ 98 102 99 103 /* Used for CPU hotplug events occurring while tasks are frozen due to a suspend 100 104 * operation in progress ··· 274 270 void arch_cpu_idle_enter(void); 275 271 void arch_cpu_idle_exit(void); 276 272 void arch_cpu_idle_dead(void); 273 + 274 + DECLARE_PER_CPU(bool, cpu_dead_idle); 275 + 276 + int cpu_report_state(int cpu); 277 + int cpu_check_up_prepare(int cpu); 278 + void cpu_set_state_online(int cpu); 279 + #ifdef CONFIG_HOTPLUG_CPU 280 + bool cpu_wait_death(unsigned int cpu, int seconds); 281 + bool cpu_report_death(void); 282 + #endif /* #ifdef CONFIG_HOTPLUG_CPU */ 277 283 278 284 #endif /* _LINUX_CPU_H_ */

+6 -1

include/linux/lockdep.h

··· 531 531 # define might_lock_read(lock) do { } while (0) 532 532 #endif 533 533 534 - #ifdef CONFIG_PROVE_RCU 534 + #ifdef CONFIG_LOCKDEP 535 535 void lockdep_rcu_suspicious(const char *file, const int line, const char *s); 536 + #else 537 + static inline void 538 + lockdep_rcu_suspicious(const char *file, const int line, const char *s) 539 + { 540 + } 536 541 #endif 537 542 538 543 #endif /* __LINUX_LOCKDEP_H */

+36 -4

include/linux/rcupdate.h

··· 48 48 49 49 extern int rcu_expedited; /* for sysctl */ 50 50 51 + #ifdef CONFIG_TINY_RCU 52 + /* Tiny RCU doesn't expedite, as its purpose in life is instead to be tiny. */ 53 + static inline bool rcu_gp_is_expedited(void) /* Internal RCU use. */ 54 + { 55 + return false; 56 + } 57 + 58 + static inline void rcu_expedite_gp(void) 59 + { 60 + } 61 + 62 + static inline void rcu_unexpedite_gp(void) 63 + { 64 + } 65 + #else /* #ifdef CONFIG_TINY_RCU */ 66 + bool rcu_gp_is_expedited(void); /* Internal RCU use. */ 67 + void rcu_expedite_gp(void); 68 + void rcu_unexpedite_gp(void); 69 + #endif /* #else #ifdef CONFIG_TINY_RCU */ 70 + 51 71 enum rcutorture_type { 52 72 RCU_FLAVOR, 53 73 RCU_BH_FLAVOR, ··· 215 195 216 196 void synchronize_sched(void); 217 197 198 + /* 199 + * Structure allowing asynchronous waiting on RCU. 200 + */ 201 + struct rcu_synchronize { 202 + struct rcu_head head; 203 + struct completion completion; 204 + }; 205 + void wakeme_after_rcu(struct rcu_head *head); 206 + 218 207 /** 219 208 * call_rcu_tasks() - Queue an RCU for invocation task-based grace period 220 209 * @head: structure to be used for queueing the RCU updates. ··· 287 258 288 259 /* Internal to kernel */ 289 260 void rcu_init(void); 261 + void rcu_end_inkernel_boot(void); 290 262 void rcu_sched_qs(void); 291 263 void rcu_bh_qs(void); 292 264 void rcu_check_callbacks(int user); ··· 296 266 void rcu_idle_exit(void); 297 267 void rcu_irq_enter(void); 298 268 void rcu_irq_exit(void); 269 + int rcu_cpu_notify(struct notifier_block *self, 270 + unsigned long action, void *hcpu); 299 271 300 272 #ifdef CONFIG_RCU_STALL_COMMON 301 273 void rcu_sysrq_start(void); ··· 752 720 * annotated as __rcu. 753 721 */ 754 722 #define rcu_dereference_check(p, c) \ 755 - __rcu_dereference_check((p), rcu_read_lock_held() || (c), __rcu) 723 + __rcu_dereference_check((p), (c) || rcu_read_lock_held(), __rcu) 756 724 757 725 /** 758 726 * rcu_dereference_bh_check() - rcu_dereference_bh with debug checking ··· 762 730 * This is the RCU-bh counterpart to rcu_dereference_check(). 763 731 */ 764 732 #define rcu_dereference_bh_check(p, c) \ 765 - __rcu_dereference_check((p), rcu_read_lock_bh_held() || (c), __rcu) 733 + __rcu_dereference_check((p), (c) || rcu_read_lock_bh_held(), __rcu) 766 734 767 735 /** 768 736 * rcu_dereference_sched_check() - rcu_dereference_sched with debug checking ··· 772 740 * This is the RCU-sched counterpart to rcu_dereference_check(). 773 741 */ 774 742 #define rcu_dereference_sched_check(p, c) \ 775 - __rcu_dereference_check((p), rcu_read_lock_sched_held() || (c), \ 743 + __rcu_dereference_check((p), (c) || rcu_read_lock_sched_held(), \ 776 744 __rcu) 777 745 778 746 #define rcu_dereference_raw(p) rcu_dereference_check(p, 1) /*@@@ needed? @@@*/ ··· 965 933 { 966 934 rcu_lockdep_assert(rcu_is_watching(), 967 935 "rcu_read_unlock() used illegally while idle"); 968 - rcu_lock_release(&rcu_lock_map); 969 936 __release(RCU); 970 937 __rcu_read_unlock(); 938 + rcu_lock_release(&rcu_lock_map); /* Keep acq info for rls diags. */ 971 939 } 972 940 973 941 /**

+1 -1

include/linux/srcu.h

··· 182 182 * lockdep_is_held() calls. 183 183 */ 184 184 #define srcu_dereference_check(p, sp, c) \ 185 - __rcu_dereference_check((p), srcu_read_lock_held(sp) || (c), __rcu) 185 + __rcu_dereference_check((p), (c) || srcu_read_lock_held(sp), __rcu) 186 186 187 187 /** 188 188 * srcu_dereference - fetch SRCU-protected pointer for later dereferencing

+13

init/Kconfig

··· 791 791 792 792 endchoice 793 793 794 + config RCU_EXPEDITE_BOOT 795 + bool 796 + default n 797 + help 798 + This option enables expedited grace periods at boot time, 799 + as if rcu_expedite_gp() had been invoked early in boot. 800 + The corresponding rcu_unexpedite_gp() is invoked from 801 + rcu_end_inkernel_boot(), which is intended to be invoked 802 + at the end of the kernel-only boot sequence, just before 803 + init is exec'ed. 804 + 805 + Accept the default if unsure. 806 + 794 807 endmenu # "RCU Subsystem" 795 808 796 809 config BUILD_BIN2C

+3 -1

kernel/cpu.c

··· 408 408 * 409 409 * Wait for the stop thread to go away. 410 410 */ 411 - while (!idle_cpu(cpu)) 411 + while (!per_cpu(cpu_dead_idle, cpu)) 412 412 cpu_relax(); 413 + smp_mb(); /* Read from cpu_dead_idle before __cpu_die(). */ 414 + per_cpu(cpu_dead_idle, cpu) = false; 413 415 414 416 /* This actually kills the CPU. */ 415 417 __cpu_die(cpu);

+26 -1

kernel/rcu/rcutorture.c

··· 853 853 static int 854 854 rcu_torture_writer(void *arg) 855 855 { 856 + bool can_expedite = !rcu_gp_is_expedited(); 857 + int expediting = 0; 856 858 unsigned long gp_snap; 857 859 bool gp_cond1 = gp_cond, gp_exp1 = gp_exp, gp_normal1 = gp_normal; 858 860 bool gp_sync1 = gp_sync; ··· 867 865 int nsynctypes = 0; 868 866 869 867 VERBOSE_TOROUT_STRING("rcu_torture_writer task started"); 868 + pr_alert("%s" TORTURE_FLAG 869 + " Grace periods expedited from boot/sysfs for %s,\n", 870 + torture_type, cur_ops->name); 871 + pr_alert("%s" TORTURE_FLAG 872 + " Testing of dynamic grace-period expediting diabled.\n", 873 + torture_type); 870 874 871 875 /* Initialize synctype[] array. If none set, take default. */ 872 - if (!gp_cond1 && !gp_exp1 && !gp_normal1 && !gp_sync) 876 + if (!gp_cond1 && !gp_exp1 && !gp_normal1 && !gp_sync1) 873 877 gp_cond1 = gp_exp1 = gp_normal1 = gp_sync1 = true; 874 878 if (gp_cond1 && cur_ops->get_state && cur_ops->cond_sync) 875 879 synctype[nsynctypes++] = RTWS_COND_GET; ··· 957 949 } 958 950 } 959 951 rcutorture_record_progress(++rcu_torture_current_version); 952 + /* Cycle through nesting levels of rcu_expedite_gp() calls. */ 953 + if (can_expedite && 954 + !(torture_random(&rand) & 0xff & (!!expediting - 1))) { 955 + WARN_ON_ONCE(expediting == 0 && rcu_gp_is_expedited()); 956 + if (expediting >= 0) 957 + rcu_expedite_gp(); 958 + else 959 + rcu_unexpedite_gp(); 960 + if (++expediting > 3) 961 + expediting = -expediting; 962 + } 960 963 rcu_torture_writer_state = RTWS_STUTTER; 961 964 stutter_wait("rcu_torture_writer"); 962 965 } while (!torture_must_stop()); 966 + /* Reset expediting back to unexpedited. */ 967 + if (expediting > 0) 968 + expediting = -expediting; 969 + while (can_expedite && expediting++ < 0) 970 + rcu_unexpedite_gp(); 971 + WARN_ON_ONCE(can_expedite && rcu_gp_is_expedited()); 963 972 rcu_torture_writer_state = RTWS_STOPPING; 964 973 torture_kthread_stopping("rcu_torture_writer"); 965 974 return 0;

+1 -18

kernel/rcu/srcu.c

··· 402 402 } 403 403 EXPORT_SYMBOL_GPL(call_srcu); 404 404 405 - struct rcu_synchronize { 406 - struct rcu_head head; 407 - struct completion completion; 408 - }; 409 - 410 - /* 411 - * Awaken the corresponding synchronize_srcu() instance now that a 412 - * grace period has elapsed. 413 - */ 414 - static void wakeme_after_rcu(struct rcu_head *head) 415 - { 416 - struct rcu_synchronize *rcu; 417 - 418 - rcu = container_of(head, struct rcu_synchronize, head); 419 - complete(&rcu->completion); 420 - } 421 - 422 405 static void srcu_advance_batches(struct srcu_struct *sp, int trycount); 423 406 static void srcu_reschedule(struct srcu_struct *sp); 424 407 ··· 490 507 */ 491 508 void synchronize_srcu(struct srcu_struct *sp) 492 509 { 493 - __synchronize_srcu(sp, rcu_expedited 510 + __synchronize_srcu(sp, rcu_gp_is_expedited() 494 511 ? SYNCHRONIZE_SRCU_EXP_TRYCOUNT 495 512 : SYNCHRONIZE_SRCU_TRYCOUNT); 496 513 }

+1 -13

kernel/rcu/tiny.c

··· 103 103 static int rcu_qsctr_help(struct rcu_ctrlblk *rcp) 104 104 { 105 105 RCU_TRACE(reset_cpu_stall_ticks(rcp)); 106 - if (rcp->rcucblist != NULL && 107 - rcp->donetail != rcp->curtail) { 106 + if (rcp->donetail != rcp->curtail) { 108 107 rcp->donetail = rcp->curtail; 109 108 return 1; 110 109 } ··· 167 168 struct rcu_head *next, *list; 168 169 unsigned long flags; 169 170 RCU_TRACE(int cb_count = 0); 170 - 171 - /* If no RCU callbacks ready to invoke, just return. */ 172 - if (&rcp->rcucblist == rcp->donetail) { 173 - RCU_TRACE(trace_rcu_batch_start(rcp->name, 0, 0, -1)); 174 - RCU_TRACE(trace_rcu_batch_end(rcp->name, 0, 175 - !!ACCESS_ONCE(rcp->rcucblist), 176 - need_resched(), 177 - is_idle_task(current), 178 - false)); 179 - return; 180 - } 181 171 182 172 /* Move the ready-to-invoke callbacks to a local list. */ 183 173 local_irq_save(flags);

+311 -126

kernel/rcu/tree.c

··· 91 91 92 92 #define RCU_STATE_INITIALIZER(sname, sabbr, cr) \ 93 93 DEFINE_RCU_TPS(sname) \ 94 + DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, sname##_data); \ 94 95 struct rcu_state sname##_state = { \ 95 96 .level = { &sname##_state.node[0] }, \ 97 + .rda = &sname##_data, \ 96 98 .call = cr, \ 97 99 .fqs_state = RCU_GP_IDLE, \ 98 100 .gpnum = 0UL - 300UL, \ ··· 103 101 .orphan_nxttail = &sname##_state.orphan_nxtlist, \ 104 102 .orphan_donetail = &sname##_state.orphan_donelist, \ 105 103 .barrier_mutex = __MUTEX_INITIALIZER(sname##_state.barrier_mutex), \ 106 - .onoff_mutex = __MUTEX_INITIALIZER(sname##_state.onoff_mutex), \ 107 104 .name = RCU_STATE_NAME(sname), \ 108 105 .abbr = sabbr, \ 109 - }; \ 110 - DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, sname##_data) 106 + } 111 107 112 108 RCU_STATE_INITIALIZER(rcu_sched, 's', call_rcu_sched); 113 109 RCU_STATE_INITIALIZER(rcu_bh, 'b', call_rcu_bh); ··· 152 152 */ 153 153 static int rcu_scheduler_fully_active __read_mostly; 154 154 155 + static void rcu_init_new_rnp(struct rcu_node *rnp_leaf); 156 + static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf); 155 157 static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu); 156 158 static void invoke_rcu_core(void); 157 159 static void invoke_rcu_callbacks(struct rcu_state *rsp, struct rcu_data *rdp); ··· 161 159 /* rcuc/rcub kthread realtime priority */ 162 160 static int kthread_prio = CONFIG_RCU_KTHREAD_PRIO; 163 161 module_param(kthread_prio, int, 0644); 162 + 163 + /* Delay in jiffies for grace-period initialization delays. */ 164 + static int gp_init_delay = IS_ENABLED(CONFIG_RCU_TORTURE_TEST_SLOW_INIT) 165 + ? CONFIG_RCU_TORTURE_TEST_SLOW_INIT_DELAY 166 + : 0; 167 + module_param(gp_init_delay, int, 0644); 164 168 165 169 /* 166 170 * Track the rcutorture test sequence number and the update version ··· 179 171 */ 180 172 unsigned long rcutorture_testseq; 181 173 unsigned long rcutorture_vernum; 174 + 175 + /* 176 + * Compute the mask of online CPUs for the specified rcu_node structure. 177 + * This will not be stable unless the rcu_node structure's ->lock is 178 + * held, but the bit corresponding to the current CPU will be stable 179 + * in most contexts. 180 + */ 181 + unsigned long rcu_rnp_online_cpus(struct rcu_node *rnp) 182 + { 183 + return ACCESS_ONCE(rnp->qsmaskinitnext); 184 + } 182 185 183 186 /* 184 187 * Return true if an RCU grace period is in progress. The ACCESS_ONCE()s ··· 311 292 EXPORT_SYMBOL_GPL(rcu_note_context_switch); 312 293 313 294 /* 314 - * Register a quiesecent state for all RCU flavors. If there is an 295 + * Register a quiescent state for all RCU flavors. If there is an 315 296 * emergency, invoke rcu_momentary_dyntick_idle() to do a heavy-weight 316 297 * dyntick-idle quiescent state visible to other CPUs (but only for those 317 - * RCU flavors in desparate need of a quiescent state, which will normally 298 + * RCU flavors in desperate need of a quiescent state, which will normally 318 299 * be none of them). Either way, do a lightweight quiescent state for 319 300 * all RCU flavors. 320 301 */ ··· 429 410 EXPORT_SYMBOL_GPL(rcu_bh_force_quiescent_state); 430 411 431 412 /* 413 + * Force a quiescent state for RCU-sched. 414 + */ 415 + void rcu_sched_force_quiescent_state(void) 416 + { 417 + force_quiescent_state(&rcu_sched_state); 418 + } 419 + EXPORT_SYMBOL_GPL(rcu_sched_force_quiescent_state); 420 + 421 + /* 432 422 * Show the state of the grace-period kthreads. 433 423 */ 434 424 void show_rcu_gp_kthreads(void) ··· 509 481 rcutorture_vernum++; 510 482 } 511 483 EXPORT_SYMBOL_GPL(rcutorture_record_progress); 512 - 513 - /* 514 - * Force a quiescent state for RCU-sched. 515 - */ 516 - void rcu_sched_force_quiescent_state(void) 517 - { 518 - force_quiescent_state(&rcu_sched_state); 519 - } 520 - EXPORT_SYMBOL_GPL(rcu_sched_force_quiescent_state); 521 484 522 485 /* 523 486 * Does the CPU have callbacks ready to be invoked? ··· 973 954 preempt_disable(); 974 955 rdp = this_cpu_ptr(&rcu_sched_data); 975 956 rnp = rdp->mynode; 976 - ret = (rdp->grpmask & rnp->qsmaskinit) || 957 + ret = (rdp->grpmask & rcu_rnp_online_cpus(rnp)) || 977 958 !rcu_scheduler_fully_active; 978 959 preempt_enable(); 979 960 return ret; ··· 1215 1196 } else { 1216 1197 j = jiffies; 1217 1198 gpa = ACCESS_ONCE(rsp->gp_activity); 1218 - pr_err("All QSes seen, last %s kthread activity %ld (%ld-%ld), jiffies_till_next_fqs=%ld\n", 1199 + pr_err("All QSes seen, last %s kthread activity %ld (%ld-%ld), jiffies_till_next_fqs=%ld, root ->qsmask %#lx\n", 1219 1200 rsp->name, j - gpa, j, gpa, 1220 - jiffies_till_next_fqs); 1201 + jiffies_till_next_fqs, 1202 + rcu_get_root(rsp)->qsmask); 1221 1203 /* In this case, the current CPU might be at fault. */ 1222 1204 sched_show_task(current); 1223 1205 } ··· 1348 1328 } 1349 1329 1350 1330 /* 1331 + * Initialize the specified rcu_data structure's default callback list 1332 + * to empty. The default callback list is the one that is not used by 1333 + * no-callbacks CPUs. 1334 + */ 1335 + static void init_default_callback_list(struct rcu_data *rdp) 1336 + { 1337 + int i; 1338 + 1339 + rdp->nxtlist = NULL; 1340 + for (i = 0; i < RCU_NEXT_SIZE; i++) 1341 + rdp->nxttail[i] = &rdp->nxtlist; 1342 + } 1343 + 1344 + /* 1351 1345 * Initialize the specified rcu_data structure's callback list to empty. 1352 1346 */ 1353 1347 static void init_callback_list(struct rcu_data *rdp) 1354 1348 { 1355 - int i; 1356 - 1357 1349 if (init_nocb_callback_list(rdp)) 1358 1350 return; 1359 - rdp->nxtlist = NULL; 1360 - for (i = 0; i < RCU_NEXT_SIZE; i++) 1361 - rdp->nxttail[i] = &rdp->nxtlist; 1351 + init_default_callback_list(rdp); 1362 1352 } 1363 1353 1364 1354 /* ··· 1733 1703 */ 1734 1704 static int rcu_gp_init(struct rcu_state *rsp) 1735 1705 { 1706 + unsigned long oldmask; 1736 1707 struct rcu_data *rdp; 1737 1708 struct rcu_node *rnp = rcu_get_root(rsp); 1738 1709 1739 1710 ACCESS_ONCE(rsp->gp_activity) = jiffies; 1740 - rcu_bind_gp_kthread(); 1741 1711 raw_spin_lock_irq(&rnp->lock); 1742 1712 smp_mb__after_unlock_lock(); 1743 1713 if (!ACCESS_ONCE(rsp->gp_flags)) { ··· 1763 1733 trace_rcu_grace_period(rsp->name, rsp->gpnum, TPS("start")); 1764 1734 raw_spin_unlock_irq(&rnp->lock); 1765 1735 1766 - /* Exclude any concurrent CPU-hotplug operations. */ 1767 - mutex_lock(&rsp->onoff_mutex); 1768 - smp_mb__after_unlock_lock(); /* ->gpnum increment before GP! */ 1736 + /* 1737 + * Apply per-leaf buffered online and offline operations to the 1738 + * rcu_node tree. Note that this new grace period need not wait 1739 + * for subsequent online CPUs, and that quiescent-state forcing 1740 + * will handle subsequent offline CPUs. 1741 + */ 1742 + rcu_for_each_leaf_node(rsp, rnp) { 1743 + raw_spin_lock_irq(&rnp->lock); 1744 + smp_mb__after_unlock_lock(); 1745 + if (rnp->qsmaskinit == rnp->qsmaskinitnext && 1746 + !rnp->wait_blkd_tasks) { 1747 + /* Nothing to do on this leaf rcu_node structure. */ 1748 + raw_spin_unlock_irq(&rnp->lock); 1749 + continue; 1750 + } 1751 + 1752 + /* Record old state, apply changes to ->qsmaskinit field. */ 1753 + oldmask = rnp->qsmaskinit; 1754 + rnp->qsmaskinit = rnp->qsmaskinitnext; 1755 + 1756 + /* If zero-ness of ->qsmaskinit changed, propagate up tree. */ 1757 + if (!oldmask != !rnp->qsmaskinit) { 1758 + if (!oldmask) /* First online CPU for this rcu_node. */ 1759 + rcu_init_new_rnp(rnp); 1760 + else if (rcu_preempt_has_tasks(rnp)) /* blocked tasks */ 1761 + rnp->wait_blkd_tasks = true; 1762 + else /* Last offline CPU and can propagate. */ 1763 + rcu_cleanup_dead_rnp(rnp); 1764 + } 1765 + 1766 + /* 1767 + * If all waited-on tasks from prior grace period are 1768 + * done, and if all this rcu_node structure's CPUs are 1769 + * still offline, propagate up the rcu_node tree and 1770 + * clear ->wait_blkd_tasks. Otherwise, if one of this 1771 + * rcu_node structure's CPUs has since come back online, 1772 + * simply clear ->wait_blkd_tasks (but rcu_cleanup_dead_rnp() 1773 + * checks for this, so just call it unconditionally). 1774 + */ 1775 + if (rnp->wait_blkd_tasks && 1776 + (!rcu_preempt_has_tasks(rnp) || 1777 + rnp->qsmaskinit)) { 1778 + rnp->wait_blkd_tasks = false; 1779 + rcu_cleanup_dead_rnp(rnp); 1780 + } 1781 + 1782 + raw_spin_unlock_irq(&rnp->lock); 1783 + } 1769 1784 1770 1785 /* 1771 1786 * Set the quiescent-state-needed bits in all the rcu_node ··· 1832 1757 rcu_preempt_check_blocked_tasks(rnp); 1833 1758 rnp->qsmask = rnp->qsmaskinit; 1834 1759 ACCESS_ONCE(rnp->gpnum) = rsp->gpnum; 1835 - WARN_ON_ONCE(rnp->completed != rsp->completed); 1836 - ACCESS_ONCE(rnp->completed) = rsp->completed; 1760 + if (WARN_ON_ONCE(rnp->completed != rsp->completed)) 1761 + ACCESS_ONCE(rnp->completed) = rsp->completed; 1837 1762 if (rnp == rdp->mynode) 1838 1763 (void)__note_gp_changes(rsp, rnp, rdp); 1839 1764 rcu_preempt_boost_start_gp(rnp); ··· 1843 1768 raw_spin_unlock_irq(&rnp->lock); 1844 1769 cond_resched_rcu_qs(); 1845 1770 ACCESS_ONCE(rsp->gp_activity) = jiffies; 1771 + if (IS_ENABLED(CONFIG_RCU_TORTURE_TEST_SLOW_INIT) && 1772 + gp_init_delay > 0 && 1773 + !(rsp->gpnum % (rcu_num_nodes * 10))) 1774 + schedule_timeout_uninterruptible(gp_init_delay); 1846 1775 } 1847 1776 1848 - mutex_unlock(&rsp->onoff_mutex); 1849 1777 return 1; 1850 1778 } 1851 1779 ··· 1876 1798 fqs_state = RCU_FORCE_QS; 1877 1799 } else { 1878 1800 /* Handle dyntick-idle and offline CPUs. */ 1879 - isidle = false; 1801 + isidle = true; 1880 1802 force_qs_rnp(rsp, rcu_implicit_dynticks_qs, &isidle, &maxj); 1881 1803 } 1882 1804 /* Clear flag to prevent immediate re-entry. */ ··· 1930 1852 rcu_for_each_node_breadth_first(rsp, rnp) { 1931 1853 raw_spin_lock_irq(&rnp->lock); 1932 1854 smp_mb__after_unlock_lock(); 1855 + WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp)); 1856 + WARN_ON_ONCE(rnp->qsmask); 1933 1857 ACCESS_ONCE(rnp->completed) = rsp->gpnum; 1934 1858 rdp = this_cpu_ptr(rsp->rda); 1935 1859 if (rnp == rdp->mynode) ··· 1975 1895 struct rcu_state *rsp = arg; 1976 1896 struct rcu_node *rnp = rcu_get_root(rsp); 1977 1897 1898 + rcu_bind_gp_kthread(); 1978 1899 for (;;) { 1979 1900 1980 1901 /* Handle grace-period start. */ ··· 2143 2062 * Similar to rcu_report_qs_rdp(), for which it is a helper function. 2144 2063 * Allows quiescent states for a group of CPUs to be reported at one go 2145 2064 * to the specified rcu_node structure, though all the CPUs in the group 2146 - * must be represented by the same rcu_node structure (which need not be 2147 - * a leaf rcu_node structure, though it often will be). That structure's 2148 - * lock must be held upon entry, and it is released before return. 2065 + * must be represented by the same rcu_node structure (which need not be a 2066 + * leaf rcu_node structure, though it often will be). The gps parameter 2067 + * is the grace-period snapshot, which means that the quiescent states 2068 + * are valid only if rnp->gpnum is equal to gps. That structure's lock 2069 + * must be held upon entry, and it is released before return. 2149 2070 */ 2150 2071 static void 2151 2072 rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp, 2152 - struct rcu_node *rnp, unsigned long flags) 2073 + struct rcu_node *rnp, unsigned long gps, unsigned long flags) 2153 2074 __releases(rnp->lock) 2154 2075 { 2076 + unsigned long oldmask = 0; 2155 2077 struct rcu_node *rnp_c; 2156 2078 2157 2079 /* Walk up the rcu_node hierarchy. */ 2158 2080 for (;;) { 2159 - if (!(rnp->qsmask & mask)) { 2081 + if (!(rnp->qsmask & mask) || rnp->gpnum != gps) { 2160 2082 2161 - /* Our bit has already been cleared, so done. */ 2083 + /* 2084 + * Our bit has already been cleared, or the 2085 + * relevant grace period is already over, so done. 2086 + */ 2162 2087 raw_spin_unlock_irqrestore(&rnp->lock, flags); 2163 2088 return; 2164 2089 } 2090 + WARN_ON_ONCE(oldmask); /* Any child must be all zeroed! */ 2165 2091 rnp->qsmask &= ~mask; 2166 2092 trace_rcu_quiescent_state_report(rsp->name, rnp->gpnum, 2167 2093 mask, rnp->qsmask, rnp->level, ··· 2192 2104 rnp = rnp->parent; 2193 2105 raw_spin_lock_irqsave(&rnp->lock, flags); 2194 2106 smp_mb__after_unlock_lock(); 2195 - WARN_ON_ONCE(rnp_c->qsmask); 2107 + oldmask = rnp_c->qsmask; 2196 2108 } 2197 2109 2198 2110 /* ··· 2201 2113 * to clean up and start the next grace period if one is needed. 2202 2114 */ 2203 2115 rcu_report_qs_rsp(rsp, flags); /* releases rnp->lock. */ 2116 + } 2117 + 2118 + /* 2119 + * Record a quiescent state for all tasks that were previously queued 2120 + * on the specified rcu_node structure and that were blocking the current 2121 + * RCU grace period. The caller must hold the specified rnp->lock with 2122 + * irqs disabled, and this lock is released upon return, but irqs remain 2123 + * disabled. 2124 + */ 2125 + static void rcu_report_unblock_qs_rnp(struct rcu_state *rsp, 2126 + struct rcu_node *rnp, unsigned long flags) 2127 + __releases(rnp->lock) 2128 + { 2129 + unsigned long gps; 2130 + unsigned long mask; 2131 + struct rcu_node *rnp_p; 2132 + 2133 + if (rcu_state_p == &rcu_sched_state || rsp != rcu_state_p || 2134 + rnp->qsmask != 0 || rcu_preempt_blocked_readers_cgp(rnp)) { 2135 + raw_spin_unlock_irqrestore(&rnp->lock, flags); 2136 + return; /* Still need more quiescent states! */ 2137 + } 2138 + 2139 + rnp_p = rnp->parent; 2140 + if (rnp_p == NULL) { 2141 + /* 2142 + * Only one rcu_node structure in the tree, so don't 2143 + * try to report up to its nonexistent parent! 2144 + */ 2145 + rcu_report_qs_rsp(rsp, flags); 2146 + return; 2147 + } 2148 + 2149 + /* Report up the rest of the hierarchy, tracking current ->gpnum. */ 2150 + gps = rnp->gpnum; 2151 + mask = rnp->grpmask; 2152 + raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */ 2153 + raw_spin_lock(&rnp_p->lock); /* irqs already disabled. */ 2154 + smp_mb__after_unlock_lock(); 2155 + rcu_report_qs_rnp(mask, rsp, rnp_p, gps, flags); 2204 2156 } 2205 2157 2206 2158 /* ··· 2291 2163 */ 2292 2164 needwake = rcu_accelerate_cbs(rsp, rnp, rdp); 2293 2165 2294 - rcu_report_qs_rnp(mask, rsp, rnp, flags); /* rlses rnp->lock */ 2166 + rcu_report_qs_rnp(mask, rsp, rnp, rnp->gpnum, flags); 2167 + /* ^^^ Released rnp->lock */ 2295 2168 if (needwake) 2296 2169 rcu_gp_kthread_wake(rsp); 2297 2170 } ··· 2385 2256 rsp->orphan_donetail = rdp->nxttail[RCU_DONE_TAIL]; 2386 2257 } 2387 2258 2388 - /* Finally, initialize the rcu_data structure's list to empty. */ 2259 + /* 2260 + * Finally, initialize the rcu_data structure's list to empty and 2261 + * disallow further callbacks on this CPU. 2262 + */ 2389 2263 init_callback_list(rdp); 2264 + rdp->nxttail[RCU_NEXT_TAIL] = NULL; 2390 2265 } 2391 2266 2392 2267 /* ··· 2488 2355 raw_spin_lock(&rnp->lock); /* irqs already disabled. */ 2489 2356 smp_mb__after_unlock_lock(); /* GP memory ordering. */ 2490 2357 rnp->qsmaskinit &= ~mask; 2358 + rnp->qsmask &= ~mask; 2491 2359 if (rnp->qsmaskinit) { 2492 2360 raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */ 2493 2361 return; 2494 2362 } 2495 2363 raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */ 2496 2364 } 2365 + } 2366 + 2367 + /* 2368 + * The CPU is exiting the idle loop into the arch_cpu_idle_dead() 2369 + * function. We now remove it from the rcu_node tree's ->qsmaskinit 2370 + * bit masks. 2371 + */ 2372 + static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp) 2373 + { 2374 + unsigned long flags; 2375 + unsigned long mask; 2376 + struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu); 2377 + struct rcu_node *rnp = rdp->mynode; /* Outgoing CPU's rdp & rnp. */ 2378 + 2379 + /* Remove outgoing CPU from mask in the leaf rcu_node structure. */ 2380 + mask = rdp->grpmask; 2381 + raw_spin_lock_irqsave(&rnp->lock, flags); 2382 + smp_mb__after_unlock_lock(); /* Enforce GP memory-order guarantee. */ 2383 + rnp->qsmaskinitnext &= ~mask; 2384 + raw_spin_unlock_irqrestore(&rnp->lock, flags); 2497 2385 } 2498 2386 2499 2387 /* ··· 2533 2379 /* Adjust any no-longer-needed kthreads. */ 2534 2380 rcu_boost_kthread_setaffinity(rnp, -1); 2535 2381 2536 - /* Exclude any attempts to start a new grace period. */ 2537 - mutex_lock(&rsp->onoff_mutex); 2538 - raw_spin_lock_irqsave(&rsp->orphan_lock, flags); 2539 - 2540 2382 /* Orphan the dead CPU's callbacks, and adopt them if appropriate. */ 2383 + raw_spin_lock_irqsave(&rsp->orphan_lock, flags); 2541 2384 rcu_send_cbs_to_orphanage(cpu, rsp, rnp, rdp); 2542 2385 rcu_adopt_orphan_cbs(rsp, flags); 2543 2386 raw_spin_unlock_irqrestore(&rsp->orphan_lock, flags); 2544 2387 2545 - /* Remove outgoing CPU from mask in the leaf rcu_node structure. */ 2546 - raw_spin_lock_irqsave(&rnp->lock, flags); 2547 - smp_mb__after_unlock_lock(); /* Enforce GP memory-order guarantee. */ 2548 - rnp->qsmaskinit &= ~rdp->grpmask; 2549 - if (rnp->qsmaskinit == 0 && !rcu_preempt_has_tasks(rnp)) 2550 - rcu_cleanup_dead_rnp(rnp); 2551 - rcu_report_qs_rnp(rdp->grpmask, rsp, rnp, flags); /* Rlses rnp->lock. */ 2552 2388 WARN_ONCE(rdp->qlen != 0 || rdp->nxtlist != NULL, 2553 2389 "rcu_cleanup_dead_cpu: Callbacks on offline CPU %d: qlen=%lu, nxtlist=%p\n", 2554 2390 cpu, rdp->qlen, rdp->nxtlist); 2555 - init_callback_list(rdp); 2556 - /* Disallow further callbacks on this CPU. */ 2557 - rdp->nxttail[RCU_NEXT_TAIL] = NULL; 2558 - mutex_unlock(&rsp->onoff_mutex); 2559 2391 } 2560 2392 2561 2393 #else /* #ifdef CONFIG_HOTPLUG_CPU */ ··· 2551 2411 } 2552 2412 2553 2413 static void __maybe_unused rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf) 2414 + { 2415 + } 2416 + 2417 + static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp) 2554 2418 { 2555 2419 } 2556 2420 ··· 2733 2589 return; 2734 2590 } 2735 2591 if (rnp->qsmask == 0) { 2736 - rcu_initiate_boost(rnp, flags); /* releases rnp->lock */ 2737 - continue; 2592 + if (rcu_state_p == &rcu_sched_state || 2593 + rsp != rcu_state_p || 2594 + rcu_preempt_blocked_readers_cgp(rnp)) { 2595 + /* 2596 + * No point in scanning bits because they 2597 + * are all zero. But we might need to 2598 + * priority-boost blocked readers. 2599 + */ 2600 + rcu_initiate_boost(rnp, flags); 2601 + /* rcu_initiate_boost() releases rnp->lock */ 2602 + continue; 2603 + } 2604 + if (rnp->parent && 2605 + (rnp->parent->qsmask & rnp->grpmask)) { 2606 + /* 2607 + * Race between grace-period 2608 + * initialization and task exiting RCU 2609 + * read-side critical section: Report. 2610 + */ 2611 + rcu_report_unblock_qs_rnp(rsp, rnp, flags); 2612 + /* rcu_report_unblock_qs_rnp() rlses ->lock */ 2613 + continue; 2614 + } 2738 2615 } 2739 2616 cpu = rnp->grplo; 2740 2617 bit = 1; 2741 2618 for (; cpu <= rnp->grphi; cpu++, bit <<= 1) { 2742 2619 if ((rnp->qsmask & bit) != 0) { 2743 - if ((rnp->qsmaskinit & bit) != 0) 2744 - *isidle = false; 2620 + if ((rnp->qsmaskinit & bit) == 0) 2621 + *isidle = false; /* Pending hotplug. */ 2745 2622 if (f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj)) 2746 2623 mask |= bit; 2747 2624 } 2748 2625 } 2749 2626 if (mask != 0) { 2750 - 2751 - /* rcu_report_qs_rnp() releases rnp->lock. */ 2752 - rcu_report_qs_rnp(mask, rsp, rnp, flags); 2753 - continue; 2627 + /* Idle/offline CPUs, report (releases rnp->lock. */ 2628 + rcu_report_qs_rnp(mask, rsp, rnp, rnp->gpnum, flags); 2629 + } else { 2630 + /* Nothing to do here, so just drop the lock. */ 2631 + raw_spin_unlock_irqrestore(&rnp->lock, flags); 2754 2632 } 2755 - raw_spin_unlock_irqrestore(&rnp->lock, flags); 2756 2633 } 2757 2634 } 2758 2635 ··· 2906 2741 * If called from an extended quiescent state, invoke the RCU 2907 2742 * core in order to force a re-evaluation of RCU's idleness. 2908 2743 */ 2909 - if (!rcu_is_watching() && cpu_online(smp_processor_id())) 2744 + if (!rcu_is_watching()) 2910 2745 invoke_rcu_core(); 2911 2746 2912 2747 /* If interrupts were disabled or CPU offline, don't invoke RCU core. */ ··· 2992 2827 2993 2828 if (cpu != -1) 2994 2829 rdp = per_cpu_ptr(rsp->rda, cpu); 2995 - offline = !__call_rcu_nocb(rdp, head, lazy, flags); 2996 - WARN_ON_ONCE(offline); 2997 - /* _call_rcu() is illegal on offline CPU; leak the callback. */ 2998 - local_irq_restore(flags); 2999 - return; 2830 + if (likely(rdp->mynode)) { 2831 + /* Post-boot, so this should be for a no-CBs CPU. */ 2832 + offline = !__call_rcu_nocb(rdp, head, lazy, flags); 2833 + WARN_ON_ONCE(offline); 2834 + /* Offline CPU, _call_rcu() illegal, leak callback. */ 2835 + local_irq_restore(flags); 2836 + return; 2837 + } 2838 + /* 2839 + * Very early boot, before rcu_init(). Initialize if needed 2840 + * and then drop through to queue the callback. 2841 + */ 2842 + BUG_ON(cpu != -1); 2843 + WARN_ON_ONCE(!rcu_is_watching()); 2844 + if (!likely(rdp->nxtlist)) 2845 + init_default_callback_list(rdp); 3000 2846 } 3001 2847 ACCESS_ONCE(rdp->qlen) = rdp->qlen + 1; 3002 2848 if (lazy) ··· 3130 2954 "Illegal synchronize_sched() in RCU-sched read-side critical section"); 3131 2955 if (rcu_blocking_is_gp()) 3132 2956 return; 3133 - if (rcu_expedited) 2957 + if (rcu_gp_is_expedited()) 3134 2958 synchronize_sched_expedited(); 3135 2959 else 3136 2960 wait_rcu_gp(call_rcu_sched); ··· 3157 2981 "Illegal synchronize_rcu_bh() in RCU-bh read-side critical section"); 3158 2982 if (rcu_blocking_is_gp()) 3159 2983 return; 3160 - if (rcu_expedited) 2984 + if (rcu_gp_is_expedited()) 3161 2985 synchronize_rcu_bh_expedited(); 3162 2986 else 3163 2987 wait_rcu_gp(call_rcu_bh); ··· 3694 3518 EXPORT_SYMBOL_GPL(rcu_barrier_sched); 3695 3519 3696 3520 /* 3521 + * Propagate ->qsinitmask bits up the rcu_node tree to account for the 3522 + * first CPU in a given leaf rcu_node structure coming online. The caller 3523 + * must hold the corresponding leaf rcu_node ->lock with interrrupts 3524 + * disabled. 3525 + */ 3526 + static void rcu_init_new_rnp(struct rcu_node *rnp_leaf) 3527 + { 3528 + long mask; 3529 + struct rcu_node *rnp = rnp_leaf; 3530 + 3531 + for (;;) { 3532 + mask = rnp->grpmask; 3533 + rnp = rnp->parent; 3534 + if (rnp == NULL) 3535 + return; 3536 + raw_spin_lock(&rnp->lock); /* Interrupts already disabled. */ 3537 + rnp->qsmaskinit |= mask; 3538 + raw_spin_unlock(&rnp->lock); /* Interrupts remain disabled. */ 3539 + } 3540 + } 3541 + 3542 + /* 3697 3543 * Do boot-time initialization of a CPU's per-CPU RCU data. 3698 3544 */ 3699 3545 static void __init ··· 3751 3553 struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu); 3752 3554 struct rcu_node *rnp = rcu_get_root(rsp); 3753 3555 3754 - /* Exclude new grace periods. */ 3755 - mutex_lock(&rsp->onoff_mutex); 3756 - 3757 3556 /* Set up local state, ensuring consistent view of global state. */ 3758 3557 raw_spin_lock_irqsave(&rnp->lock, flags); 3759 3558 rdp->beenonline = 1; /* We have now been online. */ 3760 3559 rdp->qlen_last_fqs_check = 0; 3761 3560 rdp->n_force_qs_snap = rsp->n_force_qs; 3762 3561 rdp->blimit = blimit; 3763 - init_callback_list(rdp); /* Re-enable callbacks on this CPU. */ 3562 + if (!rdp->nxtlist) 3563 + init_callback_list(rdp); /* Re-enable callbacks on this CPU. */ 3764 3564 rdp->dynticks->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE; 3765 3565 rcu_sysidle_init_percpu_data(rdp->dynticks); 3766 3566 atomic_set(&rdp->dynticks->dynticks, 3767 3567 (atomic_read(&rdp->dynticks->dynticks) & ~0x1) + 1); 3768 3568 raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */ 3769 3569 3770 - /* Add CPU to rcu_node bitmasks. */ 3570 + /* 3571 + * Add CPU to leaf rcu_node pending-online bitmask. Any needed 3572 + * propagation up the rcu_node tree will happen at the beginning 3573 + * of the next grace period. 3574 + */ 3771 3575 rnp = rdp->mynode; 3772 3576 mask = rdp->grpmask; 3773 - do { 3774 - /* Exclude any attempts to start a new GP on small systems. */ 3775 - raw_spin_lock(&rnp->lock); /* irqs already disabled. */ 3776 - rnp->qsmaskinit |= mask; 3777 - mask = rnp->grpmask; 3778 - if (rnp == rdp->mynode) { 3779 - /* 3780 - * If there is a grace period in progress, we will 3781 - * set up to wait for it next time we run the 3782 - * RCU core code. 3783 - */ 3784 - rdp->gpnum = rnp->completed; 3785 - rdp->completed = rnp->completed; 3786 - rdp->passed_quiesce = 0; 3787 - rdp->rcu_qs_ctr_snap = __this_cpu_read(rcu_qs_ctr); 3788 - rdp->qs_pending = 0; 3789 - trace_rcu_grace_period(rsp->name, rdp->gpnum, TPS("cpuonl")); 3790 - } 3791 - raw_spin_unlock(&rnp->lock); /* irqs already disabled. */ 3792 - rnp = rnp->parent; 3793 - } while (rnp != NULL && !(rnp->qsmaskinit & mask)); 3794 - local_irq_restore(flags); 3795 - 3796 - mutex_unlock(&rsp->onoff_mutex); 3577 + raw_spin_lock(&rnp->lock); /* irqs already disabled. */ 3578 + smp_mb__after_unlock_lock(); 3579 + rnp->qsmaskinitnext |= mask; 3580 + rdp->gpnum = rnp->completed; /* Make CPU later note any new GP. */ 3581 + rdp->completed = rnp->completed; 3582 + rdp->passed_quiesce = false; 3583 + rdp->rcu_qs_ctr_snap = __this_cpu_read(rcu_qs_ctr); 3584 + rdp->qs_pending = false; 3585 + trace_rcu_grace_period(rsp->name, rdp->gpnum, TPS("cpuonl")); 3586 + raw_spin_unlock_irqrestore(&rnp->lock, flags); 3797 3587 } 3798 3588 3799 3589 static void rcu_prepare_cpu(int cpu) ··· 3795 3609 /* 3796 3610 * Handle CPU online/offline notification events. 3797 3611 */ 3798 - static int rcu_cpu_notify(struct notifier_block *self, 3799 - unsigned long action, void *hcpu) 3612 + int rcu_cpu_notify(struct notifier_block *self, 3613 + unsigned long action, void *hcpu) 3800 3614 { 3801 3615 long cpu = (long)hcpu; 3802 3616 struct rcu_data *rdp = per_cpu_ptr(rcu_state_p->rda, cpu); 3803 3617 struct rcu_node *rnp = rdp->mynode; 3804 3618 struct rcu_state *rsp; 3805 3619 3806 - trace_rcu_utilization(TPS("Start CPU hotplug")); 3807 3620 switch (action) { 3808 3621 case CPU_UP_PREPARE: 3809 3622 case CPU_UP_PREPARE_FROZEN: ··· 3822 3637 for_each_rcu_flavor(rsp) 3823 3638 rcu_cleanup_dying_cpu(rsp); 3824 3639 break; 3640 + case CPU_DYING_IDLE: 3641 + for_each_rcu_flavor(rsp) { 3642 + rcu_cleanup_dying_idle_cpu(cpu, rsp); 3643 + } 3644 + break; 3825 3645 case CPU_DEAD: 3826 3646 case CPU_DEAD_FROZEN: 3827 3647 case CPU_UP_CANCELED: ··· 3839 3649 default: 3840 3650 break; 3841 3651 } 3842 - trace_rcu_utilization(TPS("End CPU hotplug")); 3843 3652 return NOTIFY_OK; 3844 3653 } 3845 3654 ··· 3849 3660 case PM_HIBERNATION_PREPARE: 3850 3661 case PM_SUSPEND_PREPARE: 3851 3662 if (nr_cpu_ids <= 256) /* Expediting bad for large systems. */ 3852 - rcu_expedited = 1; 3663 + rcu_expedite_gp(); 3853 3664 break; 3854 3665 case PM_POST_HIBERNATION: 3855 3666 case PM_POST_SUSPEND: 3856 - rcu_expedited = 0; 3667 + if (nr_cpu_ids <= 256) /* Expediting bad for large systems. */ 3668 + rcu_unexpedite_gp(); 3857 3669 break; 3858 3670 default: 3859 3671 break; ··· 3924 3734 * Compute the per-level fanout, either using the exact fanout specified 3925 3735 * or balancing the tree, depending on CONFIG_RCU_FANOUT_EXACT. 3926 3736 */ 3927 - #ifdef CONFIG_RCU_FANOUT_EXACT 3928 3737 static void __init rcu_init_levelspread(struct rcu_state *rsp) 3929 3738 { 3930 3739 int i; 3931 3740 3932 - rsp->levelspread[rcu_num_lvls - 1] = rcu_fanout_leaf; 3933 - for (i = rcu_num_lvls - 2; i >= 0; i--) 3934 - rsp->levelspread[i] = CONFIG_RCU_FANOUT; 3935 - } 3936 - #else /* #ifdef CONFIG_RCU_FANOUT_EXACT */ 3937 - static void __init rcu_init_levelspread(struct rcu_state *rsp) 3938 - { 3939 - int ccur; 3940 - int cprv; 3941 - int i; 3741 + if (IS_ENABLED(CONFIG_RCU_FANOUT_EXACT)) { 3742 + rsp->levelspread[rcu_num_lvls - 1] = rcu_fanout_leaf; 3743 + for (i = rcu_num_lvls - 2; i >= 0; i--) 3744 + rsp->levelspread[i] = CONFIG_RCU_FANOUT; 3745 + } else { 3746 + int ccur; 3747 + int cprv; 3942 3748 3943 - cprv = nr_cpu_ids; 3944 - for (i = rcu_num_lvls - 1; i >= 0; i--) { 3945 - ccur = rsp->levelcnt[i]; 3946 - rsp->levelspread[i] = (cprv + ccur - 1) / ccur; 3947 - cprv = ccur; 3749 + cprv = nr_cpu_ids; 3750 + for (i = rcu_num_lvls - 1; i >= 0; i--) { 3751 + ccur = rsp->levelcnt[i]; 3752 + rsp->levelspread[i] = (cprv + ccur - 1) / ccur; 3753 + cprv = ccur; 3754 + } 3948 3755 } 3949 3756 } 3950 - #endif /* #else #ifdef CONFIG_RCU_FANOUT_EXACT */ 3951 3757 3952 3758 /* 3953 3759 * Helper function for rcu_init() that initializes one rcu_state structure. ··· 4019 3833 } 4020 3834 } 4021 3835 4022 - rsp->rda = rda; 4023 3836 init_waitqueue_head(&rsp->gp_wq); 4024 3837 rnp = rsp->level[rcu_num_lvls - 1]; 4025 3838 for_each_possible_cpu(i) { ··· 4111 3926 { 4112 3927 int cpu; 4113 3928 3929 + rcu_early_boot_tests(); 3930 + 4114 3931 rcu_bootup_announce(); 4115 3932 rcu_init_geometry(); 4116 3933 rcu_init_one(&rcu_bh_state, &rcu_bh_data); ··· 4129 3942 pm_notifier(rcu_pm_notify, 0); 4130 3943 for_each_online_cpu(cpu) 4131 3944 rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu); 4132 - 4133 - rcu_early_boot_tests(); 4134 3945 } 4135 3946 4136 3947 #include "tree_plugin.h"

+9 -2

kernel/rcu/tree.h

··· 141 141 /* complete (only for PREEMPT_RCU). */ 142 142 unsigned long qsmaskinit; 143 143 /* Per-GP initial value for qsmask & expmask. */ 144 + /* Initialized from ->qsmaskinitnext at the */ 145 + /* beginning of each grace period. */ 146 + unsigned long qsmaskinitnext; 147 + /* Online CPUs for next grace period. */ 144 148 unsigned long grpmask; /* Mask to apply to parent qsmask. */ 145 149 /* Only one bit will be set in this mask. */ 146 150 int grplo; /* lowest-numbered CPU or group here. */ 147 151 int grphi; /* highest-numbered CPU or group here. */ 148 152 u8 grpnum; /* CPU/group number for next level up. */ 149 153 u8 level; /* root is at level 0. */ 154 + bool wait_blkd_tasks;/* Necessary to wait for blocked tasks to */ 155 + /* exit RCU read-side critical sections */ 156 + /* before propagating offline up the */ 157 + /* rcu_node tree? */ 150 158 struct rcu_node *parent; 151 159 struct list_head blkd_tasks; 152 160 /* Tasks blocked in RCU read-side critical */ ··· 456 448 long qlen; /* Total number of callbacks. */ 457 449 /* End of fields guarded by orphan_lock. */ 458 450 459 - struct mutex onoff_mutex; /* Coordinate hotplug & GPs. */ 460 - 461 451 struct mutex barrier_mutex; /* Guards barrier fields. */ 462 452 atomic_t barrier_cpu_count; /* # CPUs waiting on. */ 463 453 struct completion barrier_completion; /* Wake at barrier end. */ ··· 565 559 static void rcu_cleanup_after_idle(void); 566 560 static void rcu_prepare_for_idle(void); 567 561 static void rcu_idle_count_callbacks_posted(void); 562 + static bool rcu_preempt_has_tasks(struct rcu_node *rnp); 568 563 static void print_cpu_stall_info_begin(void); 569 564 static void print_cpu_stall_info(struct rcu_state *rsp, int cpu); 570 565 static void print_cpu_stall_info_end(void);

+132 -135

kernel/rcu/tree_plugin.h

··· 58 58 */ 59 59 static void __init rcu_bootup_announce_oddness(void) 60 60 { 61 - #ifdef CONFIG_RCU_TRACE 62 - pr_info("\tRCU debugfs-based tracing is enabled.\n"); 63 - #endif 64 - #if (defined(CONFIG_64BIT) && CONFIG_RCU_FANOUT != 64) || (!defined(CONFIG_64BIT) && CONFIG_RCU_FANOUT != 32) 65 - pr_info("\tCONFIG_RCU_FANOUT set to non-default value of %d\n", 66 - CONFIG_RCU_FANOUT); 67 - #endif 68 - #ifdef CONFIG_RCU_FANOUT_EXACT 69 - pr_info("\tHierarchical RCU autobalancing is disabled.\n"); 70 - #endif 71 - #ifdef CONFIG_RCU_FAST_NO_HZ 72 - pr_info("\tRCU dyntick-idle grace-period acceleration is enabled.\n"); 73 - #endif 74 - #ifdef CONFIG_PROVE_RCU 75 - pr_info("\tRCU lockdep checking is enabled.\n"); 76 - #endif 77 - #ifdef CONFIG_RCU_TORTURE_TEST_RUNNABLE 78 - pr_info("\tRCU torture testing starts during boot.\n"); 79 - #endif 80 - #if defined(CONFIG_RCU_CPU_STALL_INFO) 81 - pr_info("\tAdditional per-CPU info printed with stalls.\n"); 82 - #endif 83 - #if NUM_RCU_LVL_4 != 0 84 - pr_info("\tFour-level hierarchy is enabled.\n"); 85 - #endif 61 + if (IS_ENABLED(CONFIG_RCU_TRACE)) 62 + pr_info("\tRCU debugfs-based tracing is enabled.\n"); 63 + if ((IS_ENABLED(CONFIG_64BIT) && CONFIG_RCU_FANOUT != 64) || 64 + (!IS_ENABLED(CONFIG_64BIT) && CONFIG_RCU_FANOUT != 32)) 65 + pr_info("\tCONFIG_RCU_FANOUT set to non-default value of %d\n", 66 + CONFIG_RCU_FANOUT); 67 + if (IS_ENABLED(CONFIG_RCU_FANOUT_EXACT)) 68 + pr_info("\tHierarchical RCU autobalancing is disabled.\n"); 69 + if (IS_ENABLED(CONFIG_RCU_FAST_NO_HZ)) 70 + pr_info("\tRCU dyntick-idle grace-period acceleration is enabled.\n"); 71 + if (IS_ENABLED(CONFIG_PROVE_RCU)) 72 + pr_info("\tRCU lockdep checking is enabled.\n"); 73 + if (IS_ENABLED(CONFIG_RCU_TORTURE_TEST_RUNNABLE)) 74 + pr_info("\tRCU torture testing starts during boot.\n"); 75 + if (IS_ENABLED(CONFIG_RCU_CPU_STALL_INFO)) 76 + pr_info("\tAdditional per-CPU info printed with stalls.\n"); 77 + if (NUM_RCU_LVL_4 != 0) 78 + pr_info("\tFour-level hierarchy is enabled.\n"); 79 + if (CONFIG_RCU_FANOUT_LEAF != 16) 80 + pr_info("\tBuild-time adjustment of leaf fanout to %d.\n", 81 + CONFIG_RCU_FANOUT_LEAF); 86 82 if (rcu_fanout_leaf != CONFIG_RCU_FANOUT_LEAF) 87 83 pr_info("\tBoot-time adjustment of leaf fanout to %d.\n", rcu_fanout_leaf); 88 84 if (nr_cpu_ids != NR_CPUS) 89 85 pr_info("\tRCU restricting CPUs from NR_CPUS=%d to nr_cpu_ids=%d.\n", NR_CPUS, nr_cpu_ids); 90 - #ifdef CONFIG_RCU_BOOST 91 - pr_info("\tRCU kthread priority: %d.\n", kthread_prio); 92 - #endif 86 + if (IS_ENABLED(CONFIG_RCU_BOOST)) 87 + pr_info("\tRCU kthread priority: %d.\n", kthread_prio); 93 88 } 94 89 95 90 #ifdef CONFIG_PREEMPT_RCU ··· 175 180 * But first, note that the current CPU must still be 176 181 * on line! 177 182 */ 178 - WARN_ON_ONCE((rdp->grpmask & rnp->qsmaskinit) == 0); 183 + WARN_ON_ONCE((rdp->grpmask & rcu_rnp_online_cpus(rnp)) == 0); 179 184 WARN_ON_ONCE(!list_empty(&t->rcu_node_entry)); 180 185 if ((rnp->qsmask & rdp->grpmask) && rnp->gp_tasks != NULL) { 181 186 list_add(&t->rcu_node_entry, rnp->gp_tasks->prev); ··· 228 233 } 229 234 230 235 /* 231 - * Record a quiescent state for all tasks that were previously queued 232 - * on the specified rcu_node structure and that were blocking the current 233 - * RCU grace period. The caller must hold the specified rnp->lock with 234 - * irqs disabled, and this lock is released upon return, but irqs remain 235 - * disabled. 236 - */ 237 - static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp, unsigned long flags) 238 - __releases(rnp->lock) 239 - { 240 - unsigned long mask; 241 - struct rcu_node *rnp_p; 242 - 243 - if (rnp->qsmask != 0 || rcu_preempt_blocked_readers_cgp(rnp)) { 244 - raw_spin_unlock_irqrestore(&rnp->lock, flags); 245 - return; /* Still need more quiescent states! */ 246 - } 247 - 248 - rnp_p = rnp->parent; 249 - if (rnp_p == NULL) { 250 - /* 251 - * Either there is only one rcu_node in the tree, 252 - * or tasks were kicked up to root rcu_node due to 253 - * CPUs going offline. 254 - */ 255 - rcu_report_qs_rsp(&rcu_preempt_state, flags); 256 - return; 257 - } 258 - 259 - /* Report up the rest of the hierarchy. */ 260 - mask = rnp->grpmask; 261 - raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */ 262 - raw_spin_lock(&rnp_p->lock); /* irqs already disabled. */ 263 - smp_mb__after_unlock_lock(); 264 - rcu_report_qs_rnp(mask, &rcu_preempt_state, rnp_p, flags); 265 - } 266 - 267 - /* 268 236 * Advance a ->blkd_tasks-list pointer to the next entry, instead 269 237 * returning NULL if at the end of the list. 270 238 */ ··· 258 300 */ 259 301 void rcu_read_unlock_special(struct task_struct *t) 260 302 { 261 - bool empty; 262 303 bool empty_exp; 263 304 bool empty_norm; 264 305 bool empty_exp_now; ··· 291 334 } 292 335 293 336 /* Hardware IRQ handlers cannot block, complain if they get here. */ 294 - if (WARN_ON_ONCE(in_irq() || in_serving_softirq())) { 337 + if (in_irq() || in_serving_softirq()) { 338 + lockdep_rcu_suspicious(__FILE__, __LINE__, 339 + "rcu_read_unlock() from irq or softirq with blocking in critical section!!!\n"); 340 + pr_alert("->rcu_read_unlock_special: %#x (b: %d, nq: %d)\n", 341 + t->rcu_read_unlock_special.s, 342 + t->rcu_read_unlock_special.b.blocked, 343 + t->rcu_read_unlock_special.b.need_qs); 295 344 local_irq_restore(flags); 296 345 return; 297 346 } ··· 319 356 break; 320 357 raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */ 321 358 } 322 - empty = !rcu_preempt_has_tasks(rnp); 323 359 empty_norm = !rcu_preempt_blocked_readers_cgp(rnp); 324 360 empty_exp = !rcu_preempted_readers_exp(rnp); 325 361 smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */ ··· 339 377 #endif /* #ifdef CONFIG_RCU_BOOST */ 340 378 341 379 /* 342 - * If this was the last task on the list, go see if we 343 - * need to propagate ->qsmaskinit bit clearing up the 344 - * rcu_node tree. 345 - */ 346 - if (!empty && !rcu_preempt_has_tasks(rnp)) 347 - rcu_cleanup_dead_rnp(rnp); 348 - 349 - /* 350 380 * If this was the last task on the current list, and if 351 381 * we aren't waiting on any CPUs, report the quiescent state. 352 382 * Note that rcu_report_unblock_qs_rnp() releases rnp->lock, ··· 353 399 rnp->grplo, 354 400 rnp->grphi, 355 401 !!rnp->gp_tasks); 356 - rcu_report_unblock_qs_rnp(rnp, flags); 402 + rcu_report_unblock_qs_rnp(&rcu_preempt_state, 403 + rnp, flags); 357 404 } else { 358 405 raw_spin_unlock_irqrestore(&rnp->lock, flags); 359 406 } ··· 475 520 WARN_ON_ONCE(rnp->qsmask); 476 521 } 477 522 478 - #ifdef CONFIG_HOTPLUG_CPU 479 - 480 - #endif /* #ifdef CONFIG_HOTPLUG_CPU */ 481 - 482 523 /* 483 524 * Check for a quiescent state from the current CPU. When a task blocks, 484 525 * the task is recorded in the corresponding CPU's rcu_node structure, ··· 536 585 "Illegal synchronize_rcu() in RCU read-side critical section"); 537 586 if (!rcu_scheduler_active) 538 587 return; 539 - if (rcu_expedited) 588 + if (rcu_gp_is_expedited()) 540 589 synchronize_rcu_expedited(); 541 590 else 542 591 wait_rcu_gp(call_rcu); ··· 581 630 * recursively up the tree. (Calm down, calm down, we do the recursion 582 631 * iteratively!) 583 632 * 584 - * Most callers will set the "wake" flag, but the task initiating the 585 - * expedited grace period need not wake itself. 586 - * 587 633 * Caller must hold sync_rcu_preempt_exp_mutex. 588 634 */ 589 635 static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp, ··· 615 667 616 668 /* 617 669 * Snapshot the tasks blocking the newly started preemptible-RCU expedited 618 - * grace period for the specified rcu_node structure. If there are no such 619 - * tasks, report it up the rcu_node hierarchy. 670 + * grace period for the specified rcu_node structure, phase 1. If there 671 + * are such tasks, set the ->expmask bits up the rcu_node tree and also 672 + * set the ->expmask bits on the leaf rcu_node structures to tell phase 2 673 + * that work is needed here. 620 674 * 621 - * Caller must hold sync_rcu_preempt_exp_mutex and must exclude 622 - * CPU hotplug operations. 675 + * Caller must hold sync_rcu_preempt_exp_mutex. 623 676 */ 624 677 static void 625 - sync_rcu_preempt_exp_init(struct rcu_state *rsp, struct rcu_node *rnp) 678 + sync_rcu_preempt_exp_init1(struct rcu_state *rsp, struct rcu_node *rnp) 626 679 { 627 680 unsigned long flags; 628 - int must_wait = 0; 681 + unsigned long mask; 682 + struct rcu_node *rnp_up; 629 683 630 684 raw_spin_lock_irqsave(&rnp->lock, flags); 631 685 smp_mb__after_unlock_lock(); 686 + WARN_ON_ONCE(rnp->expmask); 687 + WARN_ON_ONCE(rnp->exp_tasks); 632 688 if (!rcu_preempt_has_tasks(rnp)) { 689 + /* No blocked tasks, nothing to do. */ 633 690 raw_spin_unlock_irqrestore(&rnp->lock, flags); 634 - } else { 691 + return; 692 + } 693 + /* Call for Phase 2 and propagate ->expmask bits up the tree. */ 694 + rnp->expmask = 1; 695 + rnp_up = rnp; 696 + while (rnp_up->parent) { 697 + mask = rnp_up->grpmask; 698 + rnp_up = rnp_up->parent; 699 + if (rnp_up->expmask & mask) 700 + break; 701 + raw_spin_lock(&rnp_up->lock); /* irqs already off */ 702 + smp_mb__after_unlock_lock(); 703 + rnp_up->expmask |= mask; 704 + raw_spin_unlock(&rnp_up->lock); /* irqs still off */ 705 + } 706 + raw_spin_unlock_irqrestore(&rnp->lock, flags); 707 + } 708 + 709 + /* 710 + * Snapshot the tasks blocking the newly started preemptible-RCU expedited 711 + * grace period for the specified rcu_node structure, phase 2. If the 712 + * leaf rcu_node structure has its ->expmask field set, check for tasks. 713 + * If there are some, clear ->expmask and set ->exp_tasks accordingly, 714 + * then initiate RCU priority boosting. Otherwise, clear ->expmask and 715 + * invoke rcu_report_exp_rnp() to clear out the upper-level ->expmask bits, 716 + * enabling rcu_read_unlock_special() to do the bit-clearing. 717 + * 718 + * Caller must hold sync_rcu_preempt_exp_mutex. 719 + */ 720 + static void 721 + sync_rcu_preempt_exp_init2(struct rcu_state *rsp, struct rcu_node *rnp) 722 + { 723 + unsigned long flags; 724 + 725 + raw_spin_lock_irqsave(&rnp->lock, flags); 726 + smp_mb__after_unlock_lock(); 727 + if (!rnp->expmask) { 728 + /* Phase 1 didn't do anything, so Phase 2 doesn't either. */ 729 + raw_spin_unlock_irqrestore(&rnp->lock, flags); 730 + return; 731 + } 732 + 733 + /* Phase 1 is over. */ 734 + rnp->expmask = 0; 735 + 736 + /* 737 + * If there are still blocked tasks, set up ->exp_tasks so that 738 + * rcu_read_unlock_special() will wake us and then boost them. 739 + */ 740 + if (rcu_preempt_has_tasks(rnp)) { 635 741 rnp->exp_tasks = rnp->blkd_tasks.next; 636 742 rcu_initiate_boost(rnp, flags); /* releases rnp->lock */ 637 - must_wait = 1; 743 + return; 638 744 } 639 - if (!must_wait) 640 - rcu_report_exp_rnp(rsp, rnp, false); /* Don't wake self. */ 745 + 746 + /* No longer any blocked tasks, so undo bit setting. */ 747 + raw_spin_unlock_irqrestore(&rnp->lock, flags); 748 + rcu_report_exp_rnp(rsp, rnp, false); 641 749 } 642 750 643 751 /** ··· 710 706 */ 711 707 void synchronize_rcu_expedited(void) 712 708 { 713 - unsigned long flags; 714 709 struct rcu_node *rnp; 715 710 struct rcu_state *rsp = &rcu_preempt_state; 716 711 unsigned long snap; ··· 760 757 /* force all RCU readers onto ->blkd_tasks lists. */ 761 758 synchronize_sched_expedited(); 762 759 763 - /* Initialize ->expmask for all non-leaf rcu_node structures. */ 764 - rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) { 765 - raw_spin_lock_irqsave(&rnp->lock, flags); 766 - smp_mb__after_unlock_lock(); 767 - rnp->expmask = rnp->qsmaskinit; 768 - raw_spin_unlock_irqrestore(&rnp->lock, flags); 769 - } 770 - 771 - /* Snapshot current state of ->blkd_tasks lists. */ 760 + /* 761 + * Snapshot current state of ->blkd_tasks lists into ->expmask. 762 + * Phase 1 sets bits and phase 2 permits rcu_read_unlock_special() 763 + * to start clearing them. Doing this in one phase leads to 764 + * strange races between setting and clearing bits, so just say "no"! 765 + */ 772 766 rcu_for_each_leaf_node(rsp, rnp) 773 - sync_rcu_preempt_exp_init(rsp, rnp); 774 - if (NUM_RCU_NODES > 1) 775 - sync_rcu_preempt_exp_init(rsp, rcu_get_root(rsp)); 767 + sync_rcu_preempt_exp_init1(rsp, rnp); 768 + rcu_for_each_leaf_node(rsp, rnp) 769 + sync_rcu_preempt_exp_init2(rsp, rnp); 776 770 777 771 put_online_cpus(); 778 772 ··· 859 859 return 0; 860 860 } 861 861 862 - #ifdef CONFIG_HOTPLUG_CPU 863 - 864 862 /* 865 863 * Because there is no preemptible RCU, there can be no readers blocked. 866 864 */ ··· 866 868 { 867 869 return false; 868 870 } 869 - 870 - #endif /* #ifdef CONFIG_HOTPLUG_CPU */ 871 871 872 872 /* 873 873 * Because preemptible RCU does not exist, we never have to check for ··· 1166 1170 * Returns zero if all is well, a negated errno otherwise. 1167 1171 */ 1168 1172 static int rcu_spawn_one_boost_kthread(struct rcu_state *rsp, 1169 - struct rcu_node *rnp) 1173 + struct rcu_node *rnp) 1170 1174 { 1171 1175 int rnp_index = rnp - &rsp->node[0]; 1172 1176 unsigned long flags; ··· 1176 1180 if (&rcu_preempt_state != rsp) 1177 1181 return 0; 1178 1182 1179 - if (!rcu_scheduler_fully_active || rnp->qsmaskinit == 0) 1183 + if (!rcu_scheduler_fully_active || rcu_rnp_online_cpus(rnp) == 0) 1180 1184 return 0; 1181 1185 1182 1186 rsp->boost = 1; ··· 1269 1273 static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) 1270 1274 { 1271 1275 struct task_struct *t = rnp->boost_kthread_task; 1272 - unsigned long mask = rnp->qsmaskinit; 1276 + unsigned long mask = rcu_rnp_online_cpus(rnp); 1273 1277 cpumask_var_t cm; 1274 1278 int cpu; 1275 1279 ··· 1941 1945 rhp = ACCESS_ONCE(rdp->nocb_follower_head); 1942 1946 1943 1947 /* Having no rcuo kthread but CBs after scheduler starts is bad! */ 1944 - if (!ACCESS_ONCE(rdp->nocb_kthread) && rhp) { 1948 + if (!ACCESS_ONCE(rdp->nocb_kthread) && rhp && 1949 + rcu_scheduler_fully_active) { 1945 1950 /* RCU callback enqueued before CPU first came online??? */ 1946 1951 pr_err("RCU: Never-onlined no-CBs CPU %d has CB %p\n", 1947 1952 cpu, rhp->func); ··· 2389 2392 pr_info("\tPoll for callbacks from no-CBs CPUs.\n"); 2390 2393 2391 2394 for_each_rcu_flavor(rsp) { 2392 - for_each_cpu(cpu, rcu_nocb_mask) { 2393 - struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu); 2394 - 2395 - /* 2396 - * If there are early callbacks, they will need 2397 - * to be moved to the nocb lists. 2398 - */ 2399 - WARN_ON_ONCE(rdp->nxttail[RCU_NEXT_TAIL] != 2400 - &rdp->nxtlist && 2401 - rdp->nxttail[RCU_NEXT_TAIL] != NULL); 2402 - init_nocb_callback_list(rdp); 2403 - } 2395 + for_each_cpu(cpu, rcu_nocb_mask) 2396 + init_nocb_callback_list(per_cpu_ptr(rsp->rda, cpu)); 2404 2397 rcu_organize_nocb_kthreads(rsp); 2405 2398 } 2406 2399 } ··· 2527 2540 if (!rcu_is_nocb_cpu(rdp->cpu)) 2528 2541 return false; 2529 2542 2543 + /* If there are early-boot callbacks, move them to nocb lists. */ 2544 + if (rdp->nxtlist) { 2545 + rdp->nocb_head = rdp->nxtlist; 2546 + rdp->nocb_tail = rdp->nxttail[RCU_NEXT_TAIL]; 2547 + atomic_long_set(&rdp->nocb_q_count, rdp->qlen); 2548 + atomic_long_set(&rdp->nocb_q_count_lazy, rdp->qlen_lazy); 2549 + rdp->nxtlist = NULL; 2550 + rdp->qlen = 0; 2551 + rdp->qlen_lazy = 0; 2552 + } 2530 2553 rdp->nxttail[RCU_NEXT_TAIL] = NULL; 2531 2554 return true; 2532 2555 } ··· 2760 2763 2761 2764 /* 2762 2765 * Check to see if the current CPU is idle. Note that usermode execution 2763 - * does not count as idle. The caller must have disabled interrupts. 2766 + * does not count as idle. The caller must have disabled interrupts, 2767 + * and must be running on tick_do_timer_cpu. 2764 2768 */ 2765 2769 static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle, 2766 2770 unsigned long *maxj) ··· 2782 2784 if (!*isidle || rdp->rsp != rcu_state_p || 2783 2785 cpu_is_offline(rdp->cpu) || rdp->cpu == tick_do_timer_cpu) 2784 2786 return; 2785 - if (rcu_gp_in_progress(rdp->rsp)) 2786 - WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); 2787 + /* Verify affinity of current kthread. */ 2788 + WARN_ON_ONCE(smp_processor_id() != tick_do_timer_cpu); 2787 2789 2788 2790 /* Pick up current idle and NMI-nesting counter and check. */ 2789 2791 cur = atomic_read(&rdtp->dynticks_idle); ··· 3066 3068 return; 3067 3069 #ifdef CONFIG_NO_HZ_FULL_SYSIDLE 3068 3070 cpu = tick_do_timer_cpu; 3069 - if (cpu >= 0 && cpu < nr_cpu_ids && raw_smp_processor_id() != cpu) 3071 + if (cpu >= 0 && cpu < nr_cpu_ids) 3070 3072 set_cpus_allowed_ptr(current, cpumask_of(cpu)); 3071 3073 #else /* #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */ 3072 - if (!is_housekeeping_cpu(raw_smp_processor_id())) 3073 - housekeeping_affine(current); 3074 + housekeeping_affine(current); 3074 3075 #endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */ 3075 3076 } 3076 3077

+2 -2

kernel/rcu/tree_trace.c

··· 283 283 seq_puts(m, "\n"); 284 284 level = rnp->level; 285 285 } 286 - seq_printf(m, "%lx/%lx %c%c>%c %d:%d ^%d ", 287 - rnp->qsmask, rnp->qsmaskinit, 286 + seq_printf(m, "%lx/%lx->%lx %c%c>%c %d:%d ^%d ", 287 + rnp->qsmask, rnp->qsmaskinit, rnp->qsmaskinitnext, 288 288 ".G"[rnp->gp_tasks != NULL], 289 289 ".E"[rnp->exp_tasks != NULL], 290 290 ".T"[!list_empty(&rnp->blkd_tasks)],

+63 -9

kernel/rcu/update.c

··· 62 62 63 63 module_param(rcu_expedited, int, 0); 64 64 65 + #ifndef CONFIG_TINY_RCU 66 + 67 + static atomic_t rcu_expedited_nesting = 68 + ATOMIC_INIT(IS_ENABLED(CONFIG_RCU_EXPEDITE_BOOT) ? 1 : 0); 69 + 70 + /* 71 + * Should normal grace-period primitives be expedited? Intended for 72 + * use within RCU. Note that this function takes the rcu_expedited 73 + * sysfs/boot variable into account as well as the rcu_expedite_gp() 74 + * nesting. So looping on rcu_unexpedite_gp() until rcu_gp_is_expedited() 75 + * returns false is a -really- bad idea. 76 + */ 77 + bool rcu_gp_is_expedited(void) 78 + { 79 + return rcu_expedited || atomic_read(&rcu_expedited_nesting); 80 + } 81 + EXPORT_SYMBOL_GPL(rcu_gp_is_expedited); 82 + 83 + /** 84 + * rcu_expedite_gp - Expedite future RCU grace periods 85 + * 86 + * After a call to this function, future calls to synchronize_rcu() and 87 + * friends act as the corresponding synchronize_rcu_expedited() function 88 + * had instead been called. 89 + */ 90 + void rcu_expedite_gp(void) 91 + { 92 + atomic_inc(&rcu_expedited_nesting); 93 + } 94 + EXPORT_SYMBOL_GPL(rcu_expedite_gp); 95 + 96 + /** 97 + * rcu_unexpedite_gp - Cancel prior rcu_expedite_gp() invocation 98 + * 99 + * Undo a prior call to rcu_expedite_gp(). If all prior calls to 100 + * rcu_expedite_gp() are undone by a subsequent call to rcu_unexpedite_gp(), 101 + * and if the rcu_expedited sysfs/boot parameter is not set, then all 102 + * subsequent calls to synchronize_rcu() and friends will return to 103 + * their normal non-expedited behavior. 104 + */ 105 + void rcu_unexpedite_gp(void) 106 + { 107 + atomic_dec(&rcu_expedited_nesting); 108 + } 109 + EXPORT_SYMBOL_GPL(rcu_unexpedite_gp); 110 + 111 + #endif /* #ifndef CONFIG_TINY_RCU */ 112 + 113 + /* 114 + * Inform RCU of the end of the in-kernel boot sequence. 115 + */ 116 + void rcu_end_inkernel_boot(void) 117 + { 118 + if (IS_ENABLED(CONFIG_RCU_EXPEDITE_BOOT)) 119 + rcu_unexpedite_gp(); 120 + } 121 + 65 122 #ifdef CONFIG_PREEMPT_RCU 66 123 67 124 /* ··· 256 199 257 200 #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ 258 201 259 - struct rcu_synchronize { 260 - struct rcu_head head; 261 - struct completion completion; 262 - }; 263 - 264 - /* 265 - * Awaken the corresponding synchronize_rcu() instance now that a 266 - * grace period has elapsed. 202 + /** 203 + * wakeme_after_rcu() - Callback function to awaken a task after grace period 204 + * @head: Pointer to rcu_head member within rcu_synchronize structure 205 + * 206 + * Awaken the corresponding task now that a grace period has elapsed. 267 207 */ 268 - static void wakeme_after_rcu(struct rcu_head *head) 208 + void wakeme_after_rcu(struct rcu_head *head) 269 209 { 270 210 struct rcu_synchronize *rcu; 271 211

+8 -1

kernel/sched/idle.c

··· 210 210 goto exit_idle; 211 211 } 212 212 213 + DEFINE_PER_CPU(bool, cpu_dead_idle); 214 + 213 215 /* 214 216 * Generic idle loop implementation 215 217 * ··· 236 234 check_pgt_cache(); 237 235 rmb(); 238 236 239 - if (cpu_is_offline(smp_processor_id())) 237 + if (cpu_is_offline(smp_processor_id())) { 238 + rcu_cpu_notify(NULL, CPU_DYING_IDLE, 239 + (void *)(long)smp_processor_id()); 240 + smp_mb(); /* all activity before dead. */ 241 + this_cpu_write(cpu_dead_idle, true); 240 242 arch_cpu_idle_dead(); 243 + } 241 244 242 245 local_irq_disable(); 243 246 arch_cpu_idle_enter();

+156

kernel/smpboot.c

··· 4 4 #include <linux/cpu.h> 5 5 #include <linux/err.h> 6 6 #include <linux/smp.h> 7 + #include <linux/delay.h> 7 8 #include <linux/init.h> 8 9 #include <linux/list.h> 9 10 #include <linux/slab.h> ··· 315 314 put_online_cpus(); 316 315 } 317 316 EXPORT_SYMBOL_GPL(smpboot_unregister_percpu_thread); 317 + 318 + static DEFINE_PER_CPU(atomic_t, cpu_hotplug_state) = ATOMIC_INIT(CPU_POST_DEAD); 319 + 320 + /* 321 + * Called to poll specified CPU's state, for example, when waiting for 322 + * a CPU to come online. 323 + */ 324 + int cpu_report_state(int cpu) 325 + { 326 + return atomic_read(&per_cpu(cpu_hotplug_state, cpu)); 327 + } 328 + 329 + /* 330 + * If CPU has died properly, set its state to CPU_UP_PREPARE and 331 + * return success. Otherwise, return -EBUSY if the CPU died after 332 + * cpu_wait_death() timed out. And yet otherwise again, return -EAGAIN 333 + * if cpu_wait_death() timed out and the CPU still hasn't gotten around 334 + * to dying. In the latter two cases, the CPU might not be set up 335 + * properly, but it is up to the arch-specific code to decide. 336 + * Finally, -EIO indicates an unanticipated problem. 337 + * 338 + * Note that it is permissible to omit this call entirely, as is 339 + * done in architectures that do no CPU-hotplug error checking. 340 + */ 341 + int cpu_check_up_prepare(int cpu) 342 + { 343 + if (!IS_ENABLED(CONFIG_HOTPLUG_CPU)) { 344 + atomic_set(&per_cpu(cpu_hotplug_state, cpu), CPU_UP_PREPARE); 345 + return 0; 346 + } 347 + 348 + switch (atomic_read(&per_cpu(cpu_hotplug_state, cpu))) { 349 + 350 + case CPU_POST_DEAD: 351 + 352 + /* The CPU died properly, so just start it up again. */ 353 + atomic_set(&per_cpu(cpu_hotplug_state, cpu), CPU_UP_PREPARE); 354 + return 0; 355 + 356 + case CPU_DEAD_FROZEN: 357 + 358 + /* 359 + * Timeout during CPU death, so let caller know. 360 + * The outgoing CPU completed its processing, but after 361 + * cpu_wait_death() timed out and reported the error. The 362 + * caller is free to proceed, in which case the state 363 + * will be reset properly by cpu_set_state_online(). 364 + * Proceeding despite this -EBUSY return makes sense 365 + * for systems where the outgoing CPUs take themselves 366 + * offline, with no post-death manipulation required from 367 + * a surviving CPU. 368 + */ 369 + return -EBUSY; 370 + 371 + case CPU_BROKEN: 372 + 373 + /* 374 + * The most likely reason we got here is that there was 375 + * a timeout during CPU death, and the outgoing CPU never 376 + * did complete its processing. This could happen on 377 + * a virtualized system if the outgoing VCPU gets preempted 378 + * for more than five seconds, and the user attempts to 379 + * immediately online that same CPU. Trying again later 380 + * might return -EBUSY above, hence -EAGAIN. 381 + */ 382 + return -EAGAIN; 383 + 384 + default: 385 + 386 + /* Should not happen. Famous last words. */ 387 + return -EIO; 388 + } 389 + } 390 + 391 + /* 392 + * Mark the specified CPU online. 393 + * 394 + * Note that it is permissible to omit this call entirely, as is 395 + * done in architectures that do no CPU-hotplug error checking. 396 + */ 397 + void cpu_set_state_online(int cpu) 398 + { 399 + (void)atomic_xchg(&per_cpu(cpu_hotplug_state, cpu), CPU_ONLINE); 400 + } 401 + 402 + #ifdef CONFIG_HOTPLUG_CPU 403 + 404 + /* 405 + * Wait for the specified CPU to exit the idle loop and die. 406 + */ 407 + bool cpu_wait_death(unsigned int cpu, int seconds) 408 + { 409 + int jf_left = seconds * HZ; 410 + int oldstate; 411 + bool ret = true; 412 + int sleep_jf = 1; 413 + 414 + might_sleep(); 415 + 416 + /* The outgoing CPU will normally get done quite quickly. */ 417 + if (atomic_read(&per_cpu(cpu_hotplug_state, cpu)) == CPU_DEAD) 418 + goto update_state; 419 + udelay(5); 420 + 421 + /* But if the outgoing CPU dawdles, wait increasingly long times. */ 422 + while (atomic_read(&per_cpu(cpu_hotplug_state, cpu)) != CPU_DEAD) { 423 + schedule_timeout_uninterruptible(sleep_jf); 424 + jf_left -= sleep_jf; 425 + if (jf_left <= 0) 426 + break; 427 + sleep_jf = DIV_ROUND_UP(sleep_jf * 11, 10); 428 + } 429 + update_state: 430 + oldstate = atomic_read(&per_cpu(cpu_hotplug_state, cpu)); 431 + if (oldstate == CPU_DEAD) { 432 + /* Outgoing CPU died normally, update state. */ 433 + smp_mb(); /* atomic_read() before update. */ 434 + atomic_set(&per_cpu(cpu_hotplug_state, cpu), CPU_POST_DEAD); 435 + } else { 436 + /* Outgoing CPU still hasn't died, set state accordingly. */ 437 + if (atomic_cmpxchg(&per_cpu(cpu_hotplug_state, cpu), 438 + oldstate, CPU_BROKEN) != oldstate) 439 + goto update_state; 440 + ret = false; 441 + } 442 + return ret; 443 + } 444 + 445 + /* 446 + * Called by the outgoing CPU to report its successful death. Return 447 + * false if this report follows the surviving CPU's timing out. 448 + * 449 + * A separate "CPU_DEAD_FROZEN" is used when the surviving CPU 450 + * timed out. This approach allows architectures to omit calls to 451 + * cpu_check_up_prepare() and cpu_set_state_online() without defeating 452 + * the next cpu_wait_death()'s polling loop. 453 + */ 454 + bool cpu_report_death(void) 455 + { 456 + int oldstate; 457 + int newstate; 458 + int cpu = smp_processor_id(); 459 + 460 + do { 461 + oldstate = atomic_read(&per_cpu(cpu_hotplug_state, cpu)); 462 + if (oldstate != CPU_BROKEN) 463 + newstate = CPU_DEAD; 464 + else 465 + newstate = CPU_DEAD_FROZEN; 466 + } while (atomic_cmpxchg(&per_cpu(cpu_hotplug_state, cpu), 467 + oldstate, newstate) != oldstate); 468 + return newstate == CPU_DEAD; 469 + } 470 + 471 + #endif /* #ifdef CONFIG_HOTPLUG_CPU */

+25 -10

lib/Kconfig.debug

··· 1180 1180 menu "RCU Debugging" 1181 1181 1182 1182 config PROVE_RCU 1183 - bool "RCU debugging: prove RCU correctness" 1184 - depends on PROVE_LOCKING 1185 - default n 1186 - help 1187 - This feature enables lockdep extensions that check for correct 1188 - use of RCU APIs. This is currently under development. Say Y 1189 - if you want to debug RCU usage or help work on the PROVE_RCU 1190 - feature. 1191 - 1192 - Say N if you are unsure. 1183 + def_bool PROVE_LOCKING 1193 1184 1194 1185 config PROVE_RCU_REPEATEDLY 1195 1186 bool "RCU debugging: don't disable PROVE_RCU on first splat" ··· 1247 1256 boot (you probably don't). 1248 1257 Say N here if you want the RCU torture tests to start only 1249 1258 after being manually enabled via /proc. 1259 + 1260 + config RCU_TORTURE_TEST_SLOW_INIT 1261 + bool "Slow down RCU grace-period initialization to expose races" 1262 + depends on RCU_TORTURE_TEST 1263 + help 1264 + This option makes grace-period initialization block for a 1265 + few jiffies between initializing each pair of consecutive 1266 + rcu_node structures. This helps to expose races involving 1267 + grace-period initialization, in other words, it makes your 1268 + kernel less stable. It can also greatly increase grace-period 1269 + latency, especially on systems with large numbers of CPUs. 1270 + This is useful when torture-testing RCU, but in almost no 1271 + other circumstance. 1272 + 1273 + Say Y here if you want your system to crash and hang more often. 1274 + Say N if you want a sane system. 1275 + 1276 + config RCU_TORTURE_TEST_SLOW_INIT_DELAY 1277 + int "How much to slow down RCU grace-period initialization" 1278 + range 0 5 1279 + default 3 1280 + help 1281 + This option specifies the number of jiffies to wait between 1282 + each rcu_node structure initialization. 1250 1283 1251 1284 config RCU_CPU_STALL_TIMEOUT 1252 1285 int "RCU CPU stall timeout in seconds"

+1 -1

tools/testing/selftests/rcutorture/bin/kvm.sh

··· 310 310 cfr[jn] = cf[j] "." cfrep[cf[j]]; 311 311 } 312 312 if (cpusr[jn] > ncpus && ncpus != 0) 313 - ovf = "(!)"; 313 + ovf = "-ovf"; 314 314 else 315 315 ovf = ""; 316 316 print "echo ", cfr[jn], cpusr[jn] ovf ": Starting build. `date`";

+1

tools/testing/selftests/rcutorture/configs/rcu/CFcommon

··· 1 1 CONFIG_RCU_TORTURE_TEST=y 2 2 CONFIG_PRINTK_TIME=y 3 + CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y