Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu

+6

Documentation/RCU/checklist.txt

··· 328 328 RCU rather than SRCU, because RCU is almost always faster and 329 329 easier to use than is SRCU. 330 330 331 + If you need to enter your read-side critical section in a 332 + hardirq or exception handler, and then exit that same read-side 333 + critical section in the task that was interrupted, then you need 334 + to srcu_read_lock_raw() and srcu_read_unlock_raw(), which avoid 335 + the lockdep checking that would otherwise this practice illegal. 336 + 331 337 Also unlike other forms of RCU, explicit initialization 332 338 and cleanup is required via init_srcu_struct() and 333 339 cleanup_srcu_struct(). These are passed a "struct srcu_struct"

+5 -5

Documentation/RCU/rcu.txt

··· 38 38 39 39 Preemptible variants of RCU (CONFIG_TREE_PREEMPT_RCU) get the 40 40 same effect, but require that the readers manipulate CPU-local 41 - counters. These counters allow limited types of blocking 42 - within RCU read-side critical sections. SRCU also uses 43 - CPU-local counters, and permits general blocking within 44 - RCU read-side critical sections. These two variants of 45 - RCU detect grace periods by sampling these counters. 41 + counters. These counters allow limited types of blocking within 42 + RCU read-side critical sections. SRCU also uses CPU-local 43 + counters, and permits general blocking within RCU read-side 44 + critical sections. These variants of RCU detect grace periods 45 + by sampling these counters. 46 46 47 47 o If I am running on a uniprocessor kernel, which can only do one 48 48 thing at a time, why should I wait for a grace period?

+10 -6

Documentation/RCU/stallwarn.txt

··· 101 101 CONFIG_TREE_PREEMPT_RCU case, you might see stall-warning 102 102 messages. 103 103 104 + o A hardware or software issue shuts off the scheduler-clock 105 + interrupt on a CPU that is not in dyntick-idle mode. This 106 + problem really has happened, and seems to be most likely to 107 + result in RCU CPU stall warnings for CONFIG_NO_HZ=n kernels. 108 + 104 109 o A bug in the RCU implementation. 105 110 106 111 o A hardware failure. This is quite unlikely, but has occurred ··· 114 109 This resulted in a series of RCU CPU stall warnings, eventually 115 110 leading the realization that the CPU had failed. 116 111 117 - The RCU, RCU-sched, and RCU-bh implementations have CPU stall 118 - warning. SRCU does not have its own CPU stall warnings, but its 119 - calls to synchronize_sched() will result in RCU-sched detecting 120 - RCU-sched-related CPU stalls. Please note that RCU only detects 121 - CPU stalls when there is a grace period in progress. No grace period, 122 - no CPU stall warnings. 112 + The RCU, RCU-sched, and RCU-bh implementations have CPU stall warning. 113 + SRCU does not have its own CPU stall warnings, but its calls to 114 + synchronize_sched() will result in RCU-sched detecting RCU-sched-related 115 + CPU stalls. Please note that RCU only detects CPU stalls when there is 116 + a grace period in progress. No grace period, no CPU stall warnings. 123 117 124 118 To diagnose the cause of the stall, inspect the stack traces. 125 119 The offending function will usually be near the top of the stack.

+13

Documentation/RCU/torture.txt

··· 61 61 To properly exercise RCU implementations with preemptible 62 62 read-side critical sections. 63 63 64 + onoff_interval 65 + The number of seconds between each attempt to execute a 66 + randomly selected CPU-hotplug operation. Defaults to 67 + zero, which disables CPU hotplugging. In HOTPLUG_CPU=n 68 + kernels, rcutorture will silently refuse to do any 69 + CPU-hotplug operations regardless of what value is 70 + specified for onoff_interval. 71 + 64 72 shuffle_interval 65 73 The number of seconds to keep the test threads affinitied 66 74 to a particular subset of the CPUs, defaults to 3 seconds. 67 75 Used in conjunction with test_no_idle_hz. 76 + 77 + shutdown_secs The number of seconds to run the test before terminating 78 + the test and powering off the system. The default is 79 + zero, which disables test termination and system shutdown. 80 + This capability is useful for automated testing. 68 81 69 82 stat_interval The number of seconds between output of torture 70 83 statistics (via printk()). Regardless of the interval,

-4

Documentation/RCU/trace.txt

··· 105 105 or one greater than the interrupt-nesting depth otherwise. 106 106 The number after the second "/" is the NMI nesting depth. 107 107 108 - This field is displayed only for CONFIG_NO_HZ kernels. 109 - 110 108 o "df" is the number of times that some other CPU has forced a 111 109 quiescent state on behalf of this CPU due to this CPU being in 112 110 dynticks-idle state. 113 - 114 - This field is displayed only for CONFIG_NO_HZ kernels. 115 111 116 112 o "of" is the number of times that some other CPU has forced a 117 113 quiescent state on behalf of this CPU due to this CPU being

+14 -5

Documentation/RCU/whatisRCU.txt

··· 4 4 1. What is RCU, Fundamentally? http://lwn.net/Articles/262464/ 5 5 2. What is RCU? Part 2: Usage http://lwn.net/Articles/263130/ 6 6 3. RCU part 3: the RCU API http://lwn.net/Articles/264090/ 7 + 4. The RCU API, 2010 Edition http://lwn.net/Articles/418853/ 7 8 8 9 9 10 What is RCU? ··· 835 834 836 835 srcu_read_lock synchronize_srcu N/A 837 836 srcu_read_unlock synchronize_srcu_expedited 837 + srcu_read_lock_raw 838 + srcu_read_unlock_raw 838 839 srcu_dereference 839 840 840 841 SRCU: Initialization/cleanup ··· 858 855 859 856 a. Will readers need to block? If so, you need SRCU. 860 857 861 - b. What about the -rt patchset? If readers would need to block 858 + b. Is it necessary to start a read-side critical section in a 859 + hardirq handler or exception handler, and then to complete 860 + this read-side critical section in the task that was 861 + interrupted? If so, you need SRCU's srcu_read_lock_raw() and 862 + srcu_read_unlock_raw() primitives. 863 + 864 + c. What about the -rt patchset? If readers would need to block 862 865 in an non-rt kernel, you need SRCU. If readers would block 863 866 in a -rt kernel, but not in a non-rt kernel, SRCU is not 864 867 necessary. 865 868 866 - c. Do you need to treat NMI handlers, hardirq handlers, 869 + d. Do you need to treat NMI handlers, hardirq handlers, 867 870 and code segments with preemption disabled (whether 868 871 via preempt_disable(), local_irq_save(), local_bh_disable(), 869 872 or some other mechanism) as if they were explicit RCU readers? 870 873 If so, you need RCU-sched. 871 874 872 - d. Do you need RCU grace periods to complete even in the face 875 + e. Do you need RCU grace periods to complete even in the face 873 876 of softirq monopolization of one or more of the CPUs? For 874 877 example, is your code subject to network-based denial-of-service 875 878 attacks? If so, you need RCU-bh. 876 879 877 - e. Is your workload too update-intensive for normal use of 880 + f. Is your workload too update-intensive for normal use of 878 881 RCU, but inappropriate for other synchronization mechanisms? 879 882 If so, consider SLAB_DESTROY_BY_RCU. But please be careful! 880 883 881 - f. Otherwise, use RCU. 884 + g. Otherwise, use RCU. 882 885 883 886 Of course, this all assumes that you have determined that RCU is in fact 884 887 the right tool for your job.

+87

Documentation/atomic_ops.txt

··· 84 84 85 85 *** YOU HAVE BEEN WARNED! *** 86 86 87 + Properly aligned pointers, longs, ints, and chars (and unsigned 88 + equivalents) may be atomically loaded from and stored to in the same 89 + sense as described for atomic_read() and atomic_set(). The ACCESS_ONCE() 90 + macro should be used to prevent the compiler from using optimizations 91 + that might otherwise optimize accesses out of existence on the one hand, 92 + or that might create unsolicited accesses on the other. 93 + 94 + For example consider the following code: 95 + 96 + while (a > 0) 97 + do_something(); 98 + 99 + If the compiler can prove that do_something() does not store to the 100 + variable a, then the compiler is within its rights transforming this to 101 + the following: 102 + 103 + tmp = a; 104 + if (a > 0) 105 + for (;;) 106 + do_something(); 107 + 108 + If you don't want the compiler to do this (and you probably don't), then 109 + you should use something like the following: 110 + 111 + while (ACCESS_ONCE(a) < 0) 112 + do_something(); 113 + 114 + Alternatively, you could place a barrier() call in the loop. 115 + 116 + For another example, consider the following code: 117 + 118 + tmp_a = a; 119 + do_something_with(tmp_a); 120 + do_something_else_with(tmp_a); 121 + 122 + If the compiler can prove that do_something_with() does not store to the 123 + variable a, then the compiler is within its rights to manufacture an 124 + additional load as follows: 125 + 126 + tmp_a = a; 127 + do_something_with(tmp_a); 128 + tmp_a = a; 129 + do_something_else_with(tmp_a); 130 + 131 + This could fatally confuse your code if it expected the same value 132 + to be passed to do_something_with() and do_something_else_with(). 133 + 134 + The compiler would be likely to manufacture this additional load if 135 + do_something_with() was an inline function that made very heavy use 136 + of registers: reloading from variable a could save a flush to the 137 + stack and later reload. To prevent the compiler from attacking your 138 + code in this manner, write the following: 139 + 140 + tmp_a = ACCESS_ONCE(a); 141 + do_something_with(tmp_a); 142 + do_something_else_with(tmp_a); 143 + 144 + For a final example, consider the following code, assuming that the 145 + variable a is set at boot time before the second CPU is brought online 146 + and never changed later, so that memory barriers are not needed: 147 + 148 + if (a) 149 + b = 9; 150 + else 151 + b = 42; 152 + 153 + The compiler is within its rights to manufacture an additional store 154 + by transforming the above code into the following: 155 + 156 + b = 42; 157 + if (a) 158 + b = 9; 159 + 160 + This could come as a fatal surprise to other code running concurrently 161 + that expected b to never have the value 42 if a was zero. To prevent 162 + the compiler from doing this, write something like: 163 + 164 + if (a) 165 + ACCESS_ONCE(b) = 9; 166 + else 167 + ACCESS_ONCE(b) = 42; 168 + 169 + Don't even -think- about doing this without proper use of memory barriers, 170 + locks, or atomic operations if variable a can change at runtime! 171 + 172 + *** WARNING: ACCESS_ONCE() DOES NOT IMPLY A BARRIER! *** 173 + 87 174 Now, we move onto the atomic operation interfaces typically implemented with 88 175 the help of assembly code. 89 176

+63

Documentation/lockdep-design.txt

··· 221 221 table, which hash-table can be checked in a lockfree manner. If the 222 222 locking chain occurs again later on, the hash table tells us that we 223 223 dont have to validate the chain again. 224 + 225 + Troubleshooting: 226 + ---------------- 227 + 228 + The validator tracks a maximum of MAX_LOCKDEP_KEYS number of lock classes. 229 + Exceeding this number will trigger the following lockdep warning: 230 + 231 + (DEBUG_LOCKS_WARN_ON(id >= MAX_LOCKDEP_KEYS)) 232 + 233 + By default, MAX_LOCKDEP_KEYS is currently set to 8191, and typical 234 + desktop systems have less than 1,000 lock classes, so this warning 235 + normally results from lock-class leakage or failure to properly 236 + initialize locks. These two problems are illustrated below: 237 + 238 + 1. Repeated module loading and unloading while running the validator 239 + will result in lock-class leakage. The issue here is that each 240 + load of the module will create a new set of lock classes for 241 + that module's locks, but module unloading does not remove old 242 + classes (see below discussion of reuse of lock classes for why). 243 + Therefore, if that module is loaded and unloaded repeatedly, 244 + the number of lock classes will eventually reach the maximum. 245 + 246 + 2. Using structures such as arrays that have large numbers of 247 + locks that are not explicitly initialized. For example, 248 + a hash table with 8192 buckets where each bucket has its own 249 + spinlock_t will consume 8192 lock classes -unless- each spinlock 250 + is explicitly initialized at runtime, for example, using the 251 + run-time spin_lock_init() as opposed to compile-time initializers 252 + such as __SPIN_LOCK_UNLOCKED(). Failure to properly initialize 253 + the per-bucket spinlocks would guarantee lock-class overflow. 254 + In contrast, a loop that called spin_lock_init() on each lock 255 + would place all 8192 locks into a single lock class. 256 + 257 + The moral of this story is that you should always explicitly 258 + initialize your locks. 259 + 260 + One might argue that the validator should be modified to allow 261 + lock classes to be reused. However, if you are tempted to make this 262 + argument, first review the code and think through the changes that would 263 + be required, keeping in mind that the lock classes to be removed are 264 + likely to be linked into the lock-dependency graph. This turns out to 265 + be harder to do than to say. 266 + 267 + Of course, if you do run out of lock classes, the next thing to do is 268 + to find the offending lock classes. First, the following command gives 269 + you the number of lock classes currently in use along with the maximum: 270 + 271 + grep "lock-classes" /proc/lockdep_stats 272 + 273 + This command produces the following output on a modest system: 274 + 275 + lock-classes: 748 [max: 8191] 276 + 277 + If the number allocated (748 above) increases continually over time, 278 + then there is likely a leak. The following command can be used to 279 + identify the leaking lock classes: 280 + 281 + grep "BD" /proc/lockdep 282 + 283 + Run the command and save the output, then compare against the output from 284 + a later run of this command to identify the leakers. This same output 285 + can also help you find situations where runtime lock initialization has 286 + been omitted.

+4 -2

arch/arm/kernel/process.c

··· 183 183 184 184 /* endless idle loop with no priority at all */ 185 185 while (1) { 186 - tick_nohz_stop_sched_tick(1); 186 + tick_nohz_idle_enter(); 187 + rcu_idle_enter(); 187 188 leds_event(led_idle_start); 188 189 while (!need_resched()) { 189 190 #ifdef CONFIG_HOTPLUG_CPU ··· 214 213 } 215 214 } 216 215 leds_event(led_idle_end); 217 - tick_nohz_restart_sched_tick(); 216 + rcu_idle_exit(); 217 + tick_nohz_idle_exit(); 218 218 preempt_enable_no_resched(); 219 219 schedule(); 220 220 preempt_disable();

+4 -2

arch/avr32/kernel/process.c

··· 34 34 { 35 35 /* endless idle loop with no priority at all */ 36 36 while (1) { 37 - tick_nohz_stop_sched_tick(1); 37 + tick_nohz_idle_enter(); 38 + rcu_idle_enter(); 38 39 while (!need_resched()) 39 40 cpu_idle_sleep(); 40 - tick_nohz_restart_sched_tick(); 41 + rcu_idle_exit(); 42 + tick_nohz_idle_exit(); 41 43 preempt_enable_no_resched(); 42 44 schedule(); 43 45 preempt_disable();

+4 -2

arch/blackfin/kernel/process.c

··· 88 88 #endif 89 89 if (!idle) 90 90 idle = default_idle; 91 - tick_nohz_stop_sched_tick(1); 91 + tick_nohz_idle_enter(); 92 + rcu_idle_enter(); 92 93 while (!need_resched()) 93 94 idle(); 94 - tick_nohz_restart_sched_tick(); 95 + rcu_idle_exit(); 96 + tick_nohz_idle_exit(); 95 97 preempt_enable_no_resched(); 96 98 schedule(); 97 99 preempt_disable();

+4 -2

arch/microblaze/kernel/process.c

··· 103 103 if (!idle) 104 104 idle = default_idle; 105 105 106 - tick_nohz_stop_sched_tick(1); 106 + tick_nohz_idle_enter(); 107 + rcu_idle_enter(); 107 108 while (!need_resched()) 108 109 idle(); 109 - tick_nohz_restart_sched_tick(); 110 + rcu_idle_exit(); 111 + tick_nohz_idle_exit(); 110 112 111 113 preempt_enable_no_resched(); 112 114 schedule();

+4 -2

arch/mips/kernel/process.c

··· 56 56 57 57 /* endless idle loop with no priority at all */ 58 58 while (1) { 59 - tick_nohz_stop_sched_tick(1); 59 + tick_nohz_idle_enter(); 60 + rcu_idle_enter(); 60 61 while (!need_resched() && cpu_online(cpu)) { 61 62 #ifdef CONFIG_MIPS_MT_SMTC 62 63 extern void smtc_idle_loop_hook(void); ··· 78 77 system_state == SYSTEM_BOOTING)) 79 78 play_dead(); 80 79 #endif 81 - tick_nohz_restart_sched_tick(); 80 + rcu_idle_exit(); 81 + tick_nohz_idle_exit(); 82 82 preempt_enable_no_resched(); 83 83 schedule(); 84 84 preempt_disable();

+4 -2

arch/openrisc/kernel/idle.c

··· 51 51 52 52 /* endless idle loop with no priority at all */ 53 53 while (1) { 54 - tick_nohz_stop_sched_tick(1); 54 + tick_nohz_idle_enter(); 55 + rcu_idle_enter(); 55 56 56 57 while (!need_resched()) { 57 58 check_pgt_cache(); ··· 70 69 set_thread_flag(TIF_POLLING_NRFLAG); 71 70 } 72 71 73 - tick_nohz_restart_sched_tick(); 72 + rcu_idle_exit(); 73 + tick_nohz_idle_exit(); 74 74 preempt_enable_no_resched(); 75 75 schedule(); 76 76 preempt_disable();

+13 -2

arch/powerpc/kernel/idle.c

··· 46 46 } 47 47 __setup("powersave=off", powersave_off); 48 48 49 + #if defined(CONFIG_PPC_PSERIES) && defined(CONFIG_TRACEPOINTS) 50 + static const bool idle_uses_rcu = 1; 51 + #else 52 + static const bool idle_uses_rcu; 53 + #endif 54 + 49 55 /* 50 56 * The body of the idle task. 51 57 */ ··· 62 56 63 57 set_thread_flag(TIF_POLLING_NRFLAG); 64 58 while (1) { 65 - tick_nohz_stop_sched_tick(1); 59 + tick_nohz_idle_enter(); 60 + if (!idle_uses_rcu) 61 + rcu_idle_enter(); 62 + 66 63 while (!need_resched() && !cpu_should_die()) { 67 64 ppc64_runlatch_off(); 68 65 ··· 102 93 103 94 HMT_medium(); 104 95 ppc64_runlatch_on(); 105 - tick_nohz_restart_sched_tick(); 96 + if (!idle_uses_rcu) 97 + rcu_idle_exit(); 98 + tick_nohz_idle_exit(); 106 99 preempt_enable_no_resched(); 107 100 if (cpu_should_die()) 108 101 cpu_die();

+8 -4

arch/powerpc/platforms/iseries/setup.c

··· 563 563 static void iseries_shared_idle(void) 564 564 { 565 565 while (1) { 566 - tick_nohz_stop_sched_tick(1); 566 + tick_nohz_idle_enter(); 567 + rcu_idle_enter(); 567 568 while (!need_resched() && !hvlpevent_is_pending()) { 568 569 local_irq_disable(); 569 570 ppc64_runlatch_off(); ··· 578 577 } 579 578 580 579 ppc64_runlatch_on(); 581 - tick_nohz_restart_sched_tick(); 580 + rcu_idle_exit(); 581 + tick_nohz_idle_exit(); 582 582 583 583 if (hvlpevent_is_pending()) 584 584 process_iSeries_events(); ··· 595 593 set_thread_flag(TIF_POLLING_NRFLAG); 596 594 597 595 while (1) { 598 - tick_nohz_stop_sched_tick(1); 596 + tick_nohz_idle_enter(); 597 + rcu_idle_enter(); 599 598 if (!need_resched()) { 600 599 while (!need_resched()) { 601 600 ppc64_runlatch_off(); ··· 613 610 } 614 611 615 612 ppc64_runlatch_on(); 616 - tick_nohz_restart_sched_tick(); 613 + rcu_idle_exit(); 614 + tick_nohz_idle_exit(); 617 615 preempt_enable_no_resched(); 618 616 schedule(); 619 617 preempt_disable();

+4

arch/powerpc/platforms/pseries/lpar.c

··· 555 555 556 556 (*depth)++; 557 557 trace_hcall_entry(opcode, args); 558 + if (opcode == H_CEDE) 559 + rcu_idle_enter(); 558 560 (*depth)--; 559 561 560 562 out: ··· 577 575 goto out; 578 576 579 577 (*depth)++; 578 + if (opcode == H_CEDE) 579 + rcu_idle_exit(); 580 580 trace_hcall_exit(opcode, retval, retbuf); 581 581 (*depth)--; 582 582

+4 -2

arch/s390/kernel/process.c

··· 91 91 void cpu_idle(void) 92 92 { 93 93 for (;;) { 94 - tick_nohz_stop_sched_tick(1); 94 + tick_nohz_idle_enter(); 95 + rcu_idle_enter(); 95 96 while (!need_resched()) 96 97 default_idle(); 97 - tick_nohz_restart_sched_tick(); 98 + rcu_idle_exit(); 99 + tick_nohz_idle_exit(); 98 100 preempt_enable_no_resched(); 99 101 schedule(); 100 102 preempt_disable();

+4 -2

arch/sh/kernel/idle.c

··· 89 89 90 90 /* endless idle loop with no priority at all */ 91 91 while (1) { 92 - tick_nohz_stop_sched_tick(1); 92 + tick_nohz_idle_enter(); 93 + rcu_idle_enter(); 93 94 94 95 while (!need_resched()) { 95 96 check_pgt_cache(); ··· 112 111 start_critical_timings(); 113 112 } 114 113 115 - tick_nohz_restart_sched_tick(); 114 + rcu_idle_exit(); 115 + tick_nohz_idle_exit(); 116 116 preempt_enable_no_resched(); 117 117 schedule(); 118 118 preempt_disable();

+4 -2

arch/sparc/kernel/process_64.c

··· 95 95 set_thread_flag(TIF_POLLING_NRFLAG); 96 96 97 97 while(1) { 98 - tick_nohz_stop_sched_tick(1); 98 + tick_nohz_idle_enter(); 99 + rcu_idle_enter(); 99 100 100 101 while (!need_resched() && !cpu_is_offline(cpu)) 101 102 sparc64_yield(cpu); 102 103 103 - tick_nohz_restart_sched_tick(); 104 + rcu_idle_exit(); 105 + tick_nohz_idle_exit(); 104 106 105 107 preempt_enable_no_resched(); 106 108

+1 -1

arch/sparc/kernel/setup_32.c

··· 84 84 85 85 prom_printf("PROM SYNC COMMAND...\n"); 86 86 show_free_areas(0); 87 - if(current->pid != 0) { 87 + if (!is_idle_task(current)) { 88 88 local_irq_enable(); 89 89 sys_sync(); 90 90 local_irq_disable();

+4 -2

arch/tile/kernel/process.c

··· 85 85 86 86 /* endless idle loop with no priority at all */ 87 87 while (1) { 88 - tick_nohz_stop_sched_tick(1); 88 + tick_nohz_idle_enter(); 89 + rcu_idle_enter(); 89 90 while (!need_resched()) { 90 91 if (cpu_is_offline(cpu)) 91 92 BUG(); /* no HOTPLUG_CPU */ ··· 106 105 local_irq_enable(); 107 106 current_thread_info()->status |= TS_POLLING; 108 107 } 109 - tick_nohz_restart_sched_tick(); 108 + rcu_idle_exit(); 109 + tick_nohz_idle_exit(); 110 110 preempt_enable_no_resched(); 111 111 schedule(); 112 112 preempt_disable();

+2 -2

arch/tile/mm/fault.c

··· 54 54 if (unlikely(tsk->pid < 2)) { 55 55 panic("Signal %d (code %d) at %#lx sent to %s!", 56 56 si_signo, si_code & 0xffff, address, 57 - tsk->pid ? "init" : "the idle task"); 57 + is_idle_task(tsk) ? "the idle task" : "init"); 58 58 } 59 59 60 60 info.si_signo = si_signo; ··· 515 515 516 516 if (unlikely(tsk->pid < 2)) { 517 517 panic("Kernel page fault running %s!", 518 - tsk->pid ? "init" : "the idle task"); 518 + is_idle_task(tsk) ? "the idle task" : "init"); 519 519 } 520 520 521 521 /*

+4 -2

arch/um/kernel/process.c

··· 246 246 if (need_resched()) 247 247 schedule(); 248 248 249 - tick_nohz_stop_sched_tick(1); 249 + tick_nohz_idle_enter(); 250 + rcu_idle_enter(); 250 251 nsecs = disable_timer(); 251 252 idle_sleep(nsecs); 252 - tick_nohz_restart_sched_tick(); 253 + rcu_idle_exit(); 254 + tick_nohz_idle_exit(); 253 255 } 254 256 } 255 257

+4 -2

arch/unicore32/kernel/process.c

··· 55 55 { 56 56 /* endless idle loop with no priority at all */ 57 57 while (1) { 58 - tick_nohz_stop_sched_tick(1); 58 + tick_nohz_idle_enter(); 59 + rcu_idle_enter(); 59 60 while (!need_resched()) { 60 61 local_irq_disable(); 61 62 stop_critical_timings(); ··· 64 63 local_irq_enable(); 65 64 start_critical_timings(); 66 65 } 67 - tick_nohz_restart_sched_tick(); 66 + rcu_idle_exit(); 67 + tick_nohz_idle_exit(); 68 68 preempt_enable_no_resched(); 69 69 schedule(); 70 70 preempt_disable();

+3 -3

arch/x86/kernel/apic/apic.c

··· 876 876 * Besides, if we don't timer interrupts ignore the global 877 877 * interrupt lock, which is the WrongThing (tm) to do. 878 878 */ 879 - exit_idle(); 880 879 irq_enter(); 880 + exit_idle(); 881 881 local_apic_timer_interrupt(); 882 882 irq_exit(); 883 883 ··· 1809 1809 { 1810 1810 u32 v; 1811 1811 1812 - exit_idle(); 1813 1812 irq_enter(); 1813 + exit_idle(); 1814 1814 /* 1815 1815 * Check if this really is a spurious interrupt and ACK it 1816 1816 * if it is a vectored one. Just in case... ··· 1846 1846 "Illegal register address", /* APIC Error Bit 7 */ 1847 1847 }; 1848 1848 1849 - exit_idle(); 1850 1849 irq_enter(); 1850 + exit_idle(); 1851 1851 /* First tickle the hardware, only then report what went on. -- REW */ 1852 1852 v0 = apic_read(APIC_ESR); 1853 1853 apic_write(APIC_ESR, 0);

+1 -1

arch/x86/kernel/apic/io_apic.c

··· 2421 2421 unsigned vector, me; 2422 2422 2423 2423 ack_APIC_irq(); 2424 - exit_idle(); 2425 2424 irq_enter(); 2425 + exit_idle(); 2426 2426 2427 2427 me = smp_processor_id(); 2428 2428 for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {

+1 -1

arch/x86/kernel/cpu/mcheck/therm_throt.c

··· 397 397 398 398 asmlinkage void smp_thermal_interrupt(struct pt_regs *regs) 399 399 { 400 - exit_idle(); 401 400 irq_enter(); 401 + exit_idle(); 402 402 inc_irq_stat(irq_thermal_count); 403 403 smp_thermal_vector(); 404 404 irq_exit();

+1 -1

arch/x86/kernel/cpu/mcheck/threshold.c

··· 19 19 20 20 asmlinkage void smp_threshold_interrupt(void) 21 21 { 22 - exit_idle(); 23 22 irq_enter(); 23 + exit_idle(); 24 24 inc_irq_stat(irq_threshold_count); 25 25 mce_threshold_vector(); 26 26 irq_exit();

+3 -3

arch/x86/kernel/irq.c

··· 181 181 unsigned vector = ~regs->orig_ax; 182 182 unsigned irq; 183 183 184 - exit_idle(); 185 184 irq_enter(); 185 + exit_idle(); 186 186 187 187 irq = __this_cpu_read(vector_irq[vector]); 188 188 ··· 209 209 210 210 ack_APIC_irq(); 211 211 212 - exit_idle(); 213 - 214 212 irq_enter(); 213 + 214 + exit_idle(); 215 215 216 216 inc_irq_stat(x86_platform_ipis); 217 217

+4 -2

arch/x86/kernel/process_32.c

··· 99 99 100 100 /* endless idle loop with no priority at all */ 101 101 while (1) { 102 - tick_nohz_stop_sched_tick(1); 102 + tick_nohz_idle_enter(); 103 + rcu_idle_enter(); 103 104 while (!need_resched()) { 104 105 105 106 check_pgt_cache(); ··· 117 116 pm_idle(); 118 117 start_critical_timings(); 119 118 } 120 - tick_nohz_restart_sched_tick(); 119 + rcu_idle_exit(); 120 + tick_nohz_idle_exit(); 121 121 preempt_enable_no_resched(); 122 122 schedule(); 123 123 preempt_disable();

+8 -2

arch/x86/kernel/process_64.c

··· 122 122 123 123 /* endless idle loop with no priority at all */ 124 124 while (1) { 125 - tick_nohz_stop_sched_tick(1); 125 + tick_nohz_idle_enter(); 126 126 while (!need_resched()) { 127 127 128 128 rmb(); ··· 139 139 enter_idle(); 140 140 /* Don't trace irqs off for idle */ 141 141 stop_critical_timings(); 142 + 143 + /* enter_idle() needs rcu for notifiers */ 144 + rcu_idle_enter(); 145 + 142 146 if (cpuidle_idle_call()) 143 147 pm_idle(); 148 + 149 + rcu_idle_exit(); 144 150 start_critical_timings(); 145 151 146 152 /* In many cases the interrupt that ended idle ··· 155 149 __exit_idle(); 156 150 } 157 151 158 - tick_nohz_restart_sched_tick(); 152 + tick_nohz_idle_exit(); 159 153 preempt_enable_no_resched(); 160 154 schedule(); 161 155 preempt_disable();

+7

drivers/base/cpu.c

··· 247 247 } 248 248 EXPORT_SYMBOL_GPL(get_cpu_sysdev); 249 249 250 + bool cpu_is_hotpluggable(unsigned cpu) 251 + { 252 + struct sys_device *dev = get_cpu_sysdev(cpu); 253 + return dev && container_of(dev, struct cpu, sysdev)->hotpluggable; 254 + } 255 + EXPORT_SYMBOL_GPL(cpu_is_hotpluggable); 256 + 250 257 int __init cpu_dev_init(void) 251 258 { 252 259 int err;

+1

include/linux/cpu.h

··· 27 27 28 28 extern int register_cpu(struct cpu *cpu, int num); 29 29 extern struct sys_device *get_cpu_sysdev(unsigned cpu); 30 + extern bool cpu_is_hotpluggable(unsigned cpu); 30 31 31 32 extern int cpu_add_sysdev_attr(struct sysdev_attribute *attr); 32 33 extern void cpu_remove_sysdev_attr(struct sysdev_attribute *attr);

-21

include/linux/hardirq.h

··· 139 139 extern void account_system_vtime(struct task_struct *tsk); 140 140 #endif 141 141 142 - #if defined(CONFIG_NO_HZ) 143 142 #if defined(CONFIG_TINY_RCU) || defined(CONFIG_TINY_PREEMPT_RCU) 144 - extern void rcu_enter_nohz(void); 145 - extern void rcu_exit_nohz(void); 146 - 147 - static inline void rcu_irq_enter(void) 148 - { 149 - rcu_exit_nohz(); 150 - } 151 - 152 - static inline void rcu_irq_exit(void) 153 - { 154 - rcu_enter_nohz(); 155 - } 156 143 157 144 static inline void rcu_nmi_enter(void) 158 145 { ··· 150 163 } 151 164 152 165 #else 153 - extern void rcu_irq_enter(void); 154 - extern void rcu_irq_exit(void); 155 166 extern void rcu_nmi_enter(void); 156 167 extern void rcu_nmi_exit(void); 157 168 #endif 158 - #else 159 - # define rcu_irq_enter() do { } while (0) 160 - # define rcu_irq_exit() do { } while (0) 161 - # define rcu_nmi_enter() do { } while (0) 162 - # define rcu_nmi_exit() do { } while (0) 163 - #endif /* #if defined(CONFIG_NO_HZ) */ 164 169 165 170 /* 166 171 * It is safe to do non-atomic ops on ->hardirq_context,

+73 -42

include/linux/rcupdate.h

··· 51 51 #if defined(CONFIG_TREE_RCU) || defined(CONFIG_TREE_PREEMPT_RCU) 52 52 extern void rcutorture_record_test_transition(void); 53 53 extern void rcutorture_record_progress(unsigned long vernum); 54 + extern void do_trace_rcu_torture_read(char *rcutorturename, 55 + struct rcu_head *rhp); 54 56 #else 55 57 static inline void rcutorture_record_test_transition(void) 56 58 { ··· 60 58 static inline void rcutorture_record_progress(unsigned long vernum) 61 59 { 62 60 } 61 + #ifdef CONFIG_RCU_TRACE 62 + extern void do_trace_rcu_torture_read(char *rcutorturename, 63 + struct rcu_head *rhp); 64 + #else 65 + #define do_trace_rcu_torture_read(rcutorturename, rhp) do { } while (0) 66 + #endif 63 67 #endif 64 68 65 69 #define UINT_CMP_GE(a, b) (UINT_MAX / 2 >= (a) - (b)) ··· 185 177 extern void rcu_bh_qs(int cpu); 186 178 extern void rcu_check_callbacks(int cpu, int user); 187 179 struct notifier_block; 188 - 189 - #ifdef CONFIG_NO_HZ 190 - 191 - extern void rcu_enter_nohz(void); 192 - extern void rcu_exit_nohz(void); 193 - 194 - #else /* #ifdef CONFIG_NO_HZ */ 195 - 196 - static inline void rcu_enter_nohz(void) 197 - { 198 - } 199 - 200 - static inline void rcu_exit_nohz(void) 201 - { 202 - } 203 - 204 - #endif /* #else #ifdef CONFIG_NO_HZ */ 180 + extern void rcu_idle_enter(void); 181 + extern void rcu_idle_exit(void); 182 + extern void rcu_irq_enter(void); 183 + extern void rcu_irq_exit(void); 205 184 206 185 /* 207 186 * Infrastructure to implement the synchronize_() primitives in ··· 228 233 229 234 #ifdef CONFIG_DEBUG_LOCK_ALLOC 230 235 236 + #ifdef CONFIG_PROVE_RCU 237 + extern int rcu_is_cpu_idle(void); 238 + #else /* !CONFIG_PROVE_RCU */ 239 + static inline int rcu_is_cpu_idle(void) 240 + { 241 + return 0; 242 + } 243 + #endif /* else !CONFIG_PROVE_RCU */ 244 + 245 + static inline void rcu_lock_acquire(struct lockdep_map *map) 246 + { 247 + WARN_ON_ONCE(rcu_is_cpu_idle()); 248 + lock_acquire(map, 0, 0, 2, 1, NULL, _THIS_IP_); 249 + } 250 + 251 + static inline void rcu_lock_release(struct lockdep_map *map) 252 + { 253 + WARN_ON_ONCE(rcu_is_cpu_idle()); 254 + lock_release(map, 1, _THIS_IP_); 255 + } 256 + 231 257 extern struct lockdep_map rcu_lock_map; 232 - # define rcu_read_acquire() \ 233 - lock_acquire(&rcu_lock_map, 0, 0, 2, 1, NULL, _THIS_IP_) 234 - # define rcu_read_release() lock_release(&rcu_lock_map, 1, _THIS_IP_) 235 - 236 258 extern struct lockdep_map rcu_bh_lock_map; 237 - # define rcu_read_acquire_bh() \ 238 - lock_acquire(&rcu_bh_lock_map, 0, 0, 2, 1, NULL, _THIS_IP_) 239 - # define rcu_read_release_bh() lock_release(&rcu_bh_lock_map, 1, _THIS_IP_) 240 - 241 259 extern struct lockdep_map rcu_sched_lock_map; 242 - # define rcu_read_acquire_sched() \ 243 - lock_acquire(&rcu_sched_lock_map, 0, 0, 2, 1, NULL, _THIS_IP_) 244 - # define rcu_read_release_sched() \ 245 - lock_release(&rcu_sched_lock_map, 1, _THIS_IP_) 246 - 247 260 extern int debug_lockdep_rcu_enabled(void); 248 261 249 262 /** ··· 265 262 * 266 263 * Checks debug_lockdep_rcu_enabled() to prevent false positives during boot 267 264 * and while lockdep is disabled. 265 + * 266 + * Note that rcu_read_lock() and the matching rcu_read_unlock() must 267 + * occur in the same context, for example, it is illegal to invoke 268 + * rcu_read_unlock() in process context if the matching rcu_read_lock() 269 + * was invoked from within an irq handler. 268 270 */ 269 271 static inline int rcu_read_lock_held(void) 270 272 { 271 273 if (!debug_lockdep_rcu_enabled()) 272 274 return 1; 275 + if (rcu_is_cpu_idle()) 276 + return 0; 273 277 return lock_is_held(&rcu_lock_map); 274 278 } 275 279 ··· 300 290 * 301 291 * Check debug_lockdep_rcu_enabled() to prevent false positives during boot 302 292 * and while lockdep is disabled. 293 + * 294 + * Note that if the CPU is in the idle loop from an RCU point of 295 + * view (ie: that we are in the section between rcu_idle_enter() and 296 + * rcu_idle_exit()) then rcu_read_lock_held() returns false even if the CPU 297 + * did an rcu_read_lock(). The reason for this is that RCU ignores CPUs 298 + * that are in such a section, considering these as in extended quiescent 299 + * state, so such a CPU is effectively never in an RCU read-side critical 300 + * section regardless of what RCU primitives it invokes. This state of 301 + * affairs is required --- we need to keep an RCU-free window in idle 302 + * where the CPU may possibly enter into low power mode. This way we can 303 + * notice an extended quiescent state to other CPUs that started a grace 304 + * period. Otherwise we would delay any grace period as long as we run in 305 + * the idle task. 303 306 */ 304 307 #ifdef CONFIG_PREEMPT_COUNT 305 308 static inline int rcu_read_lock_sched_held(void) ··· 321 298 322 299 if (!debug_lockdep_rcu_enabled()) 323 300 return 1; 301 + if (rcu_is_cpu_idle()) 302 + return 0; 324 303 if (debug_locks) 325 304 lockdep_opinion = lock_is_held(&rcu_sched_lock_map); 326 305 return lockdep_opinion || preempt_count() != 0 || irqs_disabled(); ··· 336 311 337 312 #else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ 338 313 339 - # define rcu_read_acquire() do { } while (0) 340 - # define rcu_read_release() do { } while (0) 341 - # define rcu_read_acquire_bh() do { } while (0) 342 - # define rcu_read_release_bh() do { } while (0) 343 - # define rcu_read_acquire_sched() do { } while (0) 344 - # define rcu_read_release_sched() do { } while (0) 314 + # define rcu_lock_acquire(a) do { } while (0) 315 + # define rcu_lock_release(a) do { } while (0) 345 316 346 317 static inline int rcu_read_lock_held(void) 347 318 { ··· 658 637 { 659 638 __rcu_read_lock(); 660 639 __acquire(RCU); 661 - rcu_read_acquire(); 640 + rcu_lock_acquire(&rcu_lock_map); 662 641 } 663 642 664 643 /* ··· 678 657 */ 679 658 static inline void rcu_read_unlock(void) 680 659 { 681 - rcu_read_release(); 660 + rcu_lock_release(&rcu_lock_map); 682 661 __release(RCU); 683 662 __rcu_read_unlock(); 684 663 } ··· 694 673 * critical sections in interrupt context can use just rcu_read_lock(), 695 674 * though this should at least be commented to avoid confusing people 696 675 * reading the code. 676 + * 677 + * Note that rcu_read_lock_bh() and the matching rcu_read_unlock_bh() 678 + * must occur in the same context, for example, it is illegal to invoke 679 + * rcu_read_unlock_bh() from one task if the matching rcu_read_lock_bh() 680 + * was invoked from some other task. 697 681 */ 698 682 static inline void rcu_read_lock_bh(void) 699 683 { 700 684 local_bh_disable(); 701 685 __acquire(RCU_BH); 702 - rcu_read_acquire_bh(); 686 + rcu_lock_acquire(&rcu_bh_lock_map); 703 687 } 704 688 705 689 /* ··· 714 688 */ 715 689 static inline void rcu_read_unlock_bh(void) 716 690 { 717 - rcu_read_release_bh(); 691 + rcu_lock_release(&rcu_bh_lock_map); 718 692 __release(RCU_BH); 719 693 local_bh_enable(); 720 694 } ··· 726 700 * are being done using call_rcu_sched() or synchronize_rcu_sched(). 727 701 * Read-side critical sections can also be introduced by anything that 728 702 * disables preemption, including local_irq_disable() and friends. 703 + * 704 + * Note that rcu_read_lock_sched() and the matching rcu_read_unlock_sched() 705 + * must occur in the same context, for example, it is illegal to invoke 706 + * rcu_read_unlock_sched() from process context if the matching 707 + * rcu_read_lock_sched() was invoked from an NMI handler. 729 708 */ 730 709 static inline void rcu_read_lock_sched(void) 731 710 { 732 711 preempt_disable(); 733 712 __acquire(RCU_SCHED); 734 - rcu_read_acquire_sched(); 713 + rcu_lock_acquire(&rcu_sched_lock_map); 735 714 } 736 715 737 716 /* Used by lockdep and tracing: cannot be traced, cannot call lockdep. */ ··· 753 722 */ 754 723 static inline void rcu_read_unlock_sched(void) 755 724 { 756 - rcu_read_release_sched(); 725 + rcu_lock_release(&rcu_sched_lock_map); 757 726 __release(RCU_SCHED); 758 727 preempt_enable(); 759 728 }

+8

include/linux/sched.h

··· 2070 2070 extern int sched_setscheduler_nocheck(struct task_struct *, int, 2071 2071 const struct sched_param *); 2072 2072 extern struct task_struct *idle_task(int cpu); 2073 + /** 2074 + * is_idle_task - is the specified task an idle task? 2075 + * @tsk: the task in question. 2076 + */ 2077 + static inline bool is_idle_task(struct task_struct *p) 2078 + { 2079 + return p->pid == 0; 2080 + } 2073 2081 extern struct task_struct *curr_task(int cpu); 2074 2082 extern void set_curr_task(int cpu, struct task_struct *p); 2075 2083

+74 -13

include/linux/srcu.h

··· 28 28 #define _LINUX_SRCU_H 29 29 30 30 #include <linux/mutex.h> 31 + #include <linux/rcupdate.h> 31 32 32 33 struct srcu_struct_array { 33 34 int c[2]; ··· 61 60 __init_srcu_struct((sp), #sp, &__srcu_key); \ 62 61 }) 63 62 64 - # define srcu_read_acquire(sp) \ 65 - lock_acquire(&(sp)->dep_map, 0, 0, 2, 1, NULL, _THIS_IP_) 66 - # define srcu_read_release(sp) \ 67 - lock_release(&(sp)->dep_map, 1, _THIS_IP_) 68 - 69 63 #else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ 70 64 71 65 int init_srcu_struct(struct srcu_struct *sp); 72 - 73 - # define srcu_read_acquire(sp) do { } while (0) 74 - # define srcu_read_release(sp) do { } while (0) 75 66 76 67 #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */ 77 68 ··· 83 90 * read-side critical section. In absence of CONFIG_DEBUG_LOCK_ALLOC, 84 91 * this assumes we are in an SRCU read-side critical section unless it can 85 92 * prove otherwise. 93 + * 94 + * Checks debug_lockdep_rcu_enabled() to prevent false positives during boot 95 + * and while lockdep is disabled. 96 + * 97 + * Note that if the CPU is in the idle loop from an RCU point of view 98 + * (ie: that we are in the section between rcu_idle_enter() and 99 + * rcu_idle_exit()) then srcu_read_lock_held() returns false even if 100 + * the CPU did an srcu_read_lock(). The reason for this is that RCU 101 + * ignores CPUs that are in such a section, considering these as in 102 + * extended quiescent state, so such a CPU is effectively never in an 103 + * RCU read-side critical section regardless of what RCU primitives it 104 + * invokes. This state of affairs is required --- we need to keep an 105 + * RCU-free window in idle where the CPU may possibly enter into low 106 + * power mode. This way we can notice an extended quiescent state to 107 + * other CPUs that started a grace period. Otherwise we would delay any 108 + * grace period as long as we run in the idle task. 86 109 */ 87 110 static inline int srcu_read_lock_held(struct srcu_struct *sp) 88 111 { 89 - if (debug_locks) 90 - return lock_is_held(&sp->dep_map); 91 - return 1; 112 + if (rcu_is_cpu_idle()) 113 + return 0; 114 + 115 + if (!debug_lockdep_rcu_enabled()) 116 + return 1; 117 + 118 + return lock_is_held(&sp->dep_map); 92 119 } 93 120 94 121 #else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ ··· 158 145 * one way to indirectly wait on an SRCU grace period is to acquire 159 146 * a mutex that is held elsewhere while calling synchronize_srcu() or 160 147 * synchronize_srcu_expedited(). 148 + * 149 + * Note that srcu_read_lock() and the matching srcu_read_unlock() must 150 + * occur in the same context, for example, it is illegal to invoke 151 + * srcu_read_unlock() in an irq handler if the matching srcu_read_lock() 152 + * was invoked in process context. 161 153 */ 162 154 static inline int srcu_read_lock(struct srcu_struct *sp) __acquires(sp) 163 155 { 164 156 int retval = __srcu_read_lock(sp); 165 157 166 - srcu_read_acquire(sp); 158 + rcu_lock_acquire(&(sp)->dep_map); 167 159 return retval; 168 160 } 169 161 ··· 182 164 static inline void srcu_read_unlock(struct srcu_struct *sp, int idx) 183 165 __releases(sp) 184 166 { 185 - srcu_read_release(sp); 167 + rcu_lock_release(&(sp)->dep_map); 186 168 __srcu_read_unlock(sp, idx); 169 + } 170 + 171 + /** 172 + * srcu_read_lock_raw - register a new reader for an SRCU-protected structure. 173 + * @sp: srcu_struct in which to register the new reader. 174 + * 175 + * Enter an SRCU read-side critical section. Similar to srcu_read_lock(), 176 + * but avoids the RCU-lockdep checking. This means that it is legal to 177 + * use srcu_read_lock_raw() in one context, for example, in an exception 178 + * handler, and then have the matching srcu_read_unlock_raw() in another 179 + * context, for example in the task that took the exception. 180 + * 181 + * However, the entire SRCU read-side critical section must reside within a 182 + * single task. For example, beware of using srcu_read_lock_raw() in 183 + * a device interrupt handler and srcu_read_unlock() in the interrupted 184 + * task: This will not work if interrupts are threaded. 185 + */ 186 + static inline int srcu_read_lock_raw(struct srcu_struct *sp) 187 + { 188 + unsigned long flags; 189 + int ret; 190 + 191 + local_irq_save(flags); 192 + ret = __srcu_read_lock(sp); 193 + local_irq_restore(flags); 194 + return ret; 195 + } 196 + 197 + /** 198 + * srcu_read_unlock_raw - unregister reader from an SRCU-protected structure. 199 + * @sp: srcu_struct in which to unregister the old reader. 200 + * @idx: return value from corresponding srcu_read_lock_raw(). 201 + * 202 + * Exit an SRCU read-side critical section without lockdep-RCU checking. 203 + * See srcu_read_lock_raw() for more details. 204 + */ 205 + static inline void srcu_read_unlock_raw(struct srcu_struct *sp, int idx) 206 + { 207 + unsigned long flags; 208 + 209 + local_irq_save(flags); 210 + __srcu_read_unlock(sp, idx); 211 + local_irq_restore(flags); 187 212 } 188 213 189 214 #endif

+7 -4

include/linux/tick.h

··· 7 7 #define _LINUX_TICK_H 8 8 9 9 #include <linux/clockchips.h> 10 + #include <linux/irqflags.h> 10 11 11 12 #ifdef CONFIG_GENERIC_CLOCKEVENTS 12 13 ··· 122 121 #endif /* !CONFIG_GENERIC_CLOCKEVENTS */ 123 122 124 123 # ifdef CONFIG_NO_HZ 125 - extern void tick_nohz_stop_sched_tick(int inidle); 126 - extern void tick_nohz_restart_sched_tick(void); 124 + extern void tick_nohz_idle_enter(void); 125 + extern void tick_nohz_idle_exit(void); 126 + extern void tick_nohz_irq_exit(void); 127 127 extern ktime_t tick_nohz_get_sleep_length(void); 128 128 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time); 129 129 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time); 130 130 # else 131 - static inline void tick_nohz_stop_sched_tick(int inidle) { } 132 - static inline void tick_nohz_restart_sched_tick(void) { } 131 + static inline void tick_nohz_idle_enter(void) { } 132 + static inline void tick_nohz_idle_exit(void) { } 133 + 133 134 static inline ktime_t tick_nohz_get_sleep_length(void) 134 135 { 135 136 ktime_t len = { .tv64 = NSEC_PER_SEC/HZ };

+109 -13

include/trace/events/rcu.h

··· 241 241 242 242 /* 243 243 * Tracepoint for dyntick-idle entry/exit events. These take a string 244 - * as argument: "Start" for entering dyntick-idle mode and "End" for 245 - * leaving it. 244 + * as argument: "Start" for entering dyntick-idle mode, "End" for 245 + * leaving it, "--=" for events moving towards idle, and "++=" for events 246 + * moving away from idle. "Error on entry: not idle task" and "Error on 247 + * exit: not idle task" indicate that a non-idle task is erroneously 248 + * toying with the idle loop. 249 + * 250 + * These events also take a pair of numbers, which indicate the nesting 251 + * depth before and after the event of interest. Note that task-related 252 + * events use the upper bits of each number, while interrupt-related 253 + * events use the lower bits. 246 254 */ 247 255 TRACE_EVENT(rcu_dyntick, 248 256 249 - TP_PROTO(char *polarity), 257 + TP_PROTO(char *polarity, long long oldnesting, long long newnesting), 250 258 251 - TP_ARGS(polarity), 259 + TP_ARGS(polarity, oldnesting, newnesting), 252 260 253 261 TP_STRUCT__entry( 254 262 __field(char *, polarity) 263 + __field(long long, oldnesting) 264 + __field(long long, newnesting) 255 265 ), 256 266 257 267 TP_fast_assign( 258 268 __entry->polarity = polarity; 269 + __entry->oldnesting = oldnesting; 270 + __entry->newnesting = newnesting; 259 271 ), 260 272 261 - TP_printk("%s", __entry->polarity) 273 + TP_printk("%s %llx %llx", __entry->polarity, 274 + __entry->oldnesting, __entry->newnesting) 275 + ); 276 + 277 + /* 278 + * Tracepoint for RCU preparation for idle, the goal being to get RCU 279 + * processing done so that the current CPU can shut off its scheduling 280 + * clock and enter dyntick-idle mode. One way to accomplish this is 281 + * to drain all RCU callbacks from this CPU, and the other is to have 282 + * done everything RCU requires for the current grace period. In this 283 + * latter case, the CPU will be awakened at the end of the current grace 284 + * period in order to process the remainder of its callbacks. 285 + * 286 + * These tracepoints take a string as argument: 287 + * 288 + * "No callbacks": Nothing to do, no callbacks on this CPU. 289 + * "In holdoff": Nothing to do, holding off after unsuccessful attempt. 290 + * "Begin holdoff": Attempt failed, don't retry until next jiffy. 291 + * "Dyntick with callbacks": Entering dyntick-idle despite callbacks. 292 + * "More callbacks": Still more callbacks, try again to clear them out. 293 + * "Callbacks drained": All callbacks processed, off to dyntick idle! 294 + * "Timer": Timer fired to cause CPU to continue processing callbacks. 295 + */ 296 + TRACE_EVENT(rcu_prep_idle, 297 + 298 + TP_PROTO(char *reason), 299 + 300 + TP_ARGS(reason), 301 + 302 + TP_STRUCT__entry( 303 + __field(char *, reason) 304 + ), 305 + 306 + TP_fast_assign( 307 + __entry->reason = reason; 308 + ), 309 + 310 + TP_printk("%s", __entry->reason) 262 311 ); 263 312 264 313 /* ··· 461 412 462 413 /* 463 414 * Tracepoint for exiting rcu_do_batch after RCU callbacks have been 464 - * invoked. The first argument is the name of the RCU flavor and 465 - * the second argument is number of callbacks actually invoked. 415 + * invoked. The first argument is the name of the RCU flavor, 416 + * the second argument is number of callbacks actually invoked, 417 + * the third argument (cb) is whether or not any of the callbacks that 418 + * were ready to invoke at the beginning of this batch are still 419 + * queued, the fourth argument (nr) is the return value of need_resched(), 420 + * the fifth argument (iit) is 1 if the current task is the idle task, 421 + * and the sixth argument (risk) is the return value from 422 + * rcu_is_callbacks_kthread(). 466 423 */ 467 424 TRACE_EVENT(rcu_batch_end, 468 425 469 - TP_PROTO(char *rcuname, int callbacks_invoked), 426 + TP_PROTO(char *rcuname, int callbacks_invoked, 427 + bool cb, bool nr, bool iit, bool risk), 470 428 471 - TP_ARGS(rcuname, callbacks_invoked), 429 + TP_ARGS(rcuname, callbacks_invoked, cb, nr, iit, risk), 472 430 473 431 TP_STRUCT__entry( 474 432 __field(char *, rcuname) 475 433 __field(int, callbacks_invoked) 434 + __field(bool, cb) 435 + __field(bool, nr) 436 + __field(bool, iit) 437 + __field(bool, risk) 476 438 ), 477 439 478 440 TP_fast_assign( 479 441 __entry->rcuname = rcuname; 480 442 __entry->callbacks_invoked = callbacks_invoked; 443 + __entry->cb = cb; 444 + __entry->nr = nr; 445 + __entry->iit = iit; 446 + __entry->risk = risk; 481 447 ), 482 448 483 - TP_printk("%s CBs-invoked=%d", 484 - __entry->rcuname, __entry->callbacks_invoked) 449 + TP_printk("%s CBs-invoked=%d idle=%c%c%c%c", 450 + __entry->rcuname, __entry->callbacks_invoked, 451 + __entry->cb ? 'C' : '.', 452 + __entry->nr ? 'S' : '.', 453 + __entry->iit ? 'I' : '.', 454 + __entry->risk ? 'R' : '.') 455 + ); 456 + 457 + /* 458 + * Tracepoint for rcutorture readers. The first argument is the name 459 + * of the RCU flavor from rcutorture's viewpoint and the second argument 460 + * is the callback address. 461 + */ 462 + TRACE_EVENT(rcu_torture_read, 463 + 464 + TP_PROTO(char *rcutorturename, struct rcu_head *rhp), 465 + 466 + TP_ARGS(rcutorturename, rhp), 467 + 468 + TP_STRUCT__entry( 469 + __field(char *, rcutorturename) 470 + __field(struct rcu_head *, rhp) 471 + ), 472 + 473 + TP_fast_assign( 474 + __entry->rcutorturename = rcutorturename; 475 + __entry->rhp = rhp; 476 + ), 477 + 478 + TP_printk("%s torture read %p", 479 + __entry->rcutorturename, __entry->rhp) 485 480 ); 486 481 487 482 #else /* #ifdef CONFIG_RCU_TRACE */ ··· 536 443 #define trace_rcu_unlock_preempted_task(rcuname, gpnum, pid) do { } while (0) 537 444 #define trace_rcu_quiescent_state_report(rcuname, gpnum, mask, qsmask, level, grplo, grphi, gp_tasks) do { } while (0) 538 445 #define trace_rcu_fqs(rcuname, gpnum, cpu, qsevent) do { } while (0) 539 - #define trace_rcu_dyntick(polarity) do { } while (0) 446 + #define trace_rcu_dyntick(polarity, oldnesting, newnesting) do { } while (0) 447 + #define trace_rcu_prep_idle(reason) do { } while (0) 540 448 #define trace_rcu_callback(rcuname, rhp, qlen) do { } while (0) 541 449 #define trace_rcu_kfree_callback(rcuname, rhp, offset, qlen) do { } while (0) 542 450 #define trace_rcu_batch_start(rcuname, qlen, blimit) do { } while (0) 543 451 #define trace_rcu_invoke_callback(rcuname, rhp) do { } while (0) 544 452 #define trace_rcu_invoke_kfree_callback(rcuname, rhp, offset) do { } while (0) 545 - #define trace_rcu_batch_end(rcuname, callbacks_invoked) do { } while (0) 453 + #define trace_rcu_batch_end(rcuname, callbacks_invoked, cb, nr, iit, risk) \ 454 + do { } while (0) 455 + #define trace_rcu_torture_read(rcutorturename, rhp) do { } while (0) 546 456 547 457 #endif /* #else #ifdef CONFIG_RCU_TRACE */ 548 458

+5 -5

init/Kconfig

··· 469 469 470 470 config RCU_FAST_NO_HZ 471 471 bool "Accelerate last non-dyntick-idle CPU's grace periods" 472 - depends on TREE_RCU && NO_HZ && SMP 472 + depends on NO_HZ && SMP 473 473 default n 474 474 help 475 475 This option causes RCU to attempt to accelerate grace periods 476 - in order to allow the final CPU to enter dynticks-idle state 477 - more quickly. On the other hand, this option increases the 478 - overhead of the dynticks-idle checking, particularly on systems 479 - with large numbers of CPUs. 476 + in order to allow CPUs to enter dynticks-idle state more 477 + quickly. On the other hand, this option increases the overhead 478 + of the dynticks-idle checking, particularly on systems with 479 + large numbers of CPUs. 480 480 481 481 Say Y if energy efficiency is critically important, particularly 482 482 if you have relatively few CPUs.

+1

kernel/cpu.c

··· 380 380 cpu_maps_update_done(); 381 381 return err; 382 382 } 383 + EXPORT_SYMBOL_GPL(cpu_up); 383 384 384 385 #ifdef CONFIG_PM_SLEEP_SMP 385 386 static cpumask_var_t frozen_cpus;

+1 -1

kernel/debug/kdb/kdb_support.c

··· 636 636 (p->exit_state & EXIT_ZOMBIE) ? 'Z' : 637 637 (p->exit_state & EXIT_DEAD) ? 'E' : 638 638 (p->state & TASK_INTERRUPTIBLE) ? 'S' : '?'; 639 - if (p->pid == 0) { 639 + if (is_idle_task(p)) { 640 640 /* Idle task. Is it really idle, apart from the kdb 641 641 * interrupt? */ 642 642 if (!kdb_task_has_cpu(p) || kgdb_info[cpu].irq_depth == 1) {

+1 -1

kernel/events/core.c

··· 5362 5362 regs = get_irq_regs(); 5363 5363 5364 5364 if (regs && !perf_exclude_event(event, regs)) { 5365 - if (!(event->attr.exclude_idle && current->pid == 0)) 5365 + if (!(event->attr.exclude_idle && is_idle_task(current))) 5366 5366 if (perf_event_overflow(event, &data, regs)) 5367 5367 ret = HRTIMER_NORESTART; 5368 5368 }

+22

kernel/lockdep.c

··· 4170 4170 printk("%s:%d %s!\n", file, line, s); 4171 4171 printk("\nother info that might help us debug this:\n\n"); 4172 4172 printk("\nrcu_scheduler_active = %d, debug_locks = %d\n", rcu_scheduler_active, debug_locks); 4173 + 4174 + /* 4175 + * If a CPU is in the RCU-free window in idle (ie: in the section 4176 + * between rcu_idle_enter() and rcu_idle_exit(), then RCU 4177 + * considers that CPU to be in an "extended quiescent state", 4178 + * which means that RCU will be completely ignoring that CPU. 4179 + * Therefore, rcu_read_lock() and friends have absolutely no 4180 + * effect on a CPU running in that state. In other words, even if 4181 + * such an RCU-idle CPU has called rcu_read_lock(), RCU might well 4182 + * delete data structures out from under it. RCU really has no 4183 + * choice here: we need to keep an RCU-free window in idle where 4184 + * the CPU may possibly enter into low power mode. This way we can 4185 + * notice an extended quiescent state to other CPUs that started a grace 4186 + * period. Otherwise we would delay any grace period as long as we run 4187 + * in the idle task. 4188 + * 4189 + * So complain bitterly if someone does call rcu_read_lock(), 4190 + * rcu_read_lock_bh() and so on from extended quiescent states. 4191 + */ 4192 + if (rcu_is_cpu_idle()) 4193 + printk("RCU used illegally from extended quiescent state!\n"); 4194 + 4173 4195 lockdep_print_held_locks(curr); 4174 4196 printk("\nstack backtrace:\n"); 4175 4197 dump_stack();

+7

kernel/rcu.h

··· 30 30 #endif /* #else #ifdef CONFIG_RCU_TRACE */ 31 31 32 32 /* 33 + * Process-level increment to ->dynticks_nesting field. This allows for 34 + * architectures that use half-interrupts and half-exceptions from 35 + * process context. 36 + */ 37 + #define DYNTICK_TASK_NESTING (LLONG_MAX / 2 - 1) 38 + 39 + /* 33 40 * debug_rcu_head_queue()/debug_rcu_head_unqueue() are used internally 34 41 * by call_rcu() and rcu callback execution, and are therefore not part of the 35 42 * RCU API. Leaving in rcupdate.h because they are used by all RCU flavors.

+12

kernel/rcupdate.c

··· 93 93 { 94 94 if (!debug_lockdep_rcu_enabled()) 95 95 return 1; 96 + if (rcu_is_cpu_idle()) 97 + return 0; 96 98 return in_softirq() || irqs_disabled(); 97 99 } 98 100 EXPORT_SYMBOL_GPL(rcu_read_lock_bh_held); ··· 318 316 }; 319 317 EXPORT_SYMBOL_GPL(rcuhead_debug_descr); 320 318 #endif /* #ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD */ 319 + 320 + #if defined(CONFIG_TREE_RCU) || defined(CONFIG_TREE_PREEMPT_RCU) || defined(CONFIG_RCU_TRACE) 321 + void do_trace_rcu_torture_read(char *rcutorturename, struct rcu_head *rhp) 322 + { 323 + trace_rcu_torture_read(rcutorturename, rhp); 324 + } 325 + EXPORT_SYMBOL_GPL(do_trace_rcu_torture_read); 326 + #else 327 + #define do_trace_rcu_torture_read(rcutorturename, rhp) do { } while (0) 328 + #endif

+133 -22

kernel/rcutiny.c

··· 53 53 54 54 #include "rcutiny_plugin.h" 55 55 56 - #ifdef CONFIG_NO_HZ 56 + static long long rcu_dynticks_nesting = DYNTICK_TASK_NESTING; 57 57 58 - static long rcu_dynticks_nesting = 1; 59 - 60 - /* 61 - * Enter dynticks-idle mode, which is an extended quiescent state 62 - * if we have fully entered that mode (i.e., if the new value of 63 - * dynticks_nesting is zero). 64 - */ 65 - void rcu_enter_nohz(void) 58 + /* Common code for rcu_idle_enter() and rcu_irq_exit(), see kernel/rcutree.c. */ 59 + static void rcu_idle_enter_common(long long oldval) 66 60 { 67 - if (--rcu_dynticks_nesting == 0) 68 - rcu_sched_qs(0); /* implies rcu_bh_qsctr_inc(0) */ 61 + if (rcu_dynticks_nesting) { 62 + RCU_TRACE(trace_rcu_dyntick("--=", 63 + oldval, rcu_dynticks_nesting)); 64 + return; 65 + } 66 + RCU_TRACE(trace_rcu_dyntick("Start", oldval, rcu_dynticks_nesting)); 67 + if (!is_idle_task(current)) { 68 + struct task_struct *idle = idle_task(smp_processor_id()); 69 + 70 + RCU_TRACE(trace_rcu_dyntick("Error on entry: not idle task", 71 + oldval, rcu_dynticks_nesting)); 72 + ftrace_dump(DUMP_ALL); 73 + WARN_ONCE(1, "Current pid: %d comm: %s / Idle pid: %d comm: %s", 74 + current->pid, current->comm, 75 + idle->pid, idle->comm); /* must be idle task! */ 76 + } 77 + rcu_sched_qs(0); /* implies rcu_bh_qsctr_inc(0) */ 69 78 } 70 79 71 80 /* 72 - * Exit dynticks-idle mode, so that we are no longer in an extended 73 - * quiescent state. 81 + * Enter idle, which is an extended quiescent state if we have fully 82 + * entered that mode (i.e., if the new value of dynticks_nesting is zero). 74 83 */ 75 - void rcu_exit_nohz(void) 84 + void rcu_idle_enter(void) 76 85 { 86 + unsigned long flags; 87 + long long oldval; 88 + 89 + local_irq_save(flags); 90 + oldval = rcu_dynticks_nesting; 91 + rcu_dynticks_nesting = 0; 92 + rcu_idle_enter_common(oldval); 93 + local_irq_restore(flags); 94 + } 95 + 96 + /* 97 + * Exit an interrupt handler towards idle. 98 + */ 99 + void rcu_irq_exit(void) 100 + { 101 + unsigned long flags; 102 + long long oldval; 103 + 104 + local_irq_save(flags); 105 + oldval = rcu_dynticks_nesting; 106 + rcu_dynticks_nesting--; 107 + WARN_ON_ONCE(rcu_dynticks_nesting < 0); 108 + rcu_idle_enter_common(oldval); 109 + local_irq_restore(flags); 110 + } 111 + 112 + /* Common code for rcu_idle_exit() and rcu_irq_enter(), see kernel/rcutree.c. */ 113 + static void rcu_idle_exit_common(long long oldval) 114 + { 115 + if (oldval) { 116 + RCU_TRACE(trace_rcu_dyntick("++=", 117 + oldval, rcu_dynticks_nesting)); 118 + return; 119 + } 120 + RCU_TRACE(trace_rcu_dyntick("End", oldval, rcu_dynticks_nesting)); 121 + if (!is_idle_task(current)) { 122 + struct task_struct *idle = idle_task(smp_processor_id()); 123 + 124 + RCU_TRACE(trace_rcu_dyntick("Error on exit: not idle task", 125 + oldval, rcu_dynticks_nesting)); 126 + ftrace_dump(DUMP_ALL); 127 + WARN_ONCE(1, "Current pid: %d comm: %s / Idle pid: %d comm: %s", 128 + current->pid, current->comm, 129 + idle->pid, idle->comm); /* must be idle task! */ 130 + } 131 + } 132 + 133 + /* 134 + * Exit idle, so that we are no longer in an extended quiescent state. 135 + */ 136 + void rcu_idle_exit(void) 137 + { 138 + unsigned long flags; 139 + long long oldval; 140 + 141 + local_irq_save(flags); 142 + oldval = rcu_dynticks_nesting; 143 + WARN_ON_ONCE(oldval != 0); 144 + rcu_dynticks_nesting = DYNTICK_TASK_NESTING; 145 + rcu_idle_exit_common(oldval); 146 + local_irq_restore(flags); 147 + } 148 + 149 + /* 150 + * Enter an interrupt handler, moving away from idle. 151 + */ 152 + void rcu_irq_enter(void) 153 + { 154 + unsigned long flags; 155 + long long oldval; 156 + 157 + local_irq_save(flags); 158 + oldval = rcu_dynticks_nesting; 77 159 rcu_dynticks_nesting++; 160 + WARN_ON_ONCE(rcu_dynticks_nesting == 0); 161 + rcu_idle_exit_common(oldval); 162 + local_irq_restore(flags); 78 163 } 79 164 80 - #endif /* #ifdef CONFIG_NO_HZ */ 165 + #ifdef CONFIG_PROVE_RCU 166 + 167 + /* 168 + * Test whether RCU thinks that the current CPU is idle. 169 + */ 170 + int rcu_is_cpu_idle(void) 171 + { 172 + return !rcu_dynticks_nesting; 173 + } 174 + EXPORT_SYMBOL(rcu_is_cpu_idle); 175 + 176 + #endif /* #ifdef CONFIG_PROVE_RCU */ 177 + 178 + /* 179 + * Test whether the current CPU was interrupted from idle. Nested 180 + * interrupts don't count, we must be running at the first interrupt 181 + * level. 182 + */ 183 + int rcu_is_cpu_rrupt_from_idle(void) 184 + { 185 + return rcu_dynticks_nesting <= 0; 186 + } 81 187 82 188 /* 83 189 * Helper function for rcu_sched_qs() and rcu_bh_qs(). ··· 232 126 233 127 /* 234 128 * Check to see if the scheduling-clock interrupt came from an extended 235 - * quiescent state, and, if so, tell RCU about it. 129 + * quiescent state, and, if so, tell RCU about it. This function must 130 + * be called from hardirq context. It is normally called from the 131 + * scheduling-clock interrupt. 236 132 */ 237 133 void rcu_check_callbacks(int cpu, int user) 238 134 { 239 - if (user || 240 - (idle_cpu(cpu) && 241 - !in_softirq() && 242 - hardirq_count() <= (1 << HARDIRQ_SHIFT))) 135 + if (user || rcu_is_cpu_rrupt_from_idle()) 243 136 rcu_sched_qs(cpu); 244 137 else if (!in_softirq()) 245 138 rcu_bh_qs(cpu); ··· 259 154 /* If no RCU callbacks ready to invoke, just return. */ 260 155 if (&rcp->rcucblist == rcp->donetail) { 261 156 RCU_TRACE(trace_rcu_batch_start(rcp->name, 0, -1)); 262 - RCU_TRACE(trace_rcu_batch_end(rcp->name, 0)); 157 + RCU_TRACE(trace_rcu_batch_end(rcp->name, 0, 158 + ACCESS_ONCE(rcp->rcucblist), 159 + need_resched(), 160 + is_idle_task(current), 161 + rcu_is_callbacks_kthread())); 263 162 return; 264 163 } 265 164 ··· 292 183 RCU_TRACE(cb_count++); 293 184 } 294 185 RCU_TRACE(rcu_trace_sub_qlen(rcp, cb_count)); 295 - RCU_TRACE(trace_rcu_batch_end(rcp->name, cb_count)); 186 + RCU_TRACE(trace_rcu_batch_end(rcp->name, cb_count, 0, need_resched(), 187 + is_idle_task(current), 188 + rcu_is_callbacks_kthread())); 296 189 } 297 190 298 191 static void rcu_process_callbacks(struct softirq_action *unused)

+27 -2

kernel/rcutiny_plugin.h

··· 312 312 rt_mutex_lock(&mtx); 313 313 rt_mutex_unlock(&mtx); /* Keep lockdep happy. */ 314 314 315 - return rcu_preempt_ctrlblk.boost_tasks != NULL || 316 - rcu_preempt_ctrlblk.exp_tasks != NULL; 315 + return ACCESS_ONCE(rcu_preempt_ctrlblk.boost_tasks) != NULL || 316 + ACCESS_ONCE(rcu_preempt_ctrlblk.exp_tasks) != NULL; 317 317 } 318 318 319 319 /* ··· 885 885 wake_up(&rcu_kthread_wq); 886 886 } 887 887 888 + #ifdef CONFIG_RCU_TRACE 889 + 890 + /* 891 + * Is the current CPU running the RCU-callbacks kthread? 892 + * Caller must have preemption disabled. 893 + */ 894 + static bool rcu_is_callbacks_kthread(void) 895 + { 896 + return rcu_kthread_task == current; 897 + } 898 + 899 + #endif /* #ifdef CONFIG_RCU_TRACE */ 900 + 888 901 /* 889 902 * This kthread invokes RCU callbacks whose grace periods have 890 903 * elapsed. It is awakened as needed, and takes the place of the ··· 950 937 { 951 938 raise_softirq(RCU_SOFTIRQ); 952 939 } 940 + 941 + #ifdef CONFIG_RCU_TRACE 942 + 943 + /* 944 + * There is no callback kthread, so this thread is never it. 945 + */ 946 + static bool rcu_is_callbacks_kthread(void) 947 + { 948 + return false; 949 + } 950 + 951 + #endif /* #ifdef CONFIG_RCU_TRACE */ 953 952 954 953 void rcu_init(void) 955 954 {

+218 -7

kernel/rcutorture.c

··· 61 61 static int shuffle_interval = 3; /* Interval between shuffles (in sec)*/ 62 62 static int stutter = 5; /* Start/stop testing interval (in sec) */ 63 63 static int irqreader = 1; /* RCU readers from irq (timers). */ 64 - static int fqs_duration = 0; /* Duration of bursts (us), 0 to disable. */ 65 - static int fqs_holdoff = 0; /* Hold time within burst (us). */ 64 + static int fqs_duration; /* Duration of bursts (us), 0 to disable. */ 65 + static int fqs_holdoff; /* Hold time within burst (us). */ 66 66 static int fqs_stutter = 3; /* Wait time between bursts (s). */ 67 + static int onoff_interval; /* Wait time between CPU hotplugs, 0=disable. */ 68 + static int shutdown_secs; /* Shutdown time (s). <=0 for no shutdown. */ 67 69 static int test_boost = 1; /* Test RCU prio boost: 0=no, 1=maybe, 2=yes. */ 68 70 static int test_boost_interval = 7; /* Interval between boost tests, seconds. */ 69 71 static int test_boost_duration = 4; /* Duration of each boost test, seconds. */ ··· 93 91 MODULE_PARM_DESC(fqs_holdoff, "Holdoff time within fqs bursts (us)"); 94 92 module_param(fqs_stutter, int, 0444); 95 93 MODULE_PARM_DESC(fqs_stutter, "Wait time between fqs bursts (s)"); 94 + module_param(onoff_interval, int, 0444); 95 + MODULE_PARM_DESC(onoff_interval, "Time between CPU hotplugs (s), 0=disable"); 96 + module_param(shutdown_secs, int, 0444); 97 + MODULE_PARM_DESC(shutdown_secs, "Shutdown time (s), zero to disable."); 96 98 module_param(test_boost, int, 0444); 97 99 MODULE_PARM_DESC(test_boost, "Test RCU prio boost: 0=no, 1=maybe, 2=yes."); 98 100 module_param(test_boost_interval, int, 0444); ··· 125 119 static struct task_struct *stutter_task; 126 120 static struct task_struct *fqs_task; 127 121 static struct task_struct *boost_tasks[NR_CPUS]; 122 + static struct task_struct *shutdown_task; 123 + #ifdef CONFIG_HOTPLUG_CPU 124 + static struct task_struct *onoff_task; 125 + #endif /* #ifdef CONFIG_HOTPLUG_CPU */ 128 126 129 127 #define RCU_TORTURE_PIPE_LEN 10 130 128 ··· 159 149 static long n_rcu_torture_boost_failure; 160 150 static long n_rcu_torture_boosts; 161 151 static long n_rcu_torture_timers; 152 + static long n_offline_attempts; 153 + static long n_offline_successes; 154 + static long n_online_attempts; 155 + static long n_online_successes; 162 156 static struct list_head rcu_torture_removed; 163 157 static cpumask_var_t shuffle_tmp_mask; 164 158 ··· 174 160 #define RCUTORTURE_RUNNABLE_INIT 0 175 161 #endif 176 162 int rcutorture_runnable = RCUTORTURE_RUNNABLE_INIT; 163 + module_param(rcutorture_runnable, int, 0444); 164 + MODULE_PARM_DESC(rcutorture_runnable, "Start rcutorture at boot"); 177 165 178 166 #if defined(CONFIG_RCU_BOOST) && !defined(CONFIG_HOTPLUG_CPU) 179 167 #define rcu_can_boost() 1 ··· 183 167 #define rcu_can_boost() 0 184 168 #endif /* #else #if defined(CONFIG_RCU_BOOST) && !defined(CONFIG_HOTPLUG_CPU) */ 185 169 170 + static unsigned long shutdown_time; /* jiffies to system shutdown. */ 186 171 static unsigned long boost_starttime; /* jiffies of next boost test start. */ 187 172 DEFINE_MUTEX(boost_mutex); /* protect setting boost_starttime */ 188 173 /* and boost task create/destroy. */ ··· 198 181 * Protect fullstop transitions and spawning of kthreads. 199 182 */ 200 183 static DEFINE_MUTEX(fullstop_mutex); 184 + 185 + /* Forward reference. */ 186 + static void rcu_torture_cleanup(void); 201 187 202 188 /* 203 189 * Detect and respond to a system shutdown. ··· 632 612 .name = "srcu" 633 613 }; 634 614 615 + static int srcu_torture_read_lock_raw(void) __acquires(&srcu_ctl) 616 + { 617 + return srcu_read_lock_raw(&srcu_ctl); 618 + } 619 + 620 + static void srcu_torture_read_unlock_raw(int idx) __releases(&srcu_ctl) 621 + { 622 + srcu_read_unlock_raw(&srcu_ctl, idx); 623 + } 624 + 625 + static struct rcu_torture_ops srcu_raw_ops = { 626 + .init = srcu_torture_init, 627 + .cleanup = srcu_torture_cleanup, 628 + .readlock = srcu_torture_read_lock_raw, 629 + .read_delay = srcu_read_delay, 630 + .readunlock = srcu_torture_read_unlock_raw, 631 + .completed = srcu_torture_completed, 632 + .deferred_free = rcu_sync_torture_deferred_free, 633 + .sync = srcu_torture_synchronize, 634 + .cb_barrier = NULL, 635 + .stats = srcu_torture_stats, 636 + .name = "srcu_raw" 637 + }; 638 + 635 639 static void srcu_torture_synchronize_expedited(void) 636 640 { 637 641 synchronize_srcu_expedited(&srcu_ctl); ··· 957 913 return 0; 958 914 } 959 915 916 + void rcutorture_trace_dump(void) 917 + { 918 + static atomic_t beenhere = ATOMIC_INIT(0); 919 + 920 + if (atomic_read(&beenhere)) 921 + return; 922 + if (atomic_xchg(&beenhere, 1) != 0) 923 + return; 924 + do_trace_rcu_torture_read(cur_ops->name, (struct rcu_head *)~0UL); 925 + ftrace_dump(DUMP_ALL); 926 + } 927 + 960 928 /* 961 929 * RCU torture reader from timer handler. Dereferences rcu_torture_current, 962 930 * incrementing the corresponding element of the pipeline array. The ··· 990 934 rcu_read_lock_bh_held() || 991 935 rcu_read_lock_sched_held() || 992 936 srcu_read_lock_held(&srcu_ctl)); 937 + do_trace_rcu_torture_read(cur_ops->name, &p->rtort_rcu); 993 938 if (p == NULL) { 994 939 /* Leave because rcu_torture_writer is not yet underway */ 995 940 cur_ops->readunlock(idx); ··· 1008 951 /* Should not happen, but... */ 1009 952 pipe_count = RCU_TORTURE_PIPE_LEN; 1010 953 } 954 + if (pipe_count > 1) 955 + rcutorture_trace_dump(); 1011 956 __this_cpu_inc(rcu_torture_count[pipe_count]); 1012 957 completed = cur_ops->completed() - completed; 1013 958 if (completed > RCU_TORTURE_PIPE_LEN) { ··· 1053 994 rcu_read_lock_bh_held() || 1054 995 rcu_read_lock_sched_held() || 1055 996 srcu_read_lock_held(&srcu_ctl)); 997 + do_trace_rcu_torture_read(cur_ops->name, &p->rtort_rcu); 1056 998 if (p == NULL) { 1057 999 /* Wait for rcu_torture_writer to get underway */ 1058 1000 cur_ops->readunlock(idx); ··· 1069 1009 /* Should not happen, but... */ 1070 1010 pipe_count = RCU_TORTURE_PIPE_LEN; 1071 1011 } 1012 + if (pipe_count > 1) 1013 + rcutorture_trace_dump(); 1072 1014 __this_cpu_inc(rcu_torture_count[pipe_count]); 1073 1015 completed = cur_ops->completed() - completed; 1074 1016 if (completed > RCU_TORTURE_PIPE_LEN) { ··· 1118 1056 cnt += sprintf(&page[cnt], 1119 1057 "rtc: %p ver: %lu tfle: %d rta: %d rtaf: %d rtf: %d " 1120 1058 "rtmbe: %d rtbke: %ld rtbre: %ld " 1121 - "rtbf: %ld rtb: %ld nt: %ld", 1059 + "rtbf: %ld rtb: %ld nt: %ld " 1060 + "onoff: %ld/%ld:%ld/%ld", 1122 1061 rcu_torture_current, 1123 1062 rcu_torture_current_version, 1124 1063 list_empty(&rcu_torture_freelist), ··· 1131 1068 n_rcu_torture_boost_rterror, 1132 1069 n_rcu_torture_boost_failure, 1133 1070 n_rcu_torture_boosts, 1134 - n_rcu_torture_timers); 1071 + n_rcu_torture_timers, 1072 + n_online_successes, 1073 + n_online_attempts, 1074 + n_offline_successes, 1075 + n_offline_attempts); 1135 1076 if (atomic_read(&n_rcu_torture_mberror) != 0 || 1136 1077 n_rcu_torture_boost_ktrerror != 0 || 1137 1078 n_rcu_torture_boost_rterror != 0 || ··· 1299 1232 "shuffle_interval=%d stutter=%d irqreader=%d " 1300 1233 "fqs_duration=%d fqs_holdoff=%d fqs_stutter=%d " 1301 1234 "test_boost=%d/%d test_boost_interval=%d " 1302 - "test_boost_duration=%d\n", 1235 + "test_boost_duration=%d shutdown_secs=%d " 1236 + "onoff_interval=%d\n", 1303 1237 torture_type, tag, nrealreaders, nfakewriters, 1304 1238 stat_interval, verbose, test_no_idle_hz, shuffle_interval, 1305 1239 stutter, irqreader, fqs_duration, fqs_holdoff, fqs_stutter, 1306 1240 test_boost, cur_ops->can_boost, 1307 - test_boost_interval, test_boost_duration); 1241 + test_boost_interval, test_boost_duration, shutdown_secs, 1242 + onoff_interval); 1308 1243 } 1309 1244 1310 1245 static struct notifier_block rcutorture_shutdown_nb = { ··· 1355 1286 mutex_unlock(&boost_mutex); 1356 1287 return 0; 1357 1288 } 1289 + 1290 + /* 1291 + * Cause the rcutorture test to shutdown the system after the test has 1292 + * run for the time specified by the shutdown_secs module parameter. 1293 + */ 1294 + static int 1295 + rcu_torture_shutdown(void *arg) 1296 + { 1297 + long delta; 1298 + unsigned long jiffies_snap; 1299 + 1300 + VERBOSE_PRINTK_STRING("rcu_torture_shutdown task started"); 1301 + jiffies_snap = ACCESS_ONCE(jiffies); 1302 + while (ULONG_CMP_LT(jiffies_snap, shutdown_time) && 1303 + !kthread_should_stop()) { 1304 + delta = shutdown_time - jiffies_snap; 1305 + if (verbose) 1306 + printk(KERN_ALERT "%s" TORTURE_FLAG 1307 + "rcu_torture_shutdown task: %lu " 1308 + "jiffies remaining\n", 1309 + torture_type, delta); 1310 + schedule_timeout_interruptible(delta); 1311 + jiffies_snap = ACCESS_ONCE(jiffies); 1312 + } 1313 + if (kthread_should_stop()) { 1314 + VERBOSE_PRINTK_STRING("rcu_torture_shutdown task stopping"); 1315 + return 0; 1316 + } 1317 + 1318 + /* OK, shut down the system. */ 1319 + 1320 + VERBOSE_PRINTK_STRING("rcu_torture_shutdown task shutting down system"); 1321 + shutdown_task = NULL; /* Avoid self-kill deadlock. */ 1322 + rcu_torture_cleanup(); /* Get the success/failure message. */ 1323 + kernel_power_off(); /* Shut down the system. */ 1324 + return 0; 1325 + } 1326 + 1327 + #ifdef CONFIG_HOTPLUG_CPU 1328 + 1329 + /* 1330 + * Execute random CPU-hotplug operations at the interval specified 1331 + * by the onoff_interval. 1332 + */ 1333 + static int 1334 + rcu_torture_onoff(void *arg) 1335 + { 1336 + int cpu; 1337 + int maxcpu = -1; 1338 + DEFINE_RCU_RANDOM(rand); 1339 + 1340 + VERBOSE_PRINTK_STRING("rcu_torture_onoff task started"); 1341 + for_each_online_cpu(cpu) 1342 + maxcpu = cpu; 1343 + WARN_ON(maxcpu < 0); 1344 + while (!kthread_should_stop()) { 1345 + cpu = (rcu_random(&rand) >> 4) % (maxcpu + 1); 1346 + if (cpu_online(cpu) && cpu_is_hotpluggable(cpu)) { 1347 + if (verbose) 1348 + printk(KERN_ALERT "%s" TORTURE_FLAG 1349 + "rcu_torture_onoff task: offlining %d\n", 1350 + torture_type, cpu); 1351 + n_offline_attempts++; 1352 + if (cpu_down(cpu) == 0) { 1353 + if (verbose) 1354 + printk(KERN_ALERT "%s" TORTURE_FLAG 1355 + "rcu_torture_onoff task: " 1356 + "offlined %d\n", 1357 + torture_type, cpu); 1358 + n_offline_successes++; 1359 + } 1360 + } else if (cpu_is_hotpluggable(cpu)) { 1361 + if (verbose) 1362 + printk(KERN_ALERT "%s" TORTURE_FLAG 1363 + "rcu_torture_onoff task: onlining %d\n", 1364 + torture_type, cpu); 1365 + n_online_attempts++; 1366 + if (cpu_up(cpu) == 0) { 1367 + if (verbose) 1368 + printk(KERN_ALERT "%s" TORTURE_FLAG 1369 + "rcu_torture_onoff task: " 1370 + "onlined %d\n", 1371 + torture_type, cpu); 1372 + n_online_successes++; 1373 + } 1374 + } 1375 + schedule_timeout_interruptible(onoff_interval * HZ); 1376 + } 1377 + VERBOSE_PRINTK_STRING("rcu_torture_onoff task stopping"); 1378 + return 0; 1379 + } 1380 + 1381 + static int 1382 + rcu_torture_onoff_init(void) 1383 + { 1384 + if (onoff_interval <= 0) 1385 + return 0; 1386 + onoff_task = kthread_run(rcu_torture_onoff, NULL, "rcu_torture_onoff"); 1387 + if (IS_ERR(onoff_task)) { 1388 + onoff_task = NULL; 1389 + return PTR_ERR(onoff_task); 1390 + } 1391 + return 0; 1392 + } 1393 + 1394 + static void rcu_torture_onoff_cleanup(void) 1395 + { 1396 + if (onoff_task == NULL) 1397 + return; 1398 + VERBOSE_PRINTK_STRING("Stopping rcu_torture_onoff task"); 1399 + kthread_stop(onoff_task); 1400 + } 1401 + 1402 + #else /* #ifdef CONFIG_HOTPLUG_CPU */ 1403 + 1404 + static void 1405 + rcu_torture_onoff_init(void) 1406 + { 1407 + } 1408 + 1409 + static void rcu_torture_onoff_cleanup(void) 1410 + { 1411 + } 1412 + 1413 + #endif /* #else #ifdef CONFIG_HOTPLUG_CPU */ 1358 1414 1359 1415 static int rcutorture_cpu_notify(struct notifier_block *self, 1360 1416 unsigned long action, void *hcpu) ··· 1585 1391 for_each_possible_cpu(i) 1586 1392 rcutorture_booster_cleanup(i); 1587 1393 } 1394 + if (shutdown_task != NULL) { 1395 + VERBOSE_PRINTK_STRING("Stopping rcu_torture_shutdown task"); 1396 + kthread_stop(shutdown_task); 1397 + } 1398 + rcu_torture_onoff_cleanup(); 1588 1399 1589 1400 /* Wait for all RCU callbacks to fire. */ 1590 1401 ··· 1615 1416 static struct rcu_torture_ops *torture_ops[] = 1616 1417 { &rcu_ops, &rcu_sync_ops, &rcu_expedited_ops, 1617 1418 &rcu_bh_ops, &rcu_bh_sync_ops, &rcu_bh_expedited_ops, 1618 - &srcu_ops, &srcu_expedited_ops, 1419 + &srcu_ops, &srcu_raw_ops, &srcu_expedited_ops, 1619 1420 &sched_ops, &sched_sync_ops, &sched_expedited_ops, }; 1620 1421 1621 1422 mutex_lock(&fullstop_mutex); ··· 1806 1607 } 1807 1608 } 1808 1609 } 1610 + if (shutdown_secs > 0) { 1611 + shutdown_time = jiffies + shutdown_secs * HZ; 1612 + shutdown_task = kthread_run(rcu_torture_shutdown, NULL, 1613 + "rcu_torture_shutdown"); 1614 + if (IS_ERR(shutdown_task)) { 1615 + firsterr = PTR_ERR(shutdown_task); 1616 + VERBOSE_PRINTK_ERRSTRING("Failed to create shutdown"); 1617 + shutdown_task = NULL; 1618 + goto unwind; 1619 + } 1620 + } 1621 + rcu_torture_onoff_init(); 1809 1622 register_reboot_notifier(&rcutorture_shutdown_nb); 1810 1623 rcutorture_record_test_transition(); 1811 1624 mutex_unlock(&fullstop_mutex);

+212 -92

kernel/rcutree.c

··· 69 69 NUM_RCU_LVL_3, \ 70 70 NUM_RCU_LVL_4, /* == MAX_RCU_LVLS */ \ 71 71 }, \ 72 - .signaled = RCU_GP_IDLE, \ 72 + .fqs_state = RCU_GP_IDLE, \ 73 73 .gpnum = -300, \ 74 74 .completed = -300, \ 75 75 .onofflock = __RAW_SPIN_LOCK_UNLOCKED(&structname##_state.onofflock), \ ··· 195 195 } 196 196 EXPORT_SYMBOL_GPL(rcu_note_context_switch); 197 197 198 - #ifdef CONFIG_NO_HZ 199 198 DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = { 200 - .dynticks_nesting = 1, 199 + .dynticks_nesting = DYNTICK_TASK_NESTING, 201 200 .dynticks = ATOMIC_INIT(1), 202 201 }; 203 - #endif /* #ifdef CONFIG_NO_HZ */ 204 202 205 203 static int blimit = 10; /* Maximum callbacks per rcu_do_batch. */ 206 204 static int qhimark = 10000; /* If this many pending, ignore blimit. */ ··· 326 328 return 1; 327 329 } 328 330 329 - /* If preemptible RCU, no point in sending reschedule IPI. */ 330 - if (rdp->preemptible) 331 - return 0; 332 - 333 - /* The CPU is online, so send it a reschedule IPI. */ 331 + /* 332 + * The CPU is online, so send it a reschedule IPI. This forces 333 + * it through the scheduler, and (inefficiently) also handles cases 334 + * where idle loops fail to inform RCU about the CPU being idle. 335 + */ 334 336 if (rdp->cpu != smp_processor_id()) 335 337 smp_send_reschedule(rdp->cpu); 336 338 else ··· 341 343 342 344 #endif /* #ifdef CONFIG_SMP */ 343 345 344 - #ifdef CONFIG_NO_HZ 345 - 346 - /** 347 - * rcu_enter_nohz - inform RCU that current CPU is entering nohz 346 + /* 347 + * rcu_idle_enter_common - inform RCU that current CPU is moving towards idle 348 348 * 349 - * Enter nohz mode, in other words, -leave- the mode in which RCU 350 - * read-side critical sections can occur. (Though RCU read-side 351 - * critical sections can occur in irq handlers in nohz mode, a possibility 352 - * handled by rcu_irq_enter() and rcu_irq_exit()). 349 + * If the new value of the ->dynticks_nesting counter now is zero, 350 + * we really have entered idle, and must do the appropriate accounting. 351 + * The caller must have disabled interrupts. 353 352 */ 354 - void rcu_enter_nohz(void) 353 + static void rcu_idle_enter_common(struct rcu_dynticks *rdtp, long long oldval) 355 354 { 356 - unsigned long flags; 357 - struct rcu_dynticks *rdtp; 355 + trace_rcu_dyntick("Start", oldval, 0); 356 + if (!is_idle_task(current)) { 357 + struct task_struct *idle = idle_task(smp_processor_id()); 358 358 359 - local_irq_save(flags); 360 - rdtp = &__get_cpu_var(rcu_dynticks); 361 - if (--rdtp->dynticks_nesting) { 362 - local_irq_restore(flags); 363 - return; 359 + trace_rcu_dyntick("Error on entry: not idle task", oldval, 0); 360 + ftrace_dump(DUMP_ALL); 361 + WARN_ONCE(1, "Current pid: %d comm: %s / Idle pid: %d comm: %s", 362 + current->pid, current->comm, 363 + idle->pid, idle->comm); /* must be idle task! */ 364 364 } 365 - trace_rcu_dyntick("Start"); 365 + rcu_prepare_for_idle(smp_processor_id()); 366 366 /* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */ 367 367 smp_mb__before_atomic_inc(); /* See above. */ 368 368 atomic_inc(&rdtp->dynticks); 369 369 smp_mb__after_atomic_inc(); /* Force ordering with next sojourn. */ 370 370 WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1); 371 - local_irq_restore(flags); 372 371 } 373 372 374 - /* 375 - * rcu_exit_nohz - inform RCU that current CPU is leaving nohz 373 + /** 374 + * rcu_idle_enter - inform RCU that current CPU is entering idle 376 375 * 377 - * Exit nohz mode, in other words, -enter- the mode in which RCU 378 - * read-side critical sections normally occur. 376 + * Enter idle mode, in other words, -leave- the mode in which RCU 377 + * read-side critical sections can occur. (Though RCU read-side 378 + * critical sections can occur in irq handlers in idle, a possibility 379 + * handled by irq_enter() and irq_exit().) 380 + * 381 + * We crowbar the ->dynticks_nesting field to zero to allow for 382 + * the possibility of usermode upcalls having messed up our count 383 + * of interrupt nesting level during the prior busy period. 379 384 */ 380 - void rcu_exit_nohz(void) 385 + void rcu_idle_enter(void) 381 386 { 382 387 unsigned long flags; 388 + long long oldval; 383 389 struct rcu_dynticks *rdtp; 384 390 385 391 local_irq_save(flags); 386 392 rdtp = &__get_cpu_var(rcu_dynticks); 387 - if (rdtp->dynticks_nesting++) { 388 - local_irq_restore(flags); 389 - return; 390 - } 393 + oldval = rdtp->dynticks_nesting; 394 + rdtp->dynticks_nesting = 0; 395 + rcu_idle_enter_common(rdtp, oldval); 396 + local_irq_restore(flags); 397 + } 398 + 399 + /** 400 + * rcu_irq_exit - inform RCU that current CPU is exiting irq towards idle 401 + * 402 + * Exit from an interrupt handler, which might possibly result in entering 403 + * idle mode, in other words, leaving the mode in which read-side critical 404 + * sections can occur. 405 + * 406 + * This code assumes that the idle loop never does anything that might 407 + * result in unbalanced calls to irq_enter() and irq_exit(). If your 408 + * architecture violates this assumption, RCU will give you what you 409 + * deserve, good and hard. But very infrequently and irreproducibly. 410 + * 411 + * Use things like work queues to work around this limitation. 412 + * 413 + * You have been warned. 414 + */ 415 + void rcu_irq_exit(void) 416 + { 417 + unsigned long flags; 418 + long long oldval; 419 + struct rcu_dynticks *rdtp; 420 + 421 + local_irq_save(flags); 422 + rdtp = &__get_cpu_var(rcu_dynticks); 423 + oldval = rdtp->dynticks_nesting; 424 + rdtp->dynticks_nesting--; 425 + WARN_ON_ONCE(rdtp->dynticks_nesting < 0); 426 + if (rdtp->dynticks_nesting) 427 + trace_rcu_dyntick("--=", oldval, rdtp->dynticks_nesting); 428 + else 429 + rcu_idle_enter_common(rdtp, oldval); 430 + local_irq_restore(flags); 431 + } 432 + 433 + /* 434 + * rcu_idle_exit_common - inform RCU that current CPU is moving away from idle 435 + * 436 + * If the new value of the ->dynticks_nesting counter was previously zero, 437 + * we really have exited idle, and must do the appropriate accounting. 438 + * The caller must have disabled interrupts. 439 + */ 440 + static void rcu_idle_exit_common(struct rcu_dynticks *rdtp, long long oldval) 441 + { 391 442 smp_mb__before_atomic_inc(); /* Force ordering w/previous sojourn. */ 392 443 atomic_inc(&rdtp->dynticks); 393 444 /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */ 394 445 smp_mb__after_atomic_inc(); /* See above. */ 395 446 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1)); 396 - trace_rcu_dyntick("End"); 447 + rcu_cleanup_after_idle(smp_processor_id()); 448 + trace_rcu_dyntick("End", oldval, rdtp->dynticks_nesting); 449 + if (!is_idle_task(current)) { 450 + struct task_struct *idle = idle_task(smp_processor_id()); 451 + 452 + trace_rcu_dyntick("Error on exit: not idle task", 453 + oldval, rdtp->dynticks_nesting); 454 + ftrace_dump(DUMP_ALL); 455 + WARN_ONCE(1, "Current pid: %d comm: %s / Idle pid: %d comm: %s", 456 + current->pid, current->comm, 457 + idle->pid, idle->comm); /* must be idle task! */ 458 + } 459 + } 460 + 461 + /** 462 + * rcu_idle_exit - inform RCU that current CPU is leaving idle 463 + * 464 + * Exit idle mode, in other words, -enter- the mode in which RCU 465 + * read-side critical sections can occur. 466 + * 467 + * We crowbar the ->dynticks_nesting field to DYNTICK_TASK_NESTING to 468 + * allow for the possibility of usermode upcalls messing up our count 469 + * of interrupt nesting level during the busy period that is just 470 + * now starting. 471 + */ 472 + void rcu_idle_exit(void) 473 + { 474 + unsigned long flags; 475 + struct rcu_dynticks *rdtp; 476 + long long oldval; 477 + 478 + local_irq_save(flags); 479 + rdtp = &__get_cpu_var(rcu_dynticks); 480 + oldval = rdtp->dynticks_nesting; 481 + WARN_ON_ONCE(oldval != 0); 482 + rdtp->dynticks_nesting = DYNTICK_TASK_NESTING; 483 + rcu_idle_exit_common(rdtp, oldval); 484 + local_irq_restore(flags); 485 + } 486 + 487 + /** 488 + * rcu_irq_enter - inform RCU that current CPU is entering irq away from idle 489 + * 490 + * Enter an interrupt handler, which might possibly result in exiting 491 + * idle mode, in other words, entering the mode in which read-side critical 492 + * sections can occur. 493 + * 494 + * Note that the Linux kernel is fully capable of entering an interrupt 495 + * handler that it never exits, for example when doing upcalls to 496 + * user mode! This code assumes that the idle loop never does upcalls to 497 + * user mode. If your architecture does do upcalls from the idle loop (or 498 + * does anything else that results in unbalanced calls to the irq_enter() 499 + * and irq_exit() functions), RCU will give you what you deserve, good 500 + * and hard. But very infrequently and irreproducibly. 501 + * 502 + * Use things like work queues to work around this limitation. 503 + * 504 + * You have been warned. 505 + */ 506 + void rcu_irq_enter(void) 507 + { 508 + unsigned long flags; 509 + struct rcu_dynticks *rdtp; 510 + long long oldval; 511 + 512 + local_irq_save(flags); 513 + rdtp = &__get_cpu_var(rcu_dynticks); 514 + oldval = rdtp->dynticks_nesting; 515 + rdtp->dynticks_nesting++; 516 + WARN_ON_ONCE(rdtp->dynticks_nesting == 0); 517 + if (oldval) 518 + trace_rcu_dyntick("++=", oldval, rdtp->dynticks_nesting); 519 + else 520 + rcu_idle_exit_common(rdtp, oldval); 397 521 local_irq_restore(flags); 398 522 } 399 523 ··· 562 442 WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1); 563 443 } 564 444 565 - /** 566 - * rcu_irq_enter - inform RCU of entry to hard irq context 567 - * 568 - * If the CPU was idle with dynamic ticks active, this updates the 569 - * rdtp->dynticks to let the RCU handling know that the CPU is active. 570 - */ 571 - void rcu_irq_enter(void) 572 - { 573 - rcu_exit_nohz(); 574 - } 445 + #ifdef CONFIG_PROVE_RCU 575 446 576 447 /** 577 - * rcu_irq_exit - inform RCU of exit from hard irq context 448 + * rcu_is_cpu_idle - see if RCU thinks that the current CPU is idle 578 449 * 579 - * If the CPU was idle with dynamic ticks active, update the rdp->dynticks 580 - * to put let the RCU handling be aware that the CPU is going back to idle 581 - * with no ticks. 450 + * If the current CPU is in its idle loop and is neither in an interrupt 451 + * or NMI handler, return true. 582 452 */ 583 - void rcu_irq_exit(void) 453 + int rcu_is_cpu_idle(void) 584 454 { 585 - rcu_enter_nohz(); 455 + int ret; 456 + 457 + preempt_disable(); 458 + ret = (atomic_read(&__get_cpu_var(rcu_dynticks).dynticks) & 0x1) == 0; 459 + preempt_enable(); 460 + return ret; 461 + } 462 + EXPORT_SYMBOL(rcu_is_cpu_idle); 463 + 464 + #endif /* #ifdef CONFIG_PROVE_RCU */ 465 + 466 + /** 467 + * rcu_is_cpu_rrupt_from_idle - see if idle or immediately interrupted from idle 468 + * 469 + * If the current CPU is idle or running at a first-level (not nested) 470 + * interrupt from idle, return true. The caller must have at least 471 + * disabled preemption. 472 + */ 473 + int rcu_is_cpu_rrupt_from_idle(void) 474 + { 475 + return __get_cpu_var(rcu_dynticks).dynticks_nesting <= 1; 586 476 } 587 477 588 478 #ifdef CONFIG_SMP ··· 605 475 static int dyntick_save_progress_counter(struct rcu_data *rdp) 606 476 { 607 477 rdp->dynticks_snap = atomic_add_return(0, &rdp->dynticks->dynticks); 608 - return 0; 478 + return (rdp->dynticks_snap & 0x1) == 0; 609 479 } 610 480 611 481 /* ··· 641 511 } 642 512 643 513 #endif /* #ifdef CONFIG_SMP */ 644 - 645 - #else /* #ifdef CONFIG_NO_HZ */ 646 - 647 - #ifdef CONFIG_SMP 648 - 649 - static int dyntick_save_progress_counter(struct rcu_data *rdp) 650 - { 651 - return 0; 652 - } 653 - 654 - static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) 655 - { 656 - return rcu_implicit_offline_qs(rdp); 657 - } 658 - 659 - #endif /* #ifdef CONFIG_SMP */ 660 - 661 - #endif /* #else #ifdef CONFIG_NO_HZ */ 662 - 663 - int rcu_cpu_stall_suppress __read_mostly; 664 514 665 515 static void record_gp_stall_check_time(struct rcu_state *rsp) 666 516 { ··· 976 866 /* Advance to a new grace period and initialize state. */ 977 867 rsp->gpnum++; 978 868 trace_rcu_grace_period(rsp->name, rsp->gpnum, "start"); 979 - WARN_ON_ONCE(rsp->signaled == RCU_GP_INIT); 980 - rsp->signaled = RCU_GP_INIT; /* Hold off force_quiescent_state. */ 869 + WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT); 870 + rsp->fqs_state = RCU_GP_INIT; /* Hold off force_quiescent_state. */ 981 871 rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS; 982 872 record_gp_stall_check_time(rsp); 983 873 ··· 987 877 rnp->qsmask = rnp->qsmaskinit; 988 878 rnp->gpnum = rsp->gpnum; 989 879 rnp->completed = rsp->completed; 990 - rsp->signaled = RCU_SIGNAL_INIT; /* force_quiescent_state OK. */ 880 + rsp->fqs_state = RCU_SIGNAL_INIT; /* force_quiescent_state OK */ 991 881 rcu_start_gp_per_cpu(rsp, rnp, rdp); 992 882 rcu_preempt_boost_start_gp(rnp); 993 883 trace_rcu_grace_period_init(rsp->name, rnp->gpnum, ··· 1037 927 1038 928 rnp = rcu_get_root(rsp); 1039 929 raw_spin_lock(&rnp->lock); /* irqs already disabled. */ 1040 - rsp->signaled = RCU_SIGNAL_INIT; /* force_quiescent_state now OK. */ 930 + rsp->fqs_state = RCU_SIGNAL_INIT; /* force_quiescent_state now OK. */ 1041 931 raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */ 1042 932 raw_spin_unlock_irqrestore(&rsp->onofflock, flags); 1043 933 } ··· 1101 991 1102 992 rsp->completed = rsp->gpnum; /* Declare the grace period complete. */ 1103 993 trace_rcu_grace_period(rsp->name, rsp->completed, "end"); 1104 - rsp->signaled = RCU_GP_IDLE; 994 + rsp->fqs_state = RCU_GP_IDLE; 1105 995 rcu_start_gp(rsp, flags); /* releases root node's rnp->lock. */ 1106 996 } 1107 997 ··· 1331 1221 else 1332 1222 raw_spin_unlock_irqrestore(&rnp->lock, flags); 1333 1223 if (need_report & RCU_OFL_TASKS_EXP_GP) 1334 - rcu_report_exp_rnp(rsp, rnp); 1224 + rcu_report_exp_rnp(rsp, rnp, true); 1335 1225 rcu_node_kthread_setaffinity(rnp, -1); 1336 1226 } 1337 1227 ··· 1373 1263 /* If no callbacks are ready, just return.*/ 1374 1264 if (!cpu_has_callbacks_ready_to_invoke(rdp)) { 1375 1265 trace_rcu_batch_start(rsp->name, 0, 0); 1376 - trace_rcu_batch_end(rsp->name, 0); 1266 + trace_rcu_batch_end(rsp->name, 0, !!ACCESS_ONCE(rdp->nxtlist), 1267 + need_resched(), is_idle_task(current), 1268 + rcu_is_callbacks_kthread()); 1377 1269 return; 1378 1270 } 1379 1271 ··· 1403 1291 debug_rcu_head_unqueue(list); 1404 1292 __rcu_reclaim(rsp->name, list); 1405 1293 list = next; 1406 - if (++count >= bl) 1294 + /* Stop only if limit reached and CPU has something to do. */ 1295 + if (++count >= bl && 1296 + (need_resched() || 1297 + (!is_idle_task(current) && !rcu_is_callbacks_kthread()))) 1407 1298 break; 1408 1299 } 1409 1300 1410 1301 local_irq_save(flags); 1411 - trace_rcu_batch_end(rsp->name, count); 1302 + trace_rcu_batch_end(rsp->name, count, !!list, need_resched(), 1303 + is_idle_task(current), 1304 + rcu_is_callbacks_kthread()); 1412 1305 1413 1306 /* Update count, and requeue any remaining callbacks. */ 1414 1307 rdp->qlen -= count; ··· 1451 1334 * (user mode or idle loop for rcu, non-softirq execution for rcu_bh). 1452 1335 * Also schedule RCU core processing. 1453 1336 * 1454 - * This function must be called with hardirqs disabled. It is normally 1337 + * This function must be called from hardirq context. It is normally 1455 1338 * invoked from the scheduling-clock interrupt. If rcu_pending returns 1456 1339 * false, there is no point in invoking rcu_check_callbacks(). 1457 1340 */ 1458 1341 void rcu_check_callbacks(int cpu, int user) 1459 1342 { 1460 1343 trace_rcu_utilization("Start scheduler-tick"); 1461 - if (user || 1462 - (idle_cpu(cpu) && rcu_scheduler_active && 1463 - !in_softirq() && hardirq_count() <= (1 << HARDIRQ_SHIFT))) { 1344 + if (user || rcu_is_cpu_rrupt_from_idle()) { 1464 1345 1465 1346 /* 1466 1347 * Get here if this CPU took its interrupt from user ··· 1572 1457 goto unlock_fqs_ret; /* no GP in progress, time updated. */ 1573 1458 } 1574 1459 rsp->fqs_active = 1; 1575 - switch (rsp->signaled) { 1460 + switch (rsp->fqs_state) { 1576 1461 case RCU_GP_IDLE: 1577 1462 case RCU_GP_INIT: 1578 1463 ··· 1588 1473 force_qs_rnp(rsp, dyntick_save_progress_counter); 1589 1474 raw_spin_lock(&rnp->lock); /* irqs already disabled */ 1590 1475 if (rcu_gp_in_progress(rsp)) 1591 - rsp->signaled = RCU_FORCE_QS; 1476 + rsp->fqs_state = RCU_FORCE_QS; 1592 1477 break; 1593 1478 1594 1479 case RCU_FORCE_QS: ··· 1927 1812 * by the current CPU, even if none need be done immediately, returning 1928 1813 * 1 if so. 1929 1814 */ 1930 - static int rcu_needs_cpu_quick_check(int cpu) 1815 + static int rcu_cpu_has_callbacks(int cpu) 1931 1816 { 1932 1817 /* RCU callbacks either ready or pending? */ 1933 1818 return per_cpu(rcu_sched_data, cpu).nxtlist || ··· 2028 1913 for (i = 0; i < RCU_NEXT_SIZE; i++) 2029 1914 rdp->nxttail[i] = &rdp->nxtlist; 2030 1915 rdp->qlen = 0; 2031 - #ifdef CONFIG_NO_HZ 2032 1916 rdp->dynticks = &per_cpu(rcu_dynticks, cpu); 2033 - #endif /* #ifdef CONFIG_NO_HZ */ 1917 + WARN_ON_ONCE(rdp->dynticks->dynticks_nesting != DYNTICK_TASK_NESTING); 1918 + WARN_ON_ONCE(atomic_read(&rdp->dynticks->dynticks) != 1); 2034 1919 rdp->cpu = cpu; 2035 1920 rdp->rsp = rsp; 2036 1921 raw_spin_unlock_irqrestore(&rnp->lock, flags); ··· 2057 1942 rdp->qlen_last_fqs_check = 0; 2058 1943 rdp->n_force_qs_snap = rsp->n_force_qs; 2059 1944 rdp->blimit = blimit; 1945 + rdp->dynticks->dynticks_nesting = DYNTICK_TASK_NESTING; 1946 + atomic_set(&rdp->dynticks->dynticks, 1947 + (atomic_read(&rdp->dynticks->dynticks) & ~0x1) + 1); 1948 + rcu_prepare_for_idle_init(cpu); 2060 1949 raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */ 2061 1950 2062 1951 /* ··· 2142 2023 rcu_send_cbs_to_online(&rcu_bh_state); 2143 2024 rcu_send_cbs_to_online(&rcu_sched_state); 2144 2025 rcu_preempt_send_cbs_to_online(); 2026 + rcu_cleanup_after_idle(cpu); 2145 2027 break; 2146 2028 case CPU_DEAD: 2147 2029 case CPU_DEAD_FROZEN:

+12 -14

kernel/rcutree.h

··· 84 84 * Dynticks per-CPU state. 85 85 */ 86 86 struct rcu_dynticks { 87 - int dynticks_nesting; /* Track irq/process nesting level. */ 88 - int dynticks_nmi_nesting; /* Track NMI nesting level. */ 89 - atomic_t dynticks; /* Even value for dynticks-idle, else odd. */ 87 + long long dynticks_nesting; /* Track irq/process nesting level. */ 88 + /* Process level is worth LLONG_MAX/2. */ 89 + int dynticks_nmi_nesting; /* Track NMI nesting level. */ 90 + atomic_t dynticks; /* Even value for idle, else odd. */ 90 91 }; 91 92 92 93 /* RCU's kthread states for tracing. */ ··· 275 274 /* did other CPU force QS recently? */ 276 275 long blimit; /* Upper limit on a processed batch */ 277 276 278 - #ifdef CONFIG_NO_HZ 279 277 /* 3) dynticks interface. */ 280 278 struct rcu_dynticks *dynticks; /* Shared per-CPU dynticks state. */ 281 279 int dynticks_snap; /* Per-GP tracking for dynticks. */ 282 - #endif /* #ifdef CONFIG_NO_HZ */ 283 280 284 281 /* 4) reasons this CPU needed to be kicked by force_quiescent_state */ 285 - #ifdef CONFIG_NO_HZ 286 282 unsigned long dynticks_fqs; /* Kicked due to dynticks idle. */ 287 - #endif /* #ifdef CONFIG_NO_HZ */ 288 283 unsigned long offline_fqs; /* Kicked due to being offline. */ 289 284 unsigned long resched_ipi; /* Sent a resched IPI. */ 290 285 ··· 299 302 struct rcu_state *rsp; 300 303 }; 301 304 302 - /* Values for signaled field in struct rcu_state. */ 305 + /* Values for fqs_state field in struct rcu_state. */ 303 306 #define RCU_GP_IDLE 0 /* No grace period in progress. */ 304 307 #define RCU_GP_INIT 1 /* Grace period being initialized. */ 305 308 #define RCU_SAVE_DYNTICK 2 /* Need to scan dyntick state. */ 306 309 #define RCU_FORCE_QS 3 /* Need to force quiescent state. */ 307 - #ifdef CONFIG_NO_HZ 308 310 #define RCU_SIGNAL_INIT RCU_SAVE_DYNTICK 309 - #else /* #ifdef CONFIG_NO_HZ */ 310 - #define RCU_SIGNAL_INIT RCU_FORCE_QS 311 - #endif /* #else #ifdef CONFIG_NO_HZ */ 312 311 313 312 #define RCU_JIFFIES_TILL_FORCE_QS 3 /* for rsp->jiffies_force_qs */ 314 313 ··· 354 361 355 362 /* The following fields are guarded by the root rcu_node's lock. */ 356 363 357 - u8 signaled ____cacheline_internodealigned_in_smp; 364 + u8 fqs_state ____cacheline_internodealigned_in_smp; 358 365 /* Force QS state. */ 359 366 u8 fqs_active; /* force_quiescent_state() */ 360 367 /* is running. */ ··· 444 451 static void rcu_preempt_process_callbacks(void); 445 452 void call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu)); 446 453 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_TREE_PREEMPT_RCU) 447 - static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp); 454 + static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp, 455 + bool wake); 448 456 #endif /* #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_TREE_PREEMPT_RCU) */ 449 457 static int rcu_preempt_pending(int cpu); 450 458 static int rcu_preempt_needs_cpu(int cpu); ··· 455 461 static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags); 456 462 static void rcu_preempt_boost_start_gp(struct rcu_node *rnp); 457 463 static void invoke_rcu_callbacks_kthread(void); 464 + static bool rcu_is_callbacks_kthread(void); 458 465 #ifdef CONFIG_RCU_BOOST 459 466 static void rcu_preempt_do_callbacks(void); 460 467 static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, ··· 468 473 #endif /* #ifdef CONFIG_RCU_BOOST */ 469 474 static void rcu_cpu_kthread_setrt(int cpu, int to_rt); 470 475 static void __cpuinit rcu_prepare_kthreads(int cpu); 476 + static void rcu_prepare_for_idle_init(int cpu); 477 + static void rcu_cleanup_after_idle(int cpu); 478 + static void rcu_prepare_for_idle(int cpu); 471 479 472 480 #endif /* #ifndef RCU_TREE_NONCORE */

+229 -56

kernel/rcutree_plugin.h

··· 312 312 { 313 313 int empty; 314 314 int empty_exp; 315 + int empty_exp_now; 315 316 unsigned long flags; 316 317 struct list_head *np; 317 318 #ifdef CONFIG_RCU_BOOST ··· 383 382 /* 384 383 * If this was the last task on the current list, and if 385 384 * we aren't waiting on any CPUs, report the quiescent state. 386 - * Note that rcu_report_unblock_qs_rnp() releases rnp->lock. 385 + * Note that rcu_report_unblock_qs_rnp() releases rnp->lock, 386 + * so we must take a snapshot of the expedited state. 387 387 */ 388 + empty_exp_now = !rcu_preempted_readers_exp(rnp); 388 389 if (!empty && !rcu_preempt_blocked_readers_cgp(rnp)) { 389 390 trace_rcu_quiescent_state_report("preempt_rcu", 390 391 rnp->gpnum, ··· 409 406 * If this was the last task on the expedited lists, 410 407 * then we need to report up the rcu_node hierarchy. 411 408 */ 412 - if (!empty_exp && !rcu_preempted_readers_exp(rnp)) 413 - rcu_report_exp_rnp(&rcu_preempt_state, rnp); 409 + if (!empty_exp && empty_exp_now) 410 + rcu_report_exp_rnp(&rcu_preempt_state, rnp, true); 414 411 } else { 415 412 local_irq_restore(flags); 416 413 } ··· 732 729 * recursively up the tree. (Calm down, calm down, we do the recursion 733 730 * iteratively!) 734 731 * 732 + * Most callers will set the "wake" flag, but the task initiating the 733 + * expedited grace period need not wake itself. 734 + * 735 735 * Caller must hold sync_rcu_preempt_exp_mutex. 736 736 */ 737 - static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp) 737 + static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp, 738 + bool wake) 738 739 { 739 740 unsigned long flags; 740 741 unsigned long mask; ··· 751 744 } 752 745 if (rnp->parent == NULL) { 753 746 raw_spin_unlock_irqrestore(&rnp->lock, flags); 754 - wake_up(&sync_rcu_preempt_exp_wq); 747 + if (wake) 748 + wake_up(&sync_rcu_preempt_exp_wq); 755 749 break; 756 750 } 757 751 mask = rnp->grpmask; ··· 785 777 must_wait = 1; 786 778 } 787 779 if (!must_wait) 788 - rcu_report_exp_rnp(rsp, rnp); 780 + rcu_report_exp_rnp(rsp, rnp, false); /* Don't wake self. */ 789 781 } 790 782 791 783 /* ··· 1077 1069 * report on tasks preempted in RCU read-side critical sections during 1078 1070 * expedited RCU grace periods. 1079 1071 */ 1080 - static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp) 1072 + static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp, 1073 + bool wake) 1081 1074 { 1082 - return; 1083 1075 } 1084 1076 1085 1077 #endif /* #ifdef CONFIG_HOTPLUG_CPU */ ··· 1165 1157 1166 1158 #endif /* #else #ifdef CONFIG_RCU_TRACE */ 1167 1159 1168 - static struct lock_class_key rcu_boost_class; 1169 - 1170 1160 /* 1171 1161 * Carry out RCU priority boosting on the task indicated by ->exp_tasks 1172 1162 * or ->boost_tasks, advancing the pointer to the next task in the ··· 1227 1221 */ 1228 1222 t = container_of(tb, struct task_struct, rcu_node_entry); 1229 1223 rt_mutex_init_proxy_locked(&mtx, t); 1230 - /* Avoid lockdep false positives. This rt_mutex is its own thing. */ 1231 - lockdep_set_class_and_name(&mtx.wait_lock, &rcu_boost_class, 1232 - "rcu_boost_mutex"); 1233 1224 t->rcu_boost_mutex = &mtx; 1234 1225 raw_spin_unlock_irqrestore(&rnp->lock, flags); 1235 1226 rt_mutex_lock(&mtx); /* Side effect: boosts task t's priority. */ 1236 1227 rt_mutex_unlock(&mtx); /* Keep lockdep happy. */ 1237 1228 1238 - return rnp->exp_tasks != NULL || rnp->boost_tasks != NULL; 1229 + return ACCESS_ONCE(rnp->exp_tasks) != NULL || 1230 + ACCESS_ONCE(rnp->boost_tasks) != NULL; 1239 1231 } 1240 1232 1241 1233 /* ··· 1330 1326 current != __this_cpu_read(rcu_cpu_kthread_task)) 1331 1327 wake_up_process(__this_cpu_read(rcu_cpu_kthread_task)); 1332 1328 local_irq_restore(flags); 1329 + } 1330 + 1331 + /* 1332 + * Is the current CPU running the RCU-callbacks kthread? 1333 + * Caller must have preemption disabled. 1334 + */ 1335 + static bool rcu_is_callbacks_kthread(void) 1336 + { 1337 + return __get_cpu_var(rcu_cpu_kthread_task) == current; 1333 1338 } 1334 1339 1335 1340 /* ··· 1785 1772 WARN_ON_ONCE(1); 1786 1773 } 1787 1774 1775 + static bool rcu_is_callbacks_kthread(void) 1776 + { 1777 + return false; 1778 + } 1779 + 1788 1780 static void rcu_preempt_boost_start_gp(struct rcu_node *rnp) 1789 1781 { 1790 1782 } ··· 1925 1907 * grace period works for us. 1926 1908 */ 1927 1909 get_online_cpus(); 1928 - snap = atomic_read(&sync_sched_expedited_started) - 1; 1910 + snap = atomic_read(&sync_sched_expedited_started); 1929 1911 smp_mb(); /* ensure read is before try_stop_cpus(). */ 1930 1912 } 1931 1913 ··· 1957 1939 * 1 if so. This function is part of the RCU implementation; it is -not- 1958 1940 * an exported member of the RCU API. 1959 1941 * 1960 - * Because we have preemptible RCU, just check whether this CPU needs 1961 - * any flavor of RCU. Do not chew up lots of CPU cycles with preemption 1962 - * disabled in a most-likely vain attempt to cause RCU not to need this CPU. 1942 + * Because we not have RCU_FAST_NO_HZ, just check whether this CPU needs 1943 + * any flavor of RCU. 1963 1944 */ 1964 1945 int rcu_needs_cpu(int cpu) 1965 1946 { 1966 - return rcu_needs_cpu_quick_check(cpu); 1947 + return rcu_cpu_has_callbacks(cpu); 1948 + } 1949 + 1950 + /* 1951 + * Because we do not have RCU_FAST_NO_HZ, don't bother initializing for it. 1952 + */ 1953 + static void rcu_prepare_for_idle_init(int cpu) 1954 + { 1955 + } 1956 + 1957 + /* 1958 + * Because we do not have RCU_FAST_NO_HZ, don't bother cleaning up 1959 + * after it. 1960 + */ 1961 + static void rcu_cleanup_after_idle(int cpu) 1962 + { 1963 + } 1964 + 1965 + /* 1966 + * Do the idle-entry grace-period work, which, because CONFIG_RCU_FAST_NO_HZ=y, 1967 + * is nothing. 1968 + */ 1969 + static void rcu_prepare_for_idle(int cpu) 1970 + { 1967 1971 } 1968 1972 1969 1973 #else /* #if !defined(CONFIG_RCU_FAST_NO_HZ) */ 1970 1974 1971 - #define RCU_NEEDS_CPU_FLUSHES 5 1975 + /* 1976 + * This code is invoked when a CPU goes idle, at which point we want 1977 + * to have the CPU do everything required for RCU so that it can enter 1978 + * the energy-efficient dyntick-idle mode. This is handled by a 1979 + * state machine implemented by rcu_prepare_for_idle() below. 1980 + * 1981 + * The following three proprocessor symbols control this state machine: 1982 + * 1983 + * RCU_IDLE_FLUSHES gives the maximum number of times that we will attempt 1984 + * to satisfy RCU. Beyond this point, it is better to incur a periodic 1985 + * scheduling-clock interrupt than to loop through the state machine 1986 + * at full power. 1987 + * RCU_IDLE_OPT_FLUSHES gives the number of RCU_IDLE_FLUSHES that are 1988 + * optional if RCU does not need anything immediately from this 1989 + * CPU, even if this CPU still has RCU callbacks queued. The first 1990 + * times through the state machine are mandatory: we need to give 1991 + * the state machine a chance to communicate a quiescent state 1992 + * to the RCU core. 1993 + * RCU_IDLE_GP_DELAY gives the number of jiffies that a CPU is permitted 1994 + * to sleep in dyntick-idle mode with RCU callbacks pending. This 1995 + * is sized to be roughly one RCU grace period. Those energy-efficiency 1996 + * benchmarkers who might otherwise be tempted to set this to a large 1997 + * number, be warned: Setting RCU_IDLE_GP_DELAY too high can hang your 1998 + * system. And if you are -that- concerned about energy efficiency, 1999 + * just power the system down and be done with it! 2000 + * 2001 + * The values below work well in practice. If future workloads require 2002 + * adjustment, they can be converted into kernel config parameters, though 2003 + * making the state machine smarter might be a better option. 2004 + */ 2005 + #define RCU_IDLE_FLUSHES 5 /* Number of dyntick-idle tries. */ 2006 + #define RCU_IDLE_OPT_FLUSHES 3 /* Optional dyntick-idle tries. */ 2007 + #define RCU_IDLE_GP_DELAY 6 /* Roughly one grace period. */ 2008 + 1972 2009 static DEFINE_PER_CPU(int, rcu_dyntick_drain); 1973 2010 static DEFINE_PER_CPU(unsigned long, rcu_dyntick_holdoff); 2011 + static DEFINE_PER_CPU(struct hrtimer, rcu_idle_gp_timer); 2012 + static ktime_t rcu_idle_gp_wait; 1974 2013 1975 2014 /* 1976 - * Check to see if any future RCU-related work will need to be done 1977 - * by the current CPU, even if none need be done immediately, returning 1978 - * 1 if so. This function is part of the RCU implementation; it is -not- 1979 - * an exported member of the RCU API. 2015 + * Allow the CPU to enter dyntick-idle mode if either: (1) There are no 2016 + * callbacks on this CPU, (2) this CPU has not yet attempted to enter 2017 + * dyntick-idle mode, or (3) this CPU is in the process of attempting to 2018 + * enter dyntick-idle mode. Otherwise, if we have recently tried and failed 2019 + * to enter dyntick-idle mode, we refuse to try to enter it. After all, 2020 + * it is better to incur scheduling-clock interrupts than to spin 2021 + * continuously for the same time duration! 2022 + */ 2023 + int rcu_needs_cpu(int cpu) 2024 + { 2025 + /* If no callbacks, RCU doesn't need the CPU. */ 2026 + if (!rcu_cpu_has_callbacks(cpu)) 2027 + return 0; 2028 + /* Otherwise, RCU needs the CPU only if it recently tried and failed. */ 2029 + return per_cpu(rcu_dyntick_holdoff, cpu) == jiffies; 2030 + } 2031 + 2032 + /* 2033 + * Timer handler used to force CPU to start pushing its remaining RCU 2034 + * callbacks in the case where it entered dyntick-idle mode with callbacks 2035 + * pending. The hander doesn't really need to do anything because the 2036 + * real work is done upon re-entry to idle, or by the next scheduling-clock 2037 + * interrupt should idle not be re-entered. 2038 + */ 2039 + static enum hrtimer_restart rcu_idle_gp_timer_func(struct hrtimer *hrtp) 2040 + { 2041 + trace_rcu_prep_idle("Timer"); 2042 + return HRTIMER_NORESTART; 2043 + } 2044 + 2045 + /* 2046 + * Initialize the timer used to pull CPUs out of dyntick-idle mode. 2047 + */ 2048 + static void rcu_prepare_for_idle_init(int cpu) 2049 + { 2050 + static int firsttime = 1; 2051 + struct hrtimer *hrtp = &per_cpu(rcu_idle_gp_timer, cpu); 2052 + 2053 + hrtimer_init(hrtp, CLOCK_MONOTONIC, HRTIMER_MODE_REL); 2054 + hrtp->function = rcu_idle_gp_timer_func; 2055 + if (firsttime) { 2056 + unsigned int upj = jiffies_to_usecs(RCU_IDLE_GP_DELAY); 2057 + 2058 + rcu_idle_gp_wait = ns_to_ktime(upj * (u64)1000); 2059 + firsttime = 0; 2060 + } 2061 + } 2062 + 2063 + /* 2064 + * Clean up for exit from idle. Because we are exiting from idle, there 2065 + * is no longer any point to rcu_idle_gp_timer, so cancel it. This will 2066 + * do nothing if this timer is not active, so just cancel it unconditionally. 2067 + */ 2068 + static void rcu_cleanup_after_idle(int cpu) 2069 + { 2070 + hrtimer_cancel(&per_cpu(rcu_idle_gp_timer, cpu)); 2071 + } 2072 + 2073 + /* 2074 + * Check to see if any RCU-related work can be done by the current CPU, 2075 + * and if so, schedule a softirq to get it done. This function is part 2076 + * of the RCU implementation; it is -not- an exported member of the RCU API. 1980 2077 * 1981 - * Because we are not supporting preemptible RCU, attempt to accelerate 1982 - * any current grace periods so that RCU no longer needs this CPU, but 1983 - * only if all other CPUs are already in dynticks-idle mode. This will 1984 - * allow the CPU cores to be powered down immediately, as opposed to after 1985 - * waiting many milliseconds for grace periods to elapse. 2078 + * The idea is for the current CPU to clear out all work required by the 2079 + * RCU core for the current grace period, so that this CPU can be permitted 2080 + * to enter dyntick-idle mode. In some cases, it will need to be awakened 2081 + * at the end of the grace period by whatever CPU ends the grace period. 2082 + * This allows CPUs to go dyntick-idle more quickly, and to reduce the 2083 + * number of wakeups by a modest integer factor. 1986 2084 * 1987 2085 * Because it is not legal to invoke rcu_process_callbacks() with irqs 1988 2086 * disabled, we do one pass of force_quiescent_state(), then do a 1989 2087 * invoke_rcu_core() to cause rcu_process_callbacks() to be invoked 1990 2088 * later. The per-cpu rcu_dyntick_drain variable controls the sequencing. 2089 + * 2090 + * The caller must have disabled interrupts. 1991 2091 */ 1992 - int rcu_needs_cpu(int cpu) 2092 + static void rcu_prepare_for_idle(int cpu) 1993 2093 { 1994 - int c = 0; 1995 - int snap; 1996 - int thatcpu; 2094 + unsigned long flags; 1997 2095 1998 - /* Check for being in the holdoff period. */ 1999 - if (per_cpu(rcu_dyntick_holdoff, cpu) == jiffies) 2000 - return rcu_needs_cpu_quick_check(cpu); 2096 + local_irq_save(flags); 2001 2097 2002 - /* Don't bother unless we are the last non-dyntick-idle CPU. */ 2003 - for_each_online_cpu(thatcpu) { 2004 - if (thatcpu == cpu) 2005 - continue; 2006 - snap = atomic_add_return(0, &per_cpu(rcu_dynticks, 2007 - thatcpu).dynticks); 2008 - smp_mb(); /* Order sampling of snap with end of grace period. */ 2009 - if ((snap & 0x1) != 0) { 2010 - per_cpu(rcu_dyntick_drain, cpu) = 0; 2011 - per_cpu(rcu_dyntick_holdoff, cpu) = jiffies - 1; 2012 - return rcu_needs_cpu_quick_check(cpu); 2013 - } 2098 + /* 2099 + * If there are no callbacks on this CPU, enter dyntick-idle mode. 2100 + * Also reset state to avoid prejudicing later attempts. 2101 + */ 2102 + if (!rcu_cpu_has_callbacks(cpu)) { 2103 + per_cpu(rcu_dyntick_holdoff, cpu) = jiffies - 1; 2104 + per_cpu(rcu_dyntick_drain, cpu) = 0; 2105 + local_irq_restore(flags); 2106 + trace_rcu_prep_idle("No callbacks"); 2107 + return; 2108 + } 2109 + 2110 + /* 2111 + * If in holdoff mode, just return. We will presumably have 2112 + * refrained from disabling the scheduling-clock tick. 2113 + */ 2114 + if (per_cpu(rcu_dyntick_holdoff, cpu) == jiffies) { 2115 + local_irq_restore(flags); 2116 + trace_rcu_prep_idle("In holdoff"); 2117 + return; 2014 2118 } 2015 2119 2016 2120 /* Check and update the rcu_dyntick_drain sequencing. */ 2017 2121 if (per_cpu(rcu_dyntick_drain, cpu) <= 0) { 2018 2122 /* First time through, initialize the counter. */ 2019 - per_cpu(rcu_dyntick_drain, cpu) = RCU_NEEDS_CPU_FLUSHES; 2123 + per_cpu(rcu_dyntick_drain, cpu) = RCU_IDLE_FLUSHES; 2124 + } else if (per_cpu(rcu_dyntick_drain, cpu) <= RCU_IDLE_OPT_FLUSHES && 2125 + !rcu_pending(cpu)) { 2126 + /* Can we go dyntick-idle despite still having callbacks? */ 2127 + trace_rcu_prep_idle("Dyntick with callbacks"); 2128 + per_cpu(rcu_dyntick_drain, cpu) = 0; 2129 + per_cpu(rcu_dyntick_holdoff, cpu) = jiffies - 1; 2130 + hrtimer_start(&per_cpu(rcu_idle_gp_timer, cpu), 2131 + rcu_idle_gp_wait, HRTIMER_MODE_REL); 2132 + return; /* Nothing more to do immediately. */ 2020 2133 } else if (--per_cpu(rcu_dyntick_drain, cpu) <= 0) { 2021 2134 /* We have hit the limit, so time to give up. */ 2022 2135 per_cpu(rcu_dyntick_holdoff, cpu) = jiffies; 2023 - return rcu_needs_cpu_quick_check(cpu); 2136 + local_irq_restore(flags); 2137 + trace_rcu_prep_idle("Begin holdoff"); 2138 + invoke_rcu_core(); /* Force the CPU out of dyntick-idle. */ 2139 + return; 2024 2140 } 2025 2141 2026 - /* Do one step pushing remaining RCU callbacks through. */ 2142 + /* 2143 + * Do one step of pushing the remaining RCU callbacks through 2144 + * the RCU core state machine. 2145 + */ 2146 + #ifdef CONFIG_TREE_PREEMPT_RCU 2147 + if (per_cpu(rcu_preempt_data, cpu).nxtlist) { 2148 + local_irq_restore(flags); 2149 + rcu_preempt_qs(cpu); 2150 + force_quiescent_state(&rcu_preempt_state, 0); 2151 + local_irq_save(flags); 2152 + } 2153 + #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */ 2027 2154 if (per_cpu(rcu_sched_data, cpu).nxtlist) { 2155 + local_irq_restore(flags); 2028 2156 rcu_sched_qs(cpu); 2029 2157 force_quiescent_state(&rcu_sched_state, 0); 2030 - c = c || per_cpu(rcu_sched_data, cpu).nxtlist; 2158 + local_irq_save(flags); 2031 2159 } 2032 2160 if (per_cpu(rcu_bh_data, cpu).nxtlist) { 2161 + local_irq_restore(flags); 2033 2162 rcu_bh_qs(cpu); 2034 2163 force_quiescent_state(&rcu_bh_state, 0); 2035 - c = c || per_cpu(rcu_bh_data, cpu).nxtlist; 2164 + local_irq_save(flags); 2036 2165 } 2037 2166 2038 - /* If RCU callbacks are still pending, RCU still needs this CPU. */ 2039 - if (c) 2167 + /* 2168 + * If RCU callbacks are still pending, RCU still needs this CPU. 2169 + * So try forcing the callbacks through the grace period. 2170 + */ 2171 + if (rcu_cpu_has_callbacks(cpu)) { 2172 + local_irq_restore(flags); 2173 + trace_rcu_prep_idle("More callbacks"); 2040 2174 invoke_rcu_core(); 2041 - return c; 2175 + } else { 2176 + local_irq_restore(flags); 2177 + trace_rcu_prep_idle("Callbacks drained"); 2178 + } 2042 2179 } 2043 2180 2044 2181 #endif /* #else #if !defined(CONFIG_RCU_FAST_NO_HZ) */

+3 -9

kernel/rcutree_trace.c

··· 67 67 rdp->completed, rdp->gpnum, 68 68 rdp->passed_quiesce, rdp->passed_quiesce_gpnum, 69 69 rdp->qs_pending); 70 - #ifdef CONFIG_NO_HZ 71 - seq_printf(m, " dt=%d/%d/%d df=%lu", 70 + seq_printf(m, " dt=%d/%llx/%d df=%lu", 72 71 atomic_read(&rdp->dynticks->dynticks), 73 72 rdp->dynticks->dynticks_nesting, 74 73 rdp->dynticks->dynticks_nmi_nesting, 75 74 rdp->dynticks_fqs); 76 - #endif /* #ifdef CONFIG_NO_HZ */ 77 75 seq_printf(m, " of=%lu ri=%lu", rdp->offline_fqs, rdp->resched_ipi); 78 76 seq_printf(m, " ql=%ld qs=%c%c%c%c", 79 77 rdp->qlen, ··· 139 141 rdp->completed, rdp->gpnum, 140 142 rdp->passed_quiesce, rdp->passed_quiesce_gpnum, 141 143 rdp->qs_pending); 142 - #ifdef CONFIG_NO_HZ 143 - seq_printf(m, ",%d,%d,%d,%lu", 144 + seq_printf(m, ",%d,%llx,%d,%lu", 144 145 atomic_read(&rdp->dynticks->dynticks), 145 146 rdp->dynticks->dynticks_nesting, 146 147 rdp->dynticks->dynticks_nmi_nesting, 147 148 rdp->dynticks_fqs); 148 - #endif /* #ifdef CONFIG_NO_HZ */ 149 149 seq_printf(m, ",%lu,%lu", rdp->offline_fqs, rdp->resched_ipi); 150 150 seq_printf(m, ",%ld,\"%c%c%c%c\"", rdp->qlen, 151 151 ".N"[rdp->nxttail[RCU_NEXT_READY_TAIL] != ··· 167 171 static int show_rcudata_csv(struct seq_file *m, void *unused) 168 172 { 169 173 seq_puts(m, "\"CPU\",\"Online?\",\"c\",\"g\",\"pq\",\"pgp\",\"pq\","); 170 - #ifdef CONFIG_NO_HZ 171 174 seq_puts(m, "\"dt\",\"dt nesting\",\"dt NMI nesting\",\"df\","); 172 - #endif /* #ifdef CONFIG_NO_HZ */ 173 175 seq_puts(m, "\"of\",\"ri\",\"ql\",\"qs\""); 174 176 #ifdef CONFIG_RCU_BOOST 175 177 seq_puts(m, "\"kt\",\"ktl\""); ··· 272 278 gpnum = rsp->gpnum; 273 279 seq_printf(m, "c=%lu g=%lu s=%d jfq=%ld j=%x " 274 280 "nfqs=%lu/nfqsng=%lu(%lu) fqlh=%lu\n", 275 - rsp->completed, gpnum, rsp->signaled, 281 + rsp->completed, gpnum, rsp->fqs_state, 276 282 (long)(rsp->jiffies_force_qs - jiffies), 277 283 (int)(jiffies & 0xffff), 278 284 rsp->n_force_qs, rsp->n_force_qs_ngp,

-8

kernel/rtmutex.c

··· 579 579 struct rt_mutex_waiter *waiter) 580 580 { 581 581 int ret = 0; 582 - int was_disabled; 583 582 584 583 for (;;) { 585 584 /* Try to acquire the lock: */ ··· 601 602 602 603 raw_spin_unlock(&lock->wait_lock); 603 604 604 - was_disabled = irqs_disabled(); 605 - if (was_disabled) 606 - local_irq_enable(); 607 - 608 605 debug_rt_mutex_print_deadlock(waiter); 609 606 610 607 schedule_rt_mutex(lock); 611 - 612 - if (was_disabled) 613 - local_irq_disable(); 614 608 615 609 raw_spin_lock(&lock->wait_lock); 616 610 set_current_state(state);

+2 -2

kernel/softirq.c

··· 347 347 if (!in_interrupt() && local_softirq_pending()) 348 348 invoke_softirq(); 349 349 350 - rcu_irq_exit(); 351 350 #ifdef CONFIG_NO_HZ 352 351 /* Make sure that timer wheel updates are propagated */ 353 352 if (idle_cpu(smp_processor_id()) && !in_interrupt() && !need_resched()) 354 - tick_nohz_stop_sched_tick(0); 353 + tick_nohz_irq_exit(); 355 354 #endif 355 + rcu_irq_exit(); 356 356 preempt_enable_no_resched(); 357 357 } 358 358

+60 -37

kernel/time/tick-sched.c

··· 275 275 } 276 276 EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us); 277 277 278 - /** 279 - * tick_nohz_stop_sched_tick - stop the idle tick from the idle task 280 - * 281 - * When the next event is more than a tick into the future, stop the idle tick 282 - * Called either from the idle loop or from irq_exit() when an idle period was 283 - * just interrupted by an interrupt which did not cause a reschedule. 284 - */ 285 - void tick_nohz_stop_sched_tick(int inidle) 278 + static void tick_nohz_stop_sched_tick(struct tick_sched *ts) 286 279 { 287 - unsigned long seq, last_jiffies, next_jiffies, delta_jiffies, flags; 288 - struct tick_sched *ts; 280 + unsigned long seq, last_jiffies, next_jiffies, delta_jiffies; 289 281 ktime_t last_update, expires, now; 290 282 struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev; 291 283 u64 time_delta; 292 284 int cpu; 293 285 294 - local_irq_save(flags); 295 - 296 286 cpu = smp_processor_id(); 297 287 ts = &per_cpu(tick_cpu_sched, cpu); 298 - 299 - /* 300 - * Call to tick_nohz_start_idle stops the last_update_time from being 301 - * updated. Thus, it must not be called in the event we are called from 302 - * irq_exit() with the prior state different than idle. 303 - */ 304 - if (!inidle && !ts->inidle) 305 - goto end; 306 - 307 - /* 308 - * Set ts->inidle unconditionally. Even if the system did not 309 - * switch to NOHZ mode the cpu frequency governers rely on the 310 - * update of the idle time accounting in tick_nohz_start_idle(). 311 - */ 312 - ts->inidle = 1; 313 288 314 289 now = tick_nohz_start_idle(cpu, ts); 315 290 ··· 301 326 } 302 327 303 328 if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE)) 304 - goto end; 329 + return; 305 330 306 331 if (need_resched()) 307 - goto end; 332 + return; 308 333 309 334 if (unlikely(local_softirq_pending() && cpu_online(cpu))) { 310 335 static int ratelimit; ··· 314 339 (unsigned int) local_softirq_pending()); 315 340 ratelimit++; 316 341 } 317 - goto end; 342 + return; 318 343 } 319 344 320 345 ts->idle_calls++; ··· 409 434 ts->idle_tick = hrtimer_get_expires(&ts->sched_timer); 410 435 ts->tick_stopped = 1; 411 436 ts->idle_jiffies = last_jiffies; 412 - rcu_enter_nohz(); 413 437 } 414 438 415 439 ts->idle_sleeps++; ··· 446 472 ts->next_jiffies = next_jiffies; 447 473 ts->last_jiffies = last_jiffies; 448 474 ts->sleep_length = ktime_sub(dev->next_event, now); 449 - end: 450 - local_irq_restore(flags); 475 + } 476 + 477 + /** 478 + * tick_nohz_idle_enter - stop the idle tick from the idle task 479 + * 480 + * When the next event is more than a tick into the future, stop the idle tick 481 + * Called when we start the idle loop. 482 + * 483 + * The arch is responsible of calling: 484 + * 485 + * - rcu_idle_enter() after its last use of RCU before the CPU is put 486 + * to sleep. 487 + * - rcu_idle_exit() before the first use of RCU after the CPU is woken up. 488 + */ 489 + void tick_nohz_idle_enter(void) 490 + { 491 + struct tick_sched *ts; 492 + 493 + WARN_ON_ONCE(irqs_disabled()); 494 + 495 + local_irq_disable(); 496 + 497 + ts = &__get_cpu_var(tick_cpu_sched); 498 + /* 499 + * set ts->inidle unconditionally. even if the system did not 500 + * switch to nohz mode the cpu frequency governers rely on the 501 + * update of the idle time accounting in tick_nohz_start_idle(). 502 + */ 503 + ts->inidle = 1; 504 + tick_nohz_stop_sched_tick(ts); 505 + 506 + local_irq_enable(); 507 + } 508 + 509 + /** 510 + * tick_nohz_irq_exit - update next tick event from interrupt exit 511 + * 512 + * When an interrupt fires while we are idle and it doesn't cause 513 + * a reschedule, it may still add, modify or delete a timer, enqueue 514 + * an RCU callback, etc... 515 + * So we need to re-calculate and reprogram the next tick event. 516 + */ 517 + void tick_nohz_irq_exit(void) 518 + { 519 + struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched); 520 + 521 + if (!ts->inidle) 522 + return; 523 + 524 + tick_nohz_stop_sched_tick(ts); 451 525 } 452 526 453 527 /** ··· 537 515 } 538 516 539 517 /** 540 - * tick_nohz_restart_sched_tick - restart the idle tick from the idle task 518 + * tick_nohz_idle_exit - restart the idle tick from the idle task 541 519 * 542 520 * Restart the idle tick when the CPU is woken up from idle 521 + * This also exit the RCU extended quiescent state. The CPU 522 + * can use RCU again after this function is called. 543 523 */ 544 - void tick_nohz_restart_sched_tick(void) 524 + void tick_nohz_idle_exit(void) 545 525 { 546 526 int cpu = smp_processor_id(); 547 527 struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu); ··· 553 529 ktime_t now; 554 530 555 531 local_irq_disable(); 532 + 556 533 if (ts->idle_active || (ts->inidle && ts->tick_stopped)) 557 534 now = ktime_get(); 558 535 ··· 567 542 } 568 543 569 544 ts->inidle = 0; 570 - 571 - rcu_exit_nohz(); 572 545 573 546 /* Update jiffies first */ 574 547 select_nohz_load_balancer(0);

+1

kernel/trace/trace.c

··· 4775 4775 { 4776 4776 __ftrace_dump(true, oops_dump_mode); 4777 4777 } 4778 + EXPORT_SYMBOL_GPL(ftrace_dump); 4778 4779 4779 4780 __init static int tracer_alloc_buffers(void) 4780 4781 {