Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

+1 -1

Documentation/RCU/RTFP.txt

··· 186 186 187 187 @article{Kung80 188 188 ,author="H. T. Kung and Q. Lehman" 189 - ,title="Concurrent Maintenance of Binary Search Trees" 189 + ,title="Concurrent Manipulation of Binary Search Trees" 190 190 ,Year="1980" 191 191 ,Month="September" 192 192 ,journal="ACM Transactions on Database Systems"

+8 -9

Documentation/RCU/checklist.txt

··· 271 271 The same cautions apply to call_rcu_bh() and call_rcu_sched(). 272 272 273 273 9. All RCU list-traversal primitives, which include 274 - rcu_dereference(), list_for_each_entry_rcu(), 275 - list_for_each_continue_rcu(), and list_for_each_safe_rcu(), 276 - must be either within an RCU read-side critical section or 277 - must be protected by appropriate update-side locks. RCU 278 - read-side critical sections are delimited by rcu_read_lock() 279 - and rcu_read_unlock(), or by similar primitives such as 280 - rcu_read_lock_bh() and rcu_read_unlock_bh(), in which case 281 - the matching rcu_dereference() primitive must be used in order 282 - to keep lockdep happy, in this case, rcu_dereference_bh(). 274 + rcu_dereference(), list_for_each_entry_rcu(), and 275 + list_for_each_safe_rcu(), must be either within an RCU read-side 276 + critical section or must be protected by appropriate update-side 277 + locks. RCU read-side critical sections are delimited by 278 + rcu_read_lock() and rcu_read_unlock(), or by similar primitives 279 + such as rcu_read_lock_bh() and rcu_read_unlock_bh(), in which 280 + case the matching rcu_dereference() primitive must be used in 281 + order to keep lockdep happy, in this case, rcu_dereference_bh(). 283 282 284 283 The reason that it is permissible to use RCU list-traversal 285 284 primitives when the update-side lock is held is that doing so

+1 -1

Documentation/RCU/listRCU.txt

··· 205 205 audit_copy_rule(&ne->rule, &e->rule); 206 206 ne->rule.action = newaction; 207 207 ne->rule.file_count = newfield_count; 208 - list_replace_rcu(e, ne); 208 + list_replace_rcu(&e->list, &ne->list); 209 209 call_rcu(&e->rcu, audit_free_rule); 210 210 return 0; 211 211 }

+59 -2

Documentation/RCU/rcuref.txt

··· 20 20 { { 21 21 ... write_lock(&list_lock); 22 22 atomic_dec(&el->rc, relfunc) ... 23 - ... delete_element 23 + ... remove_element 24 24 } write_unlock(&list_lock); 25 25 ... 26 26 if (atomic_dec_and_test(&el->rc)) ··· 52 52 { { 53 53 ... spin_lock(&list_lock); 54 54 if (atomic_dec_and_test(&el->rc)) ... 55 - call_rcu(&el->head, el_free); delete_element 55 + call_rcu(&el->head, el_free); remove_element 56 56 ... spin_unlock(&list_lock); 57 57 } ... 58 58 if (atomic_dec_and_test(&el->rc)) ··· 64 64 update (write) stream. In such cases, atomic_inc_not_zero() might be 65 65 overkill, since we hold the update-side spinlock. One might instead 66 66 use atomic_inc() in such cases. 67 + 68 + It is not always convenient to deal with "FAIL" in the 69 + search_and_reference() code path. In such cases, the 70 + atomic_dec_and_test() may be moved from delete() to el_free() 71 + as follows: 72 + 73 + 1. 2. 74 + add() search_and_reference() 75 + { { 76 + alloc_object rcu_read_lock(); 77 + ... search_for_element 78 + atomic_set(&el->rc, 1); atomic_inc(&el->rc); 79 + spin_lock(&list_lock); ... 80 + 81 + add_element rcu_read_unlock(); 82 + ... } 83 + spin_unlock(&list_lock); 4. 84 + } delete() 85 + 3. { 86 + release_referenced() spin_lock(&list_lock); 87 + { ... 88 + ... remove_element 89 + if (atomic_dec_and_test(&el->rc)) spin_unlock(&list_lock); 90 + kfree(el); ... 91 + ... call_rcu(&el->head, el_free); 92 + } ... 93 + 5. } 94 + void el_free(struct rcu_head *rhp) 95 + { 96 + release_referenced(); 97 + } 98 + 99 + The key point is that the initial reference added by add() is not removed 100 + until after a grace period has elapsed following removal. This means that 101 + search_and_reference() cannot find this element, which means that the value 102 + of el->rc cannot increase. Thus, once it reaches zero, there are no 103 + readers that can or ever will be able to reference the element. The 104 + element can therefore safely be freed. This in turn guarantees that if 105 + any reader finds the element, that reader may safely acquire a reference 106 + without checking the value of the reference counter. 107 + 108 + In cases where delete() can sleep, synchronize_rcu() can be called from 109 + delete(), so that el_free() can be subsumed into delete as follows: 110 + 111 + 4. 112 + delete() 113 + { 114 + spin_lock(&list_lock); 115 + ... 116 + remove_element 117 + spin_unlock(&list_lock); 118 + ... 119 + synchronize_rcu(); 120 + if (atomic_dec_and_test(&el->rc)) 121 + kfree(el); 122 + ... 123 + }

+211 -171

Documentation/RCU/trace.txt

··· 10 10 11 11 CONFIG_TREE_RCU and CONFIG_TREE_PREEMPT_RCU debugfs Files and Formats 12 12 13 - These implementations of RCU provides several debugfs files under the 13 + These implementations of RCU provide several debugfs directories under the 14 14 top-level directory "rcu": 15 15 16 - rcu/rcudata: 16 + rcu/rcu_bh 17 + rcu/rcu_preempt 18 + rcu/rcu_sched 19 + 20 + Each directory contains files for the corresponding flavor of RCU. 21 + Note that rcu/rcu_preempt is only present for CONFIG_TREE_PREEMPT_RCU. 22 + For CONFIG_TREE_RCU, the RCU flavor maps onto the RCU-sched flavor, 23 + so that activity for both appears in rcu/rcu_sched. 24 + 25 + In addition, the following file appears in the top-level directory: 26 + rcu/rcutorture. This file displays rcutorture test progress. The output 27 + of "cat rcu/rcutorture" looks as follows: 28 + 29 + rcutorture test sequence: 0 (test in progress) 30 + rcutorture update version number: 615 31 + 32 + The first line shows the number of rcutorture tests that have completed 33 + since boot. If a test is currently running, the "(test in progress)" 34 + string will appear as shown above. The second line shows the number of 35 + update cycles that the current test has started, or zero if there is 36 + no test in progress. 37 + 38 + 39 + Within each flavor directory (rcu/rcu_bh, rcu/rcu_sched, and possibly 40 + also rcu/rcu_preempt) the following files will be present: 41 + 42 + rcudata: 17 43 Displays fields in struct rcu_data. 18 - rcu/rcudata.csv: 19 - Comma-separated values spreadsheet version of rcudata. 20 - rcu/rcugp: 44 + rcuexp: 45 + Displays statistics for expedited grace periods. 46 + rcugp: 21 47 Displays grace-period counters. 22 - rcu/rcuhier: 48 + rcuhier: 23 49 Displays the struct rcu_node hierarchy. 24 - rcu/rcu_pending: 50 + rcu_pending: 25 51 Displays counts of the reasons rcu_pending() decided that RCU had 26 52 work to do. 27 - rcu/rcutorture: 28 - Displays rcutorture test progress. 29 - rcu/rcuboost: 53 + rcuboost: 30 54 Displays RCU boosting statistics. Only present if 31 55 CONFIG_RCU_BOOST=y. 32 56 33 - The output of "cat rcu/rcudata" looks as follows: 57 + The output of "cat rcu/rcu_preempt/rcudata" looks as follows: 34 58 35 - rcu_sched: 36 - 0 c=20972 g=20973 pq=1 pgp=20973 qp=0 dt=545/1/0 df=50 of=0 ql=163 qs=NRW. kt=0/W/0 ktl=ebc3 b=10 ci=153737 co=0 ca=0 37 - 1 c=20972 g=20973 pq=1 pgp=20973 qp=0 dt=967/1/0 df=58 of=0 ql=634 qs=NRW. kt=0/W/1 ktl=58c b=10 ci=191037 co=0 ca=0 38 - 2 c=20972 g=20973 pq=1 pgp=20973 qp=0 dt=1081/1/0 df=175 of=0 ql=74 qs=N.W. kt=0/W/2 ktl=da94 b=10 ci=75991 co=0 ca=0 39 - 3 c=20942 g=20943 pq=1 pgp=20942 qp=1 dt=1846/0/0 df=404 of=0 ql=0 qs=.... kt=0/W/3 ktl=d1cd b=10 ci=72261 co=0 ca=0 40 - 4 c=20972 g=20973 pq=1 pgp=20973 qp=0 dt=369/1/0 df=83 of=0 ql=48 qs=N.W. kt=0/W/4 ktl=e0e7 b=10 ci=128365 co=0 ca=0 41 - 5 c=20972 g=20973 pq=1 pgp=20973 qp=0 dt=381/1/0 df=64 of=0 ql=169 qs=NRW. kt=0/W/5 ktl=fb2f b=10 ci=164360 co=0 ca=0 42 - 6 c=20972 g=20973 pq=1 pgp=20973 qp=0 dt=1037/1/0 df=183 of=0 ql=62 qs=N.W. kt=0/W/6 ktl=d2ad b=10 ci=65663 co=0 ca=0 43 - 7 c=20897 g=20897 pq=1 pgp=20896 qp=0 dt=1572/0/0 df=382 of=0 ql=0 qs=.... kt=0/W/7 ktl=cf15 b=10 ci=75006 co=0 ca=0 44 - rcu_bh: 45 - 0 c=1480 g=1480 pq=1 pgp=1480 qp=0 dt=545/1/0 df=6 of=0 ql=0 qs=.... kt=0/W/0 ktl=ebc3 b=10 ci=0 co=0 ca=0 46 - 1 c=1480 g=1480 pq=1 pgp=1480 qp=0 dt=967/1/0 df=3 of=0 ql=0 qs=.... kt=0/W/1 ktl=58c b=10 ci=151 co=0 ca=0 47 - 2 c=1480 g=1480 pq=1 pgp=1480 qp=0 dt=1081/1/0 df=6 of=0 ql=0 qs=.... kt=0/W/2 ktl=da94 b=10 ci=0 co=0 ca=0 48 - 3 c=1480 g=1480 pq=1 pgp=1480 qp=0 dt=1846/0/0 df=8 of=0 ql=0 qs=.... kt=0/W/3 ktl=d1cd b=10 ci=0 co=0 ca=0 49 - 4 c=1480 g=1480 pq=1 pgp=1480 qp=0 dt=369/1/0 df=6 of=0 ql=0 qs=.... kt=0/W/4 ktl=e0e7 b=10 ci=0 co=0 ca=0 50 - 5 c=1480 g=1480 pq=1 pgp=1480 qp=0 dt=381/1/0 df=4 of=0 ql=0 qs=.... kt=0/W/5 ktl=fb2f b=10 ci=0 co=0 ca=0 51 - 6 c=1480 g=1480 pq=1 pgp=1480 qp=0 dt=1037/1/0 df=6 of=0 ql=0 qs=.... kt=0/W/6 ktl=d2ad b=10 ci=0 co=0 ca=0 52 - 7 c=1474 g=1474 pq=1 pgp=1473 qp=0 dt=1572/0/0 df=8 of=0 ql=0 qs=.... kt=0/W/7 ktl=cf15 b=10 ci=0 co=0 ca=0 59 + 0!c=30455 g=30456 pq=1 qp=1 dt=126535/140000000000000/0 df=2002 of=4 ql=0/0 qs=N... b=10 ci=74572 nci=0 co=1131 ca=716 60 + 1!c=30719 g=30720 pq=1 qp=0 dt=132007/140000000000000/0 df=1874 of=10 ql=0/0 qs=N... b=10 ci=123209 nci=0 co=685 ca=982 61 + 2!c=30150 g=30151 pq=1 qp=1 dt=138537/140000000000000/0 df=1707 of=8 ql=0/0 qs=N... b=10 ci=80132 nci=0 co=1328 ca=1458 62 + 3 c=31249 g=31250 pq=1 qp=0 dt=107255/140000000000000/0 df=1749 of=6 ql=0/450 qs=NRW. b=10 ci=151700 nci=0 co=509 ca=622 63 + 4!c=29502 g=29503 pq=1 qp=1 dt=83647/140000000000000/0 df=965 of=5 ql=0/0 qs=N... b=10 ci=65643 nci=0 co=1373 ca=1521 64 + 5 c=31201 g=31202 pq=1 qp=1 dt=70422/0/0 df=535 of=7 ql=0/0 qs=.... b=10 ci=58500 nci=0 co=764 ca=698 65 + 6!c=30253 g=30254 pq=1 qp=1 dt=95363/140000000000000/0 df=780 of=5 ql=0/0 qs=N... b=10 ci=100607 nci=0 co=1414 ca=1353 66 + 7 c=31178 g=31178 pq=1 qp=0 dt=91536/0/0 df=547 of=4 ql=0/0 qs=.... b=10 ci=109819 nci=0 co=1115 ca=969 53 67 54 - The first section lists the rcu_data structures for rcu_sched, the second 55 - for rcu_bh. Note that CONFIG_TREE_PREEMPT_RCU kernels will have an 56 - additional section for rcu_preempt. Each section has one line per CPU, 57 - or eight for this 8-CPU system. The fields are as follows: 68 + This file has one line per CPU, or eight for this 8-CPU system. 69 + The fields are as follows: 58 70 59 71 o The number at the beginning of each line is the CPU number. 60 72 CPUs numbers followed by an exclamation mark are offline, ··· 76 64 substantially larger than the number of actual CPUs. 77 65 78 66 o "c" is the count of grace periods that this CPU believes have 79 - completed. Offlined CPUs and CPUs in dynticks idle mode may 80 - lag quite a ways behind, for example, CPU 6 under "rcu_sched" 81 - above, which has been offline through not quite 40,000 RCU grace 82 - periods. It is not unusual to see CPUs lagging by thousands of 83 - grace periods. 67 + completed. Offlined CPUs and CPUs in dynticks idle mode may lag 68 + quite a ways behind, for example, CPU 4 under "rcu_sched" above, 69 + which has been offline through 16 RCU grace periods. It is not 70 + unusual to see offline CPUs lagging by thousands of grace periods. 71 + Note that although the grace-period number is an unsigned long, 72 + it is printed out as a signed long to allow more human-friendly 73 + representation near boot time. 84 74 85 75 o "g" is the count of grace periods that this CPU believes have 86 76 started. Again, offlined CPUs and CPUs in dynticks idle mode ··· 98 84 CPU has not yet reported that fact, (2) some other CPU has not 99 85 yet reported for this grace period, or (3) both. 100 86 101 - o "pgp" indicates which grace period the last-observed quiescent 102 - state for this CPU corresponds to. This is important for handling 103 - the race between CPU 0 reporting an extended dynticks-idle 104 - quiescent state for CPU 1 and CPU 1 suddenly waking up and 105 - reporting its own quiescent state. If CPU 1 was the last CPU 106 - for the current grace period, then the CPU that loses this race 107 - will attempt to incorrectly mark CPU 1 as having checked in for 108 - the next grace period! 109 - 110 87 o "qp" indicates that RCU still expects a quiescent state from 111 88 this CPU. Offlined CPUs and CPUs in dyntick idle mode might 112 89 well have qp=1, which is OK: RCU is still ignoring them. 113 90 114 91 o "dt" is the current value of the dyntick counter that is incremented 115 - when entering or leaving dynticks idle state, either by the 116 - scheduler or by irq. This number is even if the CPU is in 117 - dyntick idle mode and odd otherwise. The number after the first 118 - "/" is the interrupt nesting depth when in dyntick-idle state, 119 - or one greater than the interrupt-nesting depth otherwise. 120 - The number after the second "/" is the NMI nesting depth. 92 + when entering or leaving idle, either due to a context switch or 93 + due to an interrupt. This number is even if the CPU is in idle 94 + from RCU's viewpoint and odd otherwise. The number after the 95 + first "/" is the interrupt nesting depth when in idle state, 96 + or a large number added to the interrupt-nesting depth when 97 + running a non-idle task. Some architectures do not accurately 98 + count interrupt nesting when running in non-idle kernel context, 99 + which can result in interesting anomalies such as negative 100 + interrupt-nesting levels. The number after the second "/" 101 + is the NMI nesting depth. 121 102 122 103 o "df" is the number of times that some other CPU has forced a 123 104 quiescent state on behalf of this CPU due to this CPU being in 124 - dynticks-idle state. 105 + idle state. 125 106 126 107 o "of" is the number of times that some other CPU has forced a 127 108 quiescent state on behalf of this CPU due to this CPU being ··· 129 120 error, so it makes sense to err conservatively. 130 121 131 122 o "ql" is the number of RCU callbacks currently residing on 132 - this CPU. This is the total number of callbacks, regardless 133 - of what state they are in (new, waiting for grace period to 134 - start, waiting for grace period to end, ready to invoke). 123 + this CPU. The first number is the number of "lazy" callbacks 124 + that are known to RCU to only be freeing memory, and the number 125 + after the "/" is the total number of callbacks, lazy or not. 126 + These counters count callbacks regardless of what phase of 127 + grace-period processing that they are in (new, waiting for 128 + grace period to start, waiting for grace period to end, ready 129 + to invoke). 135 130 136 131 o "qs" gives an indication of the state of the callback queue 137 132 with four characters: ··· 162 149 163 150 If there are no callbacks in a given one of the above states, 164 151 the corresponding character is replaced by ".". 152 + 153 + o "b" is the batch limit for this CPU. If more than this number 154 + of RCU callbacks is ready to invoke, then the remainder will 155 + be deferred. 156 + 157 + o "ci" is the number of RCU callbacks that have been invoked for 158 + this CPU. Note that ci+nci+ql is the number of callbacks that have 159 + been registered in absence of CPU-hotplug activity. 160 + 161 + o "nci" is the number of RCU callbacks that have been offloaded from 162 + this CPU. This will always be zero unless the kernel was built 163 + with CONFIG_RCU_NOCB_CPU=y and the "rcu_nocbs=" kernel boot 164 + parameter was specified. 165 + 166 + o "co" is the number of RCU callbacks that have been orphaned due to 167 + this CPU going offline. These orphaned callbacks have been moved 168 + to an arbitrarily chosen online CPU. 169 + 170 + o "ca" is the number of RCU callbacks that have been adopted by this 171 + CPU due to other CPUs going offline. Note that ci+co-ca+ql is 172 + the number of RCU callbacks registered on this CPU. 173 + 174 + 175 + Kernels compiled with CONFIG_RCU_BOOST=y display the following from 176 + /debug/rcu/rcu_preempt/rcudata: 177 + 178 + 0!c=12865 g=12866 pq=1 qp=1 dt=83113/140000000000000/0 df=288 of=11 ql=0/0 qs=N... kt=0/O ktl=944 b=10 ci=60709 nci=0 co=748 ca=871 179 + 1 c=14407 g=14408 pq=1 qp=0 dt=100679/140000000000000/0 df=378 of=7 ql=0/119 qs=NRW. kt=0/W ktl=9b6 b=10 ci=109740 nci=0 co=589 ca=485 180 + 2 c=14407 g=14408 pq=1 qp=0 dt=105486/0/0 df=90 of=9 ql=0/89 qs=NRW. kt=0/W ktl=c0c b=10 ci=83113 nci=0 co=533 ca=490 181 + 3 c=14407 g=14408 pq=1 qp=0 dt=107138/0/0 df=142 of=8 ql=0/188 qs=NRW. kt=0/W ktl=b96 b=10 ci=121114 nci=0 co=426 ca=290 182 + 4 c=14405 g=14406 pq=1 qp=1 dt=50238/0/0 df=706 of=7 ql=0/0 qs=.... kt=0/W ktl=812 b=10 ci=34929 nci=0 co=643 ca=114 183 + 5!c=14168 g=14169 pq=1 qp=0 dt=45465/140000000000000/0 df=161 of=11 ql=0/0 qs=N... kt=0/O ktl=b4d b=10 ci=47712 nci=0 co=677 ca=722 184 + 6 c=14404 g=14405 pq=1 qp=0 dt=59454/0/0 df=94 of=6 ql=0/0 qs=.... kt=0/W ktl=e57 b=10 ci=55597 nci=0 co=701 ca=811 185 + 7 c=14407 g=14408 pq=1 qp=1 dt=68850/0/0 df=31 of=8 ql=0/0 qs=.... kt=0/W ktl=14bd b=10 ci=77475 nci=0 co=508 ca=1042 186 + 187 + This is similar to the output discussed above, but contains the following 188 + additional fields: 165 189 166 190 o "kt" is the per-CPU kernel-thread state. The digit preceding 167 191 the first slash is zero if there is no work pending and 1 ··· 234 184 235 185 This field is displayed only for CONFIG_RCU_BOOST kernels. 236 186 237 - o "b" is the batch limit for this CPU. If more than this number 238 - of RCU callbacks is ready to invoke, then the remainder will 239 - be deferred. 240 187 241 - o "ci" is the number of RCU callbacks that have been invoked for 242 - this CPU. Note that ci+ql is the number of callbacks that have 243 - been registered in absence of CPU-hotplug activity. 188 + The output of "cat rcu/rcu_preempt/rcuexp" looks as follows: 244 189 245 - o "co" is the number of RCU callbacks that have been orphaned due to 246 - this CPU going offline. These orphaned callbacks have been moved 247 - to an arbitrarily chosen online CPU. 190 + s=21872 d=21872 w=0 tf=0 wd1=0 wd2=0 n=0 sc=21872 dt=21872 dl=0 dx=21872 248 191 249 - o "ca" is the number of RCU callbacks that have been adopted due to 250 - other CPUs going offline. Note that ci+co-ca+ql is the number of 251 - RCU callbacks registered on this CPU. 192 + These fields are as follows: 252 193 253 - There is also an rcu/rcudata.csv file with the same information in 254 - comma-separated-variable spreadsheet format. 194 + o "s" is the starting sequence number. 195 + 196 + o "d" is the ending sequence number. When the starting and ending 197 + numbers differ, there is an expedited grace period in progress. 198 + 199 + o "w" is the number of times that the sequence numbers have been 200 + in danger of wrapping. 201 + 202 + o "tf" is the number of times that contention has resulted in a 203 + failure to begin an expedited grace period. 204 + 205 + o "wd1" and "wd2" are the number of times that an attempt to 206 + start an expedited grace period found that someone else had 207 + completed an expedited grace period that satisfies the 208 + attempted request. "Our work is done." 209 + 210 + o "n" is number of times that contention was so great that 211 + the request was demoted from an expedited grace period to 212 + a normal grace period. 213 + 214 + o "sc" is the number of times that the attempt to start a 215 + new expedited grace period succeeded. 216 + 217 + o "dt" is the number of times that we attempted to update 218 + the "d" counter. 219 + 220 + o "dl" is the number of times that we failed to update the "d" 221 + counter. 222 + 223 + o "dx" is the number of times that we succeeded in updating 224 + the "d" counter. 255 225 256 226 257 - The output of "cat rcu/rcugp" looks as follows: 227 + The output of "cat rcu/rcu_preempt/rcugp" looks as follows: 258 228 259 - rcu_sched: completed=33062 gpnum=33063 260 - rcu_bh: completed=464 gpnum=464 229 + completed=31249 gpnum=31250 age=1 max=18 261 230 262 - Again, this output is for both "rcu_sched" and "rcu_bh". Note that 263 - kernels built with CONFIG_TREE_PREEMPT_RCU will have an additional 264 - "rcu_preempt" line. The fields are taken from the rcu_state structure, 265 - and are as follows: 231 + These fields are taken from the rcu_state structure, and are as follows: 266 232 267 233 o "completed" is the number of grace periods that have completed. 268 234 It is comparable to the "c" field from rcu/rcudata in that a ··· 286 220 that the corresponding RCU grace period has completed. 287 221 288 222 o "gpnum" is the number of grace periods that have started. It is 289 - comparable to the "g" field from rcu/rcudata in that a CPU 290 - whose "g" field matches the value of "gpnum" is aware that the 291 - corresponding RCU grace period has started. 223 + similarly comparable to the "g" field from rcu/rcudata in that 224 + a CPU whose "g" field matches the value of "gpnum" is aware that 225 + the corresponding RCU grace period has started. 292 226 293 - If these two fields are equal (as they are for "rcu_bh" above), 294 - then there is no grace period in progress, in other words, RCU 295 - is idle. On the other hand, if the two fields differ (as they 296 - do for "rcu_sched" above), then an RCU grace period is in progress. 227 + If these two fields are equal, then there is no grace period 228 + in progress, in other words, RCU is idle. On the other hand, 229 + if the two fields differ (as they are above), then an RCU grace 230 + period is in progress. 297 231 232 + o "age" is the number of jiffies that the current grace period 233 + has extended for, or zero if there is no grace period currently 234 + in effect. 298 235 299 - The output of "cat rcu/rcuhier" looks as follows, with very long lines: 236 + o "max" is the age in jiffies of the longest-duration grace period 237 + thus far. 300 238 301 - c=6902 g=6903 s=2 jfq=3 j=72c7 nfqs=13142/nfqsng=0(13142) fqlh=6 302 - 1/1 ..>. 0:127 ^0 303 - 3/3 ..>. 0:35 ^0 0/0 ..>. 36:71 ^1 0/0 ..>. 72:107 ^2 0/0 ..>. 108:127 ^3 304 - 3/3f ..>. 0:5 ^0 2/3 ..>. 6:11 ^1 0/0 ..>. 12:17 ^2 0/0 ..>. 18:23 ^3 0/0 ..>. 24:29 ^4 0/0 ..>. 30:35 ^5 0/0 ..>. 36:41 ^0 0/0 ..>. 42:47 ^1 0/0 ..>. 48:53 ^2 0/0 ..>. 54:59 ^3 0/0 ..>. 60:65 ^4 0/0 ..>. 66:71 ^5 0/0 ..>. 72:77 ^0 0/0 ..>. 78:83 ^1 0/0 ..>. 84:89 ^2 0/0 ..>. 90:95 ^3 0/0 ..>. 96:101 ^4 0/0 ..>. 102:107 ^5 0/0 ..>. 108:113 ^0 0/0 ..>. 114:119 ^1 0/0 ..>. 120:125 ^2 0/0 ..>. 126:127 ^3 305 - rcu_bh: 306 - c=-226 g=-226 s=1 jfq=-5701 j=72c7 nfqs=88/nfqsng=0(88) fqlh=0 307 - 0/1 ..>. 0:127 ^0 308 - 0/3 ..>. 0:35 ^0 0/0 ..>. 36:71 ^1 0/0 ..>. 72:107 ^2 0/0 ..>. 108:127 ^3 309 - 0/3f ..>. 0:5 ^0 0/3 ..>. 6:11 ^1 0/0 ..>. 12:17 ^2 0/0 ..>. 18:23 ^3 0/0 ..>. 24:29 ^4 0/0 ..>. 30:35 ^5 0/0 ..>. 36:41 ^0 0/0 ..>. 42:47 ^1 0/0 ..>. 48:53 ^2 0/0 ..>. 54:59 ^3 0/0 ..>. 60:65 ^4 0/0 ..>. 66:71 ^5 0/0 ..>. 72:77 ^0 0/0 ..>. 78:83 ^1 0/0 ..>. 84:89 ^2 0/0 ..>. 90:95 ^3 0/0 ..>. 96:101 ^4 0/0 ..>. 102:107 ^5 0/0 ..>. 108:113 ^0 0/0 ..>. 114:119 ^1 0/0 ..>. 120:125 ^2 0/0 ..>. 126:127 ^3 239 + The output of "cat rcu/rcu_preempt/rcuhier" looks as follows: 310 240 311 - This is once again split into "rcu_sched" and "rcu_bh" portions, 312 - and CONFIG_TREE_PREEMPT_RCU kernels will again have an additional 313 - "rcu_preempt" section. The fields are as follows: 241 + c=14407 g=14408 s=0 jfq=2 j=c863 nfqs=12040/nfqsng=0(12040) fqlh=1051 oqlen=0/0 242 + 3/3 ..>. 0:7 ^0 243 + e/e ..>. 0:3 ^0 d/d ..>. 4:7 ^1 314 244 315 - o "c" is exactly the same as "completed" under rcu/rcugp. 245 + The fields are as follows: 316 246 317 - o "g" is exactly the same as "gpnum" under rcu/rcugp. 247 + o "c" is exactly the same as "completed" under rcu/rcu_preempt/rcugp. 318 248 319 - o "s" is the "signaled" state that drives force_quiescent_state()'s 249 + o "g" is exactly the same as "gpnum" under rcu/rcu_preempt/rcugp. 250 + 251 + o "s" is the current state of the force_quiescent_state() 320 252 state machine. 321 253 322 254 o "jfq" is the number of jiffies remaining for this grace period 323 255 before force_quiescent_state() is invoked to help push things 324 - along. Note that CPUs in dyntick-idle mode throughout the grace 325 - period will not report on their own, but rather must be check by 326 - some other CPU via force_quiescent_state(). 256 + along. Note that CPUs in idle mode throughout the grace period 257 + will not report on their own, but rather must be check by some 258 + other CPU via force_quiescent_state(). 327 259 328 260 o "j" is the low-order four hex digits of the jiffies counter. 329 261 Yes, Paul did run into a number of problems that turned out to ··· 332 268 333 269 o "nfqsng" is the number of useless calls to force_quiescent_state(), 334 270 where there wasn't actually a grace period active. This can 335 - happen due to races. The number in parentheses is the difference 271 + no longer happen due to grace-period processing being pushed 272 + into a kthread. The number in parentheses is the difference 336 273 between "nfqs" and "nfqsng", or the number of times that 337 274 force_quiescent_state() actually did some real work. 338 275 ··· 341 276 exited immediately (without even being counted in nfqs above) 342 277 due to contention on ->fqslock. 343 278 344 - o Each element of the form "1/1 0:127 ^0" represents one struct 345 - rcu_node. Each line represents one level of the hierarchy, from 346 - root to leaves. It is best to think of the rcu_data structures 347 - as forming yet another level after the leaves. Note that there 348 - might be either one, two, or three levels of rcu_node structures, 349 - depending on the relationship between CONFIG_RCU_FANOUT and 350 - CONFIG_NR_CPUS. 279 + o Each element of the form "3/3 ..>. 0:7 ^0" represents one rcu_node 280 + structure. Each line represents one level of the hierarchy, 281 + from root to leaves. It is best to think of the rcu_data 282 + structures as forming yet another level after the leaves. 283 + Note that there might be either one, two, three, or even four 284 + levels of rcu_node structures, depending on the relationship 285 + between CONFIG_RCU_FANOUT, CONFIG_RCU_FANOUT_LEAF (possibly 286 + adjusted using the rcu_fanout_leaf kernel boot parameter), and 287 + CONFIG_NR_CPUS (possibly adjusted using the nr_cpu_ids count of 288 + possible CPUs for the booting hardware). 351 289 352 290 o The numbers separated by the "/" are the qsmask followed 353 291 by the qsmaskinit. The qsmask will have one bit 354 - set for each entity in the next lower level that 355 - has not yet checked in for the current grace period. 292 + set for each entity in the next lower level that has 293 + not yet checked in for the current grace period ("e" 294 + indicating CPUs 5, 6, and 7 in the example above). 356 295 The qsmaskinit will have one bit for each entity that is 357 296 currently expected to check in during each grace period. 358 297 The value of qsmaskinit is assigned to that of qsmask 359 298 at the beginning of each grace period. 360 - 361 - For example, for "rcu_sched", the qsmask of the first 362 - entry of the lowest level is 0x14, meaning that we 363 - are still waiting for CPUs 2 and 4 to check in for the 364 - current grace period. 365 299 366 300 o The characters separated by the ">" indicate the state 367 301 of the blocked-tasks lists. A "G" preceding the ">" ··· 376 312 A "." character appears if the corresponding condition 377 313 does not hold, so that "..>." indicates that no tasks 378 314 are blocked. In contrast, "GE>T" indicates maximal 379 - inconvenience from blocked tasks. 315 + inconvenience from blocked tasks. CONFIG_TREE_RCU 316 + builds of the kernel will always show "..>.". 380 317 381 318 o The numbers separated by the ":" are the range of CPUs 382 319 served by this struct rcu_node. This can be helpful 383 320 in working out how the hierarchy is wired together. 384 321 385 - For example, the first entry at the lowest level shows 386 - "0:5", indicating that it covers CPUs 0 through 5. 322 + For example, the example rcu_node structure shown above 323 + has "0:7", indicating that it covers CPUs 0 through 7. 387 324 388 325 o The number after the "^" indicates the bit in the 389 - next higher level rcu_node structure that this 390 - rcu_node structure corresponds to. 326 + next higher level rcu_node structure that this rcu_node 327 + structure corresponds to. For example, the "d/d ..>. 4:7 328 + ^1" has a "1" in this position, indicating that it 329 + corresponds to the "1" bit in the "3" shown in the 330 + "3/3 ..>. 0:7 ^0" entry on the next level up. 391 331 392 - For example, the first entry at the lowest level shows 393 - "^0", indicating that it corresponds to bit zero in 394 - the first entry at the middle level. 395 332 333 + The output of "cat rcu/rcu_sched/rcu_pending" looks as follows: 396 334 397 - The output of "cat rcu/rcu_pending" looks as follows: 335 + 0!np=26111 qsp=29 rpq=5386 cbr=1 cng=570 gpc=3674 gps=577 nn=15903 336 + 1!np=28913 qsp=35 rpq=6097 cbr=1 cng=448 gpc=3700 gps=554 nn=18113 337 + 2!np=32740 qsp=37 rpq=6202 cbr=0 cng=476 gpc=4627 gps=546 nn=20889 338 + 3 np=23679 qsp=22 rpq=5044 cbr=1 cng=415 gpc=3403 gps=347 nn=14469 339 + 4!np=30714 qsp=4 rpq=5574 cbr=0 cng=528 gpc=3931 gps=639 nn=20042 340 + 5 np=28910 qsp=2 rpq=5246 cbr=0 cng=428 gpc=4105 gps=709 nn=18422 341 + 6!np=38648 qsp=5 rpq=7076 cbr=0 cng=840 gpc=4072 gps=961 nn=25699 342 + 7 np=37275 qsp=2 rpq=6873 cbr=0 cng=868 gpc=3416 gps=971 nn=25147 398 343 399 - rcu_sched: 400 - 0 np=255892 qsp=53936 rpq=85 cbr=0 cng=14417 gpc=10033 gps=24320 nn=146741 401 - 1 np=261224 qsp=54638 rpq=33 cbr=0 cng=25723 gpc=16310 gps=2849 nn=155792 402 - 2 np=237496 qsp=49664 rpq=23 cbr=0 cng=2762 gpc=45478 gps=1762 nn=136629 403 - 3 np=236249 qsp=48766 rpq=98 cbr=0 cng=286 gpc=48049 gps=1218 nn=137723 404 - 4 np=221310 qsp=46850 rpq=7 cbr=0 cng=26 gpc=43161 gps=4634 nn=123110 405 - 5 np=237332 qsp=48449 rpq=9 cbr=0 cng=54 gpc=47920 gps=3252 nn=137456 406 - 6 np=219995 qsp=46718 rpq=12 cbr=0 cng=50 gpc=42098 gps=6093 nn=120834 407 - 7 np=249893 qsp=49390 rpq=42 cbr=0 cng=72 gpc=38400 gps=17102 nn=144888 408 - rcu_bh: 409 - 0 np=146741 qsp=1419 rpq=6 cbr=0 cng=6 gpc=0 gps=0 nn=145314 410 - 1 np=155792 qsp=12597 rpq=3 cbr=0 cng=0 gpc=4 gps=8 nn=143180 411 - 2 np=136629 qsp=18680 rpq=1 cbr=0 cng=0 gpc=7 gps=6 nn=117936 412 - 3 np=137723 qsp=2843 rpq=0 cbr=0 cng=0 gpc=10 gps=7 nn=134863 413 - 4 np=123110 qsp=12433 rpq=0 cbr=0 cng=0 gpc=4 gps=2 nn=110671 414 - 5 np=137456 qsp=4210 rpq=1 cbr=0 cng=0 gpc=6 gps=5 nn=133235 415 - 6 np=120834 qsp=9902 rpq=2 cbr=0 cng=0 gpc=6 gps=3 nn=110921 416 - 7 np=144888 qsp=26336 rpq=0 cbr=0 cng=0 gpc=8 gps=2 nn=118542 344 + The fields are as follows: 417 345 418 - As always, this is once again split into "rcu_sched" and "rcu_bh" 419 - portions, with CONFIG_TREE_PREEMPT_RCU kernels having an additional 420 - "rcu_preempt" section. The fields are as follows: 346 + o The leading number is the CPU number, with "!" indicating 347 + an offline CPU. 421 348 422 349 o "np" is the number of times that __rcu_pending() has been invoked 423 350 for the corresponding flavor of RCU. ··· 432 377 o "gps" is the number of times that a new grace period had started, 433 378 but this CPU was not yet aware of it. 434 379 435 - o "nn" is the number of times that this CPU needed nothing. Alert 436 - readers will note that the rcu "nn" number for a given CPU very 437 - closely matches the rcu_bh "np" number for that same CPU. This 438 - is due to short-circuit evaluation in rcu_pending(). 439 - 440 - 441 - The output of "cat rcu/rcutorture" looks as follows: 442 - 443 - rcutorture test sequence: 0 (test in progress) 444 - rcutorture update version number: 615 445 - 446 - The first line shows the number of rcutorture tests that have completed 447 - since boot. If a test is currently running, the "(test in progress)" 448 - string will appear as shown above. The second line shows the number of 449 - update cycles that the current test has started, or zero if there is 450 - no test in progress. 380 + o "nn" is the number of times that this CPU needed nothing. 451 381 452 382 453 383 The output of "cat rcu/rcuboost" looks as follows: 454 384 455 - 0:5 tasks=.... kt=W ntb=0 neb=0 nnb=0 j=2f95 bt=300f 456 - balk: nt=0 egt=989 bt=0 nb=0 ny=0 nos=16 457 - 6:7 tasks=.... kt=W ntb=0 neb=0 nnb=0 j=2f95 bt=300f 458 - balk: nt=0 egt=225 bt=0 nb=0 ny=0 nos=6 385 + 0:3 tasks=.... kt=W ntb=0 neb=0 nnb=0 j=c864 bt=c894 386 + balk: nt=0 egt=4695 bt=0 nb=0 ny=56 nos=0 387 + 4:7 tasks=.... kt=W ntb=0 neb=0 nnb=0 j=c864 bt=c894 388 + balk: nt=0 egt=6541 bt=0 nb=0 ny=126 nos=0 459 389 460 390 This information is output only for rcu_preempt. Each two-line entry 461 391 corresponds to a leaf rcu_node strcuture. The fields are as follows: 462 392 463 393 o "n:m" is the CPU-number range for the corresponding two-line 464 394 entry. In the sample output above, the first entry covers 465 - CPUs zero through five and the second entry covers CPUs 6 466 - and 7. 395 + CPUs zero through three and the second entry covers CPUs four 396 + through seven. 467 397 468 398 o "tasks=TNEB" gives the state of the various segments of the 469 399 rnp->blocked_tasks list:

+12 -5

Documentation/RCU/whatisRCU.txt

··· 499 499 { 500 500 struct foo *fp = container_of(rp, struct foo, rcu); 501 501 502 + foo_cleanup(fp->a); 503 + 502 504 kfree(fp); 503 505 } 504 506 ··· 522 520 function that will be invoked after the completion of all RCU 523 521 read-side critical sections that might be referencing that 524 522 data item. 523 + 524 + If the callback for call_rcu() is not doing anything more than calling 525 + kfree() on the structure, you can use kfree_rcu() instead of call_rcu() 526 + to avoid having to write your own callback: 527 + 528 + kfree_rcu(old_fp, rcu); 525 529 526 530 Again, see checklist.txt for additional rules governing the use of RCU. 527 531 ··· 781 773 782 774 Also, the presence of synchronize_rcu() means that the RCU version of 783 775 delete() can now block. If this is a problem, there is a callback-based 784 - mechanism that never blocks, namely call_rcu(), that can be used in 785 - place of synchronize_rcu(). 776 + mechanism that never blocks, namely call_rcu() or kfree_rcu(), that can 777 + be used in place of synchronize_rcu(). 786 778 787 779 788 780 7. FULL LIST OF RCU APIs ··· 797 789 list_for_each_entry_rcu 798 790 hlist_for_each_entry_rcu 799 791 hlist_nulls_for_each_entry_rcu 800 - 801 - list_for_each_continue_rcu (to be deprecated in favor of new 802 - list_for_each_entry_continue_rcu) 792 + list_for_each_entry_continue_rcu 803 793 804 794 RCU pointer/list update: 805 795 ··· 819 813 rcu_read_unlock synchronize_rcu 820 814 rcu_dereference synchronize_rcu_expedited 821 815 call_rcu 816 + kfree_rcu 822 817 823 818 824 819 bh: Critical sections Grace period Barrier

+21

Documentation/kernel-parameters.txt

··· 2394 2394 ramdisk_size= [RAM] Sizes of RAM disks in kilobytes 2395 2395 See Documentation/blockdev/ramdisk.txt. 2396 2396 2397 + rcu_nocbs= [KNL,BOOT] 2398 + In kernels built with CONFIG_RCU_NOCB_CPU=y, set 2399 + the specified list of CPUs to be no-callback CPUs. 2400 + Invocation of these CPUs' RCU callbacks will 2401 + be offloaded to "rcuoN" kthreads created for 2402 + that purpose. This reduces OS jitter on the 2403 + offloaded CPUs, which can be useful for HPC and 2404 + real-time workloads. It can also improve energy 2405 + efficiency for asymmetric multiprocessors. 2406 + 2407 + rcu_nocbs_poll [KNL,BOOT] 2408 + Rather than requiring that offloaded CPUs 2409 + (specified by rcu_nocbs= above) explicitly 2410 + awaken the corresponding "rcuoN" kthreads, 2411 + make these kthreads poll for callbacks. 2412 + This improves the real-time response for the 2413 + offloaded CPUs by relieving them of the need to 2414 + wake up the corresponding kthread, but degrades 2415 + energy efficiency by requiring that the kthreads 2416 + periodically wake up to do the polling. 2417 + 2397 2418 rcutree.blimit= [KNL,BOOT] 2398 2419 Set maximum number of finished RCU callbacks to process 2399 2420 in one batch.

+5 -4

Documentation/memory-barriers.txt

··· 251 251 252 252 And for: 253 253 254 - *A = X; Y = *A; 254 + *A = X; *(A + 4) = Y; 255 255 256 - we may get either of: 256 + we may get any of: 257 257 258 - STORE *A = X; Y = LOAD *A; 259 - STORE *A = Y = X; 258 + STORE *A = X; STORE *(A + 4) = Y; 259 + STORE *(A + 4) = Y; STORE *A = X; 260 + STORE {*A, *(A + 4) } = {X, Y}; 260 261 261 262 262 263 =========================

+8 -7

arch/Kconfig

··· 300 300 301 301 See Documentation/prctl/seccomp_filter.txt for details. 302 302 303 - config HAVE_RCU_USER_QS 303 + config HAVE_CONTEXT_TRACKING 304 304 bool 305 305 help 306 - Provide kernel entry/exit hooks necessary for userspace 307 - RCU extended quiescent state. Syscalls need to be wrapped inside 308 - rcu_user_exit()-rcu_user_enter() through the slow path using 309 - TIF_NOHZ flag. Exceptions handlers must be wrapped as well. Irqs 310 - are already protected inside rcu_irq_enter/rcu_irq_exit() but 311 - preemption or signal handling on irq exit still need to be protected. 306 + Provide kernel/user boundaries probes necessary for subsystems 307 + that need it, such as userspace RCU extended quiescent state. 308 + Syscalls need to be wrapped inside user_exit()-user_enter() through 309 + the slow path using TIF_NOHZ flag. Exceptions handlers must be 310 + wrapped as well. Irqs are already protected inside 311 + rcu_irq_enter/rcu_irq_exit() but preemption or signal handling on 312 + irq exit still need to be protected. 312 313 313 314 config HAVE_VIRT_CPU_ACCOUNTING 314 315 bool

+1 -1

arch/um/drivers/mconsole_kern.c

··· 648 648 struct task_struct *from = current, *to = arg; 649 649 650 650 to->thread.saved_task = from; 651 - rcu_switch(from, to); 651 + rcu_user_hooks_switch(from, to); 652 652 switch_to(from, to, from); 653 653 } 654 654

+1 -1

arch/x86/Kconfig

··· 106 106 select KTIME_SCALAR if X86_32 107 107 select GENERIC_STRNCPY_FROM_USER 108 108 select GENERIC_STRNLEN_USER 109 - select HAVE_RCU_USER_QS if X86_64 109 + select HAVE_CONTEXT_TRACKING if X86_64 110 110 select HAVE_IRQ_TIME_ACCOUNTING 111 111 select GENERIC_KERNEL_THREAD 112 112 select GENERIC_KERNEL_EXECVE

+7 -8

arch/x86/include/asm/rcu.h arch/x86/include/asm/context_tracking.h

··· 1 - #ifndef _ASM_X86_RCU_H 2 - #define _ASM_X86_RCU_H 1 + #ifndef _ASM_X86_CONTEXT_TRACKING_H 2 + #define _ASM_X86_CONTEXT_TRACKING_H 3 3 4 4 #ifndef __ASSEMBLY__ 5 - 6 - #include <linux/rcupdate.h> 5 + #include <linux/context_tracking.h> 7 6 #include <asm/ptrace.h> 8 7 9 8 static inline void exception_enter(struct pt_regs *regs) 10 9 { 11 - rcu_user_exit(); 10 + user_exit(); 12 11 } 13 12 14 13 static inline void exception_exit(struct pt_regs *regs) 15 14 { 16 - #ifdef CONFIG_RCU_USER_QS 15 + #ifdef CONFIG_CONTEXT_TRACKING 17 16 if (user_mode(regs)) 18 - rcu_user_enter(); 17 + user_enter(); 19 18 #endif 20 19 } 21 20 22 21 #else /* __ASSEMBLY__ */ 23 22 24 - #ifdef CONFIG_RCU_USER_QS 23 + #ifdef CONFIG_CONTEXT_TRACKING 25 24 # define SCHEDULE_USER call schedule_user 26 25 #else 27 26 # define SCHEDULE_USER call schedule

+1 -1

arch/x86/kernel/entry_64.S

··· 56 56 #include <asm/ftrace.h> 57 57 #include <asm/percpu.h> 58 58 #include <asm/asm.h> 59 - #include <asm/rcu.h> 59 + #include <asm/context_tracking.h> 60 60 #include <asm/smap.h> 61 61 #include <linux/err.h> 62 62

+4 -3

arch/x86/kernel/ptrace.c

··· 23 23 #include <linux/hw_breakpoint.h> 24 24 #include <linux/rcupdate.h> 25 25 #include <linux/module.h> 26 + #include <linux/context_tracking.h> 26 27 27 28 #include <asm/uaccess.h> 28 29 #include <asm/pgtable.h> ··· 1492 1491 { 1493 1492 long ret = 0; 1494 1493 1495 - rcu_user_exit(); 1494 + user_exit(); 1496 1495 1497 1496 /* 1498 1497 * If we stepped into a sysenter/syscall insn, it trapped in ··· 1547 1546 * or do_notify_resume(), in which case we can be in RCU 1548 1547 * user mode. 1549 1548 */ 1550 - rcu_user_exit(); 1549 + user_exit(); 1551 1550 1552 1551 audit_syscall_exit(regs); 1553 1552 ··· 1565 1564 if (step || test_thread_flag(TIF_SYSCALL_TRACE)) 1566 1565 tracehook_report_syscall_exit(regs, step); 1567 1566 1568 - rcu_user_enter(); 1567 + user_enter(); 1569 1568 }

+3 -2

arch/x86/kernel/signal.c

··· 22 22 #include <linux/uaccess.h> 23 23 #include <linux/user-return-notifier.h> 24 24 #include <linux/uprobes.h> 25 + #include <linux/context_tracking.h> 25 26 26 27 #include <asm/processor.h> 27 28 #include <asm/ucontext.h> ··· 817 816 void 818 817 do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags) 819 818 { 820 - rcu_user_exit(); 819 + user_exit(); 821 820 822 821 #ifdef CONFIG_X86_MCE 823 822 /* notify userspace of pending MCEs */ ··· 839 838 if (thread_info_flags & _TIF_USER_RETURN_NOTIFY) 840 839 fire_user_return_notifiers(); 841 840 842 - rcu_user_enter(); 841 + user_enter(); 843 842 } 844 843 845 844 void signal_fault(struct pt_regs *regs, void __user *frame, char *where)

+1 -1

arch/x86/kernel/traps.c

··· 55 55 #include <asm/i387.h> 56 56 #include <asm/fpu-internal.h> 57 57 #include <asm/mce.h> 58 - #include <asm/rcu.h> 58 + #include <asm/context_tracking.h> 59 59 60 60 #include <asm/mach_traps.h> 61 61

+1 -1

arch/x86/mm/fault.c

··· 18 18 #include <asm/pgalloc.h> /* pgd_*(), ... */ 19 19 #include <asm/kmemcheck.h> /* kmemcheck_*(), ... */ 20 20 #include <asm/fixmap.h> /* VSYSCALL_START */ 21 - #include <asm/rcu.h> /* exception_enter(), ... */ 21 + #include <asm/context_tracking.h> /* exception_enter(), ... */ 22 22 23 23 /* 24 24 * Page fault error code bits:

+18

include/linux/context_tracking.h

··· 1 + #ifndef _LINUX_CONTEXT_TRACKING_H 2 + #define _LINUX_CONTEXT_TRACKING_H 3 + 4 + #ifdef CONFIG_CONTEXT_TRACKING 5 + #include <linux/sched.h> 6 + 7 + extern void user_enter(void); 8 + extern void user_exit(void); 9 + extern void context_tracking_task_switch(struct task_struct *prev, 10 + struct task_struct *next); 11 + #else 12 + static inline void user_enter(void) { } 13 + static inline void user_exit(void) { } 14 + static inline void context_tracking_task_switch(struct task_struct *prev, 15 + struct task_struct *next) { } 16 + #endif /* !CONFIG_CONTEXT_TRACKING */ 17 + 18 + #endif

-17

include/linux/rculist.h

··· 286 286 &pos->member != (head); \ 287 287 pos = list_entry_rcu(pos->member.next, typeof(*pos), member)) 288 288 289 - 290 - /** 291 - * list_for_each_continue_rcu 292 - * @pos: the &struct list_head to use as a loop cursor. 293 - * @head: the head for your list. 294 - * 295 - * Iterate over an rcu-protected list, continuing after current point. 296 - * 297 - * This list-traversal primitive may safely run concurrently with 298 - * the _rcu list-mutation primitives such as list_add_rcu() 299 - * as long as the traversal is guarded by rcu_read_lock(). 300 - */ 301 - #define list_for_each_continue_rcu(pos, head) \ 302 - for ((pos) = rcu_dereference_raw(list_next_rcu(pos)); \ 303 - (pos) != (head); \ 304 - (pos) = rcu_dereference_raw(list_next_rcu(pos))) 305 - 306 289 /** 307 290 * list_for_each_entry_continue_rcu - continue iteration over list of given type 308 291 * @pos: the type * to use as a loop cursor.

+27 -2

include/linux/rcupdate.h

··· 90 90 * that started after call_rcu() was invoked. RCU read-side critical 91 91 * sections are delimited by rcu_read_lock() and rcu_read_unlock(), 92 92 * and may be nested. 93 + * 94 + * Note that all CPUs must agree that the grace period extended beyond 95 + * all pre-existing RCU read-side critical section. On systems with more 96 + * than one CPU, this means that when "func()" is invoked, each CPU is 97 + * guaranteed to have executed a full memory barrier since the end of its 98 + * last RCU read-side critical section whose beginning preceded the call 99 + * to call_rcu(). It also means that each CPU executing an RCU read-side 100 + * critical section that continues beyond the start of "func()" must have 101 + * executed a memory barrier after the call_rcu() but before the beginning 102 + * of that RCU read-side critical section. Note that these guarantees 103 + * include CPUs that are offline, idle, or executing in user mode, as 104 + * well as CPUs that are executing in the kernel. 105 + * 106 + * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the 107 + * resulting RCU callback function "func()", then both CPU A and CPU B are 108 + * guaranteed to execute a full memory barrier during the time interval 109 + * between the call to call_rcu() and the invocation of "func()" -- even 110 + * if CPU A and CPU B are the same CPU (but again only if the system has 111 + * more than one CPU). 93 112 */ 94 113 extern void call_rcu(struct rcu_head *head, 95 114 void (*func)(struct rcu_head *head)); ··· 137 118 * OR 138 119 * - rcu_read_lock_bh() and rcu_read_unlock_bh(), if in process context. 139 120 * These may be nested. 121 + * 122 + * See the description of call_rcu() for more detailed information on 123 + * memory ordering guarantees. 140 124 */ 141 125 extern void call_rcu_bh(struct rcu_head *head, 142 126 void (*func)(struct rcu_head *head)); ··· 159 137 * OR 160 138 * anything that disables preemption. 161 139 * These may be nested. 140 + * 141 + * See the description of call_rcu() for more detailed information on 142 + * memory ordering guarantees. 162 143 */ 163 144 extern void call_rcu_sched(struct rcu_head *head, 164 145 void (*func)(struct rcu_head *rcu)); ··· 222 197 extern void rcu_user_exit(void); 223 198 extern void rcu_user_enter_after_irq(void); 224 199 extern void rcu_user_exit_after_irq(void); 225 - extern void rcu_user_hooks_switch(struct task_struct *prev, 226 - struct task_struct *next); 227 200 #else 228 201 static inline void rcu_user_enter(void) { } 229 202 static inline void rcu_user_exit(void) { } 230 203 static inline void rcu_user_enter_after_irq(void) { } 231 204 static inline void rcu_user_exit_after_irq(void) { } 205 + static inline void rcu_user_hooks_switch(struct task_struct *prev, 206 + struct task_struct *next) { } 232 207 #endif /* CONFIG_RCU_USER_QS */ 233 208 234 209 extern void exit_rcu(void);

+2 -8

include/linux/sched.h

··· 109 109 110 110 extern unsigned long get_parent_ip(unsigned long addr); 111 111 112 + extern void dump_cpu_task(int cpu); 113 + 112 114 struct seq_file; 113 115 struct cfs_rq; 114 116 struct task_group; ··· 1846 1844 } 1847 1845 1848 1846 #endif 1849 - 1850 - static inline void rcu_switch(struct task_struct *prev, 1851 - struct task_struct *next) 1852 - { 1853 - #ifdef CONFIG_RCU_USER_QS 1854 - rcu_user_hooks_switch(prev, next); 1855 - #endif 1856 - } 1857 1847 1858 1848 static inline void tsk_restore_flags(struct task_struct *task, 1859 1849 unsigned long orig_flags, unsigned long flags)

+34

include/linux/srcu.h

··· 16 16 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. 17 17 * 18 18 * Copyright (C) IBM Corporation, 2006 19 + * Copyright (C) Fujitsu, 2012 19 20 * 20 21 * Author: Paul McKenney <paulmck@us.ibm.com> 22 + * Lai Jiangshan <laijs@cn.fujitsu.com> 21 23 * 22 24 * For detailed explanation of Read-Copy Update mechanism see - 23 25 * Documentation/RCU/ *.txt ··· 41 39 struct rcu_batch { 42 40 struct rcu_head *head, **tail; 43 41 }; 42 + 43 + #define RCU_BATCH_INIT(name) { NULL, &(name.head) } 44 44 45 45 struct srcu_struct { 46 46 unsigned completed; ··· 74 70 __init_srcu_struct((sp), #sp, &__srcu_key); \ 75 71 }) 76 72 73 + #define __SRCU_DEP_MAP_INIT(srcu_name) .dep_map = { .name = #srcu_name }, 77 74 #else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ 78 75 79 76 int init_srcu_struct(struct srcu_struct *sp); 80 77 78 + #define __SRCU_DEP_MAP_INIT(srcu_name) 81 79 #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */ 80 + 81 + void process_srcu(struct work_struct *work); 82 + 83 + #define __SRCU_STRUCT_INIT(name) \ 84 + { \ 85 + .completed = -300, \ 86 + .per_cpu_ref = &name##_srcu_array, \ 87 + .queue_lock = __SPIN_LOCK_UNLOCKED(name.queue_lock), \ 88 + .running = false, \ 89 + .batch_queue = RCU_BATCH_INIT(name.batch_queue), \ 90 + .batch_check0 = RCU_BATCH_INIT(name.batch_check0), \ 91 + .batch_check1 = RCU_BATCH_INIT(name.batch_check1), \ 92 + .batch_done = RCU_BATCH_INIT(name.batch_done), \ 93 + .work = __DELAYED_WORK_INITIALIZER(name.work, process_srcu, 0),\ 94 + __SRCU_DEP_MAP_INIT(name) \ 95 + } 96 + 97 + /* 98 + * define and init a srcu struct at build time. 99 + * dont't call init_srcu_struct() nor cleanup_srcu_struct() on it. 100 + */ 101 + #define DEFINE_SRCU(name) \ 102 + static DEFINE_PER_CPU(struct srcu_struct_array, name##_srcu_array);\ 103 + struct srcu_struct name = __SRCU_STRUCT_INIT(name); 104 + 105 + #define DEFINE_STATIC_SRCU(name) \ 106 + static DEFINE_PER_CPU(struct srcu_struct_array, name##_srcu_array);\ 107 + static struct srcu_struct name = __SRCU_STRUCT_INIT(name); 82 108 83 109 /** 84 110 * call_srcu() - Queue a callback for invocation after an SRCU grace period

+1

include/trace/events/rcu.h

··· 549 549 * "EarlyExit": rcu_barrier_callback() piggybacked, thus early exit. 550 550 * "Inc1": rcu_barrier_callback() piggyback check counter incremented. 551 551 * "Offline": rcu_barrier_callback() found offline CPU 552 + * "OnlineNoCB": rcu_barrier_callback() found online no-CBs CPU. 552 553 * "OnlineQ": rcu_barrier_callback() found online CPU with callbacks. 553 554 * "OnlineNQ": rcu_barrier_callback() found online CPU, no callbacks. 554 555 * "IRQ": An rcu_barrier_callback() callback posted on remote CPU.

+44 -23

init/Kconfig

··· 486 486 This option enables preemptible-RCU code that is common between 487 487 the TREE_PREEMPT_RCU and TINY_PREEMPT_RCU implementations. 488 488 489 + config CONTEXT_TRACKING 490 + bool 491 + 489 492 config RCU_USER_QS 490 493 bool "Consider userspace as in RCU extended quiescent state" 491 - depends on HAVE_RCU_USER_QS && SMP 494 + depends on HAVE_CONTEXT_TRACKING && SMP 495 + select CONTEXT_TRACKING 492 496 help 493 497 This option sets hooks on kernel / userspace boundaries and 494 498 puts RCU in extended quiescent state when the CPU runs in 495 499 userspace. It means that when a CPU runs in userspace, it is 496 500 excluded from the global RCU state machine and thus doesn't 497 - to keep the timer tick on for RCU. 501 + try to keep the timer tick on for RCU. 498 502 499 503 Unless you want to hack and help the development of the full 500 - tickless feature, you shouldn't enable this option. It adds 501 - unnecessary overhead. 504 + dynticks mode, you shouldn't enable this option. It also 505 + adds unnecessary overhead. 502 506 503 507 If unsure say N 504 508 505 - config RCU_USER_QS_FORCE 506 - bool "Force userspace extended QS by default" 507 - depends on RCU_USER_QS 509 + config CONTEXT_TRACKING_FORCE 510 + bool "Force context tracking" 511 + depends on CONTEXT_TRACKING 508 512 help 509 - Set the hooks in user/kernel boundaries by default in order to 510 - test this feature that treats userspace as an extended quiescent 511 - state until we have a real user like a full adaptive nohz option. 512 - 513 - Unless you want to hack and help the development of the full 514 - tickless feature, you shouldn't enable this option. It adds 515 - unnecessary overhead. 516 - 517 - If unsure say N 513 + Probe on user/kernel boundaries by default in order to 514 + test the features that rely on it such as userspace RCU extended 515 + quiescent states. 516 + This test is there for debugging until we have a real user like the 517 + full dynticks mode. 518 518 519 519 config RCU_FANOUT 520 520 int "Tree-based hierarchical RCU fanout value" ··· 582 582 depends on NO_HZ && SMP 583 583 default n 584 584 help 585 - This option causes RCU to attempt to accelerate grace periods 586 - in order to allow CPUs to enter dynticks-idle state more 587 - quickly. On the other hand, this option increases the overhead 588 - of the dynticks-idle checking, particularly on systems with 589 - large numbers of CPUs. 585 + This option causes RCU to attempt to accelerate grace periods in 586 + order to allow CPUs to enter dynticks-idle state more quickly. 587 + On the other hand, this option increases the overhead of the 588 + dynticks-idle checking, thus degrading scheduling latency. 590 589 591 - Say Y if energy efficiency is critically important, particularly 592 - if you have relatively few CPUs. 590 + Say Y if energy efficiency is critically important, and you don't 591 + care about real-time response. 593 592 594 593 Say N if you are unsure. 595 594 ··· 653 654 blocking an expedited RCU grace period is boosted immediately. 654 655 655 656 Accept the default if unsure. 657 + 658 + config RCU_NOCB_CPU 659 + bool "Offload RCU callback processing from boot-selected CPUs" 660 + depends on TREE_RCU || TREE_PREEMPT_RCU 661 + default n 662 + help 663 + Use this option to reduce OS jitter for aggressive HPC or 664 + real-time workloads. It can also be used to offload RCU 665 + callback invocation to energy-efficient CPUs in battery-powered 666 + asymmetric multiprocessors. 667 + 668 + This option offloads callback invocation from the set of 669 + CPUs specified at boot time by the rcu_nocbs parameter. 670 + For each such CPU, a kthread ("rcuoN") will be created to 671 + invoke callbacks, where the "N" is the CPU being offloaded. 672 + Nothing prevents this kthread from running on the specified 673 + CPUs, but (1) the kthreads may be preempted between each 674 + callback, and (2) affinity or cgroups can be used to force 675 + the kthreads to run on whatever set of CPUs is desired. 676 + 677 + Say Y here if you want reduced OS jitter on selected CPUs. 678 + Say N here if you are unsure. 656 679 657 680 endmenu # "RCU Subsystem" 658 681

+1

kernel/Makefile

··· 110 110 obj-$(CONFIG_PADATA) += padata.o 111 111 obj-$(CONFIG_CRASH_DUMP) += crash_dump.o 112 112 obj-$(CONFIG_JUMP_LABEL) += jump_label.o 113 + obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o 113 114 114 115 $(obj)/configs.o: $(obj)/config_data.h 115 116

+83

kernel/context_tracking.c

··· 1 + #include <linux/context_tracking.h> 2 + #include <linux/rcupdate.h> 3 + #include <linux/sched.h> 4 + #include <linux/percpu.h> 5 + #include <linux/hardirq.h> 6 + 7 + struct context_tracking { 8 + /* 9 + * When active is false, hooks are not set to 10 + * minimize overhead: TIF flags are cleared 11 + * and calls to user_enter/exit are ignored. This 12 + * may be further optimized using static keys. 13 + */ 14 + bool active; 15 + enum { 16 + IN_KERNEL = 0, 17 + IN_USER, 18 + } state; 19 + }; 20 + 21 + static DEFINE_PER_CPU(struct context_tracking, context_tracking) = { 22 + #ifdef CONFIG_CONTEXT_TRACKING_FORCE 23 + .active = true, 24 + #endif 25 + }; 26 + 27 + void user_enter(void) 28 + { 29 + unsigned long flags; 30 + 31 + /* 32 + * Some contexts may involve an exception occuring in an irq, 33 + * leading to that nesting: 34 + * rcu_irq_enter() rcu_user_exit() rcu_user_exit() rcu_irq_exit() 35 + * This would mess up the dyntick_nesting count though. And rcu_irq_*() 36 + * helpers are enough to protect RCU uses inside the exception. So 37 + * just return immediately if we detect we are in an IRQ. 38 + */ 39 + if (in_interrupt()) 40 + return; 41 + 42 + WARN_ON_ONCE(!current->mm); 43 + 44 + local_irq_save(flags); 45 + if (__this_cpu_read(context_tracking.active) && 46 + __this_cpu_read(context_tracking.state) != IN_USER) { 47 + __this_cpu_write(context_tracking.state, IN_USER); 48 + rcu_user_enter(); 49 + } 50 + local_irq_restore(flags); 51 + } 52 + 53 + void user_exit(void) 54 + { 55 + unsigned long flags; 56 + 57 + /* 58 + * Some contexts may involve an exception occuring in an irq, 59 + * leading to that nesting: 60 + * rcu_irq_enter() rcu_user_exit() rcu_user_exit() rcu_irq_exit() 61 + * This would mess up the dyntick_nesting count though. And rcu_irq_*() 62 + * helpers are enough to protect RCU uses inside the exception. So 63 + * just return immediately if we detect we are in an IRQ. 64 + */ 65 + if (in_interrupt()) 66 + return; 67 + 68 + local_irq_save(flags); 69 + if (__this_cpu_read(context_tracking.state) == IN_USER) { 70 + __this_cpu_write(context_tracking.state, IN_KERNEL); 71 + rcu_user_exit(); 72 + } 73 + local_irq_restore(flags); 74 + } 75 + 76 + void context_tracking_task_switch(struct task_struct *prev, 77 + struct task_struct *next) 78 + { 79 + if (__this_cpu_read(context_tracking.active)) { 80 + clear_tsk_thread_flag(prev, TIF_NOHZ); 81 + set_tsk_thread_flag(next, TIF_NOHZ); 82 + } 83 + }

+18

kernel/ksysfs.c

··· 140 140 } 141 141 KERNEL_ATTR_RO(fscaps); 142 142 143 + int rcu_expedited; 144 + static ssize_t rcu_expedited_show(struct kobject *kobj, 145 + struct kobj_attribute *attr, char *buf) 146 + { 147 + return sprintf(buf, "%d\n", rcu_expedited); 148 + } 149 + static ssize_t rcu_expedited_store(struct kobject *kobj, 150 + struct kobj_attribute *attr, 151 + const char *buf, size_t count) 152 + { 153 + if (kstrtoint(buf, 0, &rcu_expedited)) 154 + return -EINVAL; 155 + 156 + return count; 157 + } 158 + KERNEL_ATTR_RW(rcu_expedited); 159 + 143 160 /* 144 161 * Make /sys/kernel/notes give the raw contents of our kernel .notes section. 145 162 */ ··· 196 179 &kexec_crash_size_attr.attr, 197 180 &vmcoreinfo_attr.attr, 198 181 #endif 182 + &rcu_expedited_attr.attr, 199 183 NULL 200 184 }; 201 185

+2

kernel/rcu.h

··· 109 109 } 110 110 } 111 111 112 + extern int rcu_expedited; 113 + 112 114 #endif /* __LINUX_RCU_H */

+3

kernel/rcupdate.c

··· 46 46 #include <linux/export.h> 47 47 #include <linux/hardirq.h> 48 48 #include <linux/delay.h> 49 + #include <linux/module.h> 49 50 50 51 #define CREATE_TRACE_POINTS 51 52 #include <trace/events/rcu.h> 52 53 53 54 #include "rcu.h" 55 + 56 + module_param(rcu_expedited, int, 0); 54 57 55 58 #ifdef CONFIG_PREEMPT_RCU 56 59

+1 -1

kernel/rcutiny.c

··· 195 195 */ 196 196 int rcu_is_cpu_rrupt_from_idle(void) 197 197 { 198 - return rcu_dynticks_nesting <= 0; 198 + return rcu_dynticks_nesting <= 1; 199 199 } 200 200 201 201 /*

+4 -1

kernel/rcutiny_plugin.h

··· 706 706 return; 707 707 708 708 /* Once we get past the fastpath checks, same code as rcu_barrier(). */ 709 - rcu_barrier(); 709 + if (rcu_expedited) 710 + synchronize_rcu_expedited(); 711 + else 712 + rcu_barrier(); 710 713 } 711 714 EXPORT_SYMBOL_GPL(synchronize_rcu); 712 715

+18 -36

kernel/rcutorture.c

··· 339 339 340 340 struct rcu_torture_ops { 341 341 void (*init)(void); 342 - void (*cleanup)(void); 343 342 int (*readlock)(void); 344 343 void (*read_delay)(struct rcu_random_state *rrsp); 345 344 void (*readunlock)(int idx); ··· 430 431 431 432 static struct rcu_torture_ops rcu_ops = { 432 433 .init = NULL, 433 - .cleanup = NULL, 434 434 .readlock = rcu_torture_read_lock, 435 435 .read_delay = rcu_read_delay, 436 436 .readunlock = rcu_torture_read_unlock, ··· 473 475 474 476 static struct rcu_torture_ops rcu_sync_ops = { 475 477 .init = rcu_sync_torture_init, 476 - .cleanup = NULL, 477 478 .readlock = rcu_torture_read_lock, 478 479 .read_delay = rcu_read_delay, 479 480 .readunlock = rcu_torture_read_unlock, ··· 490 493 491 494 static struct rcu_torture_ops rcu_expedited_ops = { 492 495 .init = rcu_sync_torture_init, 493 - .cleanup = NULL, 494 496 .readlock = rcu_torture_read_lock, 495 497 .read_delay = rcu_read_delay, /* just reuse rcu's version. */ 496 498 .readunlock = rcu_torture_read_unlock, ··· 532 536 533 537 static struct rcu_torture_ops rcu_bh_ops = { 534 538 .init = NULL, 535 - .cleanup = NULL, 536 539 .readlock = rcu_bh_torture_read_lock, 537 540 .read_delay = rcu_read_delay, /* just reuse rcu's version. */ 538 541 .readunlock = rcu_bh_torture_read_unlock, ··· 548 553 549 554 static struct rcu_torture_ops rcu_bh_sync_ops = { 550 555 .init = rcu_sync_torture_init, 551 - .cleanup = NULL, 552 556 .readlock = rcu_bh_torture_read_lock, 553 557 .read_delay = rcu_read_delay, /* just reuse rcu's version. */ 554 558 .readunlock = rcu_bh_torture_read_unlock, ··· 564 570 565 571 static struct rcu_torture_ops rcu_bh_expedited_ops = { 566 572 .init = rcu_sync_torture_init, 567 - .cleanup = NULL, 568 573 .readlock = rcu_bh_torture_read_lock, 569 574 .read_delay = rcu_read_delay, /* just reuse rcu's version. */ 570 575 .readunlock = rcu_bh_torture_read_unlock, ··· 582 589 * Definitions for srcu torture testing. 583 590 */ 584 591 585 - static struct srcu_struct srcu_ctl; 586 - 587 - static void srcu_torture_init(void) 588 - { 589 - init_srcu_struct(&srcu_ctl); 590 - rcu_sync_torture_init(); 591 - } 592 - 593 - static void srcu_torture_cleanup(void) 594 - { 595 - synchronize_srcu(&srcu_ctl); 596 - cleanup_srcu_struct(&srcu_ctl); 597 - } 592 + DEFINE_STATIC_SRCU(srcu_ctl); 598 593 599 594 static int srcu_torture_read_lock(void) __acquires(&srcu_ctl) 600 595 { ··· 653 672 } 654 673 655 674 static struct rcu_torture_ops srcu_ops = { 656 - .init = srcu_torture_init, 657 - .cleanup = srcu_torture_cleanup, 675 + .init = rcu_sync_torture_init, 658 676 .readlock = srcu_torture_read_lock, 659 677 .read_delay = srcu_read_delay, 660 678 .readunlock = srcu_torture_read_unlock, ··· 667 687 }; 668 688 669 689 static struct rcu_torture_ops srcu_sync_ops = { 670 - .init = srcu_torture_init, 671 - .cleanup = srcu_torture_cleanup, 690 + .init = rcu_sync_torture_init, 672 691 .readlock = srcu_torture_read_lock, 673 692 .read_delay = srcu_read_delay, 674 693 .readunlock = srcu_torture_read_unlock, ··· 691 712 } 692 713 693 714 static struct rcu_torture_ops srcu_raw_ops = { 694 - .init = srcu_torture_init, 695 - .cleanup = srcu_torture_cleanup, 715 + .init = rcu_sync_torture_init, 696 716 .readlock = srcu_torture_read_lock_raw, 697 717 .read_delay = srcu_read_delay, 698 718 .readunlock = srcu_torture_read_unlock_raw, ··· 705 727 }; 706 728 707 729 static struct rcu_torture_ops srcu_raw_sync_ops = { 708 - .init = srcu_torture_init, 709 - .cleanup = srcu_torture_cleanup, 730 + .init = rcu_sync_torture_init, 710 731 .readlock = srcu_torture_read_lock_raw, 711 732 .read_delay = srcu_read_delay, 712 733 .readunlock = srcu_torture_read_unlock_raw, ··· 724 747 } 725 748 726 749 static struct rcu_torture_ops srcu_expedited_ops = { 727 - .init = srcu_torture_init, 728 - .cleanup = srcu_torture_cleanup, 750 + .init = rcu_sync_torture_init, 729 751 .readlock = srcu_torture_read_lock, 730 752 .read_delay = srcu_read_delay, 731 753 .readunlock = srcu_torture_read_unlock, ··· 759 783 760 784 static struct rcu_torture_ops sched_ops = { 761 785 .init = rcu_sync_torture_init, 762 - .cleanup = NULL, 763 786 .readlock = sched_torture_read_lock, 764 787 .read_delay = rcu_read_delay, /* just reuse rcu's version. */ 765 788 .readunlock = sched_torture_read_unlock, ··· 774 799 775 800 static struct rcu_torture_ops sched_sync_ops = { 776 801 .init = rcu_sync_torture_init, 777 - .cleanup = NULL, 778 802 .readlock = sched_torture_read_lock, 779 803 .read_delay = rcu_read_delay, /* just reuse rcu's version. */ 780 804 .readunlock = sched_torture_read_unlock, ··· 788 814 789 815 static struct rcu_torture_ops sched_expedited_ops = { 790 816 .init = rcu_sync_torture_init, 791 - .cleanup = NULL, 792 817 .readlock = sched_torture_read_lock, 793 818 .read_delay = rcu_read_delay, /* just reuse rcu's version. */ 794 819 .readunlock = sched_torture_read_unlock, ··· 1369 1396 "fqs_duration=%d fqs_holdoff=%d fqs_stutter=%d " 1370 1397 "test_boost=%d/%d test_boost_interval=%d " 1371 1398 "test_boost_duration=%d shutdown_secs=%d " 1399 + "stall_cpu=%d stall_cpu_holdoff=%d " 1400 + "n_barrier_cbs=%d " 1372 1401 "onoff_interval=%d onoff_holdoff=%d\n", 1373 1402 torture_type, tag, nrealreaders, nfakewriters, 1374 1403 stat_interval, verbose, test_no_idle_hz, shuffle_interval, 1375 1404 stutter, irqreader, fqs_duration, fqs_holdoff, fqs_stutter, 1376 1405 test_boost, cur_ops->can_boost, 1377 1406 test_boost_interval, test_boost_duration, shutdown_secs, 1407 + stall_cpu, stall_cpu_holdoff, 1408 + n_barrier_cbs, 1378 1409 onoff_interval, onoff_holdoff); 1379 1410 } 1380 1411 ··· 1479 1502 unsigned long delta; 1480 1503 int maxcpu = -1; 1481 1504 DEFINE_RCU_RANDOM(rand); 1505 + int ret; 1482 1506 unsigned long starttime; 1483 1507 1484 1508 VERBOSE_PRINTK_STRING("rcu_torture_onoff task started"); ··· 1500 1522 torture_type, cpu); 1501 1523 starttime = jiffies; 1502 1524 n_offline_attempts++; 1503 - if (cpu_down(cpu) == 0) { 1525 + ret = cpu_down(cpu); 1526 + if (ret) { 1527 + if (verbose) 1528 + pr_alert("%s" TORTURE_FLAG 1529 + "rcu_torture_onoff task: offline %d failed: errno %d\n", 1530 + torture_type, cpu, ret); 1531 + } else { 1504 1532 if (verbose) 1505 1533 pr_alert("%s" TORTURE_FLAG 1506 1534 "rcu_torture_onoff task: offlined %d\n", ··· 1920 1936 1921 1937 rcu_torture_stats_print(); /* -After- the stats thread is stopped! */ 1922 1938 1923 - if (cur_ops->cleanup) 1924 - cur_ops->cleanup(); 1925 1939 if (atomic_read(&n_rcu_torture_error) || n_rcu_torture_barrier_error) 1926 1940 rcu_torture_print_module_parms(cur_ops, "End of test: FAILURE"); 1927 1941 else if (n_online_successes != n_online_attempts ||

+222 -125

kernel/rcutree.c

··· 68 68 .level = { &sname##_state.node[0] }, \ 69 69 .call = cr, \ 70 70 .fqs_state = RCU_GP_IDLE, \ 71 - .gpnum = -300, \ 72 - .completed = -300, \ 73 - .onofflock = __RAW_SPIN_LOCK_UNLOCKED(&sname##_state.onofflock), \ 71 + .gpnum = 0UL - 300UL, \ 72 + .completed = 0UL - 300UL, \ 73 + .orphan_lock = __RAW_SPIN_LOCK_UNLOCKED(&sname##_state.orphan_lock), \ 74 74 .orphan_nxttail = &sname##_state.orphan_nxtlist, \ 75 75 .orphan_donetail = &sname##_state.orphan_donelist, \ 76 76 .barrier_mutex = __MUTEX_INITIALIZER(sname##_state.barrier_mutex), \ ··· 207 207 DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = { 208 208 .dynticks_nesting = DYNTICK_TASK_EXIT_IDLE, 209 209 .dynticks = ATOMIC_INIT(1), 210 - #if defined(CONFIG_RCU_USER_QS) && !defined(CONFIG_RCU_USER_QS_FORCE) 211 - .ignore_user_qs = true, 212 - #endif 213 210 }; 214 211 215 - static int blimit = 10; /* Maximum callbacks per rcu_do_batch. */ 216 - static int qhimark = 10000; /* If this many pending, ignore blimit. */ 217 - static int qlowmark = 100; /* Once only this many pending, use blimit. */ 212 + static long blimit = 10; /* Maximum callbacks per rcu_do_batch. */ 213 + static long qhimark = 10000; /* If this many pending, ignore blimit. */ 214 + static long qlowmark = 100; /* Once only this many pending, use blimit. */ 218 215 219 - module_param(blimit, int, 0444); 220 - module_param(qhimark, int, 0444); 221 - module_param(qlowmark, int, 0444); 216 + module_param(blimit, long, 0444); 217 + module_param(qhimark, long, 0444); 218 + module_param(qlowmark, long, 0444); 222 219 223 220 int rcu_cpu_stall_suppress __read_mostly; /* 1 = suppress stall warnings. */ 224 221 int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT; ··· 300 303 static int 301 304 cpu_has_callbacks_ready_to_invoke(struct rcu_data *rdp) 302 305 { 303 - return &rdp->nxtlist != rdp->nxttail[RCU_DONE_TAIL]; 306 + return &rdp->nxtlist != rdp->nxttail[RCU_DONE_TAIL] && 307 + rdp->nxttail[RCU_DONE_TAIL] != NULL; 304 308 } 305 309 306 310 /* ··· 310 312 static int 311 313 cpu_needs_another_gp(struct rcu_state *rsp, struct rcu_data *rdp) 312 314 { 313 - return *rdp->nxttail[RCU_DONE_TAIL + 314 - ACCESS_ONCE(rsp->completed) != rdp->completed] && 315 + struct rcu_head **ntp; 316 + 317 + ntp = rdp->nxttail[RCU_DONE_TAIL + 318 + (ACCESS_ONCE(rsp->completed) != rdp->completed)]; 319 + return rdp->nxttail[RCU_DONE_TAIL] && ntp && *ntp && 315 320 !rcu_gp_in_progress(rsp); 316 321 } 317 322 ··· 417 416 */ 418 417 void rcu_user_enter(void) 419 418 { 420 - unsigned long flags; 421 - struct rcu_dynticks *rdtp; 422 - 423 - /* 424 - * Some contexts may involve an exception occuring in an irq, 425 - * leading to that nesting: 426 - * rcu_irq_enter() rcu_user_exit() rcu_user_exit() rcu_irq_exit() 427 - * This would mess up the dyntick_nesting count though. And rcu_irq_*() 428 - * helpers are enough to protect RCU uses inside the exception. So 429 - * just return immediately if we detect we are in an IRQ. 430 - */ 431 - if (in_interrupt()) 432 - return; 433 - 434 - WARN_ON_ONCE(!current->mm); 435 - 436 - local_irq_save(flags); 437 - rdtp = &__get_cpu_var(rcu_dynticks); 438 - if (!rdtp->ignore_user_qs && !rdtp->in_user) { 439 - rdtp->in_user = true; 440 - rcu_eqs_enter(true); 441 - } 442 - local_irq_restore(flags); 419 + rcu_eqs_enter(1); 443 420 } 444 421 445 422 /** ··· 554 575 */ 555 576 void rcu_user_exit(void) 556 577 { 557 - unsigned long flags; 558 - struct rcu_dynticks *rdtp; 559 - 560 - /* 561 - * Some contexts may involve an exception occuring in an irq, 562 - * leading to that nesting: 563 - * rcu_irq_enter() rcu_user_exit() rcu_user_exit() rcu_irq_exit() 564 - * This would mess up the dyntick_nesting count though. And rcu_irq_*() 565 - * helpers are enough to protect RCU uses inside the exception. So 566 - * just return immediately if we detect we are in an IRQ. 567 - */ 568 - if (in_interrupt()) 569 - return; 570 - 571 - local_irq_save(flags); 572 - rdtp = &__get_cpu_var(rcu_dynticks); 573 - if (rdtp->in_user) { 574 - rdtp->in_user = false; 575 - rcu_eqs_exit(true); 576 - } 577 - local_irq_restore(flags); 578 + rcu_eqs_exit(1); 578 579 } 579 580 580 581 /** ··· 676 717 return ret; 677 718 } 678 719 EXPORT_SYMBOL(rcu_is_cpu_idle); 679 - 680 - #ifdef CONFIG_RCU_USER_QS 681 - void rcu_user_hooks_switch(struct task_struct *prev, 682 - struct task_struct *next) 683 - { 684 - struct rcu_dynticks *rdtp; 685 - 686 - /* Interrupts are disabled in context switch */ 687 - rdtp = &__get_cpu_var(rcu_dynticks); 688 - if (!rdtp->ignore_user_qs) { 689 - clear_tsk_thread_flag(prev, TIF_NOHZ); 690 - set_tsk_thread_flag(next, TIF_NOHZ); 691 - } 692 - } 693 - #endif /* #ifdef CONFIG_RCU_USER_QS */ 694 720 695 721 #if defined(CONFIG_PROVE_RCU) && defined(CONFIG_HOTPLUG_CPU) 696 722 ··· 817 873 rsp->jiffies_stall = jiffies + jiffies_till_stall_check(); 818 874 } 819 875 876 + /* 877 + * Dump stacks of all tasks running on stalled CPUs. This is a fallback 878 + * for architectures that do not implement trigger_all_cpu_backtrace(). 879 + * The NMI-triggered stack traces are more accurate because they are 880 + * printed by the target CPU. 881 + */ 882 + static void rcu_dump_cpu_stacks(struct rcu_state *rsp) 883 + { 884 + int cpu; 885 + unsigned long flags; 886 + struct rcu_node *rnp; 887 + 888 + rcu_for_each_leaf_node(rsp, rnp) { 889 + raw_spin_lock_irqsave(&rnp->lock, flags); 890 + if (rnp->qsmask != 0) { 891 + for (cpu = 0; cpu <= rnp->grphi - rnp->grplo; cpu++) 892 + if (rnp->qsmask & (1UL << cpu)) 893 + dump_cpu_task(rnp->grplo + cpu); 894 + } 895 + raw_spin_unlock_irqrestore(&rnp->lock, flags); 896 + } 897 + } 898 + 820 899 static void print_other_cpu_stall(struct rcu_state *rsp) 821 900 { 822 901 int cpu; ··· 847 880 unsigned long flags; 848 881 int ndetected = 0; 849 882 struct rcu_node *rnp = rcu_get_root(rsp); 883 + long totqlen = 0; 850 884 851 885 /* Only let one CPU complain about others per time interval. */ 852 886 ··· 892 924 raw_spin_unlock_irqrestore(&rnp->lock, flags); 893 925 894 926 print_cpu_stall_info_end(); 895 - printk(KERN_CONT "(detected by %d, t=%ld jiffies)\n", 896 - smp_processor_id(), (long)(jiffies - rsp->gp_start)); 927 + for_each_possible_cpu(cpu) 928 + totqlen += per_cpu_ptr(rsp->rda, cpu)->qlen; 929 + pr_cont("(detected by %d, t=%ld jiffies, g=%lu, c=%lu, q=%lu)\n", 930 + smp_processor_id(), (long)(jiffies - rsp->gp_start), 931 + rsp->gpnum, rsp->completed, totqlen); 897 932 if (ndetected == 0) 898 933 printk(KERN_ERR "INFO: Stall ended before state dump start\n"); 899 934 else if (!trigger_all_cpu_backtrace()) 900 - dump_stack(); 935 + rcu_dump_cpu_stacks(rsp); 901 936 902 937 /* Complain about tasks blocking the grace period. */ 903 938 ··· 911 940 912 941 static void print_cpu_stall(struct rcu_state *rsp) 913 942 { 943 + int cpu; 914 944 unsigned long flags; 915 945 struct rcu_node *rnp = rcu_get_root(rsp); 946 + long totqlen = 0; 916 947 917 948 /* 918 949 * OK, time to rat on ourselves... ··· 925 952 print_cpu_stall_info_begin(); 926 953 print_cpu_stall_info(rsp, smp_processor_id()); 927 954 print_cpu_stall_info_end(); 928 - printk(KERN_CONT " (t=%lu jiffies)\n", jiffies - rsp->gp_start); 955 + for_each_possible_cpu(cpu) 956 + totqlen += per_cpu_ptr(rsp->rda, cpu)->qlen; 957 + pr_cont(" (t=%lu jiffies g=%lu c=%lu q=%lu)\n", 958 + jiffies - rsp->gp_start, rsp->gpnum, rsp->completed, totqlen); 929 959 if (!trigger_all_cpu_backtrace()) 930 960 dump_stack(); 931 961 ··· 1067 1091 rdp->nxtlist = NULL; 1068 1092 for (i = 0; i < RCU_NEXT_SIZE; i++) 1069 1093 rdp->nxttail[i] = &rdp->nxtlist; 1094 + init_nocb_callback_list(rdp); 1070 1095 } 1071 1096 1072 1097 /* ··· 1381 1404 !cpu_needs_another_gp(rsp, rdp)) { 1382 1405 /* 1383 1406 * Either we have not yet spawned the grace-period 1384 - * task or this CPU does not need another grace period. 1407 + * task, this CPU does not need another grace period, 1408 + * or a grace period is already in progress. 1385 1409 * Either way, don't start a new grace period. 1386 1410 */ 1387 1411 raw_spin_unlock_irqrestore(&rnp->lock, flags); 1388 1412 return; 1389 1413 } 1390 1414 1415 + /* 1416 + * Because there is no grace period in progress right now, 1417 + * any callbacks we have up to this point will be satisfied 1418 + * by the next grace period. So promote all callbacks to be 1419 + * handled after the end of the next grace period. If the 1420 + * CPU is not yet aware of the end of the previous grace period, 1421 + * we need to allow for the callback advancement that will 1422 + * occur when it does become aware. Deadlock prevents us from 1423 + * making it aware at this point: We cannot acquire a leaf 1424 + * rcu_node ->lock while holding the root rcu_node ->lock. 1425 + */ 1426 + rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL]; 1427 + if (rdp->completed == rsp->completed) 1428 + rdp->nxttail[RCU_WAIT_TAIL] = rdp->nxttail[RCU_NEXT_TAIL]; 1429 + 1391 1430 rsp->gp_flags = RCU_GP_FLAG_INIT; 1392 - raw_spin_unlock_irqrestore(&rnp->lock, flags); 1431 + raw_spin_unlock(&rnp->lock); /* Interrupts remain disabled. */ 1432 + 1433 + /* Ensure that CPU is aware of completion of last grace period. */ 1434 + rcu_process_gp_end(rsp, rdp); 1435 + local_irq_restore(flags); 1436 + 1437 + /* Wake up rcu_gp_kthread() to start the grace period. */ 1393 1438 wake_up(&rsp->gp_wq); 1394 1439 } 1395 1440 ··· 1572 1573 /* 1573 1574 * Send the specified CPU's RCU callbacks to the orphanage. The 1574 1575 * specified CPU must be offline, and the caller must hold the 1575 - * ->onofflock. 1576 + * ->orphan_lock. 1576 1577 */ 1577 1578 static void 1578 1579 rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp, 1579 1580 struct rcu_node *rnp, struct rcu_data *rdp) 1580 1581 { 1582 + /* No-CBs CPUs do not have orphanable callbacks. */ 1583 + if (is_nocb_cpu(rdp->cpu)) 1584 + return; 1585 + 1581 1586 /* 1582 1587 * Orphan the callbacks. First adjust the counts. This is safe 1583 - * because ->onofflock excludes _rcu_barrier()'s adoption of 1584 - * the callbacks, thus no memory barrier is required. 1588 + * because _rcu_barrier() excludes CPU-hotplug operations, so it 1589 + * cannot be running now. Thus no memory barrier is required. 1585 1590 */ 1586 1591 if (rdp->nxtlist != NULL) { 1587 1592 rsp->qlen_lazy += rdp->qlen_lazy; ··· 1626 1623 1627 1624 /* 1628 1625 * Adopt the RCU callbacks from the specified rcu_state structure's 1629 - * orphanage. The caller must hold the ->onofflock. 1626 + * orphanage. The caller must hold the ->orphan_lock. 1630 1627 */ 1631 1628 static void rcu_adopt_orphan_cbs(struct rcu_state *rsp) 1632 1629 { 1633 1630 int i; 1634 1631 struct rcu_data *rdp = __this_cpu_ptr(rsp->rda); 1632 + 1633 + /* No-CBs CPUs are handled specially. */ 1634 + if (rcu_nocb_adopt_orphan_cbs(rsp, rdp)) 1635 + return; 1635 1636 1636 1637 /* Do the accounting first. */ 1637 1638 rdp->qlen_lazy += rsp->qlen_lazy; ··· 1709 1702 1710 1703 /* Exclude any attempts to start a new grace period. */ 1711 1704 mutex_lock(&rsp->onoff_mutex); 1712 - raw_spin_lock_irqsave(&rsp->onofflock, flags); 1705 + raw_spin_lock_irqsave(&rsp->orphan_lock, flags); 1713 1706 1714 1707 /* Orphan the dead CPU's callbacks, and adopt them if appropriate. */ 1715 1708 rcu_send_cbs_to_orphanage(cpu, rsp, rnp, rdp); ··· 1736 1729 /* 1737 1730 * We still hold the leaf rcu_node structure lock here, and 1738 1731 * irqs are still disabled. The reason for this subterfuge is 1739 - * because invoking rcu_report_unblock_qs_rnp() with ->onofflock 1732 + * because invoking rcu_report_unblock_qs_rnp() with ->orphan_lock 1740 1733 * held leads to deadlock. 1741 1734 */ 1742 - raw_spin_unlock(&rsp->onofflock); /* irqs remain disabled. */ 1735 + raw_spin_unlock(&rsp->orphan_lock); /* irqs remain disabled. */ 1743 1736 rnp = rdp->mynode; 1744 1737 if (need_report & RCU_OFL_TASKS_NORM_GP) 1745 1738 rcu_report_unblock_qs_rnp(rnp, flags); ··· 1776 1769 { 1777 1770 unsigned long flags; 1778 1771 struct rcu_head *next, *list, **tail; 1779 - int bl, count, count_lazy, i; 1772 + long bl, count, count_lazy; 1773 + int i; 1780 1774 1781 1775 /* If no callbacks are ready, just return.*/ 1782 1776 if (!cpu_has_callbacks_ready_to_invoke(rdp)) { ··· 2115 2107 } 2116 2108 } 2117 2109 2110 + /* 2111 + * Helper function for call_rcu() and friends. The cpu argument will 2112 + * normally be -1, indicating "currently running CPU". It may specify 2113 + * a CPU only if that CPU is a no-CBs CPU. Currently, only _rcu_barrier() 2114 + * is expected to specify a CPU. 2115 + */ 2118 2116 static void 2119 2117 __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu), 2120 - struct rcu_state *rsp, bool lazy) 2118 + struct rcu_state *rsp, int cpu, bool lazy) 2121 2119 { 2122 2120 unsigned long flags; 2123 2121 struct rcu_data *rdp; ··· 2143 2129 rdp = this_cpu_ptr(rsp->rda); 2144 2130 2145 2131 /* Add the callback to our list. */ 2146 - if (unlikely(rdp->nxttail[RCU_NEXT_TAIL] == NULL)) { 2132 + if (unlikely(rdp->nxttail[RCU_NEXT_TAIL] == NULL) || cpu != -1) { 2133 + int offline; 2134 + 2135 + if (cpu != -1) 2136 + rdp = per_cpu_ptr(rsp->rda, cpu); 2137 + offline = !__call_rcu_nocb(rdp, head, lazy); 2138 + WARN_ON_ONCE(offline); 2147 2139 /* _call_rcu() is illegal on offline CPU; leak the callback. */ 2148 - WARN_ON_ONCE(1); 2149 2140 local_irq_restore(flags); 2150 2141 return; 2151 2142 } ··· 2179 2160 */ 2180 2161 void call_rcu_sched(struct rcu_head *head, void (*func)(struct rcu_head *rcu)) 2181 2162 { 2182 - __call_rcu(head, func, &rcu_sched_state, 0); 2163 + __call_rcu(head, func, &rcu_sched_state, -1, 0); 2183 2164 } 2184 2165 EXPORT_SYMBOL_GPL(call_rcu_sched); 2185 2166 ··· 2188 2169 */ 2189 2170 void call_rcu_bh(struct rcu_head *head, void (*func)(struct rcu_head *rcu)) 2190 2171 { 2191 - __call_rcu(head, func, &rcu_bh_state, 0); 2172 + __call_rcu(head, func, &rcu_bh_state, -1, 0); 2192 2173 } 2193 2174 EXPORT_SYMBOL_GPL(call_rcu_bh); 2194 2175 ··· 2224 2205 * rcu_read_lock_sched(). 2225 2206 * 2226 2207 * This means that all preempt_disable code sequences, including NMI and 2227 - * hardware-interrupt handlers, in progress on entry will have completed 2228 - * before this primitive returns. However, this does not guarantee that 2229 - * softirq handlers will have completed, since in some kernels, these 2230 - * handlers can run in process context, and can block. 2208 + * non-threaded hardware-interrupt handlers, in progress on entry will 2209 + * have completed before this primitive returns. However, this does not 2210 + * guarantee that softirq handlers will have completed, since in some 2211 + * kernels, these handlers can run in process context, and can block. 2212 + * 2213 + * Note that this guarantee implies further memory-ordering guarantees. 2214 + * On systems with more than one CPU, when synchronize_sched() returns, 2215 + * each CPU is guaranteed to have executed a full memory barrier since the 2216 + * end of its last RCU-sched read-side critical section whose beginning 2217 + * preceded the call to synchronize_sched(). In addition, each CPU having 2218 + * an RCU read-side critical section that extends beyond the return from 2219 + * synchronize_sched() is guaranteed to have executed a full memory barrier 2220 + * after the beginning of synchronize_sched() and before the beginning of 2221 + * that RCU read-side critical section. Note that these guarantees include 2222 + * CPUs that are offline, idle, or executing in user mode, as well as CPUs 2223 + * that are executing in the kernel. 2224 + * 2225 + * Furthermore, if CPU A invoked synchronize_sched(), which returned 2226 + * to its caller on CPU B, then both CPU A and CPU B are guaranteed 2227 + * to have executed a full memory barrier during the execution of 2228 + * synchronize_sched() -- even if CPU A and CPU B are the same CPU (but 2229 + * again only if the system has more than one CPU). 2231 2230 * 2232 2231 * This primitive provides the guarantees made by the (now removed) 2233 2232 * synchronize_kernel() API. In contrast, synchronize_rcu() only ··· 2261 2224 "Illegal synchronize_sched() in RCU-sched read-side critical section"); 2262 2225 if (rcu_blocking_is_gp()) 2263 2226 return; 2264 - wait_rcu_gp(call_rcu_sched); 2227 + if (rcu_expedited) 2228 + synchronize_sched_expedited(); 2229 + else 2230 + wait_rcu_gp(call_rcu_sched); 2265 2231 } 2266 2232 EXPORT_SYMBOL_GPL(synchronize_sched); 2267 2233 ··· 2276 2236 * read-side critical sections have completed. RCU read-side critical 2277 2237 * sections are delimited by rcu_read_lock_bh() and rcu_read_unlock_bh(), 2278 2238 * and may be nested. 2239 + * 2240 + * See the description of synchronize_sched() for more detailed information 2241 + * on memory ordering guarantees. 2279 2242 */ 2280 2243 void synchronize_rcu_bh(void) 2281 2244 { ··· 2288 2245 "Illegal synchronize_rcu_bh() in RCU-bh read-side critical section"); 2289 2246 if (rcu_blocking_is_gp()) 2290 2247 return; 2291 - wait_rcu_gp(call_rcu_bh); 2248 + if (rcu_expedited) 2249 + synchronize_rcu_bh_expedited(); 2250 + else 2251 + wait_rcu_gp(call_rcu_bh); 2292 2252 } 2293 2253 EXPORT_SYMBOL_GPL(synchronize_rcu_bh); 2294 - 2295 - static atomic_t sync_sched_expedited_started = ATOMIC_INIT(0); 2296 - static atomic_t sync_sched_expedited_done = ATOMIC_INIT(0); 2297 2254 2298 2255 static int synchronize_sched_expedited_cpu_stop(void *data) 2299 2256 { ··· 2351 2308 */ 2352 2309 void synchronize_sched_expedited(void) 2353 2310 { 2354 - int firstsnap, s, snap, trycount = 0; 2311 + long firstsnap, s, snap; 2312 + int trycount = 0; 2313 + struct rcu_state *rsp = &rcu_sched_state; 2355 2314 2356 - /* Note that atomic_inc_return() implies full memory barrier. */ 2357 - firstsnap = snap = atomic_inc_return(&sync_sched_expedited_started); 2315 + /* 2316 + * If we are in danger of counter wrap, just do synchronize_sched(). 2317 + * By allowing sync_sched_expedited_started to advance no more than 2318 + * ULONG_MAX/8 ahead of sync_sched_expedited_done, we are ensuring 2319 + * that more than 3.5 billion CPUs would be required to force a 2320 + * counter wrap on a 32-bit system. Quite a few more CPUs would of 2321 + * course be required on a 64-bit system. 2322 + */ 2323 + if (ULONG_CMP_GE((ulong)atomic_long_read(&rsp->expedited_start), 2324 + (ulong)atomic_long_read(&rsp->expedited_done) + 2325 + ULONG_MAX / 8)) { 2326 + synchronize_sched(); 2327 + atomic_long_inc(&rsp->expedited_wrap); 2328 + return; 2329 + } 2330 + 2331 + /* 2332 + * Take a ticket. Note that atomic_inc_return() implies a 2333 + * full memory barrier. 2334 + */ 2335 + snap = atomic_long_inc_return(&rsp->expedited_start); 2336 + firstsnap = snap; 2358 2337 get_online_cpus(); 2359 2338 WARN_ON_ONCE(cpu_is_offline(raw_smp_processor_id())); 2360 2339 ··· 2388 2323 synchronize_sched_expedited_cpu_stop, 2389 2324 NULL) == -EAGAIN) { 2390 2325 put_online_cpus(); 2326 + atomic_long_inc(&rsp->expedited_tryfail); 2327 + 2328 + /* Check to see if someone else did our work for us. */ 2329 + s = atomic_long_read(&rsp->expedited_done); 2330 + if (ULONG_CMP_GE((ulong)s, (ulong)firstsnap)) { 2331 + /* ensure test happens before caller kfree */ 2332 + smp_mb__before_atomic_inc(); /* ^^^ */ 2333 + atomic_long_inc(&rsp->expedited_workdone1); 2334 + return; 2335 + } 2391 2336 2392 2337 /* No joy, try again later. Or just synchronize_sched(). */ 2393 2338 if (trycount++ < 10) { 2394 2339 udelay(trycount * num_online_cpus()); 2395 2340 } else { 2396 - synchronize_sched(); 2341 + wait_rcu_gp(call_rcu_sched); 2342 + atomic_long_inc(&rsp->expedited_normal); 2397 2343 return; 2398 2344 } 2399 2345 2400 - /* Check to see if someone else did our work for us. */ 2401 - s = atomic_read(&sync_sched_expedited_done); 2402 - if (UINT_CMP_GE((unsigned)s, (unsigned)firstsnap)) { 2403 - smp_mb(); /* ensure test happens before caller kfree */ 2346 + /* Recheck to see if someone else did our work for us. */ 2347 + s = atomic_long_read(&rsp->expedited_done); 2348 + if (ULONG_CMP_GE((ulong)s, (ulong)firstsnap)) { 2349 + /* ensure test happens before caller kfree */ 2350 + smp_mb__before_atomic_inc(); /* ^^^ */ 2351 + atomic_long_inc(&rsp->expedited_workdone2); 2404 2352 return; 2405 2353 } 2406 2354 2407 2355 /* 2408 2356 * Refetching sync_sched_expedited_started allows later 2409 - * callers to piggyback on our grace period. We subtract 2410 - * 1 to get the same token that the last incrementer got. 2411 - * We retry after they started, so our grace period works 2412 - * for them, and they started after our first try, so their 2413 - * grace period works for us. 2357 + * callers to piggyback on our grace period. We retry 2358 + * after they started, so our grace period works for them, 2359 + * and they started after our first try, so their grace 2360 + * period works for us. 2414 2361 */ 2415 2362 get_online_cpus(); 2416 - snap = atomic_read(&sync_sched_expedited_started); 2363 + snap = atomic_long_read(&rsp->expedited_start); 2417 2364 smp_mb(); /* ensure read is before try_stop_cpus(). */ 2418 2365 } 2366 + atomic_long_inc(&rsp->expedited_stoppedcpus); 2419 2367 2420 2368 /* 2421 2369 * Everyone up to our most recent fetch is covered by our grace 2422 2370 * period. Update the counter, but only if our work is still 2423 2371 * relevant -- which it won't be if someone who started later 2424 - * than we did beat us to the punch. 2372 + * than we did already did their update. 2425 2373 */ 2426 2374 do { 2427 - s = atomic_read(&sync_sched_expedited_done); 2428 - if (UINT_CMP_GE((unsigned)s, (unsigned)snap)) { 2429 - smp_mb(); /* ensure test happens before caller kfree */ 2375 + atomic_long_inc(&rsp->expedited_done_tries); 2376 + s = atomic_long_read(&rsp->expedited_done); 2377 + if (ULONG_CMP_GE((ulong)s, (ulong)snap)) { 2378 + /* ensure test happens before caller kfree */ 2379 + smp_mb__before_atomic_inc(); /* ^^^ */ 2380 + atomic_long_inc(&rsp->expedited_done_lost); 2430 2381 break; 2431 2382 } 2432 - } while (atomic_cmpxchg(&sync_sched_expedited_done, s, snap) != s); 2383 + } while (atomic_long_cmpxchg(&rsp->expedited_done, s, snap) != s); 2384 + atomic_long_inc(&rsp->expedited_done_exit); 2433 2385 2434 2386 put_online_cpus(); 2435 2387 } ··· 2640 2558 * When that callback is invoked, we will know that all of the 2641 2559 * corresponding CPU's preceding callbacks have been invoked. 2642 2560 */ 2643 - for_each_online_cpu(cpu) { 2561 + for_each_possible_cpu(cpu) { 2562 + if (!cpu_online(cpu) && !is_nocb_cpu(cpu)) 2563 + continue; 2644 2564 rdp = per_cpu_ptr(rsp->rda, cpu); 2645 - if (ACCESS_ONCE(rdp->qlen)) { 2565 + if (is_nocb_cpu(cpu)) { 2566 + _rcu_barrier_trace(rsp, "OnlineNoCB", cpu, 2567 + rsp->n_barrier_done); 2568 + atomic_inc(&rsp->barrier_cpu_count); 2569 + __call_rcu(&rdp->barrier_head, rcu_barrier_callback, 2570 + rsp, cpu, 0); 2571 + } else if (ACCESS_ONCE(rdp->qlen)) { 2646 2572 _rcu_barrier_trace(rsp, "OnlineQ", cpu, 2647 2573 rsp->n_barrier_done); 2648 2574 smp_call_function_single(cpu, rcu_barrier_func, rsp, 1); ··· 2724 2634 #endif 2725 2635 rdp->cpu = cpu; 2726 2636 rdp->rsp = rsp; 2637 + rcu_boot_init_nocb_percpu_data(rdp); 2727 2638 raw_spin_unlock_irqrestore(&rnp->lock, flags); 2728 2639 } 2729 2640 ··· 2806 2715 struct rcu_data *rdp = per_cpu_ptr(rcu_state->rda, cpu); 2807 2716 struct rcu_node *rnp = rdp->mynode; 2808 2717 struct rcu_state *rsp; 2718 + int ret = NOTIFY_OK; 2809 2719 2810 2720 trace_rcu_utilization("Start CPU hotplug"); 2811 2721 switch (action) { ··· 2820 2728 rcu_boost_kthread_setaffinity(rnp, -1); 2821 2729 break; 2822 2730 case CPU_DOWN_PREPARE: 2823 - rcu_boost_kthread_setaffinity(rnp, cpu); 2731 + if (nocb_cpu_expendable(cpu)) 2732 + rcu_boost_kthread_setaffinity(rnp, cpu); 2733 + else 2734 + ret = NOTIFY_BAD; 2824 2735 break; 2825 2736 case CPU_DYING: 2826 2737 case CPU_DYING_FROZEN: ··· 2847 2752 break; 2848 2753 } 2849 2754 trace_rcu_utilization("End CPU hotplug"); 2850 - return NOTIFY_OK; 2755 + return ret; 2851 2756 } 2852 2757 2853 2758 /* ··· 2867 2772 raw_spin_lock_irqsave(&rnp->lock, flags); 2868 2773 rsp->gp_kthread = t; 2869 2774 raw_spin_unlock_irqrestore(&rnp->lock, flags); 2775 + rcu_spawn_nocb_kthreads(rsp); 2870 2776 } 2871 2777 return 0; 2872 2778 } ··· 3063 2967 rcu_init_one(&rcu_sched_state, &rcu_sched_data); 3064 2968 rcu_init_one(&rcu_bh_state, &rcu_bh_data); 3065 2969 __rcu_init_preempt(); 2970 + rcu_init_nocb(); 3066 2971 open_softirq(RCU_SOFTIRQ, rcu_process_callbacks); 3067 2972 3068 2973 /*

+63 -4

kernel/rcutree.h

··· 287 287 long qlen_last_fqs_check; 288 288 /* qlen at last check for QS forcing */ 289 289 unsigned long n_cbs_invoked; /* count of RCU cbs invoked. */ 290 + unsigned long n_nocbs_invoked; /* count of no-CBs RCU cbs invoked. */ 290 291 unsigned long n_cbs_orphaned; /* RCU cbs orphaned by dying CPU */ 291 292 unsigned long n_cbs_adopted; /* RCU cbs adopted from dying CPU */ 292 293 unsigned long n_force_qs_snap; ··· 317 316 #ifdef CONFIG_RCU_FAST_NO_HZ 318 317 struct rcu_head oom_head; 319 318 #endif /* #ifdef CONFIG_RCU_FAST_NO_HZ */ 319 + 320 + /* 7) Callback offloading. */ 321 + #ifdef CONFIG_RCU_NOCB_CPU 322 + struct rcu_head *nocb_head; /* CBs waiting for kthread. */ 323 + struct rcu_head **nocb_tail; 324 + atomic_long_t nocb_q_count; /* # CBs waiting for kthread */ 325 + atomic_long_t nocb_q_count_lazy; /* (approximate). */ 326 + int nocb_p_count; /* # CBs being invoked by kthread */ 327 + int nocb_p_count_lazy; /* (approximate). */ 328 + wait_queue_head_t nocb_wq; /* For nocb kthreads to sleep on. */ 329 + struct task_struct *nocb_kthread; 330 + #endif /* #ifdef CONFIG_RCU_NOCB_CPU */ 320 331 321 332 int cpu; 322 333 struct rcu_state *rsp; ··· 382 369 struct rcu_data __percpu *rda; /* pointer of percu rcu_data. */ 383 370 void (*call)(struct rcu_head *head, /* call_rcu() flavor. */ 384 371 void (*func)(struct rcu_head *head)); 372 + #ifdef CONFIG_RCU_NOCB_CPU 373 + void (*call_remote)(struct rcu_head *head, 374 + void (*func)(struct rcu_head *head)); 375 + /* call_rcu() flavor, but for */ 376 + /* placing on remote CPU. */ 377 + #endif /* #ifdef CONFIG_RCU_NOCB_CPU */ 385 378 386 379 /* The following fields are guarded by the root rcu_node's lock. */ 387 380 ··· 402 383 403 384 /* End of fields guarded by root rcu_node's lock. */ 404 385 405 - raw_spinlock_t onofflock ____cacheline_internodealigned_in_smp; 406 - /* exclude on/offline and */ 407 - /* starting new GP. */ 386 + raw_spinlock_t orphan_lock ____cacheline_internodealigned_in_smp; 387 + /* Protect following fields. */ 408 388 struct rcu_head *orphan_nxtlist; /* Orphaned callbacks that */ 409 389 /* need a grace period. */ 410 390 struct rcu_head **orphan_nxttail; /* Tail of above. */ ··· 412 394 struct rcu_head **orphan_donetail; /* Tail of above. */ 413 395 long qlen_lazy; /* Number of lazy callbacks. */ 414 396 long qlen; /* Total number of callbacks. */ 415 - /* End of fields guarded by onofflock. */ 397 + /* End of fields guarded by orphan_lock. */ 416 398 417 399 struct mutex onoff_mutex; /* Coordinate hotplug & GPs. */ 418 400 ··· 422 404 unsigned long n_barrier_done; /* ++ at start and end of */ 423 405 /* _rcu_barrier(). */ 424 406 /* End of fields guarded by barrier_mutex. */ 407 + 408 + atomic_long_t expedited_start; /* Starting ticket. */ 409 + atomic_long_t expedited_done; /* Done ticket. */ 410 + atomic_long_t expedited_wrap; /* # near-wrap incidents. */ 411 + atomic_long_t expedited_tryfail; /* # acquisition failures. */ 412 + atomic_long_t expedited_workdone1; /* # done by others #1. */ 413 + atomic_long_t expedited_workdone2; /* # done by others #2. */ 414 + atomic_long_t expedited_normal; /* # fallbacks to normal. */ 415 + atomic_long_t expedited_stoppedcpus; /* # successful stop_cpus. */ 416 + atomic_long_t expedited_done_tries; /* # tries to update _done. */ 417 + atomic_long_t expedited_done_lost; /* # times beaten to _done. */ 418 + atomic_long_t expedited_done_exit; /* # times exited _done loop. */ 425 419 426 420 unsigned long jiffies_force_qs; /* Time at which to invoke */ 427 421 /* force_quiescent_state(). */ ··· 458 428 #define RCU_GP_FLAG_FQS 0x2 /* Need grace-period quiescent-state forcing. */ 459 429 460 430 extern struct list_head rcu_struct_flavors; 431 + 432 + /* Sequence through rcu_state structures for each RCU flavor. */ 461 433 #define for_each_rcu_flavor(rsp) \ 462 434 list_for_each_entry((rsp), &rcu_struct_flavors, flavors) 463 435 ··· 536 504 static void print_cpu_stall_info_end(void); 537 505 static void zero_cpu_stall_ticks(struct rcu_data *rdp); 538 506 static void increment_cpu_stall_ticks(void); 507 + static bool is_nocb_cpu(int cpu); 508 + static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp, 509 + bool lazy); 510 + static bool rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp, 511 + struct rcu_data *rdp); 512 + static bool nocb_cpu_expendable(int cpu); 513 + static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp); 514 + static void rcu_spawn_nocb_kthreads(struct rcu_state *rsp); 515 + static void init_nocb_callback_list(struct rcu_data *rdp); 516 + static void __init rcu_init_nocb(void); 539 517 540 518 #endif /* #ifndef RCU_TREE_NONCORE */ 519 + 520 + #ifdef CONFIG_RCU_TRACE 521 + #ifdef CONFIG_RCU_NOCB_CPU 522 + /* Sum up queue lengths for tracing. */ 523 + static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll) 524 + { 525 + *ql = atomic_long_read(&rdp->nocb_q_count) + rdp->nocb_p_count; 526 + *qll = atomic_long_read(&rdp->nocb_q_count_lazy) + rdp->nocb_p_count_lazy; 527 + } 528 + #else /* #ifdef CONFIG_RCU_NOCB_CPU */ 529 + static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll) 530 + { 531 + *ql = 0; 532 + *qll = 0; 533 + } 534 + #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */ 535 + #endif /* #ifdef CONFIG_RCU_TRACE */

+409 -6

kernel/rcutree_plugin.h

··· 25 25 */ 26 26 27 27 #include <linux/delay.h> 28 + #include <linux/gfp.h> 28 29 #include <linux/oom.h> 29 30 #include <linux/smpboot.h> 30 31 ··· 36 35 #else 37 36 #define RCU_BOOST_PRIO RCU_KTHREAD_PRIO 38 37 #endif 38 + 39 + #ifdef CONFIG_RCU_NOCB_CPU 40 + static cpumask_var_t rcu_nocb_mask; /* CPUs to have callbacks offloaded. */ 41 + static bool have_rcu_nocb_mask; /* Was rcu_nocb_mask allocated? */ 42 + static bool rcu_nocb_poll; /* Offload kthread are to poll. */ 43 + module_param(rcu_nocb_poll, bool, 0444); 44 + static char __initdata nocb_buf[NR_CPUS * 5]; 45 + #endif /* #ifdef CONFIG_RCU_NOCB_CPU */ 39 46 40 47 /* 41 48 * Check the RCU kernel configuration parameters and print informative ··· 85 76 printk(KERN_INFO "\tExperimental boot-time adjustment of leaf fanout to %d.\n", rcu_fanout_leaf); 86 77 if (nr_cpu_ids != NR_CPUS) 87 78 printk(KERN_INFO "\tRCU restricting CPUs from NR_CPUS=%d to nr_cpu_ids=%d.\n", NR_CPUS, nr_cpu_ids); 79 + #ifdef CONFIG_RCU_NOCB_CPU 80 + if (have_rcu_nocb_mask) { 81 + if (cpumask_test_cpu(0, rcu_nocb_mask)) { 82 + cpumask_clear_cpu(0, rcu_nocb_mask); 83 + pr_info("\tCPU 0: illegal no-CBs CPU (cleared).\n"); 84 + } 85 + cpulist_scnprintf(nocb_buf, sizeof(nocb_buf), rcu_nocb_mask); 86 + pr_info("\tExperimental no-CBs CPUs: %s.\n", nocb_buf); 87 + if (rcu_nocb_poll) 88 + pr_info("\tExperimental polled no-CBs CPUs.\n"); 89 + } 90 + #endif /* #ifdef CONFIG_RCU_NOCB_CPU */ 88 91 } 89 92 90 93 #ifdef CONFIG_TREE_PREEMPT_RCU ··· 663 642 */ 664 643 void call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu)) 665 644 { 666 - __call_rcu(head, func, &rcu_preempt_state, 0); 645 + __call_rcu(head, func, &rcu_preempt_state, -1, 0); 667 646 } 668 647 EXPORT_SYMBOL_GPL(call_rcu); 669 648 ··· 677 656 void kfree_call_rcu(struct rcu_head *head, 678 657 void (*func)(struct rcu_head *rcu)) 679 658 { 680 - __call_rcu(head, func, &rcu_preempt_state, 1); 659 + __call_rcu(head, func, &rcu_preempt_state, -1, 1); 681 660 } 682 661 EXPORT_SYMBOL_GPL(kfree_call_rcu); 683 662 ··· 691 670 * concurrently with new RCU read-side critical sections that began while 692 671 * synchronize_rcu() was waiting. RCU read-side critical sections are 693 672 * delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested. 673 + * 674 + * See the description of synchronize_sched() for more detailed information 675 + * on memory ordering guarantees. 694 676 */ 695 677 void synchronize_rcu(void) 696 678 { ··· 703 679 "Illegal synchronize_rcu() in RCU read-side critical section"); 704 680 if (!rcu_scheduler_active) 705 681 return; 706 - wait_rcu_gp(call_rcu); 682 + if (rcu_expedited) 683 + synchronize_rcu_expedited(); 684 + else 685 + wait_rcu_gp(call_rcu); 707 686 } 708 687 EXPORT_SYMBOL_GPL(synchronize_rcu); 709 688 ··· 784 757 * grace period for the specified rcu_node structure. If there are no such 785 758 * tasks, report it up the rcu_node hierarchy. 786 759 * 787 - * Caller must hold sync_rcu_preempt_exp_mutex and rsp->onofflock. 760 + * Caller must hold sync_rcu_preempt_exp_mutex and must exclude 761 + * CPU hotplug operations. 788 762 */ 789 763 static void 790 764 sync_rcu_preempt_exp_init(struct rcu_state *rsp, struct rcu_node *rnp) ··· 859 831 udelay(trycount * num_online_cpus()); 860 832 } else { 861 833 put_online_cpus(); 862 - synchronize_rcu(); 834 + wait_rcu_gp(call_rcu); 863 835 return; 864 836 } 865 837 } ··· 903 875 904 876 /** 905 877 * rcu_barrier - Wait until all in-flight call_rcu() callbacks complete. 878 + * 879 + * Note that this primitive does not necessarily wait for an RCU grace period 880 + * to complete. For example, if there are no RCU callbacks queued anywhere 881 + * in the system, then rcu_barrier() is within its rights to return 882 + * immediately, without waiting for anything, much less an RCU grace period. 906 883 */ 907 884 void rcu_barrier(void) 908 885 { ··· 1046 1013 void kfree_call_rcu(struct rcu_head *head, 1047 1014 void (*func)(struct rcu_head *rcu)) 1048 1015 { 1049 - __call_rcu(head, func, &rcu_sched_state, 1); 1016 + __call_rcu(head, func, &rcu_sched_state, -1, 1); 1050 1017 } 1051 1018 EXPORT_SYMBOL_GPL(kfree_call_rcu); 1052 1019 ··· 2125 2092 } 2126 2093 2127 2094 #endif /* #else #ifdef CONFIG_RCU_CPU_STALL_INFO */ 2095 + 2096 + #ifdef CONFIG_RCU_NOCB_CPU 2097 + 2098 + /* 2099 + * Offload callback processing from the boot-time-specified set of CPUs 2100 + * specified by rcu_nocb_mask. For each CPU in the set, there is a 2101 + * kthread created that pulls the callbacks from the corresponding CPU, 2102 + * waits for a grace period to elapse, and invokes the callbacks. 2103 + * The no-CBs CPUs do a wake_up() on their kthread when they insert 2104 + * a callback into any empty list, unless the rcu_nocb_poll boot parameter 2105 + * has been specified, in which case each kthread actively polls its 2106 + * CPU. (Which isn't so great for energy efficiency, but which does 2107 + * reduce RCU's overhead on that CPU.) 2108 + * 2109 + * This is intended to be used in conjunction with Frederic Weisbecker's 2110 + * adaptive-idle work, which would seriously reduce OS jitter on CPUs 2111 + * running CPU-bound user-mode computations. 2112 + * 2113 + * Offloading of callback processing could also in theory be used as 2114 + * an energy-efficiency measure because CPUs with no RCU callbacks 2115 + * queued are more aggressive about entering dyntick-idle mode. 2116 + */ 2117 + 2118 + 2119 + /* Parse the boot-time rcu_nocb_mask CPU list from the kernel parameters. */ 2120 + static int __init rcu_nocb_setup(char *str) 2121 + { 2122 + alloc_bootmem_cpumask_var(&rcu_nocb_mask); 2123 + have_rcu_nocb_mask = true; 2124 + cpulist_parse(str, rcu_nocb_mask); 2125 + return 1; 2126 + } 2127 + __setup("rcu_nocbs=", rcu_nocb_setup); 2128 + 2129 + /* Is the specified CPU a no-CPUs CPU? */ 2130 + static bool is_nocb_cpu(int cpu) 2131 + { 2132 + if (have_rcu_nocb_mask) 2133 + return cpumask_test_cpu(cpu, rcu_nocb_mask); 2134 + return false; 2135 + } 2136 + 2137 + /* 2138 + * Enqueue the specified string of rcu_head structures onto the specified 2139 + * CPU's no-CBs lists. The CPU is specified by rdp, the head of the 2140 + * string by rhp, and the tail of the string by rhtp. The non-lazy/lazy 2141 + * counts are supplied by rhcount and rhcount_lazy. 2142 + * 2143 + * If warranted, also wake up the kthread servicing this CPUs queues. 2144 + */ 2145 + static void __call_rcu_nocb_enqueue(struct rcu_data *rdp, 2146 + struct rcu_head *rhp, 2147 + struct rcu_head **rhtp, 2148 + int rhcount, int rhcount_lazy) 2149 + { 2150 + int len; 2151 + struct rcu_head **old_rhpp; 2152 + struct task_struct *t; 2153 + 2154 + /* Enqueue the callback on the nocb list and update counts. */ 2155 + old_rhpp = xchg(&rdp->nocb_tail, rhtp); 2156 + ACCESS_ONCE(*old_rhpp) = rhp; 2157 + atomic_long_add(rhcount, &rdp->nocb_q_count); 2158 + atomic_long_add(rhcount_lazy, &rdp->nocb_q_count_lazy); 2159 + 2160 + /* If we are not being polled and there is a kthread, awaken it ... */ 2161 + t = ACCESS_ONCE(rdp->nocb_kthread); 2162 + if (rcu_nocb_poll | !t) 2163 + return; 2164 + len = atomic_long_read(&rdp->nocb_q_count); 2165 + if (old_rhpp == &rdp->nocb_head) { 2166 + wake_up(&rdp->nocb_wq); /* ... only if queue was empty ... */ 2167 + rdp->qlen_last_fqs_check = 0; 2168 + } else if (len > rdp->qlen_last_fqs_check + qhimark) { 2169 + wake_up_process(t); /* ... or if many callbacks queued. */ 2170 + rdp->qlen_last_fqs_check = LONG_MAX / 2; 2171 + } 2172 + return; 2173 + } 2174 + 2175 + /* 2176 + * This is a helper for __call_rcu(), which invokes this when the normal 2177 + * callback queue is inoperable. If this is not a no-CBs CPU, this 2178 + * function returns failure back to __call_rcu(), which can complain 2179 + * appropriately. 2180 + * 2181 + * Otherwise, this function queues the callback where the corresponding 2182 + * "rcuo" kthread can find it. 2183 + */ 2184 + static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp, 2185 + bool lazy) 2186 + { 2187 + 2188 + if (!is_nocb_cpu(rdp->cpu)) 2189 + return 0; 2190 + __call_rcu_nocb_enqueue(rdp, rhp, &rhp->next, 1, lazy); 2191 + return 1; 2192 + } 2193 + 2194 + /* 2195 + * Adopt orphaned callbacks on a no-CBs CPU, or return 0 if this is 2196 + * not a no-CBs CPU. 2197 + */ 2198 + static bool __maybe_unused rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp, 2199 + struct rcu_data *rdp) 2200 + { 2201 + long ql = rsp->qlen; 2202 + long qll = rsp->qlen_lazy; 2203 + 2204 + /* If this is not a no-CBs CPU, tell the caller to do it the old way. */ 2205 + if (!is_nocb_cpu(smp_processor_id())) 2206 + return 0; 2207 + rsp->qlen = 0; 2208 + rsp->qlen_lazy = 0; 2209 + 2210 + /* First, enqueue the donelist, if any. This preserves CB ordering. */ 2211 + if (rsp->orphan_donelist != NULL) { 2212 + __call_rcu_nocb_enqueue(rdp, rsp->orphan_donelist, 2213 + rsp->orphan_donetail, ql, qll); 2214 + ql = qll = 0; 2215 + rsp->orphan_donelist = NULL; 2216 + rsp->orphan_donetail = &rsp->orphan_donelist; 2217 + } 2218 + if (rsp->orphan_nxtlist != NULL) { 2219 + __call_rcu_nocb_enqueue(rdp, rsp->orphan_nxtlist, 2220 + rsp->orphan_nxttail, ql, qll); 2221 + ql = qll = 0; 2222 + rsp->orphan_nxtlist = NULL; 2223 + rsp->orphan_nxttail = &rsp->orphan_nxtlist; 2224 + } 2225 + return 1; 2226 + } 2227 + 2228 + /* 2229 + * There must be at least one non-no-CBs CPU in operation at any given 2230 + * time, because no-CBs CPUs are not capable of initiating grace periods 2231 + * independently. This function therefore complains if the specified 2232 + * CPU is the last non-no-CBs CPU, allowing the CPU-hotplug system to 2233 + * avoid offlining the last such CPU. (Recursion is a wonderful thing, 2234 + * but you have to have a base case!) 2235 + */ 2236 + static bool nocb_cpu_expendable(int cpu) 2237 + { 2238 + cpumask_var_t non_nocb_cpus; 2239 + int ret; 2240 + 2241 + /* 2242 + * If there are no no-CB CPUs or if this CPU is not a no-CB CPU, 2243 + * then offlining this CPU is harmless. Let it happen. 2244 + */ 2245 + if (!have_rcu_nocb_mask || is_nocb_cpu(cpu)) 2246 + return 1; 2247 + 2248 + /* If no memory, play it safe and keep the CPU around. */ 2249 + if (!alloc_cpumask_var(&non_nocb_cpus, GFP_NOIO)) 2250 + return 0; 2251 + cpumask_andnot(non_nocb_cpus, cpu_online_mask, rcu_nocb_mask); 2252 + cpumask_clear_cpu(cpu, non_nocb_cpus); 2253 + ret = !cpumask_empty(non_nocb_cpus); 2254 + free_cpumask_var(non_nocb_cpus); 2255 + return ret; 2256 + } 2257 + 2258 + /* 2259 + * Helper structure for remote registry of RCU callbacks. 2260 + * This is needed for when a no-CBs CPU needs to start a grace period. 2261 + * If it just invokes call_rcu(), the resulting callback will be queued, 2262 + * which can result in deadlock. 2263 + */ 2264 + struct rcu_head_remote { 2265 + struct rcu_head *rhp; 2266 + call_rcu_func_t *crf; 2267 + void (*func)(struct rcu_head *rhp); 2268 + }; 2269 + 2270 + /* 2271 + * Register a callback as specified by the rcu_head_remote struct. 2272 + * This function is intended to be invoked via smp_call_function_single(). 2273 + */ 2274 + static void call_rcu_local(void *arg) 2275 + { 2276 + struct rcu_head_remote *rhrp = 2277 + container_of(arg, struct rcu_head_remote, rhp); 2278 + 2279 + rhrp->crf(rhrp->rhp, rhrp->func); 2280 + } 2281 + 2282 + /* 2283 + * Set up an rcu_head_remote structure and the invoke call_rcu_local() 2284 + * on CPU 0 (which is guaranteed to be a non-no-CBs CPU) via 2285 + * smp_call_function_single(). 2286 + */ 2287 + static void invoke_crf_remote(struct rcu_head *rhp, 2288 + void (*func)(struct rcu_head *rhp), 2289 + call_rcu_func_t crf) 2290 + { 2291 + struct rcu_head_remote rhr; 2292 + 2293 + rhr.rhp = rhp; 2294 + rhr.crf = crf; 2295 + rhr.func = func; 2296 + smp_call_function_single(0, call_rcu_local, &rhr, 1); 2297 + } 2298 + 2299 + /* 2300 + * Helper functions to be passed to wait_rcu_gp(), each of which 2301 + * invokes invoke_crf_remote() to register a callback appropriately. 2302 + */ 2303 + static void __maybe_unused 2304 + call_rcu_preempt_remote(struct rcu_head *rhp, 2305 + void (*func)(struct rcu_head *rhp)) 2306 + { 2307 + invoke_crf_remote(rhp, func, call_rcu); 2308 + } 2309 + static void call_rcu_bh_remote(struct rcu_head *rhp, 2310 + void (*func)(struct rcu_head *rhp)) 2311 + { 2312 + invoke_crf_remote(rhp, func, call_rcu_bh); 2313 + } 2314 + static void call_rcu_sched_remote(struct rcu_head *rhp, 2315 + void (*func)(struct rcu_head *rhp)) 2316 + { 2317 + invoke_crf_remote(rhp, func, call_rcu_sched); 2318 + } 2319 + 2320 + /* 2321 + * Per-rcu_data kthread, but only for no-CBs CPUs. Each kthread invokes 2322 + * callbacks queued by the corresponding no-CBs CPU. 2323 + */ 2324 + static int rcu_nocb_kthread(void *arg) 2325 + { 2326 + int c, cl; 2327 + struct rcu_head *list; 2328 + struct rcu_head *next; 2329 + struct rcu_head **tail; 2330 + struct rcu_data *rdp = arg; 2331 + 2332 + /* Each pass through this loop invokes one batch of callbacks */ 2333 + for (;;) { 2334 + /* If not polling, wait for next batch of callbacks. */ 2335 + if (!rcu_nocb_poll) 2336 + wait_event(rdp->nocb_wq, rdp->nocb_head); 2337 + list = ACCESS_ONCE(rdp->nocb_head); 2338 + if (!list) { 2339 + schedule_timeout_interruptible(1); 2340 + continue; 2341 + } 2342 + 2343 + /* 2344 + * Extract queued callbacks, update counts, and wait 2345 + * for a grace period to elapse. 2346 + */ 2347 + ACCESS_ONCE(rdp->nocb_head) = NULL; 2348 + tail = xchg(&rdp->nocb_tail, &rdp->nocb_head); 2349 + c = atomic_long_xchg(&rdp->nocb_q_count, 0); 2350 + cl = atomic_long_xchg(&rdp->nocb_q_count_lazy, 0); 2351 + ACCESS_ONCE(rdp->nocb_p_count) += c; 2352 + ACCESS_ONCE(rdp->nocb_p_count_lazy) += cl; 2353 + wait_rcu_gp(rdp->rsp->call_remote); 2354 + 2355 + /* Each pass through the following loop invokes a callback. */ 2356 + trace_rcu_batch_start(rdp->rsp->name, cl, c, -1); 2357 + c = cl = 0; 2358 + while (list) { 2359 + next = list->next; 2360 + /* Wait for enqueuing to complete, if needed. */ 2361 + while (next == NULL && &list->next != tail) { 2362 + schedule_timeout_interruptible(1); 2363 + next = list->next; 2364 + } 2365 + debug_rcu_head_unqueue(list); 2366 + local_bh_disable(); 2367 + if (__rcu_reclaim(rdp->rsp->name, list)) 2368 + cl++; 2369 + c++; 2370 + local_bh_enable(); 2371 + list = next; 2372 + } 2373 + trace_rcu_batch_end(rdp->rsp->name, c, !!list, 0, 0, 1); 2374 + ACCESS_ONCE(rdp->nocb_p_count) -= c; 2375 + ACCESS_ONCE(rdp->nocb_p_count_lazy) -= cl; 2376 + rdp->n_nocbs_invoked += c; 2377 + } 2378 + return 0; 2379 + } 2380 + 2381 + /* Initialize per-rcu_data variables for no-CBs CPUs. */ 2382 + static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) 2383 + { 2384 + rdp->nocb_tail = &rdp->nocb_head; 2385 + init_waitqueue_head(&rdp->nocb_wq); 2386 + } 2387 + 2388 + /* Create a kthread for each RCU flavor for each no-CBs CPU. */ 2389 + static void __init rcu_spawn_nocb_kthreads(struct rcu_state *rsp) 2390 + { 2391 + int cpu; 2392 + struct rcu_data *rdp; 2393 + struct task_struct *t; 2394 + 2395 + if (rcu_nocb_mask == NULL) 2396 + return; 2397 + for_each_cpu(cpu, rcu_nocb_mask) { 2398 + rdp = per_cpu_ptr(rsp->rda, cpu); 2399 + t = kthread_run(rcu_nocb_kthread, rdp, "rcuo%d", cpu); 2400 + BUG_ON(IS_ERR(t)); 2401 + ACCESS_ONCE(rdp->nocb_kthread) = t; 2402 + } 2403 + } 2404 + 2405 + /* Prevent __call_rcu() from enqueuing callbacks on no-CBs CPUs */ 2406 + static void init_nocb_callback_list(struct rcu_data *rdp) 2407 + { 2408 + if (rcu_nocb_mask == NULL || 2409 + !cpumask_test_cpu(rdp->cpu, rcu_nocb_mask)) 2410 + return; 2411 + rdp->nxttail[RCU_NEXT_TAIL] = NULL; 2412 + } 2413 + 2414 + /* Initialize the ->call_remote fields in the rcu_state structures. */ 2415 + static void __init rcu_init_nocb(void) 2416 + { 2417 + #ifdef CONFIG_PREEMPT_RCU 2418 + rcu_preempt_state.call_remote = call_rcu_preempt_remote; 2419 + #endif /* #ifdef CONFIG_PREEMPT_RCU */ 2420 + rcu_bh_state.call_remote = call_rcu_bh_remote; 2421 + rcu_sched_state.call_remote = call_rcu_sched_remote; 2422 + } 2423 + 2424 + #else /* #ifdef CONFIG_RCU_NOCB_CPU */ 2425 + 2426 + static bool is_nocb_cpu(int cpu) 2427 + { 2428 + return false; 2429 + } 2430 + 2431 + static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp, 2432 + bool lazy) 2433 + { 2434 + return 0; 2435 + } 2436 + 2437 + static bool __maybe_unused rcu_nocb_adopt_orphan_cbs(struct rcu_state *rsp, 2438 + struct rcu_data *rdp) 2439 + { 2440 + return 0; 2441 + } 2442 + 2443 + static bool nocb_cpu_expendable(int cpu) 2444 + { 2445 + return 1; 2446 + } 2447 + 2448 + static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) 2449 + { 2450 + } 2451 + 2452 + static void __init rcu_spawn_nocb_kthreads(struct rcu_state *rsp) 2453 + { 2454 + } 2455 + 2456 + static void init_nocb_callback_list(struct rcu_data *rdp) 2457 + { 2458 + } 2459 + 2460 + static void __init rcu_init_nocb(void) 2461 + { 2462 + } 2463 + 2464 + #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */

+159 -165

kernel/rcutree_trace.c

··· 46 46 #define RCU_TREE_NONCORE 47 47 #include "rcutree.h" 48 48 49 - static int show_rcubarrier(struct seq_file *m, void *unused) 50 - { 51 - struct rcu_state *rsp; 49 + #define ulong2long(a) (*(long *)(&(a))) 52 50 53 - for_each_rcu_flavor(rsp) 54 - seq_printf(m, "%s: bcc: %d nbd: %lu\n", 55 - rsp->name, 56 - atomic_read(&rsp->barrier_cpu_count), 57 - rsp->n_barrier_done); 51 + static int r_open(struct inode *inode, struct file *file, 52 + const struct seq_operations *op) 53 + { 54 + int ret = seq_open(file, op); 55 + if (!ret) { 56 + struct seq_file *m = (struct seq_file *)file->private_data; 57 + m->private = inode->i_private; 58 + } 59 + return ret; 60 + } 61 + 62 + static void *r_start(struct seq_file *m, loff_t *pos) 63 + { 64 + struct rcu_state *rsp = (struct rcu_state *)m->private; 65 + *pos = cpumask_next(*pos - 1, cpu_possible_mask); 66 + if ((*pos) < nr_cpu_ids) 67 + return per_cpu_ptr(rsp->rda, *pos); 68 + return NULL; 69 + } 70 + 71 + static void *r_next(struct seq_file *m, void *v, loff_t *pos) 72 + { 73 + (*pos)++; 74 + return r_start(m, pos); 75 + } 76 + 77 + static void r_stop(struct seq_file *m, void *v) 78 + { 79 + } 80 + 81 + static int show_rcubarrier(struct seq_file *m, void *v) 82 + { 83 + struct rcu_state *rsp = (struct rcu_state *)m->private; 84 + seq_printf(m, "bcc: %d nbd: %lu\n", 85 + atomic_read(&rsp->barrier_cpu_count), 86 + rsp->n_barrier_done); 58 87 return 0; 59 88 } 60 89 61 90 static int rcubarrier_open(struct inode *inode, struct file *file) 62 91 { 63 - return single_open(file, show_rcubarrier, NULL); 92 + return single_open(file, show_rcubarrier, inode->i_private); 64 93 } 65 94 66 95 static const struct file_operations rcubarrier_fops = { 67 96 .owner = THIS_MODULE, 68 97 .open = rcubarrier_open, 69 98 .read = seq_read, 70 - .llseek = seq_lseek, 71 - .release = single_release, 99 + .llseek = no_llseek, 100 + .release = seq_release, 72 101 }; 73 102 74 103 #ifdef CONFIG_RCU_BOOST ··· 113 84 114 85 static void print_one_rcu_data(struct seq_file *m, struct rcu_data *rdp) 115 86 { 87 + long ql, qll; 88 + 116 89 if (!rdp->beenonline) 117 90 return; 118 - seq_printf(m, "%3d%cc=%lu g=%lu pq=%d qp=%d", 91 + seq_printf(m, "%3d%cc=%ld g=%ld pq=%d qp=%d", 119 92 rdp->cpu, 120 93 cpu_is_offline(rdp->cpu) ? '!' : ' ', 121 - rdp->completed, rdp->gpnum, 94 + ulong2long(rdp->completed), ulong2long(rdp->gpnum), 122 95 rdp->passed_quiesce, rdp->qs_pending); 123 96 seq_printf(m, " dt=%d/%llx/%d df=%lu", 124 97 atomic_read(&rdp->dynticks->dynticks), ··· 128 97 rdp->dynticks->dynticks_nmi_nesting, 129 98 rdp->dynticks_fqs); 130 99 seq_printf(m, " of=%lu", rdp->offline_fqs); 100 + rcu_nocb_q_lengths(rdp, &ql, &qll); 101 + qll += rdp->qlen_lazy; 102 + ql += rdp->qlen; 131 103 seq_printf(m, " ql=%ld/%ld qs=%c%c%c%c", 132 - rdp->qlen_lazy, rdp->qlen, 104 + qll, ql, 133 105 ".N"[rdp->nxttail[RCU_NEXT_READY_TAIL] != 134 106 rdp->nxttail[RCU_NEXT_TAIL]], 135 107 ".R"[rdp->nxttail[RCU_WAIT_TAIL] != ··· 148 114 per_cpu(rcu_cpu_kthread_loops, rdp->cpu) & 0xffff); 149 115 #endif /* #ifdef CONFIG_RCU_BOOST */ 150 116 seq_printf(m, " b=%ld", rdp->blimit); 151 - seq_printf(m, " ci=%lu co=%lu ca=%lu\n", 152 - rdp->n_cbs_invoked, rdp->n_cbs_orphaned, rdp->n_cbs_adopted); 117 + seq_printf(m, " ci=%lu nci=%lu co=%lu ca=%lu\n", 118 + rdp->n_cbs_invoked, rdp->n_nocbs_invoked, 119 + rdp->n_cbs_orphaned, rdp->n_cbs_adopted); 153 120 } 154 121 155 - static int show_rcudata(struct seq_file *m, void *unused) 122 + static int show_rcudata(struct seq_file *m, void *v) 156 123 { 157 - int cpu; 158 - struct rcu_state *rsp; 159 - 160 - for_each_rcu_flavor(rsp) { 161 - seq_printf(m, "%s:\n", rsp->name); 162 - for_each_possible_cpu(cpu) 163 - print_one_rcu_data(m, per_cpu_ptr(rsp->rda, cpu)); 164 - } 124 + print_one_rcu_data(m, (struct rcu_data *)v); 165 125 return 0; 166 126 } 167 127 128 + static const struct seq_operations rcudate_op = { 129 + .start = r_start, 130 + .next = r_next, 131 + .stop = r_stop, 132 + .show = show_rcudata, 133 + }; 134 + 168 135 static int rcudata_open(struct inode *inode, struct file *file) 169 136 { 170 - return single_open(file, show_rcudata, NULL); 137 + return r_open(inode, file, &rcudate_op); 171 138 } 172 139 173 140 static const struct file_operations rcudata_fops = { 174 141 .owner = THIS_MODULE, 175 142 .open = rcudata_open, 176 143 .read = seq_read, 177 - .llseek = seq_lseek, 178 - .release = single_release, 144 + .llseek = no_llseek, 145 + .release = seq_release, 179 146 }; 180 147 181 - static void print_one_rcu_data_csv(struct seq_file *m, struct rcu_data *rdp) 148 + static int show_rcuexp(struct seq_file *m, void *v) 182 149 { 183 - if (!rdp->beenonline) 184 - return; 185 - seq_printf(m, "%d,%s,%lu,%lu,%d,%d", 186 - rdp->cpu, 187 - cpu_is_offline(rdp->cpu) ? "\"N\"" : "\"Y\"", 188 - rdp->completed, rdp->gpnum, 189 - rdp->passed_quiesce, rdp->qs_pending); 190 - seq_printf(m, ",%d,%llx,%d,%lu", 191 - atomic_read(&rdp->dynticks->dynticks), 192 - rdp->dynticks->dynticks_nesting, 193 - rdp->dynticks->dynticks_nmi_nesting, 194 - rdp->dynticks_fqs); 195 - seq_printf(m, ",%lu", rdp->offline_fqs); 196 - seq_printf(m, ",%ld,%ld,\"%c%c%c%c\"", rdp->qlen_lazy, rdp->qlen, 197 - ".N"[rdp->nxttail[RCU_NEXT_READY_TAIL] != 198 - rdp->nxttail[RCU_NEXT_TAIL]], 199 - ".R"[rdp->nxttail[RCU_WAIT_TAIL] != 200 - rdp->nxttail[RCU_NEXT_READY_TAIL]], 201 - ".W"[rdp->nxttail[RCU_DONE_TAIL] != 202 - rdp->nxttail[RCU_WAIT_TAIL]], 203 - ".D"[&rdp->nxtlist != rdp->nxttail[RCU_DONE_TAIL]]); 204 - #ifdef CONFIG_RCU_BOOST 205 - seq_printf(m, ",%d,\"%c\"", 206 - per_cpu(rcu_cpu_has_work, rdp->cpu), 207 - convert_kthread_status(per_cpu(rcu_cpu_kthread_status, 208 - rdp->cpu))); 209 - #endif /* #ifdef CONFIG_RCU_BOOST */ 210 - seq_printf(m, ",%ld", rdp->blimit); 211 - seq_printf(m, ",%lu,%lu,%lu\n", 212 - rdp->n_cbs_invoked, rdp->n_cbs_orphaned, rdp->n_cbs_adopted); 213 - } 150 + struct rcu_state *rsp = (struct rcu_state *)m->private; 214 151 215 - static int show_rcudata_csv(struct seq_file *m, void *unused) 216 - { 217 - int cpu; 218 - struct rcu_state *rsp; 219 - 220 - seq_puts(m, "\"CPU\",\"Online?\",\"c\",\"g\",\"pq\",\"pq\","); 221 - seq_puts(m, "\"dt\",\"dt nesting\",\"dt NMI nesting\",\"df\","); 222 - seq_puts(m, "\"of\",\"qll\",\"ql\",\"qs\""); 223 - #ifdef CONFIG_RCU_BOOST 224 - seq_puts(m, "\"kt\",\"ktl\""); 225 - #endif /* #ifdef CONFIG_RCU_BOOST */ 226 - seq_puts(m, ",\"b\",\"ci\",\"co\",\"ca\"\n"); 227 - for_each_rcu_flavor(rsp) { 228 - seq_printf(m, "\"%s:\"\n", rsp->name); 229 - for_each_possible_cpu(cpu) 230 - print_one_rcu_data_csv(m, per_cpu_ptr(rsp->rda, cpu)); 231 - } 152 + seq_printf(m, "s=%lu d=%lu w=%lu tf=%lu wd1=%lu wd2=%lu n=%lu sc=%lu dt=%lu dl=%lu dx=%lu\n", 153 + atomic_long_read(&rsp->expedited_start), 154 + atomic_long_read(&rsp->expedited_done), 155 + atomic_long_read(&rsp->expedited_wrap), 156 + atomic_long_read(&rsp->expedited_tryfail), 157 + atomic_long_read(&rsp->expedited_workdone1), 158 + atomic_long_read(&rsp->expedited_workdone2), 159 + atomic_long_read(&rsp->expedited_normal), 160 + atomic_long_read(&rsp->expedited_stoppedcpus), 161 + atomic_long_read(&rsp->expedited_done_tries), 162 + atomic_long_read(&rsp->expedited_done_lost), 163 + atomic_long_read(&rsp->expedited_done_exit)); 232 164 return 0; 233 165 } 234 166 235 - static int rcudata_csv_open(struct inode *inode, struct file *file) 167 + static int rcuexp_open(struct inode *inode, struct file *file) 236 168 { 237 - return single_open(file, show_rcudata_csv, NULL); 169 + return single_open(file, show_rcuexp, inode->i_private); 238 170 } 239 171 240 - static const struct file_operations rcudata_csv_fops = { 172 + static const struct file_operations rcuexp_fops = { 241 173 .owner = THIS_MODULE, 242 - .open = rcudata_csv_open, 174 + .open = rcuexp_open, 243 175 .read = seq_read, 244 - .llseek = seq_lseek, 245 - .release = single_release, 176 + .llseek = no_llseek, 177 + .release = seq_release, 246 178 }; 247 179 248 180 #ifdef CONFIG_RCU_BOOST ··· 254 254 .owner = THIS_MODULE, 255 255 .open = rcu_node_boost_open, 256 256 .read = seq_read, 257 - .llseek = seq_lseek, 257 + .llseek = no_llseek, 258 258 .release = single_release, 259 259 }; 260 260 261 - /* 262 - * Create the rcuboost debugfs entry. Standard error return. 263 - */ 264 - static int rcu_boost_trace_create_file(struct dentry *rcudir) 265 - { 266 - return !debugfs_create_file("rcuboost", 0444, rcudir, NULL, 267 - &rcu_node_boost_fops); 268 - } 269 - 270 - #else /* #ifdef CONFIG_RCU_BOOST */ 271 - 272 - static int rcu_boost_trace_create_file(struct dentry *rcudir) 273 - { 274 - return 0; /* There cannot be an error if we didn't create it! */ 275 - } 276 - 277 - #endif /* #else #ifdef CONFIG_RCU_BOOST */ 261 + #endif /* #ifdef CONFIG_RCU_BOOST */ 278 262 279 263 static void print_one_rcu_state(struct seq_file *m, struct rcu_state *rsp) 280 264 { ··· 267 283 struct rcu_node *rnp; 268 284 269 285 gpnum = rsp->gpnum; 270 - seq_printf(m, "%s: c=%lu g=%lu s=%d jfq=%ld j=%x ", 271 - rsp->name, rsp->completed, gpnum, rsp->fqs_state, 286 + seq_printf(m, "c=%ld g=%ld s=%d jfq=%ld j=%x ", 287 + ulong2long(rsp->completed), ulong2long(gpnum), 288 + rsp->fqs_state, 272 289 (long)(rsp->jiffies_force_qs - jiffies), 273 290 (int)(jiffies & 0xffff)); 274 291 seq_printf(m, "nfqs=%lu/nfqsng=%lu(%lu) fqlh=%lu oqlen=%ld/%ld\n", ··· 291 306 seq_puts(m, "\n"); 292 307 } 293 308 294 - static int show_rcuhier(struct seq_file *m, void *unused) 309 + static int show_rcuhier(struct seq_file *m, void *v) 295 310 { 296 - struct rcu_state *rsp; 297 - 298 - for_each_rcu_flavor(rsp) 299 - print_one_rcu_state(m, rsp); 311 + struct rcu_state *rsp = (struct rcu_state *)m->private; 312 + print_one_rcu_state(m, rsp); 300 313 return 0; 301 314 } 302 315 303 316 static int rcuhier_open(struct inode *inode, struct file *file) 304 317 { 305 - return single_open(file, show_rcuhier, NULL); 318 + return single_open(file, show_rcuhier, inode->i_private); 306 319 } 307 320 308 321 static const struct file_operations rcuhier_fops = { 309 322 .owner = THIS_MODULE, 310 323 .open = rcuhier_open, 311 324 .read = seq_read, 312 - .llseek = seq_lseek, 313 - .release = single_release, 325 + .llseek = no_llseek, 326 + .release = seq_release, 314 327 }; 315 328 316 329 static void show_one_rcugp(struct seq_file *m, struct rcu_state *rsp) ··· 321 338 struct rcu_node *rnp = &rsp->node[0]; 322 339 323 340 raw_spin_lock_irqsave(&rnp->lock, flags); 324 - completed = rsp->completed; 325 - gpnum = rsp->gpnum; 326 - if (rsp->completed == rsp->gpnum) 341 + completed = ACCESS_ONCE(rsp->completed); 342 + gpnum = ACCESS_ONCE(rsp->gpnum); 343 + if (completed == gpnum) 327 344 gpage = 0; 328 345 else 329 346 gpage = jiffies - rsp->gp_start; 330 347 gpmax = rsp->gp_max; 331 348 raw_spin_unlock_irqrestore(&rnp->lock, flags); 332 - seq_printf(m, "%s: completed=%ld gpnum=%lu age=%ld max=%ld\n", 333 - rsp->name, completed, gpnum, gpage, gpmax); 349 + seq_printf(m, "completed=%ld gpnum=%ld age=%ld max=%ld\n", 350 + ulong2long(completed), ulong2long(gpnum), gpage, gpmax); 334 351 } 335 352 336 - static int show_rcugp(struct seq_file *m, void *unused) 353 + static int show_rcugp(struct seq_file *m, void *v) 337 354 { 338 - struct rcu_state *rsp; 339 - 340 - for_each_rcu_flavor(rsp) 341 - show_one_rcugp(m, rsp); 355 + struct rcu_state *rsp = (struct rcu_state *)m->private; 356 + show_one_rcugp(m, rsp); 342 357 return 0; 343 358 } 344 359 345 360 static int rcugp_open(struct inode *inode, struct file *file) 346 361 { 347 - return single_open(file, show_rcugp, NULL); 362 + return single_open(file, show_rcugp, inode->i_private); 348 363 } 349 364 350 365 static const struct file_operations rcugp_fops = { 351 366 .owner = THIS_MODULE, 352 367 .open = rcugp_open, 353 368 .read = seq_read, 354 - .llseek = seq_lseek, 355 - .release = single_release, 369 + .llseek = no_llseek, 370 + .release = seq_release, 356 371 }; 357 372 358 373 static void print_one_rcu_pending(struct seq_file *m, struct rcu_data *rdp) 359 374 { 375 + if (!rdp->beenonline) 376 + return; 360 377 seq_printf(m, "%3d%cnp=%ld ", 361 378 rdp->cpu, 362 379 cpu_is_offline(rdp->cpu) ? '!' : ' ', ··· 372 389 rdp->n_rp_need_nothing); 373 390 } 374 391 375 - static int show_rcu_pending(struct seq_file *m, void *unused) 392 + static int show_rcu_pending(struct seq_file *m, void *v) 376 393 { 377 - int cpu; 378 - struct rcu_data *rdp; 379 - struct rcu_state *rsp; 380 - 381 - for_each_rcu_flavor(rsp) { 382 - seq_printf(m, "%s:\n", rsp->name); 383 - for_each_possible_cpu(cpu) { 384 - rdp = per_cpu_ptr(rsp->rda, cpu); 385 - if (rdp->beenonline) 386 - print_one_rcu_pending(m, rdp); 387 - } 388 - } 394 + print_one_rcu_pending(m, (struct rcu_data *)v); 389 395 return 0; 390 396 } 391 397 398 + static const struct seq_operations rcu_pending_op = { 399 + .start = r_start, 400 + .next = r_next, 401 + .stop = r_stop, 402 + .show = show_rcu_pending, 403 + }; 404 + 392 405 static int rcu_pending_open(struct inode *inode, struct file *file) 393 406 { 394 - return single_open(file, show_rcu_pending, NULL); 407 + return r_open(inode, file, &rcu_pending_op); 395 408 } 396 409 397 410 static const struct file_operations rcu_pending_fops = { 398 411 .owner = THIS_MODULE, 399 412 .open = rcu_pending_open, 400 413 .read = seq_read, 401 - .llseek = seq_lseek, 402 - .release = single_release, 414 + .llseek = no_llseek, 415 + .release = seq_release, 403 416 }; 404 417 405 418 static int show_rcutorture(struct seq_file *m, void *unused) ··· 425 446 426 447 static int __init rcutree_trace_init(void) 427 448 { 449 + struct rcu_state *rsp; 428 450 struct dentry *retval; 451 + struct dentry *rspdir; 429 452 430 453 rcudir = debugfs_create_dir("rcu", NULL); 431 454 if (!rcudir) 432 455 goto free_out; 433 456 434 - retval = debugfs_create_file("rcubarrier", 0444, rcudir, 435 - NULL, &rcubarrier_fops); 436 - if (!retval) 437 - goto free_out; 457 + for_each_rcu_flavor(rsp) { 458 + rspdir = debugfs_create_dir(rsp->name, rcudir); 459 + if (!rspdir) 460 + goto free_out; 438 461 439 - retval = debugfs_create_file("rcudata", 0444, rcudir, 440 - NULL, &rcudata_fops); 441 - if (!retval) 442 - goto free_out; 462 + retval = debugfs_create_file("rcudata", 0444, 463 + rspdir, rsp, &rcudata_fops); 464 + if (!retval) 465 + goto free_out; 443 466 444 - retval = debugfs_create_file("rcudata.csv", 0444, rcudir, 445 - NULL, &rcudata_csv_fops); 446 - if (!retval) 447 - goto free_out; 467 + retval = debugfs_create_file("rcuexp", 0444, 468 + rspdir, rsp, &rcuexp_fops); 469 + if (!retval) 470 + goto free_out; 448 471 449 - if (rcu_boost_trace_create_file(rcudir)) 450 - goto free_out; 472 + retval = debugfs_create_file("rcu_pending", 0444, 473 + rspdir, rsp, &rcu_pending_fops); 474 + if (!retval) 475 + goto free_out; 451 476 452 - retval = debugfs_create_file("rcugp", 0444, rcudir, NULL, &rcugp_fops); 453 - if (!retval) 454 - goto free_out; 477 + retval = debugfs_create_file("rcubarrier", 0444, 478 + rspdir, rsp, &rcubarrier_fops); 479 + if (!retval) 480 + goto free_out; 455 481 456 - retval = debugfs_create_file("rcuhier", 0444, rcudir, 457 - NULL, &rcuhier_fops); 458 - if (!retval) 459 - goto free_out; 482 + #ifdef CONFIG_RCU_BOOST 483 + if (rsp == &rcu_preempt_state) { 484 + retval = debugfs_create_file("rcuboost", 0444, 485 + rspdir, NULL, &rcu_node_boost_fops); 486 + if (!retval) 487 + goto free_out; 488 + } 489 + #endif 460 490 461 - retval = debugfs_create_file("rcu_pending", 0444, rcudir, 462 - NULL, &rcu_pending_fops); 463 - if (!retval) 464 - goto free_out; 491 + retval = debugfs_create_file("rcugp", 0444, 492 + rspdir, rsp, &rcugp_fops); 493 + if (!retval) 494 + goto free_out; 495 + 496 + retval = debugfs_create_file("rcuhier", 0444, 497 + rspdir, rsp, &rcuhier_fops); 498 + if (!retval) 499 + goto free_out; 500 + } 465 501 466 502 retval = debugfs_create_file("rcutorture", 0444, rcudir, 467 503 NULL, &rcutorture_fops);

+17 -6

kernel/sched/core.c

··· 72 72 #include <linux/slab.h> 73 73 #include <linux/init_task.h> 74 74 #include <linux/binfmts.h> 75 + #include <linux/context_tracking.h> 75 76 76 77 #include <asm/switch_to.h> 77 78 #include <asm/tlb.h> ··· 1887 1886 spin_release(&rq->lock.dep_map, 1, _THIS_IP_); 1888 1887 #endif 1889 1888 1889 + context_tracking_task_switch(prev, next); 1890 1890 /* Here we just switch the register state and the stack. */ 1891 - rcu_switch(prev, next); 1892 1891 switch_to(prev, next, prev); 1893 1892 1894 1893 barrier(); ··· 2912 2911 } 2913 2912 EXPORT_SYMBOL(schedule); 2914 2913 2915 - #ifdef CONFIG_RCU_USER_QS 2914 + #ifdef CONFIG_CONTEXT_TRACKING 2916 2915 asmlinkage void __sched schedule_user(void) 2917 2916 { 2918 2917 /* ··· 2921 2920 * we haven't yet exited the RCU idle mode. Do it here manually until 2922 2921 * we find a better solution. 2923 2922 */ 2924 - rcu_user_exit(); 2923 + user_exit(); 2925 2924 schedule(); 2926 - rcu_user_enter(); 2925 + user_enter(); 2927 2926 } 2928 2927 #endif 2929 2928 ··· 3028 3027 /* Catch callers which need to be fixed */ 3029 3028 BUG_ON(ti->preempt_count || !irqs_disabled()); 3030 3029 3031 - rcu_user_exit(); 3030 + user_exit(); 3032 3031 do { 3033 3032 add_preempt_count(PREEMPT_ACTIVE); 3034 3033 local_irq_enable(); ··· 4475 4474 void sched_show_task(struct task_struct *p) 4476 4475 { 4477 4476 unsigned long free = 0; 4477 + int ppid; 4478 4478 unsigned state; 4479 4479 4480 4480 state = p->state ? __ffs(p->state) + 1 : 0; ··· 4495 4493 #ifdef CONFIG_DEBUG_STACK_USAGE 4496 4494 free = stack_not_used(p); 4497 4495 #endif 4496 + rcu_read_lock(); 4497 + ppid = task_pid_nr(rcu_dereference(p->real_parent)); 4498 + rcu_read_unlock(); 4498 4499 printk(KERN_CONT "%5lu %5d %6d 0x%08lx\n", free, 4499 - task_pid_nr(p), task_pid_nr(rcu_dereference(p->real_parent)), 4500 + task_pid_nr(p), ppid, 4500 4501 (unsigned long)task_thread_info(p)->flags); 4501 4502 4502 4503 show_stack(p, NULL); ··· 8081 8076 .base_cftypes = files, 8082 8077 }; 8083 8078 #endif /* CONFIG_CGROUP_CPUACCT */ 8079 + 8080 + void dump_cpu_task(int cpu) 8081 + { 8082 + pr_info("Task dump for CPU %d:\n", cpu); 8083 + sched_show_task(cpu_curr(cpu)); 8084 + }

+11 -5

kernel/srcu.c

··· 16 16 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. 17 17 * 18 18 * Copyright (C) IBM Corporation, 2006 19 + * Copyright (C) Fujitsu, 2012 19 20 * 20 21 * Author: Paul McKenney <paulmck@us.ibm.com> 22 + * Lai Jiangshan <laijs@cn.fujitsu.com> 21 23 * 22 24 * For detailed explanation of Read-Copy Update mechanism see - 23 25 * Documentation/RCU/ *.txt ··· 35 33 #include <linux/smp.h> 36 34 #include <linux/delay.h> 37 35 #include <linux/srcu.h> 36 + 37 + #include <trace/events/rcu.h> 38 + 39 + #include "rcu.h" 38 40 39 41 /* 40 42 * Initialize an rcu_batch structure to empty. ··· 97 91 rcu_batch_init(from); 98 92 } 99 93 } 100 - 101 - /* single-thread state-machine */ 102 - static void process_srcu(struct work_struct *work); 103 94 104 95 static int init_srcu_struct_fields(struct srcu_struct *sp) 105 96 { ··· 467 464 */ 468 465 void synchronize_srcu(struct srcu_struct *sp) 469 466 { 470 - __synchronize_srcu(sp, SYNCHRONIZE_SRCU_TRYCOUNT); 467 + __synchronize_srcu(sp, rcu_expedited 468 + ? SYNCHRONIZE_SRCU_EXP_TRYCOUNT 469 + : SYNCHRONIZE_SRCU_TRYCOUNT); 471 470 } 472 471 EXPORT_SYMBOL_GPL(synchronize_srcu); 473 472 ··· 642 637 /* 643 638 * This is the work-queue function that handles SRCU grace periods. 644 639 */ 645 - static void process_srcu(struct work_struct *work) 640 + void process_srcu(struct work_struct *work) 646 641 { 647 642 struct srcu_struct *sp; 648 643 ··· 653 648 srcu_invoke_callbacks(sp); 654 649 srcu_reschedule(sp); 655 650 } 651 + EXPORT_SYMBOL_GPL(process_srcu);

+1 -1

lib/Kconfig.debug

··· 972 972 int "RCU CPU stall timeout in seconds" 973 973 depends on TREE_RCU || TREE_PREEMPT_RCU 974 974 range 3 300 975 - default 60 975 + default 21 976 976 help 977 977 If a given RCU grace period extends more than the specified 978 978 number of seconds, a CPU stall warning is printed. If the