docs: rcu: convert some articles from html to ReST

-1391

Documentation/RCU/Design/Data-Structures/Data-Structures.html

··· 1 - <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 2 - "http://www.w3.org/TR/html4/loose.dtd"> 3 - <html> 4 - <head><title>A Tour Through TREE_RCU's Data Structures [LWN.net]</title> 5 - <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> 6 - 7 - <p>December 18, 2016</p> 8 - <p>This article was contributed by Paul E. McKenney</p> 9 - 10 - <h3>Introduction</h3> 11 - 12 - This document describes RCU's major data structures and their relationship 13 - to each other. 14 - 15 - <ol> 16 - <li> <a href="#Data-Structure Relationships"> 17 - Data-Structure Relationships</a> 18 - <li> <a href="#The rcu_state Structure"> 19 - The <tt>rcu_state</tt> Structure</a> 20 - <li> <a href="#The rcu_node Structure"> 21 - The <tt>rcu_node</tt> Structure</a> 22 - <li> <a href="#The rcu_segcblist Structure"> 23 - The <tt>rcu_segcblist</tt> Structure</a> 24 - <li> <a href="#The rcu_data Structure"> 25 - The <tt>rcu_data</tt> Structure</a> 26 - <li> <a href="#The rcu_head Structure"> 27 - The <tt>rcu_head</tt> Structure</a> 28 - <li> <a href="#RCU-Specific Fields in the task_struct Structure"> 29 - RCU-Specific Fields in the <tt>task_struct</tt> Structure</a> 30 - <li> <a href="#Accessor Functions"> 31 - Accessor Functions</a> 32 - </ol> 33 - 34 - <h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3> 35 - 36 - <p>RCU is for all intents and purposes a large state machine, and its 37 - data structures maintain the state in such a way as to allow RCU readers 38 - to execute extremely quickly, while also processing the RCU grace periods 39 - requested by updaters in an efficient and extremely scalable fashion. 40 - The efficiency and scalability of RCU updaters is provided primarily 41 - by a combining tree, as shown below: 42 - 43 - </p><p><img src="BigTreeClassicRCU.svg" alt="BigTreeClassicRCU.svg" width="30%"> 44 - 45 - </p><p>This diagram shows an enclosing <tt>rcu_state</tt> structure 46 - containing a tree of <tt>rcu_node</tt> structures. 47 - Each leaf node of the <tt>rcu_node</tt> tree has up to 16 48 - <tt>rcu_data</tt> structures associated with it, so that there 49 - are <tt>NR_CPUS</tt> number of <tt>rcu_data</tt> structures, 50 - one for each possible CPU. 51 - This structure is adjusted at boot time, if needed, to handle the 52 - common case where <tt>nr_cpu_ids</tt> is much less than 53 - <tt>NR_CPUs</tt>. 54 - For example, a number of Linux distributions set <tt>NR_CPUs=4096</tt>, 55 - which results in a three-level <tt>rcu_node</tt> tree. 56 - If the actual hardware has only 16 CPUs, RCU will adjust itself 57 - at boot time, resulting in an <tt>rcu_node</tt> tree with only a single node. 58 - 59 - </p><p>The purpose of this combining tree is to allow per-CPU events 60 - such as quiescent states, dyntick-idle transitions, 61 - and CPU hotplug operations to be processed efficiently 62 - and scalably. 63 - Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures, 64 - and other events are recorded by the leaf-level <tt>rcu_node</tt> 65 - structures. 66 - All of these events are combined at each level of the tree until finally 67 - grace periods are completed at the tree's root <tt>rcu_node</tt> 68 - structure. 69 - A grace period can be completed at the root once every CPU 70 - (or, in the case of <tt>CONFIG_PREEMPT_RCU</tt>, task) 71 - has passed through a quiescent state. 72 - Once a grace period has completed, record of that fact is propagated 73 - back down the tree. 74 - 75 - </p><p>As can be seen from the diagram, on a 64-bit system 76 - a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout 77 - of 64 at the root and a fanout of 16 at the leaves. 78 - 79 - <table> 80 - <tr><th> </th></tr> 81 - <tr><th align="left">Quick Quiz:</th></tr> 82 - <tr><td> 83 - Why isn't the fanout at the leaves also 64? 84 - </td></tr> 85 - <tr><th align="left">Answer:</th></tr> 86 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 87 - Because there are more types of events that affect the leaf-level 88 - <tt>rcu_node</tt> structures than further up the tree. 89 - Therefore, if the leaf <tt>rcu_node</tt> structures have fanout of 90 - 64, the contention on these structures' <tt>->structures</tt> 91 - becomes excessive. 92 - Experimentation on a wide variety of systems has shown that a fanout 93 - of 16 works well for the leaves of the <tt>rcu_node</tt> tree. 94 - </font> 95 - 96 - <p><font color="ffffff">Of course, further experience with 97 - systems having hundreds or thousands of CPUs may demonstrate 98 - that the fanout for the non-leaf <tt>rcu_node</tt> structures 99 - must also be reduced. 100 - Such reduction can be easily carried out when and if it proves 101 - necessary. 102 - In the meantime, if you are using such a system and running into 103 - contention problems on the non-leaf <tt>rcu_node</tt> structures, 104 - you may use the <tt>CONFIG_RCU_FANOUT</tt> kernel configuration 105 - parameter to reduce the non-leaf fanout as needed. 106 - </font> 107 - 108 - <p><font color="ffffff">Kernels built for systems with 109 - strong NUMA characteristics might also need to adjust 110 - <tt>CONFIG_RCU_FANOUT</tt> so that the domains of the 111 - <tt>rcu_node</tt> structures align with hardware boundaries. 112 - However, there has thus far been no need for this. 113 - </font></td></tr> 114 - <tr><td> </td></tr> 115 - </table> 116 - 117 - <p>If your system has more than 1,024 CPUs (or more than 512 CPUs on 118 - a 32-bit system), then RCU will automatically add more levels to the 119 - tree. 120 - For example, if you are crazy enough to build a 64-bit system with 65,536 121 - CPUs, RCU would configure the <tt>rcu_node</tt> tree as follows: 122 - 123 - </p><p><img src="HugeTreeClassicRCU.svg" alt="HugeTreeClassicRCU.svg" width="50%"> 124 - 125 - </p><p>RCU currently permits up to a four-level tree, which on a 64-bit system 126 - accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for 127 - 32-bit systems. 128 - On the other hand, you can set both <tt>CONFIG_RCU_FANOUT</tt> and 129 - <tt>CONFIG_RCU_FANOUT_LEAF</tt> to be as small as 2, which would result 130 - in a 16-CPU test using a 4-level tree. 131 - This can be useful for testing large-system capabilities on small test 132 - machines. 133 - 134 - </p><p>This multi-level combining tree allows us to get most of the 135 - performance and scalability 136 - benefits of partitioning, even though RCU grace-period detection is 137 - inherently a global operation. 138 - The trick here is that only the last CPU to report a quiescent state 139 - into a given <tt>rcu_node</tt> structure need advance to the <tt>rcu_node</tt> 140 - structure at the next level up the tree. 141 - This means that at the leaf-level <tt>rcu_node</tt> structure, only 142 - one access out of sixteen will progress up the tree. 143 - For the internal <tt>rcu_node</tt> structures, the situation is even 144 - more extreme: Only one access out of sixty-four will progress up 145 - the tree. 146 - Because the vast majority of the CPUs do not progress up the tree, 147 - the lock contention remains roughly constant up the tree. 148 - No matter how many CPUs there are in the system, at most 64 quiescent-state 149 - reports per grace period will progress all the way to the root 150 - <tt>rcu_node</tt> structure, thus ensuring that the lock contention 151 - on that root <tt>rcu_node</tt> structure remains acceptably low. 152 - 153 - </p><p>In effect, the combining tree acts like a big shock absorber, 154 - keeping lock contention under control at all tree levels regardless 155 - of the level of loading on the system. 156 - 157 - </p><p>RCU updaters wait for normal grace periods by registering 158 - RCU callbacks, either directly via <tt>call_rcu()</tt> 159 - or indirectly via <tt>synchronize_rcu()</tt> and friends. 160 - RCU callbacks are represented by <tt>rcu_head</tt> structures, 161 - which are queued on <tt>rcu_data</tt> structures while they are 162 - waiting for a grace period to elapse, as shown in the following figure: 163 - 164 - </p><p><img src="BigTreePreemptRCUBHdyntickCB.svg" alt="BigTreePreemptRCUBHdyntickCB.svg" width="40%"> 165 - 166 - </p><p>This figure shows how <tt>TREE_RCU</tt>'s and 167 - <tt>PREEMPT_RCU</tt>'s major data structures are related. 168 - Lesser data structures will be introduced with the algorithms that 169 - make use of them. 170 - 171 - </p><p>Note that each of the data structures in the above figure has 172 - its own synchronization: 173 - 174 - <p><ol> 175 - <li> Each <tt>rcu_state</tt> structures has a lock and a mutex, 176 - and some fields are protected by the corresponding root 177 - <tt>rcu_node</tt> structure's lock. 178 - <li> Each <tt>rcu_node</tt> structure has a spinlock. 179 - <li> The fields in <tt>rcu_data</tt> are private to the corresponding 180 - CPU, although a few can be read and written by other CPUs. 181 - </ol> 182 - 183 - <p>It is important to note that different data structures can have 184 - very different ideas about the state of RCU at any given time. 185 - For but one example, awareness of the start or end of a given RCU 186 - grace period propagates slowly through the data structures. 187 - This slow propagation is absolutely necessary for RCU to have good 188 - read-side performance. 189 - If this balkanized implementation seems foreign to you, one useful 190 - trick is to consider each instance of these data structures to be 191 - a different person, each having the usual slightly different 192 - view of reality. 193 - 194 - </p><p>The general role of each of these data structures is as 195 - follows: 196 - 197 - </p><ol> 198 - <li> <tt>rcu_state</tt>: 199 - This structure forms the interconnection between the 200 - <tt>rcu_node</tt> and <tt>rcu_data</tt> structures, 201 - tracks grace periods, serves as short-term repository 202 - for callbacks orphaned by CPU-hotplug events, 203 - maintains <tt>rcu_barrier()</tt> state, 204 - tracks expedited grace-period state, 205 - and maintains state used to force quiescent states when 206 - grace periods extend too long, 207 - <li> <tt>rcu_node</tt>: This structure forms the combining 208 - tree that propagates quiescent-state 209 - information from the leaves to the root, and also propagates 210 - grace-period information from the root to the leaves. 211 - It provides local copies of the grace-period state in order 212 - to allow this information to be accessed in a synchronized 213 - manner without suffering the scalability limitations that 214 - would otherwise be imposed by global locking. 215 - In <tt>CONFIG_PREEMPT_RCU</tt> kernels, it manages the lists 216 - of tasks that have blocked while in their current 217 - RCU read-side critical section. 218 - In <tt>CONFIG_PREEMPT_RCU</tt> with 219 - <tt>CONFIG_RCU_BOOST</tt>, it manages the 220 - per-<tt>rcu_node</tt> priority-boosting 221 - kernel threads (kthreads) and state. 222 - Finally, it records CPU-hotplug state in order to determine 223 - which CPUs should be ignored during a given grace period. 224 - <li> <tt>rcu_data</tt>: This per-CPU structure is the 225 - focus of quiescent-state detection and RCU callback queuing. 226 - It also tracks its relationship to the corresponding leaf 227 - <tt>rcu_node</tt> structure to allow more-efficient 228 - propagation of quiescent states up the <tt>rcu_node</tt> 229 - combining tree. 230 - Like the <tt>rcu_node</tt> structure, it provides a local 231 - copy of the grace-period information to allow for-free 232 - synchronized 233 - access to this information from the corresponding CPU. 234 - Finally, this structure records past dyntick-idle state 235 - for the corresponding CPU and also tracks statistics. 236 - <li> <tt>rcu_head</tt>: 237 - This structure represents RCU callbacks, and is the 238 - only structure allocated and managed by RCU users. 239 - The <tt>rcu_head</tt> structure is normally embedded 240 - within the RCU-protected data structure. 241 - </ol> 242 - 243 - <p>If all you wanted from this article was a general notion of how 244 - RCU's data structures are related, you are done. 245 - Otherwise, each of the following sections give more details on 246 - the <tt>rcu_state</tt>, <tt>rcu_node</tt> and <tt>rcu_data</tt> data 247 - structures. 248 - 249 - <h3><a name="The rcu_state Structure"> 250 - The <tt>rcu_state</tt> Structure</a></h3> 251 - 252 - <p>The <tt>rcu_state</tt> structure is the base structure that 253 - represents the state of RCU in the system. 254 - This structure forms the interconnection between the 255 - <tt>rcu_node</tt> and <tt>rcu_data</tt> structures, 256 - tracks grace periods, contains the lock used to 257 - synchronize with CPU-hotplug events, 258 - and maintains state used to force quiescent states when 259 - grace periods extend too long, 260 - 261 - </p><p>A few of the <tt>rcu_state</tt> structure's fields are discussed, 262 - singly and in groups, in the following sections. 263 - The more specialized fields are covered in the discussion of their 264 - use. 265 - 266 - <h5>Relationship to rcu_node and rcu_data Structures</h5> 267 - 268 - This portion of the <tt>rcu_state</tt> structure is declared 269 - as follows: 270 - 271 - <pre> 272 - 1 struct rcu_node node[NUM_RCU_NODES]; 273 - 2 struct rcu_node *level[NUM_RCU_LVLS + 1]; 274 - 3 struct rcu_data __percpu *rda; 275 - </pre> 276 - 277 - <table> 278 - <tr><th> </th></tr> 279 - <tr><th align="left">Quick Quiz:</th></tr> 280 - <tr><td> 281 - Wait a minute! 282 - You said that the <tt>rcu_node</tt> structures formed a tree, 283 - but they are declared as a flat array! 284 - What gives? 285 - </td></tr> 286 - <tr><th align="left">Answer:</th></tr> 287 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 288 - The tree is laid out in the array. 289 - The first node In the array is the head, the next set of nodes in the 290 - array are children of the head node, and so on until the last set of 291 - nodes in the array are the leaves. 292 - </font> 293 - 294 - <p><font color="ffffff">See the following diagrams to see how 295 - this works. 296 - </font></td></tr> 297 - <tr><td> </td></tr> 298 - </table> 299 - 300 - <p>The <tt>rcu_node</tt> tree is embedded into the 301 - <tt>->node[]</tt> array as shown in the following figure: 302 - 303 - </p><p><img src="TreeMapping.svg" alt="TreeMapping.svg" width="40%"> 304 - 305 - </p><p>One interesting consequence of this mapping is that a 306 - breadth-first traversal of the tree is implemented as a simple 307 - linear scan of the array, which is in fact what the 308 - <tt>rcu_for_each_node_breadth_first()</tt> macro does. 309 - This macro is used at the beginning and ends of grace periods. 310 - 311 - </p><p>Each entry of the <tt>->level</tt> array references 312 - the first <tt>rcu_node</tt> structure on the corresponding level 313 - of the tree, for example, as shown below: 314 - 315 - </p><p><img src="TreeMappingLevel.svg" alt="TreeMappingLevel.svg" width="40%"> 316 - 317 - </p><p>The zero<sup>th</sup> element of the array references the root 318 - <tt>rcu_node</tt> structure, the first element references the 319 - first child of the root <tt>rcu_node</tt>, and finally the second 320 - element references the first leaf <tt>rcu_node</tt> structure. 321 - 322 - </p><p>For whatever it is worth, if you draw the tree to be tree-shaped 323 - rather than array-shaped, it is easy to draw a planar representation: 324 - 325 - </p><p><img src="TreeLevel.svg" alt="TreeLevel.svg" width="60%"> 326 - 327 - </p><p>Finally, the <tt>->rda</tt> field references a per-CPU 328 - pointer to the corresponding CPU's <tt>rcu_data</tt> structure. 329 - 330 - </p><p>All of these fields are constant once initialization is complete, 331 - and therefore need no protection. 332 - 333 - <h5>Grace-Period Tracking</h5> 334 - 335 - <p>This portion of the <tt>rcu_state</tt> structure is declared 336 - as follows: 337 - 338 - <pre> 339 - 1 unsigned long gp_seq; 340 - </pre> 341 - 342 - <p>RCU grace periods are numbered, and 343 - the <tt>->gp_seq</tt> field contains the current grace-period 344 - sequence number. 345 - The bottom two bits are the state of the current grace period, 346 - which can be zero for not yet started or one for in progress. 347 - In other words, if the bottom two bits of <tt>->gp_seq</tt> are 348 - zero, then RCU is idle. 349 - Any other value in the bottom two bits indicates that something is broken. 350 - This field is protected by the root <tt>rcu_node</tt> structure's 351 - <tt>->lock</tt> field. 352 - 353 - </p><p>There are <tt>->gp_seq</tt> fields 354 - in the <tt>rcu_node</tt> and <tt>rcu_data</tt> structures 355 - as well. 356 - The fields in the <tt>rcu_state</tt> structure represent the 357 - most current value, and those of the other structures are compared 358 - in order to detect the beginnings and ends of grace periods in a distributed 359 - fashion. 360 - The values flow from <tt>rcu_state</tt> to <tt>rcu_node</tt> 361 - (down the tree from the root to the leaves) to <tt>rcu_data</tt>. 362 - 363 - <h5>Miscellaneous</h5> 364 - 365 - <p>This portion of the <tt>rcu_state</tt> structure is declared 366 - as follows: 367 - 368 - <pre> 369 - 1 unsigned long gp_max; 370 - 2 char abbr; 371 - 3 char *name; 372 - </pre> 373 - 374 - <p>The <tt>->gp_max</tt> field tracks the duration of the longest 375 - grace period in jiffies. 376 - It is protected by the root <tt>rcu_node</tt>'s <tt>->lock</tt>. 377 - 378 - <p>The <tt>->name</tt> and <tt>->abbr</tt> fields distinguish 379 - between preemptible RCU (“rcu_preempt” and “p”) 380 - and non-preemptible RCU (“rcu_sched” and “s”). 381 - These fields are used for diagnostic and tracing purposes. 382 - 383 - <h3><a name="The rcu_node Structure"> 384 - The <tt>rcu_node</tt> Structure</a></h3> 385 - 386 - <p>The <tt>rcu_node</tt> structures form the combining 387 - tree that propagates quiescent-state 388 - information from the leaves to the root and also that propagates 389 - grace-period information from the root down to the leaves. 390 - They provides local copies of the grace-period state in order 391 - to allow this information to be accessed in a synchronized 392 - manner without suffering the scalability limitations that 393 - would otherwise be imposed by global locking. 394 - In <tt>CONFIG_PREEMPT_RCU</tt> kernels, they manage the lists 395 - of tasks that have blocked while in their current 396 - RCU read-side critical section. 397 - In <tt>CONFIG_PREEMPT_RCU</tt> with 398 - <tt>CONFIG_RCU_BOOST</tt>, they manage the 399 - per-<tt>rcu_node</tt> priority-boosting 400 - kernel threads (kthreads) and state. 401 - Finally, they record CPU-hotplug state in order to determine 402 - which CPUs should be ignored during a given grace period. 403 - 404 - </p><p>The <tt>rcu_node</tt> structure's fields are discussed, 405 - singly and in groups, in the following sections. 406 - 407 - <h5>Connection to Combining Tree</h5> 408 - 409 - <p>This portion of the <tt>rcu_node</tt> structure is declared 410 - as follows: 411 - 412 - <pre> 413 - 1 struct rcu_node *parent; 414 - 2 u8 level; 415 - 3 u8 grpnum; 416 - 4 unsigned long grpmask; 417 - 5 int grplo; 418 - 6 int grphi; 419 - </pre> 420 - 421 - <p>The <tt>->parent</tt> pointer references the <tt>rcu_node</tt> 422 - one level up in the tree, and is <tt>NULL</tt> for the root 423 - <tt>rcu_node</tt>. 424 - The RCU implementation makes heavy use of this field to push quiescent 425 - states up the tree. 426 - The <tt>->level</tt> field gives the level in the tree, with 427 - the root being at level zero, its children at level one, and so on. 428 - The <tt>->grpnum</tt> field gives this node's position within 429 - the children of its parent, so this number can range between 0 and 31 430 - on 32-bit systems and between 0 and 63 on 64-bit systems. 431 - The <tt>->level</tt> and <tt>->grpnum</tt> fields are 432 - used only during initialization and for tracing. 433 - The <tt>->grpmask</tt> field is the bitmask counterpart of 434 - <tt>->grpnum</tt>, and therefore always has exactly one bit set. 435 - This mask is used to clear the bit corresponding to this <tt>rcu_node</tt> 436 - structure in its parent's bitmasks, which are described later. 437 - Finally, the <tt>->grplo</tt> and <tt>->grphi</tt> fields 438 - contain the lowest and highest numbered CPU served by this 439 - <tt>rcu_node</tt> structure, respectively. 440 - 441 - </p><p>All of these fields are constant, and thus do not require any 442 - synchronization. 443 - 444 - <h5>Synchronization</h5> 445 - 446 - <p>This field of the <tt>rcu_node</tt> structure is declared 447 - as follows: 448 - 449 - <pre> 450 - 1 raw_spinlock_t lock; 451 - </pre> 452 - 453 - <p>This field is used to protect the remaining fields in this structure, 454 - unless otherwise stated. 455 - That said, all of the fields in this structure can be accessed without 456 - locking for tracing purposes. 457 - Yes, this can result in confusing traces, but better some tracing confusion 458 - than to be heisenbugged out of existence. 459 - 460 - <h5>Grace-Period Tracking</h5> 461 - 462 - <p>This portion of the <tt>rcu_node</tt> structure is declared 463 - as follows: 464 - 465 - <pre> 466 - 1 unsigned long gp_seq; 467 - 2 unsigned long gp_seq_needed; 468 - </pre> 469 - 470 - <p>The <tt>rcu_node</tt> structures' <tt>->gp_seq</tt> fields are 471 - the counterparts of the field of the same name in the <tt>rcu_state</tt> 472 - structure. 473 - They each may lag up to one step behind their <tt>rcu_state</tt> 474 - counterpart. 475 - If the bottom two bits of a given <tt>rcu_node</tt> structure's 476 - <tt>->gp_seq</tt> field is zero, then this <tt>rcu_node</tt> 477 - structure believes that RCU is idle. 478 - </p><p>The <tt>>gp_seq</tt> field of each <tt>rcu_node</tt> 479 - structure is updated at the beginning and the end 480 - of each grace period. 481 - 482 - <p>The <tt>->gp_seq_needed</tt> fields record the 483 - furthest-in-the-future grace period request seen by the corresponding 484 - <tt>rcu_node</tt> structure. The request is considered fulfilled when 485 - the value of the <tt>->gp_seq</tt> field equals or exceeds that of 486 - the <tt>->gp_seq_needed</tt> field. 487 - 488 - <table> 489 - <tr><th> </th></tr> 490 - <tr><th align="left">Quick Quiz:</th></tr> 491 - <tr><td> 492 - Suppose that this <tt>rcu_node</tt> structure doesn't see 493 - a request for a very long time. 494 - Won't wrapping of the <tt>->gp_seq</tt> field cause 495 - problems? 496 - </td></tr> 497 - <tr><th align="left">Answer:</th></tr> 498 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 499 - No, because if the <tt>->gp_seq_needed</tt> field lags behind the 500 - <tt>->gp_seq</tt> field, the <tt>->gp_seq_needed</tt> field 501 - will be updated at the end of the grace period. 502 - Modulo-arithmetic comparisons therefore will always get the 503 - correct answer, even with wrapping. 504 - </font></td></tr> 505 - <tr><td> </td></tr> 506 - </table> 507 - 508 - <h5>Quiescent-State Tracking</h5> 509 - 510 - <p>These fields manage the propagation of quiescent states up the 511 - combining tree. 512 - 513 - </p><p>This portion of the <tt>rcu_node</tt> structure has fields 514 - as follows: 515 - 516 - <pre> 517 - 1 unsigned long qsmask; 518 - 2 unsigned long expmask; 519 - 3 unsigned long qsmaskinit; 520 - 4 unsigned long expmaskinit; 521 - </pre> 522 - 523 - <p>The <tt>->qsmask</tt> field tracks which of this 524 - <tt>rcu_node</tt> structure's children still need to report 525 - quiescent states for the current normal grace period. 526 - Such children will have a value of 1 in their corresponding bit. 527 - Note that the leaf <tt>rcu_node</tt> structures should be 528 - thought of as having <tt>rcu_data</tt> structures as their 529 - children. 530 - Similarly, the <tt>->expmask</tt> field tracks which 531 - of this <tt>rcu_node</tt> structure's children still need to report 532 - quiescent states for the current expedited grace period. 533 - An expedited grace period has 534 - the same conceptual properties as a normal grace period, but the 535 - expedited implementation accepts extreme CPU overhead to obtain 536 - much lower grace-period latency, for example, consuming a few 537 - tens of microseconds worth of CPU time to reduce grace-period 538 - duration from milliseconds to tens of microseconds. 539 - The <tt>->qsmaskinit</tt> field tracks which of this 540 - <tt>rcu_node</tt> structure's children cover for at least 541 - one online CPU. 542 - This mask is used to initialize <tt>->qsmask</tt>, 543 - and <tt>->expmaskinit</tt> is used to initialize 544 - <tt>->expmask</tt> and the beginning of the 545 - normal and expedited grace periods, respectively. 546 - 547 - <table> 548 - <tr><th> </th></tr> 549 - <tr><th align="left">Quick Quiz:</th></tr> 550 - <tr><td> 551 - Why are these bitmasks protected by locking? 552 - Come on, haven't you heard of atomic instructions??? 553 - </td></tr> 554 - <tr><th align="left">Answer:</th></tr> 555 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 556 - Lockless grace-period computation! Such a tantalizing possibility! 557 - </font> 558 - 559 - <p><font color="ffffff">But consider the following sequence of events: 560 - </font> 561 - 562 - <ol> 563 - <li> <font color="ffffff">CPU 0 has been in dyntick-idle 564 - mode for quite some time. 565 - When it wakes up, it notices that the current RCU 566 - grace period needs it to report in, so it sets a 567 - flag where the scheduling clock interrupt will find it. 568 - </font><p> 569 - <li> <font color="ffffff">Meanwhile, CPU 1 is running 570 - <tt>force_quiescent_state()</tt>, 571 - and notices that CPU 0 has been in dyntick idle mode, 572 - which qualifies as an extended quiescent state. 573 - </font><p> 574 - <li> <font color="ffffff">CPU 0's scheduling clock 575 - interrupt fires in the 576 - middle of an RCU read-side critical section, and notices 577 - that the RCU core needs something, so commences RCU softirq 578 - processing. 579 - </font> 580 - <p> 581 - <li> <font color="ffffff">CPU 0's softirq handler 582 - executes and is just about ready 583 - to report its quiescent state up the <tt>rcu_node</tt> 584 - tree. 585 - </font><p> 586 - <li> <font color="ffffff">But CPU 1 beats it to the punch, 587 - completing the current 588 - grace period and starting a new one. 589 - </font><p> 590 - <li> <font color="ffffff">CPU 0 now reports its quiescent 591 - state for the wrong 592 - grace period. 593 - That grace period might now end before the RCU read-side 594 - critical section. 595 - If that happens, disaster will ensue. 596 - </font> 597 - </ol> 598 - 599 - <p><font color="ffffff">So the locking is absolutely required in 600 - order to coordinate clearing of the bits with updating of the 601 - grace-period sequence number in <tt>->gp_seq</tt>. 602 - </font></td></tr> 603 - <tr><td> </td></tr> 604 - </table> 605 - 606 - <h5>Blocked-Task Management</h5> 607 - 608 - <p><tt>PREEMPT_RCU</tt> allows tasks to be preempted in the 609 - midst of their RCU read-side critical sections, and these tasks 610 - must be tracked explicitly. 611 - The details of exactly why and how they are tracked will be covered 612 - in a separate article on RCU read-side processing. 613 - For now, it is enough to know that the <tt>rcu_node</tt> 614 - structure tracks them. 615 - 616 - <pre> 617 - 1 struct list_head blkd_tasks; 618 - 2 struct list_head *gp_tasks; 619 - 3 struct list_head *exp_tasks; 620 - 4 bool wait_blkd_tasks; 621 - </pre> 622 - 623 - <p>The <tt>->blkd_tasks</tt> field is a list header for 624 - the list of blocked and preempted tasks. 625 - As tasks undergo context switches within RCU read-side critical 626 - sections, their <tt>task_struct</tt> structures are enqueued 627 - (via the <tt>task_struct</tt>'s <tt>->rcu_node_entry</tt> 628 - field) onto the head of the <tt>->blkd_tasks</tt> list for the 629 - leaf <tt>rcu_node</tt> structure corresponding to the CPU 630 - on which the outgoing context switch executed. 631 - As these tasks later exit their RCU read-side critical sections, 632 - they remove themselves from the list. 633 - This list is therefore in reverse time order, so that if one of the tasks 634 - is blocking the current grace period, all subsequent tasks must 635 - also be blocking that same grace period. 636 - Therefore, a single pointer into this list suffices to track 637 - all tasks blocking a given grace period. 638 - That pointer is stored in <tt>->gp_tasks</tt> for normal 639 - grace periods and in <tt>->exp_tasks</tt> for expedited 640 - grace periods. 641 - These last two fields are <tt>NULL</tt> if either there is 642 - no grace period in flight or if there are no blocked tasks 643 - preventing that grace period from completing. 644 - If either of these two pointers is referencing a task that 645 - removes itself from the <tt>->blkd_tasks</tt> list, 646 - then that task must advance the pointer to the next task on 647 - the list, or set the pointer to <tt>NULL</tt> if there 648 - are no subsequent tasks on the list. 649 - 650 - </p><p>For example, suppose that tasks T1, T2, and T3 are 651 - all hard-affinitied to the largest-numbered CPU in the system. 652 - Then if task T1 blocked in an RCU read-side 653 - critical section, then an expedited grace period started, 654 - then task T2 blocked in an RCU read-side critical section, 655 - then a normal grace period started, and finally task 3 blocked 656 - in an RCU read-side critical section, then the state of the 657 - last leaf <tt>rcu_node</tt> structure's blocked-task list 658 - would be as shown below: 659 - 660 - </p><p><img src="blkd_task.svg" alt="blkd_task.svg" width="60%"> 661 - 662 - </p><p>Task T1 is blocking both grace periods, task T2 is 663 - blocking only the normal grace period, and task T3 is blocking 664 - neither grace period. 665 - Note that these tasks will not remove themselves from this list 666 - immediately upon resuming execution. 667 - They will instead remain on the list until they execute the outermost 668 - <tt>rcu_read_unlock()</tt> that ends their RCU read-side critical 669 - section. 670 - 671 - <p> 672 - The <tt>->wait_blkd_tasks</tt> field indicates whether or not 673 - the current grace period is waiting on a blocked task. 674 - 675 - <h5>Sizing the <tt>rcu_node</tt> Array</h5> 676 - 677 - <p>The <tt>rcu_node</tt> array is sized via a series of 678 - C-preprocessor expressions as follows: 679 - 680 - <pre> 681 - 1 #ifdef CONFIG_RCU_FANOUT 682 - 2 #define RCU_FANOUT CONFIG_RCU_FANOUT 683 - 3 #else 684 - 4 # ifdef CONFIG_64BIT 685 - 5 # define RCU_FANOUT 64 686 - 6 # else 687 - 7 # define RCU_FANOUT 32 688 - 8 # endif 689 - 9 #endif 690 - 10 691 - 11 #ifdef CONFIG_RCU_FANOUT_LEAF 692 - 12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF 693 - 13 #else 694 - 14 # ifdef CONFIG_64BIT 695 - 15 # define RCU_FANOUT_LEAF 64 696 - 16 # else 697 - 17 # define RCU_FANOUT_LEAF 32 698 - 18 # endif 699 - 19 #endif 700 - 20 701 - 21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF) 702 - 22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) 703 - 23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) 704 - 24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) 705 - 25 706 - 26 #if NR_CPUS <= RCU_FANOUT_1 707 - 27 # define RCU_NUM_LVLS 1 708 - 28 # define NUM_RCU_LVL_0 1 709 - 29 # define NUM_RCU_NODES NUM_RCU_LVL_0 710 - 30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 } 711 - 31 # define RCU_NODE_NAME_INIT { "rcu_node_0" } 712 - 32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } 713 - 33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" } 714 - 34 #elif NR_CPUS <= RCU_FANOUT_2 715 - 35 # define RCU_NUM_LVLS 2 716 - 36 # define NUM_RCU_LVL_0 1 717 - 37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) 718 - 38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) 719 - 39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 } 720 - 40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } 721 - 41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } 722 - 42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" } 723 - 43 #elif NR_CPUS <= RCU_FANOUT_3 724 - 44 # define RCU_NUM_LVLS 3 725 - 45 # define NUM_RCU_LVL_0 1 726 - 46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) 727 - 47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) 728 - 48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) 729 - 49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } 730 - 50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } 731 - 51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } 732 - 52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" } 733 - 53 #elif NR_CPUS <= RCU_FANOUT_4 734 - 54 # define RCU_NUM_LVLS 4 735 - 55 # define NUM_RCU_LVL_0 1 736 - 56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) 737 - 57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) 738 - 58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) 739 - 59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) 740 - 60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 } 741 - 61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" } 742 - 62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" } 743 - 63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" } 744 - 64 #else 745 - 65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" 746 - 66 #endif 747 - </pre> 748 - 749 - <p>The maximum number of levels in the <tt>rcu_node</tt> structure 750 - is currently limited to four, as specified by lines 21-24 751 - and the structure of the subsequent “if” statement. 752 - For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which 753 - should be sufficient for the next few years at least. 754 - For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which 755 - should see us through the next decade or so. 756 - This four-level tree also allows kernels built with 757 - <tt>CONFIG_RCU_FANOUT=8</tt> to support up to 4096 CPUs, 758 - which might be useful in very large systems having eight CPUs per 759 - socket (but please note that no one has yet shown any measurable 760 - performance degradation due to misaligned socket and <tt>rcu_node</tt> 761 - boundaries). 762 - In addition, building kernels with a full four levels of <tt>rcu_node</tt> 763 - tree permits better testing of RCU's combining-tree code. 764 - 765 - </p><p>The <tt>RCU_FANOUT</tt> symbol controls how many children 766 - are permitted at each non-leaf level of the <tt>rcu_node</tt> tree. 767 - If the <tt>CONFIG_RCU_FANOUT</tt> Kconfig option is not specified, 768 - it is set based on the word size of the system, which is also 769 - the Kconfig default. 770 - 771 - </p><p>The <tt>RCU_FANOUT_LEAF</tt> symbol controls how many CPUs are 772 - handled by each leaf <tt>rcu_node</tt> structure. 773 - Experience has shown that allowing a given leaf <tt>rcu_node</tt> 774 - structure to handle 64 CPUs, as permitted by the number of bits in 775 - the <tt>->qsmask</tt> field on a 64-bit system, results in 776 - excessive contention for the leaf <tt>rcu_node</tt> structures' 777 - <tt>->lock</tt> fields. 778 - The number of CPUs per leaf <tt>rcu_node</tt> structure is therefore 779 - limited to 16 given the default value of <tt>CONFIG_RCU_FANOUT_LEAF</tt>. 780 - If <tt>CONFIG_RCU_FANOUT_LEAF</tt> is unspecified, the value 781 - selected is based on the word size of the system, just as for 782 - <tt>CONFIG_RCU_FANOUT</tt>. 783 - Lines 11-19 perform this computation. 784 - 785 - </p><p>Lines 21-24 compute the maximum number of CPUs supported by 786 - a single-level (which contains a single <tt>rcu_node</tt> structure), 787 - two-level, three-level, and four-level <tt>rcu_node</tt> tree, 788 - respectively, given the fanout specified by <tt>RCU_FANOUT</tt> 789 - and <tt>RCU_FANOUT_LEAF</tt>. 790 - These numbers of CPUs are retained in the 791 - <tt>RCU_FANOUT_1</tt>, 792 - <tt>RCU_FANOUT_2</tt>, 793 - <tt>RCU_FANOUT_3</tt>, and 794 - <tt>RCU_FANOUT_4</tt> 795 - C-preprocessor variables, respectively. 796 - 797 - </p><p>These variables are used to control the C-preprocessor <tt>#if</tt> 798 - statement spanning lines 26-66 that computes the number of 799 - <tt>rcu_node</tt> structures required for each level of the tree, 800 - as well as the number of levels required. 801 - The number of levels is placed in the <tt>NUM_RCU_LVLS</tt> 802 - C-preprocessor variable by lines 27, 35, 44, and 54. 803 - The number of <tt>rcu_node</tt> structures for the topmost level 804 - of the tree is always exactly one, and this value is unconditionally 805 - placed into <tt>NUM_RCU_LVL_0</tt> by lines 28, 36, 45, and 55. 806 - The rest of the levels (if any) of the <tt>rcu_node</tt> tree 807 - are computed by dividing the maximum number of CPUs by the 808 - fanout supported by the number of levels from the current level down, 809 - rounding up. This computation is performed by lines 37, 810 - 46-47, and 56-58. 811 - Lines 31-33, 40-42, 50-52, and 62-63 create initializers 812 - for lockdep lock-class names. 813 - Finally, lines 64-66 produce an error if the maximum number of 814 - CPUs is too large for the specified fanout. 815 - 816 - <h3><a name="The rcu_segcblist Structure"> 817 - The <tt>rcu_segcblist</tt> Structure</a></h3> 818 - 819 - The <tt>rcu_segcblist</tt> structure maintains a segmented list of 820 - callbacks as follows: 821 - 822 - <pre> 823 - 1 #define RCU_DONE_TAIL 0 824 - 2 #define RCU_WAIT_TAIL 1 825 - 3 #define RCU_NEXT_READY_TAIL 2 826 - 4 #define RCU_NEXT_TAIL 3 827 - 5 #define RCU_CBLIST_NSEGS 4 828 - 6 829 - 7 struct rcu_segcblist { 830 - 8 struct rcu_head *head; 831 - 9 struct rcu_head **tails[RCU_CBLIST_NSEGS]; 832 - 10 unsigned long gp_seq[RCU_CBLIST_NSEGS]; 833 - 11 long len; 834 - 12 long len_lazy; 835 - 13 }; 836 - </pre> 837 - 838 - <p> 839 - The segments are as follows: 840 - 841 - <ol> 842 - <li> <tt>RCU_DONE_TAIL</tt>: Callbacks whose grace periods have elapsed. 843 - These callbacks are ready to be invoked. 844 - <li> <tt>RCU_WAIT_TAIL</tt>: Callbacks that are waiting for the 845 - current grace period. 846 - Note that different CPUs can have different ideas about which 847 - grace period is current, hence the <tt>->gp_seq</tt> field. 848 - <li> <tt>RCU_NEXT_READY_TAIL</tt>: Callbacks waiting for the next 849 - grace period to start. 850 - <li> <tt>RCU_NEXT_TAIL</tt>: Callbacks that have not yet been 851 - associated with a grace period. 852 - </ol> 853 - 854 - <p> 855 - The <tt>->head</tt> pointer references the first callback or 856 - is <tt>NULL</tt> if the list contains no callbacks (which is 857 - <i>not</i> the same as being empty). 858 - Each element of the <tt>->tails[]</tt> array references the 859 - <tt>->next</tt> pointer of the last callback in the corresponding 860 - segment of the list, or the list's <tt>->head</tt> pointer if 861 - that segment and all previous segments are empty. 862 - If the corresponding segment is empty but some previous segment is 863 - not empty, then the array element is identical to its predecessor. 864 - Older callbacks are closer to the head of the list, and new callbacks 865 - are added at the tail. 866 - This relationship between the <tt>->head</tt> pointer, the 867 - <tt>->tails[]</tt> array, and the callbacks is shown in this 868 - diagram: 869 - 870 - </p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%"> 871 - 872 - </p><p>In this figure, the <tt>->head</tt> pointer references the 873 - first 874 - RCU callback in the list. 875 - The <tt>->tails[RCU_DONE_TAIL]</tt> array element references 876 - the <tt>->head</tt> pointer itself, indicating that none 877 - of the callbacks is ready to invoke. 878 - The <tt>->tails[RCU_WAIT_TAIL]</tt> array element references callback 879 - CB 2's <tt>->next</tt> pointer, which indicates that 880 - CB 1 and CB 2 are both waiting on the current grace period, 881 - give or take possible disagreements about exactly which grace period 882 - is the current one. 883 - The <tt>->tails[RCU_NEXT_READY_TAIL]</tt> array element 884 - references the same RCU callback that <tt>->tails[RCU_WAIT_TAIL]</tt> 885 - does, which indicates that there are no callbacks waiting on the next 886 - RCU grace period. 887 - The <tt>->tails[RCU_NEXT_TAIL]</tt> array element references 888 - CB 4's <tt>->next</tt> pointer, indicating that all the 889 - remaining RCU callbacks have not yet been assigned to an RCU grace 890 - period. 891 - Note that the <tt>->tails[RCU_NEXT_TAIL]</tt> array element 892 - always references the last RCU callback's <tt>->next</tt> pointer 893 - unless the callback list is empty, in which case it references 894 - the <tt>->head</tt> pointer. 895 - 896 - <p> 897 - There is one additional important special case for the 898 - <tt>->tails[RCU_NEXT_TAIL]</tt> array element: It can be <tt>NULL</tt> 899 - when this list is <i>disabled</i>. 900 - Lists are disabled when the corresponding CPU is offline or when 901 - the corresponding CPU's callbacks are offloaded to a kthread, 902 - both of which are described elsewhere. 903 - 904 - </p><p>CPUs advance their callbacks from the 905 - <tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the 906 - <tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments 907 - as grace periods advance. 908 - 909 - </p><p>The <tt>->gp_seq[]</tt> array records grace-period 910 - numbers corresponding to the list segments. 911 - This is what allows different CPUs to have different ideas as to 912 - which is the current grace period while still avoiding premature 913 - invocation of their callbacks. 914 - In particular, this allows CPUs that go idle for extended periods 915 - to determine which of their callbacks are ready to be invoked after 916 - reawakening. 917 - 918 - </p><p>The <tt>->len</tt> counter contains the number of 919 - callbacks in <tt>->head</tt>, and the 920 - <tt>->len_lazy</tt> contains the number of those callbacks that 921 - are known to only free memory, and whose invocation can therefore 922 - be safely deferred. 923 - 924 - <p><b>Important note</b>: It is the <tt>->len</tt> field that 925 - determines whether or not there are callbacks associated with 926 - this <tt>rcu_segcblist</tt> structure, <i>not</i> the <tt>->head</tt> 927 - pointer. 928 - The reason for this is that all the ready-to-invoke callbacks 929 - (that is, those in the <tt>RCU_DONE_TAIL</tt> segment) are extracted 930 - all at once at callback-invocation time (<tt>rcu_do_batch</tt>), due 931 - to which <tt>->head</tt> may be set to NULL if there are no not-done 932 - callbacks remaining in the <tt>rcu_segcblist</tt>. 933 - If callback invocation must be postponed, for example, because a 934 - high-priority process just woke up on this CPU, then the remaining 935 - callbacks are placed back on the <tt>RCU_DONE_TAIL</tt> segment and 936 - <tt>->head</tt> once again points to the start of the segment. 937 - In short, the head field can briefly be <tt>NULL</tt> even though the 938 - CPU has callbacks present the entire time. 939 - Therefore, it is not appropriate to test the <tt>->head</tt> pointer 940 - for <tt>NULL</tt>. 941 - 942 - <p>In contrast, the <tt>->len</tt> and <tt>->len_lazy</tt> counts 943 - are adjusted only after the corresponding callbacks have been invoked. 944 - This means that the <tt>->len</tt> count is zero only if 945 - the <tt>rcu_segcblist</tt> structure really is devoid of callbacks. 946 - Of course, off-CPU sampling of the <tt>->len</tt> count requires 947 - careful use of appropriate synchronization, for example, memory barriers. 948 - This synchronization can be a bit subtle, particularly in the case 949 - of <tt>rcu_barrier()</tt>. 950 - 951 - <h3><a name="The rcu_data Structure"> 952 - The <tt>rcu_data</tt> Structure</a></h3> 953 - 954 - <p>The <tt>rcu_data</tt> maintains the per-CPU state for the RCU subsystem. 955 - The fields in this structure may be accessed only from the corresponding 956 - CPU (and from tracing) unless otherwise stated. 957 - This structure is the 958 - focus of quiescent-state detection and RCU callback queuing. 959 - It also tracks its relationship to the corresponding leaf 960 - <tt>rcu_node</tt> structure to allow more-efficient 961 - propagation of quiescent states up the <tt>rcu_node</tt> 962 - combining tree. 963 - Like the <tt>rcu_node</tt> structure, it provides a local 964 - copy of the grace-period information to allow for-free 965 - synchronized 966 - access to this information from the corresponding CPU. 967 - Finally, this structure records past dyntick-idle state 968 - for the corresponding CPU and also tracks statistics. 969 - 970 - </p><p>The <tt>rcu_data</tt> structure's fields are discussed, 971 - singly and in groups, in the following sections. 972 - 973 - <h5>Connection to Other Data Structures</h5> 974 - 975 - <p>This portion of the <tt>rcu_data</tt> structure is declared 976 - as follows: 977 - 978 - <pre> 979 - 1 int cpu; 980 - 2 struct rcu_node *mynode; 981 - 3 unsigned long grpmask; 982 - 4 bool beenonline; 983 - </pre> 984 - 985 - <p>The <tt>->cpu</tt> field contains the number of the 986 - corresponding CPU and the <tt>->mynode</tt> field references the 987 - corresponding <tt>rcu_node</tt> structure. 988 - The <tt>->mynode</tt> is used to propagate quiescent states 989 - up the combining tree. 990 - These two fields are constant and therefore do not require synchronization. 991 - 992 - <p>The <tt>->grpmask</tt> field indicates the bit in 993 - the <tt>->mynode->qsmask</tt> corresponding to this 994 - <tt>rcu_data</tt> structure, and is also used when propagating 995 - quiescent states. 996 - The <tt>->beenonline</tt> flag is set whenever the corresponding 997 - CPU comes online, which means that the debugfs tracing need not dump 998 - out any <tt>rcu_data</tt> structure for which this flag is not set. 999 - 1000 - <h5>Quiescent-State and Grace-Period Tracking</h5> 1001 - 1002 - <p>This portion of the <tt>rcu_data</tt> structure is declared 1003 - as follows: 1004 - 1005 - <pre> 1006 - 1 unsigned long gp_seq; 1007 - 2 unsigned long gp_seq_needed; 1008 - 3 bool cpu_no_qs; 1009 - 4 bool core_needs_qs; 1010 - 5 bool gpwrap; 1011 - </pre> 1012 - 1013 - <p>The <tt>->gp_seq</tt> field is the counterpart of the field of the same 1014 - name in the <tt>rcu_state</tt> and <tt>rcu_node</tt> structures. The 1015 - <tt>->gp_seq_needed</tt> field is the counterpart of the field of the same 1016 - name in the rcu_node</tt> structure. 1017 - They may each lag up to one behind their <tt>rcu_node</tt> 1018 - counterparts, but in <tt>CONFIG_NO_HZ_IDLE</tt> and 1019 - <tt>CONFIG_NO_HZ_FULL</tt> kernels can lag 1020 - arbitrarily far behind for CPUs in dyntick-idle mode (but these counters 1021 - will catch up upon exit from dyntick-idle mode). 1022 - If the lower two bits of a given <tt>rcu_data</tt> structure's 1023 - <tt>->gp_seq</tt> are zero, then this <tt>rcu_data</tt> 1024 - structure believes that RCU is idle. 1025 - 1026 - <table> 1027 - <tr><th> </th></tr> 1028 - <tr><th align="left">Quick Quiz:</th></tr> 1029 - <tr><td> 1030 - All this replication of the grace period numbers can only cause 1031 - massive confusion. 1032 - Why not just keep a global sequence number and be done with it??? 1033 - </td></tr> 1034 - <tr><th align="left">Answer:</th></tr> 1035 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 1036 - Because if there was only a single global sequence 1037 - numbers, there would need to be a single global lock to allow 1038 - safely accessing and updating it. 1039 - And if we are not going to have a single global lock, we need 1040 - to carefully manage the numbers on a per-node basis. 1041 - Recall from the answer to a previous Quick Quiz that the consequences 1042 - of applying a previously sampled quiescent state to the wrong 1043 - grace period are quite severe. 1044 - </font></td></tr> 1045 - <tr><td> </td></tr> 1046 - </table> 1047 - 1048 - <p>The <tt>->cpu_no_qs</tt> flag indicates that the 1049 - CPU has not yet passed through a quiescent state, 1050 - while the <tt>->core_needs_qs</tt> flag indicates that the 1051 - RCU core needs a quiescent state from the corresponding CPU. 1052 - The <tt>->gpwrap</tt> field indicates that the corresponding 1053 - CPU has remained idle for so long that the 1054 - <tt>gp_seq</tt> counter is in danger of overflow, which 1055 - will cause the CPU to disregard the values of its counters on 1056 - its next exit from idle. 1057 - 1058 - <h5>RCU Callback Handling</h5> 1059 - 1060 - <p>In the absence of CPU-hotplug events, RCU callbacks are invoked by 1061 - the same CPU that registered them. 1062 - This is strictly a cache-locality optimization: callbacks can and 1063 - do get invoked on CPUs other than the one that registered them. 1064 - After all, if the CPU that registered a given callback has gone 1065 - offline before the callback can be invoked, there really is no other 1066 - choice. 1067 - 1068 - </p><p>This portion of the <tt>rcu_data</tt> structure is declared 1069 - as follows: 1070 - 1071 - <pre> 1072 - 1 struct rcu_segcblist cblist; 1073 - 2 long qlen_last_fqs_check; 1074 - 3 unsigned long n_cbs_invoked; 1075 - 4 unsigned long n_nocbs_invoked; 1076 - 5 unsigned long n_cbs_orphaned; 1077 - 6 unsigned long n_cbs_adopted; 1078 - 7 unsigned long n_force_qs_snap; 1079 - 8 long blimit; 1080 - </pre> 1081 - 1082 - <p>The <tt>->cblist</tt> structure is the segmented callback list 1083 - described earlier. 1084 - The CPU advances the callbacks in its <tt>rcu_data</tt> structure 1085 - whenever it notices that another RCU grace period has completed. 1086 - The CPU detects the completion of an RCU grace period by noticing 1087 - that the value of its <tt>rcu_data</tt> structure's 1088 - <tt>->gp_seq</tt> field differs from that of its leaf 1089 - <tt>rcu_node</tt> structure. 1090 - Recall that each <tt>rcu_node</tt> structure's 1091 - <tt>->gp_seq</tt> field is updated at the beginnings and ends of each 1092 - grace period. 1093 - 1094 - <p> 1095 - The <tt>->qlen_last_fqs_check</tt> and 1096 - <tt>->n_force_qs_snap</tt> coordinate the forcing of quiescent 1097 - states from <tt>call_rcu()</tt> and friends when callback 1098 - lists grow excessively long. 1099 - 1100 - </p><p>The <tt>->n_cbs_invoked</tt>, 1101 - <tt>->n_cbs_orphaned</tt>, and <tt>->n_cbs_adopted</tt> 1102 - fields count the number of callbacks invoked, 1103 - sent to other CPUs when this CPU goes offline, 1104 - and received from other CPUs when those other CPUs go offline. 1105 - The <tt>->n_nocbs_invoked</tt> is used when the CPU's callbacks 1106 - are offloaded to a kthread. 1107 - 1108 - <p> 1109 - Finally, the <tt>->blimit</tt> counter is the maximum number of 1110 - RCU callbacks that may be invoked at a given time. 1111 - 1112 - <h5>Dyntick-Idle Handling</h5> 1113 - 1114 - <p>This portion of the <tt>rcu_data</tt> structure is declared 1115 - as follows: 1116 - 1117 - <pre> 1118 - 1 int dynticks_snap; 1119 - 2 unsigned long dynticks_fqs; 1120 - </pre> 1121 - 1122 - The <tt>->dynticks_snap</tt> field is used to take a snapshot 1123 - of the corresponding CPU's dyntick-idle state when forcing 1124 - quiescent states, and is therefore accessed from other CPUs. 1125 - Finally, the <tt>->dynticks_fqs</tt> field is used to 1126 - count the number of times this CPU is determined to be in 1127 - dyntick-idle state, and is used for tracing and debugging purposes. 1128 - 1129 - <p> 1130 - This portion of the rcu_data structure is declared as follows: 1131 - 1132 - <pre> 1133 - 1 long dynticks_nesting; 1134 - 2 long dynticks_nmi_nesting; 1135 - 3 atomic_t dynticks; 1136 - 4 bool rcu_need_heavy_qs; 1137 - 5 bool rcu_urgent_qs; 1138 - </pre> 1139 - 1140 - <p>These fields in the rcu_data structure maintain the per-CPU dyntick-idle 1141 - state for the corresponding CPU. 1142 - The fields may be accessed only from the corresponding CPU (and from tracing) 1143 - unless otherwise stated. 1144 - 1145 - <p>The <tt>->dynticks_nesting</tt> field counts the 1146 - nesting depth of process execution, so that in normal circumstances 1147 - this counter has value zero or one. 1148 - NMIs, irqs, and tracers are counted by the <tt>->dynticks_nmi_nesting</tt> 1149 - field. 1150 - Because NMIs cannot be masked, changes to this variable have to be 1151 - undertaken carefully using an algorithm provided by Andy Lutomirski. 1152 - The initial transition from idle adds one, and nested transitions 1153 - add two, so that a nesting level of five is represented by a 1154 - <tt>->dynticks_nmi_nesting</tt> value of nine. 1155 - This counter can therefore be thought of as counting the number 1156 - of reasons why this CPU cannot be permitted to enter dyntick-idle 1157 - mode, aside from process-level transitions. 1158 - 1159 - <p>However, it turns out that when running in non-idle kernel context, 1160 - the Linux kernel is fully capable of entering interrupt handlers that 1161 - never exit and perhaps also vice versa. 1162 - Therefore, whenever the <tt>->dynticks_nesting</tt> field is 1163 - incremented up from zero, the <tt>->dynticks_nmi_nesting</tt> field 1164 - is set to a large positive number, and whenever the 1165 - <tt>->dynticks_nesting</tt> field is decremented down to zero, 1166 - the the <tt>->dynticks_nmi_nesting</tt> field is set to zero. 1167 - Assuming that the number of misnested interrupts is not sufficient 1168 - to overflow the counter, this approach corrects the 1169 - <tt>->dynticks_nmi_nesting</tt> field every time the corresponding 1170 - CPU enters the idle loop from process context. 1171 - 1172 - </p><p>The <tt>->dynticks</tt> field counts the corresponding 1173 - CPU's transitions to and from either dyntick-idle or user mode, so 1174 - that this counter has an even value when the CPU is in dyntick-idle 1175 - mode or user mode and an odd value otherwise. The transitions to/from 1176 - user mode need to be counted for user mode adaptive-ticks support 1177 - (see timers/NO_HZ.txt). 1178 - 1179 - </p><p>The <tt>->rcu_need_heavy_qs</tt> field is used 1180 - to record the fact that the RCU core code would really like to 1181 - see a quiescent state from the corresponding CPU, so much so that 1182 - it is willing to call for heavy-weight dyntick-counter operations. 1183 - This flag is checked by RCU's context-switch and <tt>cond_resched()</tt> 1184 - code, which provide a momentary idle sojourn in response. 1185 - 1186 - </p><p>Finally, the <tt>->rcu_urgent_qs</tt> field is used to record 1187 - the fact that the RCU core code would really like to see a quiescent state from 1188 - the corresponding CPU, with the various other fields indicating just how badly 1189 - RCU wants this quiescent state. 1190 - This flag is checked by RCU's context-switch path 1191 - (<tt>rcu_note_context_switch</tt>) and the cond_resched code. 1192 - 1193 - <table> 1194 - <tr><th> </th></tr> 1195 - <tr><th align="left">Quick Quiz:</th></tr> 1196 - <tr><td> 1197 - Why not simply combine the <tt>->dynticks_nesting</tt> 1198 - and <tt>->dynticks_nmi_nesting</tt> counters into a 1199 - single counter that just counts the number of reasons that 1200 - the corresponding CPU is non-idle? 1201 - </td></tr> 1202 - <tr><th align="left">Answer:</th></tr> 1203 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 1204 - Because this would fail in the presence of interrupts whose 1205 - handlers never return and of handlers that manage to return 1206 - from a made-up interrupt. 1207 - </font></td></tr> 1208 - <tr><td> </td></tr> 1209 - </table> 1210 - 1211 - <p>Additional fields are present for some special-purpose 1212 - builds, and are discussed separately. 1213 - 1214 - <h3><a name="The rcu_head Structure"> 1215 - The <tt>rcu_head</tt> Structure</a></h3> 1216 - 1217 - <p>Each <tt>rcu_head</tt> structure represents an RCU callback. 1218 - These structures are normally embedded within RCU-protected data 1219 - structures whose algorithms use asynchronous grace periods. 1220 - In contrast, when using algorithms that block waiting for RCU grace periods, 1221 - RCU users need not provide <tt>rcu_head</tt> structures. 1222 - 1223 - </p><p>The <tt>rcu_head</tt> structure has fields as follows: 1224 - 1225 - <pre> 1226 - 1 struct rcu_head *next; 1227 - 2 void (*func)(struct rcu_head *head); 1228 - </pre> 1229 - 1230 - <p>The <tt>->next</tt> field is used 1231 - to link the <tt>rcu_head</tt> structures together in the 1232 - lists within the <tt>rcu_data</tt> structures. 1233 - The <tt>->func</tt> field is a pointer to the function 1234 - to be called when the callback is ready to be invoked, and 1235 - this function is passed a pointer to the <tt>rcu_head</tt> 1236 - structure. 1237 - However, <tt>kfree_rcu()</tt> uses the <tt>->func</tt> 1238 - field to record the offset of the <tt>rcu_head</tt> 1239 - structure within the enclosing RCU-protected data structure. 1240 - 1241 - </p><p>Both of these fields are used internally by RCU. 1242 - From the viewpoint of RCU users, this structure is an 1243 - opaque “cookie”. 1244 - 1245 - <table> 1246 - <tr><th> </th></tr> 1247 - <tr><th align="left">Quick Quiz:</th></tr> 1248 - <tr><td> 1249 - Given that the callback function <tt>->func</tt> 1250 - is passed a pointer to the <tt>rcu_head</tt> structure, 1251 - how is that function supposed to find the beginning of the 1252 - enclosing RCU-protected data structure? 1253 - </td></tr> 1254 - <tr><th align="left">Answer:</th></tr> 1255 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 1256 - In actual practice, there is a separate callback function per 1257 - type of RCU-protected data structure. 1258 - The callback function can therefore use the <tt>container_of()</tt> 1259 - macro in the Linux kernel (or other pointer-manipulation facilities 1260 - in other software environments) to find the beginning of the 1261 - enclosing structure. 1262 - </font></td></tr> 1263 - <tr><td> </td></tr> 1264 - </table> 1265 - 1266 - <h3><a name="RCU-Specific Fields in the task_struct Structure"> 1267 - RCU-Specific Fields in the <tt>task_struct</tt> Structure</a></h3> 1268 - 1269 - <p>The <tt>CONFIG_PREEMPT_RCU</tt> implementation uses some 1270 - additional fields in the <tt>task_struct</tt> structure: 1271 - 1272 - <pre> 1273 - 1 #ifdef CONFIG_PREEMPT_RCU 1274 - 2 int rcu_read_lock_nesting; 1275 - 3 union rcu_special rcu_read_unlock_special; 1276 - 4 struct list_head rcu_node_entry; 1277 - 5 struct rcu_node *rcu_blocked_node; 1278 - 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */ 1279 - 7 #ifdef CONFIG_TASKS_RCU 1280 - 8 unsigned long rcu_tasks_nvcsw; 1281 - 9 bool rcu_tasks_holdout; 1282 - 10 struct list_head rcu_tasks_holdout_list; 1283 - 11 int rcu_tasks_idle_cpu; 1284 - 12 #endif /* #ifdef CONFIG_TASKS_RCU */ 1285 - </pre> 1286 - 1287 - <p>The <tt>->rcu_read_lock_nesting</tt> field records the 1288 - nesting level for RCU read-side critical sections, and 1289 - the <tt>->rcu_read_unlock_special</tt> field is a bitmask 1290 - that records special conditions that require <tt>rcu_read_unlock()</tt> 1291 - to do additional work. 1292 - The <tt>->rcu_node_entry</tt> field is used to form lists of 1293 - tasks that have blocked within preemptible-RCU read-side critical 1294 - sections and the <tt>->rcu_blocked_node</tt> field references 1295 - the <tt>rcu_node</tt> structure whose list this task is a member of, 1296 - or <tt>NULL</tt> if it is not blocked within a preemptible-RCU 1297 - read-side critical section. 1298 - 1299 - <p>The <tt>->rcu_tasks_nvcsw</tt> field tracks the number of 1300 - voluntary context switches that this task had undergone at the 1301 - beginning of the current tasks-RCU grace period, 1302 - <tt>->rcu_tasks_holdout</tt> is set if the current tasks-RCU 1303 - grace period is waiting on this task, <tt>->rcu_tasks_holdout_list</tt> 1304 - is a list element enqueuing this task on the holdout list, 1305 - and <tt>->rcu_tasks_idle_cpu</tt> tracks which CPU this 1306 - idle task is running, but only if the task is currently running, 1307 - that is, if the CPU is currently idle. 1308 - 1309 - <h3><a name="Accessor Functions"> 1310 - Accessor Functions</a></h3> 1311 - 1312 - <p>The following listing shows the 1313 - <tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt> and 1314 - <tt>rcu_for_each_leaf_node()</tt> function and macros: 1315 - 1316 - <pre> 1317 - 1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp) 1318 - 2 { 1319 - 3 return &rsp->node[0]; 1320 - 4 } 1321 - 5 1322 - 6 #define rcu_for_each_node_breadth_first(rsp, rnp) \ 1323 - 7 for ((rnp) = &(rsp)->node[0]; \ 1324 - 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) 1325 - 9 1326 - 10 #define rcu_for_each_leaf_node(rsp, rnp) \ 1327 - 11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ 1328 - 12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) 1329 - </pre> 1330 - 1331 - <p>The <tt>rcu_get_root()</tt> simply returns a pointer to the 1332 - first element of the specified <tt>rcu_state</tt> structure's 1333 - <tt>->node[]</tt> array, which is the root <tt>rcu_node</tt> 1334 - structure. 1335 - 1336 - </p><p>As noted earlier, the <tt>rcu_for_each_node_breadth_first()</tt> 1337 - macro takes advantage of the layout of the <tt>rcu_node</tt> 1338 - structures in the <tt>rcu_state</tt> structure's 1339 - <tt>->node[]</tt> array, performing a breadth-first traversal by 1340 - simply traversing the array in order. 1341 - Similarly, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only 1342 - the last part of the array, thus traversing only the leaf 1343 - <tt>rcu_node</tt> structures. 1344 - 1345 - <table> 1346 - <tr><th> </th></tr> 1347 - <tr><th align="left">Quick Quiz:</th></tr> 1348 - <tr><td> 1349 - What does 1350 - <tt>rcu_for_each_leaf_node()</tt> do if the <tt>rcu_node</tt> tree 1351 - contains only a single node? 1352 - </td></tr> 1353 - <tr><th align="left">Answer:</th></tr> 1354 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 1355 - In the single-node case, 1356 - <tt>rcu_for_each_leaf_node()</tt> traverses the single node. 1357 - </font></td></tr> 1358 - <tr><td> </td></tr> 1359 - </table> 1360 - 1361 - <h3><a name="Summary"> 1362 - Summary</a></h3> 1363 - 1364 - So the state of RCU is represented by an <tt>rcu_state</tt> structure, 1365 - which contains a combining tree of <tt>rcu_node</tt> and 1366 - <tt>rcu_data</tt> structures. 1367 - Finally, in <tt>CONFIG_NO_HZ_IDLE</tt> kernels, each CPU's dyntick-idle 1368 - state is tracked by dynticks-related fields in the <tt>rcu_data</tt> structure. 1369 - 1370 - If you made it this far, you are well prepared to read the code 1371 - walkthroughs in the other articles in this series. 1372 - 1373 - <h3><a name="Acknowledgments"> 1374 - Acknowledgments</a></h3> 1375 - 1376 - I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul 1377 - Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn 1378 - for helping me get this document into a more human-readable state. 1379 - 1380 - <h3><a name="Legal Statement"> 1381 - Legal Statement</a></h3> 1382 - 1383 - <p>This work represents the view of the author and does not necessarily 1384 - represent the view of IBM. 1385 - 1386 - </p><p>Linux is a registered trademark of Linus Torvalds. 1387 - 1388 - </p><p>Other company, product, and service names may be trademarks or 1389 - service marks of others. 1390 - 1391 - </body></html>

+1163

Documentation/RCU/Design/Data-Structures/Data-Structures.rst

··· 1 + =================================================== 2 + A Tour Through TREE_RCU's Data Structures [LWN.net] 3 + =================================================== 4 + 5 + December 18, 2016 6 + 7 + This article was contributed by Paul E. McKenney 8 + 9 + Introduction 10 + ============ 11 + 12 + This document describes RCU's major data structures and their relationship 13 + to each other. 14 + 15 + Data-Structure Relationships 16 + ============================ 17 + 18 + RCU is for all intents and purposes a large state machine, and its 19 + data structures maintain the state in such a way as to allow RCU readers 20 + to execute extremely quickly, while also processing the RCU grace periods 21 + requested by updaters in an efficient and extremely scalable fashion. 22 + The efficiency and scalability of RCU updaters is provided primarily 23 + by a combining tree, as shown below: 24 + 25 + .. kernel-figure:: BigTreeClassicRCU.svg 26 + 27 + This diagram shows an enclosing ``rcu_state`` structure containing a tree 28 + of ``rcu_node`` structures. Each leaf node of the ``rcu_node`` tree has up 29 + to 16 ``rcu_data`` structures associated with it, so that there are 30 + ``NR_CPUS`` number of ``rcu_data`` structures, one for each possible CPU. 31 + This structure is adjusted at boot time, if needed, to handle the common 32 + case where ``nr_cpu_ids`` is much less than ``NR_CPUs``. 33 + For example, a number of Linux distributions set ``NR_CPUs=4096``, 34 + which results in a three-level ``rcu_node`` tree. 35 + If the actual hardware has only 16 CPUs, RCU will adjust itself 36 + at boot time, resulting in an ``rcu_node`` tree with only a single node. 37 + 38 + The purpose of this combining tree is to allow per-CPU events 39 + such as quiescent states, dyntick-idle transitions, 40 + and CPU hotplug operations to be processed efficiently 41 + and scalably. 42 + Quiescent states are recorded by the per-CPU ``rcu_data`` structures, 43 + and other events are recorded by the leaf-level ``rcu_node`` 44 + structures. 45 + All of these events are combined at each level of the tree until finally 46 + grace periods are completed at the tree's root ``rcu_node`` 47 + structure. 48 + A grace period can be completed at the root once every CPU 49 + (or, in the case of ``CONFIG_PREEMPT_RCU``, task) 50 + has passed through a quiescent state. 51 + Once a grace period has completed, record of that fact is propagated 52 + back down the tree. 53 + 54 + As can be seen from the diagram, on a 64-bit system 55 + a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout 56 + of 64 at the root and a fanout of 16 at the leaves. 57 + 58 + +-----------------------------------------------------------------------+ 59 + | **Quick Quiz**: | 60 + +-----------------------------------------------------------------------+ 61 + | Why isn't the fanout at the leaves also 64? | 62 + +-----------------------------------------------------------------------+ 63 + | **Answer**: | 64 + +-----------------------------------------------------------------------+ 65 + | Because there are more types of events that affect the leaf-level | 66 + | ``rcu_node`` structures than further up the tree. Therefore, if the | 67 + | leaf ``rcu_node`` structures have fanout of 64, the contention on | 68 + | these structures' ``->structures`` becomes excessive. Experimentation | 69 + | on a wide variety of systems has shown that a fanout of 16 works well | 70 + | for the leaves of the ``rcu_node`` tree. | 71 + | | 72 + | Of course, further experience with systems having hundreds or | 73 + | thousands of CPUs may demonstrate that the fanout for the non-leaf | 74 + | ``rcu_node`` structures must also be reduced. Such reduction can be | 75 + | easily carried out when and if it proves necessary. In the meantime, | 76 + | if you are using such a system and running into contention problems | 77 + | on the non-leaf ``rcu_node`` structures, you may use the | 78 + | ``CONFIG_RCU_FANOUT`` kernel configuration parameter to reduce the | 79 + | non-leaf fanout as needed. | 80 + | | 81 + | Kernels built for systems with strong NUMA characteristics might | 82 + | also need to adjust ``CONFIG_RCU_FANOUT`` so that the domains of | 83 + | the ``rcu_node`` structures align with hardware boundaries. | 84 + | However, there has thus far been no need for this. | 85 + +-----------------------------------------------------------------------+ 86 + 87 + If your system has more than 1,024 CPUs (or more than 512 CPUs on a 88 + 32-bit system), then RCU will automatically add more levels to the tree. 89 + For example, if you are crazy enough to build a 64-bit system with 90 + 65,536 CPUs, RCU would configure the ``rcu_node`` tree as follows: 91 + 92 + .. kernel-figure:: HugeTreeClassicRCU.svg 93 + 94 + RCU currently permits up to a four-level tree, which on a 64-bit system 95 + accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for 96 + 32-bit systems. On the other hand, you can set both 97 + ``CONFIG_RCU_FANOUT`` and ``CONFIG_RCU_FANOUT_LEAF`` to be as small as 98 + 2, which would result in a 16-CPU test using a 4-level tree. This can be 99 + useful for testing large-system capabilities on small test machines. 100 + 101 + This multi-level combining tree allows us to get most of the performance 102 + and scalability benefits of partitioning, even though RCU grace-period 103 + detection is inherently a global operation. The trick here is that only 104 + the last CPU to report a quiescent state into a given ``rcu_node`` 105 + structure need advance to the ``rcu_node`` structure at the next level 106 + up the tree. This means that at the leaf-level ``rcu_node`` structure, 107 + only one access out of sixteen will progress up the tree. For the 108 + internal ``rcu_node`` structures, the situation is even more extreme: 109 + Only one access out of sixty-four will progress up the tree. Because the 110 + vast majority of the CPUs do not progress up the tree, the lock 111 + contention remains roughly constant up the tree. No matter how many CPUs 112 + there are in the system, at most 64 quiescent-state reports per grace 113 + period will progress all the way to the root ``rcu_node`` structure, 114 + thus ensuring that the lock contention on that root ``rcu_node`` 115 + structure remains acceptably low. 116 + 117 + In effect, the combining tree acts like a big shock absorber, keeping 118 + lock contention under control at all tree levels regardless of the level 119 + of loading on the system. 120 + 121 + RCU updaters wait for normal grace periods by registering RCU callbacks, 122 + either directly via ``call_rcu()`` or indirectly via 123 + ``synchronize_rcu()`` and friends. RCU callbacks are represented by 124 + ``rcu_head`` structures, which are queued on ``rcu_data`` structures 125 + while they are waiting for a grace period to elapse, as shown in the 126 + following figure: 127 + 128 + .. kernel-figure:: BigTreePreemptRCUBHdyntickCB.svg 129 + 130 + This figure shows how ``TREE_RCU``'s and ``PREEMPT_RCU``'s major data 131 + structures are related. Lesser data structures will be introduced with 132 + the algorithms that make use of them. 133 + 134 + Note that each of the data structures in the above figure has its own 135 + synchronization: 136 + 137 + #. Each ``rcu_state`` structures has a lock and a mutex, and some fields 138 + are protected by the corresponding root ``rcu_node`` structure's lock. 139 + #. Each ``rcu_node`` structure has a spinlock. 140 + #. The fields in ``rcu_data`` are private to the corresponding CPU, 141 + although a few can be read and written by other CPUs. 142 + 143 + It is important to note that different data structures can have very 144 + different ideas about the state of RCU at any given time. For but one 145 + example, awareness of the start or end of a given RCU grace period 146 + propagates slowly through the data structures. This slow propagation is 147 + absolutely necessary for RCU to have good read-side performance. If this 148 + balkanized implementation seems foreign to you, one useful trick is to 149 + consider each instance of these data structures to be a different 150 + person, each having the usual slightly different view of reality. 151 + 152 + The general role of each of these data structures is as follows: 153 + 154 + #. ``rcu_state``: This structure forms the interconnection between the 155 + ``rcu_node`` and ``rcu_data`` structures, tracks grace periods, 156 + serves as short-term repository for callbacks orphaned by CPU-hotplug 157 + events, maintains ``rcu_barrier()`` state, tracks expedited 158 + grace-period state, and maintains state used to force quiescent 159 + states when grace periods extend too long, 160 + #. ``rcu_node``: This structure forms the combining tree that propagates 161 + quiescent-state information from the leaves to the root, and also 162 + propagates grace-period information from the root to the leaves. It 163 + provides local copies of the grace-period state in order to allow 164 + this information to be accessed in a synchronized manner without 165 + suffering the scalability limitations that would otherwise be imposed 166 + by global locking. In ``CONFIG_PREEMPT_RCU`` kernels, it manages the 167 + lists of tasks that have blocked while in their current RCU read-side 168 + critical section. In ``CONFIG_PREEMPT_RCU`` with 169 + ``CONFIG_RCU_BOOST``, it manages the per-\ ``rcu_node`` 170 + priority-boosting kernel threads (kthreads) and state. Finally, it 171 + records CPU-hotplug state in order to determine which CPUs should be 172 + ignored during a given grace period. 173 + #. ``rcu_data``: This per-CPU structure is the focus of quiescent-state 174 + detection and RCU callback queuing. It also tracks its relationship 175 + to the corresponding leaf ``rcu_node`` structure to allow 176 + more-efficient propagation of quiescent states up the ``rcu_node`` 177 + combining tree. Like the ``rcu_node`` structure, it provides a local 178 + copy of the grace-period information to allow for-free synchronized 179 + access to this information from the corresponding CPU. Finally, this 180 + structure records past dyntick-idle state for the corresponding CPU 181 + and also tracks statistics. 182 + #. ``rcu_head``: This structure represents RCU callbacks, and is the 183 + only structure allocated and managed by RCU users. The ``rcu_head`` 184 + structure is normally embedded within the RCU-protected data 185 + structure. 186 + 187 + If all you wanted from this article was a general notion of how RCU's 188 + data structures are related, you are done. Otherwise, each of the 189 + following sections give more details on the ``rcu_state``, ``rcu_node`` 190 + and ``rcu_data`` data structures. 191 + 192 + The ``rcu_state`` Structure 193 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 194 + 195 + The ``rcu_state`` structure is the base structure that represents the 196 + state of RCU in the system. This structure forms the interconnection 197 + between the ``rcu_node`` and ``rcu_data`` structures, tracks grace 198 + periods, contains the lock used to synchronize with CPU-hotplug events, 199 + and maintains state used to force quiescent states when grace periods 200 + extend too long, 201 + 202 + A few of the ``rcu_state`` structure's fields are discussed, singly and 203 + in groups, in the following sections. The more specialized fields are 204 + covered in the discussion of their use. 205 + 206 + Relationship to rcu_node and rcu_data Structures 207 + '''''''''''''''''''''''''''''''''''''''''''''''' 208 + 209 + This portion of the ``rcu_state`` structure is declared as follows: 210 + 211 + :: 212 + 213 + 1 struct rcu_node node[NUM_RCU_NODES]; 214 + 2 struct rcu_node *level[NUM_RCU_LVLS + 1]; 215 + 3 struct rcu_data __percpu *rda; 216 + 217 + +-----------------------------------------------------------------------+ 218 + | **Quick Quiz**: | 219 + +-----------------------------------------------------------------------+ 220 + | Wait a minute! You said that the ``rcu_node`` structures formed a | 221 + | tree, but they are declared as a flat array! What gives? | 222 + +-----------------------------------------------------------------------+ 223 + | **Answer**: | 224 + +-----------------------------------------------------------------------+ 225 + | The tree is laid out in the array. The first node In the array is the | 226 + | head, the next set of nodes in the array are children of the head | 227 + | node, and so on until the last set of nodes in the array are the | 228 + | leaves. | 229 + | See the following diagrams to see how this works. | 230 + +-----------------------------------------------------------------------+ 231 + 232 + The ``rcu_node`` tree is embedded into the ``->node[]`` array as shown 233 + in the following figure: 234 + 235 + .. kernel-figure:: TreeMapping.svg 236 + 237 + One interesting consequence of this mapping is that a breadth-first 238 + traversal of the tree is implemented as a simple linear scan of the 239 + array, which is in fact what the ``rcu_for_each_node_breadth_first()`` 240 + macro does. This macro is used at the beginning and ends of grace 241 + periods. 242 + 243 + Each entry of the ``->level`` array references the first ``rcu_node`` 244 + structure on the corresponding level of the tree, for example, as shown 245 + below: 246 + 247 + .. kernel-figure:: TreeMappingLevel.svg 248 + 249 + The zero\ :sup:`th` element of the array references the root 250 + ``rcu_node`` structure, the first element references the first child of 251 + the root ``rcu_node``, and finally the second element references the 252 + first leaf ``rcu_node`` structure. 253 + 254 + For whatever it is worth, if you draw the tree to be tree-shaped rather 255 + than array-shaped, it is easy to draw a planar representation: 256 + 257 + .. kernel-figure:: TreeLevel.svg 258 + 259 + Finally, the ``->rda`` field references a per-CPU pointer to the 260 + corresponding CPU's ``rcu_data`` structure. 261 + 262 + All of these fields are constant once initialization is complete, and 263 + therefore need no protection. 264 + 265 + Grace-Period Tracking 266 + ''''''''''''''''''''' 267 + 268 + This portion of the ``rcu_state`` structure is declared as follows: 269 + 270 + :: 271 + 272 + 1 unsigned long gp_seq; 273 + 274 + RCU grace periods are numbered, and the ``->gp_seq`` field contains the 275 + current grace-period sequence number. The bottom two bits are the state 276 + of the current grace period, which can be zero for not yet started or 277 + one for in progress. In other words, if the bottom two bits of 278 + ``->gp_seq`` are zero, then RCU is idle. Any other value in the bottom 279 + two bits indicates that something is broken. This field is protected by 280 + the root ``rcu_node`` structure's ``->lock`` field. 281 + 282 + There are ``->gp_seq`` fields in the ``rcu_node`` and ``rcu_data`` 283 + structures as well. The fields in the ``rcu_state`` structure represent 284 + the most current value, and those of the other structures are compared 285 + in order to detect the beginnings and ends of grace periods in a 286 + distributed fashion. The values flow from ``rcu_state`` to ``rcu_node`` 287 + (down the tree from the root to the leaves) to ``rcu_data``. 288 + 289 + Miscellaneous 290 + ''''''''''''' 291 + 292 + This portion of the ``rcu_state`` structure is declared as follows: 293 + 294 + :: 295 + 296 + 1 unsigned long gp_max; 297 + 2 char abbr; 298 + 3 char *name; 299 + 300 + The ``->gp_max`` field tracks the duration of the longest grace period 301 + in jiffies. It is protected by the root ``rcu_node``'s ``->lock``. 302 + 303 + The ``->name`` and ``->abbr`` fields distinguish between preemptible RCU 304 + (“rcu_preempt” and “p”) and non-preemptible RCU (“rcu_sched” and “s”). 305 + These fields are used for diagnostic and tracing purposes. 306 + 307 + The ``rcu_node`` Structure 308 + ~~~~~~~~~~~~~~~~~~~~~~~~~~ 309 + 310 + The ``rcu_node`` structures form the combining tree that propagates 311 + quiescent-state information from the leaves to the root and also that 312 + propagates grace-period information from the root down to the leaves. 313 + They provides local copies of the grace-period state in order to allow 314 + this information to be accessed in a synchronized manner without 315 + suffering the scalability limitations that would otherwise be imposed by 316 + global locking. In ``CONFIG_PREEMPT_RCU`` kernels, they manage the lists 317 + of tasks that have blocked while in their current RCU read-side critical 318 + section. In ``CONFIG_PREEMPT_RCU`` with ``CONFIG_RCU_BOOST``, they 319 + manage the per-\ ``rcu_node`` priority-boosting kernel threads 320 + (kthreads) and state. Finally, they record CPU-hotplug state in order to 321 + determine which CPUs should be ignored during a given grace period. 322 + 323 + The ``rcu_node`` structure's fields are discussed, singly and in groups, 324 + in the following sections. 325 + 326 + Connection to Combining Tree 327 + '''''''''''''''''''''''''''' 328 + 329 + This portion of the ``rcu_node`` structure is declared as follows: 330 + 331 + :: 332 + 333 + 1 struct rcu_node *parent; 334 + 2 u8 level; 335 + 3 u8 grpnum; 336 + 4 unsigned long grpmask; 337 + 5 int grplo; 338 + 6 int grphi; 339 + 340 + The ``->parent`` pointer references the ``rcu_node`` one level up in the 341 + tree, and is ``NULL`` for the root ``rcu_node``. The RCU implementation 342 + makes heavy use of this field to push quiescent states up the tree. The 343 + ``->level`` field gives the level in the tree, with the root being at 344 + level zero, its children at level one, and so on. The ``->grpnum`` field 345 + gives this node's position within the children of its parent, so this 346 + number can range between 0 and 31 on 32-bit systems and between 0 and 63 347 + on 64-bit systems. The ``->level`` and ``->grpnum`` fields are used only 348 + during initialization and for tracing. The ``->grpmask`` field is the 349 + bitmask counterpart of ``->grpnum``, and therefore always has exactly 350 + one bit set. This mask is used to clear the bit corresponding to this 351 + ``rcu_node`` structure in its parent's bitmasks, which are described 352 + later. Finally, the ``->grplo`` and ``->grphi`` fields contain the 353 + lowest and highest numbered CPU served by this ``rcu_node`` structure, 354 + respectively. 355 + 356 + All of these fields are constant, and thus do not require any 357 + synchronization. 358 + 359 + Synchronization 360 + ''''''''''''''' 361 + 362 + This field of the ``rcu_node`` structure is declared as follows: 363 + 364 + :: 365 + 366 + 1 raw_spinlock_t lock; 367 + 368 + This field is used to protect the remaining fields in this structure, 369 + unless otherwise stated. That said, all of the fields in this structure 370 + can be accessed without locking for tracing purposes. Yes, this can 371 + result in confusing traces, but better some tracing confusion than to be 372 + heisenbugged out of existence. 373 + 374 + .. _grace-period-tracking-1: 375 + 376 + Grace-Period Tracking 377 + ''''''''''''''''''''' 378 + 379 + This portion of the ``rcu_node`` structure is declared as follows: 380 + 381 + :: 382 + 383 + 1 unsigned long gp_seq; 384 + 2 unsigned long gp_seq_needed; 385 + 386 + The ``rcu_node`` structures' ``->gp_seq`` fields are the counterparts of 387 + the field of the same name in the ``rcu_state`` structure. They each may 388 + lag up to one step behind their ``rcu_state`` counterpart. If the bottom 389 + two bits of a given ``rcu_node`` structure's ``->gp_seq`` field is zero, 390 + then this ``rcu_node`` structure believes that RCU is idle. 391 + 392 + The ``>gp_seq`` field of each ``rcu_node`` structure is updated at the 393 + beginning and the end of each grace period. 394 + 395 + The ``->gp_seq_needed`` fields record the furthest-in-the-future grace 396 + period request seen by the corresponding ``rcu_node`` structure. The 397 + request is considered fulfilled when the value of the ``->gp_seq`` field 398 + equals or exceeds that of the ``->gp_seq_needed`` field. 399 + 400 + +-----------------------------------------------------------------------+ 401 + | **Quick Quiz**: | 402 + +-----------------------------------------------------------------------+ 403 + | Suppose that this ``rcu_node`` structure doesn't see a request for a | 404 + | very long time. Won't wrapping of the ``->gp_seq`` field cause | 405 + | problems? | 406 + +-----------------------------------------------------------------------+ 407 + | **Answer**: | 408 + +-----------------------------------------------------------------------+ 409 + | No, because if the ``->gp_seq_needed`` field lags behind the | 410 + | ``->gp_seq`` field, the ``->gp_seq_needed`` field will be updated at | 411 + | the end of the grace period. Modulo-arithmetic comparisons therefore | 412 + | will always get the correct answer, even with wrapping. | 413 + +-----------------------------------------------------------------------+ 414 + 415 + Quiescent-State Tracking 416 + '''''''''''''''''''''''' 417 + 418 + These fields manage the propagation of quiescent states up the combining 419 + tree. 420 + 421 + This portion of the ``rcu_node`` structure has fields as follows: 422 + 423 + :: 424 + 425 + 1 unsigned long qsmask; 426 + 2 unsigned long expmask; 427 + 3 unsigned long qsmaskinit; 428 + 4 unsigned long expmaskinit; 429 + 430 + The ``->qsmask`` field tracks which of this ``rcu_node`` structure's 431 + children still need to report quiescent states for the current normal 432 + grace period. Such children will have a value of 1 in their 433 + corresponding bit. Note that the leaf ``rcu_node`` structures should be 434 + thought of as having ``rcu_data`` structures as their children. 435 + Similarly, the ``->expmask`` field tracks which of this ``rcu_node`` 436 + structure's children still need to report quiescent states for the 437 + current expedited grace period. An expedited grace period has the same 438 + conceptual properties as a normal grace period, but the expedited 439 + implementation accepts extreme CPU overhead to obtain much lower 440 + grace-period latency, for example, consuming a few tens of microseconds 441 + worth of CPU time to reduce grace-period duration from milliseconds to 442 + tens of microseconds. The ``->qsmaskinit`` field tracks which of this 443 + ``rcu_node`` structure's children cover for at least one online CPU. 444 + This mask is used to initialize ``->qsmask``, and ``->expmaskinit`` is 445 + used to initialize ``->expmask`` and the beginning of the normal and 446 + expedited grace periods, respectively. 447 + 448 + +-----------------------------------------------------------------------+ 449 + | **Quick Quiz**: | 450 + +-----------------------------------------------------------------------+ 451 + | Why are these bitmasks protected by locking? Come on, haven't you | 452 + | heard of atomic instructions??? | 453 + +-----------------------------------------------------------------------+ 454 + | **Answer**: | 455 + +-----------------------------------------------------------------------+ 456 + | Lockless grace-period computation! Such a tantalizing possibility! | 457 + | But consider the following sequence of events: | 458 + | | 459 + | #. CPU 0 has been in dyntick-idle mode for quite some time. When it | 460 + | wakes up, it notices that the current RCU grace period needs it to | 461 + | report in, so it sets a flag where the scheduling clock interrupt | 462 + | will find it. | 463 + | #. Meanwhile, CPU 1 is running ``force_quiescent_state()``, and | 464 + | notices that CPU 0 has been in dyntick idle mode, which qualifies | 465 + | as an extended quiescent state. | 466 + | #. CPU 0's scheduling clock interrupt fires in the middle of an RCU | 467 + | read-side critical section, and notices that the RCU core needs | 468 + | something, so commences RCU softirq processing. | 469 + | #. CPU 0's softirq handler executes and is just about ready to report | 470 + | its quiescent state up the ``rcu_node`` tree. | 471 + | #. But CPU 1 beats it to the punch, completing the current grace | 472 + | period and starting a new one. | 473 + | #. CPU 0 now reports its quiescent state for the wrong grace period. | 474 + | That grace period might now end before the RCU read-side critical | 475 + | section. If that happens, disaster will ensue. | 476 + | | 477 + | So the locking is absolutely required in order to coordinate clearing | 478 + | of the bits with updating of the grace-period sequence number in | 479 + | ``->gp_seq``. | 480 + +-----------------------------------------------------------------------+ 481 + 482 + Blocked-Task Management 483 + ''''''''''''''''''''''' 484 + 485 + ``PREEMPT_RCU`` allows tasks to be preempted in the midst of their RCU 486 + read-side critical sections, and these tasks must be tracked explicitly. 487 + The details of exactly why and how they are tracked will be covered in a 488 + separate article on RCU read-side processing. For now, it is enough to 489 + know that the ``rcu_node`` structure tracks them. 490 + 491 + :: 492 + 493 + 1 struct list_head blkd_tasks; 494 + 2 struct list_head *gp_tasks; 495 + 3 struct list_head *exp_tasks; 496 + 4 bool wait_blkd_tasks; 497 + 498 + The ``->blkd_tasks`` field is a list header for the list of blocked and 499 + preempted tasks. As tasks undergo context switches within RCU read-side 500 + critical sections, their ``task_struct`` structures are enqueued (via 501 + the ``task_struct``'s ``->rcu_node_entry`` field) onto the head of the 502 + ``->blkd_tasks`` list for the leaf ``rcu_node`` structure corresponding 503 + to the CPU on which the outgoing context switch executed. As these tasks 504 + later exit their RCU read-side critical sections, they remove themselves 505 + from the list. This list is therefore in reverse time order, so that if 506 + one of the tasks is blocking the current grace period, all subsequent 507 + tasks must also be blocking that same grace period. Therefore, a single 508 + pointer into this list suffices to track all tasks blocking a given 509 + grace period. That pointer is stored in ``->gp_tasks`` for normal grace 510 + periods and in ``->exp_tasks`` for expedited grace periods. These last 511 + two fields are ``NULL`` if either there is no grace period in flight or 512 + if there are no blocked tasks preventing that grace period from 513 + completing. If either of these two pointers is referencing a task that 514 + removes itself from the ``->blkd_tasks`` list, then that task must 515 + advance the pointer to the next task on the list, or set the pointer to 516 + ``NULL`` if there are no subsequent tasks on the list. 517 + 518 + For example, suppose that tasks T1, T2, and T3 are all hard-affinitied 519 + to the largest-numbered CPU in the system. Then if task T1 blocked in an 520 + RCU read-side critical section, then an expedited grace period started, 521 + then task T2 blocked in an RCU read-side critical section, then a normal 522 + grace period started, and finally task 3 blocked in an RCU read-side 523 + critical section, then the state of the last leaf ``rcu_node`` 524 + structure's blocked-task list would be as shown below: 525 + 526 + .. kernel-figure:: blkd_task.svg 527 + 528 + Task T1 is blocking both grace periods, task T2 is blocking only the 529 + normal grace period, and task T3 is blocking neither grace period. Note 530 + that these tasks will not remove themselves from this list immediately 531 + upon resuming execution. They will instead remain on the list until they 532 + execute the outermost ``rcu_read_unlock()`` that ends their RCU 533 + read-side critical section. 534 + 535 + The ``->wait_blkd_tasks`` field indicates whether or not the current 536 + grace period is waiting on a blocked task. 537 + 538 + Sizing the ``rcu_node`` Array 539 + ''''''''''''''''''''''''''''' 540 + 541 + The ``rcu_node`` array is sized via a series of C-preprocessor 542 + expressions as follows: 543 + 544 + :: 545 + 546 + 1 #ifdef CONFIG_RCU_FANOUT 547 + 2 #define RCU_FANOUT CONFIG_RCU_FANOUT 548 + 3 #else 549 + 4 # ifdef CONFIG_64BIT 550 + 5 # define RCU_FANOUT 64 551 + 6 # else 552 + 7 # define RCU_FANOUT 32 553 + 8 # endif 554 + 9 #endif 555 + 10 556 + 11 #ifdef CONFIG_RCU_FANOUT_LEAF 557 + 12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF 558 + 13 #else 559 + 14 # ifdef CONFIG_64BIT 560 + 15 # define RCU_FANOUT_LEAF 64 561 + 16 # else 562 + 17 # define RCU_FANOUT_LEAF 32 563 + 18 # endif 564 + 19 #endif 565 + 20 566 + 21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF) 567 + 22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) 568 + 23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) 569 + 24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) 570 + 25 571 + 26 #if NR_CPUS <= RCU_FANOUT_1 572 + 27 # define RCU_NUM_LVLS 1 573 + 28 # define NUM_RCU_LVL_0 1 574 + 29 # define NUM_RCU_NODES NUM_RCU_LVL_0 575 + 30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 } 576 + 31 # define RCU_NODE_NAME_INIT { "rcu_node_0" } 577 + 32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } 578 + 33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" } 579 + 34 #elif NR_CPUS <= RCU_FANOUT_2 580 + 35 # define RCU_NUM_LVLS 2 581 + 36 # define NUM_RCU_LVL_0 1 582 + 37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) 583 + 38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) 584 + 39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 } 585 + 40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } 586 + 41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } 587 + 42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" } 588 + 43 #elif NR_CPUS <= RCU_FANOUT_3 589 + 44 # define RCU_NUM_LVLS 3 590 + 45 # define NUM_RCU_LVL_0 1 591 + 46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) 592 + 47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) 593 + 48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) 594 + 49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } 595 + 50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } 596 + 51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } 597 + 52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" } 598 + 53 #elif NR_CPUS <= RCU_FANOUT_4 599 + 54 # define RCU_NUM_LVLS 4 600 + 55 # define NUM_RCU_LVL_0 1 601 + 56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) 602 + 57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) 603 + 58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) 604 + 59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) 605 + 60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 } 606 + 61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" } 607 + 62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" } 608 + 63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" } 609 + 64 #else 610 + 65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" 611 + 66 #endif 612 + 613 + The maximum number of levels in the ``rcu_node`` structure is currently 614 + limited to four, as specified by lines 21-24 and the structure of the 615 + subsequent “if” statement. For 32-bit systems, this allows 616 + 16*32*32*32=524,288 CPUs, which should be sufficient for the next few 617 + years at least. For 64-bit systems, 16*64*64*64=4,194,304 CPUs is 618 + allowed, which should see us through the next decade or so. This 619 + four-level tree also allows kernels built with ``CONFIG_RCU_FANOUT=8`` 620 + to support up to 4096 CPUs, which might be useful in very large systems 621 + having eight CPUs per socket (but please note that no one has yet shown 622 + any measurable performance degradation due to misaligned socket and 623 + ``rcu_node`` boundaries). In addition, building kernels with a full four 624 + levels of ``rcu_node`` tree permits better testing of RCU's 625 + combining-tree code. 626 + 627 + The ``RCU_FANOUT`` symbol controls how many children are permitted at 628 + each non-leaf level of the ``rcu_node`` tree. If the 629 + ``CONFIG_RCU_FANOUT`` Kconfig option is not specified, it is set based 630 + on the word size of the system, which is also the Kconfig default. 631 + 632 + The ``RCU_FANOUT_LEAF`` symbol controls how many CPUs are handled by 633 + each leaf ``rcu_node`` structure. Experience has shown that allowing a 634 + given leaf ``rcu_node`` structure to handle 64 CPUs, as permitted by the 635 + number of bits in the ``->qsmask`` field on a 64-bit system, results in 636 + excessive contention for the leaf ``rcu_node`` structures' ``->lock`` 637 + fields. The number of CPUs per leaf ``rcu_node`` structure is therefore 638 + limited to 16 given the default value of ``CONFIG_RCU_FANOUT_LEAF``. If 639 + ``CONFIG_RCU_FANOUT_LEAF`` is unspecified, the value selected is based 640 + on the word size of the system, just as for ``CONFIG_RCU_FANOUT``. 641 + Lines 11-19 perform this computation. 642 + 643 + Lines 21-24 compute the maximum number of CPUs supported by a 644 + single-level (which contains a single ``rcu_node`` structure), 645 + two-level, three-level, and four-level ``rcu_node`` tree, respectively, 646 + given the fanout specified by ``RCU_FANOUT`` and ``RCU_FANOUT_LEAF``. 647 + These numbers of CPUs are retained in the ``RCU_FANOUT_1``, 648 + ``RCU_FANOUT_2``, ``RCU_FANOUT_3``, and ``RCU_FANOUT_4`` C-preprocessor 649 + variables, respectively. 650 + 651 + These variables are used to control the C-preprocessor ``#if`` statement 652 + spanning lines 26-66 that computes the number of ``rcu_node`` structures 653 + required for each level of the tree, as well as the number of levels 654 + required. The number of levels is placed in the ``NUM_RCU_LVLS`` 655 + C-preprocessor variable by lines 27, 35, 44, and 54. The number of 656 + ``rcu_node`` structures for the topmost level of the tree is always 657 + exactly one, and this value is unconditionally placed into 658 + ``NUM_RCU_LVL_0`` by lines 28, 36, 45, and 55. The rest of the levels 659 + (if any) of the ``rcu_node`` tree are computed by dividing the maximum 660 + number of CPUs by the fanout supported by the number of levels from the 661 + current level down, rounding up. This computation is performed by 662 + lines 37, 46-47, and 56-58. Lines 31-33, 40-42, 50-52, and 62-63 create 663 + initializers for lockdep lock-class names. Finally, lines 64-66 produce 664 + an error if the maximum number of CPUs is too large for the specified 665 + fanout. 666 + 667 + The ``rcu_segcblist`` Structure 668 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 669 + 670 + The ``rcu_segcblist`` structure maintains a segmented list of callbacks 671 + as follows: 672 + 673 + :: 674 + 675 + 1 #define RCU_DONE_TAIL 0 676 + 2 #define RCU_WAIT_TAIL 1 677 + 3 #define RCU_NEXT_READY_TAIL 2 678 + 4 #define RCU_NEXT_TAIL 3 679 + 5 #define RCU_CBLIST_NSEGS 4 680 + 6 681 + 7 struct rcu_segcblist { 682 + 8 struct rcu_head *head; 683 + 9 struct rcu_head **tails[RCU_CBLIST_NSEGS]; 684 + 10 unsigned long gp_seq[RCU_CBLIST_NSEGS]; 685 + 11 long len; 686 + 12 long len_lazy; 687 + 13 }; 688 + 689 + The segments are as follows: 690 + 691 + #. ``RCU_DONE_TAIL``: Callbacks whose grace periods have elapsed. These 692 + callbacks are ready to be invoked. 693 + #. ``RCU_WAIT_TAIL``: Callbacks that are waiting for the current grace 694 + period. Note that different CPUs can have different ideas about which 695 + grace period is current, hence the ``->gp_seq`` field. 696 + #. ``RCU_NEXT_READY_TAIL``: Callbacks waiting for the next grace period 697 + to start. 698 + #. ``RCU_NEXT_TAIL``: Callbacks that have not yet been associated with a 699 + grace period. 700 + 701 + The ``->head`` pointer references the first callback or is ``NULL`` if 702 + the list contains no callbacks (which is *not* the same as being empty). 703 + Each element of the ``->tails[]`` array references the ``->next`` 704 + pointer of the last callback in the corresponding segment of the list, 705 + or the list's ``->head`` pointer if that segment and all previous 706 + segments are empty. If the corresponding segment is empty but some 707 + previous segment is not empty, then the array element is identical to 708 + its predecessor. Older callbacks are closer to the head of the list, and 709 + new callbacks are added at the tail. This relationship between the 710 + ``->head`` pointer, the ``->tails[]`` array, and the callbacks is shown 711 + in this diagram: 712 + 713 + .. kernel-figure:: nxtlist.svg 714 + 715 + In this figure, the ``->head`` pointer references the first RCU callback 716 + in the list. The ``->tails[RCU_DONE_TAIL]`` array element references the 717 + ``->head`` pointer itself, indicating that none of the callbacks is 718 + ready to invoke. The ``->tails[RCU_WAIT_TAIL]`` array element references 719 + callback CB 2's ``->next`` pointer, which indicates that CB 1 and CB 2 720 + are both waiting on the current grace period, give or take possible 721 + disagreements about exactly which grace period is the current one. The 722 + ``->tails[RCU_NEXT_READY_TAIL]`` array element references the same RCU 723 + callback that ``->tails[RCU_WAIT_TAIL]`` does, which indicates that 724 + there are no callbacks waiting on the next RCU grace period. The 725 + ``->tails[RCU_NEXT_TAIL]`` array element references CB 4's ``->next`` 726 + pointer, indicating that all the remaining RCU callbacks have not yet 727 + been assigned to an RCU grace period. Note that the 728 + ``->tails[RCU_NEXT_TAIL]`` array element always references the last RCU 729 + callback's ``->next`` pointer unless the callback list is empty, in 730 + which case it references the ``->head`` pointer. 731 + 732 + There is one additional important special case for the 733 + ``->tails[RCU_NEXT_TAIL]`` array element: It can be ``NULL`` when this 734 + list is *disabled*. Lists are disabled when the corresponding CPU is 735 + offline or when the corresponding CPU's callbacks are offloaded to a 736 + kthread, both of which are described elsewhere. 737 + 738 + CPUs advance their callbacks from the ``RCU_NEXT_TAIL`` to the 739 + ``RCU_NEXT_READY_TAIL`` to the ``RCU_WAIT_TAIL`` to the 740 + ``RCU_DONE_TAIL`` list segments as grace periods advance. 741 + 742 + The ``->gp_seq[]`` array records grace-period numbers corresponding to 743 + the list segments. This is what allows different CPUs to have different 744 + ideas as to which is the current grace period while still avoiding 745 + premature invocation of their callbacks. In particular, this allows CPUs 746 + that go idle for extended periods to determine which of their callbacks 747 + are ready to be invoked after reawakening. 748 + 749 + The ``->len`` counter contains the number of callbacks in ``->head``, 750 + and the ``->len_lazy`` contains the number of those callbacks that are 751 + known to only free memory, and whose invocation can therefore be safely 752 + deferred. 753 + 754 + .. important:: 755 + 756 + It is the ``->len`` field that determines whether or 757 + not there are callbacks associated with this ``rcu_segcblist`` 758 + structure, *not* the ``->head`` pointer. The reason for this is that all 759 + the ready-to-invoke callbacks (that is, those in the ``RCU_DONE_TAIL`` 760 + segment) are extracted all at once at callback-invocation time 761 + (``rcu_do_batch``), due to which ``->head`` may be set to NULL if there 762 + are no not-done callbacks remaining in the ``rcu_segcblist``. If 763 + callback invocation must be postponed, for example, because a 764 + high-priority process just woke up on this CPU, then the remaining 765 + callbacks are placed back on the ``RCU_DONE_TAIL`` segment and 766 + ``->head`` once again points to the start of the segment. In short, the 767 + head field can briefly be ``NULL`` even though the CPU has callbacks 768 + present the entire time. Therefore, it is not appropriate to test the 769 + ``->head`` pointer for ``NULL``. 770 + 771 + In contrast, the ``->len`` and ``->len_lazy`` counts are adjusted only 772 + after the corresponding callbacks have been invoked. This means that the 773 + ``->len`` count is zero only if the ``rcu_segcblist`` structure really 774 + is devoid of callbacks. Of course, off-CPU sampling of the ``->len`` 775 + count requires careful use of appropriate synchronization, for example, 776 + memory barriers. This synchronization can be a bit subtle, particularly 777 + in the case of ``rcu_barrier()``. 778 + 779 + The ``rcu_data`` Structure 780 + ~~~~~~~~~~~~~~~~~~~~~~~~~~ 781 + 782 + The ``rcu_data`` maintains the per-CPU state for the RCU subsystem. The 783 + fields in this structure may be accessed only from the corresponding CPU 784 + (and from tracing) unless otherwise stated. This structure is the focus 785 + of quiescent-state detection and RCU callback queuing. It also tracks 786 + its relationship to the corresponding leaf ``rcu_node`` structure to 787 + allow more-efficient propagation of quiescent states up the ``rcu_node`` 788 + combining tree. Like the ``rcu_node`` structure, it provides a local 789 + copy of the grace-period information to allow for-free synchronized 790 + access to this information from the corresponding CPU. Finally, this 791 + structure records past dyntick-idle state for the corresponding CPU and 792 + also tracks statistics. 793 + 794 + The ``rcu_data`` structure's fields are discussed, singly and in groups, 795 + in the following sections. 796 + 797 + Connection to Other Data Structures 798 + ''''''''''''''''''''''''''''''''''' 799 + 800 + This portion of the ``rcu_data`` structure is declared as follows: 801 + 802 + :: 803 + 804 + 1 int cpu; 805 + 2 struct rcu_node *mynode; 806 + 3 unsigned long grpmask; 807 + 4 bool beenonline; 808 + 809 + The ``->cpu`` field contains the number of the corresponding CPU and the 810 + ``->mynode`` field references the corresponding ``rcu_node`` structure. 811 + The ``->mynode`` is used to propagate quiescent states up the combining 812 + tree. These two fields are constant and therefore do not require 813 + synchronization. 814 + 815 + The ``->grpmask`` field indicates the bit in the ``->mynode->qsmask`` 816 + corresponding to this ``rcu_data`` structure, and is also used when 817 + propagating quiescent states. The ``->beenonline`` flag is set whenever 818 + the corresponding CPU comes online, which means that the debugfs tracing 819 + need not dump out any ``rcu_data`` structure for which this flag is not 820 + set. 821 + 822 + Quiescent-State and Grace-Period Tracking 823 + ''''''''''''''''''''''''''''''''''''''''' 824 + 825 + This portion of the ``rcu_data`` structure is declared as follows: 826 + 827 + :: 828 + 829 + 1 unsigned long gp_seq; 830 + 2 unsigned long gp_seq_needed; 831 + 3 bool cpu_no_qs; 832 + 4 bool core_needs_qs; 833 + 5 bool gpwrap; 834 + 835 + The ``->gp_seq`` field is the counterpart of the field of the same name 836 + in the ``rcu_state`` and ``rcu_node`` structures. The 837 + ``->gp_seq_needed`` field is the counterpart of the field of the same 838 + name in the rcu_node structure. They may each lag up to one behind their 839 + ``rcu_node`` counterparts, but in ``CONFIG_NO_HZ_IDLE`` and 840 + ``CONFIG_NO_HZ_FULL`` kernels can lag arbitrarily far behind for CPUs in 841 + dyntick-idle mode (but these counters will catch up upon exit from 842 + dyntick-idle mode). If the lower two bits of a given ``rcu_data`` 843 + structure's ``->gp_seq`` are zero, then this ``rcu_data`` structure 844 + believes that RCU is idle. 845 + 846 + +-----------------------------------------------------------------------+ 847 + | **Quick Quiz**: | 848 + +-----------------------------------------------------------------------+ 849 + | All this replication of the grace period numbers can only cause | 850 + | massive confusion. Why not just keep a global sequence number and be | 851 + | done with it??? | 852 + +-----------------------------------------------------------------------+ 853 + | **Answer**: | 854 + +-----------------------------------------------------------------------+ 855 + | Because if there was only a single global sequence numbers, there | 856 + | would need to be a single global lock to allow safely accessing and | 857 + | updating it. And if we are not going to have a single global lock, we | 858 + | need to carefully manage the numbers on a per-node basis. Recall from | 859 + | the answer to a previous Quick Quiz that the consequences of applying | 860 + | a previously sampled quiescent state to the wrong grace period are | 861 + | quite severe. | 862 + +-----------------------------------------------------------------------+ 863 + 864 + The ``->cpu_no_qs`` flag indicates that the CPU has not yet passed 865 + through a quiescent state, while the ``->core_needs_qs`` flag indicates 866 + that the RCU core needs a quiescent state from the corresponding CPU. 867 + The ``->gpwrap`` field indicates that the corresponding CPU has remained 868 + idle for so long that the ``gp_seq`` counter is in danger of overflow, 869 + which will cause the CPU to disregard the values of its counters on its 870 + next exit from idle. 871 + 872 + RCU Callback Handling 873 + ''''''''''''''''''''' 874 + 875 + In the absence of CPU-hotplug events, RCU callbacks are invoked by the 876 + same CPU that registered them. This is strictly a cache-locality 877 + optimization: callbacks can and do get invoked on CPUs other than the 878 + one that registered them. After all, if the CPU that registered a given 879 + callback has gone offline before the callback can be invoked, there 880 + really is no other choice. 881 + 882 + This portion of the ``rcu_data`` structure is declared as follows: 883 + 884 + :: 885 + 886 + 1 struct rcu_segcblist cblist; 887 + 2 long qlen_last_fqs_check; 888 + 3 unsigned long n_cbs_invoked; 889 + 4 unsigned long n_nocbs_invoked; 890 + 5 unsigned long n_cbs_orphaned; 891 + 6 unsigned long n_cbs_adopted; 892 + 7 unsigned long n_force_qs_snap; 893 + 8 long blimit; 894 + 895 + The ``->cblist`` structure is the segmented callback list described 896 + earlier. The CPU advances the callbacks in its ``rcu_data`` structure 897 + whenever it notices that another RCU grace period has completed. The CPU 898 + detects the completion of an RCU grace period by noticing that the value 899 + of its ``rcu_data`` structure's ``->gp_seq`` field differs from that of 900 + its leaf ``rcu_node`` structure. Recall that each ``rcu_node`` 901 + structure's ``->gp_seq`` field is updated at the beginnings and ends of 902 + each grace period. 903 + 904 + The ``->qlen_last_fqs_check`` and ``->n_force_qs_snap`` coordinate the 905 + forcing of quiescent states from ``call_rcu()`` and friends when 906 + callback lists grow excessively long. 907 + 908 + The ``->n_cbs_invoked``, ``->n_cbs_orphaned``, and ``->n_cbs_adopted`` 909 + fields count the number of callbacks invoked, sent to other CPUs when 910 + this CPU goes offline, and received from other CPUs when those other 911 + CPUs go offline. The ``->n_nocbs_invoked`` is used when the CPU's 912 + callbacks are offloaded to a kthread. 913 + 914 + Finally, the ``->blimit`` counter is the maximum number of RCU callbacks 915 + that may be invoked at a given time. 916 + 917 + Dyntick-Idle Handling 918 + ''''''''''''''''''''' 919 + 920 + This portion of the ``rcu_data`` structure is declared as follows: 921 + 922 + :: 923 + 924 + 1 int dynticks_snap; 925 + 2 unsigned long dynticks_fqs; 926 + 927 + The ``->dynticks_snap`` field is used to take a snapshot of the 928 + corresponding CPU's dyntick-idle state when forcing quiescent states, 929 + and is therefore accessed from other CPUs. Finally, the 930 + ``->dynticks_fqs`` field is used to count the number of times this CPU 931 + is determined to be in dyntick-idle state, and is used for tracing and 932 + debugging purposes. 933 + 934 + This portion of the rcu_data structure is declared as follows: 935 + 936 + :: 937 + 938 + 1 long dynticks_nesting; 939 + 2 long dynticks_nmi_nesting; 940 + 3 atomic_t dynticks; 941 + 4 bool rcu_need_heavy_qs; 942 + 5 bool rcu_urgent_qs; 943 + 944 + These fields in the rcu_data structure maintain the per-CPU dyntick-idle 945 + state for the corresponding CPU. The fields may be accessed only from 946 + the corresponding CPU (and from tracing) unless otherwise stated. 947 + 948 + The ``->dynticks_nesting`` field counts the nesting depth of process 949 + execution, so that in normal circumstances this counter has value zero 950 + or one. NMIs, irqs, and tracers are counted by the 951 + ``->dynticks_nmi_nesting`` field. Because NMIs cannot be masked, changes 952 + to this variable have to be undertaken carefully using an algorithm 953 + provided by Andy Lutomirski. The initial transition from idle adds one, 954 + and nested transitions add two, so that a nesting level of five is 955 + represented by a ``->dynticks_nmi_nesting`` value of nine. This counter 956 + can therefore be thought of as counting the number of reasons why this 957 + CPU cannot be permitted to enter dyntick-idle mode, aside from 958 + process-level transitions. 959 + 960 + However, it turns out that when running in non-idle kernel context, the 961 + Linux kernel is fully capable of entering interrupt handlers that never 962 + exit and perhaps also vice versa. Therefore, whenever the 963 + ``->dynticks_nesting`` field is incremented up from zero, the 964 + ``->dynticks_nmi_nesting`` field is set to a large positive number, and 965 + whenever the ``->dynticks_nesting`` field is decremented down to zero, 966 + the the ``->dynticks_nmi_nesting`` field is set to zero. Assuming that 967 + the number of misnested interrupts is not sufficient to overflow the 968 + counter, this approach corrects the ``->dynticks_nmi_nesting`` field 969 + every time the corresponding CPU enters the idle loop from process 970 + context. 971 + 972 + The ``->dynticks`` field counts the corresponding CPU's transitions to 973 + and from either dyntick-idle or user mode, so that this counter has an 974 + even value when the CPU is in dyntick-idle mode or user mode and an odd 975 + value otherwise. The transitions to/from user mode need to be counted 976 + for user mode adaptive-ticks support (see timers/NO_HZ.txt). 977 + 978 + The ``->rcu_need_heavy_qs`` field is used to record the fact that the 979 + RCU core code would really like to see a quiescent state from the 980 + corresponding CPU, so much so that it is willing to call for 981 + heavy-weight dyntick-counter operations. This flag is checked by RCU's 982 + context-switch and ``cond_resched()`` code, which provide a momentary 983 + idle sojourn in response. 984 + 985 + Finally, the ``->rcu_urgent_qs`` field is used to record the fact that 986 + the RCU core code would really like to see a quiescent state from the 987 + corresponding CPU, with the various other fields indicating just how 988 + badly RCU wants this quiescent state. This flag is checked by RCU's 989 + context-switch path (``rcu_note_context_switch``) and the cond_resched 990 + code. 991 + 992 + +-----------------------------------------------------------------------+ 993 + | **Quick Quiz**: | 994 + +-----------------------------------------------------------------------+ 995 + | Why not simply combine the ``->dynticks_nesting`` and | 996 + | ``->dynticks_nmi_nesting`` counters into a single counter that just | 997 + | counts the number of reasons that the corresponding CPU is non-idle? | 998 + +-----------------------------------------------------------------------+ 999 + | **Answer**: | 1000 + +-----------------------------------------------------------------------+ 1001 + | Because this would fail in the presence of interrupts whose handlers | 1002 + | never return and of handlers that manage to return from a made-up | 1003 + | interrupt. | 1004 + +-----------------------------------------------------------------------+ 1005 + 1006 + Additional fields are present for some special-purpose builds, and are 1007 + discussed separately. 1008 + 1009 + The ``rcu_head`` Structure 1010 + ~~~~~~~~~~~~~~~~~~~~~~~~~~ 1011 + 1012 + Each ``rcu_head`` structure represents an RCU callback. These structures 1013 + are normally embedded within RCU-protected data structures whose 1014 + algorithms use asynchronous grace periods. In contrast, when using 1015 + algorithms that block waiting for RCU grace periods, RCU users need not 1016 + provide ``rcu_head`` structures. 1017 + 1018 + The ``rcu_head`` structure has fields as follows: 1019 + 1020 + :: 1021 + 1022 + 1 struct rcu_head *next; 1023 + 2 void (*func)(struct rcu_head *head); 1024 + 1025 + The ``->next`` field is used to link the ``rcu_head`` structures 1026 + together in the lists within the ``rcu_data`` structures. The ``->func`` 1027 + field is a pointer to the function to be called when the callback is 1028 + ready to be invoked, and this function is passed a pointer to the 1029 + ``rcu_head`` structure. However, ``kfree_rcu()`` uses the ``->func`` 1030 + field to record the offset of the ``rcu_head`` structure within the 1031 + enclosing RCU-protected data structure. 1032 + 1033 + Both of these fields are used internally by RCU. From the viewpoint of 1034 + RCU users, this structure is an opaque “cookie”. 1035 + 1036 + +-----------------------------------------------------------------------+ 1037 + | **Quick Quiz**: | 1038 + +-----------------------------------------------------------------------+ 1039 + | Given that the callback function ``->func`` is passed a pointer to | 1040 + | the ``rcu_head`` structure, how is that function supposed to find the | 1041 + | beginning of the enclosing RCU-protected data structure? | 1042 + +-----------------------------------------------------------------------+ 1043 + | **Answer**: | 1044 + +-----------------------------------------------------------------------+ 1045 + | In actual practice, there is a separate callback function per type of | 1046 + | RCU-protected data structure. The callback function can therefore use | 1047 + | the ``container_of()`` macro in the Linux kernel (or other | 1048 + | pointer-manipulation facilities in other software environments) to | 1049 + | find the beginning of the enclosing structure. | 1050 + +-----------------------------------------------------------------------+ 1051 + 1052 + RCU-Specific Fields in the ``task_struct`` Structure 1053 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1054 + 1055 + The ``CONFIG_PREEMPT_RCU`` implementation uses some additional fields in 1056 + the ``task_struct`` structure: 1057 + 1058 + :: 1059 + 1060 + 1 #ifdef CONFIG_PREEMPT_RCU 1061 + 2 int rcu_read_lock_nesting; 1062 + 3 union rcu_special rcu_read_unlock_special; 1063 + 4 struct list_head rcu_node_entry; 1064 + 5 struct rcu_node *rcu_blocked_node; 1065 + 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */ 1066 + 7 #ifdef CONFIG_TASKS_RCU 1067 + 8 unsigned long rcu_tasks_nvcsw; 1068 + 9 bool rcu_tasks_holdout; 1069 + 10 struct list_head rcu_tasks_holdout_list; 1070 + 11 int rcu_tasks_idle_cpu; 1071 + 12 #endif /* #ifdef CONFIG_TASKS_RCU */ 1072 + 1073 + The ``->rcu_read_lock_nesting`` field records the nesting level for RCU 1074 + read-side critical sections, and the ``->rcu_read_unlock_special`` field 1075 + is a bitmask that records special conditions that require 1076 + ``rcu_read_unlock()`` to do additional work. The ``->rcu_node_entry`` 1077 + field is used to form lists of tasks that have blocked within 1078 + preemptible-RCU read-side critical sections and the 1079 + ``->rcu_blocked_node`` field references the ``rcu_node`` structure whose 1080 + list this task is a member of, or ``NULL`` if it is not blocked within a 1081 + preemptible-RCU read-side critical section. 1082 + 1083 + The ``->rcu_tasks_nvcsw`` field tracks the number of voluntary context 1084 + switches that this task had undergone at the beginning of the current 1085 + tasks-RCU grace period, ``->rcu_tasks_holdout`` is set if the current 1086 + tasks-RCU grace period is waiting on this task, 1087 + ``->rcu_tasks_holdout_list`` is a list element enqueuing this task on 1088 + the holdout list, and ``->rcu_tasks_idle_cpu`` tracks which CPU this 1089 + idle task is running, but only if the task is currently running, that 1090 + is, if the CPU is currently idle. 1091 + 1092 + Accessor Functions 1093 + ~~~~~~~~~~~~~~~~~~ 1094 + 1095 + The following listing shows the ``rcu_get_root()``, 1096 + ``rcu_for_each_node_breadth_first`` and ``rcu_for_each_leaf_node()`` 1097 + function and macros: 1098 + 1099 + :: 1100 + 1101 + 1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp) 1102 + 2 { 1103 + 3 return &rsp->node[0]; 1104 + 4 } 1105 + 5 1106 + 6 #define rcu_for_each_node_breadth_first(rsp, rnp) \ 1107 + 7 for ((rnp) = &(rsp)->node[0]; \ 1108 + 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) 1109 + 9 1110 + 10 #define rcu_for_each_leaf_node(rsp, rnp) \ 1111 + 11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ 1112 + 12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) 1113 + 1114 + The ``rcu_get_root()`` simply returns a pointer to the first element of 1115 + the specified ``rcu_state`` structure's ``->node[]`` array, which is the 1116 + root ``rcu_node`` structure. 1117 + 1118 + As noted earlier, the ``rcu_for_each_node_breadth_first()`` macro takes 1119 + advantage of the layout of the ``rcu_node`` structures in the 1120 + ``rcu_state`` structure's ``->node[]`` array, performing a breadth-first 1121 + traversal by simply traversing the array in order. Similarly, the 1122 + ``rcu_for_each_leaf_node()`` macro traverses only the last part of the 1123 + array, thus traversing only the leaf ``rcu_node`` structures. 1124 + 1125 + +-----------------------------------------------------------------------+ 1126 + | **Quick Quiz**: | 1127 + +-----------------------------------------------------------------------+ 1128 + | What does ``rcu_for_each_leaf_node()`` do if the ``rcu_node`` tree | 1129 + | contains only a single node? | 1130 + +-----------------------------------------------------------------------+ 1131 + | **Answer**: | 1132 + +-----------------------------------------------------------------------+ 1133 + | In the single-node case, ``rcu_for_each_leaf_node()`` traverses the | 1134 + | single node. | 1135 + +-----------------------------------------------------------------------+ 1136 + 1137 + Summary 1138 + ~~~~~~~ 1139 + 1140 + So the state of RCU is represented by an ``rcu_state`` structure, which 1141 + contains a combining tree of ``rcu_node`` and ``rcu_data`` structures. 1142 + Finally, in ``CONFIG_NO_HZ_IDLE`` kernels, each CPU's dyntick-idle state 1143 + is tracked by dynticks-related fields in the ``rcu_data`` structure. If 1144 + you made it this far, you are well prepared to read the code 1145 + walkthroughs in the other articles in this series. 1146 + 1147 + Acknowledgments 1148 + ~~~~~~~~~~~~~~~ 1149 + 1150 + I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul 1151 + Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn for 1152 + helping me get this document into a more human-readable state. 1153 + 1154 + Legal Statement 1155 + ~~~~~~~~~~~~~~~ 1156 + 1157 + This work represents the view of the author and does not necessarily 1158 + represent the view of IBM. 1159 + 1160 + Linux is a registered trademark of Linus Torvalds. 1161 + 1162 + Other company, product, and service names may be trademarks or service 1163 + marks of others.

-668

Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html

··· 1 - <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 2 - "http://www.w3.org/TR/html4/loose.dtd"> 3 - <html> 4 - <head><title>A Tour Through TREE_RCU's Expedited Grace Periods</title> 5 - <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> 6 - 7 - <h2>Introduction</h2> 8 - 9 - This document describes RCU's expedited grace periods. 10 - Unlike RCU's normal grace periods, which accept long latencies to attain 11 - high efficiency and minimal disturbance, expedited grace periods accept 12 - lower efficiency and significant disturbance to attain shorter latencies. 13 - 14 - <p> 15 - There are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier 16 - third RCU-bh flavor having been implemented in terms of the other two. 17 - Each of the two implementations is covered in its own section. 18 - 19 - <ol> 20 - <li> <a href="#Expedited Grace Period Design"> 21 - Expedited Grace Period Design</a> 22 - <li> <a href="#RCU-preempt Expedited Grace Periods"> 23 - RCU-preempt Expedited Grace Periods</a> 24 - <li> <a href="#RCU-sched Expedited Grace Periods"> 25 - RCU-sched Expedited Grace Periods</a> 26 - <li> <a href="#Expedited Grace Period and CPU Hotplug"> 27 - Expedited Grace Period and CPU Hotplug</a> 28 - <li> <a href="#Expedited Grace Period Refinements"> 29 - Expedited Grace Period Refinements</a> 30 - </ol> 31 - 32 - <h2><a name="Expedited Grace Period Design"> 33 - Expedited Grace Period Design</a></h2> 34 - 35 - <p> 36 - The expedited RCU grace periods cannot be accused of being subtle, 37 - given that they for all intents and purposes hammer every CPU that 38 - has not yet provided a quiescent state for the current expedited 39 - grace period. 40 - The one saving grace is that the hammer has grown a bit smaller 41 - over time: The old call to <tt>try_stop_cpus()</tt> has been 42 - replaced with a set of calls to <tt>smp_call_function_single()</tt>, 43 - each of which results in an IPI to the target CPU. 44 - The corresponding handler function checks the CPU's state, motivating 45 - a faster quiescent state where possible, and triggering a report 46 - of that quiescent state. 47 - As always for RCU, once everything has spent some time in a quiescent 48 - state, the expedited grace period has completed. 49 - 50 - <p> 51 - The details of the <tt>smp_call_function_single()</tt> handler's 52 - operation depend on the RCU flavor, as described in the following 53 - sections. 54 - 55 - <h2><a name="RCU-preempt Expedited Grace Periods"> 56 - RCU-preempt Expedited Grace Periods</a></h2> 57 - 58 - <p> 59 - <tt>CONFIG_PREEMPT=y</tt> kernels implement RCU-preempt. 60 - The overall flow of the handling of a given CPU by an RCU-preempt 61 - expedited grace period is shown in the following diagram: 62 - 63 - <p><img src="ExpRCUFlow.svg" alt="ExpRCUFlow.svg" width="55%"> 64 - 65 - <p> 66 - The solid arrows denote direct action, for example, a function call. 67 - The dotted arrows denote indirect action, for example, an IPI 68 - or a state that is reached after some time. 69 - 70 - <p> 71 - If a given CPU is offline or idle, <tt>synchronize_rcu_expedited()</tt> 72 - will ignore it because idle and offline CPUs are already residing 73 - in quiescent states. 74 - Otherwise, the expedited grace period will use 75 - <tt>smp_call_function_single()</tt> to send the CPU an IPI, which 76 - is handled by <tt>rcu_exp_handler()</tt>. 77 - 78 - <p> 79 - However, because this is preemptible RCU, <tt>rcu_exp_handler()</tt> 80 - can check to see if the CPU is currently running in an RCU read-side 81 - critical section. 82 - If not, the handler can immediately report a quiescent state. 83 - Otherwise, it sets flags so that the outermost <tt>rcu_read_unlock()</tt> 84 - invocation will provide the needed quiescent-state report. 85 - This flag-setting avoids the previous forced preemption of all 86 - CPUs that might have RCU read-side critical sections. 87 - In addition, this flag-setting is done so as to avoid increasing 88 - the overhead of the common-case fastpath through the scheduler. 89 - 90 - <p> 91 - Again because this is preemptible RCU, an RCU read-side critical section 92 - can be preempted. 93 - When that happens, RCU will enqueue the task, which will the continue to 94 - block the current expedited grace period until it resumes and finds its 95 - outermost <tt>rcu_read_unlock()</tt>. 96 - The CPU will report a quiescent state just after enqueuing the task because 97 - the CPU is no longer blocking the grace period. 98 - It is instead the preempted task doing the blocking. 99 - The list of blocked tasks is managed by <tt>rcu_preempt_ctxt_queue()</tt>, 100 - which is called from <tt>rcu_preempt_note_context_switch()</tt>, which 101 - in turn is called from <tt>rcu_note_context_switch()</tt>, which in 102 - turn is called from the scheduler. 103 - 104 - <table> 105 - <tr><th> </th></tr> 106 - <tr><th align="left">Quick Quiz:</th></tr> 107 - <tr><td> 108 - Why not just have the expedited grace period check the 109 - state of all the CPUs? 110 - After all, that would avoid all those real-time-unfriendly IPIs. 111 - </td></tr> 112 - <tr><th align="left">Answer:</th></tr> 113 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 114 - Because we want the RCU read-side critical sections to run fast, 115 - which means no memory barriers. 116 - Therefore, it is not possible to safely check the state from some 117 - other CPU. 118 - And even if it was possible to safely check the state, it would 119 - still be necessary to IPI the CPU to safely interact with the 120 - upcoming <tt>rcu_read_unlock()</tt> invocation, which means that 121 - the remote state testing would not help the worst-case 122 - latency that real-time applications care about. 123 - 124 - <p><font color="ffffff">One way to prevent your real-time 125 - application from getting hit with these IPIs is to 126 - build your kernel with <tt>CONFIG_NO_HZ_FULL=y</tt>. 127 - RCU would then perceive the CPU running your application 128 - as being idle, and it would be able to safely detect that 129 - state without needing to IPI the CPU. 130 - </font></td></tr> 131 - <tr><td> </td></tr> 132 - </table> 133 - 134 - <p> 135 - Please note that this is just the overall flow: 136 - Additional complications can arise due to races with CPUs going idle 137 - or offline, among other things. 138 - 139 - <h2><a name="RCU-sched Expedited Grace Periods"> 140 - RCU-sched Expedited Grace Periods</a></h2> 141 - 142 - <p> 143 - <tt>CONFIG_PREEMPT=n</tt> kernels implement RCU-sched. 144 - The overall flow of the handling of a given CPU by an RCU-sched 145 - expedited grace period is shown in the following diagram: 146 - 147 - <p><img src="ExpSchedFlow.svg" alt="ExpSchedFlow.svg" width="55%"> 148 - 149 - <p> 150 - As with RCU-preempt, RCU-sched's 151 - <tt>synchronize_rcu_expedited()</tt> ignores offline and 152 - idle CPUs, again because they are in remotely detectable 153 - quiescent states. 154 - However, because the 155 - <tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt> 156 - leave no trace of their invocation, in general it is not possible to tell 157 - whether or not the current CPU is in an RCU read-side critical section. 158 - The best that RCU-sched's <tt>rcu_exp_handler()</tt> can do is to check 159 - for idle, on the off-chance that the CPU went idle while the IPI 160 - was in flight. 161 - If the CPU is idle, then <tt>rcu_exp_handler()</tt> reports 162 - the quiescent state. 163 - 164 - <p> Otherwise, the handler forces a future context switch by setting the 165 - NEED_RESCHED flag of the current task's thread flag and the CPU preempt 166 - counter. 167 - At the time of the context switch, the CPU reports the quiescent state. 168 - Should the CPU go offline first, it will report the quiescent state 169 - at that time. 170 - 171 - <h2><a name="Expedited Grace Period and CPU Hotplug"> 172 - Expedited Grace Period and CPU Hotplug</a></h2> 173 - 174 - <p> 175 - The expedited nature of expedited grace periods require a much tighter 176 - interaction with CPU hotplug operations than is required for normal 177 - grace periods. 178 - In addition, attempting to IPI offline CPUs will result in splats, but 179 - failing to IPI online CPUs can result in too-short grace periods. 180 - Neither option is acceptable in production kernels. 181 - 182 - <p> 183 - The interaction between expedited grace periods and CPU hotplug operations 184 - is carried out at several levels: 185 - 186 - <ol> 187 - <li> The number of CPUs that have ever been online is tracked 188 - by the <tt>rcu_state</tt> structure's <tt>->ncpus</tt> 189 - field. 190 - The <tt>rcu_state</tt> structure's <tt>->ncpus_snap</tt> 191 - field tracks the number of CPUs that have ever been online 192 - at the beginning of an RCU expedited grace period. 193 - Note that this number never decreases, at least in the absence 194 - of a time machine. 195 - <li> The identities of the CPUs that have ever been online is 196 - tracked by the <tt>rcu_node</tt> structure's 197 - <tt>->expmaskinitnext</tt> field. 198 - The <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt> 199 - field tracks the identities of the CPUs that were online 200 - at least once at the beginning of the most recent RCU 201 - expedited grace period. 202 - The <tt>rcu_state</tt> structure's <tt>->ncpus</tt> and 203 - <tt>->ncpus_snap</tt> fields are used to detect when 204 - new CPUs have come online for the first time, that is, 205 - when the <tt>rcu_node</tt> structure's <tt>->expmaskinitnext</tt> 206 - field has changed since the beginning of the last RCU 207 - expedited grace period, which triggers an update of each 208 - <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt> 209 - field from its <tt>->expmaskinitnext</tt> field. 210 - <li> Each <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt> 211 - field is used to initialize that structure's 212 - <tt>->expmask</tt> at the beginning of each RCU 213 - expedited grace period. 214 - This means that only those CPUs that have been online at least 215 - once will be considered for a given grace period. 216 - <li> Any CPU that goes offline will clear its bit in its leaf 217 - <tt>rcu_node</tt> structure's <tt>->qsmaskinitnext</tt> 218 - field, so any CPU with that bit clear can safely be ignored. 219 - However, it is possible for a CPU coming online or going offline 220 - to have this bit set for some time while <tt>cpu_online</tt> 221 - returns <tt>false</tt>. 222 - <li> For each non-idle CPU that RCU believes is currently online, the grace 223 - period invokes <tt>smp_call_function_single()</tt>. 224 - If this succeeds, the CPU was fully online. 225 - Failure indicates that the CPU is in the process of coming online 226 - or going offline, in which case it is necessary to wait for a 227 - short time period and try again. 228 - The purpose of this wait (or series of waits, as the case may be) 229 - is to permit a concurrent CPU-hotplug operation to complete. 230 - <li> In the case of RCU-sched, one of the last acts of an outgoing CPU 231 - is to invoke <tt>rcu_report_dead()</tt>, which 232 - reports a quiescent state for that CPU. 233 - However, this is likely paranoia-induced redundancy.  234 - </ol> 235 - 236 - <table> 237 - <tr><th> </th></tr> 238 - <tr><th align="left">Quick Quiz:</th></tr> 239 - <tr><td> 240 - Why all the dancing around with multiple counters and masks 241 - tracking CPUs that were once online? 242 - Why not just have a single set of masks tracking the currently 243 - online CPUs and be done with it? 244 - </td></tr> 245 - <tr><th align="left">Answer:</th></tr> 246 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 247 - Maintaining single set of masks tracking the online CPUs <i>sounds</i> 248 - easier, at least until you try working out all the race conditions 249 - between grace-period initialization and CPU-hotplug operations. 250 - For example, suppose initialization is progressing down the 251 - tree while a CPU-offline operation is progressing up the tree. 252 - This situation can result in bits set at the top of the tree 253 - that have no counterparts at the bottom of the tree. 254 - Those bits will never be cleared, which will result in 255 - grace-period hangs. 256 - In short, that way lies madness, to say nothing of a great many 257 - bugs, hangs, and deadlocks. 258 - 259 - <p><font color="ffffff"> 260 - In contrast, the current multi-mask multi-counter scheme ensures 261 - that grace-period initialization will always see consistent masks 262 - up and down the tree, which brings significant simplifications 263 - over the single-mask method. 264 - 265 - <p><font color="ffffff"> 266 - This is an instance of 267 - <a href="http://www.cs.columbia.edu/~library/TR-repository/reports/reports-1992/cucs-039-92.ps.gz"><font color="ffffff"> 268 - deferring work in order to avoid synchronization</a>. 269 - Lazily recording CPU-hotplug events at the beginning of the next 270 - grace period greatly simplifies maintenance of the CPU-tracking 271 - bitmasks in the <tt>rcu_node</tt> tree. 272 - </font></td></tr> 273 - <tr><td> </td></tr> 274 - </table> 275 - 276 - <h2><a name="Expedited Grace Period Refinements"> 277 - Expedited Grace Period Refinements</a></h2> 278 - 279 - <ol> 280 - <li> <a href="#Idle-CPU Checks">Idle-CPU checks</a>. 281 - <li> <a href="#Batching via Sequence Counter"> 282 - Batching via sequence counter</a>. 283 - <li> <a href="#Funnel Locking and Wait/Wakeup"> 284 - Funnel locking and wait/wakeup</a>. 285 - <li> <a href="#Use of Workqueues">Use of Workqueues</a>. 286 - <li> <a href="#Stall Warnings">Stall warnings</a>. 287 - <li> <a href="#Mid-Boot Operation">Mid-boot operation</a>. 288 - </ol> 289 - 290 - <h3><a name="Idle-CPU Checks">Idle-CPU Checks</a></h3> 291 - 292 - <p> 293 - Each expedited grace period checks for idle CPUs when initially forming 294 - the mask of CPUs to be IPIed and again just before IPIing a CPU 295 - (both checks are carried out by <tt>sync_rcu_exp_select_cpus()</tt>). 296 - If the CPU is idle at any time between those two times, the CPU will 297 - not be IPIed. 298 - Instead, the task pushing the grace period forward will include the 299 - idle CPUs in the mask passed to <tt>rcu_report_exp_cpu_mult()</tt>. 300 - 301 - <p> 302 - For RCU-sched, there is an additional check: 303 - If the IPI has interrupted the idle loop, then 304 - <tt>rcu_exp_handler()</tt> invokes <tt>rcu_report_exp_rdp()</tt> 305 - to report the corresponding quiescent state. 306 - 307 - <p> 308 - For RCU-preempt, there is no specific check for idle in the 309 - IPI handler (<tt>rcu_exp_handler()</tt>), but because 310 - RCU read-side critical sections are not permitted within the 311 - idle loop, if <tt>rcu_exp_handler()</tt> sees that the CPU is within 312 - RCU read-side critical section, the CPU cannot possibly be idle. 313 - Otherwise, <tt>rcu_exp_handler()</tt> invokes 314 - <tt>rcu_report_exp_rdp()</tt> to report the corresponding quiescent 315 - state, regardless of whether or not that quiescent state was due to 316 - the CPU being idle. 317 - 318 - <p> 319 - In summary, RCU expedited grace periods check for idle when building 320 - the bitmask of CPUs that must be IPIed, just before sending each IPI, 321 - and (either explicitly or implicitly) within the IPI handler. 322 - 323 - <h3><a name="Batching via Sequence Counter"> 324 - Batching via Sequence Counter</a></h3> 325 - 326 - <p> 327 - If each grace-period request was carried out separately, expedited 328 - grace periods would have abysmal scalability and 329 - problematic high-load characteristics. 330 - Because each grace-period operation can serve an unlimited number of 331 - updates, it is important to <i>batch</i> requests, so that a single 332 - expedited grace-period operation will cover all requests in the 333 - corresponding batch. 334 - 335 - <p> 336 - This batching is controlled by a sequence counter named 337 - <tt>->expedited_sequence</tt> in the <tt>rcu_state</tt> structure. 338 - This counter has an odd value when there is an expedited grace period 339 - in progress and an even value otherwise, so that dividing the counter 340 - value by two gives the number of completed grace periods. 341 - During any given update request, the counter must transition from 342 - even to odd and then back to even, thus indicating that a grace 343 - period has elapsed. 344 - Therefore, if the initial value of the counter is <tt>s</tt>, 345 - the updater must wait until the counter reaches at least the 346 - value <tt>(s+3)&~0x1</tt>. 347 - This counter is managed by the following access functions: 348 - 349 - <ol> 350 - <li> <tt>rcu_exp_gp_seq_start()</tt>, which marks the start of 351 - an expedited grace period. 352 - <li> <tt>rcu_exp_gp_seq_end()</tt>, which marks the end of an 353 - expedited grace period. 354 - <li> <tt>rcu_exp_gp_seq_snap()</tt>, which obtains a snapshot of 355 - the counter. 356 - <li> <tt>rcu_exp_gp_seq_done()</tt>, which returns <tt>true</tt> 357 - if a full expedited grace period has elapsed since the 358 - corresponding call to <tt>rcu_exp_gp_seq_snap()</tt>. 359 - </ol> 360 - 361 - <p> 362 - Again, only one request in a given batch need actually carry out 363 - a grace-period operation, which means there must be an efficient 364 - way to identify which of many concurrent reqeusts will initiate 365 - the grace period, and that there be an efficient way for the 366 - remaining requests to wait for that grace period to complete. 367 - However, that is the topic of the next section. 368 - 369 - <h3><a name="Funnel Locking and Wait/Wakeup"> 370 - Funnel Locking and Wait/Wakeup</a></h3> 371 - 372 - <p> 373 - The natural way to sort out which of a batch of updaters will initiate 374 - the expedited grace period is to use the <tt>rcu_node</tt> combining 375 - tree, as implemented by the <tt>exp_funnel_lock()</tt> function. 376 - The first updater corresponding to a given grace period arriving 377 - at a given <tt>rcu_node</tt> structure records its desired grace-period 378 - sequence number in the <tt>->exp_seq_rq</tt> field and moves up 379 - to the next level in the tree. 380 - Otherwise, if the <tt>->exp_seq_rq</tt> field already contains 381 - the sequence number for the desired grace period or some later one, 382 - the updater blocks on one of four wait queues in the 383 - <tt>->exp_wq[]</tt> array, using the second-from-bottom 384 - and third-from bottom bits as an index. 385 - An <tt>->exp_lock</tt> field in the <tt>rcu_node</tt> structure 386 - synchronizes access to these fields. 387 - 388 - <p> 389 - An empty <tt>rcu_node</tt> tree is shown in the following diagram, 390 - with the white cells representing the <tt>->exp_seq_rq</tt> field 391 - and the red cells representing the elements of the 392 - <tt>->exp_wq[]</tt> array. 393 - 394 - <p><img src="Funnel0.svg" alt="Funnel0.svg" width="75%"> 395 - 396 - <p> 397 - The next diagram shows the situation after the arrival of Task A 398 - and Task B at the leftmost and rightmost leaf <tt>rcu_node</tt> 399 - structures, respectively. 400 - The current value of the <tt>rcu_state</tt> structure's 401 - <tt>->expedited_sequence</tt> field is zero, so adding three and 402 - clearing the bottom bit results in the value two, which both tasks 403 - record in the <tt>->exp_seq_rq</tt> field of their respective 404 - <tt>rcu_node</tt> structures: 405 - 406 - <p><img src="Funnel1.svg" alt="Funnel1.svg" width="75%"> 407 - 408 - <p> 409 - Each of Tasks A and B will move up to the root 410 - <tt>rcu_node</tt> structure. 411 - Suppose that Task A wins, recording its desired grace-period sequence 412 - number and resulting in the state shown below: 413 - 414 - <p><img src="Funnel2.svg" alt="Funnel2.svg" width="75%"> 415 - 416 - <p> 417 - Task A now advances to initiate a new grace period, while Task B 418 - moves up to the root <tt>rcu_node</tt> structure, and, seeing that 419 - its desired sequence number is already recorded, blocks on 420 - <tt>->exp_wq[1]</tt>. 421 - 422 - <table> 423 - <tr><th> </th></tr> 424 - <tr><th align="left">Quick Quiz:</th></tr> 425 - <tr><td> 426 - Why <tt>->exp_wq[1]</tt>? 427 - Given that the value of these tasks' desired sequence number is 428 - two, so shouldn't they instead block on <tt>->exp_wq[2]</tt>? 429 - </td></tr> 430 - <tr><th align="left">Answer:</th></tr> 431 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 432 - No. 433 - 434 - <p><font color="ffffff"> 435 - Recall that the bottom bit of the desired sequence number indicates 436 - whether or not a grace period is currently in progress. 437 - It is therefore necessary to shift the sequence number right one 438 - bit position to obtain the number of the grace period. 439 - This results in <tt>->exp_wq[1]</tt>. 440 - </font></td></tr> 441 - <tr><td> </td></tr> 442 - </table> 443 - 444 - <p> 445 - If Tasks C and D also arrive at this point, they will compute the 446 - same desired grace-period sequence number, and see that both leaf 447 - <tt>rcu_node</tt> structures already have that value recorded. 448 - They will therefore block on their respective <tt>rcu_node</tt> 449 - structures' <tt>->exp_wq[1]</tt> fields, as shown below: 450 - 451 - <p><img src="Funnel3.svg" alt="Funnel3.svg" width="75%"> 452 - 453 - <p> 454 - Task A now acquires the <tt>rcu_state</tt> structure's 455 - <tt>->exp_mutex</tt> and initiates the grace period, which 456 - increments <tt>->expedited_sequence</tt>. 457 - Therefore, if Tasks E and F arrive, they will compute 458 - a desired sequence number of 4 and will record this value as 459 - shown below: 460 - 461 - <p><img src="Funnel4.svg" alt="Funnel4.svg" width="75%"> 462 - 463 - <p> 464 - Tasks E and F will propagate up the <tt>rcu_node</tt> 465 - combining tree, with Task F blocking on the root <tt>rcu_node</tt> 466 - structure and Task E wait for Task A to finish so that 467 - it can start the next grace period. 468 - The resulting state is as shown below: 469 - 470 - <p><img src="Funnel5.svg" alt="Funnel5.svg" width="75%"> 471 - 472 - <p> 473 - Once the grace period completes, Task A 474 - starts waking up the tasks waiting for this grace period to complete, 475 - increments the <tt>->expedited_sequence</tt>, 476 - acquires the <tt>->exp_wake_mutex</tt> and then releases the 477 - <tt>->exp_mutex</tt>. 478 - This results in the following state: 479 - 480 - <p><img src="Funnel6.svg" alt="Funnel6.svg" width="75%"> 481 - 482 - <p> 483 - Task E can then acquire <tt>->exp_mutex</tt> and increment 484 - <tt>->expedited_sequence</tt> to the value three. 485 - If new tasks G and H arrive and moves up the combining tree at the 486 - same time, the state will be as follows: 487 - 488 - <p><img src="Funnel7.svg" alt="Funnel7.svg" width="75%"> 489 - 490 - <p> 491 - Note that three of the root <tt>rcu_node</tt> structure's 492 - waitqueues are now occupied. 493 - However, at some point, Task A will wake up the 494 - tasks blocked on the <tt>->exp_wq</tt> waitqueues, resulting 495 - in the following state: 496 - 497 - <p><img src="Funnel8.svg" alt="Funnel8.svg" width="75%"> 498 - 499 - <p> 500 - Execution will continue with Tasks E and H completing 501 - their grace periods and carrying out their wakeups. 502 - 503 - <table> 504 - <tr><th> </th></tr> 505 - <tr><th align="left">Quick Quiz:</th></tr> 506 - <tr><td> 507 - What happens if Task A takes so long to do its wakeups 508 - that Task E's grace period completes? 509 - </td></tr> 510 - <tr><th align="left">Answer:</th></tr> 511 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 512 - Then Task E will block on the <tt>->exp_wake_mutex</tt>, 513 - which will also prevent it from releasing <tt>->exp_mutex</tt>, 514 - which in turn will prevent the next grace period from starting. 515 - This last is important in preventing overflow of the 516 - <tt>->exp_wq[]</tt> array. 517 - </font></td></tr> 518 - <tr><td> </td></tr> 519 - </table> 520 - 521 - <h3><a name="Use of Workqueues">Use of Workqueues</a></h3> 522 - 523 - <p> 524 - In earlier implementations, the task requesting the expedited 525 - grace period also drove it to completion. 526 - This straightforward approach had the disadvantage of needing to 527 - account for POSIX signals sent to user tasks, 528 - so more recent implemementations use the Linux kernel's 529 - <a href="https://www.kernel.org/doc/Documentation/core-api/workqueue.rst">workqueues</a>. 530 - 531 - <p> 532 - The requesting task still does counter snapshotting and funnel-lock 533 - processing, but the task reaching the top of the funnel lock 534 - does a <tt>schedule_work()</tt> (from <tt>_synchronize_rcu_expedited()</tt> 535 - so that a workqueue kthread does the actual grace-period processing. 536 - Because workqueue kthreads do not accept POSIX signals, grace-period-wait 537 - processing need not allow for POSIX signals. 538 - 539 - In addition, this approach allows wakeups for the previous expedited 540 - grace period to be overlapped with processing for the next expedited 541 - grace period. 542 - Because there are only four sets of waitqueues, it is necessary to 543 - ensure that the previous grace period's wakeups complete before the 544 - next grace period's wakeups start. 545 - This is handled by having the <tt>->exp_mutex</tt> 546 - guard expedited grace-period processing and the 547 - <tt>->exp_wake_mutex</tt> guard wakeups. 548 - The key point is that the <tt>->exp_mutex</tt> is not released 549 - until the first wakeup is complete, which means that the 550 - <tt>->exp_wake_mutex</tt> has already been acquired at that point. 551 - This approach ensures that the previous grace period's wakeups can 552 - be carried out while the current grace period is in process, but 553 - that these wakeups will complete before the next grace period starts. 554 - This means that only three waitqueues are required, guaranteeing that 555 - the four that are provided are sufficient. 556 - 557 - <h3><a name="Stall Warnings">Stall Warnings</a></h3> 558 - 559 - <p> 560 - Expediting grace periods does nothing to speed things up when RCU 561 - readers take too long, and therefore expedited grace periods check 562 - for stalls just as normal grace periods do. 563 - 564 - <table> 565 - <tr><th> </th></tr> 566 - <tr><th align="left">Quick Quiz:</th></tr> 567 - <tr><td> 568 - But why not just let the normal grace-period machinery 569 - detect the stalls, given that a given reader must block 570 - both normal and expedited grace periods? 571 - </td></tr> 572 - <tr><th align="left">Answer:</th></tr> 573 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 574 - Because it is quite possible that at a given time there 575 - is no normal grace period in progress, in which case the 576 - normal grace period cannot emit a stall warning. 577 - </font></td></tr> 578 - <tr><td> </td></tr> 579 - </table> 580 - 581 - The <tt>synchronize_sched_expedited_wait()</tt> function loops waiting 582 - for the expedited grace period to end, but with a timeout set to the 583 - current RCU CPU stall-warning time. 584 - If this time is exceeded, any CPUs or <tt>rcu_node</tt> structures 585 - blocking the current grace period are printed. 586 - Each stall warning results in another pass through the loop, but the 587 - second and subsequent passes use longer stall times. 588 - 589 - <h3><a name="Mid-Boot Operation">Mid-boot operation</a></h3> 590 - 591 - <p> 592 - The use of workqueues has the advantage that the expedited 593 - grace-period code need not worry about POSIX signals. 594 - Unfortunately, it has the 595 - corresponding disadvantage that workqueues cannot be used until 596 - they are initialized, which does not happen until some time after 597 - the scheduler spawns the first task. 598 - Given that there are parts of the kernel that really do want to 599 - execute grace periods during this mid-boot “dead zone”, 600 - expedited grace periods must do something else during thie time. 601 - 602 - <p> 603 - What they do is to fall back to the old practice of requiring that the 604 - requesting task drive the expedited grace period, as was the case 605 - before the use of workqueues. 606 - However, the requesting task is only required to drive the grace period 607 - during the mid-boot dead zone. 608 - Before mid-boot, a synchronous grace period is a no-op. 609 - Some time after mid-boot, workqueues are used. 610 - 611 - <p> 612 - Non-expedited non-SRCU synchronous grace periods must also operate 613 - normally during mid-boot. 614 - This is handled by causing non-expedited grace periods to take the 615 - expedited code path during mid-boot. 616 - 617 - <p> 618 - The current code assumes that there are no POSIX signals during 619 - the mid-boot dead zone. 620 - However, if an overwhelming need for POSIX signals somehow arises, 621 - appropriate adjustments can be made to the expedited stall-warning code. 622 - One such adjustment would reinstate the pre-workqueue stall-warning 623 - checks, but only during the mid-boot dead zone. 624 - 625 - <p> 626 - With this refinement, synchronous grace periods can now be used from 627 - task context pretty much any time during the life of the kernel. 628 - That is, aside from some points in the suspend, hibernate, or shutdown 629 - code path. 630 - 631 - <h3><a name="Summary"> 632 - Summary</a></h3> 633 - 634 - <p> 635 - Expedited grace periods use a sequence-number approach to promote 636 - batching, so that a single grace-period operation can serve numerous 637 - requests. 638 - A funnel lock is used to efficiently identify the one task out of 639 - a concurrent group that will request the grace period. 640 - All members of the group will block on waitqueues provided in 641 - the <tt>rcu_node</tt> structure. 642 - The actual grace-period processing is carried out by a workqueue. 643 - 644 - <p> 645 - CPU-hotplug operations are noted lazily in order to prevent the need 646 - for tight synchronization between expedited grace periods and 647 - CPU-hotplug operations. 648 - The dyntick-idle counters are used to avoid sending IPIs to idle CPUs, 649 - at least in the common case. 650 - RCU-preempt and RCU-sched use different IPI handlers and different 651 - code to respond to the state changes carried out by those handlers, 652 - but otherwise use common code. 653 - 654 - <p> 655 - Quiescent states are tracked using the <tt>rcu_node</tt> tree, 656 - and once all necessary quiescent states have been reported, 657 - all tasks waiting on this expedited grace period are awakened. 658 - A pair of mutexes are used to allow one grace period's wakeups 659 - to proceed concurrently with the next grace period's processing. 660 - 661 - <p> 662 - This combination of mechanisms allows expedited grace periods to 663 - run reasonably efficiently. 664 - However, for non-time-critical tasks, normal grace periods should be 665 - used instead because their longer duration permits much higher 666 - degrees of batching, and thus much lower per-request overheads. 667 - 668 - </body></html>

+521

Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst

··· 1 + ================================================= 2 + A Tour Through TREE_RCU's Expedited Grace Periods 3 + ================================================= 4 + 5 + Introduction 6 + ============ 7 + 8 + This document describes RCU's expedited grace periods. 9 + Unlike RCU's normal grace periods, which accept long latencies to attain 10 + high efficiency and minimal disturbance, expedited grace periods accept 11 + lower efficiency and significant disturbance to attain shorter latencies. 12 + 13 + There are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier 14 + third RCU-bh flavor having been implemented in terms of the other two. 15 + Each of the two implementations is covered in its own section. 16 + 17 + Expedited Grace Period Design 18 + ============================= 19 + 20 + The expedited RCU grace periods cannot be accused of being subtle, 21 + given that they for all intents and purposes hammer every CPU that 22 + has not yet provided a quiescent state for the current expedited 23 + grace period. 24 + The one saving grace is that the hammer has grown a bit smaller 25 + over time: The old call to ``try_stop_cpus()`` has been 26 + replaced with a set of calls to ``smp_call_function_single()``, 27 + each of which results in an IPI to the target CPU. 28 + The corresponding handler function checks the CPU's state, motivating 29 + a faster quiescent state where possible, and triggering a report 30 + of that quiescent state. 31 + As always for RCU, once everything has spent some time in a quiescent 32 + state, the expedited grace period has completed. 33 + 34 + The details of the ``smp_call_function_single()`` handler's 35 + operation depend on the RCU flavor, as described in the following 36 + sections. 37 + 38 + RCU-preempt Expedited Grace Periods 39 + =================================== 40 + 41 + ``CONFIG_PREEMPT=y`` kernels implement RCU-preempt. 42 + The overall flow of the handling of a given CPU by an RCU-preempt 43 + expedited grace period is shown in the following diagram: 44 + 45 + .. kernel-figure:: ExpRCUFlow.svg 46 + 47 + The solid arrows denote direct action, for example, a function call. 48 + The dotted arrows denote indirect action, for example, an IPI 49 + or a state that is reached after some time. 50 + 51 + If a given CPU is offline or idle, ``synchronize_rcu_expedited()`` 52 + will ignore it because idle and offline CPUs are already residing 53 + in quiescent states. 54 + Otherwise, the expedited grace period will use 55 + ``smp_call_function_single()`` to send the CPU an IPI, which 56 + is handled by ``rcu_exp_handler()``. 57 + 58 + However, because this is preemptible RCU, ``rcu_exp_handler()`` 59 + can check to see if the CPU is currently running in an RCU read-side 60 + critical section. 61 + If not, the handler can immediately report a quiescent state. 62 + Otherwise, it sets flags so that the outermost ``rcu_read_unlock()`` 63 + invocation will provide the needed quiescent-state report. 64 + This flag-setting avoids the previous forced preemption of all 65 + CPUs that might have RCU read-side critical sections. 66 + In addition, this flag-setting is done so as to avoid increasing 67 + the overhead of the common-case fastpath through the scheduler. 68 + 69 + Again because this is preemptible RCU, an RCU read-side critical section 70 + can be preempted. 71 + When that happens, RCU will enqueue the task, which will the continue to 72 + block the current expedited grace period until it resumes and finds its 73 + outermost ``rcu_read_unlock()``. 74 + The CPU will report a quiescent state just after enqueuing the task because 75 + the CPU is no longer blocking the grace period. 76 + It is instead the preempted task doing the blocking. 77 + The list of blocked tasks is managed by ``rcu_preempt_ctxt_queue()``, 78 + which is called from ``rcu_preempt_note_context_switch()``, which 79 + in turn is called from ``rcu_note_context_switch()``, which in 80 + turn is called from the scheduler. 81 + 82 + 83 + +-----------------------------------------------------------------------+ 84 + | **Quick Quiz**: | 85 + +-----------------------------------------------------------------------+ 86 + | Why not just have the expedited grace period check the state of all | 87 + | the CPUs? After all, that would avoid all those real-time-unfriendly | 88 + | IPIs. | 89 + +-----------------------------------------------------------------------+ 90 + | **Answer**: | 91 + +-----------------------------------------------------------------------+ 92 + | Because we want the RCU read-side critical sections to run fast, | 93 + | which means no memory barriers. Therefore, it is not possible to | 94 + | safely check the state from some other CPU. And even if it was | 95 + | possible to safely check the state, it would still be necessary to | 96 + | IPI the CPU to safely interact with the upcoming | 97 + | ``rcu_read_unlock()`` invocation, which means that the remote state | 98 + | testing would not help the worst-case latency that real-time | 99 + | applications care about. | 100 + | | 101 + | One way to prevent your real-time application from getting hit with | 102 + | these IPIs is to build your kernel with ``CONFIG_NO_HZ_FULL=y``. RCU | 103 + | would then perceive the CPU running your application as being idle, | 104 + | and it would be able to safely detect that state without needing to | 105 + | IPI the CPU. | 106 + +-----------------------------------------------------------------------+ 107 + 108 + Please note that this is just the overall flow: Additional complications 109 + can arise due to races with CPUs going idle or offline, among other 110 + things. 111 + 112 + RCU-sched Expedited Grace Periods 113 + --------------------------------- 114 + 115 + ``CONFIG_PREEMPT=n`` kernels implement RCU-sched. The overall flow of 116 + the handling of a given CPU by an RCU-sched expedited grace period is 117 + shown in the following diagram: 118 + 119 + .. kernel-figure:: ExpSchedFlow.svg 120 + 121 + As with RCU-preempt, RCU-sched's ``synchronize_rcu_expedited()`` ignores 122 + offline and idle CPUs, again because they are in remotely detectable 123 + quiescent states. However, because the ``rcu_read_lock_sched()`` and 124 + ``rcu_read_unlock_sched()`` leave no trace of their invocation, in 125 + general it is not possible to tell whether or not the current CPU is in 126 + an RCU read-side critical section. The best that RCU-sched's 127 + ``rcu_exp_handler()`` can do is to check for idle, on the off-chance 128 + that the CPU went idle while the IPI was in flight. If the CPU is idle, 129 + then ``rcu_exp_handler()`` reports the quiescent state. 130 + 131 + Otherwise, the handler forces a future context switch by setting the 132 + NEED_RESCHED flag of the current task's thread flag and the CPU preempt 133 + counter. At the time of the context switch, the CPU reports the 134 + quiescent state. Should the CPU go offline first, it will report the 135 + quiescent state at that time. 136 + 137 + Expedited Grace Period and CPU Hotplug 138 + -------------------------------------- 139 + 140 + The expedited nature of expedited grace periods require a much tighter 141 + interaction with CPU hotplug operations than is required for normal 142 + grace periods. In addition, attempting to IPI offline CPUs will result 143 + in splats, but failing to IPI online CPUs can result in too-short grace 144 + periods. Neither option is acceptable in production kernels. 145 + 146 + The interaction between expedited grace periods and CPU hotplug 147 + operations is carried out at several levels: 148 + 149 + #. The number of CPUs that have ever been online is tracked by the 150 + ``rcu_state`` structure's ``->ncpus`` field. The ``rcu_state`` 151 + structure's ``->ncpus_snap`` field tracks the number of CPUs that 152 + have ever been online at the beginning of an RCU expedited grace 153 + period. Note that this number never decreases, at least in the 154 + absence of a time machine. 155 + #. The identities of the CPUs that have ever been online is tracked by 156 + the ``rcu_node`` structure's ``->expmaskinitnext`` field. The 157 + ``rcu_node`` structure's ``->expmaskinit`` field tracks the 158 + identities of the CPUs that were online at least once at the 159 + beginning of the most recent RCU expedited grace period. The 160 + ``rcu_state`` structure's ``->ncpus`` and ``->ncpus_snap`` fields are 161 + used to detect when new CPUs have come online for the first time, 162 + that is, when the ``rcu_node`` structure's ``->expmaskinitnext`` 163 + field has changed since the beginning of the last RCU expedited grace 164 + period, which triggers an update of each ``rcu_node`` structure's 165 + ``->expmaskinit`` field from its ``->expmaskinitnext`` field. 166 + #. Each ``rcu_node`` structure's ``->expmaskinit`` field is used to 167 + initialize that structure's ``->expmask`` at the beginning of each 168 + RCU expedited grace period. This means that only those CPUs that have 169 + been online at least once will be considered for a given grace 170 + period. 171 + #. Any CPU that goes offline will clear its bit in its leaf ``rcu_node`` 172 + structure's ``->qsmaskinitnext`` field, so any CPU with that bit 173 + clear can safely be ignored. However, it is possible for a CPU coming 174 + online or going offline to have this bit set for some time while 175 + ``cpu_online`` returns ``false``. 176 + #. For each non-idle CPU that RCU believes is currently online, the 177 + grace period invokes ``smp_call_function_single()``. If this 178 + succeeds, the CPU was fully online. Failure indicates that the CPU is 179 + in the process of coming online or going offline, in which case it is 180 + necessary to wait for a short time period and try again. The purpose 181 + of this wait (or series of waits, as the case may be) is to permit a 182 + concurrent CPU-hotplug operation to complete. 183 + #. In the case of RCU-sched, one of the last acts of an outgoing CPU is 184 + to invoke ``rcu_report_dead()``, which reports a quiescent state for 185 + that CPU. However, this is likely paranoia-induced redundancy. 186 + 187 + +-----------------------------------------------------------------------+ 188 + | **Quick Quiz**: | 189 + +-----------------------------------------------------------------------+ 190 + | Why all the dancing around with multiple counters and masks tracking | 191 + | CPUs that were once online? Why not just have a single set of masks | 192 + | tracking the currently online CPUs and be done with it? | 193 + +-----------------------------------------------------------------------+ 194 + | **Answer**: | 195 + +-----------------------------------------------------------------------+ 196 + | Maintaining single set of masks tracking the online CPUs *sounds* | 197 + | easier, at least until you try working out all the race conditions | 198 + | between grace-period initialization and CPU-hotplug operations. For | 199 + | example, suppose initialization is progressing down the tree while a | 200 + | CPU-offline operation is progressing up the tree. This situation can | 201 + | result in bits set at the top of the tree that have no counterparts | 202 + | at the bottom of the tree. Those bits will never be cleared, which | 203 + | will result in grace-period hangs. In short, that way lies madness, | 204 + | to say nothing of a great many bugs, hangs, and deadlocks. | 205 + | In contrast, the current multi-mask multi-counter scheme ensures that | 206 + | grace-period initialization will always see consistent masks up and | 207 + | down the tree, which brings significant simplifications over the | 208 + | single-mask method. | 209 + | | 210 + | This is an instance of `deferring work in order to avoid | 211 + | synchronization <http://www.cs.columbia.edu/~library/TR-repository/re | 212 + | ports/reports-1992/cucs-039-92.ps.gz>`__. | 213 + | Lazily recording CPU-hotplug events at the beginning of the next | 214 + | grace period greatly simplifies maintenance of the CPU-tracking | 215 + | bitmasks in the ``rcu_node`` tree. | 216 + +-----------------------------------------------------------------------+ 217 + 218 + Expedited Grace Period Refinements 219 + ---------------------------------- 220 + 221 + Idle-CPU Checks 222 + ~~~~~~~~~~~~~~~ 223 + 224 + Each expedited grace period checks for idle CPUs when initially forming 225 + the mask of CPUs to be IPIed and again just before IPIing a CPU (both 226 + checks are carried out by ``sync_rcu_exp_select_cpus()``). If the CPU is 227 + idle at any time between those two times, the CPU will not be IPIed. 228 + Instead, the task pushing the grace period forward will include the idle 229 + CPUs in the mask passed to ``rcu_report_exp_cpu_mult()``. 230 + 231 + For RCU-sched, there is an additional check: If the IPI has interrupted 232 + the idle loop, then ``rcu_exp_handler()`` invokes 233 + ``rcu_report_exp_rdp()`` to report the corresponding quiescent state. 234 + 235 + For RCU-preempt, there is no specific check for idle in the IPI handler 236 + (``rcu_exp_handler()``), but because RCU read-side critical sections are 237 + not permitted within the idle loop, if ``rcu_exp_handler()`` sees that 238 + the CPU is within RCU read-side critical section, the CPU cannot 239 + possibly be idle. Otherwise, ``rcu_exp_handler()`` invokes 240 + ``rcu_report_exp_rdp()`` to report the corresponding quiescent state, 241 + regardless of whether or not that quiescent state was due to the CPU 242 + being idle. 243 + 244 + In summary, RCU expedited grace periods check for idle when building the 245 + bitmask of CPUs that must be IPIed, just before sending each IPI, and 246 + (either explicitly or implicitly) within the IPI handler. 247 + 248 + Batching via Sequence Counter 249 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 250 + 251 + If each grace-period request was carried out separately, expedited grace 252 + periods would have abysmal scalability and problematic high-load 253 + characteristics. Because each grace-period operation can serve an 254 + unlimited number of updates, it is important to *batch* requests, so 255 + that a single expedited grace-period operation will cover all requests 256 + in the corresponding batch. 257 + 258 + This batching is controlled by a sequence counter named 259 + ``->expedited_sequence`` in the ``rcu_state`` structure. This counter 260 + has an odd value when there is an expedited grace period in progress and 261 + an even value otherwise, so that dividing the counter value by two gives 262 + the number of completed grace periods. During any given update request, 263 + the counter must transition from even to odd and then back to even, thus 264 + indicating that a grace period has elapsed. Therefore, if the initial 265 + value of the counter is ``s``, the updater must wait until the counter 266 + reaches at least the value ``(s+3)&~0x1``. This counter is managed by 267 + the following access functions: 268 + 269 + #. ``rcu_exp_gp_seq_start()``, which marks the start of an expedited 270 + grace period. 271 + #. ``rcu_exp_gp_seq_end()``, which marks the end of an expedited grace 272 + period. 273 + #. ``rcu_exp_gp_seq_snap()``, which obtains a snapshot of the counter. 274 + #. ``rcu_exp_gp_seq_done()``, which returns ``true`` if a full expedited 275 + grace period has elapsed since the corresponding call to 276 + ``rcu_exp_gp_seq_snap()``. 277 + 278 + Again, only one request in a given batch need actually carry out a 279 + grace-period operation, which means there must be an efficient way to 280 + identify which of many concurrent reqeusts will initiate the grace 281 + period, and that there be an efficient way for the remaining requests to 282 + wait for that grace period to complete. However, that is the topic of 283 + the next section. 284 + 285 + Funnel Locking and Wait/Wakeup 286 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 287 + 288 + The natural way to sort out which of a batch of updaters will initiate 289 + the expedited grace period is to use the ``rcu_node`` combining tree, as 290 + implemented by the ``exp_funnel_lock()`` function. The first updater 291 + corresponding to a given grace period arriving at a given ``rcu_node`` 292 + structure records its desired grace-period sequence number in the 293 + ``->exp_seq_rq`` field and moves up to the next level in the tree. 294 + Otherwise, if the ``->exp_seq_rq`` field already contains the sequence 295 + number for the desired grace period or some later one, the updater 296 + blocks on one of four wait queues in the ``->exp_wq[]`` array, using the 297 + second-from-bottom and third-from bottom bits as an index. An 298 + ``->exp_lock`` field in the ``rcu_node`` structure synchronizes access 299 + to these fields. 300 + 301 + An empty ``rcu_node`` tree is shown in the following diagram, with the 302 + white cells representing the ``->exp_seq_rq`` field and the red cells 303 + representing the elements of the ``->exp_wq[]`` array. 304 + 305 + .. kernel-figure:: Funnel0.svg 306 + 307 + The next diagram shows the situation after the arrival of Task A and 308 + Task B at the leftmost and rightmost leaf ``rcu_node`` structures, 309 + respectively. The current value of the ``rcu_state`` structure's 310 + ``->expedited_sequence`` field is zero, so adding three and clearing the 311 + bottom bit results in the value two, which both tasks record in the 312 + ``->exp_seq_rq`` field of their respective ``rcu_node`` structures: 313 + 314 + .. kernel-figure:: Funnel1.svg 315 + 316 + Each of Tasks A and B will move up to the root ``rcu_node`` structure. 317 + Suppose that Task A wins, recording its desired grace-period sequence 318 + number and resulting in the state shown below: 319 + 320 + .. kernel-figure:: Funnel2.svg 321 + 322 + Task A now advances to initiate a new grace period, while Task B moves 323 + up to the root ``rcu_node`` structure, and, seeing that its desired 324 + sequence number is already recorded, blocks on ``->exp_wq[1]``. 325 + 326 + +-----------------------------------------------------------------------+ 327 + | **Quick Quiz**: | 328 + +-----------------------------------------------------------------------+ 329 + | Why ``->exp_wq[1]``? Given that the value of these tasks' desired | 330 + | sequence number is two, so shouldn't they instead block on | 331 + | ``->exp_wq[2]``? | 332 + +-----------------------------------------------------------------------+ 333 + | **Answer**: | 334 + +-----------------------------------------------------------------------+ 335 + | No. | 336 + | Recall that the bottom bit of the desired sequence number indicates | 337 + | whether or not a grace period is currently in progress. It is | 338 + | therefore necessary to shift the sequence number right one bit | 339 + | position to obtain the number of the grace period. This results in | 340 + | ``->exp_wq[1]``. | 341 + +-----------------------------------------------------------------------+ 342 + 343 + If Tasks C and D also arrive at this point, they will compute the same 344 + desired grace-period sequence number, and see that both leaf 345 + ``rcu_node`` structures already have that value recorded. They will 346 + therefore block on their respective ``rcu_node`` structures' 347 + ``->exp_wq[1]`` fields, as shown below: 348 + 349 + .. kernel-figure:: Funnel3.svg 350 + 351 + Task A now acquires the ``rcu_state`` structure's ``->exp_mutex`` and 352 + initiates the grace period, which increments ``->expedited_sequence``. 353 + Therefore, if Tasks E and F arrive, they will compute a desired sequence 354 + number of 4 and will record this value as shown below: 355 + 356 + .. kernel-figure:: Funnel4.svg 357 + 358 + Tasks E and F will propagate up the ``rcu_node`` combining tree, with 359 + Task F blocking on the root ``rcu_node`` structure and Task E wait for 360 + Task A to finish so that it can start the next grace period. The 361 + resulting state is as shown below: 362 + 363 + .. kernel-figure:: Funnel5.svg 364 + 365 + Once the grace period completes, Task A starts waking up the tasks 366 + waiting for this grace period to complete, increments the 367 + ``->expedited_sequence``, acquires the ``->exp_wake_mutex`` and then 368 + releases the ``->exp_mutex``. This results in the following state: 369 + 370 + .. kernel-figure:: Funnel6.svg 371 + 372 + Task E can then acquire ``->exp_mutex`` and increment 373 + ``->expedited_sequence`` to the value three. If new tasks G and H arrive 374 + and moves up the combining tree at the same time, the state will be as 375 + follows: 376 + 377 + .. kernel-figure:: Funnel7.svg 378 + 379 + Note that three of the root ``rcu_node`` structure's waitqueues are now 380 + occupied. However, at some point, Task A will wake up the tasks blocked 381 + on the ``->exp_wq`` waitqueues, resulting in the following state: 382 + 383 + .. kernel-figure:: Funnel8.svg 384 + 385 + Execution will continue with Tasks E and H completing their grace 386 + periods and carrying out their wakeups. 387 + 388 + +-----------------------------------------------------------------------+ 389 + | **Quick Quiz**: | 390 + +-----------------------------------------------------------------------+ 391 + | What happens if Task A takes so long to do its wakeups that Task E's | 392 + | grace period completes? | 393 + +-----------------------------------------------------------------------+ 394 + | **Answer**: | 395 + +-----------------------------------------------------------------------+ 396 + | Then Task E will block on the ``->exp_wake_mutex``, which will also | 397 + | prevent it from releasing ``->exp_mutex``, which in turn will prevent | 398 + | the next grace period from starting. This last is important in | 399 + | preventing overflow of the ``->exp_wq[]`` array. | 400 + +-----------------------------------------------------------------------+ 401 + 402 + Use of Workqueues 403 + ~~~~~~~~~~~~~~~~~ 404 + 405 + In earlier implementations, the task requesting the expedited grace 406 + period also drove it to completion. This straightforward approach had 407 + the disadvantage of needing to account for POSIX signals sent to user 408 + tasks, so more recent implemementations use the Linux kernel's 409 + `workqueues <https://www.kernel.org/doc/Documentation/core-api/workqueue.rst>`__. 410 + 411 + The requesting task still does counter snapshotting and funnel-lock 412 + processing, but the task reaching the top of the funnel lock does a 413 + ``schedule_work()`` (from ``_synchronize_rcu_expedited()`` so that a 414 + workqueue kthread does the actual grace-period processing. Because 415 + workqueue kthreads do not accept POSIX signals, grace-period-wait 416 + processing need not allow for POSIX signals. In addition, this approach 417 + allows wakeups for the previous expedited grace period to be overlapped 418 + with processing for the next expedited grace period. Because there are 419 + only four sets of waitqueues, it is necessary to ensure that the 420 + previous grace period's wakeups complete before the next grace period's 421 + wakeups start. This is handled by having the ``->exp_mutex`` guard 422 + expedited grace-period processing and the ``->exp_wake_mutex`` guard 423 + wakeups. The key point is that the ``->exp_mutex`` is not released until 424 + the first wakeup is complete, which means that the ``->exp_wake_mutex`` 425 + has already been acquired at that point. This approach ensures that the 426 + previous grace period's wakeups can be carried out while the current 427 + grace period is in process, but that these wakeups will complete before 428 + the next grace period starts. This means that only three waitqueues are 429 + required, guaranteeing that the four that are provided are sufficient. 430 + 431 + Stall Warnings 432 + ~~~~~~~~~~~~~~ 433 + 434 + Expediting grace periods does nothing to speed things up when RCU 435 + readers take too long, and therefore expedited grace periods check for 436 + stalls just as normal grace periods do. 437 + 438 + +-----------------------------------------------------------------------+ 439 + | **Quick Quiz**: | 440 + +-----------------------------------------------------------------------+ 441 + | But why not just let the normal grace-period machinery detect the | 442 + | stalls, given that a given reader must block both normal and | 443 + | expedited grace periods? | 444 + +-----------------------------------------------------------------------+ 445 + | **Answer**: | 446 + +-----------------------------------------------------------------------+ 447 + | Because it is quite possible that at a given time there is no normal | 448 + | grace period in progress, in which case the normal grace period | 449 + | cannot emit a stall warning. | 450 + +-----------------------------------------------------------------------+ 451 + 452 + The ``synchronize_sched_expedited_wait()`` function loops waiting for 453 + the expedited grace period to end, but with a timeout set to the current 454 + RCU CPU stall-warning time. If this time is exceeded, any CPUs or 455 + ``rcu_node`` structures blocking the current grace period are printed. 456 + Each stall warning results in another pass through the loop, but the 457 + second and subsequent passes use longer stall times. 458 + 459 + Mid-boot operation 460 + ~~~~~~~~~~~~~~~~~~ 461 + 462 + The use of workqueues has the advantage that the expedited grace-period 463 + code need not worry about POSIX signals. Unfortunately, it has the 464 + corresponding disadvantage that workqueues cannot be used until they are 465 + initialized, which does not happen until some time after the scheduler 466 + spawns the first task. Given that there are parts of the kernel that 467 + really do want to execute grace periods during this mid-boot “dead 468 + zone”, expedited grace periods must do something else during thie time. 469 + 470 + What they do is to fall back to the old practice of requiring that the 471 + requesting task drive the expedited grace period, as was the case before 472 + the use of workqueues. However, the requesting task is only required to 473 + drive the grace period during the mid-boot dead zone. Before mid-boot, a 474 + synchronous grace period is a no-op. Some time after mid-boot, 475 + workqueues are used. 476 + 477 + Non-expedited non-SRCU synchronous grace periods must also operate 478 + normally during mid-boot. This is handled by causing non-expedited grace 479 + periods to take the expedited code path during mid-boot. 480 + 481 + The current code assumes that there are no POSIX signals during the 482 + mid-boot dead zone. However, if an overwhelming need for POSIX signals 483 + somehow arises, appropriate adjustments can be made to the expedited 484 + stall-warning code. One such adjustment would reinstate the 485 + pre-workqueue stall-warning checks, but only during the mid-boot dead 486 + zone. 487 + 488 + With this refinement, synchronous grace periods can now be used from 489 + task context pretty much any time during the life of the kernel. That 490 + is, aside from some points in the suspend, hibernate, or shutdown code 491 + path. 492 + 493 + Summary 494 + ~~~~~~~ 495 + 496 + Expedited grace periods use a sequence-number approach to promote 497 + batching, so that a single grace-period operation can serve numerous 498 + requests. A funnel lock is used to efficiently identify the one task out 499 + of a concurrent group that will request the grace period. All members of 500 + the group will block on waitqueues provided in the ``rcu_node`` 501 + structure. The actual grace-period processing is carried out by a 502 + workqueue. 503 + 504 + CPU-hotplug operations are noted lazily in order to prevent the need for 505 + tight synchronization between expedited grace periods and CPU-hotplug 506 + operations. The dyntick-idle counters are used to avoid sending IPIs to 507 + idle CPUs, at least in the common case. RCU-preempt and RCU-sched use 508 + different IPI handlers and different code to respond to the state 509 + changes carried out by those handlers, but otherwise use common code. 510 + 511 + Quiescent states are tracked using the ``rcu_node`` tree, and once all 512 + necessary quiescent states have been reported, all tasks waiting on this 513 + expedited grace period are awakened. A pair of mutexes are used to allow 514 + one grace period's wakeups to proceed concurrently with the next grace 515 + period's processing. 516 + 517 + This combination of mechanisms allows expedited grace periods to run 518 + reasonably efficiently. However, for non-time-critical tasks, normal 519 + grace periods should be used instead because their longer duration 520 + permits much higher degrees of batching, and thus much lower per-request 521 + overheads.

-9

Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html

··· 1 - <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 2 - "http://www.w3.org/TR/html4/loose.dtd"> 3 - <html> 4 - <head><title>A Diagram of TREE_RCU's Grace-Period Memory Ordering</title> 5 - <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> 6 - 7 - <p><img src="TreeRCU-gp.svg" alt="TreeRCU-gp.svg"> 8 - 9 - </body></html>

-704

Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html

··· 1 - <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 2 - "http://www.w3.org/TR/html4/loose.dtd"> 3 - <html> 4 - <head><title>A Tour Through TREE_RCU's Grace-Period Memory Ordering</title> 5 - <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> 6 - 7 - <p>August 8, 2017</p> 8 - <p>This article was contributed by Paul E. McKenney</p> 9 - 10 - <h3>Introduction</h3> 11 - 12 - <p>This document gives a rough visual overview of how Tree RCU's 13 - grace-period memory ordering guarantee is provided. 14 - 15 - <ol> 16 - <li> <a href="#What Is Tree RCU's Grace Period Memory Ordering Guarantee?"> 17 - What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a> 18 - <li> <a href="#Tree RCU Grace Period Memory Ordering Building Blocks"> 19 - Tree RCU Grace Period Memory Ordering Building Blocks</a> 20 - <li> <a href="#Tree RCU Grace Period Memory Ordering Components"> 21 - Tree RCU Grace Period Memory Ordering Components</a> 22 - <li> <a href="#Putting It All Together">Putting It All Together</a> 23 - </ol> 24 - 25 - <h3><a name="What Is Tree RCU's Grace Period Memory Ordering Guarantee?"> 26 - What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a></h3> 27 - 28 - <p>RCU grace periods provide extremely strong memory-ordering guarantees 29 - for non-idle non-offline code. 30 - Any code that happens after the end of a given RCU grace period is guaranteed 31 - to see the effects of all accesses prior to the beginning of that grace 32 - period that are within RCU read-side critical sections. 33 - Similarly, any code that happens before the beginning of a given RCU grace 34 - period is guaranteed to see the effects of all accesses following the end 35 - of that grace period that are within RCU read-side critical sections. 36 - 37 - <p>Note well that RCU-sched read-side critical sections include any region 38 - of code for which preemption is disabled. 39 - Given that each individual machine instruction can be thought of as 40 - an extremely small region of preemption-disabled code, one can think of 41 - <tt>synchronize_rcu()</tt> as <tt>smp_mb()</tt> on steroids. 42 - 43 - <p>RCU updaters use this guarantee by splitting their updates into 44 - two phases, one of which is executed before the grace period and 45 - the other of which is executed after the grace period. 46 - In the most common use case, phase one removes an element from 47 - a linked RCU-protected data structure, and phase two frees that element. 48 - For this to work, any readers that have witnessed state prior to the 49 - phase-one update (in the common case, removal) must not witness state 50 - following the phase-two update (in the common case, freeing). 51 - 52 - <p>The RCU implementation provides this guarantee using a network 53 - of lock-based critical sections, memory barriers, and per-CPU 54 - processing, as is described in the following sections. 55 - 56 - <h3><a name="Tree RCU Grace Period Memory Ordering Building Blocks"> 57 - Tree RCU Grace Period Memory Ordering Building Blocks</a></h3> 58 - 59 - <p>The workhorse for RCU's grace-period memory ordering is the 60 - critical section for the <tt>rcu_node</tt> structure's 61 - <tt>->lock</tt>. 62 - These critical sections use helper functions for lock acquisition, including 63 - <tt>raw_spin_lock_rcu_node()</tt>, 64 - <tt>raw_spin_lock_irq_rcu_node()</tt>, and 65 - <tt>raw_spin_lock_irqsave_rcu_node()</tt>. 66 - Their lock-release counterparts are 67 - <tt>raw_spin_unlock_rcu_node()</tt>, 68 - <tt>raw_spin_unlock_irq_rcu_node()</tt>, and 69 - <tt>raw_spin_unlock_irqrestore_rcu_node()</tt>, 70 - respectively. 71 - For completeness, a 72 - <tt>raw_spin_trylock_rcu_node()</tt> 73 - is also provided. 74 - The key point is that the lock-acquisition functions, including 75 - <tt>raw_spin_trylock_rcu_node()</tt>, all invoke 76 - <tt>smp_mb__after_unlock_lock()</tt> immediately after successful 77 - acquisition of the lock. 78 - 79 - <p>Therefore, for any given <tt>rcu_node</tt> structure, any access 80 - happening before one of the above lock-release functions will be seen 81 - by all CPUs as happening before any access happening after a later 82 - one of the above lock-acquisition functions. 83 - Furthermore, any access happening before one of the 84 - above lock-release function on any given CPU will be seen by all 85 - CPUs as happening before any access happening after a later one 86 - of the above lock-acquisition functions executing on that same CPU, 87 - even if the lock-release and lock-acquisition functions are operating 88 - on different <tt>rcu_node</tt> structures. 89 - Tree RCU uses these two ordering guarantees to form an ordering 90 - network among all CPUs that were in any way involved in the grace 91 - period, including any CPUs that came online or went offline during 92 - the grace period in question. 93 - 94 - <p>The following litmus test exhibits the ordering effects of these 95 - lock-acquisition and lock-release functions: 96 - 97 - <pre> 98 - 1 int x, y, z; 99 - 2 100 - 3 void task0(void) 101 - 4 { 102 - 5 raw_spin_lock_rcu_node(rnp); 103 - 6 WRITE_ONCE(x, 1); 104 - 7 r1 = READ_ONCE(y); 105 - 8 raw_spin_unlock_rcu_node(rnp); 106 - 9 } 107 - 10 108 - 11 void task1(void) 109 - 12 { 110 - 13 raw_spin_lock_rcu_node(rnp); 111 - 14 WRITE_ONCE(y, 1); 112 - 15 r2 = READ_ONCE(z); 113 - 16 raw_spin_unlock_rcu_node(rnp); 114 - 17 } 115 - 18 116 - 19 void task2(void) 117 - 20 { 118 - 21 WRITE_ONCE(z, 1); 119 - 22 smp_mb(); 120 - 23 r3 = READ_ONCE(x); 121 - 24 } 122 - 25 123 - 26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); 124 - </pre> 125 - 126 - <p>The <tt>WARN_ON()</tt> is evaluated at “the end of time”, 127 - after all changes have propagated throughout the system. 128 - Without the <tt>smp_mb__after_unlock_lock()</tt> provided by the 129 - acquisition functions, this <tt>WARN_ON()</tt> could trigger, for example 130 - on PowerPC. 131 - The <tt>smp_mb__after_unlock_lock()</tt> invocations prevent this 132 - <tt>WARN_ON()</tt> from triggering. 133 - 134 - <p>This approach must be extended to include idle CPUs, which need 135 - RCU's grace-period memory ordering guarantee to extend to any 136 - RCU read-side critical sections preceding and following the current 137 - idle sojourn. 138 - This case is handled by calls to the strongly ordered 139 - <tt>atomic_add_return()</tt> read-modify-write atomic operation that 140 - is invoked within <tt>rcu_dynticks_eqs_enter()</tt> at idle-entry 141 - time and within <tt>rcu_dynticks_eqs_exit()</tt> at idle-exit time. 142 - The grace-period kthread invokes <tt>rcu_dynticks_snap()</tt> and 143 - <tt>rcu_dynticks_in_eqs_since()</tt> (both of which invoke 144 - an <tt>atomic_add_return()</tt> of zero) to detect idle CPUs. 145 - 146 - <table> 147 - <tr><th> </th></tr> 148 - <tr><th align="left">Quick Quiz:</th></tr> 149 - <tr><td> 150 - But what about CPUs that remain offline for the entire 151 - grace period? 152 - </td></tr> 153 - <tr><th align="left">Answer:</th></tr> 154 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 155 - Such CPUs will be offline at the beginning of the grace period, 156 - so the grace period won't expect quiescent states from them. 157 - Races between grace-period start and CPU-hotplug operations 158 - are mediated by the CPU's leaf <tt>rcu_node</tt> structure's 159 - <tt>->lock</tt> as described above. 160 - </font></td></tr> 161 - <tr><td> </td></tr> 162 - </table> 163 - 164 - <p>The approach must be extended to handle one final case, that 165 - of waking a task blocked in <tt>synchronize_rcu()</tt>. 166 - This task might be affinitied to a CPU that is not yet aware that 167 - the grace period has ended, and thus might not yet be subject to 168 - the grace period's memory ordering. 169 - Therefore, there is an <tt>smp_mb()</tt> after the return from 170 - <tt>wait_for_completion()</tt> in the <tt>synchronize_rcu()</tt> 171 - code path. 172 - 173 - <table> 174 - <tr><th> </th></tr> 175 - <tr><th align="left">Quick Quiz:</th></tr> 176 - <tr><td> 177 - What? Where??? 178 - I don't see any <tt>smp_mb()</tt> after the return from 179 - <tt>wait_for_completion()</tt>!!! 180 - </td></tr> 181 - <tr><th align="left">Answer:</th></tr> 182 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 183 - That would be because I spotted the need for that 184 - <tt>smp_mb()</tt> during the creation of this documentation, 185 - and it is therefore unlikely to hit mainline before v4.14. 186 - Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and 187 - Jonathan Cameron for asking questions that sensitized me 188 - to the rather elaborate sequence of events that demonstrate 189 - the need for this memory barrier. 190 - </font></td></tr> 191 - <tr><td> </td></tr> 192 - </table> 193 - 194 - <p>Tree RCU's grace--period memory-ordering guarantees rely most 195 - heavily on the <tt>rcu_node</tt> structure's <tt>->lock</tt> 196 - field, so much so that it is necessary to abbreviate this pattern 197 - in the diagrams in the next section. 198 - For example, consider the <tt>rcu_prepare_for_idle()</tt> function 199 - shown below, which is one of several functions that enforce ordering 200 - of newly arrived RCU callbacks against future grace periods: 201 - 202 - <pre> 203 - 1 static void rcu_prepare_for_idle(void) 204 - 2 { 205 - 3 bool needwake; 206 - 4 struct rcu_data *rdp; 207 - 5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks); 208 - 6 struct rcu_node *rnp; 209 - 7 struct rcu_state *rsp; 210 - 8 int tne; 211 - 9 212 - 10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) || 213 - 11 rcu_is_nocb_cpu(smp_processor_id())) 214 - 12 return; 215 - 13 tne = READ_ONCE(tick_nohz_active); 216 - 14 if (tne != rdtp->tick_nohz_enabled_snap) { 217 - 15 if (rcu_cpu_has_callbacks(NULL)) 218 - 16 invoke_rcu_core(); 219 - 17 rdtp->tick_nohz_enabled_snap = tne; 220 - 18 return; 221 - 19 } 222 - 20 if (!tne) 223 - 21 return; 224 - 22 if (rdtp->all_lazy && 225 - 23 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) { 226 - 24 rdtp->all_lazy = false; 227 - 25 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted; 228 - 26 invoke_rcu_core(); 229 - 27 return; 230 - 28 } 231 - 29 if (rdtp->last_accelerate == jiffies) 232 - 30 return; 233 - 31 rdtp->last_accelerate = jiffies; 234 - 32 for_each_rcu_flavor(rsp) { 235 - 33 rdp = this_cpu_ptr(rsp->rda); 236 - 34 if (rcu_segcblist_pend_cbs(&rdp->cblist)) 237 - 35 continue; 238 - 36 rnp = rdp->mynode; 239 - 37 raw_spin_lock_rcu_node(rnp); 240 - 38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp); 241 - 39 raw_spin_unlock_rcu_node(rnp); 242 - 40 if (needwake) 243 - 41 rcu_gp_kthread_wake(rsp); 244 - 42 } 245 - 43 } 246 - </pre> 247 - 248 - <p>But the only part of <tt>rcu_prepare_for_idle()</tt> that really 249 - matters for this discussion are lines 37–39. 250 - We will therefore abbreviate this function as follows: 251 - 252 - </p><p><img src="rcu_node-lock.svg" alt="rcu_node-lock.svg"> 253 - 254 - <p>The box represents the <tt>rcu_node</tt> structure's <tt>->lock</tt> 255 - critical section, with the double line on top representing the additional 256 - <tt>smp_mb__after_unlock_lock()</tt>. 257 - 258 - <h3><a name="Tree RCU Grace Period Memory Ordering Components"> 259 - Tree RCU Grace Period Memory Ordering Components</a></h3> 260 - 261 - <p>Tree RCU's grace-period memory-ordering guarantee is provided by 262 - a number of RCU components: 263 - 264 - <ol> 265 - <li> <a href="#Callback Registry">Callback Registry</a> 266 - <li> <a href="#Grace-Period Initialization">Grace-Period Initialization</a> 267 - <li> <a href="#Self-Reported Quiescent States"> 268 - Self-Reported Quiescent States</a> 269 - <li> <a href="#Dynamic Tick Interface">Dynamic Tick Interface</a> 270 - <li> <a href="#CPU-Hotplug Interface">CPU-Hotplug Interface</a> 271 - <li> <a href="Forcing Quiescent States">Forcing Quiescent States</a> 272 - <li> <a href="Grace-Period Cleanup">Grace-Period Cleanup</a> 273 - <li> <a href="Callback Invocation">Callback Invocation</a> 274 - </ol> 275 - 276 - <p>Each of the following section looks at the corresponding component 277 - in detail. 278 - 279 - <h4><a name="Callback Registry">Callback Registry</a></h4> 280 - 281 - <p>If RCU's grace-period guarantee is to mean anything at all, any 282 - access that happens before a given invocation of <tt>call_rcu()</tt> 283 - must also happen before the corresponding grace period. 284 - The implementation of this portion of RCU's grace period guarantee 285 - is shown in the following figure: 286 - 287 - </p><p><img src="TreeRCU-callback-registry.svg" alt="TreeRCU-callback-registry.svg"> 288 - 289 - <p>Because <tt>call_rcu()</tt> normally acts only on CPU-local state, 290 - it provides no ordering guarantees, either for itself or for 291 - phase one of the update (which again will usually be removal of 292 - an element from an RCU-protected data structure). 293 - It simply enqueues the <tt>rcu_head</tt> structure on a per-CPU list, 294 - which cannot become associated with a grace period until a later 295 - call to <tt>rcu_accelerate_cbs()</tt>, as shown in the diagram above. 296 - 297 - <p>One set of code paths shown on the left invokes 298 - <tt>rcu_accelerate_cbs()</tt> via 299 - <tt>note_gp_changes()</tt>, either directly from <tt>call_rcu()</tt> (if 300 - the current CPU is inundated with queued <tt>rcu_head</tt> structures) 301 - or more likely from an <tt>RCU_SOFTIRQ</tt> handler. 302 - Another code path in the middle is taken only in kernels built with 303 - <tt>CONFIG_RCU_FAST_NO_HZ=y</tt>, which invokes 304 - <tt>rcu_accelerate_cbs()</tt> via <tt>rcu_prepare_for_idle()</tt>. 305 - The final code path on the right is taken only in kernels built with 306 - <tt>CONFIG_HOTPLUG_CPU=y</tt>, which invokes 307 - <tt>rcu_accelerate_cbs()</tt> via 308 - <tt>rcu_advance_cbs()</tt>, <tt>rcu_migrate_callbacks</tt>, 309 - <tt>rcutree_migrate_callbacks()</tt>, and <tt>takedown_cpu()</tt>, 310 - which in turn is invoked on a surviving CPU after the outgoing 311 - CPU has been completely offlined. 312 - 313 - <p>There are a few other code paths within grace-period processing 314 - that opportunistically invoke <tt>rcu_accelerate_cbs()</tt>. 315 - However, either way, all of the CPU's recently queued <tt>rcu_head</tt> 316 - structures are associated with a future grace-period number under 317 - the protection of the CPU's lead <tt>rcu_node</tt> structure's 318 - <tt>->lock</tt>. 319 - In all cases, there is full ordering against any prior critical section 320 - for that same <tt>rcu_node</tt> structure's <tt>->lock</tt>, and 321 - also full ordering against any of the current task's or CPU's prior critical 322 - sections for any <tt>rcu_node</tt> structure's <tt>->lock</tt>. 323 - 324 - <p>The next section will show how this ordering ensures that any 325 - accesses prior to the <tt>call_rcu()</tt> (particularly including phase 326 - one of the update) 327 - happen before the start of the corresponding grace period. 328 - 329 - <table> 330 - <tr><th> </th></tr> 331 - <tr><th align="left">Quick Quiz:</th></tr> 332 - <tr><td> 333 - But what about <tt>synchronize_rcu()</tt>? 334 - </td></tr> 335 - <tr><th align="left">Answer:</th></tr> 336 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 337 - The <tt>synchronize_rcu()</tt> passes <tt>call_rcu()</tt> 338 - to <tt>wait_rcu_gp()</tt>, which invokes it. 339 - So either way, it eventually comes down to <tt>call_rcu()</tt>. 340 - </font></td></tr> 341 - <tr><td> </td></tr> 342 - </table> 343 - 344 - <h4><a name="Grace-Period Initialization">Grace-Period Initialization</a></h4> 345 - 346 - <p>Grace-period initialization is carried out by 347 - the grace-period kernel thread, which makes several passes over the 348 - <tt>rcu_node</tt> tree within the <tt>rcu_gp_init()</tt> function. 349 - This means that showing the full flow of ordering through the 350 - grace-period computation will require duplicating this tree. 351 - If you find this confusing, please note that the state of the 352 - <tt>rcu_node</tt> changes over time, just like Heraclitus's river. 353 - However, to keep the <tt>rcu_node</tt> river tractable, the 354 - grace-period kernel thread's traversals are presented in multiple 355 - parts, starting in this section with the various phases of 356 - grace-period initialization. 357 - 358 - <p>The first ordering-related grace-period initialization action is to 359 - advance the <tt>rcu_state</tt> structure's <tt>->gp_seq</tt> 360 - grace-period-number counter, as shown below: 361 - 362 - </p><p><img src="TreeRCU-gp-init-1.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> 363 - 364 - <p>The actual increment is carried out using <tt>smp_store_release()</tt>, 365 - which helps reject false-positive RCU CPU stall detection. 366 - Note that only the root <tt>rcu_node</tt> structure is touched. 367 - 368 - <p>The first pass through the <tt>rcu_node</tt> tree updates bitmasks 369 - based on CPUs having come online or gone offline since the start of 370 - the previous grace period. 371 - In the common case where the number of online CPUs for this <tt>rcu_node</tt> 372 - structure has not transitioned to or from zero, 373 - this pass will scan only the leaf <tt>rcu_node</tt> structures. 374 - However, if the number of online CPUs for a given leaf <tt>rcu_node</tt> 375 - structure has transitioned from zero, 376 - <tt>rcu_init_new_rnp()</tt> will be invoked for the first incoming CPU. 377 - Similarly, if the number of online CPUs for a given leaf <tt>rcu_node</tt> 378 - structure has transitioned to zero, 379 - <tt>rcu_cleanup_dead_rnp()</tt> will be invoked for the last outgoing CPU. 380 - The diagram below shows the path of ordering if the leftmost 381 - <tt>rcu_node</tt> structure onlines its first CPU and if the next 382 - <tt>rcu_node</tt> structure has no online CPUs 383 - (or, alternatively if the leftmost <tt>rcu_node</tt> structure offlines 384 - its last CPU and if the next <tt>rcu_node</tt> structure has no online CPUs). 385 - 386 - </p><p><img src="TreeRCU-gp-init-2.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> 387 - 388 - <p>The final <tt>rcu_gp_init()</tt> pass through the <tt>rcu_node</tt> 389 - tree traverses breadth-first, setting each <tt>rcu_node</tt> structure's 390 - <tt>->gp_seq</tt> field to the newly advanced value from the 391 - <tt>rcu_state</tt> structure, as shown in the following diagram. 392 - 393 - </p><p><img src="TreeRCU-gp-init-3.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> 394 - 395 - <p>This change will also cause each CPU's next call to 396 - <tt>__note_gp_changes()</tt> 397 - to notice that a new grace period has started, as described in the next 398 - section. 399 - But because the grace-period kthread started the grace period at the 400 - root (with the advancing of the <tt>rcu_state</tt> structure's 401 - <tt>->gp_seq</tt> field) before setting each leaf <tt>rcu_node</tt> 402 - structure's <tt>->gp_seq</tt> field, each CPU's observation of 403 - the start of the grace period will happen after the actual start 404 - of the grace period. 405 - 406 - <table> 407 - <tr><th> </th></tr> 408 - <tr><th align="left">Quick Quiz:</th></tr> 409 - <tr><td> 410 - But what about the CPU that started the grace period? 411 - Why wouldn't it see the start of the grace period right when 412 - it started that grace period? 413 - </td></tr> 414 - <tr><th align="left">Answer:</th></tr> 415 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 416 - In some deep philosophical and overly anthromorphized 417 - sense, yes, the CPU starting the grace period is immediately 418 - aware of having done so. 419 - However, if we instead assume that RCU is not self-aware, 420 - then even the CPU starting the grace period does not really 421 - become aware of the start of this grace period until its 422 - first call to <tt>__note_gp_changes()</tt>. 423 - On the other hand, this CPU potentially gets early notification 424 - because it invokes <tt>__note_gp_changes()</tt> during its 425 - last <tt>rcu_gp_init()</tt> pass through its leaf 426 - <tt>rcu_node</tt> structure. 427 - </font></td></tr> 428 - <tr><td> </td></tr> 429 - </table> 430 - 431 - <h4><a name="Self-Reported Quiescent States"> 432 - Self-Reported Quiescent States</a></h4> 433 - 434 - <p>When all entities that might block the grace period have reported 435 - quiescent states (or as described in a later section, had quiescent 436 - states reported on their behalf), the grace period can end. 437 - Online non-idle CPUs report their own quiescent states, as shown 438 - in the following diagram: 439 - 440 - </p><p><img src="TreeRCU-qs.svg" alt="TreeRCU-qs.svg" width="75%"> 441 - 442 - <p>This is for the last CPU to report a quiescent state, which signals 443 - the end of the grace period. 444 - Earlier quiescent states would push up the <tt>rcu_node</tt> tree 445 - only until they encountered an <tt>rcu_node</tt> structure that 446 - is waiting for additional quiescent states. 447 - However, ordering is nevertheless preserved because some later quiescent 448 - state will acquire that <tt>rcu_node</tt> structure's <tt>->lock</tt>. 449 - 450 - <p>Any number of events can lead up to a CPU invoking 451 - <tt>note_gp_changes</tt> (or alternatively, directly invoking 452 - <tt>__note_gp_changes()</tt>), at which point that CPU will notice 453 - the start of a new grace period while holding its leaf 454 - <tt>rcu_node</tt> lock. 455 - Therefore, all execution shown in this diagram happens after the 456 - start of the grace period. 457 - In addition, this CPU will consider any RCU read-side critical 458 - section that started before the invocation of <tt>__note_gp_changes()</tt> 459 - to have started before the grace period, and thus a critical 460 - section that the grace period must wait on. 461 - 462 - <table> 463 - <tr><th> </th></tr> 464 - <tr><th align="left">Quick Quiz:</th></tr> 465 - <tr><td> 466 - But a RCU read-side critical section might have started 467 - after the beginning of the grace period 468 - (the advancing of <tt>->gp_seq</tt> from earlier), so why should 469 - the grace period wait on such a critical section? 470 - </td></tr> 471 - <tr><th align="left">Answer:</th></tr> 472 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 473 - It is indeed not necessary for the grace period to wait on such 474 - a critical section. 475 - However, it is permissible to wait on it. 476 - And it is furthermore important to wait on it, as this 477 - lazy approach is far more scalable than a “big bang” 478 - all-at-once grace-period start could possibly be. 479 - </font></td></tr> 480 - <tr><td> </td></tr> 481 - </table> 482 - 483 - <p>If the CPU does a context switch, a quiescent state will be 484 - noted by <tt>rcu_node_context_switch()</tt> on the left. 485 - On the other hand, if the CPU takes a scheduler-clock interrupt 486 - while executing in usermode, a quiescent state will be noted by 487 - <tt>rcu_sched_clock_irq()</tt> on the right. 488 - Either way, the passage through a quiescent state will be noted 489 - in a per-CPU variable. 490 - 491 - <p>The next time an <tt>RCU_SOFTIRQ</tt> handler executes on 492 - this CPU (for example, after the next scheduler-clock 493 - interrupt), <tt>rcu_core()</tt> will invoke 494 - <tt>rcu_check_quiescent_state()</tt>, which will notice the 495 - recorded quiescent state, and invoke 496 - <tt>rcu_report_qs_rdp()</tt>. 497 - If <tt>rcu_report_qs_rdp()</tt> verifies that the quiescent state 498 - really does apply to the current grace period, it invokes 499 - <tt>rcu_report_rnp()</tt> which traverses up the <tt>rcu_node</tt> 500 - tree as shown at the bottom of the diagram, clearing bits from 501 - each <tt>rcu_node</tt> structure's <tt>->qsmask</tt> field, 502 - and propagating up the tree when the result is zero. 503 - 504 - <p>Note that traversal passes upwards out of a given <tt>rcu_node</tt> 505 - structure only if the current CPU is reporting the last quiescent 506 - state for the subtree headed by that <tt>rcu_node</tt> structure. 507 - A key point is that if a CPU's traversal stops at a given <tt>rcu_node</tt> 508 - structure, then there will be a later traversal by another CPU 509 - (or perhaps the same one) that proceeds upwards 510 - from that point, and the <tt>rcu_node</tt> <tt>->lock</tt> 511 - guarantees that the first CPU's quiescent state happens before the 512 - remainder of the second CPU's traversal. 513 - Applying this line of thought repeatedly shows that all CPUs' 514 - quiescent states happen before the last CPU traverses through 515 - the root <tt>rcu_node</tt> structure, the “last CPU” 516 - being the one that clears the last bit in the root <tt>rcu_node</tt> 517 - structure's <tt>->qsmask</tt> field. 518 - 519 - <h4><a name="Dynamic Tick Interface">Dynamic Tick Interface</a></h4> 520 - 521 - <p>Due to energy-efficiency considerations, RCU is forbidden from 522 - disturbing idle CPUs. 523 - CPUs are therefore required to notify RCU when entering or leaving idle 524 - state, which they do via fully ordered value-returning atomic operations 525 - on a per-CPU variable. 526 - The ordering effects are as shown below: 527 - 528 - </p><p><img src="TreeRCU-dyntick.svg" alt="TreeRCU-dyntick.svg" width="50%"> 529 - 530 - <p>The RCU grace-period kernel thread samples the per-CPU idleness 531 - variable while holding the corresponding CPU's leaf <tt>rcu_node</tt> 532 - structure's <tt>->lock</tt>. 533 - This means that any RCU read-side critical sections that precede the 534 - idle period (the oval near the top of the diagram above) will happen 535 - before the end of the current grace period. 536 - Similarly, the beginning of the current grace period will happen before 537 - any RCU read-side critical sections that follow the 538 - idle period (the oval near the bottom of the diagram above). 539 - 540 - <p>Plumbing this into the full grace-period execution is described 541 - <a href="#Forcing Quiescent States">below</a>. 542 - 543 - <h4><a name="CPU-Hotplug Interface">CPU-Hotplug Interface</a></h4> 544 - 545 - <p>RCU is also forbidden from disturbing offline CPUs, which might well 546 - be powered off and removed from the system completely. 547 - CPUs are therefore required to notify RCU of their comings and goings 548 - as part of the corresponding CPU hotplug operations. 549 - The ordering effects are shown below: 550 - 551 - </p><p><img src="TreeRCU-hotplug.svg" alt="TreeRCU-hotplug.svg" width="50%"> 552 - 553 - <p>Because CPU hotplug operations are much less frequent than idle transitions, 554 - they are heavier weight, and thus acquire the CPU's leaf <tt>rcu_node</tt> 555 - structure's <tt>->lock</tt> and update this structure's 556 - <tt>->qsmaskinitnext</tt>. 557 - The RCU grace-period kernel thread samples this mask to detect CPUs 558 - having gone offline since the beginning of this grace period. 559 - 560 - <p>Plumbing this into the full grace-period execution is described 561 - <a href="#Forcing Quiescent States">below</a>. 562 - 563 - <h4><a name="Forcing Quiescent States">Forcing Quiescent States</a></h4> 564 - 565 - <p>As noted above, idle and offline CPUs cannot report their own 566 - quiescent states, and therefore the grace-period kernel thread 567 - must do the reporting on their behalf. 568 - This process is called “forcing quiescent states”, it is 569 - repeated every few jiffies, and its ordering effects are shown below: 570 - 571 - </p><p><img src="TreeRCU-gp-fqs.svg" alt="TreeRCU-gp-fqs.svg" width="100%"> 572 - 573 - <p>Each pass of quiescent state forcing is guaranteed to traverse the 574 - leaf <tt>rcu_node</tt> structures, and if there are no new quiescent 575 - states due to recently idled and/or offlined CPUs, then only the 576 - leaves are traversed. 577 - However, if there is a newly offlined CPU as illustrated on the left 578 - or a newly idled CPU as illustrated on the right, the corresponding 579 - quiescent state will be driven up towards the root. 580 - As with self-reported quiescent states, the upwards driving stops 581 - once it reaches an <tt>rcu_node</tt> structure that has quiescent 582 - states outstanding from other CPUs. 583 - 584 - <table> 585 - <tr><th> </th></tr> 586 - <tr><th align="left">Quick Quiz:</th></tr> 587 - <tr><td> 588 - The leftmost drive to root stopped before it reached 589 - the root <tt>rcu_node</tt> structure, which means that 590 - there are still CPUs subordinate to that structure on 591 - which the current grace period is waiting. 592 - Given that, how is it possible that the rightmost drive 593 - to root ended the grace period? 594 - </td></tr> 595 - <tr><th align="left">Answer:</th></tr> 596 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 597 - Good analysis! 598 - It is in fact impossible in the absence of bugs in RCU. 599 - But this diagram is complex enough as it is, so simplicity 600 - overrode accuracy. 601 - You can think of it as poetic license, or you can think of 602 - it as misdirection that is resolved in the 603 - <a href="#Putting It All Together">stitched-together diagram</a>. 604 - </font></td></tr> 605 - <tr><td> </td></tr> 606 - </table> 607 - 608 - <h4><a name="Grace-Period Cleanup">Grace-Period Cleanup</a></h4> 609 - 610 - <p>Grace-period cleanup first scans the <tt>rcu_node</tt> tree 611 - breadth-first advancing all the <tt>->gp_seq</tt> fields, then it 612 - advances the <tt>rcu_state</tt> structure's <tt>->gp_seq</tt> field. 613 - The ordering effects are shown below: 614 - 615 - </p><p><img src="TreeRCU-gp-cleanup.svg" alt="TreeRCU-gp-cleanup.svg" width="75%"> 616 - 617 - <p>As indicated by the oval at the bottom of the diagram, once 618 - grace-period cleanup is complete, the next grace period can begin. 619 - 620 - <table> 621 - <tr><th> </th></tr> 622 - <tr><th align="left">Quick Quiz:</th></tr> 623 - <tr><td> 624 - But when precisely does the grace period end? 625 - </td></tr> 626 - <tr><th align="left">Answer:</th></tr> 627 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 628 - There is no useful single point at which the grace period 629 - can be said to end. 630 - The earliest reasonable candidate is as soon as the last 631 - CPU has reported its quiescent state, but it may be some 632 - milliseconds before RCU becomes aware of this. 633 - The latest reasonable candidate is once the <tt>rcu_state</tt> 634 - structure's <tt>->gp_seq</tt> field has been updated, 635 - but it is quite possible that some CPUs have already completed 636 - phase two of their updates by that time. 637 - In short, if you are going to work with RCU, you need to 638 - learn to embrace uncertainty. 639 - </font></td></tr> 640 - <tr><td> </td></tr> 641 - </table> 642 - 643 - 644 - <h4><a name="Callback Invocation">Callback Invocation</a></h4> 645 - 646 - <p>Once a given CPU's leaf <tt>rcu_node</tt> structure's 647 - <tt>->gp_seq</tt> field has been updated, that CPU can begin 648 - invoking its RCU callbacks that were waiting for this grace period 649 - to end. 650 - These callbacks are identified by <tt>rcu_advance_cbs()</tt>, 651 - which is usually invoked by <tt>__note_gp_changes()</tt>. 652 - As shown in the diagram below, this invocation can be triggered by 653 - the scheduling-clock interrupt (<tt>rcu_sched_clock_irq()</tt> on 654 - the left) or by idle entry (<tt>rcu_cleanup_after_idle()</tt> on 655 - the right, but only for kernels build with 656 - <tt>CONFIG_RCU_FAST_NO_HZ=y</tt>). 657 - Either way, <tt>RCU_SOFTIRQ</tt> is raised, which results in 658 - <tt>rcu_do_batch()</tt> invoking the callbacks, which in turn 659 - allows those callbacks to carry out (either directly or indirectly 660 - via wakeup) the needed phase-two processing for each update. 661 - 662 - </p><p><img src="TreeRCU-callback-invocation.svg" alt="TreeRCU-callback-invocation.svg" width="60%"> 663 - 664 - <p>Please note that callback invocation can also be prompted by any 665 - number of corner-case code paths, for example, when a CPU notes that 666 - it has excessive numbers of callbacks queued. 667 - In all cases, the CPU acquires its leaf <tt>rcu_node</tt> structure's 668 - <tt>->lock</tt> before invoking callbacks, which preserves the 669 - required ordering against the newly completed grace period. 670 - 671 - <p>However, if the callback function communicates to other CPUs, 672 - for example, doing a wakeup, then it is that function's responsibility 673 - to maintain ordering. 674 - For example, if the callback function wakes up a task that runs on 675 - some other CPU, proper ordering must in place in both the callback 676 - function and the task being awakened. 677 - To see why this is important, consider the top half of the 678 - <a href="#Grace-Period Cleanup">grace-period cleanup</a> diagram. 679 - The callback might be running on a CPU corresponding to the leftmost 680 - leaf <tt>rcu_node</tt> structure, and awaken a task that is to run on 681 - a CPU corresponding to the rightmost leaf <tt>rcu_node</tt> structure, 682 - and the grace-period kernel thread might not yet have reached the 683 - rightmost leaf. 684 - In this case, the grace period's memory ordering might not yet have 685 - reached that CPU, so again the callback function and the awakened 686 - task must supply proper ordering. 687 - 688 - <h3><a name="Putting It All Together">Putting It All Together</a></h3> 689 - 690 - <p>A stitched-together diagram is 691 - <a href="Tree-RCU-Diagram.html">here</a>. 692 - 693 - <h3><a name="Legal Statement"> 694 - Legal Statement</a></h3> 695 - 696 - <p>This work represents the view of the author and does not necessarily 697 - represent the view of IBM. 698 - 699 - </p><p>Linux is a registered trademark of Linus Torvalds. 700 - 701 - </p><p>Other company, product, and service names may be trademarks or 702 - service marks of others. 703 - 704 - </body></html>

+625

Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst

··· 1 + ====================================================== 2 + A Tour Through TREE_RCU's Grace-Period Memory Ordering 3 + ====================================================== 4 + 5 + August 8, 2017 6 + 7 + This article was contributed by Paul E. McKenney 8 + 9 + Introduction 10 + ============ 11 + 12 + This document gives a rough visual overview of how Tree RCU's 13 + grace-period memory ordering guarantee is provided. 14 + 15 + What Is Tree RCU's Grace Period Memory Ordering Guarantee? 16 + ========================================================== 17 + 18 + RCU grace periods provide extremely strong memory-ordering guarantees 19 + for non-idle non-offline code. 20 + Any code that happens after the end of a given RCU grace period is guaranteed 21 + to see the effects of all accesses prior to the beginning of that grace 22 + period that are within RCU read-side critical sections. 23 + Similarly, any code that happens before the beginning of a given RCU grace 24 + period is guaranteed to see the effects of all accesses following the end 25 + of that grace period that are within RCU read-side critical sections. 26 + 27 + Note well that RCU-sched read-side critical sections include any region 28 + of code for which preemption is disabled. 29 + Given that each individual machine instruction can be thought of as 30 + an extremely small region of preemption-disabled code, one can think of 31 + ``synchronize_rcu()`` as ``smp_mb()`` on steroids. 32 + 33 + RCU updaters use this guarantee by splitting their updates into 34 + two phases, one of which is executed before the grace period and 35 + the other of which is executed after the grace period. 36 + In the most common use case, phase one removes an element from 37 + a linked RCU-protected data structure, and phase two frees that element. 38 + For this to work, any readers that have witnessed state prior to the 39 + phase-one update (in the common case, removal) must not witness state 40 + following the phase-two update (in the common case, freeing). 41 + 42 + The RCU implementation provides this guarantee using a network 43 + of lock-based critical sections, memory barriers, and per-CPU 44 + processing, as is described in the following sections. 45 + 46 + Tree RCU Grace Period Memory Ordering Building Blocks 47 + ===================================================== 48 + 49 + The workhorse for RCU's grace-period memory ordering is the 50 + critical section for the ``rcu_node`` structure's 51 + ``->lock``. These critical sections use helper functions for lock 52 + acquisition, including ``raw_spin_lock_rcu_node()``, 53 + ``raw_spin_lock_irq_rcu_node()``, and ``raw_spin_lock_irqsave_rcu_node()``. 54 + Their lock-release counterparts are ``raw_spin_unlock_rcu_node()``, 55 + ``raw_spin_unlock_irq_rcu_node()``, and 56 + ``raw_spin_unlock_irqrestore_rcu_node()``, respectively. 57 + For completeness, a ``raw_spin_trylock_rcu_node()`` is also provided. 58 + The key point is that the lock-acquisition functions, including 59 + ``raw_spin_trylock_rcu_node()``, all invoke ``smp_mb__after_unlock_lock()`` 60 + immediately after successful acquisition of the lock. 61 + 62 + Therefore, for any given ``rcu_node`` structure, any access 63 + happening before one of the above lock-release functions will be seen 64 + by all CPUs as happening before any access happening after a later 65 + one of the above lock-acquisition functions. 66 + Furthermore, any access happening before one of the 67 + above lock-release function on any given CPU will be seen by all 68 + CPUs as happening before any access happening after a later one 69 + of the above lock-acquisition functions executing on that same CPU, 70 + even if the lock-release and lock-acquisition functions are operating 71 + on different ``rcu_node`` structures. 72 + Tree RCU uses these two ordering guarantees to form an ordering 73 + network among all CPUs that were in any way involved in the grace 74 + period, including any CPUs that came online or went offline during 75 + the grace period in question. 76 + 77 + The following litmus test exhibits the ordering effects of these 78 + lock-acquisition and lock-release functions:: 79 + 80 + 1 int x, y, z; 81 + 2 82 + 3 void task0(void) 83 + 4 { 84 + 5 raw_spin_lock_rcu_node(rnp); 85 + 6 WRITE_ONCE(x, 1); 86 + 7 r1 = READ_ONCE(y); 87 + 8 raw_spin_unlock_rcu_node(rnp); 88 + 9 } 89 + 10 90 + 11 void task1(void) 91 + 12 { 92 + 13 raw_spin_lock_rcu_node(rnp); 93 + 14 WRITE_ONCE(y, 1); 94 + 15 r2 = READ_ONCE(z); 95 + 16 raw_spin_unlock_rcu_node(rnp); 96 + 17 } 97 + 18 98 + 19 void task2(void) 99 + 20 { 100 + 21 WRITE_ONCE(z, 1); 101 + 22 smp_mb(); 102 + 23 r3 = READ_ONCE(x); 103 + 24 } 104 + 25 105 + 26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); 106 + 107 + The ``WARN_ON()`` is evaluated at “the end of time”, 108 + after all changes have propagated throughout the system. 109 + Without the ``smp_mb__after_unlock_lock()`` provided by the 110 + acquisition functions, this ``WARN_ON()`` could trigger, for example 111 + on PowerPC. 112 + The ``smp_mb__after_unlock_lock()`` invocations prevent this 113 + ``WARN_ON()`` from triggering. 114 + 115 + This approach must be extended to include idle CPUs, which need 116 + RCU's grace-period memory ordering guarantee to extend to any 117 + RCU read-side critical sections preceding and following the current 118 + idle sojourn. 119 + This case is handled by calls to the strongly ordered 120 + ``atomic_add_return()`` read-modify-write atomic operation that 121 + is invoked within ``rcu_dynticks_eqs_enter()`` at idle-entry 122 + time and within ``rcu_dynticks_eqs_exit()`` at idle-exit time. 123 + The grace-period kthread invokes ``rcu_dynticks_snap()`` and 124 + ``rcu_dynticks_in_eqs_since()`` (both of which invoke 125 + an ``atomic_add_return()`` of zero) to detect idle CPUs. 126 + 127 + +-----------------------------------------------------------------------+ 128 + | **Quick Quiz**: | 129 + +-----------------------------------------------------------------------+ 130 + | But what about CPUs that remain offline for the entire grace period? | 131 + +-----------------------------------------------------------------------+ 132 + | **Answer**: | 133 + +-----------------------------------------------------------------------+ 134 + | Such CPUs will be offline at the beginning of the grace period, so | 135 + | the grace period won't expect quiescent states from them. Races | 136 + | between grace-period start and CPU-hotplug operations are mediated | 137 + | by the CPU's leaf ``rcu_node`` structure's ``->lock`` as described | 138 + | above. | 139 + +-----------------------------------------------------------------------+ 140 + 141 + The approach must be extended to handle one final case, that of waking a 142 + task blocked in ``synchronize_rcu()``. This task might be affinitied to 143 + a CPU that is not yet aware that the grace period has ended, and thus 144 + might not yet be subject to the grace period's memory ordering. 145 + Therefore, there is an ``smp_mb()`` after the return from 146 + ``wait_for_completion()`` in the ``synchronize_rcu()`` code path. 147 + 148 + +-----------------------------------------------------------------------+ 149 + | **Quick Quiz**: | 150 + +-----------------------------------------------------------------------+ 151 + | What? Where??? I don't see any ``smp_mb()`` after the return from | 152 + | ``wait_for_completion()``!!! | 153 + +-----------------------------------------------------------------------+ 154 + | **Answer**: | 155 + +-----------------------------------------------------------------------+ 156 + | That would be because I spotted the need for that ``smp_mb()`` during | 157 + | the creation of this documentation, and it is therefore unlikely to | 158 + | hit mainline before v4.14. Kudos to Lance Roy, Will Deacon, Peter | 159 + | Zijlstra, and Jonathan Cameron for asking questions that sensitized | 160 + | me to the rather elaborate sequence of events that demonstrate the | 161 + | need for this memory barrier. | 162 + +-----------------------------------------------------------------------+ 163 + 164 + Tree RCU's grace--period memory-ordering guarantees rely most heavily on 165 + the ``rcu_node`` structure's ``->lock`` field, so much so that it is 166 + necessary to abbreviate this pattern in the diagrams in the next 167 + section. For example, consider the ``rcu_prepare_for_idle()`` function 168 + shown below, which is one of several functions that enforce ordering of 169 + newly arrived RCU callbacks against future grace periods: 170 + 171 + :: 172 + 173 + 1 static void rcu_prepare_for_idle(void) 174 + 2 { 175 + 3 bool needwake; 176 + 4 struct rcu_data *rdp; 177 + 5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks); 178 + 6 struct rcu_node *rnp; 179 + 7 struct rcu_state *rsp; 180 + 8 int tne; 181 + 9 182 + 10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) || 183 + 11 rcu_is_nocb_cpu(smp_processor_id())) 184 + 12 return; 185 + 13 tne = READ_ONCE(tick_nohz_active); 186 + 14 if (tne != rdtp->tick_nohz_enabled_snap) { 187 + 15 if (rcu_cpu_has_callbacks(NULL)) 188 + 16 invoke_rcu_core(); 189 + 17 rdtp->tick_nohz_enabled_snap = tne; 190 + 18 return; 191 + 19 } 192 + 20 if (!tne) 193 + 21 return; 194 + 22 if (rdtp->all_lazy && 195 + 23 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) { 196 + 24 rdtp->all_lazy = false; 197 + 25 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted; 198 + 26 invoke_rcu_core(); 199 + 27 return; 200 + 28 } 201 + 29 if (rdtp->last_accelerate == jiffies) 202 + 30 return; 203 + 31 rdtp->last_accelerate = jiffies; 204 + 32 for_each_rcu_flavor(rsp) { 205 + 33 rdp = this_cpu_ptr(rsp->rda); 206 + 34 if (rcu_segcblist_pend_cbs(&rdp->cblist)) 207 + 35 continue; 208 + 36 rnp = rdp->mynode; 209 + 37 raw_spin_lock_rcu_node(rnp); 210 + 38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp); 211 + 39 raw_spin_unlock_rcu_node(rnp); 212 + 40 if (needwake) 213 + 41 rcu_gp_kthread_wake(rsp); 214 + 42 } 215 + 43 } 216 + 217 + But the only part of ``rcu_prepare_for_idle()`` that really matters for 218 + this discussion are lines 37–39. We will therefore abbreviate this 219 + function as follows: 220 + 221 + .. kernel-figure:: rcu_node-lock.svg 222 + 223 + The box represents the ``rcu_node`` structure's ``->lock`` critical 224 + section, with the double line on top representing the additional 225 + ``smp_mb__after_unlock_lock()``. 226 + 227 + Tree RCU Grace Period Memory Ordering Components 228 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 229 + 230 + Tree RCU's grace-period memory-ordering guarantee is provided by a 231 + number of RCU components: 232 + 233 + #. `Callback Registry <#Callback%20Registry>`__ 234 + #. `Grace-Period Initialization <#Grace-Period%20Initialization>`__ 235 + #. `Self-Reported Quiescent 236 + States <#Self-Reported%20Quiescent%20States>`__ 237 + #. `Dynamic Tick Interface <#Dynamic%20Tick%20Interface>`__ 238 + #. `CPU-Hotplug Interface <#CPU-Hotplug%20Interface>`__ 239 + #. `Forcing Quiescent States <Forcing%20Quiescent%20States>`__ 240 + #. `Grace-Period Cleanup <Grace-Period%20Cleanup>`__ 241 + #. `Callback Invocation <Callback%20Invocation>`__ 242 + 243 + Each of the following section looks at the corresponding component in 244 + detail. 245 + 246 + Callback Registry 247 + ^^^^^^^^^^^^^^^^^ 248 + 249 + If RCU's grace-period guarantee is to mean anything at all, any access 250 + that happens before a given invocation of ``call_rcu()`` must also 251 + happen before the corresponding grace period. The implementation of this 252 + portion of RCU's grace period guarantee is shown in the following 253 + figure: 254 + 255 + .. kernel-figure:: TreeRCU-callback-registry.svg 256 + 257 + Because ``call_rcu()`` normally acts only on CPU-local state, it 258 + provides no ordering guarantees, either for itself or for phase one of 259 + the update (which again will usually be removal of an element from an 260 + RCU-protected data structure). It simply enqueues the ``rcu_head`` 261 + structure on a per-CPU list, which cannot become associated with a grace 262 + period until a later call to ``rcu_accelerate_cbs()``, as shown in the 263 + diagram above. 264 + 265 + One set of code paths shown on the left invokes ``rcu_accelerate_cbs()`` 266 + via ``note_gp_changes()``, either directly from ``call_rcu()`` (if the 267 + current CPU is inundated with queued ``rcu_head`` structures) or more 268 + likely from an ``RCU_SOFTIRQ`` handler. Another code path in the middle 269 + is taken only in kernels built with ``CONFIG_RCU_FAST_NO_HZ=y``, which 270 + invokes ``rcu_accelerate_cbs()`` via ``rcu_prepare_for_idle()``. The 271 + final code path on the right is taken only in kernels built with 272 + ``CONFIG_HOTPLUG_CPU=y``, which invokes ``rcu_accelerate_cbs()`` via 273 + ``rcu_advance_cbs()``, ``rcu_migrate_callbacks``, 274 + ``rcutree_migrate_callbacks()``, and ``takedown_cpu()``, which in turn 275 + is invoked on a surviving CPU after the outgoing CPU has been completely 276 + offlined. 277 + 278 + There are a few other code paths within grace-period processing that 279 + opportunistically invoke ``rcu_accelerate_cbs()``. However, either way, 280 + all of the CPU's recently queued ``rcu_head`` structures are associated 281 + with a future grace-period number under the protection of the CPU's lead 282 + ``rcu_node`` structure's ``->lock``. In all cases, there is full 283 + ordering against any prior critical section for that same ``rcu_node`` 284 + structure's ``->lock``, and also full ordering against any of the 285 + current task's or CPU's prior critical sections for any ``rcu_node`` 286 + structure's ``->lock``. 287 + 288 + The next section will show how this ordering ensures that any accesses 289 + prior to the ``call_rcu()`` (particularly including phase one of the 290 + update) happen before the start of the corresponding grace period. 291 + 292 + +-----------------------------------------------------------------------+ 293 + | **Quick Quiz**: | 294 + +-----------------------------------------------------------------------+ 295 + | But what about ``synchronize_rcu()``? | 296 + +-----------------------------------------------------------------------+ 297 + | **Answer**: | 298 + +-----------------------------------------------------------------------+ 299 + | The ``synchronize_rcu()`` passes ``call_rcu()`` to ``wait_rcu_gp()``, | 300 + | which invokes it. So either way, it eventually comes down to | 301 + | ``call_rcu()``. | 302 + +-----------------------------------------------------------------------+ 303 + 304 + Grace-Period Initialization 305 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 306 + 307 + Grace-period initialization is carried out by the grace-period kernel 308 + thread, which makes several passes over the ``rcu_node`` tree within the 309 + ``rcu_gp_init()`` function. This means that showing the full flow of 310 + ordering through the grace-period computation will require duplicating 311 + this tree. If you find this confusing, please note that the state of the 312 + ``rcu_node`` changes over time, just like Heraclitus's river. However, 313 + to keep the ``rcu_node`` river tractable, the grace-period kernel 314 + thread's traversals are presented in multiple parts, starting in this 315 + section with the various phases of grace-period initialization. 316 + 317 + The first ordering-related grace-period initialization action is to 318 + advance the ``rcu_state`` structure's ``->gp_seq`` grace-period-number 319 + counter, as shown below: 320 + 321 + .. kernel-figure:: TreeRCU-gp-init-1.svg 322 + 323 + The actual increment is carried out using ``smp_store_release()``, which 324 + helps reject false-positive RCU CPU stall detection. Note that only the 325 + root ``rcu_node`` structure is touched. 326 + 327 + The first pass through the ``rcu_node`` tree updates bitmasks based on 328 + CPUs having come online or gone offline since the start of the previous 329 + grace period. In the common case where the number of online CPUs for 330 + this ``rcu_node`` structure has not transitioned to or from zero, this 331 + pass will scan only the leaf ``rcu_node`` structures. However, if the 332 + number of online CPUs for a given leaf ``rcu_node`` structure has 333 + transitioned from zero, ``rcu_init_new_rnp()`` will be invoked for the 334 + first incoming CPU. Similarly, if the number of online CPUs for a given 335 + leaf ``rcu_node`` structure has transitioned to zero, 336 + ``rcu_cleanup_dead_rnp()`` will be invoked for the last outgoing CPU. 337 + The diagram below shows the path of ordering if the leftmost 338 + ``rcu_node`` structure onlines its first CPU and if the next 339 + ``rcu_node`` structure has no online CPUs (or, alternatively if the 340 + leftmost ``rcu_node`` structure offlines its last CPU and if the next 341 + ``rcu_node`` structure has no online CPUs). 342 + 343 + .. kernel-figure:: TreeRCU-gp-init-1.svg 344 + 345 + The final ``rcu_gp_init()`` pass through the ``rcu_node`` tree traverses 346 + breadth-first, setting each ``rcu_node`` structure's ``->gp_seq`` field 347 + to the newly advanced value from the ``rcu_state`` structure, as shown 348 + in the following diagram. 349 + 350 + .. kernel-figure:: TreeRCU-gp-init-1.svg 351 + 352 + This change will also cause each CPU's next call to 353 + ``__note_gp_changes()`` to notice that a new grace period has started, 354 + as described in the next section. But because the grace-period kthread 355 + started the grace period at the root (with the advancing of the 356 + ``rcu_state`` structure's ``->gp_seq`` field) before setting each leaf 357 + ``rcu_node`` structure's ``->gp_seq`` field, each CPU's observation of 358 + the start of the grace period will happen after the actual start of the 359 + grace period. 360 + 361 + +-----------------------------------------------------------------------+ 362 + | **Quick Quiz**: | 363 + +-----------------------------------------------------------------------+ 364 + | But what about the CPU that started the grace period? Why wouldn't it | 365 + | see the start of the grace period right when it started that grace | 366 + | period? | 367 + +-----------------------------------------------------------------------+ 368 + | **Answer**: | 369 + +-----------------------------------------------------------------------+ 370 + | In some deep philosophical and overly anthromorphized sense, yes, the | 371 + | CPU starting the grace period is immediately aware of having done so. | 372 + | However, if we instead assume that RCU is not self-aware, then even | 373 + | the CPU starting the grace period does not really become aware of the | 374 + | start of this grace period until its first call to | 375 + | ``__note_gp_changes()``. On the other hand, this CPU potentially gets | 376 + | early notification because it invokes ``__note_gp_changes()`` during | 377 + | its last ``rcu_gp_init()`` pass through its leaf ``rcu_node`` | 378 + | structure. | 379 + +-----------------------------------------------------------------------+ 380 + 381 + Self-Reported Quiescent States 382 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 383 + 384 + When all entities that might block the grace period have reported 385 + quiescent states (or as described in a later section, had quiescent 386 + states reported on their behalf), the grace period can end. Online 387 + non-idle CPUs report their own quiescent states, as shown in the 388 + following diagram: 389 + 390 + .. kernel-figure:: TreeRCU-qs.svg 391 + 392 + This is for the last CPU to report a quiescent state, which signals the 393 + end of the grace period. Earlier quiescent states would push up the 394 + ``rcu_node`` tree only until they encountered an ``rcu_node`` structure 395 + that is waiting for additional quiescent states. However, ordering is 396 + nevertheless preserved because some later quiescent state will acquire 397 + that ``rcu_node`` structure's ``->lock``. 398 + 399 + Any number of events can lead up to a CPU invoking ``note_gp_changes`` 400 + (or alternatively, directly invoking ``__note_gp_changes()``), at which 401 + point that CPU will notice the start of a new grace period while holding 402 + its leaf ``rcu_node`` lock. Therefore, all execution shown in this 403 + diagram happens after the start of the grace period. In addition, this 404 + CPU will consider any RCU read-side critical section that started before 405 + the invocation of ``__note_gp_changes()`` to have started before the 406 + grace period, and thus a critical section that the grace period must 407 + wait on. 408 + 409 + +-----------------------------------------------------------------------+ 410 + | **Quick Quiz**: | 411 + +-----------------------------------------------------------------------+ 412 + | But a RCU read-side critical section might have started after the | 413 + | beginning of the grace period (the advancing of ``->gp_seq`` from | 414 + | earlier), so why should the grace period wait on such a critical | 415 + | section? | 416 + +-----------------------------------------------------------------------+ 417 + | **Answer**: | 418 + +-----------------------------------------------------------------------+ 419 + | It is indeed not necessary for the grace period to wait on such a | 420 + | critical section. However, it is permissible to wait on it. And it is | 421 + | furthermore important to wait on it, as this lazy approach is far | 422 + | more scalable than a “big bang” all-at-once grace-period start could | 423 + | possibly be. | 424 + +-----------------------------------------------------------------------+ 425 + 426 + If the CPU does a context switch, a quiescent state will be noted by 427 + ``rcu_node_context_switch()`` on the left. On the other hand, if the CPU 428 + takes a scheduler-clock interrupt while executing in usermode, a 429 + quiescent state will be noted by ``rcu_sched_clock_irq()`` on the right. 430 + Either way, the passage through a quiescent state will be noted in a 431 + per-CPU variable. 432 + 433 + The next time an ``RCU_SOFTIRQ`` handler executes on this CPU (for 434 + example, after the next scheduler-clock interrupt), ``rcu_core()`` will 435 + invoke ``rcu_check_quiescent_state()``, which will notice the recorded 436 + quiescent state, and invoke ``rcu_report_qs_rdp()``. If 437 + ``rcu_report_qs_rdp()`` verifies that the quiescent state really does 438 + apply to the current grace period, it invokes ``rcu_report_rnp()`` which 439 + traverses up the ``rcu_node`` tree as shown at the bottom of the 440 + diagram, clearing bits from each ``rcu_node`` structure's ``->qsmask`` 441 + field, and propagating up the tree when the result is zero. 442 + 443 + Note that traversal passes upwards out of a given ``rcu_node`` structure 444 + only if the current CPU is reporting the last quiescent state for the 445 + subtree headed by that ``rcu_node`` structure. A key point is that if a 446 + CPU's traversal stops at a given ``rcu_node`` structure, then there will 447 + be a later traversal by another CPU (or perhaps the same one) that 448 + proceeds upwards from that point, and the ``rcu_node`` ``->lock`` 449 + guarantees that the first CPU's quiescent state happens before the 450 + remainder of the second CPU's traversal. Applying this line of thought 451 + repeatedly shows that all CPUs' quiescent states happen before the last 452 + CPU traverses through the root ``rcu_node`` structure, the “last CPU” 453 + being the one that clears the last bit in the root ``rcu_node`` 454 + structure's ``->qsmask`` field. 455 + 456 + Dynamic Tick Interface 457 + ^^^^^^^^^^^^^^^^^^^^^^ 458 + 459 + Due to energy-efficiency considerations, RCU is forbidden from 460 + disturbing idle CPUs. CPUs are therefore required to notify RCU when 461 + entering or leaving idle state, which they do via fully ordered 462 + value-returning atomic operations on a per-CPU variable. The ordering 463 + effects are as shown below: 464 + 465 + .. kernel-figure:: TreeRCU-dyntick.svg 466 + 467 + The RCU grace-period kernel thread samples the per-CPU idleness variable 468 + while holding the corresponding CPU's leaf ``rcu_node`` structure's 469 + ``->lock``. This means that any RCU read-side critical sections that 470 + precede the idle period (the oval near the top of the diagram above) 471 + will happen before the end of the current grace period. Similarly, the 472 + beginning of the current grace period will happen before any RCU 473 + read-side critical sections that follow the idle period (the oval near 474 + the bottom of the diagram above). 475 + 476 + Plumbing this into the full grace-period execution is described 477 + `below <#Forcing%20Quiescent%20States>`__. 478 + 479 + CPU-Hotplug Interface 480 + ^^^^^^^^^^^^^^^^^^^^^ 481 + 482 + RCU is also forbidden from disturbing offline CPUs, which might well be 483 + powered off and removed from the system completely. CPUs are therefore 484 + required to notify RCU of their comings and goings as part of the 485 + corresponding CPU hotplug operations. The ordering effects are shown 486 + below: 487 + 488 + .. kernel-figure:: TreeRCU-hotplug.svg 489 + 490 + Because CPU hotplug operations are much less frequent than idle 491 + transitions, they are heavier weight, and thus acquire the CPU's leaf 492 + ``rcu_node`` structure's ``->lock`` and update this structure's 493 + ``->qsmaskinitnext``. The RCU grace-period kernel thread samples this 494 + mask to detect CPUs having gone offline since the beginning of this 495 + grace period. 496 + 497 + Plumbing this into the full grace-period execution is described 498 + `below <#Forcing%20Quiescent%20States>`__. 499 + 500 + Forcing Quiescent States 501 + ^^^^^^^^^^^^^^^^^^^^^^^^ 502 + 503 + As noted above, idle and offline CPUs cannot report their own quiescent 504 + states, and therefore the grace-period kernel thread must do the 505 + reporting on their behalf. This process is called “forcing quiescent 506 + states”, it is repeated every few jiffies, and its ordering effects are 507 + shown below: 508 + 509 + .. kernel-figure:: TreeRCU-gp-fqs.svg 510 + 511 + Each pass of quiescent state forcing is guaranteed to traverse the leaf 512 + ``rcu_node`` structures, and if there are no new quiescent states due to 513 + recently idled and/or offlined CPUs, then only the leaves are traversed. 514 + However, if there is a newly offlined CPU as illustrated on the left or 515 + a newly idled CPU as illustrated on the right, the corresponding 516 + quiescent state will be driven up towards the root. As with 517 + self-reported quiescent states, the upwards driving stops once it 518 + reaches an ``rcu_node`` structure that has quiescent states outstanding 519 + from other CPUs. 520 + 521 + +-----------------------------------------------------------------------+ 522 + | **Quick Quiz**: | 523 + +-----------------------------------------------------------------------+ 524 + | The leftmost drive to root stopped before it reached the root | 525 + | ``rcu_node`` structure, which means that there are still CPUs | 526 + | subordinate to that structure on which the current grace period is | 527 + | waiting. Given that, how is it possible that the rightmost drive to | 528 + | root ended the grace period? | 529 + +-----------------------------------------------------------------------+ 530 + | **Answer**: | 531 + +-----------------------------------------------------------------------+ 532 + | Good analysis! It is in fact impossible in the absence of bugs in | 533 + | RCU. But this diagram is complex enough as it is, so simplicity | 534 + | overrode accuracy. You can think of it as poetic license, or you can | 535 + | think of it as misdirection that is resolved in the | 536 + | `stitched-together diagram <#Putting%20It%20All%20Together>`__. | 537 + +-----------------------------------------------------------------------+ 538 + 539 + Grace-Period Cleanup 540 + ^^^^^^^^^^^^^^^^^^^^ 541 + 542 + Grace-period cleanup first scans the ``rcu_node`` tree breadth-first 543 + advancing all the ``->gp_seq`` fields, then it advances the 544 + ``rcu_state`` structure's ``->gp_seq`` field. The ordering effects are 545 + shown below: 546 + 547 + .. kernel-figure:: TreeRCU-gp-cleanup.svg 548 + 549 + As indicated by the oval at the bottom of the diagram, once grace-period 550 + cleanup is complete, the next grace period can begin. 551 + 552 + +-----------------------------------------------------------------------+ 553 + | **Quick Quiz**: | 554 + +-----------------------------------------------------------------------+ 555 + | But when precisely does the grace period end? | 556 + +-----------------------------------------------------------------------+ 557 + | **Answer**: | 558 + +-----------------------------------------------------------------------+ 559 + | There is no useful single point at which the grace period can be said | 560 + | to end. The earliest reasonable candidate is as soon as the last CPU | 561 + | has reported its quiescent state, but it may be some milliseconds | 562 + | before RCU becomes aware of this. The latest reasonable candidate is | 563 + | once the ``rcu_state`` structure's ``->gp_seq`` field has been | 564 + | updated, but it is quite possible that some CPUs have already | 565 + | completed phase two of their updates by that time. In short, if you | 566 + | are going to work with RCU, you need to learn to embrace uncertainty. | 567 + +-----------------------------------------------------------------------+ 568 + 569 + Callback Invocation 570 + ^^^^^^^^^^^^^^^^^^^ 571 + 572 + Once a given CPU's leaf ``rcu_node`` structure's ``->gp_seq`` field has 573 + been updated, that CPU can begin invoking its RCU callbacks that were 574 + waiting for this grace period to end. These callbacks are identified by 575 + ``rcu_advance_cbs()``, which is usually invoked by 576 + ``__note_gp_changes()``. As shown in the diagram below, this invocation 577 + can be triggered by the scheduling-clock interrupt 578 + (``rcu_sched_clock_irq()`` on the left) or by idle entry 579 + (``rcu_cleanup_after_idle()`` on the right, but only for kernels build 580 + with ``CONFIG_RCU_FAST_NO_HZ=y``). Either way, ``RCU_SOFTIRQ`` is 581 + raised, which results in ``rcu_do_batch()`` invoking the callbacks, 582 + which in turn allows those callbacks to carry out (either directly or 583 + indirectly via wakeup) the needed phase-two processing for each update. 584 + 585 + .. kernel-figure:: TreeRCU-callback-invocation.svg 586 + 587 + Please note that callback invocation can also be prompted by any number 588 + of corner-case code paths, for example, when a CPU notes that it has 589 + excessive numbers of callbacks queued. In all cases, the CPU acquires 590 + its leaf ``rcu_node`` structure's ``->lock`` before invoking callbacks, 591 + which preserves the required ordering against the newly completed grace 592 + period. 593 + 594 + However, if the callback function communicates to other CPUs, for 595 + example, doing a wakeup, then it is that function's responsibility to 596 + maintain ordering. For example, if the callback function wakes up a task 597 + that runs on some other CPU, proper ordering must in place in both the 598 + callback function and the task being awakened. To see why this is 599 + important, consider the top half of the `grace-period 600 + cleanup <#Grace-Period%20Cleanup>`__ diagram. The callback might be 601 + running on a CPU corresponding to the leftmost leaf ``rcu_node`` 602 + structure, and awaken a task that is to run on a CPU corresponding to 603 + the rightmost leaf ``rcu_node`` structure, and the grace-period kernel 604 + thread might not yet have reached the rightmost leaf. In this case, the 605 + grace period's memory ordering might not yet have reached that CPU, so 606 + again the callback function and the awakened task must supply proper 607 + ordering. 608 + 609 + Putting It All Together 610 + ~~~~~~~~~~~~~~~~~~~~~~~ 611 + 612 + A stitched-together diagram is here: 613 + 614 + .. kernel-figure:: TreeRCU-gp.svg 615 + 616 + Legal Statement 617 + ~~~~~~~~~~~~~~~ 618 + 619 + This work represents the view of the author and does not necessarily 620 + represent the view of IBM. 621 + 622 + Linux is a registered trademark of Linus Torvalds. 623 + 624 + Other company, product, and service names may be trademarks or service 625 + marks of others.

-3330

Documentation/RCU/Design/Requirements/Requirements.html

··· 1 - <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 2 - "http://www.w3.org/TR/html4/loose.dtd"> 3 - <html> 4 - <head><title>A Tour Through RCU's Requirements [LWN.net]</title> 5 - <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> 6 - 7 - <h1>A Tour Through RCU's Requirements</h1> 8 - 9 - <p>Copyright IBM Corporation, 2015</p> 10 - <p>Author: Paul E. McKenney</p> 11 - <p><i>The initial version of this document appeared in the 12 - <a href="https://lwn.net/">LWN</a> articles 13 - <a href="https://lwn.net/Articles/652156/">here</a>, 14 - <a href="https://lwn.net/Articles/652677/">here</a>, and 15 - <a href="https://lwn.net/Articles/653326/">here</a>.</i></p> 16 - 17 - <h2>Introduction</h2> 18 - 19 - <p> 20 - Read-copy update (RCU) is a synchronization mechanism that is often 21 - used as a replacement for reader-writer locking. 22 - RCU is unusual in that updaters do not block readers, 23 - which means that RCU's read-side primitives can be exceedingly fast 24 - and scalable. 25 - In addition, updaters can make useful forward progress concurrently 26 - with readers. 27 - However, all this concurrency between RCU readers and updaters does raise 28 - the question of exactly what RCU readers are doing, which in turn 29 - raises the question of exactly what RCU's requirements are. 30 - 31 - <p> 32 - This document therefore summarizes RCU's requirements, and can be thought 33 - of as an informal, high-level specification for RCU. 34 - It is important to understand that RCU's specification is primarily 35 - empirical in nature; 36 - in fact, I learned about many of these requirements the hard way. 37 - This situation might cause some consternation, however, not only 38 - has this learning process been a lot of fun, but it has also been 39 - a great privilege to work with so many people willing to apply 40 - technologies in interesting new ways. 41 - 42 - <p> 43 - All that aside, here are the categories of currently known RCU requirements: 44 - </p> 45 - 46 - <ol> 47 - <li> <a href="#Fundamental Requirements"> 48 - Fundamental Requirements</a> 49 - <li> <a href="#Fundamental Non-Requirements">Fundamental Non-Requirements</a> 50 - <li> <a href="#Parallelism Facts of Life"> 51 - Parallelism Facts of Life</a> 52 - <li> <a href="#Quality-of-Implementation Requirements"> 53 - Quality-of-Implementation Requirements</a> 54 - <li> <a href="#Linux Kernel Complications"> 55 - Linux Kernel Complications</a> 56 - <li> <a href="#Software-Engineering Requirements"> 57 - Software-Engineering Requirements</a> 58 - <li> <a href="#Other RCU Flavors"> 59 - Other RCU Flavors</a> 60 - <li> <a href="#Possible Future Changes"> 61 - Possible Future Changes</a> 62 - </ol> 63 - 64 - <p> 65 - This is followed by a <a href="#Summary">summary</a>, 66 - however, the answers to each quick quiz immediately follows the quiz. 67 - Select the big white space with your mouse to see the answer. 68 - 69 - <h2><a name="Fundamental Requirements">Fundamental Requirements</a></h2> 70 - 71 - <p> 72 - RCU's fundamental requirements are the closest thing RCU has to hard 73 - mathematical requirements. 74 - These are: 75 - 76 - <ol> 77 - <li> <a href="#Grace-Period Guarantee"> 78 - Grace-Period Guarantee</a> 79 - <li> <a href="#Publish-Subscribe Guarantee"> 80 - Publish-Subscribe Guarantee</a> 81 - <li> <a href="#Memory-Barrier Guarantees"> 82 - Memory-Barrier Guarantees</a> 83 - <li> <a href="#RCU Primitives Guaranteed to Execute Unconditionally"> 84 - RCU Primitives Guaranteed to Execute Unconditionally</a> 85 - <li> <a href="#Guaranteed Read-to-Write Upgrade"> 86 - Guaranteed Read-to-Write Upgrade</a> 87 - </ol> 88 - 89 - <h3><a name="Grace-Period Guarantee">Grace-Period Guarantee</a></h3> 90 - 91 - <p> 92 - RCU's grace-period guarantee is unusual in being premeditated: 93 - Jack Slingwine and I had this guarantee firmly in mind when we started 94 - work on RCU (then called “rclock”) in the early 1990s. 95 - That said, the past two decades of experience with RCU have produced 96 - a much more detailed understanding of this guarantee. 97 - 98 - <p> 99 - RCU's grace-period guarantee allows updaters to wait for the completion 100 - of all pre-existing RCU read-side critical sections. 101 - An RCU read-side critical section 102 - begins with the marker <tt>rcu_read_lock()</tt> and ends with 103 - the marker <tt>rcu_read_unlock()</tt>. 104 - These markers may be nested, and RCU treats a nested set as one 105 - big RCU read-side critical section. 106 - Production-quality implementations of <tt>rcu_read_lock()</tt> and 107 - <tt>rcu_read_unlock()</tt> are extremely lightweight, and in 108 - fact have exactly zero overhead in Linux kernels built for production 109 - use with <tt>CONFIG_PREEMPT=n</tt>. 110 - 111 - <p> 112 - This guarantee allows ordering to be enforced with extremely low 113 - overhead to readers, for example: 114 - 115 - <blockquote> 116 - <pre> 117 - 1 int x, y; 118 - 2 119 - 3 void thread0(void) 120 - 4 { 121 - 5 rcu_read_lock(); 122 - 6 r1 = READ_ONCE(x); 123 - 7 r2 = READ_ONCE(y); 124 - 8 rcu_read_unlock(); 125 - 9 } 126 - 10 127 - 11 void thread1(void) 128 - 12 { 129 - 13 WRITE_ONCE(x, 1); 130 - 14 synchronize_rcu(); 131 - 15 WRITE_ONCE(y, 1); 132 - 16 } 133 - </pre> 134 - </blockquote> 135 - 136 - <p> 137 - Because the <tt>synchronize_rcu()</tt> on line 14 waits for 138 - all pre-existing readers, any instance of <tt>thread0()</tt> that 139 - loads a value of zero from <tt>x</tt> must complete before 140 - <tt>thread1()</tt> stores to <tt>y</tt>, so that instance must 141 - also load a value of zero from <tt>y</tt>. 142 - Similarly, any instance of <tt>thread0()</tt> that loads a value of 143 - one from <tt>y</tt> must have started after the 144 - <tt>synchronize_rcu()</tt> started, and must therefore also load 145 - a value of one from <tt>x</tt>. 146 - Therefore, the outcome: 147 - <blockquote> 148 - <pre> 149 - (r1 == 0 && r2 == 1) 150 - </pre> 151 - </blockquote> 152 - cannot happen. 153 - 154 - <table> 155 - <tr><th> </th></tr> 156 - <tr><th align="left">Quick Quiz:</th></tr> 157 - <tr><td> 158 - Wait a minute! 159 - You said that updaters can make useful forward progress concurrently 160 - with readers, but pre-existing readers will block 161 - <tt>synchronize_rcu()</tt>!!! 162 - Just who are you trying to fool??? 163 - </td></tr> 164 - <tr><th align="left">Answer:</th></tr> 165 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 166 - First, if updaters do not wish to be blocked by readers, they can use 167 - <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt>, which will 168 - be discussed later. 169 - Second, even when using <tt>synchronize_rcu()</tt>, the other 170 - update-side code does run concurrently with readers, whether 171 - pre-existing or not. 172 - </font></td></tr> 173 - <tr><td> </td></tr> 174 - </table> 175 - 176 - <p> 177 - This scenario resembles one of the first uses of RCU in 178 - <a href="https://en.wikipedia.org/wiki/DYNIX">DYNIX/ptx</a>, 179 - which managed a distributed lock manager's transition into 180 - a state suitable for handling recovery from node failure, 181 - more or less as follows: 182 - 183 - <blockquote> 184 - <pre> 185 - 1 #define STATE_NORMAL 0 186 - 2 #define STATE_WANT_RECOVERY 1 187 - 3 #define STATE_RECOVERING 2 188 - 4 #define STATE_WANT_NORMAL 3 189 - 5 190 - 6 int state = STATE_NORMAL; 191 - 7 192 - 8 void do_something_dlm(void) 193 - 9 { 194 - 10 int state_snap; 195 - 11 196 - 12 rcu_read_lock(); 197 - 13 state_snap = READ_ONCE(state); 198 - 14 if (state_snap == STATE_NORMAL) 199 - 15 do_something(); 200 - 16 else 201 - 17 do_something_carefully(); 202 - 18 rcu_read_unlock(); 203 - 19 } 204 - 20 205 - 21 void start_recovery(void) 206 - 22 { 207 - 23 WRITE_ONCE(state, STATE_WANT_RECOVERY); 208 - 24 synchronize_rcu(); 209 - 25 WRITE_ONCE(state, STATE_RECOVERING); 210 - 26 recovery(); 211 - 27 WRITE_ONCE(state, STATE_WANT_NORMAL); 212 - 28 synchronize_rcu(); 213 - 29 WRITE_ONCE(state, STATE_NORMAL); 214 - 30 } 215 - </pre> 216 - </blockquote> 217 - 218 - <p> 219 - The RCU read-side critical section in <tt>do_something_dlm()</tt> 220 - works with the <tt>synchronize_rcu()</tt> in <tt>start_recovery()</tt> 221 - to guarantee that <tt>do_something()</tt> never runs concurrently 222 - with <tt>recovery()</tt>, but with little or no synchronization 223 - overhead in <tt>do_something_dlm()</tt>. 224 - 225 - <table> 226 - <tr><th> </th></tr> 227 - <tr><th align="left">Quick Quiz:</th></tr> 228 - <tr><td> 229 - Why is the <tt>synchronize_rcu()</tt> on line 28 needed? 230 - </td></tr> 231 - <tr><th align="left">Answer:</th></tr> 232 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 233 - Without that extra grace period, memory reordering could result in 234 - <tt>do_something_dlm()</tt> executing <tt>do_something()</tt> 235 - concurrently with the last bits of <tt>recovery()</tt>. 236 - </font></td></tr> 237 - <tr><td> </td></tr> 238 - </table> 239 - 240 - <p> 241 - In order to avoid fatal problems such as deadlocks, 242 - an RCU read-side critical section must not contain calls to 243 - <tt>synchronize_rcu()</tt>. 244 - Similarly, an RCU read-side critical section must not 245 - contain anything that waits, directly or indirectly, on completion of 246 - an invocation of <tt>synchronize_rcu()</tt>. 247 - 248 - <p> 249 - Although RCU's grace-period guarantee is useful in and of itself, with 250 - <a href="https://lwn.net/Articles/573497/">quite a few use cases</a>, 251 - it would be good to be able to use RCU to coordinate read-side 252 - access to linked data structures. 253 - For this, the grace-period guarantee is not sufficient, as can 254 - be seen in function <tt>add_gp_buggy()</tt> below. 255 - We will look at the reader's code later, but in the meantime, just think of 256 - the reader as locklessly picking up the <tt>gp</tt> pointer, 257 - and, if the value loaded is non-<tt>NULL</tt>, locklessly accessing the 258 - <tt>->a</tt> and <tt>->b</tt> fields. 259 - 260 - <blockquote> 261 - <pre> 262 - 1 bool add_gp_buggy(int a, int b) 263 - 2 { 264 - 3 p = kmalloc(sizeof(*p), GFP_KERNEL); 265 - 4 if (!p) 266 - 5 return -ENOMEM; 267 - 6 spin_lock(&gp_lock); 268 - 7 if (rcu_access_pointer(gp)) { 269 - 8 spin_unlock(&gp_lock); 270 - 9 return false; 271 - 10 } 272 - 11 p->a = a; 273 - 12 p->b = a; 274 - 13 gp = p; /* ORDERING BUG */ 275 - 14 spin_unlock(&gp_lock); 276 - 15 return true; 277 - 16 } 278 - </pre> 279 - </blockquote> 280 - 281 - <p> 282 - The problem is that both the compiler and weakly ordered CPUs are within 283 - their rights to reorder this code as follows: 284 - 285 - <blockquote> 286 - <pre> 287 - 1 bool add_gp_buggy_optimized(int a, int b) 288 - 2 { 289 - 3 p = kmalloc(sizeof(*p), GFP_KERNEL); 290 - 4 if (!p) 291 - 5 return -ENOMEM; 292 - 6 spin_lock(&gp_lock); 293 - 7 if (rcu_access_pointer(gp)) { 294 - 8 spin_unlock(&gp_lock); 295 - 9 return false; 296 - 10 } 297 - <b>11 gp = p; /* ORDERING BUG */ 298 - 12 p->a = a; 299 - 13 p->b = a;</b> 300 - 14 spin_unlock(&gp_lock); 301 - 15 return true; 302 - 16 } 303 - </pre> 304 - </blockquote> 305 - 306 - <p> 307 - If an RCU reader fetches <tt>gp</tt> just after 308 - <tt>add_gp_buggy_optimized</tt> executes line 11, 309 - it will see garbage in the <tt>->a</tt> and <tt>->b</tt> 310 - fields. 311 - And this is but one of many ways in which compiler and hardware optimizations 312 - could cause trouble. 313 - Therefore, we clearly need some way to prevent the compiler and the CPU from 314 - reordering in this manner, which brings us to the publish-subscribe 315 - guarantee discussed in the next section. 316 - 317 - <h3><a name="Publish-Subscribe Guarantee">Publish/Subscribe Guarantee</a></h3> 318 - 319 - <p> 320 - RCU's publish-subscribe guarantee allows data to be inserted 321 - into a linked data structure without disrupting RCU readers. 322 - The updater uses <tt>rcu_assign_pointer()</tt> to insert the 323 - new data, and readers use <tt>rcu_dereference()</tt> to 324 - access data, whether new or old. 325 - The following shows an example of insertion: 326 - 327 - <blockquote> 328 - <pre> 329 - 1 bool add_gp(int a, int b) 330 - 2 { 331 - 3 p = kmalloc(sizeof(*p), GFP_KERNEL); 332 - 4 if (!p) 333 - 5 return -ENOMEM; 334 - 6 spin_lock(&gp_lock); 335 - 7 if (rcu_access_pointer(gp)) { 336 - 8 spin_unlock(&gp_lock); 337 - 9 return false; 338 - 10 } 339 - 11 p->a = a; 340 - 12 p->b = a; 341 - 13 rcu_assign_pointer(gp, p); 342 - 14 spin_unlock(&gp_lock); 343 - 15 return true; 344 - 16 } 345 - </pre> 346 - </blockquote> 347 - 348 - <p> 349 - The <tt>rcu_assign_pointer()</tt> on line 13 is conceptually 350 - equivalent to a simple assignment statement, but also guarantees 351 - that its assignment will 352 - happen after the two assignments in lines 11 and 12, 353 - similar to the C11 <tt>memory_order_release</tt> store operation. 354 - It also prevents any number of “interesting” compiler 355 - optimizations, for example, the use of <tt>gp</tt> as a scratch 356 - location immediately preceding the assignment. 357 - 358 - <table> 359 - <tr><th> </th></tr> 360 - <tr><th align="left">Quick Quiz:</th></tr> 361 - <tr><td> 362 - But <tt>rcu_assign_pointer()</tt> does nothing to prevent the 363 - two assignments to <tt>p->a</tt> and <tt>p->b</tt> 364 - from being reordered. 365 - Can't that also cause problems? 366 - </td></tr> 367 - <tr><th align="left">Answer:</th></tr> 368 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 369 - No, it cannot. 370 - The readers cannot see either of these two fields until 371 - the assignment to <tt>gp</tt>, by which time both fields are 372 - fully initialized. 373 - So reordering the assignments 374 - to <tt>p->a</tt> and <tt>p->b</tt> cannot possibly 375 - cause any problems. 376 - </font></td></tr> 377 - <tr><td> </td></tr> 378 - </table> 379 - 380 - <p> 381 - It is tempting to assume that the reader need not do anything special 382 - to control its accesses to the RCU-protected data, 383 - as shown in <tt>do_something_gp_buggy()</tt> below: 384 - 385 - <blockquote> 386 - <pre> 387 - 1 bool do_something_gp_buggy(void) 388 - 2 { 389 - 3 rcu_read_lock(); 390 - 4 p = gp; /* OPTIMIZATIONS GALORE!!! */ 391 - 5 if (p) { 392 - 6 do_something(p->a, p->b); 393 - 7 rcu_read_unlock(); 394 - 8 return true; 395 - 9 } 396 - 10 rcu_read_unlock(); 397 - 11 return false; 398 - 12 } 399 - </pre> 400 - </blockquote> 401 - 402 - <p> 403 - However, this temptation must be resisted because there are a 404 - surprisingly large number of ways that the compiler 405 - (to say nothing of 406 - <a href="https://h71000.www7.hp.com/wizard/wiz_2637.html">DEC Alpha CPUs</a>) 407 - can trip this code up. 408 - For but one example, if the compiler were short of registers, it 409 - might choose to refetch from <tt>gp</tt> rather than keeping 410 - a separate copy in <tt>p</tt> as follows: 411 - 412 - <blockquote> 413 - <pre> 414 - 1 bool do_something_gp_buggy_optimized(void) 415 - 2 { 416 - 3 rcu_read_lock(); 417 - 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */ 418 - <b> 5 do_something(gp->a, gp->b);</b> 419 - 6 rcu_read_unlock(); 420 - 7 return true; 421 - 8 } 422 - 9 rcu_read_unlock(); 423 - 10 return false; 424 - 11 } 425 - </pre> 426 - </blockquote> 427 - 428 - <p> 429 - If this function ran concurrently with a series of updates that 430 - replaced the current structure with a new one, 431 - the fetches of <tt>gp->a</tt> 432 - and <tt>gp->b</tt> might well come from two different structures, 433 - which could cause serious confusion. 434 - To prevent this (and much else besides), <tt>do_something_gp()</tt> uses 435 - <tt>rcu_dereference()</tt> to fetch from <tt>gp</tt>: 436 - 437 - <blockquote> 438 - <pre> 439 - 1 bool do_something_gp(void) 440 - 2 { 441 - 3 rcu_read_lock(); 442 - 4 p = rcu_dereference(gp); 443 - 5 if (p) { 444 - 6 do_something(p->a, p->b); 445 - 7 rcu_read_unlock(); 446 - 8 return true; 447 - 9 } 448 - 10 rcu_read_unlock(); 449 - 11 return false; 450 - 12 } 451 - </pre> 452 - </blockquote> 453 - 454 - <p> 455 - The <tt>rcu_dereference()</tt> uses volatile casts and (for DEC Alpha) 456 - memory barriers in the Linux kernel. 457 - Should a 458 - <a href="http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf">high-quality implementation of C11 <tt>memory_order_consume</tt> [PDF]</a> 459 - ever appear, then <tt>rcu_dereference()</tt> could be implemented 460 - as a <tt>memory_order_consume</tt> load. 461 - Regardless of the exact implementation, a pointer fetched by 462 - <tt>rcu_dereference()</tt> may not be used outside of the 463 - outermost RCU read-side critical section containing that 464 - <tt>rcu_dereference()</tt>, unless protection of 465 - the corresponding data element has been passed from RCU to some 466 - other synchronization mechanism, most commonly locking or 467 - <a href="https://www.kernel.org/doc/Documentation/RCU/rcuref.txt">reference counting</a>. 468 - 469 - <p> 470 - In short, updaters use <tt>rcu_assign_pointer()</tt> and readers 471 - use <tt>rcu_dereference()</tt>, and these two RCU API elements 472 - work together to ensure that readers have a consistent view of 473 - newly added data elements. 474 - 475 - <p> 476 - Of course, it is also necessary to remove elements from RCU-protected 477 - data structures, for example, using the following process: 478 - 479 - <ol> 480 - <li> Remove the data element from the enclosing structure. 481 - <li> Wait for all pre-existing RCU read-side critical sections 482 - to complete (because only pre-existing readers can possibly have 483 - a reference to the newly removed data element). 484 - <li> At this point, only the updater has a reference to the 485 - newly removed data element, so it can safely reclaim 486 - the data element, for example, by passing it to <tt>kfree()</tt>. 487 - </ol> 488 - 489 - This process is implemented by <tt>remove_gp_synchronous()</tt>: 490 - 491 - <blockquote> 492 - <pre> 493 - 1 bool remove_gp_synchronous(void) 494 - 2 { 495 - 3 struct foo *p; 496 - 4 497 - 5 spin_lock(&gp_lock); 498 - 6 p = rcu_access_pointer(gp); 499 - 7 if (!p) { 500 - 8 spin_unlock(&gp_lock); 501 - 9 return false; 502 - 10 } 503 - 11 rcu_assign_pointer(gp, NULL); 504 - 12 spin_unlock(&gp_lock); 505 - 13 synchronize_rcu(); 506 - 14 kfree(p); 507 - 15 return true; 508 - 16 } 509 - </pre> 510 - </blockquote> 511 - 512 - <p> 513 - This function is straightforward, with line 13 waiting for a grace 514 - period before line 14 frees the old data element. 515 - This waiting ensures that readers will reach line 7 of 516 - <tt>do_something_gp()</tt> before the data element referenced by 517 - <tt>p</tt> is freed. 518 - The <tt>rcu_access_pointer()</tt> on line 6 is similar to 519 - <tt>rcu_dereference()</tt>, except that: 520 - 521 - <ol> 522 - <li> The value returned by <tt>rcu_access_pointer()</tt> 523 - cannot be dereferenced. 524 - If you want to access the value pointed to as well as 525 - the pointer itself, use <tt>rcu_dereference()</tt> 526 - instead of <tt>rcu_access_pointer()</tt>. 527 - <li> The call to <tt>rcu_access_pointer()</tt> need not be 528 - protected. 529 - In contrast, <tt>rcu_dereference()</tt> must either be 530 - within an RCU read-side critical section or in a code 531 - segment where the pointer cannot change, for example, in 532 - code protected by the corresponding update-side lock. 533 - </ol> 534 - 535 - <table> 536 - <tr><th> </th></tr> 537 - <tr><th align="left">Quick Quiz:</th></tr> 538 - <tr><td> 539 - Without the <tt>rcu_dereference()</tt> or the 540 - <tt>rcu_access_pointer()</tt>, what destructive optimizations 541 - might the compiler make use of? 542 - </td></tr> 543 - <tr><th align="left">Answer:</th></tr> 544 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 545 - Let's start with what happens to <tt>do_something_gp()</tt> 546 - if it fails to use <tt>rcu_dereference()</tt>. 547 - It could reuse a value formerly fetched from this same pointer. 548 - It could also fetch the pointer from <tt>gp</tt> in a byte-at-a-time 549 - manner, resulting in <i>load tearing</i>, in turn resulting a bytewise 550 - mash-up of two distinct pointer values. 551 - It might even use value-speculation optimizations, where it makes 552 - a wrong guess, but by the time it gets around to checking the 553 - value, an update has changed the pointer to match the wrong guess. 554 - Too bad about any dereferences that returned pre-initialization garbage 555 - in the meantime! 556 - </font> 557 - 558 - <p><font color="ffffff"> 559 - For <tt>remove_gp_synchronous()</tt>, as long as all modifications 560 - to <tt>gp</tt> are carried out while holding <tt>gp_lock</tt>, 561 - the above optimizations are harmless. 562 - However, <tt>sparse</tt> will complain if you 563 - define <tt>gp</tt> with <tt>__rcu</tt> and then 564 - access it without using 565 - either <tt>rcu_access_pointer()</tt> or <tt>rcu_dereference()</tt>. 566 - </font></td></tr> 567 - <tr><td> </td></tr> 568 - </table> 569 - 570 - <p> 571 - In short, RCU's publish-subscribe guarantee is provided by the combination 572 - of <tt>rcu_assign_pointer()</tt> and <tt>rcu_dereference()</tt>. 573 - This guarantee allows data elements to be safely added to RCU-protected 574 - linked data structures without disrupting RCU readers. 575 - This guarantee can be used in combination with the grace-period 576 - guarantee to also allow data elements to be removed from RCU-protected 577 - linked data structures, again without disrupting RCU readers. 578 - 579 - <p> 580 - This guarantee was only partially premeditated. 581 - DYNIX/ptx used an explicit memory barrier for publication, but had nothing 582 - resembling <tt>rcu_dereference()</tt> for subscription, nor did it 583 - have anything resembling the <tt>smp_read_barrier_depends()</tt> 584 - that was later subsumed into <tt>rcu_dereference()</tt> and later 585 - still into <tt>READ_ONCE()</tt>. 586 - The need for these operations made itself known quite suddenly at a 587 - late-1990s meeting with the DEC Alpha architects, back in the days when 588 - DEC was still a free-standing company. 589 - It took the Alpha architects a good hour to convince me that any sort 590 - of barrier would ever be needed, and it then took me a good <i>two</i> hours 591 - to convince them that their documentation did not make this point clear. 592 - More recent work with the C and C++ standards committees have provided 593 - much education on tricks and traps from the compiler. 594 - In short, compilers were much less tricky in the early 1990s, but in 595 - 2015, don't even think about omitting <tt>rcu_dereference()</tt>! 596 - 597 - <h3><a name="Memory-Barrier Guarantees">Memory-Barrier Guarantees</a></h3> 598 - 599 - <p> 600 - The previous section's simple linked-data-structure scenario clearly 601 - demonstrates the need for RCU's stringent memory-ordering guarantees on 602 - systems with more than one CPU: 603 - 604 - <ol> 605 - <li> Each CPU that has an RCU read-side critical section that 606 - begins before <tt>synchronize_rcu()</tt> starts is 607 - guaranteed to execute a full memory barrier between the time 608 - that the RCU read-side critical section ends and the time that 609 - <tt>synchronize_rcu()</tt> returns. 610 - Without this guarantee, a pre-existing RCU read-side critical section 611 - might hold a reference to the newly removed <tt>struct foo</tt> 612 - after the <tt>kfree()</tt> on line 14 of 613 - <tt>remove_gp_synchronous()</tt>. 614 - <li> Each CPU that has an RCU read-side critical section that ends 615 - after <tt>synchronize_rcu()</tt> returns is guaranteed 616 - to execute a full memory barrier between the time that 617 - <tt>synchronize_rcu()</tt> begins and the time that the RCU 618 - read-side critical section begins. 619 - Without this guarantee, a later RCU read-side critical section 620 - running after the <tt>kfree()</tt> on line 14 of 621 - <tt>remove_gp_synchronous()</tt> might 622 - later run <tt>do_something_gp()</tt> and find the 623 - newly deleted <tt>struct foo</tt>. 624 - <li> If the task invoking <tt>synchronize_rcu()</tt> remains 625 - on a given CPU, then that CPU is guaranteed to execute a full 626 - memory barrier sometime during the execution of 627 - <tt>synchronize_rcu()</tt>. 628 - This guarantee ensures that the <tt>kfree()</tt> on 629 - line 14 of <tt>remove_gp_synchronous()</tt> really does 630 - execute after the removal on line 11. 631 - <li> If the task invoking <tt>synchronize_rcu()</tt> migrates 632 - among a group of CPUs during that invocation, then each of the 633 - CPUs in that group is guaranteed to execute a full memory barrier 634 - sometime during the execution of <tt>synchronize_rcu()</tt>. 635 - This guarantee also ensures that the <tt>kfree()</tt> on 636 - line 14 of <tt>remove_gp_synchronous()</tt> really does 637 - execute after the removal on 638 - line 11, but also in the case where the thread executing the 639 - <tt>synchronize_rcu()</tt> migrates in the meantime. 640 - </ol> 641 - 642 - <table> 643 - <tr><th> </th></tr> 644 - <tr><th align="left">Quick Quiz:</th></tr> 645 - <tr><td> 646 - Given that multiple CPUs can start RCU read-side critical sections 647 - at any time without any ordering whatsoever, how can RCU possibly 648 - tell whether or not a given RCU read-side critical section starts 649 - before a given instance of <tt>synchronize_rcu()</tt>? 650 - </td></tr> 651 - <tr><th align="left">Answer:</th></tr> 652 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 653 - If RCU cannot tell whether or not a given 654 - RCU read-side critical section starts before a 655 - given instance of <tt>synchronize_rcu()</tt>, 656 - then it must assume that the RCU read-side critical section 657 - started first. 658 - In other words, a given instance of <tt>synchronize_rcu()</tt> 659 - can avoid waiting on a given RCU read-side critical section only 660 - if it can prove that <tt>synchronize_rcu()</tt> started first. 661 - </font> 662 - 663 - <p><font color="ffffff"> 664 - A related question is “When <tt>rcu_read_lock()</tt> 665 - doesn't generate any code, why does it matter how it relates 666 - to a grace period?” 667 - The answer is that it is not the relationship of 668 - <tt>rcu_read_lock()</tt> itself that is important, but rather 669 - the relationship of the code within the enclosed RCU read-side 670 - critical section to the code preceding and following the 671 - grace period. 672 - If we take this viewpoint, then a given RCU read-side critical 673 - section begins before a given grace period when some access 674 - preceding the grace period observes the effect of some access 675 - within the critical section, in which case none of the accesses 676 - within the critical section may observe the effects of any 677 - access following the grace period. 678 - </font> 679 - 680 - <p><font color="ffffff"> 681 - As of late 2016, mathematical models of RCU take this 682 - viewpoint, for example, see slides 62 and 63 683 - of the 684 - <a href="http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.2016.10.04c.LCE.pdf">2016 LinuxCon EU</a> 685 - presentation. 686 - </font></td></tr> 687 - <tr><td> </td></tr> 688 - </table> 689 - 690 - <table> 691 - <tr><th> </th></tr> 692 - <tr><th align="left">Quick Quiz:</th></tr> 693 - <tr><td> 694 - The first and second guarantees require unbelievably strict ordering! 695 - Are all these memory barriers <i> really</i> required? 696 - </td></tr> 697 - <tr><th align="left">Answer:</th></tr> 698 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 699 - Yes, they really are required. 700 - To see why the first guarantee is required, consider the following 701 - sequence of events: 702 - </font> 703 - 704 - <ol> 705 - <li> <font color="ffffff"> 706 - CPU 1: <tt>rcu_read_lock()</tt> 707 - </font> 708 - <li> <font color="ffffff"> 709 - CPU 1: <tt>q = rcu_dereference(gp); 710 - /* Very likely to return p. */</tt> 711 - </font> 712 - <li> <font color="ffffff"> 713 - CPU 0: <tt>list_del_rcu(p);</tt> 714 - </font> 715 - <li> <font color="ffffff"> 716 - CPU 0: <tt>synchronize_rcu()</tt> starts. 717 - </font> 718 - <li> <font color="ffffff"> 719 - CPU 1: <tt>do_something_with(q->a); 720 - /* No smp_mb(), so might happen after kfree(). */</tt> 721 - </font> 722 - <li> <font color="ffffff"> 723 - CPU 1: <tt>rcu_read_unlock()</tt> 724 - </font> 725 - <li> <font color="ffffff"> 726 - CPU 0: <tt>synchronize_rcu()</tt> returns. 727 - </font> 728 - <li> <font color="ffffff"> 729 - CPU 0: <tt>kfree(p);</tt> 730 - </font> 731 - </ol> 732 - 733 - <p><font color="ffffff"> 734 - Therefore, there absolutely must be a full memory barrier between the 735 - end of the RCU read-side critical section and the end of the 736 - grace period. 737 - </font> 738 - 739 - <p><font color="ffffff"> 740 - The sequence of events demonstrating the necessity of the second rule 741 - is roughly similar: 742 - </font> 743 - 744 - <ol> 745 - <li> <font color="ffffff">CPU 0: <tt>list_del_rcu(p);</tt> 746 - </font> 747 - <li> <font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> starts. 748 - </font> 749 - <li> <font color="ffffff">CPU 1: <tt>rcu_read_lock()</tt> 750 - </font> 751 - <li> <font color="ffffff">CPU 1: <tt>q = rcu_dereference(gp); 752 - /* Might return p if no memory barrier. */</tt> 753 - </font> 754 - <li> <font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> returns. 755 - </font> 756 - <li> <font color="ffffff">CPU 0: <tt>kfree(p);</tt> 757 - </font> 758 - <li> <font color="ffffff"> 759 - CPU 1: <tt>do_something_with(q->a); /* Boom!!! */</tt> 760 - </font> 761 - <li> <font color="ffffff">CPU 1: <tt>rcu_read_unlock()</tt> 762 - </font> 763 - </ol> 764 - 765 - <p><font color="ffffff"> 766 - And similarly, without a memory barrier between the beginning of the 767 - grace period and the beginning of the RCU read-side critical section, 768 - CPU 1 might end up accessing the freelist. 769 - </font> 770 - 771 - <p><font color="ffffff"> 772 - The “as if” rule of course applies, so that any 773 - implementation that acts as if the appropriate memory barriers 774 - were in place is a correct implementation. 775 - That said, it is much easier to fool yourself into believing 776 - that you have adhered to the as-if rule than it is to actually 777 - adhere to it! 778 - </font></td></tr> 779 - <tr><td> </td></tr> 780 - </table> 781 - 782 - <table> 783 - <tr><th> </th></tr> 784 - <tr><th align="left">Quick Quiz:</th></tr> 785 - <tr><td> 786 - You claim that <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> 787 - generate absolutely no code in some kernel builds. 788 - This means that the compiler might arbitrarily rearrange consecutive 789 - RCU read-side critical sections. 790 - Given such rearrangement, if a given RCU read-side critical section 791 - is done, how can you be sure that all prior RCU read-side critical 792 - sections are done? 793 - Won't the compiler rearrangements make that impossible to determine? 794 - </td></tr> 795 - <tr><th align="left">Answer:</th></tr> 796 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 797 - In cases where <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> 798 - generate absolutely no code, RCU infers quiescent states only at 799 - special locations, for example, within the scheduler. 800 - Because calls to <tt>schedule()</tt> had better prevent calling-code 801 - accesses to shared variables from being rearranged across the call to 802 - <tt>schedule()</tt>, if RCU detects the end of a given RCU read-side 803 - critical section, it will necessarily detect the end of all prior 804 - RCU read-side critical sections, no matter how aggressively the 805 - compiler scrambles the code. 806 - </font> 807 - 808 - <p><font color="ffffff"> 809 - Again, this all assumes that the compiler cannot scramble code across 810 - calls to the scheduler, out of interrupt handlers, into the idle loop, 811 - into user-mode code, and so on. 812 - But if your kernel build allows that sort of scrambling, you have broken 813 - far more than just RCU! 814 - </font></td></tr> 815 - <tr><td> </td></tr> 816 - </table> 817 - 818 - <p> 819 - Note that these memory-barrier requirements do not replace the fundamental 820 - RCU requirement that a grace period wait for all pre-existing readers. 821 - On the contrary, the memory barriers called out in this section must operate in 822 - such a way as to <i>enforce</i> this fundamental requirement. 823 - Of course, different implementations enforce this requirement in different 824 - ways, but enforce it they must. 825 - 826 - <h3><a name="RCU Primitives Guaranteed to Execute Unconditionally">RCU Primitives Guaranteed to Execute Unconditionally</a></h3> 827 - 828 - <p> 829 - The common-case RCU primitives are unconditional. 830 - They are invoked, they do their job, and they return, with no possibility 831 - of error, and no need to retry. 832 - This is a key RCU design philosophy. 833 - 834 - <p> 835 - However, this philosophy is pragmatic rather than pigheaded. 836 - If someone comes up with a good justification for a particular conditional 837 - RCU primitive, it might well be implemented and added. 838 - After all, this guarantee was reverse-engineered, not premeditated. 839 - The unconditional nature of the RCU primitives was initially an 840 - accident of implementation, and later experience with synchronization 841 - primitives with conditional primitives caused me to elevate this 842 - accident to a guarantee. 843 - Therefore, the justification for adding a conditional primitive to 844 - RCU would need to be based on detailed and compelling use cases. 845 - 846 - <h3><a name="Guaranteed Read-to-Write Upgrade">Guaranteed Read-to-Write Upgrade</a></h3> 847 - 848 - <p> 849 - As far as RCU is concerned, it is always possible to carry out an 850 - update within an RCU read-side critical section. 851 - For example, that RCU read-side critical section might search for 852 - a given data element, and then might acquire the update-side 853 - spinlock in order to update that element, all while remaining 854 - in that RCU read-side critical section. 855 - Of course, it is necessary to exit the RCU read-side critical section 856 - before invoking <tt>synchronize_rcu()</tt>, however, this 857 - inconvenience can be avoided through use of the 858 - <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt> API members 859 - described later in this document. 860 - 861 - <table> 862 - <tr><th> </th></tr> 863 - <tr><th align="left">Quick Quiz:</th></tr> 864 - <tr><td> 865 - But how does the upgrade-to-write operation exclude other readers? 866 - </td></tr> 867 - <tr><th align="left">Answer:</th></tr> 868 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 869 - It doesn't, just like normal RCU updates, which also do not exclude 870 - RCU readers. 871 - </font></td></tr> 872 - <tr><td> </td></tr> 873 - </table> 874 - 875 - <p> 876 - This guarantee allows lookup code to be shared between read-side 877 - and update-side code, and was premeditated, appearing in the earliest 878 - DYNIX/ptx RCU documentation. 879 - 880 - <h2><a name="Fundamental Non-Requirements">Fundamental Non-Requirements</a></h2> 881 - 882 - <p> 883 - RCU provides extremely lightweight readers, and its read-side guarantees, 884 - though quite useful, are correspondingly lightweight. 885 - It is therefore all too easy to assume that RCU is guaranteeing more 886 - than it really is. 887 - Of course, the list of things that RCU does not guarantee is infinitely 888 - long, however, the following sections list a few non-guarantees that 889 - have caused confusion. 890 - Except where otherwise noted, these non-guarantees were premeditated. 891 - 892 - <ol> 893 - <li> <a href="#Readers Impose Minimal Ordering"> 894 - Readers Impose Minimal Ordering</a> 895 - <li> <a href="#Readers Do Not Exclude Updaters"> 896 - Readers Do Not Exclude Updaters</a> 897 - <li> <a href="#Updaters Only Wait For Old Readers"> 898 - Updaters Only Wait For Old Readers</a> 899 - <li> <a href="#Grace Periods Don't Partition Read-Side Critical Sections"> 900 - Grace Periods Don't Partition Read-Side Critical Sections</a> 901 - <li> <a href="#Read-Side Critical Sections Don't Partition Grace Periods"> 902 - Read-Side Critical Sections Don't Partition Grace Periods</a> 903 - </ol> 904 - 905 - <h3><a name="Readers Impose Minimal Ordering">Readers Impose Minimal Ordering</a></h3> 906 - 907 - <p> 908 - Reader-side markers such as <tt>rcu_read_lock()</tt> and 909 - <tt>rcu_read_unlock()</tt> provide absolutely no ordering guarantees 910 - except through their interaction with the grace-period APIs such as 911 - <tt>synchronize_rcu()</tt>. 912 - To see this, consider the following pair of threads: 913 - 914 - <blockquote> 915 - <pre> 916 - 1 void thread0(void) 917 - 2 { 918 - 3 rcu_read_lock(); 919 - 4 WRITE_ONCE(x, 1); 920 - 5 rcu_read_unlock(); 921 - 6 rcu_read_lock(); 922 - 7 WRITE_ONCE(y, 1); 923 - 8 rcu_read_unlock(); 924 - 9 } 925 - 10 926 - 11 void thread1(void) 927 - 12 { 928 - 13 rcu_read_lock(); 929 - 14 r1 = READ_ONCE(y); 930 - 15 rcu_read_unlock(); 931 - 16 rcu_read_lock(); 932 - 17 r2 = READ_ONCE(x); 933 - 18 rcu_read_unlock(); 934 - 19 } 935 - </pre> 936 - </blockquote> 937 - 938 - <p> 939 - After <tt>thread0()</tt> and <tt>thread1()</tt> execute 940 - concurrently, it is quite possible to have 941 - 942 - <blockquote> 943 - <pre> 944 - (r1 == 1 && r2 == 0) 945 - </pre> 946 - </blockquote> 947 - 948 - (that is, <tt>y</tt> appears to have been assigned before <tt>x</tt>), 949 - which would not be possible if <tt>rcu_read_lock()</tt> and 950 - <tt>rcu_read_unlock()</tt> had much in the way of ordering 951 - properties. 952 - But they do not, so the CPU is within its rights 953 - to do significant reordering. 954 - This is by design: Any significant ordering constraints would slow down 955 - these fast-path APIs. 956 - 957 - <table> 958 - <tr><th> </th></tr> 959 - <tr><th align="left">Quick Quiz:</th></tr> 960 - <tr><td> 961 - Can't the compiler also reorder this code? 962 - </td></tr> 963 - <tr><th align="left">Answer:</th></tr> 964 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 965 - No, the volatile casts in <tt>READ_ONCE()</tt> and 966 - <tt>WRITE_ONCE()</tt> prevent the compiler from reordering in 967 - this particular case. 968 - </font></td></tr> 969 - <tr><td> </td></tr> 970 - </table> 971 - 972 - <h3><a name="Readers Do Not Exclude Updaters">Readers Do Not Exclude Updaters</a></h3> 973 - 974 - <p> 975 - Neither <tt>rcu_read_lock()</tt> nor <tt>rcu_read_unlock()</tt> 976 - exclude updates. 977 - All they do is to prevent grace periods from ending. 978 - The following example illustrates this: 979 - 980 - <blockquote> 981 - <pre> 982 - 1 void thread0(void) 983 - 2 { 984 - 3 rcu_read_lock(); 985 - 4 r1 = READ_ONCE(y); 986 - 5 if (r1) { 987 - 6 do_something_with_nonzero_x(); 988 - 7 r2 = READ_ONCE(x); 989 - 8 WARN_ON(!r2); /* BUG!!! */ 990 - 9 } 991 - 10 rcu_read_unlock(); 992 - 11 } 993 - 12 994 - 13 void thread1(void) 995 - 14 { 996 - 15 spin_lock(&my_lock); 997 - 16 WRITE_ONCE(x, 1); 998 - 17 WRITE_ONCE(y, 1); 999 - 18 spin_unlock(&my_lock); 1000 - 19 } 1001 - </pre> 1002 - </blockquote> 1003 - 1004 - <p> 1005 - If the <tt>thread0()</tt> function's <tt>rcu_read_lock()</tt> 1006 - excluded the <tt>thread1()</tt> function's update, 1007 - the <tt>WARN_ON()</tt> could never fire. 1008 - But the fact is that <tt>rcu_read_lock()</tt> does not exclude 1009 - much of anything aside from subsequent grace periods, of which 1010 - <tt>thread1()</tt> has none, so the 1011 - <tt>WARN_ON()</tt> can and does fire. 1012 - 1013 - <h3><a name="Updaters Only Wait For Old Readers">Updaters Only Wait For Old Readers</a></h3> 1014 - 1015 - <p> 1016 - It might be tempting to assume that after <tt>synchronize_rcu()</tt> 1017 - completes, there are no readers executing. 1018 - This temptation must be avoided because 1019 - new readers can start immediately after <tt>synchronize_rcu()</tt> 1020 - starts, and <tt>synchronize_rcu()</tt> is under no 1021 - obligation to wait for these new readers. 1022 - 1023 - <table> 1024 - <tr><th> </th></tr> 1025 - <tr><th align="left">Quick Quiz:</th></tr> 1026 - <tr><td> 1027 - Suppose that synchronize_rcu() did wait until <i>all</i> 1028 - readers had completed instead of waiting only on 1029 - pre-existing readers. 1030 - For how long would the updater be able to rely on there 1031 - being no readers? 1032 - </td></tr> 1033 - <tr><th align="left">Answer:</th></tr> 1034 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 1035 - For no time at all. 1036 - Even if <tt>synchronize_rcu()</tt> were to wait until 1037 - all readers had completed, a new reader might start immediately after 1038 - <tt>synchronize_rcu()</tt> completed. 1039 - Therefore, the code following 1040 - <tt>synchronize_rcu()</tt> can <i>never</i> rely on there being 1041 - no readers. 1042 - </font></td></tr> 1043 - <tr><td> </td></tr> 1044 - </table> 1045 - 1046 - <h3><a name="Grace Periods Don't Partition Read-Side Critical Sections"> 1047 - Grace Periods Don't Partition Read-Side Critical Sections</a></h3> 1048 - 1049 - <p> 1050 - It is tempting to assume that if any part of one RCU read-side critical 1051 - section precedes a given grace period, and if any part of another RCU 1052 - read-side critical section follows that same grace period, then all of 1053 - the first RCU read-side critical section must precede all of the second. 1054 - However, this just isn't the case: A single grace period does not 1055 - partition the set of RCU read-side critical sections. 1056 - An example of this situation can be illustrated as follows, where 1057 - <tt>x</tt>, <tt>y</tt>, and <tt>z</tt> are initially all zero: 1058 - 1059 - <blockquote> 1060 - <pre> 1061 - 1 void thread0(void) 1062 - 2 { 1063 - 3 rcu_read_lock(); 1064 - 4 WRITE_ONCE(a, 1); 1065 - 5 WRITE_ONCE(b, 1); 1066 - 6 rcu_read_unlock(); 1067 - 7 } 1068 - 8 1069 - 9 void thread1(void) 1070 - 10 { 1071 - 11 r1 = READ_ONCE(a); 1072 - 12 synchronize_rcu(); 1073 - 13 WRITE_ONCE(c, 1); 1074 - 14 } 1075 - 15 1076 - 16 void thread2(void) 1077 - 17 { 1078 - 18 rcu_read_lock(); 1079 - 19 r2 = READ_ONCE(b); 1080 - 20 r3 = READ_ONCE(c); 1081 - 21 rcu_read_unlock(); 1082 - 22 } 1083 - </pre> 1084 - </blockquote> 1085 - 1086 - <p> 1087 - It turns out that the outcome: 1088 - 1089 - <blockquote> 1090 - <pre> 1091 - (r1 == 1 && r2 == 0 && r3 == 1) 1092 - </pre> 1093 - </blockquote> 1094 - 1095 - is entirely possible. 1096 - The following figure show how this can happen, with each circled 1097 - <tt>QS</tt> indicating the point at which RCU recorded a 1098 - <i>quiescent state</i> for each thread, that is, a state in which 1099 - RCU knows that the thread cannot be in the midst of an RCU read-side 1100 - critical section that started before the current grace period: 1101 - 1102 - <p><img src="GPpartitionReaders1.svg" alt="GPpartitionReaders1.svg" width="60%"></p> 1103 - 1104 - <p> 1105 - If it is necessary to partition RCU read-side critical sections in this 1106 - manner, it is necessary to use two grace periods, where the first 1107 - grace period is known to end before the second grace period starts: 1108 - 1109 - <blockquote> 1110 - <pre> 1111 - 1 void thread0(void) 1112 - 2 { 1113 - 3 rcu_read_lock(); 1114 - 4 WRITE_ONCE(a, 1); 1115 - 5 WRITE_ONCE(b, 1); 1116 - 6 rcu_read_unlock(); 1117 - 7 } 1118 - 8 1119 - 9 void thread1(void) 1120 - 10 { 1121 - 11 r1 = READ_ONCE(a); 1122 - 12 synchronize_rcu(); 1123 - 13 WRITE_ONCE(c, 1); 1124 - 14 } 1125 - 15 1126 - 16 void thread2(void) 1127 - 17 { 1128 - 18 r2 = READ_ONCE(c); 1129 - 19 synchronize_rcu(); 1130 - 20 WRITE_ONCE(d, 1); 1131 - 21 } 1132 - 22 1133 - 23 void thread3(void) 1134 - 24 { 1135 - 25 rcu_read_lock(); 1136 - 26 r3 = READ_ONCE(b); 1137 - 27 r4 = READ_ONCE(d); 1138 - 28 rcu_read_unlock(); 1139 - 29 } 1140 - </pre> 1141 - </blockquote> 1142 - 1143 - <p> 1144 - Here, if <tt>(r1 == 1)</tt>, then 1145 - <tt>thread0()</tt>'s write to <tt>b</tt> must happen 1146 - before the end of <tt>thread1()</tt>'s grace period. 1147 - If in addition <tt>(r4 == 1)</tt>, then 1148 - <tt>thread3()</tt>'s read from <tt>b</tt> must happen 1149 - after the beginning of <tt>thread2()</tt>'s grace period. 1150 - If it is also the case that <tt>(r2 == 1)</tt>, then the 1151 - end of <tt>thread1()</tt>'s grace period must precede the 1152 - beginning of <tt>thread2()</tt>'s grace period. 1153 - This mean that the two RCU read-side critical sections cannot overlap, 1154 - guaranteeing that <tt>(r3 == 1)</tt>. 1155 - As a result, the outcome: 1156 - 1157 - <blockquote> 1158 - <pre> 1159 - (r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1) 1160 - </pre> 1161 - </blockquote> 1162 - 1163 - cannot happen. 1164 - 1165 - <p> 1166 - This non-requirement was also non-premeditated, but became apparent 1167 - when studying RCU's interaction with memory ordering. 1168 - 1169 - <h3><a name="Read-Side Critical Sections Don't Partition Grace Periods"> 1170 - Read-Side Critical Sections Don't Partition Grace Periods</a></h3> 1171 - 1172 - <p> 1173 - It is also tempting to assume that if an RCU read-side critical section 1174 - happens between a pair of grace periods, then those grace periods cannot 1175 - overlap. 1176 - However, this temptation leads nowhere good, as can be illustrated by 1177 - the following, with all variables initially zero: 1178 - 1179 - <blockquote> 1180 - <pre> 1181 - 1 void thread0(void) 1182 - 2 { 1183 - 3 rcu_read_lock(); 1184 - 4 WRITE_ONCE(a, 1); 1185 - 5 WRITE_ONCE(b, 1); 1186 - 6 rcu_read_unlock(); 1187 - 7 } 1188 - 8 1189 - 9 void thread1(void) 1190 - 10 { 1191 - 11 r1 = READ_ONCE(a); 1192 - 12 synchronize_rcu(); 1193 - 13 WRITE_ONCE(c, 1); 1194 - 14 } 1195 - 15 1196 - 16 void thread2(void) 1197 - 17 { 1198 - 18 rcu_read_lock(); 1199 - 19 WRITE_ONCE(d, 1); 1200 - 20 r2 = READ_ONCE(c); 1201 - 21 rcu_read_unlock(); 1202 - 22 } 1203 - 23 1204 - 24 void thread3(void) 1205 - 25 { 1206 - 26 r3 = READ_ONCE(d); 1207 - 27 synchronize_rcu(); 1208 - 28 WRITE_ONCE(e, 1); 1209 - 29 } 1210 - 30 1211 - 31 void thread4(void) 1212 - 32 { 1213 - 33 rcu_read_lock(); 1214 - 34 r4 = READ_ONCE(b); 1215 - 35 r5 = READ_ONCE(e); 1216 - 36 rcu_read_unlock(); 1217 - 37 } 1218 - </pre> 1219 - </blockquote> 1220 - 1221 - <p> 1222 - In this case, the outcome: 1223 - 1224 - <blockquote> 1225 - <pre> 1226 - (r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 &amp& r5 == 1) 1227 - </pre> 1228 - </blockquote> 1229 - 1230 - is entirely possible, as illustrated below: 1231 - 1232 - <p><img src="ReadersPartitionGP1.svg" alt="ReadersPartitionGP1.svg" width="100%"></p> 1233 - 1234 - <p> 1235 - Again, an RCU read-side critical section can overlap almost all of a 1236 - given grace period, just so long as it does not overlap the entire 1237 - grace period. 1238 - As a result, an RCU read-side critical section cannot partition a pair 1239 - of RCU grace periods. 1240 - 1241 - <table> 1242 - <tr><th> </th></tr> 1243 - <tr><th align="left">Quick Quiz:</th></tr> 1244 - <tr><td> 1245 - How long a sequence of grace periods, each separated by an RCU 1246 - read-side critical section, would be required to partition the RCU 1247 - read-side critical sections at the beginning and end of the chain? 1248 - </td></tr> 1249 - <tr><th align="left">Answer:</th></tr> 1250 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 1251 - In theory, an infinite number. 1252 - In practice, an unknown number that is sensitive to both implementation 1253 - details and timing considerations. 1254 - Therefore, even in practice, RCU users must abide by the 1255 - theoretical rather than the practical answer. 1256 - </font></td></tr> 1257 - <tr><td> </td></tr> 1258 - </table> 1259 - 1260 - <h2><a name="Parallelism Facts of Life">Parallelism Facts of Life</a></h2> 1261 - 1262 - <p> 1263 - These parallelism facts of life are by no means specific to RCU, but 1264 - the RCU implementation must abide by them. 1265 - They therefore bear repeating: 1266 - 1267 - <ol> 1268 - <li> Any CPU or task may be delayed at any time, 1269 - and any attempts to avoid these delays by disabling 1270 - preemption, interrupts, or whatever are completely futile. 1271 - This is most obvious in preemptible user-level 1272 - environments and in virtualized environments (where 1273 - a given guest OS's VCPUs can be preempted at any time by 1274 - the underlying hypervisor), but can also happen in bare-metal 1275 - environments due to ECC errors, NMIs, and other hardware 1276 - events. 1277 - Although a delay of more than about 20 seconds can result 1278 - in splats, the RCU implementation is obligated to use 1279 - algorithms that can tolerate extremely long delays, but where 1280 - “extremely long” is not long enough to allow 1281 - wrap-around when incrementing a 64-bit counter. 1282 - <li> Both the compiler and the CPU can reorder memory accesses. 1283 - Where it matters, RCU must use compiler directives and 1284 - memory-barrier instructions to preserve ordering. 1285 - <li> Conflicting writes to memory locations in any given cache line 1286 - will result in expensive cache misses. 1287 - Greater numbers of concurrent writes and more-frequent 1288 - concurrent writes will result in more dramatic slowdowns. 1289 - RCU is therefore obligated to use algorithms that have 1290 - sufficient locality to avoid significant performance and 1291 - scalability problems. 1292 - <li> As a rough rule of thumb, only one CPU's worth of processing 1293 - may be carried out under the protection of any given exclusive 1294 - lock. 1295 - RCU must therefore use scalable locking designs. 1296 - <li> Counters are finite, especially on 32-bit systems. 1297 - RCU's use of counters must therefore tolerate counter wrap, 1298 - or be designed such that counter wrap would take way more 1299 - time than a single system is likely to run. 1300 - An uptime of ten years is quite possible, a runtime 1301 - of a century much less so. 1302 - As an example of the latter, RCU's dyntick-idle nesting counter 1303 - allows 54 bits for interrupt nesting level (this counter 1304 - is 64 bits even on a 32-bit system). 1305 - Overflowing this counter requires 2<sup>54</sup> 1306 - half-interrupts on a given CPU without that CPU ever going idle. 1307 - If a half-interrupt happened every microsecond, it would take 1308 - 570 years of runtime to overflow this counter, which is currently 1309 - believed to be an acceptably long time. 1310 - <li> Linux systems can have thousands of CPUs running a single 1311 - Linux kernel in a single shared-memory environment. 1312 - RCU must therefore pay close attention to high-end scalability. 1313 - </ol> 1314 - 1315 - <p> 1316 - This last parallelism fact of life means that RCU must pay special 1317 - attention to the preceding facts of life. 1318 - The idea that Linux might scale to systems with thousands of CPUs would 1319 - have been met with some skepticism in the 1990s, but these requirements 1320 - would have otherwise have been unsurprising, even in the early 1990s. 1321 - 1322 - <h2><a name="Quality-of-Implementation Requirements">Quality-of-Implementation Requirements</a></h2> 1323 - 1324 - <p> 1325 - These sections list quality-of-implementation requirements. 1326 - Although an RCU implementation that ignores these requirements could 1327 - still be used, it would likely be subject to limitations that would 1328 - make it inappropriate for industrial-strength production use. 1329 - Classes of quality-of-implementation requirements are as follows: 1330 - 1331 - <ol> 1332 - <li> <a href="#Specialization">Specialization</a> 1333 - <li> <a href="#Performance and Scalability">Performance and Scalability</a> 1334 - <li> <a href="#Forward Progress">Forward Progress</a> 1335 - <li> <a href="#Composability">Composability</a> 1336 - <li> <a href="#Corner Cases">Corner Cases</a> 1337 - </ol> 1338 - 1339 - <p> 1340 - These classes is covered in the following sections. 1341 - 1342 - <h3><a name="Specialization">Specialization</a></h3> 1343 - 1344 - <p> 1345 - RCU is and always has been intended primarily for read-mostly situations, 1346 - which means that RCU's read-side primitives are optimized, often at the 1347 - expense of its update-side primitives. 1348 - Experience thus far is captured by the following list of situations: 1349 - 1350 - <ol> 1351 - <li> Read-mostly data, where stale and inconsistent data is not 1352 - a problem: RCU works great! 1353 - <li> Read-mostly data, where data must be consistent: 1354 - RCU works well. 1355 - <li> Read-write data, where data must be consistent: 1356 - RCU <i>might</i> work OK. 1357 - Or not. 1358 - <li> Write-mostly data, where data must be consistent: 1359 - RCU is very unlikely to be the right tool for the job, 1360 - with the following exceptions, where RCU can provide: 1361 - <ol type=a> 1362 - <li> Existence guarantees for update-friendly mechanisms. 1363 - <li> Wait-free read-side primitives for real-time use. 1364 - </ol> 1365 - </ol> 1366 - 1367 - <p> 1368 - This focus on read-mostly situations means that RCU must interoperate 1369 - with other synchronization primitives. 1370 - For example, the <tt>add_gp()</tt> and <tt>remove_gp_synchronous()</tt> 1371 - examples discussed earlier use RCU to protect readers and locking to 1372 - coordinate updaters. 1373 - However, the need extends much farther, requiring that a variety of 1374 - synchronization primitives be legal within RCU read-side critical sections, 1375 - including spinlocks, sequence locks, atomic operations, reference 1376 - counters, and memory barriers. 1377 - 1378 - <table> 1379 - <tr><th> </th></tr> 1380 - <tr><th align="left">Quick Quiz:</th></tr> 1381 - <tr><td> 1382 - What about sleeping locks? 1383 - </td></tr> 1384 - <tr><th align="left">Answer:</th></tr> 1385 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 1386 - These are forbidden within Linux-kernel RCU read-side critical 1387 - sections because it is not legal to place a quiescent state 1388 - (in this case, voluntary context switch) within an RCU read-side 1389 - critical section. 1390 - However, sleeping locks may be used within userspace RCU read-side 1391 - critical sections, and also within Linux-kernel sleepable RCU 1392 - <a href="#Sleepable RCU"><font color="ffffff">(SRCU)</font></a> 1393 - read-side critical sections. 1394 - In addition, the -rt patchset turns spinlocks into a 1395 - sleeping locks so that the corresponding critical sections 1396 - can be preempted, which also means that these sleeplockified 1397 - spinlocks (but not other sleeping locks!) may be acquire within 1398 - -rt-Linux-kernel RCU read-side critical sections. 1399 - </font> 1400 - 1401 - <p><font color="ffffff"> 1402 - Note that it <i>is</i> legal for a normal RCU read-side 1403 - critical section to conditionally acquire a sleeping locks 1404 - (as in <tt>mutex_trylock()</tt>), but only as long as it does 1405 - not loop indefinitely attempting to conditionally acquire that 1406 - sleeping locks. 1407 - The key point is that things like <tt>mutex_trylock()</tt> 1408 - either return with the mutex held, or return an error indication if 1409 - the mutex was not immediately available. 1410 - Either way, <tt>mutex_trylock()</tt> returns immediately without 1411 - sleeping. 1412 - </font></td></tr> 1413 - <tr><td> </td></tr> 1414 - </table> 1415 - 1416 - <p> 1417 - It often comes as a surprise that many algorithms do not require a 1418 - consistent view of data, but many can function in that mode, 1419 - with network routing being the poster child. 1420 - Internet routing algorithms take significant time to propagate 1421 - updates, so that by the time an update arrives at a given system, 1422 - that system has been sending network traffic the wrong way for 1423 - a considerable length of time. 1424 - Having a few threads continue to send traffic the wrong way for a 1425 - few more milliseconds is clearly not a problem: In the worst case, 1426 - TCP retransmissions will eventually get the data where it needs to go. 1427 - In general, when tracking the state of the universe outside of the 1428 - computer, some level of inconsistency must be tolerated due to 1429 - speed-of-light delays if nothing else. 1430 - 1431 - <p> 1432 - Furthermore, uncertainty about external state is inherent in many cases. 1433 - For example, a pair of veterinarians might use heartbeat to determine 1434 - whether or not a given cat was alive. 1435 - But how long should they wait after the last heartbeat to decide that 1436 - the cat is in fact dead? 1437 - Waiting less than 400 milliseconds makes no sense because this would 1438 - mean that a relaxed cat would be considered to cycle between death 1439 - and life more than 100 times per minute. 1440 - Moreover, just as with human beings, a cat's heart might stop for 1441 - some period of time, so the exact wait period is a judgment call. 1442 - One of our pair of veterinarians might wait 30 seconds before pronouncing 1443 - the cat dead, while the other might insist on waiting a full minute. 1444 - The two veterinarians would then disagree on the state of the cat during 1445 - the final 30 seconds of the minute following the last heartbeat. 1446 - 1447 - <p> 1448 - Interestingly enough, this same situation applies to hardware. 1449 - When push comes to shove, how do we tell whether or not some 1450 - external server has failed? 1451 - We send messages to it periodically, and declare it failed if we 1452 - don't receive a response within a given period of time. 1453 - Policy decisions can usually tolerate short 1454 - periods of inconsistency. 1455 - The policy was decided some time ago, and is only now being put into 1456 - effect, so a few milliseconds of delay is normally inconsequential. 1457 - 1458 - <p> 1459 - However, there are algorithms that absolutely must see consistent data. 1460 - For example, the translation between a user-level SystemV semaphore 1461 - ID to the corresponding in-kernel data structure is protected by RCU, 1462 - but it is absolutely forbidden to update a semaphore that has just been 1463 - removed. 1464 - In the Linux kernel, this need for consistency is accommodated by acquiring 1465 - spinlocks located in the in-kernel data structure from within 1466 - the RCU read-side critical section, and this is indicated by the 1467 - green box in the figure above. 1468 - Many other techniques may be used, and are in fact used within the 1469 - Linux kernel. 1470 - 1471 - <p> 1472 - In short, RCU is not required to maintain consistency, and other 1473 - mechanisms may be used in concert with RCU when consistency is required. 1474 - RCU's specialization allows it to do its job extremely well, and its 1475 - ability to interoperate with other synchronization mechanisms allows 1476 - the right mix of synchronization tools to be used for a given job. 1477 - 1478 - <h3><a name="Performance and Scalability">Performance and Scalability</a></h3> 1479 - 1480 - <p> 1481 - Energy efficiency is a critical component of performance today, 1482 - and Linux-kernel RCU implementations must therefore avoid unnecessarily 1483 - awakening idle CPUs. 1484 - I cannot claim that this requirement was premeditated. 1485 - In fact, I learned of it during a telephone conversation in which I 1486 - was given “frank and open” feedback on the importance 1487 - of energy efficiency in battery-powered systems and on specific 1488 - energy-efficiency shortcomings of the Linux-kernel RCU implementation. 1489 - In my experience, the battery-powered embedded community will consider 1490 - any unnecessary wakeups to be extremely unfriendly acts. 1491 - So much so that mere Linux-kernel-mailing-list posts are 1492 - insufficient to vent their ire. 1493 - 1494 - <p> 1495 - Memory consumption is not particularly important for in most 1496 - situations, and has become decreasingly 1497 - so as memory sizes have expanded and memory 1498 - costs have plummeted. 1499 - However, as I learned from Matt Mackall's 1500 - <a href="http://elinux.org/Linux_Tiny-FAQ">bloatwatch</a> 1501 - efforts, memory footprint is critically important on single-CPU systems with 1502 - non-preemptible (<tt>CONFIG_PREEMPT=n</tt>) kernels, and thus 1503 - <a href="https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com">tiny RCU</a> 1504 - was born. 1505 - Josh Triplett has since taken over the small-memory banner with his 1506 - <a href="https://tiny.wiki.kernel.org/">Linux kernel tinification</a> 1507 - project, which resulted in 1508 - <a href="#Sleepable RCU">SRCU</a> 1509 - becoming optional for those kernels not needing it. 1510 - 1511 - <p> 1512 - The remaining performance requirements are, for the most part, 1513 - unsurprising. 1514 - For example, in keeping with RCU's read-side specialization, 1515 - <tt>rcu_dereference()</tt> should have negligible overhead (for 1516 - example, suppression of a few minor compiler optimizations). 1517 - Similarly, in non-preemptible environments, <tt>rcu_read_lock()</tt> and 1518 - <tt>rcu_read_unlock()</tt> should have exactly zero overhead. 1519 - 1520 - <p> 1521 - In preemptible environments, in the case where the RCU read-side 1522 - critical section was not preempted (as will be the case for the 1523 - highest-priority real-time process), <tt>rcu_read_lock()</tt> and 1524 - <tt>rcu_read_unlock()</tt> should have minimal overhead. 1525 - In particular, they should not contain atomic read-modify-write 1526 - operations, memory-barrier instructions, preemption disabling, 1527 - interrupt disabling, or backwards branches. 1528 - However, in the case where the RCU read-side critical section was preempted, 1529 - <tt>rcu_read_unlock()</tt> may acquire spinlocks and disable interrupts. 1530 - This is why it is better to nest an RCU read-side critical section 1531 - within a preempt-disable region than vice versa, at least in cases 1532 - where that critical section is short enough to avoid unduly degrading 1533 - real-time latencies. 1534 - 1535 - <p> 1536 - The <tt>synchronize_rcu()</tt> grace-period-wait primitive is 1537 - optimized for throughput. 1538 - It may therefore incur several milliseconds of latency in addition to 1539 - the duration of the longest RCU read-side critical section. 1540 - On the other hand, multiple concurrent invocations of 1541 - <tt>synchronize_rcu()</tt> are required to use batching optimizations 1542 - so that they can be satisfied by a single underlying grace-period-wait 1543 - operation. 1544 - For example, in the Linux kernel, it is not unusual for a single 1545 - grace-period-wait operation to serve more than 1546 - <a href="https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response">1,000 separate invocations</a> 1547 - of <tt>synchronize_rcu()</tt>, thus amortizing the per-invocation 1548 - overhead down to nearly zero. 1549 - However, the grace-period optimization is also required to avoid 1550 - measurable degradation of real-time scheduling and interrupt latencies. 1551 - 1552 - <p> 1553 - In some cases, the multi-millisecond <tt>synchronize_rcu()</tt> 1554 - latencies are unacceptable. 1555 - In these cases, <tt>synchronize_rcu_expedited()</tt> may be used 1556 - instead, reducing the grace-period latency down to a few tens of 1557 - microseconds on small systems, at least in cases where the RCU read-side 1558 - critical sections are short. 1559 - There are currently no special latency requirements for 1560 - <tt>synchronize_rcu_expedited()</tt> on large systems, but, 1561 - consistent with the empirical nature of the RCU specification, 1562 - that is subject to change. 1563 - However, there most definitely are scalability requirements: 1564 - A storm of <tt>synchronize_rcu_expedited()</tt> invocations on 4096 1565 - CPUs should at least make reasonable forward progress. 1566 - In return for its shorter latencies, <tt>synchronize_rcu_expedited()</tt> 1567 - is permitted to impose modest degradation of real-time latency 1568 - on non-idle online CPUs. 1569 - Here, “modest” means roughly the same latency 1570 - degradation as a scheduling-clock interrupt. 1571 - 1572 - <p> 1573 - There are a number of situations where even 1574 - <tt>synchronize_rcu_expedited()</tt>'s reduced grace-period 1575 - latency is unacceptable. 1576 - In these situations, the asynchronous <tt>call_rcu()</tt> can be 1577 - used in place of <tt>synchronize_rcu()</tt> as follows: 1578 - 1579 - <blockquote> 1580 - <pre> 1581 - 1 struct foo { 1582 - 2 int a; 1583 - 3 int b; 1584 - 4 struct rcu_head rh; 1585 - 5 }; 1586 - 6 1587 - 7 static void remove_gp_cb(struct rcu_head *rhp) 1588 - 8 { 1589 - 9 struct foo *p = container_of(rhp, struct foo, rh); 1590 - 10 1591 - 11 kfree(p); 1592 - 12 } 1593 - 13 1594 - 14 bool remove_gp_asynchronous(void) 1595 - 15 { 1596 - 16 struct foo *p; 1597 - 17 1598 - 18 spin_lock(&gp_lock); 1599 - 19 p = rcu_access_pointer(gp); 1600 - 20 if (!p) { 1601 - 21 spin_unlock(&gp_lock); 1602 - 22 return false; 1603 - 23 } 1604 - 24 rcu_assign_pointer(gp, NULL); 1605 - 25 call_rcu(&p->rh, remove_gp_cb); 1606 - 26 spin_unlock(&gp_lock); 1607 - 27 return true; 1608 - 28 } 1609 - </pre> 1610 - </blockquote> 1611 - 1612 - <p> 1613 - A definition of <tt>struct foo</tt> is finally needed, and appears 1614 - on lines 1-5. 1615 - The function <tt>remove_gp_cb()</tt> is passed to <tt>call_rcu()</tt> 1616 - on line 25, and will be invoked after the end of a subsequent 1617 - grace period. 1618 - This gets the same effect as <tt>remove_gp_synchronous()</tt>, 1619 - but without forcing the updater to wait for a grace period to elapse. 1620 - The <tt>call_rcu()</tt> function may be used in a number of 1621 - situations where neither <tt>synchronize_rcu()</tt> nor 1622 - <tt>synchronize_rcu_expedited()</tt> would be legal, 1623 - including within preempt-disable code, <tt>local_bh_disable()</tt> code, 1624 - interrupt-disable code, and interrupt handlers. 1625 - However, even <tt>call_rcu()</tt> is illegal within NMI handlers 1626 - and from idle and offline CPUs. 1627 - The callback function (<tt>remove_gp_cb()</tt> in this case) will be 1628 - executed within softirq (software interrupt) environment within the 1629 - Linux kernel, 1630 - either within a real softirq handler or under the protection 1631 - of <tt>local_bh_disable()</tt>. 1632 - In both the Linux kernel and in userspace, it is bad practice to 1633 - write an RCU callback function that takes too long. 1634 - Long-running operations should be relegated to separate threads or 1635 - (in the Linux kernel) workqueues. 1636 - 1637 - <table> 1638 - <tr><th> </th></tr> 1639 - <tr><th align="left">Quick Quiz:</th></tr> 1640 - <tr><td> 1641 - Why does line 19 use <tt>rcu_access_pointer()</tt>? 1642 - After all, <tt>call_rcu()</tt> on line 25 stores into the 1643 - structure, which would interact badly with concurrent insertions. 1644 - Doesn't this mean that <tt>rcu_dereference()</tt> is required? 1645 - </td></tr> 1646 - <tr><th align="left">Answer:</th></tr> 1647 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 1648 - Presumably the <tt>->gp_lock</tt> acquired on line 18 excludes 1649 - any changes, including any insertions that <tt>rcu_dereference()</tt> 1650 - would protect against. 1651 - Therefore, any insertions will be delayed until after 1652 - <tt>->gp_lock</tt> 1653 - is released on line 25, which in turn means that 1654 - <tt>rcu_access_pointer()</tt> suffices. 1655 - </font></td></tr> 1656 - <tr><td> </td></tr> 1657 - </table> 1658 - 1659 - <p> 1660 - However, all that <tt>remove_gp_cb()</tt> is doing is 1661 - invoking <tt>kfree()</tt> on the data element. 1662 - This is a common idiom, and is supported by <tt>kfree_rcu()</tt>, 1663 - which allows “fire and forget” operation as shown below: 1664 - 1665 - <blockquote> 1666 - <pre> 1667 - 1 struct foo { 1668 - 2 int a; 1669 - 3 int b; 1670 - 4 struct rcu_head rh; 1671 - 5 }; 1672 - 6 1673 - 7 bool remove_gp_faf(void) 1674 - 8 { 1675 - 9 struct foo *p; 1676 - 10 1677 - 11 spin_lock(&gp_lock); 1678 - 12 p = rcu_dereference(gp); 1679 - 13 if (!p) { 1680 - 14 spin_unlock(&gp_lock); 1681 - 15 return false; 1682 - 16 } 1683 - 17 rcu_assign_pointer(gp, NULL); 1684 - 18 kfree_rcu(p, rh); 1685 - 19 spin_unlock(&gp_lock); 1686 - 20 return true; 1687 - 21 } 1688 - </pre> 1689 - </blockquote> 1690 - 1691 - <p> 1692 - Note that <tt>remove_gp_faf()</tt> simply invokes 1693 - <tt>kfree_rcu()</tt> and proceeds, without any need to pay any 1694 - further attention to the subsequent grace period and <tt>kfree()</tt>. 1695 - It is permissible to invoke <tt>kfree_rcu()</tt> from the same 1696 - environments as for <tt>call_rcu()</tt>. 1697 - Interestingly enough, DYNIX/ptx had the equivalents of 1698 - <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>, but not 1699 - <tt>synchronize_rcu()</tt>. 1700 - This was due to the fact that RCU was not heavily used within DYNIX/ptx, 1701 - so the very few places that needed something like 1702 - <tt>synchronize_rcu()</tt> simply open-coded it. 1703 - 1704 - <table> 1705 - <tr><th> </th></tr> 1706 - <tr><th align="left">Quick Quiz:</th></tr> 1707 - <tr><td> 1708 - Earlier it was claimed that <tt>call_rcu()</tt> and 1709 - <tt>kfree_rcu()</tt> allowed updaters to avoid being blocked 1710 - by readers. 1711 - But how can that be correct, given that the invocation of the callback 1712 - and the freeing of the memory (respectively) must still wait for 1713 - a grace period to elapse? 1714 - </td></tr> 1715 - <tr><th align="left">Answer:</th></tr> 1716 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 1717 - We could define things this way, but keep in mind that this sort of 1718 - definition would say that updates in garbage-collected languages 1719 - cannot complete until the next time the garbage collector runs, 1720 - which does not seem at all reasonable. 1721 - The key point is that in most cases, an updater using either 1722 - <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> can proceed to the 1723 - next update as soon as it has invoked <tt>call_rcu()</tt> or 1724 - <tt>kfree_rcu()</tt>, without having to wait for a subsequent 1725 - grace period. 1726 - </font></td></tr> 1727 - <tr><td> </td></tr> 1728 - </table> 1729 - 1730 - <p> 1731 - But what if the updater must wait for the completion of code to be 1732 - executed after the end of the grace period, but has other tasks 1733 - that can be carried out in the meantime? 1734 - The polling-style <tt>get_state_synchronize_rcu()</tt> and 1735 - <tt>cond_synchronize_rcu()</tt> functions may be used for this 1736 - purpose, as shown below: 1737 - 1738 - <blockquote> 1739 - <pre> 1740 - 1 bool remove_gp_poll(void) 1741 - 2 { 1742 - 3 struct foo *p; 1743 - 4 unsigned long s; 1744 - 5 1745 - 6 spin_lock(&gp_lock); 1746 - 7 p = rcu_access_pointer(gp); 1747 - 8 if (!p) { 1748 - 9 spin_unlock(&gp_lock); 1749 - 10 return false; 1750 - 11 } 1751 - 12 rcu_assign_pointer(gp, NULL); 1752 - 13 spin_unlock(&gp_lock); 1753 - 14 s = get_state_synchronize_rcu(); 1754 - 15 do_something_while_waiting(); 1755 - 16 cond_synchronize_rcu(s); 1756 - 17 kfree(p); 1757 - 18 return true; 1758 - 19 } 1759 - </pre> 1760 - </blockquote> 1761 - 1762 - <p> 1763 - On line 14, <tt>get_state_synchronize_rcu()</tt> obtains a 1764 - “cookie” from RCU, 1765 - then line 15 carries out other tasks, 1766 - and finally, line 16 returns immediately if a grace period has 1767 - elapsed in the meantime, but otherwise waits as required. 1768 - The need for <tt>get_state_synchronize_rcu</tt> and 1769 - <tt>cond_synchronize_rcu()</tt> has appeared quite recently, 1770 - so it is too early to tell whether they will stand the test of time. 1771 - 1772 - <p> 1773 - RCU thus provides a range of tools to allow updaters to strike the 1774 - required tradeoff between latency, flexibility and CPU overhead. 1775 - 1776 - <h3><a name="Forward Progress">Forward Progress</a></h3> 1777 - 1778 - <p> 1779 - In theory, delaying grace-period completion and callback invocation 1780 - is harmless. 1781 - In practice, not only are memory sizes finite but also callbacks sometimes 1782 - do wakeups, and sufficiently deferred wakeups can be difficult 1783 - to distinguish from system hangs. 1784 - Therefore, RCU must provide a number of mechanisms to promote forward 1785 - progress. 1786 - 1787 - <p> 1788 - These mechanisms are not foolproof, nor can they be. 1789 - For one simple example, an infinite loop in an RCU read-side critical 1790 - section must by definition prevent later grace periods from ever completing. 1791 - For a more involved example, consider a 64-CPU system built with 1792 - <tt>CONFIG_RCU_NOCB_CPU=y</tt> and booted with <tt>rcu_nocbs=1-63</tt>, 1793 - where CPUs 1 through 63 spin in tight loops that invoke 1794 - <tt>call_rcu()</tt>. 1795 - Even if these tight loops also contain calls to <tt>cond_resched()</tt> 1796 - (thus allowing grace periods to complete), CPU 0 simply will 1797 - not be able to invoke callbacks as fast as the other 63 CPUs can 1798 - register them, at least not until the system runs out of memory. 1799 - In both of these examples, the Spiderman principle applies: With great 1800 - power comes great responsibility. 1801 - However, short of this level of abuse, RCU is required to 1802 - ensure timely completion of grace periods and timely invocation of 1803 - callbacks. 1804 - 1805 - <p> 1806 - RCU takes the following steps to encourage timely completion of 1807 - grace periods: 1808 - 1809 - <ol> 1810 - <li> If a grace period fails to complete within 100 milliseconds, 1811 - RCU causes future invocations of <tt>cond_resched()</tt> on 1812 - the holdout CPUs to provide an RCU quiescent state. 1813 - RCU also causes those CPUs' <tt>need_resched()</tt> invocations 1814 - to return <tt>true</tt>, but only after the corresponding CPU's 1815 - next scheduling-clock. 1816 - <li> CPUs mentioned in the <tt>nohz_full</tt> kernel boot parameter 1817 - can run indefinitely in the kernel without scheduling-clock 1818 - interrupts, which defeats the above <tt>need_resched()</tt> 1819 - strategem. 1820 - RCU will therefore invoke <tt>resched_cpu()</tt> on any 1821 - <tt>nohz_full</tt> CPUs still holding out after 1822 - 109 milliseconds. 1823 - <li> In kernels built with <tt>CONFIG_RCU_BOOST=y</tt>, if a given 1824 - task that has been preempted within an RCU read-side critical 1825 - section is holding out for more than 500 milliseconds, 1826 - RCU will resort to priority boosting. 1827 - <li> If a CPU is still holding out 10 seconds into the grace 1828 - period, RCU will invoke <tt>resched_cpu()</tt> on it regardless 1829 - of its <tt>nohz_full</tt> state. 1830 - </ol> 1831 - 1832 - <p> 1833 - The above values are defaults for systems running with <tt>HZ=1000</tt>. 1834 - They will vary as the value of <tt>HZ</tt> varies, and can also be 1835 - changed using the relevant Kconfig options and kernel boot parameters. 1836 - RCU currently does not do much sanity checking of these 1837 - parameters, so please use caution when changing them. 1838 - Note that these forward-progress measures are provided only for RCU, 1839 - not for 1840 - <a href="#Sleepable RCU">SRCU</a> or 1841 - <a href="#Tasks RCU">Tasks RCU</a>. 1842 - 1843 - <p> 1844 - RCU takes the following steps in <tt>call_rcu()</tt> to encourage timely 1845 - invocation of callbacks when any given non-<tt>rcu_nocbs</tt> CPU has 1846 - 10,000 callbacks, or has 10,000 more callbacks than it had the last time 1847 - encouragement was provided: 1848 - 1849 - <ol> 1850 - <li> Starts a grace period, if one is not already in progress. 1851 - <li> Forces immediate checking for quiescent states, rather than 1852 - waiting for three milliseconds to have elapsed since the 1853 - beginning of the grace period. 1854 - <li> Immediately tags the CPU's callbacks with their grace period 1855 - completion numbers, rather than waiting for the <tt>RCU_SOFTIRQ</tt> 1856 - handler to get around to it. 1857 - <li> Lifts callback-execution batch limits, which speeds up callback 1858 - invocation at the expense of degrading realtime response. 1859 - </ol> 1860 - 1861 - <p> 1862 - Again, these are default values when running at <tt>HZ=1000</tt>, 1863 - and can be overridden. 1864 - Again, these forward-progress measures are provided only for RCU, 1865 - not for 1866 - <a href="#Sleepable RCU">SRCU</a> or 1867 - <a href="#Tasks RCU">Tasks RCU</a>. 1868 - Even for RCU, callback-invocation forward progress for <tt>rcu_nocbs</tt> 1869 - CPUs is much less well-developed, in part because workloads benefiting 1870 - from <tt>rcu_nocbs</tt> CPUs tend to invoke <tt>call_rcu()</tt> 1871 - relatively infrequently. 1872 - If workloads emerge that need both <tt>rcu_nocbs</tt> CPUs and high 1873 - <tt>call_rcu()</tt> invocation rates, then additional forward-progress 1874 - work will be required. 1875 - 1876 - <h3><a name="Composability">Composability</a></h3> 1877 - 1878 - <p> 1879 - Composability has received much attention in recent years, perhaps in part 1880 - due to the collision of multicore hardware with object-oriented techniques 1881 - designed in single-threaded environments for single-threaded use. 1882 - And in theory, RCU read-side critical sections may be composed, and in 1883 - fact may be nested arbitrarily deeply. 1884 - In practice, as with all real-world implementations of composable 1885 - constructs, there are limitations. 1886 - 1887 - <p> 1888 - Implementations of RCU for which <tt>rcu_read_lock()</tt> 1889 - and <tt>rcu_read_unlock()</tt> generate no code, such as 1890 - Linux-kernel RCU when <tt>CONFIG_PREEMPT=n</tt>, can be 1891 - nested arbitrarily deeply. 1892 - After all, there is no overhead. 1893 - Except that if all these instances of <tt>rcu_read_lock()</tt> 1894 - and <tt>rcu_read_unlock()</tt> are visible to the compiler, 1895 - compilation will eventually fail due to exhausting memory, 1896 - mass storage, or user patience, whichever comes first. 1897 - If the nesting is not visible to the compiler, as is the case with 1898 - mutually recursive functions each in its own translation unit, 1899 - stack overflow will result. 1900 - If the nesting takes the form of loops, perhaps in the guise of tail 1901 - recursion, either the control variable 1902 - will overflow or (in the Linux kernel) you will get an RCU CPU stall warning. 1903 - Nevertheless, this class of RCU implementations is one 1904 - of the most composable constructs in existence. 1905 - 1906 - <p> 1907 - RCU implementations that explicitly track nesting depth 1908 - are limited by the nesting-depth counter. 1909 - For example, the Linux kernel's preemptible RCU limits nesting to 1910 - <tt>INT_MAX</tt>. 1911 - This should suffice for almost all practical purposes. 1912 - That said, a consecutive pair of RCU read-side critical sections 1913 - between which there is an operation that waits for a grace period 1914 - cannot be enclosed in another RCU read-side critical section. 1915 - This is because it is not legal to wait for a grace period within 1916 - an RCU read-side critical section: To do so would result either 1917 - in deadlock or 1918 - in RCU implicitly splitting the enclosing RCU read-side critical 1919 - section, neither of which is conducive to a long-lived and prosperous 1920 - kernel. 1921 - 1922 - <p> 1923 - It is worth noting that RCU is not alone in limiting composability. 1924 - For example, many transactional-memory implementations prohibit 1925 - composing a pair of transactions separated by an irrevocable 1926 - operation (for example, a network receive operation). 1927 - For another example, lock-based critical sections can be composed 1928 - surprisingly freely, but only if deadlock is avoided. 1929 - 1930 - <p> 1931 - In short, although RCU read-side critical sections are highly composable, 1932 - care is required in some situations, just as is the case for any other 1933 - composable synchronization mechanism. 1934 - 1935 - <h3><a name="Corner Cases">Corner Cases</a></h3> 1936 - 1937 - <p> 1938 - A given RCU workload might have an endless and intense stream of 1939 - RCU read-side critical sections, perhaps even so intense that there 1940 - was never a point in time during which there was not at least one 1941 - RCU read-side critical section in flight. 1942 - RCU cannot allow this situation to block grace periods: As long as 1943 - all the RCU read-side critical sections are finite, grace periods 1944 - must also be finite. 1945 - 1946 - <p> 1947 - That said, preemptible RCU implementations could potentially result 1948 - in RCU read-side critical sections being preempted for long durations, 1949 - which has the effect of creating a long-duration RCU read-side 1950 - critical section. 1951 - This situation can arise only in heavily loaded systems, but systems using 1952 - real-time priorities are of course more vulnerable. 1953 - Therefore, RCU priority boosting is provided to help deal with this 1954 - case. 1955 - That said, the exact requirements on RCU priority boosting will likely 1956 - evolve as more experience accumulates. 1957 - 1958 - <p> 1959 - Other workloads might have very high update rates. 1960 - Although one can argue that such workloads should instead use 1961 - something other than RCU, the fact remains that RCU must 1962 - handle such workloads gracefully. 1963 - This requirement is another factor driving batching of grace periods, 1964 - but it is also the driving force behind the checks for large numbers 1965 - of queued RCU callbacks in the <tt>call_rcu()</tt> code path. 1966 - Finally, high update rates should not delay RCU read-side critical 1967 - sections, although some small read-side delays can occur when using 1968 - <tt>synchronize_rcu_expedited()</tt>, courtesy of this function's use 1969 - of <tt>smp_call_function_single()</tt>. 1970 - 1971 - <p> 1972 - Although all three of these corner cases were understood in the early 1973 - 1990s, a simple user-level test consisting of <tt>close(open(path))</tt> 1974 - in a tight loop 1975 - in the early 2000s suddenly provided a much deeper appreciation of the 1976 - high-update-rate corner case. 1977 - This test also motivated addition of some RCU code to react to high update 1978 - rates, for example, if a given CPU finds itself with more than 10,000 1979 - RCU callbacks queued, it will cause RCU to take evasive action by 1980 - more aggressively starting grace periods and more aggressively forcing 1981 - completion of grace-period processing. 1982 - This evasive action causes the grace period to complete more quickly, 1983 - but at the cost of restricting RCU's batching optimizations, thus 1984 - increasing the CPU overhead incurred by that grace period. 1985 - 1986 - <h2><a name="Software-Engineering Requirements"> 1987 - Software-Engineering Requirements</a></h2> 1988 - 1989 - <p> 1990 - Between Murphy's Law and “To err is human”, it is necessary to 1991 - guard against mishaps and misuse: 1992 - 1993 - <ol> 1994 - <li> It is all too easy to forget to use <tt>rcu_read_lock()</tt> 1995 - everywhere that it is needed, so kernels built with 1996 - <tt>CONFIG_PROVE_RCU=y</tt> will splat if 1997 - <tt>rcu_dereference()</tt> is used outside of an 1998 - RCU read-side critical section. 1999 - Update-side code can use <tt>rcu_dereference_protected()</tt>, 2000 - which takes a 2001 - <a href="https://lwn.net/Articles/371986/">lockdep expression</a> 2002 - to indicate what is providing the protection. 2003 - If the indicated protection is not provided, a lockdep splat 2004 - is emitted. 2005 - 2006 - <p> 2007 - Code shared between readers and updaters can use 2008 - <tt>rcu_dereference_check()</tt>, which also takes a 2009 - lockdep expression, and emits a lockdep splat if neither 2010 - <tt>rcu_read_lock()</tt> nor the indicated protection 2011 - is in place. 2012 - In addition, <tt>rcu_dereference_raw()</tt> is used in those 2013 - (hopefully rare) cases where the required protection cannot 2014 - be easily described. 2015 - Finally, <tt>rcu_read_lock_held()</tt> is provided to 2016 - allow a function to verify that it has been invoked within 2017 - an RCU read-side critical section. 2018 - I was made aware of this set of requirements shortly after Thomas 2019 - Gleixner audited a number of RCU uses. 2020 - <li> A given function might wish to check for RCU-related preconditions 2021 - upon entry, before using any other RCU API. 2022 - The <tt>rcu_lockdep_assert()</tt> does this job, 2023 - asserting the expression in kernels having lockdep enabled 2024 - and doing nothing otherwise. 2025 - <li> It is also easy to forget to use <tt>rcu_assign_pointer()</tt> 2026 - and <tt>rcu_dereference()</tt>, perhaps (incorrectly) 2027 - substituting a simple assignment. 2028 - To catch this sort of error, a given RCU-protected pointer may be 2029 - tagged with <tt>__rcu</tt>, after which sparse 2030 - will complain about simple-assignment accesses to that pointer. 2031 - Arnd Bergmann made me aware of this requirement, and also 2032 - supplied the needed 2033 - <a href="https://lwn.net/Articles/376011/">patch series</a>. 2034 - <li> Kernels built with <tt>CONFIG_DEBUG_OBJECTS_RCU_HEAD=y</tt> 2035 - will splat if a data element is passed to <tt>call_rcu()</tt> 2036 - twice in a row, without a grace period in between. 2037 - (This error is similar to a double free.) 2038 - The corresponding <tt>rcu_head</tt> structures that are 2039 - dynamically allocated are automatically tracked, but 2040 - <tt>rcu_head</tt> structures allocated on the stack 2041 - must be initialized with <tt>init_rcu_head_on_stack()</tt> 2042 - and cleaned up with <tt>destroy_rcu_head_on_stack()</tt>. 2043 - Similarly, statically allocated non-stack <tt>rcu_head</tt> 2044 - structures must be initialized with <tt>init_rcu_head()</tt> 2045 - and cleaned up with <tt>destroy_rcu_head()</tt>. 2046 - Mathieu Desnoyers made me aware of this requirement, and also 2047 - supplied the needed 2048 - <a href="https://lkml.kernel.org/g/20100319013024.GA28456@Krystal">patch</a>. 2049 - <li> An infinite loop in an RCU read-side critical section will 2050 - eventually trigger an RCU CPU stall warning splat, with 2051 - the duration of “eventually” being controlled by the 2052 - <tt>RCU_CPU_STALL_TIMEOUT</tt> <tt>Kconfig</tt> option, or, 2053 - alternatively, by the 2054 - <tt>rcupdate.rcu_cpu_stall_timeout</tt> boot/sysfs 2055 - parameter. 2056 - However, RCU is not obligated to produce this splat 2057 - unless there is a grace period waiting on that particular 2058 - RCU read-side critical section. 2059 - <p> 2060 - Some extreme workloads might intentionally delay 2061 - RCU grace periods, and systems running those workloads can 2062 - be booted with <tt>rcupdate.rcu_cpu_stall_suppress</tt> 2063 - to suppress the splats. 2064 - This kernel parameter may also be set via <tt>sysfs</tt>. 2065 - Furthermore, RCU CPU stall warnings are counter-productive 2066 - during sysrq dumps and during panics. 2067 - RCU therefore supplies the <tt>rcu_sysrq_start()</tt> and 2068 - <tt>rcu_sysrq_end()</tt> API members to be called before 2069 - and after long sysrq dumps. 2070 - RCU also supplies the <tt>rcu_panic()</tt> notifier that is 2071 - automatically invoked at the beginning of a panic to suppress 2072 - further RCU CPU stall warnings. 2073 - 2074 - <p> 2075 - This requirement made itself known in the early 1990s, pretty 2076 - much the first time that it was necessary to debug a CPU stall. 2077 - That said, the initial implementation in DYNIX/ptx was quite 2078 - generic in comparison with that of Linux. 2079 - <li> Although it would be very good to detect pointers leaking out 2080 - of RCU read-side critical sections, there is currently no 2081 - good way of doing this. 2082 - One complication is the need to distinguish between pointers 2083 - leaking and pointers that have been handed off from RCU to 2084 - some other synchronization mechanism, for example, reference 2085 - counting. 2086 - <li> In kernels built with <tt>CONFIG_RCU_TRACE=y</tt>, RCU-related 2087 - information is provided via event tracing. 2088 - <li> Open-coded use of <tt>rcu_assign_pointer()</tt> and 2089 - <tt>rcu_dereference()</tt> to create typical linked 2090 - data structures can be surprisingly error-prone. 2091 - Therefore, RCU-protected 2092 - <a href="https://lwn.net/Articles/609973/#RCU List APIs">linked lists</a> 2093 - and, more recently, RCU-protected 2094 - <a href="https://lwn.net/Articles/612100/">hash tables</a> 2095 - are available. 2096 - Many other special-purpose RCU-protected data structures are 2097 - available in the Linux kernel and the userspace RCU library. 2098 - <li> Some linked structures are created at compile time, but still 2099 - require <tt>__rcu</tt> checking. 2100 - The <tt>RCU_POINTER_INITIALIZER()</tt> macro serves this 2101 - purpose. 2102 - <li> It is not necessary to use <tt>rcu_assign_pointer()</tt> 2103 - when creating linked structures that are to be published via 2104 - a single external pointer. 2105 - The <tt>RCU_INIT_POINTER()</tt> macro is provided for 2106 - this task and also for assigning <tt>NULL</tt> pointers 2107 - at runtime. 2108 - </ol> 2109 - 2110 - <p> 2111 - This not a hard-and-fast list: RCU's diagnostic capabilities will 2112 - continue to be guided by the number and type of usage bugs found 2113 - in real-world RCU usage. 2114 - 2115 - <h2><a name="Linux Kernel Complications">Linux Kernel Complications</a></h2> 2116 - 2117 - <p> 2118 - The Linux kernel provides an interesting environment for all kinds of 2119 - software, including RCU. 2120 - Some of the relevant points of interest are as follows: 2121 - 2122 - <ol> 2123 - <li> <a href="#Configuration">Configuration</a>. 2124 - <li> <a href="#Firmware Interface">Firmware Interface</a>. 2125 - <li> <a href="#Early Boot">Early Boot</a>. 2126 - <li> <a href="#Interrupts and NMIs"> 2127 - Interrupts and non-maskable interrupts (NMIs)</a>. 2128 - <li> <a href="#Loadable Modules">Loadable Modules</a>. 2129 - <li> <a href="#Hotplug CPU">Hotplug CPU</a>. 2130 - <li> <a href="#Scheduler and RCU">Scheduler and RCU</a>. 2131 - <li> <a href="#Tracing and RCU">Tracing and RCU</a>. 2132 - <li> <a href="#Energy Efficiency">Energy Efficiency</a>. 2133 - <li> <a href="#Scheduling-Clock Interrupts and RCU"> 2134 - Scheduling-Clock Interrupts and RCU</a>. 2135 - <li> <a href="#Memory Efficiency">Memory Efficiency</a>. 2136 - <li> <a href="#Performance, Scalability, Response Time, and Reliability"> 2137 - Performance, Scalability, Response Time, and Reliability</a>. 2138 - </ol> 2139 - 2140 - <p> 2141 - This list is probably incomplete, but it does give a feel for the 2142 - most notable Linux-kernel complications. 2143 - Each of the following sections covers one of the above topics. 2144 - 2145 - <h3><a name="Configuration">Configuration</a></h3> 2146 - 2147 - <p> 2148 - RCU's goal is automatic configuration, so that almost nobody 2149 - needs to worry about RCU's <tt>Kconfig</tt> options. 2150 - And for almost all users, RCU does in fact work well 2151 - “out of the box.” 2152 - 2153 - <p> 2154 - However, there are specialized use cases that are handled by 2155 - kernel boot parameters and <tt>Kconfig</tt> options. 2156 - Unfortunately, the <tt>Kconfig</tt> system will explicitly ask users 2157 - about new <tt>Kconfig</tt> options, which requires almost all of them 2158 - be hidden behind a <tt>CONFIG_RCU_EXPERT</tt> <tt>Kconfig</tt> option. 2159 - 2160 - <p> 2161 - This all should be quite obvious, but the fact remains that 2162 - Linus Torvalds recently had to 2163 - <a href="https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com">remind</a> 2164 - me of this requirement. 2165 - 2166 - <h3><a name="Firmware Interface">Firmware Interface</a></h3> 2167 - 2168 - <p> 2169 - In many cases, kernel obtains information about the system from the 2170 - firmware, and sometimes things are lost in translation. 2171 - Or the translation is accurate, but the original message is bogus. 2172 - 2173 - <p> 2174 - For example, some systems' firmware overreports the number of CPUs, 2175 - sometimes by a large factor. 2176 - If RCU naively believed the firmware, as it used to do, 2177 - it would create too many per-CPU kthreads. 2178 - Although the resulting system will still run correctly, the extra 2179 - kthreads needlessly consume memory and can cause confusion 2180 - when they show up in <tt>ps</tt> listings. 2181 - 2182 - <p> 2183 - RCU must therefore wait for a given CPU to actually come online before 2184 - it can allow itself to believe that the CPU actually exists. 2185 - The resulting “ghost CPUs” (which are never going to 2186 - come online) cause a number of 2187 - <a href="https://paulmck.livejournal.com/37494.html">interesting complications</a>. 2188 - 2189 - <h3><a name="Early Boot">Early Boot</a></h3> 2190 - 2191 - <p> 2192 - The Linux kernel's boot sequence is an interesting process, 2193 - and RCU is used early, even before <tt>rcu_init()</tt> 2194 - is invoked. 2195 - In fact, a number of RCU's primitives can be used as soon as the 2196 - initial task's <tt>task_struct</tt> is available and the 2197 - boot CPU's per-CPU variables are set up. 2198 - The read-side primitives (<tt>rcu_read_lock()</tt>, 2199 - <tt>rcu_read_unlock()</tt>, <tt>rcu_dereference()</tt>, 2200 - and <tt>rcu_access_pointer()</tt>) will operate normally very early on, 2201 - as will <tt>rcu_assign_pointer()</tt>. 2202 - 2203 - <p> 2204 - Although <tt>call_rcu()</tt> may be invoked at any 2205 - time during boot, callbacks are not guaranteed to be invoked until after 2206 - all of RCU's kthreads have been spawned, which occurs at 2207 - <tt>early_initcall()</tt> time. 2208 - This delay in callback invocation is due to the fact that RCU does not 2209 - invoke callbacks until it is fully initialized, and this full initialization 2210 - cannot occur until after the scheduler has initialized itself to the 2211 - point where RCU can spawn and run its kthreads. 2212 - In theory, it would be possible to invoke callbacks earlier, 2213 - however, this is not a panacea because there would be severe restrictions 2214 - on what operations those callbacks could invoke. 2215 - 2216 - <p> 2217 - Perhaps surprisingly, <tt>synchronize_rcu()</tt> and 2218 - <tt>synchronize_rcu_expedited()</tt>, 2219 - will operate normally 2220 - during very early boot, the reason being that there is only one CPU 2221 - and preemption is disabled. 2222 - This means that the call <tt>synchronize_rcu()</tt> (or friends) 2223 - itself is a quiescent 2224 - state and thus a grace period, so the early-boot implementation can 2225 - be a no-op. 2226 - 2227 - <p> 2228 - However, once the scheduler has spawned its first kthread, this early 2229 - boot trick fails for <tt>synchronize_rcu()</tt> (as well as for 2230 - <tt>synchronize_rcu_expedited()</tt>) in <tt>CONFIG_PREEMPT=y</tt> 2231 - kernels. 2232 - The reason is that an RCU read-side critical section might be preempted, 2233 - which means that a subsequent <tt>synchronize_rcu()</tt> really does have 2234 - to wait for something, as opposed to simply returning immediately. 2235 - Unfortunately, <tt>synchronize_rcu()</tt> can't do this until all of 2236 - its kthreads are spawned, which doesn't happen until some time during 2237 - <tt>early_initcalls()</tt> time. 2238 - But this is no excuse: RCU is nevertheless required to correctly handle 2239 - synchronous grace periods during this time period. 2240 - Once all of its kthreads are up and running, RCU starts running 2241 - normally. 2242 - 2243 - <table> 2244 - <tr><th> </th></tr> 2245 - <tr><th align="left">Quick Quiz:</th></tr> 2246 - <tr><td> 2247 - How can RCU possibly handle grace periods before all of its 2248 - kthreads have been spawned??? 2249 - </td></tr> 2250 - <tr><th align="left">Answer:</th></tr> 2251 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 2252 - Very carefully! 2253 - </font> 2254 - 2255 - <p><font color="ffffff"> 2256 - During the “dead zone” between the time that the 2257 - scheduler spawns the first task and the time that all of RCU's 2258 - kthreads have been spawned, all synchronous grace periods are 2259 - handled by the expedited grace-period mechanism. 2260 - At runtime, this expedited mechanism relies on workqueues, but 2261 - during the dead zone the requesting task itself drives the 2262 - desired expedited grace period. 2263 - Because dead-zone execution takes place within task context, 2264 - everything works. 2265 - Once the dead zone ends, expedited grace periods go back to 2266 - using workqueues, as is required to avoid problems that would 2267 - otherwise occur when a user task received a POSIX signal while 2268 - driving an expedited grace period. 2269 - </font> 2270 - 2271 - <p><font color="ffffff"> 2272 - And yes, this does mean that it is unhelpful to send POSIX 2273 - signals to random tasks between the time that the scheduler 2274 - spawns its first kthread and the time that RCU's kthreads 2275 - have all been spawned. 2276 - If there ever turns out to be a good reason for sending POSIX 2277 - signals during that time, appropriate adjustments will be made. 2278 - (If it turns out that POSIX signals are sent during this time for 2279 - no good reason, other adjustments will be made, appropriate 2280 - or otherwise.) 2281 - </font></td></tr> 2282 - <tr><td> </td></tr> 2283 - </table> 2284 - 2285 - <p> 2286 - I learned of these boot-time requirements as a result of a series of 2287 - system hangs. 2288 - 2289 - <h3><a name="Interrupts and NMIs">Interrupts and NMIs</a></h3> 2290 - 2291 - <p> 2292 - The Linux kernel has interrupts, and RCU read-side critical sections are 2293 - legal within interrupt handlers and within interrupt-disabled regions 2294 - of code, as are invocations of <tt>call_rcu()</tt>. 2295 - 2296 - <p> 2297 - Some Linux-kernel architectures can enter an interrupt handler from 2298 - non-idle process context, and then just never leave it, instead stealthily 2299 - transitioning back to process context. 2300 - This trick is sometimes used to invoke system calls from inside the kernel. 2301 - These “half-interrupts” mean that RCU has to be very careful 2302 - about how it counts interrupt nesting levels. 2303 - I learned of this requirement the hard way during a rewrite 2304 - of RCU's dyntick-idle code. 2305 - 2306 - <p> 2307 - The Linux kernel has non-maskable interrupts (NMIs), and 2308 - RCU read-side critical sections are legal within NMI handlers. 2309 - Thankfully, RCU update-side primitives, including 2310 - <tt>call_rcu()</tt>, are prohibited within NMI handlers. 2311 - 2312 - <p> 2313 - The name notwithstanding, some Linux-kernel architectures 2314 - can have nested NMIs, which RCU must handle correctly. 2315 - Andy Lutomirski 2316 - <a href="https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com">surprised me</a> 2317 - with this requirement; 2318 - he also kindly surprised me with 2319 - <a href="https://lkml.kernel.org/r/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com">an algorithm</a> 2320 - that meets this requirement. 2321 - 2322 - <p> 2323 - Furthermore, NMI handlers can be interrupted by what appear to RCU 2324 - to be normal interrupts. 2325 - One way that this can happen is for code that directly invokes 2326 - <tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt> to be called 2327 - from an NMI handler. 2328 - This astonishing fact of life prompted the current code structure, 2329 - which has <tt>rcu_irq_enter()</tt> invoking <tt>rcu_nmi_enter()</tt> 2330 - and <tt>rcu_irq_exit()</tt> invoking <tt>rcu_nmi_exit()</tt>. 2331 - And yes, I also learned of this requirement the hard way. 2332 - 2333 - <h3><a name="Loadable Modules">Loadable Modules</a></h3> 2334 - 2335 - <p> 2336 - The Linux kernel has loadable modules, and these modules can 2337 - also be unloaded. 2338 - After a given module has been unloaded, any attempt to call 2339 - one of its functions results in a segmentation fault. 2340 - The module-unload functions must therefore cancel any 2341 - delayed calls to loadable-module functions, for example, 2342 - any outstanding <tt>mod_timer()</tt> must be dealt with 2343 - via <tt>del_timer_sync()</tt> or similar. 2344 - 2345 - <p> 2346 - Unfortunately, there is no way to cancel an RCU callback; 2347 - once you invoke <tt>call_rcu()</tt>, the callback function is 2348 - eventually going to be invoked, unless the system goes down first. 2349 - Because it is normally considered socially irresponsible to crash the system 2350 - in response to a module unload request, we need some other way 2351 - to deal with in-flight RCU callbacks. 2352 - 2353 - <p> 2354 - RCU therefore provides 2355 - <tt><a href="https://lwn.net/Articles/217484/">rcu_barrier()</a></tt>, 2356 - which waits until all in-flight RCU callbacks have been invoked. 2357 - If a module uses <tt>call_rcu()</tt>, its exit function should therefore 2358 - prevent any future invocation of <tt>call_rcu()</tt>, then invoke 2359 - <tt>rcu_barrier()</tt>. 2360 - In theory, the underlying module-unload code could invoke 2361 - <tt>rcu_barrier()</tt> unconditionally, but in practice this would 2362 - incur unacceptable latencies. 2363 - 2364 - <p> 2365 - Nikita Danilov noted this requirement for an analogous filesystem-unmount 2366 - situation, and Dipankar Sarma incorporated <tt>rcu_barrier()</tt> into RCU. 2367 - The need for <tt>rcu_barrier()</tt> for module unloading became 2368 - apparent later. 2369 - 2370 - <p> 2371 - <b>Important note</b>: The <tt>rcu_barrier()</tt> function is not, 2372 - repeat, <i>not</i>, obligated to wait for a grace period. 2373 - It is instead only required to wait for RCU callbacks that have 2374 - already been posted. 2375 - Therefore, if there are no RCU callbacks posted anywhere in the system, 2376 - <tt>rcu_barrier()</tt> is within its rights to return immediately. 2377 - Even if there are callbacks posted, <tt>rcu_barrier()</tt> does not 2378 - necessarily need to wait for a grace period. 2379 - 2380 - <table> 2381 - <tr><th> </th></tr> 2382 - <tr><th align="left">Quick Quiz:</th></tr> 2383 - <tr><td> 2384 - Wait a minute! 2385 - Each RCU callbacks must wait for a grace period to complete, 2386 - and <tt>rcu_barrier()</tt> must wait for each pre-existing 2387 - callback to be invoked. 2388 - Doesn't <tt>rcu_barrier()</tt> therefore need to wait for 2389 - a full grace period if there is even one callback posted anywhere 2390 - in the system? 2391 - </td></tr> 2392 - <tr><th align="left">Answer:</th></tr> 2393 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 2394 - Absolutely not!!! 2395 - </font> 2396 - 2397 - <p><font color="ffffff"> 2398 - Yes, each RCU callbacks must wait for a grace period to complete, 2399 - but it might well be partly (or even completely) finished waiting 2400 - by the time <tt>rcu_barrier()</tt> is invoked. 2401 - In that case, <tt>rcu_barrier()</tt> need only wait for the 2402 - remaining portion of the grace period to elapse. 2403 - So even if there are quite a few callbacks posted, 2404 - <tt>rcu_barrier()</tt> might well return quite quickly. 2405 - </font> 2406 - 2407 - <p><font color="ffffff"> 2408 - So if you need to wait for a grace period as well as for all 2409 - pre-existing callbacks, you will need to invoke both 2410 - <tt>synchronize_rcu()</tt> and <tt>rcu_barrier()</tt>. 2411 - If latency is a concern, you can always use workqueues 2412 - to invoke them concurrently. 2413 - </font></td></tr> 2414 - <tr><td> </td></tr> 2415 - </table> 2416 - 2417 - <h3><a name="Hotplug CPU">Hotplug CPU</a></h3> 2418 - 2419 - <p> 2420 - The Linux kernel supports CPU hotplug, which means that CPUs 2421 - can come and go. 2422 - It is of course illegal to use any RCU API member from an offline CPU, 2423 - with the exception of <a href="#Sleepable RCU">SRCU</a> read-side 2424 - critical sections. 2425 - This requirement was present from day one in DYNIX/ptx, but 2426 - on the other hand, the Linux kernel's CPU-hotplug implementation 2427 - is “interesting.” 2428 - 2429 - <p> 2430 - The Linux-kernel CPU-hotplug implementation has notifiers that 2431 - are used to allow the various kernel subsystems (including RCU) 2432 - to respond appropriately to a given CPU-hotplug operation. 2433 - Most RCU operations may be invoked from CPU-hotplug notifiers, 2434 - including even synchronous grace-period operations such as 2435 - <tt>synchronize_rcu()</tt> and <tt>synchronize_rcu_expedited()</tt>. 2436 - 2437 - <p> 2438 - However, all-callback-wait operations such as 2439 - <tt>rcu_barrier()</tt> are also not supported, due to the 2440 - fact that there are phases of CPU-hotplug operations where 2441 - the outgoing CPU's callbacks will not be invoked until after 2442 - the CPU-hotplug operation ends, which could also result in deadlock. 2443 - Furthermore, <tt>rcu_barrier()</tt> blocks CPU-hotplug operations 2444 - during its execution, which results in another type of deadlock 2445 - when invoked from a CPU-hotplug notifier. 2446 - 2447 - <h3><a name="Scheduler and RCU">Scheduler and RCU</a></h3> 2448 - 2449 - <p> 2450 - RCU depends on the scheduler, and the scheduler uses RCU to 2451 - protect some of its data structures. 2452 - The preemptible-RCU <tt>rcu_read_unlock()</tt> 2453 - implementation must therefore be written carefully to avoid deadlocks 2454 - involving the scheduler's runqueue and priority-inheritance locks. 2455 - In particular, <tt>rcu_read_unlock()</tt> must tolerate an 2456 - interrupt where the interrupt handler invokes both 2457 - <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>. 2458 - This possibility requires <tt>rcu_read_unlock()</tt> to use 2459 - negative nesting levels to avoid destructive recursion via 2460 - interrupt handler's use of RCU. 2461 - 2462 - <p> 2463 - This scheduler-RCU requirement came as a 2464 - <a href="https://lwn.net/Articles/453002/">complete surprise</a>. 2465 - 2466 - <p> 2467 - As noted above, RCU makes use of kthreads, and it is necessary to 2468 - avoid excessive CPU-time accumulation by these kthreads. 2469 - This requirement was no surprise, but RCU's violation of it 2470 - when running context-switch-heavy workloads when built with 2471 - <tt>CONFIG_NO_HZ_FULL=y</tt> 2472 - <a href="http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf">did come as a surprise [PDF]</a>. 2473 - RCU has made good progress towards meeting this requirement, even 2474 - for context-switch-heavy <tt>CONFIG_NO_HZ_FULL=y</tt> workloads, 2475 - but there is room for further improvement. 2476 - 2477 - <p> 2478 - It is forbidden to hold any of scheduler's runqueue or priority-inheritance 2479 - spinlocks across an <tt>rcu_read_unlock()</tt> unless interrupts have been 2480 - disabled across the entire RCU read-side critical section, that is, 2481 - up to and including the matching <tt>rcu_read_lock()</tt>. 2482 - Violating this restriction can result in deadlocks involving these 2483 - scheduler spinlocks. 2484 - There was hope that this restriction might be lifted when interrupt-disabled 2485 - calls to <tt>rcu_read_unlock()</tt> started deferring the reporting of 2486 - the resulting RCU-preempt quiescent state until the end of the corresponding 2487 - interrupts-disabled region. 2488 - Unfortunately, timely reporting of the corresponding quiescent state 2489 - to expedited grace periods requires a call to <tt>raise_softirq()</tt>, 2490 - which can acquire these scheduler spinlocks. 2491 - In addition, real-time systems using RCU priority boosting 2492 - need this restriction to remain in effect because deferred 2493 - quiescent-state reporting would also defer deboosting, which in turn 2494 - would degrade real-time latencies. 2495 - 2496 - <p> 2497 - In theory, if a given RCU read-side critical section could be 2498 - guaranteed to be less than one second in duration, holding a scheduler 2499 - spinlock across that critical section's <tt>rcu_read_unlock()</tt> 2500 - would require only that preemption be disabled across the entire 2501 - RCU read-side critical section, not interrupts. 2502 - Unfortunately, given the possibility of vCPU preemption, long-running 2503 - interrupts, and so on, it is not possible in practice to guarantee 2504 - that a given RCU read-side critical section will complete in less than 2505 - one second. 2506 - Therefore, as noted above, if scheduler spinlocks are held across 2507 - a given call to <tt>rcu_read_unlock()</tt>, interrupts must be 2508 - disabled across the entire RCU read-side critical section. 2509 - 2510 - <h3><a name="Tracing and RCU">Tracing and RCU</a></h3> 2511 - 2512 - <p> 2513 - It is possible to use tracing on RCU code, but tracing itself 2514 - uses RCU. 2515 - For this reason, <tt>rcu_dereference_raw_notrace()</tt> 2516 - is provided for use by tracing, which avoids the destructive 2517 - recursion that could otherwise ensue. 2518 - This API is also used by virtualization in some architectures, 2519 - where RCU readers execute in environments in which tracing 2520 - cannot be used. 2521 - The tracing folks both located the requirement and provided the 2522 - needed fix, so this surprise requirement was relatively painless. 2523 - 2524 - <h3><a name="Energy Efficiency">Energy Efficiency</a></h3> 2525 - 2526 - <p> 2527 - Interrupting idle CPUs is considered socially unacceptable, 2528 - especially by people with battery-powered embedded systems. 2529 - RCU therefore conserves energy by detecting which CPUs are 2530 - idle, including tracking CPUs that have been interrupted from idle. 2531 - This is a large part of the energy-efficiency requirement, 2532 - so I learned of this via an irate phone call. 2533 - 2534 - <p> 2535 - Because RCU avoids interrupting idle CPUs, it is illegal to 2536 - execute an RCU read-side critical section on an idle CPU. 2537 - (Kernels built with <tt>CONFIG_PROVE_RCU=y</tt> will splat 2538 - if you try it.) 2539 - The <tt>RCU_NONIDLE()</tt> macro and <tt>_rcuidle</tt> 2540 - event tracing is provided to work around this restriction. 2541 - In addition, <tt>rcu_is_watching()</tt> may be used to 2542 - test whether or not it is currently legal to run RCU read-side 2543 - critical sections on this CPU. 2544 - I learned of the need for diagnostics on the one hand 2545 - and <tt>RCU_NONIDLE()</tt> on the other while inspecting 2546 - idle-loop code. 2547 - Steven Rostedt supplied <tt>_rcuidle</tt> event tracing, 2548 - which is used quite heavily in the idle loop. 2549 - However, there are some restrictions on the code placed within 2550 - <tt>RCU_NONIDLE()</tt>: 2551 - 2552 - <ol> 2553 - <li> Blocking is prohibited. 2554 - In practice, this is not a serious restriction given that idle 2555 - tasks are prohibited from blocking to begin with. 2556 - <li> Although nesting <tt>RCU_NONIDLE()</tt> is permitted, they cannot 2557 - nest indefinitely deeply. 2558 - However, given that they can be nested on the order of a million 2559 - deep, even on 32-bit systems, this should not be a serious 2560 - restriction. 2561 - This nesting limit would probably be reached long after the 2562 - compiler OOMed or the stack overflowed. 2563 - <li> Any code path that enters <tt>RCU_NONIDLE()</tt> must sequence 2564 - out of that same <tt>RCU_NONIDLE()</tt>. 2565 - For example, the following is grossly illegal: 2566 - 2567 - <blockquote> 2568 - <pre> 2569 - 1 RCU_NONIDLE({ 2570 - 2 do_something(); 2571 - 3 goto bad_idea; /* BUG!!! */ 2572 - 4 do_something_else();}); 2573 - 5 bad_idea: 2574 - </pre> 2575 - </blockquote> 2576 - 2577 - <p> 2578 - It is just as illegal to transfer control into the middle of 2579 - <tt>RCU_NONIDLE()</tt>'s argument. 2580 - Yes, in theory, you could transfer in as long as you also 2581 - transferred out, but in practice you could also expect to get sharply 2582 - worded review comments. 2583 - </ol> 2584 - 2585 - <p> 2586 - It is similarly socially unacceptable to interrupt an 2587 - <tt>nohz_full</tt> CPU running in userspace. 2588 - RCU must therefore track <tt>nohz_full</tt> userspace 2589 - execution. 2590 - RCU must therefore be able to sample state at two points in 2591 - time, and be able to determine whether or not some other CPU spent 2592 - any time idle and/or executing in userspace. 2593 - 2594 - <p> 2595 - These energy-efficiency requirements have proven quite difficult to 2596 - understand and to meet, for example, there have been more than five 2597 - clean-sheet rewrites of RCU's energy-efficiency code, the last of 2598 - which was finally able to demonstrate 2599 - <a href="http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf">real energy savings running on real hardware [PDF]</a>. 2600 - As noted earlier, 2601 - I learned of many of these requirements via angry phone calls: 2602 - Flaming me on the Linux-kernel mailing list was apparently not 2603 - sufficient to fully vent their ire at RCU's energy-efficiency bugs! 2604 - 2605 - <h3><a name="Scheduling-Clock Interrupts and RCU"> 2606 - Scheduling-Clock Interrupts and RCU</a></h3> 2607 - 2608 - <p> 2609 - The kernel transitions between in-kernel non-idle execution, userspace 2610 - execution, and the idle loop. 2611 - Depending on kernel configuration, RCU handles these states differently: 2612 - 2613 - <table border=3> 2614 - <tr><th><tt>HZ</tt> Kconfig</th> 2615 - <th>In-Kernel</th> 2616 - <th>Usermode</th> 2617 - <th>Idle</th></tr> 2618 - <tr><th align="left"><tt>HZ_PERIODIC</tt></th> 2619 - <td>Can rely on scheduling-clock interrupt.</td> 2620 - <td>Can rely on scheduling-clock interrupt and its 2621 - detection of interrupt from usermode.</td> 2622 - <td>Can rely on RCU's dyntick-idle detection.</td></tr> 2623 - <tr><th align="left"><tt>NO_HZ_IDLE</tt></th> 2624 - <td>Can rely on scheduling-clock interrupt.</td> 2625 - <td>Can rely on scheduling-clock interrupt and its 2626 - detection of interrupt from usermode.</td> 2627 - <td>Can rely on RCU's dyntick-idle detection.</td></tr> 2628 - <tr><th align="left"><tt>NO_HZ_FULL</tt></th> 2629 - <td>Can only sometimes rely on scheduling-clock interrupt. 2630 - In other cases, it is necessary to bound kernel execution 2631 - times and/or use IPIs.</td> 2632 - <td>Can rely on RCU's dyntick-idle detection.</td> 2633 - <td>Can rely on RCU's dyntick-idle detection.</td></tr> 2634 - </table> 2635 - 2636 - <table> 2637 - <tr><th> </th></tr> 2638 - <tr><th align="left">Quick Quiz:</th></tr> 2639 - <tr><td> 2640 - Why can't <tt>NO_HZ_FULL</tt> in-kernel execution rely on the 2641 - scheduling-clock interrupt, just like <tt>HZ_PERIODIC</tt> 2642 - and <tt>NO_HZ_IDLE</tt> do? 2643 - </td></tr> 2644 - <tr><th align="left">Answer:</th></tr> 2645 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 2646 - Because, as a performance optimization, <tt>NO_HZ_FULL</tt> 2647 - does not necessarily re-enable the scheduling-clock interrupt 2648 - on entry to each and every system call. 2649 - </font></td></tr> 2650 - <tr><td> </td></tr> 2651 - </table> 2652 - 2653 - <p> 2654 - However, RCU must be reliably informed as to whether any given 2655 - CPU is currently in the idle loop, and, for <tt>NO_HZ_FULL</tt>, 2656 - also whether that CPU is executing in usermode, as discussed 2657 - <a href="#Energy Efficiency">earlier</a>. 2658 - It also requires that the scheduling-clock interrupt be enabled when 2659 - RCU needs it to be: 2660 - 2661 - <ol> 2662 - <li> If a CPU is either idle or executing in usermode, and RCU believes 2663 - it is non-idle, the scheduling-clock tick had better be running. 2664 - Otherwise, you will get RCU CPU stall warnings. Or at best, 2665 - very long (11-second) grace periods, with a pointless IPI waking 2666 - the CPU from time to time. 2667 - <li> If a CPU is in a portion of the kernel that executes RCU read-side 2668 - critical sections, and RCU believes this CPU to be idle, you will get 2669 - random memory corruption. <b>DON'T DO THIS!!!</b> 2670 - 2671 - <br>This is one reason to test with lockdep, which will complain 2672 - about this sort of thing. 2673 - <li> If a CPU is in a portion of the kernel that is absolutely 2674 - positively no-joking guaranteed to never execute any RCU read-side 2675 - critical sections, and RCU believes this CPU to to be idle, 2676 - no problem. This sort of thing is used by some architectures 2677 - for light-weight exception handlers, which can then avoid the 2678 - overhead of <tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt> 2679 - at exception entry and exit, respectively. 2680 - Some go further and avoid the entireties of <tt>irq_enter()</tt> 2681 - and <tt>irq_exit()</tt>. 2682 - 2683 - <br>Just make very sure you are running some of your tests with 2684 - <tt>CONFIG_PROVE_RCU=y</tt>, just in case one of your code paths 2685 - was in fact joking about not doing RCU read-side critical sections. 2686 - <li> If a CPU is executing in the kernel with the scheduling-clock 2687 - interrupt disabled and RCU believes this CPU to be non-idle, 2688 - and if the CPU goes idle (from an RCU perspective) every few 2689 - jiffies, no problem. It is usually OK for there to be the 2690 - occasional gap between idle periods of up to a second or so. 2691 - 2692 - <br>If the gap grows too long, you get RCU CPU stall warnings. 2693 - <li> If a CPU is either idle or executing in usermode, and RCU believes 2694 - it to be idle, of course no problem. 2695 - <li> If a CPU is executing in the kernel, the kernel code 2696 - path is passing through quiescent states at a reasonable 2697 - frequency (preferably about once per few jiffies, but the 2698 - occasional excursion to a second or so is usually OK) and the 2699 - scheduling-clock interrupt is enabled, of course no problem. 2700 - 2701 - <br>If the gap between a successive pair of quiescent states grows 2702 - too long, you get RCU CPU stall warnings. 2703 - </ol> 2704 - 2705 - <table> 2706 - <tr><th> </th></tr> 2707 - <tr><th align="left">Quick Quiz:</th></tr> 2708 - <tr><td> 2709 - But what if my driver has a hardware interrupt handler 2710 - that can run for many seconds? 2711 - I cannot invoke <tt>schedule()</tt> from an hardware 2712 - interrupt handler, after all! 2713 - </td></tr> 2714 - <tr><th align="left">Answer:</th></tr> 2715 - <tr><td bgcolor="#ffffff"><font color="ffffff"> 2716 - One approach is to do <tt>rcu_irq_exit();rcu_irq_enter();</tt> 2717 - every so often. 2718 - But given that long-running interrupt handlers can cause 2719 - other problems, not least for response time, shouldn't you 2720 - work to keep your interrupt handler's runtime within reasonable 2721 - bounds? 2722 - </font></td></tr> 2723 - <tr><td> </td></tr> 2724 - </table> 2725 - 2726 - <p> 2727 - But as long as RCU is properly informed of kernel state transitions between 2728 - in-kernel execution, usermode execution, and idle, and as long as the 2729 - scheduling-clock interrupt is enabled when RCU needs it to be, you 2730 - can rest assured that the bugs you encounter will be in some other 2731 - part of RCU or some other part of the kernel! 2732 - 2733 - <h3><a name="Memory Efficiency">Memory Efficiency</a></h3> 2734 - 2735 - <p> 2736 - Although small-memory non-realtime systems can simply use Tiny RCU, 2737 - code size is only one aspect of memory efficiency. 2738 - Another aspect is the size of the <tt>rcu_head</tt> structure 2739 - used by <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>. 2740 - Although this structure contains nothing more than a pair of pointers, 2741 - it does appear in many RCU-protected data structures, including 2742 - some that are size critical. 2743 - The <tt>page</tt> structure is a case in point, as evidenced by 2744 - the many occurrences of the <tt>union</tt> keyword within that structure. 2745 - 2746 - <p> 2747 - This need for memory efficiency is one reason that RCU uses hand-crafted 2748 - singly linked lists to track the <tt>rcu_head</tt> structures that 2749 - are waiting for a grace period to elapse. 2750 - It is also the reason why <tt>rcu_head</tt> structures do not contain 2751 - debug information, such as fields tracking the file and line of the 2752 - <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> that posted them. 2753 - Although this information might appear in debug-only kernel builds at some 2754 - point, in the meantime, the <tt>->func</tt> field will often provide 2755 - the needed debug information. 2756 - 2757 - <p> 2758 - However, in some cases, the need for memory efficiency leads to even 2759 - more extreme measures. 2760 - Returning to the <tt>page</tt> structure, the <tt>rcu_head</tt> field 2761 - shares storage with a great many other structures that are used at 2762 - various points in the corresponding page's lifetime. 2763 - In order to correctly resolve certain 2764 - <a href="https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com">race conditions</a>, 2765 - the Linux kernel's memory-management subsystem needs a particular bit 2766 - to remain zero during all phases of grace-period processing, 2767 - and that bit happens to map to the bottom bit of the 2768 - <tt>rcu_head</tt> structure's <tt>->next</tt> field. 2769 - RCU makes this guarantee as long as <tt>call_rcu()</tt> 2770 - is used to post the callback, as opposed to <tt>kfree_rcu()</tt> 2771 - or some future “lazy” 2772 - variant of <tt>call_rcu()</tt> that might one day be created for 2773 - energy-efficiency purposes. 2774 - 2775 - <p> 2776 - That said, there are limits. 2777 - RCU requires that the <tt>rcu_head</tt> structure be aligned to a 2778 - two-byte boundary, and passing a misaligned <tt>rcu_head</tt> 2779 - structure to one of the <tt>call_rcu()</tt> family of functions 2780 - will result in a splat. 2781 - It is therefore necessary to exercise caution when packing 2782 - structures containing fields of type <tt>rcu_head</tt>. 2783 - Why not a four-byte or even eight-byte alignment requirement? 2784 - Because the m68k architecture provides only two-byte alignment, 2785 - and thus acts as alignment's least common denominator. 2786 - 2787 - <p> 2788 - The reason for reserving the bottom bit of pointers to 2789 - <tt>rcu_head</tt> structures is to leave the door open to 2790 - “lazy” callbacks whose invocations can safely be deferred. 2791 - Deferring invocation could potentially have energy-efficiency 2792 - benefits, but only if the rate of non-lazy callbacks decreases 2793 - significantly for some important workload. 2794 - In the meantime, reserving the bottom bit keeps this option open 2795 - in case it one day becomes useful. 2796 - 2797 - <h3><a name="Performance, Scalability, Response Time, and Reliability"> 2798 - Performance, Scalability, Response Time, and Reliability</a></h3> 2799 - 2800 - <p> 2801 - Expanding on the 2802 - <a href="#Performance and Scalability">earlier discussion</a>, 2803 - RCU is used heavily by hot code paths in performance-critical 2804 - portions of the Linux kernel's networking, security, virtualization, 2805 - and scheduling code paths. 2806 - RCU must therefore use efficient implementations, especially in its 2807 - read-side primitives. 2808 - To that end, it would be good if preemptible RCU's implementation 2809 - of <tt>rcu_read_lock()</tt> could be inlined, however, doing 2810 - this requires resolving <tt>#include</tt> issues with the 2811 - <tt>task_struct</tt> structure. 2812 - 2813 - <p> 2814 - The Linux kernel supports hardware configurations with up to 2815 - 4096 CPUs, which means that RCU must be extremely scalable. 2816 - Algorithms that involve frequent acquisitions of global locks or 2817 - frequent atomic operations on global variables simply cannot be 2818 - tolerated within the RCU implementation. 2819 - RCU therefore makes heavy use of a combining tree based on the 2820 - <tt>rcu_node</tt> structure. 2821 - RCU is required to tolerate all CPUs continuously invoking any 2822 - combination of RCU's runtime primitives with minimal per-operation 2823 - overhead. 2824 - In fact, in many cases, increasing load must <i>decrease</i> the 2825 - per-operation overhead, witness the batching optimizations for 2826 - <tt>synchronize_rcu()</tt>, <tt>call_rcu()</tt>, 2827 - <tt>synchronize_rcu_expedited()</tt>, and <tt>rcu_barrier()</tt>. 2828 - As a general rule, RCU must cheerfully accept whatever the 2829 - rest of the Linux kernel decides to throw at it. 2830 - 2831 - <p> 2832 - The Linux kernel is used for real-time workloads, especially 2833 - in conjunction with the 2834 - <a href="https://rt.wiki.kernel.org/index.php/Main_Page">-rt patchset</a>. 2835 - The real-time-latency response requirements are such that the 2836 - traditional approach of disabling preemption across RCU 2837 - read-side critical sections is inappropriate. 2838 - Kernels built with <tt>CONFIG_PREEMPT=y</tt> therefore 2839 - use an RCU implementation that allows RCU read-side critical 2840 - sections to be preempted. 2841 - This requirement made its presence known after users made it 2842 - clear that an earlier 2843 - <a href="https://lwn.net/Articles/107930/">real-time patch</a> 2844 - did not meet their needs, in conjunction with some 2845 - <a href="https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com">RCU issues</a> 2846 - encountered by a very early version of the -rt patchset. 2847 - 2848 - <p> 2849 - In addition, RCU must make do with a sub-100-microsecond real-time latency 2850 - budget. 2851 - In fact, on smaller systems with the -rt patchset, the Linux kernel 2852 - provides sub-20-microsecond real-time latencies for the whole kernel, 2853 - including RCU. 2854 - RCU's scalability and latency must therefore be sufficient for 2855 - these sorts of configurations. 2856 - To my surprise, the sub-100-microsecond real-time latency budget 2857 - <a href="http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf"> 2858 - applies to even the largest systems [PDF]</a>, 2859 - up to and including systems with 4096 CPUs. 2860 - This real-time requirement motivated the grace-period kthread, which 2861 - also simplified handling of a number of race conditions. 2862 - 2863 - <p> 2864 - RCU must avoid degrading real-time response for CPU-bound threads, whether 2865 - executing in usermode (which is one use case for 2866 - <tt>CONFIG_NO_HZ_FULL=y</tt>) or in the kernel. 2867 - That said, CPU-bound loops in the kernel must execute 2868 - <tt>cond_resched()</tt> at least once per few tens of milliseconds 2869 - in order to avoid receiving an IPI from RCU. 2870 - 2871 - <p> 2872 - Finally, RCU's status as a synchronization primitive means that 2873 - any RCU failure can result in arbitrary memory corruption that can be 2874 - extremely difficult to debug. 2875 - This means that RCU must be extremely reliable, which in 2876 - practice also means that RCU must have an aggressive stress-test 2877 - suite. 2878 - This stress-test suite is called <tt>rcutorture</tt>. 2879 - 2880 - <p> 2881 - Although the need for <tt>rcutorture</tt> was no surprise, 2882 - the current immense popularity of the Linux kernel is posing 2883 - interesting—and perhaps unprecedented—validation 2884 - challenges. 2885 - To see this, keep in mind that there are well over one billion 2886 - instances of the Linux kernel running today, given Android 2887 - smartphones, Linux-powered televisions, and servers. 2888 - This number can be expected to increase sharply with the advent of 2889 - the celebrated Internet of Things. 2890 - 2891 - <p> 2892 - Suppose that RCU contains a race condition that manifests on average 2893 - once per million years of runtime. 2894 - This bug will be occurring about three times per <i>day</i> across 2895 - the installed base. 2896 - RCU could simply hide behind hardware error rates, given that no one 2897 - should really expect their smartphone to last for a million years. 2898 - However, anyone taking too much comfort from this thought should 2899 - consider the fact that in most jurisdictions, a successful multi-year 2900 - test of a given mechanism, which might include a Linux kernel, 2901 - suffices for a number of types of safety-critical certifications. 2902 - In fact, rumor has it that the Linux kernel is already being used 2903 - in production for safety-critical applications. 2904 - I don't know about you, but I would feel quite bad if a bug in RCU 2905 - killed someone. 2906 - Which might explain my recent focus on validation and verification. 2907 - 2908 - <h2><a name="Other RCU Flavors">Other RCU Flavors</a></h2> 2909 - 2910 - <p> 2911 - One of the more surprising things about RCU is that there are now 2912 - no fewer than five <i>flavors</i>, or API families. 2913 - In addition, the primary flavor that has been the sole focus up to 2914 - this point has two different implementations, non-preemptible and 2915 - preemptible. 2916 - The other four flavors are listed below, with requirements for each 2917 - described in a separate section. 2918 - 2919 - <ol> 2920 - <li> <a href="#Bottom-Half Flavor">Bottom-Half Flavor (Historical)</a> 2921 - <li> <a href="#Sched Flavor">Sched Flavor (Historical)</a> 2922 - <li> <a href="#Sleepable RCU">Sleepable RCU</a> 2923 - <li> <a href="#Tasks RCU">Tasks RCU</a> 2924 - </ol> 2925 - 2926 - <h3><a name="Bottom-Half Flavor">Bottom-Half Flavor (Historical)</a></h3> 2927 - 2928 - <p> 2929 - The RCU-bh flavor of RCU has since been expressed in terms of 2930 - the other RCU flavors as part of a consolidation of the three 2931 - flavors into a single flavor. 2932 - The read-side API remains, and continues to disable softirq and to 2933 - be accounted for by lockdep. 2934 - Much of the material in this section is therefore strictly historical 2935 - in nature. 2936 - 2937 - <p> 2938 - The softirq-disable (AKA “bottom-half”, 2939 - hence the “_bh” abbreviations) 2940 - flavor of RCU, or <i>RCU-bh</i>, was developed by 2941 - Dipankar Sarma to provide a flavor of RCU that could withstand the 2942 - network-based denial-of-service attacks researched by Robert 2943 - Olsson. 2944 - These attacks placed so much networking load on the system 2945 - that some of the CPUs never exited softirq execution, 2946 - which in turn prevented those CPUs from ever executing a context switch, 2947 - which, in the RCU implementation of that time, prevented grace periods 2948 - from ever ending. 2949 - The result was an out-of-memory condition and a system hang. 2950 - 2951 - <p> 2952 - The solution was the creation of RCU-bh, which does 2953 - <tt>local_bh_disable()</tt> 2954 - across its read-side critical sections, and which uses the transition 2955 - from one type of softirq processing to another as a quiescent state 2956 - in addition to context switch, idle, user mode, and offline. 2957 - This means that RCU-bh grace periods can complete even when some of 2958 - the CPUs execute in softirq indefinitely, thus allowing algorithms 2959 - based on RCU-bh to withstand network-based denial-of-service attacks. 2960 - 2961 - <p> 2962 - Because 2963 - <tt>rcu_read_lock_bh()</tt> and <tt>rcu_read_unlock_bh()</tt> 2964 - disable and re-enable softirq handlers, any attempt to start a softirq 2965 - handlers during the 2966 - RCU-bh read-side critical section will be deferred. 2967 - In this case, <tt>rcu_read_unlock_bh()</tt> 2968 - will invoke softirq processing, which can take considerable time. 2969 - One can of course argue that this softirq overhead should be associated 2970 - with the code following the RCU-bh read-side critical section rather 2971 - than <tt>rcu_read_unlock_bh()</tt>, but the fact 2972 - is that most profiling tools cannot be expected to make this sort 2973 - of fine distinction. 2974 - For example, suppose that a three-millisecond-long RCU-bh read-side 2975 - critical section executes during a time of heavy networking load. 2976 - There will very likely be an attempt to invoke at least one softirq 2977 - handler during that three milliseconds, but any such invocation will 2978 - be delayed until the time of the <tt>rcu_read_unlock_bh()</tt>. 2979 - This can of course make it appear at first glance as if 2980 - <tt>rcu_read_unlock_bh()</tt> was executing very slowly. 2981 - 2982 - <p> 2983 - The 2984 - <a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-bh API</a> 2985 - includes 2986 - <tt>rcu_read_lock_bh()</tt>, 2987 - <tt>rcu_read_unlock_bh()</tt>, 2988 - <tt>rcu_dereference_bh()</tt>, 2989 - <tt>rcu_dereference_bh_check()</tt>, 2990 - <tt>synchronize_rcu_bh()</tt>, 2991 - <tt>synchronize_rcu_bh_expedited()</tt>, 2992 - <tt>call_rcu_bh()</tt>, 2993 - <tt>rcu_barrier_bh()</tt>, and 2994 - <tt>rcu_read_lock_bh_held()</tt>. 2995 - However, the update-side APIs are now simple wrappers for other RCU 2996 - flavors, namely RCU-sched in CONFIG_PREEMPT=n kernels and RCU-preempt 2997 - otherwise. 2998 - 2999 - <h3><a name="Sched Flavor">Sched Flavor (Historical)</a></h3> 3000 - 3001 - <p> 3002 - The RCU-sched flavor of RCU has since been expressed in terms of 3003 - the other RCU flavors as part of a consolidation of the three 3004 - flavors into a single flavor. 3005 - The read-side API remains, and continues to disable preemption and to 3006 - be accounted for by lockdep. 3007 - Much of the material in this section is therefore strictly historical 3008 - in nature. 3009 - 3010 - <p> 3011 - Before preemptible RCU, waiting for an RCU grace period had the 3012 - side effect of also waiting for all pre-existing interrupt 3013 - and NMI handlers. 3014 - However, there are legitimate preemptible-RCU implementations that 3015 - do not have this property, given that any point in the code outside 3016 - of an RCU read-side critical section can be a quiescent state. 3017 - Therefore, <i>RCU-sched</i> was created, which follows “classic” 3018 - RCU in that an RCU-sched grace period waits for for pre-existing 3019 - interrupt and NMI handlers. 3020 - In kernels built with <tt>CONFIG_PREEMPT=n</tt>, the RCU and RCU-sched 3021 - APIs have identical implementations, while kernels built with 3022 - <tt>CONFIG_PREEMPT=y</tt> provide a separate implementation for each. 3023 - 3024 - <p> 3025 - Note well that in <tt>CONFIG_PREEMPT=y</tt> kernels, 3026 - <tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt> 3027 - disable and re-enable preemption, respectively. 3028 - This means that if there was a preemption attempt during the 3029 - RCU-sched read-side critical section, <tt>rcu_read_unlock_sched()</tt> 3030 - will enter the scheduler, with all the latency and overhead entailed. 3031 - Just as with <tt>rcu_read_unlock_bh()</tt>, this can make it look 3032 - as if <tt>rcu_read_unlock_sched()</tt> was executing very slowly. 3033 - However, the highest-priority task won't be preempted, so that task 3034 - will enjoy low-overhead <tt>rcu_read_unlock_sched()</tt> invocations. 3035 - 3036 - <p> 3037 - The 3038 - <a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-sched API</a> 3039 - includes 3040 - <tt>rcu_read_lock_sched()</tt>, 3041 - <tt>rcu_read_unlock_sched()</tt>, 3042 - <tt>rcu_read_lock_sched_notrace()</tt>, 3043 - <tt>rcu_read_unlock_sched_notrace()</tt>, 3044 - <tt>rcu_dereference_sched()</tt>, 3045 - <tt>rcu_dereference_sched_check()</tt>, 3046 - <tt>synchronize_sched()</tt>, 3047 - <tt>synchronize_rcu_sched_expedited()</tt>, 3048 - <tt>call_rcu_sched()</tt>, 3049 - <tt>rcu_barrier_sched()</tt>, and 3050 - <tt>rcu_read_lock_sched_held()</tt>. 3051 - However, anything that disables preemption also marks an RCU-sched 3052 - read-side critical section, including 3053 - <tt>preempt_disable()</tt> and <tt>preempt_enable()</tt>, 3054 - <tt>local_irq_save()</tt> and <tt>local_irq_restore()</tt>, 3055 - and so on. 3056 - 3057 - <h3><a name="Sleepable RCU">Sleepable RCU</a></h3> 3058 - 3059 - <p> 3060 - For well over a decade, someone saying “I need to block within 3061 - an RCU read-side critical section” was a reliable indication 3062 - that this someone did not understand RCU. 3063 - After all, if you are always blocking in an RCU read-side critical 3064 - section, you can probably afford to use a higher-overhead synchronization 3065 - mechanism. 3066 - However, that changed with the advent of the Linux kernel's notifiers, 3067 - whose RCU read-side critical 3068 - sections almost never sleep, but sometimes need to. 3069 - This resulted in the introduction of 3070 - <a href="https://lwn.net/Articles/202847/">sleepable RCU</a>, 3071 - or <i>SRCU</i>. 3072 - 3073 - <p> 3074 - SRCU allows different domains to be defined, with each such domain 3075 - defined by an instance of an <tt>srcu_struct</tt> structure. 3076 - A pointer to this structure must be passed in to each SRCU function, 3077 - for example, <tt>synchronize_srcu(&ss)</tt>, where 3078 - <tt>ss</tt> is the <tt>srcu_struct</tt> structure. 3079 - The key benefit of these domains is that a slow SRCU reader in one 3080 - domain does not delay an SRCU grace period in some other domain. 3081 - That said, one consequence of these domains is that read-side code 3082 - must pass a “cookie” from <tt>srcu_read_lock()</tt> 3083 - to <tt>srcu_read_unlock()</tt>, for example, as follows: 3084 - 3085 - <blockquote> 3086 - <pre> 3087 - 1 int idx; 3088 - 2 3089 - 3 idx = srcu_read_lock(&ss); 3090 - 4 do_something(); 3091 - 5 srcu_read_unlock(&ss, idx); 3092 - </pre> 3093 - </blockquote> 3094 - 3095 - <p> 3096 - As noted above, it is legal to block within SRCU read-side critical sections, 3097 - however, with great power comes great responsibility. 3098 - If you block forever in one of a given domain's SRCU read-side critical 3099 - sections, then that domain's grace periods will also be blocked forever. 3100 - Of course, one good way to block forever is to deadlock, which can 3101 - happen if any operation in a given domain's SRCU read-side critical 3102 - section can wait, either directly or indirectly, for that domain's 3103 - grace period to elapse. 3104 - For example, this results in a self-deadlock: 3105 - 3106 - <blockquote> 3107 - <pre> 3108 - 1 int idx; 3109 - 2 3110 - 3 idx = srcu_read_lock(&ss); 3111 - 4 do_something(); 3112 - 5 synchronize_srcu(&ss); 3113 - 6 srcu_read_unlock(&ss, idx); 3114 - </pre> 3115 - </blockquote> 3116 - 3117 - <p> 3118 - However, if line 5 acquired a mutex that was held across 3119 - a <tt>synchronize_srcu()</tt> for domain <tt>ss</tt>, 3120 - deadlock would still be possible. 3121 - Furthermore, if line 5 acquired a mutex that was held across 3122 - a <tt>synchronize_srcu()</tt> for some other domain <tt>ss1</tt>, 3123 - and if an <tt>ss1</tt>-domain SRCU read-side critical section 3124 - acquired another mutex that was held across as <tt>ss</tt>-domain 3125 - <tt>synchronize_srcu()</tt>, 3126 - deadlock would again be possible. 3127 - Such a deadlock cycle could extend across an arbitrarily large number 3128 - of different SRCU domains. 3129 - Again, with great power comes great responsibility. 3130 - 3131 - <p> 3132 - Unlike the other RCU flavors, SRCU read-side critical sections can 3133 - run on idle and even offline CPUs. 3134 - This ability requires that <tt>srcu_read_lock()</tt> and 3135 - <tt>srcu_read_unlock()</tt> contain memory barriers, which means 3136 - that SRCU readers will run a bit slower than would RCU readers. 3137 - It also motivates the <tt>smp_mb__after_srcu_read_unlock()</tt> 3138 - API, which, in combination with <tt>srcu_read_unlock()</tt>, 3139 - guarantees a full memory barrier. 3140 - 3141 - <p> 3142 - Also unlike other RCU flavors, <tt>synchronize_srcu()</tt> may <b>not</b> 3143 - be invoked from CPU-hotplug notifiers, due to the fact that SRCU grace 3144 - periods make use of timers and the possibility of timers being temporarily 3145 - “stranded” on the outgoing CPU. 3146 - This stranding of timers means that timers posted to the outgoing CPU 3147 - will not fire until late in the CPU-hotplug process. 3148 - The problem is that if a notifier is waiting on an SRCU grace period, 3149 - that grace period is waiting on a timer, and that timer is stranded on the 3150 - outgoing CPU, then the notifier will never be awakened, in other words, 3151 - deadlock has occurred. 3152 - This same situation of course also prohibits <tt>srcu_barrier()</tt> 3153 - from being invoked from CPU-hotplug notifiers. 3154 - 3155 - <p> 3156 - SRCU also differs from other RCU flavors in that SRCU's expedited and 3157 - non-expedited grace periods are implemented by the same mechanism. 3158 - This means that in the current SRCU implementation, expediting a 3159 - future grace period has the side effect of expediting all prior 3160 - grace periods that have not yet completed. 3161 - (But please note that this is a property of the current implementation, 3162 - not necessarily of future implementations.) 3163 - In addition, if SRCU has been idle for longer than the interval 3164 - specified by the <tt>srcutree.exp_holdoff</tt> kernel boot parameter 3165 - (25 microseconds by default), 3166 - and if a <tt>synchronize_srcu()</tt> invocation ends this idle period, 3167 - that invocation will be automatically expedited. 3168 - 3169 - <p> 3170 - As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating 3171 - a locking bottleneck present in prior kernel versions. 3172 - Although this will allow users to put much heavier stress on 3173 - <tt>call_srcu()</tt>, it is important to note that SRCU does not 3174 - yet take any special steps to deal with callback flooding. 3175 - So if you are posting (say) 10,000 SRCU callbacks per second per CPU, 3176 - you are probably totally OK, but if you intend to post (say) 1,000,000 3177 - SRCU callbacks per second per CPU, please run some tests first. 3178 - SRCU just might need a few adjustment to deal with that sort of load. 3179 - Of course, your mileage may vary based on the speed of your CPUs and 3180 - the size of your memory. 3181 - 3182 - <p> 3183 - The 3184 - <a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">SRCU API</a> 3185 - includes 3186 - <tt>srcu_read_lock()</tt>, 3187 - <tt>srcu_read_unlock()</tt>, 3188 - <tt>srcu_dereference()</tt>, 3189 - <tt>srcu_dereference_check()</tt>, 3190 - <tt>synchronize_srcu()</tt>, 3191 - <tt>synchronize_srcu_expedited()</tt>, 3192 - <tt>call_srcu()</tt>, 3193 - <tt>srcu_barrier()</tt>, and 3194 - <tt>srcu_read_lock_held()</tt>. 3195 - It also includes 3196 - <tt>DEFINE_SRCU()</tt>, 3197 - <tt>DEFINE_STATIC_SRCU()</tt>, and 3198 - <tt>init_srcu_struct()</tt> 3199 - APIs for defining and initializing <tt>srcu_struct</tt> structures. 3200 - 3201 - <h3><a name="Tasks RCU">Tasks RCU</a></h3> 3202 - 3203 - <p> 3204 - Some forms of tracing use “trampolines” to handle the 3205 - binary rewriting required to install different types of probes. 3206 - It would be good to be able to free old trampolines, which sounds 3207 - like a job for some form of RCU. 3208 - However, because it is necessary to be able to install a trace 3209 - anywhere in the code, it is not possible to use read-side markers 3210 - such as <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>. 3211 - In addition, it does not work to have these markers in the trampoline 3212 - itself, because there would need to be instructions following 3213 - <tt>rcu_read_unlock()</tt>. 3214 - Although <tt>synchronize_rcu()</tt> would guarantee that execution 3215 - reached the <tt>rcu_read_unlock()</tt>, it would not be able to 3216 - guarantee that execution had completely left the trampoline. 3217 - 3218 - <p> 3219 - The solution, in the form of 3220 - <a href="https://lwn.net/Articles/607117/"><i>Tasks RCU</i></a>, 3221 - is to have implicit 3222 - read-side critical sections that are delimited by voluntary context 3223 - switches, that is, calls to <tt>schedule()</tt>, 3224 - <tt>cond_resched()</tt>, and 3225 - <tt>synchronize_rcu_tasks()</tt>. 3226 - In addition, transitions to and from userspace execution also delimit 3227 - tasks-RCU read-side critical sections. 3228 - 3229 - <p> 3230 - The tasks-RCU API is quite compact, consisting only of 3231 - <tt>call_rcu_tasks()</tt>, 3232 - <tt>synchronize_rcu_tasks()</tt>, and 3233 - <tt>rcu_barrier_tasks()</tt>. 3234 - In <tt>CONFIG_PREEMPT=n</tt> kernels, trampolines cannot be preempted, 3235 - so these APIs map to 3236 - <tt>call_rcu()</tt>, 3237 - <tt>synchronize_rcu()</tt>, and 3238 - <tt>rcu_barrier()</tt>, respectively. 3239 - In <tt>CONFIG_PREEMPT=y</tt> kernels, trampolines can be preempted, 3240 - and these three APIs are therefore implemented by separate functions 3241 - that check for voluntary context switches. 3242 - 3243 - <h2><a name="Possible Future Changes">Possible Future Changes</a></h2> 3244 - 3245 - <p> 3246 - One of the tricks that RCU uses to attain update-side scalability is 3247 - to increase grace-period latency with increasing numbers of CPUs. 3248 - If this becomes a serious problem, it will be necessary to rework the 3249 - grace-period state machine so as to avoid the need for the additional 3250 - latency. 3251 - 3252 - <p> 3253 - RCU disables CPU hotplug in a few places, perhaps most notably in the 3254 - <tt>rcu_barrier()</tt> operations. 3255 - If there is a strong reason to use <tt>rcu_barrier()</tt> in CPU-hotplug 3256 - notifiers, it will be necessary to avoid disabling CPU hotplug. 3257 - This would introduce some complexity, so there had better be a <i>very</i> 3258 - good reason. 3259 - 3260 - <p> 3261 - The tradeoff between grace-period latency on the one hand and interruptions 3262 - of other CPUs on the other hand may need to be re-examined. 3263 - The desire is of course for zero grace-period latency as well as zero 3264 - interprocessor interrupts undertaken during an expedited grace period 3265 - operation. 3266 - While this ideal is unlikely to be achievable, it is quite possible that 3267 - further improvements can be made. 3268 - 3269 - <p> 3270 - The multiprocessor implementations of RCU use a combining tree that 3271 - groups CPUs so as to reduce lock contention and increase cache locality. 3272 - However, this combining tree does not spread its memory across NUMA 3273 - nodes nor does it align the CPU groups with hardware features such 3274 - as sockets or cores. 3275 - Such spreading and alignment is currently believed to be unnecessary 3276 - because the hotpath read-side primitives do not access the combining 3277 - tree, nor does <tt>call_rcu()</tt> in the common case. 3278 - If you believe that your architecture needs such spreading and alignment, 3279 - then your architecture should also benefit from the 3280 - <tt>rcutree.rcu_fanout_leaf</tt> boot parameter, which can be set 3281 - to the number of CPUs in a socket, NUMA node, or whatever. 3282 - If the number of CPUs is too large, use a fraction of the number of 3283 - CPUs. 3284 - If the number of CPUs is a large prime number, well, that certainly 3285 - is an “interesting” architectural choice! 3286 - More flexible arrangements might be considered, but only if 3287 - <tt>rcutree.rcu_fanout_leaf</tt> has proven inadequate, and only 3288 - if the inadequacy has been demonstrated by a carefully run and 3289 - realistic system-level workload. 3290 - 3291 - <p> 3292 - Please note that arrangements that require RCU to remap CPU numbers will 3293 - require extremely good demonstration of need and full exploration of 3294 - alternatives. 3295 - 3296 - <p> 3297 - RCU's various kthreads are reasonably recent additions. 3298 - It is quite likely that adjustments will be required to more gracefully 3299 - handle extreme loads. 3300 - It might also be necessary to be able to relate CPU utilization by 3301 - RCU's kthreads and softirq handlers to the code that instigated this 3302 - CPU utilization. 3303 - For example, RCU callback overhead might be charged back to the 3304 - originating <tt>call_rcu()</tt> instance, though probably not 3305 - in production kernels. 3306 - 3307 - <p> 3308 - Additional work may be required to provide reasonable forward-progress 3309 - guarantees under heavy load for grace periods and for callback 3310 - invocation. 3311 - 3312 - <h2><a name="Summary">Summary</a></h2> 3313 - 3314 - <p> 3315 - This document has presented more than two decade's worth of RCU 3316 - requirements. 3317 - Given that the requirements keep changing, this will not be the last 3318 - word on this subject, but at least it serves to get an important 3319 - subset of the requirements set forth. 3320 - 3321 - <h2><a name="Acknowledgments">Acknowledgments</a></h2> 3322 - 3323 - I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, 3324 - Oleg Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and 3325 - Andy Lutomirski for their help in rendering 3326 - this article human readable, and to Michelle Rankin for her support 3327 - of this effort. 3328 - Other contributions are acknowledged in the Linux kernel's git archive. 3329 - 3330 - </body></html>

+2662

Documentation/RCU/Design/Requirements/Requirements.rst

··· 1 + ================================= 2 + A Tour Through RCU's Requirements 3 + ================================= 4 + 5 + Copyright IBM Corporation, 2015 6 + 7 + Author: Paul E. McKenney 8 + 9 + The initial version of this document appeared in the 10 + `LWN <https://lwn.net/>`_ on those articles: 11 + `part 1 <https://lwn.net/Articles/652156/>`_, 12 + `part 2 <https://lwn.net/Articles/652677/>`_, and 13 + `part 3 <https://lwn.net/Articles/653326/>`_. 14 + 15 + Introduction 16 + ------------ 17 + 18 + Read-copy update (RCU) is a synchronization mechanism that is often used 19 + as a replacement for reader-writer locking. RCU is unusual in that 20 + updaters do not block readers, which means that RCU's read-side 21 + primitives can be exceedingly fast and scalable. In addition, updaters 22 + can make useful forward progress concurrently with readers. However, all 23 + this concurrency between RCU readers and updaters does raise the 24 + question of exactly what RCU readers are doing, which in turn raises the 25 + question of exactly what RCU's requirements are. 26 + 27 + This document therefore summarizes RCU's requirements, and can be 28 + thought of as an informal, high-level specification for RCU. It is 29 + important to understand that RCU's specification is primarily empirical 30 + in nature; in fact, I learned about many of these requirements the hard 31 + way. This situation might cause some consternation, however, not only 32 + has this learning process been a lot of fun, but it has also been a 33 + great privilege to work with so many people willing to apply 34 + technologies in interesting new ways. 35 + 36 + All that aside, here are the categories of currently known RCU 37 + requirements: 38 + 39 + #. `Fundamental Requirements <#Fundamental%20Requirements>`__ 40 + #. `Fundamental Non-Requirements <#Fundamental%20Non-Requirements>`__ 41 + #. `Parallelism Facts of Life <#Parallelism%20Facts%20of%20Life>`__ 42 + #. `Quality-of-Implementation 43 + Requirements <#Quality-of-Implementation%20Requirements>`__ 44 + #. `Linux Kernel Complications <#Linux%20Kernel%20Complications>`__ 45 + #. `Software-Engineering 46 + Requirements <#Software-Engineering%20Requirements>`__ 47 + #. `Other RCU Flavors <#Other%20RCU%20Flavors>`__ 48 + #. `Possible Future Changes <#Possible%20Future%20Changes>`__ 49 + 50 + This is followed by a `summary <#Summary>`__, however, the answers to 51 + each quick quiz immediately follows the quiz. Select the big white space 52 + with your mouse to see the answer. 53 + 54 + Fundamental Requirements 55 + ------------------------ 56 + 57 + RCU's fundamental requirements are the closest thing RCU has to hard 58 + mathematical requirements. These are: 59 + 60 + #. `Grace-Period Guarantee <#Grace-Period%20Guarantee>`__ 61 + #. `Publish-Subscribe Guarantee <#Publish-Subscribe%20Guarantee>`__ 62 + #. `Memory-Barrier Guarantees <#Memory-Barrier%20Guarantees>`__ 63 + #. `RCU Primitives Guaranteed to Execute 64 + Unconditionally <#RCU%20Primitives%20Guaranteed%20to%20Execute%20Unconditionally>`__ 65 + #. `Guaranteed Read-to-Write 66 + Upgrade <#Guaranteed%20Read-to-Write%20Upgrade>`__ 67 + 68 + Grace-Period Guarantee 69 + ~~~~~~~~~~~~~~~~~~~~~~ 70 + 71 + RCU's grace-period guarantee is unusual in being premeditated: Jack 72 + Slingwine and I had this guarantee firmly in mind when we started work 73 + on RCU (then called “rclock”) in the early 1990s. That said, the past 74 + two decades of experience with RCU have produced a much more detailed 75 + understanding of this guarantee. 76 + 77 + RCU's grace-period guarantee allows updaters to wait for the completion 78 + of all pre-existing RCU read-side critical sections. An RCU read-side 79 + critical section begins with the marker ``rcu_read_lock()`` and ends 80 + with the marker ``rcu_read_unlock()``. These markers may be nested, and 81 + RCU treats a nested set as one big RCU read-side critical section. 82 + Production-quality implementations of ``rcu_read_lock()`` and 83 + ``rcu_read_unlock()`` are extremely lightweight, and in fact have 84 + exactly zero overhead in Linux kernels built for production use with 85 + ``CONFIG_PREEMPT=n``. 86 + 87 + This guarantee allows ordering to be enforced with extremely low 88 + overhead to readers, for example: 89 + 90 + :: 91 + 92 + 1 int x, y; 93 + 2 94 + 3 void thread0(void) 95 + 4 { 96 + 5 rcu_read_lock(); 97 + 6 r1 = READ_ONCE(x); 98 + 7 r2 = READ_ONCE(y); 99 + 8 rcu_read_unlock(); 100 + 9 } 101 + 10 102 + 11 void thread1(void) 103 + 12 { 104 + 13 WRITE_ONCE(x, 1); 105 + 14 synchronize_rcu(); 106 + 15 WRITE_ONCE(y, 1); 107 + 16 } 108 + 109 + Because the ``synchronize_rcu()`` on line 14 waits for all pre-existing 110 + readers, any instance of ``thread0()`` that loads a value of zero from 111 + ``x`` must complete before ``thread1()`` stores to ``y``, so that 112 + instance must also load a value of zero from ``y``. Similarly, any 113 + instance of ``thread0()`` that loads a value of one from ``y`` must have 114 + started after the ``synchronize_rcu()`` started, and must therefore also 115 + load a value of one from ``x``. Therefore, the outcome: 116 + 117 + :: 118 + 119 + (r1 == 0 && r2 == 1) 120 + 121 + cannot happen. 122 + 123 + +-----------------------------------------------------------------------+ 124 + | **Quick Quiz**: | 125 + +-----------------------------------------------------------------------+ 126 + | Wait a minute! You said that updaters can make useful forward | 127 + | progress concurrently with readers, but pre-existing readers will | 128 + | block ``synchronize_rcu()``!!! | 129 + | Just who are you trying to fool??? | 130 + +-----------------------------------------------------------------------+ 131 + | **Answer**: | 132 + +-----------------------------------------------------------------------+ 133 + | First, if updaters do not wish to be blocked by readers, they can use | 134 + | ``call_rcu()`` or ``kfree_rcu()``, which will be discussed later. | 135 + | Second, even when using ``synchronize_rcu()``, the other update-side | 136 + | code does run concurrently with readers, whether pre-existing or not. | 137 + +-----------------------------------------------------------------------+ 138 + 139 + This scenario resembles one of the first uses of RCU in 140 + `DYNIX/ptx <https://en.wikipedia.org/wiki/DYNIX>`__, which managed a 141 + distributed lock manager's transition into a state suitable for handling 142 + recovery from node failure, more or less as follows: 143 + 144 + :: 145 + 146 + 1 #define STATE_NORMAL 0 147 + 2 #define STATE_WANT_RECOVERY 1 148 + 3 #define STATE_RECOVERING 2 149 + 4 #define STATE_WANT_NORMAL 3 150 + 5 151 + 6 int state = STATE_NORMAL; 152 + 7 153 + 8 void do_something_dlm(void) 154 + 9 { 155 + 10 int state_snap; 156 + 11 157 + 12 rcu_read_lock(); 158 + 13 state_snap = READ_ONCE(state); 159 + 14 if (state_snap == STATE_NORMAL) 160 + 15 do_something(); 161 + 16 else 162 + 17 do_something_carefully(); 163 + 18 rcu_read_unlock(); 164 + 19 } 165 + 20 166 + 21 void start_recovery(void) 167 + 22 { 168 + 23 WRITE_ONCE(state, STATE_WANT_RECOVERY); 169 + 24 synchronize_rcu(); 170 + 25 WRITE_ONCE(state, STATE_RECOVERING); 171 + 26 recovery(); 172 + 27 WRITE_ONCE(state, STATE_WANT_NORMAL); 173 + 28 synchronize_rcu(); 174 + 29 WRITE_ONCE(state, STATE_NORMAL); 175 + 30 } 176 + 177 + The RCU read-side critical section in ``do_something_dlm()`` works with 178 + the ``synchronize_rcu()`` in ``start_recovery()`` to guarantee that 179 + ``do_something()`` never runs concurrently with ``recovery()``, but with 180 + little or no synchronization overhead in ``do_something_dlm()``. 181 + 182 + +-----------------------------------------------------------------------+ 183 + | **Quick Quiz**: | 184 + +-----------------------------------------------------------------------+ 185 + | Why is the ``synchronize_rcu()`` on line 28 needed? | 186 + +-----------------------------------------------------------------------+ 187 + | **Answer**: | 188 + +-----------------------------------------------------------------------+ 189 + | Without that extra grace period, memory reordering could result in | 190 + | ``do_something_dlm()`` executing ``do_something()`` concurrently with | 191 + | the last bits of ``recovery()``. | 192 + +-----------------------------------------------------------------------+ 193 + 194 + In order to avoid fatal problems such as deadlocks, an RCU read-side 195 + critical section must not contain calls to ``synchronize_rcu()``. 196 + Similarly, an RCU read-side critical section must not contain anything 197 + that waits, directly or indirectly, on completion of an invocation of 198 + ``synchronize_rcu()``. 199 + 200 + Although RCU's grace-period guarantee is useful in and of itself, with 201 + `quite a few use cases <https://lwn.net/Articles/573497/>`__, it would 202 + be good to be able to use RCU to coordinate read-side access to linked 203 + data structures. For this, the grace-period guarantee is not sufficient, 204 + as can be seen in function ``add_gp_buggy()`` below. We will look at the 205 + reader's code later, but in the meantime, just think of the reader as 206 + locklessly picking up the ``gp`` pointer, and, if the value loaded is 207 + non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields. 208 + 209 + :: 210 + 211 + 1 bool add_gp_buggy(int a, int b) 212 + 2 { 213 + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); 214 + 4 if (!p) 215 + 5 return -ENOMEM; 216 + 6 spin_lock(&gp_lock); 217 + 7 if (rcu_access_pointer(gp)) { 218 + 8 spin_unlock(&gp_lock); 219 + 9 return false; 220 + 10 } 221 + 11 p->a = a; 222 + 12 p->b = a; 223 + 13 gp = p; /* ORDERING BUG */ 224 + 14 spin_unlock(&gp_lock); 225 + 15 return true; 226 + 16 } 227 + 228 + The problem is that both the compiler and weakly ordered CPUs are within 229 + their rights to reorder this code as follows: 230 + 231 + :: 232 + 233 + 1 bool add_gp_buggy_optimized(int a, int b) 234 + 2 { 235 + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); 236 + 4 if (!p) 237 + 5 return -ENOMEM; 238 + 6 spin_lock(&gp_lock); 239 + 7 if (rcu_access_pointer(gp)) { 240 + 8 spin_unlock(&gp_lock); 241 + 9 return false; 242 + 10 } 243 + 11 gp = p; /* ORDERING BUG */ 244 + 12 p->a = a; 245 + 13 p->b = a; 246 + 14 spin_unlock(&gp_lock); 247 + 15 return true; 248 + 16 } 249 + 250 + If an RCU reader fetches ``gp`` just after ``add_gp_buggy_optimized`` 251 + executes line 11, it will see garbage in the ``->a`` and ``->b`` fields. 252 + And this is but one of many ways in which compiler and hardware 253 + optimizations could cause trouble. Therefore, we clearly need some way 254 + to prevent the compiler and the CPU from reordering in this manner, 255 + which brings us to the publish-subscribe guarantee discussed in the next 256 + section. 257 + 258 + Publish/Subscribe Guarantee 259 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 260 + 261 + RCU's publish-subscribe guarantee allows data to be inserted into a 262 + linked data structure without disrupting RCU readers. The updater uses 263 + ``rcu_assign_pointer()`` to insert the new data, and readers use 264 + ``rcu_dereference()`` to access data, whether new or old. The following 265 + shows an example of insertion: 266 + 267 + :: 268 + 269 + 1 bool add_gp(int a, int b) 270 + 2 { 271 + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); 272 + 4 if (!p) 273 + 5 return -ENOMEM; 274 + 6 spin_lock(&gp_lock); 275 + 7 if (rcu_access_pointer(gp)) { 276 + 8 spin_unlock(&gp_lock); 277 + 9 return false; 278 + 10 } 279 + 11 p->a = a; 280 + 12 p->b = a; 281 + 13 rcu_assign_pointer(gp, p); 282 + 14 spin_unlock(&gp_lock); 283 + 15 return true; 284 + 16 } 285 + 286 + The ``rcu_assign_pointer()`` on line 13 is conceptually equivalent to a 287 + simple assignment statement, but also guarantees that its assignment 288 + will happen after the two assignments in lines 11 and 12, similar to the 289 + C11 ``memory_order_release`` store operation. It also prevents any 290 + number of “interesting” compiler optimizations, for example, the use of 291 + ``gp`` as a scratch location immediately preceding the assignment. 292 + 293 + +-----------------------------------------------------------------------+ 294 + | **Quick Quiz**: | 295 + +-----------------------------------------------------------------------+ 296 + | But ``rcu_assign_pointer()`` does nothing to prevent the two | 297 + | assignments to ``p->a`` and ``p->b`` from being reordered. Can't that | 298 + | also cause problems? | 299 + +-----------------------------------------------------------------------+ 300 + | **Answer**: | 301 + +-----------------------------------------------------------------------+ 302 + | No, it cannot. The readers cannot see either of these two fields | 303 + | until the assignment to ``gp``, by which time both fields are fully | 304 + | initialized. So reordering the assignments to ``p->a`` and ``p->b`` | 305 + | cannot possibly cause any problems. | 306 + +-----------------------------------------------------------------------+ 307 + 308 + It is tempting to assume that the reader need not do anything special to 309 + control its accesses to the RCU-protected data, as shown in 310 + ``do_something_gp_buggy()`` below: 311 + 312 + :: 313 + 314 + 1 bool do_something_gp_buggy(void) 315 + 2 { 316 + 3 rcu_read_lock(); 317 + 4 p = gp; /* OPTIMIZATIONS GALORE!!! */ 318 + 5 if (p) { 319 + 6 do_something(p->a, p->b); 320 + 7 rcu_read_unlock(); 321 + 8 return true; 322 + 9 } 323 + 10 rcu_read_unlock(); 324 + 11 return false; 325 + 12 } 326 + 327 + However, this temptation must be resisted because there are a 328 + surprisingly large number of ways that the compiler (to say nothing of 329 + `DEC Alpha CPUs <https://h71000.www7.hp.com/wizard/wiz_2637.html>`__) 330 + can trip this code up. For but one example, if the compiler were short 331 + of registers, it might choose to refetch from ``gp`` rather than keeping 332 + a separate copy in ``p`` as follows: 333 + 334 + :: 335 + 336 + 1 bool do_something_gp_buggy_optimized(void) 337 + 2 { 338 + 3 rcu_read_lock(); 339 + 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */ 340 + 5 do_something(gp->a, gp->b); 341 + 6 rcu_read_unlock(); 342 + 7 return true; 343 + 8 } 344 + 9 rcu_read_unlock(); 345 + 10 return false; 346 + 11 } 347 + 348 + If this function ran concurrently with a series of updates that replaced 349 + the current structure with a new one, the fetches of ``gp->a`` and 350 + ``gp->b`` might well come from two different structures, which could 351 + cause serious confusion. To prevent this (and much else besides), 352 + ``do_something_gp()`` uses ``rcu_dereference()`` to fetch from ``gp``: 353 + 354 + :: 355 + 356 + 1 bool do_something_gp(void) 357 + 2 { 358 + 3 rcu_read_lock(); 359 + 4 p = rcu_dereference(gp); 360 + 5 if (p) { 361 + 6 do_something(p->a, p->b); 362 + 7 rcu_read_unlock(); 363 + 8 return true; 364 + 9 } 365 + 10 rcu_read_unlock(); 366 + 11 return false; 367 + 12 } 368 + 369 + The ``rcu_dereference()`` uses volatile casts and (for DEC Alpha) memory 370 + barriers in the Linux kernel. Should a `high-quality implementation of 371 + C11 ``memory_order_consume`` 372 + [PDF] <http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf>`__ 373 + ever appear, then ``rcu_dereference()`` could be implemented as a 374 + ``memory_order_consume`` load. Regardless of the exact implementation, a 375 + pointer fetched by ``rcu_dereference()`` may not be used outside of the 376 + outermost RCU read-side critical section containing that 377 + ``rcu_dereference()``, unless protection of the corresponding data 378 + element has been passed from RCU to some other synchronization 379 + mechanism, most commonly locking or `reference 380 + counting <https://www.kernel.org/doc/Documentation/RCU/rcuref.txt>`__. 381 + 382 + In short, updaters use ``rcu_assign_pointer()`` and readers use 383 + ``rcu_dereference()``, and these two RCU API elements work together to 384 + ensure that readers have a consistent view of newly added data elements. 385 + 386 + Of course, it is also necessary to remove elements from RCU-protected 387 + data structures, for example, using the following process: 388 + 389 + #. Remove the data element from the enclosing structure. 390 + #. Wait for all pre-existing RCU read-side critical sections to complete 391 + (because only pre-existing readers can possibly have a reference to 392 + the newly removed data element). 393 + #. At this point, only the updater has a reference to the newly removed 394 + data element, so it can safely reclaim the data element, for example, 395 + by passing it to ``kfree()``. 396 + 397 + This process is implemented by ``remove_gp_synchronous()``: 398 + 399 + :: 400 + 401 + 1 bool remove_gp_synchronous(void) 402 + 2 { 403 + 3 struct foo *p; 404 + 4 405 + 5 spin_lock(&gp_lock); 406 + 6 p = rcu_access_pointer(gp); 407 + 7 if (!p) { 408 + 8 spin_unlock(&gp_lock); 409 + 9 return false; 410 + 10 } 411 + 11 rcu_assign_pointer(gp, NULL); 412 + 12 spin_unlock(&gp_lock); 413 + 13 synchronize_rcu(); 414 + 14 kfree(p); 415 + 15 return true; 416 + 16 } 417 + 418 + This function is straightforward, with line 13 waiting for a grace 419 + period before line 14 frees the old data element. This waiting ensures 420 + that readers will reach line 7 of ``do_something_gp()`` before the data 421 + element referenced by ``p`` is freed. The ``rcu_access_pointer()`` on 422 + line 6 is similar to ``rcu_dereference()``, except that: 423 + 424 + #. The value returned by ``rcu_access_pointer()`` cannot be 425 + dereferenced. If you want to access the value pointed to as well as 426 + the pointer itself, use ``rcu_dereference()`` instead of 427 + ``rcu_access_pointer()``. 428 + #. The call to ``rcu_access_pointer()`` need not be protected. In 429 + contrast, ``rcu_dereference()`` must either be within an RCU 430 + read-side critical section or in a code segment where the pointer 431 + cannot change, for example, in code protected by the corresponding 432 + update-side lock. 433 + 434 + +-----------------------------------------------------------------------+ 435 + | **Quick Quiz**: | 436 + +-----------------------------------------------------------------------+ 437 + | Without the ``rcu_dereference()`` or the ``rcu_access_pointer()``, | 438 + | what destructive optimizations might the compiler make use of? | 439 + +-----------------------------------------------------------------------+ 440 + | **Answer**: | 441 + +-----------------------------------------------------------------------+ 442 + | Let's start with what happens to ``do_something_gp()`` if it fails to | 443 + | use ``rcu_dereference()``. It could reuse a value formerly fetched | 444 + | from this same pointer. It could also fetch the pointer from ``gp`` | 445 + | in a byte-at-a-time manner, resulting in *load tearing*, in turn | 446 + | resulting a bytewise mash-up of two distinct pointer values. It might | 447 + | even use value-speculation optimizations, where it makes a wrong | 448 + | guess, but by the time it gets around to checking the value, an | 449 + | update has changed the pointer to match the wrong guess. Too bad | 450 + | about any dereferences that returned pre-initialization garbage in | 451 + | the meantime! | 452 + | For ``remove_gp_synchronous()``, as long as all modifications to | 453 + | ``gp`` are carried out while holding ``gp_lock``, the above | 454 + | optimizations are harmless. However, ``sparse`` will complain if you | 455 + | define ``gp`` with ``__rcu`` and then access it without using either | 456 + | ``rcu_access_pointer()`` or ``rcu_dereference()``. | 457 + +-----------------------------------------------------------------------+ 458 + 459 + In short, RCU's publish-subscribe guarantee is provided by the 460 + combination of ``rcu_assign_pointer()`` and ``rcu_dereference()``. This 461 + guarantee allows data elements to be safely added to RCU-protected 462 + linked data structures without disrupting RCU readers. This guarantee 463 + can be used in combination with the grace-period guarantee to also allow 464 + data elements to be removed from RCU-protected linked data structures, 465 + again without disrupting RCU readers. 466 + 467 + This guarantee was only partially premeditated. DYNIX/ptx used an 468 + explicit memory barrier for publication, but had nothing resembling 469 + ``rcu_dereference()`` for subscription, nor did it have anything 470 + resembling the ``smp_read_barrier_depends()`` that was later subsumed 471 + into ``rcu_dereference()`` and later still into ``READ_ONCE()``. The 472 + need for these operations made itself known quite suddenly at a 473 + late-1990s meeting with the DEC Alpha architects, back in the days when 474 + DEC was still a free-standing company. It took the Alpha architects a 475 + good hour to convince me that any sort of barrier would ever be needed, 476 + and it then took me a good *two* hours to convince them that their 477 + documentation did not make this point clear. More recent work with the C 478 + and C++ standards committees have provided much education on tricks and 479 + traps from the compiler. In short, compilers were much less tricky in 480 + the early 1990s, but in 2015, don't even think about omitting 481 + ``rcu_dereference()``! 482 + 483 + Memory-Barrier Guarantees 484 + ~~~~~~~~~~~~~~~~~~~~~~~~~ 485 + 486 + The previous section's simple linked-data-structure scenario clearly 487 + demonstrates the need for RCU's stringent memory-ordering guarantees on 488 + systems with more than one CPU: 489 + 490 + #. Each CPU that has an RCU read-side critical section that begins 491 + before ``synchronize_rcu()`` starts is guaranteed to execute a full 492 + memory barrier between the time that the RCU read-side critical 493 + section ends and the time that ``synchronize_rcu()`` returns. Without 494 + this guarantee, a pre-existing RCU read-side critical section might 495 + hold a reference to the newly removed ``struct foo`` after the 496 + ``kfree()`` on line 14 of ``remove_gp_synchronous()``. 497 + #. Each CPU that has an RCU read-side critical section that ends after 498 + ``synchronize_rcu()`` returns is guaranteed to execute a full memory 499 + barrier between the time that ``synchronize_rcu()`` begins and the 500 + time that the RCU read-side critical section begins. Without this 501 + guarantee, a later RCU read-side critical section running after the 502 + ``kfree()`` on line 14 of ``remove_gp_synchronous()`` might later run 503 + ``do_something_gp()`` and find the newly deleted ``struct foo``. 504 + #. If the task invoking ``synchronize_rcu()`` remains on a given CPU, 505 + then that CPU is guaranteed to execute a full memory barrier sometime 506 + during the execution of ``synchronize_rcu()``. This guarantee ensures 507 + that the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really 508 + does execute after the removal on line 11. 509 + #. If the task invoking ``synchronize_rcu()`` migrates among a group of 510 + CPUs during that invocation, then each of the CPUs in that group is 511 + guaranteed to execute a full memory barrier sometime during the 512 + execution of ``synchronize_rcu()``. This guarantee also ensures that 513 + the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really does 514 + execute after the removal on line 11, but also in the case where the 515 + thread executing the ``synchronize_rcu()`` migrates in the meantime. 516 + 517 + +-----------------------------------------------------------------------+ 518 + | **Quick Quiz**: | 519 + +-----------------------------------------------------------------------+ 520 + | Given that multiple CPUs can start RCU read-side critical sections at | 521 + | any time without any ordering whatsoever, how can RCU possibly tell | 522 + | whether or not a given RCU read-side critical section starts before a | 523 + | given instance of ``synchronize_rcu()``? | 524 + +-----------------------------------------------------------------------+ 525 + | **Answer**: | 526 + +-----------------------------------------------------------------------+ 527 + | If RCU cannot tell whether or not a given RCU read-side critical | 528 + | section starts before a given instance of ``synchronize_rcu()``, then | 529 + | it must assume that the RCU read-side critical section started first. | 530 + | In other words, a given instance of ``synchronize_rcu()`` can avoid | 531 + | waiting on a given RCU read-side critical section only if it can | 532 + | prove that ``synchronize_rcu()`` started first. | 533 + | A related question is “When ``rcu_read_lock()`` doesn't generate any | 534 + | code, why does it matter how it relates to a grace period?” The | 535 + | answer is that it is not the relationship of ``rcu_read_lock()`` | 536 + | itself that is important, but rather the relationship of the code | 537 + | within the enclosed RCU read-side critical section to the code | 538 + | preceding and following the grace period. If we take this viewpoint, | 539 + | then a given RCU read-side critical section begins before a given | 540 + | grace period when some access preceding the grace period observes the | 541 + | effect of some access within the critical section, in which case none | 542 + | of the accesses within the critical section may observe the effects | 543 + | of any access following the grace period. | 544 + | | 545 + | As of late 2016, mathematical models of RCU take this viewpoint, for | 546 + | example, see slides 62 and 63 of the `2016 LinuxCon | 547 + | EU <http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.201 | 548 + | 6.10.04c.LCE.pdf>`__ | 549 + | presentation. | 550 + +-----------------------------------------------------------------------+ 551 + 552 + +-----------------------------------------------------------------------+ 553 + | **Quick Quiz**: | 554 + +-----------------------------------------------------------------------+ 555 + | The first and second guarantees require unbelievably strict ordering! | 556 + | Are all these memory barriers *really* required? | 557 + +-----------------------------------------------------------------------+ 558 + | **Answer**: | 559 + +-----------------------------------------------------------------------+ 560 + | Yes, they really are required. To see why the first guarantee is | 561 + | required, consider the following sequence of events: | 562 + | | 563 + | #. CPU 1: ``rcu_read_lock()`` | 564 + | #. CPU 1: ``q = rcu_dereference(gp); /* Very likely to return p. */`` | 565 + | #. CPU 0: ``list_del_rcu(p);`` | 566 + | #. CPU 0: ``synchronize_rcu()`` starts. | 567 + | #. CPU 1: ``do_something_with(q->a);`` | 568 + | ``/* No smp_mb(), so might happen after kfree(). */`` | 569 + | #. CPU 1: ``rcu_read_unlock()`` | 570 + | #. CPU 0: ``synchronize_rcu()`` returns. | 571 + | #. CPU 0: ``kfree(p);`` | 572 + | | 573 + | Therefore, there absolutely must be a full memory barrier between the | 574 + | end of the RCU read-side critical section and the end of the grace | 575 + | period. | 576 + | | 577 + | The sequence of events demonstrating the necessity of the second rule | 578 + | is roughly similar: | 579 + | | 580 + | #. CPU 0: ``list_del_rcu(p);`` | 581 + | #. CPU 0: ``synchronize_rcu()`` starts. | 582 + | #. CPU 1: ``rcu_read_lock()`` | 583 + | #. CPU 1: ``q = rcu_dereference(gp);`` | 584 + | ``/* Might return p if no memory barrier. */`` | 585 + | #. CPU 0: ``synchronize_rcu()`` returns. | 586 + | #. CPU 0: ``kfree(p);`` | 587 + | #. CPU 1: ``do_something_with(q->a); /* Boom!!! */`` | 588 + | #. CPU 1: ``rcu_read_unlock()`` | 589 + | | 590 + | And similarly, without a memory barrier between the beginning of the | 591 + | grace period and the beginning of the RCU read-side critical section, | 592 + | CPU 1 might end up accessing the freelist. | 593 + | | 594 + | The “as if” rule of course applies, so that any implementation that | 595 + | acts as if the appropriate memory barriers were in place is a correct | 596 + | implementation. That said, it is much easier to fool yourself into | 597 + | believing that you have adhered to the as-if rule than it is to | 598 + | actually adhere to it! | 599 + +-----------------------------------------------------------------------+ 600 + 601 + +-----------------------------------------------------------------------+ 602 + | **Quick Quiz**: | 603 + +-----------------------------------------------------------------------+ 604 + | You claim that ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate | 605 + | absolutely no code in some kernel builds. This means that the | 606 + | compiler might arbitrarily rearrange consecutive RCU read-side | 607 + | critical sections. Given such rearrangement, if a given RCU read-side | 608 + | critical section is done, how can you be sure that all prior RCU | 609 + | read-side critical sections are done? Won't the compiler | 610 + | rearrangements make that impossible to determine? | 611 + +-----------------------------------------------------------------------+ 612 + | **Answer**: | 613 + +-----------------------------------------------------------------------+ 614 + | In cases where ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate | 615 + | absolutely no code, RCU infers quiescent states only at special | 616 + | locations, for example, within the scheduler. Because calls to | 617 + | ``schedule()`` had better prevent calling-code accesses to shared | 618 + | variables from being rearranged across the call to ``schedule()``, if | 619 + | RCU detects the end of a given RCU read-side critical section, it | 620 + | will necessarily detect the end of all prior RCU read-side critical | 621 + | sections, no matter how aggressively the compiler scrambles the code. | 622 + | Again, this all assumes that the compiler cannot scramble code across | 623 + | calls to the scheduler, out of interrupt handlers, into the idle | 624 + | loop, into user-mode code, and so on. But if your kernel build allows | 625 + | that sort of scrambling, you have broken far more than just RCU! | 626 + +-----------------------------------------------------------------------+ 627 + 628 + Note that these memory-barrier requirements do not replace the 629 + fundamental RCU requirement that a grace period wait for all 630 + pre-existing readers. On the contrary, the memory barriers called out in 631 + this section must operate in such a way as to *enforce* this fundamental 632 + requirement. Of course, different implementations enforce this 633 + requirement in different ways, but enforce it they must. 634 + 635 + RCU Primitives Guaranteed to Execute Unconditionally 636 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 637 + 638 + The common-case RCU primitives are unconditional. They are invoked, they 639 + do their job, and they return, with no possibility of error, and no need 640 + to retry. This is a key RCU design philosophy. 641 + 642 + However, this philosophy is pragmatic rather than pigheaded. If someone 643 + comes up with a good justification for a particular conditional RCU 644 + primitive, it might well be implemented and added. After all, this 645 + guarantee was reverse-engineered, not premeditated. The unconditional 646 + nature of the RCU primitives was initially an accident of 647 + implementation, and later experience with synchronization primitives 648 + with conditional primitives caused me to elevate this accident to a 649 + guarantee. Therefore, the justification for adding a conditional 650 + primitive to RCU would need to be based on detailed and compelling use 651 + cases. 652 + 653 + Guaranteed Read-to-Write Upgrade 654 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 655 + 656 + As far as RCU is concerned, it is always possible to carry out an update 657 + within an RCU read-side critical section. For example, that RCU 658 + read-side critical section might search for a given data element, and 659 + then might acquire the update-side spinlock in order to update that 660 + element, all while remaining in that RCU read-side critical section. Of 661 + course, it is necessary to exit the RCU read-side critical section 662 + before invoking ``synchronize_rcu()``, however, this inconvenience can 663 + be avoided through use of the ``call_rcu()`` and ``kfree_rcu()`` API 664 + members described later in this document. 665 + 666 + +-----------------------------------------------------------------------+ 667 + | **Quick Quiz**: | 668 + +-----------------------------------------------------------------------+ 669 + | But how does the upgrade-to-write operation exclude other readers? | 670 + +-----------------------------------------------------------------------+ 671 + | **Answer**: | 672 + +-----------------------------------------------------------------------+ 673 + | It doesn't, just like normal RCU updates, which also do not exclude | 674 + | RCU readers. | 675 + +-----------------------------------------------------------------------+ 676 + 677 + This guarantee allows lookup code to be shared between read-side and 678 + update-side code, and was premeditated, appearing in the earliest 679 + DYNIX/ptx RCU documentation. 680 + 681 + Fundamental Non-Requirements 682 + ---------------------------- 683 + 684 + RCU provides extremely lightweight readers, and its read-side 685 + guarantees, though quite useful, are correspondingly lightweight. It is 686 + therefore all too easy to assume that RCU is guaranteeing more than it 687 + really is. Of course, the list of things that RCU does not guarantee is 688 + infinitely long, however, the following sections list a few 689 + non-guarantees that have caused confusion. Except where otherwise noted, 690 + these non-guarantees were premeditated. 691 + 692 + #. `Readers Impose Minimal 693 + Ordering <#Readers%20Impose%20Minimal%20Ordering>`__ 694 + #. `Readers Do Not Exclude 695 + Updaters <#Readers%20Do%20Not%20Exclude%20Updaters>`__ 696 + #. `Updaters Only Wait For Old 697 + Readers <#Updaters%20Only%20Wait%20For%20Old%20Readers>`__ 698 + #. `Grace Periods Don't Partition Read-Side Critical 699 + Sections <#Grace%20Periods%20Don't%20Partition%20Read-Side%20Critical%20Sections>`__ 700 + #. `Read-Side Critical Sections Don't Partition Grace 701 + Periods <#Read-Side%20Critical%20Sections%20Don't%20Partition%20Grace%20Periods>`__ 702 + 703 + Readers Impose Minimal Ordering 704 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 705 + 706 + Reader-side markers such as ``rcu_read_lock()`` and 707 + ``rcu_read_unlock()`` provide absolutely no ordering guarantees except 708 + through their interaction with the grace-period APIs such as 709 + ``synchronize_rcu()``. To see this, consider the following pair of 710 + threads: 711 + 712 + :: 713 + 714 + 1 void thread0(void) 715 + 2 { 716 + 3 rcu_read_lock(); 717 + 4 WRITE_ONCE(x, 1); 718 + 5 rcu_read_unlock(); 719 + 6 rcu_read_lock(); 720 + 7 WRITE_ONCE(y, 1); 721 + 8 rcu_read_unlock(); 722 + 9 } 723 + 10 724 + 11 void thread1(void) 725 + 12 { 726 + 13 rcu_read_lock(); 727 + 14 r1 = READ_ONCE(y); 728 + 15 rcu_read_unlock(); 729 + 16 rcu_read_lock(); 730 + 17 r2 = READ_ONCE(x); 731 + 18 rcu_read_unlock(); 732 + 19 } 733 + 734 + After ``thread0()`` and ``thread1()`` execute concurrently, it is quite 735 + possible to have 736 + 737 + :: 738 + 739 + (r1 == 1 && r2 == 0) 740 + 741 + (that is, ``y`` appears to have been assigned before ``x``), which would 742 + not be possible if ``rcu_read_lock()`` and ``rcu_read_unlock()`` had 743 + much in the way of ordering properties. But they do not, so the CPU is 744 + within its rights to do significant reordering. This is by design: Any 745 + significant ordering constraints would slow down these fast-path APIs. 746 + 747 + +-----------------------------------------------------------------------+ 748 + | **Quick Quiz**: | 749 + +-----------------------------------------------------------------------+ 750 + | Can't the compiler also reorder this code? | 751 + +-----------------------------------------------------------------------+ 752 + | **Answer**: | 753 + +-----------------------------------------------------------------------+ 754 + | No, the volatile casts in ``READ_ONCE()`` and ``WRITE_ONCE()`` | 755 + | prevent the compiler from reordering in this particular case. | 756 + +-----------------------------------------------------------------------+ 757 + 758 + Readers Do Not Exclude Updaters 759 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 760 + 761 + Neither ``rcu_read_lock()`` nor ``rcu_read_unlock()`` exclude updates. 762 + All they do is to prevent grace periods from ending. The following 763 + example illustrates this: 764 + 765 + :: 766 + 767 + 1 void thread0(void) 768 + 2 { 769 + 3 rcu_read_lock(); 770 + 4 r1 = READ_ONCE(y); 771 + 5 if (r1) { 772 + 6 do_something_with_nonzero_x(); 773 + 7 r2 = READ_ONCE(x); 774 + 8 WARN_ON(!r2); /* BUG!!! */ 775 + 9 } 776 + 10 rcu_read_unlock(); 777 + 11 } 778 + 12 779 + 13 void thread1(void) 780 + 14 { 781 + 15 spin_lock(&my_lock); 782 + 16 WRITE_ONCE(x, 1); 783 + 17 WRITE_ONCE(y, 1); 784 + 18 spin_unlock(&my_lock); 785 + 19 } 786 + 787 + If the ``thread0()`` function's ``rcu_read_lock()`` excluded the 788 + ``thread1()`` function's update, the ``WARN_ON()`` could never fire. But 789 + the fact is that ``rcu_read_lock()`` does not exclude much of anything 790 + aside from subsequent grace periods, of which ``thread1()`` has none, so 791 + the ``WARN_ON()`` can and does fire. 792 + 793 + Updaters Only Wait For Old Readers 794 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 795 + 796 + It might be tempting to assume that after ``synchronize_rcu()`` 797 + completes, there are no readers executing. This temptation must be 798 + avoided because new readers can start immediately after 799 + ``synchronize_rcu()`` starts, and ``synchronize_rcu()`` is under no 800 + obligation to wait for these new readers. 801 + 802 + +-----------------------------------------------------------------------+ 803 + | **Quick Quiz**: | 804 + +-----------------------------------------------------------------------+ 805 + | Suppose that synchronize_rcu() did wait until *all* readers had | 806 + | completed instead of waiting only on pre-existing readers. For how | 807 + | long would the updater be able to rely on there being no readers? | 808 + +-----------------------------------------------------------------------+ 809 + | **Answer**: | 810 + +-----------------------------------------------------------------------+ 811 + | For no time at all. Even if ``synchronize_rcu()`` were to wait until | 812 + | all readers had completed, a new reader might start immediately after | 813 + | ``synchronize_rcu()`` completed. Therefore, the code following | 814 + | ``synchronize_rcu()`` can *never* rely on there being no readers. | 815 + +-----------------------------------------------------------------------+ 816 + 817 + Grace Periods Don't Partition Read-Side Critical Sections 818 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 819 + 820 + It is tempting to assume that if any part of one RCU read-side critical 821 + section precedes a given grace period, and if any part of another RCU 822 + read-side critical section follows that same grace period, then all of 823 + the first RCU read-side critical section must precede all of the second. 824 + However, this just isn't the case: A single grace period does not 825 + partition the set of RCU read-side critical sections. An example of this 826 + situation can be illustrated as follows, where ``x``, ``y``, and ``z`` 827 + are initially all zero: 828 + 829 + :: 830 + 831 + 1 void thread0(void) 832 + 2 { 833 + 3 rcu_read_lock(); 834 + 4 WRITE_ONCE(a, 1); 835 + 5 WRITE_ONCE(b, 1); 836 + 6 rcu_read_unlock(); 837 + 7 } 838 + 8 839 + 9 void thread1(void) 840 + 10 { 841 + 11 r1 = READ_ONCE(a); 842 + 12 synchronize_rcu(); 843 + 13 WRITE_ONCE(c, 1); 844 + 14 } 845 + 15 846 + 16 void thread2(void) 847 + 17 { 848 + 18 rcu_read_lock(); 849 + 19 r2 = READ_ONCE(b); 850 + 20 r3 = READ_ONCE(c); 851 + 21 rcu_read_unlock(); 852 + 22 } 853 + 854 + It turns out that the outcome: 855 + 856 + :: 857 + 858 + (r1 == 1 && r2 == 0 && r3 == 1) 859 + 860 + is entirely possible. The following figure show how this can happen, 861 + with each circled ``QS`` indicating the point at which RCU recorded a 862 + *quiescent state* for each thread, that is, a state in which RCU knows 863 + that the thread cannot be in the midst of an RCU read-side critical 864 + section that started before the current grace period: 865 + 866 + .. kernel-figure:: GPpartitionReaders1.svg 867 + 868 + If it is necessary to partition RCU read-side critical sections in this 869 + manner, it is necessary to use two grace periods, where the first grace 870 + period is known to end before the second grace period starts: 871 + 872 + :: 873 + 874 + 1 void thread0(void) 875 + 2 { 876 + 3 rcu_read_lock(); 877 + 4 WRITE_ONCE(a, 1); 878 + 5 WRITE_ONCE(b, 1); 879 + 6 rcu_read_unlock(); 880 + 7 } 881 + 8 882 + 9 void thread1(void) 883 + 10 { 884 + 11 r1 = READ_ONCE(a); 885 + 12 synchronize_rcu(); 886 + 13 WRITE_ONCE(c, 1); 887 + 14 } 888 + 15 889 + 16 void thread2(void) 890 + 17 { 891 + 18 r2 = READ_ONCE(c); 892 + 19 synchronize_rcu(); 893 + 20 WRITE_ONCE(d, 1); 894 + 21 } 895 + 22 896 + 23 void thread3(void) 897 + 24 { 898 + 25 rcu_read_lock(); 899 + 26 r3 = READ_ONCE(b); 900 + 27 r4 = READ_ONCE(d); 901 + 28 rcu_read_unlock(); 902 + 29 } 903 + 904 + Here, if ``(r1 == 1)``, then ``thread0()``'s write to ``b`` must happen 905 + before the end of ``thread1()``'s grace period. If in addition 906 + ``(r4 == 1)``, then ``thread3()``'s read from ``b`` must happen after 907 + the beginning of ``thread2()``'s grace period. If it is also the case 908 + that ``(r2 == 1)``, then the end of ``thread1()``'s grace period must 909 + precede the beginning of ``thread2()``'s grace period. This mean that 910 + the two RCU read-side critical sections cannot overlap, guaranteeing 911 + that ``(r3 == 1)``. As a result, the outcome: 912 + 913 + :: 914 + 915 + (r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1) 916 + 917 + cannot happen. 918 + 919 + This non-requirement was also non-premeditated, but became apparent when 920 + studying RCU's interaction with memory ordering. 921 + 922 + Read-Side Critical Sections Don't Partition Grace Periods 923 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 924 + 925 + It is also tempting to assume that if an RCU read-side critical section 926 + happens between a pair of grace periods, then those grace periods cannot 927 + overlap. However, this temptation leads nowhere good, as can be 928 + illustrated by the following, with all variables initially zero: 929 + 930 + :: 931 + 932 + 1 void thread0(void) 933 + 2 { 934 + 3 rcu_read_lock(); 935 + 4 WRITE_ONCE(a, 1); 936 + 5 WRITE_ONCE(b, 1); 937 + 6 rcu_read_unlock(); 938 + 7 } 939 + 8 940 + 9 void thread1(void) 941 + 10 { 942 + 11 r1 = READ_ONCE(a); 943 + 12 synchronize_rcu(); 944 + 13 WRITE_ONCE(c, 1); 945 + 14 } 946 + 15 947 + 16 void thread2(void) 948 + 17 { 949 + 18 rcu_read_lock(); 950 + 19 WRITE_ONCE(d, 1); 951 + 20 r2 = READ_ONCE(c); 952 + 21 rcu_read_unlock(); 953 + 22 } 954 + 23 955 + 24 void thread3(void) 956 + 25 { 957 + 26 r3 = READ_ONCE(d); 958 + 27 synchronize_rcu(); 959 + 28 WRITE_ONCE(e, 1); 960 + 29 } 961 + 30 962 + 31 void thread4(void) 963 + 32 { 964 + 33 rcu_read_lock(); 965 + 34 r4 = READ_ONCE(b); 966 + 35 r5 = READ_ONCE(e); 967 + 36 rcu_read_unlock(); 968 + 37 } 969 + 970 + In this case, the outcome: 971 + 972 + :: 973 + 974 + (r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1) 975 + 976 + is entirely possible, as illustrated below: 977 + 978 + .. kernel-figure:: ReadersPartitionGP1.svg 979 + 980 + Again, an RCU read-side critical section can overlap almost all of a 981 + given grace period, just so long as it does not overlap the entire grace 982 + period. As a result, an RCU read-side critical section cannot partition 983 + a pair of RCU grace periods. 984 + 985 + +-----------------------------------------------------------------------+ 986 + | **Quick Quiz**: | 987 + +-----------------------------------------------------------------------+ 988 + | How long a sequence of grace periods, each separated by an RCU | 989 + | read-side critical section, would be required to partition the RCU | 990 + | read-side critical sections at the beginning and end of the chain? | 991 + +-----------------------------------------------------------------------+ 992 + | **Answer**: | 993 + +-----------------------------------------------------------------------+ 994 + | In theory, an infinite number. In practice, an unknown number that is | 995 + | sensitive to both implementation details and timing considerations. | 996 + | Therefore, even in practice, RCU users must abide by the theoretical | 997 + | rather than the practical answer. | 998 + +-----------------------------------------------------------------------+ 999 + 1000 + Parallelism Facts of Life 1001 + ------------------------- 1002 + 1003 + These parallelism facts of life are by no means specific to RCU, but the 1004 + RCU implementation must abide by them. They therefore bear repeating: 1005 + 1006 + #. Any CPU or task may be delayed at any time, and any attempts to avoid 1007 + these delays by disabling preemption, interrupts, or whatever are 1008 + completely futile. This is most obvious in preemptible user-level 1009 + environments and in virtualized environments (where a given guest 1010 + OS's VCPUs can be preempted at any time by the underlying 1011 + hypervisor), but can also happen in bare-metal environments due to 1012 + ECC errors, NMIs, and other hardware events. Although a delay of more 1013 + than about 20 seconds can result in splats, the RCU implementation is 1014 + obligated to use algorithms that can tolerate extremely long delays, 1015 + but where “extremely long” is not long enough to allow wrap-around 1016 + when incrementing a 64-bit counter. 1017 + #. Both the compiler and the CPU can reorder memory accesses. Where it 1018 + matters, RCU must use compiler directives and memory-barrier 1019 + instructions to preserve ordering. 1020 + #. Conflicting writes to memory locations in any given cache line will 1021 + result in expensive cache misses. Greater numbers of concurrent 1022 + writes and more-frequent concurrent writes will result in more 1023 + dramatic slowdowns. RCU is therefore obligated to use algorithms that 1024 + have sufficient locality to avoid significant performance and 1025 + scalability problems. 1026 + #. As a rough rule of thumb, only one CPU's worth of processing may be 1027 + carried out under the protection of any given exclusive lock. RCU 1028 + must therefore use scalable locking designs. 1029 + #. Counters are finite, especially on 32-bit systems. RCU's use of 1030 + counters must therefore tolerate counter wrap, or be designed such 1031 + that counter wrap would take way more time than a single system is 1032 + likely to run. An uptime of ten years is quite possible, a runtime of 1033 + a century much less so. As an example of the latter, RCU's 1034 + dyntick-idle nesting counter allows 54 bits for interrupt nesting 1035 + level (this counter is 64 bits even on a 32-bit system). Overflowing 1036 + this counter requires 2\ :sup:`54` half-interrupts on a given CPU 1037 + without that CPU ever going idle. If a half-interrupt happened every 1038 + microsecond, it would take 570 years of runtime to overflow this 1039 + counter, which is currently believed to be an acceptably long time. 1040 + #. Linux systems can have thousands of CPUs running a single Linux 1041 + kernel in a single shared-memory environment. RCU must therefore pay 1042 + close attention to high-end scalability. 1043 + 1044 + This last parallelism fact of life means that RCU must pay special 1045 + attention to the preceding facts of life. The idea that Linux might 1046 + scale to systems with thousands of CPUs would have been met with some 1047 + skepticism in the 1990s, but these requirements would have otherwise 1048 + have been unsurprising, even in the early 1990s. 1049 + 1050 + Quality-of-Implementation Requirements 1051 + -------------------------------------- 1052 + 1053 + These sections list quality-of-implementation requirements. Although an 1054 + RCU implementation that ignores these requirements could still be used, 1055 + it would likely be subject to limitations that would make it 1056 + inappropriate for industrial-strength production use. Classes of 1057 + quality-of-implementation requirements are as follows: 1058 + 1059 + #. `Specialization <#Specialization>`__ 1060 + #. `Performance and Scalability <#Performance%20and%20Scalability>`__ 1061 + #. `Forward Progress <#Forward%20Progress>`__ 1062 + #. `Composability <#Composability>`__ 1063 + #. `Corner Cases <#Corner%20Cases>`__ 1064 + 1065 + These classes is covered in the following sections. 1066 + 1067 + Specialization 1068 + ~~~~~~~~~~~~~~ 1069 + 1070 + RCU is and always has been intended primarily for read-mostly 1071 + situations, which means that RCU's read-side primitives are optimized, 1072 + often at the expense of its update-side primitives. Experience thus far 1073 + is captured by the following list of situations: 1074 + 1075 + #. Read-mostly data, where stale and inconsistent data is not a problem: 1076 + RCU works great! 1077 + #. Read-mostly data, where data must be consistent: RCU works well. 1078 + #. Read-write data, where data must be consistent: RCU *might* work OK. 1079 + Or not. 1080 + #. Write-mostly data, where data must be consistent: RCU is very 1081 + unlikely to be the right tool for the job, with the following 1082 + exceptions, where RCU can provide: 1083 + 1084 + a. Existence guarantees for update-friendly mechanisms. 1085 + b. Wait-free read-side primitives for real-time use. 1086 + 1087 + This focus on read-mostly situations means that RCU must interoperate 1088 + with other synchronization primitives. For example, the ``add_gp()`` and 1089 + ``remove_gp_synchronous()`` examples discussed earlier use RCU to 1090 + protect readers and locking to coordinate updaters. However, the need 1091 + extends much farther, requiring that a variety of synchronization 1092 + primitives be legal within RCU read-side critical sections, including 1093 + spinlocks, sequence locks, atomic operations, reference counters, and 1094 + memory barriers. 1095 + 1096 + +-----------------------------------------------------------------------+ 1097 + | **Quick Quiz**: | 1098 + +-----------------------------------------------------------------------+ 1099 + | What about sleeping locks? | 1100 + +-----------------------------------------------------------------------+ 1101 + | **Answer**: | 1102 + +-----------------------------------------------------------------------+ 1103 + | These are forbidden within Linux-kernel RCU read-side critical | 1104 + | sections because it is not legal to place a quiescent state (in this | 1105 + | case, voluntary context switch) within an RCU read-side critical | 1106 + | section. However, sleeping locks may be used within userspace RCU | 1107 + | read-side critical sections, and also within Linux-kernel sleepable | 1108 + | RCU `(SRCU) <#Sleepable%20RCU>`__ read-side critical sections. In | 1109 + | addition, the -rt patchset turns spinlocks into a sleeping locks so | 1110 + | that the corresponding critical sections can be preempted, which also | 1111 + | means that these sleeplockified spinlocks (but not other sleeping | 1112 + | locks!) may be acquire within -rt-Linux-kernel RCU read-side critical | 1113 + | sections. | 1114 + | Note that it *is* legal for a normal RCU read-side critical section | 1115 + | to conditionally acquire a sleeping locks (as in | 1116 + | ``mutex_trylock()``), but only as long as it does not loop | 1117 + | indefinitely attempting to conditionally acquire that sleeping locks. | 1118 + | The key point is that things like ``mutex_trylock()`` either return | 1119 + | with the mutex held, or return an error indication if the mutex was | 1120 + | not immediately available. Either way, ``mutex_trylock()`` returns | 1121 + | immediately without sleeping. | 1122 + +-----------------------------------------------------------------------+ 1123 + 1124 + It often comes as a surprise that many algorithms do not require a 1125 + consistent view of data, but many can function in that mode, with 1126 + network routing being the poster child. Internet routing algorithms take 1127 + significant time to propagate updates, so that by the time an update 1128 + arrives at a given system, that system has been sending network traffic 1129 + the wrong way for a considerable length of time. Having a few threads 1130 + continue to send traffic the wrong way for a few more milliseconds is 1131 + clearly not a problem: In the worst case, TCP retransmissions will 1132 + eventually get the data where it needs to go. In general, when tracking 1133 + the state of the universe outside of the computer, some level of 1134 + inconsistency must be tolerated due to speed-of-light delays if nothing 1135 + else. 1136 + 1137 + Furthermore, uncertainty about external state is inherent in many cases. 1138 + For example, a pair of veterinarians might use heartbeat to determine 1139 + whether or not a given cat was alive. But how long should they wait 1140 + after the last heartbeat to decide that the cat is in fact dead? Waiting 1141 + less than 400 milliseconds makes no sense because this would mean that a 1142 + relaxed cat would be considered to cycle between death and life more 1143 + than 100 times per minute. Moreover, just as with human beings, a cat's 1144 + heart might stop for some period of time, so the exact wait period is a 1145 + judgment call. One of our pair of veterinarians might wait 30 seconds 1146 + before pronouncing the cat dead, while the other might insist on waiting 1147 + a full minute. The two veterinarians would then disagree on the state of 1148 + the cat during the final 30 seconds of the minute following the last 1149 + heartbeat. 1150 + 1151 + Interestingly enough, this same situation applies to hardware. When push 1152 + comes to shove, how do we tell whether or not some external server has 1153 + failed? We send messages to it periodically, and declare it failed if we 1154 + don't receive a response within a given period of time. Policy decisions 1155 + can usually tolerate short periods of inconsistency. The policy was 1156 + decided some time ago, and is only now being put into effect, so a few 1157 + milliseconds of delay is normally inconsequential. 1158 + 1159 + However, there are algorithms that absolutely must see consistent data. 1160 + For example, the translation between a user-level SystemV semaphore ID 1161 + to the corresponding in-kernel data structure is protected by RCU, but 1162 + it is absolutely forbidden to update a semaphore that has just been 1163 + removed. In the Linux kernel, this need for consistency is accommodated 1164 + by acquiring spinlocks located in the in-kernel data structure from 1165 + within the RCU read-side critical section, and this is indicated by the 1166 + green box in the figure above. Many other techniques may be used, and 1167 + are in fact used within the Linux kernel. 1168 + 1169 + In short, RCU is not required to maintain consistency, and other 1170 + mechanisms may be used in concert with RCU when consistency is required. 1171 + RCU's specialization allows it to do its job extremely well, and its 1172 + ability to interoperate with other synchronization mechanisms allows the 1173 + right mix of synchronization tools to be used for a given job. 1174 + 1175 + Performance and Scalability 1176 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1177 + 1178 + Energy efficiency is a critical component of performance today, and 1179 + Linux-kernel RCU implementations must therefore avoid unnecessarily 1180 + awakening idle CPUs. I cannot claim that this requirement was 1181 + premeditated. In fact, I learned of it during a telephone conversation 1182 + in which I was given “frank and open” feedback on the importance of 1183 + energy efficiency in battery-powered systems and on specific 1184 + energy-efficiency shortcomings of the Linux-kernel RCU implementation. 1185 + In my experience, the battery-powered embedded community will consider 1186 + any unnecessary wakeups to be extremely unfriendly acts. So much so that 1187 + mere Linux-kernel-mailing-list posts are insufficient to vent their ire. 1188 + 1189 + Memory consumption is not particularly important for in most situations, 1190 + and has become decreasingly so as memory sizes have expanded and memory 1191 + costs have plummeted. However, as I learned from Matt Mackall's 1192 + `bloatwatch <http://elinux.org/Linux_Tiny-FAQ>`__ efforts, memory 1193 + footprint is critically important on single-CPU systems with 1194 + non-preemptible (``CONFIG_PREEMPT=n``) kernels, and thus `tiny 1195 + RCU <https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com>`__ 1196 + was born. Josh Triplett has since taken over the small-memory banner 1197 + with his `Linux kernel tinification <https://tiny.wiki.kernel.org/>`__ 1198 + project, which resulted in `SRCU <#Sleepable%20RCU>`__ becoming optional 1199 + for those kernels not needing it. 1200 + 1201 + The remaining performance requirements are, for the most part, 1202 + unsurprising. For example, in keeping with RCU's read-side 1203 + specialization, ``rcu_dereference()`` should have negligible overhead 1204 + (for example, suppression of a few minor compiler optimizations). 1205 + Similarly, in non-preemptible environments, ``rcu_read_lock()`` and 1206 + ``rcu_read_unlock()`` should have exactly zero overhead. 1207 + 1208 + In preemptible environments, in the case where the RCU read-side 1209 + critical section was not preempted (as will be the case for the 1210 + highest-priority real-time process), ``rcu_read_lock()`` and 1211 + ``rcu_read_unlock()`` should have minimal overhead. In particular, they 1212 + should not contain atomic read-modify-write operations, memory-barrier 1213 + instructions, preemption disabling, interrupt disabling, or backwards 1214 + branches. However, in the case where the RCU read-side critical section 1215 + was preempted, ``rcu_read_unlock()`` may acquire spinlocks and disable 1216 + interrupts. This is why it is better to nest an RCU read-side critical 1217 + section within a preempt-disable region than vice versa, at least in 1218 + cases where that critical section is short enough to avoid unduly 1219 + degrading real-time latencies. 1220 + 1221 + The ``synchronize_rcu()`` grace-period-wait primitive is optimized for 1222 + throughput. It may therefore incur several milliseconds of latency in 1223 + addition to the duration of the longest RCU read-side critical section. 1224 + On the other hand, multiple concurrent invocations of 1225 + ``synchronize_rcu()`` are required to use batching optimizations so that 1226 + they can be satisfied by a single underlying grace-period-wait 1227 + operation. For example, in the Linux kernel, it is not unusual for a 1228 + single grace-period-wait operation to serve more than `1,000 separate 1229 + invocations <https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response>`__ 1230 + of ``synchronize_rcu()``, thus amortizing the per-invocation overhead 1231 + down to nearly zero. However, the grace-period optimization is also 1232 + required to avoid measurable degradation of real-time scheduling and 1233 + interrupt latencies. 1234 + 1235 + In some cases, the multi-millisecond ``synchronize_rcu()`` latencies are 1236 + unacceptable. In these cases, ``synchronize_rcu_expedited()`` may be 1237 + used instead, reducing the grace-period latency down to a few tens of 1238 + microseconds on small systems, at least in cases where the RCU read-side 1239 + critical sections are short. There are currently no special latency 1240 + requirements for ``synchronize_rcu_expedited()`` on large systems, but, 1241 + consistent with the empirical nature of the RCU specification, that is 1242 + subject to change. However, there most definitely are scalability 1243 + requirements: A storm of ``synchronize_rcu_expedited()`` invocations on 1244 + 4096 CPUs should at least make reasonable forward progress. In return 1245 + for its shorter latencies, ``synchronize_rcu_expedited()`` is permitted 1246 + to impose modest degradation of real-time latency on non-idle online 1247 + CPUs. Here, “modest” means roughly the same latency degradation as a 1248 + scheduling-clock interrupt. 1249 + 1250 + There are a number of situations where even 1251 + ``synchronize_rcu_expedited()``'s reduced grace-period latency is 1252 + unacceptable. In these situations, the asynchronous ``call_rcu()`` can 1253 + be used in place of ``synchronize_rcu()`` as follows: 1254 + 1255 + :: 1256 + 1257 + 1 struct foo { 1258 + 2 int a; 1259 + 3 int b; 1260 + 4 struct rcu_head rh; 1261 + 5 }; 1262 + 6 1263 + 7 static void remove_gp_cb(struct rcu_head *rhp) 1264 + 8 { 1265 + 9 struct foo *p = container_of(rhp, struct foo, rh); 1266 + 10 1267 + 11 kfree(p); 1268 + 12 } 1269 + 13 1270 + 14 bool remove_gp_asynchronous(void) 1271 + 15 { 1272 + 16 struct foo *p; 1273 + 17 1274 + 18 spin_lock(&gp_lock); 1275 + 19 p = rcu_access_pointer(gp); 1276 + 20 if (!p) { 1277 + 21 spin_unlock(&gp_lock); 1278 + 22 return false; 1279 + 23 } 1280 + 24 rcu_assign_pointer(gp, NULL); 1281 + 25 call_rcu(&p->rh, remove_gp_cb); 1282 + 26 spin_unlock(&gp_lock); 1283 + 27 return true; 1284 + 28 } 1285 + 1286 + A definition of ``struct foo`` is finally needed, and appears on 1287 + lines 1-5. The function ``remove_gp_cb()`` is passed to ``call_rcu()`` 1288 + on line 25, and will be invoked after the end of a subsequent grace 1289 + period. This gets the same effect as ``remove_gp_synchronous()``, but 1290 + without forcing the updater to wait for a grace period to elapse. The 1291 + ``call_rcu()`` function may be used in a number of situations where 1292 + neither ``synchronize_rcu()`` nor ``synchronize_rcu_expedited()`` would 1293 + be legal, including within preempt-disable code, ``local_bh_disable()`` 1294 + code, interrupt-disable code, and interrupt handlers. However, even 1295 + ``call_rcu()`` is illegal within NMI handlers and from idle and offline 1296 + CPUs. The callback function (``remove_gp_cb()`` in this case) will be 1297 + executed within softirq (software interrupt) environment within the 1298 + Linux kernel, either within a real softirq handler or under the 1299 + protection of ``local_bh_disable()``. In both the Linux kernel and in 1300 + userspace, it is bad practice to write an RCU callback function that 1301 + takes too long. Long-running operations should be relegated to separate 1302 + threads or (in the Linux kernel) workqueues. 1303 + 1304 + +-----------------------------------------------------------------------+ 1305 + | **Quick Quiz**: | 1306 + +-----------------------------------------------------------------------+ 1307 + | Why does line 19 use ``rcu_access_pointer()``? After all, | 1308 + | ``call_rcu()`` on line 25 stores into the structure, which would | 1309 + | interact badly with concurrent insertions. Doesn't this mean that | 1310 + | ``rcu_dereference()`` is required? | 1311 + +-----------------------------------------------------------------------+ 1312 + | **Answer**: | 1313 + +-----------------------------------------------------------------------+ 1314 + | Presumably the ``->gp_lock`` acquired on line 18 excludes any | 1315 + | changes, including any insertions that ``rcu_dereference()`` would | 1316 + | protect against. Therefore, any insertions will be delayed until | 1317 + | after ``->gp_lock`` is released on line 25, which in turn means that | 1318 + | ``rcu_access_pointer()`` suffices. | 1319 + +-----------------------------------------------------------------------+ 1320 + 1321 + However, all that ``remove_gp_cb()`` is doing is invoking ``kfree()`` on 1322 + the data element. This is a common idiom, and is supported by 1323 + ``kfree_rcu()``, which allows “fire and forget” operation as shown 1324 + below: 1325 + 1326 + :: 1327 + 1328 + 1 struct foo { 1329 + 2 int a; 1330 + 3 int b; 1331 + 4 struct rcu_head rh; 1332 + 5 }; 1333 + 6 1334 + 7 bool remove_gp_faf(void) 1335 + 8 { 1336 + 9 struct foo *p; 1337 + 10 1338 + 11 spin_lock(&gp_lock); 1339 + 12 p = rcu_dereference(gp); 1340 + 13 if (!p) { 1341 + 14 spin_unlock(&gp_lock); 1342 + 15 return false; 1343 + 16 } 1344 + 17 rcu_assign_pointer(gp, NULL); 1345 + 18 kfree_rcu(p, rh); 1346 + 19 spin_unlock(&gp_lock); 1347 + 20 return true; 1348 + 21 } 1349 + 1350 + Note that ``remove_gp_faf()`` simply invokes ``kfree_rcu()`` and 1351 + proceeds, without any need to pay any further attention to the 1352 + subsequent grace period and ``kfree()``. It is permissible to invoke 1353 + ``kfree_rcu()`` from the same environments as for ``call_rcu()``. 1354 + Interestingly enough, DYNIX/ptx had the equivalents of ``call_rcu()`` 1355 + and ``kfree_rcu()``, but not ``synchronize_rcu()``. This was due to the 1356 + fact that RCU was not heavily used within DYNIX/ptx, so the very few 1357 + places that needed something like ``synchronize_rcu()`` simply 1358 + open-coded it. 1359 + 1360 + +-----------------------------------------------------------------------+ 1361 + | **Quick Quiz**: | 1362 + +-----------------------------------------------------------------------+ 1363 + | Earlier it was claimed that ``call_rcu()`` and ``kfree_rcu()`` | 1364 + | allowed updaters to avoid being blocked by readers. But how can that | 1365 + | be correct, given that the invocation of the callback and the freeing | 1366 + | of the memory (respectively) must still wait for a grace period to | 1367 + | elapse? | 1368 + +-----------------------------------------------------------------------+ 1369 + | **Answer**: | 1370 + +-----------------------------------------------------------------------+ 1371 + | We could define things this way, but keep in mind that this sort of | 1372 + | definition would say that updates in garbage-collected languages | 1373 + | cannot complete until the next time the garbage collector runs, which | 1374 + | does not seem at all reasonable. The key point is that in most cases, | 1375 + | an updater using either ``call_rcu()`` or ``kfree_rcu()`` can proceed | 1376 + | to the next update as soon as it has invoked ``call_rcu()`` or | 1377 + | ``kfree_rcu()``, without having to wait for a subsequent grace | 1378 + | period. | 1379 + +-----------------------------------------------------------------------+ 1380 + 1381 + But what if the updater must wait for the completion of code to be 1382 + executed after the end of the grace period, but has other tasks that can 1383 + be carried out in the meantime? The polling-style 1384 + ``get_state_synchronize_rcu()`` and ``cond_synchronize_rcu()`` functions 1385 + may be used for this purpose, as shown below: 1386 + 1387 + :: 1388 + 1389 + 1 bool remove_gp_poll(void) 1390 + 2 { 1391 + 3 struct foo *p; 1392 + 4 unsigned long s; 1393 + 5 1394 + 6 spin_lock(&gp_lock); 1395 + 7 p = rcu_access_pointer(gp); 1396 + 8 if (!p) { 1397 + 9 spin_unlock(&gp_lock); 1398 + 10 return false; 1399 + 11 } 1400 + 12 rcu_assign_pointer(gp, NULL); 1401 + 13 spin_unlock(&gp_lock); 1402 + 14 s = get_state_synchronize_rcu(); 1403 + 15 do_something_while_waiting(); 1404 + 16 cond_synchronize_rcu(s); 1405 + 17 kfree(p); 1406 + 18 return true; 1407 + 19 } 1408 + 1409 + On line 14, ``get_state_synchronize_rcu()`` obtains a “cookie” from RCU, 1410 + then line 15 carries out other tasks, and finally, line 16 returns 1411 + immediately if a grace period has elapsed in the meantime, but otherwise 1412 + waits as required. The need for ``get_state_synchronize_rcu`` and 1413 + ``cond_synchronize_rcu()`` has appeared quite recently, so it is too 1414 + early to tell whether they will stand the test of time. 1415 + 1416 + RCU thus provides a range of tools to allow updaters to strike the 1417 + required tradeoff between latency, flexibility and CPU overhead. 1418 + 1419 + Forward Progress 1420 + ~~~~~~~~~~~~~~~~ 1421 + 1422 + In theory, delaying grace-period completion and callback invocation is 1423 + harmless. In practice, not only are memory sizes finite but also 1424 + callbacks sometimes do wakeups, and sufficiently deferred wakeups can be 1425 + difficult to distinguish from system hangs. Therefore, RCU must provide 1426 + a number of mechanisms to promote forward progress. 1427 + 1428 + These mechanisms are not foolproof, nor can they be. For one simple 1429 + example, an infinite loop in an RCU read-side critical section must by 1430 + definition prevent later grace periods from ever completing. For a more 1431 + involved example, consider a 64-CPU system built with 1432 + ``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where 1433 + CPUs 1 through 63 spin in tight loops that invoke ``call_rcu()``. Even 1434 + if these tight loops also contain calls to ``cond_resched()`` (thus 1435 + allowing grace periods to complete), CPU 0 simply will not be able to 1436 + invoke callbacks as fast as the other 63 CPUs can register them, at 1437 + least not until the system runs out of memory. In both of these 1438 + examples, the Spiderman principle applies: With great power comes great 1439 + responsibility. However, short of this level of abuse, RCU is required 1440 + to ensure timely completion of grace periods and timely invocation of 1441 + callbacks. 1442 + 1443 + RCU takes the following steps to encourage timely completion of grace 1444 + periods: 1445 + 1446 + #. If a grace period fails to complete within 100 milliseconds, RCU 1447 + causes future invocations of ``cond_resched()`` on the holdout CPUs 1448 + to provide an RCU quiescent state. RCU also causes those CPUs' 1449 + ``need_resched()`` invocations to return ``true``, but only after the 1450 + corresponding CPU's next scheduling-clock. 1451 + #. CPUs mentioned in the ``nohz_full`` kernel boot parameter can run 1452 + indefinitely in the kernel without scheduling-clock interrupts, which 1453 + defeats the above ``need_resched()`` strategem. RCU will therefore 1454 + invoke ``resched_cpu()`` on any ``nohz_full`` CPUs still holding out 1455 + after 109 milliseconds. 1456 + #. In kernels built with ``CONFIG_RCU_BOOST=y``, if a given task that 1457 + has been preempted within an RCU read-side critical section is 1458 + holding out for more than 500 milliseconds, RCU will resort to 1459 + priority boosting. 1460 + #. If a CPU is still holding out 10 seconds into the grace period, RCU 1461 + will invoke ``resched_cpu()`` on it regardless of its ``nohz_full`` 1462 + state. 1463 + 1464 + The above values are defaults for systems running with ``HZ=1000``. They 1465 + will vary as the value of ``HZ`` varies, and can also be changed using 1466 + the relevant Kconfig options and kernel boot parameters. RCU currently 1467 + does not do much sanity checking of these parameters, so please use 1468 + caution when changing them. Note that these forward-progress measures 1469 + are provided only for RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks 1470 + RCU <#Tasks%20RCU>`__. 1471 + 1472 + RCU takes the following steps in ``call_rcu()`` to encourage timely 1473 + invocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has 1474 + 10,000 callbacks, or has 10,000 more callbacks than it had the last time 1475 + encouragement was provided: 1476 + 1477 + #. Starts a grace period, if one is not already in progress. 1478 + #. Forces immediate checking for quiescent states, rather than waiting 1479 + for three milliseconds to have elapsed since the beginning of the 1480 + grace period. 1481 + #. Immediately tags the CPU's callbacks with their grace period 1482 + completion numbers, rather than waiting for the ``RCU_SOFTIRQ`` 1483 + handler to get around to it. 1484 + #. Lifts callback-execution batch limits, which speeds up callback 1485 + invocation at the expense of degrading realtime response. 1486 + 1487 + Again, these are default values when running at ``HZ=1000``, and can be 1488 + overridden. Again, these forward-progress measures are provided only for 1489 + RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks 1490 + RCU <#Tasks%20RCU>`__. Even for RCU, callback-invocation forward 1491 + progress for ``rcu_nocbs`` CPUs is much less well-developed, in part 1492 + because workloads benefiting from ``rcu_nocbs`` CPUs tend to invoke 1493 + ``call_rcu()`` relatively infrequently. If workloads emerge that need 1494 + both ``rcu_nocbs`` CPUs and high ``call_rcu()`` invocation rates, then 1495 + additional forward-progress work will be required. 1496 + 1497 + Composability 1498 + ~~~~~~~~~~~~~ 1499 + 1500 + Composability has received much attention in recent years, perhaps in 1501 + part due to the collision of multicore hardware with object-oriented 1502 + techniques designed in single-threaded environments for single-threaded 1503 + use. And in theory, RCU read-side critical sections may be composed, and 1504 + in fact may be nested arbitrarily deeply. In practice, as with all 1505 + real-world implementations of composable constructs, there are 1506 + limitations. 1507 + 1508 + Implementations of RCU for which ``rcu_read_lock()`` and 1509 + ``rcu_read_unlock()`` generate no code, such as Linux-kernel RCU when 1510 + ``CONFIG_PREEMPT=n``, can be nested arbitrarily deeply. After all, there 1511 + is no overhead. Except that if all these instances of 1512 + ``rcu_read_lock()`` and ``rcu_read_unlock()`` are visible to the 1513 + compiler, compilation will eventually fail due to exhausting memory, 1514 + mass storage, or user patience, whichever comes first. If the nesting is 1515 + not visible to the compiler, as is the case with mutually recursive 1516 + functions each in its own translation unit, stack overflow will result. 1517 + If the nesting takes the form of loops, perhaps in the guise of tail 1518 + recursion, either the control variable will overflow or (in the Linux 1519 + kernel) you will get an RCU CPU stall warning. Nevertheless, this class 1520 + of RCU implementations is one of the most composable constructs in 1521 + existence. 1522 + 1523 + RCU implementations that explicitly track nesting depth are limited by 1524 + the nesting-depth counter. For example, the Linux kernel's preemptible 1525 + RCU limits nesting to ``INT_MAX``. This should suffice for almost all 1526 + practical purposes. That said, a consecutive pair of RCU read-side 1527 + critical sections between which there is an operation that waits for a 1528 + grace period cannot be enclosed in another RCU read-side critical 1529 + section. This is because it is not legal to wait for a grace period 1530 + within an RCU read-side critical section: To do so would result either 1531 + in deadlock or in RCU implicitly splitting the enclosing RCU read-side 1532 + critical section, neither of which is conducive to a long-lived and 1533 + prosperous kernel. 1534 + 1535 + It is worth noting that RCU is not alone in limiting composability. For 1536 + example, many transactional-memory implementations prohibit composing a 1537 + pair of transactions separated by an irrevocable operation (for example, 1538 + a network receive operation). For another example, lock-based critical 1539 + sections can be composed surprisingly freely, but only if deadlock is 1540 + avoided. 1541 + 1542 + In short, although RCU read-side critical sections are highly 1543 + composable, care is required in some situations, just as is the case for 1544 + any other composable synchronization mechanism. 1545 + 1546 + Corner Cases 1547 + ~~~~~~~~~~~~ 1548 + 1549 + A given RCU workload might have an endless and intense stream of RCU 1550 + read-side critical sections, perhaps even so intense that there was 1551 + never a point in time during which there was not at least one RCU 1552 + read-side critical section in flight. RCU cannot allow this situation to 1553 + block grace periods: As long as all the RCU read-side critical sections 1554 + are finite, grace periods must also be finite. 1555 + 1556 + That said, preemptible RCU implementations could potentially result in 1557 + RCU read-side critical sections being preempted for long durations, 1558 + which has the effect of creating a long-duration RCU read-side critical 1559 + section. This situation can arise only in heavily loaded systems, but 1560 + systems using real-time priorities are of course more vulnerable. 1561 + Therefore, RCU priority boosting is provided to help deal with this 1562 + case. That said, the exact requirements on RCU priority boosting will 1563 + likely evolve as more experience accumulates. 1564 + 1565 + Other workloads might have very high update rates. Although one can 1566 + argue that such workloads should instead use something other than RCU, 1567 + the fact remains that RCU must handle such workloads gracefully. This 1568 + requirement is another factor driving batching of grace periods, but it 1569 + is also the driving force behind the checks for large numbers of queued 1570 + RCU callbacks in the ``call_rcu()`` code path. Finally, high update 1571 + rates should not delay RCU read-side critical sections, although some 1572 + small read-side delays can occur when using 1573 + ``synchronize_rcu_expedited()``, courtesy of this function's use of 1574 + ``smp_call_function_single()``. 1575 + 1576 + Although all three of these corner cases were understood in the early 1577 + 1990s, a simple user-level test consisting of ``close(open(path))`` in a 1578 + tight loop in the early 2000s suddenly provided a much deeper 1579 + appreciation of the high-update-rate corner case. This test also 1580 + motivated addition of some RCU code to react to high update rates, for 1581 + example, if a given CPU finds itself with more than 10,000 RCU callbacks 1582 + queued, it will cause RCU to take evasive action by more aggressively 1583 + starting grace periods and more aggressively forcing completion of 1584 + grace-period processing. This evasive action causes the grace period to 1585 + complete more quickly, but at the cost of restricting RCU's batching 1586 + optimizations, thus increasing the CPU overhead incurred by that grace 1587 + period. 1588 + 1589 + Software-Engineering Requirements 1590 + --------------------------------- 1591 + 1592 + Between Murphy's Law and “To err is human”, it is necessary to guard 1593 + against mishaps and misuse: 1594 + 1595 + #. It is all too easy to forget to use ``rcu_read_lock()`` everywhere 1596 + that it is needed, so kernels built with ``CONFIG_PROVE_RCU=y`` will 1597 + splat if ``rcu_dereference()`` is used outside of an RCU read-side 1598 + critical section. Update-side code can use 1599 + ``rcu_dereference_protected()``, which takes a `lockdep 1600 + expression <https://lwn.net/Articles/371986/>`__ to indicate what is 1601 + providing the protection. If the indicated protection is not 1602 + provided, a lockdep splat is emitted. 1603 + Code shared between readers and updaters can use 1604 + ``rcu_dereference_check()``, which also takes a lockdep expression, 1605 + and emits a lockdep splat if neither ``rcu_read_lock()`` nor the 1606 + indicated protection is in place. In addition, 1607 + ``rcu_dereference_raw()`` is used in those (hopefully rare) cases 1608 + where the required protection cannot be easily described. Finally, 1609 + ``rcu_read_lock_held()`` is provided to allow a function to verify 1610 + that it has been invoked within an RCU read-side critical section. I 1611 + was made aware of this set of requirements shortly after Thomas 1612 + Gleixner audited a number of RCU uses. 1613 + #. A given function might wish to check for RCU-related preconditions 1614 + upon entry, before using any other RCU API. The 1615 + ``rcu_lockdep_assert()`` does this job, asserting the expression in 1616 + kernels having lockdep enabled and doing nothing otherwise. 1617 + #. It is also easy to forget to use ``rcu_assign_pointer()`` and 1618 + ``rcu_dereference()``, perhaps (incorrectly) substituting a simple 1619 + assignment. To catch this sort of error, a given RCU-protected 1620 + pointer may be tagged with ``__rcu``, after which sparse will 1621 + complain about simple-assignment accesses to that pointer. Arnd 1622 + Bergmann made me aware of this requirement, and also supplied the 1623 + needed `patch series <https://lwn.net/Articles/376011/>`__. 1624 + #. Kernels built with ``CONFIG_DEBUG_OBJECTS_RCU_HEAD=y`` will splat if 1625 + a data element is passed to ``call_rcu()`` twice in a row, without a 1626 + grace period in between. (This error is similar to a double free.) 1627 + The corresponding ``rcu_head`` structures that are dynamically 1628 + allocated are automatically tracked, but ``rcu_head`` structures 1629 + allocated on the stack must be initialized with 1630 + ``init_rcu_head_on_stack()`` and cleaned up with 1631 + ``destroy_rcu_head_on_stack()``. Similarly, statically allocated 1632 + non-stack ``rcu_head`` structures must be initialized with 1633 + ``init_rcu_head()`` and cleaned up with ``destroy_rcu_head()``. 1634 + Mathieu Desnoyers made me aware of this requirement, and also 1635 + supplied the needed 1636 + `patch <https://lkml.kernel.org/g/20100319013024.GA28456@Krystal>`__. 1637 + #. An infinite loop in an RCU read-side critical section will eventually 1638 + trigger an RCU CPU stall warning splat, with the duration of 1639 + “eventually” being controlled by the ``RCU_CPU_STALL_TIMEOUT`` 1640 + ``Kconfig`` option, or, alternatively, by the 1641 + ``rcupdate.rcu_cpu_stall_timeout`` boot/sysfs parameter. However, RCU 1642 + is not obligated to produce this splat unless there is a grace period 1643 + waiting on that particular RCU read-side critical section. 1644 + 1645 + Some extreme workloads might intentionally delay RCU grace periods, 1646 + and systems running those workloads can be booted with 1647 + ``rcupdate.rcu_cpu_stall_suppress`` to suppress the splats. This 1648 + kernel parameter may also be set via ``sysfs``. Furthermore, RCU CPU 1649 + stall warnings are counter-productive during sysrq dumps and during 1650 + panics. RCU therefore supplies the ``rcu_sysrq_start()`` and 1651 + ``rcu_sysrq_end()`` API members to be called before and after long 1652 + sysrq dumps. RCU also supplies the ``rcu_panic()`` notifier that is 1653 + automatically invoked at the beginning of a panic to suppress further 1654 + RCU CPU stall warnings. 1655 + 1656 + This requirement made itself known in the early 1990s, pretty much 1657 + the first time that it was necessary to debug a CPU stall. That said, 1658 + the initial implementation in DYNIX/ptx was quite generic in 1659 + comparison with that of Linux. 1660 + 1661 + #. Although it would be very good to detect pointers leaking out of RCU 1662 + read-side critical sections, there is currently no good way of doing 1663 + this. One complication is the need to distinguish between pointers 1664 + leaking and pointers that have been handed off from RCU to some other 1665 + synchronization mechanism, for example, reference counting. 1666 + #. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information 1667 + is provided via event tracing. 1668 + #. Open-coded use of ``rcu_assign_pointer()`` and ``rcu_dereference()`` 1669 + to create typical linked data structures can be surprisingly 1670 + error-prone. Therefore, RCU-protected `linked 1671 + lists <https://lwn.net/Articles/609973/#RCU%20List%20APIs>`__ and, 1672 + more recently, RCU-protected `hash 1673 + tables <https://lwn.net/Articles/612100/>`__ are available. Many 1674 + other special-purpose RCU-protected data structures are available in 1675 + the Linux kernel and the userspace RCU library. 1676 + #. Some linked structures are created at compile time, but still require 1677 + ``__rcu`` checking. The ``RCU_POINTER_INITIALIZER()`` macro serves 1678 + this purpose. 1679 + #. It is not necessary to use ``rcu_assign_pointer()`` when creating 1680 + linked structures that are to be published via a single external 1681 + pointer. The ``RCU_INIT_POINTER()`` macro is provided for this task 1682 + and also for assigning ``NULL`` pointers at runtime. 1683 + 1684 + This not a hard-and-fast list: RCU's diagnostic capabilities will 1685 + continue to be guided by the number and type of usage bugs found in 1686 + real-world RCU usage. 1687 + 1688 + Linux Kernel Complications 1689 + -------------------------- 1690 + 1691 + The Linux kernel provides an interesting environment for all kinds of 1692 + software, including RCU. Some of the relevant points of interest are as 1693 + follows: 1694 + 1695 + #. `Configuration <#Configuration>`__. 1696 + #. `Firmware Interface <#Firmware%20Interface>`__. 1697 + #. `Early Boot <#Early%20Boot>`__. 1698 + #. `Interrupts and non-maskable interrupts 1699 + (NMIs) <#Interrupts%20and%20NMIs>`__. 1700 + #. `Loadable Modules <#Loadable%20Modules>`__. 1701 + #. `Hotplug CPU <#Hotplug%20CPU>`__. 1702 + #. `Scheduler and RCU <#Scheduler%20and%20RCU>`__. 1703 + #. `Tracing and RCU <#Tracing%20and%20RCU>`__. 1704 + #. `Energy Efficiency <#Energy%20Efficiency>`__. 1705 + #. `Scheduling-Clock Interrupts and 1706 + RCU <#Scheduling-Clock%20Interrupts%20and%20RCU>`__. 1707 + #. `Memory Efficiency <#Memory%20Efficiency>`__. 1708 + #. `Performance, Scalability, Response Time, and 1709 + Reliability <#Performance,%20Scalability,%20Response%20Time,%20and%20Reliability>`__. 1710 + 1711 + This list is probably incomplete, but it does give a feel for the most 1712 + notable Linux-kernel complications. Each of the following sections 1713 + covers one of the above topics. 1714 + 1715 + Configuration 1716 + ~~~~~~~~~~~~~ 1717 + 1718 + RCU's goal is automatic configuration, so that almost nobody needs to 1719 + worry about RCU's ``Kconfig`` options. And for almost all users, RCU 1720 + does in fact work well “out of the box.” 1721 + 1722 + However, there are specialized use cases that are handled by kernel boot 1723 + parameters and ``Kconfig`` options. Unfortunately, the ``Kconfig`` 1724 + system will explicitly ask users about new ``Kconfig`` options, which 1725 + requires almost all of them be hidden behind a ``CONFIG_RCU_EXPERT`` 1726 + ``Kconfig`` option. 1727 + 1728 + This all should be quite obvious, but the fact remains that Linus 1729 + Torvalds recently had to 1730 + `remind <https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com>`__ 1731 + me of this requirement. 1732 + 1733 + Firmware Interface 1734 + ~~~~~~~~~~~~~~~~~~ 1735 + 1736 + In many cases, kernel obtains information about the system from the 1737 + firmware, and sometimes things are lost in translation. Or the 1738 + translation is accurate, but the original message is bogus. 1739 + 1740 + For example, some systems' firmware overreports the number of CPUs, 1741 + sometimes by a large factor. If RCU naively believed the firmware, as it 1742 + used to do, it would create too many per-CPU kthreads. Although the 1743 + resulting system will still run correctly, the extra kthreads needlessly 1744 + consume memory and can cause confusion when they show up in ``ps`` 1745 + listings. 1746 + 1747 + RCU must therefore wait for a given CPU to actually come online before 1748 + it can allow itself to believe that the CPU actually exists. The 1749 + resulting “ghost CPUs” (which are never going to come online) cause a 1750 + number of `interesting 1751 + complications <https://paulmck.livejournal.com/37494.html>`__. 1752 + 1753 + Early Boot 1754 + ~~~~~~~~~~ 1755 + 1756 + The Linux kernel's boot sequence is an interesting process, and RCU is 1757 + used early, even before ``rcu_init()`` is invoked. In fact, a number of 1758 + RCU's primitives can be used as soon as the initial task's 1759 + ``task_struct`` is available and the boot CPU's per-CPU variables are 1760 + set up. The read-side primitives (``rcu_read_lock()``, 1761 + ``rcu_read_unlock()``, ``rcu_dereference()``, and 1762 + ``rcu_access_pointer()``) will operate normally very early on, as will 1763 + ``rcu_assign_pointer()``. 1764 + 1765 + Although ``call_rcu()`` may be invoked at any time during boot, 1766 + callbacks are not guaranteed to be invoked until after all of RCU's 1767 + kthreads have been spawned, which occurs at ``early_initcall()`` time. 1768 + This delay in callback invocation is due to the fact that RCU does not 1769 + invoke callbacks until it is fully initialized, and this full 1770 + initialization cannot occur until after the scheduler has initialized 1771 + itself to the point where RCU can spawn and run its kthreads. In theory, 1772 + it would be possible to invoke callbacks earlier, however, this is not a 1773 + panacea because there would be severe restrictions on what operations 1774 + those callbacks could invoke. 1775 + 1776 + Perhaps surprisingly, ``synchronize_rcu()`` and 1777 + ``synchronize_rcu_expedited()``, will operate normally during very early 1778 + boot, the reason being that there is only one CPU and preemption is 1779 + disabled. This means that the call ``synchronize_rcu()`` (or friends) 1780 + itself is a quiescent state and thus a grace period, so the early-boot 1781 + implementation can be a no-op. 1782 + 1783 + However, once the scheduler has spawned its first kthread, this early 1784 + boot trick fails for ``synchronize_rcu()`` (as well as for 1785 + ``synchronize_rcu_expedited()``) in ``CONFIG_PREEMPT=y`` kernels. The 1786 + reason is that an RCU read-side critical section might be preempted, 1787 + which means that a subsequent ``synchronize_rcu()`` really does have to 1788 + wait for something, as opposed to simply returning immediately. 1789 + Unfortunately, ``synchronize_rcu()`` can't do this until all of its 1790 + kthreads are spawned, which doesn't happen until some time during 1791 + ``early_initcalls()`` time. But this is no excuse: RCU is nevertheless 1792 + required to correctly handle synchronous grace periods during this time 1793 + period. Once all of its kthreads are up and running, RCU starts running 1794 + normally. 1795 + 1796 + +-----------------------------------------------------------------------+ 1797 + | **Quick Quiz**: | 1798 + +-----------------------------------------------------------------------+ 1799 + | How can RCU possibly handle grace periods before all of its kthreads | 1800 + | have been spawned??? | 1801 + +-----------------------------------------------------------------------+ 1802 + | **Answer**: | 1803 + +-----------------------------------------------------------------------+ 1804 + | Very carefully! | 1805 + | During the “dead zone” between the time that the scheduler spawns the | 1806 + | first task and the time that all of RCU's kthreads have been spawned, | 1807 + | all synchronous grace periods are handled by the expedited | 1808 + | grace-period mechanism. At runtime, this expedited mechanism relies | 1809 + | on workqueues, but during the dead zone the requesting task itself | 1810 + | drives the desired expedited grace period. Because dead-zone | 1811 + | execution takes place within task context, everything works. Once the | 1812 + | dead zone ends, expedited grace periods go back to using workqueues, | 1813 + | as is required to avoid problems that would otherwise occur when a | 1814 + | user task received a POSIX signal while driving an expedited grace | 1815 + | period. | 1816 + | | 1817 + | And yes, this does mean that it is unhelpful to send POSIX signals to | 1818 + | random tasks between the time that the scheduler spawns its first | 1819 + | kthread and the time that RCU's kthreads have all been spawned. If | 1820 + | there ever turns out to be a good reason for sending POSIX signals | 1821 + | during that time, appropriate adjustments will be made. (If it turns | 1822 + | out that POSIX signals are sent during this time for no good reason, | 1823 + | other adjustments will be made, appropriate or otherwise.) | 1824 + +-----------------------------------------------------------------------+ 1825 + 1826 + I learned of these boot-time requirements as a result of a series of 1827 + system hangs. 1828 + 1829 + Interrupts and NMIs 1830 + ~~~~~~~~~~~~~~~~~~~ 1831 + 1832 + The Linux kernel has interrupts, and RCU read-side critical sections are 1833 + legal within interrupt handlers and within interrupt-disabled regions of 1834 + code, as are invocations of ``call_rcu()``. 1835 + 1836 + Some Linux-kernel architectures can enter an interrupt handler from 1837 + non-idle process context, and then just never leave it, instead 1838 + stealthily transitioning back to process context. This trick is 1839 + sometimes used to invoke system calls from inside the kernel. These 1840 + “half-interrupts” mean that RCU has to be very careful about how it 1841 + counts interrupt nesting levels. I learned of this requirement the hard 1842 + way during a rewrite of RCU's dyntick-idle code. 1843 + 1844 + The Linux kernel has non-maskable interrupts (NMIs), and RCU read-side 1845 + critical sections are legal within NMI handlers. Thankfully, RCU 1846 + update-side primitives, including ``call_rcu()``, are prohibited within 1847 + NMI handlers. 1848 + 1849 + The name notwithstanding, some Linux-kernel architectures can have 1850 + nested NMIs, which RCU must handle correctly. Andy Lutomirski `surprised 1851 + me <https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com>`__ 1852 + with this requirement; he also kindly surprised me with `an 1853 + algorithm <https://lkml.kernel.org/r/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com>`__ 1854 + that meets this requirement. 1855 + 1856 + Furthermore, NMI handlers can be interrupted by what appear to RCU to be 1857 + normal interrupts. One way that this can happen is for code that 1858 + directly invokes ``rcu_irq_enter()`` and ``rcu_irq_exit()`` to be called 1859 + from an NMI handler. This astonishing fact of life prompted the current 1860 + code structure, which has ``rcu_irq_enter()`` invoking 1861 + ``rcu_nmi_enter()`` and ``rcu_irq_exit()`` invoking ``rcu_nmi_exit()``. 1862 + And yes, I also learned of this requirement the hard way. 1863 + 1864 + Loadable Modules 1865 + ~~~~~~~~~~~~~~~~ 1866 + 1867 + The Linux kernel has loadable modules, and these modules can also be 1868 + unloaded. After a given module has been unloaded, any attempt to call 1869 + one of its functions results in a segmentation fault. The module-unload 1870 + functions must therefore cancel any delayed calls to loadable-module 1871 + functions, for example, any outstanding ``mod_timer()`` must be dealt 1872 + with via ``del_timer_sync()`` or similar. 1873 + 1874 + Unfortunately, there is no way to cancel an RCU callback; once you 1875 + invoke ``call_rcu()``, the callback function is eventually going to be 1876 + invoked, unless the system goes down first. Because it is normally 1877 + considered socially irresponsible to crash the system in response to a 1878 + module unload request, we need some other way to deal with in-flight RCU 1879 + callbacks. 1880 + 1881 + RCU therefore provides ``rcu_barrier()``, which waits until all 1882 + in-flight RCU callbacks have been invoked. If a module uses 1883 + ``call_rcu()``, its exit function should therefore prevent any future 1884 + invocation of ``call_rcu()``, then invoke ``rcu_barrier()``. In theory, 1885 + the underlying module-unload code could invoke ``rcu_barrier()`` 1886 + unconditionally, but in practice this would incur unacceptable 1887 + latencies. 1888 + 1889 + Nikita Danilov noted this requirement for an analogous 1890 + filesystem-unmount situation, and Dipankar Sarma incorporated 1891 + ``rcu_barrier()`` into RCU. The need for ``rcu_barrier()`` for module 1892 + unloading became apparent later. 1893 + 1894 + .. important:: 1895 + 1896 + The ``rcu_barrier()`` function is not, repeat, 1897 + *not*, obligated to wait for a grace period. It is instead only required 1898 + to wait for RCU callbacks that have already been posted. Therefore, if 1899 + there are no RCU callbacks posted anywhere in the system, 1900 + ``rcu_barrier()`` is within its rights to return immediately. Even if 1901 + there are callbacks posted, ``rcu_barrier()`` does not necessarily need 1902 + to wait for a grace period. 1903 + 1904 + +-----------------------------------------------------------------------+ 1905 + | **Quick Quiz**: | 1906 + +-----------------------------------------------------------------------+ 1907 + | Wait a minute! Each RCU callbacks must wait for a grace period to | 1908 + | complete, and ``rcu_barrier()`` must wait for each pre-existing | 1909 + | callback to be invoked. Doesn't ``rcu_barrier()`` therefore need to | 1910 + | wait for a full grace period if there is even one callback posted | 1911 + | anywhere in the system? | 1912 + +-----------------------------------------------------------------------+ 1913 + | **Answer**: | 1914 + +-----------------------------------------------------------------------+ 1915 + | Absolutely not!!! | 1916 + | Yes, each RCU callbacks must wait for a grace period to complete, but | 1917 + | it might well be partly (or even completely) finished waiting by the | 1918 + | time ``rcu_barrier()`` is invoked. In that case, ``rcu_barrier()`` | 1919 + | need only wait for the remaining portion of the grace period to | 1920 + | elapse. So even if there are quite a few callbacks posted, | 1921 + | ``rcu_barrier()`` might well return quite quickly. | 1922 + | | 1923 + | So if you need to wait for a grace period as well as for all | 1924 + | pre-existing callbacks, you will need to invoke both | 1925 + | ``synchronize_rcu()`` and ``rcu_barrier()``. If latency is a concern, | 1926 + | you can always use workqueues to invoke them concurrently. | 1927 + +-----------------------------------------------------------------------+ 1928 + 1929 + Hotplug CPU 1930 + ~~~~~~~~~~~ 1931 + 1932 + The Linux kernel supports CPU hotplug, which means that CPUs can come 1933 + and go. It is of course illegal to use any RCU API member from an 1934 + offline CPU, with the exception of `SRCU <#Sleepable%20RCU>`__ read-side 1935 + critical sections. This requirement was present from day one in 1936 + DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug 1937 + implementation is “interesting.” 1938 + 1939 + The Linux-kernel CPU-hotplug implementation has notifiers that are used 1940 + to allow the various kernel subsystems (including RCU) to respond 1941 + appropriately to a given CPU-hotplug operation. Most RCU operations may 1942 + be invoked from CPU-hotplug notifiers, including even synchronous 1943 + grace-period operations such as ``synchronize_rcu()`` and 1944 + ``synchronize_rcu_expedited()``. 1945 + 1946 + However, all-callback-wait operations such as ``rcu_barrier()`` are also 1947 + not supported, due to the fact that there are phases of CPU-hotplug 1948 + operations where the outgoing CPU's callbacks will not be invoked until 1949 + after the CPU-hotplug operation ends, which could also result in 1950 + deadlock. Furthermore, ``rcu_barrier()`` blocks CPU-hotplug operations 1951 + during its execution, which results in another type of deadlock when 1952 + invoked from a CPU-hotplug notifier. 1953 + 1954 + Scheduler and RCU 1955 + ~~~~~~~~~~~~~~~~~ 1956 + 1957 + RCU depends on the scheduler, and the scheduler uses RCU to protect some 1958 + of its data structures. The preemptible-RCU ``rcu_read_unlock()`` 1959 + implementation must therefore be written carefully to avoid deadlocks 1960 + involving the scheduler's runqueue and priority-inheritance locks. In 1961 + particular, ``rcu_read_unlock()`` must tolerate an interrupt where the 1962 + interrupt handler invokes both ``rcu_read_lock()`` and 1963 + ``rcu_read_unlock()``. This possibility requires ``rcu_read_unlock()`` 1964 + to use negative nesting levels to avoid destructive recursion via 1965 + interrupt handler's use of RCU. 1966 + 1967 + This scheduler-RCU requirement came as a `complete 1968 + surprise <https://lwn.net/Articles/453002/>`__. 1969 + 1970 + As noted above, RCU makes use of kthreads, and it is necessary to avoid 1971 + excessive CPU-time accumulation by these kthreads. This requirement was 1972 + no surprise, but RCU's violation of it when running context-switch-heavy 1973 + workloads when built with ``CONFIG_NO_HZ_FULL=y`` `did come as a 1974 + surprise 1975 + [PDF] <http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf>`__. 1976 + RCU has made good progress towards meeting this requirement, even for 1977 + context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is 1978 + room for further improvement. 1979 + 1980 + It is forbidden to hold any of scheduler's runqueue or 1981 + priority-inheritance spinlocks across an ``rcu_read_unlock()`` unless 1982 + interrupts have been disabled across the entire RCU read-side critical 1983 + section, that is, up to and including the matching ``rcu_read_lock()``. 1984 + Violating this restriction can result in deadlocks involving these 1985 + scheduler spinlocks. There was hope that this restriction might be 1986 + lifted when interrupt-disabled calls to ``rcu_read_unlock()`` started 1987 + deferring the reporting of the resulting RCU-preempt quiescent state 1988 + until the end of the corresponding interrupts-disabled region. 1989 + Unfortunately, timely reporting of the corresponding quiescent state to 1990 + expedited grace periods requires a call to ``raise_softirq()``, which 1991 + can acquire these scheduler spinlocks. In addition, real-time systems 1992 + using RCU priority boosting need this restriction to remain in effect 1993 + because deferred quiescent-state reporting would also defer deboosting, 1994 + which in turn would degrade real-time latencies. 1995 + 1996 + In theory, if a given RCU read-side critical section could be guaranteed 1997 + to be less than one second in duration, holding a scheduler spinlock 1998 + across that critical section's ``rcu_read_unlock()`` would require only 1999 + that preemption be disabled across the entire RCU read-side critical 2000 + section, not interrupts. Unfortunately, given the possibility of vCPU 2001 + preemption, long-running interrupts, and so on, it is not possible in 2002 + practice to guarantee that a given RCU read-side critical section will 2003 + complete in less than one second. Therefore, as noted above, if 2004 + scheduler spinlocks are held across a given call to 2005 + ``rcu_read_unlock()``, interrupts must be disabled across the entire RCU 2006 + read-side critical section. 2007 + 2008 + Tracing and RCU 2009 + ~~~~~~~~~~~~~~~ 2010 + 2011 + It is possible to use tracing on RCU code, but tracing itself uses RCU. 2012 + For this reason, ``rcu_dereference_raw_notrace()`` is provided for use 2013 + by tracing, which avoids the destructive recursion that could otherwise 2014 + ensue. This API is also used by virtualization in some architectures, 2015 + where RCU readers execute in environments in which tracing cannot be 2016 + used. The tracing folks both located the requirement and provided the 2017 + needed fix, so this surprise requirement was relatively painless. 2018 + 2019 + Energy Efficiency 2020 + ~~~~~~~~~~~~~~~~~ 2021 + 2022 + Interrupting idle CPUs is considered socially unacceptable, especially 2023 + by people with battery-powered embedded systems. RCU therefore conserves 2024 + energy by detecting which CPUs are idle, including tracking CPUs that 2025 + have been interrupted from idle. This is a large part of the 2026 + energy-efficiency requirement, so I learned of this via an irate phone 2027 + call. 2028 + 2029 + Because RCU avoids interrupting idle CPUs, it is illegal to execute an 2030 + RCU read-side critical section on an idle CPU. (Kernels built with 2031 + ``CONFIG_PROVE_RCU=y`` will splat if you try it.) The ``RCU_NONIDLE()`` 2032 + macro and ``_rcuidle`` event tracing is provided to work around this 2033 + restriction. In addition, ``rcu_is_watching()`` may be used to test 2034 + whether or not it is currently legal to run RCU read-side critical 2035 + sections on this CPU. I learned of the need for diagnostics on the one 2036 + hand and ``RCU_NONIDLE()`` on the other while inspecting idle-loop code. 2037 + Steven Rostedt supplied ``_rcuidle`` event tracing, which is used quite 2038 + heavily in the idle loop. However, there are some restrictions on the 2039 + code placed within ``RCU_NONIDLE()``: 2040 + 2041 + #. Blocking is prohibited. In practice, this is not a serious 2042 + restriction given that idle tasks are prohibited from blocking to 2043 + begin with. 2044 + #. Although nesting ``RCU_NONIDLE()`` is permitted, they cannot nest 2045 + indefinitely deeply. However, given that they can be nested on the 2046 + order of a million deep, even on 32-bit systems, this should not be a 2047 + serious restriction. This nesting limit would probably be reached 2048 + long after the compiler OOMed or the stack overflowed. 2049 + #. Any code path that enters ``RCU_NONIDLE()`` must sequence out of that 2050 + same ``RCU_NONIDLE()``. For example, the following is grossly 2051 + illegal: 2052 + 2053 + :: 2054 + 2055 + 1 RCU_NONIDLE({ 2056 + 2 do_something(); 2057 + 3 goto bad_idea; /* BUG!!! */ 2058 + 4 do_something_else();}); 2059 + 5 bad_idea: 2060 + 2061 + 2062 + It is just as illegal to transfer control into the middle of 2063 + ``RCU_NONIDLE()``'s argument. Yes, in theory, you could transfer in 2064 + as long as you also transferred out, but in practice you could also 2065 + expect to get sharply worded review comments. 2066 + 2067 + It is similarly socially unacceptable to interrupt an ``nohz_full`` CPU 2068 + running in userspace. RCU must therefore track ``nohz_full`` userspace 2069 + execution. RCU must therefore be able to sample state at two points in 2070 + time, and be able to determine whether or not some other CPU spent any 2071 + time idle and/or executing in userspace. 2072 + 2073 + These energy-efficiency requirements have proven quite difficult to 2074 + understand and to meet, for example, there have been more than five 2075 + clean-sheet rewrites of RCU's energy-efficiency code, the last of which 2076 + was finally able to demonstrate `real energy savings running on real 2077 + hardware 2078 + [PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf>`__. 2079 + As noted earlier, I learned of many of these requirements via angry 2080 + phone calls: Flaming me on the Linux-kernel mailing list was apparently 2081 + not sufficient to fully vent their ire at RCU's energy-efficiency bugs! 2082 + 2083 + Scheduling-Clock Interrupts and RCU 2084 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2085 + 2086 + The kernel transitions between in-kernel non-idle execution, userspace 2087 + execution, and the idle loop. Depending on kernel configuration, RCU 2088 + handles these states differently: 2089 + 2090 + +-----------------+------------------+------------------+-----------------+ 2091 + | ``HZ`` Kconfig | In-Kernel | Usermode | Idle | 2092 + +=================+==================+==================+=================+ 2093 + | ``HZ_PERIODIC`` | Can rely on | Can rely on | Can rely on | 2094 + | | scheduling-clock | scheduling-clock | RCU's | 2095 + | | interrupt. | interrupt and | dyntick-idle | 2096 + | | | its detection | detection. | 2097 + | | | of interrupt | | 2098 + | | | from usermode. | | 2099 + +-----------------+------------------+------------------+-----------------+ 2100 + | ``NO_HZ_IDLE`` | Can rely on | Can rely on | Can rely on | 2101 + | | scheduling-clock | scheduling-clock | RCU's | 2102 + | | interrupt. | interrupt and | dyntick-idle | 2103 + | | | its detection | detection. | 2104 + | | | of interrupt | | 2105 + | | | from usermode. | | 2106 + +-----------------+------------------+------------------+-----------------+ 2107 + | ``NO_HZ_FULL`` | Can only | Can rely on | Can rely on | 2108 + | | sometimes rely | RCU's | RCU's | 2109 + | | on | dyntick-idle | dyntick-idle | 2110 + | | scheduling-clock | detection. | detection. | 2111 + | | interrupt. In | | | 2112 + | | other cases, it | | | 2113 + | | is necessary to | | | 2114 + | | bound kernel | | | 2115 + | | execution times | | | 2116 + | | and/or use | | | 2117 + | | IPIs. | | | 2118 + +-----------------+------------------+------------------+-----------------+ 2119 + 2120 + +-----------------------------------------------------------------------+ 2121 + | **Quick Quiz**: | 2122 + +-----------------------------------------------------------------------+ 2123 + | Why can't ``NO_HZ_FULL`` in-kernel execution rely on the | 2124 + | scheduling-clock interrupt, just like ``HZ_PERIODIC`` and | 2125 + | ``NO_HZ_IDLE`` do? | 2126 + +-----------------------------------------------------------------------+ 2127 + | **Answer**: | 2128 + +-----------------------------------------------------------------------+ 2129 + | Because, as a performance optimization, ``NO_HZ_FULL`` does not | 2130 + | necessarily re-enable the scheduling-clock interrupt on entry to each | 2131 + | and every system call. | 2132 + +-----------------------------------------------------------------------+ 2133 + 2134 + However, RCU must be reliably informed as to whether any given CPU is 2135 + currently in the idle loop, and, for ``NO_HZ_FULL``, also whether that 2136 + CPU is executing in usermode, as discussed 2137 + `earlier <#Energy%20Efficiency>`__. It also requires that the 2138 + scheduling-clock interrupt be enabled when RCU needs it to be: 2139 + 2140 + #. If a CPU is either idle or executing in usermode, and RCU believes it 2141 + is non-idle, the scheduling-clock tick had better be running. 2142 + Otherwise, you will get RCU CPU stall warnings. Or at best, very long 2143 + (11-second) grace periods, with a pointless IPI waking the CPU from 2144 + time to time. 2145 + #. If a CPU is in a portion of the kernel that executes RCU read-side 2146 + critical sections, and RCU believes this CPU to be idle, you will get 2147 + random memory corruption. **DON'T DO THIS!!!** 2148 + This is one reason to test with lockdep, which will complain about 2149 + this sort of thing. 2150 + #. If a CPU is in a portion of the kernel that is absolutely positively 2151 + no-joking guaranteed to never execute any RCU read-side critical 2152 + sections, and RCU believes this CPU to to be idle, no problem. This 2153 + sort of thing is used by some architectures for light-weight 2154 + exception handlers, which can then avoid the overhead of 2155 + ``rcu_irq_enter()`` and ``rcu_irq_exit()`` at exception entry and 2156 + exit, respectively. Some go further and avoid the entireties of 2157 + ``irq_enter()`` and ``irq_exit()``. 2158 + Just make very sure you are running some of your tests with 2159 + ``CONFIG_PROVE_RCU=y``, just in case one of your code paths was in 2160 + fact joking about not doing RCU read-side critical sections. 2161 + #. If a CPU is executing in the kernel with the scheduling-clock 2162 + interrupt disabled and RCU believes this CPU to be non-idle, and if 2163 + the CPU goes idle (from an RCU perspective) every few jiffies, no 2164 + problem. It is usually OK for there to be the occasional gap between 2165 + idle periods of up to a second or so. 2166 + If the gap grows too long, you get RCU CPU stall warnings. 2167 + #. If a CPU is either idle or executing in usermode, and RCU believes it 2168 + to be idle, of course no problem. 2169 + #. If a CPU is executing in the kernel, the kernel code path is passing 2170 + through quiescent states at a reasonable frequency (preferably about 2171 + once per few jiffies, but the occasional excursion to a second or so 2172 + is usually OK) and the scheduling-clock interrupt is enabled, of 2173 + course no problem. 2174 + If the gap between a successive pair of quiescent states grows too 2175 + long, you get RCU CPU stall warnings. 2176 + 2177 + +-----------------------------------------------------------------------+ 2178 + | **Quick Quiz**: | 2179 + +-----------------------------------------------------------------------+ 2180 + | But what if my driver has a hardware interrupt handler that can run | 2181 + | for many seconds? I cannot invoke ``schedule()`` from an hardware | 2182 + | interrupt handler, after all! | 2183 + +-----------------------------------------------------------------------+ 2184 + | **Answer**: | 2185 + +-----------------------------------------------------------------------+ 2186 + | One approach is to do ``rcu_irq_exit();rcu_irq_enter();`` every so | 2187 + | often. But given that long-running interrupt handlers can cause other | 2188 + | problems, not least for response time, shouldn't you work to keep | 2189 + | your interrupt handler's runtime within reasonable bounds? | 2190 + +-----------------------------------------------------------------------+ 2191 + 2192 + But as long as RCU is properly informed of kernel state transitions 2193 + between in-kernel execution, usermode execution, and idle, and as long 2194 + as the scheduling-clock interrupt is enabled when RCU needs it to be, 2195 + you can rest assured that the bugs you encounter will be in some other 2196 + part of RCU or some other part of the kernel! 2197 + 2198 + Memory Efficiency 2199 + ~~~~~~~~~~~~~~~~~ 2200 + 2201 + Although small-memory non-realtime systems can simply use Tiny RCU, code 2202 + size is only one aspect of memory efficiency. Another aspect is the size 2203 + of the ``rcu_head`` structure used by ``call_rcu()`` and 2204 + ``kfree_rcu()``. Although this structure contains nothing more than a 2205 + pair of pointers, it does appear in many RCU-protected data structures, 2206 + including some that are size critical. The ``page`` structure is a case 2207 + in point, as evidenced by the many occurrences of the ``union`` keyword 2208 + within that structure. 2209 + 2210 + This need for memory efficiency is one reason that RCU uses hand-crafted 2211 + singly linked lists to track the ``rcu_head`` structures that are 2212 + waiting for a grace period to elapse. It is also the reason why 2213 + ``rcu_head`` structures do not contain debug information, such as fields 2214 + tracking the file and line of the ``call_rcu()`` or ``kfree_rcu()`` that 2215 + posted them. Although this information might appear in debug-only kernel 2216 + builds at some point, in the meantime, the ``->func`` field will often 2217 + provide the needed debug information. 2218 + 2219 + However, in some cases, the need for memory efficiency leads to even 2220 + more extreme measures. Returning to the ``page`` structure, the 2221 + ``rcu_head`` field shares storage with a great many other structures 2222 + that are used at various points in the corresponding page's lifetime. In 2223 + order to correctly resolve certain `race 2224 + conditions <https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com>`__, 2225 + the Linux kernel's memory-management subsystem needs a particular bit to 2226 + remain zero during all phases of grace-period processing, and that bit 2227 + happens to map to the bottom bit of the ``rcu_head`` structure's 2228 + ``->next`` field. RCU makes this guarantee as long as ``call_rcu()`` is 2229 + used to post the callback, as opposed to ``kfree_rcu()`` or some future 2230 + “lazy” variant of ``call_rcu()`` that might one day be created for 2231 + energy-efficiency purposes. 2232 + 2233 + That said, there are limits. RCU requires that the ``rcu_head`` 2234 + structure be aligned to a two-byte boundary, and passing a misaligned 2235 + ``rcu_head`` structure to one of the ``call_rcu()`` family of functions 2236 + will result in a splat. It is therefore necessary to exercise caution 2237 + when packing structures containing fields of type ``rcu_head``. Why not 2238 + a four-byte or even eight-byte alignment requirement? Because the m68k 2239 + architecture provides only two-byte alignment, and thus acts as 2240 + alignment's least common denominator. 2241 + 2242 + The reason for reserving the bottom bit of pointers to ``rcu_head`` 2243 + structures is to leave the door open to “lazy” callbacks whose 2244 + invocations can safely be deferred. Deferring invocation could 2245 + potentially have energy-efficiency benefits, but only if the rate of 2246 + non-lazy callbacks decreases significantly for some important workload. 2247 + In the meantime, reserving the bottom bit keeps this option open in case 2248 + it one day becomes useful. 2249 + 2250 + Performance, Scalability, Response Time, and Reliability 2251 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2252 + 2253 + Expanding on the `earlier 2254 + discussion <#Performance%20and%20Scalability>`__, RCU is used heavily by 2255 + hot code paths in performance-critical portions of the Linux kernel's 2256 + networking, security, virtualization, and scheduling code paths. RCU 2257 + must therefore use efficient implementations, especially in its 2258 + read-side primitives. To that end, it would be good if preemptible RCU's 2259 + implementation of ``rcu_read_lock()`` could be inlined, however, doing 2260 + this requires resolving ``#include`` issues with the ``task_struct`` 2261 + structure. 2262 + 2263 + The Linux kernel supports hardware configurations with up to 4096 CPUs, 2264 + which means that RCU must be extremely scalable. Algorithms that involve 2265 + frequent acquisitions of global locks or frequent atomic operations on 2266 + global variables simply cannot be tolerated within the RCU 2267 + implementation. RCU therefore makes heavy use of a combining tree based 2268 + on the ``rcu_node`` structure. RCU is required to tolerate all CPUs 2269 + continuously invoking any combination of RCU's runtime primitives with 2270 + minimal per-operation overhead. In fact, in many cases, increasing load 2271 + must *decrease* the per-operation overhead, witness the batching 2272 + optimizations for ``synchronize_rcu()``, ``call_rcu()``, 2273 + ``synchronize_rcu_expedited()``, and ``rcu_barrier()``. As a general 2274 + rule, RCU must cheerfully accept whatever the rest of the Linux kernel 2275 + decides to throw at it. 2276 + 2277 + The Linux kernel is used for real-time workloads, especially in 2278 + conjunction with the `-rt 2279 + patchset <https://rt.wiki.kernel.org/index.php/Main_Page>`__. The 2280 + real-time-latency response requirements are such that the traditional 2281 + approach of disabling preemption across RCU read-side critical sections 2282 + is inappropriate. Kernels built with ``CONFIG_PREEMPT=y`` therefore use 2283 + an RCU implementation that allows RCU read-side critical sections to be 2284 + preempted. This requirement made its presence known after users made it 2285 + clear that an earlier `real-time 2286 + patch <https://lwn.net/Articles/107930/>`__ did not meet their needs, in 2287 + conjunction with some `RCU 2288 + issues <https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com>`__ 2289 + encountered by a very early version of the -rt patchset. 2290 + 2291 + In addition, RCU must make do with a sub-100-microsecond real-time 2292 + latency budget. In fact, on smaller systems with the -rt patchset, the 2293 + Linux kernel provides sub-20-microsecond real-time latencies for the 2294 + whole kernel, including RCU. RCU's scalability and latency must 2295 + therefore be sufficient for these sorts of configurations. To my 2296 + surprise, the sub-100-microsecond real-time latency budget `applies to 2297 + even the largest systems 2298 + [PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf>`__, 2299 + up to and including systems with 4096 CPUs. This real-time requirement 2300 + motivated the grace-period kthread, which also simplified handling of a 2301 + number of race conditions. 2302 + 2303 + RCU must avoid degrading real-time response for CPU-bound threads, 2304 + whether executing in usermode (which is one use case for 2305 + ``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in 2306 + the kernel must execute ``cond_resched()`` at least once per few tens of 2307 + milliseconds in order to avoid receiving an IPI from RCU. 2308 + 2309 + Finally, RCU's status as a synchronization primitive means that any RCU 2310 + failure can result in arbitrary memory corruption that can be extremely 2311 + difficult to debug. This means that RCU must be extremely reliable, 2312 + which in practice also means that RCU must have an aggressive 2313 + stress-test suite. This stress-test suite is called ``rcutorture``. 2314 + 2315 + Although the need for ``rcutorture`` was no surprise, the current 2316 + immense popularity of the Linux kernel is posing interesting—and perhaps 2317 + unprecedented—validation challenges. To see this, keep in mind that 2318 + there are well over one billion instances of the Linux kernel running 2319 + today, given Android smartphones, Linux-powered televisions, and 2320 + servers. This number can be expected to increase sharply with the advent 2321 + of the celebrated Internet of Things. 2322 + 2323 + Suppose that RCU contains a race condition that manifests on average 2324 + once per million years of runtime. This bug will be occurring about 2325 + three times per *day* across the installed base. RCU could simply hide 2326 + behind hardware error rates, given that no one should really expect 2327 + their smartphone to last for a million years. However, anyone taking too 2328 + much comfort from this thought should consider the fact that in most 2329 + jurisdictions, a successful multi-year test of a given mechanism, which 2330 + might include a Linux kernel, suffices for a number of types of 2331 + safety-critical certifications. In fact, rumor has it that the Linux 2332 + kernel is already being used in production for safety-critical 2333 + applications. I don't know about you, but I would feel quite bad if a 2334 + bug in RCU killed someone. Which might explain my recent focus on 2335 + validation and verification. 2336 + 2337 + Other RCU Flavors 2338 + ----------------- 2339 + 2340 + One of the more surprising things about RCU is that there are now no 2341 + fewer than five *flavors*, or API families. In addition, the primary 2342 + flavor that has been the sole focus up to this point has two different 2343 + implementations, non-preemptible and preemptible. The other four flavors 2344 + are listed below, with requirements for each described in a separate 2345 + section. 2346 + 2347 + #. `Bottom-Half Flavor (Historical) <#Bottom-Half%20Flavor>`__ 2348 + #. `Sched Flavor (Historical) <#Sched%20Flavor>`__ 2349 + #. `Sleepable RCU <#Sleepable%20RCU>`__ 2350 + #. `Tasks RCU <#Tasks%20RCU>`__ 2351 + 2352 + Bottom-Half Flavor (Historical) 2353 + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2354 + 2355 + The RCU-bh flavor of RCU has since been expressed in terms of the other 2356 + RCU flavors as part of a consolidation of the three flavors into a 2357 + single flavor. The read-side API remains, and continues to disable 2358 + softirq and to be accounted for by lockdep. Much of the material in this 2359 + section is therefore strictly historical in nature. 2360 + 2361 + The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations) 2362 + flavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a 2363 + flavor of RCU that could withstand the network-based denial-of-service 2364 + attacks researched by Robert Olsson. These attacks placed so much 2365 + networking load on the system that some of the CPUs never exited softirq 2366 + execution, which in turn prevented those CPUs from ever executing a 2367 + context switch, which, in the RCU implementation of that time, prevented 2368 + grace periods from ever ending. The result was an out-of-memory 2369 + condition and a system hang. 2370 + 2371 + The solution was the creation of RCU-bh, which does 2372 + ``local_bh_disable()`` across its read-side critical sections, and which 2373 + uses the transition from one type of softirq processing to another as a 2374 + quiescent state in addition to context switch, idle, user mode, and 2375 + offline. This means that RCU-bh grace periods can complete even when 2376 + some of the CPUs execute in softirq indefinitely, thus allowing 2377 + algorithms based on RCU-bh to withstand network-based denial-of-service 2378 + attacks. 2379 + 2380 + Because ``rcu_read_lock_bh()`` and ``rcu_read_unlock_bh()`` disable and 2381 + re-enable softirq handlers, any attempt to start a softirq handlers 2382 + during the RCU-bh read-side critical section will be deferred. In this 2383 + case, ``rcu_read_unlock_bh()`` will invoke softirq processing, which can 2384 + take considerable time. One can of course argue that this softirq 2385 + overhead should be associated with the code following the RCU-bh 2386 + read-side critical section rather than ``rcu_read_unlock_bh()``, but the 2387 + fact is that most profiling tools cannot be expected to make this sort 2388 + of fine distinction. For example, suppose that a three-millisecond-long 2389 + RCU-bh read-side critical section executes during a time of heavy 2390 + networking load. There will very likely be an attempt to invoke at least 2391 + one softirq handler during that three milliseconds, but any such 2392 + invocation will be delayed until the time of the 2393 + ``rcu_read_unlock_bh()``. This can of course make it appear at first 2394 + glance as if ``rcu_read_unlock_bh()`` was executing very slowly. 2395 + 2396 + The `RCU-bh 2397 + API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ 2398 + includes ``rcu_read_lock_bh()``, ``rcu_read_unlock_bh()``, 2399 + ``rcu_dereference_bh()``, ``rcu_dereference_bh_check()``, 2400 + ``synchronize_rcu_bh()``, ``synchronize_rcu_bh_expedited()``, 2401 + ``call_rcu_bh()``, ``rcu_barrier_bh()``, and 2402 + ``rcu_read_lock_bh_held()``. However, the update-side APIs are now 2403 + simple wrappers for other RCU flavors, namely RCU-sched in 2404 + CONFIG_PREEMPT=n kernels and RCU-preempt otherwise. 2405 + 2406 + Sched Flavor (Historical) 2407 + ~~~~~~~~~~~~~~~~~~~~~~~~~ 2408 + 2409 + The RCU-sched flavor of RCU has since been expressed in terms of the 2410 + other RCU flavors as part of a consolidation of the three flavors into a 2411 + single flavor. The read-side API remains, and continues to disable 2412 + preemption and to be accounted for by lockdep. Much of the material in 2413 + this section is therefore strictly historical in nature. 2414 + 2415 + Before preemptible RCU, waiting for an RCU grace period had the side 2416 + effect of also waiting for all pre-existing interrupt and NMI handlers. 2417 + However, there are legitimate preemptible-RCU implementations that do 2418 + not have this property, given that any point in the code outside of an 2419 + RCU read-side critical section can be a quiescent state. Therefore, 2420 + *RCU-sched* was created, which follows “classic” RCU in that an 2421 + RCU-sched grace period waits for for pre-existing interrupt and NMI 2422 + handlers. In kernels built with ``CONFIG_PREEMPT=n``, the RCU and 2423 + RCU-sched APIs have identical implementations, while kernels built with 2424 + ``CONFIG_PREEMPT=y`` provide a separate implementation for each. 2425 + 2426 + Note well that in ``CONFIG_PREEMPT=y`` kernels, 2427 + ``rcu_read_lock_sched()`` and ``rcu_read_unlock_sched()`` disable and 2428 + re-enable preemption, respectively. This means that if there was a 2429 + preemption attempt during the RCU-sched read-side critical section, 2430 + ``rcu_read_unlock_sched()`` will enter the scheduler, with all the 2431 + latency and overhead entailed. Just as with ``rcu_read_unlock_bh()``, 2432 + this can make it look as if ``rcu_read_unlock_sched()`` was executing 2433 + very slowly. However, the highest-priority task won't be preempted, so 2434 + that task will enjoy low-overhead ``rcu_read_unlock_sched()`` 2435 + invocations. 2436 + 2437 + The `RCU-sched 2438 + API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ 2439 + includes ``rcu_read_lock_sched()``, ``rcu_read_unlock_sched()``, 2440 + ``rcu_read_lock_sched_notrace()``, ``rcu_read_unlock_sched_notrace()``, 2441 + ``rcu_dereference_sched()``, ``rcu_dereference_sched_check()``, 2442 + ``synchronize_sched()``, ``synchronize_rcu_sched_expedited()``, 2443 + ``call_rcu_sched()``, ``rcu_barrier_sched()``, and 2444 + ``rcu_read_lock_sched_held()``. However, anything that disables 2445 + preemption also marks an RCU-sched read-side critical section, including 2446 + ``preempt_disable()`` and ``preempt_enable()``, ``local_irq_save()`` and 2447 + ``local_irq_restore()``, and so on. 2448 + 2449 + Sleepable RCU 2450 + ~~~~~~~~~~~~~ 2451 + 2452 + For well over a decade, someone saying “I need to block within an RCU 2453 + read-side critical section” was a reliable indication that this someone 2454 + did not understand RCU. After all, if you are always blocking in an RCU 2455 + read-side critical section, you can probably afford to use a 2456 + higher-overhead synchronization mechanism. However, that changed with 2457 + the advent of the Linux kernel's notifiers, whose RCU read-side critical 2458 + sections almost never sleep, but sometimes need to. This resulted in the 2459 + introduction of `sleepable RCU <https://lwn.net/Articles/202847/>`__, or 2460 + *SRCU*. 2461 + 2462 + SRCU allows different domains to be defined, with each such domain 2463 + defined by an instance of an ``srcu_struct`` structure. A pointer to 2464 + this structure must be passed in to each SRCU function, for example, 2465 + ``synchronize_srcu(&ss)``, where ``ss`` is the ``srcu_struct`` 2466 + structure. The key benefit of these domains is that a slow SRCU reader 2467 + in one domain does not delay an SRCU grace period in some other domain. 2468 + That said, one consequence of these domains is that read-side code must 2469 + pass a “cookie” from ``srcu_read_lock()`` to ``srcu_read_unlock()``, for 2470 + example, as follows: 2471 + 2472 + :: 2473 + 2474 + 1 int idx; 2475 + 2 2476 + 3 idx = srcu_read_lock(&ss); 2477 + 4 do_something(); 2478 + 5 srcu_read_unlock(&ss, idx); 2479 + 2480 + As noted above, it is legal to block within SRCU read-side critical 2481 + sections, however, with great power comes great responsibility. If you 2482 + block forever in one of a given domain's SRCU read-side critical 2483 + sections, then that domain's grace periods will also be blocked forever. 2484 + Of course, one good way to block forever is to deadlock, which can 2485 + happen if any operation in a given domain's SRCU read-side critical 2486 + section can wait, either directly or indirectly, for that domain's grace 2487 + period to elapse. For example, this results in a self-deadlock: 2488 + 2489 + :: 2490 + 2491 + 1 int idx; 2492 + 2 2493 + 3 idx = srcu_read_lock(&ss); 2494 + 4 do_something(); 2495 + 5 synchronize_srcu(&ss); 2496 + 6 srcu_read_unlock(&ss, idx); 2497 + 2498 + However, if line 5 acquired a mutex that was held across a 2499 + ``synchronize_srcu()`` for domain ``ss``, deadlock would still be 2500 + possible. Furthermore, if line 5 acquired a mutex that was held across a 2501 + ``synchronize_srcu()`` for some other domain ``ss1``, and if an 2502 + ``ss1``-domain SRCU read-side critical section acquired another mutex 2503 + that was held across as ``ss``-domain ``synchronize_srcu()``, deadlock 2504 + would again be possible. Such a deadlock cycle could extend across an 2505 + arbitrarily large number of different SRCU domains. Again, with great 2506 + power comes great responsibility. 2507 + 2508 + Unlike the other RCU flavors, SRCU read-side critical sections can run 2509 + on idle and even offline CPUs. This ability requires that 2510 + ``srcu_read_lock()`` and ``srcu_read_unlock()`` contain memory barriers, 2511 + which means that SRCU readers will run a bit slower than would RCU 2512 + readers. It also motivates the ``smp_mb__after_srcu_read_unlock()`` API, 2513 + which, in combination with ``srcu_read_unlock()``, guarantees a full 2514 + memory barrier. 2515 + 2516 + Also unlike other RCU flavors, ``synchronize_srcu()`` may **not** be 2517 + invoked from CPU-hotplug notifiers, due to the fact that SRCU grace 2518 + periods make use of timers and the possibility of timers being 2519 + temporarily “stranded” on the outgoing CPU. This stranding of timers 2520 + means that timers posted to the outgoing CPU will not fire until late in 2521 + the CPU-hotplug process. The problem is that if a notifier is waiting on 2522 + an SRCU grace period, that grace period is waiting on a timer, and that 2523 + timer is stranded on the outgoing CPU, then the notifier will never be 2524 + awakened, in other words, deadlock has occurred. This same situation of 2525 + course also prohibits ``srcu_barrier()`` from being invoked from 2526 + CPU-hotplug notifiers. 2527 + 2528 + SRCU also differs from other RCU flavors in that SRCU's expedited and 2529 + non-expedited grace periods are implemented by the same mechanism. This 2530 + means that in the current SRCU implementation, expediting a future grace 2531 + period has the side effect of expediting all prior grace periods that 2532 + have not yet completed. (But please note that this is a property of the 2533 + current implementation, not necessarily of future implementations.) In 2534 + addition, if SRCU has been idle for longer than the interval specified 2535 + by the ``srcutree.exp_holdoff`` kernel boot parameter (25 microseconds 2536 + by default), and if a ``synchronize_srcu()`` invocation ends this idle 2537 + period, that invocation will be automatically expedited. 2538 + 2539 + As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a 2540 + locking bottleneck present in prior kernel versions. Although this will 2541 + allow users to put much heavier stress on ``call_srcu()``, it is 2542 + important to note that SRCU does not yet take any special steps to deal 2543 + with callback flooding. So if you are posting (say) 10,000 SRCU 2544 + callbacks per second per CPU, you are probably totally OK, but if you 2545 + intend to post (say) 1,000,000 SRCU callbacks per second per CPU, please 2546 + run some tests first. SRCU just might need a few adjustment to deal with 2547 + that sort of load. Of course, your mileage may vary based on the speed 2548 + of your CPUs and the size of your memory. 2549 + 2550 + The `SRCU 2551 + API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ 2552 + includes ``srcu_read_lock()``, ``srcu_read_unlock()``, 2553 + ``srcu_dereference()``, ``srcu_dereference_check()``, 2554 + ``synchronize_srcu()``, ``synchronize_srcu_expedited()``, 2555 + ``call_srcu()``, ``srcu_barrier()``, and ``srcu_read_lock_held()``. It 2556 + also includes ``DEFINE_SRCU()``, ``DEFINE_STATIC_SRCU()``, and 2557 + ``init_srcu_struct()`` APIs for defining and initializing 2558 + ``srcu_struct`` structures. 2559 + 2560 + Tasks RCU 2561 + ~~~~~~~~~ 2562 + 2563 + Some forms of tracing use “trampolines” to handle the binary rewriting 2564 + required to install different types of probes. It would be good to be 2565 + able to free old trampolines, which sounds like a job for some form of 2566 + RCU. However, because it is necessary to be able to install a trace 2567 + anywhere in the code, it is not possible to use read-side markers such 2568 + as ``rcu_read_lock()`` and ``rcu_read_unlock()``. In addition, it does 2569 + not work to have these markers in the trampoline itself, because there 2570 + would need to be instructions following ``rcu_read_unlock()``. Although 2571 + ``synchronize_rcu()`` would guarantee that execution reached the 2572 + ``rcu_read_unlock()``, it would not be able to guarantee that execution 2573 + had completely left the trampoline. 2574 + 2575 + The solution, in the form of `Tasks 2576 + RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side 2577 + critical sections that are delimited by voluntary context switches, that 2578 + is, calls to ``schedule()``, ``cond_resched()``, and 2579 + ``synchronize_rcu_tasks()``. In addition, transitions to and from 2580 + userspace execution also delimit tasks-RCU read-side critical sections. 2581 + 2582 + The tasks-RCU API is quite compact, consisting only of 2583 + ``call_rcu_tasks()``, ``synchronize_rcu_tasks()``, and 2584 + ``rcu_barrier_tasks()``. In ``CONFIG_PREEMPT=n`` kernels, trampolines 2585 + cannot be preempted, so these APIs map to ``call_rcu()``, 2586 + ``synchronize_rcu()``, and ``rcu_barrier()``, respectively. In 2587 + ``CONFIG_PREEMPT=y`` kernels, trampolines can be preempted, and these 2588 + three APIs are therefore implemented by separate functions that check 2589 + for voluntary context switches. 2590 + 2591 + Possible Future Changes 2592 + ----------------------- 2593 + 2594 + One of the tricks that RCU uses to attain update-side scalability is to 2595 + increase grace-period latency with increasing numbers of CPUs. If this 2596 + becomes a serious problem, it will be necessary to rework the 2597 + grace-period state machine so as to avoid the need for the additional 2598 + latency. 2599 + 2600 + RCU disables CPU hotplug in a few places, perhaps most notably in the 2601 + ``rcu_barrier()`` operations. If there is a strong reason to use 2602 + ``rcu_barrier()`` in CPU-hotplug notifiers, it will be necessary to 2603 + avoid disabling CPU hotplug. This would introduce some complexity, so 2604 + there had better be a *very* good reason. 2605 + 2606 + The tradeoff between grace-period latency on the one hand and 2607 + interruptions of other CPUs on the other hand may need to be 2608 + re-examined. The desire is of course for zero grace-period latency as 2609 + well as zero interprocessor interrupts undertaken during an expedited 2610 + grace period operation. While this ideal is unlikely to be achievable, 2611 + it is quite possible that further improvements can be made. 2612 + 2613 + The multiprocessor implementations of RCU use a combining tree that 2614 + groups CPUs so as to reduce lock contention and increase cache locality. 2615 + However, this combining tree does not spread its memory across NUMA 2616 + nodes nor does it align the CPU groups with hardware features such as 2617 + sockets or cores. Such spreading and alignment is currently believed to 2618 + be unnecessary because the hotpath read-side primitives do not access 2619 + the combining tree, nor does ``call_rcu()`` in the common case. If you 2620 + believe that your architecture needs such spreading and alignment, then 2621 + your architecture should also benefit from the 2622 + ``rcutree.rcu_fanout_leaf`` boot parameter, which can be set to the 2623 + number of CPUs in a socket, NUMA node, or whatever. If the number of 2624 + CPUs is too large, use a fraction of the number of CPUs. If the number 2625 + of CPUs is a large prime number, well, that certainly is an 2626 + “interesting” architectural choice! More flexible arrangements might be 2627 + considered, but only if ``rcutree.rcu_fanout_leaf`` has proven 2628 + inadequate, and only if the inadequacy has been demonstrated by a 2629 + carefully run and realistic system-level workload. 2630 + 2631 + Please note that arrangements that require RCU to remap CPU numbers will 2632 + require extremely good demonstration of need and full exploration of 2633 + alternatives. 2634 + 2635 + RCU's various kthreads are reasonably recent additions. It is quite 2636 + likely that adjustments will be required to more gracefully handle 2637 + extreme loads. It might also be necessary to be able to relate CPU 2638 + utilization by RCU's kthreads and softirq handlers to the code that 2639 + instigated this CPU utilization. For example, RCU callback overhead 2640 + might be charged back to the originating ``call_rcu()`` instance, though 2641 + probably not in production kernels. 2642 + 2643 + Additional work may be required to provide reasonable forward-progress 2644 + guarantees under heavy load for grace periods and for callback 2645 + invocation. 2646 + 2647 + Summary 2648 + ------- 2649 + 2650 + This document has presented more than two decade's worth of RCU 2651 + requirements. Given that the requirements keep changing, this will not 2652 + be the last word on this subject, but at least it serves to get an 2653 + important subset of the requirements set forth. 2654 + 2655 + Acknowledgments 2656 + --------------- 2657 + 2658 + I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, Oleg 2659 + Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and Andy 2660 + Lutomirski for their help in rendering this article human readable, and 2661 + to Michelle Rankin for her support of this effort. Other contributions 2662 + are acknowledged in the Linux kernel's git archive.

+5

Documentation/RCU/index.rst

··· 11 11 listRCU 12 12 UP 13 13 14 + Design/Memory-Ordering/Tree-RCU-Memory-Ordering 15 + Design/Expedited-Grace-Periods/Expedited-Grace-Periods 16 + Design/Requirements/Requirements 17 + Design/Data-Structures/Data-Structures 18 + 14 19 .. only:: subproject and html 15 20 16 21 Indices

+2 -2

Documentation/RCU/whatisRCU.txt

··· 302 302 must prohibit. The rcu_dereference_protected() variant takes 303 303 a lockdep expression to indicate which locks must be acquired 304 304 by the caller. If the indicated protection is not provided, 305 - a lockdep splat is emitted. See RCU/Design/Requirements/Requirements.html 305 + a lockdep splat is emitted. See Documentation/RCU/Design/Requirements/Requirements.rst 306 306 and the API's code comments for more details and example usage. 307 307 308 308 The following diagram shows how each API communicates among the ··· 630 630 promotes synchronize_rcu() to a full memory barrier in compliance with 631 631 the "Memory-Barrier Guarantees" listed in: 632 632 633 - Documentation/RCU/Design/Requirements/Requirements.html. 633 + Documentation/RCU/Design/Requirements/Requirements.rst 634 634 635 635 It is possible to nest rcu_read_lock(), since reader-writer locks may 636 636 be recursively acquired. Note also that rcu_read_lock() is immune