Linux kernel mirror (for testing)
git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel
os
linux
1.. _whatisrcu_doc:
2
3What is RCU? -- "Read, Copy, Update"
4======================================
5
6Please note that the "What is RCU?" LWN series is an excellent place
7to start learning about RCU:
8
9| 1. What is RCU, Fundamentally? https://lwn.net/Articles/262464/
10| 2. What is RCU? Part 2: Usage https://lwn.net/Articles/263130/
11| 3. RCU part 3: the RCU API https://lwn.net/Articles/264090/
12| 4. The RCU API, 2010 Edition https://lwn.net/Articles/418853/
13| 2010 Big API Table https://lwn.net/Articles/419086/
14| 5. The RCU API, 2014 Edition https://lwn.net/Articles/609904/
15| 2014 Big API Table https://lwn.net/Articles/609973/
16| 6. The RCU API, 2019 Edition https://lwn.net/Articles/777036/
17| 2019 Big API Table https://lwn.net/Articles/777165/
18| 7. The RCU API, 2024 Edition https://lwn.net/Articles/988638/
19| 2024 Background Information https://lwn.net/Articles/988641/
20| 2024 Big API Table https://lwn.net/Articles/988666/
21
22For those preferring video:
23
24| 1. Unraveling RCU Mysteries: Fundamentals https://www.linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries
25| 2. Unraveling RCU Mysteries: Additional Use Cases https://www.linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries-additional-use-cases
26
27
28What is RCU?
29
30RCU is a synchronization mechanism that was added to the Linux kernel
31during the 2.5 development effort that is optimized for read-mostly
32situations. Although RCU is actually quite simple, making effective use
33of it requires you to think differently about your code. Another part
34of the problem is the mistaken assumption that there is "one true way" to
35describe and to use RCU. Instead, the experience has been that different
36people must take different paths to arrive at an understanding of RCU,
37depending on their experiences and use cases. This document provides
38several different paths, as follows:
39
40:ref:`1. RCU OVERVIEW <1_whatisRCU>`
41
42:ref:`2. WHAT IS RCU'S CORE API? <2_whatisRCU>`
43
44:ref:`3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API? <3_whatisRCU>`
45
46:ref:`4. WHAT IF MY UPDATING THREAD CANNOT BLOCK? <4_whatisRCU>`
47
48:ref:`5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU? <5_whatisRCU>`
49
50:ref:`6. ANALOGY WITH READER-WRITER LOCKING <6_whatisRCU>`
51
52:ref:`7. ANALOGY WITH REFERENCE COUNTING <7_whatisRCU>`
53
54:ref:`8. FULL LIST OF RCU APIs <8_whatisRCU>`
55
56:ref:`9. ANSWERS TO QUICK QUIZZES <9_whatisRCU>`
57
58People who prefer starting with a conceptual overview should focus on
59Section 1, though most readers will profit by reading this section at
60some point. People who prefer to start with an API that they can then
61experiment with should focus on Section 2. People who prefer to start
62with example uses should focus on Sections 3 and 4. People who need to
63understand the RCU implementation should focus on Section 5, then dive
64into the kernel source code. People who reason best by analogy should
65focus on Section 6 and 7. Section 8 serves as an index to the docbook
66API documentation, and Section 9 is the traditional answer key.
67
68So, start with the section that makes the most sense to you and your
69preferred method of learning. If you need to know everything about
70everything, feel free to read the whole thing -- but if you are really
71that type of person, you have perused the source code and will therefore
72never need this document anyway. ;-)
73
74.. _1_whatisRCU:
75
761. RCU OVERVIEW
77----------------
78
79The basic idea behind RCU is to split updates into "removal" and
80"reclamation" phases. The removal phase removes references to data items
81within a data structure (possibly by replacing them with references to
82new versions of these data items), and can run concurrently with readers.
83The reason that it is safe to run the removal phase concurrently with
84readers is the semantics of modern CPUs guarantee that readers will see
85either the old or the new version of the data structure rather than a
86partially updated reference. The reclamation phase does the work of reclaiming
87(e.g., freeing) the data items removed from the data structure during the
88removal phase. Because reclaiming data items can disrupt any readers
89concurrently referencing those data items, the reclamation phase must
90not start until readers no longer hold references to those data items.
91
92Splitting the update into removal and reclamation phases permits the
93updater to perform the removal phase immediately, and to defer the
94reclamation phase until all readers active during the removal phase have
95completed, either by blocking until they finish or by registering a
96callback that is invoked after they finish. Only readers that are active
97during the removal phase need be considered, because any reader starting
98after the removal phase will be unable to gain a reference to the removed
99data items, and therefore cannot be disrupted by the reclamation phase.
100
101So the typical RCU update sequence goes something like the following:
102
103a. Remove pointers to a data structure, so that subsequent
104 readers cannot gain a reference to it.
105
106b. Wait for all previous readers to complete their RCU read-side
107 critical sections.
108
109c. At this point, there cannot be any readers who hold references
110 to the data structure, so it now may safely be reclaimed
111 (e.g., kfree()d).
112
113Step (b) above is the key idea underlying RCU's deferred destruction.
114The ability to wait until all readers are done allows RCU readers to
115use much lighter-weight synchronization, in some cases, absolutely no
116synchronization at all. In contrast, in more conventional lock-based
117schemes, readers must use heavy-weight synchronization in order to
118prevent an updater from deleting the data structure out from under them.
119This is because lock-based updaters typically update data items in place,
120and must therefore exclude readers. In contrast, RCU-based updaters
121typically take advantage of the fact that writes to single aligned
122pointers are atomic on modern CPUs, allowing atomic insertion, removal,
123and replacement of data items in a linked structure without disrupting
124readers. Concurrent RCU readers can then continue accessing the old
125versions, and can dispense with the atomic operations, memory barriers,
126and communications cache misses that are so expensive on present-day
127SMP computer systems, even in absence of lock contention.
128
129In the three-step procedure shown above, the updater is performing both
130the removal and the reclamation step, but it is often helpful for an
131entirely different thread to do the reclamation, as is in fact the case
132in the Linux kernel's directory-entry cache (dcache). Even if the same
133thread performs both the update step (step (a) above) and the reclamation
134step (step (c) above), it is often helpful to think of them separately.
135For example, RCU readers and updaters need not communicate at all,
136but RCU provides implicit low-overhead communication between readers
137and reclaimers, namely, in step (b) above.
138
139So how the heck can a reclaimer tell when a reader is done, given
140that readers are not doing any sort of synchronization operations???
141Read on to learn about how RCU's API makes this easy.
142
143.. _2_whatisRCU:
144
1452. WHAT IS RCU'S CORE API?
146---------------------------
147
148The core RCU API is quite small:
149
150a. rcu_read_lock()
151b. rcu_read_unlock()
152c. synchronize_rcu() / call_rcu()
153d. rcu_assign_pointer()
154e. rcu_dereference()
155
156There are many other members of the RCU API, but the rest can be
157expressed in terms of these five, though most implementations instead
158express synchronize_rcu() in terms of the call_rcu() callback API.
159
160The five core RCU APIs are described below, the other 18 will be enumerated
161later. See the kernel docbook documentation for more info, or look directly
162at the function header comments.
163
164rcu_read_lock()
165^^^^^^^^^^^^^^^
166 void rcu_read_lock(void);
167
168 This temporal primitive is used by a reader to inform the
169 reclaimer that the reader is entering an RCU read-side critical
170 section. It is illegal to block while in an RCU read-side
171 critical section, though kernels built with CONFIG_PREEMPT_RCU
172 can preempt RCU read-side critical sections. Any RCU-protected
173 data structure accessed during an RCU read-side critical section
174 is guaranteed to remain unreclaimed for the full duration of that
175 critical section. Reference counts may be used in conjunction
176 with RCU to maintain longer-term references to data structures.
177
178 Note that anything that disables bottom halves, preemption,
179 or interrupts also enters an RCU read-side critical section.
180 Acquiring a spinlock also enters an RCU read-side critical
181 sections, even for spinlocks that do not disable preemption,
182 as is the case in kernels built with CONFIG_PREEMPT_RT=y.
183 Sleeplocks do *not* enter RCU read-side critical sections.
184
185rcu_read_unlock()
186^^^^^^^^^^^^^^^^^
187 void rcu_read_unlock(void);
188
189 This temporal primitives is used by a reader to inform the
190 reclaimer that the reader is exiting an RCU read-side critical
191 section. Anything that enables bottom halves, preemption,
192 or interrupts also exits an RCU read-side critical section.
193 Releasing a spinlock also exits an RCU read-side critical section.
194
195 Note that RCU read-side critical sections may be nested and/or
196 overlapping.
197
198synchronize_rcu()
199^^^^^^^^^^^^^^^^^
200 void synchronize_rcu(void);
201
202 This temporal primitive marks the end of updater code and the
203 beginning of reclaimer code. It does this by blocking until
204 all pre-existing RCU read-side critical sections on all CPUs
205 have completed. Note that synchronize_rcu() will **not**
206 necessarily wait for any subsequent RCU read-side critical
207 sections to complete. For example, consider the following
208 sequence of events::
209
210 CPU 0 CPU 1 CPU 2
211 ----------------- ------------------------- ---------------
212 1. rcu_read_lock()
213 2. enters synchronize_rcu()
214 3. rcu_read_lock()
215 4. rcu_read_unlock()
216 5. exits synchronize_rcu()
217 6. rcu_read_unlock()
218
219 To reiterate, synchronize_rcu() waits only for ongoing RCU
220 read-side critical sections to complete, not necessarily for
221 any that begin after synchronize_rcu() is invoked.
222
223 Of course, synchronize_rcu() does not necessarily return
224 **immediately** after the last pre-existing RCU read-side critical
225 section completes. For one thing, there might well be scheduling
226 delays. For another thing, many RCU implementations process
227 requests in batches in order to improve efficiencies, which can
228 further delay synchronize_rcu().
229
230 Since synchronize_rcu() is the API that must figure out when
231 readers are done, its implementation is key to RCU. For RCU
232 to be useful in all but the most read-intensive situations,
233 synchronize_rcu()'s overhead must also be quite small.
234
235 The call_rcu() API is an asynchronous callback form of
236 synchronize_rcu(), and is described in more detail in a later
237 section. Instead of blocking, it registers a function and
238 argument which are invoked after all ongoing RCU read-side
239 critical sections have completed. This callback variant is
240 particularly useful in situations where it is illegal to block
241 or where update-side performance is critically important.
242
243 However, the call_rcu() API should not be used lightly, as use
244 of the synchronize_rcu() API generally results in simpler code.
245 In addition, the synchronize_rcu() API has the nice property
246 of automatically limiting update rate should grace periods
247 be delayed. This property results in system resilience in face
248 of denial-of-service attacks. Code using call_rcu() should limit
249 update rate in order to gain this same sort of resilience. See
250 checklist.rst for some approaches to limiting the update rate.
251
252rcu_assign_pointer()
253^^^^^^^^^^^^^^^^^^^^
254 void rcu_assign_pointer(p, typeof(p) v);
255
256 Yes, rcu_assign_pointer() **is** implemented as a macro, though
257 it would be cool to be able to declare a function in this manner.
258 (And there has been some discussion of adding overloaded functions
259 to the C language, so who knows?)
260
261 The updater uses this spatial macro to assign a new value to an
262 RCU-protected pointer, in order to safely communicate the change
263 in value from the updater to the reader. This is a spatial (as
264 opposed to temporal) macro. It does not evaluate to an rvalue,
265 but it does provide any compiler directives and memory-barrier
266 instructions required for a given compile or CPU architecture.
267 Its ordering properties are that of a store-release operation,
268 that is, any prior loads and stores required to initialize the
269 structure are ordered before the store that publishes the pointer
270 to that structure.
271
272 Perhaps just as important, rcu_assign_pointer() serves to document
273 (1) which pointers are protected by RCU and (2) the point at which
274 a given structure becomes accessible to other CPUs. That said,
275 rcu_assign_pointer() is most frequently used indirectly, via
276 the _rcu list-manipulation primitives such as list_add_rcu().
277
278rcu_dereference()
279^^^^^^^^^^^^^^^^^
280 typeof(p) rcu_dereference(p);
281
282 Like rcu_assign_pointer(), rcu_dereference() must be implemented
283 as a macro.
284
285 The reader uses the spatial rcu_dereference() macro to fetch
286 an RCU-protected pointer, which returns a value that may
287 then be safely dereferenced. Note that rcu_dereference()
288 does not actually dereference the pointer, instead, it
289 protects the pointer for later dereferencing. It also
290 executes any needed memory-barrier instructions for a given
291 CPU architecture. Currently, only Alpha needs memory barriers
292 within rcu_dereference() -- on other CPUs, it compiles to a
293 volatile load. However, no mainstream C compilers respect
294 address dependencies, so rcu_dereference() uses volatile casts,
295 which, in combination with the coding guidelines listed in
296 rcu_dereference.rst, prevent current compilers from breaking
297 these dependencies.
298
299 Common coding practice uses rcu_dereference() to copy an
300 RCU-protected pointer to a local variable, then dereferences
301 this local variable, for example as follows::
302
303 p = rcu_dereference(head.next);
304 return p->data;
305
306 However, in this case, one could just as easily combine these
307 into one statement::
308
309 return rcu_dereference(head.next)->data;
310
311 If you are going to be fetching multiple fields from the
312 RCU-protected structure, using the local variable is of
313 course preferred. Repeated rcu_dereference() calls look
314 ugly, do not guarantee that the same pointer will be returned
315 if an update happened while in the critical section, and incur
316 unnecessary overhead on Alpha CPUs.
317
318 Note that the value returned by rcu_dereference() is valid
319 only within the enclosing RCU read-side critical section [1]_.
320 For example, the following is **not** legal::
321
322 rcu_read_lock();
323 p = rcu_dereference(head.next);
324 rcu_read_unlock();
325 x = p->address; /* BUG!!! */
326 rcu_read_lock();
327 y = p->data; /* BUG!!! */
328 rcu_read_unlock();
329
330 Holding a reference from one RCU read-side critical section
331 to another is just as illegal as holding a reference from
332 one lock-based critical section to another! Similarly,
333 using a reference outside of the critical section in which
334 it was acquired is just as illegal as doing so with normal
335 locking.
336
337 As with rcu_assign_pointer(), an important function of
338 rcu_dereference() is to document which pointers are protected by
339 RCU, in particular, flagging a pointer that is subject to changing
340 at any time, including immediately after the rcu_dereference().
341 And, again like rcu_assign_pointer(), rcu_dereference() is
342 typically used indirectly, via the _rcu list-manipulation
343 primitives, such as list_for_each_entry_rcu() [2]_.
344
345.. [1] The variant rcu_dereference_protected() can be used outside
346 of an RCU read-side critical section as long as the usage is
347 protected by locks acquired by the update-side code. This variant
348 avoids the lockdep warning that would happen when using (for
349 example) rcu_dereference() without rcu_read_lock() protection.
350 Using rcu_dereference_protected() also has the advantage
351 of permitting compiler optimizations that rcu_dereference()
352 must prohibit. The rcu_dereference_protected() variant takes
353 a lockdep expression to indicate which locks must be acquired
354 by the caller. If the indicated protection is not provided,
355 a lockdep splat is emitted. See Design/Requirements/Requirements.rst
356 and the API's code comments for more details and example usage.
357
358.. [2] If the list_for_each_entry_rcu() instance might be used by
359 update-side code as well as by RCU readers, then an additional
360 lockdep expression can be added to its list of arguments.
361 For example, given an additional "lock_is_held(&mylock)" argument,
362 the RCU lockdep code would complain only if this instance was
363 invoked outside of an RCU read-side critical section and without
364 the protection of mylock.
365
366The following diagram shows how each API communicates among the
367reader, updater, and reclaimer.
368::
369
370
371 rcu_assign_pointer()
372 +--------+
373 +---------------------->| reader |---------+
374 | +--------+ |
375 | | |
376 | | | Protect:
377 | | | rcu_read_lock()
378 | | | rcu_read_unlock()
379 | rcu_dereference() | |
380 +---------+ | |
381 | updater |<----------------+ |
382 +---------+ V
383 | +-----------+
384 +----------------------------------->| reclaimer |
385 +-----------+
386 Defer:
387 synchronize_rcu() & call_rcu()
388
389
390The RCU infrastructure observes the temporal sequence of rcu_read_lock(),
391rcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in
392order to determine when (1) synchronize_rcu() invocations may return
393to their callers and (2) call_rcu() callbacks may be invoked. Efficient
394implementations of the RCU infrastructure make heavy use of batching in
395order to amortize their overhead over many uses of the corresponding APIs.
396The rcu_assign_pointer() and rcu_dereference() invocations communicate
397spatial changes via stores to and loads from the RCU-protected pointer in
398question.
399
400There are at least three flavors of RCU usage in the Linux kernel. The diagram
401above shows the most common one. On the updater side, the rcu_assign_pointer(),
402synchronize_rcu() and call_rcu() primitives used are the same for all three
403flavors. However for protection (on the reader side), the primitives used vary
404depending on the flavor:
405
406a. rcu_read_lock() / rcu_read_unlock()
407 rcu_dereference()
408
409b. rcu_read_lock_bh() / rcu_read_unlock_bh()
410 local_bh_disable() / local_bh_enable()
411 rcu_dereference_bh()
412
413c. rcu_read_lock_sched() / rcu_read_unlock_sched()
414 preempt_disable() / preempt_enable()
415 local_irq_save() / local_irq_restore()
416 hardirq enter / hardirq exit
417 NMI enter / NMI exit
418 rcu_dereference_sched()
419
420These three flavors are used as follows:
421
422a. RCU applied to normal data structures.
423
424b. RCU applied to networking data structures that may be subjected
425 to remote denial-of-service attacks.
426
427c. RCU applied to scheduler and interrupt/NMI-handler tasks.
428
429Again, most uses will be of (a). The (b) and (c) cases are important
430for specialized uses, but are relatively uncommon. The SRCU, RCU-Tasks,
431RCU-Tasks-Rude, and RCU-Tasks-Trace have similar relationships among
432their assorted primitives.
433
434.. _3_whatisRCU:
435
4363. WHAT ARE SOME EXAMPLE USES OF CORE RCU API?
437-----------------------------------------------
438
439This section shows a simple use of the core RCU API to protect a
440global pointer to a dynamically allocated structure. More-typical
441uses of RCU may be found in listRCU.rst and NMI-RCU.rst.
442::
443
444 struct foo {
445 int a;
446 char b;
447 long c;
448 };
449 DEFINE_SPINLOCK(foo_mutex);
450
451 struct foo __rcu *gbl_foo;
452
453 /*
454 * Create a new struct foo that is the same as the one currently
455 * pointed to by gbl_foo, except that field "a" is replaced
456 * with "new_a". Points gbl_foo to the new structure, and
457 * frees up the old structure after a grace period.
458 *
459 * Uses rcu_assign_pointer() to ensure that concurrent readers
460 * see the initialized version of the new structure.
461 *
462 * Uses synchronize_rcu() to ensure that any readers that might
463 * have references to the old structure complete before freeing
464 * the old structure.
465 */
466 void foo_update_a(int new_a)
467 {
468 struct foo *new_fp;
469 struct foo *old_fp;
470
471 new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
472 spin_lock(&foo_mutex);
473 old_fp = rcu_dereference_protected(gbl_foo, lockdep_is_held(&foo_mutex));
474 *new_fp = *old_fp;
475 new_fp->a = new_a;
476 rcu_assign_pointer(gbl_foo, new_fp);
477 spin_unlock(&foo_mutex);
478 synchronize_rcu();
479 kfree(old_fp);
480 }
481
482 /*
483 * Return the value of field "a" of the current gbl_foo
484 * structure. Use rcu_read_lock() and rcu_read_unlock()
485 * to ensure that the structure does not get deleted out
486 * from under us, and use rcu_dereference() to ensure that
487 * we see the initialized version of the structure (important
488 * for DEC Alpha and for people reading the code).
489 */
490 int foo_get_a(void)
491 {
492 int retval;
493
494 rcu_read_lock();
495 retval = rcu_dereference(gbl_foo)->a;
496 rcu_read_unlock();
497 return retval;
498 }
499
500So, to sum up:
501
502- Use rcu_read_lock() and rcu_read_unlock() to guard RCU
503 read-side critical sections.
504
505- Within an RCU read-side critical section, use rcu_dereference()
506 to dereference RCU-protected pointers.
507
508- Use some solid design (such as locks or semaphores) to
509 keep concurrent updates from interfering with each other.
510
511- Use rcu_assign_pointer() to update an RCU-protected pointer.
512 This primitive protects concurrent readers from the updater,
513 **not** concurrent updates from each other! You therefore still
514 need to use locking (or something similar) to keep concurrent
515 rcu_assign_pointer() primitives from interfering with each other.
516
517- Use synchronize_rcu() **after** removing a data element from an
518 RCU-protected data structure, but **before** reclaiming/freeing
519 the data element, in order to wait for the completion of all
520 RCU read-side critical sections that might be referencing that
521 data item.
522
523See checklist.rst for additional rules to follow when using RCU.
524And again, more-typical uses of RCU may be found in listRCU.rst
525and NMI-RCU.rst.
526
527.. _4_whatisRCU:
528
5294. WHAT IF MY UPDATING THREAD CANNOT BLOCK?
530--------------------------------------------
531
532In the example above, foo_update_a() blocks until a grace period elapses.
533This is quite simple, but in some cases one cannot afford to wait so
534long -- there might be other high-priority work to be done.
535
536In such cases, one uses call_rcu() rather than synchronize_rcu().
537The call_rcu() API is as follows::
538
539 void call_rcu(struct rcu_head *head, rcu_callback_t func);
540
541This function invokes func(head) after a grace period has elapsed.
542This invocation might happen from either softirq or process context,
543so the function is not permitted to block. The foo struct needs to
544have an rcu_head structure added, perhaps as follows::
545
546 struct foo {
547 int a;
548 char b;
549 long c;
550 struct rcu_head rcu;
551 };
552
553The foo_update_a() function might then be written as follows::
554
555 /*
556 * Create a new struct foo that is the same as the one currently
557 * pointed to by gbl_foo, except that field "a" is replaced
558 * with "new_a". Points gbl_foo to the new structure, and
559 * frees up the old structure after a grace period.
560 *
561 * Uses rcu_assign_pointer() to ensure that concurrent readers
562 * see the initialized version of the new structure.
563 *
564 * Uses call_rcu() to ensure that any readers that might have
565 * references to the old structure complete before freeing the
566 * old structure.
567 */
568 void foo_update_a(int new_a)
569 {
570 struct foo *new_fp;
571 struct foo *old_fp;
572
573 new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
574 spin_lock(&foo_mutex);
575 old_fp = rcu_dereference_protected(gbl_foo, lockdep_is_held(&foo_mutex));
576 *new_fp = *old_fp;
577 new_fp->a = new_a;
578 rcu_assign_pointer(gbl_foo, new_fp);
579 spin_unlock(&foo_mutex);
580 call_rcu(&old_fp->rcu, foo_reclaim);
581 }
582
583The foo_reclaim() function might appear as follows::
584
585 void foo_reclaim(struct rcu_head *rp)
586 {
587 struct foo *fp = container_of(rp, struct foo, rcu);
588
589 foo_cleanup(fp->a);
590
591 kfree(fp);
592 }
593
594The container_of() primitive is a macro that, given a pointer into a
595struct, the type of the struct, and the pointed-to field within the
596struct, returns a pointer to the beginning of the struct.
597
598The use of call_rcu() permits the caller of foo_update_a() to
599immediately regain control, without needing to worry further about the
600old version of the newly updated element. It also clearly shows the
601RCU distinction between updater, namely foo_update_a(), and reclaimer,
602namely foo_reclaim().
603
604The summary of advice is the same as for the previous section, except
605that we are now using call_rcu() rather than synchronize_rcu():
606
607- Use call_rcu() **after** removing a data element from an
608 RCU-protected data structure in order to register a callback
609 function that will be invoked after the completion of all RCU
610 read-side critical sections that might be referencing that
611 data item.
612
613If the callback for call_rcu() is not doing anything more than calling
614kfree() on the structure, you can use kfree_rcu() instead of call_rcu()
615to avoid having to write your own callback::
616
617 kfree_rcu(old_fp, rcu);
618
619If the occasional sleep is permitted, the single-argument form may
620be used, omitting the rcu_head structure from struct foo.
621
622 kfree_rcu_mightsleep(old_fp);
623
624This variant almost never blocks, but might do so by invoking
625synchronize_rcu() in response to memory-allocation failure.
626
627Again, see checklist.rst for additional rules governing the use of RCU.
628
629.. _5_whatisRCU:
630
6315. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU?
632------------------------------------------------
633
634One of the nice things about RCU is that it has extremely simple "toy"
635implementations that are a good first step towards understanding the
636production-quality implementations in the Linux kernel. This section
637presents two such "toy" implementations of RCU, one that is implemented
638in terms of familiar locking primitives, and another that more closely
639resembles "classic" RCU. Both are way too simple for real-world use,
640lacking both functionality and performance. However, they are useful
641in getting a feel for how RCU works. See kernel/rcu/update.c for a
642production-quality implementation, and see:
643
644 https://docs.google.com/document/d/1X0lThx8OK0ZgLMqVoXiR4ZrGURHrXK6NyLRbeXe3Xac/edit
645
646for papers describing the Linux kernel RCU implementation. The OLS'01
647and OLS'02 papers are a good introduction, and the dissertation provides
648more details on the current implementation as of early 2004.
649
650
6515A. "TOY" IMPLEMENTATION #1: LOCKING
652^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
653This section presents a "toy" RCU implementation that is based on
654familiar locking primitives. Its overhead makes it a non-starter for
655real-life use, as does its lack of scalability. It is also unsuitable
656for realtime use, since it allows scheduling latency to "bleed" from
657one read-side critical section to another. It also assumes recursive
658reader-writer locks: If you try this with non-recursive locks, and
659you allow nested rcu_read_lock() calls, you can deadlock.
660
661However, it is probably the easiest implementation to relate to, so is
662a good starting point.
663
664It is extremely simple::
665
666 static DEFINE_RWLOCK(rcu_gp_mutex);
667
668 void rcu_read_lock(void)
669 {
670 read_lock(&rcu_gp_mutex);
671 }
672
673 void rcu_read_unlock(void)
674 {
675 read_unlock(&rcu_gp_mutex);
676 }
677
678 void synchronize_rcu(void)
679 {
680 write_lock(&rcu_gp_mutex);
681 smp_mb__after_spinlock();
682 write_unlock(&rcu_gp_mutex);
683 }
684
685[You can ignore rcu_assign_pointer() and rcu_dereference() without missing
686much. But here are simplified versions anyway. And whatever you do,
687don't forget about them when submitting patches making use of RCU!]::
688
689 #define rcu_assign_pointer(p, v) \
690 ({ \
691 smp_store_release(&(p), (v)); \
692 })
693
694 #define rcu_dereference(p) \
695 ({ \
696 typeof(p) _________p1 = READ_ONCE(p); \
697 (_________p1); \
698 })
699
700
701The rcu_read_lock() and rcu_read_unlock() primitive read-acquire
702and release a global reader-writer lock. The synchronize_rcu()
703primitive write-acquires this same lock, then releases it. This means
704that once synchronize_rcu() exits, all RCU read-side critical sections
705that were in progress before synchronize_rcu() was called are guaranteed
706to have completed -- there is no way that synchronize_rcu() would have
707been able to write-acquire the lock otherwise. The smp_mb__after_spinlock()
708promotes synchronize_rcu() to a full memory barrier in compliance with
709the "Memory-Barrier Guarantees" listed in:
710
711 Design/Requirements/Requirements.rst
712
713It is possible to nest rcu_read_lock(), since reader-writer locks may
714be recursively acquired. Note also that rcu_read_lock() is immune
715from deadlock (an important property of RCU). The reason for this is
716that the only thing that can block rcu_read_lock() is a synchronize_rcu().
717But synchronize_rcu() does not acquire any locks while holding rcu_gp_mutex,
718so there can be no deadlock cycle.
719
720.. _quiz_1:
721
722Quick Quiz #1:
723 Why is this argument naive? How could a deadlock
724 occur when using this algorithm in a real-world Linux
725 kernel? How could this deadlock be avoided?
726
727:ref:`Answers to Quick Quiz <9_whatisRCU>`
728
7295B. "TOY" EXAMPLE #2: CLASSIC RCU
730^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
731This section presents a "toy" RCU implementation that is based on
732"classic RCU". It is also short on performance (but only for updates) and
733on features such as hotplug CPU and the ability to run in CONFIG_PREEMPTION
734kernels. The definitions of rcu_dereference() and rcu_assign_pointer()
735are the same as those shown in the preceding section, so they are omitted.
736::
737
738 void rcu_read_lock(void) { }
739
740 void rcu_read_unlock(void) { }
741
742 void synchronize_rcu(void)
743 {
744 int cpu;
745
746 for_each_possible_cpu(cpu)
747 run_on(cpu);
748 }
749
750Note that rcu_read_lock() and rcu_read_unlock() do absolutely nothing.
751This is the great strength of classic RCU in a non-preemptive kernel:
752read-side overhead is precisely zero, at least on non-Alpha CPUs.
753And there is absolutely no way that rcu_read_lock() can possibly
754participate in a deadlock cycle!
755
756The implementation of synchronize_rcu() simply schedules itself on each
757CPU in turn. The run_on() primitive can be implemented straightforwardly
758in terms of the sched_setaffinity() primitive. Of course, a somewhat less
759"toy" implementation would restore the affinity upon completion rather
760than just leaving all tasks running on the last CPU, but when I said
761"toy", I meant **toy**!
762
763So how the heck is this supposed to work???
764
765Remember that it is illegal to block while in an RCU read-side critical
766section. Therefore, if a given CPU executes a context switch, we know
767that it must have completed all preceding RCU read-side critical sections.
768Once **all** CPUs have executed a context switch, then **all** preceding
769RCU read-side critical sections will have completed.
770
771So, suppose that we remove a data item from its structure and then invoke
772synchronize_rcu(). Once synchronize_rcu() returns, we are guaranteed
773that there are no RCU read-side critical sections holding a reference
774to that data item, so we can safely reclaim it.
775
776.. _quiz_2:
777
778Quick Quiz #2:
779 Give an example where Classic RCU's read-side
780 overhead is **negative**.
781
782:ref:`Answers to Quick Quiz <9_whatisRCU>`
783
784.. _quiz_3:
785
786Quick Quiz #3:
787 If it is illegal to block in an RCU read-side
788 critical section, what the heck do you do in
789 CONFIG_PREEMPT_RT, where normal spinlocks can block???
790
791:ref:`Answers to Quick Quiz <9_whatisRCU>`
792
793.. _6_whatisRCU:
794
7956. ANALOGY WITH READER-WRITER LOCKING
796--------------------------------------
797
798Although RCU can be used in many different ways, a very common use of
799RCU is analogous to reader-writer locking. The following unified
800diff shows how closely related RCU and reader-writer locking can be.
801::
802
803 @@ -5,5 +5,5 @@ struct el {
804 int data;
805 /* Other data fields */
806 };
807 -rwlock_t listmutex;
808 +spinlock_t listmutex;
809 struct el head;
810
811 @@ -13,15 +14,15 @@
812 struct list_head *lp;
813 struct el *p;
814
815 - read_lock(&listmutex);
816 - list_for_each_entry(p, head, lp) {
817 + rcu_read_lock();
818 + list_for_each_entry_rcu(p, head, lp) {
819 if (p->key == key) {
820 *result = p->data;
821 - read_unlock(&listmutex);
822 + rcu_read_unlock();
823 return 1;
824 }
825 }
826 - read_unlock(&listmutex);
827 + rcu_read_unlock();
828 return 0;
829 }
830
831 @@ -29,15 +30,16 @@
832 {
833 struct el *p;
834
835 - write_lock(&listmutex);
836 + spin_lock(&listmutex);
837 list_for_each_entry(p, head, lp) {
838 if (p->key == key) {
839 - list_del(&p->list);
840 - write_unlock(&listmutex);
841 + list_del_rcu(&p->list);
842 + spin_unlock(&listmutex);
843 + synchronize_rcu();
844 kfree(p);
845 return 1;
846 }
847 }
848 - write_unlock(&listmutex);
849 + spin_unlock(&listmutex);
850 return 0;
851 }
852
853Or, for those who prefer a side-by-side listing::
854
855 1 struct el { 1 struct el {
856 2 struct list_head list; 2 struct list_head list;
857 3 long key; 3 long key;
858 4 spinlock_t mutex; 4 spinlock_t mutex;
859 5 int data; 5 int data;
860 6 /* Other data fields */ 6 /* Other data fields */
861 7 }; 7 };
862 8 rwlock_t listmutex; 8 spinlock_t listmutex;
863 9 struct el head; 9 struct el head;
864
865::
866
867 1 int search(long key, int *result) 1 int search(long key, int *result)
868 2 { 2 {
869 3 struct list_head *lp; 3 struct list_head *lp;
870 4 struct el *p; 4 struct el *p;
871 5 5
872 6 read_lock(&listmutex); 6 rcu_read_lock();
873 7 list_for_each_entry(p, head, lp) { 7 list_for_each_entry_rcu(p, head, lp) {
874 8 if (p->key == key) { 8 if (p->key == key) {
875 9 *result = p->data; 9 *result = p->data;
876 10 read_unlock(&listmutex); 10 rcu_read_unlock();
877 11 return 1; 11 return 1;
878 12 } 12 }
879 13 } 13 }
880 14 read_unlock(&listmutex); 14 rcu_read_unlock();
881 15 return 0; 15 return 0;
882 16 } 16 }
883
884::
885
886 1 int delete(long key) 1 int delete(long key)
887 2 { 2 {
888 3 struct el *p; 3 struct el *p;
889 4 4
890 5 write_lock(&listmutex); 5 spin_lock(&listmutex);
891 6 list_for_each_entry(p, head, lp) { 6 list_for_each_entry(p, head, lp) {
892 7 if (p->key == key) { 7 if (p->key == key) {
893 8 list_del(&p->list); 8 list_del_rcu(&p->list);
894 9 write_unlock(&listmutex); 9 spin_unlock(&listmutex);
895 10 synchronize_rcu();
896 10 kfree(p); 11 kfree(p);
897 11 return 1; 12 return 1;
898 12 } 13 }
899 13 } 14 }
900 14 write_unlock(&listmutex); 15 spin_unlock(&listmutex);
901 15 return 0; 16 return 0;
902 16 } 17 }
903
904Either way, the differences are quite small. Read-side locking moves
905to rcu_read_lock() and rcu_read_unlock, update-side locking moves from
906a reader-writer lock to a simple spinlock, and a synchronize_rcu()
907precedes the kfree().
908
909However, there is one potential catch: the read-side and update-side
910critical sections can now run concurrently. In many cases, this will
911not be a problem, but it is necessary to check carefully regardless.
912For example, if multiple independent list updates must be seen as
913a single atomic update, converting to RCU will require special care.
914
915Also, the presence of synchronize_rcu() means that the RCU version of
916delete() can now block. If this is a problem, there is a callback-based
917mechanism that never blocks, namely call_rcu() or kfree_rcu(), that can
918be used in place of synchronize_rcu().
919
920.. _7_whatisRCU:
921
9227. ANALOGY WITH REFERENCE COUNTING
923-----------------------------------
924
925The reader-writer analogy (illustrated by the previous section) is not
926always the best way to think about using RCU. Another helpful analogy
927considers RCU an effective reference count on everything which is
928protected by RCU.
929
930A reference count typically does not prevent the referenced object's
931values from changing, but does prevent changes to type -- particularly the
932gross change of type that happens when that object's memory is freed and
933re-allocated for some other purpose. Once a type-safe reference to the
934object is obtained, some other mechanism is needed to ensure consistent
935access to the data in the object. This could involve taking a spinlock,
936but with RCU the typical approach is to perform reads with SMP-aware
937operations such as smp_load_acquire(), to perform updates with atomic
938read-modify-write operations, and to provide the necessary ordering.
939RCU provides a number of support functions that embed the required
940operations and ordering, such as the list_for_each_entry_rcu() macro
941used in the previous section.
942
943A more focused view of the reference counting behavior is that,
944between rcu_read_lock() and rcu_read_unlock(), any reference taken with
945rcu_dereference() on a pointer marked as ``__rcu`` can be treated as
946though a reference-count on that object has been temporarily increased.
947This prevents the object from changing type. Exactly what this means
948will depend on normal expectations of objects of that type, but it
949typically includes that spinlocks can still be safely locked, normal
950reference counters can be safely manipulated, and ``__rcu`` pointers
951can be safely dereferenced.
952
953Some operations that one might expect to see on an object for
954which an RCU reference is held include:
955
956 - Copying out data that is guaranteed to be stable by the object's type.
957 - Using kref_get_unless_zero() or similar to get a longer-term
958 reference. This may fail of course.
959 - Acquiring a spinlock in the object, and checking if the object still
960 is the expected object and if so, manipulating it freely.
961
962The understanding that RCU provides a reference that only prevents a
963change of type is particularly visible with objects allocated from a
964slab cache marked ``SLAB_TYPESAFE_BY_RCU``. RCU operations may yield a
965reference to an object from such a cache that has been concurrently freed
966and the memory reallocated to a completely different object, though of
967the same type. In this case RCU doesn't even protect the identity of the
968object from changing, only its type. So the object found may not be the
969one expected, but it will be one where it is safe to take a reference
970(and then potentially acquiring a spinlock), allowing subsequent code
971to check whether the identity matches expectations. It is tempting
972to simply acquire the spinlock without first taking the reference, but
973unfortunately any spinlock in a ``SLAB_TYPESAFE_BY_RCU`` object must be
974initialized after each and every call to kmem_cache_alloc(), which renders
975reference-free spinlock acquisition completely unsafe. Therefore, when
976using ``SLAB_TYPESAFE_BY_RCU``, make proper use of a reference counter.
977If using refcount_t, the specialized refcount_{add|inc}_not_zero_acquire()
978and refcount_set_release() APIs should be used to ensure correct operation
979ordering when verifying object identity and when initializing newly
980allocated objects. Acquire fence in refcount_{add|inc}_not_zero_acquire()
981ensures that identity checks happen *after* reference count is taken.
982refcount_set_release() should be called after a newly allocated object is
983fully initialized and release fence ensures that new values are visible
984*before* refcount can be successfully taken by other users. Once
985refcount_set_release() is called, the object should be considered visible
986by other tasks.
987(Those willing to initialize their locks in a kmem_cache constructor
988may also use locking, including cache-friendly sequence locking.)
989
990With traditional reference counting -- such as that implemented by the
991kref library in Linux -- there is typically code that runs when the last
992reference to an object is dropped. With kref, this is the function
993passed to kref_put(). When RCU is being used, such finalization code
994must not be run until all ``__rcu`` pointers referencing the object have
995been updated, and then a grace period has passed. Every remaining
996globally visible pointer to the object must be considered to be a
997potential counted reference, and the finalization code is typically run
998using call_rcu() only after all those pointers have been changed.
999
1000To see how to choose between these two analogies -- of RCU as a
1001reader-writer lock and RCU as a reference counting system -- it is useful
1002to reflect on the scale of the thing being protected. The reader-writer
1003lock analogy looks at larger multi-part objects such as a linked list
1004and shows how RCU can facilitate concurrency while elements are added
1005to, and removed from, the list. The reference-count analogy looks at
1006the individual objects and looks at how they can be accessed safely
1007within whatever whole they are a part of.
1008
1009.. _8_whatisRCU:
1010
10118. FULL LIST OF RCU APIs
1012-------------------------
1013
1014The RCU APIs are documented in docbook-format header comments in the
1015Linux-kernel source code, but it helps to have a full list of the
1016APIs, since there does not appear to be a way to categorize them
1017in docbook. Here is the list, by category.
1018
1019RCU list traversal::
1020
1021 list_entry_rcu
1022 list_entry_lockless
1023 list_first_entry_rcu
1024 list_first_or_null_rcu
1025 list_tail_rcu
1026 list_next_rcu
1027 list_next_or_null_rcu
1028 list_for_each_entry_rcu
1029 list_for_each_entry_continue_rcu
1030 list_for_each_entry_from_rcu
1031 list_for_each_entry_lockless
1032 hlist_first_rcu
1033 hlist_next_rcu
1034 hlist_pprev_rcu
1035 hlist_for_each_entry_rcu
1036 hlist_for_each_entry_rcu_notrace
1037 hlist_for_each_entry_rcu_bh
1038 hlist_for_each_entry_from_rcu
1039 hlist_for_each_entry_continue_rcu
1040 hlist_for_each_entry_continue_rcu_bh
1041 hlist_nulls_first_rcu
1042 hlist_nulls_next_rcu
1043 hlist_nulls_for_each_entry_rcu
1044 hlist_nulls_for_each_entry_safe
1045 hlist_bl_first_rcu
1046 hlist_bl_for_each_entry_rcu
1047
1048RCU pointer/list update::
1049
1050 rcu_assign_pointer
1051 rcu_replace_pointer
1052 INIT_LIST_HEAD_RCU
1053 list_add_rcu
1054 list_add_tail_rcu
1055 list_del_rcu
1056 list_replace_rcu
1057 list_splice_init_rcu
1058 list_splice_tail_init_rcu
1059 hlist_add_behind_rcu
1060 hlist_add_before_rcu
1061 hlist_add_head_rcu
1062 hlist_add_tail_rcu
1063 hlist_del_rcu
1064 hlist_del_init_rcu
1065 hlist_replace_rcu
1066 hlist_nulls_del_init_rcu
1067 hlist_nulls_del_rcu
1068 hlist_nulls_add_head_rcu
1069 hlist_nulls_add_tail_rcu
1070 hlist_nulls_add_fake
1071 hlists_swap_heads_rcu
1072 hlist_bl_add_head_rcu
1073 hlist_bl_del_rcu
1074 hlist_bl_set_first_rcu
1075
1076RCU::
1077
1078 Critical sections Grace period Barrier
1079
1080 rcu_read_lock synchronize_net rcu_barrier
1081 rcu_read_unlock synchronize_rcu
1082 guard(rcu)() synchronize_rcu_expedited
1083 scoped_guard(rcu) synchronize_rcu_mult
1084 rcu_dereference call_rcu
1085 rcu_dereference_check call_rcu_hurry
1086 rcu_dereference_protected kfree_rcu
1087 rcu_read_lock_held kvfree_rcu
1088 rcu_read_lock_any_held kfree_rcu_mightsleep
1089 rcu_pointer_handoff cond_synchronize_rcu
1090 unrcu_pointer cond_synchronize_rcu_full
1091 cond_synchronize_rcu_expedited
1092 cond_synchronize_rcu_expedited_full
1093 get_completed_synchronize_rcu
1094 get_completed_synchronize_rcu_full
1095 get_state_synchronize_rcu
1096 get_state_synchronize_rcu_full
1097 poll_state_synchronize_rcu
1098 poll_state_synchronize_rcu_full
1099 same_state_synchronize_rcu
1100 same_state_synchronize_rcu_full
1101 start_poll_synchronize_rcu
1102 start_poll_synchronize_rcu_full
1103 start_poll_synchronize_rcu_expedited
1104 start_poll_synchronize_rcu_expedited_full
1105
1106bh::
1107
1108 Critical sections Grace period Barrier
1109
1110 rcu_read_lock_bh [Same as RCU] [Same as RCU]
1111 rcu_read_unlock_bh
1112 [local_bh_disable]
1113 [and friends]
1114 rcu_dereference_bh
1115 rcu_dereference_bh_check
1116 rcu_dereference_bh_protected
1117 rcu_read_lock_bh_held
1118
1119sched::
1120
1121 Critical sections Grace period Barrier
1122
1123 rcu_read_lock_sched [Same as RCU] [Same as RCU]
1124 rcu_read_unlock_sched
1125 [preempt_disable]
1126 [and friends]
1127 rcu_read_lock_sched_notrace
1128 rcu_read_unlock_sched_notrace
1129 rcu_dereference_sched
1130 rcu_dereference_sched_check
1131 rcu_dereference_sched_protected
1132 rcu_read_lock_sched_held
1133
1134
1135RCU: Initialization/cleanup/ordering::
1136
1137 RCU_INIT_POINTER
1138 RCU_INITIALIZER
1139 RCU_POINTER_INITIALIZER
1140 init_rcu_head
1141 destroy_rcu_head
1142 init_rcu_head_on_stack
1143 destroy_rcu_head_on_stack
1144 SLAB_TYPESAFE_BY_RCU
1145
1146
1147RCU: Quiescents states and control::
1148
1149 cond_resched_tasks_rcu_qs
1150 rcu_all_qs
1151 rcu_softirq_qs_periodic
1152 rcu_end_inkernel_boot
1153 rcu_expedite_gp
1154 rcu_gp_is_expedited
1155 rcu_unexpedite_gp
1156 rcu_cpu_stall_reset
1157 rcu_head_after_call_rcu
1158 rcu_is_watching
1159
1160
1161RCU-sync primitive::
1162
1163 rcu_sync_is_idle
1164 rcu_sync_init
1165 rcu_sync_enter
1166 rcu_sync_exit
1167 rcu_sync_dtor
1168
1169
1170RCU-Tasks::
1171
1172 Critical sections Grace period Barrier
1173
1174 N/A call_rcu_tasks rcu_barrier_tasks
1175 synchronize_rcu_tasks
1176
1177
1178RCU-Tasks-Rude::
1179
1180 Critical sections Grace period Barrier
1181
1182 N/A synchronize_rcu_tasks_rude rcu_barrier_tasks_rude
1183 call_rcu_tasks_rude
1184
1185
1186RCU-Tasks-Trace::
1187
1188 Critical sections Grace period Barrier
1189
1190 rcu_read_lock_trace call_rcu_tasks_trace rcu_barrier_tasks_trace
1191 rcu_read_unlock_trace synchronize_rcu_tasks_trace
1192 guard(rcu_tasks_trace)()
1193 scoped_guard(rcu_tasks_trace)
1194
1195
1196SRCU list traversal::
1197 list_for_each_entry_srcu
1198 hlist_for_each_entry_srcu
1199
1200
1201SRCU::
1202
1203 Critical sections Grace period Barrier
1204
1205 srcu_read_lock call_srcu srcu_barrier
1206 srcu_read_unlock synchronize_srcu
1207 srcu_read_lock_fast synchronize_srcu_expedited
1208 srcu_read_unlock_fast get_state_synchronize_srcu
1209 srcu_read_lock_nmisafe start_poll_synchronize_srcu
1210 srcu_read_unlock_nmisafe start_poll_synchronize_srcu_expedited
1211 srcu_read_lock_notrace poll_state_synchronize_srcu
1212 srcu_read_unlock_notrace
1213 srcu_down_read
1214 srcu_up_read
1215 srcu_down_read_fast
1216 srcu_up_read_fast
1217 guard(srcu)()
1218 scoped_guard(srcu)
1219 srcu_read_lock_held
1220 srcu_dereference
1221 srcu_dereference_check
1222 srcu_dereference_notrace
1223 srcu_read_lock_held
1224
1225
1226SRCU: Initialization/cleanup/ordering::
1227
1228 DEFINE_SRCU
1229 DEFINE_STATIC_SRCU
1230 init_srcu_struct
1231 cleanup_srcu_struct
1232 smp_mb__after_srcu_read_unlock
1233
1234All: lockdep-checked RCU utility APIs::
1235
1236 RCU_LOCKDEP_WARN
1237 rcu_sleep_check
1238
1239All: Unchecked RCU-protected pointer access::
1240
1241 rcu_dereference_raw
1242
1243All: Unchecked RCU-protected pointer access with dereferencing prohibited::
1244
1245 rcu_access_pointer
1246
1247See the comment headers in the source code (or the docbook generated
1248from them) for more information.
1249
1250However, given that there are no fewer than four families of RCU APIs
1251in the Linux kernel, how do you choose which one to use? The following
1252list can be helpful:
1253
1254a. Will readers need to block? If so, you need SRCU.
1255
1256b. Will readers need to block and are you doing tracing, for
1257 example, ftrace or BPF? If so, you need RCU-tasks,
1258 RCU-tasks-rude, and/or RCU-tasks-trace.
1259
1260c. What about the -rt patchset? If readers would need to block in
1261 an non-rt kernel, you need SRCU. If readers would block when
1262 acquiring spinlocks in a -rt kernel, but not in a non-rt kernel,
1263 SRCU is not necessary. (The -rt patchset turns spinlocks into
1264 sleeplocks, hence this distinction.)
1265
1266d. Do you need to treat NMI handlers, hardirq handlers,
1267 and code segments with preemption disabled (whether
1268 via preempt_disable(), local_irq_save(), local_bh_disable(),
1269 or some other mechanism) as if they were explicit RCU readers?
1270 If so, RCU-sched readers are the only choice that will work
1271 for you, but since about v4.20 you use can use the vanilla RCU
1272 update primitives.
1273
1274e. Do you need RCU grace periods to complete even in the face of
1275 softirq monopolization of one or more of the CPUs? For example,
1276 is your code subject to network-based denial-of-service attacks?
1277 If so, you should disable softirq across your readers, for
1278 example, by using rcu_read_lock_bh(). Since about v4.20 you
1279 use can use the vanilla RCU update primitives.
1280
1281f. Is your workload too update-intensive for normal use of
1282 RCU, but inappropriate for other synchronization mechanisms?
1283 If so, consider SLAB_TYPESAFE_BY_RCU (which was originally
1284 named SLAB_DESTROY_BY_RCU). But please be careful!
1285
1286g. Do you need read-side critical sections that are respected even
1287 on CPUs that are deep in the idle loop, during entry to or exit
1288 from user-mode execution, or on an offlined CPU? If so, SRCU
1289 and RCU Tasks Trace are the only choices that will work for you,
1290 with SRCU being strongly preferred in almost all cases.
1291
1292h. Otherwise, use RCU.
1293
1294Of course, this all assumes that you have determined that RCU is in fact
1295the right tool for your job.
1296
1297.. _9_whatisRCU:
1298
12999. ANSWERS TO QUICK QUIZZES
1300----------------------------
1301
1302Quick Quiz #1:
1303 Why is this argument naive? How could a deadlock
1304 occur when using this algorithm in a real-world Linux
1305 kernel? [Referring to the lock-based "toy" RCU
1306 algorithm.]
1307
1308Answer:
1309 Consider the following sequence of events:
1310
1311 1. CPU 0 acquires some unrelated lock, call it
1312 "problematic_lock", disabling irq via
1313 spin_lock_irqsave().
1314
1315 2. CPU 1 enters synchronize_rcu(), write-acquiring
1316 rcu_gp_mutex.
1317
1318 3. CPU 0 enters rcu_read_lock(), but must wait
1319 because CPU 1 holds rcu_gp_mutex.
1320
1321 4. CPU 1 is interrupted, and the irq handler
1322 attempts to acquire problematic_lock.
1323
1324 The system is now deadlocked.
1325
1326 One way to avoid this deadlock is to use an approach like
1327 that of CONFIG_PREEMPT_RT, where all normal spinlocks
1328 become blocking locks, and all irq handlers execute in
1329 the context of special tasks. In this case, in step 4
1330 above, the irq handler would block, allowing CPU 1 to
1331 release rcu_gp_mutex, avoiding the deadlock.
1332
1333 Even in the absence of deadlock, this RCU implementation
1334 allows latency to "bleed" from readers to other
1335 readers through synchronize_rcu(). To see this,
1336 consider task A in an RCU read-side critical section
1337 (thus read-holding rcu_gp_mutex), task B blocked
1338 attempting to write-acquire rcu_gp_mutex, and
1339 task C blocked in rcu_read_lock() attempting to
1340 read_acquire rcu_gp_mutex. Task A's RCU read-side
1341 latency is holding up task C, albeit indirectly via
1342 task B.
1343
1344 Realtime RCU implementations therefore use a counter-based
1345 approach where tasks in RCU read-side critical sections
1346 cannot be blocked by tasks executing synchronize_rcu().
1347
1348:ref:`Back to Quick Quiz #1 <quiz_1>`
1349
1350Quick Quiz #2:
1351 Give an example where Classic RCU's read-side
1352 overhead is **negative**.
1353
1354Answer:
1355 Imagine a single-CPU system with a non-CONFIG_PREEMPTION
1356 kernel where a routing table is used by process-context
1357 code, but can be updated by irq-context code (for example,
1358 by an "ICMP REDIRECT" packet). The usual way of handling
1359 this would be to have the process-context code disable
1360 interrupts while searching the routing table. Use of
1361 RCU allows such interrupt-disabling to be dispensed with.
1362 Thus, without RCU, you pay the cost of disabling interrupts,
1363 and with RCU you don't.
1364
1365 One can argue that the overhead of RCU in this
1366 case is negative with respect to the single-CPU
1367 interrupt-disabling approach. Others might argue that
1368 the overhead of RCU is merely zero, and that replacing
1369 the positive overhead of the interrupt-disabling scheme
1370 with the zero-overhead RCU scheme does not constitute
1371 negative overhead.
1372
1373 In real life, of course, things are more complex. But
1374 even the theoretical possibility of negative overhead for
1375 a synchronization primitive is a bit unexpected. ;-)
1376
1377:ref:`Back to Quick Quiz #2 <quiz_2>`
1378
1379Quick Quiz #3:
1380 If it is illegal to block in an RCU read-side
1381 critical section, what the heck do you do in
1382 CONFIG_PREEMPT_RT, where normal spinlocks can block???
1383
1384Answer:
1385 Just as CONFIG_PREEMPT_RT permits preemption of spinlock
1386 critical sections, it permits preemption of RCU
1387 read-side critical sections. It also permits
1388 spinlocks blocking while in RCU read-side critical
1389 sections.
1390
1391 Why the apparent inconsistency? Because it is
1392 possible to use priority boosting to keep the RCU
1393 grace periods short if need be (for example, if running
1394 short of memory). In contrast, if blocking waiting
1395 for (say) network reception, there is no way to know
1396 what should be boosted. Especially given that the
1397 process we need to boost might well be a human being
1398 who just went out for a pizza or something. And although
1399 a computer-operated cattle prod might arouse serious
1400 interest, it might also provoke serious objections.
1401 Besides, how does the computer know what pizza parlor
1402 the human being went to???
1403
1404:ref:`Back to Quick Quiz #3 <quiz_3>`
1405
1406ACKNOWLEDGEMENTS
1407
1408My thanks to the people who helped make this human-readable, including
1409Jon Walpole, Josh Triplett, Serge Hallyn, Suzanne Wood, and Alan Stern.
1410
1411
1412For more information, see http://www.rdrop.com/users/paulmck/RCU.