tools/memory-model: Document categories of ordering primitives

+17

tools/memory-model/Documentation/README

··· 11 11 12 12 o You are new to Linux-kernel concurrency: simple.txt 13 13 14 + o You have some background in Linux-kernel concurrency, and would 15 + like an overview of the types of low-level concurrency primitives 16 + that the Linux kernel provides: ordering.txt 17 + 18 + Here, "low level" means atomic operations to single variables. 19 + 14 20 o You are familiar with the Linux-kernel concurrency primitives 15 21 that you need, and just want to get started with LKMM litmus 16 22 tests: litmus-tests.txt ··· 24 18 o You are familiar with Linux-kernel concurrency, and would 25 19 like a detailed intuitive understanding of LKMM, including 26 20 situations involving more than two threads: recipes.txt 21 + 22 + o You would like a detailed understanding of what your compiler can 23 + and cannot do to control dependencies: control-dependencies.txt 27 24 28 25 o You are familiar with Linux-kernel concurrency and the use of 29 26 LKMM, and would like a quick reference: cheatsheet.txt ··· 50 41 cheatsheet.txt 51 42 Quick-reference guide to the Linux-kernel memory model. 52 43 44 + control-dependencies.txt 45 + Guide to preventing compiler optimizations from destroying 46 + your control dependencies. 47 + 53 48 explanation.txt 54 49 Detailed description of the memory model. 55 50 56 51 litmus-tests.txt 57 52 The format, features, capabilities, and limitations of the litmus 58 53 tests that LKMM can evaluate. 54 + 55 + ordering.txt 56 + Overview of the Linux kernel's low-level memory-ordering 57 + primitives by category. 59 58 60 59 recipes.txt 61 60 Common memory-ordering patterns.

+258

tools/memory-model/Documentation/control-dependencies.txt

··· 1 + CONTROL DEPENDENCIES 2 + ==================== 3 + 4 + A major difficulty with control dependencies is that current compilers 5 + do not support them. One purpose of this document is therefore to 6 + help you prevent your compiler from breaking your code. However, 7 + control dependencies also pose other challenges, which leads to the 8 + second purpose of this document, namely to help you to avoid breaking 9 + your own code, even in the absence of help from your compiler. 10 + 11 + One such challenge is that control dependencies order only later stores. 12 + Therefore, a load-load control dependency will not preserve ordering 13 + unless a read memory barrier is provided. Consider the following code: 14 + 15 + q = READ_ONCE(a); 16 + if (q) 17 + p = READ_ONCE(b); 18 + 19 + This is not guaranteed to provide any ordering because some types of CPUs 20 + are permitted to predict the result of the load from "b". This prediction 21 + can cause other CPUs to see this load as having happened before the load 22 + from "a". This means that an explicit read barrier is required, for example 23 + as follows: 24 + 25 + q = READ_ONCE(a); 26 + if (q) { 27 + smp_rmb(); 28 + p = READ_ONCE(b); 29 + } 30 + 31 + However, stores are not speculated. This means that ordering is 32 + (usually) guaranteed for load-store control dependencies, as in the 33 + following example: 34 + 35 + q = READ_ONCE(a); 36 + if (q) 37 + WRITE_ONCE(b, 1); 38 + 39 + Control dependencies can pair with each other and with other types 40 + of ordering. But please note that neither the READ_ONCE() nor the 41 + WRITE_ONCE() are optional. Without the READ_ONCE(), the compiler might 42 + fuse the load from "a" with other loads. Without the WRITE_ONCE(), 43 + the compiler might fuse the store to "b" with other stores. Worse yet, 44 + the compiler might convert the store into a load and a check followed 45 + by a store, and this compiler-generated load would not be ordered by 46 + the control dependency. 47 + 48 + Furthermore, if the compiler is able to prove that the value of variable 49 + "a" is always non-zero, it would be well within its rights to optimize 50 + the original example by eliminating the "if" statement as follows: 51 + 52 + q = a; 53 + b = 1; /* BUG: Compiler and CPU can both reorder!!! */ 54 + 55 + So don't leave out either the READ_ONCE() or the WRITE_ONCE(). 56 + In particular, although READ_ONCE() does force the compiler to emit a 57 + load, it does *not* force the compiler to actually use the loaded value. 58 + 59 + It is tempting to try use control dependencies to enforce ordering on 60 + identical stores on both branches of the "if" statement as follows: 61 + 62 + q = READ_ONCE(a); 63 + if (q) { 64 + barrier(); 65 + WRITE_ONCE(b, 1); 66 + do_something(); 67 + } else { 68 + barrier(); 69 + WRITE_ONCE(b, 1); 70 + do_something_else(); 71 + } 72 + 73 + Unfortunately, current compilers will transform this as follows at high 74 + optimization levels: 75 + 76 + q = READ_ONCE(a); 77 + barrier(); 78 + WRITE_ONCE(b, 1); /* BUG: No ordering vs. load from a!!! */ 79 + if (q) { 80 + /* WRITE_ONCE(b, 1); -- moved up, BUG!!! */ 81 + do_something(); 82 + } else { 83 + /* WRITE_ONCE(b, 1); -- moved up, BUG!!! */ 84 + do_something_else(); 85 + } 86 + 87 + Now there is no conditional between the load from "a" and the store to 88 + "b", which means that the CPU is within its rights to reorder them: The 89 + conditional is absolutely required, and must be present in the final 90 + assembly code, after all of the compiler and link-time optimizations 91 + have been applied. Therefore, if you need ordering in this example, 92 + you must use explicit memory ordering, for example, smp_store_release(): 93 + 94 + q = READ_ONCE(a); 95 + if (q) { 96 + smp_store_release(&b, 1); 97 + do_something(); 98 + } else { 99 + smp_store_release(&b, 1); 100 + do_something_else(); 101 + } 102 + 103 + Without explicit memory ordering, control-dependency-based ordering is 104 + guaranteed only when the stores differ, for example: 105 + 106 + q = READ_ONCE(a); 107 + if (q) { 108 + WRITE_ONCE(b, 1); 109 + do_something(); 110 + } else { 111 + WRITE_ONCE(b, 2); 112 + do_something_else(); 113 + } 114 + 115 + The initial READ_ONCE() is still required to prevent the compiler from 116 + knowing too much about the value of "a". 117 + 118 + But please note that you need to be careful what you do with the local 119 + variable "q", otherwise the compiler might be able to guess the value 120 + and again remove the conditional branch that is absolutely required to 121 + preserve ordering. For example: 122 + 123 + q = READ_ONCE(a); 124 + if (q % MAX) { 125 + WRITE_ONCE(b, 1); 126 + do_something(); 127 + } else { 128 + WRITE_ONCE(b, 2); 129 + do_something_else(); 130 + } 131 + 132 + If MAX is compile-time defined to be 1, then the compiler knows that 133 + (q % MAX) must be equal to zero, regardless of the value of "q". 134 + The compiler is therefore within its rights to transform the above code 135 + into the following: 136 + 137 + q = READ_ONCE(a); 138 + WRITE_ONCE(b, 2); 139 + do_something_else(); 140 + 141 + Given this transformation, the CPU is not required to respect the ordering 142 + between the load from variable "a" and the store to variable "b". It is 143 + tempting to add a barrier(), but this does not help. The conditional 144 + is gone, and the barrier won't bring it back. Therefore, if you need 145 + to relying on control dependencies to produce this ordering, you should 146 + make sure that MAX is greater than one, perhaps as follows: 147 + 148 + q = READ_ONCE(a); 149 + BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */ 150 + if (q % MAX) { 151 + WRITE_ONCE(b, 1); 152 + do_something(); 153 + } else { 154 + WRITE_ONCE(b, 2); 155 + do_something_else(); 156 + } 157 + 158 + Please note once again that each leg of the "if" statement absolutely 159 + must store different values to "b". As in previous examples, if the two 160 + values were identical, the compiler could pull this store outside of the 161 + "if" statement, destroying the control dependency's ordering properties. 162 + 163 + You must also be careful avoid relying too much on boolean short-circuit 164 + evaluation. Consider this example: 165 + 166 + q = READ_ONCE(a); 167 + if (q || 1 > 0) 168 + WRITE_ONCE(b, 1); 169 + 170 + Because the first condition cannot fault and the second condition is 171 + always true, the compiler can transform this example as follows, again 172 + destroying the control dependency's ordering: 173 + 174 + q = READ_ONCE(a); 175 + WRITE_ONCE(b, 1); 176 + 177 + This is yet another example showing the importance of preventing the 178 + compiler from out-guessing your code. Again, although READ_ONCE() really 179 + does force the compiler to emit code for a given load, the compiler is 180 + within its rights to discard the loaded value. 181 + 182 + In addition, control dependencies apply only to the then-clause and 183 + else-clause of the "if" statement in question. In particular, they do 184 + not necessarily order the code following the entire "if" statement: 185 + 186 + q = READ_ONCE(a); 187 + if (q) { 188 + WRITE_ONCE(b, 1); 189 + } else { 190 + WRITE_ONCE(b, 2); 191 + } 192 + WRITE_ONCE(c, 1); /* BUG: No ordering against the read from "a". */ 193 + 194 + It is tempting to argue that there in fact is ordering because the 195 + compiler cannot reorder volatile accesses and also cannot reorder 196 + the writes to "b" with the condition. Unfortunately for this line 197 + of reasoning, the compiler might compile the two writes to "b" as 198 + conditional-move instructions, as in this fanciful pseudo-assembly 199 + language: 200 + 201 + ld r1,a 202 + cmp r1,$0 203 + cmov,ne r4,$1 204 + cmov,eq r4,$2 205 + st r4,b 206 + st $1,c 207 + 208 + The control dependencies would then extend only to the pair of cmov 209 + instructions and the store depending on them. This means that a weakly 210 + ordered CPU would have no dependency of any sort between the load from 211 + "a" and the store to "c". In short, control dependencies provide ordering 212 + only to the stores in the then-clause and else-clause of the "if" statement 213 + in question (including functions invoked by those two clauses), and not 214 + to code following that "if" statement. 215 + 216 + 217 + In summary: 218 + 219 + (*) Control dependencies can order prior loads against later stores. 220 + However, they do *not* guarantee any other sort of ordering: 221 + Not prior loads against later loads, nor prior stores against 222 + later anything. If you need these other forms of ordering, use 223 + smp_load_acquire(), smp_store_release(), or, in the case of prior 224 + stores and later loads, smp_mb(). 225 + 226 + (*) If both legs of the "if" statement contain identical stores to 227 + the same variable, then you must explicitly order those stores, 228 + either by preceding both of them with smp_mb() or by using 229 + smp_store_release(). Please note that it is *not* sufficient to use 230 + barrier() at beginning and end of each leg of the "if" statement 231 + because, as shown by the example above, optimizing compilers can 232 + destroy the control dependency while respecting the letter of the 233 + barrier() law. 234 + 235 + (*) Control dependencies require at least one run-time conditional 236 + between the prior load and the subsequent store, and this 237 + conditional must involve the prior load. If the compiler is able 238 + to optimize the conditional away, it will have also optimized 239 + away the ordering. Careful use of READ_ONCE() and WRITE_ONCE() 240 + can help to preserve the needed conditional. 241 + 242 + (*) Control dependencies require that the compiler avoid reordering the 243 + dependency into nonexistence. Careful use of READ_ONCE() or 244 + atomic{,64}_read() can help to preserve your control dependency. 245 + 246 + (*) Control dependencies apply only to the then-clause and else-clause 247 + of the "if" statement containing the control dependency, including 248 + any functions that these two clauses call. Control dependencies 249 + do *not* apply to code beyond the end of that "if" statement. 250 + 251 + (*) Control dependencies pair normally with other types of barriers. 252 + 253 + (*) Control dependencies do *not* provide multicopy atomicity. If you 254 + need all the CPUs to agree on the ordering of a given store against 255 + all other accesses, use smp_mb(). 256 + 257 + (*) Compilers do not understand control dependencies. It is therefore 258 + your job to ensure that they do not break your code.

+556

tools/memory-model/Documentation/ordering.txt

··· 1 + This document gives an overview of the categories of memory-ordering 2 + operations provided by the Linux-kernel memory model (LKMM). 3 + 4 + 5 + Categories of Ordering 6 + ====================== 7 + 8 + This section lists LKMM's three top-level categories of memory-ordering 9 + operations in decreasing order of strength: 10 + 11 + 1. Barriers (also known as "fences"). A barrier orders some or 12 + all of the CPU's prior operations against some or all of its 13 + subsequent operations. 14 + 15 + 2. Ordered memory accesses. These operations order themselves 16 + against some or all of the CPU's prior accesses or some or all 17 + of the CPU's subsequent accesses, depending on the subcategory 18 + of the operation. 19 + 20 + 3. Unordered accesses, as the name indicates, have no ordering 21 + properties except to the extent that they interact with an 22 + operation in the previous categories. This being the real world, 23 + some of these "unordered" operations provide limited ordering 24 + in some special situations. 25 + 26 + Each of the above categories is described in more detail by one of the 27 + following sections. 28 + 29 + 30 + Barriers 31 + ======== 32 + 33 + Each of the following categories of barriers is described in its own 34 + subsection below: 35 + 36 + a. Full memory barriers. 37 + 38 + b. Read-modify-write (RMW) ordering augmentation barriers. 39 + 40 + c. Write memory barrier. 41 + 42 + d. Read memory barrier. 43 + 44 + e. Compiler barrier. 45 + 46 + Note well that many of these primitives generate absolutely no code 47 + in kernels built with CONFIG_SMP=n. Therefore, if you are writing 48 + a device driver, which must correctly order accesses to a physical 49 + device even in kernels built with CONFIG_SMP=n, please use the 50 + ordering primitives provided for that purpose. For example, instead of 51 + smp_mb(), use mb(). See the "Linux Kernel Device Drivers" book or the 52 + https://lwn.net/Articles/698014/ article for more information. 53 + 54 + 55 + Full Memory Barriers 56 + -------------------- 57 + 58 + The Linux-kernel primitives that provide full ordering include: 59 + 60 + o The smp_mb() full memory barrier. 61 + 62 + o Value-returning RMW atomic operations whose names do not end in 63 + _acquire, _release, or _relaxed. 64 + 65 + o RCU's grace-period primitives. 66 + 67 + First, the smp_mb() full memory barrier orders all of the CPU's prior 68 + accesses against all subsequent accesses from the viewpoint of all CPUs. 69 + In other words, all CPUs will agree that any earlier action taken 70 + by that CPU happened before any later action taken by that same CPU. 71 + For example, consider the following: 72 + 73 + WRITE_ONCE(x, 1); 74 + smp_mb(); // Order store to x before load from y. 75 + r1 = READ_ONCE(y); 76 + 77 + All CPUs will agree that the store to "x" happened before the load 78 + from "y", as indicated by the comment. And yes, please comment your 79 + memory-ordering primitives. It is surprisingly hard to remember their 80 + purpose after even a few months. 81 + 82 + Second, some RMW atomic operations provide full ordering. These 83 + operations include value-returning RMW atomic operations (that is, those 84 + with non-void return types) whose names do not end in _acquire, _release, 85 + or _relaxed. Examples include atomic_add_return(), atomic_dec_and_test(), 86 + cmpxchg(), and xchg(). Note that conditional RMW atomic operations such 87 + as cmpxchg() are only guaranteed to provide ordering when they succeed. 88 + When RMW atomic operations provide full ordering, they partition the 89 + CPU's accesses into three groups: 90 + 91 + 1. All code that executed prior to the RMW atomic operation. 92 + 93 + 2. The RMW atomic operation itself. 94 + 95 + 3. All code that executed after the RMW atomic operation. 96 + 97 + All CPUs will agree that any operation in a given partition happened 98 + before any operation in a higher-numbered partition. 99 + 100 + In contrast, non-value-returning RMW atomic operations (that is, those 101 + with void return types) do not guarantee any ordering whatsoever. Nor do 102 + value-returning RMW atomic operations whose names end in _relaxed. 103 + Examples of the former include atomic_inc() and atomic_dec(), 104 + while examples of the latter include atomic_cmpxchg_relaxed() and 105 + atomic_xchg_relaxed(). Similarly, value-returning non-RMW atomic 106 + operations such as atomic_read() do not guarantee full ordering, and 107 + are covered in the later section on unordered operations. 108 + 109 + Value-returning RMW atomic operations whose names end in _acquire or 110 + _release provide limited ordering, and will be described later in this 111 + document. 112 + 113 + Finally, RCU's grace-period primitives provide full ordering. These 114 + primitives include synchronize_rcu(), synchronize_rcu_expedited(), 115 + synchronize_srcu() and so on. However, these primitives have orders 116 + of magnitude greater overhead than smp_mb(), atomic_xchg(), and so on. 117 + Furthermore, RCU's grace-period primitives can only be invoked in 118 + sleepable contexts. Therefore, RCU's grace-period primitives are 119 + typically instead used to provide ordering against RCU read-side critical 120 + sections, as documented in their comment headers. But of course if you 121 + need a synchronize_rcu() to interact with readers, it costs you nothing 122 + to also rely on its additional full-memory-barrier semantics. Just please 123 + carefully comment this, otherwise your future self will hate you. 124 + 125 + 126 + RMW Ordering Augmentation Barriers 127 + ---------------------------------- 128 + 129 + As noted in the previous section, non-value-returning RMW operations 130 + such as atomic_inc() and atomic_dec() guarantee no ordering whatsoever. 131 + Nevertheless, a number of popular CPU families, including x86, provide 132 + full ordering for these primitives. One way to obtain full ordering on 133 + all architectures is to add a call to smp_mb(): 134 + 135 + WRITE_ONCE(x, 1); 136 + atomic_inc(&my_counter); 137 + smp_mb(); // Inefficient on x86!!! 138 + r1 = READ_ONCE(y); 139 + 140 + This works, but the added smp_mb() adds needless overhead for 141 + x86, on which atomic_inc() provides full ordering all by itself. 142 + The smp_mb__after_atomic() primitive can be used instead: 143 + 144 + WRITE_ONCE(x, 1); 145 + atomic_inc(&my_counter); 146 + smp_mb__after_atomic(); // Order store to x before load from y. 147 + r1 = READ_ONCE(y); 148 + 149 + The smp_mb__after_atomic() primitive emits code only on CPUs whose 150 + atomic_inc() implementations do not guarantee full ordering, thus 151 + incurring no unnecessary overhead on x86. There are a number of 152 + variations on the smp_mb__*() theme: 153 + 154 + o smp_mb__before_atomic(), which provides full ordering prior 155 + to an unordered RMW atomic operation. 156 + 157 + o smp_mb__after_atomic(), which, as shown above, provides full 158 + ordering subsequent to an unordered RMW atomic operation. 159 + 160 + o smp_mb__after_spinlock(), which provides full ordering subsequent 161 + to a successful spinlock acquisition. Note that spin_lock() is 162 + always successful but spin_trylock() might not be. 163 + 164 + o smp_mb__after_srcu_read_unlock(), which provides full ordering 165 + subsequent to an srcu_read_unlock(). 166 + 167 + It is bad practice to place code between the smp__*() primitive and the 168 + operation whose ordering that it is augmenting. The reason is that the 169 + ordering of this intervening code will differ from one CPU architecture 170 + to another. 171 + 172 + 173 + Write Memory Barrier 174 + -------------------- 175 + 176 + The Linux kernel's write memory barrier is smp_wmb(). If a CPU executes 177 + the following code: 178 + 179 + WRITE_ONCE(x, 1); 180 + smp_wmb(); 181 + WRITE_ONCE(y, 1); 182 + 183 + Then any given CPU will see the write to "x" has having happened before 184 + the write to "y". However, you are usually better off using a release 185 + store, as described in the "Release Operations" section below. 186 + 187 + Note that smp_wmb() might fail to provide ordering for unmarked C-language 188 + stores because profile-driven optimization could determine that the 189 + value being overwritten is almost always equal to the new value. Such a 190 + compiler might then reasonably decide to transform "x = 1" and "y = 1" 191 + as follows: 192 + 193 + if (x != 1) 194 + x = 1; 195 + smp_wmb(); // BUG: does not order the reads!!! 196 + if (y != 1) 197 + y = 1; 198 + 199 + Therefore, if you need to use smp_wmb() with unmarked C-language writes, 200 + you will need to make sure that none of the compilers used to build 201 + the Linux kernel carry out this sort of transformation, both now and in 202 + the future. 203 + 204 + 205 + Read Memory Barrier 206 + ------------------- 207 + 208 + The Linux kernel's read memory barrier is smp_rmb(). If a CPU executes 209 + the following code: 210 + 211 + r0 = READ_ONCE(y); 212 + smp_rmb(); 213 + r1 = READ_ONCE(x); 214 + 215 + Then any given CPU will see the read from "y" as having preceded the read from 216 + "x". However, you are usually better off using an acquire load, as described 217 + in the "Acquire Operations" section below. 218 + 219 + Compiler Barrier 220 + ---------------- 221 + 222 + The Linux kernel's compiler barrier is barrier(). This primitive 223 + prohibits compiler code-motion optimizations that might move memory 224 + references across the point in the code containing the barrier(), but 225 + does not constrain hardware memory ordering. For example, this can be 226 + used to prevent to compiler from moving code across an infinite loop: 227 + 228 + WRITE_ONCE(x, 1); 229 + while (dontstop) 230 + barrier(); 231 + r1 = READ_ONCE(y); 232 + 233 + Without the barrier(), the compiler would be within its rights to move the 234 + WRITE_ONCE() to follow the loop. This code motion could be problematic 235 + in the case where an interrupt handler terminates the loop. Another way 236 + to handle this is to use READ_ONCE() for the load of "dontstop". 237 + 238 + Note that the barriers discussed previously use barrier() or its low-level 239 + equivalent in their implementations. 240 + 241 + 242 + Ordered Memory Accesses 243 + ======================= 244 + 245 + The Linux kernel provides a wide variety of ordered memory accesses: 246 + 247 + a. Release operations. 248 + 249 + b. Acquire operations. 250 + 251 + c. RCU read-side ordering. 252 + 253 + d. Control dependencies. 254 + 255 + Each of the above categories has its own section below. 256 + 257 + 258 + Release Operations 259 + ------------------ 260 + 261 + Release operations include smp_store_release(), atomic_set_release(), 262 + rcu_assign_pointer(), and value-returning RMW operations whose names 263 + end in _release. These operations order their own store against all 264 + of the CPU's prior memory accesses. Release operations often provide 265 + improved readability and performance compared to explicit barriers. 266 + For example, use of smp_store_release() saves a line compared to the 267 + smp_wmb() example above: 268 + 269 + WRITE_ONCE(x, 1); 270 + smp_store_release(&y, 1); 271 + 272 + More important, smp_store_release() makes it easier to connect up the 273 + different pieces of the concurrent algorithm. The variable stored to 274 + by the smp_store_release(), in this case "y", will normally be used in 275 + an acquire operation in other parts of the concurrent algorithm. 276 + 277 + To see the performance advantages, suppose that the above example read 278 + from "x" instead of writing to it. Then an smp_wmb() could not guarantee 279 + ordering, and an smp_mb() would be needed instead: 280 + 281 + r1 = READ_ONCE(x); 282 + smp_mb(); 283 + WRITE_ONCE(y, 1); 284 + 285 + But smp_mb() often incurs much higher overhead than does 286 + smp_store_release(), which still provides the needed ordering of "x" 287 + against "y". On x86, the version using smp_store_release() might compile 288 + to a simple load instruction followed by a simple store instruction. 289 + In contrast, the smp_mb() compiles to an expensive instruction that 290 + provides the needed ordering. 291 + 292 + There is a wide variety of release operations: 293 + 294 + o Store operations, including not only the aforementioned 295 + smp_store_release(), but also atomic_set_release(), and 296 + atomic_long_set_release(). 297 + 298 + o RCU's rcu_assign_pointer() operation. This is the same as 299 + smp_store_release() except that: (1) It takes the pointer to 300 + be assigned to instead of a pointer to that pointer, (2) It 301 + is intended to be used in conjunction with rcu_dereference() 302 + and similar rather than smp_load_acquire(), and (3) It checks 303 + for an RCU-protected pointer in "sparse" runs. 304 + 305 + o Value-returning RMW operations whose names end in _release, 306 + such as atomic_fetch_add_release() and cmpxchg_release(). 307 + Note that release ordering is guaranteed only against the 308 + memory-store portion of the RMW operation, and not against the 309 + memory-load portion. Note also that conditional operations such 310 + as cmpxchg_release() are only guaranteed to provide ordering 311 + when they succeed. 312 + 313 + As mentioned earlier, release operations are often paired with acquire 314 + operations, which are the subject of the next section. 315 + 316 + 317 + Acquire Operations 318 + ------------------ 319 + 320 + Acquire operations include smp_load_acquire(), atomic_read_acquire(), 321 + and value-returning RMW operations whose names end in _acquire. These 322 + operations order their own load against all of the CPU's subsequent 323 + memory accesses. Acquire operations often provide improved performance 324 + and readability compared to explicit barriers. For example, use of 325 + smp_load_acquire() saves a line compared to the smp_rmb() example above: 326 + 327 + r0 = smp_load_acquire(&y); 328 + r1 = READ_ONCE(x); 329 + 330 + As with smp_store_release(), this also makes it easier to connect 331 + the different pieces of the concurrent algorithm by looking for the 332 + smp_store_release() that stores to "y". In addition, smp_load_acquire() 333 + improves upon smp_rmb() by ordering against subsequent stores as well 334 + as against subsequent loads. 335 + 336 + There are a couple of categories of acquire operations: 337 + 338 + o Load operations, including not only the aforementioned 339 + smp_load_acquire(), but also atomic_read_acquire(), and 340 + atomic64_read_acquire(). 341 + 342 + o Value-returning RMW operations whose names end in _acquire, 343 + such as atomic_xchg_acquire() and atomic_cmpxchg_acquire(). 344 + Note that acquire ordering is guaranteed only against the 345 + memory-load portion of the RMW operation, and not against the 346 + memory-store portion. Note also that conditional operations 347 + such as atomic_cmpxchg_acquire() are only guaranteed to provide 348 + ordering when they succeed. 349 + 350 + Symmetry being what it is, acquire operations are often paired with the 351 + release operations covered earlier. For example, consider the following 352 + example, where task0() and task1() execute concurrently: 353 + 354 + void task0(void) 355 + { 356 + WRITE_ONCE(x, 1); 357 + smp_store_release(&y, 1); 358 + } 359 + 360 + void task1(void) 361 + { 362 + r0 = smp_load_acquire(&y); 363 + r1 = READ_ONCE(x); 364 + } 365 + 366 + If "x" and "y" are both initially zero, then either r0's final value 367 + will be zero or r1's final value will be one, thus providing the required 368 + ordering. 369 + 370 + 371 + RCU Read-Side Ordering 372 + ---------------------- 373 + 374 + This category includes read-side markers such as rcu_read_lock() 375 + and rcu_read_unlock() as well as pointer-traversal primitives such as 376 + rcu_dereference() and srcu_dereference(). 377 + 378 + Compared to locking primitives and RMW atomic operations, markers 379 + for RCU read-side critical sections incur very low overhead because 380 + they interact only with the corresponding grace-period primitives. 381 + For example, the rcu_read_lock() and rcu_read_unlock() markers interact 382 + with synchronize_rcu(), synchronize_rcu_expedited(), and call_rcu(). 383 + The way this works is that if a given call to synchronize_rcu() cannot 384 + prove that it started before a given call to rcu_read_lock(), then 385 + that synchronize_rcu() must block until the matching rcu_read_unlock() 386 + is reached. For more information, please see the synchronize_rcu() 387 + docbook header comment and the material in Documentation/RCU. 388 + 389 + RCU's pointer-traversal primitives, including rcu_dereference() and 390 + srcu_dereference(), order their load (which must be a pointer) against any 391 + of the CPU's subsequent memory accesses whose address has been calculated 392 + from the value loaded. There is said to be an *address dependency* 393 + from the value returned by the rcu_dereference() or srcu_dereference() 394 + to that subsequent memory access. 395 + 396 + A call to rcu_dereference() for a given RCU-protected pointer is 397 + usually paired with a call to a call to rcu_assign_pointer() for that 398 + same pointer in much the same way that a call to smp_load_acquire() is 399 + paired with a call to smp_store_release(). Calls to rcu_dereference() 400 + and rcu_assign_pointer are often buried in other APIs, for example, 401 + the RCU list API members defined in include/linux/rculist.h. For more 402 + information, please see the docbook headers in that file, the most 403 + recent LWN article on the RCU API (https://lwn.net/Articles/777036/), 404 + and of course the material in Documentation/RCU. 405 + 406 + If the pointer value is manipulated between the rcu_dereference() 407 + that returned it and a later dereference(), please read 408 + Documentation/RCU/rcu_dereference.rst. It can also be quite helpful to 409 + review uses in the Linux kernel. 410 + 411 + 412 + Control Dependencies 413 + -------------------- 414 + 415 + A control dependency extends from a marked load (READ_ONCE() or stronger) 416 + through an "if" condition to a marked store (WRITE_ONCE() or stronger) 417 + that is executed only by one of the legs of that "if" statement. 418 + Control dependencies are so named because they are mediated by 419 + control-flow instructions such as comparisons and conditional branches. 420 + 421 + In short, you can use a control dependency to enforce ordering between 422 + an READ_ONCE() and a WRITE_ONCE() when there is an "if" condition 423 + between them. The canonical example is as follows: 424 + 425 + q = READ_ONCE(a); 426 + if (q) 427 + WRITE_ONCE(b, 1); 428 + 429 + In this case, all CPUs would see the read from "a" as happening before 430 + the write to "b". 431 + 432 + However, control dependencies are easily destroyed by compiler 433 + optimizations, so any use of control dependencies must take into account 434 + all of the compilers used to build the Linux kernel. Please see the 435 + "control-dependencies.txt" file for more information. 436 + 437 + 438 + Unordered Accesses 439 + ================== 440 + 441 + Each of these two categories of unordered accesses has a section below: 442 + 443 + a. Unordered marked operations. 444 + 445 + b. Unmarked C-language accesses. 446 + 447 + 448 + Unordered Marked Operations 449 + --------------------------- 450 + 451 + Unordered operations to different variables are just that, unordered. 452 + However, if a group of CPUs apply these operations to a single variable, 453 + all the CPUs will agree on the operation order. Of course, the ordering 454 + of unordered marked accesses can also be constrained using the mechanisms 455 + described earlier in this document. 456 + 457 + These operations come in three categories: 458 + 459 + o Marked writes, such as WRITE_ONCE() and atomic_set(). These 460 + primitives required the compiler to emit the corresponding store 461 + instructions in the expected execution order, thus suppressing 462 + a number of destructive optimizations. However, they provide no 463 + hardware ordering guarantees, and in fact many CPUs will happily 464 + reorder marked writes with each other or with other unordered 465 + operations, unless these operations are to the same variable. 466 + 467 + o Marked reads, such as READ_ONCE() and atomic_read(). These 468 + primitives required the compiler to emit the corresponding load 469 + instructions in the expected execution order, thus suppressing 470 + a number of destructive optimizations. However, they provide no 471 + hardware ordering guarantees, and in fact many CPUs will happily 472 + reorder marked reads with each other or with other unordered 473 + operations, unless these operations are to the same variable. 474 + 475 + o Unordered RMW atomic operations. These are non-value-returning 476 + RMW atomic operations whose names do not end in _acquire or 477 + _release, and also value-returning RMW operations whose names 478 + end in _relaxed. Examples include atomic_add(), atomic_or(), 479 + and atomic64_fetch_xor_relaxed(). These operations do carry 480 + out the specified RMW operation atomically, for example, five 481 + concurrent atomic_inc() operations applied to a given variable 482 + will reliably increase the value of that variable by five. 483 + However, many CPUs will happily reorder these operations with 484 + each other or with other unordered operations. 485 + 486 + This category of operations can be efficiently ordered using 487 + smp_mb__before_atomic() and smp_mb__after_atomic(), as was 488 + discussed in the "RMW Ordering Augmentation Barriers" section. 489 + 490 + In short, these operations can be freely reordered unless they are all 491 + operating on a single variable or unless they are constrained by one of 492 + the operations called out earlier in this document. 493 + 494 + 495 + Unmarked C-Language Accesses 496 + ---------------------------- 497 + 498 + Unmarked C-language accesses are normal variable accesses to normal 499 + variables, that is, to variables that are not "volatile" and are not 500 + C11 atomic variables. These operations provide no ordering guarantees, 501 + and further do not guarantee "atomic" access. For example, the compiler 502 + might (and sometimes does) split a plain C-language store into multiple 503 + smaller stores. A load from that same variable running on some other 504 + CPU while such a store is executing might see a value that is a mashup 505 + of the old value and the new value. 506 + 507 + Unmarked C-language accesses are unordered, and are also subject to 508 + any number of compiler optimizations, many of which can break your 509 + concurrent code. It is possible to used unmarked C-language accesses for 510 + shared variables that are subject to concurrent access, but great care 511 + is required on an ongoing basis. The compiler-constraining barrier() 512 + primitive can be helpful, as can the various ordering primitives discussed 513 + in this document. It nevertheless bears repeating that use of unmarked 514 + C-language accesses requires careful attention to not just your code, 515 + but to all the compilers that might be used to build it. Such compilers 516 + might replace a series of loads with a single load, and might replace 517 + a series of stores with a single store. Some compilers will even split 518 + a single store into multiple smaller stores. 519 + 520 + But there are some ways of using unmarked C-language accesses for shared 521 + variables without such worries: 522 + 523 + o Guard all accesses to a given variable by a particular lock, 524 + so that there are never concurrent conflicting accesses to 525 + that variable. (There are "conflicting accesses" when 526 + (1) at least one of the concurrent accesses to a variable is an 527 + unmarked C-language access and (2) when at least one of those 528 + accesses is a write, whether marked or not.) 529 + 530 + o As above, but using other synchronization primitives such 531 + as reader-writer locks or sequence locks. 532 + 533 + o Use locking or other means to ensure that all concurrent accesses 534 + to a given variable are reads. 535 + 536 + o Restrict use of a given variable to statistics or heuristics 537 + where the occasional bogus value can be tolerated. 538 + 539 + o Declare the accessed variables as C11 atomics. 540 + https://lwn.net/Articles/691128/ 541 + 542 + o Declare the accessed variables as "volatile". 543 + 544 + If you need to live more dangerously, please do take the time to 545 + understand the compilers. One place to start is these two LWN 546 + articles: 547 + 548 + Who's afraid of a big bad optimizing compiler? 549 + https://lwn.net/Articles/793253 550 + Calibrating your fear of big bad optimizing compilers 551 + https://lwn.net/Articles/799218 552 + 553 + Used properly, unmarked C-language accesses can reduce overhead on 554 + fastpaths. However, the price is great care and continual attention 555 + to your compiler as new versions come out and as new optimizations 556 + are enabled.