Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

at v2.6.12 456 lines 15 kB view raw
1 Semantics and Behavior of Atomic and 2 Bitmask Operations 3 4 David S. Miller 5 6 This document is intended to serve as a guide to Linux port 7maintainers on how to implement atomic counter, bitops, and spinlock 8interfaces properly. 9 10 The atomic_t type should be defined as a signed integer. 11Also, it should be made opaque such that any kind of cast to a normal 12C integer type will fail. Something like the following should 13suffice: 14 15 typedef struct { volatile int counter; } atomic_t; 16 17 The first operations to implement for atomic_t's are the 18initializers and plain reads. 19 20 #define ATOMIC_INIT(i) { (i) } 21 #define atomic_set(v, i) ((v)->counter = (i)) 22 23The first macro is used in definitions, such as: 24 25static atomic_t my_counter = ATOMIC_INIT(1); 26 27The second interface can be used at runtime, as in: 28 29 struct foo { atomic_t counter; }; 30 ... 31 32 struct foo *k; 33 34 k = kmalloc(sizeof(*k), GFP_KERNEL); 35 if (!k) 36 return -ENOMEM; 37 atomic_set(&k->counter, 0); 38 39Next, we have: 40 41 #define atomic_read(v) ((v)->counter) 42 43which simply reads the current value of the counter. 44 45Now, we move onto the actual atomic operation interfaces. 46 47 void atomic_add(int i, atomic_t *v); 48 void atomic_sub(int i, atomic_t *v); 49 void atomic_inc(atomic_t *v); 50 void atomic_dec(atomic_t *v); 51 52These four routines add and subtract integral values to/from the given 53atomic_t value. The first two routines pass explicit integers by 54which to make the adjustment, whereas the latter two use an implicit 55adjustment value of "1". 56 57One very important aspect of these two routines is that they DO NOT 58require any explicit memory barriers. They need only perform the 59atomic_t counter update in an SMP safe manner. 60 61Next, we have: 62 63 int atomic_inc_return(atomic_t *v); 64 int atomic_dec_return(atomic_t *v); 65 66These routines add 1 and subtract 1, respectively, from the given 67atomic_t and return the new counter value after the operation is 68performed. 69 70Unlike the above routines, it is required that explicit memory 71barriers are performed before and after the operation. It must be 72done such that all memory operations before and after the atomic 73operation calls are strongly ordered with respect to the atomic 74operation itself. 75 76For example, it should behave as if a smp_mb() call existed both 77before and after the atomic operation. 78 79If the atomic instructions used in an implementation provide explicit 80memory barrier semantics which satisfy the above requirements, that is 81fine as well. 82 83Let's move on: 84 85 int atomic_add_return(int i, atomic_t *v); 86 int atomic_sub_return(int i, atomic_t *v); 87 88These behave just like atomic_{inc,dec}_return() except that an 89explicit counter adjustment is given instead of the implicit "1". 90This means that like atomic_{inc,dec}_return(), the memory barrier 91semantics are required. 92 93Next: 94 95 int atomic_inc_and_test(atomic_t *v); 96 int atomic_dec_and_test(atomic_t *v); 97 98These two routines increment and decrement by 1, respectively, the 99given atomic counter. They return a boolean indicating whether the 100resulting counter value was zero or not. 101 102It requires explicit memory barrier semantics around the operation as 103above. 104 105 int atomic_sub_and_test(int i, atomic_t *v); 106 107This is identical to atomic_dec_and_test() except that an explicit 108decrement is given instead of the implicit "1". It requires explicit 109memory barrier semantics around the operation. 110 111 int atomic_add_negative(int i, atomic_t *v); 112 113The given increment is added to the given atomic counter value. A 114boolean is return which indicates whether the resulting counter value 115is negative. It requires explicit memory barrier semantics around the 116operation. 117 118If a caller requires memory barrier semantics around an atomic_t 119operation which does not return a value, a set of interfaces are 120defined which accomplish this: 121 122 void smp_mb__before_atomic_dec(void); 123 void smp_mb__after_atomic_dec(void); 124 void smp_mb__before_atomic_inc(void); 125 void smp_mb__after_atomic_dec(void); 126 127For example, smp_mb__before_atomic_dec() can be used like so: 128 129 obj->dead = 1; 130 smp_mb__before_atomic_dec(); 131 atomic_dec(&obj->ref_count); 132 133It makes sure that all memory operations preceeding the atomic_dec() 134call are strongly ordered with respect to the atomic counter 135operation. In the above example, it guarentees that the assignment of 136"1" to obj->dead will be globally visible to other cpus before the 137atomic counter decrement. 138 139Without the explicitl smp_mb__before_atomic_dec() call, the 140implementation could legally allow the atomic counter update visible 141to other cpus before the "obj->dead = 1;" assignment. 142 143The other three interfaces listed are used to provide explicit 144ordering with respect to memory operations after an atomic_dec() call 145(smp_mb__after_atomic_dec()) and around atomic_inc() calls 146(smp_mb__{before,after}_atomic_inc()). 147 148A missing memory barrier in the cases where they are required by the 149atomic_t implementation above can have disasterous results. Here is 150an example, which follows a pattern occuring frequently in the Linux 151kernel. It is the use of atomic counters to implement reference 152counting, and it works such that once the counter falls to zero it can 153be guarenteed that no other entity can be accessing the object: 154 155static void obj_list_add(struct obj *obj) 156{ 157 obj->active = 1; 158 list_add(&obj->list); 159} 160 161static void obj_list_del(struct obj *obj) 162{ 163 list_del(&obj->list); 164 obj->active = 0; 165} 166 167static void obj_destroy(struct obj *obj) 168{ 169 BUG_ON(obj->active); 170 kfree(obj); 171} 172 173struct obj *obj_list_peek(struct list_head *head) 174{ 175 if (!list_empty(head)) { 176 struct obj *obj; 177 178 obj = list_entry(head->next, struct obj, list); 179 atomic_inc(&obj->refcnt); 180 return obj; 181 } 182 return NULL; 183} 184 185void obj_poke(void) 186{ 187 struct obj *obj; 188 189 spin_lock(&global_list_lock); 190 obj = obj_list_peek(&global_list); 191 spin_unlock(&global_list_lock); 192 193 if (obj) { 194 obj->ops->poke(obj); 195 if (atomic_dec_and_test(&obj->refcnt)) 196 obj_destroy(obj); 197 } 198} 199 200void obj_timeout(struct obj *obj) 201{ 202 spin_lock(&global_list_lock); 203 obj_list_del(obj); 204 spin_unlock(&global_list_lock); 205 206 if (atomic_dec_and_test(&obj->refcnt)) 207 obj_destroy(obj); 208} 209 210(This is a simplification of the ARP queue management in the 211 generic neighbour discover code of the networking. Olaf Kirch 212 found a bug wrt. memory barriers in kfree_skb() that exposed 213 the atomic_t memory barrier requirements quite clearly.) 214 215Given the above scheme, it must be the case that the obj->active 216update done by the obj list deletion be visible to other processors 217before the atomic counter decrement is performed. 218 219Otherwise, the counter could fall to zero, yet obj->active would still 220be set, thus triggering the assertion in obj_destroy(). The error 221sequence looks like this: 222 223 cpu 0 cpu 1 224 obj_poke() obj_timeout() 225 obj = obj_list_peek(); 226 ... gains ref to obj, refcnt=2 227 obj_list_del(obj); 228 obj->active = 0 ... 229 ... visibility delayed ... 230 atomic_dec_and_test() 231 ... refcnt drops to 1 ... 232 atomic_dec_and_test() 233 ... refcount drops to 0 ... 234 obj_destroy() 235 BUG() triggers since obj->active 236 still seen as one 237 obj->active update visibility occurs 238 239With the memory barrier semantics required of the atomic_t operations 240which return values, the above sequence of memory visibility can never 241happen. Specifically, in the above case the atomic_dec_and_test() 242counter decrement would not become globally visible until the 243obj->active update does. 244 245As a historical note, 32-bit Sparc used to only allow usage of 24624-bits of it's atomic_t type. This was because it used 8 bits 247as a spinlock for SMP safety. Sparc32 lacked a "compare and swap" 248type instruction. However, 32-bit Sparc has since been moved over 249to a "hash table of spinlocks" scheme, that allows the full 32-bit 250counter to be realized. Essentially, an array of spinlocks are 251indexed into based upon the address of the atomic_t being operated 252on, and that lock protects the atomic operation. Parisc uses the 253same scheme. 254 255Another note is that the atomic_t operations returning values are 256extremely slow on an old 386. 257 258We will now cover the atomic bitmask operations. You will find that 259their SMP and memory barrier semantics are similar in shape and scope 260to the atomic_t ops above. 261 262Native atomic bit operations are defined to operate on objects aligned 263to the size of an "unsigned long" C data type, and are least of that 264size. The endianness of the bits within each "unsigned long" are the 265native endianness of the cpu. 266 267 void set_bit(unsigned long nr, volatils unsigned long *addr); 268 void clear_bit(unsigned long nr, volatils unsigned long *addr); 269 void change_bit(unsigned long nr, volatils unsigned long *addr); 270 271These routines set, clear, and change, respectively, the bit number 272indicated by "nr" on the bit mask pointed to by "ADDR". 273 274They must execute atomically, yet there are no implicit memory barrier 275semantics required of these interfaces. 276 277 int test_and_set_bit(unsigned long nr, volatils unsigned long *addr); 278 int test_and_clear_bit(unsigned long nr, volatils unsigned long *addr); 279 int test_and_change_bit(unsigned long nr, volatils unsigned long *addr); 280 281Like the above, except that these routines return a boolean which 282indicates whether the changed bit was set _BEFORE_ the atomic bit 283operation. 284 285WARNING! It is incredibly important that the value be a boolean, 286ie. "0" or "1". Do not try to be fancy and save a few instructions by 287declaring the above to return "long" and just returning something like 288"old_val & mask" because that will not work. 289 290For one thing, this return value gets truncated to int in many code 291paths using these interfaces, so on 64-bit if the bit is set in the 292upper 32-bits then testers will never see that. 293 294One great example of where this problem crops up are the thread_info 295flag operations. Routines such as test_and_set_ti_thread_flag() chop 296the return value into an int. There are other places where things 297like this occur as well. 298 299These routines, like the atomic_t counter operations returning values, 300require explicit memory barrier semantics around their execution. All 301memory operations before the atomic bit operation call must be made 302visible globally before the atomic bit operation is made visible. 303Likewise, the atomic bit operation must be visible globally before any 304subsequent memory operation is made visible. For example: 305 306 obj->dead = 1; 307 if (test_and_set_bit(0, &obj->flags)) 308 /* ... */; 309 obj->killed = 1; 310 311The implementation of test_and_set_bit() must guarentee that 312"obj->dead = 1;" is visible to cpus before the atomic memory operation 313done by test_and_set_bit() becomes visible. Likewise, the atomic 314memory operation done by test_and_set_bit() must become visible before 315"obj->killed = 1;" is visible. 316 317Finally there is the basic operation: 318 319 int test_bit(unsigned long nr, __const__ volatile unsigned long *addr); 320 321Which returns a boolean indicating if bit "nr" is set in the bitmask 322pointed to by "addr". 323 324If explicit memory barriers are required around clear_bit() (which 325does not return a value, and thus does not need to provide memory 326barrier semantics), two interfaces are provided: 327 328 void smp_mb__before_clear_bit(void); 329 void smp_mb__after_clear_bit(void); 330 331They are used as follows, and are akin to their atomic_t operation 332brothers: 333 334 /* All memory operations before this call will 335 * be globally visible before the clear_bit(). 336 */ 337 smp_mb__before_clear_bit(); 338 clear_bit( ... ); 339 340 /* The clear_bit() will be visible before all 341 * subsequent memory operations. 342 */ 343 smp_mb__after_clear_bit(); 344 345Finally, there are non-atomic versions of the bitmask operations 346provided. They are used in contexts where some other higher-level SMP 347locking scheme is being used to protect the bitmask, and thus less 348expensive non-atomic operations may be used in the implementation. 349They have names similar to the above bitmask operation interfaces, 350except that two underscores are prefixed to the interface name. 351 352 void __set_bit(unsigned long nr, volatile unsigned long *addr); 353 void __clear_bit(unsigned long nr, volatile unsigned long *addr); 354 void __change_bit(unsigned long nr, volatile unsigned long *addr); 355 int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr); 356 int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); 357 int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr); 358 359These non-atomic variants also do not require any special memory 360barrier semantics. 361 362The routines xchg() and cmpxchg() need the same exact memory barriers 363as the atomic and bit operations returning values. 364 365Spinlocks and rwlocks have memory barrier expectations as well. 366The rule to follow is simple: 367 3681) When acquiring a lock, the implementation must make it globally 369 visible before any subsequent memory operation. 370 3712) When releasing a lock, the implementation must make it such that 372 all previous memory operations are globally visible before the 373 lock release. 374 375Which finally brings us to _atomic_dec_and_lock(). There is an 376architecture-neutral version implemented in lib/dec_and_lock.c, 377but most platforms will wish to optimize this in assembler. 378 379 int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock); 380 381Atomically decrement the given counter, and if will drop to zero 382atomically acquire the given spinlock and perform the decrement 383of the counter to zero. If it does not drop to zero, do nothing 384with the spinlock. 385 386It is actually pretty simple to get the memory barrier correct. 387Simply satisfy the spinlock grab requirements, which is make 388sure the spinlock operation is globally visible before any 389subsequent memory operation. 390 391We can demonstrate this operation more clearly if we define 392an abstract atomic operation: 393 394 long cas(long *mem, long old, long new); 395 396"cas" stands for "compare and swap". It atomically: 397 3981) Compares "old" with the value currently at "mem". 3992) If they are equal, "new" is written to "mem". 4003) Regardless, the current value at "mem" is returned. 401 402As an example usage, here is what an atomic counter update 403might look like: 404 405void example_atomic_inc(long *counter) 406{ 407 long old, new, ret; 408 409 while (1) { 410 old = *counter; 411 new = old + 1; 412 413 ret = cas(counter, old, new); 414 if (ret == old) 415 break; 416 } 417} 418 419Let's use cas() in order to build a pseudo-C atomic_dec_and_lock(): 420 421int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) 422{ 423 long old, new, ret; 424 int went_to_zero; 425 426 went_to_zero = 0; 427 while (1) { 428 old = atomic_read(atomic); 429 new = old - 1; 430 if (new == 0) { 431 went_to_zero = 1; 432 spin_lock(lock); 433 } 434 ret = cas(atomic, old, new); 435 if (ret == old) 436 break; 437 if (went_to_zero) { 438 spin_unlock(lock); 439 went_to_zero = 0; 440 } 441 } 442 443 return went_to_zero; 444} 445 446Now, as far as memory barriers go, as long as spin_lock() 447strictly orders all subsequent memory operations (including 448the cas()) with respect to itself, things will be fine. 449 450Said another way, _atomic_dec_and_lock() must guarentee that 451a counter dropping to zero is never made visible before the 452spinlock being acquired. 453 454Note that this also means that for the case where the counter 455is not dropping to zero, there are no memory ordering 456requirements.