Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

atomics: Provide rcuref - scalable reference counting

atomic_t based reference counting, including refcount_t, uses
atomic_inc_not_zero() for acquiring a reference. atomic_inc_not_zero() is
implemented with a atomic_try_cmpxchg() loop. High contention of the
reference count leads to retry loops and scales badly. There is nothing to
improve on this implementation as the semantics have to be preserved.

Provide rcuref as a scalable alternative solution which is suitable for RCU
managed objects. Similar to refcount_t it comes with overflow and underflow
detection and mitigation.

rcuref treats the underlying atomic_t as an unsigned integer and partitions
this space into zones:

0x00000000 - 0x7FFFFFFF valid zone (1 .. (INT_MAX + 1) references)
0x80000000 - 0xBFFFFFFF saturation zone
0xC0000000 - 0xFFFFFFFE dead zone
0xFFFFFFFF no reference

rcuref_get() unconditionally increments the reference count with
atomic_add_negative_relaxed(). rcuref_put() unconditionally decrements the
reference count with atomic_add_negative_release().

This unconditional increment avoids the inc_not_zero() problem, but
requires a more complex implementation on the put() side when the count
drops from 0 to -1.

When this transition is detected then it is attempted to mark the reference
count dead, by setting it to the midpoint of the dead zone with a single
atomic_cmpxchg_release() operation. This operation can fail due to a
concurrent rcuref_get() elevating the reference count from -1 to 0 again.

If the unconditional increment in rcuref_get() hits a reference count which
is marked dead (or saturated) it will detect it after the fact and bring
back the reference count to the midpoint of the respective zone. The zones
provide enough tolerance which makes it practically impossible to escape
from a zone.

The racy implementation of rcuref_put() requires to protect rcuref_put()
against a grace period ending in order to prevent a subtle use after
free. As RCU is the only mechanism which allows to protect against that, it
is not possible to fully replace the atomic_inc_not_zero() based
implementation of refcount_t with this scheme.

The final drop is slightly more expensive than the atomic_dec_return()
counterpart, but that's not the case which this is optimized for. The
optimization is on the high frequeunt get()/put() pairs and their
scalability.

The performance of an uncontended rcuref_get()/put() pair where the put()
is not dropping the last reference is still on par with the plain atomic
operations, while at the same time providing overflow and underflow
detection and mitigation.

The performance of rcuref compared to plain atomic_inc_not_zero() and
atomic_dec_return() based reference counting under contention:

- Micro benchmark: All CPUs running a increment/decrement loop on an
elevated reference count, which means the 0 to -1 transition never
happens.

The performance gain depends on microarchitecture and the number of
CPUs and has been observed in the range of 1.3X to 4.7X

- Conversion of dst_entry::__refcnt to rcuref and testing with the
localhost memtier/memcached benchmark. That benchmark shows the
reference count contention prominently.

The performance gain depends on microarchitecture and the number of
CPUs and has been observed in the range of 1.1X to 2.6X over the
previous fix for the false sharing issue vs. struct
dst_entry::__refcnt.

When memtier is run over a real 1Gb network connection, there is a
small gain on top of the false sharing fix. The two changes combined
result in a 2%-5% total gain for that networked test.

Reported-by: Wangyang Guo <wangyang.guo@intel.com>
Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230323102800.158429195@linutronix.de

authored by

Thomas Gleixner and committed by
Peter Zijlstra
ee1ee6db e5ab9eff

+443 -1
+155
include/linux/rcuref.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + #ifndef _LINUX_RCUREF_H 3 + #define _LINUX_RCUREF_H 4 + 5 + #include <linux/atomic.h> 6 + #include <linux/bug.h> 7 + #include <linux/limits.h> 8 + #include <linux/lockdep.h> 9 + #include <linux/preempt.h> 10 + #include <linux/rcupdate.h> 11 + 12 + #define RCUREF_ONEREF 0x00000000U 13 + #define RCUREF_MAXREF 0x7FFFFFFFU 14 + #define RCUREF_SATURATED 0xA0000000U 15 + #define RCUREF_RELEASED 0xC0000000U 16 + #define RCUREF_DEAD 0xE0000000U 17 + #define RCUREF_NOREF 0xFFFFFFFFU 18 + 19 + /** 20 + * rcuref_init - Initialize a rcuref reference count with the given reference count 21 + * @ref: Pointer to the reference count 22 + * @cnt: The initial reference count typically '1' 23 + */ 24 + static inline void rcuref_init(rcuref_t *ref, unsigned int cnt) 25 + { 26 + atomic_set(&ref->refcnt, cnt - 1); 27 + } 28 + 29 + /** 30 + * rcuref_read - Read the number of held reference counts of a rcuref 31 + * @ref: Pointer to the reference count 32 + * 33 + * Return: The number of held references (0 ... N) 34 + */ 35 + static inline unsigned int rcuref_read(rcuref_t *ref) 36 + { 37 + unsigned int c = atomic_read(&ref->refcnt); 38 + 39 + /* Return 0 if within the DEAD zone. */ 40 + return c >= RCUREF_RELEASED ? 0 : c + 1; 41 + } 42 + 43 + extern __must_check bool rcuref_get_slowpath(rcuref_t *ref); 44 + 45 + /** 46 + * rcuref_get - Acquire one reference on a rcuref reference count 47 + * @ref: Pointer to the reference count 48 + * 49 + * Similar to atomic_inc_not_zero() but saturates at RCUREF_MAXREF. 50 + * 51 + * Provides no memory ordering, it is assumed the caller has guaranteed the 52 + * object memory to be stable (RCU, etc.). It does provide a control dependency 53 + * and thereby orders future stores. See documentation in lib/rcuref.c 54 + * 55 + * Return: 56 + * False if the attempt to acquire a reference failed. This happens 57 + * when the last reference has been put already 58 + * 59 + * True if a reference was successfully acquired 60 + */ 61 + static inline __must_check bool rcuref_get(rcuref_t *ref) 62 + { 63 + /* 64 + * Unconditionally increase the reference count. The saturation and 65 + * dead zones provide enough tolerance for this. 66 + */ 67 + if (likely(!atomic_add_negative_relaxed(1, &ref->refcnt))) 68 + return true; 69 + 70 + /* Handle the cases inside the saturation and dead zones */ 71 + return rcuref_get_slowpath(ref); 72 + } 73 + 74 + extern __must_check bool rcuref_put_slowpath(rcuref_t *ref); 75 + 76 + /* 77 + * Internal helper. Do not invoke directly. 78 + */ 79 + static __always_inline __must_check bool __rcuref_put(rcuref_t *ref) 80 + { 81 + RCU_LOCKDEP_WARN(!rcu_read_lock_held() && preemptible(), 82 + "suspicious rcuref_put_rcusafe() usage"); 83 + /* 84 + * Unconditionally decrease the reference count. The saturation and 85 + * dead zones provide enough tolerance for this. 86 + */ 87 + if (likely(!atomic_add_negative_release(-1, &ref->refcnt))) 88 + return false; 89 + 90 + /* 91 + * Handle the last reference drop and cases inside the saturation 92 + * and dead zones. 93 + */ 94 + return rcuref_put_slowpath(ref); 95 + } 96 + 97 + /** 98 + * rcuref_put_rcusafe -- Release one reference for a rcuref reference count RCU safe 99 + * @ref: Pointer to the reference count 100 + * 101 + * Provides release memory ordering, such that prior loads and stores are done 102 + * before, and provides an acquire ordering on success such that free() 103 + * must come after. 104 + * 105 + * Can be invoked from contexts, which guarantee that no grace period can 106 + * happen which would free the object concurrently if the decrement drops 107 + * the last reference and the slowpath races against a concurrent get() and 108 + * put() pair. rcu_read_lock()'ed and atomic contexts qualify. 109 + * 110 + * Return: 111 + * True if this was the last reference with no future references 112 + * possible. This signals the caller that it can safely release the 113 + * object which is protected by the reference counter. 114 + * 115 + * False if there are still active references or the put() raced 116 + * with a concurrent get()/put() pair. Caller is not allowed to 117 + * release the protected object. 118 + */ 119 + static inline __must_check bool rcuref_put_rcusafe(rcuref_t *ref) 120 + { 121 + return __rcuref_put(ref); 122 + } 123 + 124 + /** 125 + * rcuref_put -- Release one reference for a rcuref reference count 126 + * @ref: Pointer to the reference count 127 + * 128 + * Can be invoked from any context. 129 + * 130 + * Provides release memory ordering, such that prior loads and stores are done 131 + * before, and provides an acquire ordering on success such that free() 132 + * must come after. 133 + * 134 + * Return: 135 + * 136 + * True if this was the last reference with no future references 137 + * possible. This signals the caller that it can safely schedule the 138 + * object, which is protected by the reference counter, for 139 + * deconstruction. 140 + * 141 + * False if there are still active references or the put() raced 142 + * with a concurrent get()/put() pair. Caller is not allowed to 143 + * deconstruct the protected object. 144 + */ 145 + static inline __must_check bool rcuref_put(rcuref_t *ref) 146 + { 147 + bool released; 148 + 149 + preempt_disable(); 150 + released = __rcuref_put(ref); 151 + preempt_enable(); 152 + return released; 153 + } 154 + 155 + #endif
+6
include/linux/types.h
··· 175 175 } atomic64_t; 176 176 #endif 177 177 178 + typedef struct { 179 + atomic_t refcnt; 180 + } rcuref_t; 181 + 182 + #define RCUREF_INIT(i) { .refcnt = ATOMIC_INIT(i - 1) } 183 + 178 184 struct list_head { 179 185 struct list_head *next, *prev; 180 186 };
+1 -1
lib/Makefile
··· 47 47 list_sort.o uuid.o iov_iter.o clz_ctz.o \ 48 48 bsearch.o find_bit.o llist.o memweight.o kfifo.o \ 49 49 percpu-refcount.o rhashtable.o base64.o \ 50 - once.o refcount.o usercopy.o errseq.o bucket_locks.o \ 50 + once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \ 51 51 generic-radix-tree.o 52 52 obj-$(CONFIG_STRING_SELFTEST) += test_string.o 53 53 obj-y += string_helpers.o
+281
lib/rcuref.c
··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + 3 + /* 4 + * rcuref - A scalable reference count implementation for RCU managed objects 5 + * 6 + * rcuref is provided to replace open coded reference count implementations 7 + * based on atomic_t. It protects explicitely RCU managed objects which can 8 + * be visible even after the last reference has been dropped and the object 9 + * is heading towards destruction. 10 + * 11 + * A common usage pattern is: 12 + * 13 + * get() 14 + * rcu_read_lock(); 15 + * p = get_ptr(); 16 + * if (p && !atomic_inc_not_zero(&p->refcnt)) 17 + * p = NULL; 18 + * rcu_read_unlock(); 19 + * return p; 20 + * 21 + * put() 22 + * if (!atomic_dec_return(&->refcnt)) { 23 + * remove_ptr(p); 24 + * kfree_rcu((p, rcu); 25 + * } 26 + * 27 + * atomic_inc_not_zero() is implemented with a try_cmpxchg() loop which has 28 + * O(N^2) behaviour under contention with N concurrent operations. 29 + * 30 + * rcuref uses atomic_add_negative_relaxed() for the fast path, which scales 31 + * better under contention. 32 + * 33 + * Why not refcount? 34 + * ================= 35 + * 36 + * In principle it should be possible to make refcount use the rcuref 37 + * scheme, but the destruction race described below cannot be prevented 38 + * unless the protected object is RCU managed. 39 + * 40 + * Theory of operation 41 + * =================== 42 + * 43 + * rcuref uses an unsigned integer reference counter. As long as the 44 + * counter value is greater than or equal to RCUREF_ONEREF and not larger 45 + * than RCUREF_MAXREF the reference is alive: 46 + * 47 + * ONEREF MAXREF SATURATED RELEASED DEAD NOREF 48 + * 0 0x7FFFFFFF 0x8000000 0xA0000000 0xBFFFFFFF 0xC0000000 0xE0000000 0xFFFFFFFF 49 + * <---valid --------> <-------saturation zone-------> <-----dead zone-----> 50 + * 51 + * The get() and put() operations do unconditional increments and 52 + * decrements. The result is checked after the operation. This optimizes 53 + * for the fast path. 54 + * 55 + * If the reference count is saturated or dead, then the increments and 56 + * decrements are not harmful as the reference count still stays in the 57 + * respective zones and is always set back to STATURATED resp. DEAD. The 58 + * zones have room for 2^28 racing operations in each direction, which 59 + * makes it practically impossible to escape the zones. 60 + * 61 + * Once the last reference is dropped the reference count becomes 62 + * RCUREF_NOREF which forces rcuref_put() into the slowpath operation. The 63 + * slowpath then tries to set the reference count from RCUREF_NOREF to 64 + * RCUREF_DEAD via a cmpxchg(). This opens a small window where a 65 + * concurrent rcuref_get() can acquire the reference count and bring it 66 + * back to RCUREF_ONEREF or even drop the reference again and mark it DEAD. 67 + * 68 + * If the cmpxchg() succeeds then a concurrent rcuref_get() will result in 69 + * DEAD + 1, which is inside the dead zone. If that happens the reference 70 + * count is put back to DEAD. 71 + * 72 + * The actual race is possible due to the unconditional increment and 73 + * decrements in rcuref_get() and rcuref_put(): 74 + * 75 + * T1 T2 76 + * get() put() 77 + * if (atomic_add_negative(-1, &ref->refcnt)) 78 + * succeeds-> atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); 79 + * 80 + * atomic_add_negative(1, &ref->refcnt); <- Elevates refcount to DEAD + 1 81 + * 82 + * As the result of T1's add is negative, the get() goes into the slow path 83 + * and observes refcnt being in the dead zone which makes the operation fail. 84 + * 85 + * Possible critical states: 86 + * 87 + * Context Counter References Operation 88 + * T1 0 1 init() 89 + * T2 1 2 get() 90 + * T1 0 1 put() 91 + * T2 -1 0 put() tries to mark dead 92 + * T1 0 1 get() 93 + * T2 0 1 put() mark dead fails 94 + * T1 -1 0 put() tries to mark dead 95 + * T1 DEAD 0 put() mark dead succeeds 96 + * T2 DEAD+1 0 get() fails and puts it back to DEAD 97 + * 98 + * Of course there are more complex scenarios, but the above illustrates 99 + * the working principle. The rest is left to the imagination of the 100 + * reader. 101 + * 102 + * Deconstruction race 103 + * =================== 104 + * 105 + * The release operation must be protected by prohibiting a grace period in 106 + * order to prevent a possible use after free: 107 + * 108 + * T1 T2 109 + * put() get() 110 + * // ref->refcnt = ONEREF 111 + * if (!atomic_add_negative(-1, &ref->refcnt)) 112 + * return false; <- Not taken 113 + * 114 + * // ref->refcnt == NOREF 115 + * --> preemption 116 + * // Elevates ref->refcnt to ONEREF 117 + * if (!atomic_add_negative(1, &ref->refcnt)) 118 + * return true; <- taken 119 + * 120 + * if (put(&p->ref)) { <-- Succeeds 121 + * remove_pointer(p); 122 + * kfree_rcu(p, rcu); 123 + * } 124 + * 125 + * RCU grace period ends, object is freed 126 + * 127 + * atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF 128 + * 129 + * This is prevented by disabling preemption around the put() operation as 130 + * that's in most kernel configurations cheaper than a rcu_read_lock() / 131 + * rcu_read_unlock() pair and in many cases even a NOOP. In any case it 132 + * prevents the grace period which keeps the object alive until all put() 133 + * operations complete. 134 + * 135 + * Saturation protection 136 + * ===================== 137 + * 138 + * The reference count has a saturation limit RCUREF_MAXREF (INT_MAX). 139 + * Once this is exceedded the reference count becomes stale by setting it 140 + * to RCUREF_SATURATED, which will cause a memory leak, but it prevents 141 + * wrap arounds which obviously cause worse problems than a memory 142 + * leak. When saturation is reached a warning is emitted. 143 + * 144 + * Race conditions 145 + * =============== 146 + * 147 + * All reference count increment/decrement operations are unconditional and 148 + * only verified after the fact. This optimizes for the good case and takes 149 + * the occasional race vs. a dead or already saturated refcount into 150 + * account. The saturation and dead zones are large enough to accomodate 151 + * for that. 152 + * 153 + * Memory ordering 154 + * =============== 155 + * 156 + * Memory ordering rules are slightly relaxed wrt regular atomic_t functions 157 + * and provide only what is strictly required for refcounts. 158 + * 159 + * The increments are fully relaxed; these will not provide ordering. The 160 + * rationale is that whatever is used to obtain the object to increase the 161 + * reference count on will provide the ordering. For locked data 162 + * structures, its the lock acquire, for RCU/lockless data structures its 163 + * the dependent load. 164 + * 165 + * rcuref_get() provides a control dependency ordering future stores which 166 + * ensures that the object is not modified when acquiring a reference 167 + * fails. 168 + * 169 + * rcuref_put() provides release order, i.e. all prior loads and stores 170 + * will be issued before. It also provides a control dependency ordering 171 + * against the subsequent destruction of the object. 172 + * 173 + * If rcuref_put() successfully dropped the last reference and marked the 174 + * object DEAD it also provides acquire ordering. 175 + */ 176 + 177 + #include <linux/export.h> 178 + #include <linux/rcuref.h> 179 + 180 + /** 181 + * rcuref_get_slowpath - Slowpath of rcuref_get() 182 + * @ref: Pointer to the reference count 183 + * 184 + * Invoked when the reference count is outside of the valid zone. 185 + * 186 + * Return: 187 + * False if the reference count was already marked dead 188 + * 189 + * True if the reference count is saturated, which prevents the 190 + * object from being deconstructed ever. 191 + */ 192 + bool rcuref_get_slowpath(rcuref_t *ref) 193 + { 194 + unsigned int cnt = atomic_read(&ref->refcnt); 195 + 196 + /* 197 + * If the reference count was already marked dead, undo the 198 + * increment so it stays in the middle of the dead zone and return 199 + * fail. 200 + */ 201 + if (cnt >= RCUREF_RELEASED) { 202 + atomic_set(&ref->refcnt, RCUREF_DEAD); 203 + return false; 204 + } 205 + 206 + /* 207 + * If it was saturated, warn and mark it so. In case the increment 208 + * was already on a saturated value restore the saturation 209 + * marker. This keeps it in the middle of the saturation zone and 210 + * prevents the reference count from overflowing. This leaks the 211 + * object memory, but prevents the obvious reference count overflow 212 + * damage. 213 + */ 214 + if (WARN_ONCE(cnt > RCUREF_MAXREF, "rcuref saturated - leaking memory")) 215 + atomic_set(&ref->refcnt, RCUREF_SATURATED); 216 + return true; 217 + } 218 + EXPORT_SYMBOL_GPL(rcuref_get_slowpath); 219 + 220 + /** 221 + * rcuref_put_slowpath - Slowpath of __rcuref_put() 222 + * @ref: Pointer to the reference count 223 + * 224 + * Invoked when the reference count is outside of the valid zone. 225 + * 226 + * Return: 227 + * True if this was the last reference with no future references 228 + * possible. This signals the caller that it can safely schedule the 229 + * object, which is protected by the reference counter, for 230 + * deconstruction. 231 + * 232 + * False if there are still active references or the put() raced 233 + * with a concurrent get()/put() pair. Caller is not allowed to 234 + * deconstruct the protected object. 235 + */ 236 + bool rcuref_put_slowpath(rcuref_t *ref) 237 + { 238 + unsigned int cnt = atomic_read(&ref->refcnt); 239 + 240 + /* Did this drop the last reference? */ 241 + if (likely(cnt == RCUREF_NOREF)) { 242 + /* 243 + * Carefully try to set the reference count to RCUREF_DEAD. 244 + * 245 + * This can fail if a concurrent get() operation has 246 + * elevated it again or the corresponding put() even marked 247 + * it dead already. Both are valid situations and do not 248 + * require a retry. If this fails the caller is not 249 + * allowed to deconstruct the object. 250 + */ 251 + if (atomic_cmpxchg_release(&ref->refcnt, RCUREF_NOREF, RCUREF_DEAD) != RCUREF_NOREF) 252 + return false; 253 + 254 + /* 255 + * The caller can safely schedule the object for 256 + * deconstruction. Provide acquire ordering. 257 + */ 258 + smp_acquire__after_ctrl_dep(); 259 + return true; 260 + } 261 + 262 + /* 263 + * If the reference count was already in the dead zone, then this 264 + * put() operation is imbalanced. Warn, put the reference count back to 265 + * DEAD and tell the caller to not deconstruct the object. 266 + */ 267 + if (WARN_ONCE(cnt >= RCUREF_RELEASED, "rcuref - imbalanced put()")) { 268 + atomic_set(&ref->refcnt, RCUREF_DEAD); 269 + return false; 270 + } 271 + 272 + /* 273 + * This is a put() operation on a saturated refcount. Restore the 274 + * mean saturation value and tell the caller to not deconstruct the 275 + * object. 276 + */ 277 + if (cnt > RCUREF_MAXREF) 278 + atomic_set(&ref->refcnt, RCUREF_SATURATED); 279 + return false; 280 + } 281 + EXPORT_SYMBOL_GPL(rcuref_put_slowpath);