Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

seqcount: replace smp_rmb() in read_seqcount() with load acquire

Many architectures support load acquire which can replace a memory
barrier and save some cycles.

A typical sequence

do {
seq = read_seqcount_begin(&s);
<something>
} while (read_seqcount_retry(&s, seq);

requires 13 cycles on an N1 Neoverse arm64 core (Ampere Altra, to be
specific) for an empty loop. Two read memory barriers are needed. One
for each of the seqcount_* functions.

We can replace the first read barrier with a load acquire of the
seqcount which saves us one barrier.

On the Altra doing so reduces the cycle count from 13 to 8.

According to ARM, this is a general improvement for the ARM64
architecture and not specific to a certain processor.

See

https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions

"Weaker ordering requirements that are imposed by Load-Acquire and
Store-Release instructions allow for micro-architectural
optimizations, which could reduce some of the performance impacts that
are otherwise imposed by an explicit memory barrier.

If the ordering requirement is satisfied using either a Load-Acquire
or Store-Release, then it would be preferable to use these
instructions instead of a DMB"

[ NOTE! This is my original minimal patch that unconditionally switches
over to using smp_load_acquire(), instead of the much more involved
and subtle patch that Christoph Lameter wrote that made it
conditional.

But Christoph gets authorship credit because I had initially thought
that we needed the more complex model, and Christoph ran with it it
and did the work. Only after looking at code generation for all the
relevant architectures, did I come to the conclusion that nobody
actually really needs the old "smp_rmb()" model.

Even architectures without load-acquire support generally do as well
or better with smp_load_acquire().

So credit to Christoph, but if this then causes issues on other
architectures, put the blame solidly on me.

Also note as part of the ruthless simplification, this gets rid of the
overly subtle optimization where some code uses a non-barrier version
of the sequence count (see the __read_seqcount_begin() users in
fs/namei.c). They then play games with their own barriers and/or with
nested sequence counts.

Those optimizations are literally meaningless on x86, and questionable
elsewhere. If somebody can show that they matter, we need to re-do
them more cleanly than "use an internal helper". - Linus ]

Signed-off-by: Christoph Lameter (Ampere) <cl@gentwo.org>
Link: https://lore.kernel.org/all/20240912-seq_optimize-v3-1-8ee25e04dffa@gentwo.org/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Christoph Lameter (Ampere) and committed by
Linus Torvalds
d0dd066a de5cb0dc

+5 -20
+5 -20
include/linux/seqlock.h
··· 157 157 static __always_inline unsigned \ 158 158 __seqprop_##lockname##_sequence(const seqcount_##lockname##_t *s) \ 159 159 { \ 160 - unsigned seq = READ_ONCE(s->seqcount.sequence); \ 160 + unsigned seq = smp_load_acquire(&s->seqcount.sequence); \ 161 161 \ 162 162 if (!IS_ENABLED(CONFIG_PREEMPT_RT)) \ 163 163 return seq; \ ··· 170 170 * Re-read the sequence counter since the (possibly \ 171 171 * preempted) writer made progress. \ 172 172 */ \ 173 - seq = READ_ONCE(s->seqcount.sequence); \ 173 + seq = smp_load_acquire(&s->seqcount.sequence); \ 174 174 } \ 175 175 \ 176 176 return seq; \ ··· 208 208 209 209 static inline unsigned __seqprop_sequence(const seqcount_t *s) 210 210 { 211 - return READ_ONCE(s->sequence); 211 + return smp_load_acquire(&s->sequence); 212 212 } 213 213 214 214 static inline bool __seqprop_preemptible(const seqcount_t *s) ··· 263 263 #define seqprop_assert(s) __seqprop(s, assert)(s) 264 264 265 265 /** 266 - * __read_seqcount_begin() - begin a seqcount_t read section w/o barrier 266 + * __read_seqcount_begin() - begin a seqcount_t read section 267 267 * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants 268 - * 269 - * __read_seqcount_begin is like read_seqcount_begin, but has no smp_rmb() 270 - * barrier. Callers should ensure that smp_rmb() or equivalent ordering is 271 - * provided before actually loading any of the variables that are to be 272 - * protected in this critical section. 273 - * 274 - * Use carefully, only in critical code, and comment how the barrier is 275 - * provided. 276 268 * 277 269 * Return: count to be passed to read_seqcount_retry() 278 270 */ ··· 285 293 * 286 294 * Return: count to be passed to read_seqcount_retry() 287 295 */ 288 - #define raw_read_seqcount_begin(s) \ 289 - ({ \ 290 - unsigned _seq = __read_seqcount_begin(s); \ 291 - \ 292 - smp_rmb(); \ 293 - _seq; \ 294 - }) 296 + #define raw_read_seqcount_begin(s) __read_seqcount_begin(s) 295 297 296 298 /** 297 299 * read_seqcount_begin() - begin a seqcount_t read critical section ··· 314 328 ({ \ 315 329 unsigned __seq = seqprop_sequence(s); \ 316 330 \ 317 - smp_rmb(); \ 318 331 kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX); \ 319 332 __seq; \ 320 333 })