Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

srcu: Clarify comments on memory barrier "E"

There is an smp_mb() named "E" in srcu_flip() immediately before the
increment (flip) of the srcu_struct structure's ->srcu_idx.

The purpose of E is to order the preceding scan's read of lock counters
against the flipping of the ->srcu_idx, in order to prevent new readers
from continuing to use the old ->srcu_idx value, which might needlessly
extend the grace period.

However, this ordering is already enforced because of the control
dependency between the preceding scan and the ->srcu_idx flip.
This control dependency exists because atomic_long_read() is used
to scan the counts, because WRITE_ONCE() is used to flip ->srcu_idx,
and because ->srcu_idx is not flipped until the ->srcu_lock_count[] and
->srcu_unlock_count[] counts match. And such a match cannot happen when
there is an in-flight reader that started before the flip (observation
courtesy Mathieu Desnoyers).

The litmus test below (courtesy of Frederic Weisbecker, with changes
for ctrldep by Boqun and Joel) shows this:

C srcu
(*
* bad condition: P0's first scan (SCAN1) saw P1's idx=0 LOCK count inc, though P1 saw flip.
*
* So basically, the ->po ordering on both P0 and P1 is enforced via ->ppo
* (control deps) on both sides, and both P0 and P1 are interconnected by ->rf
* relations. Combining the ->ppo with ->rf, a cycle is impossible.
*)

{}

// updater
P0(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
int lock1;
int unlock1;
int lock0;
int unlock0;

// SCAN1
unlock1 = READ_ONCE(*UNLOCK1);
smp_mb(); // A
lock1 = READ_ONCE(*LOCK1);

// FLIP
if (lock1 == unlock1) { // Control dep
smp_mb(); // E // Remove E and still passes.
WRITE_ONCE(*IDX, 1);
smp_mb(); // D

// SCAN2
unlock0 = READ_ONCE(*UNLOCK0);
smp_mb(); // A
lock0 = READ_ONCE(*LOCK0);
}
}

// reader
P1(int *IDX, int *LOCK0, int *UNLOCK0, int *LOCK1, int *UNLOCK1)
{
int tmp;
int idx1;
int idx2;

// 1st reader
idx1 = READ_ONCE(*IDX);
if (idx1 == 0) { // Control dep
tmp = READ_ONCE(*LOCK0);
WRITE_ONCE(*LOCK0, tmp + 1);
smp_mb(); /* B and C */
tmp = READ_ONCE(*UNLOCK0);
WRITE_ONCE(*UNLOCK0, tmp + 1);
} else {
tmp = READ_ONCE(*LOCK1);
WRITE_ONCE(*LOCK1, tmp + 1);
smp_mb(); /* B and C */
tmp = READ_ONCE(*UNLOCK1);
WRITE_ONCE(*UNLOCK1, tmp + 1);
}
}

exists (0:lock1=1 /\ 1:idx1=1)

More complicated litmus tests with multiple SRCU readers also show that
memory barrier E is not needed.

This commit therefore clarifies the comment on memory barrier E.

Why not also remove that redundant smp_mb()?

Because control dependencies are quite fragile due to their not being
recognized by most compilers and tools. Control dependencies therefore
exact an ongoing maintenance burden, and such a burden cannot be justified
in this slowpath. Therefore, that smp_mb() stays until such time as
its overhead becomes a measurable problem in a real workload running on
a real production system, or until such time as compilers start paying
attention to this sort of control dependency.

Co-developed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Co-developed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

+27 -7
+27 -7
kernel/rcu/srcutree.c
··· 1085 1085 static void srcu_flip(struct srcu_struct *ssp) 1086 1086 { 1087 1087 /* 1088 - * Ensure that if this updater saw a given reader's increment 1089 - * from __srcu_read_lock(), that reader was using an old value 1090 - * of ->srcu_idx. Also ensure that if a given reader sees the 1091 - * new value of ->srcu_idx, this updater's earlier scans cannot 1092 - * have seen that reader's increments (which is OK, because this 1093 - * grace period need not wait on that reader). 1088 + * Because the flip of ->srcu_idx is executed only if the 1089 + * preceding call to srcu_readers_active_idx_check() found that 1090 + * the ->srcu_unlock_count[] and ->srcu_lock_count[] sums matched 1091 + * and because that summing uses atomic_long_read(), there is 1092 + * ordering due to a control dependency between that summing and 1093 + * the WRITE_ONCE() in this call to srcu_flip(). This ordering 1094 + * ensures that if this updater saw a given reader's increment from 1095 + * __srcu_read_lock(), that reader was using a value of ->srcu_idx 1096 + * from before the previous call to srcu_flip(), which should be 1097 + * quite rare. This ordering thus helps forward progress because 1098 + * the grace period could otherwise be delayed by additional 1099 + * calls to __srcu_read_lock() using that old (soon to be new) 1100 + * value of ->srcu_idx. 1101 + * 1102 + * This sum-equality check and ordering also ensures that if 1103 + * a given call to __srcu_read_lock() uses the new value of 1104 + * ->srcu_idx, this updater's earlier scans cannot have seen 1105 + * that reader's increments, which is all to the good, because 1106 + * this grace period need not wait on that reader. After all, 1107 + * if those earlier scans had seen that reader, there would have 1108 + * been a sum mismatch and this code would not be reached. 1109 + * 1110 + * This means that the following smp_mb() is redundant, but 1111 + * it stays until either (1) Compilers learn about this sort of 1112 + * control dependency or (2) Some production workload running on 1113 + * a production system is unduly delayed by this slowpath smp_mb(). 1094 1114 */ 1095 1115 smp_mb(); /* E */ /* Pairs with B and C. */ 1096 1116 1097 - WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); 1117 + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); // Flip the counter. 1098 1118 1099 1119 /* 1100 1120 * Ensure that if the updater misses an __srcu_read_unlock()