Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

rcu: Handle gpnum/completed wrap while dyntick idle

Subtle race conditions can result if a CPU stays in dyntick-idle mode
long enough for the ->gpnum and ->completed fields to wrap. For
example, consider the following sequence of events:

o CPU 1 encounters a quiescent state while waiting for grace period
5 to complete, but then enters dyntick-idle mode.

o While CPU 1 is in dyntick-idle mode, the grace-period counters
wrap around so that the grace period number is now 4.

o Just as CPU 1 exits dyntick-idle mode, grace period 4 completes
and grace period 5 begins.

o The quiescent state that CPU 1 passed through during the old
grace period 5 looks like it applies to the new grace period
5. Therefore, the new grace period 5 completes without CPU 1
having passed through a quiescent state.

This could clearly be a fatal surprise to any long-running RCU read-side
critical section that happened to be running on CPU 1 at the time. At one
time, this was not a problem, given that it takes significant time for
the grace-period counters to overflow even on 32-bit systems. However,
with the advent of NO_HZ_FULL and SMP embedded systems, arbitrarily long
idle periods are now becoming quite feasible. It is therefore time to
close this race.

This commit therefore avoids this race condition by having the
quiescent-state forcing code detect when a CPU is falling too far
behind, and setting a new rcu_data field ->gpwrap when this happens.
Whenever this new ->gpwrap field is set, the CPU's ->gpnum and ->completed
fields are known to be untrustworthy, and can be ignored, along with
any associated quiescent states.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

+15 -6
+12 -5
kernel/rcu/tree.c
··· 930 930 trace_rcu_fqs(rdp->rsp->name, rdp->gpnum, rdp->cpu, TPS("dti")); 931 931 return 1; 932 932 } else { 933 + if (ULONG_CMP_LT(ACCESS_ONCE(rdp->gpnum) + ULONG_MAX / 4, 934 + rdp->mynode->gpnum)) 935 + ACCESS_ONCE(rdp->gpwrap) = true; 933 936 return 0; 934 937 } 935 938 } ··· 1580 1577 bool ret; 1581 1578 1582 1579 /* Handle the ends of any preceding grace periods first. */ 1583 - if (rdp->completed == rnp->completed) { 1580 + if (rdp->completed == rnp->completed && 1581 + !unlikely(ACCESS_ONCE(rdp->gpwrap))) { 1584 1582 1585 1583 /* No grace period end, so just accelerate recent callbacks. */ 1586 1584 ret = rcu_accelerate_cbs(rsp, rnp, rdp); ··· 1596 1592 trace_rcu_grace_period(rsp->name, rdp->gpnum, TPS("cpuend")); 1597 1593 } 1598 1594 1599 - if (rdp->gpnum != rnp->gpnum) { 1595 + if (rdp->gpnum != rnp->gpnum || unlikely(ACCESS_ONCE(rdp->gpwrap))) { 1600 1596 /* 1601 1597 * If the current grace period is waiting for this CPU, 1602 1598 * set up to detect a quiescent state, otherwise don't ··· 1607 1603 rdp->passed_quiesce = 0; 1608 1604 rdp->qs_pending = !!(rnp->qsmask & rdp->grpmask); 1609 1605 zero_cpu_stall_ticks(rdp); 1606 + ACCESS_ONCE(rdp->gpwrap) = false; 1610 1607 } 1611 1608 return ret; 1612 1609 } ··· 1621 1616 local_irq_save(flags); 1622 1617 rnp = rdp->mynode; 1623 1618 if ((rdp->gpnum == ACCESS_ONCE(rnp->gpnum) && 1624 - rdp->completed == ACCESS_ONCE(rnp->completed)) || /* w/out lock. */ 1619 + rdp->completed == ACCESS_ONCE(rnp->completed) && 1620 + !unlikely(ACCESS_ONCE(rdp->gpwrap))) || /* w/out lock. */ 1625 1621 !raw_spin_trylock(&rnp->lock)) { /* irqs already off, so later. */ 1626 1622 local_irq_restore(flags); 1627 1623 return; ··· 2072 2066 raw_spin_lock_irqsave(&rnp->lock, flags); 2073 2067 smp_mb__after_unlock_lock(); 2074 2068 if (rdp->passed_quiesce == 0 || rdp->gpnum != rnp->gpnum || 2075 - rnp->completed == rnp->gpnum) { 2069 + rnp->completed == rnp->gpnum || rdp->gpwrap) { 2076 2070 2077 2071 /* 2078 2072 * The grace period in which this quiescent state was ··· 3196 3190 } 3197 3191 3198 3192 /* Has a new RCU grace period started? */ 3199 - if (ACCESS_ONCE(rnp->gpnum) != rdp->gpnum) { /* outside lock */ 3193 + if (ACCESS_ONCE(rnp->gpnum) != rdp->gpnum || 3194 + unlikely(ACCESS_ONCE(rdp->gpwrap))) { /* outside lock */ 3200 3195 rdp->n_rp_gp_started++; 3201 3196 return 1; 3202 3197 }
+1
kernel/rcu/tree.h
··· 260 260 bool passed_quiesce; /* User-mode/idle loop etc. */ 261 261 bool qs_pending; /* Core waits for quiesc state. */ 262 262 bool beenonline; /* CPU online at least once. */ 263 + bool gpwrap; /* Possible gpnum/completed wrap. */ 263 264 struct rcu_node *mynode; /* This CPU's leaf of hierarchy */ 264 265 unsigned long grpmask; /* Mask to apply to leaf qsmask. */ 265 266 #ifdef CONFIG_RCU_CPU_STALL_INFO
+2 -1
kernel/rcu/tree_plugin.h
··· 1605 1605 * completed since we last checked and there are 1606 1606 * callbacks not yet ready to invoke. 1607 1607 */ 1608 - if (rdp->completed != rnp->completed && 1608 + if ((rdp->completed != rnp->completed || 1609 + unlikely(ACCESS_ONCE(rdp->gpwrap))) && 1609 1610 rdp->nxttail[RCU_DONE_TAIL] != rdp->nxttail[RCU_NEXT_TAIL]) 1610 1611 note_gp_changes(rsp, rdp); 1611 1612