Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures

All the locking related cmpxchg's in the following functions are
replaced with the _acquire variants:

- pv_queued_spin_steal_lock()
- trylock_clear_pending()

This change should help performance on architectures that use LL/SC.

The cmpxchg in pv_kick_node() is replaced with a relaxed version
with explicit memory barrier to make sure that it is fully ordered
in the writing of next->lock and the reading of pn->state whether
the cmpxchg is a success or failure without affecting performance in
non-LL/SC architectures.

On a 2-socket 12-core 96-thread Power8 system with pvqspinlock
explicitly enabled, the performance of a locking microbenchmark
with and without this patch on a 4.13-rc4 kernel with Xinhui's PPC
qspinlock patch were as follows:

# of thread w/o patch with patch % Change
----------- --------- ---------- --------
8 5054.8 Mop/s 5209.4 Mop/s +3.1%
16 3985.0 Mop/s 4015.0 Mop/s +0.8%
32 2378.2 Mop/s 2396.0 Mop/s +0.7%

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pan Xinhui <xinhui@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1502741222-24360-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

authored by

Waiman Long and committed by
Ingo Molnar
34d54f3d 966a9671

+17 -7
+17 -7
kernel/locking/qspinlock_paravirt.h
··· 72 72 struct __qspinlock *l = (void *)lock; 73 73 74 74 if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) && 75 - (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { 75 + (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { 76 76 qstat_inc(qstat_pv_lock_stealing, true); 77 77 return true; 78 78 } ··· 101 101 102 102 /* 103 103 * The pending bit check in pv_queued_spin_steal_lock() isn't a memory 104 - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock 105 - * just to be sure that it will get it. 104 + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the 105 + * lock just to be sure that it will get it. 106 106 */ 107 107 static __always_inline int trylock_clear_pending(struct qspinlock *lock) 108 108 { 109 109 struct __qspinlock *l = (void *)lock; 110 110 111 111 return !READ_ONCE(l->locked) && 112 - (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL) 113 - == _Q_PENDING_VAL); 112 + (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL, 113 + _Q_LOCKED_VAL) == _Q_PENDING_VAL); 114 114 } 115 115 #else /* _Q_PENDING_BITS == 8 */ 116 116 static __always_inline void set_pending(struct qspinlock *lock) ··· 138 138 */ 139 139 old = val; 140 140 new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; 141 - val = atomic_cmpxchg(&lock->val, old, new); 141 + val = atomic_cmpxchg_acquire(&lock->val, old, new); 142 142 143 143 if (val == old) 144 144 return 1; ··· 362 362 * observe its next->locked value and advance itself. 363 363 * 364 364 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() 365 + * 366 + * The write to next->locked in arch_mcs_spin_unlock_contended() 367 + * must be ordered before the read of pn->state in the cmpxchg() 368 + * below for the code to work correctly. To guarantee full ordering 369 + * irrespective of the success or failure of the cmpxchg(), 370 + * a relaxed version with explicit barrier is used. The control 371 + * dependency will order the reading of pn->state before any 372 + * subsequent writes. 365 373 */ 366 - if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted) 374 + smp_mb__before_atomic(); 375 + if (cmpxchg_relaxed(&pn->state, vcpu_halted, vcpu_hashed) 376 + != vcpu_halted) 367 377 return; 368 378 369 379 /*