Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

+131 -113

Documentation/mutex-design.txt

··· 1 1 Generic Mutex Subsystem 2 2 3 3 started by Ingo Molnar <mingo@redhat.com> 4 + updated by Davidlohr Bueso <davidlohr@hp.com> 4 5 5 - "Why on earth do we need a new mutex subsystem, and what's wrong 6 - with semaphores?" 6 + What are mutexes? 7 + ----------------- 7 8 8 - firstly, there's nothing wrong with semaphores. But if the simpler 9 - mutex semantics are sufficient for your code, then there are a couple 10 - of advantages of mutexes: 9 + In the Linux kernel, mutexes refer to a particular locking primitive 10 + that enforces serialization on shared memory systems, and not only to 11 + the generic term referring to 'mutual exclusion' found in academia 12 + or similar theoretical text books. Mutexes are sleeping locks which 13 + behave similarly to binary semaphores, and were introduced in 2006[1] 14 + as an alternative to these. This new data structure provided a number 15 + of advantages, including simpler interfaces, and at that time smaller 16 + code (see Disadvantages). 11 17 12 - - 'struct mutex' is smaller on most architectures: E.g. on x86, 13 - 'struct semaphore' is 20 bytes, 'struct mutex' is 16 bytes. 14 - A smaller structure size means less RAM footprint, and better 15 - CPU-cache utilization. 18 + [1] http://lwn.net/Articles/164802/ 16 19 17 - - tighter code. On x86 i get the following .text sizes when 18 - switching all mutex-alike semaphores in the kernel to the mutex 19 - subsystem: 20 + Implementation 21 + -------------- 20 22 21 - text data bss dec hex filename 22 - 3280380 868188 396860 4545428 455b94 vmlinux-semaphore 23 - 3255329 865296 396732 4517357 44eded vmlinux-mutex 23 + Mutexes are represented by 'struct mutex', defined in include/linux/mutex.h 24 + and implemented in kernel/locking/mutex.c. These locks use a three 25 + state atomic counter (->count) to represent the different possible 26 + transitions that can occur during the lifetime of a lock: 24 27 25 - that's 25051 bytes of code saved, or a 0.76% win - off the hottest 26 - codepaths of the kernel. (The .data savings are 2892 bytes, or 0.33%) 27 - Smaller code means better icache footprint, which is one of the 28 - major optimization goals in the Linux kernel currently. 28 + 1: unlocked 29 + 0: locked, no waiters 30 + negative: locked, with potential waiters 29 31 30 - - the mutex subsystem is slightly faster and has better scalability for 31 - contended workloads. On an 8-way x86 system, running a mutex-based 32 - kernel and testing creat+unlink+close (of separate, per-task files) 33 - in /tmp with 16 parallel tasks, the average number of ops/sec is: 32 + In its most basic form it also includes a wait-queue and a spinlock 33 + that serializes access to it. CONFIG_SMP systems can also include 34 + a pointer to the lock task owner (->owner) as well as a spinner MCS 35 + lock (->osq), both described below in (ii). 34 36 35 - Semaphores: Mutexes: 37 + When acquiring a mutex, there are three possible paths that can be 38 + taken, depending on the state of the lock: 36 39 37 - $ ./test-mutex V 16 10 $ ./test-mutex V 16 10 38 - 8 CPUs, running 16 tasks. 8 CPUs, running 16 tasks. 39 - checking VFS performance. checking VFS performance. 40 - avg loops/sec: 34713 avg loops/sec: 84153 41 - CPU utilization: 63% CPU utilization: 22% 40 + (i) fastpath: tries to atomically acquire the lock by decrementing the 41 + counter. If it was already taken by another task it goes to the next 42 + possible path. This logic is architecture specific. On x86-64, the 43 + locking fastpath is 2 instructions: 42 44 43 - i.e. in this workload, the mutex based kernel was 2.4 times faster 44 - than the semaphore based kernel, _and_ it also had 2.8 times less CPU 45 - utilization. (In terms of 'ops per CPU cycle', the semaphore kernel 46 - performed 551 ops/sec per 1% of CPU time used, while the mutex kernel 47 - performed 3825 ops/sec per 1% of CPU time used - it was 6.9 times 48 - more efficient.) 49 - 50 - the scalability difference is visible even on a 2-way P4 HT box: 51 - 52 - Semaphores: Mutexes: 53 - 54 - $ ./test-mutex V 16 10 $ ./test-mutex V 16 10 55 - 4 CPUs, running 16 tasks. 8 CPUs, running 16 tasks. 56 - checking VFS performance. checking VFS performance. 57 - avg loops/sec: 127659 avg loops/sec: 181082 58 - CPU utilization: 100% CPU utilization: 34% 59 - 60 - (the straight performance advantage of mutexes is 41%, the per-cycle 61 - efficiency of mutexes is 4.1 times better.) 62 - 63 - - there are no fastpath tradeoffs, the mutex fastpath is just as tight 64 - as the semaphore fastpath. On x86, the locking fastpath is 2 65 - instructions: 66 - 67 - c0377ccb <mutex_lock>: 68 - c0377ccb: f0 ff 08 lock decl (%eax) 69 - c0377cce: 78 0e js c0377cde <.text..lock.mutex> 70 - c0377cd0: c3 ret 45 + 0000000000000e10 <mutex_lock>: 46 + e21: f0 ff 0b lock decl (%rbx) 47 + e24: 79 08 jns e2e <mutex_lock+0x1e> 71 48 72 49 the unlocking fastpath is equally tight: 73 50 74 - c0377cd1 <mutex_unlock>: 75 - c0377cd1: f0 ff 00 lock incl (%eax) 76 - c0377cd4: 7e 0f jle c0377ce5 <.text..lock.mutex+0x7> 77 - c0377cd6: c3 ret 51 + 0000000000000bc0 <mutex_unlock>: 52 + bc8: f0 ff 07 lock incl (%rdi) 53 + bcb: 7f 0a jg bd7 <mutex_unlock+0x17> 78 54 79 - - 'struct mutex' semantics are well-defined and are enforced if 80 - CONFIG_DEBUG_MUTEXES is turned on. Semaphores on the other hand have 81 - virtually no debugging code or instrumentation. The mutex subsystem 82 - checks and enforces the following rules: 83 55 84 - * - only one task can hold the mutex at a time 85 - * - only the owner can unlock the mutex 86 - * - multiple unlocks are not permitted 87 - * - recursive locking is not permitted 88 - * - a mutex object must be initialized via the API 89 - * - a mutex object must not be initialized via memset or copying 90 - * - task may not exit with mutex held 91 - * - memory areas where held locks reside must not be freed 92 - * - held mutexes must not be reinitialized 93 - * - mutexes may not be used in hardware or software interrupt 94 - * contexts such as tasklets and timers 56 + (ii) midpath: aka optimistic spinning, tries to spin for acquisition 57 + while the lock owner is running and there are no other tasks ready 58 + to run that have higher priority (need_resched). The rationale is 59 + that if the lock owner is running, it is likely to release the lock 60 + soon. The mutex spinners are queued up using MCS lock so that only 61 + one spinner can compete for the mutex. 95 62 96 - furthermore, there are also convenience features in the debugging 97 - code: 63 + The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spinlock 64 + with the desirable properties of being fair and with each cpu trying 65 + to acquire the lock spinning on a local variable. It avoids expensive 66 + cacheline bouncing that common test-and-set spinlock implementations 67 + incur. An MCS-like lock is specially tailored for optimistic spinning 68 + for sleeping lock implementation. An important feature of the customized 69 + MCS lock is that it has the extra property that spinners are able to exit 70 + the MCS spinlock queue when they need to reschedule. This further helps 71 + avoid situations where MCS spinners that need to reschedule would continue 72 + waiting to spin on mutex owner, only to go directly to slowpath upon 73 + obtaining the MCS lock. 98 74 99 - * - uses symbolic names of mutexes, whenever they are printed in debug output 100 - * - point-of-acquire tracking, symbolic lookup of function names 101 - * - list of all locks held in the system, printout of them 102 - * - owner tracking 103 - * - detects self-recursing locks and prints out all relevant info 104 - * - detects multi-task circular deadlocks and prints out all affected 105 - * locks and tasks (and only those tasks) 75 + 76 + (iii) slowpath: last resort, if the lock is still unable to be acquired, 77 + the task is added to the wait-queue and sleeps until woken up by the 78 + unlock path. Under normal circumstances it blocks as TASK_UNINTERRUPTIBLE. 79 + 80 + While formally kernel mutexes are sleepable locks, it is path (ii) that 81 + makes them more practically a hybrid type. By simply not interrupting a 82 + task and busy-waiting for a few cycles instead of immediately sleeping, 83 + the performance of this lock has been seen to significantly improve a 84 + number of workloads. Note that this technique is also used for rw-semaphores. 85 + 86 + Semantics 87 + --------- 88 + 89 + The mutex subsystem checks and enforces the following rules: 90 + 91 + - Only one task can hold the mutex at a time. 92 + - Only the owner can unlock the mutex. 93 + - Multiple unlocks are not permitted. 94 + - Recursive locking/unlocking is not permitted. 95 + - A mutex must only be initialized via the API (see below). 96 + - A task may not exit with a mutex held. 97 + - Memory areas where held locks reside must not be freed. 98 + - Held mutexes must not be reinitialized. 99 + - Mutexes may not be used in hardware or software interrupt 100 + contexts such as tasklets and timers. 101 + 102 + These semantics are fully enforced when CONFIG DEBUG_MUTEXES is enabled. 103 + In addition, the mutex debugging code also implements a number of other 104 + features that make lock debugging easier and faster: 105 + 106 + - Uses symbolic names of mutexes, whenever they are printed 107 + in debug output. 108 + - Point-of-acquire tracking, symbolic lookup of function names, 109 + list of all locks held in the system, printout of them. 110 + - Owner tracking. 111 + - Detects self-recursing locks and prints out all relevant info. 112 + - Detects multi-task circular deadlocks and prints out all affected 113 + locks and tasks (and only those tasks). 114 + 115 + 116 + Interfaces 117 + ---------- 118 + Statically define the mutex: 119 + DEFINE_MUTEX(name); 120 + 121 + Dynamically initialize the mutex: 122 + mutex_init(mutex); 123 + 124 + Acquire the mutex, uninterruptible: 125 + void mutex_lock(struct mutex *lock); 126 + void mutex_lock_nested(struct mutex *lock, unsigned int subclass); 127 + int mutex_trylock(struct mutex *lock); 128 + 129 + Acquire the mutex, interruptible: 130 + int mutex_lock_interruptible_nested(struct mutex *lock, 131 + unsigned int subclass); 132 + int mutex_lock_interruptible(struct mutex *lock); 133 + 134 + Acquire the mutex, interruptible, if dec to 0: 135 + int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock); 136 + 137 + Unlock the mutex: 138 + void mutex_unlock(struct mutex *lock); 139 + 140 + Test if the mutex is taken: 141 + int mutex_is_locked(struct mutex *lock); 106 142 107 143 Disadvantages 108 144 ------------- 109 145 110 - The stricter mutex API means you cannot use mutexes the same way you 111 - can use semaphores: e.g. they cannot be used from an interrupt context, 112 - nor can they be unlocked from a different context that which acquired 113 - it. [ I'm not aware of any other (e.g. performance) disadvantages from 114 - using mutexes at the moment, please let me know if you find any. ] 146 + Unlike its original design and purpose, 'struct mutex' is larger than 147 + most locks in the kernel. E.g: on x86-64 it is 40 bytes, almost twice 148 + as large as 'struct semaphore' (24 bytes) and 8 bytes shy of the 149 + 'struct rw_semaphore' variant. Larger structure sizes mean more CPU 150 + cache and memory footprint. 115 151 116 - Implementation of mutexes 117 - ------------------------- 152 + When to use mutexes 153 + ------------------- 118 154 119 - 'struct mutex' is the new mutex type, defined in include/linux/mutex.h and 120 - implemented in kernel/locking/mutex.c. It is a counter-based mutex with a 121 - spinlock and a wait-list. The counter has 3 states: 1 for "unlocked", 0 for 122 - "locked" and negative numbers (usually -1) for "locked, potential waiters 123 - queued". 124 - 125 - the APIs of 'struct mutex' have been streamlined: 126 - 127 - DEFINE_MUTEX(name); 128 - 129 - mutex_init(mutex); 130 - 131 - void mutex_lock(struct mutex *lock); 132 - int mutex_lock_interruptible(struct mutex *lock); 133 - int mutex_trylock(struct mutex *lock); 134 - void mutex_unlock(struct mutex *lock); 135 - int mutex_is_locked(struct mutex *lock); 136 - void mutex_lock_nested(struct mutex *lock, unsigned int subclass); 137 - int mutex_lock_interruptible_nested(struct mutex *lock, 138 - unsigned int subclass); 139 - int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock); 155 + Unless the strict semantics of mutexes are unsuitable and/or the critical 156 + region prevents the lock from being shared, always prefer them to any other 157 + locking primitive.

+1

arch/x86/Kconfig

··· 121 121 select MODULES_USE_ELF_RELA if X86_64 122 122 select CLONE_BACKWARDS if X86_32 123 123 select ARCH_USE_BUILTIN_BSWAP 124 + select ARCH_USE_QUEUE_RWLOCK 124 125 select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION 125 126 select OLD_SIGACTION if X86_32 126 127 select COMPAT_OLD_SIGACTION if IA32_EMULATION

+17

arch/x86/include/asm/qrwlock.h

··· 1 + #ifndef _ASM_X86_QRWLOCK_H 2 + #define _ASM_X86_QRWLOCK_H 3 + 4 + #include <asm-generic/qrwlock_types.h> 5 + 6 + #if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE) 7 + #define queue_write_unlock queue_write_unlock 8 + static inline void queue_write_unlock(struct qrwlock *lock) 9 + { 10 + barrier(); 11 + ACCESS_ONCE(*(u8 *)&lock->cnts) = 0; 12 + } 13 + #endif 14 + 15 + #include <asm-generic/qrwlock.h> 16 + 17 + #endif /* _ASM_X86_QRWLOCK_H */

+4

arch/x86/include/asm/spinlock.h

··· 187 187 cpu_relax(); 188 188 } 189 189 190 + #ifndef CONFIG_QUEUE_RWLOCK 190 191 /* 191 192 * Read-write spinlocks, allowing multiple readers 192 193 * but only one writer. ··· 270 269 asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0" 271 270 : "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory"); 272 271 } 272 + #else 273 + #include <asm/qrwlock.h> 274 + #endif /* CONFIG_QUEUE_RWLOCK */ 273 275 274 276 #define arch_read_lock_flags(lock, flags) arch_read_lock(lock) 275 277 #define arch_write_lock_flags(lock, flags) arch_write_lock(lock)

+4

arch/x86/include/asm/spinlock_types.h

··· 34 34 35 35 #define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } 36 36 37 + #ifdef CONFIG_QUEUE_RWLOCK 38 + #include <asm-generic/qrwlock_types.h> 39 + #else 37 40 #include <asm/rwlock.h> 41 + #endif 38 42 39 43 #endif /* _ASM_X86_SPINLOCK_TYPES_H */

+166

include/asm-generic/qrwlock.h

··· 1 + /* 2 + * Queue read/write lock 3 + * 4 + * This program is free software; you can redistribute it and/or modify 5 + * it under the terms of the GNU General Public License as published by 6 + * the Free Software Foundation; either version 2 of the License, or 7 + * (at your option) any later version. 8 + * 9 + * This program is distributed in the hope that it will be useful, 10 + * but WITHOUT ANY WARRANTY; without even the implied warranty of 11 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 12 + * GNU General Public License for more details. 13 + * 14 + * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P. 15 + * 16 + * Authors: Waiman Long <waiman.long@hp.com> 17 + */ 18 + #ifndef __ASM_GENERIC_QRWLOCK_H 19 + #define __ASM_GENERIC_QRWLOCK_H 20 + 21 + #include <linux/atomic.h> 22 + #include <asm/barrier.h> 23 + #include <asm/processor.h> 24 + 25 + #include <asm-generic/qrwlock_types.h> 26 + 27 + /* 28 + * Writer states & reader shift and bias 29 + */ 30 + #define _QW_WAITING 1 /* A writer is waiting */ 31 + #define _QW_LOCKED 0xff /* A writer holds the lock */ 32 + #define _QW_WMASK 0xff /* Writer mask */ 33 + #define _QR_SHIFT 8 /* Reader count shift */ 34 + #define _QR_BIAS (1U << _QR_SHIFT) 35 + 36 + /* 37 + * External function declarations 38 + */ 39 + extern void queue_read_lock_slowpath(struct qrwlock *lock); 40 + extern void queue_write_lock_slowpath(struct qrwlock *lock); 41 + 42 + /** 43 + * queue_read_can_lock- would read_trylock() succeed? 44 + * @lock: Pointer to queue rwlock structure 45 + */ 46 + static inline int queue_read_can_lock(struct qrwlock *lock) 47 + { 48 + return !(atomic_read(&lock->cnts) & _QW_WMASK); 49 + } 50 + 51 + /** 52 + * queue_write_can_lock- would write_trylock() succeed? 53 + * @lock: Pointer to queue rwlock structure 54 + */ 55 + static inline int queue_write_can_lock(struct qrwlock *lock) 56 + { 57 + return !atomic_read(&lock->cnts); 58 + } 59 + 60 + /** 61 + * queue_read_trylock - try to acquire read lock of a queue rwlock 62 + * @lock : Pointer to queue rwlock structure 63 + * Return: 1 if lock acquired, 0 if failed 64 + */ 65 + static inline int queue_read_trylock(struct qrwlock *lock) 66 + { 67 + u32 cnts; 68 + 69 + cnts = atomic_read(&lock->cnts); 70 + if (likely(!(cnts & _QW_WMASK))) { 71 + cnts = (u32)atomic_add_return(_QR_BIAS, &lock->cnts); 72 + if (likely(!(cnts & _QW_WMASK))) 73 + return 1; 74 + atomic_sub(_QR_BIAS, &lock->cnts); 75 + } 76 + return 0; 77 + } 78 + 79 + /** 80 + * queue_write_trylock - try to acquire write lock of a queue rwlock 81 + * @lock : Pointer to queue rwlock structure 82 + * Return: 1 if lock acquired, 0 if failed 83 + */ 84 + static inline int queue_write_trylock(struct qrwlock *lock) 85 + { 86 + u32 cnts; 87 + 88 + cnts = atomic_read(&lock->cnts); 89 + if (unlikely(cnts)) 90 + return 0; 91 + 92 + return likely(atomic_cmpxchg(&lock->cnts, 93 + cnts, cnts | _QW_LOCKED) == cnts); 94 + } 95 + /** 96 + * queue_read_lock - acquire read lock of a queue rwlock 97 + * @lock: Pointer to queue rwlock structure 98 + */ 99 + static inline void queue_read_lock(struct qrwlock *lock) 100 + { 101 + u32 cnts; 102 + 103 + cnts = atomic_add_return(_QR_BIAS, &lock->cnts); 104 + if (likely(!(cnts & _QW_WMASK))) 105 + return; 106 + 107 + /* The slowpath will decrement the reader count, if necessary. */ 108 + queue_read_lock_slowpath(lock); 109 + } 110 + 111 + /** 112 + * queue_write_lock - acquire write lock of a queue rwlock 113 + * @lock : Pointer to queue rwlock structure 114 + */ 115 + static inline void queue_write_lock(struct qrwlock *lock) 116 + { 117 + /* Optimize for the unfair lock case where the fair flag is 0. */ 118 + if (atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0) 119 + return; 120 + 121 + queue_write_lock_slowpath(lock); 122 + } 123 + 124 + /** 125 + * queue_read_unlock - release read lock of a queue rwlock 126 + * @lock : Pointer to queue rwlock structure 127 + */ 128 + static inline void queue_read_unlock(struct qrwlock *lock) 129 + { 130 + /* 131 + * Atomically decrement the reader count 132 + */ 133 + smp_mb__before_atomic(); 134 + atomic_sub(_QR_BIAS, &lock->cnts); 135 + } 136 + 137 + #ifndef queue_write_unlock 138 + /** 139 + * queue_write_unlock - release write lock of a queue rwlock 140 + * @lock : Pointer to queue rwlock structure 141 + */ 142 + static inline void queue_write_unlock(struct qrwlock *lock) 143 + { 144 + /* 145 + * If the writer field is atomic, it can be cleared directly. 146 + * Otherwise, an atomic subtraction will be used to clear it. 147 + */ 148 + smp_mb__before_atomic(); 149 + atomic_sub(_QW_LOCKED, &lock->cnts); 150 + } 151 + #endif 152 + 153 + /* 154 + * Remapping rwlock architecture specific functions to the corresponding 155 + * queue rwlock functions. 156 + */ 157 + #define arch_read_can_lock(l) queue_read_can_lock(l) 158 + #define arch_write_can_lock(l) queue_write_can_lock(l) 159 + #define arch_read_lock(l) queue_read_lock(l) 160 + #define arch_write_lock(l) queue_write_lock(l) 161 + #define arch_read_trylock(l) queue_read_trylock(l) 162 + #define arch_write_trylock(l) queue_write_trylock(l) 163 + #define arch_read_unlock(l) queue_read_unlock(l) 164 + #define arch_write_unlock(l) queue_write_unlock(l) 165 + 166 + #endif /* __ASM_GENERIC_QRWLOCK_H */

+21

include/asm-generic/qrwlock_types.h

··· 1 + #ifndef __ASM_GENERIC_QRWLOCK_TYPES_H 2 + #define __ASM_GENERIC_QRWLOCK_TYPES_H 3 + 4 + #include <linux/types.h> 5 + #include <asm/spinlock_types.h> 6 + 7 + /* 8 + * The queue read/write lock data structure 9 + */ 10 + 11 + typedef struct qrwlock { 12 + atomic_t cnts; 13 + arch_spinlock_t lock; 14 + } arch_rwlock_t; 15 + 16 + #define __ARCH_RW_LOCK_UNLOCKED { \ 17 + .cnts = ATOMIC_INIT(0), \ 18 + .lock = __ARCH_SPIN_LOCK_UNLOCKED, \ 19 + } 20 + 21 + #endif /* __ASM_GENERIC_QRWLOCK_TYPES_H */

+22 -3

include/linux/rwsem.h

··· 16 16 17 17 #include <linux/atomic.h> 18 18 19 + struct optimistic_spin_queue; 19 20 struct rw_semaphore; 20 21 21 22 #ifdef CONFIG_RWSEM_GENERIC_SPINLOCK ··· 24 23 #else 25 24 /* All arch specific implementations share the same struct */ 26 25 struct rw_semaphore { 27 - long count; 28 - raw_spinlock_t wait_lock; 29 - struct list_head wait_list; 26 + long count; 27 + raw_spinlock_t wait_lock; 28 + struct list_head wait_list; 29 + #ifdef CONFIG_SMP 30 + /* 31 + * Write owner. Used as a speculative check to see 32 + * if the owner is running on the cpu. 33 + */ 34 + struct task_struct *owner; 35 + struct optimistic_spin_queue *osq; /* spinner MCS lock */ 36 + #endif 30 37 #ifdef CONFIG_DEBUG_LOCK_ALLOC 31 38 struct lockdep_map dep_map; 32 39 #endif ··· 64 55 # define __RWSEM_DEP_MAP_INIT(lockname) 65 56 #endif 66 57 58 + #if defined(CONFIG_SMP) && !defined(CONFIG_RWSEM_GENERIC_SPINLOCK) 59 + #define __RWSEM_INITIALIZER(name) \ 60 + { RWSEM_UNLOCKED_VALUE, \ 61 + __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \ 62 + LIST_HEAD_INIT((name).wait_list), \ 63 + NULL, /* owner */ \ 64 + NULL /* mcs lock */ \ 65 + __RWSEM_DEP_MAP_INIT(name) } 66 + #else 67 67 #define __RWSEM_INITIALIZER(name) \ 68 68 { RWSEM_UNLOCKED_VALUE, \ 69 69 __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \ 70 70 LIST_HEAD_INIT((name).wait_list) \ 71 71 __RWSEM_DEP_MAP_INIT(name) } 72 + #endif 72 73 73 74 #define DECLARE_RWSEM(name) \ 74 75 struct rw_semaphore name = __RWSEM_INITIALIZER(name)

+7

kernel/Kconfig.locks

··· 223 223 config MUTEX_SPIN_ON_OWNER 224 224 def_bool y 225 225 depends on SMP && !DEBUG_MUTEXES 226 + 227 + config ARCH_USE_QUEUE_RWLOCK 228 + bool 229 + 230 + config QUEUE_RWLOCK 231 + def_bool y if ARCH_USE_QUEUE_RWLOCK 232 + depends on SMP

+1

kernel/locking/Makefile

··· 24 24 obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o 25 25 obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o 26 26 obj-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o 27 + obj-$(CONFIG_QUEUE_RWLOCK) += qrwlock.o 27 28 obj-$(CONFIG_LOCK_TORTURE_TEST) += locktorture.o

+133

kernel/locking/qrwlock.c

··· 1 + /* 2 + * Queue read/write lock 3 + * 4 + * This program is free software; you can redistribute it and/or modify 5 + * it under the terms of the GNU General Public License as published by 6 + * the Free Software Foundation; either version 2 of the License, or 7 + * (at your option) any later version. 8 + * 9 + * This program is distributed in the hope that it will be useful, 10 + * but WITHOUT ANY WARRANTY; without even the implied warranty of 11 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 12 + * GNU General Public License for more details. 13 + * 14 + * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P. 15 + * 16 + * Authors: Waiman Long <waiman.long@hp.com> 17 + */ 18 + #include <linux/smp.h> 19 + #include <linux/bug.h> 20 + #include <linux/cpumask.h> 21 + #include <linux/percpu.h> 22 + #include <linux/hardirq.h> 23 + #include <linux/mutex.h> 24 + #include <asm/qrwlock.h> 25 + 26 + /** 27 + * rspin_until_writer_unlock - inc reader count & spin until writer is gone 28 + * @lock : Pointer to queue rwlock structure 29 + * @writer: Current queue rwlock writer status byte 30 + * 31 + * In interrupt context or at the head of the queue, the reader will just 32 + * increment the reader count & wait until the writer releases the lock. 33 + */ 34 + static __always_inline void 35 + rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts) 36 + { 37 + while ((cnts & _QW_WMASK) == _QW_LOCKED) { 38 + arch_mutex_cpu_relax(); 39 + cnts = smp_load_acquire((u32 *)&lock->cnts); 40 + } 41 + } 42 + 43 + /** 44 + * queue_read_lock_slowpath - acquire read lock of a queue rwlock 45 + * @lock: Pointer to queue rwlock structure 46 + */ 47 + void queue_read_lock_slowpath(struct qrwlock *lock) 48 + { 49 + u32 cnts; 50 + 51 + /* 52 + * Readers come here when they cannot get the lock without waiting 53 + */ 54 + if (unlikely(in_interrupt())) { 55 + /* 56 + * Readers in interrupt context will spin until the lock is 57 + * available without waiting in the queue. 58 + */ 59 + cnts = smp_load_acquire((u32 *)&lock->cnts); 60 + rspin_until_writer_unlock(lock, cnts); 61 + return; 62 + } 63 + atomic_sub(_QR_BIAS, &lock->cnts); 64 + 65 + /* 66 + * Put the reader into the wait queue 67 + */ 68 + arch_spin_lock(&lock->lock); 69 + 70 + /* 71 + * At the head of the wait queue now, wait until the writer state 72 + * goes to 0 and then try to increment the reader count and get 73 + * the lock. It is possible that an incoming writer may steal the 74 + * lock in the interim, so it is necessary to check the writer byte 75 + * to make sure that the write lock isn't taken. 76 + */ 77 + while (atomic_read(&lock->cnts) & _QW_WMASK) 78 + arch_mutex_cpu_relax(); 79 + 80 + cnts = atomic_add_return(_QR_BIAS, &lock->cnts) - _QR_BIAS; 81 + rspin_until_writer_unlock(lock, cnts); 82 + 83 + /* 84 + * Signal the next one in queue to become queue head 85 + */ 86 + arch_spin_unlock(&lock->lock); 87 + } 88 + EXPORT_SYMBOL(queue_read_lock_slowpath); 89 + 90 + /** 91 + * queue_write_lock_slowpath - acquire write lock of a queue rwlock 92 + * @lock : Pointer to queue rwlock structure 93 + */ 94 + void queue_write_lock_slowpath(struct qrwlock *lock) 95 + { 96 + u32 cnts; 97 + 98 + /* Put the writer into the wait queue */ 99 + arch_spin_lock(&lock->lock); 100 + 101 + /* Try to acquire the lock directly if no reader is present */ 102 + if (!atomic_read(&lock->cnts) && 103 + (atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0)) 104 + goto unlock; 105 + 106 + /* 107 + * Set the waiting flag to notify readers that a writer is pending, 108 + * or wait for a previous writer to go away. 109 + */ 110 + for (;;) { 111 + cnts = atomic_read(&lock->cnts); 112 + if (!(cnts & _QW_WMASK) && 113 + (atomic_cmpxchg(&lock->cnts, cnts, 114 + cnts | _QW_WAITING) == cnts)) 115 + break; 116 + 117 + arch_mutex_cpu_relax(); 118 + } 119 + 120 + /* When no more readers, set the locked flag */ 121 + for (;;) { 122 + cnts = atomic_read(&lock->cnts); 123 + if ((cnts == _QW_WAITING) && 124 + (atomic_cmpxchg(&lock->cnts, _QW_WAITING, 125 + _QW_LOCKED) == _QW_WAITING)) 126 + break; 127 + 128 + arch_mutex_cpu_relax(); 129 + } 130 + unlock: 131 + arch_spin_unlock(&lock->lock); 132 + } 133 + EXPORT_SYMBOL(queue_write_lock_slowpath);

+196 -29

kernel/locking/rwsem-xadd.c

··· 5 5 * 6 6 * Writer lock-stealing by Alex Shi <alex.shi@intel.com> 7 7 * and Michel Lespinasse <walken@google.com> 8 + * 9 + * Optimistic spinning by Tim Chen <tim.c.chen@intel.com> 10 + * and Davidlohr Bueso <davidlohr@hp.com>. Based on mutexes. 8 11 */ 9 12 #include <linux/rwsem.h> 10 13 #include <linux/sched.h> 11 14 #include <linux/init.h> 12 15 #include <linux/export.h> 16 + #include <linux/sched/rt.h> 17 + 18 + #include "mcs_spinlock.h" 13 19 14 20 /* 15 21 * Guide to the rw_semaphore's count field for common values. ··· 82 76 sem->count = RWSEM_UNLOCKED_VALUE; 83 77 raw_spin_lock_init(&sem->wait_lock); 84 78 INIT_LIST_HEAD(&sem->wait_list); 79 + #ifdef CONFIG_SMP 80 + sem->owner = NULL; 81 + sem->osq = NULL; 82 + #endif 85 83 } 86 84 87 85 EXPORT_SYMBOL(__init_rwsem); ··· 200 190 } 201 191 202 192 /* 203 - * wait for the read lock to be granted 193 + * Wait for the read lock to be granted 204 194 */ 205 195 __visible 206 196 struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem) ··· 247 237 return sem; 248 238 } 249 239 240 + static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem) 241 + { 242 + if (!(count & RWSEM_ACTIVE_MASK)) { 243 + /* try acquiring the write lock */ 244 + if (sem->count == RWSEM_WAITING_BIAS && 245 + cmpxchg(&sem->count, RWSEM_WAITING_BIAS, 246 + RWSEM_ACTIVE_WRITE_BIAS) == RWSEM_WAITING_BIAS) { 247 + if (!list_is_singular(&sem->wait_list)) 248 + rwsem_atomic_update(RWSEM_WAITING_BIAS, sem); 249 + return true; 250 + } 251 + } 252 + return false; 253 + } 254 + 255 + #ifdef CONFIG_SMP 250 256 /* 251 - * wait until we successfully acquire the write lock 257 + * Try to acquire write lock before the writer has been put on wait queue. 258 + */ 259 + static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem) 260 + { 261 + long old, count = ACCESS_ONCE(sem->count); 262 + 263 + while (true) { 264 + if (!(count == 0 || count == RWSEM_WAITING_BIAS)) 265 + return false; 266 + 267 + old = cmpxchg(&sem->count, count, count + RWSEM_ACTIVE_WRITE_BIAS); 268 + if (old == count) 269 + return true; 270 + 271 + count = old; 272 + } 273 + } 274 + 275 + static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) 276 + { 277 + struct task_struct *owner; 278 + bool on_cpu = true; 279 + 280 + if (need_resched()) 281 + return 0; 282 + 283 + rcu_read_lock(); 284 + owner = ACCESS_ONCE(sem->owner); 285 + if (owner) 286 + on_cpu = owner->on_cpu; 287 + rcu_read_unlock(); 288 + 289 + /* 290 + * If sem->owner is not set, the rwsem owner may have 291 + * just acquired it and not set the owner yet or the rwsem 292 + * has been released. 293 + */ 294 + return on_cpu; 295 + } 296 + 297 + static inline bool owner_running(struct rw_semaphore *sem, 298 + struct task_struct *owner) 299 + { 300 + if (sem->owner != owner) 301 + return false; 302 + 303 + /* 304 + * Ensure we emit the owner->on_cpu, dereference _after_ checking 305 + * sem->owner still matches owner, if that fails, owner might 306 + * point to free()d memory, if it still matches, the rcu_read_lock() 307 + * ensures the memory stays valid. 308 + */ 309 + barrier(); 310 + 311 + return owner->on_cpu; 312 + } 313 + 314 + static noinline 315 + bool rwsem_spin_on_owner(struct rw_semaphore *sem, struct task_struct *owner) 316 + { 317 + rcu_read_lock(); 318 + while (owner_running(sem, owner)) { 319 + if (need_resched()) 320 + break; 321 + 322 + arch_mutex_cpu_relax(); 323 + } 324 + rcu_read_unlock(); 325 + 326 + /* 327 + * We break out the loop above on need_resched() or when the 328 + * owner changed, which is a sign for heavy contention. Return 329 + * success only when sem->owner is NULL. 330 + */ 331 + return sem->owner == NULL; 332 + } 333 + 334 + static bool rwsem_optimistic_spin(struct rw_semaphore *sem) 335 + { 336 + struct task_struct *owner; 337 + bool taken = false; 338 + 339 + preempt_disable(); 340 + 341 + /* sem->wait_lock should not be held when doing optimistic spinning */ 342 + if (!rwsem_can_spin_on_owner(sem)) 343 + goto done; 344 + 345 + if (!osq_lock(&sem->osq)) 346 + goto done; 347 + 348 + while (true) { 349 + owner = ACCESS_ONCE(sem->owner); 350 + if (owner && !rwsem_spin_on_owner(sem, owner)) 351 + break; 352 + 353 + /* wait_lock will be acquired if write_lock is obtained */ 354 + if (rwsem_try_write_lock_unqueued(sem)) { 355 + taken = true; 356 + break; 357 + } 358 + 359 + /* 360 + * When there's no owner, we might have preempted between the 361 + * owner acquiring the lock and setting the owner field. If 362 + * we're an RT task that will live-lock because we won't let 363 + * the owner complete. 364 + */ 365 + if (!owner && (need_resched() || rt_task(current))) 366 + break; 367 + 368 + /* 369 + * The cpu_relax() call is a compiler barrier which forces 370 + * everything in this loop to be re-loaded. We don't need 371 + * memory barriers as we'll eventually observe the right 372 + * values at the cost of a few extra spins. 373 + */ 374 + arch_mutex_cpu_relax(); 375 + } 376 + osq_unlock(&sem->osq); 377 + done: 378 + preempt_enable(); 379 + return taken; 380 + } 381 + 382 + #else 383 + static bool rwsem_optimistic_spin(struct rw_semaphore *sem) 384 + { 385 + return false; 386 + } 387 + #endif 388 + 389 + /* 390 + * Wait until we successfully acquire the write lock 252 391 */ 253 392 __visible 254 393 struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem) 255 394 { 256 - long count, adjustment = -RWSEM_ACTIVE_WRITE_BIAS; 395 + long count; 396 + bool waiting = true; /* any queued threads before us */ 257 397 struct rwsem_waiter waiter; 258 - struct task_struct *tsk = current; 259 398 260 - /* set up my own style of waitqueue */ 261 - waiter.task = tsk; 399 + /* undo write bias from down_write operation, stop active locking */ 400 + count = rwsem_atomic_update(-RWSEM_ACTIVE_WRITE_BIAS, sem); 401 + 402 + /* do optimistic spinning and steal lock if possible */ 403 + if (rwsem_optimistic_spin(sem)) 404 + return sem; 405 + 406 + /* 407 + * Optimistic spinning failed, proceed to the slowpath 408 + * and block until we can acquire the sem. 409 + */ 410 + waiter.task = current; 262 411 waiter.type = RWSEM_WAITING_FOR_WRITE; 263 412 264 413 raw_spin_lock_irq(&sem->wait_lock); 414 + 415 + /* account for this before adding a new element to the list */ 265 416 if (list_empty(&sem->wait_list)) 266 - adjustment += RWSEM_WAITING_BIAS; 417 + waiting = false; 418 + 267 419 list_add_tail(&waiter.list, &sem->wait_list); 268 420 269 421 /* we're now waiting on the lock, but no longer actively locking */ 270 - count = rwsem_atomic_update(adjustment, sem); 422 + if (waiting) { 423 + count = ACCESS_ONCE(sem->count); 271 424 272 - /* If there were already threads queued before us and there are no 273 - * active writers, the lock must be read owned; so we try to wake 274 - * any read locks that were queued ahead of us. */ 275 - if (count > RWSEM_WAITING_BIAS && 276 - adjustment == -RWSEM_ACTIVE_WRITE_BIAS) 277 - sem = __rwsem_do_wake(sem, RWSEM_WAKE_READERS); 425 + /* 426 + * If there were already threads queued before us and there are 427 + * no active writers, the lock must be read owned; so we try to 428 + * wake any read locks that were queued ahead of us. 429 + */ 430 + if (count > RWSEM_WAITING_BIAS) 431 + sem = __rwsem_do_wake(sem, RWSEM_WAKE_READERS); 432 + 433 + } else 434 + count = rwsem_atomic_update(RWSEM_WAITING_BIAS, sem); 278 435 279 436 /* wait until we successfully acquire the lock */ 280 - set_task_state(tsk, TASK_UNINTERRUPTIBLE); 437 + set_current_state(TASK_UNINTERRUPTIBLE); 281 438 while (true) { 282 - if (!(count & RWSEM_ACTIVE_MASK)) { 283 - /* Try acquiring the write lock. */ 284 - count = RWSEM_ACTIVE_WRITE_BIAS; 285 - if (!list_is_singular(&sem->wait_list)) 286 - count += RWSEM_WAITING_BIAS; 287 - 288 - if (sem->count == RWSEM_WAITING_BIAS && 289 - cmpxchg(&sem->count, RWSEM_WAITING_BIAS, count) == 290 - RWSEM_WAITING_BIAS) 291 - break; 292 - } 293 - 439 + if (rwsem_try_write_lock(count, sem)) 440 + break; 294 441 raw_spin_unlock_irq(&sem->wait_lock); 295 442 296 443 /* Block until there are no active lockers. */ 297 444 do { 298 445 schedule(); 299 - set_task_state(tsk, TASK_UNINTERRUPTIBLE); 446 + set_current_state(TASK_UNINTERRUPTIBLE); 300 447 } while ((count = sem->count) & RWSEM_ACTIVE_MASK); 301 448 302 449 raw_spin_lock_irq(&sem->wait_lock); 303 450 } 451 + __set_current_state(TASK_RUNNING); 304 452 305 453 list_del(&waiter.list); 306 454 raw_spin_unlock_irq(&sem->wait_lock); 307 - tsk->state = TASK_RUNNING; 308 455 309 456 return sem; 310 457 }

+30 -1

kernel/locking/rwsem.c

··· 12 12 13 13 #include <linux/atomic.h> 14 14 15 + #if defined(CONFIG_SMP) && defined(CONFIG_RWSEM_XCHGADD_ALGORITHM) 16 + static inline void rwsem_set_owner(struct rw_semaphore *sem) 17 + { 18 + sem->owner = current; 19 + } 20 + 21 + static inline void rwsem_clear_owner(struct rw_semaphore *sem) 22 + { 23 + sem->owner = NULL; 24 + } 25 + 26 + #else 27 + static inline void rwsem_set_owner(struct rw_semaphore *sem) 28 + { 29 + } 30 + 31 + static inline void rwsem_clear_owner(struct rw_semaphore *sem) 32 + { 33 + } 34 + #endif 35 + 15 36 /* 16 37 * lock for reading 17 38 */ ··· 69 48 rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_); 70 49 71 50 LOCK_CONTENDED(sem, __down_write_trylock, __down_write); 51 + rwsem_set_owner(sem); 72 52 } 73 53 74 54 EXPORT_SYMBOL(down_write); ··· 81 59 { 82 60 int ret = __down_write_trylock(sem); 83 61 84 - if (ret == 1) 62 + if (ret == 1) { 85 63 rwsem_acquire(&sem->dep_map, 0, 1, _RET_IP_); 64 + rwsem_set_owner(sem); 65 + } 66 + 86 67 return ret; 87 68 } 88 69 ··· 110 85 { 111 86 rwsem_release(&sem->dep_map, 1, _RET_IP_); 112 87 88 + rwsem_clear_owner(sem); 113 89 __up_write(sem); 114 90 } 115 91 ··· 125 99 * lockdep: a downgraded write will live on as a write 126 100 * dependency. 127 101 */ 102 + rwsem_clear_owner(sem); 128 103 __downgrade_write(sem); 129 104 } 130 105 ··· 149 122 rwsem_acquire_nest(&sem->dep_map, 0, 0, nest, _RET_IP_); 150 123 151 124 LOCK_CONTENDED(sem, __down_write_trylock, __down_write); 125 + rwsem_set_owner(sem); 152 126 } 153 127 154 128 EXPORT_SYMBOL(_down_write_nest_lock); ··· 169 141 rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_); 170 142 171 143 LOCK_CONTENDED(sem, __down_write_trylock, __down_write); 144 + rwsem_set_owner(sem); 172 145 } 173 146 174 147 EXPORT_SYMBOL(down_write_nested);