Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

futex: add requeue-pi documentation

Add Documentation/futex-requeue-pi.txt describing the motivation for the
newly added FUTEX_*REQUEUE_PI op codes and their implementation.

[ Impact: add documentation ]

Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Sripathi Kodi <sripathik@in.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Jakub Jelinek <jakub@redhat.com>
LKML-Reference: <4A03634E.3080609@us.ibm.com>
[ reformatted the file ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>

authored by

Darren Hart and committed by
Ingo Molnar
b30505c8 ba9c22f2

+131
+131
Documentation/futex-requeue-pi.txt
··· 1 + Futex Requeue PI 2 + ---------------- 3 + 4 + Requeueing of tasks from a non-PI futex to a PI futex requires 5 + special handling in order to ensure the underlying rt_mutex is never 6 + left without an owner if it has waiters; doing so would break the PI 7 + boosting logic [see rt-mutex-desgin.txt] For the purposes of 8 + brevity, this action will be referred to as "requeue_pi" throughout 9 + this document. Priority inheritance is abbreviated throughout as 10 + "PI". 11 + 12 + Motivation 13 + ---------- 14 + 15 + Without requeue_pi, the glibc implementation of 16 + pthread_cond_broadcast() must resort to waking all the tasks waiting 17 + on a pthread_condvar and letting them try to sort out which task 18 + gets to run first in classic thundering-herd formation. An ideal 19 + implementation would wake the highest-priority waiter, and leave the 20 + rest to the natural wakeup inherent in unlocking the mutex 21 + associated with the condvar. 22 + 23 + Consider the simplified glibc calls: 24 + 25 + /* caller must lock mutex */ 26 + pthread_cond_wait(cond, mutex) 27 + { 28 + lock(cond->__data.__lock); 29 + unlock(mutex); 30 + do { 31 + unlock(cond->__data.__lock); 32 + futex_wait(cond->__data.__futex); 33 + lock(cond->__data.__lock); 34 + } while(...) 35 + unlock(cond->__data.__lock); 36 + lock(mutex); 37 + } 38 + 39 + pthread_cond_broadcast(cond) 40 + { 41 + lock(cond->__data.__lock); 42 + unlock(cond->__data.__lock); 43 + futex_requeue(cond->data.__futex, cond->mutex); 44 + } 45 + 46 + Once pthread_cond_broadcast() requeues the tasks, the cond->mutex 47 + has waiters. Note that pthread_cond_wait() attempts to lock the 48 + mutex only after it has returned to user space. This will leave the 49 + underlying rt_mutex with waiters, and no owner, breaking the 50 + previously mentioned PI-boosting algorithms. 51 + 52 + In order to support PI-aware pthread_condvar's, the kernel needs to 53 + be able to requeue tasks to PI futexes. This support implies that 54 + upon a successful futex_wait system call, the caller would return to 55 + user space already holding the PI futex. The glibc implementation 56 + would be modified as follows: 57 + 58 + 59 + /* caller must lock mutex */ 60 + pthread_cond_wait_pi(cond, mutex) 61 + { 62 + lock(cond->__data.__lock); 63 + unlock(mutex); 64 + do { 65 + unlock(cond->__data.__lock); 66 + futex_wait_requeue_pi(cond->__data.__futex); 67 + lock(cond->__data.__lock); 68 + } while(...) 69 + unlock(cond->__data.__lock); 70 + /* the kernel acquired the the mutex for us */ 71 + } 72 + 73 + pthread_cond_broadcast_pi(cond) 74 + { 75 + lock(cond->__data.__lock); 76 + unlock(cond->__data.__lock); 77 + futex_requeue_pi(cond->data.__futex, cond->mutex); 78 + } 79 + 80 + The actual glibc implementation will likely test for PI and make the 81 + necessary changes inside the existing calls rather than creating new 82 + calls for the PI cases. Similar changes are needed for 83 + pthread_cond_timedwait() and pthread_cond_signal(). 84 + 85 + Implementation 86 + -------------- 87 + 88 + In order to ensure the rt_mutex has an owner if it has waiters, it 89 + is necessary for both the requeue code, as well as the waiting code, 90 + to be able to acquire the rt_mutex before returning to user space. 91 + The requeue code cannot simply wake the waiter and leave it to 92 + acquire the rt_mutex as it would open a race window between the 93 + requeue call returning to user space and the waiter waking and 94 + starting to run. This is especially true in the uncontended case. 95 + 96 + The solution involves two new rt_mutex helper routines, 97 + rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which 98 + allow the requeue code to acquire an uncontended rt_mutex on behalf 99 + of the waiter and to enqueue the waiter on a contended rt_mutex. 100 + Two new system calls provide the kernel<->user interface to 101 + requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_CMP_PI. 102 + 103 + FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() 104 + and pthread_cond_timedwait()) to block on the initial futex and wait 105 + to be requeued to a PI-aware futex. The implementation is the 106 + result of a high-speed collision between futex_wait() and 107 + futex_lock_pi(), with some extra logic to check for the additional 108 + wake-up scenarios. 109 + 110 + FUTEX_REQUEUE_CMP_PI is called by the waker 111 + (pthread_cond_broadcast() and pthread_cond_signal()) to requeue and 112 + possibly wake the waiting tasks. Internally, this system call is 113 + still handled by futex_requeue (by passing requeue_pi=1). Before 114 + requeueing, futex_requeue() attempts to acquire the requeue target 115 + PI futex on behalf of the top waiter. If it can, this waiter is 116 + woken. futex_requeue() then proceeds to requeue the remaining 117 + nr_wake+nr_requeue tasks to the PI futex, calling 118 + rt_mutex_start_proxy_lock() prior to each requeue to prepare the 119 + task as a waiter on the underlying rt_mutex. It is possible that 120 + the lock can be acquired at this stage as well, if so, the next 121 + waiter is woken to finish the acquisition of the lock. 122 + 123 + FUTEX_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but 124 + their sum is all that really matters. futex_requeue() will wake or 125 + requeue up to nr_wake + nr_requeue tasks. It will wake only as many 126 + tasks as it can acquire the lock for, which in the majority of cases 127 + should be 0 as good programming practice dictates that the caller of 128 + either pthread_cond_broadcast() or pthread_cond_signal() acquire the 129 + mutex prior to making the call. FUTEX_REQUEUE_PI requires that 130 + nr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for 131 + signal.