Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

fs/locks: create a tree of dependent requests.

When we find an existing lock which conflicts with a request,
and the request wants to wait, we currently add the request
to a list. When the lock is removed, the whole list is woken.
This can cause the thundering-herd problem.
To reduce the problem, we make use of the (new) fact that
a pending request can itself have a list of blocked requests.
When we find a conflict, we look through the existing blocked requests.
If any one of them blocks the new request, the new request is attached
below that request, otherwise it is added to the list of blocked
requests, which are now known to be mutually non-conflicting.

This way, when the lock is released, only a set of non-conflicting
locks will be woken, the rest can stay asleep.
If the lock request cannot be granted and the request needs to be
requeued, all the other requests it blocks will then be woken

To make this more concrete:

If you have a many-core machine, and have many threads all wanting to
briefly lock a give file (udev is known to do this), you can get quite
poor performance.

When one thread releases a lock, it wakes up all other threads that
are waiting (classic thundering-herd) - one will get the lock and the
others go to sleep.
When you have few cores, this is not very noticeable: by the time the
4th or 5th thread gets enough CPU time to try to claim the lock, the
earlier threads have claimed it, done what was needed, and released.
So with few cores, many of the threads don't end up contending.
With 50+ cores, lost of threads can get the CPU at the same time,
and the contention can easily be measured.

This patchset creates a tree of pending lock requests in which siblings
don't conflict and each lock request does conflict with its parent.
When a lock is released, only requests which don't conflict with each
other a woken.

Testing shows that lock-acquisitions-per-second is now fairly stable
even as the number of contending process goes to 1000. Without this
patch, locks-per-second drops off steeply after a few 10s of
processes.

There is a small cost to this extra complexity.
At 20 processes running a particular test on 72 cores, the lock
acquisitions per second drops from 1.8 million to 1.4 million with
this patch. For 100 processes, this patch still provides 1.4 million
while without this patch there are about 700,000.

Reported-and-tested-by: Martin Wilck <mwilck@suse.de>
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>

authored by

NeilBrown and committed by
Jeff Layton
fd7732e0 c0e15908

+63 -6
+63 -6
fs/locks.c
··· 112 112 * Leases and LOCK_MAND 113 113 * Matthew Wilcox <willy@debian.org>, June, 2000. 114 114 * Stephen Rothwell <sfr@canb.auug.org.au>, June, 2000. 115 + * 116 + * Locking conflicts and dependencies: 117 + * If multiple threads attempt to lock the same byte (or flock the same file) 118 + * only one can be granted the lock, and other must wait their turn. 119 + * The first lock has been "applied" or "granted", the others are "waiting" 120 + * and are "blocked" by the "applied" lock.. 121 + * 122 + * Waiting and applied locks are all kept in trees whose properties are: 123 + * 124 + * - the root of a tree may be an applied or waiting lock. 125 + * - every other node in the tree is a waiting lock that 126 + * conflicts with every ancestor of that node. 127 + * 128 + * Every such tree begins life as a waiting singleton which obviously 129 + * satisfies the above properties. 130 + * 131 + * The only ways we modify trees preserve these properties: 132 + * 133 + * 1. We may add a new leaf node, but only after first verifying that it 134 + * conflicts with all of its ancestors. 135 + * 2. We may remove the root of a tree, creating a new singleton 136 + * tree from the root and N new trees rooted in the immediate 137 + * children. 138 + * 3. If the root of a tree is not currently an applied lock, we may 139 + * apply it (if possible). 140 + * 4. We may upgrade the root of the tree (either extend its range, 141 + * or upgrade its entire range from read to write). 142 + * 143 + * When an applied lock is modified in a way that reduces or downgrades any 144 + * part of its range, we remove all its children (2 above). This particularly 145 + * happens when a lock is unlocked. 146 + * 147 + * For each of those child trees we "wake up" the thread which is 148 + * waiting for the lock so it can continue handling as follows: if the 149 + * root of the tree applies, we do so (3). If it doesn't, it must 150 + * conflict with some applied lock. We remove (wake up) all of its children 151 + * (2), and add it is a new leaf to the tree rooted in the applied 152 + * lock (1). We then repeat the process recursively with those 153 + * children. 154 + * 115 155 */ 116 156 117 157 #include <linux/capability.h> ··· 780 740 * but by ensuring that the flc_lock is also held on insertions we can avoid 781 741 * taking the blocked_lock_lock in some cases when we see that the 782 742 * fl_blocked_requests list is empty. 743 + * 744 + * Rather than just adding to the list, we check for conflicts with any existing 745 + * waiters, and add beneath any waiter that blocks the new waiter. 746 + * Thus wakeups don't happen until needed. 783 747 */ 784 748 static void __locks_insert_block(struct file_lock *blocker, 785 - struct file_lock *waiter) 749 + struct file_lock *waiter, 750 + bool conflict(struct file_lock *, 751 + struct file_lock *)) 786 752 { 753 + struct file_lock *fl; 787 754 BUG_ON(!list_empty(&waiter->fl_blocked_member)); 755 + 756 + new_blocker: 757 + list_for_each_entry(fl, &blocker->fl_blocked_requests, fl_blocked_member) 758 + if (conflict(fl, waiter)) { 759 + blocker = fl; 760 + goto new_blocker; 761 + } 788 762 waiter->fl_blocker = blocker; 789 763 list_add_tail(&waiter->fl_blocked_member, &blocker->fl_blocked_requests); 790 764 if (IS_POSIX(blocker) && !IS_OFDLCK(blocker)) ··· 813 759 814 760 /* Must be called with flc_lock held. */ 815 761 static void locks_insert_block(struct file_lock *blocker, 816 - struct file_lock *waiter) 762 + struct file_lock *waiter, 763 + bool conflict(struct file_lock *, 764 + struct file_lock *)) 817 765 { 818 766 spin_lock(&blocked_lock_lock); 819 - __locks_insert_block(blocker, waiter); 767 + __locks_insert_block(blocker, waiter, conflict); 820 768 spin_unlock(&blocked_lock_lock); 821 769 } 822 770 ··· 1077 1021 if (!(request->fl_flags & FL_SLEEP)) 1078 1022 goto out; 1079 1023 error = FILE_LOCK_DEFERRED; 1080 - locks_insert_block(fl, request); 1024 + locks_insert_block(fl, request, flock_locks_conflict); 1081 1025 goto out; 1082 1026 } 1083 1027 if (request->fl_flags & FL_ACCESS) ··· 1152 1096 spin_lock(&blocked_lock_lock); 1153 1097 if (likely(!posix_locks_deadlock(request, fl))) { 1154 1098 error = FILE_LOCK_DEFERRED; 1155 - __locks_insert_block(fl, request); 1099 + __locks_insert_block(fl, request, 1100 + posix_locks_conflict); 1156 1101 } 1157 1102 spin_unlock(&blocked_lock_lock); 1158 1103 goto out; ··· 1624 1567 break_time -= jiffies; 1625 1568 if (break_time == 0) 1626 1569 break_time++; 1627 - locks_insert_block(fl, new_fl); 1570 + locks_insert_block(fl, new_fl, leases_conflict); 1628 1571 trace_break_lease_block(inode, new_fl); 1629 1572 spin_unlock(&ctx->flc_lock); 1630 1573 percpu_up_read_preempt_enable(&file_rwsem);