futex: Handle early deadlock return correctly

commit 56222b212e8e ("futex: Drop hb->lock before enqueueing on the
rtmutex") changed the locking rules in the futex code so that the hash
bucket lock is not longer held while the waiter is enqueued into the
rtmutex wait list. This made the lock and the unlock path symmetric, but
unfortunately the possible early exit from __rt_mutex_proxy_start() due to
a detected deadlock was not updated accordingly. That allows a concurrent
unlocker to observe inconsitent state which triggers the warning in the
unlock path.

futex_lock_pi() futex_unlock_pi()
lock(hb->lock)
queue(hb_waiter) lock(hb->lock)
lock(rtmutex->wait_lock)
unlock(hb->lock)
// acquired hb->lock
hb_waiter = futex_top_waiter()
lock(rtmutex->wait_lock)
__rt_mutex_proxy_start()
---> fail
remove(rtmutex_waiter);
---> returns -EDEADLOCK
unlock(rtmutex->wait_lock)
// acquired wait_lock
wake_futex_pi()
rt_mutex_next_owner()
--> returns NULL
--> WARN

lock(hb->lock)
unqueue(hb_waiter)

The problem is caused by the remove(rtmutex_waiter) in the failure case of
__rt_mutex_proxy_start() as this lets the unlocker observe a waiter in the
hash bucket but no waiter on the rtmutex, i.e. inconsistent state.

The original commit handles this correctly for the other early return cases
(timeout, signal) by delaying the removal of the rtmutex waiter until the
returning task reacquired the hash bucket lock.

Treat the failure case of __rt_mutex_proxy_start() in the same way and let
the existing cleanup code handle the eventual handover of the rtmutex
gracefully. The regular rt_mutex_proxy_start() gains the rtmutex waiter
removal for the failure case, so that the other callsites are still
operating correctly.

Add proper comments to the code so all these details are fully documented.

Thanks to Peter for helping with the analysis and writing the really
valuable code comments.

Fixes: 56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Cc: Stefan Liebler <stli@linux.ibm.com>
Cc: Sebastian Sewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1901292311410.1950@nanos.tec.linutronix.de

+50 -15
+18 -10
kernel/futex.c
··· 2861 * and BUG when futex_unlock_pi() interleaves with this. 2862 * 2863 * Therefore acquire wait_lock while holding hb->lock, but drop the 2864 - * latter before calling rt_mutex_start_proxy_lock(). This still fully 2865 - * serializes against futex_unlock_pi() as that does the exact same 2866 - * lock handoff sequence. 2867 */ 2868 raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock); 2869 spin_unlock(q.lock_ptr); 2870 ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current); 2871 raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock); 2872 2873 if (ret) { 2874 if (ret == 1) 2875 ret = 0; 2876 - 2877 - spin_lock(q.lock_ptr); 2878 - goto no_block; 2879 } 2880 - 2881 2882 if (unlikely(to)) 2883 hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS); 2884 2885 ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter); 2886 2887 spin_lock(q.lock_ptr); 2888 /* 2889 - * If we failed to acquire the lock (signal/timeout), we must 2890 * first acquire the hb->lock before removing the lock from the 2891 - * rt_mutex waitqueue, such that we can keep the hb and rt_mutex 2892 - * wait lists consistent. 2893 * 2894 * In particular; it is important that futex_unlock_pi() can not 2895 * observe this inconsistency. ··· 3017 * there is no point where we hold neither; and therefore 3018 * wake_futex_pi() must observe a state consistent with what we 3019 * observed. 3020 */ 3021 raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); 3022 spin_unlock(&hb->lock);
··· 2861 * and BUG when futex_unlock_pi() interleaves with this. 2862 * 2863 * Therefore acquire wait_lock while holding hb->lock, but drop the 2864 + * latter before calling __rt_mutex_start_proxy_lock(). This 2865 + * interleaves with futex_unlock_pi() -- which does a similar lock 2866 + * handoff -- such that the latter can observe the futex_q::pi_state 2867 + * before __rt_mutex_start_proxy_lock() is done. 2868 */ 2869 raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock); 2870 spin_unlock(q.lock_ptr); 2871 + /* 2872 + * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter 2873 + * such that futex_unlock_pi() is guaranteed to observe the waiter when 2874 + * it sees the futex_q::pi_state. 2875 + */ 2876 ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current); 2877 raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock); 2878 2879 if (ret) { 2880 if (ret == 1) 2881 ret = 0; 2882 + goto cleanup; 2883 } 2884 2885 if (unlikely(to)) 2886 hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS); 2887 2888 ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter); 2889 2890 + cleanup: 2891 spin_lock(q.lock_ptr); 2892 /* 2893 + * If we failed to acquire the lock (deadlock/signal/timeout), we must 2894 * first acquire the hb->lock before removing the lock from the 2895 + * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait 2896 + * lists consistent. 2897 * 2898 * In particular; it is important that futex_unlock_pi() can not 2899 * observe this inconsistency. ··· 3013 * there is no point where we hold neither; and therefore 3014 * wake_futex_pi() must observe a state consistent with what we 3015 * observed. 3016 + * 3017 + * In particular; this forces __rt_mutex_start_proxy() to 3018 + * complete such that we're guaranteed to observe the 3019 + * rt_waiter. Also see the WARN in wake_futex_pi(). 3020 */ 3021 raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); 3022 spin_unlock(&hb->lock);
+32 -5
kernel/locking/rtmutex.c
··· 1726 rt_mutex_set_owner(lock, NULL); 1727 } 1728 1729 int __rt_mutex_start_proxy_lock(struct rt_mutex *lock, 1730 struct rt_mutex_waiter *waiter, 1731 struct task_struct *task) 1732 { 1733 int ret; 1734 1735 if (try_to_take_rt_mutex(lock, task, NULL)) 1736 return 1; ··· 1770 ret = 0; 1771 } 1772 1773 - if (unlikely(ret)) 1774 - remove_waiter(lock, waiter); 1775 - 1776 debug_rt_mutex_print_deadlock(waiter); 1777 1778 return ret; ··· 1781 * @waiter: the pre-initialized rt_mutex_waiter 1782 * @task: the task to prepare 1783 * 1784 * Returns: 1785 * 0 - task blocked on lock 1786 * 1 - acquired the lock for task, caller should wake it up 1787 * <0 - error 1788 * 1789 - * Special API call for FUTEX_REQUEUE_PI support. 1790 */ 1791 int rt_mutex_start_proxy_lock(struct rt_mutex *lock, 1792 struct rt_mutex_waiter *waiter, ··· 1802 1803 raw_spin_lock_irq(&lock->wait_lock); 1804 ret = __rt_mutex_start_proxy_lock(lock, waiter, task); 1805 raw_spin_unlock_irq(&lock->wait_lock); 1806 1807 return ret; ··· 1871 * @lock: the rt_mutex we were woken on 1872 * @waiter: the pre-initialized rt_mutex_waiter 1873 * 1874 - * Attempt to clean up after a failed rt_mutex_wait_proxy_lock(). 1875 * 1876 * Unless we acquired the lock; we're still enqueued on the wait-list and can 1877 * in fact still be granted ownership until we're removed. Therefore we can
··· 1726 rt_mutex_set_owner(lock, NULL); 1727 } 1728 1729 + /** 1730 + * __rt_mutex_start_proxy_lock() - Start lock acquisition for another task 1731 + * @lock: the rt_mutex to take 1732 + * @waiter: the pre-initialized rt_mutex_waiter 1733 + * @task: the task to prepare 1734 + * 1735 + * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock 1736 + * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that. 1737 + * 1738 + * NOTE: does _NOT_ remove the @waiter on failure; must either call 1739 + * rt_mutex_wait_proxy_lock() or rt_mutex_cleanup_proxy_lock() after this. 1740 + * 1741 + * Returns: 1742 + * 0 - task blocked on lock 1743 + * 1 - acquired the lock for task, caller should wake it up 1744 + * <0 - error 1745 + * 1746 + * Special API call for PI-futex support. 1747 + */ 1748 int __rt_mutex_start_proxy_lock(struct rt_mutex *lock, 1749 struct rt_mutex_waiter *waiter, 1750 struct task_struct *task) 1751 { 1752 int ret; 1753 + 1754 + lockdep_assert_held(&lock->wait_lock); 1755 1756 if (try_to_take_rt_mutex(lock, task, NULL)) 1757 return 1; ··· 1749 ret = 0; 1750 } 1751 1752 debug_rt_mutex_print_deadlock(waiter); 1753 1754 return ret; ··· 1763 * @waiter: the pre-initialized rt_mutex_waiter 1764 * @task: the task to prepare 1765 * 1766 + * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock 1767 + * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that. 1768 + * 1769 + * NOTE: unlike __rt_mutex_start_proxy_lock this _DOES_ remove the @waiter 1770 + * on failure. 1771 + * 1772 * Returns: 1773 * 0 - task blocked on lock 1774 * 1 - acquired the lock for task, caller should wake it up 1775 * <0 - error 1776 * 1777 + * Special API call for PI-futex support. 1778 */ 1779 int rt_mutex_start_proxy_lock(struct rt_mutex *lock, 1780 struct rt_mutex_waiter *waiter, ··· 1778 1779 raw_spin_lock_irq(&lock->wait_lock); 1780 ret = __rt_mutex_start_proxy_lock(lock, waiter, task); 1781 + if (unlikely(ret)) 1782 + remove_waiter(lock, waiter); 1783 raw_spin_unlock_irq(&lock->wait_lock); 1784 1785 return ret; ··· 1845 * @lock: the rt_mutex we were woken on 1846 * @waiter: the pre-initialized rt_mutex_waiter 1847 * 1848 + * Attempt to clean up after a failed __rt_mutex_start_proxy_lock() or 1849 + * rt_mutex_wait_proxy_lock(). 1850 * 1851 * Unless we acquired the lock; we're still enqueued on the wait-list and can 1852 * in fact still be granted ownership until we're removed. Therefore we can