futex: Handle early deadlock return correctly

commit 56222b212e8e ("futex: Drop hb->lock before enqueueing on the
rtmutex") changed the locking rules in the futex code so that the hash
bucket lock is not longer held while the waiter is enqueued into the
rtmutex wait list. This made the lock and the unlock path symmetric, but
unfortunately the possible early exit from __rt_mutex_proxy_start() due to
a detected deadlock was not updated accordingly. That allows a concurrent
unlocker to observe inconsitent state which triggers the warning in the
unlock path.

futex_lock_pi() futex_unlock_pi()
lock(hb->lock)
queue(hb_waiter) lock(hb->lock)
lock(rtmutex->wait_lock)
unlock(hb->lock)
// acquired hb->lock
hb_waiter = futex_top_waiter()
lock(rtmutex->wait_lock)
__rt_mutex_proxy_start()
---> fail
remove(rtmutex_waiter);
---> returns -EDEADLOCK
unlock(rtmutex->wait_lock)
// acquired wait_lock
wake_futex_pi()
rt_mutex_next_owner()
--> returns NULL
--> WARN

lock(hb->lock)
unqueue(hb_waiter)

The problem is caused by the remove(rtmutex_waiter) in the failure case of
__rt_mutex_proxy_start() as this lets the unlocker observe a waiter in the
hash bucket but no waiter on the rtmutex, i.e. inconsistent state.

The original commit handles this correctly for the other early return cases
(timeout, signal) by delaying the removal of the rtmutex waiter until the
returning task reacquired the hash bucket lock.

Treat the failure case of __rt_mutex_proxy_start() in the same way and let
the existing cleanup code handle the eventual handover of the rtmutex
gracefully. The regular rt_mutex_proxy_start() gains the rtmutex waiter
removal for the failure case, so that the other callsites are still
operating correctly.

Add proper comments to the code so all these details are fully documented.

Thanks to Peter for helping with the analysis and writing the really
valuable code comments.

Fixes: 56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Cc: Stefan Liebler <stli@linux.ibm.com>
Cc: Sebastian Sewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1901292311410.1950@nanos.tec.linutronix.de

+50 -15
+18 -10
kernel/futex.c
··· 2861 2861 * and BUG when futex_unlock_pi() interleaves with this. 2862 2862 * 2863 2863 * Therefore acquire wait_lock while holding hb->lock, but drop the 2864 - * latter before calling rt_mutex_start_proxy_lock(). This still fully 2865 - * serializes against futex_unlock_pi() as that does the exact same 2866 - * lock handoff sequence. 2864 + * latter before calling __rt_mutex_start_proxy_lock(). This 2865 + * interleaves with futex_unlock_pi() -- which does a similar lock 2866 + * handoff -- such that the latter can observe the futex_q::pi_state 2867 + * before __rt_mutex_start_proxy_lock() is done. 2867 2868 */ 2868 2869 raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock); 2869 2870 spin_unlock(q.lock_ptr); 2871 + /* 2872 + * __rt_mutex_start_proxy_lock() unconditionally enqueues the @rt_waiter 2873 + * such that futex_unlock_pi() is guaranteed to observe the waiter when 2874 + * it sees the futex_q::pi_state. 2875 + */ 2870 2876 ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current); 2871 2877 raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock); 2872 2878 2873 2879 if (ret) { 2874 2880 if (ret == 1) 2875 2881 ret = 0; 2876 - 2877 - spin_lock(q.lock_ptr); 2878 - goto no_block; 2882 + goto cleanup; 2879 2883 } 2880 - 2881 2884 2882 2885 if (unlikely(to)) 2883 2886 hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS); 2884 2887 2885 2888 ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter); 2886 2889 2890 + cleanup: 2887 2891 spin_lock(q.lock_ptr); 2888 2892 /* 2889 - * If we failed to acquire the lock (signal/timeout), we must 2893 + * If we failed to acquire the lock (deadlock/signal/timeout), we must 2890 2894 * first acquire the hb->lock before removing the lock from the 2891 - * rt_mutex waitqueue, such that we can keep the hb and rt_mutex 2892 - * wait lists consistent. 2895 + * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait 2896 + * lists consistent. 2893 2897 * 2894 2898 * In particular; it is important that futex_unlock_pi() can not 2895 2899 * observe this inconsistency. ··· 3017 3013 * there is no point where we hold neither; and therefore 3018 3014 * wake_futex_pi() must observe a state consistent with what we 3019 3015 * observed. 3016 + * 3017 + * In particular; this forces __rt_mutex_start_proxy() to 3018 + * complete such that we're guaranteed to observe the 3019 + * rt_waiter. Also see the WARN in wake_futex_pi(). 3020 3020 */ 3021 3021 raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock); 3022 3022 spin_unlock(&hb->lock);
+32 -5
kernel/locking/rtmutex.c
··· 1726 1726 rt_mutex_set_owner(lock, NULL); 1727 1727 } 1728 1728 1729 + /** 1730 + * __rt_mutex_start_proxy_lock() - Start lock acquisition for another task 1731 + * @lock: the rt_mutex to take 1732 + * @waiter: the pre-initialized rt_mutex_waiter 1733 + * @task: the task to prepare 1734 + * 1735 + * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock 1736 + * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that. 1737 + * 1738 + * NOTE: does _NOT_ remove the @waiter on failure; must either call 1739 + * rt_mutex_wait_proxy_lock() or rt_mutex_cleanup_proxy_lock() after this. 1740 + * 1741 + * Returns: 1742 + * 0 - task blocked on lock 1743 + * 1 - acquired the lock for task, caller should wake it up 1744 + * <0 - error 1745 + * 1746 + * Special API call for PI-futex support. 1747 + */ 1729 1748 int __rt_mutex_start_proxy_lock(struct rt_mutex *lock, 1730 1749 struct rt_mutex_waiter *waiter, 1731 1750 struct task_struct *task) 1732 1751 { 1733 1752 int ret; 1753 + 1754 + lockdep_assert_held(&lock->wait_lock); 1734 1755 1735 1756 if (try_to_take_rt_mutex(lock, task, NULL)) 1736 1757 return 1; ··· 1770 1749 ret = 0; 1771 1750 } 1772 1751 1773 - if (unlikely(ret)) 1774 - remove_waiter(lock, waiter); 1775 - 1776 1752 debug_rt_mutex_print_deadlock(waiter); 1777 1753 1778 1754 return ret; ··· 1781 1763 * @waiter: the pre-initialized rt_mutex_waiter 1782 1764 * @task: the task to prepare 1783 1765 * 1766 + * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock 1767 + * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that. 1768 + * 1769 + * NOTE: unlike __rt_mutex_start_proxy_lock this _DOES_ remove the @waiter 1770 + * on failure. 1771 + * 1784 1772 * Returns: 1785 1773 * 0 - task blocked on lock 1786 1774 * 1 - acquired the lock for task, caller should wake it up 1787 1775 * <0 - error 1788 1776 * 1789 - * Special API call for FUTEX_REQUEUE_PI support. 1777 + * Special API call for PI-futex support. 1790 1778 */ 1791 1779 int rt_mutex_start_proxy_lock(struct rt_mutex *lock, 1792 1780 struct rt_mutex_waiter *waiter, ··· 1802 1778 1803 1779 raw_spin_lock_irq(&lock->wait_lock); 1804 1780 ret = __rt_mutex_start_proxy_lock(lock, waiter, task); 1781 + if (unlikely(ret)) 1782 + remove_waiter(lock, waiter); 1805 1783 raw_spin_unlock_irq(&lock->wait_lock); 1806 1784 1807 1785 return ret; ··· 1871 1845 * @lock: the rt_mutex we were woken on 1872 1846 * @waiter: the pre-initialized rt_mutex_waiter 1873 1847 * 1874 - * Attempt to clean up after a failed rt_mutex_wait_proxy_lock(). 1848 + * Attempt to clean up after a failed __rt_mutex_start_proxy_lock() or 1849 + * rt_mutex_wait_proxy_lock(). 1875 1850 * 1876 1851 * Unless we acquired the lock; we're still enqueued on the wait-list and can 1877 1852 * in fact still be granted ownership until we're removed. Therefore we can