Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: add vma_start_write_killable()

Patch series "vma_start_write_killable"", v2.

When we added the VMA lock, we made a major oversight in not adding a
killable variant. That can run us into trouble where a thread takes the
VMA lock for read (eg handling a page fault) and then goes out to lunch
for an hour (eg doing reclaim). Another thread tries to modify the VMA,
taking the mmap_lock for write, then attempts to lock the VMA for write.
That blocks on the first thread, and ensures that every other page fault
now tries to take the mmap_lock for read. Because everything's in an
uninterruptible sleep, we can't kill the task, which makes me angry.

This patchset just adds vma_start_write_killable() and converts one caller
to use it. Most users are somewhat tricky to convert, so expect follow-up
individual patches per call-site which need careful analysis to make sure
we've done proper cleanup.


This patch (of 2):

The vma can be held read-locked for a substantial period of time, eg if
memory allocation needs to go into reclaim. It's useful to be able to
send fatal signals to threads which are waiting for the write lock.

Link: https://lkml.kernel.org/r/20251110203204.1454057-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20251110203204.1454057-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Chris Li <chriscli@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Matthew Wilcox (Oracle) and committed by
Andrew Morton
2197bb60 3a47e877

+69 -12
+8 -1
Documentation/mm/process_addrs.rst
··· 48 48 * **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves 49 49 as a read/write semaphore in practice. A VMA read lock is obtained via 50 50 :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a 51 - write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked 51 + write lock via vma_start_write() or vma_start_write_killable() 52 + (all VMA write locks are unlocked 52 53 automatically when the mmap write lock is released). To take a VMA write lock 53 54 you **must** have already acquired an :c:func:`!mmap_write_lock`. 54 55 * **rmap locks** - When trying to access VMAs through the reverse mapping via a ··· 908 907 Stack expansion throws up additional complexities in that we cannot permit there 909 908 to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to 910 909 prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`. 910 + 911 + ------------------------ 912 + Functions and structures 913 + ------------------------ 914 + 915 + .. kernel-doc:: include/linux/mmap_lock.h
+28 -2
include/linux/mmap_lock.h
··· 195 195 return (vma->vm_lock_seq == *mm_lock_seq); 196 196 } 197 197 198 - void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq); 198 + int __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq, 199 + int state); 199 200 200 201 /* 201 202 * Begin writing to a VMA. ··· 210 209 if (__is_vma_write_locked(vma, &mm_lock_seq)) 211 210 return; 212 211 213 - __vma_start_write(vma, mm_lock_seq); 212 + __vma_start_write(vma, mm_lock_seq, TASK_UNINTERRUPTIBLE); 213 + } 214 + 215 + /** 216 + * vma_start_write_killable - Begin writing to a VMA. 217 + * @vma: The VMA we are going to modify. 218 + * 219 + * Exclude concurrent readers under the per-VMA lock until the currently 220 + * write-locked mmap_lock is dropped or downgraded. 221 + * 222 + * Context: May sleep while waiting for readers to drop the vma read lock. 223 + * Caller must already hold the mmap_lock for write. 224 + * 225 + * Return: 0 for a successful acquisition. -EINTR if a fatal signal was 226 + * received. 227 + */ 228 + static inline __must_check 229 + int vma_start_write_killable(struct vm_area_struct *vma) 230 + { 231 + unsigned int mm_lock_seq; 232 + 233 + if (__is_vma_write_locked(vma, &mm_lock_seq)) 234 + return 0; 235 + return __vma_start_write(vma, mm_lock_seq, TASK_KILLABLE); 214 236 } 215 237 216 238 static inline void vma_assert_write_locked(struct vm_area_struct *vma) ··· 307 283 static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt) {} 308 284 static inline void vma_end_read(struct vm_area_struct *vma) {} 309 285 static inline void vma_start_write(struct vm_area_struct *vma) {} 286 + static inline __must_check 287 + int vma_start_write_killable(struct vm_area_struct *vma) { return 0; } 310 288 static inline void vma_assert_write_locked(struct vm_area_struct *vma) 311 289 { mmap_assert_write_locked(vma->vm_mm); } 312 290 static inline void vma_assert_attached(struct vm_area_struct *vma) {}
+25 -9
mm/mmap_lock.c
··· 45 45 46 46 #ifdef CONFIG_MMU 47 47 #ifdef CONFIG_PER_VMA_LOCK 48 - static inline bool __vma_enter_locked(struct vm_area_struct *vma, bool detaching) 48 + /* 49 + * Return value: 0 if vma detached, 50 + * 1 if vma attached with no readers, 51 + * -EINTR if signal received, 52 + */ 53 + static inline int __vma_enter_locked(struct vm_area_struct *vma, 54 + bool detaching, int state) 49 55 { 56 + int err; 50 57 unsigned int tgt_refcnt = VMA_LOCK_OFFSET; 51 58 52 59 /* Additional refcnt if the vma is attached. */ ··· 65 58 * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached(). 66 59 */ 67 60 if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt)) 68 - return false; 61 + return 0; 69 62 70 63 rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_); 71 - rcuwait_wait_event(&vma->vm_mm->vma_writer_wait, 64 + err = rcuwait_wait_event(&vma->vm_mm->vma_writer_wait, 72 65 refcount_read(&vma->vm_refcnt) == tgt_refcnt, 73 - TASK_UNINTERRUPTIBLE); 66 + state); 67 + if (err) { 68 + rwsem_release(&vma->vmlock_dep_map, _RET_IP_); 69 + return err; 70 + } 74 71 lock_acquired(&vma->vmlock_dep_map, _RET_IP_); 75 72 76 - return true; 73 + return 1; 77 74 } 78 75 79 76 static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached) ··· 86 75 rwsem_release(&vma->vmlock_dep_map, _RET_IP_); 87 76 } 88 77 89 - void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq) 78 + int __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq, 79 + int state) 90 80 { 91 - bool locked; 81 + int locked; 92 82 93 83 /* 94 84 * __vma_enter_locked() returns false immediately if the vma is not 95 85 * attached, otherwise it waits until refcnt is indicating that vma 96 86 * is attached with no readers. 97 87 */ 98 - locked = __vma_enter_locked(vma, false); 88 + locked = __vma_enter_locked(vma, false, state); 89 + if (locked < 0) 90 + return locked; 99 91 100 92 /* 101 93 * We should use WRITE_ONCE() here because we can have concurrent reads ··· 114 100 __vma_exit_locked(vma, &detached); 115 101 WARN_ON_ONCE(detached); /* vma should remain attached */ 116 102 } 103 + 104 + return 0; 117 105 } 118 106 EXPORT_SYMBOL_GPL(__vma_start_write); 119 107 ··· 134 118 */ 135 119 if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) { 136 120 /* Wait until vma is detached with no readers. */ 137 - if (__vma_enter_locked(vma, true)) { 121 + if (__vma_enter_locked(vma, true, TASK_UNINTERRUPTIBLE)) { 138 122 bool detached; 139 123 140 124 __vma_exit_locked(vma, &detached);
+8
tools/testing/vma/vma_internal.h
··· 952 952 vma->vm_lock_seq++; 953 953 } 954 954 955 + static inline __must_check 956 + int vma_start_write_killable(struct vm_area_struct *vma) 957 + { 958 + /* Used to indicate to tests that a write operation has begun. */ 959 + vma->vm_lock_seq++; 960 + return 0; 961 + } 962 + 955 963 static inline void vma_adjust_trans_huge(struct vm_area_struct *vma, 956 964 unsigned long start, 957 965 unsigned long end,