Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: replace vm_lock and detached flag with a reference count

rw_semaphore is a sizable structure of 40 bytes and consumes considerable
space for each vm_area_struct. However vma_lock has two important
specifics which can be used to replace rw_semaphore with a simpler
structure:

1. Readers never wait. They try to take the vma_lock and fall back to
mmap_lock if that fails.

2. Only one writer at a time will ever try to write-lock a vma_lock
because writers first take mmap_lock in write mode. Because of these
requirements, full rw_semaphore functionality is not needed and we can
replace rw_semaphore and the vma->detached flag with a refcount
(vm_refcnt).

When vma is in detached state, vm_refcnt is 0 and only a call to
vma_mark_attached() can take it out of this state. Note that unlike
before, now we enforce both vma_mark_attached() and vma_mark_detached() to
be done only after vma has been write-locked. vma_mark_attached() changes
vm_refcnt to 1 to indicate that it has been attached to the vma tree.
When a reader takes read lock, it increments vm_refcnt, unless the top
usable bit of vm_refcnt (0x40000000) is set, indicating presence of a
writer. When writer takes write lock, it sets the top usable bit to
indicate its presence. If there are readers, writer will wait using newly
introduced mm->vma_writer_wait. Since all writers take mmap_lock in write
mode first, there can be only one writer at a time. The last reader to
release the lock will signal the writer to wake up. refcount might
overflow if there are many competing readers, in which case read-locking
will fail. Readers are expected to handle such failures.

In summary:
1. all readers increment the vm_refcnt;
2. writer sets top usable (writer) bit of vm_refcnt;
3. readers cannot increment the vm_refcnt if the writer bit is set;
4. in the presence of readers, writer must wait for the vm_refcnt to drop
to 1 (plus the VMA_LOCK_OFFSET writer bit), indicating an attached vma
with no readers;
5. vm_refcnt overflow is handled by the readers.

While this vm_lock replacement does not yet result in a smaller
vm_area_struct (it stays at 256 bytes due to cacheline alignment), it
allows for further size optimization by structure member regrouping to
bring the size of vm_area_struct below 192 bytes.

[surenb@google.com: fix a crash due to vma_end_read() that should have been removed]
Link: https://lkml.kernel.org/r/20250220200208.323769-1-surenb@google.com
Link: https://lkml.kernel.org/r/20250213224655.1680278-13-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Shivank Garg <shivankg@amd.com>
Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Klara Modin <klarasmodin@gmail.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Suren Baghdasaryan and committed by
Andrew Morton
f35ab95c 4e0dbe10

+218 -106
+86 -44
include/linux/mm.h
··· 32 32 #include <linux/memremap.h> 33 33 #include <linux/slab.h> 34 34 #include <linux/cacheinfo.h> 35 + #include <linux/rcuwait.h> 35 36 36 37 struct mempolicy; 37 38 struct anon_vma; ··· 698 697 #endif /* CONFIG_NUMA_BALANCING */ 699 698 700 699 #ifdef CONFIG_PER_VMA_LOCK 701 - static inline void vma_lock_init(struct vm_area_struct *vma) 700 + static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt) 702 701 { 703 - init_rwsem(&vma->vm_lock.lock); 702 + #ifdef CONFIG_DEBUG_LOCK_ALLOC 703 + static struct lock_class_key lockdep_key; 704 + 705 + lockdep_init_map(&vma->vmlock_dep_map, "vm_lock", &lockdep_key, 0); 706 + #endif 707 + if (reset_refcnt) 708 + refcount_set(&vma->vm_refcnt, 0); 704 709 vma->vm_lock_seq = UINT_MAX; 710 + } 711 + 712 + static inline bool is_vma_writer_only(int refcnt) 713 + { 714 + /* 715 + * With a writer and no readers, refcnt is VMA_LOCK_OFFSET if the vma 716 + * is detached and (VMA_LOCK_OFFSET + 1) if it is attached. Waiting on 717 + * a detached vma happens only in vma_mark_detached() and is a rare 718 + * case, therefore most of the time there will be no unnecessary wakeup. 719 + */ 720 + return refcnt & VMA_LOCK_OFFSET && refcnt <= VMA_LOCK_OFFSET + 1; 721 + } 722 + 723 + static inline void vma_refcount_put(struct vm_area_struct *vma) 724 + { 725 + /* Use a copy of vm_mm in case vma is freed after we drop vm_refcnt */ 726 + struct mm_struct *mm = vma->vm_mm; 727 + int oldcnt; 728 + 729 + rwsem_release(&vma->vmlock_dep_map, _RET_IP_); 730 + if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) { 731 + 732 + if (is_vma_writer_only(oldcnt - 1)) 733 + rcuwait_wake_up(&mm->vma_writer_wait); 734 + } 705 735 } 706 736 707 737 /* 708 738 * Try to read-lock a vma. The function is allowed to occasionally yield false 709 739 * locked result to avoid performance overhead, in which case we fall back to 710 740 * using mmap_lock. The function should never yield false unlocked result. 741 + * Returns the vma on success, NULL on failure to lock and EAGAIN if vma got 742 + * detached. 711 743 */ 712 - static inline bool vma_start_read(struct vm_area_struct *vma) 744 + static inline struct vm_area_struct *vma_start_read(struct vm_area_struct *vma) 713 745 { 746 + int oldcnt; 747 + 714 748 /* 715 749 * Check before locking. A race might cause false locked result. 716 750 * We can use READ_ONCE() for the mm_lock_seq here, and don't need ··· 754 718 * need ordering is below. 755 719 */ 756 720 if (READ_ONCE(vma->vm_lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq.sequence)) 757 - return false; 758 - 759 - if (unlikely(down_read_trylock(&vma->vm_lock.lock) == 0)) 760 - return false; 721 + return NULL; 761 722 762 723 /* 763 - * Overflow might produce false locked result. 724 + * If VMA_LOCK_OFFSET is set, __refcount_inc_not_zero_limited_acquire() 725 + * will fail because VMA_REF_LIMIT is less than VMA_LOCK_OFFSET. 726 + * Acquire fence is required here to avoid reordering against later 727 + * vm_lock_seq check and checks inside lock_vma_under_rcu(). 728 + */ 729 + if (unlikely(!__refcount_inc_not_zero_limited_acquire(&vma->vm_refcnt, &oldcnt, 730 + VMA_REF_LIMIT))) { 731 + /* return EAGAIN if vma got detached from under us */ 732 + return oldcnt ? NULL : ERR_PTR(-EAGAIN); 733 + } 734 + 735 + rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_); 736 + /* 737 + * Overflow of vm_lock_seq/mm_lock_seq might produce false locked result. 764 738 * False unlocked result is impossible because we modify and check 765 - * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq 739 + * vma->vm_lock_seq under vma->vm_refcnt protection and mm->mm_lock_seq 766 740 * modification invalidates all existing locks. 767 741 * 768 742 * We must use ACQUIRE semantics for the mm_lock_seq so that if we are ··· 781 735 * This pairs with RELEASE semantics in vma_end_write_all(). 782 736 */ 783 737 if (unlikely(vma->vm_lock_seq == raw_read_seqcount(&vma->vm_mm->mm_lock_seq))) { 784 - up_read(&vma->vm_lock.lock); 785 - return false; 738 + vma_refcount_put(vma); 739 + return NULL; 786 740 } 787 - return true; 741 + 742 + return vma; 788 743 } 789 744 790 745 /* ··· 796 749 */ 797 750 static inline bool vma_start_read_locked_nested(struct vm_area_struct *vma, int subclass) 798 751 { 752 + int oldcnt; 753 + 799 754 mmap_assert_locked(vma->vm_mm); 800 - down_read_nested(&vma->vm_lock.lock, subclass); 755 + if (unlikely(!__refcount_inc_not_zero_limited_acquire(&vma->vm_refcnt, &oldcnt, 756 + VMA_REF_LIMIT))) 757 + return false; 758 + 759 + rwsem_acquire_read(&vma->vmlock_dep_map, 0, 1, _RET_IP_); 801 760 return true; 802 761 } 803 762 ··· 815 762 */ 816 763 static inline bool vma_start_read_locked(struct vm_area_struct *vma) 817 764 { 818 - mmap_assert_locked(vma->vm_mm); 819 - down_read(&vma->vm_lock.lock); 820 - return true; 765 + return vma_start_read_locked_nested(vma, 0); 821 766 } 822 767 823 768 static inline void vma_end_read(struct vm_area_struct *vma) 824 769 { 825 - rcu_read_lock(); /* keeps vma alive till the end of up_read */ 826 - up_read(&vma->vm_lock.lock); 827 - rcu_read_unlock(); 770 + vma_refcount_put(vma); 828 771 } 829 772 830 773 /* WARNING! Can only be used if mmap_lock is expected to be write-locked */ ··· 862 813 863 814 static inline void vma_assert_locked(struct vm_area_struct *vma) 864 815 { 865 - if (!rwsem_is_locked(&vma->vm_lock.lock)) 866 - vma_assert_write_locked(vma); 816 + unsigned int mm_lock_seq; 817 + 818 + VM_BUG_ON_VMA(refcount_read(&vma->vm_refcnt) <= 1 && 819 + !__is_vma_write_locked(vma, &mm_lock_seq), vma); 867 820 } 868 821 822 + /* 823 + * WARNING: to avoid racing with vma_mark_attached()/vma_mark_detached(), these 824 + * assertions should be made either under mmap_write_lock or when the object 825 + * has been isolated under mmap_write_lock, ensuring no competing writers. 826 + */ 869 827 static inline void vma_assert_attached(struct vm_area_struct *vma) 870 828 { 871 - WARN_ON_ONCE(vma->detached); 829 + WARN_ON_ONCE(!refcount_read(&vma->vm_refcnt)); 872 830 } 873 831 874 832 static inline void vma_assert_detached(struct vm_area_struct *vma) 875 833 { 876 - WARN_ON_ONCE(!vma->detached); 834 + WARN_ON_ONCE(refcount_read(&vma->vm_refcnt)); 877 835 } 878 836 879 837 static inline void vma_mark_attached(struct vm_area_struct *vma) 880 838 { 881 - vma_assert_detached(vma); 882 - vma->detached = false; 883 - } 884 - 885 - static inline void vma_mark_detached(struct vm_area_struct *vma) 886 - { 887 - /* When detaching vma should be write-locked */ 888 839 vma_assert_write_locked(vma); 889 - vma_assert_attached(vma); 890 - vma->detached = true; 840 + vma_assert_detached(vma); 841 + refcount_set(&vma->vm_refcnt, 1); 891 842 } 892 843 893 - static inline bool is_vma_detached(struct vm_area_struct *vma) 894 - { 895 - return vma->detached; 896 - } 844 + void vma_mark_detached(struct vm_area_struct *vma); 897 845 898 846 static inline void release_fault_lock(struct vm_fault *vmf) 899 847 { ··· 913 867 914 868 #else /* CONFIG_PER_VMA_LOCK */ 915 869 916 - static inline void vma_lock_init(struct vm_area_struct *vma) {} 917 - static inline bool vma_start_read(struct vm_area_struct *vma) 918 - { return false; } 870 + static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt) {} 871 + static inline struct vm_area_struct *vma_start_read(struct vm_area_struct *vma) 872 + { return NULL; } 919 873 static inline void vma_end_read(struct vm_area_struct *vma) {} 920 874 static inline void vma_start_write(struct vm_area_struct *vma) {} 921 875 static inline void vma_assert_write_locked(struct vm_area_struct *vma) ··· 956 910 vma->vm_mm = mm; 957 911 vma->vm_ops = &vma_dummy_vm_ops; 958 912 INIT_LIST_HEAD(&vma->anon_vma_chain); 959 - #ifdef CONFIG_PER_VMA_LOCK 960 - /* vma is not locked, can't use vma_mark_detached() */ 961 - vma->detached = true; 962 - #endif 963 913 vma_numab_state_init(vma); 964 - vma_lock_init(vma); 914 + vma_lock_init(vma, false); 965 915 } 966 916 967 917 /* Use when VMA is not part of the VMA tree and needs no locking */
+10 -12
include/linux/mm_types.h
··· 19 19 #include <linux/workqueue.h> 20 20 #include <linux/seqlock.h> 21 21 #include <linux/percpu_counter.h> 22 + #include <linux/types.h> 22 23 23 24 #include <asm/mmu.h> 24 25 ··· 630 629 } 631 630 #endif 632 631 633 - struct vma_lock { 634 - struct rw_semaphore lock; 635 - }; 632 + #define VMA_LOCK_OFFSET 0x40000000 633 + #define VMA_REF_LIMIT (VMA_LOCK_OFFSET - 1) 636 634 637 635 struct vma_numab_state { 638 636 /* ··· 710 710 711 711 #ifdef CONFIG_PER_VMA_LOCK 712 712 /* 713 - * Flag to indicate areas detached from the mm->mm_mt tree. 714 - * Unstable RCU readers are allowed to read this. 715 - */ 716 - bool detached; 717 - 718 - /* 719 713 * Can only be written (using WRITE_ONCE()) while holding both: 720 714 * - mmap_lock (in write mode) 721 - * - vm_lock->lock (in write mode) 715 + * - vm_refcnt bit at VMA_LOCK_OFFSET is set 722 716 * Can be read reliably while holding one of: 723 717 * - mmap_lock (in read or write mode) 724 - * - vm_lock->lock (in read or write mode) 718 + * - vm_refcnt bit at VMA_LOCK_OFFSET is set or vm_refcnt > 1 725 719 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout 726 720 * while holding nothing (except RCU to keep the VMA struct allocated). 727 721 * ··· 778 784 struct vm_userfaultfd_ctx vm_userfaultfd_ctx; 779 785 #ifdef CONFIG_PER_VMA_LOCK 780 786 /* Unstable RCU readers are allowed to read this. */ 781 - struct vma_lock vm_lock ____cacheline_aligned_in_smp; 787 + refcount_t vm_refcnt ____cacheline_aligned_in_smp; 788 + #ifdef CONFIG_DEBUG_LOCK_ALLOC 789 + struct lockdep_map vmlock_dep_map; 790 + #endif 782 791 #endif 783 792 } __randomize_layout; 784 793 ··· 917 920 * by mmlist_lock 918 921 */ 919 922 #ifdef CONFIG_PER_VMA_LOCK 923 + struct rcuwait vma_writer_wait; 920 924 /* 921 925 * This field has lock-like semantics, meaning it is sometimes 922 926 * accessed with ACQUIRE/RELEASE semantics.
+6 -7
kernel/fork.c
··· 463 463 * will be reinitialized. 464 464 */ 465 465 data_race(memcpy(new, orig, sizeof(*new))); 466 - vma_lock_init(new); 466 + vma_lock_init(new, true); 467 467 INIT_LIST_HEAD(&new->anon_vma_chain); 468 - #ifdef CONFIG_PER_VMA_LOCK 469 - /* vma is not locked, can't use vma_mark_detached() */ 470 - new->detached = true; 471 - #endif 472 468 vma_numab_state_init(new); 473 469 dup_anon_vma_name(orig, new); 474 470 ··· 473 477 474 478 void __vm_area_free(struct vm_area_struct *vma) 475 479 { 480 + /* The vma should be detached while being destroyed. */ 481 + vma_assert_detached(vma); 476 482 vma_numab_state_free(vma); 477 483 free_anon_vma_name(vma); 478 484 kmem_cache_free(vm_area_cachep, vma); ··· 486 488 struct vm_area_struct *vma = container_of(head, struct vm_area_struct, 487 489 vm_rcu); 488 490 489 - /* The vma should not be locked while being destroyed. */ 490 - VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock.lock), vma); 491 491 __vm_area_free(vma); 492 492 } 493 493 #endif ··· 1230 1234 { 1231 1235 init_rwsem(&mm->mmap_lock); 1232 1236 mm_lock_seqcount_init(mm); 1237 + #ifdef CONFIG_PER_VMA_LOCK 1238 + rcuwait_init(&mm->vma_writer_wait); 1239 + #endif 1233 1240 } 1234 1241 1235 1242 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
+1
mm/init-mm.c
··· 40 40 .arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock), 41 41 .mmlist = LIST_HEAD_INIT(init_mm.mmlist), 42 42 #ifdef CONFIG_PER_VMA_LOCK 43 + .vma_writer_wait = __RCUWAIT_INITIALIZER(init_mm.vma_writer_wait), 43 44 .mm_lock_seq = SEQCNT_ZERO(init_mm.mm_lock_seq), 44 45 #endif 45 46 .user_ns = &init_user_ns,
+80 -10
mm/memory.c
··· 6353 6353 #endif 6354 6354 6355 6355 #ifdef CONFIG_PER_VMA_LOCK 6356 + static inline bool __vma_enter_locked(struct vm_area_struct *vma, bool detaching) 6357 + { 6358 + unsigned int tgt_refcnt = VMA_LOCK_OFFSET; 6359 + 6360 + /* Additional refcnt if the vma is attached. */ 6361 + if (!detaching) 6362 + tgt_refcnt++; 6363 + 6364 + /* 6365 + * If vma is detached then only vma_mark_attached() can raise the 6366 + * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached(). 6367 + */ 6368 + if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt)) 6369 + return false; 6370 + 6371 + rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_); 6372 + rcuwait_wait_event(&vma->vm_mm->vma_writer_wait, 6373 + refcount_read(&vma->vm_refcnt) == tgt_refcnt, 6374 + TASK_UNINTERRUPTIBLE); 6375 + lock_acquired(&vma->vmlock_dep_map, _RET_IP_); 6376 + 6377 + return true; 6378 + } 6379 + 6380 + static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached) 6381 + { 6382 + *detached = refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt); 6383 + rwsem_release(&vma->vmlock_dep_map, _RET_IP_); 6384 + } 6385 + 6356 6386 void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq) 6357 6387 { 6358 - down_write(&vma->vm_lock.lock); 6388 + bool locked; 6389 + 6390 + /* 6391 + * __vma_enter_locked() returns false immediately if the vma is not 6392 + * attached, otherwise it waits until refcnt is indicating that vma 6393 + * is attached with no readers. 6394 + */ 6395 + locked = __vma_enter_locked(vma, false); 6396 + 6359 6397 /* 6360 6398 * We should use WRITE_ONCE() here because we can have concurrent reads 6361 6399 * from the early lockless pessimistic check in vma_start_read(). ··· 6401 6363 * we should use WRITE_ONCE() for cleanliness and to keep KCSAN happy. 6402 6364 */ 6403 6365 WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); 6404 - up_write(&vma->vm_lock.lock); 6366 + 6367 + if (locked) { 6368 + bool detached; 6369 + 6370 + __vma_exit_locked(vma, &detached); 6371 + WARN_ON_ONCE(detached); /* vma should remain attached */ 6372 + } 6405 6373 } 6406 6374 EXPORT_SYMBOL_GPL(__vma_start_write); 6375 + 6376 + void vma_mark_detached(struct vm_area_struct *vma) 6377 + { 6378 + vma_assert_write_locked(vma); 6379 + vma_assert_attached(vma); 6380 + 6381 + /* 6382 + * We are the only writer, so no need to use vma_refcount_put(). 6383 + * The condition below is unlikely because the vma has been already 6384 + * write-locked and readers can increment vm_refcnt only temporarily 6385 + * before they check vm_lock_seq, realize the vma is locked and drop 6386 + * back the vm_refcnt. That is a narrow window for observing a raised 6387 + * vm_refcnt. 6388 + */ 6389 + if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) { 6390 + /* Wait until vma is detached with no readers. */ 6391 + if (__vma_enter_locked(vma, true)) { 6392 + bool detached; 6393 + 6394 + __vma_exit_locked(vma, &detached); 6395 + WARN_ON_ONCE(!detached); 6396 + } 6397 + } 6398 + } 6407 6399 6408 6400 /* 6409 6401 * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be ··· 6452 6384 if (!vma) 6453 6385 goto inval; 6454 6386 6455 - if (!vma_start_read(vma)) 6456 - goto inval; 6387 + vma = vma_start_read(vma); 6388 + if (IS_ERR_OR_NULL(vma)) { 6389 + /* Check if the VMA got isolated after we found it */ 6390 + if (PTR_ERR(vma) == -EAGAIN) { 6391 + count_vm_vma_lock_event(VMA_LOCK_MISS); 6392 + /* The area was replaced with another one */ 6393 + goto retry; 6394 + } 6457 6395 6458 - /* Check if the VMA got isolated after we found it */ 6459 - if (is_vma_detached(vma)) { 6460 - vma_end_read(vma); 6461 - count_vm_vma_lock_event(VMA_LOCK_MISS); 6462 - /* The area was replaced with another one */ 6463 - goto retry; 6396 + /* Failed to lock the VMA */ 6397 + goto inval; 6464 6398 } 6465 6399 /* 6466 6400 * At this point, we have a stable reference to a VMA: The VMA is
+5
tools/testing/vma/linux/atomic.h
··· 9 9 #define atomic_set(x, y) uatomic_set(x, y) 10 10 #define U8_MAX UCHAR_MAX 11 11 12 + #ifndef atomic_cmpxchg_relaxed 13 + #define atomic_cmpxchg_relaxed uatomic_cmpxchg 14 + #define atomic_cmpxchg_release uatomic_cmpxchg 15 + #endif /* atomic_cmpxchg_relaxed */ 16 + 12 17 #endif /* _LINUX_ATOMIC_H */
+30 -33
tools/testing/vma/vma_internal.h
··· 25 25 #include <linux/maple_tree.h> 26 26 #include <linux/mm.h> 27 27 #include <linux/rbtree.h> 28 - #include <linux/rwsem.h> 28 + #include <linux/refcount.h> 29 29 30 30 extern unsigned long stack_guard_gap; 31 31 #ifdef CONFIG_MMU ··· 135 135 */ 136 136 #define pr_warn_once pr_err 137 137 138 - typedef struct refcount_struct { 139 - atomic_t refs; 140 - } refcount_t; 141 - 142 138 struct kref { 143 139 refcount_t refcount; 144 140 }; ··· 229 233 unsigned long flags; /* Must use atomic bitops to access */ 230 234 }; 231 235 232 - struct vma_lock { 233 - struct rw_semaphore lock; 234 - }; 235 - 236 - 237 236 struct file { 238 237 struct address_space *f_mapping; 239 238 }; 239 + 240 + #define VMA_LOCK_OFFSET 0x40000000 240 241 241 242 struct vm_area_struct { 242 243 /* The first cache line has the info for VMA tree walking. */ ··· 262 269 }; 263 270 264 271 #ifdef CONFIG_PER_VMA_LOCK 265 - /* Flag to indicate areas detached from the mm->mm_mt tree */ 266 - bool detached; 267 - 268 272 /* 269 273 * Can only be written (using WRITE_ONCE()) while holding both: 270 274 * - mmap_lock (in write mode) 271 - * - vm_lock.lock (in write mode) 275 + * - vm_refcnt bit at VMA_LOCK_OFFSET is set 272 276 * Can be read reliably while holding one of: 273 277 * - mmap_lock (in read or write mode) 274 - * - vm_lock.lock (in read or write mode) 278 + * - vm_refcnt bit at VMA_LOCK_OFFSET is set or vm_refcnt > 1 275 279 * Can be read unreliably (using READ_ONCE()) for pessimistic bailout 276 280 * while holding nothing (except RCU to keep the VMA struct allocated). 277 281 * ··· 277 287 * slowpath. 278 288 */ 279 289 unsigned int vm_lock_seq; 280 - struct vma_lock vm_lock; 281 290 #endif 282 291 283 292 /* ··· 329 340 struct vma_numab_state *numab_state; /* NUMA Balancing state */ 330 341 #endif 331 342 struct vm_userfaultfd_ctx vm_userfaultfd_ctx; 343 + #ifdef CONFIG_PER_VMA_LOCK 344 + /* Unstable RCU readers are allowed to read this. */ 345 + refcount_t vm_refcnt; 346 + #endif 332 347 } __randomize_layout; 333 348 334 349 struct vm_fault {}; ··· 457 464 return mas_find(&vmi->mas, ULONG_MAX); 458 465 } 459 466 460 - static inline void vma_lock_init(struct vm_area_struct *vma) 461 - { 462 - init_rwsem(&vma->vm_lock.lock); 463 - vma->vm_lock_seq = UINT_MAX; 464 - } 465 - 467 + /* 468 + * WARNING: to avoid racing with vma_mark_attached()/vma_mark_detached(), these 469 + * assertions should be made either under mmap_write_lock or when the object 470 + * has been isolated under mmap_write_lock, ensuring no competing writers. 471 + */ 466 472 static inline void vma_assert_attached(struct vm_area_struct *vma) 467 473 { 468 - WARN_ON_ONCE(vma->detached); 474 + WARN_ON_ONCE(!refcount_read(&vma->vm_refcnt)); 469 475 } 470 476 471 477 static inline void vma_assert_detached(struct vm_area_struct *vma) 472 478 { 473 - WARN_ON_ONCE(!vma->detached); 479 + WARN_ON_ONCE(refcount_read(&vma->vm_refcnt)); 474 480 } 475 481 476 482 static inline void vma_assert_write_locked(struct vm_area_struct *); 477 483 static inline void vma_mark_attached(struct vm_area_struct *vma) 478 484 { 479 - vma->detached = false; 485 + vma_assert_write_locked(vma); 486 + vma_assert_detached(vma); 487 + refcount_set(&vma->vm_refcnt, 1); 480 488 } 481 489 482 490 static inline void vma_mark_detached(struct vm_area_struct *vma) 483 491 { 484 - /* When detaching vma should be write-locked */ 485 492 vma_assert_write_locked(vma); 486 - vma->detached = true; 493 + vma_assert_attached(vma); 494 + /* We are the only writer, so no need to use vma_refcount_put(). */ 495 + if (unlikely(!refcount_dec_and_test(&vma->vm_refcnt))) { 496 + /* 497 + * Reader must have temporarily raised vm_refcnt but it will 498 + * drop it without using the vma since vma is write-locked. 499 + */ 500 + } 487 501 } 488 502 489 503 extern const struct vm_operations_struct vma_dummy_vm_ops; ··· 503 503 vma->vm_mm = mm; 504 504 vma->vm_ops = &vma_dummy_vm_ops; 505 505 INIT_LIST_HEAD(&vma->anon_vma_chain); 506 - /* vma is not locked, can't use vma_mark_detached() */ 507 - vma->detached = true; 508 - vma_lock_init(vma); 506 + vma->vm_lock_seq = UINT_MAX; 509 507 } 510 508 511 509 static inline struct vm_area_struct *vm_area_alloc(struct mm_struct *mm) ··· 526 528 return NULL; 527 529 528 530 memcpy(new, orig, sizeof(*new)); 529 - vma_lock_init(new); 531 + refcount_set(&new->vm_refcnt, 0); 532 + new->vm_lock_seq = UINT_MAX; 530 533 INIT_LIST_HEAD(&new->anon_vma_chain); 531 - /* vma is not locked, can't use vma_mark_detached() */ 532 - new->detached = true; 533 534 534 535 return new; 535 536 }