Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: introduce VMA flags bitmap type

It is useful to transition to using a bitmap for VMA flags so we can avoid
running out of flags, especially for 32-bit kernels which are constrained
to 32 flags, necessitating some features to be limited to 64-bit kernels
only.

By doing so, we remove any constraint on the number of VMA flags moving
forwards no matter the platform and can decide in future to extend beyond
64 if required.

We start by declaring an opaque types, vma_flags_t (which resembles
mm_struct flags of type mm_flags_t), setting it to precisely the same size
as vm_flags_t, and place it in union with vm_flags in the VMA declaration.

We additionally update struct vm_area_desc equivalently placing the new
opaque type in union with vm_flags.

This change therefore does not impact the size of struct vm_area_struct or
struct vm_area_desc.

In order for the change to be iterative and to avoid impacting
performance, we designate VM_xxx declared bitmap flag values as those
which must exist in the first system word of the VMA flags bitmap.

We therefore declare vma_flags_clear_all(), vma_flags_overwrite_word(),
vma_flags_overwrite_word(), vma_flags_overwrite_word_once(),
vma_flags_set_word() and vma_flags_clear_word() in order to allow us to
update the existing vm_flags_*() functions to utilise these helpers.

This is a stepping stone towards converting users to the VMA flags bitmap
and behaves precisely as before.

By doing this, we can eliminate the existing private vma->__vm_flags field
in the vma->vm_flags union and replace it with the newly introduced opaque
type vma_flags, which we call flags so we refer to the new bitmap field as
vma->flags.

We update vma_flag_[test, set]_atomic() to account for the change also.

We adapt vm_flags_reset_once() to only clear those bits above the first
system word providing write-once semantics to the first system word (which
it is presumed the caller requires - and in all current use cases this is
so).

As we currently only specify that the VMA flags bitmap size is equal to
BITS_PER_LONG number of bits, this is a noop, but is defensive in
preparation for a future change that increases this.

We additionally update the VMA userland test declarations to implement the
same changes there.

Finally, we update the rust code to reference vma->vm_flags on update
rather than vma->__vm_flags which has been removed. This is safe for now,
albeit it is implicitly performing a const cast.

Once we introduce flag helpers we can improve this more.

No functional change intended.

Link: https://lkml.kernel.org/r/bab179d7b153ac12f221b7d65caac2759282cfe9.1764064557.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Alice Ryhl <aliceryhl@google.com> [rust]
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andreas Hindborg <a.hindborg@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Gary Guo <gary@garyguo.net>
Cc: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Wei Xu <weixugc@google.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Lorenzo Stoakes and committed by
Andrew Morton
9ea35a25 4c613f51

+202 -38
+19 -5
include/linux/mm.h
··· 911 911 vm_flags_t flags) 912 912 { 913 913 VM_WARN_ON_ONCE(!pgtable_supports_soft_dirty() && (flags & VM_SOFTDIRTY)); 914 - ACCESS_PRIVATE(vma, __vm_flags) = flags; 914 + vma_flags_clear_all(&vma->flags); 915 + vma_flags_overwrite_word(&vma->flags, flags); 915 916 } 916 917 917 918 /* ··· 932 931 vm_flags_t flags) 933 932 { 934 933 vma_assert_write_locked(vma); 935 - WRITE_ONCE(ACCESS_PRIVATE(vma, __vm_flags), flags); 934 + /* 935 + * If VMA flags exist beyond the first system word, also clear these. It 936 + * is assumed the write once behaviour is required only for the first 937 + * system word. 938 + */ 939 + if (NUM_VMA_FLAG_BITS > BITS_PER_LONG) { 940 + unsigned long *bitmap = ACCESS_PRIVATE(&vma->flags, __vma_flags); 941 + 942 + bitmap_zero(&bitmap[1], NUM_VMA_FLAG_BITS - BITS_PER_LONG); 943 + } 944 + 945 + vma_flags_overwrite_word_once(&vma->flags, flags); 936 946 } 937 947 938 948 static inline void vm_flags_set(struct vm_area_struct *vma, 939 949 vm_flags_t flags) 940 950 { 941 951 vma_start_write(vma); 942 - ACCESS_PRIVATE(vma, __vm_flags) |= flags; 952 + vma_flags_set_word(&vma->flags, flags); 943 953 } 944 954 945 955 static inline void vm_flags_clear(struct vm_area_struct *vma, ··· 958 946 { 959 947 VM_WARN_ON_ONCE(!pgtable_supports_soft_dirty() && (flags & VM_SOFTDIRTY)); 960 948 vma_start_write(vma); 961 - ACCESS_PRIVATE(vma, __vm_flags) &= ~flags; 949 + vma_flags_clear_word(&vma->flags, flags); 962 950 } 963 951 964 952 /* ··· 1001 989 static inline void vma_flag_set_atomic(struct vm_area_struct *vma, 1002 990 vma_flag_t bit) 1003 991 { 992 + unsigned long *bitmap = ACCESS_PRIVATE(&vma->flags, __vma_flags); 993 + 1004 994 /* mmap read lock/VMA read lock must be held. */ 1005 995 if (!rwsem_is_locked(&vma->vm_mm->mmap_lock)) 1006 996 vma_assert_locked(vma); 1007 997 1008 998 if (__vma_flag_atomic_valid(vma, bit)) 1009 - set_bit((__force int)bit, &ACCESS_PRIVATE(vma, __vm_flags)); 999 + set_bit((__force int)bit, bitmap); 1010 1000 } 1011 1001 1012 1002 /*
+62 -2
include/linux/mm_types.h
··· 849 849 }; 850 850 851 851 /* 852 + * Opaque type representing current VMA (vm_area_struct) flag state. Must be 853 + * accessed via vma_flags_xxx() helper functions. 854 + */ 855 + #define NUM_VMA_FLAG_BITS BITS_PER_LONG 856 + typedef struct { 857 + DECLARE_BITMAP(__vma_flags, NUM_VMA_FLAG_BITS); 858 + } __private vma_flags_t; 859 + 860 + /* 852 861 * Describes a VMA that is about to be mmap()'ed. Drivers may choose to 853 862 * manipulate mutable fields which will cause those fields to be updated in the 854 863 * resultant VMA. ··· 874 865 /* Mutable fields. Populated with initial state. */ 875 866 pgoff_t pgoff; 876 867 struct file *vm_file; 877 - vm_flags_t vm_flags; 868 + union { 869 + vm_flags_t vm_flags; 870 + vma_flags_t vma_flags; 871 + }; 878 872 pgprot_t page_prot; 879 873 880 874 /* Write-only fields. */ ··· 922 910 /* 923 911 * Flags, see mm.h. 924 912 * To modify use vm_flags_{init|reset|set|clear|mod} functions. 913 + * Preferably, use vma_flags_xxx() functions. 925 914 */ 926 915 union { 916 + /* Temporary while VMA flags are being converted. */ 927 917 const vm_flags_t vm_flags; 928 - vm_flags_t __private __vm_flags; 918 + vma_flags_t flags; 929 919 }; 930 920 931 921 #ifdef CONFIG_PER_VMA_LOCK ··· 1007 993 struct pfnmap_track_ctx *pfnmap_track_ctx; 1008 994 #endif 1009 995 } __randomize_layout; 996 + 997 + /* Clears all bits in the VMA flags bitmap, non-atomically. */ 998 + static inline void vma_flags_clear_all(vma_flags_t *flags) 999 + { 1000 + bitmap_zero(ACCESS_PRIVATE(flags, __vma_flags), NUM_VMA_FLAG_BITS); 1001 + } 1002 + 1003 + /* 1004 + * Copy value to the first system word of VMA flags, non-atomically. 1005 + * 1006 + * IMPORTANT: This does not overwrite bytes past the first system word. The 1007 + * caller must account for this. 1008 + */ 1009 + static inline void vma_flags_overwrite_word(vma_flags_t *flags, unsigned long value) 1010 + { 1011 + *ACCESS_PRIVATE(flags, __vma_flags) = value; 1012 + } 1013 + 1014 + /* 1015 + * Copy value to the first system word of VMA flags ONCE, non-atomically. 1016 + * 1017 + * IMPORTANT: This does not overwrite bytes past the first system word. The 1018 + * caller must account for this. 1019 + */ 1020 + static inline void vma_flags_overwrite_word_once(vma_flags_t *flags, unsigned long value) 1021 + { 1022 + unsigned long *bitmap = ACCESS_PRIVATE(flags, __vma_flags); 1023 + 1024 + WRITE_ONCE(*bitmap, value); 1025 + } 1026 + 1027 + /* Update the first system word of VMA flags setting bits, non-atomically. */ 1028 + static inline void vma_flags_set_word(vma_flags_t *flags, unsigned long value) 1029 + { 1030 + unsigned long *bitmap = ACCESS_PRIVATE(flags, __vma_flags); 1031 + 1032 + *bitmap |= value; 1033 + } 1034 + 1035 + /* Update the first system word of VMA flags clearing bits, non-atomically. */ 1036 + static inline void vma_flags_clear_word(vma_flags_t *flags, unsigned long value) 1037 + { 1038 + unsigned long *bitmap = ACCESS_PRIVATE(flags, __vma_flags); 1039 + 1040 + *bitmap &= ~value; 1041 + } 1010 1042 1011 1043 #ifdef CONFIG_NUMA 1012 1044 #define vma_policy(vma) ((vma)->vm_policy)
+1 -1
rust/kernel/mm/virt.rs
··· 250 250 // SAFETY: This is not a data race: the vma is undergoing initial setup, so it's not yet 251 251 // shared. Additionally, `VmaNew` is `!Sync`, so it cannot be used to write in parallel. 252 252 // The caller promises that this does not set the flags to an invalid value. 253 - unsafe { (*self.as_ptr()).__bindgen_anon_2.__vm_flags = flags }; 253 + unsafe { (*self.as_ptr()).__bindgen_anon_2.vm_flags = flags }; 254 254 } 255 255 256 256 /// Set the `VM_MIXEDMAP` flag on this vma.
+120 -30
tools/testing/vma/vma_internal.h
··· 524 524 __private DECLARE_BITMAP(__mm_flags, NUM_MM_FLAG_BITS); 525 525 } mm_flags_t; 526 526 527 + /* 528 + * Opaque type representing current VMA (vm_area_struct) flag state. Must be 529 + * accessed via vma_flags_xxx() helper functions. 530 + */ 531 + #define NUM_VMA_FLAG_BITS BITS_PER_LONG 532 + typedef struct { 533 + DECLARE_BITMAP(__vma_flags, NUM_VMA_FLAG_BITS); 534 + } __private vma_flags_t; 535 + 527 536 struct mm_struct { 528 537 struct maple_tree mm_mt; 529 538 int map_count; /* number of VMAs */ ··· 617 608 /* Mutable fields. Populated with initial state. */ 618 609 pgoff_t pgoff; 619 610 struct file *vm_file; 620 - vm_flags_t vm_flags; 611 + union { 612 + vm_flags_t vm_flags; 613 + vma_flags_t vma_flags; 614 + }; 621 615 pgprot_t page_prot; 622 616 623 617 /* Write-only fields. */ ··· 666 654 */ 667 655 union { 668 656 const vm_flags_t vm_flags; 669 - vm_flags_t __private __vm_flags; 657 + vma_flags_t flags; 670 658 }; 671 659 672 660 #ifdef CONFIG_PER_VMA_LOCK ··· 1380 1368 return true; 1381 1369 } 1382 1370 1383 - static inline void vm_flags_init(struct vm_area_struct *vma, 1384 - vm_flags_t flags) 1385 - { 1386 - vma->__vm_flags = flags; 1387 - } 1388 - 1389 - static inline void vm_flags_set(struct vm_area_struct *vma, 1390 - vm_flags_t flags) 1391 - { 1392 - vma_start_write(vma); 1393 - vma->__vm_flags |= flags; 1394 - } 1395 - 1396 - static inline void vm_flags_clear(struct vm_area_struct *vma, 1397 - vm_flags_t flags) 1398 - { 1399 - vma_start_write(vma); 1400 - vma->__vm_flags &= ~flags; 1401 - } 1402 - 1403 1371 static inline int shmem_zero_setup(struct vm_area_struct *vma) 1404 1372 { 1405 1373 return 0; ··· 1536 1544 { 1537 1545 } 1538 1546 1539 - # define ACCESS_PRIVATE(p, member) ((p)->member) 1547 + #define ACCESS_PRIVATE(p, member) ((p)->member) 1548 + 1549 + #define bitmap_size(nbits) (ALIGN(nbits, BITS_PER_LONG) / BITS_PER_BYTE) 1550 + 1551 + static __always_inline void bitmap_zero(unsigned long *dst, unsigned int nbits) 1552 + { 1553 + unsigned int len = bitmap_size(nbits); 1554 + 1555 + if (small_const_nbits(nbits)) 1556 + *dst = 0; 1557 + else 1558 + memset(dst, 0, len); 1559 + } 1540 1560 1541 1561 static inline bool mm_flags_test(int flag, const struct mm_struct *mm) 1542 1562 { 1543 1563 return test_bit(flag, ACCESS_PRIVATE(&mm->flags, __mm_flags)); 1564 + } 1565 + 1566 + /* Clears all bits in the VMA flags bitmap, non-atomically. */ 1567 + static inline void vma_flags_clear_all(vma_flags_t *flags) 1568 + { 1569 + bitmap_zero(ACCESS_PRIVATE(flags, __vma_flags), NUM_VMA_FLAG_BITS); 1570 + } 1571 + 1572 + /* 1573 + * Copy value to the first system word of VMA flags, non-atomically. 1574 + * 1575 + * IMPORTANT: This does not overwrite bytes past the first system word. The 1576 + * caller must account for this. 1577 + */ 1578 + static inline void vma_flags_overwrite_word(vma_flags_t *flags, unsigned long value) 1579 + { 1580 + *ACCESS_PRIVATE(flags, __vma_flags) = value; 1581 + } 1582 + 1583 + /* 1584 + * Copy value to the first system word of VMA flags ONCE, non-atomically. 1585 + * 1586 + * IMPORTANT: This does not overwrite bytes past the first system word. The 1587 + * caller must account for this. 1588 + */ 1589 + static inline void vma_flags_overwrite_word_once(vma_flags_t *flags, unsigned long value) 1590 + { 1591 + unsigned long *bitmap = ACCESS_PRIVATE(flags, __vma_flags); 1592 + 1593 + WRITE_ONCE(*bitmap, value); 1594 + } 1595 + 1596 + /* Update the first system word of VMA flags setting bits, non-atomically. */ 1597 + static inline void vma_flags_set_word(vma_flags_t *flags, unsigned long value) 1598 + { 1599 + unsigned long *bitmap = ACCESS_PRIVATE(flags, __vma_flags); 1600 + 1601 + *bitmap |= value; 1602 + } 1603 + 1604 + /* Update the first system word of VMA flags clearing bits, non-atomically. */ 1605 + static inline void vma_flags_clear_word(vma_flags_t *flags, unsigned long value) 1606 + { 1607 + unsigned long *bitmap = ACCESS_PRIVATE(flags, __vma_flags); 1608 + 1609 + *bitmap &= ~value; 1610 + } 1611 + 1612 + 1613 + /* Use when VMA is not part of the VMA tree and needs no locking */ 1614 + static inline void vm_flags_init(struct vm_area_struct *vma, 1615 + vm_flags_t flags) 1616 + { 1617 + vma_flags_clear_all(&vma->flags); 1618 + vma_flags_overwrite_word(&vma->flags, flags); 1619 + } 1620 + 1621 + /* 1622 + * Use when VMA is part of the VMA tree and modifications need coordination 1623 + * Note: vm_flags_reset and vm_flags_reset_once do not lock the vma and 1624 + * it should be locked explicitly beforehand. 1625 + */ 1626 + static inline void vm_flags_reset(struct vm_area_struct *vma, 1627 + vm_flags_t flags) 1628 + { 1629 + vma_assert_write_locked(vma); 1630 + vm_flags_init(vma, flags); 1631 + } 1632 + 1633 + static inline void vm_flags_reset_once(struct vm_area_struct *vma, 1634 + vm_flags_t flags) 1635 + { 1636 + vma_assert_write_locked(vma); 1637 + /* 1638 + * The user should only be interested in avoiding reordering of 1639 + * assignment to the first word. 1640 + */ 1641 + vma_flags_clear_all(&vma->flags); 1642 + vma_flags_overwrite_word_once(&vma->flags, flags); 1643 + } 1644 + 1645 + static inline void vm_flags_set(struct vm_area_struct *vma, 1646 + vm_flags_t flags) 1647 + { 1648 + vma_start_write(vma); 1649 + vma_flags_set_word(&vma->flags, flags); 1650 + } 1651 + 1652 + static inline void vm_flags_clear(struct vm_area_struct *vma, 1653 + vm_flags_t flags) 1654 + { 1655 + vma_start_write(vma); 1656 + vma_flags_clear_word(&vma->flags, flags); 1544 1657 } 1545 1658 1546 1659 /* ··· 1858 1761 struct list_head *uf) 1859 1762 { 1860 1763 return 0; 1861 - } 1862 - 1863 - static inline void vm_flags_reset(struct vm_area_struct *vma, vm_flags_t flags) 1864 - { 1865 - vm_flags_t *dst = (vm_flags_t *)(&vma->vm_flags); 1866 - 1867 - *dst = flags; 1868 1764 } 1869 1765 1870 1766 #endif /* __MM_VMA_INTERNAL_H */