Merge tag 'mm-hotfixes-stable-2024-12-21-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

+1

.mailmap

··· 735 735 Wolfram Sang <wsa@kernel.org> <wsa@the-dreams.de> 736 736 Yakir Yang <kuankuan.y@gmail.com> <ykk@rock-chips.com> 737 737 Yanteng Si <si.yanteng@linux.dev> <siyanteng@loongson.cn> 738 + Ying Huang <huang.ying.caritas@gmail.com> <ying.huang@intel.com> 738 739 Yusuke Goda <goda.yusuke@renesas.com> 739 740 Zack Rusin <zack.rusin@broadcom.com> <zackr@vmware.com> 740 741 Zhu Yanjun <zyjzyj2000@gmail.com> <yanjunz@nvidia.com>

+850

Documentation/mm/process_addrs.rst

··· 3 3 ================= 4 4 Process Addresses 5 5 ================= 6 + 7 + .. toctree:: 8 + :maxdepth: 3 9 + 10 + 11 + Userland memory ranges are tracked by the kernel via Virtual Memory Areas or 12 + 'VMA's of type :c:struct:`!struct vm_area_struct`. 13 + 14 + Each VMA describes a virtually contiguous memory range with identical 15 + attributes, each described by a :c:struct:`!struct vm_area_struct` 16 + object. Userland access outside of VMAs is invalid except in the case where an 17 + adjacent stack VMA could be extended to contain the accessed address. 18 + 19 + All VMAs are contained within one and only one virtual address space, described 20 + by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is, 21 + threads) which share the virtual address space. We refer to this as the 22 + :c:struct:`!mm`. 23 + 24 + Each mm object contains a maple tree data structure which describes all VMAs 25 + within the virtual address space. 26 + 27 + .. note:: An exception to this is the 'gate' VMA which is provided by 28 + architectures which use :c:struct:`!vsyscall` and is a global static 29 + object which does not belong to any specific mm. 30 + 31 + ------- 32 + Locking 33 + ------- 34 + 35 + The kernel is designed to be highly scalable against concurrent read operations 36 + on VMA **metadata** so a complicated set of locks are required to ensure memory 37 + corruption does not occur. 38 + 39 + .. note:: Locking VMAs for their metadata does not have any impact on the memory 40 + they describe nor the page tables that map them. 41 + 42 + Terminology 43 + ----------- 44 + 45 + * **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock` 46 + which locks at a process address space granularity which can be acquired via 47 + :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants. 48 + * **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves 49 + as a read/write semaphore in practice. A VMA read lock is obtained via 50 + :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a 51 + write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked 52 + automatically when the mmap write lock is released). To take a VMA write lock 53 + you **must** have already acquired an :c:func:`!mmap_write_lock`. 54 + * **rmap locks** - When trying to access VMAs through the reverse mapping via a 55 + :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object 56 + (reachable from a folio via :c:member:`!folio->mapping`). VMAs must be stabilised via 57 + :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for 58 + anonymous memory and :c:func:`!i_mmap_[try]lock_read` or 59 + :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these 60 + locks as the reverse mapping locks, or 'rmap locks' for brevity. 61 + 62 + We discuss page table locks separately in the dedicated section below. 63 + 64 + The first thing **any** of these locks achieve is to **stabilise** the VMA 65 + within the MM tree. That is, guaranteeing that the VMA object will not be 66 + deleted from under you nor modified (except for some specific fields 67 + described below). 68 + 69 + Stabilising a VMA also keeps the address space described by it around. 70 + 71 + Lock usage 72 + ---------- 73 + 74 + If you want to **read** VMA metadata fields or just keep the VMA stable, you 75 + must do one of the following: 76 + 77 + * Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a 78 + suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when 79 + you're done with the VMA, *or* 80 + * Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to 81 + acquire the lock atomically so might fail, in which case fall-back logic is 82 + required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`, 83 + *or* 84 + * Acquire an rmap lock before traversing the locked interval tree (whether 85 + anonymous or file-backed) to obtain the required VMA. 86 + 87 + If you want to **write** VMA metadata fields, then things vary depending on the 88 + field (we explore each VMA field in detail below). For the majority you must: 89 + 90 + * Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a 91 + suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when 92 + you're done with the VMA, *and* 93 + * Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to 94 + modify, which will be released automatically when :c:func:`!mmap_write_unlock` is 95 + called. 96 + * If you want to be able to write to **any** field, you must also hide the VMA 97 + from the reverse mapping by obtaining an **rmap write lock**. 98 + 99 + VMA locks are special in that you must obtain an mmap **write** lock **first** 100 + in order to obtain a VMA **write** lock. A VMA **read** lock however can be 101 + obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then 102 + release an RCU lock to lookup the VMA for you). 103 + 104 + This constrains the impact of writers on readers, as a writer can interact with 105 + one VMA while a reader interacts with another simultaneously. 106 + 107 + .. note:: The primary users of VMA read locks are page fault handlers, which 108 + means that without a VMA write lock, page faults will run concurrent with 109 + whatever you are doing. 110 + 111 + Examining all valid lock states: 112 + 113 + .. table:: 114 + 115 + ========= ======== ========= ======= ===== =========== ========== 116 + mmap lock VMA lock rmap lock Stable? Read? Write most? Write all? 117 + ========= ======== ========= ======= ===== =========== ========== 118 + \- \- \- N N N N 119 + \- R \- Y Y N N 120 + \- \- R/W Y Y N N 121 + R/W \-/R \-/R/W Y Y N N 122 + W W \-/R Y Y Y N 123 + W W W Y Y Y Y 124 + ========= ======== ========= ======= ===== =========== ========== 125 + 126 + .. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock, 127 + attempting to do the reverse is invalid as it can result in deadlock - if 128 + another task already holds an mmap write lock and attempts to acquire a VMA 129 + write lock that will deadlock on the VMA read lock. 130 + 131 + All of these locks behave as read/write semaphores in practice, so you can 132 + obtain either a read or a write lock for each of these. 133 + 134 + .. note:: Generally speaking, a read/write semaphore is a class of lock which 135 + permits concurrent readers. However a write lock can only be obtained 136 + once all readers have left the critical region (and pending readers 137 + made to wait). 138 + 139 + This renders read locks on a read/write semaphore concurrent with other 140 + readers and write locks exclusive against all others holding the semaphore. 141 + 142 + VMA fields 143 + ^^^^^^^^^^ 144 + 145 + We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it 146 + easier to explore their locking characteristics: 147 + 148 + .. note:: We exclude VMA lock-specific fields here to avoid confusion, as these 149 + are in effect an internal implementation detail. 150 + 151 + .. table:: Virtual layout fields 152 + 153 + ===================== ======================================== =========== 154 + Field Description Write lock 155 + ===================== ======================================== =========== 156 + :c:member:`!vm_start` Inclusive start virtual address of range mmap write, 157 + VMA describes. VMA write, 158 + rmap write. 159 + :c:member:`!vm_end` Exclusive end virtual address of range mmap write, 160 + VMA describes. VMA write, 161 + rmap write. 162 + :c:member:`!vm_pgoff` Describes the page offset into the file, mmap write, 163 + the original page offset within the VMA write, 164 + virtual address space (prior to any rmap write. 165 + :c:func:`!mremap`), or PFN if a PFN map 166 + and the architecture does not support 167 + :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`. 168 + ===================== ======================================== =========== 169 + 170 + These fields describes the size, start and end of the VMA, and as such cannot be 171 + modified without first being hidden from the reverse mapping since these fields 172 + are used to locate VMAs within the reverse mapping interval trees. 173 + 174 + .. table:: Core fields 175 + 176 + ============================ ======================================== ========================= 177 + Field Description Write lock 178 + ============================ ======================================== ========================= 179 + :c:member:`!vm_mm` Containing mm_struct. None - written once on 180 + initial map. 181 + :c:member:`!vm_page_prot` Architecture-specific page table mmap write, VMA write. 182 + protection bits determined from VMA 183 + flags. 184 + :c:member:`!vm_flags` Read-only access to VMA flags describing N/A 185 + attributes of the VMA, in union with 186 + private writable 187 + :c:member:`!__vm_flags`. 188 + :c:member:`!__vm_flags` Private, writable access to VMA flags mmap write, VMA write. 189 + field, updated by 190 + :c:func:`!vm_flags_*` functions. 191 + :c:member:`!vm_file` If the VMA is file-backed, points to a None - written once on 192 + struct file object describing the initial map. 193 + underlying file, if anonymous then 194 + :c:macro:`!NULL`. 195 + :c:member:`!vm_ops` If the VMA is file-backed, then either None - Written once on 196 + the driver or file-system provides a initial map by 197 + :c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`. 198 + object describing callbacks to be 199 + invoked on VMA lifetime events. 200 + :c:member:`!vm_private_data` A :c:member:`!void *` field for Handled by driver. 201 + driver-specific metadata. 202 + ============================ ======================================== ========================= 203 + 204 + These are the core fields which describe the MM the VMA belongs to and its attributes. 205 + 206 + .. table:: Config-specific fields 207 + 208 + ================================= ===================== ======================================== =============== 209 + Field Configuration option Description Write lock 210 + ================================= ===================== ======================================== =============== 211 + :c:member:`!anon_name` CONFIG_ANON_VMA_NAME A field for storing a mmap write, 212 + :c:struct:`!struct anon_vma_name` VMA write. 213 + object providing a name for anonymous 214 + mappings, or :c:macro:`!NULL` if none 215 + is set or the VMA is file-backed. The 216 + underlying object is reference counted 217 + and can be shared across multiple VMAs 218 + for scalability. 219 + :c:member:`!swap_readahead_info` CONFIG_SWAP Metadata used by the swap mechanism mmap read, 220 + to perform readahead. This field is swap-specific 221 + accessed atomically. lock. 222 + :c:member:`!vm_policy` CONFIG_NUMA :c:type:`!mempolicy` object which mmap write, 223 + describes the NUMA behaviour of the VMA write. 224 + VMA. The underlying object is reference 225 + counted. 226 + :c:member:`!numab_state` CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which mmap read, 227 + describes the current state of numab-specific 228 + NUMA balancing in relation to this VMA. lock. 229 + Updated under mmap read lock by 230 + :c:func:`!task_numa_work`. 231 + :c:member:`!vm_userfaultfd_ctx` CONFIG_USERFAULTFD Userfaultfd context wrapper object of mmap write, 232 + type :c:type:`!vm_userfaultfd_ctx`, VMA write. 233 + either of zero size if userfaultfd is 234 + disabled, or containing a pointer 235 + to an underlying 236 + :c:type:`!userfaultfd_ctx` object which 237 + describes userfaultfd metadata. 238 + ================================= ===================== ======================================== =============== 239 + 240 + These fields are present or not depending on whether the relevant kernel 241 + configuration option is set. 242 + 243 + .. table:: Reverse mapping fields 244 + 245 + =================================== ========================================= ============================ 246 + Field Description Write lock 247 + =================================== ========================================= ============================ 248 + :c:member:`!shared.rb` A red/black tree node used, if the mmap write, VMA write, 249 + mapping is file-backed, to place the VMA i_mmap write. 250 + in the 251 + :c:member:`!struct address_space->i_mmap` 252 + red/black interval tree. 253 + :c:member:`!shared.rb_subtree_last` Metadata used for management of the mmap write, VMA write, 254 + interval tree if the VMA is file-backed. i_mmap write. 255 + :c:member:`!anon_vma_chain` List of pointers to both forked/CoW’d mmap read, anon_vma write. 256 + :c:type:`!anon_vma` objects and 257 + :c:member:`!vma->anon_vma` if it is 258 + non-:c:macro:`!NULL`. 259 + :c:member:`!anon_vma` :c:type:`!anon_vma` object used by When :c:macro:`NULL` and 260 + anonymous folios mapped exclusively to setting non-:c:macro:`NULL`: 261 + this VMA. Initially set by mmap read, page_table_lock. 262 + :c:func:`!anon_vma_prepare` serialised 263 + by the :c:macro:`!page_table_lock`. This When non-:c:macro:`NULL` and 264 + is set as soon as any page is faulted in. setting :c:macro:`NULL`: 265 + mmap write, VMA write, 266 + anon_vma write. 267 + =================================== ========================================= ============================ 268 + 269 + These fields are used to both place the VMA within the reverse mapping, and for 270 + anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects 271 + and the :c:struct:`!struct anon_vma` in which folios mapped exclusively to this VMA should 272 + reside. 273 + 274 + .. note:: If a file-backed mapping is mapped with :c:macro:`!MAP_PRIVATE` set 275 + then it can be in both the :c:type:`!anon_vma` and :c:type:`!i_mmap` 276 + trees at the same time, so all of these fields might be utilised at 277 + once. 278 + 279 + Page tables 280 + ----------- 281 + 282 + We won't speak exhaustively on the subject but broadly speaking, page tables map 283 + virtual addresses to physical ones through a series of page tables, each of 284 + which contain entries with physical addresses for the next page table level 285 + (along with flags), and at the leaf level the physical addresses of the 286 + underlying physical data pages or a special entry such as a swap entry, 287 + migration entry or other special marker. Offsets into these pages are provided 288 + by the virtual address itself. 289 + 290 + In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge 291 + pages might eliminate one or two of these levels, but when this is the case we 292 + typically refer to the leaf level as the PTE level regardless. 293 + 294 + .. note:: In instances where the architecture supports fewer page tables than 295 + five the kernel cleverly 'folds' page table levels, that is stubbing 296 + out functions related to the skipped levels. This allows us to 297 + conceptually act as if there were always five levels, even if the 298 + compiler might, in practice, eliminate any code relating to missing 299 + ones. 300 + 301 + There are four key operations typically performed on page tables: 302 + 303 + 1. **Traversing** page tables - Simply reading page tables in order to traverse 304 + them. This only requires that the VMA is kept stable, so a lock which 305 + establishes this suffices for traversal (there are also lockless variants 306 + which eliminate even this requirement, such as :c:func:`!gup_fast`). 307 + 2. **Installing** page table mappings - Whether creating a new mapping or 308 + modifying an existing one in such a way as to change its identity. This 309 + requires that the VMA is kept stable via an mmap or VMA lock (explicitly not 310 + rmap locks). 311 + 3. **Zapping/unmapping** page table entries - This is what the kernel calls 312 + clearing page table mappings at the leaf level only, whilst leaving all page 313 + tables in place. This is a very common operation in the kernel performed on 314 + file truncation, the :c:macro:`!MADV_DONTNEED` operation via 315 + :c:func:`!madvise`, and others. This is performed by a number of functions 316 + including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`. 317 + The VMA need only be kept stable for this operation. 318 + 4. **Freeing** page tables - When finally the kernel removes page tables from a 319 + userland process (typically via :c:func:`!free_pgtables`) extreme care must 320 + be taken to ensure this is done safely, as this logic finally frees all page 321 + tables in the specified range, ignoring existing leaf entries (it assumes the 322 + caller has both zapped the range and prevented any further faults or 323 + modifications within it). 324 + 325 + .. note:: Modifying mappings for reclaim or migration is performed under rmap 326 + lock as it, like zapping, does not fundamentally modify the identity 327 + of what is being mapped. 328 + 329 + **Traversing** and **zapping** ranges can be performed holding any one of the 330 + locks described in the terminology section above - that is the mmap lock, the 331 + VMA lock or either of the reverse mapping locks. 332 + 333 + That is - as long as you keep the relevant VMA **stable** - you are good to go 334 + ahead and perform these operations on page tables (though internally, kernel 335 + operations that perform writes also acquire internal page table locks to 336 + serialise - see the page table implementation detail section for more details). 337 + 338 + When **installing** page table entries, the mmap or VMA lock must be held to 339 + keep the VMA stable. We explore why this is in the page table locking details 340 + section below. 341 + 342 + .. warning:: Page tables are normally only traversed in regions covered by VMAs. 343 + If you want to traverse page tables in areas that might not be 344 + covered by VMAs, heavier locking is required. 345 + See :c:func:`!walk_page_range_novma` for details. 346 + 347 + **Freeing** page tables is an entirely internal memory management operation and 348 + has special requirements (see the page freeing section below for more details). 349 + 350 + .. warning:: When **freeing** page tables, it must not be possible for VMAs 351 + containing the ranges those page tables map to be accessible via 352 + the reverse mapping. 353 + 354 + The :c:func:`!free_pgtables` function removes the relevant VMAs 355 + from the reverse mappings, but no other VMAs can be permitted to be 356 + accessible and span the specified range. 357 + 358 + Lock ordering 359 + ------------- 360 + 361 + As we have multiple locks across the kernel which may or may not be taken at the 362 + same time as explicit mm or VMA locks, we have to be wary of lock inversion, and 363 + the **order** in which locks are acquired and released becomes very important. 364 + 365 + .. note:: Lock inversion occurs when two threads need to acquire multiple locks, 366 + but in doing so inadvertently cause a mutual deadlock. 367 + 368 + For example, consider thread 1 which holds lock A and tries to acquire lock B, 369 + while thread 2 holds lock B and tries to acquire lock A. 370 + 371 + Both threads are now deadlocked on each other. However, had they attempted to 372 + acquire locks in the same order, one would have waited for the other to 373 + complete its work and no deadlock would have occurred. 374 + 375 + The opening comment in :c:macro:`!mm/rmap.c` describes in detail the required 376 + ordering of locks within memory management code: 377 + 378 + .. code-block:: 379 + 380 + inode->i_rwsem (while writing or truncating, not reading or faulting) 381 + mm->mmap_lock 382 + mapping->invalidate_lock (in filemap_fault) 383 + folio_lock 384 + hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below) 385 + vma_start_write 386 + mapping->i_mmap_rwsem 387 + anon_vma->rwsem 388 + mm->page_table_lock or pte_lock 389 + swap_lock (in swap_duplicate, swap_info_get) 390 + mmlist_lock (in mmput, drain_mmlist and others) 391 + mapping->private_lock (in block_dirty_folio) 392 + i_pages lock (widely used) 393 + lruvec->lru_lock (in folio_lruvec_lock_irq) 394 + inode->i_lock (in set_page_dirty's __mark_inode_dirty) 395 + bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) 396 + sb_lock (within inode_lock in fs/fs-writeback.c) 397 + i_pages lock (widely used, in set_page_dirty, 398 + in arch-dependent flush_dcache_mmap_lock, 399 + within bdi.wb->list_lock in __sync_single_inode) 400 + 401 + There is also a file-system specific lock ordering comment located at the top of 402 + :c:macro:`!mm/filemap.c`: 403 + 404 + .. code-block:: 405 + 406 + ->i_mmap_rwsem (truncate_pagecache) 407 + ->private_lock (__free_pte->block_dirty_folio) 408 + ->swap_lock (exclusive_swap_page, others) 409 + ->i_pages lock 410 + 411 + ->i_rwsem 412 + ->invalidate_lock (acquired by fs in truncate path) 413 + ->i_mmap_rwsem (truncate->unmap_mapping_range) 414 + 415 + ->mmap_lock 416 + ->i_mmap_rwsem 417 + ->page_table_lock or pte_lock (various, mainly in memory.c) 418 + ->i_pages lock (arch-dependent flush_dcache_mmap_lock) 419 + 420 + ->mmap_lock 421 + ->invalidate_lock (filemap_fault) 422 + ->lock_page (filemap_fault, access_process_vm) 423 + 424 + ->i_rwsem (generic_perform_write) 425 + ->mmap_lock (fault_in_readable->do_page_fault) 426 + 427 + bdi->wb.list_lock 428 + sb_lock (fs/fs-writeback.c) 429 + ->i_pages lock (__sync_single_inode) 430 + 431 + ->i_mmap_rwsem 432 + ->anon_vma.lock (vma_merge) 433 + 434 + ->anon_vma.lock 435 + ->page_table_lock or pte_lock (anon_vma_prepare and various) 436 + 437 + ->page_table_lock or pte_lock 438 + ->swap_lock (try_to_unmap_one) 439 + ->private_lock (try_to_unmap_one) 440 + ->i_pages lock (try_to_unmap_one) 441 + ->lruvec->lru_lock (follow_page_mask->mark_page_accessed) 442 + ->lruvec->lru_lock (check_pte_range->folio_isolate_lru) 443 + ->private_lock (folio_remove_rmap_pte->set_page_dirty) 444 + ->i_pages lock (folio_remove_rmap_pte->set_page_dirty) 445 + bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty) 446 + ->inode->i_lock (folio_remove_rmap_pte->set_page_dirty) 447 + bdi.wb->list_lock (zap_pte_range->set_page_dirty) 448 + ->inode->i_lock (zap_pte_range->set_page_dirty) 449 + ->private_lock (zap_pte_range->block_dirty_folio) 450 + 451 + Please check the current state of these comments which may have changed since 452 + the time of writing of this document. 453 + 454 + ------------------------------ 455 + Locking Implementation Details 456 + ------------------------------ 457 + 458 + .. warning:: Locking rules for PTE-level page tables are very different from 459 + locking rules for page tables at other levels. 460 + 461 + Page table locking details 462 + -------------------------- 463 + 464 + In addition to the locks described in the terminology section above, we have 465 + additional locks dedicated to page tables: 466 + 467 + * **Higher level page table locks** - Higher level page tables, that is PGD, P4D 468 + and PUD each make use of the process address space granularity 469 + :c:member:`!mm->page_table_lock` lock when modified. 470 + 471 + * **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks 472 + either kept within the folios describing the page tables or allocated 473 + separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is 474 + set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are 475 + mapped into higher memory (if a 32-bit system) and carefully locked via 476 + :c:func:`!pte_offset_map_lock`. 477 + 478 + These locks represent the minimum required to interact with each page table 479 + level, but there are further requirements. 480 + 481 + Importantly, note that on a **traversal** of page tables, sometimes no such 482 + locks are taken. However, at the PTE level, at least concurrent page table 483 + deletion must be prevented (using RCU) and the page table must be mapped into 484 + high memory, see below. 485 + 486 + Whether care is taken on reading the page table entries depends on the 487 + architecture, see the section on atomicity below. 488 + 489 + Locking rules 490 + ^^^^^^^^^^^^^ 491 + 492 + We establish basic locking rules when interacting with page tables: 493 + 494 + * When changing a page table entry the page table lock for that page table 495 + **must** be held, except if you can safely assume nobody can access the page 496 + tables concurrently (such as on invocation of :c:func:`!free_pgtables`). 497 + * Reads from and writes to page table entries must be *appropriately* 498 + atomic. See the section on atomicity below for details. 499 + * Populating previously empty entries requires that the mmap or VMA locks are 500 + held (read or write), doing so with only rmap locks would be dangerous (see 501 + the warning below). 502 + * As mentioned previously, zapping can be performed while simply keeping the VMA 503 + stable, that is holding any one of the mmap, VMA or rmap locks. 504 + 505 + .. warning:: Populating previously empty entries is dangerous as, when unmapping 506 + VMAs, :c:func:`!vms_clear_ptes` has a window of time between 507 + zapping (via :c:func:`!unmap_vmas`) and freeing page tables (via 508 + :c:func:`!free_pgtables`), where the VMA is still visible in the 509 + rmap tree. :c:func:`!free_pgtables` assumes that the zap has 510 + already been performed and removes PTEs unconditionally (along with 511 + all other page tables in the freed range), so installing new PTE 512 + entries could leak memory and also cause other unexpected and 513 + dangerous behaviour. 514 + 515 + There are additional rules applicable when moving page tables, which we discuss 516 + in the section on this topic below. 517 + 518 + PTE-level page tables are different from page tables at other levels, and there 519 + are extra requirements for accessing them: 520 + 521 + * On 32-bit architectures, they may be in high memory (meaning they need to be 522 + mapped into kernel memory to be accessible). 523 + * When empty, they can be unlinked and RCU-freed while holding an mmap lock or 524 + rmap lock for reading in combination with the PTE and PMD page table locks. 525 + In particular, this happens in :c:func:`!retract_page_tables` when handling 526 + :c:macro:`!MADV_COLLAPSE`. 527 + So accessing PTE-level page tables requires at least holding an RCU read lock; 528 + but that only suffices for readers that can tolerate racing with concurrent 529 + page table updates such that an empty PTE is observed (in a page table that 530 + has actually already been detached and marked for RCU freeing) while another 531 + new page table has been installed in the same location and filled with 532 + entries. Writers normally need to take the PTE lock and revalidate that the 533 + PMD entry still refers to the same PTE-level page table. 534 + 535 + To access PTE-level page tables, a helper like :c:func:`!pte_offset_map_lock` or 536 + :c:func:`!pte_offset_map` can be used depending on stability requirements. 537 + These map the page table into kernel memory if required, take the RCU lock, and 538 + depending on variant, may also look up or acquire the PTE lock. 539 + See the comment on :c:func:`!__pte_offset_map_lock`. 540 + 541 + Atomicity 542 + ^^^^^^^^^ 543 + 544 + Regardless of page table locks, the MMU hardware concurrently updates accessed 545 + and dirty bits (perhaps more, depending on architecture). Additionally, page 546 + table traversal operations in parallel (though holding the VMA stable) and 547 + functionality like GUP-fast locklessly traverses (that is reads) page tables, 548 + without even keeping the VMA stable at all. 549 + 550 + When performing a page table traversal and keeping the VMA stable, whether a 551 + read must be performed once and only once or not depends on the architecture 552 + (for instance x86-64 does not require any special precautions). 553 + 554 + If a write is being performed, or if a read informs whether a write takes place 555 + (on an installation of a page table entry say, for instance in 556 + :c:func:`!__pud_install`), special care must always be taken. In these cases we 557 + can never assume that page table locks give us entirely exclusive access, and 558 + must retrieve page table entries once and only once. 559 + 560 + If we are reading page table entries, then we need only ensure that the compiler 561 + does not rearrange our loads. This is achieved via :c:func:`!pXXp_get` 562 + functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`, :c:func:`!pudp_get`, 563 + :c:func:`!pmdp_get`, and :c:func:`!ptep_get`. 564 + 565 + Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads 566 + the page table entry only once. 567 + 568 + However, if we wish to manipulate an existing page table entry and care about 569 + the previously stored data, we must go further and use an hardware atomic 570 + operation as, for example, in :c:func:`!ptep_get_and_clear`. 571 + 572 + Equally, operations that do not rely on the VMA being held stable, such as 573 + GUP-fast (see :c:func:`!gup_fast` and its various page table level handlers like 574 + :c:func:`!gup_fast_pte_range`), must very carefully interact with page table 575 + entries, using functions such as :c:func:`!ptep_get_lockless` and equivalent for 576 + higher level page table levels. 577 + 578 + Writes to page table entries must also be appropriately atomic, as established 579 + by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`, 580 + :c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`. 581 + 582 + Equally functions which clear page table entries must be appropriately atomic, 583 + as in :c:func:`!pXX_clear` functions - :c:func:`!pgd_clear`, 584 + :c:func:`!p4d_clear`, :c:func:`!pud_clear`, :c:func:`!pmd_clear`, and 585 + :c:func:`!pte_clear`. 586 + 587 + Page table installation 588 + ^^^^^^^^^^^^^^^^^^^^^^^ 589 + 590 + Page table installation is performed with the VMA held stable explicitly by an 591 + mmap or VMA lock in read or write mode (see the warning in the locking rules 592 + section for details as to why). 593 + 594 + When allocating a P4D, PUD or PMD and setting the relevant entry in the above 595 + PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is 596 + acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and 597 + :c:func:`!__pmd_alloc` respectively. 598 + 599 + .. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and 600 + :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately 601 + references the :c:member:`!mm->page_table_lock`. 602 + 603 + Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if 604 + :c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, a lock embedded in the PMD 605 + physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by 606 + :c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately 607 + :c:func:`!__pte_alloc`. 608 + 609 + Finally, modifying the contents of the PTE requires special treatment, as the 610 + PTE page table lock must be acquired whenever we want stable and exclusive 611 + access to entries contained within a PTE, especially when we wish to modify 612 + them. 613 + 614 + This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to 615 + ensure that the PTE hasn't changed from under us, ultimately invoking 616 + :c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within 617 + the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock 618 + must be released via :c:func:`!pte_unmap_unlock`. 619 + 620 + .. note:: There are some variants on this, such as 621 + :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but 622 + for brevity we do not explore this. See the comment for 623 + :c:func:`!__pte_offset_map_lock` for more details. 624 + 625 + When modifying data in ranges we typically only wish to allocate higher page 626 + tables as necessary, using these locks to avoid races or overwriting anything, 627 + and set/clear data at the PTE level as required (for instance when page faulting 628 + or zapping). 629 + 630 + A typical pattern taken when traversing page table entries to install a new 631 + mapping is to optimistically determine whether the page table entry in the table 632 + above is empty, if so, only then acquiring the page table lock and checking 633 + again to see if it was allocated underneath us. 634 + 635 + This allows for a traversal with page table locks only being taken when 636 + required. An example of this is :c:func:`!__pud_alloc`. 637 + 638 + At the leaf page table, that is the PTE, we can't entirely rely on this pattern 639 + as we have separate PMD and PTE locks and a THP collapse for instance might have 640 + eliminated the PMD entry as well as the PTE from under us. 641 + 642 + This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry 643 + for the PTE, carefully checking it is as expected, before acquiring the 644 + PTE-specific lock, and then *again* checking that the PMD entry is as expected. 645 + 646 + If a THP collapse (or similar) were to occur then the lock on both pages would 647 + be acquired, so we can ensure this is prevented while the PTE lock is held. 648 + 649 + Installing entries this way ensures mutual exclusion on write. 650 + 651 + Page table freeing 652 + ^^^^^^^^^^^^^^^^^^ 653 + 654 + Tearing down page tables themselves is something that requires significant 655 + care. There must be no way that page tables designated for removal can be 656 + traversed or referenced by concurrent tasks. 657 + 658 + It is insufficient to simply hold an mmap write lock and VMA lock (which will 659 + prevent racing faults, and rmap operations), as a file-backed mapping can be 660 + truncated under the :c:struct:`!struct address_space->i_mmap_rwsem` alone. 661 + 662 + As a result, no VMA which can be accessed via the reverse mapping (either 663 + through the :c:struct:`!struct anon_vma->rb_root` or the :c:member:`!struct 664 + address_space->i_mmap` interval trees) can have its page tables torn down. 665 + 666 + The operation is typically performed via :c:func:`!free_pgtables`, which assumes 667 + either the mmap write lock has been taken (as specified by its 668 + :c:member:`!mm_wr_locked` parameter), or that the VMA is already unreachable. 669 + 670 + It carefully removes the VMA from all reverse mappings, however it's important 671 + that no new ones overlap these or any route remain to permit access to addresses 672 + within the range whose page tables are being torn down. 673 + 674 + Additionally, it assumes that a zap has already been performed and steps have 675 + been taken to ensure that no further page table entries can be installed between 676 + the zap and the invocation of :c:func:`!free_pgtables`. 677 + 678 + Since it is assumed that all such steps have been taken, page table entries are 679 + cleared without page table locks (in the :c:func:`!pgd_clear`, :c:func:`!p4d_clear`, 680 + :c:func:`!pud_clear`, and :c:func:`!pmd_clear` functions. 681 + 682 + .. note:: It is possible for leaf page tables to be torn down independent of 683 + the page tables above it as is done by 684 + :c:func:`!retract_page_tables`, which is performed under the i_mmap 685 + read lock, PMD, and PTE page table locks, without this level of care. 686 + 687 + Page table moving 688 + ^^^^^^^^^^^^^^^^^ 689 + 690 + Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD 691 + page tables). Most notable of these is :c:func:`!mremap`, which is capable of 692 + moving higher level page tables. 693 + 694 + In these instances, it is required that **all** locks are taken, that is 695 + the mmap lock, the VMA lock and the relevant rmap locks. 696 + 697 + You can observe this in the :c:func:`!mremap` implementation in the functions 698 + :c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap 699 + side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`. 700 + 701 + VMA lock internals 702 + ------------------ 703 + 704 + Overview 705 + ^^^^^^^^ 706 + 707 + VMA read locking is entirely optimistic - if the lock is contended or a competing 708 + write has started, then we do not obtain a read lock. 709 + 710 + A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu`, which first 711 + calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU 712 + critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`, 713 + before releasing the RCU lock via :c:func:`!rcu_read_unlock`. 714 + 715 + VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for 716 + their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it 717 + via :c:func:`!vma_end_read`. 718 + 719 + VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a 720 + VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always 721 + acquired. An mmap write lock **must** be held for the duration of the VMA write 722 + lock, releasing or downgrading the mmap write lock also releases the VMA write 723 + lock so there is no :c:func:`!vma_end_write` function. 724 + 725 + Note that a semaphore write lock is not held across a VMA lock. Rather, a 726 + sequence number is used for serialisation, and the write semaphore is only 727 + acquired at the point of write lock to update this. 728 + 729 + This ensures the semantics we require - VMA write locks provide exclusive write 730 + access to the VMA. 731 + 732 + Implementation details 733 + ^^^^^^^^^^^^^^^^^^^^^^ 734 + 735 + The VMA lock mechanism is designed to be a lightweight means of avoiding the use 736 + of the heavily contended mmap lock. It is implemented using a combination of a 737 + read/write semaphore and sequence numbers belonging to the containing 738 + :c:struct:`!struct mm_struct` and the VMA. 739 + 740 + Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic 741 + operation, i.e. it tries to acquire a read lock but returns false if it is 742 + unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is 743 + called to release the VMA read lock. 744 + 745 + Invoking :c:func:`!vma_start_read` requires that :c:func:`!rcu_read_lock` has 746 + been called first, establishing that we are in an RCU critical section upon VMA 747 + read lock acquisition. Once acquired, the RCU lock can be released as it is only 748 + required for lookup. This is abstracted by :c:func:`!lock_vma_under_rcu` which 749 + is the interface a user should use. 750 + 751 + Writing requires the mmap to be write-locked and the VMA lock to be acquired via 752 + :c:func:`!vma_start_write`, however the write lock is released by the termination or 753 + downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required. 754 + 755 + All this is achieved by the use of per-mm and per-VMA sequence counts, which are 756 + used in order to reduce complexity, especially for operations which write-lock 757 + multiple VMAs at once. 758 + 759 + If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA 760 + sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If 761 + they differ, then it is not. 762 + 763 + Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or 764 + :c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which 765 + also increments :c:member:`!mm->mm_lock_seq` via 766 + :c:func:`!mm_lock_seqcount_end`. 767 + 768 + This way, we ensure that, regardless of the VMA's sequence number, a write lock 769 + is never incorrectly indicated and that when we release an mmap write lock we 770 + efficiently release **all** VMA write locks contained within the mmap at the 771 + same time. 772 + 773 + Since the mmap write lock is exclusive against others who hold it, the automatic 774 + release of any VMA locks on its release makes sense, as you would never want to 775 + keep VMAs locked across entirely separate write operations. It also maintains 776 + correct lock ordering. 777 + 778 + Each time a VMA read lock is acquired, we acquire a read lock on the 779 + :c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that 780 + the sequence count of the VMA does not match that of the mm. 781 + 782 + If it does, the read lock fails. If it does not, we hold the lock, excluding 783 + writers, but permitting other readers, who will also obtain this lock under RCU. 784 + 785 + Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu` 786 + are also RCU safe, so the whole read lock operation is guaranteed to function 787 + correctly. 788 + 789 + On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock` 790 + read/write semaphore, before setting the VMA's sequence number under this lock, 791 + also simultaneously holding the mmap write lock. 792 + 793 + This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep 794 + until these are finished and mutual exclusion is achieved. 795 + 796 + After setting the VMA's sequence number, the lock is released, avoiding 797 + complexity with a long-term held write lock. 798 + 799 + This clever combination of a read/write semaphore and sequence count allows for 800 + fast RCU-based per-VMA lock acquisition (especially on page fault, though 801 + utilised elsewhere) with minimal complexity around lock ordering. 802 + 803 + mmap write lock downgrading 804 + --------------------------- 805 + 806 + When an mmap write lock is held one has exclusive access to resources within the 807 + mmap (with the usual caveats about requiring VMA write locks to avoid races with 808 + tasks holding VMA read locks). 809 + 810 + It is then possible to **downgrade** from a write lock to a read lock via 811 + :c:func:`!mmap_write_downgrade` which, similar to :c:func:`!mmap_write_unlock`, 812 + implicitly terminates all VMA write locks via :c:func:`!vma_end_write_all`, but 813 + importantly does not relinquish the mmap lock while downgrading, therefore 814 + keeping the locked virtual address space stable. 815 + 816 + An interesting consequence of this is that downgraded locks are exclusive 817 + against any other task possessing a downgraded lock (since a racing task would 818 + have to acquire a write lock first to downgrade it, and the downgraded lock 819 + prevents a new write lock from being obtained until the original lock is 820 + released). 821 + 822 + For clarity, we map read (R)/downgraded write (D)/write (W) locks against one 823 + another showing which locks exclude the others: 824 + 825 + .. list-table:: Lock exclusivity 826 + :widths: 5 5 5 5 827 + :header-rows: 1 828 + :stub-columns: 1 829 + 830 + * - 831 + - R 832 + - D 833 + - W 834 + * - R 835 + - N 836 + - N 837 + - Y 838 + * - D 839 + - N 840 + - Y 841 + - Y 842 + * - W 843 + - Y 844 + - Y 845 + - Y 846 + 847 + Here a Y indicates the locks in the matching row/column are mutually exclusive, 848 + and N indicates that they are not. 849 + 850 + Stack expansion 851 + --------------- 852 + 853 + Stack expansion throws up additional complexities in that we cannot permit there 854 + to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to 855 + prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`.

+1

arch/arc/Kconfig

··· 6 6 config ARC 7 7 def_bool y 8 8 select ARC_TIMERS 9 + select ARCH_HAS_CPU_CACHE_ALIASING 9 10 select ARCH_HAS_CACHE_LINE_SIZE 10 11 select ARCH_HAS_DEBUG_VM_PGTABLE 11 12 select ARCH_HAS_DMA_PREP_COHERENT

+8

arch/arc/include/asm/cachetype.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef __ASM_ARC_CACHETYPE_H 3 + #define __ASM_ARC_CACHETYPE_H 4 + 5 + #define cpu_dcache_is_aliasing() false 6 + #define cpu_icache_is_aliasing() true 7 + 8 + #endif

+10 -5

drivers/block/zram/zram_drv.c

··· 614 614 } 615 615 616 616 nr_pages = i_size_read(inode) >> PAGE_SHIFT; 617 + /* Refuse to use zero sized device (also prevents self reference) */ 618 + if (!nr_pages) { 619 + err = -EINVAL; 620 + goto out; 621 + } 622 + 617 623 bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long); 618 624 bitmap = kvzalloc(bitmap_sz, GFP_KERNEL); 619 625 if (!bitmap) { ··· 1444 1438 size_t num_pages = disksize >> PAGE_SHIFT; 1445 1439 size_t index; 1446 1440 1441 + if (!zram->table) 1442 + return; 1443 + 1447 1444 /* Free all pages that are still in this zram device */ 1448 1445 for (index = 0; index < num_pages; index++) 1449 1446 zram_free_page(zram, index); 1450 1447 1451 1448 zs_destroy_pool(zram->mem_pool); 1452 1449 vfree(zram->table); 1450 + zram->table = NULL; 1453 1451 } 1454 1452 1455 1453 static bool zram_meta_alloc(struct zram *zram, u64 disksize) ··· 2329 2319 down_write(&zram->init_lock); 2330 2320 2331 2321 zram->limit_pages = 0; 2332 - 2333 - if (!init_done(zram)) { 2334 - up_write(&zram->init_lock); 2335 - return; 2336 - } 2337 2322 2338 2323 set_capacity_and_notify(zram->disk, 0); 2339 2324 part_stat_set_all(zram->disk->part0, 0);

+1 -1

fs/hugetlbfs/inode.c

··· 825 825 error = PTR_ERR(folio); 826 826 goto out; 827 827 } 828 - folio_zero_user(folio, ALIGN_DOWN(addr, hpage_size)); 828 + folio_zero_user(folio, addr); 829 829 __folio_mark_uptodate(folio); 830 830 error = hugetlb_add_to_page_cache(folio, mapping, index); 831 831 if (unlikely(error)) {

+1

fs/nilfs2/btnode.c

··· 35 35 ii->i_flags = 0; 36 36 memset(&ii->i_bmap_data, 0, sizeof(struct nilfs_bmap)); 37 37 mapping_set_gfp_mask(btnc_inode->i_mapping, GFP_NOFS); 38 + btnc_inode->i_mapping->a_ops = &nilfs_buffer_cache_aops; 38 39 } 39 40 40 41 void nilfs_btnode_cache_clear(struct address_space *btnc)

+1 -1

fs/nilfs2/gcinode.c

··· 163 163 164 164 inode->i_mode = S_IFREG; 165 165 mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS); 166 - inode->i_mapping->a_ops = &empty_aops; 166 + inode->i_mapping->a_ops = &nilfs_buffer_cache_aops; 167 167 168 168 ii->i_flags = 0; 169 169 nilfs_bmap_init_gc(ii->i_bmap);

+12 -1

fs/nilfs2/inode.c

··· 276 276 .is_partially_uptodate = block_is_partially_uptodate, 277 277 }; 278 278 279 + const struct address_space_operations nilfs_buffer_cache_aops = { 280 + .invalidate_folio = block_invalidate_folio, 281 + }; 282 + 279 283 static int nilfs_insert_inode_locked(struct inode *inode, 280 284 struct nilfs_root *root, 281 285 unsigned long ino) ··· 548 544 inode = nilfs_iget_locked(sb, root, ino); 549 545 if (unlikely(!inode)) 550 546 return ERR_PTR(-ENOMEM); 551 - if (!(inode->i_state & I_NEW)) 547 + 548 + if (!(inode->i_state & I_NEW)) { 549 + if (!inode->i_nlink) { 550 + iput(inode); 551 + return ERR_PTR(-ESTALE); 552 + } 552 553 return inode; 554 + } 553 555 554 556 err = __nilfs_read_inode(sb, root, ino, inode); 555 557 if (unlikely(err)) { ··· 685 675 NILFS_I(s_inode)->i_flags = 0; 686 676 memset(NILFS_I(s_inode)->i_bmap, 0, sizeof(struct nilfs_bmap)); 687 677 mapping_set_gfp_mask(s_inode->i_mapping, GFP_NOFS); 678 + s_inode->i_mapping->a_ops = &nilfs_buffer_cache_aops; 688 679 689 680 err = nilfs_attach_btree_node_cache(s_inode); 690 681 if (unlikely(err)) {

+5

fs/nilfs2/namei.c

··· 67 67 inode = NULL; 68 68 } else { 69 69 inode = nilfs_iget(dir->i_sb, NILFS_I(dir)->i_root, ino); 70 + if (inode == ERR_PTR(-ESTALE)) { 71 + nilfs_error(dir->i_sb, 72 + "deleted inode referenced: %lu", ino); 73 + return ERR_PTR(-EIO); 74 + } 70 75 } 71 76 72 77 return d_splice_alias(inode, dentry);

+1

fs/nilfs2/nilfs.h

··· 401 401 extern const struct inode_operations nilfs_file_inode_operations; 402 402 extern const struct file_operations nilfs_file_operations; 403 403 extern const struct address_space_operations nilfs_aops; 404 + extern const struct address_space_operations nilfs_buffer_cache_aops; 404 405 extern const struct inode_operations nilfs_dir_inode_operations; 405 406 extern const struct inode_operations nilfs_special_inode_operations; 406 407 extern const struct inode_operations nilfs_symlink_inode_operations;

+5 -22

fs/ocfs2/localalloc.c

··· 971 971 start = count = 0; 972 972 left = le32_to_cpu(alloc->id1.bitmap1.i_total); 973 973 974 - while ((bit_off = ocfs2_find_next_zero_bit(bitmap, left, start)) < 975 - left) { 976 - if (bit_off == start) { 974 + while (1) { 975 + bit_off = ocfs2_find_next_zero_bit(bitmap, left, start); 976 + if ((bit_off < left) && (bit_off == start)) { 977 977 count++; 978 978 start++; 979 979 continue; ··· 998 998 } 999 999 } 1000 1000 1001 + if (bit_off >= left) 1002 + break; 1001 1003 count = 1; 1002 1004 start = bit_off + 1; 1003 - } 1004 - 1005 - /* clear the contiguous bits until the end boundary */ 1006 - if (count) { 1007 - blkno = la_start_blk + 1008 - ocfs2_clusters_to_blocks(osb->sb, 1009 - start - count); 1010 - 1011 - trace_ocfs2_sync_local_to_main_free( 1012 - count, start - count, 1013 - (unsigned long long)la_start_blk, 1014 - (unsigned long long)blkno); 1015 - 1016 - status = ocfs2_release_clusters(handle, 1017 - main_bm_inode, 1018 - main_bm_bh, blkno, 1019 - count); 1020 - if (status < 0) 1021 - mlog_errno(status); 1022 1005 } 1023 1006 1024 1007 bail:

+7 -2

include/linux/alloc_tag.h

··· 63 63 #else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */ 64 64 65 65 static inline bool is_codetag_empty(union codetag_ref *ref) { return false; } 66 - static inline void set_codetag_empty(union codetag_ref *ref) {} 66 + 67 + static inline void set_codetag_empty(union codetag_ref *ref) 68 + { 69 + if (ref) 70 + ref->ct = NULL; 71 + } 67 72 68 73 #endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */ 69 74 ··· 140 135 #ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG 141 136 static inline void alloc_tag_add_check(union codetag_ref *ref, struct alloc_tag *tag) 142 137 { 143 - WARN_ONCE(ref && ref->ct, 138 + WARN_ONCE(ref && ref->ct && !is_codetag_empty(ref), 144 139 "alloc_tag was not cleared (got tag for %s:%u)\n", 145 140 ref->ct->filename, ref->ct->lineno); 146 141

+6

include/linux/cacheinfo.h

··· 155 155 156 156 #ifndef CONFIG_ARCH_HAS_CPU_CACHE_ALIASING 157 157 #define cpu_dcache_is_aliasing() false 158 + #define cpu_icache_is_aliasing() cpu_dcache_is_aliasing() 158 159 #else 159 160 #include <asm/cachetype.h> 161 + 162 + #ifndef cpu_icache_is_aliasing 163 + #define cpu_icache_is_aliasing() cpu_dcache_is_aliasing() 164 + #endif 165 + 160 166 #endif 161 167 162 168 #endif /* _LINUX_CACHEINFO_H */

+7 -1

include/linux/highmem.h

··· 224 224 struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma, 225 225 unsigned long vaddr) 226 226 { 227 - return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr); 227 + struct folio *folio; 228 + 229 + folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vaddr); 230 + if (folio && user_alloc_needs_zeroing()) 231 + clear_user_highpage(&folio->page, vaddr); 232 + 233 + return folio; 228 234 } 229 235 #endif 230 236

+29 -2

include/linux/mm.h

··· 31 31 #include <linux/kasan.h> 32 32 #include <linux/memremap.h> 33 33 #include <linux/slab.h> 34 + #include <linux/cacheinfo.h> 34 35 35 36 struct mempolicy; 36 37 struct anon_vma; ··· 3011 3010 lruvec_stat_sub_folio(folio, NR_PAGETABLE); 3012 3011 } 3013 3012 3014 - pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp); 3013 + pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp); 3014 + static inline pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, 3015 + pmd_t *pmdvalp) 3016 + { 3017 + pte_t *pte; 3018 + 3019 + __cond_lock(RCU, pte = ___pte_offset_map(pmd, addr, pmdvalp)); 3020 + return pte; 3021 + } 3015 3022 static inline pte_t *pte_offset_map(pmd_t *pmd, unsigned long addr) 3016 3023 { 3017 3024 return __pte_offset_map(pmd, addr, NULL); ··· 3032 3023 { 3033 3024 pte_t *pte; 3034 3025 3035 - __cond_lock(*ptlp, pte = __pte_offset_map_lock(mm, pmd, addr, ptlp)); 3026 + __cond_lock(RCU, __cond_lock(*ptlp, 3027 + pte = __pte_offset_map_lock(mm, pmd, addr, ptlp))); 3036 3028 return pte; 3037 3029 } 3038 3030 ··· 4184 4174 return 0; 4185 4175 } 4186 4176 #endif 4177 + 4178 + /* 4179 + * user_alloc_needs_zeroing checks if a user folio from page allocator needs to 4180 + * be zeroed or not. 4181 + */ 4182 + static inline bool user_alloc_needs_zeroing(void) 4183 + { 4184 + /* 4185 + * for user folios, arch with cache aliasing requires cache flush and 4186 + * arc changes folio->flags to make icache coherent with dcache, so 4187 + * always return false to make caller use 4188 + * clear_user_page()/clear_user_highpage(). 4189 + */ 4190 + return cpu_dcache_is_aliasing() || cpu_icache_is_aliasing() || 4191 + !static_branch_maybe(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, 4192 + &init_on_alloc); 4193 + } 4187 4194 4188 4195 int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *status); 4189 4196 int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);

+2 -10

include/linux/page-flags.h

··· 862 862 ClearPageHead(page); 863 863 } 864 864 FOLIO_FLAG(large_rmappable, FOLIO_SECOND_PAGE) 865 - FOLIO_TEST_FLAG(partially_mapped, FOLIO_SECOND_PAGE) 866 - /* 867 - * PG_partially_mapped is protected by deferred_split split_queue_lock, 868 - * so its safe to use non-atomic set/clear. 869 - */ 870 - __FOLIO_SET_FLAG(partially_mapped, FOLIO_SECOND_PAGE) 871 - __FOLIO_CLEAR_FLAG(partially_mapped, FOLIO_SECOND_PAGE) 865 + FOLIO_FLAG(partially_mapped, FOLIO_SECOND_PAGE) 872 866 #else 873 867 FOLIO_FLAG_FALSE(large_rmappable) 874 - FOLIO_TEST_FLAG_FALSE(partially_mapped) 875 - __FOLIO_SET_FLAG_NOOP(partially_mapped) 876 - __FOLIO_CLEAR_FLAG_NOOP(partially_mapped) 868 + FOLIO_FLAG_FALSE(partially_mapped) 877 869 #endif 878 870 879 871 #define PG_head_mask ((1UL << PG_head))

+1 -1

include/linux/vmstat.h

··· 515 515 516 516 static inline const char *lru_list_name(enum lru_list lru) 517 517 { 518 - return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_" 518 + return node_stat_name(NR_LRU_BASE + (enum node_stat_item)lru) + 3; // skip "nr_" 519 519 } 520 520 521 521 #if defined(CONFIG_VM_EVENT_COUNTERS) || defined(CONFIG_MEMCG)

+6 -7

kernel/fork.c

··· 639 639 LIST_HEAD(uf); 640 640 VMA_ITERATOR(vmi, mm, 0); 641 641 642 - uprobe_start_dup_mmap(); 643 - if (mmap_write_lock_killable(oldmm)) { 644 - retval = -EINTR; 645 - goto fail_uprobe_end; 646 - } 642 + if (mmap_write_lock_killable(oldmm)) 643 + return -EINTR; 647 644 flush_cache_dup_mm(oldmm); 648 645 uprobe_dup_mmap(oldmm, mm); 649 646 /* ··· 779 782 dup_userfaultfd_complete(&uf); 780 783 else 781 784 dup_userfaultfd_fail(&uf); 782 - fail_uprobe_end: 783 - uprobe_end_dup_mmap(); 784 785 return retval; 785 786 786 787 fail_nomem_anon_vma_fork: ··· 1687 1692 if (!mm_init(mm, tsk, mm->user_ns)) 1688 1693 goto fail_nomem; 1689 1694 1695 + uprobe_start_dup_mmap(); 1690 1696 err = dup_mmap(mm, oldmm); 1691 1697 if (err) 1692 1698 goto free_pt; 1699 + uprobe_end_dup_mmap(); 1693 1700 1694 1701 mm->hiwater_rss = get_mm_rss(mm); 1695 1702 mm->hiwater_vm = mm->total_vm; ··· 1706 1709 mm->binfmt = NULL; 1707 1710 mm_init_owner(mm, NULL); 1708 1711 mmput(mm); 1712 + if (err) 1713 + uprobe_end_dup_mmap(); 1709 1714 1710 1715 fail_nomem: 1711 1716 return NULL;

+36 -5

lib/alloc_tag.c

··· 209 209 return; 210 210 } 211 211 212 + /* 213 + * Clear tag references to avoid debug warning when using 214 + * __alloc_tag_ref_set() with non-empty reference. 215 + */ 216 + set_codetag_empty(&ref_old); 217 + set_codetag_empty(&ref_new); 218 + 212 219 /* swap tags */ 213 220 __alloc_tag_ref_set(&ref_old, tag_new); 214 221 update_page_tag_ref(handle_old, &ref_old); ··· 408 401 409 402 static int vm_module_tags_populate(void) 410 403 { 411 - unsigned long phys_size = vm_module_tags->nr_pages << PAGE_SHIFT; 404 + unsigned long phys_end = ALIGN_DOWN(module_tags.start_addr, PAGE_SIZE) + 405 + (vm_module_tags->nr_pages << PAGE_SHIFT); 406 + unsigned long new_end = module_tags.start_addr + module_tags.size; 412 407 413 - if (phys_size < module_tags.size) { 408 + if (phys_end < new_end) { 414 409 struct page **next_page = vm_module_tags->pages + vm_module_tags->nr_pages; 415 - unsigned long addr = module_tags.start_addr + phys_size; 410 + unsigned long old_shadow_end = ALIGN(phys_end, MODULE_ALIGN); 411 + unsigned long new_shadow_end = ALIGN(new_end, MODULE_ALIGN); 416 412 unsigned long more_pages; 417 413 unsigned long nr; 418 414 419 - more_pages = ALIGN(module_tags.size - phys_size, PAGE_SIZE) >> PAGE_SHIFT; 415 + more_pages = ALIGN(new_end - phys_end, PAGE_SIZE) >> PAGE_SHIFT; 420 416 nr = alloc_pages_bulk_array_node(GFP_KERNEL | __GFP_NOWARN, 421 417 NUMA_NO_NODE, more_pages, next_page); 422 418 if (nr < more_pages || 423 - vmap_pages_range(addr, addr + (nr << PAGE_SHIFT), PAGE_KERNEL, 419 + vmap_pages_range(phys_end, phys_end + (nr << PAGE_SHIFT), PAGE_KERNEL, 424 420 next_page, PAGE_SHIFT) < 0) { 425 421 /* Clean up and error out */ 426 422 for (int i = 0; i < nr; i++) 427 423 __free_page(next_page[i]); 428 424 return -ENOMEM; 429 425 } 426 + 430 427 vm_module_tags->nr_pages += nr; 428 + 429 + /* 430 + * Kasan allocates 1 byte of shadow for every 8 bytes of data. 431 + * When kasan_alloc_module_shadow allocates shadow memory, 432 + * its unit of allocation is a page. 433 + * Therefore, here we need to align to MODULE_ALIGN. 434 + */ 435 + if (old_shadow_end < new_shadow_end) 436 + kasan_alloc_module_shadow((void *)old_shadow_end, 437 + new_shadow_end - old_shadow_end, 438 + GFP_KERNEL); 431 439 } 440 + 441 + /* 442 + * Mark the pages as accessible, now that they are mapped. 443 + * With hardware tag-based KASAN, marking is skipped for 444 + * non-VM_ALLOC mappings, see __kasan_unpoison_vmalloc(). 445 + */ 446 + kasan_unpoison_vmalloc((void *)module_tags.start_addr, 447 + new_end - module_tags.start_addr, 448 + KASAN_VMALLOC_PROT_NORMAL); 432 449 433 450 return 0; 434 451 }

+10 -9

mm/huge_memory.c

··· 1176 1176 folio_throttle_swaprate(folio, gfp); 1177 1177 1178 1178 /* 1179 - * When a folio is not zeroed during allocation (__GFP_ZERO not used), 1180 - * folio_zero_user() is used to make sure that the page corresponding 1181 - * to the faulting address will be hot in the cache after zeroing. 1179 + * When a folio is not zeroed during allocation (__GFP_ZERO not used) 1180 + * or user folios require special handling, folio_zero_user() is used to 1181 + * make sure that the page corresponding to the faulting address will be 1182 + * hot in the cache after zeroing. 1182 1183 */ 1183 - if (!alloc_zeroed()) 1184 + if (user_alloc_needs_zeroing()) 1184 1185 folio_zero_user(folio, addr); 1185 1186 /* 1186 1187 * The memory barrier inside __folio_mark_uptodate makes sure that ··· 3577 3576 !list_empty(&folio->_deferred_list)) { 3578 3577 ds_queue->split_queue_len--; 3579 3578 if (folio_test_partially_mapped(folio)) { 3580 - __folio_clear_partially_mapped(folio); 3579 + folio_clear_partially_mapped(folio); 3581 3580 mod_mthp_stat(folio_order(folio), 3582 3581 MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); 3583 3582 } ··· 3689 3688 if (!list_empty(&folio->_deferred_list)) { 3690 3689 ds_queue->split_queue_len--; 3691 3690 if (folio_test_partially_mapped(folio)) { 3692 - __folio_clear_partially_mapped(folio); 3691 + folio_clear_partially_mapped(folio); 3693 3692 mod_mthp_stat(folio_order(folio), 3694 3693 MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); 3695 3694 } ··· 3733 3732 spin_lock_irqsave(&ds_queue->split_queue_lock, flags); 3734 3733 if (partially_mapped) { 3735 3734 if (!folio_test_partially_mapped(folio)) { 3736 - __folio_set_partially_mapped(folio); 3735 + folio_set_partially_mapped(folio); 3737 3736 if (folio_test_pmd_mappable(folio)) 3738 3737 count_vm_event(THP_DEFERRED_SPLIT_PAGE); 3739 3738 count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED); ··· 3826 3825 } else { 3827 3826 /* We lost race with folio_put() */ 3828 3827 if (folio_test_partially_mapped(folio)) { 3829 - __folio_clear_partially_mapped(folio); 3828 + folio_clear_partially_mapped(folio); 3830 3829 mod_mthp_stat(folio_order(folio), 3831 3830 MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); 3832 3831 } ··· 4169 4168 size_t input_len = strlen(input_buf); 4170 4169 4171 4170 tok = strsep(&buf, ","); 4172 - if (tok) { 4171 + if (tok && buf) { 4173 4172 strscpy(file_path, tok); 4174 4173 } else { 4175 4174 ret = -EINVAL;

+2 -3

mm/hugetlb.c

··· 5340 5340 break; 5341 5341 } 5342 5342 ret = copy_user_large_folio(new_folio, pte_folio, 5343 - ALIGN_DOWN(addr, sz), dst_vma); 5343 + addr, dst_vma); 5344 5344 folio_put(pte_folio); 5345 5345 if (ret) { 5346 5346 folio_put(new_folio); ··· 6643 6643 *foliop = NULL; 6644 6644 goto out; 6645 6645 } 6646 - ret = copy_user_large_folio(folio, *foliop, 6647 - ALIGN_DOWN(dst_addr, size), dst_vma); 6646 + ret = copy_user_large_folio(folio, *foliop, dst_addr, dst_vma); 6648 6647 folio_put(*foliop); 6649 6648 *foliop = NULL; 6650 6649 if (ret) {

-6

mm/internal.h

··· 1285 1285 void touch_pmd(struct vm_area_struct *vma, unsigned long addr, 1286 1286 pmd_t *pmd, bool write); 1287 1287 1288 - static inline bool alloc_zeroed(void) 1289 - { 1290 - return static_branch_maybe(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, 1291 - &init_on_alloc); 1292 - } 1293 - 1294 1288 /* 1295 1289 * Parses a string with mem suffixes into its order. Useful to parse kernel 1296 1290 * parameters.

+10 -8

mm/memory.c

··· 4733 4733 folio_throttle_swaprate(folio, gfp); 4734 4734 /* 4735 4735 * When a folio is not zeroed during allocation 4736 - * (__GFP_ZERO not used), folio_zero_user() is used 4737 - * to make sure that the page corresponding to the 4738 - * faulting address will be hot in the cache after 4739 - * zeroing. 4736 + * (__GFP_ZERO not used) or user folios require special 4737 + * handling, folio_zero_user() is used to make sure 4738 + * that the page corresponding to the faulting address 4739 + * will be hot in the cache after zeroing. 4740 4740 */ 4741 - if (!alloc_zeroed()) 4741 + if (user_alloc_needs_zeroing()) 4742 4742 folio_zero_user(folio, vmf->address); 4743 4743 return folio; 4744 4744 } ··· 6815 6815 return 0; 6816 6816 } 6817 6817 6818 - static void clear_gigantic_page(struct folio *folio, unsigned long addr, 6818 + static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint, 6819 6819 unsigned int nr_pages) 6820 6820 { 6821 + unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio)); 6821 6822 int i; 6822 6823 6823 6824 might_sleep(); ··· 6852 6851 } 6853 6852 6854 6853 static int copy_user_gigantic_page(struct folio *dst, struct folio *src, 6855 - unsigned long addr, 6854 + unsigned long addr_hint, 6856 6855 struct vm_area_struct *vma, 6857 6856 unsigned int nr_pages) 6858 6857 { 6859 - int i; 6858 + unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(dst)); 6860 6859 struct page *dst_page; 6861 6860 struct page *src_page; 6861 + int i; 6862 6862 6863 6863 for (i = 0; i < nr_pages; i++) { 6864 6864 dst_page = folio_page(dst, i);

+4 -2

mm/page_alloc.c

··· 1238 1238 if (order > pageblock_order) 1239 1239 order = pageblock_order; 1240 1240 1241 - while (pfn != end) { 1241 + do { 1242 1242 int mt = get_pfnblock_migratetype(page, pfn); 1243 1243 1244 1244 __free_one_page(page, pfn, zone, order, mt, fpi); 1245 1245 pfn += 1 << order; 1246 + if (pfn == end) 1247 + break; 1246 1248 page = pfn_to_page(pfn); 1247 - } 1249 + } while (1); 1248 1250 } 1249 1251 1250 1252 static void free_one_page(struct zone *zone, struct page *page,

+1 -1

mm/pgtable-generic.c

··· 279 279 static void pmdp_get_lockless_end(unsigned long irqflags) { } 280 280 #endif 281 281 282 - pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) 282 + pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) 283 283 { 284 284 unsigned long irqflags; 285 285 pmd_t pmdval;

+12 -10

mm/shmem.c

··· 787 787 } 788 788 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 789 789 790 + static void shmem_update_stats(struct folio *folio, int nr_pages) 791 + { 792 + if (folio_test_pmd_mappable(folio)) 793 + __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, nr_pages); 794 + __lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages); 795 + __lruvec_stat_mod_folio(folio, NR_SHMEM, nr_pages); 796 + } 797 + 790 798 /* 791 799 * Somewhat like filemap_add_folio, but error if expected item has gone. 792 800 */ ··· 829 821 xas_store(&xas, folio); 830 822 if (xas_error(&xas)) 831 823 goto unlock; 832 - if (folio_test_pmd_mappable(folio)) 833 - __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, nr); 834 - __lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr); 835 - __lruvec_stat_mod_folio(folio, NR_SHMEM, nr); 824 + shmem_update_stats(folio, nr); 836 825 mapping->nrpages += nr; 837 826 unlock: 838 827 xas_unlock_irq(&xas); ··· 857 852 error = shmem_replace_entry(mapping, folio->index, folio, radswap); 858 853 folio->mapping = NULL; 859 854 mapping->nrpages -= nr; 860 - __lruvec_stat_mod_folio(folio, NR_FILE_PAGES, -nr); 861 - __lruvec_stat_mod_folio(folio, NR_SHMEM, -nr); 855 + shmem_update_stats(folio, -nr); 862 856 xa_unlock_irq(&mapping->i_pages); 863 857 folio_put_refs(folio, nr); 864 858 BUG_ON(error); ··· 1973 1969 } 1974 1970 if (!error) { 1975 1971 mem_cgroup_replace_folio(old, new); 1976 - __lruvec_stat_mod_folio(new, NR_FILE_PAGES, nr_pages); 1977 - __lruvec_stat_mod_folio(new, NR_SHMEM, nr_pages); 1978 - __lruvec_stat_mod_folio(old, NR_FILE_PAGES, -nr_pages); 1979 - __lruvec_stat_mod_folio(old, NR_SHMEM, -nr_pages); 1972 + shmem_update_stats(new, nr_pages); 1973 + shmem_update_stats(old, -nr_pages); 1980 1974 } 1981 1975 xa_unlock_irq(&swap_mapping->i_pages); 1982 1976

+4 -1

mm/vma.c

··· 2460 2460 2461 2461 /* If flags changed, we might be able to merge, so try again. */ 2462 2462 if (map.retry_merge) { 2463 + struct vm_area_struct *merged; 2463 2464 VMG_MMAP_STATE(vmg, &map, vma); 2464 2465 2465 2466 vma_iter_config(map.vmi, map.addr, map.end); 2466 - vma_merge_existing_range(&vmg); 2467 + merged = vma_merge_existing_range(&vmg); 2468 + if (merged) 2469 + vma = merged; 2467 2470 } 2468 2471 2469 2472 __mmap_complete(&map, vma);

+4 -2

mm/vmalloc.c

··· 3374 3374 struct page *page = vm->pages[i]; 3375 3375 3376 3376 BUG_ON(!page); 3377 - mod_memcg_page_state(page, MEMCG_VMALLOC, -1); 3377 + if (!(vm->flags & VM_MAP_PUT_PAGES)) 3378 + mod_memcg_page_state(page, MEMCG_VMALLOC, -1); 3378 3379 /* 3379 3380 * High-order allocs for huge vmallocs are split, so 3380 3381 * can be freed as an array of order-0 allocations ··· 3383 3382 __free_page(page); 3384 3383 cond_resched(); 3385 3384 } 3386 - atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages); 3385 + if (!(vm->flags & VM_MAP_PUT_PAGES)) 3386 + atomic_long_sub(vm->nr_pages, &nr_vmalloc_pages); 3387 3387 kvfree(vm->pages); 3388 3388 kfree(vm); 3389 3389 }

+12 -2

tools/testing/selftests/memfd/memfd_test.c

··· 9 9 #include <fcntl.h> 10 10 #include <linux/memfd.h> 11 11 #include <sched.h> 12 + #include <stdbool.h> 12 13 #include <stdio.h> 13 14 #include <stdlib.h> 14 15 #include <signal.h> ··· 1558 1557 close(fd); 1559 1558 } 1560 1559 1560 + static bool pid_ns_supported(void) 1561 + { 1562 + return access("/proc/self/ns/pid", F_OK) == 0; 1563 + } 1564 + 1561 1565 int main(int argc, char **argv) 1562 1566 { 1563 1567 pid_t pid; ··· 1597 1591 test_seal_grow(); 1598 1592 test_seal_resize(); 1599 1593 1600 - test_sysctl_simple(); 1601 - test_sysctl_nested(); 1594 + if (pid_ns_supported()) { 1595 + test_sysctl_simple(); 1596 + test_sysctl_nested(); 1597 + } else { 1598 + printf("PID namespaces are not supported; skipping sysctl tests\n"); 1599 + } 1602 1600 1603 1601 test_share_dup("SHARE-DUP", ""); 1604 1602 test_share_mmap("SHARE-MMAP", "");