Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

docs/mm: add VMA locks documentation

Locking around VMAs is complicated and confusing. While we have a number
of disparate comments scattered around the place, we seem to be reaching a
level of complexity that justifies a serious effort at clearly documenting
how locks are expected to be used when it comes to interacting with
mm_struct and vm_area_struct objects.

This is especially pertinent as regards the efforts to find sensible
abstractions for these fundamental objects in kernel rust code whose
compiler strictly requires some means of expressing these rules (and
through this expression, self-document these requirements as well as
enforce them).

The document limits scope to mmap and VMA locks and those that are
immediately adjacent and relevant to them - so additionally covers page
table locking as this is so very closely tied to VMA operations (and
relies upon us handling these correctly).

The document tries to cover some of the nastier and more confusing edge
cases and concerns especially around lock ordering and page table
teardown.

The document is split between generally useful information for users of mm
interfaces, and separately a section intended for mm kernel developers
providing a discussion around internal implementation details.

[lorenzo.stoakes@oracle.com: v3]
Link: https://lkml.kernel.org/r/20241114205402.859737-1-lorenzo.stoakes@oracle.com
[lorenzo.stoakes@oracle.com: docs/mm: minor corrections]
Link: https://lkml.kernel.org/r/d3de735a-25ae-4eb2-866c-a9624fe6f795@lucifer.local
[jannh@google.com: docs/mm: add more warnings around page table access]
Link: https://lkml.kernel.org/r/20241118-vma-docs-addition1-onv3-v2-1-c9d5395b72ee@google.com
Link: https://lkml.kernel.org/r/20241108135708.48567-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Jann Horn <jannh@google.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Lorenzo Stoakes and committed by
Andrew Morton
dbf8be82 78d4f34e

+850
+850
Documentation/mm/process_addrs.rst
··· 3 3 ================= 4 4 Process Addresses 5 5 ================= 6 + 7 + .. toctree:: 8 + :maxdepth: 3 9 + 10 + 11 + Userland memory ranges are tracked by the kernel via Virtual Memory Areas or 12 + 'VMA's of type :c:struct:`!struct vm_area_struct`. 13 + 14 + Each VMA describes a virtually contiguous memory range with identical 15 + attributes, each described by a :c:struct:`!struct vm_area_struct` 16 + object. Userland access outside of VMAs is invalid except in the case where an 17 + adjacent stack VMA could be extended to contain the accessed address. 18 + 19 + All VMAs are contained within one and only one virtual address space, described 20 + by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is, 21 + threads) which share the virtual address space. We refer to this as the 22 + :c:struct:`!mm`. 23 + 24 + Each mm object contains a maple tree data structure which describes all VMAs 25 + within the virtual address space. 26 + 27 + .. note:: An exception to this is the 'gate' VMA which is provided by 28 + architectures which use :c:struct:`!vsyscall` and is a global static 29 + object which does not belong to any specific mm. 30 + 31 + ------- 32 + Locking 33 + ------- 34 + 35 + The kernel is designed to be highly scalable against concurrent read operations 36 + on VMA **metadata** so a complicated set of locks are required to ensure memory 37 + corruption does not occur. 38 + 39 + .. note:: Locking VMAs for their metadata does not have any impact on the memory 40 + they describe nor the page tables that map them. 41 + 42 + Terminology 43 + ----------- 44 + 45 + * **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock` 46 + which locks at a process address space granularity which can be acquired via 47 + :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants. 48 + * **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves 49 + as a read/write semaphore in practice. A VMA read lock is obtained via 50 + :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a 51 + write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked 52 + automatically when the mmap write lock is released). To take a VMA write lock 53 + you **must** have already acquired an :c:func:`!mmap_write_lock`. 54 + * **rmap locks** - When trying to access VMAs through the reverse mapping via a 55 + :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object 56 + (reachable from a folio via :c:member:`!folio->mapping`). VMAs must be stabilised via 57 + :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for 58 + anonymous memory and :c:func:`!i_mmap_[try]lock_read` or 59 + :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these 60 + locks as the reverse mapping locks, or 'rmap locks' for brevity. 61 + 62 + We discuss page table locks separately in the dedicated section below. 63 + 64 + The first thing **any** of these locks achieve is to **stabilise** the VMA 65 + within the MM tree. That is, guaranteeing that the VMA object will not be 66 + deleted from under you nor modified (except for some specific fields 67 + described below). 68 + 69 + Stabilising a VMA also keeps the address space described by it around. 70 + 71 + Lock usage 72 + ---------- 73 + 74 + If you want to **read** VMA metadata fields or just keep the VMA stable, you 75 + must do one of the following: 76 + 77 + * Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a 78 + suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when 79 + you're done with the VMA, *or* 80 + * Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to 81 + acquire the lock atomically so might fail, in which case fall-back logic is 82 + required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`, 83 + *or* 84 + * Acquire an rmap lock before traversing the locked interval tree (whether 85 + anonymous or file-backed) to obtain the required VMA. 86 + 87 + If you want to **write** VMA metadata fields, then things vary depending on the 88 + field (we explore each VMA field in detail below). For the majority you must: 89 + 90 + * Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a 91 + suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when 92 + you're done with the VMA, *and* 93 + * Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to 94 + modify, which will be released automatically when :c:func:`!mmap_write_unlock` is 95 + called. 96 + * If you want to be able to write to **any** field, you must also hide the VMA 97 + from the reverse mapping by obtaining an **rmap write lock**. 98 + 99 + VMA locks are special in that you must obtain an mmap **write** lock **first** 100 + in order to obtain a VMA **write** lock. A VMA **read** lock however can be 101 + obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then 102 + release an RCU lock to lookup the VMA for you). 103 + 104 + This constrains the impact of writers on readers, as a writer can interact with 105 + one VMA while a reader interacts with another simultaneously. 106 + 107 + .. note:: The primary users of VMA read locks are page fault handlers, which 108 + means that without a VMA write lock, page faults will run concurrent with 109 + whatever you are doing. 110 + 111 + Examining all valid lock states: 112 + 113 + .. table:: 114 + 115 + ========= ======== ========= ======= ===== =========== ========== 116 + mmap lock VMA lock rmap lock Stable? Read? Write most? Write all? 117 + ========= ======== ========= ======= ===== =========== ========== 118 + \- \- \- N N N N 119 + \- R \- Y Y N N 120 + \- \- R/W Y Y N N 121 + R/W \-/R \-/R/W Y Y N N 122 + W W \-/R Y Y Y N 123 + W W W Y Y Y Y 124 + ========= ======== ========= ======= ===== =========== ========== 125 + 126 + .. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock, 127 + attempting to do the reverse is invalid as it can result in deadlock - if 128 + another task already holds an mmap write lock and attempts to acquire a VMA 129 + write lock that will deadlock on the VMA read lock. 130 + 131 + All of these locks behave as read/write semaphores in practice, so you can 132 + obtain either a read or a write lock for each of these. 133 + 134 + .. note:: Generally speaking, a read/write semaphore is a class of lock which 135 + permits concurrent readers. However a write lock can only be obtained 136 + once all readers have left the critical region (and pending readers 137 + made to wait). 138 + 139 + This renders read locks on a read/write semaphore concurrent with other 140 + readers and write locks exclusive against all others holding the semaphore. 141 + 142 + VMA fields 143 + ^^^^^^^^^^ 144 + 145 + We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it 146 + easier to explore their locking characteristics: 147 + 148 + .. note:: We exclude VMA lock-specific fields here to avoid confusion, as these 149 + are in effect an internal implementation detail. 150 + 151 + .. table:: Virtual layout fields 152 + 153 + ===================== ======================================== =========== 154 + Field Description Write lock 155 + ===================== ======================================== =========== 156 + :c:member:`!vm_start` Inclusive start virtual address of range mmap write, 157 + VMA describes. VMA write, 158 + rmap write. 159 + :c:member:`!vm_end` Exclusive end virtual address of range mmap write, 160 + VMA describes. VMA write, 161 + rmap write. 162 + :c:member:`!vm_pgoff` Describes the page offset into the file, mmap write, 163 + the original page offset within the VMA write, 164 + virtual address space (prior to any rmap write. 165 + :c:func:`!mremap`), or PFN if a PFN map 166 + and the architecture does not support 167 + :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`. 168 + ===================== ======================================== =========== 169 + 170 + These fields describes the size, start and end of the VMA, and as such cannot be 171 + modified without first being hidden from the reverse mapping since these fields 172 + are used to locate VMAs within the reverse mapping interval trees. 173 + 174 + .. table:: Core fields 175 + 176 + ============================ ======================================== ========================= 177 + Field Description Write lock 178 + ============================ ======================================== ========================= 179 + :c:member:`!vm_mm` Containing mm_struct. None - written once on 180 + initial map. 181 + :c:member:`!vm_page_prot` Architecture-specific page table mmap write, VMA write. 182 + protection bits determined from VMA 183 + flags. 184 + :c:member:`!vm_flags` Read-only access to VMA flags describing N/A 185 + attributes of the VMA, in union with 186 + private writable 187 + :c:member:`!__vm_flags`. 188 + :c:member:`!__vm_flags` Private, writable access to VMA flags mmap write, VMA write. 189 + field, updated by 190 + :c:func:`!vm_flags_*` functions. 191 + :c:member:`!vm_file` If the VMA is file-backed, points to a None - written once on 192 + struct file object describing the initial map. 193 + underlying file, if anonymous then 194 + :c:macro:`!NULL`. 195 + :c:member:`!vm_ops` If the VMA is file-backed, then either None - Written once on 196 + the driver or file-system provides a initial map by 197 + :c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`. 198 + object describing callbacks to be 199 + invoked on VMA lifetime events. 200 + :c:member:`!vm_private_data` A :c:member:`!void *` field for Handled by driver. 201 + driver-specific metadata. 202 + ============================ ======================================== ========================= 203 + 204 + These are the core fields which describe the MM the VMA belongs to and its attributes. 205 + 206 + .. table:: Config-specific fields 207 + 208 + ================================= ===================== ======================================== =============== 209 + Field Configuration option Description Write lock 210 + ================================= ===================== ======================================== =============== 211 + :c:member:`!anon_name` CONFIG_ANON_VMA_NAME A field for storing a mmap write, 212 + :c:struct:`!struct anon_vma_name` VMA write. 213 + object providing a name for anonymous 214 + mappings, or :c:macro:`!NULL` if none 215 + is set or the VMA is file-backed. The 216 + underlying object is reference counted 217 + and can be shared across multiple VMAs 218 + for scalability. 219 + :c:member:`!swap_readahead_info` CONFIG_SWAP Metadata used by the swap mechanism mmap read, 220 + to perform readahead. This field is swap-specific 221 + accessed atomically. lock. 222 + :c:member:`!vm_policy` CONFIG_NUMA :c:type:`!mempolicy` object which mmap write, 223 + describes the NUMA behaviour of the VMA write. 224 + VMA. The underlying object is reference 225 + counted. 226 + :c:member:`!numab_state` CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which mmap read, 227 + describes the current state of numab-specific 228 + NUMA balancing in relation to this VMA. lock. 229 + Updated under mmap read lock by 230 + :c:func:`!task_numa_work`. 231 + :c:member:`!vm_userfaultfd_ctx` CONFIG_USERFAULTFD Userfaultfd context wrapper object of mmap write, 232 + type :c:type:`!vm_userfaultfd_ctx`, VMA write. 233 + either of zero size if userfaultfd is 234 + disabled, or containing a pointer 235 + to an underlying 236 + :c:type:`!userfaultfd_ctx` object which 237 + describes userfaultfd metadata. 238 + ================================= ===================== ======================================== =============== 239 + 240 + These fields are present or not depending on whether the relevant kernel 241 + configuration option is set. 242 + 243 + .. table:: Reverse mapping fields 244 + 245 + =================================== ========================================= ============================ 246 + Field Description Write lock 247 + =================================== ========================================= ============================ 248 + :c:member:`!shared.rb` A red/black tree node used, if the mmap write, VMA write, 249 + mapping is file-backed, to place the VMA i_mmap write. 250 + in the 251 + :c:member:`!struct address_space->i_mmap` 252 + red/black interval tree. 253 + :c:member:`!shared.rb_subtree_last` Metadata used for management of the mmap write, VMA write, 254 + interval tree if the VMA is file-backed. i_mmap write. 255 + :c:member:`!anon_vma_chain` List of pointers to both forked/CoW’d mmap read, anon_vma write. 256 + :c:type:`!anon_vma` objects and 257 + :c:member:`!vma->anon_vma` if it is 258 + non-:c:macro:`!NULL`. 259 + :c:member:`!anon_vma` :c:type:`!anon_vma` object used by When :c:macro:`NULL` and 260 + anonymous folios mapped exclusively to setting non-:c:macro:`NULL`: 261 + this VMA. Initially set by mmap read, page_table_lock. 262 + :c:func:`!anon_vma_prepare` serialised 263 + by the :c:macro:`!page_table_lock`. This When non-:c:macro:`NULL` and 264 + is set as soon as any page is faulted in. setting :c:macro:`NULL`: 265 + mmap write, VMA write, 266 + anon_vma write. 267 + =================================== ========================================= ============================ 268 + 269 + These fields are used to both place the VMA within the reverse mapping, and for 270 + anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects 271 + and the :c:struct:`!struct anon_vma` in which folios mapped exclusively to this VMA should 272 + reside. 273 + 274 + .. note:: If a file-backed mapping is mapped with :c:macro:`!MAP_PRIVATE` set 275 + then it can be in both the :c:type:`!anon_vma` and :c:type:`!i_mmap` 276 + trees at the same time, so all of these fields might be utilised at 277 + once. 278 + 279 + Page tables 280 + ----------- 281 + 282 + We won't speak exhaustively on the subject but broadly speaking, page tables map 283 + virtual addresses to physical ones through a series of page tables, each of 284 + which contain entries with physical addresses for the next page table level 285 + (along with flags), and at the leaf level the physical addresses of the 286 + underlying physical data pages or a special entry such as a swap entry, 287 + migration entry or other special marker. Offsets into these pages are provided 288 + by the virtual address itself. 289 + 290 + In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge 291 + pages might eliminate one or two of these levels, but when this is the case we 292 + typically refer to the leaf level as the PTE level regardless. 293 + 294 + .. note:: In instances where the architecture supports fewer page tables than 295 + five the kernel cleverly 'folds' page table levels, that is stubbing 296 + out functions related to the skipped levels. This allows us to 297 + conceptually act as if there were always five levels, even if the 298 + compiler might, in practice, eliminate any code relating to missing 299 + ones. 300 + 301 + There are four key operations typically performed on page tables: 302 + 303 + 1. **Traversing** page tables - Simply reading page tables in order to traverse 304 + them. This only requires that the VMA is kept stable, so a lock which 305 + establishes this suffices for traversal (there are also lockless variants 306 + which eliminate even this requirement, such as :c:func:`!gup_fast`). 307 + 2. **Installing** page table mappings - Whether creating a new mapping or 308 + modifying an existing one in such a way as to change its identity. This 309 + requires that the VMA is kept stable via an mmap or VMA lock (explicitly not 310 + rmap locks). 311 + 3. **Zapping/unmapping** page table entries - This is what the kernel calls 312 + clearing page table mappings at the leaf level only, whilst leaving all page 313 + tables in place. This is a very common operation in the kernel performed on 314 + file truncation, the :c:macro:`!MADV_DONTNEED` operation via 315 + :c:func:`!madvise`, and others. This is performed by a number of functions 316 + including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`. 317 + The VMA need only be kept stable for this operation. 318 + 4. **Freeing** page tables - When finally the kernel removes page tables from a 319 + userland process (typically via :c:func:`!free_pgtables`) extreme care must 320 + be taken to ensure this is done safely, as this logic finally frees all page 321 + tables in the specified range, ignoring existing leaf entries (it assumes the 322 + caller has both zapped the range and prevented any further faults or 323 + modifications within it). 324 + 325 + .. note:: Modifying mappings for reclaim or migration is performed under rmap 326 + lock as it, like zapping, does not fundamentally modify the identity 327 + of what is being mapped. 328 + 329 + **Traversing** and **zapping** ranges can be performed holding any one of the 330 + locks described in the terminology section above - that is the mmap lock, the 331 + VMA lock or either of the reverse mapping locks. 332 + 333 + That is - as long as you keep the relevant VMA **stable** - you are good to go 334 + ahead and perform these operations on page tables (though internally, kernel 335 + operations that perform writes also acquire internal page table locks to 336 + serialise - see the page table implementation detail section for more details). 337 + 338 + When **installing** page table entries, the mmap or VMA lock must be held to 339 + keep the VMA stable. We explore why this is in the page table locking details 340 + section below. 341 + 342 + .. warning:: Page tables are normally only traversed in regions covered by VMAs. 343 + If you want to traverse page tables in areas that might not be 344 + covered by VMAs, heavier locking is required. 345 + See :c:func:`!walk_page_range_novma` for details. 346 + 347 + **Freeing** page tables is an entirely internal memory management operation and 348 + has special requirements (see the page freeing section below for more details). 349 + 350 + .. warning:: When **freeing** page tables, it must not be possible for VMAs 351 + containing the ranges those page tables map to be accessible via 352 + the reverse mapping. 353 + 354 + The :c:func:`!free_pgtables` function removes the relevant VMAs 355 + from the reverse mappings, but no other VMAs can be permitted to be 356 + accessible and span the specified range. 357 + 358 + Lock ordering 359 + ------------- 360 + 361 + As we have multiple locks across the kernel which may or may not be taken at the 362 + same time as explicit mm or VMA locks, we have to be wary of lock inversion, and 363 + the **order** in which locks are acquired and released becomes very important. 364 + 365 + .. note:: Lock inversion occurs when two threads need to acquire multiple locks, 366 + but in doing so inadvertently cause a mutual deadlock. 367 + 368 + For example, consider thread 1 which holds lock A and tries to acquire lock B, 369 + while thread 2 holds lock B and tries to acquire lock A. 370 + 371 + Both threads are now deadlocked on each other. However, had they attempted to 372 + acquire locks in the same order, one would have waited for the other to 373 + complete its work and no deadlock would have occurred. 374 + 375 + The opening comment in :c:macro:`!mm/rmap.c` describes in detail the required 376 + ordering of locks within memory management code: 377 + 378 + .. code-block:: 379 + 380 + inode->i_rwsem (while writing or truncating, not reading or faulting) 381 + mm->mmap_lock 382 + mapping->invalidate_lock (in filemap_fault) 383 + folio_lock 384 + hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below) 385 + vma_start_write 386 + mapping->i_mmap_rwsem 387 + anon_vma->rwsem 388 + mm->page_table_lock or pte_lock 389 + swap_lock (in swap_duplicate, swap_info_get) 390 + mmlist_lock (in mmput, drain_mmlist and others) 391 + mapping->private_lock (in block_dirty_folio) 392 + i_pages lock (widely used) 393 + lruvec->lru_lock (in folio_lruvec_lock_irq) 394 + inode->i_lock (in set_page_dirty's __mark_inode_dirty) 395 + bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) 396 + sb_lock (within inode_lock in fs/fs-writeback.c) 397 + i_pages lock (widely used, in set_page_dirty, 398 + in arch-dependent flush_dcache_mmap_lock, 399 + within bdi.wb->list_lock in __sync_single_inode) 400 + 401 + There is also a file-system specific lock ordering comment located at the top of 402 + :c:macro:`!mm/filemap.c`: 403 + 404 + .. code-block:: 405 + 406 + ->i_mmap_rwsem (truncate_pagecache) 407 + ->private_lock (__free_pte->block_dirty_folio) 408 + ->swap_lock (exclusive_swap_page, others) 409 + ->i_pages lock 410 + 411 + ->i_rwsem 412 + ->invalidate_lock (acquired by fs in truncate path) 413 + ->i_mmap_rwsem (truncate->unmap_mapping_range) 414 + 415 + ->mmap_lock 416 + ->i_mmap_rwsem 417 + ->page_table_lock or pte_lock (various, mainly in memory.c) 418 + ->i_pages lock (arch-dependent flush_dcache_mmap_lock) 419 + 420 + ->mmap_lock 421 + ->invalidate_lock (filemap_fault) 422 + ->lock_page (filemap_fault, access_process_vm) 423 + 424 + ->i_rwsem (generic_perform_write) 425 + ->mmap_lock (fault_in_readable->do_page_fault) 426 + 427 + bdi->wb.list_lock 428 + sb_lock (fs/fs-writeback.c) 429 + ->i_pages lock (__sync_single_inode) 430 + 431 + ->i_mmap_rwsem 432 + ->anon_vma.lock (vma_merge) 433 + 434 + ->anon_vma.lock 435 + ->page_table_lock or pte_lock (anon_vma_prepare and various) 436 + 437 + ->page_table_lock or pte_lock 438 + ->swap_lock (try_to_unmap_one) 439 + ->private_lock (try_to_unmap_one) 440 + ->i_pages lock (try_to_unmap_one) 441 + ->lruvec->lru_lock (follow_page_mask->mark_page_accessed) 442 + ->lruvec->lru_lock (check_pte_range->folio_isolate_lru) 443 + ->private_lock (folio_remove_rmap_pte->set_page_dirty) 444 + ->i_pages lock (folio_remove_rmap_pte->set_page_dirty) 445 + bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty) 446 + ->inode->i_lock (folio_remove_rmap_pte->set_page_dirty) 447 + bdi.wb->list_lock (zap_pte_range->set_page_dirty) 448 + ->inode->i_lock (zap_pte_range->set_page_dirty) 449 + ->private_lock (zap_pte_range->block_dirty_folio) 450 + 451 + Please check the current state of these comments which may have changed since 452 + the time of writing of this document. 453 + 454 + ------------------------------ 455 + Locking Implementation Details 456 + ------------------------------ 457 + 458 + .. warning:: Locking rules for PTE-level page tables are very different from 459 + locking rules for page tables at other levels. 460 + 461 + Page table locking details 462 + -------------------------- 463 + 464 + In addition to the locks described in the terminology section above, we have 465 + additional locks dedicated to page tables: 466 + 467 + * **Higher level page table locks** - Higher level page tables, that is PGD, P4D 468 + and PUD each make use of the process address space granularity 469 + :c:member:`!mm->page_table_lock` lock when modified. 470 + 471 + * **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks 472 + either kept within the folios describing the page tables or allocated 473 + separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is 474 + set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are 475 + mapped into higher memory (if a 32-bit system) and carefully locked via 476 + :c:func:`!pte_offset_map_lock`. 477 + 478 + These locks represent the minimum required to interact with each page table 479 + level, but there are further requirements. 480 + 481 + Importantly, note that on a **traversal** of page tables, sometimes no such 482 + locks are taken. However, at the PTE level, at least concurrent page table 483 + deletion must be prevented (using RCU) and the page table must be mapped into 484 + high memory, see below. 485 + 486 + Whether care is taken on reading the page table entries depends on the 487 + architecture, see the section on atomicity below. 488 + 489 + Locking rules 490 + ^^^^^^^^^^^^^ 491 + 492 + We establish basic locking rules when interacting with page tables: 493 + 494 + * When changing a page table entry the page table lock for that page table 495 + **must** be held, except if you can safely assume nobody can access the page 496 + tables concurrently (such as on invocation of :c:func:`!free_pgtables`). 497 + * Reads from and writes to page table entries must be *appropriately* 498 + atomic. See the section on atomicity below for details. 499 + * Populating previously empty entries requires that the mmap or VMA locks are 500 + held (read or write), doing so with only rmap locks would be dangerous (see 501 + the warning below). 502 + * As mentioned previously, zapping can be performed while simply keeping the VMA 503 + stable, that is holding any one of the mmap, VMA or rmap locks. 504 + 505 + .. warning:: Populating previously empty entries is dangerous as, when unmapping 506 + VMAs, :c:func:`!vms_clear_ptes` has a window of time between 507 + zapping (via :c:func:`!unmap_vmas`) and freeing page tables (via 508 + :c:func:`!free_pgtables`), where the VMA is still visible in the 509 + rmap tree. :c:func:`!free_pgtables` assumes that the zap has 510 + already been performed and removes PTEs unconditionally (along with 511 + all other page tables in the freed range), so installing new PTE 512 + entries could leak memory and also cause other unexpected and 513 + dangerous behaviour. 514 + 515 + There are additional rules applicable when moving page tables, which we discuss 516 + in the section on this topic below. 517 + 518 + PTE-level page tables are different from page tables at other levels, and there 519 + are extra requirements for accessing them: 520 + 521 + * On 32-bit architectures, they may be in high memory (meaning they need to be 522 + mapped into kernel memory to be accessible). 523 + * When empty, they can be unlinked and RCU-freed while holding an mmap lock or 524 + rmap lock for reading in combination with the PTE and PMD page table locks. 525 + In particular, this happens in :c:func:`!retract_page_tables` when handling 526 + :c:macro:`!MADV_COLLAPSE`. 527 + So accessing PTE-level page tables requires at least holding an RCU read lock; 528 + but that only suffices for readers that can tolerate racing with concurrent 529 + page table updates such that an empty PTE is observed (in a page table that 530 + has actually already been detached and marked for RCU freeing) while another 531 + new page table has been installed in the same location and filled with 532 + entries. Writers normally need to take the PTE lock and revalidate that the 533 + PMD entry still refers to the same PTE-level page table. 534 + 535 + To access PTE-level page tables, a helper like :c:func:`!pte_offset_map_lock` or 536 + :c:func:`!pte_offset_map` can be used depending on stability requirements. 537 + These map the page table into kernel memory if required, take the RCU lock, and 538 + depending on variant, may also look up or acquire the PTE lock. 539 + See the comment on :c:func:`!__pte_offset_map_lock`. 540 + 541 + Atomicity 542 + ^^^^^^^^^ 543 + 544 + Regardless of page table locks, the MMU hardware concurrently updates accessed 545 + and dirty bits (perhaps more, depending on architecture). Additionally, page 546 + table traversal operations in parallel (though holding the VMA stable) and 547 + functionality like GUP-fast locklessly traverses (that is reads) page tables, 548 + without even keeping the VMA stable at all. 549 + 550 + When performing a page table traversal and keeping the VMA stable, whether a 551 + read must be performed once and only once or not depends on the architecture 552 + (for instance x86-64 does not require any special precautions). 553 + 554 + If a write is being performed, or if a read informs whether a write takes place 555 + (on an installation of a page table entry say, for instance in 556 + :c:func:`!__pud_install`), special care must always be taken. In these cases we 557 + can never assume that page table locks give us entirely exclusive access, and 558 + must retrieve page table entries once and only once. 559 + 560 + If we are reading page table entries, then we need only ensure that the compiler 561 + does not rearrange our loads. This is achieved via :c:func:`!pXXp_get` 562 + functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`, :c:func:`!pudp_get`, 563 + :c:func:`!pmdp_get`, and :c:func:`!ptep_get`. 564 + 565 + Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads 566 + the page table entry only once. 567 + 568 + However, if we wish to manipulate an existing page table entry and care about 569 + the previously stored data, we must go further and use an hardware atomic 570 + operation as, for example, in :c:func:`!ptep_get_and_clear`. 571 + 572 + Equally, operations that do not rely on the VMA being held stable, such as 573 + GUP-fast (see :c:func:`!gup_fast` and its various page table level handlers like 574 + :c:func:`!gup_fast_pte_range`), must very carefully interact with page table 575 + entries, using functions such as :c:func:`!ptep_get_lockless` and equivalent for 576 + higher level page table levels. 577 + 578 + Writes to page table entries must also be appropriately atomic, as established 579 + by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`, 580 + :c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`. 581 + 582 + Equally functions which clear page table entries must be appropriately atomic, 583 + as in :c:func:`!pXX_clear` functions - :c:func:`!pgd_clear`, 584 + :c:func:`!p4d_clear`, :c:func:`!pud_clear`, :c:func:`!pmd_clear`, and 585 + :c:func:`!pte_clear`. 586 + 587 + Page table installation 588 + ^^^^^^^^^^^^^^^^^^^^^^^ 589 + 590 + Page table installation is performed with the VMA held stable explicitly by an 591 + mmap or VMA lock in read or write mode (see the warning in the locking rules 592 + section for details as to why). 593 + 594 + When allocating a P4D, PUD or PMD and setting the relevant entry in the above 595 + PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is 596 + acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and 597 + :c:func:`!__pmd_alloc` respectively. 598 + 599 + .. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and 600 + :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately 601 + references the :c:member:`!mm->page_table_lock`. 602 + 603 + Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if 604 + :c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, a lock embedded in the PMD 605 + physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by 606 + :c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately 607 + :c:func:`!__pte_alloc`. 608 + 609 + Finally, modifying the contents of the PTE requires special treatment, as the 610 + PTE page table lock must be acquired whenever we want stable and exclusive 611 + access to entries contained within a PTE, especially when we wish to modify 612 + them. 613 + 614 + This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to 615 + ensure that the PTE hasn't changed from under us, ultimately invoking 616 + :c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within 617 + the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock 618 + must be released via :c:func:`!pte_unmap_unlock`. 619 + 620 + .. note:: There are some variants on this, such as 621 + :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but 622 + for brevity we do not explore this. See the comment for 623 + :c:func:`!__pte_offset_map_lock` for more details. 624 + 625 + When modifying data in ranges we typically only wish to allocate higher page 626 + tables as necessary, using these locks to avoid races or overwriting anything, 627 + and set/clear data at the PTE level as required (for instance when page faulting 628 + or zapping). 629 + 630 + A typical pattern taken when traversing page table entries to install a new 631 + mapping is to optimistically determine whether the page table entry in the table 632 + above is empty, if so, only then acquiring the page table lock and checking 633 + again to see if it was allocated underneath us. 634 + 635 + This allows for a traversal with page table locks only being taken when 636 + required. An example of this is :c:func:`!__pud_alloc`. 637 + 638 + At the leaf page table, that is the PTE, we can't entirely rely on this pattern 639 + as we have separate PMD and PTE locks and a THP collapse for instance might have 640 + eliminated the PMD entry as well as the PTE from under us. 641 + 642 + This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry 643 + for the PTE, carefully checking it is as expected, before acquiring the 644 + PTE-specific lock, and then *again* checking that the PMD entry is as expected. 645 + 646 + If a THP collapse (or similar) were to occur then the lock on both pages would 647 + be acquired, so we can ensure this is prevented while the PTE lock is held. 648 + 649 + Installing entries this way ensures mutual exclusion on write. 650 + 651 + Page table freeing 652 + ^^^^^^^^^^^^^^^^^^ 653 + 654 + Tearing down page tables themselves is something that requires significant 655 + care. There must be no way that page tables designated for removal can be 656 + traversed or referenced by concurrent tasks. 657 + 658 + It is insufficient to simply hold an mmap write lock and VMA lock (which will 659 + prevent racing faults, and rmap operations), as a file-backed mapping can be 660 + truncated under the :c:struct:`!struct address_space->i_mmap_rwsem` alone. 661 + 662 + As a result, no VMA which can be accessed via the reverse mapping (either 663 + through the :c:struct:`!struct anon_vma->rb_root` or the :c:member:`!struct 664 + address_space->i_mmap` interval trees) can have its page tables torn down. 665 + 666 + The operation is typically performed via :c:func:`!free_pgtables`, which assumes 667 + either the mmap write lock has been taken (as specified by its 668 + :c:member:`!mm_wr_locked` parameter), or that the VMA is already unreachable. 669 + 670 + It carefully removes the VMA from all reverse mappings, however it's important 671 + that no new ones overlap these or any route remain to permit access to addresses 672 + within the range whose page tables are being torn down. 673 + 674 + Additionally, it assumes that a zap has already been performed and steps have 675 + been taken to ensure that no further page table entries can be installed between 676 + the zap and the invocation of :c:func:`!free_pgtables`. 677 + 678 + Since it is assumed that all such steps have been taken, page table entries are 679 + cleared without page table locks (in the :c:func:`!pgd_clear`, :c:func:`!p4d_clear`, 680 + :c:func:`!pud_clear`, and :c:func:`!pmd_clear` functions. 681 + 682 + .. note:: It is possible for leaf page tables to be torn down independent of 683 + the page tables above it as is done by 684 + :c:func:`!retract_page_tables`, which is performed under the i_mmap 685 + read lock, PMD, and PTE page table locks, without this level of care. 686 + 687 + Page table moving 688 + ^^^^^^^^^^^^^^^^^ 689 + 690 + Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD 691 + page tables). Most notable of these is :c:func:`!mremap`, which is capable of 692 + moving higher level page tables. 693 + 694 + In these instances, it is required that **all** locks are taken, that is 695 + the mmap lock, the VMA lock and the relevant rmap locks. 696 + 697 + You can observe this in the :c:func:`!mremap` implementation in the functions 698 + :c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap 699 + side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`. 700 + 701 + VMA lock internals 702 + ------------------ 703 + 704 + Overview 705 + ^^^^^^^^ 706 + 707 + VMA read locking is entirely optimistic - if the lock is contended or a competing 708 + write has started, then we do not obtain a read lock. 709 + 710 + A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu`, which first 711 + calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU 712 + critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`, 713 + before releasing the RCU lock via :c:func:`!rcu_read_unlock`. 714 + 715 + VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for 716 + their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it 717 + via :c:func:`!vma_end_read`. 718 + 719 + VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a 720 + VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always 721 + acquired. An mmap write lock **must** be held for the duration of the VMA write 722 + lock, releasing or downgrading the mmap write lock also releases the VMA write 723 + lock so there is no :c:func:`!vma_end_write` function. 724 + 725 + Note that a semaphore write lock is not held across a VMA lock. Rather, a 726 + sequence number is used for serialisation, and the write semaphore is only 727 + acquired at the point of write lock to update this. 728 + 729 + This ensures the semantics we require - VMA write locks provide exclusive write 730 + access to the VMA. 731 + 732 + Implementation details 733 + ^^^^^^^^^^^^^^^^^^^^^^ 734 + 735 + The VMA lock mechanism is designed to be a lightweight means of avoiding the use 736 + of the heavily contended mmap lock. It is implemented using a combination of a 737 + read/write semaphore and sequence numbers belonging to the containing 738 + :c:struct:`!struct mm_struct` and the VMA. 739 + 740 + Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic 741 + operation, i.e. it tries to acquire a read lock but returns false if it is 742 + unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is 743 + called to release the VMA read lock. 744 + 745 + Invoking :c:func:`!vma_start_read` requires that :c:func:`!rcu_read_lock` has 746 + been called first, establishing that we are in an RCU critical section upon VMA 747 + read lock acquisition. Once acquired, the RCU lock can be released as it is only 748 + required for lookup. This is abstracted by :c:func:`!lock_vma_under_rcu` which 749 + is the interface a user should use. 750 + 751 + Writing requires the mmap to be write-locked and the VMA lock to be acquired via 752 + :c:func:`!vma_start_write`, however the write lock is released by the termination or 753 + downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required. 754 + 755 + All this is achieved by the use of per-mm and per-VMA sequence counts, which are 756 + used in order to reduce complexity, especially for operations which write-lock 757 + multiple VMAs at once. 758 + 759 + If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA 760 + sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If 761 + they differ, then it is not. 762 + 763 + Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or 764 + :c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which 765 + also increments :c:member:`!mm->mm_lock_seq` via 766 + :c:func:`!mm_lock_seqcount_end`. 767 + 768 + This way, we ensure that, regardless of the VMA's sequence number, a write lock 769 + is never incorrectly indicated and that when we release an mmap write lock we 770 + efficiently release **all** VMA write locks contained within the mmap at the 771 + same time. 772 + 773 + Since the mmap write lock is exclusive against others who hold it, the automatic 774 + release of any VMA locks on its release makes sense, as you would never want to 775 + keep VMAs locked across entirely separate write operations. It also maintains 776 + correct lock ordering. 777 + 778 + Each time a VMA read lock is acquired, we acquire a read lock on the 779 + :c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that 780 + the sequence count of the VMA does not match that of the mm. 781 + 782 + If it does, the read lock fails. If it does not, we hold the lock, excluding 783 + writers, but permitting other readers, who will also obtain this lock under RCU. 784 + 785 + Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu` 786 + are also RCU safe, so the whole read lock operation is guaranteed to function 787 + correctly. 788 + 789 + On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock` 790 + read/write semaphore, before setting the VMA's sequence number under this lock, 791 + also simultaneously holding the mmap write lock. 792 + 793 + This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep 794 + until these are finished and mutual exclusion is achieved. 795 + 796 + After setting the VMA's sequence number, the lock is released, avoiding 797 + complexity with a long-term held write lock. 798 + 799 + This clever combination of a read/write semaphore and sequence count allows for 800 + fast RCU-based per-VMA lock acquisition (especially on page fault, though 801 + utilised elsewhere) with minimal complexity around lock ordering. 802 + 803 + mmap write lock downgrading 804 + --------------------------- 805 + 806 + When an mmap write lock is held one has exclusive access to resources within the 807 + mmap (with the usual caveats about requiring VMA write locks to avoid races with 808 + tasks holding VMA read locks). 809 + 810 + It is then possible to **downgrade** from a write lock to a read lock via 811 + :c:func:`!mmap_write_downgrade` which, similar to :c:func:`!mmap_write_unlock`, 812 + implicitly terminates all VMA write locks via :c:func:`!vma_end_write_all`, but 813 + importantly does not relinquish the mmap lock while downgrading, therefore 814 + keeping the locked virtual address space stable. 815 + 816 + An interesting consequence of this is that downgraded locks are exclusive 817 + against any other task possessing a downgraded lock (since a racing task would 818 + have to acquire a write lock first to downgrade it, and the downgraded lock 819 + prevents a new write lock from being obtained until the original lock is 820 + released). 821 + 822 + For clarity, we map read (R)/downgraded write (D)/write (W) locks against one 823 + another showing which locks exclude the others: 824 + 825 + .. list-table:: Lock exclusivity 826 + :widths: 5 5 5 5 827 + :header-rows: 1 828 + :stub-columns: 1 829 + 830 + * - 831 + - R 832 + - D 833 + - W 834 + * - R 835 + - N 836 + - N 837 + - Y 838 + * - D 839 + - N 840 + - Y 841 + - Y 842 + * - W 843 + - Y 844 + - Y 845 + - Y 846 + 847 + Here a Y indicates the locks in the matching row/column are mutually exclusive, 848 + and N indicates that they are not. 849 + 850 + Stack expansion 851 + --------------- 852 + 853 + Stack expansion throws up additional complexities in that we cannot permit there 854 + to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to 855 + prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`.