Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/khugepaged: take the right locks for page table retraction

pagetable walks on address ranges mapped by VMAs can be done under the
mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the
VMA's address_space. Only one of these needs to be held, and it does not
need to be held in exclusive mode.

Under those circumstances, the rules for concurrent access to page table
entries are:

- Terminal page table entries (entries that don't point to another page
table) can be arbitrarily changed under the page table lock, with the
exception that they always need to be consistent for
hardware page table walks and lockless_pages_from_mm().
This includes that they can be changed into non-terminal entries.
- Non-terminal page table entries (which point to another page table)
can not be modified; readers are allowed to READ_ONCE() an entry, verify
that it is non-terminal, and then assume that its value will stay as-is.

Retracting a page table involves modifying a non-terminal entry, so
page-table-level locks are insufficient to protect against concurrent page
table traversal; it requires taking all the higher-level locks under which
it is possible to start a page walk in the relevant range in exclusive
mode.

The collapse_huge_page() path for anonymous THP already follows this rule,
but the shmem/file THP path was getting it wrong, making it possible for
concurrent rmap-based operations to cause corruption.

Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Jann Horn and committed by
Andrew Morton
8d3c106e 829ae0f8

+51 -4
+51 -4
mm/khugepaged.c
··· 1379 1379 return SCAN_SUCCEED; 1380 1380 } 1381 1381 1382 + /* 1383 + * A note about locking: 1384 + * Trying to take the page table spinlocks would be useless here because those 1385 + * are only used to synchronize: 1386 + * 1387 + * - modifying terminal entries (ones that point to a data page, not to another 1388 + * page table) 1389 + * - installing *new* non-terminal entries 1390 + * 1391 + * Instead, we need roughly the same kind of protection as free_pgtables() or 1392 + * mm_take_all_locks() (but only for a single VMA): 1393 + * The mmap lock together with this VMA's rmap locks covers all paths towards 1394 + * the page table entries we're messing with here, except for hardware page 1395 + * table walks and lockless_pages_from_mm(). 1396 + */ 1382 1397 static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma, 1383 1398 unsigned long addr, pmd_t *pmdp) 1384 1399 { 1385 - spinlock_t *ptl; 1386 1400 pmd_t pmd; 1387 1401 1388 1402 mmap_assert_write_locked(mm); 1389 - ptl = pmd_lock(vma->vm_mm, pmdp); 1403 + if (vma->vm_file) 1404 + lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem); 1405 + /* 1406 + * All anon_vmas attached to the VMA have the same root and are 1407 + * therefore locked by the same lock. 1408 + */ 1409 + if (vma->anon_vma) 1410 + lockdep_assert_held_write(&vma->anon_vma->root->rwsem); 1411 + 1390 1412 pmd = pmdp_collapse_flush(vma, addr, pmdp); 1391 - spin_unlock(ptl); 1392 1413 mm_dec_nr_ptes(mm); 1393 1414 page_table_check_pte_clear_range(mm, addr, pmd); 1394 1415 pte_free(mm, pmd_pgtable(pmd)); ··· 1460 1439 if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false)) 1461 1440 return SCAN_VMA_CHECK; 1462 1441 1442 + /* 1443 + * Symmetry with retract_page_tables(): Exclude MAP_PRIVATE mappings 1444 + * that got written to. Without this, we'd have to also lock the 1445 + * anon_vma if one exists. 1446 + */ 1447 + if (vma->anon_vma) 1448 + return SCAN_VMA_CHECK; 1449 + 1463 1450 /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */ 1464 1451 if (userfaultfd_wp(vma)) 1465 1452 return SCAN_PTE_UFFD_WP; ··· 1501 1472 goto drop_hpage; 1502 1473 } 1503 1474 1475 + /* 1476 + * We need to lock the mapping so that from here on, only GUP-fast and 1477 + * hardware page walks can access the parts of the page tables that 1478 + * we're operating on. 1479 + * See collapse_and_free_pmd(). 1480 + */ 1481 + i_mmap_lock_write(vma->vm_file->f_mapping); 1482 + 1483 + /* 1484 + * This spinlock should be unnecessary: Nobody else should be accessing 1485 + * the page tables under spinlock protection here, only 1486 + * lockless_pages_from_mm() and the hardware page walker can access page 1487 + * tables while all the high-level locks are held in write mode. 1488 + */ 1504 1489 start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); 1505 1490 result = SCAN_FAIL; 1506 1491 ··· 1569 1526 /* step 4: remove pte entries */ 1570 1527 collapse_and_free_pmd(mm, vma, haddr, pmd); 1571 1528 1529 + i_mmap_unlock_write(vma->vm_file->f_mapping); 1530 + 1572 1531 maybe_install_pmd: 1573 1532 /* step 5: install pmd entry */ 1574 1533 result = install_pmd ··· 1584 1539 1585 1540 abort: 1586 1541 pte_unmap_unlock(start_pte, ptl); 1542 + i_mmap_unlock_write(vma->vm_file->f_mapping); 1587 1543 goto drop_hpage; 1588 1544 } 1589 1545 ··· 1641 1595 * An alternative would be drop the check, but check that page 1642 1596 * table is clear before calling pmdp_collapse_flush() under 1643 1597 * ptl. It has higher chance to recover THP for the VMA, but 1644 - * has higher cost too. 1598 + * has higher cost too. It would also probably require locking 1599 + * the anon_vma. 1645 1600 */ 1646 1601 if (vma->anon_vma) { 1647 1602 result = SCAN_PAGE_ANON;