Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: rmap use pte lock not mmap_sem to set PageMlocked

KernelThreadSanitizer (ktsan) has shown that the down_read_trylock() of
mmap_sem in try_to_unmap_one() (when going to set PageMlocked on a page
found mapped in a VM_LOCKED vma) is ineffective against races with
exit_mmap()'s munlock_vma_pages_all(), because mmap_sem is not held when
tearing down an mm.

But that's okay, those races are benign; and although we've believed for
years in that ugly down_read_trylock(), it's unsuitable for the job, and
frustrates the good intention of setting PageMlocked when it fails.

It just doesn't matter if here we read vm_flags an instant before or after
a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED: the
syscalls (or exit) work their way up the address space (taking pt locks
after updating vm_flags) to establish the final state.

We do still need to be careful never to mark a page Mlocked (hence
unevictable) by any race that will not be corrected shortly after. The
page lock protects from many of the races, but not all (a page is not
necessarily locked when it's unmapped). But the pte lock we just dropped
is good to cover the rest (and serializes even with
munlock_vma_pages_all(), so no special barriers required): now hold on to
the pte lock while calling mlock_vma_page(). Is that lock ordering safe?
Yes, that's how follow_page_pte() calls it, and how page_remove_rmap()
calls the complementary clear_page_mlock().

This fixes the following case (though not a case which anyone has
complained of), which mmap_sem did not: truncation's preliminary
unmap_mapping_range() is supposed to remove even the anonymous COWs of
filecache pages, and that might race with try_to_unmap_one() on a
VM_LOCKED vma, so that mlock_vma_page() sets PageMlocked just after
zap_pte_range() unmaps the page, causing "Bad page state (mlocked)" when
freed. The pte lock protects against this.

You could say that it also protects against the more ordinary case, racing
with the preliminary unmapping of a filecache page itself: but in our
current tree, that's independently protected by i_mmap_rwsem; and that
race would be why "Bad page state (mlocked)" was seen before commit
48ec833b7851 ("Revert mm/memory.c: share the i_mmap_rwsem").

Vlastimil Babka points out another race which this patch protects against.
try_to_unmap_one() might reach its mlock_vma_page() TestSetPageMlocked a
moment after munlock_vma_pages_all() did its Phase 1 TestClearPageMlocked:
leaving PageMlocked and unevictable when it should be evictable. mmap_sem
is ineffective because exit_mmap() does not hold it; page lock ineffective
because __munlock_pagevec() only takes it afterwards, in Phase 2; pte lock
is effective because __munlock_pagevec_fill() takes it to get the page,
after VM_LOCKED was cleared from vm_flags, so visible to try_to_unmap_one.

Kirill Shutemov points out that if the compiler chooses to implement a
"vma->vm_flags &= VM_WHATEVER" or "vma->vm_flags |= VM_WHATEVER" operation
with an intermediate store of unrelated bits set, since I'm here foregoing
its usual protection by mmap_sem, try_to_unmap_one() might catch sight of
a spurious VM_LOCKED in vm_flags, and make the wrong decision. This does
not appear to be an immediate problem, but we may want to define vm_flags
accessors in future, to guard against such a possibility.

While we're here, make a related optimization in try_to_munmap_one(): if
it's doing TTU_MUNLOCK, then there's no point at all in descending the
page tables and getting the pt lock, unless the vma is VM_LOCKED. Yes,
that can change racily, but it can change racily even without the
optimization: it's not critical. Far better not to waste time here.

Stopped short of separating try_to_munlock_one() from try_to_munmap_one()
on this occasion, but that's probably the sensible next step - with a
rename, given that try_to_munlock()'s business is to try to set Mlocked.

Updated the unevictable-lru Documentation, to remove its reference to mmap
semaphore, but found a few more updates needed in just that area.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Hugh Dickins and committed by
Linus Torvalds
b87537d9 7a14239a

+27 -70
+16 -45
Documentation/vm/unevictable-lru.txt
··· 531 531 532 532 try_to_unmap() is always called, by either vmscan for reclaim or for page 533 533 migration, with the argument page locked and isolated from the LRU. Separate 534 - functions handle anonymous and mapped file pages, as these types of pages have 535 - different reverse map mechanisms. 534 + functions handle anonymous and mapped file and KSM pages, as these types of 535 + pages have different reverse map lookup mechanisms, with different locking. 536 + In each case, whether rmap_walk_anon() or rmap_walk_file() or rmap_walk_ksm(), 537 + it will call try_to_unmap_one() for every VMA which might contain the page. 536 538 537 - (*) try_to_unmap_anon() 539 + When trying to reclaim, if try_to_unmap_one() finds the page in a VM_LOCKED 540 + VMA, it will then mlock the page via mlock_vma_page() instead of unmapping it, 541 + and return SWAP_MLOCK to indicate that the page is unevictable: and the scan 542 + stops there. 538 543 539 - To unmap anonymous pages, each VMA in the list anchored in the anon_vma 540 - must be visited - at least until a VM_LOCKED VMA is encountered. If the 541 - page is being unmapped for migration, VM_LOCKED VMAs do not stop the 542 - process because mlocked pages are migratable. However, for reclaim, if 543 - the page is mapped into a VM_LOCKED VMA, the scan stops. 544 - 545 - try_to_unmap_anon() attempts to acquire in read mode the mmap semaphore of 546 - the mm_struct to which the VMA belongs. If this is successful, it will 547 - mlock the page via mlock_vma_page() - we wouldn't have gotten to 548 - try_to_unmap_anon() if the page were already mlocked - and will return 549 - SWAP_MLOCK, indicating that the page is unevictable. 550 - 551 - If the mmap semaphore cannot be acquired, we are not sure whether the page 552 - is really unevictable or not. In this case, try_to_unmap_anon() will 553 - return SWAP_AGAIN. 554 - 555 - (*) try_to_unmap_file() 556 - 557 - Unmapping of a mapped file page works the same as for anonymous mappings, 558 - except that the scan visits all VMAs that map the page's index/page offset 559 - in the page's mapping's reverse map interval search tree. 560 - 561 - As for anonymous pages, on encountering a VM_LOCKED VMA for a mapped file 562 - page, try_to_unmap_file() will attempt to acquire the associated 563 - mm_struct's mmap semaphore to mlock the page, returning SWAP_MLOCK if this 564 - is successful, and SWAP_AGAIN, if not. 544 + mlock_vma_page() is called while holding the page table's lock (in addition 545 + to the page lock, and the rmap lock): to serialize against concurrent mlock or 546 + munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim, 547 + holepunching, and truncation of file pages and their anonymous COWed pages. 565 548 566 549 567 550 try_to_munlock() REVERSE MAP SCAN ··· 560 577 introduced a variant of try_to_unmap() called try_to_munlock(). 561 578 562 579 try_to_munlock() calls the same functions as try_to_unmap() for anonymous and 563 - mapped file pages with an additional argument specifying unlock versus unmap 580 + mapped file and KSM pages with a flag argument specifying unlock versus unmap 564 581 processing. Again, these functions walk the respective reverse maps looking 565 582 for VM_LOCKED VMAs. When such a VMA is found, as in the try_to_unmap() case, 566 - the functions attempt to acquire the associated mmap semaphore, mlock the page 567 - via mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the 568 - pre-clearing of the page's PG_mlocked done by munlock_vma_page. 569 - 570 - If try_to_unmap() is unable to acquire a VM_LOCKED VMA's associated mmap 571 - semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() to 572 - recycle the page on the inactive list and hope that it has better luck with the 573 - page next time. 583 + the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK. This 584 + undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page. 574 585 575 586 Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's 576 587 reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA. 577 - However, the scan can terminate when it encounters a VM_LOCKED VMA and can 578 - successfully acquire the VMA's mmap semaphore for read and mlock the page. 588 + However, the scan can terminate when it encounters a VM_LOCKED VMA. 579 589 Although try_to_munlock() might be called a great many times when munlocking a 580 590 large region or tearing down a large address space that has been mlocked via 581 591 mlockall(), overall this is a fairly rare event. ··· 595 619 596 620 (3) mlocked pages that could not be isolated from the LRU and moved to the 597 621 unevictable list in mlock_vma_page(). 598 - 599 - (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't 600 - acquire the VMA's mmap semaphore to test the flags and set PageMlocked. 601 - munlock_vma_page() was forced to let the page back on to the normal LRU 602 - list for vmscan to handle. 603 622 604 623 shrink_inactive_list() also diverts any unevictable pages that it finds on the 605 624 inactive lists to the appropriate zone's unevictable list.
+11 -25
mm/rmap.c
··· 1304 1304 int ret = SWAP_AGAIN; 1305 1305 enum ttu_flags flags = (enum ttu_flags)arg; 1306 1306 1307 + /* munlock has nothing to gain from examining un-locked vmas */ 1308 + if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED)) 1309 + goto out; 1310 + 1307 1311 pte = page_check_address(page, mm, address, &ptl, 0); 1308 1312 if (!pte) 1309 1313 goto out; ··· 1318 1314 * skipped over this mm) then we should reactivate it. 1319 1315 */ 1320 1316 if (!(flags & TTU_IGNORE_MLOCK)) { 1321 - if (vma->vm_flags & VM_LOCKED) 1322 - goto out_mlock; 1323 - 1317 + if (vma->vm_flags & VM_LOCKED) { 1318 + /* Holding pte lock, we do *not* need mmap_sem here */ 1319 + mlock_vma_page(page); 1320 + ret = SWAP_MLOCK; 1321 + goto out_unmap; 1322 + } 1324 1323 if (flags & TTU_MUNLOCK) 1325 1324 goto out_unmap; 1326 1325 } ··· 1428 1421 1429 1422 out_unmap: 1430 1423 pte_unmap_unlock(pte, ptl); 1431 - if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK)) 1424 + if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK)) 1432 1425 mmu_notifier_invalidate_page(mm, address); 1433 1426 out: 1434 - return ret; 1435 - 1436 - out_mlock: 1437 - pte_unmap_unlock(pte, ptl); 1438 - 1439 - 1440 - /* 1441 - * We need mmap_sem locking, Otherwise VM_LOCKED check makes 1442 - * unstable result and race. Plus, We can't wait here because 1443 - * we now hold anon_vma->rwsem or mapping->i_mmap_rwsem. 1444 - * if trylock failed, the page remain in evictable lru and later 1445 - * vmscan could retry to move the page to unevictable lru if the 1446 - * page is actually mlocked. 1447 - */ 1448 - if (down_read_trylock(&vma->vm_mm->mmap_sem)) { 1449 - if (vma->vm_flags & VM_LOCKED) { 1450 - mlock_vma_page(page); 1451 - ret = SWAP_MLOCK; 1452 - } 1453 - up_read(&vma->vm_mm->mmap_sem); 1454 - } 1455 1427 return ret; 1456 1428 } 1457 1429