mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix
huge_pmd_unshare() vs GUP-fast race") we can end up in some situations
where we perform so many IPI broadcasts when unsharing hugetlb PMD page
tables that it severely regresses some workloads.

In particular, when we fork()+exit(), or when we munmap() a large
area backed by many shared PMD tables, we perform one IPI broadcast per
unshared PMD table.

There are two optimizations to be had:

(1) When we process (unshare) multiple such PMD tables, such as during
exit(), it is sufficient to send a single IPI broadcast (as long as
we respect locking rules) instead of one per PMD table.

Locking prevents that any of these PMD tables could get reused before
we drop the lock.

(2) When we are not the last sharer (> 2 users including us), there is
no need to send the IPI broadcast. The shared PMD tables cannot
become exclusive (fully unshared) before an IPI will be broadcasted
by the last sharer.

Concurrent GUP-fast could walk into a PMD table just before we
unshared it. It could then succeed in grabbing a page from the
shared page table even after munmap() etc succeeded (and supressed
an IPI). But there is not difference compared to GUP-fast just
sleeping for a while after grabbing the page and re-enabling IRQs.

Most importantly, GUP-fast will never walk into page tables that are
no-longer shared, because the last sharer will issue an IPI
broadcast.

(if ever required, checking whether the PUD changed in GUP-fast
after grabbing the page like we do in the PTE case could handle
this)

So let's rework PMD sharing TLB flushing + IPI sync to use the mmu_gather
infrastructure so we can implement these optimizations and demystify the
code at least a bit. Extend the mmu_gather infrastructure to be able to
deal with our special hugetlb PMD table sharing implementation.

To make initialization of the mmu_gather easier when working on a single
VMA (in particular, when dealing with hugetlb), provide
tlb_gather_mmu_vma().

We'll consolidate the handling for (full) unsharing of PMD tables in
tlb_unshare_pmd_ptdesc() and tlb_flush_unshared_tables(), and track
in "struct mmu_gather" whether we had (full) unsharing of PMD tables.

Because locking is very special (concurrent unsharing+reuse must be
prevented), we disallow deferring flushing to tlb_finish_mmu() and instead
require an explicit earlier call to tlb_flush_unshared_tables().

From hugetlb code, we call huge_pmd_unshare_flush() where we make sure
that the expected lock protecting us from concurrent unsharing+reuse is
still held.

Check with a VM_WARN_ON_ONCE() in tlb_finish_mmu() that
tlb_flush_unshared_tables() was properly called earlier.

Document it all properly.

Notes about tlb_remove_table_sync_one() interaction with unsharing:

There are two fairly tricky things:

(1) tlb_remove_table_sync_one() is a NOP on architectures without
CONFIG_MMU_GATHER_RCU_TABLE_FREE.

Here, the assumption is that the previous TLB flush would send an
IPI to all relevant CPUs. Careful: some architectures like x86 only
send IPIs to all relevant CPUs when tlb->freed_tables is set.

The relevant architectures should be selecting
MMU_GATHER_RCU_TABLE_FREE, but x86 might not do that in stable
kernels and it might have been problematic before this patch.

Also, the arch flushing behavior (independent of IPIs) is different
when tlb->freed_tables is set. Do we have to enlighten them to also
take care of tlb->unshared_tables? So far we didn't care, so
hopefully we are fine. Of course, we could be setting
tlb->freed_tables as well, but that might then unnecessarily flush
too much, because the semantics of tlb->freed_tables are a bit
fuzzy.

This patch changes nothing in this regard.

(2) tlb_remove_table_sync_one() is not a NOP on architectures with
CONFIG_MMU_GATHER_RCU_TABLE_FREE that actually don't need a sync.

Take x86 as an example: in the common case (!pv, !X86_FEATURE_INVLPGB)
we still issue IPIs during TLB flushes and don't actually need the
second tlb_remove_table_sync_one().

This optimized can be implemented on top of this, by checking e.g., in
tlb_remove_table_sync_one() whether we really need IPIs. But as
described in (1), it really must honor tlb->freed_tables then to
send IPIs to all relevant CPUs.

Notes on TLB flushing changes:

(1) Flushing for non-shared PMD tables

We're converting from flush_hugetlb_tlb_range() to
tlb_remove_huge_tlb_entry(). Given that we properly initialize the
MMU gather in tlb_gather_mmu_vma() to be hugetlb aware, similar to
__unmap_hugepage_range(), that should be fine.

(2) Flushing for shared PMD tables

We're converting from various things (flush_hugetlb_tlb_range(),
tlb_flush_pmd_range(), flush_tlb_range()) to tlb_flush_pmd_range().

tlb_flush_pmd_range() achieves the same that
tlb_remove_huge_tlb_entry() would achieve in these scenarios.
Note that tlb_remove_huge_tlb_entry() also calls
__tlb_remove_tlb_entry(), however that is only implemented on
powerpc, which does not support PMD table sharing.

Similar to (1), tlb_gather_mmu_vma() should make sure that TLB
flushing keeps on working as expected.

Further, note that the ptdesc_pmd_pts_dec() in huge_pmd_share() is not a
concern, as we are holding the i_mmap_lock the whole time, preventing
concurrent unsharing. That ptdesc_pmd_pts_dec() usage will be removed
separately as a cleanup later.

There are plenty more cleanups to be had, but they have to wait until
this is fixed.

[david@kernel.org: fix kerneldoc]
Link: https://lkml.kernel.org/r/f223dd74-331c-412d-93fc-69e360a5006c@kernel.org
Link: https://lkml.kernel.org/r/20251223214037.580860-5-david@kernel.org
Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race")
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reported-by: Uschakow, Stanislav" <suschako@amazon.de>
Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/
Tested-by: Laurence Oberman <loberman@redhat.com>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

David Hildenbrand (Red Hat) and committed by

Andrew Morton 2 months ago 8ce720d5 a8682d50

+208 -66

6 changed files

expand all

include

asm-generic

tlb.h

linux

hugetlb.h

mm_types.h

hugetlb.c

mmu_gather.c

rmap.c

+75 -2

include/asm-generic/tlb.h

··· 46 46 * 47 47 * The mmu_gather API consists of: 48 48 * 49 - * - tlb_gather_mmu() / tlb_gather_mmu_fullmm() / tlb_finish_mmu() 49 + * - tlb_gather_mmu() / tlb_gather_mmu_fullmm() / tlb_gather_mmu_vma() / 50 + * tlb_finish_mmu() 50 51 * 51 52 * start and finish a mmu_gather 52 53 * ··· 365 364 unsigned int vma_huge : 1; 366 365 unsigned int vma_pfn : 1; 367 366 367 + /* 368 + * Did we unshare (unmap) any shared page tables? For now only 369 + * used for hugetlb PMD table sharing. 370 + */ 371 + unsigned int unshared_tables : 1; 372 + 373 + /* 374 + * Did we unshare any page tables such that they are now exclusive 375 + * and could get reused+modified by the new owner? When setting this 376 + * flag, "unshared_tables" will be set as well. For now only used 377 + * for hugetlb PMD table sharing. 378 + */ 379 + unsigned int fully_unshared_tables : 1; 380 + 368 381 unsigned int batch_count; 369 382 370 383 #ifndef CONFIG_MMU_GATHER_NO_GATHER ··· 415 400 tlb->cleared_pmds = 0; 416 401 tlb->cleared_puds = 0; 417 402 tlb->cleared_p4ds = 0; 403 + tlb->unshared_tables = 0; 418 404 /* 419 405 * Do not reset mmu_gather::vma_* fields here, we do not 420 406 * call into tlb_start_vma() again to set them if there is an ··· 500 484 * these bits. 501 485 */ 502 486 if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds || 503 - tlb->cleared_puds || tlb->cleared_p4ds)) 487 + tlb->cleared_puds || tlb->cleared_p4ds || tlb->unshared_tables)) 504 488 return; 505 489 506 490 tlb_flush(tlb); ··· 788 772 return true; 789 773 } 790 774 #endif 775 + 776 + #ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING 777 + static inline void tlb_unshare_pmd_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt, 778 + unsigned long addr) 779 + { 780 + /* 781 + * The caller must make sure that concurrent unsharing + exclusive 782 + * reuse is impossible until tlb_flush_unshared_tables() was called. 783 + */ 784 + VM_WARN_ON_ONCE(!ptdesc_pmd_is_shared(pt)); 785 + ptdesc_pmd_pts_dec(pt); 786 + 787 + /* Clearing a PUD pointing at a PMD table with PMD leaves. */ 788 + tlb_flush_pmd_range(tlb, addr & PUD_MASK, PUD_SIZE); 789 + 790 + /* 791 + * If the page table is now exclusively owned, we fully unshared 792 + * a page table. 793 + */ 794 + if (!ptdesc_pmd_is_shared(pt)) 795 + tlb->fully_unshared_tables = true; 796 + tlb->unshared_tables = true; 797 + } 798 + 799 + static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb) 800 + { 801 + /* 802 + * As soon as the caller drops locks to allow for reuse of 803 + * previously-shared tables, these tables could get modified and 804 + * even reused outside of hugetlb context, so we have to make sure that 805 + * any page table walkers (incl. TLB, GUP-fast) are aware of that 806 + * change. 807 + * 808 + * Even if we are not fully unsharing a PMD table, we must 809 + * flush the TLB for the unsharer now. 810 + */ 811 + if (tlb->unshared_tables) 812 + tlb_flush_mmu_tlbonly(tlb); 813 + 814 + /* 815 + * Similarly, we must make sure that concurrent GUP-fast will not 816 + * walk previously-shared page tables that are getting modified+reused 817 + * elsewhere. So broadcast an IPI to wait for any concurrent GUP-fast. 818 + * 819 + * We only perform this when we are the last sharer of a page table, 820 + * as the IPI will reach all CPUs: any GUP-fast. 821 + * 822 + * Note that on configs where tlb_remove_table_sync_one() is a NOP, 823 + * the expectation is that the tlb_flush_mmu_tlbonly() would have issued 824 + * required IPIs already for us. 825 + */ 826 + if (tlb->fully_unshared_tables) { 827 + tlb_remove_table_sync_one(); 828 + tlb->fully_unshared_tables = false; 829 + } 830 + } 831 + #endif /* CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */ 791 832 792 833 #endif /* CONFIG_MMU */ 793 834

+10 -5

include/linux/hugetlb.h

··· 240 240 pte_t *huge_pte_offset(struct mm_struct *mm, 241 241 unsigned long addr, unsigned long sz); 242 242 unsigned long hugetlb_mask_last_page(struct hstate *h); 243 - int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, 244 - unsigned long addr, pte_t *ptep); 243 + int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma, 244 + unsigned long addr, pte_t *ptep); 245 + void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma); 245 246 void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, 246 247 unsigned long *start, unsigned long *end); 247 248 ··· 301 300 return NULL; 302 301 } 303 302 304 - static inline int huge_pmd_unshare(struct mm_struct *mm, 305 - struct vm_area_struct *vma, 306 - unsigned long addr, pte_t *ptep) 303 + static inline int huge_pmd_unshare(struct mmu_gather *tlb, 304 + struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) 307 305 { 308 306 return 0; 307 + } 308 + 309 + static inline void huge_pmd_unshare_flush(struct mmu_gather *tlb, 310 + struct vm_area_struct *vma) 311 + { 309 312 } 310 313 311 314 static inline void adjust_range_if_pmd_sharing_possible(

include/linux/mm_types.h

··· 1530 1530 struct mmu_gather; 1531 1531 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm); 1532 1532 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm); 1533 + void tlb_gather_mmu_vma(struct mmu_gather *tlb, struct vm_area_struct *vma); 1533 1534 extern void tlb_finish_mmu(struct mmu_gather *tlb); 1534 1535 1535 1536 struct vm_fault;

+72 -51

mm/hugetlb.c

··· 5112 5112 unsigned long last_addr_mask; 5113 5113 pte_t *src_pte, *dst_pte; 5114 5114 struct mmu_notifier_range range; 5115 - bool shared_pmd = false; 5115 + struct mmu_gather tlb; 5116 5116 5117 5117 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, old_addr, 5118 5118 old_end); ··· 5122 5122 * range. 5123 5123 */ 5124 5124 flush_cache_range(vma, range.start, range.end); 5125 + tlb_gather_mmu_vma(&tlb, vma); 5125 5126 5126 5127 mmu_notifier_invalidate_range_start(&range); 5127 5128 last_addr_mask = hugetlb_mask_last_page(h); ··· 5139 5138 if (huge_pte_none(huge_ptep_get(mm, old_addr, src_pte))) 5140 5139 continue; 5141 5140 5142 - if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) { 5143 - shared_pmd = true; 5141 + if (huge_pmd_unshare(&tlb, vma, old_addr, src_pte)) { 5144 5142 old_addr |= last_addr_mask; 5145 5143 new_addr |= last_addr_mask; 5146 5144 continue; ··· 5150 5150 break; 5151 5151 5152 5152 move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte, sz); 5153 + tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr); 5153 5154 } 5154 5155 5155 - if (shared_pmd) 5156 - flush_hugetlb_tlb_range(vma, range.start, range.end); 5157 - else 5158 - flush_hugetlb_tlb_range(vma, old_end - len, old_end); 5156 + tlb_flush_mmu_tlbonly(&tlb); 5157 + huge_pmd_unshare_flush(&tlb, vma); 5158 + 5159 5159 mmu_notifier_invalidate_range_end(&range); 5160 5160 i_mmap_unlock_write(mapping); 5161 5161 hugetlb_vma_unlock_write(vma); 5162 + tlb_finish_mmu(&tlb); 5162 5163 5163 5164 return len + old_addr - old_end; 5164 5165 } ··· 5178 5177 unsigned long sz = huge_page_size(h); 5179 5178 bool adjust_reservation; 5180 5179 unsigned long last_addr_mask; 5181 - bool force_flush = false; 5182 5180 5183 5181 WARN_ON(!is_vm_hugetlb_page(vma)); 5184 5182 BUG_ON(start & ~huge_page_mask(h)); ··· 5200 5200 } 5201 5201 5202 5202 ptl = huge_pte_lock(h, mm, ptep); 5203 - if (huge_pmd_unshare(mm, vma, address, ptep)) { 5203 + if (huge_pmd_unshare(tlb, vma, address, ptep)) { 5204 5204 spin_unlock(ptl); 5205 - tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE); 5206 - force_flush = true; 5207 5205 address |= last_addr_mask; 5208 5206 continue; 5209 5207 } ··· 5317 5319 } 5318 5320 tlb_end_vma(tlb, vma); 5319 5321 5320 - /* 5321 - * There is nothing protecting a previously-shared page table that we 5322 - * unshared through huge_pmd_unshare() from getting freed after we 5323 - * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare() 5324 - * succeeded, flush the range corresponding to the pud. 5325 - */ 5326 - if (force_flush) 5327 - tlb_flush_mmu_tlbonly(tlb); 5322 + huge_pmd_unshare_flush(tlb, vma); 5328 5323 } 5329 5324 5330 5325 void __hugetlb_zap_begin(struct vm_area_struct *vma, ··· 6416 6425 pte_t pte; 6417 6426 struct hstate *h = hstate_vma(vma); 6418 6427 long pages = 0, psize = huge_page_size(h); 6419 - bool shared_pmd = false; 6420 6428 struct mmu_notifier_range range; 6421 6429 unsigned long last_addr_mask; 6422 6430 bool uffd_wp = cp_flags & MM_CP_UFFD_WP; 6423 6431 bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; 6432 + struct mmu_gather tlb; 6424 6433 6425 6434 /* 6426 6435 * In the case of shared PMDs, the area to flush could be beyond ··· 6433 6442 6434 6443 BUG_ON(address >= end); 6435 6444 flush_cache_range(vma, range.start, range.end); 6445 + tlb_gather_mmu_vma(&tlb, vma); 6436 6446 6437 6447 mmu_notifier_invalidate_range_start(&range); 6438 6448 hugetlb_vma_lock_write(vma); ··· 6460 6468 } 6461 6469 } 6462 6470 ptl = huge_pte_lock(h, mm, ptep); 6463 - if (huge_pmd_unshare(mm, vma, address, ptep)) { 6471 + if (huge_pmd_unshare(&tlb, vma, address, ptep)) { 6464 6472 /* 6465 6473 * When uffd-wp is enabled on the vma, unshare 6466 6474 * shouldn't happen at all. Warn about it if it ··· 6469 6477 WARN_ON_ONCE(uffd_wp || uffd_wp_resolve); 6470 6478 pages++; 6471 6479 spin_unlock(ptl); 6472 - shared_pmd = true; 6473 6480 address |= last_addr_mask; 6474 6481 continue; 6475 6482 } ··· 6529 6538 pte = huge_pte_clear_uffd_wp(pte); 6530 6539 huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); 6531 6540 pages++; 6541 + tlb_remove_huge_tlb_entry(h, &tlb, ptep, address); 6532 6542 } 6533 6543 6534 6544 next: 6535 6545 spin_unlock(ptl); 6536 6546 cond_resched(); 6537 6547 } 6538 - /* 6539 - * There is nothing protecting a previously-shared page table that we 6540 - * unshared through huge_pmd_unshare() from getting freed after we 6541 - * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare() 6542 - * succeeded, flush the range corresponding to the pud. 6543 - */ 6544 - if (shared_pmd) 6545 - flush_hugetlb_tlb_range(vma, range.start, range.end); 6546 - else 6547 - flush_hugetlb_tlb_range(vma, start, end); 6548 + 6549 + tlb_flush_mmu_tlbonly(&tlb); 6550 + huge_pmd_unshare_flush(&tlb, vma); 6548 6551 /* 6549 6552 * No need to call mmu_notifier_arch_invalidate_secondary_tlbs() we are 6550 6553 * downgrading page table protection not changing it to point to a new ··· 6549 6564 i_mmap_unlock_write(vma->vm_file->f_mapping); 6550 6565 hugetlb_vma_unlock_write(vma); 6551 6566 mmu_notifier_invalidate_range_end(&range); 6567 + tlb_finish_mmu(&tlb); 6552 6568 6553 6569 return pages > 0 ? (pages << h->order) : pages; 6554 6570 } ··· 6906 6920 return pte; 6907 6921 } 6908 6922 6909 - /* 6910 - * unmap huge page backed by shared pte. 6923 + /** 6924 + * huge_pmd_unshare - Unmap a pmd table if it is shared by multiple users 6925 + * @tlb: the current mmu_gather. 6926 + * @vma: the vma covering the pmd table. 6927 + * @addr: the address we are trying to unshare. 6928 + * @ptep: pointer into the (pmd) page table. 6911 6929 * 6912 - * Called with page table lock held. 6930 + * Called with the page table lock held, the i_mmap_rwsem held in write mode 6931 + * and the hugetlb vma lock held in write mode. 6913 6932 * 6914 - * returns: 1 successfully unmapped a shared pte page 6915 - * 0 the underlying pte page is not shared, or it is the last user 6933 + * Note: The caller must call huge_pmd_unshare_flush() before dropping the 6934 + * i_mmap_rwsem. 6935 + * 6936 + * Returns: 1 if it was a shared PMD table and it got unmapped, or 0 if it 6937 + * was not a shared PMD table. 6916 6938 */ 6917 - int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, 6918 - unsigned long addr, pte_t *ptep) 6939 + int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma, 6940 + unsigned long addr, pte_t *ptep) 6919 6941 { 6920 6942 unsigned long sz = huge_page_size(hstate_vma(vma)); 6943 + struct mm_struct *mm = vma->vm_mm; 6921 6944 pgd_t *pgd = pgd_offset(mm, addr); 6922 6945 p4d_t *p4d = p4d_offset(pgd, addr); 6923 6946 pud_t *pud = pud_offset(p4d, addr); ··· 6938 6943 i_mmap_assert_write_locked(vma->vm_file->f_mapping); 6939 6944 hugetlb_vma_assert_locked(vma); 6940 6945 pud_clear(pud); 6941 - /* 6942 - * Once our caller drops the rmap lock, some other process might be 6943 - * using this page table as a normal, non-hugetlb page table. 6944 - * Wait for pending gup_fast() in other threads to finish before letting 6945 - * that happen. 6946 - */ 6947 - tlb_remove_table_sync_one(); 6948 - ptdesc_pmd_pts_dec(virt_to_ptdesc(ptep)); 6946 + 6947 + tlb_unshare_pmd_ptdesc(tlb, virt_to_ptdesc(ptep), addr); 6948 + 6949 6949 mm_dec_nr_pmds(mm); 6950 6950 return 1; 6951 + } 6952 + 6953 + /* 6954 + * huge_pmd_unshare_flush - Complete a sequence of huge_pmd_unshare() calls 6955 + * @tlb: the current mmu_gather. 6956 + * @vma: the vma covering the pmd table. 6957 + * 6958 + * Perform necessary TLB flushes or IPI broadcasts to synchronize PMD table 6959 + * unsharing with concurrent page table walkers. 6960 + * 6961 + * This function must be called after a sequence of huge_pmd_unshare() 6962 + * calls while still holding the i_mmap_rwsem. 6963 + */ 6964 + void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma) 6965 + { 6966 + /* 6967 + * We must synchronize page table unsharing such that nobody will 6968 + * try reusing a previously-shared page table while it might still 6969 + * be in use by previous sharers (TLB, GUP_fast). 6970 + */ 6971 + i_mmap_assert_write_locked(vma->vm_file->f_mapping); 6972 + 6973 + tlb_flush_unshared_tables(tlb); 6951 6974 } 6952 6975 6953 6976 #else /* !CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */ ··· 6976 6963 return NULL; 6977 6964 } 6978 6965 6979 - int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, 6980 - unsigned long addr, pte_t *ptep) 6966 + int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma, 6967 + unsigned long addr, pte_t *ptep) 6981 6968 { 6982 6969 return 0; 6970 + } 6971 + 6972 + void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma) 6973 + { 6983 6974 } 6984 6975 6985 6976 void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, ··· 7252 7235 unsigned long sz = huge_page_size(h); 7253 7236 struct mm_struct *mm = vma->vm_mm; 7254 7237 struct mmu_notifier_range range; 7238 + struct mmu_gather tlb; 7255 7239 unsigned long address; 7256 7240 spinlock_t *ptl; 7257 7241 pte_t *ptep; ··· 7264 7246 return; 7265 7247 7266 7248 flush_cache_range(vma, start, end); 7249 + tlb_gather_mmu_vma(&tlb, vma); 7250 + 7267 7251 /* 7268 7252 * No need to call adjust_range_if_pmd_sharing_possible(), because 7269 7253 * we have already done the PUD_SIZE alignment. ··· 7284 7264 if (!ptep) 7285 7265 continue; 7286 7266 ptl = huge_pte_lock(h, mm, ptep); 7287 - huge_pmd_unshare(mm, vma, address, ptep); 7267 + huge_pmd_unshare(&tlb, vma, address, ptep); 7288 7268 spin_unlock(ptl); 7289 7269 } 7290 - flush_hugetlb_tlb_range(vma, start, end); 7270 + huge_pmd_unshare_flush(&tlb, vma); 7291 7271 if (take_locks) { 7292 7272 i_mmap_unlock_write(vma->vm_file->f_mapping); 7293 7273 hugetlb_vma_unlock_write(vma); ··· 7297 7277 * Documentation/mm/mmu_notifier.rst. 7298 7278 */ 7299 7279 mmu_notifier_invalidate_range_end(&range); 7280 + tlb_finish_mmu(&tlb); 7300 7281 } 7301 7282 7302 7283 /*

+33

mm/mmu_gather.c

··· 10 10 #include <linux/swap.h> 11 11 #include <linux/rmap.h> 12 12 #include <linux/pgalloc.h> 13 + #include <linux/hugetlb.h> 13 14 14 15 #include <asm/tlb.h> 15 16 ··· 427 426 #endif 428 427 tlb->vma_pfn = 0; 429 428 429 + tlb->fully_unshared_tables = 0; 430 430 __tlb_reset_range(tlb); 431 431 inc_tlb_flush_pending(tlb->mm); 432 432 } ··· 462 460 } 463 461 464 462 /** 463 + * tlb_gather_mmu_vma - initialize an mmu_gather structure for operating on a 464 + * single VMA 465 + * @tlb: the mmu_gather structure to initialize 466 + * @vma: the vm_area_struct 467 + * 468 + * Called to initialize an (on-stack) mmu_gather structure for operating on 469 + * a single VMA. In contrast to tlb_gather_mmu(), calling this function will 470 + * not require another call to tlb_start_vma(). In contrast to tlb_start_vma(), 471 + * this function will *not* call flush_cache_range(). 472 + * 473 + * For hugetlb VMAs, this function will also initialize the mmu_gather 474 + * page_size accordingly, not requiring a separate call to 475 + * tlb_change_page_size(). 476 + * 477 + */ 478 + void tlb_gather_mmu_vma(struct mmu_gather *tlb, struct vm_area_struct *vma) 479 + { 480 + tlb_gather_mmu(tlb, vma->vm_mm); 481 + tlb_update_vma_flags(tlb, vma); 482 + if (is_vm_hugetlb_page(vma)) 483 + /* All entries have the same size. */ 484 + tlb_change_page_size(tlb, huge_page_size(hstate_vma(vma))); 485 + } 486 + 487 + /** 465 488 * tlb_finish_mmu - finish an mmu_gather structure 466 489 * @tlb: the mmu_gather structure to finish 467 490 * ··· 495 468 */ 496 469 void tlb_finish_mmu(struct mmu_gather *tlb) 497 470 { 471 + /* 472 + * We expect an earlier huge_pmd_unshare_flush() call to sort this out, 473 + * due to complicated locking requirements with page table unsharing. 474 + */ 475 + VM_WARN_ON_ONCE(tlb->fully_unshared_tables); 476 + 498 477 /* 499 478 * If there are parallel threads are doing PTE changes on same range 500 479 * under non-exclusive lock (e.g., mmap_lock read-side) but defer TLB

+17 -8

mm/rmap.c

··· 76 76 #include <linux/mm_inline.h> 77 77 #include <linux/oom.h> 78 78 79 - #include <asm/tlbflush.h> 79 + #include <asm/tlb.h> 80 80 81 81 #define CREATE_TRACE_POINTS 82 82 #include <trace/events/migrate.h> ··· 2008 2008 * if unsuccessful. 2009 2009 */ 2010 2010 if (!anon) { 2011 + struct mmu_gather tlb; 2012 + 2011 2013 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); 2012 2014 if (!hugetlb_vma_trylock_write(vma)) 2013 2015 goto walk_abort; 2014 - if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) { 2016 + 2017 + tlb_gather_mmu_vma(&tlb, vma); 2018 + if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { 2015 2019 hugetlb_vma_unlock_write(vma); 2016 - flush_tlb_range(vma, 2017 - range.start, range.end); 2020 + huge_pmd_unshare_flush(&tlb, vma); 2021 + tlb_finish_mmu(&tlb); 2018 2022 /* 2019 2023 * The PMD table was unmapped, 2020 2024 * consequently unmapping the folio. ··· 2026 2022 goto walk_done; 2027 2023 } 2028 2024 hugetlb_vma_unlock_write(vma); 2025 + tlb_finish_mmu(&tlb); 2029 2026 } 2030 2027 pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); 2031 2028 if (pte_dirty(pteval)) ··· 2403 2398 * fail if unsuccessful. 2404 2399 */ 2405 2400 if (!anon) { 2401 + struct mmu_gather tlb; 2402 + 2406 2403 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); 2407 2404 if (!hugetlb_vma_trylock_write(vma)) { 2408 2405 page_vma_mapped_walk_done(&pvmw); 2409 2406 ret = false; 2410 2407 break; 2411 2408 } 2412 - if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) { 2413 - hugetlb_vma_unlock_write(vma); 2414 - flush_tlb_range(vma, 2415 - range.start, range.end); 2416 2409 2410 + tlb_gather_mmu_vma(&tlb, vma); 2411 + if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) { 2412 + hugetlb_vma_unlock_write(vma); 2413 + huge_pmd_unshare_flush(&tlb, vma); 2414 + tlb_finish_mmu(&tlb); 2417 2415 /* 2418 2416 * The PMD table was unmapped, 2419 2417 * consequently unmapping the folio. ··· 2425 2417 break; 2426 2418 } 2427 2419 hugetlb_vma_unlock_write(vma); 2420 + tlb_finish_mmu(&tlb); 2428 2421 } 2429 2422 /* Nuke the hugetlb page table entry */ 2430 2423 pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);