Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

arm64, tlbflush: don't TLBI broadcast if page reused in write fault

A multi-thread customer workload with large memory footprint uses
fork()/exec() to run some external programs every tens seconds. When
running the workload on an arm64 server machine, it's observed that
quite some CPU cycles are spent in the TLB flushing functions. While
running the workload on the x86_64 server machine, it's not. This
causes the performance on arm64 to be much worse than that on x86_64.

During the workload running, after fork()/exec() write-protects all
pages in the parent process, memory writing in the parent process
will cause a write protection fault. Then the page fault handler
will make the PTE/PDE writable if the page can be reused, which is
almost always true in the workload. On arm64, to avoid the write
protection fault on other CPUs, the page fault handler flushes the TLB
globally with TLBI broadcast after changing the PTE/PDE. However, this
isn't always necessary. Firstly, it's safe to leave some stale
read-only TLB entries as long as they will be flushed finally.
Secondly, it's quite possible that the original read-only PTE/PDEs
aren't cached in remote TLB at all if the memory footprint is large.
In fact, on x86_64, the page fault handler doesn't flush the remote
TLB in this situation, which benefits the performance a lot.

To improve the performance on arm64, make the write protection fault
handler flush the TLB locally instead of globally via TLBI broadcast
after making the PTE/PDE writable. If there are stale read-only TLB
entries in the remote CPUs, the page fault handler on these CPUs will
regard the page fault as spurious and flush the stale TLB entries.

To test the patchset, make the usemem.c from
vm-scalability (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git).
support calling fork()/exec() periodically. To mimic the behavior of
the customer workload, run usemem with 4 threads, access 100GB memory,
and call fork()/exec() every 40 seconds. Test results show that with
the patchset the score of usemem improves ~40.6%. The cycles% of TLB
flush functions reduces from ~50.5% to ~0.3% in perf profile.

Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Will Deacon <will@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

authored by

Huang Ying and committed by
Catalin Marinas
cb1fa2e9 79301c7d

+72 -9
+9 -5
arch/arm64/include/asm/pgtable.h
··· 130 130 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 131 131 132 132 /* 133 - * Outside of a few very special situations (e.g. hibernation), we always 134 - * use broadcast TLB invalidation instructions, therefore a spurious page 135 - * fault on one CPU which has been handled concurrently by another CPU 136 - * does not need to perform additional invalidation. 133 + * We use local TLB invalidation instruction when reusing page in 134 + * write protection fault handler to avoid TLBI broadcast in the hot 135 + * path. This will cause spurious page faults if stale read-only TLB 136 + * entries exist. 137 137 */ 138 - #define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0) 138 + #define flush_tlb_fix_spurious_fault(vma, address, ptep) \ 139 + local_flush_tlb_page_nonotify(vma, address) 140 + 141 + #define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \ 142 + local_flush_tlb_page_nonotify(vma, address) 139 143 140 144 /* 141 145 * ZERO_PAGE is a global shared page that is always zero: used
+56
arch/arm64/include/asm/tlbflush.h
··· 249 249 * cannot be easily determined, the value TLBI_TTL_UNKNOWN will 250 250 * perform a non-hinted invalidation. 251 251 * 252 + * local_flush_tlb_page(vma, addr) 253 + * Local variant of flush_tlb_page(). Stale TLB entries may 254 + * remain in remote CPUs. 255 + * 256 + * local_flush_tlb_page_nonotify(vma, addr) 257 + * Same as local_flush_tlb_page() except MMU notifier will not be 258 + * called. 259 + * 260 + * local_flush_tlb_contpte(vma, addr) 261 + * Invalidate the virtual-address range 262 + * '[addr, addr+CONT_PTE_SIZE)' mapped with contpte on local CPU 263 + * for the user address space corresponding to 'vma->mm'. Stale 264 + * TLB entries may remain in remote CPUs. 252 265 * 253 266 * Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented 254 267 * on top of these routines, since that is our interface to the mmu_gather ··· 293 280 __tlbi_user(aside1is, asid); 294 281 dsb(ish); 295 282 mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); 283 + } 284 + 285 + static inline void __local_flush_tlb_page_nonotify_nosync(struct mm_struct *mm, 286 + unsigned long uaddr) 287 + { 288 + unsigned long addr; 289 + 290 + dsb(nshst); 291 + addr = __TLBI_VADDR(uaddr, ASID(mm)); 292 + __tlbi(vale1, addr); 293 + __tlbi_user(vale1, addr); 294 + } 295 + 296 + static inline void local_flush_tlb_page_nonotify(struct vm_area_struct *vma, 297 + unsigned long uaddr) 298 + { 299 + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); 300 + dsb(nsh); 301 + } 302 + 303 + static inline void local_flush_tlb_page(struct vm_area_struct *vma, 304 + unsigned long uaddr) 305 + { 306 + __local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr); 307 + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK, 308 + (uaddr & PAGE_MASK) + PAGE_SIZE); 309 + dsb(nsh); 296 310 } 297 311 298 312 static inline void __flush_tlb_page_nosync(struct mm_struct *mm, ··· 510 470 __flush_tlb_range_nosync(vma->vm_mm, start, end, stride, 511 471 last_level, tlb_level); 512 472 dsb(ish); 473 + } 474 + 475 + static inline void local_flush_tlb_contpte(struct vm_area_struct *vma, 476 + unsigned long addr) 477 + { 478 + unsigned long asid; 479 + 480 + addr = round_down(addr, CONT_PTE_SIZE); 481 + 482 + dsb(nshst); 483 + asid = ASID(vma->vm_mm); 484 + __flush_tlb_range_op(vale1, addr, CONT_PTES, PAGE_SIZE, asid, 485 + 3, true, lpa2_is_enabled()); 486 + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, addr, 487 + addr + CONT_PTE_SIZE); 488 + dsb(nsh); 513 489 } 514 490 515 491 static inline void flush_tlb_range(struct vm_area_struct *vma,
+1 -2
arch/arm64/mm/contpte.c
··· 622 622 __ptep_set_access_flags(vma, addr, ptep, entry, 0); 623 623 624 624 if (dirty) 625 - __flush_tlb_range(vma, start_addr, addr, 626 - PAGE_SIZE, true, 3); 625 + local_flush_tlb_contpte(vma, start_addr); 627 626 } else { 628 627 __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte); 629 628 __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
+6 -2
arch/arm64/mm/fault.c
··· 233 233 pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval); 234 234 } while (pteval != old_pteval); 235 235 236 - /* Invalidate a stale read-only entry */ 236 + /* 237 + * Invalidate the local stale read-only entry. Remote stale entries 238 + * may still cause page faults and be invalidated via 239 + * flush_tlb_fix_spurious_fault(). 240 + */ 237 241 if (dirty) 238 - flush_tlb_page(vma, address); 242 + local_flush_tlb_page(vma, address); 239 243 return 1; 240 244 } 241 245