Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

arm64: support batched/deferred tlb shootdown during page reclamation/migration

On x86, batched and deferred tlb shootdown has lead to 90% performance
increase on tlb shootdown. on arm64, HW can do tlb shootdown without
software IPI. But sync tlbi is still quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
#include <sys/types.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>

int main()
{
#define SIZE (1 * 1024 * 1024)
volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);

memset(p, 0x88, SIZE);

for (int k = 0; k < 10000; k++) {
/* swap in */
for (int i = 0; i < SIZE; i += 4096) {
(void)p[i];
}

/* swap out */
madvise(p, SIZE, MADV_PAGEOUT);
}
}

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

~ # perf record taskset -c 4 ./a.out
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
~ # perf report
# To display the perf.data header info, please use --header/--header-only options.
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 60K of event 'cycles'
# Event count (approx.): 35706225414
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ......
#
21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq
8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
6.67% a.out [kernel.kallsyms] [k] filemap_map_pages
6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write
5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush
3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock
3.49% a.out [kernel.kallsyms] [k] memset64
1.63% a.out [kernel.kallsyms] [k] clear_page
1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock
1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930
1.23% a.out [kernel.kallsyms] [k] xas_load
1.15% a.out [kernel.kallsyms] [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark swapping in/out
a page mapped by only one process. If the page is mapped by multiple
processes, typically, like more than 100 on a phone, the overhead would be
much higher as we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number of CPU cores due to
the bad scalability of tlb shootdown in HW, so those ARM64 servers should
expect much higher overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush is actually
used by the final dsb() to wait for the completion of tlb flush. This
provides us a very good chance to leverage the existing batched tlb in
kernel. The minimum modification is that we only send async tlbi in the
first stage and we send dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to finish the
program decreases around 5%.

Typical collapsed time w/o patch:
~ # time taskset -c 4 ./a.out
0.21user 14.34system 0:14.69elapsed
w/ patch:
~ # time taskset -c 4 ./a.out
0.22user 13.45system 0:13.80elapsed

Also tested with benchmark in the commit on Kunpeng920 arm64 server
and observed an improvement around 12.5% with command
`time ./swap_bench`.
w/o w/
real 0m13.460s 0m11.771s
user 0m0.248s 0m0.279s
sys 0m12.039s 0m11.458s

Originally it's noticed a 16.99% overhead of ptep_clear_flush()
which has been eliminated by this patch:

[root@localhost yang]# perf record -- ./swap_bench && perf report
[...]
16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush

It is tested on 4,8,128 CPU platforms and shows to be beneficial on
large systems but may not have improvement on small systems like on
a 4 CPU platform.

Also this patch improve the performance of page migration. Using pmbench
and tries to migrate the pages of pmbench between node 0 and node 1 for
100 times for 1G memory, this patch decrease the time used around 20%
(prev 18.338318910 sec after 13.981866350 sec) and saved the time used
by ptep_clear_flush().

Link: https://lkml.kernel.org/r/20230717131004.12662-5-yangyicong@huawei.com
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Barry Song <baohua@kernel.org>
Cc: Darren Hart <darren@os.amperecomputing.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: lipeifeng <lipeifeng@oppo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Barry Song and committed by
Andrew Morton
43b3dfdd db6c1f6f

+55 -4
+1 -1
Documentation/features/vm/TLB/arch-support.txt
··· 9 9 | alpha: | TODO | 10 10 | arc: | TODO | 11 11 | arm: | TODO | 12 - | arm64: | N/A | 12 + | arm64: | ok | 13 13 | csky: | TODO | 14 14 | hexagon: | TODO | 15 15 | ia64: | TODO |
+1
arch/arm64/Kconfig
··· 96 96 select ARCH_SUPPORTS_NUMA_BALANCING 97 97 select ARCH_SUPPORTS_PAGE_TABLE_CHECK 98 98 select ARCH_SUPPORTS_PER_VMA_LOCK 99 + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH 99 100 select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT 100 101 select ARCH_WANT_DEFAULT_BPF_JIT 101 102 select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
+12
arch/arm64/include/asm/tlbbatch.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _ARCH_ARM64_TLBBATCH_H 3 + #define _ARCH_ARM64_TLBBATCH_H 4 + 5 + struct arch_tlbflush_unmap_batch { 6 + /* 7 + * For arm64, HW can do tlb shootdown, so we don't 8 + * need to record cpumask for sending IPI 9 + */ 10 + }; 11 + 12 + #endif /* _ARCH_ARM64_TLBBATCH_H */
+41 -3
arch/arm64/include/asm/tlbflush.h
··· 254 254 dsb(ish); 255 255 } 256 256 257 - static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, 258 - unsigned long uaddr) 257 + static inline void __flush_tlb_page_nosync(struct mm_struct *mm, 258 + unsigned long uaddr) 259 259 { 260 260 unsigned long addr; 261 261 262 262 dsb(ishst); 263 - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm)); 263 + addr = __TLBI_VADDR(uaddr, ASID(mm)); 264 264 __tlbi(vale1is, addr); 265 265 __tlbi_user(vale1is, addr); 266 + } 267 + 268 + static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, 269 + unsigned long uaddr) 270 + { 271 + return __flush_tlb_page_nosync(vma->vm_mm, uaddr); 266 272 } 267 273 268 274 static inline void flush_tlb_page(struct vm_area_struct *vma, 269 275 unsigned long uaddr) 270 276 { 271 277 flush_tlb_page_nosync(vma, uaddr); 278 + dsb(ish); 279 + } 280 + 281 + static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) 282 + { 283 + #ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI 284 + /* 285 + * TLB flush deferral is not required on systems which are affected by 286 + * ARM64_WORKAROUND_REPEAT_TLBI, as __tlbi()/__tlbi_user() implementation 287 + * will have two consecutive TLBI instructions with a dsb(ish) in between 288 + * defeating the purpose (i.e save overall 'dsb ish' cost). 289 + */ 290 + if (unlikely(cpus_have_const_cap(ARM64_WORKAROUND_REPEAT_TLBI))) 291 + return false; 292 + #endif 293 + return true; 294 + } 295 + 296 + static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, 297 + struct mm_struct *mm, 298 + unsigned long uaddr) 299 + { 300 + __flush_tlb_page_nosync(mm, uaddr); 301 + } 302 + 303 + static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm) 304 + { 305 + dsb(ish); 306 + } 307 + 308 + static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) 309 + { 272 310 dsb(ish); 273 311 } 274 312