Merge branch 'akpm' (patches from Andrew)

+11 -6

Documentation/dev-tools/kasan.rst

··· 30 30 31 31 The hardware KASAN mode (#3) relies on hardware to perform the checks but 32 32 still requires a compiler version that supports memory tagging instructions. 33 - This mode is supported in GCC 10+ and Clang 11+. 33 + This mode is supported in GCC 10+ and Clang 12+. 34 34 35 35 Both software KASAN modes work with SLUB and SLAB memory allocators, 36 36 while the hardware tag-based KASAN currently only supports SLUB. ··· 206 206 Asymmetric mode: a bad access is detected synchronously on reads and 207 207 asynchronously on writes. 208 208 209 + - ``kasan.vmalloc=off`` or ``=on`` disables or enables tagging of vmalloc 210 + allocations (default: ``on``). 211 + 209 212 - ``kasan.stacktrace=off`` or ``=on`` disables or enables alloc and free stack 210 213 traces collection (default: ``on``). 211 214 ··· 282 279 pointers with the 0xFF pointer tag are not checked). The value 0xFE is currently 283 280 reserved to tag freed memory regions. 284 281 285 - Software tag-based KASAN currently only supports tagging of slab and page_alloc 286 - memory. 282 + Software tag-based KASAN currently only supports tagging of slab, page_alloc, 283 + and vmalloc memory. 287 284 288 285 Hardware tag-based KASAN 289 286 ~~~~~~~~~~~~~~~~~~~~~~~~ ··· 306 303 pointers with the 0xFF pointer tag are not checked). The value 0xFE is currently 307 304 reserved to tag freed memory regions. 308 305 309 - Hardware tag-based KASAN currently only supports tagging of slab and page_alloc 310 - memory. 306 + Hardware tag-based KASAN currently only supports tagging of slab, page_alloc, 307 + and VM_ALLOC-based vmalloc memory. 311 308 312 309 If the hardware does not support MTE (pre ARMv8.5), hardware tag-based KASAN 313 310 will not be enabled. In this case, all KASAN boot parameters are ignored. ··· 321 318 322 319 Shadow memory 323 320 ------------- 321 + 322 + The contents of this section are only applicable to software KASAN modes. 324 323 325 324 The kernel maps memory in several different parts of the address space. 326 325 The range of kernel virtual addresses is large: there is not enough real ··· 354 349 355 350 With ``CONFIG_KASAN_VMALLOC``, KASAN can cover vmalloc space at the 356 351 cost of greater memory usage. Currently, this is supported on x86, 357 - riscv, s390, and powerpc. 352 + arm64, riscv, s390, and powerpc. 358 353 359 354 This works by hooking into vmalloc and vmap and dynamically 360 355 allocating real shadow memory to back the mappings.

+59 -6

Documentation/vm/page_owner.rst

··· 78 78 79 79 2) Enable page owner: add "page_owner=on" to boot cmdline. 80 80 81 - 3) Do the job what you want to debug 81 + 3) Do the job that you want to debug. 82 82 83 83 4) Analyze information from page owner:: 84 84 ··· 89 89 90 90 Page allocated via order XXX, ... 91 91 PFN XXX ... 92 - // Detailed stack 92 + // Detailed stack 93 93 94 94 Page allocated via order XXX, ... 95 95 PFN XXX ... 96 - // Detailed stack 96 + // Detailed stack 97 97 98 98 The ``page_owner_sort`` tool ignores ``PFN`` rows, puts the remaining rows 99 99 in buf, uses regexp to extract the page order value, counts the times 100 - and pages of buf, and finally sorts them according to the times. 100 + and pages of buf, and finally sorts them according to the parameter(s). 101 101 102 102 See the result about who allocated each page 103 103 in the ``sorted_page_owner.txt``. General output:: 104 104 105 105 XXX times, XXX pages: 106 106 Page allocated via order XXX, ... 107 - // Detailed stack 107 + // Detailed stack 108 108 109 109 By default, ``page_owner_sort`` is sorted according to the times of buf. 110 - If you want to sort by the pages nums of buf, use the ``-m`` parameter. 110 + If you want to sort by the page nums of buf, use the ``-m`` parameter. 111 + The detailed parameters are: 112 + 113 + fundamental function: 114 + 115 + Sort: 116 + -a Sort by memory allocation time. 117 + -m Sort by total memory. 118 + -p Sort by pid. 119 + -P Sort by tgid. 120 + -n Sort by task command name. 121 + -r Sort by memory release time. 122 + -s Sort by stack trace. 123 + -t Sort by times (default). 124 + 125 + additional function: 126 + 127 + Cull: 128 + -c Cull by comparing stacktrace instead of total block. 129 + --cull <rules> 130 + Specify culling rules.Culling syntax is key[,key[,...]].Choose a 131 + multi-letter key from the **STANDARD FORMAT SPECIFIERS** section. 132 + 133 + 134 + <rules> is a single argument in the form of a comma-separated list, 135 + which offers a way to specify individual culling rules. The recognized 136 + keywords are described in the **STANDARD FORMAT SPECIFIERS** section below. 137 + <rules> can be specified by the sequence of keys k1,k2, ..., as described in 138 + the STANDARD SORT KEYS section below. Mixed use of abbreviated and 139 + complete-form of keys is allowed. 140 + 141 + 142 + Examples: 143 + ./page_owner_sort <input> <output> --cull=stacktrace 144 + ./page_owner_sort <input> <output> --cull=st,pid,name 145 + ./page_owner_sort <input> <output> --cull=n,f 146 + 147 + Filter: 148 + -f Filter out the information of blocks whose memory has been released. 149 + 150 + Select: 151 + --pid <PID> Select by pid. 152 + --tgid <TGID> Select by tgid. 153 + --name <command> Select by task command name. 154 + 155 + STANDARD FORMAT SPECIFIERS 156 + ========================== 157 + 158 + KEY LONG DESCRIPTION 159 + p pid process ID 160 + tg tgid thread group ID 161 + n name task command name 162 + f free whether the page has been released or not 163 + st stacktrace stace trace of the page allocation

+2

arch/alpha/include/uapi/asm/mman.h

··· 74 74 #define MADV_POPULATE_READ 22 /* populate (prefault) page tables readable */ 75 75 #define MADV_POPULATE_WRITE 23 /* populate (prefault) page tables writable */ 76 76 77 + #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ 78 + 77 79 /* compatibility flags */ 78 80 #define MAP_FILE 0 79 81

+1 -1

arch/arm64/Kconfig

··· 208 208 select IOMMU_DMA if IOMMU_SUPPORT 209 209 select IRQ_DOMAIN 210 210 select IRQ_FORCED_THREADING 211 - select KASAN_VMALLOC if KASAN_GENERIC 211 + select KASAN_VMALLOC if KASAN 212 212 select MODULES_USE_ELF_RELA 213 213 select NEED_DMA_MAP_STATE 214 214 select NEED_SG_DMA_LENGTH

+6

arch/arm64/include/asm/vmalloc.h

··· 25 25 26 26 #endif 27 27 28 + #define arch_vmap_pgprot_tagged arch_vmap_pgprot_tagged 29 + static inline pgprot_t arch_vmap_pgprot_tagged(pgprot_t prot) 30 + { 31 + return pgprot_tagged(prot); 32 + } 33 + 28 34 #endif /* _ASM_ARM64_VMALLOC_H */

+4 -1

arch/arm64/include/asm/vmap_stack.h

··· 17 17 */ 18 18 static inline unsigned long *arch_alloc_vmap_stack(size_t stack_size, int node) 19 19 { 20 + void *p; 21 + 20 22 BUILD_BUG_ON(!IS_ENABLED(CONFIG_VMAP_STACK)); 21 23 22 - return __vmalloc_node(stack_size, THREAD_ALIGN, THREADINFO_GFP, node, 24 + p = __vmalloc_node(stack_size, THREAD_ALIGN, THREADINFO_GFP, node, 23 25 __builtin_return_address(0)); 26 + return kasan_reset_tag(p); 24 27 } 25 28 26 29 #endif /* __ASM_VMAP_STACK_H */

+3 -2

arch/arm64/kernel/module.c

··· 58 58 PAGE_KERNEL, 0, NUMA_NO_NODE, 59 59 __builtin_return_address(0)); 60 60 61 - if (p && (kasan_module_alloc(p, size, gfp_mask) < 0)) { 61 + if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) { 62 62 vfree(p); 63 63 return NULL; 64 64 } 65 65 66 - return p; 66 + /* Memory is intended to be executable, reset the pointer tag. */ 67 + return kasan_reset_tag(p); 67 68 } 68 69 69 70 enum aarch64_reloc_op {

+1 -1

arch/arm64/mm/pageattr.c

··· 85 85 */ 86 86 area = find_vm_area((void *)addr); 87 87 if (!area || 88 - end > (unsigned long)area->addr + area->size || 88 + end > (unsigned long)kasan_reset_tag(area->addr) + area->size || 89 89 !(area->flags & VM_ALLOC)) 90 90 return -EINVAL; 91 91

+2 -1

arch/arm64/net/bpf_jit_comp.c

··· 1304 1304 1305 1305 void *bpf_jit_alloc_exec(unsigned long size) 1306 1306 { 1307 - return vmalloc(size); 1307 + /* Memory is intended to be executable, reset the pointer tag. */ 1308 + return kasan_reset_tag(vmalloc(size)); 1308 1309 } 1309 1310 1310 1311 void bpf_jit_free_exec(void *addr)

+2

arch/mips/include/uapi/asm/mman.h

··· 101 101 #define MADV_POPULATE_READ 22 /* populate (prefault) page tables readable */ 102 102 #define MADV_POPULATE_WRITE 23 /* populate (prefault) page tables writable */ 103 103 104 + #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ 105 + 104 106 /* compatibility flags */ 105 107 #define MAP_FILE 0 106 108

+2

arch/parisc/include/uapi/asm/mman.h

··· 55 55 #define MADV_POPULATE_READ 22 /* populate (prefault) page tables readable */ 56 56 #define MADV_POPULATE_WRITE 23 /* populate (prefault) page tables writable */ 57 57 58 + #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ 59 + 58 60 #define MADV_MERGEABLE 65 /* KSM may merge identical pages */ 59 61 #define MADV_UNMERGEABLE 66 /* KSM may not merge identical pages */ 60 62

-1

arch/powerpc/mm/book3s64/trace.c

··· 3 3 * This file is for defining trace points and trace related helpers. 4 4 */ 5 5 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 6 - #define CREATE_TRACE_POINTS 7 6 #include <trace/events/thp.h> 8 7 #endif

+1 -1

arch/s390/kernel/module.c

··· 45 45 p = __vmalloc_node_range(size, MODULE_ALIGN, MODULES_VADDR, MODULES_END, 46 46 gfp_mask, PAGE_KERNEL_EXEC, VM_DEFER_KMEMLEAK, NUMA_NO_NODE, 47 47 __builtin_return_address(0)); 48 - if (p && (kasan_module_alloc(p, size, gfp_mask) < 0)) { 48 + if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) { 49 49 vfree(p); 50 50 return NULL; 51 51 }

-3

arch/x86/Kconfig

··· 337 337 config ARCH_HAS_CPU_RELAX 338 338 def_bool y 339 339 340 - config ARCH_HAS_FILTER_PGPROT 341 - def_bool y 342 - 343 340 config ARCH_HIBERNATION_POSSIBLE 344 341 def_bool y 345 342

+1 -1

arch/x86/kernel/module.c

··· 78 78 MODULES_END, gfp_mask, 79 79 PAGE_KERNEL, VM_DEFER_KMEMLEAK, NUMA_NO_NODE, 80 80 __builtin_return_address(0)); 81 - if (p && (kasan_module_alloc(p, size, gfp_mask) < 0)) { 81 + if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) { 82 82 vfree(p); 83 83 return NULL; 84 84 }

-1

arch/x86/mm/init.c

··· 31 31 * We need to define the tracepoints somewhere, and tlb.c 32 32 * is only compiled when SMP=y. 33 33 */ 34 - #define CREATE_TRACE_POINTS 35 34 #include <trace/events/tlb.h> 36 35 37 36 #include "mm_internal.h"

+2

arch/xtensa/include/uapi/asm/mman.h

··· 109 109 #define MADV_POPULATE_READ 22 /* populate (prefault) page tables readable */ 110 110 #define MADV_POPULATE_WRITE 23 /* populate (prefault) page tables writable */ 111 111 112 + #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ 113 + 112 114 /* compatibility flags */ 113 115 #define MAP_FILE 0 114 116

+26 -9

include/linux/gfp.h

··· 54 54 #define ___GFP_THISNODE 0x200000u 55 55 #define ___GFP_ACCOUNT 0x400000u 56 56 #define ___GFP_ZEROTAGS 0x800000u 57 - #define ___GFP_SKIP_KASAN_POISON 0x1000000u 57 + #ifdef CONFIG_KASAN_HW_TAGS 58 + #define ___GFP_SKIP_ZERO 0x1000000u 59 + #define ___GFP_SKIP_KASAN_UNPOISON 0x2000000u 60 + #define ___GFP_SKIP_KASAN_POISON 0x4000000u 61 + #else 62 + #define ___GFP_SKIP_ZERO 0 63 + #define ___GFP_SKIP_KASAN_UNPOISON 0 64 + #define ___GFP_SKIP_KASAN_POISON 0 65 + #endif 58 66 #ifdef CONFIG_LOCKDEP 59 - #define ___GFP_NOLOCKDEP 0x2000000u 67 + #define ___GFP_NOLOCKDEP 0x8000000u 60 68 #else 61 69 #define ___GFP_NOLOCKDEP 0 62 70 #endif ··· 240 232 * 241 233 * %__GFP_ZERO returns a zeroed page on success. 242 234 * 243 - * %__GFP_ZEROTAGS returns a page with zeroed memory tags on success, if 244 - * __GFP_ZERO is set. 235 + * %__GFP_ZEROTAGS zeroes memory tags at allocation time if the memory itself 236 + * is being zeroed (either via __GFP_ZERO or via init_on_alloc, provided that 237 + * __GFP_SKIP_ZERO is not set). This flag is intended for optimization: setting 238 + * memory tags at the same time as zeroing memory has minimal additional 239 + * performace impact. 245 240 * 246 - * %__GFP_SKIP_KASAN_POISON returns a page which does not need to be poisoned 247 - * on deallocation. Typically used for userspace pages. Currently only has an 248 - * effect in HW tags mode. 241 + * %__GFP_SKIP_KASAN_UNPOISON makes KASAN skip unpoisoning on page allocation. 242 + * Only effective in HW_TAGS mode. 243 + * 244 + * %__GFP_SKIP_KASAN_POISON makes KASAN skip poisoning on page deallocation. 245 + * Typically, used for userspace pages. Only effective in HW_TAGS mode. 249 246 */ 250 247 #define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN) 251 248 #define __GFP_COMP ((__force gfp_t)___GFP_COMP) 252 249 #define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) 253 250 #define __GFP_ZEROTAGS ((__force gfp_t)___GFP_ZEROTAGS) 254 - #define __GFP_SKIP_KASAN_POISON ((__force gfp_t)___GFP_SKIP_KASAN_POISON) 251 + #define __GFP_SKIP_ZERO ((__force gfp_t)___GFP_SKIP_ZERO) 252 + #define __GFP_SKIP_KASAN_UNPOISON ((__force gfp_t)___GFP_SKIP_KASAN_UNPOISON) 253 + #define __GFP_SKIP_KASAN_POISON ((__force gfp_t)___GFP_SKIP_KASAN_POISON) 255 254 256 255 /* Disable lockdep for GFP context tracking */ 257 256 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP) 258 257 259 258 /* Room for N __GFP_FOO bits */ 260 - #define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP)) 259 + #define __GFP_BITS_SHIFT (24 + \ 260 + 3 * IS_ENABLED(CONFIG_KASAN_HW_TAGS) + \ 261 + IS_ENABLED(CONFIG_LOCKDEP)) 261 262 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) 262 263 263 264 /**

-6

include/linux/huge_mm.h

··· 183 183 184 184 void prep_transhuge_page(struct page *page); 185 185 void free_transhuge_page(struct page *page); 186 - bool is_transparent_hugepage(struct page *page); 187 186 188 187 bool can_split_folio(struct folio *folio, int *pextra_pins); 189 188 int split_huge_page_to_list(struct page *page, struct list_head *list); ··· 339 340 } 340 341 341 342 static inline void prep_transhuge_page(struct page *page) {} 342 - 343 - static inline bool is_transparent_hugepage(struct page *page) 344 - { 345 - return false; 346 - } 347 343 348 344 #define transparent_hugepage_flags 0UL 349 345

+63 -45

include/linux/kasan.h

··· 19 19 #include <linux/linkage.h> 20 20 #include <asm/kasan.h> 21 21 22 - /* kasan_data struct is used in KUnit tests for KASAN expected failures */ 23 - struct kunit_kasan_expectation { 24 - bool report_found; 25 - }; 26 - 27 22 #endif 23 + 24 + typedef unsigned int __bitwise kasan_vmalloc_flags_t; 25 + 26 + #define KASAN_VMALLOC_NONE 0x00u 27 + #define KASAN_VMALLOC_INIT 0x01u 28 + #define KASAN_VMALLOC_VM_ALLOC 0x02u 29 + #define KASAN_VMALLOC_PROT_NORMAL 0x04u 28 30 29 31 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS) 30 32 ··· 86 84 87 85 #ifdef CONFIG_KASAN_HW_TAGS 88 86 89 - void kasan_alloc_pages(struct page *page, unsigned int order, gfp_t flags); 90 - void kasan_free_pages(struct page *page, unsigned int order); 91 - 92 87 #else /* CONFIG_KASAN_HW_TAGS */ 93 - 94 - static __always_inline void kasan_alloc_pages(struct page *page, 95 - unsigned int order, gfp_t flags) 96 - { 97 - /* Only available for integrated init. */ 98 - BUILD_BUG(); 99 - } 100 - 101 - static __always_inline void kasan_free_pages(struct page *page, 102 - unsigned int order) 103 - { 104 - /* Only available for integrated init. */ 105 - BUILD_BUG(); 106 - } 107 88 108 89 #endif /* CONFIG_KASAN_HW_TAGS */ 109 90 ··· 267 282 return true; 268 283 } 269 284 270 - 271 - bool kasan_save_enable_multi_shot(void); 272 - void kasan_restore_multi_shot(bool enabled); 273 - 274 285 #else /* CONFIG_KASAN */ 275 286 276 287 static inline slab_flags_t kasan_never_merge(void) ··· 395 414 396 415 #ifdef CONFIG_KASAN_VMALLOC 397 416 417 + #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS) 418 + 419 + void kasan_populate_early_vm_area_shadow(void *start, unsigned long size); 398 420 int kasan_populate_vmalloc(unsigned long addr, unsigned long size); 399 - void kasan_poison_vmalloc(const void *start, unsigned long size); 400 - void kasan_unpoison_vmalloc(const void *start, unsigned long size); 401 421 void kasan_release_vmalloc(unsigned long start, unsigned long end, 402 422 unsigned long free_region_start, 403 423 unsigned long free_region_end); 404 424 405 - void kasan_populate_early_vm_area_shadow(void *start, unsigned long size); 425 + #else /* CONFIG_KASAN_GENERIC || CONFIG_KASAN_SW_TAGS */ 406 426 407 - #else /* CONFIG_KASAN_VMALLOC */ 408 - 427 + static inline void kasan_populate_early_vm_area_shadow(void *start, 428 + unsigned long size) 429 + { } 409 430 static inline int kasan_populate_vmalloc(unsigned long start, 410 431 unsigned long size) 411 432 { 412 433 return 0; 413 434 } 414 - 415 - static inline void kasan_poison_vmalloc(const void *start, unsigned long size) 416 - { } 417 - static inline void kasan_unpoison_vmalloc(const void *start, unsigned long size) 418 - { } 419 435 static inline void kasan_release_vmalloc(unsigned long start, 420 436 unsigned long end, 421 437 unsigned long free_region_start, 422 - unsigned long free_region_end) {} 438 + unsigned long free_region_end) { } 439 + 440 + #endif /* CONFIG_KASAN_GENERIC || CONFIG_KASAN_SW_TAGS */ 441 + 442 + void *__kasan_unpoison_vmalloc(const void *start, unsigned long size, 443 + kasan_vmalloc_flags_t flags); 444 + static __always_inline void *kasan_unpoison_vmalloc(const void *start, 445 + unsigned long size, 446 + kasan_vmalloc_flags_t flags) 447 + { 448 + if (kasan_enabled()) 449 + return __kasan_unpoison_vmalloc(start, size, flags); 450 + return (void *)start; 451 + } 452 + 453 + void __kasan_poison_vmalloc(const void *start, unsigned long size); 454 + static __always_inline void kasan_poison_vmalloc(const void *start, 455 + unsigned long size) 456 + { 457 + if (kasan_enabled()) 458 + __kasan_poison_vmalloc(start, size); 459 + } 460 + 461 + #else /* CONFIG_KASAN_VMALLOC */ 423 462 424 463 static inline void kasan_populate_early_vm_area_shadow(void *start, 425 - unsigned long size) 464 + unsigned long size) { } 465 + static inline int kasan_populate_vmalloc(unsigned long start, 466 + unsigned long size) 467 + { 468 + return 0; 469 + } 470 + static inline void kasan_release_vmalloc(unsigned long start, 471 + unsigned long end, 472 + unsigned long free_region_start, 473 + unsigned long free_region_end) { } 474 + 475 + static inline void *kasan_unpoison_vmalloc(const void *start, 476 + unsigned long size, 477 + kasan_vmalloc_flags_t flags) 478 + { 479 + return (void *)start; 480 + } 481 + static inline void kasan_poison_vmalloc(const void *start, unsigned long size) 426 482 { } 427 483 428 484 #endif /* CONFIG_KASAN_VMALLOC */ ··· 468 450 !defined(CONFIG_KASAN_VMALLOC) 469 451 470 452 /* 471 - * These functions provide a special case to support backing module 472 - * allocations with real shadow memory. With KASAN vmalloc, the special 473 - * case is unnecessary, as the work is handled in the generic case. 453 + * These functions allocate and free shadow memory for kernel modules. 454 + * They are only required when KASAN_VMALLOC is not supported, as otherwise 455 + * shadow memory is allocated by the generic vmalloc handlers. 474 456 */ 475 - int kasan_module_alloc(void *addr, size_t size, gfp_t gfp_mask); 476 - void kasan_free_shadow(const struct vm_struct *vm); 457 + int kasan_alloc_module_shadow(void *addr, size_t size, gfp_t gfp_mask); 458 + void kasan_free_module_shadow(const struct vm_struct *vm); 477 459 478 460 #else /* (CONFIG_KASAN_GENERIC || CONFIG_KASAN_SW_TAGS) && !CONFIG_KASAN_VMALLOC */ 479 461 480 - static inline int kasan_module_alloc(void *addr, size_t size, gfp_t gfp_mask) { return 0; } 481 - static inline void kasan_free_shadow(const struct vm_struct *vm) {} 462 + static inline int kasan_alloc_module_shadow(void *addr, size_t size, gfp_t gfp_mask) { return 0; } 463 + static inline void kasan_free_module_shadow(const struct vm_struct *vm) {} 482 464 483 465 #endif /* (CONFIG_KASAN_GENERIC || CONFIG_KASAN_SW_TAGS) && !CONFIG_KASAN_VMALLOC */ 484 466

-5

include/linux/mm.h

··· 834 834 return folio_mapcount(page_folio(page)); 835 835 } 836 836 837 - int page_trans_huge_mapcount(struct page *page); 838 837 #else 839 838 static inline int total_mapcount(struct page *page) 840 - { 841 - return page_mapcount(page); 842 - } 843 - static inline int page_trans_huge_mapcount(struct page *page) 844 839 { 845 840 return page_mapcount(page); 846 841 }

+1 -1

include/linux/page-flags.h

··· 481 481 TESTSETFLAG_FALSE(uname, lname) TESTCLEARFLAG_FALSE(uname, lname) 482 482 483 483 __PAGEFLAG(Locked, locked, PF_NO_TAIL) 484 - PAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) __CLEARPAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) 484 + PAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) 485 485 PAGEFLAG(Error, error, PF_NO_TAIL) TESTCLEARFLAG(Error, error, PF_NO_TAIL) 486 486 PAGEFLAG(Referenced, referenced, PF_HEAD) 487 487 TESTCLEARFLAG(Referenced, referenced, PF_HEAD)

+1 -2

include/linux/pagemap.h

··· 1009 1009 { 1010 1010 __folio_mark_dirty(page_folio(page), mapping, warn); 1011 1011 } 1012 - void folio_account_cleaned(struct folio *folio, struct address_space *mapping, 1013 - struct bdi_writeback *wb); 1012 + void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb); 1014 1013 void __folio_cancel_dirty(struct folio *folio); 1015 1014 static inline void folio_cancel_dirty(struct folio *folio) 1016 1015 {

-4

include/linux/swap.h

··· 515 515 extern int swp_swapcount(swp_entry_t entry); 516 516 extern struct swap_info_struct *page_swap_info(struct page *); 517 517 extern struct swap_info_struct *swp_swap_info(swp_entry_t entry); 518 - extern bool reuse_swap_page(struct page *); 519 518 extern int try_to_free_swap(struct page *); 520 519 struct backing_dev_info; 521 520 extern int init_swap_address_space(unsigned int type, unsigned long nr_pages); ··· 679 680 { 680 681 return 0; 681 682 } 682 - 683 - #define reuse_swap_page(page) \ 684 - (page_trans_huge_mapcount(page) == 1) 685 683 686 684 static inline int try_to_free_swap(struct page *page) 687 685 {

+7 -11

include/linux/vmalloc.h

··· 35 35 #define VM_DEFER_KMEMLEAK 0 36 36 #endif 37 37 38 - /* 39 - * VM_KASAN is used slightly differently depending on CONFIG_KASAN_VMALLOC. 40 - * 41 - * If IS_ENABLED(CONFIG_KASAN_VMALLOC), VM_KASAN is set on a vm_struct after 42 - * shadow memory has been mapped. It's used to handle allocation errors so that 43 - * we don't try to poison shadow on free if it was never allocated. 44 - * 45 - * Otherwise, VM_KASAN is set for kasan_module_alloc() allocations and used to 46 - * determine which allocations need the module shadow freed. 47 - */ 48 - 49 38 /* bits [20..32] reserved for arch specific ioremap internals */ 50 39 51 40 /* ··· 112 123 static inline int arch_vmap_pte_supported_shift(unsigned long size) 113 124 { 114 125 return PAGE_SHIFT; 126 + } 127 + #endif 128 + 129 + #ifndef arch_vmap_pgprot_tagged 130 + static inline pgprot_t arch_vmap_pgprot_tagged(pgprot_t prot) 131 + { 132 + return prot; 115 133 } 116 134 #endif 117 135

-1

include/trace/events/huge_memory.h

··· 29 29 EM( SCAN_VMA_NULL, "vma_null") \ 30 30 EM( SCAN_VMA_CHECK, "vma_check_failed") \ 31 31 EM( SCAN_ADDRESS_RANGE, "not_suitable_address_range") \ 32 - EM( SCAN_SWAP_CACHE_PAGE, "page_swap_cache") \ 33 32 EM( SCAN_DEL_PAGE_LRU, "could_not_delete_page_from_lru")\ 34 33 EM( SCAN_ALLOC_HUGE_PAGE_FAIL, "alloc_huge_page_failed") \ 35 34 EM( SCAN_CGROUP_CHARGE_FAIL, "ccgroup_charge_failed") \

+31

include/trace/events/migrate.h

··· 105 105 __print_symbolic(__entry->reason, MIGRATE_REASON)) 106 106 ); 107 107 108 + DECLARE_EVENT_CLASS(migration_pte, 109 + 110 + TP_PROTO(unsigned long addr, unsigned long pte, int order), 111 + 112 + TP_ARGS(addr, pte, order), 113 + 114 + TP_STRUCT__entry( 115 + __field(unsigned long, addr) 116 + __field(unsigned long, pte) 117 + __field(int, order) 118 + ), 119 + 120 + TP_fast_assign( 121 + __entry->addr = addr; 122 + __entry->pte = pte; 123 + __entry->order = order; 124 + ), 125 + 126 + TP_printk("addr=%lx, pte=%lx order=%d", __entry->addr, __entry->pte, __entry->order) 127 + ); 128 + 129 + DEFINE_EVENT(migration_pte, set_migration_pte, 130 + TP_PROTO(unsigned long addr, unsigned long pte, int order), 131 + TP_ARGS(addr, pte, order) 132 + ); 133 + 134 + DEFINE_EVENT(migration_pte, remove_migration_pte, 135 + TP_PROTO(unsigned long addr, unsigned long pte, int order), 136 + TP_ARGS(addr, pte, order) 137 + ); 138 + 108 139 #endif /* _TRACE_MIGRATE_H */ 109 140 110 141 /* This part must be outside protection */

+11 -3

include/trace/events/mmflags.h

··· 49 49 {(unsigned long)__GFP_RECLAIM, "__GFP_RECLAIM"}, \ 50 50 {(unsigned long)__GFP_DIRECT_RECLAIM, "__GFP_DIRECT_RECLAIM"},\ 51 51 {(unsigned long)__GFP_KSWAPD_RECLAIM, "__GFP_KSWAPD_RECLAIM"},\ 52 - {(unsigned long)__GFP_ZEROTAGS, "__GFP_ZEROTAGS"}, \ 53 - {(unsigned long)__GFP_SKIP_KASAN_POISON,"__GFP_SKIP_KASAN_POISON"}\ 52 + {(unsigned long)__GFP_ZEROTAGS, "__GFP_ZEROTAGS"} \ 53 + 54 + #ifdef CONFIG_KASAN_HW_TAGS 55 + #define __def_gfpflag_names_kasan , \ 56 + {(unsigned long)__GFP_SKIP_ZERO, "__GFP_SKIP_ZERO"}, \ 57 + {(unsigned long)__GFP_SKIP_KASAN_POISON, "__GFP_SKIP_KASAN_POISON"}, \ 58 + {(unsigned long)__GFP_SKIP_KASAN_UNPOISON, "__GFP_SKIP_KASAN_UNPOISON"} 59 + #else 60 + #define __def_gfpflag_names_kasan 61 + #endif 54 62 55 63 #define show_gfp_flags(flags) \ 56 64 (flags) ? __print_flags(flags, "|", \ 57 - __def_gfpflag_names \ 65 + __def_gfpflag_names __def_gfpflag_names_kasan \ 58 66 ) : "none" 59 67 60 68 #ifdef CONFIG_MMU

+27

include/trace/events/thp.h

··· 48 48 TP_printk("hugepage update at addr 0x%lx and pte = 0x%lx clr = 0x%lx, set = 0x%lx", __entry->addr, __entry->pte, __entry->clr, __entry->set) 49 49 ); 50 50 51 + DECLARE_EVENT_CLASS(migration_pmd, 52 + 53 + TP_PROTO(unsigned long addr, unsigned long pmd), 54 + 55 + TP_ARGS(addr, pmd), 56 + 57 + TP_STRUCT__entry( 58 + __field(unsigned long, addr) 59 + __field(unsigned long, pmd) 60 + ), 61 + 62 + TP_fast_assign( 63 + __entry->addr = addr; 64 + __entry->pmd = pmd; 65 + ), 66 + TP_printk("addr=%lx, pmd=%lx", __entry->addr, __entry->pmd) 67 + ); 68 + 69 + DEFINE_EVENT(migration_pmd, set_migration_pmd, 70 + TP_PROTO(unsigned long addr, unsigned long pmd), 71 + TP_ARGS(addr, pmd) 72 + ); 73 + 74 + DEFINE_EVENT(migration_pmd, remove_migration_pmd, 75 + TP_PROTO(unsigned long addr, unsigned long pmd), 76 + TP_ARGS(addr, pmd) 77 + ); 51 78 #endif /* _TRACE_THP_H */ 52 79 53 80 /* This part must be outside protection */

+2

include/uapi/asm-generic/mman-common.h

··· 75 75 #define MADV_POPULATE_READ 22 /* populate (prefault) page tables readable */ 76 76 #define MADV_POPULATE_WRITE 23 /* populate (prefault) page tables writable */ 77 77 78 + #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ 79 + 78 80 /* compatibility flags */ 79 81 #define MAP_FILE 0 80 82

+6 -3

kernel/fork.c

··· 286 286 if (!s) 287 287 continue; 288 288 289 - /* Mark stack accessible for KASAN. */ 289 + /* Reset stack metadata. */ 290 290 kasan_unpoison_range(s->addr, THREAD_SIZE); 291 291 292 + stack = kasan_reset_tag(s->addr); 293 + 292 294 /* Clear stale pointers from reused stack. */ 293 - memset(s->addr, 0, THREAD_SIZE); 295 + memset(stack, 0, THREAD_SIZE); 294 296 295 297 if (memcg_charge_kernel_stack(s)) { 296 298 vfree(s->addr); ··· 300 298 } 301 299 302 300 tsk->stack_vm_area = s; 303 - tsk->stack = s->addr; 301 + tsk->stack = stack; 304 302 return 0; 305 303 } 306 304 ··· 328 326 * so cache the vm_struct. 329 327 */ 330 328 tsk->stack_vm_area = vm; 329 + stack = kasan_reset_tag(stack); 331 330 tsk->stack = stack; 332 331 return 0; 333 332 }

+8 -4

kernel/scs.c

··· 32 32 for (i = 0; i < NR_CACHED_SCS; i++) { 33 33 s = this_cpu_xchg(scs_cache[i], NULL); 34 34 if (s) { 35 - kasan_unpoison_vmalloc(s, SCS_SIZE); 35 + s = kasan_unpoison_vmalloc(s, SCS_SIZE, 36 + KASAN_VMALLOC_PROT_NORMAL); 36 37 memset(s, 0, SCS_SIZE); 37 - return s; 38 + goto out; 38 39 } 39 40 } 40 41 41 - return __vmalloc_node_range(SCS_SIZE, 1, VMALLOC_START, VMALLOC_END, 42 + s = __vmalloc_node_range(SCS_SIZE, 1, VMALLOC_START, VMALLOC_END, 42 43 GFP_SCS, PAGE_KERNEL, 0, node, 43 44 __builtin_return_address(0)); 45 + 46 + out: 47 + return kasan_reset_tag(s); 44 48 } 45 49 46 50 void *scs_alloc(int node) ··· 82 78 if (this_cpu_cmpxchg(scs_cache[i], 0, s) == NULL) 83 79 return; 84 80 85 - kasan_unpoison_vmalloc(s, SCS_SIZE); 81 + kasan_unpoison_vmalloc(s, SCS_SIZE, KASAN_VMALLOC_PROT_NORMAL); 86 82 vfree_atomic(s); 87 83 } 88 84

+9 -9

lib/Kconfig.kasan

··· 178 178 memory consumption. 179 179 180 180 config KASAN_VMALLOC 181 - bool "Back mappings in vmalloc space with real shadow memory" 182 - depends on KASAN_GENERIC && HAVE_ARCH_KASAN_VMALLOC 181 + bool "Check accesses to vmalloc allocations" 182 + depends on HAVE_ARCH_KASAN_VMALLOC 183 183 help 184 - By default, the shadow region for vmalloc space is the read-only 185 - zero page. This means that KASAN cannot detect errors involving 186 - vmalloc space. 184 + This mode makes KASAN check accesses to vmalloc allocations for 185 + validity. 187 186 188 - Enabling this option will hook in to vmap/vmalloc and back those 189 - mappings with real shadow memory allocated on demand. This allows 190 - for KASAN to detect more sorts of errors (and to support vmapped 191 - stacks), but at the cost of higher memory usage. 187 + With software KASAN modes, checking is done for all types of vmalloc 188 + allocations. Enabling this option leads to higher memory usage. 189 + 190 + With hardware tag-based KASAN, only VM_ALLOC mappings are checked. 191 + There is no additional memory usage. 192 192 193 193 config KASAN_KUNIT_TEST 194 194 tristate "KUnit-compatible tests of KASAN bug detection capabilities" if !KUNIT_ALL_TESTS

+214 -27

lib/test_kasan.c

··· 19 19 #include <linux/uaccess.h> 20 20 #include <linux/io.h> 21 21 #include <linux/vmalloc.h> 22 + #include <linux/set_memory.h> 22 23 23 24 #include <asm/page.h> 24 25 ··· 37 36 int kasan_int_result; 38 37 39 38 static struct kunit_resource resource; 40 - static struct kunit_kasan_expectation fail_data; 39 + static struct kunit_kasan_status test_status; 41 40 static bool multishot; 42 41 43 42 /* ··· 54 53 } 55 54 56 55 multishot = kasan_save_enable_multi_shot(); 57 - fail_data.report_found = false; 56 + test_status.report_found = false; 57 + test_status.sync_fault = false; 58 58 kunit_add_named_resource(test, NULL, NULL, &resource, 59 - "kasan_data", &fail_data); 59 + "kasan_status", &test_status); 60 60 return 0; 61 61 } 62 62 63 63 static void kasan_test_exit(struct kunit *test) 64 64 { 65 65 kasan_restore_multi_shot(multishot); 66 - KUNIT_EXPECT_FALSE(test, fail_data.report_found); 66 + KUNIT_EXPECT_FALSE(test, test_status.report_found); 67 67 } 68 68 69 69 /** 70 70 * KUNIT_EXPECT_KASAN_FAIL() - check that the executed expression produces a 71 71 * KASAN report; causes a test failure otherwise. This relies on a KUnit 72 - * resource named "kasan_data". Do not use this name for KUnit resources 72 + * resource named "kasan_status". Do not use this name for KUnit resources 73 73 * outside of KASAN tests. 74 74 * 75 - * For hardware tag-based KASAN in sync mode, when a tag fault happens, tag 75 + * For hardware tag-based KASAN, when a synchronous tag fault happens, tag 76 76 * checking is auto-disabled. When this happens, this test handler reenables 77 77 * tag checking. As tag checking can be only disabled or enabled per CPU, 78 78 * this handler disables migration (preemption). 79 79 * 80 - * Since the compiler doesn't see that the expression can change the fail_data 80 + * Since the compiler doesn't see that the expression can change the test_status 81 81 * fields, it can reorder or optimize away the accesses to those fields. 82 82 * Use READ/WRITE_ONCE() for the accesses and compiler barriers around the 83 83 * expression to prevent that. 84 84 * 85 - * In between KUNIT_EXPECT_KASAN_FAIL checks, fail_data.report_found is kept as 86 - * false. This allows detecting KASAN reports that happen outside of the checks 87 - * by asserting !fail_data.report_found at the start of KUNIT_EXPECT_KASAN_FAIL 88 - * and in kasan_test_exit. 85 + * In between KUNIT_EXPECT_KASAN_FAIL checks, test_status.report_found is kept 86 + * as false. This allows detecting KASAN reports that happen outside of the 87 + * checks by asserting !test_status.report_found at the start of 88 + * KUNIT_EXPECT_KASAN_FAIL and in kasan_test_exit. 89 89 */ 90 90 #define KUNIT_EXPECT_KASAN_FAIL(test, expression) do { \ 91 91 if (IS_ENABLED(CONFIG_KASAN_HW_TAGS) && \ 92 92 kasan_sync_fault_possible()) \ 93 93 migrate_disable(); \ 94 - KUNIT_EXPECT_FALSE(test, READ_ONCE(fail_data.report_found)); \ 94 + KUNIT_EXPECT_FALSE(test, READ_ONCE(test_status.report_found)); \ 95 95 barrier(); \ 96 96 expression; \ 97 97 barrier(); \ 98 - if (!READ_ONCE(fail_data.report_found)) { \ 98 + if (kasan_async_fault_possible()) \ 99 + kasan_force_async_fault(); \ 100 + if (!READ_ONCE(test_status.report_found)) { \ 99 101 KUNIT_FAIL(test, KUNIT_SUBTEST_INDENT "KASAN failure " \ 100 102 "expected in \"" #expression \ 101 103 "\", but none occurred"); \ 102 104 } \ 103 - if (IS_ENABLED(CONFIG_KASAN_HW_TAGS)) { \ 104 - if (READ_ONCE(fail_data.report_found)) \ 105 - kasan_enable_tagging_sync(); \ 105 + if (IS_ENABLED(CONFIG_KASAN_HW_TAGS) && \ 106 + kasan_sync_fault_possible()) { \ 107 + if (READ_ONCE(test_status.report_found) && \ 108 + READ_ONCE(test_status.sync_fault)) \ 109 + kasan_enable_tagging(); \ 106 110 migrate_enable(); \ 107 111 } \ 108 - WRITE_ONCE(fail_data.report_found, false); \ 112 + WRITE_ONCE(test_status.report_found, false); \ 109 113 } while (0) 110 114 111 115 #define KASAN_TEST_NEEDS_CONFIG_ON(test, config) do { \ ··· 786 780 static void kasan_stack_oob(struct kunit *test) 787 781 { 788 782 char stack_array[10]; 789 - /* See comment in kasan_global_oob. */ 783 + /* See comment in kasan_global_oob_right. */ 790 784 char *volatile array = stack_array; 791 785 char *p = &array[ARRAY_SIZE(stack_array) + OOB_TAG_OFF]; 792 786 ··· 799 793 { 800 794 volatile int i = 10; 801 795 char alloca_array[i]; 802 - /* See comment in kasan_global_oob. */ 796 + /* See comment in kasan_global_oob_right. */ 803 797 char *volatile array = alloca_array; 804 798 char *p = array - 1; 805 799 ··· 814 808 { 815 809 volatile int i = 10; 816 810 char alloca_array[i]; 817 - /* See comment in kasan_global_oob. */ 811 + /* See comment in kasan_global_oob_right. */ 818 812 char *volatile array = alloca_array; 819 813 char *p = array + i; 820 814 ··· 1063 1057 KUNIT_EXPECT_KASAN_FAIL(test, kfree_sensitive(ptr)); 1064 1058 } 1065 1059 1066 - static void vmalloc_oob(struct kunit *test) 1060 + static void vmalloc_helpers_tags(struct kunit *test) 1067 1061 { 1068 - void *area; 1062 + void *ptr; 1063 + 1064 + /* This test is intended for tag-based modes. */ 1065 + KASAN_TEST_NEEDS_CONFIG_OFF(test, CONFIG_KASAN_GENERIC); 1069 1066 1070 1067 KASAN_TEST_NEEDS_CONFIG_ON(test, CONFIG_KASAN_VMALLOC); 1071 1068 1069 + ptr = vmalloc(PAGE_SIZE); 1070 + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); 1071 + 1072 + /* Check that the returned pointer is tagged. */ 1073 + KUNIT_EXPECT_GE(test, (u8)get_tag(ptr), (u8)KASAN_TAG_MIN); 1074 + KUNIT_EXPECT_LT(test, (u8)get_tag(ptr), (u8)KASAN_TAG_KERNEL); 1075 + 1076 + /* Make sure exported vmalloc helpers handle tagged pointers. */ 1077 + KUNIT_ASSERT_TRUE(test, is_vmalloc_addr(ptr)); 1078 + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, vmalloc_to_page(ptr)); 1079 + 1080 + #if !IS_MODULE(CONFIG_KASAN_KUNIT_TEST) 1081 + { 1082 + int rv; 1083 + 1084 + /* Make sure vmalloc'ed memory permissions can be changed. */ 1085 + rv = set_memory_ro((unsigned long)ptr, 1); 1086 + KUNIT_ASSERT_GE(test, rv, 0); 1087 + rv = set_memory_rw((unsigned long)ptr, 1); 1088 + KUNIT_ASSERT_GE(test, rv, 0); 1089 + } 1090 + #endif 1091 + 1092 + vfree(ptr); 1093 + } 1094 + 1095 + static void vmalloc_oob(struct kunit *test) 1096 + { 1097 + char *v_ptr, *p_ptr; 1098 + struct page *page; 1099 + size_t size = PAGE_SIZE / 2 - KASAN_GRANULE_SIZE - 5; 1100 + 1101 + KASAN_TEST_NEEDS_CONFIG_ON(test, CONFIG_KASAN_VMALLOC); 1102 + 1103 + v_ptr = vmalloc(size); 1104 + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, v_ptr); 1105 + 1106 + OPTIMIZER_HIDE_VAR(v_ptr); 1107 + 1072 1108 /* 1073 - * We have to be careful not to hit the guard page. 1109 + * We have to be careful not to hit the guard page in vmalloc tests. 1074 1110 * The MMU will catch that and crash us. 1075 1111 */ 1076 - area = vmalloc(3000); 1077 - KUNIT_ASSERT_NOT_ERR_OR_NULL(test, area); 1078 1112 1079 - KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)area)[3100]); 1080 - vfree(area); 1113 + /* Make sure in-bounds accesses are valid. */ 1114 + v_ptr[0] = 0; 1115 + v_ptr[size - 1] = 0; 1116 + 1117 + /* 1118 + * An unaligned access past the requested vmalloc size. 1119 + * Only generic KASAN can precisely detect these. 1120 + */ 1121 + if (IS_ENABLED(CONFIG_KASAN_GENERIC)) 1122 + KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)v_ptr)[size]); 1123 + 1124 + /* An aligned access into the first out-of-bounds granule. */ 1125 + KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)v_ptr)[size + 5]); 1126 + 1127 + /* Check that in-bounds accesses to the physical page are valid. */ 1128 + page = vmalloc_to_page(v_ptr); 1129 + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, page); 1130 + p_ptr = page_address(page); 1131 + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, p_ptr); 1132 + p_ptr[0] = 0; 1133 + 1134 + vfree(v_ptr); 1135 + 1136 + /* 1137 + * We can't check for use-after-unmap bugs in this nor in the following 1138 + * vmalloc tests, as the page might be fully unmapped and accessing it 1139 + * will crash the kernel. 1140 + */ 1141 + } 1142 + 1143 + static void vmap_tags(struct kunit *test) 1144 + { 1145 + char *p_ptr, *v_ptr; 1146 + struct page *p_page, *v_page; 1147 + 1148 + /* 1149 + * This test is specifically crafted for the software tag-based mode, 1150 + * the only tag-based mode that poisons vmap mappings. 1151 + */ 1152 + KASAN_TEST_NEEDS_CONFIG_ON(test, CONFIG_KASAN_SW_TAGS); 1153 + 1154 + KASAN_TEST_NEEDS_CONFIG_ON(test, CONFIG_KASAN_VMALLOC); 1155 + 1156 + p_page = alloc_pages(GFP_KERNEL, 1); 1157 + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, p_page); 1158 + p_ptr = page_address(p_page); 1159 + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, p_ptr); 1160 + 1161 + v_ptr = vmap(&p_page, 1, VM_MAP, PAGE_KERNEL); 1162 + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, v_ptr); 1163 + 1164 + /* 1165 + * We can't check for out-of-bounds bugs in this nor in the following 1166 + * vmalloc tests, as allocations have page granularity and accessing 1167 + * the guard page will crash the kernel. 1168 + */ 1169 + 1170 + KUNIT_EXPECT_GE(test, (u8)get_tag(v_ptr), (u8)KASAN_TAG_MIN); 1171 + KUNIT_EXPECT_LT(test, (u8)get_tag(v_ptr), (u8)KASAN_TAG_KERNEL); 1172 + 1173 + /* Make sure that in-bounds accesses through both pointers work. */ 1174 + *p_ptr = 0; 1175 + *v_ptr = 0; 1176 + 1177 + /* Make sure vmalloc_to_page() correctly recovers the page pointer. */ 1178 + v_page = vmalloc_to_page(v_ptr); 1179 + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, v_page); 1180 + KUNIT_EXPECT_PTR_EQ(test, p_page, v_page); 1181 + 1182 + vunmap(v_ptr); 1183 + free_pages((unsigned long)p_ptr, 1); 1184 + } 1185 + 1186 + static void vm_map_ram_tags(struct kunit *test) 1187 + { 1188 + char *p_ptr, *v_ptr; 1189 + struct page *page; 1190 + 1191 + /* 1192 + * This test is specifically crafted for the software tag-based mode, 1193 + * the only tag-based mode that poisons vm_map_ram mappings. 1194 + */ 1195 + KASAN_TEST_NEEDS_CONFIG_ON(test, CONFIG_KASAN_SW_TAGS); 1196 + 1197 + page = alloc_pages(GFP_KERNEL, 1); 1198 + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, page); 1199 + p_ptr = page_address(page); 1200 + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, p_ptr); 1201 + 1202 + v_ptr = vm_map_ram(&page, 1, -1); 1203 + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, v_ptr); 1204 + 1205 + KUNIT_EXPECT_GE(test, (u8)get_tag(v_ptr), (u8)KASAN_TAG_MIN); 1206 + KUNIT_EXPECT_LT(test, (u8)get_tag(v_ptr), (u8)KASAN_TAG_KERNEL); 1207 + 1208 + /* Make sure that in-bounds accesses through both pointers work. */ 1209 + *p_ptr = 0; 1210 + *v_ptr = 0; 1211 + 1212 + vm_unmap_ram(v_ptr, 1); 1213 + free_pages((unsigned long)p_ptr, 1); 1214 + } 1215 + 1216 + static void vmalloc_percpu(struct kunit *test) 1217 + { 1218 + char __percpu *ptr; 1219 + int cpu; 1220 + 1221 + /* 1222 + * This test is specifically crafted for the software tag-based mode, 1223 + * the only tag-based mode that poisons percpu mappings. 1224 + */ 1225 + KASAN_TEST_NEEDS_CONFIG_ON(test, CONFIG_KASAN_SW_TAGS); 1226 + 1227 + ptr = __alloc_percpu(PAGE_SIZE, PAGE_SIZE); 1228 + 1229 + for_each_possible_cpu(cpu) { 1230 + char *c_ptr = per_cpu_ptr(ptr, cpu); 1231 + 1232 + KUNIT_EXPECT_GE(test, (u8)get_tag(c_ptr), (u8)KASAN_TAG_MIN); 1233 + KUNIT_EXPECT_LT(test, (u8)get_tag(c_ptr), (u8)KASAN_TAG_KERNEL); 1234 + 1235 + /* Make sure that in-bounds accesses don't crash the kernel. */ 1236 + *c_ptr = 0; 1237 + } 1238 + 1239 + free_percpu(ptr); 1081 1240 } 1082 1241 1083 1242 /* ··· 1275 1104 KUNIT_EXPECT_GE(test, (u8)get_tag(ptr), (u8)KASAN_TAG_MIN); 1276 1105 KUNIT_EXPECT_LT(test, (u8)get_tag(ptr), (u8)KASAN_TAG_KERNEL); 1277 1106 free_pages((unsigned long)ptr, order); 1107 + } 1108 + 1109 + if (!IS_ENABLED(CONFIG_KASAN_VMALLOC)) 1110 + return; 1111 + 1112 + for (i = 0; i < 256; i++) { 1113 + size = (get_random_int() % 1024) + 1; 1114 + ptr = vmalloc(size); 1115 + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); 1116 + KUNIT_EXPECT_GE(test, (u8)get_tag(ptr), (u8)KASAN_TAG_MIN); 1117 + KUNIT_EXPECT_LT(test, (u8)get_tag(ptr), (u8)KASAN_TAG_KERNEL); 1118 + vfree(ptr); 1278 1119 } 1279 1120 } 1280 1121 ··· 1393 1210 KUNIT_CASE(kasan_bitops_generic), 1394 1211 KUNIT_CASE(kasan_bitops_tags), 1395 1212 KUNIT_CASE(kmalloc_double_kzfree), 1213 + KUNIT_CASE(vmalloc_helpers_tags), 1396 1214 KUNIT_CASE(vmalloc_oob), 1215 + KUNIT_CASE(vmap_tags), 1216 + KUNIT_CASE(vm_map_ram_tags), 1217 + KUNIT_CASE(vmalloc_percpu), 1397 1218 KUNIT_CASE(match_all_not_assigned), 1398 1219 KUNIT_CASE(match_all_ptr_tag), 1399 1220 KUNIT_CASE(match_all_mem_tag),

+5 -3

lib/vsprintf.c

··· 2906 2906 { 2907 2907 int i; 2908 2908 2909 + if (unlikely(!size)) 2910 + return 0; 2911 + 2909 2912 i = vsnprintf(buf, size, fmt, args); 2910 2913 2911 2914 if (likely(i < size)) 2912 2915 return i; 2913 - if (size != 0) 2914 - return size - 1; 2915 - return 0; 2916 + 2917 + return size - 1; 2916 2918 } 2917 2919 EXPORT_SYMBOL(vscnprintf); 2918 2920

+3

mm/Kconfig

··· 762 762 register alias named "current_stack_pointer", this config can be 763 763 selected. 764 764 765 + config ARCH_HAS_FILTER_PGPROT 766 + bool 767 + 765 768 config ARCH_HAS_PTE_DEVMAP 766 769 bool 767 770

-1

mm/debug.c

··· 261 261 if (page_init_poisoning) 262 262 memset(page, PAGE_POISON_PATTERN, size); 263 263 } 264 - EXPORT_SYMBOL_GPL(page_init_poison); 265 264 #endif /* CONFIG_DEBUG_VM */

+30 -33

mm/filemap.c

··· 152 152 153 153 VM_BUG_ON_FOLIO(folio_mapped(folio), folio); 154 154 if (!IS_ENABLED(CONFIG_DEBUG_VM) && unlikely(folio_mapped(folio))) { 155 - int mapcount; 156 - 157 155 pr_alert("BUG: Bad page cache in process %s pfn:%05lx\n", 158 156 current->comm, folio_pfn(folio)); 159 157 dump_page(&folio->page, "still mapped when deleted"); 160 158 dump_stack(); 161 159 add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); 162 160 163 - mapcount = page_mapcount(&folio->page); 164 - if (mapping_exiting(mapping) && 165 - folio_ref_count(folio) >= mapcount + 2) { 166 - /* 167 - * All vmas have already been torn down, so it's 168 - * a good bet that actually the folio is unmapped, 169 - * and we'd prefer not to leak it: if we're wrong, 170 - * some other bad page check should catch it later. 171 - */ 172 - page_mapcount_reset(&folio->page); 173 - folio_ref_sub(folio, mapcount); 161 + if (mapping_exiting(mapping) && !folio_test_large(folio)) { 162 + int mapcount = page_mapcount(&folio->page); 163 + 164 + if (folio_ref_count(folio) >= mapcount + 2) { 165 + /* 166 + * All vmas have already been torn down, so it's 167 + * a good bet that actually the page is unmapped 168 + * and we'd rather not leak it: if we're wrong, 169 + * another bad page check should catch it later. 170 + */ 171 + page_mapcount_reset(&folio->page); 172 + folio_ref_sub(folio, mapcount); 173 + } 174 174 } 175 175 } 176 176 ··· 193 193 /* 194 194 * At this point folio must be either written or cleaned by 195 195 * truncate. Dirty folio here signals a bug and loss of 196 - * unwritten data. 196 + * unwritten data - on ordinary filesystems. 197 197 * 198 - * This fixes dirty accounting after removing the folio entirely 198 + * But it's harmless on in-memory filesystems like tmpfs; and can 199 + * occur when a driver which did get_user_pages() sets page dirty 200 + * before putting it, while the inode is being finally evicted. 201 + * 202 + * Below fixes dirty accounting after removing the folio entirely 199 203 * but leaves the dirty flag set: it has no effect for truncated 200 204 * folio and anyway will be cleared before returning folio to 201 205 * buddy allocator. 202 206 */ 203 - if (WARN_ON_ONCE(folio_test_dirty(folio))) 204 - folio_account_cleaned(folio, mapping, 205 - inode_to_wb(mapping->host)); 207 + if (WARN_ON_ONCE(folio_test_dirty(folio) && 208 + mapping_can_writeback(mapping))) 209 + folio_account_cleaned(folio, inode_to_wb(mapping->host)); 206 210 } 207 211 208 212 /* ··· 1189 1185 } 1190 1186 1191 1187 /* 1192 - * It is possible for other pages to have collided on the waitqueue 1193 - * hash, so in that case check for a page match. That prevents a long- 1194 - * term waiter 1188 + * It's possible to miss clearing waiters here, when we woke our page 1189 + * waiters, but the hashed waitqueue has waiters for other pages on it. 1190 + * That's okay, it's a rare case. The next waker will clear it. 1195 1191 * 1196 - * It is still possible to miss a case here, when we woke page waiters 1197 - * and removed them from the waitqueue, but there are still other 1198 - * page waiters. 1192 + * Note that, depending on the page pool (buddy, hugetlb, ZONE_DEVICE, 1193 + * other), the flag may be cleared in the course of freeing the page; 1194 + * but that is not required for correctness. 1199 1195 */ 1200 - if (!waitqueue_active(q) || !key.page_match) { 1196 + if (!waitqueue_active(q) || !key.page_match) 1201 1197 folio_clear_waiters(folio); 1202 - /* 1203 - * It's possible to miss clearing Waiters here, when we woke 1204 - * our page waiters, but the hashed waitqueue has waiters for 1205 - * other pages on it. 1206 - * 1207 - * That's okay, it's a rare case. The next waker will clear it. 1208 - */ 1209 - } 1198 + 1210 1199 spin_unlock_irqrestore(&q->lock, flags); 1211 1200 } 1212 1201

+19 -90

mm/huge_memory.c

··· 40 40 #include <asm/pgalloc.h> 41 41 #include "internal.h" 42 42 43 + #define CREATE_TRACE_POINTS 44 + #include <trace/events/thp.h> 45 + 43 46 /* 44 47 * By default, transparent hugepage support is disabled in order to avoid 45 48 * risking an increased memory footprint for applications that are not ··· 533 530 set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR); 534 531 } 535 532 536 - bool is_transparent_hugepage(struct page *page) 533 + static inline bool is_transparent_hugepage(struct page *page) 537 534 { 538 535 if (!PageCompound(page)) 539 536 return false; ··· 542 539 return is_huge_zero_page(page) || 543 540 page[1].compound_dtor == TRANSHUGE_PAGE_DTOR; 544 541 } 545 - EXPORT_SYMBOL_GPL(is_transparent_hugepage); 546 542 547 543 static unsigned long __thp_get_unmapped_area(struct file *filp, 548 544 unsigned long addr, unsigned long len, ··· 1303 1301 page = pmd_page(orig_pmd); 1304 1302 VM_BUG_ON_PAGE(!PageHead(page), page); 1305 1303 1306 - /* Lock page for reuse_swap_page() */ 1307 1304 if (!trylock_page(page)) { 1308 1305 get_page(page); 1309 1306 spin_unlock(vmf->ptl); ··· 1318 1317 } 1319 1318 1320 1319 /* 1321 - * We can only reuse the page if nobody else maps the huge page or it's 1322 - * part. 1320 + * See do_wp_page(): we can only map the page writable if there are 1321 + * no additional references. Note that we always drain the LRU 1322 + * pagevecs immediately after adding a THP. 1323 1323 */ 1324 - if (reuse_swap_page(page)) { 1324 + if (page_count(page) > 1 + PageSwapCache(page) * thp_nr_pages(page)) 1325 + goto unlock_fallback; 1326 + if (PageSwapCache(page)) 1327 + try_to_free_swap(page); 1328 + if (page_count(page) == 1) { 1325 1329 pmd_t entry; 1326 1330 entry = pmd_mkyoung(orig_pmd); 1327 1331 entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); ··· 1337 1331 return VM_FAULT_WRITE; 1338 1332 } 1339 1333 1334 + unlock_fallback: 1340 1335 unlock_page(page); 1341 1336 spin_unlock(vmf->ptl); 1342 1337 fallback: ··· 2133 2126 { 2134 2127 spinlock_t *ptl; 2135 2128 struct mmu_notifier_range range; 2136 - bool do_unlock_folio = false; 2137 - pmd_t _pmd; 2138 2129 2139 2130 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, 2140 2131 address & HPAGE_PMD_MASK, ··· 2151 2146 goto out; 2152 2147 } 2153 2148 2154 - repeat: 2155 - if (pmd_trans_huge(*pmd)) { 2156 - if (!folio) { 2157 - folio = page_folio(pmd_page(*pmd)); 2158 - /* 2159 - * An anonymous page must be locked, to ensure that a 2160 - * concurrent reuse_swap_page() sees stable mapcount; 2161 - * but reuse_swap_page() is not used on shmem or file, 2162 - * and page lock must not be taken when zap_pmd_range() 2163 - * calls __split_huge_pmd() while i_mmap_lock is held. 2164 - */ 2165 - if (folio_test_anon(folio)) { 2166 - if (unlikely(!folio_trylock(folio))) { 2167 - folio_get(folio); 2168 - _pmd = *pmd; 2169 - spin_unlock(ptl); 2170 - folio_lock(folio); 2171 - spin_lock(ptl); 2172 - if (unlikely(!pmd_same(*pmd, _pmd))) { 2173 - folio_unlock(folio); 2174 - folio_put(folio); 2175 - folio = NULL; 2176 - goto repeat; 2177 - } 2178 - folio_put(folio); 2179 - } 2180 - do_unlock_folio = true; 2181 - } 2182 - } 2183 - } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd))) 2184 - goto out; 2185 - __split_huge_pmd_locked(vma, pmd, range.start, freeze); 2149 + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || 2150 + is_pmd_migration_entry(*pmd)) 2151 + __split_huge_pmd_locked(vma, pmd, range.start, freeze); 2152 + 2186 2153 out: 2187 2154 spin_unlock(ptl); 2188 - if (do_unlock_folio) 2189 - folio_unlock(folio); 2190 2155 /* 2191 2156 * No need to double call mmu_notifier->invalidate_range() callback. 2192 2157 * They are 3 cases to consider inside __split_huge_pmd_locked(): ··· 2449 2474 */ 2450 2475 put_page(subpage); 2451 2476 } 2452 - } 2453 - 2454 - /* 2455 - * This calculates accurately how many mappings a transparent hugepage 2456 - * has (unlike page_mapcount() which isn't fully accurate). This full 2457 - * accuracy is primarily needed to know if copy-on-write faults can 2458 - * reuse the page and change the mapping to read-write instead of 2459 - * copying them. At the same time this returns the total_mapcount too. 2460 - * 2461 - * The function returns the highest mapcount any one of the subpages 2462 - * has. If the return value is one, even if different processes are 2463 - * mapping different subpages of the transparent hugepage, they can 2464 - * all reuse it, because each process is reusing a different subpage. 2465 - * 2466 - * The total_mapcount is instead counting all virtual mappings of the 2467 - * subpages. If the total_mapcount is equal to "one", it tells the 2468 - * caller all mappings belong to the same "mm" and in turn the 2469 - * anon_vma of the transparent hugepage can become the vma->anon_vma 2470 - * local one as no other process may be mapping any of the subpages. 2471 - * 2472 - * It would be more accurate to replace page_mapcount() with 2473 - * page_trans_huge_mapcount(), however we only use 2474 - * page_trans_huge_mapcount() in the copy-on-write faults where we 2475 - * need full accuracy to avoid breaking page pinning, because 2476 - * page_trans_huge_mapcount() is slower than page_mapcount(). 2477 - */ 2478 - int page_trans_huge_mapcount(struct page *page) 2479 - { 2480 - int i, ret; 2481 - 2482 - /* hugetlbfs shouldn't call it */ 2483 - VM_BUG_ON_PAGE(PageHuge(page), page); 2484 - 2485 - if (likely(!PageTransCompound(page))) 2486 - return atomic_read(&page->_mapcount) + 1; 2487 - 2488 - page = compound_head(page); 2489 - 2490 - ret = 0; 2491 - for (i = 0; i < thp_nr_pages(page); i++) { 2492 - int mapcount = atomic_read(&page[i]._mapcount) + 1; 2493 - ret = max(ret, mapcount); 2494 - } 2495 - 2496 - if (PageDoubleMap(page)) 2497 - ret -= 1; 2498 - 2499 - return ret + compound_mapcount(page); 2500 2477 } 2501 2478 2502 2479 /* Racy check whether the huge page can be split */ ··· 3058 3131 set_pmd_at(mm, address, pvmw->pmd, pmdswp); 3059 3132 page_remove_rmap(page, vma, true); 3060 3133 put_page(page); 3134 + trace_set_migration_pmd(address, pmd_val(pmdswp)); 3061 3135 } 3062 3136 3063 3137 void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) ··· 3091 3163 3092 3164 /* No need to invalidate - it was non-present before */ 3093 3165 update_mmu_cache_pmd(vma, address, pvmw->pmd); 3166 + trace_remove_migration_pmd(address, pmd_val(pmde)); 3094 3167 } 3095 3168 #endif

+1 -1

mm/kasan/Makefile

··· 35 35 CFLAGS_hw_tags.o := $(CC_FLAGS_KASAN_RUNTIME) 36 36 CFLAGS_sw_tags.o := $(CC_FLAGS_KASAN_RUNTIME) 37 37 38 - obj-$(CONFIG_KASAN) := common.o report.o 38 + obj-y := common.o report.o 39 39 obj-$(CONFIG_KASAN_GENERIC) += init.o generic.o report_generic.o shadow.o quarantine.o 40 40 obj-$(CONFIG_KASAN_HW_TAGS) += hw_tags.o report_hw_tags.o tags.o report_tags.o 41 41 obj-$(CONFIG_KASAN_SW_TAGS) += init.o report_sw_tags.o shadow.o sw_tags.o tags.o report_tags.o

+2 -2

mm/kasan/common.c

··· 387 387 } 388 388 389 389 /* 390 - * The object will be poisoned by kasan_free_pages() or 390 + * The object will be poisoned by kasan_poison_pages() or 391 391 * kasan_slab_free_mempool(). 392 392 */ 393 393 ··· 538 538 return NULL; 539 539 540 540 /* 541 - * The object has already been unpoisoned by kasan_alloc_pages() for 541 + * The object has already been unpoisoned by kasan_unpoison_pages() for 542 542 * alloc_pages() or by kasan_krealloc() for krealloc(). 543 543 */ 544 544

+165 -48

mm/kasan/hw_tags.c

··· 32 32 KASAN_ARG_MODE_ASYMM, 33 33 }; 34 34 35 + enum kasan_arg_vmalloc { 36 + KASAN_ARG_VMALLOC_DEFAULT, 37 + KASAN_ARG_VMALLOC_OFF, 38 + KASAN_ARG_VMALLOC_ON, 39 + }; 40 + 35 41 enum kasan_arg_stacktrace { 36 42 KASAN_ARG_STACKTRACE_DEFAULT, 37 43 KASAN_ARG_STACKTRACE_OFF, ··· 46 40 47 41 static enum kasan_arg kasan_arg __ro_after_init; 48 42 static enum kasan_arg_mode kasan_arg_mode __ro_after_init; 49 - static enum kasan_arg_stacktrace kasan_arg_stacktrace __ro_after_init; 43 + static enum kasan_arg_vmalloc kasan_arg_vmalloc __initdata; 44 + static enum kasan_arg_stacktrace kasan_arg_stacktrace __initdata; 50 45 51 - /* Whether KASAN is enabled at all. */ 46 + /* 47 + * Whether KASAN is enabled at all. 48 + * The value remains false until KASAN is initialized by kasan_init_hw_tags(). 49 + */ 52 50 DEFINE_STATIC_KEY_FALSE(kasan_flag_enabled); 53 51 EXPORT_SYMBOL(kasan_flag_enabled); 54 52 55 - /* Whether the selected mode is synchronous/asynchronous/asymmetric.*/ 53 + /* 54 + * Whether the selected mode is synchronous, asynchronous, or asymmetric. 55 + * Defaults to KASAN_MODE_SYNC. 56 + */ 56 57 enum kasan_mode kasan_mode __ro_after_init; 57 58 EXPORT_SYMBOL_GPL(kasan_mode); 58 59 60 + /* Whether to enable vmalloc tagging. */ 61 + DEFINE_STATIC_KEY_TRUE(kasan_flag_vmalloc); 62 + 59 63 /* Whether to collect alloc/free stack traces. */ 60 - DEFINE_STATIC_KEY_FALSE(kasan_flag_stacktrace); 64 + DEFINE_STATIC_KEY_TRUE(kasan_flag_stacktrace); 61 65 62 66 /* kasan=off/on */ 63 67 static int __init early_kasan_flag(char *arg) ··· 105 89 } 106 90 early_param("kasan.mode", early_kasan_mode); 107 91 92 + /* kasan.vmalloc=off/on */ 93 + static int __init early_kasan_flag_vmalloc(char *arg) 94 + { 95 + if (!arg) 96 + return -EINVAL; 97 + 98 + if (!strcmp(arg, "off")) 99 + kasan_arg_vmalloc = KASAN_ARG_VMALLOC_OFF; 100 + else if (!strcmp(arg, "on")) 101 + kasan_arg_vmalloc = KASAN_ARG_VMALLOC_ON; 102 + else 103 + return -EINVAL; 104 + 105 + return 0; 106 + } 107 + early_param("kasan.vmalloc", early_kasan_flag_vmalloc); 108 + 108 109 /* kasan.stacktrace=off/on */ 109 110 static int __init early_kasan_flag_stacktrace(char *arg) 110 111 { ··· 149 116 return "sync"; 150 117 } 151 118 152 - /* kasan_init_hw_tags_cpu() is called for each CPU. */ 119 + /* 120 + * kasan_init_hw_tags_cpu() is called for each CPU. 121 + * Not marked as __init as a CPU can be hot-plugged after boot. 122 + */ 153 123 void kasan_init_hw_tags_cpu(void) 154 124 { 155 125 /* ··· 160 124 * as this function is only called for MTE-capable hardware. 161 125 */ 162 126 163 - /* If KASAN is disabled via command line, don't initialize it. */ 127 + /* 128 + * If KASAN is disabled via command line, don't initialize it. 129 + * When this function is called, kasan_flag_enabled is not yet 130 + * set by kasan_init_hw_tags(). Thus, check kasan_arg instead. 131 + */ 164 132 if (kasan_arg == KASAN_ARG_OFF) 165 133 return; 166 134 ··· 172 132 * Enable async or asymm modes only when explicitly requested 173 133 * through the command line. 174 134 */ 175 - if (kasan_arg_mode == KASAN_ARG_MODE_ASYNC) 176 - hw_enable_tagging_async(); 177 - else if (kasan_arg_mode == KASAN_ARG_MODE_ASYMM) 178 - hw_enable_tagging_asymm(); 179 - else 180 - hw_enable_tagging_sync(); 135 + kasan_enable_tagging(); 181 136 } 182 137 183 138 /* kasan_init_hw_tags() is called once on boot CPU. */ ··· 186 151 if (kasan_arg == KASAN_ARG_OFF) 187 152 return; 188 153 189 - /* Enable KASAN. */ 190 - static_branch_enable(&kasan_flag_enabled); 191 - 192 154 switch (kasan_arg_mode) { 193 155 case KASAN_ARG_MODE_DEFAULT: 194 - /* 195 - * Default to sync mode. 196 - */ 197 - fallthrough; 156 + /* Default is specified by kasan_mode definition. */ 157 + break; 198 158 case KASAN_ARG_MODE_SYNC: 199 - /* Sync mode enabled. */ 200 159 kasan_mode = KASAN_MODE_SYNC; 201 160 break; 202 161 case KASAN_ARG_MODE_ASYNC: 203 - /* Async mode enabled. */ 204 162 kasan_mode = KASAN_MODE_ASYNC; 205 163 break; 206 164 case KASAN_ARG_MODE_ASYMM: 207 - /* Asymm mode enabled. */ 208 165 kasan_mode = KASAN_MODE_ASYMM; 166 + break; 167 + } 168 + 169 + switch (kasan_arg_vmalloc) { 170 + case KASAN_ARG_VMALLOC_DEFAULT: 171 + /* Default is specified by kasan_flag_vmalloc definition. */ 172 + break; 173 + case KASAN_ARG_VMALLOC_OFF: 174 + static_branch_disable(&kasan_flag_vmalloc); 175 + break; 176 + case KASAN_ARG_VMALLOC_ON: 177 + static_branch_enable(&kasan_flag_vmalloc); 209 178 break; 210 179 } 211 180 212 181 switch (kasan_arg_stacktrace) { 213 182 case KASAN_ARG_STACKTRACE_DEFAULT: 214 - /* Default to enabling stack trace collection. */ 215 - static_branch_enable(&kasan_flag_stacktrace); 183 + /* Default is specified by kasan_flag_stacktrace definition. */ 216 184 break; 217 185 case KASAN_ARG_STACKTRACE_OFF: 218 - /* Do nothing, kasan_flag_stacktrace keeps its default value. */ 186 + static_branch_disable(&kasan_flag_stacktrace); 219 187 break; 220 188 case KASAN_ARG_STACKTRACE_ON: 221 189 static_branch_enable(&kasan_flag_stacktrace); 222 190 break; 223 191 } 224 192 225 - pr_info("KernelAddressSanitizer initialized (hw-tags, mode=%s, stacktrace=%s)\n", 193 + /* KASAN is now initialized, enable it. */ 194 + static_branch_enable(&kasan_flag_enabled); 195 + 196 + pr_info("KernelAddressSanitizer initialized (hw-tags, mode=%s, vmalloc=%s, stacktrace=%s)\n", 226 197 kasan_mode_info(), 198 + kasan_vmalloc_enabled() ? "on" : "off", 227 199 kasan_stack_collection_enabled() ? "on" : "off"); 228 200 } 229 201 230 - void kasan_alloc_pages(struct page *page, unsigned int order, gfp_t flags) 202 + #ifdef CONFIG_KASAN_VMALLOC 203 + 204 + static void unpoison_vmalloc_pages(const void *addr, u8 tag) 231 205 { 206 + struct vm_struct *area; 207 + int i; 208 + 232 209 /* 233 - * This condition should match the one in post_alloc_hook() in 234 - * page_alloc.c. 210 + * As hardware tag-based KASAN only tags VM_ALLOC vmalloc allocations 211 + * (see the comment in __kasan_unpoison_vmalloc), all of the pages 212 + * should belong to a single area. 235 213 */ 236 - bool init = !want_init_on_free() && want_init_on_alloc(flags); 214 + area = find_vm_area((void *)addr); 215 + if (WARN_ON(!area)) 216 + return; 237 217 238 - if (flags & __GFP_SKIP_KASAN_POISON) 239 - SetPageSkipKASanPoison(page); 218 + for (i = 0; i < area->nr_pages; i++) { 219 + struct page *page = area->pages[i]; 240 220 241 - if (flags & __GFP_ZEROTAGS) { 242 - int i; 243 - 244 - for (i = 0; i != 1 << order; ++i) 245 - tag_clear_highpage(page + i); 246 - } else { 247 - kasan_unpoison_pages(page, order, init); 221 + page_kasan_tag_set(page, tag); 248 222 } 249 223 } 250 224 251 - void kasan_free_pages(struct page *page, unsigned int order) 225 + void *__kasan_unpoison_vmalloc(const void *start, unsigned long size, 226 + kasan_vmalloc_flags_t flags) 227 + { 228 + u8 tag; 229 + unsigned long redzone_start, redzone_size; 230 + 231 + if (!kasan_vmalloc_enabled()) 232 + return (void *)start; 233 + 234 + if (!is_vmalloc_or_module_addr(start)) 235 + return (void *)start; 236 + 237 + /* 238 + * Skip unpoisoning and assigning a pointer tag for non-VM_ALLOC 239 + * mappings as: 240 + * 241 + * 1. Unlike the software KASAN modes, hardware tag-based KASAN only 242 + * supports tagging physical memory. Therefore, it can only tag a 243 + * single mapping of normal physical pages. 244 + * 2. Hardware tag-based KASAN can only tag memory mapped with special 245 + * mapping protection bits, see arch_vmalloc_pgprot_modify(). 246 + * As non-VM_ALLOC mappings can be mapped outside of vmalloc code, 247 + * providing these bits would require tracking all non-VM_ALLOC 248 + * mappers. 249 + * 250 + * Thus, for VM_ALLOC mappings, hardware tag-based KASAN only tags 251 + * the first virtual mapping, which is created by vmalloc(). 252 + * Tagging the page_alloc memory backing that vmalloc() allocation is 253 + * skipped, see ___GFP_SKIP_KASAN_UNPOISON. 254 + * 255 + * For non-VM_ALLOC allocations, page_alloc memory is tagged as usual. 256 + */ 257 + if (!(flags & KASAN_VMALLOC_VM_ALLOC)) 258 + return (void *)start; 259 + 260 + /* 261 + * Don't tag executable memory. 262 + * The kernel doesn't tolerate having the PC register tagged. 263 + */ 264 + if (!(flags & KASAN_VMALLOC_PROT_NORMAL)) 265 + return (void *)start; 266 + 267 + tag = kasan_random_tag(); 268 + start = set_tag(start, tag); 269 + 270 + /* Unpoison and initialize memory up to size. */ 271 + kasan_unpoison(start, size, flags & KASAN_VMALLOC_INIT); 272 + 273 + /* 274 + * Explicitly poison and initialize the in-page vmalloc() redzone. 275 + * Unlike software KASAN modes, hardware tag-based KASAN doesn't 276 + * unpoison memory when populating shadow for vmalloc() space. 277 + */ 278 + redzone_start = round_up((unsigned long)start + size, 279 + KASAN_GRANULE_SIZE); 280 + redzone_size = round_up(redzone_start, PAGE_SIZE) - redzone_start; 281 + kasan_poison((void *)redzone_start, redzone_size, KASAN_TAG_INVALID, 282 + flags & KASAN_VMALLOC_INIT); 283 + 284 + /* 285 + * Set per-page tag flags to allow accessing physical memory for the 286 + * vmalloc() mapping through page_address(vmalloc_to_page()). 287 + */ 288 + unpoison_vmalloc_pages(start, tag); 289 + 290 + return (void *)start; 291 + } 292 + 293 + void __kasan_poison_vmalloc(const void *start, unsigned long size) 252 294 { 253 295 /* 254 - * This condition should match the one in free_pages_prepare() in 255 - * page_alloc.c. 296 + * No tagging here. 297 + * The physical pages backing the vmalloc() allocation are poisoned 298 + * through the usual page_alloc paths. 256 299 */ 257 - bool init = want_init_on_free(); 258 - 259 - kasan_poison_pages(page, order, init); 260 300 } 301 + 302 + #endif 261 303 262 304 #if IS_ENABLED(CONFIG_KASAN_KUNIT_TEST) 263 305 264 - void kasan_enable_tagging_sync(void) 306 + void kasan_enable_tagging(void) 265 307 { 266 - hw_enable_tagging_sync(); 308 + if (kasan_arg_mode == KASAN_ARG_MODE_ASYNC) 309 + hw_enable_tagging_async(); 310 + else if (kasan_arg_mode == KASAN_ARG_MODE_ASYMM) 311 + hw_enable_tagging_asymm(); 312 + else 313 + hw_enable_tagging_sync(); 267 314 } 268 - EXPORT_SYMBOL_GPL(kasan_enable_tagging_sync); 315 + EXPORT_SYMBOL_GPL(kasan_enable_tagging); 269 316 270 317 void kasan_force_async_fault(void) 271 318 {

+44 -12

mm/kasan/kasan.h

··· 12 12 #include <linux/static_key.h> 13 13 #include "../slab.h" 14 14 15 - DECLARE_STATIC_KEY_FALSE(kasan_flag_stacktrace); 15 + DECLARE_STATIC_KEY_TRUE(kasan_flag_vmalloc); 16 + DECLARE_STATIC_KEY_TRUE(kasan_flag_stacktrace); 16 17 17 18 enum kasan_mode { 18 19 KASAN_MODE_SYNC, ··· 22 21 }; 23 22 24 23 extern enum kasan_mode kasan_mode __ro_after_init; 24 + 25 + static inline bool kasan_vmalloc_enabled(void) 26 + { 27 + return static_branch_likely(&kasan_flag_vmalloc); 28 + } 25 29 26 30 static inline bool kasan_stack_collection_enabled(void) 27 31 { ··· 77 71 #define KASAN_PAGE_REDZONE 0xFE /* redzone for kmalloc_large allocations */ 78 72 #define KASAN_KMALLOC_REDZONE 0xFC /* redzone inside slub object */ 79 73 #define KASAN_KMALLOC_FREE 0xFB /* object was freed (kmem_cache_free/kfree) */ 80 - #define KASAN_KMALLOC_FREETRACK 0xFA /* object was freed and has free track set */ 74 + #define KASAN_VMALLOC_INVALID 0xF8 /* unallocated space in vmapped page */ 81 75 #else 82 76 #define KASAN_FREE_PAGE KASAN_TAG_INVALID 83 77 #define KASAN_PAGE_REDZONE KASAN_TAG_INVALID 84 78 #define KASAN_KMALLOC_REDZONE KASAN_TAG_INVALID 85 79 #define KASAN_KMALLOC_FREE KASAN_TAG_INVALID 86 - #define KASAN_KMALLOC_FREETRACK KASAN_TAG_INVALID 80 + #define KASAN_VMALLOC_INVALID KASAN_TAG_INVALID /* only for SW_TAGS */ 87 81 #endif 88 82 83 + #ifdef CONFIG_KASAN_GENERIC 84 + 85 + #define KASAN_KMALLOC_FREETRACK 0xFA /* object was freed and has free track set */ 89 86 #define KASAN_GLOBAL_REDZONE 0xF9 /* redzone for global variable */ 90 - #define KASAN_VMALLOC_INVALID 0xF8 /* unallocated space in vmapped page */ 91 87 92 88 /* 93 89 * Stack redzone shadow values ··· 118 110 #define KASAN_ABI_VERSION 1 119 111 #endif 120 112 113 + #endif /* CONFIG_KASAN_GENERIC */ 114 + 121 115 /* Metadata layout customization. */ 122 116 #define META_BYTES_PER_BLOCK 1 123 117 #define META_BLOCKS_PER_ROW 16 ··· 127 117 #define META_MEM_BYTES_PER_ROW (META_BYTES_PER_ROW * KASAN_GRANULE_SIZE) 128 118 #define META_ROWS_AROUND_ADDR 2 129 119 130 - struct kasan_access_info { 131 - const void *access_addr; 132 - const void *first_bad_addr; 120 + enum kasan_report_type { 121 + KASAN_REPORT_ACCESS, 122 + KASAN_REPORT_INVALID_FREE, 123 + }; 124 + 125 + struct kasan_report_info { 126 + enum kasan_report_type type; 127 + void *access_addr; 128 + void *first_bad_addr; 133 129 size_t access_size; 134 130 bool is_write; 135 131 unsigned long ip; ··· 220 204 #endif 221 205 }; 222 206 207 + #if IS_ENABLED(CONFIG_KASAN_KUNIT_TEST) 208 + /* Used in KUnit-compatible KASAN tests. */ 209 + struct kunit_kasan_status { 210 + bool report_found; 211 + bool sync_fault; 212 + }; 213 + #endif 214 + 223 215 struct kasan_alloc_meta *kasan_get_alloc_meta(struct kmem_cache *cache, 224 216 const void *object); 225 217 #ifdef CONFIG_KASAN_GENERIC ··· 245 221 246 222 static inline bool addr_has_metadata(const void *addr) 247 223 { 248 - return (addr >= kasan_shadow_to_mem((void *)KASAN_SHADOW_START)); 224 + return (kasan_reset_tag(addr) >= 225 + kasan_shadow_to_mem((void *)KASAN_SHADOW_START)); 249 226 } 250 227 251 228 /** ··· 276 251 #endif 277 252 278 253 void *kasan_find_first_bad_addr(void *addr, size_t size); 279 - const char *kasan_get_bug_type(struct kasan_access_info *info); 254 + const char *kasan_get_bug_type(struct kasan_report_info *info); 280 255 void kasan_metadata_fetch_row(char *buffer, void *row); 281 256 282 - #if defined(CONFIG_KASAN_GENERIC) && defined(CONFIG_KASAN_STACK) 257 + #if defined(CONFIG_KASAN_STACK) 283 258 void kasan_print_address_stack_frame(const void *addr); 284 259 #else 285 260 static inline void kasan_print_address_stack_frame(const void *addr) { } ··· 365 340 366 341 #if defined(CONFIG_KASAN_HW_TAGS) && IS_ENABLED(CONFIG_KASAN_KUNIT_TEST) 367 342 368 - void kasan_enable_tagging_sync(void); 343 + void kasan_enable_tagging(void); 369 344 void kasan_force_async_fault(void); 370 345 371 346 #else /* CONFIG_KASAN_HW_TAGS || CONFIG_KASAN_KUNIT_TEST */ 372 347 373 - static inline void kasan_enable_tagging_sync(void) { } 348 + static inline void kasan_enable_tagging(void) { } 374 349 static inline void kasan_force_async_fault(void) { } 375 350 376 351 #endif /* CONFIG_KASAN_HW_TAGS || CONFIG_KASAN_KUNIT_TEST */ ··· 490 465 static inline bool kasan_arch_is_ready(void) { return true; } 491 466 #elif !defined(CONFIG_KASAN_GENERIC) || !defined(CONFIG_KASAN_OUTLINE) 492 467 #error kasan_arch_is_ready only works in KASAN generic outline mode! 468 + #endif 469 + 470 + #if IS_ENABLED(CONFIG_KASAN_KUNIT_TEST) || IS_ENABLED(CONFIG_KASAN_MODULE_TEST) 471 + 472 + bool kasan_save_enable_multi_shot(void); 473 + void kasan_restore_multi_shot(bool enabled); 474 + 493 475 #endif 494 476 495 477 /*

+209 -147

mm/kasan/report.c

··· 13 13 #include <linux/ftrace.h> 14 14 #include <linux/init.h> 15 15 #include <linux/kernel.h> 16 + #include <linux/lockdep.h> 16 17 #include <linux/mm.h> 17 18 #include <linux/printk.h> 18 19 #include <linux/sched.h> ··· 65 64 } 66 65 early_param("kasan.fault", early_kasan_fault); 67 66 67 + static int __init kasan_set_multi_shot(char *str) 68 + { 69 + set_bit(KASAN_BIT_MULTI_SHOT, &kasan_flags); 70 + return 1; 71 + } 72 + __setup("kasan_multi_shot", kasan_set_multi_shot); 73 + 74 + /* 75 + * Used to suppress reports within kasan_disable/enable_current() critical 76 + * sections, which are used for marking accesses to slab metadata. 77 + */ 78 + static bool report_suppressed(void) 79 + { 80 + #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS) 81 + if (current->kasan_depth) 82 + return true; 83 + #endif 84 + return false; 85 + } 86 + 87 + /* 88 + * Used to avoid reporting more than one KASAN bug unless kasan_multi_shot 89 + * is enabled. Note that KASAN tests effectively enable kasan_multi_shot 90 + * for their duration. 91 + */ 92 + static bool report_enabled(void) 93 + { 94 + if (test_bit(KASAN_BIT_MULTI_SHOT, &kasan_flags)) 95 + return true; 96 + return !test_and_set_bit(KASAN_BIT_REPORTED, &kasan_flags); 97 + } 98 + 99 + #if IS_ENABLED(CONFIG_KASAN_KUNIT_TEST) || IS_ENABLED(CONFIG_KASAN_MODULE_TEST) 100 + 68 101 bool kasan_save_enable_multi_shot(void) 69 102 { 70 103 return test_and_set_bit(KASAN_BIT_MULTI_SHOT, &kasan_flags); ··· 112 77 } 113 78 EXPORT_SYMBOL_GPL(kasan_restore_multi_shot); 114 79 115 - static int __init kasan_set_multi_shot(char *str) 116 - { 117 - set_bit(KASAN_BIT_MULTI_SHOT, &kasan_flags); 118 - return 1; 119 - } 120 - __setup("kasan_multi_shot", kasan_set_multi_shot); 80 + #endif 121 81 122 - static void print_error_description(struct kasan_access_info *info) 82 + #if IS_ENABLED(CONFIG_KASAN_KUNIT_TEST) 83 + static void update_kunit_status(bool sync) 123 84 { 85 + struct kunit *test; 86 + struct kunit_resource *resource; 87 + struct kunit_kasan_status *status; 88 + 89 + test = current->kunit_test; 90 + if (!test) 91 + return; 92 + 93 + resource = kunit_find_named_resource(test, "kasan_status"); 94 + if (!resource) { 95 + kunit_set_failure(test); 96 + return; 97 + } 98 + 99 + status = (struct kunit_kasan_status *)resource->data; 100 + WRITE_ONCE(status->report_found, true); 101 + WRITE_ONCE(status->sync_fault, sync); 102 + 103 + kunit_put_resource(resource); 104 + } 105 + #else 106 + static void update_kunit_status(bool sync) { } 107 + #endif 108 + 109 + static DEFINE_SPINLOCK(report_lock); 110 + 111 + static void start_report(unsigned long *flags, bool sync) 112 + { 113 + /* Respect the /proc/sys/kernel/traceoff_on_warning interface. */ 114 + disable_trace_on_warning(); 115 + /* Update status of the currently running KASAN test. */ 116 + update_kunit_status(sync); 117 + /* Do not allow LOCKDEP mangling KASAN reports. */ 118 + lockdep_off(); 119 + /* Make sure we don't end up in loop. */ 120 + kasan_disable_current(); 121 + spin_lock_irqsave(&report_lock, *flags); 122 + pr_err("==================================================================\n"); 123 + } 124 + 125 + static void end_report(unsigned long *flags, void *addr) 126 + { 127 + if (addr) 128 + trace_error_report_end(ERROR_DETECTOR_KASAN, 129 + (unsigned long)addr); 130 + pr_err("==================================================================\n"); 131 + spin_unlock_irqrestore(&report_lock, *flags); 132 + if (panic_on_warn && !test_bit(KASAN_BIT_MULTI_SHOT, &kasan_flags)) 133 + panic("panic_on_warn set ...\n"); 134 + if (kasan_arg_fault == KASAN_ARG_FAULT_PANIC) 135 + panic("kasan.fault=panic set ...\n"); 136 + add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); 137 + lockdep_on(); 138 + kasan_enable_current(); 139 + } 140 + 141 + static void print_error_description(struct kasan_report_info *info) 142 + { 143 + if (info->type == KASAN_REPORT_INVALID_FREE) { 144 + pr_err("BUG: KASAN: double-free or invalid-free in %pS\n", 145 + (void *)info->ip); 146 + return; 147 + } 148 + 124 149 pr_err("BUG: KASAN: %s in %pS\n", 125 150 kasan_get_bug_type(info), (void *)info->ip); 126 151 if (info->access_size) ··· 191 96 pr_err("%s at addr %px by task %s/%d\n", 192 97 info->is_write ? "Write" : "Read", 193 98 info->access_addr, current->comm, task_pid_nr(current)); 194 - } 195 - 196 - static DEFINE_SPINLOCK(report_lock); 197 - 198 - static void start_report(unsigned long *flags) 199 - { 200 - /* 201 - * Make sure we don't end up in loop. 202 - */ 203 - kasan_disable_current(); 204 - spin_lock_irqsave(&report_lock, *flags); 205 - pr_err("==================================================================\n"); 206 - } 207 - 208 - static void end_report(unsigned long *flags, unsigned long addr) 209 - { 210 - if (!kasan_async_fault_possible()) 211 - trace_error_report_end(ERROR_DETECTOR_KASAN, addr); 212 - pr_err("==================================================================\n"); 213 - add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); 214 - spin_unlock_irqrestore(&report_lock, *flags); 215 - if (panic_on_warn && !test_bit(KASAN_BIT_MULTI_SHOT, &kasan_flags)) 216 - panic("panic_on_warn set ...\n"); 217 - if (kasan_arg_fault == KASAN_ARG_FAULT_PANIC) 218 - panic("kasan.fault=panic set ...\n"); 219 - kasan_enable_current(); 220 99 } 221 100 222 101 static void print_track(struct kasan_track *track, const char *prefix) ··· 230 161 pr_err("The buggy address belongs to the object at %px\n" 231 162 " which belongs to the cache %s of size %d\n", 232 163 object, cache->name, cache->object_size); 233 - 234 - if (!addr) 235 - return; 236 164 237 165 if (access_addr < object_addr) { 238 166 rel_type = "to the left"; ··· 319 253 void *object = nearest_obj(cache, slab, addr); 320 254 321 255 describe_object(cache, object, addr, tag); 256 + pr_err("\n"); 322 257 } 323 258 324 259 if (kernel_or_module_addr(addr) && !init_task_stack_addr(addr)) { 325 260 pr_err("The buggy address belongs to the variable:\n"); 326 261 pr_err(" %pS\n", addr); 262 + pr_err("\n"); 263 + } 264 + 265 + if (object_is_on_stack(addr)) { 266 + /* 267 + * Currently, KASAN supports printing frame information only 268 + * for accesses to the task's own stack. 269 + */ 270 + kasan_print_address_stack_frame(addr); 271 + pr_err("\n"); 272 + } 273 + 274 + if (is_vmalloc_addr(addr)) { 275 + struct vm_struct *va = find_vm_area(addr); 276 + 277 + if (va) { 278 + pr_err("The buggy address belongs to the virtual mapping at\n" 279 + " [%px, %px) created by:\n" 280 + " %pS\n", 281 + va->addr, va->addr + va->size, va->caller); 282 + pr_err("\n"); 283 + 284 + page = vmalloc_to_page(page); 285 + } 327 286 } 328 287 329 288 if (page) { 330 - pr_err("The buggy address belongs to the page:\n"); 289 + pr_err("The buggy address belongs to the physical page:\n"); 331 290 dump_page(page, "kasan: bad access detected"); 291 + pr_err("\n"); 332 292 } 333 - 334 - kasan_print_address_stack_frame(addr); 335 293 } 336 294 337 295 static bool meta_row_is_guilty(const void *row, const void *addr) ··· 414 324 } 415 325 } 416 326 417 - static bool report_enabled(void) 327 + static void print_report(struct kasan_report_info *info) 418 328 { 419 - #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS) 420 - if (current->kasan_depth) 421 - return false; 422 - #endif 423 - if (test_bit(KASAN_BIT_MULTI_SHOT, &kasan_flags)) 424 - return true; 425 - return !test_and_set_bit(KASAN_BIT_REPORTED, &kasan_flags); 426 - } 329 + void *tagged_addr = info->access_addr; 330 + void *untagged_addr = kasan_reset_tag(tagged_addr); 331 + u8 tag = get_tag(tagged_addr); 427 332 428 - #if IS_ENABLED(CONFIG_KUNIT) 429 - static void kasan_update_kunit_status(struct kunit *cur_test) 430 - { 431 - struct kunit_resource *resource; 432 - struct kunit_kasan_expectation *kasan_data; 333 + print_error_description(info); 334 + if (addr_has_metadata(untagged_addr)) 335 + kasan_print_tags(tag, info->first_bad_addr); 336 + pr_err("\n"); 433 337 434 - resource = kunit_find_named_resource(cur_test, "kasan_data"); 435 - 436 - if (!resource) { 437 - kunit_set_failure(cur_test); 438 - return; 338 + if (addr_has_metadata(untagged_addr)) { 339 + print_address_description(untagged_addr, tag); 340 + print_memory_metadata(info->first_bad_addr); 341 + } else { 342 + dump_stack_lvl(KERN_ERR); 439 343 } 440 - 441 - kasan_data = (struct kunit_kasan_expectation *)resource->data; 442 - WRITE_ONCE(kasan_data->report_found, true); 443 - kunit_put_resource(resource); 444 344 } 445 - #endif /* IS_ENABLED(CONFIG_KUNIT) */ 446 345 447 - void kasan_report_invalid_free(void *object, unsigned long ip) 346 + void kasan_report_invalid_free(void *ptr, unsigned long ip) 448 347 { 449 348 unsigned long flags; 450 - u8 tag = get_tag(object); 349 + struct kasan_report_info info; 451 350 452 - object = kasan_reset_tag(object); 351 + /* 352 + * Do not check report_suppressed(), as an invalid-free cannot be 353 + * caused by accessing slab metadata and thus should not be 354 + * suppressed by kasan_disable/enable_current() critical sections. 355 + */ 356 + if (unlikely(!report_enabled())) 357 + return; 453 358 454 - #if IS_ENABLED(CONFIG_KUNIT) 455 - if (current->kunit_test) 456 - kasan_update_kunit_status(current->kunit_test); 457 - #endif /* IS_ENABLED(CONFIG_KUNIT) */ 359 + start_report(&flags, true); 458 360 459 - start_report(&flags); 460 - pr_err("BUG: KASAN: double-free or invalid-free in %pS\n", (void *)ip); 461 - kasan_print_tags(tag, object); 462 - pr_err("\n"); 463 - print_address_description(object, tag); 464 - pr_err("\n"); 465 - print_memory_metadata(object); 466 - end_report(&flags, (unsigned long)object); 361 + info.type = KASAN_REPORT_INVALID_FREE; 362 + info.access_addr = ptr; 363 + info.first_bad_addr = kasan_reset_tag(ptr); 364 + info.access_size = 0; 365 + info.is_write = false; 366 + info.ip = ip; 367 + 368 + print_report(&info); 369 + 370 + end_report(&flags, ptr); 371 + } 372 + 373 + /* 374 + * kasan_report() is the only reporting function that uses 375 + * user_access_save/restore(): kasan_report_invalid_free() cannot be called 376 + * from a UACCESS region, and kasan_report_async() is not used on x86. 377 + */ 378 + bool kasan_report(unsigned long addr, size_t size, bool is_write, 379 + unsigned long ip) 380 + { 381 + bool ret = true; 382 + void *ptr = (void *)addr; 383 + unsigned long ua_flags = user_access_save(); 384 + unsigned long irq_flags; 385 + struct kasan_report_info info; 386 + 387 + if (unlikely(report_suppressed()) || unlikely(!report_enabled())) { 388 + ret = false; 389 + goto out; 390 + } 391 + 392 + start_report(&irq_flags, true); 393 + 394 + info.type = KASAN_REPORT_ACCESS; 395 + info.access_addr = ptr; 396 + info.first_bad_addr = kasan_find_first_bad_addr(ptr, size); 397 + info.access_size = size; 398 + info.is_write = is_write; 399 + info.ip = ip; 400 + 401 + print_report(&info); 402 + 403 + end_report(&irq_flags, ptr); 404 + 405 + out: 406 + user_access_restore(ua_flags); 407 + 408 + return ret; 467 409 } 468 410 469 411 #ifdef CONFIG_KASAN_HW_TAGS ··· 503 381 { 504 382 unsigned long flags; 505 383 506 - #if IS_ENABLED(CONFIG_KUNIT) 507 - if (current->kunit_test) 508 - kasan_update_kunit_status(current->kunit_test); 509 - #endif /* IS_ENABLED(CONFIG_KUNIT) */ 384 + /* 385 + * Do not check report_suppressed(), as kasan_disable/enable_current() 386 + * critical sections do not affect Hardware Tag-Based KASAN. 387 + */ 388 + if (unlikely(!report_enabled())) 389 + return; 510 390 511 - start_report(&flags); 391 + start_report(&flags, false); 512 392 pr_err("BUG: KASAN: invalid-access\n"); 513 - pr_err("Asynchronous mode enabled: no access details available\n"); 393 + pr_err("Asynchronous fault: no details available\n"); 514 394 pr_err("\n"); 515 395 dump_stack_lvl(KERN_ERR); 516 - end_report(&flags, 0); 396 + end_report(&flags, NULL); 517 397 } 518 398 #endif /* CONFIG_KASAN_HW_TAGS */ 519 - 520 - static void __kasan_report(unsigned long addr, size_t size, bool is_write, 521 - unsigned long ip) 522 - { 523 - struct kasan_access_info info; 524 - void *tagged_addr; 525 - void *untagged_addr; 526 - unsigned long flags; 527 - 528 - #if IS_ENABLED(CONFIG_KUNIT) 529 - if (current->kunit_test) 530 - kasan_update_kunit_status(current->kunit_test); 531 - #endif /* IS_ENABLED(CONFIG_KUNIT) */ 532 - 533 - disable_trace_on_warning(); 534 - 535 - tagged_addr = (void *)addr; 536 - untagged_addr = kasan_reset_tag(tagged_addr); 537 - 538 - info.access_addr = tagged_addr; 539 - if (addr_has_metadata(untagged_addr)) 540 - info.first_bad_addr = 541 - kasan_find_first_bad_addr(tagged_addr, size); 542 - else 543 - info.first_bad_addr = untagged_addr; 544 - info.access_size = size; 545 - info.is_write = is_write; 546 - info.ip = ip; 547 - 548 - start_report(&flags); 549 - 550 - print_error_description(&info); 551 - if (addr_has_metadata(untagged_addr)) 552 - kasan_print_tags(get_tag(tagged_addr), info.first_bad_addr); 553 - pr_err("\n"); 554 - 555 - if (addr_has_metadata(untagged_addr)) { 556 - print_address_description(untagged_addr, get_tag(tagged_addr)); 557 - pr_err("\n"); 558 - print_memory_metadata(info.first_bad_addr); 559 - } else { 560 - dump_stack_lvl(KERN_ERR); 561 - } 562 - 563 - end_report(&flags, addr); 564 - } 565 - 566 - bool kasan_report(unsigned long addr, size_t size, bool is_write, 567 - unsigned long ip) 568 - { 569 - unsigned long flags = user_access_save(); 570 - bool ret = false; 571 - 572 - if (likely(report_enabled())) { 573 - __kasan_report(addr, size, is_write, ip); 574 - ret = true; 575 - } 576 - 577 - user_access_restore(flags); 578 - 579 - return ret; 580 - } 581 399 582 400 #ifdef CONFIG_KASAN_INLINE 583 401 /*

+16 -18

mm/kasan/report_generic.c

··· 34 34 { 35 35 void *p = addr; 36 36 37 + if (!addr_has_metadata(p)) 38 + return p; 39 + 37 40 while (p < addr + size && !(*(u8 *)kasan_mem_to_shadow(p))) 38 41 p += KASAN_GRANULE_SIZE; 42 + 39 43 return p; 40 44 } 41 45 42 - static const char *get_shadow_bug_type(struct kasan_access_info *info) 46 + static const char *get_shadow_bug_type(struct kasan_report_info *info) 43 47 { 44 48 const char *bug_type = "unknown-crash"; 45 49 u8 *shadow_addr; ··· 95 91 return bug_type; 96 92 } 97 93 98 - static const char *get_wild_bug_type(struct kasan_access_info *info) 94 + static const char *get_wild_bug_type(struct kasan_report_info *info) 99 95 { 100 96 const char *bug_type = "unknown-crash"; 101 97 ··· 109 105 return bug_type; 110 106 } 111 107 112 - const char *kasan_get_bug_type(struct kasan_access_info *info) 108 + const char *kasan_get_bug_type(struct kasan_report_info *info) 113 109 { 114 110 /* 115 111 * If access_size is a negative number, then it has reason to be ··· 184 180 return; 185 181 186 182 pr_err("\n"); 187 - pr_err("this frame has %lu %s:\n", num_objects, 183 + pr_err("This frame has %lu %s:\n", num_objects, 188 184 num_objects == 1 ? "object" : "objects"); 189 185 190 186 while (num_objects--) { ··· 215 211 } 216 212 } 217 213 214 + /* Returns true only if the address is on the current task's stack. */ 218 215 static bool __must_check get_address_stack_frame_info(const void *addr, 219 216 unsigned long *offset, 220 217 const char **frame_descr, ··· 228 223 const unsigned long *frame; 229 224 230 225 BUILD_BUG_ON(IS_ENABLED(CONFIG_STACK_GROWSUP)); 231 - 232 - /* 233 - * NOTE: We currently only support printing frame information for 234 - * accesses to the task's own stack. 235 - */ 236 - if (!object_is_on_stack(addr)) 237 - return false; 238 226 239 227 aligned_addr = round_down((unsigned long)addr, sizeof(long)); 240 228 mem_ptr = round_down(aligned_addr, KASAN_GRANULE_SIZE); ··· 267 269 const char *frame_descr; 268 270 const void *frame_pc; 269 271 272 + if (WARN_ON(!object_is_on_stack(addr))) 273 + return; 274 + 275 + pr_err("The buggy address belongs to stack of task %s/%d\n", 276 + current->comm, task_pid_nr(current)); 277 + 270 278 if (!get_address_stack_frame_info(addr, &offset, &frame_descr, 271 279 &frame_pc)) 272 280 return; 273 281 274 - /* 275 - * get_address_stack_frame_info only returns true if the given addr is 276 - * on the current task's stack. 277 - */ 278 - pr_err("\n"); 279 - pr_err("addr %px is located in stack of task %s/%d at offset %lu in frame:\n", 280 - addr, current->comm, task_pid_nr(current), offset); 282 + pr_err(" and is located at offset %lu in frame:\n", offset); 281 283 pr_err(" %pS\n", frame_pc); 282 284 283 285 if (!frame_descr)

+1

mm/kasan/report_hw_tags.c

··· 17 17 18 18 void *kasan_find_first_bad_addr(void *addr, size_t size) 19 19 { 20 + /* Return the same value regardless of whether addr_has_metadata(). */ 20 21 return kasan_reset_tag(addr); 21 22 } 22 23

+16

mm/kasan/report_sw_tags.c

··· 16 16 #include <linux/mm.h> 17 17 #include <linux/printk.h> 18 18 #include <linux/sched.h> 19 + #include <linux/sched/task_stack.h> 19 20 #include <linux/slab.h> 20 21 #include <linux/stackdepot.h> 21 22 #include <linux/stacktrace.h> ··· 36 35 void *p = kasan_reset_tag(addr); 37 36 void *end = p + size; 38 37 38 + if (!addr_has_metadata(p)) 39 + return p; 40 + 39 41 while (p < end && tag == *(u8 *)kasan_mem_to_shadow(p)) 40 42 p += KASAN_GRANULE_SIZE; 43 + 41 44 return p; 42 45 } 43 46 ··· 56 51 57 52 pr_err("Pointer tag: [%02x], memory tag: [%02x]\n", addr_tag, *shadow); 58 53 } 54 + 55 + #ifdef CONFIG_KASAN_STACK 56 + void kasan_print_address_stack_frame(const void *addr) 57 + { 58 + if (WARN_ON(!object_is_on_stack(addr))) 59 + return; 60 + 61 + pr_err("The buggy address belongs to stack of task %s/%d\n", 62 + current->comm, task_pid_nr(current)); 63 + } 64 + #endif

+1 -1

mm/kasan/report_tags.c

··· 7 7 #include "kasan.h" 8 8 #include "../slab.h" 9 9 10 - const char *kasan_get_bug_type(struct kasan_access_info *info) 10 + const char *kasan_get_bug_type(struct kasan_report_info *info) 11 11 { 12 12 #ifdef CONFIG_KASAN_TAGS_IDENTIFY 13 13 struct kasan_alloc_meta *alloc_meta;

+41 -23

mm/kasan/shadow.c

··· 345 345 return 0; 346 346 } 347 347 348 - /* 349 - * Poison the shadow for a vmalloc region. Called as part of the 350 - * freeing process at the time the region is freed. 351 - */ 352 - void kasan_poison_vmalloc(const void *start, unsigned long size) 353 - { 354 - if (!is_vmalloc_or_module_addr(start)) 355 - return; 356 - 357 - size = round_up(size, KASAN_GRANULE_SIZE); 358 - kasan_poison(start, size, KASAN_VMALLOC_INVALID, false); 359 - } 360 - 361 - void kasan_unpoison_vmalloc(const void *start, unsigned long size) 362 - { 363 - if (!is_vmalloc_or_module_addr(start)) 364 - return; 365 - 366 - kasan_unpoison(start, size, false); 367 - } 368 - 369 348 static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr, 370 349 void *unused) 371 350 { ··· 475 496 } 476 497 } 477 498 499 + void *__kasan_unpoison_vmalloc(const void *start, unsigned long size, 500 + kasan_vmalloc_flags_t flags) 501 + { 502 + /* 503 + * Software KASAN modes unpoison both VM_ALLOC and non-VM_ALLOC 504 + * mappings, so the KASAN_VMALLOC_VM_ALLOC flag is ignored. 505 + * Software KASAN modes can't optimize zeroing memory by combining it 506 + * with setting memory tags, so the KASAN_VMALLOC_INIT flag is ignored. 507 + */ 508 + 509 + if (!is_vmalloc_or_module_addr(start)) 510 + return (void *)start; 511 + 512 + /* 513 + * Don't tag executable memory with the tag-based mode. 514 + * The kernel doesn't tolerate having the PC register tagged. 515 + */ 516 + if (IS_ENABLED(CONFIG_KASAN_SW_TAGS) && 517 + !(flags & KASAN_VMALLOC_PROT_NORMAL)) 518 + return (void *)start; 519 + 520 + start = set_tag(start, kasan_random_tag()); 521 + kasan_unpoison(start, size, false); 522 + return (void *)start; 523 + } 524 + 525 + /* 526 + * Poison the shadow for a vmalloc region. Called as part of the 527 + * freeing process at the time the region is freed. 528 + */ 529 + void __kasan_poison_vmalloc(const void *start, unsigned long size) 530 + { 531 + if (!is_vmalloc_or_module_addr(start)) 532 + return; 533 + 534 + size = round_up(size, KASAN_GRANULE_SIZE); 535 + kasan_poison(start, size, KASAN_VMALLOC_INVALID, false); 536 + } 537 + 478 538 #else /* CONFIG_KASAN_VMALLOC */ 479 539 480 - int kasan_module_alloc(void *addr, size_t size, gfp_t gfp_mask) 540 + int kasan_alloc_module_shadow(void *addr, size_t size, gfp_t gfp_mask) 481 541 { 482 542 void *ret; 483 543 size_t scaled_size; ··· 552 534 return -ENOMEM; 553 535 } 554 536 555 - void kasan_free_shadow(const struct vm_struct *vm) 537 + void kasan_free_module_shadow(const struct vm_struct *vm) 556 538 { 557 539 if (vm->flags & VM_KASAN) 558 540 vfree(kasan_mem_to_shadow(vm->addr));

-11

mm/khugepaged.c

··· 46 46 SCAN_VMA_NULL, 47 47 SCAN_VMA_CHECK, 48 48 SCAN_ADDRESS_RANGE, 49 - SCAN_SWAP_CACHE_PAGE, 50 49 SCAN_DEL_PAGE_LRU, 51 50 SCAN_ALLOC_HUGE_PAGE_FAIL, 52 51 SCAN_CGROUP_CHARGE_FAIL, ··· 680 681 if (!is_refcount_suitable(page)) { 681 682 unlock_page(page); 682 683 result = SCAN_PAGE_COUNT; 683 - goto out; 684 - } 685 - if (!pte_write(pteval) && PageSwapCache(page) && 686 - !reuse_swap_page(page)) { 687 - /* 688 - * Page is in the swap cache and cannot be re-used. 689 - * It cannot be collapsed into a THP. 690 - */ 691 - unlock_page(page); 692 - result = SCAN_SWAP_CACHE_PAGE; 693 684 goto out; 694 685 } 695 686

+35 -4

mm/madvise.c

··· 52 52 case MADV_REMOVE: 53 53 case MADV_WILLNEED: 54 54 case MADV_DONTNEED: 55 + case MADV_DONTNEED_LOCKED: 55 56 case MADV_COLD: 56 57 case MADV_PAGEOUT: 57 58 case MADV_FREE: ··· 505 504 506 505 static inline bool can_madv_lru_vma(struct vm_area_struct *vma) 507 506 { 508 - return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)); 507 + return !(vma->vm_flags & (VM_LOCKED|VM_PFNMAP|VM_HUGETLB)); 509 508 } 510 509 511 510 static long madvise_cold(struct vm_area_struct *vma, ··· 778 777 return 0; 779 778 } 780 779 780 + static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma, 781 + unsigned long start, 782 + unsigned long *end, 783 + int behavior) 784 + { 785 + if (!is_vm_hugetlb_page(vma)) { 786 + unsigned int forbidden = VM_PFNMAP; 787 + 788 + if (behavior != MADV_DONTNEED_LOCKED) 789 + forbidden |= VM_LOCKED; 790 + 791 + return !(vma->vm_flags & forbidden); 792 + } 793 + 794 + if (behavior != MADV_DONTNEED && behavior != MADV_DONTNEED_LOCKED) 795 + return false; 796 + if (start & ~huge_page_mask(hstate_vma(vma))) 797 + return false; 798 + 799 + *end = ALIGN(*end, huge_page_size(hstate_vma(vma))); 800 + return true; 801 + } 802 + 781 803 static long madvise_dontneed_free(struct vm_area_struct *vma, 782 804 struct vm_area_struct **prev, 783 805 unsigned long start, unsigned long end, ··· 809 785 struct mm_struct *mm = vma->vm_mm; 810 786 811 787 *prev = vma; 812 - if (!can_madv_lru_vma(vma)) 788 + if (!madvise_dontneed_free_valid_vma(vma, start, &end, behavior)) 813 789 return -EINVAL; 814 790 815 791 if (!userfaultfd_remove(vma, start, end)) { ··· 831 807 */ 832 808 return -ENOMEM; 833 809 } 834 - if (!can_madv_lru_vma(vma)) 810 + /* 811 + * Potential end adjustment for hugetlb vma is OK as 812 + * the check below keeps end within vma. 813 + */ 814 + if (!madvise_dontneed_free_valid_vma(vma, start, &end, 815 + behavior)) 835 816 return -EINVAL; 836 817 if (end > vma->vm_end) { 837 818 /* ··· 856 827 VM_WARN_ON(start >= end); 857 828 } 858 829 859 - if (behavior == MADV_DONTNEED) 830 + if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) 860 831 return madvise_dontneed_single_vma(vma, start, end); 861 832 else if (behavior == MADV_FREE) 862 833 return madvise_free_single_vma(vma, start, end); ··· 995 966 return madvise_pageout(vma, prev, start, end); 996 967 case MADV_FREE: 997 968 case MADV_DONTNEED: 969 + case MADV_DONTNEED_LOCKED: 998 970 return madvise_dontneed_free(vma, prev, start, end, behavior); 999 971 case MADV_POPULATE_READ: 1000 972 case MADV_POPULATE_WRITE: ··· 1126 1096 case MADV_REMOVE: 1127 1097 case MADV_WILLNEED: 1128 1098 case MADV_DONTNEED: 1099 + case MADV_DONTNEED_LOCKED: 1129 1100 case MADV_FREE: 1130 1101 case MADV_COLD: 1131 1102 case MADV_PAGEOUT:

+91 -36

mm/memory.c

··· 3287 3287 if (PageAnon(vmf->page)) { 3288 3288 struct page *page = vmf->page; 3289 3289 3290 - /* PageKsm() doesn't necessarily raise the page refcount */ 3291 - if (PageKsm(page) || page_count(page) != 1) 3290 + /* 3291 + * We have to verify under page lock: these early checks are 3292 + * just an optimization to avoid locking the page and freeing 3293 + * the swapcache if there is little hope that we can reuse. 3294 + * 3295 + * PageKsm() doesn't necessarily raise the page refcount. 3296 + */ 3297 + if (PageKsm(page) || page_count(page) > 3) 3298 + goto copy; 3299 + if (!PageLRU(page)) 3300 + /* 3301 + * Note: We cannot easily detect+handle references from 3302 + * remote LRU pagevecs or references to PageLRU() pages. 3303 + */ 3304 + lru_add_drain(); 3305 + if (page_count(page) > 1 + PageSwapCache(page)) 3292 3306 goto copy; 3293 3307 if (!trylock_page(page)) 3294 3308 goto copy; 3295 - if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1) { 3309 + if (PageSwapCache(page)) 3310 + try_to_free_swap(page); 3311 + if (PageKsm(page) || page_count(page) != 1) { 3296 3312 unlock_page(page); 3297 3313 goto copy; 3298 3314 } 3299 3315 /* 3300 - * Ok, we've got the only map reference, and the only 3301 - * page count reference, and the page is locked, 3302 - * it's dark out, and we're wearing sunglasses. Hit it. 3316 + * Ok, we've got the only page reference from our mapping 3317 + * and the page is locked, it's dark out, and we're wearing 3318 + * sunglasses. Hit it. 3303 3319 */ 3304 3320 unlock_page(page); 3305 3321 wp_page_reuse(vmf); ··· 3388 3372 details.even_cows = false; 3389 3373 details.single_folio = folio; 3390 3374 3391 - i_mmap_lock_write(mapping); 3375 + i_mmap_lock_read(mapping); 3392 3376 if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))) 3393 3377 unmap_mapping_range_tree(&mapping->i_mmap, first_index, 3394 3378 last_index, &details); 3395 - i_mmap_unlock_write(mapping); 3379 + i_mmap_unlock_read(mapping); 3396 3380 } 3397 3381 3398 3382 /** ··· 3418 3402 if (last_index < first_index) 3419 3403 last_index = ULONG_MAX; 3420 3404 3421 - i_mmap_lock_write(mapping); 3405 + i_mmap_lock_read(mapping); 3422 3406 if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))) 3423 3407 unmap_mapping_range_tree(&mapping->i_mmap, first_index, 3424 3408 last_index, &details); 3425 - i_mmap_unlock_write(mapping); 3409 + i_mmap_unlock_read(mapping); 3426 3410 } 3427 3411 EXPORT_SYMBOL_GPL(unmap_mapping_pages); 3428 3412 ··· 3487 3471 3488 3472 mmu_notifier_invalidate_range_end(&range); 3489 3473 return 0; 3474 + } 3475 + 3476 + static inline bool should_try_to_free_swap(struct page *page, 3477 + struct vm_area_struct *vma, 3478 + unsigned int fault_flags) 3479 + { 3480 + if (!PageSwapCache(page)) 3481 + return false; 3482 + if (mem_cgroup_swap_full(page) || (vma->vm_flags & VM_LOCKED) || 3483 + PageMlocked(page)) 3484 + return true; 3485 + /* 3486 + * If we want to map a page that's in the swapcache writable, we 3487 + * have to detect via the refcount if we're really the exclusive 3488 + * user. Try freeing the swapcache to get rid of the swapcache 3489 + * reference only in case it's likely that we'll be the exlusive user. 3490 + */ 3491 + return (fault_flags & FAULT_FLAG_WRITE) && !PageKsm(page) && 3492 + page_count(page) == 2; 3490 3493 } 3491 3494 3492 3495 /* ··· 3626 3591 goto out_release; 3627 3592 } 3628 3593 3629 - /* 3630 - * Make sure try_to_free_swap or reuse_swap_page or swapoff did not 3631 - * release the swapcache from under us. The page pin, and pte_same 3632 - * test below, are not enough to exclude that. Even if it is still 3633 - * swapcache, we need to check that the page's swap has not changed. 3634 - */ 3635 - if (unlikely((!PageSwapCache(page) || 3636 - page_private(page) != entry.val)) && swapcache) 3637 - goto out_page; 3594 + if (swapcache) { 3595 + /* 3596 + * Make sure try_to_free_swap or swapoff did not release the 3597 + * swapcache from under us. The page pin, and pte_same test 3598 + * below, are not enough to exclude that. Even if it is still 3599 + * swapcache, we need to check that the page's swap has not 3600 + * changed. 3601 + */ 3602 + if (unlikely(!PageSwapCache(page) || 3603 + page_private(page) != entry.val)) 3604 + goto out_page; 3638 3605 3639 - page = ksm_might_need_to_copy(page, vma, vmf->address); 3640 - if (unlikely(!page)) { 3641 - ret = VM_FAULT_OOM; 3642 - page = swapcache; 3643 - goto out_page; 3606 + /* 3607 + * KSM sometimes has to copy on read faults, for example, if 3608 + * page->index of !PageKSM() pages would be nonlinear inside the 3609 + * anon VMA -- PageKSM() is lost on actual swapout. 3610 + */ 3611 + page = ksm_might_need_to_copy(page, vma, vmf->address); 3612 + if (unlikely(!page)) { 3613 + ret = VM_FAULT_OOM; 3614 + page = swapcache; 3615 + goto out_page; 3616 + } 3617 + 3618 + /* 3619 + * If we want to map a page that's in the swapcache writable, we 3620 + * have to detect via the refcount if we're really the exclusive 3621 + * owner. Try removing the extra reference from the local LRU 3622 + * pagevecs if required. 3623 + */ 3624 + if ((vmf->flags & FAULT_FLAG_WRITE) && page == swapcache && 3625 + !PageKsm(page) && !PageLRU(page)) 3626 + lru_add_drain(); 3644 3627 } 3645 3628 3646 3629 cgroup_throttle_swaprate(page, GFP_KERNEL); ··· 3677 3624 } 3678 3625 3679 3626 /* 3680 - * The page isn't present yet, go ahead with the fault. 3681 - * 3682 - * Be careful about the sequence of operations here. 3683 - * To get its accounting right, reuse_swap_page() must be called 3684 - * while the page is counted on swap but not yet in mapcount i.e. 3685 - * before page_add_anon_rmap() and swap_free(); try_to_free_swap() 3686 - * must be called after the swap_free(), or it will never succeed. 3627 + * Remove the swap entry and conditionally try to free up the swapcache. 3628 + * We're already holding a reference on the page but haven't mapped it 3629 + * yet. 3687 3630 */ 3631 + swap_free(entry); 3632 + if (should_try_to_free_swap(page, vma, vmf->flags)) 3633 + try_to_free_swap(page); 3688 3634 3689 3635 inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); 3690 3636 dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS); 3691 3637 pte = mk_pte(page, vma->vm_page_prot); 3692 - if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) { 3638 + 3639 + /* 3640 + * Same logic as in do_wp_page(); however, optimize for fresh pages 3641 + * that are certainly not shared because we just allocated them without 3642 + * exposing them to the swapcache. 3643 + */ 3644 + if ((vmf->flags & FAULT_FLAG_WRITE) && !PageKsm(page) && 3645 + (page != swapcache || page_count(page) == 1)) { 3693 3646 pte = maybe_mkwrite(pte_mkdirty(pte), vma); 3694 3647 vmf->flags &= ~FAULT_FLAG_WRITE; 3695 3648 ret |= VM_FAULT_WRITE; ··· 3721 3662 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); 3722 3663 arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); 3723 3664 3724 - swap_free(entry); 3725 - if (mem_cgroup_swap_full(page) || 3726 - (vma->vm_flags & VM_LOCKED) || PageMlocked(page)) 3727 - try_to_free_swap(page); 3728 3665 unlock_page(page); 3729 3666 if (page != swapcache && swapcache) { 3730 3667 /*

-2

mm/memremap.c

··· 456 456 if (WARN_ON_ONCE(!page->pgmap->ops || !page->pgmap->ops->page_free)) 457 457 return; 458 458 459 - __ClearPageWaiters(page); 460 - 461 459 mem_cgroup_uncharge(page_folio(page)); 462 460 463 461 /*

+3 -1

mm/migrate.c

··· 53 53 54 54 #include <asm/tlbflush.h> 55 55 56 - #define CREATE_TRACE_POINTS 57 56 #include <trace/events/migrate.h> 58 57 59 58 #include "internal.h" ··· 247 248 } 248 249 if (vma->vm_flags & VM_LOCKED) 249 250 mlock_page_drain(smp_processor_id()); 251 + 252 + trace_remove_migration_pte(pvmw.address, pte_val(pte), 253 + compound_order(new)); 250 254 251 255 /* No need to invalidate - it was non-present before */ 252 256 update_mmu_cache(vma, pvmw.address, pvmw.pte);

+8 -10

mm/page-writeback.c

··· 2465 2465 * 2466 2466 * Caller must hold lock_page_memcg(). 2467 2467 */ 2468 - void folio_account_cleaned(struct folio *folio, struct address_space *mapping, 2469 - struct bdi_writeback *wb) 2468 + void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb) 2470 2469 { 2471 - if (mapping_can_writeback(mapping)) { 2472 - long nr = folio_nr_pages(folio); 2473 - lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, -nr); 2474 - zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr); 2475 - wb_stat_mod(wb, WB_RECLAIMABLE, -nr); 2476 - task_io_account_cancelled_write(nr * PAGE_SIZE); 2477 - } 2470 + long nr = folio_nr_pages(folio); 2471 + 2472 + lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, -nr); 2473 + zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr); 2474 + wb_stat_mod(wb, WB_RECLAIMABLE, -nr); 2475 + task_io_account_cancelled_write(nr * PAGE_SIZE); 2478 2476 } 2479 2477 2480 2478 /* ··· 2681 2683 wb = unlocked_inode_to_wb_begin(inode, &cookie); 2682 2684 2683 2685 if (folio_test_clear_dirty(folio)) 2684 - folio_account_cleaned(folio, mapping, wb); 2686 + folio_account_cleaned(folio, wb); 2685 2687 2686 2688 unlocked_inode_to_wb_end(inode, &cookie); 2687 2689 folio_memcg_unlock(folio);

+105 -49

mm/page_alloc.c

··· 378 378 */ 379 379 static DEFINE_STATIC_KEY_TRUE(deferred_pages); 380 380 381 - /* 382 - * Calling kasan_poison_pages() only after deferred memory initialization 383 - * has completed. Poisoning pages during deferred memory init will greatly 384 - * lengthen the process and cause problem in large memory systems as the 385 - * deferred pages initialization is done with interrupt disabled. 386 - * 387 - * Assuming that there will be no reference to those newly initialized 388 - * pages before they are ever allocated, this should have no effect on 389 - * KASAN memory tracking as the poison will be properly inserted at page 390 - * allocation time. The only corner case is when pages are allocated by 391 - * on-demand allocation and then freed again before the deferred pages 392 - * initialization is done, but this is not likely to happen. 393 - */ 394 - static inline bool should_skip_kasan_poison(struct page *page, fpi_t fpi_flags) 381 + static inline bool deferred_pages_enabled(void) 395 382 { 396 - return static_branch_unlikely(&deferred_pages) || 397 - (!IS_ENABLED(CONFIG_KASAN_GENERIC) && 398 - (fpi_flags & FPI_SKIP_KASAN_POISON)) || 399 - PageSkipKASanPoison(page); 383 + return static_branch_unlikely(&deferred_pages); 400 384 } 401 385 402 386 /* Returns true if the struct page for the pfn is uninitialised */ ··· 431 447 return false; 432 448 } 433 449 #else 434 - static inline bool should_skip_kasan_poison(struct page *page, fpi_t fpi_flags) 450 + static inline bool deferred_pages_enabled(void) 435 451 { 436 - return (!IS_ENABLED(CONFIG_KASAN_GENERIC) && 437 - (fpi_flags & FPI_SKIP_KASAN_POISON)) || 438 - PageSkipKASanPoison(page); 452 + return false; 439 453 } 440 454 441 455 static inline bool early_page_uninitialised(unsigned long pfn) ··· 1249 1267 return ret; 1250 1268 } 1251 1269 1252 - static void kernel_init_free_pages(struct page *page, int numpages, bool zero_tags) 1270 + /* 1271 + * Skip KASAN memory poisoning when either: 1272 + * 1273 + * 1. Deferred memory initialization has not yet completed, 1274 + * see the explanation below. 1275 + * 2. Skipping poisoning is requested via FPI_SKIP_KASAN_POISON, 1276 + * see the comment next to it. 1277 + * 3. Skipping poisoning is requested via __GFP_SKIP_KASAN_POISON, 1278 + * see the comment next to it. 1279 + * 1280 + * Poisoning pages during deferred memory init will greatly lengthen the 1281 + * process and cause problem in large memory systems as the deferred pages 1282 + * initialization is done with interrupt disabled. 1283 + * 1284 + * Assuming that there will be no reference to those newly initialized 1285 + * pages before they are ever allocated, this should have no effect on 1286 + * KASAN memory tracking as the poison will be properly inserted at page 1287 + * allocation time. The only corner case is when pages are allocated by 1288 + * on-demand allocation and then freed again before the deferred pages 1289 + * initialization is done, but this is not likely to happen. 1290 + */ 1291 + static inline bool should_skip_kasan_poison(struct page *page, fpi_t fpi_flags) 1292 + { 1293 + return deferred_pages_enabled() || 1294 + (!IS_ENABLED(CONFIG_KASAN_GENERIC) && 1295 + (fpi_flags & FPI_SKIP_KASAN_POISON)) || 1296 + PageSkipKASanPoison(page); 1297 + } 1298 + 1299 + static void kernel_init_free_pages(struct page *page, int numpages) 1253 1300 { 1254 1301 int i; 1255 - 1256 - if (zero_tags) { 1257 - for (i = 0; i < numpages; i++) 1258 - tag_clear_highpage(page + i); 1259 - return; 1260 - } 1261 1302 1262 1303 /* s390's use of memset() could override KASAN redzones. */ 1263 1304 kasan_disable_current(); ··· 1297 1292 unsigned int order, bool check_free, fpi_t fpi_flags) 1298 1293 { 1299 1294 int bad = 0; 1300 - bool skip_kasan_poison = should_skip_kasan_poison(page, fpi_flags); 1295 + bool init = want_init_on_free(); 1301 1296 1302 1297 VM_BUG_ON_PAGE(PageTail(page), page); 1303 1298 ··· 1364 1359 1365 1360 /* 1366 1361 * As memory initialization might be integrated into KASAN, 1367 - * kasan_free_pages and kernel_init_free_pages must be 1362 + * KASAN poisoning and memory initialization code must be 1368 1363 * kept together to avoid discrepancies in behavior. 1369 1364 * 1370 1365 * With hardware tag-based KASAN, memory tags must be set before the 1371 1366 * page becomes unavailable via debug_pagealloc or arch_free_page. 1372 1367 */ 1373 - if (kasan_has_integrated_init()) { 1374 - if (!skip_kasan_poison) 1375 - kasan_free_pages(page, order); 1376 - } else { 1377 - bool init = want_init_on_free(); 1368 + if (!should_skip_kasan_poison(page, fpi_flags)) { 1369 + kasan_poison_pages(page, order, init); 1378 1370 1379 - if (init) 1380 - kernel_init_free_pages(page, 1 << order, false); 1381 - if (!skip_kasan_poison) 1382 - kasan_poison_pages(page, order, init); 1371 + /* Memory is already initialized if KASAN did it internally. */ 1372 + if (kasan_has_integrated_init()) 1373 + init = false; 1383 1374 } 1375 + if (init) 1376 + kernel_init_free_pages(page, 1 << order); 1384 1377 1385 1378 /* 1386 1379 * arch_free_page() can make the page's contents inaccessible. s390 ··· 2343 2340 } 2344 2341 #endif /* CONFIG_DEBUG_VM */ 2345 2342 2343 + static inline bool should_skip_kasan_unpoison(gfp_t flags, bool init_tags) 2344 + { 2345 + /* Don't skip if a software KASAN mode is enabled. */ 2346 + if (IS_ENABLED(CONFIG_KASAN_GENERIC) || 2347 + IS_ENABLED(CONFIG_KASAN_SW_TAGS)) 2348 + return false; 2349 + 2350 + /* Skip, if hardware tag-based KASAN is not enabled. */ 2351 + if (!kasan_hw_tags_enabled()) 2352 + return true; 2353 + 2354 + /* 2355 + * With hardware tag-based KASAN enabled, skip if either: 2356 + * 2357 + * 1. Memory tags have already been cleared via tag_clear_highpage(). 2358 + * 2. Skipping has been requested via __GFP_SKIP_KASAN_UNPOISON. 2359 + */ 2360 + return init_tags || (flags & __GFP_SKIP_KASAN_UNPOISON); 2361 + } 2362 + 2363 + static inline bool should_skip_init(gfp_t flags) 2364 + { 2365 + /* Don't skip, if hardware tag-based KASAN is not enabled. */ 2366 + if (!kasan_hw_tags_enabled()) 2367 + return false; 2368 + 2369 + /* For hardware tag-based KASAN, skip if requested. */ 2370 + return (flags & __GFP_SKIP_ZERO); 2371 + } 2372 + 2346 2373 inline void post_alloc_hook(struct page *page, unsigned int order, 2347 2374 gfp_t gfp_flags) 2348 2375 { 2376 + bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) && 2377 + !should_skip_init(gfp_flags); 2378 + bool init_tags = init && (gfp_flags & __GFP_ZEROTAGS); 2379 + 2349 2380 set_page_private(page, 0); 2350 2381 set_page_refcounted(page); 2351 2382 ··· 2395 2358 2396 2359 /* 2397 2360 * As memory initialization might be integrated into KASAN, 2398 - * kasan_alloc_pages and kernel_init_free_pages must be 2361 + * KASAN unpoisoning and memory initializion code must be 2399 2362 * kept together to avoid discrepancies in behavior. 2400 2363 */ 2401 - if (kasan_has_integrated_init()) { 2402 - kasan_alloc_pages(page, order, gfp_flags); 2403 - } else { 2404 - bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags); 2405 2364 2406 - kasan_unpoison_pages(page, order, init); 2407 - if (init) 2408 - kernel_init_free_pages(page, 1 << order, 2409 - gfp_flags & __GFP_ZEROTAGS); 2365 + /* 2366 + * If memory tags should be zeroed (which happens only when memory 2367 + * should be initialized as well). 2368 + */ 2369 + if (init_tags) { 2370 + int i; 2371 + 2372 + /* Initialize both memory and tags. */ 2373 + for (i = 0; i != 1 << order; ++i) 2374 + tag_clear_highpage(page + i); 2375 + 2376 + /* Note that memory is already initialized by the loop above. */ 2377 + init = false; 2410 2378 } 2379 + if (!should_skip_kasan_unpoison(gfp_flags, init_tags)) { 2380 + /* Unpoison shadow memory or set memory tags. */ 2381 + kasan_unpoison_pages(page, order, init); 2382 + 2383 + /* Note that memory is already initialized by KASAN. */ 2384 + if (kasan_has_integrated_init()) 2385 + init = false; 2386 + } 2387 + /* If memory is still not initialized, do it now. */ 2388 + if (init) 2389 + kernel_init_free_pages(page, 1 << order); 2390 + /* Propagate __GFP_SKIP_KASAN_POISON to page flags. */ 2391 + if (kasan_hw_tags_enabled() && (gfp_flags & __GFP_SKIP_KASAN_POISON)) 2392 + SetPageSkipKASanPoison(page); 2411 2393 2412 2394 set_page_owner(page, order, gfp_flags); 2413 2395 page_table_check_alloc(page, order);

+57 -14

mm/page_owner.c

··· 10 10 #include <linux/migrate.h> 11 11 #include <linux/stackdepot.h> 12 12 #include <linux/seq_file.h> 13 + #include <linux/memcontrol.h> 13 14 #include <linux/sched/clock.h> 14 15 15 16 #include "internal.h" ··· 29 28 depot_stack_handle_t free_handle; 30 29 u64 ts_nsec; 31 30 u64 free_ts_nsec; 31 + char comm[TASK_COMM_LEN]; 32 32 pid_t pid; 33 + pid_t tgid; 33 34 }; 34 35 35 36 static bool page_owner_enabled = false; ··· 166 163 page_owner->gfp_mask = gfp_mask; 167 164 page_owner->last_migrate_reason = -1; 168 165 page_owner->pid = current->pid; 166 + page_owner->tgid = current->tgid; 169 167 page_owner->ts_nsec = local_clock(); 168 + strlcpy(page_owner->comm, current->comm, 169 + sizeof(page_owner->comm)); 170 170 __set_bit(PAGE_EXT_OWNER, &page_ext->flags); 171 171 __set_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags); 172 172 ··· 235 229 old_page_owner->last_migrate_reason; 236 230 new_page_owner->handle = old_page_owner->handle; 237 231 new_page_owner->pid = old_page_owner->pid; 232 + new_page_owner->tgid = old_page_owner->tgid; 238 233 new_page_owner->ts_nsec = old_page_owner->ts_nsec; 239 234 new_page_owner->free_ts_nsec = old_page_owner->ts_nsec; 235 + strcpy(new_page_owner->comm, old_page_owner->comm); 240 236 241 237 /* 242 238 * We don't clear the bit on the old folio as it's going to be freed ··· 333 325 seq_putc(m, '\n'); 334 326 } 335 327 328 + /* 329 + * Looking for memcg information and print it out 330 + */ 331 + static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret, 332 + struct page *page) 333 + { 334 + #ifdef CONFIG_MEMCG 335 + unsigned long memcg_data; 336 + struct mem_cgroup *memcg; 337 + bool online; 338 + char name[80]; 339 + 340 + rcu_read_lock(); 341 + memcg_data = READ_ONCE(page->memcg_data); 342 + if (!memcg_data) 343 + goto out_unlock; 344 + 345 + if (memcg_data & MEMCG_DATA_OBJCGS) 346 + ret += scnprintf(kbuf + ret, count - ret, 347 + "Slab cache page\n"); 348 + 349 + memcg = page_memcg_check(page); 350 + if (!memcg) 351 + goto out_unlock; 352 + 353 + online = (memcg->css.flags & CSS_ONLINE); 354 + cgroup_name(memcg->css.cgroup, name, sizeof(name)); 355 + ret += scnprintf(kbuf + ret, count - ret, 356 + "Charged %sto %smemcg %s\n", 357 + PageMemcgKmem(page) ? "(via objcg) " : "", 358 + online ? "" : "offline ", 359 + name); 360 + out_unlock: 361 + rcu_read_unlock(); 362 + #endif /* CONFIG_MEMCG */ 363 + 364 + return ret; 365 + } 366 + 336 367 static ssize_t 337 368 print_page_owner(char __user *buf, size_t count, unsigned long pfn, 338 369 struct page *page, struct page_owner *page_owner, ··· 385 338 if (!kbuf) 386 339 return -ENOMEM; 387 340 388 - ret = snprintf(kbuf, count, 389 - "Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n", 341 + ret = scnprintf(kbuf, count, 342 + "Page allocated via order %u, mask %#x(%pGg), pid %d, tgid %d (%s), ts %llu ns, free_ts %llu ns\n", 390 343 page_owner->order, page_owner->gfp_mask, 391 344 &page_owner->gfp_mask, page_owner->pid, 345 + page_owner->tgid, page_owner->comm, 392 346 page_owner->ts_nsec, page_owner->free_ts_nsec); 393 - 394 - if (ret >= count) 395 - goto err; 396 347 397 348 /* Print information relevant to grouping pages by mobility */ 398 349 pageblock_mt = get_pageblock_migratetype(page); 399 350 page_mt = gfp_migratetype(page_owner->gfp_mask); 400 - ret += snprintf(kbuf + ret, count - ret, 351 + ret += scnprintf(kbuf + ret, count - ret, 401 352 "PFN %lu type %s Block %lu type %s Flags %pGp\n", 402 353 pfn, 403 354 migratetype_names[page_mt], ··· 403 358 migratetype_names[pageblock_mt], 404 359 &page->flags); 405 360 406 - if (ret >= count) 407 - goto err; 408 - 409 361 ret += stack_depot_snprint(handle, kbuf + ret, count - ret, 0); 410 362 if (ret >= count) 411 363 goto err; 412 364 413 365 if (page_owner->last_migrate_reason != -1) { 414 - ret += snprintf(kbuf + ret, count - ret, 366 + ret += scnprintf(kbuf + ret, count - ret, 415 367 "Page has been migrated, last migrate reason: %s\n", 416 368 migrate_reason_names[page_owner->last_migrate_reason]); 417 - if (ret >= count) 418 - goto err; 419 369 } 370 + 371 + ret = print_page_owner_memcg(kbuf, count, ret, page); 420 372 421 373 ret += snprintf(kbuf + ret, count - ret, "\n"); 422 374 if (ret >= count) ··· 457 415 else 458 416 pr_alert("page_owner tracks the page as freed\n"); 459 417 460 - pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d, ts %llu, free_ts %llu\n", 418 + pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d, tgid %d (%s), ts %llu, free_ts %llu\n", 461 419 page_owner->order, migratetype_names[mt], gfp_mask, &gfp_mask, 462 - page_owner->pid, page_owner->ts_nsec, page_owner->free_ts_nsec); 420 + page_owner->pid, page_owner->tgid, page_owner->comm, 421 + page_owner->ts_nsec, page_owner->free_ts_nsec); 463 422 464 423 handle = READ_ONCE(page_owner->handle); 465 424 if (!handle)

+44 -18

mm/rmap.c

··· 76 76 77 77 #include <asm/tlbflush.h> 78 78 79 + #define CREATE_TRACE_POINTS 79 80 #include <trace/events/tlb.h> 81 + #include <trace/events/migrate.h> 80 82 81 83 #include "internal.h" 82 84 ··· 1238 1236 void page_add_file_rmap(struct page *page, 1239 1237 struct vm_area_struct *vma, bool compound) 1240 1238 { 1241 - int i, nr = 1; 1239 + int i, nr = 0; 1242 1240 1243 1241 VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page); 1244 1242 lock_page_memcg(page); 1245 1243 if (compound && PageTransHuge(page)) { 1246 1244 int nr_pages = thp_nr_pages(page); 1247 1245 1248 - for (i = 0, nr = 0; i < nr_pages; i++) { 1246 + for (i = 0; i < nr_pages; i++) { 1249 1247 if (atomic_inc_and_test(&page[i]._mapcount)) 1250 1248 nr++; 1251 1249 } ··· 1273 1271 VM_WARN_ON_ONCE(!PageLocked(page)); 1274 1272 SetPageDoubleMap(compound_head(page)); 1275 1273 } 1276 - if (!atomic_inc_and_test(&page->_mapcount)) 1277 - goto out; 1274 + if (atomic_inc_and_test(&page->_mapcount)) 1275 + nr++; 1278 1276 } 1279 - __mod_lruvec_page_state(page, NR_FILE_MAPPED, nr); 1280 1277 out: 1278 + if (nr) 1279 + __mod_lruvec_page_state(page, NR_FILE_MAPPED, nr); 1281 1280 unlock_page_memcg(page); 1282 1281 1283 1282 mlock_vma_page(page, vma, compound); ··· 1286 1283 1287 1284 static void page_remove_file_rmap(struct page *page, bool compound) 1288 1285 { 1289 - int i, nr = 1; 1286 + int i, nr = 0; 1290 1287 1291 1288 VM_BUG_ON_PAGE(compound && !PageHead(page), page); 1292 1289 ··· 1301 1298 if (compound && PageTransHuge(page)) { 1302 1299 int nr_pages = thp_nr_pages(page); 1303 1300 1304 - for (i = 0, nr = 0; i < nr_pages; i++) { 1301 + for (i = 0; i < nr_pages; i++) { 1305 1302 if (atomic_add_negative(-1, &page[i]._mapcount)) 1306 1303 nr++; 1307 1304 } 1308 1305 if (!atomic_add_negative(-1, compound_mapcount_ptr(page))) 1309 - return; 1306 + goto out; 1310 1307 if (PageSwapBacked(page)) 1311 1308 __mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED, 1312 1309 -nr_pages); ··· 1314 1311 __mod_lruvec_page_state(page, NR_FILE_PMDMAPPED, 1315 1312 -nr_pages); 1316 1313 } else { 1317 - if (!atomic_add_negative(-1, &page->_mapcount)) 1318 - return; 1314 + if (atomic_add_negative(-1, &page->_mapcount)) 1315 + nr++; 1319 1316 } 1320 - 1321 - /* 1322 - * We use the irq-unsafe __{inc|mod}_lruvec_page_state because 1323 - * these counters are not modified in interrupt context, and 1324 - * pte lock(a spinlock) is held, which implies preemption disabled. 1325 - */ 1326 - __mod_lruvec_page_state(page, NR_FILE_MAPPED, -nr); 1317 + out: 1318 + if (nr) 1319 + __mod_lruvec_page_state(page, NR_FILE_MAPPED, -nr); 1327 1320 } 1328 1321 1329 1322 static void page_remove_anon_compound_rmap(struct page *page) ··· 1588 1589 1589 1590 /* MADV_FREE page check */ 1590 1591 if (!folio_test_swapbacked(folio)) { 1591 - if (!folio_test_dirty(folio)) { 1592 + int ref_count, map_count; 1593 + 1594 + /* 1595 + * Synchronize with gup_pte_range(): 1596 + * - clear PTE; barrier; read refcount 1597 + * - inc refcount; barrier; read PTE 1598 + */ 1599 + smp_mb(); 1600 + 1601 + ref_count = folio_ref_count(folio); 1602 + map_count = folio_mapcount(folio); 1603 + 1604 + /* 1605 + * Order reads for page refcount and dirty flag 1606 + * (see comments in __remove_mapping()). 1607 + */ 1608 + smp_rmb(); 1609 + 1610 + /* 1611 + * The only page refs must be one from isolation 1612 + * plus the rmap(s) (dropped by discard:). 1613 + */ 1614 + if (ref_count == 1 + map_count && 1615 + !folio_test_dirty(folio)) { 1592 1616 /* Invalidate as we cleared the pte */ 1593 1617 mmu_notifier_invalidate_range(mm, 1594 1618 address, address + PAGE_SIZE); ··· 1874 1852 if (pte_swp_uffd_wp(pteval)) 1875 1853 swp_pte = pte_swp_mkuffd_wp(swp_pte); 1876 1854 set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); 1855 + trace_set_migration_pte(pvmw.address, pte_val(swp_pte), 1856 + compound_order(&folio->page)); 1877 1857 /* 1878 1858 * No need to invalidate here it will synchronize on 1879 1859 * against the special swap migration pte. ··· 1944 1920 if (pte_uffd_wp(pteval)) 1945 1921 swp_pte = pte_swp_mkuffd_wp(swp_pte); 1946 1922 set_pte_at(mm, address, pvmw.pte, swp_pte); 1923 + trace_set_migration_pte(address, pte_val(swp_pte), 1924 + compound_order(&folio->page)); 1947 1925 /* 1948 1926 * No need to invalidate here it will synchronize on 1949 1927 * against the special swap migration pte.

-4

mm/swap.c

··· 97 97 mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); 98 98 count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages); 99 99 } 100 - __ClearPageWaiters(page); 101 100 } 102 101 103 102 static void __put_single_page(struct page *page) ··· 151 152 continue; 152 153 } 153 154 /* Cannot be PageLRU because it's passed to us using the lru */ 154 - __ClearPageWaiters(page); 155 155 } 156 156 157 157 free_unref_page_list(pages); ··· 968 970 dec_zone_page_state(page, NR_MLOCK); 969 971 count_vm_event(UNEVICTABLE_PGCLEARED); 970 972 } 971 - 972 - __ClearPageWaiters(page); 973 973 974 974 list_add(&page->lru, &pages_to_free); 975 975 }

-104

mm/swapfile.c

··· 1167 1167 return NULL; 1168 1168 } 1169 1169 1170 - static struct swap_info_struct *swap_info_get(swp_entry_t entry) 1171 - { 1172 - struct swap_info_struct *p; 1173 - 1174 - p = _swap_info_get(entry); 1175 - if (p) 1176 - spin_lock(&p->lock); 1177 - return p; 1178 - } 1179 - 1180 1170 static struct swap_info_struct *swap_info_get_cont(swp_entry_t entry, 1181 1171 struct swap_info_struct *q) 1182 1172 { ··· 1589 1599 if (si) 1590 1600 return swap_page_trans_huge_swapped(si, entry); 1591 1601 return false; 1592 - } 1593 - 1594 - static int page_trans_huge_map_swapcount(struct page *page, 1595 - int *total_swapcount) 1596 - { 1597 - int i, map_swapcount, _total_swapcount; 1598 - unsigned long offset = 0; 1599 - struct swap_info_struct *si; 1600 - struct swap_cluster_info *ci = NULL; 1601 - unsigned char *map = NULL; 1602 - int swapcount = 0; 1603 - 1604 - /* hugetlbfs shouldn't call it */ 1605 - VM_BUG_ON_PAGE(PageHuge(page), page); 1606 - 1607 - if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!PageTransCompound(page))) { 1608 - if (PageSwapCache(page)) 1609 - swapcount = page_swapcount(page); 1610 - if (total_swapcount) 1611 - *total_swapcount = swapcount; 1612 - return swapcount + page_trans_huge_mapcount(page); 1613 - } 1614 - 1615 - page = compound_head(page); 1616 - 1617 - _total_swapcount = map_swapcount = 0; 1618 - if (PageSwapCache(page)) { 1619 - swp_entry_t entry; 1620 - 1621 - entry.val = page_private(page); 1622 - si = _swap_info_get(entry); 1623 - if (si) { 1624 - map = si->swap_map; 1625 - offset = swp_offset(entry); 1626 - } 1627 - } 1628 - if (map) 1629 - ci = lock_cluster(si, offset); 1630 - for (i = 0; i < HPAGE_PMD_NR; i++) { 1631 - int mapcount = atomic_read(&page[i]._mapcount) + 1; 1632 - if (map) { 1633 - swapcount = swap_count(map[offset + i]); 1634 - _total_swapcount += swapcount; 1635 - } 1636 - map_swapcount = max(map_swapcount, mapcount + swapcount); 1637 - } 1638 - unlock_cluster(ci); 1639 - 1640 - if (PageDoubleMap(page)) 1641 - map_swapcount -= 1; 1642 - 1643 - if (total_swapcount) 1644 - *total_swapcount = _total_swapcount; 1645 - 1646 - return map_swapcount + compound_mapcount(page); 1647 - } 1648 - 1649 - /* 1650 - * We can write to an anon page without COW if there are no other references 1651 - * to it. And as a side-effect, free up its swap: because the old content 1652 - * on disk will never be read, and seeking back there to write new content 1653 - * later would only waste time away from clustering. 1654 - */ 1655 - bool reuse_swap_page(struct page *page) 1656 - { 1657 - int count, total_swapcount; 1658 - 1659 - VM_BUG_ON_PAGE(!PageLocked(page), page); 1660 - if (unlikely(PageKsm(page))) 1661 - return false; 1662 - count = page_trans_huge_map_swapcount(page, &total_swapcount); 1663 - if (count == 1 && PageSwapCache(page) && 1664 - (likely(!PageTransCompound(page)) || 1665 - /* The remaining swap count will be freed soon */ 1666 - total_swapcount == page_swapcount(page))) { 1667 - if (!PageWriteback(page)) { 1668 - page = compound_head(page); 1669 - delete_from_swap_cache(page); 1670 - SetPageDirty(page); 1671 - } else { 1672 - swp_entry_t entry; 1673 - struct swap_info_struct *p; 1674 - 1675 - entry.val = page_private(page); 1676 - p = swap_info_get(entry); 1677 - if (p->flags & SWP_STABLE_WRITES) { 1678 - spin_unlock(&p->lock); 1679 - return false; 1680 - } 1681 - spin_unlock(&p->lock); 1682 - } 1683 - } 1684 - 1685 - return count <= 1; 1686 1602 } 1687 1603 1688 1604 /*

+84 -15

mm/vmalloc.c

··· 74 74 75 75 bool is_vmalloc_addr(const void *x) 76 76 { 77 - unsigned long addr = (unsigned long)x; 77 + unsigned long addr = (unsigned long)kasan_reset_tag(x); 78 78 79 79 return addr >= VMALLOC_START && addr < VMALLOC_END; 80 80 } ··· 631 631 * just put it in the vmalloc space. 632 632 */ 633 633 #if defined(CONFIG_MODULES) && defined(MODULES_VADDR) 634 - unsigned long addr = (unsigned long)x; 634 + unsigned long addr = (unsigned long)kasan_reset_tag(x); 635 635 if (addr >= MODULES_VADDR && addr < MODULES_END) 636 636 return 1; 637 637 #endif ··· 795 795 struct vmap_area *va = NULL; 796 796 struct rb_node *n = vmap_area_root.rb_node; 797 797 798 + addr = (unsigned long)kasan_reset_tag((void *)addr); 799 + 798 800 while (n) { 799 801 struct vmap_area *tmp; 800 802 ··· 817 815 static struct vmap_area *__find_vmap_area(unsigned long addr) 818 816 { 819 817 struct rb_node *n = vmap_area_root.rb_node; 818 + 819 + addr = (unsigned long)kasan_reset_tag((void *)addr); 820 820 821 821 while (n) { 822 822 struct vmap_area *va; ··· 2170 2166 void vm_unmap_ram(const void *mem, unsigned int count) 2171 2167 { 2172 2168 unsigned long size = (unsigned long)count << PAGE_SHIFT; 2173 - unsigned long addr = (unsigned long)mem; 2169 + unsigned long addr = (unsigned long)kasan_reset_tag(mem); 2174 2170 struct vmap_area *va; 2175 2171 2176 2172 might_sleep(); ··· 2231 2227 mem = (void *)addr; 2232 2228 } 2233 2229 2234 - kasan_unpoison_vmalloc(mem, size); 2235 - 2236 2230 if (vmap_pages_range(addr, addr + size, PAGE_KERNEL, 2237 2231 pages, PAGE_SHIFT) < 0) { 2238 2232 vm_unmap_ram(mem, count); 2239 2233 return NULL; 2240 2234 } 2235 + 2236 + /* 2237 + * Mark the pages as accessible, now that they are mapped. 2238 + * With hardware tag-based KASAN, marking is skipped for 2239 + * non-VM_ALLOC mappings, see __kasan_unpoison_vmalloc(). 2240 + */ 2241 + mem = kasan_unpoison_vmalloc(mem, size, KASAN_VMALLOC_PROT_NORMAL); 2241 2242 2242 2243 return mem; 2243 2244 } ··· 2469 2460 return NULL; 2470 2461 } 2471 2462 2472 - kasan_unpoison_vmalloc((void *)va->va_start, requested_size); 2473 - 2474 2463 setup_vmalloc_vm(area, va, flags, caller); 2464 + 2465 + /* 2466 + * Mark pages for non-VM_ALLOC mappings as accessible. Do it now as a 2467 + * best-effort approach, as they can be mapped outside of vmalloc code. 2468 + * For VM_ALLOC mappings, the pages are marked as accessible after 2469 + * getting mapped in __vmalloc_node_range(). 2470 + * With hardware tag-based KASAN, marking is skipped for 2471 + * non-VM_ALLOC mappings, see __kasan_unpoison_vmalloc(). 2472 + */ 2473 + if (!(flags & VM_ALLOC)) 2474 + area->addr = kasan_unpoison_vmalloc(area->addr, requested_size, 2475 + KASAN_VMALLOC_PROT_NORMAL); 2475 2476 2476 2477 return area; 2477 2478 } ··· 2566 2547 va->vm = NULL; 2567 2548 spin_unlock(&vmap_area_lock); 2568 2549 2569 - kasan_free_shadow(vm); 2550 + kasan_free_module_shadow(vm); 2570 2551 free_unmap_vmap_area(va); 2571 2552 2572 2553 return vm; ··· 3090 3071 const void *caller) 3091 3072 { 3092 3073 struct vm_struct *area; 3093 - void *addr; 3074 + void *ret; 3075 + kasan_vmalloc_flags_t kasan_flags = KASAN_VMALLOC_NONE; 3094 3076 unsigned long real_size = size; 3095 3077 unsigned long real_align = align; 3096 3078 unsigned int shift = PAGE_SHIFT; ··· 3144 3124 goto fail; 3145 3125 } 3146 3126 3147 - addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node); 3148 - if (!addr) 3127 + /* 3128 + * Prepare arguments for __vmalloc_area_node() and 3129 + * kasan_unpoison_vmalloc(). 3130 + */ 3131 + if (pgprot_val(prot) == pgprot_val(PAGE_KERNEL)) { 3132 + if (kasan_hw_tags_enabled()) { 3133 + /* 3134 + * Modify protection bits to allow tagging. 3135 + * This must be done before mapping. 3136 + */ 3137 + prot = arch_vmap_pgprot_tagged(prot); 3138 + 3139 + /* 3140 + * Skip page_alloc poisoning and zeroing for physical 3141 + * pages backing VM_ALLOC mapping. Memory is instead 3142 + * poisoned and zeroed by kasan_unpoison_vmalloc(). 3143 + */ 3144 + gfp_mask |= __GFP_SKIP_KASAN_UNPOISON | __GFP_SKIP_ZERO; 3145 + } 3146 + 3147 + /* Take note that the mapping is PAGE_KERNEL. */ 3148 + kasan_flags |= KASAN_VMALLOC_PROT_NORMAL; 3149 + } 3150 + 3151 + /* Allocate physical pages and map them into vmalloc space. */ 3152 + ret = __vmalloc_area_node(area, gfp_mask, prot, shift, node); 3153 + if (!ret) 3149 3154 goto fail; 3155 + 3156 + /* 3157 + * Mark the pages as accessible, now that they are mapped. 3158 + * The init condition should match the one in post_alloc_hook() 3159 + * (except for the should_skip_init() check) to make sure that memory 3160 + * is initialized under the same conditions regardless of the enabled 3161 + * KASAN mode. 3162 + * Tag-based KASAN modes only assign tags to normal non-executable 3163 + * allocations, see __kasan_unpoison_vmalloc(). 3164 + */ 3165 + kasan_flags |= KASAN_VMALLOC_VM_ALLOC; 3166 + if (!want_init_on_free() && want_init_on_alloc(gfp_mask)) 3167 + kasan_flags |= KASAN_VMALLOC_INIT; 3168 + /* KASAN_VMALLOC_PROT_NORMAL already set if required. */ 3169 + area->addr = kasan_unpoison_vmalloc(area->addr, real_size, kasan_flags); 3150 3170 3151 3171 /* 3152 3172 * In this function, newly allocated vm_struct has VM_UNINITIALIZED ··· 3199 3139 if (!(vm_flags & VM_DEFER_KMEMLEAK)) 3200 3140 kmemleak_vmalloc(area, size, gfp_mask); 3201 3141 3202 - return addr; 3142 + return area->addr; 3203 3143 3204 3144 fail: 3205 3145 if (shift > PAGE_SHIFT) { ··· 3483 3423 char *vaddr, *buf_start = buf; 3484 3424 unsigned long buflen = count; 3485 3425 unsigned long n; 3426 + 3427 + addr = kasan_reset_tag(addr); 3486 3428 3487 3429 /* Don't allow overflow */ 3488 3430 if ((unsigned long) addr + count < count) ··· 3871 3809 for (area = 0; area < nr_vms; area++) { 3872 3810 if (kasan_populate_vmalloc(vas[area]->va_start, sizes[area])) 3873 3811 goto err_free_shadow; 3874 - 3875 - kasan_unpoison_vmalloc((void *)vas[area]->va_start, 3876 - sizes[area]); 3877 3812 } 3878 3813 3879 3814 /* insert all vm's */ ··· 3882 3823 pcpu_get_vm_areas); 3883 3824 } 3884 3825 spin_unlock(&vmap_area_lock); 3826 + 3827 + /* 3828 + * Mark allocated areas as accessible. Do it now as a best-effort 3829 + * approach, as they can be mapped outside of vmalloc code. 3830 + * With hardware tag-based KASAN, marking is skipped for 3831 + * non-VM_ALLOC mappings, see __kasan_unpoison_vmalloc(). 3832 + */ 3833 + for (area = 0; area < nr_vms; area++) 3834 + vms[area]->addr = kasan_unpoison_vmalloc(vms[area]->addr, 3835 + vms[area]->size, KASAN_VMALLOC_PROT_NORMAL); 3885 3836 3886 3837 kfree(vas); 3887 3838 return vms;

+10

tools/testing/selftests/kselftest.h

··· 28 28 * 29 29 * When all tests are finished, clean up and exit the program with one of: 30 30 * 31 + * ksft_finished(); 31 32 * ksft_exit(condition); 32 33 * ksft_exit_pass(); 33 34 * ksft_exit_fail(); ··· 235 234 else \ 236 235 ksft_exit_fail(); \ 237 236 } while (0) 237 + 238 + /** 239 + * ksft_finished() - Exit selftest with success if all tests passed 240 + */ 241 + #define ksft_finished() \ 242 + ksft_exit(ksft_plan == \ 243 + ksft_cnt.ksft_pass + \ 244 + ksft_cnt.ksft_xfail + \ 245 + ksft_cnt.ksft_xskip) 238 246 239 247 static inline int ksft_exit_fail_msg(const char *msg, ...) 240 248 {

+1

tools/testing/selftests/vm/.gitignore

··· 3 3 hugepage-mremap 4 4 hugepage-shm 5 5 hugepage-vmemmap 6 + hugetlb-madvise 6 7 khugepaged 7 8 map_hugetlb 8 9 map_populate

+1

tools/testing/selftests/vm/Makefile

··· 30 30 TEST_GEN_FILES = compaction_test 31 31 TEST_GEN_FILES += gup_test 32 32 TEST_GEN_FILES += hmm-tests 33 + TEST_GEN_FILES += hugetlb-madvise 33 34 TEST_GEN_FILES += hugepage-mmap 34 35 TEST_GEN_FILES += hugepage-mremap 35 36 TEST_GEN_FILES += hugepage-shm

+2 -1

tools/testing/selftests/vm/gup_test.c

··· 10 10 #include <assert.h> 11 11 #include "../../../../mm/gup_test.h" 12 12 13 + #include "util.h" 14 + 13 15 #define MB (1UL << 20) 14 - #define PAGE_SIZE sysconf(_SC_PAGESIZE) 15 16 16 17 /* Just the flags we need, copied from mm.h: */ 17 18 #define FOLL_WRITE 0x01 /* check pte is writable */

+410

tools/testing/selftests/vm/hugetlb-madvise.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * hugepage-madvise: 4 + * 5 + * Basic functional testing of madvise MADV_DONTNEED and MADV_REMOVE 6 + * on hugetlb mappings. 7 + * 8 + * Before running this test, make sure the administrator has pre-allocated 9 + * at least MIN_FREE_PAGES hugetlb pages and they are free. In addition, 10 + * the test takes an argument that is the path to a file in a hugetlbfs 11 + * filesystem. Therefore, a hugetlbfs filesystem must be mounted on some 12 + * directory. 13 + */ 14 + 15 + #include <stdlib.h> 16 + #include <stdio.h> 17 + #include <unistd.h> 18 + #include <sys/mman.h> 19 + #define __USE_GNU 20 + #include <fcntl.h> 21 + 22 + #define USAGE "USAGE: %s <hugepagefile_name>\n" 23 + #define MIN_FREE_PAGES 20 24 + #define NR_HUGE_PAGES 10 /* common number of pages to map/allocate */ 25 + 26 + #define validate_free_pages(exp_free) \ 27 + do { \ 28 + int fhp = get_free_hugepages(); \ 29 + if (fhp != (exp_free)) { \ 30 + printf("Unexpected number of free huge " \ 31 + "pages line %d\n", __LINE__); \ 32 + exit(1); \ 33 + } \ 34 + } while (0) 35 + 36 + unsigned long huge_page_size; 37 + unsigned long base_page_size; 38 + 39 + /* 40 + * default_huge_page_size copied from mlock2-tests.c 41 + */ 42 + unsigned long default_huge_page_size(void) 43 + { 44 + unsigned long hps = 0; 45 + char *line = NULL; 46 + size_t linelen = 0; 47 + FILE *f = fopen("/proc/meminfo", "r"); 48 + 49 + if (!f) 50 + return 0; 51 + while (getline(&line, &linelen, f) > 0) { 52 + if (sscanf(line, "Hugepagesize: %lu kB", &hps) == 1) { 53 + hps <<= 10; 54 + break; 55 + } 56 + } 57 + 58 + free(line); 59 + fclose(f); 60 + return hps; 61 + } 62 + 63 + unsigned long get_free_hugepages(void) 64 + { 65 + unsigned long fhp = 0; 66 + char *line = NULL; 67 + size_t linelen = 0; 68 + FILE *f = fopen("/proc/meminfo", "r"); 69 + 70 + if (!f) 71 + return fhp; 72 + while (getline(&line, &linelen, f) > 0) { 73 + if (sscanf(line, "HugePages_Free: %lu", &fhp) == 1) 74 + break; 75 + } 76 + 77 + free(line); 78 + fclose(f); 79 + return fhp; 80 + } 81 + 82 + void write_fault_pages(void *addr, unsigned long nr_pages) 83 + { 84 + unsigned long i; 85 + 86 + for (i = 0; i < nr_pages; i++) 87 + *((unsigned long *)(addr + (i * huge_page_size))) = i; 88 + } 89 + 90 + void read_fault_pages(void *addr, unsigned long nr_pages) 91 + { 92 + unsigned long i, tmp; 93 + 94 + for (i = 0; i < nr_pages; i++) 95 + tmp += *((unsigned long *)(addr + (i * huge_page_size))); 96 + } 97 + 98 + int main(int argc, char **argv) 99 + { 100 + unsigned long free_hugepages; 101 + void *addr, *addr2; 102 + int fd; 103 + int ret; 104 + 105 + if (argc != 2) { 106 + printf(USAGE, argv[0]); 107 + exit(1); 108 + } 109 + 110 + huge_page_size = default_huge_page_size(); 111 + if (!huge_page_size) { 112 + printf("Unable to determine huge page size, exiting!\n"); 113 + exit(1); 114 + } 115 + base_page_size = sysconf(_SC_PAGE_SIZE); 116 + if (!huge_page_size) { 117 + printf("Unable to determine base page size, exiting!\n"); 118 + exit(1); 119 + } 120 + 121 + free_hugepages = get_free_hugepages(); 122 + if (free_hugepages < MIN_FREE_PAGES) { 123 + printf("Not enough free huge pages to test, exiting!\n"); 124 + exit(1); 125 + } 126 + 127 + fd = open(argv[1], O_CREAT | O_RDWR, 0755); 128 + if (fd < 0) { 129 + perror("Open failed"); 130 + exit(1); 131 + } 132 + 133 + /* 134 + * Test validity of MADV_DONTNEED addr and length arguments. mmap 135 + * size is NR_HUGE_PAGES + 2. One page at the beginning and end of 136 + * the mapping will be unmapped so we KNOW there is nothing mapped 137 + * there. 138 + */ 139 + addr = mmap(NULL, (NR_HUGE_PAGES + 2) * huge_page_size, 140 + PROT_READ | PROT_WRITE, 141 + MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 142 + -1, 0); 143 + if (addr == MAP_FAILED) { 144 + perror("mmap"); 145 + exit(1); 146 + } 147 + if (munmap(addr, huge_page_size) || 148 + munmap(addr + (NR_HUGE_PAGES + 1) * huge_page_size, 149 + huge_page_size)) { 150 + perror("munmap"); 151 + exit(1); 152 + } 153 + addr = addr + huge_page_size; 154 + 155 + write_fault_pages(addr, NR_HUGE_PAGES); 156 + validate_free_pages(free_hugepages - NR_HUGE_PAGES); 157 + 158 + /* addr before mapping should fail */ 159 + ret = madvise(addr - base_page_size, NR_HUGE_PAGES * huge_page_size, 160 + MADV_DONTNEED); 161 + if (!ret) { 162 + printf("Unexpected success of madvise call with invalid addr line %d\n", 163 + __LINE__); 164 + exit(1); 165 + } 166 + 167 + /* addr + length after mapping should fail */ 168 + ret = madvise(addr, (NR_HUGE_PAGES * huge_page_size) + base_page_size, 169 + MADV_DONTNEED); 170 + if (!ret) { 171 + printf("Unexpected success of madvise call with invalid length line %d\n", 172 + __LINE__); 173 + exit(1); 174 + } 175 + 176 + (void)munmap(addr, NR_HUGE_PAGES * huge_page_size); 177 + 178 + /* 179 + * Test alignment of MADV_DONTNEED addr and length arguments 180 + */ 181 + addr = mmap(NULL, NR_HUGE_PAGES * huge_page_size, 182 + PROT_READ | PROT_WRITE, 183 + MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 184 + -1, 0); 185 + if (addr == MAP_FAILED) { 186 + perror("mmap"); 187 + exit(1); 188 + } 189 + write_fault_pages(addr, NR_HUGE_PAGES); 190 + validate_free_pages(free_hugepages - NR_HUGE_PAGES); 191 + 192 + /* addr is not huge page size aligned and should fail */ 193 + ret = madvise(addr + base_page_size, 194 + NR_HUGE_PAGES * huge_page_size - base_page_size, 195 + MADV_DONTNEED); 196 + if (!ret) { 197 + printf("Unexpected success of madvise call with unaligned start address %d\n", 198 + __LINE__); 199 + exit(1); 200 + } 201 + 202 + /* addr + length should be aligned up to huge page size */ 203 + if (madvise(addr, 204 + ((NR_HUGE_PAGES - 1) * huge_page_size) + base_page_size, 205 + MADV_DONTNEED)) { 206 + perror("madvise"); 207 + exit(1); 208 + } 209 + 210 + /* should free all pages in mapping */ 211 + validate_free_pages(free_hugepages); 212 + 213 + (void)munmap(addr, NR_HUGE_PAGES * huge_page_size); 214 + 215 + /* 216 + * Test MADV_DONTNEED on anonymous private mapping 217 + */ 218 + addr = mmap(NULL, NR_HUGE_PAGES * huge_page_size, 219 + PROT_READ | PROT_WRITE, 220 + MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 221 + -1, 0); 222 + if (addr == MAP_FAILED) { 223 + perror("mmap"); 224 + exit(1); 225 + } 226 + write_fault_pages(addr, NR_HUGE_PAGES); 227 + validate_free_pages(free_hugepages - NR_HUGE_PAGES); 228 + 229 + if (madvise(addr, NR_HUGE_PAGES * huge_page_size, MADV_DONTNEED)) { 230 + perror("madvise"); 231 + exit(1); 232 + } 233 + 234 + /* should free all pages in mapping */ 235 + validate_free_pages(free_hugepages); 236 + 237 + (void)munmap(addr, NR_HUGE_PAGES * huge_page_size); 238 + 239 + /* 240 + * Test MADV_DONTNEED on private mapping of hugetlb file 241 + */ 242 + if (fallocate(fd, 0, 0, NR_HUGE_PAGES * huge_page_size)) { 243 + perror("fallocate"); 244 + exit(1); 245 + } 246 + validate_free_pages(free_hugepages - NR_HUGE_PAGES); 247 + 248 + addr = mmap(NULL, NR_HUGE_PAGES * huge_page_size, 249 + PROT_READ | PROT_WRITE, 250 + MAP_PRIVATE, fd, 0); 251 + if (addr == MAP_FAILED) { 252 + perror("mmap"); 253 + exit(1); 254 + } 255 + 256 + /* read should not consume any pages */ 257 + read_fault_pages(addr, NR_HUGE_PAGES); 258 + validate_free_pages(free_hugepages - NR_HUGE_PAGES); 259 + 260 + /* madvise should not free any pages */ 261 + if (madvise(addr, NR_HUGE_PAGES * huge_page_size, MADV_DONTNEED)) { 262 + perror("madvise"); 263 + exit(1); 264 + } 265 + validate_free_pages(free_hugepages - NR_HUGE_PAGES); 266 + 267 + /* writes should allocate private pages */ 268 + write_fault_pages(addr, NR_HUGE_PAGES); 269 + validate_free_pages(free_hugepages - (2 * NR_HUGE_PAGES)); 270 + 271 + /* madvise should free private pages */ 272 + if (madvise(addr, NR_HUGE_PAGES * huge_page_size, MADV_DONTNEED)) { 273 + perror("madvise"); 274 + exit(1); 275 + } 276 + validate_free_pages(free_hugepages - NR_HUGE_PAGES); 277 + 278 + /* writes should allocate private pages */ 279 + write_fault_pages(addr, NR_HUGE_PAGES); 280 + validate_free_pages(free_hugepages - (2 * NR_HUGE_PAGES)); 281 + 282 + /* 283 + * The fallocate below certainly should free the pages associated 284 + * with the file. However, pages in the private mapping are also 285 + * freed. This is not the 'correct' behavior, but is expected 286 + * because this is how it has worked since the initial hugetlb 287 + * implementation. 288 + */ 289 + if (fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 290 + 0, NR_HUGE_PAGES * huge_page_size)) { 291 + perror("fallocate"); 292 + exit(1); 293 + } 294 + validate_free_pages(free_hugepages); 295 + 296 + (void)munmap(addr, NR_HUGE_PAGES * huge_page_size); 297 + 298 + /* 299 + * Test MADV_DONTNEED on shared mapping of hugetlb file 300 + */ 301 + if (fallocate(fd, 0, 0, NR_HUGE_PAGES * huge_page_size)) { 302 + perror("fallocate"); 303 + exit(1); 304 + } 305 + validate_free_pages(free_hugepages - NR_HUGE_PAGES); 306 + 307 + addr = mmap(NULL, NR_HUGE_PAGES * huge_page_size, 308 + PROT_READ | PROT_WRITE, 309 + MAP_SHARED, fd, 0); 310 + if (addr == MAP_FAILED) { 311 + perror("mmap"); 312 + exit(1); 313 + } 314 + 315 + /* write should not consume any pages */ 316 + write_fault_pages(addr, NR_HUGE_PAGES); 317 + validate_free_pages(free_hugepages - NR_HUGE_PAGES); 318 + 319 + /* madvise should not free any pages */ 320 + if (madvise(addr, NR_HUGE_PAGES * huge_page_size, MADV_DONTNEED)) { 321 + perror("madvise"); 322 + exit(1); 323 + } 324 + validate_free_pages(free_hugepages - NR_HUGE_PAGES); 325 + 326 + /* 327 + * Test MADV_REMOVE on shared mapping of hugetlb file 328 + * 329 + * madvise is same as hole punch and should free all pages. 330 + */ 331 + if (madvise(addr, NR_HUGE_PAGES * huge_page_size, MADV_REMOVE)) { 332 + perror("madvise"); 333 + exit(1); 334 + } 335 + validate_free_pages(free_hugepages); 336 + (void)munmap(addr, NR_HUGE_PAGES * huge_page_size); 337 + 338 + /* 339 + * Test MADV_REMOVE on shared and private mapping of hugetlb file 340 + */ 341 + if (fallocate(fd, 0, 0, NR_HUGE_PAGES * huge_page_size)) { 342 + perror("fallocate"); 343 + exit(1); 344 + } 345 + validate_free_pages(free_hugepages - NR_HUGE_PAGES); 346 + 347 + addr = mmap(NULL, NR_HUGE_PAGES * huge_page_size, 348 + PROT_READ | PROT_WRITE, 349 + MAP_SHARED, fd, 0); 350 + if (addr == MAP_FAILED) { 351 + perror("mmap"); 352 + exit(1); 353 + } 354 + 355 + /* shared write should not consume any additional pages */ 356 + write_fault_pages(addr, NR_HUGE_PAGES); 357 + validate_free_pages(free_hugepages - NR_HUGE_PAGES); 358 + 359 + addr2 = mmap(NULL, NR_HUGE_PAGES * huge_page_size, 360 + PROT_READ | PROT_WRITE, 361 + MAP_PRIVATE, fd, 0); 362 + if (addr2 == MAP_FAILED) { 363 + perror("mmap"); 364 + exit(1); 365 + } 366 + 367 + /* private read should not consume any pages */ 368 + read_fault_pages(addr2, NR_HUGE_PAGES); 369 + validate_free_pages(free_hugepages - NR_HUGE_PAGES); 370 + 371 + /* private write should consume additional pages */ 372 + write_fault_pages(addr2, NR_HUGE_PAGES); 373 + validate_free_pages(free_hugepages - (2 * NR_HUGE_PAGES)); 374 + 375 + /* madvise of shared mapping should not free any pages */ 376 + if (madvise(addr, NR_HUGE_PAGES * huge_page_size, MADV_DONTNEED)) { 377 + perror("madvise"); 378 + exit(1); 379 + } 380 + validate_free_pages(free_hugepages - (2 * NR_HUGE_PAGES)); 381 + 382 + /* madvise of private mapping should free private pages */ 383 + if (madvise(addr2, NR_HUGE_PAGES * huge_page_size, MADV_DONTNEED)) { 384 + perror("madvise"); 385 + exit(1); 386 + } 387 + validate_free_pages(free_hugepages - NR_HUGE_PAGES); 388 + 389 + /* private write should consume additional pages again */ 390 + write_fault_pages(addr2, NR_HUGE_PAGES); 391 + validate_free_pages(free_hugepages - (2 * NR_HUGE_PAGES)); 392 + 393 + /* 394 + * madvise should free both file and private pages although this is 395 + * not correct. private pages should not be freed, but this is 396 + * expected. See comment associated with FALLOC_FL_PUNCH_HOLE call. 397 + */ 398 + if (madvise(addr, NR_HUGE_PAGES * huge_page_size, MADV_REMOVE)) { 399 + perror("madvise"); 400 + exit(1); 401 + } 402 + validate_free_pages(free_hugepages); 403 + 404 + (void)munmap(addr, NR_HUGE_PAGES * huge_page_size); 405 + (void)munmap(addr2, NR_HUGE_PAGES * huge_page_size); 406 + 407 + close(fd); 408 + unlink(argv[1]); 409 + return 0; 410 + }

+1 -37

tools/testing/selftests/vm/ksm_tests.c

··· 12 12 13 13 #include "../kselftest.h" 14 14 #include "../../../../include/vdso/time64.h" 15 + #include "util.h" 15 16 16 17 #define KSM_SYSFS_PATH "/sys/kernel/mm/ksm/" 17 18 #define KSM_FP(s) (KSM_SYSFS_PATH s) ··· 22 21 #define KSM_USE_ZERO_PAGES_DEFAULT false 23 22 #define KSM_MERGE_ACROSS_NODES_DEFAULT true 24 23 #define MB (1ul << 20) 25 - 26 - #define PAGE_SHIFT 12 27 - #define HPAGE_SHIFT 21 28 - 29 - #define PAGE_SIZE (1 << PAGE_SHIFT) 30 - #define HPAGE_SIZE (1 << HPAGE_SHIFT) 31 - 32 - #define PAGEMAP_PRESENT(ent) (((ent) & (1ull << 63)) != 0) 33 - #define PAGEMAP_PFN(ent) ((ent) & ((1ull << 55) - 1)) 34 24 35 25 struct ksm_sysfs { 36 26 unsigned long max_page_sharing; ··· 446 454 numa_free(numa2_map_ptr, page_size); 447 455 printf("Not OK\n"); 448 456 return KSFT_FAIL; 449 - } 450 - 451 - int64_t allocate_transhuge(void *ptr, int pagemap_fd) 452 - { 453 - uint64_t ent[2]; 454 - 455 - /* drop pmd */ 456 - if (mmap(ptr, HPAGE_SIZE, PROT_READ | PROT_WRITE, 457 - MAP_FIXED | MAP_ANONYMOUS | 458 - MAP_NORESERVE | MAP_PRIVATE, -1, 0) != ptr) 459 - errx(2, "mmap transhuge"); 460 - 461 - if (madvise(ptr, HPAGE_SIZE, MADV_HUGEPAGE)) 462 - err(2, "MADV_HUGEPAGE"); 463 - 464 - /* allocate transparent huge page */ 465 - *(volatile void **)ptr = ptr; 466 - 467 - if (pread(pagemap_fd, ent, sizeof(ent), 468 - (uintptr_t)ptr >> (PAGE_SHIFT - 3)) != sizeof(ent)) 469 - err(2, "read pagemap"); 470 - 471 - if (PAGEMAP_PRESENT(ent[0]) && PAGEMAP_PRESENT(ent[1]) && 472 - PAGEMAP_PFN(ent[0]) + 1 == PAGEMAP_PFN(ent[1]) && 473 - !(PAGEMAP_PFN(ent[0]) & ((1 << (HPAGE_SHIFT - PAGE_SHIFT)) - 1))) 474 - return PAGEMAP_PFN(ent[0]); 475 - 476 - return -1; 477 457 } 478 458 479 459 static int ksm_merge_hugepages_time(int mapping, int prot, int timeout, size_t map_size)

+1 -1

tools/testing/selftests/vm/memfd_secret.c

··· 282 282 283 283 close(fd); 284 284 285 - ksft_exit(!ksft_get_fail_cnt()); 285 + ksft_finished(); 286 286 } 287 287 288 288 #else /* __NR_memfd_secret */

+13 -2

tools/testing/selftests/vm/run_vmtests.sh

··· 131 131 echo "[PASS]" 132 132 fi 133 133 134 + echo "-----------------------" 135 + echo "running hugetlb-madvise" 136 + echo "-----------------------" 137 + ./hugetlb-madvise $mnt/madvise-test 138 + if [ $? -ne 0 ]; then 139 + echo "[FAIL]" 140 + exitcode=1 141 + else 142 + echo "[PASS]" 143 + fi 144 + rm -f $mnt/madvise-test 145 + 134 146 echo "NOTE: The above hugetlb tests provide minimal coverage. Use" 135 147 echo " https://github.com/libhugetlbfs/libhugetlbfs.git for" 136 148 echo " hugetlb regression testing." ··· 208 196 echo "---------------------------" 209 197 # Test requires source and destination huge pages. Size of source 210 198 # (half_ufd_size_MB) is passed as argument to test. 211 - ./userfaultfd hugetlb $half_ufd_size_MB 32 $mnt/ufd_test_file 199 + ./userfaultfd hugetlb $half_ufd_size_MB 32 212 200 if [ $? -ne 0 ]; then 213 201 echo "[FAIL]" 214 202 exitcode=1 215 203 else 216 204 echo "[PASS]" 217 205 fi 218 - rm -f $mnt/ufd_test_file 219 206 220 207 echo "-------------------------" 221 208 echo "running userfaultfd_shmem"

+3 -38

tools/testing/selftests/vm/transhuge-stress.c

··· 15 15 #include <fcntl.h> 16 16 #include <string.h> 17 17 #include <sys/mman.h> 18 + #include "util.h" 18 19 19 - #define PAGE_SHIFT 12 20 - #define HPAGE_SHIFT 21 21 - 22 - #define PAGE_SIZE (1 << PAGE_SHIFT) 23 - #define HPAGE_SIZE (1 << HPAGE_SHIFT) 24 - 25 - #define PAGEMAP_PRESENT(ent) (((ent) & (1ull << 63)) != 0) 26 - #define PAGEMAP_PFN(ent) ((ent) & ((1ull << 55) - 1)) 27 - 28 - int pagemap_fd; 29 20 int backing_fd = -1; 30 21 int mmap_flags = MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE; 31 22 #define PROT_RW (PROT_READ | PROT_WRITE) 32 - 33 - int64_t allocate_transhuge(void *ptr) 34 - { 35 - uint64_t ent[2]; 36 - 37 - /* drop pmd */ 38 - if (mmap(ptr, HPAGE_SIZE, PROT_RW, MAP_FIXED | mmap_flags, 39 - backing_fd, 0) != ptr) 40 - errx(2, "mmap transhuge"); 41 - 42 - if (madvise(ptr, HPAGE_SIZE, MADV_HUGEPAGE)) 43 - err(2, "MADV_HUGEPAGE"); 44 - 45 - /* allocate transparent huge page */ 46 - *(volatile void **)ptr = ptr; 47 - 48 - if (pread(pagemap_fd, ent, sizeof(ent), 49 - (uintptr_t)ptr >> (PAGE_SHIFT - 3)) != sizeof(ent)) 50 - err(2, "read pagemap"); 51 - 52 - if (PAGEMAP_PRESENT(ent[0]) && PAGEMAP_PRESENT(ent[1]) && 53 - PAGEMAP_PFN(ent[0]) + 1 == PAGEMAP_PFN(ent[1]) && 54 - !(PAGEMAP_PFN(ent[0]) & ((1 << (HPAGE_SHIFT - PAGE_SHIFT)) - 1))) 55 - return PAGEMAP_PFN(ent[0]); 56 - 57 - return -1; 58 - } 59 23 60 24 int main(int argc, char **argv) 61 25 { ··· 31 67 double s; 32 68 uint8_t *map; 33 69 size_t map_len; 70 + int pagemap_fd; 34 71 35 72 ram = sysconf(_SC_PHYS_PAGES); 36 73 if (ram > SIZE_MAX / sysconf(_SC_PAGESIZE) / 4) ··· 87 122 for (p = ptr; p < ptr + len; p += HPAGE_SIZE) { 88 123 int64_t pfn; 89 124 90 - pfn = allocate_transhuge(p); 125 + pfn = allocate_transhuge(p, pagemap_fd); 91 126 92 127 if (pfn < 0) { 93 128 nr_failed++;

+35 -34

tools/testing/selftests/vm/userfaultfd.c

··· 89 89 static bool map_shared; 90 90 static int shm_fd; 91 91 static int huge_fd; 92 - static char *huge_fd_off0; 93 92 static unsigned long long *count_verify; 94 93 static int uffd = -1; 95 94 static int uffd_flags, finished, *pipefd; ··· 127 128 "./userfaultfd anon 100 99999\n\n" 128 129 "# Run share memory test on 1GiB region with 99 bounces:\n" 129 130 "./userfaultfd shmem 1000 99\n\n" 130 - "# Run hugetlb memory test on 256MiB region with 50 bounces (using /dev/hugepages/hugefile):\n" 131 - "./userfaultfd hugetlb 256 50 /dev/hugepages/hugefile\n\n" 132 - "# Run the same hugetlb test but using shmem:\n" 131 + "# Run hugetlb memory test on 256MiB region with 50 bounces:\n" 132 + "./userfaultfd hugetlb 256 50\n\n" 133 + "# Run the same hugetlb test but using shared file:\n" 133 134 "./userfaultfd hugetlb_shared 256 50 /dev/hugepages/hugefile\n\n" 134 135 "# 10MiB-~6GiB 999 bounces anonymous test, " 135 136 "continue forever unless an error triggers\n" ··· 226 227 227 228 static void hugetlb_release_pages(char *rel_area) 228 229 { 229 - if (fallocate(huge_fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 230 - rel_area == huge_fd_off0 ? 0 : nr_pages * page_size, 231 - nr_pages * page_size)) 232 - err("fallocate() failed"); 230 + if (!map_shared) { 231 + if (madvise(rel_area, nr_pages * page_size, MADV_DONTNEED)) 232 + err("madvise(MADV_DONTNEED) failed"); 233 + } else { 234 + if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE)) 235 + err("madvise(MADV_REMOVE) failed"); 236 + } 233 237 } 234 238 235 239 static void hugetlb_allocate_area(void **alloc_area) ··· 240 238 void *area_alias = NULL; 241 239 char **alloc_area_alias; 242 240 243 - *alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, 244 - (map_shared ? MAP_SHARED : MAP_PRIVATE) | 245 - MAP_HUGETLB | 246 - (*alloc_area == area_src ? 0 : MAP_NORESERVE), 247 - huge_fd, *alloc_area == area_src ? 0 : 248 - nr_pages * page_size); 241 + if (!map_shared) 242 + *alloc_area = mmap(NULL, 243 + nr_pages * page_size, 244 + PROT_READ | PROT_WRITE, 245 + MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | 246 + (*alloc_area == area_src ? 0 : MAP_NORESERVE), 247 + -1, 248 + 0); 249 + else 250 + *alloc_area = mmap(NULL, 251 + nr_pages * page_size, 252 + PROT_READ | PROT_WRITE, 253 + MAP_SHARED | 254 + (*alloc_area == area_src ? 0 : MAP_NORESERVE), 255 + huge_fd, 256 + *alloc_area == area_src ? 0 : nr_pages * page_size); 249 257 if (*alloc_area == MAP_FAILED) 250 258 err("mmap of hugetlbfs file failed"); 251 259 252 260 if (map_shared) { 253 - area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, 254 - MAP_SHARED | MAP_HUGETLB, 255 - huge_fd, *alloc_area == area_src ? 0 : 256 - nr_pages * page_size); 261 + area_alias = mmap(NULL, 262 + nr_pages * page_size, 263 + PROT_READ | PROT_WRITE, 264 + MAP_SHARED, 265 + huge_fd, 266 + *alloc_area == area_src ? 0 : nr_pages * page_size); 257 267 if (area_alias == MAP_FAILED) 258 268 err("mmap of hugetlb file alias failed"); 259 269 } 260 270 261 271 if (*alloc_area == area_src) { 262 - huge_fd_off0 = *alloc_area; 263 272 alloc_area_alias = &area_src_alias; 264 273 } else { 265 274 alloc_area_alias = &area_dst_alias; ··· 283 270 { 284 271 if (!map_shared) 285 272 return; 286 - /* 287 - * We can't zap just the pagetable with hugetlbfs because 288 - * MADV_DONTEED won't work. So exercise -EEXIST on a alias 289 - * mapping where the pagetables are not established initially, 290 - * this way we'll exercise the -EEXEC at the fs level. 291 - */ 273 + 292 274 *start = (unsigned long) area_dst_alias + offset; 293 275 } 294 276 ··· 436 428 uffd = -1; 437 429 } 438 430 439 - huge_fd_off0 = NULL; 440 431 munmap_area((void **)&area_src); 441 432 munmap_area((void **)&area_src_alias); 442 433 munmap_area((void **)&area_dst); ··· 933 926 struct sigaction act; 934 927 unsigned long signalled = 0; 935 928 936 - if (test_type != TEST_HUGETLB) 937 - split_nr_pages = (nr_pages + 1) / 2; 938 - else 939 - split_nr_pages = nr_pages; 929 + split_nr_pages = (nr_pages + 1) / 2; 940 930 941 931 if (signal_test) { 942 932 sigbuf = &jbuf; ··· 989 985 990 986 if (signal_test) 991 987 return signalled != split_nr_pages; 992 - 993 - if (test_type == TEST_HUGETLB) 994 - return 0; 995 988 996 989 area_dst = mremap(area_dst, nr_pages * page_size, nr_pages * page_size, 997 990 MREMAP_MAYMOVE | MREMAP_FIXED, area_src); ··· 1677 1676 } 1678 1677 nr_pages = nr_pages_per_cpu * nr_cpus; 1679 1678 1680 - if (test_type == TEST_HUGETLB) { 1679 + if (test_type == TEST_HUGETLB && map_shared) { 1681 1680 if (argc < 5) 1682 1681 usage(); 1683 1682 huge_fd = open(argv[4], O_CREAT | O_RDWR, 0755);

+69

tools/testing/selftests/vm/util.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + 3 + #ifndef __KSELFTEST_VM_UTIL_H 4 + #define __KSELFTEST_VM_UTIL_H 5 + 6 + #include <stdint.h> 7 + #include <sys/mman.h> 8 + #include <err.h> 9 + #include <string.h> /* ffsl() */ 10 + #include <unistd.h> /* _SC_PAGESIZE */ 11 + 12 + static unsigned int __page_size; 13 + static unsigned int __page_shift; 14 + 15 + static inline unsigned int page_size(void) 16 + { 17 + if (!__page_size) 18 + __page_size = sysconf(_SC_PAGESIZE); 19 + return __page_size; 20 + } 21 + 22 + static inline unsigned int page_shift(void) 23 + { 24 + if (!__page_shift) 25 + __page_shift = (ffsl(page_size()) - 1); 26 + return __page_shift; 27 + } 28 + 29 + #define PAGE_SHIFT (page_shift()) 30 + #define PAGE_SIZE (page_size()) 31 + /* 32 + * On ppc64 this will only work with radix 2M hugepage size 33 + */ 34 + #define HPAGE_SHIFT 21 35 + #define HPAGE_SIZE (1 << HPAGE_SHIFT) 36 + 37 + #define PAGEMAP_PRESENT(ent) (((ent) & (1ull << 63)) != 0) 38 + #define PAGEMAP_PFN(ent) ((ent) & ((1ull << 55) - 1)) 39 + 40 + 41 + static inline int64_t allocate_transhuge(void *ptr, int pagemap_fd) 42 + { 43 + uint64_t ent[2]; 44 + 45 + /* drop pmd */ 46 + if (mmap(ptr, HPAGE_SIZE, PROT_READ | PROT_WRITE, 47 + MAP_FIXED | MAP_ANONYMOUS | 48 + MAP_NORESERVE | MAP_PRIVATE, -1, 0) != ptr) 49 + errx(2, "mmap transhuge"); 50 + 51 + if (madvise(ptr, HPAGE_SIZE, MADV_HUGEPAGE)) 52 + err(2, "MADV_HUGEPAGE"); 53 + 54 + /* allocate transparent huge page */ 55 + *(volatile void **)ptr = ptr; 56 + 57 + if (pread(pagemap_fd, ent, sizeof(ent), 58 + (uintptr_t)ptr >> (PAGE_SHIFT - 3)) != sizeof(ent)) 59 + err(2, "read pagemap"); 60 + 61 + if (PAGEMAP_PRESENT(ent[0]) && PAGEMAP_PRESENT(ent[1]) && 62 + PAGEMAP_PFN(ent[0]) + 1 == PAGEMAP_PFN(ent[1]) && 63 + !(PAGEMAP_PFN(ent[0]) & ((1 << (HPAGE_SHIFT - PAGE_SHIFT)) - 1))) 64 + return PAGEMAP_PFN(ent[0]); 65 + 66 + return -1; 67 + } 68 + 69 + #endif

+444 -56

tools/vm/page_owner_sort.c

··· 20 20 #include <string.h> 21 21 #include <regex.h> 22 22 #include <errno.h> 23 + #include <linux/types.h> 24 + #include <getopt.h> 25 + 26 + #define bool int 27 + #define true 1 28 + #define false 0 29 + #define TASK_COMM_LEN 16 23 30 24 31 struct block_list { 25 32 char *txt; 33 + char *comm; // task command name 34 + char *stacktrace; 35 + __u64 ts_nsec; 36 + __u64 free_ts_nsec; 26 37 int len; 27 38 int num; 28 39 int page_num; 40 + pid_t pid; 41 + pid_t tgid; 29 42 }; 30 - 31 - static int sort_by_memory; 43 + enum FILTER_BIT { 44 + FILTER_UNRELEASE = 1<<1, 45 + FILTER_PID = 1<<2, 46 + FILTER_TGID = 1<<3, 47 + FILTER_COMM = 1<<4 48 + }; 49 + enum CULL_BIT { 50 + CULL_UNRELEASE = 1<<1, 51 + CULL_PID = 1<<2, 52 + CULL_TGID = 1<<3, 53 + CULL_COMM = 1<<4, 54 + CULL_STACKTRACE = 1<<5 55 + }; 56 + struct filter_condition { 57 + pid_t tgid; 58 + pid_t pid; 59 + char comm[TASK_COMM_LEN]; 60 + }; 61 + static struct filter_condition fc; 32 62 static regex_t order_pattern; 63 + static regex_t pid_pattern; 64 + static regex_t tgid_pattern; 65 + static regex_t comm_pattern; 66 + static regex_t ts_nsec_pattern; 67 + static regex_t free_ts_nsec_pattern; 33 68 static struct block_list *list; 34 69 static int list_size; 35 70 static int max_size; 36 - 37 - struct block_list *block_head; 71 + static int cull; 72 + static int filter; 38 73 39 74 int read_block(char *buf, int buf_size, FILE *fin) 40 75 { ··· 93 58 return strcmp(l1->txt, l2->txt); 94 59 } 95 60 61 + static int compare_stacktrace(const void *p1, const void *p2) 62 + { 63 + const struct block_list *l1 = p1, *l2 = p2; 64 + 65 + return strcmp(l1->stacktrace, l2->stacktrace); 66 + } 67 + 96 68 static int compare_num(const void *p1, const void *p2) 97 69 { 98 70 const struct block_list *l1 = p1, *l2 = p2; ··· 114 72 return l2->page_num - l1->page_num; 115 73 } 116 74 117 - static int get_page_num(char *buf) 75 + static int compare_pid(const void *p1, const void *p2) 118 76 { 119 - int err, val_len, order_val; 120 - char order_str[4] = {0}; 121 - char *endptr; 77 + const struct block_list *l1 = p1, *l2 = p2; 78 + 79 + return l1->pid - l2->pid; 80 + } 81 + 82 + static int compare_tgid(const void *p1, const void *p2) 83 + { 84 + const struct block_list *l1 = p1, *l2 = p2; 85 + 86 + return l1->tgid - l2->tgid; 87 + } 88 + 89 + static int compare_comm(const void *p1, const void *p2) 90 + { 91 + const struct block_list *l1 = p1, *l2 = p2; 92 + 93 + return strcmp(l1->comm, l2->comm); 94 + } 95 + 96 + static int compare_ts(const void *p1, const void *p2) 97 + { 98 + const struct block_list *l1 = p1, *l2 = p2; 99 + 100 + return l1->ts_nsec < l2->ts_nsec ? -1 : 1; 101 + } 102 + 103 + static int compare_free_ts(const void *p1, const void *p2) 104 + { 105 + const struct block_list *l1 = p1, *l2 = p2; 106 + 107 + return l1->free_ts_nsec < l2->free_ts_nsec ? -1 : 1; 108 + } 109 + 110 + 111 + static int compare_release(const void *p1, const void *p2) 112 + { 113 + const struct block_list *l1 = p1, *l2 = p2; 114 + 115 + if (!l1->free_ts_nsec && !l2->free_ts_nsec) 116 + return 0; 117 + if (l1->free_ts_nsec && l2->free_ts_nsec) 118 + return 0; 119 + return l1->free_ts_nsec ? 1 : -1; 120 + } 121 + 122 + 123 + static int compare_cull_condition(const void *p1, const void *p2) 124 + { 125 + if (cull == 0) 126 + return compare_txt(p1, p2); 127 + if ((cull & CULL_STACKTRACE) && compare_stacktrace(p1, p2)) 128 + return compare_stacktrace(p1, p2); 129 + if ((cull & CULL_PID) && compare_pid(p1, p2)) 130 + return compare_pid(p1, p2); 131 + if ((cull & CULL_TGID) && compare_tgid(p1, p2)) 132 + return compare_tgid(p1, p2); 133 + if ((cull & CULL_COMM) && compare_comm(p1, p2)) 134 + return compare_comm(p1, p2); 135 + if ((cull & CULL_UNRELEASE) && compare_release(p1, p2)) 136 + return compare_release(p1, p2); 137 + return 0; 138 + } 139 + 140 + static int search_pattern(regex_t *pattern, char *pattern_str, char *buf) 141 + { 142 + int err, val_len; 122 143 regmatch_t pmatch[2]; 123 144 124 - err = regexec(&order_pattern, buf, 2, pmatch, REG_NOTBOL); 145 + err = regexec(pattern, buf, 2, pmatch, REG_NOTBOL); 125 146 if (err != 0 || pmatch[1].rm_so == -1) { 126 - printf("no order pattern in %s\n", buf); 127 - return 0; 147 + printf("no matching pattern in %s\n", buf); 148 + return -1; 128 149 } 129 150 val_len = pmatch[1].rm_eo - pmatch[1].rm_so; 130 - if (val_len > 2) /* max_order should not exceed 2 digits */ 131 - goto wrong_order; 132 151 133 - memcpy(order_str, buf + pmatch[1].rm_so, val_len); 152 + memcpy(pattern_str, buf + pmatch[1].rm_so, val_len); 134 153 154 + return 0; 155 + } 156 + 157 + static void check_regcomp(regex_t *pattern, const char *regex) 158 + { 159 + int err; 160 + 161 + err = regcomp(pattern, regex, REG_EXTENDED | REG_NEWLINE); 162 + if (err != 0 || pattern->re_nsub != 1) { 163 + printf("Invalid pattern %s code %d\n", regex, err); 164 + exit(1); 165 + } 166 + } 167 + 168 + static char **explode(char sep, const char *str, int *size) 169 + { 170 + int count = 0, len = strlen(str); 171 + int lastindex = -1, j = 0; 172 + 173 + for (int i = 0; i < len; i++) 174 + if (str[i] == sep) 175 + count++; 176 + char **ret = calloc(++count, sizeof(char *)); 177 + 178 + for (int i = 0; i < len; i++) { 179 + if (str[i] == sep) { 180 + ret[j] = calloc(i - lastindex, sizeof(char)); 181 + memcpy(ret[j++], str + lastindex + 1, i - lastindex - 1); 182 + lastindex = i; 183 + } 184 + } 185 + if (lastindex <= len - 1) { 186 + ret[j] = calloc(len - lastindex, sizeof(char)); 187 + memcpy(ret[j++], str + lastindex + 1, strlen(str) - 1 - lastindex); 188 + } 189 + *size = j; 190 + return ret; 191 + } 192 + 193 + static void free_explode(char **arr, int size) 194 + { 195 + for (int i = 0; i < size; i++) 196 + free(arr[i]); 197 + free(arr); 198 + } 199 + 200 + # define FIELD_BUFF 25 201 + 202 + static int get_page_num(char *buf) 203 + { 204 + int order_val; 205 + char order_str[FIELD_BUFF] = {0}; 206 + char *endptr; 207 + 208 + search_pattern(&order_pattern, order_str, buf); 135 209 errno = 0; 136 210 order_val = strtol(order_str, &endptr, 10); 137 - if (errno != 0 || endptr == order_str || *endptr != '\0') 138 - goto wrong_order; 211 + if (order_val > 64 || errno != 0 || endptr == order_str || *endptr != '\0') { 212 + printf("wrong order in follow buf:\n%s\n", buf); 213 + return 0; 214 + } 139 215 140 216 return 1 << order_val; 217 + } 141 218 142 - wrong_order: 143 - printf("wrong order in follow buf:\n%s\n", buf); 144 - return 0; 219 + static pid_t get_pid(char *buf) 220 + { 221 + pid_t pid; 222 + char pid_str[FIELD_BUFF] = {0}; 223 + char *endptr; 224 + 225 + search_pattern(&pid_pattern, pid_str, buf); 226 + errno = 0; 227 + pid = strtol(pid_str, &endptr, 10); 228 + if (errno != 0 || endptr == pid_str || *endptr != '\0') { 229 + printf("wrong/invalid pid in follow buf:\n%s\n", buf); 230 + return -1; 231 + } 232 + 233 + return pid; 234 + 235 + } 236 + 237 + static pid_t get_tgid(char *buf) 238 + { 239 + pid_t tgid; 240 + char tgid_str[FIELD_BUFF] = {0}; 241 + char *endptr; 242 + 243 + search_pattern(&tgid_pattern, tgid_str, buf); 244 + errno = 0; 245 + tgid = strtol(tgid_str, &endptr, 10); 246 + if (errno != 0 || endptr == tgid_str || *endptr != '\0') { 247 + printf("wrong/invalid tgid in follow buf:\n%s\n", buf); 248 + return -1; 249 + } 250 + 251 + return tgid; 252 + 253 + } 254 + 255 + static __u64 get_ts_nsec(char *buf) 256 + { 257 + __u64 ts_nsec; 258 + char ts_nsec_str[FIELD_BUFF] = {0}; 259 + char *endptr; 260 + 261 + search_pattern(&ts_nsec_pattern, ts_nsec_str, buf); 262 + errno = 0; 263 + ts_nsec = strtoull(ts_nsec_str, &endptr, 10); 264 + if (errno != 0 || endptr == ts_nsec_str || *endptr != '\0') { 265 + printf("wrong ts_nsec in follow buf:\n%s\n", buf); 266 + return -1; 267 + } 268 + 269 + return ts_nsec; 270 + } 271 + 272 + static __u64 get_free_ts_nsec(char *buf) 273 + { 274 + __u64 free_ts_nsec; 275 + char free_ts_nsec_str[FIELD_BUFF] = {0}; 276 + char *endptr; 277 + 278 + search_pattern(&free_ts_nsec_pattern, free_ts_nsec_str, buf); 279 + errno = 0; 280 + free_ts_nsec = strtoull(free_ts_nsec_str, &endptr, 10); 281 + if (errno != 0 || endptr == free_ts_nsec_str || *endptr != '\0') { 282 + printf("wrong free_ts_nsec in follow buf:\n%s\n", buf); 283 + return -1; 284 + } 285 + 286 + return free_ts_nsec; 287 + } 288 + 289 + static char *get_comm(char *buf) 290 + { 291 + char *comm_str = malloc(TASK_COMM_LEN); 292 + 293 + memset(comm_str, 0, TASK_COMM_LEN); 294 + 295 + search_pattern(&comm_pattern, comm_str, buf); 296 + errno = 0; 297 + if (errno != 0) { 298 + printf("wrong comm in follow buf:\n%s\n", buf); 299 + return NULL; 300 + } 301 + 302 + return comm_str; 303 + } 304 + 305 + static bool is_need(char *buf) 306 + { 307 + if ((filter & FILTER_UNRELEASE) && get_free_ts_nsec(buf) != 0) 308 + return false; 309 + if ((filter & FILTER_PID) && get_pid(buf) != fc.pid) 310 + return false; 311 + if ((filter & FILTER_TGID) && get_tgid(buf) != fc.tgid) 312 + return false; 313 + 314 + char *comm = get_comm(buf); 315 + 316 + if ((filter & FILTER_COMM) && 317 + strncmp(comm, fc.comm, TASK_COMM_LEN) != 0) { 318 + free(comm); 319 + return false; 320 + } 321 + return true; 145 322 } 146 323 147 324 static void add_list(char *buf, int len) 148 325 { 149 326 if (list_size != 0 && 150 - len == list[list_size-1].len && 151 - memcmp(buf, list[list_size-1].txt, len) == 0) { 327 + len == list[list_size-1].len && 328 + memcmp(buf, list[list_size-1].txt, len) == 0) { 152 329 list[list_size-1].num++; 153 330 list[list_size-1].page_num += get_page_num(buf); 154 331 return; ··· 376 115 printf("max_size too small??\n"); 377 116 exit(1); 378 117 } 118 + if (!is_need(buf)) 119 + return; 120 + list[list_size].pid = get_pid(buf); 121 + list[list_size].tgid = get_tgid(buf); 122 + list[list_size].comm = get_comm(buf); 379 123 list[list_size].txt = malloc(len+1); 124 + if (!list[list_size].txt) { 125 + printf("Out of memory\n"); 126 + exit(1); 127 + } 128 + memcpy(list[list_size].txt, buf, len); 129 + list[list_size].txt[len] = 0; 380 130 list[list_size].len = len; 381 131 list[list_size].num = 1; 382 132 list[list_size].page_num = get_page_num(buf); 383 - memcpy(list[list_size].txt, buf, len); 384 - list[list_size].txt[len] = 0; 133 + 134 + list[list_size].stacktrace = strchr(list[list_size].txt, '\n') ?: ""; 135 + if (*list[list_size].stacktrace == '\n') 136 + list[list_size].stacktrace++; 137 + list[list_size].ts_nsec = get_ts_nsec(buf); 138 + list[list_size].free_ts_nsec = get_free_ts_nsec(buf); 385 139 list_size++; 386 140 if (list_size % 1000 == 0) { 387 141 printf("loaded %d\r", list_size); ··· 404 128 } 405 129 } 406 130 131 + static bool parse_cull_args(const char *arg_str) 132 + { 133 + int size = 0; 134 + char **args = explode(',', arg_str, &size); 135 + 136 + for (int i = 0; i < size; ++i) 137 + if (!strcmp(args[i], "pid") || !strcmp(args[i], "p")) 138 + cull |= CULL_PID; 139 + else if (!strcmp(args[i], "tgid") || !strcmp(args[i], "tg")) 140 + cull |= CULL_TGID; 141 + else if (!strcmp(args[i], "name") || !strcmp(args[i], "n")) 142 + cull |= CULL_COMM; 143 + else if (!strcmp(args[i], "stacktrace") || !strcmp(args[i], "st")) 144 + cull |= CULL_STACKTRACE; 145 + else if (!strcmp(args[i], "free") || !strcmp(args[i], "f")) 146 + cull |= CULL_UNRELEASE; 147 + else { 148 + free_explode(args, size); 149 + return false; 150 + } 151 + free_explode(args, size); 152 + return true; 153 + } 154 + 407 155 #define BUF_SIZE (128 * 1024) 408 156 409 157 static void usage(void) 410 158 { 411 - printf("Usage: ./page_owner_sort [-m] <input> <output>\n" 412 - "-m Sort by total memory. If this option is unset, sort by times\n" 159 + printf("Usage: ./page_owner_sort [OPTIONS] <input> <output>\n" 160 + "-m\t\tSort by total memory.\n" 161 + "-s\t\tSort by the stack trace.\n" 162 + "-t\t\tSort by times (default).\n" 163 + "-p\t\tSort by pid.\n" 164 + "-P\t\tSort by tgid.\n" 165 + "-n\t\tSort by task command name.\n" 166 + "-a\t\tSort by memory allocate time.\n" 167 + "-r\t\tSort by memory release time.\n" 168 + "-c\t\tCull by comparing stacktrace instead of total block.\n" 169 + "-f\t\tFilter out the information of blocks whose memory has been released.\n" 170 + "--pid <PID>\tSelect by pid. This selects the information of blocks whose process ID number equals to <PID>.\n" 171 + "--tgid <TGID>\tSelect by tgid. This selects the information of blocks whose Thread Group ID number equals to <TGID>.\n" 172 + "--name <command>\n\t\tSelect by command name. This selects the information of blocks whose command name identical to <command>.\n" 173 + "--cull <rules>\tCull by user-defined rules. <rules> is a single argument in the form of a comma-separated list with some common fields predefined\n" 413 174 ); 414 175 } 415 176 416 177 int main(int argc, char **argv) 417 178 { 179 + int (*cmp)(const void *, const void *) = compare_num; 418 180 FILE *fin, *fout; 419 - char *buf; 181 + char *buf, *endptr; 420 182 int ret, i, count; 421 - struct block_list *list2; 422 183 struct stat st; 423 - int err; 424 184 int opt; 185 + struct option longopts[] = { 186 + { "pid", required_argument, NULL, 1 }, 187 + { "tgid", required_argument, NULL, 2 }, 188 + { "name", required_argument, NULL, 3 }, 189 + { "cull", required_argument, NULL, 4 }, 190 + { 0, 0, 0, 0}, 191 + }; 425 192 426 - while ((opt = getopt(argc, argv, "m")) != -1) 193 + while ((opt = getopt_long(argc, argv, "acfmnprstP", longopts, NULL)) != -1) 427 194 switch (opt) { 195 + case 'a': 196 + cmp = compare_ts; 197 + break; 198 + case 'c': 199 + cull = cull | CULL_STACKTRACE; 200 + break; 201 + case 'f': 202 + filter = filter | FILTER_UNRELEASE; 203 + break; 428 204 case 'm': 429 - sort_by_memory = 1; 205 + cmp = compare_page_num; 206 + break; 207 + case 'p': 208 + cmp = compare_pid; 209 + break; 210 + case 'r': 211 + cmp = compare_free_ts; 212 + break; 213 + case 's': 214 + cmp = compare_stacktrace; 215 + break; 216 + case 't': 217 + cmp = compare_num; 218 + break; 219 + case 'P': 220 + cmp = compare_tgid; 221 + break; 222 + case 'n': 223 + cmp = compare_comm; 224 + break; 225 + case 1: 226 + filter = filter | FILTER_PID; 227 + errno = 0; 228 + fc.pid = strtol(optarg, &endptr, 10); 229 + if (errno != 0 || endptr == optarg || *endptr != '\0') { 230 + printf("wrong/invalid pid in from the command line:%s\n", optarg); 231 + exit(1); 232 + } 233 + break; 234 + case 2: 235 + filter = filter | FILTER_TGID; 236 + errno = 0; 237 + fc.tgid = strtol(optarg, &endptr, 10); 238 + if (errno != 0 || endptr == optarg || *endptr != '\0') { 239 + printf("wrong/invalid tgid in from the command line:%s\n", optarg); 240 + exit(1); 241 + } 242 + break; 243 + case 3: 244 + filter = filter | FILTER_COMM; 245 + strncpy(fc.comm, optarg, TASK_COMM_LEN); 246 + fc.comm[TASK_COMM_LEN-1] = '\0'; 247 + break; 248 + case 4: 249 + if (!parse_cull_args(optarg)) { 250 + printf("wrong argument after --cull in from the command line:%s\n", 251 + optarg); 252 + exit(1); 253 + } 430 254 break; 431 255 default: 432 256 usage(); ··· 546 170 exit(1); 547 171 } 548 172 549 - err = regcomp(&order_pattern, "order\\s*([0-9]*),", REG_EXTENDED|REG_NEWLINE); 550 - if (err != 0 || order_pattern.re_nsub != 1) { 551 - printf("%s: Invalid pattern 'order\\s*([0-9]*),' code %d\n", 552 - argv[0], err); 553 - exit(1); 554 - } 555 - 173 + check_regcomp(&order_pattern, "order\\s*([0-9]*),"); 174 + check_regcomp(&pid_pattern, "pid\\s*([0-9]*),"); 175 + check_regcomp(&tgid_pattern, "tgid\\s*([0-9]*) "); 176 + check_regcomp(&comm_pattern, "tgid\\s*[0-9]*\\s*\$(.*)\$,\\s*ts"); 177 + check_regcomp(&ts_nsec_pattern, "ts\\s*([0-9]*)\\s*ns,"); 178 + check_regcomp(&free_ts_nsec_pattern, "free_ts\\s*([0-9]*)\\s*ns"); 556 179 fstat(fileno(fin), &st); 557 180 max_size = st.st_size / 100; /* hack ... */ 558 181 ··· 574 199 575 200 printf("sorting ....\n"); 576 201 577 - qsort(list, list_size, sizeof(list[0]), compare_txt); 578 - 579 - list2 = malloc(sizeof(*list) * list_size); 580 - if (!list2) { 581 - printf("Out of memory\n"); 582 - exit(1); 583 - } 202 + qsort(list, list_size, sizeof(list[0]), compare_cull_condition); 584 203 585 204 printf("culling\n"); 586 205 587 206 for (i = count = 0; i < list_size; i++) { 588 207 if (count == 0 || 589 - strcmp(list2[count-1].txt, list[i].txt) != 0) { 590 - list2[count++] = list[i]; 208 + compare_cull_condition((void *)(&list[count-1]), (void *)(&list[i])) != 0) { 209 + list[count++] = list[i]; 591 210 } else { 592 - list2[count-1].num += list[i].num; 593 - list2[count-1].page_num += list[i].page_num; 211 + list[count-1].num += list[i].num; 212 + list[count-1].page_num += list[i].page_num; 594 213 } 595 214 } 596 215 597 - if (sort_by_memory) 598 - qsort(list2, count, sizeof(list[0]), compare_page_num); 599 - else 600 - qsort(list2, count, sizeof(list[0]), compare_num); 216 + qsort(list, count, sizeof(list[0]), cmp); 601 217 602 - for (i = 0; i < count; i++) 603 - fprintf(fout, "%d times, %d pages:\n%s\n", 604 - list2[i].num, list2[i].page_num, list2[i].txt); 605 - 218 + for (i = 0; i < count; i++) { 219 + if (cull == 0) 220 + fprintf(fout, "%d times, %d pages:\n%s\n", 221 + list[i].num, list[i].page_num, list[i].txt); 222 + else { 223 + fprintf(fout, "%d times, %d pages", 224 + list[i].num, list[i].page_num); 225 + if (cull & CULL_PID || filter & FILTER_PID) 226 + fprintf(fout, ", PID %d", list[i].pid); 227 + if (cull & CULL_TGID || filter & FILTER_TGID) 228 + fprintf(fout, ", TGID %d", list[i].pid); 229 + if (cull & CULL_COMM || filter & FILTER_COMM) 230 + fprintf(fout, ", task_comm_name: %s", list[i].comm); 231 + if (cull & CULL_UNRELEASE) 232 + fprintf(fout, " (%s)", 233 + list[i].free_ts_nsec ? "UNRELEASED" : "RELEASED"); 234 + if (cull & CULL_STACKTRACE) 235 + fprintf(fout, ":\n%s", list[i].stacktrace); 236 + fprintf(fout, "\n"); 237 + } 238 + } 606 239 regfree(&order_pattern); 240 + regfree(&pid_pattern); 241 + regfree(&tgid_pattern); 242 + regfree(&comm_pattern); 243 + regfree(&ts_nsec_pattern); 244 + regfree(&free_ts_nsec_pattern); 607 245 return 0; 608 246 }