Merge branch 'akpm' (patches from Andrew)

+24

Documentation/ABI/testing/sysfs-kernel-mm-numa

··· 1 + What: /sys/kernel/mm/numa/ 2 + Date: June 2021 3 + Contact: Linux memory management mailing list <linux-mm@kvack.org> 4 + Description: Interface for NUMA 5 + 6 + What: /sys/kernel/mm/numa/demotion_enabled 7 + Date: June 2021 8 + Contact: Linux memory management mailing list <linux-mm@kvack.org> 9 + Description: Enable/disable demoting pages during reclaim 10 + 11 + Page migration during reclaim is intended for systems 12 + with tiered memory configurations. These systems have 13 + multiple types of memory with varied performance 14 + characteristics instead of plain NUMA systems where 15 + the same kind of memory is found at varied distances. 16 + Allowing page migration during reclaim enables these 17 + systems to migrate pages from fast tiers to slow tiers 18 + when the fast tier is under pressure. This migration 19 + is performed before swap. It may move data to a NUMA 20 + node that does not fall into the cpuset of the 21 + allocating process which might be construed to violate 22 + the guarantees of cpusets. This should not be enabled 23 + on systems which need strict cpuset location 24 + guarantees.

+11 -4

Documentation/admin-guide/mm/numa_memory_policy.rst

··· 245 245 address range or file. During system boot up, the temporary 246 246 interleaved system default policy works in this mode. 247 247 248 + MPOL_PREFERRED_MANY 249 + This mode specifices that the allocation should be preferrably 250 + satisfied from the nodemask specified in the policy. If there is 251 + a memory pressure on all nodes in the nodemask, the allocation 252 + can fall back to all existing numa nodes. This is effectively 253 + MPOL_PREFERRED allowed for a mask rather than a single node. 254 + 248 255 NUMA memory policy supports the following optional mode flags: 249 256 250 257 MPOL_F_STATIC_NODES ··· 260 253 nodes changes after the memory policy has been defined. 261 254 262 255 Without this flag, any time a mempolicy is rebound because of a 263 - change in the set of allowed nodes, the node (Preferred) or 264 - nodemask (Bind, Interleave) is remapped to the new set of 265 - allowed nodes. This may result in nodes being used that were 266 - previously undesired. 256 + change in the set of allowed nodes, the preferred nodemask (Preferred 257 + Many), preferred node (Preferred) or nodemask (Bind, Interleave) is 258 + remapped to the new set of allowed nodes. This may result in nodes 259 + being used that were previously undesired. 267 260 268 261 With this flag, if the user-specified nodes overlap with the 269 262 nodes allowed by the task's cpuset, then the memory policy is

+2 -1

Documentation/admin-guide/sysctl/vm.rst

··· 118 118 119 119 This tunable takes a value in the range [0, 100] with a default value of 120 120 20. This tunable determines how aggressively compaction is done in the 121 - background. Setting it to 0 disables proactive compaction. 121 + background. Write of a non zero value to this tunable will immediately 122 + trigger the proactive compaction. Setting it to 0 disables proactive compaction. 122 123 123 124 Note that compaction has a non-trivial system-wide impact as pages 124 125 belonging to different processes are moved around, which could also lead

+32 -44

Documentation/core-api/cachetlb.rst

··· 271 271 272 272 ``void flush_dcache_page(struct page *page)`` 273 273 274 - Any time the kernel writes to a page cache page, _OR_ 275 - the kernel is about to read from a page cache page and 276 - user space shared/writable mappings of this page potentially 277 - exist, this routine is called. 274 + This routines must be called when: 275 + 276 + a) the kernel did write to a page that is in the page cache page 277 + and / or in high memory 278 + b) the kernel is about to read from a page cache page and user space 279 + shared/writable mappings of this page potentially exist. Note 280 + that {get,pin}_user_pages{_fast} already call flush_dcache_page 281 + on any page found in the user address space and thus driver 282 + code rarely needs to take this into account. 278 283 279 284 .. note:: 280 285 ··· 289 284 handling vfs symlinks in the page cache need not call 290 285 this interface at all. 291 286 292 - The phrase "kernel writes to a page cache page" means, 293 - specifically, that the kernel executes store instructions 294 - that dirty data in that page at the page->virtual mapping 295 - of that page. It is important to flush here to handle 296 - D-cache aliasing, to make sure these kernel stores are 297 - visible to user space mappings of that page. 287 + The phrase "kernel writes to a page cache page" means, specifically, 288 + that the kernel executes store instructions that dirty data in that 289 + page at the page->virtual mapping of that page. It is important to 290 + flush here to handle D-cache aliasing, to make sure these kernel stores 291 + are visible to user space mappings of that page. 298 292 299 - The corollary case is just as important, if there are users 300 - which have shared+writable mappings of this file, we must make 301 - sure that kernel reads of these pages will see the most recent 302 - stores done by the user. 293 + The corollary case is just as important, if there are users which have 294 + shared+writable mappings of this file, we must make sure that kernel 295 + reads of these pages will see the most recent stores done by the user. 303 296 304 - If D-cache aliasing is not an issue, this routine may 305 - simply be defined as a nop on that architecture. 297 + If D-cache aliasing is not an issue, this routine may simply be defined 298 + as a nop on that architecture. 306 299 307 - There is a bit set aside in page->flags (PG_arch_1) as 308 - "architecture private". The kernel guarantees that, 309 - for pagecache pages, it will clear this bit when such 310 - a page first enters the pagecache. 300 + There is a bit set aside in page->flags (PG_arch_1) as "architecture 301 + private". The kernel guarantees that, for pagecache pages, it will 302 + clear this bit when such a page first enters the pagecache. 311 303 312 - This allows these interfaces to be implemented much more 313 - efficiently. It allows one to "defer" (perhaps indefinitely) 314 - the actual flush if there are currently no user processes 315 - mapping this page. See sparc64's flush_dcache_page and 316 - update_mmu_cache implementations for an example of how to go 317 - about doing this. 304 + This allows these interfaces to be implemented much more efficiently. 305 + It allows one to "defer" (perhaps indefinitely) the actual flush if 306 + there are currently no user processes mapping this page. See sparc64's 307 + flush_dcache_page and update_mmu_cache implementations for an example 308 + of how to go about doing this. 318 309 319 - The idea is, first at flush_dcache_page() time, if 320 - page->mapping->i_mmap is an empty tree, just mark the architecture 321 - private page flag bit. Later, in update_mmu_cache(), a check is 322 - made of this flag bit, and if set the flush is done and the flag 323 - bit is cleared. 310 + The idea is, first at flush_dcache_page() time, if page_file_mapping() 311 + returns a mapping, and mapping_mapped on that mapping returns %false, 312 + just mark the architecture private page flag bit. Later, in 313 + update_mmu_cache(), a check is made of this flag bit, and if set the 314 + flush is done and the flag bit is cleared. 324 315 325 316 .. important:: 326 317 ··· 351 350 implementation is a nop (and should remain so for all coherent 352 351 architectures). For incoherent architectures, it should flush 353 352 the cache of the page at vmaddr. 354 - 355 - ``void flush_kernel_dcache_page(struct page *page)`` 356 - 357 - When the kernel needs to modify a user page is has obtained 358 - with kmap, it calls this function after all modifications are 359 - complete (but before kunmapping it) to bring the underlying 360 - page up to date. It is assumed here that the user has no 361 - incoherent cached copies (i.e. the original page was obtained 362 - from a mechanism like get_user_pages()). The default 363 - implementation is a nop and should remain so on all coherent 364 - architectures. On incoherent architectures, this should flush 365 - the kernel cache for page (using page_address(page)). 366 - 367 353 368 354 ``void flush_icache_range(unsigned long start, unsigned long end)`` 369 355

+8 -5

Documentation/dev-tools/kasan.rst

··· 181 181 With ``kasan_multi_shot``, KASAN prints a report on every invalid access. This 182 182 effectively disables ``panic_on_warn`` for KASAN reports. 183 183 184 + Alternatively, independent of ``panic_on_warn`` the ``kasan.fault=`` boot 185 + parameter can be used to control panic and reporting behaviour: 186 + 187 + - ``kasan.fault=report`` or ``=panic`` controls whether to only print a KASAN 188 + report or also panic the kernel (default: ``report``). The panic happens even 189 + if ``kasan_multi_shot`` is enabled. 190 + 184 191 Hardware tag-based KASAN mode (see the section about various modes below) is 185 192 intended for use in production as a security mitigation. Therefore, it supports 186 - boot parameters that allow disabling KASAN or controlling its features. 193 + additional boot parameters that allow disabling KASAN or controlling features: 187 194 188 195 - ``kasan=off`` or ``=on`` controls whether KASAN is enabled (default: ``on``). 189 196 ··· 205 198 206 199 - ``kasan.stacktrace=off`` or ``=on`` disables or enables alloc and free stack 207 200 traces collection (default: ``on``). 208 - 209 - - ``kasan.fault=report`` or ``=panic`` controls whether to only print a KASAN 210 - report or also panic the kernel (default: ``report``). The panic happens even 211 - if ``kasan_multi_shot`` is enabled. 212 201 213 202 Implementation details 214 203 ----------------------

-9

Documentation/translations/zh_CN/core-api/cachetlb.rst

··· 298 298 用。默认的实现是nop（对于所有相干的架构应该保持这样）。对于不一致性 299 299 的架构，它应该刷新vmaddr处的页面缓存。 300 300 301 - ``void flush_kernel_dcache_page(struct page *page)`` 302 - 303 - 当内核需要修改一个用kmap获得的用户页时，它会在所有修改完成后（但在 304 - kunmapping之前）调用这个函数，以使底层页面达到最新状态。这里假定用 305 - 户没有不一致性的缓存副本（即原始页面是从类似get_user_pages()的机制 306 - 中获得的）。默认的实现是一个nop，在所有相干的架构上都应该如此。在不 307 - 一致性的架构上，这应该刷新内核缓存中的页面（使用page_address(page)）。 308 - 309 - 310 301 ``void flush_icache_range(unsigned long start, unsigned long end)`` 311 302 312 303 当内核存储到它将执行的地址中时（例如在加载模块时），这个函数被调用。

-1

Documentation/vm/hwpoison.rst

··· 180 180 =========== 181 181 - Not all page types are supported and never will. Most kernel internal 182 182 objects cannot be recovered, only LRU pages for now. 183 - - Right now hugepage support is missing. 184 183 185 184 --- 186 185 Andi Kleen, Oct 2009

+2

arch/alpha/kernel/syscalls/syscall.tbl

··· 486 486 554 common landlock_create_ruleset sys_landlock_create_ruleset 487 487 555 common landlock_add_rule sys_landlock_add_rule 488 488 556 common landlock_restrict_self sys_landlock_restrict_self 489 + # 557 reserved for memfd_secret 490 + 558 common process_mrelease sys_process_mrelease

+1 -3

arch/arm/include/asm/cacheflush.h

··· 291 291 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 292 292 extern void flush_dcache_page(struct page *); 293 293 294 + #define ARCH_IMPLEMENTS_FLUSH_KERNEL_VMAP_RANGE 1 294 295 static inline void flush_kernel_vmap_range(void *addr, int size) 295 296 { 296 297 if ((cache_is_vivt() || cache_is_vipt_aliasing())) ··· 312 311 if (PageAnon(page)) 313 312 __flush_anon_page(vma, page, vmaddr); 314 313 } 315 - 316 - #define ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE 317 - extern void flush_kernel_dcache_page(struct page *); 318 314 319 315 #define flush_dcache_mmap_lock(mapping) xa_lock_irq(&mapping->i_pages) 320 316 #define flush_dcache_mmap_unlock(mapping) xa_unlock_irq(&mapping->i_pages)

+7 -13

arch/arm/kernel/setup.c

··· 1012 1012 unsigned long long lowmem_max = __pa(high_memory - 1) + 1; 1013 1013 if (crash_max > lowmem_max) 1014 1014 crash_max = lowmem_max; 1015 - crash_base = memblock_find_in_range(CRASH_ALIGN, crash_max, 1016 - crash_size, CRASH_ALIGN); 1015 + 1016 + crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN, 1017 + CRASH_ALIGN, crash_max); 1017 1018 if (!crash_base) { 1018 1019 pr_err("crashkernel reservation failed - No suitable area found.\n"); 1019 1020 return; 1020 1021 } 1021 1022 } else { 1023 + unsigned long long crash_max = crash_base + crash_size; 1022 1024 unsigned long long start; 1023 1025 1024 - start = memblock_find_in_range(crash_base, 1025 - crash_base + crash_size, 1026 - crash_size, SECTION_SIZE); 1027 - if (start != crash_base) { 1026 + start = memblock_phys_alloc_range(crash_size, SECTION_SIZE, 1027 + crash_base, crash_max); 1028 + if (!start) { 1028 1029 pr_err("crashkernel reservation failed - memory is in use.\n"); 1029 1030 return; 1030 1031 } 1031 - } 1032 - 1033 - ret = memblock_reserve(crash_base, crash_size); 1034 - if (ret < 0) { 1035 - pr_warn("crashkernel reservation failed - memory is in use (0x%lx)\n", 1036 - (unsigned long)crash_base); 1037 - return; 1038 1032 } 1039 1033 1040 1034 pr_info("Reserving %ldMB of memory at %ldMB for crashkernel (System RAM: %ldMB)\n",

-33

arch/arm/mm/flush.c

··· 346 346 EXPORT_SYMBOL(flush_dcache_page); 347 347 348 348 /* 349 - * Ensure cache coherency for the kernel mapping of this page. We can 350 - * assume that the page is pinned via kmap. 351 - * 352 - * If the page only exists in the page cache and there are no user 353 - * space mappings, this is a no-op since the page was already marked 354 - * dirty at creation. Otherwise, we need to flush the dirty kernel 355 - * cache lines directly. 356 - */ 357 - void flush_kernel_dcache_page(struct page *page) 358 - { 359 - if (cache_is_vivt() || cache_is_vipt_aliasing()) { 360 - struct address_space *mapping; 361 - 362 - mapping = page_mapping_file(page); 363 - 364 - if (!mapping || mapping_mapped(mapping)) { 365 - void *addr; 366 - 367 - addr = page_address(page); 368 - /* 369 - * kmap_atomic() doesn't set the page virtual 370 - * address for highmem pages, and 371 - * kunmap_atomic() takes care of cache 372 - * flushing already. 373 - */ 374 - if (!IS_ENABLED(CONFIG_HIGHMEM) || addr) 375 - __cpuc_flush_dcache_area(addr, PAGE_SIZE); 376 - } 377 - } 378 - } 379 - EXPORT_SYMBOL(flush_kernel_dcache_page); 380 - 381 - /* 382 349 * Flush an anonymous page so that users of get_user_pages() 383 350 * can safely access the data. The expected sequence is: 384 351 *

-6

arch/arm/mm/nommu.c

··· 166 166 } 167 167 EXPORT_SYMBOL(flush_dcache_page); 168 168 169 - void flush_kernel_dcache_page(struct page *page) 170 - { 171 - __cpuc_flush_dcache_area(page_address(page), PAGE_SIZE); 172 - } 173 - EXPORT_SYMBOL(flush_kernel_dcache_page); 174 - 175 169 void copy_to_user_page(struct vm_area_struct *vma, struct page *page, 176 170 unsigned long uaddr, void *dst, const void *src, 177 171 unsigned long len)

+2

arch/arm/tools/syscall.tbl

··· 460 460 444 common landlock_create_ruleset sys_landlock_create_ruleset 461 461 445 common landlock_add_rule sys_landlock_add_rule 462 462 446 common landlock_restrict_self sys_landlock_restrict_self 463 + # 447 reserved for memfd_secret 464 + 448 common process_mrelease sys_process_mrelease

+1 -1

arch/arm64/include/asm/unistd.h

··· 38 38 #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) 39 39 #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) 40 40 41 - #define __NR_compat_syscalls 447 41 + #define __NR_compat_syscalls 449 42 42 #endif 43 43 44 44 #define __ARCH_WANT_SYS_CLONE

+2

arch/arm64/include/asm/unistd32.h

··· 901 901 __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule) 902 902 #define __NR_landlock_restrict_self 446 903 903 __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self) 904 + #define __NR_process_mrelease 448 905 + __SYSCALL(__NR_process_mrelease, sys_process_mrelease) 904 906 905 907 /* 906 908 * Please add new compat syscalls above this comment and update

+3 -6

arch/arm64/kvm/hyp/reserved_mem.c

··· 92 92 * this is unmapped from the host stage-2, and fallback to PAGE_SIZE. 93 93 */ 94 94 hyp_mem_size = hyp_mem_pages << PAGE_SHIFT; 95 - hyp_mem_base = memblock_find_in_range(0, memblock_end_of_DRAM(), 96 - ALIGN(hyp_mem_size, PMD_SIZE), 97 - PMD_SIZE); 95 + hyp_mem_base = memblock_phys_alloc(ALIGN(hyp_mem_size, PMD_SIZE), 96 + PMD_SIZE); 98 97 if (!hyp_mem_base) 99 - hyp_mem_base = memblock_find_in_range(0, memblock_end_of_DRAM(), 100 - hyp_mem_size, PAGE_SIZE); 98 + hyp_mem_base = memblock_phys_alloc(hyp_mem_size, PAGE_SIZE); 101 99 else 102 100 hyp_mem_size = ALIGN(hyp_mem_size, PMD_SIZE); 103 101 ··· 103 105 kvm_err("Failed to reserve hyp memory\n"); 104 106 return; 105 107 } 106 - memblock_reserve(hyp_mem_base, hyp_mem_size); 107 108 108 109 kvm_info("Reserved %lld MiB at 0x%llx\n", hyp_mem_size >> 20, 109 110 hyp_mem_base);

+11 -25

arch/arm64/mm/init.c

··· 74 74 static void __init reserve_crashkernel(void) 75 75 { 76 76 unsigned long long crash_base, crash_size; 77 + unsigned long long crash_max = arm64_dma_phys_limit; 77 78 int ret; 78 79 79 80 ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), ··· 85 84 86 85 crash_size = PAGE_ALIGN(crash_size); 87 86 88 - if (crash_base == 0) { 89 - /* Current arm64 boot protocol requires 2MB alignment */ 90 - crash_base = memblock_find_in_range(0, arm64_dma_phys_limit, 91 - crash_size, SZ_2M); 92 - if (crash_base == 0) { 93 - pr_warn("cannot allocate crashkernel (size:0x%llx)\n", 94 - crash_size); 95 - return; 96 - } 97 - } else { 98 - /* User specifies base address explicitly. */ 99 - if (!memblock_is_region_memory(crash_base, crash_size)) { 100 - pr_warn("cannot reserve crashkernel: region is not memory\n"); 101 - return; 102 - } 87 + /* User specifies base address explicitly. */ 88 + if (crash_base) 89 + crash_max = crash_base + crash_size; 103 90 104 - if (memblock_is_region_reserved(crash_base, crash_size)) { 105 - pr_warn("cannot reserve crashkernel: region overlaps reserved memory\n"); 106 - return; 107 - } 108 - 109 - if (!IS_ALIGNED(crash_base, SZ_2M)) { 110 - pr_warn("cannot reserve crashkernel: base address is not 2MB aligned\n"); 111 - return; 112 - } 91 + /* Current arm64 boot protocol requires 2MB alignment */ 92 + crash_base = memblock_phys_alloc_range(crash_size, SZ_2M, 93 + crash_base, crash_max); 94 + if (!crash_base) { 95 + pr_warn("cannot allocate crashkernel (size:0x%llx)\n", 96 + crash_size); 97 + return; 113 98 } 114 - memblock_reserve(crash_base, crash_size); 115 99 116 100 pr_info("crashkernel reserved: 0x%016llx - 0x%016llx (%lld MB)\n", 117 101 crash_base, crash_base + crash_size, crash_size >> 20);

-11

arch/csky/abiv1/cacheflush.c

··· 56 56 } 57 57 } 58 58 59 - void flush_kernel_dcache_page(struct page *page) 60 - { 61 - struct address_space *mapping; 62 - 63 - mapping = page_mapping_file(page); 64 - 65 - if (!mapping || mapping_mapped(mapping)) 66 - dcache_wbinv_all(); 67 - } 68 - EXPORT_SYMBOL(flush_kernel_dcache_page); 69 - 70 59 void flush_cache_range(struct vm_area_struct *vma, unsigned long start, 71 60 unsigned long end) 72 61 {

+1 -3

arch/csky/abiv1/inc/abi/cacheflush.h

··· 14 14 #define flush_cache_page(vma, page, pfn) cache_wbinv_all() 15 15 #define flush_cache_dup_mm(mm) cache_wbinv_all() 16 16 17 - #define ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE 18 - extern void flush_kernel_dcache_page(struct page *); 19 - 20 17 #define flush_dcache_mmap_lock(mapping) xa_lock_irq(&mapping->i_pages) 21 18 #define flush_dcache_mmap_unlock(mapping) xa_unlock_irq(&mapping->i_pages) 22 19 20 + #define ARCH_IMPLEMENTS_FLUSH_KERNEL_VMAP_RANGE 1 23 21 static inline void flush_kernel_vmap_range(void *addr, int size) 24 22 { 25 23 dcache_wbinv_all();

+1 -2

arch/csky/kernel/probes/kprobes.c

··· 283 283 * normal page fault. 284 284 */ 285 285 regs->pc = (unsigned long) cur->addr; 286 - if (!instruction_pointer(regs)) 287 - BUG(); 286 + BUG_ON(!instruction_pointer(regs)); 288 287 289 288 if (kcb->kprobe_status == KPROBE_REENTER) 290 289 restore_previous_kprobe(kcb);

-2

arch/ia64/include/asm/meminit.h

··· 29 29 }; 30 30 31 31 extern struct rsvd_region rsvd_region[IA64_MAX_RSVD_REGIONS + 1]; 32 - extern int num_rsvd_regions; 33 32 34 33 extern void find_memory (void); 35 34 extern void reserve_memory (void); ··· 39 40 extern int find_max_min_low_pfn (u64, u64, void *); 40 41 41 42 extern unsigned long vmcore_find_descriptor_size(unsigned long address); 42 - extern int reserve_elfcorehdr(u64 *start, u64 *end); 43 43 44 44 /* 45 45 * For rounding an address to the next IA64_GRANULE_SIZE or order

+1 -1

arch/ia64/kernel/acpi.c

··· 906 906 /* 907 907 * acpi_suspend_lowlevel() - save kernel state and suspend. 908 908 * 909 - * TBD when when IA64 starts to support suspend... 909 + * TBD when IA64 starts to support suspend... 910 910 */ 911 911 int acpi_suspend_lowlevel(void) { return 0; }

+26 -27

arch/ia64/kernel/setup.c

··· 131 131 * We use a special marker for the end of memory and it uses the extra (+1) slot 132 132 */ 133 133 struct rsvd_region rsvd_region[IA64_MAX_RSVD_REGIONS + 1] __initdata; 134 - int num_rsvd_regions __initdata; 134 + static int num_rsvd_regions __initdata; 135 135 136 136 137 137 /* ··· 324 324 static inline void __init setup_crashkernel(unsigned long total, int *n) 325 325 {} 326 326 #endif 327 + 328 + #ifdef CONFIG_CRASH_DUMP 329 + static int __init reserve_elfcorehdr(u64 *start, u64 *end) 330 + { 331 + u64 length; 332 + 333 + /* We get the address using the kernel command line, 334 + * but the size is extracted from the EFI tables. 335 + * Both address and size are required for reservation 336 + * to work properly. 337 + */ 338 + 339 + if (!is_vmcore_usable()) 340 + return -EINVAL; 341 + 342 + if ((length = vmcore_find_descriptor_size(elfcorehdr_addr)) == 0) { 343 + vmcore_unusable(); 344 + return -EINVAL; 345 + } 346 + 347 + *start = (unsigned long)__va(elfcorehdr_addr); 348 + *end = *start + length; 349 + return 0; 350 + } 351 + #endif /* CONFIG_CRASH_DUMP */ 327 352 328 353 /** 329 354 * reserve_memory - setup reserved memory areas ··· 546 521 return 0; 547 522 } 548 523 early_param("nomca", setup_nomca); 549 - 550 - #ifdef CONFIG_CRASH_DUMP 551 - int __init reserve_elfcorehdr(u64 *start, u64 *end) 552 - { 553 - u64 length; 554 - 555 - /* We get the address using the kernel command line, 556 - * but the size is extracted from the EFI tables. 557 - * Both address and size are required for reservation 558 - * to work properly. 559 - */ 560 - 561 - if (!is_vmcore_usable()) 562 - return -EINVAL; 563 - 564 - if ((length = vmcore_find_descriptor_size(elfcorehdr_addr)) == 0) { 565 - vmcore_unusable(); 566 - return -EINVAL; 567 - } 568 - 569 - *start = (unsigned long)__va(elfcorehdr_addr); 570 - *end = *start + length; 571 - return 0; 572 - } 573 - 574 - #endif /* CONFIG_PROC_VMCORE */ 575 524 576 525 void __init 577 526 setup_arch (char **cmdline_p)

+2

arch/ia64/kernel/syscalls/syscall.tbl

··· 367 367 444 common landlock_create_ruleset sys_landlock_create_ruleset 368 368 445 common landlock_add_rule sys_landlock_add_rule 369 369 446 common landlock_restrict_self sys_landlock_restrict_self 370 + # 447 reserved for memfd_secret 371 + 448 common process_mrelease sys_process_mrelease

+2

arch/m68k/kernel/syscalls/syscall.tbl

··· 446 446 444 common landlock_create_ruleset sys_landlock_create_ruleset 447 447 445 common landlock_add_rule sys_landlock_add_rule 448 448 446 common landlock_restrict_self sys_landlock_restrict_self 449 + # 447 reserved for memfd_secret 450 + 448 common process_mrelease sys_process_mrelease

+1 -2

arch/microblaze/include/asm/page.h

··· 112 112 # define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT) 113 113 114 114 # define ARCH_PFN_OFFSET (memory_start >> PAGE_SHIFT) 115 - # define pfn_valid(pfn) ((pfn) < (max_mapnr + ARCH_PFN_OFFSET)) 116 - 115 + # define pfn_valid(pfn) ((pfn) >= ARCH_PFN_OFFSET && (pfn) < (max_mapnr + ARCH_PFN_OFFSET)) 117 116 # endif /* __ASSEMBLY__ */ 118 117 119 118 #define virt_addr_valid(vaddr) (pfn_valid(virt_to_pfn(vaddr)))

-2

arch/microblaze/include/asm/pgtable.h

··· 443 443 444 444 asmlinkage void __init mmu_init(void); 445 445 446 - void __init *early_get_page(void); 447 - 448 446 #endif /* __ASSEMBLY__ */ 449 447 #endif /* __KERNEL__ */ 450 448

+2

arch/microblaze/kernel/syscalls/syscall.tbl

··· 452 452 444 common landlock_create_ruleset sys_landlock_create_ruleset 453 453 445 common landlock_add_rule sys_landlock_add_rule 454 454 446 common landlock_restrict_self sys_landlock_restrict_self 455 + # 447 reserved for memfd_secret 456 + 448 common process_mrelease sys_process_mrelease

-12

arch/microblaze/mm/init.c

··· 265 265 dma_contiguous_reserve(memory_start + lowmem_size - 1); 266 266 } 267 267 268 - /* This is only called until mem_init is done. */ 269 - void __init *early_get_page(void) 270 - { 271 - /* 272 - * Mem start + kernel_tlb -> here is limit 273 - * because of mem mapping from head.S 274 - */ 275 - return memblock_alloc_try_nid_raw(PAGE_SIZE, PAGE_SIZE, 276 - MEMBLOCK_LOW_LIMIT, memory_start + kernel_tlb, 277 - NUMA_NO_NODE); 278 - } 279 - 280 268 void * __ref zalloc_maybe_bootmem(size_t size, gfp_t mask) 281 269 { 282 270 void *p;

+8 -9

arch/microblaze/mm/pgtable.c

··· 33 33 #include <linux/init.h> 34 34 #include <linux/mm_types.h> 35 35 #include <linux/pgtable.h> 36 + #include <linux/memblock.h> 36 37 37 38 #include <asm/pgalloc.h> 38 39 #include <linux/io.h> ··· 243 242 244 243 __ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm) 245 244 { 246 - pte_t *pte; 247 - if (mem_init_done) { 248 - pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO); 249 - } else { 250 - pte = (pte_t *)early_get_page(); 251 - if (pte) 252 - clear_page(pte); 253 - } 254 - return pte; 245 + if (mem_init_done) 246 + return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO); 247 + else 248 + return memblock_alloc_try_nid(PAGE_SIZE, PAGE_SIZE, 249 + MEMBLOCK_LOW_LIMIT, 250 + memory_start + kernel_tlb, 251 + NUMA_NO_NODE); 255 252 } 256 253 257 254 void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t flags)

+1 -7

arch/mips/include/asm/cacheflush.h

··· 125 125 kunmap_coherent(); 126 126 } 127 127 128 - #define ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE 129 - static inline void flush_kernel_dcache_page(struct page *page) 130 - { 131 - BUG_ON(cpu_has_dc_aliases && PageHighMem(page)); 132 - flush_dcache_page(page); 133 - } 134 - 128 + #define ARCH_IMPLEMENTS_FLUSH_KERNEL_VMAP_RANGE 1 135 129 /* 136 130 * For now flush_kernel_vmap_range and invalidate_kernel_vmap_range both do a 137 131 * cache writeback and invalidate operation.

+6 -8

arch/mips/kernel/setup.c

··· 452 452 return; 453 453 454 454 if (crash_base <= 0) { 455 - crash_base = memblock_find_in_range(CRASH_ALIGN, CRASH_ADDR_MAX, 456 - crash_size, CRASH_ALIGN); 455 + crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN, 456 + CRASH_ALIGN, 457 + CRASH_ADDR_MAX); 457 458 if (!crash_base) { 458 459 pr_warn("crashkernel reservation failed - No suitable area found.\n"); 459 460 return; ··· 462 461 } else { 463 462 unsigned long long start; 464 463 465 - start = memblock_find_in_range(crash_base, crash_base + crash_size, 466 - crash_size, 1); 464 + start = memblock_phys_alloc_range(crash_size, 1, 465 + crash_base, 466 + crash_base + crash_size); 467 467 if (start != crash_base) { 468 468 pr_warn("Invalid memory region reserved for crash kernel\n"); 469 469 return; ··· 658 656 mips_reserve_vmcore(); 659 657 660 658 mips_parse_crashkernel(); 661 - #ifdef CONFIG_KEXEC 662 - if (crashk_res.start != crashk_res.end) 663 - memblock_reserve(crashk_res.start, resource_size(&crashk_res)); 664 - #endif 665 659 device_tree_init(); 666 660 667 661 /*

+2

arch/mips/kernel/syscalls/syscall_n32.tbl

··· 385 385 444 n32 landlock_create_ruleset sys_landlock_create_ruleset 386 386 445 n32 landlock_add_rule sys_landlock_add_rule 387 387 446 n32 landlock_restrict_self sys_landlock_restrict_self 388 + # 447 reserved for memfd_secret 389 + 448 n32 process_mrelease sys_process_mrelease

+2

arch/mips/kernel/syscalls/syscall_n64.tbl

··· 361 361 444 n64 landlock_create_ruleset sys_landlock_create_ruleset 362 362 445 n64 landlock_add_rule sys_landlock_add_rule 363 363 446 n64 landlock_restrict_self sys_landlock_restrict_self 364 + # 447 reserved for memfd_secret 365 + 448 n64 process_mrelease sys_process_mrelease

+2

arch/mips/kernel/syscalls/syscall_o32.tbl

··· 434 434 444 o32 landlock_create_ruleset sys_landlock_create_ruleset 435 435 445 o32 landlock_add_rule sys_landlock_add_rule 436 436 446 o32 landlock_restrict_self sys_landlock_restrict_self 437 + # 447 reserved for memfd_secret 438 + 448 o32 process_mrelease sys_process_mrelease

+1 -2

arch/nds32/include/asm/cacheflush.h

··· 36 36 void flush_anon_page(struct vm_area_struct *vma, 37 37 struct page *page, unsigned long vaddr); 38 38 39 - #define ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE 40 - void flush_kernel_dcache_page(struct page *page); 39 + #define ARCH_IMPLEMENTS_FLUSH_KERNEL_VMAP_RANGE 1 41 40 void flush_kernel_vmap_range(void *addr, int size); 42 41 void invalidate_kernel_vmap_range(void *addr, int size); 43 42 #define flush_dcache_mmap_lock(mapping) xa_lock_irq(&(mapping)->i_pages)

-9

arch/nds32/mm/cacheflush.c

··· 318 318 local_irq_restore(flags); 319 319 } 320 320 321 - void flush_kernel_dcache_page(struct page *page) 322 - { 323 - unsigned long flags; 324 - local_irq_save(flags); 325 - cpu_dcache_wbinval_page((unsigned long)page_address(page)); 326 - local_irq_restore(flags); 327 - } 328 - EXPORT_SYMBOL(flush_kernel_dcache_page); 329 - 330 321 void flush_kernel_vmap_range(void *addr, int size) 331 322 { 332 323 unsigned long flags;

+2 -6

arch/parisc/include/asm/cacheflush.h

··· 36 36 void flush_cache_all(void); 37 37 void flush_cache_mm(struct mm_struct *mm); 38 38 39 - #define ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE 40 39 void flush_kernel_dcache_page_addr(void *addr); 41 - static inline void flush_kernel_dcache_page(struct page *page) 42 - { 43 - flush_kernel_dcache_page_addr(page_address(page)); 44 - } 45 40 46 41 #define flush_kernel_dcache_range(start,size) \ 47 42 flush_kernel_dcache_range_asm((start), (start)+(size)); 48 43 44 + #define ARCH_IMPLEMENTS_FLUSH_KERNEL_VMAP_RANGE 1 49 45 void flush_kernel_vmap_range(void *vaddr, int size); 50 46 void invalidate_kernel_vmap_range(void *vaddr, int size); 51 47 ··· 55 59 #define flush_dcache_mmap_unlock(mapping) xa_unlock_irq(&mapping->i_pages) 56 60 57 61 #define flush_icache_page(vma,page) do { \ 58 - flush_kernel_dcache_page(page); \ 62 + flush_kernel_dcache_page_addr(page_address(page)); \ 59 63 flush_kernel_icache_page(page_address(page)); \ 60 64 } while (0) 61 65

+1 -2

arch/parisc/kernel/cache.c

··· 334 334 return; 335 335 } 336 336 337 - flush_kernel_dcache_page(page); 337 + flush_kernel_dcache_page_addr(page_address(page)); 338 338 339 339 if (!mapping) 340 340 return; ··· 375 375 376 376 /* Defined in arch/parisc/kernel/pacache.S */ 377 377 EXPORT_SYMBOL(flush_kernel_dcache_range_asm); 378 - EXPORT_SYMBOL(flush_kernel_dcache_page_asm); 379 378 EXPORT_SYMBOL(flush_data_cache_local); 380 379 EXPORT_SYMBOL(flush_kernel_icache_range_asm); 381 380

+2

arch/parisc/kernel/syscalls/syscall.tbl

··· 444 444 444 common landlock_create_ruleset sys_landlock_create_ruleset 445 445 445 common landlock_add_rule sys_landlock_add_rule 446 446 446 common landlock_restrict_self sys_landlock_restrict_self 447 + # 447 reserved for memfd_secret 448 + 448 common process_mrelease sys_process_mrelease

+2

arch/powerpc/kernel/syscalls/syscall.tbl

··· 526 526 444 common landlock_create_ruleset sys_landlock_create_ruleset 527 527 445 common landlock_add_rule sys_landlock_add_rule 528 528 446 common landlock_restrict_self sys_landlock_restrict_self 529 + # 447 reserved for memfd_secret 530 + 448 common process_mrelease sys_process_mrelease

+1 -3

arch/powerpc/platforms/pseries/hotplug-memory.c

··· 211 211 static struct memory_block *lmb_to_memblock(struct drmem_lmb *lmb) 212 212 { 213 213 unsigned long section_nr; 214 - struct mem_section *mem_sect; 215 214 struct memory_block *mem_block; 216 215 217 216 section_nr = pfn_to_section_nr(PFN_DOWN(lmb->base_addr)); 218 - mem_sect = __nr_to_section(section_nr); 219 217 220 - mem_block = find_memory_block(mem_sect); 218 + mem_block = find_memory_block(section_nr); 221 219 return mem_block; 222 220 } 223 221

+15 -31

arch/riscv/mm/init.c

··· 819 819 820 820 crash_size = PAGE_ALIGN(crash_size); 821 821 822 - if (crash_base == 0) { 823 - /* 824 - * Current riscv boot protocol requires 2MB alignment for 825 - * RV64 and 4MB alignment for RV32 (hugepage size) 826 - */ 827 - crash_base = memblock_find_in_range(search_start, search_end, 828 - crash_size, PMD_SIZE); 829 - 830 - if (crash_base == 0) { 831 - pr_warn("crashkernel: couldn't allocate %lldKB\n", 832 - crash_size >> 10); 833 - return; 834 - } 835 - } else { 836 - /* User specifies base address explicitly. */ 837 - if (!memblock_is_region_memory(crash_base, crash_size)) { 838 - pr_warn("crashkernel: requested region is not memory\n"); 839 - return; 840 - } 841 - 842 - if (memblock_is_region_reserved(crash_base, crash_size)) { 843 - pr_warn("crashkernel: requested region is reserved\n"); 844 - return; 845 - } 846 - 847 - 848 - if (!IS_ALIGNED(crash_base, PMD_SIZE)) { 849 - pr_warn("crashkernel: requested region is misaligned\n"); 850 - return; 851 - } 822 + if (crash_base) { 823 + search_start = crash_base; 824 + search_end = crash_base + crash_size; 852 825 } 853 - memblock_reserve(crash_base, crash_size); 826 + 827 + /* 828 + * Current riscv boot protocol requires 2MB alignment for 829 + * RV64 and 4MB alignment for RV32 (hugepage size) 830 + */ 831 + crash_base = memblock_phys_alloc_range(crash_size, PMD_SIZE, 832 + search_start, search_end); 833 + if (crash_base == 0) { 834 + pr_warn("crashkernel: couldn't allocate %lldKB\n", 835 + crash_size >> 10); 836 + return; 837 + } 854 838 855 839 pr_info("crashkernel: reserved 0x%016llx - 0x%016llx (%lld MB)\n", 856 840 crash_base, crash_base + crash_size, crash_size >> 20);

+6 -3

arch/s390/kernel/setup.c

··· 677 677 return; 678 678 } 679 679 low = crash_base ?: low; 680 - crash_base = memblock_find_in_range(low, high, crash_size, 681 - KEXEC_CRASH_MEM_ALIGN); 680 + crash_base = memblock_phys_alloc_range(crash_size, 681 + KEXEC_CRASH_MEM_ALIGN, 682 + low, high); 682 683 } 683 684 684 685 if (!crash_base) { ··· 688 687 return; 689 688 } 690 689 691 - if (register_memory_notifier(&kdump_mem_nb)) 690 + if (register_memory_notifier(&kdump_mem_nb)) { 691 + memblock_free(crash_base, crash_size); 692 692 return; 693 + } 693 694 694 695 if (!oldmem_data.start && MACHINE_IS_VM) 695 696 diag10_range(PFN_DOWN(crash_base), PFN_DOWN(crash_size));

+2

arch/s390/kernel/syscalls/syscall.tbl

··· 449 449 444 common landlock_create_ruleset sys_landlock_create_ruleset sys_landlock_create_ruleset 450 450 445 common landlock_add_rule sys_landlock_add_rule sys_landlock_add_rule 451 451 446 common landlock_restrict_self sys_landlock_restrict_self sys_landlock_restrict_self 452 + # 447 reserved for memfd_secret 453 + 448 common process_mrelease sys_process_mrelease sys_process_mrelease

+1 -1

arch/s390/mm/fault.c

··· 822 822 break; 823 823 case KERNEL_FAULT: 824 824 page = phys_to_page(addr); 825 - if (unlikely(!try_get_page(page))) 825 + if (unlikely(!try_get_compound_head(page, 1))) 826 826 break; 827 827 rc = arch_make_page_accessible(page); 828 828 put_page(page);

+2 -6

arch/sh/include/asm/cacheflush.h

··· 63 63 if (boot_cpu_data.dcache.n_aliases && PageAnon(page)) 64 64 __flush_anon_page(page, vmaddr); 65 65 } 66 + 67 + #define ARCH_IMPLEMENTS_FLUSH_KERNEL_VMAP_RANGE 1 66 68 static inline void flush_kernel_vmap_range(void *addr, int size) 67 69 { 68 70 __flush_wback_region(addr, size); ··· 72 70 static inline void invalidate_kernel_vmap_range(void *addr, int size) 73 71 { 74 72 __flush_invalidate_region(addr, size); 75 - } 76 - 77 - #define ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE 78 - static inline void flush_kernel_dcache_page(struct page *page) 79 - { 80 - flush_dcache_page(page); 81 73 } 82 74 83 75 extern void copy_to_user_page(struct vm_area_struct *vma,

+2

arch/sh/kernel/syscalls/syscall.tbl

··· 449 449 444 common landlock_create_ruleset sys_landlock_create_ruleset 450 450 445 common landlock_add_rule sys_landlock_add_rule 451 451 446 common landlock_restrict_self sys_landlock_restrict_self 452 + # 447 reserved for memfd_secret 453 + 448 common process_mrelease sys_process_mrelease

+2

arch/sparc/kernel/syscalls/syscall.tbl

··· 492 492 444 common landlock_create_ruleset sys_landlock_create_ruleset 493 493 445 common landlock_add_rule sys_landlock_add_rule 494 494 446 common landlock_restrict_self sys_landlock_restrict_self 495 + # 447 reserved for memfd_secret 496 + 448 common process_mrelease sys_process_mrelease

+1

arch/x86/entry/syscalls/syscall_32.tbl

··· 452 452 445 i386 landlock_add_rule sys_landlock_add_rule 453 453 446 i386 landlock_restrict_self sys_landlock_restrict_self 454 454 447 i386 memfd_secret sys_memfd_secret 455 + 448 i386 process_mrelease sys_process_mrelease

+1

arch/x86/entry/syscalls/syscall_64.tbl

··· 369 369 445 common landlock_add_rule sys_landlock_add_rule 370 370 446 common landlock_restrict_self sys_landlock_restrict_self 371 371 447 common memfd_secret sys_memfd_secret 372 + 448 common process_mrelease sys_process_mrelease 372 373 373 374 # 374 375 # Due to a historical design error, certain syscalls are numbered differently

+2 -3

arch/x86/kernel/aperture_64.c

··· 109 109 * memory. Unfortunately we cannot move it up because that would 110 110 * make the IOMMU useless. 111 111 */ 112 - addr = memblock_find_in_range(GART_MIN_ADDR, GART_MAX_ADDR, 113 - aper_size, aper_size); 112 + addr = memblock_phys_alloc_range(aper_size, aper_size, 113 + GART_MIN_ADDR, GART_MAX_ADDR); 114 114 if (!addr) { 115 115 pr_err("Cannot allocate aperture memory hole [mem %#010lx-%#010lx] (%uKB)\n", 116 116 addr, addr + aper_size - 1, aper_size >> 10); 117 117 return 0; 118 118 } 119 - memblock_reserve(addr, aper_size); 120 119 pr_info("Mapping aperture over RAM [mem %#010lx-%#010lx] (%uKB)\n", 121 120 addr, addr + aper_size - 1, aper_size >> 10); 122 121 register_nosave_region(addr >> PAGE_SHIFT,

+3 -3

arch/x86/kernel/ldt.c

··· 154 154 if (num_entries > LDT_ENTRIES) 155 155 return NULL; 156 156 157 - new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL); 157 + new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL_ACCOUNT); 158 158 if (!new_ldt) 159 159 return NULL; 160 160 ··· 168 168 * than PAGE_SIZE. 169 169 */ 170 170 if (alloc_size > PAGE_SIZE) 171 - new_ldt->entries = vzalloc(alloc_size); 171 + new_ldt->entries = __vmalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO); 172 172 else 173 - new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL); 173 + new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); 174 174 175 175 if (!new_ldt->entries) { 176 176 kfree(new_ldt);

+15 -8

arch/x86/mm/init.c

··· 127 127 unsigned long ret = 0; 128 128 129 129 if (min_pfn_mapped < max_pfn_mapped) { 130 - ret = memblock_find_in_range( 130 + ret = memblock_phys_alloc_range( 131 + PAGE_SIZE * num, PAGE_SIZE, 131 132 min_pfn_mapped << PAGE_SHIFT, 132 - max_pfn_mapped << PAGE_SHIFT, 133 - PAGE_SIZE * num , PAGE_SIZE); 133 + max_pfn_mapped << PAGE_SHIFT); 134 134 } 135 - if (ret) 136 - memblock_reserve(ret, PAGE_SIZE * num); 137 - else if (can_use_brk_pgt) 135 + if (!ret && can_use_brk_pgt) 138 136 ret = __pa(extend_brk(PAGE_SIZE * num, PAGE_SIZE)); 139 137 140 138 if (!ret) ··· 608 610 unsigned long addr; 609 611 unsigned long mapped_ram_size = 0; 610 612 611 - /* xen has big range in reserved near end of ram, skip it at first.*/ 612 - addr = memblock_find_in_range(map_start, map_end, PMD_SIZE, PMD_SIZE); 613 + /* 614 + * Systems that have many reserved areas near top of the memory, 615 + * e.g. QEMU with less than 1G RAM and EFI enabled, or Xen, will 616 + * require lots of 4K mappings which may exhaust pgt_buf. 617 + * Start with top-most PMD_SIZE range aligned at PMD_SIZE to ensure 618 + * there is enough mapped memory that can be allocated from 619 + * memblock. 620 + */ 621 + addr = memblock_phys_alloc_range(PMD_SIZE, PMD_SIZE, map_start, 622 + map_end); 623 + memblock_free(addr, PMD_SIZE); 613 624 real_end = addr + PMD_SIZE; 614 625 615 626 /* step_size need to be small so pgt_buf from BRK could cover it */

+2 -3

arch/x86/mm/numa.c

··· 376 376 cnt++; 377 377 size = cnt * cnt * sizeof(numa_distance[0]); 378 378 379 - phys = memblock_find_in_range(0, PFN_PHYS(max_pfn_mapped), 380 - size, PAGE_SIZE); 379 + phys = memblock_phys_alloc_range(size, PAGE_SIZE, 0, 380 + PFN_PHYS(max_pfn_mapped)); 381 381 if (!phys) { 382 382 pr_warn("Warning: can't allocate distance table!\n"); 383 383 /* don't retry until explicitly reset */ 384 384 numa_distance = (void *)1LU; 385 385 return -ENOMEM; 386 386 } 387 - memblock_reserve(phys, size); 388 387 389 388 numa_distance = __va(phys); 390 389 numa_distance_cnt = cnt;

+2 -3

arch/x86/mm/numa_emulation.c

··· 447 447 if (numa_dist_cnt) { 448 448 u64 phys; 449 449 450 - phys = memblock_find_in_range(0, PFN_PHYS(max_pfn_mapped), 451 - phys_size, PAGE_SIZE); 450 + phys = memblock_phys_alloc_range(phys_size, PAGE_SIZE, 0, 451 + PFN_PHYS(max_pfn_mapped)); 452 452 if (!phys) { 453 453 pr_warn("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n"); 454 454 goto no_emu; 455 455 } 456 - memblock_reserve(phys, phys_size); 457 456 phys_dist = __va(phys); 458 457 459 458 for (i = 0; i < numa_dist_cnt; i++)

+1 -1

arch/x86/realmode/init.c

··· 28 28 WARN_ON(slab_is_available()); 29 29 30 30 /* Has to be under 1M so we can execute real-mode AP code. */ 31 - mem = memblock_find_in_range(0, 1<<20, size, PAGE_SIZE); 31 + mem = memblock_phys_alloc_range(size, PAGE_SIZE, 0, 1<<20); 32 32 if (!mem) 33 33 pr_info("No sub-1M memory is available for the trampoline\n"); 34 34 else

+2

arch/xtensa/kernel/syscalls/syscall.tbl

··· 417 417 444 common landlock_create_ruleset sys_landlock_create_ruleset 418 418 445 common landlock_add_rule sys_landlock_add_rule 419 419 446 common landlock_restrict_self sys_landlock_restrict_self 420 + # 447 reserved for memfd_secret 421 + 448 common process_mrelease sys_process_mrelease

+1 -1

block/blk-map.c

··· 309 309 310 310 static void bio_invalidate_vmalloc_pages(struct bio *bio) 311 311 { 312 - #ifdef ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE 312 + #ifdef ARCH_IMPLEMENTS_FLUSH_KERNEL_VMAP_RANGE 313 313 if (bio->bi_private && !op_is_write(bio_op(bio))) { 314 314 unsigned long i, len = 0; 315 315

+2 -3

drivers/acpi/tables.c

··· 583 583 } 584 584 585 585 acpi_tables_addr = 586 - memblock_find_in_range(0, ACPI_TABLE_UPGRADE_MAX_PHYS, 587 - all_tables_size, PAGE_SIZE); 586 + memblock_phys_alloc_range(all_tables_size, PAGE_SIZE, 587 + 0, ACPI_TABLE_UPGRADE_MAX_PHYS); 588 588 if (!acpi_tables_addr) { 589 589 WARN_ON(1); 590 590 return; ··· 599 599 * Both memblock_reserve and e820__range_add (via arch_reserve_mem_area) 600 600 * works fine. 601 601 */ 602 - memblock_reserve(acpi_tables_addr, all_tables_size); 603 602 arch_reserve_mem_area(acpi_tables_addr, all_tables_size); 604 603 605 604 /*

+1 -4

drivers/base/arch_numa.c

··· 279 279 int i, j; 280 280 281 281 size = nr_node_ids * nr_node_ids * sizeof(numa_distance[0]); 282 - phys = memblock_find_in_range(0, PFN_PHYS(max_pfn), 283 - size, PAGE_SIZE); 282 + phys = memblock_phys_alloc_range(size, PAGE_SIZE, 0, PFN_PHYS(max_pfn)); 284 283 if (WARN_ON(!phys)) 285 284 return -ENOMEM; 286 - 287 - memblock_reserve(phys, size); 288 285 289 286 numa_distance = __va(phys); 290 287 numa_distance_cnt = nr_node_ids;

+2 -2

drivers/base/memory.c

··· 578 578 /* 579 579 * Called under device_hotplug_lock. 580 580 */ 581 - struct memory_block *find_memory_block(struct mem_section *section) 581 + struct memory_block *find_memory_block(unsigned long section_nr) 582 582 { 583 - unsigned long block_id = memory_block_id(__section_nr(section)); 583 + unsigned long block_id = memory_block_id(section_nr); 584 584 585 585 return find_memory_block_by_id(block_id); 586 586 }

-4

drivers/mmc/host/jz4740_mmc.c

··· 578 578 } 579 579 } 580 580 data->bytes_xfered += miter->length; 581 - 582 - /* This can go away once MIPS implements 583 - * flush_kernel_dcache_page */ 584 - flush_dcache_page(miter->page); 585 581 } 586 582 sg_miter_stop(miter); 587 583

+1 -1

drivers/mmc/host/mmc_spi.c

··· 941 941 942 942 /* discard mappings */ 943 943 if (direction == DMA_FROM_DEVICE) 944 - flush_kernel_dcache_page(sg_page(sg)); 944 + flush_dcache_page(sg_page(sg)); 945 945 kunmap(sg_page(sg)); 946 946 if (dma_dev) 947 947 dma_unmap_page(dma_dev, dma_addr, PAGE_SIZE, dir);

+8 -4

drivers/of/of_reserved_mem.c

··· 33 33 phys_addr_t *res_base) 34 34 { 35 35 phys_addr_t base; 36 + int err = 0; 36 37 37 38 end = !end ? MEMBLOCK_ALLOC_ANYWHERE : end; 38 39 align = !align ? SMP_CACHE_BYTES : align; 39 - base = memblock_find_in_range(start, end, size, align); 40 + base = memblock_phys_alloc_range(size, align, start, end); 40 41 if (!base) 41 42 return -ENOMEM; 42 43 43 44 *res_base = base; 44 - if (nomap) 45 - return memblock_mark_nomap(base, size); 45 + if (nomap) { 46 + err = memblock_mark_nomap(base, size); 47 + if (err) 48 + memblock_free(base, size); 49 + } 46 50 47 - return memblock_reserve(base, size); 51 + return err; 48 52 } 49 53 50 54 /*

+2 -1

fs/drop_caches.c

··· 3 3 * Implement the manual drop-all-pagecache function 4 4 */ 5 5 6 + #include <linux/pagemap.h> 6 7 #include <linux/kernel.h> 7 8 #include <linux/mm.h> 8 9 #include <linux/fs.h> ··· 28 27 * we need to reschedule to avoid softlockups. 29 28 */ 30 29 if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || 31 - (inode->i_mapping->nrpages == 0 && !need_resched())) { 30 + (mapping_empty(inode->i_mapping) && !need_resched())) { 32 31 spin_unlock(&inode->i_lock); 33 32 continue; 34 33 }

+5 -3

fs/exec.c

··· 217 217 * We are doing an exec(). 'current' is the process 218 218 * doing the exec and bprm->mm is the new process's mm. 219 219 */ 220 + mmap_read_lock(bprm->mm); 220 221 ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags, 221 222 &page, NULL, NULL); 223 + mmap_read_unlock(bprm->mm); 222 224 if (ret <= 0) 223 225 return NULL; 224 226 ··· 576 574 } 577 575 578 576 if (kmapped_page) { 579 - flush_kernel_dcache_page(kmapped_page); 577 + flush_dcache_page(kmapped_page); 580 578 kunmap(kmapped_page); 581 579 put_arg_page(kmapped_page); 582 580 } ··· 594 592 ret = 0; 595 593 out: 596 594 if (kmapped_page) { 597 - flush_kernel_dcache_page(kmapped_page); 595 + flush_dcache_page(kmapped_page); 598 596 kunmap(kmapped_page); 599 597 put_arg_page(kmapped_page); 600 598 } ··· 636 634 kaddr = kmap_atomic(page); 637 635 flush_arg_page(bprm, pos & PAGE_MASK, page); 638 636 memcpy(kaddr + offset_in_page(pos), arg, bytes_to_copy); 639 - flush_kernel_dcache_page(page); 637 + flush_dcache_page(page); 640 638 kunmap_atomic(kaddr); 641 639 put_arg_page(page); 642 640 }

+2 -1

fs/fcntl.c

··· 1051 1051 __FMODE_EXEC | __FMODE_NONOTIFY)); 1052 1052 1053 1053 fasync_cache = kmem_cache_create("fasync_cache", 1054 - sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL); 1054 + sizeof(struct fasync_struct), 0, 1055 + SLAB_PANIC | SLAB_ACCOUNT, NULL); 1055 1056 return 0; 1056 1057 } 1057 1058

+14 -14

fs/fs-writeback.c

··· 406 406 inc_wb_stat(new_wb, WB_WRITEBACK); 407 407 } 408 408 409 + if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) { 410 + atomic_dec(&old_wb->writeback_inodes); 411 + atomic_inc(&new_wb->writeback_inodes); 412 + } 413 + 409 414 wb_get(new_wb); 410 415 411 416 /* ··· 1039 1034 * cgroup_writeback_by_id - initiate cgroup writeback from bdi and memcg IDs 1040 1035 * @bdi_id: target bdi id 1041 1036 * @memcg_id: target memcg css id 1042 - * @nr: number of pages to write, 0 for best-effort dirty flushing 1043 1037 * @reason: reason why some writeback work initiated 1044 1038 * @done: target wb_completion 1045 1039 * 1046 1040 * Initiate flush of the bdi_writeback identified by @bdi_id and @memcg_id 1047 1041 * with the specified parameters. 1048 1042 */ 1049 - int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr, 1043 + int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, 1050 1044 enum wb_reason reason, struct wb_completion *done) 1051 1045 { 1052 1046 struct backing_dev_info *bdi; 1053 1047 struct cgroup_subsys_state *memcg_css; 1054 1048 struct bdi_writeback *wb; 1055 1049 struct wb_writeback_work *work; 1050 + unsigned long dirty; 1056 1051 int ret; 1057 1052 1058 1053 /* lookup bdi and memcg */ ··· 1081 1076 } 1082 1077 1083 1078 /* 1084 - * If @nr is zero, the caller is attempting to write out most of 1079 + * The caller is attempting to write out most of 1085 1080 * the currently dirty pages. Let's take the current dirty page 1086 1081 * count and inflate it by 25% which should be large enough to 1087 1082 * flush out most dirty pages while avoiding getting livelocked by 1088 1083 * concurrent dirtiers. 1084 + * 1085 + * BTW the memcg stats are flushed periodically and this is best-effort 1086 + * estimation, so some potential error is ok. 1089 1087 */ 1090 - if (!nr) { 1091 - unsigned long filepages, headroom, dirty, writeback; 1092 - 1093 - mem_cgroup_wb_stats(wb, &filepages, &headroom, &dirty, 1094 - &writeback); 1095 - nr = dirty * 10 / 8; 1096 - } 1088 + dirty = memcg_page_state(mem_cgroup_from_css(memcg_css), NR_FILE_DIRTY); 1089 + dirty = dirty * 10 / 8; 1097 1090 1098 1091 /* issue the writeback work */ 1099 1092 work = kzalloc(sizeof(*work), GFP_NOWAIT | __GFP_NOWARN); 1100 1093 if (work) { 1101 - work->nr_pages = nr; 1094 + work->nr_pages = dirty; 1102 1095 work->sync_mode = WB_SYNC_NONE; 1103 1096 work->range_cyclic = 1; 1104 1097 work->reason = reason; ··· 2002 1999 static long wb_writeback(struct bdi_writeback *wb, 2003 2000 struct wb_writeback_work *work) 2004 2001 { 2005 - unsigned long wb_start = jiffies; 2006 2002 long nr_pages = work->nr_pages; 2007 2003 unsigned long dirtied_before = jiffies; 2008 2004 struct inode *inode; ··· 2054 2052 else 2055 2053 progress = __writeback_inodes_wb(wb, work); 2056 2054 trace_writeback_written(wb, work); 2057 - 2058 - wb_update_bandwidth(wb, wb_start); 2059 2055 2060 2056 /* 2061 2057 * Did we write something? Try for more

+2 -2

fs/fs_context.c

··· 254 254 struct fs_context *fc; 255 255 int ret = -ENOMEM; 256 256 257 - fc = kzalloc(sizeof(struct fs_context), GFP_KERNEL); 257 + fc = kzalloc(sizeof(struct fs_context), GFP_KERNEL_ACCOUNT); 258 258 if (!fc) 259 259 return ERR_PTR(-ENOMEM); 260 260 ··· 649 649 */ 650 650 static int legacy_init_fs_context(struct fs_context *fc) 651 651 { 652 - fc->fs_private = kzalloc(sizeof(struct legacy_fs_context), GFP_KERNEL); 652 + fc->fs_private = kzalloc(sizeof(struct legacy_fs_context), GFP_KERNEL_ACCOUNT); 653 653 if (!fc->fs_private) 654 654 return -ENOMEM; 655 655 fc->ops = &legacy_fs_context_ops;

+1 -1

fs/inode.c

··· 770 770 return LRU_ROTATE; 771 771 } 772 772 773 - if (inode_has_buffers(inode) || inode->i_data.nrpages) { 773 + if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) { 774 774 __iget(inode); 775 775 spin_unlock(&inode->i_lock); 776 776 spin_unlock(lru_lock);

+4 -2

fs/locks.c

··· 2941 2941 int i; 2942 2942 2943 2943 flctx_cache = kmem_cache_create("file_lock_ctx", 2944 - sizeof(struct file_lock_context), 0, SLAB_PANIC, NULL); 2944 + sizeof(struct file_lock_context), 0, 2945 + SLAB_PANIC | SLAB_ACCOUNT, NULL); 2945 2946 2946 2947 filelock_cache = kmem_cache_create("file_lock_cache", 2947 - sizeof(struct file_lock), 0, SLAB_PANIC, NULL); 2948 + sizeof(struct file_lock), 0, 2949 + SLAB_PANIC | SLAB_ACCOUNT, NULL); 2948 2950 2949 2951 for_each_possible_cpu(i) { 2950 2952 struct file_lock_list_struct *fll = per_cpu_ptr(&file_lock_list, i);

+7 -1

fs/namei.c

··· 4089 4089 return -EPERM; 4090 4090 4091 4091 inode_lock(target); 4092 - if (is_local_mountpoint(dentry)) 4092 + if (IS_SWAPFILE(target)) 4093 + error = -EPERM; 4094 + else if (is_local_mountpoint(dentry)) 4093 4095 error = -EBUSY; 4094 4096 else { 4095 4097 error = security_inode_unlink(dir, dentry); ··· 4598 4596 lock_two_nondirectories(source, target); 4599 4597 else if (target) 4600 4598 inode_lock(target); 4599 + 4600 + error = -EPERM; 4601 + if (IS_SWAPFILE(source) || (target && IS_SWAPFILE(target))) 4602 + goto out; 4601 4603 4602 4604 error = -EBUSY; 4603 4605 if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))

+4 -3

fs/namespace.c

··· 203 203 goto out_free_cache; 204 204 205 205 if (name) { 206 - mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL); 206 + mnt->mnt_devname = kstrdup_const(name, 207 + GFP_KERNEL_ACCOUNT); 207 208 if (!mnt->mnt_devname) 208 209 goto out_free_id; 209 210 } ··· 3371 3370 if (!ucounts) 3372 3371 return ERR_PTR(-ENOSPC); 3373 3372 3374 - new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL); 3373 + new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT); 3375 3374 if (!new_ns) { 3376 3375 dec_mnt_namespaces(ucounts); 3377 3376 return ERR_PTR(-ENOMEM); ··· 4307 4306 int err; 4308 4307 4309 4308 mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount), 4310 - 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); 4309 + 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); 4311 4310 4312 4311 mount_hashtable = alloc_large_system_hash("Mount-cache", 4313 4312 sizeof(struct hlist_head),

+13 -1

fs/ocfs2/dlmglue.c

··· 16 16 #include <linux/debugfs.h> 17 17 #include <linux/seq_file.h> 18 18 #include <linux/time.h> 19 + #include <linux/delay.h> 19 20 #include <linux/quotaops.h> 20 21 #include <linux/sched/signal.h> 21 22 ··· 2722 2721 return status; 2723 2722 } 2724 2723 } 2725 - return tmp_oh ? 1 : 0; 2724 + return 1; 2726 2725 } 2727 2726 2728 2727 void ocfs2_inode_unlock_tracker(struct inode *inode, ··· 3913 3912 spin_unlock_irqrestore(&lockres->l_lock, flags); 3914 3913 ret = ocfs2_downconvert_lock(osb, lockres, new_level, set_lvb, 3915 3914 gen); 3915 + /* The dlm lock convert is being cancelled in background, 3916 + * ocfs2_cancel_convert() is asynchronous in fs/dlm, 3917 + * requeue it, try again later. 3918 + */ 3919 + if (ret == -EBUSY) { 3920 + ctl->requeue = 1; 3921 + mlog(ML_BASTS, "lockres %s, ReQ: Downconvert busy\n", 3922 + lockres->l_name); 3923 + ret = 0; 3924 + msleep(20); 3925 + } 3916 3926 3917 3927 leave: 3918 3928 if (ret)

-1

fs/ocfs2/quota_global.c

··· 357 357 } 358 358 oinfo->dqi_gi.dqi_sb = sb; 359 359 oinfo->dqi_gi.dqi_type = type; 360 - ocfs2_qinfo_lock_res_init(&oinfo->dqi_gqlock, oinfo); 361 360 oinfo->dqi_gi.dqi_entry_size = sizeof(struct ocfs2_global_disk_dqblk); 362 361 oinfo->dqi_gi.dqi_ops = &ocfs2_global_ops; 363 362 oinfo->dqi_gqi_bh = NULL;

+2

fs/ocfs2/quota_local.c

··· 702 702 info->dqi_priv = oinfo; 703 703 oinfo->dqi_type = type; 704 704 INIT_LIST_HEAD(&oinfo->dqi_chunk); 705 + oinfo->dqi_gqinode = NULL; 706 + ocfs2_qinfo_lock_res_init(&oinfo->dqi_gqlock, oinfo); 705 707 oinfo->dqi_rec = NULL; 706 708 oinfo->dqi_lqi_bh = NULL; 707 709 oinfo->dqi_libh = NULL;

+1 -1

fs/pipe.c

··· 191 191 */ 192 192 bool generic_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) 193 193 { 194 - return try_get_page(buf->page); 194 + return try_get_compound_head(buf->page, 1); 195 195 } 196 196 EXPORT_SYMBOL(generic_pipe_buf_get); 197 197

+2 -2

fs/select.c

··· 655 655 goto out_nofds; 656 656 657 657 alloc_size = 6 * size; 658 - bits = kvmalloc(alloc_size, GFP_KERNEL); 658 + bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT); 659 659 if (!bits) 660 660 goto out_nofds; 661 661 } ··· 1000 1000 1001 1001 len = min(todo, POLLFD_PER_PAGE); 1002 1002 walk = walk->next = kmalloc(struct_size(walk, entries, len), 1003 - GFP_KERNEL); 1003 + GFP_KERNEL_ACCOUNT); 1004 1004 if (!walk) { 1005 1005 err = -ENOMEM; 1006 1006 goto out_fds;

+57 -59

fs/userfaultfd.c

··· 33 33 34 34 static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly; 35 35 36 - enum userfaultfd_state { 37 - UFFD_STATE_WAIT_API, 38 - UFFD_STATE_RUNNING, 39 - }; 40 - 41 36 /* 42 37 * Start with fault_pending_wqh and fault_wqh so they're more likely 43 38 * to be in the same cacheline. ··· 64 69 unsigned int flags; 65 70 /* features requested from the userspace */ 66 71 unsigned int features; 67 - /* state machine */ 68 - enum userfaultfd_state state; 69 72 /* released */ 70 73 bool released; 71 74 /* memory mappings are changing because of non-cooperative event */ 72 - bool mmap_changing; 75 + atomic_t mmap_changing; 73 76 /* mm with one ore more vmas attached to this userfaultfd_ctx */ 74 77 struct mm_struct *mm; 75 78 }; ··· 96 103 unsigned long start; 97 104 unsigned long len; 98 105 }; 106 + 107 + /* internal indication that UFFD_API ioctl was successfully executed */ 108 + #define UFFD_FEATURE_INITIALIZED (1u << 31) 109 + 110 + static bool userfaultfd_is_initialized(struct userfaultfd_ctx *ctx) 111 + { 112 + return ctx->features & UFFD_FEATURE_INITIALIZED; 113 + } 99 114 100 115 static int userfaultfd_wake_function(wait_queue_entry_t *wq, unsigned mode, 101 116 int wake_flags, void *key) ··· 624 623 * already released. 625 624 */ 626 625 out: 627 - WRITE_ONCE(ctx->mmap_changing, false); 626 + atomic_dec(&ctx->mmap_changing); 627 + VM_BUG_ON(atomic_read(&ctx->mmap_changing) < 0); 628 628 userfaultfd_ctx_put(ctx); 629 629 } 630 630 ··· 668 666 669 667 refcount_set(&ctx->refcount, 1); 670 668 ctx->flags = octx->flags; 671 - ctx->state = UFFD_STATE_RUNNING; 672 669 ctx->features = octx->features; 673 670 ctx->released = false; 674 - ctx->mmap_changing = false; 671 + atomic_set(&ctx->mmap_changing, 0); 675 672 ctx->mm = vma->vm_mm; 676 673 mmgrab(ctx->mm); 677 674 678 675 userfaultfd_ctx_get(octx); 679 - WRITE_ONCE(octx->mmap_changing, true); 676 + atomic_inc(&octx->mmap_changing); 680 677 fctx->orig = octx; 681 678 fctx->new = ctx; 682 679 list_add_tail(&fctx->list, fcs); ··· 722 721 if (ctx->features & UFFD_FEATURE_EVENT_REMAP) { 723 722 vm_ctx->ctx = ctx; 724 723 userfaultfd_ctx_get(ctx); 725 - WRITE_ONCE(ctx->mmap_changing, true); 724 + atomic_inc(&ctx->mmap_changing); 726 725 } else { 727 726 /* Drop uffd context if remap feature not enabled */ 728 727 vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; ··· 767 766 return true; 768 767 769 768 userfaultfd_ctx_get(ctx); 770 - WRITE_ONCE(ctx->mmap_changing, true); 769 + atomic_inc(&ctx->mmap_changing); 771 770 mmap_read_unlock(mm); 772 771 773 772 msg_init(&ewq.msg); ··· 811 810 return -ENOMEM; 812 811 813 812 userfaultfd_ctx_get(ctx); 814 - WRITE_ONCE(ctx->mmap_changing, true); 813 + atomic_inc(&ctx->mmap_changing); 815 814 unmap_ctx->ctx = ctx; 816 815 unmap_ctx->start = start; 817 816 unmap_ctx->end = end; ··· 944 943 945 944 poll_wait(file, &ctx->fd_wqh, wait); 946 945 947 - switch (ctx->state) { 948 - case UFFD_STATE_WAIT_API: 946 + if (!userfaultfd_is_initialized(ctx)) 949 947 return EPOLLERR; 950 - case UFFD_STATE_RUNNING: 951 - /* 952 - * poll() never guarantees that read won't block. 953 - * userfaults can be waken before they're read(). 954 - */ 955 - if (unlikely(!(file->f_flags & O_NONBLOCK))) 956 - return EPOLLERR; 957 - /* 958 - * lockless access to see if there are pending faults 959 - * __pollwait last action is the add_wait_queue but 960 - * the spin_unlock would allow the waitqueue_active to 961 - * pass above the actual list_add inside 962 - * add_wait_queue critical section. So use a full 963 - * memory barrier to serialize the list_add write of 964 - * add_wait_queue() with the waitqueue_active read 965 - * below. 966 - */ 967 - ret = 0; 968 - smp_mb(); 969 - if (waitqueue_active(&ctx->fault_pending_wqh)) 970 - ret = EPOLLIN; 971 - else if (waitqueue_active(&ctx->event_wqh)) 972 - ret = EPOLLIN; 973 948 974 - return ret; 975 - default: 976 - WARN_ON_ONCE(1); 949 + /* 950 + * poll() never guarantees that read won't block. 951 + * userfaults can be waken before they're read(). 952 + */ 953 + if (unlikely(!(file->f_flags & O_NONBLOCK))) 977 954 return EPOLLERR; 978 - } 955 + /* 956 + * lockless access to see if there are pending faults 957 + * __pollwait last action is the add_wait_queue but 958 + * the spin_unlock would allow the waitqueue_active to 959 + * pass above the actual list_add inside 960 + * add_wait_queue critical section. So use a full 961 + * memory barrier to serialize the list_add write of 962 + * add_wait_queue() with the waitqueue_active read 963 + * below. 964 + */ 965 + ret = 0; 966 + smp_mb(); 967 + if (waitqueue_active(&ctx->fault_pending_wqh)) 968 + ret = EPOLLIN; 969 + else if (waitqueue_active(&ctx->event_wqh)) 970 + ret = EPOLLIN; 971 + 972 + return ret; 979 973 } 980 974 981 975 static const struct file_operations userfaultfd_fops; ··· 1165 1169 int no_wait = file->f_flags & O_NONBLOCK; 1166 1170 struct inode *inode = file_inode(file); 1167 1171 1168 - if (ctx->state == UFFD_STATE_WAIT_API) 1172 + if (!userfaultfd_is_initialized(ctx)) 1169 1173 return -EINVAL; 1170 1174 1171 1175 for (;;) { ··· 1696 1700 user_uffdio_copy = (struct uffdio_copy __user *) arg; 1697 1701 1698 1702 ret = -EAGAIN; 1699 - if (READ_ONCE(ctx->mmap_changing)) 1703 + if (atomic_read(&ctx->mmap_changing)) 1700 1704 goto out; 1701 1705 1702 1706 ret = -EFAULT; ··· 1753 1757 user_uffdio_zeropage = (struct uffdio_zeropage __user *) arg; 1754 1758 1755 1759 ret = -EAGAIN; 1756 - if (READ_ONCE(ctx->mmap_changing)) 1760 + if (atomic_read(&ctx->mmap_changing)) 1757 1761 goto out; 1758 1762 1759 1763 ret = -EFAULT; ··· 1803 1807 struct userfaultfd_wake_range range; 1804 1808 bool mode_wp, mode_dontwake; 1805 1809 1806 - if (READ_ONCE(ctx->mmap_changing)) 1810 + if (atomic_read(&ctx->mmap_changing)) 1807 1811 return -EAGAIN; 1808 1812 1809 1813 user_uffdio_wp = (struct uffdio_writeprotect __user *) arg; ··· 1851 1855 user_uffdio_continue = (struct uffdio_continue __user *)arg; 1852 1856 1853 1857 ret = -EAGAIN; 1854 - if (READ_ONCE(ctx->mmap_changing)) 1858 + if (atomic_read(&ctx->mmap_changing)) 1855 1859 goto out; 1856 1860 1857 1861 ret = -EFAULT; ··· 1904 1908 static inline unsigned int uffd_ctx_features(__u64 user_features) 1905 1909 { 1906 1910 /* 1907 - * For the current set of features the bits just coincide 1911 + * For the current set of features the bits just coincide. Set 1912 + * UFFD_FEATURE_INITIALIZED to mark the features as enabled. 1908 1913 */ 1909 - return (unsigned int)user_features; 1914 + return (unsigned int)user_features | UFFD_FEATURE_INITIALIZED; 1910 1915 } 1911 1916 1912 1917 /* ··· 1920 1923 { 1921 1924 struct uffdio_api uffdio_api; 1922 1925 void __user *buf = (void __user *)arg; 1926 + unsigned int ctx_features; 1923 1927 int ret; 1924 1928 __u64 features; 1925 1929 1926 - ret = -EINVAL; 1927 - if (ctx->state != UFFD_STATE_WAIT_API) 1928 - goto out; 1929 1930 ret = -EFAULT; 1930 1931 if (copy_from_user(&uffdio_api, buf, sizeof(uffdio_api))) 1931 1932 goto out; ··· 1947 1952 ret = -EFAULT; 1948 1953 if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api))) 1949 1954 goto out; 1950 - ctx->state = UFFD_STATE_RUNNING; 1955 + 1951 1956 /* only enable the requested features for this uffd context */ 1952 - ctx->features = uffd_ctx_features(features); 1957 + ctx_features = uffd_ctx_features(features); 1958 + ret = -EINVAL; 1959 + if (cmpxchg(&ctx->features, 0, ctx_features) != 0) 1960 + goto err_out; 1961 + 1953 1962 ret = 0; 1954 1963 out: 1955 1964 return ret; ··· 1970 1971 int ret = -EINVAL; 1971 1972 struct userfaultfd_ctx *ctx = file->private_data; 1972 1973 1973 - if (cmd != UFFDIO_API && ctx->state == UFFD_STATE_WAIT_API) 1974 + if (cmd != UFFDIO_API && !userfaultfd_is_initialized(ctx)) 1974 1975 return -EINVAL; 1975 1976 1976 1977 switch(cmd) { ··· 2084 2085 refcount_set(&ctx->refcount, 1); 2085 2086 ctx->flags = flags; 2086 2087 ctx->features = 0; 2087 - ctx->state = UFFD_STATE_WAIT_API; 2088 2088 ctx->released = false; 2089 - ctx->mmap_changing = false; 2089 + atomic_set(&ctx->mmap_changing, 0); 2090 2090 ctx->mm = current->mm; 2091 2091 /* prevent the mm struct to be freed */ 2092 2092 mmgrab(ctx->mm);

+2

include/linux/backing-dev-defs.h

··· 116 116 struct list_head b_dirty_time; /* time stamps are dirty */ 117 117 spinlock_t list_lock; /* protects the b_* lists */ 118 118 119 + atomic_t writeback_inodes; /* number of inodes under writeback */ 119 120 struct percpu_counter stat[NR_WB_STAT_ITEMS]; 120 121 121 122 unsigned long congested; /* WB_[a]sync_congested flags */ ··· 143 142 spinlock_t work_lock; /* protects work_list & dwork scheduling */ 144 143 struct list_head work_list; 145 144 struct delayed_work dwork; /* work item used for writeback */ 145 + struct delayed_work bw_dwork; /* work item used for bandwidth estimate */ 146 146 147 147 unsigned long dirty_sleep; /* last wait */ 148 148

+19

include/linux/backing-dev.h

··· 288 288 return inode->i_wb; 289 289 } 290 290 291 + static inline struct bdi_writeback *inode_to_wb_wbc( 292 + struct inode *inode, 293 + struct writeback_control *wbc) 294 + { 295 + /* 296 + * If wbc does not have inode attached, it means cgroup writeback was 297 + * disabled when wbc started. Just use the default wb in that case. 298 + */ 299 + return wbc->wb ? wbc->wb : &inode_to_bdi(inode)->wb; 300 + } 301 + 291 302 /** 292 303 * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction 293 304 * @inode: target inode ··· 376 365 { 377 366 return &inode_to_bdi(inode)->wb; 378 367 } 368 + 369 + static inline struct bdi_writeback *inode_to_wb_wbc( 370 + struct inode *inode, 371 + struct writeback_control *wbc) 372 + { 373 + return inode_to_wb(inode); 374 + } 375 + 379 376 380 377 static inline struct bdi_writeback * 381 378 unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie)

+1 -1

include/linux/buffer_head.h

··· 409 409 static inline int remove_inode_buffers(struct inode *inode) { return 1; } 410 410 static inline int sync_mapping_buffers(struct address_space *mapping) { return 0; } 411 411 static inline void invalidate_bh_lrus_cpu(int cpu) {} 412 - static inline bool has_bh_in_lru(int cpu, void *dummy) { return 0; } 412 + static inline bool has_bh_in_lru(int cpu, void *dummy) { return false; } 413 413 #define buffer_heads_over_limit 0 414 414 415 415 #endif /* CONFIG_BLOCK */

+2

include/linux/compaction.h

··· 84 84 extern unsigned int sysctl_compaction_proactiveness; 85 85 extern int sysctl_compaction_handler(struct ctl_table *table, int write, 86 86 void *buffer, size_t *length, loff_t *ppos); 87 + extern int compaction_proactiveness_sysctl_handler(struct ctl_table *table, 88 + int write, void *buffer, size_t *length, loff_t *ppos); 87 89 extern int sysctl_extfrag_threshold; 88 90 extern int sysctl_compact_unevictable_allowed; 89 91

+1 -4

include/linux/highmem.h

··· 130 130 } 131 131 #endif 132 132 133 - #ifndef ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE 134 - static inline void flush_kernel_dcache_page(struct page *page) 135 - { 136 - } 133 + #ifndef ARCH_IMPLEMENTS_FLUSH_KERNEL_VMAP_RANGE 137 134 static inline void flush_kernel_vmap_range(void *vaddr, int size) 138 135 { 139 136 }

+12

include/linux/hugetlb_cgroup.h

··· 121 121 css_put(&h_cg->css); 122 122 } 123 123 124 + static inline void resv_map_dup_hugetlb_cgroup_uncharge_info( 125 + struct resv_map *resv_map) 126 + { 127 + if (resv_map->css) 128 + css_get(resv_map->css); 129 + } 130 + 124 131 extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, 125 132 struct hugetlb_cgroup **ptr); 126 133 extern int hugetlb_cgroup_charge_cgroup_rsvd(int idx, unsigned long nr_pages, ··· 203 196 } 204 197 205 198 static inline void hugetlb_cgroup_put_rsvd_cgroup(struct hugetlb_cgroup *h_cg) 199 + { 200 + } 201 + 202 + static inline void resv_map_dup_hugetlb_cgroup_uncharge_info( 203 + struct resv_map *resv_map) 206 204 { 207 205 } 208 206

-2

include/linux/memblock.h

··· 99 99 static inline void memblock_discard(void) {} 100 100 #endif 101 101 102 - phys_addr_t memblock_find_in_range(phys_addr_t start, phys_addr_t end, 103 - phys_addr_t size, phys_addr_t align); 104 102 void memblock_allow_resize(void); 105 103 int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid); 106 104 int memblock_add(phys_addr_t base, phys_addr_t size);

+61 -43

include/linux/memcontrol.h

··· 105 105 unsigned int generation; 106 106 }; 107 107 108 - struct lruvec_stat { 109 - long count[NR_VM_NODE_STAT_ITEMS]; 110 - }; 111 - 112 - struct batched_lruvec_stat { 113 - s32 count[NR_VM_NODE_STAT_ITEMS]; 114 - }; 115 - 116 108 /* 117 109 * Bitmap and deferred work of shrinker::id corresponding to memcg-aware 118 110 * shrinkers, which have elements charged to this memcg. ··· 115 123 unsigned long *map; 116 124 }; 117 125 126 + struct lruvec_stats_percpu { 127 + /* Local (CPU and cgroup) state */ 128 + long state[NR_VM_NODE_STAT_ITEMS]; 129 + 130 + /* Delta calculation for lockless upward propagation */ 131 + long state_prev[NR_VM_NODE_STAT_ITEMS]; 132 + }; 133 + 134 + struct lruvec_stats { 135 + /* Aggregated (CPU and subtree) state */ 136 + long state[NR_VM_NODE_STAT_ITEMS]; 137 + 138 + /* Pending child counts during tree propagation */ 139 + long state_pending[NR_VM_NODE_STAT_ITEMS]; 140 + }; 141 + 118 142 /* 119 143 * per-node information in memory controller. 120 144 */ 121 145 struct mem_cgroup_per_node { 122 146 struct lruvec lruvec; 123 147 124 - /* 125 - * Legacy local VM stats. This should be struct lruvec_stat and 126 - * cannot be optimized to struct batched_lruvec_stat. Because 127 - * the threshold of the lruvec_stat_cpu can be as big as 128 - * MEMCG_CHARGE_BATCH * PAGE_SIZE. It can fit into s32. But this 129 - * filed has no upper limit. 130 - */ 131 - struct lruvec_stat __percpu *lruvec_stat_local; 132 - 133 - /* Subtree VM stats (batched updates) */ 134 - struct batched_lruvec_stat __percpu *lruvec_stat_cpu; 135 - atomic_long_t lruvec_stat[NR_VM_NODE_STAT_ITEMS]; 148 + struct lruvec_stats_percpu __percpu *lruvec_stats_percpu; 149 + struct lruvec_stats lruvec_stats; 136 150 137 151 unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS]; 138 152 ··· 593 595 } 594 596 #endif 595 597 596 - static __always_inline bool memcg_stat_item_in_bytes(int idx) 597 - { 598 - if (idx == MEMCG_PERCPU_B) 599 - return true; 600 - return vmstat_item_in_bytes(idx); 601 - } 602 - 603 598 static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) 604 599 { 605 600 return (memcg == root_mem_cgroup); ··· 684 693 page_counter_read(&memcg->memory); 685 694 } 686 695 687 - int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); 696 + int __mem_cgroup_charge(struct page *page, struct mm_struct *mm, 697 + gfp_t gfp_mask); 698 + static inline int mem_cgroup_charge(struct page *page, struct mm_struct *mm, 699 + gfp_t gfp_mask) 700 + { 701 + if (mem_cgroup_disabled()) 702 + return 0; 703 + return __mem_cgroup_charge(page, mm, gfp_mask); 704 + } 705 + 688 706 int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm, 689 707 gfp_t gfp, swp_entry_t entry); 690 708 void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); 691 709 692 - void mem_cgroup_uncharge(struct page *page); 693 - void mem_cgroup_uncharge_list(struct list_head *page_list); 710 + void __mem_cgroup_uncharge(struct page *page); 711 + static inline void mem_cgroup_uncharge(struct page *page) 712 + { 713 + if (mem_cgroup_disabled()) 714 + return; 715 + __mem_cgroup_uncharge(page); 716 + } 717 + 718 + void __mem_cgroup_uncharge_list(struct list_head *page_list); 719 + static inline void mem_cgroup_uncharge_list(struct list_head *page_list) 720 + { 721 + if (mem_cgroup_disabled()) 722 + return; 723 + __mem_cgroup_uncharge_list(page_list); 724 + } 694 725 695 726 void mem_cgroup_migrate(struct page *oldpage, struct page *newpage); 696 727 ··· 897 884 return !!(memcg->css.flags & CSS_ONLINE); 898 885 } 899 886 900 - /* 901 - * For memory reclaim. 902 - */ 903 - int mem_cgroup_select_victim_node(struct mem_cgroup *memcg); 904 - 905 887 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, 906 888 int zid, int nr_pages); 907 889 ··· 963 955 local_irq_restore(flags); 964 956 } 965 957 958 + static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) 959 + { 960 + return READ_ONCE(memcg->vmstats.state[idx]); 961 + } 962 + 966 963 static inline unsigned long lruvec_page_state(struct lruvec *lruvec, 967 964 enum node_stat_item idx) 968 965 { 969 966 struct mem_cgroup_per_node *pn; 970 - long x; 971 967 972 968 if (mem_cgroup_disabled()) 973 969 return node_page_state(lruvec_pgdat(lruvec), idx); 974 970 975 971 pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); 976 - x = atomic_long_read(&pn->lruvec_stat[idx]); 977 - #ifdef CONFIG_SMP 978 - if (x < 0) 979 - x = 0; 980 - #endif 981 - return x; 972 + return READ_ONCE(pn->lruvec_stats.state[idx]); 982 973 } 983 974 984 975 static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, ··· 992 985 993 986 pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); 994 987 for_each_possible_cpu(cpu) 995 - x += per_cpu(pn->lruvec_stat_local->count[idx], cpu); 988 + x += per_cpu(pn->lruvec_stats_percpu->state[idx], cpu); 996 989 #ifdef CONFIG_SMP 997 990 if (x < 0) 998 991 x = 0; 999 992 #endif 1000 993 return x; 1001 994 } 995 + 996 + void mem_cgroup_flush_stats(void); 1002 997 1003 998 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, 1004 999 int val); ··· 1400 1391 { 1401 1392 } 1402 1393 1394 + static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) 1395 + { 1396 + return 0; 1397 + } 1398 + 1403 1399 static inline unsigned long lruvec_page_state(struct lruvec *lruvec, 1404 1400 enum node_stat_item idx) 1405 1401 { ··· 1415 1401 enum node_stat_item idx) 1416 1402 { 1417 1403 return node_page_state(lruvec_pgdat(lruvec), idx); 1404 + } 1405 + 1406 + static inline void mem_cgroup_flush_stats(void) 1407 + { 1418 1408 } 1419 1409 1420 1410 static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,

+1 -1

include/linux/memory.h

··· 90 90 void remove_memory_block_devices(unsigned long start, unsigned long size); 91 91 extern void memory_dev_init(void); 92 92 extern int memory_notify(unsigned long val, void *v); 93 - extern struct memory_block *find_memory_block(struct mem_section *); 93 + extern struct memory_block *find_memory_block(unsigned long section_nr); 94 94 typedef int (*walk_memory_blocks_func_t)(struct memory_block *, void *); 95 95 extern int walk_memory_blocks(unsigned long start, unsigned long size, 96 96 void *arg, walk_memory_blocks_func_t func);

+16

include/linux/mempolicy.h

··· 184 184 extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long); 185 185 extern void mpol_put_task_policy(struct task_struct *); 186 186 187 + extern bool numa_demotion_enabled; 188 + 189 + static inline bool mpol_is_preferred_many(struct mempolicy *pol) 190 + { 191 + return (pol->mode == MPOL_PREFERRED_MANY); 192 + } 193 + 194 + 187 195 #else 188 196 189 197 struct mempolicy {}; ··· 300 292 { 301 293 return NULL; 302 294 } 295 + 296 + #define numa_demotion_enabled false 297 + 298 + static inline bool mpol_is_preferred_many(struct mempolicy *pol) 299 + { 300 + return false; 301 + } 302 + 303 303 #endif /* CONFIG_NUMA */ 304 304 #endif

+12 -2

include/linux/migrate.h

··· 28 28 MR_NUMA_MISPLACED, 29 29 MR_CONTIG_RANGE, 30 30 MR_LONGTERM_PIN, 31 + MR_DEMOTION, 31 32 MR_TYPES 32 33 }; 33 34 ··· 42 41 struct page *newpage, struct page *page, 43 42 enum migrate_mode mode); 44 43 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free, 45 - unsigned long private, enum migrate_mode mode, int reason); 44 + unsigned long private, enum migrate_mode mode, int reason, 45 + unsigned int *ret_succeeded); 46 46 extern struct page *alloc_migration_target(struct page *page, unsigned long private); 47 47 extern int isolate_movable_page(struct page *page, isolate_mode_t mode); 48 48 ··· 58 56 static inline void putback_movable_pages(struct list_head *l) {} 59 57 static inline int migrate_pages(struct list_head *l, new_page_t new, 60 58 free_page_t free, unsigned long private, enum migrate_mode mode, 61 - int reason) 59 + int reason, unsigned int *ret_succeeded) 62 60 { return -ENOSYS; } 63 61 static inline struct page *alloc_migration_target(struct page *page, 64 62 unsigned long private) ··· 168 166 int migrate_vma_setup(struct migrate_vma *args); 169 167 void migrate_vma_pages(struct migrate_vma *migrate); 170 168 void migrate_vma_finalize(struct migrate_vma *migrate); 169 + int next_demotion_node(int node); 170 + 171 + #else /* CONFIG_MIGRATION disabled: */ 172 + 173 + static inline int next_demotion_node(int node) 174 + { 175 + return NUMA_NO_NODE; 176 + } 171 177 172 178 #endif /* CONFIG_MIGRATION */ 173 179

+4 -13

include/linux/mm.h

··· 1216 1216 } 1217 1217 1218 1218 bool __must_check try_grab_page(struct page *page, unsigned int flags); 1219 - __maybe_unused struct page *try_grab_compound_head(struct page *page, int refs, 1220 - unsigned int flags); 1219 + struct page *try_grab_compound_head(struct page *page, int refs, 1220 + unsigned int flags); 1221 1221 1222 - 1223 - static inline __must_check bool try_get_page(struct page *page) 1224 - { 1225 - page = compound_head(page); 1226 - if (WARN_ON_ONCE(page_ref_count(page) <= 0)) 1227 - return false; 1228 - page_ref_inc(page); 1229 - return true; 1230 - } 1222 + struct page *try_get_compound_head(struct page *page, int refs); 1231 1223 1232 1224 static inline void put_page(struct page *page) 1233 1225 { ··· 1841 1849 struct kvec; 1842 1850 int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, 1843 1851 struct page **pages); 1844 - int get_kernel_page(unsigned long start, int write, struct page **pages); 1845 1852 struct page *get_dump_page(unsigned long addr); 1846 1853 1847 1854 extern int try_to_release_page(struct page * page, gfp_t gfp_mask); ··· 3112 3121 extern int unpoison_memory(unsigned long pfn); 3113 3122 extern int sysctl_memory_failure_early_kill; 3114 3123 extern int sysctl_memory_failure_recovery; 3115 - extern void shake_page(struct page *p, int access); 3124 + extern void shake_page(struct page *p); 3116 3125 extern atomic_long_t num_poisoned_pages __read_mostly; 3117 3126 extern int soft_offline_page(unsigned long pfn, int flags); 3118 3127

+2 -2

include/linux/mmzone.h

··· 846 846 enum zone_type kcompactd_highest_zoneidx; 847 847 wait_queue_head_t kcompactd_wait; 848 848 struct task_struct *kcompactd; 849 + bool proactive_compact_trigger; 849 850 #endif 850 851 /* 851 852 * This is a per-node reserve of pages that are not available ··· 1343 1342 return NULL; 1344 1343 return &mem_section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK]; 1345 1344 } 1346 - extern unsigned long __section_nr(struct mem_section *ms); 1347 1345 extern size_t mem_section_usage_size(void); 1348 1346 1349 1347 /* ··· 1365 1365 #define SECTION_TAINT_ZONE_DEVICE (1UL<<4) 1366 1366 #define SECTION_MAP_LAST_BIT (1UL<<5) 1367 1367 #define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1)) 1368 - #define SECTION_NID_SHIFT 3 1368 + #define SECTION_NID_SHIFT 6 1369 1369 1370 1370 static inline struct page *__section_mem_map_addr(struct mem_section *section) 1371 1371 {

+2 -2

include/linux/pagemap.h

··· 736 736 /* 737 737 * Fault everything in given userspace address range in. 738 738 */ 739 - static inline int fault_in_pages_writeable(char __user *uaddr, int size) 739 + static inline int fault_in_pages_writeable(char __user *uaddr, size_t size) 740 740 { 741 741 char __user *end = uaddr + size - 1; 742 742 ··· 763 763 return 0; 764 764 } 765 765 766 - static inline int fault_in_pages_readable(const char __user *uaddr, int size) 766 + static inline int fault_in_pages_readable(const char __user *uaddr, size_t size) 767 767 { 768 768 volatile char c; 769 769 const char __user *end = uaddr + size - 1;

+5 -5

include/linux/sched/mm.h

··· 174 174 } 175 175 176 176 #ifdef CONFIG_LOCKDEP 177 - extern void __fs_reclaim_acquire(void); 178 - extern void __fs_reclaim_release(void); 177 + extern void __fs_reclaim_acquire(unsigned long ip); 178 + extern void __fs_reclaim_release(unsigned long ip); 179 179 extern void fs_reclaim_acquire(gfp_t gfp_mask); 180 180 extern void fs_reclaim_release(gfp_t gfp_mask); 181 181 #else 182 - static inline void __fs_reclaim_acquire(void) { } 183 - static inline void __fs_reclaim_release(void) { } 182 + static inline void __fs_reclaim_acquire(unsigned long ip) { } 183 + static inline void __fs_reclaim_release(unsigned long ip) { } 184 184 static inline void fs_reclaim_acquire(gfp_t gfp_mask) { } 185 185 static inline void fs_reclaim_release(gfp_t gfp_mask) { } 186 186 #endif ··· 306 306 { 307 307 struct mem_cgroup *old; 308 308 309 - if (in_interrupt()) { 309 + if (!in_task()) { 310 310 old = this_cpu_read(int_active_memcg); 311 311 this_cpu_write(int_active_memcg, memcg); 312 312 } else {

+21 -4

include/linux/shmem_fs.h

··· 18 18 unsigned long flags; 19 19 unsigned long alloced; /* data pages alloced to file */ 20 20 unsigned long swapped; /* subtotal assigned to swap */ 21 + pgoff_t fallocend; /* highest fallocate endindex */ 21 22 struct list_head shrinklist; /* shrinkable hpage inodes */ 22 23 struct list_head swaplist; /* chain of maybes on swap */ 23 24 struct shared_policy policy; /* NUMA memory alloc policy */ ··· 32 31 struct percpu_counter used_blocks; /* How many are allocated */ 33 32 unsigned long max_inodes; /* How many inodes are allowed */ 34 33 unsigned long free_inodes; /* How many are left for allocation */ 35 - spinlock_t stat_lock; /* Serialize shmem_sb_info changes */ 34 + raw_spinlock_t stat_lock; /* Serialize shmem_sb_info changes */ 36 35 umode_t mode; /* Mount mode for root directory */ 37 36 unsigned char huge; /* Whether to try for hugepages */ 38 37 kuid_t uid; /* Mount uid for root directory */ ··· 86 85 extern int shmem_unuse(unsigned int type, bool frontswap, 87 86 unsigned long *fs_pages_to_unuse); 88 87 89 - extern bool shmem_huge_enabled(struct vm_area_struct *vma); 88 + extern bool shmem_is_huge(struct vm_area_struct *vma, 89 + struct inode *inode, pgoff_t index); 90 + static inline bool shmem_huge_enabled(struct vm_area_struct *vma) 91 + { 92 + return shmem_is_huge(vma, file_inode(vma->vm_file), vma->vm_pgoff); 93 + } 90 94 extern unsigned long shmem_swap_usage(struct vm_area_struct *vma); 91 95 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping, 92 96 pgoff_t start, pgoff_t end); ··· 99 93 /* Flag allocation requirements to shmem_getpage */ 100 94 enum sgp_type { 101 95 SGP_READ, /* don't exceed i_size, don't allocate page */ 96 + SGP_NOALLOC, /* similar, but fail on hole or use fallocated page */ 102 97 SGP_CACHE, /* don't exceed i_size, may allocate page */ 103 - SGP_NOHUGE, /* like SGP_CACHE, but no huge pages */ 104 - SGP_HUGE, /* like SGP_CACHE, huge pages preferred */ 105 98 SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */ 106 99 SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */ 107 100 }; ··· 122 117 if (!file || !file->f_mapping) 123 118 return false; 124 119 return shmem_mapping(file->f_mapping); 120 + } 121 + 122 + /* 123 + * If fallocate(FALLOC_FL_KEEP_SIZE) has been used, there may be pages 124 + * beyond i_size's notion of EOF, which fallocate has committed to reserving: 125 + * which split_huge_page() must therefore not delete. This use of a single 126 + * "fallocend" per inode errs on the side of not deleting a reservation when 127 + * in doubt: there are plenty of cases when it preserves unreserved pages. 128 + */ 129 + static inline pgoff_t shmem_fallocend(struct inode *inode, pgoff_t eof) 130 + { 131 + return max(eof, SHMEM_I(inode)->fallocend); 125 132 } 126 133 127 134 extern bool shmem_charge(struct inode *inode, long pages);

+24 -4

include/linux/swap.h

··· 408 408 409 409 extern void check_move_unevictable_pages(struct pagevec *pvec); 410 410 411 - extern int kswapd_run(int nid); 411 + extern void kswapd_run(int nid); 412 412 extern void kswapd_stop(int nid); 413 413 414 414 #ifdef CONFIG_SWAP ··· 721 721 #endif 722 722 723 723 #if defined(CONFIG_SWAP) && defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP) 724 - extern void cgroup_throttle_swaprate(struct page *page, gfp_t gfp_mask); 724 + extern void __cgroup_throttle_swaprate(struct page *page, gfp_t gfp_mask); 725 + static inline void cgroup_throttle_swaprate(struct page *page, gfp_t gfp_mask) 726 + { 727 + if (mem_cgroup_disabled()) 728 + return; 729 + __cgroup_throttle_swaprate(page, gfp_mask); 730 + } 725 731 #else 726 732 static inline void cgroup_throttle_swaprate(struct page *page, gfp_t gfp_mask) 727 733 { ··· 736 730 737 731 #ifdef CONFIG_MEMCG_SWAP 738 732 extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry); 739 - extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry); 740 - extern void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages); 733 + extern int __mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry); 734 + static inline int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry) 735 + { 736 + if (mem_cgroup_disabled()) 737 + return 0; 738 + return __mem_cgroup_try_charge_swap(page, entry); 739 + } 740 + 741 + extern void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages); 742 + static inline void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages) 743 + { 744 + if (mem_cgroup_disabled()) 745 + return; 746 + __mem_cgroup_uncharge_swap(entry, nr_pages); 747 + } 748 + 741 749 extern long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg); 742 750 extern bool mem_cgroup_swap_full(struct page *page); 743 751 #else

+1

include/linux/syscalls.h

··· 915 915 asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); 916 916 asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec, 917 917 size_t vlen, int behavior, unsigned int flags); 918 + asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags); 918 919 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, 919 920 unsigned long prot, unsigned long pgoff, 920 921 unsigned long flags);

+4 -4

include/linux/userfaultfd_k.h

··· 60 60 61 61 extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start, 62 62 unsigned long src_start, unsigned long len, 63 - bool *mmap_changing, __u64 mode); 63 + atomic_t *mmap_changing, __u64 mode); 64 64 extern ssize_t mfill_zeropage(struct mm_struct *dst_mm, 65 65 unsigned long dst_start, 66 66 unsigned long len, 67 - bool *mmap_changing); 67 + atomic_t *mmap_changing); 68 68 extern ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long dst_start, 69 - unsigned long len, bool *mmap_changing); 69 + unsigned long len, atomic_t *mmap_changing); 70 70 extern int mwriteprotect_range(struct mm_struct *dst_mm, 71 71 unsigned long start, unsigned long len, 72 - bool enable_wp, bool *mmap_changing); 72 + bool enable_wp, atomic_t *mmap_changing); 73 73 74 74 /* mm helpers */ 75 75 static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,

+2

include/linux/vm_event_item.h

··· 33 33 PGREUSE, 34 34 PGSTEAL_KSWAPD, 35 35 PGSTEAL_DIRECT, 36 + PGDEMOTE_KSWAPD, 37 + PGDEMOTE_DIRECT, 36 38 PGSCAN_KSWAPD, 37 39 PGSCAN_DIRECT, 38 40 PGSCAN_DIRECT_THROTTLE,

+1 -1

include/linux/vmpressure.h

··· 37 37 extern void vmpressure_init(struct vmpressure *vmpr); 38 38 extern void vmpressure_cleanup(struct vmpressure *vmpr); 39 39 extern struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg); 40 - extern struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr); 40 + extern struct mem_cgroup *vmpressure_to_memcg(struct vmpressure *vmpr); 41 41 extern int vmpressure_register_event(struct mem_cgroup *memcg, 42 42 struct eventfd_ctx *eventfd, 43 43 const char *args);

+2 -2

include/linux/writeback.h

··· 218 218 void wbc_detach_inode(struct writeback_control *wbc); 219 219 void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page, 220 220 size_t bytes); 221 - int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages, 221 + int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, 222 222 enum wb_reason reason, struct wb_completion *done); 223 223 void cgroup_writeback_umount(void); 224 224 bool cleanup_offline_cgwb(struct bdi_writeback *wb); ··· 374 374 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty); 375 375 unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh); 376 376 377 - void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time); 377 + void wb_update_bandwidth(struct bdi_writeback *wb); 378 378 void balance_dirty_pages_ratelimited(struct address_space *mapping); 379 379 bool wb_over_bg_thresh(struct bdi_writeback *wb); 380 380

+2 -1

include/trace/events/migrate.h

··· 21 21 EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \ 22 22 EM( MR_NUMA_MISPLACED, "numa_misplaced") \ 23 23 EM( MR_CONTIG_RANGE, "contig_range") \ 24 - EMe(MR_LONGTERM_PIN, "longterm_pin") 24 + EM( MR_LONGTERM_PIN, "longterm_pin") \ 25 + EMe(MR_DEMOTION, "demotion") 25 26 26 27 /* 27 28 * First define the enums in the above macros to be exported to userspace

+3 -1

include/uapi/asm-generic/unistd.h

··· 877 877 #define __NR_memfd_secret 447 878 878 __SYSCALL(__NR_memfd_secret, sys_memfd_secret) 879 879 #endif 880 + #define __NR_process_mrelease 448 881 + __SYSCALL(__NR_process_mrelease, sys_process_mrelease) 880 882 881 883 #undef __NR_syscalls 882 - #define __NR_syscalls 448 884 + #define __NR_syscalls 449 883 885 884 886 /* 885 887 * 32 bit systems traditionally used different

+1

include/uapi/linux/mempolicy.h

··· 22 22 MPOL_BIND, 23 23 MPOL_INTERLEAVE, 24 24 MPOL_LOCAL, 25 + MPOL_PREFERRED_MANY, 25 26 MPOL_MAX, /* always last member of enum */ 26 27 }; 27 28

+1 -1

ipc/msg.c

··· 147 147 key_t key = params->key; 148 148 int msgflg = params->flg; 149 149 150 - msq = kmalloc(sizeof(*msq), GFP_KERNEL); 150 + msq = kmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT); 151 151 if (unlikely(!msq)) 152 152 return -ENOMEM; 153 153

+1 -1

ipc/namespace.c

··· 42 42 goto fail; 43 43 44 44 err = -ENOMEM; 45 - ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL); 45 + ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT); 46 46 if (ns == NULL) 47 47 goto fail_dec; 48 48

+5 -4

ipc/sem.c

··· 514 514 if (nsems > (INT_MAX - sizeof(*sma)) / sizeof(sma->sems[0])) 515 515 return NULL; 516 516 517 - sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL); 517 + sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL_ACCOUNT); 518 518 if (unlikely(!sma)) 519 519 return NULL; 520 520 ··· 1855 1855 1856 1856 undo_list = current->sysvsem.undo_list; 1857 1857 if (!undo_list) { 1858 - undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL); 1858 + undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL_ACCOUNT); 1859 1859 if (undo_list == NULL) 1860 1860 return -ENOMEM; 1861 1861 spin_lock_init(&undo_list->lock); ··· 1941 1941 1942 1942 /* step 2: allocate new undo structure */ 1943 1943 new = kvzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems, 1944 - GFP_KERNEL); 1944 + GFP_KERNEL_ACCOUNT); 1945 1945 if (!new) { 1946 1946 ipc_rcu_putref(&sma->sem_perm, sem_rcu_free); 1947 1947 return ERR_PTR(-ENOMEM); ··· 2005 2005 if (nsops > ns->sc_semopm) 2006 2006 return -E2BIG; 2007 2007 if (nsops > SEMOPM_FAST) { 2008 - sops = kvmalloc_array(nsops, sizeof(*sops), GFP_KERNEL); 2008 + sops = kvmalloc_array(nsops, sizeof(*sops), 2009 + GFP_KERNEL_ACCOUNT); 2009 2010 if (sops == NULL) 2010 2011 return -ENOMEM; 2011 2012 }

+1 -1

ipc/shm.c

··· 619 619 ns->shm_tot + numpages > ns->shm_ctlall) 620 620 return -ENOSPC; 621 621 622 - shp = kmalloc(sizeof(*shp), GFP_KERNEL); 622 + shp = kmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT); 623 623 if (unlikely(!shp)) 624 624 return -ENOMEM; 625 625

+1 -1

kernel/cgroup/namespace.c

··· 24 24 struct cgroup_namespace *new_ns; 25 25 int ret; 26 26 27 - new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL); 27 + new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT); 28 28 if (!new_ns) 29 29 return ERR_PTR(-ENOMEM); 30 30 ret = ns_alloc_inum(&new_ns->ns);

+1 -1

kernel/nsproxy.c

··· 568 568 569 569 int __init nsproxy_cache_init(void) 570 570 { 571 - nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC); 571 + nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT); 572 572 return 0; 573 573 }

+3 -2

kernel/pid_namespace.c

··· 51 51 mutex_lock(&pid_caches_mutex); 52 52 /* Name collision forces to do allocation under mutex. */ 53 53 if (!*pkc) 54 - *pkc = kmem_cache_create(name, len, 0, SLAB_HWCACHE_ALIGN, 0); 54 + *pkc = kmem_cache_create(name, len, 0, 55 + SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, 0); 55 56 mutex_unlock(&pid_caches_mutex); 56 57 /* current can fail, but someone else can succeed. */ 57 58 return READ_ONCE(*pkc); ··· 450 449 451 450 static __init int pid_namespaces_init(void) 452 451 { 453 - pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC); 452 + pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT); 454 453 455 454 #ifdef CONFIG_CHECKPOINT_RESTORE 456 455 register_sysctl_paths(kern_path, pid_ns_ctl_table);

+1 -1

kernel/signal.c

··· 4726 4726 { 4727 4727 siginfo_buildtime_checks(); 4728 4728 4729 - sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC); 4729 + sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT); 4730 4730 } 4731 4731 4732 4732 #ifdef CONFIG_KGDB_KDB

+1

kernel/sys_ni.c

··· 289 289 COND_SYSCALL(mincore); 290 290 COND_SYSCALL(madvise); 291 291 COND_SYSCALL(process_madvise); 292 + COND_SYSCALL(process_mrelease); 292 293 COND_SYSCALL(remap_file_pages); 293 294 COND_SYSCALL(mbind); 294 295 COND_SYSCALL_COMPAT(mbind);

+1 -1

kernel/sysctl.c

··· 2912 2912 .data = &sysctl_compaction_proactiveness, 2913 2913 .maxlen = sizeof(sysctl_compaction_proactiveness), 2914 2914 .mode = 0644, 2915 - .proc_handler = proc_dointvec_minmax, 2915 + .proc_handler = compaction_proactiveness_sysctl_handler, 2916 2916 .extra1 = SYSCTL_ZERO, 2917 2917 .extra2 = &one_hundred, 2918 2918 },

+2 -2

kernel/time/namespace.c

··· 88 88 goto fail; 89 89 90 90 err = -ENOMEM; 91 - ns = kmalloc(sizeof(*ns), GFP_KERNEL); 91 + ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT); 92 92 if (!ns) 93 93 goto fail_dec; 94 94 95 95 refcount_set(&ns->ns.count, 1); 96 96 97 - ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO); 97 + ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO); 98 98 if (!ns->vvar_page) 99 99 goto fail_free; 100 100

+2 -2

kernel/time/posix-timers.c

··· 273 273 static __init int init_posix_timers(void) 274 274 { 275 275 posix_timers_cache = kmem_cache_create("posix_timers_cache", 276 - sizeof (struct k_itimer), 0, SLAB_PANIC, 277 - NULL); 276 + sizeof(struct k_itimer), 0, 277 + SLAB_PANIC | SLAB_ACCOUNT, NULL); 278 278 return 0; 279 279 } 280 280 __initcall(init_posix_timers);

+1 -1

kernel/user_namespace.c

··· 1385 1385 1386 1386 static __init int user_namespaces_init(void) 1387 1387 { 1388 - user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC); 1388 + user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT); 1389 1389 return 0; 1390 1390 } 1391 1391 subsys_initcall(user_namespaces_init);

+2 -3

lib/scatterlist.c

··· 918 918 miter->__offset += miter->consumed; 919 919 miter->__remaining -= miter->consumed; 920 920 921 - if ((miter->__flags & SG_MITER_TO_SG) && 922 - !PageSlab(miter->page)) 923 - flush_kernel_dcache_page(miter->page); 921 + if (miter->__flags & SG_MITER_TO_SG) 922 + flush_dcache_page(miter->page); 924 923 925 924 if (miter->__flags & SG_MITER_ATOMIC) { 926 925 WARN_ON_ONCE(preemptible());

+57 -23

lib/test_kasan.c

··· 120 120 static void kmalloc_oob_right(struct kunit *test) 121 121 { 122 122 char *ptr; 123 - size_t size = 123; 123 + size_t size = 128 - KASAN_GRANULE_SIZE - 5; 124 124 125 125 ptr = kmalloc(size, GFP_KERNEL); 126 126 KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); 127 127 128 - KUNIT_EXPECT_KASAN_FAIL(test, ptr[size + OOB_TAG_OFF] = 'x'); 128 + /* 129 + * An unaligned access past the requested kmalloc size. 130 + * Only generic KASAN can precisely detect these. 131 + */ 132 + if (IS_ENABLED(CONFIG_KASAN_GENERIC)) 133 + KUNIT_EXPECT_KASAN_FAIL(test, ptr[size] = 'x'); 134 + 135 + /* 136 + * An aligned access into the first out-of-bounds granule that falls 137 + * within the aligned kmalloc object. 138 + */ 139 + KUNIT_EXPECT_KASAN_FAIL(test, ptr[size + 5] = 'y'); 140 + 141 + /* Out-of-bounds access past the aligned kmalloc object. */ 142 + KUNIT_EXPECT_KASAN_FAIL(test, ptr[0] = 143 + ptr[size + KASAN_GRANULE_SIZE + 5]); 144 + 129 145 kfree(ptr); 130 146 } 131 147 ··· 165 149 ptr = kmalloc_node(size, GFP_KERNEL, 0); 166 150 KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); 167 151 168 - KUNIT_EXPECT_KASAN_FAIL(test, ptr[size] = 0); 152 + KUNIT_EXPECT_KASAN_FAIL(test, ptr[0] = ptr[size]); 169 153 kfree(ptr); 170 154 } 171 155 ··· 201 185 KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); 202 186 kfree(ptr); 203 187 204 - KUNIT_EXPECT_KASAN_FAIL(test, ptr[0] = 0); 188 + KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[0]); 205 189 } 206 190 207 191 static void kmalloc_pagealloc_invalid_free(struct kunit *test) ··· 235 219 ptr = page_address(pages); 236 220 KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); 237 221 238 - KUNIT_EXPECT_KASAN_FAIL(test, ptr[size] = 0); 222 + KUNIT_EXPECT_KASAN_FAIL(test, ptr[0] = ptr[size]); 239 223 free_pages((unsigned long)ptr, order); 240 224 } 241 225 ··· 250 234 KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); 251 235 free_pages((unsigned long)ptr, order); 252 236 253 - KUNIT_EXPECT_KASAN_FAIL(test, ptr[0] = 0); 237 + KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[0]); 254 238 } 255 239 256 240 static void kmalloc_large_oob_right(struct kunit *test) ··· 426 410 kfree(ptr1); 427 411 } 428 412 413 + /* 414 + * Note: in the memset tests below, the written range touches both valid and 415 + * invalid memory. This makes sure that the instrumentation does not only check 416 + * the starting address but the whole range. 417 + */ 418 + 429 419 static void kmalloc_oob_memset_2(struct kunit *test) 430 420 { 431 421 char *ptr; 432 - size_t size = 8; 422 + size_t size = 128 - KASAN_GRANULE_SIZE; 433 423 434 424 ptr = kmalloc(size, GFP_KERNEL); 435 425 KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); 436 426 437 - KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + 7 + OOB_TAG_OFF, 0, 2)); 427 + KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + size - 1, 0, 2)); 438 428 kfree(ptr); 439 429 } 440 430 441 431 static void kmalloc_oob_memset_4(struct kunit *test) 442 432 { 443 433 char *ptr; 444 - size_t size = 8; 434 + size_t size = 128 - KASAN_GRANULE_SIZE; 445 435 446 436 ptr = kmalloc(size, GFP_KERNEL); 447 437 KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); 448 438 449 - KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + 5 + OOB_TAG_OFF, 0, 4)); 439 + KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + size - 3, 0, 4)); 450 440 kfree(ptr); 451 441 } 452 - 453 442 454 443 static void kmalloc_oob_memset_8(struct kunit *test) 455 444 { 456 445 char *ptr; 457 - size_t size = 8; 446 + size_t size = 128 - KASAN_GRANULE_SIZE; 458 447 459 448 ptr = kmalloc(size, GFP_KERNEL); 460 449 KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); 461 450 462 - KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + 1 + OOB_TAG_OFF, 0, 8)); 451 + KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + size - 7, 0, 8)); 463 452 kfree(ptr); 464 453 } 465 454 466 455 static void kmalloc_oob_memset_16(struct kunit *test) 467 456 { 468 457 char *ptr; 469 - size_t size = 16; 458 + size_t size = 128 - KASAN_GRANULE_SIZE; 470 459 471 460 ptr = kmalloc(size, GFP_KERNEL); 472 461 KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); 473 462 474 - KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + 1 + OOB_TAG_OFF, 0, 16)); 463 + KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + size - 15, 0, 16)); 475 464 kfree(ptr); 476 465 } 477 466 478 467 static void kmalloc_oob_in_memset(struct kunit *test) 479 468 { 480 469 char *ptr; 481 - size_t size = 666; 470 + size_t size = 128 - KASAN_GRANULE_SIZE; 482 471 483 472 ptr = kmalloc(size, GFP_KERNEL); 484 473 KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); 485 474 486 - KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr, 0, size + 5 + OOB_TAG_OFF)); 475 + KUNIT_EXPECT_KASAN_FAIL(test, 476 + memset(ptr, 0, size + KASAN_GRANULE_SIZE)); 487 477 kfree(ptr); 488 478 } 489 479 ··· 499 477 size_t size = 64; 500 478 volatile size_t invalid_size = -2; 501 479 480 + /* 481 + * Hardware tag-based mode doesn't check memmove for negative size. 482 + * As a result, this test introduces a side-effect memory corruption, 483 + * which can result in a crash. 484 + */ 485 + KASAN_TEST_NEEDS_CONFIG_OFF(test, CONFIG_KASAN_HW_TAGS); 486 + 502 487 ptr = kmalloc(size, GFP_KERNEL); 503 488 KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); 504 489 505 490 memset((char *)ptr, 0, 64); 506 - 507 491 KUNIT_EXPECT_KASAN_FAIL(test, 508 492 memmove((char *)ptr, (char *)ptr + 4, invalid_size)); 509 493 kfree(ptr); ··· 524 496 KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); 525 497 526 498 kfree(ptr); 527 - KUNIT_EXPECT_KASAN_FAIL(test, *(ptr + 8) = 'x'); 499 + KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[8]); 528 500 } 529 501 530 502 static void kmalloc_uaf_memset(struct kunit *test) 531 503 { 532 504 char *ptr; 533 505 size_t size = 33; 506 + 507 + /* 508 + * Only generic KASAN uses quarantine, which is required to avoid a 509 + * kernel memory corruption this test causes. 510 + */ 511 + KASAN_TEST_NEEDS_CONFIG_ON(test, CONFIG_KASAN_GENERIC); 534 512 535 513 ptr = kmalloc(size, GFP_KERNEL); 536 514 KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr); ··· 569 535 goto again; 570 536 } 571 537 572 - KUNIT_EXPECT_KASAN_FAIL(test, ptr1[40] = 'x'); 538 + KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr1)[40]); 573 539 KUNIT_EXPECT_PTR_NE(test, ptr1, ptr2); 574 540 575 541 kfree(ptr2); ··· 716 682 ptr[size] = 'x'; 717 683 718 684 /* This one must. */ 719 - KUNIT_EXPECT_KASAN_FAIL(test, ptr[real_size] = 'y'); 685 + KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[real_size]); 720 686 721 687 kfree(ptr); 722 688 } ··· 735 701 kfree(ptr); 736 702 737 703 KUNIT_EXPECT_KASAN_FAIL(test, ksize(ptr)); 738 - KUNIT_EXPECT_KASAN_FAIL(test, kasan_int_result = *ptr); 739 - KUNIT_EXPECT_KASAN_FAIL(test, kasan_int_result = *(ptr + size)); 704 + KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[0]); 705 + KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[size]); 740 706 } 741 707 742 708 static void kasan_stack_oob(struct kunit *test)

+9 -11

lib/test_kasan_module.c

··· 15 15 16 16 #include "../mm/kasan/kasan.h" 17 17 18 - #define OOB_TAG_OFF (IS_ENABLED(CONFIG_KASAN_GENERIC) ? 0 : KASAN_GRANULE_SIZE) 19 - 20 18 static noinline void __init copy_user_test(void) 21 19 { 22 20 char *kmem; 23 21 char __user *usermem; 24 - size_t size = 10; 22 + size_t size = 128 - KASAN_GRANULE_SIZE; 25 23 int __maybe_unused unused; 26 24 27 25 kmem = kmalloc(size, GFP_KERNEL); ··· 36 38 } 37 39 38 40 pr_info("out-of-bounds in copy_from_user()\n"); 39 - unused = copy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF); 41 + unused = copy_from_user(kmem, usermem, size + 1); 40 42 41 43 pr_info("out-of-bounds in copy_to_user()\n"); 42 - unused = copy_to_user(usermem, kmem, size + 1 + OOB_TAG_OFF); 44 + unused = copy_to_user(usermem, kmem, size + 1); 43 45 44 46 pr_info("out-of-bounds in __copy_from_user()\n"); 45 - unused = __copy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF); 47 + unused = __copy_from_user(kmem, usermem, size + 1); 46 48 47 49 pr_info("out-of-bounds in __copy_to_user()\n"); 48 - unused = __copy_to_user(usermem, kmem, size + 1 + OOB_TAG_OFF); 50 + unused = __copy_to_user(usermem, kmem, size + 1); 49 51 50 52 pr_info("out-of-bounds in __copy_from_user_inatomic()\n"); 51 - unused = __copy_from_user_inatomic(kmem, usermem, size + 1 + OOB_TAG_OFF); 53 + unused = __copy_from_user_inatomic(kmem, usermem, size + 1); 52 54 53 55 pr_info("out-of-bounds in __copy_to_user_inatomic()\n"); 54 - unused = __copy_to_user_inatomic(usermem, kmem, size + 1 + OOB_TAG_OFF); 56 + unused = __copy_to_user_inatomic(usermem, kmem, size + 1); 55 57 56 58 pr_info("out-of-bounds in strncpy_from_user()\n"); 57 - unused = strncpy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF); 59 + unused = strncpy_from_user(kmem, usermem, size + 1); 58 60 59 61 vm_munmap((unsigned long)usermem, PAGE_SIZE); 60 62 kfree(kmem); ··· 71 73 struct kasan_rcu_info, rcu); 72 74 73 75 kfree(fp); 74 - fp->i = 1; 76 + ((volatile struct kasan_rcu_info *)fp)->i; 75 77 } 76 78 77 79 static noinline void __init kasan_rcu_uaf(void)

+4 -1

lib/test_vmalloc.c

··· 35 35 __param(int, test_loop_count, 1000000, 36 36 "Set test loop counter"); 37 37 38 + __param(int, nr_pages, 0, 39 + "Set number of pages for fix_size_alloc_test(default: 1)"); 40 + 38 41 __param(int, run_test_mask, INT_MAX, 39 42 "Set tests specified in the mask.\n\n" 40 43 "\t\tid: 1, name: fix_size_alloc_test\n" ··· 265 262 int i; 266 263 267 264 for (i = 0; i < test_loop_count; i++) { 268 - ptr = vmalloc(3 * PAGE_SIZE); 265 + ptr = vmalloc((nr_pages > 0 ? nr_pages:1) * PAGE_SIZE); 269 266 270 267 if (!ptr) 271 268 return -1;

+11

mm/backing-dev.c

··· 271 271 spin_unlock_bh(&wb->work_lock); 272 272 } 273 273 274 + static void wb_update_bandwidth_workfn(struct work_struct *work) 275 + { 276 + struct bdi_writeback *wb = container_of(to_delayed_work(work), 277 + struct bdi_writeback, bw_dwork); 278 + 279 + wb_update_bandwidth(wb); 280 + } 281 + 274 282 /* 275 283 * Initial write bandwidth: 100 MB/s 276 284 */ ··· 301 293 INIT_LIST_HEAD(&wb->b_dirty_time); 302 294 spin_lock_init(&wb->list_lock); 303 295 296 + atomic_set(&wb->writeback_inodes, 0); 304 297 wb->bw_time_stamp = jiffies; 305 298 wb->balanced_dirty_ratelimit = INIT_BW; 306 299 wb->dirty_ratelimit = INIT_BW; ··· 311 302 spin_lock_init(&wb->work_lock); 312 303 INIT_LIST_HEAD(&wb->work_list); 313 304 INIT_DELAYED_WORK(&wb->dwork, wb_workfn); 305 + INIT_DELAYED_WORK(&wb->bw_dwork, wb_update_bandwidth_workfn); 314 306 wb->dirty_sleep = jiffies; 315 307 316 308 err = fprop_local_init_percpu(&wb->completions, gfp); ··· 360 350 mod_delayed_work(bdi_wq, &wb->dwork, 0); 361 351 flush_delayed_work(&wb->dwork); 362 352 WARN_ON(!list_empty(&wb->work_list)); 353 + flush_delayed_work(&wb->bw_dwork); 363 354 } 364 355 365 356 static void wb_exit(struct bdi_writeback *wb)

+2 -2

mm/bootmem_info.c

··· 39 39 } 40 40 41 41 #ifndef CONFIG_SPARSEMEM_VMEMMAP 42 - static void register_page_bootmem_info_section(unsigned long start_pfn) 42 + static void __init register_page_bootmem_info_section(unsigned long start_pfn) 43 43 { 44 44 unsigned long mapsize, section_nr, i; 45 45 struct mem_section *ms; ··· 74 74 75 75 } 76 76 #else /* CONFIG_SPARSEMEM_VMEMMAP */ 77 - static void register_page_bootmem_info_section(unsigned long start_pfn) 77 + static void __init register_page_bootmem_info_section(unsigned long start_pfn) 78 78 { 79 79 unsigned long mapsize, section_nr, i; 80 80 struct mem_section *ms;

+55 -12

mm/compaction.c

··· 2398 2398 2399 2399 err = migrate_pages(&cc->migratepages, compaction_alloc, 2400 2400 compaction_free, (unsigned long)cc, cc->mode, 2401 - MR_COMPACTION); 2401 + MR_COMPACTION, NULL); 2402 2402 2403 2403 trace_mm_compaction_migratepages(cc->nr_migratepages, err, 2404 2404 &cc->migratepages); ··· 2706 2706 */ 2707 2707 unsigned int __read_mostly sysctl_compaction_proactiveness = 20; 2708 2708 2709 + int compaction_proactiveness_sysctl_handler(struct ctl_table *table, int write, 2710 + void *buffer, size_t *length, loff_t *ppos) 2711 + { 2712 + int rc, nid; 2713 + 2714 + rc = proc_dointvec_minmax(table, write, buffer, length, ppos); 2715 + if (rc) 2716 + return rc; 2717 + 2718 + if (write && sysctl_compaction_proactiveness) { 2719 + for_each_online_node(nid) { 2720 + pg_data_t *pgdat = NODE_DATA(nid); 2721 + 2722 + if (pgdat->proactive_compact_trigger) 2723 + continue; 2724 + 2725 + pgdat->proactive_compact_trigger = true; 2726 + wake_up_interruptible(&pgdat->kcompactd_wait); 2727 + } 2728 + } 2729 + 2730 + return 0; 2731 + } 2732 + 2709 2733 /* 2710 2734 * This is the entry point for compacting all nodes via 2711 2735 * /proc/sys/vm/compact_memory ··· 2774 2750 2775 2751 static inline bool kcompactd_work_requested(pg_data_t *pgdat) 2776 2752 { 2777 - return pgdat->kcompactd_max_order > 0 || kthread_should_stop(); 2753 + return pgdat->kcompactd_max_order > 0 || kthread_should_stop() || 2754 + pgdat->proactive_compact_trigger; 2778 2755 } 2779 2756 2780 2757 static bool kcompactd_node_suitable(pg_data_t *pgdat) ··· 2910 2885 { 2911 2886 pg_data_t *pgdat = (pg_data_t *)p; 2912 2887 struct task_struct *tsk = current; 2913 - unsigned int proactive_defer = 0; 2888 + long default_timeout = msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC); 2889 + long timeout = default_timeout; 2914 2890 2915 2891 const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id); 2916 2892 ··· 2926 2900 while (!kthread_should_stop()) { 2927 2901 unsigned long pflags; 2928 2902 2903 + /* 2904 + * Avoid the unnecessary wakeup for proactive compaction 2905 + * when it is disabled. 2906 + */ 2907 + if (!sysctl_compaction_proactiveness) 2908 + timeout = MAX_SCHEDULE_TIMEOUT; 2929 2909 trace_mm_compaction_kcompactd_sleep(pgdat->node_id); 2930 2910 if (wait_event_freezable_timeout(pgdat->kcompactd_wait, 2931 - kcompactd_work_requested(pgdat), 2932 - msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC))) { 2911 + kcompactd_work_requested(pgdat), timeout) && 2912 + !pgdat->proactive_compact_trigger) { 2933 2913 2934 2914 psi_memstall_enter(&pflags); 2935 2915 kcompactd_do_work(pgdat); 2936 2916 psi_memstall_leave(&pflags); 2917 + /* 2918 + * Reset the timeout value. The defer timeout from 2919 + * proactive compaction is lost here but that is fine 2920 + * as the condition of the zone changing substantionally 2921 + * then carrying on with the previous defer interval is 2922 + * not useful. 2923 + */ 2924 + timeout = default_timeout; 2937 2925 continue; 2938 2926 } 2939 2927 2940 - /* kcompactd wait timeout */ 2928 + /* 2929 + * Start the proactive work with default timeout. Based 2930 + * on the fragmentation score, this timeout is updated. 2931 + */ 2932 + timeout = default_timeout; 2941 2933 if (should_proactive_compact_node(pgdat)) { 2942 2934 unsigned int prev_score, score; 2943 2935 2944 - if (proactive_defer) { 2945 - proactive_defer--; 2946 - continue; 2947 - } 2948 2936 prev_score = fragmentation_score_node(pgdat); 2949 2937 proactive_compact_node(pgdat); 2950 2938 score = fragmentation_score_node(pgdat); ··· 2966 2926 * Defer proactive compaction if the fragmentation 2967 2927 * score did not go down i.e. no progress made. 2968 2928 */ 2969 - proactive_defer = score < prev_score ? 2970 - 0 : 1 << COMPACT_MAX_DEFER_SHIFT; 2929 + if (unlikely(score >= prev_score)) 2930 + timeout = 2931 + default_timeout << COMPACT_MAX_DEFER_SHIFT; 2971 2932 } 2933 + if (unlikely(pgdat->proactive_compact_trigger)) 2934 + pgdat->proactive_compact_trigger = false; 2972 2935 } 2973 2936 2974 2937 return 0;

+557 -357

mm/debug_vm_pgtable.c

··· 29 29 #include <linux/start_kernel.h> 30 30 #include <linux/sched/mm.h> 31 31 #include <linux/io.h> 32 + 33 + #include <asm/cacheflush.h> 32 34 #include <asm/pgalloc.h> 33 35 #include <asm/tlbflush.h> 34 36 ··· 60 58 #define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK) 61 59 #define RANDOM_NZVALUE GENMASK(7, 0) 62 60 63 - static void __init pte_basic_tests(unsigned long pfn, int idx) 61 + struct pgtable_debug_args { 62 + struct mm_struct *mm; 63 + struct vm_area_struct *vma; 64 + 65 + pgd_t *pgdp; 66 + p4d_t *p4dp; 67 + pud_t *pudp; 68 + pmd_t *pmdp; 69 + pte_t *ptep; 70 + 71 + p4d_t *start_p4dp; 72 + pud_t *start_pudp; 73 + pmd_t *start_pmdp; 74 + pgtable_t start_ptep; 75 + 76 + unsigned long vaddr; 77 + pgprot_t page_prot; 78 + pgprot_t page_prot_none; 79 + 80 + bool is_contiguous_page; 81 + unsigned long pud_pfn; 82 + unsigned long pmd_pfn; 83 + unsigned long pte_pfn; 84 + 85 + unsigned long fixed_pgd_pfn; 86 + unsigned long fixed_p4d_pfn; 87 + unsigned long fixed_pud_pfn; 88 + unsigned long fixed_pmd_pfn; 89 + unsigned long fixed_pte_pfn; 90 + }; 91 + 92 + static void __init pte_basic_tests(struct pgtable_debug_args *args, int idx) 64 93 { 65 94 pgprot_t prot = protection_map[idx]; 66 - pte_t pte = pfn_pte(pfn, prot); 95 + pte_t pte = pfn_pte(args->fixed_pte_pfn, prot); 67 96 unsigned long val = idx, *ptr = &val; 68 97 69 98 pr_debug("Validating PTE basic (%pGv)\n", ptr); ··· 119 86 WARN_ON(!pte_dirty(pte_wrprotect(pte_mkdirty(pte)))); 120 87 } 121 88 122 - static void __init pte_advanced_tests(struct mm_struct *mm, 123 - struct vm_area_struct *vma, pte_t *ptep, 124 - unsigned long pfn, unsigned long vaddr, 125 - pgprot_t prot) 89 + static void __init pte_advanced_tests(struct pgtable_debug_args *args) 126 90 { 91 + struct page *page; 127 92 pte_t pte; 128 93 129 94 /* 130 95 * Architectures optimize set_pte_at by avoiding TLB flush. 131 96 * This requires set_pte_at to be not used to update an 132 97 * existing pte entry. Clear pte before we do set_pte_at 98 + * 99 + * flush_dcache_page() is called after set_pte_at() to clear 100 + * PG_arch_1 for the page on ARM64. The page flag isn't cleared 101 + * when it's released and page allocation check will fail when 102 + * the page is allocated again. For architectures other than ARM64, 103 + * the unexpected overhead of cache flushing is acceptable. 133 104 */ 105 + page = (args->pte_pfn != ULONG_MAX) ? pfn_to_page(args->pte_pfn) : NULL; 106 + if (!page) 107 + return; 134 108 135 109 pr_debug("Validating PTE advanced\n"); 136 - pte = pfn_pte(pfn, prot); 137 - set_pte_at(mm, vaddr, ptep, pte); 138 - ptep_set_wrprotect(mm, vaddr, ptep); 139 - pte = ptep_get(ptep); 110 + pte = pfn_pte(args->pte_pfn, args->page_prot); 111 + set_pte_at(args->mm, args->vaddr, args->ptep, pte); 112 + flush_dcache_page(page); 113 + ptep_set_wrprotect(args->mm, args->vaddr, args->ptep); 114 + pte = ptep_get(args->ptep); 140 115 WARN_ON(pte_write(pte)); 141 - ptep_get_and_clear(mm, vaddr, ptep); 142 - pte = ptep_get(ptep); 116 + ptep_get_and_clear(args->mm, args->vaddr, args->ptep); 117 + pte = ptep_get(args->ptep); 143 118 WARN_ON(!pte_none(pte)); 144 119 145 - pte = pfn_pte(pfn, prot); 120 + pte = pfn_pte(args->pte_pfn, args->page_prot); 146 121 pte = pte_wrprotect(pte); 147 122 pte = pte_mkclean(pte); 148 - set_pte_at(mm, vaddr, ptep, pte); 123 + set_pte_at(args->mm, args->vaddr, args->ptep, pte); 124 + flush_dcache_page(page); 149 125 pte = pte_mkwrite(pte); 150 126 pte = pte_mkdirty(pte); 151 - ptep_set_access_flags(vma, vaddr, ptep, pte, 1); 152 - pte = ptep_get(ptep); 127 + ptep_set_access_flags(args->vma, args->vaddr, args->ptep, pte, 1); 128 + pte = ptep_get(args->ptep); 153 129 WARN_ON(!(pte_write(pte) && pte_dirty(pte))); 154 - ptep_get_and_clear_full(mm, vaddr, ptep, 1); 155 - pte = ptep_get(ptep); 130 + ptep_get_and_clear_full(args->mm, args->vaddr, args->ptep, 1); 131 + pte = ptep_get(args->ptep); 156 132 WARN_ON(!pte_none(pte)); 157 133 158 - pte = pfn_pte(pfn, prot); 134 + pte = pfn_pte(args->pte_pfn, args->page_prot); 159 135 pte = pte_mkyoung(pte); 160 - set_pte_at(mm, vaddr, ptep, pte); 161 - ptep_test_and_clear_young(vma, vaddr, ptep); 162 - pte = ptep_get(ptep); 136 + set_pte_at(args->mm, args->vaddr, args->ptep, pte); 137 + flush_dcache_page(page); 138 + ptep_test_and_clear_young(args->vma, args->vaddr, args->ptep); 139 + pte = ptep_get(args->ptep); 163 140 WARN_ON(pte_young(pte)); 164 141 } 165 142 166 - static void __init pte_savedwrite_tests(unsigned long pfn, pgprot_t prot) 143 + static void __init pte_savedwrite_tests(struct pgtable_debug_args *args) 167 144 { 168 - pte_t pte = pfn_pte(pfn, prot); 145 + pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot_none); 169 146 170 147 if (!IS_ENABLED(CONFIG_NUMA_BALANCING)) 171 148 return; ··· 186 143 } 187 144 188 145 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 189 - static void __init pmd_basic_tests(unsigned long pfn, int idx) 146 + static void __init pmd_basic_tests(struct pgtable_debug_args *args, int idx) 190 147 { 191 148 pgprot_t prot = protection_map[idx]; 192 149 unsigned long val = idx, *ptr = &val; ··· 196 153 return; 197 154 198 155 pr_debug("Validating PMD basic (%pGv)\n", ptr); 199 - pmd = pfn_pmd(pfn, prot); 156 + pmd = pfn_pmd(args->fixed_pmd_pfn, prot); 200 157 201 158 /* 202 159 * This test needs to be executed after the given page table entry ··· 224 181 WARN_ON(!pmd_bad(pmd_mkhuge(pmd))); 225 182 } 226 183 227 - static void __init pmd_advanced_tests(struct mm_struct *mm, 228 - struct vm_area_struct *vma, pmd_t *pmdp, 229 - unsigned long pfn, unsigned long vaddr, 230 - pgprot_t prot, pgtable_t pgtable) 184 + static void __init pmd_advanced_tests(struct pgtable_debug_args *args) 231 185 { 186 + struct page *page; 232 187 pmd_t pmd; 188 + unsigned long vaddr = args->vaddr; 233 189 234 190 if (!has_transparent_hugepage()) 235 191 return; 236 192 193 + page = (args->pmd_pfn != ULONG_MAX) ? pfn_to_page(args->pmd_pfn) : NULL; 194 + if (!page) 195 + return; 196 + 197 + /* 198 + * flush_dcache_page() is called after set_pmd_at() to clear 199 + * PG_arch_1 for the page on ARM64. The page flag isn't cleared 200 + * when it's released and page allocation check will fail when 201 + * the page is allocated again. For architectures other than ARM64, 202 + * the unexpected overhead of cache flushing is acceptable. 203 + */ 237 204 pr_debug("Validating PMD advanced\n"); 238 205 /* Align the address wrt HPAGE_PMD_SIZE */ 239 206 vaddr &= HPAGE_PMD_MASK; 240 207 241 - pgtable_trans_huge_deposit(mm, pmdp, pgtable); 208 + pgtable_trans_huge_deposit(args->mm, args->pmdp, args->start_ptep); 242 209 243 - pmd = pfn_pmd(pfn, prot); 244 - set_pmd_at(mm, vaddr, pmdp, pmd); 245 - pmdp_set_wrprotect(mm, vaddr, pmdp); 246 - pmd = READ_ONCE(*pmdp); 210 + pmd = pfn_pmd(args->pmd_pfn, args->page_prot); 211 + set_pmd_at(args->mm, vaddr, args->pmdp, pmd); 212 + flush_dcache_page(page); 213 + pmdp_set_wrprotect(args->mm, vaddr, args->pmdp); 214 + pmd = READ_ONCE(*args->pmdp); 247 215 WARN_ON(pmd_write(pmd)); 248 - pmdp_huge_get_and_clear(mm, vaddr, pmdp); 249 - pmd = READ_ONCE(*pmdp); 216 + pmdp_huge_get_and_clear(args->mm, vaddr, args->pmdp); 217 + pmd = READ_ONCE(*args->pmdp); 250 218 WARN_ON(!pmd_none(pmd)); 251 219 252 - pmd = pfn_pmd(pfn, prot); 220 + pmd = pfn_pmd(args->pmd_pfn, args->page_prot); 253 221 pmd = pmd_wrprotect(pmd); 254 222 pmd = pmd_mkclean(pmd); 255 - set_pmd_at(mm, vaddr, pmdp, pmd); 223 + set_pmd_at(args->mm, vaddr, args->pmdp, pmd); 224 + flush_dcache_page(page); 256 225 pmd = pmd_mkwrite(pmd); 257 226 pmd = pmd_mkdirty(pmd); 258 - pmdp_set_access_flags(vma, vaddr, pmdp, pmd, 1); 259 - pmd = READ_ONCE(*pmdp); 227 + pmdp_set_access_flags(args->vma, vaddr, args->pmdp, pmd, 1); 228 + pmd = READ_ONCE(*args->pmdp); 260 229 WARN_ON(!(pmd_write(pmd) && pmd_dirty(pmd))); 261 - pmdp_huge_get_and_clear_full(vma, vaddr, pmdp, 1); 262 - pmd = READ_ONCE(*pmdp); 230 + pmdp_huge_get_and_clear_full(args->vma, vaddr, args->pmdp, 1); 231 + pmd = READ_ONCE(*args->pmdp); 263 232 WARN_ON(!pmd_none(pmd)); 264 233 265 - pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); 234 + pmd = pmd_mkhuge(pfn_pmd(args->pmd_pfn, args->page_prot)); 266 235 pmd = pmd_mkyoung(pmd); 267 - set_pmd_at(mm, vaddr, pmdp, pmd); 268 - pmdp_test_and_clear_young(vma, vaddr, pmdp); 269 - pmd = READ_ONCE(*pmdp); 236 + set_pmd_at(args->mm, vaddr, args->pmdp, pmd); 237 + flush_dcache_page(page); 238 + pmdp_test_and_clear_young(args->vma, vaddr, args->pmdp); 239 + pmd = READ_ONCE(*args->pmdp); 270 240 WARN_ON(pmd_young(pmd)); 271 241 272 242 /* Clear the pte entries */ 273 - pmdp_huge_get_and_clear(mm, vaddr, pmdp); 274 - pgtable = pgtable_trans_huge_withdraw(mm, pmdp); 243 + pmdp_huge_get_and_clear(args->mm, vaddr, args->pmdp); 244 + pgtable_trans_huge_withdraw(args->mm, args->pmdp); 275 245 } 276 246 277 - static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot) 247 + static void __init pmd_leaf_tests(struct pgtable_debug_args *args) 278 248 { 279 249 pmd_t pmd; 280 250 ··· 295 239 return; 296 240 297 241 pr_debug("Validating PMD leaf\n"); 298 - pmd = pfn_pmd(pfn, prot); 242 + pmd = pfn_pmd(args->fixed_pmd_pfn, args->page_prot); 299 243 300 244 /* 301 245 * PMD based THP is a leaf entry. ··· 304 248 WARN_ON(!pmd_leaf(pmd)); 305 249 } 306 250 307 - static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) 251 + static void __init pmd_savedwrite_tests(struct pgtable_debug_args *args) 308 252 { 309 253 pmd_t pmd; 310 254 ··· 315 259 return; 316 260 317 261 pr_debug("Validating PMD saved write\n"); 318 - pmd = pfn_pmd(pfn, prot); 262 + pmd = pfn_pmd(args->fixed_pmd_pfn, args->page_prot_none); 319 263 WARN_ON(!pmd_savedwrite(pmd_mk_savedwrite(pmd_clear_savedwrite(pmd)))); 320 264 WARN_ON(pmd_savedwrite(pmd_clear_savedwrite(pmd_mk_savedwrite(pmd)))); 321 265 } 322 266 323 267 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD 324 - static void __init pud_basic_tests(struct mm_struct *mm, unsigned long pfn, int idx) 268 + static void __init pud_basic_tests(struct pgtable_debug_args *args, int idx) 325 269 { 326 270 pgprot_t prot = protection_map[idx]; 327 271 unsigned long val = idx, *ptr = &val; ··· 331 275 return; 332 276 333 277 pr_debug("Validating PUD basic (%pGv)\n", ptr); 334 - pud = pfn_pud(pfn, prot); 278 + pud = pfn_pud(args->fixed_pud_pfn, prot); 335 279 336 280 /* 337 281 * This test needs to be executed after the given page table entry ··· 352 296 WARN_ON(pud_dirty(pud_wrprotect(pud_mkclean(pud)))); 353 297 WARN_ON(!pud_dirty(pud_wrprotect(pud_mkdirty(pud)))); 354 298 355 - if (mm_pmd_folded(mm)) 299 + if (mm_pmd_folded(args->mm)) 356 300 return; 357 301 358 302 /* ··· 362 306 WARN_ON(!pud_bad(pud_mkhuge(pud))); 363 307 } 364 308 365 - static void __init pud_advanced_tests(struct mm_struct *mm, 366 - struct vm_area_struct *vma, pud_t *pudp, 367 - unsigned long pfn, unsigned long vaddr, 368 - pgprot_t prot) 309 + static void __init pud_advanced_tests(struct pgtable_debug_args *args) 369 310 { 311 + struct page *page; 312 + unsigned long vaddr = args->vaddr; 370 313 pud_t pud; 371 314 372 315 if (!has_transparent_hugepage()) 373 316 return; 374 317 318 + page = (args->pud_pfn != ULONG_MAX) ? pfn_to_page(args->pud_pfn) : NULL; 319 + if (!page) 320 + return; 321 + 322 + /* 323 + * flush_dcache_page() is called after set_pud_at() to clear 324 + * PG_arch_1 for the page on ARM64. The page flag isn't cleared 325 + * when it's released and page allocation check will fail when 326 + * the page is allocated again. For architectures other than ARM64, 327 + * the unexpected overhead of cache flushing is acceptable. 328 + */ 375 329 pr_debug("Validating PUD advanced\n"); 376 330 /* Align the address wrt HPAGE_PUD_SIZE */ 377 331 vaddr &= HPAGE_PUD_MASK; 378 332 379 - pud = pfn_pud(pfn, prot); 380 - set_pud_at(mm, vaddr, pudp, pud); 381 - pudp_set_wrprotect(mm, vaddr, pudp); 382 - pud = READ_ONCE(*pudp); 333 + pud = pfn_pud(args->pud_pfn, args->page_prot); 334 + set_pud_at(args->mm, vaddr, args->pudp, pud); 335 + flush_dcache_page(page); 336 + pudp_set_wrprotect(args->mm, vaddr, args->pudp); 337 + pud = READ_ONCE(*args->pudp); 383 338 WARN_ON(pud_write(pud)); 384 339 385 340 #ifndef __PAGETABLE_PMD_FOLDED 386 - pudp_huge_get_and_clear(mm, vaddr, pudp); 387 - pud = READ_ONCE(*pudp); 341 + pudp_huge_get_and_clear(args->mm, vaddr, args->pudp); 342 + pud = READ_ONCE(*args->pudp); 388 343 WARN_ON(!pud_none(pud)); 389 344 #endif /* __PAGETABLE_PMD_FOLDED */ 390 - pud = pfn_pud(pfn, prot); 345 + pud = pfn_pud(args->pud_pfn, args->page_prot); 391 346 pud = pud_wrprotect(pud); 392 347 pud = pud_mkclean(pud); 393 - set_pud_at(mm, vaddr, pudp, pud); 348 + set_pud_at(args->mm, vaddr, args->pudp, pud); 349 + flush_dcache_page(page); 394 350 pud = pud_mkwrite(pud); 395 351 pud = pud_mkdirty(pud); 396 - pudp_set_access_flags(vma, vaddr, pudp, pud, 1); 397 - pud = READ_ONCE(*pudp); 352 + pudp_set_access_flags(args->vma, vaddr, args->pudp, pud, 1); 353 + pud = READ_ONCE(*args->pudp); 398 354 WARN_ON(!(pud_write(pud) && pud_dirty(pud))); 399 355 400 356 #ifndef __PAGETABLE_PMD_FOLDED 401 - pudp_huge_get_and_clear_full(mm, vaddr, pudp, 1); 402 - pud = READ_ONCE(*pudp); 357 + pudp_huge_get_and_clear_full(args->mm, vaddr, args->pudp, 1); 358 + pud = READ_ONCE(*args->pudp); 403 359 WARN_ON(!pud_none(pud)); 404 360 #endif /* __PAGETABLE_PMD_FOLDED */ 405 361 406 - pud = pfn_pud(pfn, prot); 362 + pud = pfn_pud(args->pud_pfn, args->page_prot); 407 363 pud = pud_mkyoung(pud); 408 - set_pud_at(mm, vaddr, pudp, pud); 409 - pudp_test_and_clear_young(vma, vaddr, pudp); 410 - pud = READ_ONCE(*pudp); 364 + set_pud_at(args->mm, vaddr, args->pudp, pud); 365 + flush_dcache_page(page); 366 + pudp_test_and_clear_young(args->vma, vaddr, args->pudp); 367 + pud = READ_ONCE(*args->pudp); 411 368 WARN_ON(pud_young(pud)); 412 369 413 - pudp_huge_get_and_clear(mm, vaddr, pudp); 370 + pudp_huge_get_and_clear(args->mm, vaddr, args->pudp); 414 371 } 415 372 416 - static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) 373 + static void __init pud_leaf_tests(struct pgtable_debug_args *args) 417 374 { 418 375 pud_t pud; 419 376 ··· 434 365 return; 435 366 436 367 pr_debug("Validating PUD leaf\n"); 437 - pud = pfn_pud(pfn, prot); 368 + pud = pfn_pud(args->fixed_pud_pfn, args->page_prot); 438 369 /* 439 370 * PUD based THP is a leaf entry. 440 371 */ ··· 442 373 WARN_ON(!pud_leaf(pud)); 443 374 } 444 375 #else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 445 - static void __init pud_basic_tests(struct mm_struct *mm, unsigned long pfn, int idx) { } 446 - static void __init pud_advanced_tests(struct mm_struct *mm, 447 - struct vm_area_struct *vma, pud_t *pudp, 448 - unsigned long pfn, unsigned long vaddr, 449 - pgprot_t prot) 450 - { 451 - } 452 - static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) { } 376 + static void __init pud_basic_tests(struct pgtable_debug_args *args, int idx) { } 377 + static void __init pud_advanced_tests(struct pgtable_debug_args *args) { } 378 + static void __init pud_leaf_tests(struct pgtable_debug_args *args) { } 453 379 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 454 380 #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ 455 - static void __init pmd_basic_tests(unsigned long pfn, int idx) { } 456 - static void __init pud_basic_tests(struct mm_struct *mm, unsigned long pfn, int idx) { } 457 - static void __init pmd_advanced_tests(struct mm_struct *mm, 458 - struct vm_area_struct *vma, pmd_t *pmdp, 459 - unsigned long pfn, unsigned long vaddr, 460 - pgprot_t prot, pgtable_t pgtable) 461 - { 462 - } 463 - static void __init pud_advanced_tests(struct mm_struct *mm, 464 - struct vm_area_struct *vma, pud_t *pudp, 465 - unsigned long pfn, unsigned long vaddr, 466 - pgprot_t prot) 467 - { 468 - } 469 - static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot) { } 470 - static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) { } 471 - static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) { } 381 + static void __init pmd_basic_tests(struct pgtable_debug_args *args, int idx) { } 382 + static void __init pud_basic_tests(struct pgtable_debug_args *args, int idx) { } 383 + static void __init pmd_advanced_tests(struct pgtable_debug_args *args) { } 384 + static void __init pud_advanced_tests(struct pgtable_debug_args *args) { } 385 + static void __init pmd_leaf_tests(struct pgtable_debug_args *args) { } 386 + static void __init pud_leaf_tests(struct pgtable_debug_args *args) { } 387 + static void __init pmd_savedwrite_tests(struct pgtable_debug_args *args) { } 472 388 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 473 389 474 390 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP 475 - static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) 391 + static void __init pmd_huge_tests(struct pgtable_debug_args *args) 476 392 { 477 393 pmd_t pmd; 478 394 479 - if (!arch_vmap_pmd_supported(prot)) 395 + if (!arch_vmap_pmd_supported(args->page_prot)) 480 396 return; 481 397 482 398 pr_debug("Validating PMD huge\n"); ··· 469 415 * X86 defined pmd_set_huge() verifies that the given 470 416 * PMD is not a populated non-leaf entry. 471 417 */ 472 - WRITE_ONCE(*pmdp, __pmd(0)); 473 - WARN_ON(!pmd_set_huge(pmdp, __pfn_to_phys(pfn), prot)); 474 - WARN_ON(!pmd_clear_huge(pmdp)); 475 - pmd = READ_ONCE(*pmdp); 418 + WRITE_ONCE(*args->pmdp, __pmd(0)); 419 + WARN_ON(!pmd_set_huge(args->pmdp, __pfn_to_phys(args->fixed_pmd_pfn), args->page_prot)); 420 + WARN_ON(!pmd_clear_huge(args->pmdp)); 421 + pmd = READ_ONCE(*args->pmdp); 476 422 WARN_ON(!pmd_none(pmd)); 477 423 } 478 424 479 - static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) 425 + static void __init pud_huge_tests(struct pgtable_debug_args *args) 480 426 { 481 427 pud_t pud; 482 428 483 - if (!arch_vmap_pud_supported(prot)) 429 + if (!arch_vmap_pud_supported(args->page_prot)) 484 430 return; 485 431 486 432 pr_debug("Validating PUD huge\n"); ··· 488 434 * X86 defined pud_set_huge() verifies that the given 489 435 * PUD is not a populated non-leaf entry. 490 436 */ 491 - WRITE_ONCE(*pudp, __pud(0)); 492 - WARN_ON(!pud_set_huge(pudp, __pfn_to_phys(pfn), prot)); 493 - WARN_ON(!pud_clear_huge(pudp)); 494 - pud = READ_ONCE(*pudp); 437 + WRITE_ONCE(*args->pudp, __pud(0)); 438 + WARN_ON(!pud_set_huge(args->pudp, __pfn_to_phys(args->fixed_pud_pfn), args->page_prot)); 439 + WARN_ON(!pud_clear_huge(args->pudp)); 440 + pud = READ_ONCE(*args->pudp); 495 441 WARN_ON(!pud_none(pud)); 496 442 } 497 443 #else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ 498 - static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) { } 499 - static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) { } 444 + static void __init pmd_huge_tests(struct pgtable_debug_args *args) { } 445 + static void __init pud_huge_tests(struct pgtable_debug_args *args) { } 500 446 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ 501 447 502 - static void __init p4d_basic_tests(unsigned long pfn, pgprot_t prot) 448 + static void __init p4d_basic_tests(struct pgtable_debug_args *args) 503 449 { 504 450 p4d_t p4d; 505 451 ··· 508 454 WARN_ON(!p4d_same(p4d, p4d)); 509 455 } 510 456 511 - static void __init pgd_basic_tests(unsigned long pfn, pgprot_t prot) 457 + static void __init pgd_basic_tests(struct pgtable_debug_args *args) 512 458 { 513 459 pgd_t pgd; 514 460 ··· 518 464 } 519 465 520 466 #ifndef __PAGETABLE_PUD_FOLDED 521 - static void __init pud_clear_tests(struct mm_struct *mm, pud_t *pudp) 467 + static void __init pud_clear_tests(struct pgtable_debug_args *args) 522 468 { 523 - pud_t pud = READ_ONCE(*pudp); 469 + pud_t pud = READ_ONCE(*args->pudp); 524 470 525 - if (mm_pmd_folded(mm)) 471 + if (mm_pmd_folded(args->mm)) 526 472 return; 527 473 528 474 pr_debug("Validating PUD clear\n"); 529 475 pud = __pud(pud_val(pud) | RANDOM_ORVALUE); 530 - WRITE_ONCE(*pudp, pud); 531 - pud_clear(pudp); 532 - pud = READ_ONCE(*pudp); 476 + WRITE_ONCE(*args->pudp, pud); 477 + pud_clear(args->pudp); 478 + pud = READ_ONCE(*args->pudp); 533 479 WARN_ON(!pud_none(pud)); 534 480 } 535 481 536 - static void __init pud_populate_tests(struct mm_struct *mm, pud_t *pudp, 537 - pmd_t *pmdp) 482 + static void __init pud_populate_tests(struct pgtable_debug_args *args) 538 483 { 539 484 pud_t pud; 540 485 541 - if (mm_pmd_folded(mm)) 486 + if (mm_pmd_folded(args->mm)) 542 487 return; 543 488 544 489 pr_debug("Validating PUD populate\n"); ··· 545 492 * This entry points to next level page table page. 546 493 * Hence this must not qualify as pud_bad(). 547 494 */ 548 - pud_populate(mm, pudp, pmdp); 549 - pud = READ_ONCE(*pudp); 495 + pud_populate(args->mm, args->pudp, args->start_pmdp); 496 + pud = READ_ONCE(*args->pudp); 550 497 WARN_ON(pud_bad(pud)); 551 498 } 552 499 #else /* !__PAGETABLE_PUD_FOLDED */ 553 - static void __init pud_clear_tests(struct mm_struct *mm, pud_t *pudp) { } 554 - static void __init pud_populate_tests(struct mm_struct *mm, pud_t *pudp, 555 - pmd_t *pmdp) 556 - { 557 - } 500 + static void __init pud_clear_tests(struct pgtable_debug_args *args) { } 501 + static void __init pud_populate_tests(struct pgtable_debug_args *args) { } 558 502 #endif /* PAGETABLE_PUD_FOLDED */ 559 503 560 504 #ifndef __PAGETABLE_P4D_FOLDED 561 - static void __init p4d_clear_tests(struct mm_struct *mm, p4d_t *p4dp) 505 + static void __init p4d_clear_tests(struct pgtable_debug_args *args) 562 506 { 563 - p4d_t p4d = READ_ONCE(*p4dp); 507 + p4d_t p4d = READ_ONCE(*args->p4dp); 564 508 565 - if (mm_pud_folded(mm)) 509 + if (mm_pud_folded(args->mm)) 566 510 return; 567 511 568 512 pr_debug("Validating P4D clear\n"); 569 513 p4d = __p4d(p4d_val(p4d) | RANDOM_ORVALUE); 570 - WRITE_ONCE(*p4dp, p4d); 571 - p4d_clear(p4dp); 572 - p4d = READ_ONCE(*p4dp); 514 + WRITE_ONCE(*args->p4dp, p4d); 515 + p4d_clear(args->p4dp); 516 + p4d = READ_ONCE(*args->p4dp); 573 517 WARN_ON(!p4d_none(p4d)); 574 518 } 575 519 576 - static void __init p4d_populate_tests(struct mm_struct *mm, p4d_t *p4dp, 577 - pud_t *pudp) 520 + static void __init p4d_populate_tests(struct pgtable_debug_args *args) 578 521 { 579 522 p4d_t p4d; 580 523 581 - if (mm_pud_folded(mm)) 524 + if (mm_pud_folded(args->mm)) 582 525 return; 583 526 584 527 pr_debug("Validating P4D populate\n"); ··· 582 533 * This entry points to next level page table page. 583 534 * Hence this must not qualify as p4d_bad(). 584 535 */ 585 - pud_clear(pudp); 586 - p4d_clear(p4dp); 587 - p4d_populate(mm, p4dp, pudp); 588 - p4d = READ_ONCE(*p4dp); 536 + pud_clear(args->pudp); 537 + p4d_clear(args->p4dp); 538 + p4d_populate(args->mm, args->p4dp, args->start_pudp); 539 + p4d = READ_ONCE(*args->p4dp); 589 540 WARN_ON(p4d_bad(p4d)); 590 541 } 591 542 592 - static void __init pgd_clear_tests(struct mm_struct *mm, pgd_t *pgdp) 543 + static void __init pgd_clear_tests(struct pgtable_debug_args *args) 593 544 { 594 - pgd_t pgd = READ_ONCE(*pgdp); 545 + pgd_t pgd = READ_ONCE(*(args->pgdp)); 595 546 596 - if (mm_p4d_folded(mm)) 547 + if (mm_p4d_folded(args->mm)) 597 548 return; 598 549 599 550 pr_debug("Validating PGD clear\n"); 600 551 pgd = __pgd(pgd_val(pgd) | RANDOM_ORVALUE); 601 - WRITE_ONCE(*pgdp, pgd); 602 - pgd_clear(pgdp); 603 - pgd = READ_ONCE(*pgdp); 552 + WRITE_ONCE(*args->pgdp, pgd); 553 + pgd_clear(args->pgdp); 554 + pgd = READ_ONCE(*args->pgdp); 604 555 WARN_ON(!pgd_none(pgd)); 605 556 } 606 557 607 - static void __init pgd_populate_tests(struct mm_struct *mm, pgd_t *pgdp, 608 - p4d_t *p4dp) 558 + static void __init pgd_populate_tests(struct pgtable_debug_args *args) 609 559 { 610 560 pgd_t pgd; 611 561 612 - if (mm_p4d_folded(mm)) 562 + if (mm_p4d_folded(args->mm)) 613 563 return; 614 564 615 565 pr_debug("Validating PGD populate\n"); ··· 616 568 * This entry points to next level page table page. 617 569 * Hence this must not qualify as pgd_bad(). 618 570 */ 619 - p4d_clear(p4dp); 620 - pgd_clear(pgdp); 621 - pgd_populate(mm, pgdp, p4dp); 622 - pgd = READ_ONCE(*pgdp); 571 + p4d_clear(args->p4dp); 572 + pgd_clear(args->pgdp); 573 + pgd_populate(args->mm, args->pgdp, args->start_p4dp); 574 + pgd = READ_ONCE(*args->pgdp); 623 575 WARN_ON(pgd_bad(pgd)); 624 576 } 625 577 #else /* !__PAGETABLE_P4D_FOLDED */ 626 - static void __init p4d_clear_tests(struct mm_struct *mm, p4d_t *p4dp) { } 627 - static void __init pgd_clear_tests(struct mm_struct *mm, pgd_t *pgdp) { } 628 - static void __init p4d_populate_tests(struct mm_struct *mm, p4d_t *p4dp, 629 - pud_t *pudp) 630 - { 631 - } 632 - static void __init pgd_populate_tests(struct mm_struct *mm, pgd_t *pgdp, 633 - p4d_t *p4dp) 634 - { 635 - } 578 + static void __init p4d_clear_tests(struct pgtable_debug_args *args) { } 579 + static void __init pgd_clear_tests(struct pgtable_debug_args *args) { } 580 + static void __init p4d_populate_tests(struct pgtable_debug_args *args) { } 581 + static void __init pgd_populate_tests(struct pgtable_debug_args *args) { } 636 582 #endif /* PAGETABLE_P4D_FOLDED */ 637 583 638 - static void __init pte_clear_tests(struct mm_struct *mm, pte_t *ptep, 639 - unsigned long pfn, unsigned long vaddr, 640 - pgprot_t prot) 584 + static void __init pte_clear_tests(struct pgtable_debug_args *args) 641 585 { 642 - pte_t pte = pfn_pte(pfn, prot); 586 + struct page *page; 587 + pte_t pte = pfn_pte(args->pte_pfn, args->page_prot); 643 588 589 + page = (args->pte_pfn != ULONG_MAX) ? pfn_to_page(args->pte_pfn) : NULL; 590 + if (!page) 591 + return; 592 + 593 + /* 594 + * flush_dcache_page() is called after set_pte_at() to clear 595 + * PG_arch_1 for the page on ARM64. The page flag isn't cleared 596 + * when it's released and page allocation check will fail when 597 + * the page is allocated again. For architectures other than ARM64, 598 + * the unexpected overhead of cache flushing is acceptable. 599 + */ 644 600 pr_debug("Validating PTE clear\n"); 645 601 #ifndef CONFIG_RISCV 646 602 pte = __pte(pte_val(pte) | RANDOM_ORVALUE); 647 603 #endif 648 - set_pte_at(mm, vaddr, ptep, pte); 604 + set_pte_at(args->mm, args->vaddr, args->ptep, pte); 605 + flush_dcache_page(page); 649 606 barrier(); 650 - pte_clear(mm, vaddr, ptep); 651 - pte = ptep_get(ptep); 607 + pte_clear(args->mm, args->vaddr, args->ptep); 608 + pte = ptep_get(args->ptep); 652 609 WARN_ON(!pte_none(pte)); 653 610 } 654 611 655 - static void __init pmd_clear_tests(struct mm_struct *mm, pmd_t *pmdp) 612 + static void __init pmd_clear_tests(struct pgtable_debug_args *args) 656 613 { 657 - pmd_t pmd = READ_ONCE(*pmdp); 614 + pmd_t pmd = READ_ONCE(*args->pmdp); 658 615 659 616 pr_debug("Validating PMD clear\n"); 660 617 pmd = __pmd(pmd_val(pmd) | RANDOM_ORVALUE); 661 - WRITE_ONCE(*pmdp, pmd); 662 - pmd_clear(pmdp); 663 - pmd = READ_ONCE(*pmdp); 618 + WRITE_ONCE(*args->pmdp, pmd); 619 + pmd_clear(args->pmdp); 620 + pmd = READ_ONCE(*args->pmdp); 664 621 WARN_ON(!pmd_none(pmd)); 665 622 } 666 623 667 - static void __init pmd_populate_tests(struct mm_struct *mm, pmd_t *pmdp, 668 - pgtable_t pgtable) 624 + static void __init pmd_populate_tests(struct pgtable_debug_args *args) 669 625 { 670 626 pmd_t pmd; 671 627 ··· 678 626 * This entry points to next level page table page. 679 627 * Hence this must not qualify as pmd_bad(). 680 628 */ 681 - pmd_populate(mm, pmdp, pgtable); 682 - pmd = READ_ONCE(*pmdp); 629 + pmd_populate(args->mm, args->pmdp, args->start_ptep); 630 + pmd = READ_ONCE(*args->pmdp); 683 631 WARN_ON(pmd_bad(pmd)); 684 632 } 685 633 686 - static void __init pte_special_tests(unsigned long pfn, pgprot_t prot) 634 + static void __init pte_special_tests(struct pgtable_debug_args *args) 687 635 { 688 - pte_t pte = pfn_pte(pfn, prot); 636 + pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot); 689 637 690 638 if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) 691 639 return; ··· 694 642 WARN_ON(!pte_special(pte_mkspecial(pte))); 695 643 } 696 644 697 - static void __init pte_protnone_tests(unsigned long pfn, pgprot_t prot) 645 + static void __init pte_protnone_tests(struct pgtable_debug_args *args) 698 646 { 699 - pte_t pte = pfn_pte(pfn, prot); 647 + pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot_none); 700 648 701 649 if (!IS_ENABLED(CONFIG_NUMA_BALANCING)) 702 650 return; ··· 707 655 } 708 656 709 657 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 710 - static void __init pmd_protnone_tests(unsigned long pfn, pgprot_t prot) 658 + static void __init pmd_protnone_tests(struct pgtable_debug_args *args) 711 659 { 712 660 pmd_t pmd; 713 661 ··· 718 666 return; 719 667 720 668 pr_debug("Validating PMD protnone\n"); 721 - pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); 669 + pmd = pmd_mkhuge(pfn_pmd(args->fixed_pmd_pfn, args->page_prot_none)); 722 670 WARN_ON(!pmd_protnone(pmd)); 723 671 WARN_ON(!pmd_present(pmd)); 724 672 } 725 673 #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ 726 - static void __init pmd_protnone_tests(unsigned long pfn, pgprot_t prot) { } 674 + static void __init pmd_protnone_tests(struct pgtable_debug_args *args) { } 727 675 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 728 676 729 677 #ifdef CONFIG_ARCH_HAS_PTE_DEVMAP 730 - static void __init pte_devmap_tests(unsigned long pfn, pgprot_t prot) 678 + static void __init pte_devmap_tests(struct pgtable_debug_args *args) 731 679 { 732 - pte_t pte = pfn_pte(pfn, prot); 680 + pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot); 733 681 734 682 pr_debug("Validating PTE devmap\n"); 735 683 WARN_ON(!pte_devmap(pte_mkdevmap(pte))); 736 684 } 737 685 738 686 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 739 - static void __init pmd_devmap_tests(unsigned long pfn, pgprot_t prot) 687 + static void __init pmd_devmap_tests(struct pgtable_debug_args *args) 740 688 { 741 689 pmd_t pmd; 742 690 ··· 744 692 return; 745 693 746 694 pr_debug("Validating PMD devmap\n"); 747 - pmd = pfn_pmd(pfn, prot); 695 + pmd = pfn_pmd(args->fixed_pmd_pfn, args->page_prot); 748 696 WARN_ON(!pmd_devmap(pmd_mkdevmap(pmd))); 749 697 } 750 698 751 699 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD 752 - static void __init pud_devmap_tests(unsigned long pfn, pgprot_t prot) 700 + static void __init pud_devmap_tests(struct pgtable_debug_args *args) 753 701 { 754 702 pud_t pud; 755 703 ··· 757 705 return; 758 706 759 707 pr_debug("Validating PUD devmap\n"); 760 - pud = pfn_pud(pfn, prot); 708 + pud = pfn_pud(args->fixed_pud_pfn, args->page_prot); 761 709 WARN_ON(!pud_devmap(pud_mkdevmap(pud))); 762 710 } 763 711 #else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 764 - static void __init pud_devmap_tests(unsigned long pfn, pgprot_t prot) { } 712 + static void __init pud_devmap_tests(struct pgtable_debug_args *args) { } 765 713 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 766 714 #else /* CONFIG_TRANSPARENT_HUGEPAGE */ 767 - static void __init pmd_devmap_tests(unsigned long pfn, pgprot_t prot) { } 768 - static void __init pud_devmap_tests(unsigned long pfn, pgprot_t prot) { } 715 + static void __init pmd_devmap_tests(struct pgtable_debug_args *args) { } 716 + static void __init pud_devmap_tests(struct pgtable_debug_args *args) { } 769 717 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 770 718 #else 771 - static void __init pte_devmap_tests(unsigned long pfn, pgprot_t prot) { } 772 - static void __init pmd_devmap_tests(unsigned long pfn, pgprot_t prot) { } 773 - static void __init pud_devmap_tests(unsigned long pfn, pgprot_t prot) { } 719 + static void __init pte_devmap_tests(struct pgtable_debug_args *args) { } 720 + static void __init pmd_devmap_tests(struct pgtable_debug_args *args) { } 721 + static void __init pud_devmap_tests(struct pgtable_debug_args *args) { } 774 722 #endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */ 775 723 776 - static void __init pte_soft_dirty_tests(unsigned long pfn, pgprot_t prot) 724 + static void __init pte_soft_dirty_tests(struct pgtable_debug_args *args) 777 725 { 778 - pte_t pte = pfn_pte(pfn, prot); 726 + pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot); 779 727 780 728 if (!IS_ENABLED(CONFIG_MEM_SOFT_DIRTY)) 781 729 return; ··· 785 733 WARN_ON(pte_soft_dirty(pte_clear_soft_dirty(pte))); 786 734 } 787 735 788 - static void __init pte_swap_soft_dirty_tests(unsigned long pfn, pgprot_t prot) 736 + static void __init pte_swap_soft_dirty_tests(struct pgtable_debug_args *args) 789 737 { 790 - pte_t pte = pfn_pte(pfn, prot); 738 + pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot); 791 739 792 740 if (!IS_ENABLED(CONFIG_MEM_SOFT_DIRTY)) 793 741 return; ··· 798 746 } 799 747 800 748 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 801 - static void __init pmd_soft_dirty_tests(unsigned long pfn, pgprot_t prot) 749 + static void __init pmd_soft_dirty_tests(struct pgtable_debug_args *args) 802 750 { 803 751 pmd_t pmd; 804 752 ··· 809 757 return; 810 758 811 759 pr_debug("Validating PMD soft dirty\n"); 812 - pmd = pfn_pmd(pfn, prot); 760 + pmd = pfn_pmd(args->fixed_pmd_pfn, args->page_prot); 813 761 WARN_ON(!pmd_soft_dirty(pmd_mksoft_dirty(pmd))); 814 762 WARN_ON(pmd_soft_dirty(pmd_clear_soft_dirty(pmd))); 815 763 } 816 764 817 - static void __init pmd_swap_soft_dirty_tests(unsigned long pfn, pgprot_t prot) 765 + static void __init pmd_swap_soft_dirty_tests(struct pgtable_debug_args *args) 818 766 { 819 767 pmd_t pmd; 820 768 ··· 826 774 return; 827 775 828 776 pr_debug("Validating PMD swap soft dirty\n"); 829 - pmd = pfn_pmd(pfn, prot); 777 + pmd = pfn_pmd(args->fixed_pmd_pfn, args->page_prot); 830 778 WARN_ON(!pmd_swp_soft_dirty(pmd_swp_mksoft_dirty(pmd))); 831 779 WARN_ON(pmd_swp_soft_dirty(pmd_swp_clear_soft_dirty(pmd))); 832 780 } 833 781 #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ 834 - static void __init pmd_soft_dirty_tests(unsigned long pfn, pgprot_t prot) { } 835 - static void __init pmd_swap_soft_dirty_tests(unsigned long pfn, pgprot_t prot) 836 - { 837 - } 782 + static void __init pmd_soft_dirty_tests(struct pgtable_debug_args *args) { } 783 + static void __init pmd_swap_soft_dirty_tests(struct pgtable_debug_args *args) { } 838 784 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 839 785 840 - static void __init pte_swap_tests(unsigned long pfn, pgprot_t prot) 786 + static void __init pte_swap_tests(struct pgtable_debug_args *args) 841 787 { 842 788 swp_entry_t swp; 843 789 pte_t pte; 844 790 845 791 pr_debug("Validating PTE swap\n"); 846 - pte = pfn_pte(pfn, prot); 792 + pte = pfn_pte(args->fixed_pte_pfn, args->page_prot); 847 793 swp = __pte_to_swp_entry(pte); 848 794 pte = __swp_entry_to_pte(swp); 849 - WARN_ON(pfn != pte_pfn(pte)); 795 + WARN_ON(args->fixed_pte_pfn != pte_pfn(pte)); 850 796 } 851 797 852 798 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION 853 - static void __init pmd_swap_tests(unsigned long pfn, pgprot_t prot) 799 + static void __init pmd_swap_tests(struct pgtable_debug_args *args) 854 800 { 855 801 swp_entry_t swp; 856 802 pmd_t pmd; ··· 857 807 return; 858 808 859 809 pr_debug("Validating PMD swap\n"); 860 - pmd = pfn_pmd(pfn, prot); 810 + pmd = pfn_pmd(args->fixed_pmd_pfn, args->page_prot); 861 811 swp = __pmd_to_swp_entry(pmd); 862 812 pmd = __swp_entry_to_pmd(swp); 863 - WARN_ON(pfn != pmd_pfn(pmd)); 813 + WARN_ON(args->fixed_pmd_pfn != pmd_pfn(pmd)); 864 814 } 865 815 #else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */ 866 - static void __init pmd_swap_tests(unsigned long pfn, pgprot_t prot) { } 816 + static void __init pmd_swap_tests(struct pgtable_debug_args *args) { } 867 817 #endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */ 868 818 869 - static void __init swap_migration_tests(void) 819 + static void __init swap_migration_tests(struct pgtable_debug_args *args) 870 820 { 871 821 struct page *page; 872 822 swp_entry_t swp; ··· 874 824 if (!IS_ENABLED(CONFIG_MIGRATION)) 875 825 return; 876 826 877 - pr_debug("Validating swap migration\n"); 878 827 /* 879 828 * swap_migration_tests() requires a dedicated page as it needs to 880 829 * be locked before creating a migration entry from it. Locking the 881 830 * page that actually maps kernel text ('start_kernel') can be real 882 - * problematic. Lets allocate a dedicated page explicitly for this 883 - * purpose that will be freed subsequently. 831 + * problematic. Lets use the allocated page explicitly for this 832 + * purpose. 884 833 */ 885 - page = alloc_page(GFP_KERNEL); 886 - if (!page) { 887 - pr_err("page allocation failed\n"); 834 + page = (args->pte_pfn != ULONG_MAX) ? pfn_to_page(args->pte_pfn) : NULL; 835 + if (!page) 888 836 return; 889 - } 837 + 838 + pr_debug("Validating swap migration\n"); 890 839 891 840 /* 892 841 * make_migration_entry() expects given page to be ··· 904 855 WARN_ON(!is_migration_entry(swp)); 905 856 WARN_ON(is_writable_migration_entry(swp)); 906 857 __ClearPageLocked(page); 907 - __free_page(page); 908 858 } 909 859 910 860 #ifdef CONFIG_HUGETLB_PAGE 911 - static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) 861 + static void __init hugetlb_basic_tests(struct pgtable_debug_args *args) 912 862 { 913 863 struct page *page; 914 864 pte_t pte; ··· 917 869 * Accessing the page associated with the pfn is safe here, 918 870 * as it was previously derived from a real kernel symbol. 919 871 */ 920 - page = pfn_to_page(pfn); 921 - pte = mk_huge_pte(page, prot); 872 + page = pfn_to_page(args->fixed_pmd_pfn); 873 + pte = mk_huge_pte(page, args->page_prot); 922 874 923 875 WARN_ON(!huge_pte_dirty(huge_pte_mkdirty(pte))); 924 876 WARN_ON(!huge_pte_write(huge_pte_mkwrite(huge_pte_wrprotect(pte)))); 925 877 WARN_ON(huge_pte_write(huge_pte_wrprotect(huge_pte_mkwrite(pte)))); 926 878 927 879 #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB 928 - pte = pfn_pte(pfn, prot); 880 + pte = pfn_pte(args->fixed_pmd_pfn, args->page_prot); 929 881 930 882 WARN_ON(!pte_huge(pte_mkhuge(pte))); 931 883 #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ 932 884 } 933 885 #else /* !CONFIG_HUGETLB_PAGE */ 934 - static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) { } 886 + static void __init hugetlb_basic_tests(struct pgtable_debug_args *args) { } 935 887 #endif /* CONFIG_HUGETLB_PAGE */ 936 888 937 889 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 938 - static void __init pmd_thp_tests(unsigned long pfn, pgprot_t prot) 890 + static void __init pmd_thp_tests(struct pgtable_debug_args *args) 939 891 { 940 892 pmd_t pmd; 941 893 ··· 954 906 * needs to return true. pmd_present() should be true whenever 955 907 * pmd_trans_huge() returns true. 956 908 */ 957 - pmd = pfn_pmd(pfn, prot); 909 + pmd = pfn_pmd(args->fixed_pmd_pfn, args->page_prot); 958 910 WARN_ON(!pmd_trans_huge(pmd_mkhuge(pmd))); 959 911 960 912 #ifndef __HAVE_ARCH_PMDP_INVALIDATE ··· 964 916 } 965 917 966 918 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD 967 - static void __init pud_thp_tests(unsigned long pfn, pgprot_t prot) 919 + static void __init pud_thp_tests(struct pgtable_debug_args *args) 968 920 { 969 921 pud_t pud; 970 922 ··· 972 924 return; 973 925 974 926 pr_debug("Validating PUD based THP\n"); 975 - pud = pfn_pud(pfn, prot); 927 + pud = pfn_pud(args->fixed_pud_pfn, args->page_prot); 976 928 WARN_ON(!pud_trans_huge(pud_mkhuge(pud))); 977 929 978 930 /* ··· 984 936 */ 985 937 } 986 938 #else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 987 - static void __init pud_thp_tests(unsigned long pfn, pgprot_t prot) { } 939 + static void __init pud_thp_tests(struct pgtable_debug_args *args) { } 988 940 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 989 941 #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ 990 - static void __init pmd_thp_tests(unsigned long pfn, pgprot_t prot) { } 991 - static void __init pud_thp_tests(unsigned long pfn, pgprot_t prot) { } 942 + static void __init pmd_thp_tests(struct pgtable_debug_args *args) { } 943 + static void __init pud_thp_tests(struct pgtable_debug_args *args) { } 992 944 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 993 945 994 946 static unsigned long __init get_random_vaddr(void) ··· 1003 955 return random_vaddr; 1004 956 } 1005 957 1006 - static int __init debug_vm_pgtable(void) 958 + static void __init destroy_args(struct pgtable_debug_args *args) 1007 959 { 1008 - struct vm_area_struct *vma; 1009 - struct mm_struct *mm; 1010 - pgd_t *pgdp; 1011 - p4d_t *p4dp, *saved_p4dp; 1012 - pud_t *pudp, *saved_pudp; 1013 - pmd_t *pmdp, *saved_pmdp, pmd; 1014 - pte_t *ptep; 1015 - pgtable_t saved_ptep; 1016 - pgprot_t prot, protnone; 1017 - phys_addr_t paddr; 1018 - unsigned long vaddr, pte_aligned, pmd_aligned; 1019 - unsigned long pud_aligned, p4d_aligned, pgd_aligned; 1020 - spinlock_t *ptl = NULL; 1021 - int idx; 960 + struct page *page = NULL; 1022 961 1023 - pr_info("Validating architecture page table helpers\n"); 1024 - prot = vm_get_page_prot(VMFLAGS); 1025 - vaddr = get_random_vaddr(); 1026 - mm = mm_alloc(); 1027 - if (!mm) { 1028 - pr_err("mm_struct allocation failed\n"); 1029 - return 1; 962 + /* Free (huge) page */ 963 + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && 964 + IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) && 965 + has_transparent_hugepage() && 966 + args->pud_pfn != ULONG_MAX) { 967 + if (args->is_contiguous_page) { 968 + free_contig_range(args->pud_pfn, 969 + (1 << (HPAGE_PUD_SHIFT - PAGE_SHIFT))); 970 + } else { 971 + page = pfn_to_page(args->pud_pfn); 972 + __free_pages(page, HPAGE_PUD_SHIFT - PAGE_SHIFT); 973 + } 974 + 975 + args->pud_pfn = ULONG_MAX; 976 + args->pmd_pfn = ULONG_MAX; 977 + args->pte_pfn = ULONG_MAX; 1030 978 } 1031 979 980 + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && 981 + has_transparent_hugepage() && 982 + args->pmd_pfn != ULONG_MAX) { 983 + if (args->is_contiguous_page) { 984 + free_contig_range(args->pmd_pfn, (1 << HPAGE_PMD_ORDER)); 985 + } else { 986 + page = pfn_to_page(args->pmd_pfn); 987 + __free_pages(page, HPAGE_PMD_ORDER); 988 + } 989 + 990 + args->pmd_pfn = ULONG_MAX; 991 + args->pte_pfn = ULONG_MAX; 992 + } 993 + 994 + if (args->pte_pfn != ULONG_MAX) { 995 + page = pfn_to_page(args->pte_pfn); 996 + __free_pages(page, 0); 997 + 998 + args->pte_pfn = ULONG_MAX; 999 + } 1000 + 1001 + /* Free page table entries */ 1002 + if (args->start_ptep) { 1003 + pte_free(args->mm, args->start_ptep); 1004 + mm_dec_nr_ptes(args->mm); 1005 + } 1006 + 1007 + if (args->start_pmdp) { 1008 + pmd_free(args->mm, args->start_pmdp); 1009 + mm_dec_nr_pmds(args->mm); 1010 + } 1011 + 1012 + if (args->start_pudp) { 1013 + pud_free(args->mm, args->start_pudp); 1014 + mm_dec_nr_puds(args->mm); 1015 + } 1016 + 1017 + if (args->start_p4dp) 1018 + p4d_free(args->mm, args->start_p4dp); 1019 + 1020 + /* Free vma and mm struct */ 1021 + if (args->vma) 1022 + vm_area_free(args->vma); 1023 + 1024 + if (args->mm) 1025 + mmdrop(args->mm); 1026 + } 1027 + 1028 + static struct page * __init 1029 + debug_vm_pgtable_alloc_huge_page(struct pgtable_debug_args *args, int order) 1030 + { 1031 + struct page *page = NULL; 1032 + 1033 + #ifdef CONFIG_CONTIG_ALLOC 1034 + if (order >= MAX_ORDER) { 1035 + page = alloc_contig_pages((1 << order), GFP_KERNEL, 1036 + first_online_node, NULL); 1037 + if (page) { 1038 + args->is_contiguous_page = true; 1039 + return page; 1040 + } 1041 + } 1042 + #endif 1043 + 1044 + if (order < MAX_ORDER) 1045 + page = alloc_pages(GFP_KERNEL, order); 1046 + 1047 + return page; 1048 + } 1049 + 1050 + static int __init init_args(struct pgtable_debug_args *args) 1051 + { 1052 + struct page *page = NULL; 1053 + phys_addr_t phys; 1054 + int ret = 0; 1055 + 1032 1056 /* 1057 + * Initialize the debugging data. 1058 + * 1033 1059 * __P000 (or even __S000) will help create page table entries with 1034 1060 * PROT_NONE permission as required for pxx_protnone_tests(). 1035 1061 */ 1036 - protnone = __P000; 1062 + memset(args, 0, sizeof(*args)); 1063 + args->vaddr = get_random_vaddr(); 1064 + args->page_prot = vm_get_page_prot(VMFLAGS); 1065 + args->page_prot_none = __P000; 1066 + args->is_contiguous_page = false; 1067 + args->pud_pfn = ULONG_MAX; 1068 + args->pmd_pfn = ULONG_MAX; 1069 + args->pte_pfn = ULONG_MAX; 1070 + args->fixed_pgd_pfn = ULONG_MAX; 1071 + args->fixed_p4d_pfn = ULONG_MAX; 1072 + args->fixed_pud_pfn = ULONG_MAX; 1073 + args->fixed_pmd_pfn = ULONG_MAX; 1074 + args->fixed_pte_pfn = ULONG_MAX; 1037 1075 1038 - vma = vm_area_alloc(mm); 1039 - if (!vma) { 1040 - pr_err("vma allocation failed\n"); 1041 - return 1; 1076 + /* Allocate mm and vma */ 1077 + args->mm = mm_alloc(); 1078 + if (!args->mm) { 1079 + pr_err("Failed to allocate mm struct\n"); 1080 + ret = -ENOMEM; 1081 + goto error; 1042 1082 } 1083 + 1084 + args->vma = vm_area_alloc(args->mm); 1085 + if (!args->vma) { 1086 + pr_err("Failed to allocate vma\n"); 1087 + ret = -ENOMEM; 1088 + goto error; 1089 + } 1090 + 1091 + /* 1092 + * Allocate page table entries. They will be modified in the tests. 1093 + * Lets save the page table entries so that they can be released 1094 + * when the tests are completed. 1095 + */ 1096 + args->pgdp = pgd_offset(args->mm, args->vaddr); 1097 + args->p4dp = p4d_alloc(args->mm, args->pgdp, args->vaddr); 1098 + if (!args->p4dp) { 1099 + pr_err("Failed to allocate p4d entries\n"); 1100 + ret = -ENOMEM; 1101 + goto error; 1102 + } 1103 + args->start_p4dp = p4d_offset(args->pgdp, 0UL); 1104 + WARN_ON(!args->start_p4dp); 1105 + 1106 + args->pudp = pud_alloc(args->mm, args->p4dp, args->vaddr); 1107 + if (!args->pudp) { 1108 + pr_err("Failed to allocate pud entries\n"); 1109 + ret = -ENOMEM; 1110 + goto error; 1111 + } 1112 + args->start_pudp = pud_offset(args->p4dp, 0UL); 1113 + WARN_ON(!args->start_pudp); 1114 + 1115 + args->pmdp = pmd_alloc(args->mm, args->pudp, args->vaddr); 1116 + if (!args->pmdp) { 1117 + pr_err("Failed to allocate pmd entries\n"); 1118 + ret = -ENOMEM; 1119 + goto error; 1120 + } 1121 + args->start_pmdp = pmd_offset(args->pudp, 0UL); 1122 + WARN_ON(!args->start_pmdp); 1123 + 1124 + if (pte_alloc(args->mm, args->pmdp)) { 1125 + pr_err("Failed to allocate pte entries\n"); 1126 + ret = -ENOMEM; 1127 + goto error; 1128 + } 1129 + args->start_ptep = pmd_pgtable(READ_ONCE(*args->pmdp)); 1130 + WARN_ON(!args->start_ptep); 1043 1131 1044 1132 /* 1045 1133 * PFN for mapping at PTE level is determined from a standard kernel ··· 1184 1000 * exist on the platform but that does not really matter as pfn_pxx() 1185 1001 * helpers will still create appropriate entries for the test. This 1186 1002 * helps avoid large memory block allocations to be used for mapping 1187 - * at higher page table levels. 1003 + * at higher page table levels in some of the tests. 1188 1004 */ 1189 - paddr = __pa_symbol(&start_kernel); 1005 + phys = __pa_symbol(&start_kernel); 1006 + args->fixed_pgd_pfn = __phys_to_pfn(phys & PGDIR_MASK); 1007 + args->fixed_p4d_pfn = __phys_to_pfn(phys & P4D_MASK); 1008 + args->fixed_pud_pfn = __phys_to_pfn(phys & PUD_MASK); 1009 + args->fixed_pmd_pfn = __phys_to_pfn(phys & PMD_MASK); 1010 + args->fixed_pte_pfn = __phys_to_pfn(phys & PAGE_MASK); 1011 + WARN_ON(!pfn_valid(args->fixed_pte_pfn)); 1190 1012 1191 - pte_aligned = (paddr & PAGE_MASK) >> PAGE_SHIFT; 1192 - pmd_aligned = (paddr & PMD_MASK) >> PAGE_SHIFT; 1193 - pud_aligned = (paddr & PUD_MASK) >> PAGE_SHIFT; 1194 - p4d_aligned = (paddr & P4D_MASK) >> PAGE_SHIFT; 1195 - pgd_aligned = (paddr & PGDIR_MASK) >> PAGE_SHIFT; 1196 - WARN_ON(!pfn_valid(pte_aligned)); 1197 - 1198 - pgdp = pgd_offset(mm, vaddr); 1199 - p4dp = p4d_alloc(mm, pgdp, vaddr); 1200 - pudp = pud_alloc(mm, p4dp, vaddr); 1201 - pmdp = pmd_alloc(mm, pudp, vaddr); 1202 1013 /* 1203 - * Allocate pgtable_t 1014 + * Allocate (huge) pages because some of the tests need to access 1015 + * the data in the pages. The corresponding tests will be skipped 1016 + * if we fail to allocate (huge) pages. 1204 1017 */ 1205 - if (pte_alloc(mm, pmdp)) { 1206 - pr_err("pgtable allocation failed\n"); 1207 - return 1; 1018 + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && 1019 + IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) && 1020 + has_transparent_hugepage()) { 1021 + page = debug_vm_pgtable_alloc_huge_page(args, 1022 + HPAGE_PUD_SHIFT - PAGE_SHIFT); 1023 + if (page) { 1024 + args->pud_pfn = page_to_pfn(page); 1025 + args->pmd_pfn = args->pud_pfn; 1026 + args->pte_pfn = args->pud_pfn; 1027 + return 0; 1028 + } 1208 1029 } 1209 1030 1210 - /* 1211 - * Save all the page table page addresses as the page table 1212 - * entries will be used for testing with random or garbage 1213 - * values. These saved addresses will be used for freeing 1214 - * page table pages. 1215 - */ 1216 - pmd = READ_ONCE(*pmdp); 1217 - saved_p4dp = p4d_offset(pgdp, 0UL); 1218 - saved_pudp = pud_offset(p4dp, 0UL); 1219 - saved_pmdp = pmd_offset(pudp, 0UL); 1220 - saved_ptep = pmd_pgtable(pmd); 1031 + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && 1032 + has_transparent_hugepage()) { 1033 + page = debug_vm_pgtable_alloc_huge_page(args, HPAGE_PMD_ORDER); 1034 + if (page) { 1035 + args->pmd_pfn = page_to_pfn(page); 1036 + args->pte_pfn = args->pmd_pfn; 1037 + return 0; 1038 + } 1039 + } 1040 + 1041 + page = alloc_pages(GFP_KERNEL, 0); 1042 + if (page) 1043 + args->pte_pfn = page_to_pfn(page); 1044 + 1045 + return 0; 1046 + 1047 + error: 1048 + destroy_args(args); 1049 + return ret; 1050 + } 1051 + 1052 + static int __init debug_vm_pgtable(void) 1053 + { 1054 + struct pgtable_debug_args args; 1055 + spinlock_t *ptl = NULL; 1056 + int idx, ret; 1057 + 1058 + pr_info("Validating architecture page table helpers\n"); 1059 + ret = init_args(&args); 1060 + if (ret) 1061 + return ret; 1221 1062 1222 1063 /* 1223 1064 * Iterate over the protection_map[] to make sure that all ··· 1251 1042 * given page table entry. 1252 1043 */ 1253 1044 for (idx = 0; idx < ARRAY_SIZE(protection_map); idx++) { 1254 - pte_basic_tests(pte_aligned, idx); 1255 - pmd_basic_tests(pmd_aligned, idx); 1256 - pud_basic_tests(mm, pud_aligned, idx); 1045 + pte_basic_tests(&args, idx); 1046 + pmd_basic_tests(&args, idx); 1047 + pud_basic_tests(&args, idx); 1257 1048 } 1258 1049 1259 1050 /* ··· 1263 1054 * the above iteration for now to save some test execution 1264 1055 * time. 1265 1056 */ 1266 - p4d_basic_tests(p4d_aligned, prot); 1267 - pgd_basic_tests(pgd_aligned, prot); 1057 + p4d_basic_tests(&args); 1058 + pgd_basic_tests(&args); 1268 1059 1269 - pmd_leaf_tests(pmd_aligned, prot); 1270 - pud_leaf_tests(pud_aligned, prot); 1060 + pmd_leaf_tests(&args); 1061 + pud_leaf_tests(&args); 1271 1062 1272 - pte_savedwrite_tests(pte_aligned, protnone); 1273 - pmd_savedwrite_tests(pmd_aligned, protnone); 1063 + pte_savedwrite_tests(&args); 1064 + pmd_savedwrite_tests(&args); 1274 1065 1275 - pte_special_tests(pte_aligned, prot); 1276 - pte_protnone_tests(pte_aligned, protnone); 1277 - pmd_protnone_tests(pmd_aligned, protnone); 1066 + pte_special_tests(&args); 1067 + pte_protnone_tests(&args); 1068 + pmd_protnone_tests(&args); 1278 1069 1279 - pte_devmap_tests(pte_aligned, prot); 1280 - pmd_devmap_tests(pmd_aligned, prot); 1281 - pud_devmap_tests(pud_aligned, prot); 1070 + pte_devmap_tests(&args); 1071 + pmd_devmap_tests(&args); 1072 + pud_devmap_tests(&args); 1282 1073 1283 - pte_soft_dirty_tests(pte_aligned, prot); 1284 - pmd_soft_dirty_tests(pmd_aligned, prot); 1285 - pte_swap_soft_dirty_tests(pte_aligned, prot); 1286 - pmd_swap_soft_dirty_tests(pmd_aligned, prot); 1074 + pte_soft_dirty_tests(&args); 1075 + pmd_soft_dirty_tests(&args); 1076 + pte_swap_soft_dirty_tests(&args); 1077 + pmd_swap_soft_dirty_tests(&args); 1287 1078 1288 - pte_swap_tests(pte_aligned, prot); 1289 - pmd_swap_tests(pmd_aligned, prot); 1079 + pte_swap_tests(&args); 1080 + pmd_swap_tests(&args); 1290 1081 1291 - swap_migration_tests(); 1082 + swap_migration_tests(&args); 1292 1083 1293 - pmd_thp_tests(pmd_aligned, prot); 1294 - pud_thp_tests(pud_aligned, prot); 1084 + pmd_thp_tests(&args); 1085 + pud_thp_tests(&args); 1295 1086 1296 - hugetlb_basic_tests(pte_aligned, prot); 1087 + hugetlb_basic_tests(&args); 1297 1088 1298 1089 /* 1299 1090 * Page table modifying tests. They need to hold 1300 1091 * proper page table lock. 1301 1092 */ 1302 1093 1303 - ptep = pte_offset_map_lock(mm, pmdp, vaddr, &ptl); 1304 - pte_clear_tests(mm, ptep, pte_aligned, vaddr, prot); 1305 - pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); 1306 - pte_unmap_unlock(ptep, ptl); 1094 + args.ptep = pte_offset_map_lock(args.mm, args.pmdp, args.vaddr, &ptl); 1095 + pte_clear_tests(&args); 1096 + pte_advanced_tests(&args); 1097 + pte_unmap_unlock(args.ptep, ptl); 1307 1098 1308 - ptl = pmd_lock(mm, pmdp); 1309 - pmd_clear_tests(mm, pmdp); 1310 - pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep); 1311 - pmd_huge_tests(pmdp, pmd_aligned, prot); 1312 - pmd_populate_tests(mm, pmdp, saved_ptep); 1099 + ptl = pmd_lock(args.mm, args.pmdp); 1100 + pmd_clear_tests(&args); 1101 + pmd_advanced_tests(&args); 1102 + pmd_huge_tests(&args); 1103 + pmd_populate_tests(&args); 1313 1104 spin_unlock(ptl); 1314 1105 1315 - ptl = pud_lock(mm, pudp); 1316 - pud_clear_tests(mm, pudp); 1317 - pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot); 1318 - pud_huge_tests(pudp, pud_aligned, prot); 1319 - pud_populate_tests(mm, pudp, saved_pmdp); 1106 + ptl = pud_lock(args.mm, args.pudp); 1107 + pud_clear_tests(&args); 1108 + pud_advanced_tests(&args); 1109 + pud_huge_tests(&args); 1110 + pud_populate_tests(&args); 1320 1111 spin_unlock(ptl); 1321 1112 1322 - spin_lock(&mm->page_table_lock); 1323 - p4d_clear_tests(mm, p4dp); 1324 - pgd_clear_tests(mm, pgdp); 1325 - p4d_populate_tests(mm, p4dp, saved_pudp); 1326 - pgd_populate_tests(mm, pgdp, saved_p4dp); 1327 - spin_unlock(&mm->page_table_lock); 1113 + spin_lock(&(args.mm->page_table_lock)); 1114 + p4d_clear_tests(&args); 1115 + pgd_clear_tests(&args); 1116 + p4d_populate_tests(&args); 1117 + pgd_populate_tests(&args); 1118 + spin_unlock(&(args.mm->page_table_lock)); 1328 1119 1329 - p4d_free(mm, saved_p4dp); 1330 - pud_free(mm, saved_pudp); 1331 - pmd_free(mm, saved_pmdp); 1332 - pte_free(mm, saved_ptep); 1333 - 1334 - vm_area_free(vma); 1335 - mm_dec_nr_puds(mm); 1336 - mm_dec_nr_pmds(mm); 1337 - mm_dec_nr_ptes(mm); 1338 - mmdrop(mm); 1120 + destroy_args(&args); 1339 1121 return 0; 1340 1122 } 1341 1123 late_initcall(debug_vm_pgtable);

+6 -9

mm/filemap.c

··· 260 260 void delete_from_page_cache(struct page *page) 261 261 { 262 262 struct address_space *mapping = page_mapping(page); 263 - unsigned long flags; 264 263 265 264 BUG_ON(!PageLocked(page)); 266 - xa_lock_irqsave(&mapping->i_pages, flags); 265 + xa_lock_irq(&mapping->i_pages); 267 266 __delete_from_page_cache(page, NULL); 268 - xa_unlock_irqrestore(&mapping->i_pages, flags); 267 + xa_unlock_irq(&mapping->i_pages); 269 268 270 269 page_cache_free_page(mapping, page); 271 270 } ··· 336 337 struct pagevec *pvec) 337 338 { 338 339 int i; 339 - unsigned long flags; 340 340 341 341 if (!pagevec_count(pvec)) 342 342 return; 343 343 344 - xa_lock_irqsave(&mapping->i_pages, flags); 344 + xa_lock_irq(&mapping->i_pages); 345 345 for (i = 0; i < pagevec_count(pvec); i++) { 346 346 trace_mm_filemap_delete_from_page_cache(pvec->pages[i]); 347 347 348 348 unaccount_page_cache_page(mapping, pvec->pages[i]); 349 349 } 350 350 page_cache_delete_batch(mapping, pvec); 351 - xa_unlock_irqrestore(&mapping->i_pages, flags); 351 + xa_unlock_irq(&mapping->i_pages); 352 352 353 353 for (i = 0; i < pagevec_count(pvec); i++) 354 354 page_cache_free_page(mapping, pvec->pages[i]); ··· 839 841 void (*freepage)(struct page *) = mapping->a_ops->freepage; 840 842 pgoff_t offset = old->index; 841 843 XA_STATE(xas, &mapping->i_pages, offset); 842 - unsigned long flags; 843 844 844 845 VM_BUG_ON_PAGE(!PageLocked(old), old); 845 846 VM_BUG_ON_PAGE(!PageLocked(new), new); ··· 850 853 851 854 mem_cgroup_migrate(old, new); 852 855 853 - xas_lock_irqsave(&xas, flags); 856 + xas_lock_irq(&xas); 854 857 xas_store(&xas, new); 855 858 856 859 old->mapping = NULL; ··· 863 866 __dec_lruvec_page_state(old, NR_SHMEM); 864 867 if (PageSwapBacked(new)) 865 868 __inc_lruvec_page_state(new, NR_SHMEM); 866 - xas_unlock_irqrestore(&xas, flags); 869 + xas_unlock_irq(&xas); 867 870 if (freepage) 868 871 freepage(old); 869 872 put_page(old);

+54 -55

mm/gup.c

··· 62 62 put_page(page); 63 63 } 64 64 65 - /* 66 - * Return the compound head page with ref appropriately incremented, 67 - * or NULL if that failed. 65 + /** 66 + * try_get_compound_head() - return the compound head page with refcount 67 + * appropriately incremented, or NULL if that failed. 68 + * 69 + * This handles potential refcount overflow correctly. It also works correctly 70 + * for various lockless get_user_pages()-related callers, due to the use of 71 + * page_cache_add_speculative(). 72 + * 73 + * Even though the name includes "compound_head", this function is still 74 + * appropriate for callers that have a non-compound @page to get. 75 + * 76 + * @page: pointer to page to be gotten 77 + * @refs: the value to add to the page's refcount 78 + * 79 + * Return: head page (with refcount appropriately incremented) for success, or 80 + * NULL upon failure. 68 81 */ 69 - static inline struct page *try_get_compound_head(struct page *page, int refs) 82 + struct page *try_get_compound_head(struct page *page, int refs) 70 83 { 71 84 struct page *head = compound_head(page); 72 85 ··· 105 92 return head; 106 93 } 107 94 108 - /* 95 + /** 109 96 * try_grab_compound_head() - attempt to elevate a page's refcount, by a 110 97 * flags-dependent amount. 98 + * 99 + * Even though the name includes "compound_head", this function is still 100 + * appropriate for callers that have a non-compound @page to get. 101 + * 102 + * @page: pointer to page to be grabbed 103 + * @refs: the value to (effectively) add to the page's refcount 104 + * @flags: gup flags: these are the FOLL_* flag values. 111 105 * 112 106 * "grab" names in this file mean, "look at flags to decide whether to use 113 107 * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount. ··· 123 103 * same time. (That's true throughout the get_user_pages*() and 124 104 * pin_user_pages*() APIs.) Cases: 125 105 * 126 - * FOLL_GET: page's refcount will be incremented by 1. 127 - * FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS. 106 + * FOLL_GET: page's refcount will be incremented by @refs. 107 + * 108 + * FOLL_PIN on compound pages that are > two pages long: page's refcount will 109 + * be incremented by @refs, and page[2].hpage_pinned_refcount will be 110 + * incremented by @refs * GUP_PIN_COUNTING_BIAS. 111 + * 112 + * FOLL_PIN on normal pages, or compound pages that are two pages long: 113 + * page's refcount will be incremented by @refs * GUP_PIN_COUNTING_BIAS. 128 114 * 129 115 * Return: head page (with refcount appropriately incremented) for success, or 130 116 * NULL upon failure. If neither FOLL_GET nor FOLL_PIN was set, that's 131 117 * considered failure, and furthermore, a likely bug in the caller, so a warning 132 118 * is also emitted. 133 119 */ 134 - __maybe_unused struct page *try_grab_compound_head(struct page *page, 135 - int refs, unsigned int flags) 120 + struct page *try_grab_compound_head(struct page *page, 121 + int refs, unsigned int flags) 136 122 { 137 123 if (flags & FOLL_GET) 138 124 return try_get_compound_head(page, refs); 139 125 else if (flags & FOLL_PIN) { 140 - int orig_refs = refs; 141 - 142 126 /* 143 127 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a 144 128 * right zone, so fail and let the caller fall back to the slow ··· 167 143 * 168 144 * However, be sure to *also* increment the normal page refcount 169 145 * field at least once, so that the page really is pinned. 146 + * That's why the refcount from the earlier 147 + * try_get_compound_head() is left intact. 170 148 */ 171 149 if (hpage_pincount_available(page)) 172 150 hpage_pincount_add(page, refs); ··· 176 150 page_ref_add(page, refs * (GUP_PIN_COUNTING_BIAS - 1)); 177 151 178 152 mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED, 179 - orig_refs); 153 + refs); 180 154 181 155 return page; 182 156 } ··· 212 186 * @flags: gup flags: these are the FOLL_* flag values. 213 187 * 214 188 * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the same 215 - * time. Cases: 216 - * 217 - * FOLL_GET: page's refcount will be incremented by 1. 218 - * FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS. 189 + * time. Cases: please see the try_grab_compound_head() documentation, with 190 + * "refs=1". 219 191 * 220 192 * Return: true for success, or if no action was required (if neither FOLL_PIN 221 193 * nor FOLL_GET was set, nothing is done). False for failure: FOLL_GET or ··· 221 197 */ 222 198 bool __must_check try_grab_page(struct page *page, unsigned int flags) 223 199 { 224 - WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == (FOLL_GET | FOLL_PIN)); 200 + if (!(flags & (FOLL_GET | FOLL_PIN))) 201 + return true; 225 202 226 - if (flags & FOLL_GET) 227 - return try_get_page(page); 228 - else if (flags & FOLL_PIN) { 229 - int refs = 1; 230 - 231 - page = compound_head(page); 232 - 233 - if (WARN_ON_ONCE(page_ref_count(page) <= 0)) 234 - return false; 235 - 236 - if (hpage_pincount_available(page)) 237 - hpage_pincount_add(page, 1); 238 - else 239 - refs = GUP_PIN_COUNTING_BIAS; 240 - 241 - /* 242 - * Similar to try_grab_compound_head(): even if using the 243 - * hpage_pincount_add/_sub() routines, be sure to 244 - * *also* increment the normal page refcount field at least 245 - * once, so that the page really is pinned. 246 - */ 247 - page_ref_add(page, refs); 248 - 249 - mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED, 1); 250 - } 251 - 252 - return true; 203 + return try_grab_compound_head(page, 1, flags); 253 204 } 254 205 255 206 /** ··· 1150 1151 * We must stop here. 1151 1152 */ 1152 1153 BUG_ON(gup_flags & FOLL_NOWAIT); 1153 - BUG_ON(ret != 0); 1154 1154 goto out; 1155 1155 } 1156 1156 continue; ··· 1274 1276 bool *unlocked) 1275 1277 { 1276 1278 struct vm_area_struct *vma; 1277 - vm_fault_t ret, major = 0; 1279 + vm_fault_t ret; 1278 1280 1279 1281 address = untagged_addr(address); 1280 1282 ··· 1294 1296 return -EINTR; 1295 1297 1296 1298 ret = handle_mm_fault(vma, address, fault_flags, NULL); 1297 - major |= ret & VM_FAULT_MAJOR; 1298 1299 if (ret & VM_FAULT_ERROR) { 1299 1300 int err = vm_fault_to_errno(ret, 0); 1300 1301 ··· 1472 1475 unsigned long nr_pages = (end - start) / PAGE_SIZE; 1473 1476 int gup_flags; 1474 1477 1475 - VM_BUG_ON(start & ~PAGE_MASK); 1476 - VM_BUG_ON(end & ~PAGE_MASK); 1478 + VM_BUG_ON(!PAGE_ALIGNED(start)); 1479 + VM_BUG_ON(!PAGE_ALIGNED(end)); 1477 1480 VM_BUG_ON_VMA(start < vma->vm_start, vma); 1478 1481 VM_BUG_ON_VMA(end > vma->vm_end, vma); 1479 1482 mmap_assert_locked(mm); ··· 1772 1775 if (!list_empty(&movable_page_list)) { 1773 1776 ret = migrate_pages(&movable_page_list, alloc_migration_target, 1774 1777 NULL, (unsigned long)&mtc, MIGRATE_SYNC, 1775 - MR_LONGTERM_PIN); 1778 + MR_LONGTERM_PIN, NULL); 1776 1779 if (ret && !list_empty(&movable_page_list)) 1777 1780 putback_movable_pages(&movable_page_list); 1778 1781 } ··· 2241 2244 { 2242 2245 int nr_start = *nr; 2243 2246 struct dev_pagemap *pgmap = NULL; 2247 + int ret = 1; 2244 2248 2245 2249 do { 2246 2250 struct page *page = pfn_to_page(pfn); ··· 2249 2251 pgmap = get_dev_pagemap(pfn, pgmap); 2250 2252 if (unlikely(!pgmap)) { 2251 2253 undo_dev_pagemap(nr, nr_start, flags, pages); 2252 - return 0; 2254 + ret = 0; 2255 + break; 2253 2256 } 2254 2257 SetPageReferenced(page); 2255 2258 pages[*nr] = page; 2256 2259 if (unlikely(!try_grab_page(page, flags))) { 2257 2260 undo_dev_pagemap(nr, nr_start, flags, pages); 2258 - return 0; 2261 + ret = 0; 2262 + break; 2259 2263 } 2260 2264 (*nr)++; 2261 2265 pfn++; 2262 2266 } while (addr += PAGE_SIZE, addr != end); 2263 2267 2264 - if (pgmap) 2265 - put_dev_pagemap(pgmap); 2266 - return 1; 2268 + put_dev_pagemap(pgmap); 2269 + return ret; 2267 2270 } 2268 2271 2269 2272 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,

+4 -28

mm/huge_memory.c

··· 1440 1440 goto out; 1441 1441 } 1442 1442 1443 - /* 1444 - * Since we took the NUMA fault, we must have observed the !accessible 1445 - * bit. Make sure all other CPUs agree with that, to avoid them 1446 - * modifying the page we're about to migrate. 1447 - * 1448 - * Must be done under PTL such that we'll observe the relevant 1449 - * inc_tlb_flush_pending(). 1450 - * 1451 - * We are not sure a pending tlb flush here is for a huge page 1452 - * mapping or not. Hence use the tlb range variant 1453 - */ 1454 - if (mm_tlb_flush_pending(vma->vm_mm)) { 1455 - flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE); 1456 - /* 1457 - * change_huge_pmd() released the pmd lock before 1458 - * invalidating the secondary MMUs sharing the primary 1459 - * MMU pagetables (with ->invalidate_range()). The 1460 - * mmu_notifier_invalidate_range_end() (which 1461 - * internally calls ->invalidate_range()) in 1462 - * change_pmd_range() will run after us, so we can't 1463 - * rely on it here and we need an explicit invalidate. 1464 - */ 1465 - mmu_notifier_invalidate_range(vma->vm_mm, haddr, 1466 - haddr + HPAGE_PMD_SIZE); 1467 - } 1468 - 1469 1443 pmd = pmd_modify(oldpmd, vma->vm_page_prot); 1470 1444 page = vm_normal_page_pmd(vma, haddr, pmd); 1471 1445 if (!page) ··· 2428 2454 2429 2455 for (i = nr - 1; i >= 1; i--) { 2430 2456 __split_huge_page_tail(head, i, lruvec, list); 2431 - /* Some pages can be beyond i_size: drop them from page cache */ 2457 + /* Some pages can be beyond EOF: drop them from page cache */ 2432 2458 if (head[i].index >= end) { 2433 2459 ClearPageDirty(head + i); 2434 2460 __delete_from_page_cache(head + i, NULL); 2435 - if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head)) 2461 + if (shmem_mapping(head->mapping)) 2436 2462 shmem_uncharge(head->mapping->host, 1); 2437 2463 put_page(head + i); 2438 2464 } else if (!PageAnon(page)) { ··· 2660 2686 * head page lock is good enough to serialize the trimming. 2661 2687 */ 2662 2688 end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); 2689 + if (shmem_mapping(mapping)) 2690 + end = shmem_fallocend(mapping->host, end); 2663 2691 } 2664 2692 2665 2693 /*

+133 -38

mm/hugetlb.c

··· 1072 1072 int nid = page_to_nid(page); 1073 1073 1074 1074 lockdep_assert_held(&hugetlb_lock); 1075 + VM_BUG_ON_PAGE(page_count(page), page); 1076 + 1075 1077 list_move(&page->lru, &h->hugepage_freelists[nid]); 1076 1078 h->free_huge_pages++; 1077 1079 h->free_huge_pages_node[nid]++; ··· 1145 1143 unsigned long address, int avoid_reserve, 1146 1144 long chg) 1147 1145 { 1148 - struct page *page; 1146 + struct page *page = NULL; 1149 1147 struct mempolicy *mpol; 1150 1148 gfp_t gfp_mask; 1151 1149 nodemask_t *nodemask; ··· 1166 1164 1167 1165 gfp_mask = htlb_alloc_mask(h); 1168 1166 nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask); 1169 - page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask); 1167 + 1168 + if (mpol_is_preferred_many(mpol)) { 1169 + page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask); 1170 + 1171 + /* Fallback to all nodes if page==NULL */ 1172 + nodemask = NULL; 1173 + } 1174 + 1175 + if (!page) 1176 + page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask); 1177 + 1170 1178 if (page && !avoid_reserve && vma_has_reserves(vma, chg)) { 1171 1179 SetHPageRestoreReserve(page); 1172 1180 h->resv_huge_pages--; ··· 1380 1368 h->surplus_huge_pages_node[nid]--; 1381 1369 } 1382 1370 1371 + /* 1372 + * Very subtle 1373 + * 1374 + * For non-gigantic pages set the destructor to the normal compound 1375 + * page dtor. This is needed in case someone takes an additional 1376 + * temporary ref to the page, and freeing is delayed until they drop 1377 + * their reference. 1378 + * 1379 + * For gigantic pages set the destructor to the null dtor. This 1380 + * destructor will never be called. Before freeing the gigantic 1381 + * page destroy_compound_gigantic_page will turn the compound page 1382 + * into a simple group of pages. After this the destructor does not 1383 + * apply. 1384 + * 1385 + * This handles the case where more than one ref is held when and 1386 + * after update_and_free_page is called. 1387 + */ 1383 1388 set_page_refcounted(page); 1384 - set_compound_page_dtor(page, NULL_COMPOUND_DTOR); 1389 + if (hstate_is_gigantic(h)) 1390 + set_compound_page_dtor(page, NULL_COMPOUND_DTOR); 1391 + else 1392 + set_compound_page_dtor(page, COMPOUND_PAGE_DTOR); 1385 1393 1386 1394 h->nr_huge_pages--; 1387 1395 h->nr_huge_pages_node[nid]--; ··· 1431 1399 SetHPageVmemmapOptimized(page); 1432 1400 1433 1401 /* 1434 - * This page is now managed by the hugetlb allocator and has 1435 - * no users -- drop the last reference. 1402 + * This page is about to be managed by the hugetlb allocator and 1403 + * should have no users. Drop our reference, and check for others 1404 + * just in case. 1436 1405 */ 1437 1406 zeroed = put_page_testzero(page); 1438 - VM_BUG_ON_PAGE(!zeroed, page); 1407 + if (!zeroed) 1408 + /* 1409 + * It is VERY unlikely soneone else has taken a ref on 1410 + * the page. In this case, we simply return as the 1411 + * hugetlb destructor (free_huge_page) will be called 1412 + * when this other ref is dropped. 1413 + */ 1414 + return; 1415 + 1439 1416 arch_clear_hugepage_flags(page); 1440 1417 enqueue_huge_page(h, page); 1441 1418 } ··· 1698 1657 * cache adding could take a ref on a 'to be' tail page. 1699 1658 * We need to respect any increased ref count, and only set 1700 1659 * the ref count to zero if count is currently 1. If count 1701 - * is not 1, we call synchronize_rcu in the hope that a rcu 1702 - * grace period will cause ref count to drop and then retry. 1703 - * If count is still inflated on retry we return an error and 1704 - * must discard the pages. 1660 + * is not 1, we return an error. An error return indicates 1661 + * the set of pages can not be converted to a gigantic page. 1662 + * The caller who allocated the pages should then discard the 1663 + * pages using the appropriate free interface. 1705 1664 */ 1706 1665 if (!page_ref_freeze(p, 1)) { 1707 - pr_info("HugeTLB unexpected inflated ref count on freshly allocated page\n"); 1708 - synchronize_rcu(); 1709 - if (!page_ref_freeze(p, 1)) 1710 - goto out_error; 1666 + pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n"); 1667 + goto out_error; 1711 1668 } 1712 1669 set_page_count(p, 0); 1713 1670 set_compound_head(p, page); ··· 1869 1830 retry = true; 1870 1831 goto retry; 1871 1832 } 1872 - pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n"); 1873 1833 return NULL; 1874 1834 } 1875 1835 } ··· 2058 2020 * Allocates a fresh surplus page from the page allocator. 2059 2021 */ 2060 2022 static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, 2061 - int nid, nodemask_t *nmask) 2023 + int nid, nodemask_t *nmask, bool zero_ref) 2062 2024 { 2063 2025 struct page *page = NULL; 2026 + bool retry = false; 2064 2027 2065 2028 if (hstate_is_gigantic(h)) 2066 2029 return NULL; ··· 2071 2032 goto out_unlock; 2072 2033 spin_unlock_irq(&hugetlb_lock); 2073 2034 2035 + retry: 2074 2036 page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL); 2075 2037 if (!page) 2076 2038 return NULL; ··· 2089 2049 spin_unlock_irq(&hugetlb_lock); 2090 2050 put_page(page); 2091 2051 return NULL; 2092 - } else { 2093 - h->surplus_huge_pages++; 2094 - h->surplus_huge_pages_node[page_to_nid(page)]++; 2095 2052 } 2053 + 2054 + if (zero_ref) { 2055 + /* 2056 + * Caller requires a page with zero ref count. 2057 + * We will drop ref count here. If someone else is holding 2058 + * a ref, the page will be freed when they drop it. Abuse 2059 + * temporary page flag to accomplish this. 2060 + */ 2061 + SetHPageTemporary(page); 2062 + if (!put_page_testzero(page)) { 2063 + /* 2064 + * Unexpected inflated ref count on freshly allocated 2065 + * huge. Retry once. 2066 + */ 2067 + pr_info("HugeTLB unexpected inflated ref count on freshly allocated page\n"); 2068 + spin_unlock_irq(&hugetlb_lock); 2069 + if (retry) 2070 + return NULL; 2071 + 2072 + retry = true; 2073 + goto retry; 2074 + } 2075 + ClearHPageTemporary(page); 2076 + } 2077 + 2078 + h->surplus_huge_pages++; 2079 + h->surplus_huge_pages_node[page_to_nid(page)]++; 2096 2080 2097 2081 out_unlock: 2098 2082 spin_unlock_irq(&hugetlb_lock); ··· 2152 2088 struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h, 2153 2089 struct vm_area_struct *vma, unsigned long addr) 2154 2090 { 2155 - struct page *page; 2091 + struct page *page = NULL; 2156 2092 struct mempolicy *mpol; 2157 2093 gfp_t gfp_mask = htlb_alloc_mask(h); 2158 2094 int nid; 2159 2095 nodemask_t *nodemask; 2160 2096 2161 2097 nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask); 2162 - page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask); 2163 - mpol_cond_put(mpol); 2098 + if (mpol_is_preferred_many(mpol)) { 2099 + gfp_t gfp = gfp_mask | __GFP_NOWARN; 2164 2100 2101 + gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL); 2102 + page = alloc_surplus_huge_page(h, gfp, nid, nodemask, false); 2103 + 2104 + /* Fallback to all nodes if page==NULL */ 2105 + nodemask = NULL; 2106 + } 2107 + 2108 + if (!page) 2109 + page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask, false); 2110 + mpol_cond_put(mpol); 2165 2111 return page; 2166 2112 } 2167 2113 ··· 2241 2167 spin_unlock_irq(&hugetlb_lock); 2242 2168 for (i = 0; i < needed; i++) { 2243 2169 page = alloc_surplus_huge_page(h, htlb_alloc_mask(h), 2244 - NUMA_NO_NODE, NULL); 2170 + NUMA_NO_NODE, NULL, true); 2245 2171 if (!page) { 2246 2172 alloc_ok = false; 2247 2173 break; ··· 2282 2208 2283 2209 /* Free the needed pages to the hugetlb pool */ 2284 2210 list_for_each_entry_safe(page, tmp, &surplus_list, lru) { 2285 - int zeroed; 2286 - 2287 2211 if ((--needed) < 0) 2288 2212 break; 2289 - /* 2290 - * This page is now managed by the hugetlb allocator and has 2291 - * no users -- drop the buddy allocator's reference. 2292 - */ 2293 - zeroed = put_page_testzero(page); 2294 - VM_BUG_ON_PAGE(!zeroed, page); 2213 + /* Add the page to the hugetlb allocator */ 2295 2214 enqueue_huge_page(h, page); 2296 2215 } 2297 2216 free: 2298 2217 spin_unlock_irq(&hugetlb_lock); 2299 2218 2300 - /* Free unnecessary surplus pages to the buddy allocator */ 2219 + /* 2220 + * Free unnecessary surplus pages to the buddy allocator. 2221 + * Pages have no ref count, call free_huge_page directly. 2222 + */ 2301 2223 list_for_each_entry_safe(page, tmp, &surplus_list, lru) 2302 - put_page(page); 2224 + free_huge_page(page); 2303 2225 spin_lock_irq(&hugetlb_lock); 2304 2226 2305 2227 return ret; ··· 2604 2534 { 2605 2535 gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; 2606 2536 int nid = page_to_nid(old_page); 2537 + bool alloc_retry = false; 2607 2538 struct page *new_page; 2608 2539 int ret = 0; 2609 2540 ··· 2615 2544 * the pool. This simplifies and let us do most of the processing 2616 2545 * under the lock. 2617 2546 */ 2547 + alloc_retry: 2618 2548 new_page = alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL); 2619 2549 if (!new_page) 2620 2550 return -ENOMEM; 2551 + /* 2552 + * If all goes well, this page will be directly added to the free 2553 + * list in the pool. For this the ref count needs to be zero. 2554 + * Attempt to drop now, and retry once if needed. It is VERY 2555 + * unlikely there is another ref on the page. 2556 + * 2557 + * If someone else has a reference to the page, it will be freed 2558 + * when they drop their ref. Abuse temporary page flag to accomplish 2559 + * this. Retry once if there is an inflated ref count. 2560 + */ 2561 + SetHPageTemporary(new_page); 2562 + if (!put_page_testzero(new_page)) { 2563 + if (alloc_retry) 2564 + return -EBUSY; 2565 + 2566 + alloc_retry = true; 2567 + goto alloc_retry; 2568 + } 2569 + ClearHPageTemporary(new_page); 2570 + 2621 2571 __prep_new_huge_page(h, new_page); 2622 2572 2623 2573 retry: ··· 2678 2586 remove_hugetlb_page(h, old_page, false); 2679 2587 2680 2588 /* 2681 - * Reference count trick is needed because allocator gives us 2682 - * referenced page but the pool requires pages with 0 refcount. 2589 + * Ref count on new page is already zero as it was dropped 2590 + * earlier. It can be directly added to the pool free list. 2683 2591 */ 2684 2592 __prep_account_new_huge_page(h, nid); 2685 - page_ref_dec(new_page); 2686 2593 enqueue_huge_page(h, new_page); 2687 2594 2688 2595 /* ··· 2695 2604 2696 2605 free_new: 2697 2606 spin_unlock_irq(&hugetlb_lock); 2607 + /* Page has a zero ref count, but needs a ref to be freed */ 2608 + set_page_refcounted(new_page); 2698 2609 update_and_free_page(h, new_page, false); 2699 2610 2700 2611 return ret; ··· 2921 2828 prep_new_huge_page(h, page, page_to_nid(page)); 2922 2829 put_page(page); /* add to the hugepage allocator */ 2923 2830 } else { 2831 + /* VERY unlikely inflated ref count on a tail page */ 2924 2832 free_gigantic_page(page, huge_page_order(h)); 2925 - pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n"); 2926 2833 } 2927 2834 2928 2835 /* ··· 4126 4033 * after this open call completes. It is therefore safe to take a 4127 4034 * new reference here without additional locking. 4128 4035 */ 4129 - if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) 4036 + if (resv && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) { 4037 + resv_map_dup_hugetlb_cgroup_uncharge_info(resv); 4130 4038 kref_get(&resv->refs); 4039 + } 4131 4040 } 4132 4041 4133 4042 static void hugetlb_vm_op_close(struct vm_area_struct *vma)

+1 -1

mm/hwpoison-inject.c

··· 30 30 if (!hwpoison_filter_enable) 31 31 goto inject; 32 32 33 - shake_page(hpage, 0); 33 + shake_page(hpage); 34 34 /* 35 35 * This implies unable to support non-LRU pages. 36 36 */

+9

mm/internal.h

··· 211 211 extern void zone_pcp_disable(struct zone *zone); 212 212 extern void zone_pcp_enable(struct zone *zone); 213 213 214 + extern void *memmap_alloc(phys_addr_t size, phys_addr_t align, 215 + phys_addr_t min_addr, 216 + int nid, bool exact_nid); 217 + 214 218 #if defined CONFIG_COMPACTION || defined CONFIG_CMA 215 219 216 220 /* ··· 543 539 544 540 #ifdef CONFIG_NUMA 545 541 extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int); 542 + extern int find_next_best_node(int node, nodemask_t *used_node_mask); 546 543 #else 547 544 static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask, 548 545 unsigned int order) 549 546 { 550 547 return NODE_RECLAIM_NOSCAN; 548 + } 549 + static inline int find_next_best_node(int node, nodemask_t *used_node_mask) 550 + { 551 + return NUMA_NO_NODE; 551 552 } 552 553 #endif 553 554

-43

mm/kasan/hw_tags.c

··· 37 37 KASAN_ARG_STACKTRACE_ON, 38 38 }; 39 39 40 - enum kasan_arg_fault { 41 - KASAN_ARG_FAULT_DEFAULT, 42 - KASAN_ARG_FAULT_REPORT, 43 - KASAN_ARG_FAULT_PANIC, 44 - }; 45 - 46 40 static enum kasan_arg kasan_arg __ro_after_init; 47 41 static enum kasan_arg_mode kasan_arg_mode __ro_after_init; 48 42 static enum kasan_arg_stacktrace kasan_arg_stacktrace __ro_after_init; 49 - static enum kasan_arg_fault kasan_arg_fault __ro_after_init; 50 43 51 44 /* Whether KASAN is enabled at all. */ 52 45 DEFINE_STATIC_KEY_FALSE(kasan_flag_enabled); ··· 51 58 52 59 /* Whether to collect alloc/free stack traces. */ 53 60 DEFINE_STATIC_KEY_FALSE(kasan_flag_stacktrace); 54 - 55 - /* Whether to panic or print a report and disable tag checking on fault. */ 56 - bool kasan_flag_panic __ro_after_init; 57 61 58 62 /* kasan=off/on */ 59 63 static int __init early_kasan_flag(char *arg) ··· 102 112 return 0; 103 113 } 104 114 early_param("kasan.stacktrace", early_kasan_flag_stacktrace); 105 - 106 - /* kasan.fault=report/panic */ 107 - static int __init early_kasan_fault(char *arg) 108 - { 109 - if (!arg) 110 - return -EINVAL; 111 - 112 - if (!strcmp(arg, "report")) 113 - kasan_arg_fault = KASAN_ARG_FAULT_REPORT; 114 - else if (!strcmp(arg, "panic")) 115 - kasan_arg_fault = KASAN_ARG_FAULT_PANIC; 116 - else 117 - return -EINVAL; 118 - 119 - return 0; 120 - } 121 - early_param("kasan.fault", early_kasan_fault); 122 115 123 116 /* kasan_init_hw_tags_cpu() is called for each CPU. */ 124 117 void kasan_init_hw_tags_cpu(void) ··· 165 192 break; 166 193 case KASAN_ARG_STACKTRACE_ON: 167 194 static_branch_enable(&kasan_flag_stacktrace); 168 - break; 169 - } 170 - 171 - switch (kasan_arg_fault) { 172 - case KASAN_ARG_FAULT_DEFAULT: 173 - /* 174 - * Default to no panic on report. 175 - * Do nothing, kasan_flag_panic keeps its default value. 176 - */ 177 - break; 178 - case KASAN_ARG_FAULT_REPORT: 179 - /* Do nothing, kasan_flag_panic keeps its default value. */ 180 - break; 181 - case KASAN_ARG_FAULT_PANIC: 182 - /* Enable panic on report. */ 183 - kasan_flag_panic = true; 184 195 break; 185 196 } 186 197

-1

mm/kasan/kasan.h

··· 38 38 39 39 #endif 40 40 41 - extern bool kasan_flag_panic __ro_after_init; 42 41 extern bool kasan_flag_async __ro_after_init; 43 42 44 43 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)

+26 -3

mm/kasan/report.c

··· 39 39 #define KASAN_BIT_REPORTED 0 40 40 #define KASAN_BIT_MULTI_SHOT 1 41 41 42 + enum kasan_arg_fault { 43 + KASAN_ARG_FAULT_DEFAULT, 44 + KASAN_ARG_FAULT_REPORT, 45 + KASAN_ARG_FAULT_PANIC, 46 + }; 47 + 48 + static enum kasan_arg_fault kasan_arg_fault __ro_after_init = KASAN_ARG_FAULT_DEFAULT; 49 + 50 + /* kasan.fault=report/panic */ 51 + static int __init early_kasan_fault(char *arg) 52 + { 53 + if (!arg) 54 + return -EINVAL; 55 + 56 + if (!strcmp(arg, "report")) 57 + kasan_arg_fault = KASAN_ARG_FAULT_REPORT; 58 + else if (!strcmp(arg, "panic")) 59 + kasan_arg_fault = KASAN_ARG_FAULT_PANIC; 60 + else 61 + return -EINVAL; 62 + 63 + return 0; 64 + } 65 + early_param("kasan.fault", early_kasan_fault); 66 + 42 67 bool kasan_save_enable_multi_shot(void) 43 68 { 44 69 return test_and_set_bit(KASAN_BIT_MULTI_SHOT, &kasan_flags); ··· 127 102 panic_on_warn = 0; 128 103 panic("panic_on_warn set ...\n"); 129 104 } 130 - #ifdef CONFIG_KASAN_HW_TAGS 131 - if (kasan_flag_panic) 105 + if (kasan_arg_fault == KASAN_ARG_FAULT_PANIC) 132 106 panic("kasan.fault=panic set ...\n"); 133 - #endif 134 107 kasan_enable_current(); 135 108 } 136 109

+1 -1

mm/khugepaged.c

··· 1721 1721 xas_unlock_irq(&xas); 1722 1722 /* swap in or instantiate fallocated page */ 1723 1723 if (shmem_getpage(mapping->host, index, &page, 1724 - SGP_NOHUGE)) { 1724 + SGP_NOALLOC)) { 1725 1725 result = SCAN_FAIL; 1726 1726 goto xa_unlocked; 1727 1727 }

+4 -4

mm/ksm.c

··· 259 259 static unsigned long ksm_stable_node_dups; 260 260 261 261 /* Delay in pruning stale stable_node_dups in the stable_node_chains */ 262 - static int ksm_stable_node_chains_prune_millisecs = 2000; 262 + static unsigned int ksm_stable_node_chains_prune_millisecs = 2000; 263 263 264 264 /* Maximum number of page slots sharing a stable node */ 265 265 static int ksm_max_page_sharing = 256; ··· 3105 3105 struct kobj_attribute *attr, 3106 3106 const char *buf, size_t count) 3107 3107 { 3108 - unsigned long msecs; 3108 + unsigned int msecs; 3109 3109 int err; 3110 3110 3111 - err = kstrtoul(buf, 10, &msecs); 3112 - if (err || msecs > UINT_MAX) 3111 + err = kstrtouint(buf, 10, &msecs); 3112 + if (err) 3113 3113 return -EINVAL; 3114 3114 3115 3115 ksm_stable_node_chains_prune_millisecs = msecs;

+1

mm/madvise.c

··· 1048 1048 switch (behavior) { 1049 1049 case MADV_COLD: 1050 1050 case MADV_PAGEOUT: 1051 + case MADV_WILLNEED: 1051 1052 return true; 1052 1053 default: 1053 1054 return false;

+5 -17

mm/memblock.c

··· 315 315 * Return: 316 316 * Found address on success, 0 on failure. 317 317 */ 318 - phys_addr_t __init_memblock memblock_find_in_range(phys_addr_t start, 318 + static phys_addr_t __init_memblock memblock_find_in_range(phys_addr_t start, 319 319 phys_addr_t end, phys_addr_t size, 320 320 phys_addr_t align) 321 321 { ··· 1496 1496 phys_addr_t min_addr, phys_addr_t max_addr, 1497 1497 int nid) 1498 1498 { 1499 - void *ptr; 1500 - 1501 1499 memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pS\n", 1502 1500 __func__, (u64)size, (u64)align, nid, &min_addr, 1503 1501 &max_addr, (void *)_RET_IP_); 1504 1502 1505 - ptr = memblock_alloc_internal(size, align, 1506 - min_addr, max_addr, nid, true); 1507 - if (ptr && size > 0) 1508 - page_init_poison(ptr, size); 1509 - 1510 - return ptr; 1503 + return memblock_alloc_internal(size, align, min_addr, max_addr, nid, 1504 + true); 1511 1505 } 1512 1506 1513 1507 /** ··· 1528 1534 phys_addr_t min_addr, phys_addr_t max_addr, 1529 1535 int nid) 1530 1536 { 1531 - void *ptr; 1532 - 1533 1537 memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pS\n", 1534 1538 __func__, (u64)size, (u64)align, nid, &min_addr, 1535 1539 &max_addr, (void *)_RET_IP_); 1536 1540 1537 - ptr = memblock_alloc_internal(size, align, 1538 - min_addr, max_addr, nid, false); 1539 - if (ptr && size > 0) 1540 - page_init_poison(ptr, size); 1541 - 1542 - return ptr; 1541 + return memblock_alloc_internal(size, align, min_addr, max_addr, nid, 1542 + false); 1543 1543 } 1544 1544 1545 1545 /**

+101 -127

mm/memcontrol.c

··· 103 103 return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_noswap; 104 104 } 105 105 106 + /* memcg and lruvec stats flushing */ 107 + static void flush_memcg_stats_dwork(struct work_struct *w); 108 + static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); 109 + static void flush_memcg_stats_work(struct work_struct *w); 110 + static DECLARE_WORK(stats_flush_work, flush_memcg_stats_work); 111 + static DEFINE_PER_CPU(unsigned int, stats_flush_threshold); 112 + static DEFINE_SPINLOCK(stats_flush_lock); 113 + 106 114 #define THRESHOLDS_EVENTS_TARGET 128 107 115 #define SOFTLIMIT_EVENTS_TARGET 1024 108 116 ··· 256 248 return &memcg->vmpressure; 257 249 } 258 250 259 - struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr) 251 + struct mem_cgroup *vmpressure_to_memcg(struct vmpressure *vmpr) 260 252 { 261 - return &container_of(vmpr, struct mem_cgroup, vmpressure)->css; 253 + return container_of(vmpr, struct mem_cgroup, vmpressure); 262 254 } 263 255 264 256 #ifdef CONFIG_MEMCG_KMEM ··· 654 646 } 655 647 656 648 /* idx can be of type enum memcg_stat_item or node_stat_item. */ 657 - static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) 658 - { 659 - long x = READ_ONCE(memcg->vmstats.state[idx]); 660 - #ifdef CONFIG_SMP 661 - if (x < 0) 662 - x = 0; 663 - #endif 664 - return x; 665 - } 666 - 667 - /* idx can be of type enum memcg_stat_item or node_stat_item. */ 668 649 static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) 669 650 { 670 651 long x = 0; ··· 668 671 return x; 669 672 } 670 673 671 - static struct mem_cgroup_per_node * 672 - parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid) 673 - { 674 - struct mem_cgroup *parent; 675 - 676 - parent = parent_mem_cgroup(pn->memcg); 677 - if (!parent) 678 - return NULL; 679 - return parent->nodeinfo[nid]; 680 - } 681 - 682 674 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, 683 675 int val) 684 676 { 685 677 struct mem_cgroup_per_node *pn; 686 678 struct mem_cgroup *memcg; 687 - long x, threshold = MEMCG_CHARGE_BATCH; 688 679 689 680 pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); 690 681 memcg = pn->memcg; ··· 681 696 __mod_memcg_state(memcg, idx, val); 682 697 683 698 /* Update lruvec */ 684 - __this_cpu_add(pn->lruvec_stat_local->count[idx], val); 685 - 686 - if (vmstat_item_in_bytes(idx)) 687 - threshold <<= PAGE_SHIFT; 688 - 689 - x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]); 690 - if (unlikely(abs(x) > threshold)) { 691 - pg_data_t *pgdat = lruvec_pgdat(lruvec); 692 - struct mem_cgroup_per_node *pi; 693 - 694 - for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id)) 695 - atomic_long_add(x, &pi->lruvec_stat[idx]); 696 - x = 0; 697 - } 698 - __this_cpu_write(pn->lruvec_stat_cpu->count[idx], x); 699 + __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val); 700 + if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH)) 701 + queue_work(system_unbound_wq, &stats_flush_work); 699 702 } 700 703 701 704 /** ··· 878 905 879 906 static __always_inline struct mem_cgroup *active_memcg(void) 880 907 { 881 - if (in_interrupt()) 908 + if (!in_task()) 882 909 return this_cpu_read(int_active_memcg); 883 910 else 884 911 return current->active_memcg; ··· 2178 2205 unsigned long flags; 2179 2206 2180 2207 /* 2181 - * The only protection from memory hotplug vs. drain_stock races is 2182 - * that we always operate on local CPU stock here with IRQ disabled 2208 + * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs. 2209 + * drain_stock races is that we always operate on local CPU stock 2210 + * here with IRQ disabled 2183 2211 */ 2184 2212 local_irq_save(flags); 2185 2213 ··· 2247 2273 if (memcg && stock->nr_pages && 2248 2274 mem_cgroup_is_descendant(memcg, root_memcg)) 2249 2275 flush = true; 2250 - if (obj_stock_flush_required(stock, root_memcg)) 2276 + else if (obj_stock_flush_required(stock, root_memcg)) 2251 2277 flush = true; 2252 2278 rcu_read_unlock(); 2253 2279 ··· 2263 2289 mutex_unlock(&percpu_charge_mutex); 2264 2290 } 2265 2291 2266 - static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg, int cpu) 2267 - { 2268 - int nid; 2269 - 2270 - for_each_node(nid) { 2271 - struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; 2272 - unsigned long stat[NR_VM_NODE_STAT_ITEMS]; 2273 - struct batched_lruvec_stat *lstatc; 2274 - int i; 2275 - 2276 - lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); 2277 - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { 2278 - stat[i] = lstatc->count[i]; 2279 - lstatc->count[i] = 0; 2280 - } 2281 - 2282 - do { 2283 - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) 2284 - atomic_long_add(stat[i], &pn->lruvec_stat[i]); 2285 - } while ((pn = parent_nodeinfo(pn, nid))); 2286 - } 2287 - } 2288 - 2289 2292 static int memcg_hotplug_cpu_dead(unsigned int cpu) 2290 2293 { 2291 2294 struct memcg_stock_pcp *stock; 2292 - struct mem_cgroup *memcg; 2293 2295 2294 2296 stock = &per_cpu(memcg_stock, cpu); 2295 2297 drain_stock(stock); 2296 - 2297 - for_each_mem_cgroup(memcg) 2298 - memcg_flush_lruvec_page_state(memcg, cpu); 2299 2298 2300 2299 return 0; 2301 2300 } ··· 4063 4116 { 4064 4117 struct mem_cgroup *memcg = mem_cgroup_from_css(css); 4065 4118 4066 - if (val > 100) 4119 + if (val > 200) 4067 4120 return -EINVAL; 4068 4121 4069 4122 if (!mem_cgroup_is_root(memcg)) ··· 4615 4668 atomic_read(&frn->done.cnt) == 1) { 4616 4669 frn->at = 0; 4617 4670 trace_flush_foreign(wb, frn->bdi_id, frn->memcg_id); 4618 - cgroup_writeback_by_id(frn->bdi_id, frn->memcg_id, 0, 4671 + cgroup_writeback_by_id(frn->bdi_id, frn->memcg_id, 4619 4672 WB_REASON_FOREIGN_FLUSH, 4620 4673 &frn->done); 4621 4674 } ··· 4839 4892 4840 4893 vfs_poll(efile.file, &event->pt); 4841 4894 4842 - spin_lock(&memcg->event_list_lock); 4895 + spin_lock_irq(&memcg->event_list_lock); 4843 4896 list_add(&event->list, &memcg->event_list); 4844 - spin_unlock(&memcg->event_list_lock); 4897 + spin_unlock_irq(&memcg->event_list_lock); 4845 4898 4846 4899 fdput(cfile); 4847 4900 fdput(efile); ··· 5076 5129 if (!pn) 5077 5130 return 1; 5078 5131 5079 - pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat, 5080 - GFP_KERNEL_ACCOUNT); 5081 - if (!pn->lruvec_stat_local) { 5082 - kfree(pn); 5083 - return 1; 5084 - } 5085 - 5086 - pn->lruvec_stat_cpu = alloc_percpu_gfp(struct batched_lruvec_stat, 5087 - GFP_KERNEL_ACCOUNT); 5088 - if (!pn->lruvec_stat_cpu) { 5089 - free_percpu(pn->lruvec_stat_local); 5132 + pn->lruvec_stats_percpu = alloc_percpu_gfp(struct lruvec_stats_percpu, 5133 + GFP_KERNEL_ACCOUNT); 5134 + if (!pn->lruvec_stats_percpu) { 5090 5135 kfree(pn); 5091 5136 return 1; 5092 5137 } ··· 5099 5160 if (!pn) 5100 5161 return; 5101 5162 5102 - free_percpu(pn->lruvec_stat_cpu); 5103 - free_percpu(pn->lruvec_stat_local); 5163 + free_percpu(pn->lruvec_stats_percpu); 5104 5164 kfree(pn); 5105 5165 } 5106 5166 ··· 5115 5177 5116 5178 static void mem_cgroup_free(struct mem_cgroup *memcg) 5117 5179 { 5118 - int cpu; 5119 - 5120 5180 memcg_wb_domain_exit(memcg); 5121 - /* 5122 - * Flush percpu lruvec stats to guarantee the value 5123 - * correctness on parent's and all ancestor levels. 5124 - */ 5125 - for_each_online_cpu(cpu) 5126 - memcg_flush_lruvec_page_state(memcg, cpu); 5127 5181 __mem_cgroup_free(memcg); 5128 5182 } 5129 5183 ··· 5251 5321 /* Online state pins memcg ID, memcg ID pins CSS */ 5252 5322 refcount_set(&memcg->id.ref, 1); 5253 5323 css_get(css); 5324 + 5325 + if (unlikely(mem_cgroup_is_root(memcg))) 5326 + queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 5327 + 2UL*HZ); 5254 5328 return 0; 5255 5329 } 5256 5330 ··· 5268 5334 * Notify userspace about cgroup removing only after rmdir of cgroup 5269 5335 * directory to avoid race between userspace and kernelspace. 5270 5336 */ 5271 - spin_lock(&memcg->event_list_lock); 5337 + spin_lock_irq(&memcg->event_list_lock); 5272 5338 list_for_each_entry_safe(event, tmp, &memcg->event_list, list) { 5273 5339 list_del_init(&event->list); 5274 5340 schedule_work(&event->remove); 5275 5341 } 5276 - spin_unlock(&memcg->event_list_lock); 5342 + spin_unlock_irq(&memcg->event_list_lock); 5277 5343 5278 5344 page_counter_set_min(&memcg->memory, 0); 5279 5345 page_counter_set_low(&memcg->memory, 0); ··· 5346 5412 memcg_wb_domain_size_changed(memcg); 5347 5413 } 5348 5414 5415 + void mem_cgroup_flush_stats(void) 5416 + { 5417 + if (!spin_trylock(&stats_flush_lock)) 5418 + return; 5419 + 5420 + cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup); 5421 + spin_unlock(&stats_flush_lock); 5422 + } 5423 + 5424 + static void flush_memcg_stats_dwork(struct work_struct *w) 5425 + { 5426 + mem_cgroup_flush_stats(); 5427 + queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ); 5428 + } 5429 + 5430 + static void flush_memcg_stats_work(struct work_struct *w) 5431 + { 5432 + mem_cgroup_flush_stats(); 5433 + } 5434 + 5349 5435 static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) 5350 5436 { 5351 5437 struct mem_cgroup *memcg = mem_cgroup_from_css(css); 5352 5438 struct mem_cgroup *parent = parent_mem_cgroup(memcg); 5353 5439 struct memcg_vmstats_percpu *statc; 5354 5440 long delta, v; 5355 - int i; 5441 + int i, nid; 5356 5442 5357 5443 statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); 5358 5444 ··· 5419 5465 memcg->vmstats.events[i] += delta; 5420 5466 if (parent) 5421 5467 parent->vmstats.events_pending[i] += delta; 5468 + } 5469 + 5470 + for_each_node_state(nid, N_MEMORY) { 5471 + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; 5472 + struct mem_cgroup_per_node *ppn = NULL; 5473 + struct lruvec_stats_percpu *lstatc; 5474 + 5475 + if (parent) 5476 + ppn = parent->nodeinfo[nid]; 5477 + 5478 + lstatc = per_cpu_ptr(pn->lruvec_stats_percpu, cpu); 5479 + 5480 + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { 5481 + delta = pn->lruvec_stats.state_pending[i]; 5482 + if (delta) 5483 + pn->lruvec_stats.state_pending[i] = 0; 5484 + 5485 + v = READ_ONCE(lstatc->state[i]); 5486 + if (v != lstatc->state_prev[i]) { 5487 + delta += v - lstatc->state_prev[i]; 5488 + lstatc->state_prev[i] = v; 5489 + } 5490 + 5491 + if (!delta) 5492 + continue; 5493 + 5494 + pn->lruvec_stats.state[i] += delta; 5495 + if (ppn) 5496 + ppn->lruvec_stats.state_pending[i] += delta; 5497 + } 5422 5498 } 5423 5499 } 5424 5500 ··· 6383 6399 int i; 6384 6400 struct mem_cgroup *memcg = mem_cgroup_from_seq(m); 6385 6401 6402 + cgroup_rstat_flush(memcg->css.cgroup); 6403 + 6386 6404 for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { 6387 6405 int nid; 6388 6406 ··· 6690 6704 atomic_long_read(&parent->memory.children_low_usage))); 6691 6705 } 6692 6706 6693 - static int __mem_cgroup_charge(struct page *page, struct mem_cgroup *memcg, 6694 - gfp_t gfp) 6707 + static int charge_memcg(struct page *page, struct mem_cgroup *memcg, gfp_t gfp) 6695 6708 { 6696 6709 unsigned int nr_pages = thp_nr_pages(page); 6697 6710 int ret; ··· 6711 6726 } 6712 6727 6713 6728 /** 6714 - * mem_cgroup_charge - charge a newly allocated page to a cgroup 6729 + * __mem_cgroup_charge - charge a newly allocated page to a cgroup 6715 6730 * @page: page to charge 6716 6731 * @mm: mm context of the victim 6717 6732 * @gfp_mask: reclaim mode ··· 6724 6739 * 6725 6740 * Returns 0 on success. Otherwise, an error code is returned. 6726 6741 */ 6727 - int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) 6742 + int __mem_cgroup_charge(struct page *page, struct mm_struct *mm, 6743 + gfp_t gfp_mask) 6728 6744 { 6729 6745 struct mem_cgroup *memcg; 6730 6746 int ret; 6731 6747 6732 - if (mem_cgroup_disabled()) 6733 - return 0; 6734 - 6735 6748 memcg = get_mem_cgroup_from_mm(mm); 6736 - ret = __mem_cgroup_charge(page, memcg, gfp_mask); 6749 + ret = charge_memcg(page, memcg, gfp_mask); 6737 6750 css_put(&memcg->css); 6738 6751 6739 6752 return ret; ··· 6766 6783 memcg = get_mem_cgroup_from_mm(mm); 6767 6784 rcu_read_unlock(); 6768 6785 6769 - ret = __mem_cgroup_charge(page, memcg, gfp); 6786 + ret = charge_memcg(page, memcg, gfp); 6770 6787 6771 6788 css_put(&memcg->css); 6772 6789 return ret; ··· 6902 6919 } 6903 6920 6904 6921 /** 6905 - * mem_cgroup_uncharge - uncharge a page 6922 + * __mem_cgroup_uncharge - uncharge a page 6906 6923 * @page: page to uncharge 6907 6924 * 6908 - * Uncharge a page previously charged with mem_cgroup_charge(). 6925 + * Uncharge a page previously charged with __mem_cgroup_charge(). 6909 6926 */ 6910 - void mem_cgroup_uncharge(struct page *page) 6927 + void __mem_cgroup_uncharge(struct page *page) 6911 6928 { 6912 6929 struct uncharge_gather ug; 6913 - 6914 - if (mem_cgroup_disabled()) 6915 - return; 6916 6930 6917 6931 /* Don't touch page->lru of any random page, pre-check: */ 6918 6932 if (!page_memcg(page)) ··· 6921 6941 } 6922 6942 6923 6943 /** 6924 - * mem_cgroup_uncharge_list - uncharge a list of page 6944 + * __mem_cgroup_uncharge_list - uncharge a list of page 6925 6945 * @page_list: list of pages to uncharge 6926 6946 * 6927 6947 * Uncharge a list of pages previously charged with 6928 - * mem_cgroup_charge(). 6948 + * __mem_cgroup_charge(). 6929 6949 */ 6930 - void mem_cgroup_uncharge_list(struct list_head *page_list) 6950 + void __mem_cgroup_uncharge_list(struct list_head *page_list) 6931 6951 { 6932 6952 struct uncharge_gather ug; 6933 6953 struct page *page; 6934 - 6935 - if (mem_cgroup_disabled()) 6936 - return; 6937 6954 6938 6955 uncharge_gather_clear(&ug); 6939 6956 list_for_each_entry(page, page_list, lru) ··· 7221 7244 } 7222 7245 7223 7246 /** 7224 - * mem_cgroup_try_charge_swap - try charging swap space for a page 7247 + * __mem_cgroup_try_charge_swap - try charging swap space for a page 7225 7248 * @page: page being added to swap 7226 7249 * @entry: swap entry to charge 7227 7250 * ··· 7229 7252 * 7230 7253 * Returns 0 on success, -ENOMEM on failure. 7231 7254 */ 7232 - int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry) 7255 + int __mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry) 7233 7256 { 7234 7257 unsigned int nr_pages = thp_nr_pages(page); 7235 7258 struct page_counter *counter; 7236 7259 struct mem_cgroup *memcg; 7237 7260 unsigned short oldid; 7238 - 7239 - if (mem_cgroup_disabled()) 7240 - return 0; 7241 7261 7242 7262 if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) 7243 7263 return 0; ··· 7271 7297 } 7272 7298 7273 7299 /** 7274 - * mem_cgroup_uncharge_swap - uncharge swap space 7300 + * __mem_cgroup_uncharge_swap - uncharge swap space 7275 7301 * @entry: swap entry to uncharge 7276 7302 * @nr_pages: the amount of swap space to uncharge 7277 7303 */ 7278 - void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages) 7304 + void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages) 7279 7305 { 7280 7306 struct mem_cgroup *memcg; 7281 7307 unsigned short id;

+24 -27

mm/memory-failure.c

··· 68 68 69 69 static bool __page_handle_poison(struct page *page) 70 70 { 71 - bool ret; 71 + int ret; 72 72 73 73 zone_pcp_disable(page_zone(page)); 74 74 ret = dissolve_free_huge_page(page); ··· 76 76 ret = take_page_off_buddy(page); 77 77 zone_pcp_enable(page_zone(page)); 78 78 79 - return ret; 79 + return ret > 0; 80 80 } 81 81 82 82 static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release) ··· 282 282 283 283 /* 284 284 * Unknown page type encountered. Try to check whether it can turn PageLRU by 285 - * lru_add_drain_all, or a free page by reclaiming slabs when possible. 285 + * lru_add_drain_all. 286 286 */ 287 - void shake_page(struct page *p, int access) 287 + void shake_page(struct page *p) 288 288 { 289 289 if (PageHuge(p)) 290 290 return; ··· 296 296 } 297 297 298 298 /* 299 - * Only call shrink_node_slabs here (which would also shrink 300 - * other caches) if access is not potentially fatal. 299 + * TODO: Could shrink slab caches here if a lightweight range-based 300 + * shrinker will be available. 301 301 */ 302 - if (access) 303 - drop_slab_node(page_to_nid(p)); 304 302 } 305 303 EXPORT_SYMBOL_GPL(shake_page); 306 304 ··· 389 391 /* 390 392 * Kill the processes that have been collected earlier. 391 393 * 392 - * Only do anything when DOIT is set, otherwise just free the list 393 - * (this is used for clean pages which do not need killing) 394 + * Only do anything when FORCEKILL is set, otherwise just free the 395 + * list (this is used for clean pages which do not need killing) 394 396 * Also when FAIL is set do a force kill because something went 395 397 * wrong earlier. 396 398 */ ··· 630 632 { 631 633 struct hwp_walk *hwp = (struct hwp_walk *)walk->private; 632 634 int ret = 0; 633 - pte_t *ptep; 635 + pte_t *ptep, *mapped_pte; 634 636 spinlock_t *ptl; 635 637 636 638 ptl = pmd_trans_huge_lock(pmdp, walk->vma); ··· 643 645 if (pmd_trans_unstable(pmdp)) 644 646 goto out; 645 647 646 - ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp, addr, &ptl); 648 + mapped_pte = ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp, 649 + addr, &ptl); 647 650 for (; addr != end; ptep++, addr += PAGE_SIZE) { 648 651 ret = check_hwpoisoned_entry(*ptep, addr, PAGE_SHIFT, 649 652 hwp->pfn, &hwp->tk); 650 653 if (ret == 1) 651 654 break; 652 655 } 653 - pte_unmap_unlock(ptep - 1, ptl); 656 + pte_unmap_unlock(mapped_pte, ptl); 654 657 out: 655 658 cond_resched(); 656 659 return ret; ··· 1203 1204 * page, retry. 1204 1205 */ 1205 1206 if (pass++ < 3) { 1206 - shake_page(p, 1); 1207 + shake_page(p); 1207 1208 goto try_again; 1208 1209 } 1209 1210 ret = -EIO; ··· 1220 1221 */ 1221 1222 if (pass++ < 3) { 1222 1223 put_page(p); 1223 - shake_page(p, 1); 1224 + shake_page(p); 1224 1225 count_increased = false; 1225 1226 goto try_again; 1226 1227 } ··· 1228 1229 ret = -EIO; 1229 1230 } 1230 1231 out: 1232 + if (ret == -EIO) 1233 + dump_page(p, "hwpoison: unhandlable page"); 1234 + 1231 1235 return ret; 1232 1236 } 1233 1237 ··· 1272 1270 * the pages and send SIGBUS to the processes if the data was dirty. 1273 1271 */ 1274 1272 static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, 1275 - int flags, struct page **hpagep) 1273 + int flags, struct page *hpage) 1276 1274 { 1277 1275 enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_SYNC; 1278 1276 struct address_space *mapping; 1279 1277 LIST_HEAD(tokill); 1280 1278 bool unmap_success; 1281 1279 int kill = 1, forcekill; 1282 - struct page *hpage = *hpagep; 1283 1280 bool mlocked = PageMlocked(hpage); 1284 1281 1285 1282 /* ··· 1370 1369 * shake_page() again to ensure that it's flushed. 1371 1370 */ 1372 1371 if (mlocked) 1373 - shake_page(hpage, 0); 1372 + shake_page(hpage); 1374 1373 1375 1374 /* 1376 1375 * Now that the dirty bit has been propagated to the ··· 1503 1502 goto out; 1504 1503 } 1505 1504 1506 - if (!hwpoison_user_mappings(p, pfn, flags, &head)) { 1505 + if (!hwpoison_user_mappings(p, pfn, flags, head)) { 1507 1506 action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED); 1508 1507 res = -EBUSY; 1509 1508 goto out; ··· 1519 1518 struct dev_pagemap *pgmap) 1520 1519 { 1521 1520 struct page *page = pfn_to_page(pfn); 1522 - const bool unmap_success = true; 1523 1521 unsigned long size = 0; 1524 1522 struct to_kill *tk; 1525 1523 LIST_HEAD(tokill); ··· 1590 1590 start = (page->index << PAGE_SHIFT) & ~(size - 1); 1591 1591 unmap_mapping_range(page->mapping, start, size, 0); 1592 1592 } 1593 - kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, pfn, flags); 1593 + kill_procs(&tokill, flags & MF_MUST_KILL, false, pfn, flags); 1594 1594 rc = 0; 1595 1595 unlock: 1596 1596 dax_unlock_page(page, cookie); ··· 1724 1724 * The check (unnecessarily) ignores LRU pages being isolated and 1725 1725 * walked by the page reclaim code, however that's not a big loss. 1726 1726 */ 1727 - shake_page(p, 0); 1727 + shake_page(p); 1728 1728 1729 1729 lock_page(p); 1730 1730 ··· 1783 1783 * Now take care of user space mappings. 1784 1784 * Abort on fail: __delete_from_page_cache() assumes unmapped page. 1785 1785 */ 1786 - if (!hwpoison_user_mappings(p, pfn, flags, &p)) { 1786 + if (!hwpoison_user_mappings(p, pfn, flags, p)) { 1787 1787 action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED); 1788 1788 res = -EBUSY; 1789 1789 goto unlock_page; ··· 2099 2099 2100 2100 if (isolate_page(hpage, &pagelist)) { 2101 2101 ret = migrate_pages(&pagelist, alloc_migration_target, NULL, 2102 - (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_FAILURE); 2102 + (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_FAILURE, NULL); 2103 2103 if (!ret) { 2104 2104 bool release = !huge; 2105 2105 ··· 2208 2208 try_again = false; 2209 2209 goto retry; 2210 2210 } 2211 - } else if (ret == -EIO) { 2212 - pr_info("%s: %#lx: unknown page type: %lx (%pGp)\n", 2213 - __func__, pfn, page->flags, &page->flags); 2214 2211 } 2215 2212 2216 2213 return ret;

+1 -1

mm/memory_hotplug.c

··· 1469 1469 if (nodes_empty(nmask)) 1470 1470 node_set(mtc.nid, nmask); 1471 1471 ret = migrate_pages(&source, alloc_migration_target, NULL, 1472 - (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG); 1472 + (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG, NULL); 1473 1473 if (ret) { 1474 1474 list_for_each_entry(page, &source, lru) { 1475 1475 if (__ratelimit(&migrate_rs)) {

+146 -31

mm/mempolicy.c

··· 31 31 * but useful to set in a VMA when you have a non default 32 32 * process policy. 33 33 * 34 + * preferred many Try a set of nodes first before normal fallback. This is 35 + * similar to preferred without the special case. 36 + * 34 37 * default Allocate on the local node first, or when on a VMA 35 38 * use the process policy. This is what Linux always did 36 39 * in a NUMA aware kernel and still does by, ahem, default. ··· 192 189 nodes_onto(*ret, tmp, *rel); 193 190 } 194 191 195 - static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes) 192 + static int mpol_new_nodemask(struct mempolicy *pol, const nodemask_t *nodes) 196 193 { 197 194 if (nodes_empty(*nodes)) 198 195 return -EINVAL; ··· 207 204 208 205 nodes_clear(pol->nodes); 209 206 node_set(first_node(*nodes), pol->nodes); 210 - return 0; 211 - } 212 - 213 - static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes) 214 - { 215 - if (nodes_empty(*nodes)) 216 - return -EINVAL; 217 - pol->nodes = *nodes; 218 207 return 0; 219 208 } 220 209 ··· 389 394 .rebind = mpol_rebind_default, 390 395 }, 391 396 [MPOL_INTERLEAVE] = { 392 - .create = mpol_new_interleave, 397 + .create = mpol_new_nodemask, 393 398 .rebind = mpol_rebind_nodemask, 394 399 }, 395 400 [MPOL_PREFERRED] = { ··· 397 402 .rebind = mpol_rebind_preferred, 398 403 }, 399 404 [MPOL_BIND] = { 400 - .create = mpol_new_bind, 405 + .create = mpol_new_nodemask, 401 406 .rebind = mpol_rebind_nodemask, 402 407 }, 403 408 [MPOL_LOCAL] = { 404 409 .rebind = mpol_rebind_default, 410 + }, 411 + [MPOL_PREFERRED_MANY] = { 412 + .create = mpol_new_nodemask, 413 + .rebind = mpol_rebind_preferred, 405 414 }, 406 415 }; 407 416 ··· 899 900 case MPOL_BIND: 900 901 case MPOL_INTERLEAVE: 901 902 case MPOL_PREFERRED: 903 + case MPOL_PREFERRED_MANY: 902 904 *nodes = p->nodes; 903 905 break; 904 906 case MPOL_LOCAL: ··· 1084 1084 1085 1085 if (!list_empty(&pagelist)) { 1086 1086 err = migrate_pages(&pagelist, alloc_migration_target, NULL, 1087 - (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL); 1087 + (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL, NULL); 1088 1088 if (err) 1089 1089 putback_movable_pages(&pagelist); 1090 1090 } ··· 1338 1338 if (!list_empty(&pagelist)) { 1339 1339 WARN_ON_ONCE(flags & MPOL_MF_LAZY); 1340 1340 nr_failed = migrate_pages(&pagelist, new_page, NULL, 1341 - start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND); 1341 + start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND, NULL); 1342 1342 if (nr_failed) 1343 1343 putback_movable_pages(&pagelist); 1344 1344 } ··· 1446 1446 { 1447 1447 *flags = *mode & MPOL_MODE_FLAGS; 1448 1448 *mode &= ~MPOL_MODE_FLAGS; 1449 - if ((unsigned int)(*mode) >= MPOL_MAX) 1449 + 1450 + if ((unsigned int)(*mode) >= MPOL_MAX) 1450 1451 return -EINVAL; 1451 1452 if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) 1452 1453 return -EINVAL; ··· 1876 1875 */ 1877 1876 nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) 1878 1877 { 1878 + int mode = policy->mode; 1879 + 1879 1880 /* Lower zones don't get a nodemask applied for MPOL_BIND */ 1880 - if (unlikely(policy->mode == MPOL_BIND) && 1881 - apply_policy_zone(policy, gfp_zone(gfp)) && 1882 - cpuset_nodemask_valid_mems_allowed(&policy->nodes)) 1881 + if (unlikely(mode == MPOL_BIND) && 1882 + apply_policy_zone(policy, gfp_zone(gfp)) && 1883 + cpuset_nodemask_valid_mems_allowed(&policy->nodes)) 1884 + return &policy->nodes; 1885 + 1886 + if (mode == MPOL_PREFERRED_MANY) 1883 1887 return &policy->nodes; 1884 1888 1885 1889 return NULL; 1886 1890 } 1887 1891 1888 - /* Return the node id preferred by the given mempolicy, or the given id */ 1892 + /* 1893 + * Return the preferred node id for 'prefer' mempolicy, and return 1894 + * the given id for all other policies. 1895 + * 1896 + * policy_node() is always coupled with policy_nodemask(), which 1897 + * secures the nodemask limit for 'bind' and 'prefer-many' policy. 1898 + */ 1889 1899 static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd) 1890 1900 { 1891 1901 if (policy->mode == MPOL_PREFERRED) { ··· 1934 1922 struct mempolicy *policy; 1935 1923 int node = numa_mem_id(); 1936 1924 1937 - if (in_interrupt()) 1925 + if (!in_task()) 1938 1926 return node; 1939 1927 1940 1928 policy = current->mempolicy; ··· 1948 1936 case MPOL_INTERLEAVE: 1949 1937 return interleave_nodes(policy); 1950 1938 1951 - case MPOL_BIND: { 1939 + case MPOL_BIND: 1940 + case MPOL_PREFERRED_MANY: 1941 + { 1952 1942 struct zoneref *z; 1953 1943 1954 1944 /* ··· 2022 2008 * @addr: address in @vma for shared policy lookup and interleave policy 2023 2009 * @gfp_flags: for requested zone 2024 2010 * @mpol: pointer to mempolicy pointer for reference counted mempolicy 2025 - * @nodemask: pointer to nodemask pointer for MPOL_BIND nodemask 2011 + * @nodemask: pointer to nodemask pointer for 'bind' and 'prefer-many' policy 2026 2012 * 2027 2013 * Returns a nid suitable for a huge page allocation and a pointer 2028 2014 * to the struct mempolicy for conditional unref after allocation. 2029 - * If the effective policy is 'BIND, returns a pointer to the mempolicy's 2030 - * @nodemask for filtering the zonelist. 2015 + * If the effective policy is 'bind' or 'prefer-many', returns a pointer 2016 + * to the mempolicy's @nodemask for filtering the zonelist. 2031 2017 * 2032 2018 * Must be protected by read_mems_allowed_begin() 2033 2019 */ ··· 2035 2021 struct mempolicy **mpol, nodemask_t **nodemask) 2036 2022 { 2037 2023 int nid; 2024 + int mode; 2038 2025 2039 2026 *mpol = get_vma_policy(vma, addr); 2040 - *nodemask = NULL; /* assume !MPOL_BIND */ 2027 + *nodemask = NULL; 2028 + mode = (*mpol)->mode; 2041 2029 2042 - if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) { 2030 + if (unlikely(mode == MPOL_INTERLEAVE)) { 2043 2031 nid = interleave_nid(*mpol, vma, addr, 2044 2032 huge_page_shift(hstate_vma(vma))); 2045 2033 } else { 2046 2034 nid = policy_node(gfp_flags, *mpol, numa_node_id()); 2047 - if ((*mpol)->mode == MPOL_BIND) 2035 + if (mode == MPOL_BIND || mode == MPOL_PREFERRED_MANY) 2048 2036 *nodemask = &(*mpol)->nodes; 2049 2037 } 2050 2038 return nid; ··· 2079 2063 mempolicy = current->mempolicy; 2080 2064 switch (mempolicy->mode) { 2081 2065 case MPOL_PREFERRED: 2066 + case MPOL_PREFERRED_MANY: 2082 2067 case MPOL_BIND: 2083 2068 case MPOL_INTERLEAVE: 2084 2069 *mask = mempolicy->nodes; ··· 2145 2128 return page; 2146 2129 } 2147 2130 2131 + static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order, 2132 + int nid, struct mempolicy *pol) 2133 + { 2134 + struct page *page; 2135 + gfp_t preferred_gfp; 2136 + 2137 + /* 2138 + * This is a two pass approach. The first pass will only try the 2139 + * preferred nodes but skip the direct reclaim and allow the 2140 + * allocation to fail, while the second pass will try all the 2141 + * nodes in system. 2142 + */ 2143 + preferred_gfp = gfp | __GFP_NOWARN; 2144 + preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL); 2145 + page = __alloc_pages(preferred_gfp, order, nid, &pol->nodes); 2146 + if (!page) 2147 + page = __alloc_pages(gfp, order, numa_node_id(), NULL); 2148 + 2149 + return page; 2150 + } 2151 + 2148 2152 /** 2149 2153 * alloc_pages_vma - Allocate a page for a VMA. 2150 2154 * @gfp: GFP flags. ··· 2201 2163 goto out; 2202 2164 } 2203 2165 2166 + if (pol->mode == MPOL_PREFERRED_MANY) { 2167 + page = alloc_pages_preferred_many(gfp, order, node, pol); 2168 + mpol_cond_put(pol); 2169 + goto out; 2170 + } 2171 + 2204 2172 if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) { 2205 2173 int hpage_node = node; 2206 2174 ··· 2217 2173 * node and don't fall back to other nodes, as the cost of 2218 2174 * remote accesses would likely offset THP benefits. 2219 2175 * 2220 - * If the policy is interleave, or does not allow the current 2176 + * If the policy is interleave or does not allow the current 2221 2177 * node in its nodemask, we allocate the standard way. 2222 2178 */ 2223 2179 if (pol->mode == MPOL_PREFERRED) ··· 2284 2240 */ 2285 2241 if (pol->mode == MPOL_INTERLEAVE) 2286 2242 page = alloc_page_interleave(gfp, order, interleave_nodes(pol)); 2243 + else if (pol->mode == MPOL_PREFERRED_MANY) 2244 + page = alloc_pages_preferred_many(gfp, order, 2245 + numa_node_id(), pol); 2287 2246 else 2288 2247 page = __alloc_pages(gfp, order, 2289 2248 policy_node(gfp, pol, numa_node_id()), ··· 2358 2311 case MPOL_BIND: 2359 2312 case MPOL_INTERLEAVE: 2360 2313 case MPOL_PREFERRED: 2314 + case MPOL_PREFERRED_MANY: 2361 2315 return !!nodes_equal(a->nodes, b->nodes); 2362 2316 case MPOL_LOCAL: 2363 2317 return true; ··· 2473 2425 * node id. Policy determination "mimics" alloc_page_vma(). 2474 2426 * Called from fault path where we know the vma and faulting address. 2475 2427 * 2476 - * Return: -1 if the page is in a node that is valid for this policy, or a 2477 - * suitable node ID to allocate a replacement page from. 2428 + * Return: NUMA_NO_NODE if the page is in a node that is valid for this 2429 + * policy, or a suitable node ID to allocate a replacement page from. 2478 2430 */ 2479 2431 int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr) 2480 2432 { ··· 2485 2437 int thiscpu = raw_smp_processor_id(); 2486 2438 int thisnid = cpu_to_node(thiscpu); 2487 2439 int polnid = NUMA_NO_NODE; 2488 - int ret = -1; 2440 + int ret = NUMA_NO_NODE; 2489 2441 2490 2442 pol = get_vma_policy(vma, addr); 2491 2443 if (!(pol->flags & MPOL_F_MOF)) ··· 2499 2451 break; 2500 2452 2501 2453 case MPOL_PREFERRED: 2454 + if (node_isset(curnid, pol->nodes)) 2455 + goto out; 2502 2456 polnid = first_node(pol->nodes); 2503 2457 break; 2504 2458 ··· 2515 2465 break; 2516 2466 goto out; 2517 2467 } 2468 + fallthrough; 2518 2469 2470 + case MPOL_PREFERRED_MANY: 2519 2471 /* 2520 - * allows binding to multiple nodes. 2521 2472 * use current page if in policy nodemask, 2522 2473 * else select nearest allowed node, if any. 2523 2474 * If no allowed nodes, use current [!misplaced]. ··· 2880 2829 [MPOL_BIND] = "bind", 2881 2830 [MPOL_INTERLEAVE] = "interleave", 2882 2831 [MPOL_LOCAL] = "local", 2832 + [MPOL_PREFERRED_MANY] = "prefer (many)", 2883 2833 }; 2884 2834 2885 2835 ··· 2959 2907 if (!nodelist) 2960 2908 err = 0; 2961 2909 goto out; 2910 + case MPOL_PREFERRED_MANY: 2962 2911 case MPOL_BIND: 2963 2912 /* 2964 2913 * Insist on a nodelist ··· 3046 2993 case MPOL_LOCAL: 3047 2994 break; 3048 2995 case MPOL_PREFERRED: 2996 + case MPOL_PREFERRED_MANY: 3049 2997 case MPOL_BIND: 3050 2998 case MPOL_INTERLEAVE: 3051 2999 nodes = pol->nodes; ··· 3075 3021 p += scnprintf(p, buffer + maxlen - p, ":%*pbl", 3076 3022 nodemask_pr_args(&nodes)); 3077 3023 } 3024 + 3025 + bool numa_demotion_enabled = false; 3026 + 3027 + #ifdef CONFIG_SYSFS 3028 + static ssize_t numa_demotion_enabled_show(struct kobject *kobj, 3029 + struct kobj_attribute *attr, char *buf) 3030 + { 3031 + return sysfs_emit(buf, "%s\n", 3032 + numa_demotion_enabled? "true" : "false"); 3033 + } 3034 + 3035 + static ssize_t numa_demotion_enabled_store(struct kobject *kobj, 3036 + struct kobj_attribute *attr, 3037 + const char *buf, size_t count) 3038 + { 3039 + if (!strncmp(buf, "true", 4) || !strncmp(buf, "1", 1)) 3040 + numa_demotion_enabled = true; 3041 + else if (!strncmp(buf, "false", 5) || !strncmp(buf, "0", 1)) 3042 + numa_demotion_enabled = false; 3043 + else 3044 + return -EINVAL; 3045 + 3046 + return count; 3047 + } 3048 + 3049 + static struct kobj_attribute numa_demotion_enabled_attr = 3050 + __ATTR(demotion_enabled, 0644, numa_demotion_enabled_show, 3051 + numa_demotion_enabled_store); 3052 + 3053 + static struct attribute *numa_attrs[] = { 3054 + &numa_demotion_enabled_attr.attr, 3055 + NULL, 3056 + }; 3057 + 3058 + static const struct attribute_group numa_attr_group = { 3059 + .attrs = numa_attrs, 3060 + }; 3061 + 3062 + static int __init numa_init_sysfs(void) 3063 + { 3064 + int err; 3065 + struct kobject *numa_kobj; 3066 + 3067 + numa_kobj = kobject_create_and_add("numa", mm_kobj); 3068 + if (!numa_kobj) { 3069 + pr_err("failed to create numa kobject\n"); 3070 + return -ENOMEM; 3071 + } 3072 + err = sysfs_create_group(numa_kobj, &numa_attr_group); 3073 + if (err) { 3074 + pr_err("failed to register numa group\n"); 3075 + goto delete_obj; 3076 + } 3077 + return 0; 3078 + 3079 + delete_obj: 3080 + kobject_put(numa_kobj); 3081 + return err; 3082 + } 3083 + subsys_initcall(numa_init_sysfs); 3084 + #endif

+312 -3

mm/migrate.c

··· 49 49 #include <linux/sched/mm.h> 50 50 #include <linux/ptrace.h> 51 51 #include <linux/oom.h> 52 + #include <linux/memory.h> 52 53 53 54 #include <asm/tlbflush.h> 54 55 ··· 1100 1099 return rc; 1101 1100 } 1102 1101 1102 + 1103 + /* 1104 + * node_demotion[] example: 1105 + * 1106 + * Consider a system with two sockets. Each socket has 1107 + * three classes of memory attached: fast, medium and slow. 1108 + * Each memory class is placed in its own NUMA node. The 1109 + * CPUs are placed in the node with the "fast" memory. The 1110 + * 6 NUMA nodes (0-5) might be split among the sockets like 1111 + * this: 1112 + * 1113 + * Socket A: 0, 1, 2 1114 + * Socket B: 3, 4, 5 1115 + * 1116 + * When Node 0 fills up, its memory should be migrated to 1117 + * Node 1. When Node 1 fills up, it should be migrated to 1118 + * Node 2. The migration path start on the nodes with the 1119 + * processors (since allocations default to this node) and 1120 + * fast memory, progress through medium and end with the 1121 + * slow memory: 1122 + * 1123 + * 0 -> 1 -> 2 -> stop 1124 + * 3 -> 4 -> 5 -> stop 1125 + * 1126 + * This is represented in the node_demotion[] like this: 1127 + * 1128 + * { 1, // Node 0 migrates to 1 1129 + * 2, // Node 1 migrates to 2 1130 + * -1, // Node 2 does not migrate 1131 + * 4, // Node 3 migrates to 4 1132 + * 5, // Node 4 migrates to 5 1133 + * -1} // Node 5 does not migrate 1134 + */ 1135 + 1136 + /* 1137 + * Writes to this array occur without locking. Cycles are 1138 + * not allowed: Node X demotes to Y which demotes to X... 1139 + * 1140 + * If multiple reads are performed, a single rcu_read_lock() 1141 + * must be held over all reads to ensure that no cycles are 1142 + * observed. 1143 + */ 1144 + static int node_demotion[MAX_NUMNODES] __read_mostly = 1145 + {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE}; 1146 + 1147 + /** 1148 + * next_demotion_node() - Get the next node in the demotion path 1149 + * @node: The starting node to lookup the next node 1150 + * 1151 + * Return: node id for next memory node in the demotion path hierarchy 1152 + * from @node; NUMA_NO_NODE if @node is terminal. This does not keep 1153 + * @node online or guarantee that it *continues* to be the next demotion 1154 + * target. 1155 + */ 1156 + int next_demotion_node(int node) 1157 + { 1158 + int target; 1159 + 1160 + /* 1161 + * node_demotion[] is updated without excluding this 1162 + * function from running. RCU doesn't provide any 1163 + * compiler barriers, so the READ_ONCE() is required 1164 + * to avoid compiler reordering or read merging. 1165 + * 1166 + * Make sure to use RCU over entire code blocks if 1167 + * node_demotion[] reads need to be consistent. 1168 + */ 1169 + rcu_read_lock(); 1170 + target = READ_ONCE(node_demotion[node]); 1171 + rcu_read_unlock(); 1172 + 1173 + return target; 1174 + } 1175 + 1103 1176 /* 1104 1177 * Obtain the lock on page, remove all ptes and migrate the page 1105 1178 * to the newly allocated page in newpage. ··· 1429 1354 * @mode: The migration mode that specifies the constraints for 1430 1355 * page migration, if any. 1431 1356 * @reason: The reason for page migration. 1357 + * @ret_succeeded: Set to the number of pages migrated successfully if 1358 + * the caller passes a non-NULL pointer. 1432 1359 * 1433 1360 * The function returns after 10 attempts or if no pages are movable any more 1434 1361 * because the list has become empty or no retryable pages exist any more. ··· 1441 1364 */ 1442 1365 int migrate_pages(struct list_head *from, new_page_t get_new_page, 1443 1366 free_page_t put_new_page, unsigned long private, 1444 - enum migrate_mode mode, int reason) 1367 + enum migrate_mode mode, int reason, unsigned int *ret_succeeded) 1445 1368 { 1446 1369 int retry = 1; 1447 1370 int thp_retry = 1; ··· 1596 1519 if (!swapwrite) 1597 1520 current->flags &= ~PF_SWAPWRITE; 1598 1521 1522 + if (ret_succeeded) 1523 + *ret_succeeded = nr_succeeded; 1524 + 1599 1525 return rc; 1600 1526 } 1601 1527 ··· 1668 1588 }; 1669 1589 1670 1590 err = migrate_pages(pagelist, alloc_migration_target, NULL, 1671 - (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL); 1591 + (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL, NULL); 1672 1592 if (err) 1673 1593 putback_movable_pages(pagelist); 1674 1594 return err; ··· 2183 2103 2184 2104 list_add(&page->lru, &migratepages); 2185 2105 nr_remaining = migrate_pages(&migratepages, *new, NULL, node, 2186 - MIGRATE_ASYNC, MR_NUMA_MISPLACED); 2106 + MIGRATE_ASYNC, MR_NUMA_MISPLACED, NULL); 2187 2107 if (nr_remaining) { 2188 2108 if (!list_empty(&migratepages)) { 2189 2109 list_del(&page->lru); ··· 3062 2982 } 3063 2983 EXPORT_SYMBOL(migrate_vma_finalize); 3064 2984 #endif /* CONFIG_DEVICE_PRIVATE */ 2985 + 2986 + #if defined(CONFIG_MEMORY_HOTPLUG) 2987 + /* Disable reclaim-based migration. */ 2988 + static void __disable_all_migrate_targets(void) 2989 + { 2990 + int node; 2991 + 2992 + for_each_online_node(node) 2993 + node_demotion[node] = NUMA_NO_NODE; 2994 + } 2995 + 2996 + static void disable_all_migrate_targets(void) 2997 + { 2998 + __disable_all_migrate_targets(); 2999 + 3000 + /* 3001 + * Ensure that the "disable" is visible across the system. 3002 + * Readers will see either a combination of before+disable 3003 + * state or disable+after. They will never see before and 3004 + * after state together. 3005 + * 3006 + * The before+after state together might have cycles and 3007 + * could cause readers to do things like loop until this 3008 + * function finishes. This ensures they can only see a 3009 + * single "bad" read and would, for instance, only loop 3010 + * once. 3011 + */ 3012 + synchronize_rcu(); 3013 + } 3014 + 3015 + /* 3016 + * Find an automatic demotion target for 'node'. 3017 + * Failing here is OK. It might just indicate 3018 + * being at the end of a chain. 3019 + */ 3020 + static int establish_migrate_target(int node, nodemask_t *used) 3021 + { 3022 + int migration_target; 3023 + 3024 + /* 3025 + * Can not set a migration target on a 3026 + * node with it already set. 3027 + * 3028 + * No need for READ_ONCE() here since this 3029 + * in the write path for node_demotion[]. 3030 + * This should be the only thread writing. 3031 + */ 3032 + if (node_demotion[node] != NUMA_NO_NODE) 3033 + return NUMA_NO_NODE; 3034 + 3035 + migration_target = find_next_best_node(node, used); 3036 + if (migration_target == NUMA_NO_NODE) 3037 + return NUMA_NO_NODE; 3038 + 3039 + node_demotion[node] = migration_target; 3040 + 3041 + return migration_target; 3042 + } 3043 + 3044 + /* 3045 + * When memory fills up on a node, memory contents can be 3046 + * automatically migrated to another node instead of 3047 + * discarded at reclaim. 3048 + * 3049 + * Establish a "migration path" which will start at nodes 3050 + * with CPUs and will follow the priorities used to build the 3051 + * page allocator zonelists. 3052 + * 3053 + * The difference here is that cycles must be avoided. If 3054 + * node0 migrates to node1, then neither node1, nor anything 3055 + * node1 migrates to can migrate to node0. 3056 + * 3057 + * This function can run simultaneously with readers of 3058 + * node_demotion[]. However, it can not run simultaneously 3059 + * with itself. Exclusion is provided by memory hotplug events 3060 + * being single-threaded. 3061 + */ 3062 + static void __set_migration_target_nodes(void) 3063 + { 3064 + nodemask_t next_pass = NODE_MASK_NONE; 3065 + nodemask_t this_pass = NODE_MASK_NONE; 3066 + nodemask_t used_targets = NODE_MASK_NONE; 3067 + int node; 3068 + 3069 + /* 3070 + * Avoid any oddities like cycles that could occur 3071 + * from changes in the topology. This will leave 3072 + * a momentary gap when migration is disabled. 3073 + */ 3074 + disable_all_migrate_targets(); 3075 + 3076 + /* 3077 + * Allocations go close to CPUs, first. Assume that 3078 + * the migration path starts at the nodes with CPUs. 3079 + */ 3080 + next_pass = node_states[N_CPU]; 3081 + again: 3082 + this_pass = next_pass; 3083 + next_pass = NODE_MASK_NONE; 3084 + /* 3085 + * To avoid cycles in the migration "graph", ensure 3086 + * that migration sources are not future targets by 3087 + * setting them in 'used_targets'. Do this only 3088 + * once per pass so that multiple source nodes can 3089 + * share a target node. 3090 + * 3091 + * 'used_targets' will become unavailable in future 3092 + * passes. This limits some opportunities for 3093 + * multiple source nodes to share a destination. 3094 + */ 3095 + nodes_or(used_targets, used_targets, this_pass); 3096 + for_each_node_mask(node, this_pass) { 3097 + int target_node = establish_migrate_target(node, &used_targets); 3098 + 3099 + if (target_node == NUMA_NO_NODE) 3100 + continue; 3101 + 3102 + /* 3103 + * Visit targets from this pass in the next pass. 3104 + * Eventually, every node will have been part of 3105 + * a pass, and will become set in 'used_targets'. 3106 + */ 3107 + node_set(target_node, next_pass); 3108 + } 3109 + /* 3110 + * 'next_pass' contains nodes which became migration 3111 + * targets in this pass. Make additional passes until 3112 + * no more migrations targets are available. 3113 + */ 3114 + if (!nodes_empty(next_pass)) 3115 + goto again; 3116 + } 3117 + 3118 + /* 3119 + * For callers that do not hold get_online_mems() already. 3120 + */ 3121 + static void set_migration_target_nodes(void) 3122 + { 3123 + get_online_mems(); 3124 + __set_migration_target_nodes(); 3125 + put_online_mems(); 3126 + } 3127 + 3128 + /* 3129 + * React to hotplug events that might affect the migration targets 3130 + * like events that online or offline NUMA nodes. 3131 + * 3132 + * The ordering is also currently dependent on which nodes have 3133 + * CPUs. That means we need CPU on/offline notification too. 3134 + */ 3135 + static int migration_online_cpu(unsigned int cpu) 3136 + { 3137 + set_migration_target_nodes(); 3138 + return 0; 3139 + } 3140 + 3141 + static int migration_offline_cpu(unsigned int cpu) 3142 + { 3143 + set_migration_target_nodes(); 3144 + return 0; 3145 + } 3146 + 3147 + /* 3148 + * This leaves migrate-on-reclaim transiently disabled between 3149 + * the MEM_GOING_OFFLINE and MEM_OFFLINE events. This runs 3150 + * whether reclaim-based migration is enabled or not, which 3151 + * ensures that the user can turn reclaim-based migration at 3152 + * any time without needing to recalculate migration targets. 3153 + * 3154 + * These callbacks already hold get_online_mems(). That is why 3155 + * __set_migration_target_nodes() can be used as opposed to 3156 + * set_migration_target_nodes(). 3157 + */ 3158 + static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, 3159 + unsigned long action, void *arg) 3160 + { 3161 + switch (action) { 3162 + case MEM_GOING_OFFLINE: 3163 + /* 3164 + * Make sure there are not transient states where 3165 + * an offline node is a migration target. This 3166 + * will leave migration disabled until the offline 3167 + * completes and the MEM_OFFLINE case below runs. 3168 + */ 3169 + disable_all_migrate_targets(); 3170 + break; 3171 + case MEM_OFFLINE: 3172 + case MEM_ONLINE: 3173 + /* 3174 + * Recalculate the target nodes once the node 3175 + * reaches its final state (online or offline). 3176 + */ 3177 + __set_migration_target_nodes(); 3178 + break; 3179 + case MEM_CANCEL_OFFLINE: 3180 + /* 3181 + * MEM_GOING_OFFLINE disabled all the migration 3182 + * targets. Reenable them. 3183 + */ 3184 + __set_migration_target_nodes(); 3185 + break; 3186 + case MEM_GOING_ONLINE: 3187 + case MEM_CANCEL_ONLINE: 3188 + break; 3189 + } 3190 + 3191 + return notifier_from_errno(0); 3192 + } 3193 + 3194 + static int __init migrate_on_reclaim_init(void) 3195 + { 3196 + int ret; 3197 + 3198 + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "migrate on reclaim", 3199 + migration_online_cpu, 3200 + migration_offline_cpu); 3201 + /* 3202 + * In the unlikely case that this fails, the automatic 3203 + * migration targets may become suboptimal for nodes 3204 + * where N_CPU changes. With such a small impact in a 3205 + * rare case, do not bother trying to do anything special. 3206 + */ 3207 + WARN_ON(ret < 0); 3208 + 3209 + hotplug_memory_notifier(migrate_on_reclaim_callback, 100); 3210 + return 0; 3211 + } 3212 + late_initcall(migrate_on_reclaim_init); 3213 + #endif /* CONFIG_MEMORY_HOTPLUG */

+3 -4

mm/mmap.c

··· 534 534 { 535 535 struct rb_node **__rb_link, *__rb_parent, *rb_prev; 536 536 537 + mmap_assert_locked(mm); 537 538 __rb_link = &mm->mm_rb.rb_node; 538 539 rb_prev = __rb_parent = NULL; 539 540 ··· 2298 2297 struct rb_node *rb_node; 2299 2298 struct vm_area_struct *vma; 2300 2299 2300 + mmap_assert_locked(mm); 2301 2301 /* Check the cache first. */ 2302 2302 vma = vmacache_find(mm, addr); 2303 2303 if (likely(vma)) ··· 2988 2986 if (mmap_write_lock_killable(mm)) 2989 2987 return -EINTR; 2990 2988 2991 - vma = find_vma(mm, start); 2989 + vma = vma_lookup(mm, start); 2992 2990 2993 2991 if (!vma || !(vma->vm_flags & VM_SHARED)) 2994 - goto out; 2995 - 2996 - if (start < vma->vm_start) 2997 2992 goto out; 2998 2993 2999 2994 if (start + size > vma->vm_end) {

+1 -1

mm/mremap.c

··· 686 686 if (do_munmap(mm, old_addr, old_len, uf_unmap) < 0) { 687 687 /* OOM: unable to split vma, just get accounts right */ 688 688 if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP)) 689 - vm_acct_memory(new_len >> PAGE_SHIFT); 689 + vm_acct_memory(old_len >> PAGE_SHIFT); 690 690 excess = 0; 691 691 } 692 692

+70

mm/oom_kill.c

··· 28 28 #include <linux/sched/task.h> 29 29 #include <linux/sched/debug.h> 30 30 #include <linux/swap.h> 31 + #include <linux/syscalls.h> 31 32 #include <linux/timex.h> 32 33 #include <linux/jiffies.h> 33 34 #include <linux/cpuset.h> ··· 1141 1140 return; 1142 1141 out_of_memory(&oc); 1143 1142 mutex_unlock(&oom_lock); 1143 + } 1144 + 1145 + SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) 1146 + { 1147 + #ifdef CONFIG_MMU 1148 + struct mm_struct *mm = NULL; 1149 + struct task_struct *task; 1150 + struct task_struct *p; 1151 + unsigned int f_flags; 1152 + bool reap = true; 1153 + struct pid *pid; 1154 + long ret = 0; 1155 + 1156 + if (flags) 1157 + return -EINVAL; 1158 + 1159 + pid = pidfd_get_pid(pidfd, &f_flags); 1160 + if (IS_ERR(pid)) 1161 + return PTR_ERR(pid); 1162 + 1163 + task = get_pid_task(pid, PIDTYPE_TGID); 1164 + if (!task) { 1165 + ret = -ESRCH; 1166 + goto put_pid; 1167 + } 1168 + 1169 + /* 1170 + * Make sure to choose a thread which still has a reference to mm 1171 + * during the group exit 1172 + */ 1173 + p = find_lock_task_mm(task); 1174 + if (!p) { 1175 + ret = -ESRCH; 1176 + goto put_task; 1177 + } 1178 + 1179 + mm = p->mm; 1180 + mmgrab(mm); 1181 + 1182 + /* If the work has been done already, just exit with success */ 1183 + if (test_bit(MMF_OOM_SKIP, &mm->flags)) 1184 + reap = false; 1185 + else if (!task_will_free_mem(p)) { 1186 + reap = false; 1187 + ret = -EINVAL; 1188 + } 1189 + task_unlock(p); 1190 + 1191 + if (!reap) 1192 + goto drop_mm; 1193 + 1194 + if (mmap_read_lock_killable(mm)) { 1195 + ret = -EINTR; 1196 + goto drop_mm; 1197 + } 1198 + if (!__oom_reap_task_mm(mm)) 1199 + ret = -EAGAIN; 1200 + mmap_read_unlock(mm); 1201 + 1202 + drop_mm: 1203 + mmdrop(mm); 1204 + put_task: 1205 + put_task_struct(task); 1206 + put_pid: 1207 + put_pid(pid); 1208 + return ret; 1209 + #else 1210 + return -ENOSYS; 1211 + #endif /* CONFIG_MMU */ 1144 1212 }

+83 -38

mm/page-writeback.c

··· 183 183 static void wb_min_max_ratio(struct bdi_writeback *wb, 184 184 unsigned long *minp, unsigned long *maxp) 185 185 { 186 - unsigned long this_bw = wb->avg_write_bandwidth; 186 + unsigned long this_bw = READ_ONCE(wb->avg_write_bandwidth); 187 187 unsigned long tot_bw = atomic_long_read(&wb->bdi->tot_write_bandwidth); 188 188 unsigned long long min = wb->bdi->min_ratio; 189 189 unsigned long long max = wb->bdi->max_ratio; ··· 892 892 static void wb_position_ratio(struct dirty_throttle_control *dtc) 893 893 { 894 894 struct bdi_writeback *wb = dtc->wb; 895 - unsigned long write_bw = wb->avg_write_bandwidth; 895 + unsigned long write_bw = READ_ONCE(wb->avg_write_bandwidth); 896 896 unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh); 897 897 unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh); 898 898 unsigned long wb_thresh = dtc->wb_thresh; ··· 1115 1115 &wb->bdi->tot_write_bandwidth) <= 0); 1116 1116 } 1117 1117 wb->write_bandwidth = bw; 1118 - wb->avg_write_bandwidth = avg; 1118 + WRITE_ONCE(wb->avg_write_bandwidth, avg); 1119 1119 } 1120 1120 1121 1121 static void update_dirty_limit(struct dirty_throttle_control *dtc) ··· 1147 1147 dom->dirty_limit = limit; 1148 1148 } 1149 1149 1150 - static void domain_update_bandwidth(struct dirty_throttle_control *dtc, 1151 - unsigned long now) 1150 + static void domain_update_dirty_limit(struct dirty_throttle_control *dtc, 1151 + unsigned long now) 1152 1152 { 1153 1153 struct wb_domain *dom = dtc_dom(dtc); 1154 1154 ··· 1324 1324 else 1325 1325 dirty_ratelimit -= step; 1326 1326 1327 - wb->dirty_ratelimit = max(dirty_ratelimit, 1UL); 1327 + WRITE_ONCE(wb->dirty_ratelimit, max(dirty_ratelimit, 1UL)); 1328 1328 wb->balanced_dirty_ratelimit = balanced_dirty_ratelimit; 1329 1329 1330 1330 trace_bdi_dirty_ratelimit(wb, dirty_rate, task_ratelimit); ··· 1332 1332 1333 1333 static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc, 1334 1334 struct dirty_throttle_control *mdtc, 1335 - unsigned long start_time, 1336 1335 bool update_ratelimit) 1337 1336 { 1338 1337 struct bdi_writeback *wb = gdtc->wb; 1339 1338 unsigned long now = jiffies; 1340 - unsigned long elapsed = now - wb->bw_time_stamp; 1339 + unsigned long elapsed; 1341 1340 unsigned long dirtied; 1342 1341 unsigned long written; 1343 1342 1344 - lockdep_assert_held(&wb->list_lock); 1343 + spin_lock(&wb->list_lock); 1345 1344 1346 1345 /* 1347 - * rate-limit, only update once every 200ms. 1346 + * Lockless checks for elapsed time are racy and delayed update after 1347 + * IO completion doesn't do it at all (to make sure written pages are 1348 + * accounted reasonably quickly). Make sure elapsed >= 1 to avoid 1349 + * division errors. 1348 1350 */ 1349 - if (elapsed < BANDWIDTH_INTERVAL) 1350 - return; 1351 - 1351 + elapsed = max(now - wb->bw_time_stamp, 1UL); 1352 1352 dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED]); 1353 1353 written = percpu_counter_read(&wb->stat[WB_WRITTEN]); 1354 1354 1355 - /* 1356 - * Skip quiet periods when disk bandwidth is under-utilized. 1357 - * (at least 1s idle time between two flusher runs) 1358 - */ 1359 - if (elapsed > HZ && time_before(wb->bw_time_stamp, start_time)) 1360 - goto snapshot; 1361 - 1362 1355 if (update_ratelimit) { 1363 - domain_update_bandwidth(gdtc, now); 1356 + domain_update_dirty_limit(gdtc, now); 1364 1357 wb_update_dirty_ratelimit(gdtc, dirtied, elapsed); 1365 1358 1366 1359 /* ··· 1361 1368 * compiler has no way to figure that out. Help it. 1362 1369 */ 1363 1370 if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) && mdtc) { 1364 - domain_update_bandwidth(mdtc, now); 1371 + domain_update_dirty_limit(mdtc, now); 1365 1372 wb_update_dirty_ratelimit(mdtc, dirtied, elapsed); 1366 1373 } 1367 1374 } 1368 1375 wb_update_write_bandwidth(wb, elapsed, written); 1369 1376 1370 - snapshot: 1371 1377 wb->dirtied_stamp = dirtied; 1372 1378 wb->written_stamp = written; 1373 - wb->bw_time_stamp = now; 1379 + WRITE_ONCE(wb->bw_time_stamp, now); 1380 + spin_unlock(&wb->list_lock); 1374 1381 } 1375 1382 1376 - void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time) 1383 + void wb_update_bandwidth(struct bdi_writeback *wb) 1377 1384 { 1378 1385 struct dirty_throttle_control gdtc = { GDTC_INIT(wb) }; 1379 1386 1380 - __wb_update_bandwidth(&gdtc, NULL, start_time, false); 1387 + __wb_update_bandwidth(&gdtc, NULL, false); 1388 + } 1389 + 1390 + /* Interval after which we consider wb idle and don't estimate bandwidth */ 1391 + #define WB_BANDWIDTH_IDLE_JIF (HZ) 1392 + 1393 + static void wb_bandwidth_estimate_start(struct bdi_writeback *wb) 1394 + { 1395 + unsigned long now = jiffies; 1396 + unsigned long elapsed = now - READ_ONCE(wb->bw_time_stamp); 1397 + 1398 + if (elapsed > WB_BANDWIDTH_IDLE_JIF && 1399 + !atomic_read(&wb->writeback_inodes)) { 1400 + spin_lock(&wb->list_lock); 1401 + wb->dirtied_stamp = wb_stat(wb, WB_DIRTIED); 1402 + wb->written_stamp = wb_stat(wb, WB_WRITTEN); 1403 + WRITE_ONCE(wb->bw_time_stamp, now); 1404 + spin_unlock(&wb->list_lock); 1405 + } 1381 1406 } 1382 1407 1383 1408 /* ··· 1418 1407 static unsigned long wb_max_pause(struct bdi_writeback *wb, 1419 1408 unsigned long wb_dirty) 1420 1409 { 1421 - unsigned long bw = wb->avg_write_bandwidth; 1410 + unsigned long bw = READ_ONCE(wb->avg_write_bandwidth); 1422 1411 unsigned long t; 1423 1412 1424 1413 /* ··· 1440 1429 unsigned long dirty_ratelimit, 1441 1430 int *nr_dirtied_pause) 1442 1431 { 1443 - long hi = ilog2(wb->avg_write_bandwidth); 1444 - long lo = ilog2(wb->dirty_ratelimit); 1432 + long hi = ilog2(READ_ONCE(wb->avg_write_bandwidth)); 1433 + long lo = ilog2(READ_ONCE(wb->dirty_ratelimit)); 1445 1434 long t; /* target pause */ 1446 1435 long pause; /* estimated next pause */ 1447 1436 int pages; /* target nr_dirtied_pause */ ··· 1721 1710 if (dirty_exceeded && !wb->dirty_exceeded) 1722 1711 wb->dirty_exceeded = 1; 1723 1712 1724 - if (time_is_before_jiffies(wb->bw_time_stamp + 1725 - BANDWIDTH_INTERVAL)) { 1726 - spin_lock(&wb->list_lock); 1727 - __wb_update_bandwidth(gdtc, mdtc, start_time, true); 1728 - spin_unlock(&wb->list_lock); 1729 - } 1713 + if (time_is_before_jiffies(READ_ONCE(wb->bw_time_stamp) + 1714 + BANDWIDTH_INTERVAL)) 1715 + __wb_update_bandwidth(gdtc, mdtc, true); 1730 1716 1731 1717 /* throttle according to the chosen dtc */ 1732 - dirty_ratelimit = wb->dirty_ratelimit; 1718 + dirty_ratelimit = READ_ONCE(wb->dirty_ratelimit); 1733 1719 task_ratelimit = ((u64)dirty_ratelimit * sdtc->pos_ratio) >> 1734 1720 RATELIMIT_CALC_SHIFT; 1735 1721 max_pause = wb_max_pause(wb, sdtc->wb_dirty); ··· 2353 2345 int do_writepages(struct address_space *mapping, struct writeback_control *wbc) 2354 2346 { 2355 2347 int ret; 2348 + struct bdi_writeback *wb; 2356 2349 2357 2350 if (wbc->nr_to_write <= 0) 2358 2351 return 0; 2352 + wb = inode_to_wb_wbc(mapping->host, wbc); 2353 + wb_bandwidth_estimate_start(wb); 2359 2354 while (1) { 2360 2355 if (mapping->a_ops->writepages) 2361 2356 ret = mapping->a_ops->writepages(mapping, wbc); ··· 2369 2358 cond_resched(); 2370 2359 congestion_wait(BLK_RW_ASYNC, HZ/50); 2371 2360 } 2361 + /* 2362 + * Usually few pages are written by now from those we've just submitted 2363 + * but if there's constant writeback being submitted, this makes sure 2364 + * writeback bandwidth is updated once in a while. 2365 + */ 2366 + if (time_is_before_jiffies(READ_ONCE(wb->bw_time_stamp) + 2367 + BANDWIDTH_INTERVAL)) 2368 + wb_update_bandwidth(wb); 2372 2369 return ret; 2373 2370 } 2374 2371 ··· 2748 2729 } 2749 2730 EXPORT_SYMBOL(clear_page_dirty_for_io); 2750 2731 2732 + static void wb_inode_writeback_start(struct bdi_writeback *wb) 2733 + { 2734 + atomic_inc(&wb->writeback_inodes); 2735 + } 2736 + 2737 + static void wb_inode_writeback_end(struct bdi_writeback *wb) 2738 + { 2739 + atomic_dec(&wb->writeback_inodes); 2740 + /* 2741 + * Make sure estimate of writeback throughput gets updated after 2742 + * writeback completed. We delay the update by BANDWIDTH_INTERVAL 2743 + * (which is the interval other bandwidth updates use for batching) so 2744 + * that if multiple inodes end writeback at a similar time, they get 2745 + * batched into one bandwidth update. 2746 + */ 2747 + queue_delayed_work(bdi_wq, &wb->bw_dwork, BANDWIDTH_INTERVAL); 2748 + } 2749 + 2751 2750 int test_clear_page_writeback(struct page *page) 2752 2751 { 2753 2752 struct address_space *mapping = page_mapping(page); ··· 2787 2750 2788 2751 dec_wb_stat(wb, WB_WRITEBACK); 2789 2752 __wb_writeout_inc(wb); 2753 + if (!mapping_tagged(mapping, 2754 + PAGECACHE_TAG_WRITEBACK)) 2755 + wb_inode_writeback_end(wb); 2790 2756 } 2791 2757 } 2792 2758 ··· 2832 2792 PAGECACHE_TAG_WRITEBACK); 2833 2793 2834 2794 xas_set_mark(&xas, PAGECACHE_TAG_WRITEBACK); 2835 - if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) 2836 - inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK); 2795 + if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) { 2796 + struct bdi_writeback *wb = inode_to_wb(inode); 2797 + 2798 + inc_wb_stat(wb, WB_WRITEBACK); 2799 + if (!on_wblist) 2800 + wb_inode_writeback_start(wb); 2801 + } 2837 2802 2838 2803 /* 2839 2804 * We can come through here when swapping anonymous

+38 -24

mm/page_alloc.c

··· 4211 4211 if (tsk_is_oom_victim(current) || 4212 4212 (current->flags & (PF_MEMALLOC | PF_EXITING))) 4213 4213 filter &= ~SHOW_MEM_FILTER_NODES; 4214 - if (in_interrupt() || !(gfp_mask & __GFP_DIRECT_RECLAIM)) 4214 + if (!in_task() || !(gfp_mask & __GFP_DIRECT_RECLAIM)) 4215 4215 filter &= ~SHOW_MEM_FILTER_NODES; 4216 4216 4217 4217 show_mem(filter, nodemask); ··· 4549 4549 return true; 4550 4550 } 4551 4551 4552 - void __fs_reclaim_acquire(void) 4552 + void __fs_reclaim_acquire(unsigned long ip) 4553 4553 { 4554 - lock_map_acquire(&__fs_reclaim_map); 4554 + lock_acquire_exclusive(&__fs_reclaim_map, 0, 0, NULL, ip); 4555 4555 } 4556 4556 4557 - void __fs_reclaim_release(void) 4557 + void __fs_reclaim_release(unsigned long ip) 4558 4558 { 4559 - lock_map_release(&__fs_reclaim_map); 4559 + lock_release(&__fs_reclaim_map, ip); 4560 4560 } 4561 4561 4562 4562 void fs_reclaim_acquire(gfp_t gfp_mask) ··· 4565 4565 4566 4566 if (__need_reclaim(gfp_mask)) { 4567 4567 if (gfp_mask & __GFP_FS) 4568 - __fs_reclaim_acquire(); 4568 + __fs_reclaim_acquire(_RET_IP_); 4569 4569 4570 4570 #ifdef CONFIG_MMU_NOTIFIER 4571 4571 lock_map_acquire(&__mmu_notifier_invalidate_range_start_map); ··· 4582 4582 4583 4583 if (__need_reclaim(gfp_mask)) { 4584 4584 if (gfp_mask & __GFP_FS) 4585 - __fs_reclaim_release(); 4585 + __fs_reclaim_release(_RET_IP_); 4586 4586 } 4587 4587 } 4588 4588 EXPORT_SYMBOL_GPL(fs_reclaim_release); ··· 4697 4697 * comment for __cpuset_node_allowed(). 4698 4698 */ 4699 4699 alloc_flags &= ~ALLOC_CPUSET; 4700 - } else if (unlikely(rt_task(current)) && !in_interrupt()) 4700 + } else if (unlikely(rt_task(current)) && in_task()) 4701 4701 alloc_flags |= ALLOC_HARDER; 4702 4702 4703 4703 alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags); ··· 5157 5157 * When we are in the interrupt context, it is irrelevant 5158 5158 * to the current task context. It means that any node ok. 5159 5159 */ 5160 - if (!in_interrupt() && !ac->nodemask) 5160 + if (in_task() && !ac->nodemask) 5161 5161 ac->nodemask = &cpuset_current_mems_allowed; 5162 5162 else 5163 5163 *alloc_flags |= ALLOC_CPUSET; ··· 5903 5903 " unevictable:%lu dirty:%lu writeback:%lu\n" 5904 5904 " slab_reclaimable:%lu slab_unreclaimable:%lu\n" 5905 5905 " mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n" 5906 + " kernel_misc_reclaimable:%lu\n" 5906 5907 " free:%lu free_pcp:%lu free_cma:%lu\n", 5907 5908 global_node_page_state(NR_ACTIVE_ANON), 5908 5909 global_node_page_state(NR_INACTIVE_ANON), ··· 5920 5919 global_node_page_state(NR_SHMEM), 5921 5920 global_node_page_state(NR_PAGETABLE), 5922 5921 global_zone_page_state(NR_BOUNCE), 5922 + global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE), 5923 5923 global_zone_page_state(NR_FREE_PAGES), 5924 5924 free_pcp, 5925 5925 global_zone_page_state(NR_FREE_CMA_PAGES)); ··· 6157 6155 * 6158 6156 * Return: node id of the found node or %NUMA_NO_NODE if no node is found. 6159 6157 */ 6160 - static int find_next_best_node(int node, nodemask_t *used_node_mask) 6158 + int find_next_best_node(int node, nodemask_t *used_node_mask) 6161 6159 { 6162 6160 int n, val; 6163 6161 int min_val = INT_MAX; ··· 6642 6640 } 6643 6641 } 6644 6642 6645 - #if !defined(CONFIG_FLATMEM) 6646 6643 /* 6647 6644 * Only struct pages that correspond to ranges defined by memblock.memory 6648 6645 * are zeroed and initialized by going through __init_single_page() during ··· 6686 6685 pr_info("On node %d, zone %s: %lld pages in unavailable ranges", 6687 6686 node, zone_names[zone], pgcnt); 6688 6687 } 6689 - #else 6690 - static inline void init_unavailable_range(unsigned long spfn, 6691 - unsigned long epfn, 6692 - int zone, int node) 6693 - { 6694 - } 6695 - #endif 6696 6688 6697 6689 static void __init memmap_init_zone_range(struct zone *zone, 6698 6690 unsigned long start_pfn, ··· 6715 6721 { 6716 6722 unsigned long start_pfn, end_pfn; 6717 6723 unsigned long hole_pfn = 0; 6718 - int i, j, zone_id, nid; 6724 + int i, j, zone_id = 0, nid; 6719 6725 6720 6726 for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) { 6721 6727 struct pglist_data *node = NODE_DATA(nid); ··· 6746 6752 if (hole_pfn < end_pfn) 6747 6753 #endif 6748 6754 init_unavailable_range(hole_pfn, end_pfn, zone_id, nid); 6755 + } 6756 + 6757 + void __init *memmap_alloc(phys_addr_t size, phys_addr_t align, 6758 + phys_addr_t min_addr, int nid, bool exact_nid) 6759 + { 6760 + void *ptr; 6761 + 6762 + if (exact_nid) 6763 + ptr = memblock_alloc_exact_nid_raw(size, align, min_addr, 6764 + MEMBLOCK_ALLOC_ACCESSIBLE, 6765 + nid); 6766 + else 6767 + ptr = memblock_alloc_try_nid_raw(size, align, min_addr, 6768 + MEMBLOCK_ALLOC_ACCESSIBLE, 6769 + nid); 6770 + 6771 + if (ptr && size > 0) 6772 + page_init_poison(ptr, size); 6773 + 6774 + return ptr; 6749 6775 } 6750 6776 6751 6777 static int zone_batchsize(struct zone *zone) ··· 7515 7501 } 7516 7502 7517 7503 #ifdef CONFIG_FLATMEM 7518 - static void __ref alloc_node_mem_map(struct pglist_data *pgdat) 7504 + static void __init alloc_node_mem_map(struct pglist_data *pgdat) 7519 7505 { 7520 7506 unsigned long __maybe_unused start = 0; 7521 7507 unsigned long __maybe_unused offset = 0; ··· 7539 7525 end = pgdat_end_pfn(pgdat); 7540 7526 end = ALIGN(end, MAX_ORDER_NR_PAGES); 7541 7527 size = (end - start) * sizeof(struct page); 7542 - map = memblock_alloc_node(size, SMP_CACHE_BYTES, 7543 - pgdat->node_id); 7528 + map = memmap_alloc(size, SMP_CACHE_BYTES, MEMBLOCK_LOW_LIMIT, 7529 + pgdat->node_id, false); 7544 7530 if (!map) 7545 7531 panic("Failed to allocate %ld bytes for node %d memory map\n", 7546 7532 size, pgdat->node_id); ··· 7561 7547 #endif 7562 7548 } 7563 7549 #else 7564 - static void __ref alloc_node_mem_map(struct pglist_data *pgdat) { } 7550 + static inline void alloc_node_mem_map(struct pglist_data *pgdat) { } 7565 7551 #endif /* CONFIG_FLATMEM */ 7566 7552 7567 7553 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT ··· 8990 8976 cc->nr_migratepages -= nr_reclaimed; 8991 8977 8992 8978 ret = migrate_pages(&cc->migratepages, alloc_migration_target, 8993 - NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE); 8979 + NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE, NULL); 8994 8980 8995 8981 /* 8996 8982 * On -ENOMEM, migrate_pages() bails out right away. It is pointless

+10 -3

mm/page_isolation.c

··· 287 287 unsigned long pfn, flags; 288 288 struct page *page; 289 289 struct zone *zone; 290 + int ret; 290 291 291 292 /* 292 293 * Note: pageblock_nr_pages != MAX_ORDER. Then, chunks of free pages ··· 300 299 break; 301 300 } 302 301 page = __first_valid_page(start_pfn, end_pfn - start_pfn); 303 - if ((pfn < end_pfn) || !page) 304 - return -EBUSY; 302 + if ((pfn < end_pfn) || !page) { 303 + ret = -EBUSY; 304 + goto out; 305 + } 306 + 305 307 /* Check all pages are free or marked as ISOLATED */ 306 308 zone = page_zone(page); 307 309 spin_lock_irqsave(&zone->lock, flags); 308 310 pfn = __test_page_isolated_in_pageblock(start_pfn, end_pfn, isol_flags); 309 311 spin_unlock_irqrestore(&zone->lock, flags); 310 312 313 + ret = pfn < end_pfn ? -EBUSY : 0; 314 + 315 + out: 311 316 trace_test_pages_isolated(start_pfn, end_pfn, pfn); 312 317 313 - return pfn < end_pfn ? -EBUSY : 0; 318 + return ret; 314 319 }

-3

mm/percpu.c

··· 1520 1520 * Pages in [@page_start,@page_end) have been populated to @chunk. Update 1521 1521 * the bookkeeping information accordingly. Must be called after each 1522 1522 * successful population. 1523 - * 1524 - * If this is @for_alloc, do not increment pcpu_nr_empty_pop_pages because it 1525 - * is to serve an allocation in that area. 1526 1523 */ 1527 1524 static void pcpu_chunk_populated(struct pcpu_chunk *chunk, int page_start, 1528 1525 int page_end)

+123 -148

mm/shmem.c

··· 38 38 #include <linux/hugetlb.h> 39 39 #include <linux/frontswap.h> 40 40 #include <linux/fs_parser.h> 41 - 42 - #include <asm/tlbflush.h> /* for arch/microblaze update_mmu_cache() */ 41 + #include <linux/swapfile.h> 43 42 44 43 static struct vfsmount *shm_mnt; 45 44 ··· 136 137 } 137 138 #endif 138 139 139 - static bool shmem_should_replace_page(struct page *page, gfp_t gfp); 140 - static int shmem_replace_page(struct page **pagep, gfp_t gfp, 141 - struct shmem_inode_info *info, pgoff_t index); 142 140 static int shmem_swapin_page(struct inode *inode, pgoff_t index, 143 141 struct page **pagep, enum sgp_type sgp, 144 142 gfp_t gfp, struct vm_area_struct *vma, ··· 274 278 ino_t ino; 275 279 276 280 if (!(sb->s_flags & SB_KERNMOUNT)) { 277 - spin_lock(&sbinfo->stat_lock); 281 + raw_spin_lock(&sbinfo->stat_lock); 278 282 if (sbinfo->max_inodes) { 279 283 if (!sbinfo->free_inodes) { 280 - spin_unlock(&sbinfo->stat_lock); 284 + raw_spin_unlock(&sbinfo->stat_lock); 281 285 return -ENOSPC; 282 286 } 283 287 sbinfo->free_inodes--; ··· 300 304 } 301 305 *inop = ino; 302 306 } 303 - spin_unlock(&sbinfo->stat_lock); 307 + raw_spin_unlock(&sbinfo->stat_lock); 304 308 } else if (inop) { 305 309 /* 306 310 * __shmem_file_setup, one of our callers, is lock-free: it ··· 315 319 * to worry about things like glibc compatibility. 316 320 */ 317 321 ino_t *next_ino; 322 + 318 323 next_ino = per_cpu_ptr(sbinfo->ino_batch, get_cpu()); 319 324 ino = *next_ino; 320 325 if (unlikely(ino % SHMEM_INO_BATCH == 0)) { 321 - spin_lock(&sbinfo->stat_lock); 326 + raw_spin_lock(&sbinfo->stat_lock); 322 327 ino = sbinfo->next_ino; 323 328 sbinfo->next_ino += SHMEM_INO_BATCH; 324 - spin_unlock(&sbinfo->stat_lock); 329 + raw_spin_unlock(&sbinfo->stat_lock); 325 330 if (unlikely(is_zero_ino(ino))) 326 331 ino++; 327 332 } ··· 338 341 { 339 342 struct shmem_sb_info *sbinfo = SHMEM_SB(sb); 340 343 if (sbinfo->max_inodes) { 341 - spin_lock(&sbinfo->stat_lock); 344 + raw_spin_lock(&sbinfo->stat_lock); 342 345 sbinfo->free_inodes++; 343 - spin_unlock(&sbinfo->stat_lock); 346 + raw_spin_unlock(&sbinfo->stat_lock); 344 347 } 345 348 } 346 349 ··· 471 474 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 472 475 /* ifdef here to avoid bloating shmem.o when not necessary */ 473 476 474 - static int shmem_huge __read_mostly; 477 + static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER; 478 + 479 + bool shmem_is_huge(struct vm_area_struct *vma, 480 + struct inode *inode, pgoff_t index) 481 + { 482 + loff_t i_size; 483 + 484 + if (shmem_huge == SHMEM_HUGE_DENY) 485 + return false; 486 + if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) || 487 + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))) 488 + return false; 489 + if (shmem_huge == SHMEM_HUGE_FORCE) 490 + return true; 491 + 492 + switch (SHMEM_SB(inode->i_sb)->huge) { 493 + case SHMEM_HUGE_ALWAYS: 494 + return true; 495 + case SHMEM_HUGE_WITHIN_SIZE: 496 + index = round_up(index, HPAGE_PMD_NR); 497 + i_size = round_up(i_size_read(inode), PAGE_SIZE); 498 + if (i_size >= HPAGE_PMD_SIZE && (i_size >> PAGE_SHIFT) >= index) 499 + return true; 500 + fallthrough; 501 + case SHMEM_HUGE_ADVISE: 502 + if (vma && (vma->vm_flags & VM_HUGEPAGE)) 503 + return true; 504 + fallthrough; 505 + default: 506 + return false; 507 + } 508 + } 475 509 476 510 #if defined(CONFIG_SYSFS) 477 511 static int shmem_parse_huge(const char *str) ··· 673 645 674 646 #define shmem_huge SHMEM_HUGE_DENY 675 647 648 + bool shmem_is_huge(struct vm_area_struct *vma, 649 + struct inode *inode, pgoff_t index) 650 + { 651 + return false; 652 + } 653 + 676 654 static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo, 677 655 struct shrink_control *sc, unsigned long nr_to_split) 678 656 { 679 657 return 0; 680 658 } 681 659 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 682 - 683 - static inline bool is_huge_enabled(struct shmem_sb_info *sbinfo) 684 - { 685 - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && 686 - (shmem_huge == SHMEM_HUGE_FORCE || sbinfo->huge) && 687 - shmem_huge != SHMEM_HUGE_DENY) 688 - return true; 689 - return false; 690 - } 691 660 692 661 /* 693 662 * Like add_to_page_cache_locked, but error if expected item has gone. ··· 930 905 if (lend == -1) 931 906 end = -1; /* unsigned, so actually very big */ 932 907 908 + if (info->fallocend > start && info->fallocend <= end && !unfalloc) 909 + info->fallocend = start; 910 + 933 911 pagevec_init(&pvec); 934 912 index = start; 935 913 while (index < end && find_lock_entries(mapping, index, end - 1, ··· 1066 1038 { 1067 1039 struct inode *inode = path->dentry->d_inode; 1068 1040 struct shmem_inode_info *info = SHMEM_I(inode); 1069 - struct shmem_sb_info *sb_info = SHMEM_SB(inode->i_sb); 1070 1041 1071 1042 if (info->alloced - info->swapped != inode->i_mapping->nrpages) { 1072 1043 spin_lock_irq(&info->lock); ··· 1074 1047 } 1075 1048 generic_fillattr(&init_user_ns, inode, stat); 1076 1049 1077 - if (is_huge_enabled(sb_info)) 1050 + if (shmem_is_huge(NULL, inode, 0)) 1078 1051 stat->blksize = HPAGE_PMD_SIZE; 1079 1052 1080 1053 return 0; ··· 1085 1058 { 1086 1059 struct inode *inode = d_inode(dentry); 1087 1060 struct shmem_inode_info *info = SHMEM_I(inode); 1088 - struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); 1089 1061 int error; 1090 1062 1091 1063 error = setattr_prepare(&init_user_ns, dentry, attr); ··· 1120 1094 if (oldsize > holebegin) 1121 1095 unmap_mapping_range(inode->i_mapping, 1122 1096 holebegin, 0, 1); 1123 - 1124 - /* 1125 - * Part of the huge page can be beyond i_size: subject 1126 - * to shrink under memory pressure. 1127 - */ 1128 - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { 1129 - spin_lock(&sbinfo->shrinklist_lock); 1130 - /* 1131 - * _careful to defend against unlocked access to 1132 - * ->shrink_list in shmem_unused_huge_shrink() 1133 - */ 1134 - if (list_empty_careful(&info->shrinklist)) { 1135 - list_add_tail(&info->shrinklist, 1136 - &sbinfo->shrinklist); 1137 - sbinfo->shrinklist_len++; 1138 - } 1139 - spin_unlock(&sbinfo->shrinklist_lock); 1140 - } 1141 1097 } 1142 1098 } 1143 1099 ··· 1163 1155 shmem_free_inode(inode->i_sb); 1164 1156 clear_inode(inode); 1165 1157 } 1166 - 1167 - extern struct swap_info_struct *swap_info[]; 1168 1158 1169 1159 static int shmem_find_swap_entries(struct address_space *mapping, 1170 1160 pgoff_t start, unsigned int nr_entries, ··· 1344 1338 swp_entry_t swap; 1345 1339 pgoff_t index; 1346 1340 1347 - VM_BUG_ON_PAGE(PageCompound(page), page); 1341 + /* 1342 + * If /sys/kernel/mm/transparent_hugepage/shmem_enabled is "always" or 1343 + * "force", drivers/gpu/drm/i915/gem/i915_gem_shmem.c gets huge pages, 1344 + * and its shmem_writeback() needs them to be split when swapping. 1345 + */ 1346 + if (PageTransCompound(page)) { 1347 + /* Ensure the subpages are still dirty */ 1348 + SetPageDirty(page); 1349 + if (split_huge_page(page) < 0) 1350 + goto redirty; 1351 + ClearPageDirty(page); 1352 + } 1353 + 1348 1354 BUG_ON(!PageLocked(page)); 1349 1355 mapping = page->mapping; 1350 1356 index = page->index; ··· 1471 1453 { 1472 1454 struct mempolicy *mpol = NULL; 1473 1455 if (sbinfo->mpol) { 1474 - spin_lock(&sbinfo->stat_lock); /* prevent replace/use races */ 1456 + raw_spin_lock(&sbinfo->stat_lock); /* prevent replace/use races */ 1475 1457 mpol = sbinfo->mpol; 1476 1458 mpol_get(mpol); 1477 - spin_unlock(&sbinfo->stat_lock); 1459 + raw_spin_unlock(&sbinfo->stat_lock); 1478 1460 } 1479 1461 return mpol; 1480 1462 } ··· 1816 1798 struct shmem_sb_info *sbinfo; 1817 1799 struct mm_struct *charge_mm; 1818 1800 struct page *page; 1819 - enum sgp_type sgp_huge = sgp; 1820 1801 pgoff_t hindex = index; 1821 1802 gfp_t huge_gfp; 1822 1803 int error; ··· 1824 1807 1825 1808 if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT)) 1826 1809 return -EFBIG; 1827 - if (sgp == SGP_NOHUGE || sgp == SGP_HUGE) 1828 - sgp = SGP_CACHE; 1829 1810 repeat: 1830 1811 if (sgp <= SGP_CACHE && 1831 1812 ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) { ··· 1855 1840 return error; 1856 1841 } 1857 1842 1858 - if (page) 1843 + if (page) { 1859 1844 hindex = page->index; 1860 - if (page && sgp == SGP_WRITE) 1861 - mark_page_accessed(page); 1862 - 1863 - /* fallocated page? */ 1864 - if (page && !PageUptodate(page)) { 1845 + if (sgp == SGP_WRITE) 1846 + mark_page_accessed(page); 1847 + if (PageUptodate(page)) 1848 + goto out; 1849 + /* fallocated page */ 1865 1850 if (sgp != SGP_READ) 1866 1851 goto clear; 1867 1852 unlock_page(page); 1868 1853 put_page(page); 1869 - page = NULL; 1870 - hindex = index; 1871 1854 } 1872 - if (page || sgp == SGP_READ) 1873 - goto out; 1874 1855 1875 1856 /* 1876 - * Fast cache lookup did not find it: 1877 - * bring it back from swap or allocate. 1857 + * SGP_READ: succeed on hole, with NULL page, letting caller zero. 1858 + * SGP_NOALLOC: fail on hole, with NULL page, letting caller fail. 1859 + */ 1860 + *pagep = NULL; 1861 + if (sgp == SGP_READ) 1862 + return 0; 1863 + if (sgp == SGP_NOALLOC) 1864 + return -ENOENT; 1865 + 1866 + /* 1867 + * Fast cache lookup and swap lookup did not find it: allocate. 1878 1868 */ 1879 1869 1880 1870 if (vma && userfaultfd_missing(vma)) { ··· 1887 1867 return 0; 1888 1868 } 1889 1869 1890 - /* shmem_symlink() */ 1891 - if (!shmem_mapping(mapping)) 1870 + /* Never use a huge page for shmem_symlink() */ 1871 + if (S_ISLNK(inode->i_mode)) 1892 1872 goto alloc_nohuge; 1893 - if (shmem_huge == SHMEM_HUGE_DENY || sgp_huge == SGP_NOHUGE) 1873 + if (!shmem_is_huge(vma, inode, index)) 1894 1874 goto alloc_nohuge; 1895 - if (shmem_huge == SHMEM_HUGE_FORCE) 1896 - goto alloc_huge; 1897 - switch (sbinfo->huge) { 1898 - case SHMEM_HUGE_NEVER: 1899 - goto alloc_nohuge; 1900 - case SHMEM_HUGE_WITHIN_SIZE: { 1901 - loff_t i_size; 1902 - pgoff_t off; 1903 1875 1904 - off = round_up(index, HPAGE_PMD_NR); 1905 - i_size = round_up(i_size_read(inode), PAGE_SIZE); 1906 - if (i_size >= HPAGE_PMD_SIZE && 1907 - i_size >> PAGE_SHIFT >= off) 1908 - goto alloc_huge; 1909 - 1910 - fallthrough; 1911 - } 1912 - case SHMEM_HUGE_ADVISE: 1913 - if (sgp_huge == SGP_HUGE) 1914 - goto alloc_huge; 1915 - /* TODO: implement fadvise() hints */ 1916 - goto alloc_nohuge; 1917 - } 1918 - 1919 - alloc_huge: 1920 1876 huge_gfp = vma_thp_gfp_mask(vma); 1921 1877 huge_gfp = limit_gfp_mask(huge_gfp, gfp); 1922 1878 page = shmem_alloc_and_acct_page(huge_gfp, inode, index, true); ··· 2048 2052 struct vm_area_struct *vma = vmf->vma; 2049 2053 struct inode *inode = file_inode(vma->vm_file); 2050 2054 gfp_t gfp = mapping_gfp_mask(inode->i_mapping); 2051 - enum sgp_type sgp; 2052 2055 int err; 2053 2056 vm_fault_t ret = VM_FAULT_LOCKED; 2054 2057 ··· 2110 2115 spin_unlock(&inode->i_lock); 2111 2116 } 2112 2117 2113 - sgp = SGP_CACHE; 2114 - 2115 - if ((vma->vm_flags & VM_NOHUGEPAGE) || 2116 - test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) 2117 - sgp = SGP_NOHUGE; 2118 - else if (vma->vm_flags & VM_HUGEPAGE) 2119 - sgp = SGP_HUGE; 2120 - 2121 - err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp, 2118 + err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE, 2122 2119 gfp, vma, vmf, &ret); 2123 2120 if (err) 2124 2121 return vmf_error(err); ··· 2642 2655 struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); 2643 2656 struct shmem_inode_info *info = SHMEM_I(inode); 2644 2657 struct shmem_falloc shmem_falloc; 2645 - pgoff_t start, index, end; 2658 + pgoff_t start, index, end, undo_fallocend; 2646 2659 int error; 2647 2660 2648 2661 if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) ··· 2711 2724 inode->i_private = &shmem_falloc; 2712 2725 spin_unlock(&inode->i_lock); 2713 2726 2714 - for (index = start; index < end; index++) { 2727 + /* 2728 + * info->fallocend is only relevant when huge pages might be 2729 + * involved: to prevent split_huge_page() freeing fallocated 2730 + * pages when FALLOC_FL_KEEP_SIZE committed beyond i_size. 2731 + */ 2732 + undo_fallocend = info->fallocend; 2733 + if (info->fallocend < end) 2734 + info->fallocend = end; 2735 + 2736 + for (index = start; index < end; ) { 2715 2737 struct page *page; 2716 2738 2717 2739 /* ··· 2734 2738 else 2735 2739 error = shmem_getpage(inode, index, &page, SGP_FALLOC); 2736 2740 if (error) { 2741 + info->fallocend = undo_fallocend; 2737 2742 /* Remove the !PageUptodate pages we added */ 2738 2743 if (index > start) { 2739 2744 shmem_undo_range(inode, ··· 2744 2747 goto undone; 2745 2748 } 2746 2749 2750 + index++; 2751 + /* 2752 + * Here is a more important optimization than it appears: 2753 + * a second SGP_FALLOC on the same huge page will clear it, 2754 + * making it PageUptodate and un-undoable if we fail later. 2755 + */ 2756 + if (PageTransCompound(page)) { 2757 + index = round_up(index, HPAGE_PMD_NR); 2758 + /* Beware 32-bit wraparound */ 2759 + if (!index) 2760 + index--; 2761 + } 2762 + 2747 2763 /* 2748 2764 * Inform shmem_writepage() how far we have reached. 2749 2765 * No need for lock or barrier: we have the page lock. 2750 2766 */ 2751 - shmem_falloc.next++; 2752 2767 if (!PageUptodate(page)) 2753 - shmem_falloc.nr_falloced++; 2768 + shmem_falloc.nr_falloced += index - shmem_falloc.next; 2769 + shmem_falloc.next = index; 2754 2770 2755 2771 /* 2756 2772 * If !PageUptodate, leave it that way so that freeable pages ··· 3498 3488 struct shmem_options *ctx = fc->fs_private; 3499 3489 struct shmem_sb_info *sbinfo = SHMEM_SB(fc->root->d_sb); 3500 3490 unsigned long inodes; 3491 + struct mempolicy *mpol = NULL; 3501 3492 const char *err; 3502 3493 3503 - spin_lock(&sbinfo->stat_lock); 3494 + raw_spin_lock(&sbinfo->stat_lock); 3504 3495 inodes = sbinfo->max_inodes - sbinfo->free_inodes; 3505 3496 if ((ctx->seen & SHMEM_SEEN_BLOCKS) && ctx->blocks) { 3506 3497 if (!sbinfo->max_blocks) { ··· 3546 3535 * Preserve previous mempolicy unless mpol remount option was specified. 3547 3536 */ 3548 3537 if (ctx->mpol) { 3549 - mpol_put(sbinfo->mpol); 3538 + mpol = sbinfo->mpol; 3550 3539 sbinfo->mpol = ctx->mpol; /* transfers initial ref */ 3551 3540 ctx->mpol = NULL; 3552 3541 } 3553 - spin_unlock(&sbinfo->stat_lock); 3542 + raw_spin_unlock(&sbinfo->stat_lock); 3543 + mpol_put(mpol); 3554 3544 return 0; 3555 3545 out: 3556 - spin_unlock(&sbinfo->stat_lock); 3546 + raw_spin_unlock(&sbinfo->stat_lock); 3557 3547 return invalfc(fc, "%s", err); 3558 3548 } 3559 3549 ··· 3625 3613 struct shmem_options *ctx = fc->fs_private; 3626 3614 struct inode *inode; 3627 3615 struct shmem_sb_info *sbinfo; 3628 - int err = -ENOMEM; 3629 3616 3630 3617 /* Round up to L1_CACHE_BYTES to resist false sharing */ 3631 3618 sbinfo = kzalloc(max((int)sizeof(struct shmem_sb_info), ··· 3670 3659 sbinfo->mpol = ctx->mpol; 3671 3660 ctx->mpol = NULL; 3672 3661 3673 - spin_lock_init(&sbinfo->stat_lock); 3662 + raw_spin_lock_init(&sbinfo->stat_lock); 3674 3663 if (percpu_counter_init(&sbinfo->used_blocks, 0, GFP_KERNEL)) 3675 3664 goto failed; 3676 3665 spin_lock_init(&sbinfo->shrinklist_lock); ··· 3702 3691 3703 3692 failed: 3704 3693 shmem_put_super(sb); 3705 - return err; 3694 + return -ENOMEM; 3706 3695 } 3707 3696 3708 3697 static int shmem_get_tree(struct fs_context *fc) ··· 3918 3907 if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY) 3919 3908 SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; 3920 3909 else 3921 - shmem_huge = 0; /* just in case it was patched */ 3910 + shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ 3922 3911 #endif 3923 3912 return 0; 3924 3913 ··· 3986 3975 struct kobj_attribute shmem_enabled_attr = 3987 3976 __ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store); 3988 3977 #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */ 3989 - 3990 - #ifdef CONFIG_TRANSPARENT_HUGEPAGE 3991 - bool shmem_huge_enabled(struct vm_area_struct *vma) 3992 - { 3993 - struct inode *inode = file_inode(vma->vm_file); 3994 - struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); 3995 - loff_t i_size; 3996 - pgoff_t off; 3997 - 3998 - if (!transhuge_vma_enabled(vma, vma->vm_flags)) 3999 - return false; 4000 - if (shmem_huge == SHMEM_HUGE_FORCE) 4001 - return true; 4002 - if (shmem_huge == SHMEM_HUGE_DENY) 4003 - return false; 4004 - switch (sbinfo->huge) { 4005 - case SHMEM_HUGE_NEVER: 4006 - return false; 4007 - case SHMEM_HUGE_ALWAYS: 4008 - return true; 4009 - case SHMEM_HUGE_WITHIN_SIZE: 4010 - off = round_up(vma->vm_pgoff, HPAGE_PMD_NR); 4011 - i_size = round_up(i_size_read(inode), PAGE_SIZE); 4012 - if (i_size >= HPAGE_PMD_SIZE && 4013 - i_size >> PAGE_SHIFT >= off) 4014 - return true; 4015 - fallthrough; 4016 - case SHMEM_HUGE_ADVISE: 4017 - /* TODO: implement fadvise() hints */ 4018 - return (vma->vm_flags & VM_HUGEPAGE); 4019 - default: 4020 - VM_BUG_ON(1); 4021 - return false; 4022 - } 4023 - } 4024 - #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 4025 3978 4026 3979 #else /* !CONFIG_SHMEM */ 4027 3980

+9 -37

mm/sparse.c

··· 109 109 } 110 110 #endif 111 111 112 - #ifdef CONFIG_SPARSEMEM_EXTREME 113 - unsigned long __section_nr(struct mem_section *ms) 114 - { 115 - unsigned long root_nr; 116 - struct mem_section *root = NULL; 117 - 118 - for (root_nr = 0; root_nr < NR_SECTION_ROOTS; root_nr++) { 119 - root = __nr_to_section(root_nr * SECTIONS_PER_ROOT); 120 - if (!root) 121 - continue; 122 - 123 - if ((ms >= root) && (ms < (root + SECTIONS_PER_ROOT))) 124 - break; 125 - } 126 - 127 - VM_BUG_ON(!root); 128 - 129 - return (root_nr * SECTIONS_PER_ROOT) + (ms - root); 130 - } 131 - #else 132 - unsigned long __section_nr(struct mem_section *ms) 133 - { 134 - return (unsigned long)(ms - mem_section[0]); 135 - } 136 - #endif 137 - 138 112 /* 139 113 * During early boot, before section_mem_map is used for an actual 140 114 * mem_map, we use section_mem_map to store the section's NUMA ··· 117 143 */ 118 144 static inline unsigned long sparse_encode_early_nid(int nid) 119 145 { 120 - return (nid << SECTION_NID_SHIFT); 146 + return ((unsigned long)nid << SECTION_NID_SHIFT); 121 147 } 122 148 123 149 static inline int sparse_early_nid(struct mem_section *section) ··· 161 187 * those loops early. 162 188 */ 163 189 unsigned long __highest_present_section_nr; 164 - static void section_mark_present(struct mem_section *ms) 190 + static void __section_mark_present(struct mem_section *ms, 191 + unsigned long section_nr) 165 192 { 166 - unsigned long section_nr = __section_nr(ms); 167 - 168 193 if (section_nr > __highest_present_section_nr) 169 194 __highest_present_section_nr = section_nr; 170 195 ··· 253 280 if (!ms->section_mem_map) { 254 281 ms->section_mem_map = sparse_encode_early_nid(nid) | 255 282 SECTION_IS_ONLINE; 256 - section_mark_present(ms); 283 + __section_mark_present(ms, section); 257 284 } 258 285 } 259 286 } ··· 321 348 static inline phys_addr_t pgdat_to_phys(struct pglist_data *pgdat) 322 349 { 323 350 #ifndef CONFIG_NUMA 324 - return __pa_symbol(pgdat); 351 + VM_BUG_ON(pgdat != &contig_page_data); 352 + return __pa_symbol(&contig_page_data); 325 353 #else 326 354 return __pa(pgdat); 327 355 #endif ··· 436 462 if (map) 437 463 return map; 438 464 439 - map = memblock_alloc_try_nid_raw(size, size, addr, 440 - MEMBLOCK_ALLOC_ACCESSIBLE, nid); 465 + map = memmap_alloc(size, size, addr, nid, false); 441 466 if (!map) 442 467 panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d from=%pa\n", 443 468 __func__, size, PAGE_SIZE, nid, &addr); ··· 463 490 * and we want it to be properly aligned to the section size - this is 464 491 * especially the case for VMEMMAP which maps memmap to PMDs 465 492 */ 466 - sparsemap_buf = memblock_alloc_exact_nid_raw(size, section_map_size(), 467 - addr, MEMBLOCK_ALLOC_ACCESSIBLE, nid); 493 + sparsemap_buf = memmap_alloc(size, section_map_size(), addr, nid, true); 468 494 sparsemap_buf_end = sparsemap_buf + size; 469 495 } 470 496 ··· 906 934 907 935 ms = __nr_to_section(section_nr); 908 936 set_section_nid(section_nr, nid); 909 - section_mark_present(ms); 937 + __section_mark_present(ms, section_nr); 910 938 911 939 /* Align memmap to section boundary in the subsection case */ 912 940 if (section_nr_to_pfn(section_nr) != start_pfn)

-22

mm/swap.c

··· 179 179 } 180 180 EXPORT_SYMBOL_GPL(get_kernel_pages); 181 181 182 - /* 183 - * get_kernel_page() - pin a kernel page in memory 184 - * @start: starting kernel address 185 - * @write: pinning for read/write, currently ignored 186 - * @pages: array that receives pointer to the page pinned. 187 - * Must be at least nr_segs long. 188 - * 189 - * Returns 1 if page is pinned. If the page was not pinned, returns 190 - * -errno. The page returned must be released with a put_page() call 191 - * when it is finished with. 192 - */ 193 - int get_kernel_page(unsigned long start, int write, struct page **pages) 194 - { 195 - const struct kvec kiov = { 196 - .iov_base = (void *)start, 197 - .iov_len = PAGE_SIZE 198 - }; 199 - 200 - return get_kernel_pages(&kiov, 1, write, pages); 201 - } 202 - EXPORT_SYMBOL_GPL(get_kernel_page); 203 - 204 182 static void pagevec_lru_move_fn(struct pagevec *pvec, 205 183 void (*move_fn)(struct page *page, struct lruvec *lruvec)) 206 184 {

+7 -1

mm/swapfile.c

··· 3130 3130 struct filename *name; 3131 3131 struct file *swap_file = NULL; 3132 3132 struct address_space *mapping; 3133 + struct dentry *dentry; 3133 3134 int prio; 3134 3135 int error; 3135 3136 union swap_header *swap_header; ··· 3174 3173 3175 3174 p->swap_file = swap_file; 3176 3175 mapping = swap_file->f_mapping; 3176 + dentry = swap_file->f_path.dentry; 3177 3177 inode = mapping->host; 3178 3178 3179 3179 error = claim_swapfile(p, inode); ··· 3182 3180 goto bad_swap; 3183 3181 3184 3182 inode_lock(inode); 3183 + if (d_unlinked(dentry) || cant_mount(dentry)) { 3184 + error = -ENOENT; 3185 + goto bad_swap_unlock_inode; 3186 + } 3185 3187 if (IS_SWAPFILE(inode)) { 3186 3188 error = -EBUSY; 3187 3189 goto bad_swap_unlock_inode; ··· 3779 3773 } 3780 3774 3781 3775 #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP) 3782 - void cgroup_throttle_swaprate(struct page *page, gfp_t gfp_mask) 3776 + void __cgroup_throttle_swaprate(struct page *page, gfp_t gfp_mask) 3783 3777 { 3784 3778 struct swap_info_struct *si, *next; 3785 3779 int nid = page_to_nid(page);

+13 -15

mm/truncate.c

··· 484 484 index = indices[i]; 485 485 486 486 if (xa_is_value(page)) { 487 - invalidate_exceptional_entry(mapping, index, 488 - page); 487 + count += invalidate_exceptional_entry(mapping, 488 + index, 489 + page); 489 490 continue; 490 491 } 491 492 index += thp_nr_pages(page) - 1; ··· 514 513 } 515 514 516 515 /** 517 - * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode 518 - * @mapping: the address_space which holds the pages to invalidate 516 + * invalidate_mapping_pages - Invalidate all clean, unlocked cache of one inode 517 + * @mapping: the address_space which holds the cache to invalidate 519 518 * @start: the offset 'from' which to invalidate 520 519 * @end: the offset 'to' which to invalidate (inclusive) 521 520 * 522 - * This function only removes the unlocked pages, if you want to 523 - * remove all the pages of one inode, you must call truncate_inode_pages. 521 + * This function removes pages that are clean, unmapped and unlocked, 522 + * as well as shadow entries. It will not block on IO activity. 524 523 * 525 - * invalidate_mapping_pages() will not block on IO activity. It will not 526 - * invalidate pages which are dirty, locked, under writeback or mapped into 527 - * pagetables. 524 + * If you want to remove all the pages of one inode, regardless of 525 + * their use and writeback state, use truncate_inode_pages(). 528 526 * 529 - * Return: the number of the pages that were invalidated 527 + * Return: the number of the cache entries that were invalidated 530 528 */ 531 529 unsigned long invalidate_mapping_pages(struct address_space *mapping, 532 530 pgoff_t start, pgoff_t end) ··· 561 561 static int 562 562 invalidate_complete_page2(struct address_space *mapping, struct page *page) 563 563 { 564 - unsigned long flags; 565 - 566 564 if (page->mapping != mapping) 567 565 return 0; 568 566 569 567 if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL)) 570 568 return 0; 571 569 572 - xa_lock_irqsave(&mapping->i_pages, flags); 570 + xa_lock_irq(&mapping->i_pages); 573 571 if (PageDirty(page)) 574 572 goto failed; 575 573 576 574 BUG_ON(page_has_private(page)); 577 575 __delete_from_page_cache(page, NULL); 578 - xa_unlock_irqrestore(&mapping->i_pages, flags); 576 + xa_unlock_irq(&mapping->i_pages); 579 577 580 578 if (mapping->a_ops->freepage) 581 579 mapping->a_ops->freepage(page); ··· 581 583 put_page(page); /* pagecache ref */ 582 584 return 1; 583 585 failed: 584 - xa_unlock_irqrestore(&mapping->i_pages, flags); 586 + xa_unlock_irq(&mapping->i_pages); 585 587 return 0; 586 588 } 587 589

+8 -7

mm/userfaultfd.c

··· 483 483 unsigned long src_start, 484 484 unsigned long len, 485 485 enum mcopy_atomic_mode mcopy_mode, 486 - bool *mmap_changing, 486 + atomic_t *mmap_changing, 487 487 __u64 mode) 488 488 { 489 489 struct vm_area_struct *dst_vma; ··· 517 517 * request the user to retry later 518 518 */ 519 519 err = -EAGAIN; 520 - if (mmap_changing && READ_ONCE(*mmap_changing)) 520 + if (mmap_changing && atomic_read(mmap_changing)) 521 521 goto out_unlock; 522 522 523 523 /* ··· 650 650 651 651 ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start, 652 652 unsigned long src_start, unsigned long len, 653 - bool *mmap_changing, __u64 mode) 653 + atomic_t *mmap_changing, __u64 mode) 654 654 { 655 655 return __mcopy_atomic(dst_mm, dst_start, src_start, len, 656 656 MCOPY_ATOMIC_NORMAL, mmap_changing, mode); 657 657 } 658 658 659 659 ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start, 660 - unsigned long len, bool *mmap_changing) 660 + unsigned long len, atomic_t *mmap_changing) 661 661 { 662 662 return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_ZEROPAGE, 663 663 mmap_changing, 0); 664 664 } 665 665 666 666 ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start, 667 - unsigned long len, bool *mmap_changing) 667 + unsigned long len, atomic_t *mmap_changing) 668 668 { 669 669 return __mcopy_atomic(dst_mm, start, 0, len, MCOPY_ATOMIC_CONTINUE, 670 670 mmap_changing, 0); 671 671 } 672 672 673 673 int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, 674 - unsigned long len, bool enable_wp, bool *mmap_changing) 674 + unsigned long len, bool enable_wp, 675 + atomic_t *mmap_changing) 675 676 { 676 677 struct vm_area_struct *dst_vma; 677 678 pgprot_t newprot; ··· 695 694 * request the user to retry later 696 695 */ 697 696 err = -EAGAIN; 698 - if (mmap_changing && READ_ONCE(*mmap_changing)) 697 + if (mmap_changing && atomic_read(mmap_changing)) 699 698 goto out_unlock; 700 699 701 700 err = -ENOENT;

+63 -16

mm/vmalloc.c

··· 787 787 return atomic_long_read(&nr_vmalloc_pages); 788 788 } 789 789 790 + static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr) 791 + { 792 + struct vmap_area *va = NULL; 793 + struct rb_node *n = vmap_area_root.rb_node; 794 + 795 + while (n) { 796 + struct vmap_area *tmp; 797 + 798 + tmp = rb_entry(n, struct vmap_area, rb_node); 799 + if (tmp->va_end > addr) { 800 + va = tmp; 801 + if (tmp->va_start <= addr) 802 + break; 803 + 804 + n = n->rb_left; 805 + } else 806 + n = n->rb_right; 807 + } 808 + 809 + return va; 810 + } 811 + 790 812 static struct vmap_area *__find_vmap_area(unsigned long addr) 791 813 { 792 814 struct rb_node *n = vmap_area_root.rb_node; ··· 1501 1479 int node, gfp_t gfp_mask) 1502 1480 { 1503 1481 struct vmap_area *va; 1482 + unsigned long freed; 1504 1483 unsigned long addr; 1505 1484 int purged = 0; 1506 1485 int ret; ··· 1565 1542 goto retry; 1566 1543 } 1567 1544 1568 - if (gfpflags_allow_blocking(gfp_mask)) { 1569 - unsigned long freed = 0; 1570 - blocking_notifier_call_chain(&vmap_notify_list, 0, &freed); 1571 - if (freed > 0) { 1572 - purged = 0; 1573 - goto retry; 1574 - } 1545 + freed = 0; 1546 + blocking_notifier_call_chain(&vmap_notify_list, 0, &freed); 1547 + 1548 + if (freed > 0) { 1549 + purged = 0; 1550 + goto retry; 1575 1551 } 1576 1552 1577 1553 if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) ··· 2801 2779 2802 2780 static inline unsigned int 2803 2781 vm_area_alloc_pages(gfp_t gfp, int nid, 2804 - unsigned int order, unsigned long nr_pages, struct page **pages) 2782 + unsigned int order, unsigned int nr_pages, struct page **pages) 2805 2783 { 2806 2784 unsigned int nr_allocated = 0; 2807 2785 ··· 2811 2789 * to fails, fallback to a single page allocator that is 2812 2790 * more permissive. 2813 2791 */ 2814 - if (!order) 2815 - nr_allocated = alloc_pages_bulk_array_node( 2816 - gfp, nid, nr_pages, pages); 2817 - else 2792 + if (!order) { 2793 + while (nr_allocated < nr_pages) { 2794 + unsigned int nr, nr_pages_request; 2795 + 2796 + /* 2797 + * A maximum allowed request is hard-coded and is 100 2798 + * pages per call. That is done in order to prevent a 2799 + * long preemption off scenario in the bulk-allocator 2800 + * so the range is [1:100]. 2801 + */ 2802 + nr_pages_request = min(100U, nr_pages - nr_allocated); 2803 + 2804 + nr = alloc_pages_bulk_array_node(gfp, nid, 2805 + nr_pages_request, pages + nr_allocated); 2806 + 2807 + nr_allocated += nr; 2808 + cond_resched(); 2809 + 2810 + /* 2811 + * If zero or pages were obtained partly, 2812 + * fallback to a single page allocator. 2813 + */ 2814 + if (nr != nr_pages_request) 2815 + break; 2816 + } 2817 + } else 2818 2818 /* 2819 2819 * Compound pages required for remap_vmalloc_page if 2820 2820 * high-order pages. ··· 2860 2816 for (i = 0; i < (1U << order); i++) 2861 2817 pages[nr_allocated + i] = page + i; 2862 2818 2863 - if (gfpflags_allow_blocking(gfp)) 2864 - cond_resched(); 2865 - 2819 + cond_resched(); 2866 2820 nr_allocated += 1U << order; 2867 2821 } 2868 2822 ··· 3309 3267 count = -(unsigned long) addr; 3310 3268 3311 3269 spin_lock(&vmap_area_lock); 3312 - va = __find_vmap_area((unsigned long)addr); 3270 + va = find_vmap_area_exceed_addr((unsigned long)addr); 3313 3271 if (!va) 3314 3272 goto finished; 3273 + 3274 + /* no intersects with alive vmap_area */ 3275 + if ((unsigned long)addr + count <= va->va_start) 3276 + goto finished; 3277 + 3315 3278 list_for_each_entry_from(va, &vmap_area_list, list) { 3316 3279 if (!count) 3317 3280 break;

+7 -3

mm/vmpressure.c

··· 74 74 75 75 static struct vmpressure *vmpressure_parent(struct vmpressure *vmpr) 76 76 { 77 - struct cgroup_subsys_state *css = vmpressure_to_css(vmpr); 78 - struct mem_cgroup *memcg = mem_cgroup_from_css(css); 77 + struct mem_cgroup *memcg = vmpressure_to_memcg(vmpr); 79 78 80 79 memcg = parent_mem_cgroup(memcg); 81 80 if (!memcg) ··· 239 240 void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree, 240 241 unsigned long scanned, unsigned long reclaimed) 241 242 { 242 - struct vmpressure *vmpr = memcg_to_vmpressure(memcg); 243 + struct vmpressure *vmpr; 244 + 245 + if (mem_cgroup_disabled()) 246 + return; 247 + 248 + vmpr = memcg_to_vmpressure(memcg); 243 249 244 250 /* 245 251 * Here we only want to account pressure that userland is able to

+174 -36

mm/vmscan.c

··· 41 41 #include <linux/kthread.h> 42 42 #include <linux/freezer.h> 43 43 #include <linux/memcontrol.h> 44 + #include <linux/migrate.h> 44 45 #include <linux/delayacct.h> 45 46 #include <linux/sysctl.h> 46 47 #include <linux/oom.h> ··· 121 120 122 121 /* The file pages on the current node are dangerously low */ 123 122 unsigned int file_is_tiny:1; 123 + 124 + /* Always discard instead of demoting to lower tier memory */ 125 + unsigned int no_demotion:1; 124 126 125 127 /* Allocation order */ 126 128 s8 order; ··· 522 518 return atomic_long_add_return(nr, &shrinker->nr_deferred[nid]); 523 519 } 524 520 521 + static bool can_demote(int nid, struct scan_control *sc) 522 + { 523 + if (!numa_demotion_enabled) 524 + return false; 525 + if (sc) { 526 + if (sc->no_demotion) 527 + return false; 528 + /* It is pointless to do demotion in memcg reclaim */ 529 + if (cgroup_reclaim(sc)) 530 + return false; 531 + } 532 + if (next_demotion_node(nid) == NUMA_NO_NODE) 533 + return false; 534 + 535 + return true; 536 + } 537 + 538 + static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, 539 + int nid, 540 + struct scan_control *sc) 541 + { 542 + if (memcg == NULL) { 543 + /* 544 + * For non-memcg reclaim, is there 545 + * space in any swap device? 546 + */ 547 + if (get_nr_swap_pages() > 0) 548 + return true; 549 + } else { 550 + /* Is the memcg below its swap limit? */ 551 + if (mem_cgroup_get_nr_swap_pages(memcg) > 0) 552 + return true; 553 + } 554 + 555 + /* 556 + * The page can not be swapped. 557 + * 558 + * Can it be reclaimed from this node via demotion? 559 + */ 560 + return can_demote(nid, sc); 561 + } 562 + 525 563 /* 526 564 * This misses isolated pages which are not accounted for to save counters. 527 565 * As the data only determines if reclaim or compaction continues, it is ··· 575 529 576 530 nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) + 577 531 zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE); 578 - if (get_nr_swap_pages() > 0) 532 + if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL)) 579 533 nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) + 580 534 zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON); 581 535 ··· 939 893 void drop_slab_node(int nid) 940 894 { 941 895 unsigned long freed; 896 + int shift = 0; 942 897 943 898 do { 944 899 struct mem_cgroup *memcg = NULL; ··· 952 905 do { 953 906 freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); 954 907 } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); 955 - } while (freed > 10); 908 + } while ((freed >> shift++) > 1); 956 909 } 957 910 958 911 void drop_slab(void) ··· 1099 1052 static int __remove_mapping(struct address_space *mapping, struct page *page, 1100 1053 bool reclaimed, struct mem_cgroup *target_memcg) 1101 1054 { 1102 - unsigned long flags; 1103 1055 int refcount; 1104 1056 void *shadow = NULL; 1105 1057 1106 1058 BUG_ON(!PageLocked(page)); 1107 1059 BUG_ON(mapping != page_mapping(page)); 1108 1060 1109 - xa_lock_irqsave(&mapping->i_pages, flags); 1061 + xa_lock_irq(&mapping->i_pages); 1110 1062 /* 1111 1063 * The non racy check for a busy page. 1112 1064 * ··· 1146 1100 if (reclaimed && !mapping_exiting(mapping)) 1147 1101 shadow = workingset_eviction(page, target_memcg); 1148 1102 __delete_from_swap_cache(page, swap, shadow); 1149 - xa_unlock_irqrestore(&mapping->i_pages, flags); 1103 + xa_unlock_irq(&mapping->i_pages); 1150 1104 put_swap_page(page, swap); 1151 1105 } else { 1152 1106 void (*freepage)(struct page *); ··· 1172 1126 !mapping_exiting(mapping) && !dax_mapping(mapping)) 1173 1127 shadow = workingset_eviction(page, target_memcg); 1174 1128 __delete_from_page_cache(page, shadow); 1175 - xa_unlock_irqrestore(&mapping->i_pages, flags); 1129 + xa_unlock_irq(&mapping->i_pages); 1176 1130 1177 1131 if (freepage != NULL) 1178 1132 freepage(page); ··· 1181 1135 return 1; 1182 1136 1183 1137 cannot_free: 1184 - xa_unlock_irqrestore(&mapping->i_pages, flags); 1138 + xa_unlock_irq(&mapping->i_pages); 1185 1139 return 0; 1186 1140 } 1187 1141 ··· 1310 1264 mapping->a_ops->is_dirty_writeback(page, dirty, writeback); 1311 1265 } 1312 1266 1267 + static struct page *alloc_demote_page(struct page *page, unsigned long node) 1268 + { 1269 + struct migration_target_control mtc = { 1270 + /* 1271 + * Allocate from 'node', or fail quickly and quietly. 1272 + * When this happens, 'page' will likely just be discarded 1273 + * instead of migrated. 1274 + */ 1275 + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | 1276 + __GFP_THISNODE | __GFP_NOWARN | 1277 + __GFP_NOMEMALLOC | GFP_NOWAIT, 1278 + .nid = node 1279 + }; 1280 + 1281 + return alloc_migration_target(page, (unsigned long)&mtc); 1282 + } 1283 + 1284 + /* 1285 + * Take pages on @demote_list and attempt to demote them to 1286 + * another node. Pages which are not demoted are left on 1287 + * @demote_pages. 1288 + */ 1289 + static unsigned int demote_page_list(struct list_head *demote_pages, 1290 + struct pglist_data *pgdat) 1291 + { 1292 + int target_nid = next_demotion_node(pgdat->node_id); 1293 + unsigned int nr_succeeded; 1294 + int err; 1295 + 1296 + if (list_empty(demote_pages)) 1297 + return 0; 1298 + 1299 + if (target_nid == NUMA_NO_NODE) 1300 + return 0; 1301 + 1302 + /* Demotion ignores all cpuset and mempolicy settings */ 1303 + err = migrate_pages(demote_pages, alloc_demote_page, NULL, 1304 + target_nid, MIGRATE_ASYNC, MR_DEMOTION, 1305 + &nr_succeeded); 1306 + 1307 + if (current_is_kswapd()) 1308 + __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded); 1309 + else 1310 + __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded); 1311 + 1312 + return nr_succeeded; 1313 + } 1314 + 1313 1315 /* 1314 1316 * shrink_page_list() returns the number of reclaimed pages 1315 1317 */ ··· 1369 1275 { 1370 1276 LIST_HEAD(ret_pages); 1371 1277 LIST_HEAD(free_pages); 1278 + LIST_HEAD(demote_pages); 1372 1279 unsigned int nr_reclaimed = 0; 1373 1280 unsigned int pgactivate = 0; 1281 + bool do_demote_pass; 1374 1282 1375 1283 memset(stat, 0, sizeof(*stat)); 1376 1284 cond_resched(); 1285 + do_demote_pass = can_demote(pgdat->node_id, sc); 1377 1286 1287 + retry: 1378 1288 while (!list_empty(page_list)) { 1379 1289 struct address_space *mapping; 1380 1290 struct page *page; ··· 1525 1427 case PAGEREF_RECLAIM: 1526 1428 case PAGEREF_RECLAIM_CLEAN: 1527 1429 ; /* try to reclaim the page below */ 1430 + } 1431 + 1432 + /* 1433 + * Before reclaiming the page, try to relocate 1434 + * its contents to another node. 1435 + */ 1436 + if (do_demote_pass && 1437 + (thp_migration_supported() || !PageTransHuge(page))) { 1438 + list_add(&page->lru, &demote_pages); 1439 + unlock_page(page); 1440 + continue; 1528 1441 } 1529 1442 1530 1443 /* ··· 1733 1624 /* follow __remove_mapping for reference */ 1734 1625 if (!page_ref_freeze(page, 1)) 1735 1626 goto keep_locked; 1736 - if (PageDirty(page)) { 1737 - page_ref_unfreeze(page, 1); 1738 - goto keep_locked; 1739 - } 1740 - 1627 + /* 1628 + * The page has only one reference left, which is 1629 + * from the isolation. After the caller puts the 1630 + * page back on lru and drops the reference, the 1631 + * page will be freed anyway. It doesn't matter 1632 + * which lru it goes. So we don't bother checking 1633 + * PageDirty here. 1634 + */ 1741 1635 count_vm_event(PGLAZYFREED); 1742 1636 count_memcg_page_event(page, PGLAZYFREED); 1743 1637 } else if (!mapping || !__remove_mapping(mapping, page, true, ··· 1792 1680 list_add(&page->lru, &ret_pages); 1793 1681 VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); 1794 1682 } 1683 + /* 'page_list' is always empty here */ 1684 + 1685 + /* Migrate pages selected for demotion */ 1686 + nr_reclaimed += demote_page_list(&demote_pages, pgdat); 1687 + /* Pages that could not be demoted are still in @demote_pages */ 1688 + if (!list_empty(&demote_pages)) { 1689 + /* Pages which failed to demoted go back on @page_list for retry: */ 1690 + list_splice_init(&demote_pages, page_list); 1691 + do_demote_pass = false; 1692 + goto retry; 1693 + } 1795 1694 1796 1695 pgactivate = stat->nr_activate[0] + stat->nr_activate[1]; 1797 1696 ··· 1821 1698 { 1822 1699 struct scan_control sc = { 1823 1700 .gfp_mask = GFP_KERNEL, 1824 - .priority = DEF_PRIORITY, 1825 1701 .may_unmap = 1, 1826 1702 }; 1827 1703 struct reclaim_stat stat; ··· 2445 2323 unsigned int noreclaim_flag; 2446 2324 struct scan_control sc = { 2447 2325 .gfp_mask = GFP_KERNEL, 2448 - .priority = DEF_PRIORITY, 2449 2326 .may_writepage = 1, 2450 2327 .may_unmap = 1, 2451 2328 .may_swap = 1, 2329 + .no_demotion = 1, 2452 2330 }; 2453 2331 2454 2332 noreclaim_flag = memalloc_noreclaim_save(); ··· 2574 2452 static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, 2575 2453 unsigned long *nr) 2576 2454 { 2455 + struct pglist_data *pgdat = lruvec_pgdat(lruvec); 2577 2456 struct mem_cgroup *memcg = lruvec_memcg(lruvec); 2578 2457 unsigned long anon_cost, file_cost, total_cost; 2579 2458 int swappiness = mem_cgroup_swappiness(memcg); ··· 2585 2462 enum lru_list lru; 2586 2463 2587 2464 /* If we have no swap space, do not bother scanning anon pages. */ 2588 - if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) { 2465 + if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, sc)) { 2589 2466 scan_balance = SCAN_FILE; 2590 2467 goto out; 2591 2468 } ··· 2768 2645 } 2769 2646 } 2770 2647 2648 + /* 2649 + * Anonymous LRU management is a waste if there is 2650 + * ultimately no way to reclaim the memory. 2651 + */ 2652 + static bool can_age_anon_pages(struct pglist_data *pgdat, 2653 + struct scan_control *sc) 2654 + { 2655 + /* Aging the anon LRU is valuable if swap is present: */ 2656 + if (total_swap_pages > 0) 2657 + return true; 2658 + 2659 + /* Also valuable if anon pages can be demoted: */ 2660 + return can_demote(pgdat->node_id, sc); 2661 + } 2662 + 2771 2663 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) 2772 2664 { 2773 2665 unsigned long nr[NR_LRU_LISTS]; ··· 2892 2754 * Even if we did not try to evict anon pages at all, we want to 2893 2755 * rebalance the anon lru active/inactive ratio. 2894 2756 */ 2895 - if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON)) 2757 + if (can_age_anon_pages(lruvec_pgdat(lruvec), sc) && 2758 + inactive_is_low(lruvec, LRU_INACTIVE_ANON)) 2896 2759 shrink_active_list(SWAP_CLUSTER_MAX, lruvec, 2897 2760 sc, LRU_ACTIVE_ANON); 2898 2761 } ··· 2963 2824 */ 2964 2825 pages_for_compaction = compact_gap(sc->order); 2965 2826 inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE); 2966 - if (get_nr_swap_pages() > 0) 2827 + if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc)) 2967 2828 inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON); 2968 2829 2969 2830 return inactive_lru_pages > pages_for_compaction; ··· 3037 2898 target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); 3038 2899 3039 2900 again: 2901 + /* 2902 + * Flush the memory cgroup stats, so that we read accurate per-memcg 2903 + * lruvec stats for heuristics. 2904 + */ 2905 + mem_cgroup_flush_stats(); 2906 + 3040 2907 memset(&sc->nr, 0, sizeof(sc->nr)); 3041 2908 3042 2909 nr_reclaimed = sc->nr_reclaimed; ··· 3579 3434 * blocked waiting on the same lock. Instead, throttle for up to a 3580 3435 * second before continuing. 3581 3436 */ 3582 - if (!(gfp_mask & __GFP_FS)) { 3437 + if (!(gfp_mask & __GFP_FS)) 3583 3438 wait_event_interruptible_timeout(pgdat->pfmemalloc_wait, 3584 3439 allow_direct_reclaim(pgdat), HZ); 3440 + else 3441 + /* Throttle until kswapd wakes the process */ 3442 + wait_event_killable(zone->zone_pgdat->pfmemalloc_wait, 3443 + allow_direct_reclaim(pgdat)); 3585 3444 3586 - goto check_pending; 3587 - } 3588 - 3589 - /* Throttle until kswapd wakes the process */ 3590 - wait_event_killable(zone->zone_pgdat->pfmemalloc_wait, 3591 - allow_direct_reclaim(pgdat)); 3592 - 3593 - check_pending: 3594 3445 if (fatal_signal_pending(current)) 3595 3446 return true; 3596 3447 ··· 3724 3583 struct mem_cgroup *memcg; 3725 3584 struct lruvec *lruvec; 3726 3585 3727 - if (!total_swap_pages) 3586 + if (!can_age_anon_pages(pgdat, sc)) 3728 3587 return; 3729 3588 3730 3589 lruvec = mem_cgroup_lruvec(NULL, pgdat); ··· 3953 3812 3954 3813 set_task_reclaim_state(current, &sc.reclaim_state); 3955 3814 psi_memstall_enter(&pflags); 3956 - __fs_reclaim_acquire(); 3815 + __fs_reclaim_acquire(_THIS_IP_); 3957 3816 3958 3817 count_vm_event(PAGEOUTRUN); 3959 3818 ··· 4079 3938 wake_up_all(&pgdat->pfmemalloc_wait); 4080 3939 4081 3940 /* Check if kswapd should be suspending */ 4082 - __fs_reclaim_release(); 3941 + __fs_reclaim_release(_THIS_IP_); 4083 3942 ret = try_to_freeze(); 4084 - __fs_reclaim_acquire(); 3943 + __fs_reclaim_acquire(_THIS_IP_); 4085 3944 if (ret || kthread_should_stop()) 4086 3945 break; 4087 3946 ··· 4133 3992 } 4134 3993 4135 3994 snapshot_refaults(NULL, pgdat); 4136 - __fs_reclaim_release(); 3995 + __fs_reclaim_release(_THIS_IP_); 4137 3996 psi_memstall_leave(&pflags); 4138 3997 set_task_reclaim_state(current, NULL); 4139 3998 ··· 4431 4290 * This kswapd start function will be called by init and node-hot-add. 4432 4291 * On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added. 4433 4292 */ 4434 - int kswapd_run(int nid) 4293 + void kswapd_run(int nid) 4435 4294 { 4436 4295 pg_data_t *pgdat = NODE_DATA(nid); 4437 - int ret = 0; 4438 4296 4439 4297 if (pgdat->kswapd) 4440 - return 0; 4298 + return; 4441 4299 4442 4300 pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid); 4443 4301 if (IS_ERR(pgdat->kswapd)) { 4444 4302 /* failure at boot is fatal */ 4445 4303 BUG_ON(system_state < SYSTEM_RUNNING); 4446 4304 pr_err("Failed to start kswapd on node %d\n", nid); 4447 - ret = PTR_ERR(pgdat->kswapd); 4448 4305 pgdat->kswapd = NULL; 4449 4306 } 4450 - return ret; 4451 4307 } 4452 4308 4453 4309 /*

+8 -17

mm/vmstat.c

··· 204 204 * 205 205 * Some sample thresholds: 206 206 * 207 - * Threshold Processors (fls) Zonesize fls(mem+1) 207 + * Threshold Processors (fls) Zonesize fls(mem)+1 208 208 * ------------------------------------------------------------------ 209 209 * 8 1 1 0.9-1 GB 4 210 210 * 16 2 2 0.9-1 GB 4 ··· 1217 1217 "pgreuse", 1218 1218 "pgsteal_kswapd", 1219 1219 "pgsteal_direct", 1220 + "pgdemote_kswapd", 1221 + "pgdemote_direct", 1220 1222 "pgscan_kswapd", 1221 1223 "pgscan_direct", 1222 1224 "pgscan_direct_throttle", ··· 1454 1452 } 1455 1453 1456 1454 /* Print out the free pages at each order for each migatetype */ 1457 - static int pagetypeinfo_showfree(struct seq_file *m, void *arg) 1455 + static void pagetypeinfo_showfree(struct seq_file *m, void *arg) 1458 1456 { 1459 1457 int order; 1460 1458 pg_data_t *pgdat = (pg_data_t *)arg; ··· 1466 1464 seq_putc(m, '\n'); 1467 1465 1468 1466 walk_zones_in_node(m, pgdat, true, false, pagetypeinfo_showfree_print); 1469 - 1470 - return 0; 1471 1467 } 1472 1468 1473 1469 static void pagetypeinfo_showblockcount_print(struct seq_file *m, ··· 1501 1501 } 1502 1502 1503 1503 /* Print out the number of pageblocks for each migratetype */ 1504 - static int pagetypeinfo_showblockcount(struct seq_file *m, void *arg) 1504 + static void pagetypeinfo_showblockcount(struct seq_file *m, void *arg) 1505 1505 { 1506 1506 int mtype; 1507 1507 pg_data_t *pgdat = (pg_data_t *)arg; ··· 1512 1512 seq_putc(m, '\n'); 1513 1513 walk_zones_in_node(m, pgdat, true, false, 1514 1514 pagetypeinfo_showblockcount_print); 1515 - 1516 - return 0; 1517 1515 } 1518 1516 1519 1517 /* ··· 1872 1874 } 1873 1875 1874 1876 /* 1875 - * Switch off vmstat processing and then fold all the remaining differentials 1876 - * until the diffs stay at zero. The function is used by NOHZ and can only be 1877 - * invoked when tick processing is not active. 1878 - */ 1879 - /* 1880 1877 * Check if the diffs for a certain cpu indicate that 1881 1878 * an update is needed. 1882 1879 */ ··· 1887 1894 /* 1888 1895 * The fast way of checking if there are any vmstat diffs. 1889 1896 */ 1890 - if (memchr_inv(pzstats->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS * 1891 - sizeof(pzstats->vm_stat_diff[0]))) 1897 + if (memchr_inv(pzstats->vm_stat_diff, 0, sizeof(pzstats->vm_stat_diff))) 1892 1898 return true; 1893 1899 1894 1900 if (last_pgdat == zone->zone_pgdat) 1895 1901 continue; 1896 1902 last_pgdat = zone->zone_pgdat; 1897 1903 n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu); 1898 - if (memchr_inv(n->vm_node_stat_diff, 0, NR_VM_NODE_STAT_ITEMS * 1899 - sizeof(n->vm_node_stat_diff[0]))) 1900 - return true; 1904 + if (memchr_inv(n->vm_node_stat_diff, 0, sizeof(n->vm_node_stat_diff))) 1905 + return true; 1901 1906 } 1902 1907 return false; 1903 1908 }

+9 -4

security/tomoyo/domain.c

··· 897 897 struct tomoyo_page_dump *dump) 898 898 { 899 899 struct page *page; 900 + #ifdef CONFIG_MMU 901 + int ret; 902 + #endif 900 903 901 904 /* dump->data is released by tomoyo_find_next_domain(). */ 902 905 if (!dump->data) { ··· 912 909 /* 913 910 * This is called at execve() time in order to dig around 914 911 * in the argv/environment of the new proceess 915 - * (represented by bprm). 'current' is the process doing 916 - * the execve(). 912 + * (represented by bprm). 917 913 */ 918 - if (get_user_pages_remote(bprm->mm, pos, 1, 919 - FOLL_FORCE, &page, NULL, NULL) <= 0) 914 + mmap_read_lock(bprm->mm); 915 + ret = get_user_pages_remote(bprm->mm, pos, 1, 916 + FOLL_FORCE, &page, NULL, NULL); 917 + mmap_read_unlock(bprm->mm); 918 + if (ret <= 0) 920 919 return false; 921 920 #else 922 921 page = bprm->page[pos / PAGE_SIZE];

-1

tools/testing/scatterlist/linux/mm.h

··· 127 127 #define kmemleak_free(a) 128 128 129 129 #define PageSlab(p) (0) 130 - #define flush_kernel_dcache_page(p) 131 130 132 131 #define MAX_ERRNO 4095 133 132

+1

tools/testing/selftests/vm/.gitignore

··· 27 27 memfd_secret 28 28 local_config.* 29 29 split_huge_page_test 30 + ksm_tests

+3

tools/testing/selftests/vm/Makefile

··· 45 45 TEST_GEN_FILES += transhuge-stress 46 46 TEST_GEN_FILES += userfaultfd 47 47 TEST_GEN_FILES += split_huge_page_test 48 + TEST_GEN_FILES += ksm_tests 48 49 49 50 ifeq ($(MACHINE),x86_64) 50 51 CAN_BUILD_I386 := $(shell ./../x86/check_cc.sh $(CC) ../x86/trivial_32bit_program.c -m32) ··· 145 144 146 145 # HMM_EXTRA_LIBS may get set in local_config.mk, or it may be left empty. 147 146 $(OUTPUT)/hmm-tests: LDLIBS += $(HMM_EXTRA_LIBS) 147 + 148 + $(OUTPUT)/ksm_tests: LDLIBS += -lnuma 148 149 149 150 local_config.mk local_config.h: check_config.sh 150 151 /bin/sh ./check_config.sh $(CC)

+4 -1

tools/testing/selftests/vm/charge_reserved_hugetlb.sh

··· 1 1 #!/bin/sh 2 2 # SPDX-License-Identifier: GPL-2.0 3 3 4 + # Kselftest framework requirement - SKIP code is 4. 5 + ksft_skip=4 6 + 4 7 set -e 5 8 6 9 if [[ $(id -u) -ne 0 ]]; then 7 10 echo "This test must be run as root. Skipping..." 8 - exit 0 11 + exit $ksft_skip 9 12 fi 10 13 11 14 fault_limit_file=limit_in_bytes

+4 -1

tools/testing/selftests/vm/hugetlb_reparenting_test.sh

··· 1 1 #!/bin/bash 2 2 # SPDX-License-Identifier: GPL-2.0 3 3 4 + # Kselftest framework requirement - SKIP code is 4. 5 + ksft_skip=4 6 + 4 7 set -e 5 8 6 9 if [[ $(id -u) -ne 0 ]]; then 7 10 echo "This test must be run as root. Skipping..." 8 - exit 0 11 + exit $ksft_skip 9 12 fi 10 13 11 14 usage_file=usage_in_bytes

+662

tools/testing/selftests/vm/ksm_tests.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include <sys/mman.h> 4 + #include <stdbool.h> 5 + #include <time.h> 6 + #include <string.h> 7 + #include <numa.h> 8 + 9 + #include "../kselftest.h" 10 + #include "../../../../include/vdso/time64.h" 11 + 12 + #define KSM_SYSFS_PATH "/sys/kernel/mm/ksm/" 13 + #define KSM_FP(s) (KSM_SYSFS_PATH s) 14 + #define KSM_SCAN_LIMIT_SEC_DEFAULT 120 15 + #define KSM_PAGE_COUNT_DEFAULT 10l 16 + #define KSM_PROT_STR_DEFAULT "rw" 17 + #define KSM_USE_ZERO_PAGES_DEFAULT false 18 + #define KSM_MERGE_ACROSS_NODES_DEFAULT true 19 + #define MB (1ul << 20) 20 + 21 + struct ksm_sysfs { 22 + unsigned long max_page_sharing; 23 + unsigned long merge_across_nodes; 24 + unsigned long pages_to_scan; 25 + unsigned long run; 26 + unsigned long sleep_millisecs; 27 + unsigned long stable_node_chains_prune_millisecs; 28 + unsigned long use_zero_pages; 29 + }; 30 + 31 + enum ksm_test_name { 32 + CHECK_KSM_MERGE, 33 + CHECK_KSM_UNMERGE, 34 + CHECK_KSM_ZERO_PAGE_MERGE, 35 + CHECK_KSM_NUMA_MERGE, 36 + KSM_MERGE_TIME, 37 + KSM_COW_TIME 38 + }; 39 + 40 + static int ksm_write_sysfs(const char *file_path, unsigned long val) 41 + { 42 + FILE *f = fopen(file_path, "w"); 43 + 44 + if (!f) { 45 + fprintf(stderr, "f %s\n", file_path); 46 + perror("fopen"); 47 + return 1; 48 + } 49 + if (fprintf(f, "%lu", val) < 0) { 50 + perror("fprintf"); 51 + return 1; 52 + } 53 + fclose(f); 54 + 55 + return 0; 56 + } 57 + 58 + static int ksm_read_sysfs(const char *file_path, unsigned long *val) 59 + { 60 + FILE *f = fopen(file_path, "r"); 61 + 62 + if (!f) { 63 + fprintf(stderr, "f %s\n", file_path); 64 + perror("fopen"); 65 + return 1; 66 + } 67 + if (fscanf(f, "%lu", val) != 1) { 68 + perror("fscanf"); 69 + return 1; 70 + } 71 + fclose(f); 72 + 73 + return 0; 74 + } 75 + 76 + static int str_to_prot(char *prot_str) 77 + { 78 + int prot = 0; 79 + 80 + if ((strchr(prot_str, 'r')) != NULL) 81 + prot |= PROT_READ; 82 + if ((strchr(prot_str, 'w')) != NULL) 83 + prot |= PROT_WRITE; 84 + if ((strchr(prot_str, 'x')) != NULL) 85 + prot |= PROT_EXEC; 86 + 87 + return prot; 88 + } 89 + 90 + static void print_help(void) 91 + { 92 + printf("usage: ksm_tests [-h] <test type> [-a prot] [-p page_count] [-l timeout]\n" 93 + "[-z use_zero_pages] [-m merge_across_nodes] [-s size]\n"); 94 + 95 + printf("Supported <test type>:\n" 96 + " -M (page merging)\n" 97 + " -Z (zero pages merging)\n" 98 + " -N (merging of pages in different NUMA nodes)\n" 99 + " -U (page unmerging)\n" 100 + " -P evaluate merging time and speed.\n" 101 + " For this test, the size of duplicated memory area (in MiB)\n" 102 + " must be provided using -s option\n" 103 + " -C evaluate the time required to break COW of merged pages.\n\n"); 104 + 105 + printf(" -a: specify the access protections of pages.\n" 106 + " <prot> must be of the form [rwx].\n" 107 + " Default: %s\n", KSM_PROT_STR_DEFAULT); 108 + printf(" -p: specify the number of pages to test.\n" 109 + " Default: %ld\n", KSM_PAGE_COUNT_DEFAULT); 110 + printf(" -l: limit the maximum running time (in seconds) for a test.\n" 111 + " Default: %d seconds\n", KSM_SCAN_LIMIT_SEC_DEFAULT); 112 + printf(" -z: change use_zero_pages tunable\n" 113 + " Default: %d\n", KSM_USE_ZERO_PAGES_DEFAULT); 114 + printf(" -m: change merge_across_nodes tunable\n" 115 + " Default: %d\n", KSM_MERGE_ACROSS_NODES_DEFAULT); 116 + printf(" -s: the size of duplicated memory area (in MiB)\n"); 117 + 118 + exit(0); 119 + } 120 + 121 + static void *allocate_memory(void *ptr, int prot, int mapping, char data, size_t map_size) 122 + { 123 + void *map_ptr = mmap(ptr, map_size, PROT_WRITE, mapping, -1, 0); 124 + 125 + if (!map_ptr) { 126 + perror("mmap"); 127 + return NULL; 128 + } 129 + memset(map_ptr, data, map_size); 130 + if (mprotect(map_ptr, map_size, prot)) { 131 + perror("mprotect"); 132 + munmap(map_ptr, map_size); 133 + return NULL; 134 + } 135 + 136 + return map_ptr; 137 + } 138 + 139 + static int ksm_do_scan(int scan_count, struct timespec start_time, int timeout) 140 + { 141 + struct timespec cur_time; 142 + unsigned long cur_scan, init_scan; 143 + 144 + if (ksm_read_sysfs(KSM_FP("full_scans"), &init_scan)) 145 + return 1; 146 + cur_scan = init_scan; 147 + 148 + while (cur_scan < init_scan + scan_count) { 149 + if (ksm_read_sysfs(KSM_FP("full_scans"), &cur_scan)) 150 + return 1; 151 + if (clock_gettime(CLOCK_MONOTONIC_RAW, &cur_time)) { 152 + perror("clock_gettime"); 153 + return 1; 154 + } 155 + if ((cur_time.tv_sec - start_time.tv_sec) > timeout) { 156 + printf("Scan time limit exceeded\n"); 157 + return 1; 158 + } 159 + } 160 + 161 + return 0; 162 + } 163 + 164 + static int ksm_merge_pages(void *addr, size_t size, struct timespec start_time, int timeout) 165 + { 166 + if (madvise(addr, size, MADV_MERGEABLE)) { 167 + perror("madvise"); 168 + return 1; 169 + } 170 + if (ksm_write_sysfs(KSM_FP("run"), 1)) 171 + return 1; 172 + 173 + /* Since merging occurs only after 2 scans, make sure to get at least 2 full scans */ 174 + if (ksm_do_scan(2, start_time, timeout)) 175 + return 1; 176 + 177 + return 0; 178 + } 179 + 180 + static bool assert_ksm_pages_count(long dupl_page_count) 181 + { 182 + unsigned long max_page_sharing, pages_sharing, pages_shared; 183 + 184 + if (ksm_read_sysfs(KSM_FP("pages_shared"), &pages_shared) || 185 + ksm_read_sysfs(KSM_FP("pages_sharing"), &pages_sharing) || 186 + ksm_read_sysfs(KSM_FP("max_page_sharing"), &max_page_sharing)) 187 + return false; 188 + 189 + /* 190 + * Since there must be at least 2 pages for merging and 1 page can be 191 + * shared with the limited number of pages (max_page_sharing), sometimes 192 + * there are 'leftover' pages that cannot be merged. For example, if there 193 + * are 11 pages and max_page_sharing = 10, then only 10 pages will be 194 + * merged and the 11th page won't be affected. As a result, when the number 195 + * of duplicate pages is divided by max_page_sharing and the remainder is 1, 196 + * pages_shared and pages_sharing values will be equal between dupl_page_count 197 + * and dupl_page_count - 1. 198 + */ 199 + if (dupl_page_count % max_page_sharing == 1 || dupl_page_count % max_page_sharing == 0) { 200 + if (pages_shared == dupl_page_count / max_page_sharing && 201 + pages_sharing == pages_shared * (max_page_sharing - 1)) 202 + return true; 203 + } else { 204 + if (pages_shared == (dupl_page_count / max_page_sharing + 1) && 205 + pages_sharing == dupl_page_count - pages_shared) 206 + return true; 207 + } 208 + 209 + return false; 210 + } 211 + 212 + static int ksm_save_def(struct ksm_sysfs *ksm_sysfs) 213 + { 214 + if (ksm_read_sysfs(KSM_FP("max_page_sharing"), &ksm_sysfs->max_page_sharing) || 215 + ksm_read_sysfs(KSM_FP("merge_across_nodes"), &ksm_sysfs->merge_across_nodes) || 216 + ksm_read_sysfs(KSM_FP("sleep_millisecs"), &ksm_sysfs->sleep_millisecs) || 217 + ksm_read_sysfs(KSM_FP("pages_to_scan"), &ksm_sysfs->pages_to_scan) || 218 + ksm_read_sysfs(KSM_FP("run"), &ksm_sysfs->run) || 219 + ksm_read_sysfs(KSM_FP("stable_node_chains_prune_millisecs"), 220 + &ksm_sysfs->stable_node_chains_prune_millisecs) || 221 + ksm_read_sysfs(KSM_FP("use_zero_pages"), &ksm_sysfs->use_zero_pages)) 222 + return 1; 223 + 224 + return 0; 225 + } 226 + 227 + static int ksm_restore(struct ksm_sysfs *ksm_sysfs) 228 + { 229 + if (ksm_write_sysfs(KSM_FP("max_page_sharing"), ksm_sysfs->max_page_sharing) || 230 + ksm_write_sysfs(KSM_FP("merge_across_nodes"), ksm_sysfs->merge_across_nodes) || 231 + ksm_write_sysfs(KSM_FP("pages_to_scan"), ksm_sysfs->pages_to_scan) || 232 + ksm_write_sysfs(KSM_FP("run"), ksm_sysfs->run) || 233 + ksm_write_sysfs(KSM_FP("sleep_millisecs"), ksm_sysfs->sleep_millisecs) || 234 + ksm_write_sysfs(KSM_FP("stable_node_chains_prune_millisecs"), 235 + ksm_sysfs->stable_node_chains_prune_millisecs) || 236 + ksm_write_sysfs(KSM_FP("use_zero_pages"), ksm_sysfs->use_zero_pages)) 237 + return 1; 238 + 239 + return 0; 240 + } 241 + 242 + static int check_ksm_merge(int mapping, int prot, long page_count, int timeout, size_t page_size) 243 + { 244 + void *map_ptr; 245 + struct timespec start_time; 246 + 247 + if (clock_gettime(CLOCK_MONOTONIC_RAW, &start_time)) { 248 + perror("clock_gettime"); 249 + return KSFT_FAIL; 250 + } 251 + 252 + /* fill pages with the same data and merge them */ 253 + map_ptr = allocate_memory(NULL, prot, mapping, '*', page_size * page_count); 254 + if (!map_ptr) 255 + return KSFT_FAIL; 256 + 257 + if (ksm_merge_pages(map_ptr, page_size * page_count, start_time, timeout)) 258 + goto err_out; 259 + 260 + /* verify that the right number of pages are merged */ 261 + if (assert_ksm_pages_count(page_count)) { 262 + printf("OK\n"); 263 + munmap(map_ptr, page_size * page_count); 264 + return KSFT_PASS; 265 + } 266 + 267 + err_out: 268 + printf("Not OK\n"); 269 + munmap(map_ptr, page_size * page_count); 270 + return KSFT_FAIL; 271 + } 272 + 273 + static int check_ksm_unmerge(int mapping, int prot, int timeout, size_t page_size) 274 + { 275 + void *map_ptr; 276 + struct timespec start_time; 277 + int page_count = 2; 278 + 279 + if (clock_gettime(CLOCK_MONOTONIC_RAW, &start_time)) { 280 + perror("clock_gettime"); 281 + return KSFT_FAIL; 282 + } 283 + 284 + /* fill pages with the same data and merge them */ 285 + map_ptr = allocate_memory(NULL, prot, mapping, '*', page_size * page_count); 286 + if (!map_ptr) 287 + return KSFT_FAIL; 288 + 289 + if (ksm_merge_pages(map_ptr, page_size * page_count, start_time, timeout)) 290 + goto err_out; 291 + 292 + /* change 1 byte in each of the 2 pages -- KSM must automatically unmerge them */ 293 + memset(map_ptr, '-', 1); 294 + memset(map_ptr + page_size, '+', 1); 295 + 296 + /* get at least 1 scan, so KSM can detect that the pages were modified */ 297 + if (ksm_do_scan(1, start_time, timeout)) 298 + goto err_out; 299 + 300 + /* check that unmerging was successful and 0 pages are currently merged */ 301 + if (assert_ksm_pages_count(0)) { 302 + printf("OK\n"); 303 + munmap(map_ptr, page_size * page_count); 304 + return KSFT_PASS; 305 + } 306 + 307 + err_out: 308 + printf("Not OK\n"); 309 + munmap(map_ptr, page_size * page_count); 310 + return KSFT_FAIL; 311 + } 312 + 313 + static int check_ksm_zero_page_merge(int mapping, int prot, long page_count, int timeout, 314 + bool use_zero_pages, size_t page_size) 315 + { 316 + void *map_ptr; 317 + struct timespec start_time; 318 + 319 + if (clock_gettime(CLOCK_MONOTONIC_RAW, &start_time)) { 320 + perror("clock_gettime"); 321 + return KSFT_FAIL; 322 + } 323 + 324 + if (ksm_write_sysfs(KSM_FP("use_zero_pages"), use_zero_pages)) 325 + return KSFT_FAIL; 326 + 327 + /* fill pages with zero and try to merge them */ 328 + map_ptr = allocate_memory(NULL, prot, mapping, 0, page_size * page_count); 329 + if (!map_ptr) 330 + return KSFT_FAIL; 331 + 332 + if (ksm_merge_pages(map_ptr, page_size * page_count, start_time, timeout)) 333 + goto err_out; 334 + 335 + /* 336 + * verify that the right number of pages are merged: 337 + * 1) if use_zero_pages is set to 1, empty pages are merged 338 + * with the kernel zero page instead of with each other; 339 + * 2) if use_zero_pages is set to 0, empty pages are not treated specially 340 + * and merged as usual. 341 + */ 342 + if (use_zero_pages && !assert_ksm_pages_count(0)) 343 + goto err_out; 344 + else if (!use_zero_pages && !assert_ksm_pages_count(page_count)) 345 + goto err_out; 346 + 347 + printf("OK\n"); 348 + munmap(map_ptr, page_size * page_count); 349 + return KSFT_PASS; 350 + 351 + err_out: 352 + printf("Not OK\n"); 353 + munmap(map_ptr, page_size * page_count); 354 + return KSFT_FAIL; 355 + } 356 + 357 + static int check_ksm_numa_merge(int mapping, int prot, int timeout, bool merge_across_nodes, 358 + size_t page_size) 359 + { 360 + void *numa1_map_ptr, *numa2_map_ptr; 361 + struct timespec start_time; 362 + int page_count = 2; 363 + 364 + if (clock_gettime(CLOCK_MONOTONIC_RAW, &start_time)) { 365 + perror("clock_gettime"); 366 + return KSFT_FAIL; 367 + } 368 + 369 + if (numa_available() < 0) { 370 + perror("NUMA support not enabled"); 371 + return KSFT_SKIP; 372 + } 373 + if (numa_max_node() < 1) { 374 + printf("At least 2 NUMA nodes must be available\n"); 375 + return KSFT_SKIP; 376 + } 377 + if (ksm_write_sysfs(KSM_FP("merge_across_nodes"), merge_across_nodes)) 378 + return KSFT_FAIL; 379 + 380 + /* allocate 2 pages in 2 different NUMA nodes and fill them with the same data */ 381 + numa1_map_ptr = numa_alloc_onnode(page_size, 0); 382 + numa2_map_ptr = numa_alloc_onnode(page_size, 1); 383 + if (!numa1_map_ptr || !numa2_map_ptr) { 384 + perror("numa_alloc_onnode"); 385 + return KSFT_FAIL; 386 + } 387 + 388 + memset(numa1_map_ptr, '*', page_size); 389 + memset(numa2_map_ptr, '*', page_size); 390 + 391 + /* try to merge the pages */ 392 + if (ksm_merge_pages(numa1_map_ptr, page_size, start_time, timeout) || 393 + ksm_merge_pages(numa2_map_ptr, page_size, start_time, timeout)) 394 + goto err_out; 395 + 396 + /* 397 + * verify that the right number of pages are merged: 398 + * 1) if merge_across_nodes was enabled, 2 duplicate pages will be merged; 399 + * 2) if merge_across_nodes = 0, there must be 0 merged pages, since there is 400 + * only 1 unique page in each node and they can't be shared. 401 + */ 402 + if (merge_across_nodes && !assert_ksm_pages_count(page_count)) 403 + goto err_out; 404 + else if (!merge_across_nodes && !assert_ksm_pages_count(0)) 405 + goto err_out; 406 + 407 + numa_free(numa1_map_ptr, page_size); 408 + numa_free(numa2_map_ptr, page_size); 409 + printf("OK\n"); 410 + return KSFT_PASS; 411 + 412 + err_out: 413 + numa_free(numa1_map_ptr, page_size); 414 + numa_free(numa2_map_ptr, page_size); 415 + printf("Not OK\n"); 416 + return KSFT_FAIL; 417 + } 418 + 419 + static int ksm_merge_time(int mapping, int prot, int timeout, size_t map_size) 420 + { 421 + void *map_ptr; 422 + struct timespec start_time, end_time; 423 + unsigned long scan_time_ns; 424 + 425 + map_size *= MB; 426 + 427 + map_ptr = allocate_memory(NULL, prot, mapping, '*', map_size); 428 + if (!map_ptr) 429 + return KSFT_FAIL; 430 + 431 + if (clock_gettime(CLOCK_MONOTONIC_RAW, &start_time)) { 432 + perror("clock_gettime"); 433 + goto err_out; 434 + } 435 + if (ksm_merge_pages(map_ptr, map_size, start_time, timeout)) 436 + goto err_out; 437 + if (clock_gettime(CLOCK_MONOTONIC_RAW, &end_time)) { 438 + perror("clock_gettime"); 439 + goto err_out; 440 + } 441 + 442 + scan_time_ns = (end_time.tv_sec - start_time.tv_sec) * NSEC_PER_SEC + 443 + (end_time.tv_nsec - start_time.tv_nsec); 444 + 445 + printf("Total size: %lu MiB\n", map_size / MB); 446 + printf("Total time: %ld.%09ld s\n", scan_time_ns / NSEC_PER_SEC, 447 + scan_time_ns % NSEC_PER_SEC); 448 + printf("Average speed: %.3f MiB/s\n", (map_size / MB) / 449 + ((double)scan_time_ns / NSEC_PER_SEC)); 450 + 451 + munmap(map_ptr, map_size); 452 + return KSFT_PASS; 453 + 454 + err_out: 455 + printf("Not OK\n"); 456 + munmap(map_ptr, map_size); 457 + return KSFT_FAIL; 458 + } 459 + 460 + static int ksm_cow_time(int mapping, int prot, int timeout, size_t page_size) 461 + { 462 + void *map_ptr; 463 + struct timespec start_time, end_time; 464 + unsigned long cow_time_ns; 465 + 466 + /* page_count must be less than 2*page_size */ 467 + size_t page_count = 4000; 468 + 469 + map_ptr = allocate_memory(NULL, prot, mapping, '*', page_size * page_count); 470 + if (!map_ptr) 471 + return KSFT_FAIL; 472 + 473 + if (clock_gettime(CLOCK_MONOTONIC_RAW, &start_time)) { 474 + perror("clock_gettime"); 475 + return KSFT_FAIL; 476 + } 477 + for (size_t i = 0; i < page_count - 1; i = i + 2) 478 + memset(map_ptr + page_size * i, '-', 1); 479 + if (clock_gettime(CLOCK_MONOTONIC_RAW, &end_time)) { 480 + perror("clock_gettime"); 481 + return KSFT_FAIL; 482 + } 483 + 484 + cow_time_ns = (end_time.tv_sec - start_time.tv_sec) * NSEC_PER_SEC + 485 + (end_time.tv_nsec - start_time.tv_nsec); 486 + 487 + printf("Total size: %lu MiB\n\n", (page_size * page_count) / MB); 488 + printf("Not merged pages:\n"); 489 + printf("Total time: %ld.%09ld s\n", cow_time_ns / NSEC_PER_SEC, 490 + cow_time_ns % NSEC_PER_SEC); 491 + printf("Average speed: %.3f MiB/s\n\n", ((page_size * (page_count / 2)) / MB) / 492 + ((double)cow_time_ns / NSEC_PER_SEC)); 493 + 494 + /* Create 2000 pairs of duplicate pages */ 495 + for (size_t i = 0; i < page_count - 1; i = i + 2) { 496 + memset(map_ptr + page_size * i, '+', i / 2 + 1); 497 + memset(map_ptr + page_size * (i + 1), '+', i / 2 + 1); 498 + } 499 + if (ksm_merge_pages(map_ptr, page_size * page_count, start_time, timeout)) 500 + goto err_out; 501 + 502 + if (clock_gettime(CLOCK_MONOTONIC_RAW, &start_time)) { 503 + perror("clock_gettime"); 504 + goto err_out; 505 + } 506 + for (size_t i = 0; i < page_count - 1; i = i + 2) 507 + memset(map_ptr + page_size * i, '-', 1); 508 + if (clock_gettime(CLOCK_MONOTONIC_RAW, &end_time)) { 509 + perror("clock_gettime"); 510 + goto err_out; 511 + } 512 + 513 + cow_time_ns = (end_time.tv_sec - start_time.tv_sec) * NSEC_PER_SEC + 514 + (end_time.tv_nsec - start_time.tv_nsec); 515 + 516 + printf("Merged pages:\n"); 517 + printf("Total time: %ld.%09ld s\n", cow_time_ns / NSEC_PER_SEC, 518 + cow_time_ns % NSEC_PER_SEC); 519 + printf("Average speed: %.3f MiB/s\n", ((page_size * (page_count / 2)) / MB) / 520 + ((double)cow_time_ns / NSEC_PER_SEC)); 521 + 522 + munmap(map_ptr, page_size * page_count); 523 + return KSFT_PASS; 524 + 525 + err_out: 526 + printf("Not OK\n"); 527 + munmap(map_ptr, page_size * page_count); 528 + return KSFT_FAIL; 529 + } 530 + 531 + int main(int argc, char *argv[]) 532 + { 533 + int ret, opt; 534 + int prot = 0; 535 + int ksm_scan_limit_sec = KSM_SCAN_LIMIT_SEC_DEFAULT; 536 + long page_count = KSM_PAGE_COUNT_DEFAULT; 537 + size_t page_size = sysconf(_SC_PAGESIZE); 538 + struct ksm_sysfs ksm_sysfs_old; 539 + int test_name = CHECK_KSM_MERGE; 540 + bool use_zero_pages = KSM_USE_ZERO_PAGES_DEFAULT; 541 + bool merge_across_nodes = KSM_MERGE_ACROSS_NODES_DEFAULT; 542 + long size_MB = 0; 543 + 544 + while ((opt = getopt(argc, argv, "ha:p:l:z:m:s:MUZNPC")) != -1) { 545 + switch (opt) { 546 + case 'a': 547 + prot = str_to_prot(optarg); 548 + break; 549 + case 'p': 550 + page_count = atol(optarg); 551 + if (page_count <= 0) { 552 + printf("The number of pages must be greater than 0\n"); 553 + return KSFT_FAIL; 554 + } 555 + break; 556 + case 'l': 557 + ksm_scan_limit_sec = atoi(optarg); 558 + if (ksm_scan_limit_sec <= 0) { 559 + printf("Timeout value must be greater than 0\n"); 560 + return KSFT_FAIL; 561 + } 562 + break; 563 + case 'h': 564 + print_help(); 565 + break; 566 + case 'z': 567 + if (strcmp(optarg, "0") == 0) 568 + use_zero_pages = 0; 569 + else 570 + use_zero_pages = 1; 571 + break; 572 + case 'm': 573 + if (strcmp(optarg, "0") == 0) 574 + merge_across_nodes = 0; 575 + else 576 + merge_across_nodes = 1; 577 + break; 578 + case 's': 579 + size_MB = atoi(optarg); 580 + if (size_MB <= 0) { 581 + printf("Size must be greater than 0\n"); 582 + return KSFT_FAIL; 583 + } 584 + case 'M': 585 + break; 586 + case 'U': 587 + test_name = CHECK_KSM_UNMERGE; 588 + break; 589 + case 'Z': 590 + test_name = CHECK_KSM_ZERO_PAGE_MERGE; 591 + break; 592 + case 'N': 593 + test_name = CHECK_KSM_NUMA_MERGE; 594 + break; 595 + case 'P': 596 + test_name = KSM_MERGE_TIME; 597 + break; 598 + case 'C': 599 + test_name = KSM_COW_TIME; 600 + break; 601 + default: 602 + return KSFT_FAIL; 603 + } 604 + } 605 + 606 + if (prot == 0) 607 + prot = str_to_prot(KSM_PROT_STR_DEFAULT); 608 + 609 + if (access(KSM_SYSFS_PATH, F_OK)) { 610 + printf("Config KSM not enabled\n"); 611 + return KSFT_SKIP; 612 + } 613 + 614 + if (ksm_save_def(&ksm_sysfs_old)) { 615 + printf("Cannot save default tunables\n"); 616 + return KSFT_FAIL; 617 + } 618 + 619 + if (ksm_write_sysfs(KSM_FP("run"), 2) || 620 + ksm_write_sysfs(KSM_FP("sleep_millisecs"), 0) || 621 + ksm_write_sysfs(KSM_FP("merge_across_nodes"), 1) || 622 + ksm_write_sysfs(KSM_FP("pages_to_scan"), page_count)) 623 + return KSFT_FAIL; 624 + 625 + switch (test_name) { 626 + case CHECK_KSM_MERGE: 627 + ret = check_ksm_merge(MAP_PRIVATE | MAP_ANONYMOUS, prot, page_count, 628 + ksm_scan_limit_sec, page_size); 629 + break; 630 + case CHECK_KSM_UNMERGE: 631 + ret = check_ksm_unmerge(MAP_PRIVATE | MAP_ANONYMOUS, prot, ksm_scan_limit_sec, 632 + page_size); 633 + break; 634 + case CHECK_KSM_ZERO_PAGE_MERGE: 635 + ret = check_ksm_zero_page_merge(MAP_PRIVATE | MAP_ANONYMOUS, prot, page_count, 636 + ksm_scan_limit_sec, use_zero_pages, page_size); 637 + break; 638 + case CHECK_KSM_NUMA_MERGE: 639 + ret = check_ksm_numa_merge(MAP_PRIVATE | MAP_ANONYMOUS, prot, ksm_scan_limit_sec, 640 + merge_across_nodes, page_size); 641 + break; 642 + case KSM_MERGE_TIME: 643 + if (size_MB == 0) { 644 + printf("Option '-s' is required.\n"); 645 + return KSFT_FAIL; 646 + } 647 + ret = ksm_merge_time(MAP_PRIVATE | MAP_ANONYMOUS, prot, ksm_scan_limit_sec, 648 + size_MB); 649 + break; 650 + case KSM_COW_TIME: 651 + ret = ksm_cow_time(MAP_PRIVATE | MAP_ANONYMOUS, prot, ksm_scan_limit_sec, 652 + page_size); 653 + break; 654 + } 655 + 656 + if (ksm_restore(&ksm_sysfs_old)) { 657 + printf("Cannot restore default tunables\n"); 658 + return KSFT_FAIL; 659 + } 660 + 661 + return ret; 662 + }

+1 -1

tools/testing/selftests/vm/mlock-random-test.c

··· 70 70 } 71 71 } 72 72 73 - perror("cann't parse VmLck in /proc/self/status\n"); 73 + perror("cannot parse VmLck in /proc/self/status\n"); 74 74 fclose(f); 75 75 return -1; 76 76 }

+96

tools/testing/selftests/vm/run_vmtests.sh

··· 377 377 exitcode=1 378 378 fi 379 379 380 + echo "-------------------------------------------------------" 381 + echo "running KSM MADV_MERGEABLE test with 10 identical pages" 382 + echo "-------------------------------------------------------" 383 + ./ksm_tests -M -p 10 384 + ret_val=$? 385 + 386 + if [ $ret_val -eq 0 ]; then 387 + echo "[PASS]" 388 + elif [ $ret_val -eq $ksft_skip ]; then 389 + echo "[SKIP]" 390 + exitcode=$ksft_skip 391 + else 392 + echo "[FAIL]" 393 + exitcode=1 394 + fi 395 + 396 + echo "------------------------" 397 + echo "running KSM unmerge test" 398 + echo "------------------------" 399 + ./ksm_tests -U 400 + ret_val=$? 401 + 402 + if [ $ret_val -eq 0 ]; then 403 + echo "[PASS]" 404 + elif [ $ret_val -eq $ksft_skip ]; then 405 + echo "[SKIP]" 406 + exitcode=$ksft_skip 407 + else 408 + echo "[FAIL]" 409 + exitcode=1 410 + fi 411 + 412 + echo "----------------------------------------------------------" 413 + echo "running KSM test with 10 zero pages and use_zero_pages = 0" 414 + echo "----------------------------------------------------------" 415 + ./ksm_tests -Z -p 10 -z 0 416 + ret_val=$? 417 + 418 + if [ $ret_val -eq 0 ]; then 419 + echo "[PASS]" 420 + elif [ $ret_val -eq $ksft_skip ]; then 421 + echo "[SKIP]" 422 + exitcode=$ksft_skip 423 + else 424 + echo "[FAIL]" 425 + exitcode=1 426 + fi 427 + 428 + echo "----------------------------------------------------------" 429 + echo "running KSM test with 10 zero pages and use_zero_pages = 1" 430 + echo "----------------------------------------------------------" 431 + ./ksm_tests -Z -p 10 -z 1 432 + ret_val=$? 433 + 434 + if [ $ret_val -eq 0 ]; then 435 + echo "[PASS]" 436 + elif [ $ret_val -eq $ksft_skip ]; then 437 + echo "[SKIP]" 438 + exitcode=$ksft_skip 439 + else 440 + echo "[FAIL]" 441 + exitcode=1 442 + fi 443 + 444 + echo "-------------------------------------------------------------" 445 + echo "running KSM test with 2 NUMA nodes and merge_across_nodes = 1" 446 + echo "-------------------------------------------------------------" 447 + ./ksm_tests -N -m 1 448 + ret_val=$? 449 + 450 + if [ $ret_val -eq 0 ]; then 451 + echo "[PASS]" 452 + elif [ $ret_val -eq $ksft_skip ]; then 453 + echo "[SKIP]" 454 + exitcode=$ksft_skip 455 + else 456 + echo "[FAIL]" 457 + exitcode=1 458 + fi 459 + 460 + echo "-------------------------------------------------------------" 461 + echo "running KSM test with 2 NUMA nodes and merge_across_nodes = 0" 462 + echo "-------------------------------------------------------------" 463 + ./ksm_tests -N -m 0 464 + ret_val=$? 465 + 466 + if [ $ret_val -eq 0 ]; then 467 + echo "[PASS]" 468 + elif [ $ret_val -eq $ksft_skip ]; then 469 + echo "[SKIP]" 470 + exitcode=$ksft_skip 471 + else 472 + echo "[FAIL]" 473 + exitcode=1 474 + fi 475 + 380 476 exit $exitcode 381 477 382 478 exit $exitcode

+13

tools/testing/selftests/vm/userfaultfd.c

··· 566 566 } 567 567 } 568 568 569 + static void wake_range(int ufd, unsigned long addr, unsigned long len) 570 + { 571 + struct uffdio_range uffdio_wake; 572 + 573 + uffdio_wake.start = addr; 574 + uffdio_wake.len = len; 575 + 576 + if (ioctl(ufd, UFFDIO_WAKE, &uffdio_wake)) 577 + fprintf(stderr, "error waking %lu\n", 578 + addr), exit(1); 579 + } 580 + 569 581 static int __copy_page(int ufd, unsigned long offset, bool retry) 570 582 { 571 583 struct uffdio_copy uffdio_copy; ··· 597 585 if (uffdio_copy.copy != -EEXIST) 598 586 err("UFFDIO_COPY error: %"PRId64, 599 587 (int64_t)uffdio_copy.copy); 588 + wake_range(ufd, uffdio_copy.dst, page_size); 600 589 } else if (uffdio_copy.copy != page_size) { 601 590 err("UFFDIO_COPY error: %"PRId64, (int64_t)uffdio_copy.copy); 602 591 } else {