Merge branch 'akpm' (patches from Andrew)

+6

Documentation/admin-guide/kernel-parameters.txt

··· 3591 3591 off: turn off poisoning (default) 3592 3592 on: turn on poisoning 3593 3593 3594 + page_reporting.page_reporting_order= 3595 + [KNL] Minimal page reporting order 3596 + Format: <integer> 3597 + Adjust the minimal page reporting order. The page 3598 + reporting is disabled when it exceeds (MAX_ORDER-1). 3599 + 3594 3600 panic= [KNL] Kernel behaviour on panic: delay <timeout> 3595 3601 timeout > 0: seconds before rebooting 3596 3602 timeout = 0: wait forever

+2 -2

Documentation/admin-guide/lockup-watchdogs.rst

··· 39 39 subsystems are present. 40 40 41 41 A periodic hrtimer runs to generate interrupts and kick the watchdog 42 - task. An NMI perf event is generated every "watchdog_thresh" 42 + job. An NMI perf event is generated every "watchdog_thresh" 43 43 (compile-time initialized to 10 and configurable through sysctl of the 44 44 same name) seconds to check for hardlockups. If any CPU in the system 45 45 does not receive any hrtimer interrupt during that time the ··· 47 47 generate a kernel warning or call panic, depending on the 48 48 configuration. 49 49 50 - The watchdog task is a high priority kernel thread that updates a 50 + The watchdog job runs in a stop scheduling thread that updates a 51 51 timestamp every time it is scheduled. If that timestamp is not updated 52 52 for 2*watchdog_thresh seconds (the softlockup threshold) the 53 53 'softlockup detector' (coded inside the hrtimer callback function)

+5 -5

Documentation/admin-guide/sysctl/kernel.rst

··· 1297 1297 = ================================= 1298 1298 1299 1299 The soft lockup detector monitors CPUs for threads that are hogging the CPUs 1300 - without rescheduling voluntarily, and thus prevent the 'watchdog/N' threads 1301 - from running. The mechanism depends on the CPUs ability to respond to timer 1302 - interrupts which are needed for the 'watchdog/N' threads to be woken up by 1303 - the watchdog timer function, otherwise the NMI watchdog — if enabled — can 1304 - detect a hard lockup condition. 1300 + without rescheduling voluntarily, and thus prevent the 'migration/N' threads 1301 + from running, causing the watchdog work fail to execute. The mechanism depends 1302 + on the CPUs ability to respond to timer interrupts which are needed for the 1303 + watchdog work to be queued by the watchdog timer function, otherwise the NMI 1304 + watchdog — if enabled — can detect a hard lockup condition. 1305 1305 1306 1306 1307 1307 stack_erasing

+22 -20

Documentation/admin-guide/sysctl/vm.rst

··· 64 64 - overcommit_ratio 65 65 - page-cluster 66 66 - panic_on_oom 67 - - percpu_pagelist_fraction 67 + - percpu_pagelist_high_fraction 68 68 - stat_interval 69 69 - stat_refresh 70 70 - numa_stat ··· 790 790 why oom happens. You can get snapshot. 791 791 792 792 793 - percpu_pagelist_fraction 794 - ======================== 793 + percpu_pagelist_high_fraction 794 + ============================= 795 795 796 - This is the fraction of pages at most (high mark pcp->high) in each zone that 797 - are allocated for each per cpu page list. The min value for this is 8. It 798 - means that we don't allow more than 1/8th of pages in each zone to be 799 - allocated in any single per_cpu_pagelist. This entry only changes the value 800 - of hot per cpu pagelists. User can specify a number like 100 to allocate 801 - 1/100th of each zone to each per cpu page list. 796 + This is the fraction of pages in each zone that are can be stored to 797 + per-cpu page lists. It is an upper boundary that is divided depending 798 + on the number of online CPUs. The min value for this is 8 which means 799 + that we do not allow more than 1/8th of pages in each zone to be stored 800 + on per-cpu page lists. This entry only changes the value of hot per-cpu 801 + page lists. A user can specify a number like 100 to allocate 1/100th of 802 + each zone between per-cpu lists. 802 803 803 - The batch value of each per cpu pagelist is also updated as a result. It is 804 - set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) 804 + The batch value of each per-cpu page list remains the same regardless of 805 + the value of the high fraction so allocation latencies are unaffected. 805 806 806 - The initial value is zero. Kernel does not use this value at boot time to set 807 - the high water marks for each per cpu page list. If the user writes '0' to this 808 - sysctl, it will revert to this default behavior. 807 + The initial value is zero. Kernel uses this value to set the high pcp->high 808 + mark based on the low watermark for the zone and the number of local 809 + online CPUs. If the user writes '0' to this sysctl, it will revert to 810 + this default behavior. 809 811 810 812 811 813 stat_interval ··· 938 936 939 937 To make it sensible with respect to the watermark_scale_factor 940 938 parameter, the unit is in fractions of 10,000. The default value of 941 - 15,000 on !DISCONTIGMEM configurations means that up to 150% of the high 942 - watermark will be reclaimed in the event of a pageblock being mixed due 943 - to fragmentation. The level of reclaim is determined by the number of 944 - fragmentation events that occurred in the recent past. If this value is 945 - smaller than a pageblock then a pageblocks worth of pages will be reclaimed 946 - (e.g. 2MB on 64-bit x86). A boost factor of 0 will disable the feature. 939 + 15,000 means that up to 150% of the high watermark will be reclaimed in the 940 + event of a pageblock being mixed due to fragmentation. The level of reclaim 941 + is determined by the number of fragmentation events that occurred in the 942 + recent past. If this value is smaller than a pageblock then a pageblocks 943 + worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor 944 + of 0 will disable the feature. 947 945 948 946 949 947 watermark_scale_factor

+4 -5

Documentation/dev-tools/kasan.rst

··· 447 447 448 448 When a test fails due to a missing KASAN report:: 449 449 450 - # kmalloc_double_kzfree: EXPECTATION FAILED at lib/test_kasan.c:629 451 - Expected kasan_data->report_expected == kasan_data->report_found, but 452 - kasan_data->report_expected == 1 453 - kasan_data->report_found == 0 454 - not ok 28 - kmalloc_double_kzfree 450 + # kmalloc_double_kzfree: EXPECTATION FAILED at lib/test_kasan.c:974 451 + KASAN failure expected in "kfree_sensitive(ptr)", but none occurred 452 + not ok 44 - kmalloc_double_kzfree 453 + 455 454 456 455 At the end the cumulative status of all KASAN tests is printed. On success:: 457 456

+2 -43

Documentation/vm/memory-model.rst

··· 14 14 completely distinct addresses. And, don't forget about NUMA, where 15 15 different memory banks are attached to different CPUs. 16 16 17 - Linux abstracts this diversity using one of the three memory models: 18 - FLATMEM, DISCONTIGMEM and SPARSEMEM. Each architecture defines what 17 + Linux abstracts this diversity using one of the two memory models: 18 + FLATMEM and SPARSEMEM. Each architecture defines what 19 19 memory models it supports, what the default memory model is and 20 20 whether it is possible to manually override that default. 21 - 22 - .. note:: 23 - At time of this writing, DISCONTIGMEM is considered deprecated, 24 - although it is still in use by several architectures. 25 21 26 22 All the memory models track the status of physical page frames using 27 23 struct page arranged in one or more arrays. ··· 58 62 59 63 The `ARCH_PFN_OFFSET` defines the first page frame number for 60 64 systems with physical memory starting at address different from 0. 61 - 62 - DISCONTIGMEM 63 - ============ 64 - 65 - The DISCONTIGMEM model treats the physical memory as a collection of 66 - `nodes` similarly to how Linux NUMA support does. For each node Linux 67 - constructs an independent memory management subsystem represented by 68 - `struct pglist_data` (or `pg_data_t` for short). Among other 69 - things, `pg_data_t` holds the `node_mem_map` array that maps 70 - physical pages belonging to that node. The `node_start_pfn` field of 71 - `pg_data_t` is the number of the first page frame belonging to that 72 - node. 73 - 74 - The architecture setup code should call :c:func:`free_area_init_node` for 75 - each node in the system to initialize the `pg_data_t` object and its 76 - `node_mem_map`. 77 - 78 - Every `node_mem_map` behaves exactly as FLATMEM's `mem_map` - 79 - every physical page frame in a node has a `struct page` entry in the 80 - `node_mem_map` array. When DISCONTIGMEM is enabled, a portion of the 81 - `flags` field of the `struct page` encodes the node number of the 82 - node hosting that page. 83 - 84 - The conversion between a PFN and the `struct page` in the 85 - DISCONTIGMEM model became slightly more complex as it has to determine 86 - which node hosts the physical page and which `pg_data_t` object 87 - holds the `struct page`. 88 - 89 - Architectures that support DISCONTIGMEM provide :c:func:`pfn_to_nid` 90 - to convert PFN to the node number. The opposite conversion helper 91 - :c:func:`page_to_nid` is generic as it uses the node number encoded in 92 - page->flags. 93 - 94 - Once the node number is known, the PFN can be used to index 95 - appropriate `node_mem_map` array to access the `struct page` and 96 - the offset of the `struct page` from the `node_mem_map` plus 97 - `node_start_pfn` is the PFN of that page. 98 65 99 66 SPARSEMEM 100 67 =========

-22

arch/alpha/Kconfig

··· 549 549 MARVEL support can handle a maximum of 32 CPUs, all the others 550 550 with working support have a maximum of 4 CPUs. 551 551 552 - config ARCH_DISCONTIGMEM_ENABLE 553 - bool "Discontiguous Memory Support" 554 - depends on BROKEN 555 - help 556 - Say Y to support efficient handling of discontiguous physical memory, 557 - for architectures which are either NUMA (Non-Uniform Memory Access) 558 - or have huge holes in the physical address space for other reasons. 559 - See <file:Documentation/vm/numa.rst> for more. 560 - 561 552 config ARCH_SPARSEMEM_ENABLE 562 553 bool "Sparse Memory Support" 563 554 help 564 555 Say Y to support efficient handling of discontiguous physical memory, 565 556 for systems that have huge holes in the physical address space. 566 - 567 - config NUMA 568 - bool "NUMA Support (EXPERIMENTAL)" 569 - depends on DISCONTIGMEM && BROKEN 570 - help 571 - Say Y to compile the kernel to support NUMA (Non-Uniform Memory 572 - Access). This option is for configuring high-end multiprocessor 573 - server machines. If in doubt, say N. 574 557 575 558 config ALPHA_WTINT 576 559 bool "Use WTINT" if ALPHA_SRM || ALPHA_GENERIC ··· 578 595 so you might as well say Y here. 579 596 580 597 If unsure, say N. 581 - 582 - config NODES_SHIFT 583 - int 584 - default "7" 585 - depends on NEED_MULTIPLE_NODES 586 598 587 599 # LARGE_VMALLOC is racy, if you *really* need it then fix it first 588 600 config ALPHA_LARGE_VMALLOC

-6

arch/alpha/include/asm/machvec.h

··· 99 99 100 100 const char *vector_name; 101 101 102 - /* NUMA information */ 103 - int (*pa_to_nid)(unsigned long); 104 - int (*cpuid_to_nid)(int); 105 - unsigned long (*node_mem_start)(int); 106 - unsigned long (*node_mem_size)(int); 107 - 108 102 /* System specific parameters. */ 109 103 union { 110 104 struct {

-100

arch/alpha/include/asm/mmzone.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 2 - /* 3 - * Written by Kanoj Sarcar (kanoj@sgi.com) Aug 99 4 - * Adapted for the alpha wildfire architecture Jan 2001. 5 - */ 6 - #ifndef _ASM_MMZONE_H_ 7 - #define _ASM_MMZONE_H_ 8 - 9 - #ifdef CONFIG_DISCONTIGMEM 10 - 11 - #include <asm/smp.h> 12 - 13 - /* 14 - * Following are macros that are specific to this numa platform. 15 - */ 16 - 17 - extern pg_data_t node_data[]; 18 - 19 - #define alpha_pa_to_nid(pa) \ 20 - (alpha_mv.pa_to_nid \ 21 - ? alpha_mv.pa_to_nid(pa) \ 22 - : (0)) 23 - #define node_mem_start(nid) \ 24 - (alpha_mv.node_mem_start \ 25 - ? alpha_mv.node_mem_start(nid) \ 26 - : (0UL)) 27 - #define node_mem_size(nid) \ 28 - (alpha_mv.node_mem_size \ 29 - ? alpha_mv.node_mem_size(nid) \ 30 - : ((nid) ? (0UL) : (~0UL))) 31 - 32 - #define pa_to_nid(pa) alpha_pa_to_nid(pa) 33 - #define NODE_DATA(nid) (&node_data[(nid)]) 34 - 35 - #define node_localnr(pfn, nid) ((pfn) - NODE_DATA(nid)->node_start_pfn) 36 - 37 - #if 1 38 - #define PLAT_NODE_DATA_LOCALNR(p, n) \ 39 - (((p) >> PAGE_SHIFT) - PLAT_NODE_DATA(n)->gendata.node_start_pfn) 40 - #else 41 - static inline unsigned long 42 - PLAT_NODE_DATA_LOCALNR(unsigned long p, int n) 43 - { 44 - unsigned long temp; 45 - temp = p >> PAGE_SHIFT; 46 - return temp - PLAT_NODE_DATA(n)->gendata.node_start_pfn; 47 - } 48 - #endif 49 - 50 - /* 51 - * Following are macros that each numa implementation must define. 52 - */ 53 - 54 - /* 55 - * Given a kernel address, find the home node of the underlying memory. 56 - */ 57 - #define kvaddr_to_nid(kaddr) pa_to_nid(__pa(kaddr)) 58 - 59 - /* 60 - * Given a kaddr, LOCAL_BASE_ADDR finds the owning node of the memory 61 - * and returns the kaddr corresponding to first physical page in the 62 - * node's mem_map. 63 - */ 64 - #define LOCAL_BASE_ADDR(kaddr) \ 65 - ((unsigned long)__va(NODE_DATA(kvaddr_to_nid(kaddr))->node_start_pfn \ 66 - << PAGE_SHIFT)) 67 - 68 - /* XXX: FIXME -- nyc */ 69 - #define kern_addr_valid(kaddr) (0) 70 - 71 - #define mk_pte(page, pgprot) \ 72 - ({ \ 73 - pte_t pte; \ 74 - unsigned long pfn; \ 75 - \ 76 - pfn = page_to_pfn(page) << 32; \ 77 - pte_val(pte) = pfn | pgprot_val(pgprot); \ 78 - \ 79 - pte; \ 80 - }) 81 - 82 - #define pte_page(x) \ 83 - ({ \ 84 - unsigned long kvirt; \ 85 - struct page * __xx; \ 86 - \ 87 - kvirt = (unsigned long)__va(pte_val(x) >> (32-PAGE_SHIFT)); \ 88 - __xx = virt_to_page(kvirt); \ 89 - \ 90 - __xx; \ 91 - }) 92 - 93 - #define pfn_to_nid(pfn) pa_to_nid(((u64)(pfn) << PAGE_SHIFT)) 94 - #define pfn_valid(pfn) \ 95 - (((pfn) - node_start_pfn(pfn_to_nid(pfn))) < \ 96 - node_spanned_pages(pfn_to_nid(pfn))) \ 97 - 98 - #endif /* CONFIG_DISCONTIGMEM */ 99 - 100 - #endif /* _ASM_MMZONE_H_ */

-4

arch/alpha/include/asm/pgtable.h

··· 206 206 #define page_to_pa(page) (page_to_pfn(page) << PAGE_SHIFT) 207 207 #define pte_pfn(pte) (pte_val(pte) >> 32) 208 208 209 - #ifndef CONFIG_DISCONTIGMEM 210 209 #define pte_page(pte) pfn_to_page(pte_pfn(pte)) 211 210 #define mk_pte(page, pgprot) \ 212 211 ({ \ ··· 214 215 pte_val(pte) = (page_to_pfn(page) << 32) | pgprot_val(pgprot); \ 215 216 pte; \ 216 217 }) 217 - #endif 218 218 219 219 extern inline pte_t pfn_pte(unsigned long physpfn, pgprot_t pgprot) 220 220 { pte_t pte; pte_val(pte) = (PHYS_TWIDDLE(physpfn) << 32) | pgprot_val(pgprot); return pte; } ··· 328 330 #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) 329 331 #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) 330 332 331 - #ifndef CONFIG_DISCONTIGMEM 332 333 #define kern_addr_valid(addr) (1) 333 - #endif 334 334 335 335 #define pte_ERROR(e) \ 336 336 printk("%s:%d: bad pte %016lx.\n", __FILE__, __LINE__, pte_val(e))

-39

arch/alpha/include/asm/topology.h

··· 7 7 #include <linux/numa.h> 8 8 #include <asm/machvec.h> 9 9 10 - #ifdef CONFIG_NUMA 11 - static inline int cpu_to_node(int cpu) 12 - { 13 - int node; 14 - 15 - if (!alpha_mv.cpuid_to_nid) 16 - return 0; 17 - 18 - node = alpha_mv.cpuid_to_nid(cpu); 19 - 20 - #ifdef DEBUG_NUMA 21 - BUG_ON(node < 0); 22 - #endif 23 - 24 - return node; 25 - } 26 - 27 - extern struct cpumask node_to_cpumask_map[]; 28 - /* FIXME: This is dumb, recalculating every time. But simple. */ 29 - static const struct cpumask *cpumask_of_node(int node) 30 - { 31 - int cpu; 32 - 33 - if (node == NUMA_NO_NODE) 34 - return cpu_all_mask; 35 - 36 - cpumask_clear(&node_to_cpumask_map[node]); 37 - 38 - for_each_online_cpu(cpu) { 39 - if (cpu_to_node(cpu) == node) 40 - cpumask_set_cpu(cpu, node_to_cpumask_map[node]); 41 - } 42 - 43 - return &node_to_cpumask_map[node]; 44 - } 45 - 46 - #define cpumask_of_pcibus(bus) (cpu_online_mask) 47 - 48 - #endif /* !CONFIG_NUMA */ 49 10 # include <asm-generic/topology.h> 50 11 51 12 #endif /* _ASM_ALPHA_TOPOLOGY_H */

+3 -50

arch/alpha/kernel/core_marvel.c

··· 287 287 /* 288 288 * Set up window 0 for scatter-gather 8MB at 8MB. 289 289 */ 290 - hose->sg_isa = iommu_arena_new_node(marvel_cpuid_to_nid(io7->pe), 291 - hose, 0x00800000, 0x00800000, 0); 290 + hose->sg_isa = iommu_arena_new_node(0, hose, 0x00800000, 0x00800000, 0); 292 291 hose->sg_isa->align_entry = 8; /* cache line boundary */ 293 292 csrs->POx_WBASE[0].csr = 294 293 hose->sg_isa->dma_base | wbase_m_ena | wbase_m_sg; ··· 304 305 /* 305 306 * Set up window 2 for scatter-gather (up-to) 1GB at 3GB. 306 307 */ 307 - hose->sg_pci = iommu_arena_new_node(marvel_cpuid_to_nid(io7->pe), 308 - hose, 0xc0000000, 0x40000000, 0); 308 + hose->sg_pci = iommu_arena_new_node(0, hose, 0xc0000000, 0x40000000, 0); 309 309 hose->sg_pci->align_entry = 8; /* cache line boundary */ 310 310 csrs->POx_WBASE[2].csr = 311 311 hose->sg_pci->dma_base | wbase_m_ena | wbase_m_sg; ··· 841 843 EXPORT_SYMBOL(marvel_ioread8); 842 844 EXPORT_SYMBOL(marvel_iowrite8); 843 845 #endif 844 - 846 + 845 847 /* 846 - * NUMA Support 847 - */ 848 - /********** 849 - * FIXME - for now each cpu is a node by itself 850 - * -- no real support for striped mode 851 - ********** 852 - */ 853 - int 854 - marvel_pa_to_nid(unsigned long pa) 855 - { 856 - int cpuid; 857 - 858 - if ((pa >> 43) & 1) /* I/O */ 859 - cpuid = (~(pa >> 35) & 0xff); 860 - else /* mem */ 861 - cpuid = ((pa >> 34) & 0x3) | ((pa >> (37 - 2)) & (0x1f << 2)); 862 - 863 - return marvel_cpuid_to_nid(cpuid); 864 - } 865 - 866 - int 867 - marvel_cpuid_to_nid(int cpuid) 868 - { 869 - return cpuid; 870 - } 871 - 872 - unsigned long 873 - marvel_node_mem_start(int nid) 874 - { 875 - unsigned long pa; 876 - 877 - pa = (nid & 0x3) | ((nid & (0x1f << 2)) << 1); 878 - pa <<= 34; 879 - 880 - return pa; 881 - } 882 - 883 - unsigned long 884 - marvel_node_mem_size(int nid) 885 - { 886 - return 16UL * 1024 * 1024 * 1024; /* 16GB */ 887 - } 888 - 889 - 890 - /* 891 848 * AGP GART Support. 892 849 */ 893 850 #include <linux/agp_backend.h>

+1 -28

arch/alpha/kernel/core_wildfire.c

··· 434 434 return PCIBIOS_SUCCESSFUL; 435 435 } 436 436 437 - struct pci_ops wildfire_pci_ops = 437 + struct pci_ops wildfire_pci_ops = 438 438 { 439 439 .read = wildfire_read_config, 440 440 .write = wildfire_write_config, 441 441 }; 442 - 443 - 444 - /* 445 - * NUMA Support 446 - */ 447 - int wildfire_pa_to_nid(unsigned long pa) 448 - { 449 - return pa >> 36; 450 - } 451 - 452 - int wildfire_cpuid_to_nid(int cpuid) 453 - { 454 - /* assume 4 CPUs per node */ 455 - return cpuid >> 2; 456 - } 457 - 458 - unsigned long wildfire_node_mem_start(int nid) 459 - { 460 - /* 64GB per node */ 461 - return (unsigned long)nid * (64UL * 1024 * 1024 * 1024); 462 - } 463 - 464 - unsigned long wildfire_node_mem_size(int nid) 465 - { 466 - /* 64GB per node */ 467 - return 64UL * 1024 * 1024 * 1024; 468 - } 469 442 470 443 #if DEBUG_DUMP_REGS 471 444

-29

arch/alpha/kernel/pci_iommu.c

··· 71 71 if (align < mem_size) 72 72 align = mem_size; 73 73 74 - 75 - #ifdef CONFIG_DISCONTIGMEM 76 - 77 - arena = memblock_alloc_node(sizeof(*arena), align, nid); 78 - if (!NODE_DATA(nid) || !arena) { 79 - printk("%s: couldn't allocate arena from node %d\n" 80 - " falling back to system-wide allocation\n", 81 - __func__, nid); 82 - arena = memblock_alloc(sizeof(*arena), SMP_CACHE_BYTES); 83 - if (!arena) 84 - panic("%s: Failed to allocate %zu bytes\n", __func__, 85 - sizeof(*arena)); 86 - } 87 - 88 - arena->ptes = memblock_alloc_node(sizeof(*arena), align, nid); 89 - if (!NODE_DATA(nid) || !arena->ptes) { 90 - printk("%s: couldn't allocate arena ptes from node %d\n" 91 - " falling back to system-wide allocation\n", 92 - __func__, nid); 93 - arena->ptes = memblock_alloc(mem_size, align); 94 - if (!arena->ptes) 95 - panic("%s: Failed to allocate %lu bytes align=0x%lx\n", 96 - __func__, mem_size, align); 97 - } 98 - 99 - #else /* CONFIG_DISCONTIGMEM */ 100 - 101 74 arena = memblock_alloc(sizeof(*arena), SMP_CACHE_BYTES); 102 75 if (!arena) 103 76 panic("%s: Failed to allocate %zu bytes\n", __func__, ··· 79 106 if (!arena->ptes) 80 107 panic("%s: Failed to allocate %lu bytes align=0x%lx\n", 81 108 __func__, mem_size, align); 82 - 83 - #endif /* CONFIG_DISCONTIGMEM */ 84 109 85 110 spin_lock_init(&arena->lock); 86 111 arena->hose = hose;

-8

arch/alpha/kernel/proto.h

··· 49 49 extern void marvel_kill_arch(int); 50 50 extern void marvel_machine_check(unsigned long, unsigned long); 51 51 extern void marvel_pci_tbi(struct pci_controller *, dma_addr_t, dma_addr_t); 52 - extern int marvel_pa_to_nid(unsigned long); 53 - extern int marvel_cpuid_to_nid(int); 54 - extern unsigned long marvel_node_mem_start(int); 55 - extern unsigned long marvel_node_mem_size(int); 56 52 extern struct _alpha_agp_info *marvel_agp_info(void); 57 53 struct io7 *marvel_find_io7(int pe); 58 54 struct io7 *marvel_next_io7(struct io7 *prev); ··· 97 101 extern void wildfire_kill_arch(int); 98 102 extern void wildfire_machine_check(unsigned long vector, unsigned long la_ptr); 99 103 extern void wildfire_pci_tbi(struct pci_controller *, dma_addr_t, dma_addr_t); 100 - extern int wildfire_pa_to_nid(unsigned long); 101 - extern int wildfire_cpuid_to_nid(int); 102 - extern unsigned long wildfire_node_mem_start(int); 103 - extern unsigned long wildfire_node_mem_size(int); 104 104 105 105 /* console.c */ 106 106 #ifdef CONFIG_VGA_HOSE

-16

arch/alpha/kernel/setup.c

··· 79 79 unsigned long alpha_verbose_mcheck = CONFIG_VERBOSE_MCHECK_ON; 80 80 #endif 81 81 82 - #ifdef CONFIG_NUMA 83 - struct cpumask node_to_cpumask_map[MAX_NUMNODES] __read_mostly; 84 - EXPORT_SYMBOL(node_to_cpumask_map); 85 - #endif 86 - 87 82 /* Which processor we booted from. */ 88 83 int boot_cpuid; 89 84 ··· 300 305 } 301 306 #endif 302 307 303 - #ifndef CONFIG_DISCONTIGMEM 304 308 static void __init 305 309 setup_memory(void *kernel_end) 306 310 { ··· 383 389 } 384 390 #endif /* CONFIG_BLK_DEV_INITRD */ 385 391 } 386 - #else 387 - extern void setup_memory(void *); 388 - #endif /* !CONFIG_DISCONTIGMEM */ 389 392 390 393 int __init 391 394 page_is_ram(unsigned long pfn) ··· 607 616 #endif 608 617 #ifdef CONFIG_VERBOSE_MCHECK 609 618 "VERBOSE_MCHECK " 610 - #endif 611 - 612 - #ifdef CONFIG_DISCONTIGMEM 613 - "DISCONTIGMEM " 614 - #ifdef CONFIG_NUMA 615 - "NUMA " 616 - #endif 617 619 #endif 618 620 619 621 #ifdef CONFIG_DEBUG_SPINLOCK

-5

arch/alpha/kernel/sys_marvel.c

··· 461 461 .kill_arch = marvel_kill_arch, 462 462 .pci_map_irq = marvel_map_irq, 463 463 .pci_swizzle = common_swizzle, 464 - 465 - .pa_to_nid = marvel_pa_to_nid, 466 - .cpuid_to_nid = marvel_cpuid_to_nid, 467 - .node_mem_start = marvel_node_mem_start, 468 - .node_mem_size = marvel_node_mem_size, 469 464 }; 470 465 ALIAS_MV(marvel_ev7)

-5

arch/alpha/kernel/sys_wildfire.c

··· 337 337 .kill_arch = wildfire_kill_arch, 338 338 .pci_map_irq = wildfire_map_irq, 339 339 .pci_swizzle = common_swizzle, 340 - 341 - .pa_to_nid = wildfire_pa_to_nid, 342 - .cpuid_to_nid = wildfire_cpuid_to_nid, 343 - .node_mem_start = wildfire_node_mem_start, 344 - .node_mem_size = wildfire_node_mem_size, 345 340 }; 346 341 ALIAS_MV(wildfire)

-2

arch/alpha/mm/Makefile

··· 6 6 ccflags-y := -Werror 7 7 8 8 obj-y := init.o fault.o 9 - 10 - obj-$(CONFIG_DISCONTIGMEM) += numa.o

-3

arch/alpha/mm/init.c

··· 235 235 return kernel_end; 236 236 } 237 237 238 - 239 - #ifndef CONFIG_DISCONTIGMEM 240 238 /* 241 239 * paging_init() sets up the memory map. 242 240 */ ··· 255 257 /* Initialize the kernel's ZERO_PGE. */ 256 258 memset((void *)ZERO_PGE, 0, PAGE_SIZE); 257 259 } 258 - #endif /* CONFIG_DISCONTIGMEM */ 259 260 260 261 #if defined(CONFIG_ALPHA_GENERIC) || defined(CONFIG_ALPHA_SRM) 261 262 void

-223

arch/alpha/mm/numa.c

··· 1 - // SPDX-License-Identifier: GPL-2.0 2 - /* 3 - * linux/arch/alpha/mm/numa.c 4 - * 5 - * DISCONTIGMEM NUMA alpha support. 6 - * 7 - * Copyright (C) 2001 Andrea Arcangeli <andrea@suse.de> SuSE 8 - */ 9 - 10 - #include <linux/types.h> 11 - #include <linux/kernel.h> 12 - #include <linux/mm.h> 13 - #include <linux/memblock.h> 14 - #include <linux/swap.h> 15 - #include <linux/initrd.h> 16 - #include <linux/pfn.h> 17 - #include <linux/module.h> 18 - 19 - #include <asm/hwrpb.h> 20 - #include <asm/sections.h> 21 - 22 - pg_data_t node_data[MAX_NUMNODES]; 23 - EXPORT_SYMBOL(node_data); 24 - 25 - #undef DEBUG_DISCONTIG 26 - #ifdef DEBUG_DISCONTIG 27 - #define DBGDCONT(args...) printk(args) 28 - #else 29 - #define DBGDCONT(args...) 30 - #endif 31 - 32 - #define for_each_mem_cluster(memdesc, _cluster, i) \ 33 - for ((_cluster) = (memdesc)->cluster, (i) = 0; \ 34 - (i) < (memdesc)->numclusters; (i)++, (_cluster)++) 35 - 36 - static void __init show_mem_layout(void) 37 - { 38 - struct memclust_struct * cluster; 39 - struct memdesc_struct * memdesc; 40 - int i; 41 - 42 - /* Find free clusters, and init and free the bootmem accordingly. */ 43 - memdesc = (struct memdesc_struct *) 44 - (hwrpb->mddt_offset + (unsigned long) hwrpb); 45 - 46 - printk("Raw memory layout:\n"); 47 - for_each_mem_cluster(memdesc, cluster, i) { 48 - printk(" memcluster %2d, usage %1lx, start %8lu, end %8lu\n", 49 - i, cluster->usage, cluster->start_pfn, 50 - cluster->start_pfn + cluster->numpages); 51 - } 52 - } 53 - 54 - static void __init 55 - setup_memory_node(int nid, void *kernel_end) 56 - { 57 - extern unsigned long mem_size_limit; 58 - struct memclust_struct * cluster; 59 - struct memdesc_struct * memdesc; 60 - unsigned long start_kernel_pfn, end_kernel_pfn; 61 - unsigned long start, end; 62 - unsigned long node_pfn_start, node_pfn_end; 63 - unsigned long node_min_pfn, node_max_pfn; 64 - int i; 65 - int show_init = 0; 66 - 67 - /* Find the bounds of current node */ 68 - node_pfn_start = (node_mem_start(nid)) >> PAGE_SHIFT; 69 - node_pfn_end = node_pfn_start + (node_mem_size(nid) >> PAGE_SHIFT); 70 - 71 - /* Find free clusters, and init and free the bootmem accordingly. */ 72 - memdesc = (struct memdesc_struct *) 73 - (hwrpb->mddt_offset + (unsigned long) hwrpb); 74 - 75 - /* find the bounds of this node (node_min_pfn/node_max_pfn) */ 76 - node_min_pfn = ~0UL; 77 - node_max_pfn = 0UL; 78 - for_each_mem_cluster(memdesc, cluster, i) { 79 - /* Bit 0 is console/PALcode reserved. Bit 1 is 80 - non-volatile memory -- we might want to mark 81 - this for later. */ 82 - if (cluster->usage & 3) 83 - continue; 84 - 85 - start = cluster->start_pfn; 86 - end = start + cluster->numpages; 87 - 88 - if (start >= node_pfn_end || end <= node_pfn_start) 89 - continue; 90 - 91 - if (!show_init) { 92 - show_init = 1; 93 - printk("Initializing bootmem allocator on Node ID %d\n", nid); 94 - } 95 - printk(" memcluster %2d, usage %1lx, start %8lu, end %8lu\n", 96 - i, cluster->usage, cluster->start_pfn, 97 - cluster->start_pfn + cluster->numpages); 98 - 99 - if (start < node_pfn_start) 100 - start = node_pfn_start; 101 - if (end > node_pfn_end) 102 - end = node_pfn_end; 103 - 104 - if (start < node_min_pfn) 105 - node_min_pfn = start; 106 - if (end > node_max_pfn) 107 - node_max_pfn = end; 108 - } 109 - 110 - if (mem_size_limit && node_max_pfn > mem_size_limit) { 111 - static int msg_shown = 0; 112 - if (!msg_shown) { 113 - msg_shown = 1; 114 - printk("setup: forcing memory size to %ldK (from %ldK).\n", 115 - mem_size_limit << (PAGE_SHIFT - 10), 116 - node_max_pfn << (PAGE_SHIFT - 10)); 117 - } 118 - node_max_pfn = mem_size_limit; 119 - } 120 - 121 - if (node_min_pfn >= node_max_pfn) 122 - return; 123 - 124 - /* Update global {min,max}_low_pfn from node information. */ 125 - if (node_min_pfn < min_low_pfn) 126 - min_low_pfn = node_min_pfn; 127 - if (node_max_pfn > max_low_pfn) 128 - max_pfn = max_low_pfn = node_max_pfn; 129 - 130 - #if 0 /* we'll try this one again in a little while */ 131 - /* Cute trick to make sure our local node data is on local memory */ 132 - node_data[nid] = (pg_data_t *)(__va(node_min_pfn << PAGE_SHIFT)); 133 - #endif 134 - printk(" Detected node memory: start %8lu, end %8lu\n", 135 - node_min_pfn, node_max_pfn); 136 - 137 - DBGDCONT(" DISCONTIG: node_data[%d] is at 0x%p\n", nid, NODE_DATA(nid)); 138 - 139 - /* Find the bounds of kernel memory. */ 140 - start_kernel_pfn = PFN_DOWN(KERNEL_START_PHYS); 141 - end_kernel_pfn = PFN_UP(virt_to_phys(kernel_end)); 142 - 143 - if (!nid && (node_max_pfn < end_kernel_pfn || node_min_pfn > start_kernel_pfn)) 144 - panic("kernel loaded out of ram"); 145 - 146 - memblock_add_node(PFN_PHYS(node_min_pfn), 147 - (node_max_pfn - node_min_pfn) << PAGE_SHIFT, nid); 148 - 149 - /* Zone start phys-addr must be 2^(MAX_ORDER-1) aligned. 150 - Note that we round this down, not up - node memory 151 - has much larger alignment than 8Mb, so it's safe. */ 152 - node_min_pfn &= ~((1UL << (MAX_ORDER-1))-1); 153 - 154 - NODE_DATA(nid)->node_start_pfn = node_min_pfn; 155 - NODE_DATA(nid)->node_present_pages = node_max_pfn - node_min_pfn; 156 - 157 - node_set_online(nid); 158 - } 159 - 160 - void __init 161 - setup_memory(void *kernel_end) 162 - { 163 - unsigned long kernel_size; 164 - int nid; 165 - 166 - show_mem_layout(); 167 - 168 - nodes_clear(node_online_map); 169 - 170 - min_low_pfn = ~0UL; 171 - max_low_pfn = 0UL; 172 - for (nid = 0; nid < MAX_NUMNODES; nid++) 173 - setup_memory_node(nid, kernel_end); 174 - 175 - kernel_size = virt_to_phys(kernel_end) - KERNEL_START_PHYS; 176 - memblock_reserve(KERNEL_START_PHYS, kernel_size); 177 - 178 - #ifdef CONFIG_BLK_DEV_INITRD 179 - initrd_start = INITRD_START; 180 - if (initrd_start) { 181 - extern void *move_initrd(unsigned long); 182 - 183 - initrd_end = initrd_start+INITRD_SIZE; 184 - printk("Initial ramdisk at: 0x%p (%lu bytes)\n", 185 - (void *) initrd_start, INITRD_SIZE); 186 - 187 - if ((void *)initrd_end > phys_to_virt(PFN_PHYS(max_low_pfn))) { 188 - if (!move_initrd(PFN_PHYS(max_low_pfn))) 189 - printk("initrd extends beyond end of memory " 190 - "(0x%08lx > 0x%p)\ndisabling initrd\n", 191 - initrd_end, 192 - phys_to_virt(PFN_PHYS(max_low_pfn))); 193 - } else { 194 - nid = kvaddr_to_nid(initrd_start); 195 - memblock_reserve(virt_to_phys((void *)initrd_start), 196 - INITRD_SIZE); 197 - } 198 - } 199 - #endif /* CONFIG_BLK_DEV_INITRD */ 200 - } 201 - 202 - void __init paging_init(void) 203 - { 204 - unsigned long max_zone_pfn[MAX_NR_ZONES] = {0, }; 205 - unsigned long dma_local_pfn; 206 - 207 - /* 208 - * The old global MAX_DMA_ADDRESS per-arch API doesn't fit 209 - * in the NUMA model, for now we convert it to a pfn and 210 - * we interpret this pfn as a local per-node information. 211 - * This issue isn't very important since none of these machines 212 - * have legacy ISA slots anyways. 213 - */ 214 - dma_local_pfn = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT; 215 - 216 - max_zone_pfn[ZONE_DMA] = dma_local_pfn; 217 - max_zone_pfn[ZONE_NORMAL] = max_pfn; 218 - 219 - free_area_init(max_zone_pfn); 220 - 221 - /* Initialize the kernel's ZERO_PGE. */ 222 - memset((void *)ZERO_PGE, 0, PAGE_SIZE); 223 - }

-13

arch/arc/Kconfig

··· 62 62 config GENERIC_CSUM 63 63 def_bool y 64 64 65 - config ARCH_DISCONTIGMEM_ENABLE 66 - def_bool n 67 - depends on BROKEN 68 - 69 65 config ARCH_FLATMEM_ENABLE 70 66 def_bool y 71 67 ··· 339 343 bool "16MB" 340 344 341 345 endchoice 342 - 343 - config NODES_SHIFT 344 - int "Maximum NUMA Nodes (as a power of 2)" 345 - default "0" if !DISCONTIGMEM 346 - default "1" if DISCONTIGMEM 347 - depends on NEED_MULTIPLE_NODES 348 - help 349 - Accessing memory beyond 1GB (with or w/o PAE) requires 2 memory 350 - zones. 351 346 352 347 config ARC_COMPACT_IRQ_LEVELS 353 348 depends on ISA_ARCOMPACT

-40

arch/arc/include/asm/mmzone.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0-only */ 2 - /* 3 - * Copyright (C) 2016 Synopsys, Inc. (www.synopsys.com) 4 - */ 5 - 6 - #ifndef _ASM_ARC_MMZONE_H 7 - #define _ASM_ARC_MMZONE_H 8 - 9 - #ifdef CONFIG_DISCONTIGMEM 10 - 11 - extern struct pglist_data node_data[]; 12 - #define NODE_DATA(nid) (&node_data[nid]) 13 - 14 - static inline int pfn_to_nid(unsigned long pfn) 15 - { 16 - int is_end_low = 1; 17 - 18 - if (IS_ENABLED(CONFIG_ARC_HAS_PAE40)) 19 - is_end_low = pfn <= virt_to_pfn(0xFFFFFFFFUL); 20 - 21 - /* 22 - * node 0: lowmem: 0x8000_0000 to 0xFFFF_FFFF 23 - * node 1: HIGHMEM w/o PAE40: 0x0 to 0x7FFF_FFFF 24 - * HIGHMEM with PAE40: 0x1_0000_0000 to ... 25 - */ 26 - if (pfn >= ARCH_PFN_OFFSET && is_end_low) 27 - return 0; 28 - 29 - return 1; 30 - } 31 - 32 - static inline int pfn_valid(unsigned long pfn) 33 - { 34 - int nid = pfn_to_nid(pfn); 35 - 36 - return (pfn <= node_end_pfn(nid)); 37 - } 38 - #endif /* CONFIG_DISCONTIGMEM */ 39 - 40 - #endif

+4 -4

arch/arc/kernel/troubleshoot.c

··· 83 83 * non-inclusive vma 84 84 */ 85 85 mmap_read_lock(active_mm); 86 - vma = find_vma(active_mm, address); 86 + vma = vma_lookup(active_mm, address); 87 87 88 - /* check against the find_vma( ) behaviour which returns the next VMA 89 - * if the container VMA is not found 88 + /* Lookup the vma at the address and report if the container VMA is not 89 + * found 90 90 */ 91 - if (vma && (vma->vm_start <= address)) { 91 + if (vma) { 92 92 char buf[ARC_PATH_MAX]; 93 93 char *nm = "?"; 94 94

+5 -16

arch/arc/mm/init.c

··· 32 32 EXPORT_SYMBOL(arch_pfn_offset); 33 33 #endif 34 34 35 - #ifdef CONFIG_DISCONTIGMEM 36 - struct pglist_data node_data[MAX_NUMNODES] __read_mostly; 37 - EXPORT_SYMBOL(node_data); 38 - #endif 39 - 40 35 long __init arc_get_mem_sz(void) 41 36 { 42 37 return low_mem_sz; ··· 134 139 135 140 #ifdef CONFIG_HIGHMEM 136 141 /* 137 - * Populate a new node with highmem 138 - * 139 142 * On ARC (w/o PAE) HIGHMEM addresses are actually smaller (0 based) 140 - * than addresses in normal ala low memory (0x8000_0000 based). 143 + * than addresses in normal aka low memory (0x8000_0000 based). 141 144 * Even with PAE, the huge peripheral space hole would waste a lot of 142 - * mem with single mem_map[]. This warrants a mem_map per region design. 143 - * Thus HIGHMEM on ARC is imlemented with DISCONTIGMEM. 144 - * 145 - * DISCONTIGMEM in turns requires multiple nodes. node 0 above is 146 - * populated with normal memory zone while node 1 only has highmem 145 + * mem with single contiguous mem_map[]. 146 + * Thus when HIGHMEM on ARC is enabled the memory map corresponding 147 + * to the hole is freed and ARC specific version of pfn_valid() 148 + * handles the hole in the memory map. 147 149 */ 148 - #ifdef CONFIG_DISCONTIGMEM 149 - node_set_online(1); 150 - #endif 151 150 152 151 min_high_pfn = PFN_DOWN(high_mem_start); 153 152 max_high_pfn = PFN_DOWN(high_mem_start + high_mem_sz);

+3 -10

arch/arm/include/asm/tlbflush.h

··· 253 253 * space. 254 254 * - mm - mm_struct describing address space 255 255 * 256 - * flush_tlb_range(mm,start,end) 256 + * flush_tlb_range(vma,start,end) 257 257 * 258 258 * Invalidate a range of TLB entries in the specified 259 259 * address space. ··· 261 261 * - start - start address (may not be aligned) 262 262 * - end - end address (exclusive, may not be aligned) 263 263 * 264 - * flush_tlb_page(vaddr,vma) 264 + * flush_tlb_page(vma, uaddr) 265 265 * 266 266 * Invalidate the specified page in the specified address range. 267 + * - vma - vm_area_struct describing address range 267 268 * - vaddr - virtual address (may not be aligned) 268 - * - vma - vma_struct describing address range 269 - * 270 - * flush_kern_tlb_page(kaddr) 271 - * 272 - * Invalidate the TLB entry for the specified page. The address 273 - * will be in the kernels virtual memory space. Current uses 274 - * only require the D-TLB to be invalidated. 275 - * - kaddr - Kernel virtual memory address 276 269 */ 277 270 278 271 /*

+1 -1

arch/arm/mm/tlb-v6.S

··· 24 24 * 25 25 * - start - start address (may not be aligned) 26 26 * - end - end address (exclusive, may not be aligned) 27 - * - vma - vma_struct describing address range 27 + * - vma - vm_area_struct describing address range 28 28 * 29 29 * It is assumed that: 30 30 * - the "Invalidate single entry" instruction will invalidate

+1 -1

arch/arm/mm/tlb-v7.S

··· 23 23 * 24 24 * - start - start address (may not be aligned) 25 25 * - end - end address (exclusive, may not be aligned) 26 - * - vma - vma_struct describing address range 26 + * - vma - vm_area_struct describing address range 27 27 * 28 28 * It is assumed that: 29 29 * - the "Invalidate single entry" instruction will invalidate

+1 -1

arch/arm64/Kconfig

··· 1035 1035 int "Maximum NUMA Nodes (as a power of 2)" 1036 1036 range 1 10 1037 1037 default "4" 1038 - depends on NEED_MULTIPLE_NODES 1038 + depends on NUMA 1039 1039 help 1040 1040 Specify the maximum number of NUMA Nodes available on the target 1041 1041 system. Increases memory reserved to accommodate various tables.

+1 -1

arch/arm64/kvm/mmu.c

··· 929 929 * get block mapping for device MMIO region. 930 930 */ 931 931 mmap_read_lock(current->mm); 932 - vma = find_vma_intersection(current->mm, hva, hva + 1); 932 + vma = vma_lookup(current->mm, hva); 933 933 if (unlikely(!vma)) { 934 934 kvm_err("Failed to find VMA for hva 0x%lx\n", hva); 935 935 mmap_read_unlock(current->mm);

-2

arch/h8300/kernel/setup.c

··· 69 69 70 70 static void __init bootmem_init(void) 71 71 { 72 - struct memblock_region *region; 73 - 74 72 memory_end = memory_start = 0; 75 73 76 74 /* Find main memory where is the kernel */

+1 -1

arch/ia64/Kconfig

··· 302 302 int "Max num nodes shift(3-10)" 303 303 range 3 10 304 304 default "10" 305 - depends on NEED_MULTIPLE_NODES 305 + depends on NUMA 306 306 help 307 307 This option specifies the maximum number of nodes in your SSI system. 308 308 MAX_NUMNODES will be 2^(This value).

+1 -1

arch/ia64/include/asm/pal.h

··· 1086 1086 1087 1087 /* 1088 1088 * Get the ratios for processor frequency, bus frequency and interval timer to 1089 - * to base frequency of the platform 1089 + * the base frequency of the platform 1090 1090 */ 1091 1091 static inline s64 1092 1092 ia64_pal_freq_ratios (struct pal_freq_ratio *proc_ratio, struct pal_freq_ratio *bus_ratio,

+1 -1

arch/ia64/include/asm/spinlock.h

··· 26 26 * the queue, and the other indicating the current tail. The lock is acquired 27 27 * by atomically noting the tail and incrementing it by one (thus adding 28 28 * ourself to the queue and noting our position), then waiting until the head 29 - * becomes equal to the the initial value of the tail. 29 + * becomes equal to the initial value of the tail. 30 30 * The pad bits in the middle are used to prevent the next_ticket number 31 31 * overflowing into the now_serving number. 32 32 *

+1 -1

arch/ia64/include/asm/uv/uv_hub.h

··· 257 257 return 0; 258 258 } 259 259 260 - /* Convert a cpu number to the the UV blade number */ 260 + /* Convert a cpu number to the UV blade number */ 261 261 static inline int uv_cpu_to_blade_id(int cpu) 262 262 { 263 263 return 0;

+1 -1

arch/ia64/kernel/efi_stub.S

··· 7 7 * 8 8 * This stub allows us to make EFI calls in physical mode with interrupts 9 9 * turned off. We need this because we can't call SetVirtualMap() until 10 - * the kernel has booted far enough to allow allocation of struct vma_struct 10 + * the kernel has booted far enough to allow allocation of struct vm_area_struct 11 11 * entries (which we would need to map stuff with memory attributes other 12 12 * than uncached or writeback...). Since the GetTime() service gets called 13 13 * earlier than that, we need to be able to make physical mode EFI calls from

+1 -1

arch/ia64/kernel/mca_drv.c

··· 343 343 344 344 /* - 2 - */ 345 345 sect_min_size = sal_log_sect_min_sizes[0]; 346 - for (i = 1; i < sizeof sal_log_sect_min_sizes/sizeof(size_t); i++) 346 + for (i = 1; i < ARRAY_SIZE(sal_log_sect_min_sizes); i++) 347 347 if (sect_min_size > sal_log_sect_min_sizes[i]) 348 348 sect_min_size = sal_log_sect_min_sizes[i]; 349 349

+2 -3

arch/ia64/kernel/topology.c

··· 3 3 * License. See the file "COPYING" in the main directory of this archive 4 4 * for more details. 5 5 * 6 - * This file contains NUMA specific variables and functions which can 7 - * be split away from DISCONTIGMEM and are used on NUMA machines with 8 - * contiguous memory. 6 + * This file contains NUMA specific variables and functions which are used on 7 + * NUMA machines with contiguous memory. 9 8 * 2002/08/07 Erich Focht <efocht@ess.nec.de> 10 9 * Populate cpu entries in sysfs for non-numa systems as well 11 10 * Intel Corporation - Ashok Raj

+2 -3

arch/ia64/mm/numa.c

··· 3 3 * License. See the file "COPYING" in the main directory of this archive 4 4 * for more details. 5 5 * 6 - * This file contains NUMA specific variables and functions which can 7 - * be split away from DISCONTIGMEM and are used on NUMA machines with 8 - * contiguous memory. 6 + * This file contains NUMA specific variables and functions which are used on 7 + * NUMA machines with contiguous memory. 9 8 * 10 9 * 2002/08/07 Erich Focht <efocht@ess.nec.de> 11 10 */

-10

arch/m68k/Kconfig.cpu

··· 408 408 order" to save memory that could be wasted for unused memory map. 409 409 Say N if not sure. 410 410 411 - config ARCH_DISCONTIGMEM_ENABLE 412 - depends on BROKEN 413 - def_bool MMU && !SINGLE_MEMORY_CHUNK 414 - 415 411 config FORCE_MAX_ZONEORDER 416 412 int "Maximum zone order" if ADVANCED 417 413 depends on !SINGLE_MEMORY_CHUNK ··· 446 450 bool 447 451 depends on MAC 448 452 default y 449 - 450 - config NODES_SHIFT 451 - int 452 - default "3" 453 - depends on DISCONTIGMEM 454 453 455 454 config CPU_HAS_NO_BITFIELDS 456 455 bool ··· 544 553 The ColdFire CPU cache is set into Copy-back mode. 545 554 endchoice 546 555 endif 547 -

-10

arch/m68k/include/asm/mmzone.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 2 - #ifndef _ASM_M68K_MMZONE_H_ 3 - #define _ASM_M68K_MMZONE_H_ 4 - 5 - extern pg_data_t pg_data_map[]; 6 - 7 - #define NODE_DATA(nid) (&pg_data_map[nid]) 8 - #define NODE_MEM_MAP(nid) (NODE_DATA(nid)->node_mem_map) 9 - 10 - #endif /* _ASM_M68K_MMZONE_H_ */

+1 -1

arch/m68k/include/asm/page.h

··· 62 62 #include <asm/page_no.h> 63 63 #endif 64 64 65 - #if !defined(CONFIG_MMU) || defined(CONFIG_DISCONTIGMEM) 65 + #ifndef CONFIG_MMU 66 66 #define __phys_to_pfn(paddr) ((unsigned long)((paddr) >> PAGE_SHIFT)) 67 67 #define __pfn_to_phys(pfn) PFN_PHYS(pfn) 68 68 #endif

-35

arch/m68k/include/asm/page_mm.h

··· 126 126 127 127 extern int m68k_virt_to_node_shift; 128 128 129 - #ifndef CONFIG_DISCONTIGMEM 130 - #define __virt_to_node(addr) (&pg_data_map[0]) 131 - #else 132 - extern struct pglist_data *pg_data_table[]; 133 - 134 - static inline __attribute_const__ int __virt_to_node_shift(void) 135 - { 136 - int shift; 137 - 138 - asm ( 139 - "1: moveq #0,%0\n" 140 - m68k_fixup(%c1, 1b) 141 - : "=d" (shift) 142 - : "i" (m68k_fixup_vnode_shift)); 143 - return shift; 144 - } 145 - 146 - #define __virt_to_node(addr) (pg_data_table[(unsigned long)(addr) >> __virt_to_node_shift()]) 147 - #endif 148 - 149 129 #define virt_to_page(addr) ({ \ 150 130 pfn_to_page(virt_to_pfn(addr)); \ 151 131 }) ··· 133 153 pfn_to_virt(page_to_pfn(page)); \ 134 154 }) 135 155 136 - #ifdef CONFIG_DISCONTIGMEM 137 - #define pfn_to_page(pfn) ({ \ 138 - unsigned long __pfn = (pfn); \ 139 - struct pglist_data *pgdat; \ 140 - pgdat = __virt_to_node((unsigned long)pfn_to_virt(__pfn)); \ 141 - pgdat->node_mem_map + (__pfn - pgdat->node_start_pfn); \ 142 - }) 143 - #define page_to_pfn(_page) ({ \ 144 - const struct page *__p = (_page); \ 145 - struct pglist_data *pgdat; \ 146 - pgdat = &pg_data_map[page_to_nid(__p)]; \ 147 - ((__p) - pgdat->node_mem_map) + pgdat->node_start_pfn; \ 148 - }) 149 - #else 150 156 #define ARCH_PFN_OFFSET (m68k_memory[0].addr >> PAGE_SHIFT) 151 157 #include <asm-generic/memory_model.h> 152 - #endif 153 158 154 159 #define virt_addr_valid(kaddr) ((unsigned long)(kaddr) >= PAGE_OFFSET && (unsigned long)(kaddr) < (unsigned long)high_memory) 155 160 #define pfn_valid(pfn) virt_addr_valid(pfn_to_virt(pfn))

+1 -1

arch/m68k/include/asm/tlbflush.h

··· 263 263 BUG(); 264 264 } 265 265 266 - static inline void flush_tlb_range(struct mm_struct *mm, 266 + static inline void flush_tlb_range(struct vm_area_struct *vma, 267 267 unsigned long start, unsigned long end) 268 268 { 269 269 BUG();

+2 -2

arch/m68k/kernel/sys_m68k.c

··· 402 402 * to this process. 403 403 */ 404 404 mmap_read_lock(current->mm); 405 - vma = find_vma(current->mm, addr); 406 - if (!vma || addr < vma->vm_start || addr + len > vma->vm_end) 405 + vma = vma_lookup(current->mm, addr); 406 + if (!vma || addr + len > vma->vm_end) 407 407 goto out_unlock; 408 408 } 409 409

-20

arch/m68k/mm/init.c

··· 44 44 45 45 int m68k_virt_to_node_shift; 46 46 47 - #ifdef CONFIG_DISCONTIGMEM 48 - pg_data_t pg_data_map[MAX_NUMNODES]; 49 - EXPORT_SYMBOL(pg_data_map); 50 - 51 - pg_data_t *pg_data_table[65]; 52 - EXPORT_SYMBOL(pg_data_table); 53 - #endif 54 - 55 47 void __init m68k_setup_node(int node) 56 48 { 57 - #ifdef CONFIG_DISCONTIGMEM 58 - struct m68k_mem_info *info = m68k_memory + node; 59 - int i, end; 60 - 61 - i = (unsigned long)phys_to_virt(info->addr) >> __virt_to_node_shift(); 62 - end = (unsigned long)phys_to_virt(info->addr + info->size - 1) >> __virt_to_node_shift(); 63 - for (; i <= end; i++) { 64 - if (pg_data_table[i]) 65 - pr_warn("overlap at %u for chunk %u\n", i, node); 66 - pg_data_table[i] = pg_data_map + node; 67 - } 68 - #endif 69 49 node_set_online(node); 70 50 } 71 51

+1 -1

arch/mips/Kconfig

··· 2867 2867 config NODES_SHIFT 2868 2868 int 2869 2869 default "6" 2870 - depends on NEED_MULTIPLE_NODES 2870 + depends on NUMA 2871 2871 2872 2872 config HW_PERF_EVENTS 2873 2873 bool "Enable hardware performance counter support for perf events"

+1 -7

arch/mips/include/asm/mmzone.h

··· 8 8 9 9 #include <asm/page.h> 10 10 11 - #ifdef CONFIG_NEED_MULTIPLE_NODES 11 + #ifdef CONFIG_NUMA 12 12 # include <mmzone.h> 13 13 #endif 14 14 ··· 19 19 #ifndef nid_to_addrbase 20 20 #define nid_to_addrbase(nid) 0 21 21 #endif 22 - 23 - #ifdef CONFIG_DISCONTIGMEM 24 - 25 - #define pfn_to_nid(pfn) pa_to_nid((pfn) << PAGE_SHIFT) 26 - 27 - #endif /* CONFIG_DISCONTIGMEM */ 28 22 29 23 #endif /* _ASM_MMZONE_H_ */

+1 -1

arch/mips/include/asm/page.h

··· 239 239 240 240 /* pfn_valid is defined in linux/mmzone.h */ 241 241 242 - #elif defined(CONFIG_NEED_MULTIPLE_NODES) 242 + #elif defined(CONFIG_NUMA) 243 243 244 244 #define pfn_valid(pfn) \ 245 245 ({ \

+1 -3

arch/mips/kernel/traps.c

··· 784 784 int process_fpemu_return(int sig, void __user *fault_addr, unsigned long fcr31) 785 785 { 786 786 int si_code; 787 - struct vm_area_struct *vma; 788 787 789 788 switch (sig) { 790 789 case 0: ··· 799 800 800 801 case SIGSEGV: 801 802 mmap_read_lock(current->mm); 802 - vma = find_vma(current->mm, (unsigned long)fault_addr); 803 - if (vma && (vma->vm_start <= (unsigned long)fault_addr)) 803 + if (vma_lookup(current->mm, (unsigned long)fault_addr)) 804 804 si_code = SEGV_ACCERR; 805 805 else 806 806 si_code = SEGV_MAPERR;

+2 -5

arch/mips/mm/init.c

··· 394 394 } 395 395 } 396 396 397 - #ifndef CONFIG_NEED_MULTIPLE_NODES 397 + #ifndef CONFIG_NUMA 398 398 void __init paging_init(void) 399 399 { 400 400 unsigned long max_zone_pfns[MAX_NR_ZONES]; ··· 454 454 BUILD_BUG_ON(IS_ENABLED(CONFIG_32BIT) && (_PFN_SHIFT > PAGE_SHIFT)); 455 455 456 456 #ifdef CONFIG_HIGHMEM 457 - #ifdef CONFIG_DISCONTIGMEM 458 - #error "CONFIG_HIGHMEM and CONFIG_DISCONTIGMEM dont work together yet" 459 - #endif 460 457 max_mapnr = highend_pfn ? highend_pfn : max_low_pfn; 461 458 #else 462 459 max_mapnr = max_low_pfn; ··· 473 476 0x80000000 - 4, KCORE_TEXT); 474 477 #endif 475 478 } 476 - #endif /* !CONFIG_NEED_MULTIPLE_NODES */ 479 + #endif /* !CONFIG_NUMA */ 477 480 478 481 void free_init_pages(const char *what, unsigned long begin, unsigned long end) 479 482 {

-6

arch/nds32/include/asm/memory.h

··· 76 76 * virt_to_page(k) convert a _valid_ virtual address to struct page * 77 77 * virt_addr_valid(k) indicates whether a virtual address is valid 78 78 */ 79 - #ifndef CONFIG_DISCONTIGMEM 80 - 81 79 #define ARCH_PFN_OFFSET PHYS_PFN_OFFSET 82 80 #define pfn_valid(pfn) ((pfn) >= PHYS_PFN_OFFSET && (pfn) < (PHYS_PFN_OFFSET + max_mapnr)) 83 81 84 82 #define virt_to_page(kaddr) (pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)) 85 83 #define virt_addr_valid(kaddr) ((unsigned long)(kaddr) >= PAGE_OFFSET && (unsigned long)(kaddr) < (unsigned long)high_memory) 86 - 87 - #else /* CONFIG_DISCONTIGMEM */ 88 - #error CONFIG_DISCONTIGMEM is not supported yet. 89 - #endif /* !CONFIG_DISCONTIGMEM */ 90 84 91 85 #define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT) 92 86

+1 -1

arch/openrisc/include/asm/tlbflush.h

··· 25 25 * - flush_tlb_all() flushes all processes TLBs 26 26 * - flush_tlb_mm(mm) flushes the specified mm context TLB's 27 27 * - flush_tlb_page(vma, vmaddr) flushes one page 28 - * - flush_tlb_range(mm, start, end) flushes a range of pages 28 + * - flush_tlb_range(vma, start, end) flushes a range of pages 29 29 */ 30 30 extern void local_flush_tlb_all(void); 31 31 extern void local_flush_tlb_mm(struct mm_struct *mm);

+1 -1

arch/powerpc/Kconfig

··· 671 671 int 672 672 default "8" if PPC64 673 673 default "4" 674 - depends on NEED_MULTIPLE_NODES 674 + depends on NUMA 675 675 676 676 config USE_PERCPU_NUMA_NODE_ID 677 677 def_bool y

+2 -2

arch/powerpc/include/asm/mmzone.h

··· 18 18 * flags field of the struct page 19 19 */ 20 20 21 - #ifdef CONFIG_NEED_MULTIPLE_NODES 21 + #ifdef CONFIG_NUMA 22 22 23 23 extern struct pglist_data *node_data[]; 24 24 /* ··· 41 41 42 42 #else 43 43 #define memory_hotplug_max() memblock_end_of_DRAM() 44 - #endif /* CONFIG_NEED_MULTIPLE_NODES */ 44 + #endif /* CONFIG_NUMA */ 45 45 #ifdef CONFIG_FA_DUMP 46 46 #define __HAVE_ARCH_RESERVED_KERNEL_PAGES 47 47 #endif

+1 -1

arch/powerpc/kernel/setup_64.c

··· 788 788 size_t align) 789 789 { 790 790 const unsigned long goal = __pa(MAX_DMA_ADDRESS); 791 - #ifdef CONFIG_NEED_MULTIPLE_NODES 791 + #ifdef CONFIG_NUMA 792 792 int node = early_cpu_to_node(cpu); 793 793 void *ptr; 794 794

+1 -1

arch/powerpc/kernel/smp.c

··· 1047 1047 zalloc_cpumask_var_node(&per_cpu(cpu_coregroup_map, cpu), 1048 1048 GFP_KERNEL, cpu_to_node(cpu)); 1049 1049 1050 - #ifdef CONFIG_NEED_MULTIPLE_NODES 1050 + #ifdef CONFIG_NUMA 1051 1051 /* 1052 1052 * numa_node_id() works after this. 1053 1053 */

+2 -2

arch/powerpc/kexec/core.c

··· 68 68 void arch_crash_save_vmcoreinfo(void) 69 69 { 70 70 71 - #ifdef CONFIG_NEED_MULTIPLE_NODES 71 + #ifdef CONFIG_NUMA 72 72 VMCOREINFO_SYMBOL(node_data); 73 73 VMCOREINFO_LENGTH(node_data, MAX_NUMNODES); 74 74 #endif 75 - #ifndef CONFIG_NEED_MULTIPLE_NODES 75 + #ifndef CONFIG_NUMA 76 76 VMCOREINFO_SYMBOL(contig_page_data); 77 77 #endif 78 78 #if defined(CONFIG_PPC64) && defined(CONFIG_SPARSEMEM_VMEMMAP)

+2 -2

arch/powerpc/kvm/book3s_hv.c

··· 4924 4924 /* Look up the VMA for the start of this memory slot */ 4925 4925 hva = memslot->userspace_addr; 4926 4926 mmap_read_lock(kvm->mm); 4927 - vma = find_vma(kvm->mm, hva); 4928 - if (!vma || vma->vm_start > hva || (vma->vm_flags & VM_IO)) 4927 + vma = vma_lookup(kvm->mm, hva); 4928 + if (!vma || (vma->vm_flags & VM_IO)) 4929 4929 goto up_out; 4930 4930 4931 4931 psize = vma_kernel_pagesize(vma);

+1 -1

arch/powerpc/kvm/book3s_hv_uvmem.c

··· 615 615 616 616 /* Fetch the VMA if addr is not in the latest fetched one */ 617 617 if (!vma || addr >= vma->vm_end) { 618 - vma = find_vma_intersection(kvm->mm, addr, addr+1); 618 + vma = vma_lookup(kvm->mm, addr); 619 619 if (!vma) { 620 620 pr_err("Can't find VMA for gfn:0x%lx\n", gfn); 621 621 break;

+1 -1

arch/powerpc/mm/Makefile

··· 13 13 obj-$(CONFIG_PPC_MMU_NOHASH) += nohash/ 14 14 obj-$(CONFIG_PPC_BOOK3S_32) += book3s32/ 15 15 obj-$(CONFIG_PPC_BOOK3S_64) += book3s64/ 16 - obj-$(CONFIG_NEED_MULTIPLE_NODES) += numa.o 16 + obj-$(CONFIG_NUMA) += numa.o 17 17 obj-$(CONFIG_PPC_MM_SLICES) += slice.o 18 18 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o 19 19 obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o

+2 -2

arch/powerpc/mm/mem.c

··· 127 127 } 128 128 #endif 129 129 130 - #ifndef CONFIG_NEED_MULTIPLE_NODES 130 + #ifndef CONFIG_NUMA 131 131 void __init mem_topology_setup(void) 132 132 { 133 133 max_low_pfn = max_pfn = memblock_end_of_DRAM() >> PAGE_SHIFT; ··· 162 162 163 163 return 0; 164 164 } 165 - #else /* CONFIG_NEED_MULTIPLE_NODES */ 165 + #else /* CONFIG_NUMA */ 166 166 static int __init mark_nonram_nosave(void) 167 167 { 168 168 return 0;

+1 -1

arch/riscv/Kconfig

··· 332 332 int "Maximum NUMA Nodes (as a power of 2)" 333 333 range 1 10 334 334 default "2" 335 - depends on NEED_MULTIPLE_NODES 335 + depends on NUMA 336 336 help 337 337 Specify the maximum number of NUMA Nodes available on the target 338 338 system. Increases memory reserved to accommodate various tables.

+1 -1

arch/s390/Kconfig

··· 475 475 476 476 config NODES_SHIFT 477 477 int 478 - depends on NEED_MULTIPLE_NODES 478 + depends on NUMA 479 479 default "1" 480 480 481 481 config SCHED_SMT

-2

arch/s390/include/asm/pgtable.h

··· 344 344 #define PTRS_PER_P4D _CRST_ENTRIES 345 345 #define PTRS_PER_PGD _CRST_ENTRIES 346 346 347 - #define MAX_PTRS_PER_P4D PTRS_PER_P4D 348 - 349 347 /* 350 348 * Segment table and region3 table entry encoding 351 349 * (R = read-only, I = invalid, y = young bit):

+2 -2

arch/sh/include/asm/mmzone.h

··· 2 2 #ifndef __ASM_SH_MMZONE_H 3 3 #define __ASM_SH_MMZONE_H 4 4 5 - #ifdef CONFIG_NEED_MULTIPLE_NODES 5 + #ifdef CONFIG_NUMA 6 6 #include <linux/numa.h> 7 7 8 8 extern struct pglist_data *node_data[]; ··· 31 31 setup_bootmem_node(int nid, unsigned long start, unsigned long end) 32 32 { 33 33 } 34 - #endif /* CONFIG_NEED_MULTIPLE_NODES */ 34 + #endif /* CONFIG_NUMA */ 35 35 36 36 /* Platform specific mem init */ 37 37 void __init plat_mem_setup(void);

+1 -1

arch/sh/kernel/topology.c

··· 46 46 { 47 47 int i, ret; 48 48 49 - #ifdef CONFIG_NEED_MULTIPLE_NODES 49 + #ifdef CONFIG_NUMA 50 50 for_each_online_node(i) 51 51 register_one_node(i); 52 52 #endif

+1 -1

arch/sh/mm/Kconfig

··· 120 120 int 121 121 default "3" if CPU_SUBTYPE_SHX3 122 122 default "1" 123 - depends on NEED_MULTIPLE_NODES 123 + depends on NUMA 124 124 125 125 config ARCH_FLATMEM_ENABLE 126 126 def_bool y

+1 -1

arch/sh/mm/init.c

··· 211 211 212 212 get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); 213 213 214 - #ifdef CONFIG_NEED_MULTIPLE_NODES 214 + #ifdef CONFIG_NUMA 215 215 NODE_DATA(nid) = memblock_alloc_try_nid( 216 216 sizeof(struct pglist_data), 217 217 SMP_CACHE_BYTES, MEMBLOCK_LOW_LIMIT,

+1 -1

arch/sparc/Kconfig

··· 265 265 int "Maximum NUMA Nodes (as a power of 2)" 266 266 range 4 5 if SPARC64 267 267 default "5" 268 - depends on NEED_MULTIPLE_NODES 268 + depends on NUMA 269 269 help 270 270 Specify the maximum number of NUMA Nodes available on the target 271 271 system. Increases memory reserved to accommodate various tables.

+2 -2

arch/sparc/include/asm/mmzone.h

··· 2 2 #ifndef _SPARC64_MMZONE_H 3 3 #define _SPARC64_MMZONE_H 4 4 5 - #ifdef CONFIG_NEED_MULTIPLE_NODES 5 + #ifdef CONFIG_NUMA 6 6 7 7 #include <linux/cpumask.h> 8 8 ··· 13 13 extern int numa_cpu_lookup_table[]; 14 14 extern cpumask_t numa_cpumask_lookup_table[]; 15 15 16 - #endif /* CONFIG_NEED_MULTIPLE_NODES */ 16 + #endif /* CONFIG_NUMA */ 17 17 18 18 #endif /* _SPARC64_MMZONE_H */

+1 -1

arch/sparc/kernel/smp_64.c

··· 1543 1543 size_t align) 1544 1544 { 1545 1545 const unsigned long goal = __pa(MAX_DMA_ADDRESS); 1546 - #ifdef CONFIG_NEED_MULTIPLE_NODES 1546 + #ifdef CONFIG_NUMA 1547 1547 int node = cpu_to_node(cpu); 1548 1548 void *ptr; 1549 1549

+6 -6

arch/sparc/mm/init_64.c

··· 903 903 static struct node_mem_mask node_masks[MAX_NUMNODES]; 904 904 static int num_node_masks; 905 905 906 - #ifdef CONFIG_NEED_MULTIPLE_NODES 906 + #ifdef CONFIG_NUMA 907 907 908 908 struct mdesc_mlgroup { 909 909 u64 node; ··· 1059 1059 { 1060 1060 struct pglist_data *p; 1061 1061 unsigned long start_pfn, end_pfn; 1062 - #ifdef CONFIG_NEED_MULTIPLE_NODES 1062 + #ifdef CONFIG_NUMA 1063 1063 1064 1064 NODE_DATA(nid) = memblock_alloc_node(sizeof(struct pglist_data), 1065 1065 SMP_CACHE_BYTES, nid); ··· 1080 1080 1081 1081 static void init_node_masks_nonnuma(void) 1082 1082 { 1083 - #ifdef CONFIG_NEED_MULTIPLE_NODES 1083 + #ifdef CONFIG_NUMA 1084 1084 int i; 1085 1085 #endif 1086 1086 ··· 1090 1090 node_masks[0].match = 0; 1091 1091 num_node_masks = 1; 1092 1092 1093 - #ifdef CONFIG_NEED_MULTIPLE_NODES 1093 + #ifdef CONFIG_NUMA 1094 1094 for (i = 0; i < NR_CPUS; i++) 1095 1095 numa_cpu_lookup_table[i] = 0; 1096 1096 ··· 1098 1098 #endif 1099 1099 } 1100 1100 1101 - #ifdef CONFIG_NEED_MULTIPLE_NODES 1101 + #ifdef CONFIG_NUMA 1102 1102 struct pglist_data *node_data[MAX_NUMNODES]; 1103 1103 1104 1104 EXPORT_SYMBOL(numa_cpu_lookup_table); ··· 2487 2487 2488 2488 static void __init register_page_bootmem_info(void) 2489 2489 { 2490 - #ifdef CONFIG_NEED_MULTIPLE_NODES 2490 + #ifdef CONFIG_NUMA 2491 2491 int i; 2492 2492 2493 2493 for_each_online_node(i)

+1 -1

arch/x86/Kconfig

··· 1597 1597 default "10" if MAXSMP 1598 1598 default "6" if X86_64 1599 1599 default "3" 1600 - depends on NEED_MULTIPLE_NODES 1600 + depends on NUMA 1601 1601 help 1602 1602 Specify the maximum number of NUMA Nodes available on the target 1603 1603 system. Increases memory reserved to accommodate various tables.

+2 -2

arch/x86/ia32/ia32_aout.c

··· 203 203 error = vm_mmap(bprm->file, N_TXTADDR(ex), ex.a_text, 204 204 PROT_READ | PROT_EXEC, 205 205 MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | 206 - MAP_EXECUTABLE | MAP_32BIT, 206 + MAP_32BIT, 207 207 fd_offset); 208 208 209 209 if (error != N_TXTADDR(ex)) ··· 212 212 error = vm_mmap(bprm->file, N_DATADDR(ex), ex.a_data, 213 213 PROT_READ | PROT_WRITE | PROT_EXEC, 214 214 MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | 215 - MAP_EXECUTABLE | MAP_32BIT, 215 + MAP_32BIT, 216 216 fd_offset + ex.a_text); 217 217 if (error != N_DATADDR(ex)) 218 218 return error;

+11 -2

arch/x86/kernel/cpu/mce/core.c

··· 1257 1257 { 1258 1258 struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); 1259 1259 int flags = MF_ACTION_REQUIRED; 1260 + int ret; 1260 1261 1261 1262 pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); 1262 1263 1263 1264 if (!p->mce_ripv) 1264 1265 flags |= MF_MUST_KILL; 1265 1266 1266 - if (!memory_failure(p->mce_addr >> PAGE_SHIFT, flags) && 1267 - !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { 1267 + ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags); 1268 + if (!ret && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { 1268 1269 set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); 1269 1270 sync_core(); 1270 1271 return; 1271 1272 } 1273 + 1274 + /* 1275 + * -EHWPOISON from memory_failure() means that it already sent SIGBUS 1276 + * to the current process with the proper error info, so no need to 1277 + * send SIGBUS here again. 1278 + */ 1279 + if (ret == -EHWPOISON) 1280 + return; 1272 1281 1273 1282 if (p->mce_vaddr != (void __user *)-1l) { 1274 1283 force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);

+2 -2

arch/x86/kernel/cpu/sgx/encl.h

··· 91 91 { 92 92 struct vm_area_struct *result; 93 93 94 - result = find_vma(mm, addr); 95 - if (!result || result->vm_ops != &sgx_vm_ops || addr < result->vm_start) 94 + result = vma_lookup(mm, addr); 95 + if (!result || result->vm_ops != &sgx_vm_ops) 96 96 return -EINVAL; 97 97 98 98 *vma = result;

+3 -3

arch/x86/kernel/setup_percpu.c

··· 66 66 */ 67 67 static bool __init pcpu_need_numa(void) 68 68 { 69 - #ifdef CONFIG_NEED_MULTIPLE_NODES 69 + #ifdef CONFIG_NUMA 70 70 pg_data_t *last = NULL; 71 71 unsigned int cpu; 72 72 ··· 101 101 unsigned long align) 102 102 { 103 103 const unsigned long goal = __pa(MAX_DMA_ADDRESS); 104 - #ifdef CONFIG_NEED_MULTIPLE_NODES 104 + #ifdef CONFIG_NUMA 105 105 int node = early_cpu_to_node(cpu); 106 106 void *ptr; 107 107 ··· 140 140 141 141 static int __init pcpu_cpu_distance(unsigned int from, unsigned int to) 142 142 { 143 - #ifdef CONFIG_NEED_MULTIPLE_NODES 143 + #ifdef CONFIG_NUMA 144 144 if (early_cpu_to_node(from) == early_cpu_to_node(to)) 145 145 return LOCAL_DISTANCE; 146 146 else

+2 -2

arch/x86/mm/init_32.c

··· 651 651 highmem_pfn_init(); 652 652 } 653 653 654 - #ifndef CONFIG_NEED_MULTIPLE_NODES 654 + #ifndef CONFIG_NUMA 655 655 void __init initmem_init(void) 656 656 { 657 657 #ifdef CONFIG_HIGHMEM ··· 677 677 678 678 setup_bootmem_allocator(); 679 679 } 680 - #endif /* !CONFIG_NEED_MULTIPLE_NODES */ 680 + #endif /* !CONFIG_NUMA */ 681 681 682 682 void __init setup_bootmem_allocator(void) 683 683 {

-4

arch/xtensa/include/asm/page.h

··· 192 192 #define pfn_valid(pfn) \ 193 193 ((pfn) >= ARCH_PFN_OFFSET && ((pfn) - ARCH_PFN_OFFSET) < max_mapnr) 194 194 195 - #ifdef CONFIG_DISCONTIGMEM 196 - # error CONFIG_DISCONTIGMEM not supported 197 - #endif 198 - 199 195 #define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT) 200 196 #define page_to_virt(page) __va(page_to_pfn(page) << PAGE_SHIFT) 201 197 #define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)

+2 -2

arch/xtensa/include/asm/tlbflush.h

··· 26 26 * 27 27 * - flush_tlb_all() flushes all processes TLB entries 28 28 * - flush_tlb_mm(mm) flushes the specified mm context TLB entries 29 - * - flush_tlb_page(mm, vmaddr) flushes a single page 30 - * - flush_tlb_range(mm, start, end) flushes a range of pages 29 + * - flush_tlb_page(vma, page) flushes a single page 30 + * - flush_tlb_range(vma, vmaddr, end) flushes a range of pages 31 31 */ 32 32 33 33 void local_flush_tlb_all(void);

+10 -8

drivers/base/node.c

··· 482 482 static ssize_t node_read_numastat(struct device *dev, 483 483 struct device_attribute *attr, char *buf) 484 484 { 485 + fold_vm_numa_events(); 485 486 return sysfs_emit(buf, 486 487 "numa_hit %lu\n" 487 488 "numa_miss %lu\n" ··· 490 489 "interleave_hit %lu\n" 491 490 "local_node %lu\n" 492 491 "other_node %lu\n", 493 - sum_zone_numa_state(dev->id, NUMA_HIT), 494 - sum_zone_numa_state(dev->id, NUMA_MISS), 495 - sum_zone_numa_state(dev->id, NUMA_FOREIGN), 496 - sum_zone_numa_state(dev->id, NUMA_INTERLEAVE_HIT), 497 - sum_zone_numa_state(dev->id, NUMA_LOCAL), 498 - sum_zone_numa_state(dev->id, NUMA_OTHER)); 492 + sum_zone_numa_event_state(dev->id, NUMA_HIT), 493 + sum_zone_numa_event_state(dev->id, NUMA_MISS), 494 + sum_zone_numa_event_state(dev->id, NUMA_FOREIGN), 495 + sum_zone_numa_event_state(dev->id, NUMA_INTERLEAVE_HIT), 496 + sum_zone_numa_event_state(dev->id, NUMA_LOCAL), 497 + sum_zone_numa_event_state(dev->id, NUMA_OTHER)); 499 498 } 500 499 static DEVICE_ATTR(numastat, 0444, node_read_numastat, NULL); 501 500 ··· 513 512 sum_zone_node_page_state(nid, i)); 514 513 515 514 #ifdef CONFIG_NUMA 516 - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) 515 + fold_vm_numa_events(); 516 + for (i = 0; i < NR_VM_NUMA_EVENT_ITEMS; i++) 517 517 len += sysfs_emit_at(buf, len, "%s %lu\n", 518 518 numa_stat_name(i), 519 - sum_zone_numa_state(nid, i)); 519 + sum_zone_numa_event_state(nid, i)); 520 520 521 521 #endif 522 522 for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {

+214 -44

drivers/block/loop.c

··· 71 71 #include <linux/writeback.h> 72 72 #include <linux/completion.h> 73 73 #include <linux/highmem.h> 74 - #include <linux/kthread.h> 75 74 #include <linux/splice.h> 76 75 #include <linux/sysfs.h> 77 76 #include <linux/miscdevice.h> ··· 78 79 #include <linux/uio.h> 79 80 #include <linux/ioprio.h> 80 81 #include <linux/blk-cgroup.h> 82 + #include <linux/sched/mm.h> 81 83 82 84 #include "loop.h" 83 85 84 86 #include <linux/uaccess.h> 87 + 88 + #define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ) 85 89 86 90 static DEFINE_IDR(loop_index_idr); 87 91 static DEFINE_MUTEX(loop_ctl_mutex); ··· 517 515 { 518 516 struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb); 519 517 520 - if (cmd->css) 521 - css_put(cmd->css); 522 518 cmd->ret = ret; 523 519 lo_rw_aio_do_completion(cmd); 524 520 } ··· 577 577 cmd->iocb.ki_complete = lo_rw_aio_complete; 578 578 cmd->iocb.ki_flags = IOCB_DIRECT; 579 579 cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0); 580 - if (cmd->css) 581 - kthread_associate_blkcg(cmd->css); 582 580 583 581 if (rw == WRITE) 584 582 ret = call_write_iter(file, &cmd->iocb, &iter); ··· 584 586 ret = call_read_iter(file, &cmd->iocb, &iter); 585 587 586 588 lo_rw_aio_do_completion(cmd); 587 - kthread_associate_blkcg(NULL); 588 589 589 590 if (ret != -EIOCBQUEUED) 590 591 cmd->iocb.ki_complete(&cmd->iocb, ret, 0); ··· 918 921 q->limits.discard_alignment = 0; 919 922 } 920 923 921 - static void loop_unprepare_queue(struct loop_device *lo) 922 - { 923 - kthread_flush_worker(&lo->worker); 924 - kthread_stop(lo->worker_task); 925 - } 924 + struct loop_worker { 925 + struct rb_node rb_node; 926 + struct work_struct work; 927 + struct list_head cmd_list; 928 + struct list_head idle_list; 929 + struct loop_device *lo; 930 + struct cgroup_subsys_state *blkcg_css; 931 + unsigned long last_ran_at; 932 + }; 926 933 927 - static int loop_kthread_worker_fn(void *worker_ptr) 928 - { 929 - current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO; 930 - return kthread_worker_fn(worker_ptr); 931 - } 934 + static void loop_workfn(struct work_struct *work); 935 + static void loop_rootcg_workfn(struct work_struct *work); 936 + static void loop_free_idle_workers(struct timer_list *timer); 932 937 933 - static int loop_prepare_queue(struct loop_device *lo) 938 + #ifdef CONFIG_BLK_CGROUP 939 + static inline int queue_on_root_worker(struct cgroup_subsys_state *css) 934 940 { 935 - kthread_init_worker(&lo->worker); 936 - lo->worker_task = kthread_run(loop_kthread_worker_fn, 937 - &lo->worker, "loop%d", lo->lo_number); 938 - if (IS_ERR(lo->worker_task)) 939 - return -ENOMEM; 940 - set_user_nice(lo->worker_task, MIN_NICE); 941 - return 0; 941 + return !css || css == blkcg_root_css; 942 + } 943 + #else 944 + static inline int queue_on_root_worker(struct cgroup_subsys_state *css) 945 + { 946 + return !css; 947 + } 948 + #endif 949 + 950 + static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd) 951 + { 952 + struct rb_node **node = &(lo->worker_tree.rb_node), *parent = NULL; 953 + struct loop_worker *cur_worker, *worker = NULL; 954 + struct work_struct *work; 955 + struct list_head *cmd_list; 956 + 957 + spin_lock_irq(&lo->lo_work_lock); 958 + 959 + if (queue_on_root_worker(cmd->blkcg_css)) 960 + goto queue_work; 961 + 962 + node = &lo->worker_tree.rb_node; 963 + 964 + while (*node) { 965 + parent = *node; 966 + cur_worker = container_of(*node, struct loop_worker, rb_node); 967 + if (cur_worker->blkcg_css == cmd->blkcg_css) { 968 + worker = cur_worker; 969 + break; 970 + } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) { 971 + node = &(*node)->rb_left; 972 + } else { 973 + node = &(*node)->rb_right; 974 + } 975 + } 976 + if (worker) 977 + goto queue_work; 978 + 979 + worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN); 980 + /* 981 + * In the event we cannot allocate a worker, just queue on the 982 + * rootcg worker and issue the I/O as the rootcg 983 + */ 984 + if (!worker) { 985 + cmd->blkcg_css = NULL; 986 + if (cmd->memcg_css) 987 + css_put(cmd->memcg_css); 988 + cmd->memcg_css = NULL; 989 + goto queue_work; 990 + } 991 + 992 + worker->blkcg_css = cmd->blkcg_css; 993 + css_get(worker->blkcg_css); 994 + INIT_WORK(&worker->work, loop_workfn); 995 + INIT_LIST_HEAD(&worker->cmd_list); 996 + INIT_LIST_HEAD(&worker->idle_list); 997 + worker->lo = lo; 998 + rb_link_node(&worker->rb_node, parent, node); 999 + rb_insert_color(&worker->rb_node, &lo->worker_tree); 1000 + queue_work: 1001 + if (worker) { 1002 + /* 1003 + * We need to remove from the idle list here while 1004 + * holding the lock so that the idle timer doesn't 1005 + * free the worker 1006 + */ 1007 + if (!list_empty(&worker->idle_list)) 1008 + list_del_init(&worker->idle_list); 1009 + work = &worker->work; 1010 + cmd_list = &worker->cmd_list; 1011 + } else { 1012 + work = &lo->rootcg_work; 1013 + cmd_list = &lo->rootcg_cmd_list; 1014 + } 1015 + list_add_tail(&cmd->list_entry, cmd_list); 1016 + queue_work(lo->workqueue, work); 1017 + spin_unlock_irq(&lo->lo_work_lock); 942 1018 } 943 1019 944 1020 static void loop_update_rotational(struct loop_device *lo) ··· 1197 1127 !file->f_op->write_iter) 1198 1128 lo->lo_flags |= LO_FLAGS_READ_ONLY; 1199 1129 1200 - error = loop_prepare_queue(lo); 1201 - if (error) 1130 + lo->workqueue = alloc_workqueue("loop%d", 1131 + WQ_UNBOUND | WQ_FREEZABLE, 1132 + 0, 1133 + lo->lo_number); 1134 + if (!lo->workqueue) { 1135 + error = -ENOMEM; 1202 1136 goto out_unlock; 1137 + } 1203 1138 1204 1139 set_disk_ro(lo->lo_disk, (lo->lo_flags & LO_FLAGS_READ_ONLY) != 0); 1205 1140 1141 + INIT_WORK(&lo->rootcg_work, loop_rootcg_workfn); 1142 + INIT_LIST_HEAD(&lo->rootcg_cmd_list); 1143 + INIT_LIST_HEAD(&lo->idle_worker_list); 1144 + lo->worker_tree = RB_ROOT; 1145 + timer_setup(&lo->timer, loop_free_idle_workers, 1146 + TIMER_DEFERRABLE); 1206 1147 lo->use_dio = lo->lo_flags & LO_FLAGS_DIRECT_IO; 1207 1148 lo->lo_device = bdev; 1208 1149 lo->lo_backing_file = file; ··· 1281 1200 int err = 0; 1282 1201 bool partscan = false; 1283 1202 int lo_number; 1203 + struct loop_worker *pos, *worker; 1284 1204 1285 1205 mutex_lock(&lo->lo_mutex); 1286 1206 if (WARN_ON_ONCE(lo->lo_state != Lo_rundown)) { ··· 1300 1218 1301 1219 /* freeze request queue during the transition */ 1302 1220 blk_mq_freeze_queue(lo->lo_queue); 1221 + 1222 + destroy_workqueue(lo->workqueue); 1223 + spin_lock_irq(&lo->lo_work_lock); 1224 + list_for_each_entry_safe(worker, pos, &lo->idle_worker_list, 1225 + idle_list) { 1226 + list_del(&worker->idle_list); 1227 + rb_erase(&worker->rb_node, &lo->worker_tree); 1228 + css_put(worker->blkcg_css); 1229 + kfree(worker); 1230 + } 1231 + spin_unlock_irq(&lo->lo_work_lock); 1232 + del_timer_sync(&lo->timer); 1303 1233 1304 1234 spin_lock_irq(&lo->lo_lock); 1305 1235 lo->lo_backing_file = NULL; ··· 1349 1255 1350 1256 partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev; 1351 1257 lo_number = lo->lo_number; 1352 - loop_unprepare_queue(lo); 1353 1258 out_unlock: 1354 1259 mutex_unlock(&lo->lo_mutex); 1355 1260 if (partscan) { ··· 2101 2008 } 2102 2009 2103 2010 /* always use the first bio's css */ 2011 + cmd->blkcg_css = NULL; 2012 + cmd->memcg_css = NULL; 2104 2013 #ifdef CONFIG_BLK_CGROUP 2105 - if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) { 2106 - cmd->css = &bio_blkcg(rq->bio)->css; 2107 - css_get(cmd->css); 2108 - } else 2014 + if (rq->bio && rq->bio->bi_blkg) { 2015 + cmd->blkcg_css = &bio_blkcg(rq->bio)->css; 2016 + #ifdef CONFIG_MEMCG 2017 + cmd->memcg_css = 2018 + cgroup_get_e_css(cmd->blkcg_css->cgroup, 2019 + &memory_cgrp_subsys); 2109 2020 #endif 2110 - cmd->css = NULL; 2111 - kthread_queue_work(&lo->worker, &cmd->work); 2021 + } 2022 + #endif 2023 + loop_queue_work(lo, cmd); 2112 2024 2113 2025 return BLK_STS_OK; 2114 2026 } ··· 2124 2026 const bool write = op_is_write(req_op(rq)); 2125 2027 struct loop_device *lo = rq->q->queuedata; 2126 2028 int ret = 0; 2029 + struct mem_cgroup *old_memcg = NULL; 2127 2030 2128 2031 if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY)) { 2129 2032 ret = -EIO; 2130 2033 goto failed; 2131 2034 } 2132 2035 2036 + if (cmd->blkcg_css) 2037 + kthread_associate_blkcg(cmd->blkcg_css); 2038 + if (cmd->memcg_css) 2039 + old_memcg = set_active_memcg( 2040 + mem_cgroup_from_css(cmd->memcg_css)); 2041 + 2133 2042 ret = do_req_filebacked(lo, rq); 2043 + 2044 + if (cmd->blkcg_css) 2045 + kthread_associate_blkcg(NULL); 2046 + 2047 + if (cmd->memcg_css) { 2048 + set_active_memcg(old_memcg); 2049 + css_put(cmd->memcg_css); 2050 + } 2134 2051 failed: 2135 2052 /* complete non-aio request */ 2136 2053 if (!cmd->use_aio || ret) { ··· 2158 2045 } 2159 2046 } 2160 2047 2161 - static void loop_queue_work(struct kthread_work *work) 2048 + static void loop_set_timer(struct loop_device *lo) 2162 2049 { 2163 - struct loop_cmd *cmd = 2164 - container_of(work, struct loop_cmd, work); 2165 - 2166 - loop_handle_cmd(cmd); 2050 + timer_reduce(&lo->timer, jiffies + LOOP_IDLE_WORKER_TIMEOUT); 2167 2051 } 2168 2052 2169 - static int loop_init_request(struct blk_mq_tag_set *set, struct request *rq, 2170 - unsigned int hctx_idx, unsigned int numa_node) 2053 + static void loop_process_work(struct loop_worker *worker, 2054 + struct list_head *cmd_list, struct loop_device *lo) 2171 2055 { 2172 - struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq); 2056 + int orig_flags = current->flags; 2057 + struct loop_cmd *cmd; 2173 2058 2174 - kthread_init_work(&cmd->work, loop_queue_work); 2175 - return 0; 2059 + current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO; 2060 + spin_lock_irq(&lo->lo_work_lock); 2061 + while (!list_empty(cmd_list)) { 2062 + cmd = container_of( 2063 + cmd_list->next, struct loop_cmd, list_entry); 2064 + list_del(cmd_list->next); 2065 + spin_unlock_irq(&lo->lo_work_lock); 2066 + 2067 + loop_handle_cmd(cmd); 2068 + cond_resched(); 2069 + 2070 + spin_lock_irq(&lo->lo_work_lock); 2071 + } 2072 + 2073 + /* 2074 + * We only add to the idle list if there are no pending cmds 2075 + * *and* the worker will not run again which ensures that it 2076 + * is safe to free any worker on the idle list 2077 + */ 2078 + if (worker && !work_pending(&worker->work)) { 2079 + worker->last_ran_at = jiffies; 2080 + list_add_tail(&worker->idle_list, &lo->idle_worker_list); 2081 + loop_set_timer(lo); 2082 + } 2083 + spin_unlock_irq(&lo->lo_work_lock); 2084 + current->flags = orig_flags; 2085 + } 2086 + 2087 + static void loop_workfn(struct work_struct *work) 2088 + { 2089 + struct loop_worker *worker = 2090 + container_of(work, struct loop_worker, work); 2091 + loop_process_work(worker, &worker->cmd_list, worker->lo); 2092 + } 2093 + 2094 + static void loop_rootcg_workfn(struct work_struct *work) 2095 + { 2096 + struct loop_device *lo = 2097 + container_of(work, struct loop_device, rootcg_work); 2098 + loop_process_work(NULL, &lo->rootcg_cmd_list, lo); 2099 + } 2100 + 2101 + static void loop_free_idle_workers(struct timer_list *timer) 2102 + { 2103 + struct loop_device *lo = container_of(timer, struct loop_device, timer); 2104 + struct loop_worker *pos, *worker; 2105 + 2106 + spin_lock_irq(&lo->lo_work_lock); 2107 + list_for_each_entry_safe(worker, pos, &lo->idle_worker_list, 2108 + idle_list) { 2109 + if (time_is_after_jiffies(worker->last_ran_at + 2110 + LOOP_IDLE_WORKER_TIMEOUT)) 2111 + break; 2112 + list_del(&worker->idle_list); 2113 + rb_erase(&worker->rb_node, &lo->worker_tree); 2114 + css_put(worker->blkcg_css); 2115 + kfree(worker); 2116 + } 2117 + if (!list_empty(&lo->idle_worker_list)) 2118 + loop_set_timer(lo); 2119 + spin_unlock_irq(&lo->lo_work_lock); 2176 2120 } 2177 2121 2178 2122 static const struct blk_mq_ops loop_mq_ops = { 2179 2123 .queue_rq = loop_queue_rq, 2180 - .init_request = loop_init_request, 2181 2124 .complete = lo_complete_rq, 2182 2125 }; 2183 2126 ··· 2322 2153 mutex_init(&lo->lo_mutex); 2323 2154 lo->lo_number = i; 2324 2155 spin_lock_init(&lo->lo_lock); 2156 + spin_lock_init(&lo->lo_work_lock); 2325 2157 disk->major = LOOP_MAJOR; 2326 2158 disk->first_minor = i << part_shift; 2327 2159 disk->fops = &lo_fops;

+10 -5

drivers/block/loop.h

··· 14 14 #include <linux/blk-mq.h> 15 15 #include <linux/spinlock.h> 16 16 #include <linux/mutex.h> 17 - #include <linux/kthread.h> 18 17 #include <uapi/linux/loop.h> 19 18 20 19 /* Possible states of device */ ··· 54 55 55 56 spinlock_t lo_lock; 56 57 int lo_state; 57 - struct kthread_worker worker; 58 - struct task_struct *worker_task; 58 + spinlock_t lo_work_lock; 59 + struct workqueue_struct *workqueue; 60 + struct work_struct rootcg_work; 61 + struct list_head rootcg_cmd_list; 62 + struct list_head idle_worker_list; 63 + struct rb_root worker_tree; 64 + struct timer_list timer; 59 65 bool use_dio; 60 66 bool sysfs_inited; 61 67 ··· 71 67 }; 72 68 73 69 struct loop_cmd { 74 - struct kthread_work work; 70 + struct list_head list_entry; 75 71 bool use_aio; /* use AIO interface to handle I/O */ 76 72 atomic_t ref; /* only for aio */ 77 73 long ret; 78 74 struct kiocb iocb; 79 75 struct bio_vec *bvec; 80 - struct cgroup_subsys_state *css; 76 + struct cgroup_subsys_state *blkcg_css; 77 + struct cgroup_subsys_state *memcg_css; 81 78 }; 82 79 83 80 /* Support for loadable transfer modules */

+1 -1

drivers/dax/device.c

··· 337 337 } 338 338 339 339 static const struct address_space_operations dev_dax_aops = { 340 - .set_page_dirty = noop_set_page_dirty, 340 + .set_page_dirty = __set_page_dirty_no_writeback, 341 341 .invalidatepage = noop_invalidatepage, 342 342 }; 343 343

+2 -2

drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c

··· 709 709 } 710 710 711 711 mmap_read_lock(mm); 712 - vma = find_vma(mm, start); 713 - if (unlikely(!vma || start < vma->vm_start)) { 712 + vma = vma_lookup(mm, start); 713 + if (unlikely(!vma)) { 714 714 r = -EFAULT; 715 715 goto out_unlock; 716 716 }

+1 -1

drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c

··· 871 871 872 872 pr_debug("igt_mmap(%s, %d) @ %lx\n", obj->mm.region->name, type, addr); 873 873 874 - area = find_vma(current->mm, addr); 874 + area = vma_lookup(current->mm, addr); 875 875 if (!area) { 876 876 pr_err("%s: Did not create a vm_area_struct for the mmap\n", 877 877 obj->mm.region->name);

+1 -1

drivers/media/common/videobuf2/frame_vector.c

··· 64 64 do { 65 65 unsigned long *nums = frame_vector_pfns(vec); 66 66 67 - vma = find_vma_intersection(mm, start, start + 1); 67 + vma = vma_lookup(mm, start); 68 68 if (!vma) 69 69 break; 70 70

+2 -2

drivers/misc/sgi-gru/grufault.c

··· 49 49 { 50 50 struct vm_area_struct *vma; 51 51 52 - vma = find_vma(current->mm, vaddr); 53 - if (vma && vma->vm_start <= vaddr && vma->vm_ops == &gru_vm_ops) 52 + vma = vma_lookup(current->mm, vaddr); 53 + if (vma && vma->vm_ops == &gru_vm_ops) 54 54 return vma; 55 55 return NULL; 56 56 }

+1 -1

drivers/vfio/vfio_iommu_type1.c

··· 567 567 vaddr = untagged_addr(vaddr); 568 568 569 569 retry: 570 - vma = find_vma_intersection(mm, vaddr, vaddr + 1); 570 + vma = vma_lookup(mm, vaddr); 571 571 572 572 if (vma && vma->vm_flags & VM_PFNMAP) { 573 573 ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);

+17

drivers/virtio/virtio_balloon.c

··· 993 993 goto out_unregister_oom; 994 994 } 995 995 996 + /* 997 + * The default page reporting order is @pageblock_order, which 998 + * corresponds to 512MB in size on ARM64 when 64KB base page 999 + * size is used. The page reporting won't be triggered if the 1000 + * freeing page can't come up with a free area like that huge. 1001 + * So we specify the page reporting order to 5, corresponding 1002 + * to 2MB. It helps to avoid THP splitting if 4KB base page 1003 + * size is used by host. 1004 + * 1005 + * Ideally, the page reporting order is selected based on the 1006 + * host's base page size. However, it needs more work to report 1007 + * that value. The hard-coded order would be fine currently. 1008 + */ 1009 + #if defined(CONFIG_ARM64) && defined(CONFIG_ARM64_64K_PAGES) 1010 + vb->pr_dev_info.order = 5; 1011 + #endif 1012 + 996 1013 err = page_reporting_register(&vb->pr_dev_info); 997 1014 if (err) 998 1015 goto out_unregister_oom;

+1

fs/adfs/inode.c

··· 73 73 } 74 74 75 75 static const struct address_space_operations adfs_aops = { 76 + .set_page_dirty = __set_page_dirty_buffers, 76 77 .readpage = adfs_readpage, 77 78 .writepage = adfs_writepage, 78 79 .write_begin = adfs_write_begin,

+2

fs/affs/file.c

··· 453 453 } 454 454 455 455 const struct address_space_operations affs_aops = { 456 + .set_page_dirty = __set_page_dirty_buffers, 456 457 .readpage = affs_readpage, 457 458 .writepage = affs_writepage, 458 459 .write_begin = affs_write_begin, ··· 834 833 } 835 834 836 835 const struct address_space_operations affs_aops_ofs = { 836 + .set_page_dirty = __set_page_dirty_buffers, 837 837 .readpage = affs_readpage_ofs, 838 838 //.writepage = affs_writepage_ofs, 839 839 .write_begin = affs_write_begin_ofs,

+1

fs/bfs/file.c

··· 188 188 } 189 189 190 190 const struct address_space_operations bfs_aops = { 191 + .set_page_dirty = __set_page_dirty_buffers, 191 192 .readpage = bfs_readpage, 192 193 .writepage = bfs_writepage, 193 194 .write_begin = bfs_write_begin,

+2 -2

fs/binfmt_aout.c

+1 -1

fs/binfmt_elf.c

··· 1070 1070 elf_prot = make_prot(elf_ppnt->p_flags, &arch_state, 1071 1071 !!interpreter, false); 1072 1072 1073 - elf_flags = MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE; 1073 + elf_flags = MAP_PRIVATE | MAP_DENYWRITE; 1074 1074 1075 1075 vaddr = elf_ppnt->p_vaddr; 1076 1076 /*

+2 -9

fs/binfmt_elf_fdpic.c

··· 928 928 { 929 929 struct elf32_fdpic_loadseg *seg; 930 930 struct elf32_phdr *phdr; 931 - unsigned long load_addr, base = ULONG_MAX, top = 0, maddr = 0, mflags; 931 + unsigned long load_addr, base = ULONG_MAX, top = 0, maddr = 0; 932 932 int loop, ret; 933 933 934 934 load_addr = params->load_addr; ··· 948 948 } 949 949 950 950 /* allocate one big anon block for everything */ 951 - mflags = MAP_PRIVATE; 952 - if (params->flags & ELF_FDPIC_FLAG_EXECUTABLE) 953 - mflags |= MAP_EXECUTABLE; 954 - 955 951 maddr = vm_mmap(NULL, load_addr, top - base, 956 - PROT_READ | PROT_WRITE | PROT_EXEC, mflags, 0); 952 + PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE, 0); 957 953 if (IS_ERR_VALUE(maddr)) 958 954 return (int) maddr; 959 955 ··· 1042 1046 if (phdr->p_flags & PF_X) prot |= PROT_EXEC; 1043 1047 1044 1048 flags = MAP_PRIVATE | MAP_DENYWRITE; 1045 - if (params->flags & ELF_FDPIC_FLAG_EXECUTABLE) 1046 - flags |= MAP_EXECUTABLE; 1047 - 1048 1049 maddr = 0; 1049 1050 1050 1051 switch (params->flags & ELF_FDPIC_FLAG_ARRANGEMENT) {

+1 -1

fs/binfmt_flat.c

··· 573 573 pr_debug("ROM mapping of file (we hope)\n"); 574 574 575 575 textpos = vm_mmap(bprm->file, 0, text_len, PROT_READ|PROT_EXEC, 576 - MAP_PRIVATE|MAP_EXECUTABLE, 0); 576 + MAP_PRIVATE, 0); 577 577 if (!textpos || IS_ERR_VALUE(textpos)) { 578 578 ret = textpos; 579 579 if (!textpos)

+1

fs/block_dev.c

··· 1754 1754 } 1755 1755 1756 1756 static const struct address_space_operations def_blk_aops = { 1757 + .set_page_dirty = __set_page_dirty_buffers, 1757 1758 .readpage = blkdev_readpage, 1758 1759 .readahead = blkdev_readahead, 1759 1760 .writepage = blkdev_writepage,

-25

fs/buffer.c

··· 589 589 EXPORT_SYMBOL(mark_buffer_dirty_inode); 590 590 591 591 /* 592 - * Mark the page dirty, and set it dirty in the page cache, and mark the inode 593 - * dirty. 594 - * 595 - * If warn is true, then emit a warning if the page is not uptodate and has 596 - * not been truncated. 597 - * 598 - * The caller must hold lock_page_memcg(). 599 - */ 600 - void __set_page_dirty(struct page *page, struct address_space *mapping, 601 - int warn) 602 - { 603 - unsigned long flags; 604 - 605 - xa_lock_irqsave(&mapping->i_pages, flags); 606 - if (page->mapping) { /* Race with truncate? */ 607 - WARN_ON_ONCE(warn && !PageUptodate(page)); 608 - account_page_dirtied(page, mapping); 609 - __xa_set_mark(&mapping->i_pages, page_index(page), 610 - PAGECACHE_TAG_DIRTY); 611 - } 612 - xa_unlock_irqrestore(&mapping->i_pages, flags); 613 - } 614 - EXPORT_SYMBOL_GPL(__set_page_dirty); 615 - 616 - /* 617 592 * Add a page to the dirty page list. 618 593 * 619 594 * It is a sad fact of life that this function is called from several places

+1 -7

fs/configfs/inode.c

··· 28 28 static struct lock_class_key default_group_class[MAX_LOCK_DEPTH]; 29 29 #endif 30 30 31 - static const struct address_space_operations configfs_aops = { 32 - .readpage = simple_readpage, 33 - .write_begin = simple_write_begin, 34 - .write_end = simple_write_end, 35 - }; 36 - 37 31 static const struct inode_operations configfs_inode_operations ={ 38 32 .setattr = configfs_setattr, 39 33 }; ··· 108 114 struct inode * inode = new_inode(s); 109 115 if (inode) { 110 116 inode->i_ino = get_next_ino(); 111 - inode->i_mapping->a_ops = &configfs_aops; 117 + inode->i_mapping->a_ops = &ram_aops; 112 118 inode->i_op = &configfs_inode_operations; 113 119 114 120 if (sd->s_iattr) {

+2 -1

fs/dax.c

··· 488 488 struct address_space *mapping, unsigned int order) 489 489 { 490 490 unsigned long index = xas->xa_index; 491 - bool pmd_downgrade = false; /* splitting PMD entry into PTE entries? */ 491 + bool pmd_downgrade; /* splitting PMD entry into PTE entries? */ 492 492 void *entry; 493 493 494 494 retry: 495 + pmd_downgrade = false; 495 496 xas_lock_irq(xas); 496 497 entry = get_unlocked_entry(xas, order); 497 498

+13

fs/ecryptfs/mmap.c

··· 533 533 return block; 534 534 } 535 535 536 + #include <linux/buffer_head.h> 537 + 536 538 const struct address_space_operations ecryptfs_aops = { 539 + /* 540 + * XXX: This is pretty broken for multiple reasons: ecryptfs does not 541 + * actually use buffer_heads, and ecryptfs will crash without 542 + * CONFIG_BLOCK. But it matches the behavior before the default for 543 + * address_space_operations without the ->set_page_dirty method was 544 + * cleaned up, so this is the best we can do without maintainer 545 + * feedback. 546 + */ 547 + #ifdef CONFIG_BLOCK 548 + .set_page_dirty = __set_page_dirty_buffers, 549 + #endif 537 550 .writepage = ecryptfs_writepage, 538 551 .readpage = ecryptfs_readpage, 539 552 .write_begin = ecryptfs_write_begin,

+1

fs/exfat/inode.c

··· 491 491 } 492 492 493 493 static const struct address_space_operations exfat_aops = { 494 + .set_page_dirty = __set_page_dirty_buffers, 494 495 .readpage = exfat_readpage, 495 496 .readahead = exfat_readahead, 496 497 .writepage = exfat_writepage,

+3 -1

fs/ext2/inode.c

··· 961 961 } 962 962 963 963 const struct address_space_operations ext2_aops = { 964 + .set_page_dirty = __set_page_dirty_buffers, 964 965 .readpage = ext2_readpage, 965 966 .readahead = ext2_readahead, 966 967 .writepage = ext2_writepage, ··· 976 975 }; 977 976 978 977 const struct address_space_operations ext2_nobh_aops = { 978 + .set_page_dirty = __set_page_dirty_buffers, 979 979 .readpage = ext2_readpage, 980 980 .readahead = ext2_readahead, 981 981 .writepage = ext2_nobh_writepage, ··· 992 990 static const struct address_space_operations ext2_dax_aops = { 993 991 .writepages = ext2_dax_writepages, 994 992 .direct_IO = noop_direct_IO, 995 - .set_page_dirty = noop_set_page_dirty, 993 + .set_page_dirty = __set_page_dirty_no_writeback, 996 994 .invalidatepage = noop_invalidatepage, 997 995 }; 998 996

+1 -1

fs/ext4/inode.c

··· 3701 3701 static const struct address_space_operations ext4_dax_aops = { 3702 3702 .writepages = ext4_dax_writepages, 3703 3703 .direct_IO = noop_direct_IO, 3704 - .set_page_dirty = noop_set_page_dirty, 3704 + .set_page_dirty = __set_page_dirty_no_writeback, 3705 3705 .bmap = ext4_bmap, 3706 3706 .invalidatepage = noop_invalidatepage, 3707 3707 .swap_activate = ext4_iomap_swap_activate,

+1

fs/fat/inode.c

··· 342 342 } 343 343 344 344 static const struct address_space_operations fat_aops = { 345 + .set_page_dirty = __set_page_dirty_buffers, 345 346 .readpage = fat_readpage, 346 347 .readahead = fat_readahead, 347 348 .writepage = fat_writepage,

+235 -97

fs/fs-writeback.c

··· 131 131 return false; 132 132 } 133 133 134 - /** 135 - * inode_io_list_del_locked - remove an inode from its bdi_writeback IO list 136 - * @inode: inode to be removed 137 - * @wb: bdi_writeback @inode is being removed from 138 - * 139 - * Remove @inode which may be on one of @wb->b_{dirty|io|more_io} lists and 140 - * clear %WB_has_dirty_io if all are empty afterwards. 141 - */ 142 - static void inode_io_list_del_locked(struct inode *inode, 143 - struct bdi_writeback *wb) 144 - { 145 - assert_spin_locked(&wb->list_lock); 146 - assert_spin_locked(&inode->i_lock); 147 - 148 - inode->i_state &= ~I_SYNC_QUEUED; 149 - list_del_init(&inode->i_io_list); 150 - wb_io_lists_depopulated(wb); 151 - } 152 - 153 134 static void wb_wakeup(struct bdi_writeback *wb) 154 135 { 155 136 spin_lock_bh(&wb->work_lock); ··· 225 244 /* one round can affect upto 5 slots */ 226 245 #define WB_FRN_MAX_IN_FLIGHT 1024 /* don't queue too many concurrently */ 227 246 247 + /* 248 + * Maximum inodes per isw. A specific value has been chosen to make 249 + * struct inode_switch_wbs_context fit into 1024 bytes kmalloc. 250 + */ 251 + #define WB_MAX_INODES_PER_ISW ((1024UL - sizeof(struct inode_switch_wbs_context)) \ 252 + / sizeof(struct inode *)) 253 + 228 254 static atomic_t isw_nr_in_flight = ATOMIC_INIT(0); 229 255 static struct workqueue_struct *isw_wq; 230 256 ··· 265 277 wb_put(wb); 266 278 } 267 279 EXPORT_SYMBOL_GPL(__inode_attach_wb); 280 + 281 + /** 282 + * inode_cgwb_move_to_attached - put the inode onto wb->b_attached list 283 + * @inode: inode of interest with i_lock held 284 + * @wb: target bdi_writeback 285 + * 286 + * Remove the inode from wb's io lists and if necessarily put onto b_attached 287 + * list. Only inodes attached to cgwb's are kept on this list. 288 + */ 289 + static void inode_cgwb_move_to_attached(struct inode *inode, 290 + struct bdi_writeback *wb) 291 + { 292 + assert_spin_locked(&wb->list_lock); 293 + assert_spin_locked(&inode->i_lock); 294 + 295 + inode->i_state &= ~I_SYNC_QUEUED; 296 + if (wb != &wb->bdi->wb) 297 + list_move(&inode->i_io_list, &wb->b_attached); 298 + else 299 + list_del_init(&inode->i_io_list); 300 + wb_io_lists_depopulated(wb); 301 + } 268 302 269 303 /** 270 304 * locked_inode_to_wb_and_lock_list - determine a locked inode's wb and lock it ··· 342 332 } 343 333 344 334 struct inode_switch_wbs_context { 345 - struct inode *inode; 346 - struct bdi_writeback *new_wb; 335 + struct rcu_work work; 347 336 348 - struct rcu_head rcu_head; 349 - struct work_struct work; 337 + /* 338 + * Multiple inodes can be switched at once. The switching procedure 339 + * consists of two parts, separated by a RCU grace period. To make 340 + * sure that the second part is executed for each inode gone through 341 + * the first part, all inode pointers are placed into a NULL-terminated 342 + * array embedded into struct inode_switch_wbs_context. Otherwise 343 + * an inode could be left in a non-consistent state. 344 + */ 345 + struct bdi_writeback *new_wb; 346 + struct inode *inodes[]; 350 347 }; 351 348 352 349 static void bdi_down_write_wb_switch_rwsem(struct backing_dev_info *bdi) ··· 366 349 up_write(&bdi->wb_switch_rwsem); 367 350 } 368 351 369 - static void inode_switch_wbs_work_fn(struct work_struct *work) 352 + static bool inode_do_switch_wbs(struct inode *inode, 353 + struct bdi_writeback *old_wb, 354 + struct bdi_writeback *new_wb) 370 355 { 371 - struct inode_switch_wbs_context *isw = 372 - container_of(work, struct inode_switch_wbs_context, work); 373 - struct inode *inode = isw->inode; 374 - struct backing_dev_info *bdi = inode_to_bdi(inode); 375 356 struct address_space *mapping = inode->i_mapping; 376 - struct bdi_writeback *old_wb = inode->i_wb; 377 - struct bdi_writeback *new_wb = isw->new_wb; 378 357 XA_STATE(xas, &mapping->i_pages, 0); 379 358 struct page *page; 380 359 bool switched = false; 381 360 382 - /* 383 - * If @inode switches cgwb membership while sync_inodes_sb() is 384 - * being issued, sync_inodes_sb() might miss it. Synchronize. 385 - */ 386 - down_read(&bdi->wb_switch_rwsem); 387 - 388 - /* 389 - * By the time control reaches here, RCU grace period has passed 390 - * since I_WB_SWITCH assertion and all wb stat update transactions 391 - * between unlocked_inode_to_wb_begin/end() are guaranteed to be 392 - * synchronizing against the i_pages lock. 393 - * 394 - * Grabbing old_wb->list_lock, inode->i_lock and the i_pages lock 395 - * gives us exclusion against all wb related operations on @inode 396 - * including IO list manipulations and stat updates. 397 - */ 398 - if (old_wb < new_wb) { 399 - spin_lock(&old_wb->list_lock); 400 - spin_lock_nested(&new_wb->list_lock, SINGLE_DEPTH_NESTING); 401 - } else { 402 - spin_lock(&new_wb->list_lock); 403 - spin_lock_nested(&old_wb->list_lock, SINGLE_DEPTH_NESTING); 404 - } 405 361 spin_lock(&inode->i_lock); 406 362 xa_lock_irq(&mapping->i_pages); 407 363 408 364 /* 409 - * Once I_FREEING is visible under i_lock, the eviction path owns 410 - * the inode and we shouldn't modify ->i_io_list. 365 + * Once I_FREEING or I_WILL_FREE are visible under i_lock, the eviction 366 + * path owns the inode and we shouldn't modify ->i_io_list. 411 367 */ 412 - if (unlikely(inode->i_state & I_FREEING)) 368 + if (unlikely(inode->i_state & (I_FREEING | I_WILL_FREE))) 413 369 goto skip_switch; 414 370 415 371 trace_inode_switch_wbs(inode, old_wb, new_wb); ··· 409 419 wb_get(new_wb); 410 420 411 421 /* 412 - * Transfer to @new_wb's IO list if necessary. The specific list 413 - * @inode was on is ignored and the inode is put on ->b_dirty which 414 - * is always correct including from ->b_dirty_time. The transfer 415 - * preserves @inode->dirtied_when ordering. 422 + * Transfer to @new_wb's IO list if necessary. If the @inode is dirty, 423 + * the specific list @inode was on is ignored and the @inode is put on 424 + * ->b_dirty which is always correct including from ->b_dirty_time. 425 + * The transfer preserves @inode->dirtied_when ordering. If the @inode 426 + * was clean, it means it was on the b_attached list, so move it onto 427 + * the b_attached list of @new_wb. 416 428 */ 417 429 if (!list_empty(&inode->i_io_list)) { 418 - struct inode *pos; 419 - 420 - inode_io_list_del_locked(inode, old_wb); 421 430 inode->i_wb = new_wb; 422 - list_for_each_entry(pos, &new_wb->b_dirty, i_io_list) 423 - if (time_after_eq(inode->dirtied_when, 424 - pos->dirtied_when)) 425 - break; 426 - inode_io_list_move_locked(inode, new_wb, pos->i_io_list.prev); 431 + 432 + if (inode->i_state & I_DIRTY_ALL) { 433 + struct inode *pos; 434 + 435 + list_for_each_entry(pos, &new_wb->b_dirty, i_io_list) 436 + if (time_after_eq(inode->dirtied_when, 437 + pos->dirtied_when)) 438 + break; 439 + inode_io_list_move_locked(inode, new_wb, 440 + pos->i_io_list.prev); 441 + } else { 442 + inode_cgwb_move_to_attached(inode, new_wb); 443 + } 427 444 } else { 428 445 inode->i_wb = new_wb; 429 446 } ··· 449 452 450 453 xa_unlock_irq(&mapping->i_pages); 451 454 spin_unlock(&inode->i_lock); 455 + 456 + return switched; 457 + } 458 + 459 + static void inode_switch_wbs_work_fn(struct work_struct *work) 460 + { 461 + struct inode_switch_wbs_context *isw = 462 + container_of(to_rcu_work(work), struct inode_switch_wbs_context, work); 463 + struct backing_dev_info *bdi = inode_to_bdi(isw->inodes[0]); 464 + struct bdi_writeback *old_wb = isw->inodes[0]->i_wb; 465 + struct bdi_writeback *new_wb = isw->new_wb; 466 + unsigned long nr_switched = 0; 467 + struct inode **inodep; 468 + 469 + /* 470 + * If @inode switches cgwb membership while sync_inodes_sb() is 471 + * being issued, sync_inodes_sb() might miss it. Synchronize. 472 + */ 473 + down_read(&bdi->wb_switch_rwsem); 474 + 475 + /* 476 + * By the time control reaches here, RCU grace period has passed 477 + * since I_WB_SWITCH assertion and all wb stat update transactions 478 + * between unlocked_inode_to_wb_begin/end() are guaranteed to be 479 + * synchronizing against the i_pages lock. 480 + * 481 + * Grabbing old_wb->list_lock, inode->i_lock and the i_pages lock 482 + * gives us exclusion against all wb related operations on @inode 483 + * including IO list manipulations and stat updates. 484 + */ 485 + if (old_wb < new_wb) { 486 + spin_lock(&old_wb->list_lock); 487 + spin_lock_nested(&new_wb->list_lock, SINGLE_DEPTH_NESTING); 488 + } else { 489 + spin_lock(&new_wb->list_lock); 490 + spin_lock_nested(&old_wb->list_lock, SINGLE_DEPTH_NESTING); 491 + } 492 + 493 + for (inodep = isw->inodes; *inodep; inodep++) { 494 + WARN_ON_ONCE((*inodep)->i_wb != old_wb); 495 + if (inode_do_switch_wbs(*inodep, old_wb, new_wb)) 496 + nr_switched++; 497 + } 498 + 452 499 spin_unlock(&new_wb->list_lock); 453 500 spin_unlock(&old_wb->list_lock); 454 501 455 502 up_read(&bdi->wb_switch_rwsem); 456 503 457 - if (switched) { 504 + if (nr_switched) { 458 505 wb_wakeup(new_wb); 459 - wb_put(old_wb); 506 + wb_put_many(old_wb, nr_switched); 460 507 } 508 + 509 + for (inodep = isw->inodes; *inodep; inodep++) 510 + iput(*inodep); 461 511 wb_put(new_wb); 462 - 463 - iput(inode); 464 512 kfree(isw); 465 - 466 513 atomic_dec(&isw_nr_in_flight); 467 514 } 468 515 469 - static void inode_switch_wbs_rcu_fn(struct rcu_head *rcu_head) 516 + static bool inode_prepare_wbs_switch(struct inode *inode, 517 + struct bdi_writeback *new_wb) 470 518 { 471 - struct inode_switch_wbs_context *isw = container_of(rcu_head, 472 - struct inode_switch_wbs_context, rcu_head); 519 + /* 520 + * Paired with smp_mb() in cgroup_writeback_umount(). 521 + * isw_nr_in_flight must be increased before checking SB_ACTIVE and 522 + * grabbing an inode, otherwise isw_nr_in_flight can be observed as 0 523 + * in cgroup_writeback_umount() and the isw_wq will be not flushed. 524 + */ 525 + smp_mb(); 473 526 474 - /* needs to grab bh-unsafe locks, bounce to work item */ 475 - INIT_WORK(&isw->work, inode_switch_wbs_work_fn); 476 - queue_work(isw_wq, &isw->work); 527 + /* while holding I_WB_SWITCH, no one else can update the association */ 528 + spin_lock(&inode->i_lock); 529 + if (!(inode->i_sb->s_flags & SB_ACTIVE) || 530 + inode->i_state & (I_WB_SWITCH | I_FREEING | I_WILL_FREE) || 531 + inode_to_wb(inode) == new_wb) { 532 + spin_unlock(&inode->i_lock); 533 + return false; 534 + } 535 + inode->i_state |= I_WB_SWITCH; 536 + __iget(inode); 537 + spin_unlock(&inode->i_lock); 538 + 539 + return true; 477 540 } 478 541 479 542 /** ··· 558 501 if (atomic_read(&isw_nr_in_flight) > WB_FRN_MAX_IN_FLIGHT) 559 502 return; 560 503 561 - isw = kzalloc(sizeof(*isw), GFP_ATOMIC); 504 + isw = kzalloc(sizeof(*isw) + 2 * sizeof(struct inode *), GFP_ATOMIC); 562 505 if (!isw) 563 506 return; 507 + 508 + atomic_inc(&isw_nr_in_flight); 564 509 565 510 /* find and pin the new wb */ 566 511 rcu_read_lock(); ··· 573 514 if (!isw->new_wb) 574 515 goto out_free; 575 516 576 - /* while holding I_WB_SWITCH, no one else can update the association */ 577 - spin_lock(&inode->i_lock); 578 - if (!(inode->i_sb->s_flags & SB_ACTIVE) || 579 - inode->i_state & (I_WB_SWITCH | I_FREEING) || 580 - inode_to_wb(inode) == isw->new_wb) { 581 - spin_unlock(&inode->i_lock); 517 + if (!inode_prepare_wbs_switch(inode, isw->new_wb)) 582 518 goto out_free; 583 - } 584 - inode->i_state |= I_WB_SWITCH; 585 - __iget(inode); 586 - spin_unlock(&inode->i_lock); 587 519 588 - isw->inode = inode; 520 + isw->inodes[0] = inode; 589 521 590 522 /* 591 523 * In addition to synchronizing among switchers, I_WB_SWITCH tells ··· 584 534 * lock so that stat transfer can synchronize against them. 585 535 * Let's continue after I_WB_SWITCH is guaranteed to be visible. 586 536 */ 587 - call_rcu(&isw->rcu_head, inode_switch_wbs_rcu_fn); 588 - 589 - atomic_inc(&isw_nr_in_flight); 537 + INIT_RCU_WORK(&isw->work, inode_switch_wbs_work_fn); 538 + queue_rcu_work(isw_wq, &isw->work); 590 539 return; 591 540 592 541 out_free: 542 + atomic_dec(&isw_nr_in_flight); 593 543 if (isw->new_wb) 594 544 wb_put(isw->new_wb); 595 545 kfree(isw); 546 + } 547 + 548 + /** 549 + * cleanup_offline_cgwb - detach associated inodes 550 + * @wb: target wb 551 + * 552 + * Switch all inodes attached to @wb to a nearest living ancestor's wb in order 553 + * to eventually release the dying @wb. Returns %true if not all inodes were 554 + * switched and the function has to be restarted. 555 + */ 556 + bool cleanup_offline_cgwb(struct bdi_writeback *wb) 557 + { 558 + struct cgroup_subsys_state *memcg_css; 559 + struct inode_switch_wbs_context *isw; 560 + struct inode *inode; 561 + int nr; 562 + bool restart = false; 563 + 564 + isw = kzalloc(sizeof(*isw) + WB_MAX_INODES_PER_ISW * 565 + sizeof(struct inode *), GFP_KERNEL); 566 + if (!isw) 567 + return restart; 568 + 569 + atomic_inc(&isw_nr_in_flight); 570 + 571 + for (memcg_css = wb->memcg_css->parent; memcg_css; 572 + memcg_css = memcg_css->parent) { 573 + isw->new_wb = wb_get_create(wb->bdi, memcg_css, GFP_KERNEL); 574 + if (isw->new_wb) 575 + break; 576 + } 577 + if (unlikely(!isw->new_wb)) 578 + isw->new_wb = &wb->bdi->wb; /* wb_get() is noop for bdi's wb */ 579 + 580 + nr = 0; 581 + spin_lock(&wb->list_lock); 582 + list_for_each_entry(inode, &wb->b_attached, i_io_list) { 583 + if (!inode_prepare_wbs_switch(inode, isw->new_wb)) 584 + continue; 585 + 586 + isw->inodes[nr++] = inode; 587 + 588 + if (nr >= WB_MAX_INODES_PER_ISW - 1) { 589 + restart = true; 590 + break; 591 + } 592 + } 593 + spin_unlock(&wb->list_lock); 594 + 595 + /* no attached inodes? bail out */ 596 + if (nr == 0) { 597 + atomic_dec(&isw_nr_in_flight); 598 + wb_put(isw->new_wb); 599 + kfree(isw); 600 + return restart; 601 + } 602 + 603 + /* 604 + * In addition to synchronizing among switchers, I_WB_SWITCH tells 605 + * the RCU protected stat update paths to grab the i_page 606 + * lock so that stat transfer can synchronize against them. 607 + * Let's continue after I_WB_SWITCH is guaranteed to be visible. 608 + */ 609 + INIT_RCU_WORK(&isw->work, inode_switch_wbs_work_fn); 610 + queue_rcu_work(isw_wq, &isw->work); 611 + 612 + return restart; 596 613 } 597 614 598 615 /** ··· 1117 1000 */ 1118 1001 void cgroup_writeback_umount(void) 1119 1002 { 1003 + /* 1004 + * SB_ACTIVE should be reliably cleared before checking 1005 + * isw_nr_in_flight, see generic_shutdown_super(). 1006 + */ 1007 + smp_mb(); 1008 + 1120 1009 if (atomic_read(&isw_nr_in_flight)) { 1121 1010 /* 1122 1011 * Use rcu_barrier() to wait for all pending callbacks to ··· 1146 1023 1147 1024 static void bdi_down_write_wb_switch_rwsem(struct backing_dev_info *bdi) { } 1148 1025 static void bdi_up_write_wb_switch_rwsem(struct backing_dev_info *bdi) { } 1026 + 1027 + static void inode_cgwb_move_to_attached(struct inode *inode, 1028 + struct bdi_writeback *wb) 1029 + { 1030 + assert_spin_locked(&wb->list_lock); 1031 + assert_spin_locked(&inode->i_lock); 1032 + 1033 + inode->i_state &= ~I_SYNC_QUEUED; 1034 + list_del_init(&inode->i_io_list); 1035 + wb_io_lists_depopulated(wb); 1036 + } 1149 1037 1150 1038 static struct bdi_writeback * 1151 1039 locked_inode_to_wb_and_lock_list(struct inode *inode) ··· 1258 1124 1259 1125 wb = inode_to_wb_and_lock_list(inode); 1260 1126 spin_lock(&inode->i_lock); 1261 - inode_io_list_del_locked(inode, wb); 1127 + 1128 + inode->i_state &= ~I_SYNC_QUEUED; 1129 + list_del_init(&inode->i_io_list); 1130 + wb_io_lists_depopulated(wb); 1131 + 1262 1132 spin_unlock(&inode->i_lock); 1263 1133 spin_unlock(&wb->list_lock); 1264 1134 } ··· 1575 1437 inode->i_state &= ~I_SYNC_QUEUED; 1576 1438 } else { 1577 1439 /* The inode is clean. Remove from writeback lists. */ 1578 - inode_io_list_del_locked(inode, wb); 1440 + inode_cgwb_move_to_attached(inode, wb); 1579 1441 } 1580 1442 } 1581 1443 ··· 1727 1589 * responsible for the writeback lists. 1728 1590 */ 1729 1591 if (!(inode->i_state & I_DIRTY_ALL)) 1730 - inode_io_list_del_locked(inode, wb); 1592 + inode_cgwb_move_to_attached(inode, wb); 1731 1593 spin_unlock(&wb->list_lock); 1732 1594 inode_sync_complete(inode); 1733 1595 out:

+2 -1

fs/fuse/dax.c

··· 9 9 #include <linux/delay.h> 10 10 #include <linux/dax.h> 11 11 #include <linux/uio.h> 12 + #include <linux/pagemap.h> 12 13 #include <linux/pfn_t.h> 13 14 #include <linux/iomap.h> 14 15 #include <linux/interval_tree.h> ··· 1330 1329 static const struct address_space_operations fuse_dax_file_aops = { 1331 1330 .writepages = fuse_dax_writepages, 1332 1331 .direct_IO = noop_direct_IO, 1333 - .set_page_dirty = noop_set_page_dirty, 1332 + .set_page_dirty = __set_page_dirty_no_writeback, 1334 1333 .invalidatepage = noop_invalidatepage, 1335 1334 }; 1336 1335

+1 -1

fs/gfs2/aops.c

··· 784 784 .writepages = gfs2_writepages, 785 785 .readpage = gfs2_readpage, 786 786 .readahead = gfs2_readahead, 787 - .set_page_dirty = iomap_set_page_dirty, 787 + .set_page_dirty = __set_page_dirty_nobuffers, 788 788 .releasepage = iomap_releasepage, 789 789 .invalidatepage = iomap_invalidatepage, 790 790 .bmap = gfs2_bmap,

+2

fs/gfs2/meta_io.c

··· 89 89 } 90 90 91 91 const struct address_space_operations gfs2_meta_aops = { 92 + .set_page_dirty = __set_page_dirty_buffers, 92 93 .writepage = gfs2_aspace_writepage, 93 94 .releasepage = gfs2_releasepage, 94 95 }; 95 96 96 97 const struct address_space_operations gfs2_rgrp_aops = { 98 + .set_page_dirty = __set_page_dirty_buffers, 97 99 .writepage = gfs2_aspace_writepage, 98 100 .releasepage = gfs2_releasepage, 99 101 };

+2

fs/hfs/inode.c

··· 159 159 } 160 160 161 161 const struct address_space_operations hfs_btree_aops = { 162 + .set_page_dirty = __set_page_dirty_buffers, 162 163 .readpage = hfs_readpage, 163 164 .writepage = hfs_writepage, 164 165 .write_begin = hfs_write_begin, ··· 169 168 }; 170 169 171 170 const struct address_space_operations hfs_aops = { 171 + .set_page_dirty = __set_page_dirty_buffers, 172 172 .readpage = hfs_readpage, 173 173 .writepage = hfs_writepage, 174 174 .write_begin = hfs_write_begin,

+2

fs/hfsplus/inode.c

··· 156 156 } 157 157 158 158 const struct address_space_operations hfsplus_btree_aops = { 159 + .set_page_dirty = __set_page_dirty_buffers, 159 160 .readpage = hfsplus_readpage, 160 161 .writepage = hfsplus_writepage, 161 162 .write_begin = hfsplus_write_begin, ··· 166 165 }; 167 166 168 167 const struct address_space_operations hfsplus_aops = { 168 + .set_page_dirty = __set_page_dirty_buffers, 169 169 .readpage = hfsplus_readpage, 170 170 .writepage = hfsplus_writepage, 171 171 .write_begin = hfsplus_write_begin,

+1

fs/hpfs/file.c

··· 196 196 } 197 197 198 198 const struct address_space_operations hpfs_aops = { 199 + .set_page_dirty = __set_page_dirty_buffers, 199 200 .readpage = hpfs_readpage, 200 201 .writepage = hpfs_writepage, 201 202 .readahead = hpfs_readahead,

+1 -26

fs/iomap/buffered-io.c

··· 640 640 return status; 641 641 } 642 642 643 - int 644 - iomap_set_page_dirty(struct page *page) 645 - { 646 - struct address_space *mapping = page_mapping(page); 647 - int newly_dirty; 648 - 649 - if (unlikely(!mapping)) 650 - return !TestSetPageDirty(page); 651 - 652 - /* 653 - * Lock out page's memcg migration to keep PageDirty 654 - * synchronized with per-memcg dirty page counters. 655 - */ 656 - lock_page_memcg(page); 657 - newly_dirty = !TestSetPageDirty(page); 658 - if (newly_dirty) 659 - __set_page_dirty(page, mapping, 0); 660 - unlock_page_memcg(page); 661 - 662 - if (newly_dirty) 663 - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); 664 - return newly_dirty; 665 - } 666 - EXPORT_SYMBOL_GPL(iomap_set_page_dirty); 667 - 668 643 static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len, 669 644 size_t copied, struct page *page) 670 645 { ··· 659 684 if (unlikely(copied < len && !PageUptodate(page))) 660 685 return 0; 661 686 iomap_set_range_uptodate(page, offset_in_page(pos), len); 662 - iomap_set_page_dirty(page); 687 + __set_page_dirty_nobuffers(page); 663 688 return copied; 664 689 } 665 690

+1

fs/jfs/inode.c

··· 356 356 } 357 357 358 358 const struct address_space_operations jfs_aops = { 359 + .set_page_dirty = __set_page_dirty_buffers, 359 360 .readpage = jfs_readpage, 360 361 .readahead = jfs_readahead, 361 362 .writepage = jfs_writepage,

+1 -7

fs/kernfs/inode.c

··· 17 17 18 18 #include "kernfs-internal.h" 19 19 20 - static const struct address_space_operations kernfs_aops = { 21 - .readpage = simple_readpage, 22 - .write_begin = simple_write_begin, 23 - .write_end = simple_write_end, 24 - }; 25 - 26 20 static const struct inode_operations kernfs_iops = { 27 21 .permission = kernfs_iop_permission, 28 22 .setattr = kernfs_iop_setattr, ··· 197 203 { 198 204 kernfs_get(kn); 199 205 inode->i_private = kn; 200 - inode->i_mapping->a_ops = &kernfs_aops; 206 + inode->i_mapping->a_ops = &ram_aops; 201 207 inode->i_op = &kernfs_iops; 202 208 inode->i_generation = kernfs_gen(kn); 203 209

+14 -30

fs/libfs.c

··· 512 512 } 513 513 EXPORT_SYMBOL(simple_setattr); 514 514 515 - int simple_readpage(struct file *file, struct page *page) 515 + static int simple_readpage(struct file *file, struct page *page) 516 516 { 517 517 clear_highpage(page); 518 518 flush_dcache_page(page); ··· 520 520 unlock_page(page); 521 521 return 0; 522 522 } 523 - EXPORT_SYMBOL(simple_readpage); 524 523 525 524 int simple_write_begin(struct file *file, struct address_space *mapping, 526 525 loff_t pos, unsigned len, unsigned flags, ··· 567 568 * 568 569 * Use *ONLY* with simple_readpage() 569 570 */ 570 - int simple_write_end(struct file *file, struct address_space *mapping, 571 + static int simple_write_end(struct file *file, struct address_space *mapping, 571 572 loff_t pos, unsigned len, unsigned copied, 572 573 struct page *page, void *fsdata) 573 574 { ··· 596 597 597 598 return copied; 598 599 } 599 - EXPORT_SYMBOL(simple_write_end); 600 + 601 + /* 602 + * Provides ramfs-style behavior: data in the pagecache, but no writeback. 603 + */ 604 + const struct address_space_operations ram_aops = { 605 + .readpage = simple_readpage, 606 + .write_begin = simple_write_begin, 607 + .write_end = simple_write_end, 608 + .set_page_dirty = __set_page_dirty_no_writeback, 609 + }; 610 + EXPORT_SYMBOL(ram_aops); 600 611 601 612 /* 602 613 * the inodes created here are not hashed. If you use iunique to generate ··· 1171 1162 } 1172 1163 EXPORT_SYMBOL(noop_fsync); 1173 1164 1174 - int noop_set_page_dirty(struct page *page) 1175 - { 1176 - /* 1177 - * Unlike __set_page_dirty_no_writeback that handles dirty page 1178 - * tracking in the page object, dax does all dirty tracking in 1179 - * the inode address_space in response to mkwrite faults. In the 1180 - * dax case we only need to worry about potentially dirty CPU 1181 - * caches, not dirty page cache pages to write back. 1182 - * 1183 - * This callback is defined to prevent fallback to 1184 - * __set_page_dirty_buffers() in set_page_dirty(). 1185 - */ 1186 - return 0; 1187 - } 1188 - EXPORT_SYMBOL_GPL(noop_set_page_dirty); 1189 - 1190 1165 void noop_invalidatepage(struct page *page, unsigned int offset, 1191 1166 unsigned int length) 1192 1167 { ··· 1201 1208 } 1202 1209 EXPORT_SYMBOL(kfree_link); 1203 1210 1204 - /* 1205 - * nop .set_page_dirty method so that people can use .page_mkwrite on 1206 - * anon inodes. 1207 - */ 1208 - static int anon_set_page_dirty(struct page *page) 1209 - { 1210 - return 0; 1211 - }; 1212 - 1213 1211 struct inode *alloc_anon_inode(struct super_block *s) 1214 1212 { 1215 1213 static const struct address_space_operations anon_aops = { 1216 - .set_page_dirty = anon_set_page_dirty, 1214 + .set_page_dirty = __set_page_dirty_no_writeback, 1217 1215 }; 1218 1216 struct inode *inode = new_inode_pseudo(s); 1219 1217

+1

fs/minix/inode.c

··· 442 442 } 443 443 444 444 static const struct address_space_operations minix_aops = { 445 + .set_page_dirty = __set_page_dirty_buffers, 445 446 .readpage = minix_readpage, 446 447 .writepage = minix_writepage, 447 448 .write_begin = minix_write_begin,

+1

fs/nilfs2/mdt.c

··· 434 434 435 435 436 436 static const struct address_space_operations def_mdt_aops = { 437 + .set_page_dirty = __set_page_dirty_buffers, 437 438 .writepage = nilfs_mdt_write_page, 438 439 }; 439 440

+1 -1

fs/ntfs/inode.c

··· 477 477 } 478 478 file_name_attr = (FILE_NAME_ATTR*)((u8*)attr + 479 479 le16_to_cpu(attr->data.resident.value_offset)); 480 - p2 = (u8*)attr + le32_to_cpu(attr->data.resident.value_length); 480 + p2 = (u8 *)file_name_attr + le32_to_cpu(attr->data.resident.value_length); 481 481 if (p2 < (u8*)attr || p2 > p) 482 482 goto err_corrupt_attr; 483 483 /* This attribute is ok, but is it in the $Extend directory? */

+2 -2

fs/ocfs2/aops.c

··· 632 632 } 633 633 634 634 if (PageUptodate(page)) { 635 - if (!buffer_uptodate(bh)) 636 - set_buffer_uptodate(bh); 635 + set_buffer_uptodate(bh); 637 636 } else if (!buffer_uptodate(bh) && !buffer_delay(bh) && 638 637 !buffer_new(bh) && 639 638 ocfs2_should_read_blk(inode, page, block_start) && ··· 2453 2454 } 2454 2455 2455 2456 const struct address_space_operations ocfs2_aops = { 2457 + .set_page_dirty = __set_page_dirty_buffers, 2456 2458 .readpage = ocfs2_readpage, 2457 2459 .readahead = ocfs2_readahead, 2458 2460 .writepage = ocfs2_writepage,

+3 -4

fs/ocfs2/cluster/heartbeat.c

··· 1442 1442 for (i = 0; i < ARRAY_SIZE(o2hb_live_slots); i++) 1443 1443 INIT_LIST_HEAD(&o2hb_live_slots[i]); 1444 1444 1445 - INIT_LIST_HEAD(&o2hb_node_events); 1446 - 1447 1445 memset(o2hb_live_node_bitmap, 0, sizeof(o2hb_live_node_bitmap)); 1448 1446 memset(o2hb_region_bitmap, 0, sizeof(o2hb_region_bitmap)); 1449 1447 memset(o2hb_live_region_bitmap, 0, sizeof(o2hb_live_region_bitmap)); ··· 1596 1598 struct o2hb_region *reg = to_o2hb_region(item); 1597 1599 unsigned long long tmp; 1598 1600 char *p = (char *)page; 1601 + ssize_t ret; 1599 1602 1600 1603 if (reg->hr_bdev) 1601 1604 return -EINVAL; 1602 1605 1603 - tmp = simple_strtoull(p, &p, 0); 1604 - if (!p || (*p && (*p != '\n'))) 1606 + ret = kstrtoull(p, 0, &tmp); 1607 + if (ret) 1605 1608 return -EINVAL; 1606 1609 1607 1610 reg->hr_start_block = tmp;

+1 -1

fs/ocfs2/cluster/nodemanager.c

··· 824 824 825 825 static int __init init_o2nm(void) 826 826 { 827 - int ret = -1; 827 + int ret; 828 828 829 829 o2hb_init(); 830 830

+1 -1

fs/ocfs2/dlm/dlmmaster.c

··· 2977 2977 struct dlm_lock_resource *res) 2978 2978 { 2979 2979 enum dlm_lockres_list idx; 2980 - struct list_head *queue = &res->granted; 2980 + struct list_head *queue; 2981 2981 struct dlm_lock *lock; 2982 2982 int noderef; 2983 2983 u8 nodenum = O2NM_MAX_NODES;

+1 -5

fs/ocfs2/filecheck.c

··· 326 326 ret = snprintf(buf + total, remain, "%lu\t\t%u\t%s\n", 327 327 p->fe_ino, p->fe_done, 328 328 ocfs2_filecheck_error(p->fe_status)); 329 - if (ret < 0) { 330 - total = ret; 331 - break; 332 - } 333 - if (ret == remain) { 329 + if (ret >= remain) { 334 330 /* snprintf() didn't fit */ 335 331 total = -E2BIG; 336 332 break;

+2 -6

fs/ocfs2/stackglue.c

··· 500 500 list_for_each_entry(p, &ocfs2_stack_list, sp_list) { 501 501 ret = snprintf(buf, remain, "%s\n", 502 502 p->sp_name); 503 - if (ret < 0) { 504 - total = ret; 505 - break; 506 - } 507 - if (ret == remain) { 503 + if (ret >= remain) { 508 504 /* snprintf() didn't fit */ 509 505 total = -E2BIG; 510 506 break; ··· 527 531 if (active_stack) { 528 532 ret = snprintf(buf, PAGE_SIZE, "%s\n", 529 533 active_stack->sp_name); 530 - if (ret == PAGE_SIZE) 534 + if (ret >= PAGE_SIZE) 531 535 ret = -E2BIG; 532 536 } 533 537 spin_unlock(&ocfs2_stack_lock);

+1

fs/omfs/file.c

··· 372 372 }; 373 373 374 374 const struct address_space_operations omfs_aops = { 375 + .set_page_dirty = __set_page_dirty_buffers, 375 376 .readpage = omfs_readpage, 376 377 .readahead = omfs_readahead, 377 378 .writepage = omfs_writepage,

+1 -1

fs/proc/task_mmu.c

··· 1047 1047 return false; 1048 1048 if (!is_cow_mapping(vma->vm_flags)) 1049 1049 return false; 1050 - if (likely(!atomic_read(&vma->vm_mm->has_pinned))) 1050 + if (likely(!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags))) 1051 1051 return false; 1052 1052 page = vm_normal_page(vma, addr, pte); 1053 1053 if (!page)

+1 -8

fs/ramfs/inode.c

··· 53 53 static const struct super_operations ramfs_ops; 54 54 static const struct inode_operations ramfs_dir_inode_operations; 55 55 56 - static const struct address_space_operations ramfs_aops = { 57 - .readpage = simple_readpage, 58 - .write_begin = simple_write_begin, 59 - .write_end = simple_write_end, 60 - .set_page_dirty = __set_page_dirty_no_writeback, 61 - }; 62 - 63 56 struct inode *ramfs_get_inode(struct super_block *sb, 64 57 const struct inode *dir, umode_t mode, dev_t dev) 65 58 { ··· 61 68 if (inode) { 62 69 inode->i_ino = get_next_ino(); 63 70 inode_init_owner(&init_user_ns, inode, dir, mode); 64 - inode->i_mapping->a_ops = &ramfs_aops; 71 + inode->i_mapping->a_ops = &ram_aops; 65 72 mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); 66 73 mapping_set_unevictable(inode->i_mapping); 67 74 inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);

+4 -1

fs/squashfs/block.c

··· 226 226 bio_free_pages(bio); 227 227 bio_put(bio); 228 228 out: 229 - if (res < 0) 229 + if (res < 0) { 230 230 ERROR("Failed to read block 0x%llx: %d\n", index, res); 231 + if (msblk->panic_on_errors) 232 + panic("squashfs read failed"); 233 + } 231 234 232 235 return res; 233 236 }

+1

fs/squashfs/squashfs_fs_sb.h

··· 65 65 unsigned int fragments; 66 66 int xattr_ids; 67 67 unsigned int ids; 68 + bool panic_on_errors; 68 69 }; 69 70 #endif

+86

fs/squashfs/super.c

··· 18 18 19 19 #include <linux/fs.h> 20 20 #include <linux/fs_context.h> 21 + #include <linux/fs_parser.h> 21 22 #include <linux/vfs.h> 22 23 #include <linux/slab.h> 23 24 #include <linux/mutex.h> 25 + #include <linux/seq_file.h> 24 26 #include <linux/pagemap.h> 25 27 #include <linux/init.h> 26 28 #include <linux/module.h> ··· 38 36 39 37 static struct file_system_type squashfs_fs_type; 40 38 static const struct super_operations squashfs_super_ops; 39 + 40 + enum Opt_errors { 41 + Opt_errors_continue, 42 + Opt_errors_panic, 43 + }; 44 + 45 + enum squashfs_param { 46 + Opt_errors, 47 + }; 48 + 49 + struct squashfs_mount_opts { 50 + enum Opt_errors errors; 51 + }; 52 + 53 + static const struct constant_table squashfs_param_errors[] = { 54 + {"continue", Opt_errors_continue }, 55 + {"panic", Opt_errors_panic }, 56 + {} 57 + }; 58 + 59 + static const struct fs_parameter_spec squashfs_fs_parameters[] = { 60 + fsparam_enum("errors", Opt_errors, squashfs_param_errors), 61 + {} 62 + }; 63 + 64 + static int squashfs_parse_param(struct fs_context *fc, struct fs_parameter *param) 65 + { 66 + struct squashfs_mount_opts *opts = fc->fs_private; 67 + struct fs_parse_result result; 68 + int opt; 69 + 70 + opt = fs_parse(fc, squashfs_fs_parameters, param, &result); 71 + if (opt < 0) 72 + return opt; 73 + 74 + switch (opt) { 75 + case Opt_errors: 76 + opts->errors = result.uint_32; 77 + break; 78 + default: 79 + return -EINVAL; 80 + } 81 + 82 + return 0; 83 + } 41 84 42 85 static const struct squashfs_decompressor *supported_squashfs_filesystem( 43 86 struct fs_context *fc, ··· 114 67 115 68 static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc) 116 69 { 70 + struct squashfs_mount_opts *opts = fc->fs_private; 117 71 struct squashfs_sb_info *msblk; 118 72 struct squashfs_super_block *sblk = NULL; 119 73 struct inode *root; ··· 132 84 return -ENOMEM; 133 85 } 134 86 msblk = sb->s_fs_info; 87 + 88 + msblk->panic_on_errors = (opts->errors == Opt_errors_panic); 135 89 136 90 msblk->devblksize = sb_min_blocksize(sb, SQUASHFS_DEVBLK_SIZE); 137 91 msblk->devblksize_log2 = ffz(~msblk->devblksize); ··· 400 350 401 351 static int squashfs_reconfigure(struct fs_context *fc) 402 352 { 353 + struct super_block *sb = fc->root->d_sb; 354 + struct squashfs_sb_info *msblk = sb->s_fs_info; 355 + struct squashfs_mount_opts *opts = fc->fs_private; 356 + 403 357 sync_filesystem(fc->root->d_sb); 404 358 fc->sb_flags |= SB_RDONLY; 359 + 360 + msblk->panic_on_errors = (opts->errors == Opt_errors_panic); 361 + 405 362 return 0; 363 + } 364 + 365 + static void squashfs_free_fs_context(struct fs_context *fc) 366 + { 367 + kfree(fc->fs_private); 406 368 } 407 369 408 370 static const struct fs_context_operations squashfs_context_ops = { 409 371 .get_tree = squashfs_get_tree, 372 + .free = squashfs_free_fs_context, 373 + .parse_param = squashfs_parse_param, 410 374 .reconfigure = squashfs_reconfigure, 411 375 }; 412 376 377 + static int squashfs_show_options(struct seq_file *s, struct dentry *root) 378 + { 379 + struct super_block *sb = root->d_sb; 380 + struct squashfs_sb_info *msblk = sb->s_fs_info; 381 + 382 + if (msblk->panic_on_errors) 383 + seq_puts(s, ",errors=panic"); 384 + else 385 + seq_puts(s, ",errors=continue"); 386 + 387 + return 0; 388 + } 389 + 413 390 static int squashfs_init_fs_context(struct fs_context *fc) 414 391 { 392 + struct squashfs_mount_opts *opts; 393 + 394 + opts = kzalloc(sizeof(*opts), GFP_KERNEL); 395 + if (!opts) 396 + return -ENOMEM; 397 + 398 + fc->fs_private = opts; 415 399 fc->ops = &squashfs_context_ops; 416 400 return 0; 417 401 } ··· 565 481 .owner = THIS_MODULE, 566 482 .name = "squashfs", 567 483 .init_fs_context = squashfs_init_fs_context, 484 + .parameters = squashfs_fs_parameters, 568 485 .kill_sb = kill_block_super, 569 486 .fs_flags = FS_REQUIRES_DEV 570 487 }; ··· 576 491 .free_inode = squashfs_free_inode, 577 492 .statfs = squashfs_statfs, 578 493 .put_super = squashfs_put_super, 494 + .show_options = squashfs_show_options, 579 495 }; 580 496 581 497 module_init(init_squashfs_fs);

+1

fs/sysv/itree.c

··· 495 495 } 496 496 497 497 const struct address_space_operations sysv_aops = { 498 + .set_page_dirty = __set_page_dirty_buffers, 498 499 .readpage = sysv_readpage, 499 500 .writepage = sysv_writepage, 500 501 .write_begin = sysv_write_begin,

+1

fs/udf/file.c

··· 125 125 } 126 126 127 127 const struct address_space_operations udf_adinicb_aops = { 128 + .set_page_dirty = __set_page_dirty_buffers, 128 129 .readpage = udf_adinicb_readpage, 129 130 .writepage = udf_adinicb_writepage, 130 131 .write_begin = udf_adinicb_write_begin,

+1

fs/udf/inode.c

··· 235 235 } 236 236 237 237 const struct address_space_operations udf_aops = { 238 + .set_page_dirty = __set_page_dirty_buffers, 238 239 .readpage = udf_readpage, 239 240 .readahead = udf_readahead, 240 241 .writepage = udf_writepage,

+1

fs/ufs/inode.c

··· 526 526 } 527 527 528 528 const struct address_space_operations ufs_aops = { 529 + .set_page_dirty = __set_page_dirty_buffers, 529 530 .readpage = ufs_readpage, 530 531 .writepage = ufs_writepage, 531 532 .write_begin = ufs_write_begin,

+2 -2

fs/xfs/xfs_aops.c

··· 561 561 .readahead = xfs_vm_readahead, 562 562 .writepage = xfs_vm_writepage, 563 563 .writepages = xfs_vm_writepages, 564 - .set_page_dirty = iomap_set_page_dirty, 564 + .set_page_dirty = __set_page_dirty_nobuffers, 565 565 .releasepage = iomap_releasepage, 566 566 .invalidatepage = iomap_invalidatepage, 567 567 .bmap = xfs_vm_bmap, ··· 575 575 const struct address_space_operations xfs_dax_aops = { 576 576 .writepages = xfs_dax_writepages, 577 577 .direct_IO = noop_direct_IO, 578 - .set_page_dirty = noop_set_page_dirty, 578 + .set_page_dirty = __set_page_dirty_no_writeback, 579 579 .invalidatepage = noop_invalidatepage, 580 580 .swap_activate = xfs_iomap_swapfile_activate, 581 581 };

+2 -2

fs/zonefs/super.c

··· 5 5 * Copyright (C) 2019 Western Digital Corporation or its affiliates. 6 6 */ 7 7 #include <linux/module.h> 8 - #include <linux/fs.h> 8 + #include <linux/pagemap.h> 9 9 #include <linux/magic.h> 10 10 #include <linux/iomap.h> 11 11 #include <linux/init.h> ··· 185 185 .readahead = zonefs_readahead, 186 186 .writepage = zonefs_writepage, 187 187 .writepages = zonefs_writepages, 188 - .set_page_dirty = iomap_set_page_dirty, 188 + .set_page_dirty = __set_page_dirty_nobuffers, 189 189 .releasepage = iomap_releasepage, 190 190 .invalidatepage = iomap_invalidatepage, 191 191 .migratepage = iomap_migrate_page,

+4 -33

include/asm-generic/memory_model.h

··· 6 6 7 7 #ifndef __ASSEMBLY__ 8 8 9 + /* 10 + * supports 3 memory models. 11 + */ 9 12 #if defined(CONFIG_FLATMEM) 10 13 11 14 #ifndef ARCH_PFN_OFFSET 12 15 #define ARCH_PFN_OFFSET (0UL) 13 16 #endif 14 17 15 - #elif defined(CONFIG_DISCONTIGMEM) 16 - 17 - #ifndef arch_pfn_to_nid 18 - #define arch_pfn_to_nid(pfn) pfn_to_nid(pfn) 19 - #endif 20 - 21 - #ifndef arch_local_page_offset 22 - #define arch_local_page_offset(pfn, nid) \ 23 - ((pfn) - NODE_DATA(nid)->node_start_pfn) 24 - #endif 25 - 26 - #endif /* CONFIG_DISCONTIGMEM */ 27 - 28 - /* 29 - * supports 3 memory models. 30 - */ 31 - #if defined(CONFIG_FLATMEM) 32 - 33 18 #define __pfn_to_page(pfn) (mem_map + ((pfn) - ARCH_PFN_OFFSET)) 34 19 #define __page_to_pfn(page) ((unsigned long)((page) - mem_map) + \ 35 20 ARCH_PFN_OFFSET) 36 - #elif defined(CONFIG_DISCONTIGMEM) 37 - 38 - #define __pfn_to_page(pfn) \ 39 - ({ unsigned long __pfn = (pfn); \ 40 - unsigned long __nid = arch_pfn_to_nid(__pfn); \ 41 - NODE_DATA(__nid)->node_mem_map + arch_local_page_offset(__pfn, __nid);\ 42 - }) 43 - 44 - #define __page_to_pfn(pg) \ 45 - ({ const struct page *__pg = (pg); \ 46 - struct pglist_data *__pgdat = NODE_DATA(page_to_nid(__pg)); \ 47 - (unsigned long)(__pg - __pgdat->node_mem_map) + \ 48 - __pgdat->node_start_pfn; \ 49 - }) 50 21 51 22 #elif defined(CONFIG_SPARSEMEM_VMEMMAP) 52 23 ··· 41 70 struct mem_section *__sec = __pfn_to_section(__pfn); \ 42 71 __section_mem_map_addr(__sec) + __pfn; \ 43 72 }) 44 - #endif /* CONFIG_FLATMEM/DISCONTIGMEM/SPARSEMEM */ 73 + #endif /* CONFIG_FLATMEM/SPARSEMEM */ 45 74 46 75 /* 47 76 * Convert a physical address to a Page Frame Number and back

-1

include/asm-generic/pgtable-nop4d.h

··· 9 9 typedef struct { pgd_t pgd; } p4d_t; 10 10 11 11 #define P4D_SHIFT PGDIR_SHIFT 12 - #define MAX_PTRS_PER_P4D 1 13 12 #define PTRS_PER_P4D 1 14 13 #define P4D_SIZE (1UL << P4D_SHIFT) 15 14 #define P4D_MASK (~(P4D_SIZE-1))

+1 -1

include/asm-generic/topology.h

··· 45 45 #endif 46 46 47 47 #ifndef cpumask_of_node 48 - #ifdef CONFIG_NEED_MULTIPLE_NODES 48 + #ifdef CONFIG_NUMA 49 49 #define cpumask_of_node(node) ((node) == 0 ? cpu_online_mask : cpu_none_mask) 50 50 #else 51 51 #define cpumask_of_node(node) ((void)(node), cpu_online_mask)

+3 -2

include/kunit/test.h

··· 515 515 void *match_data) 516 516 { 517 517 struct kunit_resource *res, *found = NULL; 518 + unsigned long flags; 518 519 519 - spin_lock(&test->lock); 520 + spin_lock_irqsave(&test->lock, flags); 520 521 521 522 list_for_each_entry_reverse(res, &test->resources, node) { 522 523 if (match(test, res, (void *)match_data)) { ··· 527 526 } 528 527 } 529 528 530 - spin_unlock(&test->lock); 529 + spin_unlock_irqrestore(&test->lock, flags); 531 530 532 531 return found; 533 532 }

+18 -2

include/linux/backing-dev-defs.h

··· 154 154 struct cgroup_subsys_state *blkcg_css; /* and blkcg */ 155 155 struct list_head memcg_node; /* anchored at memcg->cgwb_list */ 156 156 struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */ 157 + struct list_head b_attached; /* attached inodes, protected by list_lock */ 158 + struct list_head offline_node; /* anchored at offline_cgwbs */ 157 159 158 160 union { 159 161 struct work_struct release_work; ··· 241 239 /** 242 240 * wb_put - decrement a wb's refcount 243 241 * @wb: bdi_writeback to put 242 + * @nr: number of references to put 244 243 */ 245 - static inline void wb_put(struct bdi_writeback *wb) 244 + static inline void wb_put_many(struct bdi_writeback *wb, unsigned long nr) 246 245 { 247 246 if (WARN_ON_ONCE(!wb->bdi)) { 248 247 /* ··· 254 251 } 255 252 256 253 if (wb != &wb->bdi->wb) 257 - percpu_ref_put(&wb->refcnt); 254 + percpu_ref_put_many(&wb->refcnt, nr); 255 + } 256 + 257 + /** 258 + * wb_put - decrement a wb's refcount 259 + * @wb: bdi_writeback to put 260 + */ 261 + static inline void wb_put(struct bdi_writeback *wb) 262 + { 263 + wb_put_many(wb, 1); 258 264 } 259 265 260 266 /** ··· 289 277 } 290 278 291 279 static inline void wb_put(struct bdi_writeback *wb) 280 + { 281 + } 282 + 283 + static inline void wb_put_many(struct bdi_writeback *wb, unsigned long nr) 292 284 { 293 285 } 294 286

+1 -1

include/linux/cpuhotplug.h

··· 54 54 CPUHP_MM_MEMCQ_DEAD, 55 55 CPUHP_PERCPU_CNT_DEAD, 56 56 CPUHP_RADIX_DEAD, 57 - CPUHP_PAGE_ALLOC_DEAD, 57 + CPUHP_PAGE_ALLOC, 58 58 CPUHP_NET_DEV_DEAD, 59 59 CPUHP_PCI_XGENE_DEAD, 60 60 CPUHP_IOMMU_IOVA_DEAD,

+1 -5

include/linux/fs.h

··· 3417 3417 extern void simple_recursive_removal(struct dentry *, 3418 3418 void (*callback)(struct dentry *)); 3419 3419 extern int noop_fsync(struct file *, loff_t, loff_t, int); 3420 - extern int noop_set_page_dirty(struct page *page); 3421 3420 extern void noop_invalidatepage(struct page *page, unsigned int offset, 3422 3421 unsigned int length); 3423 3422 extern ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter); 3424 3423 extern int simple_empty(struct dentry *); 3425 - extern int simple_readpage(struct file *file, struct page *page); 3426 3424 extern int simple_write_begin(struct file *file, struct address_space *mapping, 3427 3425 loff_t pos, unsigned len, unsigned flags, 3428 3426 struct page **pagep, void **fsdata); 3429 - extern int simple_write_end(struct file *file, struct address_space *mapping, 3430 - loff_t pos, unsigned len, unsigned copied, 3431 - struct page *page, void *fsdata); 3427 + extern const struct address_space_operations ram_aops; 3432 3428 extern int always_delete_dentry(const struct dentry *); 3433 3429 extern struct inode *alloc_anon_inode(struct super_block *); 3434 3430 extern int simple_nosetlease(struct file *, long, struct file_lock **, void **);

+11 -2

include/linux/gfp.h

··· 506 506 * There are two zonelists per node, one for all zones with memory and 507 507 * one containing just zones from the node the zonelist belongs to. 508 508 * 509 - * For the normal case of non-DISCONTIGMEM systems the NODE_DATA() gets 510 - * optimized to &contig_page_data at compile-time. 509 + * For the case of non-NUMA systems the NODE_DATA() gets optimized to 510 + * &contig_page_data at compile-time. 511 511 */ 512 512 static inline struct zonelist *node_zonelist(int nid, gfp_t flags) 513 513 { ··· 546 546 alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_array) 547 547 { 548 548 return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, page_array); 549 + } 550 + 551 + static inline unsigned long 552 + alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array) 553 + { 554 + if (nid == NUMA_NO_NODE) 555 + nid = numa_mem_id(); 556 + 557 + return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array); 549 558 } 550 559 551 560 /*

-1

include/linux/iomap.h

··· 159 159 const struct iomap_ops *ops); 160 160 int iomap_readpage(struct page *page, const struct iomap_ops *ops); 161 161 void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops); 162 - int iomap_set_page_dirty(struct page *page); 163 162 int iomap_is_partially_uptodate(struct page *page, unsigned long from, 164 163 unsigned long count); 165 164 int iomap_releasepage(struct page *page, gfp_t gfp_mask);

+3 -4

include/linux/kasan.h

··· 18 18 19 19 /* kasan_data struct is used in KUnit tests for KASAN expected failures */ 20 20 struct kunit_kasan_expectation { 21 - bool report_expected; 22 21 bool report_found; 23 22 }; 24 23 ··· 41 42 #endif 42 43 43 44 extern unsigned char kasan_early_shadow_page[PAGE_SIZE]; 44 - extern pte_t kasan_early_shadow_pte[PTRS_PER_PTE + PTE_HWTABLE_PTRS]; 45 - extern pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD]; 46 - extern pud_t kasan_early_shadow_pud[PTRS_PER_PUD]; 45 + extern pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE + PTE_HWTABLE_PTRS]; 46 + extern pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD]; 47 + extern pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD]; 47 48 extern p4d_t kasan_early_shadow_p4d[MAX_PTRS_PER_P4D]; 48 49 49 50 int kasan_populate_early_shadow(const void *shadow_start,

+2

include/linux/kernel.h

··· 357 357 extern __scanf(2, 0) 358 358 int vsscanf(const char *, const char *, va_list); 359 359 360 + extern int no_hash_pointers_enable(char *str); 361 + 360 362 extern int get_option(char **str, int *pint); 361 363 extern char *get_options(const char *str, int nints, int *ints); 362 364 extern unsigned long long memparse(const char *ptr, char **retptr);

+1 -1

include/linux/kthread.h

··· 18 18 * @threadfn: the function to run in the thread 19 19 * @data: data pointer for @threadfn() 20 20 * @namefmt: printf-style format string for the thread name 21 - * @arg...: arguments for @namefmt. 21 + * @arg: arguments for @namefmt. 22 22 * 23 23 * This macro will create a kthread on the current node, leaving it in 24 24 * the stopped state. This is just a helper for kthread_create_on_node();

+3 -3

include/linux/memblock.h

··· 50 50 phys_addr_t base; 51 51 phys_addr_t size; 52 52 enum memblock_flags flags; 53 - #ifdef CONFIG_NEED_MULTIPLE_NODES 53 + #ifdef CONFIG_NUMA 54 54 int nid; 55 55 #endif 56 56 }; ··· 347 347 int memblock_set_node(phys_addr_t base, phys_addr_t size, 348 348 struct memblock_type *type, int nid); 349 349 350 - #ifdef CONFIG_NEED_MULTIPLE_NODES 350 + #ifdef CONFIG_NUMA 351 351 static inline void memblock_set_region_node(struct memblock_region *r, int nid) 352 352 { 353 353 r->nid = nid; ··· 366 366 { 367 367 return 0; 368 368 } 369 - #endif /* CONFIG_NEED_MULTIPLE_NODES */ 369 + #endif /* CONFIG_NUMA */ 370 370 371 371 /* Flags for memblock allocation APIs */ 372 372 #define MEMBLOCK_ALLOC_ANYWHERE (~(phys_addr_t)0)

+21 -33

include/linux/memcontrol.h

··· 192 192 struct memcg_padding { 193 193 char x[0]; 194 194 } ____cacheline_internodealigned_in_smp; 195 - #define MEMCG_PADDING(name) struct memcg_padding name; 195 + #define MEMCG_PADDING(name) struct memcg_padding name 196 196 #else 197 197 #define MEMCG_PADDING(name) 198 198 #endif ··· 349 349 struct deferred_split deferred_split_queue; 350 350 #endif 351 351 352 - struct mem_cgroup_per_node *nodeinfo[0]; 353 - /* WARNING: nodeinfo must be the last member here */ 352 + struct mem_cgroup_per_node *nodeinfo[]; 354 353 }; 355 354 356 355 /* ··· 742 743 /** 743 744 * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page 744 745 * @page: the page 745 - * @pgdat: pgdat of the page 746 746 * 747 747 * This function relies on page->mem_cgroup being stable. 748 748 */ 749 - static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, 750 - struct pglist_data *pgdat) 749 + static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page) 751 750 { 751 + pg_data_t *pgdat = page_pgdat(page); 752 752 struct mem_cgroup *memcg = page_memcg(page); 753 753 754 754 VM_WARN_ON_ONCE_PAGE(!memcg && !mem_cgroup_disabled(), page); 755 755 return mem_cgroup_lruvec(memcg, pgdat); 756 - } 757 - 758 - static inline bool lruvec_holds_page_lru_lock(struct page *page, 759 - struct lruvec *lruvec) 760 - { 761 - pg_data_t *pgdat = page_pgdat(page); 762 - const struct mem_cgroup *memcg; 763 - struct mem_cgroup_per_node *mz; 764 - 765 - if (mem_cgroup_disabled()) 766 - return lruvec == &pgdat->__lruvec; 767 - 768 - mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec); 769 - memcg = page_memcg(page) ? : root_mem_cgroup; 770 - 771 - return lruvec->pgdat == pgdat && mz->memcg == memcg; 772 756 } 773 757 774 758 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); ··· 1203 1221 return &pgdat->__lruvec; 1204 1222 } 1205 1223 1206 - static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, 1207 - struct pglist_data *pgdat) 1208 - { 1209 - return &pgdat->__lruvec; 1210 - } 1211 - 1212 - static inline bool lruvec_holds_page_lru_lock(struct page *page, 1213 - struct lruvec *lruvec) 1224 + static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page) 1214 1225 { 1215 1226 pg_data_t *pgdat = page_pgdat(page); 1216 1227 1217 - return lruvec == &pgdat->__lruvec; 1228 + return &pgdat->__lruvec; 1218 1229 } 1219 1230 1220 1231 static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) ··· 1226 1251 } 1227 1252 1228 1253 static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) 1254 + { 1255 + return NULL; 1256 + } 1257 + 1258 + static inline 1259 + struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css) 1229 1260 { 1230 1261 return NULL; 1231 1262 } ··· 1497 1516 spin_unlock_irqrestore(&lruvec->lru_lock, flags); 1498 1517 } 1499 1518 1519 + /* Test requires a stable page->memcg binding, see page_memcg() */ 1520 + static inline bool page_matches_lruvec(struct page *page, struct lruvec *lruvec) 1521 + { 1522 + return lruvec_pgdat(lruvec) == page_pgdat(page) && 1523 + lruvec_memcg(lruvec) == page_memcg(page); 1524 + } 1525 + 1500 1526 /* Don't lock again iff page's lruvec locked */ 1501 1527 static inline struct lruvec *relock_page_lruvec_irq(struct page *page, 1502 1528 struct lruvec *locked_lruvec) 1503 1529 { 1504 1530 if (locked_lruvec) { 1505 - if (lruvec_holds_page_lru_lock(page, locked_lruvec)) 1531 + if (page_matches_lruvec(page, locked_lruvec)) 1506 1532 return locked_lruvec; 1507 1533 1508 1534 unlock_page_lruvec_irq(locked_lruvec); ··· 1523 1535 struct lruvec *locked_lruvec, unsigned long *flags) 1524 1536 { 1525 1537 if (locked_lruvec) { 1526 - if (lruvec_holds_page_lru_lock(page, locked_lruvec)) 1538 + if (page_matches_lruvec(page, locked_lruvec)) 1527 1539 return locked_lruvec; 1528 1540 1529 1541 unlock_page_lruvec_irqrestore(locked_lruvec, *flags);

+40 -13

include/linux/mm.h

··· 46 46 47 47 void init_mm_internals(void); 48 48 49 - #ifndef CONFIG_NEED_MULTIPLE_NODES /* Don't use mapnrs, do it properly */ 49 + #ifndef CONFIG_NUMA /* Don't use mapnrs, do it properly */ 50 50 extern unsigned long max_mapnr; 51 51 52 52 static inline void set_max_mapnr(unsigned long limit) ··· 234 234 int __add_to_page_cache_locked(struct page *page, struct address_space *mapping, 235 235 pgoff_t index, gfp_t gfp, void **shadowp); 236 236 237 + #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) 237 238 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) 239 + #else 240 + #define nth_page(page,n) ((page) + (n)) 241 + #endif 238 242 239 243 /* to align the pointer to the (next) page boundary */ 240 244 #define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE) ··· 1345 1341 if (!is_cow_mapping(vma->vm_flags)) 1346 1342 return false; 1347 1343 1348 - if (!atomic_read(&vma->vm_mm->has_pinned)) 1344 + if (!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags)) 1349 1345 return false; 1350 1346 1351 1347 return page_maybe_dma_pinned(page); ··· 1854 1850 extern void do_invalidatepage(struct page *page, unsigned int offset, 1855 1851 unsigned int length); 1856 1852 1857 - void __set_page_dirty(struct page *, struct address_space *, int warn); 1858 - int __set_page_dirty_nobuffers(struct page *page); 1859 - int __set_page_dirty_no_writeback(struct page *page); 1860 1853 int redirty_page_for_writepage(struct writeback_control *wbc, 1861 1854 struct page *page); 1862 - void account_page_dirtied(struct page *page, struct address_space *mapping); 1863 1855 void account_page_cleaned(struct page *page, struct address_space *mapping, 1864 1856 struct bdi_writeback *wb); 1865 1857 int set_page_dirty(struct page *page); ··· 2420 2420 extern char __init_begin[], __init_end[]; 2421 2421 2422 2422 return free_reserved_area(&__init_begin, &__init_end, 2423 - poison, "unused kernel"); 2423 + poison, "unused kernel image (initmem)"); 2424 2424 } 2425 2425 2426 2426 static inline unsigned long get_num_physpages(void) ··· 2460 2460 unsigned long *start_pfn, unsigned long *end_pfn); 2461 2461 extern unsigned long find_min_pfn_with_active_regions(void); 2462 2462 2463 - #ifndef CONFIG_NEED_MULTIPLE_NODES 2463 + #ifndef CONFIG_NUMA 2464 2464 static inline int early_pfn_to_nid(unsigned long pfn) 2465 2465 { 2466 2466 return 0; ··· 2474 2474 extern void memmap_init_range(unsigned long, int, unsigned long, 2475 2475 unsigned long, unsigned long, enum meminit_context, 2476 2476 struct vmem_altmap *, int migratetype); 2477 - extern void memmap_init_zone(struct zone *zone); 2478 2477 extern void setup_per_zone_wmarks(void); 2479 2478 extern int __meminit init_per_zone_wmark_min(void); 2480 2479 extern void mem_init(void); ··· 2680 2681 extern struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned long addr, 2681 2682 struct vm_area_struct **pprev); 2682 2683 2683 - /* Look up the first VMA which intersects the interval start_addr..end_addr-1, 2684 - NULL if none. Assume start_addr < end_addr. */ 2685 - static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * mm, unsigned long start_addr, unsigned long end_addr) 2684 + /** 2685 + * find_vma_intersection() - Look up the first VMA which intersects the interval 2686 + * @mm: The process address space. 2687 + * @start_addr: The inclusive start user address. 2688 + * @end_addr: The exclusive end user address. 2689 + * 2690 + * Returns: The first VMA within the provided range, %NULL otherwise. Assumes 2691 + * start_addr < end_addr. 2692 + */ 2693 + static inline 2694 + struct vm_area_struct *find_vma_intersection(struct mm_struct *mm, 2695 + unsigned long start_addr, 2696 + unsigned long end_addr) 2686 2697 { 2687 - struct vm_area_struct * vma = find_vma(mm,start_addr); 2698 + struct vm_area_struct *vma = find_vma(mm, start_addr); 2688 2699 2689 2700 if (vma && end_addr <= vma->vm_start) 2690 2701 vma = NULL; 2702 + return vma; 2703 + } 2704 + 2705 + /** 2706 + * vma_lookup() - Find a VMA at a specific address 2707 + * @mm: The process address space. 2708 + * @addr: The user address. 2709 + * 2710 + * Return: The vm_area_struct at the given address, %NULL otherwise. 2711 + */ 2712 + static inline 2713 + struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr) 2714 + { 2715 + struct vm_area_struct *vma = find_vma(mm, addr); 2716 + 2717 + if (vma && addr < vma->vm_start) 2718 + vma = NULL; 2719 + 2691 2720 return vma; 2692 2721 } 2693 2722

-10

include/linux/mm_types.h

··· 435 435 */ 436 436 atomic_t mm_count; 437 437 438 - /** 439 - * @has_pinned: Whether this mm has pinned any pages. This can 440 - * be either replaced in the future by @pinned_vm when it 441 - * becomes stable, or grow into a counter on its own. We're 442 - * aggresive on this bit now - even if the pinned pages were 443 - * unpinned later on, we'll still keep this bit set for the 444 - * lifecycle of this mm just for simplicity. 445 - */ 446 - atomic_t has_pinned; 447 - 448 438 #ifdef CONFIG_MMU 449 439 atomic_long_t pgtables_bytes; /* PTE page table pages */ 450 440 #endif

+2

include/linux/mman.h

··· 31 31 /* 32 32 * The historical set of flags that all mmap implementations implicitly 33 33 * support when a ->mmap_validate() op is not provided in file_operations. 34 + * 35 + * MAP_EXECUTABLE is completely ignored throughout the kernel. 34 36 */ 35 37 #define LEGACY_MAP_MASK (MAP_SHARED \ 36 38 | MAP_PRIVATE \

+1 -2

include/linux/mmdebug.h

··· 9 9 struct vm_area_struct; 10 10 struct mm_struct; 11 11 12 - extern void dump_page(struct page *page, const char *reason); 13 - extern void __dump_page(struct page *page, const char *reason); 12 + void dump_page(struct page *page, const char *reason); 14 13 void dump_vma(const struct vm_area_struct *vma); 15 14 void dump_mm(const struct mm_struct *mm); 16 15

+54 -36

include/linux/mmzone.h

··· 20 20 #include <linux/atomic.h> 21 21 #include <linux/mm_types.h> 22 22 #include <linux/page-flags.h> 23 + #include <linux/local_lock.h> 23 24 #include <asm/page.h> 24 25 25 26 /* Free memory management - zoned buddy allocator. */ ··· 135 134 NUMA_INTERLEAVE_HIT, /* interleaver preferred this zone */ 136 135 NUMA_LOCAL, /* allocation from local node */ 137 136 NUMA_OTHER, /* allocation from other node */ 138 - NR_VM_NUMA_STAT_ITEMS 137 + NR_VM_NUMA_EVENT_ITEMS 139 138 }; 140 139 #else 141 - #define NR_VM_NUMA_STAT_ITEMS 0 140 + #define NR_VM_NUMA_EVENT_ITEMS 0 142 141 #endif 143 142 144 143 enum zone_stat_item { ··· 333 332 NR_WMARK 334 333 }; 335 334 335 + /* 336 + * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER plus one additional 337 + * for pageblock size for THP if configured. 338 + */ 339 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 340 + #define NR_PCP_THP 1 341 + #else 342 + #define NR_PCP_THP 0 343 + #endif 344 + #define NR_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1 + NR_PCP_THP)) 345 + 346 + /* 347 + * Shift to encode migratetype and order in the same integer, with order 348 + * in the least significant bits. 349 + */ 350 + #define NR_PCP_ORDER_WIDTH 8 351 + #define NR_PCP_ORDER_MASK ((1<<NR_PCP_ORDER_WIDTH) - 1) 352 + 336 353 #define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost) 337 354 #define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost) 338 355 #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) 339 356 #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) 340 357 358 + /* Fields and list protected by pagesets local_lock in page_alloc.c */ 341 359 struct per_cpu_pages { 342 360 int count; /* number of pages in the list */ 343 361 int high; /* high watermark, emptying needed */ 344 362 int batch; /* chunk size for buddy add/remove */ 363 + short free_factor; /* batch scaling factor during free */ 364 + #ifdef CONFIG_NUMA 365 + short expire; /* When 0, remote pagesets are drained */ 366 + #endif 345 367 346 368 /* Lists of pages, one per migrate type stored on the pcp-lists */ 347 - struct list_head lists[MIGRATE_PCPTYPES]; 369 + struct list_head lists[NR_PCP_LISTS]; 348 370 }; 349 371 350 - struct per_cpu_pageset { 351 - struct per_cpu_pages pcp; 352 - #ifdef CONFIG_NUMA 353 - s8 expire; 354 - u16 vm_numa_stat_diff[NR_VM_NUMA_STAT_ITEMS]; 355 - #endif 372 + struct per_cpu_zonestat { 356 373 #ifdef CONFIG_SMP 357 - s8 stat_threshold; 358 374 s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS]; 375 + s8 stat_threshold; 376 + #endif 377 + #ifdef CONFIG_NUMA 378 + /* 379 + * Low priority inaccurate counters that are only folded 380 + * on demand. Use a large type to avoid the overhead of 381 + * folding during refresh_cpu_vm_stats. 382 + */ 383 + unsigned long vm_numa_event[NR_VM_NUMA_EVENT_ITEMS]; 359 384 #endif 360 385 }; 361 386 ··· 511 484 int node; 512 485 #endif 513 486 struct pglist_data *zone_pgdat; 514 - struct per_cpu_pageset __percpu *pageset; 487 + struct per_cpu_pages __percpu *per_cpu_pageset; 488 + struct per_cpu_zonestat __percpu *per_cpu_zonestats; 515 489 /* 516 490 * the high and batch values are copied to individual pagesets for 517 491 * faster access ··· 647 619 ZONE_PADDING(_pad3_) 648 620 /* Zone statistics */ 649 621 atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS]; 650 - atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS]; 622 + atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS]; 651 623 } ____cacheline_internodealigned_in_smp; 652 624 653 625 enum pgdat_flags { ··· 665 637 ZONE_BOOSTED_WATERMARK, /* zone recently boosted watermarks. 666 638 * Cleared when kswapd is woken. 667 639 */ 640 + ZONE_RECLAIM_ACTIVE, /* kswapd may be scanning the zone. */ 668 641 }; 669 642 670 643 static inline unsigned long zone_managed_pages(struct zone *zone) ··· 767 738 struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1]; 768 739 }; 769 740 770 - #ifndef CONFIG_DISCONTIGMEM 771 - /* The array of struct pages - for discontigmem use pgdat->lmem_map */ 741 + /* 742 + * The array of struct pages for flatmem. 743 + * It must be declared for SPARSEMEM as well because there are configurations 744 + * that rely on that. 745 + */ 772 746 extern struct page *mem_map; 773 - #endif 774 747 775 748 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 776 749 struct deferred_split { ··· 806 775 struct zonelist node_zonelists[MAX_ZONELISTS]; 807 776 808 777 int nr_zones; /* number of populated zones in this node */ 809 - #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */ 778 + #ifdef CONFIG_FLATMEM /* means !SPARSEMEM */ 810 779 struct page *node_mem_map; 811 780 #ifdef CONFIG_PAGE_EXTENSION 812 781 struct page_ext *node_page_ext; ··· 896 865 897 866 #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) 898 867 #define node_spanned_pages(nid) (NODE_DATA(nid)->node_spanned_pages) 899 - #ifdef CONFIG_FLAT_NODE_MEM_MAP 868 + #ifdef CONFIG_FLATMEM 900 869 #define pgdat_page_nr(pgdat, pagenr) ((pgdat)->node_mem_map + (pagenr)) 901 870 #else 902 871 #define pgdat_page_nr(pgdat, pagenr) pfn_to_page((pgdat)->node_start_pfn + (pagenr)) ··· 1013 982 1014 983 extern int movable_zone; 1015 984 1016 - #ifdef CONFIG_HIGHMEM 1017 - static inline int zone_movable_is_highmem(void) 1018 - { 1019 - #ifdef CONFIG_NEED_MULTIPLE_NODES 1020 - return movable_zone == ZONE_HIGHMEM; 1021 - #else 1022 - return (ZONE_MOVABLE - 1) == ZONE_HIGHMEM; 1023 - #endif 1024 - } 1025 - #endif 1026 - 1027 985 static inline int is_highmem_idx(enum zone_type idx) 1028 986 { 1029 987 #ifdef CONFIG_HIGHMEM 1030 988 return (idx == ZONE_HIGHMEM || 1031 - (idx == ZONE_MOVABLE && zone_movable_is_highmem())); 989 + (idx == ZONE_MOVABLE && movable_zone == ZONE_HIGHMEM)); 1032 990 #else 1033 991 return 0; 1034 992 #endif ··· 1049 1029 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES]; 1050 1030 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, void *, 1051 1031 size_t *, loff_t *); 1052 - int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int, 1032 + int percpu_pagelist_high_fraction_sysctl_handler(struct ctl_table *, int, 1053 1033 void *, size_t *, loff_t *); 1054 1034 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int, 1055 1035 void *, size_t *, loff_t *); ··· 1057 1037 void *, size_t *, loff_t *); 1058 1038 int numa_zonelist_order_handler(struct ctl_table *, int, 1059 1039 void *, size_t *, loff_t *); 1060 - extern int percpu_pagelist_fraction; 1040 + extern int percpu_pagelist_high_fraction; 1061 1041 extern char numa_zonelist_order[]; 1062 1042 #define NUMA_ZONELIST_ORDER_LEN 16 1063 1043 1064 - #ifndef CONFIG_NEED_MULTIPLE_NODES 1044 + #ifndef CONFIG_NUMA 1065 1045 1066 1046 extern struct pglist_data contig_page_data; 1067 1047 #define NODE_DATA(nid) (&contig_page_data) 1068 1048 #define NODE_MEM_MAP(nid) mem_map 1069 1049 1070 - #else /* CONFIG_NEED_MULTIPLE_NODES */ 1050 + #else /* CONFIG_NUMA */ 1071 1051 1072 1052 #include <asm/mmzone.h> 1073 1053 1074 - #endif /* !CONFIG_NEED_MULTIPLE_NODES */ 1054 + #endif /* !CONFIG_NUMA */ 1075 1055 1076 1056 extern struct pglist_data *first_online_pgdat(void); 1077 1057 extern struct pglist_data *next_online_pgdat(struct pglist_data *pgdat); ··· 1220 1200 #ifdef CONFIG_SPARSEMEM 1221 1201 1222 1202 /* 1223 - * SECTION_SHIFT #bits space required to store a section # 1224 - * 1225 1203 * PA_SECTION_SHIFT physical address to/from section number 1226 1204 * PFN_SECTION_SHIFT pfn to/from section number 1227 1205 */

+5 -5

include/linux/page-flags.h

··· 180 180 181 181 #ifndef __GENERATING_BOUNDS_H 182 182 183 - struct page; /* forward declaration */ 184 - 185 - static inline struct page *compound_head(struct page *page) 183 + static inline unsigned long _compound_head(const struct page *page) 186 184 { 187 185 unsigned long head = READ_ONCE(page->compound_head); 188 186 189 187 if (unlikely(head & 1)) 190 - return (struct page *) (head - 1); 191 - return page; 188 + return head - 1; 189 + return (unsigned long)page; 192 190 } 191 + 192 + #define compound_head(page) ((typeof(page))_compound_head(page)) 193 193 194 194 static __always_inline int PageTail(struct page *page) 195 195 {

+3 -3

include/linux/page_owner.h

··· 14 14 extern void __split_page_owner(struct page *page, unsigned int nr); 15 15 extern void __copy_page_owner(struct page *oldpage, struct page *newpage); 16 16 extern void __set_page_owner_migrate_reason(struct page *page, int reason); 17 - extern void __dump_page_owner(struct page *page); 17 + extern void __dump_page_owner(const struct page *page); 18 18 extern void pagetypeinfo_showmixedcount_print(struct seq_file *m, 19 19 pg_data_t *pgdat, struct zone *zone); 20 20 ··· 46 46 if (static_branch_unlikely(&page_owner_inited)) 47 47 __set_page_owner_migrate_reason(page, reason); 48 48 } 49 - static inline void dump_page_owner(struct page *page) 49 + static inline void dump_page_owner(const struct page *page) 50 50 { 51 51 if (static_branch_unlikely(&page_owner_inited)) 52 52 __dump_page_owner(page); ··· 69 69 static inline void set_page_owner_migrate_reason(struct page *page, int reason) 70 70 { 71 71 } 72 - static inline void dump_page_owner(struct page *page) 72 + static inline void dump_page_owner(const struct page *page) 73 73 { 74 74 } 75 75 #endif /* CONFIG_PAGE_OWNER */

+2 -2

include/linux/page_ref.h

··· 62 62 63 63 #endif 64 64 65 - static inline int page_ref_count(struct page *page) 65 + static inline int page_ref_count(const struct page *page) 66 66 { 67 67 return atomic_read(&page->_refcount); 68 68 } 69 69 70 - static inline int page_count(struct page *page) 70 + static inline int page_count(const struct page *page) 71 71 { 72 72 return atomic_read(&compound_head(page)->_refcount); 73 73 }

+3

include/linux/page_reporting.h

··· 18 18 19 19 /* Current state of page reporting */ 20 20 atomic_t state; 21 + 22 + /* Minimal order of page reporting */ 23 + unsigned int order; 21 24 }; 22 25 23 26 /* Tear-down and bring-up for page reporting devices */

+1 -1

include/linux/pageblock-flags.h

··· 54 54 /* Forward declaration */ 55 55 struct page; 56 56 57 - unsigned long get_pfnblock_flags_mask(struct page *page, 57 + unsigned long get_pfnblock_flags_mask(const struct page *page, 58 58 unsigned long pfn, 59 59 unsigned long mask); 60 60

+4

include/linux/pagemap.h

··· 702 702 extern void end_page_writeback(struct page *page); 703 703 void wait_for_stable_page(struct page *page); 704 704 705 + void __set_page_dirty(struct page *, struct address_space *, int warn); 706 + int __set_page_dirty_nobuffers(struct page *page); 707 + int __set_page_dirty_no_writeback(struct page *page); 708 + 705 709 void page_endio(struct page *page, bool is_write, int err); 706 710 707 711 /**

+22

include/linux/pgtable.h

··· 1592 1592 #define pte_leaf_size(x) PAGE_SIZE 1593 1593 #endif 1594 1594 1595 + /* 1596 + * Some architectures have MMUs that are configurable or selectable at boot 1597 + * time. These lead to variable PTRS_PER_x. For statically allocated arrays it 1598 + * helps to have a static maximum value. 1599 + */ 1600 + 1601 + #ifndef MAX_PTRS_PER_PTE 1602 + #define MAX_PTRS_PER_PTE PTRS_PER_PTE 1603 + #endif 1604 + 1605 + #ifndef MAX_PTRS_PER_PMD 1606 + #define MAX_PTRS_PER_PMD PTRS_PER_PMD 1607 + #endif 1608 + 1609 + #ifndef MAX_PTRS_PER_PUD 1610 + #define MAX_PTRS_PER_PUD PTRS_PER_PUD 1611 + #endif 1612 + 1613 + #ifndef MAX_PTRS_PER_P4D 1614 + #define MAX_PTRS_PER_P4D PTRS_PER_P4D 1615 + #endif 1616 + 1595 1617 #endif /* _LINUX_PGTABLE_H */

+5

include/linux/printk.h

··· 206 206 __printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...); 207 207 void dump_stack_print_info(const char *log_lvl); 208 208 void show_regs_print_info(const char *log_lvl); 209 + extern asmlinkage void dump_stack_lvl(const char *log_lvl) __cold; 209 210 extern asmlinkage void dump_stack(void) __cold; 210 211 extern void printk_safe_flush(void); 211 212 extern void printk_safe_flush_on_panic(void); ··· 267 266 } 268 267 269 268 static inline void show_regs_print_info(const char *log_lvl) 269 + { 270 + } 271 + 272 + static inline void dump_stack_lvl(const char *log_lvl) 270 273 { 271 274 } 272 275

+8

include/linux/sched/coredump.h

··· 73 73 #define MMF_OOM_VICTIM 25 /* mm is the oom victim */ 74 74 #define MMF_OOM_REAP_QUEUED 26 /* mm was queued for oom_reaper */ 75 75 #define MMF_MULTIPROCESS 27 /* mm is shared between processes */ 76 + /* 77 + * MMF_HAS_PINNED: Whether this mm has pinned any pages. This can be either 78 + * replaced in the future by mm.pinned_vm when it becomes stable, or grow into 79 + * a counter on its own. We're aggresive on this bit for now: even if the 80 + * pinned pages were unpinned later on, we'll still keep this bit set for the 81 + * lifecycle of this mm, just for simplicity. 82 + */ 83 + #define MMF_HAS_PINNED 28 /* FOLL_PIN has run, never cleared */ 76 84 #define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) 77 85 78 86 #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\

+47 -12

include/linux/slab.h

··· 305 305 /* 306 306 * Whenever changing this, take care of that kmalloc_type() and 307 307 * create_kmalloc_caches() still work as intended. 308 + * 309 + * KMALLOC_NORMAL can contain only unaccounted objects whereas KMALLOC_CGROUP 310 + * is for accounted but unreclaimable and non-dma objects. All the other 311 + * kmem caches can have both accounted and unaccounted objects. 308 312 */ 309 313 enum kmalloc_cache_type { 310 314 KMALLOC_NORMAL = 0, 315 + #ifndef CONFIG_ZONE_DMA 316 + KMALLOC_DMA = KMALLOC_NORMAL, 317 + #endif 318 + #ifndef CONFIG_MEMCG_KMEM 319 + KMALLOC_CGROUP = KMALLOC_NORMAL, 320 + #else 321 + KMALLOC_CGROUP, 322 + #endif 311 323 KMALLOC_RECLAIM, 312 324 #ifdef CONFIG_ZONE_DMA 313 325 KMALLOC_DMA, ··· 331 319 extern struct kmem_cache * 332 320 kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1]; 333 321 322 + /* 323 + * Define gfp bits that should not be set for KMALLOC_NORMAL. 324 + */ 325 + #define KMALLOC_NOT_NORMAL_BITS \ 326 + (__GFP_RECLAIMABLE | \ 327 + (IS_ENABLED(CONFIG_ZONE_DMA) ? __GFP_DMA : 0) | \ 328 + (IS_ENABLED(CONFIG_MEMCG_KMEM) ? __GFP_ACCOUNT : 0)) 329 + 334 330 static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags) 335 331 { 336 - #ifdef CONFIG_ZONE_DMA 337 332 /* 338 333 * The most common case is KMALLOC_NORMAL, so test for it 339 - * with a single branch for both flags. 334 + * with a single branch for all the relevant flags. 340 335 */ 341 - if (likely((flags & (__GFP_DMA | __GFP_RECLAIMABLE)) == 0)) 336 + if (likely((flags & KMALLOC_NOT_NORMAL_BITS) == 0)) 342 337 return KMALLOC_NORMAL; 343 338 344 339 /* 345 - * At least one of the flags has to be set. If both are, __GFP_DMA 346 - * is more important. 340 + * At least one of the flags has to be set. Their priorities in 341 + * decreasing order are: 342 + * 1) __GFP_DMA 343 + * 2) __GFP_RECLAIMABLE 344 + * 3) __GFP_ACCOUNT 347 345 */ 348 - return flags & __GFP_DMA ? KMALLOC_DMA : KMALLOC_RECLAIM; 349 - #else 350 - return flags & __GFP_RECLAIMABLE ? KMALLOC_RECLAIM : KMALLOC_NORMAL; 351 - #endif 346 + if (IS_ENABLED(CONFIG_ZONE_DMA) && (flags & __GFP_DMA)) 347 + return KMALLOC_DMA; 348 + if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || (flags & __GFP_RECLAIMABLE)) 349 + return KMALLOC_RECLAIM; 350 + else 351 + return KMALLOC_CGROUP; 352 352 } 353 353 354 354 /* ··· 370 346 * 1 = 65 .. 96 bytes 371 347 * 2 = 129 .. 192 bytes 372 348 * n = 2^(n-1)+1 .. 2^n 349 + * 350 + * Note: __kmalloc_index() is compile-time optimized, and not runtime optimized; 351 + * typical usage is via kmalloc_index() and therefore evaluated at compile-time. 352 + * Callers where !size_is_constant should only be test modules, where runtime 353 + * overheads of __kmalloc_index() can be tolerated. Also see kmalloc_slab(). 373 354 */ 374 - static __always_inline unsigned int kmalloc_index(size_t size) 355 + static __always_inline unsigned int __kmalloc_index(size_t size, 356 + bool size_is_constant) 375 357 { 376 358 if (!size) 377 359 return 0; ··· 412 382 if (size <= 8 * 1024 * 1024) return 23; 413 383 if (size <= 16 * 1024 * 1024) return 24; 414 384 if (size <= 32 * 1024 * 1024) return 25; 415 - if (size <= 64 * 1024 * 1024) return 26; 416 - BUG(); 385 + 386 + if ((IS_ENABLED(CONFIG_CC_IS_GCC) || CONFIG_CLANG_VERSION >= 110000) 387 + && !IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant) 388 + BUILD_BUG_ON_MSG(1, "unexpected size in kmalloc_index()"); 389 + else 390 + BUG(); 417 391 418 392 /* Will never be reached. Needed because the compiler may complain */ 419 393 return -1; 420 394 } 395 + #define kmalloc_index(s) __kmalloc_index(s, true) 421 396 #endif /* !CONFIG_SLOB */ 422 397 423 398 void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __malloc;

+17 -2

include/linux/swap.h

··· 177 177 SWP_PAGE_DISCARD = (1 << 10), /* freed swap page-cluster discards */ 178 178 SWP_STABLE_WRITES = (1 << 11), /* no overwrite PG_writeback pages */ 179 179 SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */ 180 - SWP_VALID = (1 << 13), /* swap is valid to be operated on? */ 181 180 /* add others here before... */ 182 181 SWP_SCANNING = (1 << 14), /* refcount in scan_swap_map */ 183 182 }; ··· 239 240 * The in-memory structure used to track swap areas. 240 241 */ 241 242 struct swap_info_struct { 243 + struct percpu_ref users; /* indicate and keep swap device valid. */ 242 244 unsigned long flags; /* SWP_USED etc: see above */ 243 245 signed short prio; /* swap priority of this type */ 244 246 struct plist_node list; /* entry in swap_active_head */ ··· 260 260 struct block_device *bdev; /* swap device or bdev of swap file */ 261 261 struct file *swap_file; /* seldom referenced */ 262 262 unsigned int old_block_size; /* seldom referenced */ 263 + struct completion comp; /* seldom referenced */ 263 264 #ifdef CONFIG_FRONTSWAP 264 265 unsigned long *frontswap_map; /* frontswap in-use, one bit per page */ 265 266 atomic_t frontswap_pages; /* frontswap pages in-use counter */ ··· 446 445 extern void delete_from_swap_cache(struct page *); 447 446 extern void clear_shadow_from_swap_cache(int type, unsigned long begin, 448 447 unsigned long end); 448 + extern void free_swap_cache(struct page *); 449 449 extern void free_page_and_swap_cache(struct page *); 450 450 extern void free_pages_and_swap_cache(struct page **, int); 451 451 extern struct page *lookup_swap_cache(swp_entry_t entry, ··· 513 511 514 512 static inline void put_swap_device(struct swap_info_struct *si) 515 513 { 516 - rcu_read_unlock(); 514 + percpu_ref_put(&si->users); 517 515 } 518 516 519 517 #else /* CONFIG_SWAP */ ··· 526 524 static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry) 527 525 { 528 526 return NULL; 527 + } 528 + 529 + static inline struct swap_info_struct *get_swap_device(swp_entry_t entry) 530 + { 531 + return NULL; 532 + } 533 + 534 + static inline void put_swap_device(struct swap_info_struct *si) 535 + { 529 536 } 530 537 531 538 #define swap_address_space(entry) (NULL) ··· 551 540 put_page(page) 552 541 #define free_pages_and_swap_cache(pages, nr) \ 553 542 release_pages((pages), (nr)); 543 + 544 + static inline void free_swap_cache(struct page *page) 545 + { 546 + } 554 547 555 548 static inline void show_swap_cache_info(void) 556 549 {

+5

include/linux/swapops.h

··· 330 330 return swp_type(entry) == SWP_HWPOISON; 331 331 } 332 332 333 + static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry) 334 + { 335 + return swp_offset(entry); 336 + } 337 + 333 338 static inline void num_poisoned_pages_inc(void) 334 339 { 335 340 atomic_long_inc(&num_poisoned_pages);

+40 -27

include/linux/vmstat.h

··· 138 138 * Zone and node-based page accounting with per cpu differentials. 139 139 */ 140 140 extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS]; 141 - extern atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS]; 142 141 extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS]; 142 + extern atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS]; 143 143 144 144 #ifdef CONFIG_NUMA 145 - static inline void zone_numa_state_add(long x, struct zone *zone, 146 - enum numa_stat_item item) 145 + static inline void zone_numa_event_add(long x, struct zone *zone, 146 + enum numa_stat_item item) 147 147 { 148 - atomic_long_add(x, &zone->vm_numa_stat[item]); 149 - atomic_long_add(x, &vm_numa_stat[item]); 148 + atomic_long_add(x, &zone->vm_numa_event[item]); 149 + atomic_long_add(x, &vm_numa_event[item]); 150 150 } 151 151 152 - static inline unsigned long global_numa_state(enum numa_stat_item item) 153 - { 154 - long x = atomic_long_read(&vm_numa_stat[item]); 155 - 156 - return x; 157 - } 158 - 159 - static inline unsigned long zone_numa_state_snapshot(struct zone *zone, 152 + static inline unsigned long zone_numa_event_state(struct zone *zone, 160 153 enum numa_stat_item item) 161 154 { 162 - long x = atomic_long_read(&zone->vm_numa_stat[item]); 163 - int cpu; 155 + return atomic_long_read(&zone->vm_numa_event[item]); 156 + } 164 157 165 - for_each_online_cpu(cpu) 166 - x += per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item]; 167 - 168 - return x; 158 + static inline unsigned long 159 + global_numa_event_state(enum numa_stat_item item) 160 + { 161 + return atomic_long_read(&vm_numa_event[item]); 169 162 } 170 163 #endif /* CONFIG_NUMA */ 171 164 ··· 229 236 #ifdef CONFIG_SMP 230 237 int cpu; 231 238 for_each_online_cpu(cpu) 232 - x += per_cpu_ptr(zone->pageset, cpu)->vm_stat_diff[item]; 239 + x += per_cpu_ptr(zone->per_cpu_zonestats, cpu)->vm_stat_diff[item]; 233 240 234 241 if (x < 0) 235 242 x = 0; ··· 238 245 } 239 246 240 247 #ifdef CONFIG_NUMA 241 - extern void __inc_numa_state(struct zone *zone, enum numa_stat_item item); 248 + /* See __count_vm_event comment on why raw_cpu_inc is used. */ 249 + static inline void 250 + __count_numa_event(struct zone *zone, enum numa_stat_item item) 251 + { 252 + struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats; 253 + 254 + raw_cpu_inc(pzstats->vm_numa_event[item]); 255 + } 256 + 257 + static inline void 258 + __count_numa_events(struct zone *zone, enum numa_stat_item item, long delta) 259 + { 260 + struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats; 261 + 262 + raw_cpu_add(pzstats->vm_numa_event[item], delta); 263 + } 264 + 242 265 extern unsigned long sum_zone_node_page_state(int node, 243 266 enum zone_stat_item item); 244 - extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item); 267 + extern unsigned long sum_zone_numa_event_state(int node, enum numa_stat_item item); 245 268 extern unsigned long node_page_state(struct pglist_data *pgdat, 246 269 enum node_stat_item item); 247 270 extern unsigned long node_page_state_pages(struct pglist_data *pgdat, 248 271 enum node_stat_item item); 272 + extern void fold_vm_numa_events(void); 249 273 #else 250 274 #define sum_zone_node_page_state(node, item) global_zone_page_state(item) 251 275 #define node_page_state(node, item) global_node_page_state(item) 252 276 #define node_page_state_pages(node, item) global_node_page_state_pages(item) 277 + static inline void fold_vm_numa_events(void) 278 + { 279 + } 253 280 #endif /* CONFIG_NUMA */ 254 281 255 282 #ifdef CONFIG_SMP ··· 304 291 int vmstat_refresh(struct ctl_table *, int write, void *buffer, size_t *lenp, 305 292 loff_t *ppos); 306 293 307 - void drain_zonestat(struct zone *zone, struct per_cpu_pageset *); 294 + void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *); 308 295 309 296 int calculate_pressure_threshold(struct zone *zone); 310 297 int calculate_normal_threshold(struct zone *zone); ··· 412 399 static inline void quiet_vmstat(void) { } 413 400 414 401 static inline void drain_zonestat(struct zone *zone, 415 - struct per_cpu_pageset *pset) { } 402 + struct per_cpu_zonestat *pzstats) { } 416 403 #endif /* CONFIG_SMP */ 417 404 418 405 static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages, ··· 441 428 static inline const char *node_stat_name(enum node_stat_item item) 442 429 { 443 430 return vmstat_text[NR_VM_ZONE_STAT_ITEMS + 444 - NR_VM_NUMA_STAT_ITEMS + 431 + NR_VM_NUMA_EVENT_ITEMS + 445 432 item]; 446 433 } 447 434 ··· 453 440 static inline const char *writeback_stat_name(enum writeback_stat_item item) 454 441 { 455 442 return vmstat_text[NR_VM_ZONE_STAT_ITEMS + 456 - NR_VM_NUMA_STAT_ITEMS + 443 + NR_VM_NUMA_EVENT_ITEMS + 457 444 NR_VM_NODE_STAT_ITEMS + 458 445 item]; 459 446 } ··· 462 449 static inline const char *vm_event_name(enum vm_event_item item) 463 450 { 464 451 return vmstat_text[NR_VM_ZONE_STAT_ITEMS + 465 - NR_VM_NUMA_STAT_ITEMS + 452 + NR_VM_NUMA_EVENT_ITEMS + 466 453 NR_VM_NODE_STAT_ITEMS + 467 454 NR_VM_WRITEBACK_STAT_ITEMS + 468 455 item];

+1

include/linux/writeback.h

··· 221 221 int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages, 222 222 enum wb_reason reason, struct wb_completion *done); 223 223 void cgroup_writeback_umount(void); 224 + bool cleanup_offline_cgwb(struct bdi_writeback *wb); 224 225 225 226 /** 226 227 * inode_attach_wb - associate an inode with its wb

+2 -2

include/trace/events/cma.h

··· 31 31 __entry->align = align; 32 32 ), 33 33 34 - TP_printk("name=%s pfn=%lx page=%p count=%lu align=%u", 34 + TP_printk("name=%s pfn=0x%lx page=%p count=%lu align=%u", 35 35 __get_str(name), 36 36 __entry->pfn, 37 37 __entry->page, ··· 60 60 __entry->count = count; 61 61 ), 62 62 63 - TP_printk("name=%s pfn=%lx page=%p count=%lu", 63 + TP_printk("name=%s pfn=0x%lx page=%p count=%lu", 64 64 __get_str(name), 65 65 __entry->pfn, 66 66 __entry->page,

+1 -1

include/trace/events/filemap.h

··· 36 36 __entry->s_dev = page->mapping->host->i_rdev; 37 37 ), 38 38 39 - TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu", 39 + TP_printk("dev %d:%d ino %lx page=%p pfn=0x%lx ofs=%lu", 40 40 MAJOR(__entry->s_dev), MINOR(__entry->s_dev), 41 41 __entry->i_ino, 42 42 pfn_to_page(__entry->pfn),

+6 -6

include/trace/events/kmem.h

··· 173 173 __entry->order = order; 174 174 ), 175 175 176 - TP_printk("page=%p pfn=%lu order=%d", 176 + TP_printk("page=%p pfn=0x%lx order=%d", 177 177 pfn_to_page(__entry->pfn), 178 178 __entry->pfn, 179 179 __entry->order) ··· 193 193 __entry->pfn = page_to_pfn(page); 194 194 ), 195 195 196 - TP_printk("page=%p pfn=%lu order=0", 196 + TP_printk("page=%p pfn=0x%lx order=0", 197 197 pfn_to_page(__entry->pfn), 198 198 __entry->pfn) 199 199 ); ··· 219 219 __entry->migratetype = migratetype; 220 220 ), 221 221 222 - TP_printk("page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s", 222 + TP_printk("page=%p pfn=0x%lx order=%d migratetype=%d gfp_flags=%s", 223 223 __entry->pfn != -1UL ? pfn_to_page(__entry->pfn) : NULL, 224 224 __entry->pfn != -1UL ? __entry->pfn : 0, 225 225 __entry->order, ··· 245 245 __entry->migratetype = migratetype; 246 246 ), 247 247 248 - TP_printk("page=%p pfn=%lu order=%u migratetype=%d percpu_refill=%d", 248 + TP_printk("page=%p pfn=0x%lx order=%u migratetype=%d percpu_refill=%d", 249 249 __entry->pfn != -1UL ? pfn_to_page(__entry->pfn) : NULL, 250 250 __entry->pfn != -1UL ? __entry->pfn : 0, 251 251 __entry->order, ··· 278 278 __entry->migratetype = migratetype; 279 279 ), 280 280 281 - TP_printk("page=%p pfn=%lu order=%d migratetype=%d", 281 + TP_printk("page=%p pfn=0x%lx order=%d migratetype=%d", 282 282 pfn_to_page(__entry->pfn), __entry->pfn, 283 283 __entry->order, __entry->migratetype) 284 284 ); ··· 312 312 get_pageblock_migratetype(page)); 313 313 ), 314 314 315 - TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d", 315 + TP_printk("page=%p pfn=0x%lx alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d", 316 316 pfn_to_page(__entry->pfn), 317 317 __entry->pfn, 318 318 __entry->alloc_order,

+2 -2

include/trace/events/page_pool.h

··· 60 60 __entry->pfn = page_to_pfn(page); 61 61 ), 62 62 63 - TP_printk("page_pool=%p page=%p pfn=%lu release=%u", 63 + TP_printk("page_pool=%p page=%p pfn=0x%lx release=%u", 64 64 __entry->pool, __entry->page, __entry->pfn, __entry->release) 65 65 ); 66 66 ··· 85 85 __entry->pfn = page_to_pfn(page); 86 86 ), 87 87 88 - TP_printk("page_pool=%p page=%p pfn=%lu hold=%u", 88 + TP_printk("page_pool=%p page=%p pfn=0x%lx hold=%u", 89 89 __entry->pool, __entry->page, __entry->pfn, __entry->hold) 90 90 ); 91 91

+2 -2

include/trace/events/pagemap.h

··· 46 46 ), 47 47 48 48 /* Flag format is based on page-types.c formatting for pagemap */ 49 - TP_printk("page=%p pfn=%lu lru=%d flags=%s%s%s%s%s%s", 49 + TP_printk("page=%p pfn=0x%lx lru=%d flags=%s%s%s%s%s%s", 50 50 __entry->page, 51 51 __entry->pfn, 52 52 __entry->lru, ··· 75 75 ), 76 76 77 77 /* Flag format is based on page-types.c formatting for pagemap */ 78 - TP_printk("page=%p pfn=%lu", __entry->page, __entry->pfn) 78 + TP_printk("page=%p pfn=0x%lx", __entry->page, __entry->pfn) 79 79 80 80 ); 81 81

+1 -1

include/trace/events/vmscan.h

··· 330 330 page_is_file_lru(page)); 331 331 ), 332 332 333 - TP_printk("page=%p pfn=%lu flags=%s", 333 + TP_printk("page=%p pfn=0x%lx flags=%s", 334 334 pfn_to_page(__entry->pfn), 335 335 __entry->pfn, 336 336 show_reclaim_flags(__entry->reclaim_flags))

+1

kernel/cgroup/cgroup.c

··· 577 577 rcu_read_unlock(); 578 578 return css; 579 579 } 580 + EXPORT_SYMBOL_GPL(cgroup_get_e_css); 580 581 581 582 static void cgroup_get_live(struct cgroup *cgrp) 582 583 {

+2 -2

kernel/crash_core.c

··· 455 455 VMCOREINFO_SYMBOL(_stext); 456 456 VMCOREINFO_SYMBOL(vmap_area_list); 457 457 458 - #ifndef CONFIG_NEED_MULTIPLE_NODES 458 + #ifndef CONFIG_NUMA 459 459 VMCOREINFO_SYMBOL(mem_map); 460 460 VMCOREINFO_SYMBOL(contig_page_data); 461 461 #endif ··· 484 484 VMCOREINFO_OFFSET(page, compound_head); 485 485 VMCOREINFO_OFFSET(pglist_data, node_zones); 486 486 VMCOREINFO_OFFSET(pglist_data, nr_zones); 487 - #ifdef CONFIG_FLAT_NODE_MEM_MAP 487 + #ifdef CONFIG_FLATMEM 488 488 VMCOREINFO_OFFSET(pglist_data, node_mem_map); 489 489 #endif 490 490 VMCOREINFO_OFFSET(pglist_data, node_start_pfn);

-2

kernel/events/core.c

··· 8309 8309 8310 8310 if (vma->vm_flags & VM_DENYWRITE) 8311 8311 flags |= MAP_DENYWRITE; 8312 - if (vma->vm_flags & VM_MAYEXEC) 8313 - flags |= MAP_EXECUTABLE; 8314 8312 if (vma->vm_flags & VM_LOCKED) 8315 8313 flags |= MAP_LOCKED; 8316 8314 if (is_vm_hugetlb_page(vma))

+2 -2

kernel/events/uprobes.c

··· 2047 2047 struct vm_area_struct *vma; 2048 2048 2049 2049 mmap_read_lock(mm); 2050 - vma = find_vma(mm, bp_vaddr); 2051 - if (vma && vma->vm_start <= bp_vaddr) { 2050 + vma = vma_lookup(mm, bp_vaddr); 2051 + if (vma) { 2052 2052 if (valid_vma(vma, false)) { 2053 2053 struct inode *inode = file_inode(vma->vm_file); 2054 2054 loff_t offset = vaddr_to_offset(vma, bp_vaddr);

-1

kernel/fork.c

··· 1035 1035 mm_pgtables_bytes_init(mm); 1036 1036 mm->map_count = 0; 1037 1037 mm->locked_vm = 0; 1038 - atomic_set(&mm->has_pinned, 0); 1039 1038 atomic64_set(&mm->pinned_vm, 0); 1040 1039 memset(&mm->rss_stat, 0, sizeof(mm->rss_stat)); 1041 1040 spin_lock_init(&mm->page_table_lock);

+12 -7

kernel/kthread.c

··· 1162 1162 * modify @dwork's timer so that it expires after @delay. If @delay is zero, 1163 1163 * @work is guaranteed to be queued immediately. 1164 1164 * 1165 - * Return: %true if @dwork was pending and its timer was modified, 1166 - * %false otherwise. 1165 + * Return: %false if @dwork was idle and queued, %true otherwise. 1167 1166 * 1168 1167 * A special case is when the work is being canceled in parallel. 1169 1168 * It might be caused either by the real kthread_cancel_delayed_work_sync() 1170 1169 * or yet another kthread_mod_delayed_work() call. We let the other command 1171 - * win and return %false here. The caller is supposed to synchronize these 1172 - * operations a reasonable way. 1170 + * win and return %true here. The return value can be used for reference 1171 + * counting and the number of queued works stays the same. Anyway, the caller 1172 + * is supposed to synchronize these operations a reasonable way. 1173 1173 * 1174 1174 * This function is safe to call from any context including IRQ handler. 1175 1175 * See __kthread_cancel_work() and kthread_delayed_work_timer_fn() ··· 1181 1181 { 1182 1182 struct kthread_work *work = &dwork->work; 1183 1183 unsigned long flags; 1184 - int ret = false; 1184 + int ret; 1185 1185 1186 1186 raw_spin_lock_irqsave(&worker->lock, flags); 1187 1187 1188 1188 /* Do not bother with canceling when never queued. */ 1189 - if (!work->worker) 1189 + if (!work->worker) { 1190 + ret = false; 1190 1191 goto fast_queue; 1192 + } 1191 1193 1192 1194 /* Work must not be used with >1 worker, see kthread_queue_work() */ 1193 1195 WARN_ON_ONCE(work->worker != worker); ··· 1207 1205 * be used for reference counting. 1208 1206 */ 1209 1207 kthread_cancel_delayed_work_timer(work, &flags); 1210 - if (work->canceling) 1208 + if (work->canceling) { 1209 + /* The number of works in the queue does not change. */ 1210 + ret = true; 1211 1211 goto out; 1212 + } 1212 1213 ret = __kthread_cancel_work(work); 1213 1214 1214 1215 fast_queue:

+4 -4

kernel/sysctl.c

··· 2921 2921 .extra2 = &one_thousand, 2922 2922 }, 2923 2923 { 2924 - .procname = "percpu_pagelist_fraction", 2925 - .data = &percpu_pagelist_fraction, 2926 - .maxlen = sizeof(percpu_pagelist_fraction), 2924 + .procname = "percpu_pagelist_high_fraction", 2925 + .data = &percpu_pagelist_high_fraction, 2926 + .maxlen = sizeof(percpu_pagelist_high_fraction), 2927 2927 .mode = 0644, 2928 - .proc_handler = percpu_pagelist_fraction_sysctl_handler, 2928 + .proc_handler = percpu_pagelist_high_fraction_sysctl_handler, 2929 2929 .extra1 = SYSCTL_ZERO, 2930 2930 }, 2931 2931 {

+4 -8

kernel/watchdog.c

··· 92 92 * own hardlockup detector. 93 93 * 94 94 * watchdog_nmi_enable/disable can be implemented to start and stop when 95 - * softlockup watchdog threads start and stop. The arch must select the 95 + * softlockup watchdog start and stop. The arch must select the 96 96 * SOFTLOCKUP_DETECTOR Kconfig. 97 97 */ 98 98 int __weak watchdog_nmi_enable(unsigned int cpu) ··· 335 335 static DEFINE_PER_CPU(struct cpu_stop_work, softlockup_stop_work); 336 336 337 337 /* 338 - * The watchdog thread function - touches the timestamp. 338 + * The watchdog feed function - touches the timestamp. 339 339 * 340 340 * It only runs once every sample_period seconds (4 seconds by 341 341 * default) to reset the softlockup timestamp. If this gets delayed ··· 558 558 } 559 559 560 560 /* 561 - * Create the watchdog thread infrastructure and configure the detector(s). 562 - * 563 - * The threads are not unparked as watchdog_allowed_mask is empty. When 564 - * the threads are successfully initialized, take the proper locks and 565 - * unpark the threads in the watchdog_cpumask if the watchdog is enabled. 561 + * Create the watchdog infrastructure and configure the detector(s). 566 562 */ 567 563 static __init void lockup_detector_setup(void) 568 564 { ··· 624 628 625 629 #ifdef CONFIG_SYSCTL 626 630 627 - /* Propagate any changes to the watchdog threads */ 631 + /* Propagate any changes to the watchdog infrastructure */ 628 632 static void proc_watchdog_update(void) 629 633 { 630 634 /* Remove impossible cpus to keep sysctl output clean. */

+15

lib/Kconfig.debug

··· 313 313 config PAHOLE_HAS_SPLIT_BTF 314 314 def_bool $(success, test `$(PAHOLE) --version | sed -E 's/v([0-9]+)\.([0-9]+)/\1\2/'` -ge "119") 315 315 316 + config PAHOLE_HAS_ZEROSIZE_PERCPU_SUPPORT 317 + def_bool $(success, test `$(PAHOLE) --version | sed -E 's/v([0-9]+)\.([0-9]+)/\1\2/'` -ge "122") 318 + 316 319 config DEBUG_INFO_BTF_MODULES 317 320 def_bool y 318 321 depends on DEBUG_INFO_BTF && MODULES && PAHOLE_HAS_SPLIT_BTF ··· 2429 2426 help 2430 2427 This builds the bits unit test. 2431 2428 Tests the logic of macros defined in bits.h. 2429 + For more information on KUnit and unit tests in general please refer 2430 + to the KUnit documentation in Documentation/dev-tools/kunit/. 2431 + 2432 + If unsure, say N. 2433 + 2434 + config SLUB_KUNIT_TEST 2435 + tristate "KUnit test for SLUB cache error detection" if !KUNIT_ALL_TESTS 2436 + depends on SLUB_DEBUG && KUNIT 2437 + default KUNIT_ALL_TESTS 2438 + help 2439 + This builds SLUB allocator unit test. 2440 + Tests SLUB cache debugging functionality. 2432 2441 For more information on KUnit and unit tests in general please refer 2433 2442 to the KUnit documentation in Documentation/dev-tools/kunit/. 2434 2443

+14 -2

lib/Kconfig.kasan

··· 12 12 config HAVE_ARCH_KASAN_VMALLOC 13 13 bool 14 14 15 + config ARCH_DISABLE_KASAN_INLINE 16 + bool 17 + help 18 + An architecture might not support inline instrumentation. 19 + When this option is selected, inline and stack instrumentation are 20 + disabled. 21 + 15 22 config CC_HAS_KASAN_GENERIC 16 23 def_bool $(cc-option, -fsanitize=kernel-address) 17 24 ··· 137 130 138 131 config KASAN_INLINE 139 132 bool "Inline instrumentation" 133 + depends on !ARCH_DISABLE_KASAN_INLINE 140 134 help 141 135 Compiler directly inserts code checking shadow memory before 142 136 memory accesses. This is faster than outline (in some workloads ··· 149 141 config KASAN_STACK 150 142 bool "Enable stack instrumentation (unsafe)" if CC_IS_CLANG && !COMPILE_TEST 151 143 depends on KASAN_GENERIC || KASAN_SW_TAGS 144 + depends on !ARCH_DISABLE_KASAN_INLINE 152 145 default y if CC_IS_GCC 153 146 help 154 147 The LLVM stack address sanitizer has a know problem that ··· 163 154 but clang users can still enable it for builds without 164 155 CONFIG_COMPILE_TEST. On gcc it is assumed to always be safe 165 156 to use and enabled by default. 157 + If the architecture disables inline instrumentation, stack 158 + instrumentation is also disabled as it adds inline-style 159 + instrumentation that is run unconditionally. 166 160 167 - config KASAN_SW_TAGS_IDENTIFY 161 + config KASAN_TAGS_IDENTIFY 168 162 bool "Enable memory corruption identification" 169 - depends on KASAN_SW_TAGS 163 + depends on KASAN_SW_TAGS || KASAN_HW_TAGS 170 164 help 171 165 This option enables best-effort identification of bug type 172 166 (use-after-free or out-of-bounds) at the cost of increased

+1

lib/Makefile

··· 355 355 obj-$(CONFIG_LINEAR_RANGES_TEST) += test_linear_ranges.o 356 356 obj-$(CONFIG_BITS_TEST) += test_bits.o 357 357 obj-$(CONFIG_CMDLINE_KUNIT_TEST) += cmdline_kunit.o 358 + obj-$(CONFIG_SLUB_KUNIT_TEST) += slub_kunit.o 358 359 359 360 obj-$(CONFIG_GENERIC_LIB_DEVMEM_IS_ALLOWED) += devmem_is_allowed.o

+11 -5

lib/dump_stack.c

··· 73 73 dump_stack_print_info(log_lvl); 74 74 } 75 75 76 - static void __dump_stack(void) 76 + static void __dump_stack(const char *log_lvl) 77 77 { 78 - dump_stack_print_info(KERN_DEFAULT); 79 - show_stack(NULL, NULL, KERN_DEFAULT); 78 + dump_stack_print_info(log_lvl); 79 + show_stack(NULL, NULL, log_lvl); 80 80 } 81 81 82 82 /** ··· 84 84 * 85 85 * Architectures can override this implementation by implementing its own. 86 86 */ 87 - asmlinkage __visible void dump_stack(void) 87 + asmlinkage __visible void dump_stack_lvl(const char *log_lvl) 88 88 { 89 89 unsigned long flags; 90 90 ··· 93 93 * against other CPUs 94 94 */ 95 95 printk_cpu_lock_irqsave(flags); 96 - __dump_stack(); 96 + __dump_stack(log_lvl); 97 97 printk_cpu_unlock_irqrestore(flags); 98 + } 99 + EXPORT_SYMBOL(dump_stack_lvl); 100 + 101 + asmlinkage __visible void dump_stack(void) 102 + { 103 + dump_stack_lvl(KERN_DEFAULT); 98 104 } 99 105 EXPORT_SYMBOL(dump_stack);

+11 -7

lib/kunit/test.c

··· 475 475 void *data) 476 476 { 477 477 int ret = 0; 478 + unsigned long flags; 478 479 479 480 res->free = free; 480 481 kref_init(&res->refcount); ··· 488 487 res->data = data; 489 488 } 490 489 491 - spin_lock(&test->lock); 490 + spin_lock_irqsave(&test->lock, flags); 492 491 list_add_tail(&res->node, &test->resources); 493 492 /* refcount for list is established by kref_init() */ 494 - spin_unlock(&test->lock); 493 + spin_unlock_irqrestore(&test->lock, flags); 495 494 496 495 return ret; 497 496 } ··· 549 548 550 549 void kunit_remove_resource(struct kunit *test, struct kunit_resource *res) 551 550 { 552 - spin_lock(&test->lock); 551 + unsigned long flags; 552 + 553 + spin_lock_irqsave(&test->lock, flags); 553 554 list_del(&res->node); 554 - spin_unlock(&test->lock); 555 + spin_unlock_irqrestore(&test->lock, flags); 555 556 kunit_put_resource(res); 556 557 } 557 558 EXPORT_SYMBOL_GPL(kunit_remove_resource); ··· 633 630 void kunit_cleanup(struct kunit *test) 634 631 { 635 632 struct kunit_resource *res; 633 + unsigned long flags; 636 634 637 635 /* 638 636 * test->resources is a stack - each allocation must be freed in the ··· 645 641 * protect against the current node being deleted, not the next. 646 642 */ 647 643 while (true) { 648 - spin_lock(&test->lock); 644 + spin_lock_irqsave(&test->lock, flags); 649 645 if (list_empty(&test->resources)) { 650 - spin_unlock(&test->lock); 646 + spin_unlock_irqrestore(&test->lock, flags); 651 647 break; 652 648 } 653 649 res = list_last_entry(&test->resources, ··· 658 654 * resource, and this can't happen if the test->lock 659 655 * is held. 660 656 */ 661 - spin_unlock(&test->lock); 657 + spin_unlock_irqrestore(&test->lock, flags); 662 658 kunit_remove_resource(test, res); 663 659 } 664 660 current->kunit_test = NULL;

+152

lib/slub_kunit.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include <kunit/test.h> 3 + #include <linux/mm.h> 4 + #include <linux/slab.h> 5 + #include <linux/module.h> 6 + #include <linux/kernel.h> 7 + #include "../mm/slab.h" 8 + 9 + static struct kunit_resource resource; 10 + static int slab_errors; 11 + 12 + static void test_clobber_zone(struct kunit *test) 13 + { 14 + struct kmem_cache *s = kmem_cache_create("TestSlub_RZ_alloc", 64, 0, 15 + SLAB_RED_ZONE, NULL); 16 + u8 *p = kmem_cache_alloc(s, GFP_KERNEL); 17 + 18 + kasan_disable_current(); 19 + p[64] = 0x12; 20 + 21 + validate_slab_cache(s); 22 + KUNIT_EXPECT_EQ(test, 2, slab_errors); 23 + 24 + kasan_enable_current(); 25 + kmem_cache_free(s, p); 26 + kmem_cache_destroy(s); 27 + } 28 + 29 + #ifndef CONFIG_KASAN 30 + static void test_next_pointer(struct kunit *test) 31 + { 32 + struct kmem_cache *s = kmem_cache_create("TestSlub_next_ptr_free", 64, 0, 33 + SLAB_POISON, NULL); 34 + u8 *p = kmem_cache_alloc(s, GFP_KERNEL); 35 + unsigned long tmp; 36 + unsigned long *ptr_addr; 37 + 38 + kmem_cache_free(s, p); 39 + 40 + ptr_addr = (unsigned long *)(p + s->offset); 41 + tmp = *ptr_addr; 42 + p[s->offset] = 0x12; 43 + 44 + /* 45 + * Expecting three errors. 46 + * One for the corrupted freechain and the other one for the wrong 47 + * count of objects in use. The third error is fixing broken cache. 48 + */ 49 + validate_slab_cache(s); 50 + KUNIT_EXPECT_EQ(test, 3, slab_errors); 51 + 52 + /* 53 + * Try to repair corrupted freepointer. 54 + * Still expecting two errors. The first for the wrong count 55 + * of objects in use. 56 + * The second error is for fixing broken cache. 57 + */ 58 + *ptr_addr = tmp; 59 + slab_errors = 0; 60 + 61 + validate_slab_cache(s); 62 + KUNIT_EXPECT_EQ(test, 2, slab_errors); 63 + 64 + /* 65 + * Previous validation repaired the count of objects in use. 66 + * Now expecting no error. 67 + */ 68 + slab_errors = 0; 69 + validate_slab_cache(s); 70 + KUNIT_EXPECT_EQ(test, 0, slab_errors); 71 + 72 + kmem_cache_destroy(s); 73 + } 74 + 75 + static void test_first_word(struct kunit *test) 76 + { 77 + struct kmem_cache *s = kmem_cache_create("TestSlub_1th_word_free", 64, 0, 78 + SLAB_POISON, NULL); 79 + u8 *p = kmem_cache_alloc(s, GFP_KERNEL); 80 + 81 + kmem_cache_free(s, p); 82 + *p = 0x78; 83 + 84 + validate_slab_cache(s); 85 + KUNIT_EXPECT_EQ(test, 2, slab_errors); 86 + 87 + kmem_cache_destroy(s); 88 + } 89 + 90 + static void test_clobber_50th_byte(struct kunit *test) 91 + { 92 + struct kmem_cache *s = kmem_cache_create("TestSlub_50th_word_free", 64, 0, 93 + SLAB_POISON, NULL); 94 + u8 *p = kmem_cache_alloc(s, GFP_KERNEL); 95 + 96 + kmem_cache_free(s, p); 97 + p[50] = 0x9a; 98 + 99 + validate_slab_cache(s); 100 + KUNIT_EXPECT_EQ(test, 2, slab_errors); 101 + 102 + kmem_cache_destroy(s); 103 + } 104 + #endif 105 + 106 + static void test_clobber_redzone_free(struct kunit *test) 107 + { 108 + struct kmem_cache *s = kmem_cache_create("TestSlub_RZ_free", 64, 0, 109 + SLAB_RED_ZONE, NULL); 110 + u8 *p = kmem_cache_alloc(s, GFP_KERNEL); 111 + 112 + kasan_disable_current(); 113 + kmem_cache_free(s, p); 114 + p[64] = 0xab; 115 + 116 + validate_slab_cache(s); 117 + KUNIT_EXPECT_EQ(test, 2, slab_errors); 118 + 119 + kasan_enable_current(); 120 + kmem_cache_destroy(s); 121 + } 122 + 123 + static int test_init(struct kunit *test) 124 + { 125 + slab_errors = 0; 126 + 127 + kunit_add_named_resource(test, NULL, NULL, &resource, 128 + "slab_errors", &slab_errors); 129 + return 0; 130 + } 131 + 132 + static struct kunit_case test_cases[] = { 133 + KUNIT_CASE(test_clobber_zone), 134 + 135 + #ifndef CONFIG_KASAN 136 + KUNIT_CASE(test_next_pointer), 137 + KUNIT_CASE(test_first_word), 138 + KUNIT_CASE(test_clobber_50th_byte), 139 + #endif 140 + 141 + KUNIT_CASE(test_clobber_redzone_free), 142 + {} 143 + }; 144 + 145 + static struct kunit_suite test_suite = { 146 + .name = "slub_test", 147 + .init = test_init, 148 + .test_cases = test_cases, 149 + }; 150 + kunit_test_suite(test_suite); 151 + 152 + MODULE_LICENSE("GPL");

+2 -3

lib/test_hmm.c

··· 686 686 687 687 mmap_read_lock(mm); 688 688 for (addr = start; addr < end; addr = next) { 689 - vma = find_vma(mm, addr); 690 - if (!vma || addr < vma->vm_start || 691 - !(vma->vm_flags & VM_READ)) { 689 + vma = vma_lookup(mm, addr); 690 + if (!vma || !(vma->vm_flags & VM_READ)) { 692 691 ret = -EINVAL; 693 692 goto out; 694 693 }

+5 -6

lib/test_kasan.c

··· 55 55 multishot = kasan_save_enable_multi_shot(); 56 56 kasan_set_tagging_report_once(false); 57 57 fail_data.report_found = false; 58 - fail_data.report_expected = false; 59 58 kunit_add_named_resource(test, NULL, NULL, &resource, 60 59 "kasan_data", &fail_data); 61 60 return 0; ··· 93 94 !kasan_async_mode_enabled()) \ 94 95 migrate_disable(); \ 95 96 KUNIT_EXPECT_FALSE(test, READ_ONCE(fail_data.report_found)); \ 96 - WRITE_ONCE(fail_data.report_expected, true); \ 97 97 barrier(); \ 98 98 expression; \ 99 99 barrier(); \ 100 - KUNIT_EXPECT_EQ(test, \ 101 - READ_ONCE(fail_data.report_expected), \ 102 - READ_ONCE(fail_data.report_found)); \ 100 + if (!READ_ONCE(fail_data.report_found)) { \ 101 + KUNIT_FAIL(test, KUNIT_SUBTEST_INDENT "KASAN failure " \ 102 + "expected in \"" #expression \ 103 + "\", but none occurred"); \ 104 + } \ 103 105 if (IS_ENABLED(CONFIG_KASAN_HW_TAGS)) { \ 104 106 if (READ_ONCE(fail_data.report_found)) \ 105 107 kasan_enable_tagging_sync(); \ 106 108 migrate_enable(); \ 107 109 } \ 108 110 WRITE_ONCE(fail_data.report_found, false); \ 109 - WRITE_ONCE(fail_data.report_expected, false); \ 110 111 } while (0) 111 112 112 113 #define KASAN_TEST_NEEDS_CONFIG_ON(test, config) do { \

+1 -1

lib/vsprintf.c

··· 2224 2224 bool no_hash_pointers __ro_after_init; 2225 2225 EXPORT_SYMBOL_GPL(no_hash_pointers); 2226 2226 2227 - static int __init no_hash_pointers_enable(char *str) 2227 + int __init no_hash_pointers_enable(char *str) 2228 2228 { 2229 2229 if (no_hash_pointers) 2230 2230 return 0;

+2 -34

mm/Kconfig

··· 19 19 20 20 config FLATMEM_MANUAL 21 21 bool "Flat Memory" 22 - depends on !(ARCH_DISCONTIGMEM_ENABLE || ARCH_SPARSEMEM_ENABLE) || ARCH_FLATMEM_ENABLE 22 + depends on !ARCH_SPARSEMEM_ENABLE || ARCH_FLATMEM_ENABLE 23 23 help 24 24 This option is best suited for non-NUMA systems with 25 25 flat address space. The FLATMEM is the most efficient ··· 31 31 choose "Sparse Memory". 32 32 33 33 If unsure, choose this option (Flat Memory) over any other. 34 - 35 - config DISCONTIGMEM_MANUAL 36 - bool "Discontiguous Memory" 37 - depends on ARCH_DISCONTIGMEM_ENABLE 38 - help 39 - This option provides enhanced support for discontiguous 40 - memory systems, over FLATMEM. These systems have holes 41 - in their physical address spaces, and this option provides 42 - more efficient handling of these holes. 43 - 44 - Although "Discontiguous Memory" is still used by several 45 - architectures, it is considered deprecated in favor of 46 - "Sparse Memory". 47 - 48 - If unsure, choose "Sparse Memory" over this option. 49 34 50 35 config SPARSEMEM_MANUAL 51 36 bool "Sparse Memory" ··· 47 62 48 63 endchoice 49 64 50 - config DISCONTIGMEM 51 - def_bool y 52 - depends on (!SELECT_MEMORY_MODEL && ARCH_DISCONTIGMEM_ENABLE) || DISCONTIGMEM_MANUAL 53 - 54 65 config SPARSEMEM 55 66 def_bool y 56 67 depends on (!SELECT_MEMORY_MODEL && ARCH_SPARSEMEM_ENABLE) || SPARSEMEM_MANUAL 57 68 58 69 config FLATMEM 59 70 def_bool y 60 - depends on (!DISCONTIGMEM && !SPARSEMEM) || FLATMEM_MANUAL 61 - 62 - config FLAT_NODE_MEM_MAP 63 - def_bool y 64 - depends on !SPARSEMEM 65 - 66 - # 67 - # Both the NUMA code and DISCONTIGMEM use arrays of pg_data_t's 68 - # to represent different areas of memory. This variable allows 69 - # those dependencies to exist individually. 70 - # 71 - config NEED_MULTIPLE_NODES 72 - def_bool y 73 - depends on DISCONTIGMEM || NUMA 71 + depends on !SPARSEMEM || FLATMEM_MANUAL 74 72 75 73 # 76 74 # SPARSEMEM_EXTREME (which is the default) does some bootmem

+64 -2

mm/backing-dev.c

··· 371 371 #include <linux/memcontrol.h> 372 372 373 373 /* 374 - * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, and memcg->cgwb_list. 375 - * bdi->cgwb_tree is also RCU protected. 374 + * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, offline_cgwbs and 375 + * memcg->cgwb_list. bdi->cgwb_tree is also RCU protected. 376 376 */ 377 377 static DEFINE_SPINLOCK(cgwb_lock); 378 378 static struct workqueue_struct *cgwb_release_wq; 379 + 380 + static LIST_HEAD(offline_cgwbs); 381 + static void cleanup_offline_cgwbs_workfn(struct work_struct *work); 382 + static DECLARE_WORK(cleanup_offline_cgwbs_work, cleanup_offline_cgwbs_workfn); 379 383 380 384 static void cgwb_release_workfn(struct work_struct *work) 381 385 { ··· 399 395 400 396 fprop_local_destroy_percpu(&wb->memcg_completions); 401 397 percpu_ref_exit(&wb->refcnt); 398 + 399 + spin_lock_irq(&cgwb_lock); 400 + list_del(&wb->offline_node); 401 + spin_unlock_irq(&cgwb_lock); 402 + 402 403 wb_exit(wb); 404 + WARN_ON_ONCE(!list_empty(&wb->b_attached)); 403 405 kfree_rcu(wb, rcu); 404 406 } 405 407 ··· 423 413 WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id)); 424 414 list_del(&wb->memcg_node); 425 415 list_del(&wb->blkcg_node); 416 + list_add(&wb->offline_node, &offline_cgwbs); 426 417 percpu_ref_kill(&wb->refcnt); 427 418 } 428 419 ··· 483 472 484 473 wb->memcg_css = memcg_css; 485 474 wb->blkcg_css = blkcg_css; 475 + INIT_LIST_HEAD(&wb->b_attached); 486 476 INIT_WORK(&wb->release_work, cgwb_release_workfn); 487 477 set_bit(WB_registered, &wb->state); 488 478 ··· 645 633 mutex_unlock(&bdi->cgwb_release_mutex); 646 634 } 647 635 636 + /* 637 + * cleanup_offline_cgwbs_workfn - try to release dying cgwbs 638 + * 639 + * Try to release dying cgwbs by switching attached inodes to the nearest 640 + * living ancestor's writeback. Processed wbs are placed at the end 641 + * of the list to guarantee the forward progress. 642 + */ 643 + static void cleanup_offline_cgwbs_workfn(struct work_struct *work) 644 + { 645 + struct bdi_writeback *wb; 646 + LIST_HEAD(processed); 647 + 648 + spin_lock_irq(&cgwb_lock); 649 + 650 + while (!list_empty(&offline_cgwbs)) { 651 + wb = list_first_entry(&offline_cgwbs, struct bdi_writeback, 652 + offline_node); 653 + list_move(&wb->offline_node, &processed); 654 + 655 + /* 656 + * If wb is dirty, cleaning up the writeback by switching 657 + * attached inodes will result in an effective removal of any 658 + * bandwidth restrictions, which isn't the goal. Instead, 659 + * it can be postponed until the next time, when all io 660 + * will be likely completed. If in the meantime some inodes 661 + * will get re-dirtied, they should be eventually switched to 662 + * a new cgwb. 663 + */ 664 + if (wb_has_dirty_io(wb)) 665 + continue; 666 + 667 + if (!wb_tryget(wb)) 668 + continue; 669 + 670 + spin_unlock_irq(&cgwb_lock); 671 + while (cleanup_offline_cgwb(wb)) 672 + cond_resched(); 673 + spin_lock_irq(&cgwb_lock); 674 + 675 + wb_put(wb); 676 + } 677 + 678 + if (!list_empty(&processed)) 679 + list_splice_tail(&processed, &offline_cgwbs); 680 + 681 + spin_unlock_irq(&cgwb_lock); 682 + } 683 + 648 684 /** 649 685 * wb_memcg_offline - kill all wb's associated with a memcg being offlined 650 686 * @memcg: memcg being offlined ··· 709 649 cgwb_kill(wb); 710 650 memcg_cgwb_list->next = NULL; /* prevent new wb's */ 711 651 spin_unlock_irq(&cgwb_lock); 652 + 653 + queue_work(system_unbound_wq, &cleanup_offline_cgwbs_work); 712 654 } 713 655 714 656 /**

+1 -1

mm/compaction.c

··· 1028 1028 if (!TestClearPageLRU(page)) 1029 1029 goto isolate_fail_put; 1030 1030 1031 - lruvec = mem_cgroup_page_lruvec(page, pgdat); 1031 + lruvec = mem_cgroup_page_lruvec(page); 1032 1032 1033 1033 /* If we already hold the lock, we can skip some rechecking */ 1034 1034 if (lruvec != locked) {

+7 -18

mm/debug.c

··· 42 42 {0, NULL} 43 43 }; 44 44 45 - void __dump_page(struct page *page, const char *reason) 45 + static void __dump_page(struct page *page) 46 46 { 47 47 struct page *head = compound_head(page); 48 48 struct address_space *mapping; 49 - bool page_poisoned = PagePoisoned(page); 50 49 bool compound = PageCompound(page); 51 50 /* 52 51 * Accessing the pageblock without the zone lock. It could change to ··· 56 57 bool page_cma = is_migrate_cma_page(page); 57 58 int mapcount; 58 59 char *type = ""; 59 - 60 - /* 61 - * If struct page is poisoned don't access Page*() functions as that 62 - * leads to recursive loop. Page*() check for poisoned pages, and calls 63 - * dump_page() when detected. 64 - */ 65 - if (page_poisoned) { 66 - pr_warn("page:%px is uninitialized and poisoned", page); 67 - goto hex_only; 68 - } 69 60 70 61 if (page < head || (page >= head + MAX_ORDER_NR_PAGES)) { 71 62 /* ··· 162 173 163 174 pr_warn("%sflags: %#lx(%pGp)%s\n", type, head->flags, &head->flags, 164 175 page_cma ? " CMA" : ""); 165 - 166 - hex_only: 167 176 print_hex_dump(KERN_WARNING, "raw: ", DUMP_PREFIX_NONE, 32, 168 177 sizeof(unsigned long), page, 169 178 sizeof(struct page), false); ··· 169 182 print_hex_dump(KERN_WARNING, "head: ", DUMP_PREFIX_NONE, 32, 170 183 sizeof(unsigned long), head, 171 184 sizeof(struct page), false); 172 - 173 - if (reason) 174 - pr_warn("page dumped because: %s\n", reason); 175 185 } 176 186 177 187 void dump_page(struct page *page, const char *reason) 178 188 { 179 - __dump_page(page, reason); 189 + if (PagePoisoned(page)) 190 + pr_warn("page:%p is uninitialized and poisoned", page); 191 + else 192 + __dump_page(page); 193 + if (reason) 194 + pr_warn("page dumped because: %s\n", reason); 180 195 dump_page_owner(page); 181 196 } 182 197 EXPORT_SYMBOL(dump_page);

+51 -12

mm/debug_vm_pgtable.c

··· 146 146 static void __init pmd_basic_tests(unsigned long pfn, int idx) 147 147 { 148 148 pgprot_t prot = protection_map[idx]; 149 - pmd_t pmd = pfn_pmd(pfn, prot); 150 149 unsigned long val = idx, *ptr = &val; 150 + pmd_t pmd; 151 151 152 152 if (!has_transparent_hugepage()) 153 153 return; 154 154 155 155 pr_debug("Validating PMD basic (%pGv)\n", ptr); 156 + pmd = pfn_pmd(pfn, prot); 156 157 157 158 /* 158 159 * This test needs to be executed after the given page table entry ··· 186 185 unsigned long pfn, unsigned long vaddr, 187 186 pgprot_t prot, pgtable_t pgtable) 188 187 { 189 - pmd_t pmd = pfn_pmd(pfn, prot); 188 + pmd_t pmd; 190 189 191 190 if (!has_transparent_hugepage()) 192 191 return; ··· 233 232 234 233 static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot) 235 234 { 236 - pmd_t pmd = pfn_pmd(pfn, prot); 235 + pmd_t pmd; 236 + 237 + if (!has_transparent_hugepage()) 238 + return; 237 239 238 240 pr_debug("Validating PMD leaf\n"); 241 + pmd = pfn_pmd(pfn, prot); 242 + 239 243 /* 240 244 * PMD based THP is a leaf entry. 241 245 */ ··· 273 267 274 268 static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) 275 269 { 276 - pmd_t pmd = pfn_pmd(pfn, prot); 270 + pmd_t pmd; 277 271 278 272 if (!IS_ENABLED(CONFIG_NUMA_BALANCING)) 279 273 return; 280 274 275 + if (!has_transparent_hugepage()) 276 + return; 277 + 281 278 pr_debug("Validating PMD saved write\n"); 279 + pmd = pfn_pmd(pfn, prot); 282 280 WARN_ON(!pmd_savedwrite(pmd_mk_savedwrite(pmd_clear_savedwrite(pmd)))); 283 281 WARN_ON(pmd_savedwrite(pmd_clear_savedwrite(pmd_mk_savedwrite(pmd)))); 284 282 } ··· 291 281 static void __init pud_basic_tests(struct mm_struct *mm, unsigned long pfn, int idx) 292 282 { 293 283 pgprot_t prot = protection_map[idx]; 294 - pud_t pud = pfn_pud(pfn, prot); 295 284 unsigned long val = idx, *ptr = &val; 285 + pud_t pud; 296 286 297 287 if (!has_transparent_hugepage()) 298 288 return; 299 289 300 290 pr_debug("Validating PUD basic (%pGv)\n", ptr); 291 + pud = pfn_pud(pfn, prot); 301 292 302 293 /* 303 294 * This test needs to be executed after the given page table entry ··· 334 323 unsigned long pfn, unsigned long vaddr, 335 324 pgprot_t prot) 336 325 { 337 - pud_t pud = pfn_pud(pfn, prot); 326 + pud_t pud; 338 327 339 328 if (!has_transparent_hugepage()) 340 329 return; ··· 343 332 /* Align the address wrt HPAGE_PUD_SIZE */ 344 333 vaddr &= HPAGE_PUD_MASK; 345 334 335 + pud = pfn_pud(pfn, prot); 346 336 set_pud_at(mm, vaddr, pudp, pud); 347 337 pudp_set_wrprotect(mm, vaddr, pudp); 348 338 pud = READ_ONCE(*pudp); ··· 382 370 383 371 static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) 384 372 { 385 - pud_t pud = pfn_pud(pfn, prot); 373 + pud_t pud; 374 + 375 + if (!has_transparent_hugepage()) 376 + return; 386 377 387 378 pr_debug("Validating PUD leaf\n"); 379 + pud = pfn_pud(pfn, prot); 388 380 /* 389 381 * PUD based THP is a leaf entry. 390 382 */ ··· 670 654 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 671 655 static void __init pmd_protnone_tests(unsigned long pfn, pgprot_t prot) 672 656 { 673 - pmd_t pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); 657 + pmd_t pmd; 674 658 675 659 if (!IS_ENABLED(CONFIG_NUMA_BALANCING)) 676 660 return; 677 661 662 + if (!has_transparent_hugepage()) 663 + return; 664 + 678 665 pr_debug("Validating PMD protnone\n"); 666 + pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); 679 667 WARN_ON(!pmd_protnone(pmd)); 680 668 WARN_ON(!pmd_present(pmd)); 681 669 } ··· 699 679 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 700 680 static void __init pmd_devmap_tests(unsigned long pfn, pgprot_t prot) 701 681 { 702 - pmd_t pmd = pfn_pmd(pfn, prot); 682 + pmd_t pmd; 683 + 684 + if (!has_transparent_hugepage()) 685 + return; 703 686 704 687 pr_debug("Validating PMD devmap\n"); 688 + pmd = pfn_pmd(pfn, prot); 705 689 WARN_ON(!pmd_devmap(pmd_mkdevmap(pmd))); 706 690 } 707 691 708 692 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD 709 693 static void __init pud_devmap_tests(unsigned long pfn, pgprot_t prot) 710 694 { 711 - pud_t pud = pfn_pud(pfn, prot); 695 + pud_t pud; 696 + 697 + if (!has_transparent_hugepage()) 698 + return; 712 699 713 700 pr_debug("Validating PUD devmap\n"); 701 + pud = pfn_pud(pfn, prot); 714 702 WARN_ON(!pud_devmap(pud_mkdevmap(pud))); 715 703 } 716 704 #else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ ··· 761 733 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 762 734 static void __init pmd_soft_dirty_tests(unsigned long pfn, pgprot_t prot) 763 735 { 764 - pmd_t pmd = pfn_pmd(pfn, prot); 736 + pmd_t pmd; 765 737 766 738 if (!IS_ENABLED(CONFIG_MEM_SOFT_DIRTY)) 767 739 return; 768 740 741 + if (!has_transparent_hugepage()) 742 + return; 743 + 769 744 pr_debug("Validating PMD soft dirty\n"); 745 + pmd = pfn_pmd(pfn, prot); 770 746 WARN_ON(!pmd_soft_dirty(pmd_mksoft_dirty(pmd))); 771 747 WARN_ON(pmd_soft_dirty(pmd_clear_soft_dirty(pmd))); 772 748 } 773 749 774 750 static void __init pmd_swap_soft_dirty_tests(unsigned long pfn, pgprot_t prot) 775 751 { 776 - pmd_t pmd = pfn_pmd(pfn, prot); 752 + pmd_t pmd; 777 753 778 754 if (!IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) || 779 755 !IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION)) 780 756 return; 781 757 758 + if (!has_transparent_hugepage()) 759 + return; 760 + 782 761 pr_debug("Validating PMD swap soft dirty\n"); 762 + pmd = pfn_pmd(pfn, prot); 783 763 WARN_ON(!pmd_swp_soft_dirty(pmd_swp_mksoft_dirty(pmd))); 784 764 WARN_ON(pmd_swp_soft_dirty(pmd_swp_clear_soft_dirty(pmd))); 785 765 } ··· 815 779 { 816 780 swp_entry_t swp; 817 781 pmd_t pmd; 782 + 783 + if (!has_transparent_hugepage()) 784 + return; 818 785 819 786 pr_debug("Validating PMD swap\n"); 820 787 pmd = pfn_pmd(pfn, prot);

+2 -3

mm/dmapool.c

··· 62 62 static DEFINE_MUTEX(pools_lock); 63 63 static DEFINE_MUTEX(pools_reg_lock); 64 64 65 - static ssize_t 66 - show_pools(struct device *dev, struct device_attribute *attr, char *buf) 65 + static ssize_t pools_show(struct device *dev, struct device_attribute *attr, char *buf) 67 66 { 68 67 unsigned temp; 69 68 unsigned size; ··· 102 103 return PAGE_SIZE - size; 103 104 } 104 105 105 - static DEVICE_ATTR(pools, 0444, show_pools, NULL); 106 + static DEVICE_ATTR_RO(pools); 106 107 107 108 /** 108 109 * dma_pool_create - Creates a pool of consistent memory blocks, for dma.

+1 -1

mm/filemap.c

··· 872 872 page->index = offset; 873 873 874 874 if (!huge) { 875 - error = mem_cgroup_charge(page, current->mm, gfp); 875 + error = mem_cgroup_charge(page, NULL, gfp); 876 876 if (error) 877 877 goto error; 878 878 charged = true;

+56 -17

mm/gup.c

··· 44 44 atomic_sub(refs, compound_pincount_ptr(page)); 45 45 } 46 46 47 + /* Equivalent to calling put_page() @refs times. */ 48 + static void put_page_refs(struct page *page, int refs) 49 + { 50 + #ifdef CONFIG_DEBUG_VM 51 + if (VM_WARN_ON_ONCE_PAGE(page_ref_count(page) < refs, page)) 52 + return; 53 + #endif 54 + 55 + /* 56 + * Calling put_page() for each ref is unnecessarily slow. Only the last 57 + * ref needs a put_page(). 58 + */ 59 + if (refs > 1) 60 + page_ref_sub(page, refs - 1); 61 + put_page(page); 62 + } 63 + 47 64 /* 48 65 * Return the compound head page with ref appropriately incremented, 49 66 * or NULL if that failed. ··· 73 56 return NULL; 74 57 if (unlikely(!page_cache_add_speculative(head, refs))) 75 58 return NULL; 59 + 60 + /* 61 + * At this point we have a stable reference to the head page; but it 62 + * could be that between the compound_head() lookup and the refcount 63 + * increment, the compound page was split, in which case we'd end up 64 + * holding a reference on a page that has nothing to do with the page 65 + * we were given anymore. 66 + * So now that the head page is stable, recheck that the pages still 67 + * belong together. 68 + */ 69 + if (unlikely(compound_head(page) != head)) { 70 + put_page_refs(head, refs); 71 + return NULL; 72 + } 73 + 76 74 return head; 77 75 } 78 76 ··· 128 96 return NULL; 129 97 130 98 /* 99 + * CAUTION: Don't use compound_head() on the page before this 100 + * point, the result won't be stable. 101 + */ 102 + page = try_get_compound_head(page, refs); 103 + if (!page) 104 + return NULL; 105 + 106 + /* 131 107 * When pinning a compound page of order > 1 (which is what 132 108 * hpage_pincount_available() checks for), use an exact count to 133 109 * track it, via hpage_pincount_add/_sub(). ··· 143 103 * However, be sure to *also* increment the normal page refcount 144 104 * field at least once, so that the page really is pinned. 145 105 */ 146 - if (!hpage_pincount_available(page)) 147 - refs *= GUP_PIN_COUNTING_BIAS; 148 - 149 - page = try_get_compound_head(page, refs); 150 - if (!page) 151 - return NULL; 152 - 153 106 if (hpage_pincount_available(page)) 154 107 hpage_pincount_add(page, refs); 108 + else 109 + page_ref_add(page, refs * (GUP_PIN_COUNTING_BIAS - 1)); 155 110 156 111 mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED, 157 112 orig_refs); ··· 170 135 refs *= GUP_PIN_COUNTING_BIAS; 171 136 } 172 137 173 - VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); 174 - /* 175 - * Calling put_page() for each ref is unnecessarily slow. Only the last 176 - * ref needs a put_page(). 177 - */ 178 - if (refs > 1) 179 - page_ref_sub(page, refs - 1); 180 - put_page(page); 138 + put_page_refs(page, refs); 181 139 } 182 140 183 141 /** ··· 419 391 put_compound_head(head, ntails, FOLL_PIN); 420 392 } 421 393 EXPORT_SYMBOL(unpin_user_pages); 394 + 395 + /* 396 + * Set the MMF_HAS_PINNED if not set yet; after set it'll be there for the mm's 397 + * lifecycle. Avoid setting the bit unless necessary, or it might cause write 398 + * cache bouncing on large SMP machines for concurrent pinned gups. 399 + */ 400 + static inline void mm_set_has_pinned_flag(unsigned long *mm_flags) 401 + { 402 + if (!test_bit(MMF_HAS_PINNED, mm_flags)) 403 + set_bit(MMF_HAS_PINNED, mm_flags); 404 + } 422 405 423 406 #ifdef CONFIG_MMU 424 407 static struct page *no_page_table(struct vm_area_struct *vma, ··· 1332 1293 } 1333 1294 1334 1295 if (flags & FOLL_PIN) 1335 - atomic_set(&mm->has_pinned, 1); 1296 + mm_set_has_pinned_flag(&mm->flags); 1336 1297 1337 1298 /* 1338 1299 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior ··· 2653 2614 return -EINVAL; 2654 2615 2655 2616 if (gup_flags & FOLL_PIN) 2656 - atomic_set(&current->mm->has_pinned, 1); 2617 + mm_set_has_pinned_flag(&current->mm->flags); 2657 2618 2658 2619 if (!(gup_flags & FOLL_FAST_ONLY)) 2659 2620 might_lock_read(&current->mm->mmap_lock);

+2

mm/hugetlb.c

··· 5938 5938 *hugetlb = true; 5939 5939 if (HPageFreed(page) || HPageMigratable(page)) 5940 5940 ret = get_page_unless_zero(page); 5941 + else 5942 + ret = -EBUSY; 5941 5943 } 5942 5944 spin_unlock_irq(&hugetlb_lock); 5943 5945 return ret;

+7 -2

mm/internal.h

··· 116 116 extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address); 117 117 118 118 /* 119 + * in mm/memcontrol.c: 120 + */ 121 + extern bool cgroup_memory_nokmem; 122 + 123 + /* 119 124 * in mm/page_alloc.c 120 125 */ 121 126 ··· 203 198 gfp_t gfp_flags); 204 199 extern int user_min_free_kbytes; 205 200 206 - extern void free_unref_page(struct page *page); 201 + extern void free_unref_page(struct page *page, unsigned int order); 207 202 extern void free_unref_page_list(struct list_head *list); 208 203 209 - extern void zone_pcp_update(struct zone *zone); 204 + extern void zone_pcp_update(struct zone *zone, int cpu_online); 210 205 extern void zone_pcp_reset(struct zone *zone); 211 206 extern void zone_pcp_disable(struct zone *zone); 212 207 extern void zone_pcp_enable(struct zone *zone);

+2 -2

mm/kasan/Makefile

··· 37 37 38 38 obj-$(CONFIG_KASAN) := common.o report.o 39 39 obj-$(CONFIG_KASAN_GENERIC) += init.o generic.o report_generic.o shadow.o quarantine.o 40 - obj-$(CONFIG_KASAN_HW_TAGS) += hw_tags.o report_hw_tags.o 41 - obj-$(CONFIG_KASAN_SW_TAGS) += init.o report_sw_tags.o shadow.o sw_tags.o 40 + obj-$(CONFIG_KASAN_HW_TAGS) += hw_tags.o report_hw_tags.o tags.o report_tags.o 41 + obj-$(CONFIG_KASAN_SW_TAGS) += init.o report_sw_tags.o shadow.o sw_tags.o tags.o report_tags.o

+6

mm/kasan/common.c

··· 51 51 { 52 52 current->kasan_depth++; 53 53 } 54 + EXPORT_SYMBOL(kasan_enable_current); 54 55 55 56 void kasan_disable_current(void) 56 57 { 57 58 current->kasan_depth--; 58 59 } 60 + EXPORT_SYMBOL(kasan_disable_current); 61 + 59 62 #endif /* CONFIG_KASAN_GENERIC || CONFIG_KASAN_SW_TAGS */ 60 63 61 64 void __kasan_unpoison_range(const void *address, size_t size) ··· 330 327 { 331 328 u8 tag; 332 329 void *tagged_object; 330 + 331 + if (!kasan_arch_is_ready()) 332 + return false; 333 333 334 334 tag = get_tag(object); 335 335 tagged_object = object;

+3

mm/kasan/generic.c

··· 163 163 size_t size, bool write, 164 164 unsigned long ret_ip) 165 165 { 166 + if (!kasan_arch_is_ready()) 167 + return true; 168 + 166 169 if (unlikely(size == 0)) 167 170 return true; 168 171

-22

mm/kasan/hw_tags.c

··· 216 216 pr_info("KernelAddressSanitizer initialized\n"); 217 217 } 218 218 219 - void kasan_set_free_info(struct kmem_cache *cache, 220 - void *object, u8 tag) 221 - { 222 - struct kasan_alloc_meta *alloc_meta; 223 - 224 - alloc_meta = kasan_get_alloc_meta(cache, object); 225 - if (alloc_meta) 226 - kasan_set_track(&alloc_meta->free_track[0], GFP_NOWAIT); 227 - } 228 - 229 - struct kasan_track *kasan_get_free_track(struct kmem_cache *cache, 230 - void *object, u8 tag) 231 - { 232 - struct kasan_alloc_meta *alloc_meta; 233 - 234 - alloc_meta = kasan_get_alloc_meta(cache, object); 235 - if (!alloc_meta) 236 - return NULL; 237 - 238 - return &alloc_meta->free_track[0]; 239 - } 240 - 241 219 void kasan_alloc_pages(struct page *page, unsigned int order, gfp_t flags) 242 220 { 243 221 /*

+3 -3

mm/kasan/init.c

··· 41 41 } 42 42 #endif 43 43 #if CONFIG_PGTABLE_LEVELS > 3 44 - pud_t kasan_early_shadow_pud[PTRS_PER_PUD] __page_aligned_bss; 44 + pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD] __page_aligned_bss; 45 45 static inline bool kasan_pud_table(p4d_t p4d) 46 46 { 47 47 return p4d_page(p4d) == virt_to_page(lm_alias(kasan_early_shadow_pud)); ··· 53 53 } 54 54 #endif 55 55 #if CONFIG_PGTABLE_LEVELS > 2 56 - pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD] __page_aligned_bss; 56 + pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD] __page_aligned_bss; 57 57 static inline bool kasan_pmd_table(pud_t pud) 58 58 { 59 59 return pud_page(pud) == virt_to_page(lm_alias(kasan_early_shadow_pmd)); ··· 64 64 return false; 65 65 } 66 66 #endif 67 - pte_t kasan_early_shadow_pte[PTRS_PER_PTE + PTE_HWTABLE_PTRS] 67 + pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE + PTE_HWTABLE_PTRS] 68 68 __page_aligned_bss; 69 69 70 70 static inline bool kasan_pte_table(pmd_t pmd)

+8 -2

mm/kasan/kasan.h

··· 153 153 depot_stack_handle_t stack; 154 154 }; 155 155 156 - #ifdef CONFIG_KASAN_SW_TAGS_IDENTIFY 156 + #if defined(CONFIG_KASAN_TAGS_IDENTIFY) && defined(CONFIG_KASAN_SW_TAGS) 157 157 #define KASAN_NR_FREE_STACKS 5 158 158 #else 159 159 #define KASAN_NR_FREE_STACKS 1 ··· 170 170 #else 171 171 struct kasan_track free_track[KASAN_NR_FREE_STACKS]; 172 172 #endif 173 - #ifdef CONFIG_KASAN_SW_TAGS_IDENTIFY 173 + #ifdef CONFIG_KASAN_TAGS_IDENTIFY 174 174 u8 free_pointer_tag[KASAN_NR_FREE_STACKS]; 175 175 u8 free_track_idx; 176 176 #endif ··· 448 448 static inline void kasan_poison_last_granule(const void *address, size_t size) { } 449 449 450 450 #endif /* CONFIG_KASAN_GENERIC */ 451 + 452 + #ifndef kasan_arch_is_ready 453 + static inline bool kasan_arch_is_ready(void) { return true; } 454 + #elif !defined(CONFIG_KASAN_GENERIC) || !defined(CONFIG_KASAN_OUTLINE) 455 + #error kasan_arch_is_ready only works in KASAN generic outline mode! 456 + #endif 451 457 452 458 /* 453 459 * Exported functions for interfaces called from assembly or from generated

+3 -3

mm/kasan/report.c

··· 230 230 { 231 231 struct page *page = kasan_addr_to_page(addr); 232 232 233 - dump_stack(); 233 + dump_stack_lvl(KERN_ERR); 234 234 pr_err("\n"); 235 235 236 236 if (page && PageSlab(page)) { ··· 375 375 pr_err("BUG: KASAN: invalid-access\n"); 376 376 pr_err("Asynchronous mode enabled: no access details available\n"); 377 377 pr_err("\n"); 378 - dump_stack(); 378 + dump_stack_lvl(KERN_ERR); 379 379 end_report(&flags, 0); 380 380 } 381 381 #endif /* CONFIG_KASAN_HW_TAGS */ ··· 420 420 pr_err("\n"); 421 421 print_memory_metadata(info.first_bad_addr); 422 422 } else { 423 - dump_stack(); 423 + dump_stack_lvl(KERN_ERR); 424 424 } 425 425 426 426 end_report(&flags, addr);

-5

mm/kasan/report_hw_tags.c

··· 15 15 16 16 #include "kasan.h" 17 17 18 - const char *kasan_get_bug_type(struct kasan_access_info *info) 19 - { 20 - return "invalid-access"; 21 - } 22 - 23 18 void *kasan_find_first_bad_addr(void *addr, size_t size) 24 19 { 25 20 return kasan_reset_tag(addr);

-43

mm/kasan/report_sw_tags.c

··· 29 29 #include "kasan.h" 30 30 #include "../slab.h" 31 31 32 - const char *kasan_get_bug_type(struct kasan_access_info *info) 33 - { 34 - #ifdef CONFIG_KASAN_SW_TAGS_IDENTIFY 35 - struct kasan_alloc_meta *alloc_meta; 36 - struct kmem_cache *cache; 37 - struct page *page; 38 - const void *addr; 39 - void *object; 40 - u8 tag; 41 - int i; 42 - 43 - tag = get_tag(info->access_addr); 44 - addr = kasan_reset_tag(info->access_addr); 45 - page = kasan_addr_to_page(addr); 46 - if (page && PageSlab(page)) { 47 - cache = page->slab_cache; 48 - object = nearest_obj(cache, page, (void *)addr); 49 - alloc_meta = kasan_get_alloc_meta(cache, object); 50 - 51 - if (alloc_meta) { 52 - for (i = 0; i < KASAN_NR_FREE_STACKS; i++) { 53 - if (alloc_meta->free_pointer_tag[i] == tag) 54 - return "use-after-free"; 55 - } 56 - } 57 - return "out-of-bounds"; 58 - } 59 - 60 - #endif 61 - /* 62 - * If access_size is a negative number, then it has reason to be 63 - * defined as out-of-bounds bug type. 64 - * 65 - * Casting negative numbers to size_t would indeed turn up as 66 - * a large size_t and its value will be larger than ULONG_MAX/2, 67 - * so that this can qualify as out-of-bounds. 68 - */ 69 - if (info->access_addr + info->access_size < info->access_addr) 70 - return "out-of-bounds"; 71 - 72 - return "invalid-access"; 73 - } 74 - 75 32 void *kasan_find_first_bad_addr(void *addr, size_t size) 76 33 { 77 34 u8 tag = get_tag(addr);

+51

mm/kasan/report_tags.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright (c) 2014 Samsung Electronics Co., Ltd. 4 + * Copyright (c) 2020 Google, Inc. 5 + */ 6 + 7 + #include "kasan.h" 8 + #include "../slab.h" 9 + 10 + const char *kasan_get_bug_type(struct kasan_access_info *info) 11 + { 12 + #ifdef CONFIG_KASAN_TAGS_IDENTIFY 13 + struct kasan_alloc_meta *alloc_meta; 14 + struct kmem_cache *cache; 15 + struct page *page; 16 + const void *addr; 17 + void *object; 18 + u8 tag; 19 + int i; 20 + 21 + tag = get_tag(info->access_addr); 22 + addr = kasan_reset_tag(info->access_addr); 23 + page = kasan_addr_to_page(addr); 24 + if (page && PageSlab(page)) { 25 + cache = page->slab_cache; 26 + object = nearest_obj(cache, page, (void *)addr); 27 + alloc_meta = kasan_get_alloc_meta(cache, object); 28 + 29 + if (alloc_meta) { 30 + for (i = 0; i < KASAN_NR_FREE_STACKS; i++) { 31 + if (alloc_meta->free_pointer_tag[i] == tag) 32 + return "use-after-free"; 33 + } 34 + } 35 + return "out-of-bounds"; 36 + } 37 + #endif 38 + 39 + /* 40 + * If access_size is a negative number, then it has reason to be 41 + * defined as out-of-bounds bug type. 42 + * 43 + * Casting negative numbers to size_t would indeed turn up as 44 + * a large size_t and its value will be larger than ULONG_MAX/2, 45 + * so that this can qualify as out-of-bounds. 46 + */ 47 + if (info->access_addr + info->access_size < info->access_addr) 48 + return "out-of-bounds"; 49 + 50 + return "invalid-access"; 51 + }

+6

mm/kasan/shadow.c

··· 73 73 { 74 74 void *shadow_start, *shadow_end; 75 75 76 + if (!kasan_arch_is_ready()) 77 + return; 78 + 76 79 /* 77 80 * Perform shadow offset calculation based on untagged address, as 78 81 * some of the callers (e.g. kasan_poison_object_data) pass tagged ··· 102 99 #ifdef CONFIG_KASAN_GENERIC 103 100 void kasan_poison_last_granule(const void *addr, size_t size) 104 101 { 102 + if (!kasan_arch_is_ready()) 103 + return; 104 + 105 105 if (size & KASAN_GRANULE_MASK) { 106 106 u8 *shadow = (u8 *)kasan_mem_to_shadow(addr + size); 107 107 *shadow = size & KASAN_GRANULE_MASK;

-41

mm/kasan/sw_tags.c

··· 167 167 } 168 168 EXPORT_SYMBOL(__hwasan_tag_memory); 169 169 170 - void kasan_set_free_info(struct kmem_cache *cache, 171 - void *object, u8 tag) 172 - { 173 - struct kasan_alloc_meta *alloc_meta; 174 - u8 idx = 0; 175 - 176 - alloc_meta = kasan_get_alloc_meta(cache, object); 177 - if (!alloc_meta) 178 - return; 179 - 180 - #ifdef CONFIG_KASAN_SW_TAGS_IDENTIFY 181 - idx = alloc_meta->free_track_idx; 182 - alloc_meta->free_pointer_tag[idx] = tag; 183 - alloc_meta->free_track_idx = (idx + 1) % KASAN_NR_FREE_STACKS; 184 - #endif 185 - 186 - kasan_set_track(&alloc_meta->free_track[idx], GFP_NOWAIT); 187 - } 188 - 189 - struct kasan_track *kasan_get_free_track(struct kmem_cache *cache, 190 - void *object, u8 tag) 191 - { 192 - struct kasan_alloc_meta *alloc_meta; 193 - int i = 0; 194 - 195 - alloc_meta = kasan_get_alloc_meta(cache, object); 196 - if (!alloc_meta) 197 - return NULL; 198 - 199 - #ifdef CONFIG_KASAN_SW_TAGS_IDENTIFY 200 - for (i = 0; i < KASAN_NR_FREE_STACKS; i++) { 201 - if (alloc_meta->free_pointer_tag[i] == tag) 202 - break; 203 - } 204 - if (i == KASAN_NR_FREE_STACKS) 205 - i = alloc_meta->free_track_idx; 206 - #endif 207 - 208 - return &alloc_meta->free_track[i]; 209 - } 210 - 211 170 void kasan_tag_mismatch(unsigned long addr, unsigned long access_info, 212 171 unsigned long ret_ip) 213 172 {

+59

mm/kasan/tags.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * This file contains common tag-based KASAN code. 4 + * 5 + * Copyright (c) 2018 Google, Inc. 6 + * Copyright (c) 2020 Google, Inc. 7 + */ 8 + 9 + #include <linux/init.h> 10 + #include <linux/kasan.h> 11 + #include <linux/kernel.h> 12 + #include <linux/memory.h> 13 + #include <linux/mm.h> 14 + #include <linux/static_key.h> 15 + #include <linux/string.h> 16 + #include <linux/types.h> 17 + 18 + #include "kasan.h" 19 + 20 + void kasan_set_free_info(struct kmem_cache *cache, 21 + void *object, u8 tag) 22 + { 23 + struct kasan_alloc_meta *alloc_meta; 24 + u8 idx = 0; 25 + 26 + alloc_meta = kasan_get_alloc_meta(cache, object); 27 + if (!alloc_meta) 28 + return; 29 + 30 + #ifdef CONFIG_KASAN_TAGS_IDENTIFY 31 + idx = alloc_meta->free_track_idx; 32 + alloc_meta->free_pointer_tag[idx] = tag; 33 + alloc_meta->free_track_idx = (idx + 1) % KASAN_NR_FREE_STACKS; 34 + #endif 35 + 36 + kasan_set_track(&alloc_meta->free_track[idx], GFP_NOWAIT); 37 + } 38 + 39 + struct kasan_track *kasan_get_free_track(struct kmem_cache *cache, 40 + void *object, u8 tag) 41 + { 42 + struct kasan_alloc_meta *alloc_meta; 43 + int i = 0; 44 + 45 + alloc_meta = kasan_get_alloc_meta(cache, object); 46 + if (!alloc_meta) 47 + return NULL; 48 + 49 + #ifdef CONFIG_KASAN_TAGS_IDENTIFY 50 + for (i = 0; i < KASAN_NR_FREE_STACKS; i++) { 51 + if (alloc_meta->free_pointer_tag[i] == tag) 52 + break; 53 + } 54 + if (i == KASAN_NR_FREE_STACKS) 55 + i = alloc_meta->free_track_idx; 56 + #endif 57 + 58 + return &alloc_meta->free_track[i]; 59 + }

+3 -2

mm/kfence/kfence_test.c

··· 197 197 198 198 static inline size_t kmalloc_cache_alignment(size_t size) 199 199 { 200 - return kmalloc_caches[kmalloc_type(GFP_KERNEL)][kmalloc_index(size)]->align; 200 + return kmalloc_caches[kmalloc_type(GFP_KERNEL)][__kmalloc_index(size, false)]->align; 201 201 } 202 202 203 203 /* Must always inline to match stack trace against caller. */ ··· 267 267 268 268 if (is_kfence_address(alloc)) { 269 269 struct page *page = virt_to_head_page(alloc); 270 - struct kmem_cache *s = test_cache ?: kmalloc_caches[kmalloc_type(GFP_KERNEL)][kmalloc_index(size)]; 270 + struct kmem_cache *s = test_cache ?: 271 + kmalloc_caches[kmalloc_type(GFP_KERNEL)][__kmalloc_index(size, false)]; 271 272 272 273 /* 273 274 * Verify that various helpers return the right values

+12 -6

mm/kmemleak.c

··· 219 219 static unsigned long jiffies_min_age; 220 220 static unsigned long jiffies_last_scan; 221 221 /* delay between automatic memory scannings */ 222 - static signed long jiffies_scan_wait; 222 + static unsigned long jiffies_scan_wait; 223 223 /* enables or disables the task stacks scanning */ 224 224 static int kmemleak_stack_scan = 1; 225 225 /* protects the memory scanning, parameters and debug/kmemleak file access */ ··· 1567 1567 } 1568 1568 1569 1569 while (!kthread_should_stop()) { 1570 - signed long timeout = jiffies_scan_wait; 1570 + signed long timeout = READ_ONCE(jiffies_scan_wait); 1571 1571 1572 1572 mutex_lock(&scan_mutex); 1573 1573 kmemleak_scan(); ··· 1807 1807 else if (strncmp(buf, "scan=off", 8) == 0) 1808 1808 stop_scan_thread(); 1809 1809 else if (strncmp(buf, "scan=", 5) == 0) { 1810 - unsigned long secs; 1810 + unsigned secs; 1811 + unsigned long msecs; 1811 1812 1812 - ret = kstrtoul(buf + 5, 0, &secs); 1813 + ret = kstrtouint(buf + 5, 0, &secs); 1813 1814 if (ret < 0) 1814 1815 goto out; 1816 + 1817 + msecs = secs * MSEC_PER_SEC; 1818 + if (msecs > UINT_MAX) 1819 + msecs = UINT_MAX; 1820 + 1815 1821 stop_scan_thread(); 1816 - if (secs) { 1817 - jiffies_scan_wait = msecs_to_jiffies(secs * 1000); 1822 + if (msecs) { 1823 + WRITE_ONCE(jiffies_scan_wait, msecs_to_jiffies(msecs)); 1818 1824 start_scan_thread(); 1819 1825 } 1820 1826 } else if (strncmp(buf, "scan", 4) == 0)

+2 -4

mm/ksm.c

··· 521 521 struct vm_area_struct *vma; 522 522 if (ksm_test_exit(mm)) 523 523 return NULL; 524 - vma = find_vma(mm, addr); 525 - if (!vma || vma->vm_start > addr) 526 - return NULL; 527 - if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma) 524 + vma = vma_lookup(mm, addr); 525 + if (!vma || !(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma) 528 526 return NULL; 529 527 return vma; 530 528 }

+4 -4

mm/memblock.c

··· 92 92 * system initialization completes. 93 93 */ 94 94 95 - #ifndef CONFIG_NEED_MULTIPLE_NODES 95 + #ifndef CONFIG_NUMA 96 96 struct pglist_data __refdata contig_page_data; 97 97 EXPORT_SYMBOL(contig_page_data); 98 98 #endif ··· 607 607 * area, insert that portion. 608 608 */ 609 609 if (rbase > base) { 610 - #ifdef CONFIG_NEED_MULTIPLE_NODES 610 + #ifdef CONFIG_NUMA 611 611 WARN_ON(nid != memblock_get_region_node(rgn)); 612 612 #endif 613 613 WARN_ON(flags != rgn->flags); ··· 1205 1205 int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size, 1206 1206 struct memblock_type *type, int nid) 1207 1207 { 1208 - #ifdef CONFIG_NEED_MULTIPLE_NODES 1208 + #ifdef CONFIG_NUMA 1209 1209 int start_rgn, end_rgn; 1210 1210 int i, ret; 1211 1211 ··· 1849 1849 size = rgn->size; 1850 1850 end = base + size - 1; 1851 1851 flags = rgn->flags; 1852 - #ifdef CONFIG_NEED_MULTIPLE_NODES 1852 + #ifdef CONFIG_NUMA 1853 1853 if (memblock_get_region_node(rgn) != MAX_NUMNODES) 1854 1854 snprintf(nid_buf, sizeof(nid_buf), " on node %d", 1855 1855 memblock_get_region_node(rgn));

+276 -89

mm/memcontrol.c

··· 78 78 79 79 /* Active memory cgroup to use from an interrupt context */ 80 80 DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg); 81 + EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg); 81 82 82 83 /* Socket memory accounting disabled? */ 83 84 static bool cgroup_memory_nosocket; 84 85 85 86 /* Kernel memory accounting disabled? */ 86 - static bool cgroup_memory_nokmem; 87 + bool cgroup_memory_nokmem; 87 88 88 89 /* Whether the swap controller is active */ 89 90 #ifdef CONFIG_MEMCG_SWAP ··· 262 261 static void obj_cgroup_release(struct percpu_ref *ref) 263 262 { 264 263 struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt); 265 - struct mem_cgroup *memcg; 266 264 unsigned int nr_bytes; 267 265 unsigned int nr_pages; 268 266 unsigned long flags; ··· 290 290 WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)); 291 291 nr_pages = nr_bytes >> PAGE_SHIFT; 292 292 293 - spin_lock_irqsave(&css_set_lock, flags); 294 - memcg = obj_cgroup_memcg(objcg); 295 293 if (nr_pages) 296 294 obj_cgroup_uncharge_pages(objcg, nr_pages); 295 + 296 + spin_lock_irqsave(&css_set_lock, flags); 297 297 list_del(&objcg->list); 298 - mem_cgroup_put(memcg); 299 298 spin_unlock_irqrestore(&css_set_lock, flags); 300 299 301 300 percpu_ref_exit(ref); ··· 329 330 330 331 spin_lock_irq(&css_set_lock); 331 332 332 - /* Move active objcg to the parent's list */ 333 - xchg(&objcg->memcg, parent); 334 - css_get(&parent->css); 335 - list_add(&objcg->list, &parent->objcg_list); 336 - 337 - /* Move already reparented objcgs to the parent's list */ 338 - list_for_each_entry(iter, &memcg->objcg_list, list) { 339 - css_get(&parent->css); 340 - xchg(&iter->memcg, parent); 341 - css_put(&memcg->css); 342 - } 333 + /* 1) Ready to reparent active objcg. */ 334 + list_add(&objcg->list, &memcg->objcg_list); 335 + /* 2) Reparent active objcg and already reparented objcgs to parent. */ 336 + list_for_each_entry(iter, &memcg->objcg_list, list) 337 + WRITE_ONCE(iter->memcg, parent); 338 + /* 3) Move already reparented objcgs to the parent's list */ 343 339 list_splice(&memcg->objcg_list, &parent->objcg_list); 344 340 345 341 spin_unlock_irq(&css_set_lock); ··· 776 782 rcu_read_unlock(); 777 783 } 778 784 785 + /* 786 + * mod_objcg_mlstate() may be called with irq enabled, so 787 + * mod_memcg_lruvec_state() should be used. 788 + */ 789 + static inline void mod_objcg_mlstate(struct obj_cgroup *objcg, 790 + struct pglist_data *pgdat, 791 + enum node_stat_item idx, int nr) 792 + { 793 + struct mem_cgroup *memcg; 794 + struct lruvec *lruvec; 795 + 796 + rcu_read_lock(); 797 + memcg = obj_cgroup_memcg(objcg); 798 + lruvec = mem_cgroup_lruvec(memcg, pgdat); 799 + mod_memcg_lruvec_state(lruvec, idx, nr); 800 + rcu_read_unlock(); 801 + } 802 + 779 803 /** 780 804 * __count_memcg_events - account VM events in a cgroup 781 805 * @memcg: the memory cgroup ··· 898 886 } 899 887 EXPORT_SYMBOL(mem_cgroup_from_task); 900 888 889 + static __always_inline struct mem_cgroup *active_memcg(void) 890 + { 891 + if (in_interrupt()) 892 + return this_cpu_read(int_active_memcg); 893 + else 894 + return current->active_memcg; 895 + } 896 + 901 897 /** 902 898 * get_mem_cgroup_from_mm: Obtain a reference on given mm_struct's memcg. 903 899 * @mm: mm from which memcg should be extracted. It can be NULL. 904 900 * 905 - * Obtain a reference on mm->memcg and returns it if successful. Otherwise 906 - * root_mem_cgroup is returned. However if mem_cgroup is disabled, NULL is 907 - * returned. 901 + * Obtain a reference on mm->memcg and returns it if successful. If mm 902 + * is NULL, then the memcg is chosen as follows: 903 + * 1) The active memcg, if set. 904 + * 2) current->mm->memcg, if available 905 + * 3) root memcg 906 + * If mem_cgroup is disabled, NULL is returned. 908 907 */ 909 908 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) 910 909 { ··· 924 901 if (mem_cgroup_disabled()) 925 902 return NULL; 926 903 904 + /* 905 + * Page cache insertions can happen without an 906 + * actual mm context, e.g. during disk probing 907 + * on boot, loopback IO, acct() writes etc. 908 + * 909 + * No need to css_get on root memcg as the reference 910 + * counting is disabled on the root level in the 911 + * cgroup core. See CSS_NO_REF. 912 + */ 913 + if (unlikely(!mm)) { 914 + memcg = active_memcg(); 915 + if (unlikely(memcg)) { 916 + /* remote memcg must hold a ref */ 917 + css_get(&memcg->css); 918 + return memcg; 919 + } 920 + mm = current->mm; 921 + if (unlikely(!mm)) 922 + return root_mem_cgroup; 923 + } 924 + 927 925 rcu_read_lock(); 928 926 do { 929 - /* 930 - * Page cache insertions can happen without an 931 - * actual mm context, e.g. during disk probing 932 - * on boot, loopback IO, acct() writes etc. 933 - */ 934 - if (unlikely(!mm)) 927 + memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); 928 + if (unlikely(!memcg)) 935 929 memcg = root_mem_cgroup; 936 - else { 937 - memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); 938 - if (unlikely(!memcg)) 939 - memcg = root_mem_cgroup; 940 - } 941 930 } while (!css_tryget(&memcg->css)); 942 931 rcu_read_unlock(); 943 932 return memcg; 944 933 } 945 934 EXPORT_SYMBOL(get_mem_cgroup_from_mm); 946 - 947 - static __always_inline struct mem_cgroup *active_memcg(void) 948 - { 949 - if (in_interrupt()) 950 - return this_cpu_read(int_active_memcg); 951 - else 952 - return current->active_memcg; 953 - } 954 935 955 936 static __always_inline bool memcg_kmem_bypass(void) 956 937 { ··· 1205 1178 struct lruvec *lock_page_lruvec(struct page *page) 1206 1179 { 1207 1180 struct lruvec *lruvec; 1208 - struct pglist_data *pgdat = page_pgdat(page); 1209 1181 1210 - lruvec = mem_cgroup_page_lruvec(page, pgdat); 1182 + lruvec = mem_cgroup_page_lruvec(page); 1211 1183 spin_lock(&lruvec->lru_lock); 1212 1184 1213 1185 lruvec_memcg_debug(lruvec, page); ··· 1217 1191 struct lruvec *lock_page_lruvec_irq(struct page *page) 1218 1192 { 1219 1193 struct lruvec *lruvec; 1220 - struct pglist_data *pgdat = page_pgdat(page); 1221 1194 1222 - lruvec = mem_cgroup_page_lruvec(page, pgdat); 1195 + lruvec = mem_cgroup_page_lruvec(page); 1223 1196 spin_lock_irq(&lruvec->lru_lock); 1224 1197 1225 1198 lruvec_memcg_debug(lruvec, page); ··· 1229 1204 struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags) 1230 1205 { 1231 1206 struct lruvec *lruvec; 1232 - struct pglist_data *pgdat = page_pgdat(page); 1233 1207 1234 - lruvec = mem_cgroup_page_lruvec(page, pgdat); 1208 + lruvec = mem_cgroup_page_lruvec(page); 1235 1209 spin_lock_irqsave(&lruvec->lru_lock, *flags); 1236 1210 1237 1211 lruvec_memcg_debug(lruvec, page); ··· 2064 2040 } 2065 2041 EXPORT_SYMBOL(unlock_page_memcg); 2066 2042 2043 + struct obj_stock { 2044 + #ifdef CONFIG_MEMCG_KMEM 2045 + struct obj_cgroup *cached_objcg; 2046 + struct pglist_data *cached_pgdat; 2047 + unsigned int nr_bytes; 2048 + int nr_slab_reclaimable_b; 2049 + int nr_slab_unreclaimable_b; 2050 + #else 2051 + int dummy[0]; 2052 + #endif 2053 + }; 2054 + 2067 2055 struct memcg_stock_pcp { 2068 2056 struct mem_cgroup *cached; /* this never be root cgroup */ 2069 2057 unsigned int nr_pages; 2070 - 2071 - #ifdef CONFIG_MEMCG_KMEM 2072 - struct obj_cgroup *cached_objcg; 2073 - unsigned int nr_bytes; 2074 - #endif 2058 + struct obj_stock task_obj; 2059 + struct obj_stock irq_obj; 2075 2060 2076 2061 struct work_struct work; 2077 2062 unsigned long flags; ··· 2090 2057 static DEFINE_MUTEX(percpu_charge_mutex); 2091 2058 2092 2059 #ifdef CONFIG_MEMCG_KMEM 2093 - static void drain_obj_stock(struct memcg_stock_pcp *stock); 2060 + static void drain_obj_stock(struct obj_stock *stock); 2094 2061 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, 2095 2062 struct mem_cgroup *root_memcg); 2096 2063 2097 2064 #else 2098 - static inline void drain_obj_stock(struct memcg_stock_pcp *stock) 2065 + static inline void drain_obj_stock(struct obj_stock *stock) 2099 2066 { 2100 2067 } 2101 2068 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, ··· 2104 2071 return false; 2105 2072 } 2106 2073 #endif 2074 + 2075 + /* 2076 + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable 2077 + * sequence used in this case to access content from object stock is slow. 2078 + * To optimize for user context access, there are now two object stocks for 2079 + * task context and interrupt context access respectively. 2080 + * 2081 + * The task context object stock can be accessed by disabling preemption only 2082 + * which is cheap in non-preempt kernel. The interrupt context object stock 2083 + * can only be accessed after disabling interrupt. User context code can 2084 + * access interrupt object stock, but not vice versa. 2085 + */ 2086 + static inline struct obj_stock *get_obj_stock(unsigned long *pflags) 2087 + { 2088 + struct memcg_stock_pcp *stock; 2089 + 2090 + if (likely(in_task())) { 2091 + *pflags = 0UL; 2092 + preempt_disable(); 2093 + stock = this_cpu_ptr(&memcg_stock); 2094 + return &stock->task_obj; 2095 + } 2096 + 2097 + local_irq_save(*pflags); 2098 + stock = this_cpu_ptr(&memcg_stock); 2099 + return &stock->irq_obj; 2100 + } 2101 + 2102 + static inline void put_obj_stock(unsigned long flags) 2103 + { 2104 + if (likely(in_task())) 2105 + preempt_enable(); 2106 + else 2107 + local_irq_restore(flags); 2108 + } 2107 2109 2108 2110 /** 2109 2111 * consume_stock: Try to consume stocked charge on this cpu. ··· 2206 2138 local_irq_save(flags); 2207 2139 2208 2140 stock = this_cpu_ptr(&memcg_stock); 2209 - drain_obj_stock(stock); 2141 + drain_obj_stock(&stock->irq_obj); 2142 + if (in_task()) 2143 + drain_obj_stock(&stock->task_obj); 2210 2144 drain_stock(stock); 2211 2145 clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); 2212 2146 ··· 2574 2504 css_put(&memcg->css); 2575 2505 } 2576 2506 2577 - static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, 2578 - unsigned int nr_pages) 2507 + static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, 2508 + unsigned int nr_pages) 2579 2509 { 2580 2510 unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages); 2581 2511 int nr_retries = MAX_RECLAIM_RETRIES; ··· 2587 2517 bool drained = false; 2588 2518 unsigned long pflags; 2589 2519 2590 - if (mem_cgroup_is_root(memcg)) 2591 - return 0; 2592 2520 retry: 2593 2521 if (consume_stock(memcg, nr_pages)) 2594 2522 return 0; ··· 2766 2698 return 0; 2767 2699 } 2768 2700 2701 + static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, 2702 + unsigned int nr_pages) 2703 + { 2704 + if (mem_cgroup_is_root(memcg)) 2705 + return 0; 2706 + 2707 + return try_charge_memcg(memcg, gfp_mask, nr_pages); 2708 + } 2709 + 2769 2710 #if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MMU) 2770 2711 static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) 2771 2712 { ··· 2816 2739 } 2817 2740 2818 2741 #ifdef CONFIG_MEMCG_KMEM 2742 + /* 2743 + * The allocated objcg pointers array is not accounted directly. 2744 + * Moreover, it should not come from DMA buffer and is not readily 2745 + * reclaimable. So those GFP bits should be masked off. 2746 + */ 2747 + #define OBJCGS_CLEAR_MASK (__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT) 2748 + 2819 2749 int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s, 2820 2750 gfp_t gfp, bool new_page) 2821 2751 { ··· 2830 2746 unsigned long memcg_data; 2831 2747 void *vec; 2832 2748 2749 + gfp &= ~OBJCGS_CLEAR_MASK; 2833 2750 vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp, 2834 2751 page_to_nid(page)); 2835 2752 if (!vec) ··· 3010 2925 3011 2926 memcg = get_mem_cgroup_from_objcg(objcg); 3012 2927 3013 - ret = try_charge(memcg, gfp, nr_pages); 2928 + ret = try_charge_memcg(memcg, gfp, nr_pages); 3014 2929 if (ret) 3015 2930 goto out; 3016 2931 ··· 3080 2995 obj_cgroup_put(objcg); 3081 2996 } 3082 2997 2998 + void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, 2999 + enum node_stat_item idx, int nr) 3000 + { 3001 + unsigned long flags; 3002 + struct obj_stock *stock = get_obj_stock(&flags); 3003 + int *bytes; 3004 + 3005 + /* 3006 + * Save vmstat data in stock and skip vmstat array update unless 3007 + * accumulating over a page of vmstat data or when pgdat or idx 3008 + * changes. 3009 + */ 3010 + if (stock->cached_objcg != objcg) { 3011 + drain_obj_stock(stock); 3012 + obj_cgroup_get(objcg); 3013 + stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes) 3014 + ? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0; 3015 + stock->cached_objcg = objcg; 3016 + stock->cached_pgdat = pgdat; 3017 + } else if (stock->cached_pgdat != pgdat) { 3018 + /* Flush the existing cached vmstat data */ 3019 + if (stock->nr_slab_reclaimable_b) { 3020 + mod_objcg_mlstate(objcg, pgdat, NR_SLAB_RECLAIMABLE_B, 3021 + stock->nr_slab_reclaimable_b); 3022 + stock->nr_slab_reclaimable_b = 0; 3023 + } 3024 + if (stock->nr_slab_unreclaimable_b) { 3025 + mod_objcg_mlstate(objcg, pgdat, NR_SLAB_UNRECLAIMABLE_B, 3026 + stock->nr_slab_unreclaimable_b); 3027 + stock->nr_slab_unreclaimable_b = 0; 3028 + } 3029 + stock->cached_pgdat = pgdat; 3030 + } 3031 + 3032 + bytes = (idx == NR_SLAB_RECLAIMABLE_B) ? &stock->nr_slab_reclaimable_b 3033 + : &stock->nr_slab_unreclaimable_b; 3034 + /* 3035 + * Even for large object >= PAGE_SIZE, the vmstat data will still be 3036 + * cached locally at least once before pushing it out. 3037 + */ 3038 + if (!*bytes) { 3039 + *bytes = nr; 3040 + nr = 0; 3041 + } else { 3042 + *bytes += nr; 3043 + if (abs(*bytes) > PAGE_SIZE) { 3044 + nr = *bytes; 3045 + *bytes = 0; 3046 + } else { 3047 + nr = 0; 3048 + } 3049 + } 3050 + if (nr) 3051 + mod_objcg_mlstate(objcg, pgdat, idx, nr); 3052 + 3053 + put_obj_stock(flags); 3054 + } 3055 + 3083 3056 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) 3084 3057 { 3085 - struct memcg_stock_pcp *stock; 3086 3058 unsigned long flags; 3059 + struct obj_stock *stock = get_obj_stock(&flags); 3087 3060 bool ret = false; 3088 3061 3089 - local_irq_save(flags); 3090 - 3091 - stock = this_cpu_ptr(&memcg_stock); 3092 3062 if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) { 3093 3063 stock->nr_bytes -= nr_bytes; 3094 3064 ret = true; 3095 3065 } 3096 3066 3097 - local_irq_restore(flags); 3067 + put_obj_stock(flags); 3098 3068 3099 3069 return ret; 3100 3070 } 3101 3071 3102 - static void drain_obj_stock(struct memcg_stock_pcp *stock) 3072 + static void drain_obj_stock(struct obj_stock *stock) 3103 3073 { 3104 3074 struct obj_cgroup *old = stock->cached_objcg; 3105 3075 ··· 3182 3042 stock->nr_bytes = 0; 3183 3043 } 3184 3044 3045 + /* 3046 + * Flush the vmstat data in current stock 3047 + */ 3048 + if (stock->nr_slab_reclaimable_b || stock->nr_slab_unreclaimable_b) { 3049 + if (stock->nr_slab_reclaimable_b) { 3050 + mod_objcg_mlstate(old, stock->cached_pgdat, 3051 + NR_SLAB_RECLAIMABLE_B, 3052 + stock->nr_slab_reclaimable_b); 3053 + stock->nr_slab_reclaimable_b = 0; 3054 + } 3055 + if (stock->nr_slab_unreclaimable_b) { 3056 + mod_objcg_mlstate(old, stock->cached_pgdat, 3057 + NR_SLAB_UNRECLAIMABLE_B, 3058 + stock->nr_slab_unreclaimable_b); 3059 + stock->nr_slab_unreclaimable_b = 0; 3060 + } 3061 + stock->cached_pgdat = NULL; 3062 + } 3063 + 3185 3064 obj_cgroup_put(old); 3186 3065 stock->cached_objcg = NULL; 3187 3066 } ··· 3210 3051 { 3211 3052 struct mem_cgroup *memcg; 3212 3053 3213 - if (stock->cached_objcg) { 3214 - memcg = obj_cgroup_memcg(stock->cached_objcg); 3054 + if (in_task() && stock->task_obj.cached_objcg) { 3055 + memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg); 3056 + if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) 3057 + return true; 3058 + } 3059 + if (stock->irq_obj.cached_objcg) { 3060 + memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg); 3215 3061 if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) 3216 3062 return true; 3217 3063 } ··· 3224 3060 return false; 3225 3061 } 3226 3062 3227 - static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) 3063 + static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, 3064 + bool allow_uncharge) 3228 3065 { 3229 - struct memcg_stock_pcp *stock; 3230 3066 unsigned long flags; 3067 + struct obj_stock *stock = get_obj_stock(&flags); 3068 + unsigned int nr_pages = 0; 3231 3069 3232 - local_irq_save(flags); 3233 - 3234 - stock = this_cpu_ptr(&memcg_stock); 3235 3070 if (stock->cached_objcg != objcg) { /* reset if necessary */ 3236 3071 drain_obj_stock(stock); 3237 3072 obj_cgroup_get(objcg); 3238 3073 stock->cached_objcg = objcg; 3239 - stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0); 3074 + stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes) 3075 + ? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0; 3076 + allow_uncharge = true; /* Allow uncharge when objcg changes */ 3240 3077 } 3241 3078 stock->nr_bytes += nr_bytes; 3242 3079 3243 - if (stock->nr_bytes > PAGE_SIZE) 3244 - drain_obj_stock(stock); 3080 + if (allow_uncharge && (stock->nr_bytes > PAGE_SIZE)) { 3081 + nr_pages = stock->nr_bytes >> PAGE_SHIFT; 3082 + stock->nr_bytes &= (PAGE_SIZE - 1); 3083 + } 3245 3084 3246 - local_irq_restore(flags); 3085 + put_obj_stock(flags); 3086 + 3087 + if (nr_pages) 3088 + obj_cgroup_uncharge_pages(objcg, nr_pages); 3247 3089 } 3248 3090 3249 3091 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) ··· 3261 3091 return 0; 3262 3092 3263 3093 /* 3264 - * In theory, memcg->nr_charged_bytes can have enough 3094 + * In theory, objcg->nr_charged_bytes can have enough 3265 3095 * pre-charged bytes to satisfy the allocation. However, 3266 - * flushing memcg->nr_charged_bytes requires two atomic 3267 - * operations, and memcg->nr_charged_bytes can't be big, 3268 - * so it's better to ignore it and try grab some new pages. 3269 - * memcg->nr_charged_bytes will be flushed in 3270 - * refill_obj_stock(), called from this function or 3271 - * independently later. 3096 + * flushing objcg->nr_charged_bytes requires two atomic 3097 + * operations, and objcg->nr_charged_bytes can't be big. 3098 + * The shared objcg->nr_charged_bytes can also become a 3099 + * performance bottleneck if all tasks of the same memcg are 3100 + * trying to update it. So it's better to ignore it and try 3101 + * grab some new pages. The stock's nr_bytes will be flushed to 3102 + * objcg->nr_charged_bytes later on when objcg changes. 3103 + * 3104 + * The stock's nr_bytes may contain enough pre-charged bytes 3105 + * to allow one less page from being charged, but we can't rely 3106 + * on the pre-charged bytes not being changed outside of 3107 + * consume_obj_stock() or refill_obj_stock(). So ignore those 3108 + * pre-charged bytes as well when charging pages. To avoid a 3109 + * page uncharge right after a page charge, we set the 3110 + * allow_uncharge flag to false when calling refill_obj_stock() 3111 + * to temporarily allow the pre-charged bytes to exceed the page 3112 + * size limit. The maximum reachable value of the pre-charged 3113 + * bytes is (sizeof(object) + PAGE_SIZE - 2) if there is no data 3114 + * race. 3272 3115 */ 3273 3116 nr_pages = size >> PAGE_SHIFT; 3274 3117 nr_bytes = size & (PAGE_SIZE - 1); ··· 3291 3108 3292 3109 ret = obj_cgroup_charge_pages(objcg, gfp, nr_pages); 3293 3110 if (!ret && nr_bytes) 3294 - refill_obj_stock(objcg, PAGE_SIZE - nr_bytes); 3111 + refill_obj_stock(objcg, PAGE_SIZE - nr_bytes, false); 3295 3112 3296 3113 return ret; 3297 3114 } 3298 3115 3299 3116 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) 3300 3117 { 3301 - refill_obj_stock(objcg, size); 3118 + refill_obj_stock(objcg, size, true); 3302 3119 } 3303 3120 3304 3121 #endif /* CONFIG_MEMCG_KMEM */ ··· 6724 6541 * @gfp_mask: reclaim mode 6725 6542 * 6726 6543 * Try to charge @page to the memcg that @mm belongs to, reclaiming 6727 - * pages according to @gfp_mask if necessary. 6544 + * pages according to @gfp_mask if necessary. if @mm is NULL, try to 6545 + * charge to the active memcg. 6728 6546 * 6729 6547 * Do not use this for pages allocated for swapin. 6730 6548 * ··· 6855 6671 unsigned long nr_pages; 6856 6672 struct mem_cgroup *memcg; 6857 6673 struct obj_cgroup *objcg; 6674 + bool use_objcg = PageMemcgKmem(page); 6858 6675 6859 6676 VM_BUG_ON_PAGE(PageLRU(page), page); 6860 6677 ··· 6864 6679 * page memcg or objcg at this point, we have fully 6865 6680 * exclusive access to the page. 6866 6681 */ 6867 - if (PageMemcgKmem(page)) { 6682 + if (use_objcg) { 6868 6683 objcg = __page_objcg(page); 6869 6684 /* 6870 6685 * This get matches the put at the end of the function and ··· 6892 6707 6893 6708 nr_pages = compound_nr(page); 6894 6709 6895 - if (PageMemcgKmem(page)) { 6710 + if (use_objcg) { 6896 6711 ug->nr_memory += nr_pages; 6897 6712 ug->nr_kmem += nr_pages; 6898 6713 ··· 6991 6806 /* Force-charge the new page. The old one will be freed soon */ 6992 6807 nr_pages = thp_nr_pages(newpage); 6993 6808 6994 - page_counter_charge(&memcg->memory, nr_pages); 6995 - if (do_memsw_account()) 6996 - page_counter_charge(&memcg->memsw, nr_pages); 6809 + if (!mem_cgroup_is_root(memcg)) { 6810 + page_counter_charge(&memcg->memory, nr_pages); 6811 + if (do_memsw_account()) 6812 + page_counter_charge(&memcg->memsw, nr_pages); 6813 + } 6997 6814 6998 6815 css_get(&memcg->css); 6999 6816 commit_charge(newpage, memcg);

+261 -89

mm/memory-failure.c

··· 56 56 #include <linux/kfifo.h> 57 57 #include <linux/ratelimit.h> 58 58 #include <linux/page-isolation.h> 59 + #include <linux/pagewalk.h> 59 60 #include "internal.h" 60 61 #include "ras/ras_event.h" 61 62 ··· 555 554 collect_procs_file(page, tokill, force_early); 556 555 } 557 556 557 + struct hwp_walk { 558 + struct to_kill tk; 559 + unsigned long pfn; 560 + int flags; 561 + }; 562 + 563 + static void set_to_kill(struct to_kill *tk, unsigned long addr, short shift) 564 + { 565 + tk->addr = addr; 566 + tk->size_shift = shift; 567 + } 568 + 569 + static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift, 570 + unsigned long poisoned_pfn, struct to_kill *tk) 571 + { 572 + unsigned long pfn = 0; 573 + 574 + if (pte_present(pte)) { 575 + pfn = pte_pfn(pte); 576 + } else { 577 + swp_entry_t swp = pte_to_swp_entry(pte); 578 + 579 + if (is_hwpoison_entry(swp)) 580 + pfn = hwpoison_entry_to_pfn(swp); 581 + } 582 + 583 + if (!pfn || pfn != poisoned_pfn) 584 + return 0; 585 + 586 + set_to_kill(tk, addr, shift); 587 + return 1; 588 + } 589 + 590 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 591 + static int check_hwpoisoned_pmd_entry(pmd_t *pmdp, unsigned long addr, 592 + struct hwp_walk *hwp) 593 + { 594 + pmd_t pmd = *pmdp; 595 + unsigned long pfn; 596 + unsigned long hwpoison_vaddr; 597 + 598 + if (!pmd_present(pmd)) 599 + return 0; 600 + pfn = pmd_pfn(pmd); 601 + if (pfn <= hwp->pfn && hwp->pfn < pfn + HPAGE_PMD_NR) { 602 + hwpoison_vaddr = addr + ((hwp->pfn - pfn) << PAGE_SHIFT); 603 + set_to_kill(&hwp->tk, hwpoison_vaddr, PAGE_SHIFT); 604 + return 1; 605 + } 606 + return 0; 607 + } 608 + #else 609 + static int check_hwpoisoned_pmd_entry(pmd_t *pmdp, unsigned long addr, 610 + struct hwp_walk *hwp) 611 + { 612 + return 0; 613 + } 614 + #endif 615 + 616 + static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr, 617 + unsigned long end, struct mm_walk *walk) 618 + { 619 + struct hwp_walk *hwp = (struct hwp_walk *)walk->private; 620 + int ret = 0; 621 + pte_t *ptep; 622 + spinlock_t *ptl; 623 + 624 + ptl = pmd_trans_huge_lock(pmdp, walk->vma); 625 + if (ptl) { 626 + ret = check_hwpoisoned_pmd_entry(pmdp, addr, hwp); 627 + spin_unlock(ptl); 628 + goto out; 629 + } 630 + 631 + if (pmd_trans_unstable(pmdp)) 632 + goto out; 633 + 634 + ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp, addr, &ptl); 635 + for (; addr != end; ptep++, addr += PAGE_SIZE) { 636 + ret = check_hwpoisoned_entry(*ptep, addr, PAGE_SHIFT, 637 + hwp->pfn, &hwp->tk); 638 + if (ret == 1) 639 + break; 640 + } 641 + pte_unmap_unlock(ptep - 1, ptl); 642 + out: 643 + cond_resched(); 644 + return ret; 645 + } 646 + 647 + #ifdef CONFIG_HUGETLB_PAGE 648 + static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask, 649 + unsigned long addr, unsigned long end, 650 + struct mm_walk *walk) 651 + { 652 + struct hwp_walk *hwp = (struct hwp_walk *)walk->private; 653 + pte_t pte = huge_ptep_get(ptep); 654 + struct hstate *h = hstate_vma(walk->vma); 655 + 656 + return check_hwpoisoned_entry(pte, addr, huge_page_shift(h), 657 + hwp->pfn, &hwp->tk); 658 + } 659 + #else 660 + #define hwpoison_hugetlb_range NULL 661 + #endif 662 + 663 + static struct mm_walk_ops hwp_walk_ops = { 664 + .pmd_entry = hwpoison_pte_range, 665 + .hugetlb_entry = hwpoison_hugetlb_range, 666 + }; 667 + 668 + /* 669 + * Sends SIGBUS to the current process with error info. 670 + * 671 + * This function is intended to handle "Action Required" MCEs on already 672 + * hardware poisoned pages. They could happen, for example, when 673 + * memory_failure() failed to unmap the error page at the first call, or 674 + * when multiple local machine checks happened on different CPUs. 675 + * 676 + * MCE handler currently has no easy access to the error virtual address, 677 + * so this function walks page table to find it. The returned virtual address 678 + * is proper in most cases, but it could be wrong when the application 679 + * process has multiple entries mapping the error page. 680 + */ 681 + static int kill_accessing_process(struct task_struct *p, unsigned long pfn, 682 + int flags) 683 + { 684 + int ret; 685 + struct hwp_walk priv = { 686 + .pfn = pfn, 687 + }; 688 + priv.tk.tsk = p; 689 + 690 + mmap_read_lock(p->mm); 691 + ret = walk_page_range(p->mm, 0, TASK_SIZE, &hwp_walk_ops, 692 + (void *)&priv); 693 + if (ret == 1 && priv.tk.addr) 694 + kill_proc(&priv.tk, pfn, flags); 695 + mmap_read_unlock(p->mm); 696 + return ret ? -EFAULT : -EHWPOISON; 697 + } 698 + 558 699 static const char *action_name[] = { 559 700 [MF_IGNORED] = "Ignored", 560 701 [MF_FAILED] = "Failed", ··· 1117 974 return PageLRU(page) || __PageMovable(page); 1118 975 } 1119 976 1120 - /** 1121 - * __get_hwpoison_page() - Get refcount for memory error handling: 1122 - * @page: raw error page (hit by memory error) 1123 - * 1124 - * Return: return 0 if failed to grab the refcount, otherwise true (some 1125 - * non-zero value.) 1126 - */ 1127 977 static int __get_hwpoison_page(struct page *page) 1128 978 { 1129 979 struct page *head = compound_head(page); ··· 1161 1025 return 0; 1162 1026 } 1163 1027 1164 - /* 1165 - * Safely get reference count of an arbitrary page. 1166 - * 1167 - * Returns 0 for a free page, 1 for an in-use page, 1168 - * -EIO for a page-type we cannot handle and -EBUSY if we raced with an 1169 - * allocation. 1170 - * We only incremented refcount in case the page was already in-use and it 1171 - * is a known type we can handle. 1172 - */ 1173 1028 static int get_any_page(struct page *p, unsigned long flags) 1174 1029 { 1175 1030 int ret = 0, pass = 0; ··· 1170 1043 count_increased = true; 1171 1044 1172 1045 try_again: 1173 - if (!count_increased && !__get_hwpoison_page(p)) { 1174 - if (page_count(p)) { 1175 - /* We raced with an allocation, retry. */ 1176 - if (pass++ < 3) 1177 - goto try_again; 1178 - ret = -EBUSY; 1179 - } else if (!PageHuge(p) && !is_free_buddy_page(p)) { 1180 - /* We raced with put_page, retry. */ 1181 - if (pass++ < 3) 1182 - goto try_again; 1183 - ret = -EIO; 1184 - } 1185 - } else { 1186 - if (PageHuge(p) || HWPoisonHandlable(p)) { 1187 - ret = 1; 1188 - } else { 1189 - /* 1190 - * A page we cannot handle. Check whether we can turn 1191 - * it into something we can handle. 1192 - */ 1193 - if (pass++ < 3) { 1194 - put_page(p); 1195 - shake_page(p, 1); 1196 - count_increased = false; 1197 - goto try_again; 1046 + if (!count_increased) { 1047 + ret = __get_hwpoison_page(p); 1048 + if (!ret) { 1049 + if (page_count(p)) { 1050 + /* We raced with an allocation, retry. */ 1051 + if (pass++ < 3) 1052 + goto try_again; 1053 + ret = -EBUSY; 1054 + } else if (!PageHuge(p) && !is_free_buddy_page(p)) { 1055 + /* We raced with put_page, retry. */ 1056 + if (pass++ < 3) 1057 + goto try_again; 1058 + ret = -EIO; 1198 1059 } 1199 - put_page(p); 1200 - ret = -EIO; 1060 + goto out; 1061 + } else if (ret == -EBUSY) { 1062 + /* We raced with freeing huge page to buddy, retry. */ 1063 + if (pass++ < 3) 1064 + goto try_again; 1065 + goto out; 1201 1066 } 1202 1067 } 1203 1068 1069 + if (PageHuge(p) || HWPoisonHandlable(p)) { 1070 + ret = 1; 1071 + } else { 1072 + /* 1073 + * A page we cannot handle. Check whether we can turn 1074 + * it into something we can handle. 1075 + */ 1076 + if (pass++ < 3) { 1077 + put_page(p); 1078 + shake_page(p, 1); 1079 + count_increased = false; 1080 + goto try_again; 1081 + } 1082 + put_page(p); 1083 + ret = -EIO; 1084 + } 1085 + out: 1204 1086 return ret; 1205 1087 } 1206 1088 1207 - static int get_hwpoison_page(struct page *p, unsigned long flags, 1208 - enum mf_flags ctxt) 1089 + /** 1090 + * get_hwpoison_page() - Get refcount for memory error handling 1091 + * @p: Raw error page (hit by memory error) 1092 + * @flags: Flags controlling behavior of error handling 1093 + * 1094 + * get_hwpoison_page() takes a page refcount of an error page to handle memory 1095 + * error on it, after checking that the error page is in a well-defined state 1096 + * (defined as a page-type we can successfully handle the memor error on it, 1097 + * such as LRU page and hugetlb page). 1098 + * 1099 + * Memory error handling could be triggered at any time on any type of page, 1100 + * so it's prone to race with typical memory management lifecycle (like 1101 + * allocation and free). So to avoid such races, get_hwpoison_page() takes 1102 + * extra care for the error page's state (as done in __get_hwpoison_page()), 1103 + * and has some retry logic in get_any_page(). 1104 + * 1105 + * Return: 0 on failure, 1106 + * 1 on success for in-use pages in a well-defined state, 1107 + * -EIO for pages on which we can not handle memory errors, 1108 + * -EBUSY when get_hwpoison_page() has raced with page lifecycle 1109 + * operations like allocation and free. 1110 + */ 1111 + static int get_hwpoison_page(struct page *p, unsigned long flags) 1209 1112 { 1210 1113 int ret; 1211 1114 1212 1115 zone_pcp_disable(page_zone(p)); 1213 - if (ctxt == MF_SOFT_OFFLINE) 1214 - ret = get_any_page(p, flags); 1215 - else 1216 - ret = __get_hwpoison_page(p); 1116 + ret = get_any_page(p, flags); 1217 1117 zone_pcp_enable(page_zone(p)); 1218 1118 1219 1119 return ret; ··· 1421 1267 if (TestSetPageHWPoison(head)) { 1422 1268 pr_err("Memory failure: %#lx: already hardware poisoned\n", 1423 1269 pfn); 1424 - return -EHWPOISON; 1270 + res = -EHWPOISON; 1271 + if (flags & MF_ACTION_REQUIRED) 1272 + res = kill_accessing_process(current, page_to_pfn(head), flags); 1273 + return res; 1425 1274 } 1426 1275 1427 1276 num_poisoned_pages_inc(); 1428 1277 1429 - if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p, flags, 0)) { 1430 - /* 1431 - * Check "filter hit" and "race with other subpage." 1432 - */ 1433 - lock_page(head); 1434 - if (PageHWPoison(head)) { 1435 - if ((hwpoison_filter(p) && TestClearPageHWPoison(p)) 1436 - || (p != head && TestSetPageHWPoison(head))) { 1437 - num_poisoned_pages_dec(); 1438 - unlock_page(head); 1439 - return 0; 1278 + if (!(flags & MF_COUNT_INCREASED)) { 1279 + res = get_hwpoison_page(p, flags); 1280 + if (!res) { 1281 + /* 1282 + * Check "filter hit" and "race with other subpage." 1283 + */ 1284 + lock_page(head); 1285 + if (PageHWPoison(head)) { 1286 + if ((hwpoison_filter(p) && TestClearPageHWPoison(p)) 1287 + || (p != head && TestSetPageHWPoison(head))) { 1288 + num_poisoned_pages_dec(); 1289 + unlock_page(head); 1290 + return 0; 1291 + } 1440 1292 } 1293 + unlock_page(head); 1294 + res = MF_FAILED; 1295 + if (!dissolve_free_huge_page(p) && take_page_off_buddy(p)) { 1296 + page_ref_inc(p); 1297 + res = MF_RECOVERED; 1298 + } 1299 + action_result(pfn, MF_MSG_FREE_HUGE, res); 1300 + return res == MF_RECOVERED ? 0 : -EBUSY; 1301 + } else if (res < 0) { 1302 + action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED); 1303 + return -EBUSY; 1441 1304 } 1442 - unlock_page(head); 1443 - res = MF_FAILED; 1444 - if (!dissolve_free_huge_page(p) && take_page_off_buddy(p)) { 1445 - page_ref_inc(p); 1446 - res = MF_RECOVERED; 1447 - } 1448 - action_result(pfn, MF_MSG_FREE_HUGE, res); 1449 - return res == MF_RECOVERED ? 0 : -EBUSY; 1450 1305 } 1451 1306 1452 1307 lock_page(head); ··· 1639 1476 pr_err("Memory failure: %#lx: already hardware poisoned\n", 1640 1477 pfn); 1641 1478 res = -EHWPOISON; 1479 + if (flags & MF_ACTION_REQUIRED) 1480 + res = kill_accessing_process(current, pfn, flags); 1642 1481 goto unlock_mutex; 1643 1482 } 1644 1483 ··· 1658 1493 * In fact it's dangerous to directly bump up page count from 0, 1659 1494 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch. 1660 1495 */ 1661 - if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p, flags, 0)) { 1662 - if (is_free_buddy_page(p)) { 1663 - if (take_page_off_buddy(p)) { 1664 - page_ref_inc(p); 1665 - res = MF_RECOVERED; 1666 - } else { 1667 - /* We lost the race, try again */ 1668 - if (retry) { 1669 - ClearPageHWPoison(p); 1670 - num_poisoned_pages_dec(); 1671 - retry = false; 1672 - goto try_again; 1496 + if (!(flags & MF_COUNT_INCREASED)) { 1497 + res = get_hwpoison_page(p, flags); 1498 + if (!res) { 1499 + if (is_free_buddy_page(p)) { 1500 + if (take_page_off_buddy(p)) { 1501 + page_ref_inc(p); 1502 + res = MF_RECOVERED; 1503 + } else { 1504 + /* We lost the race, try again */ 1505 + if (retry) { 1506 + ClearPageHWPoison(p); 1507 + num_poisoned_pages_dec(); 1508 + retry = false; 1509 + goto try_again; 1510 + } 1511 + res = MF_FAILED; 1673 1512 } 1674 - res = MF_FAILED; 1513 + action_result(pfn, MF_MSG_BUDDY, res); 1514 + res = res == MF_RECOVERED ? 0 : -EBUSY; 1515 + } else { 1516 + action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED); 1517 + res = -EBUSY; 1675 1518 } 1676 - action_result(pfn, MF_MSG_BUDDY, res); 1677 - res = res == MF_RECOVERED ? 0 : -EBUSY; 1678 - } else { 1679 - action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED); 1519 + goto unlock_mutex; 1520 + } else if (res < 0) { 1521 + action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED); 1680 1522 res = -EBUSY; 1523 + goto unlock_mutex; 1681 1524 } 1682 - goto unlock_mutex; 1683 1525 } 1684 1526 1685 1527 if (PageTransHuge(hpage)) { ··· 1964 1792 return 0; 1965 1793 } 1966 1794 1967 - if (!get_hwpoison_page(p, flags, 0)) { 1795 + if (!get_hwpoison_page(p, flags)) { 1968 1796 if (TestClearPageHWPoison(p)) 1969 1797 num_poisoned_pages_dec(); 1970 1798 unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n", ··· 2180 2008 2181 2009 retry: 2182 2010 get_online_mems(); 2183 - ret = get_hwpoison_page(page, flags, MF_SOFT_OFFLINE); 2011 + ret = get_hwpoison_page(page, flags); 2184 2012 put_online_mems(); 2185 2013 2186 2014 if (ret > 0) {

+15 -7

mm/memory.c

··· 90 90 #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid. 91 91 #endif 92 92 93 - #ifndef CONFIG_NEED_MULTIPLE_NODES 94 - /* use the per-pgdat data instead for discontigmem - mbligh */ 93 + #ifndef CONFIG_NUMA 95 94 unsigned long max_mapnr; 96 95 EXPORT_SYMBOL(max_mapnr); 97 96 ··· 3022 3023 munlock_vma_page(old_page); 3023 3024 unlock_page(old_page); 3024 3025 } 3026 + if (page_copied) 3027 + free_swap_cache(old_page); 3025 3028 put_page(old_page); 3026 3029 } 3027 3030 return page_copied ? VM_FAULT_WRITE : 0; ··· 3048 3047 * The function expects the page to be locked or other protection against 3049 3048 * concurrent faults / writeback (such as DAX radix tree locks). 3050 3049 * 3051 - * Return: %VM_FAULT_WRITE on success, %0 when PTE got changed before 3050 + * Return: %0 on success, %VM_FAULT_NOPAGE when PTE got changed before 3052 3051 * we acquired PTE lock. 3053 3052 */ 3054 3053 vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf) ··· 3354 3353 { 3355 3354 struct vm_area_struct *vma = vmf->vma; 3356 3355 struct page *page = NULL, *swapcache; 3356 + struct swap_info_struct *si = NULL; 3357 3357 swp_entry_t entry; 3358 3358 pte_t pte; 3359 3359 int locked; ··· 3382 3380 goto out; 3383 3381 } 3384 3382 3383 + /* Prevent swapoff from happening to us. */ 3384 + si = get_swap_device(entry); 3385 + if (unlikely(!si)) 3386 + goto out; 3385 3387 3386 3388 delayacct_set_flag(current, DELAYACCT_PF_SWAPIN); 3387 3389 page = lookup_swap_cache(entry, vma, vmf->address); 3388 3390 swapcache = page; 3389 3391 3390 3392 if (!page) { 3391 - struct swap_info_struct *si = swp_swap_info(entry); 3392 - 3393 3393 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && 3394 3394 __swap_count(entry) == 1) { 3395 3395 /* skip swapcache */ ··· 3560 3556 unlock: 3561 3557 pte_unmap_unlock(vmf->pte, vmf->ptl); 3562 3558 out: 3559 + if (si) 3560 + put_swap_device(si); 3563 3561 return ret; 3564 3562 out_nomap: 3565 3563 pte_unmap_unlock(vmf->pte, vmf->ptl); ··· 3573 3567 unlock_page(swapcache); 3574 3568 put_page(swapcache); 3575 3569 } 3570 + if (si) 3571 + put_swap_device(si); 3576 3572 return ret; 3577 3573 } 3578 3574 ··· 4993 4985 * Check if this is a VM_IO | VM_PFNMAP VMA, which 4994 4986 * we can access using slightly different code. 4995 4987 */ 4996 - vma = find_vma(mm, addr); 4997 - if (!vma || vma->vm_start > addr) 4988 + vma = vma_lookup(mm, addr); 4989 + if (!vma) 4998 4990 break; 4999 4991 if (vma->vm_ops && vma->vm_ops->access) 5000 4992 ret = vma->vm_ops->access(vma, addr, buf,

+3 -3

mm/memory_hotplug.c

··· 961 961 node_states_set_node(nid, &arg); 962 962 if (need_zonelists_rebuild) 963 963 build_all_zonelists(NULL); 964 - zone_pcp_update(zone); 965 964 966 965 /* Basic onlining is complete, allow allocation of onlined pages. */ 967 966 undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE); ··· 973 974 */ 974 975 shuffle_zone(zone); 975 976 977 + /* reinitialise watermarks and update pcp limits */ 976 978 init_per_zone_wmark_min(); 977 979 978 980 kswapd_run(nid); ··· 1829 1829 adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages); 1830 1830 adjust_present_page_count(zone, -nr_pages); 1831 1831 1832 + /* reinitialise watermarks and update pcp limits */ 1832 1833 init_per_zone_wmark_min(); 1833 1834 1834 1835 if (!populated_zone(zone)) { 1835 1836 zone_pcp_reset(zone); 1836 1837 build_all_zonelists(NULL); 1837 - } else 1838 - zone_pcp_update(zone); 1838 + } 1839 1839 1840 1840 node_states_clear_node(node, &arg); 1841 1841 if (arg.status_change_nid >= 0) {

+2 -2

mm/mempolicy.c

··· 975 975 * want to return MPOL_DEFAULT in this case. 976 976 */ 977 977 mmap_read_lock(mm); 978 - vma = find_vma_intersection(mm, addr, addr+1); 978 + vma = vma_lookup(mm, addr); 979 979 if (!vma) { 980 980 mmap_read_unlock(mm); 981 981 return -EFAULT; ··· 2150 2150 return page; 2151 2151 if (page && page_to_nid(page) == nid) { 2152 2152 preempt_disable(); 2153 - __inc_numa_state(page_zone(page), NUMA_INTERLEAVE_HIT); 2153 + __count_numa_event(page_zone(page), NUMA_INTERLEAVE_HIT); 2154 2154 preempt_enable(); 2155 2155 } 2156 2156 return page;

+2 -2

mm/migrate.c

··· 1834 1834 struct page *page; 1835 1835 int err = -EFAULT; 1836 1836 1837 - vma = find_vma(mm, addr); 1838 - if (!vma || addr < vma->vm_start) 1837 + vma = vma_lookup(mm, addr); 1838 + if (!vma) 1839 1839 goto set_status; 1840 1840 1841 1841 /* FOLL_DUMP to ignore special (like zero) pages */

+24 -30

mm/mmap.c

··· 1457 1457 return addr; 1458 1458 1459 1459 if (flags & MAP_FIXED_NOREPLACE) { 1460 - struct vm_area_struct *vma = find_vma(mm, addr); 1461 - 1462 - if (vma && vma->vm_start < addr + len) 1460 + if (find_vma_intersection(mm, addr, addr + len)) 1463 1461 return -EEXIST; 1464 1462 } 1465 1463 ··· 1631 1633 return PTR_ERR(file); 1632 1634 } 1633 1635 1634 - flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE); 1636 + flags &= ~MAP_DENYWRITE; 1635 1637 1636 1638 retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff); 1637 1639 out_fput: ··· 2800 2802 return __split_vma(mm, vma, addr, new_below); 2801 2803 } 2802 2804 2805 + static inline void 2806 + unlock_range(struct vm_area_struct *start, unsigned long limit) 2807 + { 2808 + struct mm_struct *mm = start->vm_mm; 2809 + struct vm_area_struct *tmp = start; 2810 + 2811 + while (tmp && tmp->vm_start < limit) { 2812 + if (tmp->vm_flags & VM_LOCKED) { 2813 + mm->locked_vm -= vma_pages(tmp); 2814 + munlock_vma_pages_all(tmp); 2815 + } 2816 + 2817 + tmp = tmp->vm_next; 2818 + } 2819 + } 2820 + 2803 2821 /* Munmap is split into 2 main parts -- this part which finds 2804 2822 * what needs doing, and the areas themselves, which do the 2805 2823 * work. This now handles partial unmappings. ··· 2842 2828 */ 2843 2829 arch_unmap(mm, start, end); 2844 2830 2845 - /* Find the first overlapping VMA */ 2846 - vma = find_vma(mm, start); 2831 + /* Find the first overlapping VMA where start < vma->vm_end */ 2832 + vma = find_vma_intersection(mm, start, end); 2847 2833 if (!vma) 2848 2834 return 0; 2849 2835 prev = vma->vm_prev; 2850 - /* we have start < vma->vm_end */ 2851 - 2852 - /* if it doesn't overlap, we have nothing.. */ 2853 - if (vma->vm_start >= end) 2854 - return 0; 2855 2836 2856 2837 /* 2857 2838 * If we need to split any vma, do it now to save pain later. ··· 2899 2890 /* 2900 2891 * unlock any mlock()ed ranges before detaching vmas 2901 2892 */ 2902 - if (mm->locked_vm) { 2903 - struct vm_area_struct *tmp = vma; 2904 - while (tmp && tmp->vm_start < end) { 2905 - if (tmp->vm_flags & VM_LOCKED) { 2906 - mm->locked_vm -= vma_pages(tmp); 2907 - munlock_vma_pages_all(tmp); 2908 - } 2909 - 2910 - tmp = tmp->vm_next; 2911 - } 2912 - } 2893 + if (mm->locked_vm) 2894 + unlock_range(vma, end); 2913 2895 2914 2896 /* Detach vmas from rbtree */ 2915 2897 if (!detach_vmas_to_be_unmapped(mm, vma, prev, end)) ··· 3185 3185 mmap_write_unlock(mm); 3186 3186 } 3187 3187 3188 - if (mm->locked_vm) { 3189 - vma = mm->mmap; 3190 - while (vma) { 3191 - if (vma->vm_flags & VM_LOCKED) 3192 - munlock_vma_pages_all(vma); 3193 - vma = vma->vm_next; 3194 - } 3195 - } 3188 + if (mm->locked_vm) 3189 + unlock_range(mm->mmap, ULONG_MAX); 3196 3190 3197 3191 arch_exit_mmap(mm); 3198 3192

+22 -11

mm/mmap_lock.c

··· 11 11 #include <linux/rcupdate.h> 12 12 #include <linux/smp.h> 13 13 #include <linux/trace_events.h> 14 + #include <linux/local_lock.h> 14 15 15 16 EXPORT_TRACEPOINT_SYMBOL(mmap_lock_start_locking); 16 17 EXPORT_TRACEPOINT_SYMBOL(mmap_lock_acquire_returned); ··· 40 39 */ 41 40 #define CONTEXT_COUNT 4 42 41 43 - static DEFINE_PER_CPU(char __rcu *, memcg_path_buf); 42 + struct memcg_path { 43 + local_lock_t lock; 44 + char __rcu *buf; 45 + local_t buf_idx; 46 + }; 47 + static DEFINE_PER_CPU(struct memcg_path, memcg_paths) = { 48 + .lock = INIT_LOCAL_LOCK(lock), 49 + .buf_idx = LOCAL_INIT(0), 50 + }; 51 + 44 52 static char **tmp_bufs; 45 - static DEFINE_PER_CPU(int, memcg_path_buf_idx); 46 53 47 54 /* Called with reg_lock held. */ 48 55 static void free_memcg_path_bufs(void) 49 56 { 57 + struct memcg_path *memcg_path; 50 58 int cpu; 51 59 char **old = tmp_bufs; 52 60 53 61 for_each_possible_cpu(cpu) { 54 - *(old++) = rcu_dereference_protected( 55 - per_cpu(memcg_path_buf, cpu), 62 + memcg_path = per_cpu_ptr(&memcg_paths, cpu); 63 + *(old++) = rcu_dereference_protected(memcg_path->buf, 56 64 lockdep_is_held(&reg_lock)); 57 - rcu_assign_pointer(per_cpu(memcg_path_buf, cpu), NULL); 65 + rcu_assign_pointer(memcg_path->buf, NULL); 58 66 } 59 67 60 68 /* Wait for inflight memcg_path_buf users to finish. */ ··· 98 88 new = kmalloc(MEMCG_PATH_BUF_SIZE * CONTEXT_COUNT, GFP_KERNEL); 99 89 if (new == NULL) 100 90 goto out_fail_free; 101 - rcu_assign_pointer(per_cpu(memcg_path_buf, cpu), new); 91 + rcu_assign_pointer(per_cpu_ptr(&memcg_paths, cpu)->buf, new); 102 92 /* Don't need to wait for inflights, they'd have gotten NULL. */ 103 93 } 104 94 ··· 132 122 133 123 static inline char *get_memcg_path_buf(void) 134 124 { 125 + struct memcg_path *memcg_path = this_cpu_ptr(&memcg_paths); 135 126 char *buf; 136 127 int idx; 137 128 138 129 rcu_read_lock(); 139 - buf = rcu_dereference(*this_cpu_ptr(&memcg_path_buf)); 130 + buf = rcu_dereference(memcg_path->buf); 140 131 if (buf == NULL) { 141 132 rcu_read_unlock(); 142 133 return NULL; 143 134 } 144 - idx = this_cpu_add_return(memcg_path_buf_idx, MEMCG_PATH_BUF_SIZE) - 135 + idx = local_add_return(MEMCG_PATH_BUF_SIZE, &memcg_path->buf_idx) - 145 136 MEMCG_PATH_BUF_SIZE; 146 137 return &buf[idx]; 147 138 } 148 139 149 140 static inline void put_memcg_path_buf(void) 150 141 { 151 - this_cpu_sub(memcg_path_buf_idx, MEMCG_PATH_BUF_SIZE); 142 + local_sub(MEMCG_PATH_BUF_SIZE, &this_cpu_ptr(&memcg_paths)->buf_idx); 152 143 rcu_read_unlock(); 153 144 } 154 145 ··· 190 179 #define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \ 191 180 do { \ 192 181 const char *memcg_path; \ 193 - preempt_disable(); \ 182 + local_lock(&memcg_paths.lock); \ 194 183 memcg_path = get_mm_memcg_path(mm); \ 195 184 trace_mmap_lock_##type(mm, \ 196 185 memcg_path != NULL ? memcg_path : "", \ 197 186 ##__VA_ARGS__); \ 198 187 if (likely(memcg_path != NULL)) \ 199 188 put_memcg_path_buf(); \ 200 - preempt_enable(); \ 189 + local_unlock(&memcg_paths.lock); \ 201 190 } while (0) 202 191 203 192 #else /* !CONFIG_MEMCG */

+3 -2

mm/mremap.c

··· 634 634 unsigned long *p) 635 635 { 636 636 struct mm_struct *mm = current->mm; 637 - struct vm_area_struct *vma = find_vma(mm, addr); 637 + struct vm_area_struct *vma; 638 638 unsigned long pgoff; 639 639 640 - if (!vma || vma->vm_start > addr) 640 + vma = vma_lookup(mm, addr); 641 + if (!vma) 641 642 return ERR_PTR(-EFAULT); 642 643 643 644 /*

+1 -1

mm/nommu.c

··· 1296 1296 goto out; 1297 1297 } 1298 1298 1299 - flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE); 1299 + flags &= ~MAP_DENYWRITE; 1300 1300 1301 1301 retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff); 1302 1302

+53 -36

mm/page-writeback.c

··· 32 32 #include <linux/sysctl.h> 33 33 #include <linux/cpu.h> 34 34 #include <linux/syscalls.h> 35 - #include <linux/buffer_head.h> /* __set_page_dirty_buffers */ 36 35 #include <linux/pagevec.h> 37 36 #include <linux/timer.h> 38 37 #include <linux/sched/rt.h> ··· 844 845 * ^ pos_ratio 845 846 * | 846 847 * | |<===== global dirty control scope ======>| 847 - * 2.0 .............* 848 + * 2.0 * * * * * * * 848 849 * | .* 849 850 * | . * 850 851 * | . * ··· 1868 1869 * which was newly dirtied. The function will periodically check the system's 1869 1870 * dirty state and will initiate writeback if needed. 1870 1871 * 1871 - * On really big machines, get_writeback_state is expensive, so try to avoid 1872 - * calling it too often (ratelimiting). But once we're over the dirty memory 1873 - * limit we decrease the ratelimiting by a lot, to prevent individual processes 1874 - * from overshooting the limit by (ratelimit_pages) each. 1872 + * Once we're over the dirty memory limit we decrease the ratelimiting 1873 + * by a lot, to prevent individual processes from overshooting the limit 1874 + * by (ratelimit_pages) each. 1875 1875 */ 1876 1876 void balance_dirty_pages_ratelimited(struct address_space *mapping) 1877 1877 { ··· 1943 1945 struct dirty_throttle_control * const gdtc = &gdtc_stor; 1944 1946 struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ? 1945 1947 &mdtc_stor : NULL; 1948 + unsigned long reclaimable; 1949 + unsigned long thresh; 1946 1950 1947 1951 /* 1948 1952 * Similar to balance_dirty_pages() but ignores pages being written ··· 1957 1957 if (gdtc->dirty > gdtc->bg_thresh) 1958 1958 return true; 1959 1959 1960 - if (wb_stat(wb, WB_RECLAIMABLE) > 1961 - wb_calc_thresh(gdtc->wb, gdtc->bg_thresh)) 1960 + thresh = wb_calc_thresh(gdtc->wb, gdtc->bg_thresh); 1961 + if (thresh < 2 * wb_stat_error()) 1962 + reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE); 1963 + else 1964 + reclaimable = wb_stat(wb, WB_RECLAIMABLE); 1965 + 1966 + if (reclaimable > thresh) 1962 1967 return true; 1963 1968 1964 1969 if (mdtc) { ··· 1977 1972 if (mdtc->dirty > mdtc->bg_thresh) 1978 1973 return true; 1979 1974 1980 - if (wb_stat(wb, WB_RECLAIMABLE) > 1981 - wb_calc_thresh(mdtc->wb, mdtc->bg_thresh)) 1975 + thresh = wb_calc_thresh(mdtc->wb, mdtc->bg_thresh); 1976 + if (thresh < 2 * wb_stat_error()) 1977 + reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE); 1978 + else 1979 + reclaimable = wb_stat(wb, WB_RECLAIMABLE); 1980 + 1981 + if (reclaimable > thresh) 1982 1982 return true; 1983 1983 } 1984 1984 ··· 2055 2045 /* 2056 2046 * If ratelimit_pages is too high then we can get into dirty-data overload 2057 2047 * if a large number of processes all perform writes at the same time. 2058 - * If it is too low then SMP machines will call the (expensive) 2059 - * get_writeback_state too often. 2060 2048 * 2061 2049 * Here we set ratelimit_pages to a level which ensures that when all CPUs are 2062 2050 * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory ··· 2417 2409 return !TestSetPageDirty(page); 2418 2410 return 0; 2419 2411 } 2412 + EXPORT_SYMBOL(__set_page_dirty_no_writeback); 2420 2413 2421 2414 /* 2422 2415 * Helper function for set_page_dirty family. ··· 2426 2417 * 2427 2418 * NOTE: This relies on being atomic wrt interrupts. 2428 2419 */ 2429 - void account_page_dirtied(struct page *page, struct address_space *mapping) 2420 + static void account_page_dirtied(struct page *page, 2421 + struct address_space *mapping) 2430 2422 { 2431 2423 struct inode *inode = mapping->host; 2432 2424 ··· 2446 2436 inc_wb_stat(wb, WB_DIRTIED); 2447 2437 task_io_account_write(PAGE_SIZE); 2448 2438 current->nr_dirtied++; 2449 - this_cpu_inc(bdp_ratelimits); 2439 + __this_cpu_inc(bdp_ratelimits); 2450 2440 2451 2441 mem_cgroup_track_foreign_dirty(page, wb); 2452 2442 } ··· 2469 2459 } 2470 2460 2471 2461 /* 2462 + * Mark the page dirty, and set it dirty in the page cache, and mark the inode 2463 + * dirty. 2464 + * 2465 + * If warn is true, then emit a warning if the page is not uptodate and has 2466 + * not been truncated. 2467 + * 2468 + * The caller must hold lock_page_memcg(). 2469 + */ 2470 + void __set_page_dirty(struct page *page, struct address_space *mapping, 2471 + int warn) 2472 + { 2473 + unsigned long flags; 2474 + 2475 + xa_lock_irqsave(&mapping->i_pages, flags); 2476 + if (page->mapping) { /* Race with truncate? */ 2477 + WARN_ON_ONCE(warn && !PageUptodate(page)); 2478 + account_page_dirtied(page, mapping); 2479 + __xa_set_mark(&mapping->i_pages, page_index(page), 2480 + PAGECACHE_TAG_DIRTY); 2481 + } 2482 + xa_unlock_irqrestore(&mapping->i_pages, flags); 2483 + } 2484 + 2485 + /* 2472 2486 * For address_spaces which do not use buffers. Just tag the page as dirty in 2473 2487 * the xarray. 2474 2488 * ··· 2509 2475 lock_page_memcg(page); 2510 2476 if (!TestSetPageDirty(page)) { 2511 2477 struct address_space *mapping = page_mapping(page); 2512 - unsigned long flags; 2513 2478 2514 2479 if (!mapping) { 2515 2480 unlock_page_memcg(page); 2516 2481 return 1; 2517 2482 } 2518 - 2519 - xa_lock_irqsave(&mapping->i_pages, flags); 2520 - BUG_ON(page_mapping(page) != mapping); 2521 - WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); 2522 - account_page_dirtied(page, mapping); 2523 - __xa_set_mark(&mapping->i_pages, page_index(page), 2524 - PAGECACHE_TAG_DIRTY); 2525 - xa_unlock_irqrestore(&mapping->i_pages, flags); 2483 + __set_page_dirty(page, mapping, !PagePrivate(page)); 2526 2484 unlock_page_memcg(page); 2527 2485 2528 2486 if (mapping->host) { ··· 2572 2546 /* 2573 2547 * Dirty a page. 2574 2548 * 2575 - * For pages with a mapping this should be done under the page lock 2576 - * for the benefit of asynchronous memory errors who prefer a consistent 2577 - * dirty state. This rule can be broken in some special cases, 2578 - * but should be better not to. 2579 - * 2580 - * If the mapping doesn't provide a set_page_dirty a_op, then 2581 - * just fall through and assume that it wants buffer_heads. 2549 + * For pages with a mapping this should be done under the page lock for the 2550 + * benefit of asynchronous memory errors who prefer a consistent dirty state. 2551 + * This rule can be broken in some special cases, but should be better not to. 2582 2552 */ 2583 2553 int set_page_dirty(struct page *page) 2584 2554 { ··· 2582 2560 2583 2561 page = compound_head(page); 2584 2562 if (likely(mapping)) { 2585 - int (*spd)(struct page *) = mapping->a_ops->set_page_dirty; 2586 2563 /* 2587 2564 * readahead/lru_deactivate_page could remain 2588 2565 * PG_readahead/PG_reclaim due to race with end_page_writeback ··· 2594 2573 */ 2595 2574 if (PageReclaim(page)) 2596 2575 ClearPageReclaim(page); 2597 - #ifdef CONFIG_BLOCK 2598 - if (!spd) 2599 - spd = __set_page_dirty_buffers; 2600 - #endif 2601 - return (*spd)(page); 2576 + return mapping->a_ops->set_page_dirty(page); 2602 2577 } 2603 2578 if (!PageDirty(page)) { 2604 2579 if (!TestSetPageDirty(page))

+530 -276

mm/page_alloc.c

··· 120 120 121 121 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ 122 122 static DEFINE_MUTEX(pcp_batch_high_lock); 123 - #define MIN_PERCPU_PAGELIST_FRACTION (8) 123 + #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8) 124 + 125 + struct pagesets { 126 + local_lock_t lock; 127 + #if defined(CONFIG_DEBUG_INFO_BTF) && \ 128 + !defined(CONFIG_DEBUG_LOCK_ALLOC) && \ 129 + !defined(CONFIG_PAHOLE_HAS_ZEROSIZE_PERCPU_SUPPORT) 130 + /* 131 + * pahole 1.21 and earlier gets confused by zero-sized per-CPU 132 + * variables and produces invalid BTF. Ensure that 133 + * sizeof(struct pagesets) != 0 for older versions of pahole. 134 + */ 135 + char __pahole_hack; 136 + #warning "pahole too old to support zero-sized struct pagesets" 137 + #endif 138 + }; 139 + static DEFINE_PER_CPU(struct pagesets, pagesets) = { 140 + .lock = INIT_LOCAL_LOCK(lock), 141 + }; 124 142 125 143 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID 126 144 DEFINE_PER_CPU(int, numa_node); ··· 193 175 unsigned long totalreserve_pages __read_mostly; 194 176 unsigned long totalcma_pages __read_mostly; 195 177 196 - int percpu_pagelist_fraction; 178 + int percpu_pagelist_high_fraction; 197 179 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK; 198 180 DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc); 199 181 EXPORT_SYMBOL(init_on_alloc); ··· 349 331 350 332 int min_free_kbytes = 1024; 351 333 int user_min_free_kbytes = -1; 352 - #ifdef CONFIG_DISCONTIGMEM 353 - /* 354 - * DiscontigMem defines memory ranges as separate pg_data_t even if the ranges 355 - * are not on separate NUMA nodes. Functionally this works but with 356 - * watermark_boost_factor, it can reclaim prematurely as the ranges can be 357 - * quite small. By default, do not boost watermarks on discontigmem as in 358 - * many cases very high-order allocations like THP are likely to be 359 - * unsupported and the premature reclaim offsets the advantage of long-term 360 - * fragmentation avoidance. 361 - */ 362 - int watermark_boost_factor __read_mostly; 363 - #else 364 334 int watermark_boost_factor __read_mostly = 15000; 365 - #endif 366 335 int watermark_scale_factor = 10; 367 336 368 337 static unsigned long nr_kernel_pages __initdata; ··· 474 469 #endif 475 470 476 471 /* Return a pointer to the bitmap storing bits affecting a block of pages */ 477 - static inline unsigned long *get_pageblock_bitmap(struct page *page, 472 + static inline unsigned long *get_pageblock_bitmap(const struct page *page, 478 473 unsigned long pfn) 479 474 { 480 475 #ifdef CONFIG_SPARSEMEM ··· 484 479 #endif /* CONFIG_SPARSEMEM */ 485 480 } 486 481 487 - static inline int pfn_to_bitidx(struct page *page, unsigned long pfn) 482 + static inline int pfn_to_bitidx(const struct page *page, unsigned long pfn) 488 483 { 489 484 #ifdef CONFIG_SPARSEMEM 490 485 pfn &= (PAGES_PER_SECTION-1); ··· 495 490 } 496 491 497 492 static __always_inline 498 - unsigned long __get_pfnblock_flags_mask(struct page *page, 493 + unsigned long __get_pfnblock_flags_mask(const struct page *page, 499 494 unsigned long pfn, 500 495 unsigned long mask) 501 496 { ··· 520 515 * 521 516 * Return: pageblock_bits flags 522 517 */ 523 - unsigned long get_pfnblock_flags_mask(struct page *page, unsigned long pfn, 524 - unsigned long mask) 518 + unsigned long get_pfnblock_flags_mask(const struct page *page, 519 + unsigned long pfn, unsigned long mask) 525 520 { 526 521 return __get_pfnblock_flags_mask(page, pfn, mask); 527 522 } 528 523 529 - static __always_inline int get_pfnblock_migratetype(struct page *page, unsigned long pfn) 524 + static __always_inline int get_pfnblock_migratetype(const struct page *page, 525 + unsigned long pfn) 530 526 { 531 527 return __get_pfnblock_flags_mask(page, pfn, MIGRATETYPE_MASK); 532 528 } ··· 659 653 660 654 pr_alert("BUG: Bad page state in process %s pfn:%05lx\n", 661 655 current->comm, page_to_pfn(page)); 662 - __dump_page(page, reason); 663 - dump_page_owner(page); 656 + dump_page(page, reason); 664 657 665 658 print_modules(); 666 659 dump_stack(); ··· 667 662 /* Leave bad fields for debug, except PageBuddy could make trouble */ 668 663 page_mapcount_reset(page); /* remove PageBuddy */ 669 664 add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); 665 + } 666 + 667 + static inline unsigned int order_to_pindex(int migratetype, int order) 668 + { 669 + int base = order; 670 + 671 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 672 + if (order > PAGE_ALLOC_COSTLY_ORDER) { 673 + VM_BUG_ON(order != pageblock_order); 674 + base = PAGE_ALLOC_COSTLY_ORDER + 1; 675 + } 676 + #else 677 + VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); 678 + #endif 679 + 680 + return (MIGRATE_PCPTYPES * base) + migratetype; 681 + } 682 + 683 + static inline int pindex_to_order(unsigned int pindex) 684 + { 685 + int order = pindex / MIGRATE_PCPTYPES; 686 + 687 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 688 + if (order > PAGE_ALLOC_COSTLY_ORDER) { 689 + order = pageblock_order; 690 + VM_BUG_ON(order != pageblock_order); 691 + } 692 + #else 693 + VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); 694 + #endif 695 + 696 + return order; 697 + } 698 + 699 + static inline bool pcp_allowed_order(unsigned int order) 700 + { 701 + if (order <= PAGE_ALLOC_COSTLY_ORDER) 702 + return true; 703 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 704 + if (order == pageblock_order) 705 + return true; 706 + #endif 707 + return false; 708 + } 709 + 710 + static inline void free_the_page(struct page *page, unsigned int order) 711 + { 712 + if (pcp_allowed_order(order)) /* Via pcp? */ 713 + free_unref_page(page, order); 714 + else 715 + __free_pages_ok(page, order, FPI_NONE); 670 716 } 671 717 672 718 /* ··· 738 682 void free_compound_page(struct page *page) 739 683 { 740 684 mem_cgroup_uncharge(page); 741 - __free_pages_ok(page, compound_order(page), FPI_NONE); 685 + free_the_page(page, compound_order(page)); 742 686 } 743 687 744 688 void prep_compound_page(struct page *page, unsigned int order) ··· 1401 1345 * to pcp lists. With debug_pagealloc also enabled, they are also rechecked when 1402 1346 * moved from pcp lists to free lists. 1403 1347 */ 1404 - static bool free_pcp_prepare(struct page *page) 1348 + static bool free_pcp_prepare(struct page *page, unsigned int order) 1405 1349 { 1406 - return free_pages_prepare(page, 0, true, FPI_NONE); 1350 + return free_pages_prepare(page, order, true, FPI_NONE); 1407 1351 } 1408 1352 1409 1353 static bool bulkfree_pcp_prepare(struct page *page) ··· 1420 1364 * debug_pagealloc enabled, they are checked also immediately when being freed 1421 1365 * to the pcp lists. 1422 1366 */ 1423 - static bool free_pcp_prepare(struct page *page) 1367 + static bool free_pcp_prepare(struct page *page, unsigned int order) 1424 1368 { 1425 1369 if (debug_pagealloc_enabled_static()) 1426 - return free_pages_prepare(page, 0, true, FPI_NONE); 1370 + return free_pages_prepare(page, order, true, FPI_NONE); 1427 1371 else 1428 - return free_pages_prepare(page, 0, false, FPI_NONE); 1372 + return free_pages_prepare(page, order, false, FPI_NONE); 1429 1373 } 1430 1374 1431 1375 static bool bulkfree_pcp_prepare(struct page *page) ··· 1457 1401 static void free_pcppages_bulk(struct zone *zone, int count, 1458 1402 struct per_cpu_pages *pcp) 1459 1403 { 1460 - int migratetype = 0; 1404 + int pindex = 0; 1461 1405 int batch_free = 0; 1406 + int nr_freed = 0; 1407 + unsigned int order; 1462 1408 int prefetch_nr = READ_ONCE(pcp->batch); 1463 1409 bool isolated_pageblocks; 1464 1410 struct page *page, *tmp; ··· 1471 1413 * below while (list_empty(list)) loop. 1472 1414 */ 1473 1415 count = min(pcp->count, count); 1474 - while (count) { 1416 + while (count > 0) { 1475 1417 struct list_head *list; 1476 1418 1477 1419 /* ··· 1483 1425 */ 1484 1426 do { 1485 1427 batch_free++; 1486 - if (++migratetype == MIGRATE_PCPTYPES) 1487 - migratetype = 0; 1488 - list = &pcp->lists[migratetype]; 1428 + if (++pindex == NR_PCP_LISTS) 1429 + pindex = 0; 1430 + list = &pcp->lists[pindex]; 1489 1431 } while (list_empty(list)); 1490 1432 1491 1433 /* This is the only non-empty list. Free them all. */ 1492 - if (batch_free == MIGRATE_PCPTYPES) 1434 + if (batch_free == NR_PCP_LISTS) 1493 1435 batch_free = count; 1494 1436 1437 + order = pindex_to_order(pindex); 1438 + BUILD_BUG_ON(MAX_ORDER >= (1<<NR_PCP_ORDER_WIDTH)); 1495 1439 do { 1496 1440 page = list_last_entry(list, struct page, lru); 1497 1441 /* must delete to avoid corrupting pcp list */ 1498 1442 list_del(&page->lru); 1499 - pcp->count--; 1443 + nr_freed += 1 << order; 1444 + count -= 1 << order; 1500 1445 1501 1446 if (bulkfree_pcp_prepare(page)) 1502 1447 continue; 1448 + 1449 + /* Encode order with the migratetype */ 1450 + page->index <<= NR_PCP_ORDER_WIDTH; 1451 + page->index |= order; 1503 1452 1504 1453 list_add_tail(&page->lru, &head); 1505 1454 ··· 1523 1458 prefetch_buddy(page); 1524 1459 prefetch_nr--; 1525 1460 } 1526 - } while (--count && --batch_free && !list_empty(list)); 1461 + } while (count > 0 && --batch_free && !list_empty(list)); 1527 1462 } 1463 + pcp->count -= nr_freed; 1528 1464 1465 + /* 1466 + * local_lock_irq held so equivalent to spin_lock_irqsave for 1467 + * both PREEMPT_RT and non-PREEMPT_RT configurations. 1468 + */ 1529 1469 spin_lock(&zone->lock); 1530 1470 isolated_pageblocks = has_isolate_pageblock(zone); 1531 1471 ··· 1540 1470 */ 1541 1471 list_for_each_entry_safe(page, tmp, &head, lru) { 1542 1472 int mt = get_pcppage_migratetype(page); 1473 + 1474 + /* mt has been encoded with the order (see above) */ 1475 + order = mt & NR_PCP_ORDER_MASK; 1476 + mt >>= NR_PCP_ORDER_WIDTH; 1477 + 1543 1478 /* MIGRATE_ISOLATE page should not go to pcplists */ 1544 1479 VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); 1545 1480 /* Pageblock could have been isolated meanwhile */ 1546 1481 if (unlikely(isolated_pageblocks)) 1547 1482 mt = get_pageblock_migratetype(page); 1548 1483 1549 - __free_one_page(page, page_to_pfn(page), zone, 0, mt, FPI_NONE); 1550 - trace_mm_page_pcpu_drain(page, 0, mt); 1484 + __free_one_page(page, page_to_pfn(page), zone, order, mt, FPI_NONE); 1485 + trace_mm_page_pcpu_drain(page, order, mt); 1551 1486 } 1552 1487 spin_unlock(&zone->lock); 1553 1488 } ··· 1562 1487 unsigned int order, 1563 1488 int migratetype, fpi_t fpi_flags) 1564 1489 { 1565 - spin_lock(&zone->lock); 1490 + unsigned long flags; 1491 + 1492 + spin_lock_irqsave(&zone->lock, flags); 1566 1493 if (unlikely(has_isolate_pageblock(zone) || 1567 1494 is_migrate_isolate(migratetype))) { 1568 1495 migratetype = get_pfnblock_migratetype(page, pfn); 1569 1496 } 1570 1497 __free_one_page(page, pfn, zone, order, migratetype, fpi_flags); 1571 - spin_unlock(&zone->lock); 1498 + spin_unlock_irqrestore(&zone->lock, flags); 1572 1499 } 1573 1500 1574 1501 static void __meminit __init_single_page(struct page *page, unsigned long pfn, ··· 1653 1576 unsigned long flags; 1654 1577 int migratetype; 1655 1578 unsigned long pfn = page_to_pfn(page); 1579 + struct zone *zone = page_zone(page); 1656 1580 1657 1581 if (!free_pages_prepare(page, order, true, fpi_flags)) 1658 1582 return; 1659 1583 1660 1584 migratetype = get_pfnblock_migratetype(page, pfn); 1661 - local_irq_save(flags); 1585 + 1586 + spin_lock_irqsave(&zone->lock, flags); 1587 + if (unlikely(has_isolate_pageblock(zone) || 1588 + is_migrate_isolate(migratetype))) { 1589 + migratetype = get_pfnblock_migratetype(page, pfn); 1590 + } 1591 + __free_one_page(page, pfn, zone, order, migratetype, fpi_flags); 1592 + spin_unlock_irqrestore(&zone->lock, flags); 1593 + 1662 1594 __count_vm_events(PGFREE, 1 << order); 1663 - free_one_page(page_zone(page), page, pfn, order, migratetype, 1664 - fpi_flags); 1665 - local_irq_restore(flags); 1666 1595 } 1667 1596 1668 1597 void __free_pages_core(struct page *page, unsigned int order) ··· 1700 1617 __free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON); 1701 1618 } 1702 1619 1703 - #ifdef CONFIG_NEED_MULTIPLE_NODES 1620 + #ifdef CONFIG_NUMA 1704 1621 1705 1622 /* 1706 1623 * During memory init memblocks map pfns to nids. The search is expensive and ··· 1750 1667 1751 1668 return nid; 1752 1669 } 1753 - #endif /* CONFIG_NEED_MULTIPLE_NODES */ 1670 + #endif /* CONFIG_NUMA */ 1754 1671 1755 1672 void __init memblock_free_pages(struct page *page, unsigned long pfn, 1756 1673 unsigned int order) ··· 2236 2153 2237 2154 /* Block until all are initialised */ 2238 2155 wait_for_completion(&pgdat_init_all_done_comp); 2239 - 2240 - /* 2241 - * The number of managed pages has changed due to the initialisation 2242 - * so the pcpu batch and high limits needs to be updated or the limits 2243 - * will be artificially small. 2244 - */ 2245 - for_each_populated_zone(zone) 2246 - zone_pcp_update(zone); 2247 2156 2248 2157 /* 2249 2158 * We initialized the rest of the deferred pages. Permanently disable ··· 3042 2967 { 3043 2968 int i, allocated = 0; 3044 2969 2970 + /* 2971 + * local_lock_irq held so equivalent to spin_lock_irqsave for 2972 + * both PREEMPT_RT and non-PREEMPT_RT configurations. 2973 + */ 3045 2974 spin_lock(&zone->lock); 3046 2975 for (i = 0; i < count; ++i) { 3047 2976 struct page *page = __rmqueue(zone, order, migratetype, ··· 3098 3019 unsigned long flags; 3099 3020 int to_drain, batch; 3100 3021 3101 - local_irq_save(flags); 3022 + local_lock_irqsave(&pagesets.lock, flags); 3102 3023 batch = READ_ONCE(pcp->batch); 3103 3024 to_drain = min(pcp->count, batch); 3104 3025 if (to_drain > 0) 3105 3026 free_pcppages_bulk(zone, to_drain, pcp); 3106 - local_irq_restore(flags); 3027 + local_unlock_irqrestore(&pagesets.lock, flags); 3107 3028 } 3108 3029 #endif 3109 3030 ··· 3117 3038 static void drain_pages_zone(unsigned int cpu, struct zone *zone) 3118 3039 { 3119 3040 unsigned long flags; 3120 - struct per_cpu_pageset *pset; 3121 3041 struct per_cpu_pages *pcp; 3122 3042 3123 - local_irq_save(flags); 3124 - pset = per_cpu_ptr(zone->pageset, cpu); 3043 + local_lock_irqsave(&pagesets.lock, flags); 3125 3044 3126 - pcp = &pset->pcp; 3045 + pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); 3127 3046 if (pcp->count) 3128 3047 free_pcppages_bulk(zone, pcp->count, pcp); 3129 - local_irq_restore(flags); 3048 + 3049 + local_unlock_irqrestore(&pagesets.lock, flags); 3130 3050 } 3131 3051 3132 3052 /* ··· 3223 3145 * disables preemption as part of its processing 3224 3146 */ 3225 3147 for_each_online_cpu(cpu) { 3226 - struct per_cpu_pageset *pcp; 3148 + struct per_cpu_pages *pcp; 3227 3149 struct zone *z; 3228 3150 bool has_pcps = false; 3229 3151 ··· 3234 3156 */ 3235 3157 has_pcps = true; 3236 3158 } else if (zone) { 3237 - pcp = per_cpu_ptr(zone->pageset, cpu); 3238 - if (pcp->pcp.count) 3159 + pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); 3160 + if (pcp->count) 3239 3161 has_pcps = true; 3240 3162 } else { 3241 3163 for_each_populated_zone(z) { 3242 - pcp = per_cpu_ptr(z->pageset, cpu); 3243 - if (pcp->pcp.count) { 3164 + pcp = per_cpu_ptr(z->per_cpu_pageset, cpu); 3165 + if (pcp->count) { 3244 3166 has_pcps = true; 3245 3167 break; 3246 3168 } ··· 3333 3255 } 3334 3256 #endif /* CONFIG_PM */ 3335 3257 3336 - static bool free_unref_page_prepare(struct page *page, unsigned long pfn) 3258 + static bool free_unref_page_prepare(struct page *page, unsigned long pfn, 3259 + unsigned int order) 3337 3260 { 3338 3261 int migratetype; 3339 3262 3340 - if (!free_pcp_prepare(page)) 3263 + if (!free_pcp_prepare(page, order)) 3341 3264 return false; 3342 3265 3343 3266 migratetype = get_pfnblock_migratetype(page, pfn); ··· 3346 3267 return true; 3347 3268 } 3348 3269 3349 - static void free_unref_page_commit(struct page *page, unsigned long pfn) 3270 + static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch) 3271 + { 3272 + int min_nr_free, max_nr_free; 3273 + 3274 + /* Check for PCP disabled or boot pageset */ 3275 + if (unlikely(high < batch)) 3276 + return 1; 3277 + 3278 + /* Leave at least pcp->batch pages on the list */ 3279 + min_nr_free = batch; 3280 + max_nr_free = high - batch; 3281 + 3282 + /* 3283 + * Double the number of pages freed each time there is subsequent 3284 + * freeing of pages without any allocation. 3285 + */ 3286 + batch <<= pcp->free_factor; 3287 + if (batch < max_nr_free) 3288 + pcp->free_factor++; 3289 + batch = clamp(batch, min_nr_free, max_nr_free); 3290 + 3291 + return batch; 3292 + } 3293 + 3294 + static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone) 3295 + { 3296 + int high = READ_ONCE(pcp->high); 3297 + 3298 + if (unlikely(!high)) 3299 + return 0; 3300 + 3301 + if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) 3302 + return high; 3303 + 3304 + /* 3305 + * If reclaim is active, limit the number of pages that can be 3306 + * stored on pcp lists 3307 + */ 3308 + return min(READ_ONCE(pcp->batch) << 2, high); 3309 + } 3310 + 3311 + static void free_unref_page_commit(struct page *page, unsigned long pfn, 3312 + int migratetype, unsigned int order) 3350 3313 { 3351 3314 struct zone *zone = page_zone(page); 3352 3315 struct per_cpu_pages *pcp; 3316 + int high; 3317 + int pindex; 3318 + 3319 + __count_vm_event(PGFREE); 3320 + pcp = this_cpu_ptr(zone->per_cpu_pageset); 3321 + pindex = order_to_pindex(migratetype, order); 3322 + list_add(&page->lru, &pcp->lists[pindex]); 3323 + pcp->count += 1 << order; 3324 + high = nr_pcp_high(pcp, zone); 3325 + if (pcp->count >= high) { 3326 + int batch = READ_ONCE(pcp->batch); 3327 + 3328 + free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), pcp); 3329 + } 3330 + } 3331 + 3332 + /* 3333 + * Free a pcp page 3334 + */ 3335 + void free_unref_page(struct page *page, unsigned int order) 3336 + { 3337 + unsigned long flags; 3338 + unsigned long pfn = page_to_pfn(page); 3353 3339 int migratetype; 3354 3340 3355 - migratetype = get_pcppage_migratetype(page); 3356 - __count_vm_event(PGFREE); 3341 + if (!free_unref_page_prepare(page, pfn, order)) 3342 + return; 3357 3343 3358 3344 /* 3359 3345 * We only track unmovable, reclaimable and movable on pcp lists. 3360 - * Free ISOLATE pages back to the allocator because they are being 3346 + * Place ISOLATE pages on the isolated list because they are being 3361 3347 * offlined but treat HIGHATOMIC as movable pages so we can get those 3362 3348 * areas back if necessary. Otherwise, we may have to free 3363 3349 * excessively into the page allocator 3364 3350 */ 3365 - if (migratetype >= MIGRATE_PCPTYPES) { 3351 + migratetype = get_pcppage_migratetype(page); 3352 + if (unlikely(migratetype >= MIGRATE_PCPTYPES)) { 3366 3353 if (unlikely(is_migrate_isolate(migratetype))) { 3367 - free_one_page(zone, page, pfn, 0, migratetype, 3368 - FPI_NONE); 3354 + free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE); 3369 3355 return; 3370 3356 } 3371 3357 migratetype = MIGRATE_MOVABLE; 3372 3358 } 3373 3359 3374 - pcp = &this_cpu_ptr(zone->pageset)->pcp; 3375 - list_add(&page->lru, &pcp->lists[migratetype]); 3376 - pcp->count++; 3377 - if (pcp->count >= READ_ONCE(pcp->high)) 3378 - free_pcppages_bulk(zone, READ_ONCE(pcp->batch), pcp); 3379 - } 3380 - 3381 - /* 3382 - * Free a 0-order page 3383 - */ 3384 - void free_unref_page(struct page *page) 3385 - { 3386 - unsigned long flags; 3387 - unsigned long pfn = page_to_pfn(page); 3388 - 3389 - if (!free_unref_page_prepare(page, pfn)) 3390 - return; 3391 - 3392 - local_irq_save(flags); 3393 - free_unref_page_commit(page, pfn); 3394 - local_irq_restore(flags); 3360 + local_lock_irqsave(&pagesets.lock, flags); 3361 + free_unref_page_commit(page, pfn, migratetype, order); 3362 + local_unlock_irqrestore(&pagesets.lock, flags); 3395 3363 } 3396 3364 3397 3365 /* ··· 3449 3323 struct page *page, *next; 3450 3324 unsigned long flags, pfn; 3451 3325 int batch_count = 0; 3326 + int migratetype; 3452 3327 3453 3328 /* Prepare pages for freeing */ 3454 3329 list_for_each_entry_safe(page, next, list, lru) { 3455 3330 pfn = page_to_pfn(page); 3456 - if (!free_unref_page_prepare(page, pfn)) 3331 + if (!free_unref_page_prepare(page, pfn, 0)) 3457 3332 list_del(&page->lru); 3333 + 3334 + /* 3335 + * Free isolated pages directly to the allocator, see 3336 + * comment in free_unref_page. 3337 + */ 3338 + migratetype = get_pcppage_migratetype(page); 3339 + if (unlikely(migratetype >= MIGRATE_PCPTYPES)) { 3340 + if (unlikely(is_migrate_isolate(migratetype))) { 3341 + list_del(&page->lru); 3342 + free_one_page(page_zone(page), page, pfn, 0, 3343 + migratetype, FPI_NONE); 3344 + continue; 3345 + } 3346 + 3347 + /* 3348 + * Non-isolated types over MIGRATE_PCPTYPES get added 3349 + * to the MIGRATE_MOVABLE pcp list. 3350 + */ 3351 + set_pcppage_migratetype(page, MIGRATE_MOVABLE); 3352 + } 3353 + 3458 3354 set_page_private(page, pfn); 3459 3355 } 3460 3356 3461 - local_irq_save(flags); 3357 + local_lock_irqsave(&pagesets.lock, flags); 3462 3358 list_for_each_entry_safe(page, next, list, lru) { 3463 - unsigned long pfn = page_private(page); 3464 - 3359 + pfn = page_private(page); 3465 3360 set_page_private(page, 0); 3361 + migratetype = get_pcppage_migratetype(page); 3466 3362 trace_mm_page_free_batched(page); 3467 - free_unref_page_commit(page, pfn); 3363 + free_unref_page_commit(page, pfn, migratetype, 0); 3468 3364 3469 3365 /* 3470 3366 * Guard against excessive IRQ disabled times when we get 3471 3367 * a large list of pages to free. 3472 3368 */ 3473 3369 if (++batch_count == SWAP_CLUSTER_MAX) { 3474 - local_irq_restore(flags); 3370 + local_unlock_irqrestore(&pagesets.lock, flags); 3475 3371 batch_count = 0; 3476 - local_irq_save(flags); 3372 + local_lock_irqsave(&pagesets.lock, flags); 3477 3373 } 3478 3374 } 3479 - local_irq_restore(flags); 3375 + local_unlock_irqrestore(&pagesets.lock, flags); 3480 3376 } 3481 3377 3482 3378 /* ··· 3597 3449 * 3598 3450 * Must be called with interrupts disabled. 3599 3451 */ 3600 - static inline void zone_statistics(struct zone *preferred_zone, struct zone *z) 3452 + static inline void zone_statistics(struct zone *preferred_zone, struct zone *z, 3453 + long nr_account) 3601 3454 { 3602 3455 #ifdef CONFIG_NUMA 3603 3456 enum numa_stat_item local_stat = NUMA_LOCAL; ··· 3611 3462 local_stat = NUMA_OTHER; 3612 3463 3613 3464 if (zone_to_nid(z) == zone_to_nid(preferred_zone)) 3614 - __inc_numa_state(z, NUMA_HIT); 3465 + __count_numa_events(z, NUMA_HIT, nr_account); 3615 3466 else { 3616 - __inc_numa_state(z, NUMA_MISS); 3617 - __inc_numa_state(preferred_zone, NUMA_FOREIGN); 3467 + __count_numa_events(z, NUMA_MISS, nr_account); 3468 + __count_numa_events(preferred_zone, NUMA_FOREIGN, nr_account); 3618 3469 } 3619 - __inc_numa_state(z, local_stat); 3470 + __count_numa_events(z, local_stat, nr_account); 3620 3471 #endif 3621 3472 } 3622 3473 3623 3474 /* Remove page from the per-cpu list, caller must protect the list */ 3624 3475 static inline 3625 - struct page *__rmqueue_pcplist(struct zone *zone, int migratetype, 3476 + struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, 3477 + int migratetype, 3626 3478 unsigned int alloc_flags, 3627 3479 struct per_cpu_pages *pcp, 3628 3480 struct list_head *list) ··· 3632 3482 3633 3483 do { 3634 3484 if (list_empty(list)) { 3635 - pcp->count += rmqueue_bulk(zone, 0, 3636 - READ_ONCE(pcp->batch), list, 3485 + int batch = READ_ONCE(pcp->batch); 3486 + int alloced; 3487 + 3488 + /* 3489 + * Scale batch relative to order if batch implies 3490 + * free pages can be stored on the PCP. Batch can 3491 + * be 1 for small zones or for boot pagesets which 3492 + * should never store free pages as the pages may 3493 + * belong to arbitrary zones. 3494 + */ 3495 + if (batch > 1) 3496 + batch = max(batch >> order, 2); 3497 + alloced = rmqueue_bulk(zone, order, 3498 + batch, list, 3637 3499 migratetype, alloc_flags); 3500 + 3501 + pcp->count += alloced << order; 3638 3502 if (unlikely(list_empty(list))) 3639 3503 return NULL; 3640 3504 } 3641 3505 3642 3506 page = list_first_entry(list, struct page, lru); 3643 3507 list_del(&page->lru); 3644 - pcp->count--; 3508 + pcp->count -= 1 << order; 3645 3509 } while (check_new_pcp(page)); 3646 3510 3647 3511 return page; ··· 3663 3499 3664 3500 /* Lock and remove page from the per-cpu list */ 3665 3501 static struct page *rmqueue_pcplist(struct zone *preferred_zone, 3666 - struct zone *zone, gfp_t gfp_flags, 3667 - int migratetype, unsigned int alloc_flags) 3502 + struct zone *zone, unsigned int order, 3503 + gfp_t gfp_flags, int migratetype, 3504 + unsigned int alloc_flags) 3668 3505 { 3669 3506 struct per_cpu_pages *pcp; 3670 3507 struct list_head *list; 3671 3508 struct page *page; 3672 3509 unsigned long flags; 3673 3510 3674 - local_irq_save(flags); 3675 - pcp = &this_cpu_ptr(zone->pageset)->pcp; 3676 - list = &pcp->lists[migratetype]; 3677 - page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list); 3511 + local_lock_irqsave(&pagesets.lock, flags); 3512 + 3513 + /* 3514 + * On allocation, reduce the number of pages that are batch freed. 3515 + * See nr_pcp_free() where free_factor is increased for subsequent 3516 + * frees. 3517 + */ 3518 + pcp = this_cpu_ptr(zone->per_cpu_pageset); 3519 + pcp->free_factor >>= 1; 3520 + list = &pcp->lists[order_to_pindex(migratetype, order)]; 3521 + page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list); 3522 + local_unlock_irqrestore(&pagesets.lock, flags); 3678 3523 if (page) { 3679 3524 __count_zid_vm_events(PGALLOC, page_zonenum(page), 1); 3680 - zone_statistics(preferred_zone, zone); 3525 + zone_statistics(preferred_zone, zone, 1); 3681 3526 } 3682 - local_irq_restore(flags); 3683 3527 return page; 3684 3528 } 3685 3529 ··· 3703 3531 unsigned long flags; 3704 3532 struct page *page; 3705 3533 3706 - if (likely(order == 0)) { 3534 + if (likely(pcp_allowed_order(order))) { 3707 3535 /* 3708 3536 * MIGRATE_MOVABLE pcplist could have the pages on CMA area and 3709 3537 * we need to skip it when CMA area isn't allowed. 3710 3538 */ 3711 3539 if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA || 3712 3540 migratetype != MIGRATE_MOVABLE) { 3713 - page = rmqueue_pcplist(preferred_zone, zone, gfp_flags, 3714 - migratetype, alloc_flags); 3541 + page = rmqueue_pcplist(preferred_zone, zone, order, 3542 + gfp_flags, migratetype, alloc_flags); 3715 3543 goto out; 3716 3544 } 3717 3545 } ··· 3739 3567 if (!page) 3740 3568 page = __rmqueue(zone, order, migratetype, alloc_flags); 3741 3569 } while (page && check_new_pages(page, order)); 3742 - spin_unlock(&zone->lock); 3743 3570 if (!page) 3744 3571 goto failed; 3572 + 3745 3573 __mod_zone_freepage_state(zone, -(1 << order), 3746 3574 get_pcppage_migratetype(page)); 3575 + spin_unlock_irqrestore(&zone->lock, flags); 3747 3576 3748 3577 __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); 3749 - zone_statistics(preferred_zone, zone); 3750 - local_irq_restore(flags); 3578 + zone_statistics(preferred_zone, zone, 1); 3751 3579 3752 3580 out: 3753 3581 /* Separate test+clear to avoid unnecessary atomics */ ··· 3760 3588 return page; 3761 3589 3762 3590 failed: 3763 - local_irq_restore(flags); 3591 + spin_unlock_irqrestore(&zone->lock, flags); 3764 3592 return NULL; 3765 3593 } 3766 3594 ··· 4434 4262 enum compact_priority priority = *compact_priority; 4435 4263 4436 4264 if (!order) 4265 + return false; 4266 + 4267 + if (fatal_signal_pending(current)) 4437 4268 return false; 4438 4269 4439 4270 if (compaction_made_progress(compact_result)) ··· 5231 5056 struct alloc_context ac; 5232 5057 gfp_t alloc_gfp; 5233 5058 unsigned int alloc_flags = ALLOC_WMARK_LOW; 5234 - int nr_populated = 0; 5059 + int nr_populated = 0, nr_account = 0; 5235 5060 5236 5061 if (unlikely(nr_pages <= 0)) 5237 5062 return 0; ··· 5288 5113 goto failed; 5289 5114 5290 5115 /* Attempt the batch allocation */ 5291 - local_irq_save(flags); 5292 - pcp = &this_cpu_ptr(zone->pageset)->pcp; 5293 - pcp_list = &pcp->lists[ac.migratetype]; 5116 + local_lock_irqsave(&pagesets.lock, flags); 5117 + pcp = this_cpu_ptr(zone->per_cpu_pageset); 5118 + pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)]; 5294 5119 5295 5120 while (nr_populated < nr_pages) { 5296 5121 ··· 5300 5125 continue; 5301 5126 } 5302 5127 5303 - page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags, 5128 + page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags, 5304 5129 pcp, pcp_list); 5305 5130 if (unlikely(!page)) { 5306 5131 /* Try and get at least one page */ ··· 5308 5133 goto failed_irq; 5309 5134 break; 5310 5135 } 5311 - 5312 - /* 5313 - * Ideally this would be batched but the best way to do 5314 - * that cheaply is to first convert zone_statistics to 5315 - * be inaccurate per-cpu counter like vm_events to avoid 5316 - * a RMW cycle then do the accounting with IRQs enabled. 5317 - */ 5318 - __count_zid_vm_events(PGALLOC, zone_idx(zone), 1); 5319 - zone_statistics(ac.preferred_zoneref->zone, zone); 5136 + nr_account++; 5320 5137 5321 5138 prep_new_page(page, 0, gfp, 0); 5322 5139 if (page_list) ··· 5318 5151 nr_populated++; 5319 5152 } 5320 5153 5321 - local_irq_restore(flags); 5154 + local_unlock_irqrestore(&pagesets.lock, flags); 5155 + 5156 + __count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account); 5157 + zone_statistics(ac.preferred_zoneref->zone, zone, nr_account); 5322 5158 5323 5159 return nr_populated; 5324 5160 5325 5161 failed_irq: 5326 - local_irq_restore(flags); 5162 + local_unlock_irqrestore(&pagesets.lock, flags); 5327 5163 5328 5164 failed: 5329 5165 page = __alloc_pages(gfp, 0, preferred_nid, nodemask); ··· 5432 5262 return __get_free_pages(gfp_mask | __GFP_ZERO, 0); 5433 5263 } 5434 5264 EXPORT_SYMBOL(get_zeroed_page); 5435 - 5436 - static inline void free_the_page(struct page *page, unsigned int order) 5437 - { 5438 - if (order == 0) /* Via pcp? */ 5439 - free_unref_page(page); 5440 - else 5441 - __free_pages_ok(page, order, FPI_NONE); 5442 - } 5443 5265 5444 5266 /** 5445 5267 * __free_pages - Free pages allocated with alloc_pages(). ··· 5891 5729 continue; 5892 5730 5893 5731 for_each_online_cpu(cpu) 5894 - free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count; 5732 + free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count; 5895 5733 } 5896 5734 5897 5735 printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n" ··· 5983 5821 5984 5822 free_pcp = 0; 5985 5823 for_each_online_cpu(cpu) 5986 - free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count; 5824 + free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count; 5987 5825 5988 5826 show_node(zone); 5989 5827 printk(KERN_CONT ··· 6024 5862 K(zone_page_state(zone, NR_MLOCK)), 6025 5863 K(zone_page_state(zone, NR_BOUNCE)), 6026 5864 K(free_pcp), 6027 - K(this_cpu_read(zone->pageset->pcp.count)), 5865 + K(this_cpu_read(zone->per_cpu_pageset->count)), 6028 5866 K(zone_page_state(zone, NR_FREE_CMA_PAGES))); 6029 5867 printk("lowmem_reserve[]:"); 6030 5868 for (i = 0; i < MAX_NR_ZONES; i++) ··· 6351 6189 * not check if the processor is online before following the pageset pointer. 6352 6190 * Other parts of the kernel may not check if the zone is available. 6353 6191 */ 6354 - static void pageset_init(struct per_cpu_pageset *p); 6192 + static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonestat *pzstats); 6355 6193 /* These effectively disable the pcplists in the boot pageset completely */ 6356 6194 #define BOOT_PAGESET_HIGH 0 6357 6195 #define BOOT_PAGESET_BATCH 1 6358 - static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset); 6196 + static DEFINE_PER_CPU(struct per_cpu_pages, boot_pageset); 6197 + static DEFINE_PER_CPU(struct per_cpu_zonestat, boot_zonestats); 6359 6198 static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats); 6360 6199 6361 6200 static void __build_all_zonelists(void *data) ··· 6423 6260 * (a chicken-egg dilemma). 6424 6261 */ 6425 6262 for_each_possible_cpu(cpu) 6426 - pageset_init(&per_cpu(boot_pageset, cpu)); 6263 + per_cpu_pages_init(&per_cpu(boot_pageset, cpu), &per_cpu(boot_zonestats, cpu)); 6427 6264 6428 6265 mminit_verify_zonelist(); 6429 6266 cpuset_init_current_mems_allowed(); ··· 6575 6412 return; 6576 6413 6577 6414 /* 6578 - * The call to memmap_init_zone should have already taken care 6415 + * The call to memmap_init should have already taken care 6579 6416 * of the pages reserved for the memmap, so we can just jump to 6580 6417 * the end of that region and start processing the device pages. 6581 6418 */ ··· 6636 6473 } 6637 6474 } 6638 6475 6639 - #if !defined(CONFIG_FLAT_NODE_MEM_MAP) 6476 + #if !defined(CONFIG_FLATMEM) 6640 6477 /* 6641 6478 * Only struct pages that correspond to ranges defined by memblock.memory 6642 6479 * are zeroed and initialized by going through __init_single_page() during 6643 - * memmap_init_zone(). 6480 + * memmap_init_zone_range(). 6644 6481 * 6645 6482 * But, there could be struct pages that correspond to holes in 6646 6483 * memblock.memory. This can happen because of the following reasons: ··· 6659 6496 * zone/node above the hole except for the trailing pages in the last 6660 6497 * section that will be appended to the zone/node below. 6661 6498 */ 6662 - static u64 __meminit init_unavailable_range(unsigned long spfn, 6663 - unsigned long epfn, 6664 - int zone, int node) 6499 + static void __init init_unavailable_range(unsigned long spfn, 6500 + unsigned long epfn, 6501 + int zone, int node) 6665 6502 { 6666 6503 unsigned long pfn; 6667 6504 u64 pgcnt = 0; ··· 6677 6514 pgcnt++; 6678 6515 } 6679 6516 6680 - return pgcnt; 6517 + if (pgcnt) 6518 + pr_info("On node %d, zone %s: %lld pages in unavailable ranges", 6519 + node, zone_names[zone], pgcnt); 6681 6520 } 6682 6521 #else 6683 - static inline u64 init_unavailable_range(unsigned long spfn, unsigned long epfn, 6684 - int zone, int node) 6522 + static inline void init_unavailable_range(unsigned long spfn, 6523 + unsigned long epfn, 6524 + int zone, int node) 6685 6525 { 6686 - return 0; 6687 6526 } 6688 6527 #endif 6689 6528 6690 - void __meminit __weak memmap_init_zone(struct zone *zone) 6529 + static void __init memmap_init_zone_range(struct zone *zone, 6530 + unsigned long start_pfn, 6531 + unsigned long end_pfn, 6532 + unsigned long *hole_pfn) 6691 6533 { 6692 6534 unsigned long zone_start_pfn = zone->zone_start_pfn; 6693 6535 unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages; 6694 - int i, nid = zone_to_nid(zone), zone_id = zone_idx(zone); 6695 - static unsigned long hole_pfn; 6536 + int nid = zone_to_nid(zone), zone_id = zone_idx(zone); 6537 + 6538 + start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn); 6539 + end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn); 6540 + 6541 + if (start_pfn >= end_pfn) 6542 + return; 6543 + 6544 + memmap_init_range(end_pfn - start_pfn, nid, zone_id, start_pfn, 6545 + zone_end_pfn, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); 6546 + 6547 + if (*hole_pfn < start_pfn) 6548 + init_unavailable_range(*hole_pfn, start_pfn, zone_id, nid); 6549 + 6550 + *hole_pfn = end_pfn; 6551 + } 6552 + 6553 + static void __init memmap_init(void) 6554 + { 6696 6555 unsigned long start_pfn, end_pfn; 6697 - u64 pgcnt = 0; 6556 + unsigned long hole_pfn = 0; 6557 + int i, j, zone_id, nid; 6698 6558 6699 - for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { 6700 - start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn); 6701 - end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn); 6559 + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) { 6560 + struct pglist_data *node = NODE_DATA(nid); 6702 6561 6703 - if (end_pfn > start_pfn) 6704 - memmap_init_range(end_pfn - start_pfn, nid, 6705 - zone_id, start_pfn, zone_end_pfn, 6706 - MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); 6562 + for (j = 0; j < MAX_NR_ZONES; j++) { 6563 + struct zone *zone = node->node_zones + j; 6707 6564 6708 - if (hole_pfn < start_pfn) 6709 - pgcnt += init_unavailable_range(hole_pfn, start_pfn, 6710 - zone_id, nid); 6711 - hole_pfn = end_pfn; 6565 + if (!populated_zone(zone)) 6566 + continue; 6567 + 6568 + memmap_init_zone_range(zone, start_pfn, end_pfn, 6569 + &hole_pfn); 6570 + zone_id = j; 6571 + } 6712 6572 } 6713 6573 6714 6574 #ifdef CONFIG_SPARSEMEM 6715 6575 /* 6716 - * Initialize the hole in the range [zone_end_pfn, section_end]. 6717 - * If zone boundary falls in the middle of a section, this hole 6718 - * will be re-initialized during the call to this function for the 6719 - * higher zone. 6576 + * Initialize the memory map for hole in the range [memory_end, 6577 + * section_end]. 6578 + * Append the pages in this hole to the highest zone in the last 6579 + * node. 6580 + * The call to init_unavailable_range() is outside the ifdef to 6581 + * silence the compiler warining about zone_id set but not used; 6582 + * for FLATMEM it is a nop anyway 6720 6583 */ 6721 - end_pfn = round_up(zone_end_pfn, PAGES_PER_SECTION); 6584 + end_pfn = round_up(end_pfn, PAGES_PER_SECTION); 6722 6585 if (hole_pfn < end_pfn) 6723 - pgcnt += init_unavailable_range(hole_pfn, end_pfn, 6724 - zone_id, nid); 6725 6586 #endif 6726 - 6727 - if (pgcnt) 6728 - pr_info(" %s zone: %llu pages in unavailable ranges\n", 6729 - zone->name, pgcnt); 6587 + init_unavailable_range(hole_pfn, end_pfn, zone_id, nid); 6730 6588 } 6731 6589 6732 6590 static int zone_batchsize(struct zone *zone) ··· 6756 6572 int batch; 6757 6573 6758 6574 /* 6759 - * The per-cpu-pages pools are set to around 1000th of the 6760 - * size of the zone. 6575 + * The number of pages to batch allocate is either ~0.1% 6576 + * of the zone or 1MB, whichever is smaller. The batch 6577 + * size is striking a balance between allocation latency 6578 + * and zone lock contention. 6761 6579 */ 6762 - batch = zone_managed_pages(zone) / 1024; 6763 - /* But no more than a meg. */ 6764 - if (batch * PAGE_SIZE > 1024 * 1024) 6765 - batch = (1024 * 1024) / PAGE_SIZE; 6580 + batch = min(zone_managed_pages(zone) >> 10, (1024 * 1024) / PAGE_SIZE); 6766 6581 batch /= 4; /* We effectively *= 4 below */ 6767 6582 if (batch < 1) 6768 6583 batch = 1; ··· 6798 6615 #endif 6799 6616 } 6800 6617 6618 + static int zone_highsize(struct zone *zone, int batch, int cpu_online) 6619 + { 6620 + #ifdef CONFIG_MMU 6621 + int high; 6622 + int nr_split_cpus; 6623 + unsigned long total_pages; 6624 + 6625 + if (!percpu_pagelist_high_fraction) { 6626 + /* 6627 + * By default, the high value of the pcp is based on the zone 6628 + * low watermark so that if they are full then background 6629 + * reclaim will not be started prematurely. 6630 + */ 6631 + total_pages = low_wmark_pages(zone); 6632 + } else { 6633 + /* 6634 + * If percpu_pagelist_high_fraction is configured, the high 6635 + * value is based on a fraction of the managed pages in the 6636 + * zone. 6637 + */ 6638 + total_pages = zone_managed_pages(zone) / percpu_pagelist_high_fraction; 6639 + } 6640 + 6641 + /* 6642 + * Split the high value across all online CPUs local to the zone. Note 6643 + * that early in boot that CPUs may not be online yet and that during 6644 + * CPU hotplug that the cpumask is not yet updated when a CPU is being 6645 + * onlined. For memory nodes that have no CPUs, split pcp->high across 6646 + * all online CPUs to mitigate the risk that reclaim is triggered 6647 + * prematurely due to pages stored on pcp lists. 6648 + */ 6649 + nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online; 6650 + if (!nr_split_cpus) 6651 + nr_split_cpus = num_online_cpus(); 6652 + high = total_pages / nr_split_cpus; 6653 + 6654 + /* 6655 + * Ensure high is at least batch*4. The multiple is based on the 6656 + * historical relationship between high and batch. 6657 + */ 6658 + high = max(high, batch << 2); 6659 + 6660 + return high; 6661 + #else 6662 + return 0; 6663 + #endif 6664 + } 6665 + 6801 6666 /* 6802 6667 * pcp->high and pcp->batch values are related and generally batch is lower 6803 6668 * than high. They are also related to pcp->count such that count is lower ··· 6869 6638 WRITE_ONCE(pcp->high, high); 6870 6639 } 6871 6640 6872 - static void pageset_init(struct per_cpu_pageset *p) 6641 + static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonestat *pzstats) 6873 6642 { 6874 - struct per_cpu_pages *pcp; 6875 - int migratetype; 6643 + int pindex; 6876 6644 6877 - memset(p, 0, sizeof(*p)); 6645 + memset(pcp, 0, sizeof(*pcp)); 6646 + memset(pzstats, 0, sizeof(*pzstats)); 6878 6647 6879 - pcp = &p->pcp; 6880 - for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) 6881 - INIT_LIST_HEAD(&pcp->lists[migratetype]); 6648 + for (pindex = 0; pindex < NR_PCP_LISTS; pindex++) 6649 + INIT_LIST_HEAD(&pcp->lists[pindex]); 6882 6650 6883 6651 /* 6884 6652 * Set batch and high values safe for a boot pageset. A true percpu ··· 6887 6657 */ 6888 6658 pcp->high = BOOT_PAGESET_HIGH; 6889 6659 pcp->batch = BOOT_PAGESET_BATCH; 6660 + pcp->free_factor = 0; 6890 6661 } 6891 6662 6892 6663 static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high, 6893 6664 unsigned long batch) 6894 6665 { 6895 - struct per_cpu_pageset *p; 6666 + struct per_cpu_pages *pcp; 6896 6667 int cpu; 6897 6668 6898 6669 for_each_possible_cpu(cpu) { 6899 - p = per_cpu_ptr(zone->pageset, cpu); 6900 - pageset_update(&p->pcp, high, batch); 6670 + pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); 6671 + pageset_update(pcp, high, batch); 6901 6672 } 6902 6673 } 6903 6674 6904 6675 /* 6905 6676 * Calculate and set new high and batch values for all per-cpu pagesets of a 6906 - * zone, based on the zone's size and the percpu_pagelist_fraction sysctl. 6677 + * zone based on the zone's size. 6907 6678 */ 6908 - static void zone_set_pageset_high_and_batch(struct zone *zone) 6679 + static void zone_set_pageset_high_and_batch(struct zone *zone, int cpu_online) 6909 6680 { 6910 - unsigned long new_high, new_batch; 6681 + int new_high, new_batch; 6911 6682 6912 - if (percpu_pagelist_fraction) { 6913 - new_high = zone_managed_pages(zone) / percpu_pagelist_fraction; 6914 - new_batch = max(1UL, new_high / 4); 6915 - if ((new_high / 4) > (PAGE_SHIFT * 8)) 6916 - new_batch = PAGE_SHIFT * 8; 6917 - } else { 6918 - new_batch = zone_batchsize(zone); 6919 - new_high = 6 * new_batch; 6920 - new_batch = max(1UL, 1 * new_batch); 6921 - } 6683 + new_batch = max(1, zone_batchsize(zone)); 6684 + new_high = zone_highsize(zone, new_batch, cpu_online); 6922 6685 6923 6686 if (zone->pageset_high == new_high && 6924 6687 zone->pageset_batch == new_batch) ··· 6925 6702 6926 6703 void __meminit setup_zone_pageset(struct zone *zone) 6927 6704 { 6928 - struct per_cpu_pageset *p; 6929 6705 int cpu; 6930 6706 6931 - zone->pageset = alloc_percpu(struct per_cpu_pageset); 6707 + /* Size may be 0 on !SMP && !NUMA */ 6708 + if (sizeof(struct per_cpu_zonestat) > 0) 6709 + zone->per_cpu_zonestats = alloc_percpu(struct per_cpu_zonestat); 6710 + 6711 + zone->per_cpu_pageset = alloc_percpu(struct per_cpu_pages); 6932 6712 for_each_possible_cpu(cpu) { 6933 - p = per_cpu_ptr(zone->pageset, cpu); 6934 - pageset_init(p); 6713 + struct per_cpu_pages *pcp; 6714 + struct per_cpu_zonestat *pzstats; 6715 + 6716 + pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); 6717 + pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); 6718 + per_cpu_pages_init(pcp, pzstats); 6935 6719 } 6936 6720 6937 - zone_set_pageset_high_and_batch(zone); 6721 + zone_set_pageset_high_and_batch(zone, 0); 6938 6722 } 6939 6723 6940 6724 /* ··· 6965 6735 * the nodes these zones are associated with. 6966 6736 */ 6967 6737 for_each_possible_cpu(cpu) { 6968 - struct per_cpu_pageset *pcp = &per_cpu(boot_pageset, cpu); 6969 - memset(pcp->vm_numa_stat_diff, 0, 6970 - sizeof(pcp->vm_numa_stat_diff)); 6738 + struct per_cpu_zonestat *pzstats = &per_cpu(boot_zonestats, cpu); 6739 + memset(pzstats->vm_numa_event, 0, 6740 + sizeof(pzstats->vm_numa_event)); 6971 6741 } 6972 6742 #endif 6973 6743 ··· 6983 6753 * relies on the ability of the linker to provide the 6984 6754 * offset of a (static) per cpu variable into the per cpu area. 6985 6755 */ 6986 - zone->pageset = &boot_pageset; 6756 + zone->per_cpu_pageset = &boot_pageset; 6757 + zone->per_cpu_zonestats = &boot_zonestats; 6987 6758 zone->pageset_high = BOOT_PAGESET_HIGH; 6988 6759 zone->pageset_batch = BOOT_PAGESET_BATCH; 6989 6760 6990 6761 if (populated_zone(zone)) 6991 - printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%u\n", 6992 - zone->name, zone->present_pages, 6993 - zone_batchsize(zone)); 6762 + pr_debug(" %s zone: %lu pages, LIFO batch:%u\n", zone->name, 6763 + zone->present_pages, zone_batchsize(zone)); 6994 6764 } 6995 6765 6996 6766 void __meminit init_currently_empty_zone(struct zone *zone, ··· 7260 7030 7261 7031 pgdat->node_spanned_pages = totalpages; 7262 7032 pgdat->node_present_pages = realtotalpages; 7263 - printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id, 7264 - realtotalpages); 7033 + pr_debug("On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages); 7265 7034 } 7266 7035 7267 7036 #ifndef CONFIG_SPARSEMEM ··· 7460 7231 if (freesize >= memmap_pages) { 7461 7232 freesize -= memmap_pages; 7462 7233 if (memmap_pages) 7463 - printk(KERN_DEBUG 7464 - " %s zone: %lu pages used for memmap\n", 7465 - zone_names[j], memmap_pages); 7234 + pr_debug(" %s zone: %lu pages used for memmap\n", 7235 + zone_names[j], memmap_pages); 7466 7236 } else 7467 - pr_warn(" %s zone: %lu pages exceeds freesize %lu\n", 7237 + pr_warn(" %s zone: %lu memmap pages exceeds freesize %lu\n", 7468 7238 zone_names[j], memmap_pages, freesize); 7469 7239 } 7470 7240 7471 7241 /* Account for reserved pages */ 7472 7242 if (j == 0 && freesize > dma_reserve) { 7473 7243 freesize -= dma_reserve; 7474 - printk(KERN_DEBUG " %s zone: %lu pages reserved\n", 7475 - zone_names[0], dma_reserve); 7244 + pr_debug(" %s zone: %lu pages reserved\n", zone_names[0], dma_reserve); 7476 7245 } 7477 7246 7478 7247 if (!is_highmem_idx(j)) ··· 7493 7266 set_pageblock_order(); 7494 7267 setup_usemap(zone); 7495 7268 init_currently_empty_zone(zone, zone->zone_start_pfn, size); 7496 - memmap_init_zone(zone); 7497 7269 } 7498 7270 } 7499 7271 7500 - #ifdef CONFIG_FLAT_NODE_MEM_MAP 7272 + #ifdef CONFIG_FLATMEM 7501 7273 static void __ref alloc_node_mem_map(struct pglist_data *pgdat) 7502 7274 { 7503 7275 unsigned long __maybe_unused start = 0; ··· 7531 7305 pr_debug("%s: node %d, pgdat %08lx, node_mem_map %08lx\n", 7532 7306 __func__, pgdat->node_id, (unsigned long)pgdat, 7533 7307 (unsigned long)pgdat->node_mem_map); 7534 - #ifndef CONFIG_NEED_MULTIPLE_NODES 7308 + #ifndef CONFIG_NUMA 7535 7309 /* 7536 7310 * With no DISCONTIG, the global mem_map is just set as node 0's 7537 7311 */ ··· 7544 7318 } 7545 7319 #else 7546 7320 static void __ref alloc_node_mem_map(struct pglist_data *pgdat) { } 7547 - #endif /* CONFIG_FLAT_NODE_MEM_MAP */ 7321 + #endif /* CONFIG_FLATMEM */ 7548 7322 7549 7323 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT 7550 7324 static inline void pgdat_set_deferred_range(pg_data_t *pgdat) ··· 8018 7792 node_set_state(nid, N_MEMORY); 8019 7793 check_for_memory(pgdat, nid); 8020 7794 } 7795 + 7796 + memmap_init(); 8021 7797 } 8022 7798 8023 7799 static int __init cmdline_parse_core(char *p, unsigned long *core, ··· 8196 7968 8197 7969 static int page_alloc_cpu_dead(unsigned int cpu) 8198 7970 { 7971 + struct zone *zone; 8199 7972 8200 7973 lru_add_drain_cpu(cpu); 8201 7974 drain_pages(cpu); ··· 8217 7988 * race with what we are doing. 8218 7989 */ 8219 7990 cpu_vm_stats_fold(cpu); 7991 + 7992 + for_each_populated_zone(zone) 7993 + zone_pcp_update(zone, 0); 7994 + 7995 + return 0; 7996 + } 7997 + 7998 + static int page_alloc_cpu_online(unsigned int cpu) 7999 + { 8000 + struct zone *zone; 8001 + 8002 + for_each_populated_zone(zone) 8003 + zone_pcp_update(zone, 1); 8220 8004 return 0; 8221 8005 } 8222 8006 ··· 8255 8013 hashdist = 0; 8256 8014 #endif 8257 8015 8258 - ret = cpuhp_setup_state_nocalls(CPUHP_PAGE_ALLOC_DEAD, 8259 - "mm/page_alloc:dead", NULL, 8016 + ret = cpuhp_setup_state_nocalls(CPUHP_PAGE_ALLOC, 8017 + "mm/page_alloc:pcp", 8018 + page_alloc_cpu_online, 8260 8019 page_alloc_cpu_dead); 8261 8020 WARN_ON(ret < 0); 8262 8021 } ··· 8320 8077 unsigned long managed_pages = 0; 8321 8078 8322 8079 for (j = i + 1; j < MAX_NR_ZONES; j++) { 8323 - if (clear) { 8324 - zone->lowmem_reserve[j] = 0; 8325 - } else { 8326 - struct zone *upper_zone = &pgdat->node_zones[j]; 8080 + struct zone *upper_zone = &pgdat->node_zones[j]; 8327 8081 8328 - managed_pages += zone_managed_pages(upper_zone); 8082 + managed_pages += zone_managed_pages(upper_zone); 8083 + 8084 + if (clear) 8085 + zone->lowmem_reserve[j] = 0; 8086 + else 8329 8087 zone->lowmem_reserve[j] = managed_pages / ratio; 8330 - } 8331 8088 } 8332 8089 } 8333 8090 } ··· 8407 8164 */ 8408 8165 void setup_per_zone_wmarks(void) 8409 8166 { 8167 + struct zone *zone; 8410 8168 static DEFINE_SPINLOCK(lock); 8411 8169 8412 8170 spin_lock(&lock); 8413 8171 __setup_per_zone_wmarks(); 8414 8172 spin_unlock(&lock); 8173 + 8174 + /* 8175 + * The watermark size have changed so update the pcpu batch 8176 + * and high limits or the limits may be inappropriate. 8177 + */ 8178 + for_each_zone(zone) 8179 + zone_pcp_update(zone, 0); 8415 8180 } 8416 8181 8417 8182 /* ··· 8598 8347 } 8599 8348 8600 8349 /* 8601 - * percpu_pagelist_fraction - changes the pcp->high for each zone on each 8602 - * cpu. It is the fraction of total pages in each zone that a hot per cpu 8350 + * percpu_pagelist_high_fraction - changes the pcp->high for each zone on each 8351 + * cpu. It is the fraction of total pages in each zone that a hot per cpu 8603 8352 * pagelist can have before it gets flushed back to buddy allocator. 8604 8353 */ 8605 - int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write, 8606 - void *buffer, size_t *length, loff_t *ppos) 8354 + int percpu_pagelist_high_fraction_sysctl_handler(struct ctl_table *table, 8355 + int write, void *buffer, size_t *length, loff_t *ppos) 8607 8356 { 8608 8357 struct zone *zone; 8609 - int old_percpu_pagelist_fraction; 8358 + int old_percpu_pagelist_high_fraction; 8610 8359 int ret; 8611 8360 8612 8361 mutex_lock(&pcp_batch_high_lock); 8613 - old_percpu_pagelist_fraction = percpu_pagelist_fraction; 8362 + old_percpu_pagelist_high_fraction = percpu_pagelist_high_fraction; 8614 8363 8615 8364 ret = proc_dointvec_minmax(table, write, buffer, length, ppos); 8616 8365 if (!write || ret < 0) 8617 8366 goto out; 8618 8367 8619 8368 /* Sanity checking to avoid pcp imbalance */ 8620 - if (percpu_pagelist_fraction && 8621 - percpu_pagelist_fraction < MIN_PERCPU_PAGELIST_FRACTION) { 8622 - percpu_pagelist_fraction = old_percpu_pagelist_fraction; 8369 + if (percpu_pagelist_high_fraction && 8370 + percpu_pagelist_high_fraction < MIN_PERCPU_PAGELIST_HIGH_FRACTION) { 8371 + percpu_pagelist_high_fraction = old_percpu_pagelist_high_fraction; 8623 8372 ret = -EINVAL; 8624 8373 goto out; 8625 8374 } 8626 8375 8627 8376 /* No change? */ 8628 - if (percpu_pagelist_fraction == old_percpu_pagelist_fraction) 8377 + if (percpu_pagelist_high_fraction == old_percpu_pagelist_high_fraction) 8629 8378 goto out; 8630 8379 8631 8380 for_each_populated_zone(zone) 8632 - zone_set_pageset_high_and_batch(zone); 8381 + zone_set_pageset_high_and_batch(zone, 0); 8633 8382 out: 8634 8383 mutex_unlock(&pcp_batch_high_lock); 8635 8384 return ret; ··· 8984 8733 8985 8734 lru_cache_enable(); 8986 8735 if (ret < 0) { 8987 - alloc_contig_dump_pages(&cc->migratepages); 8736 + if (ret == -EBUSY) 8737 + alloc_contig_dump_pages(&cc->migratepages); 8988 8738 putback_movable_pages(&cc->migratepages); 8989 8739 return ret; 8990 8740 } ··· 9258 9006 * The zone indicated has a new number of managed_pages; batch sizes and percpu 9259 9007 * page high values need to be recalculated. 9260 9008 */ 9261 - void __meminit zone_pcp_update(struct zone *zone) 9009 + void zone_pcp_update(struct zone *zone, int cpu_online) 9262 9010 { 9263 9011 mutex_lock(&pcp_batch_high_lock); 9264 - zone_set_pageset_high_and_batch(zone); 9012 + zone_set_pageset_high_and_batch(zone, cpu_online); 9265 9013 mutex_unlock(&pcp_batch_high_lock); 9266 9014 } 9267 9015 ··· 9289 9037 void zone_pcp_reset(struct zone *zone) 9290 9038 { 9291 9039 int cpu; 9292 - struct per_cpu_pageset *pset; 9040 + struct per_cpu_zonestat *pzstats; 9293 9041 9294 - if (zone->pageset != &boot_pageset) { 9042 + if (zone->per_cpu_pageset != &boot_pageset) { 9295 9043 for_each_online_cpu(cpu) { 9296 - pset = per_cpu_ptr(zone->pageset, cpu); 9297 - drain_zonestat(zone, pset); 9044 + pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); 9045 + drain_zonestat(zone, pzstats); 9298 9046 } 9299 - free_percpu(zone->pageset); 9300 - zone->pageset = &boot_pageset; 9047 + free_percpu(zone->per_cpu_pageset); 9048 + free_percpu(zone->per_cpu_zonestats); 9049 + zone->per_cpu_pageset = &boot_pageset; 9050 + zone->per_cpu_zonestats = &boot_zonestats; 9301 9051 } 9302 9052 } 9303 9053

+1 -1

mm/page_ext.c

··· 191 191 panic("Out of memory"); 192 192 } 193 193 194 - #else /* CONFIG_FLAT_NODE_MEM_MAP */ 194 + #else /* CONFIG_FLATMEM */ 195 195 196 196 struct page_ext *lookup_page_ext(const struct page *page) 197 197 {

+1 -1

mm/page_owner.c

··· 392 392 return -ENOMEM; 393 393 } 394 394 395 - void __dump_page_owner(struct page *page) 395 + void __dump_page_owner(const struct page *page) 396 396 { 397 397 struct page_ext *page_ext = lookup_page_ext(page); 398 398 struct page_owner *page_owner;

+15 -4

mm/page_reporting.c

··· 4 4 #include <linux/page_reporting.h> 5 5 #include <linux/gfp.h> 6 6 #include <linux/export.h> 7 + #include <linux/module.h> 7 8 #include <linux/delay.h> 8 9 #include <linux/scatterlist.h> 9 10 10 11 #include "page_reporting.h" 11 12 #include "internal.h" 13 + 14 + unsigned int page_reporting_order = MAX_ORDER; 15 + module_param(page_reporting_order, uint, 0644); 16 + MODULE_PARM_DESC(page_reporting_order, "Set page reporting order"); 12 17 13 18 #define PAGE_REPORTING_DELAY (2 * HZ) 14 19 static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly; ··· 36 31 return; 37 32 38 33 /* 39 - * If reporting is already active there is nothing we need to do. 40 - * Test against 0 as that represents PAGE_REPORTING_IDLE. 34 + * If reporting is already active there is nothing we need to do. 35 + * Test against 0 as that represents PAGE_REPORTING_IDLE. 41 36 */ 42 37 state = atomic_xchg(&prdev->state, PAGE_REPORTING_REQUESTED); 43 38 if (state != PAGE_REPORTING_IDLE) ··· 234 229 235 230 /* Generate minimum watermark to be able to guarantee progress */ 236 231 watermark = low_wmark_pages(zone) + 237 - (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER); 232 + (PAGE_REPORTING_CAPACITY << page_reporting_order); 238 233 239 234 /* 240 235 * Cancel request if insufficient free memory or if we failed ··· 244 239 return err; 245 240 246 241 /* Process each free list starting from lowest order/mt */ 247 - for (order = PAGE_REPORTING_MIN_ORDER; order < MAX_ORDER; order++) { 242 + for (order = page_reporting_order; order < MAX_ORDER; order++) { 248 243 for (mt = 0; mt < MIGRATE_TYPES; mt++) { 249 244 /* We do not pull pages from the isolate free list */ 250 245 if (is_migrate_isolate(mt)) ··· 328 323 err = -EBUSY; 329 324 goto err_out; 330 325 } 326 + 327 + /* 328 + * Update the page reporting order if it's specified by driver. 329 + * Otherwise, it falls back to @pageblock_order. 330 + */ 331 + page_reporting_order = prdev->order ? : pageblock_order; 331 332 332 333 /* initialize state and work structures */ 333 334 atomic_set(&prdev->state, PAGE_REPORTING_IDLE);

+2 -3

mm/page_reporting.h

··· 10 10 #include <linux/pgtable.h> 11 11 #include <linux/scatterlist.h> 12 12 13 - #define PAGE_REPORTING_MIN_ORDER pageblock_order 14 - 15 13 #ifdef CONFIG_PAGE_REPORTING 16 14 DECLARE_STATIC_KEY_FALSE(page_reporting_enabled); 15 + extern unsigned int page_reporting_order; 17 16 void __page_reporting_notify(void); 18 17 19 18 static inline bool page_reported(struct page *page) ··· 37 38 return; 38 39 39 40 /* Determine if we have crossed reporting threshold */ 40 - if (order < PAGE_REPORTING_MIN_ORDER) 41 + if (order < page_reporting_order) 41 42 return; 42 43 43 44 /* This will add a few cycles, but should be called infrequently */

+53 -5

mm/pagewalk.c

··· 58 58 return err; 59 59 } 60 60 61 + #ifdef CONFIG_ARCH_HAS_HUGEPD 62 + static int walk_hugepd_range(hugepd_t *phpd, unsigned long addr, 63 + unsigned long end, struct mm_walk *walk, int pdshift) 64 + { 65 + int err = 0; 66 + const struct mm_walk_ops *ops = walk->ops; 67 + int shift = hugepd_shift(*phpd); 68 + int page_size = 1 << shift; 69 + 70 + if (!ops->pte_entry) 71 + return 0; 72 + 73 + if (addr & (page_size - 1)) 74 + return 0; 75 + 76 + for (;;) { 77 + pte_t *pte; 78 + 79 + spin_lock(&walk->mm->page_table_lock); 80 + pte = hugepte_offset(*phpd, addr, pdshift); 81 + err = ops->pte_entry(pte, addr, addr + page_size, walk); 82 + spin_unlock(&walk->mm->page_table_lock); 83 + 84 + if (err) 85 + break; 86 + if (addr >= end - page_size) 87 + break; 88 + addr += page_size; 89 + } 90 + return err; 91 + } 92 + #else 93 + static int walk_hugepd_range(hugepd_t *phpd, unsigned long addr, 94 + unsigned long end, struct mm_walk *walk, int pdshift) 95 + { 96 + return 0; 97 + } 98 + #endif 99 + 61 100 static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, 62 101 struct mm_walk *walk) 63 102 { ··· 147 108 goto again; 148 109 } 149 110 150 - err = walk_pte_range(pmd, addr, next, walk); 111 + if (is_hugepd(__hugepd(pmd_val(*pmd)))) 112 + err = walk_hugepd_range((hugepd_t *)pmd, addr, next, walk, PMD_SHIFT); 113 + else 114 + err = walk_pte_range(pmd, addr, next, walk); 151 115 if (err) 152 116 break; 153 117 } while (pmd++, addr = next, addr != end); ··· 199 157 if (pud_none(*pud)) 200 158 goto again; 201 159 202 - err = walk_pmd_range(pud, addr, next, walk); 160 + if (is_hugepd(__hugepd(pud_val(*pud)))) 161 + err = walk_hugepd_range((hugepd_t *)pud, addr, next, walk, PUD_SHIFT); 162 + else 163 + err = walk_pmd_range(pud, addr, next, walk); 203 164 if (err) 204 165 break; 205 166 } while (pud++, addr = next, addr != end); ··· 234 189 if (err) 235 190 break; 236 191 } 237 - if (ops->pud_entry || ops->pmd_entry || ops->pte_entry) 192 + if (is_hugepd(__hugepd(p4d_val(*p4d)))) 193 + err = walk_hugepd_range((hugepd_t *)p4d, addr, next, walk, P4D_SHIFT); 194 + else if (ops->pud_entry || ops->pmd_entry || ops->pte_entry) 238 195 err = walk_pud_range(p4d, addr, next, walk); 239 196 if (err) 240 197 break; ··· 271 224 if (err) 272 225 break; 273 226 } 274 - if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || 275 - ops->pte_entry) 227 + if (is_hugepd(__hugepd(pgd_val(*pgd)))) 228 + err = walk_hugepd_range((hugepd_t *)pgd, addr, next, walk, PGDIR_SHIFT); 229 + else if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || ops->pte_entry) 276 230 err = walk_p4d_range(pgd, addr, next, walk); 277 231 if (err) 278 232 break;

+15 -3

mm/shmem.c

··· 1695 1695 { 1696 1696 struct address_space *mapping = inode->i_mapping; 1697 1697 struct shmem_inode_info *info = SHMEM_I(inode); 1698 - struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm; 1699 - struct page *page; 1698 + struct mm_struct *charge_mm = vma ? vma->vm_mm : NULL; 1699 + struct swap_info_struct *si; 1700 + struct page *page = NULL; 1700 1701 swp_entry_t swap; 1701 1702 int error; 1702 1703 ··· 1705 1704 swap = radix_to_swp_entry(*pagep); 1706 1705 *pagep = NULL; 1707 1706 1707 + /* Prevent swapoff from happening to us. */ 1708 + si = get_swap_device(swap); 1709 + if (!si) { 1710 + error = EINVAL; 1711 + goto failed; 1712 + } 1708 1713 /* Look it up and read it in.. */ 1709 1714 page = lookup_swap_cache(swap, NULL, 0); 1710 1715 if (!page) { ··· 1772 1765 swap_free(swap); 1773 1766 1774 1767 *pagep = page; 1768 + if (si) 1769 + put_swap_device(si); 1775 1770 return 0; 1776 1771 failed: 1777 1772 if (!shmem_confirm_swap(mapping, index, swap)) ··· 1783 1774 unlock_page(page); 1784 1775 put_page(page); 1785 1776 } 1777 + 1778 + if (si) 1779 + put_swap_device(si); 1786 1780 1787 1781 return error; 1788 1782 } ··· 1828 1816 } 1829 1817 1830 1818 sbinfo = SHMEM_SB(inode->i_sb); 1831 - charge_mm = vma ? vma->vm_mm : current->mm; 1819 + charge_mm = vma ? vma->vm_mm : NULL; 1832 1820 1833 1821 page = pagecache_get_page(mapping, index, 1834 1822 FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0);

+9 -15

mm/slab.h

··· 215 215 DECLARE_STATIC_KEY_FALSE(slub_debug_enabled); 216 216 #endif 217 217 extern void print_tracking(struct kmem_cache *s, void *object); 218 + long validate_slab_cache(struct kmem_cache *s); 218 219 #else 219 220 static inline void print_tracking(struct kmem_cache *s, void *object) 220 221 { ··· 240 239 #ifdef CONFIG_MEMCG_KMEM 241 240 int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s, 242 241 gfp_t gfp, bool new_page); 242 + void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, 243 + enum node_stat_item idx, int nr); 243 244 244 245 static inline void memcg_free_page_obj_cgroups(struct page *page) 245 246 { ··· 286 283 return true; 287 284 } 288 285 289 - static inline void mod_objcg_state(struct obj_cgroup *objcg, 290 - struct pglist_data *pgdat, 291 - enum node_stat_item idx, int nr) 292 - { 293 - struct mem_cgroup *memcg; 294 - struct lruvec *lruvec; 295 - 296 - rcu_read_lock(); 297 - memcg = obj_cgroup_memcg(objcg); 298 - lruvec = mem_cgroup_lruvec(memcg, pgdat); 299 - mod_memcg_lruvec_state(lruvec, idx, nr); 300 - rcu_read_unlock(); 301 - } 302 - 303 286 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s, 304 287 struct obj_cgroup *objcg, 305 288 gfp_t flags, size_t size, ··· 298 309 if (!memcg_kmem_enabled() || !objcg) 299 310 return; 300 311 301 - flags &= ~__GFP_ACCOUNT; 302 312 for (i = 0; i < size; i++) { 303 313 if (likely(p[i])) { 304 314 page = virt_to_head_page(p[i]); ··· 617 629 (c->flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON))); 618 630 return false; 619 631 } 632 + 633 + #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_SLUB_DEBUG) 634 + void debugfs_slab_release(struct kmem_cache *); 635 + #else 636 + static inline void debugfs_slab_release(struct kmem_cache *s) { } 637 + #endif 620 638 621 639 #ifdef CONFIG_PRINTK 622 640 #define KS_ADDRS_COUNT 16

+41 -19

mm/slab_common.c

··· 377 377 378 378 if (err) { 379 379 if (flags & SLAB_PANIC) 380 - panic("kmem_cache_create: Failed to create slab '%s'. Error %d\n", 381 - name, err); 380 + panic("%s: Failed to create slab '%s'. Error %d\n", 381 + __func__, name, err); 382 382 else { 383 - pr_warn("kmem_cache_create(%s) failed with error %d\n", 384 - name, err); 383 + pr_warn("%s(%s) failed with error %d\n", 384 + __func__, name, err); 385 385 dump_stack(); 386 386 } 387 387 return NULL; ··· 448 448 rcu_barrier(); 449 449 450 450 list_for_each_entry_safe(s, s2, &to_destroy, list) { 451 + debugfs_slab_release(s); 451 452 kfence_shutdown_cache(s); 452 453 #ifdef SLAB_SUPPORTS_SYSFS 453 454 sysfs_slab_release(s); ··· 476 475 schedule_work(&slab_caches_to_rcu_destroy_work); 477 476 } else { 478 477 kfence_shutdown_cache(s); 478 + debugfs_slab_release(s); 479 479 #ifdef SLAB_SUPPORTS_SYSFS 480 480 sysfs_slab_unlink(s); 481 481 sysfs_slab_release(s); ··· 510 508 511 509 err = shutdown_cache(s); 512 510 if (err) { 513 - pr_err("kmem_cache_destroy %s: Slab cache still has objects\n", 514 - s->name); 511 + pr_err("%s %s: Slab cache still has objects\n", 512 + __func__, s->name); 515 513 dump_stack(); 516 514 } 517 515 out_unlock: ··· 738 736 } 739 737 740 738 #ifdef CONFIG_ZONE_DMA 741 - #define INIT_KMALLOC_INFO(__size, __short_size) \ 742 - { \ 743 - .name[KMALLOC_NORMAL] = "kmalloc-" #__short_size, \ 744 - .name[KMALLOC_RECLAIM] = "kmalloc-rcl-" #__short_size, \ 745 - .name[KMALLOC_DMA] = "dma-kmalloc-" #__short_size, \ 746 - .size = __size, \ 747 - } 739 + #define KMALLOC_DMA_NAME(sz) .name[KMALLOC_DMA] = "dma-kmalloc-" #sz, 748 740 #else 741 + #define KMALLOC_DMA_NAME(sz) 742 + #endif 743 + 744 + #ifdef CONFIG_MEMCG_KMEM 745 + #define KMALLOC_CGROUP_NAME(sz) .name[KMALLOC_CGROUP] = "kmalloc-cg-" #sz, 746 + #else 747 + #define KMALLOC_CGROUP_NAME(sz) 748 + #endif 749 + 749 750 #define INIT_KMALLOC_INFO(__size, __short_size) \ 750 751 { \ 751 752 .name[KMALLOC_NORMAL] = "kmalloc-" #__short_size, \ 752 753 .name[KMALLOC_RECLAIM] = "kmalloc-rcl-" #__short_size, \ 754 + KMALLOC_CGROUP_NAME(__short_size) \ 755 + KMALLOC_DMA_NAME(__short_size) \ 753 756 .size = __size, \ 754 757 } 755 - #endif 756 758 757 759 /* 758 760 * kmalloc_info[] is to make slub_debug=,kmalloc-xx option work at boot time. 759 - * kmalloc_index() supports up to 2^26=64MB, so the final entry of the table is 760 - * kmalloc-67108864. 761 + * kmalloc_index() supports up to 2^25=32MB, so the final entry of the table is 762 + * kmalloc-32M. 761 763 */ 762 764 const struct kmalloc_info_struct kmalloc_info[] __initconst = { 763 765 INIT_KMALLOC_INFO(0, 0), ··· 789 783 INIT_KMALLOC_INFO(4194304, 4M), 790 784 INIT_KMALLOC_INFO(8388608, 8M), 791 785 INIT_KMALLOC_INFO(16777216, 16M), 792 - INIT_KMALLOC_INFO(33554432, 32M), 793 - INIT_KMALLOC_INFO(67108864, 64M) 786 + INIT_KMALLOC_INFO(33554432, 32M) 794 787 }; 795 788 796 789 /* ··· 842 837 static void __init 843 838 new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags) 844 839 { 845 - if (type == KMALLOC_RECLAIM) 840 + if (type == KMALLOC_RECLAIM) { 846 841 flags |= SLAB_RECLAIM_ACCOUNT; 842 + } else if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_CGROUP)) { 843 + if (cgroup_memory_nokmem) { 844 + kmalloc_caches[type][idx] = kmalloc_caches[KMALLOC_NORMAL][idx]; 845 + return; 846 + } 847 + flags |= SLAB_ACCOUNT; 848 + } 847 849 848 850 kmalloc_caches[type][idx] = create_kmalloc_cache( 849 851 kmalloc_info[idx].name[type], 850 852 kmalloc_info[idx].size, flags, 0, 851 853 kmalloc_info[idx].size); 854 + 855 + /* 856 + * If CONFIG_MEMCG_KMEM is enabled, disable cache merging for 857 + * KMALLOC_NORMAL caches. 858 + */ 859 + if (IS_ENABLED(CONFIG_MEMCG_KMEM) && (type == KMALLOC_NORMAL)) 860 + kmalloc_caches[type][idx]->refcount = -1; 852 861 } 853 862 854 863 /* ··· 875 856 int i; 876 857 enum kmalloc_cache_type type; 877 858 859 + /* 860 + * Including KMALLOC_CGROUP if CONFIG_MEMCG_KMEM defined 861 + */ 878 862 for (type = KMALLOC_NORMAL; type <= KMALLOC_RECLAIM; type++) { 879 863 for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) { 880 864 if (!kmalloc_caches[type][i])

+252 -166

mm/slub.c

··· 36 36 #include <linux/prefetch.h> 37 37 #include <linux/memcontrol.h> 38 38 #include <linux/random.h> 39 + #include <kunit/test.h> 39 40 41 + #include <linux/debugfs.h> 40 42 #include <trace/events/kmem.h> 41 43 42 44 #include "internal.h" ··· 119 117 */ 120 118 121 119 #ifdef CONFIG_SLUB_DEBUG 120 + 122 121 #ifdef CONFIG_SLUB_DEBUG_ON 123 122 DEFINE_STATIC_KEY_TRUE(slub_debug_enabled); 124 123 #else 125 124 DEFINE_STATIC_KEY_FALSE(slub_debug_enabled); 126 125 #endif 127 - #endif 126 + 127 + static inline bool __slub_debug_enabled(void) 128 + { 129 + return static_branch_unlikely(&slub_debug_enabled); 130 + } 131 + 132 + #else /* CONFIG_SLUB_DEBUG */ 133 + 134 + static inline bool __slub_debug_enabled(void) 135 + { 136 + return false; 137 + } 138 + 139 + #endif /* CONFIG_SLUB_DEBUG */ 128 140 129 141 static inline bool kmem_cache_debug(struct kmem_cache *s) 130 142 { ··· 169 153 * 170 154 * - Variable sizing of the per node arrays 171 155 */ 172 - 173 - /* Enable to test recovery from slab corruption on boot */ 174 - #undef SLUB_RESILIENCY_TEST 175 156 176 157 /* Enable to log cmpxchg failures */ 177 158 #undef SLUB_DEBUG_CMPXCHG ··· 237 224 static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; } 238 225 static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p) 239 226 { return 0; } 227 + #endif 228 + 229 + #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_SLUB_DEBUG) 230 + static void debugfs_slab_add(struct kmem_cache *); 231 + #else 232 + static inline void debugfs_slab_add(struct kmem_cache *s) { } 240 233 #endif 241 234 242 235 static inline void stat(const struct kmem_cache *s, enum stat_item si) ··· 468 449 static unsigned long object_map[BITS_TO_LONGS(MAX_OBJS_PER_PAGE)]; 469 450 static DEFINE_SPINLOCK(object_map_lock); 470 451 452 + #if IS_ENABLED(CONFIG_KUNIT) 453 + static bool slab_add_kunit_errors(void) 454 + { 455 + struct kunit_resource *resource; 456 + 457 + if (likely(!current->kunit_test)) 458 + return false; 459 + 460 + resource = kunit_find_named_resource(current->kunit_test, "slab_errors"); 461 + if (!resource) 462 + return false; 463 + 464 + (*(int *)resource->data)++; 465 + kunit_put_resource(resource); 466 + return true; 467 + } 468 + #else 469 + static inline bool slab_add_kunit_errors(void) { return false; } 470 + #endif 471 + 471 472 /* 472 473 * Determine a map of object in use on a page. 473 474 * ··· 708 669 pr_err("=============================================================================\n"); 709 670 pr_err("BUG %s (%s): %pV\n", s->name, print_tainted(), &vaf); 710 671 pr_err("-----------------------------------------------------------------------------\n\n"); 711 - 712 - add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); 713 672 va_end(args); 714 673 } 715 674 675 + __printf(2, 3) 716 676 static void slab_fix(struct kmem_cache *s, char *fmt, ...) 717 677 { 718 678 struct va_format vaf; 719 679 va_list args; 680 + 681 + if (slab_add_kunit_errors()) 682 + return; 720 683 721 684 va_start(args, fmt); 722 685 vaf.fmt = fmt; ··· 783 742 void object_err(struct kmem_cache *s, struct page *page, 784 743 u8 *object, char *reason) 785 744 { 745 + if (slab_add_kunit_errors()) 746 + return; 747 + 786 748 slab_bug(s, "%s", reason); 787 749 print_trailer(s, page, object); 750 + add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); 788 751 } 789 752 790 753 static __printf(3, 4) void slab_err(struct kmem_cache *s, struct page *page, ··· 797 752 va_list args; 798 753 char buf[100]; 799 754 755 + if (slab_add_kunit_errors()) 756 + return; 757 + 800 758 va_start(args, fmt); 801 759 vsnprintf(buf, sizeof(buf), fmt, args); 802 760 va_end(args); 803 761 slab_bug(s, "%s", buf); 804 762 print_page_info(page); 805 763 dump_stack(); 764 + add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); 806 765 } 807 766 808 767 static void init_object(struct kmem_cache *s, void *object, u8 val) ··· 828 779 static void restore_bytes(struct kmem_cache *s, char *message, u8 data, 829 780 void *from, void *to) 830 781 { 831 - slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data); 782 + slab_fix(s, "Restoring %s 0x%p-0x%p=0x%x", message, from, to - 1, data); 832 783 memset(from, data, to - from); 833 784 } 834 785 ··· 850 801 while (end > fault && end[-1] == value) 851 802 end--; 852 803 804 + if (slab_add_kunit_errors()) 805 + goto skip_bug_print; 806 + 853 807 slab_bug(s, "%s overwritten", what); 854 808 pr_err("0x%p-0x%p @offset=%tu. First byte 0x%x instead of 0x%x\n", 855 809 fault, end - 1, fault - addr, 856 810 fault[0], value); 857 811 print_trailer(s, page, object); 812 + add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); 858 813 814 + skip_bug_print: 859 815 restore_bytes(s, what, value, fault, end); 860 816 return 0; 861 817 } ··· 1082 1028 slab_err(s, page, "Wrong number of objects. Found %d but should be %d", 1083 1029 page->objects, max_objects); 1084 1030 page->objects = max_objects; 1085 - slab_fix(s, "Number of objects adjusted."); 1031 + slab_fix(s, "Number of objects adjusted"); 1086 1032 } 1087 1033 if (page->inuse != page->objects - nr) { 1088 1034 slab_err(s, page, "Wrong object count. Counter is %d but counted were %d", 1089 1035 page->inuse, page->objects - nr); 1090 1036 page->inuse = page->objects - nr; 1091 - slab_fix(s, "Object count adjusted."); 1037 + slab_fix(s, "Object count adjusted"); 1092 1038 } 1093 1039 return search == NULL; 1094 1040 } ··· 1452 1398 out: 1453 1399 if (slub_debug != 0 || slub_debug_string) 1454 1400 static_branch_enable(&slub_debug_enabled); 1401 + else 1402 + static_branch_disable(&slub_debug_enabled); 1455 1403 if ((static_branch_unlikely(&init_on_alloc) || 1456 1404 static_branch_unlikely(&init_on_free)) && 1457 1405 (slub_debug & SLAB_POISON)) ··· 4509 4453 if (debug_guardpage_minorder()) 4510 4454 slub_max_order = 0; 4511 4455 4456 + /* Print slub debugging pointers without hashing */ 4457 + if (__slub_debug_enabled()) 4458 + no_hash_pointers_enable(NULL); 4459 + 4512 4460 kmem_cache_node = &boot_kmem_cache_node; 4513 4461 kmem_cache = &boot_kmem_cache; 4514 4462 ··· 4600 4540 err = sysfs_slab_add(s); 4601 4541 if (err) 4602 4542 __kmem_cache_release(s); 4543 + 4544 + if (s->flags & SLAB_STORE_USER) 4545 + debugfs_slab_add(s); 4603 4546 4604 4547 return err; 4605 4548 } ··· 4712 4649 validate_slab(s, page); 4713 4650 count++; 4714 4651 } 4715 - if (count != n->nr_partial) 4652 + if (count != n->nr_partial) { 4716 4653 pr_err("SLUB %s: %ld partial slabs counted but counter=%ld\n", 4717 4654 s->name, count, n->nr_partial); 4655 + slab_add_kunit_errors(); 4656 + } 4718 4657 4719 4658 if (!(s->flags & SLAB_STORE_USER)) 4720 4659 goto out; ··· 4725 4660 validate_slab(s, page); 4726 4661 count++; 4727 4662 } 4728 - if (count != atomic_long_read(&n->nr_slabs)) 4663 + if (count != atomic_long_read(&n->nr_slabs)) { 4729 4664 pr_err("SLUB: %s %ld slabs counted but counter=%ld\n", 4730 4665 s->name, count, atomic_long_read(&n->nr_slabs)); 4666 + slab_add_kunit_errors(); 4667 + } 4731 4668 4732 4669 out: 4733 4670 spin_unlock_irqrestore(&n->list_lock, flags); 4734 4671 return count; 4735 4672 } 4736 4673 4737 - static long validate_slab_cache(struct kmem_cache *s) 4674 + long validate_slab_cache(struct kmem_cache *s) 4738 4675 { 4739 4676 int node; 4740 4677 unsigned long count = 0; ··· 4748 4681 4749 4682 return count; 4750 4683 } 4684 + EXPORT_SYMBOL(validate_slab_cache); 4685 + 4686 + #ifdef CONFIG_DEBUG_FS 4751 4687 /* 4752 4688 * Generate lists of code addresses where slabcache objects are allocated 4753 4689 * and freed. ··· 4773 4703 unsigned long count; 4774 4704 struct location *loc; 4775 4705 }; 4706 + 4707 + static struct dentry *slab_debugfs_root; 4776 4708 4777 4709 static void free_loc_track(struct loc_track *t) 4778 4710 { ··· 4892 4820 add_location(t, s, get_track(s, p, alloc)); 4893 4821 put_map(map); 4894 4822 } 4895 - 4896 - static int list_locations(struct kmem_cache *s, char *buf, 4897 - enum track_item alloc) 4898 - { 4899 - int len = 0; 4900 - unsigned long i; 4901 - struct loc_track t = { 0, 0, NULL }; 4902 - int node; 4903 - struct kmem_cache_node *n; 4904 - 4905 - if (!alloc_loc_track(&t, PAGE_SIZE / sizeof(struct location), 4906 - GFP_KERNEL)) { 4907 - return sysfs_emit(buf, "Out of memory\n"); 4908 - } 4909 - /* Push back cpu slabs */ 4910 - flush_all(s); 4911 - 4912 - for_each_kmem_cache_node(s, node, n) { 4913 - unsigned long flags; 4914 - struct page *page; 4915 - 4916 - if (!atomic_long_read(&n->nr_slabs)) 4917 - continue; 4918 - 4919 - spin_lock_irqsave(&n->list_lock, flags); 4920 - list_for_each_entry(page, &n->partial, slab_list) 4921 - process_slab(&t, s, page, alloc); 4922 - list_for_each_entry(page, &n->full, slab_list) 4923 - process_slab(&t, s, page, alloc); 4924 - spin_unlock_irqrestore(&n->list_lock, flags); 4925 - } 4926 - 4927 - for (i = 0; i < t.count; i++) { 4928 - struct location *l = &t.loc[i]; 4929 - 4930 - len += sysfs_emit_at(buf, len, "%7ld ", l->count); 4931 - 4932 - if (l->addr) 4933 - len += sysfs_emit_at(buf, len, "%pS", (void *)l->addr); 4934 - else 4935 - len += sysfs_emit_at(buf, len, "<not-available>"); 4936 - 4937 - if (l->sum_time != l->min_time) 4938 - len += sysfs_emit_at(buf, len, " age=%ld/%ld/%ld", 4939 - l->min_time, 4940 - (long)div_u64(l->sum_time, 4941 - l->count), 4942 - l->max_time); 4943 - else 4944 - len += sysfs_emit_at(buf, len, " age=%ld", l->min_time); 4945 - 4946 - if (l->min_pid != l->max_pid) 4947 - len += sysfs_emit_at(buf, len, " pid=%ld-%ld", 4948 - l->min_pid, l->max_pid); 4949 - else 4950 - len += sysfs_emit_at(buf, len, " pid=%ld", 4951 - l->min_pid); 4952 - 4953 - if (num_online_cpus() > 1 && 4954 - !cpumask_empty(to_cpumask(l->cpus))) 4955 - len += sysfs_emit_at(buf, len, " cpus=%*pbl", 4956 - cpumask_pr_args(to_cpumask(l->cpus))); 4957 - 4958 - if (nr_online_nodes > 1 && !nodes_empty(l->nodes)) 4959 - len += sysfs_emit_at(buf, len, " nodes=%*pbl", 4960 - nodemask_pr_args(&l->nodes)); 4961 - 4962 - len += sysfs_emit_at(buf, len, "\n"); 4963 - } 4964 - 4965 - free_loc_track(&t); 4966 - if (!t.count) 4967 - len += sysfs_emit_at(buf, len, "No data\n"); 4968 - 4969 - return len; 4970 - } 4823 + #endif /* CONFIG_DEBUG_FS */ 4971 4824 #endif /* CONFIG_SLUB_DEBUG */ 4972 - 4973 - #ifdef SLUB_RESILIENCY_TEST 4974 - static void __init resiliency_test(void) 4975 - { 4976 - u8 *p; 4977 - int type = KMALLOC_NORMAL; 4978 - 4979 - BUILD_BUG_ON(KMALLOC_MIN_SIZE > 16 || KMALLOC_SHIFT_HIGH < 10); 4980 - 4981 - pr_err("SLUB resiliency testing\n"); 4982 - pr_err("-----------------------\n"); 4983 - pr_err("A. Corruption after allocation\n"); 4984 - 4985 - p = kzalloc(16, GFP_KERNEL); 4986 - p[16] = 0x12; 4987 - pr_err("\n1. kmalloc-16: Clobber Redzone/next pointer 0x12->0x%p\n\n", 4988 - p + 16); 4989 - 4990 - validate_slab_cache(kmalloc_caches[type][4]); 4991 - 4992 - /* Hmmm... The next two are dangerous */ 4993 - p = kzalloc(32, GFP_KERNEL); 4994 - p[32 + sizeof(void *)] = 0x34; 4995 - pr_err("\n2. kmalloc-32: Clobber next pointer/next slab 0x34 -> -0x%p\n", 4996 - p); 4997 - pr_err("If allocated object is overwritten then not detectable\n\n"); 4998 - 4999 - validate_slab_cache(kmalloc_caches[type][5]); 5000 - p = kzalloc(64, GFP_KERNEL); 5001 - p += 64 + (get_cycles() & 0xff) * sizeof(void *); 5002 - *p = 0x56; 5003 - pr_err("\n3. kmalloc-64: corrupting random byte 0x56->0x%p\n", 5004 - p); 5005 - pr_err("If allocated object is overwritten then not detectable\n\n"); 5006 - validate_slab_cache(kmalloc_caches[type][6]); 5007 - 5008 - pr_err("\nB. Corruption after free\n"); 5009 - p = kzalloc(128, GFP_KERNEL); 5010 - kfree(p); 5011 - *p = 0x78; 5012 - pr_err("1. kmalloc-128: Clobber first word 0x78->0x%p\n\n", p); 5013 - validate_slab_cache(kmalloc_caches[type][7]); 5014 - 5015 - p = kzalloc(256, GFP_KERNEL); 5016 - kfree(p); 5017 - p[50] = 0x9a; 5018 - pr_err("\n2. kmalloc-256: Clobber 50th byte 0x9a->0x%p\n\n", p); 5019 - validate_slab_cache(kmalloc_caches[type][8]); 5020 - 5021 - p = kzalloc(512, GFP_KERNEL); 5022 - kfree(p); 5023 - p[512] = 0xab; 5024 - pr_err("\n3. kmalloc-512: Clobber redzone 0xab->0x%p\n\n", p); 5025 - validate_slab_cache(kmalloc_caches[type][9]); 5026 - } 5027 - #else 5028 - #ifdef CONFIG_SYSFS 5029 - static void resiliency_test(void) {}; 5030 - #endif 5031 - #endif /* SLUB_RESILIENCY_TEST */ 5032 4825 5033 4826 #ifdef CONFIG_SYSFS 5034 4827 enum slab_stat_type { ··· 5282 5345 } 5283 5346 SLAB_ATTR(validate); 5284 5347 5285 - static ssize_t alloc_calls_show(struct kmem_cache *s, char *buf) 5286 - { 5287 - if (!(s->flags & SLAB_STORE_USER)) 5288 - return -ENOSYS; 5289 - return list_locations(s, buf, TRACK_ALLOC); 5290 - } 5291 - SLAB_ATTR_RO(alloc_calls); 5292 - 5293 - static ssize_t free_calls_show(struct kmem_cache *s, char *buf) 5294 - { 5295 - if (!(s->flags & SLAB_STORE_USER)) 5296 - return -ENOSYS; 5297 - return list_locations(s, buf, TRACK_FREE); 5298 - } 5299 - SLAB_ATTR_RO(free_calls); 5300 5348 #endif /* CONFIG_SLUB_DEBUG */ 5301 5349 5302 5350 #ifdef CONFIG_FAILSLAB ··· 5445 5523 &poison_attr.attr, 5446 5524 &store_user_attr.attr, 5447 5525 &validate_attr.attr, 5448 - &alloc_calls_attr.attr, 5449 - &free_calls_attr.attr, 5450 5526 #endif 5451 5527 #ifdef CONFIG_ZONE_DMA 5452 5528 &cache_dma_attr.attr, ··· 5726 5806 } 5727 5807 5728 5808 mutex_unlock(&slab_mutex); 5729 - resiliency_test(); 5730 5809 return 0; 5731 5810 } 5732 5811 5733 5812 __initcall(slab_sysfs_init); 5734 5813 #endif /* CONFIG_SYSFS */ 5735 5814 5815 + #if defined(CONFIG_SLUB_DEBUG) && defined(CONFIG_DEBUG_FS) 5816 + static int slab_debugfs_show(struct seq_file *seq, void *v) 5817 + { 5818 + 5819 + struct location *l; 5820 + unsigned int idx = *(unsigned int *)v; 5821 + struct loc_track *t = seq->private; 5822 + 5823 + if (idx < t->count) { 5824 + l = &t->loc[idx]; 5825 + 5826 + seq_printf(seq, "%7ld ", l->count); 5827 + 5828 + if (l->addr) 5829 + seq_printf(seq, "%pS", (void *)l->addr); 5830 + else 5831 + seq_puts(seq, "<not-available>"); 5832 + 5833 + if (l->sum_time != l->min_time) { 5834 + seq_printf(seq, " age=%ld/%llu/%ld", 5835 + l->min_time, div_u64(l->sum_time, l->count), 5836 + l->max_time); 5837 + } else 5838 + seq_printf(seq, " age=%ld", l->min_time); 5839 + 5840 + if (l->min_pid != l->max_pid) 5841 + seq_printf(seq, " pid=%ld-%ld", l->min_pid, l->max_pid); 5842 + else 5843 + seq_printf(seq, " pid=%ld", 5844 + l->min_pid); 5845 + 5846 + if (num_online_cpus() > 1 && !cpumask_empty(to_cpumask(l->cpus))) 5847 + seq_printf(seq, " cpus=%*pbl", 5848 + cpumask_pr_args(to_cpumask(l->cpus))); 5849 + 5850 + if (nr_online_nodes > 1 && !nodes_empty(l->nodes)) 5851 + seq_printf(seq, " nodes=%*pbl", 5852 + nodemask_pr_args(&l->nodes)); 5853 + 5854 + seq_puts(seq, "\n"); 5855 + } 5856 + 5857 + if (!idx && !t->count) 5858 + seq_puts(seq, "No data\n"); 5859 + 5860 + return 0; 5861 + } 5862 + 5863 + static void slab_debugfs_stop(struct seq_file *seq, void *v) 5864 + { 5865 + } 5866 + 5867 + static void *slab_debugfs_next(struct seq_file *seq, void *v, loff_t *ppos) 5868 + { 5869 + struct loc_track *t = seq->private; 5870 + 5871 + v = ppos; 5872 + ++*ppos; 5873 + if (*ppos <= t->count) 5874 + return v; 5875 + 5876 + return NULL; 5877 + } 5878 + 5879 + static void *slab_debugfs_start(struct seq_file *seq, loff_t *ppos) 5880 + { 5881 + return ppos; 5882 + } 5883 + 5884 + static const struct seq_operations slab_debugfs_sops = { 5885 + .start = slab_debugfs_start, 5886 + .next = slab_debugfs_next, 5887 + .stop = slab_debugfs_stop, 5888 + .show = slab_debugfs_show, 5889 + }; 5890 + 5891 + static int slab_debug_trace_open(struct inode *inode, struct file *filep) 5892 + { 5893 + 5894 + struct kmem_cache_node *n; 5895 + enum track_item alloc; 5896 + int node; 5897 + struct loc_track *t = __seq_open_private(filep, &slab_debugfs_sops, 5898 + sizeof(struct loc_track)); 5899 + struct kmem_cache *s = file_inode(filep)->i_private; 5900 + 5901 + if (strcmp(filep->f_path.dentry->d_name.name, "alloc_traces") == 0) 5902 + alloc = TRACK_ALLOC; 5903 + else 5904 + alloc = TRACK_FREE; 5905 + 5906 + if (!alloc_loc_track(t, PAGE_SIZE / sizeof(struct location), GFP_KERNEL)) 5907 + return -ENOMEM; 5908 + 5909 + /* Push back cpu slabs */ 5910 + flush_all(s); 5911 + 5912 + for_each_kmem_cache_node(s, node, n) { 5913 + unsigned long flags; 5914 + struct page *page; 5915 + 5916 + if (!atomic_long_read(&n->nr_slabs)) 5917 + continue; 5918 + 5919 + spin_lock_irqsave(&n->list_lock, flags); 5920 + list_for_each_entry(page, &n->partial, slab_list) 5921 + process_slab(t, s, page, alloc); 5922 + list_for_each_entry(page, &n->full, slab_list) 5923 + process_slab(t, s, page, alloc); 5924 + spin_unlock_irqrestore(&n->list_lock, flags); 5925 + } 5926 + 5927 + return 0; 5928 + } 5929 + 5930 + static int slab_debug_trace_release(struct inode *inode, struct file *file) 5931 + { 5932 + struct seq_file *seq = file->private_data; 5933 + struct loc_track *t = seq->private; 5934 + 5935 + free_loc_track(t); 5936 + return seq_release_private(inode, file); 5937 + } 5938 + 5939 + static const struct file_operations slab_debugfs_fops = { 5940 + .open = slab_debug_trace_open, 5941 + .read = seq_read, 5942 + .llseek = seq_lseek, 5943 + .release = slab_debug_trace_release, 5944 + }; 5945 + 5946 + static void debugfs_slab_add(struct kmem_cache *s) 5947 + { 5948 + struct dentry *slab_cache_dir; 5949 + 5950 + if (unlikely(!slab_debugfs_root)) 5951 + return; 5952 + 5953 + slab_cache_dir = debugfs_create_dir(s->name, slab_debugfs_root); 5954 + 5955 + debugfs_create_file("alloc_traces", 0400, 5956 + slab_cache_dir, s, &slab_debugfs_fops); 5957 + 5958 + debugfs_create_file("free_traces", 0400, 5959 + slab_cache_dir, s, &slab_debugfs_fops); 5960 + } 5961 + 5962 + void debugfs_slab_release(struct kmem_cache *s) 5963 + { 5964 + debugfs_remove_recursive(debugfs_lookup(s->name, slab_debugfs_root)); 5965 + } 5966 + 5967 + static int __init slab_debugfs_init(void) 5968 + { 5969 + struct kmem_cache *s; 5970 + 5971 + slab_debugfs_root = debugfs_create_dir("slab", NULL); 5972 + 5973 + list_for_each_entry(s, &slab_caches, list) 5974 + if (s->flags & SLAB_STORE_USER) 5975 + debugfs_slab_add(s); 5976 + 5977 + return 0; 5978 + 5979 + } 5980 + __initcall(slab_debugfs_init); 5981 + #endif 5736 5982 /* 5737 5983 * The /proc/slabinfo ABI 5738 5984 */

+1 -1

mm/sparse.c

··· 346 346 347 347 static inline phys_addr_t pgdat_to_phys(struct pglist_data *pgdat) 348 348 { 349 - #ifndef CONFIG_NEED_MULTIPLE_NODES 349 + #ifndef CONFIG_NUMA 350 350 return __pa_symbol(pgdat); 351 351 #else 352 352 return __pa(pgdat);

+2 -2

mm/swap.c

··· 95 95 { 96 96 __page_cache_release(page); 97 97 mem_cgroup_uncharge(page); 98 - free_unref_page(page); 98 + free_unref_page(page, 0); 99 99 } 100 100 101 101 static void __put_compound_page(struct page *page) ··· 313 313 314 314 void lru_note_cost_page(struct page *page) 315 315 { 316 - lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)), 316 + lru_note_cost(mem_cgroup_page_lruvec(page), 317 317 page_is_file_lru(page), thp_nr_pages(page)); 318 318 } 319 319

-2

mm/swap_slots.c

··· 43 43 static DEFINE_MUTEX(swap_slots_cache_enable_mutex); 44 44 45 45 static void __drain_swap_slots_cache(unsigned int type); 46 - static void deactivate_swap_slots_cache(void); 47 - static void reactivate_swap_slots_cache(void); 48 46 49 47 #define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_enabled) 50 48 #define SLOTS_CACHE 0x1

+7 -13

mm/swap_state.c

··· 114 114 SetPageSwapCache(page); 115 115 116 116 do { 117 - unsigned long nr_shadows = 0; 118 - 119 117 xas_lock_irq(&xas); 120 118 xas_create_range(&xas); 121 119 if (xas_error(&xas)) ··· 122 124 VM_BUG_ON_PAGE(xas.xa_index != idx + i, page); 123 125 old = xas_load(&xas); 124 126 if (xa_is_value(old)) { 125 - nr_shadows++; 126 127 if (shadowp) 127 128 *shadowp = old; 128 129 } ··· 257 260 void *old; 258 261 259 262 for (;;) { 260 - unsigned long nr_shadows = 0; 261 263 swp_entry_t entry = swp_entry(type, curr); 262 264 struct address_space *address_space = swap_address_space(entry); 263 265 XA_STATE(xas, &address_space->i_pages, curr); ··· 266 270 if (!xa_is_value(old)) 267 271 continue; 268 272 xas_store(&xas, NULL); 269 - nr_shadows++; 270 273 } 271 274 xa_unlock_irq(&address_space->i_pages); 272 275 ··· 286 291 * try_to_free_swap() _with_ the lock. 287 292 * - Marcelo 288 293 */ 289 - static inline void free_swap_cache(struct page *page) 294 + void free_swap_cache(struct page *page) 290 295 { 291 296 if (PageSwapCache(page) && !page_mapped(page) && trylock_page(page)) { 292 297 try_to_free_swap(page); ··· 693 698 694 699 void exit_swap_address_space(unsigned int type) 695 700 { 696 - kvfree(swapper_spaces[type]); 701 + int i; 702 + struct address_space *spaces = swapper_spaces[type]; 703 + 704 + for (i = 0; i < nr_swapper_spaces[type]; i++) 705 + VM_WARN_ON_ONCE(!mapping_empty(&spaces[i])); 706 + kvfree(spaces); 697 707 nr_swapper_spaces[type] = 0; 698 708 swapper_spaces[type] = NULL; 699 709 } ··· 721 721 { 722 722 struct vm_area_struct *vma = vmf->vma; 723 723 unsigned long ra_val; 724 - swp_entry_t entry; 725 724 unsigned long faddr, pfn, fpfn; 726 725 unsigned long start, end; 727 726 pte_t *pte, *orig_pte; ··· 738 739 739 740 faddr = vmf->address; 740 741 orig_pte = pte = pte_offset_map(vmf->pmd, faddr); 741 - entry = pte_to_swp_entry(*pte); 742 - if ((unlikely(non_swap_entry(entry)))) { 743 - pte_unmap(orig_pte); 744 - return; 745 - } 746 742 747 743 fpfn = PFN_DOWN(faddr); 748 744 ra_val = GET_SWAP_RA_VAL(vma);

+86 -91

mm/swapfile.c

··· 39 39 #include <linux/export.h> 40 40 #include <linux/swap_slots.h> 41 41 #include <linux/sort.h> 42 + #include <linux/completion.h> 42 43 43 44 #include <asm/tlbflush.h> 44 45 #include <linux/swapops.h> ··· 100 99 101 100 static struct swap_info_struct *swap_type_to_swap_info(int type) 102 101 { 103 - if (type >= READ_ONCE(nr_swapfiles)) 102 + if (type >= MAX_SWAPFILES) 104 103 return NULL; 105 104 106 - smp_rmb(); /* Pairs with smp_wmb in alloc_swap_info. */ 107 - return READ_ONCE(swap_info[type]); 105 + return READ_ONCE(swap_info[type]); /* rcu_dereference() */ 108 106 } 109 107 110 108 static inline unsigned char swap_count(unsigned char ent) ··· 452 452 unsigned int idx) 453 453 { 454 454 /* 455 - * If scan_swap_map() can't find a free cluster, it will check 455 + * If scan_swap_map_slots() can't find a free cluster, it will check 456 456 * si->swap_map directly. To make sure the discarding cluster isn't 457 - * taken by scan_swap_map(), mark the swap entries bad (occupied). It 458 - * will be cleared after discard 457 + * taken by scan_swap_map_slots(), mark the swap entries bad (occupied). 458 + * It will be cleared after discard 459 459 */ 460 460 memset(si->swap_map + idx * SWAPFILE_CLUSTER, 461 461 SWAP_MAP_BAD, SWAPFILE_CLUSTER); ··· 509 509 spin_lock(&si->lock); 510 510 swap_do_scheduled_discard(si); 511 511 spin_unlock(&si->lock); 512 + } 513 + 514 + static void swap_users_ref_free(struct percpu_ref *ref) 515 + { 516 + struct swap_info_struct *si; 517 + 518 + si = container_of(ref, struct swap_info_struct, users); 519 + complete(&si->comp); 512 520 } 513 521 514 522 static void alloc_cluster(struct swap_info_struct *si, unsigned long idx) ··· 588 580 } 589 581 590 582 /* 591 - * It's possible scan_swap_map() uses a free cluster in the middle of free 583 + * It's possible scan_swap_map_slots() uses a free cluster in the middle of free 592 584 * cluster list. Avoiding such abuse to avoid list corruption. 593 585 */ 594 586 static bool ··· 1036 1028 swap_range_free(si, offset, SWAPFILE_CLUSTER); 1037 1029 } 1038 1030 1039 - static unsigned long scan_swap_map(struct swap_info_struct *si, 1040 - unsigned char usage) 1041 - { 1042 - swp_entry_t entry; 1043 - int n_ret; 1044 - 1045 - n_ret = scan_swap_map_slots(si, usage, 1, &entry); 1046 - 1047 - if (n_ret) 1048 - return swp_offset(entry); 1049 - else 1050 - return 0; 1051 - 1052 - } 1053 - 1054 1031 int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) 1055 1032 { 1056 1033 unsigned long size = swap_entry_size(entry_size); ··· 1098 1105 nextsi: 1099 1106 /* 1100 1107 * if we got here, it's likely that si was almost full before, 1101 - * and since scan_swap_map() can drop the si->lock, multiple 1102 - * callers probably all tried to get a page from the same si 1103 - * and it filled up before we could get one; or, the si filled 1104 - * up between us dropping swap_avail_lock and taking si->lock. 1105 - * Since we dropped the swap_avail_lock, the swap_avail_head 1106 - * list may have been modified; so if next is still in the 1107 - * swap_avail_head list then try it, otherwise start over 1108 - * if we have not gotten any slots. 1108 + * and since scan_swap_map_slots() can drop the si->lock, 1109 + * multiple callers probably all tried to get a page from the 1110 + * same si and it filled up before we could get one; or, the si 1111 + * filled up between us dropping swap_avail_lock and taking 1112 + * si->lock. Since we dropped the swap_avail_lock, the 1113 + * swap_avail_head list may have been modified; so if next is 1114 + * still in the swap_avail_head list then try it, otherwise 1115 + * start over if we have not gotten any slots. 1109 1116 */ 1110 1117 if (plist_node_empty(&next->avail_lists[node])) 1111 1118 goto start_over; ··· 1119 1126 &nr_swap_pages); 1120 1127 noswap: 1121 1128 return n_ret; 1122 - } 1123 - 1124 - /* The only caller of this function is now suspend routine */ 1125 - swp_entry_t get_swap_page_of_type(int type) 1126 - { 1127 - struct swap_info_struct *si = swap_type_to_swap_info(type); 1128 - pgoff_t offset; 1129 - 1130 - if (!si) 1131 - goto fail; 1132 - 1133 - spin_lock(&si->lock); 1134 - if (si->flags & SWP_WRITEOK) { 1135 - /* This is called for allocating swap entry, not cache */ 1136 - offset = scan_swap_map(si, 1); 1137 - if (offset) { 1138 - atomic_long_dec(&nr_swap_pages); 1139 - spin_unlock(&si->lock); 1140 - return swp_entry(type, offset); 1141 - } 1142 - } 1143 - spin_unlock(&si->lock); 1144 - fail: 1145 - return (swp_entry_t) {0}; 1146 1129 } 1147 1130 1148 1131 static struct swap_info_struct *__swap_info_get(swp_entry_t entry) ··· 1239 1270 * via preventing the swap device from being swapoff, until 1240 1271 * put_swap_device() is called. Otherwise return NULL. 1241 1272 * 1242 - * The entirety of the RCU read critical section must come before the 1243 - * return from or after the call to synchronize_rcu() in 1244 - * enable_swap_info() or swapoff(). So if "si->flags & SWP_VALID" is 1245 - * true, the si->map, si->cluster_info, etc. must be valid in the 1246 - * critical section. 1247 - * 1248 1273 * Notice that swapoff or swapoff+swapon can still happen before the 1249 - * rcu_read_lock() in get_swap_device() or after the rcu_read_unlock() 1250 - * in put_swap_device() if there isn't any other way to prevent 1251 - * swapoff, such as page lock, page table lock, etc. The caller must 1252 - * be prepared for that. For example, the following situation is 1253 - * possible. 1274 + * percpu_ref_tryget_live() in get_swap_device() or after the 1275 + * percpu_ref_put() in put_swap_device() if there isn't any other way 1276 + * to prevent swapoff, such as page lock, page table lock, etc. The 1277 + * caller must be prepared for that. For example, the following 1278 + * situation is possible. 1254 1279 * 1255 1280 * CPU1 CPU2 1256 1281 * do_swap_page() ··· 1272 1309 si = swp_swap_info(entry); 1273 1310 if (!si) 1274 1311 goto bad_nofile; 1275 - 1276 - rcu_read_lock(); 1277 - if (data_race(!(si->flags & SWP_VALID))) 1278 - goto unlock_out; 1312 + if (!percpu_ref_tryget_live(&si->users)) 1313 + goto out; 1314 + /* 1315 + * Guarantee the si->users are checked before accessing other 1316 + * fields of swap_info_struct. 1317 + * 1318 + * Paired with the spin_unlock() after setup_swap_info() in 1319 + * enable_swap_info(). 1320 + */ 1321 + smp_rmb(); 1279 1322 offset = swp_offset(entry); 1280 1323 if (offset >= si->max) 1281 - goto unlock_out; 1324 + goto put_out; 1282 1325 1283 1326 return si; 1284 1327 bad_nofile: 1285 1328 pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val); 1286 1329 out: 1287 1330 return NULL; 1288 - unlock_out: 1289 - rcu_read_unlock(); 1331 + put_out: 1332 + percpu_ref_put(&si->users); 1290 1333 return NULL; 1291 1334 } 1292 1335 ··· 1772 1803 } 1773 1804 1774 1805 #ifdef CONFIG_HIBERNATION 1806 + 1807 + swp_entry_t get_swap_page_of_type(int type) 1808 + { 1809 + struct swap_info_struct *si = swap_type_to_swap_info(type); 1810 + swp_entry_t entry = {0}; 1811 + 1812 + if (!si) 1813 + goto fail; 1814 + 1815 + /* This is called for allocating swap entry, not cache */ 1816 + spin_lock(&si->lock); 1817 + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry)) 1818 + atomic_long_dec(&nr_swap_pages); 1819 + spin_unlock(&si->lock); 1820 + fail: 1821 + return entry; 1822 + } 1823 + 1775 1824 /* 1776 1825 * Find the swap type that corresponds to given device (if any). 1777 1826 * ··· 2453 2466 2454 2467 static void _enable_swap_info(struct swap_info_struct *p) 2455 2468 { 2456 - p->flags |= SWP_WRITEOK | SWP_VALID; 2469 + p->flags |= SWP_WRITEOK; 2457 2470 atomic_long_add(p->pages, &nr_swap_pages); 2458 2471 total_swap_pages += p->pages; 2459 2472 ··· 2484 2497 spin_unlock(&p->lock); 2485 2498 spin_unlock(&swap_lock); 2486 2499 /* 2487 - * Guarantee swap_map, cluster_info, etc. fields are valid 2488 - * between get/put_swap_device() if SWP_VALID bit is set 2500 + * Finished initializing swap device, now it's safe to reference it. 2489 2501 */ 2490 - synchronize_rcu(); 2502 + percpu_ref_resurrect(&p->users); 2491 2503 spin_lock(&swap_lock); 2492 2504 spin_lock(&p->lock); 2493 2505 _enable_swap_info(p); ··· 2602 2616 2603 2617 reenable_swap_slots_cache_unlock(); 2604 2618 2605 - spin_lock(&swap_lock); 2606 - spin_lock(&p->lock); 2607 - p->flags &= ~SWP_VALID; /* mark swap device as invalid */ 2608 - spin_unlock(&p->lock); 2609 - spin_unlock(&swap_lock); 2610 2619 /* 2611 - * wait for swap operations protected by get/put_swap_device() 2612 - * to complete 2620 + * Wait for swap operations protected by get/put_swap_device() 2621 + * to complete. 2622 + * 2623 + * We need synchronize_rcu() here to protect the accessing to 2624 + * the swap cache data structure. 2613 2625 */ 2626 + percpu_ref_kill(&p->users); 2614 2627 synchronize_rcu(); 2628 + wait_for_completion(&p->comp); 2615 2629 2616 2630 flush_work(&p->discard_work); 2617 2631 ··· 2627 2641 spin_lock(&p->lock); 2628 2642 drain_mmlist(); 2629 2643 2630 - /* wait for anyone still in scan_swap_map */ 2644 + /* wait for anyone still in scan_swap_map_slots */ 2631 2645 p->highest_bit = 0; /* cuts scans short */ 2632 2646 while (p->flags >= SWP_SCANNING) { 2633 2647 spin_unlock(&p->lock); ··· 2843 2857 if (!p) 2844 2858 return ERR_PTR(-ENOMEM); 2845 2859 2860 + if (percpu_ref_init(&p->users, swap_users_ref_free, 2861 + PERCPU_REF_INIT_DEAD, GFP_KERNEL)) { 2862 + kvfree(p); 2863 + return ERR_PTR(-ENOMEM); 2864 + } 2865 + 2846 2866 spin_lock(&swap_lock); 2847 2867 for (type = 0; type < nr_swapfiles; type++) { 2848 2868 if (!(swap_info[type]->flags & SWP_USED)) ··· 2856 2864 } 2857 2865 if (type >= MAX_SWAPFILES) { 2858 2866 spin_unlock(&swap_lock); 2867 + percpu_ref_exit(&p->users); 2859 2868 kvfree(p); 2860 2869 return ERR_PTR(-EPERM); 2861 2870 } 2862 2871 if (type >= nr_swapfiles) { 2863 2872 p->type = type; 2864 - WRITE_ONCE(swap_info[type], p); 2865 2873 /* 2866 - * Write swap_info[type] before nr_swapfiles, in case a 2867 - * racing procfs swap_start() or swap_next() is reading them. 2868 - * (We never shrink nr_swapfiles, we never free this entry.) 2874 + * Publish the swap_info_struct after initializing it. 2875 + * Note that kvzalloc() above zeroes all its fields. 2869 2876 */ 2870 - smp_wmb(); 2871 - WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1); 2877 + smp_store_release(&swap_info[type], p); /* rcu_assign_pointer() */ 2878 + nr_swapfiles++; 2872 2879 } else { 2873 2880 defer = p; 2874 2881 p = swap_info[type]; ··· 2882 2891 plist_node_init(&p->avail_lists[i], 0); 2883 2892 p->flags = SWP_USED; 2884 2893 spin_unlock(&swap_lock); 2885 - kvfree(defer); 2894 + if (defer) { 2895 + percpu_ref_exit(&defer->users); 2896 + kvfree(defer); 2897 + } 2886 2898 spin_lock_init(&p->lock); 2887 2899 spin_lock_init(&p->cont_lock); 2900 + init_completion(&p->comp); 2888 2901 2889 2902 return p; 2890 2903 }

+76 -47

mm/vmalloc.c

··· 2567 2567 2568 2568 BUG_ON(!page); 2569 2569 __free_pages(page, page_order); 2570 + cond_resched(); 2570 2571 } 2571 2572 atomic_long_sub(area->nr_pages, &nr_vmalloc_pages); 2572 2573 ··· 2759 2758 EXPORT_SYMBOL_GPL(vmap_pfn); 2760 2759 #endif /* CONFIG_VMAP_PFN */ 2761 2760 2761 + static inline unsigned int 2762 + vm_area_alloc_pages(gfp_t gfp, int nid, 2763 + unsigned int order, unsigned long nr_pages, struct page **pages) 2764 + { 2765 + unsigned int nr_allocated = 0; 2766 + 2767 + /* 2768 + * For order-0 pages we make use of bulk allocator, if 2769 + * the page array is partly or not at all populated due 2770 + * to fails, fallback to a single page allocator that is 2771 + * more permissive. 2772 + */ 2773 + if (!order) 2774 + nr_allocated = alloc_pages_bulk_array_node( 2775 + gfp, nid, nr_pages, pages); 2776 + else 2777 + /* 2778 + * Compound pages required for remap_vmalloc_page if 2779 + * high-order pages. 2780 + */ 2781 + gfp |= __GFP_COMP; 2782 + 2783 + /* High-order pages or fallback path if "bulk" fails. */ 2784 + while (nr_allocated < nr_pages) { 2785 + struct page *page; 2786 + int i; 2787 + 2788 + page = alloc_pages_node(nid, gfp, order); 2789 + if (unlikely(!page)) 2790 + break; 2791 + 2792 + /* 2793 + * Careful, we allocate and map page-order pages, but 2794 + * tracking is done per PAGE_SIZE page so as to keep the 2795 + * vm_struct APIs independent of the physical/mapped size. 2796 + */ 2797 + for (i = 0; i < (1U << order); i++) 2798 + pages[nr_allocated + i] = page + i; 2799 + 2800 + if (gfpflags_allow_blocking(gfp)) 2801 + cond_resched(); 2802 + 2803 + nr_allocated += 1U << order; 2804 + } 2805 + 2806 + return nr_allocated; 2807 + } 2808 + 2762 2809 static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, 2763 2810 pgprot_t prot, unsigned int page_shift, 2764 2811 int node) ··· 2817 2768 unsigned long array_size; 2818 2769 unsigned int nr_small_pages = size >> PAGE_SHIFT; 2819 2770 unsigned int page_order; 2820 - struct page **pages; 2821 - unsigned int i; 2822 2771 2823 2772 array_size = (unsigned long)nr_small_pages * sizeof(struct page *); 2824 2773 gfp_mask |= __GFP_NOWARN; ··· 2825 2778 2826 2779 /* Please note that the recursion is strictly bounded. */ 2827 2780 if (array_size > PAGE_SIZE) { 2828 - pages = __vmalloc_node(array_size, 1, nested_gfp, node, 2781 + area->pages = __vmalloc_node(array_size, 1, nested_gfp, node, 2829 2782 area->caller); 2830 2783 } else { 2831 - pages = kmalloc_node(array_size, nested_gfp, node); 2784 + area->pages = kmalloc_node(array_size, nested_gfp, node); 2832 2785 } 2833 2786 2834 - if (!pages) { 2835 - free_vm_area(area); 2787 + if (!area->pages) { 2836 2788 warn_alloc(gfp_mask, NULL, 2837 - "vmalloc size %lu allocation failure: " 2838 - "page array size %lu allocation failed", 2839 - nr_small_pages * PAGE_SIZE, array_size); 2789 + "vmalloc error: size %lu, failed to allocated page array size %lu", 2790 + nr_small_pages * PAGE_SIZE, array_size); 2791 + free_vm_area(area); 2840 2792 return NULL; 2841 2793 } 2842 2794 2843 - area->pages = pages; 2844 - area->nr_pages = nr_small_pages; 2845 2795 set_vm_area_page_order(area, page_shift - PAGE_SHIFT); 2846 - 2847 2796 page_order = vm_area_page_order(area); 2848 2797 2849 - /* 2850 - * Careful, we allocate and map page_order pages, but tracking is done 2851 - * per PAGE_SIZE page so as to keep the vm_struct APIs independent of 2852 - * the physical/mapped size. 2853 - */ 2854 - for (i = 0; i < area->nr_pages; i += 1U << page_order) { 2855 - struct page *page; 2856 - int p; 2798 + area->nr_pages = vm_area_alloc_pages(gfp_mask, node, 2799 + page_order, nr_small_pages, area->pages); 2857 2800 2858 - /* Compound pages required for remap_vmalloc_page */ 2859 - page = alloc_pages_node(node, gfp_mask | __GFP_COMP, page_order); 2860 - if (unlikely(!page)) { 2861 - /* Successfully allocated i pages, free them in __vfree() */ 2862 - area->nr_pages = i; 2863 - atomic_long_add(area->nr_pages, &nr_vmalloc_pages); 2864 - warn_alloc(gfp_mask, NULL, 2865 - "vmalloc size %lu allocation failure: " 2866 - "page order %u allocation failed", 2867 - area->nr_pages * PAGE_SIZE, page_order); 2868 - goto fail; 2869 - } 2870 - 2871 - for (p = 0; p < (1U << page_order); p++) 2872 - area->pages[i + p] = page + p; 2873 - 2874 - if (gfpflags_allow_blocking(gfp_mask)) 2875 - cond_resched(); 2876 - } 2877 2801 atomic_long_add(area->nr_pages, &nr_vmalloc_pages); 2878 2802 2879 - if (vmap_pages_range(addr, addr + size, prot, pages, page_shift) < 0) { 2803 + /* 2804 + * If not enough pages were obtained to accomplish an 2805 + * allocation request, free them via __vfree() if any. 2806 + */ 2807 + if (area->nr_pages != nr_small_pages) { 2880 2808 warn_alloc(gfp_mask, NULL, 2881 - "vmalloc size %lu allocation failure: " 2882 - "failed to map pages", 2883 - area->nr_pages * PAGE_SIZE); 2809 + "vmalloc error: size %lu, page order %u, failed to allocate pages", 2810 + area->nr_pages * PAGE_SIZE, page_order); 2811 + goto fail; 2812 + } 2813 + 2814 + if (vmap_pages_range(addr, addr + size, prot, area->pages, 2815 + page_shift) < 0) { 2816 + warn_alloc(gfp_mask, NULL, 2817 + "vmalloc error: size %lu, failed to map pages", 2818 + area->nr_pages * PAGE_SIZE); 2884 2819 goto fail; 2885 2820 } 2886 2821 ··· 2907 2878 2908 2879 if ((size >> PAGE_SHIFT) > totalram_pages()) { 2909 2880 warn_alloc(gfp_mask, NULL, 2910 - "vmalloc size %lu allocation failure: " 2911 - "exceeds total pages", real_size); 2881 + "vmalloc error: size %lu, exceeds total pages", 2882 + real_size); 2912 2883 return NULL; 2913 2884 } 2914 2885 ··· 2939 2910 gfp_mask, caller); 2940 2911 if (!area) { 2941 2912 warn_alloc(gfp_mask, NULL, 2942 - "vmalloc size %lu allocation failure: " 2943 - "vm_struct allocation failed", real_size); 2913 + "vmalloc error: size %lu, vm_struct allocation failed", 2914 + real_size); 2944 2915 goto fail; 2945 2916 } 2946 2917

+39 -4

mm/vmscan.c

··· 2015 2015 * 2016 2016 * Returns the number of pages moved to the given lruvec. 2017 2017 */ 2018 - static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, 2019 - struct list_head *list) 2018 + static unsigned int move_pages_to_lru(struct lruvec *lruvec, 2019 + struct list_head *list) 2020 2020 { 2021 2021 int nr_pages, nr_moved = 0; 2022 2022 LIST_HEAD(pages_to_free); ··· 2063 2063 * All pages were isolated from the same lruvec (and isolation 2064 2064 * inhibits memcg migration). 2065 2065 */ 2066 - VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec), page); 2066 + VM_BUG_ON_PAGE(!page_matches_lruvec(page, lruvec), page); 2067 2067 add_page_to_lru_list(page, lruvec); 2068 2068 nr_pages = thp_nr_pages(page); 2069 2069 nr_moved += nr_pages; ··· 2096 2096 * shrink_inactive_list() is a helper for shrink_node(). It returns the number 2097 2097 * of reclaimed pages 2098 2098 */ 2099 - static noinline_for_stack unsigned long 2099 + static unsigned long 2100 2100 shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, 2101 2101 struct scan_control *sc, enum lru_list lru) 2102 2102 { ··· 3722 3722 return sc->nr_scanned >= sc->nr_to_reclaim; 3723 3723 } 3724 3724 3725 + /* Page allocator PCP high watermark is lowered if reclaim is active. */ 3726 + static inline void 3727 + update_reclaim_active(pg_data_t *pgdat, int highest_zoneidx, bool active) 3728 + { 3729 + int i; 3730 + struct zone *zone; 3731 + 3732 + for (i = 0; i <= highest_zoneidx; i++) { 3733 + zone = pgdat->node_zones + i; 3734 + 3735 + if (!managed_zone(zone)) 3736 + continue; 3737 + 3738 + if (active) 3739 + set_bit(ZONE_RECLAIM_ACTIVE, &zone->flags); 3740 + else 3741 + clear_bit(ZONE_RECLAIM_ACTIVE, &zone->flags); 3742 + } 3743 + } 3744 + 3745 + static inline void 3746 + set_reclaim_active(pg_data_t *pgdat, int highest_zoneidx) 3747 + { 3748 + update_reclaim_active(pgdat, highest_zoneidx, true); 3749 + } 3750 + 3751 + static inline void 3752 + clear_reclaim_active(pg_data_t *pgdat, int highest_zoneidx) 3753 + { 3754 + update_reclaim_active(pgdat, highest_zoneidx, false); 3755 + } 3756 + 3725 3757 /* 3726 3758 * For kswapd, balance_pgdat() will reclaim pages across a node from zones 3727 3759 * that are eligible for use by the caller until at least one zone is ··· 3806 3774 boosted = nr_boost_reclaim; 3807 3775 3808 3776 restart: 3777 + set_reclaim_active(pgdat, highest_zoneidx); 3809 3778 sc.priority = DEF_PRIORITY; 3810 3779 do { 3811 3780 unsigned long nr_reclaimed = sc.nr_reclaimed; ··· 3940 3907 pgdat->kswapd_failures++; 3941 3908 3942 3909 out: 3910 + clear_reclaim_active(pgdat, highest_zoneidx); 3911 + 3943 3912 /* If reclaim was boosted, account for the reclaim done in this pass */ 3944 3913 if (boosted) { 3945 3914 unsigned long flags;

+112 -154

mm/vmstat.c

··· 31 31 32 32 #include "internal.h" 33 33 34 - #define NUMA_STATS_THRESHOLD (U16_MAX - 2) 35 - 36 34 #ifdef CONFIG_NUMA 37 35 int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; 38 36 ··· 39 41 { 40 42 int item, cpu; 41 43 42 - for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) { 43 - atomic_long_set(&zone->vm_numa_stat[item], 0); 44 - for_each_online_cpu(cpu) 45 - per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item] 44 + for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++) { 45 + atomic_long_set(&zone->vm_numa_event[item], 0); 46 + for_each_online_cpu(cpu) { 47 + per_cpu_ptr(zone->per_cpu_zonestats, cpu)->vm_numa_event[item] 46 48 = 0; 49 + } 47 50 } 48 51 } 49 52 ··· 62 63 { 63 64 int item; 64 65 65 - for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) 66 - atomic_long_set(&vm_numa_stat[item], 0); 66 + for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++) 67 + atomic_long_set(&vm_numa_event[item], 0); 67 68 } 68 69 69 70 static void invalid_numa_statistics(void) ··· 160 161 * vm_stat contains the global counters 161 162 */ 162 163 atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS] __cacheline_aligned_in_smp; 163 - atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS] __cacheline_aligned_in_smp; 164 164 atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS] __cacheline_aligned_in_smp; 165 + atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS] __cacheline_aligned_in_smp; 165 166 EXPORT_SYMBOL(vm_zone_stat); 166 - EXPORT_SYMBOL(vm_numa_stat); 167 167 EXPORT_SYMBOL(vm_node_stat); 168 168 169 169 #ifdef CONFIG_SMP ··· 264 266 for_each_online_cpu(cpu) { 265 267 int pgdat_threshold; 266 268 267 - per_cpu_ptr(zone->pageset, cpu)->stat_threshold 269 + per_cpu_ptr(zone->per_cpu_zonestats, cpu)->stat_threshold 268 270 = threshold; 269 271 270 272 /* Base nodestat threshold on the largest populated zone. */ ··· 301 303 302 304 threshold = (*calculate_pressure)(zone); 303 305 for_each_online_cpu(cpu) 304 - per_cpu_ptr(zone->pageset, cpu)->stat_threshold 306 + per_cpu_ptr(zone->per_cpu_zonestats, cpu)->stat_threshold 305 307 = threshold; 306 308 } 307 309 } ··· 314 316 void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item, 315 317 long delta) 316 318 { 317 - struct per_cpu_pageset __percpu *pcp = zone->pageset; 319 + struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats; 318 320 s8 __percpu *p = pcp->vm_stat_diff + item; 319 321 long x; 320 322 long t; ··· 387 389 */ 388 390 void __inc_zone_state(struct zone *zone, enum zone_stat_item item) 389 391 { 390 - struct per_cpu_pageset __percpu *pcp = zone->pageset; 392 + struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats; 391 393 s8 __percpu *p = pcp->vm_stat_diff + item; 392 394 s8 v, t; 393 395 ··· 433 435 434 436 void __dec_zone_state(struct zone *zone, enum zone_stat_item item) 435 437 { 436 - struct per_cpu_pageset __percpu *pcp = zone->pageset; 438 + struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats; 437 439 s8 __percpu *p = pcp->vm_stat_diff + item; 438 440 s8 v, t; 439 441 ··· 493 495 static inline void mod_zone_state(struct zone *zone, 494 496 enum zone_stat_item item, long delta, int overstep_mode) 495 497 { 496 - struct per_cpu_pageset __percpu *pcp = zone->pageset; 498 + struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats; 497 499 s8 __percpu *p = pcp->vm_stat_diff + item; 498 500 long o, n, t, z; 499 501 ··· 704 706 * Fold a differential into the global counters. 705 707 * Returns the number of counters updated. 706 708 */ 707 - #ifdef CONFIG_NUMA 708 - static int fold_diff(int *zone_diff, int *numa_diff, int *node_diff) 709 - { 710 - int i; 711 - int changes = 0; 712 - 713 - for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) 714 - if (zone_diff[i]) { 715 - atomic_long_add(zone_diff[i], &vm_zone_stat[i]); 716 - changes++; 717 - } 718 - 719 - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) 720 - if (numa_diff[i]) { 721 - atomic_long_add(numa_diff[i], &vm_numa_stat[i]); 722 - changes++; 723 - } 724 - 725 - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) 726 - if (node_diff[i]) { 727 - atomic_long_add(node_diff[i], &vm_node_stat[i]); 728 - changes++; 729 - } 730 - return changes; 731 - } 732 - #else 733 709 static int fold_diff(int *zone_diff, int *node_diff) 734 710 { 735 711 int i; ··· 722 750 } 723 751 return changes; 724 752 } 725 - #endif /* CONFIG_NUMA */ 753 + 754 + #ifdef CONFIG_NUMA 755 + static void fold_vm_zone_numa_events(struct zone *zone) 756 + { 757 + unsigned long zone_numa_events[NR_VM_NUMA_EVENT_ITEMS] = { 0, }; 758 + int cpu; 759 + enum numa_stat_item item; 760 + 761 + for_each_online_cpu(cpu) { 762 + struct per_cpu_zonestat *pzstats; 763 + 764 + pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); 765 + for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++) 766 + zone_numa_events[item] += xchg(&pzstats->vm_numa_event[item], 0); 767 + } 768 + 769 + for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++) 770 + zone_numa_event_add(zone_numa_events[item], zone, item); 771 + } 772 + 773 + void fold_vm_numa_events(void) 774 + { 775 + struct zone *zone; 776 + 777 + for_each_populated_zone(zone) 778 + fold_vm_zone_numa_events(zone); 779 + } 780 + #endif 726 781 727 782 /* 728 783 * Update the zone counters for the current cpu. ··· 773 774 struct zone *zone; 774 775 int i; 775 776 int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; 776 - #ifdef CONFIG_NUMA 777 - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; 778 - #endif 779 777 int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; 780 778 int changes = 0; 781 779 782 780 for_each_populated_zone(zone) { 783 - struct per_cpu_pageset __percpu *p = zone->pageset; 781 + struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats; 782 + #ifdef CONFIG_NUMA 783 + struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset; 784 + #endif 784 785 785 786 for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) { 786 787 int v; 787 788 788 - v = this_cpu_xchg(p->vm_stat_diff[i], 0); 789 + v = this_cpu_xchg(pzstats->vm_stat_diff[i], 0); 789 790 if (v) { 790 791 791 792 atomic_long_add(v, &zone->vm_stat[i]); 792 793 global_zone_diff[i] += v; 793 794 #ifdef CONFIG_NUMA 794 795 /* 3 seconds idle till flush */ 795 - __this_cpu_write(p->expire, 3); 796 + __this_cpu_write(pcp->expire, 3); 796 797 #endif 797 798 } 798 799 } 799 800 #ifdef CONFIG_NUMA 800 - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) { 801 - int v; 802 - 803 - v = this_cpu_xchg(p->vm_numa_stat_diff[i], 0); 804 - if (v) { 805 - 806 - atomic_long_add(v, &zone->vm_numa_stat[i]); 807 - global_numa_diff[i] += v; 808 - __this_cpu_write(p->expire, 3); 809 - } 810 - } 811 801 812 802 if (do_pagesets) { 813 803 cond_resched(); ··· 807 819 * Check if there are pages remaining in this pageset 808 820 * if not then there is nothing to expire. 809 821 */ 810 - if (!__this_cpu_read(p->expire) || 811 - !__this_cpu_read(p->pcp.count)) 822 + if (!__this_cpu_read(pcp->expire) || 823 + !__this_cpu_read(pcp->count)) 812 824 continue; 813 825 814 826 /* 815 827 * We never drain zones local to this processor. 816 828 */ 817 829 if (zone_to_nid(zone) == numa_node_id()) { 818 - __this_cpu_write(p->expire, 0); 830 + __this_cpu_write(pcp->expire, 0); 819 831 continue; 820 832 } 821 833 822 - if (__this_cpu_dec_return(p->expire)) 834 + if (__this_cpu_dec_return(pcp->expire)) 823 835 continue; 824 836 825 - if (__this_cpu_read(p->pcp.count)) { 826 - drain_zone_pages(zone, this_cpu_ptr(&p->pcp)); 837 + if (__this_cpu_read(pcp->count)) { 838 + drain_zone_pages(zone, this_cpu_ptr(pcp)); 827 839 changes++; 828 840 } 829 841 } ··· 844 856 } 845 857 } 846 858 847 - #ifdef CONFIG_NUMA 848 - changes += fold_diff(global_zone_diff, global_numa_diff, 849 - global_node_diff); 850 - #else 851 859 changes += fold_diff(global_zone_diff, global_node_diff); 852 - #endif 853 860 return changes; 854 861 } 855 862 ··· 859 876 struct zone *zone; 860 877 int i; 861 878 int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; 862 - #ifdef CONFIG_NUMA 863 - int global_numa_diff[NR_VM_NUMA_STAT_ITEMS] = { 0, }; 864 - #endif 865 879 int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, }; 866 880 867 881 for_each_populated_zone(zone) { 868 - struct per_cpu_pageset *p; 882 + struct per_cpu_zonestat *pzstats; 869 883 870 - p = per_cpu_ptr(zone->pageset, cpu); 884 + pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); 871 885 872 - for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) 873 - if (p->vm_stat_diff[i]) { 886 + for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) { 887 + if (pzstats->vm_stat_diff[i]) { 874 888 int v; 875 889 876 - v = p->vm_stat_diff[i]; 877 - p->vm_stat_diff[i] = 0; 890 + v = pzstats->vm_stat_diff[i]; 891 + pzstats->vm_stat_diff[i] = 0; 878 892 atomic_long_add(v, &zone->vm_stat[i]); 879 893 global_zone_diff[i] += v; 880 894 } 881 - 895 + } 882 896 #ifdef CONFIG_NUMA 883 - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) 884 - if (p->vm_numa_stat_diff[i]) { 885 - int v; 897 + for (i = 0; i < NR_VM_NUMA_EVENT_ITEMS; i++) { 898 + if (pzstats->vm_numa_event[i]) { 899 + unsigned long v; 886 900 887 - v = p->vm_numa_stat_diff[i]; 888 - p->vm_numa_stat_diff[i] = 0; 889 - atomic_long_add(v, &zone->vm_numa_stat[i]); 890 - global_numa_diff[i] += v; 901 + v = pzstats->vm_numa_event[i]; 902 + pzstats->vm_numa_event[i] = 0; 903 + zone_numa_event_add(v, zone, i); 891 904 } 905 + } 892 906 #endif 893 907 } 894 908 ··· 905 925 } 906 926 } 907 927 908 - #ifdef CONFIG_NUMA 909 - fold_diff(global_zone_diff, global_numa_diff, global_node_diff); 910 - #else 911 928 fold_diff(global_zone_diff, global_node_diff); 912 - #endif 913 929 } 914 930 915 931 /* 916 932 * this is only called if !populated_zone(zone), which implies no other users of 917 933 * pset->vm_stat_diff[] exist. 918 934 */ 919 - void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) 935 + void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *pzstats) 920 936 { 937 + unsigned long v; 921 938 int i; 922 939 923 - for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) 924 - if (pset->vm_stat_diff[i]) { 925 - int v = pset->vm_stat_diff[i]; 926 - pset->vm_stat_diff[i] = 0; 927 - atomic_long_add(v, &zone->vm_stat[i]); 928 - atomic_long_add(v, &vm_zone_stat[i]); 940 + for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) { 941 + if (pzstats->vm_stat_diff[i]) { 942 + v = pzstats->vm_stat_diff[i]; 943 + pzstats->vm_stat_diff[i] = 0; 944 + zone_page_state_add(v, zone, i); 929 945 } 930 - 931 - #ifdef CONFIG_NUMA 932 - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) 933 - if (pset->vm_numa_stat_diff[i]) { 934 - int v = pset->vm_numa_stat_diff[i]; 935 - 936 - pset->vm_numa_stat_diff[i] = 0; 937 - atomic_long_add(v, &zone->vm_numa_stat[i]); 938 - atomic_long_add(v, &vm_numa_stat[i]); 939 - } 940 - #endif 941 - } 942 - #endif 943 - 944 - #ifdef CONFIG_NUMA 945 - void __inc_numa_state(struct zone *zone, 946 - enum numa_stat_item item) 947 - { 948 - struct per_cpu_pageset __percpu *pcp = zone->pageset; 949 - u16 __percpu *p = pcp->vm_numa_stat_diff + item; 950 - u16 v; 951 - 952 - v = __this_cpu_inc_return(*p); 953 - 954 - if (unlikely(v > NUMA_STATS_THRESHOLD)) { 955 - zone_numa_state_add(v, zone, item); 956 - __this_cpu_write(*p, 0); 957 946 } 958 - } 959 947 948 + #ifdef CONFIG_NUMA 949 + for (i = 0; i < NR_VM_NUMA_EVENT_ITEMS; i++) { 950 + if (pzstats->vm_numa_event[i]) { 951 + v = pzstats->vm_numa_event[i]; 952 + pzstats->vm_numa_event[i] = 0; 953 + zone_numa_event_add(v, zone, i); 954 + } 955 + } 956 + #endif 957 + } 958 + #endif 959 + 960 + #ifdef CONFIG_NUMA 960 961 /* 961 962 * Determine the per node value of a stat item. This function 962 963 * is called frequently in a NUMA machine, so try to be as ··· 956 995 return count; 957 996 } 958 997 959 - /* 960 - * Determine the per node value of a numa stat item. To avoid deviation, 961 - * the per cpu stat number in vm_numa_stat_diff[] is also included. 962 - */ 963 - unsigned long sum_zone_numa_state(int node, 998 + /* Determine the per node value of a numa stat item. */ 999 + unsigned long sum_zone_numa_event_state(int node, 964 1000 enum numa_stat_item item) 965 1001 { 966 1002 struct zone *zones = NODE_DATA(node)->node_zones; 967 - int i; 968 1003 unsigned long count = 0; 1004 + int i; 969 1005 970 1006 for (i = 0; i < MAX_NR_ZONES; i++) 971 - count += zone_numa_state_snapshot(zones + i, item); 1007 + count += zone_numa_event_state(zones + i, item); 972 1008 973 1009 return count; 974 1010 } ··· 1644 1686 zone_page_state(zone, i)); 1645 1687 1646 1688 #ifdef CONFIG_NUMA 1647 - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) 1689 + for (i = 0; i < NR_VM_NUMA_EVENT_ITEMS; i++) 1648 1690 seq_printf(m, "\n %-12s %lu", numa_stat_name(i), 1649 - zone_numa_state_snapshot(zone, i)); 1691 + zone_numa_event_state(zone, i)); 1650 1692 #endif 1651 1693 1652 1694 seq_printf(m, "\n pagesets"); 1653 1695 for_each_online_cpu(i) { 1654 - struct per_cpu_pageset *pageset; 1696 + struct per_cpu_pages *pcp; 1697 + struct per_cpu_zonestat __maybe_unused *pzstats; 1655 1698 1656 - pageset = per_cpu_ptr(zone->pageset, i); 1699 + pcp = per_cpu_ptr(zone->per_cpu_pageset, i); 1657 1700 seq_printf(m, 1658 1701 "\n cpu: %i" 1659 1702 "\n count: %i" 1660 1703 "\n high: %i" 1661 1704 "\n batch: %i", 1662 1705 i, 1663 - pageset->pcp.count, 1664 - pageset->pcp.high, 1665 - pageset->pcp.batch); 1706 + pcp->count, 1707 + pcp->high, 1708 + pcp->batch); 1666 1709 #ifdef CONFIG_SMP 1710 + pzstats = per_cpu_ptr(zone->per_cpu_zonestats, i); 1667 1711 seq_printf(m, "\n vm stats threshold: %d", 1668 - pageset->stat_threshold); 1712 + pzstats->stat_threshold); 1669 1713 #endif 1670 1714 } 1671 1715 seq_printf(m, ··· 1700 1740 }; 1701 1741 1702 1742 #define NR_VMSTAT_ITEMS (NR_VM_ZONE_STAT_ITEMS + \ 1703 - NR_VM_NUMA_STAT_ITEMS + \ 1743 + NR_VM_NUMA_EVENT_ITEMS + \ 1704 1744 NR_VM_NODE_STAT_ITEMS + \ 1705 1745 NR_VM_WRITEBACK_STAT_ITEMS + \ 1706 1746 (IS_ENABLED(CONFIG_VM_EVENT_COUNTERS) ? \ ··· 1715 1755 return NULL; 1716 1756 1717 1757 BUILD_BUG_ON(ARRAY_SIZE(vmstat_text) < NR_VMSTAT_ITEMS); 1758 + fold_vm_numa_events(); 1718 1759 v = kmalloc_array(NR_VMSTAT_ITEMS, sizeof(unsigned long), GFP_KERNEL); 1719 1760 m->private = v; 1720 1761 if (!v) ··· 1725 1764 v += NR_VM_ZONE_STAT_ITEMS; 1726 1765 1727 1766 #ifdef CONFIG_NUMA 1728 - for (i = 0; i < NR_VM_NUMA_STAT_ITEMS; i++) 1729 - v[i] = global_numa_state(i); 1730 - v += NR_VM_NUMA_STAT_ITEMS; 1767 + for (i = 0; i < NR_VM_NUMA_EVENT_ITEMS; i++) 1768 + v[i] = global_numa_event_state(i); 1769 + v += NR_VM_NUMA_EVENT_ITEMS; 1731 1770 #endif 1732 1771 1733 1772 for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { ··· 1888 1927 struct zone *zone; 1889 1928 1890 1929 for_each_populated_zone(zone) { 1891 - struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu); 1930 + struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu); 1892 1931 struct per_cpu_nodestat *n; 1932 + 1893 1933 /* 1894 1934 * The fast way of checking if there are any vmstat diffs. 1895 1935 */ 1896 - if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS * 1897 - sizeof(p->vm_stat_diff[0]))) 1936 + if (memchr_inv(pzstats->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS * 1937 + sizeof(pzstats->vm_stat_diff[0]))) 1898 1938 return true; 1899 - #ifdef CONFIG_NUMA 1900 - if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS * 1901 - sizeof(p->vm_numa_stat_diff[0]))) 1902 - return true; 1903 - #endif 1939 + 1904 1940 if (last_pgdat == zone->zone_pgdat) 1905 1941 continue; 1906 1942 last_pgdat = zone->zone_pgdat;

+1 -1

mm/workingset.c

··· 408 408 memcg = page_memcg_rcu(page); 409 409 if (!mem_cgroup_disabled() && !memcg) 410 410 goto out; 411 - lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); 411 + lruvec = mem_cgroup_page_lruvec(page); 412 412 workingset_age_nonresident(lruvec, thp_nr_pages(page)); 413 413 out: 414 414 rcu_read_unlock();

+2 -2

net/ipv4/tcp.c

··· 2095 2095 2096 2096 mmap_read_lock(current->mm); 2097 2097 2098 - vma = find_vma(current->mm, address); 2099 - if (!vma || vma->vm_start > address || vma->vm_ops != &tcp_vm_ops) { 2098 + vma = vma_lookup(current->mm, address); 2099 + if (!vma || vma->vm_ops != &tcp_vm_ops) { 2100 2100 mmap_read_unlock(current->mm); 2101 2101 return -EINVAL; 2102 2102 }

+39 -37

scripts/kconfig/streamline_config.pl

··· 601 601 sub in_preserved_kconfigs { 602 602 my $kconfig = $config2kfile{$_[0]}; 603 603 if (!defined($kconfig)) { 604 - return 0; 604 + return 0; 605 605 } 606 606 foreach my $excl (@preserved_kconfigs) { 607 - if($kconfig =~ /^$excl/) { 608 - return 1; 609 - } 607 + if($kconfig =~ /^$excl/) { 608 + return 1; 609 + } 610 610 } 611 611 return 0; 612 612 } ··· 629 629 } 630 630 631 631 if (/CONFIG_MODULE_SIG_KEY="(.+)"/) { 632 - my $orig_cert = $1; 633 - my $default_cert = "certs/signing_key.pem"; 632 + my $orig_cert = $1; 633 + my $default_cert = "certs/signing_key.pem"; 634 634 635 - # Check that the logic in this script still matches the one in Kconfig 636 - if (!defined($depends{"MODULE_SIG_KEY"}) || 637 - $depends{"MODULE_SIG_KEY"} !~ /"\Q$default_cert\E"/) { 638 - print STDERR "WARNING: MODULE_SIG_KEY assertion failure, ", 639 - "update needed to ", __FILE__, " line ", __LINE__, "\n"; 640 - print; 641 - } elsif ($orig_cert ne $default_cert && ! -f $orig_cert) { 642 - print STDERR "Module signature verification enabled but ", 643 - "module signing key \"$orig_cert\" not found. Resetting ", 644 - "signing key to default value.\n"; 645 - print "CONFIG_MODULE_SIG_KEY=\"$default_cert\"\n"; 646 - } else { 647 - print; 648 - } 649 - next; 635 + # Check that the logic in this script still matches the one in Kconfig 636 + if (!defined($depends{"MODULE_SIG_KEY"}) || 637 + $depends{"MODULE_SIG_KEY"} !~ /"\Q$default_cert\E"/) { 638 + print STDERR "WARNING: MODULE_SIG_KEY assertion failure, ", 639 + "update needed to ", __FILE__, " line ", __LINE__, "\n"; 640 + print; 641 + } elsif ($orig_cert ne $default_cert && ! -f $orig_cert) { 642 + print STDERR "Module signature verification enabled but ", 643 + "module signing key \"$orig_cert\" not found. Resetting ", 644 + "signing key to default value.\n"; 645 + print "CONFIG_MODULE_SIG_KEY=\"$default_cert\"\n"; 646 + } else { 647 + print; 648 + } 649 + next; 650 650 } 651 651 652 652 if (/CONFIG_SYSTEM_TRUSTED_KEYS="(.+)"/) { 653 - my $orig_keys = $1; 653 + my $orig_keys = $1; 654 654 655 - if (! -f $orig_keys) { 656 - print STDERR "System keyring enabled but keys \"$orig_keys\" ", 657 - "not found. Resetting keys to default value.\n"; 658 - print "CONFIG_SYSTEM_TRUSTED_KEYS=\"\"\n"; 659 - } else { 660 - print; 661 - } 662 - next; 655 + if (! -f $orig_keys) { 656 + print STDERR "System keyring enabled but keys \"$orig_keys\" ", 657 + "not found. Resetting keys to default value.\n"; 658 + print "CONFIG_SYSTEM_TRUSTED_KEYS=\"\"\n"; 659 + } else { 660 + print; 661 + } 662 + next; 663 663 } 664 664 665 665 if (/^(CONFIG.*)=(m|y)/) { 666 - if (in_preserved_kconfigs($1)) { 667 - dprint "Preserve config $1"; 668 - print; 669 - next; 670 - } 666 + if (in_preserved_kconfigs($1)) { 667 + dprint "Preserve config $1"; 668 + print; 669 + next; 670 + } 671 671 if (defined($configs{$1})) { 672 672 if ($localyesconfig) { 673 - $setconfigs{$1} = 'y'; 673 + $setconfigs{$1} = 'y'; 674 674 print "$1=y\n"; 675 675 next; 676 676 } else { 677 - $setconfigs{$1} = $2; 677 + $setconfigs{$1} = $2; 678 678 } 679 679 } elsif ($2 eq "m") { 680 680 print "# $1 is not set\n"; ··· 702 702 print STDERR "\n"; 703 703 } 704 704 } 705 + 706 + # vim: softtabstop=4

+4

scripts/link-vmlinux.sh

··· 235 235 236 236 vmlinux_link ${1} 237 237 238 + if [ "${pahole_ver}" -ge "118" ] && [ "${pahole_ver}" -le "121" ]; then 239 + # pahole 1.18 through 1.21 can't handle zero-sized per-CPU vars 240 + extra_paholeopt="${extra_paholeopt} --skip_encoding_btf_vars" 241 + fi 238 242 if [ "${pahole_ver}" -ge "121" ]; then 239 243 extra_paholeopt="${extra_paholeopt} --btf_gen_floats" 240 244 fi

+16

scripts/spelling.txt

··· 22 22 absoulte||absolute 23 23 acccess||access 24 24 acceess||access 25 + accelaration||acceleration 25 26 acceleratoin||acceleration 26 27 accelleration||acceleration 27 28 accesing||accessing ··· 265 264 calulate||calculate 266 265 cancelation||cancellation 267 266 cancle||cancel 267 + canot||cannot 268 268 capabilites||capabilities 269 269 capabilties||capabilities 270 270 capabilty||capability ··· 496 494 dimention||dimension 497 495 dimesions||dimensions 498 496 diconnected||disconnected 497 + disabed||disabled 498 + disble||disable 499 499 disgest||digest 500 + disired||desired 500 501 dispalying||displaying 501 502 diplay||display 502 503 directon||direction ··· 715 710 heirarchically||hierarchically 716 711 heirarchy||hierarchy 717 712 helpfull||helpful 713 + hearbeat||heartbeat 718 714 heterogenous||heterogeneous 719 715 hexdecimal||hexadecimal 720 716 hybernate||hibernate ··· 995 989 notifcations||notifications 996 990 notifed||notified 997 991 notity||notify 992 + nubmer||number 998 993 numebr||number 999 994 numner||number 1000 995 obtaion||obtain ··· 1021 1014 ommitted||omitted 1022 1015 onself||oneself 1023 1016 ony||only 1017 + openning||opening 1024 1018 operatione||operation 1025 1019 opertaions||operations 1020 + opportunies||opportunities 1026 1021 optionnal||optional 1027 1022 optmizations||optimizations 1028 1023 orientatied||orientated ··· 1120 1111 preform||perform 1121 1112 premption||preemption 1122 1113 prepaired||prepared 1114 + prepate||prepare 1123 1115 preperation||preparation 1124 1116 preprare||prepare 1125 1117 pressre||pressure ··· 1133 1123 privilage||privilege 1134 1124 priviledge||privilege 1135 1125 priviledges||privileges 1126 + privleges||privileges 1136 1127 probaly||probably 1137 1128 procceed||proceed 1138 1129 proccesors||processors ··· 1178 1167 psudo||pseudo 1179 1168 psuedo||pseudo 1180 1169 psychadelic||psychedelic 1170 + purgable||purgeable 1181 1171 pwoer||power 1182 1172 queing||queuing 1183 1173 quering||querying ··· 1192 1180 recepient||recipient 1193 1181 recevied||received 1194 1182 receving||receiving 1183 + recievd||received 1195 1184 recieved||received 1196 1185 recieve||receive 1197 1186 reciever||receiver ··· 1241 1228 representaion||representation 1242 1229 reqeust||request 1243 1230 reqister||register 1231 + requed||requeued 1244 1232 requestied||requested 1245 1233 requiere||require 1246 1234 requirment||requirement ··· 1346 1332 singed||signed 1347 1333 sleeped||slept 1348 1334 sliped||slipped 1335 + softwade||software 1349 1336 softwares||software 1350 1337 soley||solely 1351 1338 souce||source ··· 1525 1510 unitialized||uninitialized 1526 1511 unkmown||unknown 1527 1512 unknonw||unknown 1513 + unknouwn||unknown 1528 1514 unknow||unknown 1529 1515 unkown||unknown 1530 1516 unamed||unnamed

+65 -31

tools/testing/selftests/vm/gup_test.c

··· 6 6 #include <sys/mman.h> 7 7 #include <sys/stat.h> 8 8 #include <sys/types.h> 9 + #include <pthread.h> 10 + #include <assert.h> 9 11 #include "../../../../mm/gup_test.h" 10 12 11 13 #define MB (1UL << 20) ··· 16 14 /* Just the flags we need, copied from mm.h: */ 17 15 #define FOLL_WRITE 0x01 /* check pte is writable */ 18 16 #define FOLL_TOUCH 0x02 /* mark page accessed */ 17 + 18 + static unsigned long cmd = GUP_FAST_BENCHMARK; 19 + static int gup_fd, repeats = 1; 20 + static unsigned long size = 128 * MB; 21 + /* Serialize prints */ 22 + static pthread_mutex_t print_mutex = PTHREAD_MUTEX_INITIALIZER; 19 23 20 24 static char *cmd_to_str(unsigned long cmd) 21 25 { ··· 42 34 return "Unknown command"; 43 35 } 44 36 37 + void *gup_thread(void *data) 38 + { 39 + struct gup_test gup = *(struct gup_test *)data; 40 + int i; 41 + 42 + /* Only report timing information on the *_BENCHMARK commands: */ 43 + if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) || 44 + (cmd == PIN_LONGTERM_BENCHMARK)) { 45 + for (i = 0; i < repeats; i++) { 46 + gup.size = size; 47 + if (ioctl(gup_fd, cmd, &gup)) 48 + perror("ioctl"), exit(1); 49 + 50 + pthread_mutex_lock(&print_mutex); 51 + printf("%s: Time: get:%lld put:%lld us", 52 + cmd_to_str(cmd), gup.get_delta_usec, 53 + gup.put_delta_usec); 54 + if (gup.size != size) 55 + printf(", truncated (size: %lld)", gup.size); 56 + printf("\n"); 57 + pthread_mutex_unlock(&print_mutex); 58 + } 59 + } else { 60 + gup.size = size; 61 + if (ioctl(gup_fd, cmd, &gup)) { 62 + perror("ioctl"); 63 + exit(1); 64 + } 65 + 66 + pthread_mutex_lock(&print_mutex); 67 + printf("%s: done\n", cmd_to_str(cmd)); 68 + if (gup.size != size) 69 + printf("Truncated (size: %lld)\n", gup.size); 70 + pthread_mutex_unlock(&print_mutex); 71 + } 72 + 73 + return NULL; 74 + } 75 + 45 76 int main(int argc, char **argv) 46 77 { 47 78 struct gup_test gup = { 0 }; 48 - unsigned long size = 128 * MB; 49 - int i, fd, filed, opt, nr_pages = 1, thp = -1, repeats = 1, write = 1; 50 - unsigned long cmd = GUP_FAST_BENCHMARK; 79 + int filed, i, opt, nr_pages = 1, thp = -1, write = 1, nthreads = 1, ret; 51 80 int flags = MAP_PRIVATE, touch = 0; 52 81 char *file = "/dev/zero"; 82 + pthread_t *tid; 53 83 char *p; 54 84 55 - while ((opt = getopt(argc, argv, "m:r:n:F:f:abctTLUuwWSHpz")) != -1) { 85 + while ((opt = getopt(argc, argv, "m:r:n:F:f:abcj:tTLUuwWSHpz")) != -1) { 56 86 switch (opt) { 57 87 case 'a': 58 88 cmd = PIN_FAST_BENCHMARK; ··· 119 73 case 'F': 120 74 /* strtol, so you can pass flags in hex form */ 121 75 gup.gup_flags = strtol(optarg, 0, 0); 76 + break; 77 + case 'j': 78 + nthreads = atoi(optarg); 122 79 break; 123 80 case 'm': 124 81 size = atoi(optarg) * MB; ··· 203 154 if (write) 204 155 gup.gup_flags |= FOLL_WRITE; 205 156 206 - fd = open("/sys/kernel/debug/gup_test", O_RDWR); 207 - if (fd == -1) { 157 + gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR); 158 + if (gup_fd == -1) { 208 159 perror("open"); 209 160 exit(1); 210 161 } ··· 234 185 p[0] = 0; 235 186 } 236 187 237 - /* Only report timing information on the *_BENCHMARK commands: */ 238 - if ((cmd == PIN_FAST_BENCHMARK) || (cmd == GUP_FAST_BENCHMARK) || 239 - (cmd == PIN_LONGTERM_BENCHMARK)) { 240 - for (i = 0; i < repeats; i++) { 241 - gup.size = size; 242 - if (ioctl(fd, cmd, &gup)) 243 - perror("ioctl"), exit(1); 244 - 245 - printf("%s: Time: get:%lld put:%lld us", 246 - cmd_to_str(cmd), gup.get_delta_usec, 247 - gup.put_delta_usec); 248 - if (gup.size != size) 249 - printf(", truncated (size: %lld)", gup.size); 250 - printf("\n"); 251 - } 252 - } else { 253 - gup.size = size; 254 - if (ioctl(fd, cmd, &gup)) { 255 - perror("ioctl"); 256 - exit(1); 257 - } 258 - 259 - printf("%s: done\n", cmd_to_str(cmd)); 260 - if (gup.size != size) 261 - printf("Truncated (size: %lld)\n", gup.size); 188 + tid = malloc(sizeof(pthread_t) * nthreads); 189 + assert(tid); 190 + for (i = 0; i < nthreads; i++) { 191 + ret = pthread_create(&tid[i], NULL, gup_thread, &gup); 192 + assert(ret == 0); 262 193 } 194 + for (i = 0; i < nthreads; i++) { 195 + ret = pthread_join(tid[i], NULL); 196 + assert(ret == 0); 197 + } 198 + free(tid); 263 199 264 200 return 0; 265 201 }

+4

tools/vm/page_owner_sort.c

··· 132 132 qsort(list, list_size, sizeof(list[0]), compare_txt); 133 133 134 134 list2 = malloc(sizeof(*list) * list_size); 135 + if (!list2) { 136 + printf("Out of memory\n"); 137 + exit(1); 138 + } 135 139 136 140 printf("culling\n"); 137 141

+1 -1

virt/kvm/kvm_main.c

··· 2290 2290 } 2291 2291 2292 2292 retry: 2293 - vma = find_vma_intersection(current->mm, addr, addr + 1); 2293 + vma = vma_lookup(current->mm, addr); 2294 2294 2295 2295 if (vma == NULL) 2296 2296 pfn = KVM_PFN_ERR_FAULT;