Merge branch 'akpm' (patches from Andrew)

+13 -4

Documentation/admin-guide/kernel-parameters.txt

··· 2315 2315 that the amount of memory usable for all allocations 2316 2316 is not too small. 2317 2317 2318 - movable_node [KNL] Boot-time switch to enable the effects 2319 - of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details. 2318 + movable_node [KNL] Boot-time switch to make hotplugable memory 2319 + NUMA nodes to be movable. This means that the memory 2320 + of such nodes will be usable only for movable 2321 + allocations which rules out almost all kernel 2322 + allocations. Use with caution! 2320 2323 2321 2324 MTD_Partition= [MTD] 2322 2325 Format: <name>,<region-number>,<size>,<offset> ··· 3775 3772 slab_nomerge [MM] 3776 3773 Disable merging of slabs with similar size. May be 3777 3774 necessary if there is some reason to distinguish 3778 - allocs to different slabs. Debug options disable 3779 - merging on their own. 3775 + allocs to different slabs, especially in hardened 3776 + environments where the risk of heap overflows and 3777 + layout control by attackers can usually be 3778 + frustrated by disabling merging. This will reduce 3779 + most of the exposure of a heap attack to a single 3780 + cache (risks via metadata attacks are mostly 3781 + unchanged). Debug options disable merging on their 3782 + own. 3780 3783 For more information see Documentation/vm/slub.txt. 3781 3784 3782 3785 slab_max_order= [MM, SLAB]

+44 -4

Documentation/cgroup-v2.txt

··· 852 852 853 853 The number of times the cgroup's memory usage was 854 854 about to go over the max boundary. If direct reclaim 855 - fails to bring it down, the OOM killer is invoked. 855 + fails to bring it down, the cgroup goes to OOM state. 856 856 857 857 oom 858 858 859 - The number of times the OOM killer has been invoked in 860 - the cgroup. This may not exactly match the number of 861 - processes killed but should generally be close. 859 + The number of time the cgroup's memory usage was 860 + reached the limit and allocation was about to fail. 861 + 862 + Depending on context result could be invocation of OOM 863 + killer and retrying allocation or failing alloction. 864 + 865 + Failed allocation in its turn could be returned into 866 + userspace as -ENOMEM or siletly ignored in cases like 867 + disk readahead. For now OOM in memory cgroup kills 868 + tasks iff shortage has happened inside page fault. 869 + 870 + oom_kill 871 + 872 + The number of processes belonging to this cgroup 873 + killed by any kind of OOM killer. 862 874 863 875 memory.stat 864 876 ··· 967 955 workingset_nodereclaim 968 956 969 957 Number of times a shadow node has been reclaimed 958 + 959 + pgrefill 960 + 961 + Amount of scanned pages (in an active LRU list) 962 + 963 + pgscan 964 + 965 + Amount of scanned pages (in an inactive LRU list) 966 + 967 + pgsteal 968 + 969 + Amount of reclaimed pages 970 + 971 + pgactivate 972 + 973 + Amount of pages moved to the active LRU list 974 + 975 + pgdeactivate 976 + 977 + Amount of pages moved to the inactive LRU lis 978 + 979 + pglazyfree 980 + 981 + Amount of pages postponed to be freed under memory pressure 982 + 983 + pglazyfreed 984 + 985 + Amount of reclaimed lazyfree pages 970 986 971 987 memory.swap.current 972 988

+1

Documentation/dev-tools/kmemleak.rst

··· 150 150 - ``kmemleak_init`` - initialize kmemleak 151 151 - ``kmemleak_alloc`` - notify of a memory block allocation 152 152 - ``kmemleak_alloc_percpu`` - notify of a percpu memory block allocation 153 + - ``kmemleak_vmalloc`` - notify of a vmalloc() memory allocation 153 154 - ``kmemleak_free`` - notify of a memory block freeing 154 155 - ``kmemleak_free_part`` - notify of a partial memory block freeing 155 156 - ``kmemleak_free_percpu`` - notify of a percpu memory block freeing

+63

Documentation/vm/ksm.txt

··· 98 98 it is only effective for pages merged after the change. 99 99 Default: 0 (normal KSM behaviour as in earlier releases) 100 100 101 + max_page_sharing - Maximum sharing allowed for each KSM page. This 102 + enforces a deduplication limit to avoid the virtual 103 + memory rmap lists to grow too large. The minimum 104 + value is 2 as a newly created KSM page will have at 105 + least two sharers. The rmap walk has O(N) 106 + complexity where N is the number of rmap_items 107 + (i.e. virtual mappings) that are sharing the page, 108 + which is in turn capped by max_page_sharing. So 109 + this effectively spread the the linear O(N) 110 + computational complexity from rmap walk context 111 + over different KSM pages. The ksmd walk over the 112 + stable_node "chains" is also O(N), but N is the 113 + number of stable_node "dups", not the number of 114 + rmap_items, so it has not a significant impact on 115 + ksmd performance. In practice the best stable_node 116 + "dup" candidate will be kept and found at the head 117 + of the "dups" list. The higher this value the 118 + faster KSM will merge the memory (because there 119 + will be fewer stable_node dups queued into the 120 + stable_node chain->hlist to check for pruning) and 121 + the higher the deduplication factor will be, but 122 + the slowest the worst case rmap walk could be for 123 + any given KSM page. Slowing down the rmap_walk 124 + means there will be higher latency for certain 125 + virtual memory operations happening during 126 + swapping, compaction, NUMA balancing and page 127 + migration, in turn decreasing responsiveness for 128 + the caller of those virtual memory operations. The 129 + scheduler latency of other tasks not involved with 130 + the VM operations doing the rmap walk is not 131 + affected by this parameter as the rmap walks are 132 + always schedule friendly themselves. 133 + 134 + stable_node_chains_prune_millisecs - How frequently to walk the whole 135 + list of stable_node "dups" linked in the 136 + stable_node "chains" in order to prune stale 137 + stable_nodes. Smaller milllisecs values will free 138 + up the KSM metadata with lower latency, but they 139 + will make ksmd use more CPU during the scan. This 140 + only applies to the stable_node chains so it's a 141 + noop if not a single KSM page hit the 142 + max_page_sharing yet (there would be no stable_node 143 + chains in such case). 144 + 101 145 The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/: 102 146 103 147 pages_shared - how many shared pages are being used ··· 150 106 pages_volatile - how many pages changing too fast to be placed in a tree 151 107 full_scans - how many times all mergeable areas have been scanned 152 108 109 + stable_node_chains - number of stable node chains allocated, this is 110 + effectively the number of KSM pages that hit the 111 + max_page_sharing limit 112 + stable_node_dups - number of stable node dups queued into the 113 + stable_node chains 114 + 153 115 A high ratio of pages_sharing to pages_shared indicates good sharing, but 154 116 a high ratio of pages_unshared to pages_sharing indicates wasted effort. 155 117 pages_volatile embraces several different kinds of activity, but a high 156 118 proportion there would also indicate poor use of madvise MADV_MERGEABLE. 119 + 120 + The maximum possible page_sharing/page_shared ratio is limited by the 121 + max_page_sharing tunable. To increase the ratio max_page_sharing must 122 + be increased accordingly. 123 + 124 + The stable_node_dups/stable_node_chains ratio is also affected by the 125 + max_page_sharing tunable, and an high ratio may indicate fragmentation 126 + in the stable_node dups, which could be solved by introducing 127 + fragmentation algorithms in ksmd which would refile rmap_items from 128 + one stable_node dup to another stable_node dup, in order to freeup 129 + stable_node "dups" with few rmap_items in them, but that may increase 130 + the ksmd CPU usage and possibly slowdown the readonly computations on 131 + the KSM pages of the applications. 157 132 158 133 Izik Eidus, 159 134 Hugh Dickins, 17 Nov 2009

+1 -1

arch/arm64/Kconfig

··· 13 13 select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI 14 14 select ARCH_HAS_ELF_RANDOMIZE 15 15 select ARCH_HAS_GCOV_PROFILE_ALL 16 - select ARCH_HAS_GIGANTIC_PAGE 16 + select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA 17 17 select ARCH_HAS_KCOV 18 18 select ARCH_HAS_SET_MEMORY 19 19 select ARCH_HAS_SG_CHAIN

+4

arch/arm64/include/asm/hugetlb.h

··· 83 83 extern void huge_ptep_clear_flush(struct vm_area_struct *vma, 84 84 unsigned long addr, pte_t *ptep); 85 85 86 + #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE 87 + static inline bool gigantic_page_supported(void) { return true; } 88 + #endif 89 + 86 90 #endif /* __ASM_HUGETLB_H */

+22 -31

arch/arm64/mm/hugetlbpage.c

··· 42 42 } 43 43 44 44 static int find_num_contig(struct mm_struct *mm, unsigned long addr, 45 - pte_t *ptep, pte_t pte, size_t *pgsize) 45 + pte_t *ptep, size_t *pgsize) 46 46 { 47 47 pgd_t *pgd = pgd_offset(mm, addr); 48 48 pud_t *pud; 49 49 pmd_t *pmd; 50 50 51 51 *pgsize = PAGE_SIZE; 52 - if (!pte_cont(pte)) 53 - return 1; 54 52 pud = pud_offset(pgd, addr); 55 53 pmd = pmd_offset(pud, addr); 56 54 if ((pte_t *)pmd == ptep) { ··· 63 65 { 64 66 size_t pgsize; 65 67 int i; 66 - int ncontig = find_num_contig(mm, addr, ptep, pte, &pgsize); 68 + int ncontig; 67 69 unsigned long pfn; 68 70 pgprot_t hugeprot; 69 71 70 - if (ncontig == 1) { 72 + if (!pte_cont(pte)) { 71 73 set_pte_at(mm, addr, ptep, pte); 72 74 return; 73 75 } 74 76 77 + ncontig = find_num_contig(mm, addr, ptep, &pgsize); 75 78 pfn = pte_pfn(pte); 76 79 hugeprot = __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte)); 77 80 for (i = 0; i < ncontig; i++) { ··· 131 132 return pte; 132 133 } 133 134 134 - pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) 135 + pte_t *huge_pte_offset(struct mm_struct *mm, 136 + unsigned long addr, unsigned long sz) 135 137 { 136 138 pgd_t *pgd; 137 139 pud_t *pud; ··· 184 184 if (pte_cont(*ptep)) { 185 185 int ncontig, i; 186 186 size_t pgsize; 187 - pte_t *cpte; 188 187 bool is_dirty = false; 189 188 190 - cpte = huge_pte_offset(mm, addr); 191 - ncontig = find_num_contig(mm, addr, cpte, *cpte, &pgsize); 189 + ncontig = find_num_contig(mm, addr, ptep, &pgsize); 192 190 /* save the 1st pte to return */ 193 - pte = ptep_get_and_clear(mm, addr, cpte); 191 + pte = ptep_get_and_clear(mm, addr, ptep); 194 192 for (i = 1, addr += pgsize; i < ncontig; ++i, addr += pgsize) { 195 193 /* 196 194 * If HW_AFDBM is enabled, then the HW could 197 195 * turn on the dirty bit for any of the page 198 196 * in the set, so check them all. 199 197 */ 200 - ++cpte; 201 - if (pte_dirty(ptep_get_and_clear(mm, addr, cpte))) 198 + ++ptep; 199 + if (pte_dirty(ptep_get_and_clear(mm, addr, ptep))) 202 200 is_dirty = true; 203 201 } 204 202 if (is_dirty) ··· 212 214 unsigned long addr, pte_t *ptep, 213 215 pte_t pte, int dirty) 214 216 { 215 - pte_t *cpte; 216 - 217 217 if (pte_cont(pte)) { 218 218 int ncontig, i, changed = 0; 219 219 size_t pgsize = 0; ··· 221 225 __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ 222 226 pte_val(pte)); 223 227 224 - cpte = huge_pte_offset(vma->vm_mm, addr); 225 - pfn = pte_pfn(*cpte); 226 - ncontig = find_num_contig(vma->vm_mm, addr, cpte, 227 - *cpte, &pgsize); 228 - for (i = 0; i < ncontig; ++i, ++cpte, addr += pgsize) { 229 - changed |= ptep_set_access_flags(vma, addr, cpte, 228 + pfn = pte_pfn(pte); 229 + ncontig = find_num_contig(vma->vm_mm, addr, ptep, 230 + &pgsize); 231 + for (i = 0; i < ncontig; ++i, ++ptep, addr += pgsize) { 232 + changed |= ptep_set_access_flags(vma, addr, ptep, 230 233 pfn_pte(pfn, 231 234 hugeprot), 232 235 dirty); ··· 242 247 { 243 248 if (pte_cont(*ptep)) { 244 249 int ncontig, i; 245 - pte_t *cpte; 246 250 size_t pgsize = 0; 247 251 248 - cpte = huge_pte_offset(mm, addr); 249 - ncontig = find_num_contig(mm, addr, cpte, *cpte, &pgsize); 250 - for (i = 0; i < ncontig; ++i, ++cpte, addr += pgsize) 251 - ptep_set_wrprotect(mm, addr, cpte); 252 + ncontig = find_num_contig(mm, addr, ptep, &pgsize); 253 + for (i = 0; i < ncontig; ++i, ++ptep, addr += pgsize) 254 + ptep_set_wrprotect(mm, addr, ptep); 252 255 } else { 253 256 ptep_set_wrprotect(mm, addr, ptep); 254 257 } ··· 257 264 { 258 265 if (pte_cont(*ptep)) { 259 266 int ncontig, i; 260 - pte_t *cpte; 261 267 size_t pgsize = 0; 262 268 263 - cpte = huge_pte_offset(vma->vm_mm, addr); 264 - ncontig = find_num_contig(vma->vm_mm, addr, cpte, 265 - *cpte, &pgsize); 266 - for (i = 0; i < ncontig; ++i, ++cpte, addr += pgsize) 267 - ptep_clear_flush(vma, addr, cpte); 269 + ncontig = find_num_contig(vma->vm_mm, addr, ptep, 270 + &pgsize); 271 + for (i = 0; i < ncontig; ++i, ++ptep, addr += pgsize) 272 + ptep_clear_flush(vma, addr, ptep); 268 273 } else { 269 274 ptep_clear_flush(vma, addr, ptep); 270 275 }

-1

arch/hexagon/include/asm/pgtable.h

··· 24 24 /* 25 25 * Page table definitions for Qualcomm Hexagon processor. 26 26 */ 27 - #include <linux/swap.h> 28 27 #include <asm/page.h> 29 28 #define __ARCH_USE_5LEVEL_HACK 30 29 #include <asm-generic/pgtable-nopmd.h>

-1

arch/hexagon/kernel/asm-offsets.c

··· 25 25 #include <linux/compat.h> 26 26 #include <linux/types.h> 27 27 #include <linux/sched.h> 28 - #include <linux/mm.h> 29 28 #include <linux/interrupt.h> 30 29 #include <linux/kbuild.h> 31 30 #include <asm/ptrace.h>

+1

arch/hexagon/mm/vm_tlb.c

··· 24 24 * be instantiated for it, differently from a native build. 25 25 */ 26 26 #include <linux/mm.h> 27 + #include <linux/sched.h> 27 28 #include <asm/page.h> 28 29 #include <asm/hexagon_vm.h> 29 30

+2 -2

arch/ia64/mm/hugetlbpage.c

··· 44 44 } 45 45 46 46 pte_t * 47 - huge_pte_offset (struct mm_struct *mm, unsigned long addr) 47 + huge_pte_offset (struct mm_struct *mm, unsigned long addr, unsigned long sz) 48 48 { 49 49 unsigned long taddr = htlbpage_to_page(addr); 50 50 pgd_t *pgd; ··· 92 92 if (REGION_NUMBER(addr) != RGN_HPAGE) 93 93 return ERR_PTR(-EINVAL); 94 94 95 - ptep = huge_pte_offset(mm, addr); 95 + ptep = huge_pte_offset(mm, addr, HPAGE_SIZE); 96 96 if (!ptep || pte_none(*ptep)) 97 97 return NULL; 98 98 page = pte_page(*ptep);

+2 -9

arch/ia64/mm/init.c

··· 646 646 } 647 647 648 648 #ifdef CONFIG_MEMORY_HOTPLUG 649 - int arch_add_memory(int nid, u64 start, u64 size, bool for_device) 649 + int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock) 650 650 { 651 - pg_data_t *pgdat; 652 - struct zone *zone; 653 651 unsigned long start_pfn = start >> PAGE_SHIFT; 654 652 unsigned long nr_pages = size >> PAGE_SHIFT; 655 653 int ret; 656 654 657 - pgdat = NODE_DATA(nid); 658 - 659 - zone = pgdat->node_zones + 660 - zone_for_memory(nid, start, size, ZONE_NORMAL, for_device); 661 - ret = __add_pages(nid, zone, start_pfn, nr_pages); 662 - 655 + ret = __add_pages(nid, start_pfn, nr_pages, want_memblock); 663 656 if (ret) 664 657 printk("%s: Problem encountered in __add_pages() as ret=%d\n", 665 658 __func__, ret);

+2 -1

arch/metag/mm/hugetlbpage.c

··· 74 74 return pte; 75 75 } 76 76 77 - pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) 77 + pte_t *huge_pte_offset(struct mm_struct *mm, 78 + unsigned long addr, unsigned long sz) 78 79 { 79 80 pgd_t *pgd; 80 81 pud_t *pud;

+2 -1

arch/mips/mm/hugetlbpage.c

··· 36 36 return pte; 37 37 } 38 38 39 - pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) 39 + pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, 40 + unsigned long sz) 40 41 { 41 42 pgd_t *pgd; 42 43 pud_t *pud;

+2

arch/mn10300/include/asm/Kbuild

··· 1 1 2 2 generic-y += barrier.h 3 3 generic-y += clkdev.h 4 + generic-y += device.h 4 5 generic-y += exec.h 5 6 generic-y += extable.h 7 + generic-y += fb.h 6 8 generic-y += irq_work.h 7 9 generic-y += mcs_spinlock.h 8 10 generic-y += mm-arch-hooks.h

-1

arch/mn10300/include/asm/device.h

··· 1 - #include <asm-generic/device.h>

-23

arch/mn10300/include/asm/fb.h

··· 1 - /* MN10300 Frame buffer stuff 2 - * 3 - * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved. 4 - * Written by David Howells (dhowells@redhat.com) 5 - * 6 - * This program is free software; you can redistribute it and/or 7 - * modify it under the terms of the GNU General Public Licence 8 - * as published by the Free Software Foundation; either version 9 - * 2 of the Licence, or (at your option) any later version. 10 - */ 11 - #ifndef _ASM_FB_H 12 - #define _ASM_FB_H 13 - 14 - #include <linux/fb.h> 15 - 16 - #define fb_pgprotect(...) do {} while (0) 17 - 18 - static inline int fb_is_primary_device(struct fb_info *info) 19 - { 20 - return 0; 21 - } 22 - 23 - #endif /* _ASM_FB_H */

+2 -1

arch/parisc/mm/hugetlbpage.c

··· 69 69 return pte; 70 70 } 71 71 72 - pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) 72 + pte_t *huge_pte_offset(struct mm_struct *mm, 73 + unsigned long addr, unsigned long sz) 73 74 { 74 75 pgd_t *pgd; 75 76 pud_t *pud;

+10

arch/powerpc/include/asm/book3s/64/hugetlb.h

··· 50 50 else 51 51 return entry; 52 52 } 53 + 54 + #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE 55 + static inline bool gigantic_page_supported(void) 56 + { 57 + if (radix_enabled()) 58 + return true; 59 + return false; 60 + } 61 + #endif 62 + 53 63 #endif

+34 -52

arch/powerpc/mm/hugetlbpage.c

··· 17 17 #include <linux/memblock.h> 18 18 #include <linux/bootmem.h> 19 19 #include <linux/moduleparam.h> 20 + #include <linux/swap.h> 21 + #include <linux/swapops.h> 20 22 #include <asm/pgtable.h> 21 23 #include <asm/pgalloc.h> 22 24 #include <asm/tlb.h> ··· 57 55 58 56 #define hugepd_none(hpd) (hpd_val(hpd) == 0) 59 57 60 - pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) 58 + pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz) 61 59 { 62 60 /* Only called for hugetlbfs pages, hence can ignore THP */ 63 61 return __find_linux_pte_or_hugepte(mm->pgd, addr, NULL, NULL); ··· 619 617 } while (addr = next, addr != end); 620 618 } 621 619 622 - /* 623 - * We are holding mmap_sem, so a parallel huge page collapse cannot run. 624 - * To prevent hugepage split, disable irq. 625 - */ 626 - struct page * 627 - follow_huge_addr(struct mm_struct *mm, unsigned long address, int write) 620 + struct page *follow_huge_pd(struct vm_area_struct *vma, 621 + unsigned long address, hugepd_t hpd, 622 + int flags, int pdshift) 628 623 { 629 - bool is_thp; 630 - pte_t *ptep, pte; 631 - unsigned shift; 632 - unsigned long mask, flags; 633 - struct page *page = ERR_PTR(-EINVAL); 624 + pte_t *ptep; 625 + spinlock_t *ptl; 626 + struct page *page = NULL; 627 + unsigned long mask; 628 + int shift = hugepd_shift(hpd); 629 + struct mm_struct *mm = vma->vm_mm; 634 630 635 - local_irq_save(flags); 636 - ptep = find_linux_pte_or_hugepte(mm->pgd, address, &is_thp, &shift); 637 - if (!ptep) 638 - goto no_page; 639 - pte = READ_ONCE(*ptep); 640 - /* 641 - * Verify it is a huge page else bail. 642 - * Transparent hugepages are handled by generic code. We can skip them 643 - * here. 644 - */ 645 - if (!shift || is_thp) 646 - goto no_page; 631 + retry: 632 + ptl = &mm->page_table_lock; 633 + spin_lock(ptl); 647 634 648 - if (!pte_present(pte)) { 649 - page = NULL; 650 - goto no_page; 635 + ptep = hugepte_offset(hpd, address, pdshift); 636 + if (pte_present(*ptep)) { 637 + mask = (1UL << shift) - 1; 638 + page = pte_page(*ptep); 639 + page += ((address & mask) >> PAGE_SHIFT); 640 + if (flags & FOLL_GET) 641 + get_page(page); 642 + } else { 643 + if (is_hugetlb_entry_migration(*ptep)) { 644 + spin_unlock(ptl); 645 + __migration_entry_wait(mm, ptep, ptl); 646 + goto retry; 647 + } 651 648 } 652 - mask = (1UL << shift) - 1; 653 - page = pte_page(pte); 654 - if (page) 655 - page += (address & mask) / PAGE_SIZE; 656 - 657 - no_page: 658 - local_irq_restore(flags); 649 + spin_unlock(ptl); 659 650 return page; 660 - } 661 - 662 - struct page * 663 - follow_huge_pmd(struct mm_struct *mm, unsigned long address, 664 - pmd_t *pmd, int write) 665 - { 666 - BUG(); 667 - return NULL; 668 - } 669 - 670 - struct page * 671 - follow_huge_pud(struct mm_struct *mm, unsigned long address, 672 - pud_t *pud, int write) 673 - { 674 - BUG(); 675 - return NULL; 676 651 } 677 652 678 653 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, ··· 742 763 * Hash: 16M and 16G 743 764 */ 744 765 if (radix_enabled()) { 745 - if (mmu_psize != MMU_PAGE_2M) 746 - return -EINVAL; 766 + if (mmu_psize != MMU_PAGE_2M) { 767 + if (cpu_has_feature(CPU_FTR_POWER9_DD1) || 768 + (mmu_psize != MMU_PAGE_1G)) 769 + return -EINVAL; 770 + } 747 771 } else { 748 772 if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G) 749 773 return -EINVAL;

+2 -10

arch/powerpc/mm/mem.c

··· 126 126 return -ENODEV; 127 127 } 128 128 129 - int arch_add_memory(int nid, u64 start, u64 size, bool for_device) 129 + int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock) 130 130 { 131 - struct pglist_data *pgdata; 132 - struct zone *zone; 133 131 unsigned long start_pfn = start >> PAGE_SHIFT; 134 132 unsigned long nr_pages = size >> PAGE_SHIFT; 135 133 int rc; 136 134 137 135 resize_hpt_for_hotplug(memblock_phys_mem_size()); 138 - 139 - pgdata = NODE_DATA(nid); 140 136 141 137 start = (unsigned long)__va(start); 142 138 rc = create_section_mapping(start, start + size); ··· 143 147 return -EFAULT; 144 148 } 145 149 146 - /* this should work for most non-highmem platforms */ 147 - zone = pgdata->node_zones + 148 - zone_for_memory(nid, start, size, 0, for_device); 149 - 150 - return __add_pages(nid, zone, start_pfn, nr_pages); 150 + return __add_pages(nid, start_pfn, nr_pages, want_memblock); 151 151 } 152 152 153 153 #ifdef CONFIG_MEMORY_HOTREMOVE

+6

arch/powerpc/platforms/Kconfig.cputype

··· 344 344 config PPC_RADIX_MMU 345 345 bool "Radix MMU Support" 346 346 depends on PPC_BOOK3S_64 347 + select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA 347 348 default y 348 349 help 349 350 Enable support for the Power ISA 3.0 Radix style MMU. Currently this 350 351 is only implemented by IBM Power9 CPUs, if you don't have one of them 351 352 you can probably disable this. 353 + 354 + config ARCH_ENABLE_HUGEPAGE_MIGRATION 355 + def_bool y 356 + depends on PPC_BOOK3S_64 && HUGETLB_PAGE && MIGRATION 357 + 352 358 353 359 config PPC_MMU_NOHASH 354 360 def_bool y

+1 -1

arch/s390/Kconfig

··· 68 68 select ARCH_HAS_DEVMEM_IS_ALLOWED 69 69 select ARCH_HAS_ELF_RANDOMIZE 70 70 select ARCH_HAS_GCOV_PROFILE_ALL 71 - select ARCH_HAS_GIGANTIC_PAGE 71 + select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA 72 72 select ARCH_HAS_KCOV 73 73 select ARCH_HAS_SET_MEMORY 74 74 select ARCH_HAS_SG_CHAIN

+4 -1

arch/s390/include/asm/hugetlb.h

··· 39 39 #define arch_clear_hugepage_flags(page) do { } while (0) 40 40 41 41 static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr, 42 - pte_t *ptep) 42 + pte_t *ptep, unsigned long sz) 43 43 { 44 44 if ((pte_val(*ptep) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3) 45 45 pte_val(*ptep) = _REGION3_ENTRY_EMPTY; ··· 112 112 return pte_modify(pte, newprot); 113 113 } 114 114 115 + #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE 116 + static inline bool gigantic_page_supported(void) { return true; } 117 + #endif 115 118 #endif /* _ASM_S390_HUGETLB_H */

+2 -1

arch/s390/mm/hugetlbpage.c

··· 180 180 return (pte_t *) pmdp; 181 181 } 182 182 183 - pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) 183 + pte_t *huge_pte_offset(struct mm_struct *mm, 184 + unsigned long addr, unsigned long sz) 184 185 { 185 186 pgd_t *pgdp; 186 187 p4d_t *p4dp;

+3 -29

arch/s390/mm/init.c

··· 166 166 } 167 167 168 168 #ifdef CONFIG_MEMORY_HOTPLUG 169 - int arch_add_memory(int nid, u64 start, u64 size, bool for_device) 169 + int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock) 170 170 { 171 - unsigned long zone_start_pfn, zone_end_pfn, nr_pages; 172 171 unsigned long start_pfn = PFN_DOWN(start); 173 172 unsigned long size_pages = PFN_DOWN(size); 174 - pg_data_t *pgdat = NODE_DATA(nid); 175 - struct zone *zone; 176 - int rc, i; 173 + int rc; 177 174 178 175 rc = vmem_add_mapping(start, size); 179 176 if (rc) 180 177 return rc; 181 178 182 - for (i = 0; i < MAX_NR_ZONES; i++) { 183 - zone = pgdat->node_zones + i; 184 - if (zone_idx(zone) != ZONE_MOVABLE) { 185 - /* Add range within existing zone limits, if possible */ 186 - zone_start_pfn = zone->zone_start_pfn; 187 - zone_end_pfn = zone->zone_start_pfn + 188 - zone->spanned_pages; 189 - } else { 190 - /* Add remaining range to ZONE_MOVABLE */ 191 - zone_start_pfn = start_pfn; 192 - zone_end_pfn = start_pfn + size_pages; 193 - } 194 - if (start_pfn < zone_start_pfn || start_pfn >= zone_end_pfn) 195 - continue; 196 - nr_pages = (start_pfn + size_pages > zone_end_pfn) ? 197 - zone_end_pfn - start_pfn : size_pages; 198 - rc = __add_pages(nid, zone, start_pfn, nr_pages); 199 - if (rc) 200 - break; 201 - start_pfn += nr_pages; 202 - size_pages -= nr_pages; 203 - if (!size_pages) 204 - break; 205 - } 179 + rc = __add_pages(nid, start_pfn, size_pages, want_memblock); 206 180 if (rc) 207 181 vmem_remove_mapping(start, size); 208 182 return rc;

+2 -1

arch/sh/mm/hugetlbpage.c

··· 42 42 return pte; 43 43 } 44 44 45 - pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) 45 + pte_t *huge_pte_offset(struct mm_struct *mm, 46 + unsigned long addr, unsigned long sz) 46 47 { 47 48 pgd_t *pgd; 48 49 pud_t *pud;

+2 -8

arch/sh/mm/init.c

··· 485 485 #endif 486 486 487 487 #ifdef CONFIG_MEMORY_HOTPLUG 488 - int arch_add_memory(int nid, u64 start, u64 size, bool for_device) 488 + int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock) 489 489 { 490 - pg_data_t *pgdat; 491 490 unsigned long start_pfn = PFN_DOWN(start); 492 491 unsigned long nr_pages = size >> PAGE_SHIFT; 493 492 int ret; 494 493 495 - pgdat = NODE_DATA(nid); 496 - 497 494 /* We only have ZONE_NORMAL, so this is easy.. */ 498 - ret = __add_pages(nid, pgdat->node_zones + 499 - zone_for_memory(nid, start, size, ZONE_NORMAL, 500 - for_device), 501 - start_pfn, nr_pages); 495 + ret = __add_pages(nid, start_pfn, nr_pages, want_memblock); 502 496 if (unlikely(ret)) 503 497 printk("%s: Failed, __add_pages() == %d\n", __func__, ret); 504 498

+2 -1

arch/sparc/mm/hugetlbpage.c

··· 277 277 return pte; 278 278 } 279 279 280 - pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) 280 + pte_t *huge_pte_offset(struct mm_struct *mm, 281 + unsigned long addr, unsigned long sz) 281 282 { 282 283 pgd_t *pgd; 283 284 pud_t *pud;

+2 -1

arch/tile/mm/hugetlbpage.c

··· 102 102 return ptep; 103 103 } 104 104 105 - pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) 105 + pte_t *huge_pte_offset(struct mm_struct *mm, 106 + unsigned long addr, unsigned long sz) 106 107 { 107 108 pgd_t *pgd; 108 109 pud_t *pud;

+11

arch/tile/mm/pgtable.c

··· 503 503 } 504 504 EXPORT_SYMBOL(ioremap_prot); 505 505 506 + #if !defined(CONFIG_PCI) || !defined(CONFIG_TILEGX) 507 + /* ioremap is conditionally declared in pci_gx.c */ 508 + 509 + void __iomem *ioremap(resource_size_t phys_addr, unsigned long size) 510 + { 511 + return NULL; 512 + } 513 + EXPORT_SYMBOL(ioremap); 514 + 515 + #endif 516 + 506 517 /* Unmap an MMIO VA mapping. */ 507 518 void iounmap(volatile void __iomem *addr_in) 508 519 {

+2 -1

arch/x86/Kconfig

··· 22 22 def_bool y 23 23 depends on 64BIT 24 24 # Options that are inherently 64-bit kernel only: 25 - select ARCH_HAS_GIGANTIC_PAGE 25 + select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA 26 26 select ARCH_SUPPORTS_INT128 27 27 select ARCH_USE_CMPXCHG_LOCKREF 28 28 select HAVE_ARCH_SOFT_DIRTY ··· 72 72 select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH 73 73 select ARCH_WANT_FRAME_POINTERS 74 74 select ARCH_WANTS_DYNAMIC_TASK_STRUCT 75 + select ARCH_WANTS_THP_SWAP if X86_64 75 76 select BUILDTIME_EXTABLE_SORT 76 77 select CLKEVT_I8253 77 78 select CLOCKSOURCE_VALIDATE_LAST_CYCLE

+4

arch/x86/include/asm/hugetlb.h

··· 85 85 { 86 86 } 87 87 88 + #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE 89 + static inline bool gigantic_page_supported(void) { return true; } 90 + #endif 91 + 88 92 #endif /* _ASM_X86_HUGETLB_H */

+1 -1

arch/x86/mm/hugetlbpage.c

··· 33 33 if (!vma || !is_vm_hugetlb_page(vma)) 34 34 return ERR_PTR(-EINVAL); 35 35 36 - pte = huge_pte_offset(mm, address); 36 + pte = huge_pte_offset(mm, address, vma_mmu_pagesize(vma)); 37 37 38 38 /* hugetlb should be locked, and hence, prefaulted */ 39 39 WARN_ON(!pte || pte_none(*pte));

+2 -5

arch/x86/mm/init_32.c

··· 823 823 } 824 824 825 825 #ifdef CONFIG_MEMORY_HOTPLUG 826 - int arch_add_memory(int nid, u64 start, u64 size, bool for_device) 826 + int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock) 827 827 { 828 - struct pglist_data *pgdata = NODE_DATA(nid); 829 - struct zone *zone = pgdata->node_zones + 830 - zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device); 831 828 unsigned long start_pfn = start >> PAGE_SHIFT; 832 829 unsigned long nr_pages = size >> PAGE_SHIFT; 833 830 834 - return __add_pages(nid, zone, start_pfn, nr_pages); 831 + return __add_pages(nid, start_pfn, nr_pages, want_memblock); 835 832 } 836 833 837 834 #ifdef CONFIG_MEMORY_HOTREMOVE

+2 -9

arch/x86/mm/init_64.c

··· 772 772 } 773 773 } 774 774 775 - /* 776 - * Memory is added always to NORMAL zone. This means you will never get 777 - * additional DMA/DMA32 memory. 778 - */ 779 - int arch_add_memory(int nid, u64 start, u64 size, bool for_device) 775 + int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock) 780 776 { 781 - struct pglist_data *pgdat = NODE_DATA(nid); 782 - struct zone *zone = pgdat->node_zones + 783 - zone_for_memory(nid, start, size, ZONE_NORMAL, for_device); 784 777 unsigned long start_pfn = start >> PAGE_SHIFT; 785 778 unsigned long nr_pages = size >> PAGE_SHIFT; 786 779 int ret; 787 780 788 781 init_memory_mapping(start, start + size); 789 782 790 - ret = __add_pages(nid, zone, start_pfn, nr_pages); 783 + ret = __add_pages(nid, start_pfn, nr_pages, want_memblock); 791 784 WARN_ON_ONCE(ret); 792 785 793 786 /* update max_pfn, max_low_pfn and high_memory */

+45 -38

drivers/base/memory.c

··· 128 128 int ret = 1; 129 129 struct memory_block *mem = to_memory_block(dev); 130 130 131 + if (mem->state != MEM_ONLINE) 132 + goto out; 133 + 131 134 for (i = 0; i < sections_per_block; i++) { 132 135 if (!present_section_nr(mem->start_section_nr + i)) 133 136 continue; ··· 138 135 ret &= is_mem_section_removable(pfn, PAGES_PER_SECTION); 139 136 } 140 137 138 + out: 141 139 return sprintf(buf, "%d\n", ret); 142 140 } 143 141 ··· 392 388 struct device_attribute *attr, char *buf) 393 389 { 394 390 struct memory_block *mem = to_memory_block(dev); 395 - unsigned long start_pfn, end_pfn; 396 - unsigned long valid_start, valid_end, valid_pages; 391 + unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); 397 392 unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; 398 - struct zone *zone; 399 - int zone_shift = 0; 393 + unsigned long valid_start_pfn, valid_end_pfn; 394 + bool append = false; 395 + int nid; 400 396 401 - start_pfn = section_nr_to_pfn(mem->start_section_nr); 402 - end_pfn = start_pfn + nr_pages; 403 - 404 - /* The block contains more than one zone can not be offlined. */ 405 - if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start, &valid_end)) 397 + /* 398 + * The block contains more than one zone can not be offlined. 399 + * This can happen e.g. for ZONE_DMA and ZONE_DMA32 400 + */ 401 + if (!test_pages_in_a_zone(start_pfn, start_pfn + nr_pages, &valid_start_pfn, &valid_end_pfn)) 406 402 return sprintf(buf, "none\n"); 407 403 408 - zone = page_zone(pfn_to_page(valid_start)); 409 - valid_pages = valid_end - valid_start; 404 + start_pfn = valid_start_pfn; 405 + nr_pages = valid_end_pfn - start_pfn; 410 406 411 - /* MMOP_ONLINE_KEEP */ 412 - sprintf(buf, "%s", zone->name); 413 - 414 - /* MMOP_ONLINE_KERNEL */ 415 - zone_can_shift(valid_start, valid_pages, ZONE_NORMAL, &zone_shift); 416 - if (zone_shift) { 417 - strcat(buf, " "); 418 - strcat(buf, (zone + zone_shift)->name); 407 + /* 408 + * Check the existing zone. Make sure that we do that only on the 409 + * online nodes otherwise the page_zone is not reliable 410 + */ 411 + if (mem->state == MEM_ONLINE) { 412 + strcat(buf, page_zone(pfn_to_page(start_pfn))->name); 413 + goto out; 419 414 } 420 415 421 - /* MMOP_ONLINE_MOVABLE */ 422 - zone_can_shift(valid_start, valid_pages, ZONE_MOVABLE, &zone_shift); 423 - if (zone_shift) { 424 - strcat(buf, " "); 425 - strcat(buf, (zone + zone_shift)->name); 416 + nid = pfn_to_nid(start_pfn); 417 + if (allow_online_pfn_range(nid, start_pfn, nr_pages, MMOP_ONLINE_KERNEL)) { 418 + strcat(buf, default_zone_for_pfn(nid, start_pfn, nr_pages)->name); 419 + append = true; 426 420 } 427 421 422 + if (allow_online_pfn_range(nid, start_pfn, nr_pages, MMOP_ONLINE_MOVABLE)) { 423 + if (append) 424 + strcat(buf, " "); 425 + strcat(buf, NODE_DATA(nid)->node_zones[ZONE_MOVABLE].name); 426 + } 427 + out: 428 428 strcat(buf, "\n"); 429 429 430 430 return strlen(buf); ··· 693 685 return 0; 694 686 } 695 687 696 - static bool is_zone_device_section(struct mem_section *ms) 697 - { 698 - struct page *page; 699 - 700 - page = sparse_decode_mem_map(ms->section_mem_map, __section_nr(ms)); 701 - return is_zone_device_page(page); 702 - } 703 - 704 688 /* 705 689 * need an interface for the VM to add new memory regions, 706 690 * but without onlining it. ··· 701 701 { 702 702 int ret = 0; 703 703 struct memory_block *mem; 704 - 705 - if (is_zone_device_section(section)) 706 - return 0; 707 704 708 705 mutex_lock(&mem_sysfs_mutex); 709 706 ··· 738 741 { 739 742 struct memory_block *mem; 740 743 741 - if (is_zone_device_section(section)) 742 - return 0; 743 - 744 744 mutex_lock(&mem_sysfs_mutex); 745 + 746 + /* 747 + * Some users of the memory hotplug do not want/need memblock to 748 + * track all sections. Skip over those. 749 + */ 745 750 mem = find_memory_block(section); 751 + if (!mem) 752 + goto out_unlock; 753 + 746 754 unregister_mem_sect_under_nodes(mem, __section_nr(section)); 747 755 748 756 mem->section_count--; ··· 756 754 else 757 755 put_device(&mem->dev); 758 756 757 + out_unlock: 759 758 mutex_unlock(&mem_sysfs_mutex); 760 759 return 0; 761 760 } ··· 823 820 */ 824 821 mutex_lock(&mem_sysfs_mutex); 825 822 for (i = 0; i < NR_MEM_SECTIONS; i += sections_per_block) { 823 + /* Don't iterate over sections we know are !present: */ 824 + if (i > __highest_present_section_nr) 825 + break; 826 + 826 827 err = add_memory_block(i); 827 828 if (!ret) 828 829 ret = err;

+25 -47

drivers/base/node.c

··· 129 129 nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)), 130 130 nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)), 131 131 nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), 132 - nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE) + 133 - sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)), 134 - nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE)), 132 + nid, K(node_page_state(pgdat, NR_SLAB_RECLAIMABLE) + 133 + node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE)), 134 + nid, K(node_page_state(pgdat, NR_SLAB_RECLAIMABLE)), 135 135 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 136 - nid, K(sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)), 136 + nid, K(node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE)), 137 137 nid, K(node_page_state(pgdat, NR_ANON_THPS) * 138 138 HPAGE_PMD_NR), 139 139 nid, K(node_page_state(pgdat, NR_SHMEM_THPS) * ··· 141 141 nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) * 142 142 HPAGE_PMD_NR)); 143 143 #else 144 - nid, K(sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE))); 144 + nid, K(node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE))); 145 145 #endif 146 146 n += hugetlb_report_node_meminfo(nid, buf + n); 147 147 return n; ··· 368 368 } 369 369 370 370 #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE 371 - #define page_initialized(page) (page->lru.next) 372 - 373 371 static int __ref get_nid_for_pfn(unsigned long pfn) 374 372 { 375 - struct page *page; 376 - 377 373 if (!pfn_valid_within(pfn)) 378 374 return -1; 379 375 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT 380 376 if (system_state < SYSTEM_RUNNING) 381 377 return early_pfn_to_nid(pfn); 382 378 #endif 383 - page = pfn_to_page(pfn); 384 - if (!page_initialized(page)) 385 - return -1; 386 379 return pfn_to_nid(pfn); 387 380 } 388 381 ··· 461 468 return 0; 462 469 } 463 470 464 - static int link_mem_sections(int nid) 471 + int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages) 465 472 { 466 - unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn; 467 - unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages; 473 + unsigned long end_pfn = start_pfn + nr_pages; 468 474 unsigned long pfn; 469 475 struct memory_block *mem_blk = NULL; 470 476 int err = 0; ··· 551 559 return NOTIFY_OK; 552 560 } 553 561 #endif /* CONFIG_HUGETLBFS */ 554 - #else /* !CONFIG_MEMORY_HOTPLUG_SPARSE */ 555 - 556 - static int link_mem_sections(int nid) { return 0; } 557 - #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */ 562 + #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */ 558 563 559 564 #if !defined(CONFIG_MEMORY_HOTPLUG_SPARSE) || \ 560 565 !defined(CONFIG_HUGETLBFS) ··· 565 576 566 577 #endif 567 578 568 - int register_one_node(int nid) 579 + int __register_one_node(int nid) 569 580 { 570 - int error = 0; 581 + int p_node = parent_node(nid); 582 + struct node *parent = NULL; 583 + int error; 571 584 int cpu; 572 585 573 - if (node_online(nid)) { 574 - int p_node = parent_node(nid); 575 - struct node *parent = NULL; 586 + if (p_node != nid) 587 + parent = node_devices[p_node]; 576 588 577 - if (p_node != nid) 578 - parent = node_devices[p_node]; 589 + node_devices[nid] = kzalloc(sizeof(struct node), GFP_KERNEL); 590 + if (!node_devices[nid]) 591 + return -ENOMEM; 579 592 580 - node_devices[nid] = kzalloc(sizeof(struct node), GFP_KERNEL); 581 - if (!node_devices[nid]) 582 - return -ENOMEM; 593 + error = register_node(node_devices[nid], nid, parent); 583 594 584 - error = register_node(node_devices[nid], nid, parent); 585 - 586 - /* link cpu under this node */ 587 - for_each_present_cpu(cpu) { 588 - if (cpu_to_node(cpu) == nid) 589 - register_cpu_under_node(cpu, nid); 590 - } 591 - 592 - /* link memory sections under this node */ 593 - error = link_mem_sections(nid); 594 - 595 - /* initialize work queue for memory hot plug */ 596 - init_node_hugetlb_work(nid); 595 + /* link cpu under this node */ 596 + for_each_present_cpu(cpu) { 597 + if (cpu_to_node(cpu) == nid) 598 + register_cpu_under_node(cpu, nid); 597 599 } 598 600 599 - return error; 601 + /* initialize work queue for memory hot plug */ 602 + init_node_hugetlb_work(nid); 600 603 604 + return error; 601 605 } 602 606 603 607 void unregister_one_node(int nid) ··· 639 657 #ifdef CONFIG_HIGHMEM 640 658 [N_HIGH_MEMORY] = _NODE_ATTR(has_high_memory, N_HIGH_MEMORY), 641 659 #endif 642 - #ifdef CONFIG_MOVABLE_NODE 643 660 [N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY), 644 - #endif 645 661 [N_CPU] = _NODE_ATTR(has_cpu, N_CPU), 646 662 }; 647 663 ··· 650 670 #ifdef CONFIG_HIGHMEM 651 671 &node_state_attr[N_HIGH_MEMORY].attr.attr, 652 672 #endif 653 - #ifdef CONFIG_MOVABLE_NODE 654 673 &node_state_attr[N_MEMORY].attr.attr, 655 - #endif 656 674 &node_state_attr[N_CPU].attr.attr, 657 675 NULL 658 676 };

+2

drivers/block/zram/zram_drv.c

··· 469 469 zram_slot_unlock(zram, index); 470 470 471 471 atomic64_inc(&zram->stats.same_pages); 472 + atomic64_inc(&zram->stats.pages_stored); 472 473 return true; 473 474 } 474 475 kunmap_atomic(mem); ··· 525 524 zram_clear_flag(zram, index, ZRAM_SAME); 526 525 zram_set_element(zram, index, 0); 527 526 atomic64_dec(&zram->stats.same_pages); 527 + atomic64_dec(&zram->stats.pages_stored); 528 528 return; 529 529 } 530 530

+1 -3

drivers/sh/intc/virq.c

··· 94 94 } 95 95 96 96 entry = kzalloc(sizeof(struct intc_virq_list), GFP_ATOMIC); 97 - if (!entry) { 98 - pr_err("can't allocate VIRQ mapping for %d\n", virq); 97 + if (!entry) 99 98 return -ENOMEM; 100 - } 101 99 102 100 entry->irq = virq; 103 101

+1 -1

fs/dax.c

··· 1213 1213 case IOMAP_MAPPED: 1214 1214 if (iomap.flags & IOMAP_F_NEW) { 1215 1215 count_vm_event(PGMAJFAULT); 1216 - mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT); 1216 + count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT); 1217 1217 major = VM_FAULT_MAJOR; 1218 1218 } 1219 1219 error = dax_insert_mapping(mapping, iomap.bdev, iomap.dax_dev,

+4 -14

fs/dcache.c

··· 3546 3546 3547 3547 static void __init dcache_init_early(void) 3548 3548 { 3549 - unsigned int loop; 3550 - 3551 3549 /* If hashes are distributed across NUMA nodes, defer 3552 3550 * hash allocation until vmalloc space is available. 3553 3551 */ ··· 3557 3559 sizeof(struct hlist_bl_head), 3558 3560 dhash_entries, 3559 3561 13, 3560 - HASH_EARLY, 3562 + HASH_EARLY | HASH_ZERO, 3561 3563 &d_hash_shift, 3562 3564 &d_hash_mask, 3563 3565 0, 3564 3566 0); 3565 - 3566 - for (loop = 0; loop < (1U << d_hash_shift); loop++) 3567 - INIT_HLIST_BL_HEAD(dentry_hashtable + loop); 3568 3567 } 3569 3568 3570 3569 static void __init dcache_init(void) 3571 3570 { 3572 - unsigned int loop; 3573 - 3574 - /* 3571 + /* 3575 3572 * A constructor could be added for stable state like the lists, 3576 3573 * but it is probably not worth it because of the cache nature 3577 - * of the dcache. 3574 + * of the dcache. 3578 3575 */ 3579 3576 dentry_cache = KMEM_CACHE(dentry, 3580 3577 SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT); ··· 3583 3590 sizeof(struct hlist_bl_head), 3584 3591 dhash_entries, 3585 3592 13, 3586 - 0, 3593 + HASH_ZERO, 3587 3594 &d_hash_shift, 3588 3595 &d_hash_mask, 3589 3596 0, 3590 3597 0); 3591 - 3592 - for (loop = 0; loop < (1U << d_hash_shift); loop++) 3593 - INIT_HLIST_BL_HEAD(dentry_hashtable + loop); 3594 3598 } 3595 3599 3596 3600 /* SLAB cache for __getname() consumers */

+4 -18

fs/file.c

··· 30 30 unsigned int sysctl_nr_open_max = 31 31 __const_min(INT_MAX, ~(size_t)0/sizeof(void *)) & -BITS_PER_LONG; 32 32 33 - static void *alloc_fdmem(size_t size) 34 - { 35 - /* 36 - * Very large allocations can stress page reclaim, so fall back to 37 - * vmalloc() if the allocation size will be considered "large" by the VM. 38 - */ 39 - if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { 40 - void *data = kmalloc(size, GFP_KERNEL_ACCOUNT | 41 - __GFP_NOWARN | __GFP_NORETRY); 42 - if (data != NULL) 43 - return data; 44 - } 45 - return __vmalloc(size, GFP_KERNEL_ACCOUNT, PAGE_KERNEL); 46 - } 47 - 48 33 static void __free_fdtable(struct fdtable *fdt) 49 34 { 50 35 kvfree(fdt->fd); ··· 116 131 if (!fdt) 117 132 goto out; 118 133 fdt->max_fds = nr; 119 - data = alloc_fdmem(nr * sizeof(struct file *)); 134 + data = kvmalloc_array(nr, sizeof(struct file *), GFP_KERNEL_ACCOUNT); 120 135 if (!data) 121 136 goto out_fdt; 122 137 fdt->fd = data; 123 138 124 - data = alloc_fdmem(max_t(size_t, 125 - 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES)); 139 + data = kvmalloc(max_t(size_t, 140 + 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES), 141 + GFP_KERNEL_ACCOUNT); 126 142 if (!data) 127 143 goto out_arr; 128 144 fdt->open_fds = data;

+2 -12

fs/inode.c

··· 1915 1915 */ 1916 1916 void __init inode_init_early(void) 1917 1917 { 1918 - unsigned int loop; 1919 - 1920 1918 /* If hashes are distributed across NUMA nodes, defer 1921 1919 * hash allocation until vmalloc space is available. 1922 1920 */ ··· 1926 1928 sizeof(struct hlist_head), 1927 1929 ihash_entries, 1928 1930 14, 1929 - HASH_EARLY, 1931 + HASH_EARLY | HASH_ZERO, 1930 1932 &i_hash_shift, 1931 1933 &i_hash_mask, 1932 1934 0, 1933 1935 0); 1934 - 1935 - for (loop = 0; loop < (1U << i_hash_shift); loop++) 1936 - INIT_HLIST_HEAD(&inode_hashtable[loop]); 1937 1936 } 1938 1937 1939 1938 void __init inode_init(void) 1940 1939 { 1941 - unsigned int loop; 1942 - 1943 1940 /* inode slab cache */ 1944 1941 inode_cachep = kmem_cache_create("inode_cache", 1945 1942 sizeof(struct inode), ··· 1952 1959 sizeof(struct hlist_head), 1953 1960 ihash_entries, 1954 1961 14, 1955 - 0, 1962 + HASH_ZERO, 1956 1963 &i_hash_shift, 1957 1964 &i_hash_mask, 1958 1965 0, 1959 1966 0); 1960 - 1961 - for (loop = 0; loop < (1U << i_hash_shift); loop++) 1962 - INIT_HLIST_HEAD(&inode_hashtable[loop]); 1963 1967 } 1964 1968 1965 1969 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)

+2 -8

fs/namespace.c

··· 3239 3239 3240 3240 void __init mnt_init(void) 3241 3241 { 3242 - unsigned u; 3243 3242 int err; 3244 3243 3245 3244 mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount), ··· 3247 3248 mount_hashtable = alloc_large_system_hash("Mount-cache", 3248 3249 sizeof(struct hlist_head), 3249 3250 mhash_entries, 19, 3250 - 0, 3251 + HASH_ZERO, 3251 3252 &m_hash_shift, &m_hash_mask, 0, 0); 3252 3253 mountpoint_hashtable = alloc_large_system_hash("Mountpoint-cache", 3253 3254 sizeof(struct hlist_head), 3254 3255 mphash_entries, 19, 3255 - 0, 3256 + HASH_ZERO, 3256 3257 &mp_hash_shift, &mp_hash_mask, 0, 0); 3257 3258 3258 3259 if (!mount_hashtable || !mountpoint_hashtable) 3259 3260 panic("Failed to allocate mount hash table\n"); 3260 - 3261 - for (u = 0; u <= m_hash_mask; u++) 3262 - INIT_HLIST_HEAD(&mount_hashtable[u]); 3263 - for (u = 0; u <= mp_hash_mask; u++) 3264 - INIT_HLIST_HEAD(&mountpoint_hashtable[u]); 3265 3261 3266 3262 kernfs_init(); 3267 3263

+1 -1

fs/ncpfs/mmap.c

··· 89 89 * -- nyc 90 90 */ 91 91 count_vm_event(PGMAJFAULT); 92 - mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT); 92 + count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT); 93 93 return VM_FAULT_MAJOR; 94 94 } 95 95

+1

fs/ocfs2/cluster/netdebug.c

··· 426 426 struct o2net_sock_container *dummy_sc = sd->dbg_sock; 427 427 428 428 o2net_debug_del_sc(dummy_sc); 429 + kfree(dummy_sc); 429 430 return seq_release_private(inode, file); 430 431 } 431 432

+1 -1

fs/ocfs2/inode.c

··· 136 136 struct inode *ocfs2_iget(struct ocfs2_super *osb, u64 blkno, unsigned flags, 137 137 int sysfile_type) 138 138 { 139 - int rc = 0; 139 + int rc = -ESTALE; 140 140 struct inode *inode = NULL; 141 141 struct super_block *sb = osb->sb; 142 142 struct ocfs2_find_inode_args args;

+2 -3

fs/ocfs2/ocfs2_fs.h

··· 25 25 #ifndef _OCFS2_FS_H 26 26 #define _OCFS2_FS_H 27 27 28 + #include <linux/magic.h> 29 + 28 30 /* Version */ 29 31 #define OCFS2_MAJOR_REV_LEVEL 0 30 32 #define OCFS2_MINOR_REV_LEVEL 90 ··· 57 55 */ 58 56 #define OCFS2_MIN_BLOCKSIZE 512 59 57 #define OCFS2_MAX_BLOCKSIZE OCFS2_MIN_CLUSTERSIZE 60 - 61 - /* Filesystem magic number */ 62 - #define OCFS2_SUPER_MAGIC 0x7461636f 63 58 64 59 /* Object signatures */ 65 60 #define OCFS2_SUPER_BLOCK_SIGNATURE "OCFSV2"

+1 -1

fs/ocfs2/stackglue.c

··· 631 631 NULL, 632 632 }; 633 633 634 - static struct attribute_group ocfs2_attr_group = { 634 + static const struct attribute_group ocfs2_attr_group = { 635 635 .attrs = ocfs2_attrs, 636 636 }; 637 637

+5 -7

fs/userfaultfd.c

··· 214 214 * hugepmd ranges. 215 215 */ 216 216 static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx, 217 + struct vm_area_struct *vma, 217 218 unsigned long address, 218 219 unsigned long flags, 219 220 unsigned long reason) ··· 225 224 226 225 VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem)); 227 226 228 - pte = huge_pte_offset(mm, address); 227 + pte = huge_pte_offset(mm, address, vma_mmu_pagesize(vma)); 229 228 if (!pte) 230 229 goto out; 231 230 ··· 244 243 } 245 244 #else 246 245 static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx, 246 + struct vm_area_struct *vma, 247 247 unsigned long address, 248 248 unsigned long flags, 249 249 unsigned long reason) ··· 450 448 must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags, 451 449 reason); 452 450 else 453 - must_wait = userfaultfd_huge_must_wait(ctx, vmf->address, 451 + must_wait = userfaultfd_huge_must_wait(ctx, vmf->vma, 452 + vmf->address, 454 453 vmf->flags, reason); 455 454 up_read(&mm->mmap_sem); 456 455 ··· 1117 1114 static void __wake_userfault(struct userfaultfd_ctx *ctx, 1118 1115 struct userfaultfd_wake_range *range) 1119 1116 { 1120 - unsigned long start, end; 1121 - 1122 - start = range->start; 1123 - end = range->start + range->len; 1124 - 1125 1117 spin_lock(&ctx->fault_pending_wqh.lock); 1126 1118 /* wake all in the range and autoremove */ 1127 1119 if (waitqueue_active(&ctx->fault_pending_wqh))

+3 -1

include/asm-generic/hugetlb.h

··· 31 31 return pte_modify(pte, newprot); 32 32 } 33 33 34 + #ifndef huge_pte_clear 34 35 static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr, 35 - pte_t *ptep) 36 + pte_t *ptep, unsigned long sz) 36 37 { 37 38 pte_clear(mm, addr, ptep); 38 39 } 40 + #endif 39 41 40 42 #endif /* _ASM_GENERIC_HUGETLB_H */

+1

include/linux/bootmem.h

··· 358 358 #define HASH_EARLY 0x00000001 /* Allocating during early boot? */ 359 359 #define HASH_SMALL 0x00000002 /* sub-page allocation allowed, min 360 360 * shift passed via *_hash_shift */ 361 + #define HASH_ZERO 0x00000004 /* Zero allocated hash table */ 361 362 362 363 /* Only NUMA needs hash distribution. 64bit NUMA architectures have 363 364 * sufficient vmalloc space.

-8

include/linux/compiler-clang.h

··· 15 15 * with any version that can compile the kernel 16 16 */ 17 17 #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__) 18 - 19 - /* 20 - * GCC does not warn about unused static inline functions for 21 - * -Wunused-function. This turns out to avoid the need for complex #ifdef 22 - * directives. Suppress the warning in clang as well. 23 - */ 24 - #undef inline 25 - #define inline inline __attribute__((unused)) notrace

+11 -7

include/linux/compiler-gcc.h

··· 66 66 67 67 /* 68 68 * Force always-inline if the user requests it so via the .config, 69 - * or if gcc is too old: 69 + * or if gcc is too old. 70 + * GCC does not warn about unused static inline functions for 71 + * -Wunused-function. This turns out to avoid the need for complex #ifdef 72 + * directives. Suppress the warning in clang as well by using "unused" 73 + * function attribute, which is redundant but not harmful for gcc. 70 74 */ 71 75 #if !defined(CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING) || \ 72 76 !defined(CONFIG_OPTIMIZE_INLINING) || (__GNUC__ < 4) 73 - #define inline inline __attribute__((always_inline)) notrace 74 - #define __inline__ __inline__ __attribute__((always_inline)) notrace 75 - #define __inline __inline __attribute__((always_inline)) notrace 77 + #define inline inline __attribute__((always_inline,unused)) notrace 78 + #define __inline__ __inline__ __attribute__((always_inline,unused)) notrace 79 + #define __inline __inline __attribute__((always_inline,unused)) notrace 76 80 #else 77 81 /* A lot of inline functions can cause havoc with function tracing */ 78 - #define inline inline notrace 79 - #define __inline__ __inline__ notrace 80 - #define __inline __inline notrace 82 + #define inline inline __attribute__((unused)) notrace 83 + #define __inline__ __inline__ __attribute__((unused)) notrace 84 + #define __inline __inline __attribute__((unused)) notrace 81 85 #endif 82 86 83 87 #define __always_inline inline __attribute__((always_inline))

+1 -4

include/linux/filter.h

··· 16 16 #include <linux/sched.h> 17 17 #include <linux/capability.h> 18 18 #include <linux/cryptohash.h> 19 + #include <linux/set_memory.h> 19 20 20 21 #include <net/sch_generic.h> 21 - 22 - #ifdef CONFIG_ARCH_HAS_SET_MEMORY 23 - #include <asm/set_memory.h> 24 - #endif 25 22 26 23 #include <uapi/linux/filter.h> 27 24 #include <uapi/linux/bpf.h>

+5 -6

include/linux/gfp.h

··· 432 432 #endif 433 433 434 434 struct page * 435 - __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, 436 - struct zonelist *zonelist, nodemask_t *nodemask); 435 + __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, 436 + nodemask_t *nodemask); 437 437 438 438 static inline struct page * 439 - __alloc_pages(gfp_t gfp_mask, unsigned int order, 440 - struct zonelist *zonelist) 439 + __alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid) 441 440 { 442 - return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL); 441 + return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL); 443 442 } 444 443 445 444 /* ··· 451 452 VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); 452 453 VM_WARN_ON(!node_online(nid)); 453 454 454 - return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); 455 + return __alloc_pages(gfp_mask, order, nid); 455 456 } 456 457 457 458 /*

+7

include/linux/huge_mm.h

··· 113 113 extern void prep_transhuge_page(struct page *page); 114 114 extern void free_transhuge_page(struct page *page); 115 115 116 + bool can_split_huge_page(struct page *page, int *pextra_pins); 116 117 int split_huge_page_to_list(struct page *page, struct list_head *list); 117 118 static inline int split_huge_page(struct page *page) 118 119 { ··· 232 231 233 232 #define thp_get_unmapped_area NULL 234 233 234 + static inline bool 235 + can_split_huge_page(struct page *page, int *pextra_pins) 236 + { 237 + BUILD_BUG(); 238 + return false; 239 + } 235 240 static inline int 236 241 split_huge_page_to_list(struct page *page, struct list_head *list) 237 242 {

+59 -26

include/linux/hugetlb.h

··· 14 14 struct user_struct; 15 15 struct mmu_gather; 16 16 17 + #ifndef is_hugepd 18 + /* 19 + * Some architectures requires a hugepage directory format that is 20 + * required to support multiple hugepage sizes. For example 21 + * a4fe3ce76 "powerpc/mm: Allow more flexible layouts for hugepage pagetables" 22 + * introduced the same on powerpc. This allows for a more flexible hugepage 23 + * pagetable layout. 24 + */ 25 + typedef struct { unsigned long pd; } hugepd_t; 26 + #define is_hugepd(hugepd) (0) 27 + #define __hugepd(x) ((hugepd_t) { (x) }) 28 + static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr, 29 + unsigned pdshift, unsigned long end, 30 + int write, struct page **pages, int *nr) 31 + { 32 + return 0; 33 + } 34 + #else 35 + extern int gup_huge_pd(hugepd_t hugepd, unsigned long addr, 36 + unsigned pdshift, unsigned long end, 37 + int write, struct page **pages, int *nr); 38 + #endif 39 + 40 + 17 41 #ifdef CONFIG_HUGETLB_PAGE 18 42 19 43 #include <linux/mempolicy.h> ··· 137 113 138 114 pte_t *huge_pte_alloc(struct mm_struct *mm, 139 115 unsigned long addr, unsigned long sz); 140 - pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr); 116 + pte_t *huge_pte_offset(struct mm_struct *mm, 117 + unsigned long addr, unsigned long sz); 141 118 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep); 142 119 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address, 143 120 int write); 121 + struct page *follow_huge_pd(struct vm_area_struct *vma, 122 + unsigned long address, hugepd_t hpd, 123 + int flags, int pdshift); 144 124 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address, 145 125 pmd_t *pmd, int flags); 146 126 struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address, 147 127 pud_t *pud, int flags); 128 + struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address, 129 + pgd_t *pgd, int flags); 130 + 148 131 int pmd_huge(pmd_t pmd); 149 132 int pud_huge(pud_t pud); 150 133 unsigned long hugetlb_change_protection(struct vm_area_struct *vma, 151 134 unsigned long address, unsigned long end, pgprot_t newprot); 152 135 136 + bool is_hugetlb_entry_migration(pte_t pte); 153 137 #else /* !CONFIG_HUGETLB_PAGE */ 154 138 155 139 static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma) ··· 179 147 static inline void hugetlb_show_meminfo(void) 180 148 { 181 149 } 150 + #define follow_huge_pd(vma, addr, hpd, flags, pdshift) NULL 182 151 #define follow_huge_pmd(mm, addr, pmd, flags) NULL 183 152 #define follow_huge_pud(mm, addr, pud, flags) NULL 153 + #define follow_huge_pgd(mm, addr, pgd, flags) NULL 184 154 #define prepare_hugepage_range(file, addr, len) (-EINVAL) 185 155 #define pmd_huge(x) 0 186 156 #define pud_huge(x) 0 ··· 191 157 #define hugetlb_fault(mm, vma, addr, flags) ({ BUG(); 0; }) 192 158 #define hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, dst_addr, \ 193 159 src_addr, pagep) ({ BUG(); 0; }) 194 - #define huge_pte_offset(mm, address) 0 160 + #define huge_pte_offset(mm, address, sz) 0 195 161 static inline int dequeue_hwpoisoned_huge_page(struct page *page) 196 162 { 197 163 return 0; ··· 249 215 BUG(); 250 216 return 0; 251 217 } 252 - #endif 253 - 254 - #ifndef is_hugepd 255 - /* 256 - * Some architectures requires a hugepage directory format that is 257 - * required to support multiple hugepage sizes. For example 258 - * a4fe3ce76 "powerpc/mm: Allow more flexible layouts for hugepage pagetables" 259 - * introduced the same on powerpc. This allows for a more flexible hugepage 260 - * pagetable layout. 261 - */ 262 - typedef struct { unsigned long pd; } hugepd_t; 263 - #define is_hugepd(hugepd) (0) 264 - #define __hugepd(x) ((hugepd_t) { (x) }) 265 - static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr, 266 - unsigned pdshift, unsigned long end, 267 - int write, struct page **pages, int *nr) 268 - { 269 - return 0; 270 - } 271 - #else 272 - extern int gup_huge_pd(hugepd_t hugepd, unsigned long addr, 273 - unsigned pdshift, unsigned long end, 274 - int write, struct page **pages, int *nr); 275 218 #endif 276 219 277 220 #define HUGETLB_ANON_FILE "anon_hugepage" ··· 477 466 static inline bool hugepage_migration_supported(struct hstate *h) 478 467 { 479 468 #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION 480 - return huge_page_shift(h) == PMD_SHIFT; 469 + if ((huge_page_shift(h) == PMD_SHIFT) || 470 + (huge_page_shift(h) == PGDIR_SHIFT)) 471 + return true; 472 + else 473 + return false; 481 474 #else 482 475 return false; 483 476 #endif ··· 516 501 { 517 502 atomic_long_sub(l, &mm->hugetlb_usage); 518 503 } 504 + 505 + #ifndef set_huge_swap_pte_at 506 + static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr, 507 + pte_t *ptep, pte_t pte, unsigned long sz) 508 + { 509 + set_huge_pte_at(mm, addr, ptep, pte); 510 + } 511 + #endif 519 512 #else /* CONFIG_HUGETLB_PAGE */ 520 513 struct hstate {}; 521 514 #define alloc_huge_page(v, a, r) NULL ··· 541 518 #define vma_mmu_pagesize(v) PAGE_SIZE 542 519 #define huge_page_order(h) 0 543 520 #define huge_page_shift(h) PAGE_SHIFT 521 + static inline bool hstate_is_gigantic(struct hstate *h) 522 + { 523 + return false; 524 + } 525 + 544 526 static inline unsigned int pages_per_huge_page(struct hstate *h) 545 527 { 546 528 return 1; ··· 571 543 } 572 544 573 545 static inline void hugetlb_count_sub(long l, struct mm_struct *mm) 546 + { 547 + } 548 + 549 + static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr, 550 + pte_t *ptep, pte_t pte, unsigned long sz) 574 551 { 575 552 } 576 553 #endif /* CONFIG_HUGETLB_PAGE */

+7

include/linux/kmemleak.h

··· 22 22 #define __KMEMLEAK_H 23 23 24 24 #include <linux/slab.h> 25 + #include <linux/vmalloc.h> 25 26 26 27 #ifdef CONFIG_DEBUG_KMEMLEAK 27 28 ··· 31 30 gfp_t gfp) __ref; 32 31 extern void kmemleak_alloc_percpu(const void __percpu *ptr, size_t size, 33 32 gfp_t gfp) __ref; 33 + extern void kmemleak_vmalloc(const struct vm_struct *area, size_t size, 34 + gfp_t gfp) __ref; 34 35 extern void kmemleak_free(const void *ptr) __ref; 35 36 extern void kmemleak_free_part(const void *ptr, size_t size) __ref; 36 37 extern void kmemleak_free_percpu(const void __percpu *ptr) __ref; ··· 82 79 } 83 80 static inline void kmemleak_alloc_percpu(const void __percpu *ptr, size_t size, 84 81 gfp_t gfp) 82 + { 83 + } 84 + static inline void kmemleak_vmalloc(const struct vm_struct *area, size_t size, 85 + gfp_t gfp) 85 86 { 86 87 } 87 88 static inline void kmemleak_free(const void *ptr)

-25

include/linux/memblock.h

··· 57 57 58 58 extern struct memblock memblock; 59 59 extern int memblock_debug; 60 - #ifdef CONFIG_MOVABLE_NODE 61 - /* If movable_node boot option specified */ 62 - extern bool movable_node_enabled; 63 - #endif /* CONFIG_MOVABLE_NODE */ 64 60 65 61 #ifdef CONFIG_ARCH_DISCARD_MEMBLOCK 66 62 #define __init_memblock __meminit ··· 165 169 i != (u64)ULLONG_MAX; \ 166 170 __next_reserved_mem_region(&i, p_start, p_end)) 167 171 168 - #ifdef CONFIG_MOVABLE_NODE 169 172 static inline bool memblock_is_hotpluggable(struct memblock_region *m) 170 173 { 171 174 return m->flags & MEMBLOCK_HOTPLUG; 172 175 } 173 - 174 - static inline bool __init_memblock movable_node_is_enabled(void) 175 - { 176 - return movable_node_enabled; 177 - } 178 - #else 179 - static inline bool memblock_is_hotpluggable(struct memblock_region *m) 180 - { 181 - return false; 182 - } 183 - static inline bool movable_node_is_enabled(void) 184 - { 185 - return false; 186 - } 187 - #endif 188 176 189 177 static inline bool memblock_is_mirror(struct memblock_region *m) 190 178 { ··· 276 296 277 297 phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align); 278 298 279 - #ifdef CONFIG_MOVABLE_NODE 280 299 /* 281 300 * Set the allocation direction to bottom-up or top-down. 282 301 */ ··· 293 314 { 294 315 return memblock.bottom_up; 295 316 } 296 - #else 297 - static inline void __init memblock_set_bottom_up(bool enable) {} 298 - static inline bool memblock_bottom_up(void) { return false; } 299 - #endif 300 317 301 318 /* Flags for memblock_alloc_base() amd __memblock_alloc_base() */ 302 319 #define MEMBLOCK_ALLOC_ANYWHERE (~(phys_addr_t)0)

+265 -53

include/linux/memcontrol.h

··· 26 26 #include <linux/page_counter.h> 27 27 #include <linux/vmpressure.h> 28 28 #include <linux/eventfd.h> 29 - #include <linux/mmzone.h> 29 + #include <linux/mm.h> 30 + #include <linux/vmstat.h> 30 31 #include <linux/writeback.h> 31 32 #include <linux/page-flags.h> 32 33 ··· 45 44 MEMCG_SOCK, 46 45 /* XXX: why are these zone and not node counters? */ 47 46 MEMCG_KERNEL_STACK_KB, 48 - MEMCG_SLAB_RECLAIMABLE, 49 - MEMCG_SLAB_UNRECLAIMABLE, 50 47 MEMCG_NR_STAT, 51 48 }; 52 49 ··· 99 100 unsigned int generation; 100 101 }; 101 102 103 + struct lruvec_stat { 104 + long count[NR_VM_NODE_STAT_ITEMS]; 105 + }; 106 + 102 107 /* 103 108 * per-zone information in memory controller. 104 109 */ 105 110 struct mem_cgroup_per_node { 106 111 struct lruvec lruvec; 112 + struct lruvec_stat __percpu *lruvec_stat; 107 113 unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS]; 108 114 109 115 struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1]; ··· 361 357 } 362 358 struct mem_cgroup *mem_cgroup_from_id(unsigned short id); 363 359 360 + static inline struct mem_cgroup *lruvec_memcg(struct lruvec *lruvec) 361 + { 362 + struct mem_cgroup_per_node *mz; 363 + 364 + if (mem_cgroup_disabled()) 365 + return NULL; 366 + 367 + mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec); 368 + return mz->memcg; 369 + } 370 + 364 371 /** 365 372 * parent_mem_cgroup - find the accounting parent of a memcg 366 373 * @memcg: memcg whose parent to find ··· 502 487 return val; 503 488 } 504 489 490 + static inline void __mod_memcg_state(struct mem_cgroup *memcg, 491 + enum memcg_stat_item idx, int val) 492 + { 493 + if (!mem_cgroup_disabled()) 494 + __this_cpu_add(memcg->stat->count[idx], val); 495 + } 496 + 505 497 static inline void mod_memcg_state(struct mem_cgroup *memcg, 506 498 enum memcg_stat_item idx, int val) 507 499 { 508 500 if (!mem_cgroup_disabled()) 509 501 this_cpu_add(memcg->stat->count[idx], val); 510 - } 511 - 512 - static inline void inc_memcg_state(struct mem_cgroup *memcg, 513 - enum memcg_stat_item idx) 514 - { 515 - mod_memcg_state(memcg, idx, 1); 516 - } 517 - 518 - static inline void dec_memcg_state(struct mem_cgroup *memcg, 519 - enum memcg_stat_item idx) 520 - { 521 - mod_memcg_state(memcg, idx, -1); 522 502 } 523 503 524 504 /** ··· 533 523 * 534 524 * Kernel pages are an exception to this, since they'll never move. 535 525 */ 526 + static inline void __mod_memcg_page_state(struct page *page, 527 + enum memcg_stat_item idx, int val) 528 + { 529 + if (page->mem_cgroup) 530 + __mod_memcg_state(page->mem_cgroup, idx, val); 531 + } 532 + 536 533 static inline void mod_memcg_page_state(struct page *page, 537 534 enum memcg_stat_item idx, int val) 538 535 { ··· 547 530 mod_memcg_state(page->mem_cgroup, idx, val); 548 531 } 549 532 550 - static inline void inc_memcg_page_state(struct page *page, 551 - enum memcg_stat_item idx) 533 + static inline unsigned long lruvec_page_state(struct lruvec *lruvec, 534 + enum node_stat_item idx) 552 535 { 553 - mod_memcg_page_state(page, idx, 1); 536 + struct mem_cgroup_per_node *pn; 537 + long val = 0; 538 + int cpu; 539 + 540 + if (mem_cgroup_disabled()) 541 + return node_page_state(lruvec_pgdat(lruvec), idx); 542 + 543 + pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); 544 + for_each_possible_cpu(cpu) 545 + val += per_cpu(pn->lruvec_stat->count[idx], cpu); 546 + 547 + if (val < 0) 548 + val = 0; 549 + 550 + return val; 554 551 } 555 552 556 - static inline void dec_memcg_page_state(struct page *page, 557 - enum memcg_stat_item idx) 553 + static inline void __mod_lruvec_state(struct lruvec *lruvec, 554 + enum node_stat_item idx, int val) 558 555 { 559 - mod_memcg_page_state(page, idx, -1); 556 + struct mem_cgroup_per_node *pn; 557 + 558 + __mod_node_page_state(lruvec_pgdat(lruvec), idx, val); 559 + if (mem_cgroup_disabled()) 560 + return; 561 + pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); 562 + __mod_memcg_state(pn->memcg, idx, val); 563 + __this_cpu_add(pn->lruvec_stat->count[idx], val); 564 + } 565 + 566 + static inline void mod_lruvec_state(struct lruvec *lruvec, 567 + enum node_stat_item idx, int val) 568 + { 569 + struct mem_cgroup_per_node *pn; 570 + 571 + mod_node_page_state(lruvec_pgdat(lruvec), idx, val); 572 + if (mem_cgroup_disabled()) 573 + return; 574 + pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); 575 + mod_memcg_state(pn->memcg, idx, val); 576 + this_cpu_add(pn->lruvec_stat->count[idx], val); 577 + } 578 + 579 + static inline void __mod_lruvec_page_state(struct page *page, 580 + enum node_stat_item idx, int val) 581 + { 582 + struct mem_cgroup_per_node *pn; 583 + 584 + __mod_node_page_state(page_pgdat(page), idx, val); 585 + if (mem_cgroup_disabled() || !page->mem_cgroup) 586 + return; 587 + __mod_memcg_state(page->mem_cgroup, idx, val); 588 + pn = page->mem_cgroup->nodeinfo[page_to_nid(page)]; 589 + __this_cpu_add(pn->lruvec_stat->count[idx], val); 590 + } 591 + 592 + static inline void mod_lruvec_page_state(struct page *page, 593 + enum node_stat_item idx, int val) 594 + { 595 + struct mem_cgroup_per_node *pn; 596 + 597 + mod_node_page_state(page_pgdat(page), idx, val); 598 + if (mem_cgroup_disabled() || !page->mem_cgroup) 599 + return; 600 + mod_memcg_state(page->mem_cgroup, idx, val); 601 + pn = page->mem_cgroup->nodeinfo[page_to_nid(page)]; 602 + this_cpu_add(pn->lruvec_stat->count[idx], val); 560 603 } 561 604 562 605 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, 563 606 gfp_t gfp_mask, 564 607 unsigned long *total_scanned); 565 608 566 - static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, 567 - enum vm_event_item idx) 609 + static inline void count_memcg_events(struct mem_cgroup *memcg, 610 + enum vm_event_item idx, 611 + unsigned long count) 612 + { 613 + if (!mem_cgroup_disabled()) 614 + this_cpu_add(memcg->stat->events[idx], count); 615 + } 616 + 617 + static inline void count_memcg_page_event(struct page *page, 618 + enum memcg_stat_item idx) 619 + { 620 + if (page->mem_cgroup) 621 + count_memcg_events(page->mem_cgroup, idx, 1); 622 + } 623 + 624 + static inline void count_memcg_event_mm(struct mm_struct *mm, 625 + enum vm_event_item idx) 568 626 { 569 627 struct mem_cgroup *memcg; 570 628 ··· 648 556 649 557 rcu_read_lock(); 650 558 memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); 651 - if (likely(memcg)) 559 + if (likely(memcg)) { 652 560 this_cpu_inc(memcg->stat->events[idx]); 561 + if (idx == OOM_KILL) 562 + cgroup_file_notify(&memcg->events_file); 563 + } 653 564 rcu_read_unlock(); 654 565 } 655 566 #ifdef CONFIG_TRANSPARENT_HUGEPAGE ··· 770 675 return NULL; 771 676 } 772 677 678 + static inline struct mem_cgroup *lruvec_memcg(struct lruvec *lruvec) 679 + { 680 + return NULL; 681 + } 682 + 773 683 static inline bool mem_cgroup_online(struct mem_cgroup *memcg) 774 684 { 775 685 return true; ··· 845 745 return 0; 846 746 } 847 747 748 + static inline void __mod_memcg_state(struct mem_cgroup *memcg, 749 + enum memcg_stat_item idx, 750 + int nr) 751 + { 752 + } 753 + 848 754 static inline void mod_memcg_state(struct mem_cgroup *memcg, 849 755 enum memcg_stat_item idx, 850 756 int nr) 851 757 { 852 758 } 853 759 854 - static inline void inc_memcg_state(struct mem_cgroup *memcg, 855 - enum memcg_stat_item idx) 856 - { 857 - } 858 - 859 - static inline void dec_memcg_state(struct mem_cgroup *memcg, 860 - enum memcg_stat_item idx) 760 + static inline void __mod_memcg_page_state(struct page *page, 761 + enum memcg_stat_item idx, 762 + int nr) 861 763 { 862 764 } 863 765 ··· 869 767 { 870 768 } 871 769 872 - static inline void inc_memcg_page_state(struct page *page, 873 - enum memcg_stat_item idx) 770 + static inline unsigned long lruvec_page_state(struct lruvec *lruvec, 771 + enum node_stat_item idx) 874 772 { 773 + return node_page_state(lruvec_pgdat(lruvec), idx); 875 774 } 876 775 877 - static inline void dec_memcg_page_state(struct page *page, 878 - enum memcg_stat_item idx) 776 + static inline void __mod_lruvec_state(struct lruvec *lruvec, 777 + enum node_stat_item idx, int val) 879 778 { 779 + __mod_node_page_state(lruvec_pgdat(lruvec), idx, val); 780 + } 781 + 782 + static inline void mod_lruvec_state(struct lruvec *lruvec, 783 + enum node_stat_item idx, int val) 784 + { 785 + mod_node_page_state(lruvec_pgdat(lruvec), idx, val); 786 + } 787 + 788 + static inline void __mod_lruvec_page_state(struct page *page, 789 + enum node_stat_item idx, int val) 790 + { 791 + __mod_node_page_state(page_pgdat(page), idx, val); 792 + } 793 + 794 + static inline void mod_lruvec_page_state(struct page *page, 795 + enum node_stat_item idx, int val) 796 + { 797 + mod_node_page_state(page_pgdat(page), idx, val); 880 798 } 881 799 882 800 static inline ··· 911 789 { 912 790 } 913 791 792 + static inline void count_memcg_events(struct mem_cgroup *memcg, 793 + enum vm_event_item idx, 794 + unsigned long count) 795 + { 796 + } 797 + 798 + static inline void count_memcg_page_event(struct page *page, 799 + enum memcg_stat_item idx) 800 + { 801 + } 802 + 914 803 static inline 915 - void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx) 804 + void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) 916 805 { 917 806 } 918 807 #endif /* CONFIG_MEMCG */ 808 + 809 + static inline void __inc_memcg_state(struct mem_cgroup *memcg, 810 + enum memcg_stat_item idx) 811 + { 812 + __mod_memcg_state(memcg, idx, 1); 813 + } 814 + 815 + static inline void __dec_memcg_state(struct mem_cgroup *memcg, 816 + enum memcg_stat_item idx) 817 + { 818 + __mod_memcg_state(memcg, idx, -1); 819 + } 820 + 821 + static inline void __inc_memcg_page_state(struct page *page, 822 + enum memcg_stat_item idx) 823 + { 824 + __mod_memcg_page_state(page, idx, 1); 825 + } 826 + 827 + static inline void __dec_memcg_page_state(struct page *page, 828 + enum memcg_stat_item idx) 829 + { 830 + __mod_memcg_page_state(page, idx, -1); 831 + } 832 + 833 + static inline void __inc_lruvec_state(struct lruvec *lruvec, 834 + enum node_stat_item idx) 835 + { 836 + __mod_lruvec_state(lruvec, idx, 1); 837 + } 838 + 839 + static inline void __dec_lruvec_state(struct lruvec *lruvec, 840 + enum node_stat_item idx) 841 + { 842 + __mod_lruvec_state(lruvec, idx, -1); 843 + } 844 + 845 + static inline void __inc_lruvec_page_state(struct page *page, 846 + enum node_stat_item idx) 847 + { 848 + __mod_lruvec_page_state(page, idx, 1); 849 + } 850 + 851 + static inline void __dec_lruvec_page_state(struct page *page, 852 + enum node_stat_item idx) 853 + { 854 + __mod_lruvec_page_state(page, idx, -1); 855 + } 856 + 857 + static inline void inc_memcg_state(struct mem_cgroup *memcg, 858 + enum memcg_stat_item idx) 859 + { 860 + mod_memcg_state(memcg, idx, 1); 861 + } 862 + 863 + static inline void dec_memcg_state(struct mem_cgroup *memcg, 864 + enum memcg_stat_item idx) 865 + { 866 + mod_memcg_state(memcg, idx, -1); 867 + } 868 + 869 + static inline void inc_memcg_page_state(struct page *page, 870 + enum memcg_stat_item idx) 871 + { 872 + mod_memcg_page_state(page, idx, 1); 873 + } 874 + 875 + static inline void dec_memcg_page_state(struct page *page, 876 + enum memcg_stat_item idx) 877 + { 878 + mod_memcg_page_state(page, idx, -1); 879 + } 880 + 881 + static inline void inc_lruvec_state(struct lruvec *lruvec, 882 + enum node_stat_item idx) 883 + { 884 + mod_lruvec_state(lruvec, idx, 1); 885 + } 886 + 887 + static inline void dec_lruvec_state(struct lruvec *lruvec, 888 + enum node_stat_item idx) 889 + { 890 + mod_lruvec_state(lruvec, idx, -1); 891 + } 892 + 893 + static inline void inc_lruvec_page_state(struct page *page, 894 + enum node_stat_item idx) 895 + { 896 + mod_lruvec_page_state(page, idx, 1); 897 + } 898 + 899 + static inline void dec_lruvec_page_state(struct page *page, 900 + enum node_stat_item idx) 901 + { 902 + mod_lruvec_page_state(page, idx, -1); 903 + } 919 904 920 905 #ifdef CONFIG_CGROUP_WRITEBACK 921 906 ··· 1115 886 return memcg ? memcg->kmemcg_id : -1; 1116 887 } 1117 888 1118 - /** 1119 - * memcg_kmem_update_page_stat - update kmem page state statistics 1120 - * @page: the page 1121 - * @idx: page state item to account 1122 - * @val: number of pages (positive or negative) 1123 - */ 1124 - static inline void memcg_kmem_update_page_stat(struct page *page, 1125 - enum memcg_stat_item idx, int val) 1126 - { 1127 - if (memcg_kmem_enabled() && page->mem_cgroup) 1128 - this_cpu_add(page->mem_cgroup->stat->count[idx], val); 1129 - } 1130 - 1131 889 #else 1132 890 #define for_each_memcg_cache_index(_idx) \ 1133 891 for (; NULL; ) ··· 1137 921 { 1138 922 } 1139 923 1140 - static inline void memcg_kmem_update_page_stat(struct page *page, 1141 - enum memcg_stat_item idx, int val) 1142 - { 1143 - } 1144 924 #endif /* CONFIG_MEMCG && !CONFIG_SLOB */ 1145 925 1146 926 #endif /* _LINUX_MEMCONTROL_H */

+43 -10

include/linux/memory_hotplug.h

··· 14 14 struct resource; 15 15 16 16 #ifdef CONFIG_MEMORY_HOTPLUG 17 + /* 18 + * Return page for the valid pfn only if the page is online. All pfn 19 + * walkers which rely on the fully initialized page->flags and others 20 + * should use this rather than pfn_valid && pfn_to_page 21 + */ 22 + #define pfn_to_online_page(pfn) \ 23 + ({ \ 24 + struct page *___page = NULL; \ 25 + unsigned long ___nr = pfn_to_section_nr(pfn); \ 26 + \ 27 + if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr))\ 28 + ___page = pfn_to_page(pfn); \ 29 + ___page; \ 30 + }) 17 31 18 32 /* 19 33 * Types for free bootmem stored in page->lru.next. These have to be in ··· 115 101 extern int try_online_node(int nid); 116 102 117 103 extern bool memhp_auto_online; 104 + /* If movable_node boot option specified */ 105 + extern bool movable_node_enabled; 106 + static inline bool movable_node_is_enabled(void) 107 + { 108 + return movable_node_enabled; 109 + } 118 110 119 111 #ifdef CONFIG_MEMORY_HOTREMOVE 120 112 extern bool is_pageblock_removable_nolock(struct page *page); ··· 129 109 unsigned long nr_pages); 130 110 #endif /* CONFIG_MEMORY_HOTREMOVE */ 131 111 132 - /* reasonably generic interface to expand the physical pages in a zone */ 133 - extern int __add_pages(int nid, struct zone *zone, unsigned long start_pfn, 134 - unsigned long nr_pages); 112 + /* reasonably generic interface to expand the physical pages */ 113 + extern int __add_pages(int nid, unsigned long start_pfn, 114 + unsigned long nr_pages, bool want_memblock); 135 115 136 116 #ifdef CONFIG_NUMA 137 117 extern int memory_add_physaddr_to_nid(u64 start); ··· 223 203 extern void clear_zone_contiguous(struct zone *zone); 224 204 225 205 #else /* ! CONFIG_MEMORY_HOTPLUG */ 206 + #define pfn_to_online_page(pfn) \ 207 + ({ \ 208 + struct page *___page = NULL; \ 209 + if (pfn_valid(pfn)) \ 210 + ___page = pfn_to_page(pfn); \ 211 + ___page; \ 212 + }) 213 + 226 214 /* 227 215 * Stub functions for when hotplug is off 228 216 */ ··· 272 244 static inline void mem_hotplug_begin(void) {} 273 245 static inline void mem_hotplug_done(void) {} 274 246 247 + static inline bool movable_node_is_enabled(void) 248 + { 249 + return false; 250 + } 275 251 #endif /* ! CONFIG_MEMORY_HOTPLUG */ 276 252 277 253 #ifdef CONFIG_MEMORY_HOTREMOVE ··· 306 274 void *arg, int (*func)(struct memory_block *, void *)); 307 275 extern int add_memory(int nid, u64 start, u64 size); 308 276 extern int add_memory_resource(int nid, struct resource *resource, bool online); 309 - extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default, 310 - bool for_device); 311 - extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device); 277 + extern int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock); 278 + extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, 279 + unsigned long nr_pages); 312 280 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages); 313 281 extern bool is_memblock_offlined(struct memory_block *mem); 314 282 extern void remove_memory(int nid, u64 start, u64 size); 315 - extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn); 283 + extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn); 316 284 extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms, 317 285 unsigned long map_offset); 318 286 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map, 319 287 unsigned long pnum); 320 - extern bool zone_can_shift(unsigned long pfn, unsigned long nr_pages, 321 - enum zone_type target, int *zone_shift); 322 - 288 + extern bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_pages, 289 + int online_type); 290 + extern struct zone *default_zone_for_pfn(int nid, unsigned long pfn, 291 + unsigned long nr_pages); 323 292 #endif /* __LINUX_MEMORY_HOTPLUG_H */

+5 -7

include/linux/mempolicy.h

··· 142 142 143 143 extern void numa_default_policy(void); 144 144 extern void numa_policy_init(void); 145 - extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new, 146 - enum mpol_rebind_step step); 145 + extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new); 147 146 extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new); 148 147 149 - extern struct zonelist *huge_zonelist(struct vm_area_struct *vma, 148 + extern int huge_node(struct vm_area_struct *vma, 150 149 unsigned long addr, gfp_t gfp_flags, 151 150 struct mempolicy **mpol, nodemask_t **nodemask); 152 151 extern bool init_nodemask_of_mempolicy(nodemask_t *mask); ··· 259 260 } 260 261 261 262 static inline void mpol_rebind_task(struct task_struct *tsk, 262 - const nodemask_t *new, 263 - enum mpol_rebind_step step) 263 + const nodemask_t *new) 264 264 { 265 265 } 266 266 ··· 267 269 { 268 270 } 269 271 270 - static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma, 272 + static inline int huge_node(struct vm_area_struct *vma, 271 273 unsigned long addr, gfp_t gfp_flags, 272 274 struct mempolicy **mpol, nodemask_t **nodemask) 273 275 { 274 276 *mpol = NULL; 275 277 *nodemask = NULL; 276 - return node_zonelist(0, gfp_flags); 278 + return 0; 277 279 } 278 280 279 281 static inline bool init_nodemask_of_mempolicy(nodemask_t *m)

+50 -9

include/linux/mmzone.h

··· 125 125 NR_ZONE_UNEVICTABLE, 126 126 NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */ 127 127 NR_MLOCK, /* mlock()ed pages found and moved off LRU */ 128 - NR_SLAB_RECLAIMABLE, 129 - NR_SLAB_UNRECLAIMABLE, 130 128 NR_PAGETABLE, /* used for pagetables */ 131 129 NR_KERNEL_STACK_KB, /* measured in KiB */ 132 130 /* Second 128 byte cacheline */ ··· 150 152 NR_INACTIVE_FILE, /* " " " " " */ 151 153 NR_ACTIVE_FILE, /* " " " " " */ 152 154 NR_UNEVICTABLE, /* " " " " " */ 155 + NR_SLAB_RECLAIMABLE, 156 + NR_SLAB_UNRECLAIMABLE, 153 157 NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ 154 158 NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ 155 159 WORKINGSET_REFAULT, ··· 533 533 } 534 534 535 535 /* 536 + * Return true if [start_pfn, start_pfn + nr_pages) range has a non-empty 537 + * intersection with the given zone 538 + */ 539 + static inline bool zone_intersects(struct zone *zone, 540 + unsigned long start_pfn, unsigned long nr_pages) 541 + { 542 + if (zone_is_empty(zone)) 543 + return false; 544 + if (start_pfn >= zone_end_pfn(zone) || 545 + start_pfn + nr_pages <= zone->zone_start_pfn) 546 + return false; 547 + 548 + return true; 549 + } 550 + 551 + /* 536 552 * The "priority" of VM scanning is how much of the queues we will scan in one 537 553 * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the 538 554 * queues ("queue_length >> 12") during an aging round. ··· 788 772 MEMMAP_EARLY, 789 773 MEMMAP_HOTPLUG, 790 774 }; 791 - extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn, 775 + extern void init_currently_empty_zone(struct zone *zone, unsigned long start_pfn, 792 776 unsigned long size); 793 777 794 778 extern void lruvec_init(struct lruvec *lruvec); ··· 1160 1144 */ 1161 1145 #define SECTION_MARKED_PRESENT (1UL<<0) 1162 1146 #define SECTION_HAS_MEM_MAP (1UL<<1) 1163 - #define SECTION_MAP_LAST_BIT (1UL<<2) 1147 + #define SECTION_IS_ONLINE (1UL<<2) 1148 + #define SECTION_MAP_LAST_BIT (1UL<<3) 1164 1149 #define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1)) 1165 - #define SECTION_NID_SHIFT 2 1150 + #define SECTION_NID_SHIFT 3 1166 1151 1167 1152 static inline struct page *__section_mem_map_addr(struct mem_section *section) 1168 1153 { ··· 1192 1175 return valid_section(__nr_to_section(nr)); 1193 1176 } 1194 1177 1178 + static inline int online_section(struct mem_section *section) 1179 + { 1180 + return (section && (section->section_mem_map & SECTION_IS_ONLINE)); 1181 + } 1182 + 1183 + static inline int online_section_nr(unsigned long nr) 1184 + { 1185 + return online_section(__nr_to_section(nr)); 1186 + } 1187 + 1188 + #ifdef CONFIG_MEMORY_HOTPLUG 1189 + void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn); 1190 + #ifdef CONFIG_MEMORY_HOTREMOVE 1191 + void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn); 1192 + #endif 1193 + #endif 1194 + 1195 1195 static inline struct mem_section *__pfn_to_section(unsigned long pfn) 1196 1196 { 1197 1197 return __nr_to_section(pfn_to_section_nr(pfn)); 1198 1198 } 1199 + 1200 + extern int __highest_present_section_nr; 1199 1201 1200 1202 #ifndef CONFIG_HAVE_ARCH_PFN_VALID 1201 1203 static inline int pfn_valid(unsigned long pfn) ··· 1287 1251 #ifdef CONFIG_ARCH_HAS_HOLES_MEMORYMODEL 1288 1252 /* 1289 1253 * pfn_valid() is meant to be able to tell if a given PFN has valid memmap 1290 - * associated with it or not. In FLATMEM, it is expected that holes always 1291 - * have valid memmap as long as there is valid PFNs either side of the hole. 1292 - * In SPARSEMEM, it is assumed that a valid section has a memmap for the 1293 - * entire section. 1254 + * associated with it or not. This means that a struct page exists for this 1255 + * pfn. The caller cannot assume the page is fully initialized in general. 1256 + * Hotplugable pages might not have been onlined yet. pfn_to_online_page() 1257 + * will ensure the struct page is fully online and initialized. Special pages 1258 + * (e.g. ZONE_DEVICE) are never onlined and should be treated accordingly. 1259 + * 1260 + * In FLATMEM, it is expected that holes always have valid memmap as long as 1261 + * there is valid PFNs either side of the hole. In SPARSEMEM, it is assumed 1262 + * that a valid section has a memmap for the entire section. 1294 1263 * 1295 1264 * However, an ARM, and maybe other embedded architectures in the future 1296 1265 * free memmap backing holes to save memory on the assumption the memmap is

+34 -1

include/linux/node.h

··· 30 30 extern struct node *node_devices[]; 31 31 typedef void (*node_registration_func_t)(struct node *); 32 32 33 + #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_NUMA) 34 + extern int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages); 35 + #else 36 + static inline int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages) 37 + { 38 + return 0; 39 + } 40 + #endif 41 + 33 42 extern void unregister_node(struct node *node); 34 43 #ifdef CONFIG_NUMA 35 - extern int register_one_node(int nid); 44 + /* Core of the node registration - only memory hotplug should use this */ 45 + extern int __register_one_node(int nid); 46 + 47 + /* Registers an online node */ 48 + static inline int register_one_node(int nid) 49 + { 50 + int error = 0; 51 + 52 + if (node_online(nid)) { 53 + struct pglist_data *pgdat = NODE_DATA(nid); 54 + 55 + error = __register_one_node(nid); 56 + if (error) 57 + return error; 58 + /* link memory sections under this node */ 59 + error = link_mem_sections(nid, pgdat->node_start_pfn, pgdat->node_spanned_pages); 60 + } 61 + 62 + return error; 63 + } 64 + 36 65 extern void unregister_one_node(int nid); 37 66 extern int register_cpu_under_node(unsigned int cpu, unsigned int nid); 38 67 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid); ··· 75 46 node_registration_func_t unregister); 76 47 #endif 77 48 #else 49 + static inline int __register_one_node(int nid) 50 + { 51 + return 0; 52 + } 78 53 static inline int register_one_node(int nid) 79 54 { 80 55 return 0;

-4

include/linux/nodemask.h

··· 387 387 #else 388 388 N_HIGH_MEMORY = N_NORMAL_MEMORY, 389 389 #endif 390 - #ifdef CONFIG_MOVABLE_NODE 391 390 N_MEMORY, /* The node has memory(regular, high, movable) */ 392 - #else 393 - N_MEMORY = N_HIGH_MEMORY, 394 - #endif 395 391 N_CPU, /* The node has one or more cpus */ 396 392 NR_NODE_STATES 397 393 };

+5 -2

include/linux/page-flags.h

··· 326 326 #ifdef CONFIG_SWAP 327 327 static __always_inline int PageSwapCache(struct page *page) 328 328 { 329 + #ifdef CONFIG_THP_SWAP 330 + page = compound_head(page); 331 + #endif 329 332 return PageSwapBacked(page) && test_bit(PG_swapcache, &page->flags); 330 333 331 334 } 332 - SETPAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND) 333 - CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND) 335 + SETPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) 336 + CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) 334 337 #else 335 338 PAGEFLAG_FALSE(SwapCache) 336 339 #endif

+1 -1

include/linux/sched.h

··· 904 904 #ifdef CONFIG_NUMA 905 905 /* Protected by alloc_lock: */ 906 906 struct mempolicy *mempolicy; 907 - short il_next; 907 + short il_prev; 908 908 short pref_node_fork; 909 909 #endif 910 910 #ifdef CONFIG_NUMA_BALANCING

+20

include/linux/set_memory.h

··· 1 + /* 2 + * Copyright 2017, Michael Ellerman, IBM Corporation. 3 + * 4 + * This program is free software; you can redistribute it and/or 5 + * modify it under the terms of the GNU General Public License version 6 + * 2 as published by the Free Software Foundation; 7 + */ 8 + #ifndef _LINUX_SET_MEMORY_H_ 9 + #define _LINUX_SET_MEMORY_H_ 10 + 11 + #ifdef CONFIG_ARCH_HAS_SET_MEMORY 12 + #include <asm/set_memory.h> 13 + #else 14 + static inline int set_memory_ro(unsigned long addr, int numpages) { return 0; } 15 + static inline int set_memory_rw(unsigned long addr, int numpages) { return 0; } 16 + static inline int set_memory_x(unsigned long addr, int numpages) { return 0; } 17 + static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; } 18 + #endif 19 + 20 + #endif /* _LINUX_SET_MEMORY_H_ */

+33 -1

include/linux/slub_def.h

··· 41 41 void **freelist; /* Pointer to next available object */ 42 42 unsigned long tid; /* Globally unique transaction id */ 43 43 struct page *page; /* The slab from which we are allocating */ 44 + #ifdef CONFIG_SLUB_CPU_PARTIAL 44 45 struct page *partial; /* Partially allocated frozen slabs */ 46 + #endif 45 47 #ifdef CONFIG_SLUB_STATS 46 48 unsigned stat[NR_SLUB_STAT_ITEMS]; 47 49 #endif 48 50 }; 51 + 52 + #ifdef CONFIG_SLUB_CPU_PARTIAL 53 + #define slub_percpu_partial(c) ((c)->partial) 54 + 55 + #define slub_set_percpu_partial(c, p) \ 56 + ({ \ 57 + slub_percpu_partial(c) = (p)->next; \ 58 + }) 59 + 60 + #define slub_percpu_partial_read_once(c) READ_ONCE(slub_percpu_partial(c)) 61 + #else 62 + #define slub_percpu_partial(c) NULL 63 + 64 + #define slub_set_percpu_partial(c, p) 65 + 66 + #define slub_percpu_partial_read_once(c) NULL 67 + #endif // CONFIG_SLUB_CPU_PARTIAL 49 68 50 69 /* 51 70 * Word size structure that can be atomically updated or read and that ··· 86 67 int size; /* The size of an object including meta data */ 87 68 int object_size; /* The size of an object without meta data */ 88 69 int offset; /* Free pointer offset. */ 70 + #ifdef CONFIG_SLUB_CPU_PARTIAL 89 71 int cpu_partial; /* Number of per cpu partial objects to keep around */ 72 + #endif 90 73 struct kmem_cache_order_objects oo; 91 74 92 75 /* Allocation and freeing of slabs */ ··· 100 79 int inuse; /* Offset to metadata */ 101 80 int align; /* Alignment */ 102 81 int reserved; /* Reserved bytes at the end of slabs */ 82 + int red_left_pad; /* Left redzone padding size */ 103 83 const char *name; /* Name (only for display!) */ 104 84 struct list_head list; /* List of slab caches */ 105 - int red_left_pad; /* Left redzone padding size */ 106 85 #ifdef CONFIG_SYSFS 107 86 struct kobject kobj; /* For sysfs */ 108 87 struct work_struct kobj_remove_work; ··· 132 111 133 112 struct kmem_cache_node *node[MAX_NUMNODES]; 134 113 }; 114 + 115 + #ifdef CONFIG_SLUB_CPU_PARTIAL 116 + #define slub_cpu_partial(s) ((s)->cpu_partial) 117 + #define slub_set_cpu_partial(s, n) \ 118 + ({ \ 119 + slub_cpu_partial(s) = (n); \ 120 + }) 121 + #else 122 + #define slub_cpu_partial(s) (0) 123 + #define slub_set_cpu_partial(s, n) 124 + #endif // CONFIG_SLUB_CPU_PARTIAL 135 125 136 126 #ifdef CONFIG_SYSFS 137 127 #define SLAB_SUPPORTS_SYSFS

+10 -9

include/linux/swap.h

··· 353 353 >> SWAP_ADDRESS_SPACE_SHIFT]) 354 354 extern unsigned long total_swapcache_pages(void); 355 355 extern void show_swap_cache_info(void); 356 - extern int add_to_swap(struct page *, struct list_head *list); 356 + extern int add_to_swap(struct page *page); 357 357 extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t); 358 358 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry); 359 359 extern void __delete_from_swap_cache(struct page *); ··· 386 386 } 387 387 388 388 extern void si_swapinfo(struct sysinfo *); 389 - extern swp_entry_t get_swap_page(void); 389 + extern swp_entry_t get_swap_page(struct page *page); 390 + extern void put_swap_page(struct page *page, swp_entry_t entry); 390 391 extern swp_entry_t get_swap_page_of_type(int); 391 - extern int get_swap_pages(int n, swp_entry_t swp_entries[]); 392 + extern int get_swap_pages(int n, bool cluster, swp_entry_t swp_entries[]); 392 393 extern int add_swap_count_continuation(swp_entry_t, gfp_t); 393 394 extern void swap_shmem_alloc(swp_entry_t); 394 395 extern int swap_duplicate(swp_entry_t); 395 396 extern int swapcache_prepare(swp_entry_t); 396 397 extern void swap_free(swp_entry_t); 397 - extern void swapcache_free(swp_entry_t); 398 398 extern void swapcache_free_entries(swp_entry_t *entries, int n); 399 399 extern int free_swap_and_cache(swp_entry_t); 400 400 extern int swap_type_of(dev_t, sector_t, struct block_device **); ··· 453 453 { 454 454 } 455 455 456 - static inline void swapcache_free(swp_entry_t swp) 456 + static inline void put_swap_page(struct page *page, swp_entry_t swp) 457 457 { 458 458 } 459 459 ··· 473 473 return NULL; 474 474 } 475 475 476 - static inline int add_to_swap(struct page *page, struct list_head *list) 476 + static inline int add_to_swap(struct page *page) 477 477 { 478 478 return 0; 479 479 } ··· 515 515 return 0; 516 516 } 517 517 518 - static inline swp_entry_t get_swap_page(void) 518 + static inline swp_entry_t get_swap_page(struct page *page) 519 519 { 520 520 swp_entry_t entry; 521 521 entry.val = 0; ··· 548 548 #ifdef CONFIG_MEMCG_SWAP 549 549 extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry); 550 550 extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry); 551 - extern void mem_cgroup_uncharge_swap(swp_entry_t entry); 551 + extern void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages); 552 552 extern long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg); 553 553 extern bool mem_cgroup_swap_full(struct page *page); 554 554 #else ··· 562 562 return 0; 563 563 } 564 564 565 - static inline void mem_cgroup_uncharge_swap(swp_entry_t entry) 565 + static inline void mem_cgroup_uncharge_swap(swp_entry_t entry, 566 + unsigned int nr_pages) 566 567 { 567 568 } 568 569

+4 -2

include/linux/swap_cgroup.h

··· 7 7 8 8 extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, 9 9 unsigned short old, unsigned short new); 10 - extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id); 10 + extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id, 11 + unsigned int nr_ents); 11 12 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent); 12 13 extern int swap_cgroup_swapon(int type, unsigned long max_pages); 13 14 extern void swap_cgroup_swapoff(int type); ··· 16 15 #else 17 16 18 17 static inline 19 - unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id) 18 + unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id, 19 + unsigned int nr_ents) 20 20 { 21 21 return 0; 22 22 }

+1

include/linux/vm_event_item.h

··· 41 41 KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY, 42 42 PAGEOUTRUN, PGROTATED, 43 43 DROP_PAGECACHE, DROP_SLAB, 44 + OOM_KILL, 44 45 #ifdef CONFIG_NUMA_BALANCING 45 46 NUMA_PTE_UPDATES, 46 47 NUMA_HUGE_PTE_UPDATES,

-1

include/linux/vmstat.h

··· 3 3 4 4 #include <linux/types.h> 5 5 #include <linux/percpu.h> 6 - #include <linux/mm.h> 7 6 #include <linux/mmzone.h> 8 7 #include <linux/vm_event_item.h> 9 8 #include <linux/atomic.h>

+1

include/uapi/linux/magic.h

··· 42 42 #define MSDOS_SUPER_MAGIC 0x4d44 /* MD */ 43 43 #define NCP_SUPER_MAGIC 0x564c /* Guess, what 0x564c is :-) */ 44 44 #define NFS_SUPER_MAGIC 0x6969 45 + #define OCFS2_SUPER_MAGIC 0x7461636f 45 46 #define OPENPROM_SUPER_MAGIC 0x9fa1 46 47 #define QNX4_SUPER_MAGIC 0x002f /* qnx4 fs detection */ 47 48 #define QNX6_SUPER_MAGIC 0x68191122 /* qnx6 fs detection */

-8

include/uapi/linux/mempolicy.h

··· 24 24 MPOL_MAX, /* always last member of enum */ 25 25 }; 26 26 27 - enum mpol_rebind_step { 28 - MPOL_REBIND_ONCE, /* do rebind work at once(not by two step) */ 29 - MPOL_REBIND_STEP1, /* first step(set all the newly nodes) */ 30 - MPOL_REBIND_STEP2, /* second step(clean all the disallowed nodes)*/ 31 - MPOL_REBIND_NSTEP, 32 - }; 33 - 34 27 /* Flags for set_mempolicy */ 35 28 #define MPOL_F_STATIC_NODES (1 << 15) 36 29 #define MPOL_F_RELATIVE_NODES (1 << 14) ··· 58 65 */ 59 66 #define MPOL_F_SHARED (1 << 0) /* identify shared policies */ 60 67 #define MPOL_F_LOCAL (1 << 1) /* preferred local allocation */ 61 - #define MPOL_F_REBINDING (1 << 2) /* identify policies in rebinding */ 62 68 #define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */ 63 69 #define MPOL_F_MORON (1 << 4) /* Migrate On protnone Reference On Node */ 64 70

+14

init/Kconfig

··· 1548 1548 1549 1549 endchoice 1550 1550 1551 + config SLAB_MERGE_DEFAULT 1552 + bool "Allow slab caches to be merged" 1553 + default y 1554 + help 1555 + For reduced kernel memory fragmentation, slab caches can be 1556 + merged when they share the same size and other characteristics. 1557 + This carries a risk of kernel heap overflows being able to 1558 + overwrite objects from merged caches (and more easily control 1559 + cache layout), which makes such heap attacks easier to exploit 1560 + by attackers. By keeping caches unmerged, these kinds of exploits 1561 + can usually only damage objects in the same cache. To disable 1562 + merging at runtime, "slab_nomerge" can be passed on the kernel 1563 + command line. 1564 + 1551 1565 config SLAB_FREELIST_RANDOM 1552 1566 default n 1553 1567 depends on SLAB || SLUB

+9 -24

kernel/cgroup/cpuset.c

··· 1038 1038 * @tsk: the task to change 1039 1039 * @newmems: new nodes that the task will be set 1040 1040 * 1041 - * In order to avoid seeing no nodes if the old and new nodes are disjoint, 1042 - * we structure updates as setting all new allowed nodes, then clearing newly 1043 - * disallowed ones. 1041 + * We use the mems_allowed_seq seqlock to safely update both tsk->mems_allowed 1042 + * and rebind an eventual tasks' mempolicy. If the task is allocating in 1043 + * parallel, it might temporarily see an empty intersection, which results in 1044 + * a seqlock check and retry before OOM or allocation failure. 1044 1045 */ 1045 1046 static void cpuset_change_task_nodemask(struct task_struct *tsk, 1046 1047 nodemask_t *newmems) 1047 1048 { 1048 - bool need_loop; 1049 - 1050 1049 task_lock(tsk); 1051 - /* 1052 - * Determine if a loop is necessary if another thread is doing 1053 - * read_mems_allowed_begin(). If at least one node remains unchanged and 1054 - * tsk does not have a mempolicy, then an empty nodemask will not be 1055 - * possible when mems_allowed is larger than a word. 1056 - */ 1057 - need_loop = task_has_mempolicy(tsk) || 1058 - !nodes_intersects(*newmems, tsk->mems_allowed); 1059 1050 1060 - if (need_loop) { 1061 - local_irq_disable(); 1062 - write_seqcount_begin(&tsk->mems_allowed_seq); 1063 - } 1051 + local_irq_disable(); 1052 + write_seqcount_begin(&tsk->mems_allowed_seq); 1064 1053 1065 1054 nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems); 1066 - mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1); 1067 - 1068 - mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2); 1055 + mpol_rebind_task(tsk, newmems); 1069 1056 tsk->mems_allowed = *newmems; 1070 1057 1071 - if (need_loop) { 1072 - write_seqcount_end(&tsk->mems_allowed_seq); 1073 - local_irq_enable(); 1074 - } 1058 + write_seqcount_end(&tsk->mems_allowed_seq); 1059 + local_irq_enable(); 1075 1060 1076 1061 task_unlock(tsk); 1077 1062 }

-1

kernel/exit.c

··· 51 51 #include <linux/task_io_accounting_ops.h> 52 52 #include <linux/tracehook.h> 53 53 #include <linux/fs_struct.h> 54 - #include <linux/userfaultfd_k.h> 55 54 #include <linux/init_task.h> 56 55 #include <linux/perf_event.h> 57 56 #include <trace/events/sched.h>

+1 -1

kernel/extable.c

··· 69 69 return 0; 70 70 } 71 71 72 - int core_kernel_text(unsigned long addr) 72 + int notrace core_kernel_text(unsigned long addr) 73 73 { 74 74 if (addr >= (unsigned long)_stext && 75 75 addr < (unsigned long)_etext)

+4 -4

kernel/fork.c

··· 326 326 } 327 327 328 328 /* All stack pages belong to the same memcg. */ 329 - memcg_kmem_update_page_stat(vm->pages[0], MEMCG_KERNEL_STACK_KB, 330 - account * (THREAD_SIZE / 1024)); 329 + mod_memcg_page_state(vm->pages[0], MEMCG_KERNEL_STACK_KB, 330 + account * (THREAD_SIZE / 1024)); 331 331 } else { 332 332 /* 333 333 * All stack pages are in the same zone and belong to the ··· 338 338 mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB, 339 339 THREAD_SIZE / 1024 * account); 340 340 341 - memcg_kmem_update_page_stat(first_page, MEMCG_KERNEL_STACK_KB, 342 - account * (THREAD_SIZE / 1024)); 341 + mod_memcg_page_state(first_page, MEMCG_KERNEL_STACK_KB, 342 + account * (THREAD_SIZE / 1024)); 343 343 } 344 344 } 345 345

+2 -1

kernel/locking/qspinlock_paravirt.h

··· 193 193 */ 194 194 pv_lock_hash = alloc_large_system_hash("PV qspinlock", 195 195 sizeof(struct pv_hash_entry), 196 - pv_hash_size, 0, HASH_EARLY, 196 + pv_hash_size, 0, 197 + HASH_EARLY | HASH_ZERO, 197 198 &pv_lock_hash_bits, NULL, 198 199 pv_hash_size, pv_hash_size); 199 200 }

+5 -1

kernel/memremap.c

··· 358 358 goto err_pfn_remap; 359 359 360 360 mem_hotplug_begin(); 361 - error = arch_add_memory(nid, align_start, align_size, true); 361 + error = arch_add_memory(nid, align_start, align_size, false); 362 + if (!error) 363 + move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], 364 + align_start >> PAGE_SHIFT, 365 + align_size >> PAGE_SHIFT); 362 366 mem_hotplug_done(); 363 367 if (error) 364 368 goto err_add_memory;

+1 -3

kernel/module.c

··· 49 49 #include <linux/rculist.h> 50 50 #include <linux/uaccess.h> 51 51 #include <asm/cacheflush.h> 52 - #ifdef CONFIG_STRICT_MODULE_RWX 53 - #include <asm/set_memory.h> 54 - #endif 52 + #include <linux/set_memory.h> 55 53 #include <asm/mmu_context.h> 56 54 #include <linux/license.h> 57 55 #include <asm/sections.h>

+2 -5

kernel/pid.c

··· 575 575 */ 576 576 void __init pidhash_init(void) 577 577 { 578 - unsigned int i, pidhash_size; 578 + unsigned int pidhash_size; 579 579 580 580 pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18, 581 - HASH_EARLY | HASH_SMALL, 581 + HASH_EARLY | HASH_SMALL | HASH_ZERO, 582 582 &pidhash_shift, NULL, 583 583 0, 4096); 584 584 pidhash_size = 1U << pidhash_shift; 585 - 586 - for (i = 0; i < pidhash_size; i++) 587 - INIT_HLIST_HEAD(&pid_hash[i]); 588 585 } 589 586 590 587 void __init pidmap_init(void)

+1 -3

kernel/power/snapshot.c

··· 30 30 #include <linux/slab.h> 31 31 #include <linux/compiler.h> 32 32 #include <linux/ktime.h> 33 + #include <linux/set_memory.h> 33 34 34 35 #include <linux/uaccess.h> 35 36 #include <asm/mmu_context.h> 36 37 #include <asm/pgtable.h> 37 38 #include <asm/tlbflush.h> 38 39 #include <asm/io.h> 39 - #ifdef CONFIG_ARCH_HAS_SET_MEMORY 40 - #include <asm/set_memory.h> 41 - #endif 42 40 43 41 #include "power.h" 44 42

+12 -26

mm/Kconfig

··· 149 149 config MEMORY_ISOLATION 150 150 bool 151 151 152 - config MOVABLE_NODE 153 - bool "Enable to assign a node which has only movable memory" 154 - depends on HAVE_MEMBLOCK 155 - depends on NO_BOOTMEM 156 - depends on X86_64 || OF_EARLY_FLATTREE || MEMORY_HOTPLUG 157 - depends on NUMA 158 - default n 159 - help 160 - Allow a node to have only movable memory. Pages used by the kernel, 161 - such as direct mapping pages cannot be migrated. So the corresponding 162 - memory device cannot be hotplugged. This option allows the following 163 - two things: 164 - - When the system is booting, node full of hotpluggable memory can 165 - be arranged to have only movable memory so that the whole node can 166 - be hot-removed. (need movable_node boot option specified). 167 - - After the system is up, the option allows users to online all the 168 - memory of a node as movable memory so that the whole node can be 169 - hot-removed. 170 - 171 - Users who don't use the memory hotplug feature are fine with this 172 - option on since they don't specify movable_node boot option or they 173 - don't online memory as movable. 174 - 175 - Say Y here if you want to hotplug a whole node. 176 - Say N here if you want kernel to use memory on all nodes evenly. 177 - 178 152 # 179 153 # Only be set on architectures that have completely implemented memory hotplug 180 154 # feature. If you are not sure, don't touch it. ··· 419 445 memory footprint of applications without a guaranteed 420 446 benefit. 421 447 endchoice 448 + 449 + config ARCH_WANTS_THP_SWAP 450 + def_bool n 451 + 452 + config THP_SWAP 453 + def_bool y 454 + depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_SWAP 455 + help 456 + Swap transparent huge pages in one piece, without splitting. 457 + XXX: For now this only does clustered swap space allocation. 458 + 459 + For selection by architectures with reasonable THP sizes. 422 460 423 461 config TRANSPARENT_HUGE_PAGECACHE 424 462 def_bool y

+2 -3

mm/compaction.c

··· 236 236 237 237 cond_resched(); 238 238 239 - if (!pfn_valid(pfn)) 239 + page = pfn_to_online_page(pfn); 240 + if (!page) 240 241 continue; 241 - 242 - page = pfn_to_page(pfn); 243 242 if (zone != page_zone(page)) 244 243 continue; 245 244

+1 -1

mm/filemap.c

··· 2265 2265 /* No page in the page cache at all */ 2266 2266 do_sync_mmap_readahead(vmf->vma, ra, file, offset); 2267 2267 count_vm_event(PGMAJFAULT); 2268 - mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT); 2268 + count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT); 2269 2269 ret = VM_FAULT_MAJOR; 2270 2270 retry_find: 2271 2271 page = find_get_page(mapping, offset);

+137 -66

mm/gup.c

··· 208 208 return no_page_table(vma, flags); 209 209 } 210 210 211 - /** 212 - * follow_page_mask - look up a page descriptor from a user-virtual address 213 - * @vma: vm_area_struct mapping @address 214 - * @address: virtual address to look up 215 - * @flags: flags modifying lookup behaviour 216 - * @page_mask: on output, *page_mask is set according to the size of the page 217 - * 218 - * @flags can have FOLL_ flags set, defined in <linux/mm.h> 219 - * 220 - * Returns the mapped (struct page *), %NULL if no mapping exists, or 221 - * an error pointer if there is a mapping to something not represented 222 - * by a page descriptor (see also vm_normal_page()). 223 - */ 224 - struct page *follow_page_mask(struct vm_area_struct *vma, 225 - unsigned long address, unsigned int flags, 226 - unsigned int *page_mask) 211 + static struct page *follow_pmd_mask(struct vm_area_struct *vma, 212 + unsigned long address, pud_t *pudp, 213 + unsigned int flags, unsigned int *page_mask) 227 214 { 228 - pgd_t *pgd; 229 - p4d_t *p4d; 230 - pud_t *pud; 231 215 pmd_t *pmd; 232 216 spinlock_t *ptl; 233 217 struct page *page; 234 218 struct mm_struct *mm = vma->vm_mm; 235 219 236 - *page_mask = 0; 237 - 238 - page = follow_huge_addr(mm, address, flags & FOLL_WRITE); 239 - if (!IS_ERR(page)) { 240 - BUG_ON(flags & FOLL_GET); 241 - return page; 242 - } 243 - 244 - pgd = pgd_offset(mm, address); 245 - if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) 246 - return no_page_table(vma, flags); 247 - p4d = p4d_offset(pgd, address); 248 - if (p4d_none(*p4d)) 249 - return no_page_table(vma, flags); 250 - BUILD_BUG_ON(p4d_huge(*p4d)); 251 - if (unlikely(p4d_bad(*p4d))) 252 - return no_page_table(vma, flags); 253 - pud = pud_offset(p4d, address); 254 - if (pud_none(*pud)) 255 - return no_page_table(vma, flags); 256 - if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) { 257 - page = follow_huge_pud(mm, address, pud, flags); 258 - if (page) 259 - return page; 260 - return no_page_table(vma, flags); 261 - } 262 - if (pud_devmap(*pud)) { 263 - ptl = pud_lock(mm, pud); 264 - page = follow_devmap_pud(vma, address, pud, flags); 265 - spin_unlock(ptl); 266 - if (page) 267 - return page; 268 - } 269 - if (unlikely(pud_bad(*pud))) 270 - return no_page_table(vma, flags); 271 - 272 - pmd = pmd_offset(pud, address); 220 + pmd = pmd_offset(pudp, address); 273 221 if (pmd_none(*pmd)) 274 222 return no_page_table(vma, flags); 275 223 if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) { 276 224 page = follow_huge_pmd(mm, address, pmd, flags); 225 + if (page) 226 + return page; 227 + return no_page_table(vma, flags); 228 + } 229 + if (is_hugepd(__hugepd(pmd_val(*pmd)))) { 230 + page = follow_huge_pd(vma, address, 231 + __hugepd(pmd_val(*pmd)), flags, 232 + PMD_SHIFT); 277 233 if (page) 278 234 return page; 279 235 return no_page_table(vma, flags); ··· 275 319 return ret ? ERR_PTR(ret) : 276 320 follow_page_pte(vma, address, pmd, flags); 277 321 } 278 - 279 322 page = follow_trans_huge_pmd(vma, address, pmd, flags); 280 323 spin_unlock(ptl); 281 324 *page_mask = HPAGE_PMD_NR - 1; 282 325 return page; 326 + } 327 + 328 + 329 + static struct page *follow_pud_mask(struct vm_area_struct *vma, 330 + unsigned long address, p4d_t *p4dp, 331 + unsigned int flags, unsigned int *page_mask) 332 + { 333 + pud_t *pud; 334 + spinlock_t *ptl; 335 + struct page *page; 336 + struct mm_struct *mm = vma->vm_mm; 337 + 338 + pud = pud_offset(p4dp, address); 339 + if (pud_none(*pud)) 340 + return no_page_table(vma, flags); 341 + if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) { 342 + page = follow_huge_pud(mm, address, pud, flags); 343 + if (page) 344 + return page; 345 + return no_page_table(vma, flags); 346 + } 347 + if (is_hugepd(__hugepd(pud_val(*pud)))) { 348 + page = follow_huge_pd(vma, address, 349 + __hugepd(pud_val(*pud)), flags, 350 + PUD_SHIFT); 351 + if (page) 352 + return page; 353 + return no_page_table(vma, flags); 354 + } 355 + if (pud_devmap(*pud)) { 356 + ptl = pud_lock(mm, pud); 357 + page = follow_devmap_pud(vma, address, pud, flags); 358 + spin_unlock(ptl); 359 + if (page) 360 + return page; 361 + } 362 + if (unlikely(pud_bad(*pud))) 363 + return no_page_table(vma, flags); 364 + 365 + return follow_pmd_mask(vma, address, pud, flags, page_mask); 366 + } 367 + 368 + 369 + static struct page *follow_p4d_mask(struct vm_area_struct *vma, 370 + unsigned long address, pgd_t *pgdp, 371 + unsigned int flags, unsigned int *page_mask) 372 + { 373 + p4d_t *p4d; 374 + struct page *page; 375 + 376 + p4d = p4d_offset(pgdp, address); 377 + if (p4d_none(*p4d)) 378 + return no_page_table(vma, flags); 379 + BUILD_BUG_ON(p4d_huge(*p4d)); 380 + if (unlikely(p4d_bad(*p4d))) 381 + return no_page_table(vma, flags); 382 + 383 + if (is_hugepd(__hugepd(p4d_val(*p4d)))) { 384 + page = follow_huge_pd(vma, address, 385 + __hugepd(p4d_val(*p4d)), flags, 386 + P4D_SHIFT); 387 + if (page) 388 + return page; 389 + return no_page_table(vma, flags); 390 + } 391 + return follow_pud_mask(vma, address, p4d, flags, page_mask); 392 + } 393 + 394 + /** 395 + * follow_page_mask - look up a page descriptor from a user-virtual address 396 + * @vma: vm_area_struct mapping @address 397 + * @address: virtual address to look up 398 + * @flags: flags modifying lookup behaviour 399 + * @page_mask: on output, *page_mask is set according to the size of the page 400 + * 401 + * @flags can have FOLL_ flags set, defined in <linux/mm.h> 402 + * 403 + * Returns the mapped (struct page *), %NULL if no mapping exists, or 404 + * an error pointer if there is a mapping to something not represented 405 + * by a page descriptor (see also vm_normal_page()). 406 + */ 407 + struct page *follow_page_mask(struct vm_area_struct *vma, 408 + unsigned long address, unsigned int flags, 409 + unsigned int *page_mask) 410 + { 411 + pgd_t *pgd; 412 + struct page *page; 413 + struct mm_struct *mm = vma->vm_mm; 414 + 415 + *page_mask = 0; 416 + 417 + /* make this handle hugepd */ 418 + page = follow_huge_addr(mm, address, flags & FOLL_WRITE); 419 + if (!IS_ERR(page)) { 420 + BUG_ON(flags & FOLL_GET); 421 + return page; 422 + } 423 + 424 + pgd = pgd_offset(mm, address); 425 + 426 + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) 427 + return no_page_table(vma, flags); 428 + 429 + if (pgd_huge(*pgd)) { 430 + page = follow_huge_pgd(mm, address, pgd, flags); 431 + if (page) 432 + return page; 433 + return no_page_table(vma, flags); 434 + } 435 + if (is_hugepd(__hugepd(pgd_val(*pgd)))) { 436 + page = follow_huge_pd(vma, address, 437 + __hugepd(pgd_val(*pgd)), flags, 438 + PGDIR_SHIFT); 439 + if (page) 440 + return page; 441 + return no_page_table(vma, flags); 442 + } 443 + 444 + return follow_p4d_mask(vma, address, pgd, flags, page_mask); 283 445 } 284 446 285 447 static int get_gate_page(struct mm_struct *mm, unsigned long address, ··· 1423 1349 return __gup_device_huge_pmd(orig, addr, end, pages, nr); 1424 1350 1425 1351 refs = 0; 1426 - head = pmd_page(orig); 1427 - page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); 1352 + page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); 1428 1353 do { 1429 - VM_BUG_ON_PAGE(compound_head(page) != head, page); 1430 1354 pages[*nr] = page; 1431 1355 (*nr)++; 1432 1356 page++; 1433 1357 refs++; 1434 1358 } while (addr += PAGE_SIZE, addr != end); 1435 1359 1360 + head = compound_head(pmd_page(orig)); 1436 1361 if (!page_cache_add_speculative(head, refs)) { 1437 1362 *nr -= refs; 1438 1363 return 0; ··· 1461 1388 return __gup_device_huge_pud(orig, addr, end, pages, nr); 1462 1389 1463 1390 refs = 0; 1464 - head = pud_page(orig); 1465 - page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT); 1391 + page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); 1466 1392 do { 1467 - VM_BUG_ON_PAGE(compound_head(page) != head, page); 1468 1393 pages[*nr] = page; 1469 1394 (*nr)++; 1470 1395 page++; 1471 1396 refs++; 1472 1397 } while (addr += PAGE_SIZE, addr != end); 1473 1398 1399 + head = compound_head(pud_page(orig)); 1474 1400 if (!page_cache_add_speculative(head, refs)) { 1475 1401 *nr -= refs; 1476 1402 return 0; ··· 1498 1426 1499 1427 BUILD_BUG_ON(pgd_devmap(orig)); 1500 1428 refs = 0; 1501 - head = pgd_page(orig); 1502 - page = head + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT); 1429 + page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT); 1503 1430 do { 1504 - VM_BUG_ON_PAGE(compound_head(page) != head, page); 1505 1431 pages[*nr] = page; 1506 1432 (*nr)++; 1507 1433 page++; 1508 1434 refs++; 1509 1435 } while (addr += PAGE_SIZE, addr != end); 1510 1436 1437 + head = compound_head(pgd_page(orig)); 1511 1438 if (!page_cache_add_speculative(head, refs)) { 1512 1439 *nr -= refs; 1513 1440 return 0;

+24 -7

mm/huge_memory.c

··· 1575 1575 get_page(page); 1576 1576 spin_unlock(ptl); 1577 1577 split_huge_page(page); 1578 - put_page(page); 1579 1578 unlock_page(page); 1579 + put_page(page); 1580 1580 goto out_unlocked; 1581 1581 } 1582 1582 ··· 2203 2203 * atomic_set() here would be safe on all archs (and not only on x86), 2204 2204 * it's safer to use atomic_inc()/atomic_add(). 2205 2205 */ 2206 - if (PageAnon(head)) { 2206 + if (PageAnon(head) && !PageSwapCache(head)) { 2207 2207 page_ref_inc(page_tail); 2208 2208 } else { 2209 2209 /* Additional pin to radix tree */ ··· 2214 2214 page_tail->flags |= (head->flags & 2215 2215 ((1L << PG_referenced) | 2216 2216 (1L << PG_swapbacked) | 2217 + (1L << PG_swapcache) | 2217 2218 (1L << PG_mlocked) | 2218 2219 (1L << PG_uptodate) | 2219 2220 (1L << PG_active) | ··· 2277 2276 ClearPageCompound(head); 2278 2277 /* See comment in __split_huge_page_tail() */ 2279 2278 if (PageAnon(head)) { 2280 - page_ref_inc(head); 2279 + /* Additional pin to radix tree of swap cache */ 2280 + if (PageSwapCache(head)) 2281 + page_ref_add(head, 2); 2282 + else 2283 + page_ref_inc(head); 2281 2284 } else { 2282 2285 /* Additional pin to radix tree */ 2283 2286 page_ref_add(head, 2); ··· 2390 2385 return ret; 2391 2386 } 2392 2387 2388 + /* Racy check whether the huge page can be split */ 2389 + bool can_split_huge_page(struct page *page, int *pextra_pins) 2390 + { 2391 + int extra_pins; 2392 + 2393 + /* Additional pins from radix tree */ 2394 + if (PageAnon(page)) 2395 + extra_pins = PageSwapCache(page) ? HPAGE_PMD_NR : 0; 2396 + else 2397 + extra_pins = HPAGE_PMD_NR; 2398 + if (pextra_pins) 2399 + *pextra_pins = extra_pins; 2400 + return total_mapcount(page) == page_count(page) - extra_pins - 1; 2401 + } 2402 + 2393 2403 /* 2394 2404 * This function splits huge page into normal pages. @page can point to any 2395 2405 * subpage of huge page to split. Split doesn't change the position of @page. ··· 2452 2432 ret = -EBUSY; 2453 2433 goto out; 2454 2434 } 2455 - extra_pins = 0; 2456 2435 mapping = NULL; 2457 2436 anon_vma_lock_write(anon_vma); 2458 2437 } else { ··· 2463 2444 goto out; 2464 2445 } 2465 2446 2466 - /* Addidional pins from radix tree */ 2467 - extra_pins = HPAGE_PMD_NR; 2468 2447 anon_vma = NULL; 2469 2448 i_mmap_lock_read(mapping); 2470 2449 } ··· 2471 2454 * Racy check if we can split the page, before freeze_page() will 2472 2455 * split PMDs 2473 2456 */ 2474 - if (total_mapcount(head) != page_count(head) - extra_pins - 1) { 2457 + if (!can_split_huge_page(head, &extra_pins)) { 2475 2458 ret = -EBUSY; 2476 2459 goto out_unlock; 2477 2460 }

+69 -29

mm/hugetlb.c

··· 867 867 h->free_huge_pages_node[nid]++; 868 868 } 869 869 870 - static struct page *dequeue_huge_page_node(struct hstate *h, int nid) 870 + static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) 871 871 { 872 872 struct page *page; 873 873 ··· 887 887 return page; 888 888 } 889 889 890 + static struct page *dequeue_huge_page_node(struct hstate *h, int nid) 891 + { 892 + struct page *page; 893 + int node; 894 + 895 + if (nid != NUMA_NO_NODE) 896 + return dequeue_huge_page_node_exact(h, nid); 897 + 898 + for_each_online_node(node) { 899 + page = dequeue_huge_page_node_exact(h, node); 900 + if (page) 901 + return page; 902 + } 903 + return NULL; 904 + } 905 + 890 906 /* Movability of hugepages depends on migration support. */ 891 907 static inline gfp_t htlb_alloc_mask(struct hstate *h) 892 908 { ··· 920 904 struct page *page = NULL; 921 905 struct mempolicy *mpol; 922 906 nodemask_t *nodemask; 907 + gfp_t gfp_mask; 908 + int nid; 923 909 struct zonelist *zonelist; 924 910 struct zone *zone; 925 911 struct zoneref *z; ··· 942 924 943 925 retry_cpuset: 944 926 cpuset_mems_cookie = read_mems_allowed_begin(); 945 - zonelist = huge_zonelist(vma, address, 946 - htlb_alloc_mask(h), &mpol, &nodemask); 927 + gfp_mask = htlb_alloc_mask(h); 928 + nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask); 929 + zonelist = node_zonelist(nid, gfp_mask); 947 930 948 931 for_each_zone_zonelist_nodemask(zone, z, zonelist, 949 932 MAX_NR_ZONES - 1, nodemask) { 950 - if (cpuset_zone_allowed(zone, htlb_alloc_mask(h))) { 933 + if (cpuset_zone_allowed(zone, gfp_mask)) { 951 934 page = dequeue_huge_page_node(h, zone_to_nid(zone)); 952 935 if (page) { 953 936 if (avoid_reserve) ··· 1043 1024 ((node = hstate_next_node_to_free(hs, mask)) || 1); \ 1044 1025 nr_nodes--) 1045 1026 1046 - #if defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) && \ 1047 - ((defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || \ 1048 - defined(CONFIG_CMA)) 1027 + #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE 1049 1028 static void destroy_compound_gigantic_page(struct page *page, 1050 1029 unsigned int order) 1051 1030 { ··· 1175 1158 return 0; 1176 1159 } 1177 1160 1178 - static inline bool gigantic_page_supported(void) { return true; } 1179 - #else 1161 + #else /* !CONFIG_ARCH_HAS_GIGANTIC_PAGE */ 1180 1162 static inline bool gigantic_page_supported(void) { return false; } 1181 1163 static inline void free_gigantic_page(struct page *page, unsigned int order) { } 1182 1164 static inline void destroy_compound_gigantic_page(struct page *page, ··· 1561 1545 do { 1562 1546 struct page *page; 1563 1547 struct mempolicy *mpol; 1564 - struct zonelist *zl; 1548 + int nid; 1565 1549 nodemask_t *nodemask; 1566 1550 1567 1551 cpuset_mems_cookie = read_mems_allowed_begin(); 1568 - zl = huge_zonelist(vma, addr, gfp, &mpol, &nodemask); 1552 + nid = huge_node(vma, addr, gfp, &mpol, &nodemask); 1569 1553 mpol_cond_put(mpol); 1570 - page = __alloc_pages_nodemask(gfp, order, zl, nodemask); 1554 + page = __alloc_pages_nodemask(gfp, order, nid, nodemask); 1571 1555 if (page) 1572 1556 return page; 1573 1557 } while (read_mems_allowed_retry(cpuset_mems_cookie)); ··· 3201 3185 update_mmu_cache(vma, address, ptep); 3202 3186 } 3203 3187 3204 - static int is_hugetlb_entry_migration(pte_t pte) 3188 + bool is_hugetlb_entry_migration(pte_t pte) 3205 3189 { 3206 3190 swp_entry_t swp; 3207 3191 3208 3192 if (huge_pte_none(pte) || pte_present(pte)) 3209 - return 0; 3193 + return false; 3210 3194 swp = pte_to_swp_entry(pte); 3211 3195 if (non_swap_entry(swp) && is_migration_entry(swp)) 3212 - return 1; 3196 + return true; 3213 3197 else 3214 - return 0; 3198 + return false; 3215 3199 } 3216 3200 3217 3201 static int is_hugetlb_entry_hwpoisoned(pte_t pte) ··· 3249 3233 3250 3234 for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) { 3251 3235 spinlock_t *src_ptl, *dst_ptl; 3252 - src_pte = huge_pte_offset(src, addr); 3236 + src_pte = huge_pte_offset(src, addr, sz); 3253 3237 if (!src_pte) 3254 3238 continue; 3255 3239 dst_pte = huge_pte_alloc(dst, addr, sz); ··· 3279 3263 */ 3280 3264 make_migration_entry_read(&swp_entry); 3281 3265 entry = swp_entry_to_pte(swp_entry); 3282 - set_huge_pte_at(src, addr, src_pte, entry); 3266 + set_huge_swap_pte_at(src, addr, src_pte, 3267 + entry, sz); 3283 3268 } 3284 - set_huge_pte_at(dst, addr, dst_pte, entry); 3269 + set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); 3285 3270 } else { 3286 3271 if (cow) { 3287 3272 huge_ptep_set_wrprotect(src, addr, src_pte); ··· 3334 3317 mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); 3335 3318 address = start; 3336 3319 for (; address < end; address += sz) { 3337 - ptep = huge_pte_offset(mm, address); 3320 + ptep = huge_pte_offset(mm, address, sz); 3338 3321 if (!ptep) 3339 3322 continue; 3340 3323 ··· 3355 3338 * unmapped and its refcount is dropped, so just clear pte here. 3356 3339 */ 3357 3340 if (unlikely(!pte_present(pte))) { 3358 - huge_pte_clear(mm, address, ptep); 3341 + huge_pte_clear(mm, address, ptep, sz); 3359 3342 spin_unlock(ptl); 3360 3343 continue; 3361 3344 } ··· 3552 3535 unmap_ref_private(mm, vma, old_page, address); 3553 3536 BUG_ON(huge_pte_none(pte)); 3554 3537 spin_lock(ptl); 3555 - ptep = huge_pte_offset(mm, address & huge_page_mask(h)); 3538 + ptep = huge_pte_offset(mm, address & huge_page_mask(h), 3539 + huge_page_size(h)); 3556 3540 if (likely(ptep && 3557 3541 pte_same(huge_ptep_get(ptep), pte))) 3558 3542 goto retry_avoidcopy; ··· 3592 3574 * before the page tables are altered 3593 3575 */ 3594 3576 spin_lock(ptl); 3595 - ptep = huge_pte_offset(mm, address & huge_page_mask(h)); 3577 + ptep = huge_pte_offset(mm, address & huge_page_mask(h), 3578 + huge_page_size(h)); 3596 3579 if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) { 3597 3580 ClearPagePrivate(new_page); 3598 3581 ··· 3880 3861 3881 3862 address &= huge_page_mask(h); 3882 3863 3883 - ptep = huge_pte_offset(mm, address); 3864 + ptep = huge_pte_offset(mm, address, huge_page_size(h)); 3884 3865 if (ptep) { 3885 3866 entry = huge_ptep_get(ptep); 3886 3867 if (unlikely(is_hugetlb_entry_migration(entry))) { ··· 4137 4118 * 4138 4119 * Note that page table lock is not held when pte is null. 4139 4120 */ 4140 - pte = huge_pte_offset(mm, vaddr & huge_page_mask(h)); 4121 + pte = huge_pte_offset(mm, vaddr & huge_page_mask(h), 4122 + huge_page_size(h)); 4141 4123 if (pte) 4142 4124 ptl = huge_pte_lock(h, mm, pte); 4143 4125 absent = !pte || huge_pte_none(huge_ptep_get(pte)); ··· 4277 4257 i_mmap_lock_write(vma->vm_file->f_mapping); 4278 4258 for (; address < end; address += huge_page_size(h)) { 4279 4259 spinlock_t *ptl; 4280 - ptep = huge_pte_offset(mm, address); 4260 + ptep = huge_pte_offset(mm, address, huge_page_size(h)); 4281 4261 if (!ptep) 4282 4262 continue; 4283 4263 ptl = huge_pte_lock(h, mm, ptep); ··· 4299 4279 4300 4280 make_migration_entry_read(&entry); 4301 4281 newpte = swp_entry_to_pte(entry); 4302 - set_huge_pte_at(mm, address, ptep, newpte); 4282 + set_huge_swap_pte_at(mm, address, ptep, 4283 + newpte, huge_page_size(h)); 4303 4284 pages++; 4304 4285 } 4305 4286 spin_unlock(ptl); ··· 4542 4521 4543 4522 saddr = page_table_shareable(svma, vma, addr, idx); 4544 4523 if (saddr) { 4545 - spte = huge_pte_offset(svma->vm_mm, saddr); 4524 + spte = huge_pte_offset(svma->vm_mm, saddr, 4525 + vma_mmu_pagesize(svma)); 4546 4526 if (spte) { 4547 4527 get_page(virt_to_page(spte)); 4548 4528 break; ··· 4639 4617 return pte; 4640 4618 } 4641 4619 4642 - pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) 4620 + pte_t *huge_pte_offset(struct mm_struct *mm, 4621 + unsigned long addr, unsigned long sz) 4643 4622 { 4644 4623 pgd_t *pgd; 4645 4624 p4d_t *p4d; ··· 4673 4650 int write) 4674 4651 { 4675 4652 return ERR_PTR(-EINVAL); 4653 + } 4654 + 4655 + struct page * __weak 4656 + follow_huge_pd(struct vm_area_struct *vma, 4657 + unsigned long address, hugepd_t hpd, int flags, int pdshift) 4658 + { 4659 + WARN(1, "hugepd follow called with no support for hugepage directory format\n"); 4660 + return NULL; 4676 4661 } 4677 4662 4678 4663 struct page * __weak ··· 4728 4697 return NULL; 4729 4698 4730 4699 return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT); 4700 + } 4701 + 4702 + struct page * __weak 4703 + follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int flags) 4704 + { 4705 + if (flags & FOLL_GET) 4706 + return NULL; 4707 + 4708 + return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT); 4731 4709 } 4732 4710 4733 4711 #ifdef CONFIG_MEMORY_FAILURE

+114 -22

mm/kmemleak.c

··· 150 150 */ 151 151 struct kmemleak_object { 152 152 spinlock_t lock; 153 - unsigned long flags; /* object status flags */ 153 + unsigned int flags; /* object status flags */ 154 154 struct list_head object_list; 155 155 struct list_head gray_list; 156 156 struct rb_node rb_node; ··· 159 159 atomic_t use_count; 160 160 unsigned long pointer; 161 161 size_t size; 162 + /* pass surplus references to this pointer */ 163 + unsigned long excess_ref; 162 164 /* minimum number of a pointers found before it is considered leak */ 163 165 int min_count; 164 166 /* the total number of pointers found pointing to this object */ ··· 255 253 KMEMLEAK_NOT_LEAK, 256 254 KMEMLEAK_IGNORE, 257 255 KMEMLEAK_SCAN_AREA, 258 - KMEMLEAK_NO_SCAN 256 + KMEMLEAK_NO_SCAN, 257 + KMEMLEAK_SET_EXCESS_REF 259 258 }; 260 259 261 260 /* ··· 265 262 */ 266 263 struct early_log { 267 264 int op_type; /* kmemleak operation type */ 268 - const void *ptr; /* allocated/freed memory block */ 269 - size_t size; /* memory block size */ 270 265 int min_count; /* minimum reference count */ 266 + const void *ptr; /* allocated/freed memory block */ 267 + union { 268 + size_t size; /* memory block size */ 269 + unsigned long excess_ref; /* surplus reference passing */ 270 + }; 271 271 unsigned long trace[MAX_TRACE]; /* stack trace */ 272 272 unsigned int trace_len; /* stack trace length */ 273 273 }; ··· 399 393 object->comm, object->pid, object->jiffies); 400 394 pr_notice(" min_count = %d\n", object->min_count); 401 395 pr_notice(" count = %d\n", object->count); 402 - pr_notice(" flags = 0x%lx\n", object->flags); 396 + pr_notice(" flags = 0x%x\n", object->flags); 403 397 pr_notice(" checksum = %u\n", object->checksum); 404 398 pr_notice(" backtrace:\n"); 405 399 print_stack_trace(&trace, 4); ··· 568 562 object->flags = OBJECT_ALLOCATED; 569 563 object->pointer = ptr; 570 564 object->size = size; 565 + object->excess_ref = 0; 571 566 object->min_count = min_count; 572 567 object->count = 0; /* white color initially */ 573 568 object->jiffies = jiffies; ··· 802 795 } 803 796 804 797 /* 798 + * Any surplus references (object already gray) to 'ptr' are passed to 799 + * 'excess_ref'. This is used in the vmalloc() case where a pointer to 800 + * vm_struct may be used as an alternative reference to the vmalloc'ed object 801 + * (see free_thread_stack()). 802 + */ 803 + static void object_set_excess_ref(unsigned long ptr, unsigned long excess_ref) 804 + { 805 + unsigned long flags; 806 + struct kmemleak_object *object; 807 + 808 + object = find_and_get_object(ptr, 0); 809 + if (!object) { 810 + kmemleak_warn("Setting excess_ref on unknown object at 0x%08lx\n", 811 + ptr); 812 + return; 813 + } 814 + 815 + spin_lock_irqsave(&object->lock, flags); 816 + object->excess_ref = excess_ref; 817 + spin_unlock_irqrestore(&object->lock, flags); 818 + put_object(object); 819 + } 820 + 821 + /* 805 822 * Set the OBJECT_NO_SCAN flag for the object corresponding to the give 806 823 * pointer. Such object will not be scanned by kmemleak but references to it 807 824 * are searched. ··· 939 908 * @gfp: kmalloc() flags used for kmemleak internal memory allocations 940 909 * 941 910 * This function is called from the kernel allocators when a new object 942 - * (memory block) is allocated (kmem_cache_alloc, kmalloc, vmalloc etc.). 911 + * (memory block) is allocated (kmem_cache_alloc, kmalloc etc.). 943 912 */ 944 913 void __ref kmemleak_alloc(const void *ptr, size_t size, int min_count, 945 914 gfp_t gfp) ··· 981 950 log_early(KMEMLEAK_ALLOC_PERCPU, ptr, size, 0); 982 951 } 983 952 EXPORT_SYMBOL_GPL(kmemleak_alloc_percpu); 953 + 954 + /** 955 + * kmemleak_vmalloc - register a newly vmalloc'ed object 956 + * @area: pointer to vm_struct 957 + * @size: size of the object 958 + * @gfp: __vmalloc() flags used for kmemleak internal memory allocations 959 + * 960 + * This function is called from the vmalloc() kernel allocator when a new 961 + * object (memory block) is allocated. 962 + */ 963 + void __ref kmemleak_vmalloc(const struct vm_struct *area, size_t size, gfp_t gfp) 964 + { 965 + pr_debug("%s(0x%p, %zu)\n", __func__, area, size); 966 + 967 + /* 968 + * A min_count = 2 is needed because vm_struct contains a reference to 969 + * the virtual address of the vmalloc'ed block. 970 + */ 971 + if (kmemleak_enabled) { 972 + create_object((unsigned long)area->addr, size, 2, gfp); 973 + object_set_excess_ref((unsigned long)area, 974 + (unsigned long)area->addr); 975 + } else if (kmemleak_early_log) { 976 + log_early(KMEMLEAK_ALLOC, area->addr, size, 2); 977 + /* reusing early_log.size for storing area->addr */ 978 + log_early(KMEMLEAK_SET_EXCESS_REF, 979 + area, (unsigned long)area->addr, 0); 980 + } 981 + } 982 + EXPORT_SYMBOL_GPL(kmemleak_vmalloc); 984 983 985 984 /** 986 985 * kmemleak_free - unregister a previously registered object ··· 1249 1188 } 1250 1189 1251 1190 /* 1191 + * Update an object's references. object->lock must be held by the caller. 1192 + */ 1193 + static void update_refs(struct kmemleak_object *object) 1194 + { 1195 + if (!color_white(object)) { 1196 + /* non-orphan, ignored or new */ 1197 + return; 1198 + } 1199 + 1200 + /* 1201 + * Increase the object's reference count (number of pointers to the 1202 + * memory block). If this count reaches the required minimum, the 1203 + * object's color will become gray and it will be added to the 1204 + * gray_list. 1205 + */ 1206 + object->count++; 1207 + if (color_gray(object)) { 1208 + /* put_object() called when removing from gray_list */ 1209 + WARN_ON(!get_object(object)); 1210 + list_add_tail(&object->gray_list, &gray_list); 1211 + } 1212 + } 1213 + 1214 + /* 1252 1215 * Memory scanning is a long process and it needs to be interruptable. This 1253 1216 * function checks whether such interrupt condition occurred. 1254 1217 */ ··· 1309 1224 for (ptr = start; ptr < end; ptr++) { 1310 1225 struct kmemleak_object *object; 1311 1226 unsigned long pointer; 1227 + unsigned long excess_ref; 1312 1228 1313 1229 if (scan_should_stop()) 1314 1230 break; ··· 1345 1259 * enclosed by scan_mutex. 1346 1260 */ 1347 1261 spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING); 1348 - if (!color_white(object)) { 1349 - /* non-orphan, ignored or new */ 1350 - spin_unlock(&object->lock); 1351 - continue; 1352 - } 1353 - 1354 - /* 1355 - * Increase the object's reference count (number of pointers 1356 - * to the memory block). If this count reaches the required 1357 - * minimum, the object's color will become gray and it will be 1358 - * added to the gray_list. 1359 - */ 1360 - object->count++; 1262 + /* only pass surplus references (object already gray) */ 1361 1263 if (color_gray(object)) { 1362 - /* put_object() called when removing from gray_list */ 1363 - WARN_ON(!get_object(object)); 1364 - list_add_tail(&object->gray_list, &gray_list); 1264 + excess_ref = object->excess_ref; 1265 + /* no need for update_refs() if object already gray */ 1266 + } else { 1267 + excess_ref = 0; 1268 + update_refs(object); 1365 1269 } 1366 1270 spin_unlock(&object->lock); 1271 + 1272 + if (excess_ref) { 1273 + object = lookup_object(excess_ref, 0); 1274 + if (!object) 1275 + continue; 1276 + if (object == scanned) 1277 + /* circular reference, ignore */ 1278 + continue; 1279 + spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING); 1280 + update_refs(object); 1281 + spin_unlock(&object->lock); 1282 + } 1367 1283 } 1368 1284 read_unlock_irqrestore(&kmemleak_lock, flags); 1369 1285 } ··· 2067 1979 break; 2068 1980 case KMEMLEAK_NO_SCAN: 2069 1981 kmemleak_no_scan(log->ptr); 1982 + break; 1983 + case KMEMLEAK_SET_EXCESS_REF: 1984 + object_set_excess_ref((unsigned long)log->ptr, 1985 + log->excess_ref); 2070 1986 break; 2071 1987 default: 2072 1988 kmemleak_warn("Unknown early log operation: %d\n",

+754 -66

mm/ksm.c

··· 128 128 * struct stable_node - node of the stable rbtree 129 129 * @node: rb node of this ksm page in the stable tree 130 130 * @head: (overlaying parent) &migrate_nodes indicates temporarily on that list 131 + * @hlist_dup: linked into the stable_node->hlist with a stable_node chain 131 132 * @list: linked into migrate_nodes, pending placement in the proper node tree 132 133 * @hlist: hlist head of rmap_items using this ksm page 133 134 * @kpfn: page frame number of this ksm page (perhaps temporarily on wrong nid) 135 + * @chain_prune_time: time of the last full garbage collection 136 + * @rmap_hlist_len: number of rmap_item entries in hlist or STABLE_NODE_CHAIN 134 137 * @nid: NUMA node id of stable tree in which linked (may not match kpfn) 135 138 */ 136 139 struct stable_node { ··· 141 138 struct rb_node node; /* when node of stable tree */ 142 139 struct { /* when listed for migration */ 143 140 struct list_head *head; 144 - struct list_head list; 141 + struct { 142 + struct hlist_node hlist_dup; 143 + struct list_head list; 144 + }; 145 145 }; 146 146 }; 147 147 struct hlist_head hlist; 148 - unsigned long kpfn; 148 + union { 149 + unsigned long kpfn; 150 + unsigned long chain_prune_time; 151 + }; 152 + /* 153 + * STABLE_NODE_CHAIN can be any negative number in 154 + * rmap_hlist_len negative range, but better not -1 to be able 155 + * to reliably detect underflows. 156 + */ 157 + #define STABLE_NODE_CHAIN -1024 158 + int rmap_hlist_len; 149 159 #ifdef CONFIG_NUMA 150 160 int nid; 151 161 #endif ··· 208 192 209 193 /* Recently migrated nodes of stable tree, pending proper placement */ 210 194 static LIST_HEAD(migrate_nodes); 195 + #define STABLE_NODE_DUP_HEAD ((struct list_head *)&migrate_nodes.prev) 211 196 212 197 #define MM_SLOTS_HASH_BITS 10 213 198 static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); ··· 235 218 236 219 /* The number of rmap_items in use: to calculate pages_volatile */ 237 220 static unsigned long ksm_rmap_items; 221 + 222 + /* The number of stable_node chains */ 223 + static unsigned long ksm_stable_node_chains; 224 + 225 + /* The number of stable_node dups linked to the stable_node chains */ 226 + static unsigned long ksm_stable_node_dups; 227 + 228 + /* Delay in pruning stale stable_node_dups in the stable_node_chains */ 229 + static int ksm_stable_node_chains_prune_millisecs = 2000; 230 + 231 + /* Maximum number of page slots sharing a stable node */ 232 + static int ksm_max_page_sharing = 256; 238 233 239 234 /* Number of pages ksmd should scan in one batch */ 240 235 static unsigned int ksm_thread_pages_to_scan = 100; ··· 316 287 mm_slot_cache = NULL; 317 288 } 318 289 290 + static __always_inline bool is_stable_node_chain(struct stable_node *chain) 291 + { 292 + return chain->rmap_hlist_len == STABLE_NODE_CHAIN; 293 + } 294 + 295 + static __always_inline bool is_stable_node_dup(struct stable_node *dup) 296 + { 297 + return dup->head == STABLE_NODE_DUP_HEAD; 298 + } 299 + 300 + static inline void stable_node_chain_add_dup(struct stable_node *dup, 301 + struct stable_node *chain) 302 + { 303 + VM_BUG_ON(is_stable_node_dup(dup)); 304 + dup->head = STABLE_NODE_DUP_HEAD; 305 + VM_BUG_ON(!is_stable_node_chain(chain)); 306 + hlist_add_head(&dup->hlist_dup, &chain->hlist); 307 + ksm_stable_node_dups++; 308 + } 309 + 310 + static inline void __stable_node_dup_del(struct stable_node *dup) 311 + { 312 + VM_BUG_ON(!is_stable_node_dup(dup)); 313 + hlist_del(&dup->hlist_dup); 314 + ksm_stable_node_dups--; 315 + } 316 + 317 + static inline void stable_node_dup_del(struct stable_node *dup) 318 + { 319 + VM_BUG_ON(is_stable_node_chain(dup)); 320 + if (is_stable_node_dup(dup)) 321 + __stable_node_dup_del(dup); 322 + else 323 + rb_erase(&dup->node, root_stable_tree + NUMA(dup->nid)); 324 + #ifdef CONFIG_DEBUG_VM 325 + dup->head = NULL; 326 + #endif 327 + } 328 + 319 329 static inline struct rmap_item *alloc_rmap_item(void) 320 330 { 321 331 struct rmap_item *rmap_item; ··· 385 317 386 318 static inline void free_stable_node(struct stable_node *stable_node) 387 319 { 320 + VM_BUG_ON(stable_node->rmap_hlist_len && 321 + !is_stable_node_chain(stable_node)); 388 322 kmem_cache_free(stable_node_cache, stable_node); 389 323 } 390 324 ··· 568 498 return ksm_merge_across_nodes ? 0 : NUMA(pfn_to_nid(kpfn)); 569 499 } 570 500 501 + static struct stable_node *alloc_stable_node_chain(struct stable_node *dup, 502 + struct rb_root *root) 503 + { 504 + struct stable_node *chain = alloc_stable_node(); 505 + VM_BUG_ON(is_stable_node_chain(dup)); 506 + if (likely(chain)) { 507 + INIT_HLIST_HEAD(&chain->hlist); 508 + chain->chain_prune_time = jiffies; 509 + chain->rmap_hlist_len = STABLE_NODE_CHAIN; 510 + #if defined (CONFIG_DEBUG_VM) && defined(CONFIG_NUMA) 511 + chain->nid = -1; /* debug */ 512 + #endif 513 + ksm_stable_node_chains++; 514 + 515 + /* 516 + * Put the stable node chain in the first dimension of 517 + * the stable tree and at the same time remove the old 518 + * stable node. 519 + */ 520 + rb_replace_node(&dup->node, &chain->node, root); 521 + 522 + /* 523 + * Move the old stable node to the second dimension 524 + * queued in the hlist_dup. The invariant is that all 525 + * dup stable_nodes in the chain->hlist point to pages 526 + * that are wrprotected and have the exact same 527 + * content. 528 + */ 529 + stable_node_chain_add_dup(dup, chain); 530 + } 531 + return chain; 532 + } 533 + 534 + static inline void free_stable_node_chain(struct stable_node *chain, 535 + struct rb_root *root) 536 + { 537 + rb_erase(&chain->node, root); 538 + free_stable_node(chain); 539 + ksm_stable_node_chains--; 540 + } 541 + 571 542 static void remove_node_from_stable_tree(struct stable_node *stable_node) 572 543 { 573 544 struct rmap_item *rmap_item; 545 + 546 + /* check it's not STABLE_NODE_CHAIN or negative */ 547 + BUG_ON(stable_node->rmap_hlist_len < 0); 574 548 575 549 hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) { 576 550 if (rmap_item->hlist.next) 577 551 ksm_pages_sharing--; 578 552 else 579 553 ksm_pages_shared--; 554 + VM_BUG_ON(stable_node->rmap_hlist_len <= 0); 555 + stable_node->rmap_hlist_len--; 580 556 put_anon_vma(rmap_item->anon_vma); 581 557 rmap_item->address &= PAGE_MASK; 582 558 cond_resched(); 583 559 } 584 560 561 + /* 562 + * We need the second aligned pointer of the migrate_nodes 563 + * list_head to stay clear from the rb_parent_color union 564 + * (aligned and different than any node) and also different 565 + * from &migrate_nodes. This will verify that future list.h changes 566 + * don't break STABLE_NODE_DUP_HEAD. 567 + */ 568 + #if GCC_VERSION >= 40903 /* only recent gcc can handle it */ 569 + BUILD_BUG_ON(STABLE_NODE_DUP_HEAD <= &migrate_nodes); 570 + BUILD_BUG_ON(STABLE_NODE_DUP_HEAD >= &migrate_nodes + 1); 571 + #endif 572 + 585 573 if (stable_node->head == &migrate_nodes) 586 574 list_del(&stable_node->list); 587 575 else 588 - rb_erase(&stable_node->node, 589 - root_stable_tree + NUMA(stable_node->nid)); 576 + stable_node_dup_del(stable_node); 590 577 free_stable_node(stable_node); 591 578 } 592 579 ··· 762 635 ksm_pages_sharing--; 763 636 else 764 637 ksm_pages_shared--; 638 + VM_BUG_ON(stable_node->rmap_hlist_len <= 0); 639 + stable_node->rmap_hlist_len--; 765 640 766 641 put_anon_vma(rmap_item->anon_vma); 767 642 rmap_item->address &= PAGE_MASK; ··· 872 743 return err; 873 744 } 874 745 746 + static int remove_stable_node_chain(struct stable_node *stable_node, 747 + struct rb_root *root) 748 + { 749 + struct stable_node *dup; 750 + struct hlist_node *hlist_safe; 751 + 752 + if (!is_stable_node_chain(stable_node)) { 753 + VM_BUG_ON(is_stable_node_dup(stable_node)); 754 + if (remove_stable_node(stable_node)) 755 + return true; 756 + else 757 + return false; 758 + } 759 + 760 + hlist_for_each_entry_safe(dup, hlist_safe, 761 + &stable_node->hlist, hlist_dup) { 762 + VM_BUG_ON(!is_stable_node_dup(dup)); 763 + if (remove_stable_node(dup)) 764 + return true; 765 + } 766 + BUG_ON(!hlist_empty(&stable_node->hlist)); 767 + free_stable_node_chain(stable_node, root); 768 + return false; 769 + } 770 + 875 771 static int remove_all_stable_nodes(void) 876 772 { 877 773 struct stable_node *stable_node, *next; ··· 907 753 while (root_stable_tree[nid].rb_node) { 908 754 stable_node = rb_entry(root_stable_tree[nid].rb_node, 909 755 struct stable_node, node); 910 - if (remove_stable_node(stable_node)) { 756 + if (remove_stable_node_chain(stable_node, 757 + root_stable_tree + nid)) { 911 758 err = -EBUSY; 912 759 break; /* proceed to next nid */ 913 760 } ··· 1293 1138 return err ? NULL : page; 1294 1139 } 1295 1140 1141 + static __always_inline 1142 + bool __is_page_sharing_candidate(struct stable_node *stable_node, int offset) 1143 + { 1144 + VM_BUG_ON(stable_node->rmap_hlist_len < 0); 1145 + /* 1146 + * Check that at least one mapping still exists, otherwise 1147 + * there's no much point to merge and share with this 1148 + * stable_node, as the underlying tree_page of the other 1149 + * sharer is going to be freed soon. 1150 + */ 1151 + return stable_node->rmap_hlist_len && 1152 + stable_node->rmap_hlist_len + offset < ksm_max_page_sharing; 1153 + } 1154 + 1155 + static __always_inline 1156 + bool is_page_sharing_candidate(struct stable_node *stable_node) 1157 + { 1158 + return __is_page_sharing_candidate(stable_node, 0); 1159 + } 1160 + 1161 + struct page *stable_node_dup(struct stable_node **_stable_node_dup, 1162 + struct stable_node **_stable_node, 1163 + struct rb_root *root, 1164 + bool prune_stale_stable_nodes) 1165 + { 1166 + struct stable_node *dup, *found = NULL, *stable_node = *_stable_node; 1167 + struct hlist_node *hlist_safe; 1168 + struct page *_tree_page, *tree_page = NULL; 1169 + int nr = 0; 1170 + int found_rmap_hlist_len; 1171 + 1172 + if (!prune_stale_stable_nodes || 1173 + time_before(jiffies, stable_node->chain_prune_time + 1174 + msecs_to_jiffies( 1175 + ksm_stable_node_chains_prune_millisecs))) 1176 + prune_stale_stable_nodes = false; 1177 + else 1178 + stable_node->chain_prune_time = jiffies; 1179 + 1180 + hlist_for_each_entry_safe(dup, hlist_safe, 1181 + &stable_node->hlist, hlist_dup) { 1182 + cond_resched(); 1183 + /* 1184 + * We must walk all stable_node_dup to prune the stale 1185 + * stable nodes during lookup. 1186 + * 1187 + * get_ksm_page can drop the nodes from the 1188 + * stable_node->hlist if they point to freed pages 1189 + * (that's why we do a _safe walk). The "dup" 1190 + * stable_node parameter itself will be freed from 1191 + * under us if it returns NULL. 1192 + */ 1193 + _tree_page = get_ksm_page(dup, false); 1194 + if (!_tree_page) 1195 + continue; 1196 + nr += 1; 1197 + if (is_page_sharing_candidate(dup)) { 1198 + if (!found || 1199 + dup->rmap_hlist_len > found_rmap_hlist_len) { 1200 + if (found) 1201 + put_page(tree_page); 1202 + found = dup; 1203 + found_rmap_hlist_len = found->rmap_hlist_len; 1204 + tree_page = _tree_page; 1205 + 1206 + /* skip put_page for found dup */ 1207 + if (!prune_stale_stable_nodes) 1208 + break; 1209 + continue; 1210 + } 1211 + } 1212 + put_page(_tree_page); 1213 + } 1214 + 1215 + if (found) { 1216 + /* 1217 + * nr is counting all dups in the chain only if 1218 + * prune_stale_stable_nodes is true, otherwise we may 1219 + * break the loop at nr == 1 even if there are 1220 + * multiple entries. 1221 + */ 1222 + if (prune_stale_stable_nodes && nr == 1) { 1223 + /* 1224 + * If there's not just one entry it would 1225 + * corrupt memory, better BUG_ON. In KSM 1226 + * context with no lock held it's not even 1227 + * fatal. 1228 + */ 1229 + BUG_ON(stable_node->hlist.first->next); 1230 + 1231 + /* 1232 + * There's just one entry and it is below the 1233 + * deduplication limit so drop the chain. 1234 + */ 1235 + rb_replace_node(&stable_node->node, &found->node, 1236 + root); 1237 + free_stable_node(stable_node); 1238 + ksm_stable_node_chains--; 1239 + ksm_stable_node_dups--; 1240 + /* 1241 + * NOTE: the caller depends on the stable_node 1242 + * to be equal to stable_node_dup if the chain 1243 + * was collapsed. 1244 + */ 1245 + *_stable_node = found; 1246 + /* 1247 + * Just for robustneess as stable_node is 1248 + * otherwise left as a stable pointer, the 1249 + * compiler shall optimize it away at build 1250 + * time. 1251 + */ 1252 + stable_node = NULL; 1253 + } else if (stable_node->hlist.first != &found->hlist_dup && 1254 + __is_page_sharing_candidate(found, 1)) { 1255 + /* 1256 + * If the found stable_node dup can accept one 1257 + * more future merge (in addition to the one 1258 + * that is underway) and is not at the head of 1259 + * the chain, put it there so next search will 1260 + * be quicker in the !prune_stale_stable_nodes 1261 + * case. 1262 + * 1263 + * NOTE: it would be inaccurate to use nr > 1 1264 + * instead of checking the hlist.first pointer 1265 + * directly, because in the 1266 + * prune_stale_stable_nodes case "nr" isn't 1267 + * the position of the found dup in the chain, 1268 + * but the total number of dups in the chain. 1269 + */ 1270 + hlist_del(&found->hlist_dup); 1271 + hlist_add_head(&found->hlist_dup, 1272 + &stable_node->hlist); 1273 + } 1274 + } 1275 + 1276 + *_stable_node_dup = found; 1277 + return tree_page; 1278 + } 1279 + 1280 + static struct stable_node *stable_node_dup_any(struct stable_node *stable_node, 1281 + struct rb_root *root) 1282 + { 1283 + if (!is_stable_node_chain(stable_node)) 1284 + return stable_node; 1285 + if (hlist_empty(&stable_node->hlist)) { 1286 + free_stable_node_chain(stable_node, root); 1287 + return NULL; 1288 + } 1289 + return hlist_entry(stable_node->hlist.first, 1290 + typeof(*stable_node), hlist_dup); 1291 + } 1292 + 1293 + /* 1294 + * Like for get_ksm_page, this function can free the *_stable_node and 1295 + * *_stable_node_dup if the returned tree_page is NULL. 1296 + * 1297 + * It can also free and overwrite *_stable_node with the found 1298 + * stable_node_dup if the chain is collapsed (in which case 1299 + * *_stable_node will be equal to *_stable_node_dup like if the chain 1300 + * never existed). It's up to the caller to verify tree_page is not 1301 + * NULL before dereferencing *_stable_node or *_stable_node_dup. 1302 + * 1303 + * *_stable_node_dup is really a second output parameter of this 1304 + * function and will be overwritten in all cases, the caller doesn't 1305 + * need to initialize it. 1306 + */ 1307 + static struct page *__stable_node_chain(struct stable_node **_stable_node_dup, 1308 + struct stable_node **_stable_node, 1309 + struct rb_root *root, 1310 + bool prune_stale_stable_nodes) 1311 + { 1312 + struct stable_node *stable_node = *_stable_node; 1313 + if (!is_stable_node_chain(stable_node)) { 1314 + if (is_page_sharing_candidate(stable_node)) { 1315 + *_stable_node_dup = stable_node; 1316 + return get_ksm_page(stable_node, false); 1317 + } 1318 + /* 1319 + * _stable_node_dup set to NULL means the stable_node 1320 + * reached the ksm_max_page_sharing limit. 1321 + */ 1322 + *_stable_node_dup = NULL; 1323 + return NULL; 1324 + } 1325 + return stable_node_dup(_stable_node_dup, _stable_node, root, 1326 + prune_stale_stable_nodes); 1327 + } 1328 + 1329 + static __always_inline struct page *chain_prune(struct stable_node **s_n_d, 1330 + struct stable_node **s_n, 1331 + struct rb_root *root) 1332 + { 1333 + return __stable_node_chain(s_n_d, s_n, root, true); 1334 + } 1335 + 1336 + static __always_inline struct page *chain(struct stable_node **s_n_d, 1337 + struct stable_node *s_n, 1338 + struct rb_root *root) 1339 + { 1340 + struct stable_node *old_stable_node = s_n; 1341 + struct page *tree_page; 1342 + 1343 + tree_page = __stable_node_chain(s_n_d, &s_n, root, false); 1344 + /* not pruning dups so s_n cannot have changed */ 1345 + VM_BUG_ON(s_n != old_stable_node); 1346 + return tree_page; 1347 + } 1348 + 1296 1349 /* 1297 1350 * stable_tree_search - search for page inside the stable tree 1298 1351 * ··· 1516 1153 struct rb_root *root; 1517 1154 struct rb_node **new; 1518 1155 struct rb_node *parent; 1519 - struct stable_node *stable_node; 1156 + struct stable_node *stable_node, *stable_node_dup, *stable_node_any; 1520 1157 struct stable_node *page_node; 1521 1158 1522 1159 page_node = page_stable_node(page); ··· 1538 1175 1539 1176 cond_resched(); 1540 1177 stable_node = rb_entry(*new, struct stable_node, node); 1541 - tree_page = get_ksm_page(stable_node, false); 1178 + stable_node_any = NULL; 1179 + tree_page = chain_prune(&stable_node_dup, &stable_node, root); 1180 + /* 1181 + * NOTE: stable_node may have been freed by 1182 + * chain_prune() if the returned stable_node_dup is 1183 + * not NULL. stable_node_dup may have been inserted in 1184 + * the rbtree instead as a regular stable_node (in 1185 + * order to collapse the stable_node chain if a single 1186 + * stable_node dup was found in it). In such case the 1187 + * stable_node is overwritten by the calleee to point 1188 + * to the stable_node_dup that was collapsed in the 1189 + * stable rbtree and stable_node will be equal to 1190 + * stable_node_dup like if the chain never existed. 1191 + */ 1192 + if (!stable_node_dup) { 1193 + /* 1194 + * Either all stable_node dups were full in 1195 + * this stable_node chain, or this chain was 1196 + * empty and should be rb_erased. 1197 + */ 1198 + stable_node_any = stable_node_dup_any(stable_node, 1199 + root); 1200 + if (!stable_node_any) { 1201 + /* rb_erase just run */ 1202 + goto again; 1203 + } 1204 + /* 1205 + * Take any of the stable_node dups page of 1206 + * this stable_node chain to let the tree walk 1207 + * continue. All KSM pages belonging to the 1208 + * stable_node dups in a stable_node chain 1209 + * have the same content and they're 1210 + * wrprotected at all times. Any will work 1211 + * fine to continue the walk. 1212 + */ 1213 + tree_page = get_ksm_page(stable_node_any, false); 1214 + } 1215 + VM_BUG_ON(!stable_node_dup ^ !!stable_node_any); 1542 1216 if (!tree_page) { 1543 1217 /* 1544 1218 * If we walked over a stale stable_node, ··· 1598 1198 else if (ret > 0) 1599 1199 new = &parent->rb_right; 1600 1200 else { 1201 + if (page_node) { 1202 + VM_BUG_ON(page_node->head != &migrate_nodes); 1203 + /* 1204 + * Test if the migrated page should be merged 1205 + * into a stable node dup. If the mapcount is 1206 + * 1 we can migrate it with another KSM page 1207 + * without adding it to the chain. 1208 + */ 1209 + if (page_mapcount(page) > 1) 1210 + goto chain_append; 1211 + } 1212 + 1213 + if (!stable_node_dup) { 1214 + /* 1215 + * If the stable_node is a chain and 1216 + * we got a payload match in memcmp 1217 + * but we cannot merge the scanned 1218 + * page in any of the existing 1219 + * stable_node dups because they're 1220 + * all full, we need to wait the 1221 + * scanned page to find itself a match 1222 + * in the unstable tree to create a 1223 + * brand new KSM page to add later to 1224 + * the dups of this stable_node. 1225 + */ 1226 + return NULL; 1227 + } 1228 + 1601 1229 /* 1602 1230 * Lock and unlock the stable_node's page (which 1603 1231 * might already have been migrated) so that page ··· 1633 1205 * It would be more elegant to return stable_node 1634 1206 * than kpage, but that involves more changes. 1635 1207 */ 1636 - tree_page = get_ksm_page(stable_node, true); 1637 - if (tree_page) { 1638 - unlock_page(tree_page); 1639 - if (get_kpfn_nid(stable_node->kpfn) != 1640 - NUMA(stable_node->nid)) { 1641 - put_page(tree_page); 1642 - goto replace; 1643 - } 1644 - return tree_page; 1645 - } 1646 - /* 1647 - * There is now a place for page_node, but the tree may 1648 - * have been rebalanced, so re-evaluate parent and new. 1649 - */ 1650 - if (page_node) 1208 + tree_page = get_ksm_page(stable_node_dup, true); 1209 + if (unlikely(!tree_page)) 1210 + /* 1211 + * The tree may have been rebalanced, 1212 + * so re-evaluate parent and new. 1213 + */ 1651 1214 goto again; 1652 - return NULL; 1215 + unlock_page(tree_page); 1216 + 1217 + if (get_kpfn_nid(stable_node_dup->kpfn) != 1218 + NUMA(stable_node_dup->nid)) { 1219 + put_page(tree_page); 1220 + goto replace; 1221 + } 1222 + return tree_page; 1653 1223 } 1654 1224 } 1655 1225 ··· 1658 1232 DO_NUMA(page_node->nid = nid); 1659 1233 rb_link_node(&page_node->node, parent, new); 1660 1234 rb_insert_color(&page_node->node, root); 1661 - get_page(page); 1662 - return page; 1235 + out: 1236 + if (is_page_sharing_candidate(page_node)) { 1237 + get_page(page); 1238 + return page; 1239 + } else 1240 + return NULL; 1663 1241 1664 1242 replace: 1665 - if (page_node) { 1666 - list_del(&page_node->list); 1667 - DO_NUMA(page_node->nid = nid); 1668 - rb_replace_node(&stable_node->node, &page_node->node, root); 1669 - get_page(page); 1243 + /* 1244 + * If stable_node was a chain and chain_prune collapsed it, 1245 + * stable_node has been updated to be the new regular 1246 + * stable_node. A collapse of the chain is indistinguishable 1247 + * from the case there was no chain in the stable 1248 + * rbtree. Otherwise stable_node is the chain and 1249 + * stable_node_dup is the dup to replace. 1250 + */ 1251 + if (stable_node_dup == stable_node) { 1252 + VM_BUG_ON(is_stable_node_chain(stable_node_dup)); 1253 + VM_BUG_ON(is_stable_node_dup(stable_node_dup)); 1254 + /* there is no chain */ 1255 + if (page_node) { 1256 + VM_BUG_ON(page_node->head != &migrate_nodes); 1257 + list_del(&page_node->list); 1258 + DO_NUMA(page_node->nid = nid); 1259 + rb_replace_node(&stable_node_dup->node, 1260 + &page_node->node, 1261 + root); 1262 + if (is_page_sharing_candidate(page_node)) 1263 + get_page(page); 1264 + else 1265 + page = NULL; 1266 + } else { 1267 + rb_erase(&stable_node_dup->node, root); 1268 + page = NULL; 1269 + } 1670 1270 } else { 1671 - rb_erase(&stable_node->node, root); 1672 - page = NULL; 1271 + VM_BUG_ON(!is_stable_node_chain(stable_node)); 1272 + __stable_node_dup_del(stable_node_dup); 1273 + if (page_node) { 1274 + VM_BUG_ON(page_node->head != &migrate_nodes); 1275 + list_del(&page_node->list); 1276 + DO_NUMA(page_node->nid = nid); 1277 + stable_node_chain_add_dup(page_node, stable_node); 1278 + if (is_page_sharing_candidate(page_node)) 1279 + get_page(page); 1280 + else 1281 + page = NULL; 1282 + } else { 1283 + page = NULL; 1284 + } 1673 1285 } 1674 - stable_node->head = &migrate_nodes; 1675 - list_add(&stable_node->list, stable_node->head); 1286 + stable_node_dup->head = &migrate_nodes; 1287 + list_add(&stable_node_dup->list, stable_node_dup->head); 1676 1288 return page; 1289 + 1290 + chain_append: 1291 + /* stable_node_dup could be null if it reached the limit */ 1292 + if (!stable_node_dup) 1293 + stable_node_dup = stable_node_any; 1294 + /* 1295 + * If stable_node was a chain and chain_prune collapsed it, 1296 + * stable_node has been updated to be the new regular 1297 + * stable_node. A collapse of the chain is indistinguishable 1298 + * from the case there was no chain in the stable 1299 + * rbtree. Otherwise stable_node is the chain and 1300 + * stable_node_dup is the dup to replace. 1301 + */ 1302 + if (stable_node_dup == stable_node) { 1303 + VM_BUG_ON(is_stable_node_chain(stable_node_dup)); 1304 + VM_BUG_ON(is_stable_node_dup(stable_node_dup)); 1305 + /* chain is missing so create it */ 1306 + stable_node = alloc_stable_node_chain(stable_node_dup, 1307 + root); 1308 + if (!stable_node) 1309 + return NULL; 1310 + } 1311 + /* 1312 + * Add this stable_node dup that was 1313 + * migrated to the stable_node chain 1314 + * of the current nid for this page 1315 + * content. 1316 + */ 1317 + VM_BUG_ON(!is_stable_node_chain(stable_node)); 1318 + VM_BUG_ON(!is_stable_node_dup(stable_node_dup)); 1319 + VM_BUG_ON(page_node->head != &migrate_nodes); 1320 + list_del(&page_node->list); 1321 + DO_NUMA(page_node->nid = nid); 1322 + stable_node_chain_add_dup(page_node, stable_node); 1323 + goto out; 1677 1324 } 1678 1325 1679 1326 /* ··· 1763 1264 struct rb_root *root; 1764 1265 struct rb_node **new; 1765 1266 struct rb_node *parent; 1766 - struct stable_node *stable_node; 1267 + struct stable_node *stable_node, *stable_node_dup, *stable_node_any; 1268 + bool need_chain = false; 1767 1269 1768 1270 kpfn = page_to_pfn(kpage); 1769 1271 nid = get_kpfn_nid(kpfn); ··· 1779 1279 1780 1280 cond_resched(); 1781 1281 stable_node = rb_entry(*new, struct stable_node, node); 1782 - tree_page = get_ksm_page(stable_node, false); 1282 + stable_node_any = NULL; 1283 + tree_page = chain(&stable_node_dup, stable_node, root); 1284 + if (!stable_node_dup) { 1285 + /* 1286 + * Either all stable_node dups were full in 1287 + * this stable_node chain, or this chain was 1288 + * empty and should be rb_erased. 1289 + */ 1290 + stable_node_any = stable_node_dup_any(stable_node, 1291 + root); 1292 + if (!stable_node_any) { 1293 + /* rb_erase just run */ 1294 + goto again; 1295 + } 1296 + /* 1297 + * Take any of the stable_node dups page of 1298 + * this stable_node chain to let the tree walk 1299 + * continue. All KSM pages belonging to the 1300 + * stable_node dups in a stable_node chain 1301 + * have the same content and they're 1302 + * wrprotected at all times. Any will work 1303 + * fine to continue the walk. 1304 + */ 1305 + tree_page = get_ksm_page(stable_node_any, false); 1306 + } 1307 + VM_BUG_ON(!stable_node_dup ^ !!stable_node_any); 1783 1308 if (!tree_page) { 1784 1309 /* 1785 1310 * If we walked over a stale stable_node, ··· 1827 1302 else if (ret > 0) 1828 1303 new = &parent->rb_right; 1829 1304 else { 1830 - /* 1831 - * It is not a bug that stable_tree_search() didn't 1832 - * find this node: because at that time our page was 1833 - * not yet write-protected, so may have changed since. 1834 - */ 1835 - return NULL; 1305 + need_chain = true; 1306 + break; 1836 1307 } 1837 1308 } 1838 1309 1839 - stable_node = alloc_stable_node(); 1840 - if (!stable_node) 1310 + stable_node_dup = alloc_stable_node(); 1311 + if (!stable_node_dup) 1841 1312 return NULL; 1842 1313 1843 - INIT_HLIST_HEAD(&stable_node->hlist); 1844 - stable_node->kpfn = kpfn; 1845 - set_page_stable_node(kpage, stable_node); 1846 - DO_NUMA(stable_node->nid = nid); 1847 - rb_link_node(&stable_node->node, parent, new); 1848 - rb_insert_color(&stable_node->node, root); 1314 + INIT_HLIST_HEAD(&stable_node_dup->hlist); 1315 + stable_node_dup->kpfn = kpfn; 1316 + set_page_stable_node(kpage, stable_node_dup); 1317 + stable_node_dup->rmap_hlist_len = 0; 1318 + DO_NUMA(stable_node_dup->nid = nid); 1319 + if (!need_chain) { 1320 + rb_link_node(&stable_node_dup->node, parent, new); 1321 + rb_insert_color(&stable_node_dup->node, root); 1322 + } else { 1323 + if (!is_stable_node_chain(stable_node)) { 1324 + struct stable_node *orig = stable_node; 1325 + /* chain is missing so create it */ 1326 + stable_node = alloc_stable_node_chain(orig, root); 1327 + if (!stable_node) { 1328 + free_stable_node(stable_node_dup); 1329 + return NULL; 1330 + } 1331 + } 1332 + stable_node_chain_add_dup(stable_node_dup, stable_node); 1333 + } 1849 1334 1850 - return stable_node; 1335 + return stable_node_dup; 1851 1336 } 1852 1337 1853 1338 /* ··· 1947 1412 * the same ksm page. 1948 1413 */ 1949 1414 static void stable_tree_append(struct rmap_item *rmap_item, 1950 - struct stable_node *stable_node) 1415 + struct stable_node *stable_node, 1416 + bool max_page_sharing_bypass) 1951 1417 { 1418 + /* 1419 + * rmap won't find this mapping if we don't insert the 1420 + * rmap_item in the right stable_node 1421 + * duplicate. page_migration could break later if rmap breaks, 1422 + * so we can as well crash here. We really need to check for 1423 + * rmap_hlist_len == STABLE_NODE_CHAIN, but we can as well check 1424 + * for other negative values as an undeflow if detected here 1425 + * for the first time (and not when decreasing rmap_hlist_len) 1426 + * would be sign of memory corruption in the stable_node. 1427 + */ 1428 + BUG_ON(stable_node->rmap_hlist_len < 0); 1429 + 1430 + stable_node->rmap_hlist_len++; 1431 + if (!max_page_sharing_bypass) 1432 + /* possibly non fatal but unexpected overflow, only warn */ 1433 + WARN_ON_ONCE(stable_node->rmap_hlist_len > 1434 + ksm_max_page_sharing); 1435 + 1952 1436 rmap_item->head = stable_node; 1953 1437 rmap_item->address |= STABLE_FLAG; 1954 1438 hlist_add_head(&rmap_item->hlist, &stable_node->hlist); ··· 1995 1441 struct page *kpage; 1996 1442 unsigned int checksum; 1997 1443 int err; 1444 + bool max_page_sharing_bypass = false; 1998 1445 1999 1446 stable_node = page_stable_node(page); 2000 1447 if (stable_node) { 2001 1448 if (stable_node->head != &migrate_nodes && 2002 - get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) { 2003 - rb_erase(&stable_node->node, 2004 - root_stable_tree + NUMA(stable_node->nid)); 1449 + get_kpfn_nid(READ_ONCE(stable_node->kpfn)) != 1450 + NUMA(stable_node->nid)) { 1451 + stable_node_dup_del(stable_node); 2005 1452 stable_node->head = &migrate_nodes; 2006 1453 list_add(&stable_node->list, stable_node->head); 2007 1454 } 2008 1455 if (stable_node->head != &migrate_nodes && 2009 1456 rmap_item->head == stable_node) 2010 1457 return; 1458 + /* 1459 + * If it's a KSM fork, allow it to go over the sharing limit 1460 + * without warnings. 1461 + */ 1462 + if (!is_page_sharing_candidate(stable_node)) 1463 + max_page_sharing_bypass = true; 2011 1464 } 2012 1465 2013 1466 /* We first start with searching the page inside the stable tree */ ··· 2034 1473 * add its rmap_item to the stable tree. 2035 1474 */ 2036 1475 lock_page(kpage); 2037 - stable_tree_append(rmap_item, page_stable_node(kpage)); 1476 + stable_tree_append(rmap_item, page_stable_node(kpage), 1477 + max_page_sharing_bypass); 2038 1478 unlock_page(kpage); 2039 1479 } 2040 1480 put_page(kpage); ··· 2085 1523 lock_page(kpage); 2086 1524 stable_node = stable_tree_insert(kpage); 2087 1525 if (stable_node) { 2088 - stable_tree_append(tree_rmap_item, stable_node); 2089 - stable_tree_append(rmap_item, stable_node); 1526 + stable_tree_append(tree_rmap_item, stable_node, 1527 + false); 1528 + stable_tree_append(rmap_item, stable_node, 1529 + false); 2090 1530 } 2091 1531 unlock_page(kpage); 2092 1532 ··· 2592 2028 } 2593 2029 } 2594 2030 2031 + static bool stable_node_dup_remove_range(struct stable_node *stable_node, 2032 + unsigned long start_pfn, 2033 + unsigned long end_pfn) 2034 + { 2035 + if (stable_node->kpfn >= start_pfn && 2036 + stable_node->kpfn < end_pfn) { 2037 + /* 2038 + * Don't get_ksm_page, page has already gone: 2039 + * which is why we keep kpfn instead of page* 2040 + */ 2041 + remove_node_from_stable_tree(stable_node); 2042 + return true; 2043 + } 2044 + return false; 2045 + } 2046 + 2047 + static bool stable_node_chain_remove_range(struct stable_node *stable_node, 2048 + unsigned long start_pfn, 2049 + unsigned long end_pfn, 2050 + struct rb_root *root) 2051 + { 2052 + struct stable_node *dup; 2053 + struct hlist_node *hlist_safe; 2054 + 2055 + if (!is_stable_node_chain(stable_node)) { 2056 + VM_BUG_ON(is_stable_node_dup(stable_node)); 2057 + return stable_node_dup_remove_range(stable_node, start_pfn, 2058 + end_pfn); 2059 + } 2060 + 2061 + hlist_for_each_entry_safe(dup, hlist_safe, 2062 + &stable_node->hlist, hlist_dup) { 2063 + VM_BUG_ON(!is_stable_node_dup(dup)); 2064 + stable_node_dup_remove_range(dup, start_pfn, end_pfn); 2065 + } 2066 + if (hlist_empty(&stable_node->hlist)) { 2067 + free_stable_node_chain(stable_node, root); 2068 + return true; /* notify caller that tree was rebalanced */ 2069 + } else 2070 + return false; 2071 + } 2072 + 2595 2073 static void ksm_check_stable_tree(unsigned long start_pfn, 2596 2074 unsigned long end_pfn) 2597 2075 { ··· 2645 2039 node = rb_first(root_stable_tree + nid); 2646 2040 while (node) { 2647 2041 stable_node = rb_entry(node, struct stable_node, node); 2648 - if (stable_node->kpfn >= start_pfn && 2649 - stable_node->kpfn < end_pfn) { 2650 - /* 2651 - * Don't get_ksm_page, page has already gone: 2652 - * which is why we keep kpfn instead of page* 2653 - */ 2654 - remove_node_from_stable_tree(stable_node); 2042 + if (stable_node_chain_remove_range(stable_node, 2043 + start_pfn, end_pfn, 2044 + root_stable_tree + 2045 + nid)) 2655 2046 node = rb_first(root_stable_tree + nid); 2656 - } else 2047 + else 2657 2048 node = rb_next(node); 2658 2049 cond_resched(); 2659 2050 } ··· 2896 2293 } 2897 2294 KSM_ATTR(use_zero_pages); 2898 2295 2296 + static ssize_t max_page_sharing_show(struct kobject *kobj, 2297 + struct kobj_attribute *attr, char *buf) 2298 + { 2299 + return sprintf(buf, "%u\n", ksm_max_page_sharing); 2300 + } 2301 + 2302 + static ssize_t max_page_sharing_store(struct kobject *kobj, 2303 + struct kobj_attribute *attr, 2304 + const char *buf, size_t count) 2305 + { 2306 + int err; 2307 + int knob; 2308 + 2309 + err = kstrtoint(buf, 10, &knob); 2310 + if (err) 2311 + return err; 2312 + /* 2313 + * When a KSM page is created it is shared by 2 mappings. This 2314 + * being a signed comparison, it implicitly verifies it's not 2315 + * negative. 2316 + */ 2317 + if (knob < 2) 2318 + return -EINVAL; 2319 + 2320 + if (READ_ONCE(ksm_max_page_sharing) == knob) 2321 + return count; 2322 + 2323 + mutex_lock(&ksm_thread_mutex); 2324 + wait_while_offlining(); 2325 + if (ksm_max_page_sharing != knob) { 2326 + if (ksm_pages_shared || remove_all_stable_nodes()) 2327 + err = -EBUSY; 2328 + else 2329 + ksm_max_page_sharing = knob; 2330 + } 2331 + mutex_unlock(&ksm_thread_mutex); 2332 + 2333 + return err ? err : count; 2334 + } 2335 + KSM_ATTR(max_page_sharing); 2336 + 2899 2337 static ssize_t pages_shared_show(struct kobject *kobj, 2900 2338 struct kobj_attribute *attr, char *buf) 2901 2339 { ··· 2975 2331 } 2976 2332 KSM_ATTR_RO(pages_volatile); 2977 2333 2334 + static ssize_t stable_node_dups_show(struct kobject *kobj, 2335 + struct kobj_attribute *attr, char *buf) 2336 + { 2337 + return sprintf(buf, "%lu\n", ksm_stable_node_dups); 2338 + } 2339 + KSM_ATTR_RO(stable_node_dups); 2340 + 2341 + static ssize_t stable_node_chains_show(struct kobject *kobj, 2342 + struct kobj_attribute *attr, char *buf) 2343 + { 2344 + return sprintf(buf, "%lu\n", ksm_stable_node_chains); 2345 + } 2346 + KSM_ATTR_RO(stable_node_chains); 2347 + 2348 + static ssize_t 2349 + stable_node_chains_prune_millisecs_show(struct kobject *kobj, 2350 + struct kobj_attribute *attr, 2351 + char *buf) 2352 + { 2353 + return sprintf(buf, "%u\n", ksm_stable_node_chains_prune_millisecs); 2354 + } 2355 + 2356 + static ssize_t 2357 + stable_node_chains_prune_millisecs_store(struct kobject *kobj, 2358 + struct kobj_attribute *attr, 2359 + const char *buf, size_t count) 2360 + { 2361 + unsigned long msecs; 2362 + int err; 2363 + 2364 + err = kstrtoul(buf, 10, &msecs); 2365 + if (err || msecs > UINT_MAX) 2366 + return -EINVAL; 2367 + 2368 + ksm_stable_node_chains_prune_millisecs = msecs; 2369 + 2370 + return count; 2371 + } 2372 + KSM_ATTR(stable_node_chains_prune_millisecs); 2373 + 2978 2374 static ssize_t full_scans_show(struct kobject *kobj, 2979 2375 struct kobj_attribute *attr, char *buf) 2980 2376 { ··· 3034 2350 #ifdef CONFIG_NUMA 3035 2351 &merge_across_nodes_attr.attr, 3036 2352 #endif 2353 + &max_page_sharing_attr.attr, 2354 + &stable_node_chains_attr.attr, 2355 + &stable_node_dups_attr.attr, 2356 + &stable_node_chains_prune_millisecs_attr.attr, 3037 2357 &use_zero_pages_attr.attr, 3038 2358 NULL, 3039 2359 };

-3

mm/memblock.c

··· 54 54 }; 55 55 56 56 int memblock_debug __initdata_memblock; 57 - #ifdef CONFIG_MOVABLE_NODE 58 - bool movable_node_enabled __initdata_memblock = false; 59 - #endif 60 57 static bool system_has_some_mirror __initdata_memblock = false; 61 58 static int memblock_can_resize __initdata_memblock; 62 59 static int memblock_memory_in_slab __initdata_memblock = 0;

+52 -29

mm/memcontrol.c

··· 2376 2376 2377 2377 #ifdef CONFIG_MEMCG_SWAP 2378 2378 static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg, 2379 - bool charge) 2379 + int nr_entries) 2380 2380 { 2381 - int val = (charge) ? 1 : -1; 2382 - this_cpu_add(memcg->stat->count[MEMCG_SWAP], val); 2381 + this_cpu_add(memcg->stat->count[MEMCG_SWAP], nr_entries); 2383 2382 } 2384 2383 2385 2384 /** ··· 2404 2405 new_id = mem_cgroup_id(to); 2405 2406 2406 2407 if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) { 2407 - mem_cgroup_swap_statistics(from, false); 2408 - mem_cgroup_swap_statistics(to, true); 2408 + mem_cgroup_swap_statistics(from, -1); 2409 + mem_cgroup_swap_statistics(to, 1); 2409 2410 return 0; 2410 2411 } 2411 2412 return -EINVAL; ··· 3573 3574 3574 3575 seq_printf(sf, "oom_kill_disable %d\n", memcg->oom_kill_disable); 3575 3576 seq_printf(sf, "under_oom %d\n", (bool)memcg->under_oom); 3577 + seq_printf(sf, "oom_kill %lu\n", memcg_sum_events(memcg, OOM_KILL)); 3576 3578 return 0; 3577 3579 } 3578 3580 ··· 4122 4122 if (!pn) 4123 4123 return 1; 4124 4124 4125 + pn->lruvec_stat = alloc_percpu(struct lruvec_stat); 4126 + if (!pn->lruvec_stat) { 4127 + kfree(pn); 4128 + return 1; 4129 + } 4130 + 4125 4131 lruvec_init(&pn->lruvec); 4126 4132 pn->usage_in_excess = 0; 4127 4133 pn->on_tree = false; ··· 4139 4133 4140 4134 static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) 4141 4135 { 4142 - kfree(memcg->nodeinfo[node]); 4136 + struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; 4137 + 4138 + free_percpu(pn->lruvec_stat); 4139 + kfree(pn); 4143 4140 } 4144 4141 4145 4142 static void __mem_cgroup_free(struct mem_cgroup *memcg) ··· 5174 5165 seq_printf(m, "high %lu\n", memcg_sum_events(memcg, MEMCG_HIGH)); 5175 5166 seq_printf(m, "max %lu\n", memcg_sum_events(memcg, MEMCG_MAX)); 5176 5167 seq_printf(m, "oom %lu\n", memcg_sum_events(memcg, MEMCG_OOM)); 5168 + seq_printf(m, "oom_kill %lu\n", memcg_sum_events(memcg, OOM_KILL)); 5177 5169 5178 5170 return 0; 5179 5171 } ··· 5207 5197 seq_printf(m, "kernel_stack %llu\n", 5208 5198 (u64)stat[MEMCG_KERNEL_STACK_KB] * 1024); 5209 5199 seq_printf(m, "slab %llu\n", 5210 - (u64)(stat[MEMCG_SLAB_RECLAIMABLE] + 5211 - stat[MEMCG_SLAB_UNRECLAIMABLE]) * PAGE_SIZE); 5200 + (u64)(stat[NR_SLAB_RECLAIMABLE] + 5201 + stat[NR_SLAB_UNRECLAIMABLE]) * PAGE_SIZE); 5212 5202 seq_printf(m, "sock %llu\n", 5213 5203 (u64)stat[MEMCG_SOCK] * PAGE_SIZE); 5214 5204 ··· 5232 5222 } 5233 5223 5234 5224 seq_printf(m, "slab_reclaimable %llu\n", 5235 - (u64)stat[MEMCG_SLAB_RECLAIMABLE] * PAGE_SIZE); 5225 + (u64)stat[NR_SLAB_RECLAIMABLE] * PAGE_SIZE); 5236 5226 seq_printf(m, "slab_unreclaimable %llu\n", 5237 - (u64)stat[MEMCG_SLAB_UNRECLAIMABLE] * PAGE_SIZE); 5227 + (u64)stat[NR_SLAB_UNRECLAIMABLE] * PAGE_SIZE); 5238 5228 5239 5229 /* Accumulated memory events */ 5240 5230 5241 5231 seq_printf(m, "pgfault %lu\n", events[PGFAULT]); 5242 5232 seq_printf(m, "pgmajfault %lu\n", events[PGMAJFAULT]); 5233 + 5234 + seq_printf(m, "pgrefill %lu\n", events[PGREFILL]); 5235 + seq_printf(m, "pgscan %lu\n", events[PGSCAN_KSWAPD] + 5236 + events[PGSCAN_DIRECT]); 5237 + seq_printf(m, "pgsteal %lu\n", events[PGSTEAL_KSWAPD] + 5238 + events[PGSTEAL_DIRECT]); 5239 + seq_printf(m, "pgactivate %lu\n", events[PGACTIVATE]); 5240 + seq_printf(m, "pgdeactivate %lu\n", events[PGDEACTIVATE]); 5241 + seq_printf(m, "pglazyfree %lu\n", events[PGLAZYFREE]); 5242 + seq_printf(m, "pglazyfreed %lu\n", events[PGLAZYFREED]); 5243 5243 5244 5244 seq_printf(m, "workingset_refault %lu\n", 5245 5245 stat[WORKINGSET_REFAULT]); ··· 5465 5445 * let's not wait for it. The page already received a 5466 5446 * memory+swap charge, drop the swap entry duplicate. 5467 5447 */ 5468 - mem_cgroup_uncharge_swap(entry); 5448 + mem_cgroup_uncharge_swap(entry, nr_pages); 5469 5449 } 5470 5450 } 5471 5451 ··· 5893 5873 * ancestor for the swap instead and transfer the memory+swap charge. 5894 5874 */ 5895 5875 swap_memcg = mem_cgroup_id_get_online(memcg); 5896 - oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg)); 5876 + oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg), 1); 5897 5877 VM_BUG_ON_PAGE(oldid, page); 5898 - mem_cgroup_swap_statistics(swap_memcg, true); 5878 + mem_cgroup_swap_statistics(swap_memcg, 1); 5899 5879 5900 5880 page->mem_cgroup = NULL; 5901 5881 ··· 5922 5902 css_put(&memcg->css); 5923 5903 } 5924 5904 5925 - /* 5926 - * mem_cgroup_try_charge_swap - try charging a swap entry 5905 + /** 5906 + * mem_cgroup_try_charge_swap - try charging swap space for a page 5927 5907 * @page: page being added to swap 5928 5908 * @entry: swap entry to charge 5929 5909 * 5930 - * Try to charge @entry to the memcg that @page belongs to. 5910 + * Try to charge @page's memcg for the swap space at @entry. 5931 5911 * 5932 5912 * Returns 0 on success, -ENOMEM on failure. 5933 5913 */ 5934 5914 int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry) 5935 5915 { 5936 - struct mem_cgroup *memcg; 5916 + unsigned int nr_pages = hpage_nr_pages(page); 5937 5917 struct page_counter *counter; 5918 + struct mem_cgroup *memcg; 5938 5919 unsigned short oldid; 5939 5920 5940 5921 if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) || !do_swap_account) ··· 5950 5929 memcg = mem_cgroup_id_get_online(memcg); 5951 5930 5952 5931 if (!mem_cgroup_is_root(memcg) && 5953 - !page_counter_try_charge(&memcg->swap, 1, &counter)) { 5932 + !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) { 5954 5933 mem_cgroup_id_put(memcg); 5955 5934 return -ENOMEM; 5956 5935 } 5957 5936 5958 - oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg)); 5937 + /* Get references for the tail pages, too */ 5938 + if (nr_pages > 1) 5939 + mem_cgroup_id_get_many(memcg, nr_pages - 1); 5940 + oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg), nr_pages); 5959 5941 VM_BUG_ON_PAGE(oldid, page); 5960 - mem_cgroup_swap_statistics(memcg, true); 5942 + mem_cgroup_swap_statistics(memcg, nr_pages); 5961 5943 5962 5944 return 0; 5963 5945 } 5964 5946 5965 5947 /** 5966 - * mem_cgroup_uncharge_swap - uncharge a swap entry 5948 + * mem_cgroup_uncharge_swap - uncharge swap space 5967 5949 * @entry: swap entry to uncharge 5968 - * 5969 - * Drop the swap charge associated with @entry. 5950 + * @nr_pages: the amount of swap space to uncharge 5970 5951 */ 5971 - void mem_cgroup_uncharge_swap(swp_entry_t entry) 5952 + void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages) 5972 5953 { 5973 5954 struct mem_cgroup *memcg; 5974 5955 unsigned short id; ··· 5978 5955 if (!do_swap_account) 5979 5956 return; 5980 5957 5981 - id = swap_cgroup_record(entry, 0); 5958 + id = swap_cgroup_record(entry, 0, nr_pages); 5982 5959 rcu_read_lock(); 5983 5960 memcg = mem_cgroup_from_id(id); 5984 5961 if (memcg) { 5985 5962 if (!mem_cgroup_is_root(memcg)) { 5986 5963 if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) 5987 - page_counter_uncharge(&memcg->swap, 1); 5964 + page_counter_uncharge(&memcg->swap, nr_pages); 5988 5965 else 5989 - page_counter_uncharge(&memcg->memsw, 1); 5966 + page_counter_uncharge(&memcg->memsw, nr_pages); 5990 5967 } 5991 - mem_cgroup_swap_statistics(memcg, false); 5992 - mem_cgroup_id_put(memcg); 5968 + mem_cgroup_swap_statistics(memcg, -nr_pages); 5969 + mem_cgroup_id_put_many(memcg, nr_pages); 5993 5970 } 5994 5971 rcu_read_unlock(); 5995 5972 }

+9 -4

mm/memory-failure.c

··· 1492 1492 static struct page *new_page(struct page *p, unsigned long private, int **x) 1493 1493 { 1494 1494 int nid = page_to_nid(p); 1495 - if (PageHuge(p)) 1496 - return alloc_huge_page_node(page_hstate(compound_head(p)), 1497 - nid); 1498 - else 1495 + if (PageHuge(p)) { 1496 + struct hstate *hstate = page_hstate(compound_head(p)); 1497 + 1498 + if (hstate_is_gigantic(hstate)) 1499 + return alloc_huge_page_node(hstate, NUMA_NO_NODE); 1500 + 1501 + return alloc_huge_page_node(hstate, nid); 1502 + } else { 1499 1503 return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0); 1504 + } 1500 1505 } 1501 1506 1502 1507 /*

+2 -4

mm/memory.c

··· 2719 2719 /* Had to read the page from swap area: Major fault */ 2720 2720 ret = VM_FAULT_MAJOR; 2721 2721 count_vm_event(PGMAJFAULT); 2722 - mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); 2722 + count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); 2723 2723 } else if (PageHWPoison(page)) { 2724 2724 /* 2725 2725 * hwpoisoned dirty swapcache pages are kept for killing ··· 3837 3837 __set_current_state(TASK_RUNNING); 3838 3838 3839 3839 count_vm_event(PGFAULT); 3840 - mem_cgroup_count_vm_event(vma->vm_mm, PGFAULT); 3840 + count_memcg_event_mm(vma->vm_mm, PGFAULT); 3841 3841 3842 3842 /* do counter updates before entering really critical section. */ 3843 3843 check_sync_rss_stat(current); ··· 4014 4014 goto out; 4015 4015 4016 4016 ptep = pte_offset_map_lock(mm, pmd, address, ptlp); 4017 - if (!ptep) 4018 - goto out; 4019 4017 if (!pte_present(*ptep)) 4020 4018 goto unlock; 4021 4019 *ptepp = ptep;

+174 -365

mm/memory_hotplug.c

··· 79 79 #define memhp_lock_acquire() lock_map_acquire(&mem_hotplug.dep_map) 80 80 #define memhp_lock_release() lock_map_release(&mem_hotplug.dep_map) 81 81 82 + bool movable_node_enabled = false; 83 + 82 84 #ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE 83 85 bool memhp_auto_online; 84 86 #else ··· 302 300 } 303 301 #endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */ 304 302 305 - static void __meminit grow_zone_span(struct zone *zone, unsigned long start_pfn, 306 - unsigned long end_pfn) 307 - { 308 - unsigned long old_zone_end_pfn; 309 - 310 - zone_span_writelock(zone); 311 - 312 - old_zone_end_pfn = zone_end_pfn(zone); 313 - if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn) 314 - zone->zone_start_pfn = start_pfn; 315 - 316 - zone->spanned_pages = max(old_zone_end_pfn, end_pfn) - 317 - zone->zone_start_pfn; 318 - 319 - zone_span_writeunlock(zone); 320 - } 321 - 322 - static void resize_zone(struct zone *zone, unsigned long start_pfn, 323 - unsigned long end_pfn) 324 - { 325 - zone_span_writelock(zone); 326 - 327 - if (end_pfn - start_pfn) { 328 - zone->zone_start_pfn = start_pfn; 329 - zone->spanned_pages = end_pfn - start_pfn; 330 - } else { 331 - /* 332 - * make it consist as free_area_init_core(), 333 - * if spanned_pages = 0, then keep start_pfn = 0 334 - */ 335 - zone->zone_start_pfn = 0; 336 - zone->spanned_pages = 0; 337 - } 338 - 339 - zone_span_writeunlock(zone); 340 - } 341 - 342 - static void fix_zone_id(struct zone *zone, unsigned long start_pfn, 343 - unsigned long end_pfn) 344 - { 345 - enum zone_type zid = zone_idx(zone); 346 - int nid = zone->zone_pgdat->node_id; 347 - unsigned long pfn; 348 - 349 - for (pfn = start_pfn; pfn < end_pfn; pfn++) 350 - set_page_links(pfn_to_page(pfn), zid, nid, pfn); 351 - } 352 - 353 - /* Can fail with -ENOMEM from allocating a wait table with vmalloc() or 354 - * alloc_bootmem_node_nopanic()/memblock_virt_alloc_node_nopanic() */ 355 - static int __ref ensure_zone_is_initialized(struct zone *zone, 356 - unsigned long start_pfn, unsigned long num_pages) 357 - { 358 - if (!zone_is_initialized(zone)) 359 - return init_currently_empty_zone(zone, start_pfn, num_pages); 360 - 361 - return 0; 362 - } 363 - 364 - static int __meminit move_pfn_range_left(struct zone *z1, struct zone *z2, 365 - unsigned long start_pfn, unsigned long end_pfn) 303 + static int __meminit __add_section(int nid, unsigned long phys_start_pfn, 304 + bool want_memblock) 366 305 { 367 306 int ret; 368 - unsigned long flags; 369 - unsigned long z1_start_pfn; 370 - 371 - ret = ensure_zone_is_initialized(z1, start_pfn, end_pfn - start_pfn); 372 - if (ret) 373 - return ret; 374 - 375 - pgdat_resize_lock(z1->zone_pgdat, &flags); 376 - 377 - /* can't move pfns which are higher than @z2 */ 378 - if (end_pfn > zone_end_pfn(z2)) 379 - goto out_fail; 380 - /* the move out part must be at the left most of @z2 */ 381 - if (start_pfn > z2->zone_start_pfn) 382 - goto out_fail; 383 - /* must included/overlap */ 384 - if (end_pfn <= z2->zone_start_pfn) 385 - goto out_fail; 386 - 387 - /* use start_pfn for z1's start_pfn if z1 is empty */ 388 - if (!zone_is_empty(z1)) 389 - z1_start_pfn = z1->zone_start_pfn; 390 - else 391 - z1_start_pfn = start_pfn; 392 - 393 - resize_zone(z1, z1_start_pfn, end_pfn); 394 - resize_zone(z2, end_pfn, zone_end_pfn(z2)); 395 - 396 - pgdat_resize_unlock(z1->zone_pgdat, &flags); 397 - 398 - fix_zone_id(z1, start_pfn, end_pfn); 399 - 400 - return 0; 401 - out_fail: 402 - pgdat_resize_unlock(z1->zone_pgdat, &flags); 403 - return -1; 404 - } 405 - 406 - static int __meminit move_pfn_range_right(struct zone *z1, struct zone *z2, 407 - unsigned long start_pfn, unsigned long end_pfn) 408 - { 409 - int ret; 410 - unsigned long flags; 411 - unsigned long z2_end_pfn; 412 - 413 - ret = ensure_zone_is_initialized(z2, start_pfn, end_pfn - start_pfn); 414 - if (ret) 415 - return ret; 416 - 417 - pgdat_resize_lock(z1->zone_pgdat, &flags); 418 - 419 - /* can't move pfns which are lower than @z1 */ 420 - if (z1->zone_start_pfn > start_pfn) 421 - goto out_fail; 422 - /* the move out part mast at the right most of @z1 */ 423 - if (zone_end_pfn(z1) > end_pfn) 424 - goto out_fail; 425 - /* must included/overlap */ 426 - if (start_pfn >= zone_end_pfn(z1)) 427 - goto out_fail; 428 - 429 - /* use end_pfn for z2's end_pfn if z2 is empty */ 430 - if (!zone_is_empty(z2)) 431 - z2_end_pfn = zone_end_pfn(z2); 432 - else 433 - z2_end_pfn = end_pfn; 434 - 435 - resize_zone(z1, z1->zone_start_pfn, start_pfn); 436 - resize_zone(z2, start_pfn, z2_end_pfn); 437 - 438 - pgdat_resize_unlock(z1->zone_pgdat, &flags); 439 - 440 - fix_zone_id(z2, start_pfn, end_pfn); 441 - 442 - return 0; 443 - out_fail: 444 - pgdat_resize_unlock(z1->zone_pgdat, &flags); 445 - return -1; 446 - } 447 - 448 - static struct zone * __meminit move_pfn_range(int zone_shift, 449 - unsigned long start_pfn, unsigned long end_pfn) 450 - { 451 - struct zone *zone = page_zone(pfn_to_page(start_pfn)); 452 - int ret = 0; 453 - 454 - if (zone_shift < 0) 455 - ret = move_pfn_range_left(zone + zone_shift, zone, 456 - start_pfn, end_pfn); 457 - else if (zone_shift) 458 - ret = move_pfn_range_right(zone, zone + zone_shift, 459 - start_pfn, end_pfn); 460 - 461 - if (ret) 462 - return NULL; 463 - 464 - return zone + zone_shift; 465 - } 466 - 467 - static void __meminit grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn, 468 - unsigned long end_pfn) 469 - { 470 - unsigned long old_pgdat_end_pfn = pgdat_end_pfn(pgdat); 471 - 472 - if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn) 473 - pgdat->node_start_pfn = start_pfn; 474 - 475 - pgdat->node_spanned_pages = max(old_pgdat_end_pfn, end_pfn) - 476 - pgdat->node_start_pfn; 477 - } 478 - 479 - static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn) 480 - { 481 - struct pglist_data *pgdat = zone->zone_pgdat; 482 - int nr_pages = PAGES_PER_SECTION; 483 - int nid = pgdat->node_id; 484 - int zone_type; 485 - unsigned long flags, pfn; 486 - int ret; 487 - 488 - zone_type = zone - pgdat->node_zones; 489 - ret = ensure_zone_is_initialized(zone, phys_start_pfn, nr_pages); 490 - if (ret) 491 - return ret; 492 - 493 - pgdat_resize_lock(zone->zone_pgdat, &flags); 494 - grow_zone_span(zone, phys_start_pfn, phys_start_pfn + nr_pages); 495 - grow_pgdat_span(zone->zone_pgdat, phys_start_pfn, 496 - phys_start_pfn + nr_pages); 497 - pgdat_resize_unlock(zone->zone_pgdat, &flags); 498 - memmap_init_zone(nr_pages, nid, zone_type, 499 - phys_start_pfn, MEMMAP_HOTPLUG); 500 - 501 - /* online_page_range is called later and expects pages reserved */ 502 - for (pfn = phys_start_pfn; pfn < phys_start_pfn + nr_pages; pfn++) { 503 - if (!pfn_valid(pfn)) 504 - continue; 505 - 506 - SetPageReserved(pfn_to_page(pfn)); 507 - } 508 - return 0; 509 - } 510 - 511 - static int __meminit __add_section(int nid, struct zone *zone, 512 - unsigned long phys_start_pfn) 513 - { 514 - int ret; 307 + int i; 515 308 516 309 if (pfn_valid(phys_start_pfn)) 517 310 return -EEXIST; 518 311 519 - ret = sparse_add_one_section(zone, phys_start_pfn); 520 - 312 + ret = sparse_add_one_section(NODE_DATA(nid), phys_start_pfn); 521 313 if (ret < 0) 522 314 return ret; 523 315 524 - ret = __add_zone(zone, phys_start_pfn); 316 + /* 317 + * Make all the pages reserved so that nobody will stumble over half 318 + * initialized state. 319 + * FIXME: We also have to associate it with a node because pfn_to_node 320 + * relies on having page with the proper node. 321 + */ 322 + for (i = 0; i < PAGES_PER_SECTION; i++) { 323 + unsigned long pfn = phys_start_pfn + i; 324 + struct page *page; 325 + if (!pfn_valid(pfn)) 326 + continue; 525 327 526 - if (ret < 0) 527 - return ret; 328 + page = pfn_to_page(pfn); 329 + set_page_node(page, nid); 330 + SetPageReserved(page); 331 + } 332 + 333 + if (!want_memblock) 334 + return 0; 528 335 529 336 return register_new_memory(nid, __pfn_to_section(phys_start_pfn)); 530 337 } ··· 344 533 * call this function after deciding the zone to which to 345 534 * add the new pages. 346 535 */ 347 - int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn, 348 - unsigned long nr_pages) 536 + int __ref __add_pages(int nid, unsigned long phys_start_pfn, 537 + unsigned long nr_pages, bool want_memblock) 349 538 { 350 539 unsigned long i; 351 540 int err = 0; 352 541 int start_sec, end_sec; 353 542 struct vmem_altmap *altmap; 354 - 355 - clear_zone_contiguous(zone); 356 543 357 544 /* during initialize mem_map, align hot-added range to section */ 358 545 start_sec = pfn_to_section_nr(phys_start_pfn); ··· 371 562 } 372 563 373 564 for (i = start_sec; i <= end_sec; i++) { 374 - err = __add_section(nid, zone, section_nr_to_pfn(i)); 565 + err = __add_section(nid, section_nr_to_pfn(i), want_memblock); 375 566 376 567 /* 377 568 * EEXIST is finally dealt with by ioresource collision ··· 384 575 } 385 576 vmemmap_populate_print_last(); 386 577 out: 387 - set_zone_contiguous(zone); 388 578 return err; 389 579 } 390 580 EXPORT_SYMBOL_GPL(__add_pages); ··· 747 939 unsigned long i; 748 940 unsigned long onlined_pages = *(unsigned long *)arg; 749 941 struct page *page; 942 + 750 943 if (PageReserved(pfn_to_page(start_pfn))) 751 944 for (i = 0; i < nr_pages; i++) { 752 945 page = pfn_to_page(start_pfn + i); 753 946 (*online_page_callback)(page); 754 947 onlined_pages++; 755 948 } 949 + 950 + online_mem_sections(start_pfn, start_pfn + nr_pages); 951 + 756 952 *(unsigned long *)arg = onlined_pages; 757 953 return 0; 758 954 } 759 - 760 - #ifdef CONFIG_MOVABLE_NODE 761 - /* 762 - * When CONFIG_MOVABLE_NODE, we permit onlining of a node which doesn't have 763 - * normal memory. 764 - */ 765 - static bool can_online_high_movable(struct zone *zone) 766 - { 767 - return true; 768 - } 769 - #else /* CONFIG_MOVABLE_NODE */ 770 - /* ensure every online node has NORMAL memory */ 771 - static bool can_online_high_movable(struct zone *zone) 772 - { 773 - return node_state(zone_to_nid(zone), N_NORMAL_MEMORY); 774 - } 775 - #endif /* CONFIG_MOVABLE_NODE */ 776 955 777 956 /* check which state of node_states will be changed when online memory */ 778 957 static void node_states_check_changes_online(unsigned long nr_pages, ··· 835 1040 node_set_state(node, N_MEMORY); 836 1041 } 837 1042 838 - bool zone_can_shift(unsigned long pfn, unsigned long nr_pages, 839 - enum zone_type target, int *zone_shift) 1043 + bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_pages, int online_type) 840 1044 { 841 - struct zone *zone = page_zone(pfn_to_page(pfn)); 842 - enum zone_type idx = zone_idx(zone); 843 - int i; 1045 + struct pglist_data *pgdat = NODE_DATA(nid); 1046 + struct zone *movable_zone = &pgdat->node_zones[ZONE_MOVABLE]; 1047 + struct zone *default_zone = default_zone_for_pfn(nid, pfn, nr_pages); 844 1048 845 - *zone_shift = 0; 846 - 847 - if (idx < target) { 848 - /* pages must be at end of current zone */ 849 - if (pfn + nr_pages != zone_end_pfn(zone)) 850 - return false; 851 - 852 - /* no zones in use between current zone and target */ 853 - for (i = idx + 1; i < target; i++) 854 - if (zone_is_initialized(zone - idx + i)) 855 - return false; 1049 + /* 1050 + * TODO there shouldn't be any inherent reason to have ZONE_NORMAL 1051 + * physically before ZONE_MOVABLE. All we need is they do not 1052 + * overlap. Historically we didn't allow ZONE_NORMAL after ZONE_MOVABLE 1053 + * though so let's stick with it for simplicity for now. 1054 + * TODO make sure we do not overlap with ZONE_DEVICE 1055 + */ 1056 + if (online_type == MMOP_ONLINE_KERNEL) { 1057 + if (zone_is_empty(movable_zone)) 1058 + return true; 1059 + return movable_zone->zone_start_pfn >= pfn + nr_pages; 1060 + } else if (online_type == MMOP_ONLINE_MOVABLE) { 1061 + return zone_end_pfn(default_zone) <= pfn; 856 1062 } 857 1063 858 - if (target < idx) { 859 - /* pages must be at beginning of current zone */ 860 - if (pfn != zone->zone_start_pfn) 861 - return false; 1064 + /* MMOP_ONLINE_KEEP will always succeed and inherits the current zone */ 1065 + return online_type == MMOP_ONLINE_KEEP; 1066 + } 862 1067 863 - /* no zones in use between current zone and target */ 864 - for (i = target + 1; i < idx; i++) 865 - if (zone_is_initialized(zone - idx + i)) 866 - return false; 1068 + static void __meminit resize_zone_range(struct zone *zone, unsigned long start_pfn, 1069 + unsigned long nr_pages) 1070 + { 1071 + unsigned long old_end_pfn = zone_end_pfn(zone); 1072 + 1073 + if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn) 1074 + zone->zone_start_pfn = start_pfn; 1075 + 1076 + zone->spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - zone->zone_start_pfn; 1077 + } 1078 + 1079 + static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned long start_pfn, 1080 + unsigned long nr_pages) 1081 + { 1082 + unsigned long old_end_pfn = pgdat_end_pfn(pgdat); 1083 + 1084 + if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn) 1085 + pgdat->node_start_pfn = start_pfn; 1086 + 1087 + pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn; 1088 + } 1089 + 1090 + void __ref move_pfn_range_to_zone(struct zone *zone, 1091 + unsigned long start_pfn, unsigned long nr_pages) 1092 + { 1093 + struct pglist_data *pgdat = zone->zone_pgdat; 1094 + int nid = pgdat->node_id; 1095 + unsigned long flags; 1096 + 1097 + if (zone_is_empty(zone)) 1098 + init_currently_empty_zone(zone, start_pfn, nr_pages); 1099 + 1100 + clear_zone_contiguous(zone); 1101 + 1102 + /* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */ 1103 + pgdat_resize_lock(pgdat, &flags); 1104 + zone_span_writelock(zone); 1105 + resize_zone_range(zone, start_pfn, nr_pages); 1106 + zone_span_writeunlock(zone); 1107 + resize_pgdat_range(pgdat, start_pfn, nr_pages); 1108 + pgdat_resize_unlock(pgdat, &flags); 1109 + 1110 + /* 1111 + * TODO now we have a visible range of pages which are not associated 1112 + * with their zone properly. Not nice but set_pfnblock_flags_mask 1113 + * expects the zone spans the pfn range. All the pages in the range 1114 + * are reserved so nobody should be touching them so we should be safe 1115 + */ 1116 + memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, MEMMAP_HOTPLUG); 1117 + 1118 + set_zone_contiguous(zone); 1119 + } 1120 + 1121 + /* 1122 + * Returns a default kernel memory zone for the given pfn range. 1123 + * If no kernel zone covers this pfn range it will automatically go 1124 + * to the ZONE_NORMAL. 1125 + */ 1126 + struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn, 1127 + unsigned long nr_pages) 1128 + { 1129 + struct pglist_data *pgdat = NODE_DATA(nid); 1130 + int zid; 1131 + 1132 + for (zid = 0; zid <= ZONE_NORMAL; zid++) { 1133 + struct zone *zone = &pgdat->node_zones[zid]; 1134 + 1135 + if (zone_intersects(zone, start_pfn, nr_pages)) 1136 + return zone; 867 1137 } 868 1138 869 - *zone_shift = target - idx; 870 - return true; 1139 + return &pgdat->node_zones[ZONE_NORMAL]; 1140 + } 1141 + 1142 + /* 1143 + * Associates the given pfn range with the given node and the zone appropriate 1144 + * for the given online type. 1145 + */ 1146 + static struct zone * __meminit move_pfn_range(int online_type, int nid, 1147 + unsigned long start_pfn, unsigned long nr_pages) 1148 + { 1149 + struct pglist_data *pgdat = NODE_DATA(nid); 1150 + struct zone *zone = default_zone_for_pfn(nid, start_pfn, nr_pages); 1151 + 1152 + if (online_type == MMOP_ONLINE_KEEP) { 1153 + struct zone *movable_zone = &pgdat->node_zones[ZONE_MOVABLE]; 1154 + /* 1155 + * MMOP_ONLINE_KEEP defaults to MMOP_ONLINE_KERNEL but use 1156 + * movable zone if that is not possible (e.g. we are within 1157 + * or past the existing movable zone) 1158 + */ 1159 + if (!allow_online_pfn_range(nid, start_pfn, nr_pages, 1160 + MMOP_ONLINE_KERNEL)) 1161 + zone = movable_zone; 1162 + } else if (online_type == MMOP_ONLINE_MOVABLE) { 1163 + zone = &pgdat->node_zones[ZONE_MOVABLE]; 1164 + } 1165 + 1166 + move_pfn_range_to_zone(zone, start_pfn, nr_pages); 1167 + return zone; 871 1168 } 872 1169 873 1170 /* Must be protected by mem_hotplug_begin() */ ··· 972 1085 int nid; 973 1086 int ret; 974 1087 struct memory_notify arg; 975 - int zone_shift = 0; 976 1088 977 - /* 978 - * This doesn't need a lock to do pfn_to_page(). 979 - * The section can't be removed here because of the 980 - * memory_block->state_mutex. 981 - */ 982 - zone = page_zone(pfn_to_page(pfn)); 983 - 984 - if ((zone_idx(zone) > ZONE_NORMAL || 985 - online_type == MMOP_ONLINE_MOVABLE) && 986 - !can_online_high_movable(zone)) 1089 + nid = pfn_to_nid(pfn); 1090 + if (!allow_online_pfn_range(nid, pfn, nr_pages, online_type)) 987 1091 return -EINVAL; 988 1092 989 - if (online_type == MMOP_ONLINE_KERNEL) { 990 - if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift)) 991 - return -EINVAL; 992 - } else if (online_type == MMOP_ONLINE_MOVABLE) { 993 - if (!zone_can_shift(pfn, nr_pages, ZONE_MOVABLE, &zone_shift)) 994 - return -EINVAL; 995 - } 996 - 997 - zone = move_pfn_range(zone_shift, pfn, pfn + nr_pages); 998 - if (!zone) 999 - return -EINVAL; 1093 + /* associate pfn range with the zone */ 1094 + zone = move_pfn_range(online_type, nid, pfn, nr_pages); 1000 1095 1001 1096 arg.start_pfn = pfn; 1002 1097 arg.nr_pages = nr_pages; 1003 1098 node_states_check_changes_online(nr_pages, zone, &arg); 1004 - 1005 - nid = zone_to_nid(zone); 1006 1099 1007 1100 ret = memory_notify(MEM_GOING_ONLINE, &arg); 1008 1101 ret = notifier_to_errno(ret); ··· 1178 1311 return 0; 1179 1312 } 1180 1313 1181 - /* 1182 - * If movable zone has already been setup, newly added memory should be check. 1183 - * If its address is higher than movable zone, it should be added as movable. 1184 - * Without this check, movable zone may overlap with other zone. 1185 - */ 1186 - static int should_add_memory_movable(int nid, u64 start, u64 size) 1187 - { 1188 - unsigned long start_pfn = start >> PAGE_SHIFT; 1189 - pg_data_t *pgdat = NODE_DATA(nid); 1190 - struct zone *movable_zone = pgdat->node_zones + ZONE_MOVABLE; 1191 - 1192 - if (zone_is_empty(movable_zone)) 1193 - return 0; 1194 - 1195 - if (movable_zone->zone_start_pfn <= start_pfn) 1196 - return 1; 1197 - 1198 - return 0; 1199 - } 1200 - 1201 - int zone_for_memory(int nid, u64 start, u64 size, int zone_default, 1202 - bool for_device) 1203 - { 1204 - #ifdef CONFIG_ZONE_DEVICE 1205 - if (for_device) 1206 - return ZONE_DEVICE; 1207 - #endif 1208 - if (should_add_memory_movable(nid, start, size)) 1209 - return ZONE_MOVABLE; 1210 - 1211 - return zone_default; 1212 - } 1213 - 1214 1314 static int online_memory_block(struct memory_block *mem, void *arg) 1215 1315 { 1216 1316 return device_online(&mem->dev); ··· 1223 1389 } 1224 1390 1225 1391 /* call arch's memory hotadd */ 1226 - ret = arch_add_memory(nid, start, size, false); 1392 + ret = arch_add_memory(nid, start, size, true); 1227 1393 1228 1394 if (ret < 0) 1229 1395 goto error; ··· 1232 1398 node_set_online(nid); 1233 1399 1234 1400 if (new_node) { 1235 - ret = register_one_node(nid); 1401 + unsigned long start_pfn = start >> PAGE_SHIFT; 1402 + unsigned long nr_pages = size >> PAGE_SHIFT; 1403 + 1404 + ret = __register_one_node(nid); 1405 + if (ret) 1406 + goto register_fail; 1407 + 1408 + /* 1409 + * link memory sections under this node. This is already 1410 + * done when creatig memory section in register_new_memory 1411 + * but that depends to have the node registered so offline 1412 + * nodes have to go through register_node. 1413 + * TODO clean up this mess. 1414 + */ 1415 + ret = link_mem_sections(nid, start_pfn, nr_pages); 1416 + register_fail: 1236 1417 /* 1237 1418 * If sysfs file of new node can't create, cpu on the node 1238 1419 * can't be hot-added. There is no rollback way now. ··· 1441 1592 gfp_mask |= __GFP_HIGHMEM; 1442 1593 1443 1594 if (!nodes_empty(nmask)) 1444 - new_page = __alloc_pages_nodemask(gfp_mask, 0, 1445 - node_zonelist(nid, gfp_mask), &nmask); 1595 + new_page = __alloc_pages_nodemask(gfp_mask, 0, nid, &nmask); 1446 1596 if (!new_page) 1447 - new_page = __alloc_pages(gfp_mask, 0, 1448 - node_zonelist(nid, gfp_mask)); 1597 + new_page = __alloc_pages(gfp_mask, 0, nid); 1449 1598 1450 1599 return new_page; 1451 1600 } ··· 1572 1725 return offlined; 1573 1726 } 1574 1727 1575 - #ifdef CONFIG_MOVABLE_NODE 1576 - /* 1577 - * When CONFIG_MOVABLE_NODE, we permit offlining of a node which doesn't have 1578 - * normal memory. 1579 - */ 1580 - static bool can_offline_normal(struct zone *zone, unsigned long nr_pages) 1581 - { 1582 - return true; 1583 - } 1584 - #else /* CONFIG_MOVABLE_NODE */ 1585 - /* ensure the node has NORMAL memory if it is still online */ 1586 - static bool can_offline_normal(struct zone *zone, unsigned long nr_pages) 1587 - { 1588 - struct pglist_data *pgdat = zone->zone_pgdat; 1589 - unsigned long present_pages = 0; 1590 - enum zone_type zt; 1591 - 1592 - for (zt = 0; zt <= ZONE_NORMAL; zt++) 1593 - present_pages += pgdat->node_zones[zt].present_pages; 1594 - 1595 - if (present_pages > nr_pages) 1596 - return true; 1597 - 1598 - present_pages = 0; 1599 - for (; zt <= ZONE_MOVABLE; zt++) 1600 - present_pages += pgdat->node_zones[zt].present_pages; 1601 - 1602 - /* 1603 - * we can't offline the last normal memory until all 1604 - * higher memory is offlined. 1605 - */ 1606 - return present_pages == 0; 1607 - } 1608 - #endif /* CONFIG_MOVABLE_NODE */ 1609 - 1610 1728 static int __init cmdline_parse_movable_node(char *p) 1611 1729 { 1612 - #ifdef CONFIG_MOVABLE_NODE 1730 + #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP 1613 1731 movable_node_enabled = true; 1614 1732 #else 1615 - pr_warn("movable_node option not supported\n"); 1733 + pr_warn("movable_node parameter depends on CONFIG_HAVE_MEMBLOCK_NODE_MAP to work properly\n"); 1616 1734 #endif 1617 1735 return 0; 1618 1736 } ··· 1698 1886 zone = page_zone(pfn_to_page(valid_start)); 1699 1887 node = zone_to_nid(zone); 1700 1888 nr_pages = end_pfn - start_pfn; 1701 - 1702 - if (zone_idx(zone) <= ZONE_NORMAL && !can_offline_normal(zone, nr_pages)) 1703 - return -EINVAL; 1704 1889 1705 1890 /* set above range as isolated */ 1706 1891 ret = start_isolate_page_range(start_pfn, end_pfn,

+44 -137

mm/mempolicy.c

··· 146 146 147 147 static const struct mempolicy_operations { 148 148 int (*create)(struct mempolicy *pol, const nodemask_t *nodes); 149 - /* 150 - * If read-side task has no lock to protect task->mempolicy, write-side 151 - * task will rebind the task->mempolicy by two step. The first step is 152 - * setting all the newly nodes, and the second step is cleaning all the 153 - * disallowed nodes. In this way, we can avoid finding no node to alloc 154 - * page. 155 - * If we have a lock to protect task->mempolicy in read-side, we do 156 - * rebind directly. 157 - * 158 - * step: 159 - * MPOL_REBIND_ONCE - do rebind work at once 160 - * MPOL_REBIND_STEP1 - set all the newly nodes 161 - * MPOL_REBIND_STEP2 - clean all the disallowed nodes 162 - */ 163 - void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes, 164 - enum mpol_rebind_step step); 149 + void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes); 165 150 } mpol_ops[MPOL_MAX]; 166 151 167 152 static inline int mpol_store_user_nodemask(const struct mempolicy *pol) ··· 289 304 kmem_cache_free(policy_cache, p); 290 305 } 291 306 292 - static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes, 293 - enum mpol_rebind_step step) 307 + static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes) 294 308 { 295 309 } 296 310 297 - /* 298 - * step: 299 - * MPOL_REBIND_ONCE - do rebind work at once 300 - * MPOL_REBIND_STEP1 - set all the newly nodes 301 - * MPOL_REBIND_STEP2 - clean all the disallowed nodes 302 - */ 303 - static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes, 304 - enum mpol_rebind_step step) 311 + static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes) 305 312 { 306 313 nodemask_t tmp; 307 314 ··· 302 325 else if (pol->flags & MPOL_F_RELATIVE_NODES) 303 326 mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes); 304 327 else { 305 - /* 306 - * if step == 1, we use ->w.cpuset_mems_allowed to cache the 307 - * result 308 - */ 309 - if (step == MPOL_REBIND_ONCE || step == MPOL_REBIND_STEP1) { 310 - nodes_remap(tmp, pol->v.nodes, 311 - pol->w.cpuset_mems_allowed, *nodes); 312 - pol->w.cpuset_mems_allowed = step ? tmp : *nodes; 313 - } else if (step == MPOL_REBIND_STEP2) { 314 - tmp = pol->w.cpuset_mems_allowed; 315 - pol->w.cpuset_mems_allowed = *nodes; 316 - } else 317 - BUG(); 328 + nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed, 329 + *nodes); 330 + pol->w.cpuset_mems_allowed = tmp; 318 331 } 319 332 320 333 if (nodes_empty(tmp)) 321 334 tmp = *nodes; 322 335 323 - if (step == MPOL_REBIND_STEP1) 324 - nodes_or(pol->v.nodes, pol->v.nodes, tmp); 325 - else if (step == MPOL_REBIND_ONCE || step == MPOL_REBIND_STEP2) 326 - pol->v.nodes = tmp; 327 - else 328 - BUG(); 329 - 330 - if (!node_isset(current->il_next, tmp)) { 331 - current->il_next = next_node_in(current->il_next, tmp); 332 - if (current->il_next >= MAX_NUMNODES) 333 - current->il_next = numa_node_id(); 334 - } 336 + pol->v.nodes = tmp; 335 337 } 336 338 337 339 static void mpol_rebind_preferred(struct mempolicy *pol, 338 - const nodemask_t *nodes, 339 - enum mpol_rebind_step step) 340 + const nodemask_t *nodes) 340 341 { 341 342 nodemask_t tmp; 342 343 ··· 340 385 /* 341 386 * mpol_rebind_policy - Migrate a policy to a different set of nodes 342 387 * 343 - * If read-side task has no lock to protect task->mempolicy, write-side 344 - * task will rebind the task->mempolicy by two step. The first step is 345 - * setting all the newly nodes, and the second step is cleaning all the 346 - * disallowed nodes. In this way, we can avoid finding no node to alloc 347 - * page. 348 - * If we have a lock to protect task->mempolicy in read-side, we do 349 - * rebind directly. 350 - * 351 - * step: 352 - * MPOL_REBIND_ONCE - do rebind work at once 353 - * MPOL_REBIND_STEP1 - set all the newly nodes 354 - * MPOL_REBIND_STEP2 - clean all the disallowed nodes 388 + * Per-vma policies are protected by mmap_sem. Allocations using per-task 389 + * policies are protected by task->mems_allowed_seq to prevent a premature 390 + * OOM/allocation failure due to parallel nodemask modification. 355 391 */ 356 - static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask, 357 - enum mpol_rebind_step step) 392 + static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask) 358 393 { 359 394 if (!pol) 360 395 return; 361 - if (!mpol_store_user_nodemask(pol) && step == MPOL_REBIND_ONCE && 396 + if (!mpol_store_user_nodemask(pol) && 362 397 nodes_equal(pol->w.cpuset_mems_allowed, *newmask)) 363 398 return; 364 399 365 - if (step == MPOL_REBIND_STEP1 && (pol->flags & MPOL_F_REBINDING)) 366 - return; 367 - 368 - if (step == MPOL_REBIND_STEP2 && !(pol->flags & MPOL_F_REBINDING)) 369 - BUG(); 370 - 371 - if (step == MPOL_REBIND_STEP1) 372 - pol->flags |= MPOL_F_REBINDING; 373 - else if (step == MPOL_REBIND_STEP2) 374 - pol->flags &= ~MPOL_F_REBINDING; 375 - else if (step >= MPOL_REBIND_NSTEP) 376 - BUG(); 377 - 378 - mpol_ops[pol->mode].rebind(pol, newmask, step); 400 + mpol_ops[pol->mode].rebind(pol, newmask); 379 401 } 380 402 381 403 /* ··· 362 430 * Called with task's alloc_lock held. 363 431 */ 364 432 365 - void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new, 366 - enum mpol_rebind_step step) 433 + void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new) 367 434 { 368 - mpol_rebind_policy(tsk->mempolicy, new, step); 435 + mpol_rebind_policy(tsk->mempolicy, new); 369 436 } 370 437 371 438 /* ··· 379 448 380 449 down_write(&mm->mmap_sem); 381 450 for (vma = mm->mmap; vma; vma = vma->vm_next) 382 - mpol_rebind_policy(vma->vm_policy, new, MPOL_REBIND_ONCE); 451 + mpol_rebind_policy(vma->vm_policy, new); 383 452 up_write(&mm->mmap_sem); 384 453 } 385 454 ··· 743 812 } 744 813 old = current->mempolicy; 745 814 current->mempolicy = new; 746 - if (new && new->mode == MPOL_INTERLEAVE && 747 - nodes_weight(new->v.nodes)) 748 - current->il_next = first_node(new->v.nodes); 815 + if (new && new->mode == MPOL_INTERLEAVE) 816 + current->il_prev = MAX_NUMNODES-1; 749 817 task_unlock(current); 750 818 mpol_put(old); 751 819 ret = 0; ··· 846 916 *policy = err; 847 917 } else if (pol == current->mempolicy && 848 918 pol->mode == MPOL_INTERLEAVE) { 849 - *policy = current->il_next; 919 + *policy = next_node_in(current->il_prev, pol->v.nodes); 850 920 } else { 851 921 err = -EINVAL; 852 922 goto out; ··· 1606 1676 return NULL; 1607 1677 } 1608 1678 1609 - /* Return a zonelist indicated by gfp for node representing a mempolicy */ 1610 - static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy, 1611 - int nd) 1679 + /* Return the node id preferred by the given mempolicy, or the given id */ 1680 + static int policy_node(gfp_t gfp, struct mempolicy *policy, 1681 + int nd) 1612 1682 { 1613 1683 if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL)) 1614 1684 nd = policy->v.preferred_node; ··· 1621 1691 WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE)); 1622 1692 } 1623 1693 1624 - return node_zonelist(nd, gfp); 1694 + return nd; 1625 1695 } 1626 1696 1627 1697 /* Do dynamic interleaving for a process */ 1628 1698 static unsigned interleave_nodes(struct mempolicy *policy) 1629 1699 { 1630 - unsigned nid, next; 1700 + unsigned next; 1631 1701 struct task_struct *me = current; 1632 1702 1633 - nid = me->il_next; 1634 - next = next_node_in(nid, policy->v.nodes); 1703 + next = next_node_in(me->il_prev, policy->v.nodes); 1635 1704 if (next < MAX_NUMNODES) 1636 - me->il_next = next; 1637 - return nid; 1705 + me->il_prev = next; 1706 + return next; 1638 1707 } 1639 1708 1640 1709 /* ··· 1728 1799 1729 1800 #ifdef CONFIG_HUGETLBFS 1730 1801 /* 1731 - * huge_zonelist(@vma, @addr, @gfp_flags, @mpol) 1802 + * huge_node(@vma, @addr, @gfp_flags, @mpol) 1732 1803 * @vma: virtual memory area whose policy is sought 1733 1804 * @addr: address in @vma for shared policy lookup and interleave policy 1734 1805 * @gfp_flags: for requested zone 1735 1806 * @mpol: pointer to mempolicy pointer for reference counted mempolicy 1736 1807 * @nodemask: pointer to nodemask pointer for MPOL_BIND nodemask 1737 1808 * 1738 - * Returns a zonelist suitable for a huge page allocation and a pointer 1809 + * Returns a nid suitable for a huge page allocation and a pointer 1739 1810 * to the struct mempolicy for conditional unref after allocation. 1740 1811 * If the effective policy is 'BIND, returns a pointer to the mempolicy's 1741 1812 * @nodemask for filtering the zonelist. 1742 1813 * 1743 1814 * Must be protected by read_mems_allowed_begin() 1744 1815 */ 1745 - struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr, 1746 - gfp_t gfp_flags, struct mempolicy **mpol, 1747 - nodemask_t **nodemask) 1816 + int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags, 1817 + struct mempolicy **mpol, nodemask_t **nodemask) 1748 1818 { 1749 - struct zonelist *zl; 1819 + int nid; 1750 1820 1751 1821 *mpol = get_vma_policy(vma, addr); 1752 1822 *nodemask = NULL; /* assume !MPOL_BIND */ 1753 1823 1754 1824 if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) { 1755 - zl = node_zonelist(interleave_nid(*mpol, vma, addr, 1756 - huge_page_shift(hstate_vma(vma))), gfp_flags); 1825 + nid = interleave_nid(*mpol, vma, addr, 1826 + huge_page_shift(hstate_vma(vma))); 1757 1827 } else { 1758 - zl = policy_zonelist(gfp_flags, *mpol, numa_node_id()); 1828 + nid = policy_node(gfp_flags, *mpol, numa_node_id()); 1759 1829 if ((*mpol)->mode == MPOL_BIND) 1760 1830 *nodemask = &(*mpol)->v.nodes; 1761 1831 } 1762 - return zl; 1832 + return nid; 1763 1833 } 1764 1834 1765 1835 /* ··· 1860 1932 static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, 1861 1933 unsigned nid) 1862 1934 { 1863 - struct zonelist *zl; 1864 1935 struct page *page; 1865 1936 1866 - zl = node_zonelist(nid, gfp); 1867 - page = __alloc_pages(gfp, order, zl); 1868 - if (page && page_zone(page) == zonelist_zone(&zl->_zonerefs[0])) 1937 + page = __alloc_pages(gfp, order, nid); 1938 + if (page && page_to_nid(page) == nid) 1869 1939 inc_zone_page_state(page, NUMA_INTERLEAVE_HIT); 1870 1940 return page; 1871 1941 } ··· 1897 1971 { 1898 1972 struct mempolicy *pol; 1899 1973 struct page *page; 1900 - unsigned int cpuset_mems_cookie; 1901 - struct zonelist *zl; 1974 + int preferred_nid; 1902 1975 nodemask_t *nmask; 1903 1976 1904 - retry_cpuset: 1905 1977 pol = get_vma_policy(vma, addr); 1906 - cpuset_mems_cookie = read_mems_allowed_begin(); 1907 1978 1908 1979 if (pol->mode == MPOL_INTERLEAVE) { 1909 1980 unsigned nid; ··· 1938 2015 } 1939 2016 1940 2017 nmask = policy_nodemask(gfp, pol); 1941 - zl = policy_zonelist(gfp, pol, node); 1942 - page = __alloc_pages_nodemask(gfp, order, zl, nmask); 2018 + preferred_nid = policy_node(gfp, pol, node); 2019 + page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask); 1943 2020 mpol_cond_put(pol); 1944 2021 out: 1945 - if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) 1946 - goto retry_cpuset; 1947 2022 return page; 1948 2023 } 1949 2024 ··· 1959 2038 * Allocate a page from the kernel page pool. When not in 1960 2039 * interrupt context and apply the current process NUMA policy. 1961 2040 * Returns NULL when no page can be allocated. 1962 - * 1963 - * Don't call cpuset_update_task_memory_state() unless 1964 - * 1) it's ok to take cpuset_sem (can WAIT), and 1965 - * 2) allocating for current task (not interrupt). 1966 2041 */ 1967 2042 struct page *alloc_pages_current(gfp_t gfp, unsigned order) 1968 2043 { 1969 2044 struct mempolicy *pol = &default_policy; 1970 2045 struct page *page; 1971 - unsigned int cpuset_mems_cookie; 1972 2046 1973 2047 if (!in_interrupt() && !(gfp & __GFP_THISNODE)) 1974 2048 pol = get_task_policy(current); 1975 - 1976 - retry_cpuset: 1977 - cpuset_mems_cookie = read_mems_allowed_begin(); 1978 2049 1979 2050 /* 1980 2051 * No reference counting needed for current->mempolicy ··· 1976 2063 page = alloc_page_interleave(gfp, order, interleave_nodes(pol)); 1977 2064 else 1978 2065 page = __alloc_pages_nodemask(gfp, order, 1979 - policy_zonelist(gfp, pol, numa_node_id()), 2066 + policy_node(gfp, pol, numa_node_id()), 1980 2067 policy_nodemask(gfp, pol)); 1981 - 1982 - if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) 1983 - goto retry_cpuset; 1984 2068 1985 2069 return page; 1986 2070 } ··· 2022 2112 2023 2113 if (current_cpuset_is_being_rebound()) { 2024 2114 nodemask_t mems = cpuset_mems_allowed(current); 2025 - if (new->flags & MPOL_F_REBINDING) 2026 - mpol_rebind_policy(new, &mems, MPOL_REBIND_STEP2); 2027 - else 2028 - mpol_rebind_policy(new, &mems, MPOL_REBIND_ONCE); 2115 + mpol_rebind_policy(new, &mems); 2029 2116 } 2030 2117 atomic_set(&new->refcnt, 1); 2031 2118 return new;

+11 -10

mm/migrate.c

··· 227 227 if (is_write_migration_entry(entry)) 228 228 pte = maybe_mkwrite(pte, vma); 229 229 230 + flush_dcache_page(new); 230 231 #ifdef CONFIG_HUGETLB_PAGE 231 232 if (PageHuge(new)) { 232 233 pte = pte_mkhuge(pte); 233 234 pte = arch_make_huge_pte(pte, vma, new, 0); 234 - } 235 - #endif 236 - flush_dcache_page(new); 237 - set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); 238 - 239 - if (PageHuge(new)) { 235 + set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); 240 236 if (PageAnon(new)) 241 237 hugepage_add_anon_rmap(new, vma, pvmw.address); 242 238 else 243 239 page_dup_rmap(new, true); 244 - } else if (PageAnon(new)) 245 - page_add_anon_rmap(new, vma, pvmw.address, false); 246 - else 247 - page_add_file_rmap(new, false); 240 + } else 241 + #endif 242 + { 243 + set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); 248 244 245 + if (PageAnon(new)) 246 + page_add_anon_rmap(new, vma, pvmw.address, false); 247 + else 248 + page_add_file_rmap(new, false); 249 + } 249 250 if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new)) 250 251 mlock_vma_page(new); 251 252

+1 -1

mm/mmap.c

··· 94 94 * w: (no) no 95 95 * x: (yes) yes 96 96 */ 97 - pgprot_t protection_map[16] = { 97 + pgprot_t protection_map[16] __ro_after_init = { 98 98 __P000, __P001, __P010, __P011, __P100, __P101, __P110, __P111, 99 99 __S000, __S001, __S010, __S011, __S100, __S101, __S110, __S111 100 100 };

-2

mm/mprotect.c

··· 58 58 * reading. 59 59 */ 60 60 pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); 61 - if (!pte) 62 - return 0; 63 61 64 62 /* Get target node for single threaded private VMAs */ 65 63 if (prot_numa && !(vma->vm_flags & VM_SHARED) &&

+1 -1

mm/nobootmem.c

··· 118 118 unsigned long end_pfn = min_t(unsigned long, 119 119 PFN_DOWN(end), max_low_pfn); 120 120 121 - if (start_pfn > end_pfn) 121 + if (start_pfn >= end_pfn) 122 122 return 0; 123 123 124 124 __free_pages_memory(start_pfn, end_pfn);

+5

mm/oom_kill.c

··· 876 876 /* Get a reference to safely compare mm after task_unlock(victim) */ 877 877 mm = victim->mm; 878 878 mmgrab(mm); 879 + 880 + /* Raise event before sending signal: task reaper must see this */ 881 + count_vm_event(OOM_KILL); 882 + count_memcg_event_mm(mm, OOM_KILL); 883 + 879 884 /* 880 885 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent 881 886 * the OOM victim from depleting the memory reserves from the user

+5 -10

mm/page-writeback.c

··· 2433 2433 inode_attach_wb(inode, page); 2434 2434 wb = inode_to_wb(inode); 2435 2435 2436 - inc_memcg_page_state(page, NR_FILE_DIRTY); 2437 - __inc_node_page_state(page, NR_FILE_DIRTY); 2436 + __inc_lruvec_page_state(page, NR_FILE_DIRTY); 2438 2437 __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); 2439 2438 __inc_node_page_state(page, NR_DIRTIED); 2440 2439 __inc_wb_stat(wb, WB_RECLAIMABLE); ··· 2454 2455 struct bdi_writeback *wb) 2455 2456 { 2456 2457 if (mapping_cap_account_dirty(mapping)) { 2457 - dec_memcg_page_state(page, NR_FILE_DIRTY); 2458 - dec_node_page_state(page, NR_FILE_DIRTY); 2458 + dec_lruvec_page_state(page, NR_FILE_DIRTY); 2459 2459 dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); 2460 2460 dec_wb_stat(wb, WB_RECLAIMABLE); 2461 2461 task_io_account_cancelled_write(PAGE_SIZE); ··· 2710 2712 */ 2711 2713 wb = unlocked_inode_to_wb_begin(inode, &locked); 2712 2714 if (TestClearPageDirty(page)) { 2713 - dec_memcg_page_state(page, NR_FILE_DIRTY); 2714 - dec_node_page_state(page, NR_FILE_DIRTY); 2715 + dec_lruvec_page_state(page, NR_FILE_DIRTY); 2715 2716 dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); 2716 2717 dec_wb_stat(wb, WB_RECLAIMABLE); 2717 2718 ret = 1; ··· 2756 2759 ret = TestClearPageWriteback(page); 2757 2760 } 2758 2761 if (ret) { 2759 - dec_memcg_page_state(page, NR_WRITEBACK); 2760 - dec_node_page_state(page, NR_WRITEBACK); 2762 + dec_lruvec_page_state(page, NR_WRITEBACK); 2761 2763 dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); 2762 2764 inc_node_page_state(page, NR_WRITTEN); 2763 2765 } ··· 2810 2814 ret = TestSetPageWriteback(page); 2811 2815 } 2812 2816 if (!ret) { 2813 - inc_memcg_page_state(page, NR_WRITEBACK); 2814 - inc_node_page_state(page, NR_WRITEBACK); 2817 + inc_lruvec_page_state(page, NR_WRITEBACK); 2815 2818 inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); 2816 2819 } 2817 2820 unlock_page_memcg(page);

+94 -40

mm/page_alloc.c

··· 113 113 #ifdef CONFIG_HIGHMEM 114 114 [N_HIGH_MEMORY] = { { [0] = 1UL } }, 115 115 #endif 116 - #ifdef CONFIG_MOVABLE_NODE 117 116 [N_MEMORY] = { { [0] = 1UL } }, 118 - #endif 119 117 [N_CPU] = { { [0] = 1UL } }, 120 118 #endif /* NUMA */ 121 119 }; ··· 509 511 /* 510 512 * Temporary debugging check for pages not lying within a given zone. 511 513 */ 512 - static int bad_range(struct zone *zone, struct page *page) 514 + static int __maybe_unused bad_range(struct zone *zone, struct page *page) 513 515 { 514 516 if (page_outside_zone_boundaries(zone, page)) 515 517 return 1; ··· 519 521 return 0; 520 522 } 521 523 #else 522 - static inline int bad_range(struct zone *zone, struct page *page) 524 + static inline int __maybe_unused bad_range(struct zone *zone, struct page *page) 523 525 { 524 526 return 0; 525 527 } ··· 1295 1297 #endif 1296 1298 1297 1299 #ifdef CONFIG_NODES_SPAN_OTHER_NODES 1298 - static inline bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node, 1299 - struct mminit_pfnnid_cache *state) 1300 + static inline bool __meminit __maybe_unused 1301 + meminit_pfn_in_nid(unsigned long pfn, int node, 1302 + struct mminit_pfnnid_cache *state) 1300 1303 { 1301 1304 int nid; 1302 1305 ··· 1319 1320 { 1320 1321 return true; 1321 1322 } 1322 - static inline bool __meminit meminit_pfn_in_nid(unsigned long pfn, int node, 1323 - struct mminit_pfnnid_cache *state) 1323 + static inline bool __meminit __maybe_unused 1324 + meminit_pfn_in_nid(unsigned long pfn, int node, 1325 + struct mminit_pfnnid_cache *state) 1324 1326 { 1325 1327 return true; 1326 1328 } ··· 1365 1365 if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn)) 1366 1366 return NULL; 1367 1367 1368 - start_page = pfn_to_page(start_pfn); 1368 + start_page = pfn_to_online_page(start_pfn); 1369 + if (!start_page) 1370 + return NULL; 1369 1371 1370 1372 if (page_zone(start_page) != zone) 1371 1373 return NULL; ··· 3675 3673 return false; 3676 3674 } 3677 3675 3676 + static inline bool 3677 + check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac) 3678 + { 3679 + /* 3680 + * It's possible that cpuset's mems_allowed and the nodemask from 3681 + * mempolicy don't intersect. This should be normally dealt with by 3682 + * policy_nodemask(), but it's possible to race with cpuset update in 3683 + * such a way the check therein was true, and then it became false 3684 + * before we got our cpuset_mems_cookie here. 3685 + * This assumes that for all allocations, ac->nodemask can come only 3686 + * from MPOL_BIND mempolicy (whose documented semantics is to be ignored 3687 + * when it does not intersect with the cpuset restrictions) or the 3688 + * caller can deal with a violated nodemask. 3689 + */ 3690 + if (cpusets_enabled() && ac->nodemask && 3691 + !cpuset_nodemask_valid_mems_allowed(ac->nodemask)) { 3692 + ac->nodemask = NULL; 3693 + return true; 3694 + } 3695 + 3696 + /* 3697 + * When updating a task's mems_allowed or mempolicy nodemask, it is 3698 + * possible to race with parallel threads in such a way that our 3699 + * allocation can fail while the mask is being updated. If we are about 3700 + * to fail, check if the cpuset changed during allocation and if so, 3701 + * retry. 3702 + */ 3703 + if (read_mems_allowed_retry(cpuset_mems_cookie)) 3704 + return true; 3705 + 3706 + return false; 3707 + } 3708 + 3678 3709 static inline struct page * 3679 3710 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, 3680 3711 struct alloc_context *ac) ··· 3903 3868 &compaction_retries)) 3904 3869 goto retry; 3905 3870 3906 - /* 3907 - * It's possible we raced with cpuset update so the OOM would be 3908 - * premature (see below the nopage: label for full explanation). 3909 - */ 3910 - if (read_mems_allowed_retry(cpuset_mems_cookie)) 3871 + 3872 + /* Deal with possible cpuset update races before we start OOM killing */ 3873 + if (check_retry_cpuset(cpuset_mems_cookie, ac)) 3911 3874 goto retry_cpuset; 3912 3875 3913 3876 /* Reclaim has failed us, start killing things */ ··· 3926 3893 } 3927 3894 3928 3895 nopage: 3929 - /* 3930 - * When updating a task's mems_allowed or mempolicy nodemask, it is 3931 - * possible to race with parallel threads in such a way that our 3932 - * allocation can fail while the mask is being updated. If we are about 3933 - * to fail, check if the cpuset changed during allocation and if so, 3934 - * retry. 3935 - */ 3936 - if (read_mems_allowed_retry(cpuset_mems_cookie)) 3896 + /* Deal with possible cpuset update races before we fail */ 3897 + if (check_retry_cpuset(cpuset_mems_cookie, ac)) 3937 3898 goto retry_cpuset; 3938 3899 3939 3900 /* ··· 3978 3951 } 3979 3952 3980 3953 static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, 3981 - struct zonelist *zonelist, nodemask_t *nodemask, 3954 + int preferred_nid, nodemask_t *nodemask, 3982 3955 struct alloc_context *ac, gfp_t *alloc_mask, 3983 3956 unsigned int *alloc_flags) 3984 3957 { 3985 3958 ac->high_zoneidx = gfp_zone(gfp_mask); 3986 - ac->zonelist = zonelist; 3959 + ac->zonelist = node_zonelist(preferred_nid, gfp_mask); 3987 3960 ac->nodemask = nodemask; 3988 3961 ac->migratetype = gfpflags_to_migratetype(gfp_mask); 3989 3962 ··· 4028 4001 * This is the 'heart' of the zoned buddy allocator. 4029 4002 */ 4030 4003 struct page * 4031 - __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, 4032 - struct zonelist *zonelist, nodemask_t *nodemask) 4004 + __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, 4005 + nodemask_t *nodemask) 4033 4006 { 4034 4007 struct page *page; 4035 4008 unsigned int alloc_flags = ALLOC_WMARK_LOW; ··· 4037 4010 struct alloc_context ac = { }; 4038 4011 4039 4012 gfp_mask &= gfp_allowed_mask; 4040 - if (!prepare_alloc_pages(gfp_mask, order, zonelist, nodemask, &ac, &alloc_mask, &alloc_flags)) 4013 + if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags)) 4041 4014 return NULL; 4042 4015 4043 4016 finalise_ac(gfp_mask, order, &ac); ··· 4641 4614 " present:%lukB" 4642 4615 " managed:%lukB" 4643 4616 " mlocked:%lukB" 4644 - " slab_reclaimable:%lukB" 4645 - " slab_unreclaimable:%lukB" 4646 4617 " kernel_stack:%lukB" 4647 4618 " pagetables:%lukB" 4648 4619 " bounce:%lukB" ··· 4662 4637 K(zone->present_pages), 4663 4638 K(zone->managed_pages), 4664 4639 K(zone_page_state(zone, NR_MLOCK)), 4665 - K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)), 4666 - K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)), 4667 4640 zone_page_state(zone, NR_KERNEL_STACK_KB), 4668 4641 K(zone_page_state(zone, NR_PAGETABLE)), 4669 4642 K(zone_page_state(zone, NR_BOUNCE)), ··· 5147 5124 */ 5148 5125 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch); 5149 5126 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset); 5127 + static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats); 5150 5128 static void setup_zone_pageset(struct zone *zone); 5151 5129 5152 5130 /* ··· 5552 5528 zone_batchsize(zone)); 5553 5529 } 5554 5530 5555 - int __meminit init_currently_empty_zone(struct zone *zone, 5531 + void __meminit init_currently_empty_zone(struct zone *zone, 5556 5532 unsigned long zone_start_pfn, 5557 5533 unsigned long size) 5558 5534 { ··· 5570 5546 5571 5547 zone_init_free_lists(zone); 5572 5548 zone->initialized = 1; 5573 - 5574 - return 0; 5575 5549 } 5576 5550 5577 5551 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP ··· 6027 6005 { 6028 6006 enum zone_type j; 6029 6007 int nid = pgdat->node_id; 6030 - int ret; 6031 6008 6032 6009 pgdat_resize_init(pgdat); 6033 6010 #ifdef CONFIG_NUMA_BALANCING ··· 6047 6026 pgdat_page_ext_init(pgdat); 6048 6027 spin_lock_init(&pgdat->lru_lock); 6049 6028 lruvec_init(node_lruvec(pgdat)); 6029 + 6030 + pgdat->per_cpu_nodestats = &boot_nodestats; 6050 6031 6051 6032 for (j = 0; j < MAX_NR_ZONES; j++) { 6052 6033 struct zone *zone = pgdat->node_zones + j; ··· 6110 6087 6111 6088 set_pageblock_order(); 6112 6089 setup_usemap(pgdat, zone, zone_start_pfn, size); 6113 - ret = init_currently_empty_zone(zone, zone_start_pfn, size); 6114 - BUG_ON(ret); 6090 + init_currently_empty_zone(zone, zone_start_pfn, size); 6115 6091 memmap_init(size, nid, j, zone_start_pfn); 6116 6092 } 6117 6093 } ··· 7204 7182 #endif 7205 7183 7206 7184 /* 7185 + * Adaptive scale is meant to reduce sizes of hash tables on large memory 7186 + * machines. As memory size is increased the scale is also increased but at 7187 + * slower pace. Starting from ADAPT_SCALE_BASE (64G), every time memory 7188 + * quadruples the scale is increased by one, which means the size of hash table 7189 + * only doubles, instead of quadrupling as well. 7190 + * Because 32-bit systems cannot have large physical memory, where this scaling 7191 + * makes sense, it is disabled on such platforms. 7192 + */ 7193 + #if __BITS_PER_LONG > 32 7194 + #define ADAPT_SCALE_BASE (64ul << 30) 7195 + #define ADAPT_SCALE_SHIFT 2 7196 + #define ADAPT_SCALE_NPAGES (ADAPT_SCALE_BASE >> PAGE_SHIFT) 7197 + #endif 7198 + 7199 + /* 7207 7200 * allocate a large system hash table from bootmem 7208 7201 * - it is assumed that the hash table must contain an exact power-of-2 7209 7202 * quantity of entries ··· 7237 7200 unsigned long long max = high_limit; 7238 7201 unsigned long log2qty, size; 7239 7202 void *table = NULL; 7203 + gfp_t gfp_flags; 7240 7204 7241 7205 /* allow the kernel cmdline to have a say */ 7242 7206 if (!numentries) { ··· 7248 7210 /* It isn't necessary when PAGE_SIZE >= 1MB */ 7249 7211 if (PAGE_SHIFT < 20) 7250 7212 numentries = round_up(numentries, (1<<20)/PAGE_SIZE); 7213 + 7214 + #if __BITS_PER_LONG > 32 7215 + if (!high_limit) { 7216 + unsigned long adapt; 7217 + 7218 + for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries; 7219 + adapt <<= ADAPT_SCALE_SHIFT) 7220 + scale++; 7221 + } 7222 + #endif 7251 7223 7252 7224 /* limit to 1 bucket per 2^scale bytes of low memory */ 7253 7225 if (scale > PAGE_SHIFT) ··· 7292 7244 7293 7245 log2qty = ilog2(numentries); 7294 7246 7247 + /* 7248 + * memblock allocator returns zeroed memory already, so HASH_ZERO is 7249 + * currently not used when HASH_EARLY is specified. 7250 + */ 7251 + gfp_flags = (flags & HASH_ZERO) ? GFP_ATOMIC | __GFP_ZERO : GFP_ATOMIC; 7295 7252 do { 7296 7253 size = bucketsize << log2qty; 7297 7254 if (flags & HASH_EARLY) 7298 7255 table = memblock_virt_alloc_nopanic(size, 0); 7299 7256 else if (hashdist) 7300 - table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL); 7257 + table = __vmalloc(size, gfp_flags, PAGE_KERNEL); 7301 7258 else { 7302 7259 /* 7303 7260 * If bucketsize is not a power-of-two, we may free ··· 7310 7257 * alloc_pages_exact() automatically does 7311 7258 */ 7312 7259 if (get_order(size) < MAX_ORDER) { 7313 - table = alloc_pages_exact(size, GFP_ATOMIC); 7314 - kmemleak_alloc(table, size, 1, GFP_ATOMIC); 7260 + table = alloc_pages_exact(size, gfp_flags); 7261 + kmemleak_alloc(table, size, 1, gfp_flags); 7315 7262 } 7316 7263 } 7317 7264 } while (!table && size > PAGE_SIZE && --log2qty); ··· 7713 7660 break; 7714 7661 if (pfn == end_pfn) 7715 7662 return; 7663 + offline_mem_sections(pfn, end_pfn); 7716 7664 zone = page_zone(pfn_to_page(pfn)); 7717 7665 spin_lock_irqsave(&zone->lock, flags); 7718 7666 pfn = start_pfn;

+18 -8

mm/page_isolation.c

··· 138 138 __first_valid_page(unsigned long pfn, unsigned long nr_pages) 139 139 { 140 140 int i; 141 - for (i = 0; i < nr_pages; i++) 142 - if (pfn_valid_within(pfn + i)) 143 - break; 144 - if (unlikely(i == nr_pages)) 145 - return NULL; 146 - return pfn_to_page(pfn + i); 141 + 142 + for (i = 0; i < nr_pages; i++) { 143 + struct page *page; 144 + 145 + if (!pfn_valid_within(pfn + i)) 146 + continue; 147 + page = pfn_to_online_page(pfn + i); 148 + if (!page) 149 + continue; 150 + return page; 151 + } 152 + return NULL; 147 153 } 148 154 149 155 /* ··· 190 184 undo: 191 185 for (pfn = start_pfn; 192 186 pfn < undo_pfn; 193 - pfn += pageblock_nr_pages) 194 - unset_migratetype_isolate(pfn_to_page(pfn), migratetype); 187 + pfn += pageblock_nr_pages) { 188 + struct page *page = pfn_to_online_page(pfn); 189 + if (!page) 190 + continue; 191 + unset_migratetype_isolate(page, migratetype); 192 + } 195 193 196 194 return -EBUSY; 197 195 }

+2 -1

mm/page_vma_mapped.c

··· 116 116 117 117 if (unlikely(PageHuge(pvmw->page))) { 118 118 /* when pud is not present, pte will be NULL */ 119 - pvmw->pte = huge_pte_offset(mm, pvmw->address); 119 + pvmw->pte = huge_pte_offset(mm, pvmw->address, 120 + PAGE_SIZE << compound_order(page)); 120 121 if (!pvmw->pte) 121 122 return false; 122 123

+2 -1

mm/pagewalk.c

··· 180 180 struct hstate *h = hstate_vma(vma); 181 181 unsigned long next; 182 182 unsigned long hmask = huge_page_mask(h); 183 + unsigned long sz = huge_page_size(h); 183 184 pte_t *pte; 184 185 int err = 0; 185 186 186 187 do { 187 188 next = hugetlb_entry_end(h, addr, end); 188 - pte = huge_pte_offset(walk->mm, addr & hmask); 189 + pte = huge_pte_offset(walk->mm, addr & hmask, sz); 189 190 if (pte && walk->hugetlb_entry) 190 191 err = walk->hugetlb_entry(pte, hmask, addr, next, walk); 191 192 if (err)

+8 -7

mm/rmap.c

··· 1145 1145 if (!atomic_inc_and_test(&page->_mapcount)) 1146 1146 goto out; 1147 1147 } 1148 - __mod_node_page_state(page_pgdat(page), NR_FILE_MAPPED, nr); 1149 - mod_memcg_page_state(page, NR_FILE_MAPPED, nr); 1148 + __mod_lruvec_page_state(page, NR_FILE_MAPPED, nr); 1150 1149 out: 1151 1150 unlock_page_memcg(page); 1152 1151 } ··· 1180 1181 } 1181 1182 1182 1183 /* 1183 - * We use the irq-unsafe __{inc|mod}_zone_page_state because 1184 + * We use the irq-unsafe __{inc|mod}_lruvec_page_state because 1184 1185 * these counters are not modified in interrupt context, and 1185 1186 * pte lock(a spinlock) is held, which implies preemption disabled. 1186 1187 */ 1187 - __mod_node_page_state(page_pgdat(page), NR_FILE_MAPPED, -nr); 1188 - mod_memcg_page_state(page, NR_FILE_MAPPED, -nr); 1188 + __mod_lruvec_page_state(page, NR_FILE_MAPPED, -nr); 1189 1189 1190 1190 if (unlikely(PageMlocked(page))) 1191 1191 clear_page_mlock(page); ··· 1365 1367 update_hiwater_rss(mm); 1366 1368 1367 1369 if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) { 1370 + pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); 1368 1371 if (PageHuge(page)) { 1369 1372 int nr = 1 << compound_order(page); 1370 1373 hugetlb_count_sub(nr, mm); 1374 + set_huge_swap_pte_at(mm, address, 1375 + pvmw.pte, pteval, 1376 + vma_mmu_pagesize(vma)); 1371 1377 } else { 1372 1378 dec_mm_counter(mm, mm_counter(page)); 1379 + set_pte_at(mm, address, pvmw.pte, pteval); 1373 1380 } 1374 1381 1375 - pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); 1376 - set_pte_at(mm, address, pvmw.pte, pteval); 1377 1382 } else if (pte_unused(pteval)) { 1378 1383 /* 1379 1384 * The guest indicated that the page content is of no

+3 -4

mm/shmem.c

··· 1291 1291 SetPageUptodate(page); 1292 1292 } 1293 1293 1294 - swap = get_swap_page(); 1294 + swap = get_swap_page(page); 1295 1295 if (!swap.val) 1296 1296 goto redirty; 1297 1297 ··· 1327 1327 1328 1328 mutex_unlock(&shmem_swaplist_mutex); 1329 1329 free_swap: 1330 - swapcache_free(swap); 1330 + put_swap_page(page, swap); 1331 1331 redirty: 1332 1332 set_page_dirty(page); 1333 1333 if (wbc->for_reclaim) ··· 1646 1646 if (fault_type) { 1647 1647 *fault_type |= VM_FAULT_MAJOR; 1648 1648 count_vm_event(PGMAJFAULT); 1649 - mem_cgroup_count_vm_event(charge_mm, 1650 - PGMAJFAULT); 1649 + count_memcg_event_mm(charge_mm, PGMAJFAULT); 1651 1650 } 1652 1651 /* Here we actually start the io */ 1653 1652 page = shmem_swapin(swap, gfp, info, index);

+6 -14

mm/slab.c

··· 1425 1425 1426 1426 nr_pages = (1 << cachep->gfporder); 1427 1427 if (cachep->flags & SLAB_RECLAIM_ACCOUNT) 1428 - add_zone_page_state(page_zone(page), 1429 - NR_SLAB_RECLAIMABLE, nr_pages); 1428 + mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, nr_pages); 1430 1429 else 1431 - add_zone_page_state(page_zone(page), 1432 - NR_SLAB_UNRECLAIMABLE, nr_pages); 1430 + mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, nr_pages); 1433 1431 1434 1432 __SetPageSlab(page); 1435 1433 /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */ ··· 1457 1459 kmemcheck_free_shadow(page, order); 1458 1460 1459 1461 if (cachep->flags & SLAB_RECLAIM_ACCOUNT) 1460 - sub_zone_page_state(page_zone(page), 1461 - NR_SLAB_RECLAIMABLE, nr_freed); 1462 + mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, -nr_freed); 1462 1463 else 1463 - sub_zone_page_state(page_zone(page), 1464 - NR_SLAB_UNRECLAIMABLE, nr_freed); 1464 + mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, -nr_freed); 1465 1465 1466 1466 BUG_ON(!PageSlab(page)); 1467 1467 __ClearPageSlabPfmemalloc(page); ··· 2036 2040 * unaligned accesses for some archs when redzoning is used, and makes 2037 2041 * sure any on-slab bufctl's are also correctly aligned. 2038 2042 */ 2039 - if (size & (BYTES_PER_WORD - 1)) { 2040 - size += (BYTES_PER_WORD - 1); 2041 - size &= ~(BYTES_PER_WORD - 1); 2042 - } 2043 + size = ALIGN(size, BYTES_PER_WORD); 2043 2044 2044 2045 if (flags & SLAB_RED_ZONE) { 2045 2046 ralign = REDZONE_ALIGN; 2046 2047 /* If redzoning, ensure that the second redzone is suitably 2047 2048 * aligned, by adjusting the object size accordingly. */ 2048 - size += REDZONE_ALIGN - 1; 2049 - size &= ~(REDZONE_ALIGN - 1); 2049 + size = ALIGN(size, REDZONE_ALIGN); 2050 2050 } 2051 2051 2052 2052 /* 3) caller mandated alignment */

+1 -17

mm/slab.h

··· 274 274 gfp_t gfp, int order, 275 275 struct kmem_cache *s) 276 276 { 277 - int ret; 278 - 279 277 if (!memcg_kmem_enabled()) 280 278 return 0; 281 279 if (is_root_cache(s)) 282 280 return 0; 283 - 284 - ret = memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg); 285 - if (ret) 286 - return ret; 287 - 288 - memcg_kmem_update_page_stat(page, 289 - (s->flags & SLAB_RECLAIM_ACCOUNT) ? 290 - MEMCG_SLAB_RECLAIMABLE : MEMCG_SLAB_UNRECLAIMABLE, 291 - 1 << order); 292 - return 0; 281 + return memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg); 293 282 } 294 283 295 284 static __always_inline void memcg_uncharge_slab(struct page *page, int order, ··· 286 297 { 287 298 if (!memcg_kmem_enabled()) 288 299 return; 289 - 290 - memcg_kmem_update_page_stat(page, 291 - (s->flags & SLAB_RECLAIM_ACCOUNT) ? 292 - MEMCG_SLAB_RECLAIMABLE : MEMCG_SLAB_UNRECLAIMABLE, 293 - -(1 << order)); 294 300 memcg_kmem_uncharge(page, order); 295 301 } 296 302

+2 -3

mm/slab_common.c

··· 47 47 48 48 /* 49 49 * Merge control. If this is set then no merging of slab caches will occur. 50 - * (Could be removed. This was introduced to pacify the merge skeptics.) 51 50 */ 52 - static int slab_nomerge; 51 + static bool slab_nomerge = !IS_ENABLED(CONFIG_SLAB_MERGE_DEFAULT); 53 52 54 53 static int __init setup_slab_nomerge(char *str) 55 54 { 56 - slab_nomerge = 1; 55 + slab_nomerge = true; 57 56 return 1; 58 57 } 59 58

+59 -54

mm/slub.c

··· 1615 1615 if (!page) 1616 1616 return NULL; 1617 1617 1618 - mod_zone_page_state(page_zone(page), 1618 + mod_lruvec_page_state(page, 1619 1619 (s->flags & SLAB_RECLAIM_ACCOUNT) ? 1620 1620 NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE, 1621 1621 1 << oo_order(oo)); ··· 1655 1655 1656 1656 kmemcheck_free_shadow(page, compound_order(page)); 1657 1657 1658 - mod_zone_page_state(page_zone(page), 1658 + mod_lruvec_page_state(page, 1659 1659 (s->flags & SLAB_RECLAIM_ACCOUNT) ? 1660 1660 NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE, 1661 1661 -pages); ··· 1829 1829 stat(s, CPU_PARTIAL_NODE); 1830 1830 } 1831 1831 if (!kmem_cache_has_cpu_partial(s) 1832 - || available > s->cpu_partial / 2) 1832 + || available > slub_cpu_partial(s) / 2) 1833 1833 break; 1834 1834 1835 1835 } ··· 1993 1993 * Remove the cpu slab 1994 1994 */ 1995 1995 static void deactivate_slab(struct kmem_cache *s, struct page *page, 1996 - void *freelist) 1996 + void *freelist, struct kmem_cache_cpu *c) 1997 1997 { 1998 1998 enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE }; 1999 1999 struct kmem_cache_node *n = get_node(s, page_to_nid(page)); ··· 2132 2132 discard_slab(s, page); 2133 2133 stat(s, FREE_SLAB); 2134 2134 } 2135 + 2136 + c->page = NULL; 2137 + c->freelist = NULL; 2135 2138 } 2136 2139 2137 2140 /* ··· 2269 2266 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) 2270 2267 { 2271 2268 stat(s, CPUSLAB_FLUSH); 2272 - deactivate_slab(s, c->page, c->freelist); 2269 + deactivate_slab(s, c->page, c->freelist, c); 2273 2270 2274 2271 c->tid = next_tid(c->tid); 2275 - c->page = NULL; 2276 - c->freelist = NULL; 2277 2272 } 2278 2273 2279 2274 /* ··· 2303 2302 struct kmem_cache *s = info; 2304 2303 struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu); 2305 2304 2306 - return c->page || c->partial; 2305 + return c->page || slub_percpu_partial(c); 2307 2306 } 2308 2307 2309 2308 static void flush_all(struct kmem_cache *s) ··· 2522 2521 2523 2522 if (unlikely(!node_match(page, searchnode))) { 2524 2523 stat(s, ALLOC_NODE_MISMATCH); 2525 - deactivate_slab(s, page, c->freelist); 2526 - c->page = NULL; 2527 - c->freelist = NULL; 2524 + deactivate_slab(s, page, c->freelist, c); 2528 2525 goto new_slab; 2529 2526 } 2530 2527 } ··· 2533 2534 * information when the page leaves the per-cpu allocator 2534 2535 */ 2535 2536 if (unlikely(!pfmemalloc_match(page, gfpflags))) { 2536 - deactivate_slab(s, page, c->freelist); 2537 - c->page = NULL; 2538 - c->freelist = NULL; 2537 + deactivate_slab(s, page, c->freelist, c); 2539 2538 goto new_slab; 2540 2539 } 2541 2540 ··· 2565 2568 2566 2569 new_slab: 2567 2570 2568 - if (c->partial) { 2569 - page = c->page = c->partial; 2570 - c->partial = page->next; 2571 + if (slub_percpu_partial(c)) { 2572 + page = c->page = slub_percpu_partial(c); 2573 + slub_set_percpu_partial(c, page); 2571 2574 stat(s, CPU_PARTIAL_ALLOC); 2572 - c->freelist = NULL; 2573 2575 goto redo; 2574 2576 } 2575 2577 ··· 2588 2592 !alloc_debug_processing(s, page, freelist, addr)) 2589 2593 goto new_slab; /* Slab failed checks. Next slab needed */ 2590 2594 2591 - deactivate_slab(s, page, get_freepointer(s, freelist)); 2592 - c->page = NULL; 2593 - c->freelist = NULL; 2595 + deactivate_slab(s, page, get_freepointer(s, freelist), c); 2594 2596 return freelist; 2595 2597 } 2596 2598 ··· 3404 3410 s->min_partial = min; 3405 3411 } 3406 3412 3413 + static void set_cpu_partial(struct kmem_cache *s) 3414 + { 3415 + #ifdef CONFIG_SLUB_CPU_PARTIAL 3416 + /* 3417 + * cpu_partial determined the maximum number of objects kept in the 3418 + * per cpu partial lists of a processor. 3419 + * 3420 + * Per cpu partial lists mainly contain slabs that just have one 3421 + * object freed. If they are used for allocation then they can be 3422 + * filled up again with minimal effort. The slab will never hit the 3423 + * per node partial lists and therefore no locking will be required. 3424 + * 3425 + * This setting also determines 3426 + * 3427 + * A) The number of objects from per cpu partial slabs dumped to the 3428 + * per node list when we reach the limit. 3429 + * B) The number of objects in cpu partial slabs to extract from the 3430 + * per node list when we run out of per cpu objects. We only fetch 3431 + * 50% to keep some capacity around for frees. 3432 + */ 3433 + if (!kmem_cache_has_cpu_partial(s)) 3434 + s->cpu_partial = 0; 3435 + else if (s->size >= PAGE_SIZE) 3436 + s->cpu_partial = 2; 3437 + else if (s->size >= 1024) 3438 + s->cpu_partial = 6; 3439 + else if (s->size >= 256) 3440 + s->cpu_partial = 13; 3441 + else 3442 + s->cpu_partial = 30; 3443 + #endif 3444 + } 3445 + 3407 3446 /* 3408 3447 * calculate_sizes() determines the order and the distribution of data within 3409 3448 * a slab object. ··· 3595 3568 */ 3596 3569 set_min_partial(s, ilog2(s->size) / 2); 3597 3570 3598 - /* 3599 - * cpu_partial determined the maximum number of objects kept in the 3600 - * per cpu partial lists of a processor. 3601 - * 3602 - * Per cpu partial lists mainly contain slabs that just have one 3603 - * object freed. If they are used for allocation then they can be 3604 - * filled up again with minimal effort. The slab will never hit the 3605 - * per node partial lists and therefore no locking will be required. 3606 - * 3607 - * This setting also determines 3608 - * 3609 - * A) The number of objects from per cpu partial slabs dumped to the 3610 - * per node list when we reach the limit. 3611 - * B) The number of objects in cpu partial slabs to extract from the 3612 - * per node list when we run out of per cpu objects. We only fetch 3613 - * 50% to keep some capacity around for frees. 3614 - */ 3615 - if (!kmem_cache_has_cpu_partial(s)) 3616 - s->cpu_partial = 0; 3617 - else if (s->size >= PAGE_SIZE) 3618 - s->cpu_partial = 2; 3619 - else if (s->size >= 1024) 3620 - s->cpu_partial = 6; 3621 - else if (s->size >= 256) 3622 - s->cpu_partial = 13; 3623 - else 3624 - s->cpu_partial = 30; 3571 + set_cpu_partial(s); 3625 3572 3626 3573 #ifdef CONFIG_NUMA 3627 3574 s->remote_node_defrag_ratio = 1000; ··· 3982 3981 * Disable empty slabs caching. Used to avoid pinning offline 3983 3982 * memory cgroups by kmem pages that can be freed. 3984 3983 */ 3985 - s->cpu_partial = 0; 3984 + slub_set_cpu_partial(s, 0); 3986 3985 s->min_partial = 0; 3987 3986 3988 3987 /* ··· 4761 4760 total += x; 4762 4761 nodes[node] += x; 4763 4762 4764 - page = READ_ONCE(c->partial); 4763 + page = slub_percpu_partial_read_once(c); 4765 4764 if (page) { 4766 4765 node = page_to_nid(page); 4767 4766 if (flags & SO_TOTAL) ··· 4922 4921 4923 4922 static ssize_t cpu_partial_show(struct kmem_cache *s, char *buf) 4924 4923 { 4925 - return sprintf(buf, "%u\n", s->cpu_partial); 4924 + return sprintf(buf, "%u\n", slub_cpu_partial(s)); 4926 4925 } 4927 4926 4928 4927 static ssize_t cpu_partial_store(struct kmem_cache *s, const char *buf, ··· 4937 4936 if (objects && !kmem_cache_has_cpu_partial(s)) 4938 4937 return -EINVAL; 4939 4938 4940 - s->cpu_partial = objects; 4939 + slub_set_cpu_partial(s, objects); 4941 4940 flush_all(s); 4942 4941 return length; 4943 4942 } ··· 4989 4988 int len; 4990 4989 4991 4990 for_each_online_cpu(cpu) { 4992 - struct page *page = per_cpu_ptr(s->cpu_slab, cpu)->partial; 4991 + struct page *page; 4992 + 4993 + page = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu)); 4993 4994 4994 4995 if (page) { 4995 4996 pages += page->pages; ··· 5003 5000 5004 5001 #ifdef CONFIG_SMP 5005 5002 for_each_online_cpu(cpu) { 5006 - struct page *page = per_cpu_ptr(s->cpu_slab, cpu) ->partial; 5003 + struct page *page; 5004 + 5005 + page = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu)); 5007 5006 5008 5007 if (page && len < PAGE_SIZE - 20) 5009 5008 len += sprintf(buf + len, " C%d=%d(%d)", cpu,

+89 -15

mm/sparse.c

··· 168 168 } 169 169 } 170 170 171 + /* 172 + * There are a number of times that we loop over NR_MEM_SECTIONS, 173 + * looking for section_present() on each. But, when we have very 174 + * large physical address spaces, NR_MEM_SECTIONS can also be 175 + * very large which makes the loops quite long. 176 + * 177 + * Keeping track of this gives us an easy way to break out of 178 + * those loops early. 179 + */ 180 + int __highest_present_section_nr; 181 + static void section_mark_present(struct mem_section *ms) 182 + { 183 + int section_nr = __section_nr(ms); 184 + 185 + if (section_nr > __highest_present_section_nr) 186 + __highest_present_section_nr = section_nr; 187 + 188 + ms->section_mem_map |= SECTION_MARKED_PRESENT; 189 + } 190 + 191 + static inline int next_present_section_nr(int section_nr) 192 + { 193 + do { 194 + section_nr++; 195 + if (present_section_nr(section_nr)) 196 + return section_nr; 197 + } while ((section_nr < NR_MEM_SECTIONS) && 198 + (section_nr <= __highest_present_section_nr)); 199 + 200 + return -1; 201 + } 202 + #define for_each_present_section_nr(start, section_nr) \ 203 + for (section_nr = next_present_section_nr(start-1); \ 204 + ((section_nr >= 0) && \ 205 + (section_nr < NR_MEM_SECTIONS) && \ 206 + (section_nr <= __highest_present_section_nr)); \ 207 + section_nr = next_present_section_nr(section_nr)) 208 + 171 209 /* Record a memory area against a node. */ 172 210 void __init memory_present(int nid, unsigned long start, unsigned long end) 173 211 { ··· 221 183 set_section_nid(section, nid); 222 184 223 185 ms = __nr_to_section(section); 224 - if (!ms->section_mem_map) 186 + if (!ms->section_mem_map) { 225 187 ms->section_mem_map = sparse_encode_early_nid(nid) | 226 - SECTION_MARKED_PRESENT; 188 + SECTION_IS_ONLINE; 189 + section_mark_present(ms); 190 + } 227 191 } 228 192 } 229 193 ··· 516 476 int nodeid_begin = 0; 517 477 unsigned long pnum_begin = 0; 518 478 519 - for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) { 479 + for_each_present_section_nr(0, pnum) { 520 480 struct mem_section *ms; 521 481 522 - if (!present_section_nr(pnum)) 523 - continue; 524 482 ms = __nr_to_section(pnum); 525 483 nodeid_begin = sparse_early_nid(ms); 526 484 pnum_begin = pnum; 527 485 break; 528 486 } 529 487 map_count = 1; 530 - for (pnum = pnum_begin + 1; pnum < NR_MEM_SECTIONS; pnum++) { 488 + for_each_present_section_nr(pnum_begin + 1, pnum) { 531 489 struct mem_section *ms; 532 490 int nodeid; 533 491 534 - if (!present_section_nr(pnum)) 535 - continue; 536 492 ms = __nr_to_section(pnum); 537 493 nodeid = sparse_early_nid(ms); 538 494 if (nodeid == nodeid_begin) { ··· 597 561 (void *)map_map); 598 562 #endif 599 563 600 - for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) { 601 - if (!present_section_nr(pnum)) 602 - continue; 603 - 564 + for_each_present_section_nr(0, pnum) { 604 565 usemap = usemap_map[pnum]; 605 566 if (!usemap) 606 567 continue; ··· 623 590 } 624 591 625 592 #ifdef CONFIG_MEMORY_HOTPLUG 593 + 594 + /* Mark all memory sections within the pfn range as online */ 595 + void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn) 596 + { 597 + unsigned long pfn; 598 + 599 + for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) { 600 + unsigned long section_nr = pfn_to_section_nr(start_pfn); 601 + struct mem_section *ms; 602 + 603 + /* onlining code should never touch invalid ranges */ 604 + if (WARN_ON(!valid_section_nr(section_nr))) 605 + continue; 606 + 607 + ms = __nr_to_section(section_nr); 608 + ms->section_mem_map |= SECTION_IS_ONLINE; 609 + } 610 + } 611 + 612 + #ifdef CONFIG_MEMORY_HOTREMOVE 613 + /* Mark all memory sections within the pfn range as online */ 614 + void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn) 615 + { 616 + unsigned long pfn; 617 + 618 + for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) { 619 + unsigned long section_nr = pfn_to_section_nr(start_pfn); 620 + struct mem_section *ms; 621 + 622 + /* 623 + * TODO this needs some double checking. Offlining code makes 624 + * sure to check pfn_valid but those checks might be just bogus 625 + */ 626 + if (WARN_ON(!valid_section_nr(section_nr))) 627 + continue; 628 + 629 + ms = __nr_to_section(section_nr); 630 + ms->section_mem_map &= ~SECTION_IS_ONLINE; 631 + } 632 + } 633 + #endif 634 + 626 635 #ifdef CONFIG_SPARSEMEM_VMEMMAP 627 636 static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid) 628 637 { ··· 761 686 * set. If this is <=0, then that means that the passed-in 762 687 * map was not consumed and must be freed. 763 688 */ 764 - int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn) 689 + int __meminit sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn) 765 690 { 766 691 unsigned long section_nr = pfn_to_section_nr(start_pfn); 767 - struct pglist_data *pgdat = zone->zone_pgdat; 768 692 struct mem_section *ms; 769 693 struct page *memmap; 770 694 unsigned long *usemap; ··· 796 722 797 723 memset(memmap, 0, sizeof(struct page) * PAGES_PER_SECTION); 798 724 799 - ms->section_mem_map |= SECTION_MARKED_PRESENT; 725 + section_mark_present(ms); 800 726 801 727 ret = sparse_init_one_section(ms, section_nr, memmap, usemap); 802 728

+1

mm/swap.c

··· 591 591 add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE); 592 592 593 593 __count_vm_events(PGLAZYFREE, hpage_nr_pages(page)); 594 + count_memcg_page_event(page, PGLAZYFREE); 594 595 update_page_reclaim_stat(lruvec, 1, 0); 595 596 } 596 597 }

+30 -10

mm/swap_cgroup.c

··· 61 61 return -ENOMEM; 62 62 } 63 63 64 + static struct swap_cgroup *__lookup_swap_cgroup(struct swap_cgroup_ctrl *ctrl, 65 + pgoff_t offset) 66 + { 67 + struct page *mappage; 68 + struct swap_cgroup *sc; 69 + 70 + mappage = ctrl->map[offset / SC_PER_PAGE]; 71 + sc = page_address(mappage); 72 + return sc + offset % SC_PER_PAGE; 73 + } 74 + 64 75 static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent, 65 76 struct swap_cgroup_ctrl **ctrlp) 66 77 { 67 78 pgoff_t offset = swp_offset(ent); 68 79 struct swap_cgroup_ctrl *ctrl; 69 - struct page *mappage; 70 - struct swap_cgroup *sc; 71 80 72 81 ctrl = &swap_cgroup_ctrl[swp_type(ent)]; 73 82 if (ctrlp) 74 83 *ctrlp = ctrl; 75 - 76 - mappage = ctrl->map[offset / SC_PER_PAGE]; 77 - sc = page_address(mappage); 78 - return sc + offset % SC_PER_PAGE; 84 + return __lookup_swap_cgroup(ctrl, offset); 79 85 } 80 86 81 87 /** ··· 114 108 } 115 109 116 110 /** 117 - * swap_cgroup_record - record mem_cgroup for this swp_entry. 118 - * @ent: swap entry to be recorded into 111 + * swap_cgroup_record - record mem_cgroup for a set of swap entries 112 + * @ent: the first swap entry to be recorded into 119 113 * @id: mem_cgroup to be recorded 114 + * @nr_ents: number of swap entries to be recorded 120 115 * 121 116 * Returns old value at success, 0 at failure. 122 117 * (Of course, old value can be 0.) 123 118 */ 124 - unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id) 119 + unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id, 120 + unsigned int nr_ents) 125 121 { 126 122 struct swap_cgroup_ctrl *ctrl; 127 123 struct swap_cgroup *sc; 128 124 unsigned short old; 129 125 unsigned long flags; 126 + pgoff_t offset = swp_offset(ent); 127 + pgoff_t end = offset + nr_ents; 130 128 131 129 sc = lookup_swap_cgroup(ent, &ctrl); 132 130 133 131 spin_lock_irqsave(&ctrl->lock, flags); 134 132 old = sc->id; 135 - sc->id = id; 133 + for (;;) { 134 + VM_BUG_ON(sc->id != old); 135 + sc->id = id; 136 + offset++; 137 + if (offset == end) 138 + break; 139 + if (offset % SC_PER_PAGE) 140 + sc++; 141 + else 142 + sc = __lookup_swap_cgroup(ctrl, offset); 143 + } 136 144 spin_unlock_irqrestore(&ctrl->lock, flags); 137 145 138 146 return old;

+12 -4

mm/swap_slots.c

··· 263 263 264 264 cache->cur = 0; 265 265 if (swap_slot_cache_active) 266 - cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE, cache->slots); 266 + cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE, false, 267 + cache->slots); 267 268 268 269 return cache->nr; 269 270 } ··· 302 301 return 0; 303 302 } 304 303 305 - swp_entry_t get_swap_page(void) 304 + swp_entry_t get_swap_page(struct page *page) 306 305 { 307 306 swp_entry_t entry, *pentry; 308 307 struct swap_slots_cache *cache; 308 + 309 + entry.val = 0; 310 + 311 + if (PageTransHuge(page)) { 312 + if (IS_ENABLED(CONFIG_THP_SWAP)) 313 + get_swap_pages(1, true, &entry); 314 + return entry; 315 + } 309 316 310 317 /* 311 318 * Preemption is allowed here, because we may sleep ··· 326 317 */ 327 318 cache = raw_cpu_ptr(&swp_slots); 328 319 329 - entry.val = 0; 330 320 if (check_cache_active()) { 331 321 mutex_lock(&cache->alloc_lock); 332 322 if (cache->slots) { ··· 345 337 return entry; 346 338 } 347 339 348 - get_swap_pages(1, &entry); 340 + get_swap_pages(1, false, &entry); 349 341 350 342 return entry; 351 343 }

+53 -45

mm/swap_state.c

··· 19 19 #include <linux/migrate.h> 20 20 #include <linux/vmalloc.h> 21 21 #include <linux/swap_slots.h> 22 + #include <linux/huge_mm.h> 22 23 23 24 #include <asm/pgtable.h> 24 25 ··· 39 38 static unsigned int nr_swapper_spaces[MAX_SWAPFILES]; 40 39 41 40 #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0) 41 + #define ADD_CACHE_INFO(x, nr) do { swap_cache_info.x += (nr); } while (0) 42 42 43 43 static struct { 44 44 unsigned long add_total; ··· 92 90 */ 93 91 int __add_to_swap_cache(struct page *page, swp_entry_t entry) 94 92 { 95 - int error; 93 + int error, i, nr = hpage_nr_pages(page); 96 94 struct address_space *address_space; 95 + pgoff_t idx = swp_offset(entry); 97 96 98 97 VM_BUG_ON_PAGE(!PageLocked(page), page); 99 98 VM_BUG_ON_PAGE(PageSwapCache(page), page); 100 99 VM_BUG_ON_PAGE(!PageSwapBacked(page), page); 101 100 102 - get_page(page); 101 + page_ref_add(page, nr); 103 102 SetPageSwapCache(page); 104 - set_page_private(page, entry.val); 105 103 106 104 address_space = swap_address_space(entry); 107 105 spin_lock_irq(&address_space->tree_lock); 108 - error = radix_tree_insert(&address_space->page_tree, 109 - swp_offset(entry), page); 110 - if (likely(!error)) { 111 - address_space->nrpages++; 112 - __inc_node_page_state(page, NR_FILE_PAGES); 113 - INC_CACHE_INFO(add_total); 106 + for (i = 0; i < nr; i++) { 107 + set_page_private(page + i, entry.val + i); 108 + error = radix_tree_insert(&address_space->page_tree, 109 + idx + i, page + i); 110 + if (unlikely(error)) 111 + break; 114 112 } 115 - spin_unlock_irq(&address_space->tree_lock); 116 - 117 - if (unlikely(error)) { 113 + if (likely(!error)) { 114 + address_space->nrpages += nr; 115 + __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr); 116 + ADD_CACHE_INFO(add_total, nr); 117 + } else { 118 118 /* 119 119 * Only the context which have set SWAP_HAS_CACHE flag 120 120 * would call add_to_swap_cache(). 121 121 * So add_to_swap_cache() doesn't returns -EEXIST. 122 122 */ 123 123 VM_BUG_ON(error == -EEXIST); 124 - set_page_private(page, 0UL); 124 + set_page_private(page + i, 0UL); 125 + while (i--) { 126 + radix_tree_delete(&address_space->page_tree, idx + i); 127 + set_page_private(page + i, 0UL); 128 + } 125 129 ClearPageSwapCache(page); 126 - put_page(page); 130 + page_ref_sub(page, nr); 127 131 } 132 + spin_unlock_irq(&address_space->tree_lock); 128 133 129 134 return error; 130 135 } ··· 141 132 { 142 133 int error; 143 134 144 - error = radix_tree_maybe_preload(gfp_mask); 135 + error = radix_tree_maybe_preload_order(gfp_mask, compound_order(page)); 145 136 if (!error) { 146 137 error = __add_to_swap_cache(page, entry); 147 138 radix_tree_preload_end(); ··· 155 146 */ 156 147 void __delete_from_swap_cache(struct page *page) 157 148 { 158 - swp_entry_t entry; 159 149 struct address_space *address_space; 150 + int i, nr = hpage_nr_pages(page); 151 + swp_entry_t entry; 152 + pgoff_t idx; 160 153 161 154 VM_BUG_ON_PAGE(!PageLocked(page), page); 162 155 VM_BUG_ON_PAGE(!PageSwapCache(page), page); ··· 166 155 167 156 entry.val = page_private(page); 168 157 address_space = swap_address_space(entry); 169 - radix_tree_delete(&address_space->page_tree, swp_offset(entry)); 170 - set_page_private(page, 0); 158 + idx = swp_offset(entry); 159 + for (i = 0; i < nr; i++) { 160 + radix_tree_delete(&address_space->page_tree, idx + i); 161 + set_page_private(page + i, 0); 162 + } 171 163 ClearPageSwapCache(page); 172 - address_space->nrpages--; 173 - __dec_node_page_state(page, NR_FILE_PAGES); 174 - INC_CACHE_INFO(del_total); 164 + address_space->nrpages -= nr; 165 + __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr); 166 + ADD_CACHE_INFO(del_total, nr); 175 167 } 176 168 177 169 /** ··· 184 170 * Allocate swap space for the page and add the page to the 185 171 * swap cache. Caller needs to hold the page lock. 186 172 */ 187 - int add_to_swap(struct page *page, struct list_head *list) 173 + int add_to_swap(struct page *page) 188 174 { 189 175 swp_entry_t entry; 190 176 int err; ··· 192 178 VM_BUG_ON_PAGE(!PageLocked(page), page); 193 179 VM_BUG_ON_PAGE(!PageUptodate(page), page); 194 180 195 - entry = get_swap_page(); 181 + entry = get_swap_page(page); 196 182 if (!entry.val) 197 183 return 0; 198 184 199 - if (mem_cgroup_try_charge_swap(page, entry)) { 200 - swapcache_free(entry); 201 - return 0; 202 - } 203 - 204 - if (unlikely(PageTransHuge(page))) 205 - if (unlikely(split_huge_page_to_list(page, list))) { 206 - swapcache_free(entry); 207 - return 0; 208 - } 185 + if (mem_cgroup_try_charge_swap(page, entry)) 186 + goto fail; 209 187 210 188 /* 211 189 * Radix-tree node allocations from PF_MEMALLOC contexts could ··· 212 206 */ 213 207 err = add_to_swap_cache(page, entry, 214 208 __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN); 215 - 216 - if (!err) { 217 - return 1; 218 - } else { /* -ENOMEM radix-tree allocation failure */ 209 + /* -ENOMEM radix-tree allocation failure */ 210 + if (err) 219 211 /* 220 212 * add_to_swap_cache() doesn't return -EEXIST, so we can safely 221 213 * clear SWAP_HAS_CACHE flag. 222 214 */ 223 - swapcache_free(entry); 224 - return 0; 225 - } 215 + goto fail; 216 + 217 + return 1; 218 + 219 + fail: 220 + put_swap_page(page, entry); 221 + return 0; 226 222 } 227 223 228 224 /* ··· 245 237 __delete_from_swap_cache(page); 246 238 spin_unlock_irq(&address_space->tree_lock); 247 239 248 - swapcache_free(entry); 249 - put_page(page); 240 + put_swap_page(page, entry); 241 + page_ref_sub(page, hpage_nr_pages(page)); 250 242 } 251 243 252 244 /* ··· 303 295 304 296 page = find_get_page(swap_address_space(entry), swp_offset(entry)); 305 297 306 - if (page) { 298 + if (page && likely(!PageTransCompound(page))) { 307 299 INC_CACHE_INFO(find_success); 308 300 if (TestClearPageReadahead(page)) 309 301 atomic_inc(&swapin_readahead_hits); ··· 397 389 * add_to_swap_cache() doesn't return -EEXIST, so we can safely 398 390 * clear SWAP_HAS_CACHE flag. 399 391 */ 400 - swapcache_free(entry); 392 + put_swap_page(new_page, entry); 401 393 } while (err != -ENOMEM); 402 394 403 395 if (new_page) ··· 514 506 gfp_mask, vma, addr); 515 507 if (!page) 516 508 continue; 517 - if (offset != entry_offset) 509 + if (offset != entry_offset && likely(!PageTransCompound(page))) 518 510 SetPageReadahead(page); 519 511 put_page(page); 520 512 }

+219 -70

mm/swapfile.c

··· 37 37 #include <linux/swapfile.h> 38 38 #include <linux/export.h> 39 39 #include <linux/swap_slots.h> 40 + #include <linux/sort.h> 40 41 41 42 #include <asm/pgtable.h> 42 43 #include <asm/tlbflush.h> ··· 200 199 } 201 200 } 202 201 202 + #ifdef CONFIG_THP_SWAP 203 + #define SWAPFILE_CLUSTER HPAGE_PMD_NR 204 + #else 203 205 #define SWAPFILE_CLUSTER 256 206 + #endif 204 207 #define LATENCY_LIMIT 256 205 208 206 209 static inline void cluster_set_flag(struct swap_cluster_info *info, ··· 379 374 schedule_work(&si->discard_work); 380 375 } 381 376 377 + static void __free_cluster(struct swap_info_struct *si, unsigned long idx) 378 + { 379 + struct swap_cluster_info *ci = si->cluster_info; 380 + 381 + cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE); 382 + cluster_list_add_tail(&si->free_clusters, ci, idx); 383 + } 384 + 382 385 /* 383 386 * Doing discard actually. After a cluster discard is finished, the cluster 384 387 * will be added to free cluster list. caller should hold si->lock. ··· 407 394 408 395 spin_lock(&si->lock); 409 396 ci = lock_cluster(si, idx * SWAPFILE_CLUSTER); 410 - cluster_set_flag(ci, CLUSTER_FLAG_FREE); 411 - unlock_cluster(ci); 412 - cluster_list_add_tail(&si->free_clusters, info, idx); 413 - ci = lock_cluster(si, idx * SWAPFILE_CLUSTER); 397 + __free_cluster(si, idx); 414 398 memset(si->swap_map + idx * SWAPFILE_CLUSTER, 415 399 0, SWAPFILE_CLUSTER); 416 400 unlock_cluster(ci); ··· 425 415 spin_unlock(&si->lock); 426 416 } 427 417 418 + static void alloc_cluster(struct swap_info_struct *si, unsigned long idx) 419 + { 420 + struct swap_cluster_info *ci = si->cluster_info; 421 + 422 + VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx); 423 + cluster_list_del_first(&si->free_clusters, ci); 424 + cluster_set_count_flag(ci + idx, 0, 0); 425 + } 426 + 427 + static void free_cluster(struct swap_info_struct *si, unsigned long idx) 428 + { 429 + struct swap_cluster_info *ci = si->cluster_info + idx; 430 + 431 + VM_BUG_ON(cluster_count(ci) != 0); 432 + /* 433 + * If the swap is discardable, prepare discard the cluster 434 + * instead of free it immediately. The cluster will be freed 435 + * after discard. 436 + */ 437 + if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == 438 + (SWP_WRITEOK | SWP_PAGE_DISCARD)) { 439 + swap_cluster_schedule_discard(si, idx); 440 + return; 441 + } 442 + 443 + __free_cluster(si, idx); 444 + } 445 + 428 446 /* 429 447 * The cluster corresponding to page_nr will be used. The cluster will be 430 448 * removed from free cluster list and its usage counter will be increased. ··· 464 426 465 427 if (!cluster_info) 466 428 return; 467 - if (cluster_is_free(&cluster_info[idx])) { 468 - VM_BUG_ON(cluster_list_first(&p->free_clusters) != idx); 469 - cluster_list_del_first(&p->free_clusters, cluster_info); 470 - cluster_set_count_flag(&cluster_info[idx], 0, 0); 471 - } 429 + if (cluster_is_free(&cluster_info[idx])) 430 + alloc_cluster(p, idx); 472 431 473 432 VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER); 474 433 cluster_set_count(&cluster_info[idx], ··· 489 454 cluster_set_count(&cluster_info[idx], 490 455 cluster_count(&cluster_info[idx]) - 1); 491 456 492 - if (cluster_count(&cluster_info[idx]) == 0) { 493 - /* 494 - * If the swap is discardable, prepare discard the cluster 495 - * instead of free it immediately. The cluster will be freed 496 - * after discard. 497 - */ 498 - if ((p->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == 499 - (SWP_WRITEOK | SWP_PAGE_DISCARD)) { 500 - swap_cluster_schedule_discard(p, idx); 501 - return; 502 - } 503 - 504 - cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE); 505 - cluster_list_add_tail(&p->free_clusters, cluster_info, idx); 506 - } 457 + if (cluster_count(&cluster_info[idx]) == 0) 458 + free_cluster(p, idx); 507 459 } 508 460 509 461 /* ··· 578 556 *offset = tmp; 579 557 *scan_base = tmp; 580 558 return found_free; 559 + } 560 + 561 + static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset, 562 + unsigned int nr_entries) 563 + { 564 + unsigned int end = offset + nr_entries - 1; 565 + 566 + if (offset == si->lowest_bit) 567 + si->lowest_bit += nr_entries; 568 + if (end == si->highest_bit) 569 + si->highest_bit -= nr_entries; 570 + si->inuse_pages += nr_entries; 571 + if (si->inuse_pages == si->pages) { 572 + si->lowest_bit = si->max; 573 + si->highest_bit = 0; 574 + spin_lock(&swap_avail_lock); 575 + plist_del(&si->avail_list, &swap_avail_head); 576 + spin_unlock(&swap_avail_lock); 577 + } 578 + } 579 + 580 + static void swap_range_free(struct swap_info_struct *si, unsigned long offset, 581 + unsigned int nr_entries) 582 + { 583 + unsigned long end = offset + nr_entries - 1; 584 + void (*swap_slot_free_notify)(struct block_device *, unsigned long); 585 + 586 + if (offset < si->lowest_bit) 587 + si->lowest_bit = offset; 588 + if (end > si->highest_bit) { 589 + bool was_full = !si->highest_bit; 590 + 591 + si->highest_bit = end; 592 + if (was_full && (si->flags & SWP_WRITEOK)) { 593 + spin_lock(&swap_avail_lock); 594 + WARN_ON(!plist_node_empty(&si->avail_list)); 595 + if (plist_node_empty(&si->avail_list)) 596 + plist_add(&si->avail_list, &swap_avail_head); 597 + spin_unlock(&swap_avail_lock); 598 + } 599 + } 600 + atomic_long_add(nr_entries, &nr_swap_pages); 601 + si->inuse_pages -= nr_entries; 602 + if (si->flags & SWP_BLKDEV) 603 + swap_slot_free_notify = 604 + si->bdev->bd_disk->fops->swap_slot_free_notify; 605 + else 606 + swap_slot_free_notify = NULL; 607 + while (offset <= end) { 608 + frontswap_invalidate_page(si->type, offset); 609 + if (swap_slot_free_notify) 610 + swap_slot_free_notify(si->bdev, offset); 611 + offset++; 612 + } 581 613 } 582 614 583 615 static int scan_swap_map_slots(struct swap_info_struct *si, ··· 752 676 inc_cluster_info_page(si, si->cluster_info, offset); 753 677 unlock_cluster(ci); 754 678 755 - if (offset == si->lowest_bit) 756 - si->lowest_bit++; 757 - if (offset == si->highest_bit) 758 - si->highest_bit--; 759 - si->inuse_pages++; 760 - if (si->inuse_pages == si->pages) { 761 - si->lowest_bit = si->max; 762 - si->highest_bit = 0; 763 - spin_lock(&swap_avail_lock); 764 - plist_del(&si->avail_list, &swap_avail_head); 765 - spin_unlock(&swap_avail_lock); 766 - } 679 + swap_range_alloc(si, offset, 1); 767 680 si->cluster_next = offset + 1; 768 681 slots[n_ret++] = swp_entry(si->type, offset); 769 682 ··· 831 766 return n_ret; 832 767 } 833 768 769 + #ifdef CONFIG_THP_SWAP 770 + static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) 771 + { 772 + unsigned long idx; 773 + struct swap_cluster_info *ci; 774 + unsigned long offset, i; 775 + unsigned char *map; 776 + 777 + if (cluster_list_empty(&si->free_clusters)) 778 + return 0; 779 + 780 + idx = cluster_list_first(&si->free_clusters); 781 + offset = idx * SWAPFILE_CLUSTER; 782 + ci = lock_cluster(si, offset); 783 + alloc_cluster(si, idx); 784 + cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0); 785 + 786 + map = si->swap_map + offset; 787 + for (i = 0; i < SWAPFILE_CLUSTER; i++) 788 + map[i] = SWAP_HAS_CACHE; 789 + unlock_cluster(ci); 790 + swap_range_alloc(si, offset, SWAPFILE_CLUSTER); 791 + *slot = swp_entry(si->type, offset); 792 + 793 + return 1; 794 + } 795 + 796 + static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) 797 + { 798 + unsigned long offset = idx * SWAPFILE_CLUSTER; 799 + struct swap_cluster_info *ci; 800 + 801 + ci = lock_cluster(si, offset); 802 + cluster_set_count_flag(ci, 0, 0); 803 + free_cluster(si, idx); 804 + unlock_cluster(ci); 805 + swap_range_free(si, offset, SWAPFILE_CLUSTER); 806 + } 807 + #else 808 + static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) 809 + { 810 + VM_WARN_ON_ONCE(1); 811 + return 0; 812 + } 813 + #endif /* CONFIG_THP_SWAP */ 814 + 834 815 static unsigned long scan_swap_map(struct swap_info_struct *si, 835 816 unsigned char usage) 836 817 { ··· 892 781 893 782 } 894 783 895 - int get_swap_pages(int n_goal, swp_entry_t swp_entries[]) 784 + int get_swap_pages(int n_goal, bool cluster, swp_entry_t swp_entries[]) 896 785 { 786 + unsigned long nr_pages = cluster ? SWAPFILE_CLUSTER : 1; 897 787 struct swap_info_struct *si, *next; 898 788 long avail_pgs; 899 789 int n_ret = 0; 900 790 901 - avail_pgs = atomic_long_read(&nr_swap_pages); 791 + /* Only single cluster request supported */ 792 + WARN_ON_ONCE(n_goal > 1 && cluster); 793 + 794 + avail_pgs = atomic_long_read(&nr_swap_pages) / nr_pages; 902 795 if (avail_pgs <= 0) 903 796 goto noswap; 904 797 ··· 912 797 if (n_goal > avail_pgs) 913 798 n_goal = avail_pgs; 914 799 915 - atomic_long_sub(n_goal, &nr_swap_pages); 800 + atomic_long_sub(n_goal * nr_pages, &nr_swap_pages); 916 801 917 802 spin_lock(&swap_avail_lock); 918 803 ··· 938 823 spin_unlock(&si->lock); 939 824 goto nextsi; 940 825 } 941 - n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, 942 - n_goal, swp_entries); 826 + if (cluster) 827 + n_ret = swap_alloc_cluster(si, swp_entries); 828 + else 829 + n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, 830 + n_goal, swp_entries); 943 831 spin_unlock(&si->lock); 944 - if (n_ret) 832 + if (n_ret || cluster) 945 833 goto check_out; 946 834 pr_debug("scan_swap_map of si %d failed to find offset\n", 947 835 si->type); ··· 970 852 971 853 check_out: 972 854 if (n_ret < n_goal) 973 - atomic_long_add((long) (n_goal-n_ret), &nr_swap_pages); 855 + atomic_long_add((long)(n_goal - n_ret) * nr_pages, 856 + &nr_swap_pages); 974 857 noswap: 975 858 return n_ret; 976 859 } ··· 1127 1008 dec_cluster_info_page(p, p->cluster_info, offset); 1128 1009 unlock_cluster(ci); 1129 1010 1130 - mem_cgroup_uncharge_swap(entry); 1131 - if (offset < p->lowest_bit) 1132 - p->lowest_bit = offset; 1133 - if (offset > p->highest_bit) { 1134 - bool was_full = !p->highest_bit; 1135 - 1136 - p->highest_bit = offset; 1137 - if (was_full && (p->flags & SWP_WRITEOK)) { 1138 - spin_lock(&swap_avail_lock); 1139 - WARN_ON(!plist_node_empty(&p->avail_list)); 1140 - if (plist_node_empty(&p->avail_list)) 1141 - plist_add(&p->avail_list, 1142 - &swap_avail_head); 1143 - spin_unlock(&swap_avail_lock); 1144 - } 1145 - } 1146 - atomic_long_inc(&nr_swap_pages); 1147 - p->inuse_pages--; 1148 - frontswap_invalidate_page(p->type, offset); 1149 - if (p->flags & SWP_BLKDEV) { 1150 - struct gendisk *disk = p->bdev->bd_disk; 1151 - 1152 - if (disk->fops->swap_slot_free_notify) 1153 - disk->fops->swap_slot_free_notify(p->bdev, 1154 - offset); 1155 - } 1011 + mem_cgroup_uncharge_swap(entry, 1); 1012 + swap_range_free(p, offset, 1); 1156 1013 } 1157 1014 1158 1015 /* ··· 1149 1054 /* 1150 1055 * Called after dropping swapcache to decrease refcnt to swap entries. 1151 1056 */ 1152 - void swapcache_free(swp_entry_t entry) 1057 + static void swapcache_free(swp_entry_t entry) 1153 1058 { 1154 1059 struct swap_info_struct *p; 1155 1060 ··· 1158 1063 if (!__swap_entry_free(p, entry, SWAP_HAS_CACHE)) 1159 1064 free_swap_slot(entry); 1160 1065 } 1066 + } 1067 + 1068 + #ifdef CONFIG_THP_SWAP 1069 + static void swapcache_free_cluster(swp_entry_t entry) 1070 + { 1071 + unsigned long offset = swp_offset(entry); 1072 + unsigned long idx = offset / SWAPFILE_CLUSTER; 1073 + struct swap_cluster_info *ci; 1074 + struct swap_info_struct *si; 1075 + unsigned char *map; 1076 + unsigned int i; 1077 + 1078 + si = swap_info_get(entry); 1079 + if (!si) 1080 + return; 1081 + 1082 + ci = lock_cluster(si, offset); 1083 + map = si->swap_map + offset; 1084 + for (i = 0; i < SWAPFILE_CLUSTER; i++) { 1085 + VM_BUG_ON(map[i] != SWAP_HAS_CACHE); 1086 + map[i] = 0; 1087 + } 1088 + unlock_cluster(ci); 1089 + mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER); 1090 + swap_free_cluster(si, idx); 1091 + spin_unlock(&si->lock); 1092 + } 1093 + #else 1094 + static inline void swapcache_free_cluster(swp_entry_t entry) 1095 + { 1096 + } 1097 + #endif /* CONFIG_THP_SWAP */ 1098 + 1099 + void put_swap_page(struct page *page, swp_entry_t entry) 1100 + { 1101 + if (!PageTransHuge(page)) 1102 + swapcache_free(entry); 1103 + else 1104 + swapcache_free_cluster(entry); 1105 + } 1106 + 1107 + static int swp_entry_cmp(const void *ent1, const void *ent2) 1108 + { 1109 + const swp_entry_t *e1 = ent1, *e2 = ent2; 1110 + 1111 + return (int)swp_type(*e1) - (int)swp_type(*e2); 1161 1112 } 1162 1113 1163 1114 void swapcache_free_entries(swp_entry_t *entries, int n) ··· 1216 1075 1217 1076 prev = NULL; 1218 1077 p = NULL; 1078 + 1079 + /* 1080 + * Sort swap entries by swap device, so each lock is only taken once. 1081 + * nr_swapfiles isn't absolutely correct, but the overhead of sort() is 1082 + * so low that it isn't necessary to optimize further. 1083 + */ 1084 + if (nr_swapfiles > 1) 1085 + sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL); 1219 1086 for (i = 0; i < n; ++i) { 1220 1087 p = swap_info_get_cont(entries[i], prev); 1221 1088 if (p)

+1 -6

mm/vmalloc.c

··· 1770 1770 */ 1771 1771 clear_vm_uninitialized_flag(area); 1772 1772 1773 - /* 1774 - * A ref_count = 2 is needed because vm_struct allocated in 1775 - * __get_vm_area_node() contains a reference to the virtual address of 1776 - * the vmalloc'ed block. 1777 - */ 1778 - kmemleak_alloc(addr, real_size, 2, gfp_mask); 1773 + kmemleak_vmalloc(area, size, gfp_mask); 1779 1774 1780 1775 return addr; 1781 1776

+60 -17

mm/vmscan.c

··· 708 708 mem_cgroup_swapout(page, swap); 709 709 __delete_from_swap_cache(page); 710 710 spin_unlock_irqrestore(&mapping->tree_lock, flags); 711 - swapcache_free(swap); 711 + put_swap_page(page, swap); 712 712 } else { 713 713 void (*freepage)(struct page *); 714 714 void *shadow = NULL; ··· 1125 1125 !PageSwapCache(page)) { 1126 1126 if (!(sc->gfp_mask & __GFP_IO)) 1127 1127 goto keep_locked; 1128 - if (!add_to_swap(page, page_list)) 1128 + if (PageTransHuge(page)) { 1129 + /* cannot split THP, skip it */ 1130 + if (!can_split_huge_page(page, NULL)) 1131 + goto activate_locked; 1132 + /* 1133 + * Split pages without a PMD map right 1134 + * away. Chances are some or all of the 1135 + * tail pages can be freed without IO. 1136 + */ 1137 + if (!compound_mapcount(page) && 1138 + split_huge_page_to_list(page, page_list)) 1139 + goto activate_locked; 1140 + } 1141 + if (!add_to_swap(page)) { 1142 + if (!PageTransHuge(page)) 1143 + goto activate_locked; 1144 + /* Split THP and swap individual base pages */ 1145 + if (split_huge_page_to_list(page, page_list)) 1146 + goto activate_locked; 1147 + if (!add_to_swap(page)) 1148 + goto activate_locked; 1149 + } 1150 + 1151 + /* XXX: We don't support THP writes */ 1152 + if (PageTransHuge(page) && 1153 + split_huge_page_to_list(page, page_list)) { 1154 + delete_from_swap_cache(page); 1129 1155 goto activate_locked; 1156 + } 1157 + 1130 1158 may_enter_fs = 1; 1131 1159 1132 1160 /* Adding to swap updated mapping */ ··· 1294 1266 } 1295 1267 1296 1268 count_vm_event(PGLAZYFREED); 1269 + count_memcg_page_event(page, PGLAZYFREED); 1297 1270 } else if (!mapping || !__remove_mapping(mapping, page, true)) 1298 1271 goto keep_locked; 1299 1272 /* ··· 1324 1295 if (!PageMlocked(page)) { 1325 1296 SetPageActive(page); 1326 1297 pgactivate++; 1298 + count_memcg_page_event(page, PGACTIVATE); 1327 1299 } 1328 1300 keep_locked: 1329 1301 unlock_page(page); ··· 1764 1734 __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); 1765 1735 reclaim_stat->recent_scanned[file] += nr_taken; 1766 1736 1767 - if (global_reclaim(sc)) { 1768 - if (current_is_kswapd()) 1737 + if (current_is_kswapd()) { 1738 + if (global_reclaim(sc)) 1769 1739 __count_vm_events(PGSCAN_KSWAPD, nr_scanned); 1770 - else 1740 + count_memcg_events(lruvec_memcg(lruvec), PGSCAN_KSWAPD, 1741 + nr_scanned); 1742 + } else { 1743 + if (global_reclaim(sc)) 1771 1744 __count_vm_events(PGSCAN_DIRECT, nr_scanned); 1745 + count_memcg_events(lruvec_memcg(lruvec), PGSCAN_DIRECT, 1746 + nr_scanned); 1772 1747 } 1773 1748 spin_unlock_irq(&pgdat->lru_lock); 1774 1749 ··· 1785 1750 1786 1751 spin_lock_irq(&pgdat->lru_lock); 1787 1752 1788 - if (global_reclaim(sc)) { 1789 - if (current_is_kswapd()) 1753 + if (current_is_kswapd()) { 1754 + if (global_reclaim(sc)) 1790 1755 __count_vm_events(PGSTEAL_KSWAPD, nr_reclaimed); 1791 - else 1756 + count_memcg_events(lruvec_memcg(lruvec), PGSTEAL_KSWAPD, 1757 + nr_reclaimed); 1758 + } else { 1759 + if (global_reclaim(sc)) 1792 1760 __count_vm_events(PGSTEAL_DIRECT, nr_reclaimed); 1761 + count_memcg_events(lruvec_memcg(lruvec), PGSTEAL_DIRECT, 1762 + nr_reclaimed); 1793 1763 } 1794 1764 1795 1765 putback_inactive_pages(lruvec, &page_list); ··· 1939 1899 } 1940 1900 } 1941 1901 1942 - if (!is_active_lru(lru)) 1902 + if (!is_active_lru(lru)) { 1943 1903 __count_vm_events(PGDEACTIVATE, nr_moved); 1904 + count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, 1905 + nr_moved); 1906 + } 1944 1907 1945 1908 return nr_moved; 1946 1909 } ··· 1981 1938 reclaim_stat->recent_scanned[file] += nr_taken; 1982 1939 1983 1940 __count_vm_events(PGREFILL, nr_scanned); 1941 + count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned); 1984 1942 1985 1943 spin_unlock_irq(&pgdat->lru_lock); 1986 1944 ··· 3011 2967 unsigned long nr_reclaimed; 3012 2968 struct scan_control sc = { 3013 2969 .nr_to_reclaim = SWAP_CLUSTER_MAX, 3014 - .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)), 2970 + .gfp_mask = current_gfp_context(gfp_mask), 3015 2971 .reclaim_idx = gfp_zone(gfp_mask), 3016 2972 .order = order, 3017 2973 .nodemask = nodemask, ··· 3026 2982 * 1 is returned so that the page allocator does not OOM kill at this 3027 2983 * point. 3028 2984 */ 3029 - if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask)) 2985 + if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask)) 3030 2986 return 1; 3031 2987 3032 2988 trace_mm_vmscan_direct_reclaim_begin(order, 3033 2989 sc.may_writepage, 3034 - gfp_mask, 2990 + sc.gfp_mask, 3035 2991 sc.reclaim_idx); 3036 2992 3037 2993 nr_reclaimed = do_try_to_free_pages(zonelist, &sc); ··· 3818 3774 const unsigned long nr_pages = 1 << order; 3819 3775 struct task_struct *p = current; 3820 3776 struct reclaim_state reclaim_state; 3821 - int classzone_idx = gfp_zone(gfp_mask); 3822 3777 unsigned int noreclaim_flag; 3823 3778 struct scan_control sc = { 3824 3779 .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), 3825 - .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)), 3780 + .gfp_mask = current_gfp_context(gfp_mask), 3826 3781 .order = order, 3827 3782 .priority = NODE_RECLAIM_PRIORITY, 3828 3783 .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE), 3829 3784 .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), 3830 3785 .may_swap = 1, 3831 - .reclaim_idx = classzone_idx, 3786 + .reclaim_idx = gfp_zone(gfp_mask), 3832 3787 }; 3833 3788 3834 3789 cond_resched(); ··· 3838 3795 */ 3839 3796 noreclaim_flag = memalloc_noreclaim_save(); 3840 3797 p->flags |= PF_SWAPWRITE; 3841 - lockdep_set_current_reclaim_state(gfp_mask); 3798 + lockdep_set_current_reclaim_state(sc.gfp_mask); 3842 3799 reclaim_state.reclaimed_slab = 0; 3843 3800 p->reclaim_state = &reclaim_state; 3844 3801 ··· 3874 3831 * unmapped file backed pages. 3875 3832 */ 3876 3833 if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages && 3877 - sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages) 3834 + node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages) 3878 3835 return NODE_RECLAIM_FULL; 3879 3836 3880 3837 /*

+13 -13

mm/vmstat.c

··· 928 928 "nr_zone_unevictable", 929 929 "nr_zone_write_pending", 930 930 "nr_mlock", 931 - "nr_slab_reclaimable", 932 - "nr_slab_unreclaimable", 933 931 "nr_page_table_pages", 934 932 "nr_kernel_stack", 935 933 "nr_bounce", ··· 950 952 "nr_inactive_file", 951 953 "nr_active_file", 952 954 "nr_unevictable", 955 + "nr_slab_reclaimable", 956 + "nr_slab_unreclaimable", 953 957 "nr_isolated_anon", 954 958 "nr_isolated_file", 955 959 "workingset_refault", ··· 1018 1018 1019 1019 "drop_pagecache", 1020 1020 "drop_slab", 1021 + "oom_kill", 1021 1022 1022 1023 #ifdef CONFIG_NUMA_BALANCING 1023 1024 "numa_pte_updates", ··· 1224 1223 for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) { 1225 1224 struct page *page; 1226 1225 1227 - if (!pfn_valid(pfn)) 1226 + page = pfn_to_online_page(pfn); 1227 + if (!page) 1228 1228 continue; 1229 - 1230 - page = pfn_to_page(pfn); 1231 1229 1232 1230 /* Watch for unexpected holes punched in the memmap */ 1233 1231 if (!memmap_valid_within(pfn, page, zone)) ··· 1322 1322 return seq_open(file, &fragmentation_op); 1323 1323 } 1324 1324 1325 - static const struct file_operations fragmentation_file_operations = { 1325 + static const struct file_operations buddyinfo_file_operations = { 1326 1326 .open = fragmentation_open, 1327 1327 .read = seq_read, 1328 1328 .llseek = seq_lseek, ··· 1341 1341 return seq_open(file, &pagetypeinfo_op); 1342 1342 } 1343 1343 1344 - static const struct file_operations pagetypeinfo_file_ops = { 1344 + static const struct file_operations pagetypeinfo_file_operations = { 1345 1345 .open = pagetypeinfo_open, 1346 1346 .read = seq_read, 1347 1347 .llseek = seq_lseek, ··· 1463 1463 return seq_open(file, &zoneinfo_op); 1464 1464 } 1465 1465 1466 - static const struct file_operations proc_zoneinfo_file_operations = { 1466 + static const struct file_operations zoneinfo_file_operations = { 1467 1467 .open = zoneinfo_open, 1468 1468 .read = seq_read, 1469 1469 .llseek = seq_lseek, ··· 1552 1552 return seq_open(file, &vmstat_op); 1553 1553 } 1554 1554 1555 - static const struct file_operations proc_vmstat_file_operations = { 1555 + static const struct file_operations vmstat_file_operations = { 1556 1556 .open = vmstat_open, 1557 1557 .read = seq_read, 1558 1558 .llseek = seq_lseek, ··· 1785 1785 start_shepherd_timer(); 1786 1786 #endif 1787 1787 #ifdef CONFIG_PROC_FS 1788 - proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations); 1789 - proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops); 1790 - proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations); 1791 - proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations); 1788 + proc_create("buddyinfo", 0444, NULL, &buddyinfo_file_operations); 1789 + proc_create("pagetypeinfo", 0444, NULL, &pagetypeinfo_file_operations); 1790 + proc_create("vmstat", 0444, NULL, &vmstat_file_operations); 1791 + proc_create("zoneinfo", 0444, NULL, &zoneinfo_file_operations); 1792 1792 #endif 1793 1793 } 1794 1794

+3 -6

mm/workingset.c

··· 288 288 */ 289 289 refault_distance = (refault - eviction) & EVICTION_MASK; 290 290 291 - inc_node_state(pgdat, WORKINGSET_REFAULT); 292 - inc_memcg_state(memcg, WORKINGSET_REFAULT); 291 + inc_lruvec_state(lruvec, WORKINGSET_REFAULT); 293 292 294 293 if (refault_distance <= active_file) { 295 - inc_node_state(pgdat, WORKINGSET_ACTIVATE); 296 - inc_memcg_state(memcg, WORKINGSET_ACTIVATE); 294 + inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); 297 295 rcu_read_unlock(); 298 296 return true; 299 297 } ··· 472 474 } 473 475 if (WARN_ON_ONCE(node->exceptional)) 474 476 goto out_invalid; 475 - inc_node_state(page_pgdat(virt_to_page(node)), WORKINGSET_NODERECLAIM); 476 - inc_memcg_page_state(virt_to_page(node), WORKINGSET_NODERECLAIM); 477 + inc_lruvec_page_state(virt_to_page(node), WORKINGSET_NODERECLAIM); 477 478 __radix_tree_delete_node(&mapping->page_tree, node, 478 479 workingset_update_node, mapping); 479 480

+4 -7

mm/zswap.c

··· 371 371 u8 *dst; 372 372 373 373 dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu)); 374 - if (!dst) { 375 - pr_err("can't allocate compressor buffer\n"); 374 + if (!dst) 376 375 return -ENOMEM; 377 - } 376 + 378 377 per_cpu(zswap_dstmem, cpu) = dst; 379 378 return 0; 380 379 } ··· 514 515 } 515 516 516 517 pool = kzalloc(sizeof(*pool), GFP_KERNEL); 517 - if (!pool) { 518 - pr_err("pool alloc failed\n"); 518 + if (!pool) 519 519 return NULL; 520 - } 521 520 522 521 /* unique name for each pool specifically required by zsmalloc */ 523 522 snprintf(name, 38, "zswap%x", atomic_inc_return(&zswap_pools_count)); ··· 1155 1158 { 1156 1159 struct zswap_tree *tree; 1157 1160 1158 - tree = kzalloc(sizeof(struct zswap_tree), GFP_KERNEL); 1161 + tree = kzalloc(sizeof(*tree), GFP_KERNEL); 1159 1162 if (!tree) { 1160 1163 pr_err("alloc failed, zswap disabled for swap type %d\n", type); 1161 1164 return;

+2

scripts/gen_initramfs_list.sh

··· 271 271 case "$arg" in 272 272 "-u") # map $1 to uid=0 (root) 273 273 root_uid="$1" 274 + [ "$root_uid" = "-1" ] && root_uid=$(id -u || echo 0) 274 275 shift 275 276 ;; 276 277 "-g") # map $1 to gid=0 (root) 277 278 root_gid="$1" 279 + [ "$root_gid" = "-1" ] && root_gid=$(id -g || echo 0) 278 280 shift 279 281 ;; 280 282 "-d") # display default initramfs list

+31

scripts/spelling.txt

··· 54 54 addional||additional 55 55 additionaly||additionally 56 56 addres||address 57 + adddress||address 57 58 addreses||addresses 58 59 addresss||address 59 60 aditional||additional ··· 131 130 artifical||artificial 132 131 artillary||artillery 133 132 asign||assign 133 + asser||assert 134 134 assertation||assertion 135 135 assiged||assigned 136 136 assigment||assignment ··· 151 149 attachement||attachment 152 150 attched||attached 153 151 attemps||attempts 152 + attemping||attempting 154 153 attruibutes||attributes 155 154 authentification||authentication 156 155 automaticaly||automatically ··· 208 205 calucate||calculate 209 206 calulate||calculate 210 207 cancelation||cancellation 208 + cancle||cancel 211 209 capabilites||capabilities 212 210 capabitilies||capabilities 213 211 capatibilities||capabilities 212 + capapbilities||capabilities 214 213 carefuly||carefully 215 214 cariage||carriage 216 215 catagory||category ··· 221 216 challanges||challenges 222 217 chanell||channel 223 218 changable||changeable 219 + chanined||chained 224 220 channle||channel 225 221 channnel||channel 226 222 charachter||character ··· 247 241 clared||cleared 248 242 closeing||closing 249 243 clustred||clustered 244 + coexistance||coexistence 250 245 collapsable||collapsible 251 246 colorfull||colorful 252 247 comand||command ··· 276 269 completly||completely 277 270 complient||compliant 278 271 componnents||components 272 + compoment||component 279 273 compres||compress 280 274 compresion||compression 281 275 comression||compression ··· 323 315 correponds||corresponds 324 316 correspoding||corresponding 325 317 cotrol||control 318 + cound||could 326 319 couter||counter 327 320 coutner||counter 328 321 cryptocraphic||cryptographic ··· 335 326 deamon||daemon 336 327 decompres||decompress 337 328 decription||description 329 + dectected||detected 338 330 defailt||default 339 331 defferred||deferred 340 332 definate||definite ··· 353 343 delares||declares 354 344 delaring||declaring 355 345 delemiter||delimiter 346 + demodualtor||demodulator 347 + demension||dimension 356 348 dependancies||dependencies 357 349 dependancy||dependency 358 350 dependant||dependent ··· 369 357 desctiptor||descriptor 370 358 desriptor||descriptor 371 359 desriptors||descriptors 360 + destionation||destination 372 361 destory||destroy 373 362 destoryed||destroyed 374 363 destorys||destroys 375 364 destroied||destroyed 376 365 detabase||database 366 + deteced||detected 377 367 develope||develop 378 368 developement||development 379 369 developped||developed ··· 433 419 encorporating||incorporating 434 420 encrupted||encrypted 435 421 encrypiton||encryption 422 + encryptio||encryption 436 423 endianess||endianness 437 424 enhaced||enhanced 438 425 enlightnment||enlightenment 426 + entrys||entries 439 427 enocded||encoded 440 428 enterily||entirely 441 429 enviroiment||environment ··· 455 439 excecutable||executable 456 440 exceded||exceeded 457 441 excellant||excellent 442 + exeed||exceed 458 443 existance||existence 459 444 existant||existent 460 445 exixt||exist ··· 484 467 faireness||fairness 485 468 falied||failed 486 469 faliure||failure 470 + fallbck||fallback 487 471 familar||familiar 488 472 fatser||faster 489 473 feauture||feature ··· 582 564 independantly||independently 583 565 independed||independent 584 566 indiate||indicate 567 + indicat||indicate 585 568 inexpect||inexpected 586 569 infomation||information 587 570 informatiom||information ··· 701 682 messgaes||messages 702 683 messsage||message 703 684 messsages||messages 685 + micropone||microphone 704 686 microprocesspr||microprocessor 705 687 milliseonds||milliseconds 706 688 minium||minimum ··· 713 693 mispelled||misspelled 714 694 mispelt||misspelt 715 695 mising||missing 696 + missmanaged||mismanaged 697 + missmatch||mismatch 716 698 miximum||maximum 717 699 mmnemonic||mnemonic 718 700 mnay||many 719 701 modulues||modules 720 702 momery||memory 703 + memomry||memory 721 704 monochorome||monochrome 722 705 monochromo||monochrome 723 706 monocrome||monochrome ··· 821 798 peroid||period 822 799 persistance||persistence 823 800 persistant||persistent 801 + plalform||platform 824 802 platfrom||platform 825 803 plattform||platform 826 804 pleaes||please ··· 834 810 positon||position 835 811 possibilites||possibilities 836 812 powerfull||powerful 813 + preapre||prepare 837 814 preceeded||preceded 838 815 preceeding||preceding 839 816 preceed||precede ··· 893 868 psychadelic||psychedelic 894 869 pwoer||power 895 870 quering||querying 871 + randomally||randomly 896 872 raoming||roaming 897 873 reasearcher||researcher 898 874 reasearchers||researchers ··· 921 895 refrence||reference 922 896 registerd||registered 923 897 registeresd||registered 898 + registerred||registered 924 899 registes||registers 925 900 registraration||registration 926 901 regsiter||register ··· 950 923 requst||request 951 924 reseting||resetting 952 925 resizeable||resizable 926 + resouce||resource 953 927 resouces||resources 954 928 resoures||resources 955 929 responce||response ··· 966 938 reuest||request 967 939 reuqest||request 968 940 reutnred||returned 941 + revsion||revision 969 942 rmeoved||removed 970 943 rmeove||remove 971 944 rmeoves||removes ··· 1128 1099 transision||transition 1129 1100 transmittd||transmitted 1130 1101 transormed||transformed 1102 + trasfer||transfer 1131 1103 trasmission||transmission 1132 1104 treshold||threshold 1133 1105 trigerring||triggering ··· 1197 1167 virtiual||virtual 1198 1168 visiters||visitors 1199 1169 vitual||virtual 1170 + wakeus||wakeups 1200 1171 wating||waiting 1201 1172 wether||whether 1202 1173 whataver||whatever

+10 -14

usr/Kconfig

··· 36 36 depends on INITRAMFS_SOURCE!="" 37 37 default "0" 38 38 help 39 - This setting is only meaningful if the INITRAMFS_SOURCE is 40 - contains a directory. Setting this user ID (UID) to something 41 - other than "0" will cause all files owned by that UID to be 42 - owned by user root in the initial ramdisk image. 39 + If INITRAMFS_SOURCE points to a directory, files owned by this UID 40 + (-1 = current user) will be owned by root in the resulting image. 43 41 44 42 If you are not sure, leave it set to "0". 45 43 ··· 46 48 depends on INITRAMFS_SOURCE!="" 47 49 default "0" 48 50 help 49 - This setting is only meaningful if the INITRAMFS_SOURCE is 50 - contains a directory. Setting this group ID (GID) to something 51 - other than "0" will cause all files owned by that GID to be 52 - owned by group root in the initial ramdisk image. 51 + If INITRAMFS_SOURCE points to a directory, files owned by this GID 52 + (-1 = current group) will be owned by root in the resulting image. 53 53 54 54 If you are not sure, leave it set to "0". 55 55 56 56 config RD_GZIP 57 - bool "Support initial ramdisks compressed using gzip" 57 + bool "Support initial ramdisk/ramfs compressed using gzip" 58 58 depends on BLK_DEV_INITRD 59 59 default y 60 60 select DECOMPRESS_GZIP ··· 61 65 If unsure, say Y. 62 66 63 67 config RD_BZIP2 64 - bool "Support initial ramdisks compressed using bzip2" 68 + bool "Support initial ramdisk/ramfs compressed using bzip2" 65 69 default y 66 70 depends on BLK_DEV_INITRD 67 71 select DECOMPRESS_BZIP2 ··· 70 74 If unsure, say N. 71 75 72 76 config RD_LZMA 73 - bool "Support initial ramdisks compressed using LZMA" 77 + bool "Support initial ramdisk/ramfs compressed using LZMA" 74 78 default y 75 79 depends on BLK_DEV_INITRD 76 80 select DECOMPRESS_LZMA ··· 79 83 If unsure, say N. 80 84 81 85 config RD_XZ 82 - bool "Support initial ramdisks compressed using XZ" 86 + bool "Support initial ramdisk/ramfs compressed using XZ" 83 87 depends on BLK_DEV_INITRD 84 88 default y 85 89 select DECOMPRESS_XZ ··· 88 92 If unsure, say N. 89 93 90 94 config RD_LZO 91 - bool "Support initial ramdisks compressed using LZO" 95 + bool "Support initial ramdisk/ramfs compressed using LZO" 92 96 default y 93 97 depends on BLK_DEV_INITRD 94 98 select DECOMPRESS_LZO ··· 97 101 If unsure, say N. 98 102 99 103 config RD_LZ4 100 - bool "Support initial ramdisks compressed using LZ4" 104 + bool "Support initial ramdisk/ramfs compressed using LZ4" 101 105 default y 102 106 depends on BLK_DEV_INITRD 103 107 select DECOMPRESS_LZ4