Merge branch 'akpm' (patches from Andrew)

+1 -1

Documentation/admin-guide/kernel-parameters.txt

··· 4693 4693 fragmentation. Defaults to 1 for systems with 4694 4694 more than 32MB of RAM, 0 otherwise. 4695 4695 4696 - slub_debug[=options[,slabs]] [MM, SLUB] 4696 + slub_debug[=options[,slabs][;[options[,slabs]]...] [MM, SLUB] 4697 4697 Enabling slub_debug allows one to determine the 4698 4698 culprit if slab objects become corrupted. Enabling 4699 4699 slub_debug can create guard zones around objects and

+5 -5

Documentation/dev-tools/kasan.rst

··· 13 13 memory access, and therefore requires a compiler version that supports that. 14 14 15 15 Generic KASAN is supported in both GCC and Clang. With GCC it requires version 16 - 4.9.2 or later for basic support and version 5.0 or later for detection of 17 - out-of-bounds accesses for stack and global variables and for inline 18 - instrumentation mode (see the Usage section). With Clang it requires version 19 - 7.0.0 or later and it doesn't support detection of out-of-bounds accesses for 20 - global variables yet. 16 + 8.3.0 or later. With Clang it requires version 7.0.0 or later, but detection of 17 + out-of-bounds accesses for global variables is only supported since Clang 11. 21 18 22 19 Tag-based KASAN is only supported in Clang and requires version 7.0.0 or later. 23 20 ··· 189 192 function calls GCC directly inserts the code to check the shadow memory. 190 193 This option significantly enlarges kernel but it gives x1.1-x2 performance 191 194 boost over outline instrumented kernel. 195 + 196 + Generic KASAN prints up to 2 call_rcu() call stacks in reports, the last one 197 + and the second to last. 192 198 193 199 Software tag-based KASAN 194 200 ~~~~~~~~~~~~~~~~~~~~~~~~

+1 -1

Documentation/filesystems/dlmfs.rst

··· 12 12 13 13 :Project web page: http://ocfs2.wiki.kernel.org 14 14 :Tools web page: https://github.com/markfasheh/ocfs2-tools 15 - :OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ 15 + :OCFS2 mailing lists: https://oss.oracle.com/projects/ocfs2/mailman/ 16 16 17 17 All code copyright 2005 Oracle except when otherwise noted. 18 18

+1 -1

Documentation/filesystems/ocfs2.rst

··· 14 14 15 15 Project web page: http://ocfs2.wiki.kernel.org 16 16 Tools git tree: https://github.com/markfasheh/ocfs2-tools 17 - OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ 17 + OCFS2 mailing lists: https://oss.oracle.com/projects/ocfs2/mailman/ 18 18 19 19 All code copyright 2005 Oracle except when otherwise noted. 20 20

+18

Documentation/filesystems/tmpfs.rst

··· 150 150 parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem. 151 151 152 152 153 + tmpfs has a mount option to select whether it will wrap at 32- or 64-bit inode 154 + numbers: 155 + 156 + ======= ======================== 157 + inode64 Use 64-bit inode numbers 158 + inode32 Use 32-bit inode numbers 159 + ======= ======================== 160 + 161 + On a 32-bit kernel, inode32 is implicit, and inode64 is refused at mount time. 162 + On a 64-bit kernel, CONFIG_TMPFS_INODE64 sets the default. inode64 avoids the 163 + possibility of multiple files with the same inode number on a single device; 164 + but risks glibc failing with EOVERFLOW once 33-bit inode numbers are reached - 165 + if a long-lived tmpfs is accessed by 32-bit applications so ancient that 166 + opening a file larger than 2GiB fails with EINVAL. 167 + 168 + 153 169 So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs' 154 170 will give you tmpfs instance on /mytmpfs which can allocate 10GB 155 171 RAM/SWAP in 10240 inodes and it is only accessible by root. ··· 177 161 Hugh Dickins, 4 June 2007 178 162 :Updated: 179 163 KOSAKI Motohiro, 16 Mar 2010 164 + :Updated: 165 + Chris Down, 13 July 2020

+258

Documentation/vm/arch_pgtable_helpers.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + .. _arch_page_table_helpers: 4 + 5 + =============================== 6 + Architecture Page Table Helpers 7 + =============================== 8 + 9 + Generic MM expects architectures (with MMU) to provide helpers to create, access 10 + and modify page table entries at various level for different memory functions. 11 + These page table helpers need to conform to a common semantics across platforms. 12 + Following tables describe the expected semantics which can also be tested during 13 + boot via CONFIG_DEBUG_VM_PGTABLE option. All future changes in here or the debug 14 + test need to be in sync. 15 + 16 + ====================== 17 + PTE Page Table Helpers 18 + ====================== 19 + 20 + +---------------------------+--------------------------------------------------+ 21 + | pte_same | Tests whether both PTE entries are the same | 22 + +---------------------------+--------------------------------------------------+ 23 + | pte_bad | Tests a non-table mapped PTE | 24 + +---------------------------+--------------------------------------------------+ 25 + | pte_present | Tests a valid mapped PTE | 26 + +---------------------------+--------------------------------------------------+ 27 + | pte_young | Tests a young PTE | 28 + +---------------------------+--------------------------------------------------+ 29 + | pte_dirty | Tests a dirty PTE | 30 + +---------------------------+--------------------------------------------------+ 31 + | pte_write | Tests a writable PTE | 32 + +---------------------------+--------------------------------------------------+ 33 + | pte_special | Tests a special PTE | 34 + +---------------------------+--------------------------------------------------+ 35 + | pte_protnone | Tests a PROT_NONE PTE | 36 + +---------------------------+--------------------------------------------------+ 37 + | pte_devmap | Tests a ZONE_DEVICE mapped PTE | 38 + +---------------------------+--------------------------------------------------+ 39 + | pte_soft_dirty | Tests a soft dirty PTE | 40 + +---------------------------+--------------------------------------------------+ 41 + | pte_swp_soft_dirty | Tests a soft dirty swapped PTE | 42 + +---------------------------+--------------------------------------------------+ 43 + | pte_mkyoung | Creates a young PTE | 44 + +---------------------------+--------------------------------------------------+ 45 + | pte_mkold | Creates an old PTE | 46 + +---------------------------+--------------------------------------------------+ 47 + | pte_mkdirty | Creates a dirty PTE | 48 + +---------------------------+--------------------------------------------------+ 49 + | pte_mkclean | Creates a clean PTE | 50 + +---------------------------+--------------------------------------------------+ 51 + | pte_mkwrite | Creates a writable PTE | 52 + +---------------------------+--------------------------------------------------+ 53 + | pte_mkwrprotect | Creates a write protected PTE | 54 + +---------------------------+--------------------------------------------------+ 55 + | pte_mkspecial | Creates a special PTE | 56 + +---------------------------+--------------------------------------------------+ 57 + | pte_mkdevmap | Creates a ZONE_DEVICE mapped PTE | 58 + +---------------------------+--------------------------------------------------+ 59 + | pte_mksoft_dirty | Creates a soft dirty PTE | 60 + +---------------------------+--------------------------------------------------+ 61 + | pte_clear_soft_dirty | Clears a soft dirty PTE | 62 + +---------------------------+--------------------------------------------------+ 63 + | pte_swp_mksoft_dirty | Creates a soft dirty swapped PTE | 64 + +---------------------------+--------------------------------------------------+ 65 + | pte_swp_clear_soft_dirty | Clears a soft dirty swapped PTE | 66 + +---------------------------+--------------------------------------------------+ 67 + | pte_mknotpresent | Invalidates a mapped PTE | 68 + +---------------------------+--------------------------------------------------+ 69 + | ptep_get_and_clear | Clears a PTE | 70 + +---------------------------+--------------------------------------------------+ 71 + | ptep_get_and_clear_full | Clears a PTE | 72 + +---------------------------+--------------------------------------------------+ 73 + | ptep_test_and_clear_young | Clears young from a PTE | 74 + +---------------------------+--------------------------------------------------+ 75 + | ptep_set_wrprotect | Converts into a write protected PTE | 76 + +---------------------------+--------------------------------------------------+ 77 + | ptep_set_access_flags | Converts into a more permissive PTE | 78 + +---------------------------+--------------------------------------------------+ 79 + 80 + ====================== 81 + PMD Page Table Helpers 82 + ====================== 83 + 84 + +---------------------------+--------------------------------------------------+ 85 + | pmd_same | Tests whether both PMD entries are the same | 86 + +---------------------------+--------------------------------------------------+ 87 + | pmd_bad | Tests a non-table mapped PMD | 88 + +---------------------------+--------------------------------------------------+ 89 + | pmd_leaf | Tests a leaf mapped PMD | 90 + +---------------------------+--------------------------------------------------+ 91 + | pmd_huge | Tests a HugeTLB mapped PMD | 92 + +---------------------------+--------------------------------------------------+ 93 + | pmd_trans_huge | Tests a Transparent Huge Page (THP) at PMD | 94 + +---------------------------+--------------------------------------------------+ 95 + | pmd_present | Tests a valid mapped PMD | 96 + +---------------------------+--------------------------------------------------+ 97 + | pmd_young | Tests a young PMD | 98 + +---------------------------+--------------------------------------------------+ 99 + | pmd_dirty | Tests a dirty PMD | 100 + +---------------------------+--------------------------------------------------+ 101 + | pmd_write | Tests a writable PMD | 102 + +---------------------------+--------------------------------------------------+ 103 + | pmd_special | Tests a special PMD | 104 + +---------------------------+--------------------------------------------------+ 105 + | pmd_protnone | Tests a PROT_NONE PMD | 106 + +---------------------------+--------------------------------------------------+ 107 + | pmd_devmap | Tests a ZONE_DEVICE mapped PMD | 108 + +---------------------------+--------------------------------------------------+ 109 + | pmd_soft_dirty | Tests a soft dirty PMD | 110 + +---------------------------+--------------------------------------------------+ 111 + | pmd_swp_soft_dirty | Tests a soft dirty swapped PMD | 112 + +---------------------------+--------------------------------------------------+ 113 + | pmd_mkyoung | Creates a young PMD | 114 + +---------------------------+--------------------------------------------------+ 115 + | pmd_mkold | Creates an old PMD | 116 + +---------------------------+--------------------------------------------------+ 117 + | pmd_mkdirty | Creates a dirty PMD | 118 + +---------------------------+--------------------------------------------------+ 119 + | pmd_mkclean | Creates a clean PMD | 120 + +---------------------------+--------------------------------------------------+ 121 + | pmd_mkwrite | Creates a writable PMD | 122 + +---------------------------+--------------------------------------------------+ 123 + | pmd_mkwrprotect | Creates a write protected PMD | 124 + +---------------------------+--------------------------------------------------+ 125 + | pmd_mkspecial | Creates a special PMD | 126 + +---------------------------+--------------------------------------------------+ 127 + | pmd_mkdevmap | Creates a ZONE_DEVICE mapped PMD | 128 + +---------------------------+--------------------------------------------------+ 129 + | pmd_mksoft_dirty | Creates a soft dirty PMD | 130 + +---------------------------+--------------------------------------------------+ 131 + | pmd_clear_soft_dirty | Clears a soft dirty PMD | 132 + +---------------------------+--------------------------------------------------+ 133 + | pmd_swp_mksoft_dirty | Creates a soft dirty swapped PMD | 134 + +---------------------------+--------------------------------------------------+ 135 + | pmd_swp_clear_soft_dirty | Clears a soft dirty swapped PMD | 136 + +---------------------------+--------------------------------------------------+ 137 + | pmd_mkinvalid | Invalidates a mapped PMD [1] | 138 + +---------------------------+--------------------------------------------------+ 139 + | pmd_set_huge | Creates a PMD huge mapping | 140 + +---------------------------+--------------------------------------------------+ 141 + | pmd_clear_huge | Clears a PMD huge mapping | 142 + +---------------------------+--------------------------------------------------+ 143 + | pmdp_get_and_clear | Clears a PMD | 144 + +---------------------------+--------------------------------------------------+ 145 + | pmdp_get_and_clear_full | Clears a PMD | 146 + +---------------------------+--------------------------------------------------+ 147 + | pmdp_test_and_clear_young | Clears young from a PMD | 148 + +---------------------------+--------------------------------------------------+ 149 + | pmdp_set_wrprotect | Converts into a write protected PMD | 150 + +---------------------------+--------------------------------------------------+ 151 + | pmdp_set_access_flags | Converts into a more permissive PMD | 152 + +---------------------------+--------------------------------------------------+ 153 + 154 + ====================== 155 + PUD Page Table Helpers 156 + ====================== 157 + 158 + +---------------------------+--------------------------------------------------+ 159 + | pud_same | Tests whether both PUD entries are the same | 160 + +---------------------------+--------------------------------------------------+ 161 + | pud_bad | Tests a non-table mapped PUD | 162 + +---------------------------+--------------------------------------------------+ 163 + | pud_leaf | Tests a leaf mapped PUD | 164 + +---------------------------+--------------------------------------------------+ 165 + | pud_huge | Tests a HugeTLB mapped PUD | 166 + +---------------------------+--------------------------------------------------+ 167 + | pud_trans_huge | Tests a Transparent Huge Page (THP) at PUD | 168 + +---------------------------+--------------------------------------------------+ 169 + | pud_present | Tests a valid mapped PUD | 170 + +---------------------------+--------------------------------------------------+ 171 + | pud_young | Tests a young PUD | 172 + +---------------------------+--------------------------------------------------+ 173 + | pud_dirty | Tests a dirty PUD | 174 + +---------------------------+--------------------------------------------------+ 175 + | pud_write | Tests a writable PUD | 176 + +---------------------------+--------------------------------------------------+ 177 + | pud_devmap | Tests a ZONE_DEVICE mapped PUD | 178 + +---------------------------+--------------------------------------------------+ 179 + | pud_mkyoung | Creates a young PUD | 180 + +---------------------------+--------------------------------------------------+ 181 + | pud_mkold | Creates an old PUD | 182 + +---------------------------+--------------------------------------------------+ 183 + | pud_mkdirty | Creates a dirty PUD | 184 + +---------------------------+--------------------------------------------------+ 185 + | pud_mkclean | Creates a clean PUD | 186 + +---------------------------+--------------------------------------------------+ 187 + | pud_mkwrite | Creates a writable PUD | 188 + +---------------------------+--------------------------------------------------+ 189 + | pud_mkwrprotect | Creates a write protected PUD | 190 + +---------------------------+--------------------------------------------------+ 191 + | pud_mkdevmap | Creates a ZONE_DEVICE mapped PUD | 192 + +---------------------------+--------------------------------------------------+ 193 + | pud_mkinvalid | Invalidates a mapped PUD [1] | 194 + +---------------------------+--------------------------------------------------+ 195 + | pud_set_huge | Creates a PUD huge mapping | 196 + +---------------------------+--------------------------------------------------+ 197 + | pud_clear_huge | Clears a PUD huge mapping | 198 + +---------------------------+--------------------------------------------------+ 199 + | pudp_get_and_clear | Clears a PUD | 200 + +---------------------------+--------------------------------------------------+ 201 + | pudp_get_and_clear_full | Clears a PUD | 202 + +---------------------------+--------------------------------------------------+ 203 + | pudp_test_and_clear_young | Clears young from a PUD | 204 + +---------------------------+--------------------------------------------------+ 205 + | pudp_set_wrprotect | Converts into a write protected PUD | 206 + +---------------------------+--------------------------------------------------+ 207 + | pudp_set_access_flags | Converts into a more permissive PUD | 208 + +---------------------------+--------------------------------------------------+ 209 + 210 + ========================== 211 + HugeTLB Page Table Helpers 212 + ========================== 213 + 214 + +---------------------------+--------------------------------------------------+ 215 + | pte_huge | Tests a HugeTLB | 216 + +---------------------------+--------------------------------------------------+ 217 + | pte_mkhuge | Creates a HugeTLB | 218 + +---------------------------+--------------------------------------------------+ 219 + | huge_pte_dirty | Tests a dirty HugeTLB | 220 + +---------------------------+--------------------------------------------------+ 221 + | huge_pte_write | Tests a writable HugeTLB | 222 + +---------------------------+--------------------------------------------------+ 223 + | huge_pte_mkdirty | Creates a dirty HugeTLB | 224 + +---------------------------+--------------------------------------------------+ 225 + | huge_pte_mkwrite | Creates a writable HugeTLB | 226 + +---------------------------+--------------------------------------------------+ 227 + | huge_pte_mkwrprotect | Creates a write protected HugeTLB | 228 + +---------------------------+--------------------------------------------------+ 229 + | huge_ptep_get_and_clear | Clears a HugeTLB | 230 + +---------------------------+--------------------------------------------------+ 231 + | huge_ptep_set_wrprotect | Converts into a write protected HugeTLB | 232 + +---------------------------+--------------------------------------------------+ 233 + | huge_ptep_set_access_flags | Converts into a more permissive HugeTLB | 234 + +---------------------------+--------------------------------------------------+ 235 + 236 + ======================== 237 + SWAP Page Table Helpers 238 + ======================== 239 + 240 + +---------------------------+--------------------------------------------------+ 241 + | __pte_to_swp_entry | Creates a swapped entry (arch) from a mapped PTE | 242 + +---------------------------+--------------------------------------------------+ 243 + | __swp_to_pte_entry | Creates a mapped PTE from a swapped entry (arch) | 244 + +---------------------------+--------------------------------------------------+ 245 + | __pmd_to_swp_entry | Creates a swapped entry (arch) from a mapped PMD | 246 + +---------------------------+--------------------------------------------------+ 247 + | __swp_to_pmd_entry | Creates a mapped PMD from a swapped entry (arch) | 248 + +---------------------------+--------------------------------------------------+ 249 + | is_migration_entry | Tests a migration (read or write) swapped entry | 250 + +---------------------------+--------------------------------------------------+ 251 + | is_write_migration_entry | Tests a write migration swapped entry | 252 + +---------------------------+--------------------------------------------------+ 253 + | make_migration_entry_read | Converts into read migration swapped entry | 254 + +---------------------------+--------------------------------------------------+ 255 + | make_migration_entry | Creates a migration swapped entry (read or write)| 256 + +---------------------------+--------------------------------------------------+ 257 + 258 + [1] https://lore.kernel.org/linux-mm/20181017020930.GN30832@redhat.com/

+3 -6

Documentation/vm/memory-model.rst

··· 141 141 `mem_section` objects and the number of rows is calculated to fit 142 142 all the memory sections. 143 143 144 - The architecture setup code should call :c:func:`memory_present` for 145 - each active memory range or use :c:func:`memblocks_present` or 146 - :c:func:`sparse_memory_present_with_active_regions` wrappers to 147 - initialize the memory sections. Next, the actual memory maps should be 148 - set up using :c:func:`sparse_init`. 144 + The architecture setup code should call sparse_init() to 145 + initialize the memory sections and the memory maps. 149 146 150 147 With SPARSEMEM there are two possible ways to convert a PFN to the 151 148 corresponding `struct page` - a "classic sparse" and "sparse ··· 175 178 devices. This storage is represented with :c:type:`struct vmem_altmap` 176 179 that is eventually passed to vmemmap_populate() through a long chain 177 180 of function calls. The vmemmap_populate() implementation may use the 178 - `vmem_altmap` along with :c:func:`altmap_alloc_block_buf` helper to 181 + `vmem_altmap` along with :c:func:`vmemmap_alloc_block_buf` helper to 179 182 allocate memory map on the persistent memory device. 180 183 181 184 ZONE_DEVICE

+29 -8

Documentation/vm/slub.rst

··· 41 41 Enable options only for select slabs (no spaces 42 42 after a comma) 43 43 44 + Multiple blocks of options for all slabs or selected slabs can be given, with 45 + blocks of options delimited by ';'. The last of "all slabs" blocks is applied 46 + to all slabs except those that match one of the "select slabs" block. Options 47 + of the first "select slabs" blocks that matches the slab's name are applied. 48 + 44 49 Possible debug options are:: 45 50 46 51 F Sanity checks on (enables SLAB_DEBUG_CONSISTENCY_CHECKS ··· 88 83 89 84 slub_debug=O 90 85 91 - In case you forgot to enable debugging on the kernel command line: It is 92 - possible to enable debugging manually when the kernel is up. Look at the 93 - contents of:: 86 + You can apply different options to different list of slab names, using blocks 87 + of options. This will enable red zoning for dentry and user tracking for 88 + kmalloc. All other slabs will not get any debugging enabled:: 89 + 90 + slub_debug=Z,dentry;U,kmalloc-* 91 + 92 + You can also enable options (e.g. sanity checks and poisoning) for all caches 93 + except some that are deemed too performance critical and don't need to be 94 + debugged by specifying global debug options followed by a list of slab names 95 + with "-" as options:: 96 + 97 + slub_debug=FZ;-,zs_handle,zspage 98 + 99 + The state of each debug option for a slab can be found in the respective files 100 + under:: 94 101 95 102 /sys/kernel/slab/<slab name>/ 96 103 97 - Look at the writable files. Writing 1 to them will enable the 98 - corresponding debug option. All options can be set on a slab that does 99 - not contain objects. If the slab already contains objects then sanity checks 100 - and tracing may only be enabled. The other options may cause the realignment 101 - of objects. 104 + If the file contains 1, the option is enabled, 0 means disabled. The debug 105 + options from the ``slub_debug`` parameter translate to the following files:: 106 + 107 + F sanity_checks 108 + Z red_zone 109 + P poison 110 + U store_user 111 + T trace 112 + A failslab 102 113 103 114 Careful with tracing: It may spew out lots of information and never stop if 104 115 used on the wrong slab.

+1 -20

arch/alpha/include/asm/pgalloc.h

··· 5 5 #include <linux/mm.h> 6 6 #include <linux/mmzone.h> 7 7 8 - #include <asm-generic/pgalloc.h> /* for pte_{alloc,free}_one */ 8 + #include <asm-generic/pgalloc.h> 9 9 10 10 /* 11 11 * Allocate and free page tables. The xxx_kernel() versions are ··· 33 33 } 34 34 35 35 extern pgd_t *pgd_alloc(struct mm_struct *mm); 36 - 37 - static inline void 38 - pgd_free(struct mm_struct *mm, pgd_t *pgd) 39 - { 40 - free_page((unsigned long)pgd); 41 - } 42 - 43 - static inline pmd_t * 44 - pmd_alloc_one(struct mm_struct *mm, unsigned long address) 45 - { 46 - pmd_t *ret = (pmd_t *)__get_free_page(GFP_PGTABLE_USER); 47 - return ret; 48 - } 49 - 50 - static inline void 51 - pmd_free(struct mm_struct *mm, pmd_t *pmd) 52 - { 53 - free_page((unsigned long)pmd); 54 - } 55 36 56 37 #endif /* _ALPHA_PGALLOC_H */

-1

arch/alpha/include/asm/tlbflush.h

··· 5 5 #include <linux/mm.h> 6 6 #include <linux/sched.h> 7 7 #include <asm/compiler.h> 8 - #include <asm/pgalloc.h> 9 8 10 9 #ifndef __EXTERN_INLINE 11 10 #define __EXTERN_INLINE extern inline

-1

arch/alpha/kernel/core_irongate.c

··· 302 302 #include <linux/agp_backend.h> 303 303 #include <linux/agpgart.h> 304 304 #include <linux/export.h> 305 - #include <asm/pgalloc.h> 306 305 307 306 #define GET_PAGE_DIR_OFF(addr) (addr >> 22) 308 307 #define GET_PAGE_DIR_IDX(addr) (GET_PAGE_DIR_OFF(addr))

-1

arch/alpha/kernel/core_marvel.c

··· 23 23 #include <asm/ptrace.h> 24 24 #include <asm/smp.h> 25 25 #include <asm/gct.h> 26 - #include <asm/pgalloc.h> 27 26 #include <asm/tlbflush.h> 28 27 #include <asm/vga.h> 29 28

-1

arch/alpha/kernel/core_titan.c

··· 20 20 21 21 #include <asm/ptrace.h> 22 22 #include <asm/smp.h> 23 - #include <asm/pgalloc.h> 24 23 #include <asm/tlbflush.h> 25 24 #include <asm/vga.h> 26 25

-2

arch/alpha/kernel/machvec_impl.h

··· 7 7 * This file has goodies to help simplify instantiation of machine vectors. 8 8 */ 9 9 10 - #include <asm/pgalloc.h> 11 - 12 10 /* Whee. These systems don't have an HAE: 13 11 IRONGATE, MARVEL, POLARIS, TSUNAMI, TITAN, WILDFIRE 14 12 Fix things up for the GENERIC kernel by defining the HAE address

-1

arch/alpha/kernel/smp.c

··· 36 36 37 37 #include <asm/io.h> 38 38 #include <asm/irq.h> 39 - #include <asm/pgalloc.h> 40 39 #include <asm/mmu_context.h> 41 40 #include <asm/tlbflush.h> 42 41

-1

arch/alpha/mm/numa.c

··· 17 17 #include <linux/module.h> 18 18 19 19 #include <asm/hwrpb.h> 20 - #include <asm/pgalloc.h> 21 20 #include <asm/sections.h> 22 21 23 22 pg_data_t node_data[MAX_NUMNODES];

-1

arch/arc/mm/fault.c

··· 13 13 #include <linux/kdebug.h> 14 14 #include <linux/perf_event.h> 15 15 #include <linux/mm_types.h> 16 - #include <asm/pgalloc.h> 17 16 #include <asm/mmu.h> 18 17 19 18 /*

-1

arch/arc/mm/init.c

··· 14 14 #include <linux/module.h> 15 15 #include <linux/highmem.h> 16 16 #include <asm/page.h> 17 - #include <asm/pgalloc.h> 18 17 #include <asm/sections.h> 19 18 #include <asm/arcregs.h> 20 19

+1 -11

arch/arm/include/asm/pgalloc.h

··· 22 22 23 23 #ifdef CONFIG_ARM_LPAE 24 24 25 - static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) 26 - { 27 - return (pmd_t *)get_zeroed_page(GFP_KERNEL); 28 - } 29 - 30 - static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd) 31 - { 32 - BUG_ON((unsigned long)pmd & (PAGE_SIZE-1)); 33 - free_page((unsigned long)pmd); 34 - } 35 - 36 25 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd) 37 26 { 38 27 set_pud(pud, __pud(__pa(pmd) | PMD_TYPE_TABLE)); ··· 65 76 66 77 #define __HAVE_ARCH_PTE_ALLOC_ONE_KERNEL 67 78 #define __HAVE_ARCH_PTE_ALLOC_ONE 79 + #define __HAVE_ARCH_PGD_FREE 68 80 #include <asm-generic/pgalloc.h> 69 81 70 82 static inline pte_t *

-1

arch/arm/include/asm/tlb.h

··· 27 27 #else /* !CONFIG_MMU */ 28 28 29 29 #include <linux/swap.h> 30 - #include <asm/pgalloc.h> 31 30 #include <asm/tlbflush.h> 32 31 33 32 static inline void __tlb_remove_table(void *_table)

-1

arch/arm/kernel/machine_kexec.c

··· 11 11 #include <linux/irq.h> 12 12 #include <linux/memblock.h> 13 13 #include <linux/of_fdt.h> 14 - #include <asm/pgalloc.h> 15 14 #include <asm/mmu_context.h> 16 15 #include <asm/cacheflush.h> 17 16 #include <asm/fncpy.h>

-1

arch/arm/kernel/smp.c

··· 37 37 #include <asm/idmap.h> 38 38 #include <asm/topology.h> 39 39 #include <asm/mmu_context.h> 40 - #include <asm/pgalloc.h> 41 40 #include <asm/procinfo.h> 42 41 #include <asm/processor.h> 43 42 #include <asm/sections.h>

-1

arch/arm/kernel/suspend.c

··· 7 7 #include <asm/bugs.h> 8 8 #include <asm/cacheflush.h> 9 9 #include <asm/idmap.h> 10 - #include <asm/pgalloc.h> 11 10 #include <asm/memory.h> 12 11 #include <asm/smp_plat.h> 13 12 #include <asm/suspend.h>

-1

arch/arm/mach-omap2/omap-mpuss-lowpower.c

··· 42 42 #include <asm/cacheflush.h> 43 43 #include <asm/tlbflush.h> 44 44 #include <asm/smp_scu.h> 45 - #include <asm/pgalloc.h> 46 45 #include <asm/suspend.h> 47 46 #include <asm/virt.h> 48 47 #include <asm/hardware/cache-l2x0.h>

-1

arch/arm/mm/hugetlbpage.c

··· 17 17 #include <asm/mman.h> 18 18 #include <asm/tlb.h> 19 19 #include <asm/tlbflush.h> 20 - #include <asm/pgalloc.h> 21 20 22 21 /* 23 22 * On ARM, huge pages are backed by pmd's rather than pte's, so we do a lot

+2 -7

arch/arm/mm/init.c

··· 243 243 (phys_addr_t)max_low_pfn << PAGE_SHIFT); 244 244 245 245 /* 246 - * Sparsemem tries to allocate bootmem in memory_present(), 247 - * so must be done after the fixed reservations 248 - */ 249 - memblocks_present(); 250 - 251 - /* 252 - * sparse_init() needs the bootmem allocator up and running. 246 + * sparse_init() tries to allocate memory from memblock, so must be 247 + * done after the fixed reservations 253 248 */ 254 249 sparse_init(); 255 250

+1

arch/arm/mm/mmu.c

··· 29 29 #include <asm/traps.h> 30 30 #include <asm/procinfo.h> 31 31 #include <asm/memory.h> 32 + #include <asm/pgalloc.h> 32 33 33 34 #include <asm/mach/arch.h> 34 35 #include <asm/mach/map.h>

+2 -37

arch/arm64/include/asm/pgalloc.h

··· 13 13 #include <asm/cacheflush.h> 14 14 #include <asm/tlbflush.h> 15 15 16 - #include <asm-generic/pgalloc.h> /* for pte_{alloc,free}_one */ 16 + #define __HAVE_ARCH_PGD_FREE 17 + #include <asm-generic/pgalloc.h> 17 18 18 19 #define PGD_SIZE (PTRS_PER_PGD * sizeof(pgd_t)) 19 20 20 21 #if CONFIG_PGTABLE_LEVELS > 2 21 - 22 - static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) 23 - { 24 - gfp_t gfp = GFP_PGTABLE_USER; 25 - struct page *page; 26 - 27 - if (mm == &init_mm) 28 - gfp = GFP_PGTABLE_KERNEL; 29 - 30 - page = alloc_page(gfp); 31 - if (!page) 32 - return NULL; 33 - if (!pgtable_pmd_page_ctor(page)) { 34 - __free_page(page); 35 - return NULL; 36 - } 37 - return page_address(page); 38 - } 39 - 40 - static inline void pmd_free(struct mm_struct *mm, pmd_t *pmdp) 41 - { 42 - BUG_ON((unsigned long)pmdp & (PAGE_SIZE-1)); 43 - pgtable_pmd_page_dtor(virt_to_page(pmdp)); 44 - free_page((unsigned long)pmdp); 45 - } 46 22 47 23 static inline void __pud_populate(pud_t *pudp, phys_addr_t pmdp, pudval_t prot) 48 24 { ··· 37 61 #endif /* CONFIG_PGTABLE_LEVELS > 2 */ 38 62 39 63 #if CONFIG_PGTABLE_LEVELS > 3 40 - 41 - static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr) 42 - { 43 - return (pud_t *)__get_free_page(GFP_PGTABLE_USER); 44 - } 45 - 46 - static inline void pud_free(struct mm_struct *mm, pud_t *pudp) 47 - { 48 - BUG_ON((unsigned long)pudp & (PAGE_SIZE-1)); 49 - free_page((unsigned long)pudp); 50 - } 51 64 52 65 static inline void __p4d_populate(p4d_t *p4dp, phys_addr_t pudp, p4dval_t prot) 53 66 {

+1 -1

arch/arm64/kernel/setup.c

··· 276 276 277 277 u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID }; 278 278 279 - void __init setup_arch(char **cmdline_p) 279 + void __init __no_sanitize_address setup_arch(char **cmdline_p) 280 280 { 281 281 init_mm.start_code = (unsigned long) _text; 282 282 init_mm.end_code = (unsigned long) _etext;

-1

arch/arm64/kernel/smp.c

··· 43 43 #include <asm/kvm_mmu.h> 44 44 #include <asm/mmu_context.h> 45 45 #include <asm/numa.h> 46 - #include <asm/pgalloc.h> 47 46 #include <asm/processor.h> 48 47 #include <asm/smp_plat.h> 49 48 #include <asm/sections.h>

-1

arch/arm64/mm/hugetlbpage.c

··· 17 17 #include <asm/mman.h> 18 18 #include <asm/tlb.h> 19 19 #include <asm/tlbflush.h> 20 - #include <asm/pgalloc.h> 21 20 22 21 /* 23 22 * HugeTLB Support Matrix

+2 -4

arch/arm64/mm/init.c

··· 430 430 #endif 431 431 432 432 /* 433 - * Sparsemem tries to allocate bootmem in memory_present(), so must be 434 - * done after the fixed reservations. 433 + * sparse_init() tries to allocate memory from memblock, so must be 434 + * done after the fixed reservations 435 435 */ 436 - memblocks_present(); 437 - 438 436 sparse_init(); 439 437 zone_sizes_init(min, max); 440 438

-1

arch/arm64/mm/ioremap.c

··· 16 16 17 17 #include <asm/fixmap.h> 18 18 #include <asm/tlbflush.h> 19 - #include <asm/pgalloc.h> 20 19 21 20 static void __iomem *__ioremap_caller(phys_addr_t phys_addr, size_t size, 22 21 pgprot_t prot, void *caller)

+39 -20

arch/arm64/mm/mmu.c

··· 35 35 #include <asm/mmu_context.h> 36 36 #include <asm/ptdump.h> 37 37 #include <asm/tlbflush.h> 38 + #include <asm/pgalloc.h> 38 39 39 40 #define NO_BLOCK_MAPPINGS BIT(0) 40 41 #define NO_CONT_MAPPINGS BIT(1) ··· 761 760 } 762 761 763 762 #ifdef CONFIG_MEMORY_HOTPLUG 764 - static void free_hotplug_page_range(struct page *page, size_t size) 763 + static void free_hotplug_page_range(struct page *page, size_t size, 764 + struct vmem_altmap *altmap) 765 765 { 766 - WARN_ON(PageReserved(page)); 767 - free_pages((unsigned long)page_address(page), get_order(size)); 766 + if (altmap) { 767 + vmem_altmap_free(altmap, size >> PAGE_SHIFT); 768 + } else { 769 + WARN_ON(PageReserved(page)); 770 + free_pages((unsigned long)page_address(page), get_order(size)); 771 + } 768 772 } 769 773 770 774 static void free_hotplug_pgtable_page(struct page *page) 771 775 { 772 - free_hotplug_page_range(page, PAGE_SIZE); 776 + free_hotplug_page_range(page, PAGE_SIZE, NULL); 773 777 } 774 778 775 779 static bool pgtable_range_aligned(unsigned long start, unsigned long end, ··· 797 791 } 798 792 799 793 static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, 800 - unsigned long end, bool free_mapped) 794 + unsigned long end, bool free_mapped, 795 + struct vmem_altmap *altmap) 801 796 { 802 797 pte_t *ptep, pte; 803 798 ··· 812 805 pte_clear(&init_mm, addr, ptep); 813 806 flush_tlb_kernel_range(addr, addr + PAGE_SIZE); 814 807 if (free_mapped) 815 - free_hotplug_page_range(pte_page(pte), PAGE_SIZE); 808 + free_hotplug_page_range(pte_page(pte), 809 + PAGE_SIZE, altmap); 816 810 } while (addr += PAGE_SIZE, addr < end); 817 811 } 818 812 819 813 static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr, 820 - unsigned long end, bool free_mapped) 814 + unsigned long end, bool free_mapped, 815 + struct vmem_altmap *altmap) 821 816 { 822 817 unsigned long next; 823 818 pmd_t *pmdp, pmd; ··· 842 833 flush_tlb_kernel_range(addr, addr + PAGE_SIZE); 843 834 if (free_mapped) 844 835 free_hotplug_page_range(pmd_page(pmd), 845 - PMD_SIZE); 836 + PMD_SIZE, altmap); 846 837 continue; 847 838 } 848 839 WARN_ON(!pmd_table(pmd)); 849 - unmap_hotplug_pte_range(pmdp, addr, next, free_mapped); 840 + unmap_hotplug_pte_range(pmdp, addr, next, free_mapped, altmap); 850 841 } while (addr = next, addr < end); 851 842 } 852 843 853 844 static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr, 854 - unsigned long end, bool free_mapped) 845 + unsigned long end, bool free_mapped, 846 + struct vmem_altmap *altmap) 855 847 { 856 848 unsigned long next; 857 849 pud_t *pudp, pud; ··· 875 865 flush_tlb_kernel_range(addr, addr + PAGE_SIZE); 876 866 if (free_mapped) 877 867 free_hotplug_page_range(pud_page(pud), 878 - PUD_SIZE); 868 + PUD_SIZE, altmap); 879 869 continue; 880 870 } 881 871 WARN_ON(!pud_table(pud)); 882 - unmap_hotplug_pmd_range(pudp, addr, next, free_mapped); 872 + unmap_hotplug_pmd_range(pudp, addr, next, free_mapped, altmap); 883 873 } while (addr = next, addr < end); 884 874 } 885 875 886 876 static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr, 887 - unsigned long end, bool free_mapped) 877 + unsigned long end, bool free_mapped, 878 + struct vmem_altmap *altmap) 888 879 { 889 880 unsigned long next; 890 881 p4d_t *p4dp, p4d; ··· 898 887 continue; 899 888 900 889 WARN_ON(!p4d_present(p4d)); 901 - unmap_hotplug_pud_range(p4dp, addr, next, free_mapped); 890 + unmap_hotplug_pud_range(p4dp, addr, next, free_mapped, altmap); 902 891 } while (addr = next, addr < end); 903 892 } 904 893 905 894 static void unmap_hotplug_range(unsigned long addr, unsigned long end, 906 - bool free_mapped) 895 + bool free_mapped, struct vmem_altmap *altmap) 907 896 { 908 897 unsigned long next; 909 898 pgd_t *pgdp, pgd; 899 + 900 + /* 901 + * altmap can only be used as vmemmap mapping backing memory. 902 + * In case the backing memory itself is not being freed, then 903 + * altmap is irrelevant. Warn about this inconsistency when 904 + * encountered. 905 + */ 906 + WARN_ON(!free_mapped && altmap); 910 907 911 908 do { 912 909 next = pgd_addr_end(addr, end); ··· 924 905 continue; 925 906 926 907 WARN_ON(!pgd_present(pgd)); 927 - unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped); 908 + unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped, altmap); 928 909 } while (addr = next, addr < end); 929 910 } 930 911 ··· 1088 1069 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, 1089 1070 struct vmem_altmap *altmap) 1090 1071 { 1091 - return vmemmap_populate_basepages(start, end, node); 1072 + return vmemmap_populate_basepages(start, end, node, altmap); 1092 1073 } 1093 1074 #else /* !ARM64_SWAPPER_USES_SECTION_MAPS */ 1094 1075 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, ··· 1120 1101 if (pmd_none(READ_ONCE(*pmdp))) { 1121 1102 void *p = NULL; 1122 1103 1123 - p = vmemmap_alloc_block_buf(PMD_SIZE, node); 1104 + p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap); 1124 1105 if (!p) 1125 1106 return -ENOMEM; 1126 1107 ··· 1138 1119 #ifdef CONFIG_MEMORY_HOTPLUG 1139 1120 WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END)); 1140 1121 1141 - unmap_hotplug_range(start, end, true); 1122 + unmap_hotplug_range(start, end, true, altmap); 1142 1123 free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END); 1143 1124 #endif 1144 1125 } ··· 1429 1410 WARN_ON(pgdir != init_mm.pgd); 1430 1411 WARN_ON((start < PAGE_OFFSET) || (end > PAGE_END)); 1431 1412 1432 - unmap_hotplug_range(start, end, false); 1413 + unmap_hotplug_range(start, end, false, NULL); 1433 1414 free_empty_tables(start, end, PAGE_OFFSET, PAGE_END); 1434 1415 } 1435 1416

+1 -6

arch/csky/include/asm/pgalloc.h

··· 9 9 #include <linux/sched.h> 10 10 11 11 #define __HAVE_ARCH_PTE_ALLOC_ONE_KERNEL 12 - #include <asm-generic/pgalloc.h> /* for pte_{alloc,free}_one */ 12 + #include <asm-generic/pgalloc.h> 13 13 14 14 static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, 15 15 pte_t *pte) ··· 40 40 (pte + i)->pte_low = _PAGE_GLOBAL; 41 41 42 42 return pte; 43 - } 44 - 45 - static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd) 46 - { 47 - free_pages((unsigned long)pgd, PGD_ORDER); 48 43 } 49 44 50 45 static inline pgd_t *pgd_alloc(struct mm_struct *mm)

-1

arch/csky/kernel/smp.c

··· 23 23 #include <asm/traps.h> 24 24 #include <asm/sections.h> 25 25 #include <asm/mmu_context.h> 26 - #include <asm/pgalloc.h> 27 26 #ifdef CONFIG_CPU_HAS_FPU 28 27 #include <abi/fpu.h> 29 28 #endif

+1 -6

arch/hexagon/include/asm/pgalloc.h

··· 11 11 #include <asm/mem-layout.h> 12 12 #include <asm/atomic.h> 13 13 14 - #include <asm-generic/pgalloc.h> /* for pte_{alloc,free}_one */ 14 + #include <asm-generic/pgalloc.h> 15 15 16 16 extern unsigned long long kmap_generation; 17 17 ··· 39 39 mm->context.ptbase = __pa(pgd); 40 40 41 41 return pgd; 42 - } 43 - 44 - static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd) 45 - { 46 - free_page((unsigned long) pgd); 47 42 } 48 43 49 44 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,

-24

arch/ia64/include/asm/pgalloc.h

··· 29 29 return (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO); 30 30 } 31 31 32 - static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd) 33 - { 34 - free_page((unsigned long)pgd); 35 - } 36 - 37 32 #if CONFIG_PGTABLE_LEVELS == 4 38 33 static inline void 39 34 p4d_populate(struct mm_struct *mm, p4d_t * p4d_entry, pud_t * pud) ··· 36 41 p4d_val(*p4d_entry) = __pa(pud); 37 42 } 38 43 39 - static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr) 40 - { 41 - return (pud_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO); 42 - } 43 - 44 - static inline void pud_free(struct mm_struct *mm, pud_t *pud) 45 - { 46 - free_page((unsigned long)pud); 47 - } 48 44 #define __pud_free_tlb(tlb, pud, address) pud_free((tlb)->mm, pud) 49 45 #endif /* CONFIG_PGTABLE_LEVELS == 4 */ 50 46 ··· 43 57 pud_populate(struct mm_struct *mm, pud_t * pud_entry, pmd_t * pmd) 44 58 { 45 59 pud_val(*pud_entry) = __pa(pmd); 46 - } 47 - 48 - static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) 49 - { 50 - return (pmd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO); 51 - } 52 - 53 - static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd) 54 - { 55 - free_page((unsigned long)pmd); 56 60 } 57 61 58 62 #define __pmd_free_tlb(tlb, pmd, address) pmd_free((tlb)->mm, pmd)

-1

arch/ia64/include/asm/tlb.h

··· 42 42 #include <linux/pagemap.h> 43 43 #include <linux/swap.h> 44 44 45 - #include <asm/pgalloc.h> 46 45 #include <asm/processor.h> 47 46 #include <asm/tlbflush.h> 48 47

-1

arch/ia64/kernel/process.c

··· 40 40 #include <asm/elf.h> 41 41 #include <asm/irq.h> 42 42 #include <asm/kexec.h> 43 - #include <asm/pgalloc.h> 44 43 #include <asm/processor.h> 45 44 #include <asm/sal.h> 46 45 #include <asm/switch_to.h>

-1

arch/ia64/kernel/smp.c

··· 39 39 #include <asm/io.h> 40 40 #include <asm/irq.h> 41 41 #include <asm/page.h> 42 - #include <asm/pgalloc.h> 43 42 #include <asm/processor.h> 44 43 #include <asm/ptrace.h> 45 44 #include <asm/sal.h>

-1

arch/ia64/kernel/smpboot.c

··· 49 49 #include <asm/irq.h> 50 50 #include <asm/mca.h> 51 51 #include <asm/page.h> 52 - #include <asm/pgalloc.h> 53 52 #include <asm/processor.h> 54 53 #include <asm/ptrace.h> 55 54 #include <asm/sal.h>

-1

arch/ia64/mm/contig.c

··· 21 21 #include <linux/swap.h> 22 22 23 23 #include <asm/meminit.h> 24 - #include <asm/pgalloc.h> 25 24 #include <asm/sections.h> 26 25 #include <asm/mca.h> 27 26

+1 -3

arch/ia64/mm/discontig.c

··· 24 24 #include <linux/efi.h> 25 25 #include <linux/nodemask.h> 26 26 #include <linux/slab.h> 27 - #include <asm/pgalloc.h> 28 27 #include <asm/tlb.h> 29 28 #include <asm/meminit.h> 30 29 #include <asm/numa.h> ··· 600 601 601 602 max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT; 602 603 603 - sparse_memory_present_with_active_regions(MAX_NUMNODES); 604 604 sparse_init(); 605 605 606 606 #ifdef CONFIG_VIRTUAL_MEM_MAP ··· 654 656 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, 655 657 struct vmem_altmap *altmap) 656 658 { 657 - return vmemmap_populate_basepages(start, end, node); 659 + return vmemmap_populate_basepages(start, end, node, NULL); 658 660 } 659 661 660 662 void vmemmap_free(unsigned long start, unsigned long end,

-1

arch/ia64/mm/hugetlbpage.c

··· 18 18 #include <linux/sysctl.h> 19 19 #include <linux/log2.h> 20 20 #include <asm/mman.h> 21 - #include <asm/pgalloc.h> 22 21 #include <asm/tlb.h> 23 22 #include <asm/tlbflush.h> 24 23

-1

arch/ia64/mm/tlb.c

··· 27 27 28 28 #include <asm/delay.h> 29 29 #include <asm/mmu_context.h> 30 - #include <asm/pgalloc.h> 31 30 #include <asm/pal.h> 32 31 #include <asm/tlbflush.h> 33 32 #include <asm/dma.h>

+1 -1

arch/m68k/include/asm/mmu_context.h

··· 222 222 223 223 #include <asm/setup.h> 224 224 #include <asm/page.h> 225 - #include <asm/pgalloc.h> 225 + #include <asm/cacheflush.h> 226 226 227 227 static inline int init_new_context(struct task_struct *tsk, 228 228 struct mm_struct *mm)

+1 -6

arch/m68k/include/asm/sun3_pgalloc.h

··· 13 13 14 14 #include <asm/tlb.h> 15 15 16 - #include <asm-generic/pgalloc.h> /* for pte_{alloc,free}_one */ 16 + #include <asm-generic/pgalloc.h> 17 17 18 18 extern const char bad_pmd_string[]; 19 19 ··· 39 39 * inside the pgd, so has no extra memory associated with it. 40 40 */ 41 41 #define pmd_free(mm, x) do { } while (0) 42 - 43 - static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd) 44 - { 45 - free_page((unsigned long) pgd); 46 - } 47 42 48 43 static inline pgd_t * pgd_alloc(struct mm_struct *mm) 49 44 {

+1 -1

arch/m68k/kernel/dma.c

··· 15 15 #include <linux/vmalloc.h> 16 16 #include <linux/export.h> 17 17 18 - #include <asm/pgalloc.h> 18 + #include <asm/cacheflush.h> 19 19 20 20 #if defined(CONFIG_MMU) && !defined(CONFIG_COLDFIRE) 21 21 void arch_dma_prep_coherent(struct page *page, size_t size)

+1 -2

arch/m68k/kernel/traps.c

··· 35 35 #include <asm/fpu.h> 36 36 #include <linux/uaccess.h> 37 37 #include <asm/traps.h> 38 - #include <asm/pgalloc.h> 39 38 #include <asm/machdep.h> 40 39 #include <asm/siginfo.h> 41 - 40 + #include <asm/tlbflush.h> 42 41 43 42 static const char *vec_names[] = { 44 43 [VEC_RESETSP] = "RESET SP",

+1 -1

arch/m68k/mm/cache.c

··· 8 8 */ 9 9 10 10 #include <linux/module.h> 11 - #include <asm/pgalloc.h> 11 + #include <asm/cacheflush.h> 12 12 #include <asm/traps.h> 13 13 14 14

-1

arch/m68k/mm/fault.c

··· 15 15 16 16 #include <asm/setup.h> 17 17 #include <asm/traps.h> 18 - #include <asm/pgalloc.h> 19 18 20 19 extern void die_if_kernel(char *, struct pt_regs *, long); 21 20

+1 -1

arch/m68k/mm/kmap.c

··· 19 19 #include <asm/setup.h> 20 20 #include <asm/segment.h> 21 21 #include <asm/page.h> 22 - #include <asm/pgalloc.h> 23 22 #include <asm/io.h> 23 + #include <asm/tlbflush.h> 24 24 25 25 #undef DEBUG 26 26

+1

arch/m68k/mm/mcfmmu.c

··· 20 20 #include <asm/mmu_context.h> 21 21 #include <asm/mcf_pgalloc.h> 22 22 #include <asm/tlbflush.h> 23 + #include <asm/pgalloc.h> 23 24 24 25 #define KMAPAREA(x) ((x >= VMALLOC_START) && (x < KMAP_END)) 25 26

-1

arch/m68k/mm/memory.c

··· 17 17 #include <asm/setup.h> 18 18 #include <asm/segment.h> 19 19 #include <asm/page.h> 20 - #include <asm/pgalloc.h> 21 20 #include <asm/traps.h> 22 21 #include <asm/machdep.h> 23 22

+1 -1

arch/m68k/sun3x/dvma.c

··· 22 22 #include <asm/dvma.h> 23 23 #include <asm/io.h> 24 24 #include <asm/page.h> 25 - #include <asm/pgalloc.h> 25 + #include <asm/tlbflush.h> 26 26 27 27 /* IOMMU support */ 28 28

-6

arch/microblaze/include/asm/pgalloc.h

··· 28 28 return (pgd_t *)__get_free_pages(GFP_KERNEL|__GFP_ZERO, 0); 29 29 } 30 30 31 - static inline void free_pgd(pgd_t *pgd) 32 - { 33 - free_page((unsigned long)pgd); 34 - } 35 - 36 - #define pgd_free(mm, pgd) free_pgd(pgd) 37 31 #define pgd_alloc(mm) get_pgd() 38 32 39 33 #define pmd_pgtable(pmd) pmd_page(pmd)

-1

arch/microblaze/include/asm/tlbflush.h

··· 15 15 #include <asm/processor.h> /* For TASK_SIZE */ 16 16 #include <asm/mmu.h> 17 17 #include <asm/page.h> 18 - #include <asm/pgalloc.h> 19 18 20 19 extern void _tlbie(unsigned long address); 21 20 extern void _tlbia(void);

-1

arch/microblaze/kernel/process.c

··· 18 18 #include <linux/tick.h> 19 19 #include <linux/bitops.h> 20 20 #include <linux/ptrace.h> 21 - #include <asm/pgalloc.h> 22 21 #include <linux/uaccess.h> /* for USER_DS macros */ 23 22 #include <asm/cacheflush.h> 24 23

-1

arch/microblaze/kernel/signal.c

··· 35 35 #include <asm/entry.h> 36 36 #include <asm/ucontext.h> 37 37 #include <linux/uaccess.h> 38 - #include <asm/pgalloc.h> 39 38 #include <linux/syscalls.h> 40 39 #include <asm/cacheflush.h> 41 40 #include <asm/syscalls.h>

-3

arch/microblaze/mm/init.c

··· 172 172 &memblock.memory, 0); 173 173 } 174 174 175 - /* XXX need to clip this if using highmem? */ 176 - sparse_memory_present_with_active_regions(0); 177 - 178 175 paging_init(); 179 176 } 180 177

+3 -16

arch/mips/include/asm/pgalloc.h

··· 13 13 #include <linux/mm.h> 14 14 #include <linux/sched.h> 15 15 16 - #include <asm-generic/pgalloc.h> /* for pte_{alloc,free}_one */ 16 + #define __HAVE_ARCH_PMD_ALLOC_ONE 17 + #define __HAVE_ARCH_PUD_ALLOC_ONE 18 + #include <asm-generic/pgalloc.h> 17 19 18 20 static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, 19 21 pte_t *pte) ··· 49 47 extern void pgd_init(unsigned long page); 50 48 extern pgd_t *pgd_alloc(struct mm_struct *mm); 51 49 52 - static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd) 53 - { 54 - free_pages((unsigned long)pgd, PGD_ORDER); 55 - } 56 - 57 50 #define __pte_free_tlb(tlb,pte,address) \ 58 51 do { \ 59 52 pgtable_pte_page_dtor(pte); \ ··· 67 70 return pmd; 68 71 } 69 72 70 - static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd) 71 - { 72 - free_pages((unsigned long)pmd, PMD_ORDER); 73 - } 74 - 75 73 #define __pmd_free_tlb(tlb, x, addr) pmd_free((tlb)->mm, x) 76 74 77 75 #endif ··· 81 89 if (pud) 82 90 pud_init((unsigned long)pud, (unsigned long)invalid_pmd_table); 83 91 return pud; 84 - } 85 - 86 - static inline void pud_free(struct mm_struct *mm, pud_t *pud) 87 - { 88 - free_pages((unsigned long)pud, PUD_ORDER); 89 92 } 90 93 91 94 static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)

-8

arch/mips/kernel/setup.c

··· 371 371 #endif 372 372 } 373 373 374 - 375 - /* 376 - * In any case the added to the memblock memory regions 377 - * (highmem/lowmem, available/reserved, etc) are considered 378 - * as present, so inform sparsemem about them. 379 - */ 380 - memblocks_present(); 381 - 382 374 /* 383 375 * Reserve initrd memory if needed. 384 376 */

-1

arch/mips/loongson64/numa.c

··· 220 220 cpumask_clear(&__node_cpumask[node]); 221 221 } 222 222 } 223 - memblocks_present(); 224 223 max_low_pfn = PHYS_PFN(memblock_end_of_DRAM()); 225 224 226 225 for (cpu = 0; cpu < loongson_sysconf.nr_cpus; cpu++) {

-2

arch/mips/sgi-ip27/ip27-memory.c

··· 402 402 } 403 403 __node_data[node] = &null_node; 404 404 } 405 - 406 - memblocks_present(); 407 405 } 408 406 409 407 void __init prom_free_prom_memory(void)

-1

arch/mips/sgi-ip32/ip32-memory.c

··· 14 14 #include <asm/ip32/crime.h> 15 15 #include <asm/bootinfo.h> 16 16 #include <asm/page.h> 17 - #include <asm/pgalloc.h> 18 17 19 18 extern void crime_init(void); 20 19

+2

arch/nds32/mm/mm-nds32.c

··· 2 2 // Copyright (C) 2005-2017 Andes Technology Corporation 3 3 4 4 #include <linux/init_task.h> 5 + 6 + #define __HAVE_ARCH_PGD_FREE 5 7 #include <asm/pgalloc.h> 6 8 7 9 #define FIRST_KERNEL_PGD_NR (USER_PTRS_PER_PGD)

+1 -6

arch/nios2/include/asm/pgalloc.h

··· 12 12 13 13 #include <linux/mm.h> 14 14 15 - #include <asm-generic/pgalloc.h> /* for pte_{alloc,free}_one */ 15 + #include <asm-generic/pgalloc.h> 16 16 17 17 static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, 18 18 pte_t *pte) ··· 33 33 extern void pmd_init(unsigned long page, unsigned long pagetable); 34 34 35 35 extern pgd_t *pgd_alloc(struct mm_struct *mm); 36 - 37 - static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd) 38 - { 39 - free_pages((unsigned long)pgd, PGD_ORDER); 40 - } 41 36 42 37 #define __pte_free_tlb(tlb, pte, addr) \ 43 38 do { \

+3 -30

arch/openrisc/include/asm/pgalloc.h

··· 20 20 #include <linux/mm.h> 21 21 #include <linux/memblock.h> 22 22 23 + #define __HAVE_ARCH_PTE_ALLOC_ONE_KERNEL 24 + #include <asm-generic/pgalloc.h> 25 + 23 26 extern int mem_init_done; 24 27 25 28 #define pmd_populate_kernel(mm, pmd, pte) \ ··· 64 61 } 65 62 #endif 66 63 67 - static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd) 68 - { 69 - free_page((unsigned long)pgd); 70 - } 71 - 72 64 extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm); 73 - 74 - static inline struct page *pte_alloc_one(struct mm_struct *mm) 75 - { 76 - struct page *pte; 77 - pte = alloc_pages(GFP_KERNEL, 0); 78 - if (!pte) 79 - return NULL; 80 - clear_page(page_address(pte)); 81 - if (!pgtable_pte_page_ctor(pte)) { 82 - __free_page(pte); 83 - return NULL; 84 - } 85 - return pte; 86 - } 87 - 88 - static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) 89 - { 90 - free_page((unsigned long)pte); 91 - } 92 - 93 - static inline void pte_free(struct mm_struct *mm, struct page *pte) 94 - { 95 - pgtable_pte_page_dtor(pte); 96 - __free_page(pte); 97 - } 98 65 99 66 #define __pte_free_tlb(tlb, pte, addr) \ 100 67 do { \

-1

arch/openrisc/include/asm/tlbflush.h

··· 17 17 18 18 #include <linux/mm.h> 19 19 #include <asm/processor.h> 20 - #include <asm/pgalloc.h> 21 20 #include <asm/current.h> 22 21 #include <linux/sched.h> 23 22

-1

arch/openrisc/kernel/or32_ksyms.c

··· 26 26 #include <asm/io.h> 27 27 #include <asm/hardirq.h> 28 28 #include <asm/delay.h> 29 - #include <asm/pgalloc.h> 30 29 31 30 #define DECLARE_EXPORT(name) extern void name(void); EXPORT_SYMBOL(name) 32 31

-1

arch/parisc/include/asm/mmu_context.h

··· 5 5 #include <linux/mm.h> 6 6 #include <linux/sched.h> 7 7 #include <linux/atomic.h> 8 - #include <asm/pgalloc.h> 9 8 #include <asm-generic/mm_hooks.h> 10 9 11 10 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)

+3 -9

arch/parisc/include/asm/pgalloc.h

··· 10 10 11 11 #include <asm/cache.h> 12 12 13 - #include <asm-generic/pgalloc.h> /* for pte_{alloc,free}_one */ 13 + #define __HAVE_ARCH_PMD_FREE 14 + #define __HAVE_ARCH_PGD_FREE 15 + #include <asm-generic/pgalloc.h> 14 16 15 17 /* Allocate the top level pgd (page directory) 16 18 * ··· 65 63 { 66 64 set_pud(pud, __pud((PxD_FLAG_PRESENT | PxD_FLAG_VALID) + 67 65 (__u32)(__pa((unsigned long)pmd) >> PxD_VALUE_SHIFT))); 68 - } 69 - 70 - static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address) 71 - { 72 - pmd_t *pmd = (pmd_t *)__get_free_pages(GFP_KERNEL, PMD_ORDER); 73 - if (pmd) 74 - memset(pmd, 0, PAGE_SIZE<<PMD_ORDER); 75 - return pmd; 76 66 } 77 67 78 68 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)

-1

arch/parisc/kernel/cache.c

··· 24 24 #include <asm/cacheflush.h> 25 25 #include <asm/tlbflush.h> 26 26 #include <asm/page.h> 27 - #include <asm/pgalloc.h> 28 27 #include <asm/processor.h> 29 28 #include <asm/sections.h> 30 29 #include <asm/shmparam.h>

-1

arch/parisc/kernel/pci-dma.c

··· 32 32 #include <asm/dma.h> /* for DMA_CHUNK_SIZE */ 33 33 #include <asm/io.h> 34 34 #include <asm/page.h> /* get_order */ 35 - #include <asm/pgalloc.h> 36 35 #include <linux/uaccess.h> 37 36 #include <asm/tlbflush.h> /* for purge_tlb_*() macros */ 38 37

-1

arch/parisc/kernel/process.c

··· 47 47 #include <asm/assembly.h> 48 48 #include <asm/pdc.h> 49 49 #include <asm/pdc_chassis.h> 50 - #include <asm/pgalloc.h> 51 50 #include <asm/unwind.h> 52 51 #include <asm/sections.h> 53 52

-1

arch/parisc/kernel/signal.c

··· 30 30 #include <asm/ucontext.h> 31 31 #include <asm/rt_sigframe.h> 32 32 #include <linux/uaccess.h> 33 - #include <asm/pgalloc.h> 34 33 #include <asm/cacheflush.h> 35 34 #include <asm/asm-offsets.h> 36 35

-1

arch/parisc/kernel/smp.c

··· 39 39 #include <asm/irq.h> /* for CPU_IRQ_REGION and friends */ 40 40 #include <asm/mmu_context.h> 41 41 #include <asm/page.h> 42 - #include <asm/pgalloc.h> 43 42 #include <asm/processor.h> 44 43 #include <asm/ptrace.h> 45 44 #include <asm/unistd.h>

-1

arch/parisc/mm/hugetlbpage.c

··· 15 15 #include <linux/sysctl.h> 16 16 17 17 #include <asm/mman.h> 18 - #include <asm/pgalloc.h> 19 18 #include <asm/tlb.h> 20 19 #include <asm/tlbflush.h> 21 20 #include <asm/cacheflush.h>

-5

arch/parisc/mm/init.c

··· 689 689 flush_cache_all_local(); /* start with known state */ 690 690 flush_tlb_all_local(NULL); 691 691 692 - /* 693 - * Mark all memblocks as present for sparsemem using 694 - * memory_present() and then initialize sparsemem. 695 - */ 696 - memblocks_present(); 697 692 sparse_init(); 698 693 parisc_bootmem_free(); 699 694 }

+1 -1

arch/parisc/mm/ioremap.c

··· 11 11 #include <linux/errno.h> 12 12 #include <linux/module.h> 13 13 #include <linux/io.h> 14 - #include <asm/pgalloc.h> 14 + #include <linux/mm.h> 15 15 16 16 /* 17 17 * Generic mapping function (not visible outside):

-1

arch/powerpc/include/asm/tlb.h

··· 12 12 #ifndef __powerpc64__ 13 13 #include <linux/pgtable.h> 14 14 #endif 15 - #include <asm/pgalloc.h> 16 15 #ifndef __powerpc64__ 17 16 #include <asm/page.h> 18 17 #include <asm/mmu.h>

-1

arch/powerpc/mm/book3s64/hash_hugetlbpage.c

··· 10 10 11 11 #include <linux/mm.h> 12 12 #include <linux/hugetlb.h> 13 - #include <asm/pgalloc.h> 14 13 #include <asm/cacheflush.h> 15 14 #include <asm/machdep.h> 16 15

-1

arch/powerpc/mm/book3s64/hash_pgtable.c

··· 9 9 #include <linux/mm_types.h> 10 10 #include <linux/mm.h> 11 11 12 - #include <asm/pgalloc.h> 13 12 #include <asm/sections.h> 14 13 #include <asm/mmu.h> 15 14 #include <asm/tlb.h>

-1

arch/powerpc/mm/book3s64/hash_tlb.c

··· 21 21 #include <linux/mm.h> 22 22 #include <linux/percpu.h> 23 23 #include <linux/hardirq.h> 24 - #include <asm/pgalloc.h> 25 24 #include <asm/tlbflush.h> 26 25 #include <asm/tlb.h> 27 26 #include <asm/bug.h>

-1

arch/powerpc/mm/book3s64/radix_hugetlbpage.c

··· 2 2 #include <linux/mm.h> 3 3 #include <linux/hugetlb.h> 4 4 #include <linux/security.h> 5 - #include <asm/pgalloc.h> 6 5 #include <asm/cacheflush.h> 7 6 #include <asm/machdep.h> 8 7 #include <asm/mman.h>

-1

arch/powerpc/mm/init_32.c

··· 29 29 #include <linux/slab.h> 30 30 #include <linux/hugetlb.h> 31 31 32 - #include <asm/pgalloc.h> 33 32 #include <asm/prom.h> 34 33 #include <asm/io.h> 35 34 #include <asm/mmu.h>

+2 -2

arch/powerpc/mm/init_64.c

··· 225 225 * fall back to system memory if the altmap allocation fail. 226 226 */ 227 227 if (altmap && !altmap_cross_boundary(altmap, start, page_size)) { 228 - p = altmap_alloc_block_buf(page_size, altmap); 228 + p = vmemmap_alloc_block_buf(page_size, node, altmap); 229 229 if (!p) 230 230 pr_debug("altmap block allocation failed, falling back to system memory"); 231 231 } 232 232 if (!p) 233 - p = vmemmap_alloc_block_buf(page_size, node); 233 + p = vmemmap_alloc_block_buf(page_size, node, NULL); 234 234 if (!p) 235 235 return -ENOMEM; 236 236

-1

arch/powerpc/mm/kasan/8xx.c

··· 5 5 #include <linux/kasan.h> 6 6 #include <linux/memblock.h> 7 7 #include <linux/hugetlb.h> 8 - #include <asm/pgalloc.h> 9 8 10 9 static int __init 11 10 kasan_init_shadow_8M(unsigned long k_start, unsigned long k_end, void *block)

-1

arch/powerpc/mm/kasan/book3s_32.c

··· 4 4 5 5 #include <linux/kasan.h> 6 6 #include <linux/memblock.h> 7 - #include <asm/pgalloc.h> 8 7 #include <mm/mmu_decl.h> 9 8 10 9 int __init kasan_init_region(void *start, size_t size)

-3

arch/powerpc/mm/mem.c

··· 34 34 #include <linux/dma-direct.h> 35 35 #include <linux/kprobes.h> 36 36 37 - #include <asm/pgalloc.h> 38 37 #include <asm/prom.h> 39 38 #include <asm/io.h> 40 39 #include <asm/mmu_context.h> ··· 178 179 179 180 void __init initmem_init(void) 180 181 { 181 - /* XXX need to clip this if using highmem? */ 182 - sparse_memory_present_with_active_regions(0); 183 182 sparse_init(); 184 183 } 185 184

-1

arch/powerpc/mm/nohash/40x.c

··· 32 32 #include <linux/highmem.h> 33 33 #include <linux/memblock.h> 34 34 35 - #include <asm/pgalloc.h> 36 35 #include <asm/prom.h> 37 36 #include <asm/io.h> 38 37 #include <asm/mmu_context.h>

-1

arch/powerpc/mm/nohash/8xx.c

··· 13 13 #include <asm/fixmap.h> 14 14 #include <asm/code-patching.h> 15 15 #include <asm/inst.h> 16 - #include <asm/pgalloc.h> 17 16 18 17 #include <mm/mmu_decl.h> 19 18

-1

arch/powerpc/mm/nohash/fsl_booke.c

··· 37 37 #include <linux/highmem.h> 38 38 #include <linux/memblock.h> 39 39 40 - #include <asm/pgalloc.h> 41 40 #include <asm/prom.h> 42 41 #include <asm/io.h> 43 42 #include <asm/mmu_context.h>

-1

arch/powerpc/mm/nohash/kaslr_booke.c

··· 15 15 #include <linux/libfdt.h> 16 16 #include <linux/crash_core.h> 17 17 #include <asm/cacheflush.h> 18 - #include <asm/pgalloc.h> 19 18 #include <asm/prom.h> 20 19 #include <asm/kdump.h> 21 20 #include <mm/mmu_decl.h>

+1

arch/powerpc/mm/nohash/tlb.c

··· 34 34 #include <linux/of_fdt.h> 35 35 #include <linux/hugetlb.h> 36 36 37 + #include <asm/pgalloc.h> 37 38 #include <asm/tlbflush.h> 38 39 #include <asm/tlb.h> 39 40 #include <asm/code-patching.h>

-1

arch/powerpc/mm/numa.c

··· 953 953 954 954 get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); 955 955 setup_node_data(nid, start_pfn, end_pfn); 956 - sparse_memory_present_with_active_regions(nid); 957 956 } 958 957 959 958 sparse_init();

-1

arch/powerpc/mm/pgtable.c

··· 23 23 #include <linux/percpu.h> 24 24 #include <linux/hardirq.h> 25 25 #include <linux/hugetlb.h> 26 - #include <asm/pgalloc.h> 27 26 #include <asm/tlbflush.h> 28 27 #include <asm/tlb.h> 29 28 #include <asm/hugetlb.h>

-1

arch/powerpc/mm/pgtable_64.c

··· 31 31 #include <linux/slab.h> 32 32 #include <linux/hugetlb.h> 33 33 34 - #include <asm/pgalloc.h> 35 34 #include <asm/page.h> 36 35 #include <asm/prom.h> 37 36 #include <asm/mmu_context.h>

+1 -1

arch/powerpc/mm/ptdump/hashpagetable.c

··· 17 17 #include <linux/seq_file.h> 18 18 #include <linux/const.h> 19 19 #include <asm/page.h> 20 - #include <asm/pgalloc.h> 21 20 #include <asm/plpar_wrappers.h> 22 21 #include <linux/memblock.h> 23 22 #include <asm/firmware.h> 23 + #include <asm/pgalloc.h> 24 24 25 25 struct pg_state { 26 26 struct seq_file *seq;

-1

arch/powerpc/mm/ptdump/ptdump.c

··· 21 21 #include <asm/fixmap.h> 22 22 #include <linux/const.h> 23 23 #include <asm/page.h> 24 - #include <asm/pgalloc.h> 25 24 #include <asm/hugetlb.h> 26 25 27 26 #include <mm/mmu_decl.h>

-1

arch/powerpc/platforms/pseries/cmm.c

··· 26 26 #include <asm/firmware.h> 27 27 #include <asm/hvcall.h> 28 28 #include <asm/mmu.h> 29 - #include <asm/pgalloc.h> 30 29 #include <linux/uaccess.h> 31 30 #include <linux/memory.h> 32 31 #include <asm/plpar_wrappers.h>

+1 -17

arch/riscv/include/asm/pgalloc.h

··· 11 11 #include <asm/tlb.h> 12 12 13 13 #ifdef CONFIG_MMU 14 - #include <asm-generic/pgalloc.h> /* for pte_{alloc,free}_one */ 14 + #include <asm-generic/pgalloc.h> 15 15 16 16 static inline void pmd_populate_kernel(struct mm_struct *mm, 17 17 pmd_t *pmd, pte_t *pte) ··· 55 55 return pgd; 56 56 } 57 57 58 - static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd) 59 - { 60 - free_page((unsigned long)pgd); 61 - } 62 - 63 58 #ifndef __PAGETABLE_PMD_FOLDED 64 - 65 - static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) 66 - { 67 - return (pmd_t *)__get_free_page( 68 - GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_ZERO); 69 - } 70 - 71 - static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd) 72 - { 73 - free_page((unsigned long)pmd); 74 - } 75 59 76 60 #define __pmd_free_tlb(tlb, pmd, addr) pmd_free((tlb)->mm, pmd) 77 61

-1

arch/riscv/mm/fault.c

··· 14 14 #include <linux/signal.h> 15 15 #include <linux/uaccess.h> 16 16 17 - #include <asm/pgalloc.h> 18 17 #include <asm/ptrace.h> 19 18 #include <asm/tlbflush.h> 20 19

+1 -2

arch/riscv/mm/init.c

··· 570 570 void __init paging_init(void) 571 571 { 572 572 setup_vm_final(); 573 - memblocks_present(); 574 573 sparse_init(); 575 574 setup_zero_page(); 576 575 zone_sizes_init(); ··· 580 581 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, 581 582 struct vmem_altmap *altmap) 582 583 { 583 - return vmemmap_populate_basepages(start, end, node); 584 + return vmemmap_populate_basepages(start, end, node, NULL); 584 585 } 585 586 #endif

+2 -2

arch/s390/crypto/prng.c

··· 249 249 { 250 250 pr_debug("The prng module stopped " 251 251 "after running in triple DES mode\n"); 252 - kzfree(prng_data); 252 + kfree_sensitive(prng_data); 253 253 } 254 254 255 255 ··· 442 442 static void prng_sha512_deinstantiate(void) 443 443 { 444 444 pr_debug("The prng module stopped after running in SHA-512 mode\n"); 445 - kzfree(prng_data); 445 + kfree_sensitive(prng_data); 446 446 } 447 447 448 448

-1

arch/s390/include/asm/tlb.h

··· 36 36 #define p4d_free_tlb p4d_free_tlb 37 37 #define pud_free_tlb pud_free_tlb 38 38 39 - #include <asm/pgalloc.h> 40 39 #include <asm/tlbflush.h> 41 40 #include <asm-generic/tlb.h> 42 41

-1

arch/s390/include/asm/tlbflush.h

··· 5 5 #include <linux/mm.h> 6 6 #include <linux/sched.h> 7 7 #include <asm/processor.h> 8 - #include <asm/pgalloc.h> 9 8 10 9 /* 11 10 * Flush all TLB entries on the local CPU.

-1

arch/s390/kernel/machine_kexec.c

··· 16 16 #include <linux/debug_locks.h> 17 17 #include <asm/cio.h> 18 18 #include <asm/setup.h> 19 - #include <asm/pgalloc.h> 20 19 #include <asm/smp.h> 21 20 #include <asm/ipl.h> 22 21 #include <asm/diag.h>

-1

arch/s390/kernel/ptrace.c

··· 25 25 #include <linux/compat.h> 26 26 #include <trace/syscall.h> 27 27 #include <asm/page.h> 28 - #include <asm/pgalloc.h> 29 28 #include <linux/uaccess.h> 30 29 #include <asm/unistd.h> 31 30 #include <asm/switch_to.h>

-1

arch/s390/kvm/diag.c

··· 10 10 11 11 #include <linux/kvm.h> 12 12 #include <linux/kvm_host.h> 13 - #include <asm/pgalloc.h> 14 13 #include <asm/gmap.h> 15 14 #include <asm/virtio-ccw.h> 16 15 #include "kvm-s390.h"

-1

arch/s390/kvm/priv.c

··· 22 22 #include <asm/ebcdic.h> 23 23 #include <asm/sysinfo.h> 24 24 #include <asm/page-states.h> 25 - #include <asm/pgalloc.h> 26 25 #include <asm/gmap.h> 27 26 #include <asm/io.h> 28 27 #include <asm/ptrace.h>

-1

arch/s390/kvm/pv.c

··· 9 9 #include <linux/kvm_host.h> 10 10 #include <linux/pagemap.h> 11 11 #include <linux/sched/signal.h> 12 - #include <asm/pgalloc.h> 13 12 #include <asm/gmap.h> 14 13 #include <asm/uv.h> 15 14 #include <asm/mman.h>

-1

arch/s390/mm/cmm.c

··· 21 21 #include <linux/oom.h> 22 22 #include <linux/uaccess.h> 23 23 24 - #include <asm/pgalloc.h> 25 24 #include <asm/diag.h> 26 25 27 26 #ifdef CONFIG_CMM_IUCV

-1

arch/s390/mm/init.c

··· 115 115 __load_psw_mask(psw.mask); 116 116 kasan_free_early_identity(); 117 117 118 - sparse_memory_present_with_active_regions(MAX_NUMNODES); 119 118 sparse_init(); 120 119 zone_dma_bits = 31; 121 120 memset(max_zone_pfns, 0, sizeof(max_zone_pfns));

-1

arch/s390/mm/mmap.c

··· 17 17 #include <linux/random.h> 18 18 #include <linux/compat.h> 19 19 #include <linux/security.h> 20 - #include <asm/pgalloc.h> 21 20 #include <asm/elf.h> 22 21 23 22 static unsigned long stack_maxrandom_size(void)

-1

arch/s390/mm/pgtable.c

··· 19 19 #include <linux/ksm.h> 20 20 #include <linux/mman.h> 21 21 22 - #include <asm/pgalloc.h> 23 22 #include <asm/tlb.h> 24 23 #include <asm/tlbflush.h> 25 24 #include <asm/mmu_context.h>

+4

arch/sh/include/asm/pgalloc.h

··· 3 3 #define __ASM_SH_PGALLOC_H 4 4 5 5 #include <asm/page.h> 6 + 7 + #define __HAVE_ARCH_PMD_ALLOC_ONE 8 + #define __HAVE_ARCH_PMD_FREE 9 + #define __HAVE_ARCH_PGD_FREE 6 10 #include <asm-generic/pgalloc.h> 7 11 8 12 extern pgd_t *pgd_alloc(struct mm_struct *);

-1

arch/sh/kernel/idle.c

··· 14 14 #include <linux/irqflags.h> 15 15 #include <linux/smp.h> 16 16 #include <linux/atomic.h> 17 - #include <asm/pgalloc.h> 18 17 #include <asm/smp.h> 19 18 #include <asm/bl_bit.h> 20 19

-1

arch/sh/kernel/machine_kexec.c

··· 14 14 #include <linux/ftrace.h> 15 15 #include <linux/suspend.h> 16 16 #include <linux/memblock.h> 17 - #include <asm/pgalloc.h> 18 17 #include <asm/mmu_context.h> 19 18 #include <asm/io.h> 20 19 #include <asm/cacheflush.h>

-1

arch/sh/mm/cache-sh3.c

··· 16 16 #include <asm/cache.h> 17 17 #include <asm/io.h> 18 18 #include <linux/uaccess.h> 19 - #include <asm/pgalloc.h> 20 19 #include <asm/mmu_context.h> 21 20 #include <asm/cacheflush.h> 22 21

-1

arch/sh/mm/cache-sh7705.c

··· 20 20 #include <asm/cache.h> 21 21 #include <asm/io.h> 22 22 #include <linux/uaccess.h> 23 - #include <asm/pgalloc.h> 24 23 #include <asm/mmu_context.h> 25 24 #include <asm/cacheflush.h> 26 25

-1

arch/sh/mm/hugetlbpage.c

··· 17 17 #include <linux/sysctl.h> 18 18 19 19 #include <asm/mman.h> 20 - #include <asm/pgalloc.h> 21 20 #include <asm/tlb.h> 22 21 #include <asm/tlbflush.h> 23 22 #include <asm/cacheflush.h>

+1 -6

arch/sh/mm/init.c

··· 27 27 #include <asm/sections.h> 28 28 #include <asm/setup.h> 29 29 #include <asm/cache.h> 30 + #include <asm/pgalloc.h> 30 31 #include <linux/sizes.h> 31 32 32 33 pgd_t swapper_pg_dir[PTRS_PER_PGD]; ··· 241 240 242 241 plat_mem_setup(); 243 242 244 - for_each_memblock(memory, reg) { 245 - int nid = memblock_get_region_node(reg); 246 - 247 - memory_present(nid, memblock_region_memory_base_pfn(reg), 248 - memblock_region_memory_end_pfn(reg)); 249 - } 250 243 sparse_init(); 251 244 } 252 245

-1

arch/sh/mm/ioremap_fixed.c

··· 18 18 #include <linux/proc_fs.h> 19 19 #include <asm/fixmap.h> 20 20 #include <asm/page.h> 21 - #include <asm/pgalloc.h> 22 21 #include <asm/addrspace.h> 23 22 #include <asm/cacheflush.h> 24 23 #include <asm/tlbflush.h>

-3

arch/sh/mm/numa.c

··· 53 53 54 54 /* It's up */ 55 55 node_set_online(nid); 56 - 57 - /* Kick sparsemem */ 58 - sparse_memory_present_with_active_regions(nid); 59 56 }

-1

arch/sh/mm/tlb-sh3.c

··· 21 21 22 22 #include <asm/io.h> 23 23 #include <linux/uaccess.h> 24 - #include <asm/pgalloc.h> 25 24 #include <asm/mmu_context.h> 26 25 #include <asm/cacheflush.h> 27 26

-1

arch/sparc/include/asm/ide.h

··· 13 13 14 14 #include <asm/io.h> 15 15 #ifdef CONFIG_SPARC64 16 - #include <asm/pgalloc.h> 17 16 #include <asm/spitfire.h> 18 17 #include <asm/cacheflush.h> 19 18 #include <asm/page.h>

-1

arch/sparc/include/asm/tlb_64.h

··· 4 4 5 5 #include <linux/swap.h> 6 6 #include <linux/pagemap.h> 7 - #include <asm/pgalloc.h> 8 7 #include <asm/tlbflush.h> 9 8 #include <asm/mmu_context.h> 10 9

-1

arch/sparc/kernel/leon_smp.c

··· 38 38 #include <asm/delay.h> 39 39 #include <asm/irq.h> 40 40 #include <asm/page.h> 41 - #include <asm/pgalloc.h> 42 41 #include <asm/oplib.h> 43 42 #include <asm/cpudata.h> 44 43 #include <asm/asi.h>

-1

arch/sparc/kernel/process_32.c

··· 34 34 #include <asm/oplib.h> 35 35 #include <linux/uaccess.h> 36 36 #include <asm/page.h> 37 - #include <asm/pgalloc.h> 38 37 #include <asm/delay.h> 39 38 #include <asm/processor.h> 40 39 #include <asm/psr.h>

-1

arch/sparc/kernel/signal_32.c

··· 23 23 24 24 #include <linux/uaccess.h> 25 25 #include <asm/ptrace.h> 26 - #include <asm/pgalloc.h> 27 26 #include <asm/cacheflush.h> /* flush_sig_insns */ 28 27 #include <asm/switch_to.h> 29 28

-1

arch/sparc/kernel/smp_32.c

··· 29 29 30 30 #include <asm/irq.h> 31 31 #include <asm/page.h> 32 - #include <asm/pgalloc.h> 33 32 #include <asm/oplib.h> 34 33 #include <asm/cacheflush.h> 35 34 #include <asm/tlbflush.h>

+1

arch/sparc/kernel/smp_64.c

··· 47 47 #include <linux/uaccess.h> 48 48 #include <asm/starfire.h> 49 49 #include <asm/tlb.h> 50 + #include <asm/pgalloc.h> 50 51 #include <asm/sections.h> 51 52 #include <asm/prom.h> 52 53 #include <asm/mdesc.h>

-1

arch/sparc/kernel/sun4m_irq.c

··· 16 16 17 17 #include <asm/timer.h> 18 18 #include <asm/traps.h> 19 - #include <asm/pgalloc.h> 20 19 #include <asm/irq.h> 21 20 #include <asm/io.h> 22 21 #include <asm/cacheflush.h>

-1

arch/sparc/mm/highmem.c

··· 29 29 30 30 #include <asm/cacheflush.h> 31 31 #include <asm/tlbflush.h> 32 - #include <asm/pgalloc.h> 33 32 #include <asm/vaddrs.h> 34 33 35 34 static pte_t *kmap_pte;

-1

arch/sparc/mm/init_64.c

··· 1610 1610 1611 1611 /* XXX cpu notifier XXX */ 1612 1612 1613 - sparse_memory_present_with_active_regions(MAX_NUMNODES); 1614 1613 sparse_init(); 1615 1614 1616 1615 return end_pfn;

-1

arch/sparc/mm/io-unit.c

··· 15 15 #include <linux/of.h> 16 16 #include <linux/of_device.h> 17 17 18 - #include <asm/pgalloc.h> 19 18 #include <asm/io.h> 20 19 #include <asm/io-unit.h> 21 20 #include <asm/mxcc.h>

-1

arch/sparc/mm/iommu.c

··· 16 16 #include <linux/of.h> 17 17 #include <linux/of_device.h> 18 18 19 - #include <asm/pgalloc.h> 20 19 #include <asm/io.h> 21 20 #include <asm/mxcc.h> 22 21 #include <asm/mbus.h>

-1

arch/sparc/mm/tlb.c

··· 10 10 #include <linux/swap.h> 11 11 #include <linux/preempt.h> 12 12 13 - #include <asm/pgalloc.h> 14 13 #include <asm/tlbflush.h> 15 14 #include <asm/cacheflush.h> 16 15 #include <asm/mmu_context.h>

+1 -8

arch/um/include/asm/pgalloc.h

··· 10 10 11 11 #include <linux/mm.h> 12 12 13 - #include <asm-generic/pgalloc.h> /* for pte_{alloc,free}_one */ 13 + #include <asm-generic/pgalloc.h> 14 14 15 15 #define pmd_populate_kernel(mm, pmd, pte) \ 16 16 set_pmd(pmd, __pmd(_PAGE_TABLE + (unsigned long) __pa(pte))) ··· 25 25 * Allocate and free page tables. 26 26 */ 27 27 extern pgd_t *pgd_alloc(struct mm_struct *); 28 - extern void pgd_free(struct mm_struct *mm, pgd_t *pgd); 29 28 30 29 #define __pte_free_tlb(tlb,pte, address) \ 31 30 do { \ ··· 33 34 } while (0) 34 35 35 36 #ifdef CONFIG_3_LEVEL_PGTABLES 36 - 37 - static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd) 38 - { 39 - free_page((unsigned long)pmd); 40 - } 41 - 42 37 #define __pmd_free_tlb(tlb,x, address) tlb_remove_page((tlb),virt_to_page(x)) 43 38 #endif 44 39

-3

arch/um/include/asm/pgtable-3level.h

··· 78 78 #define set_pmd(pmdptr, pmdval) (*(pmdptr) = (pmdval)) 79 79 #endif 80 80 81 - struct mm_struct; 82 - extern pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address); 83 - 84 81 static inline void pud_clear (pud_t *pud) 85 82 { 86 83 set_pud(pud, __pud(_PAGE_NEWPAGE));

-17

arch/um/kernel/mem.c

··· 196 196 return pgd; 197 197 } 198 198 199 - void pgd_free(struct mm_struct *mm, pgd_t *pgd) 200 - { 201 - free_page((unsigned long) pgd); 202 - } 203 - 204 - #ifdef CONFIG_3_LEVEL_PGTABLES 205 - pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address) 206 - { 207 - pmd_t *pmd = (pmd_t *) __get_free_page(GFP_KERNEL); 208 - 209 - if (pmd) 210 - memset(pmd, 0, PAGE_SIZE); 211 - 212 - return pmd; 213 - } 214 - #endif 215 - 216 199 void *uml_kmalloc(int size, int flags) 217 200 { 218 201 return kmalloc(size, flags);

-1

arch/x86/ia32/ia32_aout.c

··· 30 30 #include <linux/sched/task_stack.h> 31 31 32 32 #include <linux/uaccess.h> 33 - #include <asm/pgalloc.h> 34 33 #include <asm/cacheflush.h> 35 34 #include <asm/user32.h> 36 35 #include <asm/ia32.h>

-1

arch/x86/include/asm/mmu_context.h

··· 9 9 10 10 #include <trace/events/tlb.h> 11 11 12 - #include <asm/pgalloc.h> 13 12 #include <asm/tlbflush.h> 14 13 #include <asm/paravirt.h> 15 14 #include <asm/debugreg.h>

+2 -40

arch/x86/include/asm/pgalloc.h

··· 7 7 #include <linux/pagemap.h> 8 8 9 9 #define __HAVE_ARCH_PTE_ALLOC_ONE 10 - #include <asm-generic/pgalloc.h> /* for pte_{alloc,free}_one */ 10 + #define __HAVE_ARCH_PGD_FREE 11 + #include <asm-generic/pgalloc.h> 11 12 12 13 static inline int __paravirt_pgd_alloc(struct mm_struct *mm) { return 0; } 13 14 ··· 87 86 #define pmd_pgtable(pmd) pmd_page(pmd) 88 87 89 88 #if CONFIG_PGTABLE_LEVELS > 2 90 - static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) 91 - { 92 - struct page *page; 93 - gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO; 94 - 95 - if (mm == &init_mm) 96 - gfp &= ~__GFP_ACCOUNT; 97 - page = alloc_pages(gfp, 0); 98 - if (!page) 99 - return NULL; 100 - if (!pgtable_pmd_page_ctor(page)) { 101 - __free_pages(page, 0); 102 - return NULL; 103 - } 104 - return (pmd_t *)page_address(page); 105 - } 106 - 107 - static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd) 108 - { 109 - BUG_ON((unsigned long)pmd & (PAGE_SIZE-1)); 110 - pgtable_pmd_page_dtor(virt_to_page(pmd)); 111 - free_page((unsigned long)pmd); 112 - } 113 - 114 89 extern void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd); 115 90 116 91 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd, ··· 122 145 { 123 146 paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT); 124 147 set_p4d_safe(p4d, __p4d(_PAGE_TABLE | __pa(pud))); 125 - } 126 - 127 - static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr) 128 - { 129 - gfp_t gfp = GFP_KERNEL_ACCOUNT; 130 - 131 - if (mm == &init_mm) 132 - gfp &= ~__GFP_ACCOUNT; 133 - return (pud_t *)get_zeroed_page(gfp); 134 - } 135 - 136 - static inline void pud_free(struct mm_struct *mm, pud_t *pud) 137 - { 138 - BUG_ON((unsigned long)pud & (PAGE_SIZE-1)); 139 - free_page((unsigned long)pud); 140 148 } 141 149 142 150 extern void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud);

+1

arch/x86/kernel/alternative.c

··· 7 7 #include <linux/mutex.h> 8 8 #include <linux/list.h> 9 9 #include <linux/stringify.h> 10 + #include <linux/highmem.h> 10 11 #include <linux/mm.h> 11 12 #include <linux/vmalloc.h> 12 13 #include <linux/memory.h>

-1

arch/x86/kernel/apic/apic.c

··· 40 40 #include <asm/irq_remapping.h> 41 41 #include <asm/perf_event.h> 42 42 #include <asm/x86_init.h> 43 - #include <asm/pgalloc.h> 44 43 #include <linux/atomic.h> 45 44 #include <asm/mpspec.h> 46 45 #include <asm/i8259.h>

-1

arch/x86/kernel/mpparse.c

··· 22 22 #include <asm/irqdomain.h> 23 23 #include <asm/mtrr.h> 24 24 #include <asm/mpspec.h> 25 - #include <asm/pgalloc.h> 26 25 #include <asm/io_apic.h> 27 26 #include <asm/proto.h> 28 27 #include <asm/bios_ebda.h>

-1

arch/x86/kernel/traps.c

··· 62 62 63 63 #ifdef CONFIG_X86_64 64 64 #include <asm/x86_init.h> 65 - #include <asm/pgalloc.h> 66 65 #include <asm/proto.h> 67 66 #else 68 67 #include <asm/processor-flags.h>

-1

arch/x86/mm/fault.c

··· 21 21 22 22 #include <asm/cpufeature.h> /* boot_cpu_has, ... */ 23 23 #include <asm/traps.h> /* dotraplinkage, ... */ 24 - #include <asm/pgalloc.h> /* pgd_*(), ... */ 25 24 #include <asm/fixmap.h> /* VSYSCALL_ADDR */ 26 25 #include <asm/vsyscall.h> /* emulate_vsyscall */ 27 26 #include <asm/vm86.h> /* struct vm86 */

-1

arch/x86/mm/hugetlbpage.c

··· 17 17 #include <asm/mman.h> 18 18 #include <asm/tlb.h> 19 19 #include <asm/tlbflush.h> 20 - #include <asm/pgalloc.h> 21 20 #include <asm/elf.h> 22 21 23 22 #if 0 /* This is just for testing */

-2

arch/x86/mm/init_32.c

··· 678 678 #endif 679 679 680 680 memblock_set_node(0, PHYS_ADDR_MAX, &memblock.memory, 0); 681 - sparse_memory_present_with_active_regions(0); 682 681 683 682 #ifdef CONFIG_FLATMEM 684 683 max_mapnr = IS_ENABLED(CONFIG_HIGHMEM) ? highend_pfn : max_low_pfn; ··· 717 718 * NOTE: at this point the bootmem allocator is fully available. 718 719 */ 719 720 olpc_dt_build_devicetree(); 720 - sparse_memory_present_with_active_regions(MAX_NUMNODES); 721 721 sparse_init(); 722 722 zone_sizes_init(); 723 723 }

+4 -8

arch/x86/mm/init_64.c

··· 817 817 818 818 void __init paging_init(void) 819 819 { 820 - sparse_memory_present_with_active_regions(MAX_NUMNODES); 821 820 sparse_init(); 822 821 823 822 /* ··· 1509 1510 if (pmd_none(*pmd)) { 1510 1511 void *p; 1511 1512 1512 - if (altmap) 1513 - p = altmap_alloc_block_buf(PMD_SIZE, altmap); 1514 - else 1515 - p = vmemmap_alloc_block_buf(PMD_SIZE, node); 1513 + p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap); 1516 1514 if (p) { 1517 1515 pte_t entry; 1518 1516 ··· 1536 1540 vmemmap_verify((pte_t *)pmd, node, addr, next); 1537 1541 continue; 1538 1542 } 1539 - if (vmemmap_populate_basepages(addr, next, node)) 1543 + if (vmemmap_populate_basepages(addr, next, node, NULL)) 1540 1544 return -ENOMEM; 1541 1545 } 1542 1546 return 0; ··· 1548 1552 int err; 1549 1553 1550 1554 if (end - start < PAGES_PER_SECTION * sizeof(struct page)) 1551 - err = vmemmap_populate_basepages(start, end, node); 1555 + err = vmemmap_populate_basepages(start, end, node, NULL); 1552 1556 else if (boot_cpu_has(X86_FEATURE_PSE)) 1553 1557 err = vmemmap_populate_hugepages(start, end, node, altmap); 1554 1558 else if (altmap) { ··· 1556 1560 __func__); 1557 1561 err = -ENOMEM; 1558 1562 } else 1559 - err = vmemmap_populate_basepages(start, end, node); 1563 + err = vmemmap_populate_basepages(start, end, node, NULL); 1560 1564 if (!err) 1561 1565 sync_global_pgds(start, end - 1); 1562 1566 return err;

-1

arch/x86/mm/kaslr.c

··· 26 26 #include <linux/memblock.h> 27 27 #include <linux/pgtable.h> 28 28 29 - #include <asm/pgalloc.h> 30 29 #include <asm/setup.h> 31 30 #include <asm/kaslr.h> 32 31

-1

arch/x86/mm/pgtable_32.c

··· 11 11 #include <linux/spinlock.h> 12 12 13 13 #include <asm/cpu_entry_area.h> 14 - #include <asm/pgalloc.h> 15 14 #include <asm/fixmap.h> 16 15 #include <asm/e820/api.h> 17 16 #include <asm/tlb.h>

-1

arch/x86/mm/pti.c

··· 34 34 #include <asm/vsyscall.h> 35 35 #include <asm/cmdline.h> 36 36 #include <asm/pti.h> 37 - #include <asm/pgalloc.h> 38 37 #include <asm/tlbflush.h> 39 38 #include <asm/desc.h> 40 39 #include <asm/sections.h>

+1

arch/x86/platform/uv/bios_uv.c

··· 11 11 #include <linux/slab.h> 12 12 #include <asm/efi.h> 13 13 #include <linux/io.h> 14 + #include <asm/pgalloc.h> 14 15 #include <asm/uv/bios.h> 15 16 #include <asm/uv/uv_hub.h> 16 17

+1 -1

arch/x86/power/hibernate.c

··· 98 98 if (crypto_shash_digest(desc, (u8 *)table, size, buf)) 99 99 ret = -EINVAL; 100 100 101 - kzfree(desc); 101 + kfree_sensitive(desc); 102 102 103 103 free_tfm: 104 104 crypto_free_shash(tfm);

+16 -24

arch/xtensa/include/asm/pgalloc.h

··· 8 8 #ifndef _XTENSA_PGALLOC_H 9 9 #define _XTENSA_PGALLOC_H 10 10 11 + #ifdef CONFIG_MMU 11 12 #include <linux/highmem.h> 12 13 #include <linux/slab.h> 14 + 15 + #define __HAVE_ARCH_PTE_ALLOC_ONE_KERNEL 16 + #define __HAVE_ARCH_PTE_ALLOC_ONE 17 + #include <asm-generic/pgalloc.h> 13 18 14 19 /* 15 20 * Allocating and freeing a pmd is trivial: the 1-entry pmd is ··· 33 28 return (pgd_t*) __get_free_pages(GFP_KERNEL | __GFP_ZERO, PGD_ORDER); 34 29 } 35 30 36 - static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd) 31 + static inline void ptes_clear(pte_t *ptep) 37 32 { 38 - free_page((unsigned long)pgd); 33 + int i; 34 + 35 + for (i = 0; i < PTRS_PER_PTE; i++) 36 + pte_clear(NULL, 0, ptep + i); 39 37 } 40 38 41 39 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm) 42 40 { 43 41 pte_t *ptep; 44 - int i; 45 42 46 - ptep = (pte_t *)__get_free_page(GFP_KERNEL); 43 + ptep = (pte_t *)__pte_alloc_one_kernel(mm); 47 44 if (!ptep) 48 45 return NULL; 49 - for (i = 0; i < 1024; i++) 50 - pte_clear(NULL, 0, ptep + i); 46 + ptes_clear(ptep); 51 47 return ptep; 52 48 } 53 49 54 50 static inline pgtable_t pte_alloc_one(struct mm_struct *mm) 55 51 { 56 - pte_t *pte; 57 52 struct page *page; 58 53 59 - pte = pte_alloc_one_kernel(mm); 60 - if (!pte) 54 + page = __pte_alloc_one(mm, GFP_PGTABLE_USER); 55 + if (!page) 61 56 return NULL; 62 - page = virt_to_page(pte); 63 - if (!pgtable_pte_page_ctor(page)) { 64 - __free_page(page); 65 - return NULL; 66 - } 57 + ptes_clear(page_address(page)); 67 58 return page; 68 59 } 69 60 70 - static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) 71 - { 72 - free_page((unsigned long)pte); 73 - } 74 - 75 - static inline void pte_free(struct mm_struct *mm, pgtable_t pte) 76 - { 77 - pgtable_pte_page_dtor(pte); 78 - __free_page(pte); 79 - } 80 61 #define pmd_pgtable(pmd) pmd_page(pmd) 62 + #endif /* CONFIG_MMU */ 81 63 82 64 #endif /* _XTENSA_PGALLOC_H */

-1

arch/xtensa/kernel/xtensa_ksyms.c

··· 25 25 #include <asm/dma.h> 26 26 #include <asm/io.h> 27 27 #include <asm/page.h> 28 - #include <asm/pgalloc.h> 29 28 #include <asm/ftrace.h> 30 29 #ifdef CONFIG_BLK_DEV_FD 31 30 #include <asm/floppy.h>

-1

arch/xtensa/mm/cache.c

··· 31 31 #include <asm/tlb.h> 32 32 #include <asm/tlbflush.h> 33 33 #include <asm/page.h> 34 - #include <asm/pgalloc.h> 35 34 36 35 /* 37 36 * Note:

-1

arch/xtensa/mm/fault.c

··· 20 20 #include <asm/mmu_context.h> 21 21 #include <asm/cacheflush.h> 22 22 #include <asm/hardirq.h> 23 - #include <asm/pgalloc.h> 24 23 25 24 DEFINE_PER_CPU(unsigned long, asid_cache) = ASID_USER_FIRST; 26 25 void bad_page_fault(struct pt_regs*, unsigned long, int);

+1 -1

crypto/adiantum.c

··· 177 177 keyp += NHPOLY1305_KEY_SIZE; 178 178 WARN_ON(keyp != &data->derived_keys[ARRAY_SIZE(data->derived_keys)]); 179 179 out: 180 - kzfree(data); 180 + kfree_sensitive(data); 181 181 return err; 182 182 } 183 183

+2 -2

crypto/ahash.c

··· 183 183 alignbuffer = (u8 *)ALIGN((unsigned long)buffer, alignmask + 1); 184 184 memcpy(alignbuffer, key, keylen); 185 185 ret = tfm->setkey(tfm, alignbuffer, keylen); 186 - kzfree(buffer); 186 + kfree_sensitive(buffer); 187 187 return ret; 188 188 } 189 189 ··· 302 302 req->priv = NULL; 303 303 304 304 /* Free the req->priv.priv from the ADJUSTED request. */ 305 - kzfree(priv); 305 + kfree_sensitive(priv); 306 306 } 307 307 308 308 static void ahash_notify_einprogress(struct ahash_request *req)

+1 -1

crypto/api.c

··· 571 571 alg->cra_exit(tfm); 572 572 crypto_exit_ops(tfm); 573 573 crypto_mod_put(alg); 574 - kzfree(mem); 574 + kfree_sensitive(mem); 575 575 } 576 576 EXPORT_SYMBOL_GPL(crypto_destroy_tfm); 577 577

+2 -2

crypto/asymmetric_keys/verify_pefile.c

··· 376 376 } 377 377 378 378 error: 379 - kzfree(desc); 379 + kfree_sensitive(desc); 380 380 error_no_desc: 381 381 crypto_free_shash(tfm); 382 382 kleave(" = %d", ret); ··· 447 447 ret = pefile_digest_pe(pebuf, pelen, &ctx); 448 448 449 449 error: 450 - kzfree(ctx.digest); 450 + kfree_sensitive(ctx.digest); 451 451 return ret; 452 452 }

+1 -1

crypto/deflate.c

··· 163 163 static void deflate_free_ctx(struct crypto_scomp *tfm, void *ctx) 164 164 { 165 165 __deflate_exit(ctx); 166 - kzfree(ctx); 166 + kfree_sensitive(ctx); 167 167 } 168 168 169 169 static void deflate_exit(struct crypto_tfm *tfm)

+5 -5

crypto/drbg.c

··· 1218 1218 { 1219 1219 if (!drbg) 1220 1220 return; 1221 - kzfree(drbg->Vbuf); 1221 + kfree_sensitive(drbg->Vbuf); 1222 1222 drbg->Vbuf = NULL; 1223 1223 drbg->V = NULL; 1224 - kzfree(drbg->Cbuf); 1224 + kfree_sensitive(drbg->Cbuf); 1225 1225 drbg->Cbuf = NULL; 1226 1226 drbg->C = NULL; 1227 - kzfree(drbg->scratchpadbuf); 1227 + kfree_sensitive(drbg->scratchpadbuf); 1228 1228 drbg->scratchpadbuf = NULL; 1229 1229 drbg->reseed_ctr = 0; 1230 1230 drbg->d_ops = NULL; 1231 1231 drbg->core = NULL; 1232 1232 if (IS_ENABLED(CONFIG_CRYPTO_FIPS)) { 1233 - kzfree(drbg->prev); 1233 + kfree_sensitive(drbg->prev); 1234 1234 drbg->prev = NULL; 1235 1235 drbg->fips_primed = false; 1236 1236 } ··· 1701 1701 struct sdesc *sdesc = (struct sdesc *)drbg->priv_data; 1702 1702 if (sdesc) { 1703 1703 crypto_free_shash(sdesc->shash.tfm); 1704 - kzfree(sdesc); 1704 + kfree_sensitive(sdesc); 1705 1705 } 1706 1706 drbg->priv_data = NULL; 1707 1707 return 0;

+4 -4

crypto/ecc.c

··· 67 67 68 68 static void ecc_free_digits_space(u64 *space) 69 69 { 70 - kzfree(space); 70 + kfree_sensitive(space); 71 71 } 72 72 73 73 static struct ecc_point *ecc_alloc_point(unsigned int ndigits) ··· 101 101 if (!p) 102 102 return; 103 103 104 - kzfree(p->x); 105 - kzfree(p->y); 106 - kzfree(p); 104 + kfree_sensitive(p->x); 105 + kfree_sensitive(p->y); 106 + kfree_sensitive(p); 107 107 } 108 108 109 109 static void vli_clear(u64 *vli, unsigned int ndigits)

+1 -1

crypto/ecdh.c

··· 124 124 125 125 /* fall through */ 126 126 free_all: 127 - kzfree(shared_secret); 127 + kfree_sensitive(shared_secret); 128 128 free_pubkey: 129 129 kfree(public_key); 130 130 return ret;

+1 -1

crypto/gcm.c

··· 139 139 CRYPTO_TFM_REQ_MASK); 140 140 err = crypto_ahash_setkey(ghash, (u8 *)&data->hash, sizeof(be128)); 141 141 out: 142 - kzfree(data); 142 + kfree_sensitive(data); 143 143 return err; 144 144 } 145 145

+2 -2

crypto/gf128mul.c

··· 304 304 int i; 305 305 306 306 for (i = 0; i < 16; i++) 307 - kzfree(t->t[i]); 308 - kzfree(t); 307 + kfree_sensitive(t->t[i]); 308 + kfree_sensitive(t); 309 309 } 310 310 EXPORT_SYMBOL(gf128mul_free_64k); 311 311

+1 -1

crypto/jitterentropy-kcapi.c

··· 57 57 58 58 void jent_zfree(void *ptr) 59 59 { 60 - kzfree(ptr); 60 + kfree_sensitive(ptr); 61 61 } 62 62 63 63 int jent_fips_enabled(void)

+1 -1

crypto/rng.c

··· 53 53 err = crypto_rng_alg(tfm)->seed(tfm, seed, slen); 54 54 crypto_stats_rng_seed(alg, err); 55 55 out: 56 - kzfree(buf); 56 + kfree_sensitive(buf); 57 57 return err; 58 58 } 59 59 EXPORT_SYMBOL_GPL(crypto_rng_reset);

+3 -3

crypto/rsa-pkcs1pad.c

··· 199 199 sg_copy_from_buffer(req->dst, 200 200 sg_nents_for_len(req->dst, ctx->key_size), 201 201 out_buf, ctx->key_size); 202 - kzfree(out_buf); 202 + kfree_sensitive(out_buf); 203 203 204 204 out: 205 205 req->dst_len = ctx->key_size; ··· 322 322 out_buf + pos, req->dst_len); 323 323 324 324 done: 325 - kzfree(req_ctx->out_buf); 325 + kfree_sensitive(req_ctx->out_buf); 326 326 327 327 return err; 328 328 } ··· 500 500 req->dst_len) != 0) 501 501 err = -EKEYREJECTED; 502 502 done: 503 - kzfree(req_ctx->out_buf); 503 + kfree_sensitive(req_ctx->out_buf); 504 504 505 505 return err; 506 506 }

+1 -1

crypto/seqiv.c

··· 33 33 memcpy(req->iv, subreq->iv, crypto_aead_ivsize(geniv)); 34 34 35 35 out: 36 - kzfree(subreq->iv); 36 + kfree_sensitive(subreq->iv); 37 37 } 38 38 39 39 static void seqiv_aead_encrypt_complete(struct crypto_async_request *base,

+1 -1

crypto/shash.c

··· 44 44 alignbuffer = (u8 *)ALIGN((unsigned long)buffer, alignmask + 1); 45 45 memcpy(alignbuffer, key, keylen); 46 46 err = shash->setkey(tfm, alignbuffer, keylen); 47 - kzfree(buffer); 47 + kfree_sensitive(buffer); 48 48 return err; 49 49 } 50 50

+1 -1

crypto/skcipher.c

··· 592 592 alignbuffer = (u8 *)ALIGN((unsigned long)buffer, alignmask + 1); 593 593 memcpy(alignbuffer, key, keylen); 594 594 ret = cipher->setkey(tfm, alignbuffer, keylen); 595 - kzfree(buffer); 595 + kfree_sensitive(buffer); 596 596 return ret; 597 597 } 598 598

+3 -3

crypto/testmgr.c

··· 1744 1744 kfree(vec.plaintext); 1745 1745 kfree(vec.digest); 1746 1746 crypto_free_shash(generic_tfm); 1747 - kzfree(generic_desc); 1747 + kfree_sensitive(generic_desc); 1748 1748 return err; 1749 1749 } 1750 1750 #else /* !CONFIG_CRYPTO_MANAGER_EXTRA_TESTS */ ··· 3665 3665 if (IS_ERR(drng)) { 3666 3666 printk(KERN_ERR "alg: drbg: could not allocate DRNG handle for " 3667 3667 "%s\n", driver); 3668 - kzfree(buf); 3668 + kfree_sensitive(buf); 3669 3669 return -ENOMEM; 3670 3670 } 3671 3671 ··· 3712 3712 3713 3713 outbuf: 3714 3714 crypto_free_rng(drng); 3715 - kzfree(buf); 3715 + kfree_sensitive(buf); 3716 3716 return ret; 3717 3717 } 3718 3718

+1 -1

crypto/zstd.c

··· 137 137 static void zstd_free_ctx(struct crypto_scomp *tfm, void *ctx) 138 138 { 139 139 __zstd_exit(ctx); 140 - kzfree(ctx); 140 + kfree_sensitive(ctx); 141 141 } 142 142 143 143 static void zstd_exit(struct crypto_tfm *tfm)

+5 -5

drivers/base/node.c

··· 368 368 unsigned long sreclaimable, sunreclaimable; 369 369 370 370 si_meminfo_node(&i, nid); 371 - sreclaimable = node_page_state(pgdat, NR_SLAB_RECLAIMABLE); 372 - sunreclaimable = node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE); 371 + sreclaimable = node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B); 372 + sunreclaimable = node_page_state_pages(pgdat, NR_SLAB_UNRECLAIMABLE_B); 373 373 n = sprintf(buf, 374 374 "Node %d MemTotal: %8lu kB\n" 375 375 "Node %d MemFree: %8lu kB\n" ··· 440 440 nid, K(node_page_state(pgdat, NR_FILE_MAPPED)), 441 441 nid, K(node_page_state(pgdat, NR_ANON_MAPPED)), 442 442 nid, K(i.sharedram), 443 - nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK_KB), 443 + nid, node_page_state(pgdat, NR_KERNEL_STACK_KB), 444 444 #ifdef CONFIG_SHADOW_CALL_STACK 445 - nid, sum_zone_node_page_state(nid, NR_KERNEL_SCS_KB), 445 + nid, node_page_state(pgdat, NR_KERNEL_SCS_KB), 446 446 #endif 447 447 nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)), 448 448 nid, 0UL, ··· 513 513 514 514 for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) 515 515 n += sprintf(buf+n, "%s %lu\n", node_stat_name(i), 516 - node_page_state(pgdat, i)); 516 + node_page_state_pages(pgdat, i)); 517 517 518 518 return n; 519 519 }

-1

drivers/block/xen-blkback/common.h

··· 36 36 #include <linux/io.h> 37 37 #include <linux/rbtree.h> 38 38 #include <asm/setup.h> 39 - #include <asm/pgalloc.h> 40 39 #include <asm/hypervisor.h> 41 40 #include <xen/grant_table.h> 42 41 #include <xen/page.h>

+1 -1

drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c

··· 254 254 offset = areq->cryptlen - ivsize; 255 255 if (rctx->op_dir & CE_DECRYPTION) { 256 256 memcpy(areq->iv, backup_iv, ivsize); 257 - kzfree(backup_iv); 257 + kfree_sensitive(backup_iv); 258 258 } else { 259 259 scatterwalk_map_and_copy(areq->iv, areq->dst, offset, 260 260 ivsize, 0);

+1 -1

drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c

··· 249 249 if (rctx->op_dir & SS_DECRYPTION) { 250 250 memcpy(areq->iv, backup_iv, ivsize); 251 251 memzero_explicit(backup_iv, ivsize); 252 - kzfree(backup_iv); 252 + kfree_sensitive(backup_iv); 253 253 } else { 254 254 scatterwalk_map_and_copy(areq->iv, areq->dst, offset, 255 255 ivsize, 0);

+2 -2

drivers/crypto/amlogic/amlogic-gxl-cipher.c

··· 252 252 } 253 253 } 254 254 theend: 255 - kzfree(bkeyiv); 256 - kzfree(backup_iv); 255 + kfree_sensitive(bkeyiv); 256 + kfree_sensitive(backup_iv); 257 257 258 258 return err; 259 259 }

+1 -1

drivers/crypto/atmel-ecc.c

··· 69 69 70 70 /* fall through */ 71 71 free_work_data: 72 - kzfree(work_data); 72 + kfree_sensitive(work_data); 73 73 kpp_request_complete(req, status); 74 74 } 75 75

+14 -14

drivers/crypto/caam/caampkc.c

··· 854 854 855 855 static void caam_rsa_free_key(struct caam_rsa_key *key) 856 856 { 857 - kzfree(key->d); 858 - kzfree(key->p); 859 - kzfree(key->q); 860 - kzfree(key->dp); 861 - kzfree(key->dq); 862 - kzfree(key->qinv); 863 - kzfree(key->tmp1); 864 - kzfree(key->tmp2); 857 + kfree_sensitive(key->d); 858 + kfree_sensitive(key->p); 859 + kfree_sensitive(key->q); 860 + kfree_sensitive(key->dp); 861 + kfree_sensitive(key->dq); 862 + kfree_sensitive(key->qinv); 863 + kfree_sensitive(key->tmp1); 864 + kfree_sensitive(key->tmp2); 865 865 kfree(key->e); 866 866 kfree(key->n); 867 867 memset(key, 0, sizeof(*key)); ··· 1018 1018 return; 1019 1019 1020 1020 free_dq: 1021 - kzfree(rsa_key->dq); 1021 + kfree_sensitive(rsa_key->dq); 1022 1022 free_dp: 1023 - kzfree(rsa_key->dp); 1023 + kfree_sensitive(rsa_key->dp); 1024 1024 free_tmp2: 1025 - kzfree(rsa_key->tmp2); 1025 + kfree_sensitive(rsa_key->tmp2); 1026 1026 free_tmp1: 1027 - kzfree(rsa_key->tmp1); 1027 + kfree_sensitive(rsa_key->tmp1); 1028 1028 free_q: 1029 - kzfree(rsa_key->q); 1029 + kfree_sensitive(rsa_key->q); 1030 1030 free_p: 1031 - kzfree(rsa_key->p); 1031 + kfree_sensitive(rsa_key->p); 1032 1032 } 1033 1033 1034 1034 static int caam_rsa_set_priv_key(struct crypto_akcipher *tfm, const void *key,

+3 -3

drivers/crypto/cavium/cpt/cptvf_main.c

··· 74 74 for (i = 0; i < cptvf->nr_queues; i++) 75 75 tasklet_kill(&cwqe_info->vq_wqe[i].twork); 76 76 77 - kzfree(cwqe_info); 77 + kfree_sensitive(cwqe_info); 78 78 cptvf->wqe_info = NULL; 79 79 } 80 80 ··· 88 88 continue; 89 89 90 90 /* free single queue */ 91 - kzfree((queue->head)); 91 + kfree_sensitive((queue->head)); 92 92 93 93 queue->front = 0; 94 94 queue->rear = 0; ··· 189 189 chunk->head = NULL; 190 190 chunk->dma_addr = 0; 191 191 hlist_del(&chunk->nextchunk); 192 - kzfree(chunk); 192 + kfree_sensitive(chunk); 193 193 } 194 194 195 195 queue->nchunks = 0;

+6 -6

drivers/crypto/cavium/cpt/cptvf_reqmanager.c

··· 305 305 } 306 306 } 307 307 308 - kzfree(info->scatter_components); 309 - kzfree(info->gather_components); 310 - kzfree(info->out_buffer); 311 - kzfree(info->in_buffer); 312 - kzfree((void *)info->completion_addr); 313 - kzfree(info); 308 + kfree_sensitive(info->scatter_components); 309 + kfree_sensitive(info->gather_components); 310 + kfree_sensitive(info->out_buffer); 311 + kfree_sensitive(info->in_buffer); 312 + kfree_sensitive((void *)info->completion_addr); 313 + kfree_sensitive(info); 314 314 } 315 315 316 316 static void do_post_process(struct cpt_vf *cptvf, struct cpt_info_buffer *info)

+2 -2

drivers/crypto/cavium/nitrox/nitrox_lib.c

··· 90 90 91 91 for (i = 0; i < ndev->nr_queues; i++) { 92 92 nitrox_cmdq_cleanup(ndev->aqmq[i]); 93 - kzfree(ndev->aqmq[i]); 93 + kfree_sensitive(ndev->aqmq[i]); 94 94 ndev->aqmq[i] = NULL; 95 95 } 96 96 } ··· 122 122 123 123 err = nitrox_cmdq_init(cmdq, AQM_Q_ALIGN_BYTES); 124 124 if (err) { 125 - kzfree(cmdq); 125 + kfree_sensitive(cmdq); 126 126 goto aqmq_fail; 127 127 } 128 128 ndev->aqmq[i] = cmdq;

+3 -3

drivers/crypto/cavium/zip/zip_crypto.c

··· 260 260 ret = zip_ctx_init(zip_ctx, 0); 261 261 262 262 if (ret) { 263 - kzfree(zip_ctx); 263 + kfree_sensitive(zip_ctx); 264 264 return ERR_PTR(ret); 265 265 } 266 266 ··· 279 279 ret = zip_ctx_init(zip_ctx, 1); 280 280 281 281 if (ret) { 282 - kzfree(zip_ctx); 282 + kfree_sensitive(zip_ctx); 283 283 return ERR_PTR(ret); 284 284 } 285 285 ··· 291 291 struct zip_kernel_ctx *zip_ctx = ctx; 292 292 293 293 zip_ctx_exit(zip_ctx); 294 - kzfree(zip_ctx); 294 + kfree_sensitive(zip_ctx); 295 295 } 296 296 297 297 int zip_scomp_compress(struct crypto_scomp *tfm,

+3 -3

drivers/crypto/ccp/ccp-crypto-rsa.c

··· 112 112 static void ccp_rsa_free_key_bufs(struct ccp_ctx *ctx) 113 113 { 114 114 /* Clean up old key data */ 115 - kzfree(ctx->u.rsa.e_buf); 115 + kfree_sensitive(ctx->u.rsa.e_buf); 116 116 ctx->u.rsa.e_buf = NULL; 117 117 ctx->u.rsa.e_len = 0; 118 - kzfree(ctx->u.rsa.n_buf); 118 + kfree_sensitive(ctx->u.rsa.n_buf); 119 119 ctx->u.rsa.n_buf = NULL; 120 120 ctx->u.rsa.n_len = 0; 121 - kzfree(ctx->u.rsa.d_buf); 121 + kfree_sensitive(ctx->u.rsa.d_buf); 122 122 ctx->u.rsa.d_buf = NULL; 123 123 ctx->u.rsa.d_len = 0; 124 124 }

+2 -2

drivers/crypto/ccree/cc_aead.c

··· 448 448 if (dma_mapping_error(dev, key_dma_addr)) { 449 449 dev_err(dev, "Mapping key va=0x%p len=%u for DMA failed\n", 450 450 key, keylen); 451 - kzfree(key); 451 + kfree_sensitive(key); 452 452 return -ENOMEM; 453 453 } 454 454 if (keylen > blocksize) { ··· 533 533 if (key_dma_addr) 534 534 dma_unmap_single(dev, key_dma_addr, keylen, DMA_TO_DEVICE); 535 535 536 - kzfree(key); 536 + kfree_sensitive(key); 537 537 538 538 return rc; 539 539 }

+2 -2

drivers/crypto/ccree/cc_buffer_mgr.c

··· 488 488 if (areq_ctx->gen_ctx.iv_dma_addr) { 489 489 dma_unmap_single(dev, areq_ctx->gen_ctx.iv_dma_addr, 490 490 hw_iv_size, DMA_BIDIRECTIONAL); 491 - kzfree(areq_ctx->gen_ctx.iv); 491 + kfree_sensitive(areq_ctx->gen_ctx.iv); 492 492 } 493 493 494 494 /* Release pool */ ··· 559 559 if (dma_mapping_error(dev, areq_ctx->gen_ctx.iv_dma_addr)) { 560 560 dev_err(dev, "Mapping iv %u B at va=%pK for DMA failed\n", 561 561 hw_iv_size, req->iv); 562 - kzfree(areq_ctx->gen_ctx.iv); 562 + kfree_sensitive(areq_ctx->gen_ctx.iv); 563 563 areq_ctx->gen_ctx.iv = NULL; 564 564 rc = -ENOMEM; 565 565 goto chain_iv_exit;

+3 -3

drivers/crypto/ccree/cc_cipher.c

··· 257 257 &ctx_p->user.key_dma_addr); 258 258 259 259 /* Free key buffer in context */ 260 - kzfree(ctx_p->user.key); 260 + kfree_sensitive(ctx_p->user.key); 261 261 dev_dbg(dev, "Free key buffer in context. key=@%p\n", ctx_p->user.key); 262 262 } 263 263 ··· 881 881 /* Not a BACKLOG notification */ 882 882 cc_unmap_cipher_request(dev, req_ctx, ivsize, src, dst); 883 883 memcpy(req->iv, req_ctx->iv, ivsize); 884 - kzfree(req_ctx->iv); 884 + kfree_sensitive(req_ctx->iv); 885 885 } 886 886 887 887 skcipher_request_complete(req, err); ··· 994 994 995 995 exit_process: 996 996 if (rc != -EINPROGRESS && rc != -EBUSY) { 997 - kzfree(req_ctx->iv); 997 + kfree_sensitive(req_ctx->iv); 998 998 } 999 999 1000 1000 return rc;

+4 -4

drivers/crypto/ccree/cc_hash.c

··· 764 764 if (dma_mapping_error(dev, ctx->key_params.key_dma_addr)) { 765 765 dev_err(dev, "Mapping key va=0x%p len=%u for DMA failed\n", 766 766 ctx->key_params.key, keylen); 767 - kzfree(ctx->key_params.key); 767 + kfree_sensitive(ctx->key_params.key); 768 768 return -ENOMEM; 769 769 } 770 770 dev_dbg(dev, "mapping key-buffer: key_dma_addr=%pad keylen=%u\n", ··· 913 913 &ctx->key_params.key_dma_addr, ctx->key_params.keylen); 914 914 } 915 915 916 - kzfree(ctx->key_params.key); 916 + kfree_sensitive(ctx->key_params.key); 917 917 918 918 return rc; 919 919 } ··· 950 950 if (dma_mapping_error(dev, ctx->key_params.key_dma_addr)) { 951 951 dev_err(dev, "Mapping key va=0x%p len=%u for DMA failed\n", 952 952 key, keylen); 953 - kzfree(ctx->key_params.key); 953 + kfree_sensitive(ctx->key_params.key); 954 954 return -ENOMEM; 955 955 } 956 956 dev_dbg(dev, "mapping key-buffer: key_dma_addr=%pad keylen=%u\n", ··· 999 999 dev_dbg(dev, "Unmapped key-buffer: key_dma_addr=%pad keylen=%u\n", 1000 1000 &ctx->key_params.key_dma_addr, ctx->key_params.keylen); 1001 1001 1002 - kzfree(ctx->key_params.key); 1002 + kfree_sensitive(ctx->key_params.key); 1003 1003 1004 1004 return rc; 1005 1005 }

+1 -1

drivers/crypto/ccree/cc_request_mgr.c

··· 107 107 /* Kill tasklet */ 108 108 tasklet_kill(&req_mgr_h->comptask); 109 109 #endif 110 - kzfree(req_mgr_h); 110 + kfree_sensitive(req_mgr_h); 111 111 drvdata->request_mgr_handle = NULL; 112 112 } 113 113

+1 -1

drivers/crypto/marvell/cesa/hash.c

··· 1157 1157 } 1158 1158 1159 1159 /* Set the memory region to 0 to avoid any leak. */ 1160 - kzfree(keydup); 1160 + kfree_sensitive(keydup); 1161 1161 1162 1162 if (ret) 1163 1163 return ret;

+3 -3

drivers/crypto/marvell/octeontx/otx_cptvf_main.c

··· 68 68 for (i = 0; i < cptvf->num_queues; i++) 69 69 tasklet_kill(&cwqe_info->vq_wqe[i].twork); 70 70 71 - kzfree(cwqe_info); 71 + kfree_sensitive(cwqe_info); 72 72 cptvf->wqe_info = NULL; 73 73 } 74 74 ··· 82 82 continue; 83 83 84 84 /* free single queue */ 85 - kzfree((queue->head)); 85 + kfree_sensitive((queue->head)); 86 86 queue->front = 0; 87 87 queue->rear = 0; 88 88 queue->qlen = 0; ··· 176 176 chunk->head = NULL; 177 177 chunk->dma_addr = 0; 178 178 list_del(&chunk->nextchunk); 179 - kzfree(chunk); 179 + kfree_sensitive(chunk); 180 180 } 181 181 queue->num_chunks = 0; 182 182 queue->idx = 0;

+1 -1

drivers/crypto/marvell/octeontx/otx_cptvf_reqmgr.h

··· 215 215 DMA_BIDIRECTIONAL); 216 216 } 217 217 } 218 - kzfree(info); 218 + kfree_sensitive(info); 219 219 } 220 220 221 221 struct otx_cptvf_wqe;

+2 -2

drivers/crypto/nx/nx.c

··· 746 746 { 747 747 struct nx_crypto_ctx *nx_ctx = crypto_tfm_ctx(tfm); 748 748 749 - kzfree(nx_ctx->kmem); 749 + kfree_sensitive(nx_ctx->kmem); 750 750 nx_ctx->csbcpb = NULL; 751 751 nx_ctx->csbcpb_aead = NULL; 752 752 nx_ctx->in_sg = NULL; ··· 762 762 { 763 763 struct nx_crypto_ctx *nx_ctx = crypto_aead_ctx(tfm); 764 764 765 - kzfree(nx_ctx->kmem); 765 + kfree_sensitive(nx_ctx->kmem); 766 766 } 767 767 768 768 static int nx_probe(struct vio_dev *viodev, const struct vio_device_id *id)

+6 -6

drivers/crypto/virtio/virtio_crypto_algs.c

··· 167 167 num_in, vcrypto, GFP_ATOMIC); 168 168 if (err < 0) { 169 169 spin_unlock(&vcrypto->ctrl_lock); 170 - kzfree(cipher_key); 170 + kfree_sensitive(cipher_key); 171 171 return err; 172 172 } 173 173 virtqueue_kick(vcrypto->ctrl_vq); ··· 184 184 spin_unlock(&vcrypto->ctrl_lock); 185 185 pr_err("virtio_crypto: Create session failed status: %u\n", 186 186 le32_to_cpu(vcrypto->input.status)); 187 - kzfree(cipher_key); 187 + kfree_sensitive(cipher_key); 188 188 return -EINVAL; 189 189 } 190 190 ··· 197 197 198 198 spin_unlock(&vcrypto->ctrl_lock); 199 199 200 - kzfree(cipher_key); 200 + kfree_sensitive(cipher_key); 201 201 return 0; 202 202 } 203 203 ··· 472 472 return 0; 473 473 474 474 free_iv: 475 - kzfree(iv); 475 + kfree_sensitive(iv); 476 476 free: 477 - kzfree(req_data); 477 + kfree_sensitive(req_data); 478 478 kfree(sgs); 479 479 return err; 480 480 } ··· 583 583 scatterwalk_map_and_copy(req->iv, req->dst, 584 584 req->cryptlen - AES_BLOCK_SIZE, 585 585 AES_BLOCK_SIZE, 0); 586 - kzfree(vc_sym_req->iv); 586 + kfree_sensitive(vc_sym_req->iv); 587 587 virtcrypto_clear_request(&vc_sym_req->base); 588 588 589 589 crypto_finalize_skcipher_request(vc_sym_req->base.dataq->engine,

+1 -1

drivers/crypto/virtio/virtio_crypto_core.c

··· 17 17 virtcrypto_clear_request(struct virtio_crypto_request *vc_req) 18 18 { 19 19 if (vc_req) { 20 - kzfree(vc_req->req_data); 20 + kfree_sensitive(vc_req->req_data); 21 21 kfree(vc_req->sgs); 22 22 } 23 23 }

-1

drivers/iommu/ipmmu-vmsa.c

··· 28 28 29 29 #if defined(CONFIG_ARM) && !defined(CONFIG_IOMMU_DMA) 30 30 #include <asm/dma-iommu.h> 31 - #include <asm/pgalloc.h> 32 31 #else 33 32 #define arm_iommu_create_mapping(...) NULL 34 33 #define arm_iommu_attach_device(...) -ENODEV

+16 -16

drivers/md/dm-crypt.c

··· 407 407 crypto_free_shash(lmk->hash_tfm); 408 408 lmk->hash_tfm = NULL; 409 409 410 - kzfree(lmk->seed); 410 + kfree_sensitive(lmk->seed); 411 411 lmk->seed = NULL; 412 412 } 413 413 ··· 558 558 { 559 559 struct iv_tcw_private *tcw = &cc->iv_gen_private.tcw; 560 560 561 - kzfree(tcw->iv_seed); 561 + kfree_sensitive(tcw->iv_seed); 562 562 tcw->iv_seed = NULL; 563 - kzfree(tcw->whitening); 563 + kfree_sensitive(tcw->whitening); 564 564 tcw->whitening = NULL; 565 565 566 566 if (tcw->crc32_tfm && !IS_ERR(tcw->crc32_tfm)) ··· 994 994 995 995 kunmap_atomic(data); 996 996 out: 997 - kzfree(ks); 998 - kzfree(es); 997 + kfree_sensitive(ks); 998 + kfree_sensitive(es); 999 999 skcipher_request_free(req); 1000 1000 return r; 1001 1001 } ··· 2294 2294 2295 2295 key = request_key(type, key_desc + 1, NULL); 2296 2296 if (IS_ERR(key)) { 2297 - kzfree(new_key_string); 2297 + kfree_sensitive(new_key_string); 2298 2298 return PTR_ERR(key); 2299 2299 } 2300 2300 ··· 2304 2304 if (ret < 0) { 2305 2305 up_read(&key->sem); 2306 2306 key_put(key); 2307 - kzfree(new_key_string); 2307 + kfree_sensitive(new_key_string); 2308 2308 return ret; 2309 2309 } 2310 2310 ··· 2318 2318 2319 2319 if (!ret) { 2320 2320 set_bit(DM_CRYPT_KEY_VALID, &cc->flags); 2321 - kzfree(cc->key_string); 2321 + kfree_sensitive(cc->key_string); 2322 2322 cc->key_string = new_key_string; 2323 2323 } else 2324 - kzfree(new_key_string); 2324 + kfree_sensitive(new_key_string); 2325 2325 2326 2326 return ret; 2327 2327 } ··· 2382 2382 clear_bit(DM_CRYPT_KEY_VALID, &cc->flags); 2383 2383 2384 2384 /* wipe references to any kernel keyring key */ 2385 - kzfree(cc->key_string); 2385 + kfree_sensitive(cc->key_string); 2386 2386 cc->key_string = NULL; 2387 2387 2388 2388 /* Decode key from its hex representation. */ ··· 2414 2414 return r; 2415 2415 } 2416 2416 2417 - kzfree(cc->key_string); 2417 + kfree_sensitive(cc->key_string); 2418 2418 cc->key_string = NULL; 2419 2419 r = crypt_setkey(cc); 2420 2420 memset(&cc->key, 0, cc->key_size * sizeof(u8)); ··· 2493 2493 if (cc->dev) 2494 2494 dm_put_device(ti, cc->dev); 2495 2495 2496 - kzfree(cc->cipher_string); 2497 - kzfree(cc->key_string); 2498 - kzfree(cc->cipher_auth); 2499 - kzfree(cc->authenc_key); 2496 + kfree_sensitive(cc->cipher_string); 2497 + kfree_sensitive(cc->key_string); 2498 + kfree_sensitive(cc->cipher_auth); 2499 + kfree_sensitive(cc->authenc_key); 2500 2500 2501 2501 mutex_destroy(&cc->bio_alloc_lock); 2502 2502 2503 2503 /* Must zero key material before freeing */ 2504 - kzfree(cc); 2504 + kfree_sensitive(cc); 2505 2505 2506 2506 spin_lock(&dm_crypt_clients_lock); 2507 2507 WARN_ON(!dm_crypt_clients_n);

+3 -3

drivers/md/dm-integrity.c

··· 3405 3405 3406 3406 static void free_alg(struct alg_spec *a) 3407 3407 { 3408 - kzfree(a->alg_string); 3409 - kzfree(a->key); 3408 + kfree_sensitive(a->alg_string); 3409 + kfree_sensitive(a->key); 3410 3410 memset(a, 0, sizeof *a); 3411 3411 } 3412 3412 ··· 4337 4337 for (i = 0; i < ic->journal_sections; i++) { 4338 4338 struct skcipher_request *req = ic->sk_requests[i]; 4339 4339 if (req) { 4340 - kzfree(req->iv); 4340 + kfree_sensitive(req->iv); 4341 4341 skcipher_request_free(req); 4342 4342 } 4343 4343 }

+3 -3

drivers/misc/ibmvmc.c

··· 286 286 287 287 if (dma_mapping_error(&vdev->dev, *dma_handle)) { 288 288 *dma_handle = 0; 289 - kzfree(buffer); 289 + kfree_sensitive(buffer); 290 290 return NULL; 291 291 } 292 292 ··· 310 310 dma_unmap_single(&vdev->dev, dma_handle, size, DMA_BIDIRECTIONAL); 311 311 312 312 /* deallocate memory */ 313 - kzfree(vaddr); 313 + kfree_sensitive(vaddr); 314 314 } 315 315 316 316 /** ··· 883 883 spin_unlock_irqrestore(&hmc->lock, flags); 884 884 } 885 885 886 - kzfree(session); 886 + kfree_sensitive(session); 887 887 888 888 return rc; 889 889 }

+1 -1

drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c

··· 137 137 138 138 while (chain) { 139 139 chain_tmp = chain->next; 140 - kzfree(chain); 140 + kfree_sensitive(chain); 141 141 chain = chain_tmp; 142 142 } 143 143 }

+3 -3

drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c

··· 960 960 return 0; 961 961 962 962 err_aead: 963 - kzfree(xs->aead); 963 + kfree_sensitive(xs->aead); 964 964 err_xs: 965 - kzfree(xs); 965 + kfree_sensitive(xs); 966 966 err_out: 967 967 msgbuf[1] = err; 968 968 return err; ··· 1047 1047 ixgbe_ipsec_del_sa(xs); 1048 1048 1049 1049 /* remove the xs that was made-up in the add request */ 1050 - kzfree(xs); 1050 + kfree_sensitive(xs); 1051 1051 1052 1052 return 0; 1053 1053 }

+3 -3

drivers/net/ppp/ppp_mppe.c

··· 222 222 kfree(state->sha1_digest); 223 223 if (state->sha1) { 224 224 crypto_free_shash(state->sha1->tfm); 225 - kzfree(state->sha1); 225 + kfree_sensitive(state->sha1); 226 226 } 227 227 kfree(state); 228 228 out: ··· 238 238 if (state) { 239 239 kfree(state->sha1_digest); 240 240 crypto_free_shash(state->sha1->tfm); 241 - kzfree(state->sha1); 242 - kzfree(state); 241 + kfree_sensitive(state->sha1); 242 + kfree_sensitive(state); 243 243 } 244 244 } 245 245

+2 -2

drivers/net/wireguard/noise.c

··· 114 114 115 115 static void keypair_free_rcu(struct rcu_head *rcu) 116 116 { 117 - kzfree(container_of(rcu, struct noise_keypair, rcu)); 117 + kfree_sensitive(container_of(rcu, struct noise_keypair, rcu)); 118 118 } 119 119 120 120 static void keypair_free_kref(struct kref *kref) ··· 821 821 handshake->entry.peer->device->index_hashtable, 822 822 &handshake->entry, &new_keypair->entry); 823 823 } else { 824 - kzfree(new_keypair); 824 + kfree_sensitive(new_keypair); 825 825 } 826 826 rcu_read_unlock_bh(); 827 827

+1 -1

drivers/net/wireguard/peer.c

··· 203 203 /* The final zeroing takes care of clearing any remaining handshake key 204 204 * material and other potentially sensitive information. 205 205 */ 206 - kzfree(peer); 206 + kfree_sensitive(peer); 207 207 } 208 208 209 209 static void kref_release(struct kref *refcount)

+1 -1

drivers/net/wireless/intel/iwlwifi/pcie/rx.c

··· 1369 1369 &rxcb, rxq->id); 1370 1370 1371 1371 if (reclaim) { 1372 - kzfree(txq->entries[cmd_index].free_buf); 1372 + kfree_sensitive(txq->entries[cmd_index].free_buf); 1373 1373 txq->entries[cmd_index].free_buf = NULL; 1374 1374 } 1375 1375

+3 -3

drivers/net/wireless/intel/iwlwifi/pcie/tx-gen2.c

··· 1026 1026 BUILD_BUG_ON(IWL_TFH_NUM_TBS > sizeof(out_meta->tbs) * BITS_PER_BYTE); 1027 1027 out_meta->flags = cmd->flags; 1028 1028 if (WARN_ON_ONCE(txq->entries[idx].free_buf)) 1029 - kzfree(txq->entries[idx].free_buf); 1029 + kfree_sensitive(txq->entries[idx].free_buf); 1030 1030 txq->entries[idx].free_buf = dup_buf; 1031 1031 1032 1032 trace_iwlwifi_dev_hcmd(trans->dev, cmd, cmd_size, &out_cmd->hdr_wide); ··· 1257 1257 /* De-alloc array of command/tx buffers */ 1258 1258 if (txq_id == trans->txqs.cmd.q_id) 1259 1259 for (i = 0; i < txq->n_window; i++) { 1260 - kzfree(txq->entries[i].cmd); 1261 - kzfree(txq->entries[i].free_buf); 1260 + kfree_sensitive(txq->entries[i].cmd); 1261 + kfree_sensitive(txq->entries[i].free_buf); 1262 1262 } 1263 1263 del_timer_sync(&txq->stuck_timer); 1264 1264

+3 -3

drivers/net/wireless/intel/iwlwifi/pcie/tx.c

··· 721 721 /* De-alloc array of command/tx buffers */ 722 722 if (txq_id == trans->txqs.cmd.q_id) 723 723 for (i = 0; i < txq->n_window; i++) { 724 - kzfree(txq->entries[i].cmd); 725 - kzfree(txq->entries[i].free_buf); 724 + kfree_sensitive(txq->entries[i].cmd); 725 + kfree_sensitive(txq->entries[i].free_buf); 726 726 } 727 727 728 728 /* De-alloc circular buffer of TFDs */ ··· 1765 1765 BUILD_BUG_ON(IWL_TFH_NUM_TBS > sizeof(out_meta->tbs) * BITS_PER_BYTE); 1766 1766 out_meta->flags = cmd->flags; 1767 1767 if (WARN_ON_ONCE(txq->entries[idx].free_buf)) 1768 - kzfree(txq->entries[idx].free_buf); 1768 + kfree_sensitive(txq->entries[idx].free_buf); 1769 1769 txq->entries[idx].free_buf = dup_buf; 1770 1770 1771 1771 trace_iwlwifi_dev_hcmd(trans->dev, cmd, cmd_size, &out_cmd->hdr_wide);

+2 -2

drivers/net/wireless/intersil/orinoco/wext.c

··· 31 31 enum orinoco_alg alg, const u8 *key, int key_len, 32 32 const u8 *seq, int seq_len) 33 33 { 34 - kzfree(priv->keys[index].key); 35 - kzfree(priv->keys[index].seq); 34 + kfree_sensitive(priv->keys[index].key); 35 + kfree_sensitive(priv->keys[index].seq); 36 36 37 37 if (key_len) { 38 38 priv->keys[index].key = kzalloc(key_len, GFP_ATOMIC);

+2 -2

drivers/s390/crypto/ap_bus.h

··· 219 219 */ 220 220 static inline void ap_release_message(struct ap_message *ap_msg) 221 221 { 222 - kzfree(ap_msg->msg); 223 - kzfree(ap_msg->private); 222 + kfree_sensitive(ap_msg->msg); 223 + kfree_sensitive(ap_msg->private); 224 224 } 225 225 226 226 /*

+1 -1

drivers/staging/ks7010/ks_hostif.c

··· 245 245 ret = crypto_shash_finup(desc, data + 12, len - 12, result); 246 246 247 247 err_free_desc: 248 - kzfree(desc); 248 + kfree_sensitive(desc); 249 249 250 250 err_free_tfm: 251 251 crypto_free_shash(tfm);

+1 -1

drivers/staging/rtl8723bs/core/rtw_security.c

··· 2251 2251 2252 2252 static void aes_encrypt_deinit(void *ctx) 2253 2253 { 2254 - kzfree(ctx); 2254 + kfree_sensitive(ctx); 2255 2255 } 2256 2256 2257 2257

+1 -1

drivers/staging/wlan-ng/p80211netdev.c

··· 429 429 failed: 430 430 /* Free up the WEP buffer if it's not the same as the skb */ 431 431 if ((p80211_wep.data) && (p80211_wep.data != skb->data)) 432 - kzfree(p80211_wep.data); 432 + kfree_sensitive(p80211_wep.data); 433 433 434 434 /* we always free the skb here, never in a lower level. */ 435 435 if (!result)

+1 -1

drivers/target/iscsi/iscsi_target_auth.c

··· 484 484 pr_debug("[server] Sending CHAP_R=0x%s\n", response); 485 485 auth_ret = 0; 486 486 out: 487 - kzfree(desc); 487 + kfree_sensitive(desc); 488 488 if (tfm) 489 489 crypto_free_shash(tfm); 490 490 kfree(initiatorchg);

-1

drivers/xen/balloon.c

··· 58 58 #include <linux/sysctl.h> 59 59 60 60 #include <asm/page.h> 61 - #include <asm/pgalloc.h> 62 61 #include <asm/tlb.h> 63 62 64 63 #include <asm/xen/hypervisor.h>

-1

drivers/xen/privcmd.c

··· 25 25 #include <linux/miscdevice.h> 26 26 #include <linux/moduleparam.h> 27 27 28 - #include <asm/pgalloc.h> 29 28 #include <asm/xen/hypervisor.h> 30 29 #include <asm/xen/hypercall.h> 31 30

+21

fs/Kconfig

··· 201 201 202 202 If unsure, say N. 203 203 204 + config TMPFS_INODE64 205 + bool "Use 64-bit ino_t by default in tmpfs" 206 + depends on TMPFS && 64BIT 207 + default n 208 + help 209 + tmpfs has historically used only inode numbers as wide as an unsigned 210 + int. In some cases this can cause wraparound, potentially resulting 211 + in multiple files with the same inode number on a single device. This 212 + option makes tmpfs use the full width of ino_t by default, without 213 + needing to specify the inode64 option when mounting. 214 + 215 + But if a long-lived tmpfs is to be accessed by 32-bit applications so 216 + ancient that opening a file larger than 2GiB fails with EINVAL, then 217 + the INODE64 config option and inode64 mount option risk operations 218 + failing with EOVERFLOW once 33-bit inode numbers are reached. 219 + 220 + To override this configured default, use the inode32 or inode64 221 + option when mounting. 222 + 223 + If unsure, say N. 224 + 204 225 config HUGETLBFS 205 226 bool "HugeTLB file system support" 206 227 depends on X86 || IA64 || SPARC64 || (S390 && 64BIT) || \

+3 -3

fs/aio.c

··· 525 525 return -EINTR; 526 526 } 527 527 528 - ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size, 529 - PROT_READ | PROT_WRITE, 530 - MAP_SHARED, 0, &unused, NULL); 528 + ctx->mmap_base = do_mmap(ctx->aio_ring_file, 0, ctx->mmap_size, 529 + PROT_READ | PROT_WRITE, 530 + MAP_SHARED, 0, &unused, NULL); 531 531 mmap_write_unlock(mm); 532 532 if (IS_ERR((void *)ctx->mmap_base)) { 533 533 ctx->mmap_size = 0;

-1

fs/binfmt_elf_fdpic.c

··· 38 38 39 39 #include <linux/uaccess.h> 40 40 #include <asm/param.h> 41 - #include <asm/pgalloc.h> 42 41 43 42 typedef char *elf_caddr_t; 44 43

+1 -1

fs/cifs/cifsencrypt.c

··· 797 797 ses->auth_key.len = CIFS_SESS_KEY_SIZE; 798 798 799 799 memzero_explicit(sec_key, CIFS_SESS_KEY_SIZE); 800 - kzfree(ctx_arc4); 800 + kfree_sensitive(ctx_arc4); 801 801 return 0; 802 802 } 803 803

+5 -5

fs/cifs/connect.c

··· 2183 2183 tmp_end++; 2184 2184 if (!(tmp_end < end && tmp_end[1] == delim)) { 2185 2185 /* No it is not. Set the password to NULL */ 2186 - kzfree(vol->password); 2186 + kfree_sensitive(vol->password); 2187 2187 vol->password = NULL; 2188 2188 break; 2189 2189 } ··· 2221 2221 options = end; 2222 2222 } 2223 2223 2224 - kzfree(vol->password); 2224 + kfree_sensitive(vol->password); 2225 2225 /* Now build new password string */ 2226 2226 temp_len = strlen(value); 2227 2227 vol->password = kzalloc(temp_len+1, GFP_KERNEL); ··· 3199 3199 rc = -ENOMEM; 3200 3200 kfree(vol->username); 3201 3201 vol->username = NULL; 3202 - kzfree(vol->password); 3202 + kfree_sensitive(vol->password); 3203 3203 vol->password = NULL; 3204 3204 goto out_key_put; 3205 3205 } ··· 4220 4220 cifs_cleanup_volume_info_contents(struct smb_vol *volume_info) 4221 4221 { 4222 4222 kfree(volume_info->username); 4223 - kzfree(volume_info->password); 4223 + kfree_sensitive(volume_info->password); 4224 4224 kfree(volume_info->UNC); 4225 4225 kfree(volume_info->domainname); 4226 4226 kfree(volume_info->iocharset); ··· 5339 5339 5340 5340 out: 5341 5341 kfree(vol_info->username); 5342 - kzfree(vol_info->password); 5342 + kfree_sensitive(vol_info->password); 5343 5343 kfree(vol_info); 5344 5344 5345 5345 return tcon;

+1 -1

fs/cifs/dfs_cache.c

··· 1191 1191 err_free_unc: 1192 1192 kfree(new->UNC); 1193 1193 err_free_password: 1194 - kzfree(new->password); 1194 + kfree_sensitive(new->password); 1195 1195 err_free_username: 1196 1196 kfree(new->username); 1197 1197 kfree(new);

+4 -4

fs/cifs/misc.c

··· 103 103 kfree(buf_to_free->serverOS); 104 104 kfree(buf_to_free->serverDomain); 105 105 kfree(buf_to_free->serverNOS); 106 - kzfree(buf_to_free->password); 106 + kfree_sensitive(buf_to_free->password); 107 107 kfree(buf_to_free->user_name); 108 108 kfree(buf_to_free->domainName); 109 - kzfree(buf_to_free->auth_key.response); 109 + kfree_sensitive(buf_to_free->auth_key.response); 110 110 kfree(buf_to_free->iface_list); 111 - kzfree(buf_to_free); 111 + kfree_sensitive(buf_to_free); 112 112 } 113 113 114 114 struct cifs_tcon * ··· 148 148 } 149 149 atomic_dec(&tconInfoAllocCount); 150 150 kfree(buf_to_free->nativeFileSystem); 151 - kzfree(buf_to_free->password); 151 + kfree_sensitive(buf_to_free->password); 152 152 kfree(buf_to_free->crfid.fid); 153 153 #ifdef CONFIG_CIFS_DFS_UPCALL 154 154 kfree(buf_to_free->dfs_path);

+3 -2

fs/crypto/inline_crypt.c

··· 16 16 #include <linux/blkdev.h> 17 17 #include <linux/buffer_head.h> 18 18 #include <linux/sched/mm.h> 19 + #include <linux/slab.h> 19 20 20 21 #include "fscrypt_private.h" 21 22 ··· 188 187 fail: 189 188 for (i = 0; i < queue_refs; i++) 190 189 blk_put_queue(blk_key->devs[i]); 191 - kzfree(blk_key); 190 + kfree_sensitive(blk_key); 192 191 return err; 193 192 } 194 193 ··· 202 201 blk_crypto_evict_key(blk_key->devs[i], &blk_key->base); 203 202 blk_put_queue(blk_key->devs[i]); 204 203 } 205 - kzfree(blk_key); 204 + kfree_sensitive(blk_key); 206 205 } 207 206 } 208 207

+3 -3

fs/crypto/keyring.c

··· 51 51 } 52 52 53 53 key_put(mk->mk_users); 54 - kzfree(mk); 54 + kfree_sensitive(mk); 55 55 } 56 56 57 57 static inline bool valid_key_spec(const struct fscrypt_key_specifier *spec) ··· 531 531 static void fscrypt_provisioning_key_free_preparse( 532 532 struct key_preparsed_payload *prep) 533 533 { 534 - kzfree(prep->payload.data[0]); 534 + kfree_sensitive(prep->payload.data[0]); 535 535 } 536 536 537 537 static void fscrypt_provisioning_key_describe(const struct key *key, ··· 548 548 549 549 static void fscrypt_provisioning_key_destroy(struct key *key) 550 550 { 551 - kzfree(key->payload.data[0]); 551 + kfree_sensitive(key->payload.data[0]); 552 552 } 553 553 554 554 static struct key_type key_type_fscrypt_provisioning = {

+2 -2

fs/crypto/keysetup_v1.c

··· 155 155 { 156 156 if (dk) { 157 157 fscrypt_destroy_prepared_key(&dk->dk_key); 158 - kzfree(dk); 158 + kfree_sensitive(dk); 159 159 } 160 160 } 161 161 ··· 283 283 284 284 err = fscrypt_set_per_file_enc_key(ci, derived_key); 285 285 out: 286 - kzfree(derived_key); 286 + kfree_sensitive(derived_key); 287 287 return err; 288 288 } 289 289

+2 -2

fs/ecryptfs/keystore.c

··· 838 838 out_release_free_unlock: 839 839 crypto_free_shash(s->hash_tfm); 840 840 out_free_unlock: 841 - kzfree(s->block_aligned_filename); 841 + kfree_sensitive(s->block_aligned_filename); 842 842 out_unlock: 843 843 mutex_unlock(s->tfm_mutex); 844 844 out: ··· 847 847 key_put(auth_tok_key); 848 848 } 849 849 skcipher_request_free(s->skcipher_req); 850 - kzfree(s->hash_desc); 850 + kfree_sensitive(s->hash_desc); 851 851 kfree(s); 852 852 return rc; 853 853 }

+1 -1

fs/ecryptfs/messaging.c

··· 175 175 } 176 176 hlist_del(&daemon->euid_chain); 177 177 mutex_unlock(&daemon->mux); 178 - kzfree(daemon); 178 + kfree_sensitive(daemon); 179 179 out: 180 180 return rc; 181 181 }

+1 -1

fs/hugetlbfs/inode.c

··· 140 140 * already been checked by prepare_hugepage_range. If you add 141 141 * any error returns here, do so after setting VM_HUGETLB, so 142 142 * is_vm_hugetlb_page tests below unmap_region go the right 143 - * way when do_mmap_pgoff unwinds (may be important on powerpc 143 + * way when do_mmap unwinds (may be important on powerpc 144 144 * and ia64). 145 145 */ 146 146 vma->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;

+1 -1

fs/ntfs/dir.c

··· 1504 1504 na.type = AT_BITMAP; 1505 1505 na.name = I30; 1506 1506 na.name_len = 4; 1507 - bmp_vi = ilookup5(vi->i_sb, vi->i_ino, (test_t)ntfs_test_inode, &na); 1507 + bmp_vi = ilookup5(vi->i_sb, vi->i_ino, ntfs_test_inode, &na); 1508 1508 if (bmp_vi) { 1509 1509 write_inode_now(bmp_vi, !datasync); 1510 1510 iput(bmp_vi);

+14 -13

fs/ntfs/inode.c

··· 30 30 /** 31 31 * ntfs_test_inode - compare two (possibly fake) inodes for equality 32 32 * @vi: vfs inode which to test 33 - * @na: ntfs attribute which is being tested with 33 + * @data: data which is being tested with 34 34 * 35 35 * Compare the ntfs attribute embedded in the ntfs specific part of the vfs 36 - * inode @vi for equality with the ntfs attribute @na. 36 + * inode @vi for equality with the ntfs attribute @data. 37 37 * 38 38 * If searching for the normal file/directory inode, set @na->type to AT_UNUSED. 39 39 * @na->name and @na->name_len are then ignored. ··· 43 43 * NOTE: This function runs with the inode_hash_lock spin lock held so it is not 44 44 * allowed to sleep. 45 45 */ 46 - int ntfs_test_inode(struct inode *vi, ntfs_attr *na) 46 + int ntfs_test_inode(struct inode *vi, void *data) 47 47 { 48 + ntfs_attr *na = (ntfs_attr *)data; 48 49 ntfs_inode *ni; 49 50 50 51 if (vi->i_ino != na->mft_no) ··· 73 72 /** 74 73 * ntfs_init_locked_inode - initialize an inode 75 74 * @vi: vfs inode to initialize 76 - * @na: ntfs attribute which to initialize @vi to 75 + * @data: data which to initialize @vi to 77 76 * 78 - * Initialize the vfs inode @vi with the values from the ntfs attribute @na in 77 + * Initialize the vfs inode @vi with the values from the ntfs attribute @data in 79 78 * order to enable ntfs_test_inode() to do its work. 80 79 * 81 80 * If initializing the normal file/directory inode, set @na->type to AT_UNUSED. ··· 88 87 * NOTE: This function runs with the inode->i_lock spin lock held so it is not 89 88 * allowed to sleep. (Hence the GFP_ATOMIC allocation.) 90 89 */ 91 - static int ntfs_init_locked_inode(struct inode *vi, ntfs_attr *na) 90 + static int ntfs_init_locked_inode(struct inode *vi, void *data) 92 91 { 92 + ntfs_attr *na = (ntfs_attr *)data; 93 93 ntfs_inode *ni = NTFS_I(vi); 94 94 95 95 vi->i_ino = na->mft_no; ··· 133 131 return 0; 134 132 } 135 133 136 - typedef int (*set_t)(struct inode *, void *); 137 134 static int ntfs_read_locked_inode(struct inode *vi); 138 135 static int ntfs_read_locked_attr_inode(struct inode *base_vi, struct inode *vi); 139 136 static int ntfs_read_locked_index_inode(struct inode *base_vi, ··· 165 164 na.name = NULL; 166 165 na.name_len = 0; 167 166 168 - vi = iget5_locked(sb, mft_no, (test_t)ntfs_test_inode, 169 - (set_t)ntfs_init_locked_inode, &na); 167 + vi = iget5_locked(sb, mft_no, ntfs_test_inode, 168 + ntfs_init_locked_inode, &na); 170 169 if (unlikely(!vi)) 171 170 return ERR_PTR(-ENOMEM); 172 171 ··· 226 225 na.name = name; 227 226 na.name_len = name_len; 228 227 229 - vi = iget5_locked(base_vi->i_sb, na.mft_no, (test_t)ntfs_test_inode, 230 - (set_t)ntfs_init_locked_inode, &na); 228 + vi = iget5_locked(base_vi->i_sb, na.mft_no, ntfs_test_inode, 229 + ntfs_init_locked_inode, &na); 231 230 if (unlikely(!vi)) 232 231 return ERR_PTR(-ENOMEM); 233 232 ··· 281 280 na.name = name; 282 281 na.name_len = name_len; 283 282 284 - vi = iget5_locked(base_vi->i_sb, na.mft_no, (test_t)ntfs_test_inode, 285 - (set_t)ntfs_init_locked_inode, &na); 283 + vi = iget5_locked(base_vi->i_sb, na.mft_no, ntfs_test_inode, 284 + ntfs_init_locked_inode, &na); 286 285 if (unlikely(!vi)) 287 286 return ERR_PTR(-ENOMEM); 288 287

+1 -3

fs/ntfs/inode.h

··· 253 253 ATTR_TYPE type; 254 254 } ntfs_attr; 255 255 256 - typedef int (*test_t)(struct inode *, void *); 257 - 258 - extern int ntfs_test_inode(struct inode *vi, ntfs_attr *na); 256 + extern int ntfs_test_inode(struct inode *vi, void *data); 259 257 260 258 extern struct inode *ntfs_iget(struct super_block *sb, unsigned long mft_no); 261 259 extern struct inode *ntfs_attr_iget(struct inode *base_vi, ATTR_TYPE type,

+2 -2

fs/ntfs/mft.c

··· 958 958 * dirty code path of the inode dirty code path when writing 959 959 * $MFT occurs. 960 960 */ 961 - vi = ilookup5_nowait(sb, mft_no, (test_t)ntfs_test_inode, &na); 961 + vi = ilookup5_nowait(sb, mft_no, ntfs_test_inode, &na); 962 962 } 963 963 if (vi) { 964 964 ntfs_debug("Base inode 0x%lx is in icache.", mft_no); ··· 1019 1019 vi = igrab(mft_vi); 1020 1020 BUG_ON(vi != mft_vi); 1021 1021 } else 1022 - vi = ilookup5_nowait(sb, na.mft_no, (test_t)ntfs_test_inode, 1022 + vi = ilookup5_nowait(sb, na.mft_no, ntfs_test_inode, 1023 1023 &na); 1024 1024 if (!vi) { 1025 1025 /*

+3 -3

fs/ocfs2/Kconfig

··· 16 16 You'll want to install the ocfs2-tools package in order to at least 17 17 get "mount.ocfs2". 18 18 19 - Project web page: http://oss.oracle.com/projects/ocfs2 20 - Tools web page: http://oss.oracle.com/projects/ocfs2-tools 21 - OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ 19 + Project web page: https://oss.oracle.com/projects/ocfs2 20 + Tools web page: https://oss.oracle.com/projects/ocfs2-tools 21 + OCFS2 mailing lists: https://oss.oracle.com/projects/ocfs2/mailman/ 22 22 23 23 For more information on OCFS2, see the file 24 24 <file:Documentation/filesystems/ocfs2.rst>.

+2

fs/ocfs2/acl.c

··· 256 256 ret = ocfs2_xattr_set(inode, name_index, "", value, size, 0); 257 257 258 258 kfree(value); 259 + if (!ret) 260 + set_cached_acl(inode, type, acl); 259 261 260 262 return ret; 261 263 }

+1 -1

fs/ocfs2/blockcheck.c

··· 124 124 * parity bits that are part of the bit number 125 125 * representation. Huh? 126 126 * 127 - * <wikipedia href="http://en.wikipedia.org/wiki/Hamming_code"> 127 + * <wikipedia href="https://en.wikipedia.org/wiki/Hamming_code"> 128 128 * In other words, the parity bit at position 2^k 129 129 * checks bits in positions having bit k set in 130 130 * their binary representation. Conversely, for

+7 -1

fs/ocfs2/dlmglue.c

··· 2871 2871 2872 2872 status = ocfs2_cluster_lock(osb, lockres, ex ? LKM_EXMODE : LKM_PRMODE, 2873 2873 0, 0); 2874 - if (status < 0) 2874 + if (status < 0) { 2875 2875 mlog(ML_ERROR, "lock on nfs sync lock failed %d\n", status); 2876 + 2877 + if (ex) 2878 + up_write(&osb->nfs_sync_rwlock); 2879 + else 2880 + up_read(&osb->nfs_sync_rwlock); 2881 + } 2876 2882 2877 2883 return status; 2878 2884 }

+2 -2

fs/ocfs2/ocfs2.h

··· 327 327 spinlock_t osb_lock; 328 328 u32 s_next_generation; 329 329 unsigned long osb_flags; 330 - s16 s_inode_steal_slot; 331 - s16 s_meta_steal_slot; 330 + u16 s_inode_steal_slot; 331 + u16 s_meta_steal_slot; 332 332 atomic_t s_num_inodes_stolen; 333 333 atomic_t s_num_meta_stolen; 334 334

+2 -2

fs/ocfs2/suballoc.c

··· 879 879 { 880 880 spin_lock(&osb->osb_lock); 881 881 if (type == INODE_ALLOC_SYSTEM_INODE) 882 - osb->s_inode_steal_slot = slot; 882 + osb->s_inode_steal_slot = (u16)slot; 883 883 else if (type == EXTENT_ALLOC_SYSTEM_INODE) 884 - osb->s_meta_steal_slot = slot; 884 + osb->s_meta_steal_slot = (u16)slot; 885 885 spin_unlock(&osb->osb_lock); 886 886 } 887 887

+1 -1

fs/ocfs2/suballoc.h

··· 40 40 41 41 u64 ac_last_group; 42 42 u64 ac_max_block; /* Highest block number to allocate. 0 is 43 - is the same as ~0 - unlimited */ 43 + the same as ~0 - unlimited */ 44 44 45 45 int ac_find_loc_only; /* hack for reflink operation ordering */ 46 46 struct ocfs2_suballoc_result *ac_find_loc_priv; /* */

+2 -2

fs/ocfs2/super.c

··· 78 78 unsigned long commit_interval; 79 79 unsigned long mount_opt; 80 80 unsigned int atime_quantum; 81 - signed short slot; 81 + unsigned short slot; 82 82 int localalloc_opt; 83 83 unsigned int resv_level; 84 84 int dir_resv_level; ··· 1349 1349 goto bail; 1350 1350 } 1351 1351 if (option) 1352 - mopt->slot = (s16)option; 1352 + mopt->slot = (u16)option; 1353 1353 break; 1354 1354 case Opt_commit: 1355 1355 if (match_int(&args[0], &option)) {

+5 -5

fs/proc/meminfo.c

··· 41 41 42 42 si_meminfo(&i); 43 43 si_swapinfo(&i); 44 - committed = percpu_counter_read_positive(&vm_committed_as); 44 + committed = vm_memory_committed(); 45 45 46 46 cached = global_node_page_state(NR_FILE_PAGES) - 47 47 total_swapcache_pages() - i.bufferram; ··· 52 52 pages[lru] = global_node_page_state(NR_LRU_BASE + lru); 53 53 54 54 available = si_mem_available(); 55 - sreclaimable = global_node_page_state(NR_SLAB_RECLAIMABLE); 56 - sunreclaim = global_node_page_state(NR_SLAB_UNRECLAIMABLE); 55 + sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B); 56 + sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B); 57 57 58 58 show_val_kb(m, "MemTotal: ", i.totalram); 59 59 show_val_kb(m, "MemFree: ", i.freeram); ··· 101 101 show_val_kb(m, "SReclaimable: ", sreclaimable); 102 102 show_val_kb(m, "SUnreclaim: ", sunreclaim); 103 103 seq_printf(m, "KernelStack: %8lu kB\n", 104 - global_zone_page_state(NR_KERNEL_STACK_KB)); 104 + global_node_page_state(NR_KERNEL_STACK_KB)); 105 105 #ifdef CONFIG_SHADOW_CALL_STACK 106 106 seq_printf(m, "ShadowCallStack:%8lu kB\n", 107 - global_zone_page_state(NR_KERNEL_SCS_KB)); 107 + global_node_page_state(NR_KERNEL_SCS_KB)); 108 108 #endif 109 109 show_val_kb(m, "PageTables: ", 110 110 global_zone_page_state(NR_PAGETABLE));

+80

include/asm-generic/pgalloc.h

··· 102 102 __free_page(pte_page); 103 103 } 104 104 105 + 106 + #if CONFIG_PGTABLE_LEVELS > 2 107 + 108 + #ifndef __HAVE_ARCH_PMD_ALLOC_ONE 109 + /** 110 + * pmd_alloc_one - allocate a page for PMD-level page table 111 + * @mm: the mm_struct of the current context 112 + * 113 + * Allocates a page and runs the pgtable_pmd_page_ctor(). 114 + * Allocations use %GFP_PGTABLE_USER in user context and 115 + * %GFP_PGTABLE_KERNEL in kernel context. 116 + * 117 + * Return: pointer to the allocated memory or %NULL on error 118 + */ 119 + static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) 120 + { 121 + struct page *page; 122 + gfp_t gfp = GFP_PGTABLE_USER; 123 + 124 + if (mm == &init_mm) 125 + gfp = GFP_PGTABLE_KERNEL; 126 + page = alloc_pages(gfp, 0); 127 + if (!page) 128 + return NULL; 129 + if (!pgtable_pmd_page_ctor(page)) { 130 + __free_pages(page, 0); 131 + return NULL; 132 + } 133 + return (pmd_t *)page_address(page); 134 + } 135 + #endif 136 + 137 + #ifndef __HAVE_ARCH_PMD_FREE 138 + static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd) 139 + { 140 + BUG_ON((unsigned long)pmd & (PAGE_SIZE-1)); 141 + pgtable_pmd_page_dtor(virt_to_page(pmd)); 142 + free_page((unsigned long)pmd); 143 + } 144 + #endif 145 + 146 + #endif /* CONFIG_PGTABLE_LEVELS > 2 */ 147 + 148 + #if CONFIG_PGTABLE_LEVELS > 3 149 + 150 + #ifndef __HAVE_ARCH_PUD_FREE 151 + /** 152 + * pud_alloc_one - allocate a page for PUD-level page table 153 + * @mm: the mm_struct of the current context 154 + * 155 + * Allocates a page using %GFP_PGTABLE_USER for user context and 156 + * %GFP_PGTABLE_KERNEL for kernel context. 157 + * 158 + * Return: pointer to the allocated memory or %NULL on error 159 + */ 160 + static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr) 161 + { 162 + gfp_t gfp = GFP_PGTABLE_USER; 163 + 164 + if (mm == &init_mm) 165 + gfp = GFP_PGTABLE_KERNEL; 166 + return (pud_t *)get_zeroed_page(gfp); 167 + } 168 + #endif 169 + 170 + static inline void pud_free(struct mm_struct *mm, pud_t *pud) 171 + { 172 + BUG_ON((unsigned long)pud & (PAGE_SIZE-1)); 173 + free_page((unsigned long)pud); 174 + } 175 + 176 + #endif /* CONFIG_PGTABLE_LEVELS > 3 */ 177 + 178 + #ifndef __HAVE_ARCH_PGD_FREE 179 + static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd) 180 + { 181 + free_page((unsigned long)pgd); 182 + } 183 + #endif 184 + 105 185 #endif /* CONFIG_MMU */ 106 186 107 187 #endif /* __ASM_GENERIC_PGALLOC_H */

-1

include/asm-generic/tlb.h

··· 14 14 #include <linux/mmu_notifier.h> 15 15 #include <linux/swap.h> 16 16 #include <linux/hugetlb_inline.h> 17 - #include <asm/pgalloc.h> 18 17 #include <asm/tlbflush.h> 19 18 #include <asm/cacheflush.h> 20 19

+1 -1

include/crypto/aead.h

··· 425 425 */ 426 426 static inline void aead_request_free(struct aead_request *req) 427 427 { 428 - kzfree(req); 428 + kfree_sensitive(req); 429 429 } 430 430 431 431 /**

+1 -1

include/crypto/akcipher.h

··· 207 207 */ 208 208 static inline void akcipher_request_free(struct akcipher_request *req) 209 209 { 210 - kzfree(req); 210 + kfree_sensitive(req); 211 211 } 212 212 213 213 /**

+1 -1

include/crypto/gf128mul.h

··· 230 230 void gf128mul_x8_ble(le128 *r, const le128 *x); 231 231 static inline void gf128mul_free_4k(struct gf128mul_4k *t) 232 232 { 233 - kzfree(t); 233 + kfree_sensitive(t); 234 234 } 235 235 236 236

+1 -1

include/crypto/hash.h

··· 606 606 */ 607 607 static inline void ahash_request_free(struct ahash_request *req) 608 608 { 609 - kzfree(req); 609 + kfree_sensitive(req); 610 610 } 611 611 612 612 static inline void ahash_request_zero(struct ahash_request *req)

+1 -1

include/crypto/internal/acompress.h

··· 46 46 47 47 static inline void __acomp_request_free(struct acomp_req *req) 48 48 { 49 - kzfree(req); 49 + kfree_sensitive(req); 50 50 } 51 51 52 52 /**

+1 -1

include/crypto/kpp.h

··· 187 187 */ 188 188 static inline void kpp_request_free(struct kpp_request *req) 189 189 { 190 - kzfree(req); 190 + kfree_sensitive(req); 191 191 } 192 192 193 193 /**

+1 -1

include/crypto/skcipher.h

··· 508 508 */ 509 509 static inline void skcipher_request_free(struct skcipher_request *req) 510 510 { 511 - kzfree(req); 511 + kfree_sensitive(req); 512 512 } 513 513 514 514 static inline void skcipher_request_zero(struct skcipher_request *req)

+4

include/linux/efi.h

··· 606 606 extern void efi_map_pal_code (void); 607 607 extern void efi_memmap_walk (efi_freemem_callback_t callback, void *arg); 608 608 extern void efi_gettimeofday (struct timespec64 *ts); 609 + #ifdef CONFIG_EFI 609 610 extern void efi_enter_virtual_mode (void); /* switch EFI to virtual mode, if possible */ 611 + #else 612 + static inline void efi_enter_virtual_mode (void) {} 613 + #endif 610 614 #ifdef CONFIG_X86 611 615 extern efi_status_t efi_query_variable_store(u32 attributes, 612 616 unsigned long size,

+16 -1

include/linux/fs.h

··· 528 528 529 529 /* 530 530 * Might pages of this file have been modified in userspace? 531 - * Note that i_mmap_writable counts all VM_SHARED vmas: do_mmap_pgoff 531 + * Note that i_mmap_writable counts all VM_SHARED vmas: do_mmap 532 532 * marks vma as VM_SHARED if it is shared, and the file was opened for 533 533 * writing i.e. vma may be mprotected writable even if now readonly. 534 534 * ··· 2949 2949 extern void discard_new_inode(struct inode *); 2950 2950 extern unsigned int get_next_ino(void); 2951 2951 extern void evict_inodes(struct super_block *sb); 2952 + 2953 + /* 2954 + * Userspace may rely on the the inode number being non-zero. For example, glibc 2955 + * simply ignores files with zero i_ino in unlink() and other places. 2956 + * 2957 + * As an additional complication, if userspace was compiled with 2958 + * _FILE_OFFSET_BITS=32 on a 64-bit kernel we'll only end up reading out the 2959 + * lower 32 bits, so we need to check that those aren't zero explicitly. With 2960 + * _FILE_OFFSET_BITS=64, this may cause some harmless false-negatives, but 2961 + * better safe than sorry. 2962 + */ 2963 + static inline bool is_zero_ino(ino_t ino) 2964 + { 2965 + return (u32)ino == 0; 2966 + } 2952 2967 2953 2968 extern void __iget(struct inode * inode); 2954 2969 extern void iget_failed(struct inode *);

+1 -1

include/linux/huge_mm.h

··· 42 42 unsigned long addr, unsigned long end, 43 43 unsigned char *vec); 44 44 extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, 45 - unsigned long new_addr, unsigned long old_end, 45 + unsigned long new_addr, 46 46 pmd_t *old_pmd, pmd_t *new_pmd); 47 47 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, 48 48 unsigned long addr, pgprot_t newprot,

+2 -2

include/linux/kasan.h

··· 38 38 void kasan_unpoison_shadow(const void *address, size_t size); 39 39 40 40 void kasan_unpoison_task_stack(struct task_struct *task); 41 - void kasan_unpoison_stack_above_sp_to(const void *watermark); 42 41 43 42 void kasan_alloc_pages(struct page *page, unsigned int order); 44 43 void kasan_free_pages(struct page *page, unsigned int order); ··· 100 101 static inline void kasan_unpoison_shadow(const void *address, size_t size) {} 101 102 102 103 static inline void kasan_unpoison_task_stack(struct task_struct *task) {} 103 - static inline void kasan_unpoison_stack_above_sp_to(const void *watermark) {} 104 104 105 105 static inline void kasan_enable_current(void) {} 106 106 static inline void kasan_disable_current(void) {} ··· 172 174 173 175 void kasan_cache_shrink(struct kmem_cache *cache); 174 176 void kasan_cache_shutdown(struct kmem_cache *cache); 177 + void kasan_record_aux_stack(void *ptr); 175 178 176 179 #else /* CONFIG_KASAN_GENERIC */ 177 180 178 181 static inline void kasan_cache_shrink(struct kmem_cache *cache) {} 179 182 static inline void kasan_cache_shutdown(struct kmem_cache *cache) {} 183 + static inline void kasan_record_aux_stack(void *ptr) {} 180 184 181 185 #endif /* CONFIG_KASAN_GENERIC */ 182 186

+182 -21

include/linux/memcontrol.h

··· 23 23 #include <linux/page-flags.h> 24 24 25 25 struct mem_cgroup; 26 + struct obj_cgroup; 26 27 struct page; 27 28 struct mm_struct; 28 29 struct kmem_cache; ··· 32 31 enum memcg_stat_item { 33 32 MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS, 34 33 MEMCG_SOCK, 35 - /* XXX: why are these zone and not node counters? */ 36 - MEMCG_KERNEL_STACK_KB, 37 34 MEMCG_NR_STAT, 38 35 }; 39 36 ··· 45 46 MEMCG_SWAP_MAX, 46 47 MEMCG_SWAP_FAIL, 47 48 MEMCG_NR_MEMORY_EVENTS, 48 - }; 49 - 50 - enum mem_cgroup_protection { 51 - MEMCG_PROT_NONE, 52 - MEMCG_PROT_LOW, 53 - MEMCG_PROT_MIN, 54 49 }; 55 50 56 51 struct mem_cgroup_reclaim_cookie { ··· 186 193 }; 187 194 188 195 /* 196 + * Bucket for arbitrarily byte-sized objects charged to a memory 197 + * cgroup. The bucket can be reparented in one piece when the cgroup 198 + * is destroyed, without having to round up the individual references 199 + * of all live memory objects in the wild. 200 + */ 201 + struct obj_cgroup { 202 + struct percpu_ref refcnt; 203 + struct mem_cgroup *memcg; 204 + atomic_t nr_charged_bytes; 205 + union { 206 + struct list_head list; 207 + struct rcu_head rcu; 208 + }; 209 + }; 210 + 211 + /* 189 212 * The memory controller data structure. The memory controller controls both 190 213 * page cache and RSS per cgroup. We would eventually like to provide 191 214 * statistics based on the statistics developed by Rik Van Riel for clock-pro, ··· 309 300 /* Index in the kmem_cache->memcg_params.memcg_caches array */ 310 301 int kmemcg_id; 311 302 enum memcg_kmem_state kmem_state; 312 - struct list_head kmem_caches; 303 + struct obj_cgroup __rcu *objcg; 304 + struct list_head objcg_list; /* list of inherited objcgs */ 313 305 #endif 314 306 315 307 #ifdef CONFIG_CGROUP_WRITEBACK ··· 349 339 return !cgroup_subsys_enabled(memory_cgrp_subsys); 350 340 } 351 341 352 - static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg, 342 + static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root, 343 + struct mem_cgroup *memcg, 353 344 bool in_low_reclaim) 354 345 { 355 346 if (mem_cgroup_disabled()) 347 + return 0; 348 + 349 + /* 350 + * There is no reclaim protection applied to a targeted reclaim. 351 + * We are special casing this specific case here because 352 + * mem_cgroup_protected calculation is not robust enough to keep 353 + * the protection invariant for calculated effective values for 354 + * parallel reclaimers with different reclaim target. This is 355 + * especially a problem for tail memcgs (as they have pages on LRU) 356 + * which would want to have effective values 0 for targeted reclaim 357 + * but a different value for external reclaim. 358 + * 359 + * Example 360 + * Let's have global and A's reclaim in parallel: 361 + * | 362 + * A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G) 363 + * |\ 364 + * | C (low = 1G, usage = 2.5G) 365 + * B (low = 1G, usage = 0.5G) 366 + * 367 + * For the global reclaim 368 + * A.elow = A.low 369 + * B.elow = min(B.usage, B.low) because children_low_usage <= A.elow 370 + * C.elow = min(C.usage, C.low) 371 + * 372 + * With the effective values resetting we have A reclaim 373 + * A.elow = 0 374 + * B.elow = B.low 375 + * C.elow = C.low 376 + * 377 + * If the global reclaim races with A's reclaim then 378 + * B.elow = C.elow = 0 because children_low_usage > A.elow) 379 + * is possible and reclaiming B would be violating the protection. 380 + * 381 + */ 382 + if (root == memcg) 356 383 return 0; 357 384 358 385 if (in_low_reclaim) ··· 399 352 READ_ONCE(memcg->memory.elow)); 400 353 } 401 354 402 - enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, 403 - struct mem_cgroup *memcg); 355 + void mem_cgroup_calculate_protection(struct mem_cgroup *root, 356 + struct mem_cgroup *memcg); 357 + 358 + static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg) 359 + { 360 + /* 361 + * The root memcg doesn't account charges, and doesn't support 362 + * protection. 363 + */ 364 + return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg); 365 + 366 + } 367 + 368 + static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg) 369 + { 370 + if (!mem_cgroup_supports_protection(memcg)) 371 + return false; 372 + 373 + return READ_ONCE(memcg->memory.elow) >= 374 + page_counter_read(&memcg->memory); 375 + } 376 + 377 + static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg) 378 + { 379 + if (!mem_cgroup_supports_protection(memcg)) 380 + return false; 381 + 382 + return READ_ONCE(memcg->memory.emin) >= 383 + page_counter_read(&memcg->memory); 384 + } 404 385 405 386 int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); 406 387 ··· 489 414 static inline 490 415 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ 491 416 return css ? container_of(css, struct mem_cgroup, css) : NULL; 417 + } 418 + 419 + static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg) 420 + { 421 + return percpu_ref_tryget(&objcg->refcnt); 422 + } 423 + 424 + static inline void obj_cgroup_get(struct obj_cgroup *objcg) 425 + { 426 + percpu_ref_get(&objcg->refcnt); 427 + } 428 + 429 + static inline void obj_cgroup_put(struct obj_cgroup *objcg) 430 + { 431 + percpu_ref_put(&objcg->refcnt); 432 + } 433 + 434 + /* 435 + * After the initialization objcg->memcg is always pointing at 436 + * a valid memcg, but can be atomically swapped to the parent memcg. 437 + * 438 + * The caller must ensure that the returned memcg won't be released: 439 + * e.g. acquire the rcu_read_lock or css_set_lock. 440 + */ 441 + static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg) 442 + { 443 + return READ_ONCE(objcg->memcg); 492 444 } 493 445 494 446 static inline void mem_cgroup_put(struct mem_cgroup *memcg) ··· 781 679 return x; 782 680 } 783 681 682 + void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, 683 + int val); 784 684 void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, 785 685 int val); 786 686 void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val); 687 + 787 688 void mod_memcg_obj_state(void *p, int idx, int val); 689 + 690 + static inline void mod_lruvec_slab_state(void *p, enum node_stat_item idx, 691 + int val) 692 + { 693 + unsigned long flags; 694 + 695 + local_irq_save(flags); 696 + __mod_lruvec_slab_state(p, idx, val); 697 + local_irq_restore(flags); 698 + } 699 + 700 + static inline void mod_memcg_lruvec_state(struct lruvec *lruvec, 701 + enum node_stat_item idx, int val) 702 + { 703 + unsigned long flags; 704 + 705 + local_irq_save(flags); 706 + __mod_memcg_lruvec_state(lruvec, idx, val); 707 + local_irq_restore(flags); 708 + } 788 709 789 710 static inline void mod_lruvec_state(struct lruvec *lruvec, 790 711 enum node_stat_item idx, int val) ··· 950 825 { 951 826 } 952 827 953 - static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg, 828 + static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root, 829 + struct mem_cgroup *memcg, 954 830 bool in_low_reclaim) 955 831 { 956 832 return 0; 957 833 } 958 834 959 - static inline enum mem_cgroup_protection mem_cgroup_protected( 960 - struct mem_cgroup *root, struct mem_cgroup *memcg) 835 + static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root, 836 + struct mem_cgroup *memcg) 961 837 { 962 - return MEMCG_PROT_NONE; 838 + } 839 + 840 + static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg) 841 + { 842 + return false; 843 + } 844 + 845 + static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg) 846 + { 847 + return false; 963 848 } 964 849 965 850 static inline int mem_cgroup_charge(struct page *page, struct mm_struct *mm, ··· 1192 1057 return node_page_state(lruvec_pgdat(lruvec), idx); 1193 1058 } 1194 1059 1060 + static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec, 1061 + enum node_stat_item idx, int val) 1062 + { 1063 + } 1064 + 1195 1065 static inline void __mod_lruvec_state(struct lruvec *lruvec, 1196 1066 enum node_stat_item idx, int val) 1197 1067 { ··· 1227 1087 struct page *page = virt_to_head_page(p); 1228 1088 1229 1089 __mod_node_page_state(page_pgdat(page), idx, val); 1090 + } 1091 + 1092 + static inline void mod_lruvec_slab_state(void *p, enum node_stat_item idx, 1093 + int val) 1094 + { 1095 + struct page *page = virt_to_head_page(p); 1096 + 1097 + mod_node_page_state(page_pgdat(page), idx, val); 1230 1098 } 1231 1099 1232 1100 static inline void mod_memcg_obj_state(void *p, int idx, int val) ··· 1489 1341 } 1490 1342 #endif 1491 1343 1492 - struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep); 1493 - void memcg_kmem_put_cache(struct kmem_cache *cachep); 1494 - 1495 1344 #ifdef CONFIG_MEMCG_KMEM 1496 1345 int __memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp, 1497 1346 unsigned int nr_pages); ··· 1496 1351 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order); 1497 1352 void __memcg_kmem_uncharge_page(struct page *page, int order); 1498 1353 1354 + struct obj_cgroup *get_obj_cgroup_from_current(void); 1355 + 1356 + int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size); 1357 + void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size); 1358 + 1499 1359 extern struct static_key_false memcg_kmem_enabled_key; 1500 - extern struct workqueue_struct *memcg_kmem_cache_wq; 1501 1360 1502 1361 extern int memcg_nr_cache_ids; 1503 1362 void memcg_get_cache_ids(void); ··· 1517 1368 1518 1369 static inline bool memcg_kmem_enabled(void) 1519 1370 { 1520 - return static_branch_unlikely(&memcg_kmem_enabled_key); 1371 + return static_branch_likely(&memcg_kmem_enabled_key); 1372 + } 1373 + 1374 + static inline bool memcg_kmem_bypass(void) 1375 + { 1376 + if (in_interrupt()) 1377 + return true; 1378 + 1379 + /* Allow remote memcg charging in kthread contexts. */ 1380 + if ((!current->mm || (current->flags & PF_KTHREAD)) && 1381 + !current->active_memcg) 1382 + return true; 1383 + return false; 1521 1384 } 1522 1385 1523 1386 static inline int memcg_kmem_charge_page(struct page *page, gfp_t gfp,

+20 -66

include/linux/mm.h

··· 206 206 loff_t *); 207 207 int overcommit_kbytes_handler(struct ctl_table *, int, void *, size_t *, 208 208 loff_t *); 209 + int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *, 210 + loff_t *); 209 211 210 212 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) 211 213 ··· 779 777 extern void kvfree(const void *addr); 780 778 extern void kvfree_sensitive(const void *addr, size_t len); 781 779 780 + static inline int head_mapcount(struct page *head) 781 + { 782 + return atomic_read(compound_mapcount_ptr(head)) + 1; 783 + } 784 + 782 785 /* 783 786 * Mapcount of compound page as a whole, does not include mapped sub-pages. 784 787 * ··· 793 786 { 794 787 VM_BUG_ON_PAGE(!PageCompound(page), page); 795 788 page = compound_head(page); 796 - return atomic_read(compound_mapcount_ptr(page)) + 1; 789 + return head_mapcount(page); 797 790 } 798 791 799 792 /* ··· 906 899 return PageCompound(page) && compound_order(page) > 1; 907 900 } 908 901 902 + static inline int head_pincount(struct page *head) 903 + { 904 + return atomic_read(compound_pincount_ptr(head)); 905 + } 906 + 909 907 static inline int compound_pincount(struct page *page) 910 908 { 911 909 VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); 912 910 page = compound_head(page); 913 - return atomic_read(compound_pincount_ptr(page)); 911 + return head_pincount(page); 914 912 } 915 913 916 914 static inline void set_compound_order(struct page *page, unsigned int order) ··· 2103 2091 NULL : pud_offset(p4d, address); 2104 2092 } 2105 2093 2106 - static inline p4d_t *p4d_alloc_track(struct mm_struct *mm, pgd_t *pgd, 2107 - unsigned long address, 2108 - pgtbl_mod_mask *mod_mask) 2109 - 2110 - { 2111 - if (unlikely(pgd_none(*pgd))) { 2112 - if (__p4d_alloc(mm, pgd, address)) 2113 - return NULL; 2114 - *mod_mask |= PGTBL_PGD_MODIFIED; 2115 - } 2116 - 2117 - return p4d_offset(pgd, address); 2118 - } 2119 - 2120 - static inline pud_t *pud_alloc_track(struct mm_struct *mm, p4d_t *p4d, 2121 - unsigned long address, 2122 - pgtbl_mod_mask *mod_mask) 2123 - { 2124 - if (unlikely(p4d_none(*p4d))) { 2125 - if (__pud_alloc(mm, p4d, address)) 2126 - return NULL; 2127 - *mod_mask |= PGTBL_P4D_MODIFIED; 2128 - } 2129 - 2130 - return pud_offset(p4d, address); 2131 - } 2132 - 2133 2094 static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) 2134 2095 { 2135 2096 return (unlikely(pud_none(*pud)) && __pmd_alloc(mm, pud, address))? 2136 2097 NULL: pmd_offset(pud, address); 2137 - } 2138 - 2139 - static inline pmd_t *pmd_alloc_track(struct mm_struct *mm, pud_t *pud, 2140 - unsigned long address, 2141 - pgtbl_mod_mask *mod_mask) 2142 - { 2143 - if (unlikely(pud_none(*pud))) { 2144 - if (__pmd_alloc(mm, pud, address)) 2145 - return NULL; 2146 - *mod_mask |= PGTBL_PUD_MODIFIED; 2147 - } 2148 - 2149 - return pmd_offset(pud, address); 2150 2098 } 2151 2099 #endif /* CONFIG_MMU */ 2152 2100 ··· 2221 2249 2222 2250 #define pte_alloc_kernel(pmd, address) \ 2223 2251 ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \ 2224 - NULL: pte_offset_kernel(pmd, address)) 2225 - 2226 - #define pte_alloc_kernel_track(pmd, address, mask) \ 2227 - ((unlikely(pmd_none(*(pmd))) && \ 2228 - (__pte_alloc_kernel(pmd) || ({*(mask)|=PGTBL_PMD_MODIFIED;0;})))?\ 2229 2252 NULL: pte_offset_kernel(pmd, address)) 2230 2253 2231 2254 #if USE_SPLIT_PMD_PTLOCKS ··· 2380 2413 * for_each_valid_physical_page_range() 2381 2414 * memblock_add_node(base, size, nid) 2382 2415 * free_area_init(max_zone_pfns); 2383 - * 2384 - * sparse_memory_present_with_active_regions() calls memory_present() for 2385 - * each range when SPARSEMEM is enabled. 2386 2416 */ 2387 2417 void free_area_init(unsigned long *max_zone_pfn); 2388 2418 unsigned long node_map_pfn_alignment(void); ··· 2390 2426 extern void get_pfn_range_for_nid(unsigned int nid, 2391 2427 unsigned long *start_pfn, unsigned long *end_pfn); 2392 2428 extern unsigned long find_min_pfn_with_active_regions(void); 2393 - extern void sparse_memory_present_with_active_regions(int nid); 2394 2429 2395 2430 #ifndef CONFIG_NEED_MULTIPLE_NODES 2396 2431 static inline int early_pfn_to_nid(unsigned long pfn) ··· 2540 2577 struct list_head *uf); 2541 2578 extern unsigned long do_mmap(struct file *file, unsigned long addr, 2542 2579 unsigned long len, unsigned long prot, unsigned long flags, 2543 - vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, 2544 - struct list_head *uf); 2580 + unsigned long pgoff, unsigned long *populate, struct list_head *uf); 2545 2581 extern int __do_munmap(struct mm_struct *, unsigned long, size_t, 2546 2582 struct list_head *uf, bool downgrade); 2547 2583 extern int do_munmap(struct mm_struct *, unsigned long, size_t, 2548 2584 struct list_head *uf); 2549 2585 extern int do_madvise(unsigned long start, size_t len_in, int behavior); 2550 - 2551 - static inline unsigned long 2552 - do_mmap_pgoff(struct file *file, unsigned long addr, 2553 - unsigned long len, unsigned long prot, unsigned long flags, 2554 - unsigned long pgoff, unsigned long *populate, 2555 - struct list_head *uf) 2556 - { 2557 - return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate, uf); 2558 - } 2559 2586 2560 2587 #ifdef CONFIG_MMU 2561 2588 extern int __mm_populate(unsigned long addr, unsigned long len, ··· 2962 3009 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node); 2963 3010 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node); 2964 3011 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node); 2965 - pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node); 3012 + pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, 3013 + struct vmem_altmap *altmap); 2966 3014 void *vmemmap_alloc_block(unsigned long size, int node); 2967 3015 struct vmem_altmap; 2968 - void *vmemmap_alloc_block_buf(unsigned long size, int node); 2969 - void *altmap_alloc_block_buf(unsigned long size, struct vmem_altmap *altmap); 3016 + void *vmemmap_alloc_block_buf(unsigned long size, int node, 3017 + struct vmem_altmap *altmap); 2970 3018 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long); 2971 3019 int vmemmap_populate_basepages(unsigned long start, unsigned long end, 2972 - int node); 3020 + int node, struct vmem_altmap *altmap); 2973 3021 int vmemmap_populate(unsigned long start, unsigned long end, int node, 2974 3022 struct vmem_altmap *altmap); 2975 3023 void vmemmap_populate_print_last(void);

+4 -1

include/linux/mm_types.h

··· 198 198 atomic_t _refcount; 199 199 200 200 #ifdef CONFIG_MEMCG 201 - struct mem_cgroup *mem_cgroup; 201 + union { 202 + struct mem_cgroup *mem_cgroup; 203 + struct obj_cgroup **obj_cgroups; 204 + }; 202 205 #endif 203 206 204 207 /*

+4

include/linux/mman.h

··· 57 57 58 58 #ifdef CONFIG_SMP 59 59 extern s32 vm_committed_as_batch; 60 + extern void mm_compute_batch(int overcommit_policy); 60 61 #else 61 62 #define vm_committed_as_batch 0 63 + static inline void mm_compute_batch(int overcommit_policy) 64 + { 65 + } 62 66 #endif 63 67 64 68 unsigned long vm_memory_committed(void);

+13

include/linux/mmu_notifier.h

··· 521 521 range->flags = flags; 522 522 } 523 523 524 + static inline void mmu_notifier_range_init_migrate( 525 + struct mmu_notifier_range *range, unsigned int flags, 526 + struct vm_area_struct *vma, struct mm_struct *mm, 527 + unsigned long start, unsigned long end, void *pgmap) 528 + { 529 + mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm, 530 + start, end); 531 + range->migrate_pgmap_owner = pgmap; 532 + } 533 + 524 534 #define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ 525 535 ({ \ 526 536 int __young; \ ··· 654 644 } 655 645 656 646 #define mmu_notifier_range_init(range,event,flags,vma,mm,start,end) \ 647 + _mmu_notifier_range_init(range, start, end) 648 + #define mmu_notifier_range_init_migrate(range, flags, vma, mm, start, end, \ 649 + pgmap) \ 657 650 _mmu_notifier_range_init(range, start, end) 658 651 659 652 static inline bool

+28 -24

include/linux/mmzone.h

··· 88 88 89 89 extern int page_group_by_mobility_disabled; 90 90 91 - #define NR_MIGRATETYPE_BITS (PB_migrate_end - PB_migrate + 1) 92 - #define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1) 91 + #define MIGRATETYPE_MASK ((1UL << PB_migratetype_bits) - 1) 93 92 94 93 #define get_pageblock_migratetype(page) \ 95 - get_pfnblock_flags_mask(page, page_to_pfn(page), \ 96 - PB_migrate_end, MIGRATETYPE_MASK) 94 + get_pfnblock_flags_mask(page, page_to_pfn(page), MIGRATETYPE_MASK) 97 95 98 96 struct free_area { 99 97 struct list_head free_list[MIGRATE_TYPES]; ··· 153 155 NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */ 154 156 NR_MLOCK, /* mlock()ed pages found and moved off LRU */ 155 157 NR_PAGETABLE, /* used for pagetables */ 156 - NR_KERNEL_STACK_KB, /* measured in KiB */ 157 - #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) 158 - NR_KERNEL_SCS_KB, /* measured in KiB */ 159 - #endif 160 158 /* Second 128 byte cacheline */ 161 159 NR_BOUNCE, 162 160 #if IS_ENABLED(CONFIG_ZSMALLOC) ··· 168 174 NR_INACTIVE_FILE, /* " " " " " */ 169 175 NR_ACTIVE_FILE, /* " " " " " */ 170 176 NR_UNEVICTABLE, /* " " " " " */ 171 - NR_SLAB_RECLAIMABLE, 172 - NR_SLAB_UNRECLAIMABLE, 177 + NR_SLAB_RECLAIMABLE_B, 178 + NR_SLAB_UNRECLAIMABLE_B, 173 179 NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ 174 180 NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ 175 181 WORKINGSET_NODES, ··· 197 203 NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */ 198 204 NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */ 199 205 NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */ 206 + NR_KERNEL_STACK_KB, /* measured in KiB */ 207 + #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) 208 + NR_KERNEL_SCS_KB, /* measured in KiB */ 209 + #endif 200 210 NR_VM_NODE_STAT_ITEMS 201 211 }; 212 + 213 + /* 214 + * Returns true if the value is measured in bytes (most vmstat values are 215 + * measured in pages). This defines the API part, the internal representation 216 + * might be different. 217 + */ 218 + static __always_inline bool vmstat_item_in_bytes(int idx) 219 + { 220 + /* 221 + * Global and per-node slab counters track slab pages. 222 + * It's expected that changes are multiples of PAGE_SIZE. 223 + * Internally values are stored in pages. 224 + * 225 + * Per-memcg and per-lruvec counters track memory, consumed 226 + * by individual slab objects. These counters are actually 227 + * byte-precise. 228 + */ 229 + return (idx == NR_SLAB_RECLAIMABLE_B || 230 + idx == NR_SLAB_UNRECLAIMABLE_B); 231 + } 202 232 203 233 /* 204 234 * We do arithmetic on the LRU lists in various places in the code, ··· 837 819 838 820 extern unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx); 839 821 840 - #ifdef CONFIG_HAVE_MEMORY_PRESENT 841 - void memory_present(int nid, unsigned long start, unsigned long end); 842 - #else 843 - static inline void memory_present(int nid, unsigned long start, unsigned long end) {} 844 - #endif 845 - 846 - #if defined(CONFIG_SPARSEMEM) 847 - void memblocks_present(void); 848 - #else 849 - static inline void memblocks_present(void) {} 850 - #endif 851 - 852 822 #ifdef CONFIG_HAVE_MEMORYLESS_NODES 853 823 int local_memory_node(int node_id); 854 824 #else ··· 1392 1386 #ifndef early_pfn_valid 1393 1387 #define early_pfn_valid(pfn) (1) 1394 1388 #endif 1395 - 1396 - void memory_present(int nid, unsigned long start, unsigned long end); 1397 1389 1398 1390 /* 1399 1391 * If it is possible to have holes within a MAX_ORDER_NR_PAGES, then we

+7 -17

include/linux/pageblock-flags.h

··· 56 56 57 57 unsigned long get_pfnblock_flags_mask(struct page *page, 58 58 unsigned long pfn, 59 - unsigned long end_bitidx, 60 59 unsigned long mask); 61 60 62 61 void set_pfnblock_flags_mask(struct page *page, 63 62 unsigned long flags, 64 63 unsigned long pfn, 65 - unsigned long end_bitidx, 66 64 unsigned long mask); 67 65 68 66 /* Declarations for getting and setting flags. See mm/page_alloc.c */ 69 - #define get_pageblock_flags_group(page, start_bitidx, end_bitidx) \ 70 - get_pfnblock_flags_mask(page, page_to_pfn(page), \ 71 - end_bitidx, \ 72 - (1 << (end_bitidx - start_bitidx + 1)) - 1) 73 - #define set_pageblock_flags_group(page, flags, start_bitidx, end_bitidx) \ 74 - set_pfnblock_flags_mask(page, flags, page_to_pfn(page), \ 75 - end_bitidx, \ 76 - (1 << (end_bitidx - start_bitidx + 1)) - 1) 77 - 78 67 #ifdef CONFIG_COMPACTION 79 68 #define get_pageblock_skip(page) \ 80 - get_pageblock_flags_group(page, PB_migrate_skip, \ 81 - PB_migrate_skip) 69 + get_pfnblock_flags_mask(page, page_to_pfn(page), \ 70 + (1 << (PB_migrate_skip))) 82 71 #define clear_pageblock_skip(page) \ 83 - set_pageblock_flags_group(page, 0, PB_migrate_skip, \ 84 - PB_migrate_skip) 72 + set_pfnblock_flags_mask(page, 0, page_to_pfn(page), \ 73 + (1 << PB_migrate_skip)) 85 74 #define set_pageblock_skip(page) \ 86 - set_pageblock_flags_group(page, 1, PB_migrate_skip, \ 87 - PB_migrate_skip) 75 + set_pfnblock_flags_mask(page, (1 << PB_migrate_skip), \ 76 + page_to_pfn(page), \ 77 + (1 << PB_migrate_skip)) 88 78 #else 89 79 static inline bool get_pageblock_skip(struct page *page) 90 80 {

+4

include/linux/percpu_counter.h

··· 44 44 s32 batch); 45 45 s64 __percpu_counter_sum(struct percpu_counter *fbc); 46 46 int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch); 47 + void percpu_counter_sync(struct percpu_counter *fbc); 47 48 48 49 static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs) 49 50 { ··· 173 172 return true; 174 173 } 175 174 175 + static inline void percpu_counter_sync(struct percpu_counter *fbc) 176 + { 177 + } 176 178 #endif /* CONFIG_SMP */ 177 179 178 180 static inline void percpu_counter_inc(struct percpu_counter *fbc)

+1 -7

include/linux/sched/mm.h

··· 175 175 * Applies per-task gfp context to the given allocation flags. 176 176 * PF_MEMALLOC_NOIO implies GFP_NOIO 177 177 * PF_MEMALLOC_NOFS implies GFP_NOFS 178 - * PF_MEMALLOC_NOCMA implies no allocation from CMA region. 179 178 */ 180 179 static inline gfp_t current_gfp_context(gfp_t flags) 181 180 { 182 - if (unlikely(current->flags & 183 - (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA))) { 181 + if (unlikely(current->flags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS))) { 184 182 /* 185 183 * NOIO implies both NOIO and NOFS and it is a weaker context 186 184 * so always make sure it makes precedence ··· 187 189 flags &= ~(__GFP_IO | __GFP_FS); 188 190 else if (current->flags & PF_MEMALLOC_NOFS) 189 191 flags &= ~__GFP_FS; 190 - #ifdef CONFIG_CMA 191 - if (current->flags & PF_MEMALLOC_NOCMA) 192 - flags &= ~__GFP_MOVABLE; 193 - #endif 194 192 } 195 193 return flags; 196 194 }

+3

include/linux/shmem_fs.h

··· 36 36 unsigned char huge; /* Whether to try for hugepages */ 37 37 kuid_t uid; /* Mount uid for root directory */ 38 38 kgid_t gid; /* Mount gid for root directory */ 39 + bool full_inums; /* If i_ino should be uint or ino_t */ 40 + ino_t next_ino; /* The next per-sb inode number to use */ 41 + ino_t __percpu *ino_batch; /* The next per-cpu inode number to use */ 39 42 struct mempolicy *mpol; /* default memory policy for mappings */ 40 43 spinlock_t shrinklist_lock; /* Protects shrinklist */ 41 44 struct list_head shrinklist; /* List of shinkable inodes */

+3 -6

include/linux/slab.h

··· 155 155 void kmem_cache_destroy(struct kmem_cache *); 156 156 int kmem_cache_shrink(struct kmem_cache *); 157 157 158 - void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *); 159 - void memcg_deactivate_kmem_caches(struct mem_cgroup *, struct mem_cgroup *); 160 - 161 158 /* 162 159 * Please use this macro to create slab caches. Simply specify the 163 160 * name of the structure and maybe some flags that are listed above. ··· 183 186 */ 184 187 void * __must_check krealloc(const void *, size_t, gfp_t); 185 188 void kfree(const void *); 186 - void kzfree(const void *); 189 + void kfree_sensitive(const void *); 187 190 size_t __ksize(const void *); 188 191 size_t ksize(const void *); 192 + 193 + #define kzfree(x) kfree_sensitive(x) /* For backward compatibility */ 189 194 190 195 #ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR 191 196 void __check_heap_object(const void *ptr, unsigned long n, struct page *page, ··· 576 577 #endif 577 578 return __kmalloc_node(size, flags, node); 578 579 } 579 - 580 - int memcg_update_all_caches(int num_memcgs); 581 580 582 581 /** 583 582 * kmalloc_array - allocate memory for an array.

+6 -3

include/linux/slab_def.h

··· 72 72 int obj_offset; 73 73 #endif /* CONFIG_DEBUG_SLAB */ 74 74 75 - #ifdef CONFIG_MEMCG 76 - struct memcg_cache_params memcg_params; 77 - #endif 78 75 #ifdef CONFIG_KASAN 79 76 struct kasan_cache kasan_info; 80 77 #endif ··· 109 112 { 110 113 u32 offset = (obj - page->s_mem); 111 114 return reciprocal_divide(offset, cache->reciprocal_buffer_size); 115 + } 116 + 117 + static inline int objs_per_slab_page(const struct kmem_cache *cache, 118 + const struct page *page) 119 + { 120 + return cache->num; 112 121 } 113 122 114 123 #endif /* _LINUX_SLAB_DEF_H */

+21 -10

include/linux/slub_def.h

··· 8 8 * (C) 2007 SGI, Christoph Lameter 9 9 */ 10 10 #include <linux/kobject.h> 11 + #include <linux/reciprocal_div.h> 11 12 12 13 enum stat_item { 13 14 ALLOC_FASTPATH, /* Allocation from cpu slab */ ··· 87 86 unsigned long min_partial; 88 87 unsigned int size; /* The size of an object including metadata */ 89 88 unsigned int object_size;/* The size of an object without metadata */ 89 + struct reciprocal_value reciprocal_size; 90 90 unsigned int offset; /* Free pointer offset */ 91 91 #ifdef CONFIG_SLUB_CPU_PARTIAL 92 92 /* Number of per cpu partial objects to keep around */ ··· 108 106 struct list_head list; /* List of slab caches */ 109 107 #ifdef CONFIG_SYSFS 110 108 struct kobject kobj; /* For sysfs */ 111 - struct work_struct kobj_remove_work; 112 109 #endif 113 - #ifdef CONFIG_MEMCG 114 - struct memcg_cache_params memcg_params; 115 - /* For propagation, maximum size of a stored attr */ 116 - unsigned int max_attr_size; 117 - #ifdef CONFIG_SYSFS 118 - struct kset *memcg_kset; 119 - #endif 120 - #endif 121 - 122 110 #ifdef CONFIG_SLAB_FREELIST_HARDENED 123 111 unsigned long random; 124 112 #endif ··· 174 182 return result; 175 183 } 176 184 185 + /* Determine object index from a given position */ 186 + static inline unsigned int __obj_to_index(const struct kmem_cache *cache, 187 + void *addr, void *obj) 188 + { 189 + return reciprocal_divide(kasan_reset_tag(obj) - addr, 190 + cache->reciprocal_size); 191 + } 192 + 193 + static inline unsigned int obj_to_index(const struct kmem_cache *cache, 194 + const struct page *page, void *obj) 195 + { 196 + return __obj_to_index(cache, page_address(page), obj); 197 + } 198 + 199 + static inline int objs_per_slab_page(const struct kmem_cache *cache, 200 + const struct page *page) 201 + { 202 + return page->objects; 203 + } 177 204 #endif /* _LINUX_SLUB_DEF_H */

-2

include/linux/swap.h

··· 328 328 /* linux/mm/page_alloc.c */ 329 329 extern unsigned long totalreserve_pages; 330 330 extern unsigned long nr_free_buffer_pages(void); 331 - extern unsigned long nr_free_pagecache_pages(void); 332 331 333 332 /* Definition of global_zone_page_state not available yet */ 334 333 #define nr_free_pages() global_zone_page_state(NR_FREE_PAGES) ··· 371 372 extern unsigned long shrink_all_memory(unsigned long nr_pages); 372 373 extern int vm_swappiness; 373 374 extern int remove_mapping(struct address_space *mapping, struct page *page); 374 - extern unsigned long vm_total_pages; 375 375 376 376 extern unsigned long reclaim_pages(struct list_head *page_list); 377 377 #ifdef CONFIG_NUMA

+13 -1

include/linux/vmstat.h

··· 8 8 #include <linux/vm_event_item.h> 9 9 #include <linux/atomic.h> 10 10 #include <linux/static_key.h> 11 + #include <linux/mmdebug.h> 11 12 12 13 extern int sysctl_stat_interval; 13 14 ··· 193 192 return x; 194 193 } 195 194 196 - static inline unsigned long global_node_page_state(enum node_stat_item item) 195 + static inline 196 + unsigned long global_node_page_state_pages(enum node_stat_item item) 197 197 { 198 198 long x = atomic_long_read(&vm_node_stat[item]); 199 199 #ifdef CONFIG_SMP ··· 202 200 x = 0; 203 201 #endif 204 202 return x; 203 + } 204 + 205 + static inline unsigned long global_node_page_state(enum node_stat_item item) 206 + { 207 + VM_WARN_ON_ONCE(vmstat_item_in_bytes(item)); 208 + 209 + return global_node_page_state_pages(item); 205 210 } 206 211 207 212 static inline unsigned long zone_page_state(struct zone *zone, ··· 251 242 extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item); 252 243 extern unsigned long node_page_state(struct pglist_data *pgdat, 253 244 enum node_stat_item item); 245 + extern unsigned long node_page_state_pages(struct pglist_data *pgdat, 246 + enum node_stat_item item); 254 247 #else 255 248 #define sum_zone_node_page_state(node, item) global_zone_page_state(item) 256 249 #define node_page_state(node, item) global_node_page_state(item) 250 + #define node_page_state_pages(node, item) global_node_page_state_pages(item) 257 251 #endif /* CONFIG_NUMA */ 258 252 259 253 #ifdef CONFIG_SMP

+5 -4

init/Kconfig

··· 1913 1913 command line. 1914 1914 1915 1915 config SLAB_FREELIST_RANDOM 1916 - default n 1916 + bool "Randomize slab freelist" 1917 1917 depends on SLAB || SLUB 1918 - bool "SLAB freelist randomization" 1919 1918 help 1920 1919 Randomizes the freelist order used on creating new pages. This 1921 1920 security feature reduces the predictability of the kernel slab ··· 1922 1923 1923 1924 config SLAB_FREELIST_HARDENED 1924 1925 bool "Harden slab freelist metadata" 1925 - depends on SLUB 1926 + depends on SLAB || SLUB 1926 1927 help 1927 1928 Many kernel heap attacks try to target slab cache metadata and 1928 1929 other infrastructure. This options makes minor performance 1929 1930 sacrifices to harden the kernel slab allocator against common 1930 - freelist exploit methods. 1931 + freelist exploit methods. Some slab implementations have more 1932 + sanity-checking than others. This option is most effective with 1933 + CONFIG_SLUB. 1931 1934 1932 1935 config SHUFFLE_PAGE_ALLOCATOR 1933 1936 bool "Page allocator randomization"

+1 -1

init/main.c

··· 830 830 rest_init(); 831 831 } 832 832 833 - asmlinkage __visible void __init start_kernel(void) 833 + asmlinkage __visible void __init __no_sanitize_address start_kernel(void) 834 834 { 835 835 char *command_line; 836 836 char *after_dashes;

+1 -1

ipc/shm.c

··· 1558 1558 goto invalid; 1559 1559 } 1560 1560 1561 - addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate, NULL); 1561 + addr = do_mmap(file, addr, size, prot, flags, 0, &populate, NULL); 1562 1562 *raddr = addr; 1563 1563 err = 0; 1564 1564 if (IS_ERR_VALUE(addr))

+16 -38

kernel/fork.c

··· 261 261 THREAD_SIZE_ORDER); 262 262 263 263 if (likely(page)) { 264 - tsk->stack = page_address(page); 264 + tsk->stack = kasan_reset_tag(page_address(page)); 265 265 return tsk->stack; 266 266 } 267 267 return NULL; ··· 276 276 if (vm) { 277 277 int i; 278 278 279 - for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) { 280 - mod_memcg_page_state(vm->pages[i], 281 - MEMCG_KERNEL_STACK_KB, 282 - -(int)(PAGE_SIZE / 1024)); 283 - 279 + for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) 284 280 memcg_kmem_uncharge_page(vm->pages[i], 0); 285 - } 286 281 287 282 for (i = 0; i < NR_CACHED_STACKS; i++) { 288 283 if (this_cpu_cmpxchg(cached_stacks[i], ··· 302 307 { 303 308 unsigned long *stack; 304 309 stack = kmem_cache_alloc_node(thread_stack_cache, THREADINFO_GFP, node); 310 + stack = kasan_reset_tag(stack); 305 311 tsk->stack = stack; 306 312 return stack; 307 313 } ··· 378 382 void *stack = task_stack_page(tsk); 379 383 struct vm_struct *vm = task_stack_vm_area(tsk); 380 384 381 - BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0); 382 385 383 - if (vm) { 384 - int i; 385 - 386 - BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE); 387 - 388 - for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) { 389 - mod_zone_page_state(page_zone(vm->pages[i]), 390 - NR_KERNEL_STACK_KB, 391 - PAGE_SIZE / 1024 * account); 392 - } 393 - } else { 394 - /* 395 - * All stack pages are in the same zone and belong to the 396 - * same memcg. 397 - */ 398 - struct page *first_page = virt_to_page(stack); 399 - 400 - mod_zone_page_state(page_zone(first_page), NR_KERNEL_STACK_KB, 401 - THREAD_SIZE / 1024 * account); 402 - 403 - mod_memcg_obj_state(stack, MEMCG_KERNEL_STACK_KB, 404 - account * (THREAD_SIZE / 1024)); 405 - } 386 + /* All stack pages are in the same node. */ 387 + if (vm) 388 + mod_lruvec_page_state(vm->pages[0], NR_KERNEL_STACK_KB, 389 + account * (THREAD_SIZE / 1024)); 390 + else 391 + mod_lruvec_slab_state(stack, NR_KERNEL_STACK_KB, 392 + account * (THREAD_SIZE / 1024)); 406 393 } 407 394 408 395 static int memcg_charge_kernel_stack(struct task_struct *tsk) ··· 394 415 struct vm_struct *vm = task_stack_vm_area(tsk); 395 416 int ret; 396 417 418 + BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0); 419 + 397 420 if (vm) { 398 421 int i; 422 + 423 + BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE); 399 424 400 425 for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) { 401 426 /* 402 427 * If memcg_kmem_charge_page() fails, page->mem_cgroup 403 - * pointer is NULL, and both memcg_kmem_uncharge_page() 404 - * and mod_memcg_page_state() in free_thread_stack() 405 - * will ignore this page. So it's safe. 428 + * pointer is NULL, and memcg_kmem_uncharge_page() in 429 + * free_thread_stack() will ignore this page. 406 430 */ 407 431 ret = memcg_kmem_charge_page(vm->pages[i], GFP_KERNEL, 408 432 0); 409 433 if (ret) 410 434 return ret; 411 - 412 - mod_memcg_page_state(vm->pages[i], 413 - MEMCG_KERNEL_STACK_KB, 414 - PAGE_SIZE / 1024); 415 435 } 416 436 } 417 437 #endif

+6 -2

kernel/kthread.c

··· 480 480 * to "name.*%u". Code fills in cpu number. 481 481 * 482 482 * Description: This helper function creates and names a kernel thread 483 - * The thread will be woken and put into park mode. 484 483 */ 485 484 struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data), 486 485 void *data, unsigned int cpu, ··· 1240 1241 WARN_ON_ONCE(tsk->mm); 1241 1242 1242 1243 task_lock(tsk); 1244 + /* Hold off tlb flush IPIs while switching mm's */ 1245 + local_irq_disable(); 1243 1246 active_mm = tsk->active_mm; 1244 1247 if (active_mm != mm) { 1245 1248 mmgrab(mm); 1246 1249 tsk->active_mm = mm; 1247 1250 } 1248 1251 tsk->mm = mm; 1249 - switch_mm(active_mm, mm, tsk); 1252 + switch_mm_irqs_off(active_mm, mm, tsk); 1253 + local_irq_enable(); 1250 1254 task_unlock(tsk); 1251 1255 #ifdef finish_arch_post_lock_switch 1252 1256 finish_arch_post_lock_switch(); ··· 1278 1276 1279 1277 task_lock(tsk); 1280 1278 sync_mm_rss(mm); 1279 + local_irq_disable(); 1281 1280 tsk->mm = NULL; 1282 1281 /* active_mm is still 'mm' */ 1283 1282 enter_lazy_tlb(mm, tsk); 1283 + local_irq_enable(); 1284 1284 task_unlock(tsk); 1285 1285 } 1286 1286 EXPORT_SYMBOL_GPL(kthread_unuse_mm);

+1 -1

kernel/power/snapshot.c

··· 1663 1663 { 1664 1664 unsigned long size; 1665 1665 1666 - size = global_node_page_state(NR_SLAB_RECLAIMABLE) 1666 + size = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B) 1667 1667 + global_node_page_state(NR_ACTIVE_ANON) 1668 1668 + global_node_page_state(NR_INACTIVE_ANON) 1669 1669 + global_node_page_state(NR_ACTIVE_FILE)

+2

kernel/rcu/tree.c

··· 59 59 #include <linux/sched/clock.h> 60 60 #include <linux/vmalloc.h> 61 61 #include <linux/mm.h> 62 + #include <linux/kasan.h> 62 63 #include "../time/tick-internal.h" 63 64 64 65 #include "tree.h" ··· 2891 2890 head->func = func; 2892 2891 head->next = NULL; 2893 2892 local_irq_save(flags); 2893 + kasan_record_aux_stack(head); 2894 2894 rdp = this_cpu_ptr(&rcu_data); 2895 2895 2896 2896 /* Add the callback to our list. */

+1 -1

kernel/scs.c

··· 17 17 { 18 18 struct page *scs_page = virt_to_page(s); 19 19 20 - mod_zone_page_state(page_zone(scs_page), NR_KERNEL_SCS_KB, 20 + mod_node_page_state(page_pgdat(scs_page), NR_KERNEL_SCS_KB, 21 21 account * (SCS_SIZE / SZ_1K)); 22 22 } 23 23

+1 -1

kernel/sysctl.c

··· 2671 2671 .data = &sysctl_overcommit_memory, 2672 2672 .maxlen = sizeof(sysctl_overcommit_memory), 2673 2673 .mode = 0644, 2674 - .proc_handler = proc_dointvec_minmax, 2674 + .proc_handler = overcommit_policy_handler, 2675 2675 .extra1 = SYSCTL_ZERO, 2676 2676 .extra2 = &two, 2677 2677 },

+23 -16

lib/Kconfig.kasan

··· 18 18 config CC_HAS_WORKING_NOSANITIZE_ADDRESS 19 19 def_bool !CC_IS_GCC || GCC_VERSION >= 80300 20 20 21 - config KASAN 21 + menuconfig KASAN 22 22 bool "KASAN: runtime memory debugger" 23 23 depends on (HAVE_ARCH_KASAN && CC_HAS_KASAN_GENERIC) || \ 24 24 (HAVE_ARCH_KASAN_SW_TAGS && CC_HAS_KASAN_SW_TAGS) ··· 29 29 designed to find out-of-bounds accesses and use-after-free bugs. 30 30 See Documentation/dev-tools/kasan.rst for details. 31 31 32 + if KASAN 33 + 32 34 choice 33 35 prompt "KASAN mode" 34 - depends on KASAN 35 36 default KASAN_GENERIC 36 37 help 37 38 KASAN has two modes: generic KASAN (similar to userspace ASan, ··· 40 39 software tag-based KASAN (a version based on software memory 41 40 tagging, arm64 only, similar to userspace HWASan, enabled with 42 41 CONFIG_KASAN_SW_TAGS). 42 + 43 43 Both generic and tag-based KASAN are strictly debugging features. 44 44 45 45 config KASAN_GENERIC ··· 52 50 select STACKDEPOT 53 51 help 54 52 Enables generic KASAN mode. 55 - Supported in both GCC and Clang. With GCC it requires version 4.9.2 56 - or later for basic support and version 5.0 or later for detection of 57 - out-of-bounds accesses for stack and global variables and for inline 58 - instrumentation mode (CONFIG_KASAN_INLINE). With Clang it requires 59 - version 3.7.0 or later and it doesn't support detection of 60 - out-of-bounds accesses for global variables yet. 53 + 54 + This mode is supported in both GCC and Clang. With GCC it requires 55 + version 8.3.0 or later. With Clang it requires version 7.0.0 or 56 + later, but detection of out-of-bounds accesses for global variables 57 + is supported only since Clang 11. 58 + 61 59 This mode consumes about 1/8th of available memory at kernel start 62 60 and introduces an overhead of ~x1.5 for the rest of the allocations. 63 61 The performance slowdown is ~x3. 62 + 64 63 For better error detection enable CONFIG_STACKTRACE. 64 + 65 65 Currently CONFIG_KASAN_GENERIC doesn't work with CONFIG_DEBUG_SLAB 66 66 (the resulting kernel does not boot). 67 67 ··· 76 72 select STACKDEPOT 77 73 help 78 74 Enables software tag-based KASAN mode. 75 + 79 76 This mode requires Top Byte Ignore support by the CPU and therefore 80 - is only supported for arm64. 81 - This mode requires Clang version 7.0.0 or later. 77 + is only supported for arm64. This mode requires Clang version 7.0.0 78 + or later. 79 + 82 80 This mode consumes about 1/16th of available memory at kernel start 83 81 and introduces an overhead of ~20% for the rest of the allocations. 84 82 This mode may potentially introduce problems relating to pointer 85 83 casting and comparison, as it embeds tags into the top byte of each 86 84 pointer. 85 + 87 86 For better error detection enable CONFIG_STACKTRACE. 87 + 88 88 Currently CONFIG_KASAN_SW_TAGS doesn't work with CONFIG_DEBUG_SLAB 89 89 (the resulting kernel does not boot). 90 90 ··· 96 88 97 89 choice 98 90 prompt "Instrumentation type" 99 - depends on KASAN 100 91 default KASAN_OUTLINE 101 92 102 93 config KASAN_OUTLINE ··· 114 107 memory accesses. This is faster than outline (in some workloads 115 108 it gives about x2 boost over outline instrumentation), but 116 109 make kernel's .text size much bigger. 117 - For CONFIG_KASAN_GENERIC this requires GCC 5.0 or later. 118 110 119 111 endchoice 120 112 121 113 config KASAN_STACK_ENABLE 122 114 bool "Enable stack instrumentation (unsafe)" if CC_IS_CLANG && !COMPILE_TEST 123 - depends on KASAN 124 115 help 125 116 The LLVM stack address sanitizer has a know problem that 126 117 causes excessive stack usage in a lot of functions, see ··· 139 134 140 135 config KASAN_S390_4_LEVEL_PAGING 141 136 bool "KASan: use 4-level paging" 142 - depends on KASAN && S390 137 + depends on S390 143 138 help 144 139 Compiling the kernel with KASan disables automatic 3-level vs 145 140 4-level paging selection. 3-level paging is used by default (up ··· 156 151 157 152 config KASAN_VMALLOC 158 153 bool "Back mappings in vmalloc space with real shadow memory" 159 - depends on KASAN && HAVE_ARCH_KASAN_VMALLOC 154 + depends on HAVE_ARCH_KASAN_VMALLOC 160 155 help 161 156 By default, the shadow region for vmalloc space is the read-only 162 157 zero page. This means that KASAN cannot detect errors involving ··· 169 164 170 165 config TEST_KASAN 171 166 tristate "Module for testing KASAN for bug detection" 172 - depends on m && KASAN 167 + depends on m 173 168 help 174 169 This is a test module doing various nasty things like 175 170 out of bounds accesses, use after free. It is useful for testing 176 171 kernel debugging features like KASAN. 172 + 173 + endif # KASAN

-1

lib/Makefile

··· 37 37 nmi_backtrace.o nodemask.o win_minmax.o memcat_p.o 38 38 39 39 lib-$(CONFIG_PRINTK) += dump_stack.o 40 - lib-$(CONFIG_MMU) += ioremap.o 41 40 lib-$(CONFIG_SMP) += cpumask.o 42 41 43 42 lib-y += kobject.o klist.o

+2

lib/ioremap.c mm/ioremap.c

··· 13 13 #include <linux/export.h> 14 14 #include <asm/cacheflush.h> 15 15 16 + #include "pgalloc-track.h" 17 + 16 18 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP 17 19 static int __read_mostly ioremap_p4d_capable; 18 20 static int __read_mostly ioremap_pud_capable;

+3 -3

lib/mpi/mpiutil.c

··· 69 69 if (!a) 70 70 return; 71 71 72 - kzfree(a); 72 + kfree_sensitive(a); 73 73 } 74 74 75 75 void mpi_assign_limb_space(MPI a, mpi_ptr_t ap, unsigned nlimbs) ··· 95 95 if (!p) 96 96 return -ENOMEM; 97 97 memcpy(p, a->d, a->alloced * sizeof(mpi_limb_t)); 98 - kzfree(a->d); 98 + kfree_sensitive(a->d); 99 99 a->d = p; 100 100 } else { 101 101 a->d = kcalloc(nlimbs, sizeof(mpi_limb_t), GFP_KERNEL); ··· 112 112 return; 113 113 114 114 if (a->flags & 4) 115 - kzfree(a->d); 115 + kfree_sensitive(a->d); 116 116 else 117 117 mpi_free_limb_space(a->d); 118 118

+19

lib/percpu_counter.c

··· 99 99 EXPORT_SYMBOL(percpu_counter_add_batch); 100 100 101 101 /* 102 + * For percpu_counter with a big batch, the devication of its count could 103 + * be big, and there is requirement to reduce the deviation, like when the 104 + * counter's batch could be runtime decreased to get a better accuracy, 105 + * which can be achieved by running this sync function on each CPU. 106 + */ 107 + void percpu_counter_sync(struct percpu_counter *fbc) 108 + { 109 + unsigned long flags; 110 + s64 count; 111 + 112 + raw_spin_lock_irqsave(&fbc->lock, flags); 113 + count = __this_cpu_read(*fbc->counters); 114 + fbc->count += count; 115 + __this_cpu_sub(*fbc->counters, count); 116 + raw_spin_unlock_irqrestore(&fbc->lock, flags); 117 + } 118 + EXPORT_SYMBOL(percpu_counter_sync); 119 + 120 + /* 102 121 * Add up all the per-cpu counts, return the result. This is a more accurate 103 122 * but much slower version of percpu_counter_read_positive() 104 123 */

+66 -21

lib/test_kasan.c

··· 23 23 24 24 #include <asm/page.h> 25 25 26 + #include "../mm/kasan/kasan.h" 27 + 28 + #define OOB_TAG_OFF (IS_ENABLED(CONFIG_KASAN_GENERIC) ? 0 : KASAN_SHADOW_SCALE_SIZE) 29 + 26 30 /* 27 31 * We assign some test results to these globals to make sure the tests 28 32 * are not eliminated as dead code. ··· 52 48 return; 53 49 } 54 50 55 - ptr[size] = 'x'; 51 + ptr[size + OOB_TAG_OFF] = 'x'; 52 + 56 53 kfree(ptr); 57 54 } 58 55 ··· 105 100 return; 106 101 } 107 102 108 - ptr[size] = 0; 103 + ptr[size + OOB_TAG_OFF] = 0; 104 + 109 105 kfree(ptr); 110 106 } 111 107 ··· 176 170 return; 177 171 } 178 172 179 - ptr2[size2] = 'x'; 173 + ptr2[size2 + OOB_TAG_OFF] = 'x'; 174 + 180 175 kfree(ptr2); 181 176 } 182 177 ··· 195 188 kfree(ptr1); 196 189 return; 197 190 } 198 - ptr2[size2] = 'x'; 191 + 192 + ptr2[size2 + OOB_TAG_OFF] = 'x'; 193 + 199 194 kfree(ptr2); 200 195 } 201 196 ··· 233 224 return; 234 225 } 235 226 236 - memset(ptr+7, 0, 2); 227 + memset(ptr + 7 + OOB_TAG_OFF, 0, 2); 228 + 237 229 kfree(ptr); 238 230 } 239 231 ··· 250 240 return; 251 241 } 252 242 253 - memset(ptr+5, 0, 4); 243 + memset(ptr + 5 + OOB_TAG_OFF, 0, 4); 244 + 254 245 kfree(ptr); 255 246 } 256 247 ··· 268 257 return; 269 258 } 270 259 271 - memset(ptr+1, 0, 8); 260 + memset(ptr + 1 + OOB_TAG_OFF, 0, 8); 261 + 272 262 kfree(ptr); 273 263 } 274 264 ··· 285 273 return; 286 274 } 287 275 288 - memset(ptr+1, 0, 16); 276 + memset(ptr + 1 + OOB_TAG_OFF, 0, 16); 277 + 289 278 kfree(ptr); 290 279 } 291 280 ··· 302 289 return; 303 290 } 304 291 305 - memset(ptr, 0, size+5); 292 + memset(ptr, 0, size + 5 + OOB_TAG_OFF); 293 + 306 294 kfree(ptr); 307 295 } 308 296 ··· 437 423 return; 438 424 } 439 425 440 - *p = p[size]; 426 + *p = p[size + OOB_TAG_OFF]; 427 + 441 428 kmem_cache_free(cache, p); 442 429 kmem_cache_destroy(cache); 443 430 } ··· 488 473 static noinline void __init kasan_stack_oob(void) 489 474 { 490 475 char stack_array[10]; 491 - volatile int i = 0; 476 + volatile int i = OOB_TAG_OFF; 492 477 char *p = &stack_array[ARRAY_SIZE(stack_array) + i]; 493 478 494 479 pr_info("out-of-bounds on stack\n"); ··· 535 520 } 536 521 537 522 pr_info("out-of-bounds in copy_from_user()\n"); 538 - unused = copy_from_user(kmem, usermem, size + 1); 523 + unused = copy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF); 539 524 540 525 pr_info("out-of-bounds in copy_to_user()\n"); 541 - unused = copy_to_user(usermem, kmem, size + 1); 526 + unused = copy_to_user(usermem, kmem, size + 1 + OOB_TAG_OFF); 542 527 543 528 pr_info("out-of-bounds in __copy_from_user()\n"); 544 - unused = __copy_from_user(kmem, usermem, size + 1); 529 + unused = __copy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF); 545 530 546 531 pr_info("out-of-bounds in __copy_to_user()\n"); 547 - unused = __copy_to_user(usermem, kmem, size + 1); 532 + unused = __copy_to_user(usermem, kmem, size + 1 + OOB_TAG_OFF); 548 533 549 534 pr_info("out-of-bounds in __copy_from_user_inatomic()\n"); 550 - unused = __copy_from_user_inatomic(kmem, usermem, size + 1); 535 + unused = __copy_from_user_inatomic(kmem, usermem, size + 1 + OOB_TAG_OFF); 551 536 552 537 pr_info("out-of-bounds in __copy_to_user_inatomic()\n"); 553 - unused = __copy_to_user_inatomic(usermem, kmem, size + 1); 538 + unused = __copy_to_user_inatomic(usermem, kmem, size + 1 + OOB_TAG_OFF); 554 539 555 540 pr_info("out-of-bounds in strncpy_from_user()\n"); 556 - unused = strncpy_from_user(kmem, usermem, size + 1); 541 + unused = strncpy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF); 557 542 558 543 vm_munmap((unsigned long)usermem, PAGE_SIZE); 559 544 kfree(kmem); ··· 781 766 char *ptr; 782 767 size_t size = 16; 783 768 784 - pr_info("double-free (kzfree)\n"); 769 + pr_info("double-free (kfree_sensitive)\n"); 785 770 ptr = kmalloc(size, GFP_KERNEL); 786 771 if (!ptr) { 787 772 pr_err("Allocation failed\n"); 788 773 return; 789 774 } 790 775 791 - kzfree(ptr); 792 - kzfree(ptr); 776 + kfree_sensitive(ptr); 777 + kfree_sensitive(ptr); 793 778 } 794 779 795 780 #ifdef CONFIG_KASAN_VMALLOC ··· 815 800 #else 816 801 static void __init vmalloc_oob(void) {} 817 802 #endif 803 + 804 + static struct kasan_rcu_info { 805 + int i; 806 + struct rcu_head rcu; 807 + } *global_rcu_ptr; 808 + 809 + static noinline void __init kasan_rcu_reclaim(struct rcu_head *rp) 810 + { 811 + struct kasan_rcu_info *fp = container_of(rp, 812 + struct kasan_rcu_info, rcu); 813 + 814 + kfree(fp); 815 + fp->i = 1; 816 + } 817 + 818 + static noinline void __init kasan_rcu_uaf(void) 819 + { 820 + struct kasan_rcu_info *ptr; 821 + 822 + pr_info("use-after-free in kasan_rcu_reclaim\n"); 823 + ptr = kmalloc(sizeof(struct kasan_rcu_info), GFP_KERNEL); 824 + if (!ptr) { 825 + pr_err("Allocation failed\n"); 826 + return; 827 + } 828 + 829 + global_rcu_ptr = rcu_dereference_protected(ptr, NULL); 830 + call_rcu(&global_rcu_ptr->rcu, kasan_rcu_reclaim); 831 + } 818 832 819 833 static int __init kmalloc_tests_init(void) 820 834 { ··· 892 848 kasan_bitops(); 893 849 kmalloc_double_kzfree(); 894 850 vmalloc_oob(); 851 + kasan_rcu_uaf(); 895 852 896 853 kasan_restore_multi_shot(multishot); 897 854

+1 -5

mm/Kconfig

··· 88 88 def_bool y 89 89 depends on DISCONTIGMEM || NUMA 90 90 91 - config HAVE_MEMORY_PRESENT 92 - def_bool y 93 - depends on ARCH_HAVE_MEMORY_PRESENT || SPARSEMEM 94 - 95 91 # 96 92 # SPARSEMEM_EXTREME (which is the default) does some bootmem 97 - # allocations when memory_present() is called. If this cannot 93 + # allocations when sparse_init() is called. If this cannot 98 94 # be done on your architecture, select this option. However, 99 95 # statically allocating the mem_section[] array can potentially 100 96 # consume vast quantities of .bss, so be careful.

+1 -1

mm/Makefile

··· 38 38 mmu-$(CONFIG_MMU) := highmem.o memory.o mincore.o \ 39 39 mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \ 40 40 msync.o page_vma_mapped.o pagewalk.o \ 41 - pgtable-generic.o rmap.o vmalloc.o 41 + pgtable-generic.o rmap.o vmalloc.o ioremap.o 42 42 43 43 44 44 ifdef CONFIG_CROSS_MEMORY_ATTACH

+42 -41

mm/debug.c

··· 69 69 } 70 70 71 71 if (page < head || (page >= head + MAX_ORDER_NR_PAGES)) { 72 - /* Corrupt page, cannot call page_mapping */ 73 - mapping = page->mapping; 72 + /* 73 + * Corrupt page, so we cannot call page_mapping. Instead, do a 74 + * safe subset of the steps that page_mapping() does. Caution: 75 + * this will be misleading for tail pages, PageSwapCache pages, 76 + * and potentially other situations. (See the page_mapping() 77 + * implementation for what's missing here.) 78 + */ 79 + unsigned long tmp = (unsigned long)page->mapping; 80 + 81 + if (tmp & PAGE_MAPPING_ANON) 82 + mapping = NULL; 83 + else 84 + mapping = (void *)(tmp & ~PAGE_MAPPING_FLAGS); 74 85 head = page; 75 86 compound = false; 76 87 } else { ··· 95 84 */ 96 85 mapcount = PageSlab(head) ? 0 : page_mapcount(page); 97 86 98 - if (compound) 87 + pr_warn("page:%p refcount:%d mapcount:%d mapping:%p index:%#lx pfn:%#lx\n", 88 + page, page_ref_count(head), mapcount, mapping, 89 + page_to_pgoff(page), page_to_pfn(page)); 90 + if (compound) { 99 91 if (hpage_pincount_available(page)) { 100 - pr_warn("page:%px refcount:%d mapcount:%d mapping:%p " 101 - "index:%#lx head:%px order:%u " 102 - "compound_mapcount:%d compound_pincount:%d\n", 103 - page, page_ref_count(head), mapcount, 104 - mapping, page_to_pgoff(page), head, 105 - compound_order(head), compound_mapcount(page), 106 - compound_pincount(page)); 92 + pr_warn("head:%p order:%u compound_mapcount:%d compound_pincount:%d\n", 93 + head, compound_order(head), 94 + head_mapcount(head), 95 + head_pincount(head)); 107 96 } else { 108 - pr_warn("page:%px refcount:%d mapcount:%d mapping:%p " 109 - "index:%#lx head:%px order:%u " 110 - "compound_mapcount:%d\n", 111 - page, page_ref_count(head), mapcount, 112 - mapping, page_to_pgoff(page), head, 113 - compound_order(head), compound_mapcount(page)); 97 + pr_warn("head:%p order:%u compound_mapcount:%d\n", 98 + head, compound_order(head), 99 + head_mapcount(head)); 114 100 } 115 - else 116 - pr_warn("page:%px refcount:%d mapcount:%d mapping:%p index:%#lx\n", 117 - page, page_ref_count(page), mapcount, 118 - mapping, page_to_pgoff(page)); 101 + } 119 102 if (PageKsm(page)) 120 103 type = "ksm "; 121 104 else if (PageAnon(page)) 122 105 type = "anon "; 123 106 else if (mapping) { 124 - const struct inode *host; 107 + struct inode *host; 125 108 const struct address_space_operations *a_ops; 126 - const struct hlist_node *dentry_first; 127 - const struct dentry *dentry_ptr; 109 + struct hlist_node *dentry_first; 110 + struct dentry *dentry_ptr; 128 111 struct dentry dentry; 129 112 130 113 /* 131 114 * mapping can be invalid pointer and we don't want to crash 132 115 * accessing it, so probe everything depending on it carefully 133 116 */ 134 - if (copy_from_kernel_nofault(&host, &mapping->host, 135 - sizeof(struct inode *)) || 136 - copy_from_kernel_nofault(&a_ops, &mapping->a_ops, 137 - sizeof(struct address_space_operations *))) { 138 - pr_warn("failed to read mapping->host or a_ops, mapping not a valid kernel address?\n"); 117 + if (get_kernel_nofault(host, &mapping->host) || 118 + get_kernel_nofault(a_ops, &mapping->a_ops)) { 119 + pr_warn("failed to read mapping contents, not a valid kernel address?\n"); 139 120 goto out_mapping; 140 121 } 141 122 142 123 if (!host) { 143 - pr_warn("mapping->a_ops:%ps\n", a_ops); 124 + pr_warn("aops:%ps\n", a_ops); 144 125 goto out_mapping; 145 126 } 146 127 147 - if (copy_from_kernel_nofault(&dentry_first, 148 - &host->i_dentry.first, sizeof(struct hlist_node *))) { 149 - pr_warn("mapping->a_ops:%ps with invalid mapping->host inode address %px\n", 150 - a_ops, host); 128 + if (get_kernel_nofault(dentry_first, &host->i_dentry.first)) { 129 + pr_warn("aops:%ps with invalid host inode %px\n", 130 + a_ops, host); 151 131 goto out_mapping; 152 132 } 153 133 154 134 if (!dentry_first) { 155 - pr_warn("mapping->a_ops:%ps\n", a_ops); 135 + pr_warn("aops:%ps ino:%lx\n", a_ops, host->i_ino); 156 136 goto out_mapping; 157 137 } 158 138 159 139 dentry_ptr = container_of(dentry_first, struct dentry, d_u.d_alias); 160 - if (copy_from_kernel_nofault(&dentry, dentry_ptr, 161 - sizeof(struct dentry))) { 162 - pr_warn("mapping->aops:%ps with invalid mapping->host->i_dentry.first %px\n", 163 - a_ops, dentry_ptr); 140 + if (get_kernel_nofault(dentry, dentry_ptr)) { 141 + pr_warn("aops:%ps with invalid dentry %px\n", a_ops, 142 + dentry_ptr); 164 143 } else { 165 144 /* 166 145 * if dentry is corrupted, the %pd handler may still 167 146 * crash, but it's unlikely that we reach here with a 168 147 * corrupted struct page 169 148 */ 170 - pr_warn("mapping->aops:%ps dentry name:\"%pd\"\n", 171 - a_ops, &dentry); 149 + pr_warn("aops:%ps ino:%lx dentry name:\"%pd\"\n", 150 + a_ops, host->i_ino, &dentry); 172 151 } 173 152 } 174 153 out_mapping: 175 154 BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS + 1); 176 155 177 - pr_warn("%sflags: %#lx(%pGp)%s\n", type, page->flags, &page->flags, 156 + pr_warn("%sflags: %#lx(%pGp)%s\n", type, head->flags, &head->flags, 178 157 page_cma ? " CMA" : ""); 179 158 180 159 hex_only:

+664 -2

mm/debug_vm_pgtable.c

··· 8 8 * 9 9 * Author: Anshuman Khandual <anshuman.khandual@arm.com> 10 10 */ 11 - #define pr_fmt(fmt) "debug_vm_pgtable: %s: " fmt, __func__ 11 + #define pr_fmt(fmt) "debug_vm_pgtable: [%-25s]: " fmt, __func__ 12 12 13 13 #include <linux/gfp.h> 14 14 #include <linux/highmem.h> ··· 21 21 #include <linux/module.h> 22 22 #include <linux/pfn_t.h> 23 23 #include <linux/printk.h> 24 + #include <linux/pgtable.h> 24 25 #include <linux/random.h> 25 26 #include <linux/spinlock.h> 26 27 #include <linux/swap.h> ··· 29 28 #include <linux/start_kernel.h> 30 29 #include <linux/sched/mm.h> 31 30 #include <asm/pgalloc.h> 31 + #include <asm/tlbflush.h> 32 + 33 + /* 34 + * Please refer Documentation/vm/arch_pgtable_helpers.rst for the semantics 35 + * expectations that are being validated here. All future changes in here 36 + * or the documentation need to be in sync. 37 + */ 32 38 33 39 #define VMFLAGS (VM_READ|VM_WRITE|VM_EXEC) 34 40 ··· 54 46 { 55 47 pte_t pte = pfn_pte(pfn, prot); 56 48 49 + pr_debug("Validating PTE basic\n"); 57 50 WARN_ON(!pte_same(pte, pte)); 58 51 WARN_ON(!pte_young(pte_mkyoung(pte_mkold(pte)))); 59 52 WARN_ON(!pte_dirty(pte_mkdirty(pte_mkclean(pte)))); ··· 64 55 WARN_ON(pte_write(pte_wrprotect(pte_mkwrite(pte)))); 65 56 } 66 57 58 + static void __init pte_advanced_tests(struct mm_struct *mm, 59 + struct vm_area_struct *vma, pte_t *ptep, 60 + unsigned long pfn, unsigned long vaddr, 61 + pgprot_t prot) 62 + { 63 + pte_t pte = pfn_pte(pfn, prot); 64 + 65 + pr_debug("Validating PTE advanced\n"); 66 + pte = pfn_pte(pfn, prot); 67 + set_pte_at(mm, vaddr, ptep, pte); 68 + ptep_set_wrprotect(mm, vaddr, ptep); 69 + pte = ptep_get(ptep); 70 + WARN_ON(pte_write(pte)); 71 + 72 + pte = pfn_pte(pfn, prot); 73 + set_pte_at(mm, vaddr, ptep, pte); 74 + ptep_get_and_clear(mm, vaddr, ptep); 75 + pte = ptep_get(ptep); 76 + WARN_ON(!pte_none(pte)); 77 + 78 + pte = pfn_pte(pfn, prot); 79 + pte = pte_wrprotect(pte); 80 + pte = pte_mkclean(pte); 81 + set_pte_at(mm, vaddr, ptep, pte); 82 + pte = pte_mkwrite(pte); 83 + pte = pte_mkdirty(pte); 84 + ptep_set_access_flags(vma, vaddr, ptep, pte, 1); 85 + pte = ptep_get(ptep); 86 + WARN_ON(!(pte_write(pte) && pte_dirty(pte))); 87 + 88 + pte = pfn_pte(pfn, prot); 89 + set_pte_at(mm, vaddr, ptep, pte); 90 + ptep_get_and_clear_full(mm, vaddr, ptep, 1); 91 + pte = ptep_get(ptep); 92 + WARN_ON(!pte_none(pte)); 93 + 94 + pte = pte_mkyoung(pte); 95 + set_pte_at(mm, vaddr, ptep, pte); 96 + ptep_test_and_clear_young(vma, vaddr, ptep); 97 + pte = ptep_get(ptep); 98 + WARN_ON(pte_young(pte)); 99 + } 100 + 101 + static void __init pte_savedwrite_tests(unsigned long pfn, pgprot_t prot) 102 + { 103 + pte_t pte = pfn_pte(pfn, prot); 104 + 105 + pr_debug("Validating PTE saved write\n"); 106 + WARN_ON(!pte_savedwrite(pte_mk_savedwrite(pte_clear_savedwrite(pte)))); 107 + WARN_ON(pte_savedwrite(pte_clear_savedwrite(pte_mk_savedwrite(pte)))); 108 + } 67 109 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 68 110 static void __init pmd_basic_tests(unsigned long pfn, pgprot_t prot) 69 111 { ··· 123 63 if (!has_transparent_hugepage()) 124 64 return; 125 65 66 + pr_debug("Validating PMD basic\n"); 126 67 WARN_ON(!pmd_same(pmd, pmd)); 127 68 WARN_ON(!pmd_young(pmd_mkyoung(pmd_mkold(pmd)))); 128 69 WARN_ON(!pmd_dirty(pmd_mkdirty(pmd_mkclean(pmd)))); ··· 138 77 WARN_ON(!pmd_bad(pmd_mkhuge(pmd))); 139 78 } 140 79 80 + static void __init pmd_advanced_tests(struct mm_struct *mm, 81 + struct vm_area_struct *vma, pmd_t *pmdp, 82 + unsigned long pfn, unsigned long vaddr, 83 + pgprot_t prot) 84 + { 85 + pmd_t pmd = pfn_pmd(pfn, prot); 86 + 87 + if (!has_transparent_hugepage()) 88 + return; 89 + 90 + pr_debug("Validating PMD advanced\n"); 91 + /* Align the address wrt HPAGE_PMD_SIZE */ 92 + vaddr = (vaddr & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE; 93 + 94 + pmd = pfn_pmd(pfn, prot); 95 + set_pmd_at(mm, vaddr, pmdp, pmd); 96 + pmdp_set_wrprotect(mm, vaddr, pmdp); 97 + pmd = READ_ONCE(*pmdp); 98 + WARN_ON(pmd_write(pmd)); 99 + 100 + pmd = pfn_pmd(pfn, prot); 101 + set_pmd_at(mm, vaddr, pmdp, pmd); 102 + pmdp_huge_get_and_clear(mm, vaddr, pmdp); 103 + pmd = READ_ONCE(*pmdp); 104 + WARN_ON(!pmd_none(pmd)); 105 + 106 + pmd = pfn_pmd(pfn, prot); 107 + pmd = pmd_wrprotect(pmd); 108 + pmd = pmd_mkclean(pmd); 109 + set_pmd_at(mm, vaddr, pmdp, pmd); 110 + pmd = pmd_mkwrite(pmd); 111 + pmd = pmd_mkdirty(pmd); 112 + pmdp_set_access_flags(vma, vaddr, pmdp, pmd, 1); 113 + pmd = READ_ONCE(*pmdp); 114 + WARN_ON(!(pmd_write(pmd) && pmd_dirty(pmd))); 115 + 116 + pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); 117 + set_pmd_at(mm, vaddr, pmdp, pmd); 118 + pmdp_huge_get_and_clear_full(vma, vaddr, pmdp, 1); 119 + pmd = READ_ONCE(*pmdp); 120 + WARN_ON(!pmd_none(pmd)); 121 + 122 + pmd = pmd_mkyoung(pmd); 123 + set_pmd_at(mm, vaddr, pmdp, pmd); 124 + pmdp_test_and_clear_young(vma, vaddr, pmdp); 125 + pmd = READ_ONCE(*pmdp); 126 + WARN_ON(pmd_young(pmd)); 127 + } 128 + 129 + static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot) 130 + { 131 + pmd_t pmd = pfn_pmd(pfn, prot); 132 + 133 + pr_debug("Validating PMD leaf\n"); 134 + /* 135 + * PMD based THP is a leaf entry. 136 + */ 137 + pmd = pmd_mkhuge(pmd); 138 + WARN_ON(!pmd_leaf(pmd)); 139 + } 140 + 141 + static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) 142 + { 143 + pmd_t pmd; 144 + 145 + if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP)) 146 + return; 147 + 148 + pr_debug("Validating PMD huge\n"); 149 + /* 150 + * X86 defined pmd_set_huge() verifies that the given 151 + * PMD is not a populated non-leaf entry. 152 + */ 153 + WRITE_ONCE(*pmdp, __pmd(0)); 154 + WARN_ON(!pmd_set_huge(pmdp, __pfn_to_phys(pfn), prot)); 155 + WARN_ON(!pmd_clear_huge(pmdp)); 156 + pmd = READ_ONCE(*pmdp); 157 + WARN_ON(!pmd_none(pmd)); 158 + } 159 + 160 + static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) 161 + { 162 + pmd_t pmd = pfn_pmd(pfn, prot); 163 + 164 + pr_debug("Validating PMD saved write\n"); 165 + WARN_ON(!pmd_savedwrite(pmd_mk_savedwrite(pmd_clear_savedwrite(pmd)))); 166 + WARN_ON(pmd_savedwrite(pmd_clear_savedwrite(pmd_mk_savedwrite(pmd)))); 167 + } 168 + 141 169 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD 142 170 static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot) 143 171 { ··· 235 85 if (!has_transparent_hugepage()) 236 86 return; 237 87 88 + pr_debug("Validating PUD basic\n"); 238 89 WARN_ON(!pud_same(pud, pud)); 239 90 WARN_ON(!pud_young(pud_mkyoung(pud_mkold(pud)))); 240 91 WARN_ON(!pud_write(pud_mkwrite(pud_wrprotect(pud)))); ··· 251 100 */ 252 101 WARN_ON(!pud_bad(pud_mkhuge(pud))); 253 102 } 103 + 104 + static void __init pud_advanced_tests(struct mm_struct *mm, 105 + struct vm_area_struct *vma, pud_t *pudp, 106 + unsigned long pfn, unsigned long vaddr, 107 + pgprot_t prot) 108 + { 109 + pud_t pud = pfn_pud(pfn, prot); 110 + 111 + if (!has_transparent_hugepage()) 112 + return; 113 + 114 + pr_debug("Validating PUD advanced\n"); 115 + /* Align the address wrt HPAGE_PUD_SIZE */ 116 + vaddr = (vaddr & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE; 117 + 118 + set_pud_at(mm, vaddr, pudp, pud); 119 + pudp_set_wrprotect(mm, vaddr, pudp); 120 + pud = READ_ONCE(*pudp); 121 + WARN_ON(pud_write(pud)); 122 + 123 + #ifndef __PAGETABLE_PMD_FOLDED 124 + pud = pfn_pud(pfn, prot); 125 + set_pud_at(mm, vaddr, pudp, pud); 126 + pudp_huge_get_and_clear(mm, vaddr, pudp); 127 + pud = READ_ONCE(*pudp); 128 + WARN_ON(!pud_none(pud)); 129 + 130 + pud = pfn_pud(pfn, prot); 131 + set_pud_at(mm, vaddr, pudp, pud); 132 + pudp_huge_get_and_clear_full(mm, vaddr, pudp, 1); 133 + pud = READ_ONCE(*pudp); 134 + WARN_ON(!pud_none(pud)); 135 + #endif /* __PAGETABLE_PMD_FOLDED */ 136 + pud = pfn_pud(pfn, prot); 137 + pud = pud_wrprotect(pud); 138 + pud = pud_mkclean(pud); 139 + set_pud_at(mm, vaddr, pudp, pud); 140 + pud = pud_mkwrite(pud); 141 + pud = pud_mkdirty(pud); 142 + pudp_set_access_flags(vma, vaddr, pudp, pud, 1); 143 + pud = READ_ONCE(*pudp); 144 + WARN_ON(!(pud_write(pud) && pud_dirty(pud))); 145 + 146 + pud = pud_mkyoung(pud); 147 + set_pud_at(mm, vaddr, pudp, pud); 148 + pudp_test_and_clear_young(vma, vaddr, pudp); 149 + pud = READ_ONCE(*pudp); 150 + WARN_ON(pud_young(pud)); 151 + } 152 + 153 + static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) 154 + { 155 + pud_t pud = pfn_pud(pfn, prot); 156 + 157 + pr_debug("Validating PUD leaf\n"); 158 + /* 159 + * PUD based THP is a leaf entry. 160 + */ 161 + pud = pud_mkhuge(pud); 162 + WARN_ON(!pud_leaf(pud)); 163 + } 164 + 165 + static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) 166 + { 167 + pud_t pud; 168 + 169 + if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP)) 170 + return; 171 + 172 + pr_debug("Validating PUD huge\n"); 173 + /* 174 + * X86 defined pud_set_huge() verifies that the given 175 + * PUD is not a populated non-leaf entry. 176 + */ 177 + WRITE_ONCE(*pudp, __pud(0)); 178 + WARN_ON(!pud_set_huge(pudp, __pfn_to_phys(pfn), prot)); 179 + WARN_ON(!pud_clear_huge(pudp)); 180 + pud = READ_ONCE(*pudp); 181 + WARN_ON(!pud_none(pud)); 182 + } 254 183 #else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 255 184 static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot) { } 185 + static void __init pud_advanced_tests(struct mm_struct *mm, 186 + struct vm_area_struct *vma, pud_t *pudp, 187 + unsigned long pfn, unsigned long vaddr, 188 + pgprot_t prot) 189 + { 190 + } 191 + static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) { } 192 + static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) 193 + { 194 + } 256 195 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 257 196 #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ 258 197 static void __init pmd_basic_tests(unsigned long pfn, pgprot_t prot) { } 259 198 static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot) { } 199 + static void __init pmd_advanced_tests(struct mm_struct *mm, 200 + struct vm_area_struct *vma, pmd_t *pmdp, 201 + unsigned long pfn, unsigned long vaddr, 202 + pgprot_t prot) 203 + { 204 + } 205 + static void __init pud_advanced_tests(struct mm_struct *mm, 206 + struct vm_area_struct *vma, pud_t *pudp, 207 + unsigned long pfn, unsigned long vaddr, 208 + pgprot_t prot) 209 + { 210 + } 211 + static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot) { } 212 + static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) { } 213 + static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) 214 + { 215 + } 216 + static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) 217 + { 218 + } 219 + static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) { } 260 220 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 261 221 262 222 static void __init p4d_basic_tests(unsigned long pfn, pgprot_t prot) 263 223 { 264 224 p4d_t p4d; 265 225 226 + pr_debug("Validating P4D basic\n"); 266 227 memset(&p4d, RANDOM_NZVALUE, sizeof(p4d_t)); 267 228 WARN_ON(!p4d_same(p4d, p4d)); 268 229 } ··· 383 120 { 384 121 pgd_t pgd; 385 122 123 + pr_debug("Validating PGD basic\n"); 386 124 memset(&pgd, RANDOM_NZVALUE, sizeof(pgd_t)); 387 125 WARN_ON(!pgd_same(pgd, pgd)); 388 126 } ··· 396 132 if (mm_pmd_folded(mm)) 397 133 return; 398 134 135 + pr_debug("Validating PUD clear\n"); 399 136 pud = __pud(pud_val(pud) | RANDOM_ORVALUE); 400 137 WRITE_ONCE(*pudp, pud); 401 138 pud_clear(pudp); ··· 411 146 412 147 if (mm_pmd_folded(mm)) 413 148 return; 149 + 150 + pr_debug("Validating PUD populate\n"); 414 151 /* 415 152 * This entry points to next level page table page. 416 153 * Hence this must not qualify as pud_bad(). ··· 439 172 if (mm_pud_folded(mm)) 440 173 return; 441 174 175 + pr_debug("Validating P4D clear\n"); 442 176 p4d = __p4d(p4d_val(p4d) | RANDOM_ORVALUE); 443 177 WRITE_ONCE(*p4dp, p4d); 444 178 p4d_clear(p4dp); ··· 455 187 if (mm_pud_folded(mm)) 456 188 return; 457 189 190 + pr_debug("Validating P4D populate\n"); 458 191 /* 459 192 * This entry points to next level page table page. 460 193 * Hence this must not qualify as p4d_bad(). ··· 474 205 if (mm_p4d_folded(mm)) 475 206 return; 476 207 208 + pr_debug("Validating PGD clear\n"); 477 209 pgd = __pgd(pgd_val(pgd) | RANDOM_ORVALUE); 478 210 WRITE_ONCE(*pgdp, pgd); 479 211 pgd_clear(pgdp); ··· 490 220 if (mm_p4d_folded(mm)) 491 221 return; 492 222 223 + pr_debug("Validating PGD populate\n"); 493 224 /* 494 225 * This entry points to next level page table page. 495 226 * Hence this must not qualify as pgd_bad(). ··· 519 248 { 520 249 pte_t pte = ptep_get(ptep); 521 250 251 + pr_debug("Validating PTE clear\n"); 522 252 pte = __pte(pte_val(pte) | RANDOM_ORVALUE); 523 253 set_pte_at(mm, vaddr, ptep, pte); 524 254 barrier(); ··· 532 260 { 533 261 pmd_t pmd = READ_ONCE(*pmdp); 534 262 263 + pr_debug("Validating PMD clear\n"); 535 264 pmd = __pmd(pmd_val(pmd) | RANDOM_ORVALUE); 536 265 WRITE_ONCE(*pmdp, pmd); 537 266 pmd_clear(pmdp); ··· 545 272 { 546 273 pmd_t pmd; 547 274 275 + pr_debug("Validating PMD populate\n"); 548 276 /* 549 277 * This entry points to next level page table page. 550 278 * Hence this must not qualify as pmd_bad(). ··· 555 281 pmd = READ_ONCE(*pmdp); 556 282 WARN_ON(pmd_bad(pmd)); 557 283 } 284 + 285 + static void __init pte_special_tests(unsigned long pfn, pgprot_t prot) 286 + { 287 + pte_t pte = pfn_pte(pfn, prot); 288 + 289 + if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) 290 + return; 291 + 292 + pr_debug("Validating PTE special\n"); 293 + WARN_ON(!pte_special(pte_mkspecial(pte))); 294 + } 295 + 296 + static void __init pte_protnone_tests(unsigned long pfn, pgprot_t prot) 297 + { 298 + pte_t pte = pfn_pte(pfn, prot); 299 + 300 + if (!IS_ENABLED(CONFIG_NUMA_BALANCING)) 301 + return; 302 + 303 + pr_debug("Validating PTE protnone\n"); 304 + WARN_ON(!pte_protnone(pte)); 305 + WARN_ON(!pte_present(pte)); 306 + } 307 + 308 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 309 + static void __init pmd_protnone_tests(unsigned long pfn, pgprot_t prot) 310 + { 311 + pmd_t pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); 312 + 313 + if (!IS_ENABLED(CONFIG_NUMA_BALANCING)) 314 + return; 315 + 316 + pr_debug("Validating PMD protnone\n"); 317 + WARN_ON(!pmd_protnone(pmd)); 318 + WARN_ON(!pmd_present(pmd)); 319 + } 320 + #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ 321 + static void __init pmd_protnone_tests(unsigned long pfn, pgprot_t prot) { } 322 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 323 + 324 + #ifdef CONFIG_ARCH_HAS_PTE_DEVMAP 325 + static void __init pte_devmap_tests(unsigned long pfn, pgprot_t prot) 326 + { 327 + pte_t pte = pfn_pte(pfn, prot); 328 + 329 + pr_debug("Validating PTE devmap\n"); 330 + WARN_ON(!pte_devmap(pte_mkdevmap(pte))); 331 + } 332 + 333 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 334 + static void __init pmd_devmap_tests(unsigned long pfn, pgprot_t prot) 335 + { 336 + pmd_t pmd = pfn_pmd(pfn, prot); 337 + 338 + pr_debug("Validating PMD devmap\n"); 339 + WARN_ON(!pmd_devmap(pmd_mkdevmap(pmd))); 340 + } 341 + 342 + #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD 343 + static void __init pud_devmap_tests(unsigned long pfn, pgprot_t prot) 344 + { 345 + pud_t pud = pfn_pud(pfn, prot); 346 + 347 + pr_debug("Validating PUD devmap\n"); 348 + WARN_ON(!pud_devmap(pud_mkdevmap(pud))); 349 + } 350 + #else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 351 + static void __init pud_devmap_tests(unsigned long pfn, pgprot_t prot) { } 352 + #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 353 + #else /* CONFIG_TRANSPARENT_HUGEPAGE */ 354 + static void __init pmd_devmap_tests(unsigned long pfn, pgprot_t prot) { } 355 + static void __init pud_devmap_tests(unsigned long pfn, pgprot_t prot) { } 356 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 357 + #else 358 + static void __init pte_devmap_tests(unsigned long pfn, pgprot_t prot) { } 359 + static void __init pmd_devmap_tests(unsigned long pfn, pgprot_t prot) { } 360 + static void __init pud_devmap_tests(unsigned long pfn, pgprot_t prot) { } 361 + #endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */ 362 + 363 + static void __init pte_soft_dirty_tests(unsigned long pfn, pgprot_t prot) 364 + { 365 + pte_t pte = pfn_pte(pfn, prot); 366 + 367 + if (!IS_ENABLED(CONFIG_MEM_SOFT_DIRTY)) 368 + return; 369 + 370 + pr_debug("Validating PTE soft dirty\n"); 371 + WARN_ON(!pte_soft_dirty(pte_mksoft_dirty(pte))); 372 + WARN_ON(pte_soft_dirty(pte_clear_soft_dirty(pte))); 373 + } 374 + 375 + static void __init pte_swap_soft_dirty_tests(unsigned long pfn, pgprot_t prot) 376 + { 377 + pte_t pte = pfn_pte(pfn, prot); 378 + 379 + if (!IS_ENABLED(CONFIG_MEM_SOFT_DIRTY)) 380 + return; 381 + 382 + pr_debug("Validating PTE swap soft dirty\n"); 383 + WARN_ON(!pte_swp_soft_dirty(pte_swp_mksoft_dirty(pte))); 384 + WARN_ON(pte_swp_soft_dirty(pte_swp_clear_soft_dirty(pte))); 385 + } 386 + 387 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 388 + static void __init pmd_soft_dirty_tests(unsigned long pfn, pgprot_t prot) 389 + { 390 + pmd_t pmd = pfn_pmd(pfn, prot); 391 + 392 + if (!IS_ENABLED(CONFIG_MEM_SOFT_DIRTY)) 393 + return; 394 + 395 + pr_debug("Validating PMD soft dirty\n"); 396 + WARN_ON(!pmd_soft_dirty(pmd_mksoft_dirty(pmd))); 397 + WARN_ON(pmd_soft_dirty(pmd_clear_soft_dirty(pmd))); 398 + } 399 + 400 + static void __init pmd_swap_soft_dirty_tests(unsigned long pfn, pgprot_t prot) 401 + { 402 + pmd_t pmd = pfn_pmd(pfn, prot); 403 + 404 + if (!IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) || 405 + !IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION)) 406 + return; 407 + 408 + pr_debug("Validating PMD swap soft dirty\n"); 409 + WARN_ON(!pmd_swp_soft_dirty(pmd_swp_mksoft_dirty(pmd))); 410 + WARN_ON(pmd_swp_soft_dirty(pmd_swp_clear_soft_dirty(pmd))); 411 + } 412 + #else /* !CONFIG_ARCH_HAS_PTE_DEVMAP */ 413 + static void __init pmd_soft_dirty_tests(unsigned long pfn, pgprot_t prot) { } 414 + static void __init pmd_swap_soft_dirty_tests(unsigned long pfn, pgprot_t prot) 415 + { 416 + } 417 + #endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */ 418 + 419 + static void __init pte_swap_tests(unsigned long pfn, pgprot_t prot) 420 + { 421 + swp_entry_t swp; 422 + pte_t pte; 423 + 424 + pr_debug("Validating PTE swap\n"); 425 + pte = pfn_pte(pfn, prot); 426 + swp = __pte_to_swp_entry(pte); 427 + pte = __swp_entry_to_pte(swp); 428 + WARN_ON(pfn != pte_pfn(pte)); 429 + } 430 + 431 + #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION 432 + static void __init pmd_swap_tests(unsigned long pfn, pgprot_t prot) 433 + { 434 + swp_entry_t swp; 435 + pmd_t pmd; 436 + 437 + pr_debug("Validating PMD swap\n"); 438 + pmd = pfn_pmd(pfn, prot); 439 + swp = __pmd_to_swp_entry(pmd); 440 + pmd = __swp_entry_to_pmd(swp); 441 + WARN_ON(pfn != pmd_pfn(pmd)); 442 + } 443 + #else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */ 444 + static void __init pmd_swap_tests(unsigned long pfn, pgprot_t prot) { } 445 + #endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */ 446 + 447 + static void __init swap_migration_tests(void) 448 + { 449 + struct page *page; 450 + swp_entry_t swp; 451 + 452 + if (!IS_ENABLED(CONFIG_MIGRATION)) 453 + return; 454 + 455 + pr_debug("Validating swap migration\n"); 456 + /* 457 + * swap_migration_tests() requires a dedicated page as it needs to 458 + * be locked before creating a migration entry from it. Locking the 459 + * page that actually maps kernel text ('start_kernel') can be real 460 + * problematic. Lets allocate a dedicated page explicitly for this 461 + * purpose that will be freed subsequently. 462 + */ 463 + page = alloc_page(GFP_KERNEL); 464 + if (!page) { 465 + pr_err("page allocation failed\n"); 466 + return; 467 + } 468 + 469 + /* 470 + * make_migration_entry() expects given page to be 471 + * locked, otherwise it stumbles upon a BUG_ON(). 472 + */ 473 + __SetPageLocked(page); 474 + swp = make_migration_entry(page, 1); 475 + WARN_ON(!is_migration_entry(swp)); 476 + WARN_ON(!is_write_migration_entry(swp)); 477 + 478 + make_migration_entry_read(&swp); 479 + WARN_ON(!is_migration_entry(swp)); 480 + WARN_ON(is_write_migration_entry(swp)); 481 + 482 + swp = make_migration_entry(page, 0); 483 + WARN_ON(!is_migration_entry(swp)); 484 + WARN_ON(is_write_migration_entry(swp)); 485 + __ClearPageLocked(page); 486 + __free_page(page); 487 + } 488 + 489 + #ifdef CONFIG_HUGETLB_PAGE 490 + static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) 491 + { 492 + struct page *page; 493 + pte_t pte; 494 + 495 + pr_debug("Validating HugeTLB basic\n"); 496 + /* 497 + * Accessing the page associated with the pfn is safe here, 498 + * as it was previously derived from a real kernel symbol. 499 + */ 500 + page = pfn_to_page(pfn); 501 + pte = mk_huge_pte(page, prot); 502 + 503 + WARN_ON(!huge_pte_dirty(huge_pte_mkdirty(pte))); 504 + WARN_ON(!huge_pte_write(huge_pte_mkwrite(huge_pte_wrprotect(pte)))); 505 + WARN_ON(huge_pte_write(huge_pte_wrprotect(huge_pte_mkwrite(pte)))); 506 + 507 + #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB 508 + pte = pfn_pte(pfn, prot); 509 + 510 + WARN_ON(!pte_huge(pte_mkhuge(pte))); 511 + #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ 512 + } 513 + 514 + static void __init hugetlb_advanced_tests(struct mm_struct *mm, 515 + struct vm_area_struct *vma, 516 + pte_t *ptep, unsigned long pfn, 517 + unsigned long vaddr, pgprot_t prot) 518 + { 519 + struct page *page = pfn_to_page(pfn); 520 + pte_t pte = ptep_get(ptep); 521 + unsigned long paddr = __pfn_to_phys(pfn) & PMD_MASK; 522 + 523 + pr_debug("Validating HugeTLB advanced\n"); 524 + pte = pte_mkhuge(mk_pte(pfn_to_page(PHYS_PFN(paddr)), prot)); 525 + set_huge_pte_at(mm, vaddr, ptep, pte); 526 + barrier(); 527 + WARN_ON(!pte_same(pte, huge_ptep_get(ptep))); 528 + huge_pte_clear(mm, vaddr, ptep, PMD_SIZE); 529 + pte = huge_ptep_get(ptep); 530 + WARN_ON(!huge_pte_none(pte)); 531 + 532 + pte = mk_huge_pte(page, prot); 533 + set_huge_pte_at(mm, vaddr, ptep, pte); 534 + barrier(); 535 + huge_ptep_set_wrprotect(mm, vaddr, ptep); 536 + pte = huge_ptep_get(ptep); 537 + WARN_ON(huge_pte_write(pte)); 538 + 539 + pte = mk_huge_pte(page, prot); 540 + set_huge_pte_at(mm, vaddr, ptep, pte); 541 + barrier(); 542 + huge_ptep_get_and_clear(mm, vaddr, ptep); 543 + pte = huge_ptep_get(ptep); 544 + WARN_ON(!huge_pte_none(pte)); 545 + 546 + pte = mk_huge_pte(page, prot); 547 + pte = huge_pte_wrprotect(pte); 548 + set_huge_pte_at(mm, vaddr, ptep, pte); 549 + barrier(); 550 + pte = huge_pte_mkwrite(pte); 551 + pte = huge_pte_mkdirty(pte); 552 + huge_ptep_set_access_flags(vma, vaddr, ptep, pte, 1); 553 + pte = huge_ptep_get(ptep); 554 + WARN_ON(!(huge_pte_write(pte) && huge_pte_dirty(pte))); 555 + } 556 + #else /* !CONFIG_HUGETLB_PAGE */ 557 + static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) { } 558 + static void __init hugetlb_advanced_tests(struct mm_struct *mm, 559 + struct vm_area_struct *vma, 560 + pte_t *ptep, unsigned long pfn, 561 + unsigned long vaddr, pgprot_t prot) 562 + { 563 + } 564 + #endif /* CONFIG_HUGETLB_PAGE */ 565 + 566 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 567 + static void __init pmd_thp_tests(unsigned long pfn, pgprot_t prot) 568 + { 569 + pmd_t pmd; 570 + 571 + if (!has_transparent_hugepage()) 572 + return; 573 + 574 + pr_debug("Validating PMD based THP\n"); 575 + /* 576 + * pmd_trans_huge() and pmd_present() must return positive after 577 + * MMU invalidation with pmd_mkinvalid(). This behavior is an 578 + * optimization for transparent huge page. pmd_trans_huge() must 579 + * be true if pmd_page() returns a valid THP to avoid taking the 580 + * pmd_lock when others walk over non transhuge pmds (i.e. there 581 + * are no THP allocated). Especially when splitting a THP and 582 + * removing the present bit from the pmd, pmd_trans_huge() still 583 + * needs to return true. pmd_present() should be true whenever 584 + * pmd_trans_huge() returns true. 585 + */ 586 + pmd = pfn_pmd(pfn, prot); 587 + WARN_ON(!pmd_trans_huge(pmd_mkhuge(pmd))); 588 + 589 + #ifndef __HAVE_ARCH_PMDP_INVALIDATE 590 + WARN_ON(!pmd_trans_huge(pmd_mkinvalid(pmd_mkhuge(pmd)))); 591 + WARN_ON(!pmd_present(pmd_mkinvalid(pmd_mkhuge(pmd)))); 592 + #endif /* __HAVE_ARCH_PMDP_INVALIDATE */ 593 + } 594 + 595 + #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD 596 + static void __init pud_thp_tests(unsigned long pfn, pgprot_t prot) 597 + { 598 + pud_t pud; 599 + 600 + if (!has_transparent_hugepage()) 601 + return; 602 + 603 + pr_debug("Validating PUD based THP\n"); 604 + pud = pfn_pud(pfn, prot); 605 + WARN_ON(!pud_trans_huge(pud_mkhuge(pud))); 606 + 607 + /* 608 + * pud_mkinvalid() has been dropped for now. Enable back 609 + * these tests when it comes back with a modified pud_present(). 610 + * 611 + * WARN_ON(!pud_trans_huge(pud_mkinvalid(pud_mkhuge(pud)))); 612 + * WARN_ON(!pud_present(pud_mkinvalid(pud_mkhuge(pud)))); 613 + */ 614 + } 615 + #else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 616 + static void __init pud_thp_tests(unsigned long pfn, pgprot_t prot) { } 617 + #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ 618 + #else /* !CONFIG_TRANSPARENT_HUGEPAGE */ 619 + static void __init pmd_thp_tests(unsigned long pfn, pgprot_t prot) { } 620 + static void __init pud_thp_tests(unsigned long pfn, pgprot_t prot) { } 621 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 558 622 559 623 static unsigned long __init get_random_vaddr(void) 560 624 { ··· 908 296 909 297 static int __init debug_vm_pgtable(void) 910 298 { 299 + struct vm_area_struct *vma; 911 300 struct mm_struct *mm; 912 301 pgd_t *pgdp; 913 302 p4d_t *p4dp, *saved_p4dp; ··· 916 303 pmd_t *pmdp, *saved_pmdp, pmd; 917 304 pte_t *ptep; 918 305 pgtable_t saved_ptep; 919 - pgprot_t prot; 306 + pgprot_t prot, protnone; 920 307 phys_addr_t paddr; 921 308 unsigned long vaddr, pte_aligned, pmd_aligned; 922 309 unsigned long pud_aligned, p4d_aligned, pgd_aligned; ··· 928 315 mm = mm_alloc(); 929 316 if (!mm) { 930 317 pr_err("mm_struct allocation failed\n"); 318 + return 1; 319 + } 320 + 321 + /* 322 + * __P000 (or even __S000) will help create page table entries with 323 + * PROT_NONE permission as required for pxx_protnone_tests(). 324 + */ 325 + protnone = __P000; 326 + 327 + vma = vm_area_alloc(mm); 328 + if (!vma) { 329 + pr_err("vma allocation failed\n"); 931 330 return 1; 932 331 } 933 332 ··· 991 366 p4d_clear_tests(mm, p4dp); 992 367 pgd_clear_tests(mm, pgdp); 993 368 369 + pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); 370 + pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot); 371 + pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot); 372 + hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); 373 + 374 + pmd_leaf_tests(pmd_aligned, prot); 375 + pud_leaf_tests(pud_aligned, prot); 376 + 377 + pmd_huge_tests(pmdp, pmd_aligned, prot); 378 + pud_huge_tests(pudp, pud_aligned, prot); 379 + 380 + pte_savedwrite_tests(pte_aligned, prot); 381 + pmd_savedwrite_tests(pmd_aligned, prot); 382 + 994 383 pte_unmap_unlock(ptep, ptl); 995 384 996 385 pmd_populate_tests(mm, pmdp, saved_ptep); ··· 1012 373 p4d_populate_tests(mm, p4dp, saved_pudp); 1013 374 pgd_populate_tests(mm, pgdp, saved_p4dp); 1014 375 376 + pte_special_tests(pte_aligned, prot); 377 + pte_protnone_tests(pte_aligned, protnone); 378 + pmd_protnone_tests(pmd_aligned, protnone); 379 + 380 + pte_devmap_tests(pte_aligned, prot); 381 + pmd_devmap_tests(pmd_aligned, prot); 382 + pud_devmap_tests(pud_aligned, prot); 383 + 384 + pte_soft_dirty_tests(pte_aligned, prot); 385 + pmd_soft_dirty_tests(pmd_aligned, prot); 386 + pte_swap_soft_dirty_tests(pte_aligned, prot); 387 + pmd_swap_soft_dirty_tests(pmd_aligned, prot); 388 + 389 + pte_swap_tests(pte_aligned, prot); 390 + pmd_swap_tests(pmd_aligned, prot); 391 + 392 + swap_migration_tests(); 393 + hugetlb_basic_tests(pte_aligned, prot); 394 + 395 + pmd_thp_tests(pmd_aligned, prot); 396 + pud_thp_tests(pud_aligned, prot); 397 + 1015 398 p4d_free(mm, saved_p4dp); 1016 399 pud_free(mm, saved_pudp); 1017 400 pmd_free(mm, saved_pmdp); 1018 401 pte_free(mm, saved_ptep); 1019 402 403 + vm_area_free(vma); 1020 404 mm_dec_nr_puds(mm); 1021 405 mm_dec_nr_pmds(mm); 1022 406 mm_dec_nr_ptes(mm);

+9

mm/filemap.c

··· 41 41 #include <linux/delayacct.h> 42 42 #include <linux/psi.h> 43 43 #include <linux/ramfs.h> 44 + #include <linux/page_idle.h> 44 45 #include "internal.h" 45 46 46 47 #define CREATE_TRACE_POINTS ··· 1649 1648 * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the 1650 1649 * page is already in cache. If the page was allocated, unlock it before 1651 1650 * returning so the caller can do the same dance. 1651 + * * %FGP_WRITE - The page will be written 1652 + * * %FGP_NOFS - __GFP_FS will get cleared in gfp mask 1653 + * * %FGP_NOWAIT - Don't get blocked by page lock 1652 1654 * 1653 1655 * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even 1654 1656 * if the %GFP flags specified for %FGP_CREAT are atomic. ··· 1693 1689 1694 1690 if (fgp_flags & FGP_ACCESSED) 1695 1691 mark_page_accessed(page); 1692 + else if (fgp_flags & FGP_WRITE) { 1693 + /* Clear idle flag for buffer write */ 1694 + if (page_is_idle(page)) 1695 + clear_page_idle(page); 1696 + } 1696 1697 1697 1698 no_page: 1698 1699 if (!page && (fgp_flags & FGP_CREAT)) {

+2 -1

mm/gup.c

··· 1404 1404 * 1405 1405 * This takes care of mlocking the pages too if VM_LOCKED is set. 1406 1406 * 1407 - * return 0 on success, negative error code on error. 1407 + * Return either number of pages pinned in the vma, or a negative error 1408 + * code on error. 1408 1409 * 1409 1410 * vma->vm_mm->mmap_lock must be held. 1410 1411 *

+3 -9

mm/huge_memory.c

··· 1722 1722 } 1723 1723 1724 1724 bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, 1725 - unsigned long new_addr, unsigned long old_end, 1726 - pmd_t *old_pmd, pmd_t *new_pmd) 1725 + unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd) 1727 1726 { 1728 1727 spinlock_t *old_ptl, *new_ptl; 1729 1728 pmd_t pmd; 1730 1729 struct mm_struct *mm = vma->vm_mm; 1731 1730 bool force_flush = false; 1732 - 1733 - if ((old_addr & ~HPAGE_PMD_MASK) || 1734 - (new_addr & ~HPAGE_PMD_MASK) || 1735 - old_end - old_addr < HPAGE_PMD_SIZE) 1736 - return false; 1737 1731 1738 1732 /* 1739 1733 * The destination pmd shouldn't be established, free_pgtables() ··· 2063 2069 * free), userland could trigger a small page size TLB miss on the 2064 2070 * small sized TLB while the hugepage TLB entry is still established in 2065 2071 * the huge TLB. Some CPU doesn't like that. 2066 - * See http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum 2067 - * 383 on page 93. Intel should be safe but is also warns that it's 2072 + * See http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum 2073 + * 383 on page 105. Intel should be safe but is also warns that it's 2068 2074 * only safe if the permission and cache attributes of the two entries 2069 2075 * loaded in the two TLB is identical (which should be the case here). 2070 2076 * But it is generally safer to never allow small and huge TLB entries

+11 -14

mm/hugetlb.c

··· 31 31 #include <linux/cma.h> 32 32 33 33 #include <asm/page.h> 34 + #include <asm/pgalloc.h> 34 35 #include <asm/tlb.h> 35 36 36 37 #include <linux/io.h> ··· 5314 5313 void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, 5315 5314 unsigned long *start, unsigned long *end) 5316 5315 { 5317 - unsigned long check_addr; 5316 + unsigned long a_start, a_end; 5318 5317 5319 5318 if (!(vma->vm_flags & VM_MAYSHARE)) 5320 5319 return; 5321 5320 5322 - for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) { 5323 - unsigned long a_start = check_addr & PUD_MASK; 5324 - unsigned long a_end = a_start + PUD_SIZE; 5321 + /* Extend the range to be PUD aligned for a worst case scenario */ 5322 + a_start = ALIGN_DOWN(*start, PUD_SIZE); 5323 + a_end = ALIGN(*end, PUD_SIZE); 5325 5324 5326 - /* 5327 - * If sharing is possible, adjust start/end if necessary. 5328 - */ 5329 - if (range_in_vma(vma, a_start, a_end)) { 5330 - if (a_start < *start) 5331 - *start = a_start; 5332 - if (a_end > *end) 5333 - *end = a_end; 5334 - } 5335 - } 5325 + /* 5326 + * Intersect the range with the vma range, since pmd sharing won't be 5327 + * across vma after all 5328 + */ 5329 + *start = max(vma->vm_start, a_start); 5330 + *end = min(vma->vm_end, a_end); 5336 5331 } 5337 5332 5338 5333 /*

+4 -37

mm/kasan/common.c

··· 40 40 #include "kasan.h" 41 41 #include "../slab.h" 42 42 43 - static inline depot_stack_handle_t save_stack(gfp_t flags) 43 + depot_stack_handle_t kasan_save_stack(gfp_t flags) 44 44 { 45 45 unsigned long entries[KASAN_STACK_DEPTH]; 46 46 unsigned int nr_entries; ··· 50 50 return stack_depot_save(entries, nr_entries, flags); 51 51 } 52 52 53 - static inline void set_track(struct kasan_track *track, gfp_t flags) 53 + void kasan_set_track(struct kasan_track *track, gfp_t flags) 54 54 { 55 55 track->pid = current->pid; 56 - track->stack = save_stack(flags); 56 + track->stack = kasan_save_stack(flags); 57 57 } 58 58 59 59 void kasan_enable_current(void) ··· 180 180 kasan_unpoison_shadow(base, watermark - base); 181 181 } 182 182 183 - /* 184 - * Clear all poison for the region between the current SP and a provided 185 - * watermark value, as is sometimes required prior to hand-crafted asm function 186 - * returns in the middle of functions. 187 - */ 188 - void kasan_unpoison_stack_above_sp_to(const void *watermark) 189 - { 190 - const void *sp = __builtin_frame_address(0); 191 - size_t size = watermark - sp; 192 - 193 - if (WARN_ON(sp > watermark)) 194 - return; 195 - kasan_unpoison_shadow(sp, size); 196 - } 197 - 198 183 void kasan_alloc_pages(struct page *page, unsigned int order) 199 184 { 200 185 u8 tag; ··· 281 296 { 282 297 BUILD_BUG_ON(sizeof(struct kasan_free_meta) > 32); 283 298 return (void *)object + cache->kasan_info.free_meta_offset; 284 - } 285 - 286 - 287 - static void kasan_set_free_info(struct kmem_cache *cache, 288 - void *object, u8 tag) 289 - { 290 - struct kasan_alloc_meta *alloc_meta; 291 - u8 idx = 0; 292 - 293 - alloc_meta = get_alloc_info(cache, object); 294 - 295 - #ifdef CONFIG_KASAN_SW_TAGS_IDENTIFY 296 - idx = alloc_meta->free_track_idx; 297 - alloc_meta->free_pointer_tag[idx] = tag; 298 - alloc_meta->free_track_idx = (idx + 1) % KASAN_NR_FREE_STACKS; 299 - #endif 300 - 301 - set_track(&alloc_meta->free_track[idx], GFP_NOWAIT); 302 299 } 303 300 304 301 void kasan_poison_slab(struct page *page) ··· 458 491 KASAN_KMALLOC_REDZONE); 459 492 460 493 if (cache->flags & SLAB_KASAN) 461 - set_track(&get_alloc_info(cache, object)->alloc_track, flags); 494 + kasan_set_track(&get_alloc_info(cache, object)->alloc_track, flags); 462 495 463 496 return set_tag(object, tag); 464 497 }

+43

mm/kasan/generic.c

··· 324 324 DEFINE_ASAN_SET_SHADOW(f3); 325 325 DEFINE_ASAN_SET_SHADOW(f5); 326 326 DEFINE_ASAN_SET_SHADOW(f8); 327 + 328 + void kasan_record_aux_stack(void *addr) 329 + { 330 + struct page *page = kasan_addr_to_page(addr); 331 + struct kmem_cache *cache; 332 + struct kasan_alloc_meta *alloc_info; 333 + void *object; 334 + 335 + if (!(page && PageSlab(page))) 336 + return; 337 + 338 + cache = page->slab_cache; 339 + object = nearest_obj(cache, page, addr); 340 + alloc_info = get_alloc_info(cache, object); 341 + 342 + /* 343 + * record the last two call_rcu() call stacks. 344 + */ 345 + alloc_info->aux_stack[1] = alloc_info->aux_stack[0]; 346 + alloc_info->aux_stack[0] = kasan_save_stack(GFP_NOWAIT); 347 + } 348 + 349 + void kasan_set_free_info(struct kmem_cache *cache, 350 + void *object, u8 tag) 351 + { 352 + struct kasan_free_meta *free_meta; 353 + 354 + free_meta = get_free_info(cache, object); 355 + kasan_set_track(&free_meta->free_track, GFP_NOWAIT); 356 + 357 + /* 358 + * the object was freed and has free track set 359 + */ 360 + *(u8 *)kasan_mem_to_shadow(object) = KASAN_KMALLOC_FREETRACK; 361 + } 362 + 363 + struct kasan_track *kasan_get_free_track(struct kmem_cache *cache, 364 + void *object, u8 tag) 365 + { 366 + if (*(u8 *)kasan_mem_to_shadow(object) != KASAN_KMALLOC_FREETRACK) 367 + return NULL; 368 + return &get_free_info(cache, object)->free_track; 369 + }

+1

mm/kasan/generic_report.c

··· 80 80 break; 81 81 case KASAN_FREE_PAGE: 82 82 case KASAN_KMALLOC_FREE: 83 + case KASAN_KMALLOC_FREETRACK: 83 84 bug_type = "use-after-free"; 84 85 break; 85 86 case KASAN_ALLOCA_LEFT:

+21 -2

mm/kasan/kasan.h

··· 17 17 #define KASAN_PAGE_REDZONE 0xFE /* redzone for kmalloc_large allocations */ 18 18 #define KASAN_KMALLOC_REDZONE 0xFC /* redzone inside slub object */ 19 19 #define KASAN_KMALLOC_FREE 0xFB /* object was freed (kmem_cache_free/kfree) */ 20 + #define KASAN_KMALLOC_FREETRACK 0xFA /* object was freed and has free track set */ 20 21 #else 21 22 #define KASAN_FREE_PAGE KASAN_TAG_INVALID 22 23 #define KASAN_PAGE_REDZONE KASAN_TAG_INVALID 23 24 #define KASAN_KMALLOC_REDZONE KASAN_TAG_INVALID 24 25 #define KASAN_KMALLOC_FREE KASAN_TAG_INVALID 26 + #define KASAN_KMALLOC_FREETRACK KASAN_TAG_INVALID 25 27 #endif 26 28 27 - #define KASAN_GLOBAL_REDZONE 0xFA /* redzone for global variable */ 28 - #define KASAN_VMALLOC_INVALID 0xF9 /* unallocated space in vmapped page */ 29 + #define KASAN_GLOBAL_REDZONE 0xF9 /* redzone for global variable */ 30 + #define KASAN_VMALLOC_INVALID 0xF8 /* unallocated space in vmapped page */ 29 31 30 32 /* 31 33 * Stack redzone shadow values ··· 106 104 107 105 struct kasan_alloc_meta { 108 106 struct kasan_track alloc_track; 107 + #ifdef CONFIG_KASAN_GENERIC 108 + /* 109 + * call_rcu() call stack is stored into struct kasan_alloc_meta. 110 + * The free stack is stored into struct kasan_free_meta. 111 + */ 112 + depot_stack_handle_t aux_stack[2]; 113 + #else 109 114 struct kasan_track free_track[KASAN_NR_FREE_STACKS]; 115 + #endif 110 116 #ifdef CONFIG_KASAN_SW_TAGS_IDENTIFY 111 117 u8 free_pointer_tag[KASAN_NR_FREE_STACKS]; 112 118 u8 free_track_idx; ··· 129 119 * Otherwise it might be used for the allocator freelist. 130 120 */ 131 121 struct qlist_node quarantine_link; 122 + #ifdef CONFIG_KASAN_GENERIC 123 + struct kasan_track free_track; 124 + #endif 132 125 }; 133 126 134 127 struct kasan_alloc_meta *get_alloc_info(struct kmem_cache *cache, ··· 171 158 void kasan_report_invalid_free(void *object, unsigned long ip); 172 159 173 160 struct page *kasan_addr_to_page(const void *addr); 161 + 162 + depot_stack_handle_t kasan_save_stack(gfp_t flags); 163 + void kasan_set_track(struct kasan_track *track, gfp_t flags); 164 + void kasan_set_free_info(struct kmem_cache *cache, void *object, u8 tag); 165 + struct kasan_track *kasan_get_free_track(struct kmem_cache *cache, 166 + void *object, u8 tag); 174 167 175 168 #if defined(CONFIG_KASAN_GENERIC) && \ 176 169 (defined(CONFIG_SLAB) || defined(CONFIG_SLUB))

+1

mm/kasan/quarantine.c

··· 145 145 if (IS_ENABLED(CONFIG_SLAB)) 146 146 local_irq_save(flags); 147 147 148 + *(u8 *)kasan_mem_to_shadow(object) = KASAN_KMALLOC_FREE; 148 149 ___cache_free(cache, object, _THIS_IP_); 149 150 150 151 if (IS_ENABLED(CONFIG_SLAB))

+27 -27

mm/kasan/report.c

··· 106 106 kasan_enable_current(); 107 107 } 108 108 109 + static void print_stack(depot_stack_handle_t stack) 110 + { 111 + unsigned long *entries; 112 + unsigned int nr_entries; 113 + 114 + nr_entries = stack_depot_fetch(stack, &entries); 115 + stack_trace_print(entries, nr_entries, 0); 116 + } 117 + 109 118 static void print_track(struct kasan_track *track, const char *prefix) 110 119 { 111 120 pr_err("%s by task %u:\n", prefix, track->pid); 112 121 if (track->stack) { 113 - unsigned long *entries; 114 - unsigned int nr_entries; 115 - 116 - nr_entries = stack_depot_fetch(track->stack, &entries); 117 - stack_trace_print(entries, nr_entries, 0); 122 + print_stack(track->stack); 118 123 } else { 119 124 pr_err("(stack is not available)\n"); 120 125 } ··· 165 160 (void *)(object_addr + cache->object_size)); 166 161 } 167 162 168 - static struct kasan_track *kasan_get_free_track(struct kmem_cache *cache, 169 - void *object, u8 tag) 170 - { 171 - struct kasan_alloc_meta *alloc_meta; 172 - int i = 0; 173 - 174 - alloc_meta = get_alloc_info(cache, object); 175 - 176 - #ifdef CONFIG_KASAN_SW_TAGS_IDENTIFY 177 - for (i = 0; i < KASAN_NR_FREE_STACKS; i++) { 178 - if (alloc_meta->free_pointer_tag[i] == tag) 179 - break; 180 - } 181 - if (i == KASAN_NR_FREE_STACKS) 182 - i = alloc_meta->free_track_idx; 183 - #endif 184 - 185 - return &alloc_meta->free_track[i]; 186 - } 187 - 188 163 static void describe_object(struct kmem_cache *cache, void *object, 189 164 const void *addr, u8 tag) 190 165 { ··· 176 191 print_track(&alloc_info->alloc_track, "Allocated"); 177 192 pr_err("\n"); 178 193 free_track = kasan_get_free_track(cache, object, tag); 179 - print_track(free_track, "Freed"); 180 - pr_err("\n"); 194 + if (free_track) { 195 + print_track(free_track, "Freed"); 196 + pr_err("\n"); 197 + } 198 + 199 + #ifdef CONFIG_KASAN_GENERIC 200 + if (alloc_info->aux_stack[0]) { 201 + pr_err("Last call_rcu():\n"); 202 + print_stack(alloc_info->aux_stack[0]); 203 + pr_err("\n"); 204 + } 205 + if (alloc_info->aux_stack[1]) { 206 + pr_err("Second to last call_rcu():\n"); 207 + print_stack(alloc_info->aux_stack[1]); 208 + pr_err("\n"); 209 + } 210 + #endif 181 211 } 182 212 183 213 describe_object_addr(cache, object, addr);

+37

mm/kasan/tags.c

··· 161 161 kasan_poison_shadow((void *)addr, size, tag); 162 162 } 163 163 EXPORT_SYMBOL(__hwasan_tag_memory); 164 + 165 + void kasan_set_free_info(struct kmem_cache *cache, 166 + void *object, u8 tag) 167 + { 168 + struct kasan_alloc_meta *alloc_meta; 169 + u8 idx = 0; 170 + 171 + alloc_meta = get_alloc_info(cache, object); 172 + 173 + #ifdef CONFIG_KASAN_SW_TAGS_IDENTIFY 174 + idx = alloc_meta->free_track_idx; 175 + alloc_meta->free_pointer_tag[idx] = tag; 176 + alloc_meta->free_track_idx = (idx + 1) % KASAN_NR_FREE_STACKS; 177 + #endif 178 + 179 + kasan_set_track(&alloc_meta->free_track[idx], GFP_NOWAIT); 180 + } 181 + 182 + struct kasan_track *kasan_get_free_track(struct kmem_cache *cache, 183 + void *object, u8 tag) 184 + { 185 + struct kasan_alloc_meta *alloc_meta; 186 + int i = 0; 187 + 188 + alloc_meta = get_alloc_info(cache, object); 189 + 190 + #ifdef CONFIG_KASAN_SW_TAGS_IDENTIFY 191 + for (i = 0; i < KASAN_NR_FREE_STACKS; i++) { 192 + if (alloc_meta->free_pointer_tag[i] == tag) 193 + break; 194 + } 195 + if (i == KASAN_NR_FREE_STACKS) 196 + i = alloc_meta->free_track_idx; 197 + #endif 198 + 199 + return &alloc_meta->free_track[i]; 200 + }

+35 -40

mm/khugepaged.c

··· 431 431 432 432 static inline int khugepaged_test_exit(struct mm_struct *mm) 433 433 { 434 - return atomic_read(&mm->mm_users) == 0; 434 + return atomic_read(&mm->mm_users) == 0 || !mmget_still_valid(mm); 435 435 } 436 436 437 437 static bool hugepage_vma_check(struct vm_area_struct *vma, ··· 1100 1100 * handled by the anon_vma lock + PG_lock. 1101 1101 */ 1102 1102 mmap_write_lock(mm); 1103 - result = SCAN_ANY_PROCESS; 1104 - if (!mmget_still_valid(mm)) 1105 - goto out; 1106 1103 result = hugepage_vma_revalidate(mm, address, &vma); 1107 1104 if (result) 1108 1105 goto out; ··· 1409 1412 { 1410 1413 unsigned long haddr = addr & HPAGE_PMD_MASK; 1411 1414 struct vm_area_struct *vma = find_vma(mm, haddr); 1412 - struct page *hpage = NULL; 1415 + struct page *hpage; 1413 1416 pte_t *start_pte, *pte; 1414 1417 pmd_t *pmd, _pmd; 1415 1418 spinlock_t *ptl; ··· 1429 1432 if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE)) 1430 1433 return; 1431 1434 1435 + hpage = find_lock_page(vma->vm_file->f_mapping, 1436 + linear_page_index(vma, haddr)); 1437 + if (!hpage) 1438 + return; 1439 + 1440 + if (!PageHead(hpage)) 1441 + goto drop_hpage; 1442 + 1432 1443 pmd = mm_find_pmd(mm, haddr); 1433 1444 if (!pmd) 1434 - return; 1445 + goto drop_hpage; 1435 1446 1436 1447 start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); 1437 1448 ··· 1458 1453 1459 1454 page = vm_normal_page(vma, addr, *pte); 1460 1455 1461 - if (!page || !PageCompound(page)) 1462 - goto abort; 1463 - 1464 - if (!hpage) { 1465 - hpage = compound_head(page); 1466 - /* 1467 - * The mapping of the THP should not change. 1468 - * 1469 - * Note that uprobe, debugger, or MAP_PRIVATE may 1470 - * change the page table, but the new page will 1471 - * not pass PageCompound() check. 1472 - */ 1473 - if (WARN_ON(hpage->mapping != vma->vm_file->f_mapping)) 1474 - goto abort; 1475 - } 1476 - 1477 1456 /* 1478 - * Confirm the page maps to the correct subpage. 1479 - * 1480 - * Note that uprobe, debugger, or MAP_PRIVATE may change 1481 - * the page table, but the new page will not pass 1482 - * PageCompound() check. 1457 + * Note that uprobe, debugger, or MAP_PRIVATE may change the 1458 + * page table, but the new page will not be a subpage of hpage. 1483 1459 */ 1484 - if (WARN_ON(hpage + i != page)) 1460 + if (hpage + i != page) 1485 1461 goto abort; 1486 1462 count++; 1487 1463 } ··· 1481 1495 pte_unmap_unlock(start_pte, ptl); 1482 1496 1483 1497 /* step 3: set proper refcount and mm_counters. */ 1484 - if (hpage) { 1498 + if (count) { 1485 1499 page_ref_sub(hpage, count); 1486 1500 add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count); 1487 1501 } 1488 1502 1489 1503 /* step 4: collapse pmd */ 1490 1504 ptl = pmd_lock(vma->vm_mm, pmd); 1491 - _pmd = pmdp_collapse_flush(vma, addr, pmd); 1505 + _pmd = pmdp_collapse_flush(vma, haddr, pmd); 1492 1506 spin_unlock(ptl); 1493 1507 mm_dec_nr_ptes(mm); 1494 1508 pte_free(mm, pmd_pgtable(_pmd)); 1509 + 1510 + drop_hpage: 1511 + unlock_page(hpage); 1512 + put_page(hpage); 1495 1513 return; 1496 1514 1497 1515 abort: 1498 1516 pte_unmap_unlock(start_pte, ptl); 1517 + goto drop_hpage; 1499 1518 } 1500 1519 1501 1520 static int khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot) ··· 1529 1538 static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) 1530 1539 { 1531 1540 struct vm_area_struct *vma; 1541 + struct mm_struct *mm; 1532 1542 unsigned long addr; 1533 1543 pmd_t *pmd, _pmd; 1534 1544 ··· 1558 1566 continue; 1559 1567 if (vma->vm_end < addr + HPAGE_PMD_SIZE) 1560 1568 continue; 1561 - pmd = mm_find_pmd(vma->vm_mm, addr); 1569 + mm = vma->vm_mm; 1570 + pmd = mm_find_pmd(mm, addr); 1562 1571 if (!pmd) 1563 1572 continue; 1564 1573 /* ··· 1569 1576 * mmap_lock while holding page lock. Fault path does it in 1570 1577 * reverse order. Trylock is a way to avoid deadlock. 1571 1578 */ 1572 - if (mmap_write_trylock(vma->vm_mm)) { 1573 - spinlock_t *ptl = pmd_lock(vma->vm_mm, pmd); 1574 - /* assume page table is clear */ 1575 - _pmd = pmdp_collapse_flush(vma, addr, pmd); 1576 - spin_unlock(ptl); 1577 - mmap_write_unlock(vma->vm_mm); 1578 - mm_dec_nr_ptes(vma->vm_mm); 1579 - pte_free(vma->vm_mm, pmd_pgtable(_pmd)); 1579 + if (mmap_write_trylock(mm)) { 1580 + if (!khugepaged_test_exit(mm)) { 1581 + spinlock_t *ptl = pmd_lock(mm, pmd); 1582 + /* assume page table is clear */ 1583 + _pmd = pmdp_collapse_flush(vma, addr, pmd); 1584 + spin_unlock(ptl); 1585 + mm_dec_nr_ptes(mm); 1586 + pte_free(mm, pmd_pgtable(_pmd)); 1587 + } 1588 + mmap_write_unlock(mm); 1580 1589 } else { 1581 1590 /* Try again later */ 1582 - khugepaged_add_pte_mapped_thp(vma->vm_mm, addr); 1591 + khugepaged_add_pte_mapped_thp(mm, addr); 1583 1592 } 1584 1593 } 1585 1594 i_mmap_unlock_write(mapping);

+480 -286

mm/memcontrol.c

··· 73 73 74 74 struct mem_cgroup *root_mem_cgroup __read_mostly; 75 75 76 - #define MEM_CGROUP_RECLAIM_RETRIES 5 77 - 78 76 /* Socket memory accounting disabled? */ 79 77 static bool cgroup_memory_nosocket; 80 78 ··· 255 257 } 256 258 257 259 #ifdef CONFIG_MEMCG_KMEM 260 + extern spinlock_t css_set_lock; 261 + 262 + static void obj_cgroup_release(struct percpu_ref *ref) 263 + { 264 + struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt); 265 + struct mem_cgroup *memcg; 266 + unsigned int nr_bytes; 267 + unsigned int nr_pages; 268 + unsigned long flags; 269 + 270 + /* 271 + * At this point all allocated objects are freed, and 272 + * objcg->nr_charged_bytes can't have an arbitrary byte value. 273 + * However, it can be PAGE_SIZE or (x * PAGE_SIZE). 274 + * 275 + * The following sequence can lead to it: 276 + * 1) CPU0: objcg == stock->cached_objcg 277 + * 2) CPU1: we do a small allocation (e.g. 92 bytes), 278 + * PAGE_SIZE bytes are charged 279 + * 3) CPU1: a process from another memcg is allocating something, 280 + * the stock if flushed, 281 + * objcg->nr_charged_bytes = PAGE_SIZE - 92 282 + * 5) CPU0: we do release this object, 283 + * 92 bytes are added to stock->nr_bytes 284 + * 6) CPU0: stock is flushed, 285 + * 92 bytes are added to objcg->nr_charged_bytes 286 + * 287 + * In the result, nr_charged_bytes == PAGE_SIZE. 288 + * This page will be uncharged in obj_cgroup_release(). 289 + */ 290 + nr_bytes = atomic_read(&objcg->nr_charged_bytes); 291 + WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)); 292 + nr_pages = nr_bytes >> PAGE_SHIFT; 293 + 294 + spin_lock_irqsave(&css_set_lock, flags); 295 + memcg = obj_cgroup_memcg(objcg); 296 + if (nr_pages) 297 + __memcg_kmem_uncharge(memcg, nr_pages); 298 + list_del(&objcg->list); 299 + mem_cgroup_put(memcg); 300 + spin_unlock_irqrestore(&css_set_lock, flags); 301 + 302 + percpu_ref_exit(ref); 303 + kfree_rcu(objcg, rcu); 304 + } 305 + 306 + static struct obj_cgroup *obj_cgroup_alloc(void) 307 + { 308 + struct obj_cgroup *objcg; 309 + int ret; 310 + 311 + objcg = kzalloc(sizeof(struct obj_cgroup), GFP_KERNEL); 312 + if (!objcg) 313 + return NULL; 314 + 315 + ret = percpu_ref_init(&objcg->refcnt, obj_cgroup_release, 0, 316 + GFP_KERNEL); 317 + if (ret) { 318 + kfree(objcg); 319 + return NULL; 320 + } 321 + INIT_LIST_HEAD(&objcg->list); 322 + return objcg; 323 + } 324 + 325 + static void memcg_reparent_objcgs(struct mem_cgroup *memcg, 326 + struct mem_cgroup *parent) 327 + { 328 + struct obj_cgroup *objcg, *iter; 329 + 330 + objcg = rcu_replace_pointer(memcg->objcg, NULL, true); 331 + 332 + spin_lock_irq(&css_set_lock); 333 + 334 + /* Move active objcg to the parent's list */ 335 + xchg(&objcg->memcg, parent); 336 + css_get(&parent->css); 337 + list_add(&objcg->list, &parent->objcg_list); 338 + 339 + /* Move already reparented objcgs to the parent's list */ 340 + list_for_each_entry(iter, &memcg->objcg_list, list) { 341 + css_get(&parent->css); 342 + xchg(&iter->memcg, parent); 343 + css_put(&memcg->css); 344 + } 345 + list_splice(&memcg->objcg_list, &parent->objcg_list); 346 + 347 + spin_unlock_irq(&css_set_lock); 348 + 349 + percpu_ref_kill(&objcg->refcnt); 350 + } 351 + 258 352 /* 259 - * This will be the memcg's index in each cache's ->memcg_params.memcg_caches. 353 + * This will be used as a shrinker list's index. 260 354 * The main reason for not using cgroup id for this: 261 355 * this works better in sparse environments, where we have a lot of memcgs, 262 356 * but only a few kmem-limited. Or also, if we have, for instance, 200 ··· 391 301 392 302 /* 393 303 * A lot of the calls to the cache allocation functions are expected to be 394 - * inlined by the compiler. Since the calls to memcg_kmem_get_cache are 304 + * inlined by the compiler. Since the calls to memcg_slab_pre_alloc_hook() are 395 305 * conditional to this static branch, we'll have to allow modules that does 396 306 * kmem_cache_alloc and the such to see this symbol as well 397 307 */ 398 308 DEFINE_STATIC_KEY_FALSE(memcg_kmem_enabled_key); 399 309 EXPORT_SYMBOL(memcg_kmem_enabled_key); 400 - 401 - struct workqueue_struct *memcg_kmem_cache_wq; 402 310 #endif 403 311 404 312 static int memcg_shrinker_map_size; ··· 565 477 unsigned long ino = 0; 566 478 567 479 rcu_read_lock(); 568 - if (PageSlab(page) && !PageTail(page)) 569 - memcg = memcg_from_slab_page(page); 570 - else 571 - memcg = READ_ONCE(page->mem_cgroup); 480 + memcg = page->mem_cgroup; 481 + 482 + /* 483 + * The lowest bit set means that memcg isn't a valid 484 + * memcg pointer, but a obj_cgroups pointer. 485 + * In this case the page is shared and doesn't belong 486 + * to any specific memory cgroup. 487 + */ 488 + if ((unsigned long) memcg & 0x1UL) 489 + memcg = NULL; 490 + 572 491 while (memcg && !(memcg->css.flags & CSS_ONLINE)) 573 492 memcg = parent_mem_cgroup(memcg); 574 493 if (memcg) ··· 776 681 */ 777 682 void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) 778 683 { 779 - long x; 684 + long x, threshold = MEMCG_CHARGE_BATCH; 780 685 781 686 if (mem_cgroup_disabled()) 782 687 return; 783 688 689 + if (vmstat_item_in_bytes(idx)) 690 + threshold <<= PAGE_SHIFT; 691 + 784 692 x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]); 785 - if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) { 693 + if (unlikely(abs(x) > threshold)) { 786 694 struct mem_cgroup *mi; 787 695 788 696 /* ··· 811 713 return mem_cgroup_nodeinfo(parent, nid); 812 714 } 813 715 716 + void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, 717 + int val) 718 + { 719 + struct mem_cgroup_per_node *pn; 720 + struct mem_cgroup *memcg; 721 + long x, threshold = MEMCG_CHARGE_BATCH; 722 + 723 + pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); 724 + memcg = pn->memcg; 725 + 726 + /* Update memcg */ 727 + __mod_memcg_state(memcg, idx, val); 728 + 729 + /* Update lruvec */ 730 + __this_cpu_add(pn->lruvec_stat_local->count[idx], val); 731 + 732 + if (vmstat_item_in_bytes(idx)) 733 + threshold <<= PAGE_SHIFT; 734 + 735 + x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]); 736 + if (unlikely(abs(x) > threshold)) { 737 + pg_data_t *pgdat = lruvec_pgdat(lruvec); 738 + struct mem_cgroup_per_node *pi; 739 + 740 + for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id)) 741 + atomic_long_add(x, &pi->lruvec_stat[idx]); 742 + x = 0; 743 + } 744 + __this_cpu_write(pn->lruvec_stat_cpu->count[idx], x); 745 + } 746 + 814 747 /** 815 748 * __mod_lruvec_state - update lruvec memory statistics 816 749 * @lruvec: the lruvec ··· 855 726 void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, 856 727 int val) 857 728 { 858 - pg_data_t *pgdat = lruvec_pgdat(lruvec); 859 - struct mem_cgroup_per_node *pn; 860 - struct mem_cgroup *memcg; 861 - long x; 862 - 863 729 /* Update node */ 864 - __mod_node_page_state(pgdat, idx, val); 730 + __mod_node_page_state(lruvec_pgdat(lruvec), idx, val); 865 731 866 - if (mem_cgroup_disabled()) 867 - return; 868 - 869 - pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); 870 - memcg = pn->memcg; 871 - 872 - /* Update memcg */ 873 - __mod_memcg_state(memcg, idx, val); 874 - 875 - /* Update lruvec */ 876 - __this_cpu_add(pn->lruvec_stat_local->count[idx], val); 877 - 878 - x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]); 879 - if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) { 880 - struct mem_cgroup_per_node *pi; 881 - 882 - for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id)) 883 - atomic_long_add(x, &pi->lruvec_stat[idx]); 884 - x = 0; 885 - } 886 - __this_cpu_write(pn->lruvec_stat_cpu->count[idx], x); 732 + /* Update memcg and lruvec */ 733 + if (!mem_cgroup_disabled()) 734 + __mod_memcg_lruvec_state(lruvec, idx, val); 887 735 } 888 736 889 737 void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val) ··· 1483 1377 (u64)memcg_page_state(memcg, NR_FILE_PAGES) * 1484 1378 PAGE_SIZE); 1485 1379 seq_buf_printf(&s, "kernel_stack %llu\n", 1486 - (u64)memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) * 1380 + (u64)memcg_page_state(memcg, NR_KERNEL_STACK_KB) * 1487 1381 1024); 1488 1382 seq_buf_printf(&s, "slab %llu\n", 1489 - (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) + 1490 - memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE)) * 1491 - PAGE_SIZE); 1383 + (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) + 1384 + memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B))); 1492 1385 seq_buf_printf(&s, "sock %llu\n", 1493 1386 (u64)memcg_page_state(memcg, MEMCG_SOCK) * 1494 1387 PAGE_SIZE); ··· 1517 1412 PAGE_SIZE); 1518 1413 1519 1414 seq_buf_printf(&s, "slab_reclaimable %llu\n", 1520 - (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) * 1521 - PAGE_SIZE); 1415 + (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B)); 1522 1416 seq_buf_printf(&s, "slab_unreclaimable %llu\n", 1523 - (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) * 1524 - PAGE_SIZE); 1417 + (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)); 1525 1418 1526 1419 /* Accumulated memory events */ 1527 1420 ··· 1663 1560 .gfp_mask = gfp_mask, 1664 1561 .order = order, 1665 1562 }; 1666 - bool ret; 1563 + bool ret = true; 1667 1564 1668 1565 if (mutex_lock_killable(&oom_lock)) 1669 1566 return true; 1567 + 1568 + if (mem_cgroup_margin(memcg) >= (1 << order)) 1569 + goto unlock; 1570 + 1670 1571 /* 1671 1572 * A few threads which were not waiting at mutex_lock_killable() can 1672 1573 * fail to bail out. Therefore, check again after holding oom_lock. 1673 1574 */ 1674 1575 ret = should_force_charge() || out_of_memory(&oc); 1576 + 1577 + unlock: 1675 1578 mutex_unlock(&oom_lock); 1676 1579 return ret; 1677 1580 } ··· 2148 2039 struct memcg_stock_pcp { 2149 2040 struct mem_cgroup *cached; /* this never be root cgroup */ 2150 2041 unsigned int nr_pages; 2042 + 2043 + #ifdef CONFIG_MEMCG_KMEM 2044 + struct obj_cgroup *cached_objcg; 2045 + unsigned int nr_bytes; 2046 + #endif 2047 + 2151 2048 struct work_struct work; 2152 2049 unsigned long flags; 2153 2050 #define FLUSHING_CACHED_CHARGE 0 2154 2051 }; 2155 2052 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock); 2156 2053 static DEFINE_MUTEX(percpu_charge_mutex); 2054 + 2055 + #ifdef CONFIG_MEMCG_KMEM 2056 + static void drain_obj_stock(struct memcg_stock_pcp *stock); 2057 + static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, 2058 + struct mem_cgroup *root_memcg); 2059 + 2060 + #else 2061 + static inline void drain_obj_stock(struct memcg_stock_pcp *stock) 2062 + { 2063 + } 2064 + static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, 2065 + struct mem_cgroup *root_memcg) 2066 + { 2067 + return false; 2068 + } 2069 + #endif 2157 2070 2158 2071 /** 2159 2072 * consume_stock: Try to consume stocked charge on this cpu. ··· 2217 2086 { 2218 2087 struct mem_cgroup *old = stock->cached; 2219 2088 2089 + if (!old) 2090 + return; 2091 + 2220 2092 if (stock->nr_pages) { 2221 2093 page_counter_uncharge(&old->memory, stock->nr_pages); 2222 2094 if (do_memsw_account()) 2223 2095 page_counter_uncharge(&old->memsw, stock->nr_pages); 2224 - css_put_many(&old->css, stock->nr_pages); 2225 2096 stock->nr_pages = 0; 2226 2097 } 2098 + 2099 + css_put(&old->css); 2227 2100 stock->cached = NULL; 2228 2101 } 2229 2102 ··· 2243 2108 local_irq_save(flags); 2244 2109 2245 2110 stock = this_cpu_ptr(&memcg_stock); 2111 + drain_obj_stock(stock); 2246 2112 drain_stock(stock); 2247 2113 clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); 2248 2114 ··· 2264 2128 stock = this_cpu_ptr(&memcg_stock); 2265 2129 if (stock->cached != memcg) { /* reset if necessary */ 2266 2130 drain_stock(stock); 2131 + css_get(&memcg->css); 2267 2132 stock->cached = memcg; 2268 2133 } 2269 2134 stock->nr_pages += nr_pages; ··· 2302 2165 memcg = stock->cached; 2303 2166 if (memcg && stock->nr_pages && 2304 2167 mem_cgroup_is_descendant(memcg, root_memcg)) 2168 + flush = true; 2169 + if (obj_stock_flush_required(stock, root_memcg)) 2305 2170 flush = true; 2306 2171 rcu_read_unlock(); 2307 2172 ··· 2367 2228 return 0; 2368 2229 } 2369 2230 2370 - static void reclaim_high(struct mem_cgroup *memcg, 2371 - unsigned int nr_pages, 2372 - gfp_t gfp_mask) 2231 + static unsigned long reclaim_high(struct mem_cgroup *memcg, 2232 + unsigned int nr_pages, 2233 + gfp_t gfp_mask) 2373 2234 { 2235 + unsigned long nr_reclaimed = 0; 2236 + 2374 2237 do { 2238 + unsigned long pflags; 2239 + 2375 2240 if (page_counter_read(&memcg->memory) <= 2376 2241 READ_ONCE(memcg->memory.high)) 2377 2242 continue; 2243 + 2378 2244 memcg_memory_event(memcg, MEMCG_HIGH); 2379 - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); 2245 + 2246 + psi_memstall_enter(&pflags); 2247 + nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, 2248 + gfp_mask, true); 2249 + psi_memstall_leave(&pflags); 2380 2250 } while ((memcg = parent_mem_cgroup(memcg)) && 2381 2251 !mem_cgroup_is_root(memcg)); 2252 + 2253 + return nr_reclaimed; 2382 2254 } 2383 2255 2384 2256 static void high_work_func(struct work_struct *work) ··· 2545 2395 { 2546 2396 unsigned long penalty_jiffies; 2547 2397 unsigned long pflags; 2398 + unsigned long nr_reclaimed; 2548 2399 unsigned int nr_pages = current->memcg_nr_pages_over_high; 2400 + int nr_retries = MAX_RECLAIM_RETRIES; 2549 2401 struct mem_cgroup *memcg; 2402 + bool in_retry = false; 2550 2403 2551 2404 if (likely(!nr_pages)) 2552 2405 return; 2553 2406 2554 2407 memcg = get_mem_cgroup_from_mm(current->mm); 2555 - reclaim_high(memcg, nr_pages, GFP_KERNEL); 2556 2408 current->memcg_nr_pages_over_high = 0; 2409 + 2410 + retry_reclaim: 2411 + /* 2412 + * The allocating task should reclaim at least the batch size, but for 2413 + * subsequent retries we only want to do what's necessary to prevent oom 2414 + * or breaching resource isolation. 2415 + * 2416 + * This is distinct from memory.max or page allocator behaviour because 2417 + * memory.high is currently batched, whereas memory.max and the page 2418 + * allocator run every time an allocation is made. 2419 + */ 2420 + nr_reclaimed = reclaim_high(memcg, 2421 + in_retry ? SWAP_CLUSTER_MAX : nr_pages, 2422 + GFP_KERNEL); 2557 2423 2558 2424 /* 2559 2425 * memory.high is breached and reclaim is unable to keep up. Throttle ··· 2598 2432 goto out; 2599 2433 2600 2434 /* 2435 + * If reclaim is making forward progress but we're still over 2436 + * memory.high, we want to encourage that rather than doing allocator 2437 + * throttling. 2438 + */ 2439 + if (nr_reclaimed || nr_retries--) { 2440 + in_retry = true; 2441 + goto retry_reclaim; 2442 + } 2443 + 2444 + /* 2601 2445 * If we exit early, we're guaranteed to die (since 2602 2446 * schedule_timeout_killable sets TASK_KILLABLE). This means we don't 2603 2447 * need to account for any ill-begotten jiffies to pay them off later. ··· 2624 2448 unsigned int nr_pages) 2625 2449 { 2626 2450 unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages); 2627 - int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; 2451 + int nr_retries = MAX_RECLAIM_RETRIES; 2628 2452 struct mem_cgroup *mem_over_limit; 2629 2453 struct page_counter *counter; 2454 + enum oom_status oom_status; 2630 2455 unsigned long nr_reclaimed; 2631 2456 bool may_swap = true; 2632 2457 bool drained = false; 2633 - enum oom_status oom_status; 2458 + unsigned long pflags; 2634 2459 2635 2460 if (mem_cgroup_is_root(memcg)) 2636 2461 return 0; ··· 2691 2514 2692 2515 memcg_memory_event(mem_over_limit, MEMCG_MAX); 2693 2516 2517 + psi_memstall_enter(&pflags); 2694 2518 nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, 2695 2519 gfp_mask, may_swap); 2520 + psi_memstall_leave(&pflags); 2696 2521 2697 2522 if (mem_cgroup_margin(mem_over_limit) >= nr_pages) 2698 2523 goto retry; ··· 2746 2567 get_order(nr_pages * PAGE_SIZE)); 2747 2568 switch (oom_status) { 2748 2569 case OOM_SUCCESS: 2749 - nr_retries = MEM_CGROUP_RECLAIM_RETRIES; 2570 + nr_retries = MAX_RECLAIM_RETRIES; 2750 2571 goto retry; 2751 2572 case OOM_FAILED: 2752 2573 goto force; ··· 2765 2586 page_counter_charge(&memcg->memory, nr_pages); 2766 2587 if (do_memsw_account()) 2767 2588 page_counter_charge(&memcg->memsw, nr_pages); 2768 - css_get_many(&memcg->css, nr_pages); 2769 2589 2770 2590 return 0; 2771 2591 2772 2592 done_restock: 2773 - css_get_many(&memcg->css, batch); 2774 2593 if (batch > nr_pages) 2775 2594 refill_stock(memcg, batch - nr_pages); 2776 2595 ··· 2826 2649 page_counter_uncharge(&memcg->memory, nr_pages); 2827 2650 if (do_memsw_account()) 2828 2651 page_counter_uncharge(&memcg->memsw, nr_pages); 2829 - 2830 - css_put_many(&memcg->css, nr_pages); 2831 2652 } 2832 2653 #endif 2833 2654 ··· 2844 2669 } 2845 2670 2846 2671 #ifdef CONFIG_MEMCG_KMEM 2672 + int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s, 2673 + gfp_t gfp) 2674 + { 2675 + unsigned int objects = objs_per_slab_page(s, page); 2676 + void *vec; 2677 + 2678 + vec = kcalloc_node(objects, sizeof(struct obj_cgroup *), gfp, 2679 + page_to_nid(page)); 2680 + if (!vec) 2681 + return -ENOMEM; 2682 + 2683 + if (cmpxchg(&page->obj_cgroups, NULL, 2684 + (struct obj_cgroup **) ((unsigned long)vec | 0x1UL))) 2685 + kfree(vec); 2686 + else 2687 + kmemleak_not_leak(vec); 2688 + 2689 + return 0; 2690 + } 2691 + 2847 2692 /* 2848 2693 * Returns a pointer to the memory cgroup to which the kernel object is charged. 2849 2694 * ··· 2880 2685 page = virt_to_head_page(p); 2881 2686 2882 2687 /* 2883 - * Slab pages don't have page->mem_cgroup set because corresponding 2884 - * kmem caches can be reparented during the lifetime. That's why 2885 - * memcg_from_slab_page() should be used instead. 2688 + * Slab objects are accounted individually, not per-page. 2689 + * Memcg membership data for each individual object is saved in 2690 + * the page->obj_cgroups. 2886 2691 */ 2887 - if (PageSlab(page)) 2888 - return memcg_from_slab_page(page); 2692 + if (page_has_obj_cgroups(page)) { 2693 + struct obj_cgroup *objcg; 2694 + unsigned int off; 2695 + 2696 + off = obj_to_index(page->slab_cache, page, p); 2697 + objcg = page_obj_cgroups(page)[off]; 2698 + if (objcg) 2699 + return obj_cgroup_memcg(objcg); 2700 + 2701 + return NULL; 2702 + } 2889 2703 2890 2704 /* All other pages use page->mem_cgroup */ 2891 2705 return page->mem_cgroup; 2706 + } 2707 + 2708 + __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void) 2709 + { 2710 + struct obj_cgroup *objcg = NULL; 2711 + struct mem_cgroup *memcg; 2712 + 2713 + if (unlikely(!current->mm && !current->active_memcg)) 2714 + return NULL; 2715 + 2716 + rcu_read_lock(); 2717 + if (unlikely(current->active_memcg)) 2718 + memcg = rcu_dereference(current->active_memcg); 2719 + else 2720 + memcg = mem_cgroup_from_task(current); 2721 + 2722 + for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) { 2723 + objcg = rcu_dereference(memcg->objcg); 2724 + if (objcg && obj_cgroup_tryget(objcg)) 2725 + break; 2726 + } 2727 + rcu_read_unlock(); 2728 + 2729 + return objcg; 2892 2730 } 2893 2731 2894 2732 static int memcg_alloc_cache_id(void) ··· 2949 2721 else if (size > MEMCG_CACHES_MAX_SIZE) 2950 2722 size = MEMCG_CACHES_MAX_SIZE; 2951 2723 2952 - err = memcg_update_all_caches(size); 2953 - if (!err) 2954 - err = memcg_update_all_list_lrus(size); 2724 + err = memcg_update_all_list_lrus(size); 2955 2725 if (!err) 2956 2726 memcg_nr_cache_ids = size; 2957 2727 ··· 2965 2739 static void memcg_free_cache_id(int id) 2966 2740 { 2967 2741 ida_simple_remove(&memcg_cache_ida, id); 2968 - } 2969 - 2970 - struct memcg_kmem_cache_create_work { 2971 - struct mem_cgroup *memcg; 2972 - struct kmem_cache *cachep; 2973 - struct work_struct work; 2974 - }; 2975 - 2976 - static void memcg_kmem_cache_create_func(struct work_struct *w) 2977 - { 2978 - struct memcg_kmem_cache_create_work *cw = 2979 - container_of(w, struct memcg_kmem_cache_create_work, work); 2980 - struct mem_cgroup *memcg = cw->memcg; 2981 - struct kmem_cache *cachep = cw->cachep; 2982 - 2983 - memcg_create_kmem_cache(memcg, cachep); 2984 - 2985 - css_put(&memcg->css); 2986 - kfree(cw); 2987 - } 2988 - 2989 - /* 2990 - * Enqueue the creation of a per-memcg kmem_cache. 2991 - */ 2992 - static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg, 2993 - struct kmem_cache *cachep) 2994 - { 2995 - struct memcg_kmem_cache_create_work *cw; 2996 - 2997 - if (!css_tryget_online(&memcg->css)) 2998 - return; 2999 - 3000 - cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN); 3001 - if (!cw) { 3002 - css_put(&memcg->css); 3003 - return; 3004 - } 3005 - 3006 - cw->memcg = memcg; 3007 - cw->cachep = cachep; 3008 - INIT_WORK(&cw->work, memcg_kmem_cache_create_func); 3009 - 3010 - queue_work(memcg_kmem_cache_wq, &cw->work); 3011 - } 3012 - 3013 - static inline bool memcg_kmem_bypass(void) 3014 - { 3015 - if (in_interrupt()) 3016 - return true; 3017 - 3018 - /* Allow remote memcg charging in kthread contexts. */ 3019 - if ((!current->mm || (current->flags & PF_KTHREAD)) && 3020 - !current->active_memcg) 3021 - return true; 3022 - return false; 3023 - } 3024 - 3025 - /** 3026 - * memcg_kmem_get_cache: select the correct per-memcg cache for allocation 3027 - * @cachep: the original global kmem cache 3028 - * 3029 - * Return the kmem_cache we're supposed to use for a slab allocation. 3030 - * We try to use the current memcg's version of the cache. 3031 - * 3032 - * If the cache does not exist yet, if we are the first user of it, we 3033 - * create it asynchronously in a workqueue and let the current allocation 3034 - * go through with the original cache. 3035 - * 3036 - * This function takes a reference to the cache it returns to assure it 3037 - * won't get destroyed while we are working with it. Once the caller is 3038 - * done with it, memcg_kmem_put_cache() must be called to release the 3039 - * reference. 3040 - */ 3041 - struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep) 3042 - { 3043 - struct mem_cgroup *memcg; 3044 - struct kmem_cache *memcg_cachep; 3045 - struct memcg_cache_array *arr; 3046 - int kmemcg_id; 3047 - 3048 - VM_BUG_ON(!is_root_cache(cachep)); 3049 - 3050 - if (memcg_kmem_bypass()) 3051 - return cachep; 3052 - 3053 - rcu_read_lock(); 3054 - 3055 - if (unlikely(current->active_memcg)) 3056 - memcg = current->active_memcg; 3057 - else 3058 - memcg = mem_cgroup_from_task(current); 3059 - 3060 - if (!memcg || memcg == root_mem_cgroup) 3061 - goto out_unlock; 3062 - 3063 - kmemcg_id = READ_ONCE(memcg->kmemcg_id); 3064 - if (kmemcg_id < 0) 3065 - goto out_unlock; 3066 - 3067 - arr = rcu_dereference(cachep->memcg_params.memcg_caches); 3068 - 3069 - /* 3070 - * Make sure we will access the up-to-date value. The code updating 3071 - * memcg_caches issues a write barrier to match the data dependency 3072 - * barrier inside READ_ONCE() (see memcg_create_kmem_cache()). 3073 - */ 3074 - memcg_cachep = READ_ONCE(arr->entries[kmemcg_id]); 3075 - 3076 - /* 3077 - * If we are in a safe context (can wait, and not in interrupt 3078 - * context), we could be be predictable and return right away. 3079 - * This would guarantee that the allocation being performed 3080 - * already belongs in the new cache. 3081 - * 3082 - * However, there are some clashes that can arrive from locking. 3083 - * For instance, because we acquire the slab_mutex while doing 3084 - * memcg_create_kmem_cache, this means no further allocation 3085 - * could happen with the slab_mutex held. So it's better to 3086 - * defer everything. 3087 - * 3088 - * If the memcg is dying or memcg_cache is about to be released, 3089 - * don't bother creating new kmem_caches. Because memcg_cachep 3090 - * is ZEROed as the fist step of kmem offlining, we don't need 3091 - * percpu_ref_tryget_live() here. css_tryget_online() check in 3092 - * memcg_schedule_kmem_cache_create() will prevent us from 3093 - * creation of a new kmem_cache. 3094 - */ 3095 - if (unlikely(!memcg_cachep)) 3096 - memcg_schedule_kmem_cache_create(memcg, cachep); 3097 - else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt)) 3098 - cachep = memcg_cachep; 3099 - out_unlock: 3100 - rcu_read_unlock(); 3101 - return cachep; 3102 - } 3103 - 3104 - /** 3105 - * memcg_kmem_put_cache: drop reference taken by memcg_kmem_get_cache 3106 - * @cachep: the cache returned by memcg_kmem_get_cache 3107 - */ 3108 - void memcg_kmem_put_cache(struct kmem_cache *cachep) 3109 - { 3110 - if (!is_root_cache(cachep)) 3111 - percpu_ref_put(&cachep->memcg_params.refcnt); 3112 2742 } 3113 2743 3114 2744 /** ··· 3040 2958 if (!ret) { 3041 2959 page->mem_cgroup = memcg; 3042 2960 __SetPageKmemcg(page); 2961 + return 0; 3043 2962 } 3044 2963 } 3045 2964 css_put(&memcg->css); ··· 3063 2980 VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page); 3064 2981 __memcg_kmem_uncharge(memcg, nr_pages); 3065 2982 page->mem_cgroup = NULL; 2983 + css_put(&memcg->css); 3066 2984 3067 2985 /* slab pages do not have PageKmemcg flag set */ 3068 2986 if (PageKmemcg(page)) 3069 2987 __ClearPageKmemcg(page); 3070 - 3071 - css_put_many(&memcg->css, nr_pages); 3072 2988 } 2989 + 2990 + static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) 2991 + { 2992 + struct memcg_stock_pcp *stock; 2993 + unsigned long flags; 2994 + bool ret = false; 2995 + 2996 + local_irq_save(flags); 2997 + 2998 + stock = this_cpu_ptr(&memcg_stock); 2999 + if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) { 3000 + stock->nr_bytes -= nr_bytes; 3001 + ret = true; 3002 + } 3003 + 3004 + local_irq_restore(flags); 3005 + 3006 + return ret; 3007 + } 3008 + 3009 + static void drain_obj_stock(struct memcg_stock_pcp *stock) 3010 + { 3011 + struct obj_cgroup *old = stock->cached_objcg; 3012 + 3013 + if (!old) 3014 + return; 3015 + 3016 + if (stock->nr_bytes) { 3017 + unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT; 3018 + unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1); 3019 + 3020 + if (nr_pages) { 3021 + rcu_read_lock(); 3022 + __memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages); 3023 + rcu_read_unlock(); 3024 + } 3025 + 3026 + /* 3027 + * The leftover is flushed to the centralized per-memcg value. 3028 + * On the next attempt to refill obj stock it will be moved 3029 + * to a per-cpu stock (probably, on an other CPU), see 3030 + * refill_obj_stock(). 3031 + * 3032 + * How often it's flushed is a trade-off between the memory 3033 + * limit enforcement accuracy and potential CPU contention, 3034 + * so it might be changed in the future. 3035 + */ 3036 + atomic_add(nr_bytes, &old->nr_charged_bytes); 3037 + stock->nr_bytes = 0; 3038 + } 3039 + 3040 + obj_cgroup_put(old); 3041 + stock->cached_objcg = NULL; 3042 + } 3043 + 3044 + static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, 3045 + struct mem_cgroup *root_memcg) 3046 + { 3047 + struct mem_cgroup *memcg; 3048 + 3049 + if (stock->cached_objcg) { 3050 + memcg = obj_cgroup_memcg(stock->cached_objcg); 3051 + if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) 3052 + return true; 3053 + } 3054 + 3055 + return false; 3056 + } 3057 + 3058 + static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) 3059 + { 3060 + struct memcg_stock_pcp *stock; 3061 + unsigned long flags; 3062 + 3063 + local_irq_save(flags); 3064 + 3065 + stock = this_cpu_ptr(&memcg_stock); 3066 + if (stock->cached_objcg != objcg) { /* reset if necessary */ 3067 + drain_obj_stock(stock); 3068 + obj_cgroup_get(objcg); 3069 + stock->cached_objcg = objcg; 3070 + stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0); 3071 + } 3072 + stock->nr_bytes += nr_bytes; 3073 + 3074 + if (stock->nr_bytes > PAGE_SIZE) 3075 + drain_obj_stock(stock); 3076 + 3077 + local_irq_restore(flags); 3078 + } 3079 + 3080 + int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) 3081 + { 3082 + struct mem_cgroup *memcg; 3083 + unsigned int nr_pages, nr_bytes; 3084 + int ret; 3085 + 3086 + if (consume_obj_stock(objcg, size)) 3087 + return 0; 3088 + 3089 + /* 3090 + * In theory, memcg->nr_charged_bytes can have enough 3091 + * pre-charged bytes to satisfy the allocation. However, 3092 + * flushing memcg->nr_charged_bytes requires two atomic 3093 + * operations, and memcg->nr_charged_bytes can't be big, 3094 + * so it's better to ignore it and try grab some new pages. 3095 + * memcg->nr_charged_bytes will be flushed in 3096 + * refill_obj_stock(), called from this function or 3097 + * independently later. 3098 + */ 3099 + rcu_read_lock(); 3100 + memcg = obj_cgroup_memcg(objcg); 3101 + css_get(&memcg->css); 3102 + rcu_read_unlock(); 3103 + 3104 + nr_pages = size >> PAGE_SHIFT; 3105 + nr_bytes = size & (PAGE_SIZE - 1); 3106 + 3107 + if (nr_bytes) 3108 + nr_pages += 1; 3109 + 3110 + ret = __memcg_kmem_charge(memcg, gfp, nr_pages); 3111 + if (!ret && nr_bytes) 3112 + refill_obj_stock(objcg, PAGE_SIZE - nr_bytes); 3113 + 3114 + css_put(&memcg->css); 3115 + return ret; 3116 + } 3117 + 3118 + void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) 3119 + { 3120 + refill_obj_stock(objcg, size); 3121 + } 3122 + 3073 3123 #endif /* CONFIG_MEMCG_KMEM */ 3074 3124 3075 3125 #ifdef CONFIG_TRANSPARENT_HUGEPAGE ··· 3213 2997 */ 3214 2998 void mem_cgroup_split_huge_fixup(struct page *head) 3215 2999 { 3000 + struct mem_cgroup *memcg = head->mem_cgroup; 3216 3001 int i; 3217 3002 3218 3003 if (mem_cgroup_disabled()) 3219 3004 return; 3220 3005 3221 - for (i = 1; i < HPAGE_PMD_NR; i++) 3222 - head[i].mem_cgroup = head->mem_cgroup; 3006 + for (i = 1; i < HPAGE_PMD_NR; i++) { 3007 + css_get(&memcg->css); 3008 + head[i].mem_cgroup = memcg; 3009 + } 3223 3010 } 3224 3011 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 3225 3012 ··· 3426 3207 */ 3427 3208 static int mem_cgroup_force_empty(struct mem_cgroup *memcg) 3428 3209 { 3429 - int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; 3210 + int nr_retries = MAX_RECLAIM_RETRIES; 3430 3211 3431 3212 /* we call try-to-free pages for make this cgroup empty */ 3432 3213 lru_add_drain_all(); ··· 3623 3404 #ifdef CONFIG_MEMCG_KMEM 3624 3405 static int memcg_online_kmem(struct mem_cgroup *memcg) 3625 3406 { 3407 + struct obj_cgroup *objcg; 3626 3408 int memcg_id; 3627 3409 3628 3410 if (cgroup_memory_nokmem) ··· 3636 3416 if (memcg_id < 0) 3637 3417 return memcg_id; 3638 3418 3639 - static_branch_inc(&memcg_kmem_enabled_key); 3419 + objcg = obj_cgroup_alloc(); 3420 + if (!objcg) { 3421 + memcg_free_cache_id(memcg_id); 3422 + return -ENOMEM; 3423 + } 3424 + objcg->memcg = memcg; 3425 + rcu_assign_pointer(memcg->objcg, objcg); 3426 + 3427 + static_branch_enable(&memcg_kmem_enabled_key); 3428 + 3640 3429 /* 3641 3430 * A memory cgroup is considered kmem-online as soon as it gets 3642 3431 * kmemcg_id. Setting the id after enabling static branching will ··· 3654 3425 */ 3655 3426 memcg->kmemcg_id = memcg_id; 3656 3427 memcg->kmem_state = KMEM_ONLINE; 3657 - INIT_LIST_HEAD(&memcg->kmem_caches); 3658 3428 3659 3429 return 0; 3660 3430 } ··· 3666 3438 3667 3439 if (memcg->kmem_state != KMEM_ONLINE) 3668 3440 return; 3669 - /* 3670 - * Clear the online state before clearing memcg_caches array 3671 - * entries. The slab_mutex in memcg_deactivate_kmem_caches() 3672 - * guarantees that no cache will be created for this cgroup 3673 - * after we are done (see memcg_create_kmem_cache()). 3674 - */ 3441 + 3675 3442 memcg->kmem_state = KMEM_ALLOCATED; 3676 3443 3677 3444 parent = parent_mem_cgroup(memcg); 3678 3445 if (!parent) 3679 3446 parent = root_mem_cgroup; 3680 3447 3681 - /* 3682 - * Deactivate and reparent kmem_caches. 3683 - */ 3684 - memcg_deactivate_kmem_caches(memcg, parent); 3448 + memcg_reparent_objcgs(memcg, parent); 3685 3449 3686 3450 kmemcg_id = memcg->kmemcg_id; 3687 3451 BUG_ON(kmemcg_id < 0); ··· 3706 3486 /* css_alloc() failed, offlining didn't happen */ 3707 3487 if (unlikely(memcg->kmem_state == KMEM_ONLINE)) 3708 3488 memcg_offline_kmem(memcg); 3709 - 3710 - if (memcg->kmem_state == KMEM_ALLOCATED) { 3711 - WARN_ON(!list_empty(&memcg->kmem_caches)); 3712 - static_branch_dec(&memcg_kmem_enabled_key); 3713 - } 3714 3489 } 3715 3490 #else 3716 3491 static int memcg_online_kmem(struct mem_cgroup *memcg) ··· 5015 4800 (defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG)) 5016 4801 { 5017 4802 .name = "kmem.slabinfo", 5018 - .seq_start = memcg_slab_start, 5019 - .seq_next = memcg_slab_next, 5020 - .seq_stop = memcg_slab_stop, 5021 4803 .seq_show = memcg_slab_show, 5022 4804 }, 5023 4805 #endif ··· 5234 5022 memcg->socket_pressure = jiffies; 5235 5023 #ifdef CONFIG_MEMCG_KMEM 5236 5024 memcg->kmemcg_id = -1; 5025 + INIT_LIST_HEAD(&memcg->objcg_list); 5237 5026 #endif 5238 5027 #ifdef CONFIG_CGROUP_WRITEBACK 5239 5028 INIT_LIST_HEAD(&memcg->cgwb_list); ··· 5297 5084 5298 5085 /* The following stuff does not apply to the root */ 5299 5086 if (!parent) { 5300 - #ifdef CONFIG_MEMCG_KMEM 5301 - INIT_LIST_HEAD(&memcg->kmem_caches); 5302 - #endif 5303 5087 root_mem_cgroup = memcg; 5304 5088 return &memcg->css; 5305 5089 } ··· 5658 5448 */ 5659 5449 smp_mb(); 5660 5450 5661 - page->mem_cgroup = to; /* caller should have done css_get */ 5451 + css_get(&to->css); 5452 + css_put(&from->css); 5453 + 5454 + page->mem_cgroup = to; 5662 5455 5663 5456 __unlock_page_memcg(from); 5664 5457 ··· 5881 5668 */ 5882 5669 if (!mem_cgroup_is_root(mc.to)) 5883 5670 page_counter_uncharge(&mc.to->memory, mc.moved_swap); 5884 - 5885 - css_put_many(&mc.to->css, mc.moved_swap); 5886 5671 5887 5672 mc.moved_swap = 0; 5888 5673 } ··· 6247 6036 char *buf, size_t nbytes, loff_t off) 6248 6037 { 6249 6038 struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); 6250 - unsigned int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; 6039 + unsigned int nr_retries = MAX_RECLAIM_RETRIES; 6251 6040 bool drained = false; 6252 6041 unsigned long high; 6253 6042 int err; ··· 6256 6045 err = page_counter_memparse(buf, "max", &high); 6257 6046 if (err) 6258 6047 return err; 6259 - 6260 - page_counter_set_high(&memcg->memory, high); 6261 6048 6262 6049 for (;;) { 6263 6050 unsigned long nr_pages = page_counter_read(&memcg->memory); ··· 6280 6071 break; 6281 6072 } 6282 6073 6074 + page_counter_set_high(&memcg->memory, high); 6075 + 6076 + memcg_wb_domain_size_changed(memcg); 6077 + 6283 6078 return nbytes; 6284 6079 } 6285 6080 ··· 6297 6084 char *buf, size_t nbytes, loff_t off) 6298 6085 { 6299 6086 struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); 6300 - unsigned int nr_reclaims = MEM_CGROUP_RECLAIM_RETRIES; 6087 + unsigned int nr_reclaims = MAX_RECLAIM_RETRIES; 6301 6088 bool drained = false; 6302 6089 unsigned long max; 6303 6090 int err; ··· 6604 6391 * 6605 6392 * WARNING: This function is not stateless! It can only be used as part 6606 6393 * of a top-down tree iteration, not for isolated queries. 6607 - * 6608 - * Returns one of the following: 6609 - * MEMCG_PROT_NONE: cgroup memory is not protected 6610 - * MEMCG_PROT_LOW: cgroup memory is protected as long there is 6611 - * an unprotected supply of reclaimable memory from other cgroups. 6612 - * MEMCG_PROT_MIN: cgroup memory is protected 6613 6394 */ 6614 - enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, 6615 - struct mem_cgroup *memcg) 6395 + void mem_cgroup_calculate_protection(struct mem_cgroup *root, 6396 + struct mem_cgroup *memcg) 6616 6397 { 6617 6398 unsigned long usage, parent_usage; 6618 6399 struct mem_cgroup *parent; 6619 6400 6620 6401 if (mem_cgroup_disabled()) 6621 - return MEMCG_PROT_NONE; 6402 + return; 6622 6403 6623 6404 if (!root) 6624 6405 root = root_mem_cgroup; 6406 + 6407 + /* 6408 + * Effective values of the reclaim targets are ignored so they 6409 + * can be stale. Have a look at mem_cgroup_protection for more 6410 + * details. 6411 + * TODO: calculation should be more robust so that we do not need 6412 + * that special casing. 6413 + */ 6625 6414 if (memcg == root) 6626 - return MEMCG_PROT_NONE; 6415 + return; 6627 6416 6628 6417 usage = page_counter_read(&memcg->memory); 6629 6418 if (!usage) 6630 - return MEMCG_PROT_NONE; 6419 + return; 6631 6420 6632 6421 parent = parent_mem_cgroup(memcg); 6633 6422 /* No parent means a non-hierarchical mode on v1 memcg */ 6634 6423 if (!parent) 6635 - return MEMCG_PROT_NONE; 6424 + return; 6636 6425 6637 6426 if (parent == root) { 6638 6427 memcg->memory.emin = READ_ONCE(memcg->memory.min); 6639 6428 memcg->memory.elow = READ_ONCE(memcg->memory.low); 6640 - goto out; 6429 + return; 6641 6430 } 6642 6431 6643 6432 parent_usage = page_counter_read(&parent->memory); ··· 6653 6438 READ_ONCE(memcg->memory.low), 6654 6439 READ_ONCE(parent->memory.elow), 6655 6440 atomic_long_read(&parent->memory.children_low_usage))); 6656 - 6657 - out: 6658 - if (usage <= memcg->memory.emin) 6659 - return MEMCG_PROT_MIN; 6660 - else if (usage <= memcg->memory.elow) 6661 - return MEMCG_PROT_LOW; 6662 - else 6663 - return MEMCG_PROT_NONE; 6664 6441 } 6665 6442 6666 6443 /** ··· 6705 6498 if (ret) 6706 6499 goto out_put; 6707 6500 6501 + css_get(&memcg->css); 6708 6502 commit_charge(page, memcg); 6709 6503 6710 6504 local_irq_disable(); ··· 6760 6552 __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_pages); 6761 6553 memcg_check_events(ug->memcg, ug->dummy_page); 6762 6554 local_irq_restore(flags); 6763 - 6764 - if (!mem_cgroup_is_root(ug->memcg)) 6765 - css_put_many(&ug->memcg->css, ug->nr_pages); 6766 6555 } 6767 6556 6768 6557 static void uncharge_page(struct page *page, struct uncharge_gather *ug) ··· 6797 6592 6798 6593 ug->dummy_page = page; 6799 6594 page->mem_cgroup = NULL; 6595 + css_put(&ug->memcg->css); 6800 6596 } 6801 6597 6802 6598 static void uncharge_list(struct list_head *page_list) ··· 6903 6697 page_counter_charge(&memcg->memory, nr_pages); 6904 6698 if (do_memsw_account()) 6905 6699 page_counter_charge(&memcg->memsw, nr_pages); 6906 - css_get_many(&memcg->css, nr_pages); 6907 6700 6701 + css_get(&memcg->css); 6908 6702 commit_charge(newpage, memcg); 6909 6703 6910 6704 local_irq_save(flags); ··· 7027 6821 { 7028 6822 int cpu, node; 7029 6823 7030 - #ifdef CONFIG_MEMCG_KMEM 7031 - /* 7032 - * Kmem cache creation is mostly done with the slab_mutex held, 7033 - * so use a workqueue with limited concurrency to avoid stalling 7034 - * all worker threads in case lots of cgroups are created and 7035 - * destroyed simultaneously. 7036 - */ 7037 - memcg_kmem_cache_wq = alloc_workqueue("memcg_kmem_cache", 0, 1); 7038 - BUG_ON(!memcg_kmem_cache_wq); 7039 - #endif 7040 - 7041 6824 cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL, 7042 6825 memcg_hotplug_cpu_dead); 7043 6826 ··· 7130 6935 mem_cgroup_charge_statistics(memcg, page, -nr_entries); 7131 6936 memcg_check_events(memcg, page); 7132 6937 7133 - if (!mem_cgroup_is_root(memcg)) 7134 - css_put_many(&memcg->css, nr_entries); 6938 + css_put(&memcg->css); 7135 6939 } 7136 6940 7137 6941 /**

+5 -2

mm/memory.c

··· 1098 1098 } 1099 1099 1100 1100 entry = pte_to_swp_entry(ptent); 1101 - if (non_swap_entry(entry) && is_device_private_entry(entry)) { 1101 + if (is_device_private_entry(entry)) { 1102 1102 struct page *page = device_private_entry_to_page(entry); 1103 1103 1104 1104 if (unlikely(details && details->check_mapping)) { ··· 2082 2082 /** 2083 2083 * remap_pfn_range - remap kernel memory to userspace 2084 2084 * @vma: user vma to map to 2085 - * @addr: target user address to start at 2085 + * @addr: target page aligned user address to start at 2086 2086 * @pfn: page frame number of kernel physical memory address 2087 2087 * @size: size of mapping area 2088 2088 * @prot: page protection flags for this mapping ··· 2100 2100 struct mm_struct *mm = vma->vm_mm; 2101 2101 unsigned long remap_pfn = pfn; 2102 2102 int err; 2103 + 2104 + if (WARN_ON_ONCE(!PAGE_ALIGNED(addr))) 2105 + return -EINVAL; 2103 2106 2104 2107 /* 2105 2108 * Physically remapped pages are special. Tell the

+8 -3

mm/memory_hotplug.c

··· 831 831 zone->zone_pgdat->node_present_pages += onlined_pages; 832 832 pgdat_resize_unlock(zone->zone_pgdat, &flags); 833 833 834 + /* 835 + * When exposing larger, physically contiguous memory areas to the 836 + * buddy, shuffling in the buddy (when freeing onlined pages, putting 837 + * them either to the head or the tail of the freelist) is only helpful 838 + * for maintaining the shuffle, but not for creating the initial 839 + * shuffle. Shuffle the whole zone to make sure the just onlined pages 840 + * are properly distributed across the whole freelist. 841 + */ 834 842 shuffle_zone(zone); 835 843 836 844 node_states_set_node(nid, &arg); ··· 851 843 852 844 kswapd_run(nid); 853 845 kcompactd_run(nid); 854 - 855 - vm_total_pages = nr_free_pagecache_pages(); 856 846 857 847 writeback_set_ratelimit(); 858 848 ··· 1601 1595 kcompactd_stop(node); 1602 1596 } 1603 1597 1604 - vm_total_pages = nr_free_pagecache_pages(); 1605 1598 writeback_set_ratelimit(); 1606 1599 1607 1600 memory_notify(MEM_OFFLINE, &arg);

+3 -3

mm/migrate.c

··· 2386 2386 * that the registered device driver can skip invalidating device 2387 2387 * private page mappings that won't be migrated. 2388 2388 */ 2389 - mmu_notifier_range_init(&range, MMU_NOTIFY_MIGRATE, 0, migrate->vma, 2390 - migrate->vma->vm_mm, migrate->start, migrate->end); 2391 - range.migrate_pgmap_owner = migrate->pgmap_owner; 2389 + mmu_notifier_range_init_migrate(&range, 0, migrate->vma, 2390 + migrate->vma->vm_mm, migrate->start, migrate->end, 2391 + migrate->pgmap_owner); 2392 2392 mmu_notifier_invalidate_range_start(&range); 2393 2393 2394 2394 walk_page_range(migrate->vma->vm_mm, migrate->start, migrate->end,

+15 -5

mm/mm_init.c

··· 13 13 #include <linux/memory.h> 14 14 #include <linux/notifier.h> 15 15 #include <linux/sched.h> 16 + #include <linux/mman.h> 16 17 #include "internal.h" 17 18 18 19 #ifdef CONFIG_DEBUG_MEMORY_INIT ··· 145 144 #ifdef CONFIG_SMP 146 145 s32 vm_committed_as_batch = 32; 147 146 148 - static void __meminit mm_compute_batch(void) 147 + void mm_compute_batch(int overcommit_policy) 149 148 { 150 149 u64 memsized_batch; 151 150 s32 nr = num_present_cpus(); 152 151 s32 batch = max_t(s32, nr*2, 32); 152 + unsigned long ram_pages = totalram_pages(); 153 153 154 - /* batch size set to 0.4% of (total memory/#cpus), or max int32 */ 155 - memsized_batch = min_t(u64, (totalram_pages()/nr)/256, 0x7fffffff); 154 + /* 155 + * For policy OVERCOMMIT_NEVER, set batch size to 0.4% of 156 + * (total memory/#cpus), and lift it to 25% for other policies 157 + * to easy the possible lock contention for percpu_counter 158 + * vm_committed_as, while the max limit is INT_MAX 159 + */ 160 + if (overcommit_policy == OVERCOMMIT_NEVER) 161 + memsized_batch = min_t(u64, ram_pages/nr/256, INT_MAX); 162 + else 163 + memsized_batch = min_t(u64, ram_pages/nr/4, INT_MAX); 156 164 157 165 vm_committed_as_batch = max_t(s32, memsized_batch, batch); 158 166 } ··· 172 162 switch (action) { 173 163 case MEM_ONLINE: 174 164 case MEM_OFFLINE: 175 - mm_compute_batch(); 165 + mm_compute_batch(sysctl_overcommit_memory); 176 166 default: 177 167 break; 178 168 } ··· 186 176 187 177 static int __init mm_compute_batch_init(void) 188 178 { 189 - mm_compute_batch(); 179 + mm_compute_batch(sysctl_overcommit_memory); 190 180 register_hotmemory_notifier(&compute_batch_nb); 191 181 192 182 return 0;

+33 -12

mm/mmap.c

··· 1030 1030 * anon_vmas, nor if same anon_vma is assigned but offsets incompatible. 1031 1031 * 1032 1032 * We don't check here for the merged mmap wrapping around the end of pagecache 1033 - * indices (16TB on ia32) because do_mmap_pgoff() does not permit mmap's which 1033 + * indices (16TB on ia32) because do_mmap() does not permit mmap's which 1034 1034 * wrap, nor mmaps which cover the final page at index -1UL. 1035 1035 */ 1036 1036 static int ··· 1365 1365 */ 1366 1366 unsigned long do_mmap(struct file *file, unsigned long addr, 1367 1367 unsigned long len, unsigned long prot, 1368 - unsigned long flags, vm_flags_t vm_flags, 1369 - unsigned long pgoff, unsigned long *populate, 1370 - struct list_head *uf) 1368 + unsigned long flags, unsigned long pgoff, 1369 + unsigned long *populate, struct list_head *uf) 1371 1370 { 1372 1371 struct mm_struct *mm = current->mm; 1372 + vm_flags_t vm_flags; 1373 1373 int pkey = 0; 1374 1374 1375 1375 *populate = 0; ··· 1431 1431 * to. we assume access permissions have been handled by the open 1432 1432 * of the memory object, so we don't do any here. 1433 1433 */ 1434 - vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | 1434 + vm_flags = calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | 1435 1435 mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; 1436 1436 1437 1437 if (flags & MAP_LOCKED) ··· 1562 1562 file = fget(fd); 1563 1563 if (!file) 1564 1564 return -EBADF; 1565 - if (is_file_hugepages(file)) 1565 + if (is_file_hugepages(file)) { 1566 1566 len = ALIGN(len, huge_page_size(hstate_file(file))); 1567 - retval = -EINVAL; 1568 - if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file))) 1567 + } else if (unlikely(flags & MAP_HUGETLB)) { 1568 + retval = -EINVAL; 1569 1569 goto out_fput; 1570 + } 1570 1571 } else if (flags & MAP_HUGETLB) { 1571 1572 struct user_struct *user = NULL; 1572 1573 struct hstate *hs; ··· 1690 1689 struct list_head *uf) 1691 1690 { 1692 1691 struct mm_struct *mm = current->mm; 1693 - struct vm_area_struct *vma, *prev; 1692 + struct vm_area_struct *vma, *prev, *merge; 1694 1693 int error; 1695 1694 struct rb_node **rb_link, *rb_parent; 1696 1695 unsigned long charged = 0; ··· 1774 1773 if (error) 1775 1774 goto unmap_and_free_vma; 1776 1775 1776 + /* If vm_flags changed after call_mmap(), we should try merge vma again 1777 + * as we may succeed this time. 1778 + */ 1779 + if (unlikely(vm_flags != vma->vm_flags && prev)) { 1780 + merge = vma_merge(mm, prev, vma->vm_start, vma->vm_end, vma->vm_flags, 1781 + NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX); 1782 + if (merge) { 1783 + fput(file); 1784 + vm_area_free(vma); 1785 + vma = merge; 1786 + /* Update vm_flags and possible addr to pick up the change. We don't 1787 + * warn here if addr changed as the vma is not linked by vma_link(). 1788 + */ 1789 + addr = vma->vm_start; 1790 + vm_flags = vma->vm_flags; 1791 + goto unmap_writable; 1792 + } 1793 + } 1794 + 1777 1795 /* Can addr have changed?? 1778 1796 * 1779 1797 * Answer: Yes, several device drivers can do it in their ··· 1815 1795 vma_link(mm, vma, prev, rb_link, rb_parent); 1816 1796 /* Once vma denies write, undo our temporary denial count */ 1817 1797 if (file) { 1798 + unmap_writable: 1818 1799 if (vm_flags & VM_SHARED) 1819 1800 mapping_unmap_writable(file->f_mapping); 1820 1801 if (vm_flags & VM_DENYWRITE) ··· 2230 2209 /* 2231 2210 * mmap_region() will call shmem_zero_setup() to create a file, 2232 2211 * so use shmem's get_unmapped_area in case it can be huge. 2233 - * do_mmap_pgoff() will clear pgoff, so match alignment. 2212 + * do_mmap() will clear pgoff, so match alignment. 2234 2213 */ 2235 2214 pgoff = 0; 2236 2215 get_area = shmem_get_unmapped_area; ··· 3003 2982 } 3004 2983 3005 2984 file = get_file(vma->vm_file); 3006 - ret = do_mmap_pgoff(vma->vm_file, start, size, 2985 + ret = do_mmap(vma->vm_file, start, size, 3007 2986 prot, flags, pgoff, &populate, NULL); 3008 2987 fput(file); 3009 2988 out: ··· 3223 3202 * By setting it to reflect the virtual start address of the 3224 3203 * vma, merges and splits can happen in a seamless way, just 3225 3204 * using the existing file pgoff checks and manipulations. 3226 - * Similarly in do_mmap_pgoff and in do_brk. 3205 + * Similarly in do_mmap and in do_brk. 3227 3206 */ 3228 3207 if (vma_is_anonymous(vma)) { 3229 3208 BUG_ON(vma->anon_vma);

+6 -11

mm/mremap.c

··· 193 193 194 194 #ifdef CONFIG_HAVE_MOVE_PMD 195 195 static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, 196 - unsigned long new_addr, unsigned long old_end, 197 - pmd_t *old_pmd, pmd_t *new_pmd) 196 + unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd) 198 197 { 199 198 spinlock_t *old_ptl, *new_ptl; 200 199 struct mm_struct *mm = vma->vm_mm; 201 200 pmd_t pmd; 202 - 203 - if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK) 204 - || old_end - old_addr < PMD_SIZE) 205 - return false; 206 201 207 202 /* 208 203 * The destination pmd shouldn't be established, free_pgtables() ··· 274 279 extent = next - old_addr; 275 280 if (extent > old_end - old_addr) 276 281 extent = old_end - old_addr; 282 + next = (new_addr + PMD_SIZE) & PMD_MASK; 283 + if (extent > next - new_addr) 284 + extent = next - new_addr; 277 285 old_pmd = get_old_pmd(vma->vm_mm, old_addr); 278 286 if (!old_pmd) 279 287 continue; ··· 290 292 if (need_rmap_locks) 291 293 take_rmap_locks(vma); 292 294 moved = move_huge_pmd(vma, old_addr, new_addr, 293 - old_end, old_pmd, new_pmd); 295 + old_pmd, new_pmd); 294 296 if (need_rmap_locks) 295 297 drop_rmap_locks(vma); 296 298 if (moved) ··· 310 312 if (need_rmap_locks) 311 313 take_rmap_locks(vma); 312 314 moved = move_normal_pmd(vma, old_addr, new_addr, 313 - old_end, old_pmd, new_pmd); 315 + old_pmd, new_pmd); 314 316 if (need_rmap_locks) 315 317 drop_rmap_locks(vma); 316 318 if (moved) ··· 320 322 321 323 if (pte_alloc(new_vma->vm_mm, new_pmd)) 322 324 break; 323 - next = (new_addr + PMD_SIZE) & PMD_MASK; 324 - if (extent > next - new_addr) 325 - extent = next - new_addr; 326 325 move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma, 327 326 new_pmd, new_addr, need_rmap_locks); 328 327 }

+3 -3

mm/nommu.c

··· 1078 1078 unsigned long len, 1079 1079 unsigned long prot, 1080 1080 unsigned long flags, 1081 - vm_flags_t vm_flags, 1082 1081 unsigned long pgoff, 1083 1082 unsigned long *populate, 1084 1083 struct list_head *uf) ··· 1085 1086 struct vm_area_struct *vma; 1086 1087 struct vm_region *region; 1087 1088 struct rb_node *rb; 1089 + vm_flags_t vm_flags; 1088 1090 unsigned long capabilities, result; 1089 1091 int ret; 1090 1092 ··· 1104 1104 1105 1105 /* we've determined that we can make the mapping, now translate what we 1106 1106 * now know into VMA flags */ 1107 - vm_flags |= determine_vm_flags(file, prot, flags, capabilities); 1107 + vm_flags = determine_vm_flags(file, prot, flags, capabilities); 1108 1108 1109 1109 /* we're going to need to record the mapping */ 1110 1110 region = kmem_cache_zalloc(vm_region_jar, GFP_KERNEL); ··· 1763 1763 * 1764 1764 * Check the shared mappings on an inode on behalf of a shrinking truncate to 1765 1765 * make sure that that any outstanding VMAs aren't broken and then shrink the 1766 - * vm_regions that extend that beyond so that do_mmap_pgoff() doesn't 1766 + * vm_regions that extend that beyond so that do_mmap() doesn't 1767 1767 * automatically grant mappings that are too large. 1768 1768 */ 1769 1769 int nommu_shrink_inode_mappings(struct inode *inode, size_t size,

+1 -1

mm/oom_kill.c

··· 184 184 global_node_page_state(NR_ISOLATED_FILE) + 185 185 global_node_page_state(NR_UNEVICTABLE); 186 186 187 - return (global_node_page_state(NR_SLAB_UNRECLAIMABLE) > nr_lru); 187 + return (global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B) > nr_lru); 188 188 } 189 189 190 190 /**

+2 -4

mm/page-writeback.c

··· 2076 2076 * Called early on to tune the page writeback dirty limits. 2077 2077 * 2078 2078 * We used to scale dirty pages according to how total memory 2079 - * related to pages that could be allocated for buffers (by 2080 - * comparing nr_free_buffer_pages() to vm_total_pages. 2079 + * related to pages that could be allocated for buffers. 2081 2080 * 2082 2081 * However, that was when we used "dirty_ratio" to scale with 2083 2082 * all memory, and we don't do that any more. "dirty_ratio" 2084 - * is now applied to total non-HIGHPAGE memory (by subtracting 2085 - * totalhigh_pages from vm_total_pages), and as such we can't 2083 + * is now applied to total non-HIGHPAGE memory, and as such we can't 2086 2084 * get into the old insane situation any more where we had 2087 2085 * large amounts of dirty pages compared to a small amount of 2088 2086 * non-HIGHMEM memory.

+114 -108

mm/page_alloc.c

··· 459 459 { 460 460 #ifdef CONFIG_SPARSEMEM 461 461 pfn &= (PAGES_PER_SECTION-1); 462 - return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS; 463 462 #else 464 463 pfn = pfn - round_down(page_zone(page)->zone_start_pfn, pageblock_nr_pages); 465 - return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS; 466 464 #endif /* CONFIG_SPARSEMEM */ 465 + return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS; 467 466 } 468 467 469 468 /** 470 469 * get_pfnblock_flags_mask - Return the requested group of flags for the pageblock_nr_pages block of pages 471 470 * @page: The page within the block of interest 472 471 * @pfn: The target page frame number 473 - * @end_bitidx: The last bit of interest to retrieve 474 472 * @mask: mask of bits that the caller is interested in 475 473 * 476 474 * Return: pageblock_bits flags 477 475 */ 478 - static __always_inline unsigned long __get_pfnblock_flags_mask(struct page *page, 476 + static __always_inline 477 + unsigned long __get_pfnblock_flags_mask(struct page *page, 479 478 unsigned long pfn, 480 - unsigned long end_bitidx, 481 479 unsigned long mask) 482 480 { 483 481 unsigned long *bitmap; ··· 488 490 bitidx &= (BITS_PER_LONG-1); 489 491 490 492 word = bitmap[word_bitidx]; 491 - bitidx += end_bitidx; 492 - return (word >> (BITS_PER_LONG - bitidx - 1)) & mask; 493 + return (word >> bitidx) & mask; 493 494 } 494 495 495 496 unsigned long get_pfnblock_flags_mask(struct page *page, unsigned long pfn, 496 - unsigned long end_bitidx, 497 497 unsigned long mask) 498 498 { 499 - return __get_pfnblock_flags_mask(page, pfn, end_bitidx, mask); 499 + return __get_pfnblock_flags_mask(page, pfn, mask); 500 500 } 501 501 502 502 static __always_inline int get_pfnblock_migratetype(struct page *page, unsigned long pfn) 503 503 { 504 - return __get_pfnblock_flags_mask(page, pfn, PB_migrate_end, MIGRATETYPE_MASK); 504 + return __get_pfnblock_flags_mask(page, pfn, MIGRATETYPE_MASK); 505 505 } 506 506 507 507 /** ··· 507 511 * @page: The page within the block of interest 508 512 * @flags: The flags to set 509 513 * @pfn: The target page frame number 510 - * @end_bitidx: The last bit of interest 511 514 * @mask: mask of bits that the caller is interested in 512 515 */ 513 516 void set_pfnblock_flags_mask(struct page *page, unsigned long flags, 514 517 unsigned long pfn, 515 - unsigned long end_bitidx, 516 518 unsigned long mask) 517 519 { 518 520 unsigned long *bitmap; ··· 527 533 528 534 VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); 529 535 530 - bitidx += end_bitidx; 531 - mask <<= (BITS_PER_LONG - bitidx - 1); 532 - flags <<= (BITS_PER_LONG - bitidx - 1); 536 + mask <<= bitidx; 537 + flags <<= bitidx; 533 538 534 539 word = READ_ONCE(bitmap[word_bitidx]); 535 540 for (;;) { ··· 545 552 migratetype < MIGRATE_PCPTYPES)) 546 553 migratetype = MIGRATE_UNMOVABLE; 547 554 548 - set_pageblock_flags_group(page, (unsigned long)migratetype, 549 - PB_migrate, PB_migrate_end); 555 + set_pfnblock_flags_mask(page, (unsigned long)migratetype, 556 + page_to_pfn(page), MIGRATETYPE_MASK); 550 557 } 551 558 552 559 #ifdef CONFIG_DEBUG_VM ··· 806 813 { 807 814 struct capture_control *capc = current->capture_control; 808 815 809 - return capc && 816 + return unlikely(capc) && 810 817 !(current->flags & PF_KTHREAD) && 811 818 !capc->page && 812 - capc->cc->zone == zone && 813 - capc->cc->direct_compaction ? capc : NULL; 819 + capc->cc->zone == zone ? capc : NULL; 814 820 } 815 821 816 822 static inline bool ··· 1156 1164 { 1157 1165 int i; 1158 1166 1167 + /* s390's use of memset() could override KASAN redzones. */ 1168 + kasan_disable_current(); 1159 1169 for (i = 0; i < numpages; i++) 1160 1170 clear_highpage(page + i); 1171 + kasan_enable_current(); 1161 1172 } 1162 1173 1163 1174 static __always_inline bool free_pages_prepare(struct page *page, ··· 2268 2273 * This array describes the order lists are fallen back to when 2269 2274 * the free lists for the desirable migrate type are depleted 2270 2275 */ 2271 - static int fallbacks[MIGRATE_TYPES][4] = { 2276 + static int fallbacks[MIGRATE_TYPES][3] = { 2272 2277 [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_TYPES }, 2273 2278 [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES }, 2274 2279 [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_TYPES }, ··· 2785 2790 * allocating from CMA when over half of the zone's free memory 2786 2791 * is in the CMA area. 2787 2792 */ 2788 - if (migratetype == MIGRATE_MOVABLE && 2793 + if (alloc_flags & ALLOC_CMA && 2789 2794 zone_page_state(zone, NR_FREE_CMA_PAGES) > 2790 2795 zone_page_state(zone, NR_FREE_PAGES) / 2) { 2791 2796 page = __rmqueue_cma_fallback(zone, order); ··· 2796 2801 retry: 2797 2802 page = __rmqueue_smallest(zone, order, migratetype); 2798 2803 if (unlikely(!page)) { 2799 - if (migratetype == MIGRATE_MOVABLE) 2804 + if (alloc_flags & ALLOC_CMA) 2800 2805 page = __rmqueue_cma_fallback(zone, order); 2801 2806 2802 2807 if (!page && __rmqueue_fallback(zone, order, migratetype, ··· 3482 3487 } 3483 3488 ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE); 3484 3489 3490 + static inline long __zone_watermark_unusable_free(struct zone *z, 3491 + unsigned int order, unsigned int alloc_flags) 3492 + { 3493 + const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM)); 3494 + long unusable_free = (1 << order) - 1; 3495 + 3496 + /* 3497 + * If the caller does not have rights to ALLOC_HARDER then subtract 3498 + * the high-atomic reserves. This will over-estimate the size of the 3499 + * atomic reserve but it avoids a search. 3500 + */ 3501 + if (likely(!alloc_harder)) 3502 + unusable_free += z->nr_reserved_highatomic; 3503 + 3504 + #ifdef CONFIG_CMA 3505 + /* If allocation can't use CMA areas don't use free CMA pages */ 3506 + if (!(alloc_flags & ALLOC_CMA)) 3507 + unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES); 3508 + #endif 3509 + 3510 + return unusable_free; 3511 + } 3512 + 3485 3513 /* 3486 3514 * Return true if free base pages are above 'mark'. For high-order checks it 3487 3515 * will return true of the order-0 watermark is reached and there is at least ··· 3520 3502 const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM)); 3521 3503 3522 3504 /* free_pages may go negative - that's OK */ 3523 - free_pages -= (1 << order) - 1; 3505 + free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags); 3524 3506 3525 3507 if (alloc_flags & ALLOC_HIGH) 3526 3508 min -= min / 2; 3527 3509 3528 - /* 3529 - * If the caller does not have rights to ALLOC_HARDER then subtract 3530 - * the high-atomic reserves. This will over-estimate the size of the 3531 - * atomic reserve but it avoids a search. 3532 - */ 3533 - if (likely(!alloc_harder)) { 3534 - free_pages -= z->nr_reserved_highatomic; 3535 - } else { 3510 + if (unlikely(alloc_harder)) { 3536 3511 /* 3537 3512 * OOM victims can try even harder than normal ALLOC_HARDER 3538 3513 * users on the grounds that it's definitely going to be in ··· 3537 3526 else 3538 3527 min -= min / 4; 3539 3528 } 3540 - 3541 - 3542 - #ifdef CONFIG_CMA 3543 - /* If allocation can't use CMA areas don't use free CMA pages */ 3544 - if (!(alloc_flags & ALLOC_CMA)) 3545 - free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES); 3546 - #endif 3547 3529 3548 3530 /* 3549 3531 * Check watermarks for an order-0 allocation request. If these ··· 3584 3580 3585 3581 static inline bool zone_watermark_fast(struct zone *z, unsigned int order, 3586 3582 unsigned long mark, int highest_zoneidx, 3587 - unsigned int alloc_flags) 3583 + unsigned int alloc_flags, gfp_t gfp_mask) 3588 3584 { 3589 - long free_pages = zone_page_state(z, NR_FREE_PAGES); 3590 - long cma_pages = 0; 3585 + long free_pages; 3591 3586 3592 - #ifdef CONFIG_CMA 3593 - /* If allocation can't use CMA areas don't use free CMA pages */ 3594 - if (!(alloc_flags & ALLOC_CMA)) 3595 - cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES); 3596 - #endif 3587 + free_pages = zone_page_state(z, NR_FREE_PAGES); 3597 3588 3598 3589 /* 3599 3590 * Fast check for order-0 only. If this fails then the reserves 3600 - * need to be calculated. There is a corner case where the check 3601 - * passes but only the high-order atomic reserve are free. If 3602 - * the caller is !atomic then it'll uselessly search the free 3603 - * list. That corner case is then slower but it is harmless. 3591 + * need to be calculated. 3604 3592 */ 3605 - if (!order && (free_pages - cma_pages) > 3606 - mark + z->lowmem_reserve[highest_zoneidx]) 3607 - return true; 3593 + if (!order) { 3594 + long fast_free; 3608 3595 3609 - return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags, 3610 - free_pages); 3596 + fast_free = free_pages; 3597 + fast_free -= __zone_watermark_unusable_free(z, 0, alloc_flags); 3598 + if (fast_free > mark + z->lowmem_reserve[highest_zoneidx]) 3599 + return true; 3600 + } 3601 + 3602 + if (__zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags, 3603 + free_pages)) 3604 + return true; 3605 + /* 3606 + * Ignore watermark boosting for GFP_ATOMIC order-0 allocations 3607 + * when checking the min watermark. The min watermark is the 3608 + * point where boosting is ignored so that kswapd is woken up 3609 + * when below the low watermark. 3610 + */ 3611 + if (unlikely(!order && (gfp_mask & __GFP_ATOMIC) && z->watermark_boost 3612 + && ((alloc_flags & ALLOC_WMARK_MASK) == WMARK_MIN))) { 3613 + mark = z->_watermark[WMARK_MIN]; 3614 + return __zone_watermark_ok(z, order, mark, highest_zoneidx, 3615 + alloc_flags, free_pages); 3616 + } 3617 + 3618 + return false; 3611 3619 } 3612 3620 3613 3621 bool zone_watermark_ok_safe(struct zone *z, unsigned int order, ··· 3684 3668 3685 3669 alloc_flags |= ALLOC_NOFRAGMENT; 3686 3670 #endif /* CONFIG_ZONE_DMA32 */ 3671 + return alloc_flags; 3672 + } 3673 + 3674 + static inline unsigned int current_alloc_flags(gfp_t gfp_mask, 3675 + unsigned int alloc_flags) 3676 + { 3677 + #ifdef CONFIG_CMA 3678 + unsigned int pflags = current->flags; 3679 + 3680 + if (!(pflags & PF_MEMALLOC_NOCMA) && 3681 + gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE) 3682 + alloc_flags |= ALLOC_CMA; 3683 + 3684 + #endif 3687 3685 return alloc_flags; 3688 3686 } 3689 3687 ··· 3777 3747 3778 3748 mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); 3779 3749 if (!zone_watermark_fast(zone, order, mark, 3780 - ac->highest_zoneidx, alloc_flags)) { 3750 + ac->highest_zoneidx, alloc_flags, 3751 + gfp_mask)) { 3781 3752 int ret; 3782 3753 3783 3754 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT ··· 4347 4316 } else if (unlikely(rt_task(current)) && !in_interrupt()) 4348 4317 alloc_flags |= ALLOC_HARDER; 4349 4318 4350 - #ifdef CONFIG_CMA 4351 - if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE) 4352 - alloc_flags |= ALLOC_CMA; 4353 - #endif 4319 + alloc_flags = current_alloc_flags(gfp_mask, alloc_flags); 4320 + 4354 4321 return alloc_flags; 4355 4322 } 4356 4323 ··· 4649 4620 4650 4621 reserve_flags = __gfp_pfmemalloc_flags(gfp_mask); 4651 4622 if (reserve_flags) 4652 - alloc_flags = reserve_flags; 4623 + alloc_flags = current_alloc_flags(gfp_mask, reserve_flags); 4653 4624 4654 4625 /* 4655 4626 * Reset the nodemask and zonelist iterators if memory policies can be ··· 4726 4697 4727 4698 /* Avoid allocations with no watermarks from looping endlessly */ 4728 4699 if (tsk_is_oom_victim(current) && 4729 - (alloc_flags == ALLOC_OOM || 4700 + (alloc_flags & ALLOC_OOM || 4730 4701 (gfp_mask & __GFP_NOMEMALLOC))) 4731 4702 goto nopage; 4732 4703 ··· 4800 4771 4801 4772 if (cpusets_enabled()) { 4802 4773 *alloc_mask |= __GFP_HARDWALL; 4803 - if (!ac->nodemask) 4774 + /* 4775 + * When we are in the interrupt context, it is irrelevant 4776 + * to the current task context. It means that any node ok. 4777 + */ 4778 + if (!in_interrupt() && !ac->nodemask) 4804 4779 ac->nodemask = &cpuset_current_mems_allowed; 4805 4780 else 4806 4781 *alloc_flags |= ALLOC_CPUSET; ··· 4818 4785 if (should_fail_alloc_page(gfp_mask, order)) 4819 4786 return false; 4820 4787 4821 - if (IS_ENABLED(CONFIG_CMA) && ac->migratetype == MIGRATE_MOVABLE) 4822 - *alloc_flags |= ALLOC_CMA; 4788 + *alloc_flags = current_alloc_flags(gfp_mask, *alloc_flags); 4823 4789 4824 4790 return true; 4825 4791 } ··· 5197 5165 } 5198 5166 EXPORT_SYMBOL_GPL(nr_free_buffer_pages); 5199 5167 5200 - /** 5201 - * nr_free_pagecache_pages - count number of pages beyond high watermark 5202 - * 5203 - * nr_free_pagecache_pages() counts the number of pages which are beyond the 5204 - * high watermark within all zones. 5205 - * 5206 - * Return: number of pages beyond high watermark within all zones. 5207 - */ 5208 - unsigned long nr_free_pagecache_pages(void) 5209 - { 5210 - return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE)); 5211 - } 5212 - 5213 5168 static inline void show_node(struct zone *zone) 5214 5169 { 5215 5170 if (IS_ENABLED(CONFIG_NUMA)) ··· 5239 5220 * items that are in use, and cannot be freed. Cap this estimate at the 5240 5221 * low watermark. 5241 5222 */ 5242 - reclaimable = global_node_page_state(NR_SLAB_RECLAIMABLE) + 5243 - global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE); 5223 + reclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B) + 5224 + global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE); 5244 5225 available += reclaimable - min(reclaimable / 2, wmark_low); 5245 5226 5246 5227 if (available < 0) ··· 5383 5364 global_node_page_state(NR_UNEVICTABLE), 5384 5365 global_node_page_state(NR_FILE_DIRTY), 5385 5366 global_node_page_state(NR_WRITEBACK), 5386 - global_node_page_state(NR_SLAB_RECLAIMABLE), 5387 - global_node_page_state(NR_SLAB_UNRECLAIMABLE), 5367 + global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B), 5368 + global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B), 5388 5369 global_node_page_state(NR_FILE_MAPPED), 5389 5370 global_node_page_state(NR_SHMEM), 5390 5371 global_zone_page_state(NR_PAGETABLE), ··· 5415 5396 " anon_thp: %lukB" 5416 5397 #endif 5417 5398 " writeback_tmp:%lukB" 5399 + " kernel_stack:%lukB" 5400 + #ifdef CONFIG_SHADOW_CALL_STACK 5401 + " shadow_call_stack:%lukB" 5402 + #endif 5418 5403 " all_unreclaimable? %s" 5419 5404 "\n", 5420 5405 pgdat->node_id, ··· 5440 5417 K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR), 5441 5418 #endif 5442 5419 K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), 5420 + node_page_state(pgdat, NR_KERNEL_STACK_KB), 5421 + #ifdef CONFIG_SHADOW_CALL_STACK 5422 + node_page_state(pgdat, NR_KERNEL_SCS_KB), 5423 + #endif 5443 5424 pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ? 5444 5425 "yes" : "no"); 5445 5426 } ··· 5475 5448 " present:%lukB" 5476 5449 " managed:%lukB" 5477 5450 " mlocked:%lukB" 5478 - " kernel_stack:%lukB" 5479 - #ifdef CONFIG_SHADOW_CALL_STACK 5480 - " shadow_call_stack:%lukB" 5481 - #endif 5482 5451 " pagetables:%lukB" 5483 5452 " bounce:%lukB" 5484 5453 " free_pcp:%lukB" ··· 5496 5473 K(zone->present_pages), 5497 5474 K(zone_managed_pages(zone)), 5498 5475 K(zone_page_state(zone, NR_MLOCK)), 5499 - zone_page_state(zone, NR_KERNEL_STACK_KB), 5500 - #ifdef CONFIG_SHADOW_CALL_STACK 5501 - zone_page_state(zone, NR_KERNEL_SCS_KB), 5502 - #endif 5503 5476 K(zone_page_state(zone, NR_PAGETABLE)), 5504 5477 K(zone_page_state(zone, NR_BOUNCE)), 5505 5478 K(free_pcp), ··· 5910 5891 */ 5911 5892 void __ref build_all_zonelists(pg_data_t *pgdat) 5912 5893 { 5894 + unsigned long vm_total_pages; 5895 + 5913 5896 if (system_state == SYSTEM_BOOTING) { 5914 5897 build_all_zonelists_init(); 5915 5898 } else { 5916 5899 __build_all_zonelists(pgdat); 5917 5900 /* cpuset refresh routine should be here */ 5918 5901 } 5919 - vm_total_pages = nr_free_pagecache_pages(); 5902 + /* Get the number of free pages beyond high watermark in all zones. */ 5903 + vm_total_pages = nr_free_zone_pages(gfp_zone(GFP_HIGHUSER_MOVABLE)); 5920 5904 /* 5921 5905 * Disable grouping by mobility if the number of pages in the 5922 5906 * system is too low to allow the mechanism to work. It would be ··· 6344 6322 6345 6323 zone_init_free_lists(zone); 6346 6324 zone->initialized = 1; 6347 - } 6348 - 6349 - /** 6350 - * sparse_memory_present_with_active_regions - Call memory_present for each active range 6351 - * @nid: The node to call memory_present for. If MAX_NUMNODES, all nodes will be used. 6352 - * 6353 - * If an architecture guarantees that all ranges registered contain no holes and may 6354 - * be freed, this function may be used instead of calling memory_present() manually. 6355 - */ 6356 - void __init sparse_memory_present_with_active_regions(int nid) 6357 - { 6358 - unsigned long start_pfn, end_pfn; 6359 - int i, this_nid; 6360 - 6361 - for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, &this_nid) 6362 - memory_present(this_nid, start_pfn, end_pfn); 6363 6325 } 6364 6326 6365 6327 /**

+3 -3

mm/page_counter.c

··· 72 72 long new; 73 73 74 74 new = atomic_long_add_return(nr_pages, &c->usage); 75 - propagate_protected_usage(counter, new); 75 + propagate_protected_usage(c, new); 76 76 /* 77 77 * This is indeed racy, but we can live with some 78 78 * inaccuracy in the watermark. ··· 116 116 new = atomic_long_add_return(nr_pages, &c->usage); 117 117 if (new > c->max) { 118 118 atomic_long_sub(nr_pages, &c->usage); 119 - propagate_protected_usage(counter, new); 119 + propagate_protected_usage(c, new); 120 120 /* 121 121 * This is racy, but we can live with some 122 122 * inaccuracy in the failcnt. ··· 125 125 *fail = c; 126 126 goto failed; 127 127 } 128 - propagate_protected_usage(counter, new); 128 + propagate_protected_usage(c, new); 129 129 /* 130 130 * Just like with failcnt, we can live with some 131 131 * inaccuracy in the watermark.

+1 -1

mm/page_io.c

··· 441 441 break; 442 442 443 443 if (!blk_poll(disk->queue, qc, true)) 444 - io_schedule(); 444 + blk_io_schedule(); 445 445 } 446 446 __set_current_state(TASK_RUNNING); 447 447 bio_put(bio);

+51

mm/pgalloc-track.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _LINUX_PGALLLC_TRACK_H 3 + #define _LINUX_PGALLLC_TRACK_H 4 + 5 + #if defined(CONFIG_MMU) 6 + static inline p4d_t *p4d_alloc_track(struct mm_struct *mm, pgd_t *pgd, 7 + unsigned long address, 8 + pgtbl_mod_mask *mod_mask) 9 + { 10 + if (unlikely(pgd_none(*pgd))) { 11 + if (__p4d_alloc(mm, pgd, address)) 12 + return NULL; 13 + *mod_mask |= PGTBL_PGD_MODIFIED; 14 + } 15 + 16 + return p4d_offset(pgd, address); 17 + } 18 + 19 + static inline pud_t *pud_alloc_track(struct mm_struct *mm, p4d_t *p4d, 20 + unsigned long address, 21 + pgtbl_mod_mask *mod_mask) 22 + { 23 + if (unlikely(p4d_none(*p4d))) { 24 + if (__pud_alloc(mm, p4d, address)) 25 + return NULL; 26 + *mod_mask |= PGTBL_P4D_MODIFIED; 27 + } 28 + 29 + return pud_offset(p4d, address); 30 + } 31 + 32 + static inline pmd_t *pmd_alloc_track(struct mm_struct *mm, pud_t *pud, 33 + unsigned long address, 34 + pgtbl_mod_mask *mod_mask) 35 + { 36 + if (unlikely(pud_none(*pud))) { 37 + if (__pmd_alloc(mm, pud, address)) 38 + return NULL; 39 + *mod_mask |= PGTBL_PUD_MODIFIED; 40 + } 41 + 42 + return pmd_offset(pud, address); 43 + } 44 + #endif /* CONFIG_MMU */ 45 + 46 + #define pte_alloc_kernel_track(pmd, address, mask) \ 47 + ((unlikely(pmd_none(*(pmd))) && \ 48 + (__pte_alloc_kernel(pmd) || ({*(mask)|=PGTBL_PMD_MODIFIED;0;})))?\ 49 + NULL: pte_offset_kernel(pmd, address)) 50 + 51 + #endif /* _LINUX_PGALLLC_TRACK_H */

+123 -6

mm/shmem.c

··· 114 114 kuid_t uid; 115 115 kgid_t gid; 116 116 umode_t mode; 117 + bool full_inums; 117 118 int huge; 118 119 int seen; 119 120 #define SHMEM_SEEN_BLOCKS 1 120 121 #define SHMEM_SEEN_INODES 2 121 122 #define SHMEM_SEEN_HUGE 4 123 + #define SHMEM_SEEN_INUMS 8 122 124 }; 123 125 124 126 #ifdef CONFIG_TMPFS ··· 262 260 static LIST_HEAD(shmem_swaplist); 263 261 static DEFINE_MUTEX(shmem_swaplist_mutex); 264 262 265 - static int shmem_reserve_inode(struct super_block *sb) 263 + /* 264 + * shmem_reserve_inode() performs bookkeeping to reserve a shmem inode, and 265 + * produces a novel ino for the newly allocated inode. 266 + * 267 + * It may also be called when making a hard link to permit the space needed by 268 + * each dentry. However, in that case, no new inode number is needed since that 269 + * internally draws from another pool of inode numbers (currently global 270 + * get_next_ino()). This case is indicated by passing NULL as inop. 271 + */ 272 + #define SHMEM_INO_BATCH 1024 273 + static int shmem_reserve_inode(struct super_block *sb, ino_t *inop) 266 274 { 267 275 struct shmem_sb_info *sbinfo = SHMEM_SB(sb); 268 - if (sbinfo->max_inodes) { 276 + ino_t ino; 277 + 278 + if (!(sb->s_flags & SB_KERNMOUNT)) { 269 279 spin_lock(&sbinfo->stat_lock); 270 280 if (!sbinfo->free_inodes) { 271 281 spin_unlock(&sbinfo->stat_lock); 272 282 return -ENOSPC; 273 283 } 274 284 sbinfo->free_inodes--; 285 + if (inop) { 286 + ino = sbinfo->next_ino++; 287 + if (unlikely(is_zero_ino(ino))) 288 + ino = sbinfo->next_ino++; 289 + if (unlikely(!sbinfo->full_inums && 290 + ino > UINT_MAX)) { 291 + /* 292 + * Emulate get_next_ino uint wraparound for 293 + * compatibility 294 + */ 295 + if (IS_ENABLED(CONFIG_64BIT)) 296 + pr_warn("%s: inode number overflow on device %d, consider using inode64 mount option\n", 297 + __func__, MINOR(sb->s_dev)); 298 + sbinfo->next_ino = 1; 299 + ino = sbinfo->next_ino++; 300 + } 301 + *inop = ino; 302 + } 275 303 spin_unlock(&sbinfo->stat_lock); 304 + } else if (inop) { 305 + /* 306 + * __shmem_file_setup, one of our callers, is lock-free: it 307 + * doesn't hold stat_lock in shmem_reserve_inode since 308 + * max_inodes is always 0, and is called from potentially 309 + * unknown contexts. As such, use a per-cpu batched allocator 310 + * which doesn't require the per-sb stat_lock unless we are at 311 + * the batch boundary. 312 + * 313 + * We don't need to worry about inode{32,64} since SB_KERNMOUNT 314 + * shmem mounts are not exposed to userspace, so we don't need 315 + * to worry about things like glibc compatibility. 316 + */ 317 + ino_t *next_ino; 318 + next_ino = per_cpu_ptr(sbinfo->ino_batch, get_cpu()); 319 + ino = *next_ino; 320 + if (unlikely(ino % SHMEM_INO_BATCH == 0)) { 321 + spin_lock(&sbinfo->stat_lock); 322 + ino = sbinfo->next_ino; 323 + sbinfo->next_ino += SHMEM_INO_BATCH; 324 + spin_unlock(&sbinfo->stat_lock); 325 + if (unlikely(is_zero_ino(ino))) 326 + ino++; 327 + } 328 + *inop = ino; 329 + *next_ino = ++ino; 330 + put_cpu(); 276 331 } 332 + 277 333 return 0; 278 334 } 279 335 ··· 2282 2222 struct inode *inode; 2283 2223 struct shmem_inode_info *info; 2284 2224 struct shmem_sb_info *sbinfo = SHMEM_SB(sb); 2225 + ino_t ino; 2285 2226 2286 - if (shmem_reserve_inode(sb)) 2227 + if (shmem_reserve_inode(sb, &ino)) 2287 2228 return NULL; 2288 2229 2289 2230 inode = new_inode(sb); 2290 2231 if (inode) { 2291 - inode->i_ino = get_next_ino(); 2232 + inode->i_ino = ino; 2292 2233 inode_init_owner(inode, dir, mode); 2293 2234 inode->i_blocks = 0; 2294 2235 inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode); ··· 2993 2932 * first link must skip that, to get the accounting right. 2994 2933 */ 2995 2934 if (inode->i_nlink) { 2996 - ret = shmem_reserve_inode(inode->i_sb); 2935 + ret = shmem_reserve_inode(inode->i_sb, NULL); 2997 2936 if (ret) 2998 2937 goto out; 2999 2938 } ··· 3408 3347 Opt_nr_inodes, 3409 3348 Opt_size, 3410 3349 Opt_uid, 3350 + Opt_inode32, 3351 + Opt_inode64, 3411 3352 }; 3412 3353 3413 3354 static const struct constant_table shmem_param_enums_huge[] = { ··· 3429 3366 fsparam_string("nr_inodes", Opt_nr_inodes), 3430 3367 fsparam_string("size", Opt_size), 3431 3368 fsparam_u32 ("uid", Opt_uid), 3369 + fsparam_flag ("inode32", Opt_inode32), 3370 + fsparam_flag ("inode64", Opt_inode64), 3432 3371 {} 3433 3372 }; 3434 3373 ··· 3502 3437 break; 3503 3438 } 3504 3439 goto unsupported_parameter; 3440 + case Opt_inode32: 3441 + ctx->full_inums = false; 3442 + ctx->seen |= SHMEM_SEEN_INUMS; 3443 + break; 3444 + case Opt_inode64: 3445 + if (sizeof(ino_t) < 8) { 3446 + return invalfc(fc, 3447 + "Cannot use inode64 with <64bit inums in kernel\n"); 3448 + } 3449 + ctx->full_inums = true; 3450 + ctx->seen |= SHMEM_SEEN_INUMS; 3451 + break; 3505 3452 } 3506 3453 return 0; 3507 3454 ··· 3605 3528 } 3606 3529 } 3607 3530 3531 + if ((ctx->seen & SHMEM_SEEN_INUMS) && !ctx->full_inums && 3532 + sbinfo->next_ino > UINT_MAX) { 3533 + err = "Current inum too high to switch to 32-bit inums"; 3534 + goto out; 3535 + } 3536 + 3608 3537 if (ctx->seen & SHMEM_SEEN_HUGE) 3609 3538 sbinfo->huge = ctx->huge; 3539 + if (ctx->seen & SHMEM_SEEN_INUMS) 3540 + sbinfo->full_inums = ctx->full_inums; 3610 3541 if (ctx->seen & SHMEM_SEEN_BLOCKS) 3611 3542 sbinfo->max_blocks = ctx->blocks; 3612 3543 if (ctx->seen & SHMEM_SEEN_INODES) { ··· 3654 3569 if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID)) 3655 3570 seq_printf(seq, ",gid=%u", 3656 3571 from_kgid_munged(&init_user_ns, sbinfo->gid)); 3572 + 3573 + /* 3574 + * Showing inode{64,32} might be useful even if it's the system default, 3575 + * since then people don't have to resort to checking both here and 3576 + * /proc/config.gz to confirm 64-bit inums were successfully applied 3577 + * (which may not even exist if IKCONFIG_PROC isn't enabled). 3578 + * 3579 + * We hide it when inode64 isn't the default and we are using 32-bit 3580 + * inodes, since that probably just means the feature isn't even under 3581 + * consideration. 3582 + * 3583 + * As such: 3584 + * 3585 + * +-----------------+-----------------+ 3586 + * | TMPFS_INODE64=y | TMPFS_INODE64=n | 3587 + * +------------------+-----------------+-----------------+ 3588 + * | full_inums=true | show | show | 3589 + * | full_inums=false | show | hide | 3590 + * +------------------+-----------------+-----------------+ 3591 + * 3592 + */ 3593 + if (IS_ENABLED(CONFIG_TMPFS_INODE64) || sbinfo->full_inums) 3594 + seq_printf(seq, ",inode%d", (sbinfo->full_inums ? 64 : 32)); 3657 3595 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 3658 3596 /* Rightly or wrongly, show huge mount option unmasked by shmem_huge */ 3659 3597 if (sbinfo->huge) ··· 3692 3584 { 3693 3585 struct shmem_sb_info *sbinfo = SHMEM_SB(sb); 3694 3586 3587 + free_percpu(sbinfo->ino_batch); 3695 3588 percpu_counter_destroy(&sbinfo->used_blocks); 3696 3589 mpol_put(sbinfo->mpol); 3697 3590 kfree(sbinfo); ··· 3725 3616 ctx->blocks = shmem_default_max_blocks(); 3726 3617 if (!(ctx->seen & SHMEM_SEEN_INODES)) 3727 3618 ctx->inodes = shmem_default_max_inodes(); 3619 + if (!(ctx->seen & SHMEM_SEEN_INUMS)) 3620 + ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64); 3728 3621 } else { 3729 3622 sb->s_flags |= SB_NOUSER; 3730 3623 } ··· 3737 3626 #endif 3738 3627 sbinfo->max_blocks = ctx->blocks; 3739 3628 sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes; 3629 + if (sb->s_flags & SB_KERNMOUNT) { 3630 + sbinfo->ino_batch = alloc_percpu(ino_t); 3631 + if (!sbinfo->ino_batch) 3632 + goto failed; 3633 + } 3740 3634 sbinfo->uid = ctx->uid; 3741 3635 sbinfo->gid = ctx->gid; 3636 + sbinfo->full_inums = ctx->full_inums; 3742 3637 sbinfo->mode = ctx->mode; 3743 3638 sbinfo->huge = ctx->huge; 3744 3639 sbinfo->mpol = ctx->mpol; ··· 4245 4128 4246 4129 /** 4247 4130 * shmem_zero_setup - setup a shared anonymous mapping 4248 - * @vma: the vma to be mmapped is prepared by do_mmap_pgoff 4131 + * @vma: the vma to be mmapped is prepared by do_mmap 4249 4132 */ 4250 4133 int shmem_zero_setup(struct vm_area_struct *vma) 4251 4134 {

+11 -35

mm/shuffle.c

··· 10 10 #include "shuffle.h" 11 11 12 12 DEFINE_STATIC_KEY_FALSE(page_alloc_shuffle_key); 13 - static unsigned long shuffle_state __ro_after_init; 14 - 15 - /* 16 - * Depending on the architecture, module parameter parsing may run 17 - * before, or after the cache detection. SHUFFLE_FORCE_DISABLE prevents, 18 - * or reverts the enabling of the shuffle implementation. SHUFFLE_ENABLE 19 - * attempts to turn on the implementation, but aborts if it finds 20 - * SHUFFLE_FORCE_DISABLE already set. 21 - */ 22 - __meminit void page_alloc_shuffle(enum mm_shuffle_ctl ctl) 23 - { 24 - if (ctl == SHUFFLE_FORCE_DISABLE) 25 - set_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state); 26 - 27 - if (test_bit(SHUFFLE_FORCE_DISABLE, &shuffle_state)) { 28 - if (test_and_clear_bit(SHUFFLE_ENABLE, &shuffle_state)) 29 - static_branch_disable(&page_alloc_shuffle_key); 30 - } else if (ctl == SHUFFLE_ENABLE 31 - && !test_and_set_bit(SHUFFLE_ENABLE, &shuffle_state)) 32 - static_branch_enable(&page_alloc_shuffle_key); 33 - } 34 13 35 14 static bool shuffle_param; 36 15 static int shuffle_show(char *buffer, const struct kernel_param *kp) 37 16 { 38 - return sprintf(buffer, "%c\n", test_bit(SHUFFLE_ENABLE, &shuffle_state) 39 - ? 'Y' : 'N'); 17 + return sprintf(buffer, "%c\n", shuffle_param ? 'Y' : 'N'); 40 18 } 41 19 42 20 static __meminit int shuffle_store(const char *val, ··· 25 47 if (rc < 0) 26 48 return rc; 27 49 if (shuffle_param) 28 - page_alloc_shuffle(SHUFFLE_ENABLE); 29 - else 30 - page_alloc_shuffle(SHUFFLE_FORCE_DISABLE); 50 + static_branch_enable(&page_alloc_shuffle_key); 31 51 return 0; 32 52 } 33 53 module_param_call(shuffle, shuffle_store, shuffle_show, &shuffle_param, 0400); ··· 34 58 * For two pages to be swapped in the shuffle, they must be free (on a 35 59 * 'free_area' lru), have the same order, and have the same migratetype. 36 60 */ 37 - static struct page * __meminit shuffle_valid_page(unsigned long pfn, int order) 61 + static struct page * __meminit shuffle_valid_page(struct zone *zone, 62 + unsigned long pfn, int order) 38 63 { 39 - struct page *page; 64 + struct page *page = pfn_to_online_page(pfn); 40 65 41 66 /* 42 67 * Given we're dealing with randomly selected pfns in a zone we 43 68 * need to ask questions like... 44 69 */ 45 70 46 - /* ...is the pfn even in the memmap? */ 47 - if (!pfn_valid_within(pfn)) 71 + /* ... is the page managed by the buddy? */ 72 + if (!page) 48 73 return NULL; 49 74 50 - /* ...is the pfn in a present section or a hole? */ 51 - if (!pfn_in_present_section(pfn)) 75 + /* ... is the page assigned to the same zone? */ 76 + if (page_zone(page) != zone) 52 77 return NULL; 53 78 54 79 /* ...is the page free and currently on a free_area list? */ 55 - page = pfn_to_page(pfn); 56 80 if (!PageBuddy(page)) 57 81 return NULL; 58 82 ··· 99 123 * page_j randomly selected in the span @zone_start_pfn to 100 124 * @spanned_pages. 101 125 */ 102 - page_i = shuffle_valid_page(i, order); 126 + page_i = shuffle_valid_page(z, i, order); 103 127 if (!page_i) 104 128 continue; 105 129 ··· 113 137 j = z->zone_start_pfn + 114 138 ALIGN_DOWN(get_random_long() % z->spanned_pages, 115 139 order_pages); 116 - page_j = shuffle_valid_page(j, order); 140 + page_j = shuffle_valid_page(z, j, order); 117 141 if (page_j && page_j != page_i) 118 142 break; 119 143 }

-17

mm/shuffle.h

··· 4 4 #define _MM_SHUFFLE_H 5 5 #include <linux/jump_label.h> 6 6 7 - /* 8 - * SHUFFLE_ENABLE is called from the command line enabling path, or by 9 - * platform-firmware enabling that indicates the presence of a 10 - * direct-mapped memory-side-cache. SHUFFLE_FORCE_DISABLE is called from 11 - * the command line path and overrides any previous or future 12 - * SHUFFLE_ENABLE. 13 - */ 14 - enum mm_shuffle_ctl { 15 - SHUFFLE_ENABLE, 16 - SHUFFLE_FORCE_DISABLE, 17 - }; 18 - 19 7 #define SHUFFLE_ORDER (MAX_ORDER-1) 20 8 21 9 #ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR 22 10 DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key); 23 - extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl); 24 11 extern void __shuffle_free_memory(pg_data_t *pgdat); 25 12 extern bool shuffle_pick_tail(void); 26 13 static inline void shuffle_free_memory(pg_data_t *pgdat) ··· 42 55 } 43 56 44 57 static inline void shuffle_zone(struct zone *z) 45 - { 46 - } 47 - 48 - static inline void page_alloc_shuffle(enum mm_shuffle_ctl ctl) 49 58 { 50 59 } 51 60

+36 -67

mm/slab.c

··· 588 588 return nr; 589 589 } 590 590 591 + /* &alien->lock must be held by alien callers. */ 592 + static __always_inline void __free_one(struct array_cache *ac, void *objp) 593 + { 594 + /* Avoid trivial double-free. */ 595 + if (IS_ENABLED(CONFIG_SLAB_FREELIST_HARDENED) && 596 + WARN_ON_ONCE(ac->avail > 0 && ac->entry[ac->avail - 1] == objp)) 597 + return; 598 + ac->entry[ac->avail++] = objp; 599 + } 600 + 591 601 #ifndef CONFIG_NUMA 592 602 593 603 #define drain_alien_cache(cachep, alien) do { } while (0) ··· 777 767 STATS_INC_ACOVERFLOW(cachep); 778 768 __drain_alien_cache(cachep, ac, page_node, &list); 779 769 } 780 - ac->entry[ac->avail++] = objp; 770 + __free_one(ac, objp); 781 771 spin_unlock(&alien->lock); 782 772 slabs_destroy(cachep, &list); 783 773 } else { ··· 1060 1050 * offline. 1061 1051 * 1062 1052 * Even if all the cpus of a node are down, we don't free the 1063 - * kmem_list3 of any cache. This to avoid a race between cpu_down, and 1053 + * kmem_cache_node of any cache. This to avoid a race between cpu_down, and 1064 1054 * a kmalloc allocation from another cpu for memory from the node of 1065 1055 * the cpu going down. The list3 structure is usually allocated from 1066 1056 * kmem_cache_create() and gets destroyed at kmem_cache_destroy(). ··· 1249 1239 nr_node_ids * sizeof(struct kmem_cache_node *), 1250 1240 SLAB_HWCACHE_ALIGN, 0, 0); 1251 1241 list_add(&kmem_cache->list, &slab_caches); 1252 - memcg_link_cache(kmem_cache, NULL); 1253 1242 slab_state = PARTIAL; 1254 1243 1255 1244 /* ··· 1379 1370 return NULL; 1380 1371 } 1381 1372 1382 - if (charge_slab_page(page, flags, cachep->gfporder, cachep)) { 1383 - __free_pages(page, cachep->gfporder); 1384 - return NULL; 1385 - } 1386 - 1373 + account_slab_page(page, cachep->gfporder, cachep); 1387 1374 __SetPageSlab(page); 1388 1375 /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */ 1389 1376 if (sk_memalloc_socks() && page_is_pfmemalloc(page)) ··· 1403 1398 1404 1399 if (current->reclaim_state) 1405 1400 current->reclaim_state->reclaimed_slab += 1 << order; 1406 - uncharge_slab_page(page, order, cachep); 1401 + unaccount_slab_page(page, order, cachep); 1407 1402 __free_pages(page, order); 1408 1403 } 1409 1404 ··· 2248 2243 return (ret ? 1 : 0); 2249 2244 } 2250 2245 2251 - #ifdef CONFIG_MEMCG 2252 - void __kmemcg_cache_deactivate(struct kmem_cache *cachep) 2253 - { 2254 - __kmem_cache_shrink(cachep); 2255 - } 2256 - 2257 - void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s) 2258 - { 2259 - } 2260 - #endif 2261 - 2262 2246 int __kmem_cache_shutdown(struct kmem_cache *cachep) 2263 2247 { 2264 2248 return __kmem_cache_shrink(cachep); ··· 2573 2579 * Be lazy and only check for valid flags here, keeping it out of the 2574 2580 * critical path in kmem_cache_alloc(). 2575 2581 */ 2576 - if (unlikely(flags & GFP_SLAB_BUG_MASK)) { 2577 - gfp_t invalid_mask = flags & GFP_SLAB_BUG_MASK; 2578 - flags &= ~GFP_SLAB_BUG_MASK; 2579 - pr_warn("Unexpected gfp: %#x (%pGg). Fixing up to gfp: %#x (%pGg). Fix your code!\n", 2580 - invalid_mask, &invalid_mask, flags, &flags); 2581 - dump_stack(); 2582 - } 2582 + if (unlikely(flags & GFP_SLAB_BUG_MASK)) 2583 + flags = kmalloc_fix_flags(flags); 2584 + 2583 2585 WARN_ON_ONCE(cachep->ctor && (flags & __GFP_ZERO)); 2584 2586 local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK); 2585 2587 ··· 3212 3222 unsigned long save_flags; 3213 3223 void *ptr; 3214 3224 int slab_node = numa_mem_id(); 3225 + struct obj_cgroup *objcg = NULL; 3215 3226 3216 3227 flags &= gfp_allowed_mask; 3217 - cachep = slab_pre_alloc_hook(cachep, flags); 3228 + cachep = slab_pre_alloc_hook(cachep, &objcg, 1, flags); 3218 3229 if (unlikely(!cachep)) 3219 3230 return NULL; 3220 3231 ··· 3251 3260 if (unlikely(slab_want_init_on_alloc(flags, cachep)) && ptr) 3252 3261 memset(ptr, 0, cachep->object_size); 3253 3262 3254 - slab_post_alloc_hook(cachep, flags, 1, &ptr); 3263 + slab_post_alloc_hook(cachep, objcg, flags, 1, &ptr); 3255 3264 return ptr; 3256 3265 } 3257 3266 ··· 3292 3301 { 3293 3302 unsigned long save_flags; 3294 3303 void *objp; 3304 + struct obj_cgroup *objcg = NULL; 3295 3305 3296 3306 flags &= gfp_allowed_mask; 3297 - cachep = slab_pre_alloc_hook(cachep, flags); 3307 + cachep = slab_pre_alloc_hook(cachep, &objcg, 1, flags); 3298 3308 if (unlikely(!cachep)) 3299 3309 return NULL; 3300 3310 ··· 3309 3317 if (unlikely(slab_want_init_on_alloc(flags, cachep)) && objp) 3310 3318 memset(objp, 0, cachep->object_size); 3311 3319 3312 - slab_post_alloc_hook(cachep, flags, 1, &objp); 3320 + slab_post_alloc_hook(cachep, objcg, flags, 1, &objp); 3313 3321 return objp; 3314 3322 } 3315 3323 ··· 3418 3426 if (kasan_slab_free(cachep, objp, _RET_IP_)) 3419 3427 return; 3420 3428 3429 + /* Use KCSAN to help debug racy use-after-free. */ 3430 + if (!(cachep->flags & SLAB_TYPESAFE_BY_RCU)) 3431 + __kcsan_check_access(objp, cachep->object_size, 3432 + KCSAN_ACCESS_WRITE | KCSAN_ACCESS_ASSERT); 3433 + 3421 3434 ___cache_free(cachep, objp, caller); 3422 3435 } 3423 3436 ··· 3436 3439 memset(objp, 0, cachep->object_size); 3437 3440 kmemleak_free_recursive(objp, cachep->flags); 3438 3441 objp = cache_free_debugcheck(cachep, objp, caller); 3442 + memcg_slab_free_hook(cachep, virt_to_head_page(objp), objp); 3439 3443 3440 3444 /* 3441 3445 * Skip calling cache_free_alien() when the platform is not numa. ··· 3464 3466 } 3465 3467 } 3466 3468 3467 - ac->entry[ac->avail++] = objp; 3469 + __free_one(ac, objp); 3468 3470 } 3469 3471 3470 3472 /** ··· 3502 3504 void **p) 3503 3505 { 3504 3506 size_t i; 3507 + struct obj_cgroup *objcg = NULL; 3505 3508 3506 - s = slab_pre_alloc_hook(s, flags); 3509 + s = slab_pre_alloc_hook(s, &objcg, size, flags); 3507 3510 if (!s) 3508 3511 return 0; 3509 3512 ··· 3527 3528 for (i = 0; i < size; i++) 3528 3529 memset(p[i], 0, s->object_size); 3529 3530 3530 - slab_post_alloc_hook(s, flags, size, p); 3531 + slab_post_alloc_hook(s, objcg, flags, size, p); 3531 3532 /* FIXME: Trace call missing. Christoph would like a bulk variant */ 3532 3533 return size; 3533 3534 error: 3534 3535 local_irq_enable(); 3535 3536 cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_); 3536 - slab_post_alloc_hook(s, flags, i, p); 3537 + slab_post_alloc_hook(s, objcg, flags, i, p); 3537 3538 __kmem_cache_free_bulk(s, i, p); 3538 3539 return 0; 3539 3540 } ··· 3795 3796 } 3796 3797 3797 3798 /* Always called with the slab_mutex held */ 3798 - static int __do_tune_cpucache(struct kmem_cache *cachep, int limit, 3799 - int batchcount, int shared, gfp_t gfp) 3799 + static int do_tune_cpucache(struct kmem_cache *cachep, int limit, 3800 + int batchcount, int shared, gfp_t gfp) 3800 3801 { 3801 3802 struct array_cache __percpu *cpu_cache, *prev; 3802 3803 int cpu; ··· 3841 3842 return setup_kmem_cache_nodes(cachep, gfp); 3842 3843 } 3843 3844 3844 - static int do_tune_cpucache(struct kmem_cache *cachep, int limit, 3845 - int batchcount, int shared, gfp_t gfp) 3846 - { 3847 - int ret; 3848 - struct kmem_cache *c; 3849 - 3850 - ret = __do_tune_cpucache(cachep, limit, batchcount, shared, gfp); 3851 - 3852 - if (slab_state < FULL) 3853 - return ret; 3854 - 3855 - if ((ret < 0) || !is_root_cache(cachep)) 3856 - return ret; 3857 - 3858 - lockdep_assert_held(&slab_mutex); 3859 - for_each_memcg_cache(c, cachep) { 3860 - /* return value determined by the root cache only */ 3861 - __do_tune_cpucache(c, limit, batchcount, shared, gfp); 3862 - } 3863 - 3864 - return ret; 3865 - } 3866 - 3867 3845 /* Called with slab_mutex held always */ 3868 3846 static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp) 3869 3847 { ··· 3852 3876 err = cache_random_seq_create(cachep, cachep->num, gfp); 3853 3877 if (err) 3854 3878 goto end; 3855 - 3856 - if (!is_root_cache(cachep)) { 3857 - struct kmem_cache *root = memcg_root_cache(cachep); 3858 - limit = root->limit; 3859 - shared = root->shared; 3860 - batchcount = root->batchcount; 3861 - } 3862 3879 3863 3880 if (limit && shared && batchcount) 3864 3881 goto skip_setup;

+174 -245

mm/slab.h

··· 30 30 struct list_head list; /* List of all slab caches on the system */ 31 31 }; 32 32 33 - #else /* !CONFIG_SLOB */ 34 - 35 - struct memcg_cache_array { 36 - struct rcu_head rcu; 37 - struct kmem_cache *entries[0]; 38 - }; 39 - 40 - /* 41 - * This is the main placeholder for memcg-related information in kmem caches. 42 - * Both the root cache and the child caches will have it. For the root cache, 43 - * this will hold a dynamically allocated array large enough to hold 44 - * information about the currently limited memcgs in the system. To allow the 45 - * array to be accessed without taking any locks, on relocation we free the old 46 - * version only after a grace period. 47 - * 48 - * Root and child caches hold different metadata. 49 - * 50 - * @root_cache: Common to root and child caches. NULL for root, pointer to 51 - * the root cache for children. 52 - * 53 - * The following fields are specific to root caches. 54 - * 55 - * @memcg_caches: kmemcg ID indexed table of child caches. This table is 56 - * used to index child cachces during allocation and cleared 57 - * early during shutdown. 58 - * 59 - * @root_caches_node: List node for slab_root_caches list. 60 - * 61 - * @children: List of all child caches. While the child caches are also 62 - * reachable through @memcg_caches, a child cache remains on 63 - * this list until it is actually destroyed. 64 - * 65 - * The following fields are specific to child caches. 66 - * 67 - * @memcg: Pointer to the memcg this cache belongs to. 68 - * 69 - * @children_node: List node for @root_cache->children list. 70 - * 71 - * @kmem_caches_node: List node for @memcg->kmem_caches list. 72 - */ 73 - struct memcg_cache_params { 74 - struct kmem_cache *root_cache; 75 - union { 76 - struct { 77 - struct memcg_cache_array __rcu *memcg_caches; 78 - struct list_head __root_caches_node; 79 - struct list_head children; 80 - bool dying; 81 - }; 82 - struct { 83 - struct mem_cgroup *memcg; 84 - struct list_head children_node; 85 - struct list_head kmem_caches_node; 86 - struct percpu_ref refcnt; 87 - 88 - void (*work_fn)(struct kmem_cache *); 89 - union { 90 - struct rcu_head rcu_head; 91 - struct work_struct work; 92 - }; 93 - }; 94 - }; 95 - }; 96 33 #endif /* CONFIG_SLOB */ 97 34 98 35 #ifdef CONFIG_SLAB ··· 46 109 #include <linux/kmemleak.h> 47 110 #include <linux/random.h> 48 111 #include <linux/sched/mm.h> 112 + #include <linux/kmemleak.h> 49 113 50 114 /* 51 115 * State of the slab allocator. ··· 90 152 struct kmem_cache *kmalloc_slab(size_t, gfp_t); 91 153 #endif 92 154 155 + gfp_t kmalloc_fix_flags(gfp_t flags); 93 156 94 157 /* Functions provided by the slab allocators */ 95 158 int __kmem_cache_create(struct kmem_cache *, slab_flags_t flags); ··· 173 234 int __kmem_cache_shutdown(struct kmem_cache *); 174 235 void __kmem_cache_release(struct kmem_cache *); 175 236 int __kmem_cache_shrink(struct kmem_cache *); 176 - void __kmemcg_cache_deactivate(struct kmem_cache *s); 177 - void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s); 178 237 void slab_kmem_cache_release(struct kmem_cache *); 179 - void kmem_cache_shrink_all(struct kmem_cache *s); 180 238 181 239 struct seq_file; 182 240 struct file; ··· 208 272 static inline int cache_vmstat_idx(struct kmem_cache *s) 209 273 { 210 274 return (s->flags & SLAB_RECLAIM_ACCOUNT) ? 211 - NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE; 275 + NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B; 276 + } 277 + 278 + #ifdef CONFIG_SLUB_DEBUG 279 + #ifdef CONFIG_SLUB_DEBUG_ON 280 + DECLARE_STATIC_KEY_TRUE(slub_debug_enabled); 281 + #else 282 + DECLARE_STATIC_KEY_FALSE(slub_debug_enabled); 283 + #endif 284 + extern void print_tracking(struct kmem_cache *s, void *object); 285 + #else 286 + static inline void print_tracking(struct kmem_cache *s, void *object) 287 + { 288 + } 289 + #endif 290 + 291 + /* 292 + * Returns true if any of the specified slub_debug flags is enabled for the 293 + * cache. Use only for flags parsed by setup_slub_debug() as it also enables 294 + * the static key. 295 + */ 296 + static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t flags) 297 + { 298 + #ifdef CONFIG_SLUB_DEBUG 299 + VM_WARN_ON_ONCE(!(flags & SLAB_DEBUG_FLAGS)); 300 + if (static_branch_unlikely(&slub_debug_enabled)) 301 + return s->flags & flags; 302 + #endif 303 + return false; 212 304 } 213 305 214 306 #ifdef CONFIG_MEMCG_KMEM 215 - 216 - /* List of all root caches. */ 217 - extern struct list_head slab_root_caches; 218 - #define root_caches_node memcg_params.__root_caches_node 219 - 220 - /* 221 - * Iterate over all memcg caches of the given root cache. The caller must hold 222 - * slab_mutex. 223 - */ 224 - #define for_each_memcg_cache(iter, root) \ 225 - list_for_each_entry(iter, &(root)->memcg_params.children, \ 226 - memcg_params.children_node) 227 - 228 - static inline bool is_root_cache(struct kmem_cache *s) 307 + static inline struct obj_cgroup **page_obj_cgroups(struct page *page) 229 308 { 230 - return !s->memcg_params.root_cache; 309 + /* 310 + * page->mem_cgroup and page->obj_cgroups are sharing the same 311 + * space. To distinguish between them in case we don't know for sure 312 + * that the page is a slab page (e.g. page_cgroup_ino()), let's 313 + * always set the lowest bit of obj_cgroups. 314 + */ 315 + return (struct obj_cgroup **) 316 + ((unsigned long)page->obj_cgroups & ~0x1UL); 231 317 } 232 318 233 - static inline bool slab_equal_or_root(struct kmem_cache *s, 234 - struct kmem_cache *p) 319 + static inline bool page_has_obj_cgroups(struct page *page) 235 320 { 236 - return p == s || p == s->memcg_params.root_cache; 321 + return ((unsigned long)page->obj_cgroups & 0x1UL); 237 322 } 238 323 239 - /* 240 - * We use suffixes to the name in memcg because we can't have caches 241 - * created in the system with the same name. But when we print them 242 - * locally, better refer to them with the base name 243 - */ 244 - static inline const char *cache_name(struct kmem_cache *s) 324 + int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s, 325 + gfp_t gfp); 326 + 327 + static inline void memcg_free_page_obj_cgroups(struct page *page) 245 328 { 246 - if (!is_root_cache(s)) 247 - s = s->memcg_params.root_cache; 248 - return s->name; 329 + kfree(page_obj_cgroups(page)); 330 + page->obj_cgroups = NULL; 249 331 } 250 332 251 - static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s) 333 + static inline size_t obj_full_size(struct kmem_cache *s) 252 334 { 253 - if (is_root_cache(s)) 254 - return s; 255 - return s->memcg_params.root_cache; 335 + /* 336 + * For each accounted object there is an extra space which is used 337 + * to store obj_cgroup membership. Charge it too. 338 + */ 339 + return s->size + sizeof(struct obj_cgroup *); 256 340 } 257 341 258 - /* 259 - * Expects a pointer to a slab page. Please note, that PageSlab() check 260 - * isn't sufficient, as it returns true also for tail compound slab pages, 261 - * which do not have slab_cache pointer set. 262 - * So this function assumes that the page can pass PageSlab() && !PageTail() 263 - * check. 264 - * 265 - * The kmem_cache can be reparented asynchronously. The caller must ensure 266 - * the memcg lifetime, e.g. by taking rcu_read_lock() or cgroup_mutex. 267 - */ 268 - static inline struct mem_cgroup *memcg_from_slab_page(struct page *page) 342 + static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s, 343 + size_t objects, 344 + gfp_t flags) 269 345 { 270 - struct kmem_cache *s; 346 + struct obj_cgroup *objcg; 271 347 272 - s = READ_ONCE(page->slab_cache); 273 - if (s && !is_root_cache(s)) 274 - return READ_ONCE(s->memcg_params.memcg); 348 + if (memcg_kmem_bypass()) 349 + return NULL; 275 350 276 - return NULL; 277 - } 351 + objcg = get_obj_cgroup_from_current(); 352 + if (!objcg) 353 + return NULL; 278 354 279 - /* 280 - * Charge the slab page belonging to the non-root kmem_cache. 281 - * Can be called for non-root kmem_caches only. 282 - */ 283 - static __always_inline int memcg_charge_slab(struct page *page, 284 - gfp_t gfp, int order, 285 - struct kmem_cache *s) 286 - { 287 - int nr_pages = 1 << order; 288 - struct mem_cgroup *memcg; 289 - struct lruvec *lruvec; 290 - int ret; 291 - 292 - rcu_read_lock(); 293 - memcg = READ_ONCE(s->memcg_params.memcg); 294 - while (memcg && !css_tryget_online(&memcg->css)) 295 - memcg = parent_mem_cgroup(memcg); 296 - rcu_read_unlock(); 297 - 298 - if (unlikely(!memcg || mem_cgroup_is_root(memcg))) { 299 - mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s), 300 - nr_pages); 301 - percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages); 302 - return 0; 355 + if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) { 356 + obj_cgroup_put(objcg); 357 + return NULL; 303 358 } 304 359 305 - ret = memcg_kmem_charge(memcg, gfp, nr_pages); 306 - if (ret) 307 - goto out; 308 - 309 - lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page)); 310 - mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages); 311 - 312 - /* transer try_charge() page references to kmem_cache */ 313 - percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages); 314 - css_put_many(&memcg->css, nr_pages); 315 - out: 316 - css_put(&memcg->css); 317 - return ret; 360 + return objcg; 318 361 } 319 362 320 - /* 321 - * Uncharge a slab page belonging to a non-root kmem_cache. 322 - * Can be called for non-root kmem_caches only. 323 - */ 324 - static __always_inline void memcg_uncharge_slab(struct page *page, int order, 325 - struct kmem_cache *s) 363 + static inline void mod_objcg_state(struct obj_cgroup *objcg, 364 + struct pglist_data *pgdat, 365 + int idx, int nr) 326 366 { 327 - int nr_pages = 1 << order; 328 367 struct mem_cgroup *memcg; 329 368 struct lruvec *lruvec; 330 369 331 370 rcu_read_lock(); 332 - memcg = READ_ONCE(s->memcg_params.memcg); 333 - if (likely(!mem_cgroup_is_root(memcg))) { 334 - lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page)); 335 - mod_lruvec_state(lruvec, cache_vmstat_idx(s), -nr_pages); 336 - memcg_kmem_uncharge(memcg, nr_pages); 337 - } else { 338 - mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s), 339 - -nr_pages); 340 - } 371 + memcg = obj_cgroup_memcg(objcg); 372 + lruvec = mem_cgroup_lruvec(memcg, pgdat); 373 + mod_memcg_lruvec_state(lruvec, idx, nr); 341 374 rcu_read_unlock(); 342 - 343 - percpu_ref_put_many(&s->memcg_params.refcnt, nr_pages); 344 375 } 345 376 346 - extern void slab_init_memcg_params(struct kmem_cache *); 347 - extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg); 377 + static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s, 378 + struct obj_cgroup *objcg, 379 + gfp_t flags, size_t size, 380 + void **p) 381 + { 382 + struct page *page; 383 + unsigned long off; 384 + size_t i; 385 + 386 + if (!objcg) 387 + return; 388 + 389 + flags &= ~__GFP_ACCOUNT; 390 + for (i = 0; i < size; i++) { 391 + if (likely(p[i])) { 392 + page = virt_to_head_page(p[i]); 393 + 394 + if (!page_has_obj_cgroups(page) && 395 + memcg_alloc_page_obj_cgroups(page, s, flags)) { 396 + obj_cgroup_uncharge(objcg, obj_full_size(s)); 397 + continue; 398 + } 399 + 400 + off = obj_to_index(s, page, p[i]); 401 + obj_cgroup_get(objcg); 402 + page_obj_cgroups(page)[off] = objcg; 403 + mod_objcg_state(objcg, page_pgdat(page), 404 + cache_vmstat_idx(s), obj_full_size(s)); 405 + } else { 406 + obj_cgroup_uncharge(objcg, obj_full_size(s)); 407 + } 408 + } 409 + obj_cgroup_put(objcg); 410 + } 411 + 412 + static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page, 413 + void *p) 414 + { 415 + struct obj_cgroup *objcg; 416 + unsigned int off; 417 + 418 + if (!memcg_kmem_enabled()) 419 + return; 420 + 421 + if (!page_has_obj_cgroups(page)) 422 + return; 423 + 424 + off = obj_to_index(s, page, p); 425 + objcg = page_obj_cgroups(page)[off]; 426 + page_obj_cgroups(page)[off] = NULL; 427 + 428 + if (!objcg) 429 + return; 430 + 431 + obj_cgroup_uncharge(objcg, obj_full_size(s)); 432 + mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s), 433 + -obj_full_size(s)); 434 + 435 + obj_cgroup_put(objcg); 436 + } 348 437 349 438 #else /* CONFIG_MEMCG_KMEM */ 350 - 351 - /* If !memcg, all caches are root. */ 352 - #define slab_root_caches slab_caches 353 - #define root_caches_node list 354 - 355 - #define for_each_memcg_cache(iter, root) \ 356 - for ((void)(iter), (void)(root); 0; ) 357 - 358 - static inline bool is_root_cache(struct kmem_cache *s) 439 + static inline bool page_has_obj_cgroups(struct page *page) 359 440 { 360 - return true; 441 + return false; 361 442 } 362 443 363 - static inline bool slab_equal_or_root(struct kmem_cache *s, 364 - struct kmem_cache *p) 365 - { 366 - return s == p; 367 - } 368 - 369 - static inline const char *cache_name(struct kmem_cache *s) 370 - { 371 - return s->name; 372 - } 373 - 374 - static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s) 375 - { 376 - return s; 377 - } 378 - 379 - static inline struct mem_cgroup *memcg_from_slab_page(struct page *page) 444 + static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr) 380 445 { 381 446 return NULL; 382 447 } 383 448 384 - static inline int memcg_charge_slab(struct page *page, gfp_t gfp, int order, 385 - struct kmem_cache *s) 449 + static inline int memcg_alloc_page_obj_cgroups(struct page *page, 450 + struct kmem_cache *s, gfp_t gfp) 386 451 { 387 452 return 0; 388 453 } 389 454 390 - static inline void memcg_uncharge_slab(struct page *page, int order, 391 - struct kmem_cache *s) 455 + static inline void memcg_free_page_obj_cgroups(struct page *page) 392 456 { 393 457 } 394 458 395 - static inline void slab_init_memcg_params(struct kmem_cache *s) 459 + static inline struct obj_cgroup *memcg_slab_pre_alloc_hook(struct kmem_cache *s, 460 + size_t objects, 461 + gfp_t flags) 462 + { 463 + return NULL; 464 + } 465 + 466 + static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s, 467 + struct obj_cgroup *objcg, 468 + gfp_t flags, size_t size, 469 + void **p) 396 470 { 397 471 } 398 472 399 - static inline void memcg_link_cache(struct kmem_cache *s, 400 - struct mem_cgroup *memcg) 473 + static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page, 474 + void *p) 401 475 { 402 476 } 403 - 404 477 #endif /* CONFIG_MEMCG_KMEM */ 405 478 406 479 static inline struct kmem_cache *virt_to_cache(const void *obj) ··· 423 478 return page->slab_cache; 424 479 } 425 480 426 - static __always_inline int charge_slab_page(struct page *page, 427 - gfp_t gfp, int order, 428 - struct kmem_cache *s) 481 + static __always_inline void account_slab_page(struct page *page, int order, 482 + struct kmem_cache *s) 429 483 { 430 - if (is_root_cache(s)) { 431 - mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s), 432 - 1 << order); 433 - return 0; 434 - } 435 - 436 - return memcg_charge_slab(page, gfp, order, s); 484 + mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s), 485 + PAGE_SIZE << order); 437 486 } 438 487 439 - static __always_inline void uncharge_slab_page(struct page *page, int order, 440 - struct kmem_cache *s) 488 + static __always_inline void unaccount_slab_page(struct page *page, int order, 489 + struct kmem_cache *s) 441 490 { 442 - if (is_root_cache(s)) { 443 - mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s), 444 - -(1 << order)); 445 - return; 446 - } 491 + if (memcg_kmem_enabled()) 492 + memcg_free_page_obj_cgroups(page); 447 493 448 - memcg_uncharge_slab(page, order, s); 494 + mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s), 495 + -(PAGE_SIZE << order)); 449 496 } 450 497 451 498 static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x) 452 499 { 453 500 struct kmem_cache *cachep; 454 501 455 - /* 456 - * When kmemcg is not being used, both assignments should return the 457 - * same value. but we don't want to pay the assignment price in that 458 - * case. If it is not compiled in, the compiler should be smart enough 459 - * to not do even the assignment. In that case, slab_equal_or_root 460 - * will also be a constant. 461 - */ 462 - if (!memcg_kmem_enabled() && 463 - !IS_ENABLED(CONFIG_SLAB_FREELIST_HARDENED) && 464 - !unlikely(s->flags & SLAB_CONSISTENCY_CHECKS)) 502 + if (!IS_ENABLED(CONFIG_SLAB_FREELIST_HARDENED) && 503 + !kmem_cache_debug_flags(s, SLAB_CONSISTENCY_CHECKS)) 465 504 return s; 466 505 467 506 cachep = virt_to_cache(x); 468 - WARN_ONCE(cachep && !slab_equal_or_root(cachep, s), 507 + if (WARN(cachep && cachep != s, 469 508 "%s: Wrong slab cache. %s but object is from %s\n", 470 - __func__, s->name, cachep->name); 509 + __func__, s->name, cachep->name)) 510 + print_tracking(cachep, x); 471 511 return cachep; 472 512 } 473 513 ··· 487 557 } 488 558 489 559 static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, 490 - gfp_t flags) 560 + struct obj_cgroup **objcgp, 561 + size_t size, gfp_t flags) 491 562 { 492 563 flags &= gfp_allowed_mask; 493 564 ··· 502 571 503 572 if (memcg_kmem_enabled() && 504 573 ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT))) 505 - return memcg_kmem_get_cache(s); 574 + *objcgp = memcg_slab_pre_alloc_hook(s, size, flags); 506 575 507 576 return s; 508 577 } 509 578 510 - static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags, 511 - size_t size, void **p) 579 + static inline void slab_post_alloc_hook(struct kmem_cache *s, 580 + struct obj_cgroup *objcg, 581 + gfp_t flags, size_t size, void **p) 512 582 { 513 583 size_t i; 514 584 ··· 522 590 } 523 591 524 592 if (memcg_kmem_enabled()) 525 - memcg_kmem_put_cache(s); 593 + memcg_slab_post_alloc_hook(s, objcg, flags, size, p); 526 594 } 527 595 528 596 #ifndef CONFIG_SLOB ··· 577 645 void *slab_start(struct seq_file *m, loff_t *pos); 578 646 void *slab_next(struct seq_file *m, void *p, loff_t *pos); 579 647 void slab_stop(struct seq_file *m, void *p); 580 - void *memcg_slab_start(struct seq_file *m, loff_t *pos); 581 - void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos); 582 - void memcg_slab_stop(struct seq_file *m, void *p); 583 648 int memcg_slab_show(struct seq_file *m, void *p); 584 649 585 650 #if defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG)

+43 -660

mm/slab_common.c

··· 26 26 #define CREATE_TRACE_POINTS 27 27 #include <trace/events/kmem.h> 28 28 29 + #include "internal.h" 30 + 29 31 #include "slab.h" 30 32 31 33 enum slab_state slab_state; ··· 130 128 return i; 131 129 } 132 130 133 - #ifdef CONFIG_MEMCG_KMEM 134 - 135 - LIST_HEAD(slab_root_caches); 136 - static DEFINE_SPINLOCK(memcg_kmem_wq_lock); 137 - 138 - static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref); 139 - 140 - void slab_init_memcg_params(struct kmem_cache *s) 141 - { 142 - s->memcg_params.root_cache = NULL; 143 - RCU_INIT_POINTER(s->memcg_params.memcg_caches, NULL); 144 - INIT_LIST_HEAD(&s->memcg_params.children); 145 - s->memcg_params.dying = false; 146 - } 147 - 148 - static int init_memcg_params(struct kmem_cache *s, 149 - struct kmem_cache *root_cache) 150 - { 151 - struct memcg_cache_array *arr; 152 - 153 - if (root_cache) { 154 - int ret = percpu_ref_init(&s->memcg_params.refcnt, 155 - kmemcg_cache_shutdown, 156 - 0, GFP_KERNEL); 157 - if (ret) 158 - return ret; 159 - 160 - s->memcg_params.root_cache = root_cache; 161 - INIT_LIST_HEAD(&s->memcg_params.children_node); 162 - INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node); 163 - return 0; 164 - } 165 - 166 - slab_init_memcg_params(s); 167 - 168 - if (!memcg_nr_cache_ids) 169 - return 0; 170 - 171 - arr = kvzalloc(sizeof(struct memcg_cache_array) + 172 - memcg_nr_cache_ids * sizeof(void *), 173 - GFP_KERNEL); 174 - if (!arr) 175 - return -ENOMEM; 176 - 177 - RCU_INIT_POINTER(s->memcg_params.memcg_caches, arr); 178 - return 0; 179 - } 180 - 181 - static void destroy_memcg_params(struct kmem_cache *s) 182 - { 183 - if (is_root_cache(s)) { 184 - kvfree(rcu_access_pointer(s->memcg_params.memcg_caches)); 185 - } else { 186 - mem_cgroup_put(s->memcg_params.memcg); 187 - WRITE_ONCE(s->memcg_params.memcg, NULL); 188 - percpu_ref_exit(&s->memcg_params.refcnt); 189 - } 190 - } 191 - 192 - static void free_memcg_params(struct rcu_head *rcu) 193 - { 194 - struct memcg_cache_array *old; 195 - 196 - old = container_of(rcu, struct memcg_cache_array, rcu); 197 - kvfree(old); 198 - } 199 - 200 - static int update_memcg_params(struct kmem_cache *s, int new_array_size) 201 - { 202 - struct memcg_cache_array *old, *new; 203 - 204 - new = kvzalloc(sizeof(struct memcg_cache_array) + 205 - new_array_size * sizeof(void *), GFP_KERNEL); 206 - if (!new) 207 - return -ENOMEM; 208 - 209 - old = rcu_dereference_protected(s->memcg_params.memcg_caches, 210 - lockdep_is_held(&slab_mutex)); 211 - if (old) 212 - memcpy(new->entries, old->entries, 213 - memcg_nr_cache_ids * sizeof(void *)); 214 - 215 - rcu_assign_pointer(s->memcg_params.memcg_caches, new); 216 - if (old) 217 - call_rcu(&old->rcu, free_memcg_params); 218 - return 0; 219 - } 220 - 221 - int memcg_update_all_caches(int num_memcgs) 222 - { 223 - struct kmem_cache *s; 224 - int ret = 0; 225 - 226 - mutex_lock(&slab_mutex); 227 - list_for_each_entry(s, &slab_root_caches, root_caches_node) { 228 - ret = update_memcg_params(s, num_memcgs); 229 - /* 230 - * Instead of freeing the memory, we'll just leave the caches 231 - * up to this point in an updated state. 232 - */ 233 - if (ret) 234 - break; 235 - } 236 - mutex_unlock(&slab_mutex); 237 - return ret; 238 - } 239 - 240 - void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg) 241 - { 242 - if (is_root_cache(s)) { 243 - list_add(&s->root_caches_node, &slab_root_caches); 244 - } else { 245 - css_get(&memcg->css); 246 - s->memcg_params.memcg = memcg; 247 - list_add(&s->memcg_params.children_node, 248 - &s->memcg_params.root_cache->memcg_params.children); 249 - list_add(&s->memcg_params.kmem_caches_node, 250 - &s->memcg_params.memcg->kmem_caches); 251 - } 252 - } 253 - 254 - static void memcg_unlink_cache(struct kmem_cache *s) 255 - { 256 - if (is_root_cache(s)) { 257 - list_del(&s->root_caches_node); 258 - } else { 259 - list_del(&s->memcg_params.children_node); 260 - list_del(&s->memcg_params.kmem_caches_node); 261 - } 262 - } 263 - #else 264 - static inline int init_memcg_params(struct kmem_cache *s, 265 - struct kmem_cache *root_cache) 266 - { 267 - return 0; 268 - } 269 - 270 - static inline void destroy_memcg_params(struct kmem_cache *s) 271 - { 272 - } 273 - 274 - static inline void memcg_unlink_cache(struct kmem_cache *s) 275 - { 276 - } 277 - #endif /* CONFIG_MEMCG_KMEM */ 278 - 279 131 /* 280 132 * Figure out what the alignment of the objects will be given a set of 281 133 * flags, a user specified alignment and the size of the objects. ··· 167 311 if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE)) 168 312 return 1; 169 313 170 - if (!is_root_cache(s)) 171 - return 1; 172 - 173 314 if (s->ctor) 174 315 return 1; 175 316 ··· 178 325 */ 179 326 if (s->refcount < 0) 180 327 return 1; 181 - 182 - #ifdef CONFIG_MEMCG_KMEM 183 - /* 184 - * Skip the dying kmem_cache. 185 - */ 186 - if (s->memcg_params.dying) 187 - return 1; 188 - #endif 189 328 190 329 return 0; 191 330 } ··· 201 356 if (flags & SLAB_NEVER_MERGE) 202 357 return NULL; 203 358 204 - list_for_each_entry_reverse(s, &slab_root_caches, root_caches_node) { 359 + list_for_each_entry_reverse(s, &slab_caches, list) { 205 360 if (slab_unmergeable(s)) 206 361 continue; 207 362 ··· 233 388 unsigned int object_size, unsigned int align, 234 389 slab_flags_t flags, unsigned int useroffset, 235 390 unsigned int usersize, void (*ctor)(void *), 236 - struct mem_cgroup *memcg, struct kmem_cache *root_cache) 391 + struct kmem_cache *root_cache) 237 392 { 238 393 struct kmem_cache *s; 239 394 int err; ··· 253 408 s->useroffset = useroffset; 254 409 s->usersize = usersize; 255 410 256 - err = init_memcg_params(s, root_cache); 257 - if (err) 258 - goto out_free_cache; 259 - 260 411 err = __kmem_cache_create(s, flags); 261 412 if (err) 262 413 goto out_free_cache; 263 414 264 415 s->refcount = 1; 265 416 list_add(&s->list, &slab_caches); 266 - memcg_link_cache(s, memcg); 267 417 out: 268 418 if (err) 269 419 return ERR_PTR(err); 270 420 return s; 271 421 272 422 out_free_cache: 273 - destroy_memcg_params(s); 274 423 kmem_cache_free(kmem_cache, s); 275 424 goto out; 276 425 } ··· 310 471 311 472 get_online_cpus(); 312 473 get_online_mems(); 313 - memcg_get_cache_ids(); 314 474 315 475 mutex_lock(&slab_mutex); 316 476 ··· 350 512 351 513 s = create_cache(cache_name, size, 352 514 calculate_alignment(flags, align, size), 353 - flags, useroffset, usersize, ctor, NULL, NULL); 515 + flags, useroffset, usersize, ctor, NULL); 354 516 if (IS_ERR(s)) { 355 517 err = PTR_ERR(s); 356 518 kfree_const(cache_name); ··· 359 521 out_unlock: 360 522 mutex_unlock(&slab_mutex); 361 523 362 - memcg_put_cache_ids(); 363 524 put_online_mems(); 364 525 put_online_cpus(); 365 526 ··· 451 614 if (__kmem_cache_shutdown(s) != 0) 452 615 return -EBUSY; 453 616 454 - memcg_unlink_cache(s); 455 617 list_del(&s->list); 456 618 457 619 if (s->flags & SLAB_TYPESAFE_BY_RCU) { ··· 471 635 return 0; 472 636 } 473 637 474 - #ifdef CONFIG_MEMCG_KMEM 475 - /* 476 - * memcg_create_kmem_cache - Create a cache for a memory cgroup. 477 - * @memcg: The memory cgroup the new cache is for. 478 - * @root_cache: The parent of the new cache. 479 - * 480 - * This function attempts to create a kmem cache that will serve allocation 481 - * requests going from @memcg to @root_cache. The new cache inherits properties 482 - * from its parent. 483 - */ 484 - void memcg_create_kmem_cache(struct mem_cgroup *memcg, 485 - struct kmem_cache *root_cache) 486 - { 487 - static char memcg_name_buf[NAME_MAX + 1]; /* protected by slab_mutex */ 488 - struct cgroup_subsys_state *css = &memcg->css; 489 - struct memcg_cache_array *arr; 490 - struct kmem_cache *s = NULL; 491 - char *cache_name; 492 - int idx; 493 - 494 - get_online_cpus(); 495 - get_online_mems(); 496 - 497 - mutex_lock(&slab_mutex); 498 - 499 - /* 500 - * The memory cgroup could have been offlined while the cache 501 - * creation work was pending. 502 - */ 503 - if (memcg->kmem_state != KMEM_ONLINE) 504 - goto out_unlock; 505 - 506 - idx = memcg_cache_id(memcg); 507 - arr = rcu_dereference_protected(root_cache->memcg_params.memcg_caches, 508 - lockdep_is_held(&slab_mutex)); 509 - 510 - /* 511 - * Since per-memcg caches are created asynchronously on first 512 - * allocation (see memcg_kmem_get_cache()), several threads can try to 513 - * create the same cache, but only one of them may succeed. 514 - */ 515 - if (arr->entries[idx]) 516 - goto out_unlock; 517 - 518 - cgroup_name(css->cgroup, memcg_name_buf, sizeof(memcg_name_buf)); 519 - cache_name = kasprintf(GFP_KERNEL, "%s(%llu:%s)", root_cache->name, 520 - css->serial_nr, memcg_name_buf); 521 - if (!cache_name) 522 - goto out_unlock; 523 - 524 - s = create_cache(cache_name, root_cache->object_size, 525 - root_cache->align, 526 - root_cache->flags & CACHE_CREATE_MASK, 527 - root_cache->useroffset, root_cache->usersize, 528 - root_cache->ctor, memcg, root_cache); 529 - /* 530 - * If we could not create a memcg cache, do not complain, because 531 - * that's not critical at all as we can always proceed with the root 532 - * cache. 533 - */ 534 - if (IS_ERR(s)) { 535 - kfree(cache_name); 536 - goto out_unlock; 537 - } 538 - 539 - /* 540 - * Since readers won't lock (see memcg_kmem_get_cache()), we need a 541 - * barrier here to ensure nobody will see the kmem_cache partially 542 - * initialized. 543 - */ 544 - smp_wmb(); 545 - arr->entries[idx] = s; 546 - 547 - out_unlock: 548 - mutex_unlock(&slab_mutex); 549 - 550 - put_online_mems(); 551 - put_online_cpus(); 552 - } 553 - 554 - static void kmemcg_workfn(struct work_struct *work) 555 - { 556 - struct kmem_cache *s = container_of(work, struct kmem_cache, 557 - memcg_params.work); 558 - 559 - get_online_cpus(); 560 - get_online_mems(); 561 - 562 - mutex_lock(&slab_mutex); 563 - s->memcg_params.work_fn(s); 564 - mutex_unlock(&slab_mutex); 565 - 566 - put_online_mems(); 567 - put_online_cpus(); 568 - } 569 - 570 - static void kmemcg_rcufn(struct rcu_head *head) 571 - { 572 - struct kmem_cache *s = container_of(head, struct kmem_cache, 573 - memcg_params.rcu_head); 574 - 575 - /* 576 - * We need to grab blocking locks. Bounce to ->work. The 577 - * work item shares the space with the RCU head and can't be 578 - * initialized earlier. 579 - */ 580 - INIT_WORK(&s->memcg_params.work, kmemcg_workfn); 581 - queue_work(memcg_kmem_cache_wq, &s->memcg_params.work); 582 - } 583 - 584 - static void kmemcg_cache_shutdown_fn(struct kmem_cache *s) 585 - { 586 - WARN_ON(shutdown_cache(s)); 587 - } 588 - 589 - static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref) 590 - { 591 - struct kmem_cache *s = container_of(percpu_ref, struct kmem_cache, 592 - memcg_params.refcnt); 593 - unsigned long flags; 594 - 595 - spin_lock_irqsave(&memcg_kmem_wq_lock, flags); 596 - if (s->memcg_params.root_cache->memcg_params.dying) 597 - goto unlock; 598 - 599 - s->memcg_params.work_fn = kmemcg_cache_shutdown_fn; 600 - INIT_WORK(&s->memcg_params.work, kmemcg_workfn); 601 - queue_work(memcg_kmem_cache_wq, &s->memcg_params.work); 602 - 603 - unlock: 604 - spin_unlock_irqrestore(&memcg_kmem_wq_lock, flags); 605 - } 606 - 607 - static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s) 608 - { 609 - __kmemcg_cache_deactivate_after_rcu(s); 610 - percpu_ref_kill(&s->memcg_params.refcnt); 611 - } 612 - 613 - static void kmemcg_cache_deactivate(struct kmem_cache *s) 614 - { 615 - if (WARN_ON_ONCE(is_root_cache(s))) 616 - return; 617 - 618 - __kmemcg_cache_deactivate(s); 619 - s->flags |= SLAB_DEACTIVATED; 620 - 621 - /* 622 - * memcg_kmem_wq_lock is used to synchronize memcg_params.dying 623 - * flag and make sure that no new kmem_cache deactivation tasks 624 - * are queued (see flush_memcg_workqueue() ). 625 - */ 626 - spin_lock_irq(&memcg_kmem_wq_lock); 627 - if (s->memcg_params.root_cache->memcg_params.dying) 628 - goto unlock; 629 - 630 - s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu; 631 - call_rcu(&s->memcg_params.rcu_head, kmemcg_rcufn); 632 - unlock: 633 - spin_unlock_irq(&memcg_kmem_wq_lock); 634 - } 635 - 636 - void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg, 637 - struct mem_cgroup *parent) 638 - { 639 - int idx; 640 - struct memcg_cache_array *arr; 641 - struct kmem_cache *s, *c; 642 - unsigned int nr_reparented; 643 - 644 - idx = memcg_cache_id(memcg); 645 - 646 - get_online_cpus(); 647 - get_online_mems(); 648 - 649 - mutex_lock(&slab_mutex); 650 - list_for_each_entry(s, &slab_root_caches, root_caches_node) { 651 - arr = rcu_dereference_protected(s->memcg_params.memcg_caches, 652 - lockdep_is_held(&slab_mutex)); 653 - c = arr->entries[idx]; 654 - if (!c) 655 - continue; 656 - 657 - kmemcg_cache_deactivate(c); 658 - arr->entries[idx] = NULL; 659 - } 660 - nr_reparented = 0; 661 - list_for_each_entry(s, &memcg->kmem_caches, 662 - memcg_params.kmem_caches_node) { 663 - WRITE_ONCE(s->memcg_params.memcg, parent); 664 - css_put(&memcg->css); 665 - nr_reparented++; 666 - } 667 - if (nr_reparented) { 668 - list_splice_init(&memcg->kmem_caches, 669 - &parent->kmem_caches); 670 - css_get_many(&parent->css, nr_reparented); 671 - } 672 - mutex_unlock(&slab_mutex); 673 - 674 - put_online_mems(); 675 - put_online_cpus(); 676 - } 677 - 678 - static int shutdown_memcg_caches(struct kmem_cache *s) 679 - { 680 - struct memcg_cache_array *arr; 681 - struct kmem_cache *c, *c2; 682 - LIST_HEAD(busy); 683 - int i; 684 - 685 - BUG_ON(!is_root_cache(s)); 686 - 687 - /* 688 - * First, shutdown active caches, i.e. caches that belong to online 689 - * memory cgroups. 690 - */ 691 - arr = rcu_dereference_protected(s->memcg_params.memcg_caches, 692 - lockdep_is_held(&slab_mutex)); 693 - for_each_memcg_cache_index(i) { 694 - c = arr->entries[i]; 695 - if (!c) 696 - continue; 697 - if (shutdown_cache(c)) 698 - /* 699 - * The cache still has objects. Move it to a temporary 700 - * list so as not to try to destroy it for a second 701 - * time while iterating over inactive caches below. 702 - */ 703 - list_move(&c->memcg_params.children_node, &busy); 704 - else 705 - /* 706 - * The cache is empty and will be destroyed soon. Clear 707 - * the pointer to it in the memcg_caches array so that 708 - * it will never be accessed even if the root cache 709 - * stays alive. 710 - */ 711 - arr->entries[i] = NULL; 712 - } 713 - 714 - /* 715 - * Second, shutdown all caches left from memory cgroups that are now 716 - * offline. 717 - */ 718 - list_for_each_entry_safe(c, c2, &s->memcg_params.children, 719 - memcg_params.children_node) 720 - shutdown_cache(c); 721 - 722 - list_splice(&busy, &s->memcg_params.children); 723 - 724 - /* 725 - * A cache being destroyed must be empty. In particular, this means 726 - * that all per memcg caches attached to it must be empty too. 727 - */ 728 - if (!list_empty(&s->memcg_params.children)) 729 - return -EBUSY; 730 - return 0; 731 - } 732 - 733 - static void memcg_set_kmem_cache_dying(struct kmem_cache *s) 734 - { 735 - spin_lock_irq(&memcg_kmem_wq_lock); 736 - s->memcg_params.dying = true; 737 - spin_unlock_irq(&memcg_kmem_wq_lock); 738 - } 739 - 740 - static void flush_memcg_workqueue(struct kmem_cache *s) 741 - { 742 - /* 743 - * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make 744 - * sure all registered rcu callbacks have been invoked. 745 - */ 746 - rcu_barrier(); 747 - 748 - /* 749 - * SLAB and SLUB create memcg kmem_caches through workqueue and SLUB 750 - * deactivates the memcg kmem_caches through workqueue. Make sure all 751 - * previous workitems on workqueue are processed. 752 - */ 753 - if (likely(memcg_kmem_cache_wq)) 754 - flush_workqueue(memcg_kmem_cache_wq); 755 - 756 - /* 757 - * If we're racing with children kmem_cache deactivation, it might 758 - * take another rcu grace period to complete their destruction. 759 - * At this moment the corresponding percpu_ref_kill() call should be 760 - * done, but it might take another rcu grace period to complete 761 - * switching to the atomic mode. 762 - * Please, note that we check without grabbing the slab_mutex. It's safe 763 - * because at this moment the children list can't grow. 764 - */ 765 - if (!list_empty(&s->memcg_params.children)) 766 - rcu_barrier(); 767 - } 768 - #else 769 - static inline int shutdown_memcg_caches(struct kmem_cache *s) 770 - { 771 - return 0; 772 - } 773 - #endif /* CONFIG_MEMCG_KMEM */ 774 - 775 638 void slab_kmem_cache_release(struct kmem_cache *s) 776 639 { 777 640 __kmem_cache_release(s); 778 - destroy_memcg_params(s); 779 641 kfree_const(s->name); 780 642 kmem_cache_free(kmem_cache, s); 781 643 } ··· 494 960 if (s->refcount) 495 961 goto out_unlock; 496 962 497 - #ifdef CONFIG_MEMCG_KMEM 498 - memcg_set_kmem_cache_dying(s); 499 - 500 - mutex_unlock(&slab_mutex); 501 - 502 - put_online_mems(); 503 - put_online_cpus(); 504 - 505 - flush_memcg_workqueue(s); 506 - 507 - get_online_cpus(); 508 - get_online_mems(); 509 - 510 - mutex_lock(&slab_mutex); 511 - #endif 512 - 513 - err = shutdown_memcg_caches(s); 514 - if (!err) 515 - err = shutdown_cache(s); 516 - 963 + err = shutdown_cache(s); 517 964 if (err) { 518 965 pr_err("kmem_cache_destroy %s: Slab cache still has objects\n", 519 966 s->name); ··· 531 1016 } 532 1017 EXPORT_SYMBOL(kmem_cache_shrink); 533 1018 534 - /** 535 - * kmem_cache_shrink_all - shrink a cache and all memcg caches for root cache 536 - * @s: The cache pointer 537 - */ 538 - void kmem_cache_shrink_all(struct kmem_cache *s) 539 - { 540 - struct kmem_cache *c; 541 - 542 - if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !is_root_cache(s)) { 543 - kmem_cache_shrink(s); 544 - return; 545 - } 546 - 547 - get_online_cpus(); 548 - get_online_mems(); 549 - kasan_cache_shrink(s); 550 - __kmem_cache_shrink(s); 551 - 552 - /* 553 - * We have to take the slab_mutex to protect from the memcg list 554 - * modification. 555 - */ 556 - mutex_lock(&slab_mutex); 557 - for_each_memcg_cache(c, s) { 558 - /* 559 - * Don't need to shrink deactivated memcg caches. 560 - */ 561 - if (s->flags & SLAB_DEACTIVATED) 562 - continue; 563 - kasan_cache_shrink(c); 564 - __kmem_cache_shrink(c); 565 - } 566 - mutex_unlock(&slab_mutex); 567 - put_online_mems(); 568 - put_online_cpus(); 569 - } 570 - 571 1019 bool slab_is_available(void) 572 1020 { 573 1021 return slab_state >= UP; ··· 559 1081 s->useroffset = useroffset; 560 1082 s->usersize = usersize; 561 1083 562 - slab_init_memcg_params(s); 563 - 564 1084 err = __kmem_cache_create(s, flags); 565 1085 566 1086 if (err) ··· 579 1103 580 1104 create_boot_cache(s, name, size, flags, useroffset, usersize); 581 1105 list_add(&s->list, &slab_caches); 582 - memcg_link_cache(s, NULL); 583 1106 s->refcount = 1; 584 1107 return s; 585 1108 } ··· 807 1332 } 808 1333 #endif /* !CONFIG_SLOB */ 809 1334 1335 + gfp_t kmalloc_fix_flags(gfp_t flags) 1336 + { 1337 + gfp_t invalid_mask = flags & GFP_SLAB_BUG_MASK; 1338 + 1339 + flags &= ~GFP_SLAB_BUG_MASK; 1340 + pr_warn("Unexpected gfp: %#x (%pGg). Fixing up to gfp: %#x (%pGg). Fix your code!\n", 1341 + invalid_mask, &invalid_mask, flags, &flags); 1342 + dump_stack(); 1343 + 1344 + return flags; 1345 + } 1346 + 810 1347 /* 811 1348 * To avoid unnecessary overhead, we pass through large allocation requests 812 1349 * directly to the page allocator. We use __GFP_COMP, because we will need to ··· 829 1342 void *ret = NULL; 830 1343 struct page *page; 831 1344 1345 + if (unlikely(flags & GFP_SLAB_BUG_MASK)) 1346 + flags = kmalloc_fix_flags(flags); 1347 + 832 1348 flags |= __GFP_COMP; 833 1349 page = alloc_pages(flags, order); 834 1350 if (likely(page)) { 835 1351 ret = page_address(page); 836 - mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE, 837 - 1 << order); 1352 + mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B, 1353 + PAGE_SIZE << order); 838 1354 } 839 1355 ret = kasan_kmalloc_large(ret, size, flags); 840 1356 /* As ret might get tagged, call kmemleak hook after KASAN. */ ··· 934 1444 void *slab_start(struct seq_file *m, loff_t *pos) 935 1445 { 936 1446 mutex_lock(&slab_mutex); 937 - return seq_list_start(&slab_root_caches, *pos); 1447 + return seq_list_start(&slab_caches, *pos); 938 1448 } 939 1449 940 1450 void *slab_next(struct seq_file *m, void *p, loff_t *pos) 941 1451 { 942 - return seq_list_next(p, &slab_root_caches, pos); 1452 + return seq_list_next(p, &slab_caches, pos); 943 1453 } 944 1454 945 1455 void slab_stop(struct seq_file *m, void *p) 946 1456 { 947 1457 mutex_unlock(&slab_mutex); 948 - } 949 - 950 - static void 951 - memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info) 952 - { 953 - struct kmem_cache *c; 954 - struct slabinfo sinfo; 955 - 956 - if (!is_root_cache(s)) 957 - return; 958 - 959 - for_each_memcg_cache(c, s) { 960 - memset(&sinfo, 0, sizeof(sinfo)); 961 - get_slabinfo(c, &sinfo); 962 - 963 - info->active_slabs += sinfo.active_slabs; 964 - info->num_slabs += sinfo.num_slabs; 965 - info->shared_avail += sinfo.shared_avail; 966 - info->active_objs += sinfo.active_objs; 967 - info->num_objs += sinfo.num_objs; 968 - } 969 1458 } 970 1459 971 1460 static void cache_show(struct kmem_cache *s, struct seq_file *m) ··· 954 1485 memset(&sinfo, 0, sizeof(sinfo)); 955 1486 get_slabinfo(s, &sinfo); 956 1487 957 - memcg_accumulate_slabinfo(s, &sinfo); 958 - 959 1488 seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", 960 - cache_name(s), sinfo.active_objs, sinfo.num_objs, s->size, 1489 + s->name, sinfo.active_objs, sinfo.num_objs, s->size, 961 1490 sinfo.objects_per_slab, (1 << sinfo.cache_order)); 962 1491 963 1492 seq_printf(m, " : tunables %4u %4u %4u", ··· 968 1501 969 1502 static int slab_show(struct seq_file *m, void *p) 970 1503 { 971 - struct kmem_cache *s = list_entry(p, struct kmem_cache, root_caches_node); 1504 + struct kmem_cache *s = list_entry(p, struct kmem_cache, list); 972 1505 973 - if (p == slab_root_caches.next) 1506 + if (p == slab_caches.next) 974 1507 print_slabinfo_header(m); 975 1508 cache_show(s, m); 976 1509 return 0; ··· 997 1530 pr_info("Name Used Total\n"); 998 1531 999 1532 list_for_each_entry_safe(s, s2, &slab_caches, list) { 1000 - if (!is_root_cache(s) || (s->flags & SLAB_RECLAIM_ACCOUNT)) 1533 + if (s->flags & SLAB_RECLAIM_ACCOUNT) 1001 1534 continue; 1002 1535 1003 1536 get_slabinfo(s, &sinfo); 1004 1537 1005 1538 if (sinfo.num_objs > 0) 1006 - pr_info("%-17s %10luKB %10luKB\n", cache_name(s), 1539 + pr_info("%-17s %10luKB %10luKB\n", s->name, 1007 1540 (sinfo.active_objs * s->size) / 1024, 1008 1541 (sinfo.num_objs * s->size) / 1024); 1009 1542 } ··· 1011 1544 } 1012 1545 1013 1546 #if defined(CONFIG_MEMCG_KMEM) 1014 - void *memcg_slab_start(struct seq_file *m, loff_t *pos) 1015 - { 1016 - struct mem_cgroup *memcg = mem_cgroup_from_seq(m); 1017 - 1018 - mutex_lock(&slab_mutex); 1019 - return seq_list_start(&memcg->kmem_caches, *pos); 1020 - } 1021 - 1022 - void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos) 1023 - { 1024 - struct mem_cgroup *memcg = mem_cgroup_from_seq(m); 1025 - 1026 - return seq_list_next(p, &memcg->kmem_caches, pos); 1027 - } 1028 - 1029 - void memcg_slab_stop(struct seq_file *m, void *p) 1030 - { 1031 - mutex_unlock(&slab_mutex); 1032 - } 1033 - 1034 1547 int memcg_slab_show(struct seq_file *m, void *p) 1035 1548 { 1036 - struct kmem_cache *s = list_entry(p, struct kmem_cache, 1037 - memcg_params.kmem_caches_node); 1038 - struct mem_cgroup *memcg = mem_cgroup_from_seq(m); 1039 - 1040 - if (p == memcg->kmem_caches.next) 1041 - print_slabinfo_header(m); 1042 - cache_show(s, m); 1549 + /* 1550 + * Deprecated. 1551 + * Please, take a look at tools/cgroup/slabinfo.py . 1552 + */ 1043 1553 return 0; 1044 1554 } 1045 1555 #endif ··· 1062 1618 } 1063 1619 module_init(slab_proc_init); 1064 1620 1065 - #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_MEMCG_KMEM) 1066 - /* 1067 - * Display information about kmem caches that have child memcg caches. 1068 - */ 1069 - static int memcg_slabinfo_show(struct seq_file *m, void *unused) 1070 - { 1071 - struct kmem_cache *s, *c; 1072 - struct slabinfo sinfo; 1073 - 1074 - mutex_lock(&slab_mutex); 1075 - seq_puts(m, "# <name> <css_id[:dead|deact]> <active_objs> <num_objs>"); 1076 - seq_puts(m, " <active_slabs> <num_slabs>\n"); 1077 - list_for_each_entry(s, &slab_root_caches, root_caches_node) { 1078 - /* 1079 - * Skip kmem caches that don't have any memcg children. 1080 - */ 1081 - if (list_empty(&s->memcg_params.children)) 1082 - continue; 1083 - 1084 - memset(&sinfo, 0, sizeof(sinfo)); 1085 - get_slabinfo(s, &sinfo); 1086 - seq_printf(m, "%-17s root %6lu %6lu %6lu %6lu\n", 1087 - cache_name(s), sinfo.active_objs, sinfo.num_objs, 1088 - sinfo.active_slabs, sinfo.num_slabs); 1089 - 1090 - for_each_memcg_cache(c, s) { 1091 - struct cgroup_subsys_state *css; 1092 - char *status = ""; 1093 - 1094 - css = &c->memcg_params.memcg->css; 1095 - if (!(css->flags & CSS_ONLINE)) 1096 - status = ":dead"; 1097 - else if (c->flags & SLAB_DEACTIVATED) 1098 - status = ":deact"; 1099 - 1100 - memset(&sinfo, 0, sizeof(sinfo)); 1101 - get_slabinfo(c, &sinfo); 1102 - seq_printf(m, "%-17s %4d%-6s %6lu %6lu %6lu %6lu\n", 1103 - cache_name(c), css->id, status, 1104 - sinfo.active_objs, sinfo.num_objs, 1105 - sinfo.active_slabs, sinfo.num_slabs); 1106 - } 1107 - } 1108 - mutex_unlock(&slab_mutex); 1109 - return 0; 1110 - } 1111 - DEFINE_SHOW_ATTRIBUTE(memcg_slabinfo); 1112 - 1113 - static int __init memcg_slabinfo_init(void) 1114 - { 1115 - debugfs_create_file("memcg_slabinfo", S_IFREG | S_IRUGO, 1116 - NULL, NULL, &memcg_slabinfo_fops); 1117 - return 0; 1118 - } 1119 - 1120 - late_initcall(memcg_slabinfo_init); 1121 - #endif /* CONFIG_DEBUG_FS && CONFIG_MEMCG_KMEM */ 1122 1621 #endif /* CONFIG_SLAB || CONFIG_SLUB_DEBUG */ 1123 1622 1124 1623 static __always_inline void *__do_krealloc(const void *p, size_t new_size, 1125 1624 gfp_t flags) 1126 1625 { 1127 1626 void *ret; 1128 - size_t ks = 0; 1627 + size_t ks; 1129 1628 1130 - if (p) 1131 - ks = ksize(p); 1629 + ks = ksize(p); 1132 1630 1133 1631 if (ks >= new_size) { 1134 1632 p = kasan_krealloc((void *)p, new_size, flags); ··· 1115 1729 EXPORT_SYMBOL(krealloc); 1116 1730 1117 1731 /** 1118 - * kzfree - like kfree but zero memory 1732 + * kfree_sensitive - Clear sensitive information in memory before freeing 1119 1733 * @p: object to free memory of 1120 1734 * 1121 1735 * The memory of the object @p points to is zeroed before freed. 1122 - * If @p is %NULL, kzfree() does nothing. 1736 + * If @p is %NULL, kfree_sensitive() does nothing. 1123 1737 * 1124 1738 * Note: this function zeroes the whole allocated buffer which can be a good 1125 1739 * deal bigger than the requested buffer size passed to kmalloc(). So be 1126 1740 * careful when using this function in performance sensitive code. 1127 1741 */ 1128 - void kzfree(const void *p) 1742 + void kfree_sensitive(const void *p) 1129 1743 { 1130 1744 size_t ks; 1131 1745 void *mem = (void *)p; 1132 1746 1133 - if (unlikely(ZERO_OR_NULL_PTR(mem))) 1134 - return; 1135 1747 ks = ksize(mem); 1136 - memzero_explicit(mem, ks); 1748 + if (ks) 1749 + memzero_explicit(mem, ks); 1137 1750 kfree(mem); 1138 1751 } 1139 - EXPORT_SYMBOL(kzfree); 1752 + EXPORT_SYMBOL(kfree_sensitive); 1140 1753 1141 1754 /** 1142 1755 * ksize - get the actual amount of memory allocated for a given object ··· 1155 1770 { 1156 1771 size_t size; 1157 1772 1158 - if (WARN_ON_ONCE(!objp)) 1159 - return 0; 1160 1773 /* 1161 1774 * We need to check that the pointed to object is valid, and only then 1162 1775 * unpoison the shadow memory below. We use __kasan_check_read(), to ··· 1168 1785 * We want to perform the check before __ksize(), to avoid potentially 1169 1786 * crashing in __ksize() due to accessing invalid metadata. 1170 1787 */ 1171 - if (unlikely(objp == ZERO_SIZE_PTR) || !__kasan_check_read(objp, 1)) 1788 + if (unlikely(ZERO_OR_NULL_PTR(objp)) || !__kasan_check_read(objp, 1)) 1172 1789 return 0; 1173 1790 1174 1791 size = __ksize(objp);

+6 -6

mm/slob.c

··· 202 202 if (!page) 203 203 return NULL; 204 204 205 - mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE, 206 - 1 << order); 205 + mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B, 206 + PAGE_SIZE << order); 207 207 return page_address(page); 208 208 } 209 209 ··· 214 214 if (current->reclaim_state) 215 215 current->reclaim_state->reclaimed_slab += 1 << order; 216 216 217 - mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE, 218 - -(1 << order)); 217 + mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE_B, 218 + -(PAGE_SIZE << order)); 219 219 __free_pages(sp, order); 220 220 } 221 221 ··· 552 552 slob_free(m, *m + align); 553 553 } else { 554 554 unsigned int order = compound_order(sp); 555 - mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE, 556 - -(1 << order)); 555 + mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE_B, 556 + -(PAGE_SIZE << order)); 557 557 __free_pages(sp, order); 558 558 559 559 }

+198 -426

mm/slub.c

··· 114 114 * the fast path and disables lockless freelists. 115 115 */ 116 116 117 - static inline int kmem_cache_debug(struct kmem_cache *s) 118 - { 119 117 #ifdef CONFIG_SLUB_DEBUG 120 - return unlikely(s->flags & SLAB_DEBUG_FLAGS); 118 + #ifdef CONFIG_SLUB_DEBUG_ON 119 + DEFINE_STATIC_KEY_TRUE(slub_debug_enabled); 121 120 #else 122 - return 0; 121 + DEFINE_STATIC_KEY_FALSE(slub_debug_enabled); 123 122 #endif 123 + #endif 124 + 125 + static inline bool kmem_cache_debug(struct kmem_cache *s) 126 + { 127 + return kmem_cache_debug_flags(s, SLAB_DEBUG_FLAGS); 124 128 } 125 129 126 130 void *fixup_red_left(struct kmem_cache *s, void *p) 127 131 { 128 - if (kmem_cache_debug(s) && s->flags & SLAB_RED_ZONE) 132 + if (kmem_cache_debug_flags(s, SLAB_RED_ZONE)) 129 133 p += s->red_left_pad; 130 134 131 135 return p; ··· 218 214 #ifdef CONFIG_SYSFS 219 215 static int sysfs_slab_add(struct kmem_cache *); 220 216 static int sysfs_slab_alias(struct kmem_cache *, const char *); 221 - static void memcg_propagate_slab_attrs(struct kmem_cache *s); 222 - static void sysfs_slab_remove(struct kmem_cache *s); 223 217 #else 224 218 static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; } 225 219 static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p) 226 220 { return 0; } 227 - static inline void memcg_propagate_slab_attrs(struct kmem_cache *s) { } 228 - static inline void sysfs_slab_remove(struct kmem_cache *s) { } 229 221 #endif 230 222 231 223 static inline void stat(const struct kmem_cache *s, enum stat_item si) ··· 312 312 for (__p = fixup_red_left(__s, __addr); \ 313 313 __p < (__addr) + (__objects) * (__s)->size; \ 314 314 __p += (__s)->size) 315 - 316 - /* Determine object index from a given position */ 317 - static inline unsigned int slab_index(void *p, struct kmem_cache *s, void *addr) 318 - { 319 - return (kasan_reset_tag(p) - addr) / s->size; 320 - } 321 315 322 316 static inline unsigned int order_objects(unsigned int order, unsigned int size) 323 317 { ··· 455 461 bitmap_zero(object_map, page->objects); 456 462 457 463 for (p = page->freelist; p; p = get_freepointer(s, p)) 458 - set_bit(slab_index(p, s, addr), object_map); 464 + set_bit(__obj_to_index(s, addr, p), object_map); 459 465 460 466 return object_map; 461 467 } ··· 463 469 static void put_map(unsigned long *map) __releases(&object_map_lock) 464 470 { 465 471 VM_BUG_ON(map != object_map); 466 - lockdep_assert_held(&object_map_lock); 467 - 468 472 spin_unlock(&object_map_lock); 469 473 } 470 474 ··· 491 499 static slab_flags_t slub_debug; 492 500 #endif 493 501 494 - static char *slub_debug_slabs; 502 + static char *slub_debug_string; 495 503 static int disable_higher_order_debug; 496 504 497 505 /* ··· 626 634 #endif 627 635 } 628 636 629 - static void print_tracking(struct kmem_cache *s, void *object) 637 + void print_tracking(struct kmem_cache *s, void *object) 630 638 { 631 639 unsigned long pr_time = jiffies; 632 640 if (!(s->flags & SLAB_STORE_USER)) ··· 1104 1112 static void setup_object_debug(struct kmem_cache *s, struct page *page, 1105 1113 void *object) 1106 1114 { 1107 - if (!(s->flags & (SLAB_STORE_USER|SLAB_RED_ZONE|__OBJECT_POISON))) 1115 + if (!kmem_cache_debug_flags(s, SLAB_STORE_USER|SLAB_RED_ZONE|__OBJECT_POISON)) 1108 1116 return; 1109 1117 1110 1118 init_object(s, object, SLUB_RED_INACTIVE); ··· 1114 1122 static 1115 1123 void setup_page_debug(struct kmem_cache *s, struct page *page, void *addr) 1116 1124 { 1117 - if (!(s->flags & SLAB_POISON)) 1125 + if (!kmem_cache_debug_flags(s, SLAB_POISON)) 1118 1126 return; 1119 1127 1120 1128 metadata_access_enable(); ··· 1254 1262 return ret; 1255 1263 } 1256 1264 1265 + /* 1266 + * Parse a block of slub_debug options. Blocks are delimited by ';' 1267 + * 1268 + * @str: start of block 1269 + * @flags: returns parsed flags, or DEBUG_DEFAULT_FLAGS if none specified 1270 + * @slabs: return start of list of slabs, or NULL when there's no list 1271 + * @init: assume this is initial parsing and not per-kmem-create parsing 1272 + * 1273 + * returns the start of next block if there's any, or NULL 1274 + */ 1275 + static char * 1276 + parse_slub_debug_flags(char *str, slab_flags_t *flags, char **slabs, bool init) 1277 + { 1278 + bool higher_order_disable = false; 1279 + 1280 + /* Skip any completely empty blocks */ 1281 + while (*str && *str == ';') 1282 + str++; 1283 + 1284 + if (*str == ',') { 1285 + /* 1286 + * No options but restriction on slabs. This means full 1287 + * debugging for slabs matching a pattern. 1288 + */ 1289 + *flags = DEBUG_DEFAULT_FLAGS; 1290 + goto check_slabs; 1291 + } 1292 + *flags = 0; 1293 + 1294 + /* Determine which debug features should be switched on */ 1295 + for (; *str && *str != ',' && *str != ';'; str++) { 1296 + switch (tolower(*str)) { 1297 + case '-': 1298 + *flags = 0; 1299 + break; 1300 + case 'f': 1301 + *flags |= SLAB_CONSISTENCY_CHECKS; 1302 + break; 1303 + case 'z': 1304 + *flags |= SLAB_RED_ZONE; 1305 + break; 1306 + case 'p': 1307 + *flags |= SLAB_POISON; 1308 + break; 1309 + case 'u': 1310 + *flags |= SLAB_STORE_USER; 1311 + break; 1312 + case 't': 1313 + *flags |= SLAB_TRACE; 1314 + break; 1315 + case 'a': 1316 + *flags |= SLAB_FAILSLAB; 1317 + break; 1318 + case 'o': 1319 + /* 1320 + * Avoid enabling debugging on caches if its minimum 1321 + * order would increase as a result. 1322 + */ 1323 + higher_order_disable = true; 1324 + break; 1325 + default: 1326 + if (init) 1327 + pr_err("slub_debug option '%c' unknown. skipped\n", *str); 1328 + } 1329 + } 1330 + check_slabs: 1331 + if (*str == ',') 1332 + *slabs = ++str; 1333 + else 1334 + *slabs = NULL; 1335 + 1336 + /* Skip over the slab list */ 1337 + while (*str && *str != ';') 1338 + str++; 1339 + 1340 + /* Skip any completely empty blocks */ 1341 + while (*str && *str == ';') 1342 + str++; 1343 + 1344 + if (init && higher_order_disable) 1345 + disable_higher_order_debug = 1; 1346 + 1347 + if (*str) 1348 + return str; 1349 + else 1350 + return NULL; 1351 + } 1352 + 1257 1353 static int __init setup_slub_debug(char *str) 1258 1354 { 1355 + slab_flags_t flags; 1356 + char *saved_str; 1357 + char *slab_list; 1358 + bool global_slub_debug_changed = false; 1359 + bool slab_list_specified = false; 1360 + 1259 1361 slub_debug = DEBUG_DEFAULT_FLAGS; 1260 1362 if (*str++ != '=' || !*str) 1261 1363 /* ··· 1357 1271 */ 1358 1272 goto out; 1359 1273 1360 - if (*str == ',') 1361 - /* 1362 - * No options but restriction on slabs. This means full 1363 - * debugging for slabs matching a pattern. 1364 - */ 1365 - goto check_slabs; 1274 + saved_str = str; 1275 + while (str) { 1276 + str = parse_slub_debug_flags(str, &flags, &slab_list, true); 1366 1277 1367 - slub_debug = 0; 1368 - if (*str == '-') 1369 - /* 1370 - * Switch off all debugging measures. 1371 - */ 1372 - goto out; 1373 - 1374 - /* 1375 - * Determine which debug features should be switched on 1376 - */ 1377 - for (; *str && *str != ','; str++) { 1378 - switch (tolower(*str)) { 1379 - case 'f': 1380 - slub_debug |= SLAB_CONSISTENCY_CHECKS; 1381 - break; 1382 - case 'z': 1383 - slub_debug |= SLAB_RED_ZONE; 1384 - break; 1385 - case 'p': 1386 - slub_debug |= SLAB_POISON; 1387 - break; 1388 - case 'u': 1389 - slub_debug |= SLAB_STORE_USER; 1390 - break; 1391 - case 't': 1392 - slub_debug |= SLAB_TRACE; 1393 - break; 1394 - case 'a': 1395 - slub_debug |= SLAB_FAILSLAB; 1396 - break; 1397 - case 'o': 1398 - /* 1399 - * Avoid enabling debugging on caches if its minimum 1400 - * order would increase as a result. 1401 - */ 1402 - disable_higher_order_debug = 1; 1403 - break; 1404 - default: 1405 - pr_err("slub_debug option '%c' unknown. skipped\n", 1406 - *str); 1278 + if (!slab_list) { 1279 + slub_debug = flags; 1280 + global_slub_debug_changed = true; 1281 + } else { 1282 + slab_list_specified = true; 1407 1283 } 1408 1284 } 1409 1285 1410 - check_slabs: 1411 - if (*str == ',') 1412 - slub_debug_slabs = str + 1; 1286 + /* 1287 + * For backwards compatibility, a single list of flags with list of 1288 + * slabs means debugging is only enabled for those slabs, so the global 1289 + * slub_debug should be 0. We can extended that to multiple lists as 1290 + * long as there is no option specifying flags without a slab list. 1291 + */ 1292 + if (slab_list_specified) { 1293 + if (!global_slub_debug_changed) 1294 + slub_debug = 0; 1295 + slub_debug_string = saved_str; 1296 + } 1413 1297 out: 1298 + if (slub_debug != 0 || slub_debug_string) 1299 + static_branch_enable(&slub_debug_enabled); 1414 1300 if ((static_branch_unlikely(&init_on_alloc) || 1415 1301 static_branch_unlikely(&init_on_free)) && 1416 1302 (slub_debug & SLAB_POISON)) ··· 1410 1352 { 1411 1353 char *iter; 1412 1354 size_t len; 1355 + char *next_block; 1356 + slab_flags_t block_flags; 1413 1357 1414 1358 /* If slub_debug = 0, it folds into the if conditional. */ 1415 - if (!slub_debug_slabs) 1359 + if (!slub_debug_string) 1416 1360 return flags | slub_debug; 1417 1361 1418 1362 len = strlen(name); 1419 - iter = slub_debug_slabs; 1420 - while (*iter) { 1421 - char *end, *glob; 1422 - size_t cmplen; 1363 + next_block = slub_debug_string; 1364 + /* Go through all blocks of debug options, see if any matches our slab's name */ 1365 + while (next_block) { 1366 + next_block = parse_slub_debug_flags(next_block, &block_flags, &iter, false); 1367 + if (!iter) 1368 + continue; 1369 + /* Found a block that has a slab list, search it */ 1370 + while (*iter) { 1371 + char *end, *glob; 1372 + size_t cmplen; 1423 1373 1424 - end = strchrnul(iter, ','); 1374 + end = strchrnul(iter, ','); 1375 + if (next_block && next_block < end) 1376 + end = next_block - 1; 1425 1377 1426 - glob = strnchr(iter, end - iter, '*'); 1427 - if (glob) 1428 - cmplen = glob - iter; 1429 - else 1430 - cmplen = max_t(size_t, len, (end - iter)); 1378 + glob = strnchr(iter, end - iter, '*'); 1379 + if (glob) 1380 + cmplen = glob - iter; 1381 + else 1382 + cmplen = max_t(size_t, len, (end - iter)); 1431 1383 1432 - if (!strncmp(name, iter, cmplen)) { 1433 - flags |= slub_debug; 1434 - break; 1384 + if (!strncmp(name, iter, cmplen)) { 1385 + flags |= block_flags; 1386 + return flags; 1387 + } 1388 + 1389 + if (!*end || *end == ';') 1390 + break; 1391 + iter = end + 1; 1435 1392 } 1436 - 1437 - if (!*end) 1438 - break; 1439 - iter = end + 1; 1440 1393 } 1441 1394 1442 - return flags; 1395 + return slub_debug; 1443 1396 } 1444 1397 #else /* !CONFIG_SLUB_DEBUG */ 1445 1398 static inline void setup_object_debug(struct kmem_cache *s, ··· 1539 1470 if (!(s->flags & SLAB_DEBUG_OBJECTS)) 1540 1471 debug_check_no_obj_freed(x, s->object_size); 1541 1472 1473 + /* Use KCSAN to help debug racy use-after-free. */ 1474 + if (!(s->flags & SLAB_TYPESAFE_BY_RCU)) 1475 + __kcsan_check_access(x, s->object_size, 1476 + KCSAN_ACCESS_WRITE | KCSAN_ACCESS_ASSERT); 1477 + 1542 1478 /* KASAN might put x into memory quarantine, delaying its reuse */ 1543 1479 return kasan_slab_free(s, x, _RET_IP_); 1544 1480 } ··· 1620 1546 else 1621 1547 page = __alloc_pages_node(node, flags, order); 1622 1548 1623 - if (page && charge_slab_page(page, flags, order, s)) { 1624 - __free_pages(page, order); 1625 - page = NULL; 1626 - } 1549 + if (page) 1550 + account_slab_page(page, order, s); 1627 1551 1628 1552 return page; 1629 1553 } ··· 1817 1745 1818 1746 static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node) 1819 1747 { 1820 - if (unlikely(flags & GFP_SLAB_BUG_MASK)) { 1821 - gfp_t invalid_mask = flags & GFP_SLAB_BUG_MASK; 1822 - flags &= ~GFP_SLAB_BUG_MASK; 1823 - pr_warn("Unexpected gfp: %#x (%pGg). Fixing up to gfp: %#x (%pGg). Fix your code!\n", 1824 - invalid_mask, &invalid_mask, flags, &flags); 1825 - dump_stack(); 1826 - } 1748 + if (unlikely(flags & GFP_SLAB_BUG_MASK)) 1749 + flags = kmalloc_fix_flags(flags); 1827 1750 1828 1751 return allocate_slab(s, 1829 1752 flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node); ··· 1829 1762 int order = compound_order(page); 1830 1763 int pages = 1 << order; 1831 1764 1832 - if (s->flags & SLAB_CONSISTENCY_CHECKS) { 1765 + if (kmem_cache_debug_flags(s, SLAB_CONSISTENCY_CHECKS)) { 1833 1766 void *p; 1834 1767 1835 1768 slab_pad_check(s, page); ··· 1844 1777 page->mapping = NULL; 1845 1778 if (current->reclaim_state) 1846 1779 current->reclaim_state->reclaimed_slab += pages; 1847 - uncharge_slab_page(page, order, s); 1780 + unaccount_slab_page(page, order, s); 1848 1781 __free_pages(page, order); 1849 1782 } 1850 1783 ··· 2811 2744 struct kmem_cache_cpu *c; 2812 2745 struct page *page; 2813 2746 unsigned long tid; 2747 + struct obj_cgroup *objcg = NULL; 2814 2748 2815 - s = slab_pre_alloc_hook(s, gfpflags); 2749 + s = slab_pre_alloc_hook(s, &objcg, 1, gfpflags); 2816 2750 if (!s) 2817 2751 return NULL; 2818 2752 redo: ··· 2889 2821 if (unlikely(slab_want_init_on_alloc(gfpflags, s)) && object) 2890 2822 memset(object, 0, s->object_size); 2891 2823 2892 - slab_post_alloc_hook(s, gfpflags, 1, &object); 2824 + slab_post_alloc_hook(s, objcg, gfpflags, 1, &object); 2893 2825 2894 2826 return object; 2895 2827 } ··· 3094 3026 void *tail_obj = tail ? : head; 3095 3027 struct kmem_cache_cpu *c; 3096 3028 unsigned long tid; 3029 + 3030 + memcg_slab_free_hook(s, page, head); 3097 3031 redo: 3098 3032 /* 3099 3033 * Determine the currently cpus per cpu slab. ··· 3275 3205 { 3276 3206 struct kmem_cache_cpu *c; 3277 3207 int i; 3208 + struct obj_cgroup *objcg = NULL; 3278 3209 3279 3210 /* memcg and kmem_cache debug support */ 3280 - s = slab_pre_alloc_hook(s, flags); 3211 + s = slab_pre_alloc_hook(s, &objcg, size, flags); 3281 3212 if (unlikely(!s)) 3282 3213 return false; 3283 3214 /* ··· 3332 3261 } 3333 3262 3334 3263 /* memcg and kmem_cache debug support */ 3335 - slab_post_alloc_hook(s, flags, size, p); 3264 + slab_post_alloc_hook(s, objcg, flags, size, p); 3336 3265 return i; 3337 3266 error: 3338 3267 local_irq_enable(); 3339 - slab_post_alloc_hook(s, flags, i, p); 3268 + slab_post_alloc_hook(s, objcg, flags, i, p); 3340 3269 __kmem_cache_free_bulk(s, i, p); 3341 3270 return 0; 3342 3271 } ··· 3746 3675 */ 3747 3676 size = ALIGN(size, s->align); 3748 3677 s->size = size; 3678 + s->reciprocal_size = reciprocal_value(size); 3749 3679 if (forced_order >= 0) 3750 3680 order = forced_order; 3751 3681 else ··· 3851 3779 map = get_map(s, page); 3852 3780 for_each_object(p, s, addr, page->objects) { 3853 3781 3854 - if (!test_bit(slab_index(p, s, addr), map)) { 3782 + if (!test_bit(__obj_to_index(s, addr, p), map)) { 3855 3783 pr_err("INFO: Object 0x%p @offset=%tu\n", p, p - addr); 3856 3784 print_tracking(s, p); 3857 3785 } ··· 3914 3842 if (n->nr_partial || slabs_node(s, node)) 3915 3843 return 1; 3916 3844 } 3917 - sysfs_slab_remove(s); 3918 3845 return 0; 3919 3846 } 3920 3847 ··· 3983 3912 page = alloc_pages_node(node, flags, order); 3984 3913 if (page) { 3985 3914 ptr = page_address(page); 3986 - mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE, 3987 - 1 << order); 3915 + mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B, 3916 + PAGE_SIZE << order); 3988 3917 } 3989 3918 3990 3919 return kmalloc_large_node_hook(ptr, size, flags); ··· 4051 3980 offset = (ptr - page_address(page)) % s->size; 4052 3981 4053 3982 /* Adjust for redzone and reject if within the redzone. */ 4054 - if (kmem_cache_debug(s) && s->flags & SLAB_RED_ZONE) { 3983 + if (kmem_cache_debug_flags(s, SLAB_RED_ZONE)) { 4055 3984 if (offset < s->red_left_pad) 4056 3985 usercopy_abort("SLUB object in left red zone", 4057 3986 s->name, to_user, offset, n); ··· 4115 4044 4116 4045 BUG_ON(!PageCompound(page)); 4117 4046 kfree_hook(object); 4118 - mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE, 4119 - -(1 << order)); 4047 + mod_node_page_state(page_pgdat(page), NR_SLAB_UNRECLAIMABLE_B, 4048 + -(PAGE_SIZE << order)); 4120 4049 __free_pages(page, order); 4121 4050 return; 4122 4051 } ··· 4196 4125 4197 4126 return ret; 4198 4127 } 4199 - 4200 - #ifdef CONFIG_MEMCG 4201 - void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s) 4202 - { 4203 - /* 4204 - * Called with all the locks held after a sched RCU grace period. 4205 - * Even if @s becomes empty after shrinking, we can't know that @s 4206 - * doesn't have allocations already in-flight and thus can't 4207 - * destroy @s until the associated memcg is released. 4208 - * 4209 - * However, let's remove the sysfs files for empty caches here. 4210 - * Each cache has a lot of interface files which aren't 4211 - * particularly useful for empty draining caches; otherwise, we can 4212 - * easily end up with millions of unnecessary sysfs files on 4213 - * systems which have a lot of memory and transient cgroups. 4214 - */ 4215 - if (!__kmem_cache_shrink(s)) 4216 - sysfs_slab_remove(s); 4217 - } 4218 - 4219 - void __kmemcg_cache_deactivate(struct kmem_cache *s) 4220 - { 4221 - /* 4222 - * Disable empty slabs caching. Used to avoid pinning offline 4223 - * memory cgroups by kmem pages that can be freed. 4224 - */ 4225 - slub_set_cpu_partial(s, 0); 4226 - s->min_partial = 0; 4227 - } 4228 - #endif /* CONFIG_MEMCG */ 4229 4128 4230 4129 static int slab_mem_going_offline_callback(void *arg) 4231 4130 { ··· 4351 4310 p->slab_cache = s; 4352 4311 #endif 4353 4312 } 4354 - slab_init_memcg_params(s); 4355 4313 list_add(&s->list, &slab_caches); 4356 - memcg_link_cache(s, NULL); 4357 4314 return s; 4358 4315 } 4359 4316 ··· 4406 4367 __kmem_cache_alias(const char *name, unsigned int size, unsigned int align, 4407 4368 slab_flags_t flags, void (*ctor)(void *)) 4408 4369 { 4409 - struct kmem_cache *s, *c; 4370 + struct kmem_cache *s; 4410 4371 4411 4372 s = find_mergeable(size, align, flags, name, ctor); 4412 4373 if (s) { ··· 4418 4379 */ 4419 4380 s->object_size = max(s->object_size, size); 4420 4381 s->inuse = max(s->inuse, ALIGN(size, sizeof(void *))); 4421 - 4422 - for_each_memcg_cache(c, s) { 4423 - c->object_size = s->object_size; 4424 - c->inuse = max(c->inuse, ALIGN(size, sizeof(void *))); 4425 - } 4426 4382 4427 4383 if (sysfs_slab_alias(s, name)) { 4428 4384 s->refcount--; ··· 4440 4406 if (slab_state <= UP) 4441 4407 return 0; 4442 4408 4443 - memcg_propagate_slab_attrs(s); 4444 4409 err = sysfs_slab_add(s); 4445 4410 if (err) 4446 4411 __kmem_cache_release(s); ··· 4528 4495 /* Now we know that a valid freelist exists */ 4529 4496 map = get_map(s, page); 4530 4497 for_each_object(p, s, addr, page->objects) { 4531 - u8 val = test_bit(slab_index(p, s, addr), map) ? 4498 + u8 val = test_bit(__obj_to_index(s, addr, p), map) ? 4532 4499 SLUB_RED_INACTIVE : SLUB_RED_ACTIVE; 4533 4500 4534 4501 if (!check_object(s, page, p, val)) ··· 4719 4686 4720 4687 map = get_map(s, page); 4721 4688 for_each_object(p, s, addr, page->objects) 4722 - if (!test_bit(slab_index(p, s, addr), map)) 4689 + if (!test_bit(__obj_to_index(s, addr, p), map)) 4723 4690 add_location(t, s, get_track(s, p, alloc)); 4724 4691 put_map(map); 4725 4692 } ··· 5003 4970 return x + sprintf(buf + x, "\n"); 5004 4971 } 5005 4972 5006 - #ifdef CONFIG_SLUB_DEBUG 5007 - static int any_slab_objects(struct kmem_cache *s) 5008 - { 5009 - int node; 5010 - struct kmem_cache_node *n; 5011 - 5012 - for_each_kmem_cache_node(s, node, n) 5013 - if (atomic_long_read(&n->total_objects)) 5014 - return 1; 5015 - 5016 - return 0; 5017 - } 5018 - #endif 5019 - 5020 4973 #define to_slab_attr(n) container_of(n, struct slab_attribute, attr) 5021 4974 #define to_slab(n) container_of(n, struct kmem_cache, kobj) 5022 4975 ··· 5044 5025 } 5045 5026 SLAB_ATTR_RO(objs_per_slab); 5046 5027 5047 - static ssize_t order_store(struct kmem_cache *s, 5048 - const char *buf, size_t length) 5049 - { 5050 - unsigned int order; 5051 - int err; 5052 - 5053 - err = kstrtouint(buf, 10, &order); 5054 - if (err) 5055 - return err; 5056 - 5057 - if (order > slub_max_order || order < slub_min_order) 5058 - return -EINVAL; 5059 - 5060 - calculate_sizes(s, order); 5061 - return length; 5062 - } 5063 - 5064 5028 static ssize_t order_show(struct kmem_cache *s, char *buf) 5065 5029 { 5066 5030 return sprintf(buf, "%u\n", oo_order(s->oo)); 5067 5031 } 5068 - SLAB_ATTR(order); 5032 + SLAB_ATTR_RO(order); 5069 5033 5070 5034 static ssize_t min_partial_show(struct kmem_cache *s, char *buf) 5071 5035 { ··· 5170 5168 { 5171 5169 return sprintf(buf, "%d\n", !!(s->flags & SLAB_RECLAIM_ACCOUNT)); 5172 5170 } 5173 - 5174 - static ssize_t reclaim_account_store(struct kmem_cache *s, 5175 - const char *buf, size_t length) 5176 - { 5177 - s->flags &= ~SLAB_RECLAIM_ACCOUNT; 5178 - if (buf[0] == '1') 5179 - s->flags |= SLAB_RECLAIM_ACCOUNT; 5180 - return length; 5181 - } 5182 - SLAB_ATTR(reclaim_account); 5171 + SLAB_ATTR_RO(reclaim_account); 5183 5172 5184 5173 static ssize_t hwcache_align_show(struct kmem_cache *s, char *buf) 5185 5174 { ··· 5215 5222 { 5216 5223 return sprintf(buf, "%d\n", !!(s->flags & SLAB_CONSISTENCY_CHECKS)); 5217 5224 } 5218 - 5219 - static ssize_t sanity_checks_store(struct kmem_cache *s, 5220 - const char *buf, size_t length) 5221 - { 5222 - s->flags &= ~SLAB_CONSISTENCY_CHECKS; 5223 - if (buf[0] == '1') { 5224 - s->flags &= ~__CMPXCHG_DOUBLE; 5225 - s->flags |= SLAB_CONSISTENCY_CHECKS; 5226 - } 5227 - return length; 5228 - } 5229 - SLAB_ATTR(sanity_checks); 5225 + SLAB_ATTR_RO(sanity_checks); 5230 5226 5231 5227 static ssize_t trace_show(struct kmem_cache *s, char *buf) 5232 5228 { 5233 5229 return sprintf(buf, "%d\n", !!(s->flags & SLAB_TRACE)); 5234 5230 } 5235 - 5236 - static ssize_t trace_store(struct kmem_cache *s, const char *buf, 5237 - size_t length) 5238 - { 5239 - /* 5240 - * Tracing a merged cache is going to give confusing results 5241 - * as well as cause other issues like converting a mergeable 5242 - * cache into an umergeable one. 5243 - */ 5244 - if (s->refcount > 1) 5245 - return -EINVAL; 5246 - 5247 - s->flags &= ~SLAB_TRACE; 5248 - if (buf[0] == '1') { 5249 - s->flags &= ~__CMPXCHG_DOUBLE; 5250 - s->flags |= SLAB_TRACE; 5251 - } 5252 - return length; 5253 - } 5254 - SLAB_ATTR(trace); 5231 + SLAB_ATTR_RO(trace); 5255 5232 5256 5233 static ssize_t red_zone_show(struct kmem_cache *s, char *buf) 5257 5234 { 5258 5235 return sprintf(buf, "%d\n", !!(s->flags & SLAB_RED_ZONE)); 5259 5236 } 5260 5237 5261 - static ssize_t red_zone_store(struct kmem_cache *s, 5262 - const char *buf, size_t length) 5263 - { 5264 - if (any_slab_objects(s)) 5265 - return -EBUSY; 5266 - 5267 - s->flags &= ~SLAB_RED_ZONE; 5268 - if (buf[0] == '1') { 5269 - s->flags |= SLAB_RED_ZONE; 5270 - } 5271 - calculate_sizes(s, -1); 5272 - return length; 5273 - } 5274 - SLAB_ATTR(red_zone); 5238 + SLAB_ATTR_RO(red_zone); 5275 5239 5276 5240 static ssize_t poison_show(struct kmem_cache *s, char *buf) 5277 5241 { 5278 5242 return sprintf(buf, "%d\n", !!(s->flags & SLAB_POISON)); 5279 5243 } 5280 5244 5281 - static ssize_t poison_store(struct kmem_cache *s, 5282 - const char *buf, size_t length) 5283 - { 5284 - if (any_slab_objects(s)) 5285 - return -EBUSY; 5286 - 5287 - s->flags &= ~SLAB_POISON; 5288 - if (buf[0] == '1') { 5289 - s->flags |= SLAB_POISON; 5290 - } 5291 - calculate_sizes(s, -1); 5292 - return length; 5293 - } 5294 - SLAB_ATTR(poison); 5245 + SLAB_ATTR_RO(poison); 5295 5246 5296 5247 static ssize_t store_user_show(struct kmem_cache *s, char *buf) 5297 5248 { 5298 5249 return sprintf(buf, "%d\n", !!(s->flags & SLAB_STORE_USER)); 5299 5250 } 5300 5251 5301 - static ssize_t store_user_store(struct kmem_cache *s, 5302 - const char *buf, size_t length) 5303 - { 5304 - if (any_slab_objects(s)) 5305 - return -EBUSY; 5306 - 5307 - s->flags &= ~SLAB_STORE_USER; 5308 - if (buf[0] == '1') { 5309 - s->flags &= ~__CMPXCHG_DOUBLE; 5310 - s->flags |= SLAB_STORE_USER; 5311 - } 5312 - calculate_sizes(s, -1); 5313 - return length; 5314 - } 5315 - SLAB_ATTR(store_user); 5252 + SLAB_ATTR_RO(store_user); 5316 5253 5317 5254 static ssize_t validate_show(struct kmem_cache *s, char *buf) 5318 5255 { ··· 5285 5362 { 5286 5363 return sprintf(buf, "%d\n", !!(s->flags & SLAB_FAILSLAB)); 5287 5364 } 5288 - 5289 - static ssize_t failslab_store(struct kmem_cache *s, const char *buf, 5290 - size_t length) 5291 - { 5292 - if (s->refcount > 1) 5293 - return -EINVAL; 5294 - 5295 - s->flags &= ~SLAB_FAILSLAB; 5296 - if (buf[0] == '1') 5297 - s->flags |= SLAB_FAILSLAB; 5298 - return length; 5299 - } 5300 - SLAB_ATTR(failslab); 5365 + SLAB_ATTR_RO(failslab); 5301 5366 #endif 5302 5367 5303 5368 static ssize_t shrink_show(struct kmem_cache *s, char *buf) ··· 5297 5386 const char *buf, size_t length) 5298 5387 { 5299 5388 if (buf[0] == '1') 5300 - kmem_cache_shrink_all(s); 5389 + kmem_cache_shrink(s); 5301 5390 else 5302 5391 return -EINVAL; 5303 5392 return length; ··· 5521 5610 return -EIO; 5522 5611 5523 5612 err = attribute->store(s, buf, len); 5524 - #ifdef CONFIG_MEMCG 5525 - if (slab_state >= FULL && err >= 0 && is_root_cache(s)) { 5526 - struct kmem_cache *c; 5527 - 5528 - mutex_lock(&slab_mutex); 5529 - if (s->max_attr_size < len) 5530 - s->max_attr_size = len; 5531 - 5532 - /* 5533 - * This is a best effort propagation, so this function's return 5534 - * value will be determined by the parent cache only. This is 5535 - * basically because not all attributes will have a well 5536 - * defined semantics for rollbacks - most of the actions will 5537 - * have permanent effects. 5538 - * 5539 - * Returning the error value of any of the children that fail 5540 - * is not 100 % defined, in the sense that users seeing the 5541 - * error code won't be able to know anything about the state of 5542 - * the cache. 5543 - * 5544 - * Only returning the error code for the parent cache at least 5545 - * has well defined semantics. The cache being written to 5546 - * directly either failed or succeeded, in which case we loop 5547 - * through the descendants with best-effort propagation. 5548 - */ 5549 - for_each_memcg_cache(c, s) 5550 - attribute->store(c, buf, len); 5551 - mutex_unlock(&slab_mutex); 5552 - } 5553 - #endif 5554 5613 return err; 5555 - } 5556 - 5557 - static void memcg_propagate_slab_attrs(struct kmem_cache *s) 5558 - { 5559 - #ifdef CONFIG_MEMCG 5560 - int i; 5561 - char *buffer = NULL; 5562 - struct kmem_cache *root_cache; 5563 - 5564 - if (is_root_cache(s)) 5565 - return; 5566 - 5567 - root_cache = s->memcg_params.root_cache; 5568 - 5569 - /* 5570 - * This mean this cache had no attribute written. Therefore, no point 5571 - * in copying default values around 5572 - */ 5573 - if (!root_cache->max_attr_size) 5574 - return; 5575 - 5576 - for (i = 0; i < ARRAY_SIZE(slab_attrs); i++) { 5577 - char mbuf[64]; 5578 - char *buf; 5579 - struct slab_attribute *attr = to_slab_attr(slab_attrs[i]); 5580 - ssize_t len; 5581 - 5582 - if (!attr || !attr->store || !attr->show) 5583 - continue; 5584 - 5585 - /* 5586 - * It is really bad that we have to allocate here, so we will 5587 - * do it only as a fallback. If we actually allocate, though, 5588 - * we can just use the allocated buffer until the end. 5589 - * 5590 - * Most of the slub attributes will tend to be very small in 5591 - * size, but sysfs allows buffers up to a page, so they can 5592 - * theoretically happen. 5593 - */ 5594 - if (buffer) 5595 - buf = buffer; 5596 - else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf) && 5597 - !IS_ENABLED(CONFIG_SLUB_STATS)) 5598 - buf = mbuf; 5599 - else { 5600 - buffer = (char *) get_zeroed_page(GFP_KERNEL); 5601 - if (WARN_ON(!buffer)) 5602 - continue; 5603 - buf = buffer; 5604 - } 5605 - 5606 - len = attr->show(root_cache, buf); 5607 - if (len > 0) 5608 - attr->store(s, buf, len); 5609 - } 5610 - 5611 - if (buffer) 5612 - free_page((unsigned long)buffer); 5613 - #endif /* CONFIG_MEMCG */ 5614 5614 } 5615 5615 5616 5616 static void kmem_cache_release(struct kobject *k) ··· 5543 5721 5544 5722 static inline struct kset *cache_kset(struct kmem_cache *s) 5545 5723 { 5546 - #ifdef CONFIG_MEMCG 5547 - if (!is_root_cache(s)) 5548 - return s->memcg_params.root_cache->memcg_kset; 5549 - #endif 5550 5724 return slab_kset; 5551 5725 } 5552 5726 ··· 5585 5767 return name; 5586 5768 } 5587 5769 5588 - static void sysfs_slab_remove_workfn(struct work_struct *work) 5589 - { 5590 - struct kmem_cache *s = 5591 - container_of(work, struct kmem_cache, kobj_remove_work); 5592 - 5593 - if (!s->kobj.state_in_sysfs) 5594 - /* 5595 - * For a memcg cache, this may be called during 5596 - * deactivation and again on shutdown. Remove only once. 5597 - * A cache is never shut down before deactivation is 5598 - * complete, so no need to worry about synchronization. 5599 - */ 5600 - goto out; 5601 - 5602 - #ifdef CONFIG_MEMCG 5603 - kset_unregister(s->memcg_kset); 5604 - #endif 5605 - out: 5606 - kobject_put(&s->kobj); 5607 - } 5608 - 5609 5770 static int sysfs_slab_add(struct kmem_cache *s) 5610 5771 { 5611 5772 int err; 5612 5773 const char *name; 5613 5774 struct kset *kset = cache_kset(s); 5614 5775 int unmergeable = slab_unmergeable(s); 5615 - 5616 - INIT_WORK(&s->kobj_remove_work, sysfs_slab_remove_workfn); 5617 5776 5618 5777 if (!kset) { 5619 5778 kobject_init(&s->kobj, &slab_ktype); ··· 5628 5833 if (err) 5629 5834 goto out_del_kobj; 5630 5835 5631 - #ifdef CONFIG_MEMCG 5632 - if (is_root_cache(s) && memcg_sysfs_enabled) { 5633 - s->memcg_kset = kset_create_and_add("cgroup", NULL, &s->kobj); 5634 - if (!s->memcg_kset) { 5635 - err = -ENOMEM; 5636 - goto out_del_kobj; 5637 - } 5638 - } 5639 - #endif 5640 - 5641 5836 if (!unmergeable) { 5642 5837 /* Setup first alias */ 5643 5838 sysfs_slab_alias(s, s->name); ··· 5639 5854 out_del_kobj: 5640 5855 kobject_del(&s->kobj); 5641 5856 goto out; 5642 - } 5643 - 5644 - static void sysfs_slab_remove(struct kmem_cache *s) 5645 - { 5646 - if (slab_state < FULL) 5647 - /* 5648 - * Sysfs has not been setup yet so no need to remove the 5649 - * cache from sysfs. 5650 - */ 5651 - return; 5652 - 5653 - kobject_get(&s->kobj); 5654 - schedule_work(&s->kobj_remove_work); 5655 5857 } 5656 5858 5657 5859 void sysfs_slab_unlink(struct kmem_cache *s)

+27 -31

mm/sparse-vmemmap.c

··· 69 69 __pa(MAX_DMA_ADDRESS)); 70 70 } 71 71 72 - /* need to make sure size is all the same during early stage */ 73 - void * __meminit vmemmap_alloc_block_buf(unsigned long size, int node) 74 - { 75 - void *ptr = sparse_buffer_alloc(size); 72 + static void * __meminit altmap_alloc_block_buf(unsigned long size, 73 + struct vmem_altmap *altmap); 76 74 75 + /* need to make sure size is all the same during early stage */ 76 + void * __meminit vmemmap_alloc_block_buf(unsigned long size, int node, 77 + struct vmem_altmap *altmap) 78 + { 79 + void *ptr; 80 + 81 + if (altmap) 82 + return altmap_alloc_block_buf(size, altmap); 83 + 84 + ptr = sparse_buffer_alloc(size); 77 85 if (!ptr) 78 86 ptr = vmemmap_alloc_block(size, node); 79 87 return ptr; ··· 102 94 return 0; 103 95 } 104 96 105 - /** 106 - * altmap_alloc_block_buf - allocate pages from the device page map 107 - * @altmap: device page map 108 - * @size: size (in bytes) of the allocation 109 - * 110 - * Allocations are aligned to the size of the request. 111 - */ 112 - void * __meminit altmap_alloc_block_buf(unsigned long size, 113 - struct vmem_altmap *altmap) 97 + static void * __meminit altmap_alloc_block_buf(unsigned long size, 98 + struct vmem_altmap *altmap) 114 99 { 115 100 unsigned long pfn, nr_pfns, nr_align; 116 101 ··· 140 139 start, end - 1); 141 140 } 142 141 143 - pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node) 142 + pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, 143 + struct vmem_altmap *altmap) 144 144 { 145 145 pte_t *pte = pte_offset_kernel(pmd, addr); 146 146 if (pte_none(*pte)) { 147 147 pte_t entry; 148 - void *p = vmemmap_alloc_block_buf(PAGE_SIZE, node); 148 + void *p; 149 + 150 + p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap); 149 151 if (!p) 150 152 return NULL; 151 153 entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL); ··· 216 212 return pgd; 217 213 } 218 214 219 - int __meminit vmemmap_populate_basepages(unsigned long start, 220 - unsigned long end, int node) 215 + int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end, 216 + int node, struct vmem_altmap *altmap) 221 217 { 222 218 unsigned long addr = start; 223 219 pgd_t *pgd; ··· 239 235 pmd = vmemmap_pmd_populate(pud, addr, node); 240 236 if (!pmd) 241 237 return -ENOMEM; 242 - pte = vmemmap_pte_populate(pmd, addr, node); 238 + pte = vmemmap_pte_populate(pmd, addr, node, altmap); 243 239 if (!pte) 244 240 return -ENOMEM; 245 241 vmemmap_verify(pte, node, addr, addr + PAGE_SIZE); ··· 251 247 struct page * __meminit __populate_section_memmap(unsigned long pfn, 252 248 unsigned long nr_pages, int nid, struct vmem_altmap *altmap) 253 249 { 254 - unsigned long start; 255 - unsigned long end; 250 + unsigned long start = (unsigned long) pfn_to_page(pfn); 251 + unsigned long end = start + nr_pages * sizeof(struct page); 256 252 257 - /* 258 - * The minimum granularity of memmap extensions is 259 - * PAGES_PER_SUBSECTION as allocations are tracked in the 260 - * 'subsection_map' bitmap of the section. 261 - */ 262 - end = ALIGN(pfn + nr_pages, PAGES_PER_SUBSECTION); 263 - pfn &= PAGE_SUBSECTION_MASK; 264 - nr_pages = end - pfn; 265 - 266 - start = (unsigned long) pfn_to_page(pfn); 267 - end = start + nr_pages * sizeof(struct page); 253 + if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) || 254 + !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION))) 255 + return NULL; 268 256 269 257 if (vmemmap_populate(start, end, nid, altmap)) 270 258 return NULL;

+19 -12

mm/sparse.c

··· 16 16 17 17 #include "internal.h" 18 18 #include <asm/dma.h> 19 - #include <asm/pgalloc.h> 20 19 21 20 /* 22 21 * Permanent SPARSEMEM data: ··· 249 250 #endif 250 251 251 252 /* Record a memory area against a node. */ 252 - void __init memory_present(int nid, unsigned long start, unsigned long end) 253 + static void __init memory_present(int nid, unsigned long start, unsigned long end) 253 254 { 254 255 unsigned long pfn; 255 256 ··· 285 286 } 286 287 287 288 /* 288 - * Mark all memblocks as present using memory_present(). This is a 289 - * convenience function that is useful for a number of arches 290 - * to mark all of the systems memory as present during initialization. 289 + * Mark all memblocks as present using memory_present(). 290 + * This is a convenience function that is useful to mark all of the systems 291 + * memory as present during initialization. 291 292 */ 292 - void __init memblocks_present(void) 293 + static void __init memblocks_present(void) 293 294 { 294 295 struct memblock_region *reg; 295 296 ··· 574 575 */ 575 576 void __init sparse_init(void) 576 577 { 577 - unsigned long pnum_begin = first_present_section_nr(); 578 - int nid_begin = sparse_early_nid(__nr_to_section(pnum_begin)); 579 - unsigned long pnum_end, map_count = 1; 578 + unsigned long pnum_end, pnum_begin, map_count = 1; 579 + int nid_begin; 580 + 581 + memblocks_present(); 582 + 583 + pnum_begin = first_present_section_nr(); 584 + nid_begin = sparse_early_nid(__nr_to_section(pnum_begin)); 580 585 581 586 /* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */ 582 587 set_pageblock_order(); ··· 828 825 ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; 829 826 } 830 827 831 - if (section_is_early && memmap) 832 - free_map_bootmem(memmap); 833 - else 828 + /* 829 + * The memmap of early sections is always fully populated. See 830 + * section_activate() and pfn_valid() . 831 + */ 832 + if (!section_is_early) 834 833 depopulate_section_memmap(pfn, nr_pages, altmap); 834 + else if (memmap) 835 + free_map_bootmem(memmap); 835 836 836 837 if (empty) 837 838 ms->section_mem_map = (unsigned long)NULL;

+21 -24

mm/swap_slots.c

··· 46 46 static void deactivate_swap_slots_cache(void); 47 47 static void reactivate_swap_slots_cache(void); 48 48 49 - #define use_swap_slot_cache (swap_slot_cache_active && \ 50 - swap_slot_cache_enabled && swap_slot_cache_initialized) 49 + #define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_enabled) 51 50 #define SLOTS_CACHE 0x1 52 51 #define SLOTS_CACHE_RET 0x2 53 52 ··· 93 94 { 94 95 long pages; 95 96 96 - if (!swap_slot_cache_enabled || !swap_slot_cache_initialized) 97 + if (!swap_slot_cache_enabled) 97 98 return false; 98 99 99 100 pages = get_nr_swap_pages(); ··· 135 136 136 137 mutex_lock(&swap_slots_cache_mutex); 137 138 cache = &per_cpu(swp_slots, cpu); 138 - if (cache->slots || cache->slots_ret) 139 + if (cache->slots || cache->slots_ret) { 139 140 /* cache already allocated */ 140 - goto out; 141 + mutex_unlock(&swap_slots_cache_mutex); 142 + 143 + kvfree(slots); 144 + kvfree(slots_ret); 145 + 146 + return 0; 147 + } 148 + 141 149 if (!cache->lock_initialized) { 142 150 mutex_init(&cache->alloc_lock); 143 151 spin_lock_init(&cache->free_lock); ··· 161 155 */ 162 156 mb(); 163 157 cache->slots = slots; 164 - slots = NULL; 165 158 cache->slots_ret = slots_ret; 166 - slots_ret = NULL; 167 - out: 168 159 mutex_unlock(&swap_slots_cache_mutex); 169 - if (slots) 170 - kvfree(slots); 171 - if (slots_ret) 172 - kvfree(slots_ret); 173 160 return 0; 174 161 } 175 162 ··· 239 240 240 241 int enable_swap_slots_cache(void) 241 242 { 242 - int ret = 0; 243 - 244 243 mutex_lock(&swap_slots_cache_enable_mutex); 245 - if (swap_slot_cache_initialized) { 246 - __reenable_swap_slots_cache(); 247 - goto out_unlock; 244 + if (!swap_slot_cache_initialized) { 245 + int ret; 246 + 247 + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "swap_slots_cache", 248 + alloc_swap_slot_cache, free_slot_cache); 249 + if (WARN_ONCE(ret < 0, "Cache allocation failed (%s), operating " 250 + "without swap slots cache.\n", __func__)) 251 + goto out_unlock; 252 + 253 + swap_slot_cache_initialized = true; 248 254 } 249 255 250 - ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "swap_slots_cache", 251 - alloc_swap_slot_cache, free_slot_cache); 252 - if (WARN_ONCE(ret < 0, "Cache allocation failed (%s), operating " 253 - "without swap slots cache.\n", __func__)) 254 - goto out_unlock; 255 - 256 - swap_slot_cache_initialized = true; 257 256 __reenable_swap_slots_cache(); 258 257 out_unlock: 259 258 mutex_unlock(&swap_slots_cache_enable_mutex);

+1 -1

mm/swap_state.c

··· 725 725 726 726 /** 727 727 * swap_vma_readahead - swap in pages in hope we need them soon 728 - * @entry: swap entry of this memory 728 + * @fentry: swap entry of this memory 729 729 * @gfp_mask: memory allocation flags 730 730 * @vmf: fault information 731 731 *

+49 -3

mm/util.c

··· 503 503 if (!ret) { 504 504 if (mmap_write_lock_killable(mm)) 505 505 return -EINTR; 506 - ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff, 507 - &populate, &uf); 506 + ret = do_mmap(file, addr, len, prot, flag, pgoff, &populate, 507 + &uf); 508 508 mmap_write_unlock(mm); 509 509 userfaultfd_unmap_complete(mm, &uf); 510 510 if (populate) ··· 746 746 return ret; 747 747 } 748 748 749 + static void sync_overcommit_as(struct work_struct *dummy) 750 + { 751 + percpu_counter_sync(&vm_committed_as); 752 + } 753 + 754 + int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer, 755 + size_t *lenp, loff_t *ppos) 756 + { 757 + struct ctl_table t; 758 + int new_policy; 759 + int ret; 760 + 761 + /* 762 + * The deviation of sync_overcommit_as could be big with loose policy 763 + * like OVERCOMMIT_ALWAYS/OVERCOMMIT_GUESS. When changing policy to 764 + * strict OVERCOMMIT_NEVER, we need to reduce the deviation to comply 765 + * with the strict "NEVER", and to avoid possible race condtion (even 766 + * though user usually won't too frequently do the switching to policy 767 + * OVERCOMMIT_NEVER), the switch is done in the following order: 768 + * 1. changing the batch 769 + * 2. sync percpu count on each CPU 770 + * 3. switch the policy 771 + */ 772 + if (write) { 773 + t = *table; 774 + t.data = &new_policy; 775 + ret = proc_dointvec_minmax(&t, write, buffer, lenp, ppos); 776 + if (ret) 777 + return ret; 778 + 779 + mm_compute_batch(new_policy); 780 + if (new_policy == OVERCOMMIT_NEVER) 781 + schedule_on_each_cpu(sync_overcommit_as); 782 + sysctl_overcommit_memory = new_policy; 783 + } else { 784 + ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); 785 + } 786 + 787 + return ret; 788 + } 789 + 749 790 int overcommit_kbytes_handler(struct ctl_table *table, int write, void *buffer, 750 791 size_t *lenp, loff_t *ppos) 751 792 { ··· 828 787 * balancing memory across competing virtual machines that are hosted. 829 788 * Several metrics drive this policy engine including the guest reported 830 789 * memory commitment. 790 + * 791 + * The time cost of this is very low for small platforms, and for big 792 + * platform like a 2S/36C/72T Skylake server, in worst case where 793 + * vm_committed_as's spinlock is under severe contention, the time cost 794 + * could be about 30~40 microseconds. 831 795 */ 832 796 unsigned long vm_memory_committed(void) 833 797 { 834 - return percpu_counter_read_positive(&vm_committed_as); 798 + return percpu_counter_sum_positive(&vm_committed_as); 835 799 } 836 800 EXPORT_SYMBOL_GPL(vm_memory_committed); 837 801

+73 -103

mm/vmalloc.c

··· 7 7 * SMP-safe vmalloc/vfree/ioremap, Tigran Aivazian <tigran@veritas.com>, May 2000 8 8 * Major rework to support vmap/vunmap, Christoph Hellwig, SGI, August 2002 9 9 * Numa awareness, Christoph Lameter, SGI, June 2005 10 + * Improving global KVA allocator, Uladzislau Rezki, Sony, May 2019 10 11 */ 11 12 12 13 #include <linux/vmalloc.h> ··· 26 25 #include <linux/list.h> 27 26 #include <linux/notifier.h> 28 27 #include <linux/rbtree.h> 29 - #include <linux/radix-tree.h> 28 + #include <linux/xarray.h> 30 29 #include <linux/rcupdate.h> 31 30 #include <linux/pfn.h> 32 31 #include <linux/kmemleak.h> ··· 42 41 #include <asm/shmparam.h> 43 42 44 43 #include "internal.h" 44 + #include "pgalloc-track.h" 45 45 46 46 bool is_vmalloc_addr(const void *x) 47 47 { ··· 175 173 pgtbl_mod_mask mask = 0; 176 174 177 175 BUG_ON(addr >= end); 178 - start = addr; 179 176 pgd = pgd_offset_k(addr); 180 177 do { 181 178 next = pgd_addr_end(addr, end); ··· 512 511 /* 513 512 * This function returns back addresses of parent node 514 513 * and its left or right link for further processing. 514 + * 515 + * Otherwise NULL is returned. In that case all further 516 + * steps regarding inserting of conflicting overlap range 517 + * have to be declined and actually considered as a bug. 515 518 */ 516 519 static __always_inline struct rb_node ** 517 520 find_va_links(struct vmap_area *va, ··· 554 549 else if (va->va_end > tmp_va->va_start && 555 550 va->va_start >= tmp_va->va_end) 556 551 link = &(*link)->rb_right; 557 - else 558 - BUG(); 552 + else { 553 + WARN(1, "vmalloc bug: 0x%lx-0x%lx overlaps with 0x%lx-0x%lx\n", 554 + va->va_start, va->va_end, tmp_va->va_start, tmp_va->va_end); 555 + 556 + return NULL; 557 + } 559 558 } while (*link); 560 559 561 560 *parent = &tmp_va->rb_node; ··· 641 632 642 633 #if DEBUG_AUGMENT_PROPAGATE_CHECK 643 634 static void 644 - augment_tree_propagate_check(struct rb_node *n) 635 + augment_tree_propagate_check(void) 645 636 { 646 637 struct vmap_area *va; 647 - struct rb_node *node; 648 - unsigned long size; 649 - bool found = false; 638 + unsigned long computed_size; 650 639 651 - if (n == NULL) 652 - return; 653 - 654 - va = rb_entry(n, struct vmap_area, rb_node); 655 - size = va->subtree_max_size; 656 - node = n; 657 - 658 - while (node) { 659 - va = rb_entry(node, struct vmap_area, rb_node); 660 - 661 - if (get_subtree_max_size(node->rb_left) == size) { 662 - node = node->rb_left; 663 - } else { 664 - if (va_size(va) == size) { 665 - found = true; 666 - break; 667 - } 668 - 669 - node = node->rb_right; 670 - } 640 + list_for_each_entry(va, &free_vmap_area_list, list) { 641 + computed_size = compute_subtree_max_size(va); 642 + if (computed_size != va->subtree_max_size) 643 + pr_emerg("tree is corrupted: %lu, %lu\n", 644 + va_size(va), va->subtree_max_size); 671 645 } 672 - 673 - if (!found) { 674 - va = rb_entry(n, struct vmap_area, rb_node); 675 - pr_emerg("tree is corrupted: %lu, %lu\n", 676 - va_size(va), va->subtree_max_size); 677 - } 678 - 679 - augment_tree_propagate_check(n->rb_left); 680 - augment_tree_propagate_check(n->rb_right); 681 646 } 682 647 #endif 683 648 ··· 685 702 static __always_inline void 686 703 augment_tree_propagate_from(struct vmap_area *va) 687 704 { 688 - struct rb_node *node = &va->rb_node; 689 - unsigned long new_va_sub_max_size; 690 - 691 - while (node) { 692 - va = rb_entry(node, struct vmap_area, rb_node); 693 - new_va_sub_max_size = compute_subtree_max_size(va); 694 - 695 - /* 696 - * If the newly calculated maximum available size of the 697 - * subtree is equal to the current one, then it means that 698 - * the tree is propagated correctly. So we have to stop at 699 - * this point to save cycles. 700 - */ 701 - if (va->subtree_max_size == new_va_sub_max_size) 702 - break; 703 - 704 - va->subtree_max_size = new_va_sub_max_size; 705 - node = rb_parent(&va->rb_node); 706 - } 705 + /* 706 + * Populate the tree from bottom towards the root until 707 + * the calculated maximum available size of checked node 708 + * is equal to its current one. 709 + */ 710 + free_vmap_area_rb_augment_cb_propagate(&va->rb_node, NULL); 707 711 708 712 #if DEBUG_AUGMENT_PROPAGATE_CHECK 709 - augment_tree_propagate_check(free_vmap_area_root.rb_node); 713 + augment_tree_propagate_check(); 710 714 #endif 711 715 } 712 716 ··· 705 735 struct rb_node *parent; 706 736 707 737 link = find_va_links(va, root, NULL, &parent); 708 - link_va(va, root, parent, link, head); 738 + if (link) 739 + link_va(va, root, parent, link, head); 709 740 } 710 741 711 742 static void ··· 722 751 else 723 752 link = find_va_links(va, root, NULL, &parent); 724 753 725 - link_va(va, root, parent, link, head); 726 - augment_tree_propagate_from(va); 754 + if (link) { 755 + link_va(va, root, parent, link, head); 756 + augment_tree_propagate_from(va); 757 + } 727 758 } 728 759 729 760 /* ··· 733 760 * and next free blocks. If coalesce is not done a new 734 761 * free area is inserted. If VA has been merged, it is 735 762 * freed. 763 + * 764 + * Please note, it can return NULL in case of overlap 765 + * ranges, followed by WARN() report. Despite it is a 766 + * buggy behaviour, a system can be alive and keep 767 + * ongoing. 736 768 */ 737 769 static __always_inline struct vmap_area * 738 770 merge_or_add_vmap_area(struct vmap_area *va, ··· 754 776 * inserted, unless it is merged with its sibling/siblings. 755 777 */ 756 778 link = find_va_links(va, root, NULL, &parent); 779 + if (!link) 780 + return NULL; 757 781 758 782 /* 759 783 * Get next node of VA to check if merging can be done. ··· 776 796 if (sibling->va_start == va->va_end) { 777 797 sibling->va_start = va->va_start; 778 798 779 - /* Check and update the tree if needed. */ 780 - augment_tree_propagate_from(sibling); 781 - 782 799 /* Free vmap_area object. */ 783 800 kmem_cache_free(vmap_area_cachep, va); 784 801 ··· 795 818 if (next->prev != head) { 796 819 sibling = list_entry(next->prev, struct vmap_area, list); 797 820 if (sibling->va_end == va->va_start) { 798 - sibling->va_end = va->va_end; 799 - 800 - /* Check and update the tree if needed. */ 801 - augment_tree_propagate_from(sibling); 802 - 821 + /* 822 + * If both neighbors are coalesced, it is important 823 + * to unlink the "next" node first, followed by merging 824 + * with "previous" one. Otherwise the tree might not be 825 + * fully populated if a sibling's augmented value is 826 + * "normalized" because of rotation operations. 827 + */ 803 828 if (merged) 804 829 unlink_va(va, root); 830 + 831 + sibling->va_end = va->va_end; 805 832 806 833 /* Free vmap_area object. */ 807 834 kmem_cache_free(vmap_area_cachep, va); ··· 817 836 } 818 837 819 838 insert: 820 - if (!merged) { 839 + if (!merged) 821 840 link_va(va, root, parent, link, head); 822 - augment_tree_propagate_from(va); 823 - } 824 841 842 + /* 843 + * Last step is to check and update the tree. 844 + */ 845 + augment_tree_propagate_from(va); 825 846 return va; 826 847 } 827 848 ··· 1364 1381 va = merge_or_add_vmap_area(va, &free_vmap_area_root, 1365 1382 &free_vmap_area_list); 1366 1383 1384 + if (!va) 1385 + continue; 1386 + 1367 1387 if (is_vmalloc_or_module_addr((void *)orig_start)) 1368 1388 kasan_release_vmalloc(orig_start, orig_end, 1369 1389 va->va_start, va->va_end); ··· 1499 1513 static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue); 1500 1514 1501 1515 /* 1502 - * Radix tree of vmap blocks, indexed by address, to quickly find a vmap block 1516 + * XArray of vmap blocks, indexed by address, to quickly find a vmap block 1503 1517 * in the free path. Could get rid of this if we change the API to return a 1504 1518 * "cookie" from alloc, to be passed to free. But no big deal yet. 1505 1519 */ 1506 - static DEFINE_SPINLOCK(vmap_block_tree_lock); 1507 - static RADIX_TREE(vmap_block_tree, GFP_ATOMIC); 1520 + static DEFINE_XARRAY(vmap_blocks); 1508 1521 1509 1522 /* 1510 1523 * We should probably have a fallback mechanism to allocate virtual memory ··· 1560 1575 return ERR_CAST(va); 1561 1576 } 1562 1577 1563 - err = radix_tree_preload(gfp_mask); 1564 - if (unlikely(err)) { 1565 - kfree(vb); 1566 - free_vmap_area(va); 1567 - return ERR_PTR(err); 1568 - } 1569 - 1570 1578 vaddr = vmap_block_vaddr(va->va_start, 0); 1571 1579 spin_lock_init(&vb->lock); 1572 1580 vb->va = va; ··· 1572 1594 INIT_LIST_HEAD(&vb->free_list); 1573 1595 1574 1596 vb_idx = addr_to_vb_idx(va->va_start); 1575 - spin_lock(&vmap_block_tree_lock); 1576 - err = radix_tree_insert(&vmap_block_tree, vb_idx, vb); 1577 - spin_unlock(&vmap_block_tree_lock); 1578 - BUG_ON(err); 1579 - radix_tree_preload_end(); 1597 + err = xa_insert(&vmap_blocks, vb_idx, vb, gfp_mask); 1598 + if (err) { 1599 + kfree(vb); 1600 + free_vmap_area(va); 1601 + return ERR_PTR(err); 1602 + } 1580 1603 1581 1604 vbq = &get_cpu_var(vmap_block_queue); 1582 1605 spin_lock(&vbq->lock); ··· 1591 1612 static void free_vmap_block(struct vmap_block *vb) 1592 1613 { 1593 1614 struct vmap_block *tmp; 1594 - unsigned long vb_idx; 1595 1615 1596 - vb_idx = addr_to_vb_idx(vb->va->va_start); 1597 - spin_lock(&vmap_block_tree_lock); 1598 - tmp = radix_tree_delete(&vmap_block_tree, vb_idx); 1599 - spin_unlock(&vmap_block_tree_lock); 1616 + tmp = xa_erase(&vmap_blocks, addr_to_vb_idx(vb->va->va_start)); 1600 1617 BUG_ON(tmp != vb); 1601 1618 1602 1619 free_vmap_area_noflush(vb->va); ··· 1698 1723 static void vb_free(unsigned long addr, unsigned long size) 1699 1724 { 1700 1725 unsigned long offset; 1701 - unsigned long vb_idx; 1702 1726 unsigned int order; 1703 1727 struct vmap_block *vb; 1704 1728 ··· 1707 1733 flush_cache_vunmap(addr, addr + size); 1708 1734 1709 1735 order = get_order(size); 1710 - 1711 1736 offset = (addr & (VMAP_BLOCK_SIZE - 1)) >> PAGE_SHIFT; 1712 - 1713 - vb_idx = addr_to_vb_idx(addr); 1714 - rcu_read_lock(); 1715 - vb = radix_tree_lookup(&vmap_block_tree, vb_idx); 1716 - rcu_read_unlock(); 1717 - BUG_ON(!vb); 1737 + vb = xa_load(&vmap_blocks, addr_to_vb_idx(addr)); 1718 1738 1719 1739 unmap_kernel_range_noflush(addr, size); 1720 1740 ··· 3351 3383 orig_end = vas[area]->va_end; 3352 3384 va = merge_or_add_vmap_area(vas[area], &free_vmap_area_root, 3353 3385 &free_vmap_area_list); 3354 - kasan_release_vmalloc(orig_start, orig_end, 3355 - va->va_start, va->va_end); 3386 + if (va) 3387 + kasan_release_vmalloc(orig_start, orig_end, 3388 + va->va_start, va->va_end); 3356 3389 vas[area] = NULL; 3357 3390 } 3358 3391 ··· 3401 3432 orig_end = vas[area]->va_end; 3402 3433 va = merge_or_add_vmap_area(vas[area], &free_vmap_area_root, 3403 3434 &free_vmap_area_list); 3404 - kasan_release_vmalloc(orig_start, orig_end, 3405 - va->va_start, va->va_end); 3435 + if (va) 3436 + kasan_release_vmalloc(orig_start, orig_end, 3437 + va->va_start, va->va_end); 3406 3438 vas[area] = NULL; 3407 3439 kfree(vms[area]); 3408 3440 }

+11 -28

mm/vmscan.c

··· 170 170 * From 0 .. 200. Higher means more swappy. 171 171 */ 172 172 int vm_swappiness = 60; 173 - /* 174 - * The total number of pages which are beyond the high watermark within all 175 - * zones. 176 - */ 177 - unsigned long vm_total_pages; 178 173 179 174 static void set_task_reclaim_state(struct task_struct *task, 180 175 struct reclaim_state *rs) ··· 910 915 * order to detect refaults, thus thrashing, later on. 911 916 * 912 917 * But don't store shadows in an address space that is 913 - * already exiting. This is not just an optizimation, 918 + * already exiting. This is not just an optimization, 914 919 * inode reclaim needs to empty out the radix tree or 915 920 * the nodes are lost. Don't plant shadows behind its 916 921 * back. ··· 2030 2035 2031 2036 __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); 2032 2037 2033 - __count_vm_events(PGREFILL, nr_scanned); 2038 + if (!cgroup_reclaim(sc)) 2039 + __count_vm_events(PGREFILL, nr_scanned); 2034 2040 __count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned); 2035 2041 2036 2042 spin_unlock_irq(&pgdat->lru_lock); ··· 2327 2331 unsigned long protection; 2328 2332 2329 2333 lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx); 2330 - protection = mem_cgroup_protection(memcg, 2334 + protection = mem_cgroup_protection(sc->target_mem_cgroup, 2335 + memcg, 2331 2336 sc->memcg_low_reclaim); 2332 2337 2333 2338 if (protection) { ··· 2616 2619 unsigned long reclaimed; 2617 2620 unsigned long scanned; 2618 2621 2619 - switch (mem_cgroup_protected(target_memcg, memcg)) { 2620 - case MEMCG_PROT_MIN: 2622 + mem_cgroup_calculate_protection(target_memcg, memcg); 2623 + 2624 + if (mem_cgroup_below_min(memcg)) { 2621 2625 /* 2622 2626 * Hard protection. 2623 2627 * If there is no reclaimable memory, OOM. 2624 2628 */ 2625 2629 continue; 2626 - case MEMCG_PROT_LOW: 2630 + } else if (mem_cgroup_below_low(memcg)) { 2627 2631 /* 2628 2632 * Soft protection. 2629 2633 * Respect the protection only as long as ··· 2636 2638 continue; 2637 2639 } 2638 2640 memcg_memory_event(memcg, MEMCG_LOW); 2639 - break; 2640 - case MEMCG_PROT_NONE: 2641 - /* 2642 - * All protection thresholds breached. We may 2643 - * still choose to vary the scan pressure 2644 - * applied based on by how much the cgroup in 2645 - * question has exceeded its protection 2646 - * thresholds (see get_scan_count). 2647 - */ 2648 - break; 2649 2641 } 2650 2642 2651 2643 reclaimed = sc->nr_reclaimed; ··· 3306 3318 bool may_swap) 3307 3319 { 3308 3320 unsigned long nr_reclaimed; 3309 - unsigned long pflags; 3310 3321 unsigned int noreclaim_flag; 3311 3322 struct scan_control sc = { 3312 3323 .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), ··· 3326 3339 struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask); 3327 3340 3328 3341 set_task_reclaim_state(current, &sc.reclaim_state); 3329 - 3330 3342 trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask); 3331 - 3332 - psi_memstall_enter(&pflags); 3333 3343 noreclaim_flag = memalloc_noreclaim_save(); 3334 3344 3335 3345 nr_reclaimed = do_try_to_free_pages(zonelist, &sc); 3336 3346 3337 3347 memalloc_noreclaim_restore(noreclaim_flag); 3338 - psi_memstall_leave(&pflags); 3339 - 3340 3348 trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); 3341 3349 set_task_reclaim_state(current, NULL); 3342 3350 ··· 4204 4222 * unmapped file backed pages. 4205 4223 */ 4206 4224 if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages && 4207 - node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages) 4225 + node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) <= 4226 + pgdat->min_slab_pages) 4208 4227 return NODE_RECLAIM_FULL; 4209 4228 4210 4229 /*

+30 -8

mm/vmstat.c

··· 341 341 long x; 342 342 long t; 343 343 344 + if (vmstat_item_in_bytes(item)) { 345 + VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1)); 346 + delta >>= PAGE_SHIFT; 347 + } 348 + 344 349 x = delta + __this_cpu_read(*p); 345 350 346 351 t = __this_cpu_read(pcp->stat_threshold); ··· 403 398 s8 __percpu *p = pcp->vm_node_stat_diff + item; 404 399 s8 v, t; 405 400 401 + VM_WARN_ON_ONCE(vmstat_item_in_bytes(item)); 402 + 406 403 v = __this_cpu_inc_return(*p); 407 404 t = __this_cpu_read(pcp->stat_threshold); 408 405 if (unlikely(v > t)) { ··· 448 441 struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; 449 442 s8 __percpu *p = pcp->vm_node_stat_diff + item; 450 443 s8 v, t; 444 + 445 + VM_WARN_ON_ONCE(vmstat_item_in_bytes(item)); 451 446 452 447 v = __this_cpu_dec_return(*p); 453 448 t = __this_cpu_read(pcp->stat_threshold); ··· 549 540 struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats; 550 541 s8 __percpu *p = pcp->vm_node_stat_diff + item; 551 542 long o, n, t, z; 543 + 544 + if (vmstat_item_in_bytes(item)) { 545 + VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1)); 546 + delta >>= PAGE_SHIFT; 547 + } 552 548 553 549 do { 554 550 z = 0; /* overflow to node counters */ ··· 1003 989 /* 1004 990 * Determine the per node value of a stat item. 1005 991 */ 1006 - unsigned long node_page_state(struct pglist_data *pgdat, 1007 - enum node_stat_item item) 992 + unsigned long node_page_state_pages(struct pglist_data *pgdat, 993 + enum node_stat_item item) 1008 994 { 1009 995 long x = atomic_long_read(&pgdat->vm_stat[item]); 1010 996 #ifdef CONFIG_SMP ··· 1012 998 x = 0; 1013 999 #endif 1014 1000 return x; 1001 + } 1002 + 1003 + unsigned long node_page_state(struct pglist_data *pgdat, 1004 + enum node_stat_item item) 1005 + { 1006 + VM_WARN_ON_ONCE(vmstat_item_in_bytes(item)); 1007 + 1008 + return node_page_state_pages(pgdat, item); 1015 1009 } 1016 1010 #endif 1017 1011 ··· 1140 1118 "nr_zone_write_pending", 1141 1119 "nr_mlock", 1142 1120 "nr_page_table_pages", 1143 - "nr_kernel_stack", 1144 - #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) 1145 - "nr_shadow_call_stack", 1146 - #endif 1147 1121 "nr_bounce", 1148 1122 #if IS_ENABLED(CONFIG_ZSMALLOC) 1149 1123 "nr_zspages", ··· 1190 1172 "nr_kernel_misc_reclaimable", 1191 1173 "nr_foll_pin_acquired", 1192 1174 "nr_foll_pin_released", 1175 + "nr_kernel_stack", 1176 + #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) 1177 + "nr_shadow_call_stack", 1178 + #endif 1193 1179 1194 1180 /* enum writeback_stat_item counters */ 1195 1181 "nr_dirty_threshold", ··· 1599 1577 seq_printf(m, "\n per-node stats"); 1600 1578 for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { 1601 1579 seq_printf(m, "\n %-12s %lu", node_stat_name(i), 1602 - node_page_state(pgdat, i)); 1580 + node_page_state_pages(pgdat, i)); 1603 1581 } 1604 1582 } 1605 1583 seq_printf(m, ··· 1720 1698 #endif 1721 1699 1722 1700 for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) 1723 - v[i] = global_node_page_state(i); 1701 + v[i] = global_node_page_state_pages(i); 1724 1702 v += NR_VM_NODE_STAT_ITEMS; 1725 1703 1726 1704 global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,

+4 -2

mm/workingset.c

··· 486 486 for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) 487 487 pages += lruvec_page_state_local(lruvec, 488 488 NR_LRU_BASE + i); 489 - pages += lruvec_page_state_local(lruvec, NR_SLAB_RECLAIMABLE); 490 - pages += lruvec_page_state_local(lruvec, NR_SLAB_UNRECLAIMABLE); 489 + pages += lruvec_page_state_local( 490 + lruvec, NR_SLAB_RECLAIMABLE_B) >> PAGE_SHIFT; 491 + pages += lruvec_page_state_local( 492 + lruvec, NR_SLAB_UNRECLAIMABLE_B) >> PAGE_SHIFT; 491 493 } else 492 494 #endif 493 495 pages = node_present_pages(sc->nid);

+2 -2

net/atm/mpoa_caches.c

··· 180 180 static void in_cache_put(in_cache_entry *entry) 181 181 { 182 182 if (refcount_dec_and_test(&entry->use)) { 183 - kzfree(entry); 183 + kfree_sensitive(entry); 184 184 } 185 185 } 186 186 ··· 415 415 static void eg_cache_put(eg_cache_entry *entry) 416 416 { 417 417 if (refcount_dec_and_test(&entry->use)) { 418 - kzfree(entry); 418 + kfree_sensitive(entry); 419 419 } 420 420 } 421 421

+3 -3

net/bluetooth/ecdh_helper.c

··· 104 104 free_all: 105 105 kpp_request_free(req); 106 106 free_tmp: 107 - kzfree(tmp); 107 + kfree_sensitive(tmp); 108 108 return err; 109 109 } 110 110 ··· 151 151 err = crypto_kpp_set_secret(tfm, buf, buf_len); 152 152 /* fall through */ 153 153 free_all: 154 - kzfree(buf); 154 + kfree_sensitive(buf); 155 155 free_tmp: 156 - kzfree(tmp); 156 + kfree_sensitive(tmp); 157 157 return err; 158 158 } 159 159

+12 -12

net/bluetooth/smp.c

··· 753 753 complete = test_bit(SMP_FLAG_COMPLETE, &smp->flags); 754 754 mgmt_smp_complete(hcon, complete); 755 755 756 - kzfree(smp->csrk); 757 - kzfree(smp->slave_csrk); 758 - kzfree(smp->link_key); 756 + kfree_sensitive(smp->csrk); 757 + kfree_sensitive(smp->slave_csrk); 758 + kfree_sensitive(smp->link_key); 759 759 760 760 crypto_free_shash(smp->tfm_cmac); 761 761 crypto_free_kpp(smp->tfm_ecdh); ··· 789 789 } 790 790 791 791 chan->data = NULL; 792 - kzfree(smp); 792 + kfree_sensitive(smp); 793 793 hci_conn_drop(hcon); 794 794 } 795 795 ··· 1156 1156 const u8 salt[16] = { 0x31, 0x70, 0x6d, 0x74 }; 1157 1157 1158 1158 if (smp_h7(smp->tfm_cmac, smp->tk, salt, smp->link_key)) { 1159 - kzfree(smp->link_key); 1159 + kfree_sensitive(smp->link_key); 1160 1160 smp->link_key = NULL; 1161 1161 return; 1162 1162 } ··· 1165 1165 const u8 tmp1[4] = { 0x31, 0x70, 0x6d, 0x74 }; 1166 1166 1167 1167 if (smp_h6(smp->tfm_cmac, smp->tk, tmp1, smp->link_key)) { 1168 - kzfree(smp->link_key); 1168 + kfree_sensitive(smp->link_key); 1169 1169 smp->link_key = NULL; 1170 1170 return; 1171 1171 } 1172 1172 } 1173 1173 1174 1174 if (smp_h6(smp->tfm_cmac, smp->link_key, lebr, smp->link_key)) { 1175 - kzfree(smp->link_key); 1175 + kfree_sensitive(smp->link_key); 1176 1176 smp->link_key = NULL; 1177 1177 return; 1178 1178 } ··· 1407 1407 free_shash: 1408 1408 crypto_free_shash(smp->tfm_cmac); 1409 1409 zfree_smp: 1410 - kzfree(smp); 1410 + kfree_sensitive(smp); 1411 1411 return NULL; 1412 1412 } 1413 1413 ··· 3278 3278 tfm_cmac = crypto_alloc_shash("cmac(aes)", 0, 0); 3279 3279 if (IS_ERR(tfm_cmac)) { 3280 3280 BT_ERR("Unable to create CMAC crypto context"); 3281 - kzfree(smp); 3281 + kfree_sensitive(smp); 3282 3282 return ERR_CAST(tfm_cmac); 3283 3283 } 3284 3284 ··· 3286 3286 if (IS_ERR(tfm_ecdh)) { 3287 3287 BT_ERR("Unable to create ECDH crypto context"); 3288 3288 crypto_free_shash(tfm_cmac); 3289 - kzfree(smp); 3289 + kfree_sensitive(smp); 3290 3290 return ERR_CAST(tfm_ecdh); 3291 3291 } 3292 3292 ··· 3300 3300 if (smp) { 3301 3301 crypto_free_shash(smp->tfm_cmac); 3302 3302 crypto_free_kpp(smp->tfm_ecdh); 3303 - kzfree(smp); 3303 + kfree_sensitive(smp); 3304 3304 } 3305 3305 return ERR_PTR(-ENOMEM); 3306 3306 } ··· 3347 3347 chan->data = NULL; 3348 3348 crypto_free_shash(smp->tfm_cmac); 3349 3349 crypto_free_kpp(smp->tfm_ecdh); 3350 - kzfree(smp); 3350 + kfree_sensitive(smp); 3351 3351 } 3352 3352 3353 3353 l2cap_chan_put(chan);

+1 -1

net/core/sock.c

··· 2265 2265 if (WARN_ON_ONCE(!mem)) 2266 2266 return; 2267 2267 if (nullify) 2268 - kzfree(mem); 2268 + kfree_sensitive(mem); 2269 2269 else 2270 2270 kfree(mem); 2271 2271 atomic_sub(size, &sk->sk_omem_alloc);

+1 -1

net/ipv4/tcp_fastopen.c

··· 38 38 struct tcp_fastopen_context *ctx = 39 39 container_of(head, struct tcp_fastopen_context, rcu); 40 40 41 - kzfree(ctx); 41 + kfree_sensitive(ctx); 42 42 } 43 43 44 44 void tcp_fastopen_destroy_cipher(struct sock *sk)

+2 -2

net/mac80211/aead_api.c

··· 41 41 aead_request_set_ad(aead_req, sg[0].length); 42 42 43 43 crypto_aead_encrypt(aead_req); 44 - kzfree(aead_req); 44 + kfree_sensitive(aead_req); 45 45 46 46 return 0; 47 47 } ··· 76 76 aead_request_set_ad(aead_req, sg[0].length); 77 77 78 78 err = crypto_aead_decrypt(aead_req); 79 - kzfree(aead_req); 79 + kfree_sensitive(aead_req); 80 80 81 81 return err; 82 82 }

+1 -1

net/mac80211/aes_gmac.c

··· 60 60 aead_request_set_ad(aead_req, GMAC_AAD_LEN + data_len); 61 61 62 62 crypto_aead_encrypt(aead_req); 63 - kzfree(aead_req); 63 + kfree_sensitive(aead_req); 64 64 65 65 return 0; 66 66 }

+1 -1

net/mac80211/key.c

··· 732 732 ieee80211_aes_gcm_key_free(key->u.gcmp.tfm); 733 733 break; 734 734 } 735 - kzfree(key); 735 + kfree_sensitive(key); 736 736 } 737 737 738 738 static void __ieee80211_key_destroy(struct ieee80211_key *key,

+10 -10

net/mac802154/llsec.c

··· 49 49 50 50 msl = container_of(sl, struct mac802154_llsec_seclevel, level); 51 51 list_del(&sl->list); 52 - kzfree(msl); 52 + kfree_sensitive(msl); 53 53 } 54 54 55 55 list_for_each_entry_safe(dev, dn, &sec->table.devices, list) { ··· 66 66 mkey = container_of(key->key, struct mac802154_llsec_key, key); 67 67 list_del(&key->list); 68 68 llsec_key_put(mkey); 69 - kzfree(key); 69 + kfree_sensitive(key); 70 70 } 71 71 } 72 72 ··· 155 155 if (key->tfm[i]) 156 156 crypto_free_aead(key->tfm[i]); 157 157 158 - kzfree(key); 158 + kfree_sensitive(key); 159 159 return NULL; 160 160 } 161 161 ··· 170 170 crypto_free_aead(key->tfm[i]); 171 171 172 172 crypto_free_sync_skcipher(key->tfm0); 173 - kzfree(key); 173 + kfree_sensitive(key); 174 174 } 175 175 176 176 static struct mac802154_llsec_key* ··· 261 261 return 0; 262 262 263 263 fail: 264 - kzfree(new); 264 + kfree_sensitive(new); 265 265 return -ENOMEM; 266 266 } 267 267 ··· 341 341 devkey); 342 342 343 343 list_del(&pos->list); 344 - kzfree(devkey); 344 + kfree_sensitive(devkey); 345 345 } 346 346 347 - kzfree(dev); 347 + kfree_sensitive(dev); 348 348 } 349 349 350 350 int mac802154_llsec_dev_add(struct mac802154_llsec *sec, ··· 682 682 683 683 rc = crypto_aead_encrypt(req); 684 684 685 - kzfree(req); 685 + kfree_sensitive(req); 686 686 687 687 return rc; 688 688 } ··· 886 886 887 887 rc = crypto_aead_decrypt(req); 888 888 889 - kzfree(req); 889 + kfree_sensitive(req); 890 890 skb_trim(skb, skb->len - authlen); 891 891 892 892 return rc; ··· 926 926 if (!devkey) 927 927 list_add_rcu(&next->devkey.list, &dev->dev.keys); 928 928 else 929 - kzfree(next); 929 + kfree_sensitive(next); 930 930 931 931 spin_unlock_bh(&dev->lock); 932 932 }

+1 -1

net/sctp/auth.c

··· 49 49 return; 50 50 51 51 if (refcount_dec_and_test(&key->refcnt)) { 52 - kzfree(key); 52 + kfree_sensitive(key); 53 53 SCTP_DBG_OBJCNT_DEC(keys); 54 54 } 55 55 }

+2 -2

net/sunrpc/auth_gss/gss_krb5_crypto.c

··· 1003 1003 err = 0; 1004 1004 1005 1005 out_err: 1006 - kzfree(desc); 1006 + kfree_sensitive(desc); 1007 1007 crypto_free_shash(hmac); 1008 1008 dprintk("%s: returning %d\n", __func__, err); 1009 1009 return err; ··· 1079 1079 err = 0; 1080 1080 1081 1081 out_err: 1082 - kzfree(desc); 1082 + kfree_sensitive(desc); 1083 1083 crypto_free_shash(hmac); 1084 1084 dprintk("%s: returning %d\n", __func__, err); 1085 1085 return err;

+3 -3

net/sunrpc/auth_gss/gss_krb5_keys.c

··· 228 228 ret = 0; 229 229 230 230 err_free_raw: 231 - kzfree(rawkey); 231 + kfree_sensitive(rawkey); 232 232 err_free_out: 233 - kzfree(outblockdata); 233 + kfree_sensitive(outblockdata); 234 234 err_free_in: 235 - kzfree(inblockdata); 235 + kfree_sensitive(inblockdata); 236 236 err_free_cipher: 237 237 crypto_free_sync_skcipher(cipher); 238 238 err_return:

+1 -1

net/sunrpc/auth_gss/gss_krb5_mech.c

··· 443 443 desc->tfm = hmac; 444 444 445 445 err = crypto_shash_digest(desc, sigkeyconstant, slen, ctx->cksum); 446 - kzfree(desc); 446 + kfree_sensitive(desc); 447 447 if (err) 448 448 goto out_err_free_hmac; 449 449 /*

+5 -5

net/tipc/crypto.c

··· 441 441 /* Allocate per-cpu TFM entry pointer */ 442 442 tmp->tfm_entry = alloc_percpu(struct tipc_tfm *); 443 443 if (!tmp->tfm_entry) { 444 - kzfree(tmp); 444 + kfree_sensitive(tmp); 445 445 return -ENOMEM; 446 446 } 447 447 ··· 491 491 /* Not any TFM is allocated? */ 492 492 if (!tfm_cnt) { 493 493 free_percpu(tmp->tfm_entry); 494 - kzfree(tmp); 494 + kfree_sensitive(tmp); 495 495 return err; 496 496 } 497 497 ··· 545 545 546 546 aead->tfm_entry = alloc_percpu_gfp(struct tipc_tfm *, GFP_ATOMIC); 547 547 if (unlikely(!aead->tfm_entry)) { 548 - kzfree(aead); 548 + kfree_sensitive(aead); 549 549 return -ENOMEM; 550 550 } 551 551 ··· 1352 1352 /* Allocate statistic structure */ 1353 1353 c->stats = alloc_percpu_gfp(struct tipc_crypto_stats, GFP_ATOMIC); 1354 1354 if (!c->stats) { 1355 - kzfree(c); 1355 + kfree_sensitive(c); 1356 1356 return -ENOMEM; 1357 1357 } 1358 1358 ··· 1408 1408 free_percpu(c->stats); 1409 1409 1410 1410 *crypto = NULL; 1411 - kzfree(c); 1411 + kfree_sensitive(c); 1412 1412 } 1413 1413 1414 1414 void tipc_crypto_timeout(struct tipc_crypto *rx)

+1 -1

net/wireless/core.c

··· 1125 1125 } 1126 1126 1127 1127 #ifdef CONFIG_CFG80211_WEXT 1128 - kzfree(wdev->wext.keys); 1128 + kfree_sensitive(wdev->wext.keys); 1129 1129 wdev->wext.keys = NULL; 1130 1130 #endif 1131 1131 /* only initialized if we have a netdev */

+2 -2

net/wireless/ibss.c

··· 127 127 return -EINVAL; 128 128 129 129 if (WARN_ON(wdev->connect_keys)) 130 - kzfree(wdev->connect_keys); 130 + kfree_sensitive(wdev->connect_keys); 131 131 wdev->connect_keys = connkeys; 132 132 133 133 wdev->ibss_fixed = params->channel_fixed; ··· 161 161 162 162 ASSERT_WDEV_LOCK(wdev); 163 163 164 - kzfree(wdev->connect_keys); 164 + kfree_sensitive(wdev->connect_keys); 165 165 wdev->connect_keys = NULL; 166 166 167 167 rdev_set_qos_map(rdev, dev, NULL);

+1 -1

net/wireless/lib80211_crypt_tkip.c

··· 131 131 crypto_free_shash(_priv->tx_tfm_michael); 132 132 crypto_free_shash(_priv->rx_tfm_michael); 133 133 } 134 - kzfree(priv); 134 + kfree_sensitive(priv); 135 135 } 136 136 137 137 static inline u16 RotR1(u16 val)

+1 -1

net/wireless/lib80211_crypt_wep.c

··· 56 56 57 57 static void lib80211_wep_deinit(void *priv) 58 58 { 59 - kzfree(priv); 59 + kfree_sensitive(priv); 60 60 } 61 61 62 62 /* Add WEP IV/key info to a frame that has at least 4 bytes of headroom */

+12 -12

net/wireless/nl80211.c

··· 9836 9836 9837 9837 if ((ibss.chandef.width != NL80211_CHAN_WIDTH_20_NOHT) && 9838 9838 no_ht) { 9839 - kzfree(connkeys); 9839 + kfree_sensitive(connkeys); 9840 9840 return -EINVAL; 9841 9841 } 9842 9842 } ··· 9848 9848 int r = validate_pae_over_nl80211(rdev, info); 9849 9849 9850 9850 if (r < 0) { 9851 - kzfree(connkeys); 9851 + kfree_sensitive(connkeys); 9852 9852 return r; 9853 9853 } 9854 9854 ··· 9861 9861 wdev_lock(dev->ieee80211_ptr); 9862 9862 err = __cfg80211_join_ibss(rdev, dev, &ibss, connkeys); 9863 9863 if (err) 9864 - kzfree(connkeys); 9864 + kfree_sensitive(connkeys); 9865 9865 else if (info->attrs[NL80211_ATTR_SOCKET_OWNER]) 9866 9866 dev->ieee80211_ptr->conn_owner_nlportid = info->snd_portid; 9867 9867 wdev_unlock(dev->ieee80211_ptr); ··· 10289 10289 10290 10290 if (info->attrs[NL80211_ATTR_HT_CAPABILITY]) { 10291 10291 if (!info->attrs[NL80211_ATTR_HT_CAPABILITY_MASK]) { 10292 - kzfree(connkeys); 10292 + kfree_sensitive(connkeys); 10293 10293 return -EINVAL; 10294 10294 } 10295 10295 memcpy(&connect.ht_capa, ··· 10307 10307 10308 10308 if (info->attrs[NL80211_ATTR_VHT_CAPABILITY]) { 10309 10309 if (!info->attrs[NL80211_ATTR_VHT_CAPABILITY_MASK]) { 10310 - kzfree(connkeys); 10310 + kfree_sensitive(connkeys); 10311 10311 return -EINVAL; 10312 10312 } 10313 10313 memcpy(&connect.vht_capa, ··· 10321 10321 (rdev->wiphy.features & NL80211_FEATURE_QUIET)) && 10322 10322 !wiphy_ext_feature_isset(&rdev->wiphy, 10323 10323 NL80211_EXT_FEATURE_RRM)) { 10324 - kzfree(connkeys); 10324 + kfree_sensitive(connkeys); 10325 10325 return -EINVAL; 10326 10326 } 10327 10327 connect.flags |= ASSOC_REQ_USE_RRM; ··· 10329 10329 10330 10330 connect.pbss = nla_get_flag(info->attrs[NL80211_ATTR_PBSS]); 10331 10331 if (connect.pbss && !rdev->wiphy.bands[NL80211_BAND_60GHZ]) { 10332 - kzfree(connkeys); 10332 + kfree_sensitive(connkeys); 10333 10333 return -EOPNOTSUPP; 10334 10334 } 10335 10335 10336 10336 if (info->attrs[NL80211_ATTR_BSS_SELECT]) { 10337 10337 /* bss selection makes no sense if bssid is set */ 10338 10338 if (connect.bssid) { 10339 - kzfree(connkeys); 10339 + kfree_sensitive(connkeys); 10340 10340 return -EINVAL; 10341 10341 } 10342 10342 10343 10343 err = parse_bss_select(info->attrs[NL80211_ATTR_BSS_SELECT], 10344 10344 wiphy, &connect.bss_select); 10345 10345 if (err) { 10346 - kzfree(connkeys); 10346 + kfree_sensitive(connkeys); 10347 10347 return err; 10348 10348 } 10349 10349 } ··· 10373 10373 info->attrs[NL80211_ATTR_FILS_ERP_REALM] || 10374 10374 info->attrs[NL80211_ATTR_FILS_ERP_NEXT_SEQ_NUM] || 10375 10375 info->attrs[NL80211_ATTR_FILS_ERP_RRK]) { 10376 - kzfree(connkeys); 10376 + kfree_sensitive(connkeys); 10377 10377 return -EINVAL; 10378 10378 } 10379 10379 10380 10380 if (nla_get_flag(info->attrs[NL80211_ATTR_EXTERNAL_AUTH_SUPPORT])) { 10381 10381 if (!info->attrs[NL80211_ATTR_SOCKET_OWNER]) { 10382 - kzfree(connkeys); 10382 + kfree_sensitive(connkeys); 10383 10383 GENL_SET_ERR_MSG(info, 10384 10384 "external auth requires connection ownership"); 10385 10385 return -EINVAL; ··· 10392 10392 err = cfg80211_connect(rdev, dev, &connect, connkeys, 10393 10393 connect.prev_bssid); 10394 10394 if (err) 10395 - kzfree(connkeys); 10395 + kfree_sensitive(connkeys); 10396 10396 10397 10397 if (!err && info->attrs[NL80211_ATTR_SOCKET_OWNER]) { 10398 10398 dev->ieee80211_ptr->conn_owner_nlportid = info->snd_portid;

+3 -3

net/wireless/sme.c

··· 742 742 } 743 743 744 744 if (cr->status != WLAN_STATUS_SUCCESS) { 745 - kzfree(wdev->connect_keys); 745 + kfree_sensitive(wdev->connect_keys); 746 746 wdev->connect_keys = NULL; 747 747 wdev->ssid_len = 0; 748 748 wdev->conn_owner_nlportid = 0; ··· 1098 1098 wdev->current_bss = NULL; 1099 1099 wdev->ssid_len = 0; 1100 1100 wdev->conn_owner_nlportid = 0; 1101 - kzfree(wdev->connect_keys); 1101 + kfree_sensitive(wdev->connect_keys); 1102 1102 wdev->connect_keys = NULL; 1103 1103 1104 1104 nl80211_send_disconnected(rdev, dev, reason, ie, ie_len, from_ap); ··· 1281 1281 1282 1282 ASSERT_WDEV_LOCK(wdev); 1283 1283 1284 - kzfree(wdev->connect_keys); 1284 + kfree_sensitive(wdev->connect_keys); 1285 1285 wdev->connect_keys = NULL; 1286 1286 1287 1287 wdev->conn_owner_nlportid = 0;

+1 -1

net/wireless/util.c

··· 871 871 } 872 872 } 873 873 874 - kzfree(wdev->connect_keys); 874 + kfree_sensitive(wdev->connect_keys); 875 875 wdev->connect_keys = NULL; 876 876 } 877 877

+1 -1

net/wireless/wext-sme.c

··· 57 57 err = cfg80211_connect(rdev, wdev->netdev, 58 58 &wdev->wext.connect, ck, prev_bssid); 59 59 if (err) 60 - kzfree(ck); 60 + kfree_sensitive(ck); 61 61 62 62 return err; 63 63 }

+2 -1

scripts/Makefile.kasan

··· 44 44 endif 45 45 46 46 CFLAGS_KASAN := -fsanitize=kernel-hwaddress \ 47 - -mllvm -hwasan-instrument-stack=0 \ 47 + -mllvm -hwasan-instrument-stack=$(CONFIG_KASAN_STACK) \ 48 + -mllvm -hwasan-use-short-granules=0 \ 48 49 $(instrumentation_flags) 49 50 50 51 endif # CONFIG_KASAN_SW_TAGS

+2

scripts/bloat-o-meter

··· 26 26 sym = {} 27 27 with os.popen("nm --size-sort " + file) as f: 28 28 for line in f: 29 + if line.startswith("\n") or ":" in line: 30 + continue 29 31 size, type, name = line.split() 30 32 if type in format: 31 33 # strip generated symbols

+2 -2

scripts/coccinelle/free/devm_free.cocci

··· 89 89 ( 90 90 kfree@p(x) 91 91 | 92 - kzfree@p(x) 92 + kfree_sensitive@p(x) 93 93 | 94 94 krealloc@p(x, ...) 95 95 | ··· 112 112 ( 113 113 * kfree@p(x) 114 114 | 115 - * kzfree@p(x) 115 + * kfree_sensitive@p(x) 116 116 | 117 117 * krealloc@p(x, ...) 118 118 |

+2 -2

scripts/coccinelle/free/ifnullfree.cocci

+3 -3

scripts/coccinelle/free/kfree.cocci

··· 24 24 ( 25 25 * kfree@p1(E) 26 26 | 27 - * kzfree@p1(E) 27 + * kfree_sensitive@p1(E) 28 28 ) 29 29 30 30 @print expression@ ··· 68 68 ( 69 69 * kfree@ok(E) 70 70 | 71 - * kzfree@ok(E) 71 + * kfree_sensitive@ok(E) 72 72 ) 73 73 ... when != break; 74 74 when != goto l; ··· 86 86 ( 87 87 * kfree@p1(E,...) 88 88 | 89 - * kzfree@p1(E,...) 89 + * kfree_sensitive@p1(E,...) 90 90 ) 91 91 ... 92 92 (

+1 -1

scripts/coccinelle/free/kfreeaddr.cocci

··· 20 20 ( 21 21 * kfree@p(&e->f) 22 22 | 23 - * kzfree@p(&e->f) 23 + * kfree_sensitive@p(&e->f) 24 24 ) 25 25 26 26 @script:python depends on org@

+1

scripts/const_structs.checkpatch

··· 44 44 platform_suspend_ops 45 45 proto_ops 46 46 regmap_access_table 47 + regulator_ops 47 48 rpc_pipe_ops 48 49 rtc_class_ops 49 50 sd_desc

+70 -9

scripts/decode_stacktrace.sh

··· 3 3 # (c) 2014, Sasha Levin <sasha.levin@oracle.com> 4 4 #set -x 5 5 6 - if [[ $# < 2 ]]; then 6 + if [[ $# < 1 ]]; then 7 7 echo "Usage:" 8 - echo " $0 [vmlinux] [base path] [modules path]" 8 + echo " $0 -r <release> | <vmlinux> [base path] [modules path]" 9 9 exit 1 10 10 fi 11 11 12 - vmlinux=$1 13 - basepath=$2 14 - modpath=$3 12 + if [[ $1 == "-r" ]] ; then 13 + vmlinux="" 14 + basepath="auto" 15 + modpath="" 16 + release=$2 17 + 18 + for fn in {,/usr/lib/debug}/boot/vmlinux-$release{,.debug} /lib/modules/$release{,/build}/vmlinux ; do 19 + if [ -e "$fn" ] ; then 20 + vmlinux=$fn 21 + break 22 + fi 23 + done 24 + 25 + if [[ $vmlinux == "" ]] ; then 26 + echo "ERROR! vmlinux image for release $release is not found" >&2 27 + exit 2 28 + fi 29 + else 30 + vmlinux=$1 31 + basepath=${2-auto} 32 + modpath=$3 33 + release="" 34 + fi 35 + 15 36 declare -A cache 16 37 declare -A modcache 38 + 39 + find_module() { 40 + if [[ "$modpath" != "" ]] ; then 41 + for fn in $(find "$modpath" -name "${module//_/[-_]}.ko*") ; do 42 + if readelf -WS "$fn" | grep -qwF .debug_line ; then 43 + echo $fn 44 + return 45 + fi 46 + done 47 + return 1 48 + fi 49 + 50 + modpath=$(dirname "$vmlinux") 51 + find_module && return 52 + 53 + if [[ $release == "" ]] ; then 54 + release=$(gdb -ex 'print init_uts_ns.name.release' -ex 'quit' -quiet -batch "$vmlinux" | sed -n 's/\$1 = "$.*$".*/\1/p') 55 + fi 56 + 57 + for dn in {/usr/lib/debug,}/lib/modules/$release ; do 58 + if [ -e "$dn" ] ; then 59 + modpath="$dn" 60 + find_module && return 61 + fi 62 + done 63 + 64 + modpath="" 65 + return 1 66 + } 17 67 18 68 parse_symbol() { 19 69 # The structure of symbol at this point is: ··· 77 27 elif [[ "${modcache[$module]+isset}" == "isset" ]]; then 78 28 local objfile=${modcache[$module]} 79 29 else 80 - if [[ $modpath == "" ]]; then 30 + local objfile=$(find_module) 31 + if [[ $objfile == "" ]] ; then 81 32 echo "WARNING! Modules path isn't set, but is needed to parse this symbol" >&2 82 33 return 83 34 fi 84 - local objfile=$(find "$modpath" -name "${module//_/[-_]}.ko*" -print -quit) 85 - [[ $objfile == "" ]] && return 86 35 modcache[$module]=$objfile 87 36 fi 88 37 ··· 105 56 if [[ "${cache[$module,$name]+isset}" == "isset" ]]; then 106 57 local base_addr=${cache[$module,$name]} 107 58 else 108 - local base_addr=$(nm "$objfile" | grep -i ' t ' | awk "/ $name\$/ {print \$1}" | head -n1) 59 + local base_addr=$(nm "$objfile" | awk '$3 == "'$name'" && ($2 == "t" || $2 == "T") {print $1; exit}') 60 + if [[ $base_addr == "" ]] ; then 61 + # address not found 62 + return 63 + fi 109 64 cache[$module,$name]="$base_addr" 110 65 fi 111 66 # Let's start doing the math to get the exact address into the ··· 200 147 # Add up the line number to the symbol 201 148 echo "${words[@]}" "$symbol $module" 202 149 } 150 + 151 + if [[ $basepath == "auto" ]] ; then 152 + module="" 153 + symbol="kernel_init+0x0/0x0" 154 + parse_symbol 155 + basepath=${symbol#kernel_init (} 156 + basepath=${basepath%/init/main.c:*)} 157 + fi 203 158 204 159 while read line; do 205 160 # Let's see if we have an address in the line

+19

scripts/spelling.txt

··· 149 149 architechture||architecture 150 150 arguement||argument 151 151 arguements||arguments 152 + arithmatic||arithmetic 152 153 aritmetic||arithmetic 153 154 arne't||aren't 154 155 arraival||arrival ··· 455 454 destroied||destroyed 456 455 detabase||database 457 456 deteced||detected 457 + detectt||detect 458 458 develope||develop 459 459 developement||development 460 460 developped||developed ··· 547 545 entites||entities 548 546 entrys||entries 549 547 enocded||encoded 548 + enought||enough 550 549 enterily||entirely 551 550 enviroiment||environment 552 551 enviroment||environment ··· 559 556 equivilant||equivalent 560 557 eror||error 561 558 errorr||error 559 + errror||error 562 560 estbalishment||establishment 563 561 etsablishment||establishment 564 562 etsbalishment||establishment 563 + evalution||evaluation 565 564 excecutable||executable 566 565 exceded||exceeded 566 + exceds||exceeds 567 567 exceeed||exceed 568 568 excellant||excellent 569 569 execeeded||exceeded ··· 589 583 expresion||expression 590 584 exprimental||experimental 591 585 extened||extended 586 + exteneded||extended||extended 592 587 extensability||extensibility 593 588 extention||extension 594 589 extenstion||extension ··· 617 610 fetaure||feature 618 611 fetaures||features 619 612 fileystem||filesystem 613 + fimrware||firmware 620 614 fimware||firmware 621 615 firmare||firmware 622 616 firmaware||firmware 623 617 firware||firmware 618 + firwmare||firmware 624 619 finanize||finalize 625 620 findn||find 626 621 finilizes||finalizes ··· 670 661 grabing||grabbing 671 662 grahical||graphical 672 663 grahpical||graphical 664 + granularty||granularity 673 665 grapic||graphic 674 666 grranted||granted 675 667 guage||gauge ··· 916 906 mmnemonic||mnemonic 917 907 mnay||many 918 908 modfiy||modify 909 + modifer||modifier 919 910 modulues||modules 920 911 momery||memory 921 912 memomry||memory ··· 926 915 monocrome||monochrome 927 916 mopdule||module 928 917 mroe||more 918 + multipler||multiplier 929 919 mulitplied||multiplied 930 920 multidimensionnal||multidimensional 931 921 multipe||multiple ··· 964 952 occationally||occasionally 965 953 occurance||occurrence 966 954 occurances||occurrences 955 + occurd||occurred 967 956 occured||occurred 968 957 occurence||occurrence 969 958 occure||occurred ··· 1071 1058 preemptable||preemptible 1072 1059 prefered||preferred 1073 1060 prefferably||preferably 1061 + prefitler||prefilter 1074 1062 premption||preemption 1075 1063 prepaired||prepared 1076 1064 preperation||preparation ··· 1115 1101 propery||property 1116 1102 propigate||propagate 1117 1103 propigation||propagation 1104 + propogation||propagation 1118 1105 propogate||propagate 1119 1106 prosess||process 1120 1107 protable||portable ··· 1331 1316 subdirectoires||subdirectories 1332 1317 suble||subtle 1333 1318 substract||subtract 1319 + submited||submitted 1334 1320 submition||submission 1335 1321 suceed||succeed 1336 1322 succesfully||successfully ··· 1340 1324 successfull||successful 1341 1325 successfuly||successfully 1342 1326 sucessfully||successfully 1327 + sucessful||successful 1343 1328 sucess||success 1344 1329 superflous||superfluous 1345 1330 superseeded||superseded ··· 1426 1409 trasfer||transfer 1427 1410 trasmission||transmission 1428 1411 treshold||threshold 1412 + triggerd||triggered 1429 1413 trigerred||triggered 1430 1414 trigerring||triggering 1431 1415 trun||turn ··· 1439 1421 usccess||success 1440 1422 usupported||unsupported 1441 1423 uncommited||uncommitted 1424 + uncompatible||incompatible 1442 1425 unconditionaly||unconditionally 1443 1426 undeflow||underflow 1444 1427 underun||underrun

+4 -14

scripts/tags.sh

··· 91 91 92 92 all_compiled_sources() 93 93 { 94 - for i in $(all_sources); do 95 - case "$i" in 96 - *.[cS]) 97 - j=${i/\.[cS]/\.o} 98 - j="${j#$tree}" 99 - if [ -e $j ]; then 100 - echo $i 101 - fi 102 - ;; 103 - *) 104 - echo $i 105 - ;; 106 - esac 107 - done 94 + realpath -es $([ -z "$KBUILD_ABS_SRCTREE" ] && echo --relative-to=.) \ 95 + include/generated/autoconf.h $(find -name "*.cmd" -exec \ 96 + grep -Poh '(?(?=^source_.* \K).*|(?=^ \K\S).*(?= \\))' {} \+ | 97 + awk '!a[$0]++') | sort -u 108 98 } 109 99 110 100 all_target_sources()

+2 -2

security/apparmor/domain.c

··· 40 40 return; 41 41 42 42 for (i = 0; i < domain->size; i++) 43 - kzfree(domain->table[i]); 44 - kzfree(domain->table); 43 + kfree_sensitive(domain->table[i]); 44 + kfree_sensitive(domain->table); 45 45 domain->table = NULL; 46 46 } 47 47 }

+1 -1

security/apparmor/include/file.h

··· 72 72 { 73 73 if (ctx) { 74 74 aa_put_label(rcu_access_pointer(ctx->label)); 75 - kzfree(ctx); 75 + kfree_sensitive(ctx); 76 76 } 77 77 } 78 78

+12 -12

security/apparmor/policy.c

··· 187 187 { 188 188 struct aa_data *data = ptr; 189 189 190 - kzfree(data->data); 191 - kzfree(data->key); 192 - kzfree(data); 190 + kfree_sensitive(data->data); 191 + kfree_sensitive(data->key); 192 + kfree_sensitive(data); 193 193 } 194 194 195 195 /** ··· 217 217 aa_put_profile(rcu_access_pointer(profile->parent)); 218 218 219 219 aa_put_ns(profile->ns); 220 - kzfree(profile->rename); 220 + kfree_sensitive(profile->rename); 221 221 222 222 aa_free_file_rules(&profile->file); 223 223 aa_free_cap_rules(&profile->caps); 224 224 aa_free_rlimit_rules(&profile->rlimits); 225 225 226 226 for (i = 0; i < profile->xattr_count; i++) 227 - kzfree(profile->xattrs[i]); 228 - kzfree(profile->xattrs); 227 + kfree_sensitive(profile->xattrs[i]); 228 + kfree_sensitive(profile->xattrs); 229 229 for (i = 0; i < profile->secmark_count; i++) 230 - kzfree(profile->secmark[i].label); 231 - kzfree(profile->secmark); 232 - kzfree(profile->dirname); 230 + kfree_sensitive(profile->secmark[i].label); 231 + kfree_sensitive(profile->secmark); 232 + kfree_sensitive(profile->dirname); 233 233 aa_put_dfa(profile->xmatch); 234 234 aa_put_dfa(profile->policy.dfa); 235 235 ··· 237 237 rht = profile->data; 238 238 profile->data = NULL; 239 239 rhashtable_free_and_destroy(rht, aa_free_data, NULL); 240 - kzfree(rht); 240 + kfree_sensitive(rht); 241 241 } 242 242 243 - kzfree(profile->hash); 243 + kfree_sensitive(profile->hash); 244 244 aa_put_loaddata(profile->rawdata); 245 245 aa_label_destroy(&profile->label); 246 246 247 - kzfree(profile); 247 + kfree_sensitive(profile); 248 248 } 249 249 250 250 /**

+3 -3

security/apparmor/policy_ns.c

··· 121 121 return ns; 122 122 123 123 fail_unconfined: 124 - kzfree(ns->base.hname); 124 + kfree_sensitive(ns->base.hname); 125 125 fail_ns: 126 - kzfree(ns); 126 + kfree_sensitive(ns); 127 127 return NULL; 128 128 } 129 129 ··· 145 145 146 146 ns->unconfined->ns = NULL; 147 147 aa_free_profile(ns->unconfined); 148 - kzfree(ns); 148 + kfree_sensitive(ns); 149 149 } 150 150 151 151 /**

+7 -7

security/apparmor/policy_unpack.c

··· 163 163 aa_put_ns(ns); 164 164 } 165 165 166 - kzfree(d->hash); 167 - kzfree(d->name); 166 + kfree_sensitive(d->hash); 167 + kfree_sensitive(d->name); 168 168 kvfree(d->data); 169 - kzfree(d); 169 + kfree_sensitive(d); 170 170 } 171 171 172 172 void aa_loaddata_kref(struct kref *kref) ··· 894 894 while (unpack_strdup(e, &key, NULL)) { 895 895 data = kzalloc(sizeof(*data), GFP_KERNEL); 896 896 if (!data) { 897 - kzfree(key); 897 + kfree_sensitive(key); 898 898 goto fail; 899 899 } 900 900 ··· 902 902 data->size = unpack_blob(e, &data->data, NULL); 903 903 data->data = kvmemdup(data->data, data->size); 904 904 if (data->size && !data->data) { 905 - kzfree(data->key); 906 - kzfree(data); 905 + kfree_sensitive(data->key); 906 + kfree_sensitive(data); 907 907 goto fail; 908 908 } 909 909 ··· 1037 1037 aa_put_profile(ent->old); 1038 1038 aa_put_profile(ent->new); 1039 1039 kfree(ent->ns_name); 1040 - kzfree(ent); 1040 + kfree_sensitive(ent); 1041 1041 } 1042 1042 } 1043 1043

+3 -3

security/keys/big_key.c

··· 138 138 err_fput: 139 139 fput(file); 140 140 err_enckey: 141 - kzfree(enckey); 141 + kfree_sensitive(enckey); 142 142 error: 143 143 memzero_explicit(buf, enclen); 144 144 kvfree(buf); ··· 155 155 156 156 path_put(path); 157 157 } 158 - kzfree(prep->payload.data[big_key_data]); 158 + kfree_sensitive(prep->payload.data[big_key_data]); 159 159 } 160 160 161 161 /* ··· 187 187 path->mnt = NULL; 188 188 path->dentry = NULL; 189 189 } 190 - kzfree(key->payload.data[big_key_data]); 190 + kfree_sensitive(key->payload.data[big_key_data]); 191 191 key->payload.data[big_key_data] = NULL; 192 192 } 193 193

+7 -7

security/keys/dh.c

··· 58 58 59 59 static void dh_free_data(struct dh *dh) 60 60 { 61 - kzfree(dh->key); 62 - kzfree(dh->p); 63 - kzfree(dh->g); 61 + kfree_sensitive(dh->key); 62 + kfree_sensitive(dh->p); 63 + kfree_sensitive(dh->g); 64 64 } 65 65 66 66 struct dh_completion { ··· 126 126 if (sdesc->shash.tfm) 127 127 crypto_free_shash(sdesc->shash.tfm); 128 128 129 - kzfree(sdesc); 129 + kfree_sensitive(sdesc); 130 130 } 131 131 132 132 /* ··· 220 220 ret = -EFAULT; 221 221 222 222 err: 223 - kzfree(outbuf); 223 + kfree_sensitive(outbuf); 224 224 return ret; 225 225 } 226 226 ··· 395 395 out6: 396 396 kpp_request_free(req); 397 397 out5: 398 - kzfree(outbuf); 398 + kfree_sensitive(outbuf); 399 399 out4: 400 400 crypto_free_kpp(tfm); 401 401 out3: 402 - kzfree(secret); 402 + kfree_sensitive(secret); 403 403 out2: 404 404 dh_free_data(&dh_inputs); 405 405 out1:

+7 -7

security/keys/encrypted-keys/encrypted.c

··· 370 370 master_keylen); 371 371 ret = crypto_shash_tfm_digest(hash_tfm, derived_buf, derived_buf_len, 372 372 derived_key); 373 - kzfree(derived_buf); 373 + kfree_sensitive(derived_buf); 374 374 return ret; 375 375 } 376 376 ··· 812 812 ret = encrypted_init(epayload, key->description, format, master_desc, 813 813 decrypted_datalen, hex_encoded_iv); 814 814 if (ret < 0) { 815 - kzfree(epayload); 815 + kfree_sensitive(epayload); 816 816 goto out; 817 817 } 818 818 819 819 rcu_assign_keypointer(key, epayload); 820 820 out: 821 - kzfree(datablob); 821 + kfree_sensitive(datablob); 822 822 return ret; 823 823 } 824 824 ··· 827 827 struct encrypted_key_payload *epayload; 828 828 829 829 epayload = container_of(rcu, struct encrypted_key_payload, rcu); 830 - kzfree(epayload); 830 + kfree_sensitive(epayload); 831 831 } 832 832 833 833 /* ··· 885 885 rcu_assign_keypointer(key, new_epayload); 886 886 call_rcu(&epayload->rcu, encrypted_rcu_free); 887 887 out: 888 - kzfree(buf); 888 + kfree_sensitive(buf); 889 889 return ret; 890 890 } 891 891 ··· 946 946 memzero_explicit(derived_key, sizeof(derived_key)); 947 947 948 948 memcpy(buffer, ascii_buf, asciiblob_len); 949 - kzfree(ascii_buf); 949 + kfree_sensitive(ascii_buf); 950 950 951 951 return asciiblob_len; 952 952 out: ··· 961 961 */ 962 962 static void encrypted_destroy(struct key *key) 963 963 { 964 - kzfree(key->payload.data[0]); 964 + kfree_sensitive(key->payload.data[0]); 965 965 } 966 966 967 967 struct key_type key_type_encrypted = {

+17 -17

security/keys/trusted-keys/trusted_tpm1.c

··· 68 68 } 69 69 70 70 ret = crypto_shash_digest(&sdesc->shash, data, datalen, digest); 71 - kzfree(sdesc); 71 + kfree_sensitive(sdesc); 72 72 return ret; 73 73 } 74 74 ··· 112 112 if (!ret) 113 113 ret = crypto_shash_final(&sdesc->shash, digest); 114 114 out: 115 - kzfree(sdesc); 115 + kfree_sensitive(sdesc); 116 116 return ret; 117 117 } 118 118 ··· 166 166 paramdigest, TPM_NONCE_SIZE, h1, 167 167 TPM_NONCE_SIZE, h2, 1, &c, 0, 0); 168 168 out: 169 - kzfree(sdesc); 169 + kfree_sensitive(sdesc); 170 170 return ret; 171 171 } 172 172 EXPORT_SYMBOL_GPL(TSS_authhmac); ··· 251 251 if (memcmp(testhmac, authdata, SHA1_DIGEST_SIZE)) 252 252 ret = -EINVAL; 253 253 out: 254 - kzfree(sdesc); 254 + kfree_sensitive(sdesc); 255 255 return ret; 256 256 } 257 257 EXPORT_SYMBOL_GPL(TSS_checkhmac1); ··· 353 353 if (memcmp(testhmac2, authdata2, SHA1_DIGEST_SIZE)) 354 354 ret = -EINVAL; 355 355 out: 356 - kzfree(sdesc); 356 + kfree_sensitive(sdesc); 357 357 return ret; 358 358 } 359 359 ··· 563 563 *bloblen = storedsize; 564 564 } 565 565 out: 566 - kzfree(td); 566 + kfree_sensitive(td); 567 567 return ret; 568 568 } 569 569 ··· 1031 1031 if (!ret && options->pcrlock) 1032 1032 ret = pcrlock(options->pcrlock); 1033 1033 out: 1034 - kzfree(datablob); 1035 - kzfree(options); 1034 + kfree_sensitive(datablob); 1035 + kfree_sensitive(options); 1036 1036 if (!ret) 1037 1037 rcu_assign_keypointer(key, payload); 1038 1038 else 1039 - kzfree(payload); 1039 + kfree_sensitive(payload); 1040 1040 return ret; 1041 1041 } 1042 1042 ··· 1045 1045 struct trusted_key_payload *p; 1046 1046 1047 1047 p = container_of(rcu, struct trusted_key_payload, rcu); 1048 - kzfree(p); 1048 + kfree_sensitive(p); 1049 1049 } 1050 1050 1051 1051 /* ··· 1087 1087 ret = datablob_parse(datablob, new_p, new_o); 1088 1088 if (ret != Opt_update) { 1089 1089 ret = -EINVAL; 1090 - kzfree(new_p); 1090 + kfree_sensitive(new_p); 1091 1091 goto out; 1092 1092 } 1093 1093 1094 1094 if (!new_o->keyhandle) { 1095 1095 ret = -EINVAL; 1096 - kzfree(new_p); 1096 + kfree_sensitive(new_p); 1097 1097 goto out; 1098 1098 } 1099 1099 ··· 1107 1107 ret = key_seal(new_p, new_o); 1108 1108 if (ret < 0) { 1109 1109 pr_info("trusted_key: key_seal failed (%d)\n", ret); 1110 - kzfree(new_p); 1110 + kfree_sensitive(new_p); 1111 1111 goto out; 1112 1112 } 1113 1113 if (new_o->pcrlock) { 1114 1114 ret = pcrlock(new_o->pcrlock); 1115 1115 if (ret < 0) { 1116 1116 pr_info("trusted_key: pcrlock failed (%d)\n", ret); 1117 - kzfree(new_p); 1117 + kfree_sensitive(new_p); 1118 1118 goto out; 1119 1119 } 1120 1120 } 1121 1121 rcu_assign_keypointer(key, new_p); 1122 1122 call_rcu(&p->rcu, trusted_rcu_free); 1123 1123 out: 1124 - kzfree(datablob); 1125 - kzfree(new_o); 1124 + kfree_sensitive(datablob); 1125 + kfree_sensitive(new_o); 1126 1126 return ret; 1127 1127 } 1128 1128 ··· 1154 1154 */ 1155 1155 static void trusted_destroy(struct key *key) 1156 1156 { 1157 - kzfree(key->payload.data[0]); 1157 + kfree_sensitive(key->payload.data[0]); 1158 1158 } 1159 1159 1160 1160 struct key_type key_type_trusted = {

+3 -3

security/keys/user_defined.c

··· 82 82 */ 83 83 void user_free_preparse(struct key_preparsed_payload *prep) 84 84 { 85 - kzfree(prep->payload.data[0]); 85 + kfree_sensitive(prep->payload.data[0]); 86 86 } 87 87 EXPORT_SYMBOL_GPL(user_free_preparse); 88 88 ··· 91 91 struct user_key_payload *payload; 92 92 93 93 payload = container_of(head, struct user_key_payload, rcu); 94 - kzfree(payload); 94 + kfree_sensitive(payload); 95 95 } 96 96 97 97 /* ··· 147 147 { 148 148 struct user_key_payload *upayload = key->payload.data[0]; 149 149 150 - kzfree(upayload); 150 + kfree_sensitive(upayload); 151 151 } 152 152 153 153 EXPORT_SYMBOL_GPL(user_destroy);

+226

tools/cgroup/memcg_slabinfo.py

··· 1 + #!/usr/bin/env drgn 2 + # 3 + # Copyright (C) 2020 Roman Gushchin <guro@fb.com> 4 + # Copyright (C) 2020 Facebook 5 + 6 + from os import stat 7 + import argparse 8 + import sys 9 + 10 + from drgn.helpers.linux import list_for_each_entry, list_empty 11 + from drgn.helpers.linux import for_each_page 12 + from drgn.helpers.linux.cpumask import for_each_online_cpu 13 + from drgn.helpers.linux.percpu import per_cpu_ptr 14 + from drgn import container_of, FaultError, Object 15 + 16 + 17 + DESC = """ 18 + This is a drgn script to provide slab statistics for memory cgroups. 19 + It supports cgroup v2 and v1 and can emulate memory.kmem.slabinfo 20 + interface of cgroup v1. 21 + For drgn, visit https://github.com/osandov/drgn. 22 + """ 23 + 24 + 25 + MEMCGS = {} 26 + 27 + OO_SHIFT = 16 28 + OO_MASK = ((1 << OO_SHIFT) - 1) 29 + 30 + 31 + def err(s): 32 + print('slabinfo.py: error: %s' % s, file=sys.stderr, flush=True) 33 + sys.exit(1) 34 + 35 + 36 + def find_memcg_ids(css=prog['root_mem_cgroup'].css, prefix=''): 37 + if not list_empty(css.children.address_of_()): 38 + for css in list_for_each_entry('struct cgroup_subsys_state', 39 + css.children.address_of_(), 40 + 'sibling'): 41 + name = prefix + '/' + css.cgroup.kn.name.string_().decode('utf-8') 42 + memcg = container_of(css, 'struct mem_cgroup', 'css') 43 + MEMCGS[css.cgroup.kn.id.value_()] = memcg 44 + find_memcg_ids(css, name) 45 + 46 + 47 + def is_root_cache(s): 48 + try: 49 + return False if s.memcg_params.root_cache else True 50 + except AttributeError: 51 + return True 52 + 53 + 54 + def cache_name(s): 55 + if is_root_cache(s): 56 + return s.name.string_().decode('utf-8') 57 + else: 58 + return s.memcg_params.root_cache.name.string_().decode('utf-8') 59 + 60 + 61 + # SLUB 62 + 63 + def oo_order(s): 64 + return s.oo.x >> OO_SHIFT 65 + 66 + 67 + def oo_objects(s): 68 + return s.oo.x & OO_MASK 69 + 70 + 71 + def count_partial(n, fn): 72 + nr_pages = 0 73 + for page in list_for_each_entry('struct page', n.partial.address_of_(), 74 + 'lru'): 75 + nr_pages += fn(page) 76 + return nr_pages 77 + 78 + 79 + def count_free(page): 80 + return page.objects - page.inuse 81 + 82 + 83 + def slub_get_slabinfo(s, cfg): 84 + nr_slabs = 0 85 + nr_objs = 0 86 + nr_free = 0 87 + 88 + for node in range(cfg['nr_nodes']): 89 + n = s.node[node] 90 + nr_slabs += n.nr_slabs.counter.value_() 91 + nr_objs += n.total_objects.counter.value_() 92 + nr_free += count_partial(n, count_free) 93 + 94 + return {'active_objs': nr_objs - nr_free, 95 + 'num_objs': nr_objs, 96 + 'active_slabs': nr_slabs, 97 + 'num_slabs': nr_slabs, 98 + 'objects_per_slab': oo_objects(s), 99 + 'cache_order': oo_order(s), 100 + 'limit': 0, 101 + 'batchcount': 0, 102 + 'shared': 0, 103 + 'shared_avail': 0} 104 + 105 + 106 + def cache_show(s, cfg, objs): 107 + if cfg['allocator'] == 'SLUB': 108 + sinfo = slub_get_slabinfo(s, cfg) 109 + else: 110 + err('SLAB isn\'t supported yet') 111 + 112 + if cfg['shared_slab_pages']: 113 + sinfo['active_objs'] = objs 114 + sinfo['num_objs'] = objs 115 + 116 + print('%-17s %6lu %6lu %6u %4u %4d' 117 + ' : tunables %4u %4u %4u' 118 + ' : slabdata %6lu %6lu %6lu' % ( 119 + cache_name(s), sinfo['active_objs'], sinfo['num_objs'], 120 + s.size, sinfo['objects_per_slab'], 1 << sinfo['cache_order'], 121 + sinfo['limit'], sinfo['batchcount'], sinfo['shared'], 122 + sinfo['active_slabs'], sinfo['num_slabs'], 123 + sinfo['shared_avail'])) 124 + 125 + 126 + def detect_kernel_config(): 127 + cfg = {} 128 + 129 + cfg['nr_nodes'] = prog['nr_online_nodes'].value_() 130 + 131 + if prog.type('struct kmem_cache').members[1][1] == 'flags': 132 + cfg['allocator'] = 'SLUB' 133 + elif prog.type('struct kmem_cache').members[1][1] == 'batchcount': 134 + cfg['allocator'] = 'SLAB' 135 + else: 136 + err('Can\'t determine the slab allocator') 137 + 138 + cfg['shared_slab_pages'] = False 139 + try: 140 + if prog.type('struct obj_cgroup'): 141 + cfg['shared_slab_pages'] = True 142 + except: 143 + pass 144 + 145 + return cfg 146 + 147 + 148 + def for_each_slab_page(prog): 149 + PGSlab = 1 << prog.constant('PG_slab') 150 + PGHead = 1 << prog.constant('PG_head') 151 + 152 + for page in for_each_page(prog): 153 + try: 154 + if page.flags.value_() & PGSlab: 155 + yield page 156 + except FaultError: 157 + pass 158 + 159 + 160 + def main(): 161 + parser = argparse.ArgumentParser(description=DESC, 162 + formatter_class= 163 + argparse.RawTextHelpFormatter) 164 + parser.add_argument('cgroup', metavar='CGROUP', 165 + help='Target memory cgroup') 166 + args = parser.parse_args() 167 + 168 + try: 169 + cgroup_id = stat(args.cgroup).st_ino 170 + find_memcg_ids() 171 + memcg = MEMCGS[cgroup_id] 172 + except KeyError: 173 + err('Can\'t find the memory cgroup') 174 + 175 + cfg = detect_kernel_config() 176 + 177 + print('# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>' 178 + ' : tunables <limit> <batchcount> <sharedfactor>' 179 + ' : slabdata <active_slabs> <num_slabs> <sharedavail>') 180 + 181 + if cfg['shared_slab_pages']: 182 + obj_cgroups = set() 183 + stats = {} 184 + caches = {} 185 + 186 + # find memcg pointers belonging to the specified cgroup 187 + obj_cgroups.add(memcg.objcg.value_()) 188 + for ptr in list_for_each_entry('struct obj_cgroup', 189 + memcg.objcg_list.address_of_(), 190 + 'list'): 191 + obj_cgroups.add(ptr.value_()) 192 + 193 + # look over all slab pages, belonging to non-root memcgs 194 + # and look for objects belonging to the given memory cgroup 195 + for page in for_each_slab_page(prog): 196 + objcg_vec_raw = page.obj_cgroups.value_() 197 + if objcg_vec_raw == 0: 198 + continue 199 + cache = page.slab_cache 200 + if not cache: 201 + continue 202 + addr = cache.value_() 203 + caches[addr] = cache 204 + # clear the lowest bit to get the true obj_cgroups 205 + objcg_vec = Object(prog, page.obj_cgroups.type_, 206 + value=objcg_vec_raw & ~1) 207 + 208 + if addr not in stats: 209 + stats[addr] = 0 210 + 211 + for i in range(oo_objects(cache)): 212 + if objcg_vec[i].value_() in obj_cgroups: 213 + stats[addr] += 1 214 + 215 + for addr in caches: 216 + if stats[addr] > 0: 217 + cache_show(caches[addr], cfg, stats[addr]) 218 + 219 + else: 220 + for s in list_for_each_entry('struct kmem_cache', 221 + memcg.kmem_caches.address_of_(), 222 + 'memcg_params.kmem_caches_node'): 223 + cache_show(s, cfg, None) 224 + 225 + 226 + main()

+1 -1

tools/include/linux/jhash.h

··· 5 5 * 6 6 * Copyright (C) 2006. Bob Jenkins (bob_jenkins@burtleburtle.net) 7 7 * 8 - * http://burtleburtle.net/bob/hash/ 8 + * https://burtleburtle.net/bob/hash/ 9 9 * 10 10 * These are the credits from Bob's sources: 11 11 *

+1 -1

tools/lib/rbtree.c

··· 13 13 #include <linux/export.h> 14 14 15 15 /* 16 - * red-black trees properties: http://en.wikipedia.org/wiki/Rbtree 16 + * red-black trees properties: https://en.wikipedia.org/wiki/Rbtree 17 17 * 18 18 * 1) A node is either red or black 19 19 * 2) The root is black

+1 -1

tools/lib/traceevent/event-parse.h

··· 379 379 * errno since SUS requires the errno has distinct positive values. 380 380 * See 'Issue 6' in the link below. 381 381 * 382 - * http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/errno.h.html 382 + * https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/errno.h.html 383 383 */ 384 384 __TEP_ERRNO__START = -100000, 385 385

+1 -1

tools/testing/ktest/examples/README

··· 11 11 lots of different architectures. It only does build tests, but makes 12 12 it easy to compile test different archs. You can download the arch 13 13 cross compilers from: 14 - http://kernel.org/pub/tools/crosstool/files/bin/x86_64/ 14 + https://kernel.org/pub/tools/crosstool/files/bin/x86_64/ 15 15 16 16 test.conf - A generic example of a config. This is based on an actual config 17 17 used to perform real testing.

+1 -1

tools/testing/ktest/examples/crosstests.conf

··· 3 3 # 4 4 # In this config, it is expected that the tool chains from: 5 5 # 6 - # http://kernel.org/pub/tools/crosstool/files/bin/x86_64/ 6 + # https://kernel.org/pub/tools/crosstool/files/bin/x86_64/ 7 7 # 8 8 # running on a x86_64 system have been downloaded and installed into: 9 9 #

+1

tools/testing/selftests/Makefile

··· 32 32 TARGETS += membarrier 33 33 TARGETS += memfd 34 34 TARGETS += memory-hotplug 35 + TARGETS += mincore 35 36 TARGETS += mount 36 37 TARGETS += mqueue 37 38 TARGETS += net

+1

tools/testing/selftests/cgroup/.gitignore

··· 2 2 test_memcontrol 3 3 test_core 4 4 test_freezer 5 + test_kmem

+2

tools/testing/selftests/cgroup/Makefile

··· 6 6 TEST_FILES := with_stress.sh 7 7 TEST_PROGS := test_stress.sh 8 8 TEST_GEN_PROGS = test_memcontrol 9 + TEST_GEN_PROGS += test_kmem 9 10 TEST_GEN_PROGS += test_core 10 11 TEST_GEN_PROGS += test_freezer 11 12 12 13 include ../lib.mk 13 14 14 15 $(OUTPUT)/test_memcontrol: cgroup_util.c ../clone3/clone3_selftests.h 16 + $(OUTPUT)/test_kmem: cgroup_util.c ../clone3/clone3_selftests.h 15 17 $(OUTPUT)/test_core: cgroup_util.c ../clone3/clone3_selftests.h 16 18 $(OUTPUT)/test_freezer: cgroup_util.c ../clone3/clone3_selftests.h

+1 -1

tools/testing/selftests/cgroup/cgroup_util.c

··· 106 106 107 107 /* Handle the case of comparing against empty string */ 108 108 if (!expected) 109 - size = 32; 109 + return -1; 110 110 else 111 111 size = strlen(expected) + 1; 112 112

+382

tools/testing/selftests/cgroup/test_kmem.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #define _GNU_SOURCE 3 + 4 + #include <linux/limits.h> 5 + #include <fcntl.h> 6 + #include <stdio.h> 7 + #include <stdlib.h> 8 + #include <string.h> 9 + #include <sys/stat.h> 10 + #include <sys/types.h> 11 + #include <unistd.h> 12 + #include <sys/wait.h> 13 + #include <errno.h> 14 + #include <sys/sysinfo.h> 15 + #include <pthread.h> 16 + 17 + #include "../kselftest.h" 18 + #include "cgroup_util.h" 19 + 20 + 21 + static int alloc_dcache(const char *cgroup, void *arg) 22 + { 23 + unsigned long i; 24 + struct stat st; 25 + char buf[128]; 26 + 27 + for (i = 0; i < (unsigned long)arg; i++) { 28 + snprintf(buf, sizeof(buf), 29 + "/something-non-existent-with-a-long-name-%64lu-%d", 30 + i, getpid()); 31 + stat(buf, &st); 32 + } 33 + 34 + return 0; 35 + } 36 + 37 + /* 38 + * This test allocates 100000 of negative dentries with long names. 39 + * Then it checks that "slab" in memory.stat is larger than 1M. 40 + * Then it sets memory.high to 1M and checks that at least 1/2 41 + * of slab memory has been reclaimed. 42 + */ 43 + static int test_kmem_basic(const char *root) 44 + { 45 + int ret = KSFT_FAIL; 46 + char *cg = NULL; 47 + long slab0, slab1, current; 48 + 49 + cg = cg_name(root, "kmem_basic_test"); 50 + if (!cg) 51 + goto cleanup; 52 + 53 + if (cg_create(cg)) 54 + goto cleanup; 55 + 56 + if (cg_run(cg, alloc_dcache, (void *)100000)) 57 + goto cleanup; 58 + 59 + slab0 = cg_read_key_long(cg, "memory.stat", "slab "); 60 + if (slab0 < (1 << 20)) 61 + goto cleanup; 62 + 63 + cg_write(cg, "memory.high", "1M"); 64 + slab1 = cg_read_key_long(cg, "memory.stat", "slab "); 65 + if (slab1 <= 0) 66 + goto cleanup; 67 + 68 + current = cg_read_long(cg, "memory.current"); 69 + if (current <= 0) 70 + goto cleanup; 71 + 72 + if (slab1 < slab0 / 2 && current < slab0 / 2) 73 + ret = KSFT_PASS; 74 + cleanup: 75 + cg_destroy(cg); 76 + free(cg); 77 + 78 + return ret; 79 + } 80 + 81 + static void *alloc_kmem_fn(void *arg) 82 + { 83 + alloc_dcache(NULL, (void *)100); 84 + return NULL; 85 + } 86 + 87 + static int alloc_kmem_smp(const char *cgroup, void *arg) 88 + { 89 + int nr_threads = 2 * get_nprocs(); 90 + pthread_t *tinfo; 91 + unsigned long i; 92 + int ret = -1; 93 + 94 + tinfo = calloc(nr_threads, sizeof(pthread_t)); 95 + if (tinfo == NULL) 96 + return -1; 97 + 98 + for (i = 0; i < nr_threads; i++) { 99 + if (pthread_create(&tinfo[i], NULL, &alloc_kmem_fn, 100 + (void *)i)) { 101 + free(tinfo); 102 + return -1; 103 + } 104 + } 105 + 106 + for (i = 0; i < nr_threads; i++) { 107 + ret = pthread_join(tinfo[i], NULL); 108 + if (ret) 109 + break; 110 + } 111 + 112 + free(tinfo); 113 + return ret; 114 + } 115 + 116 + static int cg_run_in_subcgroups(const char *parent, 117 + int (*fn)(const char *cgroup, void *arg), 118 + void *arg, int times) 119 + { 120 + char *child; 121 + int i; 122 + 123 + for (i = 0; i < times; i++) { 124 + child = cg_name_indexed(parent, "child", i); 125 + if (!child) 126 + return -1; 127 + 128 + if (cg_create(child)) { 129 + cg_destroy(child); 130 + free(child); 131 + return -1; 132 + } 133 + 134 + if (cg_run(child, fn, NULL)) { 135 + cg_destroy(child); 136 + free(child); 137 + return -1; 138 + } 139 + 140 + cg_destroy(child); 141 + free(child); 142 + } 143 + 144 + return 0; 145 + } 146 + 147 + /* 148 + * The test creates and destroys a large number of cgroups. In each cgroup it 149 + * allocates some slab memory (mostly negative dentries) using 2 * NR_CPUS 150 + * threads. Then it checks the sanity of numbers on the parent level: 151 + * the total size of the cgroups should be roughly equal to 152 + * anon + file + slab + kernel_stack. 153 + */ 154 + static int test_kmem_memcg_deletion(const char *root) 155 + { 156 + long current, slab, anon, file, kernel_stack, sum; 157 + int ret = KSFT_FAIL; 158 + char *parent; 159 + 160 + parent = cg_name(root, "kmem_memcg_deletion_test"); 161 + if (!parent) 162 + goto cleanup; 163 + 164 + if (cg_create(parent)) 165 + goto cleanup; 166 + 167 + if (cg_write(parent, "cgroup.subtree_control", "+memory")) 168 + goto cleanup; 169 + 170 + if (cg_run_in_subcgroups(parent, alloc_kmem_smp, NULL, 100)) 171 + goto cleanup; 172 + 173 + current = cg_read_long(parent, "memory.current"); 174 + slab = cg_read_key_long(parent, "memory.stat", "slab "); 175 + anon = cg_read_key_long(parent, "memory.stat", "anon "); 176 + file = cg_read_key_long(parent, "memory.stat", "file "); 177 + kernel_stack = cg_read_key_long(parent, "memory.stat", "kernel_stack "); 178 + if (current < 0 || slab < 0 || anon < 0 || file < 0 || 179 + kernel_stack < 0) 180 + goto cleanup; 181 + 182 + sum = slab + anon + file + kernel_stack; 183 + if (abs(sum - current) < 4096 * 32 * 2 * get_nprocs()) { 184 + ret = KSFT_PASS; 185 + } else { 186 + printf("memory.current = %ld\n", current); 187 + printf("slab + anon + file + kernel_stack = %ld\n", sum); 188 + printf("slab = %ld\n", slab); 189 + printf("anon = %ld\n", anon); 190 + printf("file = %ld\n", file); 191 + printf("kernel_stack = %ld\n", kernel_stack); 192 + } 193 + 194 + cleanup: 195 + cg_destroy(parent); 196 + free(parent); 197 + 198 + return ret; 199 + } 200 + 201 + /* 202 + * The test reads the entire /proc/kpagecgroup. If the operation went 203 + * successfully (and the kernel didn't panic), the test is treated as passed. 204 + */ 205 + static int test_kmem_proc_kpagecgroup(const char *root) 206 + { 207 + unsigned long buf[128]; 208 + int ret = KSFT_FAIL; 209 + ssize_t len; 210 + int fd; 211 + 212 + fd = open("/proc/kpagecgroup", O_RDONLY); 213 + if (fd < 0) 214 + return ret; 215 + 216 + do { 217 + len = read(fd, buf, sizeof(buf)); 218 + } while (len > 0); 219 + 220 + if (len == 0) 221 + ret = KSFT_PASS; 222 + 223 + close(fd); 224 + return ret; 225 + } 226 + 227 + static void *pthread_wait_fn(void *arg) 228 + { 229 + sleep(100); 230 + return NULL; 231 + } 232 + 233 + static int spawn_1000_threads(const char *cgroup, void *arg) 234 + { 235 + int nr_threads = 1000; 236 + pthread_t *tinfo; 237 + unsigned long i; 238 + long stack; 239 + int ret = -1; 240 + 241 + tinfo = calloc(nr_threads, sizeof(pthread_t)); 242 + if (tinfo == NULL) 243 + return -1; 244 + 245 + for (i = 0; i < nr_threads; i++) { 246 + if (pthread_create(&tinfo[i], NULL, &pthread_wait_fn, 247 + (void *)i)) { 248 + free(tinfo); 249 + return(-1); 250 + } 251 + } 252 + 253 + stack = cg_read_key_long(cgroup, "memory.stat", "kernel_stack "); 254 + if (stack >= 4096 * 1000) 255 + ret = 0; 256 + 257 + free(tinfo); 258 + return ret; 259 + } 260 + 261 + /* 262 + * The test spawns a process, which spawns 1000 threads. Then it checks 263 + * that memory.stat's kernel_stack is at least 1000 pages large. 264 + */ 265 + static int test_kmem_kernel_stacks(const char *root) 266 + { 267 + int ret = KSFT_FAIL; 268 + char *cg = NULL; 269 + 270 + cg = cg_name(root, "kmem_kernel_stacks_test"); 271 + if (!cg) 272 + goto cleanup; 273 + 274 + if (cg_create(cg)) 275 + goto cleanup; 276 + 277 + if (cg_run(cg, spawn_1000_threads, NULL)) 278 + goto cleanup; 279 + 280 + ret = KSFT_PASS; 281 + cleanup: 282 + cg_destroy(cg); 283 + free(cg); 284 + 285 + return ret; 286 + } 287 + 288 + /* 289 + * This test sequentionally creates 30 child cgroups, allocates some 290 + * kernel memory in each of them, and deletes them. Then it checks 291 + * that the number of dying cgroups on the parent level is 0. 292 + */ 293 + static int test_kmem_dead_cgroups(const char *root) 294 + { 295 + int ret = KSFT_FAIL; 296 + char *parent; 297 + long dead; 298 + int i; 299 + 300 + parent = cg_name(root, "kmem_dead_cgroups_test"); 301 + if (!parent) 302 + goto cleanup; 303 + 304 + if (cg_create(parent)) 305 + goto cleanup; 306 + 307 + if (cg_write(parent, "cgroup.subtree_control", "+memory")) 308 + goto cleanup; 309 + 310 + if (cg_run_in_subcgroups(parent, alloc_dcache, (void *)100, 30)) 311 + goto cleanup; 312 + 313 + for (i = 0; i < 5; i++) { 314 + dead = cg_read_key_long(parent, "cgroup.stat", 315 + "nr_dying_descendants "); 316 + if (dead == 0) { 317 + ret = KSFT_PASS; 318 + break; 319 + } 320 + /* 321 + * Reclaiming cgroups might take some time, 322 + * let's wait a bit and repeat. 323 + */ 324 + sleep(1); 325 + } 326 + 327 + cleanup: 328 + cg_destroy(parent); 329 + free(parent); 330 + 331 + return ret; 332 + } 333 + 334 + #define T(x) { x, #x } 335 + struct kmem_test { 336 + int (*fn)(const char *root); 337 + const char *name; 338 + } tests[] = { 339 + T(test_kmem_basic), 340 + T(test_kmem_memcg_deletion), 341 + T(test_kmem_proc_kpagecgroup), 342 + T(test_kmem_kernel_stacks), 343 + T(test_kmem_dead_cgroups), 344 + }; 345 + #undef T 346 + 347 + int main(int argc, char **argv) 348 + { 349 + char root[PATH_MAX]; 350 + int i, ret = EXIT_SUCCESS; 351 + 352 + if (cg_find_unified_root(root, sizeof(root))) 353 + ksft_exit_skip("cgroup v2 isn't mounted\n"); 354 + 355 + /* 356 + * Check that memory controller is available: 357 + * memory is listed in cgroup.controllers 358 + */ 359 + if (cg_read_strstr(root, "cgroup.controllers", "memory")) 360 + ksft_exit_skip("memory controller isn't available\n"); 361 + 362 + if (cg_read_strstr(root, "cgroup.subtree_control", "memory")) 363 + if (cg_write(root, "cgroup.subtree_control", "+memory")) 364 + ksft_exit_skip("Failed to set memory controller\n"); 365 + 366 + for (i = 0; i < ARRAY_SIZE(tests); i++) { 367 + switch (tests[i].fn(root)) { 368 + case KSFT_PASS: 369 + ksft_test_result_pass("%s\n", tests[i].name); 370 + break; 371 + case KSFT_SKIP: 372 + ksft_test_result_skip("%s\n", tests[i].name); 373 + break; 374 + default: 375 + ret = EXIT_FAILURE; 376 + ksft_test_result_fail("%s\n", tests[i].name); 377 + break; 378 + } 379 + } 380 + 381 + return ret; 382 + }

+2

tools/testing/selftests/mincore/.gitignore

··· 1 + # SPDX-License-Identifier: GPL-2.0+ 2 + mincore_selftest

+6

tools/testing/selftests/mincore/Makefile

··· 1 + # SPDX-License-Identifier: GPL-2.0+ 2 + 3 + CFLAGS += -Wall 4 + 5 + TEST_GEN_PROGS := mincore_selftest 6 + include ../lib.mk

+361

tools/testing/selftests/mincore/mincore_selftest.c

··· 1 + // SPDX-License-Identifier: GPL-2.0+ 2 + /* 3 + * kselftest suite for mincore(). 4 + * 5 + * Copyright (C) 2020 Collabora, Ltd. 6 + */ 7 + 8 + #define _GNU_SOURCE 9 + 10 + #include <stdio.h> 11 + #include <errno.h> 12 + #include <unistd.h> 13 + #include <stdlib.h> 14 + #include <sys/mman.h> 15 + #include <string.h> 16 + #include <fcntl.h> 17 + #include <string.h> 18 + 19 + #include "../kselftest.h" 20 + #include "../kselftest_harness.h" 21 + 22 + /* Default test file size: 4MB */ 23 + #define MB (1UL << 20) 24 + #define FILE_SIZE (4 * MB) 25 + 26 + 27 + /* 28 + * Tests the user interface. This test triggers most of the documented 29 + * error conditions in mincore(). 30 + */ 31 + TEST(basic_interface) 32 + { 33 + int retval; 34 + int page_size; 35 + unsigned char vec[1]; 36 + char *addr; 37 + 38 + page_size = sysconf(_SC_PAGESIZE); 39 + 40 + /* Query a 0 byte sized range */ 41 + retval = mincore(0, 0, vec); 42 + EXPECT_EQ(0, retval); 43 + 44 + /* Addresses in the specified range are invalid or unmapped */ 45 + errno = 0; 46 + retval = mincore(NULL, page_size, vec); 47 + EXPECT_EQ(-1, retval); 48 + EXPECT_EQ(ENOMEM, errno); 49 + 50 + errno = 0; 51 + addr = mmap(NULL, page_size, PROT_READ | PROT_WRITE, 52 + MAP_SHARED | MAP_ANONYMOUS, -1, 0); 53 + ASSERT_NE(MAP_FAILED, addr) { 54 + TH_LOG("mmap error: %s", strerror(errno)); 55 + } 56 + 57 + /* <addr> argument is not page-aligned */ 58 + errno = 0; 59 + retval = mincore(addr + 1, page_size, vec); 60 + EXPECT_EQ(-1, retval); 61 + EXPECT_EQ(EINVAL, errno); 62 + 63 + /* <length> argument is too large */ 64 + errno = 0; 65 + retval = mincore(addr, -1, vec); 66 + EXPECT_EQ(-1, retval); 67 + EXPECT_EQ(ENOMEM, errno); 68 + 69 + /* <vec> argument points to an illegal address */ 70 + errno = 0; 71 + retval = mincore(addr, page_size, NULL); 72 + EXPECT_EQ(-1, retval); 73 + EXPECT_EQ(EFAULT, errno); 74 + munmap(addr, page_size); 75 + } 76 + 77 + 78 + /* 79 + * Test mincore() behavior on a private anonymous page mapping. 80 + * Check that the page is not loaded into memory right after the mapping 81 + * but after accessing it (on-demand allocation). 82 + * Then free the page and check that it's not memory-resident. 83 + */ 84 + TEST(check_anonymous_locked_pages) 85 + { 86 + unsigned char vec[1]; 87 + char *addr; 88 + int retval; 89 + int page_size; 90 + 91 + page_size = sysconf(_SC_PAGESIZE); 92 + 93 + /* Map one page and check it's not memory-resident */ 94 + errno = 0; 95 + addr = mmap(NULL, page_size, PROT_READ | PROT_WRITE, 96 + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); 97 + ASSERT_NE(MAP_FAILED, addr) { 98 + TH_LOG("mmap error: %s", strerror(errno)); 99 + } 100 + retval = mincore(addr, page_size, vec); 101 + ASSERT_EQ(0, retval); 102 + ASSERT_EQ(0, vec[0]) { 103 + TH_LOG("Page found in memory before use"); 104 + } 105 + 106 + /* Touch the page and check again. It should now be in memory */ 107 + addr[0] = 1; 108 + mlock(addr, page_size); 109 + retval = mincore(addr, page_size, vec); 110 + ASSERT_EQ(0, retval); 111 + ASSERT_EQ(1, vec[0]) { 112 + TH_LOG("Page not found in memory after use"); 113 + } 114 + 115 + /* 116 + * It shouldn't be memory-resident after unlocking it and 117 + * marking it as unneeded. 118 + */ 119 + munlock(addr, page_size); 120 + madvise(addr, page_size, MADV_DONTNEED); 121 + retval = mincore(addr, page_size, vec); 122 + ASSERT_EQ(0, retval); 123 + ASSERT_EQ(0, vec[0]) { 124 + TH_LOG("Page in memory after being zapped"); 125 + } 126 + munmap(addr, page_size); 127 + } 128 + 129 + 130 + /* 131 + * Check mincore() behavior on huge pages. 132 + * This test will be skipped if the mapping fails (ie. if there are no 133 + * huge pages available). 134 + * 135 + * Make sure the system has at least one free huge page, check 136 + * "HugePages_Free" in /proc/meminfo. 137 + * Increment /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages if 138 + * needed. 139 + */ 140 + TEST(check_huge_pages) 141 + { 142 + unsigned char vec[1]; 143 + char *addr; 144 + int retval; 145 + int page_size; 146 + 147 + page_size = sysconf(_SC_PAGESIZE); 148 + 149 + errno = 0; 150 + addr = mmap(NULL, page_size, PROT_READ | PROT_WRITE, 151 + MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 152 + -1, 0); 153 + if (addr == MAP_FAILED) { 154 + if (errno == ENOMEM) 155 + SKIP(return, "No huge pages available."); 156 + else 157 + TH_LOG("mmap error: %s", strerror(errno)); 158 + } 159 + retval = mincore(addr, page_size, vec); 160 + ASSERT_EQ(0, retval); 161 + ASSERT_EQ(0, vec[0]) { 162 + TH_LOG("Page found in memory before use"); 163 + } 164 + 165 + addr[0] = 1; 166 + mlock(addr, page_size); 167 + retval = mincore(addr, page_size, vec); 168 + ASSERT_EQ(0, retval); 169 + ASSERT_EQ(1, vec[0]) { 170 + TH_LOG("Page not found in memory after use"); 171 + } 172 + 173 + munlock(addr, page_size); 174 + munmap(addr, page_size); 175 + } 176 + 177 + 178 + /* 179 + * Test mincore() behavior on a file-backed page. 180 + * No pages should be loaded into memory right after the mapping. Then, 181 + * accessing any address in the mapping range should load the page 182 + * containing the address and a number of subsequent pages (readahead). 183 + * 184 + * The actual readahead settings depend on the test environment, so we 185 + * can't make a lot of assumptions about that. This test covers the most 186 + * general cases. 187 + */ 188 + TEST(check_file_mmap) 189 + { 190 + unsigned char *vec; 191 + int vec_size; 192 + char *addr; 193 + int retval; 194 + int page_size; 195 + int fd; 196 + int i; 197 + int ra_pages = 0; 198 + 199 + page_size = sysconf(_SC_PAGESIZE); 200 + vec_size = FILE_SIZE / page_size; 201 + if (FILE_SIZE % page_size) 202 + vec_size++; 203 + 204 + vec = calloc(vec_size, sizeof(unsigned char)); 205 + ASSERT_NE(NULL, vec) { 206 + TH_LOG("Can't allocate array"); 207 + } 208 + 209 + errno = 0; 210 + fd = open(".", O_TMPFILE | O_RDWR, 0600); 211 + ASSERT_NE(-1, fd) { 212 + TH_LOG("Can't create temporary file: %s", 213 + strerror(errno)); 214 + } 215 + errno = 0; 216 + retval = fallocate(fd, 0, 0, FILE_SIZE); 217 + ASSERT_EQ(0, retval) { 218 + TH_LOG("Error allocating space for the temporary file: %s", 219 + strerror(errno)); 220 + } 221 + 222 + /* 223 + * Map the whole file, the pages shouldn't be fetched yet. 224 + */ 225 + errno = 0; 226 + addr = mmap(NULL, FILE_SIZE, PROT_READ | PROT_WRITE, 227 + MAP_SHARED, fd, 0); 228 + ASSERT_NE(MAP_FAILED, addr) { 229 + TH_LOG("mmap error: %s", strerror(errno)); 230 + } 231 + retval = mincore(addr, FILE_SIZE, vec); 232 + ASSERT_EQ(0, retval); 233 + for (i = 0; i < vec_size; i++) { 234 + ASSERT_EQ(0, vec[i]) { 235 + TH_LOG("Unexpected page in memory"); 236 + } 237 + } 238 + 239 + /* 240 + * Touch a page in the middle of the mapping. We expect the next 241 + * few pages (the readahead window) to be populated too. 242 + */ 243 + addr[FILE_SIZE / 2] = 1; 244 + retval = mincore(addr, FILE_SIZE, vec); 245 + ASSERT_EQ(0, retval); 246 + ASSERT_EQ(1, vec[FILE_SIZE / 2 / page_size]) { 247 + TH_LOG("Page not found in memory after use"); 248 + } 249 + 250 + i = FILE_SIZE / 2 / page_size + 1; 251 + while (i < vec_size && vec[i]) { 252 + ra_pages++; 253 + i++; 254 + } 255 + EXPECT_GT(ra_pages, 0) { 256 + TH_LOG("No read-ahead pages found in memory"); 257 + } 258 + 259 + EXPECT_LT(i, vec_size) { 260 + TH_LOG("Read-ahead pages reached the end of the file"); 261 + } 262 + /* 263 + * End of the readahead window. The rest of the pages shouldn't 264 + * be in memory. 265 + */ 266 + if (i < vec_size) { 267 + while (i < vec_size && !vec[i]) 268 + i++; 269 + EXPECT_EQ(vec_size, i) { 270 + TH_LOG("Unexpected page in memory beyond readahead window"); 271 + } 272 + } 273 + 274 + munmap(addr, FILE_SIZE); 275 + close(fd); 276 + free(vec); 277 + } 278 + 279 + 280 + /* 281 + * Test mincore() behavior on a page backed by a tmpfs file. This test 282 + * performs the same steps as the previous one. However, we don't expect 283 + * any readahead in this case. 284 + */ 285 + TEST(check_tmpfs_mmap) 286 + { 287 + unsigned char *vec; 288 + int vec_size; 289 + char *addr; 290 + int retval; 291 + int page_size; 292 + int fd; 293 + int i; 294 + int ra_pages = 0; 295 + 296 + page_size = sysconf(_SC_PAGESIZE); 297 + vec_size = FILE_SIZE / page_size; 298 + if (FILE_SIZE % page_size) 299 + vec_size++; 300 + 301 + vec = calloc(vec_size, sizeof(unsigned char)); 302 + ASSERT_NE(NULL, vec) { 303 + TH_LOG("Can't allocate array"); 304 + } 305 + 306 + errno = 0; 307 + fd = open("/dev/shm", O_TMPFILE | O_RDWR, 0600); 308 + ASSERT_NE(-1, fd) { 309 + TH_LOG("Can't create temporary file: %s", 310 + strerror(errno)); 311 + } 312 + errno = 0; 313 + retval = fallocate(fd, 0, 0, FILE_SIZE); 314 + ASSERT_EQ(0, retval) { 315 + TH_LOG("Error allocating space for the temporary file: %s", 316 + strerror(errno)); 317 + } 318 + 319 + /* 320 + * Map the whole file, the pages shouldn't be fetched yet. 321 + */ 322 + errno = 0; 323 + addr = mmap(NULL, FILE_SIZE, PROT_READ | PROT_WRITE, 324 + MAP_SHARED, fd, 0); 325 + ASSERT_NE(MAP_FAILED, addr) { 326 + TH_LOG("mmap error: %s", strerror(errno)); 327 + } 328 + retval = mincore(addr, FILE_SIZE, vec); 329 + ASSERT_EQ(0, retval); 330 + for (i = 0; i < vec_size; i++) { 331 + ASSERT_EQ(0, vec[i]) { 332 + TH_LOG("Unexpected page in memory"); 333 + } 334 + } 335 + 336 + /* 337 + * Touch a page in the middle of the mapping. We expect only 338 + * that page to be fetched into memory. 339 + */ 340 + addr[FILE_SIZE / 2] = 1; 341 + retval = mincore(addr, FILE_SIZE, vec); 342 + ASSERT_EQ(0, retval); 343 + ASSERT_EQ(1, vec[FILE_SIZE / 2 / page_size]) { 344 + TH_LOG("Page not found in memory after use"); 345 + } 346 + 347 + i = FILE_SIZE / 2 / page_size + 1; 348 + while (i < vec_size && vec[i]) { 349 + ra_pages++; 350 + i++; 351 + } 352 + ASSERT_EQ(ra_pages, 0) { 353 + TH_LOG("Read-ahead pages found in memory"); 354 + } 355 + 356 + munmap(addr, FILE_SIZE); 357 + close(fd); 358 + free(vec); 359 + } 360 + 361 + TEST_HARNESS_MAIN