Merge branch 'akpm' (patches from Andrew)

+7

Documentation/admin-guide/kernel-parameters.txt

··· 4996 4996 4997 4997 slram= [HW,MTD] 4998 4998 4999 + slab_merge [MM] 5000 + Enable merging of slabs with similar size when the 5001 + kernel is built without CONFIG_SLAB_MERGE_DEFAULT. 5002 + 4999 5003 slab_nomerge [MM] 5000 5004 Disable merging of slabs with similar size. May be 5001 5005 necessary if there is some reason to distinguish ··· 5046 5042 Determines the minimum page order for slabs. Must be 5047 5043 lower than slub_max_order. 5048 5044 For more information see Documentation/vm/slub.rst. 5045 + 5046 + slub_merge [MM, SLUB] 5047 + Same with slab_merge. 5049 5048 5050 5049 slub_nomerge [MM, SLUB] 5051 5050 Same with slab_nomerge. This is supported for legacy.

+1 -1

Documentation/admin-guide/mm/transhuge.rst

··· 402 402 but failed. 403 403 404 404 It is possible to establish how long the stalls were using the function 405 - tracer to record how long was spent in __alloc_pages_nodemask and 405 + tracer to record how long was spent in __alloc_pages() and 406 406 using the mm_page_alloc tracepoint to identify which allocations were 407 407 for huge pages. 408 408

+2 -2

Documentation/core-api/cachetlb.rst

··· 213 213 there will be no entries in the cache for the kernel address 214 214 space for virtual addresses in the range 'start' to 'end-1'. 215 215 216 - The first of these two routines is invoked after map_kernel_range() 216 + The first of these two routines is invoked after vmap_range() 217 217 has installed the page table entries. The second is invoked 218 - before unmap_kernel_range() deletes the page table entries. 218 + before vunmap_range() deletes the page table entries. 219 219 220 220 There exists another whole class of cpu cache issues which currently 221 221 require a whole different set of interfaces to handle properly.

+6

Documentation/core-api/mm-api.rst

··· 92 92 :export: 93 93 94 94 .. kernel-doc:: mm/page_alloc.c 95 + .. kernel-doc:: mm/mempolicy.c 96 + .. kernel-doc:: include/linux/mm_types.h 97 + :internal: 98 + .. kernel-doc:: include/linux/mm.h 99 + :internal: 100 + .. kernel-doc:: include/linux/mmzone.h

+193 -152

Documentation/dev-tools/kasan.rst

··· 11 11 2. software tag-based KASAN (similar to userspace HWASan), 12 12 3. hardware tag-based KASAN (based on hardware memory tagging). 13 13 14 - Software KASAN modes (1 and 2) use compile-time instrumentation to insert 15 - validity checks before every memory access, and therefore require a compiler 14 + Generic KASAN is mainly used for debugging due to a large memory overhead. 15 + Software tag-based KASAN can be used for dogfood testing as it has a lower 16 + memory overhead that allows using it with real workloads. Hardware tag-based 17 + KASAN comes with low memory and performance overheads and, therefore, can be 18 + used in production. Either as an in-field memory bug detector or as a security 19 + mitigation. 20 + 21 + Software KASAN modes (#1 and #2) use compile-time instrumentation to insert 22 + validity checks before every memory access and, therefore, require a compiler 16 23 version that supports that. 17 24 18 - Generic KASAN is supported in both GCC and Clang. With GCC it requires version 25 + Generic KASAN is supported in GCC and Clang. With GCC, it requires version 19 26 8.3.0 or later. Any supported Clang version is compatible, but detection of 20 27 out-of-bounds accesses for global variables is only supported since Clang 11. 21 28 22 - Tag-based KASAN is only supported in Clang. 29 + Software tag-based KASAN mode is only supported in Clang. 23 30 24 - Currently generic KASAN is supported for the x86_64, arm, arm64, xtensa, s390 31 + The hardware KASAN mode (#3) relies on hardware to perform the checks but 32 + still requires a compiler version that supports memory tagging instructions. 33 + This mode is supported in GCC 10+ and Clang 11+. 34 + 35 + Both software KASAN modes work with SLUB and SLAB memory allocators, 36 + while the hardware tag-based KASAN currently only supports SLUB. 37 + 38 + Currently, generic KASAN is supported for the x86_64, arm, arm64, xtensa, s390, 25 39 and riscv architectures, and tag-based KASAN modes are supported only for arm64. 26 40 27 41 Usage 28 42 ----- 29 43 30 - To enable KASAN configure kernel with:: 44 + To enable KASAN, configure the kernel with:: 31 45 32 - CONFIG_KASAN = y 46 + CONFIG_KASAN=y 33 47 34 - and choose between CONFIG_KASAN_GENERIC (to enable generic KASAN), 35 - CONFIG_KASAN_SW_TAGS (to enable software tag-based KASAN), and 36 - CONFIG_KASAN_HW_TAGS (to enable hardware tag-based KASAN). 48 + and choose between ``CONFIG_KASAN_GENERIC`` (to enable generic KASAN), 49 + ``CONFIG_KASAN_SW_TAGS`` (to enable software tag-based KASAN), and 50 + ``CONFIG_KASAN_HW_TAGS`` (to enable hardware tag-based KASAN). 37 51 38 - For software modes, you also need to choose between CONFIG_KASAN_OUTLINE and 39 - CONFIG_KASAN_INLINE. Outline and inline are compiler instrumentation types. 40 - The former produces smaller binary while the latter is 1.1 - 2 times faster. 52 + For software modes, also choose between ``CONFIG_KASAN_OUTLINE`` and 53 + ``CONFIG_KASAN_INLINE``. Outline and inline are compiler instrumentation types. 54 + The former produces a smaller binary while the latter is 1.1-2 times faster. 41 55 42 - Both software KASAN modes work with both SLUB and SLAB memory allocators, 43 - while the hardware tag-based KASAN currently only support SLUB. 44 - 45 - For better error reports that include stack traces, enable CONFIG_STACKTRACE. 46 - 47 - To augment reports with last allocation and freeing stack of the physical page, 48 - it is recommended to enable also CONFIG_PAGE_OWNER and boot with page_owner=on. 56 + To include alloc and free stack traces of affected slab objects into reports, 57 + enable ``CONFIG_STACKTRACE``. To include alloc and free stack traces of affected 58 + physical pages, enable ``CONFIG_PAGE_OWNER`` and boot with ``page_owner=on``. 49 59 50 60 Error reports 51 61 ~~~~~~~~~~~~~ 52 62 53 - A typical out-of-bounds access generic KASAN report looks like this:: 63 + A typical KASAN report looks like this:: 54 64 55 65 ================================================================== 56 66 BUG: KASAN: slab-out-of-bounds in kmalloc_oob_right+0xa8/0xbc [test_kasan] ··· 133 123 ffff8801f44ec400: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc 134 124 ================================================================== 135 125 136 - The header of the report provides a short summary of what kind of bug happened 137 - and what kind of access caused it. It's followed by a stack trace of the bad 138 - access, a stack trace of where the accessed memory was allocated (in case bad 139 - access happens on a slab object), and a stack trace of where the object was 140 - freed (in case of a use-after-free bug report). Next comes a description of 141 - the accessed slab object and information about the accessed memory page. 126 + The report header summarizes what kind of bug happened and what kind of access 127 + caused it. It is followed by a stack trace of the bad access, a stack trace of 128 + where the accessed memory was allocated (in case a slab object was accessed), 129 + and a stack trace of where the object was freed (in case of a use-after-free 130 + bug report). Next comes a description of the accessed slab object and the 131 + information about the accessed memory page. 142 132 143 - In the last section the report shows memory state around the accessed address. 144 - Internally KASAN tracks memory state separately for each memory granule, which 133 + In the end, the report shows the memory state around the accessed address. 134 + Internally, KASAN tracks memory state separately for each memory granule, which 145 135 is either 8 or 16 aligned bytes depending on KASAN mode. Each number in the 146 136 memory state section of the report shows the state of one of the memory 147 137 granules that surround the accessed address. 148 138 149 - For generic KASAN the size of each memory granule is 8. The state of each 139 + For generic KASAN, the size of each memory granule is 8. The state of each 150 140 granule is encoded in one shadow byte. Those 8 bytes can be accessible, 151 - partially accessible, freed or be a part of a redzone. KASAN uses the following 152 - encoding for each shadow byte: 0 means that all 8 bytes of the corresponding 141 + partially accessible, freed, or be a part of a redzone. KASAN uses the following 142 + encoding for each shadow byte: 00 means that all 8 bytes of the corresponding 153 143 memory region are accessible; number N (1 <= N <= 7) means that the first N 154 144 bytes are accessible, and other (8 - N) bytes are not; any negative value 155 145 indicates that the entire 8-byte word is inaccessible. KASAN uses different 156 146 negative values to distinguish between different kinds of inaccessible memory 157 147 like redzones or freed memory (see mm/kasan/kasan.h). 158 148 159 - In the report above the arrows point to the shadow byte 03, which means that 160 - the accessed address is partially accessible. For tag-based KASAN modes this 161 - last report section shows the memory tags around the accessed address 162 - (see the `Implementation details`_ section). 149 + In the report above, the arrow points to the shadow byte ``03``, which means 150 + that the accessed address is partially accessible. 151 + 152 + For tag-based KASAN modes, this last report section shows the memory tags around 153 + the accessed address (see the `Implementation details`_ section). 154 + 155 + Note that KASAN bug titles (like ``slab-out-of-bounds`` or ``use-after-free``) 156 + are best-effort: KASAN prints the most probable bug type based on the limited 157 + information it has. The actual type of the bug might be different. 158 + 159 + Generic KASAN also reports up to two auxiliary call stack traces. These stack 160 + traces point to places in code that interacted with the object but that are not 161 + directly present in the bad access stack trace. Currently, this includes 162 + call_rcu() and workqueue queuing. 163 163 164 164 Boot parameters 165 165 ~~~~~~~~~~~~~~~ 166 166 167 + KASAN is affected by the generic ``panic_on_warn`` command line parameter. 168 + When it is enabled, KASAN panics the kernel after printing a bug report. 169 + 170 + By default, KASAN prints a bug report only for the first invalid memory access. 171 + With ``kasan_multi_shot``, KASAN prints a report on every invalid access. This 172 + effectively disables ``panic_on_warn`` for KASAN reports. 173 + 167 174 Hardware tag-based KASAN mode (see the section about various modes below) is 168 175 intended for use in production as a security mitigation. Therefore, it supports 169 - boot parameters that allow to disable KASAN competely or otherwise control 170 - particular KASAN features. 176 + boot parameters that allow disabling KASAN or controlling its features. 171 177 172 178 - ``kasan=off`` or ``=on`` controls whether KASAN is enabled (default: ``on``). 173 179 ··· 200 174 traces collection (default: ``on``). 201 175 202 176 - ``kasan.fault=report`` or ``=panic`` controls whether to only print a KASAN 203 - report or also panic the kernel (default: ``report``). Note, that tag 204 - checking gets disabled after the first reported bug. 205 - 206 - For developers 207 - ~~~~~~~~~~~~~~ 208 - 209 - Software KASAN modes use compiler instrumentation to insert validity checks. 210 - Such instrumentation might be incompatible with some part of the kernel, and 211 - therefore needs to be disabled. To disable instrumentation for specific files 212 - or directories, add a line similar to the following to the respective kernel 213 - Makefile: 214 - 215 - - For a single file (e.g. main.o):: 216 - 217 - KASAN_SANITIZE_main.o := n 218 - 219 - - For all files in one directory:: 220 - 221 - KASAN_SANITIZE := n 222 - 177 + report or also panic the kernel (default: ``report``). The panic happens even 178 + if ``kasan_multi_shot`` is enabled. 223 179 224 180 Implementation details 225 181 ---------------------- ··· 209 201 Generic KASAN 210 202 ~~~~~~~~~~~~~ 211 203 212 - From a high level perspective, KASAN's approach to memory error detection is 213 - similar to that of kmemcheck: use shadow memory to record whether each byte of 214 - memory is safe to access, and use compile-time instrumentation to insert checks 215 - of shadow memory on each memory access. 204 + Software KASAN modes use shadow memory to record whether each byte of memory is 205 + safe to access and use compile-time instrumentation to insert shadow memory 206 + checks before each memory access. 216 207 217 - Generic KASAN dedicates 1/8th of kernel memory to its shadow memory (e.g. 16TB 208 + Generic KASAN dedicates 1/8th of kernel memory to its shadow memory (16TB 218 209 to cover 128TB on x86_64) and uses direct mapping with a scale and offset to 219 210 translate a memory address to its corresponding shadow address. 220 211 ··· 222 215 223 216 static inline void *kasan_mem_to_shadow(const void *addr) 224 217 { 225 - return ((unsigned long)addr >> KASAN_SHADOW_SCALE_SHIFT) 218 + return (void *)((unsigned long)addr >> KASAN_SHADOW_SCALE_SHIFT) 226 219 + KASAN_SHADOW_OFFSET; 227 220 } 228 221 229 222 where ``KASAN_SHADOW_SCALE_SHIFT = 3``. 230 223 231 224 Compile-time instrumentation is used to insert memory access checks. Compiler 232 - inserts function calls (__asan_load*(addr), __asan_store*(addr)) before each 233 - memory access of size 1, 2, 4, 8 or 16. These functions check whether memory 234 - access is valid or not by checking corresponding shadow memory. 225 + inserts function calls (``__asan_load*(addr)``, ``__asan_store*(addr)``) before 226 + each memory access of size 1, 2, 4, 8, or 16. These functions check whether 227 + memory accesses are valid or not by checking corresponding shadow memory. 235 228 236 - GCC 5.0 has possibility to perform inline instrumentation. Instead of making 237 - function calls GCC directly inserts the code to check the shadow memory. 238 - This option significantly enlarges kernel but it gives x1.1-x2 performance 239 - boost over outline instrumented kernel. 229 + With inline instrumentation, instead of making function calls, the compiler 230 + directly inserts the code to check shadow memory. This option significantly 231 + enlarges the kernel, but it gives an x1.1-x2 performance boost over the 232 + outline-instrumented kernel. 240 233 241 - Generic KASAN also reports the last 2 call stacks to creation of work that 242 - potentially has access to an object. Call stacks for the following are shown: 243 - call_rcu() and workqueue queuing. 244 - 245 - Generic KASAN is the only mode that delays the reuse of freed object via 234 + Generic KASAN is the only mode that delays the reuse of freed objects via 246 235 quarantine (see mm/kasan/quarantine.c for implementation). 247 236 248 237 Software tag-based KASAN 249 238 ~~~~~~~~~~~~~~~~~~~~~~~~ 250 239 251 - Software tag-based KASAN requires software memory tagging support in the form 252 - of HWASan-like compiler instrumentation (see HWASan documentation for details). 253 - 254 - Software tag-based KASAN is currently only implemented for arm64 architecture. 240 + Software tag-based KASAN uses a software memory tagging approach to checking 241 + access validity. It is currently only implemented for the arm64 architecture. 255 242 256 243 Software tag-based KASAN uses the Top Byte Ignore (TBI) feature of arm64 CPUs 257 - to store a pointer tag in the top byte of kernel pointers. Like generic KASAN 258 - it uses shadow memory to store memory tags associated with each 16-byte memory 259 - cell (therefore it dedicates 1/16th of the kernel memory for shadow memory). 244 + to store a pointer tag in the top byte of kernel pointers. It uses shadow memory 245 + to store memory tags associated with each 16-byte memory cell (therefore, it 246 + dedicates 1/16th of the kernel memory for shadow memory). 260 247 261 - On each memory allocation software tag-based KASAN generates a random tag, tags 262 - the allocated memory with this tag, and embeds this tag into the returned 248 + On each memory allocation, software tag-based KASAN generates a random tag, tags 249 + the allocated memory with this tag, and embeds the same tag into the returned 263 250 pointer. 264 251 265 252 Software tag-based KASAN uses compile-time instrumentation to insert checks 266 - before each memory access. These checks make sure that tag of the memory that 267 - is being accessed is equal to tag of the pointer that is used to access this 268 - memory. In case of a tag mismatch software tag-based KASAN prints a bug report. 253 + before each memory access. These checks make sure that the tag of the memory 254 + that is being accessed is equal to the tag of the pointer that is used to access 255 + this memory. In case of a tag mismatch, software tag-based KASAN prints a bug 256 + report. 269 257 270 - Software tag-based KASAN also has two instrumentation modes (outline, that 271 - emits callbacks to check memory accesses; and inline, that performs the shadow 258 + Software tag-based KASAN also has two instrumentation modes (outline, which 259 + emits callbacks to check memory accesses; and inline, which performs the shadow 272 260 memory checks inline). With outline instrumentation mode, a bug report is 273 - simply printed from the function that performs the access check. With inline 274 - instrumentation a brk instruction is emitted by the compiler, and a dedicated 275 - brk handler is used to print bug reports. 261 + printed from the function that performs the access check. With inline 262 + instrumentation, a ``brk`` instruction is emitted by the compiler, and a 263 + dedicated ``brk`` handler is used to print bug reports. 276 264 277 265 Software tag-based KASAN uses 0xFF as a match-all pointer tag (accesses through 278 - pointers with 0xFF pointer tag aren't checked). The value 0xFE is currently 266 + pointers with the 0xFF pointer tag are not checked). The value 0xFE is currently 279 267 reserved to tag freed memory regions. 280 268 281 - Software tag-based KASAN currently only supports tagging of 282 - kmem_cache_alloc/kmalloc and page_alloc memory. 269 + Software tag-based KASAN currently only supports tagging of slab and page_alloc 270 + memory. 283 271 284 272 Hardware tag-based KASAN 285 273 ~~~~~~~~~~~~~~~~~~~~~~~~ 286 274 287 - Hardware tag-based KASAN is similar to the software mode in concept, but uses 275 + Hardware tag-based KASAN is similar to the software mode in concept but uses 288 276 hardware memory tagging support instead of compiler instrumentation and 289 277 shadow memory. 290 278 291 279 Hardware tag-based KASAN is currently only implemented for arm64 architecture 292 280 and based on both arm64 Memory Tagging Extension (MTE) introduced in ARMv8.5 293 - Instruction Set Architecture, and Top Byte Ignore (TBI). 281 + Instruction Set Architecture and Top Byte Ignore (TBI). 294 282 295 283 Special arm64 instructions are used to assign memory tags for each allocation. 296 284 Same tags are assigned to pointers to those allocations. On every memory 297 - access, hardware makes sure that tag of the memory that is being accessed is 298 - equal to tag of the pointer that is used to access this memory. In case of a 299 - tag mismatch a fault is generated and a report is printed. 285 + access, hardware makes sure that the tag of the memory that is being accessed is 286 + equal to the tag of the pointer that is used to access this memory. In case of a 287 + tag mismatch, a fault is generated, and a report is printed. 300 288 301 289 Hardware tag-based KASAN uses 0xFF as a match-all pointer tag (accesses through 302 - pointers with 0xFF pointer tag aren't checked). The value 0xFE is currently 290 + pointers with the 0xFF pointer tag are not checked). The value 0xFE is currently 303 291 reserved to tag freed memory regions. 304 292 305 - Hardware tag-based KASAN currently only supports tagging of 306 - kmem_cache_alloc/kmalloc and page_alloc memory. 293 + Hardware tag-based KASAN currently only supports tagging of slab and page_alloc 294 + memory. 307 295 308 - If the hardware doesn't support MTE (pre ARMv8.5), hardware tag-based KASAN 309 - won't be enabled. In this case all boot parameters are ignored. 296 + If the hardware does not support MTE (pre ARMv8.5), hardware tag-based KASAN 297 + will not be enabled. In this case, all KASAN boot parameters are ignored. 310 298 311 - Note, that enabling CONFIG_KASAN_HW_TAGS always results in in-kernel TBI being 312 - enabled. Even when kasan.mode=off is provided, or when the hardware doesn't 299 + Note that enabling CONFIG_KASAN_HW_TAGS always results in in-kernel TBI being 300 + enabled. Even when ``kasan.mode=off`` is provided or when the hardware does not 313 301 support MTE (but supports TBI). 314 302 315 - Hardware tag-based KASAN only reports the first found bug. After that MTE tag 303 + Hardware tag-based KASAN only reports the first found bug. After that, MTE tag 316 304 checking gets disabled. 317 305 318 - What memory accesses are sanitised by KASAN? 319 - -------------------------------------------- 306 + Shadow memory 307 + ------------- 320 308 321 - The kernel maps memory in a number of different parts of the address 322 - space. This poses something of a problem for KASAN, which requires 323 - that all addresses accessed by instrumented code have a valid shadow 324 - region. 309 + The kernel maps memory in several different parts of the address space. 310 + The range of kernel virtual addresses is large: there is not enough real 311 + memory to support a real shadow region for every address that could be 312 + accessed by the kernel. Therefore, KASAN only maps real shadow for certain 313 + parts of the address space. 325 314 326 - The range of kernel virtual addresses is large: there is not enough 327 - real memory to support a real shadow region for every address that 328 - could be accessed by the kernel. 329 - 330 - By default 331 - ~~~~~~~~~~ 315 + Default behaviour 316 + ~~~~~~~~~~~~~~~~~ 332 317 333 318 By default, architectures only map real memory over the shadow region 334 319 for the linear mapping (and potentially other small areas). For all ··· 329 330 declares all memory accesses as permitted. 330 331 331 332 This presents a problem for modules: they do not live in the linear 332 - mapping, but in a dedicated module space. By hooking in to the module 333 - allocator, KASAN can temporarily map real shadow memory to cover 334 - them. This allows detection of invalid accesses to module globals, for 335 - example. 333 + mapping but in a dedicated module space. By hooking into the module 334 + allocator, KASAN temporarily maps real shadow memory to cover them. 335 + This allows detection of invalid accesses to module globals, for example. 336 336 337 337 This also creates an incompatibility with ``VMAP_STACK``: if the stack 338 338 lives in vmalloc space, it will be shadowed by the read-only page, and ··· 342 344 ~~~~~~~~~~~~~~~~~~~~ 343 345 344 346 With ``CONFIG_KASAN_VMALLOC``, KASAN can cover vmalloc space at the 345 - cost of greater memory usage. Currently this is only supported on x86. 347 + cost of greater memory usage. Currently, this is supported on x86, 348 + riscv, s390, and powerpc. 346 349 347 - This works by hooking into vmalloc and vmap, and dynamically 350 + This works by hooking into vmalloc and vmap and dynamically 348 351 allocating real shadow memory to back the mappings. 349 352 350 353 Most mappings in vmalloc space are small, requiring less than a full ··· 364 365 365 366 To avoid the difficulties around swapping mappings around, KASAN expects 366 367 that the part of the shadow region that covers the vmalloc space will 367 - not be covered by the early shadow page, but will be left 368 - unmapped. This will require changes in arch-specific code. 368 + not be covered by the early shadow page but will be left unmapped. 369 + This will require changes in arch-specific code. 369 370 370 - This allows ``VMAP_STACK`` support on x86, and can simplify support of 371 + This allows ``VMAP_STACK`` support on x86 and can simplify support of 371 372 architectures that do not have a fixed module region. 372 373 373 - CONFIG_KASAN_KUNIT_TEST and CONFIG_KASAN_MODULE_TEST 374 - ---------------------------------------------------- 374 + For developers 375 + -------------- 375 376 376 - KASAN tests consist of two parts: 377 + Ignoring accesses 378 + ~~~~~~~~~~~~~~~~~ 379 + 380 + Software KASAN modes use compiler instrumentation to insert validity checks. 381 + Such instrumentation might be incompatible with some parts of the kernel, and 382 + therefore needs to be disabled. 383 + 384 + Other parts of the kernel might access metadata for allocated objects. 385 + Normally, KASAN detects and reports such accesses, but in some cases (e.g., 386 + in memory allocators), these accesses are valid. 387 + 388 + For software KASAN modes, to disable instrumentation for a specific file or 389 + directory, add a ``KASAN_SANITIZE`` annotation to the respective kernel 390 + Makefile: 391 + 392 + - For a single file (e.g., main.o):: 393 + 394 + KASAN_SANITIZE_main.o := n 395 + 396 + - For all files in one directory:: 397 + 398 + KASAN_SANITIZE := n 399 + 400 + For software KASAN modes, to disable instrumentation on a per-function basis, 401 + use the KASAN-specific ``__no_sanitize_address`` function attribute or the 402 + generic ``noinstr`` one. 403 + 404 + Note that disabling compiler instrumentation (either on a per-file or a 405 + per-function basis) makes KASAN ignore the accesses that happen directly in 406 + that code for software KASAN modes. It does not help when the accesses happen 407 + indirectly (through calls to instrumented functions) or with the hardware 408 + tag-based mode that does not use compiler instrumentation. 409 + 410 + For software KASAN modes, to disable KASAN reports in a part of the kernel code 411 + for the current task, annotate this part of the code with a 412 + ``kasan_disable_current()``/``kasan_enable_current()`` section. This also 413 + disables the reports for indirect accesses that happen through function calls. 414 + 415 + For tag-based KASAN modes (include the hardware one), to disable access 416 + checking, use ``kasan_reset_tag()`` or ``page_kasan_tag_reset()``. Note that 417 + temporarily disabling access checking via ``page_kasan_tag_reset()`` requires 418 + saving and restoring the per-page KASAN tag via 419 + ``page_kasan_tag``/``page_kasan_tag_set``. 420 + 421 + Tests 422 + ~~~~~ 423 + 424 + There are KASAN tests that allow verifying that KASAN works and can detect 425 + certain types of memory corruptions. The tests consist of two parts: 377 426 378 427 1. Tests that are integrated with the KUnit Test Framework. Enabled with 379 428 ``CONFIG_KASAN_KUNIT_TEST``. These tests can be run and partially verified 380 - automatically in a few different ways, see the instructions below. 429 + automatically in a few different ways; see the instructions below. 381 430 382 431 2. Tests that are currently incompatible with KUnit. Enabled with 383 432 ``CONFIG_KASAN_MODULE_TEST`` and can only be run as a module. These tests can 384 - only be verified manually, by loading the kernel module and inspecting the 433 + only be verified manually by loading the kernel module and inspecting the 385 434 kernel log for KASAN reports. 386 435 387 - Each KUnit-compatible KASAN test prints a KASAN report if an error is detected. 388 - Then the test prints its number and status. 436 + Each KUnit-compatible KASAN test prints one of multiple KASAN reports if an 437 + error is detected. Then the test prints its number and status. 389 438 390 439 When a test passes:: 391 440 ··· 461 414 462 415 not ok 1 - kasan 463 416 464 - 465 417 There are a few ways to run KUnit-compatible KASAN tests. 466 418 467 419 1. Loadable module 468 - ~~~~~~~~~~~~~~~~~~ 469 420 470 - With ``CONFIG_KUNIT`` enabled, ``CONFIG_KASAN_KUNIT_TEST`` can be built as 471 - a loadable module and run on any architecture that supports KASAN by loading 472 - the module with insmod or modprobe. The module is called ``test_kasan``. 421 + With ``CONFIG_KUNIT`` enabled, KASAN-KUnit tests can be built as a loadable 422 + module and run by loading ``test_kasan.ko`` with ``insmod`` or ``modprobe``. 473 423 474 424 2. Built-In 475 - ~~~~~~~~~~~ 476 425 477 - With ``CONFIG_KUNIT`` built-in, ``CONFIG_KASAN_KUNIT_TEST`` can be built-in 478 - on any architecure that supports KASAN. These and any other KUnit tests enabled 479 - will run and print the results at boot as a late-init call. 426 + With ``CONFIG_KUNIT`` built-in, KASAN-KUnit tests can be built-in as well. 427 + In this case, the tests will run at boot as a late-init call. 480 428 481 429 3. Using kunit_tool 482 - ~~~~~~~~~~~~~~~~~~~ 483 430 484 - With ``CONFIG_KUNIT`` and ``CONFIG_KASAN_KUNIT_TEST`` built-in, it's also 485 - possible use ``kunit_tool`` to see the results of these and other KUnit tests 486 - in a more readable way. This will not print the KASAN reports of the tests that 487 - passed. Use `KUnit documentation <https://www.kernel.org/doc/html/latest/dev-tools/kunit/index.html>`_ 488 - for more up-to-date information on ``kunit_tool``. 431 + With ``CONFIG_KUNIT`` and ``CONFIG_KASAN_KUNIT_TEST`` built-in, it is also 432 + possible to use ``kunit_tool`` to see the results of KUnit tests in a more 433 + readable way. This will not print the KASAN reports of the tests that passed. 434 + See `KUnit documentation <https://www.kernel.org/doc/html/latest/dev-tools/kunit/index.html>`_ 435 + for more up-to-date information on ``kunit_tool``. 489 436 490 437 .. _KUnit: https://www.kernel.org/doc/html/latest/dev-tools/kunit/index.html

+1 -1

Documentation/vm/page_owner.rst

··· 47 47 48 48 text data bss dec hex filename 49 49 48800 2445 644 51889 cab1 mm/page_alloc.o 50 - 6574 108 29 6711 1a37 mm/page_owner.o 50 + 6662 108 29 6799 1a8f mm/page_owner.o 51 51 1025 8 8 1041 411 mm/page_ext.o 52 52 53 53 Although, roughly, 8 KB code is added in total, page_alloc.o increase by

-5

Documentation/vm/transhuge.rst

··· 53 53 of handling GUP on hugetlbfs will also work fine on transparent 54 54 hugepage backed mappings. 55 55 56 - In case you can't handle compound pages if they're returned by 57 - follow_page, the FOLL_SPLIT bit can be specified as a parameter to 58 - follow_page, so that it will split the hugepages before returning 59 - them. 60 - 61 56 Graceful fallback 62 57 ================= 63 58

+1

MAINTAINERS

··· 11770 11770 F: include/linux/memory_hotplug.h 11771 11771 F: include/linux/mm.h 11772 11772 F: include/linux/mmzone.h 11773 + F: include/linux/pagewalk.h 11773 11774 F: include/linux/vmalloc.h 11774 11775 F: mm/ 11775 11776

+11

arch/Kconfig

··· 829 829 config HAVE_ARCH_HUGE_VMAP 830 830 bool 831 831 832 + # 833 + # Archs that select this would be capable of PMD-sized vmaps (i.e., 834 + # arch_vmap_pmd_supported() returns true), and they must make no assumptions 835 + # that vmalloc memory is mapped with PAGE_SIZE ptes. The VM_NO_HUGE_VMAP flag 836 + # can be used to prohibit arch-specific allocations from using hugepages to 837 + # help with this (e.g., modules may require it). 838 + # 839 + config HAVE_ARCH_HUGE_VMALLOC 840 + depends on HAVE_ARCH_HUGE_VMAP 841 + bool 842 + 832 843 config ARCH_WANT_HUGE_PMD_SHARE 833 844 bool 834 845

-1

arch/alpha/mm/init.c

··· 282 282 set_max_mapnr(max_low_pfn); 283 283 high_memory = (void *) __va(max_low_pfn * PAGE_SIZE); 284 284 memblock_free_all(); 285 - mem_init_print_info(NULL); 286 285 }

-1

arch/arc/mm/init.c

··· 194 194 { 195 195 memblock_free_all(); 196 196 highmem_init(); 197 - mem_init_print_info(NULL); 198 197 } 199 198 200 199 #ifdef CONFIG_HIGHMEM

+1

arch/arm/Kconfig

··· 33 33 select ARCH_SUPPORTS_ATOMIC_RMW 34 34 select ARCH_USE_BUILTIN_BSWAP 35 35 select ARCH_USE_CMPXCHG_LOCKREF 36 + select ARCH_USE_MEMTEST 36 37 select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU 37 38 select ARCH_WANT_IPC_PARSE_VERSION 38 39 select ARCH_WANT_LD_ORPHAN_WARN

-2

arch/arm/include/asm/pgtable-3level.h

··· 186 186 187 187 #define pmd_write(pmd) (pmd_isclear((pmd), L_PMD_SECT_RDONLY)) 188 188 #define pmd_dirty(pmd) (pmd_isset((pmd), L_PMD_SECT_DIRTY)) 189 - #define pud_page(pud) pmd_page(__pmd(pud_val(pud))) 190 - #define pud_write(pud) pmd_write(__pmd(pud_val(pud))) 191 189 192 190 #define pmd_hugewillfault(pmd) (!pmd_young(pmd) || !pmd_write(pmd)) 193 191 #define pmd_thp_or_huge(pmd) (pmd_huge(pmd) || pmd_trans_huge(pmd))

+3

arch/arm/include/asm/pgtable.h

··· 166 166 167 167 extern pgd_t swapper_pg_dir[PTRS_PER_PGD]; 168 168 169 + #define pud_page(pud) pmd_page(__pmd(pud_val(pud))) 170 + #define pud_write(pud) pmd_write(__pmd(pud_val(pud))) 171 + 169 172 #define pmd_none(pmd) (!pmd_val(pmd)) 170 173 171 174 static inline pte_t *pmd_page_vaddr(pmd_t pmd)

+1

arch/arm/mm/copypage-v4mc.c

··· 13 13 #include <linux/init.h> 14 14 #include <linux/mm.h> 15 15 #include <linux/highmem.h> 16 + #include <linux/pagemap.h> 16 17 17 18 #include <asm/tlbflush.h> 18 19 #include <asm/cacheflush.h>

+1

arch/arm/mm/copypage-v6.c

··· 8 8 #include <linux/spinlock.h> 9 9 #include <linux/mm.h> 10 10 #include <linux/highmem.h> 11 + #include <linux/pagemap.h> 11 12 12 13 #include <asm/shmparam.h> 13 14 #include <asm/tlbflush.h>

+1

arch/arm/mm/copypage-xscale.c

··· 13 13 #include <linux/init.h> 14 14 #include <linux/mm.h> 15 15 #include <linux/highmem.h> 16 + #include <linux/pagemap.h> 16 17 17 18 #include <asm/tlbflush.h> 18 19 #include <asm/cacheflush.h>

-2

arch/arm/mm/init.c

··· 316 316 317 317 free_highpages(); 318 318 319 - mem_init_print_info(NULL); 320 - 321 319 /* 322 320 * Check boundaries twice: Some fundamental inconsistencies can 323 321 * be detected at build time already.

+1

arch/arm64/Kconfig

··· 67 67 select ARCH_KEEP_MEMBLOCK 68 68 select ARCH_USE_CMPXCHG_LOCKREF 69 69 select ARCH_USE_GNU_PROPERTY 70 + select ARCH_USE_MEMTEST 70 71 select ARCH_USE_QUEUED_RWLOCKS 71 72 select ARCH_USE_QUEUED_SPINLOCKS 72 73 select ARCH_USE_SYM_ANNOTATIONS

+2 -2

arch/arm64/include/asm/memory.h

··· 250 250 #define arch_init_tags(max_tag) mte_init_tags(max_tag) 251 251 #define arch_get_random_tag() mte_get_random_tag() 252 252 #define arch_get_mem_tag(addr) mte_get_mem_tag(addr) 253 - #define arch_set_mem_tag_range(addr, size, tag) \ 254 - mte_set_mem_tag_range((addr), (size), (tag)) 253 + #define arch_set_mem_tag_range(addr, size, tag, init) \ 254 + mte_set_mem_tag_range((addr), (size), (tag), (init)) 255 255 #endif /* CONFIG_KASAN_HW_TAGS */ 256 256 257 257 /*

+25 -14

arch/arm64/include/asm/mte-kasan.h

··· 53 53 * Note: The address must be non-NULL and MTE_GRANULE_SIZE aligned and 54 54 * size must be non-zero and MTE_GRANULE_SIZE aligned. 55 55 */ 56 - static inline void mte_set_mem_tag_range(void *addr, size_t size, u8 tag) 56 + static inline void mte_set_mem_tag_range(void *addr, size_t size, 57 + u8 tag, bool init) 57 58 { 58 59 u64 curr, end; 59 60 ··· 64 63 curr = (u64)__tag_set(addr, tag); 65 64 end = curr + size; 66 65 67 - do { 68 - /* 69 - * 'asm volatile' is required to prevent the compiler to move 70 - * the statement outside of the loop. 71 - */ 72 - asm volatile(__MTE_PREAMBLE "stg %0, [%0]" 73 - : 74 - : "r" (curr) 75 - : "memory"); 76 - 77 - curr += MTE_GRANULE_SIZE; 78 - } while (curr != end); 66 + /* 67 + * 'asm volatile' is required to prevent the compiler to move 68 + * the statement outside of the loop. 69 + */ 70 + if (init) { 71 + do { 72 + asm volatile(__MTE_PREAMBLE "stzg %0, [%0]" 73 + : 74 + : "r" (curr) 75 + : "memory"); 76 + curr += MTE_GRANULE_SIZE; 77 + } while (curr != end); 78 + } else { 79 + do { 80 + asm volatile(__MTE_PREAMBLE "stg %0, [%0]" 81 + : 82 + : "r" (curr) 83 + : "memory"); 84 + curr += MTE_GRANULE_SIZE; 85 + } while (curr != end); 86 + } 79 87 } 80 88 81 89 void mte_enable_kernel_sync(void); ··· 111 101 return 0xFF; 112 102 } 113 103 114 - static inline void mte_set_mem_tag_range(void *addr, size_t size, u8 tag) 104 + static inline void mte_set_mem_tag_range(void *addr, size_t size, 105 + u8 tag, bool init) 115 106 { 116 107 } 117 108

+24

arch/arm64/include/asm/vmalloc.h

··· 1 1 #ifndef _ASM_ARM64_VMALLOC_H 2 2 #define _ASM_ARM64_VMALLOC_H 3 3 4 + #include <asm/page.h> 5 + 6 + #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP 7 + 8 + #define arch_vmap_pud_supported arch_vmap_pud_supported 9 + static inline bool arch_vmap_pud_supported(pgprot_t prot) 10 + { 11 + /* 12 + * Only 4k granule supports level 1 block mappings. 13 + * SW table walks can't handle removal of intermediate entries. 14 + */ 15 + return IS_ENABLED(CONFIG_ARM64_4K_PAGES) && 16 + !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS); 17 + } 18 + 19 + #define arch_vmap_pmd_supported arch_vmap_pmd_supported 20 + static inline bool arch_vmap_pmd_supported(pgprot_t prot) 21 + { 22 + /* See arch_vmap_pud_supported() */ 23 + return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS); 24 + } 25 + 26 + #endif 27 + 4 28 #endif /* _ASM_ARM64_VMALLOC_H */

+1 -3

arch/arm64/mm/init.c

··· 491 491 /* this will put all unused low memory onto the freelists */ 492 492 memblock_free_all(); 493 493 494 - mem_init_print_info(NULL); 495 - 496 494 /* 497 495 * Check boundaries twice: Some fundamental inconsistencies can be 498 496 * detected at build time already. ··· 519 521 * prevents the region from being reused for kernel modules, which 520 522 * is not supported by kallsyms. 521 523 */ 522 - unmap_kernel_range((u64)__init_begin, (u64)(__init_end - __init_begin)); 524 + vunmap_range((u64)__init_begin, (u64)__init_end); 523 525 } 524 526 525 527 void dump_mem_limit(void)

-26

arch/arm64/mm/mmu.c

··· 1339 1339 return dt_virt; 1340 1340 } 1341 1341 1342 - int __init arch_ioremap_p4d_supported(void) 1343 - { 1344 - return 0; 1345 - } 1346 - 1347 - int __init arch_ioremap_pud_supported(void) 1348 - { 1349 - /* 1350 - * Only 4k granule supports level 1 block mappings. 1351 - * SW table walks can't handle removal of intermediate entries. 1352 - */ 1353 - return IS_ENABLED(CONFIG_ARM64_4K_PAGES) && 1354 - !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS); 1355 - } 1356 - 1357 - int __init arch_ioremap_pmd_supported(void) 1358 - { 1359 - /* See arch_ioremap_pud_supported() */ 1360 - return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS); 1361 - } 1362 - 1363 1342 int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot) 1364 1343 { 1365 1344 pud_t new_pud = pfn_pud(__phys_to_pfn(phys), mk_pud_sect_prot(prot)); ··· 1428 1449 __flush_tlb_kernel_pgtable(addr); 1429 1450 pmd_free(NULL, table); 1430 1451 return 1; 1431 - } 1432 - 1433 - int p4d_free_pud_page(p4d_t *p4d, unsigned long addr) 1434 - { 1435 - return 0; /* Don't attempt a block mapping */ 1436 1452 } 1437 1453 1438 1454 #ifdef CONFIG_MEMORY_HOTPLUG

+1

arch/csky/abiv1/cacheflush.c

··· 4 4 #include <linux/kernel.h> 5 5 #include <linux/mm.h> 6 6 #include <linux/fs.h> 7 + #include <linux/pagemap.h> 7 8 #include <linux/syscalls.h> 8 9 #include <linux/spinlock.h> 9 10 #include <asm/page.h>

-1

arch/csky/mm/init.c

··· 107 107 free_highmem_page(page); 108 108 } 109 109 #endif 110 - mem_init_print_info(NULL); 111 110 } 112 111 113 112 void free_initmem(void)

-2

arch/h8300/mm/init.c

··· 98 98 99 99 /* this will put all low memory onto the freelists */ 100 100 memblock_free_all(); 101 - 102 - mem_init_print_info(NULL); 103 101 }

-1

arch/hexagon/mm/init.c

··· 55 55 { 56 56 /* No idea where this is actually declared. Seems to evade LXR. */ 57 57 memblock_free_all(); 58 - mem_init_print_info(NULL); 59 58 60 59 /* 61 60 * To-Do: someone somewhere should wipe out the bootmem map

-23

arch/ia64/Kconfig

··· 286 286 config ARCH_SELECT_MEMORY_MODEL 287 287 def_bool y 288 288 289 - config ARCH_DISCONTIGMEM_ENABLE 290 - def_bool y 291 - depends on BROKEN 292 - help 293 - Say Y to support efficient handling of discontiguous physical memory, 294 - for architectures which are either NUMA (Non-Uniform Memory Access) 295 - or have huge holes in the physical address space for other reasons. 296 - See <file:Documentation/vm/numa.rst> for more. 297 - 298 289 config ARCH_FLATMEM_ENABLE 299 290 def_bool y 300 291 ··· 316 325 MAX_NUMNODES will be 2^(This value). 317 326 If in doubt, use the default. 318 327 319 - # VIRTUAL_MEM_MAP and FLAT_NODE_MEM_MAP are functionally equivalent. 320 - # VIRTUAL_MEM_MAP has been retained for historical reasons. 321 - config VIRTUAL_MEM_MAP 322 - bool "Virtual mem map" 323 - depends on !SPARSEMEM && !FLATMEM 324 - default y 325 - help 326 - Say Y to compile the kernel with support for a virtual mem map. 327 - This code also only takes effect if a memory hole of greater than 328 - 1 Gb is found during boot. You must turn this option on if you 329 - require the DISCONTIGMEM option for your machine. If you are 330 - unsure, say Y. 331 - 332 328 config HOLES_IN_ZONE 333 329 bool 334 - default y if VIRTUAL_MEM_MAP 335 330 336 331 config HAVE_ARCH_NODEDATA_EXTENSION 337 332 def_bool y

-1

arch/ia64/configs/bigsur_defconfig

··· 9 9 CONFIG_SMP=y 10 10 CONFIG_NR_CPUS=2 11 11 CONFIG_PREEMPT=y 12 - # CONFIG_VIRTUAL_MEM_MAP is not set 13 12 CONFIG_IA64_PALINFO=y 14 13 CONFIG_EFI_VARS=y 15 14 CONFIG_BINFMT_MISC=m

-11

arch/ia64/include/asm/meminit.h

··· 58 58 59 59 extern int register_active_ranges(u64 start, u64 len, int nid); 60 60 61 - #ifdef CONFIG_VIRTUAL_MEM_MAP 62 - extern unsigned long VMALLOC_END; 63 - extern struct page *vmem_map; 64 - extern int create_mem_map_page_table(u64 start, u64 end, void *arg); 65 - extern int vmemmap_find_next_valid_pfn(int, int); 66 - #else 67 - static inline int vmemmap_find_next_valid_pfn(int node, int i) 68 - { 69 - return i + 1; 70 - } 71 - #endif 72 61 #endif /* meminit_h */

+5 -1

arch/ia64/include/asm/module.h

··· 14 14 struct elf64_shdr; /* forward declration */ 15 15 16 16 struct mod_arch_specific { 17 + /* Used only at module load time. */ 17 18 struct elf64_shdr *core_plt; /* core PLT section */ 18 19 struct elf64_shdr *init_plt; /* init PLT section */ 19 20 struct elf64_shdr *got; /* global offset table */ 20 21 struct elf64_shdr *opd; /* official procedure descriptors */ 21 22 struct elf64_shdr *unwind; /* unwind-table section */ 22 23 unsigned long gp; /* global-pointer for module */ 24 + unsigned int next_got_entry; /* index of next available got entry */ 23 25 26 + /* Used at module run and cleanup time. */ 24 27 void *core_unw_table; /* core unwind-table cookie returned by unwinder */ 25 28 void *init_unw_table; /* init unwind-table cookie returned by unwinder */ 26 - unsigned int next_got_entry; /* index of next available got entry */ 29 + void *opd_addr; /* symbolize uses .opd to get to actual function */ 30 + unsigned long opd_size; 27 31 }; 28 32 29 33 #define ARCH_SHF_SMALL SHF_IA_64_SHORT

+2 -23

arch/ia64/include/asm/page.h

··· 95 95 96 96 #define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT) 97 97 98 - #ifdef CONFIG_VIRTUAL_MEM_MAP 99 - extern int ia64_pfn_valid (unsigned long pfn); 100 - #else 101 - # define ia64_pfn_valid(pfn) 1 102 - #endif 103 - 104 - #ifdef CONFIG_VIRTUAL_MEM_MAP 105 - extern struct page *vmem_map; 106 - #ifdef CONFIG_DISCONTIGMEM 107 - # define page_to_pfn(page) ((unsigned long) (page - vmem_map)) 108 - # define pfn_to_page(pfn) (vmem_map + (pfn)) 109 - # define __pfn_to_phys(pfn) PFN_PHYS(pfn) 110 - #else 111 - # include <asm-generic/memory_model.h> 112 - #endif 113 - #else 114 - # include <asm-generic/memory_model.h> 115 - #endif 98 + #include <asm-generic/memory_model.h> 116 99 117 100 #ifdef CONFIG_FLATMEM 118 - # define pfn_valid(pfn) (((pfn) < max_mapnr) && ia64_pfn_valid(pfn)) 119 - #elif defined(CONFIG_DISCONTIGMEM) 120 - extern unsigned long min_low_pfn; 121 - extern unsigned long max_low_pfn; 122 - # define pfn_valid(pfn) (((pfn) >= min_low_pfn) && ((pfn) < max_low_pfn) && ia64_pfn_valid(pfn)) 101 + # define pfn_valid(pfn) ((pfn) < max_mapnr) 123 102 #endif 124 103 125 104 #define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT)

+1 -6

arch/ia64/include/asm/pgtable.h

··· 223 223 224 224 225 225 #define VMALLOC_START (RGN_BASE(RGN_GATE) + 0x200000000UL) 226 - #ifdef CONFIG_VIRTUAL_MEM_MAP 227 - # define VMALLOC_END_INIT (RGN_BASE(RGN_GATE) + (1UL << (4*PAGE_SHIFT - 9))) 228 - extern unsigned long VMALLOC_END; 229 - #else 230 226 #if defined(CONFIG_SPARSEMEM) && defined(CONFIG_SPARSEMEM_VMEMMAP) 231 227 /* SPARSEMEM_VMEMMAP uses half of vmalloc... */ 232 228 # define VMALLOC_END (RGN_BASE(RGN_GATE) + (1UL << (4*PAGE_SHIFT - 10))) 233 229 # define vmemmap ((struct page *)VMALLOC_END) 234 230 #else 235 231 # define VMALLOC_END (RGN_BASE(RGN_GATE) + (1UL << (4*PAGE_SHIFT - 9))) 236 - #endif 237 232 #endif 238 233 239 234 /* fs/proc/kcore.c */ ··· 323 328 static inline void set_pte(pte_t *ptep, pte_t pteval) 324 329 { 325 330 /* page is present && page is user && page is executable 326 - * && (page swapin or new page or page migraton 331 + * && (page swapin or new page or page migration 327 332 * || copy_on_write with page copying.) 328 333 */ 329 334 if (pte_present_exec_user(pteval) &&

+1 -1

arch/ia64/kernel/Makefile

··· 9 9 10 10 extra-y := head.o vmlinux.lds 11 11 12 - obj-y := entry.o efi.o efi_stub.o gate-data.o fsys.o ia64_ksyms.o irq.o irq_ia64.o \ 12 + obj-y := entry.o efi.o efi_stub.o gate-data.o fsys.o irq.o irq_ia64.o \ 13 13 irq_lsapic.o ivt.o pal.o patch.o process.o ptrace.o sal.o \ 14 14 salinfo.o setup.o signal.o sys_ia64.o time.o traps.o unaligned.o \ 15 15 unwind.o mca.o mca_asm.o topology.o dma-mapping.o iosapic.o acpi.o \

+5 -2

arch/ia64/kernel/acpi.c

··· 446 446 if (srat_num_cpus == 0) { 447 447 node_set_online(0); 448 448 node_cpuid[0].phys_id = hard_smp_processor_id(); 449 - return; 449 + slit_distance(0, 0) = LOCAL_DISTANCE; 450 + goto out; 450 451 } 451 452 452 453 /* ··· 490 489 for (j = 0; j < MAX_NUMNODES; j++) 491 490 slit_distance(i, j) = i == j ? 492 491 LOCAL_DISTANCE : REMOTE_DISTANCE; 493 - return; 492 + goto out; 494 493 } 495 494 496 495 memset(numa_slit, -1, sizeof(numa_slit)); ··· 515 514 printk("\n"); 516 515 } 517 516 #endif 517 + out: 518 + node_possible_map = node_online_map; 518 519 } 519 520 #endif /* CONFIG_ACPI_NUMA */ 520 521

+6 -5

arch/ia64/kernel/efi.c

··· 415 415 mask = ~((1 << IA64_GRANULE_SHIFT) - 1); 416 416 417 417 printk(KERN_INFO "CPU %d: mapping PAL code " 418 - "[0x%lx-0x%lx) into [0x%lx-0x%lx)\n", 419 - smp_processor_id(), md->phys_addr, 420 - md->phys_addr + efi_md_size(md), 421 - vaddr & mask, (vaddr & mask) + IA64_GRANULE_SIZE); 418 + "[0x%llx-0x%llx) into [0x%llx-0x%llx)\n", 419 + smp_processor_id(), md->phys_addr, 420 + md->phys_addr + efi_md_size(md), 421 + vaddr & mask, (vaddr & mask) + IA64_GRANULE_SIZE); 422 422 #endif 423 423 return __va(md->phys_addr); 424 424 } ··· 560 560 { 561 561 efi_memory_desc_t *md; 562 562 void *p; 563 + unsigned int i; 563 564 564 565 for (i = 0, p = efi_map_start; p < efi_map_end; 565 566 ++i, p += efi_desc_size) ··· 587 586 } 588 587 589 588 printk("mem%02d: %s " 590 - "range=[0x%016lx-0x%016lx) (%4lu%s)\n", 589 + "range=[0x%016llx-0x%016llx) (%4lu%s)\n", 591 590 i, efi_md_typeattr_format(buf, sizeof(buf), md), 592 591 md->phys_addr, 593 592 md->phys_addr + efi_md_size(md), size, unit);

+2 -2

arch/ia64/kernel/fsys.S

··· 172 172 // r25 = itc_lastcycle value 173 173 // r26 = address clocksource cycle_last 174 174 // r27 = (not used) 175 - // r28 = sequence number at the beginning of critcal section 175 + // r28 = sequence number at the beginning of critical section 176 176 // r29 = address of itc_jitter 177 177 // r30 = time processing flags / memory address 178 178 // r31 = pointer to result ··· 432 432 * - r29: psr 433 433 * 434 434 * We used to clear some PSR bits here but that requires slow 435 - * serialization. Fortuntely, that isn't really necessary. 435 + * serialization. Fortunately, that isn't really necessary. 436 436 * The rationale is as follows: we used to clear bits 437 437 * ~PSR_PRESERVED_BITS in PSR.L. Since 438 438 * PSR_PRESERVED_BITS==PSR.{UP,MFL,MFH,PK,DT,PP,SP,RT,IC}, we

-6

arch/ia64/kernel/head.S

··· 33 33 #include <asm/mca_asm.h> 34 34 #include <linux/init.h> 35 35 #include <linux/linkage.h> 36 - #include <linux/pgtable.h> 37 36 #include <asm/export.h> 38 37 39 38 #ifdef CONFIG_HOTPLUG_CPU ··· 404 405 405 406 // This is executed by the bootstrap processor (bsp) only: 406 407 407 - #ifdef CONFIG_IA64_FW_EMU 408 - // initialize PAL & SAL emulator: 409 - br.call.sptk.many rp=sys_fw_init 410 - .ret1: 411 - #endif 412 408 br.call.sptk.many rp=start_kernel 413 409 .ret2: addl r3=@ltoff(halt_msg),gp 414 410 ;;

-12

arch/ia64/kernel/ia64_ksyms.c

··· 1 - // SPDX-License-Identifier: GPL-2.0 2 - /* 3 - * Architecture-specific kernel symbols 4 - */ 5 - 6 - #if defined(CONFIG_VIRTUAL_MEM_MAP) || defined(CONFIG_DISCONTIGMEM) 7 - #include <linux/compiler.h> 8 - #include <linux/export.h> 9 - #include <linux/memblock.h> 10 - EXPORT_SYMBOL(min_low_pfn); /* defined by bootmem.c, but not exported by generic code */ 11 - EXPORT_SYMBOL(max_low_pfn); /* defined by bootmem.c, but not exported by generic code */ 12 - #endif

+1 -1

arch/ia64/kernel/machine_kexec.c

··· 143 143 144 144 void arch_crash_save_vmcoreinfo(void) 145 145 { 146 - #if defined(CONFIG_DISCONTIGMEM) || defined(CONFIG_SPARSEMEM) 146 + #if defined(CONFIG_SPARSEMEM) 147 147 VMCOREINFO_SYMBOL(pgdat_list); 148 148 VMCOREINFO_LENGTH(pgdat_list, MAX_NUMNODES); 149 149 #endif

+2 -2

arch/ia64/kernel/mca.c

··· 109 109 #include "irq.h" 110 110 111 111 #if defined(IA64_MCA_DEBUG_INFO) 112 - # define IA64_MCA_DEBUG(fmt...) printk(fmt) 112 + # define IA64_MCA_DEBUG(fmt...) printk(fmt) 113 113 #else 114 - # define IA64_MCA_DEBUG(fmt...) 114 + # define IA64_MCA_DEBUG(fmt...) do {} while (0) 115 115 #endif 116 116 117 117 #define NOTIFY_INIT(event, regs, arg, spin) \

+25 -4

arch/ia64/kernel/module.c

··· 905 905 int 906 906 module_finalize (const Elf_Ehdr *hdr, const Elf_Shdr *sechdrs, struct module *mod) 907 907 { 908 + struct mod_arch_specific *mas = &mod->arch; 909 + 908 910 DEBUGP("%s: init: entry=%p\n", __func__, mod->init); 909 - if (mod->arch.unwind) 911 + if (mas->unwind) 910 912 register_unwind_table(mod); 913 + 914 + /* 915 + * ".opd" was already relocated to the final destination. Store 916 + * it's address for use in symbolizer. 917 + */ 918 + mas->opd_addr = (void *)mas->opd->sh_addr; 919 + mas->opd_size = mas->opd->sh_size; 920 + 921 + /* 922 + * Module relocation was already done at this point. Section 923 + * headers are about to be deleted. Wipe out load-time context. 924 + */ 925 + mas->core_plt = NULL; 926 + mas->init_plt = NULL; 927 + mas->got = NULL; 928 + mas->opd = NULL; 929 + mas->unwind = NULL; 930 + mas->gp = 0; 931 + mas->next_got_entry = 0; 932 + 911 933 return 0; 912 934 } 913 935 ··· 948 926 949 927 void *dereference_module_function_descriptor(struct module *mod, void *ptr) 950 928 { 951 - Elf64_Shdr *opd = mod->arch.opd; 929 + struct mod_arch_specific *mas = &mod->arch; 952 930 953 - if (ptr < (void *)opd->sh_addr || 954 - ptr >= (void *)(opd->sh_addr + opd->sh_size)) 931 + if (ptr < mas->opd_addr || ptr >= mas->opd_addr + mas->opd_size) 955 932 return ptr; 956 933 957 934 return dereference_function_descriptor(ptr);

+3 -3

arch/ia64/kernel/pal.S

··· 86 86 mov ar.pfs = loc1 87 87 mov rp = loc0 88 88 ;; 89 - srlz.d // seralize restoration of psr.l 89 + srlz.d // serialize restoration of psr.l 90 90 br.ret.sptk.many b0 91 91 END(ia64_pal_call_static) 92 92 EXPORT_SYMBOL(ia64_pal_call_static) ··· 194 194 mov rp = loc0 195 195 ;; 196 196 mov ar.rsc=loc4 // restore RSE configuration 197 - srlz.d // seralize restoration of psr.l 197 + srlz.d // serialize restoration of psr.l 198 198 br.ret.sptk.many b0 199 199 END(ia64_pal_call_phys_static) 200 200 EXPORT_SYMBOL(ia64_pal_call_phys_static) ··· 252 252 mov rp = loc0 253 253 ;; 254 254 mov ar.rsc=loc4 // restore RSE configuration 255 - srlz.d // seralize restoration of psr.l 255 + srlz.d // serialize restoration of psr.l 256 256 br.ret.sptk.many b0 257 257 END(ia64_pal_call_phys_stacked) 258 258 EXPORT_SYMBOL(ia64_pal_call_phys_stacked)

-1

arch/ia64/mm/Makefile

··· 7 7 8 8 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o 9 9 obj-$(CONFIG_NUMA) += numa.o 10 - obj-$(CONFIG_DISCONTIGMEM) += discontig.o 11 10 obj-$(CONFIG_SPARSEMEM) += discontig.o 12 11 obj-$(CONFIG_FLATMEM) += contig.o

-4

arch/ia64/mm/contig.c

··· 153 153 efi_memmap_walk(find_max_min_low_pfn, NULL); 154 154 max_pfn = max_low_pfn; 155 155 156 - #ifdef CONFIG_VIRTUAL_MEM_MAP 157 - efi_memmap_walk(filter_memory, register_active_ranges); 158 - #else 159 156 memblock_add_node(0, PFN_PHYS(max_low_pfn), 0); 160 - #endif 161 157 162 158 find_initrd(); 163 159

-21

arch/ia64/mm/discontig.c

··· 585 585 } 586 586 } 587 587 588 - static void __init virtual_map_init(void) 589 - { 590 - #ifdef CONFIG_VIRTUAL_MEM_MAP 591 - int node; 592 - 593 - VMALLOC_END -= PAGE_ALIGN(ALIGN(max_low_pfn, MAX_ORDER_NR_PAGES) * 594 - sizeof(struct page)); 595 - vmem_map = (struct page *) VMALLOC_END; 596 - efi_memmap_walk(create_mem_map_page_table, NULL); 597 - printk("Virtual mem_map starts at 0x%p\n", vmem_map); 598 - 599 - for_each_online_node(node) { 600 - unsigned long pfn_offset = mem_data[node].min_pfn; 601 - 602 - NODE_DATA(node)->node_mem_map = vmem_map + pfn_offset; 603 - } 604 - #endif 605 - } 606 - 607 588 /** 608 589 * paging_init - setup page tables 609 590 * ··· 599 618 max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT; 600 619 601 620 sparse_init(); 602 - 603 - virtual_map_init(); 604 621 605 622 memset(max_zone_pfns, 0, sizeof(max_zone_pfns)); 606 623 max_zone_pfns[ZONE_DMA32] = max_dma;

-15

arch/ia64/mm/fault.c

··· 84 84 if (faulthandler_disabled() || !mm) 85 85 goto no_context; 86 86 87 - #ifdef CONFIG_VIRTUAL_MEM_MAP 88 - /* 89 - * If fault is in region 5 and we are in the kernel, we may already 90 - * have the mmap_lock (pfn_valid macro is called during mmap). There 91 - * is no vma for region 5 addr's anyway, so skip getting the semaphore 92 - * and go directly to the exception handling code. 93 - */ 94 - 95 - if ((REGION_NUMBER(address) == 5) && !user_mode(regs)) 96 - goto bad_area_no_up; 97 - #endif 98 - 99 87 /* 100 88 * This is to handle the kprobes on user space access instructions 101 89 */ ··· 201 213 202 214 bad_area: 203 215 mmap_read_unlock(mm); 204 - #ifdef CONFIG_VIRTUAL_MEM_MAP 205 - bad_area_no_up: 206 - #endif 207 216 if ((isr & IA64_ISR_SP) 208 217 || ((isr & IA64_ISR_NA) && (isr & IA64_ISR_CODE_MASK) == IA64_ISR_CODE_LFETCH)) 209 218 {

+5 -216

arch/ia64/mm/init.c

··· 43 43 44 44 unsigned long MAX_DMA_ADDRESS = PAGE_OFFSET + 0x100000000UL; 45 45 46 - #ifdef CONFIG_VIRTUAL_MEM_MAP 47 - unsigned long VMALLOC_END = VMALLOC_END_INIT; 48 - EXPORT_SYMBOL(VMALLOC_END); 49 - struct page *vmem_map; 50 - EXPORT_SYMBOL(vmem_map); 51 - #endif 52 - 53 46 struct page *zero_page_memmap_ptr; /* map entry for zero page */ 54 47 EXPORT_SYMBOL(zero_page_memmap_ptr); 55 48 ··· 366 373 #endif 367 374 } 368 375 369 - #ifdef CONFIG_VIRTUAL_MEM_MAP 370 - int vmemmap_find_next_valid_pfn(int node, int i) 371 - { 372 - unsigned long end_address, hole_next_pfn; 373 - unsigned long stop_address; 374 - pg_data_t *pgdat = NODE_DATA(node); 375 - 376 - end_address = (unsigned long) &vmem_map[pgdat->node_start_pfn + i]; 377 - end_address = PAGE_ALIGN(end_address); 378 - stop_address = (unsigned long) &vmem_map[pgdat_end_pfn(pgdat)]; 379 - 380 - do { 381 - pgd_t *pgd; 382 - p4d_t *p4d; 383 - pud_t *pud; 384 - pmd_t *pmd; 385 - pte_t *pte; 386 - 387 - pgd = pgd_offset_k(end_address); 388 - if (pgd_none(*pgd)) { 389 - end_address += PGDIR_SIZE; 390 - continue; 391 - } 392 - 393 - p4d = p4d_offset(pgd, end_address); 394 - if (p4d_none(*p4d)) { 395 - end_address += P4D_SIZE; 396 - continue; 397 - } 398 - 399 - pud = pud_offset(p4d, end_address); 400 - if (pud_none(*pud)) { 401 - end_address += PUD_SIZE; 402 - continue; 403 - } 404 - 405 - pmd = pmd_offset(pud, end_address); 406 - if (pmd_none(*pmd)) { 407 - end_address += PMD_SIZE; 408 - continue; 409 - } 410 - 411 - pte = pte_offset_kernel(pmd, end_address); 412 - retry_pte: 413 - if (pte_none(*pte)) { 414 - end_address += PAGE_SIZE; 415 - pte++; 416 - if ((end_address < stop_address) && 417 - (end_address != ALIGN(end_address, 1UL << PMD_SHIFT))) 418 - goto retry_pte; 419 - continue; 420 - } 421 - /* Found next valid vmem_map page */ 422 - break; 423 - } while (end_address < stop_address); 424 - 425 - end_address = min(end_address, stop_address); 426 - end_address = end_address - (unsigned long) vmem_map + sizeof(struct page) - 1; 427 - hole_next_pfn = end_address / sizeof(struct page); 428 - return hole_next_pfn - pgdat->node_start_pfn; 429 - } 430 - 431 - int __init create_mem_map_page_table(u64 start, u64 end, void *arg) 432 - { 433 - unsigned long address, start_page, end_page; 434 - struct page *map_start, *map_end; 435 - int node; 436 - pgd_t *pgd; 437 - p4d_t *p4d; 438 - pud_t *pud; 439 - pmd_t *pmd; 440 - pte_t *pte; 441 - 442 - map_start = vmem_map + (__pa(start) >> PAGE_SHIFT); 443 - map_end = vmem_map + (__pa(end) >> PAGE_SHIFT); 444 - 445 - start_page = (unsigned long) map_start & PAGE_MASK; 446 - end_page = PAGE_ALIGN((unsigned long) map_end); 447 - node = paddr_to_nid(__pa(start)); 448 - 449 - for (address = start_page; address < end_page; address += PAGE_SIZE) { 450 - pgd = pgd_offset_k(address); 451 - if (pgd_none(*pgd)) { 452 - p4d = memblock_alloc_node(PAGE_SIZE, PAGE_SIZE, node); 453 - if (!p4d) 454 - goto err_alloc; 455 - pgd_populate(&init_mm, pgd, p4d); 456 - } 457 - p4d = p4d_offset(pgd, address); 458 - 459 - if (p4d_none(*p4d)) { 460 - pud = memblock_alloc_node(PAGE_SIZE, PAGE_SIZE, node); 461 - if (!pud) 462 - goto err_alloc; 463 - p4d_populate(&init_mm, p4d, pud); 464 - } 465 - pud = pud_offset(p4d, address); 466 - 467 - if (pud_none(*pud)) { 468 - pmd = memblock_alloc_node(PAGE_SIZE, PAGE_SIZE, node); 469 - if (!pmd) 470 - goto err_alloc; 471 - pud_populate(&init_mm, pud, pmd); 472 - } 473 - pmd = pmd_offset(pud, address); 474 - 475 - if (pmd_none(*pmd)) { 476 - pte = memblock_alloc_node(PAGE_SIZE, PAGE_SIZE, node); 477 - if (!pte) 478 - goto err_alloc; 479 - pmd_populate_kernel(&init_mm, pmd, pte); 480 - } 481 - pte = pte_offset_kernel(pmd, address); 482 - 483 - if (pte_none(*pte)) { 484 - void *page = memblock_alloc_node(PAGE_SIZE, PAGE_SIZE, 485 - node); 486 - if (!page) 487 - goto err_alloc; 488 - set_pte(pte, pfn_pte(__pa(page) >> PAGE_SHIFT, 489 - PAGE_KERNEL)); 490 - } 491 - } 492 - return 0; 493 - 494 - err_alloc: 495 - panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d\n", 496 - __func__, PAGE_SIZE, PAGE_SIZE, node); 497 - return -ENOMEM; 498 - } 499 - 500 - struct memmap_init_callback_data { 501 - struct page *start; 502 - struct page *end; 503 - int nid; 504 - unsigned long zone; 505 - }; 506 - 507 - static int __meminit 508 - virtual_memmap_init(u64 start, u64 end, void *arg) 509 - { 510 - struct memmap_init_callback_data *args; 511 - struct page *map_start, *map_end; 512 - 513 - args = (struct memmap_init_callback_data *) arg; 514 - map_start = vmem_map + (__pa(start) >> PAGE_SHIFT); 515 - map_end = vmem_map + (__pa(end) >> PAGE_SHIFT); 516 - 517 - if (map_start < args->start) 518 - map_start = args->start; 519 - if (map_end > args->end) 520 - map_end = args->end; 521 - 522 - /* 523 - * We have to initialize "out of bounds" struct page elements that fit completely 524 - * on the same pages that were allocated for the "in bounds" elements because they 525 - * may be referenced later (and found to be "reserved"). 526 - */ 527 - map_start -= ((unsigned long) map_start & (PAGE_SIZE - 1)) / sizeof(struct page); 528 - map_end += ((PAGE_ALIGN((unsigned long) map_end) - (unsigned long) map_end) 529 - / sizeof(struct page)); 530 - 531 - if (map_start < map_end) 532 - memmap_init_range((unsigned long)(map_end - map_start), 533 - args->nid, args->zone, page_to_pfn(map_start), page_to_pfn(map_end), 534 - MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); 535 - return 0; 536 - } 537 - 538 - void __meminit memmap_init_zone(struct zone *zone) 539 - { 540 - int nid = zone_to_nid(zone), zone_id = zone_idx(zone); 541 - unsigned long start_pfn = zone->zone_start_pfn; 542 - unsigned long size = zone->spanned_pages; 543 - 544 - if (!vmem_map) { 545 - memmap_init_range(size, nid, zone_id, start_pfn, start_pfn + size, 546 - MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); 547 - } else { 548 - struct page *start; 549 - struct memmap_init_callback_data args; 550 - 551 - start = pfn_to_page(start_pfn); 552 - args.start = start; 553 - args.end = start + size; 554 - args.nid = nid; 555 - args.zone = zone_id; 556 - 557 - efi_memmap_walk(virtual_memmap_init, &args); 558 - } 559 - } 560 - 561 - int 562 - ia64_pfn_valid (unsigned long pfn) 563 - { 564 - char byte; 565 - struct page *pg = pfn_to_page(pfn); 566 - 567 - return (__get_user(byte, (char __user *) pg) == 0) 568 - && ((((u64)pg & PAGE_MASK) == (((u64)(pg + 1) - 1) & PAGE_MASK)) 569 - || (__get_user(byte, (char __user *) (pg + 1) - 1) == 0)); 570 - } 571 - EXPORT_SYMBOL(ia64_pfn_valid); 572 - 573 - #endif /* CONFIG_VIRTUAL_MEM_MAP */ 574 - 575 376 int __init register_active_ranges(u64 start, u64 len, int nid) 576 377 { 577 378 u64 end = start + len; ··· 431 644 * _before_ any drivers that may need the PCI DMA interface are 432 645 * initialized or bootmem has been freed. 433 646 */ 647 + do { 434 648 #ifdef CONFIG_INTEL_IOMMU 435 - detect_intel_iommu(); 436 - if (!iommu_detected) 649 + detect_intel_iommu(); 650 + if (iommu_detected) 651 + break; 437 652 #endif 438 653 #ifdef CONFIG_SWIOTLB 439 654 swiotlb_init(1); 440 655 #endif 656 + } while (0); 441 657 442 658 #ifdef CONFIG_FLATMEM 443 659 BUG_ON(!mem_map); ··· 449 659 set_max_mapnr(max_low_pfn); 450 660 high_memory = __va(max_low_pfn * PAGE_SIZE); 451 661 memblock_free_all(); 452 - mem_init_print_info(NULL); 453 662 454 663 /* 455 664 * For fsyscall entrpoints with no light-weight handler, use the ordinary

-1

arch/m68k/mm/init.c

··· 153 153 /* this will put all memory onto the freelists */ 154 154 memblock_free_all(); 155 155 init_pointer_tables(); 156 - mem_init_print_info(NULL); 157 156 }

-1

arch/microblaze/mm/init.c

··· 131 131 highmem_setup(); 132 132 #endif 133 133 134 - mem_init_print_info(NULL); 135 134 mem_init_done = 1; 136 135 } 137 136

+1

arch/mips/Kconfig

··· 16 16 select ARCH_SUPPORTS_UPROBES 17 17 select ARCH_USE_BUILTIN_BSWAP 18 18 select ARCH_USE_CMPXCHG_LOCKREF if 64BIT 19 + select ARCH_USE_MEMTEST 19 20 select ARCH_USE_QUEUED_RWLOCKS 20 21 select ARCH_USE_QUEUED_SPINLOCKS 21 22 select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU

-1

arch/mips/loongson64/numa.c

··· 178 178 high_memory = (void *) __va(get_num_physpages() << PAGE_SHIFT); 179 179 memblock_free_all(); 180 180 setup_zero_pages(); /* This comes from node 0 */ 181 - mem_init_print_info(NULL); 182 181 } 183 182 184 183 /* All PCI device belongs to logical Node-0 */

+1

arch/mips/mm/cache.c

··· 15 15 #include <linux/syscalls.h> 16 16 #include <linux/mm.h> 17 17 #include <linux/highmem.h> 18 + #include <linux/pagemap.h> 18 19 19 20 #include <asm/cacheflush.h> 20 21 #include <asm/processor.h>

-1

arch/mips/mm/init.c

··· 467 467 memblock_free_all(); 468 468 setup_zero_pages(); /* Setup zeroed pages. */ 469 469 mem_init_free_highmem(); 470 - mem_init_print_info(NULL); 471 470 472 471 #ifdef CONFIG_64BIT 473 472 if ((unsigned long) &_text > (unsigned long) CKSEG0)

-1

arch/mips/sgi-ip27/ip27-memory.c

··· 420 420 high_memory = (void *) __va(get_num_physpages() << PAGE_SHIFT); 421 421 memblock_free_all(); 422 422 setup_zero_pages(); /* This comes from node 0 */ 423 - mem_init_print_info(NULL); 424 423 }

-1

arch/nds32/mm/init.c

··· 191 191 192 192 /* this will put all low memory onto the freelists */ 193 193 memblock_free_all(); 194 - mem_init_print_info(NULL); 195 194 196 195 pr_info("virtual kernel memory layout:\n" 197 196 " fixmap : 0x%08lx - 0x%08lx (%4ld kB)\n"

+1

arch/nios2/mm/cacheflush.c

··· 11 11 #include <linux/sched.h> 12 12 #include <linux/mm.h> 13 13 #include <linux/fs.h> 14 + #include <linux/pagemap.h> 14 15 15 16 #include <asm/cacheflush.h> 16 17 #include <asm/cpuinfo.h>

-1

arch/nios2/mm/init.c

··· 71 71 72 72 /* this will put all memory onto the freelists */ 73 73 memblock_free_all(); 74 - mem_init_print_info(NULL); 75 74 } 76 75 77 76 void __init mmu_init(void)

-2

arch/openrisc/mm/init.c

··· 211 211 /* this will put all low memory onto the freelists */ 212 212 memblock_free_all(); 213 213 214 - mem_init_print_info(NULL); 215 - 216 214 printk("mem_init_done ...........................................\n"); 217 215 mem_init_done = 1; 218 216 return;

-2

arch/parisc/mm/init.c

··· 573 573 #endif 574 574 parisc_vmalloc_start = SET_MAP_OFFSET(MAP_START); 575 575 576 - mem_init_print_info(NULL); 577 - 578 576 #if 0 579 577 /* 580 578 * Do not expose the virtual kernel memory layout to userspace.

+1

arch/powerpc/Kconfig

··· 151 151 select ARCH_SUPPORTS_DEBUG_PAGEALLOC if PPC32 || PPC_BOOK3S_64 152 152 select ARCH_USE_BUILTIN_BSWAP 153 153 select ARCH_USE_CMPXCHG_LOCKREF if PPC64 154 + select ARCH_USE_MEMTEST 154 155 select ARCH_USE_QUEUED_RWLOCKS if PPC_QUEUED_SPINLOCKS 155 156 select ARCH_USE_QUEUED_SPINLOCKS if PPC_QUEUED_SPINLOCKS 156 157 select ARCH_WANT_IPC_PARSE_VERSION

+20

arch/powerpc/include/asm/vmalloc.h

··· 1 1 #ifndef _ASM_POWERPC_VMALLOC_H 2 2 #define _ASM_POWERPC_VMALLOC_H 3 3 4 + #include <asm/mmu.h> 5 + #include <asm/page.h> 6 + 7 + #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP 8 + 9 + #define arch_vmap_pud_supported arch_vmap_pud_supported 10 + static inline bool arch_vmap_pud_supported(pgprot_t prot) 11 + { 12 + /* HPT does not cope with large pages in the vmalloc area */ 13 + return radix_enabled(); 14 + } 15 + 16 + #define arch_vmap_pmd_supported arch_vmap_pmd_supported 17 + static inline bool arch_vmap_pmd_supported(pgprot_t prot) 18 + { 19 + return radix_enabled(); 20 + } 21 + 22 + #endif 23 + 4 24 #endif /* _ASM_POWERPC_VMALLOC_H */

+2 -2

arch/powerpc/kernel/isa-bridge.c

··· 48 48 if (slab_is_available()) { 49 49 if (ioremap_page_range(ISA_IO_BASE, ISA_IO_BASE + size, pa, 50 50 pgprot_noncached(PAGE_KERNEL))) 51 - unmap_kernel_range(ISA_IO_BASE, size); 51 + vunmap_range(ISA_IO_BASE, ISA_IO_BASE + size); 52 52 } else { 53 53 early_ioremap_range(ISA_IO_BASE, pa, size, 54 54 pgprot_noncached(PAGE_KERNEL)); ··· 311 311 isa_bridge_pcidev = NULL; 312 312 313 313 /* Unmap the ISA area */ 314 - unmap_kernel_range(ISA_IO_BASE, 0x10000); 314 + vunmap_range(ISA_IO_BASE, ISA_IO_BASE + 0x10000); 315 315 } 316 316 317 317 /**

+1 -1

arch/powerpc/kernel/pci_64.c

··· 140 140 addr = (unsigned long)area->addr; 141 141 if (ioremap_page_range(addr, addr + size, paddr, 142 142 pgprot_noncached(PAGE_KERNEL))) { 143 - unmap_kernel_range(addr, size); 143 + vunmap_range(addr, addr + size); 144 144 return NULL; 145 145 } 146 146

-21

arch/powerpc/mm/book3s64/radix_pgtable.c

··· 1082 1082 set_pte_at(mm, addr, ptep, pte); 1083 1083 } 1084 1084 1085 - int __init arch_ioremap_pud_supported(void) 1086 - { 1087 - /* HPT does not cope with large pages in the vmalloc area */ 1088 - return radix_enabled(); 1089 - } 1090 - 1091 - int __init arch_ioremap_pmd_supported(void) 1092 - { 1093 - return radix_enabled(); 1094 - } 1095 - 1096 - int p4d_free_pud_page(p4d_t *p4d, unsigned long addr) 1097 - { 1098 - return 0; 1099 - } 1100 - 1101 1085 int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot) 1102 1086 { 1103 1087 pte_t *ptep = (pte_t *)pud; ··· 1164 1180 pte_free_kernel(&init_mm, pte); 1165 1181 1166 1182 return 1; 1167 - } 1168 - 1169 - int __init arch_ioremap_p4d_supported(void) 1170 - { 1171 - return 0; 1172 1183 }

+1 -1

arch/powerpc/mm/ioremap.c

··· 93 93 if (!ret) 94 94 return (void __iomem *)area->addr + offset; 95 95 96 - unmap_kernel_range(va, size); 96 + vunmap_range(va, va + size); 97 97 free_vm_area(area); 98 98 99 99 return NULL;

-1

arch/powerpc/mm/mem.c

··· 282 282 (mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY) - 1; 283 283 #endif 284 284 285 - mem_init_print_info(NULL); 286 285 #ifdef CONFIG_PPC32 287 286 pr_info("Kernel virtual memory layout:\n"); 288 287 #ifdef CONFIG_KASAN

-4

arch/powerpc/sysdev/xive/common.c

··· 990 990 void xive_cleanup_irq_data(struct xive_irq_data *xd) 991 991 { 992 992 if (xd->eoi_mmio) { 993 - unmap_kernel_range((unsigned long)xd->eoi_mmio, 994 - 1u << xd->esb_shift); 995 993 iounmap(xd->eoi_mmio); 996 994 if (xd->eoi_mmio == xd->trig_mmio) 997 995 xd->trig_mmio = NULL; 998 996 xd->eoi_mmio = NULL; 999 997 } 1000 998 if (xd->trig_mmio) { 1001 - unmap_kernel_range((unsigned long)xd->trig_mmio, 1002 - 1u << xd->esb_shift); 1003 999 iounmap(xd->trig_mmio); 1004 1000 xd->trig_mmio = NULL; 1005 1001 }

-1

arch/riscv/mm/init.c

··· 102 102 high_memory = (void *)(__va(PFN_PHYS(max_low_pfn))); 103 103 memblock_free_all(); 104 104 105 - mem_init_print_info(NULL); 106 105 print_vm_layout(); 107 106 } 108 107

-2

arch/s390/mm/init.c

··· 209 209 setup_zero_pages(); /* Setup zeroed pages. */ 210 210 211 211 cmma_init_nodat(); 212 - 213 - mem_init_print_info(NULL); 214 212 } 215 213 216 214 void free_initmem(void)

+2 -8

arch/sh/include/asm/tlb.h

··· 4 4 5 5 #ifndef __ASSEMBLY__ 6 6 #include <linux/pagemap.h> 7 + #include <asm-generic/tlb.h> 7 8 8 9 #ifdef CONFIG_MMU 9 10 #include <linux/swap.h> 10 - 11 - #include <asm-generic/tlb.h> 12 11 13 12 #if defined(CONFIG_CPU_SH4) 14 13 extern void tlb_wire_entry(struct vm_area_struct *, unsigned long, pte_t); ··· 23 24 { 24 25 BUG(); 25 26 } 26 - #endif 27 - 28 - #else /* CONFIG_MMU */ 29 - 30 - #include <asm-generic/tlb.h> 31 - 27 + #endif /* CONFIG_CPU_SH4 */ 32 28 #endif /* CONFIG_MMU */ 33 29 #endif /* __ASSEMBLY__ */ 34 30 #endif /* __ASM_SH_TLB_H */

+1

arch/sh/mm/cache-sh4.c

··· 16 16 #include <linux/mutex.h> 17 17 #include <linux/fs.h> 18 18 #include <linux/highmem.h> 19 + #include <linux/pagemap.h> 19 20 #include <asm/mmu_context.h> 20 21 #include <asm/cache_insns.h> 21 22 #include <asm/cacheflush.h>

+1

arch/sh/mm/cache-sh7705.c

··· 13 13 #include <linux/mman.h> 14 14 #include <linux/mm.h> 15 15 #include <linux/fs.h> 16 + #include <linux/pagemap.h> 16 17 #include <linux/threads.h> 17 18 #include <asm/addrspace.h> 18 19 #include <asm/page.h>

-1

arch/sh/mm/init.c

··· 359 359 360 360 vsyscall_init(); 361 361 362 - mem_init_print_info(NULL); 363 362 pr_info("virtual kernel memory layout:\n" 364 363 " fixmap : 0x%08lx - 0x%08lx (%4ld kB)\n" 365 364 " vmalloc : 0x%08lx - 0x%08lx (%4ld MB)\n"

+3

arch/sparc/include/asm/pgtable_32.h

··· 321 321 pgprot_val(newprot)); 322 322 } 323 323 324 + /* only used by the huge vmap code, should never be called */ 325 + #define pud_page(pud) NULL 326 + 324 327 struct seq_file; 325 328 void mmu_info(struct seq_file *m); 326 329

-2

arch/sparc/mm/init_32.c

··· 292 292 293 293 map_high_region(start_pfn, end_pfn); 294 294 } 295 - 296 - mem_init_print_info(NULL); 297 295 } 298 296 299 297 void sparc_flush_page_to_ram(struct page *page)

-1

arch/sparc/mm/init_64.c

··· 2520 2520 } 2521 2521 mark_page_reserved(mem_map_zero); 2522 2522 2523 - mem_init_print_info(NULL); 2524 2523 2525 2524 if (tlb_type == cheetah || tlb_type == cheetah_plus) 2526 2525 cheetah_ecache_flush_init();

+1

arch/sparc/mm/tlb.c

··· 9 9 #include <linux/mm.h> 10 10 #include <linux/swap.h> 11 11 #include <linux/preempt.h> 12 + #include <linux/pagemap.h> 12 13 13 14 #include <asm/tlbflush.h> 14 15 #include <asm/cacheflush.h>

-1

arch/um/kernel/mem.c

··· 54 54 memblock_free_all(); 55 55 max_low_pfn = totalram_pages(); 56 56 max_pfn = max_low_pfn; 57 - mem_init_print_info(NULL); 58 57 kmalloc_ok = 1; 59 58 } 60 59

+1

arch/x86/Kconfig

··· 100 100 select ARCH_SUPPORTS_LTO_CLANG if X86_64 101 101 select ARCH_SUPPORTS_LTO_CLANG_THIN if X86_64 102 102 select ARCH_USE_BUILTIN_BSWAP 103 + select ARCH_USE_MEMTEST 103 104 select ARCH_USE_QUEUED_RWLOCKS 104 105 select ARCH_USE_QUEUED_SPINLOCKS 105 106 select ARCH_USE_SYM_ANNOTATIONS

+20

arch/x86/include/asm/vmalloc.h

··· 1 1 #ifndef _ASM_X86_VMALLOC_H 2 2 #define _ASM_X86_VMALLOC_H 3 3 4 + #include <asm/cpufeature.h> 5 + #include <asm/page.h> 4 6 #include <asm/pgtable_areas.h> 7 + 8 + #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP 9 + 10 + #ifdef CONFIG_X86_64 11 + #define arch_vmap_pud_supported arch_vmap_pud_supported 12 + static inline bool arch_vmap_pud_supported(pgprot_t prot) 13 + { 14 + return boot_cpu_has(X86_FEATURE_GBPAGES); 15 + } 16 + #endif 17 + 18 + #define arch_vmap_pmd_supported arch_vmap_pmd_supported 19 + static inline bool arch_vmap_pmd_supported(pgprot_t prot) 20 + { 21 + return boot_cpu_has(X86_FEATURE_PSE); 22 + } 23 + 24 + #endif 5 25 6 26 #endif /* _ASM_X86_VMALLOC_H */

+1 -1

arch/x86/kernel/cpu/resctrl/pseudo_lock.c

··· 1458 1458 return 0; 1459 1459 } 1460 1460 1461 - static int pseudo_lock_dev_mremap(struct vm_area_struct *area, unsigned long flags) 1461 + static int pseudo_lock_dev_mremap(struct vm_area_struct *area) 1462 1462 { 1463 1463 /* Not supported */ 1464 1464 return -EINVAL;

-2

arch/x86/mm/init_32.c

··· 755 755 after_bootmem = 1; 756 756 x86_init.hyper.init_after_bootmem(); 757 757 758 - mem_init_print_info(NULL); 759 - 760 758 /* 761 759 * Check boundaries twice: Some fundamental inconsistencies can 762 760 * be detected at build time already.

+130 -78

arch/x86/mm/init_64.c

··· 826 826 zone_sizes_init(); 827 827 } 828 828 829 + #ifdef CONFIG_SPARSEMEM_VMEMMAP 830 + #define PAGE_UNUSED 0xFD 831 + 832 + /* 833 + * The unused vmemmap range, which was not yet memset(PAGE_UNUSED), ranges 834 + * from unused_pmd_start to next PMD_SIZE boundary. 835 + */ 836 + static unsigned long unused_pmd_start __meminitdata; 837 + 838 + static void __meminit vmemmap_flush_unused_pmd(void) 839 + { 840 + if (!unused_pmd_start) 841 + return; 842 + /* 843 + * Clears (unused_pmd_start, PMD_END] 844 + */ 845 + memset((void *)unused_pmd_start, PAGE_UNUSED, 846 + ALIGN(unused_pmd_start, PMD_SIZE) - unused_pmd_start); 847 + unused_pmd_start = 0; 848 + } 849 + 850 + #ifdef CONFIG_MEMORY_HOTPLUG 851 + /* Returns true if the PMD is completely unused and thus it can be freed */ 852 + static bool __meminit vmemmap_pmd_is_unused(unsigned long addr, unsigned long end) 853 + { 854 + unsigned long start = ALIGN_DOWN(addr, PMD_SIZE); 855 + 856 + /* 857 + * Flush the unused range cache to ensure that memchr_inv() will work 858 + * for the whole range. 859 + */ 860 + vmemmap_flush_unused_pmd(); 861 + memset((void *)addr, PAGE_UNUSED, end - addr); 862 + 863 + return !memchr_inv((void *)start, PAGE_UNUSED, PMD_SIZE); 864 + } 865 + #endif 866 + 867 + static void __meminit __vmemmap_use_sub_pmd(unsigned long start) 868 + { 869 + /* 870 + * As we expect to add in the same granularity as we remove, it's 871 + * sufficient to mark only some piece used to block the memmap page from 872 + * getting removed when removing some other adjacent memmap (just in 873 + * case the first memmap never gets initialized e.g., because the memory 874 + * block never gets onlined). 875 + */ 876 + memset((void *)start, 0, sizeof(struct page)); 877 + } 878 + 879 + static void __meminit vmemmap_use_sub_pmd(unsigned long start, unsigned long end) 880 + { 881 + /* 882 + * We only optimize if the new used range directly follows the 883 + * previously unused range (esp., when populating consecutive sections). 884 + */ 885 + if (unused_pmd_start == start) { 886 + if (likely(IS_ALIGNED(end, PMD_SIZE))) 887 + unused_pmd_start = 0; 888 + else 889 + unused_pmd_start = end; 890 + return; 891 + } 892 + 893 + /* 894 + * If the range does not contiguously follows previous one, make sure 895 + * to mark the unused range of the previous one so it can be removed. 896 + */ 897 + vmemmap_flush_unused_pmd(); 898 + __vmemmap_use_sub_pmd(start); 899 + } 900 + 901 + 902 + static void __meminit vmemmap_use_new_sub_pmd(unsigned long start, unsigned long end) 903 + { 904 + vmemmap_flush_unused_pmd(); 905 + 906 + /* 907 + * Could be our memmap page is filled with PAGE_UNUSED already from a 908 + * previous remove. Make sure to reset it. 909 + */ 910 + __vmemmap_use_sub_pmd(start); 911 + 912 + /* 913 + * Mark with PAGE_UNUSED the unused parts of the new memmap range 914 + */ 915 + if (!IS_ALIGNED(start, PMD_SIZE)) 916 + memset((void *)start, PAGE_UNUSED, 917 + start - ALIGN_DOWN(start, PMD_SIZE)); 918 + 919 + /* 920 + * We want to avoid memset(PAGE_UNUSED) when populating the vmemmap of 921 + * consecutive sections. Remember for the last added PMD where the 922 + * unused range begins. 923 + */ 924 + if (!IS_ALIGNED(end, PMD_SIZE)) 925 + unused_pmd_start = end; 926 + } 927 + #endif 928 + 829 929 /* 830 930 * Memory hotplug specific functions 831 931 */ ··· 970 870 971 871 return add_pages(nid, start_pfn, nr_pages, params); 972 872 } 973 - 974 - #define PAGE_INUSE 0xFD 975 873 976 874 static void __meminit free_pagetable(struct page *page, int order) 977 875 { ··· 1060 962 { 1061 963 unsigned long next, pages = 0; 1062 964 pte_t *pte; 1063 - void *page_addr; 1064 965 phys_addr_t phys_addr; 1065 966 1066 967 pte = pte_start + pte_index(addr); ··· 1080 983 if (phys_addr < (phys_addr_t)0x40000000) 1081 984 return; 1082 985 1083 - if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) { 1084 - /* 1085 - * Do not free direct mapping pages since they were 1086 - * freed when offlining, or simply not in use. 1087 - */ 1088 - if (!direct) 1089 - free_pagetable(pte_page(*pte), 0); 986 + if (!direct) 987 + free_pagetable(pte_page(*pte), 0); 1090 988 1091 - spin_lock(&init_mm.page_table_lock); 1092 - pte_clear(&init_mm, addr, pte); 1093 - spin_unlock(&init_mm.page_table_lock); 989 + spin_lock(&init_mm.page_table_lock); 990 + pte_clear(&init_mm, addr, pte); 991 + spin_unlock(&init_mm.page_table_lock); 1094 992 1095 - /* For non-direct mapping, pages means nothing. */ 1096 - pages++; 1097 - } else { 1098 - /* 1099 - * If we are here, we are freeing vmemmap pages since 1100 - * direct mapped memory ranges to be freed are aligned. 1101 - * 1102 - * If we are not removing the whole page, it means 1103 - * other page structs in this page are being used and 1104 - * we cannot remove them. So fill the unused page_structs 1105 - * with 0xFD, and remove the page when it is wholly 1106 - * filled with 0xFD. 1107 - */ 1108 - memset((void *)addr, PAGE_INUSE, next - addr); 1109 - 1110 - page_addr = page_address(pte_page(*pte)); 1111 - if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) { 1112 - free_pagetable(pte_page(*pte), 0); 1113 - 1114 - spin_lock(&init_mm.page_table_lock); 1115 - pte_clear(&init_mm, addr, pte); 1116 - spin_unlock(&init_mm.page_table_lock); 1117 - } 1118 - } 993 + /* For non-direct mapping, pages means nothing. */ 994 + pages++; 1119 995 } 1120 996 1121 997 /* Call free_pte_table() in remove_pmd_table(). */ ··· 1104 1034 unsigned long next, pages = 0; 1105 1035 pte_t *pte_base; 1106 1036 pmd_t *pmd; 1107 - void *page_addr; 1108 1037 1109 1038 pmd = pmd_start + pmd_index(addr); 1110 1039 for (; addr < end; addr = next, pmd++) { ··· 1123 1054 pmd_clear(pmd); 1124 1055 spin_unlock(&init_mm.page_table_lock); 1125 1056 pages++; 1126 - } else { 1127 - /* If here, we are freeing vmemmap pages. */ 1128 - memset((void *)addr, PAGE_INUSE, next - addr); 1129 - 1130 - page_addr = page_address(pmd_page(*pmd)); 1131 - if (!memchr_inv(page_addr, PAGE_INUSE, 1132 - PMD_SIZE)) { 1057 + } 1058 + #ifdef CONFIG_SPARSEMEM_VMEMMAP 1059 + else if (vmemmap_pmd_is_unused(addr, next)) { 1133 1060 free_hugepage_table(pmd_page(*pmd), 1134 1061 altmap); 1135 - 1136 1062 spin_lock(&init_mm.page_table_lock); 1137 1063 pmd_clear(pmd); 1138 1064 spin_unlock(&init_mm.page_table_lock); 1139 - } 1140 1065 } 1141 - 1066 + #endif 1142 1067 continue; 1143 1068 } 1144 1069 ··· 1153 1090 unsigned long next, pages = 0; 1154 1091 pmd_t *pmd_base; 1155 1092 pud_t *pud; 1156 - void *page_addr; 1157 1093 1158 1094 pud = pud_start + pud_index(addr); 1159 1095 for (; addr < end; addr = next, pud++) { ··· 1161 1099 if (!pud_present(*pud)) 1162 1100 continue; 1163 1101 1164 - if (pud_large(*pud)) { 1165 - if (IS_ALIGNED(addr, PUD_SIZE) && 1166 - IS_ALIGNED(next, PUD_SIZE)) { 1167 - if (!direct) 1168 - free_pagetable(pud_page(*pud), 1169 - get_order(PUD_SIZE)); 1170 - 1171 - spin_lock(&init_mm.page_table_lock); 1172 - pud_clear(pud); 1173 - spin_unlock(&init_mm.page_table_lock); 1174 - pages++; 1175 - } else { 1176 - /* If here, we are freeing vmemmap pages. */ 1177 - memset((void *)addr, PAGE_INUSE, next - addr); 1178 - 1179 - page_addr = page_address(pud_page(*pud)); 1180 - if (!memchr_inv(page_addr, PAGE_INUSE, 1181 - PUD_SIZE)) { 1182 - free_pagetable(pud_page(*pud), 1183 - get_order(PUD_SIZE)); 1184 - 1185 - spin_lock(&init_mm.page_table_lock); 1186 - pud_clear(pud); 1187 - spin_unlock(&init_mm.page_table_lock); 1188 - } 1189 - } 1190 - 1102 + if (pud_large(*pud) && 1103 + IS_ALIGNED(addr, PUD_SIZE) && 1104 + IS_ALIGNED(next, PUD_SIZE)) { 1105 + spin_lock(&init_mm.page_table_lock); 1106 + pud_clear(pud); 1107 + spin_unlock(&init_mm.page_table_lock); 1108 + pages++; 1191 1109 continue; 1192 1110 } 1193 1111 ··· 1239 1197 void __ref vmemmap_free(unsigned long start, unsigned long end, 1240 1198 struct vmem_altmap *altmap) 1241 1199 { 1200 + VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE)); 1201 + VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE)); 1202 + 1242 1203 remove_pagetable(start, end, false, altmap); 1243 1204 } 1244 1205 ··· 1351 1306 kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR, PAGE_SIZE, KCORE_USER); 1352 1307 1353 1308 preallocate_vmalloc_pages(); 1354 - 1355 - mem_init_print_info(NULL); 1356 1309 } 1357 1310 1358 1311 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT ··· 1581 1538 1582 1539 addr_end = addr + PMD_SIZE; 1583 1540 p_end = p + PMD_SIZE; 1541 + 1542 + if (!IS_ALIGNED(addr, PMD_SIZE) || 1543 + !IS_ALIGNED(next, PMD_SIZE)) 1544 + vmemmap_use_new_sub_pmd(addr, next); 1545 + 1584 1546 continue; 1585 1547 } else if (altmap) 1586 1548 return -ENOMEM; /* no fallback */ 1587 1549 } else if (pmd_large(*pmd)) { 1588 1550 vmemmap_verify((pte_t *)pmd, node, addr, next); 1551 + vmemmap_use_sub_pmd(addr, next); 1589 1552 continue; 1590 1553 } 1591 1554 if (vmemmap_populate_basepages(addr, next, node, NULL)) ··· 1604 1555 struct vmem_altmap *altmap) 1605 1556 { 1606 1557 int err; 1558 + 1559 + VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE)); 1560 + VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE)); 1607 1561 1608 1562 if (end - start < PAGES_PER_SECTION * sizeof(struct page)) 1609 1563 err = vmemmap_populate_basepages(start, end, node, NULL);

-19

arch/x86/mm/ioremap.c

··· 481 481 } 482 482 EXPORT_SYMBOL(iounmap); 483 483 484 - int __init arch_ioremap_p4d_supported(void) 485 - { 486 - return 0; 487 - } 488 - 489 - int __init arch_ioremap_pud_supported(void) 490 - { 491 - #ifdef CONFIG_X86_64 492 - return boot_cpu_has(X86_FEATURE_GBPAGES); 493 - #else 494 - return 0; 495 - #endif 496 - } 497 - 498 - int __init arch_ioremap_pmd_supported(void) 499 - { 500 - return boot_cpu_has(X86_FEATURE_PSE); 501 - } 502 - 503 484 /* 504 485 * Convert a physical pointer to a virtual kernel pointer for /dev/mem 505 486 * access

-13

arch/x86/mm/pgtable.c

··· 780 780 return 0; 781 781 } 782 782 783 - /* 784 - * Until we support 512GB pages, skip them in the vmap area. 785 - */ 786 - int p4d_free_pud_page(p4d_t *p4d, unsigned long addr) 787 - { 788 - return 0; 789 - } 790 - 791 783 #ifdef CONFIG_X86_64 792 784 /** 793 785 * pud_free_pmd_page - Clear pud entry and free pmd page. ··· 852 860 } 853 861 854 862 #else /* !CONFIG_X86_64 */ 855 - 856 - int pud_free_pmd_page(pud_t *pud, unsigned long addr) 857 - { 858 - return pud_none(*pud); 859 - } 860 863 861 864 /* 862 865 * Disable free page handling on x86-PAE. This assures that ioremap()

+1

arch/xtensa/Kconfig

··· 7 7 select ARCH_HAS_SYNC_DMA_FOR_CPU if MMU 8 8 select ARCH_HAS_SYNC_DMA_FOR_DEVICE if MMU 9 9 select ARCH_HAS_DMA_SET_UNCACHED if MMU 10 + select ARCH_USE_MEMTEST 10 11 select ARCH_USE_QUEUED_RWLOCKS 11 12 select ARCH_USE_QUEUED_SPINLOCKS 12 13 select ARCH_WANT_FRAME_POINTERS

-1

arch/xtensa/mm/init.c

··· 119 119 120 120 memblock_free_all(); 121 121 122 - mem_init_print_info(NULL); 123 122 pr_info("virtual kernel memory layout:\n" 124 123 #ifdef CONFIG_KASAN 125 124 " kasan : 0x%08lx - 0x%08lx (%5lu MB)\n"

+11 -6

block/blk-cgroup.c

··· 764 764 struct blkcg *blkcg = css_to_blkcg(css); 765 765 struct blkcg_gq *blkg; 766 766 767 + /* Root-level stats are sourced from system-wide IO stats */ 768 + if (!cgroup_parent(css->cgroup)) 769 + return; 770 + 767 771 rcu_read_lock(); 768 772 769 773 hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { ··· 790 786 blkg_iostat_add(&bisc->last, &delta); 791 787 u64_stats_update_end(&blkg->iostat.sync); 792 788 793 - /* propagate global delta to parent */ 794 - if (parent) { 789 + /* propagate global delta to parent (unless that's root) */ 790 + if (parent && parent->parent) { 795 791 u64_stats_update_begin(&parent->iostat.sync); 796 792 blkg_iostat_set(&delta, &blkg->iostat.cur); 797 793 blkg_iostat_sub(&delta, &blkg->iostat.last); ··· 805 801 } 806 802 807 803 /* 808 - * The rstat algorithms intentionally don't handle the root cgroup to avoid 809 - * incurring overhead when no cgroups are defined. For that reason, 810 - * cgroup_rstat_flush in blkcg_print_stat does not actually fill out the 811 - * iostat in the root cgroup's blkcg_gq. 804 + * We source root cgroup stats from the system-wide stats to avoid 805 + * tracking the same information twice and incurring overhead when no 806 + * cgroups are defined. For that reason, cgroup_rstat_flush in 807 + * blkcg_print_stat does not actually fill out the iostat in the root 808 + * cgroup's blkcg_gq. 812 809 * 813 810 * However, we would like to re-use the printing code between the root and 814 811 * non-root cgroups to the extent possible. For that reason, we simulate

+1

drivers/gpu/drm/i915/Kconfig

··· 20 20 select INPUT if ACPI 21 21 select ACPI_VIDEO if ACPI 22 22 select ACPI_BUTTON if ACPI 23 + select IO_MAPPING 23 24 select SYNC_FILE 24 25 select IOSF_MBI 25 26 select CRC32

+4 -5

drivers/gpu/drm/i915/gem/i915_gem_mman.c

··· 367 367 goto err_unpin; 368 368 369 369 /* Finally, remap it using the new GTT offset */ 370 - ret = remap_io_mapping(area, 371 - area->vm_start + (vma->ggtt_view.partial.offset << PAGE_SHIFT), 372 - (ggtt->gmadr.start + vma->node.start) >> PAGE_SHIFT, 373 - min_t(u64, vma->size, area->vm_end - area->vm_start), 374 - &ggtt->iomap); 370 + ret = io_mapping_map_user(&ggtt->iomap, area, area->vm_start + 371 + (vma->ggtt_view.partial.offset << PAGE_SHIFT), 372 + (ggtt->gmadr.start + vma->node.start) >> PAGE_SHIFT, 373 + min_t(u64, vma->size, area->vm_end - area->vm_start)); 375 374 if (ret) 376 375 goto err_fence; 377 376

-3

drivers/gpu/drm/i915/i915_drv.h

··· 1905 1905 struct drm_file *file); 1906 1906 1907 1907 /* i915_mm.c */ 1908 - int remap_io_mapping(struct vm_area_struct *vma, 1909 - unsigned long addr, unsigned long pfn, unsigned long size, 1910 - struct io_mapping *iomap); 1911 1908 int remap_io_sg(struct vm_area_struct *vma, 1912 1909 unsigned long addr, unsigned long size, 1913 1910 struct scatterlist *sgl, resource_size_t iobase);

+22 -93

drivers/gpu/drm/i915/i915_mm.c

··· 28 28 29 29 #include "i915_drv.h" 30 30 31 - struct remap_pfn { 32 - struct mm_struct *mm; 33 - unsigned long pfn; 34 - pgprot_t prot; 35 - 36 - struct sgt_iter sgt; 37 - resource_size_t iobase; 38 - }; 39 - 40 - static int remap_pfn(pte_t *pte, unsigned long addr, void *data) 41 - { 42 - struct remap_pfn *r = data; 43 - 44 - /* Special PTE are not associated with any struct page */ 45 - set_pte_at(r->mm, addr, pte, pte_mkspecial(pfn_pte(r->pfn, r->prot))); 46 - r->pfn++; 47 - 48 - return 0; 49 - } 31 + #define EXPECTED_FLAGS (VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP) 50 32 51 33 #define use_dma(io) ((io) != -1) 52 - 53 - static inline unsigned long sgt_pfn(const struct remap_pfn *r) 54 - { 55 - if (use_dma(r->iobase)) 56 - return (r->sgt.dma + r->sgt.curr + r->iobase) >> PAGE_SHIFT; 57 - else 58 - return r->sgt.pfn + (r->sgt.curr >> PAGE_SHIFT); 59 - } 60 - 61 - static int remap_sg(pte_t *pte, unsigned long addr, void *data) 62 - { 63 - struct remap_pfn *r = data; 64 - 65 - if (GEM_WARN_ON(!r->sgt.sgp)) 66 - return -EINVAL; 67 - 68 - /* Special PTE are not associated with any struct page */ 69 - set_pte_at(r->mm, addr, pte, 70 - pte_mkspecial(pfn_pte(sgt_pfn(r), r->prot))); 71 - r->pfn++; /* track insertions in case we need to unwind later */ 72 - 73 - r->sgt.curr += PAGE_SIZE; 74 - if (r->sgt.curr >= r->sgt.max) 75 - r->sgt = __sgt_iter(__sg_next(r->sgt.sgp), use_dma(r->iobase)); 76 - 77 - return 0; 78 - } 79 - 80 - /** 81 - * remap_io_mapping - remap an IO mapping to userspace 82 - * @vma: user vma to map to 83 - * @addr: target user address to start at 84 - * @pfn: physical address of kernel memory 85 - * @size: size of map area 86 - * @iomap: the source io_mapping 87 - * 88 - * Note: this is only safe if the mm semaphore is held when called. 89 - */ 90 - int remap_io_mapping(struct vm_area_struct *vma, 91 - unsigned long addr, unsigned long pfn, unsigned long size, 92 - struct io_mapping *iomap) 93 - { 94 - struct remap_pfn r; 95 - int err; 96 - 97 - #define EXPECTED_FLAGS (VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP) 98 - GEM_BUG_ON((vma->vm_flags & EXPECTED_FLAGS) != EXPECTED_FLAGS); 99 - 100 - /* We rely on prevalidation of the io-mapping to skip track_pfn(). */ 101 - r.mm = vma->vm_mm; 102 - r.pfn = pfn; 103 - r.prot = __pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) | 104 - (pgprot_val(vma->vm_page_prot) & ~_PAGE_CACHE_MASK)); 105 - 106 - err = apply_to_page_range(r.mm, addr, size, remap_pfn, &r); 107 - if (unlikely(err)) { 108 - zap_vma_ptes(vma, addr, (r.pfn - pfn) << PAGE_SHIFT); 109 - return err; 110 - } 111 - 112 - return 0; 113 - } 114 34 115 35 /** 116 36 * remap_io_sg - remap an IO mapping to userspace ··· 46 126 unsigned long addr, unsigned long size, 47 127 struct scatterlist *sgl, resource_size_t iobase) 48 128 { 49 - struct remap_pfn r = { 50 - .mm = vma->vm_mm, 51 - .prot = vma->vm_page_prot, 52 - .sgt = __sgt_iter(sgl, use_dma(iobase)), 53 - .iobase = iobase, 54 - }; 129 + unsigned long pfn, len, remapped = 0; 55 130 int err; 56 131 57 132 /* We rely on prevalidation of the io-mapping to skip track_pfn(). */ ··· 55 140 if (!use_dma(iobase)) 56 141 flush_cache_range(vma, addr, size); 57 142 58 - err = apply_to_page_range(r.mm, addr, size, remap_sg, &r); 59 - if (unlikely(err)) { 60 - zap_vma_ptes(vma, addr, r.pfn << PAGE_SHIFT); 61 - return err; 62 - } 143 + do { 144 + if (use_dma(iobase)) { 145 + if (!sg_dma_len(sgl)) 146 + break; 147 + pfn = (sg_dma_address(sgl) + iobase) >> PAGE_SHIFT; 148 + len = sg_dma_len(sgl); 149 + } else { 150 + pfn = page_to_pfn(sg_page(sgl)); 151 + len = sgl->length; 152 + } 63 153 64 - return 0; 154 + err = remap_pfn_range(vma, addr + remapped, pfn, len, 155 + vma->vm_page_prot); 156 + if (err) 157 + break; 158 + remapped += len; 159 + } while ((sgl = __sg_next(sgl))); 160 + 161 + if (err) 162 + zap_vma_ptes(vma, addr, remapped); 163 + return err; 65 164 }

+6 -6

drivers/infiniband/core/umem.c

··· 47 47 48 48 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty) 49 49 { 50 - struct sg_page_iter sg_iter; 51 - struct page *page; 50 + bool make_dirty = umem->writable && dirty; 51 + struct scatterlist *sg; 52 + unsigned int i; 52 53 53 54 if (umem->nmap > 0) 54 55 ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents, 55 56 DMA_BIDIRECTIONAL); 56 57 57 - for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) { 58 - page = sg_page_iter_page(&sg_iter); 59 - unpin_user_pages_dirty_lock(&page, 1, umem->writable && dirty); 60 - } 58 + for_each_sg(umem->sg_head.sgl, sg, umem->sg_nents, i) 59 + unpin_user_page_range_dirty_lock(sg_page(sg), 60 + DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty); 61 61 62 62 sg_free_table(&umem->sg_head); 63 63 }

+1 -1

drivers/pci/pci.c

··· 4102 4102 #if defined(PCI_IOBASE) && defined(CONFIG_MMU) 4103 4103 unsigned long vaddr = (unsigned long)PCI_IOBASE + res->start; 4104 4104 4105 - unmap_kernel_range(vaddr, resource_size(res)); 4105 + vunmap_range(vaddr, vaddr + resource_size(res)); 4106 4106 #endif 4107 4107 } 4108 4108 EXPORT_SYMBOL(pci_unmap_iospace);

+1 -4

fs/aio.c

··· 323 323 } 324 324 } 325 325 326 - static int aio_ring_mremap(struct vm_area_struct *vma, unsigned long flags) 326 + static int aio_ring_mremap(struct vm_area_struct *vma) 327 327 { 328 328 struct file *file = vma->vm_file; 329 329 struct mm_struct *mm = vma->vm_mm; 330 330 struct kioctx_table *table; 331 331 int i, res = -EINVAL; 332 - 333 - if (flags & MREMAP_DONTUNMAP) 334 - return -EINVAL; 335 332 336 333 spin_lock(&mm->ioctx_lock); 337 334 rcu_read_lock();

+1 -1

fs/fs_parser.c

··· 310 310 #ifdef CONFIG_VALIDATE_FS_PARSER 311 311 /** 312 312 * validate_constant_table - Validate a constant table 313 - * @name: Name to use in reporting 314 313 * @tbl: The constant table to validate. 315 314 * @tbl_size: The size of the table. 316 315 * @low: The lowest permissible value. ··· 359 360 360 361 /** 361 362 * fs_validate_description - Validate a parameter description 363 + * @name: The parameter name to search for. 362 364 * @desc: The parameter description to validate. 363 365 */ 364 366 bool fs_validate_description(const char *name,

+16 -8

fs/iomap/direct-io.c

··· 487 487 if (pos >= dio->i_size) 488 488 goto out_free_dio; 489 489 490 + if (iocb->ki_flags & IOCB_NOWAIT) { 491 + if (filemap_range_needs_writeback(mapping, pos, end)) { 492 + ret = -EAGAIN; 493 + goto out_free_dio; 494 + } 495 + iomap_flags |= IOMAP_NOWAIT; 496 + } 497 + 490 498 if (iter_is_iovec(iter)) 491 499 dio->flags |= IOMAP_DIO_DIRTY; 492 500 } else { 493 501 iomap_flags |= IOMAP_WRITE; 494 502 dio->flags |= IOMAP_DIO_WRITE; 503 + 504 + if (iocb->ki_flags & IOCB_NOWAIT) { 505 + if (filemap_range_has_page(mapping, pos, end)) { 506 + ret = -EAGAIN; 507 + goto out_free_dio; 508 + } 509 + iomap_flags |= IOMAP_NOWAIT; 510 + } 495 511 496 512 /* for data sync or sync, we need sync completion processing */ 497 513 if (iocb->ki_flags & IOCB_DSYNC) ··· 521 505 */ 522 506 if ((iocb->ki_flags & (IOCB_DSYNC | IOCB_SYNC)) == IOCB_DSYNC) 523 507 dio->flags |= IOMAP_DIO_WRITE_FUA; 524 - } 525 - 526 - if (iocb->ki_flags & IOCB_NOWAIT) { 527 - if (filemap_range_has_page(mapping, pos, end)) { 528 - ret = -EAGAIN; 529 - goto out_free_dio; 530 - } 531 - iomap_flags |= IOMAP_NOWAIT; 532 508 } 533 509 534 510 if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {

+1 -1

fs/ocfs2/blockcheck.c

··· 229 229 *val = *(u64 *)data; 230 230 return 0; 231 231 } 232 - DEFINE_SIMPLE_ATTRIBUTE(blockcheck_fops, blockcheck_u64_get, NULL, "%llu\n"); 232 + DEFINE_DEBUGFS_ATTRIBUTE(blockcheck_fops, blockcheck_u64_get, NULL, "%llu\n"); 233 233 234 234 static void ocfs2_blockcheck_debug_remove(struct ocfs2_blockcheck_stats *stats) 235 235 {

-7

fs/ocfs2/dlm/dlmrecovery.c

··· 126 126 dlm_set_reco_master(dlm, O2NM_INVALID_NODE_NUM); 127 127 } 128 128 129 - static inline void dlm_reset_recovery(struct dlm_ctxt *dlm) 130 - { 131 - spin_lock(&dlm->spinlock); 132 - __dlm_reset_recovery(dlm); 133 - spin_unlock(&dlm->spinlock); 134 - } 135 - 136 129 /* Worker function used during recovery. */ 137 130 void dlm_dispatch_work(struct work_struct *work) 138 131 {

+18 -18

fs/ocfs2/stack_o2cb.c

··· 59 59 return mode; 60 60 } 61 61 62 - #define map_flag(_generic, _o2dlm) \ 63 - if (flags & (_generic)) { \ 64 - flags &= ~(_generic); \ 65 - o2dlm_flags |= (_o2dlm); \ 66 - } 67 62 static int flags_to_o2dlm(u32 flags) 68 63 { 69 64 int o2dlm_flags = 0; 70 65 71 - map_flag(DLM_LKF_NOQUEUE, LKM_NOQUEUE); 72 - map_flag(DLM_LKF_CANCEL, LKM_CANCEL); 73 - map_flag(DLM_LKF_CONVERT, LKM_CONVERT); 74 - map_flag(DLM_LKF_VALBLK, LKM_VALBLK); 75 - map_flag(DLM_LKF_IVVALBLK, LKM_INVVALBLK); 76 - map_flag(DLM_LKF_ORPHAN, LKM_ORPHAN); 77 - map_flag(DLM_LKF_FORCEUNLOCK, LKM_FORCE); 78 - map_flag(DLM_LKF_TIMEOUT, LKM_TIMEOUT); 79 - map_flag(DLM_LKF_LOCAL, LKM_LOCAL); 80 - 81 - /* map_flag() should have cleared every flag passed in */ 82 - BUG_ON(flags != 0); 66 + if (flags & DLM_LKF_NOQUEUE) 67 + o2dlm_flags |= LKM_NOQUEUE; 68 + if (flags & DLM_LKF_CANCEL) 69 + o2dlm_flags |= LKM_CANCEL; 70 + if (flags & DLM_LKF_CONVERT) 71 + o2dlm_flags |= LKM_CONVERT; 72 + if (flags & DLM_LKF_VALBLK) 73 + o2dlm_flags |= LKM_VALBLK; 74 + if (flags & DLM_LKF_IVVALBLK) 75 + o2dlm_flags |= LKM_INVVALBLK; 76 + if (flags & DLM_LKF_ORPHAN) 77 + o2dlm_flags |= LKM_ORPHAN; 78 + if (flags & DLM_LKF_FORCEUNLOCK) 79 + o2dlm_flags |= LKM_FORCE; 80 + if (flags & DLM_LKF_TIMEOUT) 81 + o2dlm_flags |= LKM_TIMEOUT; 82 + if (flags & DLM_LKF_LOCAL) 83 + o2dlm_flags |= LKM_LOCAL; 83 84 84 85 return o2dlm_flags; 85 86 } 86 - #undef map_flag 87 87 88 88 /* 89 89 * Map an o2dlm status to standard errno values.

+1 -1

fs/ocfs2/stackglue.c

··· 731 731 } 732 732 733 733 MODULE_AUTHOR("Oracle"); 734 - MODULE_DESCRIPTION("ocfs2 cluter stack glue layer"); 734 + MODULE_DESCRIPTION("ocfs2 cluster stack glue layer"); 735 735 MODULE_LICENSE("GPL"); 736 736 module_init(ocfs2_stack_glue_init); 737 737 module_exit(ocfs2_stack_glue_exit);

+2 -6

include/linux/compiler-gcc.h

··· 90 90 */ 91 91 #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0) 92 92 93 - /* 94 - * sparse (__CHECKER__) pretends to be gcc, but can't do constant 95 - * folding in __builtin_bswap*() (yet), so don't set these for it. 96 - */ 97 - #if defined(CONFIG_ARCH_USE_BUILTIN_BSWAP) && !defined(__CHECKER__) 93 + #if defined(CONFIG_ARCH_USE_BUILTIN_BSWAP) 98 94 #define __HAVE_BUILTIN_BSWAP32__ 99 95 #define __HAVE_BUILTIN_BSWAP64__ 100 96 #define __HAVE_BUILTIN_BSWAP16__ 101 - #endif /* CONFIG_ARCH_USE_BUILTIN_BSWAP && !__CHECKER__ */ 97 + #endif /* CONFIG_ARCH_USE_BUILTIN_BSWAP */ 102 98 103 99 #if GCC_VERSION >= 70000 104 100 #define KASAN_ABI_VERSION 5

+2

include/linux/fs.h

··· 2878 2878 2879 2879 extern bool filemap_range_has_page(struct address_space *, loff_t lstart, 2880 2880 loff_t lend); 2881 + extern bool filemap_range_needs_writeback(struct address_space *, 2882 + loff_t lstart, loff_t lend); 2881 2883 extern int filemap_write_and_wait_range(struct address_space *mapping, 2882 2884 loff_t lstart, loff_t lend); 2883 2885 extern int __filemap_fdatawrite_range(struct address_space *mapping,

+19 -14

include/linux/gfp.h

··· 515 515 } 516 516 #endif 517 517 518 - struct page * 519 - __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, 520 - nodemask_t *nodemask); 518 + struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, 519 + nodemask_t *nodemask); 521 520 522 - static inline struct page * 523 - __alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid) 521 + unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, 522 + nodemask_t *nodemask, int nr_pages, 523 + struct list_head *page_list, 524 + struct page **page_array); 525 + 526 + /* Bulk allocate order-0 pages */ 527 + static inline unsigned long 528 + alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head *list) 524 529 { 525 - return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL); 530 + return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list, NULL); 531 + } 532 + 533 + static inline unsigned long 534 + alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_array) 535 + { 536 + return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, page_array); 526 537 } 527 538 528 539 /* ··· 546 535 VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); 547 536 VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid)); 548 537 549 - return __alloc_pages(gfp_mask, order, nid); 538 + return __alloc_pages(gfp_mask, order, nid, NULL); 550 539 } 551 540 552 541 /* ··· 564 553 } 565 554 566 555 #ifdef CONFIG_NUMA 567 - extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order); 568 - 569 - static inline struct page * 570 - alloc_pages(gfp_t gfp_mask, unsigned int order) 571 - { 572 - return alloc_pages_current(gfp_mask, order); 573 - } 556 + struct page *alloc_pages(gfp_t gfp, unsigned int order); 574 557 extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order, 575 558 struct vm_area_struct *vma, unsigned long addr, 576 559 int node, bool hugepage);

+3

include/linux/io-mapping.h

··· 220 220 } 221 221 222 222 #endif /* _LINUX_IO_MAPPING_H */ 223 + 224 + int io_mapping_map_user(struct io_mapping *iomap, struct vm_area_struct *vma, 225 + unsigned long addr, unsigned long pfn, unsigned long size);

-9

include/linux/io.h

··· 31 31 } 32 32 #endif 33 33 34 - #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP 35 - void __init ioremap_huge_init(void); 36 - int arch_ioremap_p4d_supported(void); 37 - int arch_ioremap_pud_supported(void); 38 - int arch_ioremap_pmd_supported(void); 39 - #else 40 - static inline void ioremap_huge_init(void) { } 41 - #endif 42 - 43 34 /* 44 35 * Managed iomap interface 45 36 */

+34 -17

include/linux/kasan.h

··· 30 30 /* Software KASAN implementations use shadow memory. */ 31 31 32 32 #ifdef CONFIG_KASAN_SW_TAGS 33 - #define KASAN_SHADOW_INIT 0xFF 33 + /* This matches KASAN_TAG_INVALID. */ 34 + #define KASAN_SHADOW_INIT 0xFE 34 35 #else 35 36 #define KASAN_SHADOW_INIT 0 36 37 #endif ··· 96 95 return static_branch_likely(&kasan_flag_enabled); 97 96 } 98 97 98 + static inline bool kasan_has_integrated_init(void) 99 + { 100 + return kasan_enabled(); 101 + } 102 + 99 103 #else /* CONFIG_KASAN_HW_TAGS */ 100 104 101 105 static inline bool kasan_enabled(void) 102 106 { 103 107 return true; 108 + } 109 + 110 + static inline bool kasan_has_integrated_init(void) 111 + { 112 + return false; 104 113 } 105 114 106 115 #endif /* CONFIG_KASAN_HW_TAGS */ ··· 130 119 __kasan_unpoison_range(addr, size); 131 120 } 132 121 133 - void __kasan_alloc_pages(struct page *page, unsigned int order); 122 + void __kasan_alloc_pages(struct page *page, unsigned int order, bool init); 134 123 static __always_inline void kasan_alloc_pages(struct page *page, 135 - unsigned int order) 124 + unsigned int order, bool init) 136 125 { 137 126 if (kasan_enabled()) 138 - __kasan_alloc_pages(page, order); 127 + __kasan_alloc_pages(page, order, init); 139 128 } 140 129 141 - void __kasan_free_pages(struct page *page, unsigned int order); 130 + void __kasan_free_pages(struct page *page, unsigned int order, bool init); 142 131 static __always_inline void kasan_free_pages(struct page *page, 143 - unsigned int order) 132 + unsigned int order, bool init) 144 133 { 145 134 if (kasan_enabled()) 146 - __kasan_free_pages(page, order); 135 + __kasan_free_pages(page, order, init); 147 136 } 148 137 149 138 void __kasan_cache_create(struct kmem_cache *cache, unsigned int *size, ··· 203 192 return (void *)object; 204 193 } 205 194 206 - bool __kasan_slab_free(struct kmem_cache *s, void *object, unsigned long ip); 207 - static __always_inline bool kasan_slab_free(struct kmem_cache *s, void *object) 195 + bool __kasan_slab_free(struct kmem_cache *s, void *object, 196 + unsigned long ip, bool init); 197 + static __always_inline bool kasan_slab_free(struct kmem_cache *s, 198 + void *object, bool init) 208 199 { 209 200 if (kasan_enabled()) 210 - return __kasan_slab_free(s, object, _RET_IP_); 201 + return __kasan_slab_free(s, object, _RET_IP_, init); 211 202 return false; 212 203 } 213 204 ··· 228 215 } 229 216 230 217 void * __must_check __kasan_slab_alloc(struct kmem_cache *s, 231 - void *object, gfp_t flags); 218 + void *object, gfp_t flags, bool init); 232 219 static __always_inline void * __must_check kasan_slab_alloc( 233 - struct kmem_cache *s, void *object, gfp_t flags) 220 + struct kmem_cache *s, void *object, gfp_t flags, bool init) 234 221 { 235 222 if (kasan_enabled()) 236 - return __kasan_slab_alloc(s, object, flags); 223 + return __kasan_slab_alloc(s, object, flags, init); 237 224 return object; 238 225 } 239 226 ··· 289 276 { 290 277 return false; 291 278 } 279 + static inline bool kasan_has_integrated_init(void) 280 + { 281 + return false; 282 + } 292 283 static inline slab_flags_t kasan_never_merge(void) 293 284 { 294 285 return 0; 295 286 } 296 287 static inline void kasan_unpoison_range(const void *address, size_t size) {} 297 - static inline void kasan_alloc_pages(struct page *page, unsigned int order) {} 298 - static inline void kasan_free_pages(struct page *page, unsigned int order) {} 288 + static inline void kasan_alloc_pages(struct page *page, unsigned int order, bool init) {} 289 + static inline void kasan_free_pages(struct page *page, unsigned int order, bool init) {} 299 290 static inline void kasan_cache_create(struct kmem_cache *cache, 300 291 unsigned int *size, 301 292 slab_flags_t *flags) {} ··· 315 298 { 316 299 return (void *)object; 317 300 } 318 - static inline bool kasan_slab_free(struct kmem_cache *s, void *object) 301 + static inline bool kasan_slab_free(struct kmem_cache *s, void *object, bool init) 319 302 { 320 303 return false; 321 304 } 322 305 static inline void kasan_kfree_large(void *ptr) {} 323 306 static inline void kasan_slab_free_mempool(void *ptr) {} 324 307 static inline void *kasan_slab_alloc(struct kmem_cache *s, void *object, 325 - gfp_t flags) 308 + gfp_t flags, bool init) 326 309 { 327 310 return object; 328 311 }

+160 -113

include/linux/memcontrol.h

··· 76 76 }; 77 77 78 78 struct memcg_vmstats_percpu { 79 - long stat[MEMCG_NR_STAT]; 80 - unsigned long events[NR_VM_EVENT_ITEMS]; 81 - unsigned long nr_page_events; 82 - unsigned long targets[MEM_CGROUP_NTARGETS]; 79 + /* Local (CPU and cgroup) page state & events */ 80 + long state[MEMCG_NR_STAT]; 81 + unsigned long events[NR_VM_EVENT_ITEMS]; 82 + 83 + /* Delta calculation for lockless upward propagation */ 84 + long state_prev[MEMCG_NR_STAT]; 85 + unsigned long events_prev[NR_VM_EVENT_ITEMS]; 86 + 87 + /* Cgroup1: threshold notifications & softlimit tree updates */ 88 + unsigned long nr_page_events; 89 + unsigned long targets[MEM_CGROUP_NTARGETS]; 90 + }; 91 + 92 + struct memcg_vmstats { 93 + /* Aggregated (CPU and subtree) page state & events */ 94 + long state[MEMCG_NR_STAT]; 95 + unsigned long events[NR_VM_EVENT_ITEMS]; 96 + 97 + /* Pending child counts during tree propagation */ 98 + long state_pending[MEMCG_NR_STAT]; 99 + unsigned long events_pending[NR_VM_EVENT_ITEMS]; 83 100 }; 84 101 85 102 struct mem_cgroup_reclaim_iter { ··· 304 287 305 288 MEMCG_PADDING(_pad1_); 306 289 307 - atomic_long_t vmstats[MEMCG_NR_STAT]; 308 - atomic_long_t vmevents[NR_VM_EVENT_ITEMS]; 290 + /* memory.stat */ 291 + struct memcg_vmstats vmstats; 309 292 310 293 /* memory.events */ 311 294 atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; ··· 332 315 atomic_t moving_account; 333 316 struct task_struct *move_lock_task; 334 317 335 - /* Legacy local VM stats and events */ 336 - struct memcg_vmstats_percpu __percpu *vmstats_local; 337 - 338 - /* Subtree VM stats and events (batched updates) */ 339 318 struct memcg_vmstats_percpu __percpu *vmstats_percpu; 340 319 341 320 #ifdef CONFIG_CGROUP_WRITEBACK ··· 371 358 372 359 #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1) 373 360 361 + static inline bool PageMemcgKmem(struct page *page); 362 + 363 + /* 364 + * After the initialization objcg->memcg is always pointing at 365 + * a valid memcg, but can be atomically swapped to the parent memcg. 366 + * 367 + * The caller must ensure that the returned memcg won't be released: 368 + * e.g. acquire the rcu_read_lock or css_set_lock. 369 + */ 370 + static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg) 371 + { 372 + return READ_ONCE(objcg->memcg); 373 + } 374 + 375 + /* 376 + * __page_memcg - get the memory cgroup associated with a non-kmem page 377 + * @page: a pointer to the page struct 378 + * 379 + * Returns a pointer to the memory cgroup associated with the page, 380 + * or NULL. This function assumes that the page is known to have a 381 + * proper memory cgroup pointer. It's not safe to call this function 382 + * against some type of pages, e.g. slab pages or ex-slab pages or 383 + * kmem pages. 384 + */ 385 + static inline struct mem_cgroup *__page_memcg(struct page *page) 386 + { 387 + unsigned long memcg_data = page->memcg_data; 388 + 389 + VM_BUG_ON_PAGE(PageSlab(page), page); 390 + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); 391 + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page); 392 + 393 + return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); 394 + } 395 + 396 + /* 397 + * __page_objcg - get the object cgroup associated with a kmem page 398 + * @page: a pointer to the page struct 399 + * 400 + * Returns a pointer to the object cgroup associated with the page, 401 + * or NULL. This function assumes that the page is known to have a 402 + * proper object cgroup pointer. It's not safe to call this function 403 + * against some type of pages, e.g. slab pages or ex-slab pages or 404 + * LRU pages. 405 + */ 406 + static inline struct obj_cgroup *__page_objcg(struct page *page) 407 + { 408 + unsigned long memcg_data = page->memcg_data; 409 + 410 + VM_BUG_ON_PAGE(PageSlab(page), page); 411 + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); 412 + VM_BUG_ON_PAGE(!(memcg_data & MEMCG_DATA_KMEM), page); 413 + 414 + return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); 415 + } 416 + 374 417 /* 375 418 * page_memcg - get the memory cgroup associated with a page 376 419 * @page: a pointer to the page struct ··· 436 367 * proper memory cgroup pointer. It's not safe to call this function 437 368 * against some type of pages, e.g. slab pages or ex-slab pages. 438 369 * 439 - * Any of the following ensures page and memcg binding stability: 370 + * For a non-kmem page any of the following ensures page and memcg binding 371 + * stability: 372 + * 440 373 * - the page lock 441 374 * - LRU isolation 442 375 * - lock_page_memcg() 443 376 * - exclusive reference 377 + * 378 + * For a kmem page a caller should hold an rcu read lock to protect memcg 379 + * associated with a kmem page from being released. 444 380 */ 445 381 static inline struct mem_cgroup *page_memcg(struct page *page) 446 382 { 447 - unsigned long memcg_data = page->memcg_data; 448 - 449 - VM_BUG_ON_PAGE(PageSlab(page), page); 450 - VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); 451 - 452 - return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); 383 + if (PageMemcgKmem(page)) 384 + return obj_cgroup_memcg(__page_objcg(page)); 385 + else 386 + return __page_memcg(page); 453 387 } 454 388 455 389 /* ··· 466 394 */ 467 395 static inline struct mem_cgroup *page_memcg_rcu(struct page *page) 468 396 { 397 + unsigned long memcg_data = READ_ONCE(page->memcg_data); 398 + 469 399 VM_BUG_ON_PAGE(PageSlab(page), page); 470 400 WARN_ON_ONCE(!rcu_read_lock_held()); 471 401 472 - return (struct mem_cgroup *)(READ_ONCE(page->memcg_data) & 473 - ~MEMCG_DATA_FLAGS_MASK); 402 + if (memcg_data & MEMCG_DATA_KMEM) { 403 + struct obj_cgroup *objcg; 404 + 405 + objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); 406 + return obj_cgroup_memcg(objcg); 407 + } 408 + 409 + return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); 474 410 } 475 411 476 412 /* ··· 486 406 * @page: a pointer to the page struct 487 407 * 488 408 * Returns a pointer to the memory cgroup associated with the page, 489 - * or NULL. This function unlike page_memcg() can take any page 409 + * or NULL. This function unlike page_memcg() can take any page 490 410 * as an argument. It has to be used in cases when it's not known if a page 491 - * has an associated memory cgroup pointer or an object cgroups vector. 411 + * has an associated memory cgroup pointer or an object cgroups vector or 412 + * an object cgroup. 492 413 * 493 - * Any of the following ensures page and memcg binding stability: 414 + * For a non-kmem page any of the following ensures page and memcg binding 415 + * stability: 416 + * 494 417 * - the page lock 495 418 * - LRU isolation 496 419 * - lock_page_memcg() 497 420 * - exclusive reference 421 + * 422 + * For a kmem page a caller should hold an rcu read lock to protect memcg 423 + * associated with a kmem page from being released. 498 424 */ 499 425 static inline struct mem_cgroup *page_memcg_check(struct page *page) 500 426 { ··· 513 427 if (memcg_data & MEMCG_DATA_OBJCGS) 514 428 return NULL; 515 429 430 + if (memcg_data & MEMCG_DATA_KMEM) { 431 + struct obj_cgroup *objcg; 432 + 433 + objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); 434 + return obj_cgroup_memcg(objcg); 435 + } 436 + 516 437 return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); 517 438 } 518 439 440 + #ifdef CONFIG_MEMCG_KMEM 519 441 /* 520 442 * PageMemcgKmem - check if the page has MemcgKmem flag set 521 443 * @page: a pointer to the page struct ··· 538 444 return page->memcg_data & MEMCG_DATA_KMEM; 539 445 } 540 446 541 - #ifdef CONFIG_MEMCG_KMEM 542 447 /* 543 448 * page_objcgs - get the object cgroups vector associated with a page 544 449 * @page: a pointer to the page struct ··· 579 486 } 580 487 581 488 #else 489 + static inline bool PageMemcgKmem(struct page *page) 490 + { 491 + return false; 492 + } 493 + 582 494 static inline struct obj_cgroup **page_objcgs(struct page *page) 583 495 { 584 496 return NULL; ··· 694 596 } 695 597 696 598 int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); 599 + int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm, 600 + gfp_t gfp, swp_entry_t entry); 601 + void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); 697 602 698 603 void mem_cgroup_uncharge(struct page *page); 699 604 void mem_cgroup_uncharge_list(struct list_head *page_list); 700 605 701 606 void mem_cgroup_migrate(struct page *oldpage, struct page *newpage); 702 - 703 - static struct mem_cgroup_per_node * 704 - mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid) 705 - { 706 - return memcg->nodeinfo[nid]; 707 - } 708 607 709 608 /** 710 609 * mem_cgroup_lruvec - get the lru list vector for a memcg & node ··· 726 631 if (!memcg) 727 632 memcg = root_mem_cgroup; 728 633 729 - mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); 634 + mz = memcg->nodeinfo[pgdat->node_id]; 730 635 lruvec = &mz->lruvec; 731 636 out: 732 637 /* ··· 803 708 percpu_ref_get(&objcg->refcnt); 804 709 } 805 710 711 + static inline void obj_cgroup_get_many(struct obj_cgroup *objcg, 712 + unsigned long nr) 713 + { 714 + percpu_ref_get_many(&objcg->refcnt, nr); 715 + } 716 + 806 717 static inline void obj_cgroup_put(struct obj_cgroup *objcg) 807 718 { 808 719 percpu_ref_put(&objcg->refcnt); 809 - } 810 - 811 - /* 812 - * After the initialization objcg->memcg is always pointing at 813 - * a valid memcg, but can be atomically swapped to the parent memcg. 814 - * 815 - * The caller must ensure that the returned memcg won't be released: 816 - * e.g. acquire the rcu_read_lock or css_set_lock. 817 - */ 818 - static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg) 819 - { 820 - return READ_ONCE(objcg->memcg); 821 720 } 822 721 823 722 static inline void mem_cgroup_put(struct mem_cgroup *memcg) ··· 956 867 extern bool cgroup_memory_noswap; 957 868 #endif 958 869 959 - struct mem_cgroup *lock_page_memcg(struct page *page); 960 - void __unlock_page_memcg(struct mem_cgroup *memcg); 870 + void lock_page_memcg(struct page *page); 961 871 void unlock_page_memcg(struct page *page); 962 - 963 - /* 964 - * idx can be of type enum memcg_stat_item or node_stat_item. 965 - * Keep in sync with memcg_exact_page_state(). 966 - */ 967 - static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) 968 - { 969 - long x = atomic_long_read(&memcg->vmstats[idx]); 970 - #ifdef CONFIG_SMP 971 - if (x < 0) 972 - x = 0; 973 - #endif 974 - return x; 975 - } 976 - 977 - /* 978 - * idx can be of type enum memcg_stat_item or node_stat_item. 979 - * Keep in sync with memcg_exact_page_state(). 980 - */ 981 - static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, 982 - int idx) 983 - { 984 - long x = 0; 985 - int cpu; 986 - 987 - for_each_possible_cpu(cpu) 988 - x += per_cpu(memcg->vmstats_local->stat[idx], cpu); 989 - #ifdef CONFIG_SMP 990 - if (x < 0) 991 - x = 0; 992 - #endif 993 - return x; 994 - } 995 872 996 873 void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val); 997 874 ··· 1033 978 __mod_memcg_lruvec_state(lruvec, idx, val); 1034 979 local_irq_restore(flags); 1035 980 } 1036 - 1037 - unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, 1038 - gfp_t gfp_mask, 1039 - unsigned long *total_scanned); 1040 981 1041 982 void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, 1042 983 unsigned long count); ··· 1114 1063 1115 1064 void split_page_memcg(struct page *head, unsigned int nr); 1116 1065 1066 + unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, 1067 + gfp_t gfp_mask, 1068 + unsigned long *total_scanned); 1069 + 1117 1070 #else /* CONFIG_MEMCG */ 1118 1071 1119 1072 #define MEM_CGROUP_ID_SHIFT 0 1120 1073 #define MEM_CGROUP_ID_MAX 0 1121 - 1122 - struct mem_cgroup; 1123 1074 1124 1075 static inline struct mem_cgroup *page_memcg(struct page *page) 1125 1076 { ··· 1192 1139 return 0; 1193 1140 } 1194 1141 1142 + static inline int mem_cgroup_swapin_charge_page(struct page *page, 1143 + struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) 1144 + { 1145 + return 0; 1146 + } 1147 + 1148 + static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) 1149 + { 1150 + } 1151 + 1195 1152 static inline void mem_cgroup_uncharge(struct page *page) 1196 1153 { 1197 1154 } ··· 1232 1169 pg_data_t *pgdat = page_pgdat(page); 1233 1170 1234 1171 return lruvec == &pgdat->__lruvec; 1172 + } 1173 + 1174 + static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) 1175 + { 1235 1176 } 1236 1177 1237 1178 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) ··· 1356 1289 { 1357 1290 } 1358 1291 1359 - static inline struct mem_cgroup *lock_page_memcg(struct page *page) 1360 - { 1361 - return NULL; 1362 - } 1363 - 1364 - static inline void __unlock_page_memcg(struct mem_cgroup *memcg) 1292 + static inline void lock_page_memcg(struct page *page) 1365 1293 { 1366 1294 } 1367 1295 ··· 1394 1332 1395 1333 static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) 1396 1334 { 1397 - } 1398 - 1399 - static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) 1400 - { 1401 - return 0; 1402 - } 1403 - 1404 - static inline unsigned long memcg_page_state_local(struct mem_cgroup *memcg, 1405 - int idx) 1406 - { 1407 - return 0; 1408 1335 } 1409 1336 1410 1337 static inline void __mod_memcg_state(struct mem_cgroup *memcg, ··· 1441 1390 mod_node_page_state(page_pgdat(page), idx, val); 1442 1391 } 1443 1392 1444 - static inline 1445 - unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, 1446 - gfp_t gfp_mask, 1447 - unsigned long *total_scanned) 1448 - { 1449 - return 0; 1450 - } 1451 - 1452 - static inline void split_page_memcg(struct page *head, unsigned int nr) 1453 - { 1454 - } 1455 - 1456 1393 static inline void count_memcg_events(struct mem_cgroup *memcg, 1457 1394 enum vm_event_item idx, 1458 1395 unsigned long count) ··· 1463 1424 { 1464 1425 } 1465 1426 1466 - static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) 1427 + static inline void split_page_memcg(struct page *head, unsigned int nr) 1467 1428 { 1429 + } 1430 + 1431 + static inline 1432 + unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, 1433 + gfp_t gfp_mask, 1434 + unsigned long *total_scanned) 1435 + { 1436 + return 0; 1468 1437 } 1469 1438 #endif /* CONFIG_MEMCG */ 1470 1439

+27 -23

include/linux/mm.h

··· 432 432 extern pgprot_t protection_map[16]; 433 433 434 434 /** 435 - * Fault flag definitions. 436 - * 435 + * enum fault_flag - Fault flag definitions. 437 436 * @FAULT_FLAG_WRITE: Fault was a write fault. 438 437 * @FAULT_FLAG_MKWRITE: Fault was mkwrite of existing PTE. 439 438 * @FAULT_FLAG_ALLOW_RETRY: Allow to retry the fault if blocked. ··· 463 464 * signals before a retry to make sure the continuous page faults can still be 464 465 * interrupted if necessary. 465 466 */ 466 - #define FAULT_FLAG_WRITE 0x01 467 - #define FAULT_FLAG_MKWRITE 0x02 468 - #define FAULT_FLAG_ALLOW_RETRY 0x04 469 - #define FAULT_FLAG_RETRY_NOWAIT 0x08 470 - #define FAULT_FLAG_KILLABLE 0x10 471 - #define FAULT_FLAG_TRIED 0x20 472 - #define FAULT_FLAG_USER 0x40 473 - #define FAULT_FLAG_REMOTE 0x80 474 - #define FAULT_FLAG_INSTRUCTION 0x100 475 - #define FAULT_FLAG_INTERRUPTIBLE 0x200 467 + enum fault_flag { 468 + FAULT_FLAG_WRITE = 1 << 0, 469 + FAULT_FLAG_MKWRITE = 1 << 1, 470 + FAULT_FLAG_ALLOW_RETRY = 1 << 2, 471 + FAULT_FLAG_RETRY_NOWAIT = 1 << 3, 472 + FAULT_FLAG_KILLABLE = 1 << 4, 473 + FAULT_FLAG_TRIED = 1 << 5, 474 + FAULT_FLAG_USER = 1 << 6, 475 + FAULT_FLAG_REMOTE = 1 << 7, 476 + FAULT_FLAG_INSTRUCTION = 1 << 8, 477 + FAULT_FLAG_INTERRUPTIBLE = 1 << 9, 478 + }; 476 479 477 480 /* 478 481 * The default fault flags that should be used by most of the ··· 486 485 487 486 /** 488 487 * fault_flag_allow_retry_first - check ALLOW_RETRY the first time 488 + * @flags: Fault flags. 489 489 * 490 490 * This is mostly used for places where we want to try to avoid taking 491 491 * the mmap_lock for too long a time when waiting for another condition ··· 497 495 * Return: true if the page fault allows retry and this is the first 498 496 * attempt of the fault handling; false otherwise. 499 497 */ 500 - static inline bool fault_flag_allow_retry_first(unsigned int flags) 498 + static inline bool fault_flag_allow_retry_first(enum fault_flag flags) 501 499 { 502 500 return (flags & FAULT_FLAG_ALLOW_RETRY) && 503 501 (!(flags & FAULT_FLAG_TRIED)); ··· 532 530 pgoff_t pgoff; /* Logical page offset based on vma */ 533 531 unsigned long address; /* Faulting virtual address */ 534 532 }; 535 - unsigned int flags; /* FAULT_FLAG_xxx flags 533 + enum fault_flag flags; /* FAULT_FLAG_xxx flags 536 534 * XXX: should really be 'const' */ 537 535 pmd_t *pmd; /* Pointer to pmd entry matching 538 536 * the 'address' */ ··· 582 580 void (*close)(struct vm_area_struct * area); 583 581 /* Called any time before splitting to check if it's allowed */ 584 582 int (*may_split)(struct vm_area_struct *area, unsigned long addr); 585 - int (*mremap)(struct vm_area_struct *area, unsigned long flags); 583 + int (*mremap)(struct vm_area_struct *area); 586 584 /* 587 585 * Called by mprotect() to make driver-specific permission 588 586 * checks before mprotect() is finalised. The VMA must not ··· 1267 1265 void unpin_user_page(struct page *page); 1268 1266 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, 1269 1267 bool make_dirty); 1268 + void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, 1269 + bool make_dirty); 1270 1270 void unpin_user_pages(struct page **pages, unsigned long npages); 1271 1271 1272 1272 /** 1273 - * page_maybe_dma_pinned() - report if a page is pinned for DMA. 1273 + * page_maybe_dma_pinned - Report if a page is pinned for DMA. 1274 + * @page: The page. 1274 1275 * 1275 1276 * This function checks if a page has been pinned via a call to 1276 - * pin_user_pages*(). 1277 + * a function in the pin_user_pages() family. 1277 1278 * 1278 1279 * For non-huge pages, the return value is partially fuzzy: false is not fuzzy, 1279 1280 * because it means "definitely not pinned for DMA", but true means "probably ··· 1294 1289 * 1295 1290 * For more information, please see Documentation/core-api/pin_user_pages.rst. 1296 1291 * 1297 - * @page: pointer to page to be queried. 1298 - * @Return: True, if it is likely that the page has been "dma-pinned". 1299 - * False, if the page is definitely not dma-pinned. 1292 + * Return: True, if it is likely that the page has been "dma-pinned". 1293 + * False, if the page is definitely not dma-pinned. 1300 1294 */ 1301 1295 static inline bool page_maybe_dma_pinned(struct page *page) 1302 1296 { ··· 1633 1629 1634 1630 bool page_mapped(struct page *page); 1635 1631 struct address_space *page_mapping(struct page *page); 1636 - struct address_space *page_mapping_file(struct page *page); 1637 1632 1638 1633 /* 1639 1634 * Return true only if the page has been allocated with ··· 2360 2357 int poison, const char *s); 2361 2358 2362 2359 extern void adjust_managed_page_count(struct page *page, long count); 2363 - extern void mem_init_print_info(const char *str); 2360 + extern void mem_init_print_info(void); 2364 2361 2365 2362 extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end); 2366 2363 ··· 2734 2731 struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr); 2735 2732 int remap_pfn_range(struct vm_area_struct *, unsigned long addr, 2736 2733 unsigned long pfn, unsigned long size, pgprot_t); 2734 + int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, 2735 + unsigned long pfn, unsigned long size, pgprot_t prot); 2737 2736 int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *); 2738 2737 int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr, 2739 2738 struct page **pages, unsigned long *num); ··· 2795 2790 #define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO 2796 2791 * and return without waiting upon it */ 2797 2792 #define FOLL_POPULATE 0x40 /* fault in page */ 2798 - #define FOLL_SPLIT 0x80 /* don't return transhuge pages, split them */ 2799 2793 #define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ 2800 2794 #define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ 2801 2795 #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */

+24 -19

include/linux/mmzone.h

··· 993 993 * is_highmem - helper function to quickly check if a struct zone is a 994 994 * highmem zone or not. This is an attempt to keep references 995 995 * to ZONE_{DMA/NORMAL/HIGHMEM/etc} in general code to a minimum. 996 - * @zone - pointer to struct zone variable 996 + * @zone: pointer to struct zone variable 997 + * Return: 1 for a highmem zone, 0 otherwise 997 998 */ 998 999 static inline int is_highmem(struct zone *zone) 999 1000 { ··· 1045 1044 1046 1045 /** 1047 1046 * for_each_online_pgdat - helper macro to iterate over all online nodes 1048 - * @pgdat - pointer to a pg_data_t variable 1047 + * @pgdat: pointer to a pg_data_t variable 1049 1048 */ 1050 1049 #define for_each_online_pgdat(pgdat) \ 1051 1050 for (pgdat = first_online_pgdat(); \ ··· 1053 1052 pgdat = next_online_pgdat(pgdat)) 1054 1053 /** 1055 1054 * for_each_zone - helper macro to iterate over all memory zones 1056 - * @zone - pointer to struct zone variable 1055 + * @zone: pointer to struct zone variable 1057 1056 * 1058 1057 * The user only needs to declare the zone variable, for_each_zone 1059 1058 * fills it in. ··· 1092 1091 1093 1092 /** 1094 1093 * next_zones_zonelist - Returns the next zone at or below highest_zoneidx within the allowed nodemask using a cursor within a zonelist as a starting point 1095 - * @z - The cursor used as a starting point for the search 1096 - * @highest_zoneidx - The zone index of the highest zone to return 1097 - * @nodes - An optional nodemask to filter the zonelist with 1094 + * @z: The cursor used as a starting point for the search 1095 + * @highest_zoneidx: The zone index of the highest zone to return 1096 + * @nodes: An optional nodemask to filter the zonelist with 1098 1097 * 1099 1098 * This function returns the next zone at or below a given zone index that is 1100 1099 * within the allowed nodemask using a cursor as the starting point for the 1101 1100 * search. The zoneref returned is a cursor that represents the current zone 1102 1101 * being examined. It should be advanced by one before calling 1103 1102 * next_zones_zonelist again. 1103 + * 1104 + * Return: the next zone at or below highest_zoneidx within the allowed 1105 + * nodemask using a cursor within a zonelist as a starting point 1104 1106 */ 1105 1107 static __always_inline struct zoneref *next_zones_zonelist(struct zoneref *z, 1106 1108 enum zone_type highest_zoneidx, ··· 1116 1112 1117 1113 /** 1118 1114 * first_zones_zonelist - Returns the first zone at or below highest_zoneidx within the allowed nodemask in a zonelist 1119 - * @zonelist - The zonelist to search for a suitable zone 1120 - * @highest_zoneidx - The zone index of the highest zone to return 1121 - * @nodes - An optional nodemask to filter the zonelist with 1122 - * @return - Zoneref pointer for the first suitable zone found (see below) 1115 + * @zonelist: The zonelist to search for a suitable zone 1116 + * @highest_zoneidx: The zone index of the highest zone to return 1117 + * @nodes: An optional nodemask to filter the zonelist with 1123 1118 * 1124 1119 * This function returns the first zone at or below a given zone index that is 1125 1120 * within the allowed nodemask. The zoneref returned is a cursor that can be ··· 1128 1125 * When no eligible zone is found, zoneref->zone is NULL (zoneref itself is 1129 1126 * never NULL). This may happen either genuinely, or due to concurrent nodemask 1130 1127 * update due to cpuset modification. 1128 + * 1129 + * Return: Zoneref pointer for the first suitable zone found 1131 1130 */ 1132 1131 static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist, 1133 1132 enum zone_type highest_zoneidx, ··· 1141 1136 1142 1137 /** 1143 1138 * for_each_zone_zonelist_nodemask - helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask 1144 - * @zone - The current zone in the iterator 1145 - * @z - The current pointer within zonelist->_zonerefs being iterated 1146 - * @zlist - The zonelist being iterated 1147 - * @highidx - The zone index of the highest zone to return 1148 - * @nodemask - Nodemask allowed by the allocator 1139 + * @zone: The current zone in the iterator 1140 + * @z: The current pointer within zonelist->_zonerefs being iterated 1141 + * @zlist: The zonelist being iterated 1142 + * @highidx: The zone index of the highest zone to return 1143 + * @nodemask: Nodemask allowed by the allocator 1149 1144 * 1150 1145 * This iterator iterates though all zones at or below a given zone index and 1151 1146 * within a given nodemask ··· 1165 1160 1166 1161 /** 1167 1162 * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index 1168 - * @zone - The current zone in the iterator 1169 - * @z - The current pointer within zonelist->zones being iterated 1170 - * @zlist - The zonelist being iterated 1171 - * @highidx - The zone index of the highest zone to return 1163 + * @zone: The current zone in the iterator 1164 + * @z: The current pointer within zonelist->zones being iterated 1165 + * @zlist: The zonelist being iterated 1166 + * @highidx: The zone index of the highest zone to return 1172 1167 * 1173 1168 * This iterator iterates though all zones at or below a given zone index. 1174 1169 */

+29 -33

include/linux/page-flags-layout.h

··· 21 21 #elif MAX_NR_ZONES <= 8 22 22 #define ZONES_SHIFT 3 23 23 #else 24 - #error ZONES_SHIFT -- too many zones configured adjust calculation 24 + #error ZONES_SHIFT "Too many zones configured" 25 25 #endif 26 + 27 + #define ZONES_WIDTH ZONES_SHIFT 26 28 27 29 #ifdef CONFIG_SPARSEMEM 28 30 #include <asm/sparsemem.h> 29 - 30 - /* SECTION_SHIFT #bits space required to store a section # */ 31 31 #define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS) 32 - 33 - #endif /* CONFIG_SPARSEMEM */ 32 + #else 33 + #define SECTIONS_SHIFT 0 34 + #endif 34 35 35 36 #ifndef BUILD_VDSO32_64 36 37 /* ··· 55 54 #define SECTIONS_WIDTH 0 56 55 #endif 57 56 58 - #define ZONES_WIDTH ZONES_SHIFT 59 - 60 - #if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS 57 + #if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS 61 58 #define NODES_WIDTH NODES_SHIFT 62 - #else 63 - #ifdef CONFIG_SPARSEMEM_VMEMMAP 59 + #elif defined(CONFIG_SPARSEMEM_VMEMMAP) 64 60 #error "Vmemmap: No space for nodes field in page flags" 65 - #endif 61 + #else 66 62 #define NODES_WIDTH 0 63 + #endif 64 + 65 + /* 66 + * Note that this #define MUST have a value so that it can be tested with 67 + * the IS_ENABLED() macro. 68 + */ 69 + #if NODES_SHIFT != 0 && NODES_WIDTH == 0 70 + #define NODE_NOT_IN_PAGE_FLAGS 1 71 + #endif 72 + 73 + #if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS) 74 + #define KASAN_TAG_WIDTH 8 75 + #else 76 + #define KASAN_TAG_WIDTH 0 67 77 #endif 68 78 69 79 #ifdef CONFIG_NUMA_BALANCING ··· 89 77 #define LAST_CPUPID_SHIFT 0 90 78 #endif 91 79 92 - #if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS) 93 - #define KASAN_TAG_WIDTH 8 94 - #else 95 - #define KASAN_TAG_WIDTH 0 96 - #endif 97 - 98 - #if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT+KASAN_TAG_WIDTH \ 80 + #if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \ 99 81 <= BITS_PER_LONG - NR_PAGEFLAGS 100 82 #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT 101 83 #else 102 84 #define LAST_CPUPID_WIDTH 0 103 85 #endif 104 86 105 - #if SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH+LAST_CPUPID_WIDTH+KASAN_TAG_WIDTH \ 87 + #if LAST_CPUPID_SHIFT != 0 && LAST_CPUPID_WIDTH == 0 88 + #define LAST_CPUPID_NOT_IN_PAGE_FLAGS 89 + #endif 90 + 91 + #if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \ 106 92 > BITS_PER_LONG - NR_PAGEFLAGS 107 93 #error "Not enough bits in page flags" 108 - #endif 109 - 110 - /* 111 - * We are going to use the flags for the page to node mapping if its in 112 - * there. This includes the case where there is no node, so it is implicit. 113 - * Note that this #define MUST have a value so that it can be tested with 114 - * the IS_ENABLED() macro. 115 - */ 116 - #if !(NODES_WIDTH > 0 || NODES_SHIFT == 0) 117 - #define NODE_NOT_IN_PAGE_FLAGS 1 118 - #endif 119 - 120 - #if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0 121 - #define LAST_CPUPID_NOT_IN_PAGE_FLAGS 122 94 #endif 123 95 124 96 #endif

+10

include/linux/pagemap.h

··· 158 158 void release_pages(struct page **pages, int nr); 159 159 160 160 /* 161 + * For file cache pages, return the address_space, otherwise return NULL 162 + */ 163 + static inline struct address_space *page_mapping_file(struct page *page) 164 + { 165 + if (unlikely(PageSwapCache(page))) 166 + return NULL; 167 + return page_mapping(page); 168 + } 169 + 170 + /* 161 171 * speculatively take a reference to a page. 162 172 * If the page is free (_refcount == 0), then _refcount is untouched, and 0 163 173 * is returned. Otherwise, _refcount is incremented by 1 and 1 is returned.

+2 -2

include/linux/pagewalk.h

··· 7 7 struct mm_walk; 8 8 9 9 /** 10 - * mm_walk_ops - callbacks for walk_page_range 10 + * struct mm_walk_ops - callbacks for walk_page_range 11 11 * @pgd_entry: if set, called for each non-empty PGD (top-level) entry 12 12 * @p4d_entry: if set, called for each non-empty P4D entry 13 13 * @pud_entry: if set, called for each non-empty PUD entry ··· 71 71 }; 72 72 73 73 /** 74 - * mm_walk - walk_page_range data 74 + * struct mm_walk - walk_page_range data 75 75 * @ops: operation to call during the walk 76 76 * @mm: mm_struct representing the target process of page table walk 77 77 * @pgd: pointer to PGD; only valid with no_vma (otherwise set to NULL)

+4

include/linux/sched.h

··· 841 841 /* Stalled due to lack of memory */ 842 842 unsigned in_memstall:1; 843 843 #endif 844 + #ifdef CONFIG_PAGE_OWNER 845 + /* Used by page_owner=on to detect recursion in page tracking. */ 846 + unsigned in_page_owner:1; 847 + #endif 844 848 845 849 unsigned long atomic_flags; /* Flags requiring atomic access. */ 846 850

+47 -18

include/linux/vmalloc.h

··· 26 26 #define VM_KASAN 0x00000080 /* has allocated kasan shadow memory */ 27 27 #define VM_FLUSH_RESET_PERMS 0x00000100 /* reset direct map and flush TLB on unmap, can't be freed in atomic context */ 28 28 #define VM_MAP_PUT_PAGES 0x00000200 /* put pages and free array in vfree */ 29 + #define VM_NO_HUGE_VMAP 0x00000400 /* force PAGE_SIZE pte mapping */ 29 30 30 31 /* 31 32 * VM_KASAN is used slighly differently depending on CONFIG_KASAN_VMALLOC. ··· 55 54 unsigned long size; 56 55 unsigned long flags; 57 56 struct page **pages; 57 + #ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC 58 + unsigned int page_order; 59 + #endif 58 60 unsigned int nr_pages; 59 61 phys_addr_t phys_addr; 60 62 const void *caller; ··· 81 77 struct vm_struct *vm; /* in "busy" tree */ 82 78 }; 83 79 }; 80 + 81 + /* archs that select HAVE_ARCH_HUGE_VMAP should override one or more of these */ 82 + #ifndef arch_vmap_p4d_supported 83 + static inline bool arch_vmap_p4d_supported(pgprot_t prot) 84 + { 85 + return false; 86 + } 87 + #endif 88 + 89 + #ifndef arch_vmap_pud_supported 90 + static inline bool arch_vmap_pud_supported(pgprot_t prot) 91 + { 92 + return false; 93 + } 94 + #endif 95 + 96 + #ifndef arch_vmap_pmd_supported 97 + static inline bool arch_vmap_pmd_supported(pgprot_t prot) 98 + { 99 + return false; 100 + } 101 + #endif 84 102 85 103 /* 86 104 * Highlevel APIs for driver use ··· 192 166 extern struct vm_struct *remove_vm_area(const void *addr); 193 167 extern struct vm_struct *find_vm_area(const void *addr); 194 168 169 + static inline bool is_vm_area_hugepages(const void *addr) 170 + { 171 + /* 172 + * This may not 100% tell if the area is mapped with > PAGE_SIZE 173 + * page table entries, if for some reason the architecture indicates 174 + * larger sizes are available but decides not to use them, nothing 175 + * prevents that. This only indicates the size of the physical page 176 + * allocated in the vmalloc layer. 177 + */ 178 + #ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC 179 + return find_vm_area(addr)->page_order > 0; 180 + #else 181 + return false; 182 + #endif 183 + } 184 + 195 185 #ifdef CONFIG_MMU 196 - extern int map_kernel_range_noflush(unsigned long start, unsigned long size, 197 - pgprot_t prot, struct page **pages); 198 - int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot, 199 - struct page **pages); 200 - extern void unmap_kernel_range_noflush(unsigned long addr, unsigned long size); 201 - extern void unmap_kernel_range(unsigned long addr, unsigned long size); 186 + int vmap_range(unsigned long addr, unsigned long end, 187 + phys_addr_t phys_addr, pgprot_t prot, 188 + unsigned int max_page_shift); 189 + void vunmap_range(unsigned long addr, unsigned long end); 202 190 static inline void set_vm_flush_reset_perms(void *addr) 203 191 { 204 192 struct vm_struct *vm = find_vm_area(addr); ··· 220 180 if (vm) 221 181 vm->flags |= VM_FLUSH_RESET_PERMS; 222 182 } 183 + 223 184 #else 224 - static inline int 225 - map_kernel_range_noflush(unsigned long start, unsigned long size, 226 - pgprot_t prot, struct page **pages) 227 - { 228 - return size >> PAGE_SHIFT; 229 - } 230 - #define map_kernel_range map_kernel_range_noflush 231 - static inline void 232 - unmap_kernel_range_noflush(unsigned long addr, unsigned long size) 233 - { 234 - } 235 - #define unmap_kernel_range unmap_kernel_range_noflush 236 185 static inline void set_vm_flush_reset_perms(void *addr) 237 186 { 238 187 }

+3 -21

include/linux/vmstat.h

··· 512 512 513 513 #endif /* CONFIG_MEMCG */ 514 514 515 - static inline void __inc_lruvec_state(struct lruvec *lruvec, 516 - enum node_stat_item idx) 515 + static inline void inc_lruvec_state(struct lruvec *lruvec, 516 + enum node_stat_item idx) 517 517 { 518 - __mod_lruvec_state(lruvec, idx, 1); 519 - } 520 - 521 - static inline void __dec_lruvec_state(struct lruvec *lruvec, 522 - enum node_stat_item idx) 523 - { 524 - __mod_lruvec_state(lruvec, idx, -1); 518 + mod_lruvec_state(lruvec, idx, 1); 525 519 } 526 520 527 521 static inline void __inc_lruvec_page_state(struct page *page, ··· 528 534 enum node_stat_item idx) 529 535 { 530 536 __mod_lruvec_page_state(page, idx, -1); 531 - } 532 - 533 - static inline void inc_lruvec_state(struct lruvec *lruvec, 534 - enum node_stat_item idx) 535 - { 536 - mod_lruvec_state(lruvec, idx, 1); 537 - } 538 - 539 - static inline void dec_lruvec_state(struct lruvec *lruvec, 540 - enum node_stat_item idx) 541 - { 542 - mod_lruvec_state(lruvec, idx, -1); 543 537 } 544 538 545 539 static inline void inc_lruvec_page_state(struct page *page,

+1 -1

include/net/page_pool.h

··· 65 65 #define PP_ALLOC_CACHE_REFILL 64 66 66 struct pp_alloc_cache { 67 67 u32 count; 68 - void *cache[PP_ALLOC_CACHE_SIZE]; 68 + struct page *cache[PP_ALLOC_CACHE_SIZE]; 69 69 }; 70 70 71 71 struct page_pool_params {

+22 -2

include/trace/events/kmem.h

··· 343 343 #define __PTR_TO_HASHVAL 344 344 #endif 345 345 346 + #define TRACE_MM_PAGES \ 347 + EM(MM_FILEPAGES) \ 348 + EM(MM_ANONPAGES) \ 349 + EM(MM_SWAPENTS) \ 350 + EMe(MM_SHMEMPAGES) 351 + 352 + #undef EM 353 + #undef EMe 354 + 355 + #define EM(a) TRACE_DEFINE_ENUM(a); 356 + #define EMe(a) TRACE_DEFINE_ENUM(a); 357 + 358 + TRACE_MM_PAGES 359 + 360 + #undef EM 361 + #undef EMe 362 + 363 + #define EM(a) { a, #a }, 364 + #define EMe(a) { a, #a } 365 + 346 366 TRACE_EVENT(rss_stat, 347 367 348 368 TP_PROTO(struct mm_struct *mm, ··· 385 365 __entry->size = (count << PAGE_SHIFT); 386 366 ), 387 367 388 - TP_printk("mm_id=%u curr=%d member=%d size=%ldB", 368 + TP_printk("mm_id=%u curr=%d type=%s size=%ldB", 389 369 __entry->mm_id, 390 370 __entry->curr, 391 - __entry->member, 371 + __print_symbolic(__entry->member, TRACE_MM_PAGES), 392 372 __entry->size) 393 373 ); 394 374 #endif /* _TRACE_KMEM_H */

+1 -1

init/main.c

··· 830 830 report_meminit(); 831 831 stack_depot_init(); 832 832 mem_init(); 833 + mem_init_print_info(); 833 834 /* page_owner must be initialized after buddy is ready */ 834 835 page_ext_init_flatmem_late(); 835 836 kmem_cache_init(); ··· 838 837 pgtable_init(); 839 838 debug_objects_mem_init(); 840 839 vmalloc_init(); 841 - ioremap_huge_init(); 842 840 /* Should be run before the first non-init thread is created */ 843 841 init_espfix_bsp(); 844 842 /* Should be run after espfix64 is set up. */

+21 -13

kernel/cgroup/cgroup.c

··· 1339 1339 1340 1340 mutex_unlock(&cgroup_mutex); 1341 1341 1342 + cgroup_rstat_exit(cgrp); 1342 1343 kernfs_destroy_root(root->kf_root); 1343 1344 cgroup_free_root(root); 1344 1345 } ··· 1752 1751 &dcgrp->e_csets[ss->id]); 1753 1752 spin_unlock_irq(&css_set_lock); 1754 1753 1754 + if (ss->css_rstat_flush) { 1755 + list_del_rcu(&css->rstat_css_node); 1756 + list_add_rcu(&css->rstat_css_node, 1757 + &dcgrp->rstat_css_list); 1758 + } 1759 + 1755 1760 /* default hierarchy doesn't enable controllers by default */ 1756 1761 dst_root->subsys_mask |= 1 << ssid; 1757 1762 if (dst_root == &cgrp_dfl_root) { ··· 1978 1971 if (ret) 1979 1972 goto destroy_root; 1980 1973 1981 - ret = rebind_subsystems(root, ss_mask); 1974 + ret = cgroup_rstat_init(root_cgrp); 1982 1975 if (ret) 1983 1976 goto destroy_root; 1977 + 1978 + ret = rebind_subsystems(root, ss_mask); 1979 + if (ret) 1980 + goto exit_stats; 1984 1981 1985 1982 ret = cgroup_bpf_inherit(root_cgrp); 1986 1983 WARN_ON_ONCE(ret); ··· 2017 2006 ret = 0; 2018 2007 goto out; 2019 2008 2009 + exit_stats: 2010 + cgroup_rstat_exit(root_cgrp); 2020 2011 destroy_root: 2021 2012 kernfs_destroy_root(root->kf_root); 2022 2013 root->kf_root = NULL; ··· 4947 4934 cgroup_put(cgroup_parent(cgrp)); 4948 4935 kernfs_put(cgrp->kn); 4949 4936 psi_cgroup_free(cgrp); 4950 - if (cgroup_on_dfl(cgrp)) 4951 - cgroup_rstat_exit(cgrp); 4937 + cgroup_rstat_exit(cgrp); 4952 4938 kfree(cgrp); 4953 4939 } else { 4954 4940 /* ··· 4988 4976 /* cgroup release path */ 4989 4977 TRACE_CGROUP_PATH(release, cgrp); 4990 4978 4991 - if (cgroup_on_dfl(cgrp)) 4992 - cgroup_rstat_flush(cgrp); 4979 + cgroup_rstat_flush(cgrp); 4993 4980 4994 4981 spin_lock_irq(&css_set_lock); 4995 4982 for (tcgrp = cgroup_parent(cgrp); tcgrp; ··· 5045 5034 css_get(css->parent); 5046 5035 } 5047 5036 5048 - if (cgroup_on_dfl(cgrp) && ss->css_rstat_flush) 5037 + if (ss->css_rstat_flush) 5049 5038 list_add_rcu(&css->rstat_css_node, &cgrp->rstat_css_list); 5050 5039 5051 5040 BUG_ON(cgroup_css(cgrp, ss)); ··· 5170 5159 if (ret) 5171 5160 goto out_free_cgrp; 5172 5161 5173 - if (cgroup_on_dfl(parent)) { 5174 - ret = cgroup_rstat_init(cgrp); 5175 - if (ret) 5176 - goto out_cancel_ref; 5177 - } 5162 + ret = cgroup_rstat_init(cgrp); 5163 + if (ret) 5164 + goto out_cancel_ref; 5178 5165 5179 5166 /* create the directory */ 5180 5167 kn = kernfs_create_dir(parent->kn, name, mode, cgrp); ··· 5259 5250 out_kernfs_remove: 5260 5251 kernfs_remove(cgrp->kn); 5261 5252 out_stat_exit: 5262 - if (cgroup_on_dfl(parent)) 5263 - cgroup_rstat_exit(cgrp); 5253 + cgroup_rstat_exit(cgrp); 5264 5254 out_cancel_ref: 5265 5255 percpu_ref_exit(&cgrp->self.refcnt); 5266 5256 out_free_cgrp:

+34 -25

kernel/cgroup/rstat.c

··· 25 25 void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) 26 26 { 27 27 raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); 28 - struct cgroup *parent; 29 28 unsigned long flags; 30 - 31 - /* nothing to do for root */ 32 - if (!cgroup_parent(cgrp)) 33 - return; 34 29 35 30 /* 36 31 * Speculative already-on-list test. This may race leading to ··· 41 46 raw_spin_lock_irqsave(cpu_lock, flags); 42 47 43 48 /* put @cgrp and all ancestors on the corresponding updated lists */ 44 - for (parent = cgroup_parent(cgrp); parent; 45 - cgrp = parent, parent = cgroup_parent(cgrp)) { 49 + while (true) { 46 50 struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); 47 - struct cgroup_rstat_cpu *prstatc = cgroup_rstat_cpu(parent, cpu); 51 + struct cgroup *parent = cgroup_parent(cgrp); 52 + struct cgroup_rstat_cpu *prstatc; 48 53 49 54 /* 50 55 * Both additions and removals are bottom-up. If a cgroup ··· 53 58 if (rstatc->updated_next) 54 59 break; 55 60 61 + /* Root has no parent to link it to, but mark it busy */ 62 + if (!parent) { 63 + rstatc->updated_next = cgrp; 64 + break; 65 + } 66 + 67 + prstatc = cgroup_rstat_cpu(parent, cpu); 56 68 rstatc->updated_next = prstatc->updated_children; 57 69 prstatc->updated_children = cgrp; 70 + 71 + cgrp = parent; 58 72 } 59 73 60 74 raw_spin_unlock_irqrestore(cpu_lock, flags); ··· 117 113 */ 118 114 if (rstatc->updated_next) { 119 115 struct cgroup *parent = cgroup_parent(pos); 120 - struct cgroup_rstat_cpu *prstatc = cgroup_rstat_cpu(parent, cpu); 121 - struct cgroup_rstat_cpu *nrstatc; 122 - struct cgroup **nextp; 123 116 124 - nextp = &prstatc->updated_children; 125 - while (true) { 126 - nrstatc = cgroup_rstat_cpu(*nextp, cpu); 127 - if (*nextp == pos) 128 - break; 117 + if (parent) { 118 + struct cgroup_rstat_cpu *prstatc; 119 + struct cgroup **nextp; 129 120 130 - WARN_ON_ONCE(*nextp == parent); 131 - nextp = &nrstatc->updated_next; 121 + prstatc = cgroup_rstat_cpu(parent, cpu); 122 + nextp = &prstatc->updated_children; 123 + while (true) { 124 + struct cgroup_rstat_cpu *nrstatc; 125 + 126 + nrstatc = cgroup_rstat_cpu(*nextp, cpu); 127 + if (*nextp == pos) 128 + break; 129 + WARN_ON_ONCE(*nextp == parent); 130 + nextp = &nrstatc->updated_next; 131 + } 132 + *nextp = rstatc->updated_next; 132 133 } 133 134 134 - *nextp = rstatc->updated_next; 135 135 rstatc->updated_next = NULL; 136 - 137 136 return pos; 138 137 } 139 138 ··· 292 285 293 286 for_each_possible_cpu(cpu) 294 287 raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu)); 295 - 296 - BUG_ON(cgroup_rstat_init(&cgrp_dfl_root.cgrp)); 297 288 } 298 289 299 290 /* ··· 316 311 317 312 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) 318 313 { 319 - struct cgroup *parent = cgroup_parent(cgrp); 320 314 struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); 315 + struct cgroup *parent = cgroup_parent(cgrp); 321 316 struct cgroup_base_stat cur, delta; 322 317 unsigned seq; 318 + 319 + /* Root-level stats are sourced from system-wide CPU stats */ 320 + if (!parent) 321 + return; 323 322 324 323 /* fetch the current per-cpu values */ 325 324 do { ··· 337 328 cgroup_base_stat_add(&cgrp->bstat, &delta); 338 329 cgroup_base_stat_add(&rstatc->last_bstat, &delta); 339 330 340 - /* propagate global delta to parent */ 341 - if (parent) { 331 + /* propagate global delta to parent (unless that's root) */ 332 + if (cgroup_parent(parent)) { 342 333 delta = cgrp->bstat; 343 334 cgroup_base_stat_sub(&delta, &cgrp->last_bstat); 344 335 cgroup_base_stat_add(&parent->bstat, &delta);

-1

kernel/dma/remap.c

··· 66 66 return; 67 67 } 68 68 69 - unmap_kernel_range((unsigned long)cpu_addr, PAGE_ALIGN(size)); 70 69 vunmap(cpu_addr); 71 70 }

+8 -5

kernel/fork.c

··· 380 380 void *stack = task_stack_page(tsk); 381 381 struct vm_struct *vm = task_stack_vm_area(tsk); 382 382 383 + if (vm) { 384 + int i; 383 385 384 - /* All stack pages are in the same node. */ 385 - if (vm) 386 - mod_lruvec_page_state(vm->pages[0], NR_KERNEL_STACK_KB, 387 - account * (THREAD_SIZE / 1024)); 388 - else 386 + for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) 387 + mod_lruvec_page_state(vm->pages[i], NR_KERNEL_STACK_KB, 388 + account * (PAGE_SIZE / 1024)); 389 + } else { 390 + /* All stack pages are in the same node. */ 389 391 mod_lruvec_kmem_state(stack, NR_KERNEL_STACK_KB, 390 392 account * (THREAD_SIZE / 1024)); 393 + } 391 394 } 392 395 393 396 static int memcg_charge_kernel_stack(struct task_struct *tsk)

+6 -1

kernel/irq_work.c

··· 19 19 #include <linux/notifier.h> 20 20 #include <linux/smp.h> 21 21 #include <asm/processor.h> 22 - 22 + #include <linux/kasan.h> 23 23 24 24 static DEFINE_PER_CPU(struct llist_head, raised_list); 25 25 static DEFINE_PER_CPU(struct llist_head, lazy_list); ··· 70 70 if (!irq_work_claim(work)) 71 71 return false; 72 72 73 + /*record irq_work call stack in order to print it in KASAN reports*/ 74 + kasan_record_aux_stack(work); 75 + 73 76 /* Queue the entry and raise the IPI if needed. */ 74 77 preempt_disable(); 75 78 __irq_work_queue_local(work); ··· 100 97 /* Only queue if not already pending */ 101 98 if (!irq_work_claim(work)) 102 99 return false; 100 + 101 + kasan_record_aux_stack(work); 103 102 104 103 preempt_disable(); 105 104 if (cpu != smp_processor_id()) {

+3

kernel/task_work.c

··· 34 34 { 35 35 struct callback_head *head; 36 36 37 + /* record the work call stack in order to print it in KASAN reports */ 38 + kasan_record_aux_stack(work); 39 + 37 40 do { 38 41 head = READ_ONCE(task->task_works); 39 42 if (unlikely(head == &work_exited))

+44 -44

kernel/watchdog.c

··· 154 154 155 155 #ifdef CONFIG_SOFTLOCKUP_DETECTOR 156 156 157 - #define SOFTLOCKUP_RESET ULONG_MAX 157 + /* 158 + * Delay the soflockup report when running a known slow code. 159 + * It does _not_ affect the timestamp of the last successdul reschedule. 160 + */ 161 + #define SOFTLOCKUP_DELAY_REPORT ULONG_MAX 158 162 159 163 #ifdef CONFIG_SMP 160 164 int __read_mostly sysctl_softlockup_all_cpu_backtrace; ··· 173 169 static bool softlockup_initialized __read_mostly; 174 170 static u64 __read_mostly sample_period; 175 171 172 + /* Timestamp taken after the last successful reschedule. */ 176 173 static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts); 174 + /* Timestamp of the last softlockup report. */ 175 + static DEFINE_PER_CPU(unsigned long, watchdog_report_ts); 177 176 static DEFINE_PER_CPU(struct hrtimer, watchdog_hrtimer); 178 177 static DEFINE_PER_CPU(bool, softlockup_touch_sync); 179 - static DEFINE_PER_CPU(bool, soft_watchdog_warn); 180 178 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts); 181 179 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved); 182 180 static unsigned long soft_lockup_nmi_warn; ··· 241 235 watchdog_update_hrtimer_threshold(sample_period); 242 236 } 243 237 238 + static void update_report_ts(void) 239 + { 240 + __this_cpu_write(watchdog_report_ts, get_timestamp()); 241 + } 242 + 244 243 /* Commands for resetting the watchdog */ 245 - static void __touch_watchdog(void) 244 + static void update_touch_ts(void) 246 245 { 247 246 __this_cpu_write(watchdog_touch_ts, get_timestamp()); 247 + update_report_ts(); 248 248 } 249 249 250 250 /** ··· 264 252 notrace void touch_softlockup_watchdog_sched(void) 265 253 { 266 254 /* 267 - * Preemption can be enabled. It doesn't matter which CPU's timestamp 268 - * gets zeroed here, so use the raw_ operation. 255 + * Preemption can be enabled. It doesn't matter which CPU's watchdog 256 + * report period gets restarted here, so use the raw_ operation. 269 257 */ 270 - raw_cpu_write(watchdog_touch_ts, SOFTLOCKUP_RESET); 258 + raw_cpu_write(watchdog_report_ts, SOFTLOCKUP_DELAY_REPORT); 271 259 } 272 260 273 261 notrace void touch_softlockup_watchdog(void) ··· 291 279 * the softlockup check. 292 280 */ 293 281 for_each_cpu(cpu, &watchdog_allowed_mask) { 294 - per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET; 282 + per_cpu(watchdog_report_ts, cpu) = SOFTLOCKUP_DELAY_REPORT; 295 283 wq_watchdog_touch(cpu); 296 284 } 297 285 } ··· 299 287 void touch_softlockup_watchdog_sync(void) 300 288 { 301 289 __this_cpu_write(softlockup_touch_sync, true); 302 - __this_cpu_write(watchdog_touch_ts, SOFTLOCKUP_RESET); 290 + __this_cpu_write(watchdog_report_ts, SOFTLOCKUP_DELAY_REPORT); 303 291 } 304 292 305 - static int is_softlockup(unsigned long touch_ts) 293 + static int is_softlockup(unsigned long touch_ts, unsigned long period_ts) 306 294 { 307 295 unsigned long now = get_timestamp(); 308 296 309 297 if ((watchdog_enabled & SOFT_WATCHDOG_ENABLED) && watchdog_thresh){ 310 298 /* Warn about unreasonable delays. */ 311 - if (time_after(now, touch_ts + get_softlockup_thresh())) 299 + if (time_after(now, period_ts + get_softlockup_thresh())) 312 300 return now - touch_ts; 313 301 } 314 302 return 0; ··· 344 332 */ 345 333 static int softlockup_fn(void *data) 346 334 { 347 - __touch_watchdog(); 335 + update_touch_ts(); 348 336 complete(this_cpu_ptr(&softlockup_completion)); 349 337 350 338 return 0; ··· 354 342 static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) 355 343 { 356 344 unsigned long touch_ts = __this_cpu_read(watchdog_touch_ts); 345 + unsigned long period_ts = __this_cpu_read(watchdog_report_ts); 357 346 struct pt_regs *regs = get_irq_regs(); 358 347 int duration; 359 348 int softlockup_all_cpu_backtrace = sysctl_softlockup_all_cpu_backtrace; ··· 376 363 /* .. and repeat */ 377 364 hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period)); 378 365 379 - if (touch_ts == SOFTLOCKUP_RESET) { 366 + /* 367 + * If a virtual machine is stopped by the host it can look to 368 + * the watchdog like a soft lockup. Check to see if the host 369 + * stopped the vm before we process the timestamps. 370 + */ 371 + kvm_check_and_clear_guest_paused(); 372 + 373 + /* Reset the interval when touched by known problematic code. */ 374 + if (period_ts == SOFTLOCKUP_DELAY_REPORT) { 380 375 if (unlikely(__this_cpu_read(softlockup_touch_sync))) { 381 376 /* 382 377 * If the time stamp was touched atomically ··· 394 373 sched_clock_tick(); 395 374 } 396 375 397 - /* Clear the guest paused flag on watchdog reset */ 398 - kvm_check_and_clear_guest_paused(); 399 - __touch_watchdog(); 376 + update_report_ts(); 400 377 return HRTIMER_RESTART; 401 378 } 402 379 ··· 404 385 * indicate it is getting cpu time. If it hasn't then 405 386 * this is a good indication some task is hogging the cpu 406 387 */ 407 - duration = is_softlockup(touch_ts); 388 + duration = is_softlockup(touch_ts, period_ts); 408 389 if (unlikely(duration)) { 409 390 /* 410 - * If a virtual machine is stopped by the host it can look to 411 - * the watchdog like a soft lockup, check to see if the host 412 - * stopped the vm before we issue the warning 391 + * Prevent multiple soft-lockup reports if one cpu is already 392 + * engaged in dumping all cpu back traces. 413 393 */ 414 - if (kvm_check_and_clear_guest_paused()) 415 - return HRTIMER_RESTART; 416 - 417 - /* only warn once */ 418 - if (__this_cpu_read(soft_watchdog_warn) == true) 419 - return HRTIMER_RESTART; 420 - 421 394 if (softlockup_all_cpu_backtrace) { 422 - /* Prevent multiple soft-lockup reports if one cpu is already 423 - * engaged in dumping cpu back traces 424 - */ 425 - if (test_and_set_bit(0, &soft_lockup_nmi_warn)) { 426 - /* Someone else will report us. Let's give up */ 427 - __this_cpu_write(soft_watchdog_warn, true); 395 + if (test_and_set_bit_lock(0, &soft_lockup_nmi_warn)) 428 396 return HRTIMER_RESTART; 429 - } 430 397 } 398 + 399 + /* Start period for the next softlockup warning. */ 400 + update_report_ts(); 431 401 432 402 pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n", 433 403 smp_processor_id(), duration, ··· 429 421 dump_stack(); 430 422 431 423 if (softlockup_all_cpu_backtrace) { 432 - /* Avoid generating two back traces for current 433 - * given that one is already made above 434 - */ 435 424 trigger_allbutself_cpu_backtrace(); 436 - 437 - clear_bit(0, &soft_lockup_nmi_warn); 438 - /* Barrier to sync with other cpus */ 439 - smp_mb__after_atomic(); 425 + clear_bit_unlock(0, &soft_lockup_nmi_warn); 440 426 } 441 427 442 428 add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK); 443 429 if (softlockup_panic) 444 430 panic("softlockup: hung tasks"); 445 - __this_cpu_write(soft_watchdog_warn, true); 446 - } else 447 - __this_cpu_write(soft_watchdog_warn, false); 431 + } 448 432 449 433 return HRTIMER_RESTART; 450 434 } ··· 461 461 HRTIMER_MODE_REL_PINNED_HARD); 462 462 463 463 /* Initialize timestamp */ 464 - __touch_watchdog(); 464 + update_touch_ts(); 465 465 /* Enable the perf event */ 466 466 if (watchdog_enabled & NMI_WATCHDOG_ENABLED) 467 467 watchdog_nmi_enable(cpu);

+8 -1

lib/Kconfig.debug

··· 2573 2573 2574 2574 endif # RUNTIME_TESTING_MENU 2575 2575 2576 + config ARCH_USE_MEMTEST 2577 + bool 2578 + help 2579 + An architecture should select this when it uses early_memtest() 2580 + during boot process. 2581 + 2576 2582 config MEMTEST 2577 2583 bool "Memtest" 2584 + depends on ARCH_USE_MEMTEST 2578 2585 help 2579 2586 This option adds a kernel parameter 'memtest', which allows memtest 2580 - to be set. 2587 + to be set and executed. 2581 2588 memtest=0, mean disabled; -- default 2582 2589 memtest=1, mean do 1 test pattern; 2583 2590 ...

+31 -28

lib/test_kasan.c

··· 54 54 55 55 multishot = kasan_save_enable_multi_shot(); 56 56 kasan_set_tagging_report_once(false); 57 + fail_data.report_found = false; 58 + fail_data.report_expected = false; 59 + kunit_add_named_resource(test, NULL, NULL, &resource, 60 + "kasan_data", &fail_data); 57 61 return 0; 58 62 } 59 63 ··· 65 61 { 66 62 kasan_set_tagging_report_once(true); 67 63 kasan_restore_multi_shot(multishot); 64 + KUNIT_EXPECT_FALSE(test, fail_data.report_found); 68 65 } 69 66 70 67 /** ··· 83 78 * fields, it can reorder or optimize away the accesses to those fields. 84 79 * Use READ/WRITE_ONCE() for the accesses and compiler barriers around the 85 80 * expression to prevent that. 81 + * 82 + * In between KUNIT_EXPECT_KASAN_FAIL checks, fail_data.report_found is kept as 83 + * false. This allows detecting KASAN reports that happen outside of the checks 84 + * by asserting !fail_data.report_found at the start of KUNIT_EXPECT_KASAN_FAIL 85 + * and in kasan_test_exit. 86 86 */ 87 - #define KUNIT_EXPECT_KASAN_FAIL(test, expression) do { \ 88 - if (IS_ENABLED(CONFIG_KASAN_HW_TAGS) && \ 89 - !kasan_async_mode_enabled()) \ 90 - migrate_disable(); \ 91 - WRITE_ONCE(fail_data.report_expected, true); \ 92 - WRITE_ONCE(fail_data.report_found, false); \ 93 - kunit_add_named_resource(test, \ 94 - NULL, \ 95 - NULL, \ 96 - &resource, \ 97 - "kasan_data", &fail_data); \ 98 - barrier(); \ 99 - expression; \ 100 - barrier(); \ 101 - if (kasan_async_mode_enabled()) \ 102 - kasan_force_async_fault(); \ 103 - barrier(); \ 104 - KUNIT_EXPECT_EQ(test, \ 105 - READ_ONCE(fail_data.report_expected), \ 106 - READ_ONCE(fail_data.report_found)); \ 107 - if (IS_ENABLED(CONFIG_KASAN_HW_TAGS) && \ 108 - !kasan_async_mode_enabled()) { \ 109 - if (READ_ONCE(fail_data.report_found)) \ 110 - kasan_enable_tagging_sync(); \ 111 - migrate_enable(); \ 112 - } \ 87 + #define KUNIT_EXPECT_KASAN_FAIL(test, expression) do { \ 88 + if (IS_ENABLED(CONFIG_KASAN_HW_TAGS) && \ 89 + !kasan_async_mode_enabled()) \ 90 + migrate_disable(); \ 91 + KUNIT_EXPECT_FALSE(test, READ_ONCE(fail_data.report_found)); \ 92 + WRITE_ONCE(fail_data.report_expected, true); \ 93 + barrier(); \ 94 + expression; \ 95 + barrier(); \ 96 + KUNIT_EXPECT_EQ(test, \ 97 + READ_ONCE(fail_data.report_expected), \ 98 + READ_ONCE(fail_data.report_found)); \ 99 + if (IS_ENABLED(CONFIG_KASAN_HW_TAGS)) { \ 100 + if (READ_ONCE(fail_data.report_found)) \ 101 + kasan_enable_tagging_sync(); \ 102 + migrate_enable(); \ 103 + } \ 104 + WRITE_ONCE(fail_data.report_found, false); \ 105 + WRITE_ONCE(fail_data.report_expected, false); \ 113 106 } while (0) 114 107 115 108 #define KASAN_TEST_NEEDS_CONFIG_ON(test, config) do { \ ··· 1052 1049 continue; 1053 1050 1054 1051 /* Mark the first memory granule with the chosen memory tag. */ 1055 - kasan_poison(ptr, KASAN_GRANULE_SIZE, (u8)tag); 1052 + kasan_poison(ptr, KASAN_GRANULE_SIZE, (u8)tag, false); 1056 1053 1057 1054 /* This access must cause a KASAN report. */ 1058 1055 KUNIT_EXPECT_KASAN_FAIL(test, *ptr = 0); 1059 1056 } 1060 1057 1061 1058 /* Recover the memory tag and free. */ 1062 - kasan_poison(ptr, KASAN_GRANULE_SIZE, get_tag(ptr)); 1059 + kasan_poison(ptr, KASAN_GRANULE_SIZE, get_tag(ptr), false); 1063 1060 kfree(ptr); 1064 1061 } 1065 1062

+40 -88

lib/test_vmalloc.c

··· 23 23 module_param(name, type, 0444); \ 24 24 MODULE_PARM_DESC(name, msg) \ 25 25 26 - __param(bool, single_cpu_test, false, 27 - "Use single first online CPU to run tests"); 26 + __param(int, nr_threads, 0, 27 + "Number of workers to perform tests(min: 1 max: USHRT_MAX)"); 28 28 29 29 __param(bool, sequential_test_order, false, 30 30 "Use sequential stress tests order"); ··· 47 47 "\t\tid: 128, name: pcpu_alloc_test\n" 48 48 "\t\tid: 256, name: kvfree_rcu_1_arg_vmalloc_test\n" 49 49 "\t\tid: 512, name: kvfree_rcu_2_arg_vmalloc_test\n" 50 - "\t\tid: 1024, name: kvfree_rcu_1_arg_slab_test\n" 51 - "\t\tid: 2048, name: kvfree_rcu_2_arg_slab_test\n" 52 50 /* Add a new test case description here. */ 53 51 ); 54 - 55 - /* 56 - * Depends on single_cpu_test parameter. If it is true, then 57 - * use first online CPU to trigger a test on, otherwise go with 58 - * all online CPUs. 59 - */ 60 - static cpumask_t cpus_run_test_mask = CPU_MASK_NONE; 61 52 62 53 /* 63 54 * Read write semaphore for synchronization of setup ··· 354 363 return 0; 355 364 } 356 365 357 - static int 358 - kvfree_rcu_1_arg_slab_test(void) 359 - { 360 - struct test_kvfree_rcu *p; 361 - int i; 362 - 363 - for (i = 0; i < test_loop_count; i++) { 364 - p = kmalloc(sizeof(*p), GFP_KERNEL); 365 - if (!p) 366 - return -1; 367 - 368 - p->array[0] = 'a'; 369 - kvfree_rcu(p); 370 - } 371 - 372 - return 0; 373 - } 374 - 375 - static int 376 - kvfree_rcu_2_arg_slab_test(void) 377 - { 378 - struct test_kvfree_rcu *p; 379 - int i; 380 - 381 - for (i = 0; i < test_loop_count; i++) { 382 - p = kmalloc(sizeof(*p), GFP_KERNEL); 383 - if (!p) 384 - return -1; 385 - 386 - p->array[0] = 'a'; 387 - kvfree_rcu(p, rcu); 388 - } 389 - 390 - return 0; 391 - } 392 - 393 366 struct test_case_desc { 394 367 const char *test_name; 395 368 int (*test_func)(void); ··· 370 415 { "pcpu_alloc_test", pcpu_alloc_test }, 371 416 { "kvfree_rcu_1_arg_vmalloc_test", kvfree_rcu_1_arg_vmalloc_test }, 372 417 { "kvfree_rcu_2_arg_vmalloc_test", kvfree_rcu_2_arg_vmalloc_test }, 373 - { "kvfree_rcu_1_arg_slab_test", kvfree_rcu_1_arg_slab_test }, 374 - { "kvfree_rcu_2_arg_slab_test", kvfree_rcu_2_arg_slab_test }, 375 418 /* Add a new test case here. */ 376 419 }; 377 420 ··· 379 426 u64 time; 380 427 }; 381 428 382 - /* Split it to get rid of: WARNING: line over 80 characters */ 383 - static struct test_case_data 384 - per_cpu_test_data[NR_CPUS][ARRAY_SIZE(test_case_array)]; 385 - 386 429 static struct test_driver { 387 430 struct task_struct *task; 431 + struct test_case_data data[ARRAY_SIZE(test_case_array)]; 432 + 388 433 unsigned long start; 389 434 unsigned long stop; 390 - int cpu; 391 - } per_cpu_test_driver[NR_CPUS]; 435 + } *tdriver; 392 436 393 437 static void shuffle_array(int *arr, int n) 394 438 { ··· 413 463 ktime_t kt; 414 464 u64 delta; 415 465 416 - if (set_cpus_allowed_ptr(current, cpumask_of(t->cpu)) < 0) 417 - pr_err("Failed to set affinity to %d CPU\n", t->cpu); 418 - 419 466 for (i = 0; i < ARRAY_SIZE(test_case_array); i++) 420 467 random_array[i] = i; 421 468 ··· 437 490 kt = ktime_get(); 438 491 for (j = 0; j < test_repeat_count; j++) { 439 492 if (!test_case_array[index].test_func()) 440 - per_cpu_test_data[t->cpu][index].test_passed++; 493 + t->data[index].test_passed++; 441 494 else 442 - per_cpu_test_data[t->cpu][index].test_failed++; 495 + t->data[index].test_failed++; 443 496 } 444 497 445 498 /* ··· 448 501 delta = (u64) ktime_us_delta(ktime_get(), kt); 449 502 do_div(delta, (u32) test_repeat_count); 450 503 451 - per_cpu_test_data[t->cpu][index].time = delta; 504 + t->data[index].time = delta; 452 505 } 453 506 t->stop = get_cycles(); 454 507 ··· 464 517 return 0; 465 518 } 466 519 467 - static void 520 + static int 468 521 init_test_configurtion(void) 469 522 { 470 523 /* 471 - * Reset all data of all CPUs. 524 + * A maximum number of workers is defined as hard-coded 525 + * value and set to USHRT_MAX. We add such gap just in 526 + * case and for potential heavy stressing. 472 527 */ 473 - memset(per_cpu_test_data, 0, sizeof(per_cpu_test_data)); 528 + nr_threads = clamp(nr_threads, 1, (int) USHRT_MAX); 474 529 475 - if (single_cpu_test) 476 - cpumask_set_cpu(cpumask_first(cpu_online_mask), 477 - &cpus_run_test_mask); 478 - else 479 - cpumask_and(&cpus_run_test_mask, cpu_online_mask, 480 - cpu_online_mask); 530 + /* Allocate the space for test instances. */ 531 + tdriver = kvcalloc(nr_threads, sizeof(*tdriver), GFP_KERNEL); 532 + if (tdriver == NULL) 533 + return -1; 481 534 482 535 if (test_repeat_count <= 0) 483 536 test_repeat_count = 1; 484 537 485 538 if (test_loop_count <= 0) 486 539 test_loop_count = 1; 540 + 541 + return 0; 487 542 } 488 543 489 544 static void do_concurrent_test(void) 490 545 { 491 - int cpu, ret; 546 + int i, ret; 492 547 493 548 /* 494 549 * Set some basic configurations plus sanity check. 495 550 */ 496 - init_test_configurtion(); 551 + ret = init_test_configurtion(); 552 + if (ret < 0) 553 + return; 497 554 498 555 /* 499 556 * Put on hold all workers. 500 557 */ 501 558 down_write(&prepare_for_test_rwsem); 502 559 503 - for_each_cpu(cpu, &cpus_run_test_mask) { 504 - struct test_driver *t = &per_cpu_test_driver[cpu]; 560 + for (i = 0; i < nr_threads; i++) { 561 + struct test_driver *t = &tdriver[i]; 505 562 506 - t->cpu = cpu; 507 - t->task = kthread_run(test_func, t, "vmalloc_test/%d", cpu); 563 + t->task = kthread_run(test_func, t, "vmalloc_test/%d", i); 508 564 509 565 if (!IS_ERR(t->task)) 510 566 /* Success. */ 511 567 atomic_inc(&test_n_undone); 512 568 else 513 - pr_err("Failed to start kthread for %d CPU\n", cpu); 569 + pr_err("Failed to start %d kthread\n", i); 514 570 } 515 571 516 572 /* ··· 531 581 ret = wait_for_completion_timeout(&test_all_done_comp, HZ); 532 582 } while (!ret); 533 583 534 - for_each_cpu(cpu, &cpus_run_test_mask) { 535 - struct test_driver *t = &per_cpu_test_driver[cpu]; 536 - int i; 584 + for (i = 0; i < nr_threads; i++) { 585 + struct test_driver *t = &tdriver[i]; 586 + int j; 537 587 538 588 if (!IS_ERR(t->task)) 539 589 kthread_stop(t->task); 540 590 541 - for (i = 0; i < ARRAY_SIZE(test_case_array); i++) { 542 - if (!((run_test_mask & (1 << i)) >> i)) 591 + for (j = 0; j < ARRAY_SIZE(test_case_array); j++) { 592 + if (!((run_test_mask & (1 << j)) >> j)) 543 593 continue; 544 594 545 595 pr_info( 546 596 "Summary: %s passed: %d failed: %d repeat: %d loops: %d avg: %llu usec\n", 547 - test_case_array[i].test_name, 548 - per_cpu_test_data[cpu][i].test_passed, 549 - per_cpu_test_data[cpu][i].test_failed, 597 + test_case_array[j].test_name, 598 + t->data[j].test_passed, 599 + t->data[j].test_failed, 550 600 test_repeat_count, test_loop_count, 551 - per_cpu_test_data[cpu][i].time); 601 + t->data[j].time); 552 602 } 553 603 554 - pr_info("All test took CPU%d=%lu cycles\n", 555 - cpu, t->stop - t->start); 604 + pr_info("All test took worker%d=%lu cycles\n", 605 + i, t->stop - t->start); 556 606 } 607 + 608 + kvfree(tdriver); 557 609 } 558 610 559 611 static int vmalloc_test_init(void)

+3 -1

mm/Kconfig

··· 9 9 choice 10 10 prompt "Memory model" 11 11 depends on SELECT_MEMORY_MODEL 12 - default DISCONTIGMEM_MANUAL if ARCH_DISCONTIGMEM_DEFAULT 13 12 default SPARSEMEM_MANUAL if ARCH_SPARSEMEM_DEFAULT 14 13 default FLATMEM_MANUAL 15 14 help ··· 870 871 config KMAP_LOCAL 871 872 bool 872 873 874 + # struct io_mapping based helper. Selected by drivers that need them 875 + config IO_MAPPING 876 + bool 873 877 endmenu

+1

mm/Makefile

··· 120 120 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o 121 121 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o 122 122 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o 123 + obj-$(CONFIG_IO_MAPPING) += io-mapping.o

+2 -2

mm/debug_vm_pgtable.c

··· 247 247 { 248 248 pmd_t pmd; 249 249 250 - if (!arch_ioremap_pmd_supported()) 250 + if (!arch_vmap_pmd_supported(prot)) 251 251 return; 252 252 253 253 pr_debug("Validating PMD huge\n"); ··· 385 385 { 386 386 pud_t pud; 387 387 388 - if (!arch_ioremap_pud_supported()) 388 + if (!arch_vmap_pud_supported(prot)) 389 389 return; 390 390 391 391 pr_debug("Validating PUD huge\n");

+1 -1

mm/dmapool.c

··· 157 157 if (!retval) 158 158 return retval; 159 159 160 - strlcpy(retval->name, name, sizeof(retval->name)); 160 + strscpy(retval->name, name, sizeof(retval->name)); 161 161 162 162 retval->dev = dev; 163 163

+47 -14

mm/filemap.c

··· 636 636 } 637 637 638 638 /** 639 + * filemap_range_needs_writeback - check if range potentially needs writeback 640 + * @mapping: address space within which to check 641 + * @start_byte: offset in bytes where the range starts 642 + * @end_byte: offset in bytes where the range ends (inclusive) 643 + * 644 + * Find at least one page in the range supplied, usually used to check if 645 + * direct writing in this range will trigger a writeback. Used by O_DIRECT 646 + * read/write with IOCB_NOWAIT, to see if the caller needs to do 647 + * filemap_write_and_wait_range() before proceeding. 648 + * 649 + * Return: %true if the caller should do filemap_write_and_wait_range() before 650 + * doing O_DIRECT to a page in this range, %false otherwise. 651 + */ 652 + bool filemap_range_needs_writeback(struct address_space *mapping, 653 + loff_t start_byte, loff_t end_byte) 654 + { 655 + XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT); 656 + pgoff_t max = end_byte >> PAGE_SHIFT; 657 + struct page *page; 658 + 659 + if (!mapping_needs_writeback(mapping)) 660 + return false; 661 + if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) && 662 + !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) 663 + return false; 664 + if (end_byte < start_byte) 665 + return false; 666 + 667 + rcu_read_lock(); 668 + xas_for_each(&xas, page, max) { 669 + if (xas_retry(&xas, page)) 670 + continue; 671 + if (xa_is_value(page)) 672 + continue; 673 + if (PageDirty(page) || PageLocked(page) || PageWriteback(page)) 674 + break; 675 + } 676 + rcu_read_unlock(); 677 + return page != NULL; 678 + } 679 + EXPORT_SYMBOL_GPL(filemap_range_needs_writeback); 680 + 681 + /** 639 682 * filemap_write_and_wait_range - write out & wait on a file range 640 683 * @mapping: the address_space for the pages 641 684 * @lstart: offset in bytes where the range starts ··· 1767 1724 * @mapping: the address_space to search 1768 1725 * @index: The page cache index. 1769 1726 * 1770 - * Looks up the page cache slot at @mapping & @offset. If there is a 1727 + * Looks up the page cache slot at @mapping & @index. If there is a 1771 1728 * page cache page, the head page is returned with an increased refcount. 1772 1729 * 1773 1730 * If the slot holds a shadow entry of a previously evicted page, or a ··· 2348 2305 return error; 2349 2306 if (PageUptodate(page)) 2350 2307 return 0; 2351 - if (!page->mapping) /* page truncated */ 2352 - return AOP_TRUNCATED_PAGE; 2353 2308 shrink_readahead_size_eio(&file->f_ra); 2354 2309 return -EIO; 2355 2310 } ··· 2679 2638 2680 2639 size = i_size_read(inode); 2681 2640 if (iocb->ki_flags & IOCB_NOWAIT) { 2682 - if (filemap_range_has_page(mapping, iocb->ki_pos, 2683 - iocb->ki_pos + count - 1)) 2641 + if (filemap_range_needs_writeback(mapping, iocb->ki_pos, 2642 + iocb->ki_pos + count - 1)) 2684 2643 return -EAGAIN; 2685 2644 } else { 2686 2645 retval = filemap_write_and_wait_range(mapping, ··· 2978 2937 struct file *file = vmf->vma->vm_file; 2979 2938 struct file *fpin = NULL; 2980 2939 struct address_space *mapping = file->f_mapping; 2981 - struct file_ra_state *ra = &file->f_ra; 2982 2940 struct inode *inode = mapping->host; 2983 2941 pgoff_t offset = vmf->pgoff; 2984 2942 pgoff_t max_off; ··· 3064 3024 * because there really aren't any performance issues here 3065 3025 * and we need to check for errors. 3066 3026 */ 3067 - ClearPageError(page); 3068 3027 fpin = maybe_unlock_mmap_for_io(vmf, fpin); 3069 - error = mapping->a_ops->readpage(file, page); 3070 - if (!error) { 3071 - wait_on_page_locked(page); 3072 - if (!PageUptodate(page)) 3073 - error = -EIO; 3074 - } 3028 + error = filemap_read_page(file, mapping, page); 3075 3029 if (fpin) 3076 3030 goto out_retry; 3077 3031 put_page(page); ··· 3073 3039 if (!error || error == AOP_TRUNCATED_PAGE) 3074 3040 goto retry_find; 3075 3041 3076 - shrink_readahead_size_eio(ra); 3077 3042 return VM_FAULT_SIGBUS; 3078 3043 3079 3044 out_retry:

+101 -44

mm/gup.c

··· 213 213 } 214 214 EXPORT_SYMBOL(unpin_user_page); 215 215 216 + static inline void compound_range_next(unsigned long i, unsigned long npages, 217 + struct page **list, struct page **head, 218 + unsigned int *ntails) 219 + { 220 + struct page *next, *page; 221 + unsigned int nr = 1; 222 + 223 + if (i >= npages) 224 + return; 225 + 226 + next = *list + i; 227 + page = compound_head(next); 228 + if (PageCompound(page) && compound_order(page) >= 1) 229 + nr = min_t(unsigned int, 230 + page + compound_nr(page) - next, npages - i); 231 + 232 + *head = page; 233 + *ntails = nr; 234 + } 235 + 236 + #define for_each_compound_range(__i, __list, __npages, __head, __ntails) \ 237 + for (__i = 0, \ 238 + compound_range_next(__i, __npages, __list, &(__head), &(__ntails)); \ 239 + __i < __npages; __i += __ntails, \ 240 + compound_range_next(__i, __npages, __list, &(__head), &(__ntails))) 241 + 242 + static inline void compound_next(unsigned long i, unsigned long npages, 243 + struct page **list, struct page **head, 244 + unsigned int *ntails) 245 + { 246 + struct page *page; 247 + unsigned int nr; 248 + 249 + if (i >= npages) 250 + return; 251 + 252 + page = compound_head(list[i]); 253 + for (nr = i + 1; nr < npages; nr++) { 254 + if (compound_head(list[nr]) != page) 255 + break; 256 + } 257 + 258 + *head = page; 259 + *ntails = nr - i; 260 + } 261 + 262 + #define for_each_compound_head(__i, __list, __npages, __head, __ntails) \ 263 + for (__i = 0, \ 264 + compound_next(__i, __npages, __list, &(__head), &(__ntails)); \ 265 + __i < __npages; __i += __ntails, \ 266 + compound_next(__i, __npages, __list, &(__head), &(__ntails))) 267 + 216 268 /** 217 269 * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages 218 270 * @pages: array of pages to be maybe marked dirty, and definitely released. ··· 291 239 bool make_dirty) 292 240 { 293 241 unsigned long index; 294 - 295 - /* 296 - * TODO: this can be optimized for huge pages: if a series of pages is 297 - * physically contiguous and part of the same compound page, then a 298 - * single operation to the head page should suffice. 299 - */ 242 + struct page *head; 243 + unsigned int ntails; 300 244 301 245 if (!make_dirty) { 302 246 unpin_user_pages(pages, npages); 303 247 return; 304 248 } 305 249 306 - for (index = 0; index < npages; index++) { 307 - struct page *page = compound_head(pages[index]); 250 + for_each_compound_head(index, pages, npages, head, ntails) { 308 251 /* 309 252 * Checking PageDirty at this point may race with 310 253 * clear_page_dirty_for_io(), but that's OK. Two key ··· 320 273 * written back, so it gets written back again in the 321 274 * next writeback cycle. This is harmless. 322 275 */ 323 - if (!PageDirty(page)) 324 - set_page_dirty_lock(page); 325 - unpin_user_page(page); 276 + if (!PageDirty(head)) 277 + set_page_dirty_lock(head); 278 + put_compound_head(head, ntails, FOLL_PIN); 326 279 } 327 280 } 328 281 EXPORT_SYMBOL(unpin_user_pages_dirty_lock); 282 + 283 + /** 284 + * unpin_user_page_range_dirty_lock() - release and optionally dirty 285 + * gup-pinned page range 286 + * 287 + * @page: the starting page of a range maybe marked dirty, and definitely released. 288 + * @npages: number of consecutive pages to release. 289 + * @make_dirty: whether to mark the pages dirty 290 + * 291 + * "gup-pinned page range" refers to a range of pages that has had one of the 292 + * pin_user_pages() variants called on that page. 293 + * 294 + * For the page ranges defined by [page .. page+npages], make that range (or 295 + * its head pages, if a compound page) dirty, if @make_dirty is true, and if the 296 + * page range was previously listed as clean. 297 + * 298 + * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is 299 + * required, then the caller should a) verify that this is really correct, 300 + * because _lock() is usually required, and b) hand code it: 301 + * set_page_dirty_lock(), unpin_user_page(). 302 + * 303 + */ 304 + void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, 305 + bool make_dirty) 306 + { 307 + unsigned long index; 308 + struct page *head; 309 + unsigned int ntails; 310 + 311 + for_each_compound_range(index, &page, npages, head, ntails) { 312 + if (make_dirty && !PageDirty(head)) 313 + set_page_dirty_lock(head); 314 + put_compound_head(head, ntails, FOLL_PIN); 315 + } 316 + } 317 + EXPORT_SYMBOL(unpin_user_page_range_dirty_lock); 329 318 330 319 /** 331 320 * unpin_user_pages() - release an array of gup-pinned pages. ··· 375 292 void unpin_user_pages(struct page **pages, unsigned long npages) 376 293 { 377 294 unsigned long index; 295 + struct page *head; 296 + unsigned int ntails; 378 297 379 298 /* 380 299 * If this WARN_ON() fires, then the system *might* be leaking pages (by ··· 385 300 */ 386 301 if (WARN_ON(IS_ERR_VALUE(npages))) 387 302 return; 388 - /* 389 - * TODO: this can be optimized for huge pages: if a series of pages is 390 - * physically contiguous and part of the same compound page, then a 391 - * single operation to the head page should suffice. 392 - */ 393 - for (index = 0; index < npages; index++) 394 - unpin_user_page(pages[index]); 303 + 304 + for_each_compound_head(index, pages, npages, head, ntails) 305 + put_compound_head(head, ntails, FOLL_PIN); 395 306 } 396 307 EXPORT_SYMBOL(unpin_user_pages); 397 308 ··· 514 433 page = ERR_PTR(ret); 515 434 goto out; 516 435 } 517 - } 518 - 519 - if (flags & FOLL_SPLIT && PageTransCompound(page)) { 520 - get_page(page); 521 - pte_unmap_unlock(ptep, ptl); 522 - lock_page(page); 523 - ret = split_huge_page(page); 524 - unlock_page(page); 525 - put_page(page); 526 - if (ret) 527 - return ERR_PTR(ret); 528 - goto retry; 529 436 } 530 437 531 438 /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */ ··· 660 591 spin_unlock(ptl); 661 592 return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); 662 593 } 663 - if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) { 594 + if (flags & FOLL_SPLIT_PMD) { 664 595 int ret; 665 596 page = pmd_page(*pmd); 666 597 if (is_huge_zero_page(page)) { ··· 669 600 split_huge_pmd(vma, pmd, address); 670 601 if (pmd_trans_unstable(pmd)) 671 602 ret = -EBUSY; 672 - } else if (flags & FOLL_SPLIT) { 673 - if (unlikely(!try_get_page(page))) { 674 - spin_unlock(ptl); 675 - return ERR_PTR(-ENOMEM); 676 - } 677 - spin_unlock(ptl); 678 - lock_page(page); 679 - ret = split_huge_page(page); 680 - unlock_page(page); 681 - put_page(page); 682 - if (pmd_none(*pmd)) 683 - return no_page_table(vma, flags); 684 - } else { /* flags & FOLL_SPLIT_PMD */ 603 + } else { 685 604 spin_unlock(ptl); 686 605 split_huge_pmd(vma, pmd, address); 687 606 ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;

+1 -1

mm/hugetlb.c

··· 1616 1616 gfp_mask |= __GFP_RETRY_MAYFAIL; 1617 1617 if (nid == NUMA_NO_NODE) 1618 1618 nid = numa_mem_id(); 1619 - page = __alloc_pages_nodemask(gfp_mask, order, nid, nmask); 1619 + page = __alloc_pages(gfp_mask, order, nid, nmask); 1620 1620 if (page) 1621 1621 __count_vm_event(HTLB_BUDDY_PGALLOC); 1622 1622 else

+22 -3

mm/internal.h

··· 145 145 * family of functions. 146 146 * 147 147 * nodemask, migratetype and highest_zoneidx are initialized only once in 148 - * __alloc_pages_nodemask() and then never change. 148 + * __alloc_pages() and then never change. 149 149 * 150 150 * zonelist, preferred_zone and highest_zoneidx are set first in 151 - * __alloc_pages_nodemask() for the fast path, and might be later changed 151 + * __alloc_pages() for the fast path, and might be later changed 152 152 * in __alloc_pages_slowpath(). All other functions pass the whole structure 153 153 * by a const pointer. 154 154 */ ··· 446 446 static inline void clear_page_mlock(struct page *page) { } 447 447 static inline void mlock_vma_page(struct page *page) { } 448 448 static inline void mlock_migrate_page(struct page *new, struct page *old) { } 449 - 449 + static inline void vunmap_range_noflush(unsigned long start, unsigned long end) 450 + { 451 + } 450 452 #endif /* !CONFIG_MMU */ 451 453 452 454 /* ··· 638 636 nodemask_t *nmask; 639 637 gfp_t gfp_mask; 640 638 }; 639 + 640 + /* 641 + * mm/vmalloc.c 642 + */ 643 + #ifdef CONFIG_MMU 644 + int vmap_pages_range_noflush(unsigned long addr, unsigned long end, 645 + pgprot_t prot, struct page **pages, unsigned int page_shift); 646 + #else 647 + static inline 648 + int vmap_pages_range_noflush(unsigned long addr, unsigned long end, 649 + pgprot_t prot, struct page **pages, unsigned int page_shift) 650 + { 651 + return -EINVAL; 652 + } 653 + #endif 654 + 655 + void vunmap_range_noflush(unsigned long start, unsigned long end); 641 656 642 657 #endif /* __MM_INTERNAL_H */

+1 -1

mm/interval_tree.c

··· 22 22 23 23 INTERVAL_TREE_DEFINE(struct vm_area_struct, shared.rb, 24 24 unsigned long, shared.rb_subtree_last, 25 - vma_start_pgoff, vma_last_pgoff,, vma_interval_tree) 25 + vma_start_pgoff, vma_last_pgoff, /* empty */, vma_interval_tree) 26 26 27 27 /* Insert node immediately after prev in the interval tree */ 28 28 void vma_interval_tree_insert_after(struct vm_area_struct *node,

+29

mm/io-mapping.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-only 2 + 3 + #include <linux/mm.h> 4 + #include <linux/io-mapping.h> 5 + 6 + /** 7 + * io_mapping_map_user - remap an I/O mapping to userspace 8 + * @iomap: the source io_mapping 9 + * @vma: user vma to map to 10 + * @addr: target user address to start at 11 + * @pfn: physical address of kernel memory 12 + * @size: size of map area 13 + * 14 + * Note: this is only safe if the mm semaphore is held when called. 15 + */ 16 + int io_mapping_map_user(struct io_mapping *iomap, struct vm_area_struct *vma, 17 + unsigned long addr, unsigned long pfn, unsigned long size) 18 + { 19 + vm_flags_t expected_flags = VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP; 20 + 21 + if (WARN_ON_ONCE((vma->vm_flags & expected_flags) != expected_flags)) 22 + return -EINVAL; 23 + 24 + /* We rely on prevalidation of the io-mapping to skip track_pfn(). */ 25 + return remap_pfn_range_notrack(vma, addr, pfn, size, 26 + __pgprot((pgprot_val(iomap->prot) & _PAGE_CACHE_MASK) | 27 + (pgprot_val(vma->vm_page_prot) & ~_PAGE_CACHE_MASK))); 28 + } 29 + EXPORT_SYMBOL_GPL(io_mapping_map_user);

+5 -220

mm/ioremap.c

··· 16 16 #include "pgalloc-track.h" 17 17 18 18 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP 19 - static int __read_mostly ioremap_p4d_capable; 20 - static int __read_mostly ioremap_pud_capable; 21 - static int __read_mostly ioremap_pmd_capable; 22 - static int __read_mostly ioremap_huge_disabled; 19 + static bool __ro_after_init iomap_max_page_shift = PAGE_SHIFT; 23 20 24 21 static int __init set_nohugeiomap(char *str) 25 22 { 26 - ioremap_huge_disabled = 1; 23 + iomap_max_page_shift = P4D_SHIFT; 27 24 return 0; 28 25 } 29 26 early_param("nohugeiomap", set_nohugeiomap); 30 - 31 - void __init ioremap_huge_init(void) 32 - { 33 - if (!ioremap_huge_disabled) { 34 - if (arch_ioremap_p4d_supported()) 35 - ioremap_p4d_capable = 1; 36 - if (arch_ioremap_pud_supported()) 37 - ioremap_pud_capable = 1; 38 - if (arch_ioremap_pmd_supported()) 39 - ioremap_pmd_capable = 1; 40 - } 41 - } 42 - 43 - static inline int ioremap_p4d_enabled(void) 44 - { 45 - return ioremap_p4d_capable; 46 - } 47 - 48 - static inline int ioremap_pud_enabled(void) 49 - { 50 - return ioremap_pud_capable; 51 - } 52 - 53 - static inline int ioremap_pmd_enabled(void) 54 - { 55 - return ioremap_pmd_capable; 56 - } 57 - 58 - #else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ 59 - static inline int ioremap_p4d_enabled(void) { return 0; } 60 - static inline int ioremap_pud_enabled(void) { return 0; } 61 - static inline int ioremap_pmd_enabled(void) { return 0; } 27 + #else /* CONFIG_HAVE_ARCH_HUGE_VMAP */ 28 + static const bool iomap_max_page_shift = PAGE_SHIFT; 62 29 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ 63 - 64 - static int ioremap_pte_range(pmd_t *pmd, unsigned long addr, 65 - unsigned long end, phys_addr_t phys_addr, pgprot_t prot, 66 - pgtbl_mod_mask *mask) 67 - { 68 - pte_t *pte; 69 - u64 pfn; 70 - 71 - pfn = phys_addr >> PAGE_SHIFT; 72 - pte = pte_alloc_kernel_track(pmd, addr, mask); 73 - if (!pte) 74 - return -ENOMEM; 75 - do { 76 - BUG_ON(!pte_none(*pte)); 77 - set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot)); 78 - pfn++; 79 - } while (pte++, addr += PAGE_SIZE, addr != end); 80 - *mask |= PGTBL_PTE_MODIFIED; 81 - return 0; 82 - } 83 - 84 - static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr, 85 - unsigned long end, phys_addr_t phys_addr, 86 - pgprot_t prot) 87 - { 88 - if (!ioremap_pmd_enabled()) 89 - return 0; 90 - 91 - if ((end - addr) != PMD_SIZE) 92 - return 0; 93 - 94 - if (!IS_ALIGNED(addr, PMD_SIZE)) 95 - return 0; 96 - 97 - if (!IS_ALIGNED(phys_addr, PMD_SIZE)) 98 - return 0; 99 - 100 - if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr)) 101 - return 0; 102 - 103 - return pmd_set_huge(pmd, phys_addr, prot); 104 - } 105 - 106 - static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr, 107 - unsigned long end, phys_addr_t phys_addr, pgprot_t prot, 108 - pgtbl_mod_mask *mask) 109 - { 110 - pmd_t *pmd; 111 - unsigned long next; 112 - 113 - pmd = pmd_alloc_track(&init_mm, pud, addr, mask); 114 - if (!pmd) 115 - return -ENOMEM; 116 - do { 117 - next = pmd_addr_end(addr, end); 118 - 119 - if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) { 120 - *mask |= PGTBL_PMD_MODIFIED; 121 - continue; 122 - } 123 - 124 - if (ioremap_pte_range(pmd, addr, next, phys_addr, prot, mask)) 125 - return -ENOMEM; 126 - } while (pmd++, phys_addr += (next - addr), addr = next, addr != end); 127 - return 0; 128 - } 129 - 130 - static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr, 131 - unsigned long end, phys_addr_t phys_addr, 132 - pgprot_t prot) 133 - { 134 - if (!ioremap_pud_enabled()) 135 - return 0; 136 - 137 - if ((end - addr) != PUD_SIZE) 138 - return 0; 139 - 140 - if (!IS_ALIGNED(addr, PUD_SIZE)) 141 - return 0; 142 - 143 - if (!IS_ALIGNED(phys_addr, PUD_SIZE)) 144 - return 0; 145 - 146 - if (pud_present(*pud) && !pud_free_pmd_page(pud, addr)) 147 - return 0; 148 - 149 - return pud_set_huge(pud, phys_addr, prot); 150 - } 151 - 152 - static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr, 153 - unsigned long end, phys_addr_t phys_addr, pgprot_t prot, 154 - pgtbl_mod_mask *mask) 155 - { 156 - pud_t *pud; 157 - unsigned long next; 158 - 159 - pud = pud_alloc_track(&init_mm, p4d, addr, mask); 160 - if (!pud) 161 - return -ENOMEM; 162 - do { 163 - next = pud_addr_end(addr, end); 164 - 165 - if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) { 166 - *mask |= PGTBL_PUD_MODIFIED; 167 - continue; 168 - } 169 - 170 - if (ioremap_pmd_range(pud, addr, next, phys_addr, prot, mask)) 171 - return -ENOMEM; 172 - } while (pud++, phys_addr += (next - addr), addr = next, addr != end); 173 - return 0; 174 - } 175 - 176 - static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr, 177 - unsigned long end, phys_addr_t phys_addr, 178 - pgprot_t prot) 179 - { 180 - if (!ioremap_p4d_enabled()) 181 - return 0; 182 - 183 - if ((end - addr) != P4D_SIZE) 184 - return 0; 185 - 186 - if (!IS_ALIGNED(addr, P4D_SIZE)) 187 - return 0; 188 - 189 - if (!IS_ALIGNED(phys_addr, P4D_SIZE)) 190 - return 0; 191 - 192 - if (p4d_present(*p4d) && !p4d_free_pud_page(p4d, addr)) 193 - return 0; 194 - 195 - return p4d_set_huge(p4d, phys_addr, prot); 196 - } 197 - 198 - static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr, 199 - unsigned long end, phys_addr_t phys_addr, pgprot_t prot, 200 - pgtbl_mod_mask *mask) 201 - { 202 - p4d_t *p4d; 203 - unsigned long next; 204 - 205 - p4d = p4d_alloc_track(&init_mm, pgd, addr, mask); 206 - if (!p4d) 207 - return -ENOMEM; 208 - do { 209 - next = p4d_addr_end(addr, end); 210 - 211 - if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) { 212 - *mask |= PGTBL_P4D_MODIFIED; 213 - continue; 214 - } 215 - 216 - if (ioremap_pud_range(p4d, addr, next, phys_addr, prot, mask)) 217 - return -ENOMEM; 218 - } while (p4d++, phys_addr += (next - addr), addr = next, addr != end); 219 - return 0; 220 - } 221 30 222 31 int ioremap_page_range(unsigned long addr, 223 32 unsigned long end, phys_addr_t phys_addr, pgprot_t prot) 224 33 { 225 - pgd_t *pgd; 226 - unsigned long start; 227 - unsigned long next; 228 - int err; 229 - pgtbl_mod_mask mask = 0; 230 - 231 - might_sleep(); 232 - BUG_ON(addr >= end); 233 - 234 - start = addr; 235 - pgd = pgd_offset_k(addr); 236 - do { 237 - next = pgd_addr_end(addr, end); 238 - err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot, 239 - &mask); 240 - if (err) 241 - break; 242 - } while (pgd++, phys_addr += (next - addr), addr = next, addr != end); 243 - 244 - flush_cache_vmap(start, end); 245 - 246 - if (mask & ARCH_PAGE_TABLE_SYNC_MASK) 247 - arch_sync_kernel_mappings(start, end); 248 - 249 - return err; 34 + return vmap_range(addr, end, phys_addr, prot, iomap_max_page_shift); 250 35 } 251 36 252 37 #ifdef CONFIG_GENERIC_IOREMAP

+23 -22

mm/kasan/common.c

··· 60 60 61 61 void __kasan_unpoison_range(const void *address, size_t size) 62 62 { 63 - kasan_unpoison(address, size); 63 + kasan_unpoison(address, size, false); 64 64 } 65 65 66 66 #ifdef CONFIG_KASAN_STACK ··· 69 69 { 70 70 void *base = task_stack_page(task); 71 71 72 - kasan_unpoison(base, THREAD_SIZE); 72 + kasan_unpoison(base, THREAD_SIZE, false); 73 73 } 74 74 75 75 /* Unpoison the stack for the current task beyond a watermark sp value. */ ··· 82 82 */ 83 83 void *base = (void *)((unsigned long)watermark & ~(THREAD_SIZE - 1)); 84 84 85 - kasan_unpoison(base, watermark - base); 85 + kasan_unpoison(base, watermark - base, false); 86 86 } 87 87 #endif /* CONFIG_KASAN_STACK */ 88 88 ··· 97 97 return 0; 98 98 } 99 99 100 - void __kasan_alloc_pages(struct page *page, unsigned int order) 100 + void __kasan_alloc_pages(struct page *page, unsigned int order, bool init) 101 101 { 102 102 u8 tag; 103 103 unsigned long i; ··· 108 108 tag = kasan_random_tag(); 109 109 for (i = 0; i < (1 << order); i++) 110 110 page_kasan_tag_set(page + i, tag); 111 - kasan_unpoison(page_address(page), PAGE_SIZE << order); 111 + kasan_unpoison(page_address(page), PAGE_SIZE << order, init); 112 112 } 113 113 114 - void __kasan_free_pages(struct page *page, unsigned int order) 114 + void __kasan_free_pages(struct page *page, unsigned int order, bool init) 115 115 { 116 116 if (likely(!PageHighMem(page))) 117 117 kasan_poison(page_address(page), PAGE_SIZE << order, 118 - KASAN_FREE_PAGE); 118 + KASAN_FREE_PAGE, init); 119 119 } 120 120 121 121 /* ··· 251 251 for (i = 0; i < compound_nr(page); i++) 252 252 page_kasan_tag_reset(page + i); 253 253 kasan_poison(page_address(page), page_size(page), 254 - KASAN_KMALLOC_REDZONE); 254 + KASAN_KMALLOC_REDZONE, false); 255 255 } 256 256 257 257 void __kasan_unpoison_object_data(struct kmem_cache *cache, void *object) 258 258 { 259 - kasan_unpoison(object, cache->object_size); 259 + kasan_unpoison(object, cache->object_size, false); 260 260 } 261 261 262 262 void __kasan_poison_object_data(struct kmem_cache *cache, void *object) 263 263 { 264 264 kasan_poison(object, round_up(cache->object_size, KASAN_GRANULE_SIZE), 265 - KASAN_KMALLOC_REDZONE); 265 + KASAN_KMALLOC_REDZONE, false); 266 266 } 267 267 268 268 /* ··· 322 322 return (void *)object; 323 323 } 324 324 325 - static inline bool ____kasan_slab_free(struct kmem_cache *cache, 326 - void *object, unsigned long ip, bool quarantine) 325 + static inline bool ____kasan_slab_free(struct kmem_cache *cache, void *object, 326 + unsigned long ip, bool quarantine, bool init) 327 327 { 328 328 u8 tag; 329 329 void *tagged_object; ··· 351 351 } 352 352 353 353 kasan_poison(object, round_up(cache->object_size, KASAN_GRANULE_SIZE), 354 - KASAN_KMALLOC_FREE); 354 + KASAN_KMALLOC_FREE, init); 355 355 356 356 if ((IS_ENABLED(CONFIG_KASAN_GENERIC) && !quarantine)) 357 357 return false; ··· 362 362 return kasan_quarantine_put(cache, object); 363 363 } 364 364 365 - bool __kasan_slab_free(struct kmem_cache *cache, void *object, unsigned long ip) 365 + bool __kasan_slab_free(struct kmem_cache *cache, void *object, 366 + unsigned long ip, bool init) 366 367 { 367 - return ____kasan_slab_free(cache, object, ip, true); 368 + return ____kasan_slab_free(cache, object, ip, true, init); 368 369 } 369 370 370 371 static inline bool ____kasan_kfree_large(void *ptr, unsigned long ip) ··· 408 407 if (unlikely(!PageSlab(page))) { 409 408 if (____kasan_kfree_large(ptr, ip)) 410 409 return; 411 - kasan_poison(ptr, page_size(page), KASAN_FREE_PAGE); 410 + kasan_poison(ptr, page_size(page), KASAN_FREE_PAGE, false); 412 411 } else { 413 - ____kasan_slab_free(page->slab_cache, ptr, ip, false); 412 + ____kasan_slab_free(page->slab_cache, ptr, ip, false, false); 414 413 } 415 414 } 416 415 ··· 429 428 } 430 429 431 430 void * __must_check __kasan_slab_alloc(struct kmem_cache *cache, 432 - void *object, gfp_t flags) 431 + void *object, gfp_t flags, bool init) 433 432 { 434 433 u8 tag; 435 434 void *tagged_object; ··· 454 453 * Unpoison the whole object. 455 454 * For kmalloc() allocations, kasan_kmalloc() will do precise poisoning. 456 455 */ 457 - kasan_unpoison(tagged_object, cache->object_size); 456 + kasan_unpoison(tagged_object, cache->object_size, init); 458 457 459 458 /* Save alloc info (if possible) for non-kmalloc() allocations. */ 460 459 if (kasan_stack_collection_enabled()) ··· 497 496 redzone_end = round_up((unsigned long)(object + cache->object_size), 498 497 KASAN_GRANULE_SIZE); 499 498 kasan_poison((void *)redzone_start, redzone_end - redzone_start, 500 - KASAN_KMALLOC_REDZONE); 499 + KASAN_KMALLOC_REDZONE, false); 501 500 502 501 /* 503 502 * Save alloc info (if possible) for kmalloc() allocations. ··· 547 546 KASAN_GRANULE_SIZE); 548 547 redzone_end = (unsigned long)ptr + page_size(virt_to_page(ptr)); 549 548 kasan_poison((void *)redzone_start, redzone_end - redzone_start, 550 - KASAN_PAGE_REDZONE); 549 + KASAN_PAGE_REDZONE, false); 551 550 552 551 return (void *)ptr; 553 552 } ··· 564 563 * Part of it might already have been unpoisoned, but it's unknown 565 564 * how big that part is. 566 565 */ 567 - kasan_unpoison(object, size); 566 + kasan_unpoison(object, size, false); 568 567 569 568 page = virt_to_head_page(object); 570 569

+6 -6

mm/kasan/generic.c

··· 208 208 { 209 209 size_t aligned_size = round_up(global->size, KASAN_GRANULE_SIZE); 210 210 211 - kasan_unpoison(global->beg, global->size); 211 + kasan_unpoison(global->beg, global->size, false); 212 212 213 213 kasan_poison(global->beg + aligned_size, 214 214 global->size_with_redzone - aligned_size, 215 - KASAN_GLOBAL_REDZONE); 215 + KASAN_GLOBAL_REDZONE, false); 216 216 } 217 217 218 218 void __asan_register_globals(struct kasan_global *globals, size_t size) ··· 292 292 WARN_ON(!IS_ALIGNED(addr, KASAN_ALLOCA_REDZONE_SIZE)); 293 293 294 294 kasan_unpoison((const void *)(addr + rounded_down_size), 295 - size - rounded_down_size); 295 + size - rounded_down_size, false); 296 296 kasan_poison(left_redzone, KASAN_ALLOCA_REDZONE_SIZE, 297 - KASAN_ALLOCA_LEFT); 297 + KASAN_ALLOCA_LEFT, false); 298 298 kasan_poison(right_redzone, padding_size + KASAN_ALLOCA_REDZONE_SIZE, 299 - KASAN_ALLOCA_RIGHT); 299 + KASAN_ALLOCA_RIGHT, false); 300 300 } 301 301 EXPORT_SYMBOL(__asan_alloca_poison); 302 302 ··· 306 306 if (unlikely(!stack_top || stack_top > stack_bottom)) 307 307 return; 308 308 309 - kasan_unpoison(stack_top, stack_bottom - stack_top); 309 + kasan_unpoison(stack_top, stack_bottom - stack_top, false); 310 310 } 311 311 EXPORT_SYMBOL(__asan_allocas_unpoison); 312 312

+13 -11

mm/kasan/kasan.h

··· 163 163 struct kasan_track alloc_track; 164 164 #ifdef CONFIG_KASAN_GENERIC 165 165 /* 166 - * call_rcu() call stack is stored into struct kasan_alloc_meta. 166 + * The auxiliary stack is stored into struct kasan_alloc_meta. 167 167 * The free stack is stored into struct kasan_free_meta. 168 168 */ 169 169 depot_stack_handle_t aux_stack[2]; ··· 314 314 #define arch_get_mem_tag(addr) (0xFF) 315 315 #endif 316 316 #ifndef arch_set_mem_tag_range 317 - #define arch_set_mem_tag_range(addr, size, tag) ((void *)(addr)) 317 + #define arch_set_mem_tag_range(addr, size, tag, init) ((void *)(addr)) 318 318 #endif 319 319 320 320 #define hw_enable_tagging_sync() arch_enable_tagging_sync() ··· 324 324 #define hw_force_async_tag_fault() arch_force_async_tag_fault() 325 325 #define hw_get_random_tag() arch_get_random_tag() 326 326 #define hw_get_mem_tag(addr) arch_get_mem_tag(addr) 327 - #define hw_set_mem_tag_range(addr, size, tag) arch_set_mem_tag_range((addr), (size), (tag)) 327 + #define hw_set_mem_tag_range(addr, size, tag, init) \ 328 + arch_set_mem_tag_range((addr), (size), (tag), (init)) 328 329 329 330 #else /* CONFIG_KASAN_HW_TAGS */ 330 331 ··· 359 358 360 359 #ifdef CONFIG_KASAN_HW_TAGS 361 360 362 - static inline void kasan_poison(const void *addr, size_t size, u8 value) 361 + static inline void kasan_poison(const void *addr, size_t size, u8 value, bool init) 363 362 { 364 363 addr = kasan_reset_tag(addr); 365 364 ··· 372 371 if (WARN_ON(size & KASAN_GRANULE_MASK)) 373 372 return; 374 373 375 - hw_set_mem_tag_range((void *)addr, size, value); 374 + hw_set_mem_tag_range((void *)addr, size, value, init); 376 375 } 377 376 378 - static inline void kasan_unpoison(const void *addr, size_t size) 377 + static inline void kasan_unpoison(const void *addr, size_t size, bool init) 379 378 { 380 379 u8 tag = get_tag(addr); 381 380 ··· 389 388 return; 390 389 size = round_up(size, KASAN_GRANULE_SIZE); 391 390 392 - hw_set_mem_tag_range((void *)addr, size, tag); 391 + hw_set_mem_tag_range((void *)addr, size, tag, init); 393 392 } 394 393 395 394 static inline bool kasan_byte_accessible(const void *addr) ··· 397 396 u8 ptr_tag = get_tag(addr); 398 397 u8 mem_tag = hw_get_mem_tag((void *)addr); 399 398 400 - return (mem_tag != KASAN_TAG_INVALID) && 401 - (ptr_tag == KASAN_TAG_KERNEL || ptr_tag == mem_tag); 399 + return ptr_tag == KASAN_TAG_KERNEL || ptr_tag == mem_tag; 402 400 } 403 401 404 402 #else /* CONFIG_KASAN_HW_TAGS */ ··· 407 407 * @addr - range start address, must be aligned to KASAN_GRANULE_SIZE 408 408 * @size - range size, must be aligned to KASAN_GRANULE_SIZE 409 409 * @value - value that's written to metadata for the range 410 + * @init - whether to initialize the memory range (only for hardware tag-based) 410 411 * 411 412 * The size gets aligned to KASAN_GRANULE_SIZE before marking the range. 412 413 */ 413 - void kasan_poison(const void *addr, size_t size, u8 value); 414 + void kasan_poison(const void *addr, size_t size, u8 value, bool init); 414 415 415 416 /** 416 417 * kasan_unpoison - mark the memory range as accessible 417 418 * @addr - range start address, must be aligned to KASAN_GRANULE_SIZE 418 419 * @size - range size, can be unaligned 420 + * @init - whether to initialize the memory range (only for hardware tag-based) 419 421 * 420 422 * For the tag-based modes, the @size gets aligned to KASAN_GRANULE_SIZE before 421 423 * marking the range. 422 424 * For the generic mode, the last granule of the memory range gets partially 423 425 * unpoisoned based on the @size. 424 426 */ 425 - void kasan_unpoison(const void *addr, size_t size); 427 + void kasan_unpoison(const void *addr, size_t size, bool init); 426 428 427 429 bool kasan_byte_accessible(const void *addr); 428 430

+1 -1

mm/kasan/report_generic.c

··· 148 148 } 149 149 150 150 /* Copy token (+ 1 byte for '\0'). */ 151 - strlcpy(token, *frame_descr, tok_len + 1); 151 + strscpy(token, *frame_descr, tok_len + 1); 152 152 } 153 153 154 154 /* Advance frame_descr past separator. */

+5 -5

mm/kasan/shadow.c

··· 69 69 return __memcpy(dest, src, len); 70 70 } 71 71 72 - void kasan_poison(const void *addr, size_t size, u8 value) 72 + void kasan_poison(const void *addr, size_t size, u8 value, bool init) 73 73 { 74 74 void *shadow_start, *shadow_end; 75 75 ··· 106 106 } 107 107 #endif 108 108 109 - void kasan_unpoison(const void *addr, size_t size) 109 + void kasan_unpoison(const void *addr, size_t size, bool init) 110 110 { 111 111 u8 tag = get_tag(addr); 112 112 ··· 129 129 return; 130 130 131 131 /* Unpoison all granules that cover the object. */ 132 - kasan_poison(addr, round_up(size, KASAN_GRANULE_SIZE), tag); 132 + kasan_poison(addr, round_up(size, KASAN_GRANULE_SIZE), tag, false); 133 133 134 134 /* Partially poison the last granule for the generic mode. */ 135 135 if (IS_ENABLED(CONFIG_KASAN_GENERIC)) ··· 344 344 return; 345 345 346 346 size = round_up(size, KASAN_GRANULE_SIZE); 347 - kasan_poison(start, size, KASAN_VMALLOC_INVALID); 347 + kasan_poison(start, size, KASAN_VMALLOC_INVALID, false); 348 348 } 349 349 350 350 void kasan_unpoison_vmalloc(const void *start, unsigned long size) ··· 352 352 if (!is_vmalloc_or_module_addr(start)) 353 353 return; 354 354 355 - kasan_unpoison(start, size); 355 + kasan_unpoison(start, size, false); 356 356 } 357 357 358 358 static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,

+8 -4

mm/kasan/sw_tags.c

··· 121 121 bool kasan_byte_accessible(const void *addr) 122 122 { 123 123 u8 tag = get_tag(addr); 124 - u8 shadow_byte = READ_ONCE(*(u8 *)kasan_mem_to_shadow(kasan_reset_tag(addr))); 124 + void *untagged_addr = kasan_reset_tag(addr); 125 + u8 shadow_byte; 125 126 126 - return (shadow_byte != KASAN_TAG_INVALID) && 127 - (tag == KASAN_TAG_KERNEL || tag == shadow_byte); 127 + if (untagged_addr < kasan_shadow_to_mem((void *)KASAN_SHADOW_START)) 128 + return false; 129 + 130 + shadow_byte = READ_ONCE(*(u8 *)kasan_mem_to_shadow(untagged_addr)); 131 + return tag == KASAN_TAG_KERNEL || tag == shadow_byte; 128 132 } 129 133 130 134 #define DEFINE_HWASAN_LOAD_STORE(size) \ ··· 163 159 164 160 void __hwasan_tag_memory(unsigned long addr, u8 tag, unsigned long size) 165 161 { 166 - kasan_poison((void *)addr, size, tag); 162 + kasan_poison((void *)addr, size, tag, false); 167 163 } 168 164 EXPORT_SYMBOL(__hwasan_tag_memory); 169 165

+1 -1

mm/kmemleak.c

··· 1203 1203 } 1204 1204 1205 1205 /* 1206 - * Memory scanning is a long process and it needs to be interruptable. This 1206 + * Memory scanning is a long process and it needs to be interruptible. This 1207 1207 * function checks whether such interrupt condition occurred. 1208 1208 */ 1209 1209 static int scan_should_stop(void)

+340 -336

mm/memcontrol.c

··· 255 255 #ifdef CONFIG_MEMCG_KMEM 256 256 extern spinlock_t css_set_lock; 257 257 258 - static int __memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp, 259 - unsigned int nr_pages); 260 - static void __memcg_kmem_uncharge(struct mem_cgroup *memcg, 261 - unsigned int nr_pages); 258 + static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg, 259 + unsigned int nr_pages); 262 260 263 261 static void obj_cgroup_release(struct percpu_ref *ref) 264 262 { ··· 293 295 spin_lock_irqsave(&css_set_lock, flags); 294 296 memcg = obj_cgroup_memcg(objcg); 295 297 if (nr_pages) 296 - __memcg_kmem_uncharge(memcg, nr_pages); 298 + obj_cgroup_uncharge_pages(objcg, nr_pages); 297 299 list_del(&objcg->list); 298 300 mem_cgroup_put(memcg); 299 301 spin_unlock_irqrestore(&css_set_lock, flags); ··· 412 414 int size, int old_size) 413 415 { 414 416 struct memcg_shrinker_map *new, *old; 417 + struct mem_cgroup_per_node *pn; 415 418 int nid; 416 419 417 420 lockdep_assert_held(&memcg_shrinker_map_mutex); 418 421 419 422 for_each_node(nid) { 420 - old = rcu_dereference_protected( 421 - mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true); 423 + pn = memcg->nodeinfo[nid]; 424 + old = rcu_dereference_protected(pn->shrinker_map, true); 422 425 /* Not yet online memcg */ 423 426 if (!old) 424 427 return 0; ··· 432 433 memset(new->map, (int)0xff, old_size); 433 434 memset((void *)new->map + old_size, 0, size - old_size); 434 435 435 - rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new); 436 + rcu_assign_pointer(pn->shrinker_map, new); 436 437 call_rcu(&old->rcu, memcg_free_shrinker_map_rcu); 437 438 } 438 439 ··· 449 450 return; 450 451 451 452 for_each_node(nid) { 452 - pn = mem_cgroup_nodeinfo(memcg, nid); 453 + pn = memcg->nodeinfo[nid]; 453 454 map = rcu_dereference_protected(pn->shrinker_map, true); 454 455 kvfree(map); 455 456 rcu_assign_pointer(pn->shrinker_map, NULL); ··· 712 713 int nid; 713 714 714 715 for_each_node(nid) { 715 - mz = mem_cgroup_nodeinfo(memcg, nid); 716 + mz = memcg->nodeinfo[nid]; 716 717 mctz = soft_limit_tree_node(nid); 717 718 if (mctz) 718 719 mem_cgroup_remove_exceeded(mz, mctz); ··· 763 764 */ 764 765 void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) 765 766 { 766 - long x, threshold = MEMCG_CHARGE_BATCH; 767 - 768 767 if (mem_cgroup_disabled()) 769 768 return; 770 769 771 - if (memcg_stat_item_in_bytes(idx)) 772 - threshold <<= PAGE_SHIFT; 770 + __this_cpu_add(memcg->vmstats_percpu->state[idx], val); 771 + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); 772 + } 773 773 774 - x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]); 775 - if (unlikely(abs(x) > threshold)) { 776 - struct mem_cgroup *mi; 777 - 778 - /* 779 - * Batch local counters to keep them in sync with 780 - * the hierarchical ones. 781 - */ 782 - __this_cpu_add(memcg->vmstats_local->stat[idx], x); 783 - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) 784 - atomic_long_add(x, &mi->vmstats[idx]); 774 + /* idx can be of type enum memcg_stat_item or node_stat_item. */ 775 + static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) 776 + { 777 + long x = READ_ONCE(memcg->vmstats.state[idx]); 778 + #ifdef CONFIG_SMP 779 + if (x < 0) 785 780 x = 0; 786 - } 787 - __this_cpu_write(memcg->vmstats_percpu->stat[idx], x); 781 + #endif 782 + return x; 783 + } 784 + 785 + /* idx can be of type enum memcg_stat_item or node_stat_item. */ 786 + static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx) 787 + { 788 + long x = 0; 789 + int cpu; 790 + 791 + for_each_possible_cpu(cpu) 792 + x += per_cpu(memcg->vmstats_percpu->state[idx], cpu); 793 + #ifdef CONFIG_SMP 794 + if (x < 0) 795 + x = 0; 796 + #endif 797 + return x; 788 798 } 789 799 790 800 static struct mem_cgroup_per_node * ··· 804 796 parent = parent_mem_cgroup(pn->memcg); 805 797 if (!parent) 806 798 return NULL; 807 - return mem_cgroup_nodeinfo(parent, nid); 799 + return parent->nodeinfo[nid]; 808 800 } 809 801 810 802 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, ··· 863 855 int val) 864 856 { 865 857 struct page *head = compound_head(page); /* rmap on tail pages */ 866 - struct mem_cgroup *memcg = page_memcg(head); 858 + struct mem_cgroup *memcg; 867 859 pg_data_t *pgdat = page_pgdat(page); 868 860 struct lruvec *lruvec; 869 861 862 + rcu_read_lock(); 863 + memcg = page_memcg(head); 870 864 /* Untracked pages have no memcg, no lruvec. Update only the node */ 871 865 if (!memcg) { 866 + rcu_read_unlock(); 872 867 __mod_node_page_state(pgdat, idx, val); 873 868 return; 874 869 } 875 870 876 871 lruvec = mem_cgroup_lruvec(memcg, pgdat); 877 872 __mod_lruvec_state(lruvec, idx, val); 873 + rcu_read_unlock(); 878 874 } 879 875 EXPORT_SYMBOL(__mod_lruvec_page_state); 880 876 ··· 915 903 void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, 916 904 unsigned long count) 917 905 { 918 - unsigned long x; 919 - 920 906 if (mem_cgroup_disabled()) 921 907 return; 922 908 923 - x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]); 924 - if (unlikely(x > MEMCG_CHARGE_BATCH)) { 925 - struct mem_cgroup *mi; 926 - 927 - /* 928 - * Batch local counters to keep them in sync with 929 - * the hierarchical ones. 930 - */ 931 - __this_cpu_add(memcg->vmstats_local->events[idx], x); 932 - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) 933 - atomic_long_add(x, &mi->vmevents[idx]); 934 - x = 0; 935 - } 936 - __this_cpu_write(memcg->vmstats_percpu->events[idx], x); 909 + __this_cpu_add(memcg->vmstats_percpu->events[idx], count); 910 + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); 937 911 } 938 912 939 913 static unsigned long memcg_events(struct mem_cgroup *memcg, int event) 940 914 { 941 - return atomic_long_read(&memcg->vmevents[event]); 915 + return READ_ONCE(memcg->vmstats.events[event]); 942 916 } 943 917 944 918 static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) ··· 933 935 int cpu; 934 936 935 937 for_each_possible_cpu(cpu) 936 - x += per_cpu(memcg->vmstats_local->events[event], cpu); 938 + x += per_cpu(memcg->vmstats_percpu->events[event], cpu); 937 939 return x; 938 940 } 939 941 ··· 1053 1055 return current->active_memcg; 1054 1056 } 1055 1057 1056 - static __always_inline struct mem_cgroup *get_active_memcg(void) 1057 - { 1058 - struct mem_cgroup *memcg; 1059 - 1060 - rcu_read_lock(); 1061 - memcg = active_memcg(); 1062 - /* remote memcg must hold a ref. */ 1063 - if (memcg && WARN_ON_ONCE(!css_tryget(&memcg->css))) 1064 - memcg = root_mem_cgroup; 1065 - rcu_read_unlock(); 1066 - 1067 - return memcg; 1068 - } 1069 - 1070 1058 static __always_inline bool memcg_kmem_bypass(void) 1071 1059 { 1072 1060 /* Allow remote memcg charging from any context. */ ··· 1064 1080 return true; 1065 1081 1066 1082 return false; 1067 - } 1068 - 1069 - /** 1070 - * If active memcg is set, do not fallback to current->mm->memcg. 1071 - */ 1072 - static __always_inline struct mem_cgroup *get_mem_cgroup_from_current(void) 1073 - { 1074 - if (memcg_kmem_bypass()) 1075 - return NULL; 1076 - 1077 - if (unlikely(active_memcg())) 1078 - return get_active_memcg(); 1079 - 1080 - return get_mem_cgroup_from_mm(current->mm); 1081 1083 } 1082 1084 1083 1085 /** ··· 1106 1136 if (reclaim) { 1107 1137 struct mem_cgroup_per_node *mz; 1108 1138 1109 - mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id); 1139 + mz = root->nodeinfo[reclaim->pgdat->node_id]; 1110 1140 iter = &mz->iter; 1111 1141 1112 1142 if (prev && reclaim->generation != iter->generation) ··· 1208 1238 int nid; 1209 1239 1210 1240 for_each_node(nid) { 1211 - mz = mem_cgroup_nodeinfo(from, nid); 1241 + mz = from->nodeinfo[nid]; 1212 1242 iter = &mz->iter; 1213 1243 cmpxchg(&iter->position, dead_memcg, NULL); 1214 1244 } ··· 1541 1571 * 1542 1572 * Current memory state: 1543 1573 */ 1574 + cgroup_rstat_flush(memcg->css.cgroup); 1544 1575 1545 1576 for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { 1546 1577 u64 size; ··· 2089 2118 * This function protects unlocked LRU pages from being moved to 2090 2119 * another cgroup. 2091 2120 * 2092 - * It ensures lifetime of the returned memcg. Caller is responsible 2093 - * for the lifetime of the page; __unlock_page_memcg() is available 2094 - * when @page might get freed inside the locked section. 2121 + * It ensures lifetime of the locked memcg. Caller is responsible 2122 + * for the lifetime of the page. 2095 2123 */ 2096 - struct mem_cgroup *lock_page_memcg(struct page *page) 2124 + void lock_page_memcg(struct page *page) 2097 2125 { 2098 2126 struct page *head = compound_head(page); /* rmap on tail pages */ 2099 2127 struct mem_cgroup *memcg; ··· 2102 2132 * The RCU lock is held throughout the transaction. The fast 2103 2133 * path can get away without acquiring the memcg->move_lock 2104 2134 * because page moving starts with an RCU grace period. 2105 - * 2106 - * The RCU lock also protects the memcg from being freed when 2107 - * the page state that is going to change is the only thing 2108 - * preventing the page itself from being freed. E.g. writeback 2109 - * doesn't hold a page reference and relies on PG_writeback to 2110 - * keep off truncation, migration and so forth. 2111 2135 */ 2112 2136 rcu_read_lock(); 2113 2137 2114 2138 if (mem_cgroup_disabled()) 2115 - return NULL; 2139 + return; 2116 2140 again: 2117 2141 memcg = page_memcg(head); 2118 2142 if (unlikely(!memcg)) 2119 - return NULL; 2143 + return; 2120 2144 2121 2145 #ifdef CONFIG_PROVE_LOCKING 2122 2146 local_irq_save(flags); ··· 2119 2155 #endif 2120 2156 2121 2157 if (atomic_read(&memcg->moving_account) <= 0) 2122 - return memcg; 2158 + return; 2123 2159 2124 2160 spin_lock_irqsave(&memcg->move_lock, flags); 2125 2161 if (memcg != page_memcg(head)) { ··· 2128 2164 } 2129 2165 2130 2166 /* 2131 - * When charge migration first begins, we can have locked and 2132 - * unlocked page stat updates happening concurrently. Track 2133 - * the task who has the lock for unlock_page_memcg(). 2167 + * When charge migration first begins, we can have multiple 2168 + * critical sections holding the fast-path RCU lock and one 2169 + * holding the slowpath move_lock. Track the task who has the 2170 + * move_lock for unlock_page_memcg(). 2134 2171 */ 2135 2172 memcg->move_lock_task = current; 2136 2173 memcg->move_lock_flags = flags; 2137 - 2138 - return memcg; 2139 2174 } 2140 2175 EXPORT_SYMBOL(lock_page_memcg); 2141 2176 2142 - /** 2143 - * __unlock_page_memcg - unlock and unpin a memcg 2144 - * @memcg: the memcg 2145 - * 2146 - * Unlock and unpin a memcg returned by lock_page_memcg(). 2147 - */ 2148 - void __unlock_page_memcg(struct mem_cgroup *memcg) 2177 + static void __unlock_page_memcg(struct mem_cgroup *memcg) 2149 2178 { 2150 2179 if (memcg && memcg->move_lock_task == current) { 2151 2180 unsigned long flags = memcg->move_lock_flags; ··· 2338 2381 mutex_unlock(&percpu_charge_mutex); 2339 2382 } 2340 2383 2384 + static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg, int cpu) 2385 + { 2386 + int nid; 2387 + 2388 + for_each_node(nid) { 2389 + struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid]; 2390 + unsigned long stat[NR_VM_NODE_STAT_ITEMS]; 2391 + struct batched_lruvec_stat *lstatc; 2392 + int i; 2393 + 2394 + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); 2395 + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) { 2396 + stat[i] = lstatc->count[i]; 2397 + lstatc->count[i] = 0; 2398 + } 2399 + 2400 + do { 2401 + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) 2402 + atomic_long_add(stat[i], &pn->lruvec_stat[i]); 2403 + } while ((pn = parent_nodeinfo(pn, nid))); 2404 + } 2405 + } 2406 + 2341 2407 static int memcg_hotplug_cpu_dead(unsigned int cpu) 2342 2408 { 2343 2409 struct memcg_stock_pcp *stock; 2344 - struct mem_cgroup *memcg, *mi; 2410 + struct mem_cgroup *memcg; 2345 2411 2346 2412 stock = &per_cpu(memcg_stock, cpu); 2347 2413 drain_stock(stock); 2348 2414 2349 - for_each_mem_cgroup(memcg) { 2350 - int i; 2351 - 2352 - for (i = 0; i < MEMCG_NR_STAT; i++) { 2353 - int nid; 2354 - long x; 2355 - 2356 - x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0); 2357 - if (x) 2358 - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) 2359 - atomic_long_add(x, &memcg->vmstats[i]); 2360 - 2361 - if (i >= NR_VM_NODE_STAT_ITEMS) 2362 - continue; 2363 - 2364 - for_each_node(nid) { 2365 - struct mem_cgroup_per_node *pn; 2366 - 2367 - pn = mem_cgroup_nodeinfo(memcg, nid); 2368 - x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0); 2369 - if (x) 2370 - do { 2371 - atomic_long_add(x, &pn->lruvec_stat[i]); 2372 - } while ((pn = parent_nodeinfo(pn, nid))); 2373 - } 2374 - } 2375 - 2376 - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { 2377 - long x; 2378 - 2379 - x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0); 2380 - if (x) 2381 - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) 2382 - atomic_long_add(x, &memcg->vmevents[i]); 2383 - } 2384 - } 2415 + for_each_mem_cgroup(memcg) 2416 + memcg_flush_lruvec_page_state(memcg, cpu); 2385 2417 2386 2418 return 0; 2387 2419 } ··· 2739 2793 if (gfp_mask & __GFP_RETRY_MAYFAIL) 2740 2794 goto nomem; 2741 2795 2742 - if (gfp_mask & __GFP_NOFAIL) 2743 - goto force; 2744 - 2745 2796 if (fatal_signal_pending(current)) 2746 2797 goto force; 2747 2798 ··· 2846 2903 * - exclusive reference 2847 2904 */ 2848 2905 page->memcg_data = (unsigned long)memcg; 2906 + } 2907 + 2908 + static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg) 2909 + { 2910 + struct mem_cgroup *memcg; 2911 + 2912 + rcu_read_lock(); 2913 + retry: 2914 + memcg = obj_cgroup_memcg(objcg); 2915 + if (unlikely(!css_tryget(&memcg->css))) 2916 + goto retry; 2917 + rcu_read_unlock(); 2918 + 2919 + return memcg; 2849 2920 } 2850 2921 2851 2922 #ifdef CONFIG_MEMCG_KMEM ··· 3013 3056 ida_simple_remove(&memcg_cache_ida, id); 3014 3057 } 3015 3058 3016 - /** 3017 - * __memcg_kmem_charge: charge a number of kernel pages to a memcg 3018 - * @memcg: memory cgroup to charge 3059 + /* 3060 + * obj_cgroup_uncharge_pages: uncharge a number of kernel pages from a objcg 3061 + * @objcg: object cgroup to uncharge 3062 + * @nr_pages: number of pages to uncharge 3063 + */ 3064 + static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg, 3065 + unsigned int nr_pages) 3066 + { 3067 + struct mem_cgroup *memcg; 3068 + 3069 + memcg = get_mem_cgroup_from_objcg(objcg); 3070 + 3071 + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) 3072 + page_counter_uncharge(&memcg->kmem, nr_pages); 3073 + refill_stock(memcg, nr_pages); 3074 + 3075 + css_put(&memcg->css); 3076 + } 3077 + 3078 + /* 3079 + * obj_cgroup_charge_pages: charge a number of kernel pages to a objcg 3080 + * @objcg: object cgroup to charge 3019 3081 * @gfp: reclaim mode 3020 3082 * @nr_pages: number of pages to charge 3021 3083 * 3022 3084 * Returns 0 on success, an error code on failure. 3023 3085 */ 3024 - static int __memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp, 3025 - unsigned int nr_pages) 3086 + static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp, 3087 + unsigned int nr_pages) 3026 3088 { 3027 3089 struct page_counter *counter; 3090 + struct mem_cgroup *memcg; 3028 3091 int ret; 3092 + 3093 + memcg = get_mem_cgroup_from_objcg(objcg); 3029 3094 3030 3095 ret = try_charge(memcg, gfp, nr_pages); 3031 3096 if (ret) 3032 - return ret; 3097 + goto out; 3033 3098 3034 3099 if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && 3035 3100 !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) { ··· 3063 3084 */ 3064 3085 if (gfp & __GFP_NOFAIL) { 3065 3086 page_counter_charge(&memcg->kmem, nr_pages); 3066 - return 0; 3087 + goto out; 3067 3088 } 3068 3089 cancel_charge(memcg, nr_pages); 3069 - return -ENOMEM; 3090 + ret = -ENOMEM; 3070 3091 } 3071 - return 0; 3072 - } 3092 + out: 3093 + css_put(&memcg->css); 3073 3094 3074 - /** 3075 - * __memcg_kmem_uncharge: uncharge a number of kernel pages from a memcg 3076 - * @memcg: memcg to uncharge 3077 - * @nr_pages: number of pages to uncharge 3078 - */ 3079 - static void __memcg_kmem_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages) 3080 - { 3081 - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) 3082 - page_counter_uncharge(&memcg->kmem, nr_pages); 3083 - 3084 - refill_stock(memcg, nr_pages); 3095 + return ret; 3085 3096 } 3086 3097 3087 3098 /** ··· 3084 3115 */ 3085 3116 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order) 3086 3117 { 3087 - struct mem_cgroup *memcg; 3118 + struct obj_cgroup *objcg; 3088 3119 int ret = 0; 3089 3120 3090 - memcg = get_mem_cgroup_from_current(); 3091 - if (memcg && !mem_cgroup_is_root(memcg)) { 3092 - ret = __memcg_kmem_charge(memcg, gfp, 1 << order); 3121 + objcg = get_obj_cgroup_from_current(); 3122 + if (objcg) { 3123 + ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order); 3093 3124 if (!ret) { 3094 - page->memcg_data = (unsigned long)memcg | 3125 + page->memcg_data = (unsigned long)objcg | 3095 3126 MEMCG_DATA_KMEM; 3096 3127 return 0; 3097 3128 } 3098 - css_put(&memcg->css); 3129 + obj_cgroup_put(objcg); 3099 3130 } 3100 3131 return ret; 3101 3132 } ··· 3107 3138 */ 3108 3139 void __memcg_kmem_uncharge_page(struct page *page, int order) 3109 3140 { 3110 - struct mem_cgroup *memcg = page_memcg(page); 3141 + struct obj_cgroup *objcg; 3111 3142 unsigned int nr_pages = 1 << order; 3112 3143 3113 - if (!memcg) 3144 + if (!PageMemcgKmem(page)) 3114 3145 return; 3115 3146 3116 - VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page); 3117 - __memcg_kmem_uncharge(memcg, nr_pages); 3147 + objcg = __page_objcg(page); 3148 + obj_cgroup_uncharge_pages(objcg, nr_pages); 3118 3149 page->memcg_data = 0; 3119 - css_put(&memcg->css); 3150 + obj_cgroup_put(objcg); 3120 3151 } 3121 3152 3122 3153 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) ··· 3149 3180 unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT; 3150 3181 unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1); 3151 3182 3152 - if (nr_pages) { 3153 - rcu_read_lock(); 3154 - __memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages); 3155 - rcu_read_unlock(); 3156 - } 3183 + if (nr_pages) 3184 + obj_cgroup_uncharge_pages(old, nr_pages); 3157 3185 3158 3186 /* 3159 3187 * The leftover is flushed to the centralized per-memcg value. ··· 3208 3242 3209 3243 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) 3210 3244 { 3211 - struct mem_cgroup *memcg; 3212 3245 unsigned int nr_pages, nr_bytes; 3213 3246 int ret; 3214 3247 ··· 3224 3259 * refill_obj_stock(), called from this function or 3225 3260 * independently later. 3226 3261 */ 3227 - rcu_read_lock(); 3228 - retry: 3229 - memcg = obj_cgroup_memcg(objcg); 3230 - if (unlikely(!css_tryget(&memcg->css))) 3231 - goto retry; 3232 - rcu_read_unlock(); 3233 - 3234 3262 nr_pages = size >> PAGE_SHIFT; 3235 3263 nr_bytes = size & (PAGE_SIZE - 1); 3236 3264 3237 3265 if (nr_bytes) 3238 3266 nr_pages += 1; 3239 3267 3240 - ret = __memcg_kmem_charge(memcg, gfp, nr_pages); 3268 + ret = obj_cgroup_charge_pages(objcg, gfp, nr_pages); 3241 3269 if (!ret && nr_bytes) 3242 3270 refill_obj_stock(objcg, PAGE_SIZE - nr_bytes); 3243 3271 3244 - css_put(&memcg->css); 3245 3272 return ret; 3246 3273 } 3247 3274 ··· 3257 3300 3258 3301 for (i = 1; i < nr; i++) 3259 3302 head[i].memcg_data = head->memcg_data; 3260 - css_get_many(&memcg->css, nr - 1); 3303 + 3304 + if (PageMemcgKmem(head)) 3305 + obj_cgroup_get_many(__page_objcg(head), nr - 1); 3306 + else 3307 + css_get_many(&memcg->css, nr - 1); 3261 3308 } 3262 3309 3263 3310 #ifdef CONFIG_MEMCG_SWAP ··· 3510 3549 unsigned long val; 3511 3550 3512 3551 if (mem_cgroup_is_root(memcg)) { 3552 + cgroup_rstat_flush(memcg->css.cgroup); 3513 3553 val = memcg_page_state(memcg, NR_FILE_PAGES) + 3514 3554 memcg_page_state(memcg, NR_ANON_MAPPED); 3515 3555 if (swap) ··· 3573 3611 default: 3574 3612 BUG(); 3575 3613 } 3576 - } 3577 - 3578 - static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) 3579 - { 3580 - unsigned long stat[MEMCG_NR_STAT] = {0}; 3581 - struct mem_cgroup *mi; 3582 - int node, cpu, i; 3583 - 3584 - for_each_online_cpu(cpu) 3585 - for (i = 0; i < MEMCG_NR_STAT; i++) 3586 - stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu); 3587 - 3588 - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) 3589 - for (i = 0; i < MEMCG_NR_STAT; i++) 3590 - atomic_long_add(stat[i], &mi->vmstats[i]); 3591 - 3592 - for_each_node(node) { 3593 - struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; 3594 - struct mem_cgroup_per_node *pi; 3595 - 3596 - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) 3597 - stat[i] = 0; 3598 - 3599 - for_each_online_cpu(cpu) 3600 - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) 3601 - stat[i] += per_cpu( 3602 - pn->lruvec_stat_cpu->count[i], cpu); 3603 - 3604 - for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) 3605 - for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) 3606 - atomic_long_add(stat[i], &pi->lruvec_stat[i]); 3607 - } 3608 - } 3609 - 3610 - static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg) 3611 - { 3612 - unsigned long events[NR_VM_EVENT_ITEMS]; 3613 - struct mem_cgroup *mi; 3614 - int cpu, i; 3615 - 3616 - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) 3617 - events[i] = 0; 3618 - 3619 - for_each_online_cpu(cpu) 3620 - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) 3621 - events[i] += per_cpu(memcg->vmstats_percpu->events[i], 3622 - cpu); 3623 - 3624 - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) 3625 - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) 3626 - atomic_long_add(events[i], &mi->vmevents[i]); 3627 3614 } 3628 3615 3629 3616 #ifdef CONFIG_MEMCG_KMEM ··· 3891 3980 int nid; 3892 3981 struct mem_cgroup *memcg = mem_cgroup_from_seq(m); 3893 3982 3983 + cgroup_rstat_flush(memcg->css.cgroup); 3984 + 3894 3985 for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { 3895 3986 seq_printf(m, "%s=%lu", stat->name, 3896 3987 mem_cgroup_nr_lru_pages(memcg, stat->lru_mask, ··· 3963 4050 3964 4051 BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); 3965 4052 4053 + cgroup_rstat_flush(memcg->css.cgroup); 4054 + 3966 4055 for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { 3967 4056 unsigned long nr; 3968 4057 ··· 4023 4108 unsigned long file_cost = 0; 4024 4109 4025 4110 for_each_online_pgdat(pgdat) { 4026 - mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); 4111 + mz = memcg->nodeinfo[pgdat->node_id]; 4027 4112 4028 4113 anon_cost += mz->lruvec.anon_cost; 4029 4114 file_cost += mz->lruvec.file_cost; ··· 4052 4137 if (val > 100) 4053 4138 return -EINVAL; 4054 4139 4055 - if (css->parent) 4140 + if (!mem_cgroup_is_root(memcg)) 4056 4141 memcg->swappiness = val; 4057 4142 else 4058 4143 vm_swappiness = val; ··· 4402 4487 struct mem_cgroup *memcg = mem_cgroup_from_css(css); 4403 4488 4404 4489 /* cannot set to root cgroup and only 0 and 1 are allowed */ 4405 - if (!css->parent || !((val == 0) || (val == 1))) 4490 + if (mem_cgroup_is_root(memcg) || !((val == 0) || (val == 1))) 4406 4491 return -EINVAL; 4407 4492 4408 4493 memcg->oom_kill_disable = val; ··· 4441 4526 return &memcg->cgwb_domain; 4442 4527 } 4443 4528 4444 - /* 4445 - * idx can be of type enum memcg_stat_item or node_stat_item. 4446 - * Keep in sync with memcg_exact_page(). 4447 - */ 4448 - static unsigned long memcg_exact_page_state(struct mem_cgroup *memcg, int idx) 4449 - { 4450 - long x = atomic_long_read(&memcg->vmstats[idx]); 4451 - int cpu; 4452 - 4453 - for_each_online_cpu(cpu) 4454 - x += per_cpu_ptr(memcg->vmstats_percpu, cpu)->stat[idx]; 4455 - if (x < 0) 4456 - x = 0; 4457 - return x; 4458 - } 4459 - 4460 4529 /** 4461 4530 * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg 4462 4531 * @wb: bdi_writeback in question ··· 4466 4567 struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); 4467 4568 struct mem_cgroup *parent; 4468 4569 4469 - *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); 4570 + cgroup_rstat_flush_irqsafe(memcg->css.cgroup); 4470 4571 4471 - *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); 4472 - *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + 4473 - memcg_exact_page_state(memcg, NR_ACTIVE_FILE); 4572 + *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); 4573 + *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); 4574 + *pfilepages = memcg_page_state(memcg, NR_INACTIVE_FILE) + 4575 + memcg_page_state(memcg, NR_ACTIVE_FILE); 4576 + 4474 4577 *pheadroom = PAGE_COUNTER_MAX; 4475 - 4476 4578 while ((parent = parent_mem_cgroup(memcg))) { 4477 4579 unsigned long ceiling = min(READ_ONCE(memcg->memory.max), 4478 4580 READ_ONCE(memcg->memory.high)); ··· 5105 5205 for_each_node(node) 5106 5206 free_mem_cgroup_per_node_info(memcg, node); 5107 5207 free_percpu(memcg->vmstats_percpu); 5108 - free_percpu(memcg->vmstats_local); 5109 5208 kfree(memcg); 5110 5209 } 5111 5210 5112 5211 static void mem_cgroup_free(struct mem_cgroup *memcg) 5113 5212 { 5213 + int cpu; 5214 + 5114 5215 memcg_wb_domain_exit(memcg); 5115 5216 /* 5116 - * Flush percpu vmstats and vmevents to guarantee the value correctness 5117 - * on parent's and all ancestor levels. 5217 + * Flush percpu lruvec stats to guarantee the value 5218 + * correctness on parent's and all ancestor levels. 5118 5219 */ 5119 - memcg_flush_percpu_vmstats(memcg); 5120 - memcg_flush_percpu_vmevents(memcg); 5220 + for_each_online_cpu(cpu) 5221 + memcg_flush_lruvec_page_state(memcg, cpu); 5121 5222 __mem_cgroup_free(memcg); 5122 5223 } 5123 5224 ··· 5144 5243 error = memcg->id.id; 5145 5244 goto fail; 5146 5245 } 5147 - 5148 - memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu, 5149 - GFP_KERNEL_ACCOUNT); 5150 - if (!memcg->vmstats_local) 5151 - goto fail; 5152 5246 5153 5247 memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu, 5154 5248 GFP_KERNEL_ACCOUNT); ··· 5342 5446 memcg->soft_limit = PAGE_COUNTER_MAX; 5343 5447 page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX); 5344 5448 memcg_wb_domain_size_changed(memcg); 5449 + } 5450 + 5451 + static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) 5452 + { 5453 + struct mem_cgroup *memcg = mem_cgroup_from_css(css); 5454 + struct mem_cgroup *parent = parent_mem_cgroup(memcg); 5455 + struct memcg_vmstats_percpu *statc; 5456 + long delta, v; 5457 + int i; 5458 + 5459 + statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); 5460 + 5461 + for (i = 0; i < MEMCG_NR_STAT; i++) { 5462 + /* 5463 + * Collect the aggregated propagation counts of groups 5464 + * below us. We're in a per-cpu loop here and this is 5465 + * a global counter, so the first cycle will get them. 5466 + */ 5467 + delta = memcg->vmstats.state_pending[i]; 5468 + if (delta) 5469 + memcg->vmstats.state_pending[i] = 0; 5470 + 5471 + /* Add CPU changes on this level since the last flush */ 5472 + v = READ_ONCE(statc->state[i]); 5473 + if (v != statc->state_prev[i]) { 5474 + delta += v - statc->state_prev[i]; 5475 + statc->state_prev[i] = v; 5476 + } 5477 + 5478 + if (!delta) 5479 + continue; 5480 + 5481 + /* Aggregate counts on this level and propagate upwards */ 5482 + memcg->vmstats.state[i] += delta; 5483 + if (parent) 5484 + parent->vmstats.state_pending[i] += delta; 5485 + } 5486 + 5487 + for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { 5488 + delta = memcg->vmstats.events_pending[i]; 5489 + if (delta) 5490 + memcg->vmstats.events_pending[i] = 0; 5491 + 5492 + v = READ_ONCE(statc->events[i]); 5493 + if (v != statc->events_prev[i]) { 5494 + delta += v - statc->events_prev[i]; 5495 + statc->events_prev[i] = v; 5496 + } 5497 + 5498 + if (!delta) 5499 + continue; 5500 + 5501 + memcg->vmstats.events[i] += delta; 5502 + if (parent) 5503 + parent->vmstats.events_pending[i] += delta; 5504 + } 5345 5505 } 5346 5506 5347 5507 #ifdef CONFIG_MMU ··· 6453 6501 .css_released = mem_cgroup_css_released, 6454 6502 .css_free = mem_cgroup_css_free, 6455 6503 .css_reset = mem_cgroup_css_reset, 6504 + .css_rstat_flush = mem_cgroup_css_rstat_flush, 6456 6505 .can_attach = mem_cgroup_can_attach, 6457 6506 .cancel_attach = mem_cgroup_cancel_attach, 6458 6507 .post_attach = mem_cgroup_move_task, ··· 6636 6683 atomic_long_read(&parent->memory.children_low_usage))); 6637 6684 } 6638 6685 6686 + static int __mem_cgroup_charge(struct page *page, struct mem_cgroup *memcg, 6687 + gfp_t gfp) 6688 + { 6689 + unsigned int nr_pages = thp_nr_pages(page); 6690 + int ret; 6691 + 6692 + ret = try_charge(memcg, gfp, nr_pages); 6693 + if (ret) 6694 + goto out; 6695 + 6696 + css_get(&memcg->css); 6697 + commit_charge(page, memcg); 6698 + 6699 + local_irq_disable(); 6700 + mem_cgroup_charge_statistics(memcg, page, nr_pages); 6701 + memcg_check_events(memcg, page); 6702 + local_irq_enable(); 6703 + out: 6704 + return ret; 6705 + } 6706 + 6639 6707 /** 6640 6708 * mem_cgroup_charge - charge a newly allocated page to a cgroup 6641 6709 * @page: page to charge ··· 6666 6692 * Try to charge @page to the memcg that @mm belongs to, reclaiming 6667 6693 * pages according to @gfp_mask if necessary. 6668 6694 * 6695 + * Do not use this for pages allocated for swapin. 6696 + * 6669 6697 * Returns 0 on success. Otherwise, an error code is returned. 6670 6698 */ 6671 6699 int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask) 6672 6700 { 6673 - unsigned int nr_pages = thp_nr_pages(page); 6674 - struct mem_cgroup *memcg = NULL; 6675 - int ret = 0; 6701 + struct mem_cgroup *memcg; 6702 + int ret; 6676 6703 6677 6704 if (mem_cgroup_disabled()) 6678 - goto out; 6705 + return 0; 6679 6706 6680 - if (PageSwapCache(page)) { 6681 - swp_entry_t ent = { .val = page_private(page), }; 6682 - unsigned short id; 6707 + memcg = get_mem_cgroup_from_mm(mm); 6708 + ret = __mem_cgroup_charge(page, memcg, gfp_mask); 6709 + css_put(&memcg->css); 6683 6710 6684 - /* 6685 - * Every swap fault against a single page tries to charge the 6686 - * page, bail as early as possible. shmem_unuse() encounters 6687 - * already charged pages, too. page and memcg binding is 6688 - * protected by the page lock, which serializes swap cache 6689 - * removal, which in turn serializes uncharging. 6690 - */ 6691 - VM_BUG_ON_PAGE(!PageLocked(page), page); 6692 - if (page_memcg(compound_head(page))) 6693 - goto out; 6711 + return ret; 6712 + } 6694 6713 6695 - id = lookup_swap_cgroup_id(ent); 6696 - rcu_read_lock(); 6697 - memcg = mem_cgroup_from_id(id); 6698 - if (memcg && !css_tryget_online(&memcg->css)) 6699 - memcg = NULL; 6700 - rcu_read_unlock(); 6701 - } 6714 + /** 6715 + * mem_cgroup_swapin_charge_page - charge a newly allocated page for swapin 6716 + * @page: page to charge 6717 + * @mm: mm context of the victim 6718 + * @gfp: reclaim mode 6719 + * @entry: swap entry for which the page is allocated 6720 + * 6721 + * This function charges a page allocated for swapin. Please call this before 6722 + * adding the page to the swapcache. 6723 + * 6724 + * Returns 0 on success. Otherwise, an error code is returned. 6725 + */ 6726 + int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm, 6727 + gfp_t gfp, swp_entry_t entry) 6728 + { 6729 + struct mem_cgroup *memcg; 6730 + unsigned short id; 6731 + int ret; 6702 6732 6703 - if (!memcg) 6733 + if (mem_cgroup_disabled()) 6734 + return 0; 6735 + 6736 + id = lookup_swap_cgroup_id(entry); 6737 + rcu_read_lock(); 6738 + memcg = mem_cgroup_from_id(id); 6739 + if (!memcg || !css_tryget_online(&memcg->css)) 6704 6740 memcg = get_mem_cgroup_from_mm(mm); 6741 + rcu_read_unlock(); 6705 6742 6706 - ret = try_charge(memcg, gfp_mask, nr_pages); 6707 - if (ret) 6708 - goto out_put; 6743 + ret = __mem_cgroup_charge(page, memcg, gfp); 6709 6744 6710 - css_get(&memcg->css); 6711 - commit_charge(page, memcg); 6745 + css_put(&memcg->css); 6746 + return ret; 6747 + } 6712 6748 6713 - local_irq_disable(); 6714 - mem_cgroup_charge_statistics(memcg, page, nr_pages); 6715 - memcg_check_events(memcg, page); 6716 - local_irq_enable(); 6717 - 6749 + /* 6750 + * mem_cgroup_swapin_uncharge_swap - uncharge swap slot 6751 + * @entry: swap entry for which the page is charged 6752 + * 6753 + * Call this function after successfully adding the charged page to swapcache. 6754 + * 6755 + * Note: This function assumes the page for which swap slot is being uncharged 6756 + * is order 0 page. 6757 + */ 6758 + void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) 6759 + { 6718 6760 /* 6719 6761 * Cgroup1's unified memory+swap counter has been charged with the 6720 6762 * new swapcache page, finish the transfer by uncharging the swap ··· 6743 6753 * correspond 1:1 to page and swap slot lifetimes: we charge the 6744 6754 * page to memory here, and uncharge swap when the slot is freed. 6745 6755 */ 6746 - if (do_memsw_account() && PageSwapCache(page)) { 6747 - swp_entry_t entry = { .val = page_private(page) }; 6756 + if (!mem_cgroup_disabled() && do_memsw_account()) { 6748 6757 /* 6749 6758 * The swap entry might not get freed for a long time, 6750 6759 * let's not wait for it. The page already received a 6751 6760 * memory+swap charge, drop the swap entry duplicate. 6752 6761 */ 6753 - mem_cgroup_uncharge_swap(entry, nr_pages); 6762 + mem_cgroup_uncharge_swap(entry, 1); 6754 6763 } 6755 - 6756 - out_put: 6757 - css_put(&memcg->css); 6758 - out: 6759 - return ret; 6760 6764 } 6761 6765 6762 6766 struct uncharge_gather { 6763 6767 struct mem_cgroup *memcg; 6764 - unsigned long nr_pages; 6768 + unsigned long nr_memory; 6765 6769 unsigned long pgpgout; 6766 6770 unsigned long nr_kmem; 6767 6771 struct page *dummy_page; ··· 6770 6786 { 6771 6787 unsigned long flags; 6772 6788 6773 - if (!mem_cgroup_is_root(ug->memcg)) { 6774 - page_counter_uncharge(&ug->memcg->memory, ug->nr_pages); 6789 + if (ug->nr_memory) { 6790 + page_counter_uncharge(&ug->memcg->memory, ug->nr_memory); 6775 6791 if (do_memsw_account()) 6776 - page_counter_uncharge(&ug->memcg->memsw, ug->nr_pages); 6792 + page_counter_uncharge(&ug->memcg->memsw, ug->nr_memory); 6777 6793 if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem) 6778 6794 page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem); 6779 6795 memcg_oom_recover(ug->memcg); ··· 6781 6797 6782 6798 local_irq_save(flags); 6783 6799 __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); 6784 - __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_pages); 6800 + __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory); 6785 6801 memcg_check_events(ug->memcg, ug->dummy_page); 6786 6802 local_irq_restore(flags); 6787 6803 ··· 6792 6808 static void uncharge_page(struct page *page, struct uncharge_gather *ug) 6793 6809 { 6794 6810 unsigned long nr_pages; 6811 + struct mem_cgroup *memcg; 6812 + struct obj_cgroup *objcg; 6795 6813 6796 6814 VM_BUG_ON_PAGE(PageLRU(page), page); 6797 6815 6798 - if (!page_memcg(page)) 6799 - return; 6800 - 6801 6816 /* 6802 6817 * Nobody should be changing or seriously looking at 6803 - * page_memcg(page) at this point, we have fully 6818 + * page memcg or objcg at this point, we have fully 6804 6819 * exclusive access to the page. 6805 6820 */ 6821 + if (PageMemcgKmem(page)) { 6822 + objcg = __page_objcg(page); 6823 + /* 6824 + * This get matches the put at the end of the function and 6825 + * kmem pages do not hold memcg references anymore. 6826 + */ 6827 + memcg = get_mem_cgroup_from_objcg(objcg); 6828 + } else { 6829 + memcg = __page_memcg(page); 6830 + } 6806 6831 6807 - if (ug->memcg != page_memcg(page)) { 6832 + if (!memcg) 6833 + return; 6834 + 6835 + if (ug->memcg != memcg) { 6808 6836 if (ug->memcg) { 6809 6837 uncharge_batch(ug); 6810 6838 uncharge_gather_clear(ug); 6811 6839 } 6812 - ug->memcg = page_memcg(page); 6840 + ug->memcg = memcg; 6841 + ug->dummy_page = page; 6813 6842 6814 6843 /* pairs with css_put in uncharge_batch */ 6815 - css_get(&ug->memcg->css); 6844 + css_get(&memcg->css); 6816 6845 } 6817 6846 6818 6847 nr_pages = compound_nr(page); 6819 - ug->nr_pages += nr_pages; 6820 6848 6821 - if (PageMemcgKmem(page)) 6849 + if (PageMemcgKmem(page)) { 6850 + ug->nr_memory += nr_pages; 6822 6851 ug->nr_kmem += nr_pages; 6823 - else 6852 + 6853 + page->memcg_data = 0; 6854 + obj_cgroup_put(objcg); 6855 + } else { 6856 + /* LRU pages aren't accounted at the root level */ 6857 + if (!mem_cgroup_is_root(memcg)) 6858 + ug->nr_memory += nr_pages; 6824 6859 ug->pgpgout++; 6825 6860 6826 - ug->dummy_page = page; 6827 - page->memcg_data = 0; 6828 - css_put(&ug->memcg->css); 6861 + page->memcg_data = 0; 6862 + } 6863 + 6864 + css_put(&memcg->css); 6829 6865 } 6830 6866 6831 6867 /**

+1 -1

mm/memory-failure.c

··· 1368 1368 * communicated in siginfo, see kill_proc() 1369 1369 */ 1370 1370 start = (page->index << PAGE_SHIFT) & ~(size - 1); 1371 - unmap_mapping_range(page->mapping, start, start + size, 0); 1371 + unmap_mapping_range(page->mapping, start, size, 0); 1372 1372 } 1373 1373 kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, pfn, flags); 1374 1374 rc = 0;

+121 -72

mm/memory.c

··· 2260 2260 return 0; 2261 2261 } 2262 2262 2263 - /** 2264 - * remap_pfn_range - remap kernel memory to userspace 2265 - * @vma: user vma to map to 2266 - * @addr: target page aligned user address to start at 2267 - * @pfn: page frame number of kernel physical memory address 2268 - * @size: size of mapping area 2269 - * @prot: page protection flags for this mapping 2270 - * 2271 - * Note: this is only safe if the mm semaphore is held when called. 2272 - * 2273 - * Return: %0 on success, negative error code otherwise. 2263 + /* 2264 + * Variant of remap_pfn_range that does not call track_pfn_remap. The caller 2265 + * must have pre-validated the caching bits of the pgprot_t. 2274 2266 */ 2275 - int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, 2276 - unsigned long pfn, unsigned long size, pgprot_t prot) 2267 + int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, 2268 + unsigned long pfn, unsigned long size, pgprot_t prot) 2277 2269 { 2278 2270 pgd_t *pgd; 2279 2271 unsigned long next; 2280 2272 unsigned long end = addr + PAGE_ALIGN(size); 2281 2273 struct mm_struct *mm = vma->vm_mm; 2282 - unsigned long remap_pfn = pfn; 2283 2274 int err; 2284 2275 2285 2276 if (WARN_ON_ONCE(!PAGE_ALIGNED(addr))) ··· 2300 2309 vma->vm_pgoff = pfn; 2301 2310 } 2302 2311 2303 - err = track_pfn_remap(vma, &prot, remap_pfn, addr, PAGE_ALIGN(size)); 2304 - if (err) 2305 - return -EINVAL; 2306 - 2307 2312 vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP; 2308 2313 2309 2314 BUG_ON(addr >= end); ··· 2311 2324 err = remap_p4d_range(mm, pgd, addr, next, 2312 2325 pfn + (addr >> PAGE_SHIFT), prot); 2313 2326 if (err) 2314 - break; 2327 + return err; 2315 2328 } while (pgd++, addr = next, addr != end); 2316 2329 2317 - if (err) 2318 - untrack_pfn(vma, remap_pfn, PAGE_ALIGN(size)); 2330 + return 0; 2331 + } 2319 2332 2333 + /** 2334 + * remap_pfn_range - remap kernel memory to userspace 2335 + * @vma: user vma to map to 2336 + * @addr: target page aligned user address to start at 2337 + * @pfn: page frame number of kernel physical memory address 2338 + * @size: size of mapping area 2339 + * @prot: page protection flags for this mapping 2340 + * 2341 + * Note: this is only safe if the mm semaphore is held when called. 2342 + * 2343 + * Return: %0 on success, negative error code otherwise. 2344 + */ 2345 + int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr, 2346 + unsigned long pfn, unsigned long size, pgprot_t prot) 2347 + { 2348 + int err; 2349 + 2350 + err = track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size)); 2351 + if (err) 2352 + return -EINVAL; 2353 + 2354 + err = remap_pfn_range_notrack(vma, addr, pfn, size, prot); 2355 + if (err) 2356 + untrack_pfn(vma, pfn, PAGE_ALIGN(size)); 2320 2357 return err; 2321 2358 } 2322 2359 EXPORT_SYMBOL(remap_pfn_range); ··· 2457 2446 } 2458 2447 do { 2459 2448 next = pmd_addr_end(addr, end); 2460 - if (create || !pmd_none_or_clear_bad(pmd)) { 2461 - err = apply_to_pte_range(mm, pmd, addr, next, fn, data, 2462 - create, mask); 2463 - if (err) 2464 - break; 2449 + if (pmd_none(*pmd) && !create) 2450 + continue; 2451 + if (WARN_ON_ONCE(pmd_leaf(*pmd))) 2452 + return -EINVAL; 2453 + if (!pmd_none(*pmd) && WARN_ON_ONCE(pmd_bad(*pmd))) { 2454 + if (!create) 2455 + continue; 2456 + pmd_clear_bad(pmd); 2465 2457 } 2458 + err = apply_to_pte_range(mm, pmd, addr, next, 2459 + fn, data, create, mask); 2460 + if (err) 2461 + break; 2466 2462 } while (pmd++, addr = next, addr != end); 2463 + 2467 2464 return err; 2468 2465 } 2469 2466 ··· 2493 2474 } 2494 2475 do { 2495 2476 next = pud_addr_end(addr, end); 2496 - if (create || !pud_none_or_clear_bad(pud)) { 2497 - err = apply_to_pmd_range(mm, pud, addr, next, fn, data, 2498 - create, mask); 2499 - if (err) 2500 - break; 2477 + if (pud_none(*pud) && !create) 2478 + continue; 2479 + if (WARN_ON_ONCE(pud_leaf(*pud))) 2480 + return -EINVAL; 2481 + if (!pud_none(*pud) && WARN_ON_ONCE(pud_bad(*pud))) { 2482 + if (!create) 2483 + continue; 2484 + pud_clear_bad(pud); 2501 2485 } 2486 + err = apply_to_pmd_range(mm, pud, addr, next, 2487 + fn, data, create, mask); 2488 + if (err) 2489 + break; 2502 2490 } while (pud++, addr = next, addr != end); 2491 + 2503 2492 return err; 2504 2493 } 2505 2494 ··· 2529 2502 } 2530 2503 do { 2531 2504 next = p4d_addr_end(addr, end); 2532 - if (create || !p4d_none_or_clear_bad(p4d)) { 2533 - err = apply_to_pud_range(mm, p4d, addr, next, fn, data, 2534 - create, mask); 2535 - if (err) 2536 - break; 2505 + if (p4d_none(*p4d) && !create) 2506 + continue; 2507 + if (WARN_ON_ONCE(p4d_leaf(*p4d))) 2508 + return -EINVAL; 2509 + if (!p4d_none(*p4d) && WARN_ON_ONCE(p4d_bad(*p4d))) { 2510 + if (!create) 2511 + continue; 2512 + p4d_clear_bad(p4d); 2537 2513 } 2514 + err = apply_to_pud_range(mm, p4d, addr, next, 2515 + fn, data, create, mask); 2516 + if (err) 2517 + break; 2538 2518 } while (p4d++, addr = next, addr != end); 2519 + 2539 2520 return err; 2540 2521 } 2541 2522 ··· 2563 2528 pgd = pgd_offset(mm, addr); 2564 2529 do { 2565 2530 next = pgd_addr_end(addr, end); 2566 - if (!create && pgd_none_or_clear_bad(pgd)) 2531 + if (pgd_none(*pgd) && !create) 2567 2532 continue; 2568 - err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create, &mask); 2533 + if (WARN_ON_ONCE(pgd_leaf(*pgd))) 2534 + return -EINVAL; 2535 + if (!pgd_none(*pgd) && WARN_ON_ONCE(pgd_bad(*pgd))) { 2536 + if (!create) 2537 + continue; 2538 + pgd_clear_bad(pgd); 2539 + } 2540 + err = apply_to_p4d_range(mm, pgd, addr, next, 2541 + fn, data, create, &mask); 2569 2542 if (err) 2570 2543 break; 2571 2544 } while (pgd++, addr = next, addr != end); ··· 3352 3309 page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, 3353 3310 vmf->address); 3354 3311 if (page) { 3355 - int err; 3356 - 3357 3312 __SetPageLocked(page); 3358 3313 __SetPageSwapBacked(page); 3359 - set_page_private(page, entry.val); 3360 3314 3361 - /* Tell memcg to use swap ownership records */ 3362 - SetPageSwapCache(page); 3363 - err = mem_cgroup_charge(page, vma->vm_mm, 3364 - GFP_KERNEL); 3365 - ClearPageSwapCache(page); 3366 - if (err) { 3315 + if (mem_cgroup_swapin_charge_page(page, 3316 + vma->vm_mm, GFP_KERNEL, entry)) { 3367 3317 ret = VM_FAULT_OOM; 3368 3318 goto out_page; 3369 3319 } 3320 + mem_cgroup_swapin_uncharge_swap(entry); 3370 3321 3371 3322 shadow = get_shadow_from_swap_cache(entry); 3372 3323 if (shadow) 3373 3324 workingset_refault(page, shadow); 3374 3325 3375 3326 lru_cache_add(page); 3327 + 3328 + /* To provide entry to swap_readpage() */ 3329 + set_page_private(page, entry.val); 3376 3330 swap_readpage(page, true); 3331 + set_page_private(page, 0); 3377 3332 } 3378 3333 } else { 3379 3334 page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, ··· 4141 4100 int page_nid = NUMA_NO_NODE; 4142 4101 int last_cpupid; 4143 4102 int target_nid; 4144 - bool migrated = false; 4145 4103 pte_t pte, old_pte; 4146 4104 bool was_writable = pte_savedwrite(vmf->orig_pte); 4147 4105 int flags = 0; ··· 4157 4117 goto out; 4158 4118 } 4159 4119 4160 - /* 4161 - * Make it present again, Depending on how arch implementes non 4162 - * accessible ptes, some can allow access by kernel mode. 4163 - */ 4164 - old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte); 4120 + /* Get the normal PTE */ 4121 + old_pte = ptep_get(vmf->pte); 4165 4122 pte = pte_modify(old_pte, vma->vm_page_prot); 4166 - pte = pte_mkyoung(pte); 4167 - if (was_writable) 4168 - pte = pte_mkwrite(pte); 4169 - ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); 4170 - update_mmu_cache(vma, vmf->address, vmf->pte); 4171 4123 4172 4124 page = vm_normal_page(vma, vmf->address, pte); 4173 - if (!page) { 4174 - pte_unmap_unlock(vmf->pte, vmf->ptl); 4175 - return 0; 4176 - } 4125 + if (!page) 4126 + goto out_map; 4177 4127 4178 4128 /* TODO: handle PTE-mapped THP */ 4179 - if (PageCompound(page)) { 4180 - pte_unmap_unlock(vmf->pte, vmf->ptl); 4181 - return 0; 4182 - } 4129 + if (PageCompound(page)) 4130 + goto out_map; 4183 4131 4184 4132 /* 4185 4133 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as ··· 4177 4149 * pte_dirty has unpredictable behaviour between PTE scan updates, 4178 4150 * background writeback, dirty balancing and application behaviour. 4179 4151 */ 4180 - if (!pte_write(pte)) 4152 + if (!was_writable) 4181 4153 flags |= TNF_NO_GROUP; 4182 4154 4183 4155 /* ··· 4191 4163 page_nid = page_to_nid(page); 4192 4164 target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid, 4193 4165 &flags); 4194 - pte_unmap_unlock(vmf->pte, vmf->ptl); 4195 4166 if (target_nid == NUMA_NO_NODE) { 4196 4167 put_page(page); 4197 - goto out; 4168 + goto out_map; 4198 4169 } 4170 + pte_unmap_unlock(vmf->pte, vmf->ptl); 4199 4171 4200 4172 /* Migrate to the requested node */ 4201 - migrated = migrate_misplaced_page(page, vma, target_nid); 4202 - if (migrated) { 4173 + if (migrate_misplaced_page(page, vma, target_nid)) { 4203 4174 page_nid = target_nid; 4204 4175 flags |= TNF_MIGRATED; 4205 - } else 4176 + } else { 4206 4177 flags |= TNF_MIGRATE_FAIL; 4178 + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); 4179 + spin_lock(vmf->ptl); 4180 + if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) { 4181 + pte_unmap_unlock(vmf->pte, vmf->ptl); 4182 + goto out; 4183 + } 4184 + goto out_map; 4185 + } 4207 4186 4208 4187 out: 4209 4188 if (page_nid != NUMA_NO_NODE) 4210 4189 task_numa_fault(last_cpupid, page_nid, 1, flags); 4211 4190 return 0; 4191 + out_map: 4192 + /* 4193 + * Make it present again, depending on how arch implements 4194 + * non-accessible ptes, some can allow access by kernel mode. 4195 + */ 4196 + old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte); 4197 + pte = pte_modify(old_pte, vma->vm_page_prot); 4198 + pte = pte_mkyoung(pte); 4199 + if (was_writable) 4200 + pte = pte_mkwrite(pte); 4201 + ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); 4202 + update_mmu_cache(vma, vmf->address, vmf->pte); 4203 + pte_unmap_unlock(vmf->pte, vmf->ptl); 4204 + goto out; 4212 4205 } 4213 4206 4214 4207 static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)

+32 -44

mm/mempolicy.c

··· 2140 2140 { 2141 2141 struct page *page; 2142 2142 2143 - page = __alloc_pages(gfp, order, nid); 2143 + page = __alloc_pages(gfp, order, nid, NULL); 2144 2144 /* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */ 2145 2145 if (!static_branch_likely(&vm_numa_stat_key)) 2146 2146 return page; ··· 2153 2153 } 2154 2154 2155 2155 /** 2156 - * alloc_pages_vma - Allocate a page for a VMA. 2156 + * alloc_pages_vma - Allocate a page for a VMA. 2157 + * @gfp: GFP flags. 2158 + * @order: Order of the GFP allocation. 2159 + * @vma: Pointer to VMA or NULL if not available. 2160 + * @addr: Virtual address of the allocation. Must be inside @vma. 2161 + * @node: Which node to prefer for allocation (modulo policy). 2162 + * @hugepage: For hugepages try only the preferred node if possible. 2157 2163 * 2158 - * @gfp: 2159 - * %GFP_USER user allocation. 2160 - * %GFP_KERNEL kernel allocations, 2161 - * %GFP_HIGHMEM highmem/user allocations, 2162 - * %GFP_FS allocation should not call back into a file system. 2163 - * %GFP_ATOMIC don't sleep. 2164 + * Allocate a page for a specific address in @vma, using the appropriate 2165 + * NUMA policy. When @vma is not NULL the caller must hold the mmap_lock 2166 + * of the mm_struct of the VMA to prevent it from going away. Should be 2167 + * used for all allocations for pages that will be mapped into user space. 2164 2168 * 2165 - * @order:Order of the GFP allocation. 2166 - * @vma: Pointer to VMA or NULL if not available. 2167 - * @addr: Virtual Address of the allocation. Must be inside the VMA. 2168 - * @node: Which node to prefer for allocation (modulo policy). 2169 - * @hugepage: for hugepages try only the preferred node if possible 2170 - * 2171 - * This function allocates a page from the kernel page pool and applies 2172 - * a NUMA policy associated with the VMA or the current process. 2173 - * When VMA is not NULL caller must read-lock the mmap_lock of the 2174 - * mm_struct of the VMA to prevent it from going away. Should be used for 2175 - * all allocations for pages that will be mapped into user space. Returns 2176 - * NULL when no page can be allocated. 2169 + * Return: The page on success or NULL if allocation fails. 2177 2170 */ 2178 - struct page * 2179 - alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, 2171 + struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, 2180 2172 unsigned long addr, int node, bool hugepage) 2181 2173 { 2182 2174 struct mempolicy *pol; ··· 2229 2237 2230 2238 nmask = policy_nodemask(gfp, pol); 2231 2239 preferred_nid = policy_node(gfp, pol, node); 2232 - page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask); 2240 + page = __alloc_pages(gfp, order, preferred_nid, nmask); 2233 2241 mpol_cond_put(pol); 2234 2242 out: 2235 2243 return page; ··· 2237 2245 EXPORT_SYMBOL(alloc_pages_vma); 2238 2246 2239 2247 /** 2240 - * alloc_pages_current - Allocate pages. 2248 + * alloc_pages - Allocate pages. 2249 + * @gfp: GFP flags. 2250 + * @order: Power of two of number of pages to allocate. 2241 2251 * 2242 - * @gfp: 2243 - * %GFP_USER user allocation, 2244 - * %GFP_KERNEL kernel allocation, 2245 - * %GFP_HIGHMEM highmem allocation, 2246 - * %GFP_FS don't call back into a file system. 2247 - * %GFP_ATOMIC don't sleep. 2248 - * @order: Power of two of allocation size in pages. 0 is a single page. 2252 + * Allocate 1 << @order contiguous pages. The physical address of the 2253 + * first page is naturally aligned (eg an order-3 allocation will be aligned 2254 + * to a multiple of 8 * PAGE_SIZE bytes). The NUMA policy of the current 2255 + * process is honoured when in process context. 2249 2256 * 2250 - * Allocate a page from the kernel page pool. When not in 2251 - * interrupt context and apply the current process NUMA policy. 2252 - * Returns NULL when no page can be allocated. 2257 + * Context: Can be called from any context, providing the appropriate GFP 2258 + * flags are used. 2259 + * Return: The page on success or NULL if allocation fails. 2253 2260 */ 2254 - struct page *alloc_pages_current(gfp_t gfp, unsigned order) 2261 + struct page *alloc_pages(gfp_t gfp, unsigned order) 2255 2262 { 2256 2263 struct mempolicy *pol = &default_policy; 2257 2264 struct page *page; ··· 2265 2274 if (pol->mode == MPOL_INTERLEAVE) 2266 2275 page = alloc_page_interleave(gfp, order, interleave_nodes(pol)); 2267 2276 else 2268 - page = __alloc_pages_nodemask(gfp, order, 2277 + page = __alloc_pages(gfp, order, 2269 2278 policy_node(gfp, pol, numa_node_id()), 2270 2279 policy_nodemask(gfp, pol)); 2271 2280 2272 2281 return page; 2273 2282 } 2274 - EXPORT_SYMBOL(alloc_pages_current); 2283 + EXPORT_SYMBOL(alloc_pages); 2275 2284 2276 2285 int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst) 2277 2286 { ··· 2448 2457 * @addr: virtual address where page mapped 2449 2458 * 2450 2459 * Lookup current policy node id for vma,addr and "compare to" page's 2451 - * node id. 2452 - * 2453 - * Returns: 2454 - * -1 - not misplaced, page is in the right node 2455 - * node - node id where the page should be 2456 - * 2457 - * Policy determination "mimics" alloc_page_vma(). 2460 + * node id. Policy determination "mimics" alloc_page_vma(). 2458 2461 * Called from fault path where we know the vma and faulting address. 2462 + * 2463 + * Return: -1 if the page is in a node that is valid for this policy, or a 2464 + * suitable node ID to allocate a replacement page from. 2459 2465 */ 2460 2466 int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr) 2461 2467 {

+2 -2

mm/mempool.c

··· 106 106 if (pool->alloc == mempool_alloc_slab || pool->alloc == mempool_kmalloc) 107 107 kasan_slab_free_mempool(element); 108 108 else if (pool->alloc == mempool_alloc_pages) 109 - kasan_free_pages(element, (unsigned long)pool->pool_data); 109 + kasan_free_pages(element, (unsigned long)pool->pool_data, false); 110 110 } 111 111 112 112 static void kasan_unpoison_element(mempool_t *pool, void *element) ··· 114 114 if (pool->alloc == mempool_alloc_slab || pool->alloc == mempool_kmalloc) 115 115 kasan_unpoison_range(element, __ksize(element)); 116 116 else if (pool->alloc == mempool_alloc_pages) 117 - kasan_alloc_pages(element, (unsigned long)pool->pool_data); 117 + kasan_alloc_pages(element, (unsigned long)pool->pool_data, false); 118 118 } 119 119 120 120 static __always_inline void add_element(mempool_t *pool, void *element)

+1 -1

mm/memremap.c

··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 1 + // SPDX-License-Identifier: GPL-2.0 2 2 /* Copyright(c) 2015 Intel Corporation. All rights reserved. */ 3 3 #include <linux/device.h> 4 4 #include <linux/io.h>

+1 -1

mm/migrate.c

··· 1617 1617 if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE) 1618 1618 gfp_mask |= __GFP_HIGHMEM; 1619 1619 1620 - new_page = __alloc_pages_nodemask(gfp_mask, order, nid, mtc->nmask); 1620 + new_page = __alloc_pages(gfp_mask, order, nid, mtc->nmask); 1621 1621 1622 1622 if (new_page && PageTransHuge(new_page)) 1623 1623 prep_transhuge_page(new_page);

-4

mm/mm_init.c

··· 19 19 #ifdef CONFIG_DEBUG_MEMORY_INIT 20 20 int __meminitdata mminit_loglevel; 21 21 22 - #ifndef SECTIONS_SHIFT 23 - #define SECTIONS_SHIFT 0 24 - #endif 25 - 26 22 /* The zonelists are simply reported, validation is manual. */ 27 23 void __init mminit_verify_zonelist(void) 28 24 {

+1 -5

mm/mmap.c

··· 3409 3409 return ((struct vm_special_mapping *)vma->vm_private_data)->name; 3410 3410 } 3411 3411 3412 - static int special_mapping_mremap(struct vm_area_struct *new_vma, 3413 - unsigned long flags) 3412 + static int special_mapping_mremap(struct vm_area_struct *new_vma) 3414 3413 { 3415 3414 struct vm_special_mapping *sm = new_vma->vm_private_data; 3416 - 3417 - if (flags & MREMAP_DONTUNMAP) 3418 - return -EINVAL; 3419 3415 3420 3416 if (WARN_ON_ONCE(current->mm != new_vma->vm_mm)) 3421 3417 return -EFAULT;

+3 -3

mm/mremap.c

··· 545 545 if (moved_len < old_len) { 546 546 err = -ENOMEM; 547 547 } else if (vma->vm_ops && vma->vm_ops->mremap) { 548 - err = vma->vm_ops->mremap(new_vma, flags); 548 + err = vma->vm_ops->mremap(new_vma); 549 549 } 550 550 551 551 if (unlikely(err)) { ··· 653 653 return ERR_PTR(-EINVAL); 654 654 } 655 655 656 - if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) || 657 - vma->vm_flags & VM_SHARED)) 656 + if ((flags & MREMAP_DONTUNMAP) && 657 + (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))) 658 658 return ERR_PTR(-EINVAL); 659 659 660 660 if (is_vm_hugetlb_page(vma))

+5 -1

mm/msync.c

··· 55 55 goto out; 56 56 /* 57 57 * If the interval [start,end) covers some unmapped address ranges, 58 - * just ignore them, but return -ENOMEM at the end. 58 + * just ignore them, but return -ENOMEM at the end. Besides, if the 59 + * flag is MS_ASYNC (w/o MS_INVALIDATE) the result would be -ENOMEM 60 + * anyway and there is nothing left to do, so return immediately. 59 61 */ 60 62 mmap_read_lock(mm); 61 63 vma = find_vma(mm, start); ··· 71 69 goto out_unlock; 72 70 /* Here start < vma->vm_end. */ 73 71 if (start < vma->vm_start) { 72 + if (flags == MS_ASYNC) 73 + goto out_unlock; 74 74 start = vma->vm_start; 75 75 if (start >= end) 76 76 goto out_unlock;

+3 -6

mm/page-writeback.c

··· 2722 2722 int test_clear_page_writeback(struct page *page) 2723 2723 { 2724 2724 struct address_space *mapping = page_mapping(page); 2725 - struct mem_cgroup *memcg; 2726 - struct lruvec *lruvec; 2727 2725 int ret; 2728 2726 2729 - memcg = lock_page_memcg(page); 2730 - lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page)); 2727 + lock_page_memcg(page); 2731 2728 if (mapping && mapping_use_writeback_tags(mapping)) { 2732 2729 struct inode *inode = mapping->host; 2733 2730 struct backing_dev_info *bdi = inode_to_bdi(inode); ··· 2752 2755 ret = TestClearPageWriteback(page); 2753 2756 } 2754 2757 if (ret) { 2755 - dec_lruvec_state(lruvec, NR_WRITEBACK); 2758 + dec_lruvec_page_state(page, NR_WRITEBACK); 2756 2759 dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); 2757 2760 inc_node_page_state(page, NR_WRITTEN); 2758 2761 } 2759 - __unlock_page_memcg(memcg); 2762 + unlock_page_memcg(page); 2760 2763 return ret; 2761 2764 } 2762 2765

+294 -82

mm/page_alloc.c

··· 72 72 #include <linux/padata.h> 73 73 #include <linux/khugepaged.h> 74 74 #include <linux/buffer_head.h> 75 - 76 75 #include <asm/sections.h> 77 76 #include <asm/tlbflush.h> 78 77 #include <asm/div64.h> ··· 106 107 * reporting). 107 108 */ 108 109 #define FPI_TO_TAIL ((__force fpi_t)BIT(1)) 110 + 111 + /* 112 + * Don't poison memory with KASAN (only for the tag-based modes). 113 + * During boot, all non-reserved memblock memory is exposed to page_alloc. 114 + * Poisoning all that memory lengthens boot time, especially on systems with 115 + * large amount of RAM. This flag is used to skip that poisoning. 116 + * This is only done for the tag-based KASAN modes, as those are able to 117 + * detect memory corruptions with the memory tags assigned by default. 118 + * All memory allocated normally after boot gets poisoned as usual. 119 + */ 120 + #define FPI_SKIP_KASAN_POISON ((__force fpi_t)BIT(2)) 109 121 110 122 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ 111 123 static DEFINE_MUTEX(pcp_batch_high_lock); ··· 394 384 * on-demand allocation and then freed again before the deferred pages 395 385 * initialization is done, but this is not likely to happen. 396 386 */ 397 - static inline void kasan_free_nondeferred_pages(struct page *page, int order) 387 + static inline void kasan_free_nondeferred_pages(struct page *page, int order, 388 + bool init, fpi_t fpi_flags) 398 389 { 399 - if (!static_branch_unlikely(&deferred_pages)) 400 - kasan_free_pages(page, order); 390 + if (static_branch_unlikely(&deferred_pages)) 391 + return; 392 + if (!IS_ENABLED(CONFIG_KASAN_GENERIC) && 393 + (fpi_flags & FPI_SKIP_KASAN_POISON)) 394 + return; 395 + kasan_free_pages(page, order, init); 401 396 } 402 397 403 398 /* Returns true if the struct page for the pfn is uninitialised */ ··· 453 438 return false; 454 439 } 455 440 #else 456 - #define kasan_free_nondeferred_pages(p, o) kasan_free_pages(p, o) 441 + static inline void kasan_free_nondeferred_pages(struct page *page, int order, 442 + bool init, fpi_t fpi_flags) 443 + { 444 + if (!IS_ENABLED(CONFIG_KASAN_GENERIC) && 445 + (fpi_flags & FPI_SKIP_KASAN_POISON)) 446 + return; 447 + kasan_free_pages(page, order, init); 448 + } 457 449 458 450 static inline bool early_page_uninitialised(unsigned long pfn) 459 451 { ··· 786 764 */ 787 765 void init_mem_debugging_and_hardening(void) 788 766 { 789 - if (_init_on_alloc_enabled_early) { 790 - if (page_poisoning_enabled()) 791 - pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, " 792 - "will take precedence over init_on_alloc\n"); 793 - else 794 - static_branch_enable(&init_on_alloc); 795 - } 796 - if (_init_on_free_enabled_early) { 797 - if (page_poisoning_enabled()) 798 - pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, " 799 - "will take precedence over init_on_free\n"); 800 - else 801 - static_branch_enable(&init_on_free); 802 - } 767 + bool page_poisoning_requested = false; 803 768 804 769 #ifdef CONFIG_PAGE_POISONING 805 770 /* ··· 795 786 */ 796 787 if (page_poisoning_enabled() || 797 788 (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) && 798 - debug_pagealloc_enabled())) 789 + debug_pagealloc_enabled())) { 799 790 static_branch_enable(&_page_poisoning_enabled); 791 + page_poisoning_requested = true; 792 + } 800 793 #endif 794 + 795 + if (_init_on_alloc_enabled_early) { 796 + if (page_poisoning_requested) 797 + pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, " 798 + "will take precedence over init_on_alloc\n"); 799 + else 800 + static_branch_enable(&init_on_alloc); 801 + } 802 + if (_init_on_free_enabled_early) { 803 + if (page_poisoning_requested) 804 + pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, " 805 + "will take precedence over init_on_free\n"); 806 + else 807 + static_branch_enable(&init_on_free); 808 + } 801 809 802 810 #ifdef CONFIG_DEBUG_PAGEALLOC 803 811 if (!debug_pagealloc_enabled()) ··· 1129 1103 if (unlikely((unsigned long)page->mapping | 1130 1104 page_ref_count(page) | 1131 1105 #ifdef CONFIG_MEMCG 1132 - (unsigned long)page_memcg(page) | 1106 + page->memcg_data | 1133 1107 #endif 1134 1108 (page->flags & check_flags))) 1135 1109 return false; ··· 1154 1128 bad_reason = "PAGE_FLAGS_CHECK_AT_FREE flag(s) set"; 1155 1129 } 1156 1130 #ifdef CONFIG_MEMCG 1157 - if (unlikely(page_memcg(page))) 1131 + if (unlikely(page->memcg_data)) 1158 1132 bad_reason = "page still charged to cgroup"; 1159 1133 #endif 1160 1134 return bad_reason; ··· 1242 1216 } 1243 1217 1244 1218 static __always_inline bool free_pages_prepare(struct page *page, 1245 - unsigned int order, bool check_free) 1219 + unsigned int order, bool check_free, fpi_t fpi_flags) 1246 1220 { 1247 1221 int bad = 0; 1222 + bool init; 1248 1223 1249 1224 VM_BUG_ON_PAGE(PageTail(page), page); 1250 1225 ··· 1303 1276 debug_check_no_obj_freed(page_address(page), 1304 1277 PAGE_SIZE << order); 1305 1278 } 1306 - if (want_init_on_free()) 1307 - kernel_init_free_pages(page, 1 << order); 1308 1279 1309 1280 kernel_poison_pages(page, 1 << order); 1310 1281 1311 1282 /* 1283 + * As memory initialization might be integrated into KASAN, 1284 + * kasan_free_pages and kernel_init_free_pages must be 1285 + * kept together to avoid discrepancies in behavior. 1286 + * 1312 1287 * With hardware tag-based KASAN, memory tags must be set before the 1313 1288 * page becomes unavailable via debug_pagealloc or arch_free_page. 1314 1289 */ 1315 - kasan_free_nondeferred_pages(page, order); 1290 + init = want_init_on_free(); 1291 + if (init && !kasan_has_integrated_init()) 1292 + kernel_init_free_pages(page, 1 << order); 1293 + kasan_free_nondeferred_pages(page, order, init, fpi_flags); 1316 1294 1317 1295 /* 1318 1296 * arch_free_page() can make the page's contents inaccessible. s390 ··· 1339 1307 */ 1340 1308 static bool free_pcp_prepare(struct page *page) 1341 1309 { 1342 - return free_pages_prepare(page, 0, true); 1310 + return free_pages_prepare(page, 0, true, FPI_NONE); 1343 1311 } 1344 1312 1345 1313 static bool bulkfree_pcp_prepare(struct page *page) ··· 1359 1327 static bool free_pcp_prepare(struct page *page) 1360 1328 { 1361 1329 if (debug_pagealloc_enabled_static()) 1362 - return free_pages_prepare(page, 0, true); 1330 + return free_pages_prepare(page, 0, true, FPI_NONE); 1363 1331 else 1364 - return free_pages_prepare(page, 0, false); 1332 + return free_pages_prepare(page, 0, false, FPI_NONE); 1365 1333 } 1366 1334 1367 1335 static bool bulkfree_pcp_prepare(struct page *page) ··· 1569 1537 int migratetype; 1570 1538 unsigned long pfn = page_to_pfn(page); 1571 1539 1572 - if (!free_pages_prepare(page, order, true)) 1540 + if (!free_pages_prepare(page, order, true, fpi_flags)) 1573 1541 return; 1574 1542 1575 1543 migratetype = get_pfnblock_migratetype(page, pfn); ··· 1606 1574 * Bypass PCP and place fresh pages right to the tail, primarily 1607 1575 * relevant for memory onlining. 1608 1576 */ 1609 - __free_pages_ok(page, order, FPI_TO_TAIL); 1577 + __free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON); 1610 1578 } 1611 1579 1612 1580 #ifdef CONFIG_NEED_MULTIPLE_NODES ··· 2324 2292 inline void post_alloc_hook(struct page *page, unsigned int order, 2325 2293 gfp_t gfp_flags) 2326 2294 { 2295 + bool init; 2296 + 2327 2297 set_page_private(page, 0); 2328 2298 set_page_refcounted(page); 2329 2299 2330 2300 arch_alloc_page(page, order); 2331 2301 debug_pagealloc_map_pages(page, 1 << order); 2332 - kasan_alloc_pages(page, order); 2333 - kernel_unpoison_pages(page, 1 << order); 2334 - set_page_owner(page, order, gfp_flags); 2335 2302 2336 - if (!want_init_on_free() && want_init_on_alloc(gfp_flags)) 2303 + /* 2304 + * Page unpoisoning must happen before memory initialization. 2305 + * Otherwise, the poison pattern will be overwritten for __GFP_ZERO 2306 + * allocations and the page unpoisoning code will complain. 2307 + */ 2308 + kernel_unpoison_pages(page, 1 << order); 2309 + 2310 + /* 2311 + * As memory initialization might be integrated into KASAN, 2312 + * kasan_alloc_pages and kernel_init_free_pages must be 2313 + * kept together to avoid discrepancies in behavior. 2314 + */ 2315 + init = !want_init_on_free() && want_init_on_alloc(gfp_flags); 2316 + kasan_alloc_pages(page, order, init); 2317 + if (init && !kasan_has_integrated_init()) 2337 2318 kernel_init_free_pages(page, 1 << order); 2319 + 2320 + set_page_owner(page, order, gfp_flags); 2338 2321 } 2339 2322 2340 2323 static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags, ··· 2433 2386 * boundary. If alignment is required, use move_freepages_block() 2434 2387 */ 2435 2388 static int move_freepages(struct zone *zone, 2436 - struct page *start_page, struct page *end_page, 2389 + unsigned long start_pfn, unsigned long end_pfn, 2437 2390 int migratetype, int *num_movable) 2438 2391 { 2439 2392 struct page *page; 2393 + unsigned long pfn; 2440 2394 unsigned int order; 2441 2395 int pages_moved = 0; 2442 2396 2443 - for (page = start_page; page <= end_page;) { 2444 - if (!pfn_valid_within(page_to_pfn(page))) { 2445 - page++; 2397 + for (pfn = start_pfn; pfn <= end_pfn;) { 2398 + if (!pfn_valid_within(pfn)) { 2399 + pfn++; 2446 2400 continue; 2447 2401 } 2448 2402 2403 + page = pfn_to_page(pfn); 2449 2404 if (!PageBuddy(page)) { 2450 2405 /* 2451 2406 * We assume that pages that could be isolated for ··· 2457 2408 if (num_movable && 2458 2409 (PageLRU(page) || __PageMovable(page))) 2459 2410 (*num_movable)++; 2460 - 2461 - page++; 2411 + pfn++; 2462 2412 continue; 2463 2413 } 2464 2414 ··· 2467 2419 2468 2420 order = buddy_order(page); 2469 2421 move_to_free_list(page, zone, order, migratetype); 2470 - page += 1 << order; 2422 + pfn += 1 << order; 2471 2423 pages_moved += 1 << order; 2472 2424 } 2473 2425 ··· 2477 2429 int move_freepages_block(struct zone *zone, struct page *page, 2478 2430 int migratetype, int *num_movable) 2479 2431 { 2480 - unsigned long start_pfn, end_pfn; 2481 - struct page *start_page, *end_page; 2432 + unsigned long start_pfn, end_pfn, pfn; 2482 2433 2483 2434 if (num_movable) 2484 2435 *num_movable = 0; 2485 2436 2486 - start_pfn = page_to_pfn(page); 2487 - start_pfn = start_pfn & ~(pageblock_nr_pages-1); 2488 - start_page = pfn_to_page(start_pfn); 2489 - end_page = start_page + pageblock_nr_pages - 1; 2437 + pfn = page_to_pfn(page); 2438 + start_pfn = pfn & ~(pageblock_nr_pages - 1); 2490 2439 end_pfn = start_pfn + pageblock_nr_pages - 1; 2491 2440 2492 2441 /* Do not cross zone boundaries */ 2493 2442 if (!zone_spans_pfn(zone, start_pfn)) 2494 - start_page = page; 2443 + start_pfn = pfn; 2495 2444 if (!zone_spans_pfn(zone, end_pfn)) 2496 2445 return 0; 2497 2446 2498 - return move_freepages(zone, start_page, end_page, migratetype, 2447 + return move_freepages(zone, start_pfn, end_pfn, migratetype, 2499 2448 num_movable); 2500 2449 } 2501 2450 ··· 2953 2908 unsigned long count, struct list_head *list, 2954 2909 int migratetype, unsigned int alloc_flags) 2955 2910 { 2956 - int i, alloced = 0; 2911 + int i, allocated = 0; 2957 2912 2958 2913 spin_lock(&zone->lock); 2959 2914 for (i = 0; i < count; ++i) { ··· 2976 2931 * pages are ordered properly. 2977 2932 */ 2978 2933 list_add_tail(&page->lru, list); 2979 - alloced++; 2934 + allocated++; 2980 2935 if (is_migrate_cma(get_pcppage_migratetype(page))) 2981 2936 __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 2982 2937 -(1 << order)); ··· 2985 2940 /* 2986 2941 * i pages were removed from the buddy list even if some leak due 2987 2942 * to check_pcp_refill failing so adjust NR_FREE_PAGES based 2988 - * on i. Do not confuse with 'alloced' which is the number of 2943 + * on i. Do not confuse with 'allocated' which is the number of 2989 2944 * pages added to the pcp list. 2990 2945 */ 2991 2946 __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order)); 2992 2947 spin_unlock(&zone->lock); 2993 - return alloced; 2948 + return allocated; 2994 2949 } 2995 2950 2996 2951 #ifdef CONFIG_NUMA ··· 3460 3415 } 3461 3416 3462 3417 /* Remove page from the per-cpu list, caller must protect the list */ 3463 - static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype, 3418 + static inline 3419 + struct page *__rmqueue_pcplist(struct zone *zone, int migratetype, 3464 3420 unsigned int alloc_flags, 3465 3421 struct per_cpu_pages *pcp, 3466 3422 struct list_head *list) ··· 4967 4921 4968 4922 static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, 4969 4923 int preferred_nid, nodemask_t *nodemask, 4970 - struct alloc_context *ac, gfp_t *alloc_mask, 4924 + struct alloc_context *ac, gfp_t *alloc_gfp, 4971 4925 unsigned int *alloc_flags) 4972 4926 { 4973 4927 ac->highest_zoneidx = gfp_zone(gfp_mask); ··· 4976 4930 ac->migratetype = gfp_migratetype(gfp_mask); 4977 4931 4978 4932 if (cpusets_enabled()) { 4979 - *alloc_mask |= __GFP_HARDWALL; 4933 + *alloc_gfp |= __GFP_HARDWALL; 4980 4934 /* 4981 4935 * When we are in the interrupt context, it is irrelevant 4982 4936 * to the current task context. It means that any node ok. ··· 5012 4966 } 5013 4967 5014 4968 /* 4969 + * __alloc_pages_bulk - Allocate a number of order-0 pages to a list or array 4970 + * @gfp: GFP flags for the allocation 4971 + * @preferred_nid: The preferred NUMA node ID to allocate from 4972 + * @nodemask: Set of nodes to allocate from, may be NULL 4973 + * @nr_pages: The number of pages desired on the list or array 4974 + * @page_list: Optional list to store the allocated pages 4975 + * @page_array: Optional array to store the pages 4976 + * 4977 + * This is a batched version of the page allocator that attempts to 4978 + * allocate nr_pages quickly. Pages are added to page_list if page_list 4979 + * is not NULL, otherwise it is assumed that the page_array is valid. 4980 + * 4981 + * For lists, nr_pages is the number of pages that should be allocated. 4982 + * 4983 + * For arrays, only NULL elements are populated with pages and nr_pages 4984 + * is the maximum number of pages that will be stored in the array. 4985 + * 4986 + * Returns the number of pages on the list or array. 4987 + */ 4988 + unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, 4989 + nodemask_t *nodemask, int nr_pages, 4990 + struct list_head *page_list, 4991 + struct page **page_array) 4992 + { 4993 + struct page *page; 4994 + unsigned long flags; 4995 + struct zone *zone; 4996 + struct zoneref *z; 4997 + struct per_cpu_pages *pcp; 4998 + struct list_head *pcp_list; 4999 + struct alloc_context ac; 5000 + gfp_t alloc_gfp; 5001 + unsigned int alloc_flags = ALLOC_WMARK_LOW; 5002 + int nr_populated = 0; 5003 + 5004 + if (unlikely(nr_pages <= 0)) 5005 + return 0; 5006 + 5007 + /* 5008 + * Skip populated array elements to determine if any pages need 5009 + * to be allocated before disabling IRQs. 5010 + */ 5011 + while (page_array && page_array[nr_populated] && nr_populated < nr_pages) 5012 + nr_populated++; 5013 + 5014 + /* Use the single page allocator for one page. */ 5015 + if (nr_pages - nr_populated == 1) 5016 + goto failed; 5017 + 5018 + /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */ 5019 + gfp &= gfp_allowed_mask; 5020 + alloc_gfp = gfp; 5021 + if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask, &ac, &alloc_gfp, &alloc_flags)) 5022 + return 0; 5023 + gfp = alloc_gfp; 5024 + 5025 + /* Find an allowed local zone that meets the low watermark. */ 5026 + for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) { 5027 + unsigned long mark; 5028 + 5029 + if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) && 5030 + !__cpuset_zone_allowed(zone, gfp)) { 5031 + continue; 5032 + } 5033 + 5034 + if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone && 5035 + zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) { 5036 + goto failed; 5037 + } 5038 + 5039 + mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages; 5040 + if (zone_watermark_fast(zone, 0, mark, 5041 + zonelist_zone_idx(ac.preferred_zoneref), 5042 + alloc_flags, gfp)) { 5043 + break; 5044 + } 5045 + } 5046 + 5047 + /* 5048 + * If there are no allowed local zones that meets the watermarks then 5049 + * try to allocate a single page and reclaim if necessary. 5050 + */ 5051 + if (unlikely(!zone)) 5052 + goto failed; 5053 + 5054 + /* Attempt the batch allocation */ 5055 + local_irq_save(flags); 5056 + pcp = &this_cpu_ptr(zone->pageset)->pcp; 5057 + pcp_list = &pcp->lists[ac.migratetype]; 5058 + 5059 + while (nr_populated < nr_pages) { 5060 + 5061 + /* Skip existing pages */ 5062 + if (page_array && page_array[nr_populated]) { 5063 + nr_populated++; 5064 + continue; 5065 + } 5066 + 5067 + page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags, 5068 + pcp, pcp_list); 5069 + if (unlikely(!page)) { 5070 + /* Try and get at least one page */ 5071 + if (!nr_populated) 5072 + goto failed_irq; 5073 + break; 5074 + } 5075 + 5076 + /* 5077 + * Ideally this would be batched but the best way to do 5078 + * that cheaply is to first convert zone_statistics to 5079 + * be inaccurate per-cpu counter like vm_events to avoid 5080 + * a RMW cycle then do the accounting with IRQs enabled. 5081 + */ 5082 + __count_zid_vm_events(PGALLOC, zone_idx(zone), 1); 5083 + zone_statistics(ac.preferred_zoneref->zone, zone); 5084 + 5085 + prep_new_page(page, 0, gfp, 0); 5086 + if (page_list) 5087 + list_add(&page->lru, page_list); 5088 + else 5089 + page_array[nr_populated] = page; 5090 + nr_populated++; 5091 + } 5092 + 5093 + local_irq_restore(flags); 5094 + 5095 + return nr_populated; 5096 + 5097 + failed_irq: 5098 + local_irq_restore(flags); 5099 + 5100 + failed: 5101 + page = __alloc_pages(gfp, 0, preferred_nid, nodemask); 5102 + if (page) { 5103 + if (page_list) 5104 + list_add(&page->lru, page_list); 5105 + else 5106 + page_array[nr_populated] = page; 5107 + nr_populated++; 5108 + } 5109 + 5110 + return nr_populated; 5111 + } 5112 + EXPORT_SYMBOL_GPL(__alloc_pages_bulk); 5113 + 5114 + /* 5015 5115 * This is the 'heart' of the zoned buddy allocator. 5016 5116 */ 5017 - struct page * 5018 - __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, 5117 + struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, 5019 5118 nodemask_t *nodemask) 5020 5119 { 5021 5120 struct page *page; 5022 5121 unsigned int alloc_flags = ALLOC_WMARK_LOW; 5023 - gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */ 5122 + gfp_t alloc_gfp; /* The gfp_t that was actually used for allocation */ 5024 5123 struct alloc_context ac = { }; 5025 5124 5026 5125 /* ··· 5173 4982 * so bail out early if the request is out of bound. 5174 4983 */ 5175 4984 if (unlikely(order >= MAX_ORDER)) { 5176 - WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN)); 4985 + WARN_ON_ONCE(!(gfp & __GFP_NOWARN)); 5177 4986 return NULL; 5178 4987 } 5179 4988 5180 - gfp_mask &= gfp_allowed_mask; 5181 - alloc_mask = gfp_mask; 5182 - if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags)) 4989 + gfp &= gfp_allowed_mask; 4990 + alloc_gfp = gfp; 4991 + if (!prepare_alloc_pages(gfp, order, preferred_nid, nodemask, &ac, 4992 + &alloc_gfp, &alloc_flags)) 5183 4993 return NULL; 5184 4994 5185 4995 /* 5186 4996 * Forbid the first pass from falling back to types that fragment 5187 4997 * memory until all local zones are considered. 5188 4998 */ 5189 - alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp_mask); 4999 + alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp); 5190 5000 5191 5001 /* First allocation attempt */ 5192 - page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac); 5002 + page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac); 5193 5003 if (likely(page)) 5194 5004 goto out; 5195 5005 ··· 5200 5008 * from a particular context which has been marked by 5201 5009 * memalloc_no{fs,io}_{save,restore}. 5202 5010 */ 5203 - alloc_mask = current_gfp_context(gfp_mask); 5011 + alloc_gfp = current_gfp_context(gfp); 5204 5012 ac.spread_dirty_pages = false; 5205 5013 5206 5014 /* ··· 5209 5017 */ 5210 5018 ac.nodemask = nodemask; 5211 5019 5212 - page = __alloc_pages_slowpath(alloc_mask, order, &ac); 5020 + page = __alloc_pages_slowpath(alloc_gfp, order, &ac); 5213 5021 5214 5022 out: 5215 - if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page && 5216 - unlikely(__memcg_kmem_charge_page(page, gfp_mask, order) != 0)) { 5023 + if (memcg_kmem_enabled() && (gfp & __GFP_ACCOUNT) && page && 5024 + unlikely(__memcg_kmem_charge_page(page, gfp, order) != 0)) { 5217 5025 __free_pages(page, order); 5218 5026 page = NULL; 5219 5027 } 5220 5028 5221 - trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype); 5029 + trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype); 5222 5030 5223 5031 return page; 5224 5032 } 5225 - EXPORT_SYMBOL(__alloc_pages_nodemask); 5033 + EXPORT_SYMBOL(__alloc_pages); 5226 5034 5227 5035 /* 5228 5036 * Common helper functions. Never use with __GFP_HIGHMEM because the returned ··· 7881 7689 return pages; 7882 7690 } 7883 7691 7884 - void __init mem_init_print_info(const char *str) 7692 + void __init mem_init_print_info(void) 7885 7693 { 7886 7694 unsigned long physpages, codesize, datasize, rosize, bss_size; 7887 7695 unsigned long init_code_size, init_data_size; ··· 7920 7728 #ifdef CONFIG_HIGHMEM 7921 7729 ", %luK highmem" 7922 7730 #endif 7923 - "%s%s)\n", 7731 + ")\n", 7924 7732 nr_free_pages() << (PAGE_SHIFT - 10), 7925 7733 physpages << (PAGE_SHIFT - 10), 7926 7734 codesize >> 10, datasize >> 10, rosize >> 10, 7927 7735 (init_data_size + init_code_size) >> 10, bss_size >> 10, 7928 7736 (physpages - totalram_pages() - totalcma_pages) << (PAGE_SHIFT - 10), 7929 - totalcma_pages << (PAGE_SHIFT - 10), 7737 + totalcma_pages << (PAGE_SHIFT - 10) 7930 7738 #ifdef CONFIG_HIGHMEM 7931 - totalhigh_pages() << (PAGE_SHIFT - 10), 7739 + , totalhigh_pages() << (PAGE_SHIFT - 10) 7932 7740 #endif 7933 - str ? ", " : "", str ? str : ""); 7741 + ); 7934 7742 } 7935 7743 7936 7744 /** ··· 8414 8222 void *table = NULL; 8415 8223 gfp_t gfp_flags; 8416 8224 bool virt; 8225 + bool huge; 8417 8226 8418 8227 /* allow the kernel cmdline to have a say */ 8419 8228 if (!numentries) { ··· 8482 8289 } else if (get_order(size) >= MAX_ORDER || hashdist) { 8483 8290 table = __vmalloc(size, gfp_flags); 8484 8291 virt = true; 8292 + huge = is_vm_area_hugepages(table); 8485 8293 } else { 8486 8294 /* 8487 8295 * If bucketsize is not a power-of-two, we may free ··· 8499 8305 8500 8306 pr_info("%s hash table entries: %ld (order: %d, %lu bytes, %s)\n", 8501 8307 tablename, 1UL << log2qty, ilog2(size) - PAGE_SHIFT, size, 8502 - virt ? "vmalloc" : "linear"); 8308 + virt ? (huge ? "vmalloc hugepage" : "vmalloc") : "linear"); 8503 8309 8504 8310 if (_hash_shift) 8505 8311 *_hash_shift = log2qty; ··· 8644 8450 pageblock_nr_pages)); 8645 8451 } 8646 8452 8453 + #if defined(CONFIG_DYNAMIC_DEBUG) || \ 8454 + (defined(CONFIG_DYNAMIC_DEBUG_CORE) && defined(DYNAMIC_DEBUG_MODULE)) 8455 + /* Usage: See admin-guide/dynamic-debug-howto.rst */ 8456 + static void alloc_contig_dump_pages(struct list_head *page_list) 8457 + { 8458 + DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, "migrate failure"); 8459 + 8460 + if (DYNAMIC_DEBUG_BRANCH(descriptor)) { 8461 + struct page *page; 8462 + 8463 + dump_stack(); 8464 + list_for_each_entry(page, page_list, lru) 8465 + dump_page(page, "migration failure"); 8466 + } 8467 + } 8468 + #else 8469 + static inline void alloc_contig_dump_pages(struct list_head *page_list) 8470 + { 8471 + } 8472 + #endif 8473 + 8647 8474 /* [start, end) must belong to a single zone. */ 8648 8475 static int __alloc_contig_migrate_range(struct compact_control *cc, 8649 8476 unsigned long start, unsigned long end) ··· 8708 8493 NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE); 8709 8494 } 8710 8495 if (ret < 0) { 8496 + alloc_contig_dump_pages(&cc->migratepages); 8711 8497 putback_movable_pages(&cc->migratepages); 8712 8498 return ret; 8713 8499 } ··· 8818 8602 * isolated thus they won't get removed from buddy. 8819 8603 */ 8820 8604 8821 - lru_add_drain_all(); 8822 - 8823 8605 order = 0; 8824 8606 outer_start = start; 8825 8607 while (!PageBuddy(pfn_to_page(outer_start))) { ··· 8843 8629 8844 8630 /* Make sure the range is really isolated. */ 8845 8631 if (test_pages_isolated(outer_start, end, 0)) { 8846 - pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n", 8847 - __func__, outer_start, end); 8848 8632 ret = -EBUSY; 8849 8633 goto done; 8850 8634 }

+6 -2

mm/page_counter.c

··· 52 52 long new; 53 53 54 54 new = atomic_long_sub_return(nr_pages, &counter->usage); 55 - propagate_protected_usage(counter, new); 56 55 /* More uncharges than charges? */ 57 - WARN_ON_ONCE(new < 0); 56 + if (WARN_ONCE(new < 0, "page_counter underflow: %ld nr_pages=%lu\n", 57 + new, nr_pages)) { 58 + new = 0; 59 + atomic_long_set(&counter->usage, new); 60 + } 61 + propagate_protected_usage(counter, new); 58 62 } 59 63 60 64 /**

+27 -41

mm/page_owner.c

··· 27 27 depot_stack_handle_t handle; 28 28 depot_stack_handle_t free_handle; 29 29 u64 ts_nsec; 30 + u64 free_ts_nsec; 30 31 pid_t pid; 31 32 }; 32 33 ··· 42 41 43 42 static int __init early_page_owner_param(char *buf) 44 43 { 45 - if (!buf) 46 - return -EINVAL; 47 - 48 - if (strcmp(buf, "on") == 0) 49 - page_owner_enabled = true; 50 - 51 - return 0; 44 + return kstrtobool(buf, &page_owner_enabled); 52 45 } 53 46 early_param("page_owner", early_page_owner_param); 54 47 ··· 98 103 return (void *)page_ext + page_owner_ops.offset; 99 104 } 100 105 101 - static inline bool check_recursive_alloc(unsigned long *entries, 102 - unsigned int nr_entries, 103 - unsigned long ip) 104 - { 105 - unsigned int i; 106 - 107 - for (i = 0; i < nr_entries; i++) { 108 - if (entries[i] == ip) 109 - return true; 110 - } 111 - return false; 112 - } 113 - 114 106 static noinline depot_stack_handle_t save_stack(gfp_t flags) 115 107 { 116 108 unsigned long entries[PAGE_OWNER_STACK_DEPTH]; 117 109 depot_stack_handle_t handle; 118 110 unsigned int nr_entries; 119 111 120 - nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 2); 121 - 122 112 /* 123 - * We need to check recursion here because our request to 124 - * stackdepot could trigger memory allocation to save new 125 - * entry. New memory allocation would reach here and call 126 - * stack_depot_save_entries() again if we don't catch it. There is 127 - * still not enough memory in stackdepot so it would try to 128 - * allocate memory again and loop forever. 113 + * Avoid recursion. 114 + * 115 + * Sometimes page metadata allocation tracking requires more 116 + * memory to be allocated: 117 + * - when new stack trace is saved to stack depot 118 + * - when backtrace itself is calculated (ia64) 129 119 */ 130 - if (check_recursive_alloc(entries, nr_entries, _RET_IP_)) 120 + if (current->in_page_owner) 131 121 return dummy_handle; 122 + current->in_page_owner = 1; 132 123 124 + nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 2); 133 125 handle = stack_depot_save(entries, nr_entries, flags); 134 126 if (!handle) 135 127 handle = failure_handle; 136 128 129 + current->in_page_owner = 0; 137 130 return handle; 138 131 } 139 132 ··· 129 146 { 130 147 int i; 131 148 struct page_ext *page_ext; 132 - depot_stack_handle_t handle = 0; 149 + depot_stack_handle_t handle; 133 150 struct page_owner *page_owner; 134 - 135 - handle = save_stack(GFP_NOWAIT | __GFP_NOWARN); 151 + u64 free_ts_nsec = local_clock(); 136 152 137 153 page_ext = lookup_page_ext(page); 138 154 if (unlikely(!page_ext)) 139 155 return; 156 + 157 + handle = save_stack(GFP_NOWAIT | __GFP_NOWARN); 140 158 for (i = 0; i < (1 << order); i++) { 141 159 __clear_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags); 142 160 page_owner = get_page_owner(page_ext); 143 161 page_owner->free_handle = handle; 162 + page_owner->free_ts_nsec = free_ts_nsec; 144 163 page_ext = page_ext_next(page_ext); 145 164 } 146 165 } 147 166 148 - static inline void __set_page_owner_handle(struct page *page, 149 - struct page_ext *page_ext, depot_stack_handle_t handle, 150 - unsigned int order, gfp_t gfp_mask) 167 + static inline void __set_page_owner_handle(struct page_ext *page_ext, 168 + depot_stack_handle_t handle, 169 + unsigned int order, gfp_t gfp_mask) 151 170 { 152 171 struct page_owner *page_owner; 153 172 int i; ··· 179 194 return; 180 195 181 196 handle = save_stack(gfp_mask); 182 - __set_page_owner_handle(page, page_ext, handle, order, gfp_mask); 197 + __set_page_owner_handle(page_ext, handle, order, gfp_mask); 183 198 } 184 199 185 200 void __set_page_owner_migrate_reason(struct page *page, int reason) ··· 228 243 new_page_owner->handle = old_page_owner->handle; 229 244 new_page_owner->pid = old_page_owner->pid; 230 245 new_page_owner->ts_nsec = old_page_owner->ts_nsec; 246 + new_page_owner->free_ts_nsec = old_page_owner->ts_nsec; 231 247 232 248 /* 233 249 * We don't clear the bit on the oldpage as it's going to be freed ··· 342 356 return -ENOMEM; 343 357 344 358 ret = snprintf(kbuf, count, 345 - "Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns\n", 359 + "Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n", 346 360 page_owner->order, page_owner->gfp_mask, 347 361 &page_owner->gfp_mask, page_owner->pid, 348 - page_owner->ts_nsec); 362 + page_owner->ts_nsec, page_owner->free_ts_nsec); 349 363 350 364 if (ret >= count) 351 365 goto err; ··· 421 435 else 422 436 pr_alert("page_owner tracks the page as freed\n"); 423 437 424 - pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d, ts %llu\n", 438 + pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d, ts %llu, free_ts %llu\n", 425 439 page_owner->order, migratetype_names[mt], gfp_mask, &gfp_mask, 426 - page_owner->pid, page_owner->ts_nsec); 440 + page_owner->pid, page_owner->ts_nsec, page_owner->free_ts_nsec); 427 441 428 442 handle = READ_ONCE(page_owner->handle); 429 443 if (!handle) { ··· 598 612 continue; 599 613 600 614 /* Found early allocated page */ 601 - __set_page_owner_handle(page, page_ext, early_handle, 615 + __set_page_owner_handle(page_ext, early_handle, 602 616 0, 0); 603 617 count++; 604 618 }

+4 -2

mm/page_poison.c

··· 2 2 #include <linux/kernel.h> 3 3 #include <linux/string.h> 4 4 #include <linux/mm.h> 5 + #include <linux/mmdebug.h> 5 6 #include <linux/highmem.h> 6 7 #include <linux/page_ext.h> 7 8 #include <linux/poison.h> ··· 46 45 return error && !(error & (error - 1)); 47 46 } 48 47 49 - static void check_poison_mem(unsigned char *mem, size_t bytes) 48 + static void check_poison_mem(struct page *page, unsigned char *mem, size_t bytes) 50 49 { 51 50 static DEFINE_RATELIMIT_STATE(ratelimit, 5 * HZ, 10); 52 51 unsigned char *start; ··· 71 70 print_hex_dump(KERN_ERR, "", DUMP_PREFIX_ADDRESS, 16, 1, start, 72 71 end - start + 1, 1); 73 72 dump_stack(); 73 + dump_page(page, "pagealloc: corrupted page details"); 74 74 } 75 75 76 76 static void unpoison_page(struct page *page) ··· 85 83 * that is freed to buddy. Thus no extra check is done to 86 84 * see if a page was poisoned. 87 85 */ 88 - check_poison_mem(kasan_reset_tag(addr), PAGE_SIZE); 86 + check_poison_mem(page, kasan_reset_tag(addr), PAGE_SIZE); 89 87 kasan_enable_current(); 90 88 kunmap_atomic(addr); 91 89 }

+4 -3

mm/percpu-vm.c

··· 8 8 * Chunks are mapped into vmalloc areas and populated page by page. 9 9 * This is the default chunk allocator. 10 10 */ 11 + #include "internal.h" 11 12 12 13 static struct page *pcpu_chunk_page(struct pcpu_chunk *chunk, 13 14 unsigned int cpu, int page_idx) ··· 134 133 135 134 static void __pcpu_unmap_pages(unsigned long addr, int nr_pages) 136 135 { 137 - unmap_kernel_range_noflush(addr, nr_pages << PAGE_SHIFT); 136 + vunmap_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT)); 138 137 } 139 138 140 139 /** ··· 193 192 static int __pcpu_map_pages(unsigned long addr, struct page **pages, 194 193 int nr_pages) 195 194 { 196 - return map_kernel_range_noflush(addr, nr_pages << PAGE_SHIFT, 197 - PAGE_KERNEL, pages); 195 + return vmap_pages_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT), 196 + PAGE_KERNEL, pages, PAGE_SHIFT); 198 197 } 199 198 200 199 /**

+24 -19

mm/slab.c

··· 3216 3216 void *ptr; 3217 3217 int slab_node = numa_mem_id(); 3218 3218 struct obj_cgroup *objcg = NULL; 3219 + bool init = false; 3219 3220 3220 3221 flags &= gfp_allowed_mask; 3221 3222 cachep = slab_pre_alloc_hook(cachep, &objcg, 1, flags); ··· 3255 3254 out: 3256 3255 local_irq_restore(save_flags); 3257 3256 ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, caller); 3258 - 3259 - if (unlikely(slab_want_init_on_alloc(flags, cachep)) && ptr) 3260 - memset(ptr, 0, cachep->object_size); 3257 + init = slab_want_init_on_alloc(flags, cachep); 3261 3258 3262 3259 out_hooks: 3263 - slab_post_alloc_hook(cachep, objcg, flags, 1, &ptr); 3260 + slab_post_alloc_hook(cachep, objcg, flags, 1, &ptr, init); 3264 3261 return ptr; 3265 3262 } 3266 3263 ··· 3300 3301 unsigned long save_flags; 3301 3302 void *objp; 3302 3303 struct obj_cgroup *objcg = NULL; 3304 + bool init = false; 3303 3305 3304 3306 flags &= gfp_allowed_mask; 3305 3307 cachep = slab_pre_alloc_hook(cachep, &objcg, 1, flags); ··· 3317 3317 local_irq_restore(save_flags); 3318 3318 objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller); 3319 3319 prefetchw(objp); 3320 - 3321 - if (unlikely(slab_want_init_on_alloc(flags, cachep)) && objp) 3322 - memset(objp, 0, cachep->object_size); 3320 + init = slab_want_init_on_alloc(flags, cachep); 3323 3321 3324 3322 out: 3325 - slab_post_alloc_hook(cachep, objcg, flags, 1, &objp); 3323 + slab_post_alloc_hook(cachep, objcg, flags, 1, &objp, init); 3326 3324 return objp; 3327 3325 } 3328 3326 ··· 3425 3427 static __always_inline void __cache_free(struct kmem_cache *cachep, void *objp, 3426 3428 unsigned long caller) 3427 3429 { 3430 + bool init; 3431 + 3428 3432 if (is_kfence_address(objp)) { 3429 3433 kmemleak_free_recursive(objp, cachep->flags); 3430 3434 __kfence_free(objp); 3431 3435 return; 3432 3436 } 3433 3437 3434 - if (unlikely(slab_want_init_on_free(cachep))) 3438 + /* 3439 + * As memory initialization might be integrated into KASAN, 3440 + * kasan_slab_free and initialization memset must be 3441 + * kept together to avoid discrepancies in behavior. 3442 + */ 3443 + init = slab_want_init_on_free(cachep); 3444 + if (init && !kasan_has_integrated_init()) 3435 3445 memset(objp, 0, cachep->object_size); 3436 - 3437 - /* Put the object into the quarantine, don't touch it for now. */ 3438 - if (kasan_slab_free(cachep, objp)) 3446 + /* KASAN might put objp into memory quarantine, delaying its reuse. */ 3447 + if (kasan_slab_free(cachep, objp, init)) 3439 3448 return; 3440 3449 3441 3450 /* Use KCSAN to help debug racy use-after-free. */ ··· 3547 3542 3548 3543 cache_alloc_debugcheck_after_bulk(s, flags, size, p, _RET_IP_); 3549 3544 3550 - /* Clear memory outside IRQ disabled section */ 3551 - if (unlikely(slab_want_init_on_alloc(flags, s))) 3552 - for (i = 0; i < size; i++) 3553 - memset(p[i], 0, s->object_size); 3554 - 3555 - slab_post_alloc_hook(s, objcg, flags, size, p); 3545 + /* 3546 + * memcg and kmem_cache debug support and memory initialization. 3547 + * Done outside of the IRQ disabled section. 3548 + */ 3549 + slab_post_alloc_hook(s, objcg, flags, size, p, 3550 + slab_want_init_on_alloc(flags, s)); 3556 3551 /* FIXME: Trace call missing. Christoph would like a bulk variant */ 3557 3552 return size; 3558 3553 error: 3559 3554 local_irq_enable(); 3560 3555 cache_alloc_debugcheck_after_bulk(s, flags, i, p, _RET_IP_); 3561 - slab_post_alloc_hook(s, objcg, flags, i, p); 3556 + slab_post_alloc_hook(s, objcg, flags, i, p, false); 3562 3557 __kmem_cache_free_bulk(s, i, p); 3563 3558 return 0; 3564 3559 }

+13 -4

mm/slab.h

··· 506 506 } 507 507 508 508 static inline void slab_post_alloc_hook(struct kmem_cache *s, 509 - struct obj_cgroup *objcg, 510 - gfp_t flags, size_t size, void **p) 509 + struct obj_cgroup *objcg, gfp_t flags, 510 + size_t size, void **p, bool init) 511 511 { 512 512 size_t i; 513 513 514 514 flags &= gfp_allowed_mask; 515 + 516 + /* 517 + * As memory initialization might be integrated into KASAN, 518 + * kasan_slab_alloc and initialization memset must be 519 + * kept together to avoid discrepancies in behavior. 520 + * 521 + * As p[i] might get tagged, memset and kmemleak hook come after KASAN. 522 + */ 515 523 for (i = 0; i < size; i++) { 516 - p[i] = kasan_slab_alloc(s, p[i], flags); 517 - /* As p[i] might get tagged, call kmemleak hook after KASAN. */ 524 + p[i] = kasan_slab_alloc(s, p[i], flags, init); 525 + if (p[i] && init && !kasan_has_integrated_init()) 526 + memset(p[i], 0, s->object_size); 518 527 kmemleak_alloc_recursive(p[i], s->object_size, 1, 519 528 s->flags, flags); 520 529 }

+8

mm/slab_common.c

··· 71 71 return 1; 72 72 } 73 73 74 + static int __init setup_slab_merge(char *str) 75 + { 76 + slab_nomerge = false; 77 + return 1; 78 + } 79 + 74 80 #ifdef CONFIG_SLUB 75 81 __setup_param("slub_nomerge", slub_nomerge, setup_slab_nomerge, 0); 82 + __setup_param("slub_merge", slub_merge, setup_slab_merge, 0); 76 83 #endif 77 84 78 85 __setup("slab_nomerge", setup_slab_nomerge); 86 + __setup("slab_merge", setup_slab_merge); 79 87 80 88 /* 81 89 * Determine the size of a slab object

+48 -39

mm/slub.c

··· 3 3 * SLUB: A slab allocator that limits cache line use instead of queuing 4 4 * objects in per cpu and per node lists. 5 5 * 6 - * The allocator synchronizes using per slab locks or atomic operatios 6 + * The allocator synchronizes using per slab locks or atomic operations 7 7 * and only uses a centralized lock to manage a pool of partial slabs. 8 8 * 9 9 * (C) 2007 SGI, Christoph Lameter ··· 160 160 #undef SLUB_DEBUG_CMPXCHG 161 161 162 162 /* 163 - * Mininum number of partial slabs. These will be left on the partial 163 + * Minimum number of partial slabs. These will be left on the partial 164 164 * lists even if they are empty. kmem_cache_shrink may reclaim them. 165 165 */ 166 166 #define MIN_PARTIAL 5 ··· 833 833 * 834 834 * A. Free pointer (if we cannot overwrite object on free) 835 835 * B. Tracking data for SLAB_STORE_USER 836 - * C. Padding to reach required alignment boundary or at mininum 836 + * C. Padding to reach required alignment boundary or at minimum 837 837 * one word if debugging is on to be able to detect writes 838 838 * before the word boundary. 839 839 * ··· 1533 1533 kasan_kfree_large(x); 1534 1534 } 1535 1535 1536 - static __always_inline bool slab_free_hook(struct kmem_cache *s, void *x) 1536 + static __always_inline bool slab_free_hook(struct kmem_cache *s, 1537 + void *x, bool init) 1537 1538 { 1538 1539 kmemleak_free_recursive(x, s->flags); 1539 1540 ··· 1560 1559 __kcsan_check_access(x, s->object_size, 1561 1560 KCSAN_ACCESS_WRITE | KCSAN_ACCESS_ASSERT); 1562 1561 1563 - /* KASAN might put x into memory quarantine, delaying its reuse */ 1564 - return kasan_slab_free(s, x); 1562 + /* 1563 + * As memory initialization might be integrated into KASAN, 1564 + * kasan_slab_free and initialization memset's must be 1565 + * kept together to avoid discrepancies in behavior. 1566 + * 1567 + * The initialization memset's clear the object and the metadata, 1568 + * but don't touch the SLAB redzone. 1569 + */ 1570 + if (init) { 1571 + int rsize; 1572 + 1573 + if (!kasan_has_integrated_init()) 1574 + memset(kasan_reset_tag(x), 0, s->object_size); 1575 + rsize = (s->flags & SLAB_RED_ZONE) ? s->red_left_pad : 0; 1576 + memset((char *)kasan_reset_tag(x) + s->inuse, 0, 1577 + s->size - s->inuse - rsize); 1578 + } 1579 + /* KASAN might put x into memory quarantine, delaying its reuse. */ 1580 + return kasan_slab_free(s, x, init); 1565 1581 } 1566 1582 1567 1583 static inline bool slab_free_freelist_hook(struct kmem_cache *s, ··· 1588 1570 void *object; 1589 1571 void *next = *head; 1590 1572 void *old_tail = *tail ? *tail : *head; 1591 - int rsize; 1592 1573 1593 1574 if (is_kfence_address(next)) { 1594 - slab_free_hook(s, next); 1575 + slab_free_hook(s, next, false); 1595 1576 return true; 1596 1577 } 1597 1578 ··· 1602 1585 object = next; 1603 1586 next = get_freepointer(s, object); 1604 1587 1605 - if (slab_want_init_on_free(s)) { 1606 - /* 1607 - * Clear the object and the metadata, but don't touch 1608 - * the redzone. 1609 - */ 1610 - memset(kasan_reset_tag(object), 0, s->object_size); 1611 - rsize = (s->flags & SLAB_RED_ZONE) ? s->red_left_pad 1612 - : 0; 1613 - memset((char *)kasan_reset_tag(object) + s->inuse, 0, 1614 - s->size - s->inuse - rsize); 1615 - 1616 - } 1617 1588 /* If object's reuse doesn't have to be delayed */ 1618 - if (!slab_free_hook(s, object)) { 1589 + if (!slab_free_hook(s, object, slab_want_init_on_free(s))) { 1619 1590 /* Move object to the new freelist */ 1620 1591 set_freepointer(s, object, *head); 1621 1592 *head = object; ··· 2828 2823 struct page *page; 2829 2824 unsigned long tid; 2830 2825 struct obj_cgroup *objcg = NULL; 2826 + bool init = false; 2831 2827 2832 2828 s = slab_pre_alloc_hook(s, &objcg, 1, gfpflags); 2833 2829 if (!s) ··· 2906 2900 } 2907 2901 2908 2902 maybe_wipe_obj_freeptr(s, object); 2909 - 2910 - if (unlikely(slab_want_init_on_alloc(gfpflags, s)) && object) 2911 - memset(kasan_reset_tag(object), 0, s->object_size); 2903 + init = slab_want_init_on_alloc(gfpflags, s); 2912 2904 2913 2905 out: 2914 - slab_post_alloc_hook(s, objcg, gfpflags, 1, &object); 2906 + slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init); 2915 2907 2916 2908 return object; 2917 2909 } ··· 3241 3237 } 3242 3238 3243 3239 if (is_kfence_address(object)) { 3244 - slab_free_hook(df->s, object); 3240 + slab_free_hook(df->s, object, false); 3245 3241 __kfence_free(object); 3246 3242 p[size] = NULL; /* mark object processed */ 3247 3243 return size; ··· 3361 3357 c->tid = next_tid(c->tid); 3362 3358 local_irq_enable(); 3363 3359 3364 - /* Clear memory outside IRQ disabled fastpath loop */ 3365 - if (unlikely(slab_want_init_on_alloc(flags, s))) { 3366 - int j; 3367 - 3368 - for (j = 0; j < i; j++) 3369 - memset(kasan_reset_tag(p[j]), 0, s->object_size); 3370 - } 3371 - 3372 - /* memcg and kmem_cache debug support */ 3373 - slab_post_alloc_hook(s, objcg, flags, size, p); 3360 + /* 3361 + * memcg and kmem_cache debug support and memory initialization. 3362 + * Done outside of the IRQ disabled fastpath loop. 3363 + */ 3364 + slab_post_alloc_hook(s, objcg, flags, size, p, 3365 + slab_want_init_on_alloc(flags, s)); 3374 3366 return i; 3375 3367 error: 3376 3368 local_irq_enable(); 3377 - slab_post_alloc_hook(s, objcg, flags, i, p); 3369 + slab_post_alloc_hook(s, objcg, flags, i, p, false); 3378 3370 __kmem_cache_free_bulk(s, i, p); 3379 3371 return 0; 3380 3372 } ··· 3422 3422 * 3423 3423 * Higher order allocations also allow the placement of more objects in a 3424 3424 * slab and thereby reduce object handling overhead. If the user has 3425 - * requested a higher mininum order then we start with that one instead of 3425 + * requested a higher minimum order then we start with that one instead of 3426 3426 * the smallest order which will fit the object. 3427 3427 */ 3428 3428 static inline unsigned int slab_order(unsigned int size, ··· 3580 3580 init_object(kmem_cache_node, n, SLUB_RED_ACTIVE); 3581 3581 init_tracking(kmem_cache_node, n); 3582 3582 #endif 3583 - n = kasan_slab_alloc(kmem_cache_node, n, GFP_KERNEL); 3583 + n = kasan_slab_alloc(kmem_cache_node, n, GFP_KERNEL, false); 3584 3584 page->freelist = get_freepointer(kmem_cache_node, n); 3585 3585 page->inuse = 1; 3586 3586 page->frozen = 0; ··· 3828 3828 3829 3829 static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags) 3830 3830 { 3831 + #ifdef CONFIG_SLUB_DEBUG 3832 + /* 3833 + * If no slub_debug was enabled globally, the static key is not yet 3834 + * enabled by setup_slub_debug(). Enable it if the cache is being 3835 + * created with any of the debugging flags passed explicitly. 3836 + */ 3837 + if (flags & SLAB_DEBUG_FLAGS) 3838 + static_branch_enable(&slub_debug_enabled); 3839 + #endif 3831 3840 s->flags = kmem_cache_flags(s->size, flags, s->name); 3832 3841 #ifdef CONFIG_SLAB_FREELIST_HARDENED 3833 3842 s->random = get_random_long();

+1

mm/sparse.c

··· 547 547 pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.", 548 548 __func__, nid); 549 549 pnum_begin = pnum; 550 + sparse_buffer_fini(); 550 551 goto failed; 551 552 } 552 553 check_usemap_section_nr(nid, usage);

+6 -7

mm/swap_state.c

··· 497 497 __SetPageLocked(page); 498 498 __SetPageSwapBacked(page); 499 499 500 - /* May fail (-ENOMEM) if XArray node allocation failed. */ 501 - if (add_to_swap_cache(page, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow)) { 502 - put_swap_page(page, entry); 500 + if (mem_cgroup_swapin_charge_page(page, NULL, gfp_mask, entry)) 503 501 goto fail_unlock; 504 - } 505 502 506 - if (mem_cgroup_charge(page, NULL, gfp_mask)) { 507 - delete_from_swap_cache(page); 503 + /* May fail (-ENOMEM) if XArray node allocation failed. */ 504 + if (add_to_swap_cache(page, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow)) 508 505 goto fail_unlock; 509 - } 506 + 507 + mem_cgroup_swapin_uncharge_swap(entry); 510 508 511 509 if (shadow) 512 510 workingset_refault(page, shadow); ··· 515 517 return page; 516 518 517 519 fail_unlock: 520 + put_swap_page(page, entry); 518 521 unlock_page(page); 519 522 put_page(page); 520 523 return NULL;

-10

mm/util.c

··· 711 711 } 712 712 EXPORT_SYMBOL(page_mapping); 713 713 714 - /* 715 - * For file cache pages, return the address_space, otherwise return NULL 716 - */ 717 - struct address_space *page_mapping_file(struct page *page) 718 - { 719 - if (unlikely(PageSwapCache(page))) 720 - return NULL; 721 - return page_mapping(page); 722 - } 723 - 724 714 /* Slow path of page_mapcount() for compound pages */ 725 715 int __page_mapcount(struct page *page) 726 716 {

+503 -157

mm/vmalloc.c

··· 34 34 #include <linux/bitops.h> 35 35 #include <linux/rbtree_augmented.h> 36 36 #include <linux/overflow.h> 37 - 37 + #include <linux/pgtable.h> 38 38 #include <linux/uaccess.h> 39 39 #include <asm/tlbflush.h> 40 40 #include <asm/shmparam.h> 41 41 42 42 #include "internal.h" 43 43 #include "pgalloc-track.h" 44 + 45 + #ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC 46 + static bool __ro_after_init vmap_allow_huge = true; 47 + 48 + static int __init set_nohugevmalloc(char *str) 49 + { 50 + vmap_allow_huge = false; 51 + return 0; 52 + } 53 + early_param("nohugevmalloc", set_nohugevmalloc); 54 + #else /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */ 55 + static const bool vmap_allow_huge = false; 56 + #endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */ 44 57 45 58 bool is_vmalloc_addr(const void *x) 46 59 { ··· 81 68 } 82 69 83 70 /*** Page table manipulation functions ***/ 71 + static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, 72 + phys_addr_t phys_addr, pgprot_t prot, 73 + pgtbl_mod_mask *mask) 74 + { 75 + pte_t *pte; 76 + u64 pfn; 77 + 78 + pfn = phys_addr >> PAGE_SHIFT; 79 + pte = pte_alloc_kernel_track(pmd, addr, mask); 80 + if (!pte) 81 + return -ENOMEM; 82 + do { 83 + BUG_ON(!pte_none(*pte)); 84 + set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot)); 85 + pfn++; 86 + } while (pte++, addr += PAGE_SIZE, addr != end); 87 + *mask |= PGTBL_PTE_MODIFIED; 88 + return 0; 89 + } 90 + 91 + static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, 92 + phys_addr_t phys_addr, pgprot_t prot, 93 + unsigned int max_page_shift) 94 + { 95 + if (max_page_shift < PMD_SHIFT) 96 + return 0; 97 + 98 + if (!arch_vmap_pmd_supported(prot)) 99 + return 0; 100 + 101 + if ((end - addr) != PMD_SIZE) 102 + return 0; 103 + 104 + if (!IS_ALIGNED(addr, PMD_SIZE)) 105 + return 0; 106 + 107 + if (!IS_ALIGNED(phys_addr, PMD_SIZE)) 108 + return 0; 109 + 110 + if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr)) 111 + return 0; 112 + 113 + return pmd_set_huge(pmd, phys_addr, prot); 114 + } 115 + 116 + static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, 117 + phys_addr_t phys_addr, pgprot_t prot, 118 + unsigned int max_page_shift, pgtbl_mod_mask *mask) 119 + { 120 + pmd_t *pmd; 121 + unsigned long next; 122 + 123 + pmd = pmd_alloc_track(&init_mm, pud, addr, mask); 124 + if (!pmd) 125 + return -ENOMEM; 126 + do { 127 + next = pmd_addr_end(addr, end); 128 + 129 + if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot, 130 + max_page_shift)) { 131 + *mask |= PGTBL_PMD_MODIFIED; 132 + continue; 133 + } 134 + 135 + if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask)) 136 + return -ENOMEM; 137 + } while (pmd++, phys_addr += (next - addr), addr = next, addr != end); 138 + return 0; 139 + } 140 + 141 + static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end, 142 + phys_addr_t phys_addr, pgprot_t prot, 143 + unsigned int max_page_shift) 144 + { 145 + if (max_page_shift < PUD_SHIFT) 146 + return 0; 147 + 148 + if (!arch_vmap_pud_supported(prot)) 149 + return 0; 150 + 151 + if ((end - addr) != PUD_SIZE) 152 + return 0; 153 + 154 + if (!IS_ALIGNED(addr, PUD_SIZE)) 155 + return 0; 156 + 157 + if (!IS_ALIGNED(phys_addr, PUD_SIZE)) 158 + return 0; 159 + 160 + if (pud_present(*pud) && !pud_free_pmd_page(pud, addr)) 161 + return 0; 162 + 163 + return pud_set_huge(pud, phys_addr, prot); 164 + } 165 + 166 + static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, 167 + phys_addr_t phys_addr, pgprot_t prot, 168 + unsigned int max_page_shift, pgtbl_mod_mask *mask) 169 + { 170 + pud_t *pud; 171 + unsigned long next; 172 + 173 + pud = pud_alloc_track(&init_mm, p4d, addr, mask); 174 + if (!pud) 175 + return -ENOMEM; 176 + do { 177 + next = pud_addr_end(addr, end); 178 + 179 + if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot, 180 + max_page_shift)) { 181 + *mask |= PGTBL_PUD_MODIFIED; 182 + continue; 183 + } 184 + 185 + if (vmap_pmd_range(pud, addr, next, phys_addr, prot, 186 + max_page_shift, mask)) 187 + return -ENOMEM; 188 + } while (pud++, phys_addr += (next - addr), addr = next, addr != end); 189 + return 0; 190 + } 191 + 192 + static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end, 193 + phys_addr_t phys_addr, pgprot_t prot, 194 + unsigned int max_page_shift) 195 + { 196 + if (max_page_shift < P4D_SHIFT) 197 + return 0; 198 + 199 + if (!arch_vmap_p4d_supported(prot)) 200 + return 0; 201 + 202 + if ((end - addr) != P4D_SIZE) 203 + return 0; 204 + 205 + if (!IS_ALIGNED(addr, P4D_SIZE)) 206 + return 0; 207 + 208 + if (!IS_ALIGNED(phys_addr, P4D_SIZE)) 209 + return 0; 210 + 211 + if (p4d_present(*p4d) && !p4d_free_pud_page(p4d, addr)) 212 + return 0; 213 + 214 + return p4d_set_huge(p4d, phys_addr, prot); 215 + } 216 + 217 + static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end, 218 + phys_addr_t phys_addr, pgprot_t prot, 219 + unsigned int max_page_shift, pgtbl_mod_mask *mask) 220 + { 221 + p4d_t *p4d; 222 + unsigned long next; 223 + 224 + p4d = p4d_alloc_track(&init_mm, pgd, addr, mask); 225 + if (!p4d) 226 + return -ENOMEM; 227 + do { 228 + next = p4d_addr_end(addr, end); 229 + 230 + if (vmap_try_huge_p4d(p4d, addr, next, phys_addr, prot, 231 + max_page_shift)) { 232 + *mask |= PGTBL_P4D_MODIFIED; 233 + continue; 234 + } 235 + 236 + if (vmap_pud_range(p4d, addr, next, phys_addr, prot, 237 + max_page_shift, mask)) 238 + return -ENOMEM; 239 + } while (p4d++, phys_addr += (next - addr), addr = next, addr != end); 240 + return 0; 241 + } 242 + 243 + static int vmap_range_noflush(unsigned long addr, unsigned long end, 244 + phys_addr_t phys_addr, pgprot_t prot, 245 + unsigned int max_page_shift) 246 + { 247 + pgd_t *pgd; 248 + unsigned long start; 249 + unsigned long next; 250 + int err; 251 + pgtbl_mod_mask mask = 0; 252 + 253 + might_sleep(); 254 + BUG_ON(addr >= end); 255 + 256 + start = addr; 257 + pgd = pgd_offset_k(addr); 258 + do { 259 + next = pgd_addr_end(addr, end); 260 + err = vmap_p4d_range(pgd, addr, next, phys_addr, prot, 261 + max_page_shift, &mask); 262 + if (err) 263 + break; 264 + } while (pgd++, phys_addr += (next - addr), addr = next, addr != end); 265 + 266 + if (mask & ARCH_PAGE_TABLE_SYNC_MASK) 267 + arch_sync_kernel_mappings(start, end); 268 + 269 + return err; 270 + } 271 + 272 + int vmap_range(unsigned long addr, unsigned long end, 273 + phys_addr_t phys_addr, pgprot_t prot, 274 + unsigned int max_page_shift) 275 + { 276 + int err; 277 + 278 + err = vmap_range_noflush(addr, end, phys_addr, prot, max_page_shift); 279 + flush_cache_vmap(addr, end); 280 + 281 + return err; 282 + } 84 283 85 284 static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, 86 285 pgtbl_mod_mask *mask) ··· 378 153 } while (p4d++, addr = next, addr != end); 379 154 } 380 155 381 - /** 382 - * unmap_kernel_range_noflush - unmap kernel VM area 383 - * @start: start of the VM area to unmap 384 - * @size: size of the VM area to unmap 156 + /* 157 + * vunmap_range_noflush is similar to vunmap_range, but does not 158 + * flush caches or TLBs. 385 159 * 386 - * Unmap PFN_UP(@size) pages at @addr. The VM area @addr and @size specify 387 - * should have been allocated using get_vm_area() and its friends. 160 + * The caller is responsible for calling flush_cache_vmap() before calling 161 + * this function, and flush_tlb_kernel_range after it has returned 162 + * successfully (and before the addresses are expected to cause a page fault 163 + * or be re-mapped for something else, if TLB flushes are being delayed or 164 + * coalesced). 388 165 * 389 - * NOTE: 390 - * This function does NOT do any cache flushing. The caller is responsible 391 - * for calling flush_cache_vunmap() on to-be-mapped areas before calling this 392 - * function and flush_tlb_kernel_range() after. 166 + * This is an internal function only. Do not use outside mm/. 393 167 */ 394 - void unmap_kernel_range_noflush(unsigned long start, unsigned long size) 168 + void vunmap_range_noflush(unsigned long start, unsigned long end) 395 169 { 396 - unsigned long end = start + size; 397 170 unsigned long next; 398 171 pgd_t *pgd; 399 172 unsigned long addr = start; ··· 412 189 arch_sync_kernel_mappings(start, end); 413 190 } 414 191 415 - static int vmap_pte_range(pmd_t *pmd, unsigned long addr, 192 + /** 193 + * vunmap_range - unmap kernel virtual addresses 194 + * @addr: start of the VM area to unmap 195 + * @end: end of the VM area to unmap (non-inclusive) 196 + * 197 + * Clears any present PTEs in the virtual address range, flushes TLBs and 198 + * caches. Any subsequent access to the address before it has been re-mapped 199 + * is a kernel bug. 200 + */ 201 + void vunmap_range(unsigned long addr, unsigned long end) 202 + { 203 + flush_cache_vunmap(addr, end); 204 + vunmap_range_noflush(addr, end); 205 + flush_tlb_kernel_range(addr, end); 206 + } 207 + 208 + static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, 416 209 unsigned long end, pgprot_t prot, struct page **pages, int *nr, 417 210 pgtbl_mod_mask *mask) 418 211 { ··· 456 217 return 0; 457 218 } 458 219 459 - static int vmap_pmd_range(pud_t *pud, unsigned long addr, 220 + static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr, 460 221 unsigned long end, pgprot_t prot, struct page **pages, int *nr, 461 222 pgtbl_mod_mask *mask) 462 223 { ··· 468 229 return -ENOMEM; 469 230 do { 470 231 next = pmd_addr_end(addr, end); 471 - if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask)) 232 + if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask)) 472 233 return -ENOMEM; 473 234 } while (pmd++, addr = next, addr != end); 474 235 return 0; 475 236 } 476 237 477 - static int vmap_pud_range(p4d_t *p4d, unsigned long addr, 238 + static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr, 478 239 unsigned long end, pgprot_t prot, struct page **pages, int *nr, 479 240 pgtbl_mod_mask *mask) 480 241 { ··· 486 247 return -ENOMEM; 487 248 do { 488 249 next = pud_addr_end(addr, end); 489 - if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask)) 250 + if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask)) 490 251 return -ENOMEM; 491 252 } while (pud++, addr = next, addr != end); 492 253 return 0; 493 254 } 494 255 495 - static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, 256 + static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr, 496 257 unsigned long end, pgprot_t prot, struct page **pages, int *nr, 497 258 pgtbl_mod_mask *mask) 498 259 { ··· 504 265 return -ENOMEM; 505 266 do { 506 267 next = p4d_addr_end(addr, end); 507 - if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask)) 268 + if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask)) 508 269 return -ENOMEM; 509 270 } while (p4d++, addr = next, addr != end); 510 271 return 0; 511 272 } 512 273 513 - /** 514 - * map_kernel_range_noflush - map kernel VM area with the specified pages 515 - * @addr: start of the VM area to map 516 - * @size: size of the VM area to map 517 - * @prot: page protection flags to use 518 - * @pages: pages to map 519 - * 520 - * Map PFN_UP(@size) pages at @addr. The VM area @addr and @size specify should 521 - * have been allocated using get_vm_area() and its friends. 522 - * 523 - * NOTE: 524 - * This function does NOT do any cache flushing. The caller is responsible for 525 - * calling flush_cache_vmap() on to-be-mapped areas before calling this 526 - * function. 527 - * 528 - * RETURNS: 529 - * 0 on success, -errno on failure. 530 - */ 531 - int map_kernel_range_noflush(unsigned long addr, unsigned long size, 532 - pgprot_t prot, struct page **pages) 274 + static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end, 275 + pgprot_t prot, struct page **pages) 533 276 { 534 277 unsigned long start = addr; 535 - unsigned long end = addr + size; 536 - unsigned long next; 537 278 pgd_t *pgd; 279 + unsigned long next; 538 280 int err = 0; 539 281 int nr = 0; 540 282 pgtbl_mod_mask mask = 0; ··· 526 306 next = pgd_addr_end(addr, end); 527 307 if (pgd_bad(*pgd)) 528 308 mask |= PGTBL_PGD_MODIFIED; 529 - err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask); 309 + err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask); 530 310 if (err) 531 311 return err; 532 312 } while (pgd++, addr = next, addr != end); ··· 537 317 return 0; 538 318 } 539 319 540 - int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot, 541 - struct page **pages) 320 + /* 321 + * vmap_pages_range_noflush is similar to vmap_pages_range, but does not 322 + * flush caches. 323 + * 324 + * The caller is responsible for calling flush_cache_vmap() after this 325 + * function returns successfully and before the addresses are accessed. 326 + * 327 + * This is an internal function only. Do not use outside mm/. 328 + */ 329 + int vmap_pages_range_noflush(unsigned long addr, unsigned long end, 330 + pgprot_t prot, struct page **pages, unsigned int page_shift) 542 331 { 543 - int ret; 332 + unsigned int i, nr = (end - addr) >> PAGE_SHIFT; 544 333 545 - ret = map_kernel_range_noflush(start, size, prot, pages); 546 - flush_cache_vmap(start, start + size); 547 - return ret; 334 + WARN_ON(page_shift < PAGE_SHIFT); 335 + 336 + if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || 337 + page_shift == PAGE_SHIFT) 338 + return vmap_small_pages_range_noflush(addr, end, prot, pages); 339 + 340 + for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) { 341 + int err; 342 + 343 + err = vmap_range_noflush(addr, addr + (1UL << page_shift), 344 + __pa(page_address(pages[i])), prot, 345 + page_shift); 346 + if (err) 347 + return err; 348 + 349 + addr += 1UL << page_shift; 350 + } 351 + 352 + return 0; 353 + } 354 + 355 + /** 356 + * vmap_pages_range - map pages to a kernel virtual address 357 + * @addr: start of the VM area to map 358 + * @end: end of the VM area to map (non-inclusive) 359 + * @prot: page protection flags to use 360 + * @pages: pages to map (always PAGE_SIZE pages) 361 + * @page_shift: maximum shift that the pages may be mapped with, @pages must 362 + * be aligned and contiguous up to at least this shift. 363 + * 364 + * RETURNS: 365 + * 0 on success, -errno on failure. 366 + */ 367 + static int vmap_pages_range(unsigned long addr, unsigned long end, 368 + pgprot_t prot, struct page **pages, unsigned int page_shift) 369 + { 370 + int err; 371 + 372 + err = vmap_pages_range_noflush(addr, end, prot, pages, page_shift); 373 + flush_cache_vmap(addr, end); 374 + return err; 548 375 } 549 376 550 377 int is_vmalloc_or_module_addr(const void *x) ··· 610 343 } 611 344 612 345 /* 613 - * Walk a vmap address to the struct page it maps. 346 + * Walk a vmap address to the struct page it maps. Huge vmap mappings will 347 + * return the tail page that corresponds to the base page address, which 348 + * matches small vmap mappings. 614 349 */ 615 350 struct page *vmalloc_to_page(const void *vmalloc_addr) 616 351 { ··· 632 363 633 364 if (pgd_none(*pgd)) 634 365 return NULL; 366 + if (WARN_ON_ONCE(pgd_leaf(*pgd))) 367 + return NULL; /* XXX: no allowance for huge pgd */ 368 + if (WARN_ON_ONCE(pgd_bad(*pgd))) 369 + return NULL; 370 + 635 371 p4d = p4d_offset(pgd, addr); 636 372 if (p4d_none(*p4d)) 637 373 return NULL; 638 - pud = pud_offset(p4d, addr); 639 - 640 - /* 641 - * Don't dereference bad PUD or PMD (below) entries. This will also 642 - * identify huge mappings, which we may encounter on architectures 643 - * that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be 644 - * identified as vmalloc addresses by is_vmalloc_addr(), but are 645 - * not [unambiguously] associated with a struct page, so there is 646 - * no correct value to return for them. 647 - */ 648 - WARN_ON_ONCE(pud_bad(*pud)); 649 - if (pud_none(*pud) || pud_bad(*pud)) 374 + if (p4d_leaf(*p4d)) 375 + return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT); 376 + if (WARN_ON_ONCE(p4d_bad(*p4d))) 650 377 return NULL; 378 + 379 + pud = pud_offset(p4d, addr); 380 + if (pud_none(*pud)) 381 + return NULL; 382 + if (pud_leaf(*pud)) 383 + return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); 384 + if (WARN_ON_ONCE(pud_bad(*pud))) 385 + return NULL; 386 + 651 387 pmd = pmd_offset(pud, addr); 652 - WARN_ON_ONCE(pmd_bad(*pmd)); 653 - if (pmd_none(*pmd) || pmd_bad(*pmd)) 388 + if (pmd_none(*pmd)) 389 + return NULL; 390 + if (pmd_leaf(*pmd)) 391 + return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); 392 + if (WARN_ON_ONCE(pmd_bad(*pmd))) 654 393 return NULL; 655 394 656 395 ptep = pte_offset_map(pmd, addr); ··· 666 389 if (pte_present(pte)) 667 390 page = pte_page(pte); 668 391 pte_unmap(ptep); 392 + 669 393 return page; 670 394 } 671 395 EXPORT_SYMBOL(vmalloc_to_page); ··· 1430 1152 spin_unlock(&free_vmap_area_lock); 1431 1153 } 1432 1154 1155 + static inline void 1156 + preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node) 1157 + { 1158 + struct vmap_area *va = NULL; 1159 + 1160 + /* 1161 + * Preload this CPU with one extra vmap_area object. It is used 1162 + * when fit type of free area is NE_FIT_TYPE. It guarantees that 1163 + * a CPU that does an allocation is preloaded. 1164 + * 1165 + * We do it in non-atomic context, thus it allows us to use more 1166 + * permissive allocation masks to be more stable under low memory 1167 + * condition and high memory pressure. 1168 + */ 1169 + if (!this_cpu_read(ne_fit_preload_node)) 1170 + va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node); 1171 + 1172 + spin_lock(lock); 1173 + 1174 + if (va && __this_cpu_cmpxchg(ne_fit_preload_node, NULL, va)) 1175 + kmem_cache_free(vmap_area_cachep, va); 1176 + } 1177 + 1433 1178 /* 1434 1179 * Allocate a region of KVA of the specified size and alignment, within the 1435 1180 * vstart and vend. ··· 1462 1161 unsigned long vstart, unsigned long vend, 1463 1162 int node, gfp_t gfp_mask) 1464 1163 { 1465 - struct vmap_area *va, *pva; 1164 + struct vmap_area *va; 1466 1165 unsigned long addr; 1467 1166 int purged = 0; 1468 1167 int ret; ··· 1488 1187 kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask); 1489 1188 1490 1189 retry: 1491 - /* 1492 - * Preload this CPU with one extra vmap_area object. It is used 1493 - * when fit type of free area is NE_FIT_TYPE. Please note, it 1494 - * does not guarantee that an allocation occurs on a CPU that 1495 - * is preloaded, instead we minimize the case when it is not. 1496 - * It can happen because of cpu migration, because there is a 1497 - * race until the below spinlock is taken. 1498 - * 1499 - * The preload is done in non-atomic context, thus it allows us 1500 - * to use more permissive allocation masks to be more stable under 1501 - * low memory condition and high memory pressure. In rare case, 1502 - * if not preloaded, GFP_NOWAIT is used. 1503 - * 1504 - * Set "pva" to NULL here, because of "retry" path. 1505 - */ 1506 - pva = NULL; 1507 - 1508 - if (!this_cpu_read(ne_fit_preload_node)) 1509 - /* 1510 - * Even if it fails we do not really care about that. 1511 - * Just proceed as it is. If needed "overflow" path 1512 - * will refill the cache we allocate from. 1513 - */ 1514 - pva = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node); 1515 - 1516 - spin_lock(&free_vmap_area_lock); 1517 - 1518 - if (pva && __this_cpu_cmpxchg(ne_fit_preload_node, NULL, pva)) 1519 - kmem_cache_free(vmap_area_cachep, pva); 1190 + preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node); 1191 + addr = __alloc_vmap_area(size, align, vstart, vend); 1192 + spin_unlock(&free_vmap_area_lock); 1520 1193 1521 1194 /* 1522 1195 * If an allocation fails, the "vend" address is 1523 1196 * returned. Therefore trigger the overflow path. 1524 1197 */ 1525 - addr = __alloc_vmap_area(size, align, vstart, vend); 1526 - spin_unlock(&free_vmap_area_lock); 1527 - 1528 1198 if (unlikely(addr == vend)) 1529 1199 goto overflow; 1530 1200 1531 1201 va->va_start = addr; 1532 1202 va->va_end = addr + size; 1533 1203 va->vm = NULL; 1534 - 1535 1204 1536 1205 spin_lock(&vmap_area_lock); 1537 1206 insert_vmap_area(va, &vmap_area_root, &vmap_area_list); ··· 1719 1448 static void free_unmap_vmap_area(struct vmap_area *va) 1720 1449 { 1721 1450 flush_cache_vunmap(va->va_start, va->va_end); 1722 - unmap_kernel_range_noflush(va->va_start, va->va_end - va->va_start); 1451 + vunmap_range_noflush(va->va_start, va->va_end); 1723 1452 if (debug_pagealloc_enabled_static()) 1724 1453 flush_tlb_kernel_range(va->va_start, va->va_end); 1725 1454 ··· 1997 1726 offset = (addr & (VMAP_BLOCK_SIZE - 1)) >> PAGE_SHIFT; 1998 1727 vb = xa_load(&vmap_blocks, addr_to_vb_idx(addr)); 1999 1728 2000 - unmap_kernel_range_noflush(addr, size); 1729 + vunmap_range_noflush(addr, addr + size); 2001 1730 2002 1731 if (debug_pagealloc_enabled_static()) 2003 1732 flush_tlb_kernel_range(addr, addr + size); ··· 2033 1762 rcu_read_lock(); 2034 1763 list_for_each_entry_rcu(vb, &vbq->free, free_list) { 2035 1764 spin_lock(&vb->lock); 2036 - if (vb->dirty) { 1765 + if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) { 2037 1766 unsigned long va_start = vb->va->va_start; 2038 1767 unsigned long s, e; 2039 1768 ··· 2150 1879 2151 1880 kasan_unpoison_vmalloc(mem, size); 2152 1881 2153 - if (map_kernel_range(addr, size, PAGE_KERNEL, pages) < 0) { 1882 + if (vmap_pages_range(addr, addr + size, PAGE_KERNEL, 1883 + pages, PAGE_SHIFT) < 0) { 2154 1884 vm_unmap_ram(mem, count); 2155 1885 return NULL; 2156 1886 } 1887 + 2157 1888 return mem; 2158 1889 } 2159 1890 EXPORT_SYMBOL(vm_map_ram); 2160 1891 2161 1892 static struct vm_struct *vmlist __initdata; 1893 + 1894 + static inline unsigned int vm_area_page_order(struct vm_struct *vm) 1895 + { 1896 + #ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC 1897 + return vm->page_order; 1898 + #else 1899 + return 0; 1900 + #endif 1901 + } 1902 + 1903 + static inline void set_vm_area_page_order(struct vm_struct *vm, unsigned int order) 1904 + { 1905 + #ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC 1906 + vm->page_order = order; 1907 + #else 1908 + BUG_ON(order != 0); 1909 + #endif 1910 + } 2162 1911 2163 1912 /** 2164 1913 * vm_area_add_early - add vmap area early during boot ··· 2312 2021 */ 2313 2022 vmap_init_free_space(); 2314 2023 vmap_initialized = true; 2315 - } 2316 - 2317 - /** 2318 - * unmap_kernel_range - unmap kernel VM area and flush cache and TLB 2319 - * @addr: start of the VM area to unmap 2320 - * @size: size of the VM area to unmap 2321 - * 2322 - * Similar to unmap_kernel_range_noflush() but flushes vcache before 2323 - * the unmapping and tlb after. 2324 - */ 2325 - void unmap_kernel_range(unsigned long addr, unsigned long size) 2326 - { 2327 - unsigned long end = addr + size; 2328 - 2329 - flush_cache_vunmap(addr, end); 2330 - unmap_kernel_range_noflush(addr, size); 2331 - flush_tlb_kernel_range(addr, end); 2332 2024 } 2333 2025 2334 2026 static inline void setup_vmalloc_vm_locked(struct vm_struct *vm, ··· 2473 2199 { 2474 2200 int i; 2475 2201 2202 + /* HUGE_VMALLOC passes small pages to set_direct_map */ 2476 2203 for (i = 0; i < area->nr_pages; i++) 2477 2204 if (page_address(area->pages[i])) 2478 2205 set_direct_map(area->pages[i]); ··· 2483 2208 static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages) 2484 2209 { 2485 2210 unsigned long start = ULONG_MAX, end = 0; 2211 + unsigned int page_order = vm_area_page_order(area); 2486 2212 int flush_reset = area->flags & VM_FLUSH_RESET_PERMS; 2487 2213 int flush_dmap = 0; 2488 2214 int i; ··· 2508 2232 * map. Find the start and end range of the direct mappings to make sure 2509 2233 * the vm_unmap_aliases() flush includes the direct map. 2510 2234 */ 2511 - for (i = 0; i < area->nr_pages; i++) { 2235 + for (i = 0; i < area->nr_pages; i += 1U << page_order) { 2512 2236 unsigned long addr = (unsigned long)page_address(area->pages[i]); 2513 2237 if (addr) { 2238 + unsigned long page_size; 2239 + 2240 + page_size = PAGE_SIZE << page_order; 2514 2241 start = min(addr, start); 2515 - end = max(addr + PAGE_SIZE, end); 2242 + end = max(addr + page_size, end); 2516 2243 flush_dmap = 1; 2517 2244 } 2518 2245 } ··· 2556 2277 vm_remove_mappings(area, deallocate_pages); 2557 2278 2558 2279 if (deallocate_pages) { 2280 + unsigned int page_order = vm_area_page_order(area); 2559 2281 int i; 2560 2282 2561 - for (i = 0; i < area->nr_pages; i++) { 2283 + for (i = 0; i < area->nr_pages; i += 1U << page_order) { 2562 2284 struct page *page = area->pages[i]; 2563 2285 2564 2286 BUG_ON(!page); 2565 - __free_pages(page, 0); 2287 + __free_pages(page, page_order); 2566 2288 } 2567 2289 atomic_long_sub(area->nr_pages, &nr_vmalloc_pages); 2568 2290 ··· 2682 2402 unsigned long flags, pgprot_t prot) 2683 2403 { 2684 2404 struct vm_struct *area; 2405 + unsigned long addr; 2685 2406 unsigned long size; /* In bytes */ 2686 2407 2687 2408 might_sleep(); ··· 2695 2414 if (!area) 2696 2415 return NULL; 2697 2416 2698 - if (map_kernel_range((unsigned long)area->addr, size, pgprot_nx(prot), 2699 - pages) < 0) { 2417 + addr = (unsigned long)area->addr; 2418 + if (vmap_pages_range(addr, addr + size, pgprot_nx(prot), 2419 + pages, PAGE_SHIFT) < 0) { 2700 2420 vunmap(area->addr); 2701 2421 return NULL; 2702 2422 } ··· 2756 2474 #endif /* CONFIG_VMAP_PFN */ 2757 2475 2758 2476 static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, 2759 - pgprot_t prot, int node) 2477 + pgprot_t prot, unsigned int page_shift, 2478 + int node) 2760 2479 { 2761 2480 const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; 2762 - unsigned int nr_pages = get_vm_area_size(area) >> PAGE_SHIFT; 2481 + unsigned long addr = (unsigned long)area->addr; 2482 + unsigned long size = get_vm_area_size(area); 2763 2483 unsigned long array_size; 2764 - unsigned int i; 2484 + unsigned int nr_small_pages = size >> PAGE_SHIFT; 2485 + unsigned int page_order; 2765 2486 struct page **pages; 2487 + unsigned int i; 2766 2488 2767 - array_size = (unsigned long)nr_pages * sizeof(struct page *); 2489 + array_size = (unsigned long)nr_small_pages * sizeof(struct page *); 2768 2490 gfp_mask |= __GFP_NOWARN; 2769 2491 if (!(gfp_mask & (GFP_DMA | GFP_DMA32))) 2770 2492 gfp_mask |= __GFP_HIGHMEM; ··· 2783 2497 2784 2498 if (!pages) { 2785 2499 free_vm_area(area); 2500 + warn_alloc(gfp_mask, NULL, 2501 + "vmalloc size %lu allocation failure: " 2502 + "page array size %lu allocation failed", 2503 + nr_small_pages * PAGE_SIZE, array_size); 2786 2504 return NULL; 2787 2505 } 2788 2506 2789 2507 area->pages = pages; 2790 - area->nr_pages = nr_pages; 2508 + area->nr_pages = nr_small_pages; 2509 + set_vm_area_page_order(area, page_shift - PAGE_SHIFT); 2791 2510 2792 - for (i = 0; i < area->nr_pages; i++) { 2511 + page_order = vm_area_page_order(area); 2512 + 2513 + /* 2514 + * Careful, we allocate and map page_order pages, but tracking is done 2515 + * per PAGE_SIZE page so as to keep the vm_struct APIs independent of 2516 + * the physical/mapped size. 2517 + */ 2518 + for (i = 0; i < area->nr_pages; i += 1U << page_order) { 2793 2519 struct page *page; 2520 + int p; 2794 2521 2795 - if (node == NUMA_NO_NODE) 2796 - page = alloc_page(gfp_mask); 2797 - else 2798 - page = alloc_pages_node(node, gfp_mask, 0); 2799 - 2522 + /* Compound pages required for remap_vmalloc_page */ 2523 + page = alloc_pages_node(node, gfp_mask | __GFP_COMP, page_order); 2800 2524 if (unlikely(!page)) { 2801 2525 /* Successfully allocated i pages, free them in __vfree() */ 2802 2526 area->nr_pages = i; 2803 2527 atomic_long_add(area->nr_pages, &nr_vmalloc_pages); 2528 + warn_alloc(gfp_mask, NULL, 2529 + "vmalloc size %lu allocation failure: " 2530 + "page order %u allocation failed", 2531 + area->nr_pages * PAGE_SIZE, page_order); 2804 2532 goto fail; 2805 2533 } 2806 - area->pages[i] = page; 2534 + 2535 + for (p = 0; p < (1U << page_order); p++) 2536 + area->pages[i + p] = page + p; 2537 + 2807 2538 if (gfpflags_allow_blocking(gfp_mask)) 2808 2539 cond_resched(); 2809 2540 } 2810 2541 atomic_long_add(area->nr_pages, &nr_vmalloc_pages); 2811 2542 2812 - if (map_kernel_range((unsigned long)area->addr, get_vm_area_size(area), 2813 - prot, pages) < 0) 2543 + if (vmap_pages_range(addr, addr + size, prot, pages, page_shift) < 0) { 2544 + warn_alloc(gfp_mask, NULL, 2545 + "vmalloc size %lu allocation failure: " 2546 + "failed to map pages", 2547 + area->nr_pages * PAGE_SIZE); 2814 2548 goto fail; 2549 + } 2815 2550 2816 2551 return area->addr; 2817 2552 2818 2553 fail: 2819 - warn_alloc(gfp_mask, NULL, 2820 - "vmalloc: allocation failure, allocated %ld of %ld bytes", 2821 - (area->nr_pages*PAGE_SIZE), area->size); 2822 2554 __vfree(area->addr); 2823 2555 return NULL; 2824 2556 } ··· 2867 2563 struct vm_struct *area; 2868 2564 void *addr; 2869 2565 unsigned long real_size = size; 2566 + unsigned long real_align = align; 2567 + unsigned int shift = PAGE_SHIFT; 2870 2568 2871 - size = PAGE_ALIGN(size); 2872 - if (!size || (size >> PAGE_SHIFT) > totalram_pages()) 2873 - goto fail; 2874 - 2875 - area = __get_vm_area_node(real_size, align, VM_ALLOC | VM_UNINITIALIZED | 2876 - vm_flags, start, end, node, gfp_mask, caller); 2877 - if (!area) 2878 - goto fail; 2879 - 2880 - addr = __vmalloc_area_node(area, gfp_mask, prot, node); 2881 - if (!addr) 2569 + if (WARN_ON_ONCE(!size)) 2882 2570 return NULL; 2571 + 2572 + if ((size >> PAGE_SHIFT) > totalram_pages()) { 2573 + warn_alloc(gfp_mask, NULL, 2574 + "vmalloc size %lu allocation failure: " 2575 + "exceeds total pages", real_size); 2576 + return NULL; 2577 + } 2578 + 2579 + if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP) && 2580 + arch_vmap_pmd_supported(prot)) { 2581 + unsigned long size_per_node; 2582 + 2583 + /* 2584 + * Try huge pages. Only try for PAGE_KERNEL allocations, 2585 + * others like modules don't yet expect huge pages in 2586 + * their allocations due to apply_to_page_range not 2587 + * supporting them. 2588 + */ 2589 + 2590 + size_per_node = size; 2591 + if (node == NUMA_NO_NODE) 2592 + size_per_node /= num_online_nodes(); 2593 + if (size_per_node >= PMD_SIZE) { 2594 + shift = PMD_SHIFT; 2595 + align = max(real_align, 1UL << shift); 2596 + size = ALIGN(real_size, 1UL << shift); 2597 + } 2598 + } 2599 + 2600 + again: 2601 + size = PAGE_ALIGN(size); 2602 + area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED | 2603 + vm_flags, start, end, node, gfp_mask, caller); 2604 + if (!area) { 2605 + warn_alloc(gfp_mask, NULL, 2606 + "vmalloc size %lu allocation failure: " 2607 + "vm_struct allocation failed", real_size); 2608 + goto fail; 2609 + } 2610 + 2611 + addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node); 2612 + if (!addr) 2613 + goto fail; 2883 2614 2884 2615 /* 2885 2616 * In this function, newly allocated vm_struct has VM_UNINITIALIZED ··· 2928 2589 return addr; 2929 2590 2930 2591 fail: 2931 - warn_alloc(gfp_mask, NULL, 2932 - "vmalloc: allocation failure: %lu bytes", real_size); 2592 + if (shift > PAGE_SHIFT) { 2593 + shift = PAGE_SHIFT; 2594 + align = real_align; 2595 + size = real_size; 2596 + goto again; 2597 + } 2598 + 2933 2599 return NULL; 2934 2600 } 2935 2601 ··· 3238 2894 count = -(unsigned long) addr; 3239 2895 3240 2896 spin_lock(&vmap_area_lock); 3241 - list_for_each_entry(va, &vmap_area_list, list) { 2897 + va = __find_vmap_area((unsigned long)addr); 2898 + if (!va) 2899 + goto finished; 2900 + list_for_each_entry_from(va, &vmap_area_list, list) { 3242 2901 if (!count) 3243 2902 break; 3244 2903 ··· 3419 3072 3420 3073 return 0; 3421 3074 } 3422 - EXPORT_SYMBOL(remap_vmalloc_range_partial); 3423 3075 3424 3076 /** 3425 3077 * remap_vmalloc_range - map vmalloc pages to userspace

+74 -37

net/core/page_pool.c

··· 180 180 pool->p.dma_dir); 181 181 } 182 182 183 - /* slow path */ 184 - noinline 185 - static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, 186 - gfp_t _gfp) 183 + static bool page_pool_dma_map(struct page_pool *pool, struct page *page) 187 184 { 188 - struct page *page; 189 - gfp_t gfp = _gfp; 190 185 dma_addr_t dma; 191 - 192 - /* We could always set __GFP_COMP, and avoid this branch, as 193 - * prep_new_page() can handle order-0 with __GFP_COMP. 194 - */ 195 - if (pool->p.order) 196 - gfp |= __GFP_COMP; 197 - 198 - /* FUTURE development: 199 - * 200 - * Current slow-path essentially falls back to single page 201 - * allocations, which doesn't improve performance. This code 202 - * need bulk allocation support from the page allocator code. 203 - */ 204 - 205 - /* Cache was empty, do real allocation */ 206 - #ifdef CONFIG_NUMA 207 - page = alloc_pages_node(pool->p.nid, gfp, pool->p.order); 208 - #else 209 - page = alloc_pages(gfp, pool->p.order); 210 - #endif 211 - if (!page) 212 - return NULL; 213 - 214 - if (!(pool->p.flags & PP_FLAG_DMA_MAP)) 215 - goto skip_dma_map; 216 186 217 187 /* Setup DMA mapping: use 'struct page' area for storing DMA-addr 218 188 * since dma_addr_t can be either 32 or 64 bits and does not always fit ··· 192 222 dma = dma_map_page_attrs(pool->p.dev, page, 0, 193 223 (PAGE_SIZE << pool->p.order), 194 224 pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC); 195 - if (dma_mapping_error(pool->p.dev, dma)) { 196 - put_page(page); 197 - return NULL; 198 - } 225 + if (dma_mapping_error(pool->p.dev, dma)) 226 + return false; 227 + 199 228 page->dma_addr = dma; 200 229 201 230 if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) 202 231 page_pool_dma_sync_for_device(pool, page, pool->p.max_len); 203 232 204 - skip_dma_map: 233 + return true; 234 + } 235 + 236 + static struct page *__page_pool_alloc_page_order(struct page_pool *pool, 237 + gfp_t gfp) 238 + { 239 + struct page *page; 240 + 241 + gfp |= __GFP_COMP; 242 + page = alloc_pages_node(pool->p.nid, gfp, pool->p.order); 243 + if (unlikely(!page)) 244 + return NULL; 245 + 246 + if ((pool->p.flags & PP_FLAG_DMA_MAP) && 247 + unlikely(!page_pool_dma_map(pool, page))) { 248 + put_page(page); 249 + return NULL; 250 + } 251 + 205 252 /* Track how many pages are held 'in-flight' */ 206 253 pool->pages_state_hold_cnt++; 207 - 208 254 trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt); 255 + return page; 256 + } 257 + 258 + /* slow path */ 259 + noinline 260 + static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, 261 + gfp_t gfp) 262 + { 263 + const int bulk = PP_ALLOC_CACHE_REFILL; 264 + unsigned int pp_flags = pool->p.flags; 265 + unsigned int pp_order = pool->p.order; 266 + struct page *page; 267 + int i, nr_pages; 268 + 269 + /* Don't support bulk alloc for high-order pages */ 270 + if (unlikely(pp_order)) 271 + return __page_pool_alloc_page_order(pool, gfp); 272 + 273 + /* Unnecessary as alloc cache is empty, but guarantees zero count */ 274 + if (unlikely(pool->alloc.count > 0)) 275 + return pool->alloc.cache[--pool->alloc.count]; 276 + 277 + /* Mark empty alloc.cache slots "empty" for alloc_pages_bulk_array */ 278 + memset(&pool->alloc.cache, 0, sizeof(void *) * bulk); 279 + 280 + nr_pages = alloc_pages_bulk_array(gfp, bulk, pool->alloc.cache); 281 + if (unlikely(!nr_pages)) 282 + return NULL; 283 + 284 + /* Pages have been filled into alloc.cache array, but count is zero and 285 + * page element have not been (possibly) DMA mapped. 286 + */ 287 + for (i = 0; i < nr_pages; i++) { 288 + page = pool->alloc.cache[i]; 289 + if ((pp_flags & PP_FLAG_DMA_MAP) && 290 + unlikely(!page_pool_dma_map(pool, page))) { 291 + put_page(page); 292 + continue; 293 + } 294 + pool->alloc.cache[pool->alloc.count++] = page; 295 + /* Track how many pages are held 'in-flight' */ 296 + pool->pages_state_hold_cnt++; 297 + trace_page_pool_state_hold(pool, page, 298 + pool->pages_state_hold_cnt); 299 + } 300 + 301 + /* Return last page */ 302 + if (likely(pool->alloc.count > 0)) 303 + page = pool->alloc.cache[--pool->alloc.count]; 304 + else 305 + page = NULL; 209 306 210 307 /* When page just alloc'ed is should/must have refcnt 1. */ 211 308 return page;

+18 -20

net/sunrpc/svc_xprt.c

··· 661 661 static int svc_alloc_arg(struct svc_rqst *rqstp) 662 662 { 663 663 struct svc_serv *serv = rqstp->rq_server; 664 - struct xdr_buf *arg; 665 - int pages; 666 - int i; 664 + struct xdr_buf *arg = &rqstp->rq_arg; 665 + unsigned long pages, filled; 667 666 668 - /* now allocate needed pages. If we get a failure, sleep briefly */ 669 667 pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT; 670 668 if (pages > RPCSVC_MAXPAGES) { 671 - pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n", 669 + pr_warn_once("svc: warning: pages=%lu > RPCSVC_MAXPAGES=%lu\n", 672 670 pages, RPCSVC_MAXPAGES); 673 671 /* use as many pages as possible */ 674 672 pages = RPCSVC_MAXPAGES; 675 673 } 676 - for (i = 0; i < pages ; i++) 677 - while (rqstp->rq_pages[i] == NULL) { 678 - struct page *p = alloc_page(GFP_KERNEL); 679 - if (!p) { 680 - set_current_state(TASK_INTERRUPTIBLE); 681 - if (signalled() || kthread_should_stop()) { 682 - set_current_state(TASK_RUNNING); 683 - return -EINTR; 684 - } 685 - schedule_timeout(msecs_to_jiffies(500)); 686 - } 687 - rqstp->rq_pages[i] = p; 674 + 675 + for (;;) { 676 + filled = alloc_pages_bulk_array(GFP_KERNEL, pages, 677 + rqstp->rq_pages); 678 + if (filled == pages) 679 + break; 680 + 681 + set_current_state(TASK_INTERRUPTIBLE); 682 + if (signalled() || kthread_should_stop()) { 683 + set_current_state(TASK_RUNNING); 684 + return -EINTR; 688 685 } 689 - rqstp->rq_page_end = &rqstp->rq_pages[i]; 690 - rqstp->rq_pages[i++] = NULL; /* this might be seen in nfs_read_actor */ 686 + schedule_timeout(msecs_to_jiffies(500)); 687 + } 688 + rqstp->rq_page_end = &rqstp->rq_pages[pages]; 689 + rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */ 691 690 692 691 /* Make arg->head point to first page and arg->pages point to rest */ 693 - arg = &rqstp->rq_arg; 694 692 arg->head[0].iov_base = page_address(rqstp->rq_pages[0]); 695 693 arg->head[0].iov_len = PAGE_SIZE; 696 694 arg->pages = rqstp->rq_pages + 1;

+6 -2

samples/kfifo/bytestream-example.c

··· 122 122 ret = kfifo_from_user(&test, buf, count, &copied); 123 123 124 124 mutex_unlock(&write_lock); 125 + if (ret) 126 + return ret; 125 127 126 - return ret ? ret : copied; 128 + return copied; 127 129 } 128 130 129 131 static ssize_t fifo_read(struct file *file, char __user *buf, ··· 140 138 ret = kfifo_to_user(&test, buf, count, &copied); 141 139 142 140 mutex_unlock(&read_lock); 141 + if (ret) 142 + return ret; 143 143 144 - return ret ? ret : copied; 144 + return copied; 145 145 } 146 146 147 147 static const struct proc_ops fifo_proc_ops = {

+6 -2

samples/kfifo/inttype-example.c

··· 115 115 ret = kfifo_from_user(&test, buf, count, &copied); 116 116 117 117 mutex_unlock(&write_lock); 118 + if (ret) 119 + return ret; 118 120 119 - return ret ? ret : copied; 121 + return copied; 120 122 } 121 123 122 124 static ssize_t fifo_read(struct file *file, char __user *buf, ··· 133 131 ret = kfifo_to_user(&test, buf, count, &copied); 134 132 135 133 mutex_unlock(&read_lock); 134 + if (ret) 135 + return ret; 136 136 137 - return ret ? ret : copied; 137 + return copied; 138 138 } 139 139 140 140 static const struct proc_ops fifo_proc_ops = {

+6 -2

samples/kfifo/record-example.c

··· 129 129 ret = kfifo_from_user(&test, buf, count, &copied); 130 130 131 131 mutex_unlock(&write_lock); 132 + if (ret) 133 + return ret; 132 134 133 - return ret ? ret : copied; 135 + return copied; 134 136 } 135 137 136 138 static ssize_t fifo_read(struct file *file, char __user *buf, ··· 147 145 ret = kfifo_to_user(&test, buf, count, &copied); 148 146 149 147 mutex_unlock(&read_lock); 148 + if (ret) 149 + return ret; 150 150 151 - return ret ? ret : copied; 151 + return copied; 152 152 } 153 153 154 154 static const struct proc_ops fifo_proc_ops = {

+1 -3

samples/vfio-mdev/mdpy.c

··· 406 406 if ((vma->vm_flags & VM_SHARED) == 0) 407 407 return -EINVAL; 408 408 409 - return remap_vmalloc_range_partial(vma, vma->vm_start, 410 - mdev_state->memblk, 0, 411 - vma->vm_end - vma->vm_start); 409 + return remap_vmalloc_range(vma, mdev_state->memblk, 0); 412 410 } 413 411 414 412 static int mdpy_get_region_info(struct mdev_device *mdev,

+53

scripts/checkdeclares.pl

··· 1 + #!/usr/bin/env perl 2 + # SPDX-License-Identifier: GPL-2.0 3 + # 4 + # checkdeclares: find struct declared more than once 5 + # 6 + # Copyright 2021 Wan Jiabing<wanjiabing@vivo.com> 7 + # Inspired by checkincludes.pl 8 + # 9 + # This script checks for duplicate struct declares. 10 + # Note that this will not take into consideration macros so 11 + # you should run this only if you know you do have real dups 12 + # and do not have them under #ifdef's. 13 + # You could also just review the results. 14 + 15 + use strict; 16 + 17 + sub usage { 18 + print "Usage: checkdeclares.pl file1.h ...\n"; 19 + print "Warns of struct declaration duplicates\n"; 20 + exit 1; 21 + } 22 + 23 + if ($#ARGV < 0) { 24 + usage(); 25 + } 26 + 27 + my $dup_counter = 0; 28 + 29 + foreach my $file (@ARGV) { 30 + open(my $f, '<', $file) 31 + or die "Cannot open $file: $!.\n"; 32 + 33 + my %declaredstructs = (); 34 + 35 + while (<$f>) { 36 + if (m/^\s*struct\s*(\w*);$/o) { 37 + ++$declaredstructs{$1}; 38 + } 39 + } 40 + 41 + close($f); 42 + 43 + foreach my $structname (keys %declaredstructs) { 44 + if ($declaredstructs{$structname} > 1) { 45 + print "$file: struct $structname is declared more than once.\n"; 46 + ++$dup_counter; 47 + } 48 + } 49 + } 50 + 51 + if ($dup_counter == 0) { 52 + print "No duplicate struct declares found.\n"; 53 + }

+25 -1

scripts/spelling.txt

··· 84 84 agaist||against 85 85 aggreataon||aggregation 86 86 aggreation||aggregation 87 + ajust||adjust 87 88 albumns||albums 88 89 alegorical||allegorical 89 90 algined||aligned ··· 162 161 asser||assert 163 162 assertation||assertion 164 163 assertting||asserting 164 + assgined||assigned 165 165 assiged||assigned 166 166 assigment||assignment 167 167 assigments||assignments 168 168 assistent||assistant 169 + assocaited||associated 170 + assocating||associating 169 171 assocation||association 170 172 associcated||associated 171 173 assotiated||associated ··· 181 177 asynchromous||asynchronous 182 178 asymetric||asymmetric 183 179 asymmeric||asymmetric 180 + atleast||at least 184 181 atomatically||automatically 185 182 atomicly||atomically 186 183 atempt||attempt 184 + atrributes||attributes 187 185 attachement||attachment 188 186 attatch||attach 189 187 attched||attached ··· 321 315 commited||committed 322 316 commiting||committing 323 317 committ||commit 318 + commnunication||communication 324 319 commoditiy||commodity 325 320 comsume||consume 326 321 comsumer||consumer ··· 356 349 conected||connected 357 350 conector||connector 358 351 configration||configuration 352 + configred||configured 359 353 configuartion||configuration 360 354 configuation||configuration 361 355 configued||configured ··· 410 402 curently||currently 411 403 cylic||cyclic 412 404 dafault||default 405 + deactive||deactivate 413 406 deafult||default 414 407 deamon||daemon 415 408 debouce||debounce ··· 426 417 defferred||deferred 427 418 definate||definite 428 419 definately||definitely 420 + definiation||definition 429 421 defintion||definition 430 422 defintions||definitions 431 423 defualt||default ··· 581 571 estbalishment||establishment 582 572 etsablishment||establishment 583 573 etsbalishment||establishment 574 + evalute||evaluate 575 + evalutes||evaluates 584 576 evalution||evaluation 585 - exeeds||exceeds 586 577 excecutable||executable 587 578 exceded||exceeded 588 579 exceds||exceeds ··· 707 696 harware||hardware 708 697 havind||having 709 698 heirarchically||hierarchically 699 + heirarchy||hierarchy 710 700 helpfull||helpful 711 701 heterogenous||heterogeneous 712 702 hexdecimal||hexadecimal ··· 808 796 interchangable||interchangeable 809 797 interferring||interfering 810 798 interger||integer 799 + intergrated||integrated 811 800 intermittant||intermittent 812 801 internel||internal 813 802 interoprability||interoperability ··· 821 808 interrups||interrupts 822 809 interruptted||interrupted 823 810 interupted||interrupted 811 + intiailized||initialized 824 812 intial||initial 825 813 intialisation||initialisation 826 814 intialised||initialised ··· 1105 1091 prefered||preferred 1106 1092 prefferably||preferably 1107 1093 prefitler||prefilter 1094 + preform||perform 1108 1095 premption||preemption 1109 1096 prepaired||prepared 1110 1097 preperation||preparation 1111 1098 preprare||prepare 1112 1099 pressre||pressure 1100 + presuambly||presumably 1101 + previosuly||previously 1113 1102 primative||primitive 1114 1103 princliple||principle 1115 1104 priorty||priority ··· 1282 1265 schdule||schedule 1283 1266 seach||search 1284 1267 searchs||searches 1268 + secion||section 1285 1269 secquence||sequence 1286 1270 secund||second 1287 1271 segement||segment ··· 1330 1312 sleeped||slept 1331 1313 sliped||slipped 1332 1314 softwares||software 1315 + soley||solely 1316 + souce||source 1333 1317 speach||speech 1334 1318 specfic||specific 1335 1319 specfield||specified ··· 1340 1320 specifed||specified 1341 1321 specificatin||specification 1342 1322 specificaton||specification 1323 + specificed||specified 1343 1324 specifing||specifying 1325 + specifiy||specify 1344 1326 specifiying||specifying 1345 1327 speficied||specified 1346 1328 speicify||specify ··· 1458 1436 tmis||this 1459 1437 toogle||toggle 1460 1438 torerable||tolerable 1439 + traget||target 1461 1440 traking||tracking 1462 1441 tramsmitted||transmitted 1463 1442 tramsmit||transmit ··· 1581 1558 wirte||write 1582 1559 withing||within 1583 1560 wnat||want 1561 + wont||won't 1584 1562 workarould||workaround 1585 1563 writeing||writing 1586 1564 writting||writing

+14 -8

tools/testing/selftests/cgroup/test_kmem.c

··· 19 19 20 20 21 21 /* 22 - * Memory cgroup charging and vmstat data aggregation is performed using 23 - * percpu batches 32 pages big (look at MEMCG_CHARGE_BATCH). So the maximum 24 - * discrepancy between charge and vmstat entries is number of cpus multiplied 25 - * by 32 pages multiplied by 2. 22 + * Memory cgroup charging is performed using percpu batches 32 pages 23 + * big (look at MEMCG_CHARGE_BATCH), whereas memory.stat is exact. So 24 + * the maximum discrepancy between charge and vmstat entries is number 25 + * of cpus multiplied by 32 pages. 26 26 */ 27 - #define MAX_VMSTAT_ERROR (4096 * 32 * 2 * get_nprocs()) 27 + #define MAX_VMSTAT_ERROR (4096 * 32 * get_nprocs()) 28 28 29 29 30 30 static int alloc_dcache(const char *cgroup, void *arg) ··· 162 162 */ 163 163 static int test_kmem_memcg_deletion(const char *root) 164 164 { 165 - long current, slab, anon, file, kernel_stack, sum; 165 + long current, slab, anon, file, kernel_stack, pagetables, percpu, sock, sum; 166 166 int ret = KSFT_FAIL; 167 167 char *parent; 168 168 ··· 184 184 anon = cg_read_key_long(parent, "memory.stat", "anon "); 185 185 file = cg_read_key_long(parent, "memory.stat", "file "); 186 186 kernel_stack = cg_read_key_long(parent, "memory.stat", "kernel_stack "); 187 + pagetables = cg_read_key_long(parent, "memory.stat", "pagetables "); 188 + percpu = cg_read_key_long(parent, "memory.stat", "percpu "); 189 + sock = cg_read_key_long(parent, "memory.stat", "sock "); 187 190 if (current < 0 || slab < 0 || anon < 0 || file < 0 || 188 - kernel_stack < 0) 191 + kernel_stack < 0 || pagetables < 0 || percpu < 0 || sock < 0) 189 192 goto cleanup; 190 193 191 - sum = slab + anon + file + kernel_stack; 194 + sum = slab + anon + file + kernel_stack + pagetables + percpu + sock; 192 195 if (abs(sum - current) < MAX_VMSTAT_ERROR) { 193 196 ret = KSFT_PASS; 194 197 } else { ··· 201 198 printf("anon = %ld\n", anon); 202 199 printf("file = %ld\n", file); 203 200 printf("kernel_stack = %ld\n", kernel_stack); 201 + printf("pagetables = %ld\n", pagetables); 202 + printf("percpu = %ld\n", percpu); 203 + printf("sock = %ld\n", sock); 204 204 } 205 205 206 206 cleanup:

+52

tools/testing/selftests/vm/mremap_dontunmap.c

··· 127 127 "unable to unmap source mapping"); 128 128 } 129 129 130 + // This test validates that MREMAP_DONTUNMAP on a shared mapping works as expected. 131 + static void mremap_dontunmap_simple_shmem() 132 + { 133 + unsigned long num_pages = 5; 134 + 135 + int mem_fd = memfd_create("memfd", MFD_CLOEXEC); 136 + BUG_ON(mem_fd < 0, "memfd_create"); 137 + 138 + BUG_ON(ftruncate(mem_fd, num_pages * page_size) < 0, 139 + "ftruncate"); 140 + 141 + void *source_mapping = 142 + mmap(NULL, num_pages * page_size, PROT_READ | PROT_WRITE, 143 + MAP_FILE | MAP_SHARED, mem_fd, 0); 144 + BUG_ON(source_mapping == MAP_FAILED, "mmap"); 145 + 146 + BUG_ON(close(mem_fd) < 0, "close"); 147 + 148 + memset(source_mapping, 'a', num_pages * page_size); 149 + 150 + // Try to just move the whole mapping anywhere (not fixed). 151 + void *dest_mapping = 152 + mremap(source_mapping, num_pages * page_size, num_pages * page_size, 153 + MREMAP_DONTUNMAP | MREMAP_MAYMOVE, NULL); 154 + if (dest_mapping == MAP_FAILED && errno == EINVAL) { 155 + // Old kernel which doesn't support MREMAP_DONTUNMAP on shmem. 156 + BUG_ON(munmap(source_mapping, num_pages * page_size) == -1, 157 + "unable to unmap source mapping"); 158 + return; 159 + } 160 + 161 + BUG_ON(dest_mapping == MAP_FAILED, "mremap"); 162 + 163 + // Validate that the pages have been moved, we know they were moved if 164 + // the dest_mapping contains a's. 165 + BUG_ON(check_region_contains_byte 166 + (dest_mapping, num_pages * page_size, 'a') != 0, 167 + "pages did not migrate"); 168 + 169 + // Because the region is backed by shmem, we will actually see the same 170 + // memory at the source location still. 171 + BUG_ON(check_region_contains_byte 172 + (source_mapping, num_pages * page_size, 'a') != 0, 173 + "source should have no ptes"); 174 + 175 + BUG_ON(munmap(dest_mapping, num_pages * page_size) == -1, 176 + "unable to unmap destination mapping"); 177 + BUG_ON(munmap(source_mapping, num_pages * page_size) == -1, 178 + "unable to unmap source mapping"); 179 + } 180 + 130 181 // This test validates MREMAP_DONTUNMAP will move page tables to a specific 131 182 // destination using MREMAP_FIXED, also while validating that the source 132 183 // remains intact. ··· 351 300 BUG_ON(page_buffer == MAP_FAILED, "unable to mmap a page."); 352 301 353 302 mremap_dontunmap_simple(); 303 + mremap_dontunmap_simple_shmem(); 354 304 mremap_dontunmap_simple_fixed(); 355 305 mremap_dontunmap_partial_mapping(); 356 306 mremap_dontunmap_partial_mapping_overwrite();

+11 -10

tools/testing/selftests/vm/test_vmalloc.sh

··· 11 11 12 12 TEST_NAME="vmalloc" 13 13 DRIVER="test_${TEST_NAME}" 14 + NUM_CPUS=`grep -c ^processor /proc/cpuinfo` 14 15 15 16 # 1 if fails 16 17 exitcode=1 ··· 23 22 # Static templates for performance, stressing and smoke tests. 24 23 # Also it is possible to pass any supported parameters manualy. 25 24 # 26 - PERF_PARAM="single_cpu_test=1 sequential_test_order=1 test_repeat_count=3" 27 - SMOKE_PARAM="single_cpu_test=1 test_loop_count=10000 test_repeat_count=10" 28 - STRESS_PARAM="test_repeat_count=20" 25 + PERF_PARAM="sequential_test_order=1 test_repeat_count=3" 26 + SMOKE_PARAM="test_loop_count=10000 test_repeat_count=10" 27 + STRESS_PARAM="nr_threads=$NUM_CPUS test_repeat_count=20" 29 28 30 29 check_test_requirements() 31 30 { ··· 59 58 60 59 run_stability_check() 61 60 { 62 - echo "Run stability tests. In order to stress vmalloc subsystem we run" 63 - echo "all available test cases on all available CPUs simultaneously." 61 + echo "Run stability tests. In order to stress vmalloc subsystem all" 62 + echo "available test cases are run by NUM_CPUS workers simultaneously." 64 63 echo "It will take time, so be patient." 65 64 66 65 modprobe $DRIVER $STRESS_PARAM > /dev/null 2>&1 ··· 93 92 echo "# Shows help message" 94 93 echo "./${DRIVER}.sh" 95 94 echo 96 - echo "# Runs 1 test(id_1), repeats it 5 times on all online CPUs" 97 - echo "./${DRIVER}.sh run_test_mask=1 test_repeat_count=5" 95 + echo "# Runs 1 test(id_1), repeats it 5 times by NUM_CPUS workers" 96 + echo "./${DRIVER}.sh nr_threads=$NUM_CPUS run_test_mask=1 test_repeat_count=5" 98 97 echo 99 98 echo -n "# Runs 4 tests(id_1|id_2|id_4|id_16) on one CPU with " 100 99 echo "sequential order" 101 - echo -n "./${DRIVER}.sh single_cpu_test=1 sequential_test_order=1 " 100 + echo -n "./${DRIVER}.sh sequential_test_order=1 " 102 101 echo "run_test_mask=23" 103 102 echo 104 - echo -n "# Runs all tests on all online CPUs, shuffled order, repeats " 103 + echo -n "# Runs all tests by NUM_CPUS workers, shuffled order, repeats " 105 104 echo "20 times" 106 - echo "./${DRIVER}.sh test_repeat_count=20" 105 + echo "./${DRIVER}.sh nr_threads=$NUM_CPUS test_repeat_count=20" 107 106 echo 108 107 echo "# Performance analysis" 109 108 echo "./${DRIVER}.sh performance"