Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge branch 'akpm' (patches from Andrew)

Merge more updates from Andrew Morton:
"118 patches:

- The rest of MM.

Includes kfence - another runtime memory validator. Not as thorough
as KASAN, but it has unmeasurable overhead and is intended to be
usable in production builds.

- Everything else

Subsystems affected by this patch series: alpha, procfs, sysctl,
misc, core-kernel, MAINTAINERS, lib, bitops, checkpatch, init,
coredump, seq_file, gdb, ubsan, initramfs, and mm (thp, cma,
vmstat, memory-hotplug, mlock, rmap, zswap, zsmalloc, cleanups,
kfence, kasan2, and pagemap2)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
MIPS: make userspace mapping young by default
initramfs: panic with memory information
ubsan: remove overflow checks
kgdb: fix to kill breakpoints on initmem after boot
scripts/gdb: fix list_for_each
x86: fix seq_file iteration for pat/memtype.c
seq_file: document how per-entry resources are managed.
fs/coredump: use kmap_local_page()
init/Kconfig: fix a typo in CC_VERSION_TEXT help text
init: clean up early_param_on_off() macro
init/version.c: remove Version_<LINUX_VERSION_CODE> symbol
checkpatch: do not apply "initialise globals to 0" check to BPF progs
checkpatch: don't warn about colon termination in linker scripts
checkpatch: add kmalloc_array_node to unnecessary OOM message check
checkpatch: add warning for avoiding .L prefix symbols in assembly files
checkpatch: improve TYPECAST_INT_CONSTANT test message
checkpatch: prefer ftrace over function entry/exit printks
checkpatch: trivial style fixes
checkpatch: ignore warning designated initializers using NR_CPUS
checkpatch: improve blank line after declaration test
...

+4870 -1507
+1
.mailmap
··· 237 237 Mayuresh Janorkar <mayur@ti.com> 238 238 Michael Buesch <m@bues.ch> 239 239 Michel Dänzer <michel@tungstengraphics.com> 240 + Miguel Ojeda <ojeda@kernel.org> <miguel.ojeda.sandonis@gmail.com> 240 241 Mike Rapoport <rppt@kernel.org> <mike@compulab.co.il> 241 242 Mike Rapoport <rppt@kernel.org> <mike.rapoport@gmail.com> 242 243 Mike Rapoport <rppt@kernel.org> <rppt@linux.ibm.com>
+3 -6
CREDITS
··· 2841 2841 S: Perth, Western Australia 2842 2842 S: Australia 2843 2843 2844 - N: Miguel Ojeda Sandonis 2845 - E: miguel.ojeda.sandonis@gmail.com 2846 - W: http://miguelojeda.es 2847 - W: http://jair.lab.fi.uva.es/~migojed/ 2844 + N: Miguel Ojeda 2845 + E: ojeda@kernel.org 2846 + W: https://ojeda.dev 2848 2847 D: Author of the ks0108, cfag12864b and cfag12864bfb auxiliary display drivers. 2849 2848 D: Maintainer of the auxiliary display drivers tree (drivers/auxdisplay/*) 2850 - S: C/ Mieses 20, 9-B 2851 - S: Valladolid 47009 2852 2849 S: Spain 2853 2850 2854 2851 N: Peter Oruba
+35 -21
Documentation/ABI/testing/sysfs-devices-memory
··· 13 13 Date: June 2008 14 14 Contact: Badari Pulavarty <pbadari@us.ibm.com> 15 15 Description: 16 - The file /sys/devices/system/memory/memoryX/removable 17 - indicates whether this memory block is removable or not. 18 - This is useful for a user-level agent to determine 19 - identify removable sections of the memory before attempting 20 - potentially expensive hot-remove memory operation 16 + The file /sys/devices/system/memory/memoryX/removable is a 17 + legacy interface used to indicated whether a memory block is 18 + likely to be offlineable or not. Newer kernel versions return 19 + "1" if and only if the kernel supports memory offlining. 21 20 Users: hotplug memory remove tools 22 21 http://www.ibm.com/developerworks/wikis/display/LinuxP/powerpc-utils 22 + lsmem/chmem part of util-linux 23 23 24 24 What: /sys/devices/system/memory/memoryX/phys_device 25 25 Date: September 2008 26 26 Contact: Badari Pulavarty <pbadari@us.ibm.com> 27 27 Description: 28 28 The file /sys/devices/system/memory/memoryX/phys_device 29 - is read-only and is designed to show the name of physical 30 - memory device. Implementation is currently incomplete. 29 + is read-only; it is a legacy interface only ever used on s390x 30 + to expose the covered storage increment. 31 + Users: Legacy s390-tools lsmem/chmem 31 32 32 33 What: /sys/devices/system/memory/memoryX/phys_index 33 34 Date: September 2008 ··· 44 43 Contact: Badari Pulavarty <pbadari@us.ibm.com> 45 44 Description: 46 45 The file /sys/devices/system/memory/memoryX/state 47 - is read-write. When read, its contents show the 48 - online/offline state of the memory section. When written, 49 - root can toggle the the online/offline state of a removable 50 - memory section (see removable file description above) 51 - using the following commands:: 46 + is read-write. When read, it returns the online/offline 47 + state of the memory block. When written, root can toggle 48 + the online/offline state of a memory block using the following 49 + commands:: 52 50 53 51 # echo online > /sys/devices/system/memory/memoryX/state 54 52 # echo offline > /sys/devices/system/memory/memoryX/state 55 53 56 - For example, if /sys/devices/system/memory/memory22/removable 57 - contains a value of 1 and 58 - /sys/devices/system/memory/memory22/state contains the 59 - string "online" the following command can be executed by 60 - by root to offline that section:: 54 + On newer kernel versions, advanced states can be specified 55 + when onlining to select a target zone: "online_movable" 56 + selects the movable zone. "online_kernel" selects the 57 + applicable kernel zone (DMA, DMA32, or Normal). However, 58 + after successfully setting one of the advanced states, 59 + reading the file will return "online"; the zone information 60 + can be obtained via "valid_zones" instead. 61 61 62 - # echo offline > /sys/devices/system/memory/memory22/state 63 - 62 + While onlining is unlikely to fail, there are no guarantees 63 + that offlining will succeed. Offlining is more likely to 64 + succeed if "valid_zones" indicates "Movable". 64 65 Users: hotplug memory remove tools 65 66 http://www.ibm.com/developerworks/wikis/display/LinuxP/powerpc-utils 66 67 ··· 72 69 Contact: Zhang Zhen <zhenzhang.zhang@huawei.com> 73 70 Description: 74 71 The file /sys/devices/system/memory/memoryX/valid_zones is 75 - read-only and is designed to show which zone this memory 76 - block can be onlined to. 72 + read-only. 73 + 74 + For online memory blocks, it returns in which zone memory 75 + provided by a memory block is managed. If multiple zones 76 + apply (not applicable for hotplugged memory), "None" is returned 77 + and the memory block cannot be offlined. 78 + 79 + For offline memory blocks, it returns by which zone memory 80 + provided by a memory block can be managed when onlining. 81 + The first returned zone ("default") will be used when setting 82 + the state of an offline memory block to "online". Only one of 83 + the kernel zones (DMA, DMA32, Normal) is applicable for a single 84 + memory block. 77 85 78 86 What: /sys/devices/system/memoryX/nodeY 79 87 Date: October 2009
+1 -1
Documentation/admin-guide/auxdisplay/cfag12864b.rst
··· 3 3 =================================== 4 4 5 5 :License: GPLv2 6 - :Author & Maintainer: Miguel Ojeda Sandonis 6 + :Author & Maintainer: Miguel Ojeda <ojeda@kernel.org> 7 7 :Date: 2006-10-27 8 8 9 9
+1 -1
Documentation/admin-guide/auxdisplay/ks0108.rst
··· 3 3 ========================================== 4 4 5 5 :License: GPLv2 6 - :Author & Maintainer: Miguel Ojeda Sandonis 6 + :Author & Maintainer: Miguel Ojeda <ojeda@kernel.org> 7 7 :Date: 2006-10-27 8 8 9 9
+6
Documentation/admin-guide/kernel-parameters.txt
··· 5182 5182 growing up) the main stack are reserved for no other 5183 5183 mapping. Default value is 256 pages. 5184 5184 5185 + stack_depot_disable= [KNL] 5186 + Setting this to true through kernel command line will 5187 + disable the stack depot thereby saving the static memory 5188 + consumed by the stack hash table. By default this is set 5189 + to false. 5190 + 5185 5191 stacktrace [FTRACE] 5186 5192 Enabled the stack tracer on boot up. 5187 5193
+10 -10
Documentation/admin-guide/mm/memory-hotplug.rst
··· 160 160 161 161 "online_movable", "online", "offline" command 162 162 which will be performed on all sections in the block. 163 - ``phys_device`` read-only: designed to show the name of physical memory 164 - device. This is not well implemented now. 165 - ``removable`` read-only: contains an integer value indicating 166 - whether the memory block is removable or not 167 - removable. A value of 1 indicates that the memory 168 - block is removable and a value of 0 indicates that 169 - it is not removable. A memory block is removable only if 170 - every section in the block is removable. 171 - ``valid_zones`` read-only: designed to show which zones this memory block 172 - can be onlined to. 163 + ``phys_device`` read-only: legacy interface only ever used on s390x to 164 + expose the covered storage increment. 165 + ``removable`` read-only: legacy interface that indicated whether a memory 166 + block was likely to be offlineable or not. Newer kernel 167 + versions return "1" if and only if the kernel supports 168 + memory offlining. 169 + ``valid_zones`` read-only: designed to show by which zone memory provided by 170 + a memory block is managed, and to show by which zone memory 171 + provided by an offline memory block could be managed when 172 + onlining. 173 173 174 174 The first column shows it`s default zone. 175 175
+1
Documentation/dev-tools/index.rst
··· 22 22 ubsan 23 23 kmemleak 24 24 kcsan 25 + kfence 25 26 gdb-kernel-debugging 26 27 kgdb 27 28 kselftest
+6 -2
Documentation/dev-tools/kasan.rst
··· 155 155 ~~~~~~~~~~~~~~~ 156 156 157 157 Hardware tag-based KASAN mode (see the section about various modes below) is 158 - intended for use in production as a security mitigation. Therefore it supports 158 + intended for use in production as a security mitigation. Therefore, it supports 159 159 boot parameters that allow to disable KASAN competely or otherwise control 160 160 particular KASAN features. 161 161 ··· 165 165 traces collection (default: ``on``). 166 166 167 167 - ``kasan.fault=report`` or ``=panic`` controls whether to only print a KASAN 168 - report or also panic the kernel (default: ``report``). 168 + report or also panic the kernel (default: ``report``). Note, that tag 169 + checking gets disabled after the first reported bug. 169 170 170 171 For developers 171 172 ~~~~~~~~~~~~~~ ··· 295 294 Note, that enabling CONFIG_KASAN_HW_TAGS always results in in-kernel TBI being 296 295 enabled. Even when kasan.mode=off is provided, or when the hardware doesn't 297 296 support MTE (but supports TBI). 297 + 298 + Hardware tag-based KASAN only reports the first found bug. After that MTE tag 299 + checking gets disabled. 298 300 299 301 What memory accesses are sanitised by KASAN? 300 302 --------------------------------------------
+298
Documentation/dev-tools/kfence.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + .. Copyright (C) 2020, Google LLC. 3 + 4 + Kernel Electric-Fence (KFENCE) 5 + ============================== 6 + 7 + Kernel Electric-Fence (KFENCE) is a low-overhead sampling-based memory safety 8 + error detector. KFENCE detects heap out-of-bounds access, use-after-free, and 9 + invalid-free errors. 10 + 11 + KFENCE is designed to be enabled in production kernels, and has near zero 12 + performance overhead. Compared to KASAN, KFENCE trades performance for 13 + precision. The main motivation behind KFENCE's design, is that with enough 14 + total uptime KFENCE will detect bugs in code paths not typically exercised by 15 + non-production test workloads. One way to quickly achieve a large enough total 16 + uptime is when the tool is deployed across a large fleet of machines. 17 + 18 + Usage 19 + ----- 20 + 21 + To enable KFENCE, configure the kernel with:: 22 + 23 + CONFIG_KFENCE=y 24 + 25 + To build a kernel with KFENCE support, but disabled by default (to enable, set 26 + ``kfence.sample_interval`` to non-zero value), configure the kernel with:: 27 + 28 + CONFIG_KFENCE=y 29 + CONFIG_KFENCE_SAMPLE_INTERVAL=0 30 + 31 + KFENCE provides several other configuration options to customize behaviour (see 32 + the respective help text in ``lib/Kconfig.kfence`` for more info). 33 + 34 + Tuning performance 35 + ~~~~~~~~~~~~~~~~~~ 36 + 37 + The most important parameter is KFENCE's sample interval, which can be set via 38 + the kernel boot parameter ``kfence.sample_interval`` in milliseconds. The 39 + sample interval determines the frequency with which heap allocations will be 40 + guarded by KFENCE. The default is configurable via the Kconfig option 41 + ``CONFIG_KFENCE_SAMPLE_INTERVAL``. Setting ``kfence.sample_interval=0`` 42 + disables KFENCE. 43 + 44 + The KFENCE memory pool is of fixed size, and if the pool is exhausted, no 45 + further KFENCE allocations occur. With ``CONFIG_KFENCE_NUM_OBJECTS`` (default 46 + 255), the number of available guarded objects can be controlled. Each object 47 + requires 2 pages, one for the object itself and the other one used as a guard 48 + page; object pages are interleaved with guard pages, and every object page is 49 + therefore surrounded by two guard pages. 50 + 51 + The total memory dedicated to the KFENCE memory pool can be computed as:: 52 + 53 + ( #objects + 1 ) * 2 * PAGE_SIZE 54 + 55 + Using the default config, and assuming a page size of 4 KiB, results in 56 + dedicating 2 MiB to the KFENCE memory pool. 57 + 58 + Note: On architectures that support huge pages, KFENCE will ensure that the 59 + pool is using pages of size ``PAGE_SIZE``. This will result in additional page 60 + tables being allocated. 61 + 62 + Error reports 63 + ~~~~~~~~~~~~~ 64 + 65 + A typical out-of-bounds access looks like this:: 66 + 67 + ================================================================== 68 + BUG: KFENCE: out-of-bounds read in test_out_of_bounds_read+0xa3/0x22b 69 + 70 + Out-of-bounds read at 0xffffffffb672efff (1B left of kfence-#17): 71 + test_out_of_bounds_read+0xa3/0x22b 72 + kunit_try_run_case+0x51/0x85 73 + kunit_generic_run_threadfn_adapter+0x16/0x30 74 + kthread+0x137/0x160 75 + ret_from_fork+0x22/0x30 76 + 77 + kfence-#17 [0xffffffffb672f000-0xffffffffb672f01f, size=32, cache=kmalloc-32] allocated by task 507: 78 + test_alloc+0xf3/0x25b 79 + test_out_of_bounds_read+0x98/0x22b 80 + kunit_try_run_case+0x51/0x85 81 + kunit_generic_run_threadfn_adapter+0x16/0x30 82 + kthread+0x137/0x160 83 + ret_from_fork+0x22/0x30 84 + 85 + CPU: 4 PID: 107 Comm: kunit_try_catch Not tainted 5.8.0-rc6+ #7 86 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 87 + ================================================================== 88 + 89 + The header of the report provides a short summary of the function involved in 90 + the access. It is followed by more detailed information about the access and 91 + its origin. Note that, real kernel addresses are only shown when using the 92 + kernel command line option ``no_hash_pointers``. 93 + 94 + Use-after-free accesses are reported as:: 95 + 96 + ================================================================== 97 + BUG: KFENCE: use-after-free read in test_use_after_free_read+0xb3/0x143 98 + 99 + Use-after-free read at 0xffffffffb673dfe0 (in kfence-#24): 100 + test_use_after_free_read+0xb3/0x143 101 + kunit_try_run_case+0x51/0x85 102 + kunit_generic_run_threadfn_adapter+0x16/0x30 103 + kthread+0x137/0x160 104 + ret_from_fork+0x22/0x30 105 + 106 + kfence-#24 [0xffffffffb673dfe0-0xffffffffb673dfff, size=32, cache=kmalloc-32] allocated by task 507: 107 + test_alloc+0xf3/0x25b 108 + test_use_after_free_read+0x76/0x143 109 + kunit_try_run_case+0x51/0x85 110 + kunit_generic_run_threadfn_adapter+0x16/0x30 111 + kthread+0x137/0x160 112 + ret_from_fork+0x22/0x30 113 + 114 + freed by task 507: 115 + test_use_after_free_read+0xa8/0x143 116 + kunit_try_run_case+0x51/0x85 117 + kunit_generic_run_threadfn_adapter+0x16/0x30 118 + kthread+0x137/0x160 119 + ret_from_fork+0x22/0x30 120 + 121 + CPU: 4 PID: 109 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7 122 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 123 + ================================================================== 124 + 125 + KFENCE also reports on invalid frees, such as double-frees:: 126 + 127 + ================================================================== 128 + BUG: KFENCE: invalid free in test_double_free+0xdc/0x171 129 + 130 + Invalid free of 0xffffffffb6741000: 131 + test_double_free+0xdc/0x171 132 + kunit_try_run_case+0x51/0x85 133 + kunit_generic_run_threadfn_adapter+0x16/0x30 134 + kthread+0x137/0x160 135 + ret_from_fork+0x22/0x30 136 + 137 + kfence-#26 [0xffffffffb6741000-0xffffffffb674101f, size=32, cache=kmalloc-32] allocated by task 507: 138 + test_alloc+0xf3/0x25b 139 + test_double_free+0x76/0x171 140 + kunit_try_run_case+0x51/0x85 141 + kunit_generic_run_threadfn_adapter+0x16/0x30 142 + kthread+0x137/0x160 143 + ret_from_fork+0x22/0x30 144 + 145 + freed by task 507: 146 + test_double_free+0xa8/0x171 147 + kunit_try_run_case+0x51/0x85 148 + kunit_generic_run_threadfn_adapter+0x16/0x30 149 + kthread+0x137/0x160 150 + ret_from_fork+0x22/0x30 151 + 152 + CPU: 4 PID: 111 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7 153 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 154 + ================================================================== 155 + 156 + KFENCE also uses pattern-based redzones on the other side of an object's guard 157 + page, to detect out-of-bounds writes on the unprotected side of the object. 158 + These are reported on frees:: 159 + 160 + ================================================================== 161 + BUG: KFENCE: memory corruption in test_kmalloc_aligned_oob_write+0xef/0x184 162 + 163 + Corrupted memory at 0xffffffffb6797ff9 [ 0xac . . . . . . ] (in kfence-#69): 164 + test_kmalloc_aligned_oob_write+0xef/0x184 165 + kunit_try_run_case+0x51/0x85 166 + kunit_generic_run_threadfn_adapter+0x16/0x30 167 + kthread+0x137/0x160 168 + ret_from_fork+0x22/0x30 169 + 170 + kfence-#69 [0xffffffffb6797fb0-0xffffffffb6797ff8, size=73, cache=kmalloc-96] allocated by task 507: 171 + test_alloc+0xf3/0x25b 172 + test_kmalloc_aligned_oob_write+0x57/0x184 173 + kunit_try_run_case+0x51/0x85 174 + kunit_generic_run_threadfn_adapter+0x16/0x30 175 + kthread+0x137/0x160 176 + ret_from_fork+0x22/0x30 177 + 178 + CPU: 4 PID: 120 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7 179 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 180 + ================================================================== 181 + 182 + For such errors, the address where the corruption occurred as well as the 183 + invalidly written bytes (offset from the address) are shown; in this 184 + representation, '.' denote untouched bytes. In the example above ``0xac`` is 185 + the value written to the invalid address at offset 0, and the remaining '.' 186 + denote that no following bytes have been touched. Note that, real values are 187 + only shown if the kernel was booted with ``no_hash_pointers``; to avoid 188 + information disclosure otherwise, '!' is used instead to denote invalidly 189 + written bytes. 190 + 191 + And finally, KFENCE may also report on invalid accesses to any protected page 192 + where it was not possible to determine an associated object, e.g. if adjacent 193 + object pages had not yet been allocated:: 194 + 195 + ================================================================== 196 + BUG: KFENCE: invalid read in test_invalid_access+0x26/0xe0 197 + 198 + Invalid read at 0xffffffffb670b00a: 199 + test_invalid_access+0x26/0xe0 200 + kunit_try_run_case+0x51/0x85 201 + kunit_generic_run_threadfn_adapter+0x16/0x30 202 + kthread+0x137/0x160 203 + ret_from_fork+0x22/0x30 204 + 205 + CPU: 4 PID: 124 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7 206 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 207 + ================================================================== 208 + 209 + DebugFS interface 210 + ~~~~~~~~~~~~~~~~~ 211 + 212 + Some debugging information is exposed via debugfs: 213 + 214 + * The file ``/sys/kernel/debug/kfence/stats`` provides runtime statistics. 215 + 216 + * The file ``/sys/kernel/debug/kfence/objects`` provides a list of objects 217 + allocated via KFENCE, including those already freed but protected. 218 + 219 + Implementation Details 220 + ---------------------- 221 + 222 + Guarded allocations are set up based on the sample interval. After expiration 223 + of the sample interval, the next allocation through the main allocator (SLAB or 224 + SLUB) returns a guarded allocation from the KFENCE object pool (allocation 225 + sizes up to PAGE_SIZE are supported). At this point, the timer is reset, and 226 + the next allocation is set up after the expiration of the interval. To "gate" a 227 + KFENCE allocation through the main allocator's fast-path without overhead, 228 + KFENCE relies on static branches via the static keys infrastructure. The static 229 + branch is toggled to redirect the allocation to KFENCE. 230 + 231 + KFENCE objects each reside on a dedicated page, at either the left or right 232 + page boundaries selected at random. The pages to the left and right of the 233 + object page are "guard pages", whose attributes are changed to a protected 234 + state, and cause page faults on any attempted access. Such page faults are then 235 + intercepted by KFENCE, which handles the fault gracefully by reporting an 236 + out-of-bounds access, and marking the page as accessible so that the faulting 237 + code can (wrongly) continue executing (set ``panic_on_warn`` to panic instead). 238 + 239 + To detect out-of-bounds writes to memory within the object's page itself, 240 + KFENCE also uses pattern-based redzones. For each object page, a redzone is set 241 + up for all non-object memory. For typical alignments, the redzone is only 242 + required on the unguarded side of an object. Because KFENCE must honor the 243 + cache's requested alignment, special alignments may result in unprotected gaps 244 + on either side of an object, all of which are redzoned. 245 + 246 + The following figure illustrates the page layout:: 247 + 248 + ---+-----------+-----------+-----------+-----------+-----------+--- 249 + | xxxxxxxxx | O : | xxxxxxxxx | : O | xxxxxxxxx | 250 + | xxxxxxxxx | B : | xxxxxxxxx | : B | xxxxxxxxx | 251 + | x GUARD x | J : RED- | x GUARD x | RED- : J | x GUARD x | 252 + | xxxxxxxxx | E : ZONE | xxxxxxxxx | ZONE : E | xxxxxxxxx | 253 + | xxxxxxxxx | C : | xxxxxxxxx | : C | xxxxxxxxx | 254 + | xxxxxxxxx | T : | xxxxxxxxx | : T | xxxxxxxxx | 255 + ---+-----------+-----------+-----------+-----------+-----------+--- 256 + 257 + Upon deallocation of a KFENCE object, the object's page is again protected and 258 + the object is marked as freed. Any further access to the object causes a fault 259 + and KFENCE reports a use-after-free access. Freed objects are inserted at the 260 + tail of KFENCE's freelist, so that the least recently freed objects are reused 261 + first, and the chances of detecting use-after-frees of recently freed objects 262 + is increased. 263 + 264 + Interface 265 + --------- 266 + 267 + The following describes the functions which are used by allocators as well as 268 + page handling code to set up and deal with KFENCE allocations. 269 + 270 + .. kernel-doc:: include/linux/kfence.h 271 + :functions: is_kfence_address 272 + kfence_shutdown_cache 273 + kfence_alloc kfence_free __kfence_free 274 + kfence_ksize kfence_object_start 275 + kfence_handle_page_fault 276 + 277 + Related Tools 278 + ------------- 279 + 280 + In userspace, a similar approach is taken by `GWP-ASan 281 + <http://llvm.org/docs/GwpAsan.html>`_. GWP-ASan also relies on guard pages and 282 + a sampling strategy to detect memory unsafety bugs at scale. KFENCE's design is 283 + directly influenced by GWP-ASan, and can be seen as its kernel sibling. Another 284 + similar but non-sampling approach, that also inspired the name "KFENCE", can be 285 + found in the userspace `Electric Fence Malloc Debugger 286 + <https://linux.die.net/man/3/efence>`_. 287 + 288 + In the kernel, several tools exist to debug memory access errors, and in 289 + particular KASAN can detect all bug classes that KFENCE can detect. While KASAN 290 + is more precise, relying on compiler instrumentation, this comes at a 291 + performance cost. 292 + 293 + It is worth highlighting that KASAN and KFENCE are complementary, with 294 + different target environments. For instance, KASAN is the better debugging-aid, 295 + where test cases or reproducers exists: due to the lower chance to detect the 296 + error, it would require more effort using KFENCE to debug. Deployments at scale 297 + that cannot afford to enable KASAN, however, would benefit from using KFENCE to 298 + discover bugs due to code paths not exercised by test cases or fuzzers.
+6
Documentation/filesystems/seq_file.rst
··· 217 217 is a reasonable thing to do. The seq_file code will also avoid taking any 218 218 other locks while the iterator is active. 219 219 220 + The iterater value returned by start() or next() is guaranteed to be 221 + passed to a subsequent next() or stop() call. This allows resources 222 + such as locks that were taken to be reliably released. There is *no* 223 + guarantee that the iterator will be passed to show(), though in practice 224 + it often will be. 225 + 220 226 221 227 Formatted output 222 228 ================
+20 -6
MAINTAINERS
··· 261 261 L: linux-api@vger.kernel.org 262 262 F: include/linux/syscalls.h 263 263 F: kernel/sys_ni.c 264 + F: include/uapi/ 265 + F: arch/*/include/uapi/ 264 266 265 267 ABIT UGURU 1,2 HARDWARE MONITOR DRIVER 266 268 M: Hans de Goede <hdegoede@redhat.com> ··· 2984 2982 F: kernel/audit* 2985 2983 2986 2984 AUXILIARY DISPLAY DRIVERS 2987 - M: Miguel Ojeda Sandonis <miguel.ojeda.sandonis@gmail.com> 2985 + M: Miguel Ojeda <ojeda@kernel.org> 2988 2986 S: Maintained 2989 2987 F: drivers/auxdisplay/ 2990 2988 F: include/linux/cfag12864b.h ··· 4130 4128 F: scripts/sign-file.c 4131 4129 4132 4130 CFAG12864B LCD DRIVER 4133 - M: Miguel Ojeda Sandonis <miguel.ojeda.sandonis@gmail.com> 4131 + M: Miguel Ojeda <ojeda@kernel.org> 4134 4132 S: Maintained 4135 4133 F: drivers/auxdisplay/cfag12864b.c 4136 4134 F: include/linux/cfag12864b.h 4137 4135 4138 4136 CFAG12864BFB LCD FRAMEBUFFER DRIVER 4139 - M: Miguel Ojeda Sandonis <miguel.ojeda.sandonis@gmail.com> 4137 + M: Miguel Ojeda <ojeda@kernel.org> 4140 4138 S: Maintained 4141 4139 F: drivers/auxdisplay/cfag12864bfb.c 4142 4140 F: include/linux/cfag12864b.h ··· 4306 4304 F: drivers/infiniband/hw/usnic/ 4307 4305 4308 4306 CLANG-FORMAT FILE 4309 - M: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> 4307 + M: Miguel Ojeda <ojeda@kernel.org> 4310 4308 S: Maintained 4311 4309 F: .clang-format 4312 4310 ··· 4446 4444 F: drivers/platform/x86/compal-laptop.c 4447 4445 4448 4446 COMPILER ATTRIBUTES 4449 - M: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> 4447 + M: Miguel Ojeda <ojeda@kernel.org> 4450 4448 S: Maintained 4451 4449 F: include/linux/compiler_attributes.h 4452 4450 ··· 9869 9867 F: include/uapi/linux/keyctl.h 9870 9868 F: security/keys/ 9871 9869 9870 + KFENCE 9871 + M: Alexander Potapenko <glider@google.com> 9872 + M: Marco Elver <elver@google.com> 9873 + R: Dmitry Vyukov <dvyukov@google.com> 9874 + L: kasan-dev@googlegroups.com 9875 + S: Maintained 9876 + F: Documentation/dev-tools/kfence.rst 9877 + F: arch/*/include/asm/kfence.h 9878 + F: include/linux/kfence.h 9879 + F: lib/Kconfig.kfence 9880 + F: mm/kfence/ 9881 + 9872 9882 KFIFO 9873 9883 M: Stefani Seibold <stefani@seibold.net> 9874 9884 S: Maintained ··· 9941 9927 F: kernel/kprobes.c 9942 9928 9943 9929 KS0108 LCD CONTROLLER DRIVER 9944 - M: Miguel Ojeda Sandonis <miguel.ojeda.sandonis@gmail.com> 9930 + M: Miguel Ojeda <ojeda@kernel.org> 9945 9931 S: Maintained 9946 9932 F: Documentation/admin-guide/auxdisplay/ks0108.rst 9947 9933 F: drivers/auxdisplay/ks0108.c
-1
arch/alpha/configs/defconfig
··· 1 - CONFIG_EXPERIMENTAL=y 2 1 CONFIG_SYSVIPC=y 3 2 CONFIG_POSIX_MQUEUE=y 4 3 CONFIG_LOG_BUF_SHIFT=14
+1
arch/arm64/Kconfig
··· 140 140 select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48) 141 141 select HAVE_ARCH_KASAN_SW_TAGS if HAVE_ARCH_KASAN 142 142 select HAVE_ARCH_KASAN_HW_TAGS if (HAVE_ARCH_KASAN && ARM64_MTE) 143 + select HAVE_ARCH_KFENCE 143 144 select HAVE_ARCH_KGDB 144 145 select HAVE_ARCH_MMAP_RND_BITS 145 146 select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
-1
arch/arm64/include/asm/cache.h
··· 6 6 #define __ASM_CACHE_H 7 7 8 8 #include <asm/cputype.h> 9 - #include <asm/mte-kasan.h> 10 9 11 10 #define CTR_L1IP_SHIFT 14 12 11 #define CTR_L1IP_MASK 3
+1
arch/arm64/include/asm/kasan.h
··· 6 6 7 7 #include <linux/linkage.h> 8 8 #include <asm/memory.h> 9 + #include <asm/mte-kasan.h> 9 10 #include <asm/pgtable-types.h> 10 11 11 12 #define arch_kasan_set_tag(addr, tag) __tag_set(addr, tag)
+22
arch/arm64/include/asm/kfence.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * arm64 KFENCE support. 4 + * 5 + * Copyright (C) 2020, Google LLC. 6 + */ 7 + 8 + #ifndef __ASM_KFENCE_H 9 + #define __ASM_KFENCE_H 10 + 11 + #include <asm/cacheflush.h> 12 + 13 + static inline bool arch_kfence_init_pool(void) { return true; } 14 + 15 + static inline bool kfence_protect_page(unsigned long addr, bool protect) 16 + { 17 + set_memory_valid(addr, 1, !protect); 18 + 19 + return true; 20 + } 21 + 22 + #endif /* __ASM_KFENCE_H */
+2
arch/arm64/include/asm/mte-def.h
··· 11 11 #define MTE_TAG_SIZE 4 12 12 #define MTE_TAG_MASK GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE - 1)), MTE_TAG_SHIFT) 13 13 14 + #define __MTE_PREAMBLE ARM64_ASM_PREAMBLE ".arch_extension memtag\n" 15 + 14 16 #endif /* __ASM_MTE_DEF_H */
+58 -9
arch/arm64/include/asm/mte-kasan.h
··· 11 11 12 12 #include <linux/types.h> 13 13 14 - /* 15 - * The functions below are meant to be used only for the 16 - * KASAN_HW_TAGS interface defined in asm/memory.h. 17 - */ 18 14 #ifdef CONFIG_ARM64_MTE 15 + 16 + /* 17 + * These functions are meant to be only used from KASAN runtime through 18 + * the arch_*() interface defined in asm/memory.h. 19 + * These functions don't include system_supports_mte() checks, 20 + * as KASAN only calls them when MTE is supported and enabled. 21 + */ 19 22 20 23 static inline u8 mte_get_ptr_tag(void *ptr) 21 24 { ··· 28 25 return tag; 29 26 } 30 27 31 - u8 mte_get_mem_tag(void *addr); 32 - u8 mte_get_random_tag(void); 33 - void *mte_set_mem_tag_range(void *addr, size_t size, u8 tag); 28 + /* Get allocation tag for the address. */ 29 + static inline u8 mte_get_mem_tag(void *addr) 30 + { 31 + asm(__MTE_PREAMBLE "ldg %0, [%0]" 32 + : "+r" (addr)); 33 + 34 + return mte_get_ptr_tag(addr); 35 + } 36 + 37 + /* Generate a random tag. */ 38 + static inline u8 mte_get_random_tag(void) 39 + { 40 + void *addr; 41 + 42 + asm(__MTE_PREAMBLE "irg %0, %0" 43 + : "=r" (addr)); 44 + 45 + return mte_get_ptr_tag(addr); 46 + } 47 + 48 + /* 49 + * Assign allocation tags for a region of memory based on the pointer tag. 50 + * Note: The address must be non-NULL and MTE_GRANULE_SIZE aligned and 51 + * size must be non-zero and MTE_GRANULE_SIZE aligned. 52 + */ 53 + static inline void mte_set_mem_tag_range(void *addr, size_t size, u8 tag) 54 + { 55 + u64 curr, end; 56 + 57 + if (!size) 58 + return; 59 + 60 + curr = (u64)__tag_set(addr, tag); 61 + end = curr + size; 62 + 63 + do { 64 + /* 65 + * 'asm volatile' is required to prevent the compiler to move 66 + * the statement outside of the loop. 67 + */ 68 + asm volatile(__MTE_PREAMBLE "stg %0, [%0]" 69 + : 70 + : "r" (curr) 71 + : "memory"); 72 + 73 + curr += MTE_GRANULE_SIZE; 74 + } while (curr != end); 75 + } 34 76 35 77 void mte_enable_kernel(void); 36 78 void mte_init_tags(u64 max_tag); ··· 94 46 { 95 47 return 0xFF; 96 48 } 49 + 97 50 static inline u8 mte_get_random_tag(void) 98 51 { 99 52 return 0xFF; 100 53 } 101 - static inline void *mte_set_mem_tag_range(void *addr, size_t size, u8 tag) 54 + 55 + static inline void mte_set_mem_tag_range(void *addr, size_t size, u8 tag) 102 56 { 103 - return addr; 104 57 } 105 58 106 59 static inline void mte_enable_kernel(void)
-2
arch/arm64/include/asm/mte.h
··· 8 8 #include <asm/compiler.h> 9 9 #include <asm/mte-def.h> 10 10 11 - #define __MTE_PREAMBLE ARM64_ASM_PREAMBLE ".arch_extension memtag\n" 12 - 13 11 #ifndef __ASSEMBLY__ 14 12 15 13 #include <linux/bitfield.h>
-46
arch/arm64/kernel/mte.c
··· 19 19 #include <asm/barrier.h> 20 20 #include <asm/cpufeature.h> 21 21 #include <asm/mte.h> 22 - #include <asm/mte-kasan.h> 23 22 #include <asm/ptrace.h> 24 23 #include <asm/sysreg.h> 25 24 ··· 85 86 return addr1 != addr2; 86 87 87 88 return ret; 88 - } 89 - 90 - u8 mte_get_mem_tag(void *addr) 91 - { 92 - if (!system_supports_mte()) 93 - return 0xFF; 94 - 95 - asm(__MTE_PREAMBLE "ldg %0, [%0]" 96 - : "+r" (addr)); 97 - 98 - return mte_get_ptr_tag(addr); 99 - } 100 - 101 - u8 mte_get_random_tag(void) 102 - { 103 - void *addr; 104 - 105 - if (!system_supports_mte()) 106 - return 0xFF; 107 - 108 - asm(__MTE_PREAMBLE "irg %0, %0" 109 - : "+r" (addr)); 110 - 111 - return mte_get_ptr_tag(addr); 112 - } 113 - 114 - void *mte_set_mem_tag_range(void *addr, size_t size, u8 tag) 115 - { 116 - void *ptr = addr; 117 - 118 - if ((!system_supports_mte()) || (size == 0)) 119 - return addr; 120 - 121 - /* Make sure that size is MTE granule aligned. */ 122 - WARN_ON(size & (MTE_GRANULE_SIZE - 1)); 123 - 124 - /* Make sure that the address is MTE granule aligned. */ 125 - WARN_ON((u64)addr & (MTE_GRANULE_SIZE - 1)); 126 - 127 - tag = 0xF0 | tag; 128 - ptr = (void *)__tag_set(ptr, tag); 129 - 130 - mte_assign_mem_tag_range(ptr, size); 131 - 132 - return ptr; 133 89 } 134 90 135 91 void mte_init_tags(u64 max_tag)
-16
arch/arm64/lib/mte.S
··· 149 149 150 150 ret 151 151 SYM_FUNC_END(mte_restore_page_tags) 152 - 153 - /* 154 - * Assign allocation tags for a region of memory based on the pointer tag 155 - * x0 - source pointer 156 - * x1 - size 157 - * 158 - * Note: The address must be non-NULL and MTE_GRANULE_SIZE aligned and 159 - * size must be non-zero and MTE_GRANULE_SIZE aligned. 160 - */ 161 - SYM_FUNC_START(mte_assign_mem_tag_range) 162 - 1: stg x0, [x0] 163 - add x0, x0, #MTE_GRANULE_SIZE 164 - subs x1, x1, #MTE_GRANULE_SIZE 165 - b.gt 1b 166 - ret 167 - SYM_FUNC_END(mte_assign_mem_tag_range)
+4
arch/arm64/mm/fault.c
··· 10 10 #include <linux/acpi.h> 11 11 #include <linux/bitfield.h> 12 12 #include <linux/extable.h> 13 + #include <linux/kfence.h> 13 14 #include <linux/signal.h> 14 15 #include <linux/mm.h> 15 16 #include <linux/hardirq.h> ··· 390 389 } else if (addr < PAGE_SIZE) { 391 390 msg = "NULL pointer dereference"; 392 391 } else { 392 + if (kfence_handle_page_fault(addr, esr & ESR_ELx_WNR, regs)) 393 + return; 394 + 393 395 msg = "paging request"; 394 396 } 395 397
+13 -8
arch/arm64/mm/mmu.c
··· 1444 1444 free_empty_tables(start, end, PAGE_OFFSET, PAGE_END); 1445 1445 } 1446 1446 1447 - static bool inside_linear_region(u64 start, u64 size) 1447 + struct range arch_get_mappable_range(void) 1448 1448 { 1449 + struct range mhp_range; 1450 + 1449 1451 /* 1450 1452 * Linear mapping region is the range [PAGE_OFFSET..(PAGE_END - 1)] 1451 1453 * accommodating both its ends but excluding PAGE_END. Max physical 1452 1454 * range which can be mapped inside this linear mapping range, must 1453 1455 * also be derived from its end points. 1454 1456 */ 1455 - return start >= __pa(_PAGE_OFFSET(vabits_actual)) && 1456 - (start + size - 1) <= __pa(PAGE_END - 1); 1457 + mhp_range.start = __pa(_PAGE_OFFSET(vabits_actual)); 1458 + mhp_range.end = __pa(PAGE_END - 1); 1459 + return mhp_range; 1457 1460 } 1458 1461 1459 1462 int arch_add_memory(int nid, u64 start, u64 size, ··· 1464 1461 { 1465 1462 int ret, flags = 0; 1466 1463 1467 - if (!inside_linear_region(start, size)) { 1468 - pr_err("[%llx %llx] is outside linear mapping region\n", start, start + size); 1469 - return -EINVAL; 1470 - } 1464 + VM_BUG_ON(!mhp_range_allowed(start, size, true)); 1471 1465 1472 - if (rodata_full || debug_pagealloc_enabled()) 1466 + /* 1467 + * KFENCE requires linear map to be mapped at page granularity, so that 1468 + * it is possible to protect/unprotect single pages in the KFENCE pool. 1469 + */ 1470 + if (rodata_full || debug_pagealloc_enabled() || 1471 + IS_ENABLED(CONFIG_KFENCE)) 1473 1472 flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; 1474 1473 1475 1474 __create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
+16 -14
arch/mips/mm/cache.c
··· 157 157 EXPORT_SYMBOL(_page_cachable_default); 158 158 159 159 #define PM(p) __pgprot(_page_cachable_default | (p)) 160 + #define PVA(p) PM(_PAGE_VALID | _PAGE_ACCESSED | (p)) 160 161 161 162 static inline void setup_protection_map(void) 162 163 { 163 164 protection_map[0] = PM(_PAGE_PRESENT | _PAGE_NO_EXEC | _PAGE_NO_READ); 164 - protection_map[1] = PM(_PAGE_PRESENT | _PAGE_NO_EXEC); 165 - protection_map[2] = PM(_PAGE_PRESENT | _PAGE_NO_EXEC | _PAGE_NO_READ); 166 - protection_map[3] = PM(_PAGE_PRESENT | _PAGE_NO_EXEC); 167 - protection_map[4] = PM(_PAGE_PRESENT); 168 - protection_map[5] = PM(_PAGE_PRESENT); 169 - protection_map[6] = PM(_PAGE_PRESENT); 170 - protection_map[7] = PM(_PAGE_PRESENT); 165 + protection_map[1] = PVA(_PAGE_PRESENT | _PAGE_NO_EXEC); 166 + protection_map[2] = PVA(_PAGE_PRESENT | _PAGE_NO_EXEC | _PAGE_NO_READ); 167 + protection_map[3] = PVA(_PAGE_PRESENT | _PAGE_NO_EXEC); 168 + protection_map[4] = PVA(_PAGE_PRESENT); 169 + protection_map[5] = PVA(_PAGE_PRESENT); 170 + protection_map[6] = PVA(_PAGE_PRESENT); 171 + protection_map[7] = PVA(_PAGE_PRESENT); 171 172 172 173 protection_map[8] = PM(_PAGE_PRESENT | _PAGE_NO_EXEC | _PAGE_NO_READ); 173 - protection_map[9] = PM(_PAGE_PRESENT | _PAGE_NO_EXEC); 174 - protection_map[10] = PM(_PAGE_PRESENT | _PAGE_NO_EXEC | _PAGE_WRITE | 174 + protection_map[9] = PVA(_PAGE_PRESENT | _PAGE_NO_EXEC); 175 + protection_map[10] = PVA(_PAGE_PRESENT | _PAGE_NO_EXEC | _PAGE_WRITE | 175 176 _PAGE_NO_READ); 176 - protection_map[11] = PM(_PAGE_PRESENT | _PAGE_NO_EXEC | _PAGE_WRITE); 177 - protection_map[12] = PM(_PAGE_PRESENT); 178 - protection_map[13] = PM(_PAGE_PRESENT); 179 - protection_map[14] = PM(_PAGE_PRESENT | _PAGE_WRITE); 180 - protection_map[15] = PM(_PAGE_PRESENT | _PAGE_WRITE); 177 + protection_map[11] = PVA(_PAGE_PRESENT | _PAGE_NO_EXEC | _PAGE_WRITE); 178 + protection_map[12] = PVA(_PAGE_PRESENT); 179 + protection_map[13] = PVA(_PAGE_PRESENT); 180 + protection_map[14] = PVA(_PAGE_PRESENT); 181 + protection_map[15] = PVA(_PAGE_PRESENT); 181 182 } 182 183 184 + #undef _PVA 183 185 #undef PM 184 186 185 187 void cpu_cache_init(void)
+1
arch/s390/mm/init.c
··· 297 297 if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot)) 298 298 return -EINVAL; 299 299 300 + VM_BUG_ON(!mhp_range_allowed(start, size, true)); 300 301 rc = vmem_add_mapping(start, size); 301 302 if (rc) 302 303 return rc;
+13 -1
arch/s390/mm/vmem.c
··· 4 4 * Author(s): Heiko Carstens <heiko.carstens@de.ibm.com> 5 5 */ 6 6 7 + #include <linux/memory_hotplug.h> 7 8 #include <linux/memblock.h> 8 9 #include <linux/pfn.h> 9 10 #include <linux/mm.h> ··· 533 532 mutex_unlock(&vmem_mutex); 534 533 } 535 534 535 + struct range arch_get_mappable_range(void) 536 + { 537 + struct range mhp_range; 538 + 539 + mhp_range.start = 0; 540 + mhp_range.end = VMEM_MAX_PHYS - 1; 541 + return mhp_range; 542 + } 543 + 536 544 int vmem_add_mapping(unsigned long start, unsigned long size) 537 545 { 546 + struct range range = arch_get_mappable_range(); 538 547 int ret; 539 548 540 - if (start + size > VMEM_MAX_PHYS || 549 + if (start < range.start || 550 + start + size > range.end + 1 || 541 551 start + size < start) 542 552 return -ERANGE; 543 553
+1
arch/x86/Kconfig
··· 151 151 select HAVE_ARCH_JUMP_LABEL_RELATIVE 152 152 select HAVE_ARCH_KASAN if X86_64 153 153 select HAVE_ARCH_KASAN_VMALLOC if X86_64 154 + select HAVE_ARCH_KFENCE 154 155 select HAVE_ARCH_KGDB 155 156 select HAVE_ARCH_MMAP_RND_BITS if MMU 156 157 select HAVE_ARCH_MMAP_RND_COMPAT_BITS if MMU && COMPAT
+64
arch/x86/include/asm/kfence.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * x86 KFENCE support. 4 + * 5 + * Copyright (C) 2020, Google LLC. 6 + */ 7 + 8 + #ifndef _ASM_X86_KFENCE_H 9 + #define _ASM_X86_KFENCE_H 10 + 11 + #include <linux/bug.h> 12 + #include <linux/kfence.h> 13 + 14 + #include <asm/pgalloc.h> 15 + #include <asm/pgtable.h> 16 + #include <asm/set_memory.h> 17 + #include <asm/tlbflush.h> 18 + 19 + /* Force 4K pages for __kfence_pool. */ 20 + static inline bool arch_kfence_init_pool(void) 21 + { 22 + unsigned long addr; 23 + 24 + for (addr = (unsigned long)__kfence_pool; is_kfence_address((void *)addr); 25 + addr += PAGE_SIZE) { 26 + unsigned int level; 27 + 28 + if (!lookup_address(addr, &level)) 29 + return false; 30 + 31 + if (level != PG_LEVEL_4K) 32 + set_memory_4k(addr, 1); 33 + } 34 + 35 + return true; 36 + } 37 + 38 + /* Protect the given page and flush TLB. */ 39 + static inline bool kfence_protect_page(unsigned long addr, bool protect) 40 + { 41 + unsigned int level; 42 + pte_t *pte = lookup_address(addr, &level); 43 + 44 + if (WARN_ON(!pte || level != PG_LEVEL_4K)) 45 + return false; 46 + 47 + /* 48 + * We need to avoid IPIs, as we may get KFENCE allocations or faults 49 + * with interrupts disabled. Therefore, the below is best-effort, and 50 + * does not flush TLBs on all CPUs. We can tolerate some inaccuracy; 51 + * lazy fault handling takes care of faults after the page is PRESENT. 52 + */ 53 + 54 + if (protect) 55 + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT)); 56 + else 57 + set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT)); 58 + 59 + /* Flush this CPU's TLB. */ 60 + flush_tlb_one_kernel(addr); 61 + return true; 62 + } 63 + 64 + #endif /* _ASM_X86_KFENCE_H */
+6
arch/x86/mm/fault.c
··· 9 9 #include <linux/kdebug.h> /* oops_begin/end, ... */ 10 10 #include <linux/extable.h> /* search_exception_tables */ 11 11 #include <linux/memblock.h> /* max_low_pfn */ 12 + #include <linux/kfence.h> /* kfence_handle_page_fault */ 12 13 #include <linux/kprobes.h> /* NOKPROBE_SYMBOL, ... */ 13 14 #include <linux/mmiotrace.h> /* kmmio_handler, ... */ 14 15 #include <linux/perf_event.h> /* perf_sw_event */ ··· 680 679 */ 681 680 if (IS_ENABLED(CONFIG_EFI)) 682 681 efi_crash_gracefully_on_page_fault(address); 682 + 683 + /* Only not-present faults should be handled by KFENCE. */ 684 + if (!(error_code & X86_PF_PROT) && 685 + kfence_handle_page_fault(address, error_code & X86_PF_WRITE, regs)) 686 + return; 683 687 684 688 oops: 685 689 /*
+2 -2
arch/x86/mm/pat/memtype.c
··· 1164 1164 1165 1165 static void *memtype_seq_next(struct seq_file *seq, void *v, loff_t *pos) 1166 1166 { 1167 + kfree(v); 1167 1168 ++*pos; 1168 1169 return memtype_get_idx(*pos); 1169 1170 } 1170 1171 1171 1172 static void memtype_seq_stop(struct seq_file *seq, void *v) 1172 1173 { 1174 + kfree(v); 1173 1175 } 1174 1176 1175 1177 static int memtype_seq_show(struct seq_file *seq, void *v) ··· 1182 1180 entry_print->start, 1183 1181 entry_print->end, 1184 1182 cattr_name(entry_print->type)); 1185 - 1186 - kfree(entry_print); 1187 1183 1188 1184 return 0; 1189 1185 }
+2 -2
drivers/auxdisplay/cfag12864b.c
··· 5 5 * Description: cfag12864b LCD driver 6 6 * Depends: ks0108 7 7 * 8 - * Author: Copyright (C) Miguel Ojeda Sandonis 8 + * Author: Copyright (C) Miguel Ojeda <ojeda@kernel.org> 9 9 * Date: 2006-10-31 10 10 */ 11 11 ··· 376 376 module_exit(cfag12864b_exit); 377 377 378 378 MODULE_LICENSE("GPL v2"); 379 - MODULE_AUTHOR("Miguel Ojeda Sandonis <miguel.ojeda.sandonis@gmail.com>"); 379 + MODULE_AUTHOR("Miguel Ojeda <ojeda@kernel.org>"); 380 380 MODULE_DESCRIPTION("cfag12864b LCD driver");
+2 -2
drivers/auxdisplay/cfag12864bfb.c
··· 5 5 * Description: cfag12864b LCD framebuffer driver 6 6 * Depends: cfag12864b 7 7 * 8 - * Author: Copyright (C) Miguel Ojeda Sandonis 8 + * Author: Copyright (C) Miguel Ojeda <ojeda@kernel.org> 9 9 * Date: 2006-10-31 10 10 */ 11 11 ··· 171 171 module_exit(cfag12864bfb_exit); 172 172 173 173 MODULE_LICENSE("GPL v2"); 174 - MODULE_AUTHOR("Miguel Ojeda Sandonis <miguel.ojeda.sandonis@gmail.com>"); 174 + MODULE_AUTHOR("Miguel Ojeda <ojeda@kernel.org>"); 175 175 MODULE_DESCRIPTION("cfag12864b LCD framebuffer driver");
+2 -2
drivers/auxdisplay/ks0108.c
··· 5 5 * Description: ks0108 LCD Controller driver 6 6 * Depends: parport 7 7 * 8 - * Author: Copyright (C) Miguel Ojeda Sandonis 8 + * Author: Copyright (C) Miguel Ojeda <ojeda@kernel.org> 9 9 * Date: 2006-10-31 10 10 */ 11 11 ··· 182 182 module_exit(ks0108_exit); 183 183 184 184 MODULE_LICENSE("GPL v2"); 185 - MODULE_AUTHOR("Miguel Ojeda Sandonis <miguel.ojeda.sandonis@gmail.com>"); 185 + MODULE_AUTHOR("Miguel Ojeda <ojeda@kernel.org>"); 186 186 MODULE_DESCRIPTION("ks0108 LCD Controller driver"); 187 187
+14 -21
drivers/base/memory.c
··· 35 35 [MMOP_ONLINE_MOVABLE] = "online_movable", 36 36 }; 37 37 38 - int memhp_online_type_from_str(const char *str) 38 + int mhp_online_type_from_str(const char *str) 39 39 { 40 40 int i; 41 41 ··· 253 253 static ssize_t state_store(struct device *dev, struct device_attribute *attr, 254 254 const char *buf, size_t count) 255 255 { 256 - const int online_type = memhp_online_type_from_str(buf); 256 + const int online_type = mhp_online_type_from_str(buf); 257 257 struct memory_block *mem = to_memory_block(dev); 258 258 int ret; 259 259 ··· 290 290 } 291 291 292 292 /* 293 - * phys_device is a bad name for this. What I really want 294 - * is a way to differentiate between memory ranges that 295 - * are part of physical devices that constitute 296 - * a complete removable unit or fru. 297 - * i.e. do these ranges belong to the same physical device, 298 - * s.t. if I offline all of these sections I can then 299 - * remove the physical device? 293 + * Legacy interface that we cannot remove: s390x exposes the storage increment 294 + * covered by a memory block, allowing for identifying which memory blocks 295 + * comprise a storage increment. Since a memory block spans complete 296 + * storage increments nowadays, this interface is basically unused. Other 297 + * archs never exposed != 0. 300 298 */ 301 299 static ssize_t phys_device_show(struct device *dev, 302 300 struct device_attribute *attr, char *buf) 303 301 { 304 302 struct memory_block *mem = to_memory_block(dev); 303 + unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); 305 304 306 - return sysfs_emit(buf, "%d\n", mem->phys_device); 305 + return sysfs_emit(buf, "%d\n", 306 + arch_get_memory_phys_device(start_pfn)); 307 307 } 308 308 309 309 #ifdef CONFIG_MEMORY_HOTREMOVE ··· 387 387 struct device_attribute *attr, char *buf) 388 388 { 389 389 return sysfs_emit(buf, "%s\n", 390 - online_type_to_str[memhp_default_online_type]); 390 + online_type_to_str[mhp_default_online_type]); 391 391 } 392 392 393 393 static ssize_t auto_online_blocks_store(struct device *dev, 394 394 struct device_attribute *attr, 395 395 const char *buf, size_t count) 396 396 { 397 - const int online_type = memhp_online_type_from_str(buf); 397 + const int online_type = mhp_online_type_from_str(buf); 398 398 399 399 if (online_type < 0) 400 400 return -EINVAL; 401 401 402 - memhp_default_online_type = online_type; 402 + mhp_default_online_type = online_type; 403 403 return count; 404 404 } 405 405 ··· 488 488 static DEVICE_ATTR_WO(hard_offline_page); 489 489 #endif 490 490 491 - /* 492 - * Note that phys_device is optional. It is here to allow for 493 - * differentiation between which *physical* devices each 494 - * section belongs to... 495 - */ 491 + /* See phys_device_show(). */ 496 492 int __weak arch_get_memory_phys_device(unsigned long start_pfn) 497 493 { 498 494 return 0; ··· 570 574 static int init_memory_block(unsigned long block_id, unsigned long state) 571 575 { 572 576 struct memory_block *mem; 573 - unsigned long start_pfn; 574 577 int ret = 0; 575 578 576 579 mem = find_memory_block_by_id(block_id); ··· 583 588 584 589 mem->start_section_nr = block_id * sections_per_block; 585 590 mem->state = state; 586 - start_pfn = section_nr_to_pfn(mem->start_section_nr); 587 - mem->phys_device = arch_get_memory_phys_device(start_pfn); 588 591 mem->nid = NUMA_NO_NODE; 589 592 590 593 ret = register_memory(mem);
+1 -1
drivers/block/zram/zram_drv.c
··· 1081 1081 zram->limit_pages << PAGE_SHIFT, 1082 1082 max_used << PAGE_SHIFT, 1083 1083 (u64)atomic64_read(&zram->stats.same_pages), 1084 - pool_stats.pages_compacted, 1084 + atomic_long_read(&pool_stats.pages_compacted), 1085 1085 (u64)atomic64_read(&zram->stats.huge_pages), 1086 1086 (u64)atomic64_read(&zram->stats.huge_pages_since)); 1087 1087 up_read(&zram->init_lock);
+1 -1
drivers/hv/hv_balloon.c
··· 726 726 727 727 nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn)); 728 728 ret = add_memory(nid, PFN_PHYS((start_pfn)), 729 - (HA_CHUNK << PAGE_SHIFT), MEMHP_MERGE_RESOURCE); 729 + (HA_CHUNK << PAGE_SHIFT), MHP_MERGE_RESOURCE); 730 730 731 731 if (ret) { 732 732 pr_err("hot_add memory failed error is %d\n", ret);
+28 -15
drivers/virtio/virtio_mem.c
··· 623 623 /* Memory might get onlined immediately. */ 624 624 atomic64_add(size, &vm->offline_size); 625 625 rc = add_memory_driver_managed(vm->nid, addr, size, vm->resource_name, 626 - MEMHP_MERGE_RESOURCE); 626 + MHP_MERGE_RESOURCE); 627 627 if (rc) { 628 628 atomic64_sub(size, &vm->offline_size); 629 629 dev_warn(&vm->vdev->dev, "adding memory failed: %d\n", rc); ··· 2222 2222 */ 2223 2223 static void virtio_mem_refresh_config(struct virtio_mem *vm) 2224 2224 { 2225 - const uint64_t phys_limit = 1UL << MAX_PHYSMEM_BITS; 2225 + const struct range pluggable_range = mhp_get_pluggable_range(true); 2226 2226 uint64_t new_plugged_size, usable_region_size, end_addr; 2227 2227 2228 2228 /* the plugged_size is just a reflection of what _we_ did previously */ ··· 2234 2234 /* calculate the last usable memory block id */ 2235 2235 virtio_cread_le(vm->vdev, struct virtio_mem_config, 2236 2236 usable_region_size, &usable_region_size); 2237 - end_addr = vm->addr + usable_region_size; 2238 - end_addr = min(end_addr, phys_limit); 2237 + end_addr = min(vm->addr + usable_region_size - 1, 2238 + pluggable_range.end); 2239 2239 2240 - if (vm->in_sbm) 2241 - vm->sbm.last_usable_mb_id = 2242 - virtio_mem_phys_to_mb_id(end_addr) - 1; 2243 - else 2244 - vm->bbm.last_usable_bb_id = 2245 - virtio_mem_phys_to_bb_id(vm, end_addr) - 1; 2240 + if (vm->in_sbm) { 2241 + vm->sbm.last_usable_mb_id = virtio_mem_phys_to_mb_id(end_addr); 2242 + if (!IS_ALIGNED(end_addr + 1, memory_block_size_bytes())) 2243 + vm->sbm.last_usable_mb_id--; 2244 + } else { 2245 + vm->bbm.last_usable_bb_id = virtio_mem_phys_to_bb_id(vm, 2246 + end_addr); 2247 + if (!IS_ALIGNED(end_addr + 1, vm->bbm.bb_size)) 2248 + vm->bbm.last_usable_bb_id--; 2249 + } 2250 + /* 2251 + * If we cannot plug any of our device memory (e.g., nothing in the 2252 + * usable region is addressable), the last usable memory block id will 2253 + * be smaller than the first usable memory block id. We'll stop 2254 + * attempting to add memory with -ENOSPC from our main loop. 2255 + */ 2246 2256 2247 2257 /* see if there is a request to change the size */ 2248 2258 virtio_cread_le(vm->vdev, struct virtio_mem_config, requested_size, ··· 2374 2364 2375 2365 static int virtio_mem_init(struct virtio_mem *vm) 2376 2366 { 2377 - const uint64_t phys_limit = 1UL << MAX_PHYSMEM_BITS; 2367 + const struct range pluggable_range = mhp_get_pluggable_range(true); 2378 2368 uint64_t sb_size, addr; 2379 2369 uint16_t node_id; 2380 2370 ··· 2415 2405 if (!IS_ALIGNED(vm->addr + vm->region_size, memory_block_size_bytes())) 2416 2406 dev_warn(&vm->vdev->dev, 2417 2407 "The alignment of the physical end address can make some memory unusable.\n"); 2418 - if (vm->addr + vm->region_size > phys_limit) 2408 + if (vm->addr < pluggable_range.start || 2409 + vm->addr + vm->region_size - 1 > pluggable_range.end) 2419 2410 dev_warn(&vm->vdev->dev, 2420 - "Some memory is not addressable. This can make some memory unusable.\n"); 2411 + "Some device memory is not addressable/pluggable. This can make some memory unusable.\n"); 2421 2412 2422 2413 /* 2423 2414 * We want subblocks to span at least MAX_ORDER_NR_PAGES and ··· 2440 2429 vm->sbm.sb_size; 2441 2430 2442 2431 /* Round up to the next full memory block */ 2443 - addr = vm->addr + memory_block_size_bytes() - 1; 2432 + addr = max_t(uint64_t, vm->addr, pluggable_range.start) + 2433 + memory_block_size_bytes() - 1; 2444 2434 vm->sbm.first_mb_id = virtio_mem_phys_to_mb_id(addr); 2445 2435 vm->sbm.next_mb_id = vm->sbm.first_mb_id; 2446 2436 } else { ··· 2462 2450 } 2463 2451 2464 2452 /* Round up to the next aligned big block */ 2465 - addr = vm->addr + vm->bbm.bb_size - 1; 2453 + addr = max_t(uint64_t, vm->addr, pluggable_range.start) + 2454 + vm->bbm.bb_size - 1; 2466 2455 vm->bbm.first_bb_id = virtio_mem_phys_to_bb_id(vm, addr); 2467 2456 vm->bbm.next_bb_id = vm->bbm.first_bb_id; 2468 2457 }
+1 -1
drivers/xen/balloon.c
··· 331 331 mutex_unlock(&balloon_mutex); 332 332 /* add_memory_resource() requires the device_hotplug lock */ 333 333 lock_device_hotplug(); 334 - rc = add_memory_resource(nid, resource, MEMHP_MERGE_RESOURCE); 334 + rc = add_memory_resource(nid, resource, MHP_MERGE_RESOURCE); 335 335 unlock_device_hotplug(); 336 336 mutex_lock(&balloon_mutex); 337 337
+2 -2
fs/coredump.c
··· 897 897 */ 898 898 page = get_dump_page(addr); 899 899 if (page) { 900 - void *kaddr = kmap(page); 900 + void *kaddr = kmap_local_page(page); 901 901 902 902 stop = !dump_emit(cprm, kaddr, PAGE_SIZE); 903 - kunmap(page); 903 + kunmap_local(kaddr); 904 904 put_page(page); 905 905 } else { 906 906 stop = !dump_skip(cprm, PAGE_SIZE);
+11 -114
fs/iomap/seek.c
··· 10 10 #include <linux/pagemap.h> 11 11 #include <linux/pagevec.h> 12 12 13 - /* 14 - * Seek for SEEK_DATA / SEEK_HOLE within @page, starting at @lastoff. 15 - * Returns true if found and updates @lastoff to the offset in file. 16 - */ 17 - static bool 18 - page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff, 19 - int whence) 20 - { 21 - const struct address_space_operations *ops = inode->i_mapping->a_ops; 22 - unsigned int bsize = i_blocksize(inode), off; 23 - bool seek_data = whence == SEEK_DATA; 24 - loff_t poff = page_offset(page); 25 - 26 - if (WARN_ON_ONCE(*lastoff >= poff + PAGE_SIZE)) 27 - return false; 28 - 29 - if (*lastoff < poff) { 30 - /* 31 - * Last offset smaller than the start of the page means we found 32 - * a hole: 33 - */ 34 - if (whence == SEEK_HOLE) 35 - return true; 36 - *lastoff = poff; 37 - } 38 - 39 - /* 40 - * Just check the page unless we can and should check block ranges: 41 - */ 42 - if (bsize == PAGE_SIZE || !ops->is_partially_uptodate) 43 - return PageUptodate(page) == seek_data; 44 - 45 - lock_page(page); 46 - if (unlikely(page->mapping != inode->i_mapping)) 47 - goto out_unlock_not_found; 48 - 49 - for (off = 0; off < PAGE_SIZE; off += bsize) { 50 - if (offset_in_page(*lastoff) >= off + bsize) 51 - continue; 52 - if (ops->is_partially_uptodate(page, off, bsize) == seek_data) { 53 - unlock_page(page); 54 - return true; 55 - } 56 - *lastoff = poff + off + bsize; 57 - } 58 - 59 - out_unlock_not_found: 60 - unlock_page(page); 61 - return false; 62 - } 63 - 64 - /* 65 - * Seek for SEEK_DATA / SEEK_HOLE in the page cache. 66 - * 67 - * Within unwritten extents, the page cache determines which parts are holes 68 - * and which are data: uptodate buffer heads count as data; everything else 69 - * counts as a hole. 70 - * 71 - * Returns the resulting offset on successs, and -ENOENT otherwise. 72 - */ 73 13 static loff_t 74 - page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length, 75 - int whence) 76 - { 77 - pgoff_t index = offset >> PAGE_SHIFT; 78 - pgoff_t end = DIV_ROUND_UP(offset + length, PAGE_SIZE); 79 - loff_t lastoff = offset; 80 - struct pagevec pvec; 81 - 82 - if (length <= 0) 83 - return -ENOENT; 84 - 85 - pagevec_init(&pvec); 86 - 87 - do { 88 - unsigned nr_pages, i; 89 - 90 - nr_pages = pagevec_lookup_range(&pvec, inode->i_mapping, &index, 91 - end - 1); 92 - if (nr_pages == 0) 93 - break; 94 - 95 - for (i = 0; i < nr_pages; i++) { 96 - struct page *page = pvec.pages[i]; 97 - 98 - if (page_seek_hole_data(inode, page, &lastoff, whence)) 99 - goto check_range; 100 - lastoff = page_offset(page) + PAGE_SIZE; 101 - } 102 - pagevec_release(&pvec); 103 - } while (index < end); 104 - 105 - /* When no page at lastoff and we are not done, we found a hole. */ 106 - if (whence != SEEK_HOLE) 107 - goto not_found; 108 - 109 - check_range: 110 - if (lastoff < offset + length) 111 - goto out; 112 - not_found: 113 - lastoff = -ENOENT; 114 - out: 115 - pagevec_release(&pvec); 116 - return lastoff; 117 - } 118 - 119 - 120 - static loff_t 121 - iomap_seek_hole_actor(struct inode *inode, loff_t offset, loff_t length, 14 + iomap_seek_hole_actor(struct inode *inode, loff_t start, loff_t length, 122 15 void *data, struct iomap *iomap, struct iomap *srcmap) 123 16 { 17 + loff_t offset = start; 18 + 124 19 switch (iomap->type) { 125 20 case IOMAP_UNWRITTEN: 126 - offset = page_cache_seek_hole_data(inode, offset, length, 127 - SEEK_HOLE); 128 - if (offset < 0) 21 + offset = mapping_seek_hole_data(inode->i_mapping, start, 22 + start + length, SEEK_HOLE); 23 + if (offset == start + length) 129 24 return length; 130 25 fallthrough; 131 26 case IOMAP_HOLE: ··· 59 164 EXPORT_SYMBOL_GPL(iomap_seek_hole); 60 165 61 166 static loff_t 62 - iomap_seek_data_actor(struct inode *inode, loff_t offset, loff_t length, 167 + iomap_seek_data_actor(struct inode *inode, loff_t start, loff_t length, 63 168 void *data, struct iomap *iomap, struct iomap *srcmap) 64 169 { 170 + loff_t offset = start; 171 + 65 172 switch (iomap->type) { 66 173 case IOMAP_HOLE: 67 174 return length; 68 175 case IOMAP_UNWRITTEN: 69 - offset = page_cache_seek_hole_data(inode, offset, length, 70 - SEEK_DATA); 176 + offset = mapping_seek_hole_data(inode->i_mapping, start, 177 + start + length, SEEK_DATA); 71 178 if (offset < 0) 72 179 return length; 73 180 fallthrough;
+8 -11
fs/proc/base.c
··· 67 67 #include <linux/mm.h> 68 68 #include <linux/swap.h> 69 69 #include <linux/rcupdate.h> 70 - #include <linux/kallsyms.h> 71 70 #include <linux/stacktrace.h> 72 71 #include <linux/resource.h> 73 72 #include <linux/module.h> ··· 385 386 struct pid *pid, struct task_struct *task) 386 387 { 387 388 unsigned long wchan; 388 - char symname[KSYM_NAME_LEN]; 389 389 390 - if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) 391 - goto print0; 390 + if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) 391 + wchan = get_wchan(task); 392 + else 393 + wchan = 0; 392 394 393 - wchan = get_wchan(task); 394 - if (wchan && !lookup_symbol_name(wchan, symname)) { 395 - seq_puts(m, symname); 396 - return 0; 397 - } 395 + if (wchan) 396 + seq_printf(m, "%ps", (void *) wchan); 397 + else 398 + seq_putc(m, '0'); 398 399 399 - print0: 400 - seq_putc(m, '0'); 401 400 return 0; 402 401 } 403 402 #endif /* CONFIG_KALLSYMS */
+2 -2
fs/proc/proc_sysctl.c
··· 571 571 error = -ENOMEM; 572 572 if (count >= KMALLOC_MAX_SIZE) 573 573 goto out; 574 - kbuf = kzalloc(count + 1, GFP_KERNEL); 574 + kbuf = kvzalloc(count + 1, GFP_KERNEL); 575 575 if (!kbuf) 576 576 goto out; 577 577 ··· 600 600 601 601 error = count; 602 602 out_free_buf: 603 - kfree(kbuf); 603 + kvfree(kbuf); 604 604 out: 605 605 sysctl_head_finish(head); 606 606
+1 -1
include/linux/bitops.h
··· 214 214 * __ffs64 - find first set bit in a 64 bit word 215 215 * @word: The 64 bit word 216 216 * 217 - * On 64 bit arches this is a synomyn for __ffs 217 + * On 64 bit arches this is a synonym for __ffs 218 218 * The result is not defined if no bits are set, so check that @word 219 219 * is non-zero before calling this. 220 220 */
+1 -1
include/linux/cfag12864b.h
··· 4 4 * Version: 0.1.0 5 5 * Description: cfag12864b LCD driver header 6 6 * 7 - * Author: Copyright (C) Miguel Ojeda Sandonis 7 + * Author: Copyright (C) Miguel Ojeda <ojeda@kernel.org> 8 8 * Date: 2006-10-12 9 9 */ 10 10
+1 -1
include/linux/cred.h
··· 25 25 struct group_info { 26 26 atomic_t usage; 27 27 int ngroups; 28 - kgid_t gid[0]; 28 + kgid_t gid[]; 29 29 } __randomize_layout; 30 30 31 31 /**
+302
include/linux/fortify-string.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _LINUX_FORTIFY_STRING_H_ 3 + #define _LINUX_FORTIFY_STRING_H_ 4 + 5 + 6 + #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS) 7 + extern void *__underlying_memchr(const void *p, int c, __kernel_size_t size) __RENAME(memchr); 8 + extern int __underlying_memcmp(const void *p, const void *q, __kernel_size_t size) __RENAME(memcmp); 9 + extern void *__underlying_memcpy(void *p, const void *q, __kernel_size_t size) __RENAME(memcpy); 10 + extern void *__underlying_memmove(void *p, const void *q, __kernel_size_t size) __RENAME(memmove); 11 + extern void *__underlying_memset(void *p, int c, __kernel_size_t size) __RENAME(memset); 12 + extern char *__underlying_strcat(char *p, const char *q) __RENAME(strcat); 13 + extern char *__underlying_strcpy(char *p, const char *q) __RENAME(strcpy); 14 + extern __kernel_size_t __underlying_strlen(const char *p) __RENAME(strlen); 15 + extern char *__underlying_strncat(char *p, const char *q, __kernel_size_t count) __RENAME(strncat); 16 + extern char *__underlying_strncpy(char *p, const char *q, __kernel_size_t size) __RENAME(strncpy); 17 + #else 18 + #define __underlying_memchr __builtin_memchr 19 + #define __underlying_memcmp __builtin_memcmp 20 + #define __underlying_memcpy __builtin_memcpy 21 + #define __underlying_memmove __builtin_memmove 22 + #define __underlying_memset __builtin_memset 23 + #define __underlying_strcat __builtin_strcat 24 + #define __underlying_strcpy __builtin_strcpy 25 + #define __underlying_strlen __builtin_strlen 26 + #define __underlying_strncat __builtin_strncat 27 + #define __underlying_strncpy __builtin_strncpy 28 + #endif 29 + 30 + __FORTIFY_INLINE char *strncpy(char *p, const char *q, __kernel_size_t size) 31 + { 32 + size_t p_size = __builtin_object_size(p, 1); 33 + 34 + if (__builtin_constant_p(size) && p_size < size) 35 + __write_overflow(); 36 + if (p_size < size) 37 + fortify_panic(__func__); 38 + return __underlying_strncpy(p, q, size); 39 + } 40 + 41 + __FORTIFY_INLINE char *strcat(char *p, const char *q) 42 + { 43 + size_t p_size = __builtin_object_size(p, 1); 44 + 45 + if (p_size == (size_t)-1) 46 + return __underlying_strcat(p, q); 47 + if (strlcat(p, q, p_size) >= p_size) 48 + fortify_panic(__func__); 49 + return p; 50 + } 51 + 52 + __FORTIFY_INLINE __kernel_size_t strlen(const char *p) 53 + { 54 + __kernel_size_t ret; 55 + size_t p_size = __builtin_object_size(p, 1); 56 + 57 + /* Work around gcc excess stack consumption issue */ 58 + if (p_size == (size_t)-1 || 59 + (__builtin_constant_p(p[p_size - 1]) && p[p_size - 1] == '\0')) 60 + return __underlying_strlen(p); 61 + ret = strnlen(p, p_size); 62 + if (p_size <= ret) 63 + fortify_panic(__func__); 64 + return ret; 65 + } 66 + 67 + extern __kernel_size_t __real_strnlen(const char *, __kernel_size_t) __RENAME(strnlen); 68 + __FORTIFY_INLINE __kernel_size_t strnlen(const char *p, __kernel_size_t maxlen) 69 + { 70 + size_t p_size = __builtin_object_size(p, 1); 71 + __kernel_size_t ret = __real_strnlen(p, maxlen < p_size ? maxlen : p_size); 72 + 73 + if (p_size <= ret && maxlen != ret) 74 + fortify_panic(__func__); 75 + return ret; 76 + } 77 + 78 + /* defined after fortified strlen to reuse it */ 79 + extern size_t __real_strlcpy(char *, const char *, size_t) __RENAME(strlcpy); 80 + __FORTIFY_INLINE size_t strlcpy(char *p, const char *q, size_t size) 81 + { 82 + size_t ret; 83 + size_t p_size = __builtin_object_size(p, 1); 84 + size_t q_size = __builtin_object_size(q, 1); 85 + 86 + if (p_size == (size_t)-1 && q_size == (size_t)-1) 87 + return __real_strlcpy(p, q, size); 88 + ret = strlen(q); 89 + if (size) { 90 + size_t len = (ret >= size) ? size - 1 : ret; 91 + 92 + if (__builtin_constant_p(len) && len >= p_size) 93 + __write_overflow(); 94 + if (len >= p_size) 95 + fortify_panic(__func__); 96 + __underlying_memcpy(p, q, len); 97 + p[len] = '\0'; 98 + } 99 + return ret; 100 + } 101 + 102 + /* defined after fortified strnlen to reuse it */ 103 + extern ssize_t __real_strscpy(char *, const char *, size_t) __RENAME(strscpy); 104 + __FORTIFY_INLINE ssize_t strscpy(char *p, const char *q, size_t size) 105 + { 106 + size_t len; 107 + /* Use string size rather than possible enclosing struct size. */ 108 + size_t p_size = __builtin_object_size(p, 1); 109 + size_t q_size = __builtin_object_size(q, 1); 110 + 111 + /* If we cannot get size of p and q default to call strscpy. */ 112 + if (p_size == (size_t) -1 && q_size == (size_t) -1) 113 + return __real_strscpy(p, q, size); 114 + 115 + /* 116 + * If size can be known at compile time and is greater than 117 + * p_size, generate a compile time write overflow error. 118 + */ 119 + if (__builtin_constant_p(size) && size > p_size) 120 + __write_overflow(); 121 + 122 + /* 123 + * This call protects from read overflow, because len will default to q 124 + * length if it smaller than size. 125 + */ 126 + len = strnlen(q, size); 127 + /* 128 + * If len equals size, we will copy only size bytes which leads to 129 + * -E2BIG being returned. 130 + * Otherwise we will copy len + 1 because of the final '\O'. 131 + */ 132 + len = len == size ? size : len + 1; 133 + 134 + /* 135 + * Generate a runtime write overflow error if len is greater than 136 + * p_size. 137 + */ 138 + if (len > p_size) 139 + fortify_panic(__func__); 140 + 141 + /* 142 + * We can now safely call vanilla strscpy because we are protected from: 143 + * 1. Read overflow thanks to call to strnlen(). 144 + * 2. Write overflow thanks to above ifs. 145 + */ 146 + return __real_strscpy(p, q, len); 147 + } 148 + 149 + /* defined after fortified strlen and strnlen to reuse them */ 150 + __FORTIFY_INLINE char *strncat(char *p, const char *q, __kernel_size_t count) 151 + { 152 + size_t p_len, copy_len; 153 + size_t p_size = __builtin_object_size(p, 1); 154 + size_t q_size = __builtin_object_size(q, 1); 155 + 156 + if (p_size == (size_t)-1 && q_size == (size_t)-1) 157 + return __underlying_strncat(p, q, count); 158 + p_len = strlen(p); 159 + copy_len = strnlen(q, count); 160 + if (p_size < p_len + copy_len + 1) 161 + fortify_panic(__func__); 162 + __underlying_memcpy(p + p_len, q, copy_len); 163 + p[p_len + copy_len] = '\0'; 164 + return p; 165 + } 166 + 167 + __FORTIFY_INLINE void *memset(void *p, int c, __kernel_size_t size) 168 + { 169 + size_t p_size = __builtin_object_size(p, 0); 170 + 171 + if (__builtin_constant_p(size) && p_size < size) 172 + __write_overflow(); 173 + if (p_size < size) 174 + fortify_panic(__func__); 175 + return __underlying_memset(p, c, size); 176 + } 177 + 178 + __FORTIFY_INLINE void *memcpy(void *p, const void *q, __kernel_size_t size) 179 + { 180 + size_t p_size = __builtin_object_size(p, 0); 181 + size_t q_size = __builtin_object_size(q, 0); 182 + 183 + if (__builtin_constant_p(size)) { 184 + if (p_size < size) 185 + __write_overflow(); 186 + if (q_size < size) 187 + __read_overflow2(); 188 + } 189 + if (p_size < size || q_size < size) 190 + fortify_panic(__func__); 191 + return __underlying_memcpy(p, q, size); 192 + } 193 + 194 + __FORTIFY_INLINE void *memmove(void *p, const void *q, __kernel_size_t size) 195 + { 196 + size_t p_size = __builtin_object_size(p, 0); 197 + size_t q_size = __builtin_object_size(q, 0); 198 + 199 + if (__builtin_constant_p(size)) { 200 + if (p_size < size) 201 + __write_overflow(); 202 + if (q_size < size) 203 + __read_overflow2(); 204 + } 205 + if (p_size < size || q_size < size) 206 + fortify_panic(__func__); 207 + return __underlying_memmove(p, q, size); 208 + } 209 + 210 + extern void *__real_memscan(void *, int, __kernel_size_t) __RENAME(memscan); 211 + __FORTIFY_INLINE void *memscan(void *p, int c, __kernel_size_t size) 212 + { 213 + size_t p_size = __builtin_object_size(p, 0); 214 + 215 + if (__builtin_constant_p(size) && p_size < size) 216 + __read_overflow(); 217 + if (p_size < size) 218 + fortify_panic(__func__); 219 + return __real_memscan(p, c, size); 220 + } 221 + 222 + __FORTIFY_INLINE int memcmp(const void *p, const void *q, __kernel_size_t size) 223 + { 224 + size_t p_size = __builtin_object_size(p, 0); 225 + size_t q_size = __builtin_object_size(q, 0); 226 + 227 + if (__builtin_constant_p(size)) { 228 + if (p_size < size) 229 + __read_overflow(); 230 + if (q_size < size) 231 + __read_overflow2(); 232 + } 233 + if (p_size < size || q_size < size) 234 + fortify_panic(__func__); 235 + return __underlying_memcmp(p, q, size); 236 + } 237 + 238 + __FORTIFY_INLINE void *memchr(const void *p, int c, __kernel_size_t size) 239 + { 240 + size_t p_size = __builtin_object_size(p, 0); 241 + 242 + if (__builtin_constant_p(size) && p_size < size) 243 + __read_overflow(); 244 + if (p_size < size) 245 + fortify_panic(__func__); 246 + return __underlying_memchr(p, c, size); 247 + } 248 + 249 + void *__real_memchr_inv(const void *s, int c, size_t n) __RENAME(memchr_inv); 250 + __FORTIFY_INLINE void *memchr_inv(const void *p, int c, size_t size) 251 + { 252 + size_t p_size = __builtin_object_size(p, 0); 253 + 254 + if (__builtin_constant_p(size) && p_size < size) 255 + __read_overflow(); 256 + if (p_size < size) 257 + fortify_panic(__func__); 258 + return __real_memchr_inv(p, c, size); 259 + } 260 + 261 + extern void *__real_kmemdup(const void *src, size_t len, gfp_t gfp) __RENAME(kmemdup); 262 + __FORTIFY_INLINE void *kmemdup(const void *p, size_t size, gfp_t gfp) 263 + { 264 + size_t p_size = __builtin_object_size(p, 0); 265 + 266 + if (__builtin_constant_p(size) && p_size < size) 267 + __read_overflow(); 268 + if (p_size < size) 269 + fortify_panic(__func__); 270 + return __real_kmemdup(p, size, gfp); 271 + } 272 + 273 + /* defined after fortified strlen and memcpy to reuse them */ 274 + __FORTIFY_INLINE char *strcpy(char *p, const char *q) 275 + { 276 + size_t p_size = __builtin_object_size(p, 1); 277 + size_t q_size = __builtin_object_size(q, 1); 278 + size_t size; 279 + 280 + if (p_size == (size_t)-1 && q_size == (size_t)-1) 281 + return __underlying_strcpy(p, q); 282 + size = strlen(q) + 1; 283 + /* test here to use the more stringent object size */ 284 + if (p_size < size) 285 + fortify_panic(__func__); 286 + memcpy(p, q, size); 287 + return p; 288 + } 289 + 290 + /* Don't use these outside the FORITFY_SOURCE implementation */ 291 + #undef __underlying_memchr 292 + #undef __underlying_memcmp 293 + #undef __underlying_memcpy 294 + #undef __underlying_memmove 295 + #undef __underlying_memset 296 + #undef __underlying_strcat 297 + #undef __underlying_strcpy 298 + #undef __underlying_strlen 299 + #undef __underlying_strncat 300 + #undef __underlying_strncpy 301 + 302 + #endif /* _LINUX_FORTIFY_STRING_H_ */
+2
include/linux/gfp.h
··· 634 634 extern void pm_restrict_gfp_mask(void); 635 635 extern void pm_restore_gfp_mask(void); 636 636 637 + extern gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma); 638 + 637 639 #ifdef CONFIG_PM_SLEEP 638 640 extern bool pm_suspended_storage(void); 639 641 #else
+2 -2
include/linux/init.h
··· 338 338 var = 1; \ 339 339 return 0; \ 340 340 } \ 341 - __setup_param(str_on, parse_##var##_on, parse_##var##_on, 1); \ 341 + early_param(str_on, parse_##var##_on); \ 342 342 \ 343 343 static int __init parse_##var##_off(char *arg) \ 344 344 { \ 345 345 var = 0; \ 346 346 return 0; \ 347 347 } \ 348 - __setup_param(str_off, parse_##var##_off, parse_##var##_off, 1) 348 + early_param(str_off, parse_##var##_off) 349 349 350 350 /* Relies on boot_command_line being set */ 351 351 void __init parse_early_param(void);
+17 -8
include/linux/kasan.h
··· 83 83 struct kasan_cache { 84 84 int alloc_meta_offset; 85 85 int free_meta_offset; 86 + bool is_kmalloc; 86 87 }; 87 88 88 89 #ifdef CONFIG_KASAN_HW_TAGS ··· 144 143 __kasan_cache_create(cache, size, flags); 145 144 } 146 145 146 + void __kasan_cache_create_kmalloc(struct kmem_cache *cache); 147 + static __always_inline void kasan_cache_create_kmalloc(struct kmem_cache *cache) 148 + { 149 + if (kasan_enabled()) 150 + __kasan_cache_create_kmalloc(cache); 151 + } 152 + 147 153 size_t __kasan_metadata_size(struct kmem_cache *cache); 148 154 static __always_inline size_t kasan_metadata_size(struct kmem_cache *cache) 149 155 { ··· 200 192 return false; 201 193 } 202 194 195 + void __kasan_kfree_large(void *ptr, unsigned long ip); 196 + static __always_inline void kasan_kfree_large(void *ptr) 197 + { 198 + if (kasan_enabled()) 199 + __kasan_kfree_large(ptr, _RET_IP_); 200 + } 201 + 203 202 void __kasan_slab_free_mempool(void *ptr, unsigned long ip); 204 203 static __always_inline void kasan_slab_free_mempool(void *ptr) 205 204 { ··· 254 239 return (void *)object; 255 240 } 256 241 257 - void __kasan_kfree_large(void *ptr, unsigned long ip); 258 - static __always_inline void kasan_kfree_large(void *ptr) 259 - { 260 - if (kasan_enabled()) 261 - __kasan_kfree_large(ptr, _RET_IP_); 262 - } 263 - 264 242 /* 265 243 * Unlike kasan_check_read/write(), kasan_check_byte() is performed even for 266 244 * the hardware tag-based mode that doesn't rely on compiler instrumentation. ··· 286 278 static inline void kasan_cache_create(struct kmem_cache *cache, 287 279 unsigned int *size, 288 280 slab_flags_t *flags) {} 281 + static inline void kasan_cache_create_kmalloc(struct kmem_cache *cache) {} 289 282 static inline size_t kasan_metadata_size(struct kmem_cache *cache) { return 0; } 290 283 static inline void kasan_poison_slab(struct page *page) {} 291 284 static inline void kasan_unpoison_object_data(struct kmem_cache *cache, ··· 302 293 { 303 294 return false; 304 295 } 296 + static inline void kasan_kfree_large(void *ptr) {} 305 297 static inline void kasan_slab_free_mempool(void *ptr) {} 306 298 static inline void *kasan_slab_alloc(struct kmem_cache *s, void *object, 307 299 gfp_t flags) ··· 323 313 { 324 314 return (void *)object; 325 315 } 326 - static inline void kasan_kfree_large(void *ptr) {} 327 316 static inline bool kasan_check_byte(const void *address) 328 317 { 329 318 return true;
+222
include/linux/kfence.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Kernel Electric-Fence (KFENCE). Public interface for allocator and fault 4 + * handler integration. For more info see Documentation/dev-tools/kfence.rst. 5 + * 6 + * Copyright (C) 2020, Google LLC. 7 + */ 8 + 9 + #ifndef _LINUX_KFENCE_H 10 + #define _LINUX_KFENCE_H 11 + 12 + #include <linux/mm.h> 13 + #include <linux/types.h> 14 + 15 + #ifdef CONFIG_KFENCE 16 + 17 + /* 18 + * We allocate an even number of pages, as it simplifies calculations to map 19 + * address to metadata indices; effectively, the very first page serves as an 20 + * extended guard page, but otherwise has no special purpose. 21 + */ 22 + #define KFENCE_POOL_SIZE ((CONFIG_KFENCE_NUM_OBJECTS + 1) * 2 * PAGE_SIZE) 23 + extern char *__kfence_pool; 24 + 25 + #ifdef CONFIG_KFENCE_STATIC_KEYS 26 + #include <linux/static_key.h> 27 + DECLARE_STATIC_KEY_FALSE(kfence_allocation_key); 28 + #else 29 + #include <linux/atomic.h> 30 + extern atomic_t kfence_allocation_gate; 31 + #endif 32 + 33 + /** 34 + * is_kfence_address() - check if an address belongs to KFENCE pool 35 + * @addr: address to check 36 + * 37 + * Return: true or false depending on whether the address is within the KFENCE 38 + * object range. 39 + * 40 + * KFENCE objects live in a separate page range and are not to be intermixed 41 + * with regular heap objects (e.g. KFENCE objects must never be added to the 42 + * allocator freelists). Failing to do so may and will result in heap 43 + * corruptions, therefore is_kfence_address() must be used to check whether 44 + * an object requires specific handling. 45 + * 46 + * Note: This function may be used in fast-paths, and is performance critical. 47 + * Future changes should take this into account; for instance, we want to avoid 48 + * introducing another load and therefore need to keep KFENCE_POOL_SIZE a 49 + * constant (until immediate patching support is added to the kernel). 50 + */ 51 + static __always_inline bool is_kfence_address(const void *addr) 52 + { 53 + /* 54 + * The non-NULL check is required in case the __kfence_pool pointer was 55 + * never initialized; keep it in the slow-path after the range-check. 56 + */ 57 + return unlikely((unsigned long)((char *)addr - __kfence_pool) < KFENCE_POOL_SIZE && addr); 58 + } 59 + 60 + /** 61 + * kfence_alloc_pool() - allocate the KFENCE pool via memblock 62 + */ 63 + void __init kfence_alloc_pool(void); 64 + 65 + /** 66 + * kfence_init() - perform KFENCE initialization at boot time 67 + * 68 + * Requires that kfence_alloc_pool() was called before. This sets up the 69 + * allocation gate timer, and requires that workqueues are available. 70 + */ 71 + void __init kfence_init(void); 72 + 73 + /** 74 + * kfence_shutdown_cache() - handle shutdown_cache() for KFENCE objects 75 + * @s: cache being shut down 76 + * 77 + * Before shutting down a cache, one must ensure there are no remaining objects 78 + * allocated from it. Because KFENCE objects are not referenced from the cache 79 + * directly, we need to check them here. 80 + * 81 + * Note that shutdown_cache() is internal to SL*B, and kmem_cache_destroy() does 82 + * not return if allocated objects still exist: it prints an error message and 83 + * simply aborts destruction of a cache, leaking memory. 84 + * 85 + * If the only such objects are KFENCE objects, we will not leak the entire 86 + * cache, but instead try to provide more useful debug info by making allocated 87 + * objects "zombie allocations". Objects may then still be used or freed (which 88 + * is handled gracefully), but usage will result in showing KFENCE error reports 89 + * which include stack traces to the user of the object, the original allocation 90 + * site, and caller to shutdown_cache(). 91 + */ 92 + void kfence_shutdown_cache(struct kmem_cache *s); 93 + 94 + /* 95 + * Allocate a KFENCE object. Allocators must not call this function directly, 96 + * use kfence_alloc() instead. 97 + */ 98 + void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags); 99 + 100 + /** 101 + * kfence_alloc() - allocate a KFENCE object with a low probability 102 + * @s: struct kmem_cache with object requirements 103 + * @size: exact size of the object to allocate (can be less than @s->size 104 + * e.g. for kmalloc caches) 105 + * @flags: GFP flags 106 + * 107 + * Return: 108 + * * NULL - must proceed with allocating as usual, 109 + * * non-NULL - pointer to a KFENCE object. 110 + * 111 + * kfence_alloc() should be inserted into the heap allocation fast path, 112 + * allowing it to transparently return KFENCE-allocated objects with a low 113 + * probability using a static branch (the probability is controlled by the 114 + * kfence.sample_interval boot parameter). 115 + */ 116 + static __always_inline void *kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags) 117 + { 118 + #ifdef CONFIG_KFENCE_STATIC_KEYS 119 + if (static_branch_unlikely(&kfence_allocation_key)) 120 + #else 121 + if (unlikely(!atomic_read(&kfence_allocation_gate))) 122 + #endif 123 + return __kfence_alloc(s, size, flags); 124 + return NULL; 125 + } 126 + 127 + /** 128 + * kfence_ksize() - get actual amount of memory allocated for a KFENCE object 129 + * @addr: pointer to a heap object 130 + * 131 + * Return: 132 + * * 0 - not a KFENCE object, must call __ksize() instead, 133 + * * non-0 - this many bytes can be accessed without causing a memory error. 134 + * 135 + * kfence_ksize() returns the number of bytes requested for a KFENCE object at 136 + * allocation time. This number may be less than the object size of the 137 + * corresponding struct kmem_cache. 138 + */ 139 + size_t kfence_ksize(const void *addr); 140 + 141 + /** 142 + * kfence_object_start() - find the beginning of a KFENCE object 143 + * @addr: address within a KFENCE-allocated object 144 + * 145 + * Return: address of the beginning of the object. 146 + * 147 + * SL[AU]B-allocated objects are laid out within a page one by one, so it is 148 + * easy to calculate the beginning of an object given a pointer inside it and 149 + * the object size. The same is not true for KFENCE, which places a single 150 + * object at either end of the page. This helper function is used to find the 151 + * beginning of a KFENCE-allocated object. 152 + */ 153 + void *kfence_object_start(const void *addr); 154 + 155 + /** 156 + * __kfence_free() - release a KFENCE heap object to KFENCE pool 157 + * @addr: object to be freed 158 + * 159 + * Requires: is_kfence_address(addr) 160 + * 161 + * Release a KFENCE object and mark it as freed. 162 + */ 163 + void __kfence_free(void *addr); 164 + 165 + /** 166 + * kfence_free() - try to release an arbitrary heap object to KFENCE pool 167 + * @addr: object to be freed 168 + * 169 + * Return: 170 + * * false - object doesn't belong to KFENCE pool and was ignored, 171 + * * true - object was released to KFENCE pool. 172 + * 173 + * Release a KFENCE object and mark it as freed. May be called on any object, 174 + * even non-KFENCE objects, to simplify integration of the hooks into the 175 + * allocator's free codepath. The allocator must check the return value to 176 + * determine if it was a KFENCE object or not. 177 + */ 178 + static __always_inline __must_check bool kfence_free(void *addr) 179 + { 180 + if (!is_kfence_address(addr)) 181 + return false; 182 + __kfence_free(addr); 183 + return true; 184 + } 185 + 186 + /** 187 + * kfence_handle_page_fault() - perform page fault handling for KFENCE pages 188 + * @addr: faulting address 189 + * @is_write: is access a write 190 + * @regs: current struct pt_regs (can be NULL, but shows full stack trace) 191 + * 192 + * Return: 193 + * * false - address outside KFENCE pool, 194 + * * true - page fault handled by KFENCE, no additional handling required. 195 + * 196 + * A page fault inside KFENCE pool indicates a memory error, such as an 197 + * out-of-bounds access, a use-after-free or an invalid memory access. In these 198 + * cases KFENCE prints an error message and marks the offending page as 199 + * present, so that the kernel can proceed. 200 + */ 201 + bool __must_check kfence_handle_page_fault(unsigned long addr, bool is_write, struct pt_regs *regs); 202 + 203 + #else /* CONFIG_KFENCE */ 204 + 205 + static inline bool is_kfence_address(const void *addr) { return false; } 206 + static inline void kfence_alloc_pool(void) { } 207 + static inline void kfence_init(void) { } 208 + static inline void kfence_shutdown_cache(struct kmem_cache *s) { } 209 + static inline void *kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags) { return NULL; } 210 + static inline size_t kfence_ksize(const void *addr) { return 0; } 211 + static inline void *kfence_object_start(const void *addr) { return NULL; } 212 + static inline void __kfence_free(void *addr) { } 213 + static inline bool __must_check kfence_free(void *addr) { return false; } 214 + static inline bool __must_check kfence_handle_page_fault(unsigned long addr, bool is_write, 215 + struct pt_regs *regs) 216 + { 217 + return false; 218 + } 219 + 220 + #endif 221 + 222 + #endif /* _LINUX_KFENCE_H */
+2
include/linux/kgdb.h
··· 359 359 extern bool dbg_is_early; 360 360 extern void __init dbg_late_init(void); 361 361 extern void kgdb_panic(const char *msg); 362 + extern void kgdb_free_init_mem(void); 362 363 #else /* ! CONFIG_KGDB */ 363 364 #define in_dbg_master() (0) 364 365 #define dbg_late_init() 365 366 static inline void kgdb_panic(const char *msg) {} 367 + static inline void kgdb_free_init_mem(void) { } 366 368 #endif /* ! CONFIG_KGDB */ 367 369 #endif /* _KGDB_H_ */
+2
include/linux/khugepaged.h
··· 3 3 #define _LINUX_KHUGEPAGED_H 4 4 5 5 #include <linux/sched/coredump.h> /* MMF_VM_HUGEPAGE */ 6 + #include <linux/shmem_fs.h> 6 7 7 8 8 9 #ifdef CONFIG_TRANSPARENT_HUGEPAGE ··· 58 57 { 59 58 if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags)) 60 59 if ((khugepaged_always() || 60 + (shmem_file(vma->vm_file) && shmem_huge_enabled(vma)) || 61 61 (khugepaged_req_madv() && (vm_flags & VM_HUGEPAGE))) && 62 62 !(vm_flags & VM_NOHUGEPAGE) && 63 63 !test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+1 -1
include/linux/ks0108.h
··· 4 4 * Version: 0.1.0 5 5 * Description: ks0108 LCD Controller driver header 6 6 * 7 - * Author: Copyright (C) Miguel Ojeda Sandonis 7 + * Author: Copyright (C) Miguel Ojeda <ojeda@kernel.org> 8 8 * Date: 2006-10-31 9 9 */ 10 10
+1 -1
include/linux/mdev.h
··· 42 42 * @mdev: mdev_device structure on of mediated device 43 43 * that is being created 44 44 * Returns integer: success (0) or error (< 0) 45 - * @remove: Called to free resources in parent device's driver for a 45 + * @remove: Called to free resources in parent device's driver for 46 46 * a mediated device. It is mandatory to provide 'remove' 47 47 * ops. 48 48 * @mdev: mdev_device device structure which is being
+1 -2
include/linux/memory.h
··· 27 27 unsigned long start_section_nr; 28 28 unsigned long state; /* serialized by the dev->lock */ 29 29 int online_type; /* for passing data to online routine */ 30 - int phys_device; /* to which fru does this belong? */ 31 - struct device dev; 32 30 int nid; /* NID for this memory block */ 31 + struct device dev; 33 32 }; 34 33 35 34 int arch_get_memory_phys_device(unsigned long start_pfn);
+14 -19
include/linux/memory_hotplug.h
··· 16 16 struct vmem_altmap; 17 17 18 18 #ifdef CONFIG_MEMORY_HOTPLUG 19 - /* 20 - * Return page for the valid pfn only if the page is online. All pfn 21 - * walkers which rely on the fully initialized page->flags and others 22 - * should use this rather than pfn_valid && pfn_to_page 23 - */ 24 - #define pfn_to_online_page(pfn) \ 25 - ({ \ 26 - struct page *___page = NULL; \ 27 - unsigned long ___pfn = pfn; \ 28 - unsigned long ___nr = pfn_to_section_nr(___pfn); \ 29 - \ 30 - if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \ 31 - pfn_valid_within(___pfn)) \ 32 - ___page = pfn_to_page(___pfn); \ 33 - ___page; \ 34 - }) 19 + struct page *pfn_to_online_page(unsigned long pfn); 35 20 36 21 /* 37 22 * Types for free bootmem stored in page->lru.next. These have to be in ··· 53 68 * with this flag set, the resource pointer must no longer be used as it 54 69 * might be stale, or the resource might have changed. 55 70 */ 56 - #define MEMHP_MERGE_RESOURCE ((__force mhp_t)BIT(0)) 71 + #define MHP_MERGE_RESOURCE ((__force mhp_t)BIT(0)) 57 72 58 73 /* 59 74 * Extended parameters for memory hotplug: ··· 65 80 struct vmem_altmap *altmap; 66 81 pgprot_t pgprot; 67 82 }; 83 + 84 + bool mhp_range_allowed(u64 start, u64 size, bool need_mapping); 85 + struct range mhp_get_pluggable_range(bool need_mapping); 68 86 69 87 /* 70 88 * Zone resizing functions ··· 119 131 struct mhp_params *params); 120 132 extern u64 max_mem_size; 121 133 122 - extern int memhp_online_type_from_str(const char *str); 134 + extern int mhp_online_type_from_str(const char *str); 123 135 124 136 /* Default online_type (MMOP_*) when new memory blocks are added. */ 125 - extern int memhp_default_online_type; 137 + extern int mhp_default_online_type; 126 138 /* If movable_node boot option specified */ 127 139 extern bool movable_node_enabled; 128 140 static inline bool movable_node_is_enabled(void) ··· 268 280 return false; 269 281 } 270 282 #endif /* ! CONFIG_MEMORY_HOTPLUG */ 283 + 284 + /* 285 + * Keep this declaration outside CONFIG_MEMORY_HOTPLUG as some 286 + * platforms might override and use arch_get_mappable_range() 287 + * for internal non memory hotplug purposes. 288 + */ 289 + struct range arch_get_mappable_range(void); 271 290 272 291 #if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT) 273 292 /*
+6
include/linux/memremap.h
··· 137 137 void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap); 138 138 struct dev_pagemap *get_dev_pagemap(unsigned long pfn, 139 139 struct dev_pagemap *pgmap); 140 + bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn); 140 141 141 142 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap); 142 143 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns); ··· 164 163 struct dev_pagemap *pgmap) 165 164 { 166 165 return NULL; 166 + } 167 + 168 + static inline bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn) 169 + { 170 + return false; 167 171 } 168 172 169 173 static inline unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
+42 -7
include/linux/mmzone.h
··· 503 503 * bootmem allocator): 504 504 * managed_pages = present_pages - reserved_pages; 505 505 * 506 + * cma pages is present pages that are assigned for CMA use 507 + * (MIGRATE_CMA). 508 + * 506 509 * So present_pages may be used by memory hotplug or memory power 507 510 * management logic to figure out unmanaged pages by checking 508 511 * (present_pages - managed_pages). And managed_pages should be used ··· 530 527 atomic_long_t managed_pages; 531 528 unsigned long spanned_pages; 532 529 unsigned long present_pages; 530 + #ifdef CONFIG_CMA 531 + unsigned long cma_pages; 532 + #endif 533 533 534 534 const char *name; 535 535 ··· 628 622 static inline unsigned long zone_managed_pages(struct zone *zone) 629 623 { 630 624 return (unsigned long)atomic_long_read(&zone->managed_pages); 625 + } 626 + 627 + static inline unsigned long zone_cma_pages(struct zone *zone) 628 + { 629 + #ifdef CONFIG_CMA 630 + return zone->cma_pages; 631 + #else 632 + return 0; 633 + #endif 631 634 } 632 635 633 636 static inline unsigned long zone_end_pfn(const struct zone *zone) ··· 917 902 * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. 918 903 */ 919 904 #define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) 905 + 906 + #ifdef CONFIG_ZONE_DEVICE 907 + static inline bool zone_is_zone_device(struct zone *zone) 908 + { 909 + return zone_idx(zone) == ZONE_DEVICE; 910 + } 911 + #else 912 + static inline bool zone_is_zone_device(struct zone *zone) 913 + { 914 + return false; 915 + } 916 + #endif 920 917 921 918 /* 922 919 * Returns true if a zone has pages managed by the buddy allocator. ··· 1318 1291 * which results in PFN_SECTION_SHIFT equal 6. 1319 1292 * To sum it up, at least 6 bits are available. 1320 1293 */ 1321 - #define SECTION_MARKED_PRESENT (1UL<<0) 1322 - #define SECTION_HAS_MEM_MAP (1UL<<1) 1323 - #define SECTION_IS_ONLINE (1UL<<2) 1324 - #define SECTION_IS_EARLY (1UL<<3) 1325 - #define SECTION_MAP_LAST_BIT (1UL<<4) 1326 - #define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1)) 1327 - #define SECTION_NID_SHIFT 3 1294 + #define SECTION_MARKED_PRESENT (1UL<<0) 1295 + #define SECTION_HAS_MEM_MAP (1UL<<1) 1296 + #define SECTION_IS_ONLINE (1UL<<2) 1297 + #define SECTION_IS_EARLY (1UL<<3) 1298 + #define SECTION_TAINT_ZONE_DEVICE (1UL<<4) 1299 + #define SECTION_MAP_LAST_BIT (1UL<<5) 1300 + #define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1)) 1301 + #define SECTION_NID_SHIFT 3 1328 1302 1329 1303 static inline struct page *__section_mem_map_addr(struct mem_section *section) 1330 1304 { ··· 1362 1334 static inline int online_section(struct mem_section *section) 1363 1335 { 1364 1336 return (section && (section->section_mem_map & SECTION_IS_ONLINE)); 1337 + } 1338 + 1339 + static inline int online_device_section(struct mem_section *section) 1340 + { 1341 + unsigned long flags = SECTION_IS_ONLINE | SECTION_TAINT_ZONE_DEVICE; 1342 + 1343 + return section && ((section->section_mem_map & flags) == flags); 1365 1344 } 1366 1345 1367 1346 static inline int online_section_nr(unsigned long nr)
+2 -2
include/linux/page-flags.h
··· 810 810 811 811 /* 812 812 * Flags checked when a page is freed. Pages being freed should not have 813 - * these flags set. It they are, there is a problem. 813 + * these flags set. If they are, there is a problem. 814 814 */ 815 815 #define PAGE_FLAGS_CHECK_AT_FREE \ 816 816 (1UL << PG_lru | 1UL << PG_locked | \ ··· 821 821 822 822 /* 823 823 * Flags checked when a page is prepped for return by the page allocator. 824 - * Pages being prepped should not have these flags set. It they are set, 824 + * Pages being prepped should not have these flags set. If they are set, 825 825 * there has been a kernel bug or struct page corruption. 826 826 * 827 827 * __PG_HWPOISON is exceptional because it needs to be kept beyond page's
+4 -2
include/linux/pagemap.h
··· 315 315 #define FGP_NOWAIT 0x00000020 316 316 #define FGP_FOR_MMAP 0x00000040 317 317 #define FGP_HEAD 0x00000080 318 + #define FGP_ENTRY 0x00000100 318 319 319 320 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset, 320 321 int fgp_flags, gfp_t cache_gfp_mask); ··· 451 450 } 452 451 453 452 unsigned find_get_entries(struct address_space *mapping, pgoff_t start, 454 - unsigned int nr_entries, struct page **entries, 455 - pgoff_t *indices); 453 + pgoff_t end, struct pagevec *pvec, pgoff_t *indices); 456 454 unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start, 457 455 pgoff_t end, unsigned int nr_pages, 458 456 struct page **pages); ··· 759 759 void replace_page_cache_page(struct page *old, struct page *new); 760 760 void delete_from_page_cache_batch(struct address_space *mapping, 761 761 struct pagevec *pvec); 762 + loff_t mapping_seek_hole_data(struct address_space *, loff_t start, loff_t end, 763 + int whence); 762 764 763 765 /* 764 766 * Like add_to_page_cache_locked, but used to add newly allocated pages:
-4
include/linux/pagevec.h
··· 25 25 26 26 void __pagevec_release(struct pagevec *pvec); 27 27 void __pagevec_lru_add(struct pagevec *pvec); 28 - unsigned pagevec_lookup_entries(struct pagevec *pvec, 29 - struct address_space *mapping, 30 - pgoff_t start, unsigned nr_entries, 31 - pgoff_t *indices); 32 28 void pagevec_remove_exceptionals(struct pagevec *pvec); 33 29 unsigned pagevec_lookup_range(struct pagevec *pvec, 34 30 struct address_space *mapping,
-8
include/linux/pgtable.h
··· 432 432 * To be differentiate with macro pte_mkyoung, this macro is used on platforms 433 433 * where software maintains page access bit. 434 434 */ 435 - #ifndef pte_sw_mkyoung 436 - static inline pte_t pte_sw_mkyoung(pte_t pte) 437 - { 438 - return pte; 439 - } 440 - #define pte_sw_mkyoung pte_sw_mkyoung 441 - #endif 442 - 443 435 #ifndef pte_savedwrite 444 436 #define pte_savedwrite pte_write 445 437 #endif
+1 -1
include/linux/ptrace.h
··· 171 171 * 172 172 * Check whether @event is enabled and, if so, report @event and @pid 173 173 * to the ptrace parent. @pid is reported as the pid_t seen from the 174 - * the ptrace parent's pid namespace. 174 + * ptrace parent's pid namespace. 175 175 * 176 176 * Called without locks. 177 177 */
+2 -1
include/linux/rmap.h
··· 213 213 214 214 static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) 215 215 { 216 - if (pvmw->pte) 216 + /* HugeTLB pte is set to the relevant page table entry without pte_mapped. */ 217 + if (pvmw->pte && !PageHuge(pvmw->page)) 217 218 pte_unmap(pvmw->pte); 218 219 if (pvmw->ptl) 219 220 spin_unlock(pvmw->ptl);
+3
include/linux/slab_def.h
··· 2 2 #ifndef _LINUX_SLAB_DEF_H 3 3 #define _LINUX_SLAB_DEF_H 4 4 5 + #include <linux/kfence.h> 5 6 #include <linux/reciprocal_div.h> 6 7 7 8 /* ··· 115 114 static inline int objs_per_slab_page(const struct kmem_cache *cache, 116 115 const struct page *page) 117 116 { 117 + if (is_kfence_address(page_address(page))) 118 + return 1; 118 119 return cache->num; 119 120 } 120 121
+3
include/linux/slub_def.h
··· 7 7 * 8 8 * (C) 2007 SGI, Christoph Lameter 9 9 */ 10 + #include <linux/kfence.h> 10 11 #include <linux/kobject.h> 11 12 #include <linux/reciprocal_div.h> 12 13 ··· 186 185 static inline unsigned int obj_to_index(const struct kmem_cache *cache, 187 186 const struct page *page, void *obj) 188 187 { 188 + if (is_kfence_address(obj)) 189 + return 0; 189 190 return __obj_to_index(cache, page_address(page), obj); 190 191 } 191 192
+9
include/linux/stackdepot.h
··· 21 21 22 22 unsigned int filter_irq_stacks(unsigned long *entries, unsigned int nr_entries); 23 23 24 + #ifdef CONFIG_STACKDEPOT 25 + int stack_depot_init(void); 26 + #else 27 + static inline int stack_depot_init(void) 28 + { 29 + return 0; 30 + } 31 + #endif /* CONFIG_STACKDEPOT */ 32 + 24 33 #endif
+1 -281
include/linux/string.h
··· 266 266 void __write_overflow(void) __compiletime_error("detected write beyond size of object passed as 1st parameter"); 267 267 268 268 #if !defined(__NO_FORTIFY) && defined(__OPTIMIZE__) && defined(CONFIG_FORTIFY_SOURCE) 269 - 270 - #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS) 271 - extern void *__underlying_memchr(const void *p, int c, __kernel_size_t size) __RENAME(memchr); 272 - extern int __underlying_memcmp(const void *p, const void *q, __kernel_size_t size) __RENAME(memcmp); 273 - extern void *__underlying_memcpy(void *p, const void *q, __kernel_size_t size) __RENAME(memcpy); 274 - extern void *__underlying_memmove(void *p, const void *q, __kernel_size_t size) __RENAME(memmove); 275 - extern void *__underlying_memset(void *p, int c, __kernel_size_t size) __RENAME(memset); 276 - extern char *__underlying_strcat(char *p, const char *q) __RENAME(strcat); 277 - extern char *__underlying_strcpy(char *p, const char *q) __RENAME(strcpy); 278 - extern __kernel_size_t __underlying_strlen(const char *p) __RENAME(strlen); 279 - extern char *__underlying_strncat(char *p, const char *q, __kernel_size_t count) __RENAME(strncat); 280 - extern char *__underlying_strncpy(char *p, const char *q, __kernel_size_t size) __RENAME(strncpy); 281 - #else 282 - #define __underlying_memchr __builtin_memchr 283 - #define __underlying_memcmp __builtin_memcmp 284 - #define __underlying_memcpy __builtin_memcpy 285 - #define __underlying_memmove __builtin_memmove 286 - #define __underlying_memset __builtin_memset 287 - #define __underlying_strcat __builtin_strcat 288 - #define __underlying_strcpy __builtin_strcpy 289 - #define __underlying_strlen __builtin_strlen 290 - #define __underlying_strncat __builtin_strncat 291 - #define __underlying_strncpy __builtin_strncpy 292 - #endif 293 - 294 - __FORTIFY_INLINE char *strncpy(char *p, const char *q, __kernel_size_t size) 295 - { 296 - size_t p_size = __builtin_object_size(p, 1); 297 - if (__builtin_constant_p(size) && p_size < size) 298 - __write_overflow(); 299 - if (p_size < size) 300 - fortify_panic(__func__); 301 - return __underlying_strncpy(p, q, size); 302 - } 303 - 304 - __FORTIFY_INLINE char *strcat(char *p, const char *q) 305 - { 306 - size_t p_size = __builtin_object_size(p, 1); 307 - if (p_size == (size_t)-1) 308 - return __underlying_strcat(p, q); 309 - if (strlcat(p, q, p_size) >= p_size) 310 - fortify_panic(__func__); 311 - return p; 312 - } 313 - 314 - __FORTIFY_INLINE __kernel_size_t strlen(const char *p) 315 - { 316 - __kernel_size_t ret; 317 - size_t p_size = __builtin_object_size(p, 1); 318 - 319 - /* Work around gcc excess stack consumption issue */ 320 - if (p_size == (size_t)-1 || 321 - (__builtin_constant_p(p[p_size - 1]) && p[p_size - 1] == '\0')) 322 - return __underlying_strlen(p); 323 - ret = strnlen(p, p_size); 324 - if (p_size <= ret) 325 - fortify_panic(__func__); 326 - return ret; 327 - } 328 - 329 - extern __kernel_size_t __real_strnlen(const char *, __kernel_size_t) __RENAME(strnlen); 330 - __FORTIFY_INLINE __kernel_size_t strnlen(const char *p, __kernel_size_t maxlen) 331 - { 332 - size_t p_size = __builtin_object_size(p, 1); 333 - __kernel_size_t ret = __real_strnlen(p, maxlen < p_size ? maxlen : p_size); 334 - if (p_size <= ret && maxlen != ret) 335 - fortify_panic(__func__); 336 - return ret; 337 - } 338 - 339 - /* defined after fortified strlen to reuse it */ 340 - extern size_t __real_strlcpy(char *, const char *, size_t) __RENAME(strlcpy); 341 - __FORTIFY_INLINE size_t strlcpy(char *p, const char *q, size_t size) 342 - { 343 - size_t ret; 344 - size_t p_size = __builtin_object_size(p, 1); 345 - size_t q_size = __builtin_object_size(q, 1); 346 - if (p_size == (size_t)-1 && q_size == (size_t)-1) 347 - return __real_strlcpy(p, q, size); 348 - ret = strlen(q); 349 - if (size) { 350 - size_t len = (ret >= size) ? size - 1 : ret; 351 - if (__builtin_constant_p(len) && len >= p_size) 352 - __write_overflow(); 353 - if (len >= p_size) 354 - fortify_panic(__func__); 355 - __underlying_memcpy(p, q, len); 356 - p[len] = '\0'; 357 - } 358 - return ret; 359 - } 360 - 361 - /* defined after fortified strnlen to reuse it */ 362 - extern ssize_t __real_strscpy(char *, const char *, size_t) __RENAME(strscpy); 363 - __FORTIFY_INLINE ssize_t strscpy(char *p, const char *q, size_t size) 364 - { 365 - size_t len; 366 - /* Use string size rather than possible enclosing struct size. */ 367 - size_t p_size = __builtin_object_size(p, 1); 368 - size_t q_size = __builtin_object_size(q, 1); 369 - 370 - /* If we cannot get size of p and q default to call strscpy. */ 371 - if (p_size == (size_t) -1 && q_size == (size_t) -1) 372 - return __real_strscpy(p, q, size); 373 - 374 - /* 375 - * If size can be known at compile time and is greater than 376 - * p_size, generate a compile time write overflow error. 377 - */ 378 - if (__builtin_constant_p(size) && size > p_size) 379 - __write_overflow(); 380 - 381 - /* 382 - * This call protects from read overflow, because len will default to q 383 - * length if it smaller than size. 384 - */ 385 - len = strnlen(q, size); 386 - /* 387 - * If len equals size, we will copy only size bytes which leads to 388 - * -E2BIG being returned. 389 - * Otherwise we will copy len + 1 because of the final '\O'. 390 - */ 391 - len = len == size ? size : len + 1; 392 - 393 - /* 394 - * Generate a runtime write overflow error if len is greater than 395 - * p_size. 396 - */ 397 - if (len > p_size) 398 - fortify_panic(__func__); 399 - 400 - /* 401 - * We can now safely call vanilla strscpy because we are protected from: 402 - * 1. Read overflow thanks to call to strnlen(). 403 - * 2. Write overflow thanks to above ifs. 404 - */ 405 - return __real_strscpy(p, q, len); 406 - } 407 - 408 - /* defined after fortified strlen and strnlen to reuse them */ 409 - __FORTIFY_INLINE char *strncat(char *p, const char *q, __kernel_size_t count) 410 - { 411 - size_t p_len, copy_len; 412 - size_t p_size = __builtin_object_size(p, 1); 413 - size_t q_size = __builtin_object_size(q, 1); 414 - if (p_size == (size_t)-1 && q_size == (size_t)-1) 415 - return __underlying_strncat(p, q, count); 416 - p_len = strlen(p); 417 - copy_len = strnlen(q, count); 418 - if (p_size < p_len + copy_len + 1) 419 - fortify_panic(__func__); 420 - __underlying_memcpy(p + p_len, q, copy_len); 421 - p[p_len + copy_len] = '\0'; 422 - return p; 423 - } 424 - 425 - __FORTIFY_INLINE void *memset(void *p, int c, __kernel_size_t size) 426 - { 427 - size_t p_size = __builtin_object_size(p, 0); 428 - if (__builtin_constant_p(size) && p_size < size) 429 - __write_overflow(); 430 - if (p_size < size) 431 - fortify_panic(__func__); 432 - return __underlying_memset(p, c, size); 433 - } 434 - 435 - __FORTIFY_INLINE void *memcpy(void *p, const void *q, __kernel_size_t size) 436 - { 437 - size_t p_size = __builtin_object_size(p, 0); 438 - size_t q_size = __builtin_object_size(q, 0); 439 - if (__builtin_constant_p(size)) { 440 - if (p_size < size) 441 - __write_overflow(); 442 - if (q_size < size) 443 - __read_overflow2(); 444 - } 445 - if (p_size < size || q_size < size) 446 - fortify_panic(__func__); 447 - return __underlying_memcpy(p, q, size); 448 - } 449 - 450 - __FORTIFY_INLINE void *memmove(void *p, const void *q, __kernel_size_t size) 451 - { 452 - size_t p_size = __builtin_object_size(p, 0); 453 - size_t q_size = __builtin_object_size(q, 0); 454 - if (__builtin_constant_p(size)) { 455 - if (p_size < size) 456 - __write_overflow(); 457 - if (q_size < size) 458 - __read_overflow2(); 459 - } 460 - if (p_size < size || q_size < size) 461 - fortify_panic(__func__); 462 - return __underlying_memmove(p, q, size); 463 - } 464 - 465 - extern void *__real_memscan(void *, int, __kernel_size_t) __RENAME(memscan); 466 - __FORTIFY_INLINE void *memscan(void *p, int c, __kernel_size_t size) 467 - { 468 - size_t p_size = __builtin_object_size(p, 0); 469 - if (__builtin_constant_p(size) && p_size < size) 470 - __read_overflow(); 471 - if (p_size < size) 472 - fortify_panic(__func__); 473 - return __real_memscan(p, c, size); 474 - } 475 - 476 - __FORTIFY_INLINE int memcmp(const void *p, const void *q, __kernel_size_t size) 477 - { 478 - size_t p_size = __builtin_object_size(p, 0); 479 - size_t q_size = __builtin_object_size(q, 0); 480 - if (__builtin_constant_p(size)) { 481 - if (p_size < size) 482 - __read_overflow(); 483 - if (q_size < size) 484 - __read_overflow2(); 485 - } 486 - if (p_size < size || q_size < size) 487 - fortify_panic(__func__); 488 - return __underlying_memcmp(p, q, size); 489 - } 490 - 491 - __FORTIFY_INLINE void *memchr(const void *p, int c, __kernel_size_t size) 492 - { 493 - size_t p_size = __builtin_object_size(p, 0); 494 - if (__builtin_constant_p(size) && p_size < size) 495 - __read_overflow(); 496 - if (p_size < size) 497 - fortify_panic(__func__); 498 - return __underlying_memchr(p, c, size); 499 - } 500 - 501 - void *__real_memchr_inv(const void *s, int c, size_t n) __RENAME(memchr_inv); 502 - __FORTIFY_INLINE void *memchr_inv(const void *p, int c, size_t size) 503 - { 504 - size_t p_size = __builtin_object_size(p, 0); 505 - if (__builtin_constant_p(size) && p_size < size) 506 - __read_overflow(); 507 - if (p_size < size) 508 - fortify_panic(__func__); 509 - return __real_memchr_inv(p, c, size); 510 - } 511 - 512 - extern void *__real_kmemdup(const void *src, size_t len, gfp_t gfp) __RENAME(kmemdup); 513 - __FORTIFY_INLINE void *kmemdup(const void *p, size_t size, gfp_t gfp) 514 - { 515 - size_t p_size = __builtin_object_size(p, 0); 516 - if (__builtin_constant_p(size) && p_size < size) 517 - __read_overflow(); 518 - if (p_size < size) 519 - fortify_panic(__func__); 520 - return __real_kmemdup(p, size, gfp); 521 - } 522 - 523 - /* defined after fortified strlen and memcpy to reuse them */ 524 - __FORTIFY_INLINE char *strcpy(char *p, const char *q) 525 - { 526 - size_t p_size = __builtin_object_size(p, 1); 527 - size_t q_size = __builtin_object_size(q, 1); 528 - size_t size; 529 - if (p_size == (size_t)-1 && q_size == (size_t)-1) 530 - return __underlying_strcpy(p, q); 531 - size = strlen(q) + 1; 532 - /* test here to use the more stringent object size */ 533 - if (p_size < size) 534 - fortify_panic(__func__); 535 - memcpy(p, q, size); 536 - return p; 537 - } 538 - 539 - /* Don't use these outside the FORITFY_SOURCE implementation */ 540 - #undef __underlying_memchr 541 - #undef __underlying_memcmp 542 - #undef __underlying_memcpy 543 - #undef __underlying_memmove 544 - #undef __underlying_memset 545 - #undef __underlying_strcat 546 - #undef __underlying_strcpy 547 - #undef __underlying_strlen 548 - #undef __underlying_strncat 549 - #undef __underlying_strncpy 269 + #include <linux/fortify-string.h> 550 270 #endif 551 271 552 272 /**
+6
include/linux/vmstat.h
··· 313 313 enum node_stat_item item, int delta) 314 314 { 315 315 if (vmstat_item_in_bytes(item)) { 316 + /* 317 + * Only cgroups use subpage accounting right now; at 318 + * the global level, these items still change in 319 + * multiples of whole pages. Store them as pages 320 + * internally to keep the per-cpu counters compact. 321 + */ 316 322 VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1)); 317 323 delta >>= PAGE_SHIFT; 318 324 }
+3
include/linux/zpool.h
··· 73 73 * @malloc: allocate mem from a pool. 74 74 * @free: free mem from a pool. 75 75 * @shrink: shrink the pool. 76 + * @sleep_mapped: whether zpool driver can sleep during map. 76 77 * @map: map a handle. 77 78 * @unmap: unmap a handle. 78 79 * @total_size: get total size of a pool. ··· 101 100 int (*shrink)(void *pool, unsigned int pages, 102 101 unsigned int *reclaimed); 103 102 103 + bool sleep_mapped; 104 104 void *(*map)(void *pool, unsigned long handle, 105 105 enum zpool_mapmode mm); 106 106 void (*unmap)(void *pool, unsigned long handle); ··· 114 112 int zpool_unregister_driver(struct zpool_driver *driver); 115 113 116 114 bool zpool_evictable(struct zpool *pool); 115 + bool zpool_can_sleep_mapped(struct zpool *pool); 117 116 118 117 #endif
+1 -1
include/linux/zsmalloc.h
··· 35 35 36 36 struct zs_pool_stats { 37 37 /* How many pages were migrated (freed) */ 38 - unsigned long pages_compacted; 38 + atomic_long_t pages_compacted; 39 39 }; 40 40 41 41 struct zs_pool;
+74
include/trace/events/error_report.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0-only */ 2 + /* 3 + * Declarations for error reporting tracepoints. 4 + * 5 + * Copyright (C) 2021, Google LLC. 6 + */ 7 + #undef TRACE_SYSTEM 8 + #define TRACE_SYSTEM error_report 9 + 10 + #if !defined(_TRACE_ERROR_REPORT_H) || defined(TRACE_HEADER_MULTI_READ) 11 + #define _TRACE_ERROR_REPORT_H 12 + 13 + #include <linux/tracepoint.h> 14 + 15 + #ifndef __ERROR_REPORT_DECLARE_TRACE_ENUMS_ONCE_ONLY 16 + #define __ERROR_REPORT_DECLARE_TRACE_ENUMS_ONCE_ONLY 17 + 18 + enum error_detector { 19 + ERROR_DETECTOR_KFENCE, 20 + ERROR_DETECTOR_KASAN 21 + }; 22 + 23 + #endif /* __ERROR_REPORT_DECLARE_TRACE_ENUMS_ONCE_ONLY */ 24 + 25 + #define error_detector_list \ 26 + EM(ERROR_DETECTOR_KFENCE, "kfence") \ 27 + EMe(ERROR_DETECTOR_KASAN, "kasan") 28 + /* Always end the list with an EMe. */ 29 + 30 + #undef EM 31 + #undef EMe 32 + 33 + #define EM(a, b) TRACE_DEFINE_ENUM(a); 34 + #define EMe(a, b) TRACE_DEFINE_ENUM(a); 35 + 36 + error_detector_list 37 + 38 + #undef EM 39 + #undef EMe 40 + 41 + #define EM(a, b) { a, b }, 42 + #define EMe(a, b) { a, b } 43 + 44 + #define show_error_detector_list(val) \ 45 + __print_symbolic(val, error_detector_list) 46 + 47 + DECLARE_EVENT_CLASS(error_report_template, 48 + TP_PROTO(enum error_detector error_detector, unsigned long id), 49 + TP_ARGS(error_detector, id), 50 + TP_STRUCT__entry(__field(enum error_detector, error_detector) 51 + __field(unsigned long, id)), 52 + TP_fast_assign(__entry->error_detector = error_detector; 53 + __entry->id = id;), 54 + TP_printk("[%s] %lx", 55 + show_error_detector_list(__entry->error_detector), 56 + __entry->id)); 57 + 58 + /** 59 + * error_report_end - called after printing the error report 60 + * @error_detector: short string describing the error detection tool 61 + * @id: pseudo-unique descriptor identifying the report 62 + * (e.g. the memory access address) 63 + * 64 + * This event occurs right after a debugging tool finishes printing the error 65 + * report. 66 + */ 67 + DEFINE_EVENT(error_report_template, error_report_end, 68 + TP_PROTO(enum error_detector error_detector, unsigned long id), 69 + TP_ARGS(error_detector, id)); 70 + 71 + #endif /* _TRACE_ERROR_REPORT_H */ 72 + 73 + /* This part must be outside protection */ 74 + #include <trace/define_trace.h>
+1 -1
include/uapi/linux/firewire-cdev.h
··· 844 844 * struct fw_cdev_start_iso - Start an isochronous transmission or reception 845 845 * @cycle: Cycle in which to start I/O. If @cycle is greater than or 846 846 * equal to 0, the I/O will start on that cycle. 847 - * @sync: Determines the value to wait for for receive packets that have 847 + * @sync: Determines the value to wait for receive packets that have 848 848 * the %FW_CDEV_ISO_SYNC bit set 849 849 * @tags: Tag filter bit mask. Only valid for isochronous reception. 850 850 * Determines the tag values for which packets will be accepted.
+1 -1
include/uapi/linux/input.h
··· 84 84 * in units per radian. 85 85 * When INPUT_PROP_ACCELEROMETER is set the resolution changes. 86 86 * The main axes (ABS_X, ABS_Y, ABS_Z) are then reported in 87 - * in units per g (units/g) and in units per degree per second 87 + * units per g (units/g) and in units per degree per second 88 88 * (units/deg/s) for rotational axes (ABS_RX, ABS_RY, ABS_RZ). 89 89 */ 90 90 struct input_absinfo {
+1 -1
init/Kconfig
··· 19 19 CC_VERSION_TEXT so it is recorded in include/config/auto.conf.cmd. 20 20 When the compiler is updated, Kconfig will be invoked. 21 21 22 - - Ensure full rebuild when the compier is updated 22 + - Ensure full rebuild when the compiler is updated 23 23 include/linux/kconfig.h contains this option in the comment line so 24 24 fixdep adds include/config/cc/version/text.h into the auto-generated 25 25 dependency. When the compiler is updated, syncconfig will touch it
+15 -4
init/initramfs.c
··· 11 11 #include <linux/utime.h> 12 12 #include <linux/file.h> 13 13 #include <linux/memblock.h> 14 + #include <linux/mm.h> 14 15 #include <linux/namei.h> 15 16 #include <linux/init_syscalls.h> 16 17 ··· 44 43 { 45 44 if (!message) 46 45 message = x; 46 + } 47 + 48 + static void panic_show_mem(const char *fmt, ...) 49 + { 50 + va_list args; 51 + 52 + show_mem(0, NULL); 53 + va_start(args, fmt); 54 + panic(fmt, args); 55 + va_end(args); 47 56 } 48 57 49 58 /* link hash */ ··· 91 80 } 92 81 q = kmalloc(sizeof(struct hash), GFP_KERNEL); 93 82 if (!q) 94 - panic("can't allocate link hash entry"); 83 + panic_show_mem("can't allocate link hash entry"); 95 84 q->major = major; 96 85 q->minor = minor; 97 86 q->ino = ino; ··· 136 125 { 137 126 struct dir_entry *de = kmalloc(sizeof(struct dir_entry), GFP_KERNEL); 138 127 if (!de) 139 - panic("can't allocate dir_entry buffer"); 128 + panic_show_mem("can't allocate dir_entry buffer"); 140 129 INIT_LIST_HEAD(&de->list); 141 130 de->name = kstrdup(name, GFP_KERNEL); 142 131 de->mtime = mtime; ··· 471 460 name_buf = kmalloc(N_ALIGN(PATH_MAX), GFP_KERNEL); 472 461 473 462 if (!header_buf || !symlink_buf || !name_buf) 474 - panic("can't allocate buffers"); 463 + panic_show_mem("can't allocate buffers"); 475 464 476 465 state = Start; 477 466 this_header = 0; ··· 618 607 /* Load the built in initramfs */ 619 608 char *err = unpack_to_rootfs(__initramfs_start, __initramfs_size); 620 609 if (err) 621 - panic("%s", err); /* Failed to decompress INTERNAL initramfs */ 610 + panic_show_mem("%s", err); /* Failed to decompress INTERNAL initramfs */ 622 611 623 612 if (!initrd_start || IS_ENABLED(CONFIG_INITRAMFS_FORCE)) 624 613 goto done;
+6
init/main.c
··· 40 40 #include <linux/security.h> 41 41 #include <linux/smp.h> 42 42 #include <linux/profile.h> 43 + #include <linux/kfence.h> 43 44 #include <linux/rcupdate.h> 44 45 #include <linux/moduleparam.h> 45 46 #include <linux/kallsyms.h> ··· 97 96 #include <linux/mem_encrypt.h> 98 97 #include <linux/kcsan.h> 99 98 #include <linux/init_syscalls.h> 99 + #include <linux/stackdepot.h> 100 100 101 101 #include <asm/io.h> 102 102 #include <asm/bugs.h> ··· 826 824 */ 827 825 page_ext_init_flatmem(); 828 826 init_mem_debugging_and_hardening(); 827 + kfence_alloc_pool(); 829 828 report_meminit(); 829 + stack_depot_init(); 830 830 mem_init(); 831 831 /* page_owner must be initialized after buddy is ready */ 832 832 page_ext_init_flatmem_late(); ··· 959 955 hrtimers_init(); 960 956 softirq_init(); 961 957 timekeeping_init(); 958 + kfence_init(); 962 959 963 960 /* 964 961 * For best initial stack canary entropy, prepare it after: ··· 1426 1421 async_synchronize_full(); 1427 1422 kprobe_free_init_mem(); 1428 1423 ftrace_free_init_mem(); 1424 + kgdb_free_init_mem(); 1429 1425 free_initmem(); 1430 1426 mark_readonly(); 1431 1427
-8
init/version.c
··· 16 16 #include <linux/version.h> 17 17 #include <linux/proc_ns.h> 18 18 19 - #ifndef CONFIG_KALLSYMS 20 - #define version(a) Version_ ## a 21 - #define version_string(a) version(a) 22 - 23 - extern int version_string(LINUX_VERSION_CODE); 24 - int version_string(LINUX_VERSION_CODE); 25 - #endif 26 - 27 19 struct uts_namespace init_uts_ns = { 28 20 .ns.count = REFCOUNT_INIT(2), 29 21 .name = {
+11
kernel/debug/debug_core.c
··· 455 455 return 0; 456 456 } 457 457 458 + void kgdb_free_init_mem(void) 459 + { 460 + int i; 461 + 462 + /* Clear init memory breakpoints. */ 463 + for (i = 0; i < KGDB_MAX_BREAKPOINTS; i++) { 464 + if (init_section_contains((void *)kgdb_break[i].bpt_addr, 0)) 465 + kgdb_break[i].state = BP_UNDEFINED; 466 + } 467 + } 468 + 458 469 #ifdef CONFIG_KGDB_KDB 459 470 void kdb_dump_stack_on_cpu(int cpu) 460 471 {
+4 -4
kernel/events/core.c
··· 269 269 if (!event->parent) { 270 270 /* 271 271 * If this is a !child event, we must hold ctx::mutex to 272 - * stabilize the the event->ctx relation. See 272 + * stabilize the event->ctx relation. See 273 273 * perf_event_ctx_lock(). 274 274 */ 275 275 lockdep_assert_held(&ctx->mutex); ··· 1303 1303 * life-time rules separate them. That is an exiting task cannot fork, and a 1304 1304 * spawning task cannot (yet) exit. 1305 1305 * 1306 - * But remember that that these are parent<->child context relations, and 1306 + * But remember that these are parent<->child context relations, and 1307 1307 * migration does not affect children, therefore these two orderings should not 1308 1308 * interact. 1309 1309 * ··· 1442 1442 /* 1443 1443 * Get the perf_event_context for a task and lock it. 1444 1444 * 1445 - * This has to cope with with the fact that until it is locked, 1445 + * This has to cope with the fact that until it is locked, 1446 1446 * the context could get moved to another task. 1447 1447 */ 1448 1448 static struct perf_event_context * ··· 2486 2486 * But this is a bit hairy. 2487 2487 * 2488 2488 * So instead, we have an explicit cgroup call to remain 2489 - * within the time time source all along. We believe it 2489 + * within the time source all along. We believe it 2490 2490 * is cleaner and simpler to understand. 2491 2491 */ 2492 2492 if (is_cgroup_event(event))
+1 -1
kernel/events/uprobes.c
··· 1733 1733 } 1734 1734 1735 1735 /* 1736 - * Allocate a uprobe_task object for the task if if necessary. 1736 + * Allocate a uprobe_task object for the task if necessary. 1737 1737 * Called when the thread hits a breakpoint. 1738 1738 * 1739 1739 * Returns:
+1 -6
kernel/groups.c
··· 15 15 struct group_info *groups_alloc(int gidsetsize) 16 16 { 17 17 struct group_info *gi; 18 - unsigned int len; 19 - 20 - len = sizeof(struct group_info) + sizeof(kgid_t) * gidsetsize; 21 - gi = kmalloc(len, GFP_KERNEL_ACCOUNT|__GFP_NOWARN|__GFP_NORETRY); 22 - if (!gi) 23 - gi = __vmalloc(len, GFP_KERNEL_ACCOUNT); 18 + gi = kvmalloc(struct_size(gi, gid, gidsetsize), GFP_KERNEL_ACCOUNT); 24 19 if (!gi) 25 20 return NULL; 26 21
+2 -2
kernel/locking/rtmutex.c
··· 1420 1420 } 1421 1421 1422 1422 /* 1423 - * Performs the wakeup of the the top-waiter and re-enables preemption. 1423 + * Performs the wakeup of the top-waiter and re-enables preemption. 1424 1424 */ 1425 1425 void rt_mutex_postunlock(struct wake_q_head *wake_q) 1426 1426 { ··· 1819 1819 * been started. 1820 1820 * @waiter: the pre-initialized rt_mutex_waiter 1821 1821 * 1822 - * Wait for the the lock acquisition started on our behalf by 1822 + * Wait for the lock acquisition started on our behalf by 1823 1823 * rt_mutex_start_proxy_lock(). Upon failure, the caller must call 1824 1824 * rt_mutex_cleanup_proxy_lock(). 1825 1825 *
+1 -1
kernel/locking/rwsem.c
··· 1048 1048 1049 1049 /* 1050 1050 * If there were already threads queued before us and: 1051 - * 1) there are no no active locks, wake the front 1051 + * 1) there are no active locks, wake the front 1052 1052 * queued process(es) as the handoff bit might be set. 1053 1053 * 2) there are no active writers and some readers, the lock 1054 1054 * must be read owned; so we try to wake any read lock
+1 -1
kernel/locking/semaphore.c
··· 119 119 * @sem: the semaphore to be acquired 120 120 * 121 121 * Try to acquire the semaphore atomically. Returns 0 if the semaphore has 122 - * been acquired successfully or 1 if it it cannot be acquired. 122 + * been acquired successfully or 1 if it cannot be acquired. 123 123 * 124 124 * NOTE: This return value is inverted from both spin_trylock and 125 125 * mutex_trylock! Be careful about this when converting code.
+1 -1
kernel/sched/fair.c
··· 5126 5126 /* 5127 5127 * When a group wakes up we want to make sure that its quota is not already 5128 5128 * expired/exceeded, otherwise it may be allowed to steal additional ticks of 5129 - * runtime as update_curr() throttling can not not trigger until it's on-rq. 5129 + * runtime as update_curr() throttling can not trigger until it's on-rq. 5130 5130 */ 5131 5131 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) 5132 5132 {
+1 -1
kernel/sched/membarrier.c
··· 454 454 455 455 /* 456 456 * For each cpu runqueue, if the task's mm match @mm, ensure that all 457 - * @mm's membarrier state set bits are also set in in the runqueue's 457 + * @mm's membarrier state set bits are also set in the runqueue's 458 458 * membarrier state. This ensures that a runqueue scheduling 459 459 * between threads which are users of @mm has its membarrier state 460 460 * updated.
+4 -4
kernel/sysctl.c
··· 2962 2962 .data = &block_dump, 2963 2963 .maxlen = sizeof(block_dump), 2964 2964 .mode = 0644, 2965 - .proc_handler = proc_dointvec, 2965 + .proc_handler = proc_dointvec_minmax, 2966 2966 .extra1 = SYSCTL_ZERO, 2967 2967 }, 2968 2968 { ··· 2970 2970 .data = &sysctl_vfs_cache_pressure, 2971 2971 .maxlen = sizeof(sysctl_vfs_cache_pressure), 2972 2972 .mode = 0644, 2973 - .proc_handler = proc_dointvec, 2973 + .proc_handler = proc_dointvec_minmax, 2974 2974 .extra1 = SYSCTL_ZERO, 2975 2975 }, 2976 2976 #if defined(HAVE_ARCH_PICK_MMAP_LAYOUT) || \ ··· 2980 2980 .data = &sysctl_legacy_va_layout, 2981 2981 .maxlen = sizeof(sysctl_legacy_va_layout), 2982 2982 .mode = 0644, 2983 - .proc_handler = proc_dointvec, 2983 + .proc_handler = proc_dointvec_minmax, 2984 2984 .extra1 = SYSCTL_ZERO, 2985 2985 }, 2986 2986 #endif ··· 2990 2990 .data = &node_reclaim_mode, 2991 2991 .maxlen = sizeof(node_reclaim_mode), 2992 2992 .mode = 0644, 2993 - .proc_handler = proc_dointvec, 2993 + .proc_handler = proc_dointvec_minmax, 2994 2994 .extra1 = SYSCTL_ZERO, 2995 2995 }, 2996 2996 {
+1
kernel/trace/Makefile
··· 81 81 obj-$(CONFIG_HIST_TRIGGERS) += trace_events_hist.o 82 82 obj-$(CONFIG_BPF_EVENTS) += bpf_trace.o 83 83 obj-$(CONFIG_KPROBE_EVENTS) += trace_kprobe.o 84 + obj-$(CONFIG_TRACEPOINTS) += error_report-traces.o 84 85 obj-$(CONFIG_TRACEPOINTS) += power-traces.o 85 86 ifeq ($(CONFIG_PM),y) 86 87 obj-$(CONFIG_TRACEPOINTS) += rpm-traces.o
+11
kernel/trace/error_report-traces.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Error reporting trace points. 4 + * 5 + * Copyright (C) 2021, Google LLC. 6 + */ 7 + 8 + #define CREATE_TRACE_POINTS 9 + #include <trace/events/error_report.h> 10 + 11 + EXPORT_TRACEPOINT_SYMBOL_GPL(error_report_end);
+9
lib/Kconfig
··· 651 651 bool 652 652 select STACKTRACE 653 653 654 + config STACK_HASH_ORDER 655 + int "stack depot hash size (12 => 4KB, 20 => 1024KB)" 656 + range 12 20 657 + default 20 658 + depends on STACKDEPOT 659 + help 660 + Select the hash size as a power of 2 for the stackdepot hash table. 661 + Choose a lower value to reduce the memory impact. 662 + 654 663 config SBITMAP 655 664 bool 656 665
+1
lib/Kconfig.debug
··· 938 938 If in doubt, say "N". 939 939 940 940 source "lib/Kconfig.kasan" 941 + source "lib/Kconfig.kfence" 941 942 942 943 endmenu # "Memory Debugging" 943 944
+82
lib/Kconfig.kfence
··· 1 + # SPDX-License-Identifier: GPL-2.0-only 2 + 3 + config HAVE_ARCH_KFENCE 4 + bool 5 + 6 + menuconfig KFENCE 7 + bool "KFENCE: low-overhead sampling-based memory safety error detector" 8 + depends on HAVE_ARCH_KFENCE && (SLAB || SLUB) 9 + select STACKTRACE 10 + help 11 + KFENCE is a low-overhead sampling-based detector of heap out-of-bounds 12 + access, use-after-free, and invalid-free errors. KFENCE is designed 13 + to have negligible cost to permit enabling it in production 14 + environments. 15 + 16 + See <file:Documentation/dev-tools/kfence.rst> for more details. 17 + 18 + Note that, KFENCE is not a substitute for explicit testing with tools 19 + such as KASAN. KFENCE can detect a subset of bugs that KASAN can 20 + detect, albeit at very different performance profiles. If you can 21 + afford to use KASAN, continue using KASAN, for example in test 22 + environments. If your kernel targets production use, and cannot 23 + enable KASAN due to its cost, consider using KFENCE. 24 + 25 + if KFENCE 26 + 27 + config KFENCE_STATIC_KEYS 28 + bool "Use static keys to set up allocations" 29 + default y 30 + depends on JUMP_LABEL # To ensure performance, require jump labels 31 + help 32 + Use static keys (static branches) to set up KFENCE allocations. Using 33 + static keys is normally recommended, because it avoids a dynamic 34 + branch in the allocator's fast path. However, with very low sample 35 + intervals, or on systems that do not support jump labels, a dynamic 36 + branch may still be an acceptable performance trade-off. 37 + 38 + config KFENCE_SAMPLE_INTERVAL 39 + int "Default sample interval in milliseconds" 40 + default 100 41 + help 42 + The KFENCE sample interval determines the frequency with which heap 43 + allocations will be guarded by KFENCE. May be overridden via boot 44 + parameter "kfence.sample_interval". 45 + 46 + Set this to 0 to disable KFENCE by default, in which case only 47 + setting "kfence.sample_interval" to a non-zero value enables KFENCE. 48 + 49 + config KFENCE_NUM_OBJECTS 50 + int "Number of guarded objects available" 51 + range 1 65535 52 + default 255 53 + help 54 + The number of guarded objects available. For each KFENCE object, 2 55 + pages are required; with one containing the object and two adjacent 56 + ones used as guard pages. 57 + 58 + config KFENCE_STRESS_TEST_FAULTS 59 + int "Stress testing of fault handling and error reporting" if EXPERT 60 + default 0 61 + help 62 + The inverse probability with which to randomly protect KFENCE object 63 + pages, resulting in spurious use-after-frees. The main purpose of 64 + this option is to stress test KFENCE with concurrent error reports 65 + and allocations/frees. A value of 0 disables stress testing logic. 66 + 67 + Only for KFENCE testing; set to 0 if you are not a KFENCE developer. 68 + 69 + config KFENCE_KUNIT_TEST 70 + tristate "KFENCE integration test suite" if !KUNIT_ALL_TESTS 71 + default KUNIT_ALL_TESTS 72 + depends on TRACEPOINTS && KUNIT 73 + help 74 + Test suite for KFENCE, testing various error detection scenarios with 75 + various allocation types, and checking that reports are correctly 76 + output to console. 77 + 78 + Say Y here if you want the test to be built into the kernel and run 79 + during boot; say M if you want the test to build as a module; say N 80 + if you are unsure. 81 + 82 + endif # KFENCE
-17
lib/Kconfig.ubsan
··· 112 112 This option enables -fsanitize=unreachable which checks for control 113 113 flow reaching an expected-to-be-unreachable position. 114 114 115 - config UBSAN_SIGNED_OVERFLOW 116 - bool "Perform checking for signed arithmetic overflow" 117 - default UBSAN 118 - depends on $(cc-option,-fsanitize=signed-integer-overflow) 119 - help 120 - This option enables -fsanitize=signed-integer-overflow which checks 121 - for overflow of any arithmetic operations with signed integers. 122 - 123 - config UBSAN_UNSIGNED_OVERFLOW 124 - bool "Perform checking for unsigned arithmetic overflow" 125 - depends on $(cc-option,-fsanitize=unsigned-integer-overflow) 126 - depends on !X86_32 # avoid excessive stack usage on x86-32/clang 127 - help 128 - This option enables -fsanitize=unsigned-integer-overflow which checks 129 - for overflow of any arithmetic operations with unsigned integers. This 130 - currently causes x86 to fail to boot. 131 - 132 115 config UBSAN_OBJECT_SIZE 133 116 bool "Perform checking for accesses beyond the end of objects" 134 117 default UBSAN
+3 -4
lib/cmdline.c
··· 228 228 { 229 229 unsigned int i, equals = 0; 230 230 int in_quote = 0, quoted = 0; 231 - char *next; 232 231 233 232 if (*args == '"') { 234 233 args++; ··· 265 266 266 267 if (args[i]) { 267 268 args[i] = '\0'; 268 - next = args + i + 1; 269 + args += i + 1; 269 270 } else 270 - next = args + i; 271 + args += i; 271 272 272 273 /* Chew up trailing spaces. */ 273 - return skip_spaces(next); 274 + return skip_spaces(args); 274 275 }
+2 -1
lib/genalloc.c
··· 81 81 * users set the same bit, one user will return remain bits, otherwise 82 82 * return 0. 83 83 */ 84 - static int bitmap_set_ll(unsigned long *map, unsigned long start, unsigned long nr) 84 + static unsigned long 85 + bitmap_set_ll(unsigned long *map, unsigned long start, unsigned long nr) 85 86 { 86 87 unsigned long *p = map + BIT_WORD(start); 87 88 const unsigned long size = start + nr;
+31 -6
lib/stackdepot.c
··· 31 31 #include <linux/stackdepot.h> 32 32 #include <linux/string.h> 33 33 #include <linux/types.h> 34 + #include <linux/memblock.h> 34 35 35 36 #define DEPOT_STACK_BITS (sizeof(depot_stack_handle_t) * 8) 36 37 ··· 142 141 return stack; 143 142 } 144 143 145 - #define STACK_HASH_ORDER 20 146 - #define STACK_HASH_SIZE (1L << STACK_HASH_ORDER) 144 + #define STACK_HASH_SIZE (1L << CONFIG_STACK_HASH_ORDER) 147 145 #define STACK_HASH_MASK (STACK_HASH_SIZE - 1) 148 146 #define STACK_HASH_SEED 0x9747b28c 149 147 150 - static struct stack_record *stack_table[STACK_HASH_SIZE] = { 151 - [0 ... STACK_HASH_SIZE - 1] = NULL 152 - }; 148 + static bool stack_depot_disable; 149 + static struct stack_record **stack_table; 150 + 151 + static int __init is_stack_depot_disabled(char *str) 152 + { 153 + int ret; 154 + 155 + ret = kstrtobool(str, &stack_depot_disable); 156 + if (!ret && stack_depot_disable) { 157 + pr_info("Stack Depot is disabled\n"); 158 + stack_table = NULL; 159 + } 160 + return 0; 161 + } 162 + early_param("stack_depot_disable", is_stack_depot_disabled); 163 + 164 + int __init stack_depot_init(void) 165 + { 166 + if (!stack_depot_disable) { 167 + size_t size = (STACK_HASH_SIZE * sizeof(struct stack_record *)); 168 + int i; 169 + 170 + stack_table = memblock_alloc(size, size); 171 + for (i = 0; i < STACK_HASH_SIZE; i++) 172 + stack_table[i] = NULL; 173 + } 174 + return 0; 175 + } 153 176 154 177 /* Calculate hash for a stack */ 155 178 static inline u32 hash_stack(unsigned long *entries, unsigned int size) ··· 267 242 unsigned long flags; 268 243 u32 hash; 269 244 270 - if (unlikely(nr_entries == 0)) 245 + if (unlikely(nr_entries == 0) || stack_depot_disable) 271 246 goto fast_exit; 272 247 273 248 hash = hash_stack(entries, nr_entries);
+101 -10
lib/test_kasan.c
··· 252 252 kfree(ptr); 253 253 } 254 254 255 - static void kmalloc_oob_krealloc_more(struct kunit *test) 255 + static void krealloc_more_oob_helper(struct kunit *test, 256 + size_t size1, size_t size2) 256 257 { 257 258 char *ptr1, *ptr2; 258 - size_t size1 = 17; 259 - size_t size2 = 19; 259 + size_t middle; 260 + 261 + KUNIT_ASSERT_LT(test, size1, size2); 262 + middle = size1 + (size2 - size1) / 2; 260 263 261 264 ptr1 = kmalloc(size1, GFP_KERNEL); 262 265 KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr1); ··· 267 264 ptr2 = krealloc(ptr1, size2, GFP_KERNEL); 268 265 KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr2); 269 266 270 - KUNIT_EXPECT_KASAN_FAIL(test, ptr2[size2 + OOB_TAG_OFF] = 'x'); 267 + /* All offsets up to size2 must be accessible. */ 268 + ptr2[size1 - 1] = 'x'; 269 + ptr2[size1] = 'x'; 270 + ptr2[middle] = 'x'; 271 + ptr2[size2 - 1] = 'x'; 272 + 273 + /* Generic mode is precise, so unaligned size2 must be inaccessible. */ 274 + if (IS_ENABLED(CONFIG_KASAN_GENERIC)) 275 + KUNIT_EXPECT_KASAN_FAIL(test, ptr2[size2] = 'x'); 276 + 277 + /* For all modes first aligned offset after size2 must be inaccessible. */ 278 + KUNIT_EXPECT_KASAN_FAIL(test, 279 + ptr2[round_up(size2, KASAN_GRANULE_SIZE)] = 'x'); 280 + 271 281 kfree(ptr2); 272 282 } 273 283 274 - static void kmalloc_oob_krealloc_less(struct kunit *test) 284 + static void krealloc_less_oob_helper(struct kunit *test, 285 + size_t size1, size_t size2) 275 286 { 276 287 char *ptr1, *ptr2; 277 - size_t size1 = 17; 278 - size_t size2 = 15; 288 + size_t middle; 289 + 290 + KUNIT_ASSERT_LT(test, size2, size1); 291 + middle = size2 + (size1 - size2) / 2; 279 292 280 293 ptr1 = kmalloc(size1, GFP_KERNEL); 281 294 KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr1); ··· 299 280 ptr2 = krealloc(ptr1, size2, GFP_KERNEL); 300 281 KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr2); 301 282 302 - KUNIT_EXPECT_KASAN_FAIL(test, ptr2[size2 + OOB_TAG_OFF] = 'x'); 283 + /* Must be accessible for all modes. */ 284 + ptr2[size2 - 1] = 'x'; 285 + 286 + /* Generic mode is precise, so unaligned size2 must be inaccessible. */ 287 + if (IS_ENABLED(CONFIG_KASAN_GENERIC)) 288 + KUNIT_EXPECT_KASAN_FAIL(test, ptr2[size2] = 'x'); 289 + 290 + /* For all modes first aligned offset after size2 must be inaccessible. */ 291 + KUNIT_EXPECT_KASAN_FAIL(test, 292 + ptr2[round_up(size2, KASAN_GRANULE_SIZE)] = 'x'); 293 + 294 + /* 295 + * For all modes all size2, middle, and size1 should land in separate 296 + * granules and thus the latter two offsets should be inaccessible. 297 + */ 298 + KUNIT_EXPECT_LE(test, round_up(size2, KASAN_GRANULE_SIZE), 299 + round_down(middle, KASAN_GRANULE_SIZE)); 300 + KUNIT_EXPECT_LE(test, round_up(middle, KASAN_GRANULE_SIZE), 301 + round_down(size1, KASAN_GRANULE_SIZE)); 302 + KUNIT_EXPECT_KASAN_FAIL(test, ptr2[middle] = 'x'); 303 + KUNIT_EXPECT_KASAN_FAIL(test, ptr2[size1 - 1] = 'x'); 304 + KUNIT_EXPECT_KASAN_FAIL(test, ptr2[size1] = 'x'); 305 + 303 306 kfree(ptr2); 307 + } 308 + 309 + static void krealloc_more_oob(struct kunit *test) 310 + { 311 + krealloc_more_oob_helper(test, 201, 235); 312 + } 313 + 314 + static void krealloc_less_oob(struct kunit *test) 315 + { 316 + krealloc_less_oob_helper(test, 235, 201); 317 + } 318 + 319 + static void krealloc_pagealloc_more_oob(struct kunit *test) 320 + { 321 + /* page_alloc fallback in only implemented for SLUB. */ 322 + KASAN_TEST_NEEDS_CONFIG_ON(test, CONFIG_SLUB); 323 + 324 + krealloc_more_oob_helper(test, KMALLOC_MAX_CACHE_SIZE + 201, 325 + KMALLOC_MAX_CACHE_SIZE + 235); 326 + } 327 + 328 + static void krealloc_pagealloc_less_oob(struct kunit *test) 329 + { 330 + /* page_alloc fallback in only implemented for SLUB. */ 331 + KASAN_TEST_NEEDS_CONFIG_ON(test, CONFIG_SLUB); 332 + 333 + krealloc_less_oob_helper(test, KMALLOC_MAX_CACHE_SIZE + 235, 334 + KMALLOC_MAX_CACHE_SIZE + 201); 335 + } 336 + 337 + /* 338 + * Check that krealloc() detects a use-after-free, returns NULL, 339 + * and doesn't unpoison the freed object. 340 + */ 341 + static void krealloc_uaf(struct kunit *test) 342 + { 343 + char *ptr1, *ptr2; 344 + int size1 = 201; 345 + int size2 = 235; 346 + 347 + ptr1 = kmalloc(size1, GFP_KERNEL); 348 + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr1); 349 + kfree(ptr1); 350 + 351 + KUNIT_EXPECT_KASAN_FAIL(test, ptr2 = krealloc(ptr1, size2, GFP_KERNEL)); 352 + KUNIT_ASSERT_PTR_EQ(test, (void *)ptr2, NULL); 353 + KUNIT_EXPECT_KASAN_FAIL(test, *(volatile char *)ptr1); 304 354 } 305 355 306 356 static void kmalloc_oob_16(struct kunit *test) ··· 1065 977 KUNIT_CASE(pagealloc_oob_right), 1066 978 KUNIT_CASE(pagealloc_uaf), 1067 979 KUNIT_CASE(kmalloc_large_oob_right), 1068 - KUNIT_CASE(kmalloc_oob_krealloc_more), 1069 - KUNIT_CASE(kmalloc_oob_krealloc_less), 980 + KUNIT_CASE(krealloc_more_oob), 981 + KUNIT_CASE(krealloc_less_oob), 982 + KUNIT_CASE(krealloc_pagealloc_more_oob), 983 + KUNIT_CASE(krealloc_pagealloc_less_oob), 984 + KUNIT_CASE(krealloc_uaf), 1070 985 KUNIT_CASE(kmalloc_oob_16), 1071 986 KUNIT_CASE(kmalloc_uaf_16), 1072 987 KUNIT_CASE(kmalloc_oob_in_memset),
-49
lib/test_ubsan.c
··· 11 11 #config, IS_ENABLED(config) ? "y" : "n"); \ 12 12 } while (0) 13 13 14 - static void test_ubsan_add_overflow(void) 15 - { 16 - volatile int val = INT_MAX; 17 - volatile unsigned int uval = UINT_MAX; 18 - 19 - UBSAN_TEST(CONFIG_UBSAN_SIGNED_OVERFLOW); 20 - val += 2; 21 - 22 - UBSAN_TEST(CONFIG_UBSAN_UNSIGNED_OVERFLOW); 23 - uval += 2; 24 - } 25 - 26 - static void test_ubsan_sub_overflow(void) 27 - { 28 - volatile int val = INT_MIN; 29 - volatile unsigned int uval = 0; 30 - volatile int val2 = 2; 31 - 32 - UBSAN_TEST(CONFIG_UBSAN_SIGNED_OVERFLOW); 33 - val -= val2; 34 - 35 - UBSAN_TEST(CONFIG_UBSAN_UNSIGNED_OVERFLOW); 36 - uval -= val2; 37 - } 38 - 39 - static void test_ubsan_mul_overflow(void) 40 - { 41 - volatile int val = INT_MAX / 2; 42 - volatile unsigned int uval = UINT_MAX / 2; 43 - 44 - UBSAN_TEST(CONFIG_UBSAN_SIGNED_OVERFLOW); 45 - val *= 3; 46 - 47 - UBSAN_TEST(CONFIG_UBSAN_UNSIGNED_OVERFLOW); 48 - uval *= 3; 49 - } 50 - 51 - static void test_ubsan_negate_overflow(void) 52 - { 53 - volatile int val = INT_MIN; 54 - 55 - UBSAN_TEST(CONFIG_UBSAN_SIGNED_OVERFLOW); 56 - val = -val; 57 - } 58 - 59 14 static void test_ubsan_divrem_overflow(void) 60 15 { 61 16 volatile int val = 16; ··· 110 155 } 111 156 112 157 static const test_ubsan_fp test_ubsan_array[] = { 113 - test_ubsan_add_overflow, 114 - test_ubsan_sub_overflow, 115 - test_ubsan_mul_overflow, 116 - test_ubsan_negate_overflow, 117 158 test_ubsan_shift_out_of_bounds, 118 159 test_ubsan_out_of_bounds, 119 160 test_ubsan_load_invalid_value,
-68
lib/ubsan.c
··· 163 163 } 164 164 } 165 165 166 - static void handle_overflow(struct overflow_data *data, void *lhs, 167 - void *rhs, char op) 168 - { 169 - 170 - struct type_descriptor *type = data->type; 171 - char lhs_val_str[VALUE_LENGTH]; 172 - char rhs_val_str[VALUE_LENGTH]; 173 - 174 - if (suppress_report(&data->location)) 175 - return; 176 - 177 - ubsan_prologue(&data->location, type_is_signed(type) ? 178 - "signed-integer-overflow" : 179 - "unsigned-integer-overflow"); 180 - 181 - val_to_string(lhs_val_str, sizeof(lhs_val_str), type, lhs); 182 - val_to_string(rhs_val_str, sizeof(rhs_val_str), type, rhs); 183 - pr_err("%s %c %s cannot be represented in type %s\n", 184 - lhs_val_str, 185 - op, 186 - rhs_val_str, 187 - type->type_name); 188 - 189 - ubsan_epilogue(); 190 - } 191 - 192 - void __ubsan_handle_add_overflow(void *data, 193 - void *lhs, void *rhs) 194 - { 195 - 196 - handle_overflow(data, lhs, rhs, '+'); 197 - } 198 - EXPORT_SYMBOL(__ubsan_handle_add_overflow); 199 - 200 - void __ubsan_handle_sub_overflow(void *data, 201 - void *lhs, void *rhs) 202 - { 203 - handle_overflow(data, lhs, rhs, '-'); 204 - } 205 - EXPORT_SYMBOL(__ubsan_handle_sub_overflow); 206 - 207 - void __ubsan_handle_mul_overflow(void *data, 208 - void *lhs, void *rhs) 209 - { 210 - handle_overflow(data, lhs, rhs, '*'); 211 - } 212 - EXPORT_SYMBOL(__ubsan_handle_mul_overflow); 213 - 214 - void __ubsan_handle_negate_overflow(void *_data, void *old_val) 215 - { 216 - struct overflow_data *data = _data; 217 - char old_val_str[VALUE_LENGTH]; 218 - 219 - if (suppress_report(&data->location)) 220 - return; 221 - 222 - ubsan_prologue(&data->location, "negation-overflow"); 223 - 224 - val_to_string(old_val_str, sizeof(old_val_str), data->type, old_val); 225 - 226 - pr_err("negation of %s cannot be represented in type %s:\n", 227 - old_val_str, data->type->type_name); 228 - 229 - ubsan_epilogue(); 230 - } 231 - EXPORT_SYMBOL(__ubsan_handle_negate_overflow); 232 - 233 - 234 166 void __ubsan_handle_divrem_overflow(void *_data, void *lhs, void *rhs) 235 167 { 236 168 struct overflow_data *data = _data;
+1
mm/Makefile
··· 81 81 obj-$(CONFIG_SLAB) += slab.o 82 82 obj-$(CONFIG_SLUB) += slub.o 83 83 obj-$(CONFIG_KASAN) += kasan/ 84 + obj-$(CONFIG_KFENCE) += kfence/ 84 85 obj-$(CONFIG_FAILSLAB) += failslab.o 85 86 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o 86 87 obj-$(CONFIG_MEMTEST) += memtest.o
+2 -1
mm/backing-dev.c
··· 8 8 #include <linux/fs.h> 9 9 #include <linux/pagemap.h> 10 10 #include <linux/mm.h> 11 + #include <linux/sched/mm.h> 11 12 #include <linux/sched.h> 12 13 #include <linux/module.h> 13 14 #include <linux/writeback.h> ··· 579 578 { 580 579 struct bdi_writeback *wb; 581 580 582 - might_sleep_if(gfpflags_allow_blocking(gfp)); 581 + might_alloc(gfp); 583 582 584 583 if (!memcg_css->parent) 585 584 return &bdi->wb;
+39 -23
mm/cma.c
··· 94 94 95 95 static void __init cma_activate_area(struct cma *cma) 96 96 { 97 - unsigned long base_pfn = cma->base_pfn, pfn = base_pfn; 98 - unsigned i = cma->count >> pageblock_order; 97 + unsigned long base_pfn = cma->base_pfn, pfn; 99 98 struct zone *zone; 100 99 101 100 cma->bitmap = bitmap_zalloc(cma_bitmap_maxno(cma), GFP_KERNEL); 102 101 if (!cma->bitmap) 103 102 goto out_error; 104 103 105 - WARN_ON_ONCE(!pfn_valid(pfn)); 106 - zone = page_zone(pfn_to_page(pfn)); 104 + /* 105 + * alloc_contig_range() requires the pfn range specified to be in the 106 + * same zone. Simplify by forcing the entire CMA resv range to be in the 107 + * same zone. 108 + */ 109 + WARN_ON_ONCE(!pfn_valid(base_pfn)); 110 + zone = page_zone(pfn_to_page(base_pfn)); 111 + for (pfn = base_pfn + 1; pfn < base_pfn + cma->count; pfn++) { 112 + WARN_ON_ONCE(!pfn_valid(pfn)); 113 + if (page_zone(pfn_to_page(pfn)) != zone) 114 + goto not_in_zone; 115 + } 107 116 108 - do { 109 - unsigned j; 110 - 111 - base_pfn = pfn; 112 - for (j = pageblock_nr_pages; j; --j, pfn++) { 113 - WARN_ON_ONCE(!pfn_valid(pfn)); 114 - /* 115 - * alloc_contig_range requires the pfn range 116 - * specified to be in the same zone. Make this 117 - * simple by forcing the entire CMA resv range 118 - * to be in the same zone. 119 - */ 120 - if (page_zone(pfn_to_page(pfn)) != zone) 121 - goto not_in_zone; 122 - } 123 - init_cma_reserved_pageblock(pfn_to_page(base_pfn)); 124 - } while (--i); 117 + for (pfn = base_pfn; pfn < base_pfn + cma->count; 118 + pfn += pageblock_nr_pages) 119 + init_cma_reserved_pageblock(pfn_to_page(pfn)); 125 120 126 121 mutex_init(&cma->lock); 127 122 ··· 130 135 not_in_zone: 131 136 bitmap_free(cma->bitmap); 132 137 out_error: 138 + /* Expose all pages to the buddy, they are useless for CMA. */ 139 + for (pfn = base_pfn; pfn < base_pfn + cma->count; pfn++) 140 + free_reserved_page(pfn_to_page(pfn)); 141 + totalcma_pages -= cma->count; 133 142 cma->count = 0; 134 143 pr_err("CMA area %s could not be activated\n", cma->name); 135 144 return; ··· 335 336 limit = highmem_start; 336 337 } 337 338 339 + /* 340 + * If there is enough memory, try a bottom-up allocation first. 341 + * It will place the new cma area close to the start of the node 342 + * and guarantee that the compaction is moving pages out of the 343 + * cma area and not into it. 344 + * Avoid using first 4GB to not interfere with constrained zones 345 + * like DMA/DMA32. 346 + */ 347 + #ifdef CONFIG_PHYS_ADDR_T_64BIT 348 + if (!memblock_bottom_up() && memblock_end >= SZ_4G + size) { 349 + memblock_set_bottom_up(true); 350 + addr = memblock_alloc_range_nid(size, alignment, SZ_4G, 351 + limit, nid, true); 352 + memblock_set_bottom_up(false); 353 + } 354 + #endif 355 + 338 356 if (!addr) { 339 357 addr = memblock_alloc_range_nid(size, alignment, base, 340 358 limit, nid, true); ··· 500 484 } 501 485 502 486 if (ret && !no_warn) { 503 - pr_err("%s: alloc failed, req-size: %zu pages, ret: %d\n", 504 - __func__, count, ret); 487 + pr_err("%s: %s: alloc failed, req-size: %zu pages, ret: %d\n", 488 + __func__, cma->name, count, ret); 505 489 cma_debug_show_areas(cma); 506 490 } 507 491
+2 -1
mm/dmapool.c
··· 28 28 #include <linux/mutex.h> 29 29 #include <linux/poison.h> 30 30 #include <linux/sched.h> 31 + #include <linux/sched/mm.h> 31 32 #include <linux/slab.h> 32 33 #include <linux/stat.h> 33 34 #include <linux/spinlock.h> ··· 320 319 size_t offset; 321 320 void *retval; 322 321 323 - might_sleep_if(gfpflags_allow_blocking(mem_flags)); 322 + might_alloc(mem_flags); 324 323 325 324 spin_lock_irqsave(&pool->lock, flags); 326 325 list_for_each_entry(page, &pool->page_list, page_list) {
+6 -6
mm/early_ioremap.c
··· 181 181 } 182 182 } 183 183 184 - if (WARN(slot < 0, "early_iounmap(%p, %08lx) not found slot\n", 185 - addr, size)) 184 + if (WARN(slot < 0, "%s(%p, %08lx) not found slot\n", 185 + __func__, addr, size)) 186 186 return; 187 187 188 188 if (WARN(prev_size[slot] != size, 189 - "early_iounmap(%p, %08lx) [%d] size not consistent %08lx\n", 190 - addr, size, slot, prev_size[slot])) 189 + "%s(%p, %08lx) [%d] size not consistent %08lx\n", 190 + __func__, addr, size, slot, prev_size[slot])) 191 191 return; 192 192 193 - WARN(early_ioremap_debug, "early_iounmap(%p, %08lx) [%d]\n", 194 - addr, size, slot); 193 + WARN(early_ioremap_debug, "%s(%p, %08lx) [%d]\n", 194 + __func__, addr, size, slot); 195 195 196 196 virt_addr = (unsigned long)addr; 197 197 if (WARN_ON(virt_addr < fix_to_virt(FIX_BTMAP_BEGIN)))
+232 -111
mm/filemap.c
··· 1658 1658 } 1659 1659 EXPORT_SYMBOL(page_cache_prev_miss); 1660 1660 1661 - /** 1662 - * find_get_entry - find and get a page cache entry 1661 + /* 1662 + * mapping_get_entry - Get a page cache entry. 1663 1663 * @mapping: the address_space to search 1664 1664 * @index: The page cache index. 1665 1665 * ··· 1671 1671 * 1672 1672 * Return: The head page or shadow entry, %NULL if nothing is found. 1673 1673 */ 1674 - struct page *find_get_entry(struct address_space *mapping, pgoff_t index) 1674 + static struct page *mapping_get_entry(struct address_space *mapping, 1675 + pgoff_t index) 1675 1676 { 1676 1677 XA_STATE(xas, &mapping->i_pages, index); 1677 1678 struct page *page; ··· 1709 1708 } 1710 1709 1711 1710 /** 1712 - * find_lock_entry - Locate and lock a page cache entry. 1713 - * @mapping: The address_space to search. 1714 - * @index: The page cache index. 1715 - * 1716 - * Looks up the page at @mapping & @index. If there is a page in the 1717 - * cache, the head page is returned locked and with an increased refcount. 1718 - * 1719 - * If the slot holds a shadow entry of a previously evicted page, or a 1720 - * swap entry from shmem/tmpfs, it is returned. 1721 - * 1722 - * Context: May sleep. 1723 - * Return: The head page or shadow entry, %NULL if nothing is found. 1724 - */ 1725 - struct page *find_lock_entry(struct address_space *mapping, pgoff_t index) 1726 - { 1727 - struct page *page; 1728 - 1729 - repeat: 1730 - page = find_get_entry(mapping, index); 1731 - if (page && !xa_is_value(page)) { 1732 - lock_page(page); 1733 - /* Has the page been truncated? */ 1734 - if (unlikely(page->mapping != mapping)) { 1735 - unlock_page(page); 1736 - put_page(page); 1737 - goto repeat; 1738 - } 1739 - VM_BUG_ON_PAGE(!thp_contains(page, index), page); 1740 - } 1741 - return page; 1742 - } 1743 - 1744 - /** 1745 1711 * pagecache_get_page - Find and get a reference to a page. 1746 1712 * @mapping: The address_space to search. 1747 1713 * @index: The page index. ··· 1723 1755 * * %FGP_LOCK - The page is returned locked. 1724 1756 * * %FGP_HEAD - If the page is present and a THP, return the head page 1725 1757 * rather than the exact page specified by the index. 1758 + * * %FGP_ENTRY - If there is a shadow / swap / DAX entry, return it 1759 + * instead of allocating a new page to replace it. 1726 1760 * * %FGP_CREAT - If no page is present then a new page is allocated using 1727 1761 * @gfp_mask and added to the page cache and the VM's LRU list. 1728 1762 * The page is returned locked and with an increased refcount. ··· 1748 1778 struct page *page; 1749 1779 1750 1780 repeat: 1751 - page = find_get_entry(mapping, index); 1752 - if (xa_is_value(page)) 1781 + page = mapping_get_entry(mapping, index); 1782 + if (xa_is_value(page)) { 1783 + if (fgp_flags & FGP_ENTRY) 1784 + return page; 1753 1785 page = NULL; 1786 + } 1754 1787 if (!page) 1755 1788 goto no_page; 1756 1789 ··· 1825 1852 } 1826 1853 EXPORT_SYMBOL(pagecache_get_page); 1827 1854 1855 + static inline struct page *find_get_entry(struct xa_state *xas, pgoff_t max, 1856 + xa_mark_t mark) 1857 + { 1858 + struct page *page; 1859 + 1860 + retry: 1861 + if (mark == XA_PRESENT) 1862 + page = xas_find(xas, max); 1863 + else 1864 + page = xas_find_marked(xas, max, mark); 1865 + 1866 + if (xas_retry(xas, page)) 1867 + goto retry; 1868 + /* 1869 + * A shadow entry of a recently evicted page, a swap 1870 + * entry from shmem/tmpfs or a DAX entry. Return it 1871 + * without attempting to raise page count. 1872 + */ 1873 + if (!page || xa_is_value(page)) 1874 + return page; 1875 + 1876 + if (!page_cache_get_speculative(page)) 1877 + goto reset; 1878 + 1879 + /* Has the page moved or been split? */ 1880 + if (unlikely(page != xas_reload(xas))) { 1881 + put_page(page); 1882 + goto reset; 1883 + } 1884 + 1885 + return page; 1886 + reset: 1887 + xas_reset(xas); 1888 + goto retry; 1889 + } 1890 + 1828 1891 /** 1829 1892 * find_get_entries - gang pagecache lookup 1830 1893 * @mapping: The address_space to search 1831 1894 * @start: The starting page cache index 1832 - * @nr_entries: The maximum number of entries 1833 - * @entries: Where the resulting entries are placed 1895 + * @end: The final page index (inclusive). 1896 + * @pvec: Where the resulting entries are placed. 1834 1897 * @indices: The cache indices corresponding to the entries in @entries 1835 1898 * 1836 - * find_get_entries() will search for and return a group of up to 1837 - * @nr_entries entries in the mapping. The entries are placed at 1838 - * @entries. find_get_entries() takes a reference against any actual 1839 - * pages it returns. 1899 + * find_get_entries() will search for and return a batch of entries in 1900 + * the mapping. The entries are placed in @pvec. find_get_entries() 1901 + * takes a reference on any actual pages it returns. 1840 1902 * 1841 1903 * The search returns a group of mapping-contiguous page cache entries 1842 1904 * with ascending indexes. There may be holes in the indices due to ··· 1887 1879 * 1888 1880 * Return: the number of pages and shadow entries which were found. 1889 1881 */ 1890 - unsigned find_get_entries(struct address_space *mapping, 1891 - pgoff_t start, unsigned int nr_entries, 1892 - struct page **entries, pgoff_t *indices) 1882 + unsigned find_get_entries(struct address_space *mapping, pgoff_t start, 1883 + pgoff_t end, struct pagevec *pvec, pgoff_t *indices) 1893 1884 { 1894 1885 XA_STATE(xas, &mapping->i_pages, start); 1895 1886 struct page *page; 1896 1887 unsigned int ret = 0; 1897 - 1898 - if (!nr_entries) 1899 - return 0; 1888 + unsigned nr_entries = PAGEVEC_SIZE; 1900 1889 1901 1890 rcu_read_lock(); 1902 - xas_for_each(&xas, page, ULONG_MAX) { 1903 - if (xas_retry(&xas, page)) 1904 - continue; 1905 - /* 1906 - * A shadow entry of a recently evicted page, a swap 1907 - * entry from shmem/tmpfs or a DAX entry. Return it 1908 - * without attempting to raise page count. 1909 - */ 1910 - if (xa_is_value(page)) 1911 - goto export; 1912 - 1913 - if (!page_cache_get_speculative(page)) 1914 - goto retry; 1915 - 1916 - /* Has the page moved or been split? */ 1917 - if (unlikely(page != xas_reload(&xas))) 1918 - goto put_page; 1919 - 1891 + while ((page = find_get_entry(&xas, end, XA_PRESENT))) { 1920 1892 /* 1921 1893 * Terminate early on finding a THP, to allow the caller to 1922 1894 * handle it all at once; but continue if this is hugetlbfs. 1923 1895 */ 1924 - if (PageTransHuge(page) && !PageHuge(page)) { 1896 + if (!xa_is_value(page) && PageTransHuge(page) && 1897 + !PageHuge(page)) { 1925 1898 page = find_subpage(page, xas.xa_index); 1926 1899 nr_entries = ret + 1; 1927 1900 } 1928 - export: 1901 + 1929 1902 indices[ret] = xas.xa_index; 1930 - entries[ret] = page; 1903 + pvec->pages[ret] = page; 1931 1904 if (++ret == nr_entries) 1932 1905 break; 1933 - continue; 1934 - put_page: 1935 - put_page(page); 1936 - retry: 1937 - xas_reset(&xas); 1938 1906 } 1939 1907 rcu_read_unlock(); 1908 + 1909 + pvec->nr = ret; 1940 1910 return ret; 1911 + } 1912 + 1913 + /** 1914 + * find_lock_entries - Find a batch of pagecache entries. 1915 + * @mapping: The address_space to search. 1916 + * @start: The starting page cache index. 1917 + * @end: The final page index (inclusive). 1918 + * @pvec: Where the resulting entries are placed. 1919 + * @indices: The cache indices of the entries in @pvec. 1920 + * 1921 + * find_lock_entries() will return a batch of entries from @mapping. 1922 + * Swap, shadow and DAX entries are included. Pages are returned 1923 + * locked and with an incremented refcount. Pages which are locked by 1924 + * somebody else or under writeback are skipped. Only the head page of 1925 + * a THP is returned. Pages which are partially outside the range are 1926 + * not returned. 1927 + * 1928 + * The entries have ascending indexes. The indices may not be consecutive 1929 + * due to not-present entries, THP pages, pages which could not be locked 1930 + * or pages under writeback. 1931 + * 1932 + * Return: The number of entries which were found. 1933 + */ 1934 + unsigned find_lock_entries(struct address_space *mapping, pgoff_t start, 1935 + pgoff_t end, struct pagevec *pvec, pgoff_t *indices) 1936 + { 1937 + XA_STATE(xas, &mapping->i_pages, start); 1938 + struct page *page; 1939 + 1940 + rcu_read_lock(); 1941 + while ((page = find_get_entry(&xas, end, XA_PRESENT))) { 1942 + if (!xa_is_value(page)) { 1943 + if (page->index < start) 1944 + goto put; 1945 + VM_BUG_ON_PAGE(page->index != xas.xa_index, page); 1946 + if (page->index + thp_nr_pages(page) - 1 > end) 1947 + goto put; 1948 + if (!trylock_page(page)) 1949 + goto put; 1950 + if (page->mapping != mapping || PageWriteback(page)) 1951 + goto unlock; 1952 + VM_BUG_ON_PAGE(!thp_contains(page, xas.xa_index), 1953 + page); 1954 + } 1955 + indices[pvec->nr] = xas.xa_index; 1956 + if (!pagevec_add(pvec, page)) 1957 + break; 1958 + goto next; 1959 + unlock: 1960 + unlock_page(page); 1961 + put: 1962 + put_page(page); 1963 + next: 1964 + if (!xa_is_value(page) && PageTransHuge(page)) 1965 + xas_set(&xas, page->index + thp_nr_pages(page)); 1966 + } 1967 + rcu_read_unlock(); 1968 + 1969 + return pagevec_count(pvec); 1941 1970 } 1942 1971 1943 1972 /** ··· 2010 1965 return 0; 2011 1966 2012 1967 rcu_read_lock(); 2013 - xas_for_each(&xas, page, end) { 2014 - if (xas_retry(&xas, page)) 2015 - continue; 1968 + while ((page = find_get_entry(&xas, end, XA_PRESENT))) { 2016 1969 /* Skip over shadow, swap and DAX entries */ 2017 1970 if (xa_is_value(page)) 2018 1971 continue; 2019 - 2020 - if (!page_cache_get_speculative(page)) 2021 - goto retry; 2022 - 2023 - /* Has the page moved or been split? */ 2024 - if (unlikely(page != xas_reload(&xas))) 2025 - goto put_page; 2026 1972 2027 1973 pages[ret] = find_subpage(page, xas.xa_index); 2028 1974 if (++ret == nr_pages) { 2029 1975 *start = xas.xa_index + 1; 2030 1976 goto out; 2031 1977 } 2032 - continue; 2033 - put_page: 2034 - put_page(page); 2035 - retry: 2036 - xas_reset(&xas); 2037 1978 } 2038 1979 2039 1980 /* ··· 2093 2062 EXPORT_SYMBOL(find_get_pages_contig); 2094 2063 2095 2064 /** 2096 - * find_get_pages_range_tag - find and return pages in given range matching @tag 2065 + * find_get_pages_range_tag - Find and return head pages matching @tag. 2097 2066 * @mapping: the address_space to search 2098 2067 * @index: the starting page index 2099 2068 * @end: The final page index (inclusive) ··· 2101 2070 * @nr_pages: the maximum number of pages 2102 2071 * @pages: where the resulting pages are placed 2103 2072 * 2104 - * Like find_get_pages, except we only return pages which are tagged with 2105 - * @tag. We update @index to index the next page for the traversal. 2073 + * Like find_get_pages(), except we only return head pages which are tagged 2074 + * with @tag. @index is updated to the index immediately after the last 2075 + * page we return, ready for the next iteration. 2106 2076 * 2107 2077 * Return: the number of pages which were found. 2108 2078 */ ··· 2119 2087 return 0; 2120 2088 2121 2089 rcu_read_lock(); 2122 - xas_for_each_marked(&xas, page, end, tag) { 2123 - if (xas_retry(&xas, page)) 2124 - continue; 2090 + while ((page = find_get_entry(&xas, end, tag))) { 2125 2091 /* 2126 2092 * Shadow entries should never be tagged, but this iteration 2127 2093 * is lockless so there is a window for page reclaim to evict ··· 2128 2098 if (xa_is_value(page)) 2129 2099 continue; 2130 2100 2131 - if (!page_cache_get_speculative(page)) 2132 - goto retry; 2133 - 2134 - /* Has the page moved or been split? */ 2135 - if (unlikely(page != xas_reload(&xas))) 2136 - goto put_page; 2137 - 2138 - pages[ret] = find_subpage(page, xas.xa_index); 2101 + pages[ret] = page; 2139 2102 if (++ret == nr_pages) { 2140 - *index = xas.xa_index + 1; 2103 + *index = page->index + thp_nr_pages(page); 2141 2104 goto out; 2142 2105 } 2143 - continue; 2144 - put_page: 2145 - put_page(page); 2146 - retry: 2147 - xas_reset(&xas); 2148 2106 } 2149 2107 2150 2108 /* ··· 2609 2591 return filemap_read(iocb, iter, retval); 2610 2592 } 2611 2593 EXPORT_SYMBOL(generic_file_read_iter); 2594 + 2595 + static inline loff_t page_seek_hole_data(struct xa_state *xas, 2596 + struct address_space *mapping, struct page *page, 2597 + loff_t start, loff_t end, bool seek_data) 2598 + { 2599 + const struct address_space_operations *ops = mapping->a_ops; 2600 + size_t offset, bsz = i_blocksize(mapping->host); 2601 + 2602 + if (xa_is_value(page) || PageUptodate(page)) 2603 + return seek_data ? start : end; 2604 + if (!ops->is_partially_uptodate) 2605 + return seek_data ? end : start; 2606 + 2607 + xas_pause(xas); 2608 + rcu_read_unlock(); 2609 + lock_page(page); 2610 + if (unlikely(page->mapping != mapping)) 2611 + goto unlock; 2612 + 2613 + offset = offset_in_thp(page, start) & ~(bsz - 1); 2614 + 2615 + do { 2616 + if (ops->is_partially_uptodate(page, offset, bsz) == seek_data) 2617 + break; 2618 + start = (start + bsz) & ~(bsz - 1); 2619 + offset += bsz; 2620 + } while (offset < thp_size(page)); 2621 + unlock: 2622 + unlock_page(page); 2623 + rcu_read_lock(); 2624 + return start; 2625 + } 2626 + 2627 + static inline 2628 + unsigned int seek_page_size(struct xa_state *xas, struct page *page) 2629 + { 2630 + if (xa_is_value(page)) 2631 + return PAGE_SIZE << xa_get_order(xas->xa, xas->xa_index); 2632 + return thp_size(page); 2633 + } 2634 + 2635 + /** 2636 + * mapping_seek_hole_data - Seek for SEEK_DATA / SEEK_HOLE in the page cache. 2637 + * @mapping: Address space to search. 2638 + * @start: First byte to consider. 2639 + * @end: Limit of search (exclusive). 2640 + * @whence: Either SEEK_HOLE or SEEK_DATA. 2641 + * 2642 + * If the page cache knows which blocks contain holes and which blocks 2643 + * contain data, your filesystem can use this function to implement 2644 + * SEEK_HOLE and SEEK_DATA. This is useful for filesystems which are 2645 + * entirely memory-based such as tmpfs, and filesystems which support 2646 + * unwritten extents. 2647 + * 2648 + * Return: The requested offset on successs, or -ENXIO if @whence specifies 2649 + * SEEK_DATA and there is no data after @start. There is an implicit hole 2650 + * after @end - 1, so SEEK_HOLE returns @end if all the bytes between @start 2651 + * and @end contain data. 2652 + */ 2653 + loff_t mapping_seek_hole_data(struct address_space *mapping, loff_t start, 2654 + loff_t end, int whence) 2655 + { 2656 + XA_STATE(xas, &mapping->i_pages, start >> PAGE_SHIFT); 2657 + pgoff_t max = (end - 1) / PAGE_SIZE; 2658 + bool seek_data = (whence == SEEK_DATA); 2659 + struct page *page; 2660 + 2661 + if (end <= start) 2662 + return -ENXIO; 2663 + 2664 + rcu_read_lock(); 2665 + while ((page = find_get_entry(&xas, max, XA_PRESENT))) { 2666 + loff_t pos = xas.xa_index * PAGE_SIZE; 2667 + 2668 + if (start < pos) { 2669 + if (!seek_data) 2670 + goto unlock; 2671 + start = pos; 2672 + } 2673 + 2674 + pos += seek_page_size(&xas, page); 2675 + start = page_seek_hole_data(&xas, mapping, page, start, pos, 2676 + seek_data); 2677 + if (start < pos) 2678 + goto unlock; 2679 + if (!xa_is_value(page)) 2680 + put_page(page); 2681 + } 2682 + rcu_read_unlock(); 2683 + 2684 + if (seek_data) 2685 + return -ENXIO; 2686 + goto out; 2687 + 2688 + unlock: 2689 + rcu_read_unlock(); 2690 + if (!xa_is_value(page)) 2691 + put_page(page); 2692 + out: 2693 + if (start > end) 2694 + return end; 2695 + return start; 2696 + } 2612 2697 2613 2698 #ifdef CONFIG_MMU 2614 2699 #define MMAP_LOTSAMISS (100)
+3 -3
mm/huge_memory.c
··· 668 668 * available 669 669 * never: never stall for any thp allocation 670 670 */ 671 - static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) 671 + gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma) 672 672 { 673 - const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); 673 + const bool vma_madvised = vma && (vma->vm_flags & VM_HUGEPAGE); 674 674 675 675 /* Always do synchronous compaction */ 676 676 if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags)) ··· 762 762 } 763 763 return ret; 764 764 } 765 - gfp = alloc_hugepage_direct_gfpmask(vma); 765 + gfp = vma_thp_gfp_mask(vma); 766 766 page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER); 767 767 if (unlikely(!page)) { 768 768 count_vm_event(THP_FAULT_FALLBACK);
+2 -2
mm/internal.h
··· 60 60 force_page_cache_ra(&ractl, &file->f_ra, nr_to_read); 61 61 } 62 62 63 - struct page *find_get_entry(struct address_space *mapping, pgoff_t index); 64 - struct page *find_lock_entry(struct address_space *mapping, pgoff_t index); 63 + unsigned find_lock_entries(struct address_space *mapping, pgoff_t start, 64 + pgoff_t end, struct pagevec *pvec, pgoff_t *indices); 65 65 66 66 /** 67 67 * page_evictable - test whether a page is evictable
+139 -56
mm/kasan/common.c
··· 210 210 *size = optimal_size; 211 211 } 212 212 213 + void __kasan_cache_create_kmalloc(struct kmem_cache *cache) 214 + { 215 + cache->kasan_info.is_kmalloc = true; 216 + } 217 + 213 218 size_t __kasan_metadata_size(struct kmem_cache *cache) 214 219 { 215 220 if (!kasan_stack_collection_enabled()) ··· 261 256 262 257 void __kasan_poison_object_data(struct kmem_cache *cache, void *object) 263 258 { 264 - kasan_poison(object, cache->object_size, KASAN_KMALLOC_REDZONE); 259 + kasan_poison(object, round_up(cache->object_size, KASAN_GRANULE_SIZE), 260 + KASAN_KMALLOC_REDZONE); 265 261 } 266 262 267 263 /* ··· 279 273 * based on objects indexes, so that objects that are next to each other 280 274 * get different tags. 281 275 */ 282 - static u8 assign_tag(struct kmem_cache *cache, const void *object, 283 - bool init, bool keep_tag) 276 + static inline u8 assign_tag(struct kmem_cache *cache, 277 + const void *object, bool init) 284 278 { 285 279 if (IS_ENABLED(CONFIG_KASAN_GENERIC)) 286 280 return 0xff; 287 - 288 - /* 289 - * 1. When an object is kmalloc()'ed, two hooks are called: 290 - * kasan_slab_alloc() and kasan_kmalloc(). We assign the 291 - * tag only in the first one. 292 - * 2. We reuse the same tag for krealloc'ed objects. 293 - */ 294 - if (keep_tag) 295 - return get_tag(object); 296 281 297 282 /* 298 283 * If the cache neither has a constructor nor has SLAB_TYPESAFE_BY_RCU ··· 317 320 } 318 321 319 322 /* Tag is ignored in set_tag() without CONFIG_KASAN_SW/HW_TAGS */ 320 - object = set_tag(object, assign_tag(cache, object, true, false)); 323 + object = set_tag(object, assign_tag(cache, object, true)); 321 324 322 325 return (void *)object; 323 326 } 324 327 325 - static bool ____kasan_slab_free(struct kmem_cache *cache, void *object, 326 - unsigned long ip, bool quarantine) 328 + static inline bool ____kasan_slab_free(struct kmem_cache *cache, 329 + void *object, unsigned long ip, bool quarantine) 327 330 { 328 331 u8 tag; 329 332 void *tagged_object; ··· 331 334 tag = get_tag(object); 332 335 tagged_object = object; 333 336 object = kasan_reset_tag(object); 337 + 338 + if (is_kfence_address(object)) 339 + return false; 334 340 335 341 if (unlikely(nearest_obj(cache, virt_to_head_page(object), object) != 336 342 object)) { ··· 350 350 return true; 351 351 } 352 352 353 - kasan_poison(object, cache->object_size, KASAN_KMALLOC_FREE); 354 - 355 - if (!kasan_stack_collection_enabled()) 356 - return false; 353 + kasan_poison(object, round_up(cache->object_size, KASAN_GRANULE_SIZE), 354 + KASAN_KMALLOC_FREE); 357 355 358 356 if ((IS_ENABLED(CONFIG_KASAN_GENERIC) && !quarantine)) 359 357 return false; 360 358 361 - kasan_set_free_info(cache, object, tag); 359 + if (kasan_stack_collection_enabled()) 360 + kasan_set_free_info(cache, object, tag); 362 361 363 362 return kasan_quarantine_put(cache, object); 364 363 } ··· 365 366 bool __kasan_slab_free(struct kmem_cache *cache, void *object, unsigned long ip) 366 367 { 367 368 return ____kasan_slab_free(cache, object, ip, true); 369 + } 370 + 371 + static inline bool ____kasan_kfree_large(void *ptr, unsigned long ip) 372 + { 373 + if (ptr != page_address(virt_to_head_page(ptr))) { 374 + kasan_report_invalid_free(ptr, ip); 375 + return true; 376 + } 377 + 378 + if (!kasan_byte_accessible(ptr)) { 379 + kasan_report_invalid_free(ptr, ip); 380 + return true; 381 + } 382 + 383 + /* 384 + * The object will be poisoned by kasan_free_pages() or 385 + * kasan_slab_free_mempool(). 386 + */ 387 + 388 + return false; 389 + } 390 + 391 + void __kasan_kfree_large(void *ptr, unsigned long ip) 392 + { 393 + ____kasan_kfree_large(ptr, ip); 368 394 } 369 395 370 396 void __kasan_slab_free_mempool(void *ptr, unsigned long ip) ··· 405 381 * KMALLOC_MAX_SIZE, and kmalloc falls back onto page_alloc. 406 382 */ 407 383 if (unlikely(!PageSlab(page))) { 408 - if (ptr != page_address(page)) { 409 - kasan_report_invalid_free(ptr, ip); 384 + if (____kasan_kfree_large(ptr, ip)) 410 385 return; 411 - } 412 386 kasan_poison(ptr, page_size(page), KASAN_FREE_PAGE); 413 387 } else { 414 388 ____kasan_slab_free(page->slab_cache, ptr, ip, false); 415 389 } 416 390 } 417 391 418 - static void set_alloc_info(struct kmem_cache *cache, void *object, gfp_t flags) 392 + static void set_alloc_info(struct kmem_cache *cache, void *object, 393 + gfp_t flags, bool is_kmalloc) 419 394 { 420 395 struct kasan_alloc_meta *alloc_meta; 396 + 397 + /* Don't save alloc info for kmalloc caches in kasan_slab_alloc(). */ 398 + if (cache->kasan_info.is_kmalloc && !is_kmalloc) 399 + return; 421 400 422 401 alloc_meta = kasan_get_alloc_meta(cache, object); 423 402 if (alloc_meta) 424 403 kasan_set_track(&alloc_meta->alloc_track, flags); 425 404 } 426 405 427 - static void *____kasan_kmalloc(struct kmem_cache *cache, const void *object, 428 - size_t size, gfp_t flags, bool keep_tag) 406 + void * __must_check __kasan_slab_alloc(struct kmem_cache *cache, 407 + void *object, gfp_t flags) 429 408 { 430 - unsigned long redzone_start; 431 - unsigned long redzone_end; 432 409 u8 tag; 410 + void *tagged_object; 433 411 434 412 if (gfpflags_allow_blocking(flags)) 435 413 kasan_quarantine_reduce(); ··· 439 413 if (unlikely(object == NULL)) 440 414 return NULL; 441 415 416 + if (is_kfence_address(object)) 417 + return (void *)object; 418 + 419 + /* 420 + * Generate and assign random tag for tag-based modes. 421 + * Tag is ignored in set_tag() for the generic mode. 422 + */ 423 + tag = assign_tag(cache, object, false); 424 + tagged_object = set_tag(object, tag); 425 + 426 + /* 427 + * Unpoison the whole object. 428 + * For kmalloc() allocations, kasan_kmalloc() will do precise poisoning. 429 + */ 430 + kasan_unpoison(tagged_object, cache->object_size); 431 + 432 + /* Save alloc info (if possible) for non-kmalloc() allocations. */ 433 + if (kasan_stack_collection_enabled()) 434 + set_alloc_info(cache, (void *)object, flags, false); 435 + 436 + return tagged_object; 437 + } 438 + 439 + static inline void *____kasan_kmalloc(struct kmem_cache *cache, 440 + const void *object, size_t size, gfp_t flags) 441 + { 442 + unsigned long redzone_start; 443 + unsigned long redzone_end; 444 + 445 + if (gfpflags_allow_blocking(flags)) 446 + kasan_quarantine_reduce(); 447 + 448 + if (unlikely(object == NULL)) 449 + return NULL; 450 + 451 + if (is_kfence_address(kasan_reset_tag(object))) 452 + return (void *)object; 453 + 454 + /* 455 + * The object has already been unpoisoned by kasan_slab_alloc() for 456 + * kmalloc() or by kasan_krealloc() for krealloc(). 457 + */ 458 + 459 + /* 460 + * The redzone has byte-level precision for the generic mode. 461 + * Partially poison the last object granule to cover the unaligned 462 + * part of the redzone. 463 + */ 464 + if (IS_ENABLED(CONFIG_KASAN_GENERIC)) 465 + kasan_poison_last_granule((void *)object, size); 466 + 467 + /* Poison the aligned part of the redzone. */ 442 468 redzone_start = round_up((unsigned long)(object + size), 443 469 KASAN_GRANULE_SIZE); 444 - redzone_end = round_up((unsigned long)object + cache->object_size, 470 + redzone_end = round_up((unsigned long)(object + cache->object_size), 445 471 KASAN_GRANULE_SIZE); 446 - tag = assign_tag(cache, object, false, keep_tag); 447 - 448 - /* Tag is ignored in set_tag without CONFIG_KASAN_SW/HW_TAGS */ 449 - kasan_unpoison(set_tag(object, tag), size); 450 472 kasan_poison((void *)redzone_start, redzone_end - redzone_start, 451 473 KASAN_KMALLOC_REDZONE); 452 474 475 + /* 476 + * Save alloc info (if possible) for kmalloc() allocations. 477 + * This also rewrites the alloc info when called from kasan_krealloc(). 478 + */ 453 479 if (kasan_stack_collection_enabled()) 454 - set_alloc_info(cache, (void *)object, flags); 480 + set_alloc_info(cache, (void *)object, flags, true); 455 481 456 - return set_tag(object, tag); 457 - } 458 - 459 - void * __must_check __kasan_slab_alloc(struct kmem_cache *cache, 460 - void *object, gfp_t flags) 461 - { 462 - return ____kasan_kmalloc(cache, object, cache->object_size, flags, false); 482 + /* Keep the tag that was set by kasan_slab_alloc(). */ 483 + return (void *)object; 463 484 } 464 485 465 486 void * __must_check __kasan_kmalloc(struct kmem_cache *cache, const void *object, 466 487 size_t size, gfp_t flags) 467 488 { 468 - return ____kasan_kmalloc(cache, object, size, flags, true); 489 + return ____kasan_kmalloc(cache, object, size, flags); 469 490 } 470 491 EXPORT_SYMBOL(__kasan_kmalloc); 471 492 472 493 void * __must_check __kasan_kmalloc_large(const void *ptr, size_t size, 473 494 gfp_t flags) 474 495 { 475 - struct page *page; 476 496 unsigned long redzone_start; 477 497 unsigned long redzone_end; 478 498 ··· 528 456 if (unlikely(ptr == NULL)) 529 457 return NULL; 530 458 531 - page = virt_to_page(ptr); 459 + /* 460 + * The object has already been unpoisoned by kasan_alloc_pages() for 461 + * alloc_pages() or by kasan_krealloc() for krealloc(). 462 + */ 463 + 464 + /* 465 + * The redzone has byte-level precision for the generic mode. 466 + * Partially poison the last object granule to cover the unaligned 467 + * part of the redzone. 468 + */ 469 + if (IS_ENABLED(CONFIG_KASAN_GENERIC)) 470 + kasan_poison_last_granule(ptr, size); 471 + 472 + /* Poison the aligned part of the redzone. */ 532 473 redzone_start = round_up((unsigned long)(ptr + size), 533 474 KASAN_GRANULE_SIZE); 534 - redzone_end = (unsigned long)ptr + page_size(page); 535 - 536 - kasan_unpoison(ptr, size); 475 + redzone_end = (unsigned long)ptr + page_size(virt_to_page(ptr)); 537 476 kasan_poison((void *)redzone_start, redzone_end - redzone_start, 538 477 KASAN_PAGE_REDZONE); 539 478 ··· 558 475 if (unlikely(object == ZERO_SIZE_PTR)) 559 476 return (void *)object; 560 477 478 + /* 479 + * Unpoison the object's data. 480 + * Part of it might already have been unpoisoned, but it's unknown 481 + * how big that part is. 482 + */ 483 + kasan_unpoison(object, size); 484 + 561 485 page = virt_to_head_page(object); 562 486 487 + /* Piggy-back on kmalloc() instrumentation to poison the redzone. */ 563 488 if (unlikely(!PageSlab(page))) 564 489 return __kasan_kmalloc_large(object, size, flags); 565 490 else 566 - return ____kasan_kmalloc(page->slab_cache, object, size, 567 - flags, true); 568 - } 569 - 570 - void __kasan_kfree_large(void *ptr, unsigned long ip) 571 - { 572 - if (ptr != page_address(virt_to_head_page(ptr))) 573 - kasan_report_invalid_free(ptr, ip); 574 - /* The object will be poisoned by kasan_free_pages(). */ 491 + return ____kasan_kmalloc(page->slab_cache, object, size, flags); 575 492 } 576 493 577 494 bool __kasan_check_byte(const void *address, unsigned long ip)
+2 -1
mm/kasan/generic.c
··· 14 14 #include <linux/init.h> 15 15 #include <linux/kasan.h> 16 16 #include <linux/kernel.h> 17 + #include <linux/kfence.h> 17 18 #include <linux/kmemleak.h> 18 19 #include <linux/linkage.h> 19 20 #include <linux/memblock.h> ··· 332 331 struct kasan_alloc_meta *alloc_meta; 333 332 void *object; 334 333 335 - if (!(page && PageSlab(page))) 334 + if (is_kfence_address(addr) || !(page && PageSlab(page))) 336 335 return; 337 336 338 337 cache = page->slab_cache;
+1 -1
mm/kasan/hw_tags.c
··· 48 48 /* Whether to collect alloc/free stack traces. */ 49 49 DEFINE_STATIC_KEY_FALSE(kasan_flag_stacktrace); 50 50 51 - /* Whether panic or disable tag checking on fault. */ 51 + /* Whether to panic or print a report and disable tag checking on fault. */ 52 52 bool kasan_flag_panic __ro_after_init; 53 53 54 54 /* kasan=off/on */
+69 -8
mm/kasan/kasan.h
··· 3 3 #define __MM_KASAN_KASAN_H 4 4 5 5 #include <linux/kasan.h> 6 + #include <linux/kfence.h> 6 7 #include <linux/stackdepot.h> 7 8 8 9 #ifdef CONFIG_KASAN_HW_TAGS ··· 330 329 331 330 #ifdef CONFIG_KASAN_HW_TAGS 332 331 333 - static inline void kasan_poison(const void *address, size_t size, u8 value) 332 + static inline void kasan_poison(const void *addr, size_t size, u8 value) 334 333 { 335 - hw_set_mem_tag_range(kasan_reset_tag(address), 336 - round_up(size, KASAN_GRANULE_SIZE), value); 334 + addr = kasan_reset_tag(addr); 335 + 336 + /* Skip KFENCE memory if called explicitly outside of sl*b. */ 337 + if (is_kfence_address(addr)) 338 + return; 339 + 340 + if (WARN_ON((unsigned long)addr & KASAN_GRANULE_MASK)) 341 + return; 342 + if (WARN_ON(size & KASAN_GRANULE_MASK)) 343 + return; 344 + 345 + hw_set_mem_tag_range((void *)addr, size, value); 337 346 } 338 347 339 - static inline void kasan_unpoison(const void *address, size_t size) 348 + static inline void kasan_unpoison(const void *addr, size_t size) 340 349 { 341 - hw_set_mem_tag_range(kasan_reset_tag(address), 342 - round_up(size, KASAN_GRANULE_SIZE), get_tag(address)); 350 + u8 tag = get_tag(addr); 351 + 352 + addr = kasan_reset_tag(addr); 353 + 354 + /* Skip KFENCE memory if called explicitly outside of sl*b. */ 355 + if (is_kfence_address(addr)) 356 + return; 357 + 358 + if (WARN_ON((unsigned long)addr & KASAN_GRANULE_MASK)) 359 + return; 360 + size = round_up(size, KASAN_GRANULE_SIZE); 361 + 362 + hw_set_mem_tag_range((void *)addr, size, tag); 343 363 } 344 364 345 365 static inline bool kasan_byte_accessible(const void *addr) ··· 374 352 375 353 #else /* CONFIG_KASAN_HW_TAGS */ 376 354 377 - void kasan_poison(const void *address, size_t size, u8 value); 378 - void kasan_unpoison(const void *address, size_t size); 355 + /** 356 + * kasan_poison - mark the memory range as unaccessible 357 + * @addr - range start address, must be aligned to KASAN_GRANULE_SIZE 358 + * @size - range size, must be aligned to KASAN_GRANULE_SIZE 359 + * @value - value that's written to metadata for the range 360 + * 361 + * The size gets aligned to KASAN_GRANULE_SIZE before marking the range. 362 + */ 363 + void kasan_poison(const void *addr, size_t size, u8 value); 364 + 365 + /** 366 + * kasan_unpoison - mark the memory range as accessible 367 + * @addr - range start address, must be aligned to KASAN_GRANULE_SIZE 368 + * @size - range size, can be unaligned 369 + * 370 + * For the tag-based modes, the @size gets aligned to KASAN_GRANULE_SIZE before 371 + * marking the range. 372 + * For the generic mode, the last granule of the memory range gets partially 373 + * unpoisoned based on the @size. 374 + */ 375 + void kasan_unpoison(const void *addr, size_t size); 376 + 379 377 bool kasan_byte_accessible(const void *addr); 380 378 381 379 #endif /* CONFIG_KASAN_HW_TAGS */ 380 + 381 + #ifdef CONFIG_KASAN_GENERIC 382 + 383 + /** 384 + * kasan_poison_last_granule - mark the last granule of the memory range as 385 + * unaccessible 386 + * @addr - range start address, must be aligned to KASAN_GRANULE_SIZE 387 + * @size - range size 388 + * 389 + * This function is only available for the generic mode, as it's the only mode 390 + * that has partially poisoned memory granules. 391 + */ 392 + void kasan_poison_last_granule(const void *address, size_t size); 393 + 394 + #else /* CONFIG_KASAN_GENERIC */ 395 + 396 + static inline void kasan_poison_last_granule(const void *address, size_t size) { } 397 + 398 + #endif /* CONFIG_KASAN_GENERIC */ 382 399 383 400 /* 384 401 * Exported functions for interfaces called from assembly or from generated
+5 -3
mm/kasan/report.c
··· 25 25 #include <linux/module.h> 26 26 #include <linux/sched/task_stack.h> 27 27 #include <linux/uaccess.h> 28 + #include <trace/events/error_report.h> 28 29 29 30 #include <asm/sections.h> 30 31 ··· 85 84 pr_err("==================================================================\n"); 86 85 } 87 86 88 - static void end_report(unsigned long *flags) 87 + static void end_report(unsigned long *flags, unsigned long addr) 89 88 { 89 + trace_error_report_end(ERROR_DETECTOR_KASAN, addr); 90 90 pr_err("==================================================================\n"); 91 91 add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); 92 92 spin_unlock_irqrestore(&report_lock, *flags); ··· 357 355 print_address_description(object, tag); 358 356 pr_err("\n"); 359 357 print_memory_metadata(object); 360 - end_report(&flags); 358 + end_report(&flags, (unsigned long)object); 361 359 } 362 360 363 361 static void __kasan_report(unsigned long addr, size_t size, bool is_write, ··· 403 401 dump_stack(); 404 402 } 405 403 406 - end_report(&flags); 404 + end_report(&flags, addr); 407 405 } 408 406 409 407 bool kasan_report(unsigned long addr, size_t size, bool is_write,
+42 -20
mm/kasan/shadow.c
··· 13 13 #include <linux/init.h> 14 14 #include <linux/kasan.h> 15 15 #include <linux/kernel.h> 16 + #include <linux/kfence.h> 16 17 #include <linux/kmemleak.h> 17 18 #include <linux/memory.h> 18 19 #include <linux/mm.h> ··· 69 68 return __memcpy(dest, src, len); 70 69 } 71 70 72 - /* 73 - * Poisons the shadow memory for 'size' bytes starting from 'addr'. 74 - * Memory addresses should be aligned to KASAN_GRANULE_SIZE. 75 - */ 76 - void kasan_poison(const void *address, size_t size, u8 value) 71 + void kasan_poison(const void *addr, size_t size, u8 value) 77 72 { 78 73 void *shadow_start, *shadow_end; 79 74 ··· 78 81 * some of the callers (e.g. kasan_poison_object_data) pass tagged 79 82 * addresses to this function. 80 83 */ 81 - address = kasan_reset_tag(address); 82 - size = round_up(size, KASAN_GRANULE_SIZE); 84 + addr = kasan_reset_tag(addr); 83 85 84 - shadow_start = kasan_mem_to_shadow(address); 85 - shadow_end = kasan_mem_to_shadow(address + size); 86 + /* Skip KFENCE memory if called explicitly outside of sl*b. */ 87 + if (is_kfence_address(addr)) 88 + return; 89 + 90 + if (WARN_ON((unsigned long)addr & KASAN_GRANULE_MASK)) 91 + return; 92 + if (WARN_ON(size & KASAN_GRANULE_MASK)) 93 + return; 94 + 95 + shadow_start = kasan_mem_to_shadow(addr); 96 + shadow_end = kasan_mem_to_shadow(addr + size); 86 97 87 98 __memset(shadow_start, value, shadow_end - shadow_start); 88 99 } 89 100 EXPORT_SYMBOL(kasan_poison); 90 101 91 - void kasan_unpoison(const void *address, size_t size) 102 + #ifdef CONFIG_KASAN_GENERIC 103 + void kasan_poison_last_granule(const void *addr, size_t size) 92 104 { 93 - u8 tag = get_tag(address); 105 + if (size & KASAN_GRANULE_MASK) { 106 + u8 *shadow = (u8 *)kasan_mem_to_shadow(addr + size); 107 + *shadow = size & KASAN_GRANULE_MASK; 108 + } 109 + } 110 + #endif 111 + 112 + void kasan_unpoison(const void *addr, size_t size) 113 + { 114 + u8 tag = get_tag(addr); 94 115 95 116 /* 96 117 * Perform shadow offset calculation based on untagged address, as 97 118 * some of the callers (e.g. kasan_unpoison_object_data) pass tagged 98 119 * addresses to this function. 99 120 */ 100 - address = kasan_reset_tag(address); 121 + addr = kasan_reset_tag(addr); 101 122 102 - kasan_poison(address, size, tag); 123 + /* 124 + * Skip KFENCE memory if called explicitly outside of sl*b. Also note 125 + * that calls to ksize(), where size is not a multiple of machine-word 126 + * size, would otherwise poison the invalid portion of the word. 127 + */ 128 + if (is_kfence_address(addr)) 129 + return; 103 130 104 - if (size & KASAN_GRANULE_MASK) { 105 - u8 *shadow = (u8 *)kasan_mem_to_shadow(address + size); 131 + if (WARN_ON((unsigned long)addr & KASAN_GRANULE_MASK)) 132 + return; 106 133 107 - if (IS_ENABLED(CONFIG_KASAN_SW_TAGS)) 108 - *shadow = tag; 109 - else /* CONFIG_KASAN_GENERIC */ 110 - *shadow = size & KASAN_GRANULE_MASK; 111 - } 134 + /* Unpoison all granules that cover the object. */ 135 + kasan_poison(addr, round_up(size, KASAN_GRANULE_SIZE), tag); 136 + 137 + /* Partially poison the last granule for the generic mode. */ 138 + if (IS_ENABLED(CONFIG_KASAN_GENERIC)) 139 + kasan_poison_last_granule(addr, size); 112 140 } 113 141 114 142 #ifdef CONFIG_MEMORY_HOTPLUG
+6
mm/kfence/Makefile
··· 1 + # SPDX-License-Identifier: GPL-2.0 2 + 3 + obj-$(CONFIG_KFENCE) := core.o report.o 4 + 5 + CFLAGS_kfence_test.o := -g -fno-omit-frame-pointer -fno-optimize-sibling-calls 6 + obj-$(CONFIG_KFENCE_KUNIT_TEST) += kfence_test.o
+841
mm/kfence/core.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * KFENCE guarded object allocator and fault handling. 4 + * 5 + * Copyright (C) 2020, Google LLC. 6 + */ 7 + 8 + #define pr_fmt(fmt) "kfence: " fmt 9 + 10 + #include <linux/atomic.h> 11 + #include <linux/bug.h> 12 + #include <linux/debugfs.h> 13 + #include <linux/kcsan-checks.h> 14 + #include <linux/kfence.h> 15 + #include <linux/list.h> 16 + #include <linux/lockdep.h> 17 + #include <linux/memblock.h> 18 + #include <linux/moduleparam.h> 19 + #include <linux/random.h> 20 + #include <linux/rcupdate.h> 21 + #include <linux/seq_file.h> 22 + #include <linux/slab.h> 23 + #include <linux/spinlock.h> 24 + #include <linux/string.h> 25 + 26 + #include <asm/kfence.h> 27 + 28 + #include "kfence.h" 29 + 30 + /* Disables KFENCE on the first warning assuming an irrecoverable error. */ 31 + #define KFENCE_WARN_ON(cond) \ 32 + ({ \ 33 + const bool __cond = WARN_ON(cond); \ 34 + if (unlikely(__cond)) \ 35 + WRITE_ONCE(kfence_enabled, false); \ 36 + __cond; \ 37 + }) 38 + 39 + /* === Data ================================================================= */ 40 + 41 + static bool kfence_enabled __read_mostly; 42 + 43 + static unsigned long kfence_sample_interval __read_mostly = CONFIG_KFENCE_SAMPLE_INTERVAL; 44 + 45 + #ifdef MODULE_PARAM_PREFIX 46 + #undef MODULE_PARAM_PREFIX 47 + #endif 48 + #define MODULE_PARAM_PREFIX "kfence." 49 + 50 + static int param_set_sample_interval(const char *val, const struct kernel_param *kp) 51 + { 52 + unsigned long num; 53 + int ret = kstrtoul(val, 0, &num); 54 + 55 + if (ret < 0) 56 + return ret; 57 + 58 + if (!num) /* Using 0 to indicate KFENCE is disabled. */ 59 + WRITE_ONCE(kfence_enabled, false); 60 + else if (!READ_ONCE(kfence_enabled) && system_state != SYSTEM_BOOTING) 61 + return -EINVAL; /* Cannot (re-)enable KFENCE on-the-fly. */ 62 + 63 + *((unsigned long *)kp->arg) = num; 64 + return 0; 65 + } 66 + 67 + static int param_get_sample_interval(char *buffer, const struct kernel_param *kp) 68 + { 69 + if (!READ_ONCE(kfence_enabled)) 70 + return sprintf(buffer, "0\n"); 71 + 72 + return param_get_ulong(buffer, kp); 73 + } 74 + 75 + static const struct kernel_param_ops sample_interval_param_ops = { 76 + .set = param_set_sample_interval, 77 + .get = param_get_sample_interval, 78 + }; 79 + module_param_cb(sample_interval, &sample_interval_param_ops, &kfence_sample_interval, 0600); 80 + 81 + /* The pool of pages used for guard pages and objects. */ 82 + char *__kfence_pool __ro_after_init; 83 + EXPORT_SYMBOL(__kfence_pool); /* Export for test modules. */ 84 + 85 + /* 86 + * Per-object metadata, with one-to-one mapping of object metadata to 87 + * backing pages (in __kfence_pool). 88 + */ 89 + static_assert(CONFIG_KFENCE_NUM_OBJECTS > 0); 90 + struct kfence_metadata kfence_metadata[CONFIG_KFENCE_NUM_OBJECTS]; 91 + 92 + /* Freelist with available objects. */ 93 + static struct list_head kfence_freelist = LIST_HEAD_INIT(kfence_freelist); 94 + static DEFINE_RAW_SPINLOCK(kfence_freelist_lock); /* Lock protecting freelist. */ 95 + 96 + #ifdef CONFIG_KFENCE_STATIC_KEYS 97 + /* The static key to set up a KFENCE allocation. */ 98 + DEFINE_STATIC_KEY_FALSE(kfence_allocation_key); 99 + #endif 100 + 101 + /* Gates the allocation, ensuring only one succeeds in a given period. */ 102 + atomic_t kfence_allocation_gate = ATOMIC_INIT(1); 103 + 104 + /* Statistics counters for debugfs. */ 105 + enum kfence_counter_id { 106 + KFENCE_COUNTER_ALLOCATED, 107 + KFENCE_COUNTER_ALLOCS, 108 + KFENCE_COUNTER_FREES, 109 + KFENCE_COUNTER_ZOMBIES, 110 + KFENCE_COUNTER_BUGS, 111 + KFENCE_COUNTER_COUNT, 112 + }; 113 + static atomic_long_t counters[KFENCE_COUNTER_COUNT]; 114 + static const char *const counter_names[] = { 115 + [KFENCE_COUNTER_ALLOCATED] = "currently allocated", 116 + [KFENCE_COUNTER_ALLOCS] = "total allocations", 117 + [KFENCE_COUNTER_FREES] = "total frees", 118 + [KFENCE_COUNTER_ZOMBIES] = "zombie allocations", 119 + [KFENCE_COUNTER_BUGS] = "total bugs", 120 + }; 121 + static_assert(ARRAY_SIZE(counter_names) == KFENCE_COUNTER_COUNT); 122 + 123 + /* === Internals ============================================================ */ 124 + 125 + static bool kfence_protect(unsigned long addr) 126 + { 127 + return !KFENCE_WARN_ON(!kfence_protect_page(ALIGN_DOWN(addr, PAGE_SIZE), true)); 128 + } 129 + 130 + static bool kfence_unprotect(unsigned long addr) 131 + { 132 + return !KFENCE_WARN_ON(!kfence_protect_page(ALIGN_DOWN(addr, PAGE_SIZE), false)); 133 + } 134 + 135 + static inline struct kfence_metadata *addr_to_metadata(unsigned long addr) 136 + { 137 + long index; 138 + 139 + /* The checks do not affect performance; only called from slow-paths. */ 140 + 141 + if (!is_kfence_address((void *)addr)) 142 + return NULL; 143 + 144 + /* 145 + * May be an invalid index if called with an address at the edge of 146 + * __kfence_pool, in which case we would report an "invalid access" 147 + * error. 148 + */ 149 + index = (addr - (unsigned long)__kfence_pool) / (PAGE_SIZE * 2) - 1; 150 + if (index < 0 || index >= CONFIG_KFENCE_NUM_OBJECTS) 151 + return NULL; 152 + 153 + return &kfence_metadata[index]; 154 + } 155 + 156 + static inline unsigned long metadata_to_pageaddr(const struct kfence_metadata *meta) 157 + { 158 + unsigned long offset = (meta - kfence_metadata + 1) * PAGE_SIZE * 2; 159 + unsigned long pageaddr = (unsigned long)&__kfence_pool[offset]; 160 + 161 + /* The checks do not affect performance; only called from slow-paths. */ 162 + 163 + /* Only call with a pointer into kfence_metadata. */ 164 + if (KFENCE_WARN_ON(meta < kfence_metadata || 165 + meta >= kfence_metadata + CONFIG_KFENCE_NUM_OBJECTS)) 166 + return 0; 167 + 168 + /* 169 + * This metadata object only ever maps to 1 page; verify that the stored 170 + * address is in the expected range. 171 + */ 172 + if (KFENCE_WARN_ON(ALIGN_DOWN(meta->addr, PAGE_SIZE) != pageaddr)) 173 + return 0; 174 + 175 + return pageaddr; 176 + } 177 + 178 + /* 179 + * Update the object's metadata state, including updating the alloc/free stacks 180 + * depending on the state transition. 181 + */ 182 + static noinline void metadata_update_state(struct kfence_metadata *meta, 183 + enum kfence_object_state next) 184 + { 185 + struct kfence_track *track = 186 + next == KFENCE_OBJECT_FREED ? &meta->free_track : &meta->alloc_track; 187 + 188 + lockdep_assert_held(&meta->lock); 189 + 190 + /* 191 + * Skip over 1 (this) functions; noinline ensures we do not accidentally 192 + * skip over the caller by never inlining. 193 + */ 194 + track->num_stack_entries = stack_trace_save(track->stack_entries, KFENCE_STACK_DEPTH, 1); 195 + track->pid = task_pid_nr(current); 196 + 197 + /* 198 + * Pairs with READ_ONCE() in 199 + * kfence_shutdown_cache(), 200 + * kfence_handle_page_fault(). 201 + */ 202 + WRITE_ONCE(meta->state, next); 203 + } 204 + 205 + /* Write canary byte to @addr. */ 206 + static inline bool set_canary_byte(u8 *addr) 207 + { 208 + *addr = KFENCE_CANARY_PATTERN(addr); 209 + return true; 210 + } 211 + 212 + /* Check canary byte at @addr. */ 213 + static inline bool check_canary_byte(u8 *addr) 214 + { 215 + if (likely(*addr == KFENCE_CANARY_PATTERN(addr))) 216 + return true; 217 + 218 + atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]); 219 + kfence_report_error((unsigned long)addr, false, NULL, addr_to_metadata((unsigned long)addr), 220 + KFENCE_ERROR_CORRUPTION); 221 + return false; 222 + } 223 + 224 + /* __always_inline this to ensure we won't do an indirect call to fn. */ 225 + static __always_inline void for_each_canary(const struct kfence_metadata *meta, bool (*fn)(u8 *)) 226 + { 227 + const unsigned long pageaddr = ALIGN_DOWN(meta->addr, PAGE_SIZE); 228 + unsigned long addr; 229 + 230 + lockdep_assert_held(&meta->lock); 231 + 232 + /* 233 + * We'll iterate over each canary byte per-side until fn() returns 234 + * false. However, we'll still iterate over the canary bytes to the 235 + * right of the object even if there was an error in the canary bytes to 236 + * the left of the object. Specifically, if check_canary_byte() 237 + * generates an error, showing both sides might give more clues as to 238 + * what the error is about when displaying which bytes were corrupted. 239 + */ 240 + 241 + /* Apply to left of object. */ 242 + for (addr = pageaddr; addr < meta->addr; addr++) { 243 + if (!fn((u8 *)addr)) 244 + break; 245 + } 246 + 247 + /* Apply to right of object. */ 248 + for (addr = meta->addr + meta->size; addr < pageaddr + PAGE_SIZE; addr++) { 249 + if (!fn((u8 *)addr)) 250 + break; 251 + } 252 + } 253 + 254 + static void *kfence_guarded_alloc(struct kmem_cache *cache, size_t size, gfp_t gfp) 255 + { 256 + struct kfence_metadata *meta = NULL; 257 + unsigned long flags; 258 + struct page *page; 259 + void *addr; 260 + 261 + /* Try to obtain a free object. */ 262 + raw_spin_lock_irqsave(&kfence_freelist_lock, flags); 263 + if (!list_empty(&kfence_freelist)) { 264 + meta = list_entry(kfence_freelist.next, struct kfence_metadata, list); 265 + list_del_init(&meta->list); 266 + } 267 + raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags); 268 + if (!meta) 269 + return NULL; 270 + 271 + if (unlikely(!raw_spin_trylock_irqsave(&meta->lock, flags))) { 272 + /* 273 + * This is extremely unlikely -- we are reporting on a 274 + * use-after-free, which locked meta->lock, and the reporting 275 + * code via printk calls kmalloc() which ends up in 276 + * kfence_alloc() and tries to grab the same object that we're 277 + * reporting on. While it has never been observed, lockdep does 278 + * report that there is a possibility of deadlock. Fix it by 279 + * using trylock and bailing out gracefully. 280 + */ 281 + raw_spin_lock_irqsave(&kfence_freelist_lock, flags); 282 + /* Put the object back on the freelist. */ 283 + list_add_tail(&meta->list, &kfence_freelist); 284 + raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags); 285 + 286 + return NULL; 287 + } 288 + 289 + meta->addr = metadata_to_pageaddr(meta); 290 + /* Unprotect if we're reusing this page. */ 291 + if (meta->state == KFENCE_OBJECT_FREED) 292 + kfence_unprotect(meta->addr); 293 + 294 + /* 295 + * Note: for allocations made before RNG initialization, will always 296 + * return zero. We still benefit from enabling KFENCE as early as 297 + * possible, even when the RNG is not yet available, as this will allow 298 + * KFENCE to detect bugs due to earlier allocations. The only downside 299 + * is that the out-of-bounds accesses detected are deterministic for 300 + * such allocations. 301 + */ 302 + if (prandom_u32_max(2)) { 303 + /* Allocate on the "right" side, re-calculate address. */ 304 + meta->addr += PAGE_SIZE - size; 305 + meta->addr = ALIGN_DOWN(meta->addr, cache->align); 306 + } 307 + 308 + addr = (void *)meta->addr; 309 + 310 + /* Update remaining metadata. */ 311 + metadata_update_state(meta, KFENCE_OBJECT_ALLOCATED); 312 + /* Pairs with READ_ONCE() in kfence_shutdown_cache(). */ 313 + WRITE_ONCE(meta->cache, cache); 314 + meta->size = size; 315 + for_each_canary(meta, set_canary_byte); 316 + 317 + /* Set required struct page fields. */ 318 + page = virt_to_page(meta->addr); 319 + page->slab_cache = cache; 320 + if (IS_ENABLED(CONFIG_SLUB)) 321 + page->objects = 1; 322 + if (IS_ENABLED(CONFIG_SLAB)) 323 + page->s_mem = addr; 324 + 325 + raw_spin_unlock_irqrestore(&meta->lock, flags); 326 + 327 + /* Memory initialization. */ 328 + 329 + /* 330 + * We check slab_want_init_on_alloc() ourselves, rather than letting 331 + * SL*B do the initialization, as otherwise we might overwrite KFENCE's 332 + * redzone. 333 + */ 334 + if (unlikely(slab_want_init_on_alloc(gfp, cache))) 335 + memzero_explicit(addr, size); 336 + if (cache->ctor) 337 + cache->ctor(addr); 338 + 339 + if (CONFIG_KFENCE_STRESS_TEST_FAULTS && !prandom_u32_max(CONFIG_KFENCE_STRESS_TEST_FAULTS)) 340 + kfence_protect(meta->addr); /* Random "faults" by protecting the object. */ 341 + 342 + atomic_long_inc(&counters[KFENCE_COUNTER_ALLOCATED]); 343 + atomic_long_inc(&counters[KFENCE_COUNTER_ALLOCS]); 344 + 345 + return addr; 346 + } 347 + 348 + static void kfence_guarded_free(void *addr, struct kfence_metadata *meta, bool zombie) 349 + { 350 + struct kcsan_scoped_access assert_page_exclusive; 351 + unsigned long flags; 352 + 353 + raw_spin_lock_irqsave(&meta->lock, flags); 354 + 355 + if (meta->state != KFENCE_OBJECT_ALLOCATED || meta->addr != (unsigned long)addr) { 356 + /* Invalid or double-free, bail out. */ 357 + atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]); 358 + kfence_report_error((unsigned long)addr, false, NULL, meta, 359 + KFENCE_ERROR_INVALID_FREE); 360 + raw_spin_unlock_irqrestore(&meta->lock, flags); 361 + return; 362 + } 363 + 364 + /* Detect racy use-after-free, or incorrect reallocation of this page by KFENCE. */ 365 + kcsan_begin_scoped_access((void *)ALIGN_DOWN((unsigned long)addr, PAGE_SIZE), PAGE_SIZE, 366 + KCSAN_ACCESS_SCOPED | KCSAN_ACCESS_WRITE | KCSAN_ACCESS_ASSERT, 367 + &assert_page_exclusive); 368 + 369 + if (CONFIG_KFENCE_STRESS_TEST_FAULTS) 370 + kfence_unprotect((unsigned long)addr); /* To check canary bytes. */ 371 + 372 + /* Restore page protection if there was an OOB access. */ 373 + if (meta->unprotected_page) { 374 + kfence_protect(meta->unprotected_page); 375 + meta->unprotected_page = 0; 376 + } 377 + 378 + /* Check canary bytes for memory corruption. */ 379 + for_each_canary(meta, check_canary_byte); 380 + 381 + /* 382 + * Clear memory if init-on-free is set. While we protect the page, the 383 + * data is still there, and after a use-after-free is detected, we 384 + * unprotect the page, so the data is still accessible. 385 + */ 386 + if (!zombie && unlikely(slab_want_init_on_free(meta->cache))) 387 + memzero_explicit(addr, meta->size); 388 + 389 + /* Mark the object as freed. */ 390 + metadata_update_state(meta, KFENCE_OBJECT_FREED); 391 + 392 + raw_spin_unlock_irqrestore(&meta->lock, flags); 393 + 394 + /* Protect to detect use-after-frees. */ 395 + kfence_protect((unsigned long)addr); 396 + 397 + kcsan_end_scoped_access(&assert_page_exclusive); 398 + if (!zombie) { 399 + /* Add it to the tail of the freelist for reuse. */ 400 + raw_spin_lock_irqsave(&kfence_freelist_lock, flags); 401 + KFENCE_WARN_ON(!list_empty(&meta->list)); 402 + list_add_tail(&meta->list, &kfence_freelist); 403 + raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags); 404 + 405 + atomic_long_dec(&counters[KFENCE_COUNTER_ALLOCATED]); 406 + atomic_long_inc(&counters[KFENCE_COUNTER_FREES]); 407 + } else { 408 + /* See kfence_shutdown_cache(). */ 409 + atomic_long_inc(&counters[KFENCE_COUNTER_ZOMBIES]); 410 + } 411 + } 412 + 413 + static void rcu_guarded_free(struct rcu_head *h) 414 + { 415 + struct kfence_metadata *meta = container_of(h, struct kfence_metadata, rcu_head); 416 + 417 + kfence_guarded_free((void *)meta->addr, meta, false); 418 + } 419 + 420 + static bool __init kfence_init_pool(void) 421 + { 422 + unsigned long addr = (unsigned long)__kfence_pool; 423 + struct page *pages; 424 + int i; 425 + 426 + if (!__kfence_pool) 427 + return false; 428 + 429 + if (!arch_kfence_init_pool()) 430 + goto err; 431 + 432 + pages = virt_to_page(addr); 433 + 434 + /* 435 + * Set up object pages: they must have PG_slab set, to avoid freeing 436 + * these as real pages. 437 + * 438 + * We also want to avoid inserting kfence_free() in the kfree() 439 + * fast-path in SLUB, and therefore need to ensure kfree() correctly 440 + * enters __slab_free() slow-path. 441 + */ 442 + for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) { 443 + if (!i || (i % 2)) 444 + continue; 445 + 446 + /* Verify we do not have a compound head page. */ 447 + if (WARN_ON(compound_head(&pages[i]) != &pages[i])) 448 + goto err; 449 + 450 + __SetPageSlab(&pages[i]); 451 + } 452 + 453 + /* 454 + * Protect the first 2 pages. The first page is mostly unnecessary, and 455 + * merely serves as an extended guard page. However, adding one 456 + * additional page in the beginning gives us an even number of pages, 457 + * which simplifies the mapping of address to metadata index. 458 + */ 459 + for (i = 0; i < 2; i++) { 460 + if (unlikely(!kfence_protect(addr))) 461 + goto err; 462 + 463 + addr += PAGE_SIZE; 464 + } 465 + 466 + for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) { 467 + struct kfence_metadata *meta = &kfence_metadata[i]; 468 + 469 + /* Initialize metadata. */ 470 + INIT_LIST_HEAD(&meta->list); 471 + raw_spin_lock_init(&meta->lock); 472 + meta->state = KFENCE_OBJECT_UNUSED; 473 + meta->addr = addr; /* Initialize for validation in metadata_to_pageaddr(). */ 474 + list_add_tail(&meta->list, &kfence_freelist); 475 + 476 + /* Protect the right redzone. */ 477 + if (unlikely(!kfence_protect(addr + PAGE_SIZE))) 478 + goto err; 479 + 480 + addr += 2 * PAGE_SIZE; 481 + } 482 + 483 + return true; 484 + 485 + err: 486 + /* 487 + * Only release unprotected pages, and do not try to go back and change 488 + * page attributes due to risk of failing to do so as well. If changing 489 + * page attributes for some pages fails, it is very likely that it also 490 + * fails for the first page, and therefore expect addr==__kfence_pool in 491 + * most failure cases. 492 + */ 493 + memblock_free_late(__pa(addr), KFENCE_POOL_SIZE - (addr - (unsigned long)__kfence_pool)); 494 + __kfence_pool = NULL; 495 + return false; 496 + } 497 + 498 + /* === DebugFS Interface ==================================================== */ 499 + 500 + static int stats_show(struct seq_file *seq, void *v) 501 + { 502 + int i; 503 + 504 + seq_printf(seq, "enabled: %i\n", READ_ONCE(kfence_enabled)); 505 + for (i = 0; i < KFENCE_COUNTER_COUNT; i++) 506 + seq_printf(seq, "%s: %ld\n", counter_names[i], atomic_long_read(&counters[i])); 507 + 508 + return 0; 509 + } 510 + DEFINE_SHOW_ATTRIBUTE(stats); 511 + 512 + /* 513 + * debugfs seq_file operations for /sys/kernel/debug/kfence/objects. 514 + * start_object() and next_object() return the object index + 1, because NULL is used 515 + * to stop iteration. 516 + */ 517 + static void *start_object(struct seq_file *seq, loff_t *pos) 518 + { 519 + if (*pos < CONFIG_KFENCE_NUM_OBJECTS) 520 + return (void *)((long)*pos + 1); 521 + return NULL; 522 + } 523 + 524 + static void stop_object(struct seq_file *seq, void *v) 525 + { 526 + } 527 + 528 + static void *next_object(struct seq_file *seq, void *v, loff_t *pos) 529 + { 530 + ++*pos; 531 + if (*pos < CONFIG_KFENCE_NUM_OBJECTS) 532 + return (void *)((long)*pos + 1); 533 + return NULL; 534 + } 535 + 536 + static int show_object(struct seq_file *seq, void *v) 537 + { 538 + struct kfence_metadata *meta = &kfence_metadata[(long)v - 1]; 539 + unsigned long flags; 540 + 541 + raw_spin_lock_irqsave(&meta->lock, flags); 542 + kfence_print_object(seq, meta); 543 + raw_spin_unlock_irqrestore(&meta->lock, flags); 544 + seq_puts(seq, "---------------------------------\n"); 545 + 546 + return 0; 547 + } 548 + 549 + static const struct seq_operations object_seqops = { 550 + .start = start_object, 551 + .next = next_object, 552 + .stop = stop_object, 553 + .show = show_object, 554 + }; 555 + 556 + static int open_objects(struct inode *inode, struct file *file) 557 + { 558 + return seq_open(file, &object_seqops); 559 + } 560 + 561 + static const struct file_operations objects_fops = { 562 + .open = open_objects, 563 + .read = seq_read, 564 + .llseek = seq_lseek, 565 + }; 566 + 567 + static int __init kfence_debugfs_init(void) 568 + { 569 + struct dentry *kfence_dir = debugfs_create_dir("kfence", NULL); 570 + 571 + debugfs_create_file("stats", 0444, kfence_dir, NULL, &stats_fops); 572 + debugfs_create_file("objects", 0400, kfence_dir, NULL, &objects_fops); 573 + return 0; 574 + } 575 + 576 + late_initcall(kfence_debugfs_init); 577 + 578 + /* === Allocation Gate Timer ================================================ */ 579 + 580 + /* 581 + * Set up delayed work, which will enable and disable the static key. We need to 582 + * use a work queue (rather than a simple timer), since enabling and disabling a 583 + * static key cannot be done from an interrupt. 584 + * 585 + * Note: Toggling a static branch currently causes IPIs, and here we'll end up 586 + * with a total of 2 IPIs to all CPUs. If this ends up a problem in future (with 587 + * more aggressive sampling intervals), we could get away with a variant that 588 + * avoids IPIs, at the cost of not immediately capturing allocations if the 589 + * instructions remain cached. 590 + */ 591 + static struct delayed_work kfence_timer; 592 + static void toggle_allocation_gate(struct work_struct *work) 593 + { 594 + if (!READ_ONCE(kfence_enabled)) 595 + return; 596 + 597 + /* Enable static key, and await allocation to happen. */ 598 + atomic_set(&kfence_allocation_gate, 0); 599 + #ifdef CONFIG_KFENCE_STATIC_KEYS 600 + static_branch_enable(&kfence_allocation_key); 601 + /* 602 + * Await an allocation. Timeout after 1 second, in case the kernel stops 603 + * doing allocations, to avoid stalling this worker task for too long. 604 + */ 605 + { 606 + unsigned long end_wait = jiffies + HZ; 607 + 608 + do { 609 + set_current_state(TASK_UNINTERRUPTIBLE); 610 + if (atomic_read(&kfence_allocation_gate) != 0) 611 + break; 612 + schedule_timeout(1); 613 + } while (time_before(jiffies, end_wait)); 614 + __set_current_state(TASK_RUNNING); 615 + } 616 + /* Disable static key and reset timer. */ 617 + static_branch_disable(&kfence_allocation_key); 618 + #endif 619 + schedule_delayed_work(&kfence_timer, msecs_to_jiffies(kfence_sample_interval)); 620 + } 621 + static DECLARE_DELAYED_WORK(kfence_timer, toggle_allocation_gate); 622 + 623 + /* === Public interface ===================================================== */ 624 + 625 + void __init kfence_alloc_pool(void) 626 + { 627 + if (!kfence_sample_interval) 628 + return; 629 + 630 + __kfence_pool = memblock_alloc(KFENCE_POOL_SIZE, PAGE_SIZE); 631 + 632 + if (!__kfence_pool) 633 + pr_err("failed to allocate pool\n"); 634 + } 635 + 636 + void __init kfence_init(void) 637 + { 638 + /* Setting kfence_sample_interval to 0 on boot disables KFENCE. */ 639 + if (!kfence_sample_interval) 640 + return; 641 + 642 + if (!kfence_init_pool()) { 643 + pr_err("%s failed\n", __func__); 644 + return; 645 + } 646 + 647 + WRITE_ONCE(kfence_enabled, true); 648 + schedule_delayed_work(&kfence_timer, 0); 649 + pr_info("initialized - using %lu bytes for %d objects at 0x%p-0x%p\n", KFENCE_POOL_SIZE, 650 + CONFIG_KFENCE_NUM_OBJECTS, (void *)__kfence_pool, 651 + (void *)(__kfence_pool + KFENCE_POOL_SIZE)); 652 + } 653 + 654 + void kfence_shutdown_cache(struct kmem_cache *s) 655 + { 656 + unsigned long flags; 657 + struct kfence_metadata *meta; 658 + int i; 659 + 660 + for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) { 661 + bool in_use; 662 + 663 + meta = &kfence_metadata[i]; 664 + 665 + /* 666 + * If we observe some inconsistent cache and state pair where we 667 + * should have returned false here, cache destruction is racing 668 + * with either kmem_cache_alloc() or kmem_cache_free(). Taking 669 + * the lock will not help, as different critical section 670 + * serialization will have the same outcome. 671 + */ 672 + if (READ_ONCE(meta->cache) != s || 673 + READ_ONCE(meta->state) != KFENCE_OBJECT_ALLOCATED) 674 + continue; 675 + 676 + raw_spin_lock_irqsave(&meta->lock, flags); 677 + in_use = meta->cache == s && meta->state == KFENCE_OBJECT_ALLOCATED; 678 + raw_spin_unlock_irqrestore(&meta->lock, flags); 679 + 680 + if (in_use) { 681 + /* 682 + * This cache still has allocations, and we should not 683 + * release them back into the freelist so they can still 684 + * safely be used and retain the kernel's default 685 + * behaviour of keeping the allocations alive (leak the 686 + * cache); however, they effectively become "zombie 687 + * allocations" as the KFENCE objects are the only ones 688 + * still in use and the owning cache is being destroyed. 689 + * 690 + * We mark them freed, so that any subsequent use shows 691 + * more useful error messages that will include stack 692 + * traces of the user of the object, the original 693 + * allocation, and caller to shutdown_cache(). 694 + */ 695 + kfence_guarded_free((void *)meta->addr, meta, /*zombie=*/true); 696 + } 697 + } 698 + 699 + for (i = 0; i < CONFIG_KFENCE_NUM_OBJECTS; i++) { 700 + meta = &kfence_metadata[i]; 701 + 702 + /* See above. */ 703 + if (READ_ONCE(meta->cache) != s || READ_ONCE(meta->state) != KFENCE_OBJECT_FREED) 704 + continue; 705 + 706 + raw_spin_lock_irqsave(&meta->lock, flags); 707 + if (meta->cache == s && meta->state == KFENCE_OBJECT_FREED) 708 + meta->cache = NULL; 709 + raw_spin_unlock_irqrestore(&meta->lock, flags); 710 + } 711 + } 712 + 713 + void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags) 714 + { 715 + /* 716 + * allocation_gate only needs to become non-zero, so it doesn't make 717 + * sense to continue writing to it and pay the associated contention 718 + * cost, in case we have a large number of concurrent allocations. 719 + */ 720 + if (atomic_read(&kfence_allocation_gate) || atomic_inc_return(&kfence_allocation_gate) > 1) 721 + return NULL; 722 + 723 + if (!READ_ONCE(kfence_enabled)) 724 + return NULL; 725 + 726 + if (size > PAGE_SIZE) 727 + return NULL; 728 + 729 + return kfence_guarded_alloc(s, size, flags); 730 + } 731 + 732 + size_t kfence_ksize(const void *addr) 733 + { 734 + const struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr); 735 + 736 + /* 737 + * Read locklessly -- if there is a race with __kfence_alloc(), this is 738 + * either a use-after-free or invalid access. 739 + */ 740 + return meta ? meta->size : 0; 741 + } 742 + 743 + void *kfence_object_start(const void *addr) 744 + { 745 + const struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr); 746 + 747 + /* 748 + * Read locklessly -- if there is a race with __kfence_alloc(), this is 749 + * either a use-after-free or invalid access. 750 + */ 751 + return meta ? (void *)meta->addr : NULL; 752 + } 753 + 754 + void __kfence_free(void *addr) 755 + { 756 + struct kfence_metadata *meta = addr_to_metadata((unsigned long)addr); 757 + 758 + /* 759 + * If the objects of the cache are SLAB_TYPESAFE_BY_RCU, defer freeing 760 + * the object, as the object page may be recycled for other-typed 761 + * objects once it has been freed. meta->cache may be NULL if the cache 762 + * was destroyed. 763 + */ 764 + if (unlikely(meta->cache && (meta->cache->flags & SLAB_TYPESAFE_BY_RCU))) 765 + call_rcu(&meta->rcu_head, rcu_guarded_free); 766 + else 767 + kfence_guarded_free(addr, meta, false); 768 + } 769 + 770 + bool kfence_handle_page_fault(unsigned long addr, bool is_write, struct pt_regs *regs) 771 + { 772 + const int page_index = (addr - (unsigned long)__kfence_pool) / PAGE_SIZE; 773 + struct kfence_metadata *to_report = NULL; 774 + enum kfence_error_type error_type; 775 + unsigned long flags; 776 + 777 + if (!is_kfence_address((void *)addr)) 778 + return false; 779 + 780 + if (!READ_ONCE(kfence_enabled)) /* If disabled at runtime ... */ 781 + return kfence_unprotect(addr); /* ... unprotect and proceed. */ 782 + 783 + atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]); 784 + 785 + if (page_index % 2) { 786 + /* This is a redzone, report a buffer overflow. */ 787 + struct kfence_metadata *meta; 788 + int distance = 0; 789 + 790 + meta = addr_to_metadata(addr - PAGE_SIZE); 791 + if (meta && READ_ONCE(meta->state) == KFENCE_OBJECT_ALLOCATED) { 792 + to_report = meta; 793 + /* Data race ok; distance calculation approximate. */ 794 + distance = addr - data_race(meta->addr + meta->size); 795 + } 796 + 797 + meta = addr_to_metadata(addr + PAGE_SIZE); 798 + if (meta && READ_ONCE(meta->state) == KFENCE_OBJECT_ALLOCATED) { 799 + /* Data race ok; distance calculation approximate. */ 800 + if (!to_report || distance > data_race(meta->addr) - addr) 801 + to_report = meta; 802 + } 803 + 804 + if (!to_report) 805 + goto out; 806 + 807 + raw_spin_lock_irqsave(&to_report->lock, flags); 808 + to_report->unprotected_page = addr; 809 + error_type = KFENCE_ERROR_OOB; 810 + 811 + /* 812 + * If the object was freed before we took the look we can still 813 + * report this as an OOB -- the report will simply show the 814 + * stacktrace of the free as well. 815 + */ 816 + } else { 817 + to_report = addr_to_metadata(addr); 818 + if (!to_report) 819 + goto out; 820 + 821 + raw_spin_lock_irqsave(&to_report->lock, flags); 822 + error_type = KFENCE_ERROR_UAF; 823 + /* 824 + * We may race with __kfence_alloc(), and it is possible that a 825 + * freed object may be reallocated. We simply report this as a 826 + * use-after-free, with the stack trace showing the place where 827 + * the object was re-allocated. 828 + */ 829 + } 830 + 831 + out: 832 + if (to_report) { 833 + kfence_report_error(addr, is_write, regs, to_report, error_type); 834 + raw_spin_unlock_irqrestore(&to_report->lock, flags); 835 + } else { 836 + /* This may be a UAF or OOB access, but we can't be sure. */ 837 + kfence_report_error(addr, is_write, regs, NULL, KFENCE_ERROR_INVALID); 838 + } 839 + 840 + return kfence_unprotect(addr); /* Unprotect and let access proceed. */ 841 + }
+106
mm/kfence/kfence.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Kernel Electric-Fence (KFENCE). For more info please see 4 + * Documentation/dev-tools/kfence.rst. 5 + * 6 + * Copyright (C) 2020, Google LLC. 7 + */ 8 + 9 + #ifndef MM_KFENCE_KFENCE_H 10 + #define MM_KFENCE_KFENCE_H 11 + 12 + #include <linux/mm.h> 13 + #include <linux/slab.h> 14 + #include <linux/spinlock.h> 15 + #include <linux/types.h> 16 + 17 + #include "../slab.h" /* for struct kmem_cache */ 18 + 19 + /* 20 + * Get the canary byte pattern for @addr. Use a pattern that varies based on the 21 + * lower 3 bits of the address, to detect memory corruptions with higher 22 + * probability, where similar constants are used. 23 + */ 24 + #define KFENCE_CANARY_PATTERN(addr) ((u8)0xaa ^ (u8)((unsigned long)(addr) & 0x7)) 25 + 26 + /* Maximum stack depth for reports. */ 27 + #define KFENCE_STACK_DEPTH 64 28 + 29 + /* KFENCE object states. */ 30 + enum kfence_object_state { 31 + KFENCE_OBJECT_UNUSED, /* Object is unused. */ 32 + KFENCE_OBJECT_ALLOCATED, /* Object is currently allocated. */ 33 + KFENCE_OBJECT_FREED, /* Object was allocated, and then freed. */ 34 + }; 35 + 36 + /* Alloc/free tracking information. */ 37 + struct kfence_track { 38 + pid_t pid; 39 + int num_stack_entries; 40 + unsigned long stack_entries[KFENCE_STACK_DEPTH]; 41 + }; 42 + 43 + /* KFENCE metadata per guarded allocation. */ 44 + struct kfence_metadata { 45 + struct list_head list; /* Freelist node; access under kfence_freelist_lock. */ 46 + struct rcu_head rcu_head; /* For delayed freeing. */ 47 + 48 + /* 49 + * Lock protecting below data; to ensure consistency of the below data, 50 + * since the following may execute concurrently: __kfence_alloc(), 51 + * __kfence_free(), kfence_handle_page_fault(). However, note that we 52 + * cannot grab the same metadata off the freelist twice, and multiple 53 + * __kfence_alloc() cannot run concurrently on the same metadata. 54 + */ 55 + raw_spinlock_t lock; 56 + 57 + /* The current state of the object; see above. */ 58 + enum kfence_object_state state; 59 + 60 + /* 61 + * Allocated object address; cannot be calculated from size, because of 62 + * alignment requirements. 63 + * 64 + * Invariant: ALIGN_DOWN(addr, PAGE_SIZE) is constant. 65 + */ 66 + unsigned long addr; 67 + 68 + /* 69 + * The size of the original allocation. 70 + */ 71 + size_t size; 72 + 73 + /* 74 + * The kmem_cache cache of the last allocation; NULL if never allocated 75 + * or the cache has already been destroyed. 76 + */ 77 + struct kmem_cache *cache; 78 + 79 + /* 80 + * In case of an invalid access, the page that was unprotected; we 81 + * optimistically only store one address. 82 + */ 83 + unsigned long unprotected_page; 84 + 85 + /* Allocation and free stack information. */ 86 + struct kfence_track alloc_track; 87 + struct kfence_track free_track; 88 + }; 89 + 90 + extern struct kfence_metadata kfence_metadata[CONFIG_KFENCE_NUM_OBJECTS]; 91 + 92 + /* KFENCE error types for report generation. */ 93 + enum kfence_error_type { 94 + KFENCE_ERROR_OOB, /* Detected a out-of-bounds access. */ 95 + KFENCE_ERROR_UAF, /* Detected a use-after-free access. */ 96 + KFENCE_ERROR_CORRUPTION, /* Detected a memory corruption on free. */ 97 + KFENCE_ERROR_INVALID, /* Invalid access of unknown type. */ 98 + KFENCE_ERROR_INVALID_FREE, /* Invalid free. */ 99 + }; 100 + 101 + void kfence_report_error(unsigned long address, bool is_write, struct pt_regs *regs, 102 + const struct kfence_metadata *meta, enum kfence_error_type type); 103 + 104 + void kfence_print_object(struct seq_file *seq, const struct kfence_metadata *meta); 105 + 106 + #endif /* MM_KFENCE_KFENCE_H */
+858
mm/kfence/kfence_test.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Test cases for KFENCE memory safety error detector. Since the interface with 4 + * which KFENCE's reports are obtained is via the console, this is the output we 5 + * should verify. For each test case checks the presence (or absence) of 6 + * generated reports. Relies on 'console' tracepoint to capture reports as they 7 + * appear in the kernel log. 8 + * 9 + * Copyright (C) 2020, Google LLC. 10 + * Author: Alexander Potapenko <glider@google.com> 11 + * Marco Elver <elver@google.com> 12 + */ 13 + 14 + #include <kunit/test.h> 15 + #include <linux/jiffies.h> 16 + #include <linux/kernel.h> 17 + #include <linux/kfence.h> 18 + #include <linux/mm.h> 19 + #include <linux/random.h> 20 + #include <linux/slab.h> 21 + #include <linux/spinlock.h> 22 + #include <linux/string.h> 23 + #include <linux/tracepoint.h> 24 + #include <trace/events/printk.h> 25 + 26 + #include "kfence.h" 27 + 28 + /* Report as observed from console. */ 29 + static struct { 30 + spinlock_t lock; 31 + int nlines; 32 + char lines[2][256]; 33 + } observed = { 34 + .lock = __SPIN_LOCK_UNLOCKED(observed.lock), 35 + }; 36 + 37 + /* Probe for console output: obtains observed lines of interest. */ 38 + static void probe_console(void *ignore, const char *buf, size_t len) 39 + { 40 + unsigned long flags; 41 + int nlines; 42 + 43 + spin_lock_irqsave(&observed.lock, flags); 44 + nlines = observed.nlines; 45 + 46 + if (strnstr(buf, "BUG: KFENCE: ", len) && strnstr(buf, "test_", len)) { 47 + /* 48 + * KFENCE report and related to the test. 49 + * 50 + * The provided @buf is not NUL-terminated; copy no more than 51 + * @len bytes and let strscpy() add the missing NUL-terminator. 52 + */ 53 + strscpy(observed.lines[0], buf, min(len + 1, sizeof(observed.lines[0]))); 54 + nlines = 1; 55 + } else if (nlines == 1 && (strnstr(buf, "at 0x", len) || strnstr(buf, "of 0x", len))) { 56 + strscpy(observed.lines[nlines++], buf, min(len + 1, sizeof(observed.lines[0]))); 57 + } 58 + 59 + WRITE_ONCE(observed.nlines, nlines); /* Publish new nlines. */ 60 + spin_unlock_irqrestore(&observed.lock, flags); 61 + } 62 + 63 + /* Check if a report related to the test exists. */ 64 + static bool report_available(void) 65 + { 66 + return READ_ONCE(observed.nlines) == ARRAY_SIZE(observed.lines); 67 + } 68 + 69 + /* Information we expect in a report. */ 70 + struct expect_report { 71 + enum kfence_error_type type; /* The type or error. */ 72 + void *fn; /* Function pointer to expected function where access occurred. */ 73 + char *addr; /* Address at which the bad access occurred. */ 74 + bool is_write; /* Is access a write. */ 75 + }; 76 + 77 + static const char *get_access_type(const struct expect_report *r) 78 + { 79 + return r->is_write ? "write" : "read"; 80 + } 81 + 82 + /* Check observed report matches information in @r. */ 83 + static bool report_matches(const struct expect_report *r) 84 + { 85 + bool ret = false; 86 + unsigned long flags; 87 + typeof(observed.lines) expect; 88 + const char *end; 89 + char *cur; 90 + 91 + /* Doubled-checked locking. */ 92 + if (!report_available()) 93 + return false; 94 + 95 + /* Generate expected report contents. */ 96 + 97 + /* Title */ 98 + cur = expect[0]; 99 + end = &expect[0][sizeof(expect[0]) - 1]; 100 + switch (r->type) { 101 + case KFENCE_ERROR_OOB: 102 + cur += scnprintf(cur, end - cur, "BUG: KFENCE: out-of-bounds %s", 103 + get_access_type(r)); 104 + break; 105 + case KFENCE_ERROR_UAF: 106 + cur += scnprintf(cur, end - cur, "BUG: KFENCE: use-after-free %s", 107 + get_access_type(r)); 108 + break; 109 + case KFENCE_ERROR_CORRUPTION: 110 + cur += scnprintf(cur, end - cur, "BUG: KFENCE: memory corruption"); 111 + break; 112 + case KFENCE_ERROR_INVALID: 113 + cur += scnprintf(cur, end - cur, "BUG: KFENCE: invalid %s", 114 + get_access_type(r)); 115 + break; 116 + case KFENCE_ERROR_INVALID_FREE: 117 + cur += scnprintf(cur, end - cur, "BUG: KFENCE: invalid free"); 118 + break; 119 + } 120 + 121 + scnprintf(cur, end - cur, " in %pS", r->fn); 122 + /* The exact offset won't match, remove it; also strip module name. */ 123 + cur = strchr(expect[0], '+'); 124 + if (cur) 125 + *cur = '\0'; 126 + 127 + /* Access information */ 128 + cur = expect[1]; 129 + end = &expect[1][sizeof(expect[1]) - 1]; 130 + 131 + switch (r->type) { 132 + case KFENCE_ERROR_OOB: 133 + cur += scnprintf(cur, end - cur, "Out-of-bounds %s at", get_access_type(r)); 134 + break; 135 + case KFENCE_ERROR_UAF: 136 + cur += scnprintf(cur, end - cur, "Use-after-free %s at", get_access_type(r)); 137 + break; 138 + case KFENCE_ERROR_CORRUPTION: 139 + cur += scnprintf(cur, end - cur, "Corrupted memory at"); 140 + break; 141 + case KFENCE_ERROR_INVALID: 142 + cur += scnprintf(cur, end - cur, "Invalid %s at", get_access_type(r)); 143 + break; 144 + case KFENCE_ERROR_INVALID_FREE: 145 + cur += scnprintf(cur, end - cur, "Invalid free of"); 146 + break; 147 + } 148 + 149 + cur += scnprintf(cur, end - cur, " 0x%p", (void *)r->addr); 150 + 151 + spin_lock_irqsave(&observed.lock, flags); 152 + if (!report_available()) 153 + goto out; /* A new report is being captured. */ 154 + 155 + /* Finally match expected output to what we actually observed. */ 156 + ret = strstr(observed.lines[0], expect[0]) && strstr(observed.lines[1], expect[1]); 157 + out: 158 + spin_unlock_irqrestore(&observed.lock, flags); 159 + return ret; 160 + } 161 + 162 + /* ===== Test cases ===== */ 163 + 164 + #define TEST_PRIV_WANT_MEMCACHE ((void *)1) 165 + 166 + /* Cache used by tests; if NULL, allocate from kmalloc instead. */ 167 + static struct kmem_cache *test_cache; 168 + 169 + static size_t setup_test_cache(struct kunit *test, size_t size, slab_flags_t flags, 170 + void (*ctor)(void *)) 171 + { 172 + if (test->priv != TEST_PRIV_WANT_MEMCACHE) 173 + return size; 174 + 175 + kunit_info(test, "%s: size=%zu, ctor=%ps\n", __func__, size, ctor); 176 + 177 + /* 178 + * Use SLAB_NOLEAKTRACE to prevent merging with existing caches. Any 179 + * other flag in SLAB_NEVER_MERGE also works. Use SLAB_ACCOUNT to 180 + * allocate via memcg, if enabled. 181 + */ 182 + flags |= SLAB_NOLEAKTRACE | SLAB_ACCOUNT; 183 + test_cache = kmem_cache_create("test", size, 1, flags, ctor); 184 + KUNIT_ASSERT_TRUE_MSG(test, test_cache, "could not create cache"); 185 + 186 + return size; 187 + } 188 + 189 + static void test_cache_destroy(void) 190 + { 191 + if (!test_cache) 192 + return; 193 + 194 + kmem_cache_destroy(test_cache); 195 + test_cache = NULL; 196 + } 197 + 198 + static inline size_t kmalloc_cache_alignment(size_t size) 199 + { 200 + return kmalloc_caches[kmalloc_type(GFP_KERNEL)][kmalloc_index(size)]->align; 201 + } 202 + 203 + /* Must always inline to match stack trace against caller. */ 204 + static __always_inline void test_free(void *ptr) 205 + { 206 + if (test_cache) 207 + kmem_cache_free(test_cache, ptr); 208 + else 209 + kfree(ptr); 210 + } 211 + 212 + /* 213 + * If this should be a KFENCE allocation, and on which side the allocation and 214 + * the closest guard page should be. 215 + */ 216 + enum allocation_policy { 217 + ALLOCATE_ANY, /* KFENCE, any side. */ 218 + ALLOCATE_LEFT, /* KFENCE, left side of page. */ 219 + ALLOCATE_RIGHT, /* KFENCE, right side of page. */ 220 + ALLOCATE_NONE, /* No KFENCE allocation. */ 221 + }; 222 + 223 + /* 224 + * Try to get a guarded allocation from KFENCE. Uses either kmalloc() or the 225 + * current test_cache if set up. 226 + */ 227 + static void *test_alloc(struct kunit *test, size_t size, gfp_t gfp, enum allocation_policy policy) 228 + { 229 + void *alloc; 230 + unsigned long timeout, resched_after; 231 + const char *policy_name; 232 + 233 + switch (policy) { 234 + case ALLOCATE_ANY: 235 + policy_name = "any"; 236 + break; 237 + case ALLOCATE_LEFT: 238 + policy_name = "left"; 239 + break; 240 + case ALLOCATE_RIGHT: 241 + policy_name = "right"; 242 + break; 243 + case ALLOCATE_NONE: 244 + policy_name = "none"; 245 + break; 246 + } 247 + 248 + kunit_info(test, "%s: size=%zu, gfp=%x, policy=%s, cache=%i\n", __func__, size, gfp, 249 + policy_name, !!test_cache); 250 + 251 + /* 252 + * 100x the sample interval should be more than enough to ensure we get 253 + * a KFENCE allocation eventually. 254 + */ 255 + timeout = jiffies + msecs_to_jiffies(100 * CONFIG_KFENCE_SAMPLE_INTERVAL); 256 + /* 257 + * Especially for non-preemption kernels, ensure the allocation-gate 258 + * timer can catch up: after @resched_after, every failed allocation 259 + * attempt yields, to ensure the allocation-gate timer is scheduled. 260 + */ 261 + resched_after = jiffies + msecs_to_jiffies(CONFIG_KFENCE_SAMPLE_INTERVAL); 262 + do { 263 + if (test_cache) 264 + alloc = kmem_cache_alloc(test_cache, gfp); 265 + else 266 + alloc = kmalloc(size, gfp); 267 + 268 + if (is_kfence_address(alloc)) { 269 + struct page *page = virt_to_head_page(alloc); 270 + struct kmem_cache *s = test_cache ?: kmalloc_caches[kmalloc_type(GFP_KERNEL)][kmalloc_index(size)]; 271 + 272 + /* 273 + * Verify that various helpers return the right values 274 + * even for KFENCE objects; these are required so that 275 + * memcg accounting works correctly. 276 + */ 277 + KUNIT_EXPECT_EQ(test, obj_to_index(s, page, alloc), 0U); 278 + KUNIT_EXPECT_EQ(test, objs_per_slab_page(s, page), 1); 279 + 280 + if (policy == ALLOCATE_ANY) 281 + return alloc; 282 + if (policy == ALLOCATE_LEFT && IS_ALIGNED((unsigned long)alloc, PAGE_SIZE)) 283 + return alloc; 284 + if (policy == ALLOCATE_RIGHT && 285 + !IS_ALIGNED((unsigned long)alloc, PAGE_SIZE)) 286 + return alloc; 287 + } else if (policy == ALLOCATE_NONE) 288 + return alloc; 289 + 290 + test_free(alloc); 291 + 292 + if (time_after(jiffies, resched_after)) 293 + cond_resched(); 294 + } while (time_before(jiffies, timeout)); 295 + 296 + KUNIT_ASSERT_TRUE_MSG(test, false, "failed to allocate from KFENCE"); 297 + return NULL; /* Unreachable. */ 298 + } 299 + 300 + static void test_out_of_bounds_read(struct kunit *test) 301 + { 302 + size_t size = 32; 303 + struct expect_report expect = { 304 + .type = KFENCE_ERROR_OOB, 305 + .fn = test_out_of_bounds_read, 306 + .is_write = false, 307 + }; 308 + char *buf; 309 + 310 + setup_test_cache(test, size, 0, NULL); 311 + 312 + /* 313 + * If we don't have our own cache, adjust based on alignment, so that we 314 + * actually access guard pages on either side. 315 + */ 316 + if (!test_cache) 317 + size = kmalloc_cache_alignment(size); 318 + 319 + /* Test both sides. */ 320 + 321 + buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_LEFT); 322 + expect.addr = buf - 1; 323 + READ_ONCE(*expect.addr); 324 + KUNIT_EXPECT_TRUE(test, report_matches(&expect)); 325 + test_free(buf); 326 + 327 + buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT); 328 + expect.addr = buf + size; 329 + READ_ONCE(*expect.addr); 330 + KUNIT_EXPECT_TRUE(test, report_matches(&expect)); 331 + test_free(buf); 332 + } 333 + 334 + static void test_out_of_bounds_write(struct kunit *test) 335 + { 336 + size_t size = 32; 337 + struct expect_report expect = { 338 + .type = KFENCE_ERROR_OOB, 339 + .fn = test_out_of_bounds_write, 340 + .is_write = true, 341 + }; 342 + char *buf; 343 + 344 + setup_test_cache(test, size, 0, NULL); 345 + buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_LEFT); 346 + expect.addr = buf - 1; 347 + WRITE_ONCE(*expect.addr, 42); 348 + KUNIT_EXPECT_TRUE(test, report_matches(&expect)); 349 + test_free(buf); 350 + } 351 + 352 + static void test_use_after_free_read(struct kunit *test) 353 + { 354 + const size_t size = 32; 355 + struct expect_report expect = { 356 + .type = KFENCE_ERROR_UAF, 357 + .fn = test_use_after_free_read, 358 + .is_write = false, 359 + }; 360 + 361 + setup_test_cache(test, size, 0, NULL); 362 + expect.addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY); 363 + test_free(expect.addr); 364 + READ_ONCE(*expect.addr); 365 + KUNIT_EXPECT_TRUE(test, report_matches(&expect)); 366 + } 367 + 368 + static void test_double_free(struct kunit *test) 369 + { 370 + const size_t size = 32; 371 + struct expect_report expect = { 372 + .type = KFENCE_ERROR_INVALID_FREE, 373 + .fn = test_double_free, 374 + }; 375 + 376 + setup_test_cache(test, size, 0, NULL); 377 + expect.addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY); 378 + test_free(expect.addr); 379 + test_free(expect.addr); /* Double-free. */ 380 + KUNIT_EXPECT_TRUE(test, report_matches(&expect)); 381 + } 382 + 383 + static void test_invalid_addr_free(struct kunit *test) 384 + { 385 + const size_t size = 32; 386 + struct expect_report expect = { 387 + .type = KFENCE_ERROR_INVALID_FREE, 388 + .fn = test_invalid_addr_free, 389 + }; 390 + char *buf; 391 + 392 + setup_test_cache(test, size, 0, NULL); 393 + buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY); 394 + expect.addr = buf + 1; /* Free on invalid address. */ 395 + test_free(expect.addr); /* Invalid address free. */ 396 + test_free(buf); /* No error. */ 397 + KUNIT_EXPECT_TRUE(test, report_matches(&expect)); 398 + } 399 + 400 + static void test_corruption(struct kunit *test) 401 + { 402 + size_t size = 32; 403 + struct expect_report expect = { 404 + .type = KFENCE_ERROR_CORRUPTION, 405 + .fn = test_corruption, 406 + }; 407 + char *buf; 408 + 409 + setup_test_cache(test, size, 0, NULL); 410 + 411 + /* Test both sides. */ 412 + 413 + buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_LEFT); 414 + expect.addr = buf + size; 415 + WRITE_ONCE(*expect.addr, 42); 416 + test_free(buf); 417 + KUNIT_EXPECT_TRUE(test, report_matches(&expect)); 418 + 419 + buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT); 420 + expect.addr = buf - 1; 421 + WRITE_ONCE(*expect.addr, 42); 422 + test_free(buf); 423 + KUNIT_EXPECT_TRUE(test, report_matches(&expect)); 424 + } 425 + 426 + /* 427 + * KFENCE is unable to detect an OOB if the allocation's alignment requirements 428 + * leave a gap between the object and the guard page. Specifically, an 429 + * allocation of e.g. 73 bytes is aligned on 8 and 128 bytes for SLUB or SLAB 430 + * respectively. Therefore it is impossible for the allocated object to 431 + * contiguously line up with the right guard page. 432 + * 433 + * However, we test that an access to memory beyond the gap results in KFENCE 434 + * detecting an OOB access. 435 + */ 436 + static void test_kmalloc_aligned_oob_read(struct kunit *test) 437 + { 438 + const size_t size = 73; 439 + const size_t align = kmalloc_cache_alignment(size); 440 + struct expect_report expect = { 441 + .type = KFENCE_ERROR_OOB, 442 + .fn = test_kmalloc_aligned_oob_read, 443 + .is_write = false, 444 + }; 445 + char *buf; 446 + 447 + buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT); 448 + 449 + /* 450 + * The object is offset to the right, so there won't be an OOB to the 451 + * left of it. 452 + */ 453 + READ_ONCE(*(buf - 1)); 454 + KUNIT_EXPECT_FALSE(test, report_available()); 455 + 456 + /* 457 + * @buf must be aligned on @align, therefore buf + size belongs to the 458 + * same page -> no OOB. 459 + */ 460 + READ_ONCE(*(buf + size)); 461 + KUNIT_EXPECT_FALSE(test, report_available()); 462 + 463 + /* Overflowing by @align bytes will result in an OOB. */ 464 + expect.addr = buf + size + align; 465 + READ_ONCE(*expect.addr); 466 + KUNIT_EXPECT_TRUE(test, report_matches(&expect)); 467 + 468 + test_free(buf); 469 + } 470 + 471 + static void test_kmalloc_aligned_oob_write(struct kunit *test) 472 + { 473 + const size_t size = 73; 474 + struct expect_report expect = { 475 + .type = KFENCE_ERROR_CORRUPTION, 476 + .fn = test_kmalloc_aligned_oob_write, 477 + }; 478 + char *buf; 479 + 480 + buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT); 481 + /* 482 + * The object is offset to the right, so we won't get a page 483 + * fault immediately after it. 484 + */ 485 + expect.addr = buf + size; 486 + WRITE_ONCE(*expect.addr, READ_ONCE(*expect.addr) + 1); 487 + KUNIT_EXPECT_FALSE(test, report_available()); 488 + test_free(buf); 489 + KUNIT_EXPECT_TRUE(test, report_matches(&expect)); 490 + } 491 + 492 + /* Test cache shrinking and destroying with KFENCE. */ 493 + static void test_shrink_memcache(struct kunit *test) 494 + { 495 + const size_t size = 32; 496 + void *buf; 497 + 498 + setup_test_cache(test, size, 0, NULL); 499 + KUNIT_EXPECT_TRUE(test, test_cache); 500 + buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY); 501 + kmem_cache_shrink(test_cache); 502 + test_free(buf); 503 + 504 + KUNIT_EXPECT_FALSE(test, report_available()); 505 + } 506 + 507 + static void ctor_set_x(void *obj) 508 + { 509 + /* Every object has at least 8 bytes. */ 510 + memset(obj, 'x', 8); 511 + } 512 + 513 + /* Ensure that SL*B does not modify KFENCE objects on bulk free. */ 514 + static void test_free_bulk(struct kunit *test) 515 + { 516 + int iter; 517 + 518 + for (iter = 0; iter < 5; iter++) { 519 + const size_t size = setup_test_cache(test, 8 + prandom_u32_max(300), 0, 520 + (iter & 1) ? ctor_set_x : NULL); 521 + void *objects[] = { 522 + test_alloc(test, size, GFP_KERNEL, ALLOCATE_RIGHT), 523 + test_alloc(test, size, GFP_KERNEL, ALLOCATE_NONE), 524 + test_alloc(test, size, GFP_KERNEL, ALLOCATE_LEFT), 525 + test_alloc(test, size, GFP_KERNEL, ALLOCATE_NONE), 526 + test_alloc(test, size, GFP_KERNEL, ALLOCATE_NONE), 527 + }; 528 + 529 + kmem_cache_free_bulk(test_cache, ARRAY_SIZE(objects), objects); 530 + KUNIT_ASSERT_FALSE(test, report_available()); 531 + test_cache_destroy(); 532 + } 533 + } 534 + 535 + /* Test init-on-free works. */ 536 + static void test_init_on_free(struct kunit *test) 537 + { 538 + const size_t size = 32; 539 + struct expect_report expect = { 540 + .type = KFENCE_ERROR_UAF, 541 + .fn = test_init_on_free, 542 + .is_write = false, 543 + }; 544 + int i; 545 + 546 + if (!IS_ENABLED(CONFIG_INIT_ON_FREE_DEFAULT_ON)) 547 + return; 548 + /* Assume it hasn't been disabled on command line. */ 549 + 550 + setup_test_cache(test, size, 0, NULL); 551 + expect.addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY); 552 + for (i = 0; i < size; i++) 553 + expect.addr[i] = i + 1; 554 + test_free(expect.addr); 555 + 556 + for (i = 0; i < size; i++) { 557 + /* 558 + * This may fail if the page was recycled by KFENCE and then 559 + * written to again -- this however, is near impossible with a 560 + * default config. 561 + */ 562 + KUNIT_EXPECT_EQ(test, expect.addr[i], (char)0); 563 + 564 + if (!i) /* Only check first access to not fail test if page is ever re-protected. */ 565 + KUNIT_EXPECT_TRUE(test, report_matches(&expect)); 566 + } 567 + } 568 + 569 + /* Ensure that constructors work properly. */ 570 + static void test_memcache_ctor(struct kunit *test) 571 + { 572 + const size_t size = 32; 573 + char *buf; 574 + int i; 575 + 576 + setup_test_cache(test, size, 0, ctor_set_x); 577 + buf = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY); 578 + 579 + for (i = 0; i < 8; i++) 580 + KUNIT_EXPECT_EQ(test, buf[i], (char)'x'); 581 + 582 + test_free(buf); 583 + 584 + KUNIT_EXPECT_FALSE(test, report_available()); 585 + } 586 + 587 + /* Test that memory is zeroed if requested. */ 588 + static void test_gfpzero(struct kunit *test) 589 + { 590 + const size_t size = PAGE_SIZE; /* PAGE_SIZE so we can use ALLOCATE_ANY. */ 591 + char *buf1, *buf2; 592 + int i; 593 + 594 + if (CONFIG_KFENCE_SAMPLE_INTERVAL > 100) { 595 + kunit_warn(test, "skipping ... would take too long\n"); 596 + return; 597 + } 598 + 599 + setup_test_cache(test, size, 0, NULL); 600 + buf1 = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY); 601 + for (i = 0; i < size; i++) 602 + buf1[i] = i + 1; 603 + test_free(buf1); 604 + 605 + /* Try to get same address again -- this can take a while. */ 606 + for (i = 0;; i++) { 607 + buf2 = test_alloc(test, size, GFP_KERNEL | __GFP_ZERO, ALLOCATE_ANY); 608 + if (buf1 == buf2) 609 + break; 610 + test_free(buf2); 611 + 612 + if (i == CONFIG_KFENCE_NUM_OBJECTS) { 613 + kunit_warn(test, "giving up ... cannot get same object back\n"); 614 + return; 615 + } 616 + } 617 + 618 + for (i = 0; i < size; i++) 619 + KUNIT_EXPECT_EQ(test, buf2[i], (char)0); 620 + 621 + test_free(buf2); 622 + 623 + KUNIT_EXPECT_FALSE(test, report_available()); 624 + } 625 + 626 + static void test_invalid_access(struct kunit *test) 627 + { 628 + const struct expect_report expect = { 629 + .type = KFENCE_ERROR_INVALID, 630 + .fn = test_invalid_access, 631 + .addr = &__kfence_pool[10], 632 + .is_write = false, 633 + }; 634 + 635 + READ_ONCE(__kfence_pool[10]); 636 + KUNIT_EXPECT_TRUE(test, report_matches(&expect)); 637 + } 638 + 639 + /* Test SLAB_TYPESAFE_BY_RCU works. */ 640 + static void test_memcache_typesafe_by_rcu(struct kunit *test) 641 + { 642 + const size_t size = 32; 643 + struct expect_report expect = { 644 + .type = KFENCE_ERROR_UAF, 645 + .fn = test_memcache_typesafe_by_rcu, 646 + .is_write = false, 647 + }; 648 + 649 + setup_test_cache(test, size, SLAB_TYPESAFE_BY_RCU, NULL); 650 + KUNIT_EXPECT_TRUE(test, test_cache); /* Want memcache. */ 651 + 652 + expect.addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY); 653 + *expect.addr = 42; 654 + 655 + rcu_read_lock(); 656 + test_free(expect.addr); 657 + KUNIT_EXPECT_EQ(test, *expect.addr, (char)42); 658 + /* 659 + * Up to this point, memory should not have been freed yet, and 660 + * therefore there should be no KFENCE report from the above access. 661 + */ 662 + rcu_read_unlock(); 663 + 664 + /* Above access to @expect.addr should not have generated a report! */ 665 + KUNIT_EXPECT_FALSE(test, report_available()); 666 + 667 + /* Only after rcu_barrier() is the memory guaranteed to be freed. */ 668 + rcu_barrier(); 669 + 670 + /* Expect use-after-free. */ 671 + KUNIT_EXPECT_EQ(test, *expect.addr, (char)42); 672 + KUNIT_EXPECT_TRUE(test, report_matches(&expect)); 673 + } 674 + 675 + /* Test krealloc(). */ 676 + static void test_krealloc(struct kunit *test) 677 + { 678 + const size_t size = 32; 679 + const struct expect_report expect = { 680 + .type = KFENCE_ERROR_UAF, 681 + .fn = test_krealloc, 682 + .addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY), 683 + .is_write = false, 684 + }; 685 + char *buf = expect.addr; 686 + int i; 687 + 688 + KUNIT_EXPECT_FALSE(test, test_cache); 689 + KUNIT_EXPECT_EQ(test, ksize(buf), size); /* Precise size match after KFENCE alloc. */ 690 + for (i = 0; i < size; i++) 691 + buf[i] = i + 1; 692 + 693 + /* Check that we successfully change the size. */ 694 + buf = krealloc(buf, size * 3, GFP_KERNEL); /* Grow. */ 695 + /* Note: Might no longer be a KFENCE alloc. */ 696 + KUNIT_EXPECT_GE(test, ksize(buf), size * 3); 697 + for (i = 0; i < size; i++) 698 + KUNIT_EXPECT_EQ(test, buf[i], (char)(i + 1)); 699 + for (; i < size * 3; i++) /* Fill to extra bytes. */ 700 + buf[i] = i + 1; 701 + 702 + buf = krealloc(buf, size * 2, GFP_KERNEL); /* Shrink. */ 703 + KUNIT_EXPECT_GE(test, ksize(buf), size * 2); 704 + for (i = 0; i < size * 2; i++) 705 + KUNIT_EXPECT_EQ(test, buf[i], (char)(i + 1)); 706 + 707 + buf = krealloc(buf, 0, GFP_KERNEL); /* Free. */ 708 + KUNIT_EXPECT_EQ(test, (unsigned long)buf, (unsigned long)ZERO_SIZE_PTR); 709 + KUNIT_ASSERT_FALSE(test, report_available()); /* No reports yet! */ 710 + 711 + READ_ONCE(*expect.addr); /* Ensure krealloc() actually freed earlier KFENCE object. */ 712 + KUNIT_ASSERT_TRUE(test, report_matches(&expect)); 713 + } 714 + 715 + /* Test that some objects from a bulk allocation belong to KFENCE pool. */ 716 + static void test_memcache_alloc_bulk(struct kunit *test) 717 + { 718 + const size_t size = 32; 719 + bool pass = false; 720 + unsigned long timeout; 721 + 722 + setup_test_cache(test, size, 0, NULL); 723 + KUNIT_EXPECT_TRUE(test, test_cache); /* Want memcache. */ 724 + /* 725 + * 100x the sample interval should be more than enough to ensure we get 726 + * a KFENCE allocation eventually. 727 + */ 728 + timeout = jiffies + msecs_to_jiffies(100 * CONFIG_KFENCE_SAMPLE_INTERVAL); 729 + do { 730 + void *objects[100]; 731 + int i, num = kmem_cache_alloc_bulk(test_cache, GFP_ATOMIC, ARRAY_SIZE(objects), 732 + objects); 733 + if (!num) 734 + continue; 735 + for (i = 0; i < ARRAY_SIZE(objects); i++) { 736 + if (is_kfence_address(objects[i])) { 737 + pass = true; 738 + break; 739 + } 740 + } 741 + kmem_cache_free_bulk(test_cache, num, objects); 742 + /* 743 + * kmem_cache_alloc_bulk() disables interrupts, and calling it 744 + * in a tight loop may not give KFENCE a chance to switch the 745 + * static branch. Call cond_resched() to let KFENCE chime in. 746 + */ 747 + cond_resched(); 748 + } while (!pass && time_before(jiffies, timeout)); 749 + 750 + KUNIT_EXPECT_TRUE(test, pass); 751 + KUNIT_EXPECT_FALSE(test, report_available()); 752 + } 753 + 754 + /* 755 + * KUnit does not provide a way to provide arguments to tests, and we encode 756 + * additional info in the name. Set up 2 tests per test case, one using the 757 + * default allocator, and another using a custom memcache (suffix '-memcache'). 758 + */ 759 + #define KFENCE_KUNIT_CASE(test_name) \ 760 + { .run_case = test_name, .name = #test_name }, \ 761 + { .run_case = test_name, .name = #test_name "-memcache" } 762 + 763 + static struct kunit_case kfence_test_cases[] = { 764 + KFENCE_KUNIT_CASE(test_out_of_bounds_read), 765 + KFENCE_KUNIT_CASE(test_out_of_bounds_write), 766 + KFENCE_KUNIT_CASE(test_use_after_free_read), 767 + KFENCE_KUNIT_CASE(test_double_free), 768 + KFENCE_KUNIT_CASE(test_invalid_addr_free), 769 + KFENCE_KUNIT_CASE(test_corruption), 770 + KFENCE_KUNIT_CASE(test_free_bulk), 771 + KFENCE_KUNIT_CASE(test_init_on_free), 772 + KUNIT_CASE(test_kmalloc_aligned_oob_read), 773 + KUNIT_CASE(test_kmalloc_aligned_oob_write), 774 + KUNIT_CASE(test_shrink_memcache), 775 + KUNIT_CASE(test_memcache_ctor), 776 + KUNIT_CASE(test_invalid_access), 777 + KUNIT_CASE(test_gfpzero), 778 + KUNIT_CASE(test_memcache_typesafe_by_rcu), 779 + KUNIT_CASE(test_krealloc), 780 + KUNIT_CASE(test_memcache_alloc_bulk), 781 + {}, 782 + }; 783 + 784 + /* ===== End test cases ===== */ 785 + 786 + static int test_init(struct kunit *test) 787 + { 788 + unsigned long flags; 789 + int i; 790 + 791 + spin_lock_irqsave(&observed.lock, flags); 792 + for (i = 0; i < ARRAY_SIZE(observed.lines); i++) 793 + observed.lines[i][0] = '\0'; 794 + observed.nlines = 0; 795 + spin_unlock_irqrestore(&observed.lock, flags); 796 + 797 + /* Any test with 'memcache' in its name will want a memcache. */ 798 + if (strstr(test->name, "memcache")) 799 + test->priv = TEST_PRIV_WANT_MEMCACHE; 800 + else 801 + test->priv = NULL; 802 + 803 + return 0; 804 + } 805 + 806 + static void test_exit(struct kunit *test) 807 + { 808 + test_cache_destroy(); 809 + } 810 + 811 + static struct kunit_suite kfence_test_suite = { 812 + .name = "kfence", 813 + .test_cases = kfence_test_cases, 814 + .init = test_init, 815 + .exit = test_exit, 816 + }; 817 + static struct kunit_suite *kfence_test_suites[] = { &kfence_test_suite, NULL }; 818 + 819 + static void register_tracepoints(struct tracepoint *tp, void *ignore) 820 + { 821 + check_trace_callback_type_console(probe_console); 822 + if (!strcmp(tp->name, "console")) 823 + WARN_ON(tracepoint_probe_register(tp, probe_console, NULL)); 824 + } 825 + 826 + static void unregister_tracepoints(struct tracepoint *tp, void *ignore) 827 + { 828 + if (!strcmp(tp->name, "console")) 829 + tracepoint_probe_unregister(tp, probe_console, NULL); 830 + } 831 + 832 + /* 833 + * We only want to do tracepoints setup and teardown once, therefore we have to 834 + * customize the init and exit functions and cannot rely on kunit_test_suite(). 835 + */ 836 + static int __init kfence_test_init(void) 837 + { 838 + /* 839 + * Because we want to be able to build the test as a module, we need to 840 + * iterate through all known tracepoints, since the static registration 841 + * won't work here. 842 + */ 843 + for_each_kernel_tracepoint(register_tracepoints, NULL); 844 + return __kunit_test_suites_init(kfence_test_suites); 845 + } 846 + 847 + static void kfence_test_exit(void) 848 + { 849 + __kunit_test_suites_exit(kfence_test_suites); 850 + for_each_kernel_tracepoint(unregister_tracepoints, NULL); 851 + tracepoint_synchronize_unregister(); 852 + } 853 + 854 + late_initcall(kfence_test_init); 855 + module_exit(kfence_test_exit); 856 + 857 + MODULE_LICENSE("GPL v2"); 858 + MODULE_AUTHOR("Alexander Potapenko <glider@google.com>, Marco Elver <elver@google.com>");
+262
mm/kfence/report.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * KFENCE reporting. 4 + * 5 + * Copyright (C) 2020, Google LLC. 6 + */ 7 + 8 + #include <stdarg.h> 9 + 10 + #include <linux/kernel.h> 11 + #include <linux/lockdep.h> 12 + #include <linux/printk.h> 13 + #include <linux/sched/debug.h> 14 + #include <linux/seq_file.h> 15 + #include <linux/stacktrace.h> 16 + #include <linux/string.h> 17 + #include <trace/events/error_report.h> 18 + 19 + #include <asm/kfence.h> 20 + 21 + #include "kfence.h" 22 + 23 + extern bool no_hash_pointers; 24 + 25 + /* Helper function to either print to a seq_file or to console. */ 26 + __printf(2, 3) 27 + static void seq_con_printf(struct seq_file *seq, const char *fmt, ...) 28 + { 29 + va_list args; 30 + 31 + va_start(args, fmt); 32 + if (seq) 33 + seq_vprintf(seq, fmt, args); 34 + else 35 + vprintk(fmt, args); 36 + va_end(args); 37 + } 38 + 39 + /* 40 + * Get the number of stack entries to skip to get out of MM internals. @type is 41 + * optional, and if set to NULL, assumes an allocation or free stack. 42 + */ 43 + static int get_stack_skipnr(const unsigned long stack_entries[], int num_entries, 44 + const enum kfence_error_type *type) 45 + { 46 + char buf[64]; 47 + int skipnr, fallback = 0; 48 + 49 + if (type) { 50 + /* Depending on error type, find different stack entries. */ 51 + switch (*type) { 52 + case KFENCE_ERROR_UAF: 53 + case KFENCE_ERROR_OOB: 54 + case KFENCE_ERROR_INVALID: 55 + /* 56 + * kfence_handle_page_fault() may be called with pt_regs 57 + * set to NULL; in that case we'll simply show the full 58 + * stack trace. 59 + */ 60 + return 0; 61 + case KFENCE_ERROR_CORRUPTION: 62 + case KFENCE_ERROR_INVALID_FREE: 63 + break; 64 + } 65 + } 66 + 67 + for (skipnr = 0; skipnr < num_entries; skipnr++) { 68 + int len = scnprintf(buf, sizeof(buf), "%ps", (void *)stack_entries[skipnr]); 69 + 70 + if (str_has_prefix(buf, "kfence_") || str_has_prefix(buf, "__kfence_") || 71 + !strncmp(buf, "__slab_free", len)) { 72 + /* 73 + * In case of tail calls from any of the below 74 + * to any of the above. 75 + */ 76 + fallback = skipnr + 1; 77 + } 78 + 79 + /* Also the *_bulk() variants by only checking prefixes. */ 80 + if (str_has_prefix(buf, "kfree") || 81 + str_has_prefix(buf, "kmem_cache_free") || 82 + str_has_prefix(buf, "__kmalloc") || 83 + str_has_prefix(buf, "kmem_cache_alloc")) 84 + goto found; 85 + } 86 + if (fallback < num_entries) 87 + return fallback; 88 + found: 89 + skipnr++; 90 + return skipnr < num_entries ? skipnr : 0; 91 + } 92 + 93 + static void kfence_print_stack(struct seq_file *seq, const struct kfence_metadata *meta, 94 + bool show_alloc) 95 + { 96 + const struct kfence_track *track = show_alloc ? &meta->alloc_track : &meta->free_track; 97 + 98 + if (track->num_stack_entries) { 99 + /* Skip allocation/free internals stack. */ 100 + int i = get_stack_skipnr(track->stack_entries, track->num_stack_entries, NULL); 101 + 102 + /* stack_trace_seq_print() does not exist; open code our own. */ 103 + for (; i < track->num_stack_entries; i++) 104 + seq_con_printf(seq, " %pS\n", (void *)track->stack_entries[i]); 105 + } else { 106 + seq_con_printf(seq, " no %s stack\n", show_alloc ? "allocation" : "deallocation"); 107 + } 108 + } 109 + 110 + void kfence_print_object(struct seq_file *seq, const struct kfence_metadata *meta) 111 + { 112 + const int size = abs(meta->size); 113 + const unsigned long start = meta->addr; 114 + const struct kmem_cache *const cache = meta->cache; 115 + 116 + lockdep_assert_held(&meta->lock); 117 + 118 + if (meta->state == KFENCE_OBJECT_UNUSED) { 119 + seq_con_printf(seq, "kfence-#%zd unused\n", meta - kfence_metadata); 120 + return; 121 + } 122 + 123 + seq_con_printf(seq, 124 + "kfence-#%zd [0x%p-0x%p" 125 + ", size=%d, cache=%s] allocated by task %d:\n", 126 + meta - kfence_metadata, (void *)start, (void *)(start + size - 1), size, 127 + (cache && cache->name) ? cache->name : "<destroyed>", meta->alloc_track.pid); 128 + kfence_print_stack(seq, meta, true); 129 + 130 + if (meta->state == KFENCE_OBJECT_FREED) { 131 + seq_con_printf(seq, "\nfreed by task %d:\n", meta->free_track.pid); 132 + kfence_print_stack(seq, meta, false); 133 + } 134 + } 135 + 136 + /* 137 + * Show bytes at @addr that are different from the expected canary values, up to 138 + * @max_bytes. 139 + */ 140 + static void print_diff_canary(unsigned long address, size_t bytes_to_show, 141 + const struct kfence_metadata *meta) 142 + { 143 + const unsigned long show_until_addr = address + bytes_to_show; 144 + const u8 *cur, *end; 145 + 146 + /* Do not show contents of object nor read into following guard page. */ 147 + end = (const u8 *)(address < meta->addr ? min(show_until_addr, meta->addr) 148 + : min(show_until_addr, PAGE_ALIGN(address))); 149 + 150 + pr_cont("["); 151 + for (cur = (const u8 *)address; cur < end; cur++) { 152 + if (*cur == KFENCE_CANARY_PATTERN(cur)) 153 + pr_cont(" ."); 154 + else if (no_hash_pointers) 155 + pr_cont(" 0x%02x", *cur); 156 + else /* Do not leak kernel memory in non-debug builds. */ 157 + pr_cont(" !"); 158 + } 159 + pr_cont(" ]"); 160 + } 161 + 162 + static const char *get_access_type(bool is_write) 163 + { 164 + return is_write ? "write" : "read"; 165 + } 166 + 167 + void kfence_report_error(unsigned long address, bool is_write, struct pt_regs *regs, 168 + const struct kfence_metadata *meta, enum kfence_error_type type) 169 + { 170 + unsigned long stack_entries[KFENCE_STACK_DEPTH] = { 0 }; 171 + const ptrdiff_t object_index = meta ? meta - kfence_metadata : -1; 172 + int num_stack_entries; 173 + int skipnr = 0; 174 + 175 + if (regs) { 176 + num_stack_entries = stack_trace_save_regs(regs, stack_entries, KFENCE_STACK_DEPTH, 0); 177 + } else { 178 + num_stack_entries = stack_trace_save(stack_entries, KFENCE_STACK_DEPTH, 1); 179 + skipnr = get_stack_skipnr(stack_entries, num_stack_entries, &type); 180 + } 181 + 182 + /* Require non-NULL meta, except if KFENCE_ERROR_INVALID. */ 183 + if (WARN_ON(type != KFENCE_ERROR_INVALID && !meta)) 184 + return; 185 + 186 + if (meta) 187 + lockdep_assert_held(&meta->lock); 188 + /* 189 + * Because we may generate reports in printk-unfriendly parts of the 190 + * kernel, such as scheduler code, the use of printk() could deadlock. 191 + * Until such time that all printing code here is safe in all parts of 192 + * the kernel, accept the risk, and just get our message out (given the 193 + * system might already behave unpredictably due to the memory error). 194 + * As such, also disable lockdep to hide warnings, and avoid disabling 195 + * lockdep for the rest of the kernel. 196 + */ 197 + lockdep_off(); 198 + 199 + pr_err("==================================================================\n"); 200 + /* Print report header. */ 201 + switch (type) { 202 + case KFENCE_ERROR_OOB: { 203 + const bool left_of_object = address < meta->addr; 204 + 205 + pr_err("BUG: KFENCE: out-of-bounds %s in %pS\n\n", get_access_type(is_write), 206 + (void *)stack_entries[skipnr]); 207 + pr_err("Out-of-bounds %s at 0x%p (%luB %s of kfence-#%zd):\n", 208 + get_access_type(is_write), (void *)address, 209 + left_of_object ? meta->addr - address : address - meta->addr, 210 + left_of_object ? "left" : "right", object_index); 211 + break; 212 + } 213 + case KFENCE_ERROR_UAF: 214 + pr_err("BUG: KFENCE: use-after-free %s in %pS\n\n", get_access_type(is_write), 215 + (void *)stack_entries[skipnr]); 216 + pr_err("Use-after-free %s at 0x%p (in kfence-#%zd):\n", 217 + get_access_type(is_write), (void *)address, object_index); 218 + break; 219 + case KFENCE_ERROR_CORRUPTION: 220 + pr_err("BUG: KFENCE: memory corruption in %pS\n\n", (void *)stack_entries[skipnr]); 221 + pr_err("Corrupted memory at 0x%p ", (void *)address); 222 + print_diff_canary(address, 16, meta); 223 + pr_cont(" (in kfence-#%zd):\n", object_index); 224 + break; 225 + case KFENCE_ERROR_INVALID: 226 + pr_err("BUG: KFENCE: invalid %s in %pS\n\n", get_access_type(is_write), 227 + (void *)stack_entries[skipnr]); 228 + pr_err("Invalid %s at 0x%p:\n", get_access_type(is_write), 229 + (void *)address); 230 + break; 231 + case KFENCE_ERROR_INVALID_FREE: 232 + pr_err("BUG: KFENCE: invalid free in %pS\n\n", (void *)stack_entries[skipnr]); 233 + pr_err("Invalid free of 0x%p (in kfence-#%zd):\n", (void *)address, 234 + object_index); 235 + break; 236 + } 237 + 238 + /* Print stack trace and object info. */ 239 + stack_trace_print(stack_entries + skipnr, num_stack_entries - skipnr, 0); 240 + 241 + if (meta) { 242 + pr_err("\n"); 243 + kfence_print_object(NULL, meta); 244 + } 245 + 246 + /* Print report footer. */ 247 + pr_err("\n"); 248 + if (no_hash_pointers && regs) 249 + show_regs(regs); 250 + else 251 + dump_stack_print_info(KERN_ERR); 252 + trace_error_report_end(ERROR_DETECTOR_KFENCE, address); 253 + pr_err("==================================================================\n"); 254 + 255 + lockdep_on(); 256 + 257 + if (panic_on_warn) 258 + panic("panic_on_warn set ...\n"); 259 + 260 + /* We encountered a memory unsafety error, taint the kernel! */ 261 + add_taint(TAINT_BAD_PAGE, LOCKDEP_STILL_OK); 262 + }
+16 -6
mm/khugepaged.c
··· 442 442 static bool hugepage_vma_check(struct vm_area_struct *vma, 443 443 unsigned long vm_flags) 444 444 { 445 - if ((!(vm_flags & VM_HUGEPAGE) && !khugepaged_always()) || 446 - (vm_flags & VM_NOHUGEPAGE) || 445 + /* Explicitly disabled through madvise. */ 446 + if ((vm_flags & VM_NOHUGEPAGE) || 447 447 test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) 448 448 return false; 449 449 450 - if (shmem_file(vma->vm_file) || 451 - (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && 452 - vma->vm_file && 453 - (vm_flags & VM_DENYWRITE))) { 450 + /* Enabled via shmem mount options or sysfs settings. */ 451 + if (shmem_file(vma->vm_file) && shmem_huge_enabled(vma)) { 454 452 return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, 455 453 HPAGE_PMD_NR); 456 454 } 455 + 456 + /* THP settings require madvise. */ 457 + if (!(vm_flags & VM_HUGEPAGE) && !khugepaged_always()) 458 + return false; 459 + 460 + /* Read-only file mappings need to be aligned for THP to work. */ 461 + if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file && 462 + (vm_flags & VM_DENYWRITE)) { 463 + return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, 464 + HPAGE_PMD_NR); 465 + } 466 + 457 467 if (!vma->anon_vma || vma->vm_ops) 458 468 return false; 459 469 if (vma_is_temporary_stack(vma))
+6
mm/memory-failure.c
··· 1312 1312 */ 1313 1313 put_page(page); 1314 1314 1315 + /* device metadata space is not recoverable */ 1316 + if (!pgmap_pfn_valid(pgmap, pfn)) { 1317 + rc = -ENXIO; 1318 + goto out; 1319 + } 1320 + 1315 1321 /* 1316 1322 * Prevent the inode from being freed while we are interrogating 1317 1323 * the address_space, typically this would be handled by
-4
mm/memory.c
··· 2902 2902 } 2903 2903 flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); 2904 2904 entry = mk_pte(new_page, vma->vm_page_prot); 2905 - entry = pte_sw_mkyoung(entry); 2906 2905 entry = maybe_mkwrite(pte_mkdirty(entry), vma); 2907 2906 2908 2907 /* ··· 3559 3560 __SetPageUptodate(page); 3560 3561 3561 3562 entry = mk_pte(page, vma->vm_page_prot); 3562 - entry = pte_sw_mkyoung(entry); 3563 3563 if (vma->vm_flags & VM_WRITE) 3564 3564 entry = pte_mkwrite(pte_mkdirty(entry)); 3565 3565 ··· 3743 3745 3744 3746 if (prefault && arch_wants_old_prefaulted_pte()) 3745 3747 entry = pte_mkold(entry); 3746 - else 3747 - entry = pte_sw_mkyoung(entry); 3748 3748 3749 3749 if (write) 3750 3750 entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+134 -26
mm/memory_hotplug.c
··· 67 67 bool movable_node_enabled = false; 68 68 69 69 #ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE 70 - int memhp_default_online_type = MMOP_OFFLINE; 70 + int mhp_default_online_type = MMOP_OFFLINE; 71 71 #else 72 - int memhp_default_online_type = MMOP_ONLINE; 72 + int mhp_default_online_type = MMOP_ONLINE; 73 73 #endif 74 74 75 75 static int __init setup_memhp_default_state(char *str) 76 76 { 77 - const int online_type = memhp_online_type_from_str(str); 77 + const int online_type = mhp_online_type_from_str(str); 78 78 79 79 if (online_type >= 0) 80 - memhp_default_online_type = online_type; 80 + mhp_default_online_type = online_type; 81 81 82 82 return 1; 83 83 } ··· 106 106 107 107 if (strcmp(resource_name, "System RAM")) 108 108 flags |= IORESOURCE_SYSRAM_DRIVER_MANAGED; 109 + 110 + if (!mhp_range_allowed(start, size, true)) 111 + return ERR_PTR(-E2BIG); 109 112 110 113 /* 111 114 * Make sure value parsed from 'mem=' only restricts memory adding ··· 287 284 return 0; 288 285 } 289 286 290 - static int check_hotplug_memory_addressable(unsigned long pfn, 291 - unsigned long nr_pages) 287 + /* 288 + * Return page for the valid pfn only if the page is online. All pfn 289 + * walkers which rely on the fully initialized page->flags and others 290 + * should use this rather than pfn_valid && pfn_to_page 291 + */ 292 + struct page *pfn_to_online_page(unsigned long pfn) 292 293 { 293 - const u64 max_addr = PFN_PHYS(pfn + nr_pages) - 1; 294 + unsigned long nr = pfn_to_section_nr(pfn); 295 + struct dev_pagemap *pgmap; 296 + struct mem_section *ms; 294 297 295 - if (max_addr >> MAX_PHYSMEM_BITS) { 296 - const u64 max_allowed = (1ull << (MAX_PHYSMEM_BITS + 1)) - 1; 297 - WARN(1, 298 - "Hotplugged memory exceeds maximum addressable address, range=%#llx-%#llx, maximum=%#llx\n", 299 - (u64)PFN_PHYS(pfn), max_addr, max_allowed); 300 - return -E2BIG; 301 - } 298 + if (nr >= NR_MEM_SECTIONS) 299 + return NULL; 302 300 303 - return 0; 301 + ms = __nr_to_section(nr); 302 + if (!online_section(ms)) 303 + return NULL; 304 + 305 + /* 306 + * Save some code text when online_section() + 307 + * pfn_section_valid() are sufficient. 308 + */ 309 + if (IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID) && !pfn_valid(pfn)) 310 + return NULL; 311 + 312 + if (!pfn_section_valid(ms, pfn)) 313 + return NULL; 314 + 315 + if (!online_device_section(ms)) 316 + return pfn_to_page(pfn); 317 + 318 + /* 319 + * Slowpath: when ZONE_DEVICE collides with 320 + * ZONE_{NORMAL,MOVABLE} within the same section some pfns in 321 + * the section may be 'offline' but 'valid'. Only 322 + * get_dev_pagemap() can determine sub-section online status. 323 + */ 324 + pgmap = get_dev_pagemap(pfn, NULL); 325 + put_dev_pagemap(pgmap); 326 + 327 + /* The presence of a pgmap indicates ZONE_DEVICE offline pfn */ 328 + if (pgmap) 329 + return NULL; 330 + 331 + return pfn_to_page(pfn); 304 332 } 333 + EXPORT_SYMBOL_GPL(pfn_to_online_page); 305 334 306 335 /* 307 336 * Reasonably generic function for adding memory. It is ··· 352 317 if (WARN_ON_ONCE(!params->pgprot.pgprot)) 353 318 return -EINVAL; 354 319 355 - err = check_hotplug_memory_addressable(pfn, nr_pages); 356 - if (err) 357 - return err; 320 + VM_BUG_ON(!mhp_range_allowed(PFN_PHYS(pfn), nr_pages * PAGE_SIZE, false)); 358 321 359 322 if (altmap) { 360 323 /* ··· 478 445 479 446 for (zone = pgdat->node_zones; 480 447 zone < pgdat->node_zones + MAX_NR_ZONES; zone++) { 481 - unsigned long zone_end_pfn = zone->zone_start_pfn + 482 - zone->spanned_pages; 448 + unsigned long end_pfn = zone_end_pfn(zone); 483 449 484 450 /* No need to lock the zones, they can't change. */ 485 451 if (!zone->spanned_pages) 486 452 continue; 487 453 if (!node_end_pfn) { 488 454 node_start_pfn = zone->zone_start_pfn; 489 - node_end_pfn = zone_end_pfn; 455 + node_end_pfn = end_pfn; 490 456 continue; 491 457 } 492 458 493 - if (zone_end_pfn > node_end_pfn) 494 - node_end_pfn = zone_end_pfn; 459 + if (end_pfn > node_end_pfn) 460 + node_end_pfn = end_pfn; 495 461 if (zone->zone_start_pfn < node_start_pfn) 496 462 node_start_pfn = zone->zone_start_pfn; 497 463 } ··· 710 678 pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn; 711 679 712 680 } 681 + 682 + static void section_taint_zone_device(unsigned long pfn) 683 + { 684 + struct mem_section *ms = __pfn_to_section(pfn); 685 + 686 + ms->section_mem_map |= SECTION_TAINT_ZONE_DEVICE; 687 + } 688 + 713 689 /* 714 690 * Associate the pfn range with the given zone, initializing the memmaps 715 691 * and resizing the pgdat/zone data to span the added pages. After this ··· 746 706 zone_span_writeunlock(zone); 747 707 resize_pgdat_range(pgdat, start_pfn, nr_pages); 748 708 pgdat_resize_unlock(pgdat, &flags); 709 + 710 + /* 711 + * Subsection population requires care in pfn_to_online_page(). 712 + * Set the taint to enable the slow path detection of 713 + * ZONE_DEVICE pages in an otherwise ZONE_{NORMAL,MOVABLE} 714 + * section. 715 + */ 716 + if (zone_is_zone_device(zone)) { 717 + if (!IS_ALIGNED(start_pfn, PAGES_PER_SECTION)) 718 + section_taint_zone_device(start_pfn); 719 + if (!IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION)) 720 + section_taint_zone_device(start_pfn + nr_pages); 721 + } 749 722 750 723 /* 751 724 * TODO now we have a visible range of pages which are not associated ··· 1060 1007 1061 1008 static int online_memory_block(struct memory_block *mem, void *arg) 1062 1009 { 1063 - mem->online_type = memhp_default_online_type; 1010 + mem->online_type = mhp_default_online_type; 1064 1011 return device_online(&mem->dev); 1065 1012 } 1066 1013 ··· 1137 1084 * In case we're allowed to merge the resource, flag it and trigger 1138 1085 * merging now that adding succeeded. 1139 1086 */ 1140 - if (mhp_flags & MEMHP_MERGE_RESOURCE) 1087 + if (mhp_flags & MHP_MERGE_RESOURCE) 1141 1088 merge_system_ram_resource(res); 1142 1089 1143 1090 /* online pages if requested */ 1144 - if (memhp_default_online_type != MMOP_OFFLINE) 1091 + if (mhp_default_online_type != MMOP_OFFLINE) 1145 1092 walk_memory_blocks(start, size, NULL, online_memory_block); 1146 1093 1147 1094 return ret; ··· 1232 1179 return rc; 1233 1180 } 1234 1181 EXPORT_SYMBOL_GPL(add_memory_driver_managed); 1182 + 1183 + /* 1184 + * Platforms should define arch_get_mappable_range() that provides 1185 + * maximum possible addressable physical memory range for which the 1186 + * linear mapping could be created. The platform returned address 1187 + * range must adhere to these following semantics. 1188 + * 1189 + * - range.start <= range.end 1190 + * - Range includes both end points [range.start..range.end] 1191 + * 1192 + * There is also a fallback definition provided here, allowing the 1193 + * entire possible physical address range in case any platform does 1194 + * not define arch_get_mappable_range(). 1195 + */ 1196 + struct range __weak arch_get_mappable_range(void) 1197 + { 1198 + struct range mhp_range = { 1199 + .start = 0UL, 1200 + .end = -1ULL, 1201 + }; 1202 + return mhp_range; 1203 + } 1204 + 1205 + struct range mhp_get_pluggable_range(bool need_mapping) 1206 + { 1207 + const u64 max_phys = (1ULL << MAX_PHYSMEM_BITS) - 1; 1208 + struct range mhp_range; 1209 + 1210 + if (need_mapping) { 1211 + mhp_range = arch_get_mappable_range(); 1212 + if (mhp_range.start > max_phys) { 1213 + mhp_range.start = 0; 1214 + mhp_range.end = 0; 1215 + } 1216 + mhp_range.end = min_t(u64, mhp_range.end, max_phys); 1217 + } else { 1218 + mhp_range.start = 0; 1219 + mhp_range.end = max_phys; 1220 + } 1221 + return mhp_range; 1222 + } 1223 + EXPORT_SYMBOL_GPL(mhp_get_pluggable_range); 1224 + 1225 + bool mhp_range_allowed(u64 start, u64 size, bool need_mapping) 1226 + { 1227 + struct range mhp_range = mhp_get_pluggable_range(need_mapping); 1228 + u64 end = start + size; 1229 + 1230 + if (start < end && start >= mhp_range.start && (end - 1) <= mhp_range.end) 1231 + return true; 1232 + 1233 + pr_warn("Hotplug memory [%#llx-%#llx] exceeds maximum addressable range [%#llx-%#llx]\n", 1234 + start, end, mhp_range.start, mhp_range.end); 1235 + return false; 1236 + } 1235 1237 1236 1238 #ifdef CONFIG_MEMORY_HOTREMOVE 1237 1239 /*
+22 -1
mm/memremap.c
··· 80 80 return pfn + vmem_altmap_offset(pgmap_altmap(pgmap)); 81 81 } 82 82 83 + bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn) 84 + { 85 + int i; 86 + 87 + for (i = 0; i < pgmap->nr_range; i++) { 88 + struct range *range = &pgmap->ranges[i]; 89 + 90 + if (pfn >= PHYS_PFN(range->start) && 91 + pfn <= PHYS_PFN(range->end)) 92 + return pfn >= pfn_first(pgmap, i); 93 + } 94 + 95 + return false; 96 + } 97 + 83 98 static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id) 84 99 { 85 100 const struct range *range = &pgmap->ranges[range_id]; ··· 200 185 static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, 201 186 int range_id, int nid) 202 187 { 188 + const bool is_private = pgmap->type == MEMORY_DEVICE_PRIVATE; 203 189 struct range *range = &pgmap->ranges[range_id]; 204 190 struct dev_pagemap *conflict_pgmap; 205 191 int error, is_ram; ··· 246 230 if (error) 247 231 goto err_pfn_remap; 248 232 233 + if (!mhp_range_allowed(range->start, range_len(range), !is_private)) { 234 + error = -EINVAL; 235 + goto err_pfn_remap; 236 + } 237 + 249 238 mem_hotplug_begin(); 250 239 251 240 /* ··· 264 243 * the CPU, we do want the linear mapping and thus use 265 244 * arch_add_memory(). 266 245 */ 267 - if (pgmap->type == MEMORY_DEVICE_PRIVATE) { 246 + if (is_private) { 268 247 error = add_pages(nid, PHYS_PFN(range->start), 269 248 PHYS_PFN(range_len(range)), params); 270 249 } else {
+1 -1
mm/mlock.c
··· 622 622 623 623 vma = find_vma(mm, start); 624 624 if (vma == NULL) 625 - vma = mm->mmap; 625 + return 0; 626 626 627 627 for (; vma ; vma = vma->vm_next) { 628 628 if (start >= vma->vm_end)
+1
mm/page_alloc.c
··· 2168 2168 } 2169 2169 2170 2170 adjust_managed_page_count(page, pageblock_nr_pages); 2171 + page_zone(page)->cma_pages += pageblock_nr_pages; 2171 2172 } 2172 2173 #endif 2173 2174
+8 -14
mm/rmap.c
··· 168 168 * 169 169 * Anon-vma allocations are very subtle, because we may have 170 170 * optimistically looked up an anon_vma in page_lock_anon_vma_read() 171 - * and that may actually touch the spinlock even in the newly 171 + * and that may actually touch the rwsem even in the newly 172 172 * allocated vma (it depends on RCU to make sure that the 173 173 * anon_vma isn't actually destroyed). 174 174 * ··· 359 359 goto out_error_free_anon_vma; 360 360 361 361 /* 362 - * The root anon_vma's spinlock is the lock actually used when we 362 + * The root anon_vma's rwsem is the lock actually used when we 363 363 * lock any of the anon_vmas in this anon_vma tree. 364 364 */ 365 365 anon_vma->root = pvma->anon_vma->root; ··· 462 462 * Getting a lock on a stable anon_vma from a page off the LRU is tricky! 463 463 * 464 464 * Since there is no serialization what so ever against page_remove_rmap() 465 - * the best this function can do is return a locked anon_vma that might 466 - * have been relevant to this page. 465 + * the best this function can do is return a refcount increased anon_vma 466 + * that might have been relevant to this page. 467 467 * 468 468 * The page might have been remapped to a different anon_vma or the anon_vma 469 469 * returned may already be freed (and even reused). ··· 1086 1086 * be set up correctly at this point. 1087 1087 * 1088 1088 * We have exclusion against page_add_anon_rmap because the caller 1089 - * always holds the page locked, except if called from page_dup_rmap, 1090 - * in which case the page is already known to be setup. 1089 + * always holds the page locked. 1091 1090 * 1092 1091 * We have exclusion against page_add_new_anon_rmap because those pages 1093 1092 * are initially only visible via the pagetables, and the pte is locked ··· 1736 1737 return vma_is_temporary_stack(vma); 1737 1738 } 1738 1739 1739 - static int page_mapcount_is_zero(struct page *page) 1740 + static int page_not_mapped(struct page *page) 1740 1741 { 1741 - return !total_mapcount(page); 1742 + return !page_mapped(page); 1742 1743 } 1743 1744 1744 1745 /** ··· 1756 1757 struct rmap_walk_control rwc = { 1757 1758 .rmap_one = try_to_unmap_one, 1758 1759 .arg = (void *)flags, 1759 - .done = page_mapcount_is_zero, 1760 + .done = page_not_mapped, 1760 1761 .anon_lock = page_lock_anon_vma_read, 1761 1762 }; 1762 1763 ··· 1779 1780 1780 1781 return !page_mapcount(page) ? true : false; 1781 1782 } 1782 - 1783 - static int page_not_mapped(struct page *page) 1784 - { 1785 - return !page_mapped(page); 1786 - }; 1787 1783 1788 1784 /** 1789 1785 * try_to_munlock - try to munlock a page
+44 -110
mm/shmem.c
··· 842 842 void shmem_unlock_mapping(struct address_space *mapping) 843 843 { 844 844 struct pagevec pvec; 845 - pgoff_t indices[PAGEVEC_SIZE]; 846 845 pgoff_t index = 0; 847 846 848 847 pagevec_init(&pvec); ··· 849 850 * Minor point, but we might as well stop if someone else SHM_LOCKs it. 850 851 */ 851 852 while (!mapping_unevictable(mapping)) { 852 - /* 853 - * Avoid pagevec_lookup(): find_get_pages() returns 0 as if it 854 - * has finished, if it hits a row of PAGEVEC_SIZE swap entries. 855 - */ 856 - pvec.nr = find_get_entries(mapping, index, 857 - PAGEVEC_SIZE, pvec.pages, indices); 858 - if (!pvec.nr) 853 + if (!pagevec_lookup(&pvec, mapping, &index)) 859 854 break; 860 - index = indices[pvec.nr - 1] + 1; 861 - pagevec_remove_exceptionals(&pvec); 862 855 check_move_unevictable_pages(&pvec); 863 856 pagevec_release(&pvec); 864 857 cond_resched(); ··· 907 916 908 917 pagevec_init(&pvec); 909 918 index = start; 910 - while (index < end) { 911 - pvec.nr = find_get_entries(mapping, index, 912 - min(end - index, (pgoff_t)PAGEVEC_SIZE), 913 - pvec.pages, indices); 914 - if (!pvec.nr) 915 - break; 919 + while (index < end && find_lock_entries(mapping, index, end - 1, 920 + &pvec, indices)) { 916 921 for (i = 0; i < pagevec_count(&pvec); i++) { 917 922 struct page *page = pvec.pages[i]; 918 923 919 924 index = indices[i]; 920 - if (index >= end) 921 - break; 922 925 923 926 if (xa_is_value(page)) { 924 927 if (unfalloc) ··· 921 936 index, page); 922 937 continue; 923 938 } 939 + index += thp_nr_pages(page) - 1; 924 940 925 - VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page); 926 - 927 - if (!trylock_page(page)) 928 - continue; 929 - 930 - if ((!unfalloc || !PageUptodate(page)) && 931 - page_mapping(page) == mapping) { 932 - VM_BUG_ON_PAGE(PageWriteback(page), page); 933 - if (shmem_punch_compound(page, start, end)) 934 - truncate_inode_page(mapping, page); 935 - } 941 + if (!unfalloc || !PageUptodate(page)) 942 + truncate_inode_page(mapping, page); 936 943 unlock_page(page); 937 944 } 938 945 pagevec_remove_exceptionals(&pvec); ··· 965 988 while (index < end) { 966 989 cond_resched(); 967 990 968 - pvec.nr = find_get_entries(mapping, index, 969 - min(end - index, (pgoff_t)PAGEVEC_SIZE), 970 - pvec.pages, indices); 971 - if (!pvec.nr) { 991 + if (!find_get_entries(mapping, index, end - 1, &pvec, 992 + indices)) { 972 993 /* If all gone or hole-punch or unfalloc, we're done */ 973 994 if (index == start || end != -1) 974 995 break; ··· 978 1003 struct page *page = pvec.pages[i]; 979 1004 980 1005 index = indices[i]; 981 - if (index >= end) 982 - break; 983 - 984 1006 if (xa_is_value(page)) { 985 1007 if (unfalloc) 986 1008 continue; ··· 1505 1533 return page; 1506 1534 } 1507 1535 1536 + /* 1537 + * Make sure huge_gfp is always more limited than limit_gfp. 1538 + * Some of the flags set permissions, while others set limitations. 1539 + */ 1540 + static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp) 1541 + { 1542 + gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM; 1543 + gfp_t denyflags = __GFP_NOWARN | __GFP_NORETRY; 1544 + gfp_t zoneflags = limit_gfp & GFP_ZONEMASK; 1545 + gfp_t result = huge_gfp & ~(allowflags | GFP_ZONEMASK); 1546 + 1547 + /* Allow allocations only from the originally specified zones. */ 1548 + result |= zoneflags; 1549 + 1550 + /* 1551 + * Minimize the result gfp by taking the union with the deny flags, 1552 + * and the intersection of the allow flags. 1553 + */ 1554 + result |= (limit_gfp & denyflags); 1555 + result |= (huge_gfp & limit_gfp) & allowflags; 1556 + 1557 + return result; 1558 + } 1559 + 1508 1560 static struct page *shmem_alloc_hugepage(gfp_t gfp, 1509 1561 struct shmem_inode_info *info, pgoff_t index) 1510 1562 { ··· 1543 1547 return NULL; 1544 1548 1545 1549 shmem_pseudo_vma_init(&pvma, info, hindex); 1546 - page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN, 1547 - HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true); 1550 + page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), 1551 + true); 1548 1552 shmem_pseudo_vma_destroy(&pvma); 1549 1553 if (page) 1550 1554 prep_transhuge_page(page); ··· 1800 1804 struct page *page; 1801 1805 enum sgp_type sgp_huge = sgp; 1802 1806 pgoff_t hindex = index; 1807 + gfp_t huge_gfp; 1803 1808 int error; 1804 1809 int once = 0; 1805 1810 int alloced = 0; ··· 1818 1821 sbinfo = SHMEM_SB(inode->i_sb); 1819 1822 charge_mm = vma ? vma->vm_mm : current->mm; 1820 1823 1821 - page = find_lock_entry(mapping, index); 1824 + page = pagecache_get_page(mapping, index, 1825 + FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0); 1822 1826 if (xa_is_value(page)) { 1823 1827 error = shmem_swapin_page(inode, index, &page, 1824 1828 sgp, gfp, vma, fault_type); ··· 1887 1889 } 1888 1890 1889 1891 alloc_huge: 1890 - page = shmem_alloc_and_acct_page(gfp, inode, index, true); 1892 + huge_gfp = vma_thp_gfp_mask(vma); 1893 + huge_gfp = limit_gfp_mask(huge_gfp, gfp); 1894 + page = shmem_alloc_and_acct_page(huge_gfp, inode, index, true); 1891 1895 if (IS_ERR(page)) { 1892 1896 alloc_nohuge: 1893 1897 page = shmem_alloc_and_acct_page(gfp, inode, ··· 2676 2676 return retval ? retval : error; 2677 2677 } 2678 2678 2679 - /* 2680 - * llseek SEEK_DATA or SEEK_HOLE through the page cache. 2681 - */ 2682 - static pgoff_t shmem_seek_hole_data(struct address_space *mapping, 2683 - pgoff_t index, pgoff_t end, int whence) 2684 - { 2685 - struct page *page; 2686 - struct pagevec pvec; 2687 - pgoff_t indices[PAGEVEC_SIZE]; 2688 - bool done = false; 2689 - int i; 2690 - 2691 - pagevec_init(&pvec); 2692 - pvec.nr = 1; /* start small: we may be there already */ 2693 - while (!done) { 2694 - pvec.nr = find_get_entries(mapping, index, 2695 - pvec.nr, pvec.pages, indices); 2696 - if (!pvec.nr) { 2697 - if (whence == SEEK_DATA) 2698 - index = end; 2699 - break; 2700 - } 2701 - for (i = 0; i < pvec.nr; i++, index++) { 2702 - if (index < indices[i]) { 2703 - if (whence == SEEK_HOLE) { 2704 - done = true; 2705 - break; 2706 - } 2707 - index = indices[i]; 2708 - } 2709 - page = pvec.pages[i]; 2710 - if (page && !xa_is_value(page)) { 2711 - if (!PageUptodate(page)) 2712 - page = NULL; 2713 - } 2714 - if (index >= end || 2715 - (page && whence == SEEK_DATA) || 2716 - (!page && whence == SEEK_HOLE)) { 2717 - done = true; 2718 - break; 2719 - } 2720 - } 2721 - pagevec_remove_exceptionals(&pvec); 2722 - pagevec_release(&pvec); 2723 - pvec.nr = PAGEVEC_SIZE; 2724 - cond_resched(); 2725 - } 2726 - return index; 2727 - } 2728 - 2729 2679 static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence) 2730 2680 { 2731 2681 struct address_space *mapping = file->f_mapping; 2732 2682 struct inode *inode = mapping->host; 2733 - pgoff_t start, end; 2734 - loff_t new_offset; 2735 2683 2736 2684 if (whence != SEEK_DATA && whence != SEEK_HOLE) 2737 2685 return generic_file_llseek_size(file, offset, whence, 2738 2686 MAX_LFS_FILESIZE, i_size_read(inode)); 2687 + if (offset < 0) 2688 + return -ENXIO; 2689 + 2739 2690 inode_lock(inode); 2740 2691 /* We're holding i_mutex so we can access i_size directly */ 2741 - 2742 - if (offset < 0 || offset >= inode->i_size) 2743 - offset = -ENXIO; 2744 - else { 2745 - start = offset >> PAGE_SHIFT; 2746 - end = (inode->i_size + PAGE_SIZE - 1) >> PAGE_SHIFT; 2747 - new_offset = shmem_seek_hole_data(mapping, start, end, whence); 2748 - new_offset <<= PAGE_SHIFT; 2749 - if (new_offset > offset) { 2750 - if (new_offset < inode->i_size) 2751 - offset = new_offset; 2752 - else if (whence == SEEK_DATA) 2753 - offset = -ENXIO; 2754 - else 2755 - offset = inode->i_size; 2756 - } 2757 - } 2758 - 2692 + offset = mapping_seek_hole_data(mapping, offset, inode->i_size, whence); 2759 2693 if (offset >= 0) 2760 2694 offset = vfs_setpos(file, offset, MAX_LFS_FILESIZE); 2761 2695 inode_unlock(inode);
+29 -9
mm/slab.c
··· 100 100 #include <linux/seq_file.h> 101 101 #include <linux/notifier.h> 102 102 #include <linux/kallsyms.h> 103 + #include <linux/kfence.h> 103 104 #include <linux/cpu.h> 104 105 #include <linux/sysctl.h> 105 106 #include <linux/module.h> ··· 3209 3208 } 3210 3209 3211 3210 static __always_inline void * 3212 - slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid, 3211 + slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid, size_t orig_size, 3213 3212 unsigned long caller) 3214 3213 { 3215 3214 unsigned long save_flags; ··· 3221 3220 cachep = slab_pre_alloc_hook(cachep, &objcg, 1, flags); 3222 3221 if (unlikely(!cachep)) 3223 3222 return NULL; 3223 + 3224 + ptr = kfence_alloc(cachep, orig_size, flags); 3225 + if (unlikely(ptr)) 3226 + goto out_hooks; 3224 3227 3225 3228 cache_alloc_debugcheck_before(cachep, flags); 3226 3229 local_irq_save(save_flags); ··· 3258 3253 if (unlikely(slab_want_init_on_alloc(flags, cachep)) && ptr) 3259 3254 memset(ptr, 0, cachep->object_size); 3260 3255 3256 + out_hooks: 3261 3257 slab_post_alloc_hook(cachep, objcg, flags, 1, &ptr); 3262 3258 return ptr; 3263 3259 } ··· 3296 3290 #endif /* CONFIG_NUMA */ 3297 3291 3298 3292 static __always_inline void * 3299 - slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller) 3293 + slab_alloc(struct kmem_cache *cachep, gfp_t flags, size_t orig_size, unsigned long caller) 3300 3294 { 3301 3295 unsigned long save_flags; 3302 3296 void *objp; ··· 3306 3300 cachep = slab_pre_alloc_hook(cachep, &objcg, 1, flags); 3307 3301 if (unlikely(!cachep)) 3308 3302 return NULL; 3303 + 3304 + objp = kfence_alloc(cachep, orig_size, flags); 3305 + if (unlikely(objp)) 3306 + goto out; 3309 3307 3310 3308 cache_alloc_debugcheck_before(cachep, flags); 3311 3309 local_irq_save(save_flags); ··· 3321 3311 if (unlikely(slab_want_init_on_alloc(flags, cachep)) && objp) 3322 3312 memset(objp, 0, cachep->object_size); 3323 3313 3314 + out: 3324 3315 slab_post_alloc_hook(cachep, objcg, flags, 1, &objp); 3325 3316 return objp; 3326 3317 } ··· 3427 3416 static __always_inline void __cache_free(struct kmem_cache *cachep, void *objp, 3428 3417 unsigned long caller) 3429 3418 { 3419 + if (is_kfence_address(objp)) { 3420 + kmemleak_free_recursive(objp, cachep->flags); 3421 + __kfence_free(objp); 3422 + return; 3423 + } 3424 + 3430 3425 if (unlikely(slab_want_init_on_free(cachep))) 3431 3426 memset(objp, 0, cachep->object_size); 3432 3427 ··· 3499 3482 */ 3500 3483 void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) 3501 3484 { 3502 - void *ret = slab_alloc(cachep, flags, _RET_IP_); 3485 + void *ret = slab_alloc(cachep, flags, cachep->object_size, _RET_IP_); 3503 3486 3504 3487 trace_kmem_cache_alloc(_RET_IP_, ret, 3505 3488 cachep->object_size, cachep->size, flags); ··· 3532 3515 3533 3516 local_irq_disable(); 3534 3517 for (i = 0; i < size; i++) { 3535 - void *objp = __do_cache_alloc(s, flags); 3518 + void *objp = kfence_alloc(s, s->object_size, flags) ?: __do_cache_alloc(s, flags); 3536 3519 3537 3520 if (unlikely(!objp)) 3538 3521 goto error; ··· 3565 3548 { 3566 3549 void *ret; 3567 3550 3568 - ret = slab_alloc(cachep, flags, _RET_IP_); 3551 + ret = slab_alloc(cachep, flags, size, _RET_IP_); 3569 3552 3570 3553 ret = kasan_kmalloc(cachep, ret, size, flags); 3571 3554 trace_kmalloc(_RET_IP_, ret, ··· 3591 3574 */ 3592 3575 void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid) 3593 3576 { 3594 - void *ret = slab_alloc_node(cachep, flags, nodeid, _RET_IP_); 3577 + void *ret = slab_alloc_node(cachep, flags, nodeid, cachep->object_size, _RET_IP_); 3595 3578 3596 3579 trace_kmem_cache_alloc_node(_RET_IP_, ret, 3597 3580 cachep->object_size, cachep->size, ··· 3609 3592 { 3610 3593 void *ret; 3611 3594 3612 - ret = slab_alloc_node(cachep, flags, nodeid, _RET_IP_); 3595 + ret = slab_alloc_node(cachep, flags, nodeid, size, _RET_IP_); 3613 3596 3614 3597 ret = kasan_kmalloc(cachep, ret, size, flags); 3615 3598 trace_kmalloc_node(_RET_IP_, ret, ··· 3690 3673 cachep = kmalloc_slab(size, flags); 3691 3674 if (unlikely(ZERO_OR_NULL_PTR(cachep))) 3692 3675 return cachep; 3693 - ret = slab_alloc(cachep, flags, caller); 3676 + ret = slab_alloc(cachep, flags, size, caller); 3694 3677 3695 3678 ret = kasan_kmalloc(cachep, ret, size, flags); 3696 3679 trace_kmalloc(caller, ret, ··· 4189 4172 BUG_ON(objnr >= cachep->num); 4190 4173 4191 4174 /* Find offset within object. */ 4192 - offset = ptr - index_to_obj(cachep, page, objnr) - obj_offset(cachep); 4175 + if (is_kfence_address(ptr)) 4176 + offset = ptr - kfence_object_start(ptr); 4177 + else 4178 + offset = ptr - index_to_obj(cachep, page, objnr) - obj_offset(cachep); 4193 4179 4194 4180 /* Allow address range falling entirely within usercopy region. */ 4195 4181 if (offset >= cachep->useroffset &&
+19 -4
mm/slab_common.c
··· 12 12 #include <linux/memory.h> 13 13 #include <linux/cache.h> 14 14 #include <linux/compiler.h> 15 + #include <linux/kfence.h> 15 16 #include <linux/module.h> 16 17 #include <linux/cpu.h> 17 18 #include <linux/uaccess.h> ··· 431 430 rcu_barrier(); 432 431 433 432 list_for_each_entry_safe(s, s2, &to_destroy, list) { 433 + kfence_shutdown_cache(s); 434 434 #ifdef SLAB_SUPPORTS_SYSFS 435 435 sysfs_slab_release(s); 436 436 #else ··· 457 455 list_add_tail(&s->list, &slab_caches_to_rcu_destroy); 458 456 schedule_work(&slab_caches_to_rcu_destroy_work); 459 457 } else { 458 + kfence_shutdown_cache(s); 460 459 #ifdef SLAB_SUPPORTS_SYSFS 461 460 sysfs_slab_unlink(s); 462 461 sysfs_slab_release(s); ··· 643 640 panic("Out of memory when creating slab %s\n", name); 644 641 645 642 create_boot_cache(s, name, size, flags, useroffset, usersize); 643 + kasan_cache_create_kmalloc(s); 646 644 list_add(&s->list, &slab_caches); 647 645 s->refcount = 1; 648 646 return s; ··· 1136 1132 void *ret; 1137 1133 size_t ks; 1138 1134 1139 - ks = ksize(p); 1135 + /* Don't use instrumented ksize to allow precise KASAN poisoning. */ 1136 + if (likely(!ZERO_OR_NULL_PTR(p))) { 1137 + if (!kasan_check_byte(p)) 1138 + return NULL; 1139 + ks = kfence_ksize(p) ?: __ksize(p); 1140 + } else 1141 + ks = 0; 1140 1142 1143 + /* If the object still fits, repoison it precisely. */ 1141 1144 if (ks >= new_size) { 1142 1145 p = kasan_krealloc((void *)p, new_size, flags); 1143 1146 return (void *)p; 1144 1147 } 1145 1148 1146 1149 ret = kmalloc_track_caller(new_size, flags); 1147 - if (ret && p) 1148 - memcpy(ret, p, ks); 1150 + if (ret && p) { 1151 + /* Disable KASAN checks as the object's redzone is accessed. */ 1152 + kasan_disable_current(); 1153 + memcpy(ret, kasan_reset_tag(p), ks); 1154 + kasan_enable_current(); 1155 + } 1149 1156 1150 1157 return ret; 1151 1158 } ··· 1250 1235 if (unlikely(ZERO_OR_NULL_PTR(objp)) || !kasan_check_byte(objp)) 1251 1236 return 0; 1252 1237 1253 - size = __ksize(objp); 1238 + size = kfence_ksize(objp) ?: __ksize(objp); 1254 1239 /* 1255 1240 * We assume that ksize callers could use whole allocated area, 1256 1241 * so we need to unpoison this area.
+47 -16
mm/slub.c
··· 27 27 #include <linux/ctype.h> 28 28 #include <linux/debugobjects.h> 29 29 #include <linux/kallsyms.h> 30 + #include <linux/kfence.h> 30 31 #include <linux/memory.h> 31 32 #include <linux/math64.h> 32 33 #include <linux/fault-inject.h> ··· 1571 1570 void *old_tail = *tail ? *tail : *head; 1572 1571 int rsize; 1573 1572 1573 + if (is_kfence_address(next)) { 1574 + slab_free_hook(s, next); 1575 + return true; 1576 + } 1577 + 1574 1578 /* Head and tail of the reconstructed freelist */ 1575 1579 *head = NULL; 1576 1580 *tail = NULL; ··· 2815 2809 * Otherwise we can simply pick the next object from the lockless free list. 2816 2810 */ 2817 2811 static __always_inline void *slab_alloc_node(struct kmem_cache *s, 2818 - gfp_t gfpflags, int node, unsigned long addr) 2812 + gfp_t gfpflags, int node, unsigned long addr, size_t orig_size) 2819 2813 { 2820 2814 void *object; 2821 2815 struct kmem_cache_cpu *c; ··· 2826 2820 s = slab_pre_alloc_hook(s, &objcg, 1, gfpflags); 2827 2821 if (!s) 2828 2822 return NULL; 2823 + 2824 + object = kfence_alloc(s, orig_size, gfpflags); 2825 + if (unlikely(object)) 2826 + goto out; 2827 + 2829 2828 redo: 2830 2829 /* 2831 2830 * Must read kmem_cache cpu data via this cpu ptr. Preemption is ··· 2903 2892 if (unlikely(slab_want_init_on_alloc(gfpflags, s)) && object) 2904 2893 memset(kasan_reset_tag(object), 0, s->object_size); 2905 2894 2895 + out: 2906 2896 slab_post_alloc_hook(s, objcg, gfpflags, 1, &object); 2907 2897 2908 2898 return object; 2909 2899 } 2910 2900 2911 2901 static __always_inline void *slab_alloc(struct kmem_cache *s, 2912 - gfp_t gfpflags, unsigned long addr) 2902 + gfp_t gfpflags, unsigned long addr, size_t orig_size) 2913 2903 { 2914 - return slab_alloc_node(s, gfpflags, NUMA_NO_NODE, addr); 2904 + return slab_alloc_node(s, gfpflags, NUMA_NO_NODE, addr, orig_size); 2915 2905 } 2916 2906 2917 2907 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags) 2918 2908 { 2919 - void *ret = slab_alloc(s, gfpflags, _RET_IP_); 2909 + void *ret = slab_alloc(s, gfpflags, _RET_IP_, s->object_size); 2920 2910 2921 2911 trace_kmem_cache_alloc(_RET_IP_, ret, s->object_size, 2922 2912 s->size, gfpflags); ··· 2929 2917 #ifdef CONFIG_TRACING 2930 2918 void *kmem_cache_alloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size) 2931 2919 { 2932 - void *ret = slab_alloc(s, gfpflags, _RET_IP_); 2920 + void *ret = slab_alloc(s, gfpflags, _RET_IP_, size); 2933 2921 trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags); 2934 2922 ret = kasan_kmalloc(s, ret, size, gfpflags); 2935 2923 return ret; ··· 2940 2928 #ifdef CONFIG_NUMA 2941 2929 void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node) 2942 2930 { 2943 - void *ret = slab_alloc_node(s, gfpflags, node, _RET_IP_); 2931 + void *ret = slab_alloc_node(s, gfpflags, node, _RET_IP_, s->object_size); 2944 2932 2945 2933 trace_kmem_cache_alloc_node(_RET_IP_, ret, 2946 2934 s->object_size, s->size, gfpflags, node); ··· 2954 2942 gfp_t gfpflags, 2955 2943 int node, size_t size) 2956 2944 { 2957 - void *ret = slab_alloc_node(s, gfpflags, node, _RET_IP_); 2945 + void *ret = slab_alloc_node(s, gfpflags, node, _RET_IP_, size); 2958 2946 2959 2947 trace_kmalloc_node(_RET_IP_, ret, 2960 2948 size, s->size, gfpflags, node); ··· 2987 2975 unsigned long flags; 2988 2976 2989 2977 stat(s, FREE_SLOWPATH); 2978 + 2979 + if (kfence_free(head)) 2980 + return; 2990 2981 2991 2982 if (kmem_cache_debug(s) && 2992 2983 !free_debug_processing(s, page, head, tail, cnt, addr)) ··· 3235 3220 df->s = cache_from_obj(s, object); /* Support for memcg */ 3236 3221 } 3237 3222 3223 + if (is_kfence_address(object)) { 3224 + slab_free_hook(df->s, object); 3225 + __kfence_free(object); 3226 + p[size] = NULL; /* mark object processed */ 3227 + return size; 3228 + } 3229 + 3238 3230 /* Start new detached freelist */ 3239 3231 df->page = page; 3240 3232 set_freepointer(df->s, object, NULL); ··· 3317 3295 c = this_cpu_ptr(s->cpu_slab); 3318 3296 3319 3297 for (i = 0; i < size; i++) { 3320 - void *object = c->freelist; 3298 + void *object = kfence_alloc(s, s->object_size, flags); 3321 3299 3300 + if (unlikely(object)) { 3301 + p[i] = object; 3302 + continue; 3303 + } 3304 + 3305 + object = c->freelist; 3322 3306 if (unlikely(!object)) { 3323 3307 /* 3324 3308 * We may have removed an object from c->freelist using ··· 3579 3551 init_object(kmem_cache_node, n, SLUB_RED_ACTIVE); 3580 3552 init_tracking(kmem_cache_node, n); 3581 3553 #endif 3582 - n = kasan_kmalloc(kmem_cache_node, n, sizeof(struct kmem_cache_node), 3583 - GFP_KERNEL); 3554 + n = kasan_slab_alloc(kmem_cache_node, n, GFP_KERNEL); 3584 3555 page->freelist = get_freepointer(kmem_cache_node, n); 3585 3556 page->inuse = 1; 3586 3557 page->frozen = 0; ··· 4048 4021 if (unlikely(ZERO_OR_NULL_PTR(s))) 4049 4022 return s; 4050 4023 4051 - ret = slab_alloc(s, flags, _RET_IP_); 4024 + ret = slab_alloc(s, flags, _RET_IP_, size); 4052 4025 4053 4026 trace_kmalloc(_RET_IP_, ret, size, s->size, flags); 4054 4027 ··· 4096 4069 if (unlikely(ZERO_OR_NULL_PTR(s))) 4097 4070 return s; 4098 4071 4099 - ret = slab_alloc_node(s, flags, node, _RET_IP_); 4072 + ret = slab_alloc_node(s, flags, node, _RET_IP_, size); 4100 4073 4101 4074 trace_kmalloc_node(_RET_IP_, ret, size, s->size, flags, node); 4102 4075 ··· 4122 4095 struct kmem_cache *s; 4123 4096 unsigned int offset; 4124 4097 size_t object_size; 4098 + bool is_kfence = is_kfence_address(ptr); 4125 4099 4126 4100 ptr = kasan_reset_tag(ptr); 4127 4101 ··· 4135 4107 to_user, 0, n); 4136 4108 4137 4109 /* Find offset within object. */ 4138 - offset = (ptr - page_address(page)) % s->size; 4110 + if (is_kfence) 4111 + offset = ptr - kfence_object_start(ptr); 4112 + else 4113 + offset = (ptr - page_address(page)) % s->size; 4139 4114 4140 4115 /* Adjust for redzone and reject if within the redzone. */ 4141 - if (kmem_cache_debug_flags(s, SLAB_RED_ZONE)) { 4116 + if (!is_kfence && kmem_cache_debug_flags(s, SLAB_RED_ZONE)) { 4142 4117 if (offset < s->red_left_pad) 4143 4118 usercopy_abort("SLUB object in left red zone", 4144 4119 s->name, to_user, offset, n); ··· 4558 4527 if (unlikely(ZERO_OR_NULL_PTR(s))) 4559 4528 return s; 4560 4529 4561 - ret = slab_alloc(s, gfpflags, caller); 4530 + ret = slab_alloc(s, gfpflags, caller, size); 4562 4531 4563 4532 /* Honor the call site pointer we received. */ 4564 4533 trace_kmalloc(caller, ret, size, s->size, gfpflags); ··· 4589 4558 if (unlikely(ZERO_OR_NULL_PTR(s))) 4590 4559 return s; 4591 4560 4592 - ret = slab_alloc_node(s, gfpflags, node, caller); 4561 + ret = slab_alloc_node(s, gfpflags, node, caller, size); 4593 4562 4594 4563 /* Honor the call site pointer we received. */ 4595 4564 trace_kmalloc_node(caller, ret, size, s->size, gfpflags, node);
+2 -36
mm/swap.c
··· 1018 1018 } 1019 1019 1020 1020 /** 1021 - * pagevec_lookup_entries - gang pagecache lookup 1022 - * @pvec: Where the resulting entries are placed 1023 - * @mapping: The address_space to search 1024 - * @start: The starting entry index 1025 - * @nr_entries: The maximum number of pages 1026 - * @indices: The cache indices corresponding to the entries in @pvec 1027 - * 1028 - * pagevec_lookup_entries() will search for and return a group of up 1029 - * to @nr_pages pages and shadow entries in the mapping. All 1030 - * entries are placed in @pvec. pagevec_lookup_entries() takes a 1031 - * reference against actual pages in @pvec. 1032 - * 1033 - * The search returns a group of mapping-contiguous entries with 1034 - * ascending indexes. There may be holes in the indices due to 1035 - * not-present entries. 1036 - * 1037 - * Only one subpage of a Transparent Huge Page is returned in one call: 1038 - * allowing truncate_inode_pages_range() to evict the whole THP without 1039 - * cycling through a pagevec of extra references. 1040 - * 1041 - * pagevec_lookup_entries() returns the number of entries which were 1042 - * found. 1043 - */ 1044 - unsigned pagevec_lookup_entries(struct pagevec *pvec, 1045 - struct address_space *mapping, 1046 - pgoff_t start, unsigned nr_entries, 1047 - pgoff_t *indices) 1048 - { 1049 - pvec->nr = find_get_entries(mapping, start, nr_entries, 1050 - pvec->pages, indices); 1051 - return pagevec_count(pvec); 1052 - } 1053 - 1054 - /** 1055 1021 * pagevec_remove_exceptionals - pagevec exceptionals pruning 1056 1022 * @pvec: The pagevec to prune 1057 1023 * 1058 - * pagevec_lookup_entries() fills both pages and exceptional radix 1059 - * tree entries into the pagevec. This function prunes all 1024 + * find_get_entries() fills both pages and XArray value entries (aka 1025 + * exceptional entries) into the pagevec. This function prunes all 1060 1026 * exceptionals from @pvec without leaving holes, so that it can be 1061 1027 * passed on to page-only pagevec operations. 1062 1028 */
+3 -4
mm/swap_state.c
··· 87 87 pgoff_t idx = swp_offset(entry); 88 88 struct page *page; 89 89 90 - page = find_get_entry(address_space, idx); 90 + page = xa_load(&address_space->i_pages, idx); 91 91 if (xa_is_value(page)) 92 92 return page; 93 - if (page) 94 - put_page(page); 95 93 return NULL; 96 94 } 97 95 ··· 403 405 { 404 406 swp_entry_t swp; 405 407 struct swap_info_struct *si; 406 - struct page *page = find_get_entry(mapping, index); 408 + struct page *page = pagecache_get_page(mapping, index, 409 + FGP_ENTRY | FGP_HEAD, 0); 407 410 408 411 if (!page) 409 412 return page;
+20 -111
mm/truncate.c
··· 57 57 * exceptional entries similar to what pagevec_remove_exceptionals does. 58 58 */ 59 59 static void truncate_exceptional_pvec_entries(struct address_space *mapping, 60 - struct pagevec *pvec, pgoff_t *indices, 61 - pgoff_t end) 60 + struct pagevec *pvec, pgoff_t *indices) 62 61 { 63 62 int i, j; 64 - bool dax, lock; 63 + bool dax; 65 64 66 65 /* Handled by shmem itself */ 67 66 if (shmem_mapping(mapping)) ··· 74 75 return; 75 76 76 77 dax = dax_mapping(mapping); 77 - lock = !dax && indices[j] < end; 78 - if (lock) 78 + if (!dax) 79 79 xa_lock_irq(&mapping->i_pages); 80 80 81 81 for (i = j; i < pagevec_count(pvec); i++) { ··· 86 88 continue; 87 89 } 88 90 89 - if (index >= end) 90 - continue; 91 - 92 91 if (unlikely(dax)) { 93 92 dax_delete_mapping_entry(mapping, index); 94 93 continue; ··· 94 99 __clear_shadow_entry(mapping, index, page); 95 100 } 96 101 97 - if (lock) 102 + if (!dax) 98 103 xa_unlock_irq(&mapping->i_pages); 99 104 pvec->nr = j; 100 105 } ··· 321 326 322 327 pagevec_init(&pvec); 323 328 index = start; 324 - while (index < end && pagevec_lookup_entries(&pvec, mapping, index, 325 - min(end - index, (pgoff_t)PAGEVEC_SIZE), 326 - indices)) { 327 - /* 328 - * Pagevec array has exceptional entries and we may also fail 329 - * to lock some pages. So we store pages that can be deleted 330 - * in a new pagevec. 331 - */ 332 - struct pagevec locked_pvec; 333 - 334 - pagevec_init(&locked_pvec); 335 - for (i = 0; i < pagevec_count(&pvec); i++) { 336 - struct page *page = pvec.pages[i]; 337 - 338 - /* We rely upon deletion not changing page->index */ 339 - index = indices[i]; 340 - if (index >= end) 341 - break; 342 - 343 - if (xa_is_value(page)) 344 - continue; 345 - 346 - if (!trylock_page(page)) 347 - continue; 348 - WARN_ON(page_to_index(page) != index); 349 - if (PageWriteback(page)) { 350 - unlock_page(page); 351 - continue; 352 - } 353 - if (page->mapping != mapping) { 354 - unlock_page(page); 355 - continue; 356 - } 357 - pagevec_add(&locked_pvec, page); 358 - } 359 - for (i = 0; i < pagevec_count(&locked_pvec); i++) 360 - truncate_cleanup_page(mapping, locked_pvec.pages[i]); 361 - delete_from_page_cache_batch(mapping, &locked_pvec); 362 - for (i = 0; i < pagevec_count(&locked_pvec); i++) 363 - unlock_page(locked_pvec.pages[i]); 364 - truncate_exceptional_pvec_entries(mapping, &pvec, indices, end); 329 + while (index < end && find_lock_entries(mapping, index, end - 1, 330 + &pvec, indices)) { 331 + index = indices[pagevec_count(&pvec) - 1] + 1; 332 + truncate_exceptional_pvec_entries(mapping, &pvec, indices); 333 + for (i = 0; i < pagevec_count(&pvec); i++) 334 + truncate_cleanup_page(mapping, pvec.pages[i]); 335 + delete_from_page_cache_batch(mapping, &pvec); 336 + for (i = 0; i < pagevec_count(&pvec); i++) 337 + unlock_page(pvec.pages[i]); 365 338 pagevec_release(&pvec); 366 339 cond_resched(); 367 - index++; 368 340 } 341 + 369 342 if (partial_start) { 370 343 struct page *page = find_lock_page(mapping, start - 1); 371 344 if (page) { ··· 376 413 index = start; 377 414 for ( ; ; ) { 378 415 cond_resched(); 379 - if (!pagevec_lookup_entries(&pvec, mapping, index, 380 - min(end - index, (pgoff_t)PAGEVEC_SIZE), indices)) { 416 + if (!find_get_entries(mapping, index, end - 1, &pvec, 417 + indices)) { 381 418 /* If all gone from start onwards, we're done */ 382 419 if (index == start) 383 420 break; ··· 385 422 index = start; 386 423 continue; 387 424 } 388 - if (index == start && indices[0] >= end) { 389 - /* All gone out of hole to be punched, we're done */ 390 - pagevec_remove_exceptionals(&pvec); 391 - pagevec_release(&pvec); 392 - break; 393 - } 394 425 395 426 for (i = 0; i < pagevec_count(&pvec); i++) { 396 427 struct page *page = pvec.pages[i]; 397 428 398 429 /* We rely upon deletion not changing page->index */ 399 430 index = indices[i]; 400 - if (index >= end) { 401 - /* Restart punch to make sure all gone */ 402 - index = start - 1; 403 - break; 404 - } 405 431 406 432 if (xa_is_value(page)) 407 433 continue; ··· 401 449 truncate_inode_page(mapping, page); 402 450 unlock_page(page); 403 451 } 404 - truncate_exceptional_pvec_entries(mapping, &pvec, indices, end); 452 + truncate_exceptional_pvec_entries(mapping, &pvec, indices); 405 453 pagevec_release(&pvec); 406 454 index++; 407 455 } ··· 491 539 int i; 492 540 493 541 pagevec_init(&pvec); 494 - while (index <= end && pagevec_lookup_entries(&pvec, mapping, index, 495 - min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1, 496 - indices)) { 542 + while (find_lock_entries(mapping, index, end, &pvec, indices)) { 497 543 for (i = 0; i < pagevec_count(&pvec); i++) { 498 544 struct page *page = pvec.pages[i]; 499 545 500 546 /* We rely upon deletion not changing page->index */ 501 547 index = indices[i]; 502 - if (index > end) 503 - break; 504 548 505 549 if (xa_is_value(page)) { 506 550 invalidate_exceptional_entry(mapping, index, 507 551 page); 508 552 continue; 509 553 } 510 - 511 - if (!trylock_page(page)) 512 - continue; 513 - 514 - WARN_ON(page_to_index(page) != index); 515 - 516 - /* Middle of THP: skip */ 517 - if (PageTransTail(page)) { 518 - unlock_page(page); 519 - continue; 520 - } else if (PageTransHuge(page)) { 521 - index += HPAGE_PMD_NR - 1; 522 - i += HPAGE_PMD_NR - 1; 523 - /* 524 - * 'end' is in the middle of THP. Don't 525 - * invalidate the page as the part outside of 526 - * 'end' could be still useful. 527 - */ 528 - if (index > end) { 529 - unlock_page(page); 530 - continue; 531 - } 532 - 533 - /* Take a pin outside pagevec */ 534 - get_page(page); 535 - 536 - /* 537 - * Drop extra pins before trying to invalidate 538 - * the huge page. 539 - */ 540 - pagevec_remove_exceptionals(&pvec); 541 - pagevec_release(&pvec); 542 - } 554 + index += thp_nr_pages(page) - 1; 543 555 544 556 ret = invalidate_inode_page(page); 545 557 unlock_page(page); ··· 517 601 if (nr_pagevec) 518 602 (*nr_pagevec)++; 519 603 } 520 - 521 - if (PageTransHuge(page)) 522 - put_page(page); 523 604 count += ret; 524 605 } 525 606 pagevec_remove_exceptionals(&pvec); ··· 638 725 639 726 pagevec_init(&pvec); 640 727 index = start; 641 - while (index <= end && pagevec_lookup_entries(&pvec, mapping, index, 642 - min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1, 643 - indices)) { 728 + while (find_get_entries(mapping, index, end, &pvec, indices)) { 644 729 for (i = 0; i < pagevec_count(&pvec); i++) { 645 730 struct page *page = pvec.pages[i]; 646 731 647 732 /* We rely upon deletion not changing page->index */ 648 733 index = indices[i]; 649 - if (index > end) 650 - break; 651 734 652 735 if (xa_is_value(page)) { 653 736 if (!invalidate_exceptional_entry2(mapping,
+27 -8
mm/vmstat.c
··· 342 342 long t; 343 343 344 344 if (vmstat_item_in_bytes(item)) { 345 + /* 346 + * Only cgroups use subpage accounting right now; at 347 + * the global level, these items still change in 348 + * multiples of whole pages. Store them as pages 349 + * internally to keep the per-cpu counters compact. 350 + */ 345 351 VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1)); 346 352 delta >>= PAGE_SHIFT; 347 353 } ··· 557 551 long o, n, t, z; 558 552 559 553 if (vmstat_item_in_bytes(item)) { 554 + /* 555 + * Only cgroups use subpage accounting right now; at 556 + * the global level, these items still change in 557 + * multiples of whole pages. Store them as pages 558 + * internally to keep the per-cpu counters compact. 559 + */ 560 560 VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1)); 561 561 delta >>= PAGE_SHIFT; 562 562 } ··· 1649 1637 "\n high %lu" 1650 1638 "\n spanned %lu" 1651 1639 "\n present %lu" 1652 - "\n managed %lu", 1640 + "\n managed %lu" 1641 + "\n cma %lu", 1653 1642 zone_page_state(zone, NR_FREE_PAGES), 1654 1643 min_wmark_pages(zone), 1655 1644 low_wmark_pages(zone), 1656 1645 high_wmark_pages(zone), 1657 1646 zone->spanned_pages, 1658 1647 zone->present_pages, 1659 - zone_managed_pages(zone)); 1648 + zone_managed_pages(zone), 1649 + zone_cma_pages(zone)); 1660 1650 1661 1651 seq_printf(m, 1662 1652 "\n protection: (%ld", ··· 1906 1892 */ 1907 1893 static bool need_update(int cpu) 1908 1894 { 1895 + pg_data_t *last_pgdat = NULL; 1909 1896 struct zone *zone; 1910 1897 1911 1898 for_each_populated_zone(zone) { 1912 1899 struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu); 1913 - 1914 - BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1); 1915 - #ifdef CONFIG_NUMA 1916 - BUILD_BUG_ON(sizeof(p->vm_numa_stat_diff[0]) != 2); 1917 - #endif 1918 - 1900 + struct per_cpu_nodestat *n; 1919 1901 /* 1920 1902 * The fast way of checking if there are any vmstat diffs. 1921 1903 */ ··· 1923 1913 sizeof(p->vm_numa_stat_diff[0]))) 1924 1914 return true; 1925 1915 #endif 1916 + if (last_pgdat == zone->zone_pgdat) 1917 + continue; 1918 + last_pgdat = zone->zone_pgdat; 1919 + n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu); 1920 + if (memchr_inv(n->vm_node_stat_diff, 0, NR_VM_NODE_STAT_ITEMS * 1921 + sizeof(n->vm_node_stat_diff[0]))) 1922 + return true; 1926 1923 } 1927 1924 return false; 1928 1925 } ··· 1980 1963 1981 1964 if (!delayed_work_pending(dw) && need_update(cpu)) 1982 1965 queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); 1966 + 1967 + cond_resched(); 1983 1968 } 1984 1969 put_online_cpus(); 1985 1970
+1
mm/z3fold.c
··· 1771 1771 1772 1772 static struct zpool_driver z3fold_zpool_driver = { 1773 1773 .type = "z3fold", 1774 + .sleep_mapped = true, 1774 1775 .owner = THIS_MODULE, 1775 1776 .create = z3fold_zpool_create, 1776 1777 .destroy = z3fold_zpool_destroy,
+1
mm/zbud.c
··· 203 203 204 204 static struct zpool_driver zbud_zpool_driver = { 205 205 .type = "zbud", 206 + .sleep_mapped = true, 206 207 .owner = THIS_MODULE, 207 208 .create = zbud_zpool_create, 208 209 .destroy = zbud_zpool_destroy,
+13
mm/zpool.c
··· 23 23 void *pool; 24 24 const struct zpool_ops *ops; 25 25 bool evictable; 26 + bool can_sleep_mapped; 26 27 27 28 struct list_head list; 28 29 }; ··· 184 183 zpool->pool = driver->create(name, gfp, ops, zpool); 185 184 zpool->ops = ops; 186 185 zpool->evictable = driver->shrink && ops && ops->evict; 186 + zpool->can_sleep_mapped = driver->sleep_mapped; 187 187 188 188 if (!zpool->pool) { 189 189 pr_err("couldn't create %s pool\n", type); ··· 393 391 bool zpool_evictable(struct zpool *zpool) 394 392 { 395 393 return zpool->evictable; 394 + } 395 + 396 + /** 397 + * zpool_can_sleep_mapped - Test if zpool can sleep when do mapped. 398 + * @zpool: The zpool to test 399 + * 400 + * Returns: true if zpool can sleep; false otherwise. 401 + */ 402 + bool zpool_can_sleep_mapped(struct zpool *zpool) 403 + { 404 + return zpool->can_sleep_mapped; 396 405 } 397 406 398 407 MODULE_LICENSE("GPL");
+13 -9
mm/zsmalloc.c
··· 357 357 358 358 static struct zspage *cache_alloc_zspage(struct zs_pool *pool, gfp_t flags) 359 359 { 360 - return kmem_cache_alloc(pool->zspage_cachep, 360 + return kmem_cache_zalloc(pool->zspage_cachep, 361 361 flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE)); 362 362 } 363 363 ··· 816 816 817 817 static struct zspage *get_zspage(struct page *page) 818 818 { 819 - struct zspage *zspage = (struct zspage *)page->private; 819 + struct zspage *zspage = (struct zspage *)page_private(page); 820 820 821 821 BUG_ON(zspage->magic != ZSPAGE_MAGIC); 822 822 return zspage; ··· 1064 1064 if (!zspage) 1065 1065 return NULL; 1066 1066 1067 - memset(zspage, 0, sizeof(struct zspage)); 1068 1067 zspage->magic = ZSPAGE_MAGIC; 1069 1068 migrate_lock_init(zspage); 1070 1069 ··· 2212 2213 return obj_wasted * class->pages_per_zspage; 2213 2214 } 2214 2215 2215 - static void __zs_compact(struct zs_pool *pool, struct size_class *class) 2216 + static unsigned long __zs_compact(struct zs_pool *pool, 2217 + struct size_class *class) 2216 2218 { 2217 2219 struct zs_compact_control cc; 2218 2220 struct zspage *src_zspage; 2219 2221 struct zspage *dst_zspage = NULL; 2222 + unsigned long pages_freed = 0; 2220 2223 2221 2224 spin_lock(&class->lock); 2222 2225 while ((src_zspage = isolate_zspage(class, true))) { ··· 2248 2247 putback_zspage(class, dst_zspage); 2249 2248 if (putback_zspage(class, src_zspage) == ZS_EMPTY) { 2250 2249 free_zspage(pool, class, src_zspage); 2251 - pool->stats.pages_compacted += class->pages_per_zspage; 2250 + pages_freed += class->pages_per_zspage; 2252 2251 } 2253 2252 spin_unlock(&class->lock); 2254 2253 cond_resched(); ··· 2259 2258 putback_zspage(class, src_zspage); 2260 2259 2261 2260 spin_unlock(&class->lock); 2261 + 2262 + return pages_freed; 2262 2263 } 2263 2264 2264 2265 unsigned long zs_compact(struct zs_pool *pool) 2265 2266 { 2266 2267 int i; 2267 2268 struct size_class *class; 2269 + unsigned long pages_freed = 0; 2268 2270 2269 2271 for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) { 2270 2272 class = pool->size_class[i]; ··· 2275 2271 continue; 2276 2272 if (class->index != i) 2277 2273 continue; 2278 - __zs_compact(pool, class); 2274 + pages_freed += __zs_compact(pool, class); 2279 2275 } 2276 + atomic_long_add(pages_freed, &pool->stats.pages_compacted); 2280 2277 2281 - return pool->stats.pages_compacted; 2278 + return pages_freed; 2282 2279 } 2283 2280 EXPORT_SYMBOL_GPL(zs_compact); 2284 2281 ··· 2296 2291 struct zs_pool *pool = container_of(shrinker, struct zs_pool, 2297 2292 shrinker); 2298 2293 2299 - pages_freed = pool->stats.pages_compacted; 2300 2294 /* 2301 2295 * Compact classes and calculate compaction delta. 2302 2296 * Can run concurrently with a manually triggered 2303 2297 * (by user) compaction. 2304 2298 */ 2305 - pages_freed = zs_compact(pool) - pages_freed; 2299 + pages_freed = zs_compact(pool); 2306 2300 2307 2301 return pages_freed ? pages_freed : SHRINK_STOP; 2308 2302 }
+49 -8
mm/zswap.c
··· 935 935 struct scatterlist input, output; 936 936 struct crypto_acomp_ctx *acomp_ctx; 937 937 938 - u8 *src; 938 + u8 *src, *tmp = NULL; 939 939 unsigned int dlen; 940 940 int ret; 941 941 struct writeback_control wbc = { 942 942 .sync_mode = WB_SYNC_NONE, 943 943 }; 944 + 945 + if (!zpool_can_sleep_mapped(pool)) { 946 + tmp = kmalloc(PAGE_SIZE, GFP_ATOMIC); 947 + if (!tmp) 948 + return -ENOMEM; 949 + } 944 950 945 951 /* extract swpentry from data */ 946 952 zhdr = zpool_map_handle(pool, handle, ZPOOL_MM_RO); ··· 961 955 /* entry was invalidated */ 962 956 spin_unlock(&tree->lock); 963 957 zpool_unmap_handle(pool, handle); 958 + kfree(tmp); 964 959 return 0; 965 960 } 966 961 spin_unlock(&tree->lock); ··· 985 978 986 979 dlen = PAGE_SIZE; 987 980 src = (u8 *)zhdr + sizeof(struct zswap_header); 981 + 982 + if (!zpool_can_sleep_mapped(pool)) { 983 + 984 + memcpy(tmp, src, entry->length); 985 + src = tmp; 986 + 987 + zpool_unmap_handle(pool, handle); 988 + } 988 989 989 990 mutex_lock(acomp_ctx->mutex); 990 991 sg_init_one(&input, src, entry->length); ··· 1037 1022 1038 1023 /* 1039 1024 * if we get here due to ZSWAP_SWAPCACHE_EXIST 1040 - * a load may happening concurrently 1041 - * it is safe and okay to not free the entry 1025 + * a load may be happening concurrently. 1026 + * it is safe and okay to not free the entry. 1042 1027 * if we free the entry in the following put 1043 - * it it either okay to return !0 1028 + * it is also okay to return !0 1044 1029 */ 1045 1030 fail: 1046 1031 spin_lock(&tree->lock); ··· 1048 1033 spin_unlock(&tree->lock); 1049 1034 1050 1035 end: 1051 - zpool_unmap_handle(pool, handle); 1036 + if (zpool_can_sleep_mapped(pool)) 1037 + zpool_unmap_handle(pool, handle); 1038 + else 1039 + kfree(tmp); 1040 + 1052 1041 return ret; 1053 1042 } 1054 1043 ··· 1254 1235 struct zswap_entry *entry; 1255 1236 struct scatterlist input, output; 1256 1237 struct crypto_acomp_ctx *acomp_ctx; 1257 - u8 *src, *dst; 1238 + u8 *src, *dst, *tmp; 1258 1239 unsigned int dlen; 1259 1240 int ret; 1260 1241 ··· 1272 1253 dst = kmap_atomic(page); 1273 1254 zswap_fill_page(dst, entry->value); 1274 1255 kunmap_atomic(dst); 1256 + ret = 0; 1275 1257 goto freeentry; 1258 + } 1259 + 1260 + if (!zpool_can_sleep_mapped(entry->pool->zpool)) { 1261 + 1262 + tmp = kmalloc(entry->length, GFP_ATOMIC); 1263 + if (!tmp) { 1264 + ret = -ENOMEM; 1265 + goto freeentry; 1266 + } 1276 1267 } 1277 1268 1278 1269 /* decompress */ ··· 1290 1261 src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO); 1291 1262 if (zpool_evictable(entry->pool->zpool)) 1292 1263 src += sizeof(struct zswap_header); 1264 + 1265 + if (!zpool_can_sleep_mapped(entry->pool->zpool)) { 1266 + 1267 + memcpy(tmp, src, entry->length); 1268 + src = tmp; 1269 + 1270 + zpool_unmap_handle(entry->pool->zpool, entry->handle); 1271 + } 1293 1272 1294 1273 acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); 1295 1274 mutex_lock(acomp_ctx->mutex); ··· 1308 1271 ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait); 1309 1272 mutex_unlock(acomp_ctx->mutex); 1310 1273 1311 - zpool_unmap_handle(entry->pool->zpool, entry->handle); 1274 + if (zpool_can_sleep_mapped(entry->pool->zpool)) 1275 + zpool_unmap_handle(entry->pool->zpool, entry->handle); 1276 + else 1277 + kfree(tmp); 1278 + 1312 1279 BUG_ON(ret); 1313 1280 1314 1281 freeentry: ··· 1320 1279 zswap_entry_put(tree, entry); 1321 1280 spin_unlock(&tree->lock); 1322 1281 1323 - return 0; 1282 + return ret; 1324 1283 } 1325 1284 1326 1285 /* frees an entry in zswap */
+1 -1
samples/auxdisplay/cfag12864b-example.c
··· 4 4 * Version: 0.1.0 5 5 * Description: cfag12864b LCD userspace example program 6 6 * 7 - * Author: Copyright (C) Miguel Ojeda Sandonis 7 + * Author: Copyright (C) Miguel Ojeda <ojeda@kernel.org> 8 8 * Date: 2006-10-31 9 9 */ 10 10
-2
scripts/Makefile.ubsan
··· 8 8 ubsan-cflags-$(CONFIG_UBSAN_SHIFT) += -fsanitize=shift 9 9 ubsan-cflags-$(CONFIG_UBSAN_DIV_ZERO) += -fsanitize=integer-divide-by-zero 10 10 ubsan-cflags-$(CONFIG_UBSAN_UNREACHABLE) += -fsanitize=unreachable 11 - ubsan-cflags-$(CONFIG_UBSAN_SIGNED_OVERFLOW) += -fsanitize=signed-integer-overflow 12 - ubsan-cflags-$(CONFIG_UBSAN_UNSIGNED_OVERFLOW) += -fsanitize=unsigned-integer-overflow 13 11 ubsan-cflags-$(CONFIG_UBSAN_OBJECT_SIZE) += -fsanitize=object-size 14 12 ubsan-cflags-$(CONFIG_UBSAN_BOOL) += -fsanitize=bool 15 13 ubsan-cflags-$(CONFIG_UBSAN_ENUM) += -fsanitize=enum
+106 -46
scripts/checkpatch.pl
··· 382 382 # We need \b after 'init' otherwise 'initconst' will cause a false positive in a check 383 383 our $Attribute = qr{ 384 384 const| 385 + volatile| 385 386 __percpu| 386 387 __nocast| 387 388 __safe| ··· 487 486 488 487 our $allocFunctions = qr{(?x: 489 488 (?:(?:devm_)? 490 - (?:kv|k|v)[czm]alloc(?:_node|_array)? | 489 + (?:kv|k|v)[czm]alloc(?:_array)?(?:_node)? | 491 490 kstrdup(?:_const)? | 492 491 kmemdup(?:_nul)?) | 493 492 (?:\w+)?alloc_skb(?:_ip_align)? | ··· 505 504 Suggested-by:| 506 505 To:| 507 506 Cc: 507 + )}; 508 + 509 + our $tracing_logging_tags = qr{(?xi: 510 + [=-]*> | 511 + <[=-]* | 512 + \[ | 513 + \] | 514 + start | 515 + called | 516 + entered | 517 + entry | 518 + enter | 519 + in | 520 + inside | 521 + here | 522 + begin | 523 + exit | 524 + end | 525 + done | 526 + leave | 527 + completed | 528 + out | 529 + return | 530 + [\.\!:\s]* 508 531 )}; 509 532 510 533 sub edit_distance_min { ··· 2453 2428 return $comment; 2454 2429 } 2455 2430 2431 + sub exclude_global_initialisers { 2432 + my ($realfile) = @_; 2433 + 2434 + # Do not check for BPF programs (tools/testing/selftests/bpf/progs/*.c, samples/bpf/*_kern.c, *.bpf.c). 2435 + return $realfile =~ m@^tools/testing/selftests/bpf/progs/.*\.c$@ || 2436 + $realfile =~ m@^samples/bpf/.*_kern\.c$@ || 2437 + $realfile =~ m@/bpf/.*\.bpf\.c$@; 2438 + } 2439 + 2456 2440 sub process { 2457 2441 my $filename = shift; 2458 2442 ··· 3007 2973 } 3008 2974 if (!defined $lines[$linenr]) { 3009 2975 WARN("BAD_SIGN_OFF", 3010 - "Co-developed-by: must be immediately followed by Signed-off-by:\n" . "$here\n" . $rawline); 2976 + "Co-developed-by: must be immediately followed by Signed-off-by:\n" . "$here\n" . $rawline); 3011 2977 } elsif ($rawlines[$linenr] !~ /^\s*signed-off-by:\s*(.*)/i) { 3012 2978 WARN("BAD_SIGN_OFF", 3013 2979 "Co-developed-by: must be immediately followed by Signed-off-by:\n" . "$here\n" . $rawline . "\n" .$rawlines[$linenr]); ··· 3030 2996 if (ERROR("GERRIT_CHANGE_ID", 3031 2997 "Remove Gerrit Change-Id's before submitting upstream\n" . $herecurr) && 3032 2998 $fix) { 3033 - fix_delete_line($fixlinenr, $rawline); 3034 - } 2999 + fix_delete_line($fixlinenr, $rawline); 3000 + } 3035 3001 } 3036 3002 3037 3003 # Check if the commit log is in a possible stack dump ··· 3273 3239 next if ($start_char =~ /^\S$/); 3274 3240 next if (index(" \t.,;?!", $end_char) == -1); 3275 3241 3276 - # avoid repeating hex occurrences like 'ff ff fe 09 ...' 3277 - if ($first =~ /\b[0-9a-f]{2,}\b/i) { 3278 - next if (!exists($allow_repeated_words{lc($first)})); 3279 - } 3242 + # avoid repeating hex occurrences like 'ff ff fe 09 ...' 3243 + if ($first =~ /\b[0-9a-f]{2,}\b/i) { 3244 + next if (!exists($allow_repeated_words{lc($first)})); 3245 + } 3280 3246 3281 3247 if (WARN("REPEATED_WORD", 3282 3248 "Possible repeated word: '$first'\n" . $herecurr) && ··· 3608 3574 } 3609 3575 } 3610 3576 3577 + # check for .L prefix local symbols in .S files 3578 + if ($realfile =~ /\.S$/ && 3579 + $line =~ /^\+\s*(?:[A-Z]+_)?SYM_[A-Z]+_(?:START|END)(?:_[A-Z_]+)?\s*\(\s*\.L/) { 3580 + WARN("AVOID_L_PREFIX", 3581 + "Avoid using '.L' prefixed local symbol names for denoting a range of code via 'SYM_*_START/END' annotations; see Documentation/asm-annotations.rst\n" . $herecurr); 3582 + } 3583 + 3611 3584 # check we are in a valid source file C or perl if not then ignore this hunk 3612 3585 next if ($realfile !~ /\.(h|c|pl|dtsi|dts)$/); 3613 3586 ··· 3817 3776 } 3818 3777 3819 3778 # check for missing blank lines after declarations 3820 - if ($sline =~ /^\+\s+\S/ && #Not at char 1 3821 - # actual declarations 3822 - ($prevline =~ /^\+\s+$Declare\s*$Ident\s*[=,;:\[]/ || 3779 + # (declarations must have the same indentation and not be at the start of line) 3780 + if (($prevline =~ /\+(\s+)\S/) && $sline =~ /^\+$1\S/) { 3781 + # use temporaries 3782 + my $sl = $sline; 3783 + my $pl = $prevline; 3784 + # remove $Attribute/$Sparse uses to simplify comparisons 3785 + $sl =~ s/\b(?:$Attribute|$Sparse)\b//g; 3786 + $pl =~ s/\b(?:$Attribute|$Sparse)\b//g; 3787 + if (($pl =~ /^\+\s+$Declare\s*$Ident\s*[=,;:\[]/ || 3823 3788 # function pointer declarations 3824 - $prevline =~ /^\+\s+$Declare\s*\(\s*\*\s*$Ident\s*\)\s*[=,;:\[\(]/ || 3789 + $pl =~ /^\+\s+$Declare\s*\(\s*\*\s*$Ident\s*\)\s*[=,;:\[\(]/ || 3825 3790 # foo bar; where foo is some local typedef or #define 3826 - $prevline =~ /^\+\s+$Ident(?:\s+|\s*\*\s*)$Ident\s*[=,;\[]/ || 3791 + $pl =~ /^\+\s+$Ident(?:\s+|\s*\*\s*)$Ident\s*[=,;\[]/ || 3827 3792 # known declaration macros 3828 - $prevline =~ /^\+\s+$declaration_macros/) && 3793 + $pl =~ /^\+\s+$declaration_macros/) && 3829 3794 # for "else if" which can look like "$Ident $Ident" 3830 - !($prevline =~ /^\+\s+$c90_Keywords\b/ || 3795 + !($pl =~ /^\+\s+$c90_Keywords\b/ || 3831 3796 # other possible extensions of declaration lines 3832 - $prevline =~ /(?:$Compare|$Assignment|$Operators)\s*$/ || 3797 + $pl =~ /(?:$Compare|$Assignment|$Operators)\s*$/ || 3833 3798 # not starting a section or a macro "\" extended line 3834 - $prevline =~ /(?:\{\s*|\\)$/) && 3799 + $pl =~ /(?:\{\s*|\\)$/) && 3835 3800 # looks like a declaration 3836 - !($sline =~ /^\+\s+$Declare\s*$Ident\s*[=,;:\[]/ || 3801 + !($sl =~ /^\+\s+$Declare\s*$Ident\s*[=,;:\[]/ || 3837 3802 # function pointer declarations 3838 - $sline =~ /^\+\s+$Declare\s*\(\s*\*\s*$Ident\s*\)\s*[=,;:\[\(]/ || 3803 + $sl =~ /^\+\s+$Declare\s*\(\s*\*\s*$Ident\s*\)\s*[=,;:\[\(]/ || 3839 3804 # foo bar; where foo is some local typedef or #define 3840 - $sline =~ /^\+\s+$Ident(?:\s+|\s*\*\s*)$Ident\s*[=,;\[]/ || 3805 + $sl =~ /^\+\s+$Ident(?:\s+|\s*\*\s*)$Ident\s*[=,;\[]/ || 3841 3806 # known declaration macros 3842 - $sline =~ /^\+\s+$declaration_macros/ || 3807 + $sl =~ /^\+\s+$declaration_macros/ || 3843 3808 # start of struct or union or enum 3844 - $sline =~ /^\+\s+(?:static\s+)?(?:const\s+)?(?:union|struct|enum|typedef)\b/ || 3809 + $sl =~ /^\+\s+(?:static\s+)?(?:const\s+)?(?:union|struct|enum|typedef)\b/ || 3845 3810 # start or end of block or continuation of declaration 3846 - $sline =~ /^\+\s+(?:$|[\{\}\.\#\"\?\:\(\[])/ || 3811 + $sl =~ /^\+\s+(?:$|[\{\}\.\#\"\?\:\(\[])/ || 3847 3812 # bitfield continuation 3848 - $sline =~ /^\+\s+$Ident\s*:\s*\d+\s*[,;]/ || 3813 + $sl =~ /^\+\s+$Ident\s*:\s*\d+\s*[,;]/ || 3849 3814 # other possible extensions of declaration lines 3850 - $sline =~ /^\+\s+\(?\s*(?:$Compare|$Assignment|$Operators)/) && 3851 - # indentation of previous and current line are the same 3852 - (($prevline =~ /\+(\s+)\S/) && $sline =~ /^\+$1\S/)) { 3853 - if (WARN("LINE_SPACING", 3854 - "Missing a blank line after declarations\n" . $hereprev) && 3855 - $fix) { 3856 - fix_insert_line($fixlinenr, "\+"); 3815 + $sl =~ /^\+\s+\(?\s*(?:$Compare|$Assignment|$Operators)/)) { 3816 + if (WARN("LINE_SPACING", 3817 + "Missing a blank line after declarations\n" . $hereprev) && 3818 + $fix) { 3819 + fix_insert_line($fixlinenr, "\+"); 3820 + } 3857 3821 } 3858 3822 } 3859 3823 ··· 4367 4321 } 4368 4322 4369 4323 # check for global initialisers. 4370 - if ($line =~ /^\+$Type\s*$Ident(?:\s+$Modifier)*\s*=\s*($zero_initializer)\s*;/) { 4324 + if ($line =~ /^\+$Type\s*$Ident(?:\s+$Modifier)*\s*=\s*($zero_initializer)\s*;/ && 4325 + !exclude_global_initialisers($realfile)) { 4371 4326 if (ERROR("GLOBAL_INITIALISERS", 4372 4327 "do not initialise globals to $1\n" . $herecurr) && 4373 4328 $fix) { ··· 4464 4417 WARN("STATIC_CONST_CHAR_ARRAY", 4465 4418 "char * array declaration might be better as static const\n" . 4466 4419 $herecurr); 4467 - } 4420 + } 4468 4421 4469 4422 # check for sizeof(foo)/sizeof(foo[0]) that could be ARRAY_SIZE(foo) 4470 4423 if ($line =~ m@\bsizeof\s*\(\s*($Lval)\s*\)@) { ··· 5054 5007 # A colon needs no spaces before when it is 5055 5008 # terminating a case value or a label. 5056 5009 } elsif ($opv eq ':C' || $opv eq ':L') { 5057 - if ($ctx =~ /Wx./) { 5010 + if ($ctx =~ /Wx./ and $realfile !~ m@.*\.lds\.h$@) { 5058 5011 if (ERROR("SPACING", 5059 5012 "space prohibited before that '$op' $at\n" . $hereptr)) { 5060 5013 $good = rtrim($fix_elements[$n]) . trim($fix_elements[$n + 1]); ··· 5317 5270 $lines[$linenr - 3] !~ /^[ +]\s*$Ident\s*:/) { 5318 5271 WARN("RETURN_VOID", 5319 5272 "void function return statements are not generally useful\n" . $hereprev); 5320 - } 5273 + } 5321 5274 5322 5275 # if statements using unnecessary parentheses - ie: if ((foo == bar)) 5323 5276 if ($perl_version_ok && ··· 6013 5966 "Prefer using '\"%s...\", __func__' to using '$context_function', this function's name, in a string\n" . $herecurr); 6014 5967 } 6015 5968 5969 + # check for unnecessary function tracing like uses 5970 + # This does not use $logFunctions because there are many instances like 5971 + # 'dprintk(FOO, "%s()\n", __func__);' which do not match $logFunctions 5972 + if ($rawline =~ /^\+.*\([^"]*"$tracing_logging_tags{0,3}%s(?:\s*\(\s*\)\s*)?$tracing_logging_tags{0,3}(?:\\n)?"\s*,\s*__func__\s*\)\s*;/) { 5973 + if (WARN("TRACING_LOGGING", 5974 + "Unnecessary ftrace-like logging - prefer using ftrace\n" . $herecurr) && 5975 + $fix) { 5976 + fix_delete_line($fixlinenr, $rawline); 5977 + } 5978 + } 5979 + 6016 5980 # check for spaces before a quoted newline 6017 5981 if ($rawline =~ /^.*\".*\s\\n/) { 6018 5982 if (WARN("QUOTED_WHITESPACE_BEFORE_NEWLINE", ··· 6535 6477 if ($line =~ /(\(\s*$C90_int_types\s*\)\s*)($Constant)\b/) { 6536 6478 my $cast = $1; 6537 6479 my $const = $2; 6480 + my $suffix = ""; 6481 + my $newconst = $const; 6482 + $newconst =~ s/${Int_type}$//; 6483 + $suffix .= 'U' if ($cast =~ /\bunsigned\b/); 6484 + if ($cast =~ /\blong\s+long\b/) { 6485 + $suffix .= 'LL'; 6486 + } elsif ($cast =~ /\blong\b/) { 6487 + $suffix .= 'L'; 6488 + } 6538 6489 if (WARN("TYPECAST_INT_CONSTANT", 6539 - "Unnecessary typecast of c90 int constant\n" . $herecurr) && 6490 + "Unnecessary typecast of c90 int constant - '$cast$const' could be '$const$suffix'\n" . $herecurr) && 6540 6491 $fix) { 6541 - my $suffix = ""; 6542 - my $newconst = $const; 6543 - $newconst =~ s/${Int_type}$//; 6544 - $suffix .= 'U' if ($cast =~ /\bunsigned\b/); 6545 - if ($cast =~ /\blong\s+long\b/) { 6546 - $suffix .= 'LL'; 6547 - } elsif ($cast =~ /\blong\b/) { 6548 - $suffix .= 'L'; 6549 - } 6550 6492 $fixed[$fixlinenr] =~ s/\Q$cast\E$const\b/$newconst$suffix/; 6551 6493 } 6552 6494 } ··· 7077 7019 7078 7020 # use of NR_CPUS is usually wrong 7079 7021 # ignore definitions of NR_CPUS and usage to define arrays as likely right 7022 + # ignore designated initializers using NR_CPUS 7080 7023 if ($line =~ /\bNR_CPUS\b/ && 7081 7024 $line !~ /^.\s*\s*#\s*if\b.*\bNR_CPUS\b/ && 7082 7025 $line !~ /^.\s*\s*#\s*define\b.*\bNR_CPUS\b/ && 7083 7026 $line !~ /^.\s*$Declare\s.*\[[^\]]*NR_CPUS[^\]]*\]/ && 7084 7027 $line !~ /\[[^\]]*\.\.\.[^\]]*NR_CPUS[^\]]*\]/ && 7085 - $line !~ /\[[^\]]*NR_CPUS[^\]]*\.\.\.[^\]]*\]/) 7028 + $line !~ /\[[^\]]*NR_CPUS[^\]]*\.\.\.[^\]]*\]/ && 7029 + $line !~ /^.\s*\.\w+\s*=\s*.*\bNR_CPUS\b/) 7086 7030 { 7087 7031 WARN("NR_CPUS", 7088 7032 "usage of NR_CPUS is often wrong - consider using cpu_possible(), num_possible_cpus(), for_each_possible_cpu(), etc\n" . $herecurr);
+5
scripts/gdb/linux/lists.py
··· 27 27 raise TypeError("Must be struct list_head not {}" 28 28 .format(head.type)) 29 29 30 + if head['next'] == 0: 31 + gdb.write("list_for_each: Uninitialized list '{}' treated as empty\n" 32 + .format(head.address)) 33 + return 34 + 30 35 node = head['next'].dereference() 31 36 while node.address != head.address: 32 37 yield node.address