Merge branch 'akpm' (patches from Andrew)

+15

Documentation/admin-guide/mm/damon/index.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ======================== 4 + Monitoring Data Accesses 5 + ======================== 6 + 7 + :doc:`DAMON </vm/damon/index>` allows light-weight data access monitoring. 8 + Using DAMON, users can analyze the memory access patterns of their systems and 9 + optimize those. 10 + 11 + .. toctree:: 12 + :maxdepth: 2 13 + 14 + start 15 + usage

+114

Documentation/admin-guide/mm/damon/start.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + =============== 4 + Getting Started 5 + =============== 6 + 7 + This document briefly describes how you can use DAMON by demonstrating its 8 + default user space tool. Please note that this document describes only a part 9 + of its features for brevity. Please refer to :doc:`usage` for more details. 10 + 11 + 12 + TL; DR 13 + ====== 14 + 15 + Follow the commands below to monitor and visualize the memory access pattern of 16 + your workload. :: 17 + 18 + # # build the kernel with CONFIG_DAMON_*=y, install it, and reboot 19 + # mount -t debugfs none /sys/kernel/debug/ 20 + # git clone https://github.com/awslabs/damo 21 + # ./damo/damo record $(pidof <your workload>) 22 + # ./damo/damo report heat --plot_ascii 23 + 24 + The final command draws the access heatmap of ``<your workload>``. The heatmap 25 + shows which memory region (x-axis) is accessed when (y-axis) and how frequently 26 + (number; the higher the more accesses have been observed). :: 27 + 28 + 111111111111111111111111111111111111111111111111111111110000 29 + 111121111111111111111111111111211111111111111111111111110000 30 + 000000000000000000000000000000000000000000000000001555552000 31 + 000000000000000000000000000000000000000000000222223555552000 32 + 000000000000000000000000000000000000000011111677775000000000 33 + 000000000000000000000000000000000000000488888000000000000000 34 + 000000000000000000000000000000000177888400000000000000000000 35 + 000000000000000000000000000046666522222100000000000000000000 36 + 000000000000000000000014444344444300000000000000000000000000 37 + 000000000000000002222245555510000000000000000000000000000000 38 + # access_frequency: 0 1 2 3 4 5 6 7 8 9 39 + # x-axis: space (140286319947776-140286426374096: 101.496 MiB) 40 + # y-axis: time (605442256436361-605479951866441: 37.695430s) 41 + # resolution: 60x10 (1.692 MiB and 3.770s for each character) 42 + 43 + 44 + Prerequisites 45 + ============= 46 + 47 + Kernel 48 + ------ 49 + 50 + You should first ensure your system is running on a kernel built with 51 + ``CONFIG_DAMON_*=y``. 52 + 53 + 54 + User Space Tool 55 + --------------- 56 + 57 + For the demonstration, we will use the default user space tool for DAMON, 58 + called DAMON Operator (DAMO). It is available at 59 + https://github.com/awslabs/damo. The examples below assume that ``damo`` is on 60 + your ``$PATH``. It's not mandatory, though. 61 + 62 + Because DAMO is using the debugfs interface (refer to :doc:`usage` for the 63 + detail) of DAMON, you should ensure debugfs is mounted. Mount it manually as 64 + below:: 65 + 66 + # mount -t debugfs none /sys/kernel/debug/ 67 + 68 + or append the following line to your ``/etc/fstab`` file so that your system 69 + can automatically mount debugfs upon booting:: 70 + 71 + debugfs /sys/kernel/debug debugfs defaults 0 0 72 + 73 + 74 + Recording Data Access Patterns 75 + ============================== 76 + 77 + The commands below record the memory access patterns of a program and save the 78 + monitoring results to a file. :: 79 + 80 + $ git clone https://github.com/sjp38/masim 81 + $ cd masim; make; ./masim ./configs/zigzag.cfg & 82 + $ sudo damo record -o damon.data $(pidof masim) 83 + 84 + The first two lines of the commands download an artificial memory access 85 + generator program and run it in the background. The generator will repeatedly 86 + access two 100 MiB sized memory regions one by one. You can substitute this 87 + with your real workload. The last line asks ``damo`` to record the access 88 + pattern in the ``damon.data`` file. 89 + 90 + 91 + Visualizing Recorded Patterns 92 + ============================= 93 + 94 + The following three commands visualize the recorded access patterns and save 95 + the results as separate image files. :: 96 + 97 + $ damo report heats --heatmap access_pattern_heatmap.png 98 + $ damo report wss --range 0 101 1 --plot wss_dist.png 99 + $ damo report wss --range 0 101 1 --sortby time --plot wss_chron_change.png 100 + 101 + - ``access_pattern_heatmap.png`` will visualize the data access pattern in a 102 + heatmap, showing which memory region (y-axis) got accessed when (x-axis) 103 + and how frequently (color). 104 + - ``wss_dist.png`` will show the distribution of the working set size. 105 + - ``wss_chron_change.png`` will show how the working set size has 106 + chronologically changed. 107 + 108 + You can view the visualizations of this example workload at [1]_. 109 + Visualizations of other realistic workloads are available at [2]_ [3]_ [4]_. 110 + 111 + .. [1] https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/start.html#visualizing-recorded-patterns 112 + .. [2] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html 113 + .. [3] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html 114 + .. [4] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html

+112

Documentation/admin-guide/mm/damon/usage.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + =============== 4 + Detailed Usages 5 + =============== 6 + 7 + DAMON provides below three interfaces for different users. 8 + 9 + - *DAMON user space tool.* 10 + This is for privileged people such as system administrators who want a 11 + just-working human-friendly interface. Using this, users can use the DAMON’s 12 + major features in a human-friendly way. It may not be highly tuned for 13 + special cases, though. It supports only virtual address spaces monitoring. 14 + - *debugfs interface.* 15 + This is for privileged user space programmers who want more optimized use of 16 + DAMON. Using this, users can use DAMON’s major features by reading 17 + from and writing to special debugfs files. Therefore, you can write and use 18 + your personalized DAMON debugfs wrapper programs that reads/writes the 19 + debugfs files instead of you. The DAMON user space tool is also a reference 20 + implementation of such programs. It supports only virtual address spaces 21 + monitoring. 22 + - *Kernel Space Programming Interface.* 23 + This is for kernel space programmers. Using this, users can utilize every 24 + feature of DAMON most flexibly and efficiently by writing kernel space 25 + DAMON application programs for you. You can even extend DAMON for various 26 + address spaces. 27 + 28 + Nevertheless, you could write your own user space tool using the debugfs 29 + interface. A reference implementation is available at 30 + https://github.com/awslabs/damo. If you are a kernel programmer, you could 31 + refer to :doc:`/vm/damon/api` for the kernel space programming interface. For 32 + the reason, this document describes only the debugfs interface 33 + 34 + debugfs Interface 35 + ================= 36 + 37 + DAMON exports three files, ``attrs``, ``target_ids``, and ``monitor_on`` under 38 + its debugfs directory, ``<debugfs>/damon/``. 39 + 40 + 41 + Attributes 42 + ---------- 43 + 44 + Users can get and set the ``sampling interval``, ``aggregation interval``, 45 + ``regions update interval``, and min/max number of monitoring target regions by 46 + reading from and writing to the ``attrs`` file. To know about the monitoring 47 + attributes in detail, please refer to the :doc:`/vm/damon/design`. For 48 + example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and 49 + 1000, and then check it again:: 50 + 51 + # cd <debugfs>/damon 52 + # echo 5000 100000 1000000 10 1000 > attrs 53 + # cat attrs 54 + 5000 100000 1000000 10 1000 55 + 56 + 57 + Target IDs 58 + ---------- 59 + 60 + Some types of address spaces supports multiple monitoring target. For example, 61 + the virtual memory address spaces monitoring can have multiple processes as the 62 + monitoring targets. Users can set the targets by writing relevant id values of 63 + the targets to, and get the ids of the current targets by reading from the 64 + ``target_ids`` file. In case of the virtual address spaces monitoring, the 65 + values should be pids of the monitoring target processes. For example, below 66 + commands set processes having pids 42 and 4242 as the monitoring targets and 67 + check it again:: 68 + 69 + # cd <debugfs>/damon 70 + # echo 42 4242 > target_ids 71 + # cat target_ids 72 + 42 4242 73 + 74 + Note that setting the target ids doesn't start the monitoring. 75 + 76 + 77 + Turning On/Off 78 + -------------- 79 + 80 + Setting the files as described above doesn't incur effect unless you explicitly 81 + start the monitoring. You can start, stop, and check the current status of the 82 + monitoring by writing to and reading from the ``monitor_on`` file. Writing 83 + ``on`` to the file starts the monitoring of the targets with the attributes. 84 + Writing ``off`` to the file stops those. DAMON also stops if every target 85 + process is terminated. Below example commands turn on, off, and check the 86 + status of DAMON:: 87 + 88 + # cd <debugfs>/damon 89 + # echo on > monitor_on 90 + # echo off > monitor_on 91 + # cat monitor_on 92 + off 93 + 94 + Please note that you cannot write to the above-mentioned debugfs files while 95 + the monitoring is turned on. If you write to the files while DAMON is running, 96 + an error code such as ``-EBUSY`` will be returned. 97 + 98 + 99 + Tracepoint for Monitoring Results 100 + ================================= 101 + 102 + DAMON provides the monitoring results via a tracepoint, 103 + ``damon:damon_aggregated``. While the monitoring is turned on, you could 104 + record the tracepoint events and show results using tracepoint supporting tools 105 + like ``perf``. For example:: 106 + 107 + # echo on > monitor_on 108 + # perf record -e damon:damon_aggregated & 109 + # sleep 5 110 + # kill 9 $(pidof perf) 111 + # echo off > monitor_on 112 + # perf script

+1

Documentation/admin-guide/mm/index.rst

··· 27 27 28 28 concepts 29 29 cma_debugfs 30 + damon/index 30 31 hugetlbpage 31 32 idle_page_tracking 32 33 ksm

+465 -355

Documentation/admin-guide/mm/memory-hotplug.rst

··· 1 1 .. _admin_guide_memory_hotplug: 2 2 3 - ============== 4 - Memory Hotplug 5 - ============== 3 + ================== 4 + Memory Hot(Un)Plug 5 + ================== 6 6 7 - :Created: Jul 28 2007 8 - :Updated: Add some details about locking internals: Aug 20 2018 9 - 10 - This document is about memory hotplug including how-to-use and current status. 11 - Because Memory Hotplug is still under development, contents of this text will 12 - be changed often. 7 + This document describes generic Linux support for memory hot(un)plug with 8 + a focus on System RAM, including ZONE_MOVABLE support. 13 9 14 10 .. contents:: :local: 15 - 16 - .. note:: 17 - 18 - (1) x86_64's has special implementation for memory hotplug. 19 - This text does not describe it. 20 - (2) This text assumes that sysfs is mounted at ``/sys``. 21 - 22 11 23 12 Introduction 24 13 ============ 25 14 26 - Purpose of memory hotplug 27 - ------------------------- 15 + Memory hot(un)plug allows for increasing and decreasing the size of physical 16 + memory available to a machine at runtime. In the simplest case, it consists of 17 + physically plugging or unplugging a DIMM at runtime, coordinated with the 18 + operating system. 28 19 29 - Memory Hotplug allows users to increase/decrease the amount of memory. 30 - Generally, there are two purposes. 20 + Memory hot(un)plug is used for various purposes: 31 21 32 - (A) For changing the amount of memory. 33 - This is to allow a feature like capacity on demand. 34 - (B) For installing/removing DIMMs or NUMA-nodes physically. 35 - This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc. 22 + - The physical memory available to a machine can be adjusted at runtime, up- or 23 + downgrading the memory capacity. This dynamic memory resizing, sometimes 24 + referred to as "capacity on demand", is frequently used with virtual machines 25 + and logical partitions. 36 26 37 - (A) is required by highly virtualized environments and (B) is required by 38 - hardware which supports memory power management. 27 + - Replacing hardware, such as DIMMs or whole NUMA nodes, without downtime. One 28 + example is replacing failing memory modules. 39 29 40 - Linux memory hotplug is designed for both purpose. 30 + - Reducing energy consumption either by physically unplugging memory modules or 31 + by logically unplugging (parts of) memory modules from Linux. 41 32 42 - Phases of memory hotplug 43 - ------------------------ 33 + Further, the basic memory hot(un)plug infrastructure in Linux is nowadays also 34 + used to expose persistent memory, other performance-differentiated memory and 35 + reserved memory regions as ordinary system RAM to Linux. 44 36 45 - There are 2 phases in Memory Hotplug: 37 + Linux only supports memory hot(un)plug on selected 64 bit architectures, such as 38 + x86_64, arm64, ppc64, s390x and ia64. 46 39 47 - 1) Physical Memory Hotplug phase 48 - 2) Logical Memory Hotplug phase. 40 + Memory Hot(Un)Plug Granularity 41 + ------------------------------ 49 42 50 - The First phase is to communicate hardware/firmware and make/erase 51 - environment for hotplugged memory. Basically, this phase is necessary 52 - for the purpose (B), but this is good phase for communication between 53 - highly virtualized environments too. 54 - 55 - When memory is hotplugged, the kernel recognizes new memory, makes new memory 56 - management tables, and makes sysfs files for new memory's operation. 57 - 58 - If firmware supports notification of connection of new memory to OS, 59 - this phase is triggered automatically. ACPI can notify this event. If not, 60 - "probe" operation by system administration is used instead. 61 - (see :ref:`memory_hotplug_physical_mem`). 62 - 63 - Logical Memory Hotplug phase is to change memory state into 64 - available/unavailable for users. Amount of memory from user's view is 65 - changed by this phase. The kernel makes all memory in it as free pages 66 - when a memory range is available. 67 - 68 - In this document, this phase is described as online/offline. 69 - 70 - Logical Memory Hotplug phase is triggered by write of sysfs file by system 71 - administrator. For the hot-add case, it must be executed after Physical Hotplug 72 - phase by hand. 73 - (However, if you writes udev's hotplug scripts for memory hotplug, these 74 - phases can be execute in seamless way.) 75 - 76 - Unit of Memory online/offline operation 77 - --------------------------------------- 78 - 79 - Memory hotplug uses SPARSEMEM memory model which allows memory to be divided 80 - into chunks of the same size. These chunks are called "sections". The size of 81 - a memory section is architecture dependent. For example, power uses 16MiB, ia64 82 - uses 1GiB. 43 + Memory hot(un)plug in Linux uses the SPARSEMEM memory model, which divides the 44 + physical memory address space into chunks of the same size: memory sections. The 45 + size of a memory section is architecture dependent. For example, x86_64 uses 46 + 128 MiB and ppc64 uses 16 MiB. 83 47 84 48 Memory sections are combined into chunks referred to as "memory blocks". The 85 - size of a memory block is architecture dependent and represents the logical 86 - unit upon which memory online/offline operations are to be performed. The 87 - default size of a memory block is the same as memory section size unless an 88 - architecture specifies otherwise. (see :ref:`memory_hotplug_sysfs_files`.) 49 + size of a memory block is architecture dependent and corresponds to the smallest 50 + granularity that can be hot(un)plugged. The default size of a memory block is 51 + the same as memory section size, unless an architecture specifies otherwise. 89 52 90 - To determine the size (in bytes) of a memory block please read this file:: 53 + All memory blocks have the same size. 91 54 92 - /sys/devices/system/memory/block_size_bytes 55 + Phases of Memory Hotplug 56 + ------------------------ 93 57 94 - Kernel Configuration 95 - ==================== 58 + Memory hotplug consists of two phases: 96 59 97 - To use memory hotplug feature, kernel must be compiled with following 98 - config options. 60 + (1) Adding the memory to Linux 61 + (2) Onlining memory blocks 99 62 100 - - For all memory hotplug: 101 - - Memory model -> Sparse Memory (``CONFIG_SPARSEMEM``) 102 - - Allow for memory hot-add (``CONFIG_MEMORY_HOTPLUG``) 63 + In the first phase, metadata, such as the memory map ("memmap") and page tables 64 + for the direct mapping, is allocated and initialized, and memory blocks are 65 + created; the latter also creates sysfs files for managing newly created memory 66 + blocks. 103 67 104 - - To enable memory removal, the following are also necessary: 105 - - Allow for memory hot remove (``CONFIG_MEMORY_HOTREMOVE``) 106 - - Page Migration (``CONFIG_MIGRATION``) 68 + In the second phase, added memory is exposed to the page allocator. After this 69 + phase, the memory is visible in memory statistics, such as free and total 70 + memory, of the system. 107 71 108 - - For ACPI memory hotplug, the following are also necessary: 109 - - Memory hotplug (under ACPI Support menu) (``CONFIG_ACPI_HOTPLUG_MEMORY``) 110 - - This option can be kernel module. 72 + Phases of Memory Hotunplug 73 + -------------------------- 111 74 112 - - As a related configuration, if your box has a feature of NUMA-node hotplug 113 - via ACPI, then this option is necessary too. 75 + Memory hotunplug consists of two phases: 114 76 115 - - ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu) 116 - (``CONFIG_ACPI_CONTAINER``). 77 + (1) Offlining memory blocks 78 + (2) Removing the memory from Linux 117 79 118 - This option can be kernel module too. 80 + In the fist phase, memory is "hidden" from the page allocator again, for 81 + example, by migrating busy memory to other memory locations and removing all 82 + relevant free pages from the page allocator After this phase, the memory is no 83 + longer visible in memory statistics of the system. 119 84 85 + In the second phase, the memory blocks are removed and metadata is freed. 120 86 121 - .. _memory_hotplug_sysfs_files: 87 + Memory Hotplug Notifications 88 + ============================ 122 89 123 - sysfs files for memory hotplug 90 + There are various ways how Linux is notified about memory hotplug events such 91 + that it can start adding hotplugged memory. This description is limited to 92 + systems that support ACPI; mechanisms specific to other firmware interfaces or 93 + virtual machines are not described. 94 + 95 + ACPI Notifications 96 + ------------------ 97 + 98 + Platforms that support ACPI, such as x86_64, can support memory hotplug 99 + notifications via ACPI. 100 + 101 + In general, a firmware supporting memory hotplug defines a memory class object 102 + HID "PNP0C80". When notified about hotplug of a new memory device, the ACPI 103 + driver will hotplug the memory to Linux. 104 + 105 + If the firmware supports hotplug of NUMA nodes, it defines an object _HID 106 + "ACPI0004", "PNP0A05", or "PNP0A06". When notified about an hotplug event, all 107 + assigned memory devices are added to Linux by the ACPI driver. 108 + 109 + Similarly, Linux can be notified about requests to hotunplug a memory device or 110 + a NUMA node via ACPI. The ACPI driver will try offlining all relevant memory 111 + blocks, and, if successful, hotunplug the memory from Linux. 112 + 113 + Manual Probing 114 + -------------- 115 + 116 + On some architectures, the firmware may not be able to notify the operating 117 + system about a memory hotplug event. Instead, the memory has to be manually 118 + probed from user space. 119 + 120 + The probe interface is located at:: 121 + 122 + /sys/devices/system/memory/probe 123 + 124 + Only complete memory blocks can be probed. Individual memory blocks are probed 125 + by providing the physical start address of the memory block:: 126 + 127 + % echo addr > /sys/devices/system/memory/probe 128 + 129 + Which results in a memory block for the range [addr, addr + memory_block_size) 130 + being created. 131 + 132 + .. note:: 133 + 134 + Using the probe interface is discouraged as it is easy to crash the kernel, 135 + because Linux cannot validate user input; this interface might be removed in 136 + the future. 137 + 138 + Onlining and Offlining Memory Blocks 139 + ==================================== 140 + 141 + After a memory block has been created, Linux has to be instructed to actually 142 + make use of that memory: the memory block has to be "online". 143 + 144 + Before a memory block can be removed, Linux has to stop using any memory part of 145 + the memory block: the memory block has to be "offlined". 146 + 147 + The Linux kernel can be configured to automatically online added memory blocks 148 + and drivers automatically trigger offlining of memory blocks when trying 149 + hotunplug of memory. Memory blocks can only be removed once offlining succeeded 150 + and drivers may trigger offlining of memory blocks when attempting hotunplug of 151 + memory. 152 + 153 + Onlining Memory Blocks Manually 154 + ------------------------------- 155 + 156 + If auto-onlining of memory blocks isn't enabled, user-space has to manually 157 + trigger onlining of memory blocks. Often, udev rules are used to automate this 158 + task in user space. 159 + 160 + Onlining of a memory block can be triggered via:: 161 + 162 + % echo online > /sys/devices/system/memory/memoryXXX/state 163 + 164 + Or alternatively:: 165 + 166 + % echo 1 > /sys/devices/system/memory/memoryXXX/online 167 + 168 + The kernel will select the target zone automatically, usually defaulting to 169 + ``ZONE_NORMAL`` unless ``movablecore=1`` has been specified on the kernel 170 + command line or if the memory block would intersect the ZONE_MOVABLE already. 171 + 172 + One can explicitly request to associate an offline memory block with 173 + ZONE_MOVABLE by:: 174 + 175 + % echo online_movable > /sys/devices/system/memory/memoryXXX/state 176 + 177 + Or one can explicitly request a kernel zone (usually ZONE_NORMAL) by:: 178 + 179 + % echo online_kernel > /sys/devices/system/memory/memoryXXX/state 180 + 181 + In any case, if onlining succeeds, the state of the memory block is changed to 182 + be "online". If it fails, the state of the memory block will remain unchanged 183 + and the above commands will fail. 184 + 185 + Onlining Memory Blocks Automatically 186 + ------------------------------------ 187 + 188 + The kernel can be configured to try auto-onlining of newly added memory blocks. 189 + If this feature is disabled, the memory blocks will stay offline until 190 + explicitly onlined from user space. 191 + 192 + The configured auto-online behavior can be observed via:: 193 + 194 + % cat /sys/devices/system/memory/auto_online_blocks 195 + 196 + Auto-onlining can be enabled by writing ``online``, ``online_kernel`` or 197 + ``online_movable`` to that file, like:: 198 + 199 + % echo online > /sys/devices/system/memory/auto_online_blocks 200 + 201 + Modifying the auto-online behavior will only affect all subsequently added 202 + memory blocks only. 203 + 204 + .. note:: 205 + 206 + In corner cases, auto-onlining can fail. The kernel won't retry. Note that 207 + auto-onlining is not expected to fail in default configurations. 208 + 209 + .. note:: 210 + 211 + DLPAR on ppc64 ignores the ``offline`` setting and will still online added 212 + memory blocks; if onlining fails, memory blocks are removed again. 213 + 214 + Offlining Memory Blocks 215 + ----------------------- 216 + 217 + In the current implementation, Linux's memory offlining will try migrating all 218 + movable pages off the affected memory block. As most kernel allocations, such as 219 + page tables, are unmovable, page migration can fail and, therefore, inhibit 220 + memory offlining from succeeding. 221 + 222 + Having the memory provided by memory block managed by ZONE_MOVABLE significantly 223 + increases memory offlining reliability; still, memory offlining can fail in 224 + some corner cases. 225 + 226 + Further, memory offlining might retry for a long time (or even forever), until 227 + aborted by the user. 228 + 229 + Offlining of a memory block can be triggered via:: 230 + 231 + % echo offline > /sys/devices/system/memory/memoryXXX/state 232 + 233 + Or alternatively:: 234 + 235 + % echo 0 > /sys/devices/system/memory/memoryXXX/online 236 + 237 + If offlining succeeds, the state of the memory block is changed to be "offline". 238 + If it fails, the state of the memory block will remain unchanged and the above 239 + commands will fail, for example, via:: 240 + 241 + bash: echo: write error: Device or resource busy 242 + 243 + or via:: 244 + 245 + bash: echo: write error: Invalid argument 246 + 247 + Observing the State of Memory Blocks 248 + ------------------------------------ 249 + 250 + The state (online/offline/going-offline) of a memory block can be observed 251 + either via:: 252 + 253 + % cat /sys/device/system/memory/memoryXXX/state 254 + 255 + Or alternatively (1/0) via:: 256 + 257 + % cat /sys/device/system/memory/memoryXXX/online 258 + 259 + For an online memory block, the managing zone can be observed via:: 260 + 261 + % cat /sys/device/system/memory/memoryXXX/valid_zones 262 + 263 + Configuring Memory Hot(Un)Plug 124 264 ============================== 125 265 126 - All memory blocks have their device information in sysfs. Each memory block 127 - is described under ``/sys/devices/system/memory`` as:: 266 + There are various ways how system administrators can configure memory 267 + hot(un)plug and interact with memory blocks, especially, to online them. 268 + 269 + Memory Hot(Un)Plug Configuration via Sysfs 270 + ------------------------------------------ 271 + 272 + Some memory hot(un)plug properties can be configured or inspected via sysfs in:: 273 + 274 + /sys/devices/system/memory/ 275 + 276 + The following files are currently defined: 277 + 278 + ====================== ========================================================= 279 + ``auto_online_blocks`` read-write: set or get the default state of new memory 280 + blocks; configure auto-onlining. 281 + 282 + The default value depends on the 283 + CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration 284 + option. 285 + 286 + See the ``state`` property of memory blocks for details. 287 + ``block_size_bytes`` read-only: the size in bytes of a memory block. 288 + ``probe`` write-only: add (probe) selected memory blocks manually 289 + from user space by supplying the physical start address. 290 + 291 + Availability depends on the CONFIG_ARCH_MEMORY_PROBE 292 + kernel configuration option. 293 + ``uevent`` read-write: generic udev file for device subsystems. 294 + ====================== ========================================================= 295 + 296 + .. note:: 297 + 298 + When the CONFIG_MEMORY_FAILURE kernel configuration option is enabled, two 299 + additional files ``hard_offline_page`` and ``soft_offline_page`` are available 300 + to trigger hwpoisoning of pages, for example, for testing purposes. Note that 301 + this functionality is not really related to memory hot(un)plug or actual 302 + offlining of memory blocks. 303 + 304 + Memory Block Configuration via Sysfs 305 + ------------------------------------ 306 + 307 + Each memory block is represented as a memory block device that can be 308 + onlined or offlined. All memory blocks have their device information located in 309 + sysfs. Each present memory block is listed under 310 + ``/sys/devices/system/memory`` as:: 128 311 129 312 /sys/devices/system/memory/memoryXXX 130 313 131 - where XXX is the memory block id. 314 + where XXX is the memory block id; the number of digits is variable. 132 315 133 - For the memory block covered by the sysfs directory. It is expected that all 134 - memory sections in this range are present and no memory holes exist in the 135 - range. Currently there is no way to determine if there is a memory hole, but 136 - the existence of one should not affect the hotplug capabilities of the memory 137 - block. 316 + A present memory block indicates that some memory in the range is present; 317 + however, a memory block might span memory holes. A memory block spanning memory 318 + holes cannot be offlined. 138 319 139 - For example, assume 1GiB memory block size. A device for a memory starting at 320 + For example, assume 1 GiB memory block size. A device for a memory starting at 140 321 0x100000000 is ``/sys/device/system/memory/memory4``:: 141 322 142 323 (0x100000000 / 1Gib = 4) 143 324 144 325 This device covers address range [0x100000000 ... 0x140000000) 145 326 146 - Under each memory block, you can see 5 files: 147 - 148 - - ``/sys/devices/system/memory/memoryXXX/phys_index`` 149 - - ``/sys/devices/system/memory/memoryXXX/phys_device`` 150 - - ``/sys/devices/system/memory/memoryXXX/state`` 151 - - ``/sys/devices/system/memory/memoryXXX/removable`` 152 - - ``/sys/devices/system/memory/memoryXXX/valid_zones`` 327 + The following files are currently defined: 153 328 154 329 =================== ============================================================ 155 - ``phys_index`` read-only and contains memory block id, same as XXX. 156 - ``state`` read-write 157 - 158 - - at read: contains online/offline state of memory. 159 - - at write: user can specify "online_kernel", 160 - 161 - "online_movable", "online", "offline" command 162 - which will be performed on all sections in the block. 330 + ``online`` read-write: simplified interface to trigger onlining / 331 + offlining and to observe the state of a memory block. 332 + When onlining, the zone is selected automatically. 163 333 ``phys_device`` read-only: legacy interface only ever used on s390x to 164 334 expose the covered storage increment. 335 + ``phys_index`` read-only: the memory block id (XXX). 165 336 ``removable`` read-only: legacy interface that indicated whether a memory 166 - block was likely to be offlineable or not. Newer kernel 167 - versions return "1" if and only if the kernel supports 168 - memory offlining. 169 - ``valid_zones`` read-only: designed to show by which zone memory provided by 170 - a memory block is managed, and to show by which zone memory 171 - provided by an offline memory block could be managed when 172 - onlining. 337 + block was likely to be offlineable or not. Nowadays, the 338 + kernel return ``1`` if and only if it supports memory 339 + offlining. 340 + ``state`` read-write: advanced interface to trigger onlining / 341 + offlining and to observe the state of a memory block. 173 342 174 - The first column shows it`s default zone. 343 + When writing, ``online``, ``offline``, ``online_kernel`` and 344 + ``online_movable`` are supported. 175 345 176 - "memory6/valid_zones: Normal Movable" shows this memoryblock 177 - can be onlined to ZONE_NORMAL by default and to ZONE_MOVABLE 178 - by online_movable. 346 + ``online_movable`` specifies onlining to ZONE_MOVABLE. 347 + ``online_kernel`` specifies onlining to the default kernel 348 + zone for the memory block, such as ZONE_NORMAL. 349 + ``online`` let's the kernel select the zone automatically. 179 350 180 - "memory7/valid_zones: Movable Normal" shows this memoryblock 181 - can be onlined to ZONE_MOVABLE by default and to ZONE_NORMAL 182 - by online_kernel. 351 + When reading, ``online``, ``offline`` and ``going-offline`` 352 + may be returned. 353 + ``uevent`` read-write: generic uevent file for devices. 354 + ``valid_zones`` read-only: when a block is online, shows the zone it 355 + belongs to; when a block is offline, shows what zone will 356 + manage it when the block will be onlined. 357 + 358 + For online memory blocks, ``DMA``, ``DMA32``, ``Normal``, 359 + ``Movable`` and ``none`` may be returned. ``none`` indicates 360 + that memory provided by a memory block is managed by 361 + multiple zones or spans multiple nodes; such memory blocks 362 + cannot be offlined. ``Movable`` indicates ZONE_MOVABLE. 363 + Other values indicate a kernel zone. 364 + 365 + For offline memory blocks, the first column shows the 366 + zone the kernel would select when onlining the memory block 367 + right now without further specifying a zone. 368 + 369 + Availability depends on the CONFIG_MEMORY_HOTREMOVE 370 + kernel configuration option. 183 371 =================== ============================================================ 184 372 185 373 .. note:: 186 374 187 - These directories/files appear after physical memory hotplug phase. 375 + If the CONFIG_NUMA kernel configuration option is enabled, the memoryXXX/ 376 + directories can also be accessed via symbolic links located in the 377 + ``/sys/devices/system/node/node*`` directories. 188 378 189 - If CONFIG_NUMA is enabled the memoryXXX/ directories can also be accessed 190 - via symbolic links located in the ``/sys/devices/system/node/node*`` directories. 191 - 192 - For example:: 379 + For example:: 193 380 194 381 /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 195 382 196 - A backlink will also be created:: 383 + A backlink will also be created:: 197 384 198 385 /sys/devices/system/memory/memory9/node0 -> ../../node/node0 199 386 200 - .. _memory_hotplug_physical_mem: 387 + Command Line Parameters 388 + ----------------------- 201 389 202 - Physical memory hot-add phase 203 - ============================= 390 + Some command line parameters affect memory hot(un)plug handling. The following 391 + command line parameters are relevant: 204 392 205 - Hardware(Firmware) Support 206 - -------------------------- 393 + ======================== ======================================================= 394 + ``memhp_default_state`` configure auto-onlining by essentially setting 395 + ``/sys/devices/system/memory/auto_online_blocks``. 396 + ``movablecore`` configure automatic zone selection of the kernel. When 397 + set, the kernel will default to ZONE_MOVABLE, unless 398 + other zones can be kept contiguous. 399 + ======================== ======================================================= 207 400 208 - On x86_64/ia64 platform, memory hotplug by ACPI is supported. 401 + Module Parameters 402 + ------------------ 209 403 210 - In general, the firmware (ACPI) which supports memory hotplug defines 211 - memory class object of _HID "PNP0C80". When a notify is asserted to PNP0C80, 212 - Linux's ACPI handler does hot-add memory to the system and calls a hotplug udev 213 - script. This will be done automatically. 404 + Instead of additional command line parameters or sysfs files, the 405 + ``memory_hotplug`` subsystem now provides a dedicated namespace for module 406 + parameters. Module parameters can be set via the command line by predicating 407 + them with ``memory_hotplug.`` such as:: 214 408 215 - But scripts for memory hotplug are not contained in generic udev package(now). 216 - You may have to write it by yourself or online/offline memory by hand. 217 - Please see :ref:`memory_hotplug_how_to_online_memory` and 218 - :ref:`memory_hotplug_how_to_offline_memory`. 409 + memory_hotplug.memmap_on_memory=1 219 410 220 - If firmware supports NUMA-node hotplug, and defines an object _HID "ACPI0004", 221 - "PNP0A05", or "PNP0A06", notification is asserted to it, and ACPI handler 222 - calls hotplug code for all of objects which are defined in it. 223 - If memory device is found, memory hotplug code will be called. 411 + and they can be observed (and some even modified at runtime) via:: 224 412 225 - Notify memory hot-add event by hand 226 - ----------------------------------- 413 + /sys/modules/memory_hotplug/parameters/ 227 414 228 - On some architectures, the firmware may not notify the kernel of a memory 229 - hotplug event. Therefore, the memory "probe" interface is supported to 230 - explicitly notify the kernel. This interface depends on 231 - CONFIG_ARCH_MEMORY_PROBE and can be configured on powerpc, sh, and x86 232 - if hotplug is supported, although for x86 this should be handled by ACPI 233 - notification. 415 + The following module parameters are currently defined: 234 416 235 - Probe interface is located at:: 417 + ======================== ======================================================= 418 + ``memmap_on_memory`` read-write: Allocate memory for the memmap from the 419 + added memory block itself. Even if enabled, actual 420 + support depends on various other system properties and 421 + should only be regarded as a hint whether the behavior 422 + would be desired. 236 423 237 - /sys/devices/system/memory/probe 424 + While allocating the memmap from the memory block 425 + itself makes memory hotplug less likely to fail and 426 + keeps the memmap on the same NUMA node in any case, it 427 + can fragment physical memory in a way that huge pages 428 + in bigger granularity cannot be formed on hotplugged 429 + memory. 430 + ======================== ======================================================= 238 431 239 - You can tell the physical address of new memory to the kernel by:: 432 + ZONE_MOVABLE 433 + ============ 240 434 241 - % echo start_address_of_new_memory > /sys/devices/system/memory/probe 435 + ZONE_MOVABLE is an important mechanism for more reliable memory offlining. 436 + Further, having system RAM managed by ZONE_MOVABLE instead of one of the 437 + kernel zones can increase the number of possible transparent huge pages and 438 + dynamically allocated huge pages. 242 439 243 - Then, [start_address_of_new_memory, start_address_of_new_memory + 244 - memory_block_size] memory range is hot-added. In this case, hotplug script is 245 - not called (in current implementation). You'll have to online memory by 246 - yourself. Please see :ref:`memory_hotplug_how_to_online_memory`. 440 + Most kernel allocations are unmovable. Important examples include the memory 441 + map (usually 1/64ths of memory), page tables, and kmalloc(). Such allocations 442 + can only be served from the kernel zones. 247 443 248 - Logical Memory hot-add phase 249 - ============================ 444 + Most user space pages, such as anonymous memory, and page cache pages are 445 + movable. Such allocations can be served from ZONE_MOVABLE and the kernel zones. 250 446 251 - State of memory 447 + Only movable allocations are served from ZONE_MOVABLE, resulting in unmovable 448 + allocations being limited to the kernel zones. Without ZONE_MOVABLE, there is 449 + absolutely no guarantee whether a memory block can be offlined successfully. 450 + 451 + Zone Imbalances 252 452 --------------- 253 453 254 - To see (online/offline) state of a memory block, read 'state' file:: 454 + Having too much system RAM managed by ZONE_MOVABLE is called a zone imbalance, 455 + which can harm the system or degrade performance. As one example, the kernel 456 + might crash because it runs out of free memory for unmovable allocations, 457 + although there is still plenty of free memory left in ZONE_MOVABLE. 255 458 256 - % cat /sys/device/system/memory/memoryXXX/state 459 + Usually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1 460 + are definitely impossible due to the overhead for the memory map. 257 461 258 - 259 - - If the memory block is online, you'll read "online". 260 - - If the memory block is offline, you'll read "offline". 261 - 262 - 263 - .. _memory_hotplug_how_to_online_memory: 264 - 265 - How to online memory 266 - -------------------- 267 - 268 - When the memory is hot-added, the kernel decides whether or not to "online" 269 - it according to the policy which can be read from "auto_online_blocks" file:: 270 - 271 - % cat /sys/devices/system/memory/auto_online_blocks 272 - 273 - The default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config 274 - option. If it is disabled the default is "offline" which means the newly added 275 - memory is not in a ready-to-use state and you have to "online" the newly added 276 - memory blocks manually. Automatic onlining can be requested by writing "online" 277 - to "auto_online_blocks" file:: 278 - 279 - % echo online > /sys/devices/system/memory/auto_online_blocks 280 - 281 - This sets a global policy and impacts all memory blocks that will subsequently 282 - be hotplugged. Currently offline blocks keep their state. It is possible, under 283 - certain circumstances, that some memory blocks will be added but will fail to 284 - online. User space tools can check their "state" files 285 - (``/sys/devices/system/memory/memoryXXX/state``) and try to online them manually. 286 - 287 - If the automatic onlining wasn't requested, failed, or some memory block was 288 - offlined it is possible to change the individual block's state by writing to the 289 - "state" file:: 290 - 291 - % echo online > /sys/devices/system/memory/memoryXXX/state 292 - 293 - This onlining will not change the ZONE type of the target memory block, 294 - If the memory block doesn't belong to any zone an appropriate kernel zone 295 - (usually ZONE_NORMAL) will be used unless movable_node kernel command line 296 - option is specified when ZONE_MOVABLE will be used. 297 - 298 - You can explicitly request to associate it with ZONE_MOVABLE by:: 299 - 300 - % echo online_movable > /sys/devices/system/memory/memoryXXX/state 301 - 302 - .. note:: current limit: this memory block must be adjacent to ZONE_MOVABLE 303 - 304 - Or you can explicitly request a kernel zone (usually ZONE_NORMAL) by:: 305 - 306 - % echo online_kernel > /sys/devices/system/memory/memoryXXX/state 307 - 308 - .. note:: current limit: this memory block must be adjacent to ZONE_NORMAL 309 - 310 - An explicit zone onlining can fail (e.g. when the range is already within 311 - and existing and incompatible zone already). 312 - 313 - After this, memory block XXX's state will be 'online' and the amount of 314 - available memory will be increased. 315 - 316 - This may be changed in future. 317 - 318 - Logical memory remove 319 - ===================== 320 - 321 - Memory offline and ZONE_MOVABLE 322 - ------------------------------- 323 - 324 - Memory offlining is more complicated than memory online. Because memory offline 325 - has to make the whole memory block be unused, memory offline can fail if 326 - the memory block includes memory which cannot be freed. 327 - 328 - In general, memory offline can use 2 techniques. 329 - 330 - (1) reclaim and free all memory in the memory block. 331 - (2) migrate all pages in the memory block. 332 - 333 - In the current implementation, Linux's memory offline uses method (2), freeing 334 - all pages in the memory block by page migration. But not all pages are 335 - migratable. Under current Linux, migratable pages are anonymous pages and 336 - page caches. For offlining a memory block by migration, the kernel has to 337 - guarantee that the memory block contains only migratable pages. 338 - 339 - Now, a boot option for making a memory block which consists of migratable pages 340 - is supported. By specifying "kernelcore=" or "movablecore=" boot option, you can 341 - create ZONE_MOVABLE...a zone which is just used for movable pages. 342 - (See also Documentation/admin-guide/kernel-parameters.rst) 343 - 344 - Assume the system has "TOTAL" amount of memory at boot time, this boot option 345 - creates ZONE_MOVABLE as following. 346 - 347 - 1) When kernelcore=YYYY boot option is used, 348 - Size of memory not for movable pages (not for offline) is YYYY. 349 - Size of memory for movable pages (for offline) is TOTAL-YYYY. 350 - 351 - 2) When movablecore=ZZZZ boot option is used, 352 - Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ. 353 - Size of memory for movable pages (for offline) is ZZZZ. 462 + Actual safe zone ratios depend on the workload. Extreme cases, like excessive 463 + long-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all. 354 464 355 465 .. note:: 356 466 357 - Unfortunately, there is no information to show which memory block belongs 358 - to ZONE_MOVABLE. This is TBD. 467 + CMA memory part of a kernel zone essentially behaves like memory in 468 + ZONE_MOVABLE and similar considerations apply, especially when combining 469 + CMA with ZONE_MOVABLE. 359 470 360 - Memory offlining can fail when dissolving a free huge page on ZONE_MOVABLE 361 - and the feature of freeing unused vmemmap pages associated with each hugetlb 362 - page is enabled. 471 + ZONE_MOVABLE Sizing Considerations 472 + ---------------------------------- 363 473 364 - This can happen when we have plenty of ZONE_MOVABLE memory, but not enough 365 - kernel memory to allocate vmemmmap pages. We may even be able to migrate 366 - huge page contents, but will not be able to dissolve the source huge page. 367 - This will prevent an offline operation and is unfortunate as memory offlining 368 - is expected to succeed on movable zones. Users that depend on memory hotplug 369 - to succeed for movable zones should carefully consider whether the memory 370 - savings gained from this feature are worth the risk of possibly not being 371 - able to offline memory in certain situations. 474 + We usually expect that a large portion of available system RAM will actually 475 + be consumed by user space, either directly or indirectly via the page cache. In 476 + the normal case, ZONE_MOVABLE can be used when allocating such pages just fine. 372 477 373 - .. note:: 374 - Techniques that rely on long-term pinnings of memory (especially, RDMA and 375 - vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory 376 - hot remove. Pinned pages cannot reside on ZONE_MOVABLE, to guarantee that 377 - memory can still get hot removed - be aware that pinning can fail even if 378 - there is plenty of free memory in ZONE_MOVABLE. In addition, using 379 - ZONE_MOVABLE might make page pinning more expensive, because pages have to be 380 - migrated off that zone first. 478 + With that in mind, it makes sense that we can have a big portion of system RAM 479 + managed by ZONE_MOVABLE. However, there are some things to consider when using 480 + ZONE_MOVABLE, especially when fine-tuning zone ratios: 381 481 382 - .. _memory_hotplug_how_to_offline_memory: 482 + - Having a lot of offline memory blocks. Even offline memory blocks consume 483 + memory for metadata and page tables in the direct map; having a lot of offline 484 + memory blocks is not a typical case, though. 383 485 384 - How to offline memory 385 - --------------------- 486 + - Memory ballooning without balloon compaction is incompatible with 487 + ZONE_MOVABLE. Only some implementations, such as virtio-balloon and 488 + pseries CMM, fully support balloon compaction. 386 489 387 - You can offline a memory block by using the same sysfs interface that was used 388 - in memory onlining:: 490 + Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be 491 + disabled. In that case, balloon inflation will only perform unmovable 492 + allocations and silently create a zone imbalance, usually triggered by 493 + inflation requests from the hypervisor. 389 494 390 - % echo offline > /sys/devices/system/memory/memoryXXX/state 495 + - Gigantic pages are unmovable, resulting in user space consuming a 496 + lot of unmovable memory. 391 497 392 - If offline succeeds, the state of the memory block is changed to be "offline". 393 - If it fails, some error core (like -EBUSY) will be returned by the kernel. 394 - Even if a memory block does not belong to ZONE_MOVABLE, you can try to offline 395 - it. If it doesn't contain 'unmovable' memory, you'll get success. 498 + - Huge pages are unmovable when an architectures does not support huge 499 + page migration, resulting in a similar issue as with gigantic pages. 396 500 397 - A memory block under ZONE_MOVABLE is considered to be able to be offlined 398 - easily. But under some busy state, it may return -EBUSY. Even if a memory 399 - block cannot be offlined due to -EBUSY, you can retry offlining it and may be 400 - able to offline it (or not). (For example, a page is referred to by some kernel 401 - internal call and released soon.) 501 + - Page tables are unmovable. Excessive swapping, mapping extremely large 502 + files or ZONE_DEVICE memory can be problematic, although only really relevant 503 + in corner cases. When we manage a lot of user space memory that has been 504 + swapped out or is served from a file/persistent memory/... we still need a lot 505 + of page tables to manage that memory once user space accessed that memory. 402 506 403 - Consideration: 404 - Memory hotplug's design direction is to make the possibility of memory 405 - offlining higher and to guarantee unplugging memory under any situation. But 406 - it needs more work. Returning -EBUSY under some situation may be good because 407 - the user can decide to retry more or not by himself. Currently, memory 408 - offlining code does some amount of retry with 120 seconds timeout. 507 + - In certain DAX configurations the memory map for the device memory will be 508 + allocated from the kernel zones. 409 509 410 - Physical memory remove 411 - ====================== 510 + - KASAN can have a significant memory overhead, for example, consuming 1/8th of 511 + the total system memory size as (unmovable) tracking metadata. 412 512 413 - Need more implementation yet.... 414 - - Notification completion of remove works by OS to firmware. 415 - - Guard from remove if not yet. 513 + - Long-term pinning of pages. Techniques that rely on long-term pinnings 514 + (especially, RDMA and vfio/mdev) are fundamentally problematic with 515 + ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside 516 + on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they 517 + have to be migrated off that zone while pinning. Pinning a page can fail 518 + even if there is plenty of free memory in ZONE_MOVABLE. 416 519 520 + In addition, using ZONE_MOVABLE might make page pinning more expensive, 521 + because of the page migration overhead. 417 522 418 - Locking Internals 419 - ================= 523 + By default, all the memory configured at boot time is managed by the kernel 524 + zones and ZONE_MOVABLE is not used. 420 525 421 - When adding/removing memory that uses memory block devices (i.e. ordinary RAM), 422 - the device_hotplug_lock should be held to: 526 + To enable ZONE_MOVABLE to include the memory present at boot and to control the 527 + ratio between movable and kernel zones there are two command line options: 528 + ``kernelcore=`` and ``movablecore=``. See 529 + Documentation/admin-guide/kernel-parameters.rst for their description. 423 530 424 - - synchronize against online/offline requests (e.g. via sysfs). This way, memory 425 - block devices can only be accessed (.online/.state attributes) by user 426 - space once memory has been fully added. And when removing memory, we 427 - know nobody is in critical sections. 428 - - synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC) 531 + Memory Offlining and ZONE_MOVABLE 532 + --------------------------------- 429 533 430 - Especially, there is a possible lock inversion that is avoided using 431 - device_hotplug_lock when adding memory and user space tries to online that 432 - memory faster than expected: 534 + Even with ZONE_MOVABLE, there are some corner cases where offlining a memory 535 + block might fail: 433 536 434 - - device_online() will first take the device_lock(), followed by 435 - mem_hotplug_lock 436 - - add_memory_resource() will first take the mem_hotplug_lock, followed by 437 - the device_lock() (while creating the devices, during bus_add_device()). 537 + - Memory blocks with memory holes; this applies to memory blocks present during 538 + boot and can apply to memory blocks hotplugged via the XEN balloon and the 539 + Hyper-V balloon. 438 540 439 - As the device is visible to user space before taking the device_lock(), this 440 - can result in a lock inversion. 541 + - Mixed NUMA nodes and mixed zones within a single memory block prevent memory 542 + offlining; this applies to memory blocks present during boot only. 441 543 442 - onlining/offlining of memory should be done via device_online()/ 443 - device_offline() - to make sure it is properly synchronized to actions 444 - via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type) 544 + - Special memory blocks prevented by the system from getting offlined. Examples 545 + include any memory available during boot on arm64 or memory blocks spanning 546 + the crashkernel area on s390x; this usually applies to memory blocks present 547 + during boot only. 445 548 446 - When adding/removing/onlining/offlining memory or adding/removing 447 - heterogeneous/device memory, we should always hold the mem_hotplug_lock in 448 - write mode to serialise memory hotplug (e.g. access to global/zone 449 - variables). 549 + - Memory blocks overlapping with CMA areas cannot be offlined, this applies to 550 + memory blocks present during boot only. 450 551 451 - In addition, mem_hotplug_lock (in contrast to device_hotplug_lock) in read 452 - mode allows for a quite efficient get_online_mems/put_online_mems 453 - implementation, so code accessing memory can protect from that memory 454 - vanishing. 552 + - Concurrent activity that operates on the same physical memory area, such as 553 + allocating gigantic pages, can result in temporary offlining failures. 455 554 555 + - Out of memory when dissolving huge pages, especially when freeing unused 556 + vmemmap pages associated with each hugetlb page is enabled. 456 557 457 - Future Work 458 - =========== 558 + Offlining code may be able to migrate huge page contents, but may not be able 559 + to dissolve the source huge page because it fails allocating (unmovable) pages 560 + for the vmemmap, because the system might not have free memory in the kernel 561 + zones left. 459 562 460 - - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like 461 - sysctl or new control file. 462 - - showing memory block and physical device relationship. 463 - - test and make it better memory offlining. 464 - - support HugeTLB page migration and offlining. 465 - - memmap removing at memory offline. 466 - - physical remove memory. 563 + Users that depend on memory offlining to succeed for movable zones should 564 + carefully consider whether the memory savings gained from this feature are 565 + worth the risk of possibly not being able to offline memory in certain 566 + situations. 567 + 568 + Further, when running into out of memory situations while migrating pages, or 569 + when still encountering permanently unmovable pages within ZONE_MOVABLE 570 + (-> BUG), memory offlining will keep retrying until it eventually succeeds. 571 + 572 + When offlining is triggered from user space, the offlining context can be 573 + terminated by sending a fatal signal. A timeout based offlining can easily be 574 + implemented via:: 575 + 576 + % timeout $TIMEOUT offline_block | failure_handling

+53 -45

Documentation/dev-tools/kfence.rst

··· 65 65 A typical out-of-bounds access looks like this:: 66 66 67 67 ================================================================== 68 - BUG: KFENCE: out-of-bounds read in test_out_of_bounds_read+0xa3/0x22b 68 + BUG: KFENCE: out-of-bounds read in test_out_of_bounds_read+0xa6/0x234 69 69 70 - Out-of-bounds read at 0xffffffffb672efff (1B left of kfence-#17): 71 - test_out_of_bounds_read+0xa3/0x22b 72 - kunit_try_run_case+0x51/0x85 70 + Out-of-bounds read at 0xffff8c3f2e291fff (1B left of kfence-#72): 71 + test_out_of_bounds_read+0xa6/0x234 72 + kunit_try_run_case+0x61/0xa0 73 73 kunit_generic_run_threadfn_adapter+0x16/0x30 74 - kthread+0x137/0x160 74 + kthread+0x176/0x1b0 75 75 ret_from_fork+0x22/0x30 76 76 77 - kfence-#17 [0xffffffffb672f000-0xffffffffb672f01f, size=32, cache=kmalloc-32] allocated by task 507: 78 - test_alloc+0xf3/0x25b 79 - test_out_of_bounds_read+0x98/0x22b 80 - kunit_try_run_case+0x51/0x85 77 + kfence-#72: 0xffff8c3f2e292000-0xffff8c3f2e29201f, size=32, cache=kmalloc-32 78 + 79 + allocated by task 484 on cpu 0 at 32.919330s: 80 + test_alloc+0xfe/0x738 81 + test_out_of_bounds_read+0x9b/0x234 82 + kunit_try_run_case+0x61/0xa0 81 83 kunit_generic_run_threadfn_adapter+0x16/0x30 82 - kthread+0x137/0x160 84 + kthread+0x176/0x1b0 83 85 ret_from_fork+0x22/0x30 84 86 85 - CPU: 4 PID: 107 Comm: kunit_try_catch Not tainted 5.8.0-rc6+ #7 86 - Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 87 + CPU: 0 PID: 484 Comm: kunit_try_catch Not tainted 5.13.0-rc3+ #7 88 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 87 89 ================================================================== 88 90 89 91 The header of the report provides a short summary of the function involved in ··· 98 96 ================================================================== 99 97 BUG: KFENCE: use-after-free read in test_use_after_free_read+0xb3/0x143 100 98 101 - Use-after-free read at 0xffffffffb673dfe0 (in kfence-#24): 99 + Use-after-free read at 0xffff8c3f2e2a0000 (in kfence-#79): 102 100 test_use_after_free_read+0xb3/0x143 103 - kunit_try_run_case+0x51/0x85 101 + kunit_try_run_case+0x61/0xa0 104 102 kunit_generic_run_threadfn_adapter+0x16/0x30 105 - kthread+0x137/0x160 103 + kthread+0x176/0x1b0 106 104 ret_from_fork+0x22/0x30 107 105 108 - kfence-#24 [0xffffffffb673dfe0-0xffffffffb673dfff, size=32, cache=kmalloc-32] allocated by task 507: 109 - test_alloc+0xf3/0x25b 106 + kfence-#79: 0xffff8c3f2e2a0000-0xffff8c3f2e2a001f, size=32, cache=kmalloc-32 107 + 108 + allocated by task 488 on cpu 2 at 33.871326s: 109 + test_alloc+0xfe/0x738 110 110 test_use_after_free_read+0x76/0x143 111 - kunit_try_run_case+0x51/0x85 111 + kunit_try_run_case+0x61/0xa0 112 112 kunit_generic_run_threadfn_adapter+0x16/0x30 113 - kthread+0x137/0x160 113 + kthread+0x176/0x1b0 114 114 ret_from_fork+0x22/0x30 115 115 116 - freed by task 507: 116 + freed by task 488 on cpu 2 at 33.871358s: 117 117 test_use_after_free_read+0xa8/0x143 118 - kunit_try_run_case+0x51/0x85 118 + kunit_try_run_case+0x61/0xa0 119 119 kunit_generic_run_threadfn_adapter+0x16/0x30 120 - kthread+0x137/0x160 120 + kthread+0x176/0x1b0 121 121 ret_from_fork+0x22/0x30 122 122 123 - CPU: 4 PID: 109 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7 124 - Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 123 + CPU: 2 PID: 488 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7 124 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 125 125 ================================================================== 126 126 127 127 KFENCE also reports on invalid frees, such as double-frees:: ··· 131 127 ================================================================== 132 128 BUG: KFENCE: invalid free in test_double_free+0xdc/0x171 133 129 134 - Invalid free of 0xffffffffb6741000: 130 + Invalid free of 0xffff8c3f2e2a4000 (in kfence-#81): 135 131 test_double_free+0xdc/0x171 136 - kunit_try_run_case+0x51/0x85 132 + kunit_try_run_case+0x61/0xa0 137 133 kunit_generic_run_threadfn_adapter+0x16/0x30 138 - kthread+0x137/0x160 134 + kthread+0x176/0x1b0 139 135 ret_from_fork+0x22/0x30 140 136 141 - kfence-#26 [0xffffffffb6741000-0xffffffffb674101f, size=32, cache=kmalloc-32] allocated by task 507: 142 - test_alloc+0xf3/0x25b 137 + kfence-#81: 0xffff8c3f2e2a4000-0xffff8c3f2e2a401f, size=32, cache=kmalloc-32 138 + 139 + allocated by task 490 on cpu 1 at 34.175321s: 140 + test_alloc+0xfe/0x738 143 141 test_double_free+0x76/0x171 144 - kunit_try_run_case+0x51/0x85 142 + kunit_try_run_case+0x61/0xa0 145 143 kunit_generic_run_threadfn_adapter+0x16/0x30 146 - kthread+0x137/0x160 144 + kthread+0x176/0x1b0 147 145 ret_from_fork+0x22/0x30 148 146 149 - freed by task 507: 147 + freed by task 490 on cpu 1 at 34.175348s: 150 148 test_double_free+0xa8/0x171 151 - kunit_try_run_case+0x51/0x85 149 + kunit_try_run_case+0x61/0xa0 152 150 kunit_generic_run_threadfn_adapter+0x16/0x30 153 - kthread+0x137/0x160 151 + kthread+0x176/0x1b0 154 152 ret_from_fork+0x22/0x30 155 153 156 - CPU: 4 PID: 111 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7 157 - Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 154 + CPU: 1 PID: 490 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7 155 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 158 156 ================================================================== 159 157 160 158 KFENCE also uses pattern-based redzones on the other side of an object's guard ··· 166 160 ================================================================== 167 161 BUG: KFENCE: memory corruption in test_kmalloc_aligned_oob_write+0xef/0x184 168 162 169 - Corrupted memory at 0xffffffffb6797ff9 [ 0xac . . . . . . ] (in kfence-#69): 163 + Corrupted memory at 0xffff8c3f2e33aff9 [ 0xac . . . . . . ] (in kfence-#156): 170 164 test_kmalloc_aligned_oob_write+0xef/0x184 171 - kunit_try_run_case+0x51/0x85 165 + kunit_try_run_case+0x61/0xa0 172 166 kunit_generic_run_threadfn_adapter+0x16/0x30 173 - kthread+0x137/0x160 167 + kthread+0x176/0x1b0 174 168 ret_from_fork+0x22/0x30 175 169 176 - kfence-#69 [0xffffffffb6797fb0-0xffffffffb6797ff8, size=73, cache=kmalloc-96] allocated by task 507: 177 - test_alloc+0xf3/0x25b 170 + kfence-#156: 0xffff8c3f2e33afb0-0xffff8c3f2e33aff8, size=73, cache=kmalloc-96 171 + 172 + allocated by task 502 on cpu 7 at 42.159302s: 173 + test_alloc+0xfe/0x738 178 174 test_kmalloc_aligned_oob_write+0x57/0x184 179 - kunit_try_run_case+0x51/0x85 175 + kunit_try_run_case+0x61/0xa0 180 176 kunit_generic_run_threadfn_adapter+0x16/0x30 181 - kthread+0x137/0x160 177 + kthread+0x176/0x1b0 182 178 ret_from_fork+0x22/0x30 183 179 184 - CPU: 4 PID: 120 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7 185 - Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 180 + CPU: 7 PID: 502 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7 181 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 186 182 ================================================================== 187 183 188 184 For such errors, the address where the corruption occurred as well as the

+3 -2

Documentation/kbuild/llvm.rst

··· 130 130 ------------ 131 131 132 132 - `Website <https://clangbuiltlinux.github.io/>`_ 133 - - `Mailing List <https://groups.google.com/forum/#!forum/clang-built-linux>`_: <clang-built-linux@googlegroups.com> 133 + - `Mailing List <https://lore.kernel.org/llvm/>`_: <llvm@lists.linux.dev> 134 + - `Old Mailing List Archives <https://groups.google.com/g/clang-built-linux>`_ 134 135 - `Issue Tracker <https://github.com/ClangBuiltLinux/linux/issues>`_ 135 - - IRC: #clangbuiltlinux on chat.freenode.net 136 + - IRC: #clangbuiltlinux on irc.libera.chat 136 137 - `Telegram <https://t.me/ClangBuiltLinux>`_: @ClangBuiltLinux 137 138 - `Wiki <https://github.com/ClangBuiltLinux/linux/wiki>`_ 138 139 - `Beginner Bugs <https://github.com/ClangBuiltLinux/linux/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22>`_

+20

Documentation/vm/damon/api.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ============= 4 + API Reference 5 + ============= 6 + 7 + Kernel space programs can use every feature of DAMON using below APIs. All you 8 + need to do is including ``damon.h``, which is located in ``include/linux/`` of 9 + the source tree. 10 + 11 + Structures 12 + ========== 13 + 14 + .. kernel-doc:: include/linux/damon.h 15 + 16 + 17 + Functions 18 + ========= 19 + 20 + .. kernel-doc:: mm/damon/core.c

+166

Documentation/vm/damon/design.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ====== 4 + Design 5 + ====== 6 + 7 + Configurable Layers 8 + =================== 9 + 10 + DAMON provides data access monitoring functionality while making the accuracy 11 + and the overhead controllable. The fundamental access monitorings require 12 + primitives that dependent on and optimized for the target address space. On 13 + the other hand, the accuracy and overhead tradeoff mechanism, which is the core 14 + of DAMON, is in the pure logic space. DAMON separates the two parts in 15 + different layers and defines its interface to allow various low level 16 + primitives implementations configurable with the core logic. 17 + 18 + Due to this separated design and the configurable interface, users can extend 19 + DAMON for any address space by configuring the core logics with appropriate low 20 + level primitive implementations. If appropriate one is not provided, users can 21 + implement the primitives on their own. 22 + 23 + For example, physical memory, virtual memory, swap space, those for specific 24 + processes, NUMA nodes, files, and backing memory devices would be supportable. 25 + Also, if some architectures or devices support special optimized access check 26 + primitives, those will be easily configurable. 27 + 28 + 29 + Reference Implementations of Address Space Specific Primitives 30 + ============================================================== 31 + 32 + The low level primitives for the fundamental access monitoring are defined in 33 + two parts: 34 + 35 + 1. Identification of the monitoring target address range for the address space. 36 + 2. Access check of specific address range in the target space. 37 + 38 + DAMON currently provides the implementation of the primitives for only the 39 + virtual address spaces. Below two subsections describe how it works. 40 + 41 + 42 + VMA-based Target Address Range Construction 43 + ------------------------------------------- 44 + 45 + Only small parts in the super-huge virtual address space of the processes are 46 + mapped to the physical memory and accessed. Thus, tracking the unmapped 47 + address regions is just wasteful. However, because DAMON can deal with some 48 + level of noise using the adaptive regions adjustment mechanism, tracking every 49 + mapping is not strictly required but could even incur a high overhead in some 50 + cases. That said, too huge unmapped areas inside the monitoring target should 51 + be removed to not take the time for the adaptive mechanism. 52 + 53 + For the reason, this implementation converts the complex mappings to three 54 + distinct regions that cover every mapped area of the address space. The two 55 + gaps between the three regions are the two biggest unmapped areas in the given 56 + address space. The two biggest unmapped areas would be the gap between the 57 + heap and the uppermost mmap()-ed region, and the gap between the lowermost 58 + mmap()-ed region and the stack in most of the cases. Because these gaps are 59 + exceptionally huge in usual address spaces, excluding these will be sufficient 60 + to make a reasonable trade-off. Below shows this in detail:: 61 + 62 + <heap> 63 + <BIG UNMAPPED REGION 1> 64 + <uppermost mmap()-ed region> 65 + (small mmap()-ed regions and munmap()-ed regions) 66 + <lowermost mmap()-ed region> 67 + <BIG UNMAPPED REGION 2> 68 + <stack> 69 + 70 + 71 + PTE Accessed-bit Based Access Check 72 + ----------------------------------- 73 + 74 + The implementation for the virtual address space uses PTE Accessed-bit for 75 + basic access checks. It finds the relevant PTE Accessed bit from the address 76 + by walking the page table for the target task of the address. In this way, the 77 + implementation finds and clears the bit for next sampling target address and 78 + checks whether the bit set again after one sampling period. This could disturb 79 + other kernel subsystems using the Accessed bits, namely Idle page tracking and 80 + the reclaim logic. To avoid such disturbances, DAMON makes it mutually 81 + exclusive with Idle page tracking and uses ``PG_idle`` and ``PG_young`` page 82 + flags to solve the conflict with the reclaim logic, as Idle page tracking does. 83 + 84 + 85 + Address Space Independent Core Mechanisms 86 + ========================================= 87 + 88 + Below four sections describe each of the DAMON core mechanisms and the five 89 + monitoring attributes, ``sampling interval``, ``aggregation interval``, 90 + ``regions update interval``, ``minimum number of regions``, and ``maximum 91 + number of regions``. 92 + 93 + 94 + Access Frequency Monitoring 95 + --------------------------- 96 + 97 + The output of DAMON says what pages are how frequently accessed for a given 98 + duration. The resolution of the access frequency is controlled by setting 99 + ``sampling interval`` and ``aggregation interval``. In detail, DAMON checks 100 + access to each page per ``sampling interval`` and aggregates the results. In 101 + other words, counts the number of the accesses to each page. After each 102 + ``aggregation interval`` passes, DAMON calls callback functions that previously 103 + registered by users so that users can read the aggregated results and then 104 + clears the results. This can be described in below simple pseudo-code:: 105 + 106 + while monitoring_on: 107 + for page in monitoring_target: 108 + if accessed(page): 109 + nr_accesses[page] += 1 110 + if time() % aggregation_interval == 0: 111 + for callback in user_registered_callbacks: 112 + callback(monitoring_target, nr_accesses) 113 + for page in monitoring_target: 114 + nr_accesses[page] = 0 115 + sleep(sampling interval) 116 + 117 + The monitoring overhead of this mechanism will arbitrarily increase as the 118 + size of the target workload grows. 119 + 120 + 121 + Region Based Sampling 122 + --------------------- 123 + 124 + To avoid the unbounded increase of the overhead, DAMON groups adjacent pages 125 + that assumed to have the same access frequencies into a region. As long as the 126 + assumption (pages in a region have the same access frequencies) is kept, only 127 + one page in the region is required to be checked. Thus, for each ``sampling 128 + interval``, DAMON randomly picks one page in each region, waits for one 129 + ``sampling interval``, checks whether the page is accessed meanwhile, and 130 + increases the access frequency of the region if so. Therefore, the monitoring 131 + overhead is controllable by setting the number of regions. DAMON allows users 132 + to set the minimum and the maximum number of regions for the trade-off. 133 + 134 + This scheme, however, cannot preserve the quality of the output if the 135 + assumption is not guaranteed. 136 + 137 + 138 + Adaptive Regions Adjustment 139 + --------------------------- 140 + 141 + Even somehow the initial monitoring target regions are well constructed to 142 + fulfill the assumption (pages in same region have similar access frequencies), 143 + the data access pattern can be dynamically changed. This will result in low 144 + monitoring quality. To keep the assumption as much as possible, DAMON 145 + adaptively merges and splits each region based on their access frequency. 146 + 147 + For each ``aggregation interval``, it compares the access frequencies of 148 + adjacent regions and merges those if the frequency difference is small. Then, 149 + after it reports and clears the aggregated access frequency of each region, it 150 + splits each region into two or three regions if the total number of regions 151 + will not exceed the user-specified maximum number of regions after the split. 152 + 153 + In this way, DAMON provides its best-effort quality and minimal overhead while 154 + keeping the bounds users set for their trade-off. 155 + 156 + 157 + Dynamic Target Space Updates Handling 158 + ------------------------------------- 159 + 160 + The monitoring target address range could dynamically changed. For example, 161 + virtual memory could be dynamically mapped and unmapped. Physical memory could 162 + be hot-plugged. 163 + 164 + As the changes could be quite frequent in some cases, DAMON checks the dynamic 165 + memory mapping changes and applies it to the abstracted target area only for 166 + each of a user-specified time interval (``regions update interval``).

+51

Documentation/vm/damon/faq.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ========================== 4 + Frequently Asked Questions 5 + ========================== 6 + 7 + Why a new subsystem, instead of extending perf or other user space tools? 8 + ========================================================================= 9 + 10 + First, because it needs to be lightweight as much as possible so that it can be 11 + used online, any unnecessary overhead such as kernel - user space context 12 + switching cost should be avoided. Second, DAMON aims to be used by other 13 + programs including the kernel. Therefore, having a dependency on specific 14 + tools like perf is not desirable. These are the two biggest reasons why DAMON 15 + is implemented in the kernel space. 16 + 17 + 18 + Can 'idle pages tracking' or 'perf mem' substitute DAMON? 19 + ========================================================= 20 + 21 + Idle page tracking is a low level primitive for access check of the physical 22 + address space. 'perf mem' is similar, though it can use sampling to minimize 23 + the overhead. On the other hand, DAMON is a higher-level framework for the 24 + monitoring of various address spaces. It is focused on memory management 25 + optimization and provides sophisticated accuracy/overhead handling mechanisms. 26 + Therefore, 'idle pages tracking' and 'perf mem' could provide a subset of 27 + DAMON's output, but cannot substitute DAMON. 28 + 29 + 30 + Does DAMON support virtual memory only? 31 + ======================================= 32 + 33 + No. The core of the DAMON is address space independent. The address space 34 + specific low level primitive parts including monitoring target regions 35 + constructions and actual access checks can be implemented and configured on the 36 + DAMON core by the users. In this way, DAMON users can monitor any address 37 + space with any access check technique. 38 + 39 + Nonetheless, DAMON provides vma tracking and PTE Accessed bit check based 40 + implementations of the address space dependent functions for the virtual memory 41 + by default, for a reference and convenient use. In near future, we will 42 + provide those for physical memory address space. 43 + 44 + 45 + Can I simply monitor page granularity? 46 + ====================================== 47 + 48 + Yes. You can do so by setting the ``min_nr_regions`` attribute higher than the 49 + working set size divided by the page size. Because the monitoring target 50 + regions size is forced to be ``>=page size``, the region split will make no 51 + effect.

+30

Documentation/vm/damon/index.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ========================== 4 + DAMON: Data Access MONitor 5 + ========================== 6 + 7 + DAMON is a data access monitoring framework subsystem for the Linux kernel. 8 + The core mechanisms of DAMON (refer to :doc:`design` for the detail) make it 9 + 10 + - *accurate* (the monitoring output is useful enough for DRAM level memory 11 + management; It might not appropriate for CPU Cache levels, though), 12 + - *light-weight* (the monitoring overhead is low enough to be applied online), 13 + and 14 + - *scalable* (the upper-bound of the overhead is in constant range regardless 15 + of the size of target workloads). 16 + 17 + Using this framework, therefore, the kernel's memory management mechanisms can 18 + make advanced decisions. Experimental memory management optimization works 19 + that incurring high data accesses monitoring overhead could implemented again. 20 + In user space, meanwhile, users who have some special workloads can write 21 + personalized applications for better understanding and optimizations of their 22 + workloads and systems. 23 + 24 + .. toctree:: 25 + :maxdepth: 2 26 + 27 + faq 28 + design 29 + api 30 + plans

+1

Documentation/vm/index.rst

··· 32 32 arch_pgtable_helpers 33 33 balance 34 34 cleancache 35 + damon/index 35 36 free_page_reporting 36 37 frontswap 37 38 highmem

+13 -2

MAINTAINERS

··· 4526 4526 CLANG/LLVM BUILD SUPPORT 4527 4527 M: Nathan Chancellor <nathan@kernel.org> 4528 4528 M: Nick Desaulniers <ndesaulniers@google.com> 4529 - L: clang-built-linux@googlegroups.com 4529 + L: llvm@lists.linux.dev 4530 4530 S: Supported 4531 4531 W: https://clangbuiltlinux.github.io/ 4532 4532 B: https://github.com/ClangBuiltLinux/linux/issues ··· 4542 4542 M: Kees Cook <keescook@chromium.org> 4543 4543 R: Nathan Chancellor <nathan@kernel.org> 4544 4544 R: Nick Desaulniers <ndesaulniers@google.com> 4545 - L: clang-built-linux@googlegroups.com 4545 + L: llvm@lists.linux.dev 4546 4546 S: Supported 4547 4547 B: https://github.com/ClangBuiltLinux/linux/issues 4548 4548 T: git git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git for-next/clang/features ··· 5148 5148 F: net/ax25/ax25_out.c 5149 5149 F: net/ax25/ax25_timer.c 5150 5150 F: net/ax25/sysctl_net_ax25.c 5151 + 5152 + DATA ACCESS MONITOR 5153 + M: SeongJae Park <sjpark@amazon.de> 5154 + L: linux-mm@kvack.org 5155 + S: Maintained 5156 + F: Documentation/admin-guide/mm/damon/ 5157 + F: Documentation/vm/damon/ 5158 + F: include/linux/damon.h 5159 + F: include/trace/events/damon.h 5160 + F: mm/damon/ 5161 + F: tools/testing/selftests/damon/ 5151 5162 5152 5163 DAVICOM FAST ETHERNET (DMFE) NETWORK DRIVER 5153 5164 L: netdev@vger.kernel.org

+1 -1

arch/Kconfig

··· 889 889 bool 890 890 help 891 891 Architecture provides a function to run __do_softirq() on a 892 - seperate stack. 892 + separate stack. 893 893 894 894 config PGTABLE_LEVELS 895 895 int

+2 -2

arch/alpha/include/asm/agp.h

··· 6 6 7 7 /* dummy for now */ 8 8 9 - #define map_page_into_agp(page) 10 - #define unmap_page_from_agp(page) 9 + #define map_page_into_agp(page) do { } while (0) 10 + #define unmap_page_from_agp(page) do { } while (0) 11 11 #define flush_agp_cache() mb() 12 12 13 13 /* GATT allocation. Returns/accepts GATT kernel virtual address. */

+8 -4

arch/alpha/kernel/pci-sysfs.c

··· 60 60 * @sparse: address space type 61 61 * 62 62 * Use the bus mapping routines to map a PCI resource into userspace. 63 + * 64 + * Return: %0 on success, negative error code otherwise 63 65 */ 64 66 static int pci_mmap_resource(struct kobject *kobj, 65 67 struct bin_attribute *attr, ··· 108 106 109 107 /** 110 108 * pci_remove_resource_files - cleanup resource files 111 - * @dev: dev to cleanup 109 + * @pdev: pci_dev to cleanup 112 110 * 113 111 * If we created resource files for @dev, remove them from sysfs and 114 112 * free their resources. ··· 223 221 } 224 222 225 223 /** 226 - * pci_create_resource_files - create resource files in sysfs for @dev 227 - * @dev: dev in question 224 + * pci_create_resource_files - create resource files in sysfs for @pdev 225 + * @pdev: pci_dev in question 228 226 * 229 227 * Walk the resources in @dev creating files for each resource available. 228 + * 229 + * Return: %0 on success, or negative error code 230 230 */ 231 231 int pci_create_resource_files(struct pci_dev *pdev) 232 232 { ··· 300 296 301 297 /** 302 298 * pci_adjust_legacy_attr - adjustment of legacy file attributes 303 - * @b: bus to create files under 299 + * @bus: bus to create files under 304 300 * @mmap_type: I/O port or memory 305 301 * 306 302 * Adjust file name and size for sparse mappings.

-5

arch/arc/kernel/traps.c

··· 20 20 #include <asm/unaligned.h> 21 21 #include <asm/kprobes.h> 22 22 23 - void __init trap_init(void) 24 - { 25 - return; 26 - } 27 - 28 23 void die(const char *str, struct pt_regs *regs, unsigned long address) 29 24 { 30 25 show_kernel_fault_diag(str, regs, address);

-1

arch/arm/configs/dove_defconfig

··· 56 56 CONFIG_SATA_MV=y 57 57 CONFIG_NETDEVICES=y 58 58 CONFIG_MV643XX_ETH=y 59 - CONFIG_INPUT_POLLDEV=y 60 59 # CONFIG_INPUT_MOUSEDEV is not set 61 60 CONFIG_INPUT_EVDEV=y 62 61 # CONFIG_KEYBOARD_ATKBD is not set

-1

arch/arm/configs/pxa_defconfig

··· 284 284 CONFIG_MWIFIEX=m 285 285 CONFIG_MWIFIEX_SDIO=m 286 286 CONFIG_INPUT_FF_MEMLESS=m 287 - CONFIG_INPUT_POLLDEV=y 288 287 CONFIG_INPUT_MATRIXKMAP=y 289 288 CONFIG_INPUT_MOUSEDEV=m 290 289 CONFIG_INPUT_MOUSEDEV_SCREEN_X=640

-5

arch/arm/kernel/traps.c

··· 781 781 panic("Oops failed to kill thread"); 782 782 } 783 783 784 - void __init trap_init(void) 785 - { 786 - return; 787 - } 788 - 789 784 #ifdef CONFIG_KUSER_HELPERS 790 785 static void __init kuser_init(void *vectors) 791 786 {

+1 -2

arch/arm64/mm/mmu.c

··· 1502 1502 return ret; 1503 1503 } 1504 1504 1505 - void arch_remove_memory(int nid, u64 start, u64 size, 1506 - struct vmem_altmap *altmap) 1505 + void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) 1507 1506 { 1508 1507 unsigned long start_pfn = start >> PAGE_SHIFT; 1509 1508 unsigned long nr_pages = size >> PAGE_SHIFT;

-4

arch/h8300/kernel/traps.c

··· 39 39 { 40 40 } 41 41 42 - void __init trap_init(void) 43 - { 44 - } 45 - 46 42 asmlinkage void set_esp0(unsigned long ssp) 47 43 { 48 44 current->thread.esp0 = ssp;

-4

arch/hexagon/kernel/traps.c

··· 28 28 #define TRAP_SYSCALL 1 29 29 #define TRAP_DEBUG 0xdb 30 30 31 - void __init trap_init(void) 32 - { 33 - } 34 - 35 31 #ifdef CONFIG_GENERIC_BUG 36 32 /* Maybe should resemble arch/sh/kernel/traps.c ?? */ 37 33 int is_valid_bugaddr(unsigned long addr)

+1 -2

arch/ia64/mm/init.c

··· 484 484 return ret; 485 485 } 486 486 487 - void arch_remove_memory(int nid, u64 start, u64 size, 488 - struct vmem_altmap *altmap) 487 + void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) 489 488 { 490 489 unsigned long start_pfn = start >> PAGE_SHIFT; 491 490 unsigned long nr_pages = size >> PAGE_SHIFT;

-1

arch/mips/configs/lemote2f_defconfig

··· 116 116 CONFIG_R8169=y 117 117 CONFIG_USB_USBNET=m 118 118 CONFIG_USB_NET_CDC_EEM=m 119 - CONFIG_INPUT_POLLDEV=m 120 119 CONFIG_INPUT_EVDEV=y 121 120 # CONFIG_MOUSE_PS2_ALPS is not set 122 121 # CONFIG_MOUSE_PS2_LOGIPS2PP is not set

-1

arch/mips/configs/pic32mzda_defconfig

··· 34 34 CONFIG_SCSI_SCAN_ASYNC=y 35 35 # CONFIG_SCSI_LOWLEVEL is not set 36 36 CONFIG_INPUT_LEDS=m 37 - CONFIG_INPUT_POLLDEV=y 38 37 CONFIG_INPUT_MOUSEDEV=m 39 38 CONFIG_INPUT_EVDEV=y 40 39 CONFIG_INPUT_EVBUG=m

-1

arch/mips/configs/rt305x_defconfig

··· 90 90 CONFIG_PPP_ASYNC=m 91 91 CONFIG_ISDN=y 92 92 CONFIG_INPUT=m 93 - CONFIG_INPUT_POLLDEV=m 94 93 # CONFIG_KEYBOARD_ATKBD is not set 95 94 # CONFIG_INPUT_MOUSE is not set 96 95 CONFIG_INPUT_MISC=y

-1

arch/mips/configs/xway_defconfig

··· 96 96 CONFIG_PPP_ASYNC=m 97 97 CONFIG_ISDN=y 98 98 CONFIG_INPUT=m 99 - CONFIG_INPUT_POLLDEV=m 100 99 # CONFIG_KEYBOARD_ATKBD is not set 101 100 # CONFIG_INPUT_MOUSE is not set 102 101 CONFIG_INPUT_MISC=y

-5

arch/nds32/kernel/traps.c

··· 183 183 } 184 184 185 185 extern char *exception_vector, *exception_vector_end; 186 - void __init trap_init(void) 187 - { 188 - return; 189 - } 190 - 191 186 void __init early_trap_init(void) 192 187 { 193 188 unsigned long ivb = 0;

-5

arch/nios2/kernel/traps.c

··· 105 105 printk("%s\n", loglvl); 106 106 } 107 107 108 - void __init trap_init(void) 109 - { 110 - /* Nothing to do here */ 111 - } 112 - 113 108 /* Breakpoint handler */ 114 109 asmlinkage void breakpoint_c(struct pt_regs *fp) 115 110 {

-5

arch/openrisc/kernel/traps.c

··· 231 231 die("Oops", regs, 9); 232 232 } 233 233 234 - void __init trap_init(void) 235 - { 236 - /* Nothing needs to be done */ 237 - } 238 - 239 234 asmlinkage void do_trap(struct pt_regs *regs, unsigned long address) 240 235 { 241 236 force_sig_fault(SIGTRAP, TRAP_BRKPT, (void __user *)regs->pc);

-1

arch/parisc/configs/generic-32bit_defconfig

··· 111 111 CONFIG_PPP_DEFLATE=m 112 112 CONFIG_PPPOE=m 113 113 # CONFIG_WLAN is not set 114 - CONFIG_INPUT_POLLDEV=y 115 114 CONFIG_KEYBOARD_HIL_OLD=m 116 115 CONFIG_KEYBOARD_HIL=m 117 116 CONFIG_MOUSE_SERIAL=y

-4

arch/parisc/kernel/traps.c

··· 859 859 860 860 initialize_ivt(&fault_vector_20); 861 861 } 862 - 863 - void __init trap_init(void) 864 - { 865 - }

-5

arch/powerpc/kernel/traps.c

··· 2219 2219 die("Bad kernel stack pointer", regs, SIGABRT); 2220 2220 } 2221 2221 2222 - void __init trap_init(void) 2223 - { 2224 - } 2225 - 2226 - 2227 2222 #ifdef CONFIG_PPC_EMULATED_STATS 2228 2223 2229 2224 #define WARN_EMULATED_SETUP(type) .type = { .name = #type }

+1 -2

arch/powerpc/mm/mem.c

··· 119 119 return rc; 120 120 } 121 121 122 - void __ref arch_remove_memory(int nid, u64 start, u64 size, 123 - struct vmem_altmap *altmap) 122 + void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) 124 123 { 125 124 unsigned long start_pfn = start >> PAGE_SHIFT; 126 125 unsigned long nr_pages = size >> PAGE_SHIFT;

+4 -5

arch/powerpc/platforms/pseries/hotplug-memory.c

··· 286 286 { 287 287 unsigned long block_sz, start_pfn; 288 288 int sections_per_block; 289 - int i, nid; 289 + int i; 290 290 291 291 start_pfn = base >> PAGE_SHIFT; 292 292 ··· 297 297 298 298 block_sz = pseries_memory_block_size(); 299 299 sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE; 300 - nid = memory_add_physaddr_to_nid(base); 301 300 302 301 for (i = 0; i < sections_per_block; i++) { 303 - __remove_memory(nid, base, MIN_MEMORY_BLOCK_SIZE); 302 + __remove_memory(base, MIN_MEMORY_BLOCK_SIZE); 304 303 base += MIN_MEMORY_BLOCK_SIZE; 305 304 } 306 305 ··· 386 387 387 388 block_sz = pseries_memory_block_size(); 388 389 389 - __remove_memory(mem_block->nid, lmb->base_addr, block_sz); 390 + __remove_memory(lmb->base_addr, block_sz); 390 391 put_device(&mem_block->dev); 391 392 392 393 /* Update memory regions for memory remove */ ··· 659 660 660 661 rc = dlpar_online_lmb(lmb); 661 662 if (rc) { 662 - __remove_memory(nid, lmb->base_addr, block_sz); 663 + __remove_memory(lmb->base_addr, block_sz); 663 664 invalidate_lmb_associativity_index(lmb); 664 665 } else { 665 666 lmb->flags |= DRCONF_MEM_ASSIGNED;

+1 -1

arch/riscv/Kconfig

··· 51 51 select GENERIC_EARLY_IOREMAP 52 52 select GENERIC_GETTIMEOFDAY if HAVE_GENERIC_VDSO 53 53 select GENERIC_IDLE_POLL_SETUP 54 - select GENERIC_IOREMAP 54 + select GENERIC_IOREMAP if MMU 55 55 select GENERIC_IRQ_MULTI_HANDLER 56 56 select GENERIC_IRQ_SHOW 57 57 select GENERIC_IRQ_SHOW_LEVEL

-5

arch/riscv/kernel/traps.c

··· 199 199 } 200 200 #endif /* CONFIG_GENERIC_BUG */ 201 201 202 - /* stvec & scratch is already set from head.S */ 203 - void __init trap_init(void) 204 - { 205 - } 206 - 207 202 #ifdef CONFIG_VMAP_STACK 208 203 static DEFINE_PER_CPU(unsigned long [OVERFLOW_STACK_SIZE/sizeof(long)], 209 204 overflow_stack)__aligned(16);

+1 -2

arch/s390/mm/init.c

··· 307 307 return rc; 308 308 } 309 309 310 - void arch_remove_memory(int nid, u64 start, u64 size, 311 - struct vmem_altmap *altmap) 310 + void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) 312 311 { 313 312 unsigned long start_pfn = start >> PAGE_SHIFT; 314 313 unsigned long nr_pages = size >> PAGE_SHIFT;

+1 -2

arch/sh/mm/init.c

··· 414 414 return ret; 415 415 } 416 416 417 - void arch_remove_memory(int nid, u64 start, u64 size, 418 - struct vmem_altmap *altmap) 417 + void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) 419 418 { 420 419 unsigned long start_pfn = PFN_DOWN(start); 421 420 unsigned long nr_pages = size >> PAGE_SHIFT;

-4

arch/um/kernel/trap.c

··· 311 311 { 312 312 do_IRQ(WINCH_IRQ, regs); 313 313 } 314 - 315 - void trap_init(void) 316 - { 317 - }

-1

arch/x86/configs/i386_defconfig

··· 156 156 CONFIG_8139TOO=y 157 157 # CONFIG_8139TOO_PIO is not set 158 158 CONFIG_R8169=y 159 - CONFIG_INPUT_POLLDEV=y 160 159 CONFIG_INPUT_EVDEV=y 161 160 CONFIG_INPUT_JOYSTICK=y 162 161 CONFIG_INPUT_TABLET=y

-1

arch/x86/configs/x86_64_defconfig

··· 148 148 CONFIG_FORCEDETH=y 149 149 CONFIG_8139TOO=y 150 150 CONFIG_R8169=y 151 - CONFIG_INPUT_POLLDEV=y 152 151 CONFIG_INPUT_EVDEV=y 153 152 CONFIG_INPUT_JOYSTICK=y 154 153 CONFIG_INPUT_TABLET=y

+1 -2

arch/x86/mm/init_32.c

··· 801 801 return __add_pages(nid, start_pfn, nr_pages, params); 802 802 } 803 803 804 - void arch_remove_memory(int nid, u64 start, u64 size, 805 - struct vmem_altmap *altmap) 804 + void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) 806 805 { 807 806 unsigned long start_pfn = start >> PAGE_SHIFT; 808 807 unsigned long nr_pages = size >> PAGE_SHIFT;

+1 -2

arch/x86/mm/init_64.c

··· 1255 1255 remove_pagetable(start, end, true, NULL); 1256 1256 } 1257 1257 1258 - void __ref arch_remove_memory(int nid, u64 start, u64 size, 1259 - struct vmem_altmap *altmap) 1258 + void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) 1260 1259 { 1261 1260 unsigned long start_pfn = start >> PAGE_SHIFT; 1262 1261 unsigned long nr_pages = size >> PAGE_SHIFT;

+31 -15

drivers/acpi/acpi_memhotplug.c

··· 54 54 struct acpi_memory_device { 55 55 struct acpi_device *device; 56 56 struct list_head res_list; 57 + int mgid; 57 58 }; 58 59 59 60 static acpi_status ··· 170 169 static int acpi_memory_enable_device(struct acpi_memory_device *mem_device) 171 170 { 172 171 acpi_handle handle = mem_device->device->handle; 172 + mhp_t mhp_flags = MHP_NID_IS_MGID; 173 173 int result, num_enabled = 0; 174 174 struct acpi_memory_info *info; 175 - mhp_t mhp_flags = MHP_NONE; 176 - int node; 175 + u64 total_length = 0; 176 + int node, mgid; 177 177 178 178 node = acpi_get_node(handle); 179 + 180 + list_for_each_entry(info, &mem_device->res_list, list) { 181 + if (!info->length) 182 + continue; 183 + /* We want a single node for the whole memory group */ 184 + if (node < 0) 185 + node = memory_add_physaddr_to_nid(info->start_addr); 186 + total_length += info->length; 187 + } 188 + 189 + if (!total_length) { 190 + dev_err(&mem_device->device->dev, "device is empty\n"); 191 + return -EINVAL; 192 + } 193 + 194 + mgid = memory_group_register_static(node, PFN_UP(total_length)); 195 + if (mgid < 0) 196 + return mgid; 197 + mem_device->mgid = mgid; 198 + 179 199 /* 180 200 * Tell the VM there is more memory here... 181 201 * Note: Assume that this function returns zero on success ··· 204 182 * (i.e. memory-hot-remove function) 205 183 */ 206 184 list_for_each_entry(info, &mem_device->res_list, list) { 207 - if (info->enabled) { /* just sanity check...*/ 208 - num_enabled++; 209 - continue; 210 - } 211 185 /* 212 186 * If the memory block size is zero, please ignore it. 213 187 * Don't try to do the following memory hotplug flowchart. 214 188 */ 215 189 if (!info->length) 216 190 continue; 217 - if (node < 0) 218 - node = memory_add_physaddr_to_nid(info->start_addr); 219 191 220 192 if (mhp_supports_memmap_on_memory(info->length)) 221 193 mhp_flags |= MHP_MEMMAP_ON_MEMORY; 222 - result = __add_memory(node, info->start_addr, info->length, 194 + result = __add_memory(mgid, info->start_addr, info->length, 223 195 mhp_flags); 224 196 225 197 /* ··· 255 239 256 240 static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device) 257 241 { 258 - acpi_handle handle = mem_device->device->handle; 259 242 struct acpi_memory_info *info, *n; 260 - int nid = acpi_get_node(handle); 261 243 262 244 list_for_each_entry_safe(info, n, &mem_device->res_list, list) { 263 245 if (!info->enabled) 264 246 continue; 265 247 266 - if (nid == NUMA_NO_NODE) 267 - nid = memory_add_physaddr_to_nid(info->start_addr); 268 - 269 248 acpi_unbind_memory_blocks(info); 270 - __remove_memory(nid, info->start_addr, info->length); 249 + __remove_memory(info->start_addr, info->length); 271 250 list_del(&info->list); 272 251 kfree(info); 273 252 } ··· 272 261 { 273 262 if (!mem_device) 274 263 return; 264 + 265 + /* In case we succeeded adding *some* memory, unregistering fails. */ 266 + if (mem_device->mgid >= 0) 267 + memory_group_unregister(mem_device->mgid); 275 268 276 269 acpi_memory_free_device_resources(mem_device); 277 270 mem_device->device->driver_data = NULL; ··· 297 282 298 283 INIT_LIST_HEAD(&mem_device->res_list); 299 284 mem_device->device = device; 285 + mem_device->mgid = -1; 300 286 sprintf(acpi_device_name(device), "%s", ACPI_MEMORY_DEVICE_NAME); 301 287 sprintf(acpi_device_class(device), "%s", ACPI_MEMORY_DEVICE_CLASS); 302 288 device->driver_data = mem_device;

+204 -21

drivers/base/memory.c

··· 82 82 */ 83 83 static DEFINE_XARRAY(memory_blocks); 84 84 85 + /* 86 + * Memory groups, indexed by memory group id (mgid). 87 + */ 88 + static DEFINE_XARRAY_FLAGS(memory_groups, XA_FLAGS_ALLOC); 89 + #define MEMORY_GROUP_MARK_DYNAMIC XA_MARK_1 90 + 85 91 static BLOCKING_NOTIFIER_HEAD(memory_chain); 86 92 87 93 int register_memory_notifier(struct notifier_block *nb) ··· 183 177 struct zone *zone; 184 178 int ret; 185 179 186 - zone = zone_for_pfn_range(mem->online_type, mem->nid, start_pfn, nr_pages); 180 + zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group, 181 + start_pfn, nr_pages); 187 182 188 183 /* 189 184 * Although vmemmap pages have a different lifecycle than the pages ··· 200 193 } 201 194 202 195 ret = online_pages(start_pfn + nr_vmemmap_pages, 203 - nr_pages - nr_vmemmap_pages, zone); 196 + nr_pages - nr_vmemmap_pages, zone, mem->group); 204 197 if (ret) { 205 198 if (nr_vmemmap_pages) 206 199 mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages); ··· 212 205 * now already properly populated. 213 206 */ 214 207 if (nr_vmemmap_pages) 215 - adjust_present_page_count(zone, nr_vmemmap_pages); 208 + adjust_present_page_count(pfn_to_page(start_pfn), mem->group, 209 + nr_vmemmap_pages); 216 210 217 211 return ret; 218 212 } ··· 223 215 unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); 224 216 unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; 225 217 unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages; 226 - struct zone *zone; 227 218 int ret; 228 219 229 220 /* 230 221 * Unaccount before offlining, such that unpopulated zone and kthreads 231 222 * can properly be torn down in offline_pages(). 232 223 */ 233 - if (nr_vmemmap_pages) { 234 - zone = page_zone(pfn_to_page(start_pfn)); 235 - adjust_present_page_count(zone, -nr_vmemmap_pages); 236 - } 224 + if (nr_vmemmap_pages) 225 + adjust_present_page_count(pfn_to_page(start_pfn), mem->group, 226 + -nr_vmemmap_pages); 237 227 238 228 ret = offline_pages(start_pfn + nr_vmemmap_pages, 239 - nr_pages - nr_vmemmap_pages); 229 + nr_pages - nr_vmemmap_pages, mem->group); 240 230 if (ret) { 241 231 /* offline_pages() failed. Account back. */ 242 232 if (nr_vmemmap_pages) 243 - adjust_present_page_count(zone, nr_vmemmap_pages); 233 + adjust_present_page_count(pfn_to_page(start_pfn), 234 + mem->group, nr_vmemmap_pages); 244 235 return ret; 245 236 } 246 237 ··· 381 374 382 375 #ifdef CONFIG_MEMORY_HOTREMOVE 383 376 static int print_allowed_zone(char *buf, int len, int nid, 377 + struct memory_group *group, 384 378 unsigned long start_pfn, unsigned long nr_pages, 385 379 int online_type, struct zone *default_zone) 386 380 { 387 381 struct zone *zone; 388 382 389 - zone = zone_for_pfn_range(online_type, nid, start_pfn, nr_pages); 383 + zone = zone_for_pfn_range(online_type, nid, group, start_pfn, nr_pages); 390 384 if (zone == default_zone) 391 385 return 0; 392 386 ··· 400 392 struct memory_block *mem = to_memory_block(dev); 401 393 unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); 402 394 unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; 395 + struct memory_group *group = mem->group; 403 396 struct zone *default_zone; 397 + int nid = mem->nid; 404 398 int len = 0; 405 - int nid; 406 399 407 400 /* 408 401 * Check the existing zone. Make sure that we do that only on the ··· 422 413 goto out; 423 414 } 424 415 425 - nid = mem->nid; 426 - default_zone = zone_for_pfn_range(MMOP_ONLINE, nid, start_pfn, 427 - nr_pages); 416 + default_zone = zone_for_pfn_range(MMOP_ONLINE, nid, group, 417 + start_pfn, nr_pages); 428 418 429 419 len += sysfs_emit_at(buf, len, "%s", default_zone->name); 430 - len += print_allowed_zone(buf, len, nid, start_pfn, nr_pages, 420 + len += print_allowed_zone(buf, len, nid, group, start_pfn, nr_pages, 431 421 MMOP_ONLINE_KERNEL, default_zone); 432 - len += print_allowed_zone(buf, len, nid, start_pfn, nr_pages, 422 + len += print_allowed_zone(buf, len, nid, group, start_pfn, nr_pages, 433 423 MMOP_ONLINE_MOVABLE, default_zone); 434 424 out: 435 425 len += sysfs_emit_at(buf, len, "\n"); ··· 642 634 } 643 635 644 636 static int init_memory_block(unsigned long block_id, unsigned long state, 645 - unsigned long nr_vmemmap_pages) 637 + unsigned long nr_vmemmap_pages, 638 + struct memory_group *group) 646 639 { 647 640 struct memory_block *mem; 648 641 int ret = 0; ··· 661 652 mem->state = state; 662 653 mem->nid = NUMA_NO_NODE; 663 654 mem->nr_vmemmap_pages = nr_vmemmap_pages; 655 + INIT_LIST_HEAD(&mem->group_next); 656 + 657 + if (group) { 658 + mem->group = group; 659 + list_add(&mem->group_next, &group->memory_blocks); 660 + } 664 661 665 662 ret = register_memory(mem); 666 663 ··· 686 671 if (section_count == 0) 687 672 return 0; 688 673 return init_memory_block(memory_block_id(base_section_nr), 689 - MEM_ONLINE, 0); 674 + MEM_ONLINE, 0, NULL); 690 675 } 691 676 692 677 static void unregister_memory(struct memory_block *memory) ··· 695 680 return; 696 681 697 682 WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL); 683 + 684 + if (memory->group) { 685 + list_del(&memory->group_next); 686 + memory->group = NULL; 687 + } 698 688 699 689 /* drop the ref. we got via find_memory_block() */ 700 690 put_device(&memory->dev); ··· 714 694 * Called under device_hotplug_lock. 715 695 */ 716 696 int create_memory_block_devices(unsigned long start, unsigned long size, 717 - unsigned long vmemmap_pages) 697 + unsigned long vmemmap_pages, 698 + struct memory_group *group) 718 699 { 719 700 const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start)); 720 701 unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size)); ··· 728 707 return -EINVAL; 729 708 730 709 for (block_id = start_block_id; block_id != end_block_id; block_id++) { 731 - ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages); 710 + ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages, 711 + group); 732 712 if (ret) 733 713 break; 734 714 } ··· 912 890 913 891 return bus_for_each_dev(&memory_subsys, NULL, &cb_data, 914 892 for_each_memory_block_cb); 893 + } 894 + 895 + /* 896 + * This is an internal helper to unify allocation and initialization of 897 + * memory groups. Note that the passed memory group will be copied to a 898 + * dynamically allocated memory group. After this call, the passed 899 + * memory group should no longer be used. 900 + */ 901 + static int memory_group_register(struct memory_group group) 902 + { 903 + struct memory_group *new_group; 904 + uint32_t mgid; 905 + int ret; 906 + 907 + if (!node_possible(group.nid)) 908 + return -EINVAL; 909 + 910 + new_group = kzalloc(sizeof(group), GFP_KERNEL); 911 + if (!new_group) 912 + return -ENOMEM; 913 + *new_group = group; 914 + INIT_LIST_HEAD(&new_group->memory_blocks); 915 + 916 + ret = xa_alloc(&memory_groups, &mgid, new_group, xa_limit_31b, 917 + GFP_KERNEL); 918 + if (ret) { 919 + kfree(new_group); 920 + return ret; 921 + } else if (group.is_dynamic) { 922 + xa_set_mark(&memory_groups, mgid, MEMORY_GROUP_MARK_DYNAMIC); 923 + } 924 + return mgid; 925 + } 926 + 927 + /** 928 + * memory_group_register_static() - Register a static memory group. 929 + * @nid: The node id. 930 + * @max_pages: The maximum number of pages we'll have in this static memory 931 + * group. 932 + * 933 + * Register a new static memory group and return the memory group id. 934 + * All memory in the group belongs to a single unit, such as a DIMM. All 935 + * memory belonging to a static memory group is added in one go to be removed 936 + * in one go -- it's static. 937 + * 938 + * Returns an error if out of memory, if the node id is invalid, if no new 939 + * memory groups can be registered, or if max_pages is invalid (0). Otherwise, 940 + * returns the new memory group id. 941 + */ 942 + int memory_group_register_static(int nid, unsigned long max_pages) 943 + { 944 + struct memory_group group = { 945 + .nid = nid, 946 + .s = { 947 + .max_pages = max_pages, 948 + }, 949 + }; 950 + 951 + if (!max_pages) 952 + return -EINVAL; 953 + return memory_group_register(group); 954 + } 955 + EXPORT_SYMBOL_GPL(memory_group_register_static); 956 + 957 + /** 958 + * memory_group_register_dynamic() - Register a dynamic memory group. 959 + * @nid: The node id. 960 + * @unit_pages: Unit in pages in which is memory added/removed in this dynamic 961 + * memory group. 962 + * 963 + * Register a new dynamic memory group and return the memory group id. 964 + * Memory within a dynamic memory group is added/removed dynamically 965 + * in unit_pages. 966 + * 967 + * Returns an error if out of memory, if the node id is invalid, if no new 968 + * memory groups can be registered, or if unit_pages is invalid (0, not a 969 + * power of two, smaller than a single memory block). Otherwise, returns the 970 + * new memory group id. 971 + */ 972 + int memory_group_register_dynamic(int nid, unsigned long unit_pages) 973 + { 974 + struct memory_group group = { 975 + .nid = nid, 976 + .is_dynamic = true, 977 + .d = { 978 + .unit_pages = unit_pages, 979 + }, 980 + }; 981 + 982 + if (!unit_pages || !is_power_of_2(unit_pages) || 983 + unit_pages < PHYS_PFN(memory_block_size_bytes())) 984 + return -EINVAL; 985 + return memory_group_register(group); 986 + } 987 + EXPORT_SYMBOL_GPL(memory_group_register_dynamic); 988 + 989 + /** 990 + * memory_group_unregister() - Unregister a memory group. 991 + * @mgid: the memory group id 992 + * 993 + * Unregister a memory group. If any memory block still belongs to this 994 + * memory group, unregistering will fail. 995 + * 996 + * Returns -EINVAL if the memory group id is invalid, returns -EBUSY if some 997 + * memory blocks still belong to this memory group and returns 0 if 998 + * unregistering succeeded. 999 + */ 1000 + int memory_group_unregister(int mgid) 1001 + { 1002 + struct memory_group *group; 1003 + 1004 + if (mgid < 0) 1005 + return -EINVAL; 1006 + 1007 + group = xa_load(&memory_groups, mgid); 1008 + if (!group) 1009 + return -EINVAL; 1010 + if (!list_empty(&group->memory_blocks)) 1011 + return -EBUSY; 1012 + xa_erase(&memory_groups, mgid); 1013 + kfree(group); 1014 + return 0; 1015 + } 1016 + EXPORT_SYMBOL_GPL(memory_group_unregister); 1017 + 1018 + /* 1019 + * This is an internal helper only to be used in core memory hotplug code to 1020 + * lookup a memory group. We don't care about locking, as we don't expect a 1021 + * memory group to get unregistered while adding memory to it -- because 1022 + * the group and the memory is managed by the same driver. 1023 + */ 1024 + struct memory_group *memory_group_find_by_id(int mgid) 1025 + { 1026 + return xa_load(&memory_groups, mgid); 1027 + } 1028 + 1029 + /* 1030 + * This is an internal helper only to be used in core memory hotplug code to 1031 + * walk all dynamic memory groups excluding a given memory group, either 1032 + * belonging to a specific node, or belonging to any node. 1033 + */ 1034 + int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func, 1035 + struct memory_group *excluded, void *arg) 1036 + { 1037 + struct memory_group *group; 1038 + unsigned long index; 1039 + int ret = 0; 1040 + 1041 + xa_for_each_marked(&memory_groups, index, group, 1042 + MEMORY_GROUP_MARK_DYNAMIC) { 1043 + if (group == excluded) 1044 + continue; 1045 + #ifdef CONFIG_NUMA 1046 + if (nid != NUMA_NO_NODE && group->nid != nid) 1047 + continue; 1048 + #endif /* CONFIG_NUMA */ 1049 + ret = func(group, arg); 1050 + if (ret) 1051 + break; 1052 + } 1053 + return ret; 915 1054 }

-2

drivers/base/node.c

··· 785 785 #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE 786 786 static int __ref get_nid_for_pfn(unsigned long pfn) 787 787 { 788 - if (!pfn_valid_within(pfn)) 789 - return -1; 790 788 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT 791 789 if (system_state < SYSTEM_RUNNING) 792 790 return early_pfn_to_nid(pfn);

+38 -15

drivers/dax/kmem.c

··· 37 37 38 38 struct dax_kmem_data { 39 39 const char *res_name; 40 + int mgid; 40 41 struct resource *res[]; 41 42 }; 42 43 43 44 static int dev_dax_kmem_probe(struct dev_dax *dev_dax) 44 45 { 45 46 struct device *dev = &dev_dax->dev; 47 + unsigned long total_len = 0; 46 48 struct dax_kmem_data *data; 47 - int rc = -ENOMEM; 48 - int i, mapped = 0; 49 + int i, rc, mapped = 0; 49 50 int numa_node; 50 51 51 52 /* ··· 62 61 return -EINVAL; 63 62 } 64 63 65 - data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL); 66 - if (!data) 67 - return -ENOMEM; 68 - 69 - data->res_name = kstrdup(dev_name(dev), GFP_KERNEL); 70 - if (!data->res_name) 71 - goto err_res_name; 72 - 73 64 for (i = 0; i < dev_dax->nr_range; i++) { 74 - struct resource *res; 75 65 struct range range; 76 66 77 67 rc = dax_kmem_range(dev_dax, i, &range); ··· 71 79 i, range.start, range.end); 72 80 continue; 73 81 } 82 + total_len += range_len(&range); 83 + } 84 + 85 + if (!total_len) { 86 + dev_warn(dev, "rejecting DAX region without any memory after alignment\n"); 87 + return -EINVAL; 88 + } 89 + 90 + data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL); 91 + if (!data) 92 + return -ENOMEM; 93 + 94 + rc = -ENOMEM; 95 + data->res_name = kstrdup(dev_name(dev), GFP_KERNEL); 96 + if (!data->res_name) 97 + goto err_res_name; 98 + 99 + rc = memory_group_register_static(numa_node, total_len); 100 + if (rc < 0) 101 + goto err_reg_mgid; 102 + data->mgid = rc; 103 + 104 + for (i = 0; i < dev_dax->nr_range; i++) { 105 + struct resource *res; 106 + struct range range; 107 + 108 + rc = dax_kmem_range(dev_dax, i, &range); 109 + if (rc) 110 + continue; 74 111 75 112 /* Region is permanently reserved if hotremove fails. */ 76 113 res = request_mem_region(range.start, range_len(&range), data->res_name); ··· 129 108 * Ensure that future kexec'd kernels will not treat 130 109 * this as RAM automatically. 131 110 */ 132 - rc = add_memory_driver_managed(numa_node, range.start, 133 - range_len(&range), kmem_name, MHP_NONE); 111 + rc = add_memory_driver_managed(data->mgid, range.start, 112 + range_len(&range), kmem_name, MHP_NID_IS_MGID); 134 113 135 114 if (rc) { 136 115 dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", ··· 150 129 return 0; 151 130 152 131 err_request_mem: 132 + memory_group_unregister(data->mgid); 133 + err_reg_mgid: 153 134 kfree(data->res_name); 154 135 err_res_name: 155 136 kfree(data); ··· 179 156 if (rc) 180 157 continue; 181 158 182 - rc = remove_memory(dev_dax->target_node, range.start, 183 - range_len(&range)); 159 + rc = remove_memory(range.start, range_len(&range)); 184 160 if (rc == 0) { 185 161 release_resource(data->res[i]); 186 162 kfree(data->res[i]); ··· 194 172 } 195 173 196 174 if (success >= dev_dax->nr_range) { 175 + memory_group_unregister(data->mgid); 197 176 kfree(data->res_name); 198 177 kfree(data); 199 178 dev_set_drvdata(dev, NULL);

+1 -1

drivers/devfreq/devfreq.c

··· 27 27 #include <linux/hrtimer.h> 28 28 #include <linux/of.h> 29 29 #include <linux/pm_qos.h> 30 + #include <linux/units.h> 30 31 #include "governor.h" 31 32 32 33 #define CREATE_TRACE_POINTS ··· 35 34 36 35 #define IS_SUPPORTED_FLAG(f, name) ((f & DEVFREQ_GOV_FLAG_##name) ? true : false) 37 36 #define IS_SUPPORTED_ATTR(f, name) ((f & DEVFREQ_GOV_ATTR_##name) ? true : false) 38 - #define HZ_PER_KHZ 1000 39 37 40 38 static struct class *devfreq_class; 41 39 static struct dentry *devfreq_debugfs;

+1 -1

drivers/hwmon/mr75203.c

··· 17 17 #include <linux/property.h> 18 18 #include <linux/regmap.h> 19 19 #include <linux/reset.h> 20 + #include <linux/units.h> 20 21 21 22 /* PVT Common register */ 22 23 #define PVT_IP_CONFIG 0x04 ··· 38 37 #define CLK_SYNTH_EN BIT(24) 39 38 #define CLK_SYS_CYCLES_MAX 514 40 39 #define CLK_SYS_CYCLES_MIN 2 41 - #define HZ_PER_MHZ 1000000L 42 40 43 41 #define SDIF_DISABLE 0x04 44 42

+1 -2

drivers/iio/common/hid-sensors/hid-sensor-attributes.c

··· 6 6 #include <linux/module.h> 7 7 #include <linux/kernel.h> 8 8 #include <linux/time.h> 9 + #include <linux/units.h> 9 10 10 11 #include <linux/hid-sensor-hub.h> 11 12 #include <linux/iio/iio.h> 12 - 13 - #define HZ_PER_MHZ 1000000L 14 13 15 14 static struct { 16 15 u32 usage_id;

+1 -2

drivers/iio/light/as73211.c

··· 24 24 #include <linux/module.h> 25 25 #include <linux/mutex.h> 26 26 #include <linux/pm.h> 27 - 28 - #define HZ_PER_KHZ 1000 27 + #include <linux/units.h> 29 28 30 29 #define AS73211_DRV_NAME "as73211" 31 30

+1 -1

drivers/media/i2c/ov02a10.c

··· 9 9 #include <linux/module.h> 10 10 #include <linux/pm_runtime.h> 11 11 #include <linux/regulator/consumer.h> 12 + #include <linux/units.h> 12 13 #include <media/media-entity.h> 13 14 #include <media/v4l2-async.h> 14 15 #include <media/v4l2-ctrls.h> ··· 65 64 /* Test pattern control */ 66 65 #define OV02A10_REG_TEST_PATTERN 0xb6 67 66 68 - #define HZ_PER_MHZ 1000000L 69 67 #define OV02A10_LINK_FREQ_390MHZ (390 * HZ_PER_MHZ) 70 68 #define OV02A10_ECLK_FREQ (24 * HZ_PER_MHZ) 71 69

+1 -1

drivers/mtd/nand/raw/intel-nand-controller.c

··· 20 20 #include <linux/sched.h> 21 21 #include <linux/slab.h> 22 22 #include <linux/types.h> 23 + #include <linux/units.h> 23 24 #include <asm/unaligned.h> 24 25 25 26 #define EBU_CLC 0x000 ··· 103 102 104 103 #define MAX_CS 2 105 104 106 - #define HZ_PER_MHZ 1000000L 107 105 #define USEC_PER_SEC 1000000L 108 106 109 107 struct ebu_nand_cs {

+1 -1

drivers/phy/st/phy-stm32-usbphyc.c

··· 15 15 #include <linux/of_platform.h> 16 16 #include <linux/phy/phy.h> 17 17 #include <linux/reset.h> 18 + #include <linux/units.h> 18 19 19 20 #define STM32_USBPHYC_PLL 0x0 20 21 #define STM32_USBPHYC_MISC 0x8 ··· 48 47 #define PLL_FVCO_MHZ 2880 49 48 #define PLL_INFF_MIN_RATE_HZ 19200000 50 49 #define PLL_INFF_MAX_RATE_HZ 38400000 51 - #define HZ_PER_MHZ 1000000L 52 50 53 51 struct pll_params { 54 52 u8 ndiv;

+1 -1

drivers/thermal/devfreq_cooling.c

··· 18 18 #include <linux/pm_opp.h> 19 19 #include <linux/pm_qos.h> 20 20 #include <linux/thermal.h> 21 + #include <linux/units.h> 21 22 22 23 #include <trace/events/thermal.h> 23 24 24 - #define HZ_PER_KHZ 1000 25 25 #define SCALE_ERROR_MITIGATION 100 26 26 27 27 /**

+21 -5

drivers/virtio/virtio_mem.c

··· 143 143 * add_memory_driver_managed(). 144 144 */ 145 145 const char *resource_name; 146 + /* Memory group identification. */ 147 + int mgid; 146 148 147 149 /* 148 150 * We don't want to add too much memory if it's not getting onlined, ··· 628 626 addr + size - 1); 629 627 /* Memory might get onlined immediately. */ 630 628 atomic64_add(size, &vm->offline_size); 631 - rc = add_memory_driver_managed(vm->nid, addr, size, vm->resource_name, 632 - MHP_MERGE_RESOURCE); 629 + rc = add_memory_driver_managed(vm->mgid, addr, size, vm->resource_name, 630 + MHP_MERGE_RESOURCE | MHP_NID_IS_MGID); 633 631 if (rc) { 634 632 atomic64_sub(size, &vm->offline_size); 635 633 dev_warn(&vm->vdev->dev, "adding memory failed: %d\n", rc); ··· 679 677 680 678 dev_dbg(&vm->vdev->dev, "removing memory: 0x%llx - 0x%llx\n", addr, 681 679 addr + size - 1); 682 - rc = remove_memory(vm->nid, addr, size); 680 + rc = remove_memory(addr, size); 683 681 if (!rc) { 684 682 atomic64_sub(size, &vm->offline_size); 685 683 /* ··· 722 720 "offlining and removing memory: 0x%llx - 0x%llx\n", addr, 723 721 addr + size - 1); 724 722 725 - rc = offline_and_remove_memory(vm->nid, addr, size); 723 + rc = offline_and_remove_memory(addr, size); 726 724 if (!rc) { 727 725 atomic64_sub(size, &vm->offline_size); 728 726 /* ··· 2571 2569 static int virtio_mem_probe(struct virtio_device *vdev) 2572 2570 { 2573 2571 struct virtio_mem *vm; 2572 + uint64_t unit_pages; 2574 2573 int rc; 2575 2574 2576 2575 BUILD_BUG_ON(sizeof(struct virtio_mem_req) != 24); ··· 2606 2603 if (rc) 2607 2604 goto out_del_vq; 2608 2605 2606 + /* use a single dynamic memory group to cover the whole memory device */ 2607 + if (vm->in_sbm) 2608 + unit_pages = PHYS_PFN(memory_block_size_bytes()); 2609 + else 2610 + unit_pages = PHYS_PFN(vm->bbm.bb_size); 2611 + rc = memory_group_register_dynamic(vm->nid, unit_pages); 2612 + if (rc < 0) 2613 + goto out_del_resource; 2614 + vm->mgid = rc; 2615 + 2609 2616 /* 2610 2617 * If we still have memory plugged, we have to unplug all memory first. 2611 2618 * Registering our parent resource makes sure that this memory isn't ··· 2630 2617 vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb; 2631 2618 rc = register_memory_notifier(&vm->memory_notifier); 2632 2619 if (rc) 2633 - goto out_del_resource; 2620 + goto out_unreg_group; 2634 2621 rc = register_virtio_mem_device(vm); 2635 2622 if (rc) 2636 2623 goto out_unreg_mem; ··· 2644 2631 return 0; 2645 2632 out_unreg_mem: 2646 2633 unregister_memory_notifier(&vm->memory_notifier); 2634 + out_unreg_group: 2635 + memory_group_unregister(vm->mgid); 2647 2636 out_del_resource: 2648 2637 virtio_mem_delete_resource(vm); 2649 2638 out_del_vq: ··· 2710 2695 } else { 2711 2696 virtio_mem_delete_resource(vm); 2712 2697 kfree_const(vm->resource_name); 2698 + memory_group_unregister(vm->mgid); 2713 2699 } 2714 2700 2715 2701 /* remove all tracking data - no locking needed */

+12 -3

fs/coredump.c

··· 782 782 * filesystem. 783 783 */ 784 784 mnt_userns = file_mnt_user_ns(cprm.file); 785 - if (!uid_eq(i_uid_into_mnt(mnt_userns, inode), current_fsuid())) 785 + if (!uid_eq(i_uid_into_mnt(mnt_userns, inode), 786 + current_fsuid())) { 787 + pr_info_ratelimited("Core dump to %s aborted: cannot preserve file owner\n", 788 + cn.corename); 786 789 goto close_fail; 787 - if ((inode->i_mode & 0677) != 0600) 790 + } 791 + if ((inode->i_mode & 0677) != 0600) { 792 + pr_info_ratelimited("Core dump to %s aborted: cannot preserve file permissions\n", 793 + cn.corename); 788 794 goto close_fail; 795 + } 789 796 if (!(cprm.file->f_mode & FMODE_CAN_WRITE)) 790 797 goto close_fail; 791 798 if (do_truncate(mnt_userns, cprm.file->f_path.dentry, ··· 1134 1127 1135 1128 mmap_write_unlock(mm); 1136 1129 1137 - if (WARN_ON(i != *vma_count)) 1130 + if (WARN_ON(i != *vma_count)) { 1131 + kvfree(*vma_meta); 1138 1132 return -EFAULT; 1133 + } 1139 1134 1140 1135 *vma_data_size_ptr = vma_data_size; 1141 1136 return 0;

+10 -8

fs/eventpoll.c

··· 723 723 */ 724 724 call_rcu(&epi->rcu, epi_rcu_free); 725 725 726 - atomic_long_dec(&ep->user->epoll_watches); 726 + percpu_counter_dec(&ep->user->epoll_watches); 727 727 728 728 return 0; 729 729 } ··· 1439 1439 { 1440 1440 int error, pwake = 0; 1441 1441 __poll_t revents; 1442 - long user_watches; 1443 1442 struct epitem *epi; 1444 1443 struct ep_pqueue epq; 1445 1444 struct eventpoll *tep = NULL; ··· 1448 1449 1449 1450 lockdep_assert_irqs_enabled(); 1450 1451 1451 - user_watches = atomic_long_read(&ep->user->epoll_watches); 1452 - if (unlikely(user_watches >= max_user_watches)) 1452 + if (unlikely(percpu_counter_compare(&ep->user->epoll_watches, 1453 + max_user_watches) >= 0)) 1453 1454 return -ENOSPC; 1454 - if (!(epi = kmem_cache_zalloc(epi_cache, GFP_KERNEL))) 1455 + percpu_counter_inc(&ep->user->epoll_watches); 1456 + 1457 + if (!(epi = kmem_cache_zalloc(epi_cache, GFP_KERNEL))) { 1458 + percpu_counter_dec(&ep->user->epoll_watches); 1455 1459 return -ENOMEM; 1460 + } 1456 1461 1457 1462 /* Item initialization follow here ... */ 1458 1463 INIT_LIST_HEAD(&epi->rdllink); ··· 1469 1466 mutex_lock_nested(&tep->mtx, 1); 1470 1467 /* Add the current item to the list of active epoll hook for this file */ 1471 1468 if (unlikely(attach_epitem(tfile, epi) < 0)) { 1472 - kmem_cache_free(epi_cache, epi); 1473 1469 if (tep) 1474 1470 mutex_unlock(&tep->mtx); 1471 + kmem_cache_free(epi_cache, epi); 1472 + percpu_counter_dec(&ep->user->epoll_watches); 1475 1473 return -ENOMEM; 1476 1474 } 1477 1475 1478 1476 if (full_check && !tep) 1479 1477 list_file(tfile); 1480 - 1481 - atomic_long_inc(&ep->user->epoll_watches); 1482 1478 1483 1479 /* 1484 1480 * Add the current item to the RB tree. All RB tree operations are

+11 -15

fs/nilfs2/sysfs.c

··· 51 51 #define NILFS_DEV_INT_GROUP_TYPE(name, parent_name) \ 52 52 static void nilfs_##name##_attr_release(struct kobject *kobj) \ 53 53 { \ 54 - struct nilfs_sysfs_##parent_name##_subgroups *subgroups; \ 55 - struct the_nilfs *nilfs = container_of(kobj->parent, \ 56 - struct the_nilfs, \ 57 - ns_##parent_name##_kobj); \ 58 - subgroups = nilfs->ns_##parent_name##_subgroups; \ 54 + struct nilfs_sysfs_##parent_name##_subgroups *subgroups = container_of(kobj, \ 55 + struct nilfs_sysfs_##parent_name##_subgroups, \ 56 + sg_##name##_kobj); \ 59 57 complete(&subgroups->sg_##name##_kobj_unregister); \ 60 58 } \ 61 59 static struct kobj_type nilfs_##name##_ktype = { \ ··· 79 81 err = kobject_init_and_add(kobj, &nilfs_##name##_ktype, parent, \ 80 82 #name); \ 81 83 if (err) \ 82 - return err; \ 83 - return 0; \ 84 + kobject_put(kobj); \ 85 + return err; \ 84 86 } \ 85 87 static void nilfs_sysfs_delete_##name##_group(struct the_nilfs *nilfs) \ 86 88 { \ 87 - kobject_del(&nilfs->ns_##parent_name##_subgroups->sg_##name##_kobj); \ 89 + kobject_put(&nilfs->ns_##parent_name##_subgroups->sg_##name##_kobj); \ 88 90 } 89 91 90 92 /************************************************************************ ··· 195 197 } 196 198 197 199 if (err) 198 - return err; 200 + kobject_put(&root->snapshot_kobj); 199 201 200 - return 0; 202 + return err; 201 203 } 202 204 203 205 void nilfs_sysfs_delete_snapshot_group(struct nilfs_root *root) 204 206 { 205 - kobject_del(&root->snapshot_kobj); 207 + kobject_put(&root->snapshot_kobj); 206 208 } 207 209 208 210 /************************************************************************ ··· 984 986 err = kobject_init_and_add(&nilfs->ns_dev_kobj, &nilfs_dev_ktype, NULL, 985 987 "%s", sb->s_id); 986 988 if (err) 987 - goto free_dev_subgroups; 989 + goto cleanup_dev_kobject; 988 990 989 991 err = nilfs_sysfs_create_mounted_snapshots_group(nilfs); 990 992 if (err) ··· 1021 1023 nilfs_sysfs_delete_mounted_snapshots_group(nilfs); 1022 1024 1023 1025 cleanup_dev_kobject: 1024 - kobject_del(&nilfs->ns_dev_kobj); 1025 - 1026 - free_dev_subgroups: 1026 + kobject_put(&nilfs->ns_dev_kobj); 1027 1027 kfree(nilfs->ns_dev_subgroups); 1028 1028 1029 1029 failed_create_device_group:

+4 -5

fs/nilfs2/the_nilfs.c

··· 792 792 793 793 void nilfs_put_root(struct nilfs_root *root) 794 794 { 795 - if (refcount_dec_and_test(&root->count)) { 796 - struct the_nilfs *nilfs = root->nilfs; 795 + struct the_nilfs *nilfs = root->nilfs; 797 796 798 - nilfs_sysfs_delete_snapshot_group(root); 799 - 800 - spin_lock(&nilfs->ns_cptree_lock); 797 + if (refcount_dec_and_lock(&root->count, &nilfs->ns_cptree_lock)) { 801 798 rb_erase(&root->rb_node, &nilfs->ns_cptree); 802 799 spin_unlock(&nilfs->ns_cptree_lock); 800 + 801 + nilfs_sysfs_delete_snapshot_group(root); 803 802 iput(root->ifile); 804 803 805 804 kfree(root);

+4 -14

fs/proc/array.c

··· 98 98 99 99 void proc_task_name(struct seq_file *m, struct task_struct *p, bool escape) 100 100 { 101 - char *buf; 102 - size_t size; 103 101 char tcomm[64]; 104 - int ret; 105 102 106 103 if (p->flags & PF_WQ_WORKER) 107 104 wq_worker_comm(tcomm, sizeof(tcomm), p); 108 105 else 109 106 __get_task_comm(tcomm, sizeof(tcomm), p); 110 107 111 - size = seq_get_buf(m, &buf); 112 - if (escape) { 113 - ret = string_escape_str(tcomm, buf, size, 114 - ESCAPE_SPACE | ESCAPE_SPECIAL, "\n\\"); 115 - if (ret >= size) 116 - ret = -1; 117 - } else { 118 - ret = strscpy(buf, tcomm, size); 119 - } 120 - 121 - seq_commit(m, ret); 108 + if (escape) 109 + seq_escape_str(m, tcomm, ESCAPE_SPACE | ESCAPE_SPECIAL, "\n\\"); 110 + else 111 + seq_printf(m, "%.64s", tcomm); 122 112 } 123 113 124 114 /*

+4 -1

fs/proc/base.c

··· 95 95 #include <linux/posix-timers.h> 96 96 #include <linux/time_namespace.h> 97 97 #include <linux/resctrl.h> 98 + #include <linux/cn_proc.h> 98 99 #include <trace/events/oom.h> 99 100 #include "internal.h" 100 101 #include "fd.h" ··· 1675 1674 if (!p) 1676 1675 return -ESRCH; 1677 1676 1678 - if (same_thread_group(current, p)) 1677 + if (same_thread_group(current, p)) { 1679 1678 set_task_comm(p, buffer); 1679 + proc_comm_connector(p); 1680 + } 1680 1681 else 1681 1682 count = -EINVAL; 1682 1683

-6

include/asm-generic/early_ioremap.h

··· 19 19 extern void early_iounmap(void __iomem *addr, unsigned long size); 20 20 extern void early_memunmap(void *addr, unsigned long size); 21 21 22 - /* 23 - * Weak function called by early_ioremap_reset(). It does nothing, but 24 - * architectures may provide their own version to do any needed cleanups. 25 - */ 26 - extern void early_ioremap_shutdown(void); 27 - 28 22 #if defined(CONFIG_GENERIC_EARLY_IOREMAP) && defined(CONFIG_MMU) 29 23 /* Arch-specific initialization */ 30 24 extern void early_ioremap_init(void);

+268

include/linux/damon.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * DAMON api 4 + * 5 + * Author: SeongJae Park <sjpark@amazon.de> 6 + */ 7 + 8 + #ifndef _DAMON_H_ 9 + #define _DAMON_H_ 10 + 11 + #include <linux/mutex.h> 12 + #include <linux/time64.h> 13 + #include <linux/types.h> 14 + 15 + /* Minimal region size. Every damon_region is aligned by this. */ 16 + #define DAMON_MIN_REGION PAGE_SIZE 17 + 18 + /** 19 + * struct damon_addr_range - Represents an address region of [@start, @end). 20 + * @start: Start address of the region (inclusive). 21 + * @end: End address of the region (exclusive). 22 + */ 23 + struct damon_addr_range { 24 + unsigned long start; 25 + unsigned long end; 26 + }; 27 + 28 + /** 29 + * struct damon_region - Represents a monitoring target region. 30 + * @ar: The address range of the region. 31 + * @sampling_addr: Address of the sample for the next access check. 32 + * @nr_accesses: Access frequency of this region. 33 + * @list: List head for siblings. 34 + */ 35 + struct damon_region { 36 + struct damon_addr_range ar; 37 + unsigned long sampling_addr; 38 + unsigned int nr_accesses; 39 + struct list_head list; 40 + }; 41 + 42 + /** 43 + * struct damon_target - Represents a monitoring target. 44 + * @id: Unique identifier for this target. 45 + * @nr_regions: Number of monitoring target regions of this target. 46 + * @regions_list: Head of the monitoring target regions of this target. 47 + * @list: List head for siblings. 48 + * 49 + * Each monitoring context could have multiple targets. For example, a context 50 + * for virtual memory address spaces could have multiple target processes. The 51 + * @id of each target should be unique among the targets of the context. For 52 + * example, in the virtual address monitoring context, it could be a pidfd or 53 + * an address of an mm_struct. 54 + */ 55 + struct damon_target { 56 + unsigned long id; 57 + unsigned int nr_regions; 58 + struct list_head regions_list; 59 + struct list_head list; 60 + }; 61 + 62 + struct damon_ctx; 63 + 64 + /** 65 + * struct damon_primitive Monitoring primitives for given use cases. 66 + * 67 + * @init: Initialize primitive-internal data structures. 68 + * @update: Update primitive-internal data structures. 69 + * @prepare_access_checks: Prepare next access check of target regions. 70 + * @check_accesses: Check the accesses to target regions. 71 + * @reset_aggregated: Reset aggregated accesses monitoring results. 72 + * @target_valid: Determine if the target is valid. 73 + * @cleanup: Clean up the context. 74 + * 75 + * DAMON can be extended for various address spaces and usages. For this, 76 + * users should register the low level primitives for their target address 77 + * space and usecase via the &damon_ctx.primitive. Then, the monitoring thread 78 + * (&damon_ctx.kdamond) calls @init and @prepare_access_checks before starting 79 + * the monitoring, @update after each &damon_ctx.primitive_update_interval, and 80 + * @check_accesses, @target_valid and @prepare_access_checks after each 81 + * &damon_ctx.sample_interval. Finally, @reset_aggregated is called after each 82 + * &damon_ctx.aggr_interval. 83 + * 84 + * @init should initialize primitive-internal data structures. For example, 85 + * this could be used to construct proper monitoring target regions and link 86 + * those to @damon_ctx.adaptive_targets. 87 + * @update should update the primitive-internal data structures. For example, 88 + * this could be used to update monitoring target regions for current status. 89 + * @prepare_access_checks should manipulate the monitoring regions to be 90 + * prepared for the next access check. 91 + * @check_accesses should check the accesses to each region that made after the 92 + * last preparation and update the number of observed accesses of each region. 93 + * It should also return max number of observed accesses that made as a result 94 + * of its update. The value will be used for regions adjustment threshold. 95 + * @reset_aggregated should reset the access monitoring results that aggregated 96 + * by @check_accesses. 97 + * @target_valid should check whether the target is still valid for the 98 + * monitoring. 99 + * @cleanup is called from @kdamond just before its termination. 100 + */ 101 + struct damon_primitive { 102 + void (*init)(struct damon_ctx *context); 103 + void (*update)(struct damon_ctx *context); 104 + void (*prepare_access_checks)(struct damon_ctx *context); 105 + unsigned int (*check_accesses)(struct damon_ctx *context); 106 + void (*reset_aggregated)(struct damon_ctx *context); 107 + bool (*target_valid)(void *target); 108 + void (*cleanup)(struct damon_ctx *context); 109 + }; 110 + 111 + /* 112 + * struct damon_callback Monitoring events notification callbacks. 113 + * 114 + * @before_start: Called before starting the monitoring. 115 + * @after_sampling: Called after each sampling. 116 + * @after_aggregation: Called after each aggregation. 117 + * @before_terminate: Called before terminating the monitoring. 118 + * @private: User private data. 119 + * 120 + * The monitoring thread (&damon_ctx.kdamond) calls @before_start and 121 + * @before_terminate just before starting and finishing the monitoring, 122 + * respectively. Therefore, those are good places for installing and cleaning 123 + * @private. 124 + * 125 + * The monitoring thread calls @after_sampling and @after_aggregation for each 126 + * of the sampling intervals and aggregation intervals, respectively. 127 + * Therefore, users can safely access the monitoring results without additional 128 + * protection. For the reason, users are recommended to use these callback for 129 + * the accesses to the results. 130 + * 131 + * If any callback returns non-zero, monitoring stops. 132 + */ 133 + struct damon_callback { 134 + void *private; 135 + 136 + int (*before_start)(struct damon_ctx *context); 137 + int (*after_sampling)(struct damon_ctx *context); 138 + int (*after_aggregation)(struct damon_ctx *context); 139 + int (*before_terminate)(struct damon_ctx *context); 140 + }; 141 + 142 + /** 143 + * struct damon_ctx - Represents a context for each monitoring. This is the 144 + * main interface that allows users to set the attributes and get the results 145 + * of the monitoring. 146 + * 147 + * @sample_interval: The time between access samplings. 148 + * @aggr_interval: The time between monitor results aggregations. 149 + * @primitive_update_interval: The time between monitoring primitive updates. 150 + * 151 + * For each @sample_interval, DAMON checks whether each region is accessed or 152 + * not. It aggregates and keeps the access information (number of accesses to 153 + * each region) for @aggr_interval time. DAMON also checks whether the target 154 + * memory regions need update (e.g., by ``mmap()`` calls from the application, 155 + * in case of virtual memory monitoring) and applies the changes for each 156 + * @primitive_update_interval. All time intervals are in micro-seconds. 157 + * Please refer to &struct damon_primitive and &struct damon_callback for more 158 + * detail. 159 + * 160 + * @kdamond: Kernel thread who does the monitoring. 161 + * @kdamond_stop: Notifies whether kdamond should stop. 162 + * @kdamond_lock: Mutex for the synchronizations with @kdamond. 163 + * 164 + * For each monitoring context, one kernel thread for the monitoring is 165 + * created. The pointer to the thread is stored in @kdamond. 166 + * 167 + * Once started, the monitoring thread runs until explicitly required to be 168 + * terminated or every monitoring target is invalid. The validity of the 169 + * targets is checked via the &damon_primitive.target_valid of @primitive. The 170 + * termination can also be explicitly requested by writing non-zero to 171 + * @kdamond_stop. The thread sets @kdamond to NULL when it terminates. 172 + * Therefore, users can know whether the monitoring is ongoing or terminated by 173 + * reading @kdamond. Reads and writes to @kdamond and @kdamond_stop from 174 + * outside of the monitoring thread must be protected by @kdamond_lock. 175 + * 176 + * Note that the monitoring thread protects only @kdamond and @kdamond_stop via 177 + * @kdamond_lock. Accesses to other fields must be protected by themselves. 178 + * 179 + * @primitive: Set of monitoring primitives for given use cases. 180 + * @callback: Set of callbacks for monitoring events notifications. 181 + * 182 + * @min_nr_regions: The minimum number of adaptive monitoring regions. 183 + * @max_nr_regions: The maximum number of adaptive monitoring regions. 184 + * @adaptive_targets: Head of monitoring targets (&damon_target) list. 185 + */ 186 + struct damon_ctx { 187 + unsigned long sample_interval; 188 + unsigned long aggr_interval; 189 + unsigned long primitive_update_interval; 190 + 191 + /* private: internal use only */ 192 + struct timespec64 last_aggregation; 193 + struct timespec64 last_primitive_update; 194 + 195 + /* public: */ 196 + struct task_struct *kdamond; 197 + bool kdamond_stop; 198 + struct mutex kdamond_lock; 199 + 200 + struct damon_primitive primitive; 201 + struct damon_callback callback; 202 + 203 + unsigned long min_nr_regions; 204 + unsigned long max_nr_regions; 205 + struct list_head adaptive_targets; 206 + }; 207 + 208 + #define damon_next_region(r) \ 209 + (container_of(r->list.next, struct damon_region, list)) 210 + 211 + #define damon_prev_region(r) \ 212 + (container_of(r->list.prev, struct damon_region, list)) 213 + 214 + #define damon_for_each_region(r, t) \ 215 + list_for_each_entry(r, &t->regions_list, list) 216 + 217 + #define damon_for_each_region_safe(r, next, t) \ 218 + list_for_each_entry_safe(r, next, &t->regions_list, list) 219 + 220 + #define damon_for_each_target(t, ctx) \ 221 + list_for_each_entry(t, &(ctx)->adaptive_targets, list) 222 + 223 + #define damon_for_each_target_safe(t, next, ctx) \ 224 + list_for_each_entry_safe(t, next, &(ctx)->adaptive_targets, list) 225 + 226 + #ifdef CONFIG_DAMON 227 + 228 + struct damon_region *damon_new_region(unsigned long start, unsigned long end); 229 + inline void damon_insert_region(struct damon_region *r, 230 + struct damon_region *prev, struct damon_region *next, 231 + struct damon_target *t); 232 + void damon_add_region(struct damon_region *r, struct damon_target *t); 233 + void damon_destroy_region(struct damon_region *r, struct damon_target *t); 234 + 235 + struct damon_target *damon_new_target(unsigned long id); 236 + void damon_add_target(struct damon_ctx *ctx, struct damon_target *t); 237 + void damon_free_target(struct damon_target *t); 238 + void damon_destroy_target(struct damon_target *t); 239 + unsigned int damon_nr_regions(struct damon_target *t); 240 + 241 + struct damon_ctx *damon_new_ctx(void); 242 + void damon_destroy_ctx(struct damon_ctx *ctx); 243 + int damon_set_targets(struct damon_ctx *ctx, 244 + unsigned long *ids, ssize_t nr_ids); 245 + int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int, 246 + unsigned long aggr_int, unsigned long primitive_upd_int, 247 + unsigned long min_nr_reg, unsigned long max_nr_reg); 248 + int damon_nr_running_ctxs(void); 249 + 250 + int damon_start(struct damon_ctx **ctxs, int nr_ctxs); 251 + int damon_stop(struct damon_ctx **ctxs, int nr_ctxs); 252 + 253 + #endif /* CONFIG_DAMON */ 254 + 255 + #ifdef CONFIG_DAMON_VADDR 256 + 257 + /* Monitoring primitives for virtual memory address spaces */ 258 + void damon_va_init(struct damon_ctx *ctx); 259 + void damon_va_update(struct damon_ctx *ctx); 260 + void damon_va_prepare_access_checks(struct damon_ctx *ctx); 261 + unsigned int damon_va_check_accesses(struct damon_ctx *ctx); 262 + bool damon_va_target_valid(void *t); 263 + void damon_va_cleanup(struct damon_ctx *ctx); 264 + void damon_va_set_primitives(struct damon_ctx *ctx); 265 + 266 + #endif /* CONFIG_DAMON_VADDR */ 267 + 268 + #endif /* _DAMON_H */

+22 -5

include/linux/highmem-internal.h

··· 90 90 91 91 static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot) 92 92 { 93 - preempt_disable(); 93 + if (IS_ENABLED(CONFIG_PREEMPT_RT)) 94 + migrate_disable(); 95 + else 96 + preempt_disable(); 97 + 94 98 pagefault_disable(); 95 99 return __kmap_local_page_prot(page, prot); 96 100 } ··· 106 102 107 103 static inline void *kmap_atomic_pfn(unsigned long pfn) 108 104 { 109 - preempt_disable(); 105 + if (IS_ENABLED(CONFIG_PREEMPT_RT)) 106 + migrate_disable(); 107 + else 108 + preempt_disable(); 109 + 110 110 pagefault_disable(); 111 111 return __kmap_local_pfn_prot(pfn, kmap_prot); 112 112 } ··· 119 111 { 120 112 kunmap_local_indexed(addr); 121 113 pagefault_enable(); 122 - preempt_enable(); 114 + if (IS_ENABLED(CONFIG_PREEMPT_RT)) 115 + migrate_enable(); 116 + else 117 + preempt_enable(); 123 118 } 124 119 125 120 unsigned int __nr_free_highpages(void); ··· 190 179 191 180 static inline void *kmap_atomic(struct page *page) 192 181 { 193 - preempt_disable(); 182 + if (IS_ENABLED(CONFIG_PREEMPT_RT)) 183 + migrate_disable(); 184 + else 185 + preempt_disable(); 194 186 pagefault_disable(); 195 187 return page_address(page); 196 188 } ··· 214 200 kunmap_flush_on_unmap(addr); 215 201 #endif 216 202 pagefault_enable(); 217 - preempt_enable(); 203 + if (IS_ENABLED(CONFIG_PREEMPT_RT)) 204 + migrate_enable(); 205 + else 206 + preempt_enable(); 218 207 } 219 208 220 209 static inline unsigned int nr_free_highpages(void) { return 0; }

+54 -1

include/linux/memory.h

··· 23 23 24 24 #define MIN_MEMORY_BLOCK_SIZE (1UL << SECTION_SIZE_BITS) 25 25 26 + /** 27 + * struct memory_group - a logical group of memory blocks 28 + * @nid: The node id for all memory blocks inside the memory group. 29 + * @blocks: List of all memory blocks belonging to this memory group. 30 + * @present_kernel_pages: Present (online) memory outside ZONE_MOVABLE of this 31 + * memory group. 32 + * @present_movable_pages: Present (online) memory in ZONE_MOVABLE of this 33 + * memory group. 34 + * @is_dynamic: The memory group type: static vs. dynamic 35 + * @s.max_pages: Valid with &memory_group.is_dynamic == false. The maximum 36 + * number of pages we'll have in this static memory group. 37 + * @d.unit_pages: Valid with &memory_group.is_dynamic == true. Unit in pages 38 + * in which memory is added/removed in this dynamic memory group. 39 + * This granularity defines the alignment of a unit in physical 40 + * address space; it has to be at least as big as a single 41 + * memory block. 42 + * 43 + * A memory group logically groups memory blocks; each memory block 44 + * belongs to at most one memory group. A memory group corresponds to 45 + * a memory device, such as a DIMM or a NUMA node, which spans multiple 46 + * memory blocks and might even span multiple non-contiguous physical memory 47 + * ranges. 48 + * 49 + * Modification of members after registration is serialized by memory 50 + * hot(un)plug code. 51 + */ 52 + struct memory_group { 53 + int nid; 54 + struct list_head memory_blocks; 55 + unsigned long present_kernel_pages; 56 + unsigned long present_movable_pages; 57 + bool is_dynamic; 58 + union { 59 + struct { 60 + unsigned long max_pages; 61 + } s; 62 + struct { 63 + unsigned long unit_pages; 64 + } d; 65 + }; 66 + }; 67 + 26 68 struct memory_block { 27 69 unsigned long start_section_nr; 28 70 unsigned long state; /* serialized by the dev->lock */ ··· 76 34 * lay at the beginning of the memory block. 77 35 */ 78 36 unsigned long nr_vmemmap_pages; 37 + struct memory_group *group; /* group (if any) for this block */ 38 + struct list_head group_next; /* next block inside memory group */ 79 39 }; 80 40 81 41 int arch_get_memory_phys_device(unsigned long start_pfn); ··· 130 86 extern int register_memory_notifier(struct notifier_block *nb); 131 87 extern void unregister_memory_notifier(struct notifier_block *nb); 132 88 int create_memory_block_devices(unsigned long start, unsigned long size, 133 - unsigned long vmemmap_pages); 89 + unsigned long vmemmap_pages, 90 + struct memory_group *group); 134 91 void remove_memory_block_devices(unsigned long start, unsigned long size); 135 92 extern void memory_dev_init(void); 136 93 extern int memory_notify(unsigned long val, void *v); ··· 141 96 void *arg, walk_memory_blocks_func_t func); 142 97 extern int for_each_memory_block(void *arg, walk_memory_blocks_func_t func); 143 98 #define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<<PAGE_SHIFT) 99 + 100 + extern int memory_group_register_static(int nid, unsigned long max_pages); 101 + extern int memory_group_register_dynamic(int nid, unsigned long unit_pages); 102 + extern int memory_group_unregister(int mgid); 103 + struct memory_group *memory_group_find_by_id(int mgid); 104 + typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *); 105 + int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func, 106 + struct memory_group *excluded, void *arg); 144 107 #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */ 145 108 146 109 #ifdef CONFIG_MEMORY_HOTPLUG

+22 -12

include/linux/memory_hotplug.h

··· 12 12 struct pglist_data; 13 13 struct mem_section; 14 14 struct memory_block; 15 + struct memory_group; 15 16 struct resource; 16 17 struct vmem_altmap; 17 18 ··· 51 50 * Only selected architectures support it with SPARSE_VMEMMAP. 52 51 */ 53 52 #define MHP_MEMMAP_ON_MEMORY ((__force mhp_t)BIT(1)) 53 + /* 54 + * The nid field specifies a memory group id (mgid) instead. The memory group 55 + * implies the node id (nid). 56 + */ 57 + #define MHP_NID_IS_MGID ((__force mhp_t)BIT(2)) 54 58 55 59 /* 56 60 * Extended parameters for memory hotplug: ··· 101 95 extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages); 102 96 extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages); 103 97 extern int add_one_highpage(struct page *page, int pfn, int bad_ppro); 104 - extern void adjust_present_page_count(struct zone *zone, long nr_pages); 98 + extern void adjust_present_page_count(struct page *page, 99 + struct memory_group *group, 100 + long nr_pages); 105 101 /* VM interface that may be used by firmware interface */ 106 102 extern int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages, 107 103 struct zone *zone); 108 104 extern void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages); 109 105 extern int online_pages(unsigned long pfn, unsigned long nr_pages, 110 - struct zone *zone); 106 + struct zone *zone, struct memory_group *group); 111 107 extern struct zone *test_pages_in_a_zone(unsigned long start_pfn, 112 108 unsigned long end_pfn); 113 109 extern void __offline_isolated_pages(unsigned long start_pfn, ··· 138 130 return movable_node_enabled; 139 131 } 140 132 141 - extern void arch_remove_memory(int nid, u64 start, u64 size, 142 - struct vmem_altmap *altmap); 133 + extern void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap); 143 134 extern void __remove_pages(unsigned long start_pfn, unsigned long nr_pages, 144 135 struct vmem_altmap *altmap); 145 136 ··· 299 292 #ifdef CONFIG_MEMORY_HOTREMOVE 300 293 301 294 extern void try_offline_node(int nid); 302 - extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages); 303 - extern int remove_memory(int nid, u64 start, u64 size); 304 - extern void __remove_memory(int nid, u64 start, u64 size); 305 - extern int offline_and_remove_memory(int nid, u64 start, u64 size); 295 + extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages, 296 + struct memory_group *group); 297 + extern int remove_memory(u64 start, u64 size); 298 + extern void __remove_memory(u64 start, u64 size); 299 + extern int offline_and_remove_memory(u64 start, u64 size); 306 300 307 301 #else 308 302 static inline void try_offline_node(int nid) {} 309 303 310 - static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages) 304 + static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages, 305 + struct memory_group *group) 311 306 { 312 307 return -EINVAL; 313 308 } 314 309 315 - static inline int remove_memory(int nid, u64 start, u64 size) 310 + static inline int remove_memory(u64 start, u64 size) 316 311 { 317 312 return -EBUSY; 318 313 } 319 314 320 - static inline void __remove_memory(int nid, u64 start, u64 size) {} 315 + static inline void __remove_memory(u64 start, u64 size) {} 321 316 #endif /* CONFIG_MEMORY_HOTREMOVE */ 322 317 323 318 extern void set_zone_contiguous(struct zone *zone); ··· 348 339 unsigned long map_offset, struct vmem_altmap *altmap); 349 340 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map, 350 341 unsigned long pnum); 351 - extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn, 342 + extern struct zone *zone_for_pfn_range(int online_type, int nid, 343 + struct memory_group *group, unsigned long start_pfn, 352 344 unsigned long nr_pages); 353 345 extern int arch_create_linear_mapping(int nid, u64 start, u64 size, 354 346 struct mhp_params *params);

+7 -12

include/linux/mmzone.h

··· 540 540 * is calculated as: 541 541 * present_pages = spanned_pages - absent_pages(pages in holes); 542 542 * 543 + * present_early_pages is present pages existing within the zone 544 + * located on memory available since early boot, excluding hotplugged 545 + * memory. 546 + * 543 547 * managed_pages is present pages managed by the buddy system, which 544 548 * is calculated as (reserved_pages includes pages allocated by the 545 549 * bootmem allocator): ··· 576 572 atomic_long_t managed_pages; 577 573 unsigned long spanned_pages; 578 574 unsigned long present_pages; 575 + #if defined(CONFIG_MEMORY_HOTPLUG) 576 + unsigned long present_early_pages; 577 + #endif 579 578 #ifdef CONFIG_CMA 580 579 unsigned long cma_pages; 581 580 #endif ··· 1531 1524 #define pfn_in_present_section pfn_valid 1532 1525 #define subsection_map_init(_pfn, _nr_pages) do {} while (0) 1533 1526 #endif /* CONFIG_SPARSEMEM */ 1534 - 1535 - /* 1536 - * If it is possible to have holes within a MAX_ORDER_NR_PAGES, then we 1537 - * need to check pfn validity within that MAX_ORDER_NR_PAGES block. 1538 - * pfn_valid_within() should be used in this case; we optimise this away 1539 - * when we have no holes within a MAX_ORDER_NR_PAGES block. 1540 - */ 1541 - #ifdef CONFIG_HOLES_IN_ZONE 1542 - #define pfn_valid_within(pfn) pfn_valid(pfn) 1543 - #else 1544 - #define pfn_valid_within(pfn) (1) 1545 - #endif 1546 1527 1547 1528 #endif /* !__GENERATING_BOUNDS.H */ 1548 1529 #endif /* !__ASSEMBLY__ */

+1 -1

include/linux/once.h

··· 16 16 * out the condition into a nop. DO_ONCE() guarantees type safety of 17 17 * arguments! 18 18 * 19 - * Not that the following is not equivalent ... 19 + * Note that the following is not equivalent ... 20 20 * 21 21 * DO_ONCE(func, arg); 22 22 * DO_ONCE(func, arg);

+5 -3

include/linux/page-flags.h

··· 131 131 #ifdef CONFIG_MEMORY_FAILURE 132 132 PG_hwpoison, /* hardware poisoned page. Don't touch */ 133 133 #endif 134 - #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT) 134 + #if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT) 135 135 PG_young, 136 136 PG_idle, 137 137 #endif ··· 177 177 /* Only valid for buddy pages. Used to track pages that are reported */ 178 178 PG_reported = PG_uptodate, 179 179 }; 180 + 181 + #define PAGEFLAGS_MASK ((1UL << NR_PAGEFLAGS) - 1) 180 182 181 183 #ifndef __GENERATING_BOUNDS_H 182 184 ··· 441 439 #define __PG_HWPOISON 0 442 440 #endif 443 441 444 - #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT) 442 + #if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT) 445 443 TESTPAGEFLAG(Young, young, PF_ANY) 446 444 SETPAGEFLAG(Young, young, PF_ANY) 447 445 TESTCLEARFLAG(Young, young, PF_ANY) ··· 833 831 * alloc-free cycle to prevent from reusing the page. 834 832 */ 835 833 #define PAGE_FLAGS_CHECK_AT_PREP \ 836 - (((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON) 834 + (PAGEFLAGS_MASK & ~__PG_HWPOISON) 837 835 838 836 #define PAGE_FLAGS_PRIVATE \ 839 837 (1UL << PG_private | 1UL << PG_private_2)

+1 -1

include/linux/page_ext.h

··· 19 19 enum page_ext_flags { 20 20 PAGE_EXT_OWNER, 21 21 PAGE_EXT_OWNER_ALLOCATED, 22 - #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT) 22 + #if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT) 23 23 PAGE_EXT_YOUNG, 24 24 PAGE_EXT_IDLE, 25 25 #endif

+3 -3

include/linux/page_idle.h

··· 6 6 #include <linux/page-flags.h> 7 7 #include <linux/page_ext.h> 8 8 9 - #ifdef CONFIG_IDLE_PAGE_TRACKING 9 + #ifdef CONFIG_PAGE_IDLE_FLAG 10 10 11 11 #ifdef CONFIG_64BIT 12 12 static inline bool page_is_young(struct page *page) ··· 106 106 } 107 107 #endif /* CONFIG_64BIT */ 108 108 109 - #else /* !CONFIG_IDLE_PAGE_TRACKING */ 109 + #else /* !CONFIG_PAGE_IDLE_FLAG */ 110 110 111 111 static inline bool page_is_young(struct page *page) 112 112 { ··· 135 135 { 136 136 } 137 137 138 - #endif /* CONFIG_IDLE_PAGE_TRACKING */ 138 + #endif /* CONFIG_PAGE_IDLE_FLAG */ 139 139 140 140 #endif /* _LINUX_MM_PAGE_IDLE_H */

+3 -4

include/linux/pagemap.h

··· 521 521 */ 522 522 static inline pgoff_t page_to_index(struct page *page) 523 523 { 524 - pgoff_t pgoff; 524 + struct page *head; 525 525 526 526 if (likely(!PageTransTail(page))) 527 527 return page->index; 528 528 529 + head = compound_head(page); 529 530 /* 530 531 * We don't initialize ->index for tail pages: calculate based on 531 532 * head page 532 533 */ 533 - pgoff = compound_head(page)->index; 534 - pgoff += page - compound_head(page); 535 - return pgoff; 534 + return head->index + page - head; 536 535 } 537 536 538 537 extern pgoff_t hugetlb_basepage_index(struct page *page);

+2 -1

include/linux/sched/user.h

··· 4 4 5 5 #include <linux/uidgid.h> 6 6 #include <linux/atomic.h> 7 + #include <linux/percpu_counter.h> 7 8 #include <linux/refcount.h> 8 9 #include <linux/ratelimit.h> 9 10 ··· 14 13 struct user_struct { 15 14 refcount_t __count; /* reference count */ 16 15 #ifdef CONFIG_EPOLL 17 - atomic_long_t epoll_watches; /* The number of file descriptors currently watched */ 16 + struct percpu_counter epoll_watches; /* The number of file descriptors currently watched */ 18 17 #endif 19 18 unsigned long unix_inflight; /* How many files in flight in unix sockets */ 20 19 atomic_long_t pipe_bufs; /* how many pages are allocated in pipe buffers */

+1 -1

include/linux/threads.h

··· 38 38 * Define a minimum number of pids per cpu. Heuristically based 39 39 * on original pid max of 32k for 32 cpus. Also, increase the 40 40 * minimum settable value for pid_max on the running system based 41 - * on similar defaults. See kernel/pid.c:pidmap_init() for details. 41 + * on similar defaults. See kernel/pid.c:pid_idr_init() for details. 42 42 */ 43 43 #define PIDS_PER_CPU_DEFAULT 1024 44 44 #define PIDS_PER_CPU_MIN 8

+7 -3

include/linux/units.h

··· 20 20 #define PICO 1000000000000ULL 21 21 #define FEMTO 1000000000000000ULL 22 22 23 - #define MILLIWATT_PER_WATT 1000L 24 - #define MICROWATT_PER_MILLIWATT 1000L 25 - #define MICROWATT_PER_WATT 1000000L 23 + #define HZ_PER_KHZ 1000UL 24 + #define KHZ_PER_MHZ 1000UL 25 + #define HZ_PER_MHZ 1000000UL 26 + 27 + #define MILLIWATT_PER_WATT 1000UL 28 + #define MICROWATT_PER_MILLIWATT 1000UL 29 + #define MICROWATT_PER_WATT 1000000UL 26 30 27 31 #define ABSOLUTE_ZERO_MILLICELSIUS -273150 28 32

-3

include/linux/vmalloc.h

··· 225 225 } 226 226 227 227 #ifdef CONFIG_MMU 228 - int vmap_range(unsigned long addr, unsigned long end, 229 - phys_addr_t phys_addr, pgprot_t prot, 230 - unsigned int max_page_shift); 231 228 void vunmap_range(unsigned long addr, unsigned long end); 232 229 static inline void set_vm_flush_reset_perms(void *addr) 233 230 {

+43

include/trace/events/damon.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #undef TRACE_SYSTEM 3 + #define TRACE_SYSTEM damon 4 + 5 + #if !defined(_TRACE_DAMON_H) || defined(TRACE_HEADER_MULTI_READ) 6 + #define _TRACE_DAMON_H 7 + 8 + #include <linux/damon.h> 9 + #include <linux/types.h> 10 + #include <linux/tracepoint.h> 11 + 12 + TRACE_EVENT(damon_aggregated, 13 + 14 + TP_PROTO(struct damon_target *t, struct damon_region *r, 15 + unsigned int nr_regions), 16 + 17 + TP_ARGS(t, r, nr_regions), 18 + 19 + TP_STRUCT__entry( 20 + __field(unsigned long, target_id) 21 + __field(unsigned int, nr_regions) 22 + __field(unsigned long, start) 23 + __field(unsigned long, end) 24 + __field(unsigned int, nr_accesses) 25 + ), 26 + 27 + TP_fast_assign( 28 + __entry->target_id = t->id; 29 + __entry->nr_regions = nr_regions; 30 + __entry->start = r->ar.start; 31 + __entry->end = r->ar.end; 32 + __entry->nr_accesses = r->nr_accesses; 33 + ), 34 + 35 + TP_printk("target_id=%lu nr_regions=%u %lu-%lu: %u", 36 + __entry->target_id, __entry->nr_regions, 37 + __entry->start, __entry->end, __entry->nr_accesses) 38 + ); 39 + 40 + #endif /* _TRACE_DAMON_H */ 41 + 42 + /* This part must be outside protection */ 43 + #include <trace/define_trace.h>

+1 -1

include/trace/events/mmflags.h

··· 75 75 #define IF_HAVE_PG_HWPOISON(flag,string) 76 76 #endif 77 77 78 - #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT) 78 + #if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT) 79 79 #define IF_HAVE_PG_IDLE(flag,string) ,{1UL << flag, string} 80 80 #else 81 81 #define IF_HAVE_PG_IDLE(flag,string)

+2 -2

include/trace/events/page_ref.h

··· 38 38 39 39 TP_printk("pfn=0x%lx flags=%s count=%d mapcount=%d mapping=%p mt=%d val=%d", 40 40 __entry->pfn, 41 - show_page_flags(__entry->flags & ((1UL << NR_PAGEFLAGS) - 1)), 41 + show_page_flags(__entry->flags & PAGEFLAGS_MASK), 42 42 __entry->count, 43 43 __entry->mapcount, __entry->mapping, __entry->mt, 44 44 __entry->val) ··· 88 88 89 89 TP_printk("pfn=0x%lx flags=%s count=%d mapcount=%d mapping=%p mt=%d val=%d ret=%d", 90 90 __entry->pfn, 91 - show_page_flags(__entry->flags & ((1UL << NR_PAGEFLAGS) - 1)), 91 + show_page_flags(__entry->flags & PAGEFLAGS_MASK), 92 92 __entry->count, 93 93 __entry->mapcount, __entry->mapping, __entry->mt, 94 94 __entry->val, __entry->ret)

+2

init/initramfs.c

··· 15 15 #include <linux/mm.h> 16 16 #include <linux/namei.h> 17 17 #include <linux/init_syscalls.h> 18 + #include <linux/umh.h> 18 19 19 20 static ssize_t __init xwrite(struct file *file, const char *p, size_t count, 20 21 loff_t *pos) ··· 728 727 { 729 728 initramfs_cookie = async_schedule_domain(do_populate_rootfs, NULL, 730 729 &initramfs_domain); 730 + usermodehelper_enable(); 731 731 if (!initramfs_async) 732 732 wait_for_initramfs(); 733 733 return 0;

+2 -1

init/main.c

··· 777 777 778 778 void __init __weak pgtable_cache_init(void) { } 779 779 780 + void __init __weak trap_init(void) { } 781 + 780 782 bool initcall_debug; 781 783 core_param(initcall_debug, initcall_debug, bool, 0644); 782 784 ··· 1394 1392 driver_init(); 1395 1393 init_irq_proc(); 1396 1394 do_ctors(); 1397 - usermodehelper_enable(); 1398 1395 do_initcalls(); 1399 1396 } 1400 1397

+2

init/noinitramfs.c

··· 10 10 #include <linux/kdev_t.h> 11 11 #include <linux/syscalls.h> 12 12 #include <linux/init_syscalls.h> 13 + #include <linux/umh.h> 13 14 14 15 /* 15 16 * Create a simple rootfs that is similar to the default initramfs ··· 19 18 { 20 19 int err; 21 20 21 + usermodehelper_enable(); 22 22 err = init_mkdir("/dev", 0755); 23 23 if (err < 0) 24 24 goto out;

+4 -12

ipc/util.c

··· 788 788 static struct kern_ipc_perm *sysvipc_find_ipc(struct ipc_ids *ids, loff_t pos, 789 789 loff_t *new_pos) 790 790 { 791 - struct kern_ipc_perm *ipc; 792 - int total, id; 791 + struct kern_ipc_perm *ipc = NULL; 792 + int max_idx = ipc_get_maxidx(ids); 793 793 794 - total = 0; 795 - for (id = 0; id < pos && total < ids->in_use; id++) { 796 - ipc = idr_find(&ids->ipcs_idr, id); 797 - if (ipc != NULL) 798 - total++; 799 - } 800 - 801 - ipc = NULL; 802 - if (total >= ids->in_use) 794 + if (max_idx == -1 || pos > max_idx) 803 795 goto out; 804 796 805 - for (; pos < ipc_mni; pos++) { 797 + for (; pos <= max_idx; pos++) { 806 798 ipc = idr_find(&ids->ipcs_idr, pos); 807 799 if (ipc != NULL) { 808 800 rcu_read_lock();

+1 -1

kernel/acct.c

··· 478 478 /* 479 479 * Accounting records are not subject to resource limits. 480 480 */ 481 - flim = current->signal->rlim[RLIMIT_FSIZE].rlim_cur; 481 + flim = rlimit(RLIMIT_FSIZE); 482 482 current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY; 483 483 /* Perform file operations on behalf of whoever enabled accounting */ 484 484 orig_cred = override_creds(file->f_cred);

-2

kernel/fork.c

··· 1262 1262 rcu_read_unlock(); 1263 1263 return exe_file; 1264 1264 } 1265 - EXPORT_SYMBOL(get_mm_exe_file); 1266 1265 1267 1266 /** 1268 1267 * get_task_exe_file - acquire a reference to the task's executable file ··· 1284 1285 task_unlock(task); 1285 1286 return exe_file; 1286 1287 } 1287 - EXPORT_SYMBOL(get_task_exe_file); 1288 1288 1289 1289 /** 1290 1290 * get_task_mm - acquire a reference to the task's mm

+11 -10

kernel/profile.c

··· 41 41 #define NR_PROFILE_GRP (NR_PROFILE_HIT/PROFILE_GRPSZ) 42 42 43 43 static atomic_t *prof_buffer; 44 - static unsigned long prof_len, prof_shift; 44 + static unsigned long prof_len; 45 + static unsigned short int prof_shift; 45 46 46 47 int prof_on __read_mostly; 47 48 EXPORT_SYMBOL_GPL(prof_on); ··· 68 67 if (str[strlen(sleepstr)] == ',') 69 68 str += strlen(sleepstr) + 1; 70 69 if (get_option(&str, &par)) 71 - prof_shift = par; 72 - pr_info("kernel sleep profiling enabled (shift: %ld)\n", 70 + prof_shift = clamp(par, 0, BITS_PER_LONG - 1); 71 + pr_info("kernel sleep profiling enabled (shift: %u)\n", 73 72 prof_shift); 74 73 #else 75 74 pr_warn("kernel sleep profiling requires CONFIG_SCHEDSTATS\n"); ··· 79 78 if (str[strlen(schedstr)] == ',') 80 79 str += strlen(schedstr) + 1; 81 80 if (get_option(&str, &par)) 82 - prof_shift = par; 83 - pr_info("kernel schedule profiling enabled (shift: %ld)\n", 81 + prof_shift = clamp(par, 0, BITS_PER_LONG - 1); 82 + pr_info("kernel schedule profiling enabled (shift: %u)\n", 84 83 prof_shift); 85 84 } else if (!strncmp(str, kvmstr, strlen(kvmstr))) { 86 85 prof_on = KVM_PROFILING; 87 86 if (str[strlen(kvmstr)] == ',') 88 87 str += strlen(kvmstr) + 1; 89 88 if (get_option(&str, &par)) 90 - prof_shift = par; 91 - pr_info("kernel KVM profiling enabled (shift: %ld)\n", 89 + prof_shift = clamp(par, 0, BITS_PER_LONG - 1); 90 + pr_info("kernel KVM profiling enabled (shift: %u)\n", 92 91 prof_shift); 93 92 } else if (get_option(&str, &par)) { 94 - prof_shift = par; 93 + prof_shift = clamp(par, 0, BITS_PER_LONG - 1); 95 94 prof_on = CPU_PROFILING; 96 - pr_info("kernel profiling enabled (shift: %ld)\n", 95 + pr_info("kernel profiling enabled (shift: %u)\n", 97 96 prof_shift); 98 97 } 99 98 return 1; ··· 469 468 unsigned long p = *ppos; 470 469 ssize_t read; 471 470 char *pnt; 472 - unsigned int sample_step = 1 << prof_shift; 471 + unsigned long sample_step = 1UL << prof_shift; 473 472 474 473 profile_flip_buffers(); 475 474 if (p >= (prof_len+1)*sizeof(unsigned int))

-7

kernel/sys.c

··· 1930 1930 error = -EINVAL; 1931 1931 1932 1932 /* 1933 - * @brk should be after @end_data in traditional maps. 1934 - */ 1935 - if (prctl_map->start_brk <= prctl_map->end_data || 1936 - prctl_map->brk <= prctl_map->end_data) 1937 - goto out; 1938 - 1939 - /* 1940 1933 * Neither we should allow to override limits if they set. 1941 1934 */ 1942 1935 if (check_data_rlimit(rlimit(RLIMIT_DATA), prctl_map->brk,

+25

kernel/user.c

··· 129 129 return NULL; 130 130 } 131 131 132 + static int user_epoll_alloc(struct user_struct *up) 133 + { 134 + #ifdef CONFIG_EPOLL 135 + return percpu_counter_init(&up->epoll_watches, 0, GFP_KERNEL); 136 + #else 137 + return 0; 138 + #endif 139 + } 140 + 141 + static void user_epoll_free(struct user_struct *up) 142 + { 143 + #ifdef CONFIG_EPOLL 144 + percpu_counter_destroy(&up->epoll_watches); 145 + #endif 146 + } 147 + 132 148 /* IRQs are disabled and uidhash_lock is held upon function entry. 133 149 * IRQ state (as stored in flags) is restored and uidhash_lock released 134 150 * upon function exit. ··· 154 138 { 155 139 uid_hash_remove(up); 156 140 spin_unlock_irqrestore(&uidhash_lock, flags); 141 + user_epoll_free(up); 157 142 kmem_cache_free(uid_cachep, up); 158 143 } 159 144 ··· 202 185 203 186 new->uid = uid; 204 187 refcount_set(&new->__count, 1); 188 + if (user_epoll_alloc(new)) { 189 + kmem_cache_free(uid_cachep, new); 190 + return NULL; 191 + } 205 192 ratelimit_state_init(&new->ratelimit, HZ, 100); 206 193 ratelimit_set_flags(&new->ratelimit, RATELIMIT_MSG_ON_RELEASE); 207 194 ··· 216 195 spin_lock_irq(&uidhash_lock); 217 196 up = uid_hash_find(uid, hashent); 218 197 if (up) { 198 + user_epoll_free(new); 219 199 kmem_cache_free(uid_cachep, new); 220 200 } else { 221 201 uid_hash_insert(new, hashent); ··· 237 215 238 216 for(n = 0; n < UIDHASH_SZ; ++n) 239 217 INIT_HLIST_HEAD(uidhash_table + n); 218 + 219 + if (user_epoll_alloc(&root_user)) 220 + panic("root_user epoll percpu counter alloc failed"); 240 221 241 222 /* Insert the root user immediately (init already runs as root) */ 242 223 spin_lock_irq(&uidhash_lock);

+4 -5

lib/Kconfig.debug

··· 1064 1064 depends on HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_ARCH 1065 1065 select LOCKUP_DETECTOR 1066 1066 select HARDLOCKUP_DETECTOR_PERF if HAVE_HARDLOCKUP_DETECTOR_PERF 1067 - select HARDLOCKUP_DETECTOR_ARCH if HAVE_HARDLOCKUP_DETECTOR_ARCH 1068 1067 help 1069 1068 Say Y here to enable the kernel to act as a watchdog to detect 1070 1069 hard lockups. ··· 2060 2061 If unsure, say N. 2061 2062 2062 2063 config TEST_SORT 2063 - tristate "Array-based sort test" 2064 - depends on DEBUG_KERNEL || m 2064 + tristate "Array-based sort test" if !KUNIT_ALL_TESTS 2065 + depends on KUNIT 2066 + default KUNIT_ALL_TESTS 2065 2067 help 2066 2068 This option enables the self-test function of 'sort()' at boot, 2067 2069 or at module load time. ··· 2443 2443 2444 2444 config RATIONAL_KUNIT_TEST 2445 2445 tristate "KUnit test for rational.c" if !KUNIT_ALL_TESTS 2446 - depends on KUNIT 2447 - select RATIONAL 2446 + depends on KUNIT && RATIONAL 2448 2447 default KUNIT_ALL_TESTS 2449 2448 help 2450 2449 This builds the rational math unit test.

+2 -1

lib/dump_stack.c

··· 89 89 } 90 90 91 91 /** 92 - * dump_stack - dump the current task information and its stack trace 92 + * dump_stack_lvl - dump the current task information and its stack trace 93 + * @log_lvl: log level 93 94 * 94 95 * Architectures can override this implementation by implementing its own. 95 96 */

+6 -2

lib/iov_iter.c

··· 672 672 * _copy_mc_to_iter - copy to iter with source memory error exception handling 673 673 * @addr: source kernel address 674 674 * @bytes: total transfer length 675 - * @iter: destination iterator 675 + * @i: destination iterator 676 676 * 677 677 * The pmem driver deploys this for the dax operation 678 678 * (dax_copy_to_iter()) for dax reads (bypass page-cache and the ··· 690 690 * * ITER_KVEC, ITER_PIPE, and ITER_BVEC can return short copies. 691 691 * Compare to copy_to_iter() where only ITER_IOVEC attempts might return 692 692 * a short copy. 693 + * 694 + * Return: number of bytes copied (may be %0) 693 695 */ 694 696 size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i) 695 697 { ··· 746 744 * _copy_from_iter_flushcache - write destination through cpu cache 747 745 * @addr: destination kernel address 748 746 * @bytes: total transfer length 749 - * @iter: source iterator 747 + * @i: source iterator 750 748 * 751 749 * The pmem driver arranges for filesystem-dax to use this facility via 752 750 * dax_copy_from_iter() for ensuring that writes to persistent memory ··· 755 753 * all iterator types. The _copy_from_iter_nocache() only attempts to 756 754 * bypass the cache for the ITER_IOVEC case, and on some archs may use 757 755 * instructions that strand dirty-data in the cache. 756 + * 757 + * Return: number of bytes copied (may be %0) 758 758 */ 759 759 size_t _copy_from_iter_flushcache(void *addr, size_t bytes, struct iov_iter *i) 760 760 {

+1 -1

lib/math/Kconfig

··· 14 14 If unsure, say N. 15 15 16 16 config RATIONAL 17 - bool 17 + tristate

+3

lib/math/rational.c

··· 13 13 #include <linux/export.h> 14 14 #include <linux/minmax.h> 15 15 #include <linux/limits.h> 16 + #include <linux/module.h> 16 17 17 18 /* 18 19 * calculate best rational approximation for a given fraction ··· 107 106 } 108 107 109 108 EXPORT_SYMBOL(rational_best_approximation); 109 + 110 + MODULE_LICENSE("GPL v2");

+1 -1

lib/test_printf.c

··· 614 614 bool append = false; 615 615 int i; 616 616 617 - flags &= BIT(NR_PAGEFLAGS) - 1; 617 + flags &= PAGEFLAGS_MASK; 618 618 if (flags) { 619 619 page_flags |= flags; 620 620 snprintf(cmp_buf + size, BUF_SIZE - size, "%s", name);

+19 -21

lib/test_sort.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0-only 2 + 3 + #include <kunit/test.h> 4 + 2 5 #include <linux/sort.h> 3 6 #include <linux/slab.h> 4 7 #include <linux/module.h> ··· 10 7 11 8 #define TEST_LEN 1000 12 9 13 - static int __init cmpint(const void *a, const void *b) 10 + static int cmpint(const void *a, const void *b) 14 11 { 15 12 return *(int *)a - *(int *)b; 16 13 } 17 14 18 - static int __init test_sort_init(void) 15 + static void test_sort(struct kunit *test) 19 16 { 20 - int *a, i, r = 1, err = -ENOMEM; 17 + int *a, i, r = 1; 21 18 22 - a = kmalloc_array(TEST_LEN, sizeof(*a), GFP_KERNEL); 23 - if (!a) 24 - return err; 19 + a = kunit_kmalloc_array(test, TEST_LEN, sizeof(*a), GFP_KERNEL); 20 + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, a); 25 21 26 22 for (i = 0; i < TEST_LEN; i++) { 27 23 r = (r * 725861) % 6599; ··· 29 27 30 28 sort(a, TEST_LEN, sizeof(*a), cmpint, NULL); 31 29 32 - err = -EINVAL; 33 30 for (i = 0; i < TEST_LEN-1; i++) 34 - if (a[i] > a[i+1]) { 35 - pr_err("test has failed\n"); 36 - goto exit; 37 - } 38 - err = 0; 39 - pr_info("test passed\n"); 40 - exit: 41 - kfree(a); 42 - return err; 31 + KUNIT_ASSERT_LE(test, a[i], a[i + 1]); 43 32 } 44 33 45 - static void __exit test_sort_exit(void) 46 - { 47 - } 34 + static struct kunit_case sort_test_cases[] = { 35 + KUNIT_CASE(test_sort), 36 + {} 37 + }; 48 38 49 - module_init(test_sort_init); 50 - module_exit(test_sort_exit); 39 + static struct kunit_suite sort_test_suite = { 40 + .name = "lib_sort", 41 + .test_cases = sort_test_cases, 42 + }; 43 + 44 + kunit_test_suites(&sort_test_suite); 51 45 52 46 MODULE_LICENSE("GPL");

+1 -1

lib/vsprintf.c

··· 2019 2019 static 2020 2020 char *format_page_flags(char *buf, char *end, unsigned long flags) 2021 2021 { 2022 - unsigned long main_flags = flags & (BIT(NR_PAGEFLAGS) - 1); 2022 + unsigned long main_flags = flags & PAGEFLAGS_MASK; 2023 2023 bool append = false; 2024 2024 int i; 2025 2025

+11 -4

mm/Kconfig

··· 96 96 depends on MMU 97 97 bool 98 98 99 - config HOLES_IN_ZONE 100 - bool 101 - 102 99 # Don't discard allocated memory used to track "memory" and "reserved" memblocks 103 100 # after early boot, so it can still be used to test for validity of memory. 104 101 # Also, memblocks are updated with memory hot(un)plug. ··· 739 742 lifetime of the system until these kthreads finish the 740 743 initialisation. 741 744 745 + config PAGE_IDLE_FLAG 746 + bool 747 + select PAGE_EXTENSION if !64BIT 748 + help 749 + This adds PG_idle and PG_young flags to 'struct page'. PTE Accessed 750 + bit writers can set the state of the bit in the flags so that PTE 751 + Accessed bit readers may avoid disturbance. 752 + 742 753 config IDLE_PAGE_TRACKING 743 754 bool "Enable idle page tracking" 744 755 depends on SYSFS && MMU 745 - select PAGE_EXTENSION if !64BIT 756 + select PAGE_IDLE_FLAG 746 757 help 747 758 This feature allows to estimate the amount of user pages that have 748 759 not been touched during a given period of time. This information can ··· 893 888 894 889 config SECRETMEM 895 890 def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED 891 + 892 + source "mm/damon/Kconfig" 896 893 897 894 endmenu

+3 -1

mm/Makefile

··· 38 38 mmu-$(CONFIG_MMU) := highmem.o memory.o mincore.o \ 39 39 mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \ 40 40 msync.o page_vma_mapped.o pagewalk.o \ 41 - pgtable-generic.o rmap.o vmalloc.o ioremap.o 41 + pgtable-generic.o rmap.o vmalloc.o 42 42 43 43 44 44 ifdef CONFIG_CROSS_MEMORY_ATTACH ··· 118 118 obj-$(CONFIG_USERFAULTFD) += userfaultfd.o 119 119 obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o 120 120 obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o 121 + obj-$(CONFIG_DAMON) += damon/ 121 122 obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o 122 123 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o 123 124 obj-$(CONFIG_ZONE_DEVICE) += memremap.o ··· 129 128 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o 130 129 obj-$(CONFIG_IO_MAPPING) += io-mapping.o 131 130 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o 131 + obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o

+7 -13

mm/compaction.c

··· 306 306 * is necessary for the block to be a migration source/target. 307 307 */ 308 308 do { 309 - if (pfn_valid_within(pfn)) { 310 - if (check_source && PageLRU(page)) { 311 - clear_pageblock_skip(page); 312 - return true; 313 - } 309 + if (check_source && PageLRU(page)) { 310 + clear_pageblock_skip(page); 311 + return true; 312 + } 314 313 315 - if (check_target && PageBuddy(page)) { 316 - clear_pageblock_skip(page); 317 - return true; 318 - } 314 + if (check_target && PageBuddy(page)) { 315 + clear_pageblock_skip(page); 316 + return true; 319 317 } 320 318 321 319 page += (1 << PAGE_ALLOC_COSTLY_ORDER); ··· 583 585 break; 584 586 585 587 nr_scanned++; 586 - if (!pfn_valid_within(blockpfn)) 587 - goto isolate_fail; 588 588 589 589 /* 590 590 * For compound pages such as THP and hugetlbfs, we can save ··· 881 885 cond_resched(); 882 886 } 883 887 884 - if (!pfn_valid_within(low_pfn)) 885 - goto isolate_fail; 886 888 nr_scanned++; 887 889 888 890 page = pfn_to_page(low_pfn);

+68

mm/damon/Kconfig

··· 1 + # SPDX-License-Identifier: GPL-2.0-only 2 + 3 + menu "Data Access Monitoring" 4 + 5 + config DAMON 6 + bool "DAMON: Data Access Monitoring Framework" 7 + help 8 + This builds a framework that allows kernel subsystems to monitor 9 + access frequency of each memory region. The information can be useful 10 + for performance-centric DRAM level memory management. 11 + 12 + See https://damonitor.github.io/doc/html/latest-damon/index.html for 13 + more information. 14 + 15 + config DAMON_KUNIT_TEST 16 + bool "Test for damon" if !KUNIT_ALL_TESTS 17 + depends on DAMON && KUNIT=y 18 + default KUNIT_ALL_TESTS 19 + help 20 + This builds the DAMON Kunit test suite. 21 + 22 + For more information on KUnit and unit tests in general, please refer 23 + to the KUnit documentation. 24 + 25 + If unsure, say N. 26 + 27 + config DAMON_VADDR 28 + bool "Data access monitoring primitives for virtual address spaces" 29 + depends on DAMON && MMU 30 + select PAGE_IDLE_FLAG 31 + help 32 + This builds the default data access monitoring primitives for DAMON 33 + that works for virtual address spaces. 34 + 35 + config DAMON_VADDR_KUNIT_TEST 36 + bool "Test for DAMON primitives" if !KUNIT_ALL_TESTS 37 + depends on DAMON_VADDR && KUNIT=y 38 + default KUNIT_ALL_TESTS 39 + help 40 + This builds the DAMON virtual addresses primitives Kunit test suite. 41 + 42 + For more information on KUnit and unit tests in general, please refer 43 + to the KUnit documentation. 44 + 45 + If unsure, say N. 46 + 47 + config DAMON_DBGFS 48 + bool "DAMON debugfs interface" 49 + depends on DAMON_VADDR && DEBUG_FS 50 + help 51 + This builds the debugfs interface for DAMON. The user space admins 52 + can use the interface for arbitrary data access monitoring. 53 + 54 + If unsure, say N. 55 + 56 + config DAMON_DBGFS_KUNIT_TEST 57 + bool "Test for damon debugfs interface" if !KUNIT_ALL_TESTS 58 + depends on DAMON_DBGFS && KUNIT=y 59 + default KUNIT_ALL_TESTS 60 + help 61 + This builds the DAMON debugfs interface Kunit test suite. 62 + 63 + For more information on KUnit and unit tests in general, please refer 64 + to the KUnit documentation. 65 + 66 + If unsure, say N. 67 + 68 + endmenu

+5

mm/damon/Makefile

··· 1 + # SPDX-License-Identifier: GPL-2.0 2 + 3 + obj-$(CONFIG_DAMON) := core.o 4 + obj-$(CONFIG_DAMON_VADDR) += vaddr.o 5 + obj-$(CONFIG_DAMON_DBGFS) += dbgfs.o

+253

mm/damon/core-test.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Data Access Monitor Unit Tests 4 + * 5 + * Copyright 2019 Amazon.com, Inc. or its affiliates. All rights reserved. 6 + * 7 + * Author: SeongJae Park <sjpark@amazon.de> 8 + */ 9 + 10 + #ifdef CONFIG_DAMON_KUNIT_TEST 11 + 12 + #ifndef _DAMON_CORE_TEST_H 13 + #define _DAMON_CORE_TEST_H 14 + 15 + #include <kunit/test.h> 16 + 17 + static void damon_test_regions(struct kunit *test) 18 + { 19 + struct damon_region *r; 20 + struct damon_target *t; 21 + 22 + r = damon_new_region(1, 2); 23 + KUNIT_EXPECT_EQ(test, 1ul, r->ar.start); 24 + KUNIT_EXPECT_EQ(test, 2ul, r->ar.end); 25 + KUNIT_EXPECT_EQ(test, 0u, r->nr_accesses); 26 + 27 + t = damon_new_target(42); 28 + KUNIT_EXPECT_EQ(test, 0u, damon_nr_regions(t)); 29 + 30 + damon_add_region(r, t); 31 + KUNIT_EXPECT_EQ(test, 1u, damon_nr_regions(t)); 32 + 33 + damon_del_region(r, t); 34 + KUNIT_EXPECT_EQ(test, 0u, damon_nr_regions(t)); 35 + 36 + damon_free_target(t); 37 + } 38 + 39 + static unsigned int nr_damon_targets(struct damon_ctx *ctx) 40 + { 41 + struct damon_target *t; 42 + unsigned int nr_targets = 0; 43 + 44 + damon_for_each_target(t, ctx) 45 + nr_targets++; 46 + 47 + return nr_targets; 48 + } 49 + 50 + static void damon_test_target(struct kunit *test) 51 + { 52 + struct damon_ctx *c = damon_new_ctx(); 53 + struct damon_target *t; 54 + 55 + t = damon_new_target(42); 56 + KUNIT_EXPECT_EQ(test, 42ul, t->id); 57 + KUNIT_EXPECT_EQ(test, 0u, nr_damon_targets(c)); 58 + 59 + damon_add_target(c, t); 60 + KUNIT_EXPECT_EQ(test, 1u, nr_damon_targets(c)); 61 + 62 + damon_destroy_target(t); 63 + KUNIT_EXPECT_EQ(test, 0u, nr_damon_targets(c)); 64 + 65 + damon_destroy_ctx(c); 66 + } 67 + 68 + /* 69 + * Test kdamond_reset_aggregated() 70 + * 71 + * DAMON checks access to each region and aggregates this information as the 72 + * access frequency of each region. In detail, it increases '->nr_accesses' of 73 + * regions that an access has confirmed. 'kdamond_reset_aggregated()' flushes 74 + * the aggregated information ('->nr_accesses' of each regions) to the result 75 + * buffer. As a result of the flushing, the '->nr_accesses' of regions are 76 + * initialized to zero. 77 + */ 78 + static void damon_test_aggregate(struct kunit *test) 79 + { 80 + struct damon_ctx *ctx = damon_new_ctx(); 81 + unsigned long target_ids[] = {1, 2, 3}; 82 + unsigned long saddr[][3] = {{10, 20, 30}, {5, 42, 49}, {13, 33, 55} }; 83 + unsigned long eaddr[][3] = {{15, 27, 40}, {31, 45, 55}, {23, 44, 66} }; 84 + unsigned long accesses[][3] = {{42, 95, 84}, {10, 20, 30}, {0, 1, 2} }; 85 + struct damon_target *t; 86 + struct damon_region *r; 87 + int it, ir; 88 + 89 + damon_set_targets(ctx, target_ids, 3); 90 + 91 + it = 0; 92 + damon_for_each_target(t, ctx) { 93 + for (ir = 0; ir < 3; ir++) { 94 + r = damon_new_region(saddr[it][ir], eaddr[it][ir]); 95 + r->nr_accesses = accesses[it][ir]; 96 + damon_add_region(r, t); 97 + } 98 + it++; 99 + } 100 + kdamond_reset_aggregated(ctx); 101 + it = 0; 102 + damon_for_each_target(t, ctx) { 103 + ir = 0; 104 + /* '->nr_accesses' should be zeroed */ 105 + damon_for_each_region(r, t) { 106 + KUNIT_EXPECT_EQ(test, 0u, r->nr_accesses); 107 + ir++; 108 + } 109 + /* regions should be preserved */ 110 + KUNIT_EXPECT_EQ(test, 3, ir); 111 + it++; 112 + } 113 + /* targets also should be preserved */ 114 + KUNIT_EXPECT_EQ(test, 3, it); 115 + 116 + damon_destroy_ctx(ctx); 117 + } 118 + 119 + static void damon_test_split_at(struct kunit *test) 120 + { 121 + struct damon_ctx *c = damon_new_ctx(); 122 + struct damon_target *t; 123 + struct damon_region *r; 124 + 125 + t = damon_new_target(42); 126 + r = damon_new_region(0, 100); 127 + damon_add_region(r, t); 128 + damon_split_region_at(c, t, r, 25); 129 + KUNIT_EXPECT_EQ(test, r->ar.start, 0ul); 130 + KUNIT_EXPECT_EQ(test, r->ar.end, 25ul); 131 + 132 + r = damon_next_region(r); 133 + KUNIT_EXPECT_EQ(test, r->ar.start, 25ul); 134 + KUNIT_EXPECT_EQ(test, r->ar.end, 100ul); 135 + 136 + damon_free_target(t); 137 + damon_destroy_ctx(c); 138 + } 139 + 140 + static void damon_test_merge_two(struct kunit *test) 141 + { 142 + struct damon_target *t; 143 + struct damon_region *r, *r2, *r3; 144 + int i; 145 + 146 + t = damon_new_target(42); 147 + r = damon_new_region(0, 100); 148 + r->nr_accesses = 10; 149 + damon_add_region(r, t); 150 + r2 = damon_new_region(100, 300); 151 + r2->nr_accesses = 20; 152 + damon_add_region(r2, t); 153 + 154 + damon_merge_two_regions(t, r, r2); 155 + KUNIT_EXPECT_EQ(test, r->ar.start, 0ul); 156 + KUNIT_EXPECT_EQ(test, r->ar.end, 300ul); 157 + KUNIT_EXPECT_EQ(test, r->nr_accesses, 16u); 158 + 159 + i = 0; 160 + damon_for_each_region(r3, t) { 161 + KUNIT_EXPECT_PTR_EQ(test, r, r3); 162 + i++; 163 + } 164 + KUNIT_EXPECT_EQ(test, i, 1); 165 + 166 + damon_free_target(t); 167 + } 168 + 169 + static struct damon_region *__nth_region_of(struct damon_target *t, int idx) 170 + { 171 + struct damon_region *r; 172 + unsigned int i = 0; 173 + 174 + damon_for_each_region(r, t) { 175 + if (i++ == idx) 176 + return r; 177 + } 178 + 179 + return NULL; 180 + } 181 + 182 + static void damon_test_merge_regions_of(struct kunit *test) 183 + { 184 + struct damon_target *t; 185 + struct damon_region *r; 186 + unsigned long sa[] = {0, 100, 114, 122, 130, 156, 170, 184}; 187 + unsigned long ea[] = {100, 112, 122, 130, 156, 170, 184, 230}; 188 + unsigned int nrs[] = {0, 0, 10, 10, 20, 30, 1, 2}; 189 + 190 + unsigned long saddrs[] = {0, 114, 130, 156, 170}; 191 + unsigned long eaddrs[] = {112, 130, 156, 170, 230}; 192 + int i; 193 + 194 + t = damon_new_target(42); 195 + for (i = 0; i < ARRAY_SIZE(sa); i++) { 196 + r = damon_new_region(sa[i], ea[i]); 197 + r->nr_accesses = nrs[i]; 198 + damon_add_region(r, t); 199 + } 200 + 201 + damon_merge_regions_of(t, 9, 9999); 202 + /* 0-112, 114-130, 130-156, 156-170 */ 203 + KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 5u); 204 + for (i = 0; i < 5; i++) { 205 + r = __nth_region_of(t, i); 206 + KUNIT_EXPECT_EQ(test, r->ar.start, saddrs[i]); 207 + KUNIT_EXPECT_EQ(test, r->ar.end, eaddrs[i]); 208 + } 209 + damon_free_target(t); 210 + } 211 + 212 + static void damon_test_split_regions_of(struct kunit *test) 213 + { 214 + struct damon_ctx *c = damon_new_ctx(); 215 + struct damon_target *t; 216 + struct damon_region *r; 217 + 218 + t = damon_new_target(42); 219 + r = damon_new_region(0, 22); 220 + damon_add_region(r, t); 221 + damon_split_regions_of(c, t, 2); 222 + KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 2u); 223 + damon_free_target(t); 224 + 225 + t = damon_new_target(42); 226 + r = damon_new_region(0, 220); 227 + damon_add_region(r, t); 228 + damon_split_regions_of(c, t, 4); 229 + KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 4u); 230 + damon_free_target(t); 231 + damon_destroy_ctx(c); 232 + } 233 + 234 + static struct kunit_case damon_test_cases[] = { 235 + KUNIT_CASE(damon_test_target), 236 + KUNIT_CASE(damon_test_regions), 237 + KUNIT_CASE(damon_test_aggregate), 238 + KUNIT_CASE(damon_test_split_at), 239 + KUNIT_CASE(damon_test_merge_two), 240 + KUNIT_CASE(damon_test_merge_regions_of), 241 + KUNIT_CASE(damon_test_split_regions_of), 242 + {}, 243 + }; 244 + 245 + static struct kunit_suite damon_test_suite = { 246 + .name = "damon", 247 + .test_cases = damon_test_cases, 248 + }; 249 + kunit_test_suite(damon_test_suite); 250 + 251 + #endif /* _DAMON_CORE_TEST_H */ 252 + 253 + #endif /* CONFIG_DAMON_KUNIT_TEST */

+720

mm/damon/core.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Data Access Monitor 4 + * 5 + * Author: SeongJae Park <sjpark@amazon.de> 6 + */ 7 + 8 + #define pr_fmt(fmt) "damon: " fmt 9 + 10 + #include <linux/damon.h> 11 + #include <linux/delay.h> 12 + #include <linux/kthread.h> 13 + #include <linux/random.h> 14 + #include <linux/slab.h> 15 + 16 + #define CREATE_TRACE_POINTS 17 + #include <trace/events/damon.h> 18 + 19 + #ifdef CONFIG_DAMON_KUNIT_TEST 20 + #undef DAMON_MIN_REGION 21 + #define DAMON_MIN_REGION 1 22 + #endif 23 + 24 + /* Get a random number in [l, r) */ 25 + #define damon_rand(l, r) (l + prandom_u32_max(r - l)) 26 + 27 + static DEFINE_MUTEX(damon_lock); 28 + static int nr_running_ctxs; 29 + 30 + /* 31 + * Construct a damon_region struct 32 + * 33 + * Returns the pointer to the new struct if success, or NULL otherwise 34 + */ 35 + struct damon_region *damon_new_region(unsigned long start, unsigned long end) 36 + { 37 + struct damon_region *region; 38 + 39 + region = kmalloc(sizeof(*region), GFP_KERNEL); 40 + if (!region) 41 + return NULL; 42 + 43 + region->ar.start = start; 44 + region->ar.end = end; 45 + region->nr_accesses = 0; 46 + INIT_LIST_HEAD(&region->list); 47 + 48 + return region; 49 + } 50 + 51 + /* 52 + * Add a region between two other regions 53 + */ 54 + inline void damon_insert_region(struct damon_region *r, 55 + struct damon_region *prev, struct damon_region *next, 56 + struct damon_target *t) 57 + { 58 + __list_add(&r->list, &prev->list, &next->list); 59 + t->nr_regions++; 60 + } 61 + 62 + void damon_add_region(struct damon_region *r, struct damon_target *t) 63 + { 64 + list_add_tail(&r->list, &t->regions_list); 65 + t->nr_regions++; 66 + } 67 + 68 + static void damon_del_region(struct damon_region *r, struct damon_target *t) 69 + { 70 + list_del(&r->list); 71 + t->nr_regions--; 72 + } 73 + 74 + static void damon_free_region(struct damon_region *r) 75 + { 76 + kfree(r); 77 + } 78 + 79 + void damon_destroy_region(struct damon_region *r, struct damon_target *t) 80 + { 81 + damon_del_region(r, t); 82 + damon_free_region(r); 83 + } 84 + 85 + /* 86 + * Construct a damon_target struct 87 + * 88 + * Returns the pointer to the new struct if success, or NULL otherwise 89 + */ 90 + struct damon_target *damon_new_target(unsigned long id) 91 + { 92 + struct damon_target *t; 93 + 94 + t = kmalloc(sizeof(*t), GFP_KERNEL); 95 + if (!t) 96 + return NULL; 97 + 98 + t->id = id; 99 + t->nr_regions = 0; 100 + INIT_LIST_HEAD(&t->regions_list); 101 + 102 + return t; 103 + } 104 + 105 + void damon_add_target(struct damon_ctx *ctx, struct damon_target *t) 106 + { 107 + list_add_tail(&t->list, &ctx->adaptive_targets); 108 + } 109 + 110 + static void damon_del_target(struct damon_target *t) 111 + { 112 + list_del(&t->list); 113 + } 114 + 115 + void damon_free_target(struct damon_target *t) 116 + { 117 + struct damon_region *r, *next; 118 + 119 + damon_for_each_region_safe(r, next, t) 120 + damon_free_region(r); 121 + kfree(t); 122 + } 123 + 124 + void damon_destroy_target(struct damon_target *t) 125 + { 126 + damon_del_target(t); 127 + damon_free_target(t); 128 + } 129 + 130 + unsigned int damon_nr_regions(struct damon_target *t) 131 + { 132 + return t->nr_regions; 133 + } 134 + 135 + struct damon_ctx *damon_new_ctx(void) 136 + { 137 + struct damon_ctx *ctx; 138 + 139 + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); 140 + if (!ctx) 141 + return NULL; 142 + 143 + ctx->sample_interval = 5 * 1000; 144 + ctx->aggr_interval = 100 * 1000; 145 + ctx->primitive_update_interval = 60 * 1000 * 1000; 146 + 147 + ktime_get_coarse_ts64(&ctx->last_aggregation); 148 + ctx->last_primitive_update = ctx->last_aggregation; 149 + 150 + mutex_init(&ctx->kdamond_lock); 151 + 152 + ctx->min_nr_regions = 10; 153 + ctx->max_nr_regions = 1000; 154 + 155 + INIT_LIST_HEAD(&ctx->adaptive_targets); 156 + 157 + return ctx; 158 + } 159 + 160 + static void damon_destroy_targets(struct damon_ctx *ctx) 161 + { 162 + struct damon_target *t, *next_t; 163 + 164 + if (ctx->primitive.cleanup) { 165 + ctx->primitive.cleanup(ctx); 166 + return; 167 + } 168 + 169 + damon_for_each_target_safe(t, next_t, ctx) 170 + damon_destroy_target(t); 171 + } 172 + 173 + void damon_destroy_ctx(struct damon_ctx *ctx) 174 + { 175 + damon_destroy_targets(ctx); 176 + kfree(ctx); 177 + } 178 + 179 + /** 180 + * damon_set_targets() - Set monitoring targets. 181 + * @ctx: monitoring context 182 + * @ids: array of target ids 183 + * @nr_ids: number of entries in @ids 184 + * 185 + * This function should not be called while the kdamond is running. 186 + * 187 + * Return: 0 on success, negative error code otherwise. 188 + */ 189 + int damon_set_targets(struct damon_ctx *ctx, 190 + unsigned long *ids, ssize_t nr_ids) 191 + { 192 + ssize_t i; 193 + struct damon_target *t, *next; 194 + 195 + damon_destroy_targets(ctx); 196 + 197 + for (i = 0; i < nr_ids; i++) { 198 + t = damon_new_target(ids[i]); 199 + if (!t) { 200 + pr_err("Failed to alloc damon_target\n"); 201 + /* The caller should do cleanup of the ids itself */ 202 + damon_for_each_target_safe(t, next, ctx) 203 + damon_destroy_target(t); 204 + return -ENOMEM; 205 + } 206 + damon_add_target(ctx, t); 207 + } 208 + 209 + return 0; 210 + } 211 + 212 + /** 213 + * damon_set_attrs() - Set attributes for the monitoring. 214 + * @ctx: monitoring context 215 + * @sample_int: time interval between samplings 216 + * @aggr_int: time interval between aggregations 217 + * @primitive_upd_int: time interval between monitoring primitive updates 218 + * @min_nr_reg: minimal number of regions 219 + * @max_nr_reg: maximum number of regions 220 + * 221 + * This function should not be called while the kdamond is running. 222 + * Every time interval is in micro-seconds. 223 + * 224 + * Return: 0 on success, negative error code otherwise. 225 + */ 226 + int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int, 227 + unsigned long aggr_int, unsigned long primitive_upd_int, 228 + unsigned long min_nr_reg, unsigned long max_nr_reg) 229 + { 230 + if (min_nr_reg < 3) { 231 + pr_err("min_nr_regions (%lu) must be at least 3\n", 232 + min_nr_reg); 233 + return -EINVAL; 234 + } 235 + if (min_nr_reg > max_nr_reg) { 236 + pr_err("invalid nr_regions. min (%lu) > max (%lu)\n", 237 + min_nr_reg, max_nr_reg); 238 + return -EINVAL; 239 + } 240 + 241 + ctx->sample_interval = sample_int; 242 + ctx->aggr_interval = aggr_int; 243 + ctx->primitive_update_interval = primitive_upd_int; 244 + ctx->min_nr_regions = min_nr_reg; 245 + ctx->max_nr_regions = max_nr_reg; 246 + 247 + return 0; 248 + } 249 + 250 + /** 251 + * damon_nr_running_ctxs() - Return number of currently running contexts. 252 + */ 253 + int damon_nr_running_ctxs(void) 254 + { 255 + int nr_ctxs; 256 + 257 + mutex_lock(&damon_lock); 258 + nr_ctxs = nr_running_ctxs; 259 + mutex_unlock(&damon_lock); 260 + 261 + return nr_ctxs; 262 + } 263 + 264 + /* Returns the size upper limit for each monitoring region */ 265 + static unsigned long damon_region_sz_limit(struct damon_ctx *ctx) 266 + { 267 + struct damon_target *t; 268 + struct damon_region *r; 269 + unsigned long sz = 0; 270 + 271 + damon_for_each_target(t, ctx) { 272 + damon_for_each_region(r, t) 273 + sz += r->ar.end - r->ar.start; 274 + } 275 + 276 + if (ctx->min_nr_regions) 277 + sz /= ctx->min_nr_regions; 278 + if (sz < DAMON_MIN_REGION) 279 + sz = DAMON_MIN_REGION; 280 + 281 + return sz; 282 + } 283 + 284 + static bool damon_kdamond_running(struct damon_ctx *ctx) 285 + { 286 + bool running; 287 + 288 + mutex_lock(&ctx->kdamond_lock); 289 + running = ctx->kdamond != NULL; 290 + mutex_unlock(&ctx->kdamond_lock); 291 + 292 + return running; 293 + } 294 + 295 + static int kdamond_fn(void *data); 296 + 297 + /* 298 + * __damon_start() - Starts monitoring with given context. 299 + * @ctx: monitoring context 300 + * 301 + * This function should be called while damon_lock is hold. 302 + * 303 + * Return: 0 on success, negative error code otherwise. 304 + */ 305 + static int __damon_start(struct damon_ctx *ctx) 306 + { 307 + int err = -EBUSY; 308 + 309 + mutex_lock(&ctx->kdamond_lock); 310 + if (!ctx->kdamond) { 311 + err = 0; 312 + ctx->kdamond_stop = false; 313 + ctx->kdamond = kthread_run(kdamond_fn, ctx, "kdamond.%d", 314 + nr_running_ctxs); 315 + if (IS_ERR(ctx->kdamond)) { 316 + err = PTR_ERR(ctx->kdamond); 317 + ctx->kdamond = 0; 318 + } 319 + } 320 + mutex_unlock(&ctx->kdamond_lock); 321 + 322 + return err; 323 + } 324 + 325 + /** 326 + * damon_start() - Starts the monitorings for a given group of contexts. 327 + * @ctxs: an array of the pointers for contexts to start monitoring 328 + * @nr_ctxs: size of @ctxs 329 + * 330 + * This function starts a group of monitoring threads for a group of monitoring 331 + * contexts. One thread per each context is created and run in parallel. The 332 + * caller should handle synchronization between the threads by itself. If a 333 + * group of threads that created by other 'damon_start()' call is currently 334 + * running, this function does nothing but returns -EBUSY. 335 + * 336 + * Return: 0 on success, negative error code otherwise. 337 + */ 338 + int damon_start(struct damon_ctx **ctxs, int nr_ctxs) 339 + { 340 + int i; 341 + int err = 0; 342 + 343 + mutex_lock(&damon_lock); 344 + if (nr_running_ctxs) { 345 + mutex_unlock(&damon_lock); 346 + return -EBUSY; 347 + } 348 + 349 + for (i = 0; i < nr_ctxs; i++) { 350 + err = __damon_start(ctxs[i]); 351 + if (err) 352 + break; 353 + nr_running_ctxs++; 354 + } 355 + mutex_unlock(&damon_lock); 356 + 357 + return err; 358 + } 359 + 360 + /* 361 + * __damon_stop() - Stops monitoring of given context. 362 + * @ctx: monitoring context 363 + * 364 + * Return: 0 on success, negative error code otherwise. 365 + */ 366 + static int __damon_stop(struct damon_ctx *ctx) 367 + { 368 + mutex_lock(&ctx->kdamond_lock); 369 + if (ctx->kdamond) { 370 + ctx->kdamond_stop = true; 371 + mutex_unlock(&ctx->kdamond_lock); 372 + while (damon_kdamond_running(ctx)) 373 + usleep_range(ctx->sample_interval, 374 + ctx->sample_interval * 2); 375 + return 0; 376 + } 377 + mutex_unlock(&ctx->kdamond_lock); 378 + 379 + return -EPERM; 380 + } 381 + 382 + /** 383 + * damon_stop() - Stops the monitorings for a given group of contexts. 384 + * @ctxs: an array of the pointers for contexts to stop monitoring 385 + * @nr_ctxs: size of @ctxs 386 + * 387 + * Return: 0 on success, negative error code otherwise. 388 + */ 389 + int damon_stop(struct damon_ctx **ctxs, int nr_ctxs) 390 + { 391 + int i, err = 0; 392 + 393 + for (i = 0; i < nr_ctxs; i++) { 394 + /* nr_running_ctxs is decremented in kdamond_fn */ 395 + err = __damon_stop(ctxs[i]); 396 + if (err) 397 + return err; 398 + } 399 + 400 + return err; 401 + } 402 + 403 + /* 404 + * damon_check_reset_time_interval() - Check if a time interval is elapsed. 405 + * @baseline: the time to check whether the interval has elapsed since 406 + * @interval: the time interval (microseconds) 407 + * 408 + * See whether the given time interval has passed since the given baseline 409 + * time. If so, it also updates the baseline to current time for next check. 410 + * 411 + * Return: true if the time interval has passed, or false otherwise. 412 + */ 413 + static bool damon_check_reset_time_interval(struct timespec64 *baseline, 414 + unsigned long interval) 415 + { 416 + struct timespec64 now; 417 + 418 + ktime_get_coarse_ts64(&now); 419 + if ((timespec64_to_ns(&now) - timespec64_to_ns(baseline)) < 420 + interval * 1000) 421 + return false; 422 + *baseline = now; 423 + return true; 424 + } 425 + 426 + /* 427 + * Check whether it is time to flush the aggregated information 428 + */ 429 + static bool kdamond_aggregate_interval_passed(struct damon_ctx *ctx) 430 + { 431 + return damon_check_reset_time_interval(&ctx->last_aggregation, 432 + ctx->aggr_interval); 433 + } 434 + 435 + /* 436 + * Reset the aggregated monitoring results ('nr_accesses' of each region). 437 + */ 438 + static void kdamond_reset_aggregated(struct damon_ctx *c) 439 + { 440 + struct damon_target *t; 441 + 442 + damon_for_each_target(t, c) { 443 + struct damon_region *r; 444 + 445 + damon_for_each_region(r, t) { 446 + trace_damon_aggregated(t, r, damon_nr_regions(t)); 447 + r->nr_accesses = 0; 448 + } 449 + } 450 + } 451 + 452 + #define sz_damon_region(r) (r->ar.end - r->ar.start) 453 + 454 + /* 455 + * Merge two adjacent regions into one region 456 + */ 457 + static void damon_merge_two_regions(struct damon_target *t, 458 + struct damon_region *l, struct damon_region *r) 459 + { 460 + unsigned long sz_l = sz_damon_region(l), sz_r = sz_damon_region(r); 461 + 462 + l->nr_accesses = (l->nr_accesses * sz_l + r->nr_accesses * sz_r) / 463 + (sz_l + sz_r); 464 + l->ar.end = r->ar.end; 465 + damon_destroy_region(r, t); 466 + } 467 + 468 + #define diff_of(a, b) (a > b ? a - b : b - a) 469 + 470 + /* 471 + * Merge adjacent regions having similar access frequencies 472 + * 473 + * t target affected by this merge operation 474 + * thres '->nr_accesses' diff threshold for the merge 475 + * sz_limit size upper limit of each region 476 + */ 477 + static void damon_merge_regions_of(struct damon_target *t, unsigned int thres, 478 + unsigned long sz_limit) 479 + { 480 + struct damon_region *r, *prev = NULL, *next; 481 + 482 + damon_for_each_region_safe(r, next, t) { 483 + if (prev && prev->ar.end == r->ar.start && 484 + diff_of(prev->nr_accesses, r->nr_accesses) <= thres && 485 + sz_damon_region(prev) + sz_damon_region(r) <= sz_limit) 486 + damon_merge_two_regions(t, prev, r); 487 + else 488 + prev = r; 489 + } 490 + } 491 + 492 + /* 493 + * Merge adjacent regions having similar access frequencies 494 + * 495 + * threshold '->nr_accesses' diff threshold for the merge 496 + * sz_limit size upper limit of each region 497 + * 498 + * This function merges monitoring target regions which are adjacent and their 499 + * access frequencies are similar. This is for minimizing the monitoring 500 + * overhead under the dynamically changeable access pattern. If a merge was 501 + * unnecessarily made, later 'kdamond_split_regions()' will revert it. 502 + */ 503 + static void kdamond_merge_regions(struct damon_ctx *c, unsigned int threshold, 504 + unsigned long sz_limit) 505 + { 506 + struct damon_target *t; 507 + 508 + damon_for_each_target(t, c) 509 + damon_merge_regions_of(t, threshold, sz_limit); 510 + } 511 + 512 + /* 513 + * Split a region in two 514 + * 515 + * r the region to be split 516 + * sz_r size of the first sub-region that will be made 517 + */ 518 + static void damon_split_region_at(struct damon_ctx *ctx, 519 + struct damon_target *t, struct damon_region *r, 520 + unsigned long sz_r) 521 + { 522 + struct damon_region *new; 523 + 524 + new = damon_new_region(r->ar.start + sz_r, r->ar.end); 525 + if (!new) 526 + return; 527 + 528 + r->ar.end = new->ar.start; 529 + 530 + damon_insert_region(new, r, damon_next_region(r), t); 531 + } 532 + 533 + /* Split every region in the given target into 'nr_subs' regions */ 534 + static void damon_split_regions_of(struct damon_ctx *ctx, 535 + struct damon_target *t, int nr_subs) 536 + { 537 + struct damon_region *r, *next; 538 + unsigned long sz_region, sz_sub = 0; 539 + int i; 540 + 541 + damon_for_each_region_safe(r, next, t) { 542 + sz_region = r->ar.end - r->ar.start; 543 + 544 + for (i = 0; i < nr_subs - 1 && 545 + sz_region > 2 * DAMON_MIN_REGION; i++) { 546 + /* 547 + * Randomly select size of left sub-region to be at 548 + * least 10 percent and at most 90% of original region 549 + */ 550 + sz_sub = ALIGN_DOWN(damon_rand(1, 10) * 551 + sz_region / 10, DAMON_MIN_REGION); 552 + /* Do not allow blank region */ 553 + if (sz_sub == 0 || sz_sub >= sz_region) 554 + continue; 555 + 556 + damon_split_region_at(ctx, t, r, sz_sub); 557 + sz_region = sz_sub; 558 + } 559 + } 560 + } 561 + 562 + /* 563 + * Split every target region into randomly-sized small regions 564 + * 565 + * This function splits every target region into random-sized small regions if 566 + * current total number of the regions is equal or smaller than half of the 567 + * user-specified maximum number of regions. This is for maximizing the 568 + * monitoring accuracy under the dynamically changeable access patterns. If a 569 + * split was unnecessarily made, later 'kdamond_merge_regions()' will revert 570 + * it. 571 + */ 572 + static void kdamond_split_regions(struct damon_ctx *ctx) 573 + { 574 + struct damon_target *t; 575 + unsigned int nr_regions = 0; 576 + static unsigned int last_nr_regions; 577 + int nr_subregions = 2; 578 + 579 + damon_for_each_target(t, ctx) 580 + nr_regions += damon_nr_regions(t); 581 + 582 + if (nr_regions > ctx->max_nr_regions / 2) 583 + return; 584 + 585 + /* Maybe the middle of the region has different access frequency */ 586 + if (last_nr_regions == nr_regions && 587 + nr_regions < ctx->max_nr_regions / 3) 588 + nr_subregions = 3; 589 + 590 + damon_for_each_target(t, ctx) 591 + damon_split_regions_of(ctx, t, nr_subregions); 592 + 593 + last_nr_regions = nr_regions; 594 + } 595 + 596 + /* 597 + * Check whether it is time to check and apply the target monitoring regions 598 + * 599 + * Returns true if it is. 600 + */ 601 + static bool kdamond_need_update_primitive(struct damon_ctx *ctx) 602 + { 603 + return damon_check_reset_time_interval(&ctx->last_primitive_update, 604 + ctx->primitive_update_interval); 605 + } 606 + 607 + /* 608 + * Check whether current monitoring should be stopped 609 + * 610 + * The monitoring is stopped when either the user requested to stop, or all 611 + * monitoring targets are invalid. 612 + * 613 + * Returns true if need to stop current monitoring. 614 + */ 615 + static bool kdamond_need_stop(struct damon_ctx *ctx) 616 + { 617 + struct damon_target *t; 618 + bool stop; 619 + 620 + mutex_lock(&ctx->kdamond_lock); 621 + stop = ctx->kdamond_stop; 622 + mutex_unlock(&ctx->kdamond_lock); 623 + if (stop) 624 + return true; 625 + 626 + if (!ctx->primitive.target_valid) 627 + return false; 628 + 629 + damon_for_each_target(t, ctx) { 630 + if (ctx->primitive.target_valid(t)) 631 + return false; 632 + } 633 + 634 + return true; 635 + } 636 + 637 + static void set_kdamond_stop(struct damon_ctx *ctx) 638 + { 639 + mutex_lock(&ctx->kdamond_lock); 640 + ctx->kdamond_stop = true; 641 + mutex_unlock(&ctx->kdamond_lock); 642 + } 643 + 644 + /* 645 + * The monitoring daemon that runs as a kernel thread 646 + */ 647 + static int kdamond_fn(void *data) 648 + { 649 + struct damon_ctx *ctx = (struct damon_ctx *)data; 650 + struct damon_target *t; 651 + struct damon_region *r, *next; 652 + unsigned int max_nr_accesses = 0; 653 + unsigned long sz_limit = 0; 654 + 655 + mutex_lock(&ctx->kdamond_lock); 656 + pr_info("kdamond (%d) starts\n", ctx->kdamond->pid); 657 + mutex_unlock(&ctx->kdamond_lock); 658 + 659 + if (ctx->primitive.init) 660 + ctx->primitive.init(ctx); 661 + if (ctx->callback.before_start && ctx->callback.before_start(ctx)) 662 + set_kdamond_stop(ctx); 663 + 664 + sz_limit = damon_region_sz_limit(ctx); 665 + 666 + while (!kdamond_need_stop(ctx)) { 667 + if (ctx->primitive.prepare_access_checks) 668 + ctx->primitive.prepare_access_checks(ctx); 669 + if (ctx->callback.after_sampling && 670 + ctx->callback.after_sampling(ctx)) 671 + set_kdamond_stop(ctx); 672 + 673 + usleep_range(ctx->sample_interval, ctx->sample_interval + 1); 674 + 675 + if (ctx->primitive.check_accesses) 676 + max_nr_accesses = ctx->primitive.check_accesses(ctx); 677 + 678 + if (kdamond_aggregate_interval_passed(ctx)) { 679 + kdamond_merge_regions(ctx, 680 + max_nr_accesses / 10, 681 + sz_limit); 682 + if (ctx->callback.after_aggregation && 683 + ctx->callback.after_aggregation(ctx)) 684 + set_kdamond_stop(ctx); 685 + kdamond_reset_aggregated(ctx); 686 + kdamond_split_regions(ctx); 687 + if (ctx->primitive.reset_aggregated) 688 + ctx->primitive.reset_aggregated(ctx); 689 + } 690 + 691 + if (kdamond_need_update_primitive(ctx)) { 692 + if (ctx->primitive.update) 693 + ctx->primitive.update(ctx); 694 + sz_limit = damon_region_sz_limit(ctx); 695 + } 696 + } 697 + damon_for_each_target(t, ctx) { 698 + damon_for_each_region_safe(r, next, t) 699 + damon_destroy_region(r, t); 700 + } 701 + 702 + if (ctx->callback.before_terminate && 703 + ctx->callback.before_terminate(ctx)) 704 + set_kdamond_stop(ctx); 705 + if (ctx->primitive.cleanup) 706 + ctx->primitive.cleanup(ctx); 707 + 708 + pr_debug("kdamond (%d) finishes\n", ctx->kdamond->pid); 709 + mutex_lock(&ctx->kdamond_lock); 710 + ctx->kdamond = NULL; 711 + mutex_unlock(&ctx->kdamond_lock); 712 + 713 + mutex_lock(&damon_lock); 714 + nr_running_ctxs--; 715 + mutex_unlock(&damon_lock); 716 + 717 + do_exit(0); 718 + } 719 + 720 + #include "core-test.h"

+126

mm/damon/dbgfs-test.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * DAMON Debugfs Interface Unit Tests 4 + * 5 + * Author: SeongJae Park <sjpark@amazon.de> 6 + */ 7 + 8 + #ifdef CONFIG_DAMON_DBGFS_KUNIT_TEST 9 + 10 + #ifndef _DAMON_DBGFS_TEST_H 11 + #define _DAMON_DBGFS_TEST_H 12 + 13 + #include <kunit/test.h> 14 + 15 + static void damon_dbgfs_test_str_to_target_ids(struct kunit *test) 16 + { 17 + char *question; 18 + unsigned long *answers; 19 + unsigned long expected[] = {12, 35, 46}; 20 + ssize_t nr_integers = 0, i; 21 + 22 + question = "123"; 23 + answers = str_to_target_ids(question, strnlen(question, 128), 24 + &nr_integers); 25 + KUNIT_EXPECT_EQ(test, (ssize_t)1, nr_integers); 26 + KUNIT_EXPECT_EQ(test, 123ul, answers[0]); 27 + kfree(answers); 28 + 29 + question = "123abc"; 30 + answers = str_to_target_ids(question, strnlen(question, 128), 31 + &nr_integers); 32 + KUNIT_EXPECT_EQ(test, (ssize_t)1, nr_integers); 33 + KUNIT_EXPECT_EQ(test, 123ul, answers[0]); 34 + kfree(answers); 35 + 36 + question = "a123"; 37 + answers = str_to_target_ids(question, strnlen(question, 128), 38 + &nr_integers); 39 + KUNIT_EXPECT_EQ(test, (ssize_t)0, nr_integers); 40 + kfree(answers); 41 + 42 + question = "12 35"; 43 + answers = str_to_target_ids(question, strnlen(question, 128), 44 + &nr_integers); 45 + KUNIT_EXPECT_EQ(test, (ssize_t)2, nr_integers); 46 + for (i = 0; i < nr_integers; i++) 47 + KUNIT_EXPECT_EQ(test, expected[i], answers[i]); 48 + kfree(answers); 49 + 50 + question = "12 35 46"; 51 + answers = str_to_target_ids(question, strnlen(question, 128), 52 + &nr_integers); 53 + KUNIT_EXPECT_EQ(test, (ssize_t)3, nr_integers); 54 + for (i = 0; i < nr_integers; i++) 55 + KUNIT_EXPECT_EQ(test, expected[i], answers[i]); 56 + kfree(answers); 57 + 58 + question = "12 35 abc 46"; 59 + answers = str_to_target_ids(question, strnlen(question, 128), 60 + &nr_integers); 61 + KUNIT_EXPECT_EQ(test, (ssize_t)2, nr_integers); 62 + for (i = 0; i < 2; i++) 63 + KUNIT_EXPECT_EQ(test, expected[i], answers[i]); 64 + kfree(answers); 65 + 66 + question = ""; 67 + answers = str_to_target_ids(question, strnlen(question, 128), 68 + &nr_integers); 69 + KUNIT_EXPECT_EQ(test, (ssize_t)0, nr_integers); 70 + kfree(answers); 71 + 72 + question = "\n"; 73 + answers = str_to_target_ids(question, strnlen(question, 128), 74 + &nr_integers); 75 + KUNIT_EXPECT_EQ(test, (ssize_t)0, nr_integers); 76 + kfree(answers); 77 + } 78 + 79 + static void damon_dbgfs_test_set_targets(struct kunit *test) 80 + { 81 + struct damon_ctx *ctx = dbgfs_new_ctx(); 82 + unsigned long ids[] = {1, 2, 3}; 83 + char buf[64]; 84 + 85 + /* Make DAMON consider target id as plain number */ 86 + ctx->primitive.target_valid = NULL; 87 + ctx->primitive.cleanup = NULL; 88 + 89 + damon_set_targets(ctx, ids, 3); 90 + sprint_target_ids(ctx, buf, 64); 91 + KUNIT_EXPECT_STREQ(test, (char *)buf, "1 2 3\n"); 92 + 93 + damon_set_targets(ctx, NULL, 0); 94 + sprint_target_ids(ctx, buf, 64); 95 + KUNIT_EXPECT_STREQ(test, (char *)buf, "\n"); 96 + 97 + damon_set_targets(ctx, (unsigned long []){1, 2}, 2); 98 + sprint_target_ids(ctx, buf, 64); 99 + KUNIT_EXPECT_STREQ(test, (char *)buf, "1 2\n"); 100 + 101 + damon_set_targets(ctx, (unsigned long []){2}, 1); 102 + sprint_target_ids(ctx, buf, 64); 103 + KUNIT_EXPECT_STREQ(test, (char *)buf, "2\n"); 104 + 105 + damon_set_targets(ctx, NULL, 0); 106 + sprint_target_ids(ctx, buf, 64); 107 + KUNIT_EXPECT_STREQ(test, (char *)buf, "\n"); 108 + 109 + dbgfs_destroy_ctx(ctx); 110 + } 111 + 112 + static struct kunit_case damon_test_cases[] = { 113 + KUNIT_CASE(damon_dbgfs_test_str_to_target_ids), 114 + KUNIT_CASE(damon_dbgfs_test_set_targets), 115 + {}, 116 + }; 117 + 118 + static struct kunit_suite damon_test_suite = { 119 + .name = "damon-dbgfs", 120 + .test_cases = damon_test_cases, 121 + }; 122 + kunit_test_suite(damon_test_suite); 123 + 124 + #endif /* _DAMON_TEST_H */ 125 + 126 + #endif /* CONFIG_DAMON_KUNIT_TEST */

+623

mm/damon/dbgfs.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * DAMON Debugfs Interface 4 + * 5 + * Author: SeongJae Park <sjpark@amazon.de> 6 + */ 7 + 8 + #define pr_fmt(fmt) "damon-dbgfs: " fmt 9 + 10 + #include <linux/damon.h> 11 + #include <linux/debugfs.h> 12 + #include <linux/file.h> 13 + #include <linux/mm.h> 14 + #include <linux/module.h> 15 + #include <linux/page_idle.h> 16 + #include <linux/slab.h> 17 + 18 + static struct damon_ctx **dbgfs_ctxs; 19 + static int dbgfs_nr_ctxs; 20 + static struct dentry **dbgfs_dirs; 21 + static DEFINE_MUTEX(damon_dbgfs_lock); 22 + 23 + /* 24 + * Returns non-empty string on success, negative error code otherwise. 25 + */ 26 + static char *user_input_str(const char __user *buf, size_t count, loff_t *ppos) 27 + { 28 + char *kbuf; 29 + ssize_t ret; 30 + 31 + /* We do not accept continuous write */ 32 + if (*ppos) 33 + return ERR_PTR(-EINVAL); 34 + 35 + kbuf = kmalloc(count + 1, GFP_KERNEL); 36 + if (!kbuf) 37 + return ERR_PTR(-ENOMEM); 38 + 39 + ret = simple_write_to_buffer(kbuf, count + 1, ppos, buf, count); 40 + if (ret != count) { 41 + kfree(kbuf); 42 + return ERR_PTR(-EIO); 43 + } 44 + kbuf[ret] = '\0'; 45 + 46 + return kbuf; 47 + } 48 + 49 + static ssize_t dbgfs_attrs_read(struct file *file, 50 + char __user *buf, size_t count, loff_t *ppos) 51 + { 52 + struct damon_ctx *ctx = file->private_data; 53 + char kbuf[128]; 54 + int ret; 55 + 56 + mutex_lock(&ctx->kdamond_lock); 57 + ret = scnprintf(kbuf, ARRAY_SIZE(kbuf), "%lu %lu %lu %lu %lu\n", 58 + ctx->sample_interval, ctx->aggr_interval, 59 + ctx->primitive_update_interval, ctx->min_nr_regions, 60 + ctx->max_nr_regions); 61 + mutex_unlock(&ctx->kdamond_lock); 62 + 63 + return simple_read_from_buffer(buf, count, ppos, kbuf, ret); 64 + } 65 + 66 + static ssize_t dbgfs_attrs_write(struct file *file, 67 + const char __user *buf, size_t count, loff_t *ppos) 68 + { 69 + struct damon_ctx *ctx = file->private_data; 70 + unsigned long s, a, r, minr, maxr; 71 + char *kbuf; 72 + ssize_t ret = count; 73 + int err; 74 + 75 + kbuf = user_input_str(buf, count, ppos); 76 + if (IS_ERR(kbuf)) 77 + return PTR_ERR(kbuf); 78 + 79 + if (sscanf(kbuf, "%lu %lu %lu %lu %lu", 80 + &s, &a, &r, &minr, &maxr) != 5) { 81 + ret = -EINVAL; 82 + goto out; 83 + } 84 + 85 + mutex_lock(&ctx->kdamond_lock); 86 + if (ctx->kdamond) { 87 + ret = -EBUSY; 88 + goto unlock_out; 89 + } 90 + 91 + err = damon_set_attrs(ctx, s, a, r, minr, maxr); 92 + if (err) 93 + ret = err; 94 + unlock_out: 95 + mutex_unlock(&ctx->kdamond_lock); 96 + out: 97 + kfree(kbuf); 98 + return ret; 99 + } 100 + 101 + static inline bool targetid_is_pid(const struct damon_ctx *ctx) 102 + { 103 + return ctx->primitive.target_valid == damon_va_target_valid; 104 + } 105 + 106 + static ssize_t sprint_target_ids(struct damon_ctx *ctx, char *buf, ssize_t len) 107 + { 108 + struct damon_target *t; 109 + unsigned long id; 110 + int written = 0; 111 + int rc; 112 + 113 + damon_for_each_target(t, ctx) { 114 + id = t->id; 115 + if (targetid_is_pid(ctx)) 116 + /* Show pid numbers to debugfs users */ 117 + id = (unsigned long)pid_vnr((struct pid *)id); 118 + 119 + rc = scnprintf(&buf[written], len - written, "%lu ", id); 120 + if (!rc) 121 + return -ENOMEM; 122 + written += rc; 123 + } 124 + if (written) 125 + written -= 1; 126 + written += scnprintf(&buf[written], len - written, "\n"); 127 + return written; 128 + } 129 + 130 + static ssize_t dbgfs_target_ids_read(struct file *file, 131 + char __user *buf, size_t count, loff_t *ppos) 132 + { 133 + struct damon_ctx *ctx = file->private_data; 134 + ssize_t len; 135 + char ids_buf[320]; 136 + 137 + mutex_lock(&ctx->kdamond_lock); 138 + len = sprint_target_ids(ctx, ids_buf, 320); 139 + mutex_unlock(&ctx->kdamond_lock); 140 + if (len < 0) 141 + return len; 142 + 143 + return simple_read_from_buffer(buf, count, ppos, ids_buf, len); 144 + } 145 + 146 + /* 147 + * Converts a string into an array of unsigned long integers 148 + * 149 + * Returns an array of unsigned long integers if the conversion success, or 150 + * NULL otherwise. 151 + */ 152 + static unsigned long *str_to_target_ids(const char *str, ssize_t len, 153 + ssize_t *nr_ids) 154 + { 155 + unsigned long *ids; 156 + const int max_nr_ids = 32; 157 + unsigned long id; 158 + int pos = 0, parsed, ret; 159 + 160 + *nr_ids = 0; 161 + ids = kmalloc_array(max_nr_ids, sizeof(id), GFP_KERNEL); 162 + if (!ids) 163 + return NULL; 164 + while (*nr_ids < max_nr_ids && pos < len) { 165 + ret = sscanf(&str[pos], "%lu%n", &id, &parsed); 166 + pos += parsed; 167 + if (ret != 1) 168 + break; 169 + ids[*nr_ids] = id; 170 + *nr_ids += 1; 171 + } 172 + 173 + return ids; 174 + } 175 + 176 + static void dbgfs_put_pids(unsigned long *ids, int nr_ids) 177 + { 178 + int i; 179 + 180 + for (i = 0; i < nr_ids; i++) 181 + put_pid((struct pid *)ids[i]); 182 + } 183 + 184 + static ssize_t dbgfs_target_ids_write(struct file *file, 185 + const char __user *buf, size_t count, loff_t *ppos) 186 + { 187 + struct damon_ctx *ctx = file->private_data; 188 + char *kbuf, *nrs; 189 + unsigned long *targets; 190 + ssize_t nr_targets; 191 + ssize_t ret = count; 192 + int i; 193 + int err; 194 + 195 + kbuf = user_input_str(buf, count, ppos); 196 + if (IS_ERR(kbuf)) 197 + return PTR_ERR(kbuf); 198 + 199 + nrs = kbuf; 200 + 201 + targets = str_to_target_ids(nrs, ret, &nr_targets); 202 + if (!targets) { 203 + ret = -ENOMEM; 204 + goto out; 205 + } 206 + 207 + if (targetid_is_pid(ctx)) { 208 + for (i = 0; i < nr_targets; i++) { 209 + targets[i] = (unsigned long)find_get_pid( 210 + (int)targets[i]); 211 + if (!targets[i]) { 212 + dbgfs_put_pids(targets, i); 213 + ret = -EINVAL; 214 + goto free_targets_out; 215 + } 216 + } 217 + } 218 + 219 + mutex_lock(&ctx->kdamond_lock); 220 + if (ctx->kdamond) { 221 + if (targetid_is_pid(ctx)) 222 + dbgfs_put_pids(targets, nr_targets); 223 + ret = -EBUSY; 224 + goto unlock_out; 225 + } 226 + 227 + err = damon_set_targets(ctx, targets, nr_targets); 228 + if (err) { 229 + if (targetid_is_pid(ctx)) 230 + dbgfs_put_pids(targets, nr_targets); 231 + ret = err; 232 + } 233 + 234 + unlock_out: 235 + mutex_unlock(&ctx->kdamond_lock); 236 + free_targets_out: 237 + kfree(targets); 238 + out: 239 + kfree(kbuf); 240 + return ret; 241 + } 242 + 243 + static ssize_t dbgfs_kdamond_pid_read(struct file *file, 244 + char __user *buf, size_t count, loff_t *ppos) 245 + { 246 + struct damon_ctx *ctx = file->private_data; 247 + char *kbuf; 248 + ssize_t len; 249 + 250 + kbuf = kmalloc(count, GFP_KERNEL); 251 + if (!kbuf) 252 + return -ENOMEM; 253 + 254 + mutex_lock(&ctx->kdamond_lock); 255 + if (ctx->kdamond) 256 + len = scnprintf(kbuf, count, "%d\n", ctx->kdamond->pid); 257 + else 258 + len = scnprintf(kbuf, count, "none\n"); 259 + mutex_unlock(&ctx->kdamond_lock); 260 + if (!len) 261 + goto out; 262 + len = simple_read_from_buffer(buf, count, ppos, kbuf, len); 263 + 264 + out: 265 + kfree(kbuf); 266 + return len; 267 + } 268 + 269 + static int damon_dbgfs_open(struct inode *inode, struct file *file) 270 + { 271 + file->private_data = inode->i_private; 272 + 273 + return nonseekable_open(inode, file); 274 + } 275 + 276 + static const struct file_operations attrs_fops = { 277 + .open = damon_dbgfs_open, 278 + .read = dbgfs_attrs_read, 279 + .write = dbgfs_attrs_write, 280 + }; 281 + 282 + static const struct file_operations target_ids_fops = { 283 + .open = damon_dbgfs_open, 284 + .read = dbgfs_target_ids_read, 285 + .write = dbgfs_target_ids_write, 286 + }; 287 + 288 + static const struct file_operations kdamond_pid_fops = { 289 + .open = damon_dbgfs_open, 290 + .read = dbgfs_kdamond_pid_read, 291 + }; 292 + 293 + static void dbgfs_fill_ctx_dir(struct dentry *dir, struct damon_ctx *ctx) 294 + { 295 + const char * const file_names[] = {"attrs", "target_ids", 296 + "kdamond_pid"}; 297 + const struct file_operations *fops[] = {&attrs_fops, &target_ids_fops, 298 + &kdamond_pid_fops}; 299 + int i; 300 + 301 + for (i = 0; i < ARRAY_SIZE(file_names); i++) 302 + debugfs_create_file(file_names[i], 0600, dir, ctx, fops[i]); 303 + } 304 + 305 + static int dbgfs_before_terminate(struct damon_ctx *ctx) 306 + { 307 + struct damon_target *t, *next; 308 + 309 + if (!targetid_is_pid(ctx)) 310 + return 0; 311 + 312 + damon_for_each_target_safe(t, next, ctx) { 313 + put_pid((struct pid *)t->id); 314 + damon_destroy_target(t); 315 + } 316 + return 0; 317 + } 318 + 319 + static struct damon_ctx *dbgfs_new_ctx(void) 320 + { 321 + struct damon_ctx *ctx; 322 + 323 + ctx = damon_new_ctx(); 324 + if (!ctx) 325 + return NULL; 326 + 327 + damon_va_set_primitives(ctx); 328 + ctx->callback.before_terminate = dbgfs_before_terminate; 329 + return ctx; 330 + } 331 + 332 + static void dbgfs_destroy_ctx(struct damon_ctx *ctx) 333 + { 334 + damon_destroy_ctx(ctx); 335 + } 336 + 337 + /* 338 + * Make a context of @name and create a debugfs directory for it. 339 + * 340 + * This function should be called while holding damon_dbgfs_lock. 341 + * 342 + * Returns 0 on success, negative error code otherwise. 343 + */ 344 + static int dbgfs_mk_context(char *name) 345 + { 346 + struct dentry *root, **new_dirs, *new_dir; 347 + struct damon_ctx **new_ctxs, *new_ctx; 348 + 349 + if (damon_nr_running_ctxs()) 350 + return -EBUSY; 351 + 352 + new_ctxs = krealloc(dbgfs_ctxs, sizeof(*dbgfs_ctxs) * 353 + (dbgfs_nr_ctxs + 1), GFP_KERNEL); 354 + if (!new_ctxs) 355 + return -ENOMEM; 356 + dbgfs_ctxs = new_ctxs; 357 + 358 + new_dirs = krealloc(dbgfs_dirs, sizeof(*dbgfs_dirs) * 359 + (dbgfs_nr_ctxs + 1), GFP_KERNEL); 360 + if (!new_dirs) 361 + return -ENOMEM; 362 + dbgfs_dirs = new_dirs; 363 + 364 + root = dbgfs_dirs[0]; 365 + if (!root) 366 + return -ENOENT; 367 + 368 + new_dir = debugfs_create_dir(name, root); 369 + dbgfs_dirs[dbgfs_nr_ctxs] = new_dir; 370 + 371 + new_ctx = dbgfs_new_ctx(); 372 + if (!new_ctx) { 373 + debugfs_remove(new_dir); 374 + dbgfs_dirs[dbgfs_nr_ctxs] = NULL; 375 + return -ENOMEM; 376 + } 377 + 378 + dbgfs_ctxs[dbgfs_nr_ctxs] = new_ctx; 379 + dbgfs_fill_ctx_dir(dbgfs_dirs[dbgfs_nr_ctxs], 380 + dbgfs_ctxs[dbgfs_nr_ctxs]); 381 + dbgfs_nr_ctxs++; 382 + 383 + return 0; 384 + } 385 + 386 + static ssize_t dbgfs_mk_context_write(struct file *file, 387 + const char __user *buf, size_t count, loff_t *ppos) 388 + { 389 + char *kbuf; 390 + char *ctx_name; 391 + ssize_t ret = count; 392 + int err; 393 + 394 + kbuf = user_input_str(buf, count, ppos); 395 + if (IS_ERR(kbuf)) 396 + return PTR_ERR(kbuf); 397 + ctx_name = kmalloc(count + 1, GFP_KERNEL); 398 + if (!ctx_name) { 399 + kfree(kbuf); 400 + return -ENOMEM; 401 + } 402 + 403 + /* Trim white space */ 404 + if (sscanf(kbuf, "%s", ctx_name) != 1) { 405 + ret = -EINVAL; 406 + goto out; 407 + } 408 + 409 + mutex_lock(&damon_dbgfs_lock); 410 + err = dbgfs_mk_context(ctx_name); 411 + if (err) 412 + ret = err; 413 + mutex_unlock(&damon_dbgfs_lock); 414 + 415 + out: 416 + kfree(kbuf); 417 + kfree(ctx_name); 418 + return ret; 419 + } 420 + 421 + /* 422 + * Remove a context of @name and its debugfs directory. 423 + * 424 + * This function should be called while holding damon_dbgfs_lock. 425 + * 426 + * Return 0 on success, negative error code otherwise. 427 + */ 428 + static int dbgfs_rm_context(char *name) 429 + { 430 + struct dentry *root, *dir, **new_dirs; 431 + struct damon_ctx **new_ctxs; 432 + int i, j; 433 + 434 + if (damon_nr_running_ctxs()) 435 + return -EBUSY; 436 + 437 + root = dbgfs_dirs[0]; 438 + if (!root) 439 + return -ENOENT; 440 + 441 + dir = debugfs_lookup(name, root); 442 + if (!dir) 443 + return -ENOENT; 444 + 445 + new_dirs = kmalloc_array(dbgfs_nr_ctxs - 1, sizeof(*dbgfs_dirs), 446 + GFP_KERNEL); 447 + if (!new_dirs) 448 + return -ENOMEM; 449 + 450 + new_ctxs = kmalloc_array(dbgfs_nr_ctxs - 1, sizeof(*dbgfs_ctxs), 451 + GFP_KERNEL); 452 + if (!new_ctxs) { 453 + kfree(new_dirs); 454 + return -ENOMEM; 455 + } 456 + 457 + for (i = 0, j = 0; i < dbgfs_nr_ctxs; i++) { 458 + if (dbgfs_dirs[i] == dir) { 459 + debugfs_remove(dbgfs_dirs[i]); 460 + dbgfs_destroy_ctx(dbgfs_ctxs[i]); 461 + continue; 462 + } 463 + new_dirs[j] = dbgfs_dirs[i]; 464 + new_ctxs[j++] = dbgfs_ctxs[i]; 465 + } 466 + 467 + kfree(dbgfs_dirs); 468 + kfree(dbgfs_ctxs); 469 + 470 + dbgfs_dirs = new_dirs; 471 + dbgfs_ctxs = new_ctxs; 472 + dbgfs_nr_ctxs--; 473 + 474 + return 0; 475 + } 476 + 477 + static ssize_t dbgfs_rm_context_write(struct file *file, 478 + const char __user *buf, size_t count, loff_t *ppos) 479 + { 480 + char *kbuf; 481 + ssize_t ret = count; 482 + int err; 483 + char *ctx_name; 484 + 485 + kbuf = user_input_str(buf, count, ppos); 486 + if (IS_ERR(kbuf)) 487 + return PTR_ERR(kbuf); 488 + ctx_name = kmalloc(count + 1, GFP_KERNEL); 489 + if (!ctx_name) { 490 + kfree(kbuf); 491 + return -ENOMEM; 492 + } 493 + 494 + /* Trim white space */ 495 + if (sscanf(kbuf, "%s", ctx_name) != 1) { 496 + ret = -EINVAL; 497 + goto out; 498 + } 499 + 500 + mutex_lock(&damon_dbgfs_lock); 501 + err = dbgfs_rm_context(ctx_name); 502 + if (err) 503 + ret = err; 504 + mutex_unlock(&damon_dbgfs_lock); 505 + 506 + out: 507 + kfree(kbuf); 508 + kfree(ctx_name); 509 + return ret; 510 + } 511 + 512 + static ssize_t dbgfs_monitor_on_read(struct file *file, 513 + char __user *buf, size_t count, loff_t *ppos) 514 + { 515 + char monitor_on_buf[5]; 516 + bool monitor_on = damon_nr_running_ctxs() != 0; 517 + int len; 518 + 519 + len = scnprintf(monitor_on_buf, 5, monitor_on ? "on\n" : "off\n"); 520 + 521 + return simple_read_from_buffer(buf, count, ppos, monitor_on_buf, len); 522 + } 523 + 524 + static ssize_t dbgfs_monitor_on_write(struct file *file, 525 + const char __user *buf, size_t count, loff_t *ppos) 526 + { 527 + ssize_t ret = count; 528 + char *kbuf; 529 + int err; 530 + 531 + kbuf = user_input_str(buf, count, ppos); 532 + if (IS_ERR(kbuf)) 533 + return PTR_ERR(kbuf); 534 + 535 + /* Remove white space */ 536 + if (sscanf(kbuf, "%s", kbuf) != 1) { 537 + kfree(kbuf); 538 + return -EINVAL; 539 + } 540 + 541 + if (!strncmp(kbuf, "on", count)) 542 + err = damon_start(dbgfs_ctxs, dbgfs_nr_ctxs); 543 + else if (!strncmp(kbuf, "off", count)) 544 + err = damon_stop(dbgfs_ctxs, dbgfs_nr_ctxs); 545 + else 546 + err = -EINVAL; 547 + 548 + if (err) 549 + ret = err; 550 + kfree(kbuf); 551 + return ret; 552 + } 553 + 554 + static const struct file_operations mk_contexts_fops = { 555 + .write = dbgfs_mk_context_write, 556 + }; 557 + 558 + static const struct file_operations rm_contexts_fops = { 559 + .write = dbgfs_rm_context_write, 560 + }; 561 + 562 + static const struct file_operations monitor_on_fops = { 563 + .read = dbgfs_monitor_on_read, 564 + .write = dbgfs_monitor_on_write, 565 + }; 566 + 567 + static int __init __damon_dbgfs_init(void) 568 + { 569 + struct dentry *dbgfs_root; 570 + const char * const file_names[] = {"mk_contexts", "rm_contexts", 571 + "monitor_on"}; 572 + const struct file_operations *fops[] = {&mk_contexts_fops, 573 + &rm_contexts_fops, &monitor_on_fops}; 574 + int i; 575 + 576 + dbgfs_root = debugfs_create_dir("damon", NULL); 577 + 578 + for (i = 0; i < ARRAY_SIZE(file_names); i++) 579 + debugfs_create_file(file_names[i], 0600, dbgfs_root, NULL, 580 + fops[i]); 581 + dbgfs_fill_ctx_dir(dbgfs_root, dbgfs_ctxs[0]); 582 + 583 + dbgfs_dirs = kmalloc_array(1, sizeof(dbgfs_root), GFP_KERNEL); 584 + if (!dbgfs_dirs) { 585 + debugfs_remove(dbgfs_root); 586 + return -ENOMEM; 587 + } 588 + dbgfs_dirs[0] = dbgfs_root; 589 + 590 + return 0; 591 + } 592 + 593 + /* 594 + * Functions for the initialization 595 + */ 596 + 597 + static int __init damon_dbgfs_init(void) 598 + { 599 + int rc; 600 + 601 + dbgfs_ctxs = kmalloc(sizeof(*dbgfs_ctxs), GFP_KERNEL); 602 + if (!dbgfs_ctxs) 603 + return -ENOMEM; 604 + dbgfs_ctxs[0] = dbgfs_new_ctx(); 605 + if (!dbgfs_ctxs[0]) { 606 + kfree(dbgfs_ctxs); 607 + return -ENOMEM; 608 + } 609 + dbgfs_nr_ctxs = 1; 610 + 611 + rc = __damon_dbgfs_init(); 612 + if (rc) { 613 + kfree(dbgfs_ctxs[0]); 614 + kfree(dbgfs_ctxs); 615 + pr_err("%s: dbgfs init failed\n", __func__); 616 + } 617 + 618 + return rc; 619 + } 620 + 621 + module_init(damon_dbgfs_init); 622 + 623 + #include "dbgfs-test.h"

+329

mm/damon/vaddr-test.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Data Access Monitor Unit Tests 4 + * 5 + * Copyright 2019 Amazon.com, Inc. or its affiliates. All rights reserved. 6 + * 7 + * Author: SeongJae Park <sjpark@amazon.de> 8 + */ 9 + 10 + #ifdef CONFIG_DAMON_VADDR_KUNIT_TEST 11 + 12 + #ifndef _DAMON_VADDR_TEST_H 13 + #define _DAMON_VADDR_TEST_H 14 + 15 + #include <kunit/test.h> 16 + 17 + static void __link_vmas(struct vm_area_struct *vmas, ssize_t nr_vmas) 18 + { 19 + int i, j; 20 + unsigned long largest_gap, gap; 21 + 22 + if (!nr_vmas) 23 + return; 24 + 25 + for (i = 0; i < nr_vmas - 1; i++) { 26 + vmas[i].vm_next = &vmas[i + 1]; 27 + 28 + vmas[i].vm_rb.rb_left = NULL; 29 + vmas[i].vm_rb.rb_right = &vmas[i + 1].vm_rb; 30 + 31 + largest_gap = 0; 32 + for (j = i; j < nr_vmas; j++) { 33 + if (j == 0) 34 + continue; 35 + gap = vmas[j].vm_start - vmas[j - 1].vm_end; 36 + if (gap > largest_gap) 37 + largest_gap = gap; 38 + } 39 + vmas[i].rb_subtree_gap = largest_gap; 40 + } 41 + vmas[i].vm_next = NULL; 42 + vmas[i].vm_rb.rb_right = NULL; 43 + vmas[i].rb_subtree_gap = 0; 44 + } 45 + 46 + /* 47 + * Test __damon_va_three_regions() function 48 + * 49 + * In case of virtual memory address spaces monitoring, DAMON converts the 50 + * complex and dynamic memory mappings of each target task to three 51 + * discontiguous regions which cover every mapped areas. However, the three 52 + * regions should not include the two biggest unmapped areas in the original 53 + * mapping, because the two biggest areas are normally the areas between 1) 54 + * heap and the mmap()-ed regions, and 2) the mmap()-ed regions and stack. 55 + * Because these two unmapped areas are very huge but obviously never accessed, 56 + * covering the region is just a waste. 57 + * 58 + * '__damon_va_three_regions() receives an address space of a process. It 59 + * first identifies the start of mappings, end of mappings, and the two biggest 60 + * unmapped areas. After that, based on the information, it constructs the 61 + * three regions and returns. For more detail, refer to the comment of 62 + * 'damon_init_regions_of()' function definition in 'mm/damon.c' file. 63 + * 64 + * For example, suppose virtual address ranges of 10-20, 20-25, 200-210, 65 + * 210-220, 300-305, and 307-330 (Other comments represent this mappings in 66 + * more short form: 10-20-25, 200-210-220, 300-305, 307-330) of a process are 67 + * mapped. To cover every mappings, the three regions should start with 10, 68 + * and end with 305. The process also has three unmapped areas, 25-200, 69 + * 220-300, and 305-307. Among those, 25-200 and 220-300 are the biggest two 70 + * unmapped areas, and thus it should be converted to three regions of 10-25, 71 + * 200-220, and 300-330. 72 + */ 73 + static void damon_test_three_regions_in_vmas(struct kunit *test) 74 + { 75 + struct damon_addr_range regions[3] = {0,}; 76 + /* 10-20-25, 200-210-220, 300-305, 307-330 */ 77 + struct vm_area_struct vmas[] = { 78 + (struct vm_area_struct) {.vm_start = 10, .vm_end = 20}, 79 + (struct vm_area_struct) {.vm_start = 20, .vm_end = 25}, 80 + (struct vm_area_struct) {.vm_start = 200, .vm_end = 210}, 81 + (struct vm_area_struct) {.vm_start = 210, .vm_end = 220}, 82 + (struct vm_area_struct) {.vm_start = 300, .vm_end = 305}, 83 + (struct vm_area_struct) {.vm_start = 307, .vm_end = 330}, 84 + }; 85 + 86 + __link_vmas(vmas, 6); 87 + 88 + __damon_va_three_regions(&vmas[0], regions); 89 + 90 + KUNIT_EXPECT_EQ(test, 10ul, regions[0].start); 91 + KUNIT_EXPECT_EQ(test, 25ul, regions[0].end); 92 + KUNIT_EXPECT_EQ(test, 200ul, regions[1].start); 93 + KUNIT_EXPECT_EQ(test, 220ul, regions[1].end); 94 + KUNIT_EXPECT_EQ(test, 300ul, regions[2].start); 95 + KUNIT_EXPECT_EQ(test, 330ul, regions[2].end); 96 + } 97 + 98 + static struct damon_region *__nth_region_of(struct damon_target *t, int idx) 99 + { 100 + struct damon_region *r; 101 + unsigned int i = 0; 102 + 103 + damon_for_each_region(r, t) { 104 + if (i++ == idx) 105 + return r; 106 + } 107 + 108 + return NULL; 109 + } 110 + 111 + /* 112 + * Test 'damon_va_apply_three_regions()' 113 + * 114 + * test kunit object 115 + * regions an array containing start/end addresses of current 116 + * monitoring target regions 117 + * nr_regions the number of the addresses in 'regions' 118 + * three_regions The three regions that need to be applied now 119 + * expected start/end addresses of monitoring target regions that 120 + * 'three_regions' are applied 121 + * nr_expected the number of addresses in 'expected' 122 + * 123 + * The memory mapping of the target processes changes dynamically. To follow 124 + * the change, DAMON periodically reads the mappings, simplifies it to the 125 + * three regions, and updates the monitoring target regions to fit in the three 126 + * regions. The update of current target regions is the role of 127 + * 'damon_va_apply_three_regions()'. 128 + * 129 + * This test passes the given target regions and the new three regions that 130 + * need to be applied to the function and check whether it updates the regions 131 + * as expected. 132 + */ 133 + static void damon_do_test_apply_three_regions(struct kunit *test, 134 + unsigned long *regions, int nr_regions, 135 + struct damon_addr_range *three_regions, 136 + unsigned long *expected, int nr_expected) 137 + { 138 + struct damon_ctx *ctx = damon_new_ctx(); 139 + struct damon_target *t; 140 + struct damon_region *r; 141 + int i; 142 + 143 + t = damon_new_target(42); 144 + for (i = 0; i < nr_regions / 2; i++) { 145 + r = damon_new_region(regions[i * 2], regions[i * 2 + 1]); 146 + damon_add_region(r, t); 147 + } 148 + damon_add_target(ctx, t); 149 + 150 + damon_va_apply_three_regions(t, three_regions); 151 + 152 + for (i = 0; i < nr_expected / 2; i++) { 153 + r = __nth_region_of(t, i); 154 + KUNIT_EXPECT_EQ(test, r->ar.start, expected[i * 2]); 155 + KUNIT_EXPECT_EQ(test, r->ar.end, expected[i * 2 + 1]); 156 + } 157 + 158 + damon_destroy_ctx(ctx); 159 + } 160 + 161 + /* 162 + * This function test most common case where the three big regions are only 163 + * slightly changed. Target regions should adjust their boundary (10-20-30, 164 + * 50-55, 70-80, 90-100) to fit with the new big regions or remove target 165 + * regions (57-79) that now out of the three regions. 166 + */ 167 + static void damon_test_apply_three_regions1(struct kunit *test) 168 + { 169 + /* 10-20-30, 50-55-57-59, 70-80-90-100 */ 170 + unsigned long regions[] = {10, 20, 20, 30, 50, 55, 55, 57, 57, 59, 171 + 70, 80, 80, 90, 90, 100}; 172 + /* 5-27, 45-55, 73-104 */ 173 + struct damon_addr_range new_three_regions[3] = { 174 + (struct damon_addr_range){.start = 5, .end = 27}, 175 + (struct damon_addr_range){.start = 45, .end = 55}, 176 + (struct damon_addr_range){.start = 73, .end = 104} }; 177 + /* 5-20-27, 45-55, 73-80-90-104 */ 178 + unsigned long expected[] = {5, 20, 20, 27, 45, 55, 179 + 73, 80, 80, 90, 90, 104}; 180 + 181 + damon_do_test_apply_three_regions(test, regions, ARRAY_SIZE(regions), 182 + new_three_regions, expected, ARRAY_SIZE(expected)); 183 + } 184 + 185 + /* 186 + * Test slightly bigger change. Similar to above, but the second big region 187 + * now require two target regions (50-55, 57-59) to be removed. 188 + */ 189 + static void damon_test_apply_three_regions2(struct kunit *test) 190 + { 191 + /* 10-20-30, 50-55-57-59, 70-80-90-100 */ 192 + unsigned long regions[] = {10, 20, 20, 30, 50, 55, 55, 57, 57, 59, 193 + 70, 80, 80, 90, 90, 100}; 194 + /* 5-27, 56-57, 65-104 */ 195 + struct damon_addr_range new_three_regions[3] = { 196 + (struct damon_addr_range){.start = 5, .end = 27}, 197 + (struct damon_addr_range){.start = 56, .end = 57}, 198 + (struct damon_addr_range){.start = 65, .end = 104} }; 199 + /* 5-20-27, 56-57, 65-80-90-104 */ 200 + unsigned long expected[] = {5, 20, 20, 27, 56, 57, 201 + 65, 80, 80, 90, 90, 104}; 202 + 203 + damon_do_test_apply_three_regions(test, regions, ARRAY_SIZE(regions), 204 + new_three_regions, expected, ARRAY_SIZE(expected)); 205 + } 206 + 207 + /* 208 + * Test a big change. The second big region has totally freed and mapped to 209 + * different area (50-59 -> 61-63). The target regions which were in the old 210 + * second big region (50-55-57-59) should be removed and new target region 211 + * covering the second big region (61-63) should be created. 212 + */ 213 + static void damon_test_apply_three_regions3(struct kunit *test) 214 + { 215 + /* 10-20-30, 50-55-57-59, 70-80-90-100 */ 216 + unsigned long regions[] = {10, 20, 20, 30, 50, 55, 55, 57, 57, 59, 217 + 70, 80, 80, 90, 90, 100}; 218 + /* 5-27, 61-63, 65-104 */ 219 + struct damon_addr_range new_three_regions[3] = { 220 + (struct damon_addr_range){.start = 5, .end = 27}, 221 + (struct damon_addr_range){.start = 61, .end = 63}, 222 + (struct damon_addr_range){.start = 65, .end = 104} }; 223 + /* 5-20-27, 61-63, 65-80-90-104 */ 224 + unsigned long expected[] = {5, 20, 20, 27, 61, 63, 225 + 65, 80, 80, 90, 90, 104}; 226 + 227 + damon_do_test_apply_three_regions(test, regions, ARRAY_SIZE(regions), 228 + new_three_regions, expected, ARRAY_SIZE(expected)); 229 + } 230 + 231 + /* 232 + * Test another big change. Both of the second and third big regions (50-59 233 + * and 70-100) has totally freed and mapped to different area (30-32 and 234 + * 65-68). The target regions which were in the old second and third big 235 + * regions should now be removed and new target regions covering the new second 236 + * and third big regions should be crated. 237 + */ 238 + static void damon_test_apply_three_regions4(struct kunit *test) 239 + { 240 + /* 10-20-30, 50-55-57-59, 70-80-90-100 */ 241 + unsigned long regions[] = {10, 20, 20, 30, 50, 55, 55, 57, 57, 59, 242 + 70, 80, 80, 90, 90, 100}; 243 + /* 5-7, 30-32, 65-68 */ 244 + struct damon_addr_range new_three_regions[3] = { 245 + (struct damon_addr_range){.start = 5, .end = 7}, 246 + (struct damon_addr_range){.start = 30, .end = 32}, 247 + (struct damon_addr_range){.start = 65, .end = 68} }; 248 + /* expect 5-7, 30-32, 65-68 */ 249 + unsigned long expected[] = {5, 7, 30, 32, 65, 68}; 250 + 251 + damon_do_test_apply_three_regions(test, regions, ARRAY_SIZE(regions), 252 + new_three_regions, expected, ARRAY_SIZE(expected)); 253 + } 254 + 255 + static void damon_test_split_evenly(struct kunit *test) 256 + { 257 + struct damon_ctx *c = damon_new_ctx(); 258 + struct damon_target *t; 259 + struct damon_region *r; 260 + unsigned long i; 261 + 262 + KUNIT_EXPECT_EQ(test, damon_va_evenly_split_region(NULL, NULL, 5), 263 + -EINVAL); 264 + 265 + t = damon_new_target(42); 266 + r = damon_new_region(0, 100); 267 + KUNIT_EXPECT_EQ(test, damon_va_evenly_split_region(t, r, 0), -EINVAL); 268 + 269 + damon_add_region(r, t); 270 + KUNIT_EXPECT_EQ(test, damon_va_evenly_split_region(t, r, 10), 0); 271 + KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 10u); 272 + 273 + i = 0; 274 + damon_for_each_region(r, t) { 275 + KUNIT_EXPECT_EQ(test, r->ar.start, i++ * 10); 276 + KUNIT_EXPECT_EQ(test, r->ar.end, i * 10); 277 + } 278 + damon_free_target(t); 279 + 280 + t = damon_new_target(42); 281 + r = damon_new_region(5, 59); 282 + damon_add_region(r, t); 283 + KUNIT_EXPECT_EQ(test, damon_va_evenly_split_region(t, r, 5), 0); 284 + KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 5u); 285 + 286 + i = 0; 287 + damon_for_each_region(r, t) { 288 + if (i == 4) 289 + break; 290 + KUNIT_EXPECT_EQ(test, r->ar.start, 5 + 10 * i++); 291 + KUNIT_EXPECT_EQ(test, r->ar.end, 5 + 10 * i); 292 + } 293 + KUNIT_EXPECT_EQ(test, r->ar.start, 5 + 10 * i); 294 + KUNIT_EXPECT_EQ(test, r->ar.end, 59ul); 295 + damon_free_target(t); 296 + 297 + t = damon_new_target(42); 298 + r = damon_new_region(5, 6); 299 + damon_add_region(r, t); 300 + KUNIT_EXPECT_EQ(test, damon_va_evenly_split_region(t, r, 2), -EINVAL); 301 + KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 1u); 302 + 303 + damon_for_each_region(r, t) { 304 + KUNIT_EXPECT_EQ(test, r->ar.start, 5ul); 305 + KUNIT_EXPECT_EQ(test, r->ar.end, 6ul); 306 + } 307 + damon_free_target(t); 308 + damon_destroy_ctx(c); 309 + } 310 + 311 + static struct kunit_case damon_test_cases[] = { 312 + KUNIT_CASE(damon_test_three_regions_in_vmas), 313 + KUNIT_CASE(damon_test_apply_three_regions1), 314 + KUNIT_CASE(damon_test_apply_three_regions2), 315 + KUNIT_CASE(damon_test_apply_three_regions3), 316 + KUNIT_CASE(damon_test_apply_three_regions4), 317 + KUNIT_CASE(damon_test_split_evenly), 318 + {}, 319 + }; 320 + 321 + static struct kunit_suite damon_test_suite = { 322 + .name = "damon-primitives", 323 + .test_cases = damon_test_cases, 324 + }; 325 + kunit_test_suite(damon_test_suite); 326 + 327 + #endif /* _DAMON_VADDR_TEST_H */ 328 + 329 + #endif /* CONFIG_DAMON_VADDR_KUNIT_TEST */

+672

mm/damon/vaddr.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * DAMON Primitives for Virtual Address Spaces 4 + * 5 + * Author: SeongJae Park <sjpark@amazon.de> 6 + */ 7 + 8 + #define pr_fmt(fmt) "damon-va: " fmt 9 + 10 + #include <linux/damon.h> 11 + #include <linux/hugetlb.h> 12 + #include <linux/mm.h> 13 + #include <linux/mmu_notifier.h> 14 + #include <linux/highmem.h> 15 + #include <linux/page_idle.h> 16 + #include <linux/pagewalk.h> 17 + #include <linux/random.h> 18 + #include <linux/sched/mm.h> 19 + #include <linux/slab.h> 20 + 21 + #ifdef CONFIG_DAMON_VADDR_KUNIT_TEST 22 + #undef DAMON_MIN_REGION 23 + #define DAMON_MIN_REGION 1 24 + #endif 25 + 26 + /* Get a random number in [l, r) */ 27 + #define damon_rand(l, r) (l + prandom_u32_max(r - l)) 28 + 29 + /* 30 + * 't->id' should be the pointer to the relevant 'struct pid' having reference 31 + * count. Caller must put the returned task, unless it is NULL. 32 + */ 33 + #define damon_get_task_struct(t) \ 34 + (get_pid_task((struct pid *)t->id, PIDTYPE_PID)) 35 + 36 + /* 37 + * Get the mm_struct of the given target 38 + * 39 + * Caller _must_ put the mm_struct after use, unless it is NULL. 40 + * 41 + * Returns the mm_struct of the target on success, NULL on failure 42 + */ 43 + static struct mm_struct *damon_get_mm(struct damon_target *t) 44 + { 45 + struct task_struct *task; 46 + struct mm_struct *mm; 47 + 48 + task = damon_get_task_struct(t); 49 + if (!task) 50 + return NULL; 51 + 52 + mm = get_task_mm(task); 53 + put_task_struct(task); 54 + return mm; 55 + } 56 + 57 + /* 58 + * Functions for the initial monitoring target regions construction 59 + */ 60 + 61 + /* 62 + * Size-evenly split a region into 'nr_pieces' small regions 63 + * 64 + * Returns 0 on success, or negative error code otherwise. 65 + */ 66 + static int damon_va_evenly_split_region(struct damon_target *t, 67 + struct damon_region *r, unsigned int nr_pieces) 68 + { 69 + unsigned long sz_orig, sz_piece, orig_end; 70 + struct damon_region *n = NULL, *next; 71 + unsigned long start; 72 + 73 + if (!r || !nr_pieces) 74 + return -EINVAL; 75 + 76 + orig_end = r->ar.end; 77 + sz_orig = r->ar.end - r->ar.start; 78 + sz_piece = ALIGN_DOWN(sz_orig / nr_pieces, DAMON_MIN_REGION); 79 + 80 + if (!sz_piece) 81 + return -EINVAL; 82 + 83 + r->ar.end = r->ar.start + sz_piece; 84 + next = damon_next_region(r); 85 + for (start = r->ar.end; start + sz_piece <= orig_end; 86 + start += sz_piece) { 87 + n = damon_new_region(start, start + sz_piece); 88 + if (!n) 89 + return -ENOMEM; 90 + damon_insert_region(n, r, next, t); 91 + r = n; 92 + } 93 + /* complement last region for possible rounding error */ 94 + if (n) 95 + n->ar.end = orig_end; 96 + 97 + return 0; 98 + } 99 + 100 + static unsigned long sz_range(struct damon_addr_range *r) 101 + { 102 + return r->end - r->start; 103 + } 104 + 105 + static void swap_ranges(struct damon_addr_range *r1, 106 + struct damon_addr_range *r2) 107 + { 108 + struct damon_addr_range tmp; 109 + 110 + tmp = *r1; 111 + *r1 = *r2; 112 + *r2 = tmp; 113 + } 114 + 115 + /* 116 + * Find three regions separated by two biggest unmapped regions 117 + * 118 + * vma the head vma of the target address space 119 + * regions an array of three address ranges that results will be saved 120 + * 121 + * This function receives an address space and finds three regions in it which 122 + * separated by the two biggest unmapped regions in the space. Please refer to 123 + * below comments of '__damon_va_init_regions()' function to know why this is 124 + * necessary. 125 + * 126 + * Returns 0 if success, or negative error code otherwise. 127 + */ 128 + static int __damon_va_three_regions(struct vm_area_struct *vma, 129 + struct damon_addr_range regions[3]) 130 + { 131 + struct damon_addr_range gap = {0}, first_gap = {0}, second_gap = {0}; 132 + struct vm_area_struct *last_vma = NULL; 133 + unsigned long start = 0; 134 + struct rb_root rbroot; 135 + 136 + /* Find two biggest gaps so that first_gap > second_gap > others */ 137 + for (; vma; vma = vma->vm_next) { 138 + if (!last_vma) { 139 + start = vma->vm_start; 140 + goto next; 141 + } 142 + 143 + if (vma->rb_subtree_gap <= sz_range(&second_gap)) { 144 + rbroot.rb_node = &vma->vm_rb; 145 + vma = rb_entry(rb_last(&rbroot), 146 + struct vm_area_struct, vm_rb); 147 + goto next; 148 + } 149 + 150 + gap.start = last_vma->vm_end; 151 + gap.end = vma->vm_start; 152 + if (sz_range(&gap) > sz_range(&second_gap)) { 153 + swap_ranges(&gap, &second_gap); 154 + if (sz_range(&second_gap) > sz_range(&first_gap)) 155 + swap_ranges(&second_gap, &first_gap); 156 + } 157 + next: 158 + last_vma = vma; 159 + } 160 + 161 + if (!sz_range(&second_gap) || !sz_range(&first_gap)) 162 + return -EINVAL; 163 + 164 + /* Sort the two biggest gaps by address */ 165 + if (first_gap.start > second_gap.start) 166 + swap_ranges(&first_gap, &second_gap); 167 + 168 + /* Store the result */ 169 + regions[0].start = ALIGN(start, DAMON_MIN_REGION); 170 + regions[0].end = ALIGN(first_gap.start, DAMON_MIN_REGION); 171 + regions[1].start = ALIGN(first_gap.end, DAMON_MIN_REGION); 172 + regions[1].end = ALIGN(second_gap.start, DAMON_MIN_REGION); 173 + regions[2].start = ALIGN(second_gap.end, DAMON_MIN_REGION); 174 + regions[2].end = ALIGN(last_vma->vm_end, DAMON_MIN_REGION); 175 + 176 + return 0; 177 + } 178 + 179 + /* 180 + * Get the three regions in the given target (task) 181 + * 182 + * Returns 0 on success, negative error code otherwise. 183 + */ 184 + static int damon_va_three_regions(struct damon_target *t, 185 + struct damon_addr_range regions[3]) 186 + { 187 + struct mm_struct *mm; 188 + int rc; 189 + 190 + mm = damon_get_mm(t); 191 + if (!mm) 192 + return -EINVAL; 193 + 194 + mmap_read_lock(mm); 195 + rc = __damon_va_three_regions(mm->mmap, regions); 196 + mmap_read_unlock(mm); 197 + 198 + mmput(mm); 199 + return rc; 200 + } 201 + 202 + /* 203 + * Initialize the monitoring target regions for the given target (task) 204 + * 205 + * t the given target 206 + * 207 + * Because only a number of small portions of the entire address space 208 + * is actually mapped to the memory and accessed, monitoring the unmapped 209 + * regions is wasteful. That said, because we can deal with small noises, 210 + * tracking every mapping is not strictly required but could even incur a high 211 + * overhead if the mapping frequently changes or the number of mappings is 212 + * high. The adaptive regions adjustment mechanism will further help to deal 213 + * with the noise by simply identifying the unmapped areas as a region that 214 + * has no access. Moreover, applying the real mappings that would have many 215 + * unmapped areas inside will make the adaptive mechanism quite complex. That 216 + * said, too huge unmapped areas inside the monitoring target should be removed 217 + * to not take the time for the adaptive mechanism. 218 + * 219 + * For the reason, we convert the complex mappings to three distinct regions 220 + * that cover every mapped area of the address space. Also the two gaps 221 + * between the three regions are the two biggest unmapped areas in the given 222 + * address space. In detail, this function first identifies the start and the 223 + * end of the mappings and the two biggest unmapped areas of the address space. 224 + * Then, it constructs the three regions as below: 225 + * 226 + * [mappings[0]->start, big_two_unmapped_areas[0]->start) 227 + * [big_two_unmapped_areas[0]->end, big_two_unmapped_areas[1]->start) 228 + * [big_two_unmapped_areas[1]->end, mappings[nr_mappings - 1]->end) 229 + * 230 + * As usual memory map of processes is as below, the gap between the heap and 231 + * the uppermost mmap()-ed region, and the gap between the lowermost mmap()-ed 232 + * region and the stack will be two biggest unmapped regions. Because these 233 + * gaps are exceptionally huge areas in usual address space, excluding these 234 + * two biggest unmapped regions will be sufficient to make a trade-off. 235 + * 236 + * <heap> 237 + * <BIG UNMAPPED REGION 1> 238 + * <uppermost mmap()-ed region> 239 + * (other mmap()-ed regions and small unmapped regions) 240 + * <lowermost mmap()-ed region> 241 + * <BIG UNMAPPED REGION 2> 242 + * <stack> 243 + */ 244 + static void __damon_va_init_regions(struct damon_ctx *ctx, 245 + struct damon_target *t) 246 + { 247 + struct damon_region *r; 248 + struct damon_addr_range regions[3]; 249 + unsigned long sz = 0, nr_pieces; 250 + int i; 251 + 252 + if (damon_va_three_regions(t, regions)) { 253 + pr_err("Failed to get three regions of target %lu\n", t->id); 254 + return; 255 + } 256 + 257 + for (i = 0; i < 3; i++) 258 + sz += regions[i].end - regions[i].start; 259 + if (ctx->min_nr_regions) 260 + sz /= ctx->min_nr_regions; 261 + if (sz < DAMON_MIN_REGION) 262 + sz = DAMON_MIN_REGION; 263 + 264 + /* Set the initial three regions of the target */ 265 + for (i = 0; i < 3; i++) { 266 + r = damon_new_region(regions[i].start, regions[i].end); 267 + if (!r) { 268 + pr_err("%d'th init region creation failed\n", i); 269 + return; 270 + } 271 + damon_add_region(r, t); 272 + 273 + nr_pieces = (regions[i].end - regions[i].start) / sz; 274 + damon_va_evenly_split_region(t, r, nr_pieces); 275 + } 276 + } 277 + 278 + /* Initialize '->regions_list' of every target (task) */ 279 + void damon_va_init(struct damon_ctx *ctx) 280 + { 281 + struct damon_target *t; 282 + 283 + damon_for_each_target(t, ctx) { 284 + /* the user may set the target regions as they want */ 285 + if (!damon_nr_regions(t)) 286 + __damon_va_init_regions(ctx, t); 287 + } 288 + } 289 + 290 + /* 291 + * Functions for the dynamic monitoring target regions update 292 + */ 293 + 294 + /* 295 + * Check whether a region is intersecting an address range 296 + * 297 + * Returns true if it is. 298 + */ 299 + static bool damon_intersect(struct damon_region *r, struct damon_addr_range *re) 300 + { 301 + return !(r->ar.end <= re->start || re->end <= r->ar.start); 302 + } 303 + 304 + /* 305 + * Update damon regions for the three big regions of the given target 306 + * 307 + * t the given target 308 + * bregions the three big regions of the target 309 + */ 310 + static void damon_va_apply_three_regions(struct damon_target *t, 311 + struct damon_addr_range bregions[3]) 312 + { 313 + struct damon_region *r, *next; 314 + unsigned int i = 0; 315 + 316 + /* Remove regions which are not in the three big regions now */ 317 + damon_for_each_region_safe(r, next, t) { 318 + for (i = 0; i < 3; i++) { 319 + if (damon_intersect(r, &bregions[i])) 320 + break; 321 + } 322 + if (i == 3) 323 + damon_destroy_region(r, t); 324 + } 325 + 326 + /* Adjust intersecting regions to fit with the three big regions */ 327 + for (i = 0; i < 3; i++) { 328 + struct damon_region *first = NULL, *last; 329 + struct damon_region *newr; 330 + struct damon_addr_range *br; 331 + 332 + br = &bregions[i]; 333 + /* Get the first and last regions which intersects with br */ 334 + damon_for_each_region(r, t) { 335 + if (damon_intersect(r, br)) { 336 + if (!first) 337 + first = r; 338 + last = r; 339 + } 340 + if (r->ar.start >= br->end) 341 + break; 342 + } 343 + if (!first) { 344 + /* no damon_region intersects with this big region */ 345 + newr = damon_new_region( 346 + ALIGN_DOWN(br->start, 347 + DAMON_MIN_REGION), 348 + ALIGN(br->end, DAMON_MIN_REGION)); 349 + if (!newr) 350 + continue; 351 + damon_insert_region(newr, damon_prev_region(r), r, t); 352 + } else { 353 + first->ar.start = ALIGN_DOWN(br->start, 354 + DAMON_MIN_REGION); 355 + last->ar.end = ALIGN(br->end, DAMON_MIN_REGION); 356 + } 357 + } 358 + } 359 + 360 + /* 361 + * Update regions for current memory mappings 362 + */ 363 + void damon_va_update(struct damon_ctx *ctx) 364 + { 365 + struct damon_addr_range three_regions[3]; 366 + struct damon_target *t; 367 + 368 + damon_for_each_target(t, ctx) { 369 + if (damon_va_three_regions(t, three_regions)) 370 + continue; 371 + damon_va_apply_three_regions(t, three_regions); 372 + } 373 + } 374 + 375 + /* 376 + * Get an online page for a pfn if it's in the LRU list. Otherwise, returns 377 + * NULL. 378 + * 379 + * The body of this function is stolen from the 'page_idle_get_page()'. We 380 + * steal rather than reuse it because the code is quite simple. 381 + */ 382 + static struct page *damon_get_page(unsigned long pfn) 383 + { 384 + struct page *page = pfn_to_online_page(pfn); 385 + 386 + if (!page || !PageLRU(page) || !get_page_unless_zero(page)) 387 + return NULL; 388 + 389 + if (unlikely(!PageLRU(page))) { 390 + put_page(page); 391 + page = NULL; 392 + } 393 + return page; 394 + } 395 + 396 + static void damon_ptep_mkold(pte_t *pte, struct mm_struct *mm, 397 + unsigned long addr) 398 + { 399 + bool referenced = false; 400 + struct page *page = damon_get_page(pte_pfn(*pte)); 401 + 402 + if (!page) 403 + return; 404 + 405 + if (pte_young(*pte)) { 406 + referenced = true; 407 + *pte = pte_mkold(*pte); 408 + } 409 + 410 + #ifdef CONFIG_MMU_NOTIFIER 411 + if (mmu_notifier_clear_young(mm, addr, addr + PAGE_SIZE)) 412 + referenced = true; 413 + #endif /* CONFIG_MMU_NOTIFIER */ 414 + 415 + if (referenced) 416 + set_page_young(page); 417 + 418 + set_page_idle(page); 419 + put_page(page); 420 + } 421 + 422 + static void damon_pmdp_mkold(pmd_t *pmd, struct mm_struct *mm, 423 + unsigned long addr) 424 + { 425 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 426 + bool referenced = false; 427 + struct page *page = damon_get_page(pmd_pfn(*pmd)); 428 + 429 + if (!page) 430 + return; 431 + 432 + if (pmd_young(*pmd)) { 433 + referenced = true; 434 + *pmd = pmd_mkold(*pmd); 435 + } 436 + 437 + #ifdef CONFIG_MMU_NOTIFIER 438 + if (mmu_notifier_clear_young(mm, addr, 439 + addr + ((1UL) << HPAGE_PMD_SHIFT))) 440 + referenced = true; 441 + #endif /* CONFIG_MMU_NOTIFIER */ 442 + 443 + if (referenced) 444 + set_page_young(page); 445 + 446 + set_page_idle(page); 447 + put_page(page); 448 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 449 + } 450 + 451 + static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr, 452 + unsigned long next, struct mm_walk *walk) 453 + { 454 + pte_t *pte; 455 + spinlock_t *ptl; 456 + 457 + if (pmd_huge(*pmd)) { 458 + ptl = pmd_lock(walk->mm, pmd); 459 + if (pmd_huge(*pmd)) { 460 + damon_pmdp_mkold(pmd, walk->mm, addr); 461 + spin_unlock(ptl); 462 + return 0; 463 + } 464 + spin_unlock(ptl); 465 + } 466 + 467 + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) 468 + return 0; 469 + pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); 470 + if (!pte_present(*pte)) 471 + goto out; 472 + damon_ptep_mkold(pte, walk->mm, addr); 473 + out: 474 + pte_unmap_unlock(pte, ptl); 475 + return 0; 476 + } 477 + 478 + static struct mm_walk_ops damon_mkold_ops = { 479 + .pmd_entry = damon_mkold_pmd_entry, 480 + }; 481 + 482 + static void damon_va_mkold(struct mm_struct *mm, unsigned long addr) 483 + { 484 + mmap_read_lock(mm); 485 + walk_page_range(mm, addr, addr + 1, &damon_mkold_ops, NULL); 486 + mmap_read_unlock(mm); 487 + } 488 + 489 + /* 490 + * Functions for the access checking of the regions 491 + */ 492 + 493 + static void damon_va_prepare_access_check(struct damon_ctx *ctx, 494 + struct mm_struct *mm, struct damon_region *r) 495 + { 496 + r->sampling_addr = damon_rand(r->ar.start, r->ar.end); 497 + 498 + damon_va_mkold(mm, r->sampling_addr); 499 + } 500 + 501 + void damon_va_prepare_access_checks(struct damon_ctx *ctx) 502 + { 503 + struct damon_target *t; 504 + struct mm_struct *mm; 505 + struct damon_region *r; 506 + 507 + damon_for_each_target(t, ctx) { 508 + mm = damon_get_mm(t); 509 + if (!mm) 510 + continue; 511 + damon_for_each_region(r, t) 512 + damon_va_prepare_access_check(ctx, mm, r); 513 + mmput(mm); 514 + } 515 + } 516 + 517 + struct damon_young_walk_private { 518 + unsigned long *page_sz; 519 + bool young; 520 + }; 521 + 522 + static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr, 523 + unsigned long next, struct mm_walk *walk) 524 + { 525 + pte_t *pte; 526 + spinlock_t *ptl; 527 + struct page *page; 528 + struct damon_young_walk_private *priv = walk->private; 529 + 530 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 531 + if (pmd_huge(*pmd)) { 532 + ptl = pmd_lock(walk->mm, pmd); 533 + if (!pmd_huge(*pmd)) { 534 + spin_unlock(ptl); 535 + goto regular_page; 536 + } 537 + page = damon_get_page(pmd_pfn(*pmd)); 538 + if (!page) 539 + goto huge_out; 540 + if (pmd_young(*pmd) || !page_is_idle(page) || 541 + mmu_notifier_test_young(walk->mm, 542 + addr)) { 543 + *priv->page_sz = ((1UL) << HPAGE_PMD_SHIFT); 544 + priv->young = true; 545 + } 546 + put_page(page); 547 + huge_out: 548 + spin_unlock(ptl); 549 + return 0; 550 + } 551 + 552 + regular_page: 553 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 554 + 555 + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) 556 + return -EINVAL; 557 + pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); 558 + if (!pte_present(*pte)) 559 + goto out; 560 + page = damon_get_page(pte_pfn(*pte)); 561 + if (!page) 562 + goto out; 563 + if (pte_young(*pte) || !page_is_idle(page) || 564 + mmu_notifier_test_young(walk->mm, addr)) { 565 + *priv->page_sz = PAGE_SIZE; 566 + priv->young = true; 567 + } 568 + put_page(page); 569 + out: 570 + pte_unmap_unlock(pte, ptl); 571 + return 0; 572 + } 573 + 574 + static struct mm_walk_ops damon_young_ops = { 575 + .pmd_entry = damon_young_pmd_entry, 576 + }; 577 + 578 + static bool damon_va_young(struct mm_struct *mm, unsigned long addr, 579 + unsigned long *page_sz) 580 + { 581 + struct damon_young_walk_private arg = { 582 + .page_sz = page_sz, 583 + .young = false, 584 + }; 585 + 586 + mmap_read_lock(mm); 587 + walk_page_range(mm, addr, addr + 1, &damon_young_ops, &arg); 588 + mmap_read_unlock(mm); 589 + return arg.young; 590 + } 591 + 592 + /* 593 + * Check whether the region was accessed after the last preparation 594 + * 595 + * mm 'mm_struct' for the given virtual address space 596 + * r the region to be checked 597 + */ 598 + static void damon_va_check_access(struct damon_ctx *ctx, 599 + struct mm_struct *mm, struct damon_region *r) 600 + { 601 + static struct mm_struct *last_mm; 602 + static unsigned long last_addr; 603 + static unsigned long last_page_sz = PAGE_SIZE; 604 + static bool last_accessed; 605 + 606 + /* If the region is in the last checked page, reuse the result */ 607 + if (mm == last_mm && (ALIGN_DOWN(last_addr, last_page_sz) == 608 + ALIGN_DOWN(r->sampling_addr, last_page_sz))) { 609 + if (last_accessed) 610 + r->nr_accesses++; 611 + return; 612 + } 613 + 614 + last_accessed = damon_va_young(mm, r->sampling_addr, &last_page_sz); 615 + if (last_accessed) 616 + r->nr_accesses++; 617 + 618 + last_mm = mm; 619 + last_addr = r->sampling_addr; 620 + } 621 + 622 + unsigned int damon_va_check_accesses(struct damon_ctx *ctx) 623 + { 624 + struct damon_target *t; 625 + struct mm_struct *mm; 626 + struct damon_region *r; 627 + unsigned int max_nr_accesses = 0; 628 + 629 + damon_for_each_target(t, ctx) { 630 + mm = damon_get_mm(t); 631 + if (!mm) 632 + continue; 633 + damon_for_each_region(r, t) { 634 + damon_va_check_access(ctx, mm, r); 635 + max_nr_accesses = max(r->nr_accesses, max_nr_accesses); 636 + } 637 + mmput(mm); 638 + } 639 + 640 + return max_nr_accesses; 641 + } 642 + 643 + /* 644 + * Functions for the target validity check and cleanup 645 + */ 646 + 647 + bool damon_va_target_valid(void *target) 648 + { 649 + struct damon_target *t = target; 650 + struct task_struct *task; 651 + 652 + task = damon_get_task_struct(t); 653 + if (task) { 654 + put_task_struct(task); 655 + return true; 656 + } 657 + 658 + return false; 659 + } 660 + 661 + void damon_va_set_primitives(struct damon_ctx *ctx) 662 + { 663 + ctx->primitive.init = damon_va_init; 664 + ctx->primitive.update = damon_va_update; 665 + ctx->primitive.prepare_access_checks = damon_va_prepare_access_checks; 666 + ctx->primitive.check_accesses = damon_va_check_accesses; 667 + ctx->primitive.reset_aggregated = NULL; 668 + ctx->primitive.target_valid = damon_va_target_valid; 669 + ctx->primitive.cleanup = NULL; 670 + } 671 + 672 + #include "vaddr-test.h"

-5

mm/early_ioremap.c

··· 38 38 return prot; 39 39 } 40 40 41 - void __init __weak early_ioremap_shutdown(void) 42 - { 43 - } 44 - 45 41 void __init early_ioremap_reset(void) 46 42 { 47 - early_ioremap_shutdown(); 48 43 after_paging_init = 1; 49 44 } 50 45

+1 -1

mm/highmem.c

··· 436 436 437 437 static inline int kmap_local_idx_push(void) 438 438 { 439 - WARN_ON_ONCE(in_irq() && !irqs_disabled()); 439 + WARN_ON_ONCE(in_hardirq() && !irqs_disabled()); 440 440 current->kmap_ctrl.idx += KM_INCR; 441 441 BUG_ON(current->kmap_ctrl.idx >= KM_MAX_IDX); 442 442 return current->kmap_ctrl.idx - 1;

-25

mm/ioremap.c

··· 8 8 */ 9 9 #include <linux/vmalloc.h> 10 10 #include <linux/mm.h> 11 - #include <linux/sched.h> 12 11 #include <linux/io.h> 13 12 #include <linux/export.h> 14 - #include <asm/cacheflush.h> 15 13 16 - #include "pgalloc-track.h" 17 - 18 - #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP 19 - static unsigned int __ro_after_init iomap_max_page_shift = BITS_PER_LONG - 1; 20 - 21 - static int __init set_nohugeiomap(char *str) 22 - { 23 - iomap_max_page_shift = PAGE_SHIFT; 24 - return 0; 25 - } 26 - early_param("nohugeiomap", set_nohugeiomap); 27 - #else /* CONFIG_HAVE_ARCH_HUGE_VMAP */ 28 - static const unsigned int iomap_max_page_shift = PAGE_SHIFT; 29 - #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ 30 - 31 - int ioremap_page_range(unsigned long addr, 32 - unsigned long end, phys_addr_t phys_addr, pgprot_t prot) 33 - { 34 - return vmap_range(addr, end, phys_addr, prot, iomap_max_page_shift); 35 - } 36 - 37 - #ifdef CONFIG_GENERIC_IOREMAP 38 14 void __iomem *ioremap_prot(phys_addr_t addr, size_t size, unsigned long prot) 39 15 { 40 16 unsigned long offset, vaddr; ··· 47 71 vunmap((void *)((unsigned long)addr & PAGE_MASK)); 48 72 } 49 73 EXPORT_SYMBOL(iounmap); 50 - #endif /* CONFIG_GENERIC_IOREMAP */

+3

mm/kfence/core.c

··· 20 20 #include <linux/moduleparam.h> 21 21 #include <linux/random.h> 22 22 #include <linux/rcupdate.h> 23 + #include <linux/sched/clock.h> 23 24 #include <linux/sched/sysctl.h> 24 25 #include <linux/seq_file.h> 25 26 #include <linux/slab.h> ··· 197 196 */ 198 197 track->num_stack_entries = stack_trace_save(track->stack_entries, KFENCE_STACK_DEPTH, 1); 199 198 track->pid = task_pid_nr(current); 199 + track->cpu = raw_smp_processor_id(); 200 + track->ts_nsec = local_clock(); /* Same source as printk timestamps. */ 200 201 201 202 /* 202 203 * Pairs with READ_ONCE() in

+2

mm/kfence/kfence.h

··· 36 36 /* Alloc/free tracking information. */ 37 37 struct kfence_track { 38 38 pid_t pid; 39 + int cpu; 40 + u64 ts_nsec; 39 41 int num_stack_entries; 40 42 unsigned long stack_entries[KFENCE_STACK_DEPTH]; 41 43 };

+3

mm/kfence/kfence_test.c

··· 800 800 unsigned long flags; 801 801 int i; 802 802 803 + if (!__kfence_pool) 804 + return -EINVAL; 805 + 803 806 spin_lock_irqsave(&observed.lock, flags); 804 807 for (i = 0; i < ARRAY_SIZE(observed.lines); i++) 805 808 observed.lines[i][0] = '\0';

+13 -6

mm/kfence/report.c

··· 9 9 10 10 #include <linux/kernel.h> 11 11 #include <linux/lockdep.h> 12 + #include <linux/math.h> 12 13 #include <linux/printk.h> 13 14 #include <linux/sched/debug.h> 14 15 #include <linux/seq_file.h> ··· 101 100 bool show_alloc) 102 101 { 103 102 const struct kfence_track *track = show_alloc ? &meta->alloc_track : &meta->free_track; 103 + u64 ts_sec = track->ts_nsec; 104 + unsigned long rem_nsec = do_div(ts_sec, NSEC_PER_SEC); 105 + 106 + /* Timestamp matches printk timestamp format. */ 107 + seq_con_printf(seq, "%s by task %d on cpu %d at %lu.%06lus:\n", 108 + show_alloc ? "allocated" : "freed", track->pid, 109 + track->cpu, (unsigned long)ts_sec, rem_nsec / 1000); 104 110 105 111 if (track->num_stack_entries) { 106 112 /* Skip allocation/free internals stack. */ ··· 134 126 return; 135 127 } 136 128 137 - seq_con_printf(seq, 138 - "kfence-#%td [0x%p-0x%p" 139 - ", size=%d, cache=%s] allocated by task %d:\n", 140 - meta - kfence_metadata, (void *)start, (void *)(start + size - 1), size, 141 - (cache && cache->name) ? cache->name : "<destroyed>", meta->alloc_track.pid); 129 + seq_con_printf(seq, "kfence-#%td: 0x%p-0x%p, size=%d, cache=%s\n\n", 130 + meta - kfence_metadata, (void *)start, (void *)(start + size - 1), 131 + size, (cache && cache->name) ? cache->name : "<destroyed>"); 132 + 142 133 kfence_print_stack(seq, meta, true); 143 134 144 135 if (meta->state == KFENCE_OBJECT_FREED) { 145 - seq_con_printf(seq, "\nfreed by task %d:\n", meta->free_track.pid); 136 + seq_con_printf(seq, "\n"); 146 137 kfence_print_stack(seq, meta, false); 147 138 } 148 139 }

+1 -1

mm/kmemleak.c

··· 598 598 object->checksum = 0; 599 599 600 600 /* task information */ 601 - if (in_irq()) { 601 + if (in_hardirq()) { 602 602 object->pid = 0; 603 603 strncpy(object->comm, "hardirq", sizeof(object->comm)); 604 604 } else if (in_serving_softirq()) {

+341 -33

mm/memory_hotplug.c

··· 52 52 MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug"); 53 53 #endif 54 54 55 + enum { 56 + ONLINE_POLICY_CONTIG_ZONES = 0, 57 + ONLINE_POLICY_AUTO_MOVABLE, 58 + }; 59 + 60 + const char *online_policy_to_str[] = { 61 + [ONLINE_POLICY_CONTIG_ZONES] = "contig-zones", 62 + [ONLINE_POLICY_AUTO_MOVABLE] = "auto-movable", 63 + }; 64 + 65 + static int set_online_policy(const char *val, const struct kernel_param *kp) 66 + { 67 + int ret = sysfs_match_string(online_policy_to_str, val); 68 + 69 + if (ret < 0) 70 + return ret; 71 + *((int *)kp->arg) = ret; 72 + return 0; 73 + } 74 + 75 + static int get_online_policy(char *buffer, const struct kernel_param *kp) 76 + { 77 + return sprintf(buffer, "%s\n", online_policy_to_str[*((int *)kp->arg)]); 78 + } 79 + 80 + /* 81 + * memory_hotplug.online_policy: configure online behavior when onlining without 82 + * specifying a zone (MMOP_ONLINE) 83 + * 84 + * "contig-zones": keep zone contiguous 85 + * "auto-movable": online memory to ZONE_MOVABLE if the configuration 86 + * (auto_movable_ratio, auto_movable_numa_aware) allows for it 87 + */ 88 + static int online_policy __read_mostly = ONLINE_POLICY_CONTIG_ZONES; 89 + static const struct kernel_param_ops online_policy_ops = { 90 + .set = set_online_policy, 91 + .get = get_online_policy, 92 + }; 93 + module_param_cb(online_policy, &online_policy_ops, &online_policy, 0644); 94 + MODULE_PARM_DESC(online_policy, 95 + "Set the online policy (\"contig-zones\", \"auto-movable\") " 96 + "Default: \"contig-zones\""); 97 + 98 + /* 99 + * memory_hotplug.auto_movable_ratio: specify maximum MOVABLE:KERNEL ratio 100 + * 101 + * The ratio represent an upper limit and the kernel might decide to not 102 + * online some memory to ZONE_MOVABLE -- e.g., because hotplugged KERNEL memory 103 + * doesn't allow for more MOVABLE memory. 104 + */ 105 + static unsigned int auto_movable_ratio __read_mostly = 301; 106 + module_param(auto_movable_ratio, uint, 0644); 107 + MODULE_PARM_DESC(auto_movable_ratio, 108 + "Set the maximum ratio of MOVABLE:KERNEL memory in the system " 109 + "in percent for \"auto-movable\" online policy. Default: 301"); 110 + 111 + /* 112 + * memory_hotplug.auto_movable_numa_aware: consider numa node stats 113 + */ 114 + #ifdef CONFIG_NUMA 115 + static bool auto_movable_numa_aware __read_mostly = true; 116 + module_param(auto_movable_numa_aware, bool, 0644); 117 + MODULE_PARM_DESC(auto_movable_numa_aware, 118 + "Consider numa node stats in addition to global stats in " 119 + "\"auto-movable\" online policy. Default: true"); 120 + #endif /* CONFIG_NUMA */ 121 + 55 122 /* 56 123 * online_page_callback contains pointer to current page onlining function. 57 124 * Initially it is generic_online_page(). If it is required it could be ··· 477 410 sizeof(struct page) * cur_nr_pages); 478 411 } 479 412 480 - #ifdef CONFIG_ZONE_DEVICE 481 413 /* 482 414 * Zone shrinking code cannot properly deal with ZONE_DEVICE. So 483 415 * we will not try to shrink the zones - which is okay as 484 416 * set_zone_contiguous() cannot deal with ZONE_DEVICE either way. 485 417 */ 486 - if (zone_idx(zone) == ZONE_DEVICE) 418 + if (zone_is_zone_device(zone)) 487 419 return; 488 - #endif 489 420 490 421 clear_zone_contiguous(zone); 491 422 ··· 728 663 set_zone_contiguous(zone); 729 664 } 730 665 666 + struct auto_movable_stats { 667 + unsigned long kernel_early_pages; 668 + unsigned long movable_pages; 669 + }; 670 + 671 + static void auto_movable_stats_account_zone(struct auto_movable_stats *stats, 672 + struct zone *zone) 673 + { 674 + if (zone_idx(zone) == ZONE_MOVABLE) { 675 + stats->movable_pages += zone->present_pages; 676 + } else { 677 + stats->kernel_early_pages += zone->present_early_pages; 678 + #ifdef CONFIG_CMA 679 + /* 680 + * CMA pages (never on hotplugged memory) behave like 681 + * ZONE_MOVABLE. 682 + */ 683 + stats->movable_pages += zone->cma_pages; 684 + stats->kernel_early_pages -= zone->cma_pages; 685 + #endif /* CONFIG_CMA */ 686 + } 687 + } 688 + struct auto_movable_group_stats { 689 + unsigned long movable_pages; 690 + unsigned long req_kernel_early_pages; 691 + }; 692 + 693 + static int auto_movable_stats_account_group(struct memory_group *group, 694 + void *arg) 695 + { 696 + const int ratio = READ_ONCE(auto_movable_ratio); 697 + struct auto_movable_group_stats *stats = arg; 698 + long pages; 699 + 700 + /* 701 + * We don't support modifying the config while the auto-movable online 702 + * policy is already enabled. Just avoid the division by zero below. 703 + */ 704 + if (!ratio) 705 + return 0; 706 + 707 + /* 708 + * Calculate how many early kernel pages this group requires to 709 + * satisfy the configured zone ratio. 710 + */ 711 + pages = group->present_movable_pages * 100 / ratio; 712 + pages -= group->present_kernel_pages; 713 + 714 + if (pages > 0) 715 + stats->req_kernel_early_pages += pages; 716 + stats->movable_pages += group->present_movable_pages; 717 + return 0; 718 + } 719 + 720 + static bool auto_movable_can_online_movable(int nid, struct memory_group *group, 721 + unsigned long nr_pages) 722 + { 723 + unsigned long kernel_early_pages, movable_pages; 724 + struct auto_movable_group_stats group_stats = {}; 725 + struct auto_movable_stats stats = {}; 726 + pg_data_t *pgdat = NODE_DATA(nid); 727 + struct zone *zone; 728 + int i; 729 + 730 + /* Walk all relevant zones and collect MOVABLE vs. KERNEL stats. */ 731 + if (nid == NUMA_NO_NODE) { 732 + /* TODO: cache values */ 733 + for_each_populated_zone(zone) 734 + auto_movable_stats_account_zone(&stats, zone); 735 + } else { 736 + for (i = 0; i < MAX_NR_ZONES; i++) { 737 + zone = pgdat->node_zones + i; 738 + if (populated_zone(zone)) 739 + auto_movable_stats_account_zone(&stats, zone); 740 + } 741 + } 742 + 743 + kernel_early_pages = stats.kernel_early_pages; 744 + movable_pages = stats.movable_pages; 745 + 746 + /* 747 + * Kernel memory inside dynamic memory group allows for more MOVABLE 748 + * memory within the same group. Remove the effect of all but the 749 + * current group from the stats. 750 + */ 751 + walk_dynamic_memory_groups(nid, auto_movable_stats_account_group, 752 + group, &group_stats); 753 + if (kernel_early_pages <= group_stats.req_kernel_early_pages) 754 + return false; 755 + kernel_early_pages -= group_stats.req_kernel_early_pages; 756 + movable_pages -= group_stats.movable_pages; 757 + 758 + if (group && group->is_dynamic) 759 + kernel_early_pages += group->present_kernel_pages; 760 + 761 + /* 762 + * Test if we could online the given number of pages to ZONE_MOVABLE 763 + * and still stay in the configured ratio. 764 + */ 765 + movable_pages += nr_pages; 766 + return movable_pages <= (auto_movable_ratio * kernel_early_pages) / 100; 767 + } 768 + 731 769 /* 732 770 * Returns a default kernel memory zone for the given pfn range. 733 771 * If no kernel zone covers this pfn range it will automatically go ··· 850 682 } 851 683 852 684 return &pgdat->node_zones[ZONE_NORMAL]; 685 + } 686 + 687 + /* 688 + * Determine to which zone to online memory dynamically based on user 689 + * configuration and system stats. We care about the following ratio: 690 + * 691 + * MOVABLE : KERNEL 692 + * 693 + * Whereby MOVABLE is memory in ZONE_MOVABLE and KERNEL is memory in 694 + * one of the kernel zones. CMA pages inside one of the kernel zones really 695 + * behaves like ZONE_MOVABLE, so we treat them accordingly. 696 + * 697 + * We don't allow for hotplugged memory in a KERNEL zone to increase the 698 + * amount of MOVABLE memory we can have, so we end up with: 699 + * 700 + * MOVABLE : KERNEL_EARLY 701 + * 702 + * Whereby KERNEL_EARLY is memory in one of the kernel zones, available sinze 703 + * boot. We base our calculation on KERNEL_EARLY internally, because: 704 + * 705 + * a) Hotplugged memory in one of the kernel zones can sometimes still get 706 + * hotunplugged, especially when hot(un)plugging individual memory blocks. 707 + * There is no coordination across memory devices, therefore "automatic" 708 + * hotunplugging, as implemented in hypervisors, could result in zone 709 + * imbalances. 710 + * b) Early/boot memory in one of the kernel zones can usually not get 711 + * hotunplugged again (e.g., no firmware interface to unplug, fragmented 712 + * with unmovable allocations). While there are corner cases where it might 713 + * still work, it is barely relevant in practice. 714 + * 715 + * Exceptions are dynamic memory groups, which allow for more MOVABLE 716 + * memory within the same memory group -- because in that case, there is 717 + * coordination within the single memory device managed by a single driver. 718 + * 719 + * We rely on "present pages" instead of "managed pages", as the latter is 720 + * highly unreliable and dynamic in virtualized environments, and does not 721 + * consider boot time allocations. For example, memory ballooning adjusts the 722 + * managed pages when inflating/deflating the balloon, and balloon compaction 723 + * can even migrate inflated pages between zones. 724 + * 725 + * Using "present pages" is better but some things to keep in mind are: 726 + * 727 + * a) Some memblock allocations, such as for the crashkernel area, are 728 + * effectively unused by the kernel, yet they account to "present pages". 729 + * Fortunately, these allocations are comparatively small in relevant setups 730 + * (e.g., fraction of system memory). 731 + * b) Some hotplugged memory blocks in virtualized environments, esecially 732 + * hotplugged by virtio-mem, look like they are completely present, however, 733 + * only parts of the memory block are actually currently usable. 734 + * "present pages" is an upper limit that can get reached at runtime. As 735 + * we base our calculations on KERNEL_EARLY, this is not an issue. 736 + */ 737 + static struct zone *auto_movable_zone_for_pfn(int nid, 738 + struct memory_group *group, 739 + unsigned long pfn, 740 + unsigned long nr_pages) 741 + { 742 + unsigned long online_pages = 0, max_pages, end_pfn; 743 + struct page *page; 744 + 745 + if (!auto_movable_ratio) 746 + goto kernel_zone; 747 + 748 + if (group && !group->is_dynamic) { 749 + max_pages = group->s.max_pages; 750 + online_pages = group->present_movable_pages; 751 + 752 + /* If anything is !MOVABLE online the rest !MOVABLE. */ 753 + if (group->present_kernel_pages) 754 + goto kernel_zone; 755 + } else if (!group || group->d.unit_pages == nr_pages) { 756 + max_pages = nr_pages; 757 + } else { 758 + max_pages = group->d.unit_pages; 759 + /* 760 + * Take a look at all online sections in the current unit. 761 + * We can safely assume that all pages within a section belong 762 + * to the same zone, because dynamic memory groups only deal 763 + * with hotplugged memory. 764 + */ 765 + pfn = ALIGN_DOWN(pfn, group->d.unit_pages); 766 + end_pfn = pfn + group->d.unit_pages; 767 + for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { 768 + page = pfn_to_online_page(pfn); 769 + if (!page) 770 + continue; 771 + /* If anything is !MOVABLE online the rest !MOVABLE. */ 772 + if (page_zonenum(page) != ZONE_MOVABLE) 773 + goto kernel_zone; 774 + online_pages += PAGES_PER_SECTION; 775 + } 776 + } 777 + 778 + /* 779 + * Online MOVABLE if we could *currently* online all remaining parts 780 + * MOVABLE. We expect to (add+) online them immediately next, so if 781 + * nobody interferes, all will be MOVABLE if possible. 782 + */ 783 + nr_pages = max_pages - online_pages; 784 + if (!auto_movable_can_online_movable(NUMA_NO_NODE, group, nr_pages)) 785 + goto kernel_zone; 786 + 787 + #ifdef CONFIG_NUMA 788 + if (auto_movable_numa_aware && 789 + !auto_movable_can_online_movable(nid, group, nr_pages)) 790 + goto kernel_zone; 791 + #endif /* CONFIG_NUMA */ 792 + 793 + return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE]; 794 + kernel_zone: 795 + return default_kernel_zone_for_pfn(nid, pfn, nr_pages); 853 796 } 854 797 855 798 static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn, ··· 987 708 return movable_node_enabled ? movable_zone : kernel_zone; 988 709 } 989 710 990 - struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn, 711 + struct zone *zone_for_pfn_range(int online_type, int nid, 712 + struct memory_group *group, unsigned long start_pfn, 991 713 unsigned long nr_pages) 992 714 { 993 715 if (online_type == MMOP_ONLINE_KERNEL) ··· 997 717 if (online_type == MMOP_ONLINE_MOVABLE) 998 718 return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE]; 999 719 720 + if (online_policy == ONLINE_POLICY_AUTO_MOVABLE) 721 + return auto_movable_zone_for_pfn(nid, group, start_pfn, nr_pages); 722 + 1000 723 return default_zone_for_pfn(nid, start_pfn, nr_pages); 1001 724 } 1002 725 ··· 1007 724 * This function should only be called by memory_block_{online,offline}, 1008 725 * and {online,offline}_pages. 1009 726 */ 1010 - void adjust_present_page_count(struct zone *zone, long nr_pages) 727 + void adjust_present_page_count(struct page *page, struct memory_group *group, 728 + long nr_pages) 1011 729 { 730 + struct zone *zone = page_zone(page); 731 + const bool movable = zone_idx(zone) == ZONE_MOVABLE; 732 + 733 + /* 734 + * We only support onlining/offlining/adding/removing of complete 735 + * memory blocks; therefore, either all is either early or hotplugged. 736 + */ 737 + if (early_section(__pfn_to_section(page_to_pfn(page)))) 738 + zone->present_early_pages += nr_pages; 1012 739 zone->present_pages += nr_pages; 1013 740 zone->zone_pgdat->node_present_pages += nr_pages; 741 + 742 + if (group && movable) 743 + group->present_movable_pages += nr_pages; 744 + else if (group && !movable) 745 + group->present_kernel_pages += nr_pages; 1014 746 } 1015 747 1016 748 int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages, ··· 1071 773 kasan_remove_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages)); 1072 774 } 1073 775 1074 - int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *zone) 776 + int __ref online_pages(unsigned long pfn, unsigned long nr_pages, 777 + struct zone *zone, struct memory_group *group) 1075 778 { 1076 779 unsigned long flags; 1077 780 int need_zonelists_rebuild = 0; ··· 1125 826 } 1126 827 1127 828 online_pages_range(pfn, nr_pages); 1128 - adjust_present_page_count(zone, nr_pages); 829 + adjust_present_page_count(pfn_to_page(pfn), group, nr_pages); 1129 830 1130 831 node_states_set_node(nid, &arg); 1131 832 if (need_zonelists_rebuild) ··· 1358 1059 { 1359 1060 struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; 1360 1061 struct vmem_altmap mhp_altmap = {}; 1062 + struct memory_group *group = NULL; 1361 1063 u64 start, size; 1362 1064 bool new_node = false; 1363 1065 int ret; ··· 1369 1069 ret = check_hotplug_memory_range(start, size); 1370 1070 if (ret) 1371 1071 return ret; 1072 + 1073 + if (mhp_flags & MHP_NID_IS_MGID) { 1074 + group = memory_group_find_by_id(nid); 1075 + if (!group) 1076 + return -EINVAL; 1077 + nid = group->nid; 1078 + } 1372 1079 1373 1080 if (!node_possible(nid)) { 1374 1081 WARN(1, "node %d was absent from the node_possible_map\n", nid); ··· 1411 1104 goto error; 1412 1105 1413 1106 /* create memory block devices after memory was added */ 1414 - ret = create_memory_block_devices(start, size, mhp_altmap.alloc); 1107 + ret = create_memory_block_devices(start, size, mhp_altmap.alloc, 1108 + group); 1415 1109 if (ret) { 1416 - arch_remove_memory(nid, start, size, NULL); 1110 + arch_remove_memory(start, size, NULL); 1417 1111 goto error; 1418 1112 } 1419 1113 ··· 1606 1298 unsigned long pfn, sec_end_pfn; 1607 1299 struct zone *zone = NULL; 1608 1300 struct page *page; 1609 - int i; 1301 + 1610 1302 for (pfn = start_pfn, sec_end_pfn = SECTION_ALIGN_UP(start_pfn + 1); 1611 1303 pfn < end_pfn; 1612 1304 pfn = sec_end_pfn, sec_end_pfn += PAGES_PER_SECTION) { ··· 1615 1307 continue; 1616 1308 for (; pfn < sec_end_pfn && pfn < end_pfn; 1617 1309 pfn += MAX_ORDER_NR_PAGES) { 1618 - i = 0; 1619 - /* This is just a CONFIG_HOLES_IN_ZONE check.*/ 1620 - while ((i < MAX_ORDER_NR_PAGES) && 1621 - !pfn_valid_within(pfn + i)) 1622 - i++; 1623 - if (i == MAX_ORDER_NR_PAGES || pfn + i >= end_pfn) 1624 - continue; 1625 1310 /* Check if we got outside of the zone */ 1626 - if (zone && !zone_spans_pfn(zone, pfn + i)) 1311 + if (zone && !zone_spans_pfn(zone, pfn)) 1627 1312 return NULL; 1628 - page = pfn_to_page(pfn + i); 1313 + page = pfn_to_page(pfn); 1629 1314 if (zone && page_zone(page) != zone) 1630 1315 return NULL; 1631 1316 zone = page_zone(page); ··· 1869 1568 return 0; 1870 1569 } 1871 1570 1872 - int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) 1571 + int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages, 1572 + struct memory_group *group) 1873 1573 { 1874 1574 const unsigned long end_pfn = start_pfn + nr_pages; 1875 1575 unsigned long pfn, system_ram_pages = 0; ··· 2006 1704 2007 1705 /* removal success */ 2008 1706 adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages); 2009 - adjust_present_page_count(zone, -nr_pages); 1707 + adjust_present_page_count(pfn_to_page(start_pfn), group, -nr_pages); 2010 1708 2011 1709 /* reinitialise watermarks and update pcp limits */ 2012 1710 init_per_zone_wmark_min(); ··· 2048 1746 static int check_memblock_offlined_cb(struct memory_block *mem, void *arg) 2049 1747 { 2050 1748 int ret = !is_memblock_offlined(mem); 1749 + int *nid = arg; 2051 1750 1751 + *nid = mem->nid; 2052 1752 if (unlikely(ret)) { 2053 1753 phys_addr_t beginpa, endpa; 2054 1754 ··· 2143 1839 } 2144 1840 EXPORT_SYMBOL(try_offline_node); 2145 1841 2146 - static int __ref try_remove_memory(int nid, u64 start, u64 size) 1842 + static int __ref try_remove_memory(u64 start, u64 size) 2147 1843 { 2148 - int rc = 0; 2149 1844 struct vmem_altmap mhp_altmap = {}; 2150 1845 struct vmem_altmap *altmap = NULL; 2151 1846 unsigned long nr_vmemmap_pages; 1847 + int rc = 0, nid = NUMA_NO_NODE; 2152 1848 2153 1849 BUG_ON(check_hotplug_memory_range(start, size)); 2154 1850 ··· 2156 1852 * All memory blocks must be offlined before removing memory. Check 2157 1853 * whether all memory blocks in question are offline and return error 2158 1854 * if this is not the case. 1855 + * 1856 + * While at it, determine the nid. Note that if we'd have mixed nodes, 1857 + * we'd only try to offline the last determined one -- which is good 1858 + * enough for the cases we care about. 2159 1859 */ 2160 - rc = walk_memory_blocks(start, size, NULL, check_memblock_offlined_cb); 1860 + rc = walk_memory_blocks(start, size, &nid, check_memblock_offlined_cb); 2161 1861 if (rc) 2162 1862 return rc; 2163 1863 ··· 2201 1893 2202 1894 mem_hotplug_begin(); 2203 1895 2204 - arch_remove_memory(nid, start, size, altmap); 1896 + arch_remove_memory(start, size, altmap); 2205 1897 2206 1898 if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) { 2207 1899 memblock_free(start, size); ··· 2210 1902 2211 1903 release_mem_region_adjustable(start, size); 2212 1904 2213 - try_offline_node(nid); 1905 + if (nid != NUMA_NO_NODE) 1906 + try_offline_node(nid); 2214 1907 2215 1908 mem_hotplug_done(); 2216 1909 return 0; ··· 2219 1910 2220 1911 /** 2221 1912 * __remove_memory - Remove memory if every memory block is offline 2222 - * @nid: the node ID 2223 1913 * @start: physical address of the region to remove 2224 1914 * @size: size of the region to remove 2225 1915 * ··· 2226 1918 * and online/offline operations before this call, as required by 2227 1919 * try_offline_node(). 2228 1920 */ 2229 - void __remove_memory(int nid, u64 start, u64 size) 1921 + void __remove_memory(u64 start, u64 size) 2230 1922 { 2231 1923 2232 1924 /* 2233 1925 * trigger BUG() if some memory is not offlined prior to calling this 2234 1926 * function 2235 1927 */ 2236 - if (try_remove_memory(nid, start, size)) 1928 + if (try_remove_memory(start, size)) 2237 1929 BUG(); 2238 1930 } 2239 1931 ··· 2241 1933 * Remove memory if every memory block is offline, otherwise return -EBUSY is 2242 1934 * some memory is not offline 2243 1935 */ 2244 - int remove_memory(int nid, u64 start, u64 size) 1936 + int remove_memory(u64 start, u64 size) 2245 1937 { 2246 1938 int rc; 2247 1939 2248 1940 lock_device_hotplug(); 2249 - rc = try_remove_memory(nid, start, size); 1941 + rc = try_remove_memory(start, size); 2250 1942 unlock_device_hotplug(); 2251 1943 2252 1944 return rc; ··· 2306 1998 * unplugged all memory (so it's no longer in use) and want to offline + remove 2307 1999 * that memory. 2308 2000 */ 2309 - int offline_and_remove_memory(int nid, u64 start, u64 size) 2001 + int offline_and_remove_memory(u64 start, u64 size) 2310 2002 { 2311 2003 const unsigned long mb_count = size / memory_block_size_bytes(); 2312 2004 uint8_t *online_types, *tmp; ··· 2342 2034 * This cannot fail as it cannot get onlined in the meantime. 2343 2035 */ 2344 2036 if (!rc) { 2345 - rc = try_remove_memory(nid, start, size); 2037 + rc = try_remove_memory(start, size); 2346 2038 if (rc) 2347 2039 pr_err("%s: Failed to remove memory: %d", __func__, rc); 2348 2040 }

+1 -4

mm/memremap.c

··· 140 140 { 141 141 struct range *range = &pgmap->ranges[range_id]; 142 142 struct page *first_page; 143 - int nid; 144 143 145 144 /* make sure to access a memmap that was actually initialized */ 146 145 first_page = pfn_to_page(pfn_first(pgmap, range_id)); 147 146 148 147 /* pages are dead and unused, undo the arch mapping */ 149 - nid = page_to_nid(first_page); 150 - 151 148 mem_hotplug_begin(); 152 149 remove_pfn_range_from_zone(page_zone(first_page), PHYS_PFN(range->start), 153 150 PHYS_PFN(range_len(range))); ··· 152 155 __remove_pages(PHYS_PFN(range->start), 153 156 PHYS_PFN(range_len(range)), NULL); 154 157 } else { 155 - arch_remove_memory(nid, range->start, range_len(range), 158 + arch_remove_memory(range->start, range_len(range), 156 159 pgmap_altmap(pgmap)); 157 160 kasan_remove_zero_shadow(__va(range->start), range_len(range)); 158 161 }

+5 -22

mm/page_alloc.c

··· 594 594 595 595 static int page_is_consistent(struct zone *zone, struct page *page) 596 596 { 597 - if (!pfn_valid_within(page_to_pfn(page))) 598 - return 0; 599 597 if (zone != page_zone(page)) 600 598 return 0; 601 599 ··· 1023 1025 if (order >= MAX_ORDER - 2) 1024 1026 return false; 1025 1027 1026 - if (!pfn_valid_within(buddy_pfn)) 1027 - return false; 1028 - 1029 1028 combined_pfn = buddy_pfn & pfn; 1030 1029 higher_page = page + (combined_pfn - pfn); 1031 1030 buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1); 1032 1031 higher_buddy = higher_page + (buddy_pfn - combined_pfn); 1033 1032 1034 - return pfn_valid_within(buddy_pfn) && 1035 - page_is_buddy(higher_page, higher_buddy, order + 1); 1033 + return page_is_buddy(higher_page, higher_buddy, order + 1); 1036 1034 } 1037 1035 1038 1036 /* ··· 1089 1095 buddy_pfn = __find_buddy_pfn(pfn, order); 1090 1096 buddy = page + (buddy_pfn - pfn); 1091 1097 1092 - if (!pfn_valid_within(buddy_pfn)) 1093 - goto done_merging; 1094 1098 if (!page_is_buddy(page, buddy, order)) 1095 1099 goto done_merging; 1096 1100 /* ··· 1746 1754 /* 1747 1755 * Check that the whole (or subset of) a pageblock given by the interval of 1748 1756 * [start_pfn, end_pfn) is valid and within the same zone, before scanning it 1749 - * with the migration of free compaction scanner. The scanners then need to 1750 - * use only pfn_valid_within() check for arches that allow holes within 1751 - * pageblocks. 1757 + * with the migration of free compaction scanner. 1752 1758 * 1753 1759 * Return struct page pointer of start_pfn, or NULL if checks were not passed. 1754 1760 * ··· 1862 1872 */ 1863 1873 static inline bool __init deferred_pfn_valid(unsigned long pfn) 1864 1874 { 1865 - if (!pfn_valid_within(pfn)) 1866 - return false; 1867 1875 if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn)) 1868 1876 return false; 1869 1877 return true; ··· 2508 2520 int pages_moved = 0; 2509 2521 2510 2522 for (pfn = start_pfn; pfn <= end_pfn;) { 2511 - if (!pfn_valid_within(pfn)) { 2512 - pfn++; 2513 - continue; 2514 - } 2515 - 2516 2523 page = pfn_to_page(pfn); 2517 2524 if (!PageBuddy(page)) { 2518 2525 /* ··· 7254 7271 zone->zone_start_pfn = 0; 7255 7272 zone->spanned_pages = size; 7256 7273 zone->present_pages = real_size; 7274 + #if defined(CONFIG_MEMORY_HOTPLUG) 7275 + zone->present_early_pages = real_size; 7276 + #endif 7257 7277 7258 7278 totalpages += size; 7259 7279 realtotalpages += real_size; ··· 8814 8828 } 8815 8829 8816 8830 for (; iter < pageblock_nr_pages - offset; iter++) { 8817 - if (!pfn_valid_within(pfn + iter)) 8818 - continue; 8819 - 8820 8831 page = pfn_to_page(pfn + iter); 8821 8832 8822 8833 /*

+11 -1

mm/page_ext.c

··· 58 58 * can utilize this callback to initialize the state of it correctly. 59 59 */ 60 60 61 + #if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT) 62 + static bool need_page_idle(void) 63 + { 64 + return true; 65 + } 66 + struct page_ext_operations page_idle_ops = { 67 + .need = need_page_idle, 68 + }; 69 + #endif 70 + 61 71 static struct page_ext_operations *page_ext_ops[] = { 62 72 #ifdef CONFIG_PAGE_OWNER 63 73 &page_owner_ops, 64 74 #endif 65 - #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT) 75 + #if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT) 66 76 &page_idle_ops, 67 77 #endif 68 78 };

-10

mm/page_idle.c

··· 207 207 .name = "page_idle", 208 208 }; 209 209 210 - #ifndef CONFIG_64BIT 211 - static bool need_page_idle(void) 212 - { 213 - return true; 214 - } 215 - struct page_ext_operations page_idle_ops = { 216 - .need = need_page_idle, 217 - }; 218 - #endif 219 - 220 210 static int __init page_idle_init(void) 221 211 { 222 212 int err;

+1 -6

mm/page_isolation.c

··· 93 93 buddy_pfn = __find_buddy_pfn(pfn, order); 94 94 buddy = page + (buddy_pfn - pfn); 95 95 96 - if (pfn_valid_within(buddy_pfn) && 97 - !is_migrate_isolate_page(buddy)) { 96 + if (!is_migrate_isolate_page(buddy)) { 98 97 __isolate_free_page(page, order); 99 98 isolated_page = true; 100 99 } ··· 249 250 struct page *page; 250 251 251 252 while (pfn < end_pfn) { 252 - if (!pfn_valid_within(pfn)) { 253 - pfn++; 254 - continue; 255 - } 256 253 page = pfn_to_page(pfn); 257 254 if (PageBuddy(page)) 258 255 /*

+1 -13

mm/page_owner.c

··· 276 276 pageblock_mt = get_pageblock_migratetype(page); 277 277 278 278 for (; pfn < block_end_pfn; pfn++) { 279 - if (!pfn_valid_within(pfn)) 280 - continue; 281 - 282 279 /* The pageblock is online, no need to recheck. */ 283 280 page = pfn_to_page(pfn); 284 281 ··· 476 479 continue; 477 480 } 478 481 479 - /* Check for holes within a MAX_ORDER area */ 480 - if (!pfn_valid_within(pfn)) 481 - continue; 482 - 483 482 page = pfn_to_page(pfn); 484 483 if (PageBuddy(page)) { 485 484 unsigned long freepage_order = buddy_order_unsafe(page); ··· 553 560 block_end_pfn = min(block_end_pfn, end_pfn); 554 561 555 562 for (; pfn < block_end_pfn; pfn++) { 556 - struct page *page; 563 + struct page *page = pfn_to_page(pfn); 557 564 struct page_ext *page_ext; 558 - 559 - if (!pfn_valid_within(pfn)) 560 - continue; 561 - 562 - page = pfn_to_page(pfn); 563 565 564 566 if (page_zone(page) != zone) 565 567 continue;

-1

mm/percpu.c

··· 146 146 147 147 /* the address of the first chunk which starts with the kernel static area */ 148 148 void *pcpu_base_addr __ro_after_init; 149 - EXPORT_SYMBOL_GPL(pcpu_base_addr); 150 149 151 150 static const int *pcpu_unit_map __ro_after_init; /* cpu -> unit */ 152 151 const unsigned long *pcpu_unit_offsets __ro_after_init; /* cpu -> unit offset */

+4 -2

mm/rmap.c

··· 1231 1231 nr_pages); 1232 1232 } else { 1233 1233 if (PageTransCompound(page) && page_mapping(page)) { 1234 + struct page *head = compound_head(page); 1235 + 1234 1236 VM_WARN_ON_ONCE(!PageLocked(page)); 1235 1237 1236 - SetPageDoubleMap(compound_head(page)); 1238 + SetPageDoubleMap(head); 1237 1239 if (PageMlocked(page)) 1238 - clear_page_mlock(compound_head(page)); 1240 + clear_page_mlock(head); 1239 1241 } 1240 1242 if (!atomic_inc_and_test(&page->_mapcount)) 1241 1243 goto out;

+5 -4

mm/secretmem.c

··· 18 18 #include <linux/secretmem.h> 19 19 #include <linux/set_memory.h> 20 20 #include <linux/sched/signal.h> 21 + #include <linux/refcount.h> 21 22 22 23 #include <uapi/linux/magic.h> 23 24 ··· 41 40 MODULE_PARM_DESC(secretmem_enable, 42 41 "Enable secretmem and memfd_secret(2) system call"); 43 42 44 - static atomic_t secretmem_users; 43 + static refcount_t secretmem_users; 45 44 46 45 bool secretmem_active(void) 47 46 { 48 - return !!atomic_read(&secretmem_users); 47 + return !!refcount_read(&secretmem_users); 49 48 } 50 49 51 50 static vm_fault_t secretmem_fault(struct vm_fault *vmf) ··· 104 103 105 104 static int secretmem_release(struct inode *inode, struct file *file) 106 105 { 107 - atomic_dec(&secretmem_users); 106 + refcount_dec(&secretmem_users); 108 107 return 0; 109 108 } 110 109 ··· 218 217 file->f_flags |= O_LARGEFILE; 219 218 220 219 fd_install(fd, file); 221 - atomic_inc(&secretmem_users); 220 + refcount_inc(&secretmem_users); 222 221 return fd; 223 222 224 223 err_put_fd:

+17 -5

mm/vmalloc.c

··· 44 44 #include "internal.h" 45 45 #include "pgalloc-track.h" 46 46 47 + #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP 48 + static unsigned int __ro_after_init ioremap_max_page_shift = BITS_PER_LONG - 1; 49 + 50 + static int __init set_nohugeiomap(char *str) 51 + { 52 + ioremap_max_page_shift = PAGE_SHIFT; 53 + return 0; 54 + } 55 + early_param("nohugeiomap", set_nohugeiomap); 56 + #else /* CONFIG_HAVE_ARCH_HUGE_VMAP */ 57 + static const unsigned int ioremap_max_page_shift = PAGE_SHIFT; 58 + #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ 59 + 47 60 #ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC 48 61 static bool __ro_after_init vmap_allow_huge = true; 49 62 ··· 311 298 return err; 312 299 } 313 300 314 - int vmap_range(unsigned long addr, unsigned long end, 315 - phys_addr_t phys_addr, pgprot_t prot, 316 - unsigned int max_page_shift) 301 + int ioremap_page_range(unsigned long addr, unsigned long end, 302 + phys_addr_t phys_addr, pgprot_t prot) 317 303 { 318 304 int err; 319 305 320 - err = vmap_range_noflush(addr, end, phys_addr, prot, max_page_shift); 306 + err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot), 307 + ioremap_max_page_shift); 321 308 flush_cache_vmap(addr, end); 322 - 323 309 return err; 324 310 } 325 311

+1 -1

mm/workingset.c

··· 249 249 * @target_memcg: the cgroup that is causing the reclaim 250 250 * @page: the page being evicted 251 251 * 252 - * Returns a shadow entry to be stored in @page->mapping->i_pages in place 252 + * Return: a shadow entry to be stored in @page->mapping->i_pages in place 253 253 * of the evicted @page so that a later refault can be detected. 254 254 */ 255 255 void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)

+1 -1

scripts/check_extable.sh

··· 4 4 5 5 obj=$1 6 6 7 - file ${obj} | grep -q ELF || (echo "${obj} is not and ELF file." 1>&2 ; exit 0) 7 + file ${obj} | grep -q ELF || (echo "${obj} is not an ELF file." 1>&2 ; exit 0) 8 8 9 9 # Bail out early if there isn't an __ex_table section in this object file. 10 10 objdump -hj __ex_table ${obj} 2> /dev/null > /dev/null

+58 -37

scripts/checkpatch.pl

··· 501 501 our $Hex = qr{(?i)0x[0-9a-f]+$Int_type?}; 502 502 our $Int = qr{[0-9]+$Int_type?}; 503 503 our $Octal = qr{0[0-7]+$Int_type?}; 504 - our $String = qr{"[X\t]*"}; 504 + our $String = qr{(?:\b[Lu])?"[X\t]*"}; 505 505 our $Float_hex = qr{(?i)0x[0-9a-f]+p-?[0-9]+[fl]?}; 506 506 our $Float_dec = qr{(?i)(?:[0-9]+\.[0-9]*|[0-9]*\.[0-9]+)(?:e-?[0-9]+)?[fl]?}; 507 507 our $Float_int = qr{(?i)[0-9]+e-?[0-9]+[fl]?}; ··· 1181 1181 # git log --format='%H %s' -1 $line | 1182 1182 # echo "commit $(cut -c 1-12,41-)" 1183 1183 # done 1184 - } elsif ($lines[0] =~ /^fatal: ambiguous argument '$commit': unknown revision or path not in the working tree\./) { 1184 + } elsif ($lines[0] =~ /^fatal: ambiguous argument '$commit': unknown revision or path not in the working tree\./ || 1185 + $lines[0] =~ /^fatal: bad object $commit/) { 1185 1186 $id = undef; 1186 1187 } else { 1187 1188 $id = substr($lines[0], 0, 12); ··· 2588 2587 my $reported_maintainer_file = 0; 2589 2588 my $non_utf8_charset = 0; 2590 2589 2590 + my $last_git_commit_id_linenr = -1; 2591 + 2591 2592 my $last_blank_line = 0; 2592 2593 my $last_coalesced_string_linenr = -1; 2593 2594 ··· 2912 2909 my ($email_name, $email_comment, $email_address, $comment1) = parse_email($ctx); 2913 2910 my ($author_name, $author_comment, $author_address, $comment2) = parse_email($author); 2914 2911 2915 - if ($email_address eq $author_address && $email_name eq $author_name) { 2912 + if (lc $email_address eq lc $author_address && $email_name eq $author_name) { 2916 2913 $author_sob = $ctx; 2917 2914 $authorsignoff = 2; 2918 - } elsif ($email_address eq $author_address) { 2915 + } elsif (lc $email_address eq lc $author_address) { 2919 2916 $author_sob = $ctx; 2920 2917 $authorsignoff = 3; 2921 2918 } elsif ($email_name eq $author_name) { ··· 3173 3170 } 3174 3171 3175 3172 # Check for git id commit length and improperly formed commit descriptions 3176 - if ($in_commit_log && !$commit_log_possible_stack_dump && 3173 + # A correctly formed commit description is: 3174 + # commit <SHA-1 hash length 12+ chars> ("Complete commit subject") 3175 + # with the commit subject '("' prefix and '")' suffix 3176 + # This is a fairly compilicated block as it tests for what appears to be 3177 + # bare SHA-1 hash with minimum length of 5. It also avoids several types of 3178 + # possible SHA-1 matches. 3179 + # A commit match can span multiple lines so this block attempts to find a 3180 + # complete typical commit on a maximum of 3 lines 3181 + if ($perl_version_ok && 3182 + $in_commit_log && !$commit_log_possible_stack_dump && 3177 3183 $line !~ /^\s*(?:Link|Patchwork|http|https|BugLink|base-commit):/i && 3178 3184 $line !~ /^This reverts commit [0-9a-f]{7,40}/ && 3179 - ($line =~ /\bcommit\s+[0-9a-f]{5,}\b/i || 3185 + (($line =~ /\bcommit\s+[0-9a-f]{5,}\b/i || 3186 + ($line =~ /\bcommit\s*$/i && defined($rawlines[$linenr]) && $rawlines[$linenr] =~ /^\s*[0-9a-f]{5,}\b/i)) || 3180 3187 ($line =~ /(?:\s|^)[0-9a-f]{12,40}(?:[\s"'\(\[]|$)/i && 3181 3188 $line !~ /[\<\[][0-9a-f]{12,40}[\>\]]/i && 3182 3189 $line !~ /\bfixes:\s*[0-9a-f]{12,40}/i))) { ··· 3196 3183 my $long = 0; 3197 3184 my $case = 1; 3198 3185 my $space = 1; 3199 - my $hasdesc = 0; 3200 - my $hasparens = 0; 3201 3186 my $id = '0123456789ab'; 3202 3187 my $orig_desc = "commit description"; 3203 3188 my $description = ""; 3189 + my $herectx = $herecurr; 3190 + my $has_parens = 0; 3191 + my $has_quotes = 0; 3204 3192 3205 - if ($line =~ /\b(c)ommit\s+([0-9a-f]{5,})\b/i) { 3206 - $init_char = $1; 3207 - $orig_commit = lc($2); 3208 - } elsif ($line =~ /\b([0-9a-f]{12,40})\b/i) { 3209 - $orig_commit = lc($1); 3193 + my $input = $line; 3194 + if ($line =~ /(?:\bcommit\s+[0-9a-f]{5,}|\bcommit\s*$)/i) { 3195 + for (my $n = 0; $n < 2; $n++) { 3196 + if ($input =~ /\bcommit\s+[0-9a-f]{5,}\s*($balanced_parens)/i) { 3197 + $orig_desc = $1; 3198 + $has_parens = 1; 3199 + # Always strip leading/trailing parens then double quotes if existing 3200 + $orig_desc = substr($orig_desc, 1, -1); 3201 + if ($orig_desc =~ /^".*"$/) { 3202 + $orig_desc = substr($orig_desc, 1, -1); 3203 + $has_quotes = 1; 3204 + } 3205 + last; 3206 + } 3207 + last if ($#lines < $linenr + $n); 3208 + $input .= " " . trim($rawlines[$linenr + $n]); 3209 + $herectx .= "$rawlines[$linenr + $n]\n"; 3210 + } 3211 + $herectx = $herecurr if (!$has_parens); 3210 3212 } 3211 3213 3212 - $short = 0 if ($line =~ /\bcommit\s+[0-9a-f]{12,40}/i); 3213 - $long = 1 if ($line =~ /\bcommit\s+[0-9a-f]{41,}/i); 3214 - $space = 0 if ($line =~ /\bcommit [0-9a-f]/i); 3215 - $case = 0 if ($line =~ /\b[Cc]ommit\s+[0-9a-f]{5,40}[^A-F]/); 3216 - if ($line =~ /\bcommit\s+[0-9a-f]{5,}\s+$"([^"]+)"$/i) { 3217 - $orig_desc = $1; 3218 - $hasparens = 1; 3219 - } elsif ($line =~ /\bcommit\s+[0-9a-f]{5,}\s*$/i && 3220 - defined $rawlines[$linenr] && 3221 - $rawlines[$linenr] =~ /^\s*$"([^"]+)"$/) { 3222 - $orig_desc = $1; 3223 - $hasparens = 1; 3224 - } elsif ($line =~ /\bcommit\s+[0-9a-f]{5,}\s+$"[^"]+$/i && 3225 - defined $rawlines[$linenr] && 3226 - $rawlines[$linenr] =~ /^\s*[^"]+"$/) { 3227 - $line =~ /\bcommit\s+[0-9a-f]{5,}\s+$"([^"]+)$/i; 3228 - $orig_desc = $1; 3229 - $rawlines[$linenr] =~ /^\s*([^"]+)"$/; 3230 - $orig_desc .= " " . $1; 3231 - $hasparens = 1; 3214 + if ($input =~ /\b(c)ommit\s+([0-9a-f]{5,})\b/i) { 3215 + $init_char = $1; 3216 + $orig_commit = lc($2); 3217 + $short = 0 if ($input =~ /\bcommit\s+[0-9a-f]{12,40}/i); 3218 + $long = 1 if ($input =~ /\bcommit\s+[0-9a-f]{41,}/i); 3219 + $space = 0 if ($input =~ /\bcommit [0-9a-f]/i); 3220 + $case = 0 if ($input =~ /\b[Cc]ommit\s+[0-9a-f]{5,40}[^A-F]/); 3221 + } elsif ($input =~ /\b([0-9a-f]{12,40})\b/i) { 3222 + $orig_commit = lc($1); 3232 3223 } 3233 3224 3234 3225 ($id, $description) = git_commit_info($orig_commit, 3235 3226 $id, $orig_desc); 3236 3227 3237 3228 if (defined($id) && 3238 - ($short || $long || $space || $case || ($orig_desc ne $description) || !$hasparens)) { 3229 + ($short || $long || $space || $case || ($orig_desc ne $description) || !$has_quotes) && 3230 + $last_git_commit_id_linenr != $linenr - 1) { 3239 3231 ERROR("GIT_COMMIT_ID", 3240 - "Please use git commit description style 'commit <12+ chars of sha1> (\"<title line>\")' - ie: '${init_char}ommit $id (\"$description\")'\n" . $herecurr); 3232 + "Please use git commit description style 'commit <12+ chars of sha1> (\"<title line>\")' - ie: '${init_char}ommit $id (\"$description\")'\n" . $herectx); 3241 3233 } 3234 + #don't report the next line if this line ends in commit and the sha1 hash is the next line 3235 + $last_git_commit_id_linenr = $linenr if ($line =~ /\bcommit\s*$/i); 3242 3236 } 3243 3237 3244 3238 # Check for added, moved or deleted files ··· 6152 6132 } 6153 6133 6154 6134 # concatenated string without spaces between elements 6155 - if ($line =~ /$String[A-Za-z0-9_]/ || $line =~ /[A-Za-z0-9_]$String/) { 6135 + if ($line =~ /$String[A-Z_]/ || 6136 + ($line =~ /([A-Za-z0-9_]+)$String/ && $1 !~ /^[Lu]$/)) { 6156 6137 if (CHK("CONCATENATED_STRING", 6157 6138 "Concatenated strings should use spaces between elements\n" . $herecurr) && 6158 6139 $fix) { ··· 6166 6145 } 6167 6146 6168 6147 # uncoalesced string fragments 6169 - if ($line =~ /$String\s*"/) { 6148 + if ($line =~ /$String\s*[Lu]?"/) { 6170 6149 if (WARN("STRING_FRAGMENTS", 6171 6150 "Consecutive strings are generally better as a single string\n" . $herecurr) && 6172 6151 $fix) {

+2 -2

tools/include/linux/bitmap.h

··· 111 111 } 112 112 113 113 /** 114 - * bitmap_alloc - Allocate bitmap 114 + * bitmap_zalloc - Allocate bitmap 115 115 * @nbits: Number of bits 116 116 */ 117 - static inline unsigned long *bitmap_alloc(int nbits) 117 + static inline unsigned long *bitmap_zalloc(int nbits) 118 118 { 119 119 return calloc(1, BITS_TO_LONGS(nbits) * sizeof(unsigned long)); 120 120 }

+1 -1

tools/perf/bench/find-bit-bench.c

··· 54 54 55 55 static int do_for_each_set_bit(unsigned int num_bits) 56 56 { 57 - unsigned long *to_test = bitmap_alloc(num_bits); 57 + unsigned long *to_test = bitmap_zalloc(num_bits); 58 58 struct timeval start, end, diff; 59 59 u64 runtime_us; 60 60 struct stats fb_time_stats, tb_time_stats;

+3 -3

tools/perf/builtin-c2c.c

··· 139 139 if (!c2c_he) 140 140 return NULL; 141 141 142 - c2c_he->cpuset = bitmap_alloc(c2c.cpus_cnt); 142 + c2c_he->cpuset = bitmap_zalloc(c2c.cpus_cnt); 143 143 if (!c2c_he->cpuset) 144 144 return NULL; 145 145 146 - c2c_he->nodeset = bitmap_alloc(c2c.nodes_cnt); 146 + c2c_he->nodeset = bitmap_zalloc(c2c.nodes_cnt); 147 147 if (!c2c_he->nodeset) 148 148 return NULL; 149 149 ··· 2047 2047 struct perf_cpu_map *map = n[node].map; 2048 2048 unsigned long *set; 2049 2049 2050 - set = bitmap_alloc(c2c.cpus_cnt); 2050 + set = bitmap_zalloc(c2c.cpus_cnt); 2051 2051 if (!set) 2052 2052 return -ENOMEM; 2053 2053

+1 -1

tools/perf/builtin-record.c

··· 2757 2757 2758 2758 if (rec->opts.affinity != PERF_AFFINITY_SYS) { 2759 2759 rec->affinity_mask.nbits = cpu__max_cpu(); 2760 - rec->affinity_mask.bits = bitmap_alloc(rec->affinity_mask.nbits); 2760 + rec->affinity_mask.bits = bitmap_zalloc(rec->affinity_mask.nbits); 2761 2761 if (!rec->affinity_mask.bits) { 2762 2762 pr_err("Failed to allocate thread mask for %zd cpus\n", rec->affinity_mask.nbits); 2763 2763 err = -ENOMEM;

+1 -1

tools/perf/tests/bitmap.c

··· 14 14 unsigned long *bm = NULL; 15 15 int i; 16 16 17 - bm = bitmap_alloc(nbits); 17 + bm = bitmap_zalloc(nbits); 18 18 19 19 if (map && bm) { 20 20 for (i = 0; i < map->nr; i++)

+1 -1

tools/perf/tests/mem2node.c

··· 27 27 unsigned long *bm = NULL; 28 28 int i; 29 29 30 - bm = bitmap_alloc(nbits); 30 + bm = bitmap_zalloc(nbits); 31 31 32 32 if (map && bm) { 33 33 for (i = 0; i < map->nr; i++) {

+2 -2

tools/perf/util/affinity.c

··· 25 25 { 26 26 int cpu_set_size = get_cpu_set_size(); 27 27 28 - a->orig_cpus = bitmap_alloc(cpu_set_size * 8); 28 + a->orig_cpus = bitmap_zalloc(cpu_set_size * 8); 29 29 if (!a->orig_cpus) 30 30 return -1; 31 31 sched_getaffinity(0, cpu_set_size, (cpu_set_t *)a->orig_cpus); 32 - a->sched_cpus = bitmap_alloc(cpu_set_size * 8); 32 + a->sched_cpus = bitmap_zalloc(cpu_set_size * 8); 33 33 if (!a->sched_cpus) { 34 34 zfree(&a->orig_cpus); 35 35 return -1;

+2 -2

tools/perf/util/header.c

··· 278 278 if (ret) 279 279 return ret; 280 280 281 - set = bitmap_alloc(size); 281 + set = bitmap_zalloc(size); 282 282 if (!set) 283 283 return -ENOMEM; 284 284 ··· 1294 1294 1295 1295 size++; 1296 1296 1297 - n->set = bitmap_alloc(size); 1297 + n->set = bitmap_zalloc(size); 1298 1298 if (!n->set) { 1299 1299 closedir(dir); 1300 1300 return -ENOMEM;

+1 -1

tools/perf/util/metricgroup.c

··· 313 313 struct evsel *evsel, *tmp; 314 314 unsigned long *evlist_used; 315 315 316 - evlist_used = bitmap_alloc(perf_evlist->core.nr_entries); 316 + evlist_used = bitmap_zalloc(perf_evlist->core.nr_entries); 317 317 if (!evlist_used) 318 318 return -ENOMEM; 319 319

+2 -2

tools/perf/util/mmap.c

··· 106 106 data = map->aio.data[idx]; 107 107 mmap_len = mmap__mmap_len(map); 108 108 node_index = cpu__get_node(cpu); 109 - node_mask = bitmap_alloc(node_index + 1); 109 + node_mask = bitmap_zalloc(node_index + 1); 110 110 if (!node_mask) { 111 111 pr_err("Failed to allocate node mask for mbind: error %m\n"); 112 112 return -1; ··· 258 258 static int perf_mmap__setup_affinity_mask(struct mmap *map, struct mmap_params *mp) 259 259 { 260 260 map->affinity_mask.nbits = cpu__max_cpu(); 261 - map->affinity_mask.bits = bitmap_alloc(map->affinity_mask.nbits); 261 + map->affinity_mask.bits = bitmap_zalloc(map->affinity_mask.nbits); 262 262 if (!map->affinity_mask.bits) 263 263 return -1; 264 264

+7

tools/testing/selftests/damon/Makefile

··· 1 + # SPDX-License-Identifier: GPL-2.0 2 + # Makefile for damon selftests 3 + 4 + TEST_FILES = _chk_dependency.sh 5 + TEST_PROGS = debugfs_attrs.sh 6 + 7 + include ../lib.mk

+28

tools/testing/selftests/damon/_chk_dependency.sh

··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + 4 + # Kselftest framework requirement - SKIP code is 4. 5 + ksft_skip=4 6 + 7 + DBGFS=/sys/kernel/debug/damon 8 + 9 + if [ $EUID -ne 0 ]; 10 + then 11 + echo "Run as root" 12 + exit $ksft_skip 13 + fi 14 + 15 + if [ ! -d "$DBGFS" ] 16 + then 17 + echo "$DBGFS not found" 18 + exit $ksft_skip 19 + fi 20 + 21 + for f in attrs target_ids monitor_on 22 + do 23 + if [ ! -f "$DBGFS/$f" ] 24 + then 25 + echo "$f not found" 26 + exit 1 27 + fi 28 + done

+75

tools/testing/selftests/damon/debugfs_attrs.sh

··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + 4 + test_write_result() { 5 + file=$1 6 + content=$2 7 + orig_content=$3 8 + expect_reason=$4 9 + expected=$5 10 + 11 + echo "$content" > "$file" 12 + if [ $? -ne "$expected" ] 13 + then 14 + echo "writing $content to $file doesn't return $expected" 15 + echo "expected because: $expect_reason" 16 + echo "$orig_content" > "$file" 17 + exit 1 18 + fi 19 + } 20 + 21 + test_write_succ() { 22 + test_write_result "$1" "$2" "$3" "$4" 0 23 + } 24 + 25 + test_write_fail() { 26 + test_write_result "$1" "$2" "$3" "$4" 1 27 + } 28 + 29 + test_content() { 30 + file=$1 31 + orig_content=$2 32 + expected=$3 33 + expect_reason=$4 34 + 35 + content=$(cat "$file") 36 + if [ "$content" != "$expected" ] 37 + then 38 + echo "reading $file expected $expected but $content" 39 + echo "expected because: $expect_reason" 40 + echo "$orig_content" > "$file" 41 + exit 1 42 + fi 43 + } 44 + 45 + source ./_chk_dependency.sh 46 + 47 + # Test attrs file 48 + # =============== 49 + 50 + file="$DBGFS/attrs" 51 + orig_content=$(cat "$file") 52 + 53 + test_write_succ "$file" "1 2 3 4 5" "$orig_content" "valid input" 54 + test_write_fail "$file" "1 2 3 4" "$orig_content" "no enough fields" 55 + test_write_fail "$file" "1 2 3 5 4" "$orig_content" \ 56 + "min_nr_regions > max_nr_regions" 57 + test_content "$file" "$orig_content" "1 2 3 4 5" "successfully written" 58 + echo "$orig_content" > "$file" 59 + 60 + # Test target_ids file 61 + # ==================== 62 + 63 + file="$DBGFS/target_ids" 64 + orig_content=$(cat "$file") 65 + 66 + test_write_succ "$file" "1 2 3 4" "$orig_content" "valid input" 67 + test_write_succ "$file" "1 2 abc 4" "$orig_content" "still valid input" 68 + test_content "$file" "$orig_content" "1 2" "non-integer was there" 69 + test_write_succ "$file" "abc 2 3" "$orig_content" "the file allows wrong input" 70 + test_content "$file" "$orig_content" "" "wrong input written" 71 + test_write_succ "$file" "" "$orig_content" "empty input" 72 + test_content "$file" "$orig_content" "" "empty input written" 73 + echo "$orig_content" > "$file" 74 + 75 + echo "PASS"

+1 -1

tools/testing/selftests/kvm/dirty_log_perf_test.c

··· 171 171 guest_num_pages = (nr_vcpus * guest_percpu_mem_size) >> vm_get_page_shift(vm); 172 172 guest_num_pages = vm_adjust_num_guest_pages(mode, guest_num_pages); 173 173 host_num_pages = vm_num_host_pages(mode, guest_num_pages); 174 - bmap = bitmap_alloc(host_num_pages); 174 + bmap = bitmap_zalloc(host_num_pages); 175 175 176 176 if (dirty_log_manual_caps) { 177 177 cap.cap = KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2;

+2 -2

tools/testing/selftests/kvm/dirty_log_test.c

··· 749 749 750 750 pr_info("guest physical test memory offset: 0x%lx\n", guest_test_phys_mem); 751 751 752 - bmap = bitmap_alloc(host_num_pages); 753 - host_bmap_track = bitmap_alloc(host_num_pages); 752 + bmap = bitmap_zalloc(host_num_pages); 753 + host_bmap_track = bitmap_zalloc(host_num_pages); 754 754 755 755 /* Add an extra memory slot for testing dirty logging */ 756 756 vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,

+1 -1

tools/testing/selftests/kvm/x86_64/vmx_dirty_log_test.c

··· 111 111 nested_map(vmx, vm, NESTED_TEST_MEM1, GUEST_TEST_MEM, 4096); 112 112 nested_map(vmx, vm, NESTED_TEST_MEM2, GUEST_TEST_MEM, 4096); 113 113 114 - bmap = bitmap_alloc(TEST_MEM_PAGES); 114 + bmap = bitmap_zalloc(TEST_MEM_PAGES); 115 115 host_test_mem = addr_gpa2hva(vm, GUEST_TEST_MEM); 116 116 117 117 while (!done) {

+1 -1

tools/testing/selftests/memfd/memfd_test.c

··· 56 56 57 57 static int mfd_assert_reopen_fd(int fd_in) 58 58 { 59 - int r, fd; 59 + int fd; 60 60 char path[100]; 61 61 62 62 sprintf(path, "/proc/self/fd/%d", fd_in);