cxl: docs/allocation/page-allocator · tjh.dev/kernel@419dc40

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

cxl: docs/allocation/page-allocator

Document some interesting interactions that occur when exposing CXL
memory capacity to page allocator.

Signed-off-by: Gregory Price <gourry@gourry.net>
Link: https://patch.msgid.link/20250512162134.3596150-15-gourry@gourry.net
Signed-off-by: Dave Jiang <dave.jiang@intel.com>

authored by

Gregory Price and committed by

Dave Jiang 11 months ago 419dc40b 78ab6751

+86

2 changed files

expand all

Documentation

driver-api

cxl

allocation

page-allocator.rst

index.rst

+85

Documentation/driver-api/cxl/allocation/page-allocator.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ================== 4 + The Page Allocator 5 + ================== 6 + 7 + The kernel page allocator services all general page allocation requests, such 8 + as :code:`kmalloc`. CXL configuration steps affect the behavior of the page 9 + allocator based on the selected `Memory Zone` and `NUMA node` the capacity is 10 + placed in. 11 + 12 + This section mostly focuses on how these configurations affect the page 13 + allocator (as of Linux v6.15) rather than the overall page allocator behavior. 14 + 15 + NUMA nodes and mempolicy 16 + ======================== 17 + Unless a task explicitly registers a mempolicy, the default memory policy 18 + of the linux kernel is to allocate memory from the `local NUMA node` first, 19 + and fall back to other nodes only if the local node is pressured. 20 + 21 + Generally, we expect to see local DRAM and CXL memory on separate NUMA nodes, 22 + with the CXL memory being non-local. Technically, however, it is possible 23 + for a compute node to have no local DRAM, and for CXL memory to be the 24 + `local` capacity for that compute node. 25 + 26 + 27 + Memory Zones 28 + ============ 29 + CXL capacity may be onlined in :code:`ZONE_NORMAL` or :code:`ZONE_MOVABLE`. 30 + 31 + As of v6.15, the page allocator attempts to allocate from the highest 32 + available and compatible ZONE for an allocation from the local node first. 33 + 34 + An example of a `zone incompatibility` is attempting to service an allocation 35 + marked :code:`GFP_KERNEL` from :code:`ZONE_MOVABLE`. Kernel allocations are 36 + typically not migratable, and as a result can only be serviced from 37 + :code:`ZONE_NORMAL` or lower. 38 + 39 + To simplify this, the page allocator will prefer :code:`ZONE_MOVABLE` over 40 + :code:`ZONE_NORMAL` by default, but if :code:`ZONE_MOVABLE` is depleted, it 41 + will fallback to allocate from :code:`ZONE_NORMAL`. 42 + 43 + 44 + Zone and Node Quirks 45 + ==================== 46 + Let's consider a configuration where the local DRAM capacity is largely onlined 47 + into :code:`ZONE_NORMAL`, with no :code:`ZONE_MOVABLE` capacity present. The 48 + CXL capacity has the opposite configuration - all onlined in 49 + :code:`ZONE_MOVABLE`. 50 + 51 + Under the default allocation policy, the page allocator will completely skip 52 + :code:`ZONE_MOVABLE` as a valid allocation target. This is because, as of 53 + Linux v6.15, the page allocator does (approximately) the following: :: 54 + 55 + for (each zone in local_node): 56 + 57 + for (each node in fallback_order): 58 + 59 + attempt_allocation(gfp_flags); 60 + 61 + Because the local node does not have :code:`ZONE_MOVABLE`, the CXL node is 62 + functionally unreachable for direct allocation. As a result, the only way 63 + for CXL capacity to be used is via `demotion` in the reclaim path. 64 + 65 + This configuration also means that if the DRAM ndoe has :code:`ZONE_MOVABLE` 66 + capacity - when that capacity is depleted, the page allocator will actually 67 + prefer CXL :code:`ZONE_MOVABLE` pages over DRAM :code:`ZONE_NORMAL` pages. 68 + 69 + We may wish to invert this priority in future Linux versions. 70 + 71 + If `demotion` and `swap` are disabled, Linux will begin to cause OOM crashes 72 + when the DRAM nodes are depleted. See the reclaim section for more details. 73 + 74 + 75 + CGroups and CPUSets 76 + =================== 77 + Finally, assuming CXL memory is reachable via the page allocation (i.e. onlined 78 + in :code:`ZONE_NORMAL`), the :code:`cpusets.mems_allowed` may be used by 79 + containers to limit the accessibility of certain NUMA nodes for tasks in that 80 + container. Users may wish to utilize this in multi-tenant systems where some 81 + tasks prefer not to use slower memory. 82 + 83 + In the reclaim section we'll discuss some limitations of this interface to 84 + prevent demotions of shared data to CXL memory (if demotions are enabled). 85 +

Documentation/driver-api/cxl/index.rst

··· 45 45 :caption: Memory Allocation 46 46 47 47 allocation/dax 48 + allocation/page-allocator 48 49 49 50 .. only:: subproject and html