Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Document Linux Memory Policy

I couldn't find any memory policy documentation in the Documentation
directory, so here is my attempt to document it.

There's lots more that could be written about the internal design--including
data structures, functions, etc. However, if you agree that this is better
that the nothing that exists now, perhaps it could be merged. This will
provide a baseline for updates to document the many policy patches that are
currently being worked.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Acked-by: Rob Landley <rob@landley.net>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Lee Schermerhorn and committed by
Linus Torvalds
42b88e6a 88ae704c

+332
+332
Documentation/vm/numa_memory_policy.txt
··· 1 + 2 + What is Linux Memory Policy? 3 + 4 + In the Linux kernel, "memory policy" determines from which node the kernel will 5 + allocate memory in a NUMA system or in an emulated NUMA system. Linux has 6 + supported platforms with Non-Uniform Memory Access architectures since 2.4.?. 7 + The current memory policy support was added to Linux 2.6 around May 2004. This 8 + document attempts to describe the concepts and APIs of the 2.6 memory policy 9 + support. 10 + 11 + Memory policies should not be confused with cpusets (Documentation/cpusets.txt) 12 + which is an administrative mechanism for restricting the nodes from which 13 + memory may be allocated by a set of processes. Memory policies are a 14 + programming interface that a NUMA-aware application can take advantage of. When 15 + both cpusets and policies are applied to a task, the restrictions of the cpuset 16 + takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details. 17 + 18 + MEMORY POLICY CONCEPTS 19 + 20 + Scope of Memory Policies 21 + 22 + The Linux kernel supports _scopes_ of memory policy, described here from 23 + most general to most specific: 24 + 25 + System Default Policy: this policy is "hard coded" into the kernel. It 26 + is the policy that governs all page allocations that aren't controlled 27 + by one of the more specific policy scopes discussed below. When the 28 + system is "up and running", the system default policy will use "local 29 + allocation" described below. However, during boot up, the system 30 + default policy will be set to interleave allocations across all nodes 31 + with "sufficient" memory, so as not to overload the initial boot node 32 + with boot-time allocations. 33 + 34 + Task/Process Policy: this is an optional, per-task policy. When defined 35 + for a specific task, this policy controls all page allocations made by or 36 + on behalf of the task that aren't controlled by a more specific scope. 37 + If a task does not define a task policy, then all page allocations that 38 + would have been controlled by the task policy "fall back" to the System 39 + Default Policy. 40 + 41 + The task policy applies to the entire address space of a task. Thus, 42 + it is inheritable, and indeed is inherited, across both fork() 43 + [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task 44 + to establish the task policy for a child task exec()'d from an 45 + executable image that has no awareness of memory policy. See the 46 + MEMORY POLICY APIS section, below, for an overview of the system call 47 + that a task may use to set/change it's task/process policy. 48 + 49 + In a multi-threaded task, task policies apply only to the thread 50 + [Linux kernel task] that installs the policy and any threads 51 + subsequently created by that thread. Any sibling threads existing 52 + at the time a new task policy is installed retain their current 53 + policy. 54 + 55 + A task policy applies only to pages allocated after the policy is 56 + installed. Any pages already faulted in by the task when the task 57 + changes its task policy remain where they were allocated based on 58 + the policy at the time they were allocated. 59 + 60 + VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's 61 + virtual adddress space. A task may define a specific policy for a range 62 + of its virtual address space. See the MEMORY POLICIES APIS section, 63 + below, for an overview of the mbind() system call used to set a VMA 64 + policy. 65 + 66 + A VMA policy will govern the allocation of pages that back this region of 67 + the address space. Any regions of the task's address space that don't 68 + have an explicit VMA policy will fall back to the task policy, which may 69 + itself fall back to the System Default Policy. 70 + 71 + VMA policies have a few complicating details: 72 + 73 + VMA policy applies ONLY to anonymous pages. These include pages 74 + allocated for anonymous segments, such as the task stack and heap, and 75 + any regions of the address space mmap()ed with the MAP_ANONYMOUS flag. 76 + If a VMA policy is applied to a file mapping, it will be ignored if 77 + the mapping used the MAP_SHARED flag. If the file mapping used the 78 + MAP_PRIVATE flag, the VMA policy will only be applied when an 79 + anonymous page is allocated on an attempt to write to the mapping-- 80 + i.e., at Copy-On-Write. 81 + 82 + VMA policies are shared between all tasks that share a virtual address 83 + space--a.k.a. threads--independent of when the policy is installed; and 84 + they are inherited across fork(). However, because VMA policies refer 85 + to a specific region of a task's address space, and because the address 86 + space is discarded and recreated on exec*(), VMA policies are NOT 87 + inheritable across exec(). Thus, only NUMA-aware applications may 88 + use VMA policies. 89 + 90 + A task may install a new VMA policy on a sub-range of a previously 91 + mmap()ed region. When this happens, Linux splits the existing virtual 92 + memory area into 2 or 3 VMAs, each with it's own policy. 93 + 94 + By default, VMA policy applies only to pages allocated after the policy 95 + is installed. Any pages already faulted into the VMA range remain 96 + where they were allocated based on the policy at the time they were 97 + allocated. However, since 2.6.16, Linux supports page migration via 98 + the mbind() system call, so that page contents can be moved to match 99 + a newly installed policy. 100 + 101 + Shared Policy: Conceptually, shared policies apply to "memory objects" 102 + mapped shared into one or more tasks' distinct address spaces. An 103 + application installs a shared policies the same way as VMA policies--using 104 + the mbind() system call specifying a range of virtual addresses that map 105 + the shared object. However, unlike VMA policies, which can be considered 106 + to be an attribute of a range of a task's address space, shared policies 107 + apply directly to the shared object. Thus, all tasks that attach to the 108 + object share the policy, and all pages allocated for the shared object, 109 + by any task, will obey the shared policy. 110 + 111 + As of 2.6.22, only shared memory segments, created by shmget() or 112 + mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared 113 + policy support was added to Linux, the associated data structures were 114 + added to hugetlbfs shmem segments. At the time, hugetlbfs did not 115 + support allocation at fault time--a.k.a lazy allocation--so hugetlbfs 116 + shmem segments were never "hooked up" to the shared policy support. 117 + Although hugetlbfs segments now support lazy allocation, their support 118 + for shared policy has not been completed. 119 + 120 + As mentioned above [re: VMA policies], allocations of page cache 121 + pages for regular files mmap()ed with MAP_SHARED ignore any VMA 122 + policy installed on the virtual address range backed by the shared 123 + file mapping. Rather, shared page cache pages, including pages backing 124 + private mappings that have not yet been written by the task, follow 125 + task policy, if any, else System Default Policy. 126 + 127 + The shared policy infrastructure supports different policies on subset 128 + ranges of the shared object. However, Linux still splits the VMA of 129 + the task that installs the policy for each range of distinct policy. 130 + Thus, different tasks that attach to a shared memory segment can have 131 + different VMA configurations mapping that one shared object. This 132 + can be seen by examining the /proc/<pid>/numa_maps of tasks sharing 133 + a shared memory region, when one task has installed shared policy on 134 + one or more ranges of the region. 135 + 136 + Components of Memory Policies 137 + 138 + A Linux memory policy is a tuple consisting of a "mode" and an optional set 139 + of nodes. The mode determine the behavior of the policy, while the 140 + optional set of nodes can be viewed as the arguments to the behavior. 141 + 142 + Internally, memory policies are implemented by a reference counted 143 + structure, struct mempolicy. Details of this structure will be discussed 144 + in context, below, as required to explain the behavior. 145 + 146 + Note: in some functions AND in the struct mempolicy itself, the mode 147 + is called "policy". However, to avoid confusion with the policy tuple, 148 + this document will continue to use the term "mode". 149 + 150 + Linux memory policy supports the following 4 behavioral modes: 151 + 152 + Default Mode--MPOL_DEFAULT: The behavior specified by this mode is 153 + context or scope dependent. 154 + 155 + As mentioned in the Policy Scope section above, during normal 156 + system operation, the System Default Policy is hard coded to 157 + contain the Default mode. 158 + 159 + In this context, default mode means "local" allocation--that is 160 + attempt to allocate the page from the node associated with the cpu 161 + where the fault occurs. If the "local" node has no memory, or the 162 + node's memory can be exhausted [no free pages available], local 163 + allocation will "fallback to"--attempt to allocate pages from-- 164 + "nearby" nodes, in order of increasing "distance". 165 + 166 + Implementation detail -- subject to change: "Fallback" uses 167 + a per node list of sibling nodes--called zonelists--built at 168 + boot time, or when nodes or memory are added or removed from 169 + the system [memory hotplug]. These per node zonelist are 170 + constructed with nodes in order of increasing distance based 171 + on information provided by the platform firmware. 172 + 173 + When a task/process policy or a shared policy contains the Default 174 + mode, this also means "local allocation", as described above. 175 + 176 + In the context of a VMA, Default mode means "fall back to task 177 + policy"--which may or may not specify Default mode. Thus, Default 178 + mode can not be counted on to mean local allocation when used 179 + on a non-shared region of the address space. However, see 180 + MPOL_PREFERRED below. 181 + 182 + The Default mode does not use the optional set of nodes. 183 + 184 + MPOL_BIND: This mode specifies that memory must come from the 185 + set of nodes specified by the policy. 186 + 187 + The memory policy APIs do not specify an order in which the nodes 188 + will be searched. However, unlike "local allocation", the Bind 189 + policy does not consider the distance between the nodes. Rather, 190 + allocations will fallback to the nodes specified by the policy in 191 + order of numeric node id. Like everything in Linux, this is subject 192 + to change. 193 + 194 + MPOL_PREFERRED: This mode specifies that the allocation should be 195 + attempted from the single node specified in the policy. If that 196 + allocation fails, the kernel will search other nodes, exactly as 197 + it would for a local allocation that started at the preferred node 198 + in increasing distance from the preferred node. "Local" allocation 199 + policy can be viewed as a Preferred policy that starts at the node 200 + containing the cpu where the allocation takes place. 201 + 202 + Internally, the Preferred policy uses a single node--the 203 + preferred_node member of struct mempolicy. A "distinguished 204 + value of this preferred_node, currently '-1', is interpreted 205 + as "the node containing the cpu where the allocation takes 206 + place"--local allocation. This is the way to specify 207 + local allocation for a specific range of addresses--i.e. for 208 + VMA policies. 209 + 210 + MPOL_INTERLEAVED: This mode specifies that page allocations be 211 + interleaved, on a page granularity, across the nodes specified in 212 + the policy. This mode also behaves slightly differently, based on 213 + the context where it is used: 214 + 215 + For allocation of anonymous pages and shared memory pages, 216 + Interleave mode indexes the set of nodes specified by the policy 217 + using the page offset of the faulting address into the segment 218 + [VMA] containing the address modulo the number of nodes specified 219 + by the policy. It then attempts to allocate a page, starting at 220 + the selected node, as if the node had been specified by a Preferred 221 + policy or had been selected by a local allocation. That is, 222 + allocation will follow the per node zonelist. 223 + 224 + For allocation of page cache pages, Interleave mode indexes the set 225 + of nodes specified by the policy using a node counter maintained 226 + per task. This counter wraps around to the lowest specified node 227 + after it reaches the highest specified node. This will tend to 228 + spread the pages out over the nodes specified by the policy based 229 + on the order in which they are allocated, rather than based on any 230 + page offset into an address range or file. During system boot up, 231 + the temporary interleaved system default policy works in this 232 + mode. 233 + 234 + MEMORY POLICY APIs 235 + 236 + Linux supports 3 system calls for controlling memory policy. These APIS 237 + always affect only the calling task, the calling task's address space, or 238 + some shared object mapped into the calling task's address space. 239 + 240 + Note: the headers that define these APIs and the parameter data types 241 + for user space applications reside in a package that is not part of 242 + the Linux kernel. The kernel system call interfaces, with the 'sys_' 243 + prefix, are defined in <linux/syscalls.h>; the mode and flag 244 + definitions are defined in <linux/mempolicy.h>. 245 + 246 + Set [Task] Memory Policy: 247 + 248 + long set_mempolicy(int mode, const unsigned long *nmask, 249 + unsigned long maxnode); 250 + 251 + Set's the calling task's "task/process memory policy" to mode 252 + specified by the 'mode' argument and the set of nodes defined 253 + by 'nmask'. 'nmask' points to a bit mask of node ids containing 254 + at least 'maxnode' ids. 255 + 256 + See the set_mempolicy(2) man page for more details 257 + 258 + 259 + Get [Task] Memory Policy or Related Information 260 + 261 + long get_mempolicy(int *mode, 262 + const unsigned long *nmask, unsigned long maxnode, 263 + void *addr, int flags); 264 + 265 + Queries the "task/process memory policy" of the calling task, or 266 + the policy or location of a specified virtual address, depending 267 + on the 'flags' argument. 268 + 269 + See the get_mempolicy(2) man page for more details 270 + 271 + 272 + Install VMA/Shared Policy for a Range of Task's Address Space 273 + 274 + long mbind(void *start, unsigned long len, int mode, 275 + const unsigned long *nmask, unsigned long maxnode, 276 + unsigned flags); 277 + 278 + mbind() installs the policy specified by (mode, nmask, maxnodes) as 279 + a VMA policy for the range of the calling task's address space 280 + specified by the 'start' and 'len' arguments. Additional actions 281 + may be requested via the 'flags' argument. 282 + 283 + See the mbind(2) man page for more details. 284 + 285 + MEMORY POLICY COMMAND LINE INTERFACE 286 + 287 + Although not strictly part of the Linux implementation of memory policy, 288 + a command line tool, numactl(8), exists that allows one to: 289 + 290 + + set the task policy for a specified program via set_mempolicy(2), fork(2) and 291 + exec(2) 292 + 293 + + set the shared policy for a shared memory segment via mbind(2) 294 + 295 + The numactl(8) tool is packages with the run-time version of the library 296 + containing the memory policy system call wrappers. Some distributions 297 + package the headers and compile-time libraries in a separate development 298 + package. 299 + 300 + 301 + MEMORY POLICIES AND CPUSETS 302 + 303 + Memory policies work within cpusets as described above. For memory policies 304 + that require a node or set of nodes, the nodes are restricted to the set of 305 + nodes whose memories are allowed by the cpuset constraints. If the 306 + intersection of the set of nodes specified for the policy and the set of nodes 307 + allowed by the cpuset is the empty set, the policy is considered invalid and 308 + cannot be installed. 309 + 310 + The interaction of memory policies and cpusets can be problematic for a 311 + couple of reasons: 312 + 313 + 1) the memory policy APIs take physical node id's as arguments. However, the 314 + memory policy APIs do not provide a way to determine what nodes are valid 315 + in the context where the application is running. An application MAY consult 316 + the cpuset file system [directly or via an out of tree, and not generally 317 + available, libcpuset API] to obtain this information, but then the 318 + application must be aware that it is running in a cpuset and use what are 319 + intended primarily as administrative APIs. 320 + 321 + However, as long as the policy specifies at least one node that is valid 322 + in the controlling cpuset, the policy can be used. 323 + 324 + 2) when tasks in two cpusets share access to a memory region, such as shared 325 + memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and 326 + MAP_SHARED flags, and any of the tasks install shared policy on the region, 327 + only nodes whose memories are allowed in both cpusets may be used in the 328 + policies. Again, obtaining this information requires "stepping outside" 329 + the memory policy APIs, as well as knowing in what cpusets other task might 330 + be attaching to the shared region, to use the cpuset information. 331 + Furthermore, if the cpusets' allowed memory sets are disjoint, "local" 332 + allocation is the only valid policy.