Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/mempolicy: advertise new MPOL_PREFERRED_MANY

Adds a new mode to the existing mempolicy modes, MPOL_PREFERRED_MANY.

MPOL_PREFERRED_MANY will be adequately documented in the internal
admin-guide with this patch. Eventually, the man pages for mbind(2),
get_mempolicy(2), set_mempolicy(2) and numactl(8) will also have text
about this mode. Those shall contain the canonical reference.

NUMA systems continue to become more prevalent. New technologies like
PMEM make finer grain control over memory access patterns increasingly
desirable. MPOL_PREFERRED_MANY allows userspace to specify a set of nodes
that will be tried first when performing allocations. If those
allocations fail, all remaining nodes will be tried. It's a straight
forward API which solves many of the presumptive needs of system
administrators wanting to optimize workloads on such machines. The mode
will work either per VMA, or per thread.

[Michal Hocko: refine kernel doc for MPOL_PREFERRED_MANY]

Link: https://lore.kernel.org/r/20200630212517.308045-13-ben.widawsky@intel.com
Link: https://lkml.kernel.org/r/1627970362-61305-5-git-send-email-feng.tang@intel.com
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Ben Widawsky and committed by
Linus Torvalds
a38a59fd cfcaa66f

+12 -10
+11 -4
Documentation/admin-guide/mm/numa_memory_policy.rst
··· 245 245 address range or file. During system boot up, the temporary 246 246 interleaved system default policy works in this mode. 247 247 248 + MPOL_PREFERRED_MANY 249 + This mode specifices that the allocation should be preferrably 250 + satisfied from the nodemask specified in the policy. If there is 251 + a memory pressure on all nodes in the nodemask, the allocation 252 + can fall back to all existing numa nodes. This is effectively 253 + MPOL_PREFERRED allowed for a mask rather than a single node. 254 + 248 255 NUMA memory policy supports the following optional mode flags: 249 256 250 257 MPOL_F_STATIC_NODES ··· 260 253 nodes changes after the memory policy has been defined. 261 254 262 255 Without this flag, any time a mempolicy is rebound because of a 263 - change in the set of allowed nodes, the node (Preferred) or 264 - nodemask (Bind, Interleave) is remapped to the new set of 265 - allowed nodes. This may result in nodes being used that were 266 - previously undesired. 256 + change in the set of allowed nodes, the preferred nodemask (Preferred 257 + Many), preferred node (Preferred) or nodemask (Bind, Interleave) is 258 + remapped to the new set of allowed nodes. This may result in nodes 259 + being used that were previously undesired. 267 260 268 261 With this flag, if the user-specified nodes overlap with the 269 262 nodes allowed by the task's cpuset, then the memory policy is
+1 -6
mm/mempolicy.c
··· 1463 1463 *flags = *mode & MPOL_MODE_FLAGS; 1464 1464 *mode &= ~MPOL_MODE_FLAGS; 1465 1465 1466 - /* 1467 - * The check should be 'mode >= MPOL_MAX', but as 'prefer_many' 1468 - * is not fully implemented, don't permit it to be used for now, 1469 - * and the logic will be restored in following patch 1470 - */ 1471 - if ((unsigned int)(*mode) >= MPOL_PREFERRED_MANY) 1466 + if ((unsigned int)(*mode) >= MPOL_MAX) 1472 1467 return -EINVAL; 1473 1468 if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) 1474 1469 return -EINVAL;