mm: page_alloc: defrag_mode · tjh.dev/kernel@e3aa7df

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

mm: page_alloc: defrag_mode

The page allocator groups requests by migratetype to stave off
fragmentation. However, in practice this is routinely defeated by the
fact that it gives up *before* invoking reclaim and compaction - which may
well produce suitable pages. As a result, fragmentation of physical
memory is a common ongoing process in many load scenarios.

Fragmentation deteriorates compaction's ability to produce huge pages.
Depending on the lifetime of the fragmenting allocations, those effects
can be long-lasting or even permanent, requiring drastic measures like
forcible idle states or even reboots as the only reliable ways to recover
the address space for THP production.

In a kernel build test with supplemental THP pressure, the THP allocation
rate steadily declines over 15 runs:

thp_fault_alloc
61988
56474
57258
50187
52388
55409
52925
47648
43669
40621
36077
41721
36685
34641
33215

This is a hurdle in adopting THP in any environment where hosts are shared
between multiple overlapping workloads (cloud environments), and rarely
experience true idle periods. To make THP a reliable and predictable
optimization, there needs to be a stronger guarantee to avoid such
fragmentation.

Introduce defrag_mode. When enabled, reclaim/compaction is invoked to its
full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT is
enforced on the allocator fastpath and the reclaiming slowpath.

For now, fallbacks are permitted to avert OOMs. There is a plan to add
defrag_mode=2 to prefer OOMs over fragmentation, but this requires
additional prep work in compaction and the reserve management to make it
ready for all possible allocation contexts.

The following test results are from a kernel build with periodic bursts of
THP allocations, over 15 runs:

vanilla defrag_mode=1
@claimer[unmovable]: 189 103
@claimer[movable]: 92 103
@claimer[reclaimable]: 207 61
@pollute[unmovable from movable]: 25 0
@pollute[unmovable from reclaimable]: 28 0
@pollute[movable from unmovable]: 38835 0
@pollute[movable from reclaimable]: 147136 0
@pollute[reclaimable from unmovable]: 178 0
@pollute[reclaimable from movable]: 33 0
@steal[unmovable from movable]: 11 0
@steal[unmovable from reclaimable]: 5 0
@steal[reclaimable from unmovable]: 107 0
@steal[reclaimable from movable]: 90 0
@steal[movable from reclaimable]: 354 0
@steal[movable from unmovable]: 130 0

Both types of polluting fallbacks are eliminated in this workload.

Interestingly, whole block conversions are reduced as well. This is
because once a block is claimed for a type, its empty space remains
available for future allocations, instead of being padded with fallbacks;
this allows the native type to group up instead of spreading out to new
blocks. The assumption in the allocator has been that pollution from
movable allocations is less harmful than from other types, since they can
be reclaimed or migrated out should the space be needed. However, since
fallbacks occur *before* reclaim/compaction is invoked, movable pollution
will still cause non-movable allocations to spread out and claim more
blocks.

Without fragmentation, THP rates hold steady with defrag_mode=1:

thp_fault_alloc
32478
20725
45045
32130
14018
21711
40791
29134
34458
45381
28305
17265
22584
28454
30850

While the downward trend is eliminated, the keen reader will of course
notice that the baseline rate is much smaller than the vanilla kernel's to
begin with. This is due to deficiencies in how reclaim and compaction are
currently driven: ALLOC_NOFRAGMENT increases the extent to which smaller
allocations are competing with THPs for pageblocks, while making no effort
themselves to reclaim or compact beyond their own request size. This
effect already exists with the current usage of ALLOC_NOFRAGMENT, but is
amplified by defrag_mode insisting on whole block stealing much more
strongly.

Subsequent patches will address defrag_mode reclaim strategy to raise the
THP success baseline above the vanilla kernel.

Link: https://lkml.kernel.org/r/20250313210647.1314586-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Johannes Weiner and committed by

Andrew Morton 1 year ago e3aa7df3 f46012c0

+34 -2

2 changed files

expand all

Documentation

admin-guide

sysctl

vm.rst

page_alloc.c

Documentation/admin-guide/sysctl/vm.rst

··· 28 28 - compact_memory 29 29 - compaction_proactiveness 30 30 - compact_unevictable_allowed 31 + - defrag_mode 31 32 - dirty_background_bytes 32 33 - dirty_background_ratio 33 34 - dirty_bytes ··· 146 145 to compaction, which would block the task from becoming active until the fault 147 146 is resolved. 148 147 148 + defrag_mode 149 + =========== 150 + 151 + When set to 1, the page allocator tries harder to avoid fragmentation 152 + and maintain the ability to produce huge pages / higher-order pages. 153 + 154 + It is recommended to enable this right after boot, as fragmentation, 155 + once it occurred, can be long-lasting or even permanent. 149 156 150 157 dirty_background_bytes 151 158 ======================

+25 -2

mm/page_alloc.c

··· 273 273 int user_min_free_kbytes = -1; 274 274 static int watermark_boost_factor __read_mostly = 15000; 275 275 static int watermark_scale_factor = 10; 276 + static int defrag_mode; 276 277 277 278 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ 278 279 int movable_zone; ··· 3390 3389 */ 3391 3390 alloc_flags = (__force int) (gfp_mask & __GFP_KSWAPD_RECLAIM); 3392 3391 3392 + if (defrag_mode) { 3393 + alloc_flags |= ALLOC_NOFRAGMENT; 3394 + return alloc_flags; 3395 + } 3396 + 3393 3397 #ifdef CONFIG_ZONE_DMA32 3394 3398 if (!zone) 3395 3399 return alloc_flags; ··· 3486 3480 continue; 3487 3481 } 3488 3482 3489 - if (no_fallback && nr_online_nodes > 1 && 3483 + if (no_fallback && !defrag_mode && nr_online_nodes > 1 && 3490 3484 zone != zonelist_zone(ac->preferred_zoneref)) { 3491 3485 int local_nid; 3492 3486 ··· 3597 3591 * It's possible on a UMA machine to get through all zones that are 3598 3592 * fragmented. If avoiding fragmentation, reset and try again. 3599 3593 */ 3600 - if (no_fallback) { 3594 + if (no_fallback && !defrag_mode) { 3601 3595 alloc_flags &= ~ALLOC_NOFRAGMENT; 3602 3596 goto retry; 3603 3597 } ··· 4134 4128 4135 4129 alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags); 4136 4130 4131 + if (defrag_mode) 4132 + alloc_flags |= ALLOC_NOFRAGMENT; 4133 + 4137 4134 return alloc_flags; 4138 4135 } 4139 4136 ··· 4519 4510 &compaction_retries)) 4520 4511 goto retry; 4521 4512 4513 + /* Reclaim/compaction failed to prevent the fallback */ 4514 + if (defrag_mode) { 4515 + alloc_flags &= ALLOC_NOFRAGMENT; 4516 + goto retry; 4517 + } 4522 4518 4523 4519 /* 4524 4520 * Deal with possible cpuset update races or zonelist updates to avoid ··· 6299 6285 .proc_handler = watermark_scale_factor_sysctl_handler, 6300 6286 .extra1 = SYSCTL_ONE, 6301 6287 .extra2 = SYSCTL_THREE_THOUSAND, 6288 + }, 6289 + { 6290 + .procname = "defrag_mode", 6291 + .data = &defrag_mode, 6292 + .maxlen = sizeof(defrag_mode), 6293 + .mode = 0644, 6294 + .proc_handler = proc_dointvec_minmax, 6295 + .extra1 = SYSCTL_ZERO, 6296 + .extra2 = SYSCTL_ONE, 6302 6297 }, 6303 6298 { 6304 6299 .procname = "percpu_pagelist_high_fraction",