Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: compaction: push watermark into compaction_suitable() callers

Patch series "mm: reliable huge page allocator".

This series makes changes to the allocator and reclaim/compaction code to
try harder to avoid fragmentation. As a result, this makes huge page
allocations cheaper, more reliable and more sustainable.

It's a subset of the huge page allocator RFC initially proposed here:

https://lore.kernel.org/lkml/20230418191313.268131-1-hannes@cmpxchg.org/

The following results are from a kernel build test, with additional
concurrent bursts of THP allocations on a memory-constrained system.
Comparing before and after the changes over 15 runs:

before after
Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%)
Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%)
Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%)
Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%)
Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%)
THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%)
THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%)
Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%)
Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%)
Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%)
Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%)
Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%)
Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%)
Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%)
Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%)
Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%)
Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%)
Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%)
Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%)
Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%)
Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%)
Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%)
Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%)
Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%)
Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%)
File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%)

THP latencies are cut in half, and failure rates are cut by 75%. These
metrics also hold up over time, while the vanilla kernel sees a steady
downward trend in success rates with each subsequent run, owed to the
cumulative effects of fragmentation.

A more detailed discussion of results is in the patch changelogs.

The patches first introduce a vm.defrag_mode sysctl, which enforces the
existing ALLOC_NOFRAGMENT alloc flag until after reclaim and compaction
have run. They then change kswapd and kcompactd to target pageblocks,
which boosts success in the ALLOC_NOFRAGMENT hotpaths.

Patches #1 and #2 are somewhat unrelated cleanups, but touch the same code
and so are included here to avoid conflicts from re-ordering.


This patch (of 5):

compaction_suitable() hardcodes the min watermark, with a boost to the low
watermark for costly orders. However, compaction_ready() requires order-0
at the high watermark. It currently checks the marks twice.

Make the watermark a parameter to compaction_suitable() and have the
callers pass in what they require:

- compaction_zonelist_suitable() is used by the direct reclaim path,
so use the min watermark.

- compact_suit_allocation_order() has a watermark in context derived
from cc->alloc_flags.

The only quirk is that kcompactd doesn't initialize cc->alloc_flags
explicitly. There is a direct check in kcompactd_do_work() that
passes ALLOC_WMARK_MIN, but there is another check downstack in
compact_zone() that ends up passing the unset alloc_flags. Since
they default to 0, and that coincides with ALLOC_WMARK_MIN, it is
correct. But it's subtle. Set cc->alloc_flags explicitly.

- should_continue_reclaim() is direct reclaim, use the min watermark.

- Finally, consolidate the two checks in compaction_ready() to a
single compaction_suitable() call passing the high watermark.

There is a tiny change in behavior: before, compaction_suitable()
would check order-0 against min or low, depending on costly
order. Then there'd be another high watermark check.

Now, the high watermark is passed to compaction_suitable(), and the
costly order-boost (low - min) is added on top. This means
compaction_ready() sets a marginally higher target for free pages.

In a kernelbuild + THP pressure test, though, this didn't show any
measurable negative effects on memory pressure or reclaim rates. As
the comment above the check says, reclaim is usually stopped short
on should_continue_reclaim(), and this just defines the worst-case
reclaim cutoff in case compaction is not making any headway.

[hughd@google.com: stop oops on out-of-range highest_zoneidx]
Link: https://lkml.kernel.org/r/005ace8b-07fa-01d4-b54b-394a3e029c07@google.com
Link: https://lkml.kernel.org/r/20250313210647.1314586-1-hannes@cmpxchg.org
Link: https://lkml.kernel.org/r/20250313210647.1314586-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Johannes Weiner and committed by
Andrew Morton
67914ac0 8defffa4

+45 -38
+3 -2
include/linux/compaction.h
··· 95 95 struct page **page); 96 96 extern void reset_isolation_suitable(pg_data_t *pgdat); 97 97 extern bool compaction_suitable(struct zone *zone, int order, 98 - int highest_zoneidx); 98 + unsigned long watermark, int highest_zoneidx); 99 99 100 100 extern void compaction_defer_reset(struct zone *zone, int order, 101 101 bool alloc_success); ··· 113 113 } 114 114 115 115 static inline bool compaction_suitable(struct zone *zone, int order, 116 - int highest_zoneidx) 116 + unsigned long watermark, 117 + int highest_zoneidx) 117 118 { 118 119 return false; 119 120 }
+28 -24
mm/compaction.c
··· 2381 2381 } 2382 2382 2383 2383 static bool __compaction_suitable(struct zone *zone, int order, 2384 - int highest_zoneidx, 2385 - unsigned long wmark_target) 2384 + unsigned long watermark, int highest_zoneidx, 2385 + unsigned long free_pages) 2386 2386 { 2387 - unsigned long watermark; 2388 2387 /* 2389 2388 * Watermarks for order-0 must be met for compaction to be able to 2390 2389 * isolate free pages for migration targets. This means that the 2391 - * watermark and alloc_flags have to match, or be more pessimistic than 2392 - * the check in __isolate_free_page(). We don't use the direct 2393 - * compactor's alloc_flags, as they are not relevant for freepage 2394 - * isolation. We however do use the direct compactor's highest_zoneidx 2395 - * to skip over zones where lowmem reserves would prevent allocation 2396 - * even if compaction succeeds. 2397 - * For costly orders, we require low watermark instead of min for 2398 - * compaction to proceed to increase its chances. 2390 + * watermark have to match, or be more pessimistic than the check in 2391 + * __isolate_free_page(). 2392 + * 2393 + * For costly orders, we require a higher watermark for compaction to 2394 + * proceed to increase its chances. 2395 + * 2396 + * We use the direct compactor's highest_zoneidx to skip over zones 2397 + * where lowmem reserves would prevent allocation even if compaction 2398 + * succeeds. 2399 + * 2399 2400 * ALLOC_CMA is used, as pages in CMA pageblocks are considered 2400 - * suitable migration targets 2401 + * suitable migration targets. 2401 2402 */ 2402 - watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ? 2403 - low_wmark_pages(zone) : min_wmark_pages(zone); 2404 2403 watermark += compact_gap(order); 2404 + if (order > PAGE_ALLOC_COSTLY_ORDER) 2405 + watermark += low_wmark_pages(zone) - min_wmark_pages(zone); 2405 2406 return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx, 2406 - ALLOC_CMA, wmark_target); 2407 + ALLOC_CMA, free_pages); 2407 2408 } 2408 2409 2409 2410 /* 2410 2411 * compaction_suitable: Is this suitable to run compaction on this zone now? 2411 2412 */ 2412 - bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx) 2413 + bool compaction_suitable(struct zone *zone, int order, unsigned long watermark, 2414 + int highest_zoneidx) 2413 2415 { 2414 2416 enum compact_result compact_result; 2415 2417 bool suitable; 2416 2418 2417 - suitable = __compaction_suitable(zone, order, highest_zoneidx, 2419 + suitable = __compaction_suitable(zone, order, watermark, highest_zoneidx, 2418 2420 zone_page_state(zone, NR_FREE_PAGES)); 2419 2421 /* 2420 2422 * fragmentation index determines if allocation failures are due to ··· 2454 2452 return suitable; 2455 2453 } 2456 2454 2455 + /* Used by direct reclaimers */ 2457 2456 bool compaction_zonelist_suitable(struct alloc_context *ac, int order, 2458 2457 int alloc_flags) 2459 2458 { ··· 2477 2474 */ 2478 2475 available = zone_reclaimable_pages(zone) / order; 2479 2476 available += zone_page_state_snapshot(zone, NR_FREE_PAGES); 2480 - if (__compaction_suitable(zone, order, ac->highest_zoneidx, 2481 - available)) 2477 + if (__compaction_suitable(zone, order, min_wmark_pages(zone), 2478 + ac->highest_zoneidx, available)) 2482 2479 return true; 2483 2480 } 2484 2481 ··· 2515 2512 */ 2516 2513 if (order > PAGE_ALLOC_COSTLY_ORDER && async && 2517 2514 !(alloc_flags & ALLOC_CMA)) { 2518 - watermark = low_wmark_pages(zone) + compact_gap(order); 2519 - if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx, 2520 - 0, zone_page_state(zone, NR_FREE_PAGES))) 2515 + if (!__zone_watermark_ok(zone, 0, watermark + compact_gap(order), 2516 + highest_zoneidx, 0, 2517 + zone_page_state(zone, NR_FREE_PAGES))) 2521 2518 return COMPACT_SKIPPED; 2522 2519 } 2523 2520 2524 - if (!compaction_suitable(zone, order, highest_zoneidx)) 2521 + if (!compaction_suitable(zone, order, watermark, highest_zoneidx)) 2525 2522 return COMPACT_SKIPPED; 2526 2523 2527 2524 return COMPACT_CONTINUE; ··· 3084 3081 .mode = MIGRATE_SYNC_LIGHT, 3085 3082 .ignore_skip_hint = false, 3086 3083 .gfp_mask = GFP_KERNEL, 3084 + .alloc_flags = ALLOC_WMARK_MIN, 3087 3085 }; 3088 3086 enum compact_result ret; 3089 3087 ··· 3103 3099 continue; 3104 3100 3105 3101 ret = compaction_suit_allocation_order(zone, 3106 - cc.order, zoneid, ALLOC_WMARK_MIN, 3102 + cc.order, zoneid, cc.alloc_flags, 3107 3103 false); 3108 3104 if (ret != COMPACT_CONTINUE) 3109 3105 continue;
+14 -12
mm/vmscan.c
··· 5890 5890 5891 5891 /* If compaction would go ahead or the allocation would succeed, stop */ 5892 5892 for_each_managed_zone_pgdat(zone, pgdat, z, sc->reclaim_idx) { 5893 + unsigned long watermark = min_wmark_pages(zone); 5894 + 5893 5895 /* Allocation can already succeed, nothing to do */ 5894 - if (zone_watermark_ok(zone, sc->order, min_wmark_pages(zone), 5896 + if (zone_watermark_ok(zone, sc->order, watermark, 5895 5897 sc->reclaim_idx, 0)) 5896 5898 return false; 5897 5899 5898 - if (compaction_suitable(zone, sc->order, sc->reclaim_idx)) 5900 + if (compaction_suitable(zone, sc->order, watermark, 5901 + sc->reclaim_idx)) 5899 5902 return false; 5900 5903 } 5901 5904 ··· 6125 6122 sc->reclaim_idx, 0)) 6126 6123 return true; 6127 6124 6128 - /* Compaction cannot yet proceed. Do reclaim. */ 6129 - if (!compaction_suitable(zone, sc->order, sc->reclaim_idx)) 6130 - return false; 6131 - 6132 6125 /* 6133 - * Compaction is already possible, but it takes time to run and there 6134 - * are potentially other callers using the pages just freed. So proceed 6135 - * with reclaim to make a buffer of free pages available to give 6136 - * compaction a reasonable chance of completing and allocating the page. 6126 + * Direct reclaim usually targets the min watermark, but compaction 6127 + * takes time to run and there are potentially other callers using the 6128 + * pages just freed. So target a higher buffer to give compaction a 6129 + * reasonable chance of completing and allocating the pages. 6130 + * 6137 6131 * Note that we won't actually reclaim the whole buffer in one attempt 6138 6132 * as the target watermark in should_continue_reclaim() is lower. But if 6139 6133 * we are already above the high+gap watermark, don't reclaim at all. 6140 6134 */ 6141 - watermark = high_wmark_pages(zone) + compact_gap(sc->order); 6135 + watermark = high_wmark_pages(zone); 6136 + if (compaction_suitable(zone, sc->order, watermark, sc->reclaim_idx)) 6137 + return true; 6142 6138 6143 - return zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx); 6139 + return false; 6144 6140 } 6145 6141 6146 6142 static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc)