mm: page_alloc: speed up fallbacks in rmqueue_bulk()

The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal
single pages from biggest buddy") as the root cause of a 56.4% regression
in vm-scalability::lru-file-mmap-read.

Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix
freelist movement during block conversion"), as the root cause for a
regression in worst-case zone->lock+irqoff hold times.

Both of these patches modify the page allocator's fallback path to be less
greedy in an effort to stave off fragmentation. The flip side of this is
that fallbacks are also less productive each time around, which means the
fallback search can run much more frequently.

Carlos' traces point to rmqueue_bulk() specifically, which tries to refill
the percpu cache by allocating a large batch of pages in a loop. It
highlights how once the native freelists are exhausted, the fallback code
first scans orders top-down for whole blocks to claim, then falls back to
a bottom-up search for the smallest buddy to steal. For the next batch
page, it goes through the same thing again.

This can be made more efficient. Since rmqueue_bulk() holds the
zone->lock over the entire batch, the freelists are not subject to outside
changes; when the search for a block to claim has already failed, there is
no point in trying again for the next page.

Modify __rmqueue() to remember the last successful fallback mode, and
restart directly from there on the next rmqueue_bulk() iteration.

Oliver confirms that this improves beyond the regression that the test
robot reported against c2f6ea38fc1b:

commit:
f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap")
c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy")
acc4d5ff0b ("Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()") <--- your patch

f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5
---------------- --------------------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev %change %stddev
\ | \ | \ | \
25525364 ± 3% -56.4% 11135467 -57.8% 10779336 +31.6% 33581409 vm-scalability.throughput

Carlos confirms that worst-case times are almost fully recovered
compared to before the earlier culprit patch:

2dd482ba627d (before freelist hygiene): 1ms
c0cd6f557b90 (after freelist hygiene): 90ms
next-20250319 (steal smallest buddy): 280ms
this patch : 8ms

[jackmanb@google.com: comment updates]
Link: https://lkml.kernel.org/r/D92AC0P9594X.3BML64MUKTF8Z@google.com
[hannes@cmpxchg.org: reset rmqueue_mode in rmqueue_buddy() error loop, per Yunsheng Lin]
Link: https://lkml.kernel.org/r/20250409140023.GA2313@cmpxchg.org
Link: https://lkml.kernel.org/r/20250407180154.63348-1-hannes@cmpxchg.org
Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion")
Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Carlos Song <carlos.song@nxp.com>
Tested-by: Carlos Song <carlos.song@nxp.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202503271547.fc08b188-lkp@intel.com
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Shivank Garg <shivankg@amd.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> [6.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Johannes Weiner and committed by

Andrew Morton 1 year ago 90abee6d 61c4e6ca

+79 -34

1 changed file

expand all

page_alloc.c

+79 -34

mm/page_alloc.c

··· 2183 2183 } 2184 2184 2185 2185 /* 2186 - * Try finding a free buddy page on the fallback list. 2187 - * 2188 - * This will attempt to claim a whole pageblock for the requested type 2189 - * to ensure grouping of such requests in the future. 2190 - * 2191 - * If a whole block cannot be claimed, steal an individual page, regressing to 2192 - * __rmqueue_smallest() logic to at least break up as little contiguity as 2193 - * possible. 2186 + * Try to allocate from some fallback migratetype by claiming the entire block, 2187 + * i.e. converting it to the allocation's start migratetype. 2194 2188 * 2195 2189 * The use of signed ints for order and current_order is a deliberate 2196 2190 * deviation from the rest of this file, to make the for loop 2197 2191 * condition simpler. 2198 - * 2199 - * Return the stolen page, or NULL if none can be found. 2200 2192 */ 2201 2193 static __always_inline struct page * 2202 - __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, 2194 + __rmqueue_claim(struct zone *zone, int order, int start_migratetype, 2203 2195 unsigned int alloc_flags) 2204 2196 { 2205 2197 struct free_area *area; ··· 2229 2237 page = try_to_claim_block(zone, page, current_order, order, 2230 2238 start_migratetype, fallback_mt, 2231 2239 alloc_flags); 2232 - if (page) 2233 - goto got_one; 2240 + if (page) { 2241 + trace_mm_page_alloc_extfrag(page, order, current_order, 2242 + start_migratetype, fallback_mt); 2243 + return page; 2244 + } 2234 2245 } 2235 2246 2236 - if (alloc_flags & ALLOC_NOFRAGMENT) 2237 - return NULL; 2247 + return NULL; 2248 + } 2238 2249 2239 - /* No luck claiming pageblock. Find the smallest fallback page */ 2250 + /* 2251 + * Try to steal a single page from some fallback migratetype. Leave the rest of 2252 + * the block as its current migratetype, potentially causing fragmentation. 2253 + */ 2254 + static __always_inline struct page * 2255 + __rmqueue_steal(struct zone *zone, int order, int start_migratetype) 2256 + { 2257 + struct free_area *area; 2258 + int current_order; 2259 + struct page *page; 2260 + int fallback_mt; 2261 + bool claim_block; 2262 + 2240 2263 for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) { 2241 2264 area = &(zone->free_area[current_order]); 2242 2265 fallback_mt = find_suitable_fallback(area, current_order, ··· 2261 2254 2262 2255 page = get_page_from_free_area(area, fallback_mt); 2263 2256 page_del_and_expand(zone, page, order, current_order, fallback_mt); 2264 - goto got_one; 2257 + trace_mm_page_alloc_extfrag(page, order, current_order, 2258 + start_migratetype, fallback_mt); 2259 + return page; 2265 2260 } 2266 2261 2267 2262 return NULL; 2268 - 2269 - got_one: 2270 - trace_mm_page_alloc_extfrag(page, order, current_order, 2271 - start_migratetype, fallback_mt); 2272 - 2273 - return page; 2274 2263 } 2264 + 2265 + enum rmqueue_mode { 2266 + RMQUEUE_NORMAL, 2267 + RMQUEUE_CMA, 2268 + RMQUEUE_CLAIM, 2269 + RMQUEUE_STEAL, 2270 + }; 2275 2271 2276 2272 /* 2277 2273 * Do the hard work of removing an element from the buddy allocator. ··· 2282 2272 */ 2283 2273 static __always_inline struct page * 2284 2274 __rmqueue(struct zone *zone, unsigned int order, int migratetype, 2285 - unsigned int alloc_flags) 2275 + unsigned int alloc_flags, enum rmqueue_mode *mode) 2286 2276 { 2287 2277 struct page *page; 2288 2278 ··· 2301 2291 } 2302 2292 } 2303 2293 2304 - page = __rmqueue_smallest(zone, order, migratetype); 2305 - if (unlikely(!page)) { 2306 - if (alloc_flags & ALLOC_CMA) 2294 + /* 2295 + * First try the freelists of the requested migratetype, then try 2296 + * fallbacks modes with increasing levels of fragmentation risk. 2297 + * 2298 + * The fallback logic is expensive and rmqueue_bulk() calls in 2299 + * a loop with the zone->lock held, meaning the freelists are 2300 + * not subject to any outside changes. Remember in *mode where 2301 + * we found pay dirt, to save us the search on the next call. 2302 + */ 2303 + switch (*mode) { 2304 + case RMQUEUE_NORMAL: 2305 + page = __rmqueue_smallest(zone, order, migratetype); 2306 + if (page) 2307 + return page; 2308 + fallthrough; 2309 + case RMQUEUE_CMA: 2310 + if (alloc_flags & ALLOC_CMA) { 2307 2311 page = __rmqueue_cma_fallback(zone, order); 2308 - 2309 - if (!page) 2310 - page = __rmqueue_fallback(zone, order, migratetype, 2311 - alloc_flags); 2312 + if (page) { 2313 + *mode = RMQUEUE_CMA; 2314 + return page; 2315 + } 2316 + } 2317 + fallthrough; 2318 + case RMQUEUE_CLAIM: 2319 + page = __rmqueue_claim(zone, order, migratetype, alloc_flags); 2320 + if (page) { 2321 + /* Replenished preferred freelist, back to normal mode. */ 2322 + *mode = RMQUEUE_NORMAL; 2323 + return page; 2324 + } 2325 + fallthrough; 2326 + case RMQUEUE_STEAL: 2327 + if (!(alloc_flags & ALLOC_NOFRAGMENT)) { 2328 + page = __rmqueue_steal(zone, order, migratetype); 2329 + if (page) { 2330 + *mode = RMQUEUE_STEAL; 2331 + return page; 2332 + } 2333 + } 2312 2334 } 2313 - return page; 2335 + return NULL; 2314 2336 } 2315 2337 2316 2338 /* ··· 2354 2312 unsigned long count, struct list_head *list, 2355 2313 int migratetype, unsigned int alloc_flags) 2356 2314 { 2315 + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; 2357 2316 unsigned long flags; 2358 2317 int i; 2359 2318 ··· 2366 2323 } 2367 2324 for (i = 0; i < count; ++i) { 2368 2325 struct page *page = __rmqueue(zone, order, migratetype, 2369 - alloc_flags); 2326 + alloc_flags, &rmqm); 2370 2327 if (unlikely(page == NULL)) 2371 2328 break; 2372 2329 ··· 2991 2948 if (alloc_flags & ALLOC_HIGHATOMIC) 2992 2949 page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); 2993 2950 if (!page) { 2994 - page = __rmqueue(zone, order, migratetype, alloc_flags); 2951 + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; 2952 + 2953 + page = __rmqueue(zone, order, migratetype, alloc_flags, &rmqm); 2995 2954 2996 2955 /* 2997 2956 * If the allocation fails, allow OOM handling and

Configure Feed

Configure Feed