Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: vmscan: apply proportional reclaim pressure for memcg when MGLRU is enabled

The scan implementation for MGLRU was missing proportional reclaim
pressure for memcg, which contradicts the description in
Documentation/admin-guide/cgroup-v2.rst (memory.{low,min} section).

This issue can be observed in kselftest cgroup:test_memcontrol
(specifically test_memcg_min and test_memcg_low). The following table
shows the actual values observed in my local test env (on xfs) and the
error "e", which is the symmetric absolute percentage error from the ideal
values of 29M for c[0] and 21M for c[1].

test_memcg_min

| MGLRU enabled | MGLRU enabled | MGLRU disabled
| Without patch | With patch |
-----|-----------------|-----------------|---------------
c[0] | 25964544 (e=8%) | 28770304 (e=3%) | 27820032 (e=4%)
c[1] | 26214400 (e=9%) | 23998464 (e=4%) | 24776704 (e=6%)

test_memcg_low

| MGLRU enabled | MGLRU enabled | MGLRU disabled
| Without patch | With patch |
-----|-----------------|-----------------|---------------
c[0] | 26214400 (e=7%) | 27930624 (e=4%) | 27688960 (e=5%)
c[1] | 26214400 (e=9%) | 24764416 (e=6%) | 24920064 (e=6%)

Factor out the proportioning logic to a new function and have MGLRU reuse
it. While at it, update the eviction behavior via debugfs 'lru_gen'
interface ('-' command with an explicit 'nr_to_reclaim' parameter) to
ensure eviction is limited to the specified number.

Link: https://lkml.kernel.org/r/20250530162353.541882-1-den@valinux.co.jp
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Reviewed-by: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Koichiro Den and committed by
Andrew Morton
af827e09 80d1a813

+79 -70
+79 -70
mm/vmscan.c
··· 2482 2482 *denominator = ap + fp; 2483 2483 } 2484 2484 2485 + static unsigned long apply_proportional_protection(struct mem_cgroup *memcg, 2486 + struct scan_control *sc, unsigned long scan) 2487 + { 2488 + unsigned long min, low; 2489 + 2490 + mem_cgroup_protection(sc->target_mem_cgroup, memcg, &min, &low); 2491 + 2492 + if (min || low) { 2493 + /* 2494 + * Scale a cgroup's reclaim pressure by proportioning 2495 + * its current usage to its memory.low or memory.min 2496 + * setting. 2497 + * 2498 + * This is important, as otherwise scanning aggression 2499 + * becomes extremely binary -- from nothing as we 2500 + * approach the memory protection threshold, to totally 2501 + * nominal as we exceed it. This results in requiring 2502 + * setting extremely liberal protection thresholds. It 2503 + * also means we simply get no protection at all if we 2504 + * set it too low, which is not ideal. 2505 + * 2506 + * If there is any protection in place, we reduce scan 2507 + * pressure by how much of the total memory used is 2508 + * within protection thresholds. 2509 + * 2510 + * There is one special case: in the first reclaim pass, 2511 + * we skip over all groups that are within their low 2512 + * protection. If that fails to reclaim enough pages to 2513 + * satisfy the reclaim goal, we come back and override 2514 + * the best-effort low protection. However, we still 2515 + * ideally want to honor how well-behaved groups are in 2516 + * that case instead of simply punishing them all 2517 + * equally. As such, we reclaim them based on how much 2518 + * memory they are using, reducing the scan pressure 2519 + * again by how much of the total memory used is under 2520 + * hard protection. 2521 + */ 2522 + unsigned long cgroup_size = mem_cgroup_size(memcg); 2523 + unsigned long protection; 2524 + 2525 + /* memory.low scaling, make sure we retry before OOM */ 2526 + if (!sc->memcg_low_reclaim && low > min) { 2527 + protection = low; 2528 + sc->memcg_low_skipped = 1; 2529 + } else { 2530 + protection = min; 2531 + } 2532 + 2533 + /* Avoid TOCTOU with earlier protection check */ 2534 + cgroup_size = max(cgroup_size, protection); 2535 + 2536 + scan -= scan * protection / (cgroup_size + 1); 2537 + 2538 + /* 2539 + * Minimally target SWAP_CLUSTER_MAX pages to keep 2540 + * reclaim moving forwards, avoiding decrementing 2541 + * sc->priority further than desirable. 2542 + */ 2543 + scan = max(scan, SWAP_CLUSTER_MAX); 2544 + } 2545 + return scan; 2546 + } 2547 + 2485 2548 /* 2486 2549 * Determine how aggressively the anon and file LRU lists should be 2487 2550 * scanned. ··· 2623 2560 for_each_evictable_lru(lru) { 2624 2561 bool file = is_file_lru(lru); 2625 2562 unsigned long lruvec_size; 2626 - unsigned long low, min; 2627 2563 unsigned long scan; 2628 2564 2629 2565 lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx); 2630 - mem_cgroup_protection(sc->target_mem_cgroup, memcg, 2631 - &min, &low); 2632 - 2633 - if (min || low) { 2634 - /* 2635 - * Scale a cgroup's reclaim pressure by proportioning 2636 - * its current usage to its memory.low or memory.min 2637 - * setting. 2638 - * 2639 - * This is important, as otherwise scanning aggression 2640 - * becomes extremely binary -- from nothing as we 2641 - * approach the memory protection threshold, to totally 2642 - * nominal as we exceed it. This results in requiring 2643 - * setting extremely liberal protection thresholds. It 2644 - * also means we simply get no protection at all if we 2645 - * set it too low, which is not ideal. 2646 - * 2647 - * If there is any protection in place, we reduce scan 2648 - * pressure by how much of the total memory used is 2649 - * within protection thresholds. 2650 - * 2651 - * There is one special case: in the first reclaim pass, 2652 - * we skip over all groups that are within their low 2653 - * protection. If that fails to reclaim enough pages to 2654 - * satisfy the reclaim goal, we come back and override 2655 - * the best-effort low protection. However, we still 2656 - * ideally want to honor how well-behaved groups are in 2657 - * that case instead of simply punishing them all 2658 - * equally. As such, we reclaim them based on how much 2659 - * memory they are using, reducing the scan pressure 2660 - * again by how much of the total memory used is under 2661 - * hard protection. 2662 - */ 2663 - unsigned long cgroup_size = mem_cgroup_size(memcg); 2664 - unsigned long protection; 2665 - 2666 - /* memory.low scaling, make sure we retry before OOM */ 2667 - if (!sc->memcg_low_reclaim && low > min) { 2668 - protection = low; 2669 - sc->memcg_low_skipped = 1; 2670 - } else { 2671 - protection = min; 2672 - } 2673 - 2674 - /* Avoid TOCTOU with earlier protection check */ 2675 - cgroup_size = max(cgroup_size, protection); 2676 - 2677 - scan = lruvec_size - lruvec_size * protection / 2678 - (cgroup_size + 1); 2679 - 2680 - /* 2681 - * Minimally target SWAP_CLUSTER_MAX pages to keep 2682 - * reclaim moving forwards, avoiding decrementing 2683 - * sc->priority further than desirable. 2684 - */ 2685 - scan = max(scan, SWAP_CLUSTER_MAX); 2686 - } else { 2687 - scan = lruvec_size; 2688 - } 2689 - 2566 + scan = apply_proportional_protection(memcg, sc, lruvec_size); 2690 2567 scan >>= sc->priority; 2691 2568 2692 2569 /* ··· 4557 4554 return true; 4558 4555 } 4559 4556 4560 - static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, 4561 - int type, int tier, struct list_head *list) 4557 + static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec, 4558 + struct scan_control *sc, int type, int tier, 4559 + struct list_head *list) 4562 4560 { 4563 4561 int i; 4564 4562 int gen; ··· 4568 4564 int scanned = 0; 4569 4565 int isolated = 0; 4570 4566 int skipped = 0; 4571 - int remaining = MAX_LRU_BATCH; 4567 + int remaining = min(nr_to_scan, MAX_LRU_BATCH); 4572 4568 struct lru_gen_folio *lrugen = &lruvec->lrugen; 4573 4569 struct mem_cgroup *memcg = lruvec_memcg(lruvec); 4574 4570 ··· 4679 4675 return positive_ctrl_err(&sp, &pv); 4680 4676 } 4681 4677 4682 - static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness, 4678 + static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec, 4679 + struct scan_control *sc, int swappiness, 4683 4680 int *type_scanned, struct list_head *list) 4684 4681 { 4685 4682 int i; ··· 4692 4687 4693 4688 *type_scanned = type; 4694 4689 4695 - scanned = scan_folios(lruvec, sc, type, tier, list); 4690 + scanned = scan_folios(nr_to_scan, lruvec, sc, type, tier, list); 4696 4691 if (scanned) 4697 4692 return scanned; 4698 4693 ··· 4702 4697 return 0; 4703 4698 } 4704 4699 4705 - static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness) 4700 + static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec, 4701 + struct scan_control *sc, int swappiness) 4706 4702 { 4707 4703 int type; 4708 4704 int scanned; ··· 4722 4716 4723 4717 spin_lock_irq(&lruvec->lru_lock); 4724 4718 4725 - scanned = isolate_folios(lruvec, sc, swappiness, &type, &list); 4719 + scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list); 4726 4720 4727 4721 scanned += try_to_inc_min_seq(lruvec, swappiness); 4728 4722 ··· 4843 4837 if (nr_to_scan && !mem_cgroup_online(memcg)) 4844 4838 return nr_to_scan; 4845 4839 4840 + nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan); 4841 + 4846 4842 /* try to get away with not aging at the default priority */ 4847 4843 if (!success || sc->priority == DEF_PRIORITY) 4848 4844 return nr_to_scan >> sc->priority; ··· 4897 4889 if (nr_to_scan <= 0) 4898 4890 break; 4899 4891 4900 - delta = evict_folios(lruvec, sc, swappiness); 4892 + delta = evict_folios(nr_to_scan, lruvec, sc, swappiness); 4901 4893 if (!delta) 4902 4894 break; 4903 4895 ··· 5518 5510 if (sc->nr_reclaimed >= nr_to_reclaim) 5519 5511 return 0; 5520 5512 5521 - if (!evict_folios(lruvec, sc, swappiness)) 5513 + if (!evict_folios(nr_to_reclaim - sc->nr_reclaimed, lruvec, sc, 5514 + swappiness)) 5522 5515 return 0; 5523 5516 5524 5517 cond_resched();