Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

hugetlb: memcg: account hugetlb-backed memory in memory controller

Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.

For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness. But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etc. fair.

What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.

This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller). Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.

Some caveats to consider:
* This feature is only available on cgroup v2.
* There is no hugetlb pool management involved in the memory
controller. As stated above, hugetlb folios are only charged towards
the memory controller when it is used. Host overcommit management
has to consider it when configuring hard limits.
* Failure to charge towards the memcg results in SIGBUS. This could
happen even if the hugetlb pool still has pages (but the cgroup
limit is hit and reclaim attempt fails).
* When this feature is enabled, hugetlb pages contribute to memory
reclaim protection. low, min limits tuning must take into account
hugetlb memory.
* Hugetlb pages utilized while this option is not selected will not
be tracked by the memory controller (even if cgroup v2 is remounted
later on).

Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Nhat Pham and committed by
Andrew Morton
8cba9576 85ce2c51

+127 -11
+29
Documentation/admin-guide/cgroup-v2.rst
··· 210 210 relying on the original semantics (e.g. specifying bogusly 211 211 high 'bypass' protection values at higher tree levels). 212 212 213 + memory_hugetlb_accounting 214 + Count HugeTLB memory usage towards the cgroup's overall 215 + memory usage for the memory controller (for the purpose of 216 + statistics reporting and memory protetion). This is a new 217 + behavior that could regress existing setups, so it must be 218 + explicitly opted in with this mount option. 219 + 220 + A few caveats to keep in mind: 221 + 222 + * There is no HugeTLB pool management involved in the memory 223 + controller. The pre-allocated pool does not belong to anyone. 224 + Specifically, when a new HugeTLB folio is allocated to 225 + the pool, it is not accounted for from the perspective of the 226 + memory controller. It is only charged to a cgroup when it is 227 + actually used (for e.g at page fault time). Host memory 228 + overcommit management has to consider this when configuring 229 + hard limits. In general, HugeTLB pool management should be 230 + done via other mechanisms (such as the HugeTLB controller). 231 + * Failure to charge a HugeTLB folio to the memory controller 232 + results in SIGBUS. This could happen even if the HugeTLB pool 233 + still has pages available (but the cgroup limit is hit and 234 + reclaim attempt fails). 235 + * Charging HugeTLB memory towards the memory controller affects 236 + memory protection and reclaim dynamics. Any userspace tuning 237 + (of low, min limits for e.g) needs to take this into account. 238 + * HugeTLB pages utilized while this option is not selected 239 + will not be tracked by the memory controller (even if cgroup 240 + v2 is remounted later on). 241 + 213 242 214 243 Organizing Processes and Threads 215 244 --------------------------------
+5
include/linux/cgroup-defs.h
··· 115 115 * Enable recursive subtree protection 116 116 */ 117 117 CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 18), 118 + 119 + /* 120 + * Enable hugetlb accounting for the memory controller. 121 + */ 122 + CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19), 118 123 }; 119 124 120 125 /* cftype->flags */
+9
include/linux/memcontrol.h
··· 678 678 return __mem_cgroup_charge(folio, mm, gfp); 679 679 } 680 680 681 + int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, 682 + long nr_pages); 683 + 681 684 int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, 682 685 gfp_t gfp, swp_entry_t entry); 683 686 void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); ··· 1257 1254 1258 1255 static inline int mem_cgroup_charge(struct folio *folio, 1259 1256 struct mm_struct *mm, gfp_t gfp) 1257 + { 1258 + return 0; 1259 + } 1260 + 1261 + static inline int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, 1262 + gfp_t gfp, long nr_pages) 1260 1263 { 1261 1264 return 0; 1262 1265 }
+14 -1
kernel/cgroup/cgroup.c
··· 1902 1902 Opt_favordynmods, 1903 1903 Opt_memory_localevents, 1904 1904 Opt_memory_recursiveprot, 1905 + Opt_memory_hugetlb_accounting, 1905 1906 nr__cgroup2_params 1906 1907 }; 1907 1908 ··· 1911 1910 fsparam_flag("favordynmods", Opt_favordynmods), 1912 1911 fsparam_flag("memory_localevents", Opt_memory_localevents), 1913 1912 fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot), 1913 + fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting), 1914 1914 {} 1915 1915 }; 1916 1916 ··· 1938 1936 case Opt_memory_recursiveprot: 1939 1937 ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; 1940 1938 return 0; 1939 + case Opt_memory_hugetlb_accounting: 1940 + ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; 1941 + return 0; 1941 1942 } 1942 1943 return -EINVAL; 1943 1944 } ··· 1965 1960 cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; 1966 1961 else 1967 1962 cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT; 1963 + 1964 + if (root_flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING) 1965 + cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; 1966 + else 1967 + cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; 1968 1968 } 1969 1969 } 1970 1970 ··· 1983 1973 seq_puts(seq, ",memory_localevents"); 1984 1974 if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) 1985 1975 seq_puts(seq, ",memory_recursiveprot"); 1976 + if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING) 1977 + seq_puts(seq, ",memory_hugetlb_accounting"); 1986 1978 return 0; 1987 1979 } 1988 1980 ··· 7062 7050 "nsdelegate\n" 7063 7051 "favordynmods\n" 7064 7052 "memory_localevents\n" 7065 - "memory_recursiveprot\n"); 7053 + "memory_recursiveprot\n" 7054 + "memory_hugetlb_accounting\n"); 7066 7055 } 7067 7056 static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features); 7068 7057
+28 -7
mm/hugetlb.c
··· 1927 1927 pages_per_huge_page(h), folio); 1928 1928 hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), 1929 1929 pages_per_huge_page(h), folio); 1930 + mem_cgroup_uncharge(folio); 1930 1931 if (restore_reserve) 1931 1932 h->resv_huge_pages++; 1932 1933 ··· 3027 3026 struct hugepage_subpool *spool = subpool_vma(vma); 3028 3027 struct hstate *h = hstate_vma(vma); 3029 3028 struct folio *folio; 3030 - long map_chg, map_commit; 3029 + long map_chg, map_commit, nr_pages = pages_per_huge_page(h); 3031 3030 long gbl_chg; 3032 - int ret, idx; 3031 + int memcg_charge_ret, ret, idx; 3033 3032 struct hugetlb_cgroup *h_cg = NULL; 3033 + struct mem_cgroup *memcg; 3034 3034 bool deferred_reserve; 3035 + gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL; 3036 + 3037 + memcg = get_mem_cgroup_from_current(); 3038 + memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages); 3039 + if (memcg_charge_ret == -ENOMEM) { 3040 + mem_cgroup_put(memcg); 3041 + return ERR_PTR(-ENOMEM); 3042 + } 3035 3043 3036 3044 idx = hstate_index(h); 3037 3045 /* ··· 3049 3039 * code of zero indicates a reservation exists (no change). 3050 3040 */ 3051 3041 map_chg = gbl_chg = vma_needs_reservation(h, vma, addr); 3052 - if (map_chg < 0) 3042 + if (map_chg < 0) { 3043 + if (!memcg_charge_ret) 3044 + mem_cgroup_cancel_charge(memcg, nr_pages); 3045 + mem_cgroup_put(memcg); 3053 3046 return ERR_PTR(-ENOMEM); 3047 + } 3054 3048 3055 3049 /* 3056 3050 * Processes that did not create the mapping will have no ··· 3065 3051 */ 3066 3052 if (map_chg || avoid_reserve) { 3067 3053 gbl_chg = hugepage_subpool_get_pages(spool, 1); 3068 - if (gbl_chg < 0) { 3069 - vma_end_reservation(h, vma, addr); 3070 - return ERR_PTR(-ENOSPC); 3071 - } 3054 + if (gbl_chg < 0) 3055 + goto out_end_reservation; 3072 3056 3073 3057 /* 3074 3058 * Even though there was no reservation in the region/reserve ··· 3148 3136 hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), 3149 3137 pages_per_huge_page(h), folio); 3150 3138 } 3139 + 3140 + if (!memcg_charge_ret) 3141 + mem_cgroup_commit_charge(folio, memcg); 3142 + mem_cgroup_put(memcg); 3143 + 3151 3144 return folio; 3152 3145 3153 3146 out_uncharge_cgroup: ··· 3164 3147 out_subpool_put: 3165 3148 if (map_chg || avoid_reserve) 3166 3149 hugepage_subpool_put_pages(spool, 1); 3150 + out_end_reservation: 3167 3151 vma_end_reservation(h, vma, addr); 3152 + if (!memcg_charge_ret) 3153 + mem_cgroup_cancel_charge(memcg, nr_pages); 3154 + mem_cgroup_put(memcg); 3168 3155 return ERR_PTR(-ENOSPC); 3169 3156 } 3170 3157
+41 -1
mm/memcontrol.c
··· 7097 7097 } 7098 7098 7099 7099 /** 7100 + * mem_cgroup_hugetlb_try_charge - try to charge the memcg for a hugetlb folio 7101 + * @memcg: memcg to charge. 7102 + * @gfp: reclaim mode. 7103 + * @nr_pages: number of pages to charge. 7104 + * 7105 + * This function is called when allocating a huge page folio to determine if 7106 + * the memcg has the capacity for it. It does not commit the charge yet, 7107 + * as the hugetlb folio itself has not been obtained from the hugetlb pool. 7108 + * 7109 + * Once we have obtained the hugetlb folio, we can call 7110 + * mem_cgroup_commit_charge() to commit the charge. If we fail to obtain the 7111 + * folio, we should instead call mem_cgroup_cancel_charge() to undo the effect 7112 + * of try_charge(). 7113 + * 7114 + * Returns 0 on success. Otherwise, an error code is returned. 7115 + */ 7116 + int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp, 7117 + long nr_pages) 7118 + { 7119 + /* 7120 + * If hugetlb memcg charging is not enabled, do not fail hugetlb allocation, 7121 + * but do not attempt to commit charge later (or cancel on error) either. 7122 + */ 7123 + if (mem_cgroup_disabled() || !memcg || 7124 + !cgroup_subsys_on_dfl(memory_cgrp_subsys) || 7125 + !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)) 7126 + return -EOPNOTSUPP; 7127 + 7128 + if (try_charge(memcg, gfp, nr_pages)) 7129 + return -ENOMEM; 7130 + 7131 + return 0; 7132 + } 7133 + 7134 + /** 7100 7135 * mem_cgroup_swapin_charge_folio - Charge a newly allocated folio for swapin. 7101 7136 * @folio: folio to charge. 7102 7137 * @mm: mm context of the victim ··· 7400 7365 return; 7401 7366 7402 7367 memcg = folio_memcg(old); 7403 - VM_WARN_ON_ONCE_FOLIO(!memcg, old); 7368 + /* 7369 + * Note that it is normal to see !memcg for a hugetlb folio. 7370 + * For e.g, itt could have been allocated when memory_hugetlb_accounting 7371 + * was not selected. 7372 + */ 7373 + VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(old) && !memcg, old); 7404 7374 if (!memcg) 7405 7375 return; 7406 7376
+1 -2
mm/migrate.c
··· 633 633 634 634 folio_copy_owner(newfolio, folio); 635 635 636 - if (!folio_test_hugetlb(folio)) 637 - mem_cgroup_migrate(folio, newfolio); 636 + mem_cgroup_migrate(folio, newfolio); 638 637 } 639 638 EXPORT_SYMBOL(folio_migrate_flags); 640 639