Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage

Setting the original memory.limit_in_bytes hardlimit is subject to a
race condition when the desired value is below the current usage. The
code tries a few times to first reclaim and then see if the usage has
dropped to where we would like it to be, but there is no locking, and
the workload is free to continue making new charges up to the old limit.
Thus, attempting to shrink a workload relies on pure luck and hope that
the workload happens to cooperate.

To fix this in the cgroup2 memory.max knob, do it the other way round:
set the limit first, then try enforcement. And if reclaim is not able
to succeed, trigger OOM kills in the group. Keep going until the new
limit is met, we run out of OOM victims and there's only unreclaimable
memory left, or the task writing to memory.max is killed. This allows
users to shrink groups reliably, and the behavior is consistent with
what happens when new charges are attempted in excess of memory.max.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Johannes Weiner and committed by
Linus Torvalds
b6e6edcf 588083bb

+40 -4
+6
Documentation/cgroup-v2.txt
··· 1387 1387 limit this type of spillover and ultimately contain buggy or even 1388 1388 malicious applications. 1389 1389 1390 + Setting the original memory.limit_in_bytes below the current usage was 1391 + subject to a race condition, where concurrent charges could cause the 1392 + limit setting to fail. memory.max on the other hand will first set the 1393 + limit to prevent new charges, and then reclaim and OOM kill until the 1394 + new limit is met - or the task writing to memory.max is killed. 1395 + 1390 1396 The combined memory+swap accounting and limiting is replaced by real 1391 1397 control over swap space. 1392 1398
+34 -4
mm/memcontrol.c
··· 1236 1236 return limit; 1237 1237 } 1238 1238 1239 - static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, 1239 + static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, 1240 1240 int order) 1241 1241 { 1242 1242 struct oom_control oc = { ··· 1314 1314 } 1315 1315 unlock: 1316 1316 mutex_unlock(&oom_lock); 1317 + return chosen; 1317 1318 } 1318 1319 1319 1320 #if MAX_NUMNODES > 1 ··· 5030 5029 char *buf, size_t nbytes, loff_t off) 5031 5030 { 5032 5031 struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); 5032 + unsigned int nr_reclaims = MEM_CGROUP_RECLAIM_RETRIES; 5033 + bool drained = false; 5033 5034 unsigned long max; 5034 5035 int err; 5035 5036 ··· 5040 5037 if (err) 5041 5038 return err; 5042 5039 5043 - err = mem_cgroup_resize_limit(memcg, max); 5044 - if (err) 5045 - return err; 5040 + xchg(&memcg->memory.limit, max); 5041 + 5042 + for (;;) { 5043 + unsigned long nr_pages = page_counter_read(&memcg->memory); 5044 + 5045 + if (nr_pages <= max) 5046 + break; 5047 + 5048 + if (signal_pending(current)) { 5049 + err = -EINTR; 5050 + break; 5051 + } 5052 + 5053 + if (!drained) { 5054 + drain_all_stock(memcg); 5055 + drained = true; 5056 + continue; 5057 + } 5058 + 5059 + if (nr_reclaims) { 5060 + if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, 5061 + GFP_KERNEL, true)) 5062 + nr_reclaims--; 5063 + continue; 5064 + } 5065 + 5066 + mem_cgroup_events(memcg, MEMCG_OOM, 1); 5067 + if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0)) 5068 + break; 5069 + } 5046 5070 5047 5071 memcg_wb_domain_size_changed(memcg); 5048 5072 return nbytes;