Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

fs: buffer: move allocation failure loop into the allocator

Buffer allocation has a very crude indefinite loop around waking the
flusher threads and performing global NOFS direct reclaim because it can
not handle allocation failures.

The most immediate problem with this is that the allocation may fail due
to a memory cgroup limit, where flushers + direct reclaim might not make
any progress towards resolving the situation at all. Because unlike the
global case, a memory cgroup may not have any cache at all, only
anonymous pages but no swap. This situation will lead to a reclaim
livelock with insane IO from waking the flushers and thrashing unrelated
filesystem cache in a tight loop.

Use __GFP_NOFAIL allocations for buffers for now. This makes sure that
any looping happens in the page allocator, which knows how to
orchestrate kswapd, direct reclaim, and the flushers sensibly. It also
allows memory cgroups to detect allocations that can't handle failure
and will allow them to ultimately bypass the limit if reclaim can not
make progress.

Reported-by: azurIt <azurit@pobox.sk>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Johannes Weiner and committed by
Linus Torvalds
84235de3 49426420

+14 -2
+12 -2
fs/buffer.c
··· 1005 1005 struct buffer_head *bh; 1006 1006 sector_t end_block; 1007 1007 int ret = 0; /* Will call free_more_memory() */ 1008 + gfp_t gfp_mask; 1008 1009 1009 - page = find_or_create_page(inode->i_mapping, index, 1010 - (mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS)|__GFP_MOVABLE); 1010 + gfp_mask = mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS; 1011 + gfp_mask |= __GFP_MOVABLE; 1012 + /* 1013 + * XXX: __getblk_slow() can not really deal with failure and 1014 + * will endlessly loop on improvised global reclaim. Prefer 1015 + * looping in the allocator rather than here, at least that 1016 + * code knows what it's doing. 1017 + */ 1018 + gfp_mask |= __GFP_NOFAIL; 1019 + 1020 + page = find_or_create_page(inode->i_mapping, index, gfp_mask); 1011 1021 if (!page) 1012 1022 return ret; 1013 1023
+2
mm/memcontrol.c
··· 2766 2766 return 0; 2767 2767 nomem: 2768 2768 *ptr = NULL; 2769 + if (gfp_mask & __GFP_NOFAIL) 2770 + return 0; 2769 2771 return -ENOMEM; 2770 2772 bypass: 2771 2773 *ptr = root_mem_cgroup;