Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: introduce memalloc_nofs_{save,restore} API

GFP_NOFS context is used for the following 5 reasons currently:

- to prevent from deadlocks when the lock held by the allocation
context would be needed during the memory reclaim

- to prevent from stack overflows during the reclaim because the
allocation is performed from a deep context already

- to prevent lockups when the allocation context depends on other
reclaimers to make a forward progress indirectly

- just in case because this would be safe from the fs POV

- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems to
the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope of
GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

[akpm@linux-foundation.org: fix comment typo, reflow comment]
Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Michal Hocko and committed by
Linus Torvalds
7dea19f9 9070733b

+47 -19
+1 -1
fs/xfs/kmem.h
··· 50 50 lflags = GFP_ATOMIC | __GFP_NOWARN; 51 51 } else { 52 52 lflags = GFP_KERNEL | __GFP_NOWARN; 53 - if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS)) 53 + if (flags & KM_NOFS) 54 54 lflags &= ~__GFP_FS; 55 55 } 56 56
+8
include/linux/gfp.h
··· 210 210 * 211 211 * GFP_NOIO will use direct reclaim to discard clean pages or slab pages 212 212 * that do not require the starting of any physical IO. 213 + * Please try to avoid using this flag directly and instead use 214 + * memalloc_noio_{save,restore} to mark the whole scope which cannot 215 + * perform any IO with a short explanation why. All allocation requests 216 + * will inherit GFP_NOIO implicitly. 213 217 * 214 218 * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces. 219 + * Please try to avoid using this flag directly and instead use 220 + * memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't 221 + * recurse into the FS layer with a short explanation why. All allocation 222 + * requests will inherit GFP_NOFS implicitly. 215 223 * 216 224 * GFP_USER is for userspace allocations that also need to be directly 217 225 * accessibly by the kernel or hardware. It is typically used by hardware
+3 -5
include/linux/sched.h
··· 1224 1224 #define PF_USED_ASYNC 0x00004000 /* Used async_schedule*(), used by module init */ 1225 1225 #define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */ 1226 1226 #define PF_FROZEN 0x00010000 /* Frozen for system suspend */ 1227 - #define PF_FSTRANS 0x00020000 /* Inside a filesystem transaction */ 1228 - #define PF_KSWAPD 0x00040000 /* I am kswapd */ 1229 - #define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */ 1227 + #define PF_KSWAPD 0x00020000 /* I am kswapd */ 1228 + #define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */ 1229 + #define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */ 1230 1230 #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ 1231 1231 #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ 1232 1232 #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ ··· 1236 1236 #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */ 1237 1237 #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */ 1238 1238 #define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */ 1239 - 1240 - #define PF_MEMALLOC_NOFS PF_FSTRANS /* Transition to a more generic GFP_NOFS scope semantic */ 1241 1239 1242 1240 /* 1243 1241 * Only the _current_ task can read/write to tsk->flags, but other
+23 -3
include/linux/sched/mm.h
··· 149 149 return ret; 150 150 } 151 151 152 - /* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags 153 - * __GFP_FS is also cleared as it implies __GFP_IO. 152 + /* 153 + * Applies per-task gfp context to the given allocation flags. 154 + * PF_MEMALLOC_NOIO implies GFP_NOIO 155 + * PF_MEMALLOC_NOFS implies GFP_NOFS 154 156 */ 155 - static inline gfp_t memalloc_noio_flags(gfp_t flags) 157 + static inline gfp_t current_gfp_context(gfp_t flags) 156 158 { 159 + /* 160 + * NOIO implies both NOIO and NOFS and it is a weaker context 161 + * so always make sure it makes precendence 162 + */ 157 163 if (unlikely(current->flags & PF_MEMALLOC_NOIO)) 158 164 flags &= ~(__GFP_IO | __GFP_FS); 165 + else if (unlikely(current->flags & PF_MEMALLOC_NOFS)) 166 + flags &= ~__GFP_FS; 159 167 return flags; 160 168 } 161 169 ··· 177 169 static inline void memalloc_noio_restore(unsigned int flags) 178 170 { 179 171 current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags; 172 + } 173 + 174 + static inline unsigned int memalloc_nofs_save(void) 175 + { 176 + unsigned int flags = current->flags & PF_MEMALLOC_NOFS; 177 + current->flags |= PF_MEMALLOC_NOFS; 178 + return flags; 179 + } 180 + 181 + static inline void memalloc_nofs_restore(unsigned int flags) 182 + { 183 + current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags; 180 184 } 181 185 182 186 #endif /* _LINUX_SCHED_MM_H */
+3 -3
kernel/locking/lockdep.c
··· 2877 2877 if (unlikely(!debug_locks)) 2878 2878 return; 2879 2879 2880 - gfp_mask = memalloc_noio_flags(gfp_mask); 2880 + gfp_mask = current_gfp_context(gfp_mask); 2881 2881 2882 2882 /* no reclaim without waiting on it */ 2883 2883 if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) ··· 2888 2888 return; 2889 2889 2890 2890 /* We're only interested __GFP_FS allocations for now */ 2891 - if (!(gfp_mask & __GFP_FS)) 2891 + if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS)) 2892 2892 return; 2893 2893 2894 2894 /* ··· 3954 3954 3955 3955 void lockdep_set_current_reclaim_state(gfp_t gfp_mask) 3956 3956 { 3957 - current->lockdep_reclaim_gfp = memalloc_noio_flags(gfp_mask); 3957 + current->lockdep_reclaim_gfp = current_gfp_context(gfp_mask); 3958 3958 } 3959 3959 3960 3960 void lockdep_clear_current_reclaim_state(void)
+6 -4
mm/page_alloc.c
··· 3951 3951 goto out; 3952 3952 3953 3953 /* 3954 - * Runtime PM, block IO and its error handling path can deadlock 3955 - * because I/O on the device might not complete. 3954 + * Apply scoped allocation constraints. This is mainly about GFP_NOFS 3955 + * resp. GFP_NOIO which has to be inherited for all allocation requests 3956 + * from a particular context which has been marked by 3957 + * memalloc_no{fs,io}_{save,restore}. 3956 3958 */ 3957 - alloc_mask = memalloc_noio_flags(gfp_mask); 3959 + alloc_mask = current_gfp_context(gfp_mask); 3958 3960 ac.spread_dirty_pages = false; 3959 3961 3960 3962 /* ··· 7410 7408 .zone = page_zone(pfn_to_page(start)), 7411 7409 .mode = MIGRATE_SYNC, 7412 7410 .ignore_skip_hint = true, 7413 - .gfp_mask = memalloc_noio_flags(gfp_mask), 7411 + .gfp_mask = current_gfp_context(gfp_mask), 7414 7412 }; 7415 7413 INIT_LIST_HEAD(&cc.migratepages); 7416 7414
+3 -3
mm/vmscan.c
··· 2915 2915 unsigned long nr_reclaimed; 2916 2916 struct scan_control sc = { 2917 2917 .nr_to_reclaim = SWAP_CLUSTER_MAX, 2918 - .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)), 2918 + .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)), 2919 2919 .reclaim_idx = gfp_zone(gfp_mask), 2920 2920 .order = order, 2921 2921 .nodemask = nodemask, ··· 2995 2995 int nid; 2996 2996 struct scan_control sc = { 2997 2997 .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), 2998 - .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | 2998 + .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) | 2999 2999 (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), 3000 3000 .reclaim_idx = MAX_NR_ZONES - 1, 3001 3001 .target_mem_cgroup = memcg, ··· 3702 3702 int classzone_idx = gfp_zone(gfp_mask); 3703 3703 struct scan_control sc = { 3704 3704 .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), 3705 - .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)), 3705 + .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)), 3706 3706 .order = order, 3707 3707 .priority = NODE_RECLAIM_PRIORITY, 3708 3708 .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),