mlock: mlocked pages are unevictable · tjh.dev/kernel@b291f00

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

mlock: mlocked pages are unevictable

Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.

This is achieved through various strategies:

1) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
unevictable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.

Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_unevictable
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.

2) add the mlock/unevictable infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on unevictable
LRU list.

3) update vmscan.c:page_evictable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull unevictable pages in fault
path" patch is included.

4) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.

Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>

splitlru: introduce __get_user_pages():

New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore it
cause PROT_NONE pages can't munlock.

[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Nick Piggin and committed by

Linus Torvalds 17 years ago b291f000 89e004ea

+816 -90

13 changed files

expand all

include

linux

mm.h

page-flags.h

rmap.h

internal.h

memory.c

migrate.c

mlock.c

mmap.c

nommu.c

page_alloc.c

rmap.c

swap.c

vmscan.c

include/linux/mm.h

··· 132 132 #define VM_RandomReadHint(v) ((v)->vm_flags & VM_RAND_READ) 133 133 134 134 /* 135 + * special vmas that are non-mergable, non-mlock()able 136 + */ 137 + #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP) 138 + 139 + /* 135 140 * mapping from the currently active vm_flags protection bits (the 136 141 * low four bits) to a page protection mask.. 137 142 */

+16 -3

include/linux/page-flags.h

··· 96 96 PG_swapbacked, /* Page is backed by RAM/swap */ 97 97 #ifdef CONFIG_UNEVICTABLE_LRU 98 98 PG_unevictable, /* Page is "unevictable" */ 99 + PG_mlocked, /* Page is vma mlocked */ 99 100 #endif 100 101 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR 101 102 PG_uncached, /* Page has been mapped as uncached */ ··· 233 232 #ifdef CONFIG_UNEVICTABLE_LRU 234 233 PAGEFLAG(Unevictable, unevictable) __CLEARPAGEFLAG(Unevictable, unevictable) 235 234 TESTCLEARFLAG(Unevictable, unevictable) 235 + 236 + #define MLOCK_PAGES 1 237 + PAGEFLAG(Mlocked, mlocked) __CLEARPAGEFLAG(Mlocked, mlocked) 238 + TESTSCFLAG(Mlocked, mlocked) 239 + 236 240 #else 241 + 242 + #define MLOCK_PAGES 0 243 + PAGEFLAG_FALSE(Mlocked) 244 + SETPAGEFLAG_NOOP(Mlocked) TESTCLEARFLAG_FALSE(Mlocked) 245 + 237 246 PAGEFLAG_FALSE(Unevictable) TESTCLEARFLAG_FALSE(Unevictable) 238 247 SETPAGEFLAG_NOOP(Unevictable) CLEARPAGEFLAG_NOOP(Unevictable) 239 248 __CLEARPAGEFLAG_NOOP(Unevictable) ··· 365 354 #endif /* !PAGEFLAGS_EXTENDED */ 366 355 367 356 #ifdef CONFIG_UNEVICTABLE_LRU 368 - #define __PG_UNEVICTABLE (1 << PG_unevictable) 357 + #define __PG_UNEVICTABLE (1 << PG_unevictable) 358 + #define __PG_MLOCKED (1 << PG_mlocked) 369 359 #else 370 - #define __PG_UNEVICTABLE 0 360 + #define __PG_UNEVICTABLE 0 361 + #define __PG_MLOCKED 0 371 362 #endif 372 363 373 364 #define PAGE_FLAGS (1 << PG_lru | 1 << PG_private | 1 << PG_locked | \ 374 365 1 << PG_buddy | 1 << PG_writeback | \ 375 366 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ 376 - __PG_UNEVICTABLE) 367 + __PG_UNEVICTABLE | __PG_MLOCKED) 377 368 378 369 /* 379 370 * Flags checked in bad_page(). Pages on the free list should not have

+14

include/linux/rmap.h

··· 117 117 */ 118 118 int page_mkclean(struct page *); 119 119 120 + #ifdef CONFIG_UNEVICTABLE_LRU 121 + /* 122 + * called in munlock()/munmap() path to check for other vmas holding 123 + * the page mlocked. 124 + */ 125 + int try_to_munlock(struct page *); 126 + #else 127 + static inline int try_to_munlock(struct page *page) 128 + { 129 + return 0; /* a.k.a. SWAP_SUCCESS */ 130 + } 131 + #endif 132 + 120 133 #else /* !CONFIG_MMU */ 121 134 122 135 #define anon_vma_init() do {} while (0) ··· 153 140 #define SWAP_SUCCESS 0 154 141 #define SWAP_AGAIN 1 155 142 #define SWAP_FAIL 2 143 + #define SWAP_MLOCK 3 156 144 157 145 #endif /* _LINUX_RMAP_H */

+71

mm/internal.h

··· 61 61 return page_private(page); 62 62 } 63 63 64 + extern int mlock_vma_pages_range(struct vm_area_struct *vma, 65 + unsigned long start, unsigned long end); 66 + extern void munlock_vma_pages_all(struct vm_area_struct *vma); 67 + 64 68 #ifdef CONFIG_UNEVICTABLE_LRU 65 69 /* 66 70 * unevictable_migrate_page() called only from migrate_page_copy() to ··· 83 79 } 84 80 #endif 85 81 82 + #ifdef CONFIG_UNEVICTABLE_LRU 83 + /* 84 + * Called only in fault path via page_evictable() for a new page 85 + * to determine if it's being mapped into a LOCKED vma. 86 + * If so, mark page as mlocked. 87 + */ 88 + static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page) 89 + { 90 + VM_BUG_ON(PageLRU(page)); 91 + 92 + if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) 93 + return 0; 94 + 95 + SetPageMlocked(page); 96 + return 1; 97 + } 98 + 99 + /* 100 + * must be called with vma's mmap_sem held for read, and page locked. 101 + */ 102 + extern void mlock_vma_page(struct page *page); 103 + 104 + /* 105 + * Clear the page's PageMlocked(). This can be useful in a situation where 106 + * we want to unconditionally remove a page from the pagecache -- e.g., 107 + * on truncation or freeing. 108 + * 109 + * It is legal to call this function for any page, mlocked or not. 110 + * If called for a page that is still mapped by mlocked vmas, all we do 111 + * is revert to lazy LRU behaviour -- semantics are not broken. 112 + */ 113 + extern void __clear_page_mlock(struct page *page); 114 + static inline void clear_page_mlock(struct page *page) 115 + { 116 + if (unlikely(TestClearPageMlocked(page))) 117 + __clear_page_mlock(page); 118 + } 119 + 120 + /* 121 + * mlock_migrate_page - called only from migrate_page_copy() to 122 + * migrate the Mlocked page flag 123 + */ 124 + static inline void mlock_migrate_page(struct page *newpage, struct page *page) 125 + { 126 + if (TestClearPageMlocked(page)) 127 + SetPageMlocked(newpage); 128 + } 129 + 130 + 131 + #else /* CONFIG_UNEVICTABLE_LRU */ 132 + static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p) 133 + { 134 + return 0; 135 + } 136 + static inline void clear_page_mlock(struct page *page) { } 137 + static inline void mlock_vma_page(struct page *page) { } 138 + static inline void mlock_migrate_page(struct page *new, struct page *old) { } 139 + 140 + #endif /* CONFIG_UNEVICTABLE_LRU */ 86 141 87 142 /* 88 143 * FLATMEM and DISCONTIGMEM configurations use alloc_bootmem_node, ··· 210 147 { 211 148 } 212 149 #endif /* CONFIG_SPARSEMEM */ 150 + 151 + #define GUP_FLAGS_WRITE 0x1 152 + #define GUP_FLAGS_FORCE 0x2 153 + #define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4 154 + 155 + int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, 156 + unsigned long start, int len, int flags, 157 + struct page **pages, struct vm_area_struct **vmas); 213 158 214 159 #endif

+49 -7

mm/memory.c

··· 64 64 65 65 #include "internal.h" 66 66 67 + #include "internal.h" 68 + 67 69 #ifndef CONFIG_NEED_MULTIPLE_NODES 68 70 /* use the per-pgdat data instead for discontigmem - mbligh */ 69 71 unsigned long max_mapnr; ··· 1131 1129 return !vma->vm_ops || !vma->vm_ops->fault; 1132 1130 } 1133 1131 1134 - int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, 1135 - unsigned long start, int len, int write, int force, 1132 + 1133 + 1134 + int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, 1135 + unsigned long start, int len, int flags, 1136 1136 struct page **pages, struct vm_area_struct **vmas) 1137 1137 { 1138 1138 int i; 1139 - unsigned int vm_flags; 1139 + unsigned int vm_flags = 0; 1140 + int write = !!(flags & GUP_FLAGS_WRITE); 1141 + int force = !!(flags & GUP_FLAGS_FORCE); 1142 + int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS); 1140 1143 1141 1144 if (len <= 0) 1142 1145 return 0; ··· 1165 1158 pud_t *pud; 1166 1159 pmd_t *pmd; 1167 1160 pte_t *pte; 1168 - if (write) /* user gate pages are read-only */ 1161 + 1162 + /* user gate pages are read-only */ 1163 + if (!ignore && write) 1169 1164 return i ? : -EFAULT; 1170 1165 if (pg > TASK_SIZE) 1171 1166 pgd = pgd_offset_k(pg); ··· 1199 1190 continue; 1200 1191 } 1201 1192 1202 - if (!vma || (vma->vm_flags & (VM_IO | VM_PFNMAP)) 1203 - || !(vm_flags & vma->vm_flags)) 1193 + if (!vma || 1194 + (vma->vm_flags & (VM_IO | VM_PFNMAP)) || 1195 + (!ignore && !(vm_flags & vma->vm_flags))) 1204 1196 return i ? : -EFAULT; 1205 1197 1206 1198 if (is_vm_hugetlb_page(vma)) { ··· 1276 1266 } while (len); 1277 1267 return i; 1278 1268 } 1269 + 1270 + int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, 1271 + unsigned long start, int len, int write, int force, 1272 + struct page **pages, struct vm_area_struct **vmas) 1273 + { 1274 + int flags = 0; 1275 + 1276 + if (write) 1277 + flags |= GUP_FLAGS_WRITE; 1278 + if (force) 1279 + flags |= GUP_FLAGS_FORCE; 1280 + 1281 + return __get_user_pages(tsk, mm, 1282 + start, len, flags, 1283 + pages, vmas); 1284 + } 1285 + 1279 1286 EXPORT_SYMBOL(get_user_pages); 1280 1287 1281 1288 pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr, ··· 1885 1858 new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); 1886 1859 if (!new_page) 1887 1860 goto oom; 1861 + /* 1862 + * Don't let another task, with possibly unlocked vma, 1863 + * keep the mlocked page. 1864 + */ 1865 + if (vma->vm_flags & VM_LOCKED) { 1866 + lock_page(old_page); /* for LRU manipulation */ 1867 + clear_page_mlock(old_page); 1868 + unlock_page(old_page); 1869 + } 1888 1870 cow_user_page(new_page, old_page, address, vma); 1889 1871 __SetPageUptodate(new_page); 1890 1872 ··· 2361 2325 page_add_anon_rmap(page, vma, address); 2362 2326 2363 2327 swap_free(entry); 2364 - if (vm_swap_full()) 2328 + if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page)) 2365 2329 remove_exclusive_swap_page(page); 2366 2330 unlock_page(page); 2367 2331 ··· 2501 2465 ret = VM_FAULT_OOM; 2502 2466 goto out; 2503 2467 } 2468 + /* 2469 + * Don't let another task, with possibly unlocked vma, 2470 + * keep the mlocked page. 2471 + */ 2472 + if (vma->vm_flags & VM_LOCKED) 2473 + clear_page_mlock(vmf.page); 2504 2474 copy_user_highpage(page, vmf.page, address, vma); 2505 2475 __SetPageUptodate(page); 2506 2476 } else {

mm/migrate.c

··· 371 371 __set_page_dirty_nobuffers(newpage); 372 372 } 373 373 374 + mlock_migrate_page(newpage, page); 375 + 374 376 #ifdef CONFIG_SWAP 375 377 ClearPageSwapCache(page); 376 378 #endif

+374 -18

mm/mlock.c

··· 8 8 #include <linux/capability.h> 9 9 #include <linux/mman.h> 10 10 #include <linux/mm.h> 11 + #include <linux/swap.h> 12 + #include <linux/swapops.h> 13 + #include <linux/pagemap.h> 11 14 #include <linux/mempolicy.h> 12 15 #include <linux/syscalls.h> 13 16 #include <linux/sched.h> 14 17 #include <linux/module.h> 18 + #include <linux/rmap.h> 19 + #include <linux/mmzone.h> 20 + #include <linux/hugetlb.h> 21 + 22 + #include "internal.h" 15 23 16 24 int can_do_mlock(void) 17 25 { ··· 31 23 } 32 24 EXPORT_SYMBOL(can_do_mlock); 33 25 26 + #ifdef CONFIG_UNEVICTABLE_LRU 27 + /* 28 + * Mlocked pages are marked with PageMlocked() flag for efficient testing 29 + * in vmscan and, possibly, the fault path; and to support semi-accurate 30 + * statistics. 31 + * 32 + * An mlocked page [PageMlocked(page)] is unevictable. As such, it will 33 + * be placed on the LRU "unevictable" list, rather than the [in]active lists. 34 + * The unevictable list is an LRU sibling list to the [in]active lists. 35 + * PageUnevictable is set to indicate the unevictable state. 36 + * 37 + * When lazy mlocking via vmscan, it is important to ensure that the 38 + * vma's VM_LOCKED status is not concurrently being modified, otherwise we 39 + * may have mlocked a page that is being munlocked. So lazy mlock must take 40 + * the mmap_sem for read, and verify that the vma really is locked 41 + * (see mm/rmap.c). 42 + */ 43 + 44 + /* 45 + * LRU accounting for clear_page_mlock() 46 + */ 47 + void __clear_page_mlock(struct page *page) 48 + { 49 + VM_BUG_ON(!PageLocked(page)); 50 + 51 + if (!page->mapping) { /* truncated ? */ 52 + return; 53 + } 54 + 55 + if (!isolate_lru_page(page)) { 56 + putback_lru_page(page); 57 + } else { 58 + /* 59 + * Page not on the LRU yet. Flush all pagevecs and retry. 60 + */ 61 + lru_add_drain_all(); 62 + if (!isolate_lru_page(page)) 63 + putback_lru_page(page); 64 + } 65 + } 66 + 67 + /* 68 + * Mark page as mlocked if not already. 69 + * If page on LRU, isolate and putback to move to unevictable list. 70 + */ 71 + void mlock_vma_page(struct page *page) 72 + { 73 + BUG_ON(!PageLocked(page)); 74 + 75 + if (!TestSetPageMlocked(page) && !isolate_lru_page(page)) 76 + putback_lru_page(page); 77 + } 78 + 79 + /* 80 + * called from munlock()/munmap() path with page supposedly on the LRU. 81 + * 82 + * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked 83 + * [in try_to_munlock()] and then attempt to isolate the page. We must 84 + * isolate the page to keep others from messing with its unevictable 85 + * and mlocked state while trying to munlock. However, we pre-clear the 86 + * mlocked state anyway as we might lose the isolation race and we might 87 + * not get another chance to clear PageMlocked. If we successfully 88 + * isolate the page and try_to_munlock() detects other VM_LOCKED vmas 89 + * mapping the page, it will restore the PageMlocked state, unless the page 90 + * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(), 91 + * perhaps redundantly. 92 + * If we lose the isolation race, and the page is mapped by other VM_LOCKED 93 + * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap() 94 + * either of which will restore the PageMlocked state by calling 95 + * mlock_vma_page() above, if it can grab the vma's mmap sem. 96 + */ 97 + static void munlock_vma_page(struct page *page) 98 + { 99 + BUG_ON(!PageLocked(page)); 100 + 101 + if (TestClearPageMlocked(page) && !isolate_lru_page(page)) { 102 + try_to_munlock(page); 103 + putback_lru_page(page); 104 + } 105 + } 106 + 107 + /* 108 + * mlock a range of pages in the vma. 109 + * 110 + * This takes care of making the pages present too. 111 + * 112 + * vma->vm_mm->mmap_sem must be held for write. 113 + */ 114 + static int __mlock_vma_pages_range(struct vm_area_struct *vma, 115 + unsigned long start, unsigned long end) 116 + { 117 + struct mm_struct *mm = vma->vm_mm; 118 + unsigned long addr = start; 119 + struct page *pages[16]; /* 16 gives a reasonable batch */ 120 + int write = !!(vma->vm_flags & VM_WRITE); 121 + int nr_pages = (end - start) / PAGE_SIZE; 122 + int ret; 123 + 124 + VM_BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK); 125 + VM_BUG_ON(start < vma->vm_start || end > vma->vm_end); 126 + VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem)); 127 + 128 + lru_add_drain_all(); /* push cached pages to LRU */ 129 + 130 + while (nr_pages > 0) { 131 + int i; 132 + 133 + cond_resched(); 134 + 135 + /* 136 + * get_user_pages makes pages present if we are 137 + * setting mlock. and this extra reference count will 138 + * disable migration of this page. However, page may 139 + * still be truncated out from under us. 140 + */ 141 + ret = get_user_pages(current, mm, addr, 142 + min_t(int, nr_pages, ARRAY_SIZE(pages)), 143 + write, 0, pages, NULL); 144 + /* 145 + * This can happen for, e.g., VM_NONLINEAR regions before 146 + * a page has been allocated and mapped at a given offset, 147 + * or for addresses that map beyond end of a file. 148 + * We'll mlock the the pages if/when they get faulted in. 149 + */ 150 + if (ret < 0) 151 + break; 152 + if (ret == 0) { 153 + /* 154 + * We know the vma is there, so the only time 155 + * we cannot get a single page should be an 156 + * error (ret < 0) case. 157 + */ 158 + WARN_ON(1); 159 + break; 160 + } 161 + 162 + lru_add_drain(); /* push cached pages to LRU */ 163 + 164 + for (i = 0; i < ret; i++) { 165 + struct page *page = pages[i]; 166 + 167 + lock_page(page); 168 + /* 169 + * Because we lock page here and migration is blocked 170 + * by the elevated reference, we need only check for 171 + * page truncation (file-cache only). 172 + */ 173 + if (page->mapping) 174 + mlock_vma_page(page); 175 + unlock_page(page); 176 + put_page(page); /* ref from get_user_pages() */ 177 + 178 + /* 179 + * here we assume that get_user_pages() has given us 180 + * a list of virtually contiguous pages. 181 + */ 182 + addr += PAGE_SIZE; /* for next get_user_pages() */ 183 + nr_pages--; 184 + } 185 + } 186 + 187 + lru_add_drain_all(); /* to update stats */ 188 + 189 + return 0; /* count entire vma as locked_vm */ 190 + } 191 + 192 + /* 193 + * private structure for munlock page table walk 194 + */ 195 + struct munlock_page_walk { 196 + struct vm_area_struct *vma; 197 + pmd_t *pmd; /* for migration_entry_wait() */ 198 + }; 199 + 200 + /* 201 + * munlock normal pages for present ptes 202 + */ 203 + static int __munlock_pte_handler(pte_t *ptep, unsigned long addr, 204 + unsigned long end, struct mm_walk *walk) 205 + { 206 + struct munlock_page_walk *mpw = walk->private; 207 + swp_entry_t entry; 208 + struct page *page; 209 + pte_t pte; 210 + 211 + retry: 212 + pte = *ptep; 213 + /* 214 + * If it's a swap pte, we might be racing with page migration. 215 + */ 216 + if (unlikely(!pte_present(pte))) { 217 + if (!is_swap_pte(pte)) 218 + goto out; 219 + entry = pte_to_swp_entry(pte); 220 + if (is_migration_entry(entry)) { 221 + migration_entry_wait(mpw->vma->vm_mm, mpw->pmd, addr); 222 + goto retry; 223 + } 224 + goto out; 225 + } 226 + 227 + page = vm_normal_page(mpw->vma, addr, pte); 228 + if (!page) 229 + goto out; 230 + 231 + lock_page(page); 232 + if (!page->mapping) { 233 + unlock_page(page); 234 + goto retry; 235 + } 236 + munlock_vma_page(page); 237 + unlock_page(page); 238 + 239 + out: 240 + return 0; 241 + } 242 + 243 + /* 244 + * Save pmd for pte handler for waiting on migration entries 245 + */ 246 + static int __munlock_pmd_handler(pmd_t *pmd, unsigned long addr, 247 + unsigned long end, struct mm_walk *walk) 248 + { 249 + struct munlock_page_walk *mpw = walk->private; 250 + 251 + mpw->pmd = pmd; 252 + return 0; 253 + } 254 + 255 + 256 + /* 257 + * munlock a range of pages in the vma using standard page table walk. 258 + * 259 + * vma->vm_mm->mmap_sem must be held for write. 260 + */ 261 + static void __munlock_vma_pages_range(struct vm_area_struct *vma, 262 + unsigned long start, unsigned long end) 263 + { 264 + struct mm_struct *mm = vma->vm_mm; 265 + struct munlock_page_walk mpw = { 266 + .vma = vma, 267 + }; 268 + struct mm_walk munlock_page_walk = { 269 + .pmd_entry = __munlock_pmd_handler, 270 + .pte_entry = __munlock_pte_handler, 271 + .private = &mpw, 272 + .mm = mm, 273 + }; 274 + 275 + VM_BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK); 276 + VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem)); 277 + VM_BUG_ON(start < vma->vm_start); 278 + VM_BUG_ON(end > vma->vm_end); 279 + 280 + lru_add_drain_all(); /* push cached pages to LRU */ 281 + walk_page_range(start, end, &munlock_page_walk); 282 + lru_add_drain_all(); /* to update stats */ 283 + } 284 + 285 + #else /* CONFIG_UNEVICTABLE_LRU */ 286 + 287 + /* 288 + * Just make pages present if VM_LOCKED. No-op if unlocking. 289 + */ 290 + static int __mlock_vma_pages_range(struct vm_area_struct *vma, 291 + unsigned long start, unsigned long end) 292 + { 293 + if (vma->vm_flags & VM_LOCKED) 294 + make_pages_present(start, end); 295 + return 0; 296 + } 297 + 298 + /* 299 + * munlock a range of pages in the vma -- no-op. 300 + */ 301 + static void __munlock_vma_pages_range(struct vm_area_struct *vma, 302 + unsigned long start, unsigned long end) 303 + { 304 + } 305 + #endif /* CONFIG_UNEVICTABLE_LRU */ 306 + 307 + /* 308 + * mlock all pages in this vma range. For mmap()/mremap()/... 309 + */ 310 + int mlock_vma_pages_range(struct vm_area_struct *vma, 311 + unsigned long start, unsigned long end) 312 + { 313 + int nr_pages = (end - start) / PAGE_SIZE; 314 + BUG_ON(!(vma->vm_flags & VM_LOCKED)); 315 + 316 + /* 317 + * filter unlockable vmas 318 + */ 319 + if (vma->vm_flags & (VM_IO | VM_PFNMAP)) 320 + goto no_mlock; 321 + 322 + if (!((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) || 323 + is_vm_hugetlb_page(vma) || 324 + vma == get_gate_vma(current))) 325 + return __mlock_vma_pages_range(vma, start, end); 326 + 327 + /* 328 + * User mapped kernel pages or huge pages: 329 + * make these pages present to populate the ptes, but 330 + * fall thru' to reset VM_LOCKED--no need to unlock, and 331 + * return nr_pages so these don't get counted against task's 332 + * locked limit. huge pages are already counted against 333 + * locked vm limit. 334 + */ 335 + make_pages_present(start, end); 336 + 337 + no_mlock: 338 + vma->vm_flags &= ~VM_LOCKED; /* and don't come back! */ 339 + return nr_pages; /* pages NOT mlocked */ 340 + } 341 + 342 + 343 + /* 344 + * munlock all pages in vma. For munmap() and exit(). 345 + */ 346 + void munlock_vma_pages_all(struct vm_area_struct *vma) 347 + { 348 + vma->vm_flags &= ~VM_LOCKED; 349 + __munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end); 350 + } 351 + 352 + /* 353 + * mlock_fixup - handle mlock[all]/munlock[all] requests. 354 + * 355 + * Filters out "special" vmas -- VM_LOCKED never gets set for these, and 356 + * munlock is a no-op. However, for some special vmas, we go ahead and 357 + * populate the ptes via make_pages_present(). 358 + * 359 + * For vmas that pass the filters, merge/split as appropriate. 360 + */ 34 361 static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev, 35 362 unsigned long start, unsigned long end, unsigned int newflags) 36 363 { 37 - struct mm_struct * mm = vma->vm_mm; 364 + struct mm_struct *mm = vma->vm_mm; 38 365 pgoff_t pgoff; 39 - int pages; 366 + int nr_pages; 40 367 int ret = 0; 368 + int lock = newflags & VM_LOCKED; 41 369 42 - if (newflags == vma->vm_flags) { 43 - *prev = vma; 44 - goto out; 370 + if (newflags == vma->vm_flags || 371 + (vma->vm_flags & (VM_IO | VM_PFNMAP))) 372 + goto out; /* don't set VM_LOCKED, don't count */ 373 + 374 + if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) || 375 + is_vm_hugetlb_page(vma) || 376 + vma == get_gate_vma(current)) { 377 + if (lock) 378 + make_pages_present(start, end); 379 + goto out; /* don't set VM_LOCKED, don't count */ 45 380 } 46 381 47 382 pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); ··· 394 43 vma = *prev; 395 44 goto success; 396 45 } 397 - 398 - *prev = vma; 399 46 400 47 if (start != vma->vm_start) { 401 48 ret = split_vma(mm, vma, start, 1); ··· 409 60 410 61 success: 411 62 /* 63 + * Keep track of amount of locked VM. 64 + */ 65 + nr_pages = (end - start) >> PAGE_SHIFT; 66 + if (!lock) 67 + nr_pages = -nr_pages; 68 + mm->locked_vm += nr_pages; 69 + 70 + /* 412 71 * vm_flags is protected by the mmap_sem held in write mode. 413 72 * It's okay if try_to_unmap_one unmaps a page just after we 414 - * set VM_LOCKED, make_pages_present below will bring it back. 73 + * set VM_LOCKED, __mlock_vma_pages_range will bring it back. 415 74 */ 416 75 vma->vm_flags = newflags; 417 76 418 - /* 419 - * Keep track of amount of locked VM. 420 - */ 421 - pages = (end - start) >> PAGE_SHIFT; 422 - if (newflags & VM_LOCKED) { 423 - pages = -pages; 424 - if (!(newflags & VM_IO)) 425 - ret = make_pages_present(start, end); 426 - } 77 + if (lock) { 78 + ret = __mlock_vma_pages_range(vma, start, end); 79 + if (ret > 0) { 80 + mm->locked_vm -= ret; 81 + ret = 0; 82 + } 83 + } else 84 + __munlock_vma_pages_range(vma, start, end); 427 85 428 - mm->locked_vm -= pages; 429 86 out: 87 + *prev = vma; 430 88 return ret; 431 89 } 432 90

-2

mm/mmap.c

··· 662 662 * If the vma has a ->close operation then the driver probably needs to release 663 663 * per-vma resources, so we don't attempt to merge those. 664 664 */ 665 - #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP) 666 - 667 665 static inline int is_mergeable_vma(struct vm_area_struct *vma, 668 666 struct file *file, unsigned long vm_flags) 669 667 {

+33 -11

mm/nommu.c

··· 34 34 #include <asm/tlb.h> 35 35 #include <asm/tlbflush.h> 36 36 37 + #include "internal.h" 38 + 37 39 void *high_memory; 38 40 struct page *mem_map; 39 41 unsigned long max_mapnr; ··· 130 128 return PAGE_SIZE << compound_order(page); 131 129 } 132 130 133 - /* 134 - * get a list of pages in an address range belonging to the specified process 135 - * and indicate the VMA that covers each page 136 - * - this is potentially dodgy as we may end incrementing the page count of a 137 - * slab page or a secondary page from a compound page 138 - * - don't permit access to VMAs that don't support it, such as I/O mappings 139 - */ 140 - int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, 141 - unsigned long start, int len, int write, int force, 142 - struct page **pages, struct vm_area_struct **vmas) 131 + int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, 132 + unsigned long start, int len, int flags, 133 + struct page **pages, struct vm_area_struct **vmas) 143 134 { 144 135 struct vm_area_struct *vma; 145 136 unsigned long vm_flags; 146 137 int i; 138 + int write = !!(flags & GUP_FLAGS_WRITE); 139 + int force = !!(flags & GUP_FLAGS_FORCE); 140 + int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS); 147 141 148 142 /* calculate required read or write permissions. 149 143 * - if 'force' is set, we only require the "MAY" flags. ··· 154 156 155 157 /* protect what we can, including chardevs */ 156 158 if (vma->vm_flags & (VM_IO | VM_PFNMAP) || 157 - !(vm_flags & vma->vm_flags)) 159 + (!ignore && !(vm_flags & vma->vm_flags))) 158 160 goto finish_or_fault; 159 161 160 162 if (pages) { ··· 171 173 172 174 finish_or_fault: 173 175 return i ? : -EFAULT; 176 + } 177 + 178 + 179 + /* 180 + * get a list of pages in an address range belonging to the specified process 181 + * and indicate the VMA that covers each page 182 + * - this is potentially dodgy as we may end incrementing the page count of a 183 + * slab page or a secondary page from a compound page 184 + * - don't permit access to VMAs that don't support it, such as I/O mappings 185 + */ 186 + int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, 187 + unsigned long start, int len, int write, int force, 188 + struct page **pages, struct vm_area_struct **vmas) 189 + { 190 + int flags = 0; 191 + 192 + if (write) 193 + flags |= GUP_FLAGS_WRITE; 194 + if (force) 195 + flags |= GUP_FLAGS_FORCE; 196 + 197 + return __get_user_pages(tsk, mm, 198 + start, len, flags, 199 + pages, vmas); 174 200 } 175 201 EXPORT_SYMBOL(get_user_pages); 176 202

+5 -1

mm/page_alloc.c

··· 616 616 617 617 page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_reclaim | 618 618 1 << PG_referenced | 1 << PG_arch_1 | 619 - 1 << PG_owner_priv_1 | 1 << PG_mappedtodisk); 619 + 1 << PG_owner_priv_1 | 1 << PG_mappedtodisk 620 + #ifdef CONFIG_UNEVICTABLE_LRU 621 + | 1 << PG_mlocked 622 + #endif 623 + ); 620 624 set_page_private(page, 0); 621 625 set_page_refcounted(page); 622 626

+220 -37

mm/rmap.c

··· 53 53 54 54 #include <asm/tlbflush.h> 55 55 56 + #include "internal.h" 57 + 56 58 struct kmem_cache *anon_vma_cachep; 57 59 58 60 /** ··· 292 290 return NULL; 293 291 } 294 292 293 + /** 294 + * page_mapped_in_vma - check whether a page is really mapped in a VMA 295 + * @page: the page to test 296 + * @vma: the VMA to test 297 + * 298 + * Returns 1 if the page is mapped into the page tables of the VMA, 0 299 + * if the page is not mapped into the page tables of this VMA. Only 300 + * valid for normal file or anonymous VMAs. 301 + */ 302 + static int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) 303 + { 304 + unsigned long address; 305 + pte_t *pte; 306 + spinlock_t *ptl; 307 + 308 + address = vma_address(page, vma); 309 + if (address == -EFAULT) /* out of vma range */ 310 + return 0; 311 + pte = page_check_address(page, vma->vm_mm, address, &ptl, 1); 312 + if (!pte) /* the page is not in this mm */ 313 + return 0; 314 + pte_unmap_unlock(pte, ptl); 315 + 316 + return 1; 317 + } 318 + 295 319 /* 296 320 * Subfunctions of page_referenced: page_referenced_one called 297 321 * repeatedly from either page_referenced_anon or page_referenced_file. ··· 339 311 if (!pte) 340 312 goto out; 341 313 314 + /* 315 + * Don't want to elevate referenced for mlocked page that gets this far, 316 + * in order that it progresses to try_to_unmap and is moved to the 317 + * unevictable list. 318 + */ 342 319 if (vma->vm_flags & VM_LOCKED) { 343 - referenced++; 344 320 *mapcount = 1; /* break early from loop */ 345 - } else if (ptep_clear_flush_young_notify(vma, address, pte)) 321 + goto out_unmap; 322 + } 323 + 324 + if (ptep_clear_flush_young_notify(vma, address, pte)) 346 325 referenced++; 347 326 348 327 /* Pretend the page is referenced if the task has the ··· 358 323 rwsem_is_locked(&mm->mmap_sem)) 359 324 referenced++; 360 325 326 + out_unmap: 361 327 (*mapcount)--; 362 328 pte_unmap_unlock(pte, ptl); 363 329 out: ··· 448 412 */ 449 413 if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont)) 450 414 continue; 451 - if ((vma->vm_flags & (VM_LOCKED|VM_MAYSHARE)) 452 - == (VM_LOCKED|VM_MAYSHARE)) { 453 - referenced++; 454 - break; 455 - } 456 415 referenced += page_referenced_one(page, vma, &mapcount); 457 416 if (!mapcount) 458 417 break; ··· 770 739 * If it's recently referenced (perhaps page_referenced 771 740 * skipped over this mm) then we should reactivate it. 772 741 */ 773 - if (!migration && ((vma->vm_flags & VM_LOCKED) || 774 - (ptep_clear_flush_young_notify(vma, address, pte)))) { 775 - ret = SWAP_FAIL; 776 - goto out_unmap; 777 - } 742 + if (!migration) { 743 + if (vma->vm_flags & VM_LOCKED) { 744 + ret = SWAP_MLOCK; 745 + goto out_unmap; 746 + } 747 + if (ptep_clear_flush_young_notify(vma, address, pte)) { 748 + ret = SWAP_FAIL; 749 + goto out_unmap; 750 + } 751 + } 778 752 779 753 /* Nuke the page table entry. */ 780 754 flush_cache_page(vma, address, page_to_pfn(page)); ··· 860 824 * For very sparsely populated VMAs this is a little inefficient - chances are 861 825 * there there won't be many ptes located within the scan cluster. In this case 862 826 * maybe we could scan further - to the end of the pte page, perhaps. 827 + * 828 + * Mlocked pages: check VM_LOCKED under mmap_sem held for read, if we can 829 + * acquire it without blocking. If vma locked, mlock the pages in the cluster, 830 + * rather than unmapping them. If we encounter the "check_page" that vmscan is 831 + * trying to unmap, return SWAP_MLOCK, else default SWAP_AGAIN. 863 832 */ 864 833 #define CLUSTER_SIZE min(32*PAGE_SIZE, PMD_SIZE) 865 834 #define CLUSTER_MASK (~(CLUSTER_SIZE - 1)) 866 835 867 - static void try_to_unmap_cluster(unsigned long cursor, 868 - unsigned int *mapcount, struct vm_area_struct *vma) 836 + static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, 837 + struct vm_area_struct *vma, struct page *check_page) 869 838 { 870 839 struct mm_struct *mm = vma->vm_mm; 871 840 pgd_t *pgd; ··· 882 841 struct page *page; 883 842 unsigned long address; 884 843 unsigned long end; 844 + int ret = SWAP_AGAIN; 845 + int locked_vma = 0; 885 846 886 847 address = (vma->vm_start + cursor) & CLUSTER_MASK; 887 848 end = address + CLUSTER_SIZE; ··· 894 851 895 852 pgd = pgd_offset(mm, address); 896 853 if (!pgd_present(*pgd)) 897 - return; 854 + return ret; 898 855 899 856 pud = pud_offset(pgd, address); 900 857 if (!pud_present(*pud)) 901 - return; 858 + return ret; 902 859 903 860 pmd = pmd_offset(pud, address); 904 861 if (!pmd_present(*pmd)) 905 - return; 862 + return ret; 863 + 864 + /* 865 + * MLOCK_PAGES => feature is configured. 866 + * if we can acquire the mmap_sem for read, and vma is VM_LOCKED, 867 + * keep the sem while scanning the cluster for mlocking pages. 868 + */ 869 + if (MLOCK_PAGES && down_read_trylock(&vma->vm_mm->mmap_sem)) { 870 + locked_vma = (vma->vm_flags & VM_LOCKED); 871 + if (!locked_vma) 872 + up_read(&vma->vm_mm->mmap_sem); /* don't need it */ 873 + } 906 874 907 875 pte = pte_offset_map_lock(mm, pmd, address, &ptl); 908 876 ··· 925 871 continue; 926 872 page = vm_normal_page(vma, address, *pte); 927 873 BUG_ON(!page || PageAnon(page)); 874 + 875 + if (locked_vma) { 876 + mlock_vma_page(page); /* no-op if already mlocked */ 877 + if (page == check_page) 878 + ret = SWAP_MLOCK; 879 + continue; /* don't unmap */ 880 + } 928 881 929 882 if (ptep_clear_flush_young_notify(vma, address, pte)) 930 883 continue; ··· 954 893 (*mapcount)--; 955 894 } 956 895 pte_unmap_unlock(pte - 1, ptl); 896 + if (locked_vma) 897 + up_read(&vma->vm_mm->mmap_sem); 898 + return ret; 957 899 } 958 900 959 - static int try_to_unmap_anon(struct page *page, int migration) 901 + /* 902 + * common handling for pages mapped in VM_LOCKED vmas 903 + */ 904 + static int try_to_mlock_page(struct page *page, struct vm_area_struct *vma) 905 + { 906 + int mlocked = 0; 907 + 908 + if (down_read_trylock(&vma->vm_mm->mmap_sem)) { 909 + if (vma->vm_flags & VM_LOCKED) { 910 + mlock_vma_page(page); 911 + mlocked++; /* really mlocked the page */ 912 + } 913 + up_read(&vma->vm_mm->mmap_sem); 914 + } 915 + return mlocked; 916 + } 917 + 918 + /** 919 + * try_to_unmap_anon - unmap or unlock anonymous page using the object-based 920 + * rmap method 921 + * @page: the page to unmap/unlock 922 + * @unlock: request for unlock rather than unmap [unlikely] 923 + * @migration: unmapping for migration - ignored if @unlock 924 + * 925 + * Find all the mappings of a page using the mapping pointer and the vma chains 926 + * contained in the anon_vma struct it points to. 927 + * 928 + * This function is only called from try_to_unmap/try_to_munlock for 929 + * anonymous pages. 930 + * When called from try_to_munlock(), the mmap_sem of the mm containing the vma 931 + * where the page was found will be held for write. So, we won't recheck 932 + * vm_flags for that VMA. That should be OK, because that vma shouldn't be 933 + * 'LOCKED. 934 + */ 935 + static int try_to_unmap_anon(struct page *page, int unlock, int migration) 960 936 { 961 937 struct anon_vma *anon_vma; 962 938 struct vm_area_struct *vma; 939 + unsigned int mlocked = 0; 963 940 int ret = SWAP_AGAIN; 941 + 942 + if (MLOCK_PAGES && unlikely(unlock)) 943 + ret = SWAP_SUCCESS; /* default for try_to_munlock() */ 964 944 965 945 anon_vma = page_lock_anon_vma(page); 966 946 if (!anon_vma) 967 947 return ret; 968 948 969 949 list_for_each_entry(vma, &anon_vma->head, anon_vma_node) { 970 - ret = try_to_unmap_one(page, vma, migration); 971 - if (ret == SWAP_FAIL || !page_mapped(page)) 972 - break; 950 + if (MLOCK_PAGES && unlikely(unlock)) { 951 + if (!((vma->vm_flags & VM_LOCKED) && 952 + page_mapped_in_vma(page, vma))) 953 + continue; /* must visit all unlocked vmas */ 954 + ret = SWAP_MLOCK; /* saw at least one mlocked vma */ 955 + } else { 956 + ret = try_to_unmap_one(page, vma, migration); 957 + if (ret == SWAP_FAIL || !page_mapped(page)) 958 + break; 959 + } 960 + if (ret == SWAP_MLOCK) { 961 + mlocked = try_to_mlock_page(page, vma); 962 + if (mlocked) 963 + break; /* stop if actually mlocked page */ 964 + } 973 965 } 974 966 975 967 page_unlock_anon_vma(anon_vma); 968 + 969 + if (mlocked) 970 + ret = SWAP_MLOCK; /* actually mlocked the page */ 971 + else if (ret == SWAP_MLOCK) 972 + ret = SWAP_AGAIN; /* saw VM_LOCKED vma */ 973 + 976 974 return ret; 977 975 } 978 976 979 977 /** 980 - * try_to_unmap_file - unmap file page using the object-based rmap method 981 - * @page: the page to unmap 982 - * @migration: migration flag 978 + * try_to_unmap_file - unmap/unlock file page using the object-based rmap method 979 + * @page: the page to unmap/unlock 980 + * @unlock: request for unlock rather than unmap [unlikely] 981 + * @migration: unmapping for migration - ignored if @unlock 983 982 * 984 983 * Find all the mappings of a page using the mapping pointer and the vma chains 985 984 * contained in the address_space struct it points to. 986 985 * 987 - * This function is only called from try_to_unmap for object-based pages. 986 + * This function is only called from try_to_unmap/try_to_munlock for 987 + * object-based pages. 988 + * When called from try_to_munlock(), the mmap_sem of the mm containing the vma 989 + * where the page was found will be held for write. So, we won't recheck 990 + * vm_flags for that VMA. That should be OK, because that vma shouldn't be 991 + * 'LOCKED. 988 992 */ 989 - static int try_to_unmap_file(struct page *page, int migration) 993 + static int try_to_unmap_file(struct page *page, int unlock, int migration) 990 994 { 991 995 struct address_space *mapping = page->mapping; 992 996 pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); ··· 1062 936 unsigned long max_nl_cursor = 0; 1063 937 unsigned long max_nl_size = 0; 1064 938 unsigned int mapcount; 939 + unsigned int mlocked = 0; 940 + 941 + if (MLOCK_PAGES && unlikely(unlock)) 942 + ret = SWAP_SUCCESS; /* default for try_to_munlock() */ 1065 943 1066 944 spin_lock(&mapping->i_mmap_lock); 1067 945 vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { 1068 - ret = try_to_unmap_one(page, vma, migration); 1069 - if (ret == SWAP_FAIL || !page_mapped(page)) 1070 - goto out; 946 + if (MLOCK_PAGES && unlikely(unlock)) { 947 + if (!(vma->vm_flags & VM_LOCKED)) 948 + continue; /* must visit all vmas */ 949 + ret = SWAP_MLOCK; 950 + } else { 951 + ret = try_to_unmap_one(page, vma, migration); 952 + if (ret == SWAP_FAIL || !page_mapped(page)) 953 + goto out; 954 + } 955 + if (ret == SWAP_MLOCK) { 956 + mlocked = try_to_mlock_page(page, vma); 957 + if (mlocked) 958 + break; /* stop if actually mlocked page */ 959 + } 1071 960 } 961 + 962 + if (mlocked) 963 + goto out; 1072 964 1073 965 if (list_empty(&mapping->i_mmap_nonlinear)) 1074 966 goto out; 1075 967 1076 968 list_for_each_entry(vma, &mapping->i_mmap_nonlinear, 1077 969 shared.vm_set.list) { 1078 - if ((vma->vm_flags & VM_LOCKED) && !migration) 970 + if (MLOCK_PAGES && unlikely(unlock)) { 971 + if (!(vma->vm_flags & VM_LOCKED)) 972 + continue; /* must visit all vmas */ 973 + ret = SWAP_MLOCK; /* leave mlocked == 0 */ 974 + goto out; /* no need to look further */ 975 + } 976 + if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED)) 1079 977 continue; 1080 978 cursor = (unsigned long) vma->vm_private_data; 1081 979 if (cursor > max_nl_cursor) ··· 1109 959 max_nl_size = cursor; 1110 960 } 1111 961 1112 - if (max_nl_size == 0) { /* any nonlinears locked or reserved */ 962 + if (max_nl_size == 0) { /* all nonlinears locked or reserved ? */ 1113 963 ret = SWAP_FAIL; 1114 964 goto out; 1115 965 } ··· 1133 983 do { 1134 984 list_for_each_entry(vma, &mapping->i_mmap_nonlinear, 1135 985 shared.vm_set.list) { 1136 - if ((vma->vm_flags & VM_LOCKED) && !migration) 986 + if (!MLOCK_PAGES && !migration && 987 + (vma->vm_flags & VM_LOCKED)) 1137 988 continue; 1138 989 cursor = (unsigned long) vma->vm_private_data; 1139 990 while ( cursor < max_nl_cursor && 1140 991 cursor < vma->vm_end - vma->vm_start) { 1141 - try_to_unmap_cluster(cursor, &mapcount, vma); 992 + ret = try_to_unmap_cluster(cursor, &mapcount, 993 + vma, page); 994 + if (ret == SWAP_MLOCK) 995 + mlocked = 2; /* to return below */ 1142 996 cursor += CLUSTER_SIZE; 1143 997 vma->vm_private_data = (void *) cursor; 1144 998 if ((int)mapcount <= 0) ··· 1163 1009 vma->vm_private_data = NULL; 1164 1010 out: 1165 1011 spin_unlock(&mapping->i_mmap_lock); 1012 + if (mlocked) 1013 + ret = SWAP_MLOCK; /* actually mlocked the page */ 1014 + else if (ret == SWAP_MLOCK) 1015 + ret = SWAP_AGAIN; /* saw VM_LOCKED vma */ 1166 1016 return ret; 1167 1017 } 1168 1018 ··· 1182 1024 * SWAP_SUCCESS - we succeeded in removing all mappings 1183 1025 * SWAP_AGAIN - we missed a mapping, try again later 1184 1026 * SWAP_FAIL - the page is unswappable 1027 + * SWAP_MLOCK - page is mlocked. 1185 1028 */ 1186 1029 int try_to_unmap(struct page *page, int migration) 1187 1030 { ··· 1191 1032 BUG_ON(!PageLocked(page)); 1192 1033 1193 1034 if (PageAnon(page)) 1194 - ret = try_to_unmap_anon(page, migration); 1035 + ret = try_to_unmap_anon(page, 0, migration); 1195 1036 else 1196 - ret = try_to_unmap_file(page, migration); 1197 - 1198 - if (!page_mapped(page)) 1037 + ret = try_to_unmap_file(page, 0, migration); 1038 + if (ret != SWAP_MLOCK && !page_mapped(page)) 1199 1039 ret = SWAP_SUCCESS; 1200 1040 return ret; 1201 1041 } 1202 1042 1043 + #ifdef CONFIG_UNEVICTABLE_LRU 1044 + /** 1045 + * try_to_munlock - try to munlock a page 1046 + * @page: the page to be munlocked 1047 + * 1048 + * Called from munlock code. Checks all of the VMAs mapping the page 1049 + * to make sure nobody else has this page mlocked. The page will be 1050 + * returned with PG_mlocked cleared if no other vmas have it mlocked. 1051 + * 1052 + * Return values are: 1053 + * 1054 + * SWAP_SUCCESS - no vma's holding page mlocked. 1055 + * SWAP_AGAIN - page mapped in mlocked vma -- couldn't acquire mmap sem 1056 + * SWAP_MLOCK - page is now mlocked. 1057 + */ 1058 + int try_to_munlock(struct page *page) 1059 + { 1060 + VM_BUG_ON(!PageLocked(page) || PageLRU(page)); 1061 + 1062 + if (PageAnon(page)) 1063 + return try_to_unmap_anon(page, 1, 0); 1064 + else 1065 + return try_to_unmap_file(page, 1, 0); 1066 + } 1067 + #endif

+1 -1

mm/swap.c

··· 278 278 put_cpu(); 279 279 } 280 280 281 - #ifdef CONFIG_NUMA 281 + #if defined(CONFIG_NUMA) || defined(CONFIG_UNEVICTABLE_LRU) 282 282 static void lru_add_drain_per_cpu(struct work_struct *dummy) 283 283 { 284 284 lru_add_drain();

+26 -10

mm/vmscan.c

··· 582 582 583 583 sc->nr_scanned++; 584 584 585 - if (unlikely(!page_evictable(page, NULL))) { 586 - unlock_page(page); 587 - putback_lru_page(page); 588 - continue; 589 - } 585 + if (unlikely(!page_evictable(page, NULL))) 586 + goto cull_mlocked; 590 587 591 588 if (!sc->may_swap && page_mapped(page)) 592 589 goto keep_locked; ··· 621 624 * Anonymous process memory has backing store? 622 625 * Try to allocate it some swap space here. 623 626 */ 624 - if (PageAnon(page) && !PageSwapCache(page)) 627 + if (PageAnon(page) && !PageSwapCache(page)) { 628 + switch (try_to_munlock(page)) { 629 + case SWAP_FAIL: /* shouldn't happen */ 630 + case SWAP_AGAIN: 631 + goto keep_locked; 632 + case SWAP_MLOCK: 633 + goto cull_mlocked; 634 + case SWAP_SUCCESS: 635 + ; /* fall thru'; add to swap cache */ 636 + } 625 637 if (!add_to_swap(page, GFP_ATOMIC)) 626 638 goto activate_locked; 639 + } 627 640 #endif /* CONFIG_SWAP */ 628 641 629 642 mapping = page_mapping(page); ··· 648 641 goto activate_locked; 649 642 case SWAP_AGAIN: 650 643 goto keep_locked; 644 + case SWAP_MLOCK: 645 + goto cull_mlocked; 651 646 case SWAP_SUCCESS: 652 647 ; /* try to free the page below */ 653 648 } ··· 740 731 } 741 732 continue; 742 733 734 + cull_mlocked: 735 + unlock_page(page); 736 + putback_lru_page(page); 737 + continue; 738 + 743 739 activate_locked: 744 740 /* Not a candidate for swapping, so reclaim swap space. */ 745 741 if (PageSwapCache(page) && vm_swap_full()) ··· 756 742 unlock_page(page); 757 743 keep: 758 744 list_add(&page->lru, &ret_pages); 759 - VM_BUG_ON(PageLRU(page)); 745 + VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); 760 746 } 761 747 list_splice(&ret_pages, page_list); 762 748 if (pagevec_count(&freed_pvec)) ··· 2343 2329 * @vma: the VMA in which the page is or will be mapped, may be NULL 2344 2330 * 2345 2331 * Test whether page is evictable--i.e., should be placed on active/inactive 2346 - * lists vs unevictable list. 2332 + * lists vs unevictable list. The vma argument is !NULL when called from the 2333 + * fault path to determine how to instantate a new page. 2347 2334 * 2348 2335 * Reasons page might not be evictable: 2349 2336 * (1) page's mapping marked unevictable 2337 + * (2) page is part of an mlocked VMA 2350 2338 * 2351 - * TODO - later patches 2352 2339 */ 2353 2340 int page_evictable(struct page *page, struct vm_area_struct *vma) 2354 2341 { ··· 2357 2342 if (mapping_unevictable(page_mapping(page))) 2358 2343 return 0; 2359 2344 2360 - /* TODO: test page [!]evictable conditions */ 2345 + if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page))) 2346 + return 0; 2361 2347 2362 2348 return 1; 2363 2349 }