commit cddb8a5c14aa89810b40495d94d3d2a0faee6619

tjh.dev / kernel

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

mmu-notifiers: core

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte". In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present). The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.

Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set. Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space. Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.

To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page. Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0. This is just an example.

This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).

At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.

Dependencies:

1) mm_take_all_locks() to register the mmu notifier when the whole VM
isn't doing anything with "mm". This allows mmu notifier users to keep
track if the VM is in the middle of the invalidate_range_begin/end
critical section with an atomic counter incraese in range_begin and
decreased in range_end. No secondary MMU page fault is allowed to map
any spte or secondary tlb reference, while the VM is in the middle of
range_begin/end as any page returned by get_user_pages in that critical
section could later immediately be freed without any further
->invalidate_page notification (invalidate_range_begin/end works on
ranges and ->invalidate_page isn't called immediately before freeing
the page). To stop all page freeing and pagetable overwrites the
mmap_sem must be taken in write mode and all other anon_vma/i_mmap
locks must be taken too.

2) It'd be a waste to add branches in the VM if nobody could possibly
run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
mmu notifiers, but this already allows to compile a KVM external module
against a kernel with mmu notifiers enabled and from the next pull from
kvm.git we'll start using them. And GRU/XPMEM will also be able to
continue the development by enabling KVM=m in their config, until they
submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
are all =n.

The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled. Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.

struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int err;

if (!kvm)
return ERR_PTR(-ENOMEM);

INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+ if (err) {
+ kfree(kvm);
+ return ERR_PTR(err);
+ }
+
return kvm;
}

mmu_notifier_unregister returns void and it's reliable.

The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Andrea Arcangeli and committed by

Linus Torvalds 17 years ago cddb8a5c 7906d00c

+623 -13

15 changed files

expand all

unified split

arch

x86

kvm

Kconfig

include

linux

mm_types.h

mmu_notifier.h

kernel

fork.c

Kconfig

Makefile

filemap_xip.c

fremap.c

hugetlb.c

memory.c

mmap.c

mmu_notifier.c

mprotect.c

mremap.c

rmap.c

arch/x86/kvm/Kconfig

··· 21 tristate "Kernel-based Virtual Machine (KVM) support" 22 depends on HAVE_KVM 23 select PREEMPT_NOTIFIERS 24 select ANON_INODES 25 ---help--- 26 Support hosting fully virtualized guest machines using hardware

··· 21 tristate "Kernel-based Virtual Machine (KVM) support" 22 depends on HAVE_KVM 23 select PREEMPT_NOTIFIERS 24 + select MMU_NOTIFIER 25 select ANON_INODES 26 ---help--- 27 Support hosting fully virtualized guest machines using hardware

include/linux/mm_types.h

··· 10 #include <linux/rbtree.h> 11 #include <linux/rwsem.h> 12 #include <linux/completion.h> 13 #include <asm/page.h> 14 #include <asm/mmu.h> 15 ··· 253 /* store ref to file /proc/<pid>/exe symlink points to */ 254 struct file *exe_file; 255 unsigned long num_exe_file_vmas; 256 #endif 257 }; 258

··· 10 #include <linux/rbtree.h> 11 #include <linux/rwsem.h> 12 #include <linux/completion.h> 13 + #include <linux/cpumask.h> 14 #include <asm/page.h> 15 #include <asm/mmu.h> 16 ··· 252 /* store ref to file /proc/<pid>/exe symlink points to */ 253 struct file *exe_file; 254 unsigned long num_exe_file_vmas; 255 + #endif 256 + #ifdef CONFIG_MMU_NOTIFIER 257 + struct mmu_notifier_mm *mmu_notifier_mm; 258 #endif 259 }; 260

+279

include/linux/mmu_notifier.h

···

··· 1 + #ifndef _LINUX_MMU_NOTIFIER_H 2 + #define _LINUX_MMU_NOTIFIER_H 3 + 4 + #include <linux/list.h> 5 + #include <linux/spinlock.h> 6 + #include <linux/mm_types.h> 7 + 8 + struct mmu_notifier; 9 + struct mmu_notifier_ops; 10 + 11 + #ifdef CONFIG_MMU_NOTIFIER 12 + 13 + /* 14 + * The mmu notifier_mm structure is allocated and installed in 15 + * mm->mmu_notifier_mm inside the mm_take_all_locks() protected 16 + * critical section and it's released only when mm_count reaches zero 17 + * in mmdrop(). 18 + */ 19 + struct mmu_notifier_mm { 20 + /* all mmu notifiers registerd in this mm are queued in this list */ 21 + struct hlist_head list; 22 + /* to serialize the list modifications and hlist_unhashed */ 23 + spinlock_t lock; 24 + }; 25 + 26 + struct mmu_notifier_ops { 27 + /* 28 + * Called either by mmu_notifier_unregister or when the mm is 29 + * being destroyed by exit_mmap, always before all pages are 30 + * freed. This can run concurrently with other mmu notifier 31 + * methods (the ones invoked outside the mm context) and it 32 + * should tear down all secondary mmu mappings and freeze the 33 + * secondary mmu. If this method isn't implemented you've to 34 + * be sure that nothing could possibly write to the pages 35 + * through the secondary mmu by the time the last thread with 36 + * tsk->mm == mm exits. 37 + * 38 + * As side note: the pages freed after ->release returns could 39 + * be immediately reallocated by the gart at an alias physical 40 + * address with a different cache model, so if ->release isn't 41 + * implemented because all _software_ driven memory accesses 42 + * through the secondary mmu are terminated by the time the 43 + * last thread of this mm quits, you've also to be sure that 44 + * speculative _hardware_ operations can't allocate dirty 45 + * cachelines in the cpu that could not be snooped and made 46 + * coherent with the other read and write operations happening 47 + * through the gart alias address, so leading to memory 48 + * corruption. 49 + */ 50 + void (*release)(struct mmu_notifier *mn, 51 + struct mm_struct *mm); 52 + 53 + /* 54 + * clear_flush_young is called after the VM is 55 + * test-and-clearing the young/accessed bitflag in the 56 + * pte. This way the VM will provide proper aging to the 57 + * accesses to the page through the secondary MMUs and not 58 + * only to the ones through the Linux pte. 59 + */ 60 + int (*clear_flush_young)(struct mmu_notifier *mn, 61 + struct mm_struct *mm, 62 + unsigned long address); 63 + 64 + /* 65 + * Before this is invoked any secondary MMU is still ok to 66 + * read/write to the page previously pointed to by the Linux 67 + * pte because the page hasn't been freed yet and it won't be 68 + * freed until this returns. If required set_page_dirty has to 69 + * be called internally to this method. 70 + */ 71 + void (*invalidate_page)(struct mmu_notifier *mn, 72 + struct mm_struct *mm, 73 + unsigned long address); 74 + 75 + /* 76 + * invalidate_range_start() and invalidate_range_end() must be 77 + * paired and are called only when the mmap_sem and/or the 78 + * locks protecting the reverse maps are held. The subsystem 79 + * must guarantee that no additional references are taken to 80 + * the pages in the range established between the call to 81 + * invalidate_range_start() and the matching call to 82 + * invalidate_range_end(). 83 + * 84 + * Invalidation of multiple concurrent ranges may be 85 + * optionally permitted by the driver. Either way the 86 + * establishment of sptes is forbidden in the range passed to 87 + * invalidate_range_begin/end for the whole duration of the 88 + * invalidate_range_begin/end critical section. 89 + * 90 + * invalidate_range_start() is called when all pages in the 91 + * range are still mapped and have at least a refcount of one. 92 + * 93 + * invalidate_range_end() is called when all pages in the 94 + * range have been unmapped and the pages have been freed by 95 + * the VM. 96 + * 97 + * The VM will remove the page table entries and potentially 98 + * the page between invalidate_range_start() and 99 + * invalidate_range_end(). If the page must not be freed 100 + * because of pending I/O or other circumstances then the 101 + * invalidate_range_start() callback (or the initial mapping 102 + * by the driver) must make sure that the refcount is kept 103 + * elevated. 104 + * 105 + * If the driver increases the refcount when the pages are 106 + * initially mapped into an address space then either 107 + * invalidate_range_start() or invalidate_range_end() may 108 + * decrease the refcount. If the refcount is decreased on 109 + * invalidate_range_start() then the VM can free pages as page 110 + * table entries are removed. If the refcount is only 111 + * droppped on invalidate_range_end() then the driver itself 112 + * will drop the last refcount but it must take care to flush 113 + * any secondary tlb before doing the final free on the 114 + * page. Pages will no longer be referenced by the linux 115 + * address space but may still be referenced by sptes until 116 + * the last refcount is dropped. 117 + */ 118 + void (*invalidate_range_start)(struct mmu_notifier *mn, 119 + struct mm_struct *mm, 120 + unsigned long start, unsigned long end); 121 + void (*invalidate_range_end)(struct mmu_notifier *mn, 122 + struct mm_struct *mm, 123 + unsigned long start, unsigned long end); 124 + }; 125 + 126 + /* 127 + * The notifier chains are protected by mmap_sem and/or the reverse map 128 + * semaphores. Notifier chains are only changed when all reverse maps and 129 + * the mmap_sem locks are taken. 130 + * 131 + * Therefore notifier chains can only be traversed when either 132 + * 133 + * 1. mmap_sem is held. 134 + * 2. One of the reverse map locks is held (i_mmap_lock or anon_vma->lock). 135 + * 3. No other concurrent thread can access the list (release) 136 + */ 137 + struct mmu_notifier { 138 + struct hlist_node hlist; 139 + const struct mmu_notifier_ops *ops; 140 + }; 141 + 142 + static inline int mm_has_notifiers(struct mm_struct *mm) 143 + { 144 + return unlikely(mm->mmu_notifier_mm); 145 + } 146 + 147 + extern int mmu_notifier_register(struct mmu_notifier *mn, 148 + struct mm_struct *mm); 149 + extern int __mmu_notifier_register(struct mmu_notifier *mn, 150 + struct mm_struct *mm); 151 + extern void mmu_notifier_unregister(struct mmu_notifier *mn, 152 + struct mm_struct *mm); 153 + extern void __mmu_notifier_mm_destroy(struct mm_struct *mm); 154 + extern void __mmu_notifier_release(struct mm_struct *mm); 155 + extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, 156 + unsigned long address); 157 + extern void __mmu_notifier_invalidate_page(struct mm_struct *mm, 158 + unsigned long address); 159 + extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, 160 + unsigned long start, unsigned long end); 161 + extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, 162 + unsigned long start, unsigned long end); 163 + 164 + static inline void mmu_notifier_release(struct mm_struct *mm) 165 + { 166 + if (mm_has_notifiers(mm)) 167 + __mmu_notifier_release(mm); 168 + } 169 + 170 + static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, 171 + unsigned long address) 172 + { 173 + if (mm_has_notifiers(mm)) 174 + return __mmu_notifier_clear_flush_young(mm, address); 175 + return 0; 176 + } 177 + 178 + static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, 179 + unsigned long address) 180 + { 181 + if (mm_has_notifiers(mm)) 182 + __mmu_notifier_invalidate_page(mm, address); 183 + } 184 + 185 + static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, 186 + unsigned long start, unsigned long end) 187 + { 188 + if (mm_has_notifiers(mm)) 189 + __mmu_notifier_invalidate_range_start(mm, start, end); 190 + } 191 + 192 + static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, 193 + unsigned long start, unsigned long end) 194 + { 195 + if (mm_has_notifiers(mm)) 196 + __mmu_notifier_invalidate_range_end(mm, start, end); 197 + } 198 + 199 + static inline void mmu_notifier_mm_init(struct mm_struct *mm) 200 + { 201 + mm->mmu_notifier_mm = NULL; 202 + } 203 + 204 + static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) 205 + { 206 + if (mm_has_notifiers(mm)) 207 + __mmu_notifier_mm_destroy(mm); 208 + } 209 + 210 + /* 211 + * These two macros will sometime replace ptep_clear_flush. 212 + * ptep_clear_flush is impleemnted as macro itself, so this also is 213 + * implemented as a macro until ptep_clear_flush will converted to an 214 + * inline function, to diminish the risk of compilation failure. The 215 + * invalidate_page method over time can be moved outside the PT lock 216 + * and these two macros can be later removed. 217 + */ 218 + #define ptep_clear_flush_notify(__vma, __address, __ptep) \ 219 + ({ \ 220 + pte_t __pte; \ 221 + struct vm_area_struct *___vma = __vma; \ 222 + unsigned long ___address = __address; \ 223 + __pte = ptep_clear_flush(___vma, ___address, __ptep); \ 224 + mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \ 225 + __pte; \ 226 + }) 227 + 228 + #define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ 229 + ({ \ 230 + int __young; \ 231 + struct vm_area_struct *___vma = __vma; \ 232 + unsigned long ___address = __address; \ 233 + __young = ptep_clear_flush_young(___vma, ___address, __ptep); \ 234 + __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \ 235 + ___address); \ 236 + __young; \ 237 + }) 238 + 239 + #else /* CONFIG_MMU_NOTIFIER */ 240 + 241 + static inline void mmu_notifier_release(struct mm_struct *mm) 242 + { 243 + } 244 + 245 + static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, 246 + unsigned long address) 247 + { 248 + return 0; 249 + } 250 + 251 + static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, 252 + unsigned long address) 253 + { 254 + } 255 + 256 + static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, 257 + unsigned long start, unsigned long end) 258 + { 259 + } 260 + 261 + static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, 262 + unsigned long start, unsigned long end) 263 + { 264 + } 265 + 266 + static inline void mmu_notifier_mm_init(struct mm_struct *mm) 267 + { 268 + } 269 + 270 + static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) 271 + { 272 + } 273 + 274 + #define ptep_clear_flush_young_notify ptep_clear_flush_young 275 + #define ptep_clear_flush_notify ptep_clear_flush 276 + 277 + #endif /* CONFIG_MMU_NOTIFIER */ 278 + 279 + #endif /* _LINUX_MMU_NOTIFIER_H */

kernel/fork.c

··· 27 #include <linux/key.h> 28 #include <linux/binfmts.h> 29 #include <linux/mman.h> 30 #include <linux/fs.h> 31 #include <linux/nsproxy.h> 32 #include <linux/capability.h> ··· 415 416 if (likely(!mm_alloc_pgd(mm))) { 417 mm->def_flags = 0; 418 return mm; 419 } 420 ··· 448 BUG_ON(mm == &init_mm); 449 mm_free_pgd(mm); 450 destroy_context(mm); 451 free_mm(mm); 452 } 453 EXPORT_SYMBOL_GPL(__mmdrop);

··· 27 #include <linux/key.h> 28 #include <linux/binfmts.h> 29 #include <linux/mman.h> 30 + #include <linux/mmu_notifier.h> 31 #include <linux/fs.h> 32 #include <linux/nsproxy.h> 33 #include <linux/capability.h> ··· 414 415 if (likely(!mm_alloc_pgd(mm))) { 416 mm->def_flags = 0; 417 + mmu_notifier_mm_init(mm); 418 return mm; 419 } 420 ··· 446 BUG_ON(mm == &init_mm); 447 mm_free_pgd(mm); 448 destroy_context(mm); 449 + mmu_notifier_mm_destroy(mm); 450 free_mm(mm); 451 } 452 EXPORT_SYMBOL_GPL(__mmdrop);

mm/Kconfig

··· 208 config VIRT_TO_BUS 209 def_bool y 210 depends on !ARCH_NO_VIRT_TO_BUS

··· 208 config VIRT_TO_BUS 209 def_bool y 210 depends on !ARCH_NO_VIRT_TO_BUS 211 + 212 + config MMU_NOTIFIER 213 + bool

mm/Makefile

··· 25 obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o 26 obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o 27 obj-$(CONFIG_SLOB) += slob.o 28 obj-$(CONFIG_SLAB) += slab.o 29 obj-$(CONFIG_SLUB) += slub.o 30 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o

··· 25 obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o 26 obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o 27 obj-$(CONFIG_SLOB) += slob.o 28 + obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o 29 obj-$(CONFIG_SLAB) += slab.o 30 obj-$(CONFIG_SLUB) += slub.o 31 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o

+2 -1

mm/filemap_xip.c

··· 13 #include <linux/module.h> 14 #include <linux/uio.h> 15 #include <linux/rmap.h> 16 #include <linux/sched.h> 17 #include <asm/tlbflush.h> 18 #include <asm/io.h> ··· 189 if (pte) { 190 /* Nuke the page table entry. */ 191 flush_cache_page(vma, address, pte_pfn(*pte)); 192 - pteval = ptep_clear_flush(vma, address, pte); 193 page_remove_rmap(page, vma); 194 dec_mm_counter(mm, file_rss); 195 BUG_ON(pte_dirty(pteval));

··· 13 #include <linux/module.h> 14 #include <linux/uio.h> 15 #include <linux/rmap.h> 16 + #include <linux/mmu_notifier.h> 17 #include <linux/sched.h> 18 #include <asm/tlbflush.h> 19 #include <asm/io.h> ··· 188 if (pte) { 189 /* Nuke the page table entry. */ 190 flush_cache_page(vma, address, pte_pfn(*pte)); 191 + pteval = ptep_clear_flush_notify(vma, address, pte); 192 page_remove_rmap(page, vma); 193 dec_mm_counter(mm, file_rss); 194 BUG_ON(pte_dirty(pteval));

mm/fremap.c

··· 15 #include <linux/rmap.h> 16 #include <linux/module.h> 17 #include <linux/syscalls.h> 18 19 #include <asm/mmu_context.h> 20 #include <asm/cacheflush.h> ··· 215 spin_unlock(&mapping->i_mmap_lock); 216 } 217 218 err = populate_range(mm, vma, start, size, pgoff); 219 if (!err && !(flags & MAP_NONBLOCK)) { 220 if (unlikely(has_write_lock)) { 221 downgrade_write(&mm->mmap_sem);

··· 15 #include <linux/rmap.h> 16 #include <linux/module.h> 17 #include <linux/syscalls.h> 18 + #include <linux/mmu_notifier.h> 19 20 #include <asm/mmu_context.h> 21 #include <asm/cacheflush.h> ··· 214 spin_unlock(&mapping->i_mmap_lock); 215 } 216 217 + mmu_notifier_invalidate_range_start(mm, start, start + size); 218 err = populate_range(mm, vma, start, size, pgoff); 219 + mmu_notifier_invalidate_range_end(mm, start, start + size); 220 if (!err && !(flags & MAP_NONBLOCK)) { 221 if (unlikely(has_write_lock)) { 222 downgrade_write(&mm->mmap_sem);

mm/hugetlb.c

··· 9 #include <linux/mm.h> 10 #include <linux/sysctl.h> 11 #include <linux/highmem.h> 12 #include <linux/nodemask.h> 13 #include <linux/pagemap.h> 14 #include <linux/mempolicy.h> ··· 1673 BUG_ON(start & ~huge_page_mask(h)); 1674 BUG_ON(end & ~huge_page_mask(h)); 1675 1676 spin_lock(&mm->page_table_lock); 1677 for (address = start; address < end; address += sz) { 1678 ptep = huge_pte_offset(mm, address); ··· 1715 } 1716 spin_unlock(&mm->page_table_lock); 1717 flush_tlb_range(vma, start, end); 1718 list_for_each_entry_safe(page, tmp, &page_list, lru) { 1719 list_del(&page->lru); 1720 put_page(page);

··· 9 #include <linux/mm.h> 10 #include <linux/sysctl.h> 11 #include <linux/highmem.h> 12 + #include <linux/mmu_notifier.h> 13 #include <linux/nodemask.h> 14 #include <linux/pagemap.h> 15 #include <linux/mempolicy.h> ··· 1672 BUG_ON(start & ~huge_page_mask(h)); 1673 BUG_ON(end & ~huge_page_mask(h)); 1674 1675 + mmu_notifier_invalidate_range_start(mm, start, end); 1676 spin_lock(&mm->page_table_lock); 1677 for (address = start; address < end; address += sz) { 1678 ptep = huge_pte_offset(mm, address); ··· 1713 } 1714 spin_unlock(&mm->page_table_lock); 1715 flush_tlb_range(vma, start, end); 1716 + mmu_notifier_invalidate_range_end(mm, start, end); 1717 list_for_each_entry_safe(page, tmp, &page_list, lru) { 1718 list_del(&page->lru); 1719 put_page(page);

+29 -6

mm/memory.c

··· 51 #include <linux/init.h> 52 #include <linux/writeback.h> 53 #include <linux/memcontrol.h> 54 55 #include <asm/pgalloc.h> 56 #include <asm/uaccess.h> ··· 653 unsigned long next; 654 unsigned long addr = vma->vm_start; 655 unsigned long end = vma->vm_end; 656 657 /* 658 * Don't copy ptes where a page fault will fill them correctly. ··· 669 if (is_vm_hugetlb_page(vma)) 670 return copy_hugetlb_page_range(dst_mm, src_mm, vma); 671 672 dst_pgd = pgd_offset(dst_mm, addr); 673 src_pgd = pgd_offset(src_mm, addr); 674 do { 675 next = pgd_addr_end(addr, end); 676 if (pgd_none_or_clear_bad(src_pgd)) 677 continue; 678 - if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, 679 - vma, addr, next)) 680 - return -ENOMEM; 681 } while (dst_pgd++, src_pgd++, addr = next, addr != end); 682 - return 0; 683 } 684 685 static unsigned long zap_pte_range(struct mmu_gather *tlb, ··· 899 unsigned long start = start_addr; 900 spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; 901 int fullmm = (*tlbp)->fullmm; 902 903 for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) { 904 unsigned long end; 905 ··· 966 } 967 } 968 out: 969 return start; /* which is now the end (or restart) address */ 970 } 971 ··· 1637 { 1638 pgd_t *pgd; 1639 unsigned long next; 1640 - unsigned long end = addr + size; 1641 int err; 1642 1643 BUG_ON(addr >= end); 1644 pgd = pgd_offset(mm, addr); 1645 do { 1646 next = pgd_addr_end(addr, end); ··· 1649 if (err) 1650 break; 1651 } while (pgd++, addr = next, addr != end); 1652 return err; 1653 } 1654 EXPORT_SYMBOL_GPL(apply_to_page_range); ··· 1862 * seen in the presence of one thread doing SMC and another 1863 * thread doing COW. 1864 */ 1865 - ptep_clear_flush(vma, address, page_table); 1866 set_pte_at(mm, address, page_table, entry); 1867 update_mmu_cache(vma, address, entry); 1868 lru_cache_add_active(new_page);

··· 51 #include <linux/init.h> 52 #include <linux/writeback.h> 53 #include <linux/memcontrol.h> 54 + #include <linux/mmu_notifier.h> 55 56 #include <asm/pgalloc.h> 57 #include <asm/uaccess.h> ··· 652 unsigned long next; 653 unsigned long addr = vma->vm_start; 654 unsigned long end = vma->vm_end; 655 + int ret; 656 657 /* 658 * Don't copy ptes where a page fault will fill them correctly. ··· 667 if (is_vm_hugetlb_page(vma)) 668 return copy_hugetlb_page_range(dst_mm, src_mm, vma); 669 670 + /* 671 + * We need to invalidate the secondary MMU mappings only when 672 + * there could be a permission downgrade on the ptes of the 673 + * parent mm. And a permission downgrade will only happen if 674 + * is_cow_mapping() returns true. 675 + */ 676 + if (is_cow_mapping(vma->vm_flags)) 677 + mmu_notifier_invalidate_range_start(src_mm, addr, end); 678 + 679 + ret = 0; 680 dst_pgd = pgd_offset(dst_mm, addr); 681 src_pgd = pgd_offset(src_mm, addr); 682 do { 683 next = pgd_addr_end(addr, end); 684 if (pgd_none_or_clear_bad(src_pgd)) 685 continue; 686 + if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, 687 + vma, addr, next))) { 688 + ret = -ENOMEM; 689 + break; 690 + } 691 } while (dst_pgd++, src_pgd++, addr = next, addr != end); 692 + 693 + if (is_cow_mapping(vma->vm_flags)) 694 + mmu_notifier_invalidate_range_end(src_mm, 695 + vma->vm_start, end); 696 + return ret; 697 } 698 699 static unsigned long zap_pte_range(struct mmu_gather *tlb, ··· 881 unsigned long start = start_addr; 882 spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; 883 int fullmm = (*tlbp)->fullmm; 884 + struct mm_struct *mm = vma->vm_mm; 885 886 + mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); 887 for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) { 888 unsigned long end; 889 ··· 946 } 947 } 948 out: 949 + mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); 950 return start; /* which is now the end (or restart) address */ 951 } 952 ··· 1616 { 1617 pgd_t *pgd; 1618 unsigned long next; 1619 + unsigned long start = addr, end = addr + size; 1620 int err; 1621 1622 BUG_ON(addr >= end); 1623 + mmu_notifier_invalidate_range_start(mm, start, end); 1624 pgd = pgd_offset(mm, addr); 1625 do { 1626 next = pgd_addr_end(addr, end); ··· 1627 if (err) 1628 break; 1629 } while (pgd++, addr = next, addr != end); 1630 + mmu_notifier_invalidate_range_end(mm, start, end); 1631 return err; 1632 } 1633 EXPORT_SYMBOL_GPL(apply_to_page_range); ··· 1839 * seen in the presence of one thread doing SMC and another 1840 * thread doing COW. 1841 */ 1842 + ptep_clear_flush_notify(vma, address, page_table); 1843 set_pte_at(mm, address, page_table, entry); 1844 update_mmu_cache(vma, address, entry); 1845 lru_cache_add_active(new_page);

mm/mmap.c

··· 26 #include <linux/mount.h> 27 #include <linux/mempolicy.h> 28 #include <linux/rmap.h> 29 30 #include <asm/uaccess.h> 31 #include <asm/cacheflush.h> ··· 2062 2063 /* mm's last user has gone, and its about to be pulled down */ 2064 arch_exit_mmap(mm); 2065 2066 lru_add_drain(); 2067 flush_cache_mm(mm);

··· 26 #include <linux/mount.h> 27 #include <linux/mempolicy.h> 28 #include <linux/rmap.h> 29 + #include <linux/mmu_notifier.h> 30 31 #include <asm/uaccess.h> 32 #include <asm/cacheflush.h> ··· 2061 2062 /* mm's last user has gone, and its about to be pulled down */ 2063 arch_exit_mmap(mm); 2064 + mmu_notifier_release(mm); 2065 2066 lru_add_drain(); 2067 flush_cache_mm(mm);

+277

mm/mmu_notifier.c

···

··· 1 + /* 2 + * linux/mm/mmu_notifier.c 3 + * 4 + * Copyright (C) 2008 Qumranet, Inc. 5 + * Copyright (C) 2008 SGI 6 + * Christoph Lameter <clameter@sgi.com> 7 + * 8 + * This work is licensed under the terms of the GNU GPL, version 2. See 9 + * the COPYING file in the top-level directory. 10 + */ 11 + 12 + #include <linux/rculist.h> 13 + #include <linux/mmu_notifier.h> 14 + #include <linux/module.h> 15 + #include <linux/mm.h> 16 + #include <linux/err.h> 17 + #include <linux/rcupdate.h> 18 + #include <linux/sched.h> 19 + 20 + /* 21 + * This function can't run concurrently against mmu_notifier_register 22 + * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap 23 + * runs with mm_users == 0. Other tasks may still invoke mmu notifiers 24 + * in parallel despite there being no task using this mm any more, 25 + * through the vmas outside of the exit_mmap context, such as with 26 + * vmtruncate. This serializes against mmu_notifier_unregister with 27 + * the mmu_notifier_mm->lock in addition to RCU and it serializes 28 + * against the other mmu notifiers with RCU. struct mmu_notifier_mm 29 + * can't go away from under us as exit_mmap holds an mm_count pin 30 + * itself. 31 + */ 32 + void __mmu_notifier_release(struct mm_struct *mm) 33 + { 34 + struct mmu_notifier *mn; 35 + 36 + spin_lock(&mm->mmu_notifier_mm->lock); 37 + while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) { 38 + mn = hlist_entry(mm->mmu_notifier_mm->list.first, 39 + struct mmu_notifier, 40 + hlist); 41 + /* 42 + * We arrived before mmu_notifier_unregister so 43 + * mmu_notifier_unregister will do nothing other than 44 + * to wait ->release to finish and 45 + * mmu_notifier_unregister to return. 46 + */ 47 + hlist_del_init_rcu(&mn->hlist); 48 + /* 49 + * RCU here will block mmu_notifier_unregister until 50 + * ->release returns. 51 + */ 52 + rcu_read_lock(); 53 + spin_unlock(&mm->mmu_notifier_mm->lock); 54 + /* 55 + * if ->release runs before mmu_notifier_unregister it 56 + * must be handled as it's the only way for the driver 57 + * to flush all existing sptes and stop the driver 58 + * from establishing any more sptes before all the 59 + * pages in the mm are freed. 60 + */ 61 + if (mn->ops->release) 62 + mn->ops->release(mn, mm); 63 + rcu_read_unlock(); 64 + spin_lock(&mm->mmu_notifier_mm->lock); 65 + } 66 + spin_unlock(&mm->mmu_notifier_mm->lock); 67 + 68 + /* 69 + * synchronize_rcu here prevents mmu_notifier_release to 70 + * return to exit_mmap (which would proceed freeing all pages 71 + * in the mm) until the ->release method returns, if it was 72 + * invoked by mmu_notifier_unregister. 73 + * 74 + * The mmu_notifier_mm can't go away from under us because one 75 + * mm_count is hold by exit_mmap. 76 + */ 77 + synchronize_rcu(); 78 + } 79 + 80 + /* 81 + * If no young bitflag is supported by the hardware, ->clear_flush_young can 82 + * unmap the address and return 1 or 0 depending if the mapping previously 83 + * existed or not. 84 + */ 85 + int __mmu_notifier_clear_flush_young(struct mm_struct *mm, 86 + unsigned long address) 87 + { 88 + struct mmu_notifier *mn; 89 + struct hlist_node *n; 90 + int young = 0; 91 + 92 + rcu_read_lock(); 93 + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { 94 + if (mn->ops->clear_flush_young) 95 + young |= mn->ops->clear_flush_young(mn, mm, address); 96 + } 97 + rcu_read_unlock(); 98 + 99 + return young; 100 + } 101 + 102 + void __mmu_notifier_invalidate_page(struct mm_struct *mm, 103 + unsigned long address) 104 + { 105 + struct mmu_notifier *mn; 106 + struct hlist_node *n; 107 + 108 + rcu_read_lock(); 109 + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { 110 + if (mn->ops->invalidate_page) 111 + mn->ops->invalidate_page(mn, mm, address); 112 + } 113 + rcu_read_unlock(); 114 + } 115 + 116 + void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, 117 + unsigned long start, unsigned long end) 118 + { 119 + struct mmu_notifier *mn; 120 + struct hlist_node *n; 121 + 122 + rcu_read_lock(); 123 + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { 124 + if (mn->ops->invalidate_range_start) 125 + mn->ops->invalidate_range_start(mn, mm, start, end); 126 + } 127 + rcu_read_unlock(); 128 + } 129 + 130 + void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, 131 + unsigned long start, unsigned long end) 132 + { 133 + struct mmu_notifier *mn; 134 + struct hlist_node *n; 135 + 136 + rcu_read_lock(); 137 + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { 138 + if (mn->ops->invalidate_range_end) 139 + mn->ops->invalidate_range_end(mn, mm, start, end); 140 + } 141 + rcu_read_unlock(); 142 + } 143 + 144 + static int do_mmu_notifier_register(struct mmu_notifier *mn, 145 + struct mm_struct *mm, 146 + int take_mmap_sem) 147 + { 148 + struct mmu_notifier_mm *mmu_notifier_mm; 149 + int ret; 150 + 151 + BUG_ON(atomic_read(&mm->mm_users) <= 0); 152 + 153 + ret = -ENOMEM; 154 + mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL); 155 + if (unlikely(!mmu_notifier_mm)) 156 + goto out; 157 + 158 + if (take_mmap_sem) 159 + down_write(&mm->mmap_sem); 160 + ret = mm_take_all_locks(mm); 161 + if (unlikely(ret)) 162 + goto out_cleanup; 163 + 164 + if (!mm_has_notifiers(mm)) { 165 + INIT_HLIST_HEAD(&mmu_notifier_mm->list); 166 + spin_lock_init(&mmu_notifier_mm->lock); 167 + mm->mmu_notifier_mm = mmu_notifier_mm; 168 + mmu_notifier_mm = NULL; 169 + } 170 + atomic_inc(&mm->mm_count); 171 + 172 + /* 173 + * Serialize the update against mmu_notifier_unregister. A 174 + * side note: mmu_notifier_release can't run concurrently with 175 + * us because we hold the mm_users pin (either implicitly as 176 + * current->mm or explicitly with get_task_mm() or similar). 177 + * We can't race against any other mmu notifier method either 178 + * thanks to mm_take_all_locks(). 179 + */ 180 + spin_lock(&mm->mmu_notifier_mm->lock); 181 + hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list); 182 + spin_unlock(&mm->mmu_notifier_mm->lock); 183 + 184 + mm_drop_all_locks(mm); 185 + out_cleanup: 186 + if (take_mmap_sem) 187 + up_write(&mm->mmap_sem); 188 + /* kfree() does nothing if mmu_notifier_mm is NULL */ 189 + kfree(mmu_notifier_mm); 190 + out: 191 + BUG_ON(atomic_read(&mm->mm_users) <= 0); 192 + return ret; 193 + } 194 + 195 + /* 196 + * Must not hold mmap_sem nor any other VM related lock when calling 197 + * this registration function. Must also ensure mm_users can't go down 198 + * to zero while this runs to avoid races with mmu_notifier_release, 199 + * so mm has to be current->mm or the mm should be pinned safely such 200 + * as with get_task_mm(). If the mm is not current->mm, the mm_users 201 + * pin should be released by calling mmput after mmu_notifier_register 202 + * returns. mmu_notifier_unregister must be always called to 203 + * unregister the notifier. mm_count is automatically pinned to allow 204 + * mmu_notifier_unregister to safely run at any time later, before or 205 + * after exit_mmap. ->release will always be called before exit_mmap 206 + * frees the pages. 207 + */ 208 + int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) 209 + { 210 + return do_mmu_notifier_register(mn, mm, 1); 211 + } 212 + EXPORT_SYMBOL_GPL(mmu_notifier_register); 213 + 214 + /* 215 + * Same as mmu_notifier_register but here the caller must hold the 216 + * mmap_sem in write mode. 217 + */ 218 + int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) 219 + { 220 + return do_mmu_notifier_register(mn, mm, 0); 221 + } 222 + EXPORT_SYMBOL_GPL(__mmu_notifier_register); 223 + 224 + /* this is called after the last mmu_notifier_unregister() returned */ 225 + void __mmu_notifier_mm_destroy(struct mm_struct *mm) 226 + { 227 + BUG_ON(!hlist_empty(&mm->mmu_notifier_mm->list)); 228 + kfree(mm->mmu_notifier_mm); 229 + mm->mmu_notifier_mm = LIST_POISON1; /* debug */ 230 + } 231 + 232 + /* 233 + * This releases the mm_count pin automatically and frees the mm 234 + * structure if it was the last user of it. It serializes against 235 + * running mmu notifiers with RCU and against mmu_notifier_unregister 236 + * with the unregister lock + RCU. All sptes must be dropped before 237 + * calling mmu_notifier_unregister. ->release or any other notifier 238 + * method may be invoked concurrently with mmu_notifier_unregister, 239 + * and only after mmu_notifier_unregister returned we're guaranteed 240 + * that ->release or any other method can't run anymore. 241 + */ 242 + void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) 243 + { 244 + BUG_ON(atomic_read(&mm->mm_count) <= 0); 245 + 246 + spin_lock(&mm->mmu_notifier_mm->lock); 247 + if (!hlist_unhashed(&mn->hlist)) { 248 + hlist_del_rcu(&mn->hlist); 249 + 250 + /* 251 + * RCU here will force exit_mmap to wait ->release to finish 252 + * before freeing the pages. 253 + */ 254 + rcu_read_lock(); 255 + spin_unlock(&mm->mmu_notifier_mm->lock); 256 + /* 257 + * exit_mmap will block in mmu_notifier_release to 258 + * guarantee ->release is called before freeing the 259 + * pages. 260 + */ 261 + if (mn->ops->release) 262 + mn->ops->release(mn, mm); 263 + rcu_read_unlock(); 264 + } else 265 + spin_unlock(&mm->mmu_notifier_mm->lock); 266 + 267 + /* 268 + * Wait any running method to finish, of course including 269 + * ->release if it was run by mmu_notifier_relase instead of us. 270 + */ 271 + synchronize_rcu(); 272 + 273 + BUG_ON(atomic_read(&mm->mm_count) <= 0); 274 + 275 + mmdrop(mm); 276 + } 277 + EXPORT_SYMBOL_GPL(mmu_notifier_unregister);

mm/mprotect.c

··· 21 #include <linux/syscalls.h> 22 #include <linux/swap.h> 23 #include <linux/swapops.h> 24 #include <asm/uaccess.h> 25 #include <asm/pgtable.h> 26 #include <asm/cacheflush.h> ··· 204 dirty_accountable = 1; 205 } 206 207 if (is_vm_hugetlb_page(vma)) 208 hugetlb_change_protection(vma, start, end, vma->vm_page_prot); 209 else 210 change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); 211 vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); 212 vm_stat_account(mm, newflags, vma->vm_file, nrpages); 213 return 0;

··· 21 #include <linux/syscalls.h> 22 #include <linux/swap.h> 23 #include <linux/swapops.h> 24 + #include <linux/mmu_notifier.h> 25 #include <asm/uaccess.h> 26 #include <asm/pgtable.h> 27 #include <asm/cacheflush.h> ··· 203 dirty_accountable = 1; 204 } 205 206 + mmu_notifier_invalidate_range_start(mm, start, end); 207 if (is_vm_hugetlb_page(vma)) 208 hugetlb_change_protection(vma, start, end, vma->vm_page_prot); 209 else 210 change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); 211 + mmu_notifier_invalidate_range_end(mm, start, end); 212 vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); 213 vm_stat_account(mm, newflags, vma->vm_file, nrpages); 214 return 0;

mm/mremap.c

··· 18 #include <linux/highmem.h> 19 #include <linux/security.h> 20 #include <linux/syscalls.h> 21 22 #include <asm/uaccess.h> 23 #include <asm/cacheflush.h> ··· 75 struct mm_struct *mm = vma->vm_mm; 76 pte_t *old_pte, *new_pte, pte; 77 spinlock_t *old_ptl, *new_ptl; 78 79 if (vma->vm_file) { 80 /* 81 * Subtle point from Rajesh Venkatasubramanian: before ··· 121 pte_unmap_unlock(old_pte - 1, old_ptl); 122 if (mapping) 123 spin_unlock(&mapping->i_mmap_lock); 124 } 125 126 #define LATENCY_LIMIT (64 * PAGE_SIZE)

··· 18 #include <linux/highmem.h> 19 #include <linux/security.h> 20 #include <linux/syscalls.h> 21 + #include <linux/mmu_notifier.h> 22 23 #include <asm/uaccess.h> 24 #include <asm/cacheflush.h> ··· 74 struct mm_struct *mm = vma->vm_mm; 75 pte_t *old_pte, *new_pte, pte; 76 spinlock_t *old_ptl, *new_ptl; 77 + unsigned long old_start; 78 79 + old_start = old_addr; 80 + mmu_notifier_invalidate_range_start(vma->vm_mm, 81 + old_start, old_end); 82 if (vma->vm_file) { 83 /* 84 * Subtle point from Rajesh Venkatasubramanian: before ··· 116 pte_unmap_unlock(old_pte - 1, old_ptl); 117 if (mapping) 118 spin_unlock(&mapping->i_mmap_lock); 119 + mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end); 120 } 121 122 #define LATENCY_LIMIT (64 * PAGE_SIZE)

+7 -6

mm/rmap.c

··· 49 #include <linux/module.h> 50 #include <linux/kallsyms.h> 51 #include <linux/memcontrol.h> 52 53 #include <asm/tlbflush.h> 54 ··· 288 if (vma->vm_flags & VM_LOCKED) { 289 referenced++; 290 *mapcount = 1; /* break early from loop */ 291 - } else if (ptep_clear_flush_young(vma, address, pte)) 292 referenced++; 293 294 /* Pretend the page is referenced if the task has the ··· 458 pte_t entry; 459 460 flush_cache_page(vma, address, pte_pfn(*pte)); 461 - entry = ptep_clear_flush(vma, address, pte); 462 entry = pte_wrprotect(entry); 463 entry = pte_mkclean(entry); 464 set_pte_at(mm, address, pte, entry); ··· 706 * skipped over this mm) then we should reactivate it. 707 */ 708 if (!migration && ((vma->vm_flags & VM_LOCKED) || 709 - (ptep_clear_flush_young(vma, address, pte)))) { 710 ret = SWAP_FAIL; 711 goto out_unmap; 712 } 713 714 /* Nuke the page table entry. */ 715 flush_cache_page(vma, address, page_to_pfn(page)); 716 - pteval = ptep_clear_flush(vma, address, pte); 717 718 /* Move the dirty bit to the physical page now the pte is gone. */ 719 if (pte_dirty(pteval)) ··· 838 page = vm_normal_page(vma, address, *pte); 839 BUG_ON(!page || PageAnon(page)); 840 841 - if (ptep_clear_flush_young(vma, address, pte)) 842 continue; 843 844 /* Nuke the page table entry. */ 845 flush_cache_page(vma, address, pte_pfn(*pte)); 846 - pteval = ptep_clear_flush(vma, address, pte); 847 848 /* If nonlinear, store the file page offset in the pte. */ 849 if (page->index != linear_page_index(vma, address))

··· 49 #include <linux/module.h> 50 #include <linux/kallsyms.h> 51 #include <linux/memcontrol.h> 52 + #include <linux/mmu_notifier.h> 53 54 #include <asm/tlbflush.h> 55 ··· 287 if (vma->vm_flags & VM_LOCKED) { 288 referenced++; 289 *mapcount = 1; /* break early from loop */ 290 + } else if (ptep_clear_flush_young_notify(vma, address, pte)) 291 referenced++; 292 293 /* Pretend the page is referenced if the task has the ··· 457 pte_t entry; 458 459 flush_cache_page(vma, address, pte_pfn(*pte)); 460 + entry = ptep_clear_flush_notify(vma, address, pte); 461 entry = pte_wrprotect(entry); 462 entry = pte_mkclean(entry); 463 set_pte_at(mm, address, pte, entry); ··· 705 * skipped over this mm) then we should reactivate it. 706 */ 707 if (!migration && ((vma->vm_flags & VM_LOCKED) || 708 + (ptep_clear_flush_young_notify(vma, address, pte)))) { 709 ret = SWAP_FAIL; 710 goto out_unmap; 711 } 712 713 /* Nuke the page table entry. */ 714 flush_cache_page(vma, address, page_to_pfn(page)); 715 + pteval = ptep_clear_flush_notify(vma, address, pte); 716 717 /* Move the dirty bit to the physical page now the pte is gone. */ 718 if (pte_dirty(pteval)) ··· 837 page = vm_normal_page(vma, address, *pte); 838 BUG_ON(!page || PageAnon(page)); 839 840 + if (ptep_clear_flush_young_notify(vma, address, pte)) 841 continue; 842 843 /* Nuke the page table entry. */ 844 flush_cache_page(vma, address, pte_pfn(*pte)); 845 + pteval = ptep_clear_flush_notify(vma, address, pte); 846 847 /* If nonlinear, store the file page offset in the pte. */ 848 if (page->index != linear_page_index(vma, address))