Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mremap: avoid sending one IPI per page

This replaces ptep_clear_flush() with ptep_get_and_clear() and a single
flush_tlb_range() at the end of the loop, to avoid sending one IPI for
each page.

The mmu_notifier_invalidate_range_start/end section is enlarged
accordingly but this is not going to fundamentally change things. It was
more by accident that the region under mremap was for the most part still
available for secondary MMUs: the primary MMU was never allowed to
reliably access that region for the duration of the mremap (modulo
trapping SIGSEGV on the old address range which sounds unpractical and
flakey). If users wants secondary MMUs not to lose access to a large
region under mremap they should reduce the mremap size accordingly in
userland and run multiple calls. Overall this will run faster so it's
actually going to reduce the time the region is under mremap for the
primary MMU which should provide a net benefit to apps.

For KVM this is a noop because the guest physical memory is never
mremapped, there's just no point it ever moving it while guest runs. One
target of this optimization is JVM GC (so unrelated to the mmu notifier
logic).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Andrea Arcangeli and committed by
Linus Torvalds
7b6efc2b ebed4846

+9 -6
+9 -6
mm/mremap.c
··· 80 80 struct mm_struct *mm = vma->vm_mm; 81 81 pte_t *old_pte, *new_pte, pte; 82 82 spinlock_t *old_ptl, *new_ptl; 83 - unsigned long old_start; 84 83 85 - old_start = old_addr; 86 - mmu_notifier_invalidate_range_start(vma->vm_mm, 87 - old_start, old_end); 88 84 if (vma->vm_file) { 89 85 /* 90 86 * Subtle point from Rajesh Venkatasubramanian: before ··· 107 111 new_pte++, new_addr += PAGE_SIZE) { 108 112 if (pte_none(*old_pte)) 109 113 continue; 110 - pte = ptep_clear_flush(vma, old_addr, old_pte); 114 + pte = ptep_get_and_clear(mm, old_addr, old_pte); 111 115 pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr); 112 116 set_pte_at(mm, new_addr, new_pte, pte); 113 117 } ··· 119 123 pte_unmap_unlock(old_pte - 1, old_ptl); 120 124 if (mapping) 121 125 mutex_unlock(&mapping->i_mmap_mutex); 122 - mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end); 123 126 } 124 127 125 128 #define LATENCY_LIMIT (64 * PAGE_SIZE) ··· 129 134 { 130 135 unsigned long extent, next, old_end; 131 136 pmd_t *old_pmd, *new_pmd; 137 + bool need_flush = false; 132 138 133 139 old_end = old_addr + len; 134 140 flush_cache_range(vma, old_addr, old_end); 141 + 142 + mmu_notifier_invalidate_range_start(vma->vm_mm, old_addr, old_end); 135 143 136 144 for (; old_addr < old_end; old_addr += extent, new_addr += extent) { 137 145 cond_resched(); ··· 156 158 extent = LATENCY_LIMIT; 157 159 move_ptes(vma, old_pmd, old_addr, old_addr + extent, 158 160 new_vma, new_pmd, new_addr); 161 + need_flush = true; 159 162 } 163 + if (likely(need_flush)) 164 + flush_tlb_range(vma, old_end-len, old_addr); 165 + 166 + mmu_notifier_invalidate_range_end(vma->vm_mm, old_end-len, old_end); 160 167 161 168 return len + old_addr - old_end; /* how much done */ 162 169 }