Merge branch 'expand-stack' · tjh.dev/kernel@9471f1f

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Merge branch 'expand-stack'

This modifies our user mode stack expansion code to always take the
mmap_lock for writing before modifying the VM layout.

It's actually something we always technically should have done, but
because we didn't strictly need it, we were being lazy ("opportunistic"
sounds so much better, doesn't it?) about things, and had this hack in
place where we would extend the stack vma in-place without doing the
proper locking.

And it worked fine. We just needed to change vm_start (or, in the case
of grow-up stacks, vm_end) and together with some special ad-hoc locking
using the anon_vma lock and the mm->page_table_lock, it all was fairly
straightforward.

That is, it was all fine until Ruihan Li pointed out that now that the
vma layout uses the maple tree code, we *really* don't just change
vm_start and vm_end any more, and the locking really is broken. Oops.

It's not actually all _that_ horrible to fix this once and for all, and
do proper locking, but it's a bit painful. We have basically three
different cases of stack expansion, and they all work just a bit
differently:

- the common and obvious case is the page fault handling. It's actually
fairly simple and straightforward, except for the fact that we have
something like 24 different versions of it, and you end up in a maze
of twisty little passages, all alike.

- the simplest case is the execve() code that creates a new stack.
There are no real locking concerns because it's all in a private new
VM that hasn't been exposed to anybody, but lockdep still can end up
unhappy if you get it wrong.

- and finally, we have GUP and page pinning, which shouldn't really be
expanding the stack in the first place, but in addition to execve()
we also use it for ptrace(). And debuggers do want to possibly access
memory under the stack pointer and thus need to be able to expand the
stack as a special case.

None of these cases are exactly complicated, but the page fault case in
particular is just repeated slightly differently many many times. And
ia64 in particular has a fairly complicated situation where you can have
both a regular grow-down stack _and_ a special grow-up stack for the
register backing store.

So to make this slightly more manageable, the bulk of this series is to
first create a helper function for the most common page fault case, and
convert all the straightforward architectures to it.

Thus the new 'lock_mm_and_find_vma()' helper function, which ends up
being used by x86, arm, powerpc, mips, riscv, alpha, arc, csky, hexagon,
loongarch, nios2, sh, sparc32, and xtensa. So we not only convert more
than half the architectures, we now have more shared code and avoid some
of those twisty little passages.

And largely due to this common helper function, the full diffstat of
this series ends up deleting more lines than it adds.

That still leaves eight architectures (ia64, m68k, microblaze, openrisc,
parisc, s390, sparc64 and um) that end up doing 'expand_stack()'
manually because they are doing something slightly different from the
normal pattern. Along with the couple of special cases in execve() and
GUP.

So there's a couple of patches that first create 'locked' helper
versions of the stack expansion functions, so that there's a obvious
path forward in the conversion. The execve() case is then actually
pretty simple, and is a nice cleanup from our old "grow-up stackls are
special, because at execve time even they grow down".

The #ifdef CONFIG_STACK_GROWSUP in that code just goes away, because
it's just more straightforward to write out the stack expansion there
manually, instead od having get_user_pages_remote() do it for us in some
situations but not others and have to worry about locking rules for GUP.

And the final step is then to just convert the remaining odd cases to a
new world order where 'expand_stack()' is called with the mmap_lock held
for reading, but where it might drop it and upgrade it to a write, only
to return with it held for reading (in the success case) or with it
completely dropped (in the failure case).

In the process, we remove all the stack expansion from GUP (where
dropping the lock wouldn't be ok without special rules anyway), and add
it in manually to __access_remote_vm() for ptrace().

Thanks to Adrian Glaubitz and Frank Scheiner who tested the ia64 cases.
Everything else here felt pretty straightforward, but the ia64 rules for
stack expansion are really quite odd and very different from everything
else. Also thanks to Vegard Nossum who caught me getting one of those
odd conditions entirely the wrong way around.

Anyway, I think I want to actually move all the stack expansion code to
a whole new file of its own, rather than have it split up between
mm/mmap.c and mm/memory.c, but since this will have to be backported to
the initial maple tree vma introduction anyway, I tried to keep the
patches _fairly_ minimal.

Also, while I don't think it's valid to expand the stack from GUP, the
final patch in here is a "warn if some crazy GUP user wants to try to
expand the stack" patch. That one will be reverted before the final
release, but it's left to catch any odd cases during the merge window
and release candidates.

Reported-by: Ruihan Li <lrh2000@pku.edu.cn>

* branch 'expand-stack':
gup: add warning if some caller would seem to want stack expansion
mm: always expand the stack with the mmap write lock held
execve: expand new process stack manually ahead of time
mm: make find_extend_vma() fail if write lock not held
powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma()
mm/fault: convert remaining simple cases to lock_mm_and_find_vma()
arm/mm: Convert to using lock_mm_and_find_vma()
riscv/mm: Convert to using lock_mm_and_find_vma()
mips/mm: Convert to using lock_mm_and_find_vma()
powerpc/mm: Convert to using lock_mm_and_find_vma()
arm64/mm: Convert to using lock_mm_and_find_vma()
mm: make the page fault mmap locking killable
mm: introduce new 'lock_mm_and_find_vma()' page fault helper

Linus Torvalds 2 years ago 9471f1f2 3a8a670e

+439 -468

49 changed files

expand all

arch

alpha

Kconfig

fault.c

arc

Kconfig

fault.c

arm

Kconfig

fault.c

arm64

Kconfig

fault.c

csky

Kconfig

fault.c

hexagon

Kconfig

vm_fault.c

ia64

fault.c

loongarch

Kconfig

fault.c

m68k

fault.c

microblaze

fault.c

mips

Kconfig

fault.c

nios2

Kconfig

fault.c

openrisc

fault.c

parisc

fault.c

powerpc

Kconfig

copro_fault.c

fault.c

riscv

Kconfig

fault.c

s390

fault.c

Kconfig

fault.c

sparc

Kconfig

fault_32.c

fault_64.c

kernel

trap.c

x86

Kconfig

fault.c

xtensa

Kconfig

fault.c

drivers

iommu

amd

iommu_v2.c

iommu-sva.c

binfmt_elf.c

exec.c

include

linux

mm.h

Kconfig

gup.c

memory.c

mmap.c

nommu.c

arch/alpha/Kconfig

··· 30 30 select HAS_IOPORT 31 31 select HAVE_ARCH_AUDITSYSCALL 32 32 select HAVE_MOD_ARCH_SPECIFIC 33 + select LOCK_MM_AND_FIND_VMA 33 34 select MODULES_USE_ELF_RELA 34 35 select ODD_RT_SIGACTION 35 36 select OLD_SIGSUSPEND

+3 -10

arch/alpha/mm/fault.c

··· 119 119 flags |= FAULT_FLAG_USER; 120 120 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); 121 121 retry: 122 - mmap_read_lock(mm); 123 - vma = find_vma(mm, address); 122 + vma = lock_mm_and_find_vma(mm, address, regs); 124 123 if (!vma) 125 - goto bad_area; 126 - if (vma->vm_start <= address) 127 - goto good_area; 128 - if (!(vma->vm_flags & VM_GROWSDOWN)) 129 - goto bad_area; 130 - if (expand_stack(vma, address)) 131 - goto bad_area; 124 + goto bad_area_nosemaphore; 132 125 133 126 /* Ok, we have a good vm_area for this memory access, so 134 127 we can handle it. */ 135 - good_area: 136 128 si_code = SEGV_ACCERR; 137 129 if (cause < 0) { 138 130 if (!(vma->vm_flags & VM_EXEC)) ··· 184 192 bad_area: 185 193 mmap_read_unlock(mm); 186 194 195 + bad_area_nosemaphore: 187 196 if (user_mode(regs)) 188 197 goto do_sigsegv; 189 198

arch/arc/Kconfig

··· 41 41 select HAVE_PERF_EVENTS 42 42 select HAVE_SYSCALL_TRACEPOINTS 43 43 select IRQ_DOMAIN 44 + select LOCK_MM_AND_FIND_VMA 44 45 select MODULES_USE_ELF_RELA 45 46 select OF 46 47 select OF_EARLY_FLATTREE

+3 -8

arch/arc/mm/fault.c

··· 113 113 114 114 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); 115 115 retry: 116 - mmap_read_lock(mm); 117 - 118 - vma = find_vma(mm, address); 116 + vma = lock_mm_and_find_vma(mm, address, regs); 119 117 if (!vma) 120 - goto bad_area; 121 - if (unlikely(address < vma->vm_start)) { 122 - if (!(vma->vm_flags & VM_GROWSDOWN) || expand_stack(vma, address)) 123 - goto bad_area; 124 - } 118 + goto bad_area_nosemaphore; 125 119 126 120 /* 127 121 * vm_area is good, now check permissions for this memory access ··· 155 161 bad_area: 156 162 mmap_read_unlock(mm); 157 163 164 + bad_area_nosemaphore: 158 165 /* 159 166 * Major/minor page fault accounting 160 167 * (in case of retry we only land here once)

arch/arm/Kconfig

··· 127 127 select HAVE_VIRT_CPU_ACCOUNTING_GEN 128 128 select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU 129 129 select IRQ_FORCED_THREADING 130 + select LOCK_MM_AND_FIND_VMA 130 131 select MODULES_USE_ELF_REL 131 132 select NEED_DMA_MAP_STATE 132 133 select OF_EARLY_FLATTREE if OF

+14 -49

arch/arm/mm/fault.c

··· 235 235 return false; 236 236 } 237 237 238 - static vm_fault_t __kprobes 239 - __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int flags, 240 - unsigned long vma_flags, struct pt_regs *regs) 241 - { 242 - struct vm_area_struct *vma = find_vma(mm, addr); 243 - if (unlikely(!vma)) 244 - return VM_FAULT_BADMAP; 245 - 246 - if (unlikely(vma->vm_start > addr)) { 247 - if (!(vma->vm_flags & VM_GROWSDOWN)) 248 - return VM_FAULT_BADMAP; 249 - if (addr < FIRST_USER_ADDRESS) 250 - return VM_FAULT_BADMAP; 251 - if (expand_stack(vma, addr)) 252 - return VM_FAULT_BADMAP; 253 - } 254 - 255 - /* 256 - * ok, we have a good vm_area for this memory access, check the 257 - * permissions on the VMA allow for the fault which occurred. 258 - */ 259 - if (!(vma->vm_flags & vma_flags)) 260 - return VM_FAULT_BADACCESS; 261 - 262 - return handle_mm_fault(vma, addr & PAGE_MASK, flags, regs); 263 - } 264 - 265 238 static int __kprobes 266 239 do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) 267 240 { 268 241 struct mm_struct *mm = current->mm; 242 + struct vm_area_struct *vma; 269 243 int sig, code; 270 244 vm_fault_t fault; 271 245 unsigned int flags = FAULT_FLAG_DEFAULT; ··· 278 304 279 305 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); 280 306 281 - /* 282 - * As per x86, we may deadlock here. However, since the kernel only 283 - * validly references user space from well defined areas of the code, 284 - * we can bug out early if this is from code which shouldn't. 285 - */ 286 - if (!mmap_read_trylock(mm)) { 287 - if (!user_mode(regs) && !search_exception_tables(regs->ARM_pc)) 288 - goto no_context; 289 307 retry: 290 - mmap_read_lock(mm); 291 - } else { 292 - /* 293 - * The above down_read_trylock() might have succeeded in 294 - * which case, we'll have missed the might_sleep() from 295 - * down_read() 296 - */ 297 - might_sleep(); 298 - #ifdef CONFIG_DEBUG_VM 299 - if (!user_mode(regs) && 300 - !search_exception_tables(regs->ARM_pc)) 301 - goto no_context; 302 - #endif 308 + vma = lock_mm_and_find_vma(mm, addr, regs); 309 + if (unlikely(!vma)) { 310 + fault = VM_FAULT_BADMAP; 311 + goto bad_area; 303 312 } 304 313 305 - fault = __do_page_fault(mm, addr, flags, vm_flags, regs); 314 + /* 315 + * ok, we have a good vm_area for this memory access, check the 316 + * permissions on the VMA allow for the fault which occurred. 317 + */ 318 + if (!(vma->vm_flags & vm_flags)) 319 + fault = VM_FAULT_BADACCESS; 320 + else 321 + fault = handle_mm_fault(vma, addr & PAGE_MASK, flags, regs); 306 322 307 323 /* If we need to retry but a fatal signal is pending, handle the 308 324 * signal first. We do not need to release the mmap_lock because ··· 323 359 if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS)))) 324 360 return 0; 325 361 362 + bad_area: 326 363 /* 327 364 * If we are in kernel mode at this point, we 328 365 * have no context to handle this fault with.

arch/arm64/Kconfig

··· 231 231 select IRQ_DOMAIN 232 232 select IRQ_FORCED_THREADING 233 233 select KASAN_VMALLOC if KASAN 234 + select LOCK_MM_AND_FIND_VMA 234 235 select MODULES_USE_ELF_RELA 235 236 select NEED_DMA_MAP_STATE 236 237 select NEED_SG_DMA_LENGTH

+8 -39

arch/arm64/mm/fault.c

··· 497 497 #define VM_FAULT_BADMAP ((__force vm_fault_t)0x010000) 498 498 #define VM_FAULT_BADACCESS ((__force vm_fault_t)0x020000) 499 499 500 - static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr, 500 + static vm_fault_t __do_page_fault(struct mm_struct *mm, 501 + struct vm_area_struct *vma, unsigned long addr, 501 502 unsigned int mm_flags, unsigned long vm_flags, 502 503 struct pt_regs *regs) 503 504 { 504 - struct vm_area_struct *vma = find_vma(mm, addr); 505 - 506 - if (unlikely(!vma)) 507 - return VM_FAULT_BADMAP; 508 - 509 505 /* 510 506 * Ok, we have a good vm_area for this memory access, so we can handle 511 507 * it. 512 - */ 513 - if (unlikely(vma->vm_start > addr)) { 514 - if (!(vma->vm_flags & VM_GROWSDOWN)) 515 - return VM_FAULT_BADMAP; 516 - if (expand_stack(vma, addr)) 517 - return VM_FAULT_BADMAP; 518 - } 519 - 520 - /* 521 508 * Check that the permissions on the VMA allow for the fault which 522 509 * occurred. 523 510 */ ··· 618 631 } 619 632 lock_mmap: 620 633 #endif /* CONFIG_PER_VMA_LOCK */ 621 - /* 622 - * As per x86, we may deadlock here. However, since the kernel only 623 - * validly references user space from well defined areas of the code, 624 - * we can bug out early if this is from code which shouldn't. 625 - */ 626 - if (!mmap_read_trylock(mm)) { 627 - if (!user_mode(regs) && !search_exception_tables(regs->pc)) 628 - goto no_context; 634 + 629 635 retry: 630 - mmap_read_lock(mm); 631 - } else { 632 - /* 633 - * The above mmap_read_trylock() might have succeeded in which 634 - * case, we'll have missed the might_sleep() from down_read(). 635 - */ 636 - might_sleep(); 637 - #ifdef CONFIG_DEBUG_VM 638 - if (!user_mode(regs) && !search_exception_tables(regs->pc)) { 639 - mmap_read_unlock(mm); 640 - goto no_context; 641 - } 642 - #endif 636 + vma = lock_mm_and_find_vma(mm, addr, regs); 637 + if (unlikely(!vma)) { 638 + fault = VM_FAULT_BADMAP; 639 + goto done; 643 640 } 644 641 645 - fault = __do_page_fault(mm, addr, mm_flags, vm_flags, regs); 642 + fault = __do_page_fault(mm, vma, addr, mm_flags, vm_flags, regs); 646 643 647 644 /* Quick path to respond to signals */ 648 645 if (fault_signal_pending(fault, regs)) { ··· 645 674 } 646 675 mmap_read_unlock(mm); 647 676 648 - #ifdef CONFIG_PER_VMA_LOCK 649 677 done: 650 - #endif 651 678 /* 652 679 * Handle the "normal" (no error) case first. 653 680 */

arch/csky/Kconfig

··· 97 97 select HAVE_STACKPROTECTOR 98 98 select HAVE_SYSCALL_TRACEPOINTS 99 99 select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU 100 + select LOCK_MM_AND_FIND_VMA 100 101 select MAY_HAVE_SPARSE_IRQ 101 102 select MODULES_USE_ELF_RELA if MODULES 102 103 select OF

+5 -17

arch/csky/mm/fault.c

··· 97 97 BUG(); 98 98 } 99 99 100 - static inline void bad_area(struct pt_regs *regs, struct mm_struct *mm, int code, unsigned long addr) 100 + static inline void bad_area_nosemaphore(struct pt_regs *regs, struct mm_struct *mm, int code, unsigned long addr) 101 101 { 102 102 /* 103 103 * Something tried to access memory that isn't in our memory map. 104 104 * Fix it, but check if it's kernel or user first. 105 105 */ 106 - mmap_read_unlock(mm); 107 106 /* User mode accesses just cause a SIGSEGV */ 108 107 if (user_mode(regs)) { 109 108 do_trap(regs, SIGSEGV, code, addr); ··· 237 238 if (is_write(regs)) 238 239 flags |= FAULT_FLAG_WRITE; 239 240 retry: 240 - mmap_read_lock(mm); 241 - vma = find_vma(mm, addr); 241 + vma = lock_mm_and_find_vma(mm, address, regs); 242 242 if (unlikely(!vma)) { 243 - bad_area(regs, mm, code, addr); 244 - return; 245 - } 246 - if (likely(vma->vm_start <= addr)) 247 - goto good_area; 248 - if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) { 249 - bad_area(regs, mm, code, addr); 250 - return; 251 - } 252 - if (unlikely(expand_stack(vma, addr))) { 253 - bad_area(regs, mm, code, addr); 243 + bad_area_nosemaphore(regs, mm, code, addr); 254 244 return; 255 245 } 256 246 ··· 247 259 * Ok, we have a good vm_area for this memory access, so 248 260 * we can handle it. 249 261 */ 250 - good_area: 251 262 code = SEGV_ACCERR; 252 263 253 264 if (unlikely(access_error(regs, vma))) { 254 - bad_area(regs, mm, code, addr); 265 + mmap_read_unlock(mm); 266 + bad_area_nosemaphore(regs, mm, code, addr); 255 267 return; 256 268 } 257 269

arch/hexagon/Kconfig

··· 28 28 select GENERIC_SMP_IDLE_THREAD 29 29 select STACKTRACE_SUPPORT 30 30 select GENERIC_CLOCKEVENTS_BROADCAST 31 + select LOCK_MM_AND_FIND_VMA 31 32 select MODULES_USE_ELF_RELA 32 33 select GENERIC_CPU_DEVICES 33 34 select ARCH_WANT_LD_ORPHAN_WARN

+4 -14

arch/hexagon/mm/vm_fault.c

··· 57 57 58 58 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); 59 59 retry: 60 - mmap_read_lock(mm); 61 - vma = find_vma(mm, address); 62 - if (!vma) 63 - goto bad_area; 60 + vma = lock_mm_and_find_vma(mm, address, regs); 61 + if (unlikely(!vma)) 62 + goto bad_area_nosemaphore; 64 63 65 - if (vma->vm_start <= address) 66 - goto good_area; 67 - 68 - if (!(vma->vm_flags & VM_GROWSDOWN)) 69 - goto bad_area; 70 - 71 - if (expand_stack(vma, address)) 72 - goto bad_area; 73 - 74 - good_area: 75 64 /* Address space is OK. Now check access rights. */ 76 65 si_code = SEGV_ACCERR; 77 66 ··· 132 143 bad_area: 133 144 mmap_read_unlock(mm); 134 145 146 + bad_area_nosemaphore: 135 147 if (user_mode(regs)) { 136 148 force_sig_fault(SIGSEGV, si_code, (void __user *)address); 137 149 return;

+6 -30

arch/ia64/mm/fault.c

··· 110 110 * register backing store that needs to expand upwards, in 111 111 * this case vma will be null, but prev_vma will ne non-null 112 112 */ 113 - if (( !vma && prev_vma ) || (address < vma->vm_start) ) 114 - goto check_expansion; 113 + if (( !vma && prev_vma ) || (address < vma->vm_start) ) { 114 + vma = expand_stack(mm, address); 115 + if (!vma) 116 + goto bad_area_nosemaphore; 117 + } 115 118 116 - good_area: 117 119 code = SEGV_ACCERR; 118 120 119 121 /* OK, we've got a good vm_area for this memory area. Check the access permissions: */ ··· 179 177 mmap_read_unlock(mm); 180 178 return; 181 179 182 - check_expansion: 183 - if (!(prev_vma && (prev_vma->vm_flags & VM_GROWSUP) && (address == prev_vma->vm_end))) { 184 - if (!vma) 185 - goto bad_area; 186 - if (!(vma->vm_flags & VM_GROWSDOWN)) 187 - goto bad_area; 188 - if (REGION_NUMBER(address) != REGION_NUMBER(vma->vm_start) 189 - || REGION_OFFSET(address) >= RGN_MAP_LIMIT) 190 - goto bad_area; 191 - if (expand_stack(vma, address)) 192 - goto bad_area; 193 - } else { 194 - vma = prev_vma; 195 - if (REGION_NUMBER(address) != REGION_NUMBER(vma->vm_start) 196 - || REGION_OFFSET(address) >= RGN_MAP_LIMIT) 197 - goto bad_area; 198 - /* 199 - * Since the register backing store is accessed sequentially, 200 - * we disallow growing it by more than a page at a time. 201 - */ 202 - if (address > vma->vm_end + PAGE_SIZE - sizeof(long)) 203 - goto bad_area; 204 - if (expand_upwards(vma, address)) 205 - goto bad_area; 206 - } 207 - goto good_area; 208 - 209 180 bad_area: 210 181 mmap_read_unlock(mm); 182 + bad_area_nosemaphore: 211 183 if ((isr & IA64_ISR_SP) 212 184 || ((isr & IA64_ISR_NA) && (isr & IA64_ISR_CODE_MASK) == IA64_ISR_CODE_LFETCH)) 213 185 {

arch/loongarch/Kconfig

··· 131 131 select HAVE_VIRT_CPU_ACCOUNTING_GEN if !SMP 132 132 select IRQ_FORCED_THREADING 133 133 select IRQ_LOONGARCH_CPU 134 + select LOCK_MM_AND_FIND_VMA 134 135 select MMU_GATHER_MERGE_VMAS if MMU 135 136 select MODULES_USE_ELF_RELA if MODULES 136 137 select NEED_PER_CPU_EMBED_FIRST_CHUNK

+6 -10

arch/loongarch/mm/fault.c

··· 169 169 170 170 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); 171 171 retry: 172 - mmap_read_lock(mm); 173 - vma = find_vma(mm, address); 174 - if (!vma) 175 - goto bad_area; 176 - if (vma->vm_start <= address) 177 - goto good_area; 178 - if (!(vma->vm_flags & VM_GROWSDOWN)) 179 - goto bad_area; 180 - if (!expand_stack(vma, address)) 181 - goto good_area; 172 + vma = lock_mm_and_find_vma(mm, address, regs); 173 + if (unlikely(!vma)) 174 + goto bad_area_nosemaphore; 175 + goto good_area; 176 + 182 177 /* 183 178 * Something tried to access memory that isn't in our memory map.. 184 179 * Fix it, but check if it's kernel or user first.. 185 180 */ 186 181 bad_area: 187 182 mmap_read_unlock(mm); 183 + bad_area_nosemaphore: 188 184 do_sigsegv(regs, write, address, si_code); 189 185 return; 190 186

+6 -3

arch/m68k/mm/fault.c

··· 105 105 if (address + 256 < rdusp()) 106 106 goto map_err; 107 107 } 108 - if (expand_stack(vma, address)) 109 - goto map_err; 108 + vma = expand_stack(mm, address); 109 + if (!vma) 110 + goto map_err_nosemaphore; 110 111 111 112 /* 112 113 * Ok, we have a good vm_area for this memory access, so ··· 197 196 goto send_sig; 198 197 199 198 map_err: 199 + mmap_read_unlock(mm); 200 + map_err_nosemaphore: 200 201 current->thread.signo = SIGSEGV; 201 202 current->thread.code = SEGV_MAPERR; 202 203 current->thread.faddr = address; 203 - goto send_sig; 204 + return send_fault_sig(regs); 204 205 205 206 acc_err: 206 207 current->thread.signo = SIGSEGV;

+3 -2

arch/microblaze/mm/fault.c

··· 192 192 && (kernel_mode(regs) || !store_updates_sp(regs))) 193 193 goto bad_area; 194 194 } 195 - if (expand_stack(vma, address)) 196 - goto bad_area; 195 + vma = expand_stack(mm, address); 196 + if (!vma) 197 + goto bad_area_nosemaphore; 197 198 198 199 good_area: 199 200 code = SEGV_ACCERR;

arch/mips/Kconfig

··· 92 92 select HAVE_VIRT_CPU_ACCOUNTING_GEN if 64BIT || !SMP 93 93 select IRQ_FORCED_THREADING 94 94 select ISA if EISA 95 + select LOCK_MM_AND_FIND_VMA 95 96 select MODULES_USE_ELF_REL if MODULES 96 97 select MODULES_USE_ELF_RELA if MODULES && 64BIT 97 98 select PERF_USE_VMALLOC

+2 -10

arch/mips/mm/fault.c

··· 99 99 100 100 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); 101 101 retry: 102 - mmap_read_lock(mm); 103 - vma = find_vma(mm, address); 102 + vma = lock_mm_and_find_vma(mm, address, regs); 104 103 if (!vma) 105 - goto bad_area; 106 - if (vma->vm_start <= address) 107 - goto good_area; 108 - if (!(vma->vm_flags & VM_GROWSDOWN)) 109 - goto bad_area; 110 - if (expand_stack(vma, address)) 111 - goto bad_area; 104 + goto bad_area_nosemaphore; 112 105 /* 113 106 * Ok, we have a good vm_area for this memory access, so 114 107 * we can handle it.. 115 108 */ 116 - good_area: 117 109 si_code = SEGV_ACCERR; 118 110 119 111 if (write) {

arch/nios2/Kconfig

··· 16 16 select HAVE_ARCH_TRACEHOOK 17 17 select HAVE_ARCH_KGDB 18 18 select IRQ_DOMAIN 19 + select LOCK_MM_AND_FIND_VMA 19 20 select MODULES_USE_ELF_RELA 20 21 select OF 21 22 select OF_EARLY_FLATTREE

+2 -15

arch/nios2/mm/fault.c

··· 86 86 87 87 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); 88 88 89 - if (!mmap_read_trylock(mm)) { 90 - if (!user_mode(regs) && !search_exception_tables(regs->ea)) 91 - goto bad_area_nosemaphore; 92 89 retry: 93 - mmap_read_lock(mm); 94 - } 95 - 96 - vma = find_vma(mm, address); 90 + vma = lock_mm_and_find_vma(mm, address, regs); 97 91 if (!vma) 98 - goto bad_area; 99 - if (vma->vm_start <= address) 100 - goto good_area; 101 - if (!(vma->vm_flags & VM_GROWSDOWN)) 102 - goto bad_area; 103 - if (expand_stack(vma, address)) 104 - goto bad_area; 92 + goto bad_area_nosemaphore; 105 93 /* 106 94 * Ok, we have a good vm_area for this memory access, so 107 95 * we can handle it.. 108 96 */ 109 - good_area: 110 97 code = SEGV_ACCERR; 111 98 112 99 switch (cause) {

+3 -2

arch/openrisc/mm/fault.c

··· 127 127 if (address + PAGE_SIZE < regs->sp) 128 128 goto bad_area; 129 129 } 130 - if (expand_stack(vma, address)) 131 - goto bad_area; 130 + vma = expand_stack(mm, address); 131 + if (!vma) 132 + goto bad_area_nosemaphore; 132 133 133 134 /* 134 135 * Ok, we have a good vm_area for this memory access, so

+11 -12

arch/parisc/mm/fault.c

··· 288 288 retry: 289 289 mmap_read_lock(mm); 290 290 vma = find_vma_prev(mm, address, &prev_vma); 291 - if (!vma || address < vma->vm_start) 292 - goto check_expansion; 291 + if (!vma || address < vma->vm_start) { 292 + if (!prev || !(prev->vm_flags & VM_GROWSUP)) 293 + goto bad_area; 294 + vma = expand_stack(mm, address); 295 + if (!vma) 296 + goto bad_area_nosemaphore; 297 + } 298 + 293 299 /* 294 300 * Ok, we have a good vm_area for this memory access. We still need to 295 301 * check the access permissions. 296 302 */ 297 - 298 - good_area: 299 303 300 304 if ((vma->vm_flags & acc_type) != acc_type) 301 305 goto bad_area; ··· 351 347 mmap_read_unlock(mm); 352 348 return; 353 349 354 - check_expansion: 355 - vma = prev_vma; 356 - if (vma && (expand_stack(vma, address) == 0)) 357 - goto good_area; 358 - 359 350 /* 360 351 * Something tried to access memory that isn't in our memory map.. 361 352 */ 362 353 bad_area: 363 354 mmap_read_unlock(mm); 364 355 356 + bad_area_nosemaphore: 365 357 if (user_mode(regs)) { 366 358 int signo, si_code; 367 359 ··· 449 449 { 450 450 unsigned long insn = regs->iir; 451 451 int breg, treg, xreg, val = 0; 452 - struct vm_area_struct *vma, *prev_vma; 452 + struct vm_area_struct *vma; 453 453 struct task_struct *tsk; 454 454 struct mm_struct *mm; 455 455 unsigned long address; ··· 485 485 /* Search for VMA */ 486 486 address = regs->ior; 487 487 mmap_read_lock(mm); 488 - vma = find_vma_prev(mm, address, &prev_vma); 488 + vma = vma_lookup(mm, address); 489 489 mmap_read_unlock(mm); 490 490 491 491 /* ··· 494 494 */ 495 495 acc_type = (insn & 0x40) ? VM_WRITE : VM_READ; 496 496 if (vma 497 - && address >= vma->vm_start 498 497 && (vma->vm_flags & acc_type) == acc_type) 499 498 val = 1; 500 499 }

arch/powerpc/Kconfig

··· 277 277 select IRQ_DOMAIN 278 278 select IRQ_FORCED_THREADING 279 279 select KASAN_VMALLOC if KASAN && MODULES 280 + select LOCK_MM_AND_FIND_VMA 280 281 select MMU_GATHER_PAGE_SIZE 281 282 select MMU_GATHER_RCU_TABLE_FREE 282 283 select MMU_GATHER_MERGE_VMAS

+3 -11

arch/powerpc/mm/copro_fault.c

··· 33 33 if (mm->pgd == NULL) 34 34 return -EFAULT; 35 35 36 - mmap_read_lock(mm); 37 - ret = -EFAULT; 38 - vma = find_vma(mm, ea); 36 + vma = lock_mm_and_find_vma(mm, ea, NULL); 39 37 if (!vma) 40 - goto out_unlock; 38 + return -EFAULT; 41 39 42 - if (ea < vma->vm_start) { 43 - if (!(vma->vm_flags & VM_GROWSDOWN)) 44 - goto out_unlock; 45 - if (expand_stack(vma, ea)) 46 - goto out_unlock; 47 - } 48 - 40 + ret = -EFAULT; 49 41 is_write = dsisr & DSISR_ISSTORE; 50 42 if (is_write) { 51 43 if (!(vma->vm_flags & VM_WRITE))

+3 -36

arch/powerpc/mm/fault.c

··· 84 84 return __bad_area_nosemaphore(regs, address, si_code); 85 85 } 86 86 87 - static noinline int bad_area(struct pt_regs *regs, unsigned long address) 88 - { 89 - return __bad_area(regs, address, SEGV_MAPERR); 90 - } 91 - 92 87 static noinline int bad_access_pkey(struct pt_regs *regs, unsigned long address, 93 88 struct vm_area_struct *vma) 94 89 { ··· 510 515 * we will deadlock attempting to validate the fault against the 511 516 * address space. Luckily the kernel only validly references user 512 517 * space from well defined areas of code, which are listed in the 513 - * exceptions table. 514 - * 515 - * As the vast majority of faults will be valid we will only perform 516 - * the source reference check when there is a possibility of a deadlock. 517 - * Attempt to lock the address space, if we cannot we then validate the 518 - * source. If this is invalid we can skip the address space check, 519 - * thus avoiding the deadlock. 518 + * exceptions table. lock_mm_and_find_vma() handles that logic. 520 519 */ 521 - if (unlikely(!mmap_read_trylock(mm))) { 522 - if (!is_user && !search_exception_tables(regs->nip)) 523 - return bad_area_nosemaphore(regs, address); 524 - 525 520 retry: 526 - mmap_read_lock(mm); 527 - } else { 528 - /* 529 - * The above down_read_trylock() might have succeeded in 530 - * which case we'll have missed the might_sleep() from 531 - * down_read(): 532 - */ 533 - might_sleep(); 534 - } 535 - 536 - vma = find_vma(mm, address); 521 + vma = lock_mm_and_find_vma(mm, address, regs); 537 522 if (unlikely(!vma)) 538 - return bad_area(regs, address); 539 - 540 - if (unlikely(vma->vm_start > address)) { 541 - if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) 542 - return bad_area(regs, address); 543 - 544 - if (unlikely(expand_stack(vma, address))) 545 - return bad_area(regs, address); 546 - } 523 + return bad_area_nosemaphore(regs, address); 547 524 548 525 if (unlikely(access_pkey_error(is_write, is_exec, 549 526 (error_code & DSISR_KEYFAULT), vma)))

arch/riscv/Kconfig

··· 127 127 select IRQ_DOMAIN 128 128 select IRQ_FORCED_THREADING 129 129 select KASAN_VMALLOC if KASAN 130 + select LOCK_MM_AND_FIND_VMA 130 131 select MODULES_USE_ELF_RELA if MODULES 131 132 select MODULE_SECTIONS if MODULES 132 133 select OF

+13 -18

arch/riscv/mm/fault.c

··· 84 84 BUG(); 85 85 } 86 86 87 - static inline void bad_area(struct pt_regs *regs, struct mm_struct *mm, int code, unsigned long addr) 87 + static inline void 88 + bad_area_nosemaphore(struct pt_regs *regs, int code, unsigned long addr) 88 89 { 89 90 /* 90 91 * Something tried to access memory that isn't in our memory map. 91 92 * Fix it, but check if it's kernel or user first. 92 93 */ 93 - mmap_read_unlock(mm); 94 94 /* User mode accesses just cause a SIGSEGV */ 95 95 if (user_mode(regs)) { 96 96 do_trap(regs, SIGSEGV, code, addr); ··· 98 98 } 99 99 100 100 no_context(regs, addr); 101 + } 102 + 103 + static inline void 104 + bad_area(struct pt_regs *regs, struct mm_struct *mm, int code, 105 + unsigned long addr) 106 + { 107 + mmap_read_unlock(mm); 108 + 109 + bad_area_nosemaphore(regs, code, addr); 101 110 } 102 111 103 112 static inline void vmalloc_fault(struct pt_regs *regs, int code, unsigned long addr) ··· 296 287 else if (cause == EXC_INST_PAGE_FAULT) 297 288 flags |= FAULT_FLAG_INSTRUCTION; 298 289 retry: 299 - mmap_read_lock(mm); 300 - vma = find_vma(mm, addr); 290 + vma = lock_mm_and_find_vma(mm, addr, regs); 301 291 if (unlikely(!vma)) { 302 292 tsk->thread.bad_cause = cause; 303 - bad_area(regs, mm, code, addr); 304 - return; 305 - } 306 - if (likely(vma->vm_start <= addr)) 307 - goto good_area; 308 - if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) { 309 - tsk->thread.bad_cause = cause; 310 - bad_area(regs, mm, code, addr); 311 - return; 312 - } 313 - if (unlikely(expand_stack(vma, addr))) { 314 - tsk->thread.bad_cause = cause; 315 - bad_area(regs, mm, code, addr); 293 + bad_area_nosemaphore(regs, code, addr); 316 294 return; 317 295 } 318 296 ··· 307 311 * Ok, we have a good vm_area for this memory access, so 308 312 * we can handle it. 309 313 */ 310 - good_area: 311 314 code = SEGV_ACCERR; 312 315 313 316 if (unlikely(access_error(cause, vma))) {

+3 -2

arch/s390/mm/fault.c

··· 457 457 if (unlikely(vma->vm_start > address)) { 458 458 if (!(vma->vm_flags & VM_GROWSDOWN)) 459 459 goto out_up; 460 - if (expand_stack(vma, address)) 461 - goto out_up; 460 + vma = expand_stack(mm, address); 461 + if (!vma) 462 + goto out; 462 463 } 463 464 464 465 /*

arch/sh/Kconfig

··· 60 60 select HAVE_STACKPROTECTOR 61 61 select HAVE_SYSCALL_TRACEPOINTS 62 62 select IRQ_FORCED_THREADING 63 + select LOCK_MM_AND_FIND_VMA 63 64 select MODULES_USE_ELF_RELA 64 65 select NEED_SG_DMA_LENGTH 65 66 select NO_DMA if !MMU && !DMA_COHERENT

+2 -15

arch/sh/mm/fault.c

··· 439 439 } 440 440 441 441 retry: 442 - mmap_read_lock(mm); 443 - 444 - vma = find_vma(mm, address); 442 + vma = lock_mm_and_find_vma(mm, address, regs); 445 443 if (unlikely(!vma)) { 446 - bad_area(regs, error_code, address); 447 - return; 448 - } 449 - if (likely(vma->vm_start <= address)) 450 - goto good_area; 451 - if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) { 452 - bad_area(regs, error_code, address); 453 - return; 454 - } 455 - if (unlikely(expand_stack(vma, address))) { 456 - bad_area(regs, error_code, address); 444 + bad_area_nosemaphore(regs, error_code, address); 457 445 return; 458 446 } 459 447 ··· 449 461 * Ok, we have a good vm_area for this memory access, so 450 462 * we can handle it.. 451 463 */ 452 - good_area: 453 464 if (unlikely(access_error(error_code, vma))) { 454 465 bad_area_access_error(regs, error_code, address); 455 466 return;

arch/sparc/Kconfig

··· 58 58 select DMA_DIRECT_REMAP 59 59 select GENERIC_ATOMIC64 60 60 select HAVE_UID16 61 + select LOCK_MM_AND_FIND_VMA 61 62 select OLD_SIGACTION 62 63 select ZONE_DMA 63 64

+8 -24

arch/sparc/mm/fault_32.c

··· 143 143 if (pagefault_disabled() || !mm) 144 144 goto no_context; 145 145 146 + if (!from_user && address >= PAGE_OFFSET) 147 + goto no_context; 148 + 146 149 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); 147 150 148 151 retry: 149 - mmap_read_lock(mm); 150 - 151 - if (!from_user && address >= PAGE_OFFSET) 152 - goto bad_area; 153 - 154 - vma = find_vma(mm, address); 152 + vma = lock_mm_and_find_vma(mm, address, regs); 155 153 if (!vma) 156 - goto bad_area; 157 - if (vma->vm_start <= address) 158 - goto good_area; 159 - if (!(vma->vm_flags & VM_GROWSDOWN)) 160 - goto bad_area; 161 - if (expand_stack(vma, address)) 162 - goto bad_area; 154 + goto bad_area_nosemaphore; 163 155 /* 164 156 * Ok, we have a good vm_area for this memory access, so 165 157 * we can handle it.. 166 158 */ 167 - good_area: 168 159 code = SEGV_ACCERR; 169 160 if (write) { 170 161 if (!(vma->vm_flags & VM_WRITE)) ··· 312 321 313 322 code = SEGV_MAPERR; 314 323 315 - mmap_read_lock(mm); 316 - vma = find_vma(mm, address); 324 + vma = lock_mm_and_find_vma(mm, address, regs); 317 325 if (!vma) 318 - goto bad_area; 319 - if (vma->vm_start <= address) 320 - goto good_area; 321 - if (!(vma->vm_flags & VM_GROWSDOWN)) 322 - goto bad_area; 323 - if (expand_stack(vma, address)) 324 - goto bad_area; 325 - good_area: 326 + goto bad_area_nosemaphore; 326 327 code = SEGV_ACCERR; 327 328 if (write) { 328 329 if (!(vma->vm_flags & VM_WRITE)) ··· 333 350 return; 334 351 bad_area: 335 352 mmap_read_unlock(mm); 353 + bad_area_nosemaphore: 336 354 __do_fault_siginfo(code, SIGSEGV, tsk->thread.kregs, address); 337 355 return; 338 356

+5 -3

arch/sparc/mm/fault_64.c

··· 386 386 goto bad_area; 387 387 } 388 388 } 389 - if (expand_stack(vma, address)) 390 - goto bad_area; 389 + vma = expand_stack(mm, address); 390 + if (!vma) 391 + goto bad_area_nosemaphore; 391 392 /* 392 393 * Ok, we have a good vm_area for this memory access, so 393 394 * we can handle it.. ··· 491 490 * Fix it, but check if it's kernel or user first.. 492 491 */ 493 492 bad_area: 494 - insn = get_fault_insn(regs, insn); 495 493 mmap_read_unlock(mm); 494 + bad_area_nosemaphore: 495 + insn = get_fault_insn(regs, insn); 496 496 497 497 handle_kernel_fault: 498 498 do_kernel_fault(regs, si_code, fault_code, insn, address);

+6 -5

arch/um/kernel/trap.c

··· 47 47 vma = find_vma(mm, address); 48 48 if (!vma) 49 49 goto out; 50 - else if (vma->vm_start <= address) 50 + if (vma->vm_start <= address) 51 51 goto good_area; 52 - else if (!(vma->vm_flags & VM_GROWSDOWN)) 52 + if (!(vma->vm_flags & VM_GROWSDOWN)) 53 53 goto out; 54 - else if (is_user && !ARCH_IS_STACKGROW(address)) 54 + if (is_user && !ARCH_IS_STACKGROW(address)) 55 55 goto out; 56 - else if (expand_stack(vma, address)) 57 - goto out; 56 + vma = expand_stack(mm, address); 57 + if (!vma) 58 + goto out_nosemaphore; 58 59 59 60 good_area: 60 61 *code_out = SEGV_ACCERR;

arch/x86/Kconfig

··· 279 279 select HOTPLUG_SMT if SMP 280 280 select HOTPLUG_SPLIT_STARTUP if SMP && X86_32 281 281 select IRQ_FORCED_THREADING 282 + select LOCK_MM_AND_FIND_VMA 282 283 select NEED_PER_CPU_EMBED_FIRST_CHUNK 283 284 select NEED_PER_CPU_PAGE_FIRST_CHUNK 284 285 select NEED_SG_DMA_LENGTH

+2 -50

arch/x86/mm/fault.c

··· 880 880 __bad_area_nosemaphore(regs, error_code, address, pkey, si_code); 881 881 } 882 882 883 - static noinline void 884 - bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address) 885 - { 886 - __bad_area(regs, error_code, address, 0, SEGV_MAPERR); 887 - } 888 - 889 883 static inline bool bad_area_access_from_pkeys(unsigned long error_code, 890 884 struct vm_area_struct *vma) 891 885 { ··· 1360 1366 lock_mmap: 1361 1367 #endif /* CONFIG_PER_VMA_LOCK */ 1362 1368 1363 - /* 1364 - * Kernel-mode access to the user address space should only occur 1365 - * on well-defined single instructions listed in the exception 1366 - * tables. But, an erroneous kernel fault occurring outside one of 1367 - * those areas which also holds mmap_lock might deadlock attempting 1368 - * to validate the fault against the address space. 1369 - * 1370 - * Only do the expensive exception table search when we might be at 1371 - * risk of a deadlock. This happens if we 1372 - * 1. Failed to acquire mmap_lock, and 1373 - * 2. The access did not originate in userspace. 1374 - */ 1375 - if (unlikely(!mmap_read_trylock(mm))) { 1376 - if (!user_mode(regs) && !search_exception_tables(regs->ip)) { 1377 - /* 1378 - * Fault from code in kernel from 1379 - * which we do not expect faults. 1380 - */ 1381 - bad_area_nosemaphore(regs, error_code, address); 1382 - return; 1383 - } 1384 1369 retry: 1385 - mmap_read_lock(mm); 1386 - } else { 1387 - /* 1388 - * The above down_read_trylock() might have succeeded in 1389 - * which case we'll have missed the might_sleep() from 1390 - * down_read(): 1391 - */ 1392 - might_sleep(); 1393 - } 1394 - 1395 - vma = find_vma(mm, address); 1370 + vma = lock_mm_and_find_vma(mm, address, regs); 1396 1371 if (unlikely(!vma)) { 1397 - bad_area(regs, error_code, address); 1398 - return; 1399 - } 1400 - if (likely(vma->vm_start <= address)) 1401 - goto good_area; 1402 - if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) { 1403 - bad_area(regs, error_code, address); 1404 - return; 1405 - } 1406 - if (unlikely(expand_stack(vma, address))) { 1407 - bad_area(regs, error_code, address); 1372 + bad_area_nosemaphore(regs, error_code, address); 1408 1373 return; 1409 1374 } 1410 1375 ··· 1371 1418 * Ok, we have a good vm_area for this memory access, so 1372 1419 * we can handle it.. 1373 1420 */ 1374 - good_area: 1375 1421 if (unlikely(access_error(error_code, vma))) { 1376 1422 bad_area_access_error(regs, error_code, address, vma); 1377 1423 return;

arch/xtensa/Kconfig

··· 49 49 select HAVE_SYSCALL_TRACEPOINTS 50 50 select HAVE_VIRT_CPU_ACCOUNTING_GEN 51 51 select IRQ_DOMAIN 52 + select LOCK_MM_AND_FIND_VMA 52 53 select MODULES_USE_ELF_RELA 53 54 select PERF_USE_VMALLOC 54 55 select TRACE_IRQFLAGS_SUPPORT

+3 -11

arch/xtensa/mm/fault.c

··· 130 130 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); 131 131 132 132 retry: 133 - mmap_read_lock(mm); 134 - vma = find_vma(mm, address); 135 - 133 + vma = lock_mm_and_find_vma(mm, address, regs); 136 134 if (!vma) 137 - goto bad_area; 138 - if (vma->vm_start <= address) 139 - goto good_area; 140 - if (!(vma->vm_flags & VM_GROWSDOWN)) 141 - goto bad_area; 142 - if (expand_stack(vma, address)) 143 - goto bad_area; 135 + goto bad_area_nosemaphore; 144 136 145 137 /* Ok, we have a good vm_area for this memory access, so 146 138 * we can handle it.. 147 139 */ 148 140 149 - good_area: 150 141 code = SEGV_ACCERR; 151 142 152 143 if (is_write) { ··· 196 205 */ 197 206 bad_area: 198 207 mmap_read_unlock(mm); 208 + bad_area_nosemaphore: 199 209 if (user_mode(regs)) { 200 210 force_sig_fault(SIGSEGV, code, (void *) address); 201 211 return;

+2 -2

drivers/iommu/amd/iommu_v2.c

··· 485 485 flags |= FAULT_FLAG_REMOTE; 486 486 487 487 mmap_read_lock(mm); 488 - vma = find_extend_vma(mm, address); 489 - if (!vma || address < vma->vm_start) 488 + vma = vma_lookup(mm, address); 489 + if (!vma) 490 490 /* failed to get a vma in the right range */ 491 491 goto out; 492 492

+1 -1

drivers/iommu/iommu-sva.c

··· 175 175 176 176 mmap_read_lock(mm); 177 177 178 - vma = find_extend_vma(mm, prm->addr); 178 + vma = vma_lookup(mm, prm->addr); 179 179 if (!vma) 180 180 /* Unmapped area */ 181 181 goto out_put_mm;

+3 -3

fs/binfmt_elf.c

··· 320 320 * Grow the stack manually; some architectures have a limit on how 321 321 * far ahead a user-space access may be in order to grow the stack. 322 322 */ 323 - if (mmap_read_lock_killable(mm)) 323 + if (mmap_write_lock_killable(mm)) 324 324 return -EINTR; 325 - vma = find_extend_vma(mm, bprm->p); 326 - mmap_read_unlock(mm); 325 + vma = find_extend_vma_locked(mm, bprm->p); 326 + mmap_write_unlock(mm); 327 327 if (!vma) 328 328 return -EFAULT; 329 329

+22 -16

fs/exec.c

··· 200 200 int write) 201 201 { 202 202 struct page *page; 203 + struct vm_area_struct *vma = bprm->vma; 204 + struct mm_struct *mm = bprm->mm; 203 205 int ret; 204 - unsigned int gup_flags = 0; 205 206 206 - #ifdef CONFIG_STACK_GROWSUP 207 - if (write) { 208 - ret = expand_downwards(bprm->vma, pos); 209 - if (ret < 0) 207 + /* 208 + * Avoid relying on expanding the stack down in GUP (which 209 + * does not work for STACK_GROWSUP anyway), and just do it 210 + * by hand ahead of time. 211 + */ 212 + if (write && pos < vma->vm_start) { 213 + mmap_write_lock(mm); 214 + ret = expand_downwards(vma, pos); 215 + if (unlikely(ret < 0)) { 216 + mmap_write_unlock(mm); 210 217 return NULL; 211 - } 212 - #endif 213 - 214 - if (write) 215 - gup_flags |= FOLL_WRITE; 218 + } 219 + mmap_write_downgrade(mm); 220 + } else 221 + mmap_read_lock(mm); 216 222 217 223 /* 218 224 * We are doing an exec(). 'current' is the process 219 - * doing the exec and bprm->mm is the new process's mm. 225 + * doing the exec and 'mm' is the new process's mm. 220 226 */ 221 - mmap_read_lock(bprm->mm); 222 - ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags, 227 + ret = get_user_pages_remote(mm, pos, 1, 228 + write ? FOLL_WRITE : 0, 223 229 &page, NULL); 224 - mmap_read_unlock(bprm->mm); 230 + mmap_read_unlock(mm); 225 231 if (ret <= 0) 226 232 return NULL; 227 233 228 234 if (write) 229 - acct_arg_size(bprm, vma_pages(bprm->vma)); 235 + acct_arg_size(bprm, vma_pages(vma)); 230 236 231 237 return page; 232 238 } ··· 859 853 stack_base = vma->vm_end - stack_expand; 860 854 #endif 861 855 current->mm->start_stack = bprm->p; 862 - ret = expand_stack(vma, stack_base); 856 + ret = expand_stack_locked(vma, stack_base); 863 857 if (ret) 864 858 ret = -EFAULT; 865 859

+7 -9

include/linux/mm.h

··· 2334 2334 pgoff_t start, pgoff_t nr, bool even_cows); 2335 2335 void unmap_mapping_range(struct address_space *mapping, 2336 2336 loff_t const holebegin, loff_t const holelen, int even_cows); 2337 + struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm, 2338 + unsigned long address, struct pt_regs *regs); 2337 2339 #else 2338 2340 static inline vm_fault_t handle_mm_fault(struct vm_area_struct *vma, 2339 2341 unsigned long address, unsigned int flags, ··· 3230 3228 3231 3229 extern unsigned long stack_guard_gap; 3232 3230 /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */ 3233 - extern int expand_stack(struct vm_area_struct *vma, unsigned long address); 3231 + int expand_stack_locked(struct vm_area_struct *vma, unsigned long address); 3232 + struct vm_area_struct *expand_stack(struct mm_struct * mm, unsigned long addr); 3234 3233 3235 3234 /* CONFIG_STACK_GROWSUP still needs to grow downwards at some places */ 3236 - extern int expand_downwards(struct vm_area_struct *vma, 3237 - unsigned long address); 3238 - #if VM_GROWSUP 3239 - extern int expand_upwards(struct vm_area_struct *vma, unsigned long address); 3240 - #else 3241 - #define expand_upwards(vma, address) (0) 3242 - #endif 3235 + int expand_downwards(struct vm_area_struct *vma, unsigned long address); 3243 3236 3244 3237 /* Look up the first VMA which satisfies addr < vm_end, NULL if none. */ 3245 3238 extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr); ··· 3329 3332 unsigned long start, unsigned long end); 3330 3333 #endif 3331 3334 3332 - struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr); 3335 + struct vm_area_struct *find_extend_vma_locked(struct mm_struct *, 3336 + unsigned long addr); 3333 3337 int remap_pfn_range(struct vm_area_struct *, unsigned long addr, 3334 3338 unsigned long pfn, unsigned long size, pgprot_t); 3335 3339 int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,

mm/Kconfig

··· 1222 1222 This feature allows locking each virtual memory area separately when 1223 1223 handling page faults instead of taking mmap_lock. 1224 1224 1225 + config LOCK_MM_AND_FIND_VMA 1226 + bool 1227 + depends on !STACK_GROWSUP 1228 + 1225 1229 source "mm/damon/Kconfig" 1226 1230 1227 1231 endmenu

+11 -3

mm/gup.c

··· 1168 1168 1169 1169 /* first iteration or cross vma bound */ 1170 1170 if (!vma || start >= vma->vm_end) { 1171 - vma = find_extend_vma(mm, start); 1171 + vma = find_vma(mm, start); 1172 + if (vma && (start < vma->vm_start)) { 1173 + WARN_ON_ONCE(vma->vm_flags & VM_GROWSDOWN); 1174 + vma = NULL; 1175 + } 1172 1176 if (!vma && in_gate_area(mm, start)) { 1173 1177 ret = get_gate_page(mm, start & PAGE_MASK, 1174 1178 gup_flags, &vma, ··· 1337 1333 fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 1338 1334 1339 1335 retry: 1340 - vma = find_extend_vma(mm, address); 1341 - if (!vma || address < vma->vm_start) 1336 + vma = find_vma(mm, address); 1337 + if (!vma) 1342 1338 return -EFAULT; 1339 + if (address < vma->vm_start ) { 1340 + WARN_ON_ONCE(vma->vm_flags & VM_GROWSDOWN); 1341 + return -EFAULT; 1342 + } 1343 1343 1344 1344 if (!vma_permits_fault(vma, fault_flags)) 1345 1345 return -EFAULT;

+138 -12

mm/memory.c

··· 5245 5245 } 5246 5246 EXPORT_SYMBOL_GPL(handle_mm_fault); 5247 5247 5248 + #ifdef CONFIG_LOCK_MM_AND_FIND_VMA 5249 + #include <linux/extable.h> 5250 + 5251 + static inline bool get_mmap_lock_carefully(struct mm_struct *mm, struct pt_regs *regs) 5252 + { 5253 + /* Even if this succeeds, make it clear we *might* have slept */ 5254 + if (likely(mmap_read_trylock(mm))) { 5255 + might_sleep(); 5256 + return true; 5257 + } 5258 + 5259 + if (regs && !user_mode(regs)) { 5260 + unsigned long ip = instruction_pointer(regs); 5261 + if (!search_exception_tables(ip)) 5262 + return false; 5263 + } 5264 + 5265 + return !mmap_read_lock_killable(mm); 5266 + } 5267 + 5268 + static inline bool mmap_upgrade_trylock(struct mm_struct *mm) 5269 + { 5270 + /* 5271 + * We don't have this operation yet. 5272 + * 5273 + * It should be easy enough to do: it's basically a 5274 + * atomic_long_try_cmpxchg_acquire() 5275 + * from RWSEM_READER_BIAS -> RWSEM_WRITER_LOCKED, but 5276 + * it also needs the proper lockdep magic etc. 5277 + */ 5278 + return false; 5279 + } 5280 + 5281 + static inline bool upgrade_mmap_lock_carefully(struct mm_struct *mm, struct pt_regs *regs) 5282 + { 5283 + mmap_read_unlock(mm); 5284 + if (regs && !user_mode(regs)) { 5285 + unsigned long ip = instruction_pointer(regs); 5286 + if (!search_exception_tables(ip)) 5287 + return false; 5288 + } 5289 + return !mmap_write_lock_killable(mm); 5290 + } 5291 + 5292 + /* 5293 + * Helper for page fault handling. 5294 + * 5295 + * This is kind of equivalend to "mmap_read_lock()" followed 5296 + * by "find_extend_vma()", except it's a lot more careful about 5297 + * the locking (and will drop the lock on failure). 5298 + * 5299 + * For example, if we have a kernel bug that causes a page 5300 + * fault, we don't want to just use mmap_read_lock() to get 5301 + * the mm lock, because that would deadlock if the bug were 5302 + * to happen while we're holding the mm lock for writing. 5303 + * 5304 + * So this checks the exception tables on kernel faults in 5305 + * order to only do this all for instructions that are actually 5306 + * expected to fault. 5307 + * 5308 + * We can also actually take the mm lock for writing if we 5309 + * need to extend the vma, which helps the VM layer a lot. 5310 + */ 5311 + struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm, 5312 + unsigned long addr, struct pt_regs *regs) 5313 + { 5314 + struct vm_area_struct *vma; 5315 + 5316 + if (!get_mmap_lock_carefully(mm, regs)) 5317 + return NULL; 5318 + 5319 + vma = find_vma(mm, addr); 5320 + if (likely(vma && (vma->vm_start <= addr))) 5321 + return vma; 5322 + 5323 + /* 5324 + * Well, dang. We might still be successful, but only 5325 + * if we can extend a vma to do so. 5326 + */ 5327 + if (!vma || !(vma->vm_flags & VM_GROWSDOWN)) { 5328 + mmap_read_unlock(mm); 5329 + return NULL; 5330 + } 5331 + 5332 + /* 5333 + * We can try to upgrade the mmap lock atomically, 5334 + * in which case we can continue to use the vma 5335 + * we already looked up. 5336 + * 5337 + * Otherwise we'll have to drop the mmap lock and 5338 + * re-take it, and also look up the vma again, 5339 + * re-checking it. 5340 + */ 5341 + if (!mmap_upgrade_trylock(mm)) { 5342 + if (!upgrade_mmap_lock_carefully(mm, regs)) 5343 + return NULL; 5344 + 5345 + vma = find_vma(mm, addr); 5346 + if (!vma) 5347 + goto fail; 5348 + if (vma->vm_start <= addr) 5349 + goto success; 5350 + if (!(vma->vm_flags & VM_GROWSDOWN)) 5351 + goto fail; 5352 + } 5353 + 5354 + if (expand_stack_locked(vma, addr)) 5355 + goto fail; 5356 + 5357 + success: 5358 + mmap_write_downgrade(mm); 5359 + return vma; 5360 + 5361 + fail: 5362 + mmap_write_unlock(mm); 5363 + return NULL; 5364 + } 5365 + #endif 5366 + 5248 5367 #ifdef CONFIG_PER_VMA_LOCK 5249 5368 /* 5250 5369 * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be ··· 5703 5584 gup_flags, &vma); 5704 5585 5705 5586 if (IS_ERR_OR_NULL(page)) { 5706 - #ifndef CONFIG_HAVE_IOREMAP_PROT 5707 - break; 5708 - #else 5709 - int res = 0; 5587 + /* We might need to expand the stack to access it */ 5588 + vma = vma_lookup(mm, addr); 5589 + if (!vma) { 5590 + vma = expand_stack(mm, addr); 5591 + 5592 + /* mmap_lock was dropped on failure */ 5593 + if (!vma) 5594 + return buf - old_buf; 5595 + 5596 + /* Try again if stack expansion worked */ 5597 + continue; 5598 + } 5599 + 5710 5600 5711 5601 /* 5712 5602 * Check if this is a VM_IO | VM_PFNMAP VMA, which 5713 5603 * we can access using slightly different code. 5714 5604 */ 5715 - vma = vma_lookup(mm, addr); 5716 - if (!vma) 5717 - break; 5605 + bytes = 0; 5606 + #ifdef CONFIG_HAVE_IOREMAP_PROT 5718 5607 if (vma->vm_ops && vma->vm_ops->access) 5719 - res = vma->vm_ops->access(vma, addr, buf, 5720 - len, write); 5721 - if (res <= 0) 5722 - break; 5723 - bytes = res; 5608 + bytes = vma->vm_ops->access(vma, addr, buf, 5609 + len, write); 5724 5610 #endif 5611 + if (bytes <= 0) 5612 + break; 5725 5613 } else { 5726 5614 bytes = len; 5727 5615 offset = addr & (PAGE_SIZE-1);

+105 -16

mm/mmap.c

··· 1948 1948 * PA-RISC uses this for its stack; IA64 for its Register Backing Store. 1949 1949 * vma is the last one with address > vma->vm_end. Have to extend vma. 1950 1950 */ 1951 - int expand_upwards(struct vm_area_struct *vma, unsigned long address) 1951 + static int expand_upwards(struct vm_area_struct *vma, unsigned long address) 1952 1952 { 1953 1953 struct mm_struct *mm = vma->vm_mm; 1954 1954 struct vm_area_struct *next; ··· 2040 2040 2041 2041 /* 2042 2042 * vma is the first one with address < vma->vm_start. Have to extend vma. 2043 + * mmap_lock held for writing. 2043 2044 */ 2044 2045 int expand_downwards(struct vm_area_struct *vma, unsigned long address) 2045 2046 { ··· 2049 2048 struct vm_area_struct *prev; 2050 2049 int error = 0; 2051 2050 2051 + if (!(vma->vm_flags & VM_GROWSDOWN)) 2052 + return -EFAULT; 2053 + 2052 2054 address &= PAGE_MASK; 2053 - if (address < mmap_min_addr) 2055 + if (address < mmap_min_addr || address < FIRST_USER_ADDRESS) 2054 2056 return -EPERM; 2055 2057 2056 2058 /* Enforce stack_guard_gap */ 2057 2059 prev = mas_prev(&mas, 0); 2058 2060 /* Check that both stack segments have the same anon_vma? */ 2059 - if (prev && !(prev->vm_flags & VM_GROWSDOWN) && 2060 - vma_is_accessible(prev)) { 2061 - if (address - prev->vm_end < stack_guard_gap) 2061 + if (prev) { 2062 + if (!(prev->vm_flags & VM_GROWSDOWN) && 2063 + vma_is_accessible(prev) && 2064 + (address - prev->vm_end < stack_guard_gap)) 2062 2065 return -ENOMEM; 2063 2066 } 2064 2067 ··· 2142 2137 __setup("stack_guard_gap=", cmdline_parse_stack_guard_gap); 2143 2138 2144 2139 #ifdef CONFIG_STACK_GROWSUP 2145 - int expand_stack(struct vm_area_struct *vma, unsigned long address) 2140 + int expand_stack_locked(struct vm_area_struct *vma, unsigned long address) 2146 2141 { 2147 2142 return expand_upwards(vma, address); 2148 2143 } 2149 2144 2150 - struct vm_area_struct * 2151 - find_extend_vma(struct mm_struct *mm, unsigned long addr) 2145 + struct vm_area_struct *find_extend_vma_locked(struct mm_struct *mm, unsigned long addr) 2152 2146 { 2153 2147 struct vm_area_struct *vma, *prev; 2154 2148 ··· 2155 2151 vma = find_vma_prev(mm, addr, &prev); 2156 2152 if (vma && (vma->vm_start <= addr)) 2157 2153 return vma; 2158 - if (!prev || expand_stack(prev, addr)) 2154 + if (!prev) 2155 + return NULL; 2156 + if (expand_stack_locked(prev, addr)) 2159 2157 return NULL; 2160 2158 if (prev->vm_flags & VM_LOCKED) 2161 2159 populate_vma_page_range(prev, addr, prev->vm_end, NULL); 2162 2160 return prev; 2163 2161 } 2164 2162 #else 2165 - int expand_stack(struct vm_area_struct *vma, unsigned long address) 2163 + int expand_stack_locked(struct vm_area_struct *vma, unsigned long address) 2166 2164 { 2165 + if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) 2166 + return -EINVAL; 2167 2167 return expand_downwards(vma, address); 2168 2168 } 2169 2169 2170 - struct vm_area_struct * 2171 - find_extend_vma(struct mm_struct *mm, unsigned long addr) 2170 + struct vm_area_struct *find_extend_vma_locked(struct mm_struct *mm, unsigned long addr) 2172 2171 { 2173 2172 struct vm_area_struct *vma; 2174 2173 unsigned long start; ··· 2182 2175 return NULL; 2183 2176 if (vma->vm_start <= addr) 2184 2177 return vma; 2185 - if (!(vma->vm_flags & VM_GROWSDOWN)) 2186 - return NULL; 2187 2178 start = vma->vm_start; 2188 - if (expand_stack(vma, addr)) 2179 + if (expand_stack_locked(vma, addr)) 2189 2180 return NULL; 2190 2181 if (vma->vm_flags & VM_LOCKED) 2191 2182 populate_vma_page_range(vma, addr, start, NULL); ··· 2191 2186 } 2192 2187 #endif 2193 2188 2194 - EXPORT_SYMBOL_GPL(find_extend_vma); 2189 + /* 2190 + * IA64 has some horrid mapping rules: it can expand both up and down, 2191 + * but with various special rules. 2192 + * 2193 + * We'll get rid of this architecture eventually, so the ugliness is 2194 + * temporary. 2195 + */ 2196 + #ifdef CONFIG_IA64 2197 + static inline bool vma_expand_ok(struct vm_area_struct *vma, unsigned long addr) 2198 + { 2199 + return REGION_NUMBER(addr) == REGION_NUMBER(vma->vm_start) && 2200 + REGION_OFFSET(addr) < RGN_MAP_LIMIT; 2201 + } 2202 + 2203 + /* 2204 + * IA64 stacks grow down, but there's a special register backing store 2205 + * that can grow up. Only sequentially, though, so the new address must 2206 + * match vm_end. 2207 + */ 2208 + static inline int vma_expand_up(struct vm_area_struct *vma, unsigned long addr) 2209 + { 2210 + if (!vma_expand_ok(vma, addr)) 2211 + return -EFAULT; 2212 + if (vma->vm_end != (addr & PAGE_MASK)) 2213 + return -EFAULT; 2214 + return expand_upwards(vma, addr); 2215 + } 2216 + 2217 + static inline bool vma_expand_down(struct vm_area_struct *vma, unsigned long addr) 2218 + { 2219 + if (!vma_expand_ok(vma, addr)) 2220 + return -EFAULT; 2221 + return expand_downwards(vma, addr); 2222 + } 2223 + 2224 + #elif defined(CONFIG_STACK_GROWSUP) 2225 + 2226 + #define vma_expand_up(vma,addr) expand_upwards(vma, addr) 2227 + #define vma_expand_down(vma, addr) (-EFAULT) 2228 + 2229 + #else 2230 + 2231 + #define vma_expand_up(vma,addr) (-EFAULT) 2232 + #define vma_expand_down(vma, addr) expand_downwards(vma, addr) 2233 + 2234 + #endif 2235 + 2236 + /* 2237 + * expand_stack(): legacy interface for page faulting. Don't use unless 2238 + * you have to. 2239 + * 2240 + * This is called with the mm locked for reading, drops the lock, takes 2241 + * the lock for writing, tries to look up a vma again, expands it if 2242 + * necessary, and downgrades the lock to reading again. 2243 + * 2244 + * If no vma is found or it can't be expanded, it returns NULL and has 2245 + * dropped the lock. 2246 + */ 2247 + struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr) 2248 + { 2249 + struct vm_area_struct *vma, *prev; 2250 + 2251 + mmap_read_unlock(mm); 2252 + if (mmap_write_lock_killable(mm)) 2253 + return NULL; 2254 + 2255 + vma = find_vma_prev(mm, addr, &prev); 2256 + if (vma && vma->vm_start <= addr) 2257 + goto success; 2258 + 2259 + if (prev && !vma_expand_up(prev, addr)) { 2260 + vma = prev; 2261 + goto success; 2262 + } 2263 + 2264 + if (vma && !vma_expand_down(vma, addr)) 2265 + goto success; 2266 + 2267 + mmap_write_unlock(mm); 2268 + return NULL; 2269 + 2270 + success: 2271 + mmap_write_downgrade(mm); 2272 + return vma; 2273 + } 2195 2274 2196 2275 /* 2197 2276 * Ok - we have the memory areas we should free on a maple tree so release them,

+7 -10

mm/nommu.c

··· 631 631 EXPORT_SYMBOL(find_vma); 632 632 633 633 /* 634 - * find a VMA 635 - * - we don't extend stack VMAs under NOMMU conditions 636 - */ 637 - struct vm_area_struct *find_extend_vma(struct mm_struct *mm, unsigned long addr) 638 - { 639 - return find_vma(mm, addr); 640 - } 641 - 642 - /* 643 634 * expand a stack to a given address 644 635 * - not supported under NOMMU conditions 645 636 */ 646 - int expand_stack(struct vm_area_struct *vma, unsigned long address) 637 + int expand_stack_locked(struct vm_area_struct *vma, unsigned long addr) 647 638 { 648 639 return -ENOMEM; 640 + } 641 + 642 + struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr) 643 + { 644 + mmap_read_unlock(mm); 645 + return NULL; 649 646 } 650 647 651 648 /*