Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED

userfaultfd_remove() has to be execute before zapping the pagetables or
UFFDIO_COPY could keep filling pages after zap_page_range returned,
which would result in non zero data after a MADV_DONTNEED.

However userfaultfd_remove() may have to release the mmap_sem. This was
handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a
potentially stale vma (the very vma passed to zap_page_range(vma, ...)).

The fix consists in revalidating the vma in case userfaultfd_remove()
had to release the mmap_sem.

This also optimizes away an unnecessary down_read/up_read in the
MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered.

It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as
userfaultfd_remove() will be defined as "true" at build time.

Link: http://lkml.kernel.org/r/20170302173738.18994-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Andrea Arcangeli and committed by
Linus Torvalds
70ccb92f 7eb76d45

+47 -13
+3 -6
fs/userfaultfd.c
··· 695 695 userfaultfd_event_wait_completion(ctx, &ewq); 696 696 } 697 697 698 - void userfaultfd_remove(struct vm_area_struct *vma, 699 - struct vm_area_struct **prev, 698 + bool userfaultfd_remove(struct vm_area_struct *vma, 700 699 unsigned long start, unsigned long end) 701 700 { 702 701 struct mm_struct *mm = vma->vm_mm; ··· 704 705 705 706 ctx = vma->vm_userfaultfd_ctx.ctx; 706 707 if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_REMOVE)) 707 - return; 708 + return true; 708 709 709 710 userfaultfd_ctx_get(ctx); 710 711 up_read(&mm->mmap_sem); 711 - 712 - *prev = NULL; /* We wait for ACK w/o the mmap semaphore */ 713 712 714 713 msg_init(&ewq.msg); 715 714 ··· 717 720 718 721 userfaultfd_event_wait_completion(ctx, &ewq); 719 722 720 - down_read(&mm->mmap_sem); 723 + return false; 721 724 } 722 725 723 726 static bool has_unmap_ctx(struct userfaultfd_ctx *ctx, struct list_head *unmaps,
+3 -4
include/linux/userfaultfd_k.h
··· 61 61 unsigned long from, unsigned long to, 62 62 unsigned long len); 63 63 64 - extern void userfaultfd_remove(struct vm_area_struct *vma, 65 - struct vm_area_struct **prev, 64 + extern bool userfaultfd_remove(struct vm_area_struct *vma, 66 65 unsigned long start, 67 66 unsigned long end); 68 67 ··· 117 118 { 118 119 } 119 120 120 - static inline void userfaultfd_remove(struct vm_area_struct *vma, 121 - struct vm_area_struct **prev, 121 + static inline bool userfaultfd_remove(struct vm_area_struct *vma, 122 122 unsigned long start, 123 123 unsigned long end) 124 124 { 125 + return true; 125 126 } 126 127 127 128 static inline int userfaultfd_unmap_prep(struct vm_area_struct *vma,
+41 -3
mm/madvise.c
··· 513 513 if (!can_madv_dontneed_vma(vma)) 514 514 return -EINVAL; 515 515 516 - userfaultfd_remove(vma, prev, start, end); 516 + if (!userfaultfd_remove(vma, start, end)) { 517 + *prev = NULL; /* mmap_sem has been dropped, prev is stale */ 518 + 519 + down_read(&current->mm->mmap_sem); 520 + vma = find_vma(current->mm, start); 521 + if (!vma) 522 + return -ENOMEM; 523 + if (start < vma->vm_start) { 524 + /* 525 + * This "vma" under revalidation is the one 526 + * with the lowest vma->vm_start where start 527 + * is also < vma->vm_end. If start < 528 + * vma->vm_start it means an hole materialized 529 + * in the user address space within the 530 + * virtual range passed to MADV_DONTNEED. 531 + */ 532 + return -ENOMEM; 533 + } 534 + if (!can_madv_dontneed_vma(vma)) 535 + return -EINVAL; 536 + if (end > vma->vm_end) { 537 + /* 538 + * Don't fail if end > vma->vm_end. If the old 539 + * vma was splitted while the mmap_sem was 540 + * released the effect of the concurrent 541 + * operation may not cause MADV_DONTNEED to 542 + * have an undefined result. There may be an 543 + * adjacent next vma that we'll walk 544 + * next. userfaultfd_remove() will generate an 545 + * UFFD_EVENT_REMOVE repetition on the 546 + * end-vma->vm_end range, but the manager can 547 + * handle a repetition fine. 548 + */ 549 + end = vma->vm_end; 550 + } 551 + VM_WARN_ON(start >= end); 552 + } 517 553 zap_page_range(vma, start, end - start); 518 554 return 0; 519 555 } ··· 590 554 * mmap_sem. 591 555 */ 592 556 get_file(f); 593 - userfaultfd_remove(vma, prev, start, end); 594 - up_read(&current->mm->mmap_sem); 557 + if (userfaultfd_remove(vma, start, end)) { 558 + /* mmap_sem was not released by userfaultfd_remove() */ 559 + up_read(&current->mm->mmap_sem); 560 + } 595 561 error = vfs_fallocate(f, 596 562 FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 597 563 offset, end - start);