Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

gup: document and work around "COW can break either way" issue

Doing a "get_user_pages()" on a copy-on-write page for reading can be
ambiguous: the page can be COW'ed at any time afterwards, and the
direction of a COW event isn't defined.

Yes, whoever writes to it will generally do the COW, but if the thread
that did the get_user_pages() unmapped the page before the write (and
that could happen due to memory pressure in addition to any outright
action), the writer could also just take over the old page instead.

End result: the get_user_pages() call might result in a page pointer
that is no longer associated with the original VM, and is associated
with - and controlled by - another VM having taken it over instead.

So when doing a get_user_pages() on a COW mapping, the only really safe
thing to do would be to break the COW when getting the page, even when
only getting it for reading.

At the same time, some users simply don't even care.

For example, the perf code wants to look up the page not because it
cares about the page, but because the code simply wants to look up the
physical address of the access for informational purposes, and doesn't
really care about races when a page might be unmapped and remapped
elsewhere.

This adds logic to force a COW event by setting FOLL_WRITE on any
copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
pointer as a result.

The current semantics end up being:

- __get_user_pages_fast(): no change. If you don't ask for a write,
you won't break COW. You'd better know what you're doing.

- get_user_pages_fast(): the fast-case "look it up in the page tables
without anything getting mmap_sem" now refuses to follow a read-only
page, since it might need COW breaking. Which happens in the slow
path - the fast path doesn't know if the memory might be COW or not.

- get_user_pages() (including the slow-path fallback for gup_fast()):
for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
very similar semantics to FOLL_FORCE.

If it turns out that we want finer granularity (ie "only break COW when
it might actually matter" - things like the zero page are special and
don't need to be broken) we might need to push these semantics deeper
into the lookup fault path. So if people care enough, it's possible
that we might end up adding a new internal FOLL_BREAK_COW flag to go
with the internal FOLL_COW flag we already have for tracking "I had a
COW".

Alternatively, if it turns out that different callers might want to
explicitly control the forced COW break behavior, we might even want to
make such a flag visible to the users of get_user_pages() instead of
using the above default semantics.

But for now, this is mostly commentary on the issue (this commit message
being a lot bigger than the patch, and that patch in turn is almost all
comments), with that minimal "enable COW breaking early" logic using the
existing FOLL_WRITE behavior.

[ It might be worth noting that we've always had this ambiguity, and it
could arguably be seen as a user-space issue.

You only get private COW mappings that could break either way in
situations where user space is doing cooperative things (ie fork()
before an execve() etc), but it _is_ surprising and very subtle, and
fork() is supposed to give you independent address spaces.

So let's treat this as a kernel issue and make the semantics of
get_user_pages() easier to understand. Note that obviously a true
shared mapping will still get a page that can change under us, so this
does _not_ mean that get_user_pages() somehow returns any "stable"
page ]

Reported-by: Jann Horn <jannh@google.com>
Tested-by: Christoph Hellwig <hch@lst.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill Shutemov <kirill@shutemov.name>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

+49 -10
+8
drivers/gpu/drm/i915/gem/i915_gem_userptr.c
··· 598 598 GFP_KERNEL | 599 599 __GFP_NORETRY | 600 600 __GFP_NOWARN); 601 + /* 602 + * Using __get_user_pages_fast() with a read-only 603 + * access is questionable. A read-only page may be 604 + * COW-broken, and then this might end up giving 605 + * the wrong side of the COW.. 606 + * 607 + * We may or may not care. 608 + */ 601 609 if (pvec) /* defer to worker if malloc fails */ 602 610 pinned = __get_user_pages_fast(obj->userptr.ptr, 603 611 num_pages,
+38 -6
mm/gup.c
··· 382 382 } 383 383 384 384 /* 385 - * FOLL_FORCE can write to even unwritable pte's, but only 386 - * after we've gone through a COW cycle and they are dirty. 385 + * FOLL_FORCE or a forced COW break can write even to unwritable pte's, 386 + * but only after we've gone through a COW cycle and they are dirty. 387 387 */ 388 388 static inline bool can_follow_write_pte(pte_t pte, unsigned int flags) 389 389 { 390 - return pte_write(pte) || 391 - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte)); 390 + return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte)); 391 + } 392 + 393 + /* 394 + * A (separate) COW fault might break the page the other way and 395 + * get_user_pages() would return the page from what is now the wrong 396 + * VM. So we need to force a COW break at GUP time even for reads. 397 + */ 398 + static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags) 399 + { 400 + return is_cow_mapping(vma->vm_flags) && (flags & (FOLL_GET | FOLL_PIN)); 392 401 } 393 402 394 403 static struct page *follow_page_pte(struct vm_area_struct *vma, ··· 1075 1066 goto out; 1076 1067 } 1077 1068 if (is_vm_hugetlb_page(vma)) { 1069 + if (should_force_cow_break(vma, foll_flags)) 1070 + foll_flags |= FOLL_WRITE; 1078 1071 i = follow_hugetlb_page(mm, vma, pages, vmas, 1079 1072 &start, &nr_pages, i, 1080 - gup_flags, locked); 1073 + foll_flags, locked); 1081 1074 if (locked && *locked == 0) { 1082 1075 /* 1083 1076 * We've got a VM_FAULT_RETRY ··· 1093 1082 continue; 1094 1083 } 1095 1084 } 1085 + 1086 + if (should_force_cow_break(vma, foll_flags)) 1087 + foll_flags |= FOLL_WRITE; 1088 + 1096 1089 retry: 1097 1090 /* 1098 1091 * If we have a pending SIGKILL, don't keep faulting pages and ··· 2689 2674 * 2690 2675 * If the architecture does not support this function, simply return with no 2691 2676 * pages pinned. 2677 + * 2678 + * Careful, careful! COW breaking can go either way, so a non-write 2679 + * access can get ambiguous page results. If you call this function without 2680 + * 'write' set, you'd better be sure that you're ok with that ambiguity. 2692 2681 */ 2693 2682 int __get_user_pages_fast(unsigned long start, int nr_pages, int write, 2694 2683 struct page **pages) ··· 2728 2709 * 2729 2710 * We do not adopt an rcu_read_lock(.) here as we also want to 2730 2711 * block IPIs that come from THPs splitting. 2712 + * 2713 + * NOTE! We allow read-only gup_fast() here, but you'd better be 2714 + * careful about possible COW pages. You'll get _a_ COW page, but 2715 + * not necessarily the one you intended to get depending on what 2716 + * COW event happens after this. COW may break the page copy in a 2717 + * random direction. 2731 2718 */ 2732 2719 2733 2720 if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && ··· 2791 2766 if (unlikely(!access_ok((void __user *)start, len))) 2792 2767 return -EFAULT; 2793 2768 2769 + /* 2770 + * The FAST_GUP case requires FOLL_WRITE even for pure reads, 2771 + * because get_user_pages() may need to cause an early COW in 2772 + * order to avoid confusing the normal COW routines. So only 2773 + * targets that are already writable are safe to do by just 2774 + * looking at the page tables. 2775 + */ 2794 2776 if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && 2795 2777 gup_fast_permitted(start, end)) { 2796 2778 local_irq_disable(); 2797 - gup_pgd_range(addr, end, gup_flags, pages, &nr_pinned); 2779 + gup_pgd_range(addr, end, gup_flags | FOLL_WRITE, pages, &nr_pinned); 2798 2780 local_irq_enable(); 2799 2781 ret = nr_pinned; 2800 2782 }
+3 -4
mm/huge_memory.c
··· 1515 1515 } 1516 1516 1517 1517 /* 1518 - * FOLL_FORCE can write to even unwritable pmd's, but only 1519 - * after we've gone through a COW cycle and they are dirty. 1518 + * FOLL_FORCE or a forced COW break can write even to unwritable pmd's, 1519 + * but only after we've gone through a COW cycle and they are dirty. 1520 1520 */ 1521 1521 static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags) 1522 1522 { 1523 - return pmd_write(pmd) || 1524 - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd)); 1523 + return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd)); 1525 1524 } 1526 1525 1527 1526 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,