Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: get_user_pages(write,force) refuse to COW in shared areas

get_user_pages(write=1, force=1) has always had odd behaviour on write-
protected shared mappings: although it demands FMODE_WRITE-access to the
underlying object (do_mmap_pgoff sets neither VM_SHARED nor VM_MAYWRITE
without that), it ends up with do_wp_page substituting private anonymous
Copied-On-Write pages for the shared file pages in the area.

That was long ago intentional, as a safety measure to prevent ptrace
setting a breakpoint (or POKETEXT or POKEDATA) from inadvertently
corrupting the underlying executable. Yet exec and dynamic loaders open
the file read-only, and use MAP_PRIVATE rather than MAP_SHARED.

The traditional odd behaviour still causes surprises and bugs in mm, and
is probably not what any caller wants - even the comment on the flag
says "You do not want this" (although it's undoubtedly necessary for
overriding userspace protections in some contexts, and good when !write).

Let's stop doing that. But it would be dangerous to remove the long-
standing safety at this stage, so just make get_user_pages(write,force)
fail with EFAULT when applied to a write-protected shared area.
Infiniband may in future want to force write through to underlying
object: we can add another FOLL_flag later to enable that if required.

Odd though the old behaviour was, there is no doubt that we may turn out
to break userspace with this change, and have to revert it quickly.
Issue a WARN_ON_ONCE to help debug the changed case (easily triggered by
userspace, so only once to prevent spamming the logs); and delay a few
associated cleanups until this change is proved.

get_user_pages callers who might see trouble from this change:
ptrace poking, or writing to /proc/<pid>/mem
drivers/infiniband/
drivers/media/v4l2-core/
drivers/gpu/drm/exynos/exynos_drm_gem.c
drivers/staging/tidspbridge/core/tiomap3430.c
if they ever apply get_user_pages to write-protected shared mappings
of an object which was opened for writing.

I went to apply the same change to mm/nommu.c, but retreated. NOMMU has
no place for COW, and its VM_flags conventions are not the same: I'd be
more likely to screw up NOMMU than make an improvement there.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Hugh Dickins and committed by
Linus Torvalds
cda540ac d15e0310

+45 -21
+45 -21
mm/memory.c
··· 1705 1705 1706 1706 VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET)); 1707 1707 1708 - /* 1709 - * Require read or write permissions. 1710 - * If FOLL_FORCE is set, we only require the "MAY" flags. 1711 - */ 1712 - vm_flags = (gup_flags & FOLL_WRITE) ? 1713 - (VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD); 1714 - vm_flags &= (gup_flags & FOLL_FORCE) ? 1715 - (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE); 1716 - 1717 1708 /* 1718 1709 * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault 1719 1710 * would be called on PROT_NONE ranges. We must never invoke ··· 1732 1741 1733 1742 /* user gate pages are read-only */ 1734 1743 if (gup_flags & FOLL_WRITE) 1735 - return i ? : -EFAULT; 1744 + goto efault; 1736 1745 if (pg > TASK_SIZE) 1737 1746 pgd = pgd_offset_k(pg); 1738 1747 else ··· 1742 1751 BUG_ON(pud_none(*pud)); 1743 1752 pmd = pmd_offset(pud, pg); 1744 1753 if (pmd_none(*pmd)) 1745 - return i ? : -EFAULT; 1754 + goto efault; 1746 1755 VM_BUG_ON(pmd_trans_huge(*pmd)); 1747 1756 pte = pte_offset_map(pmd, pg); 1748 1757 if (pte_none(*pte)) { 1749 1758 pte_unmap(pte); 1750 - return i ? : -EFAULT; 1759 + goto efault; 1751 1760 } 1752 1761 vma = get_gate_vma(mm); 1753 1762 if (pages) { ··· 1760 1769 page = pte_page(*pte); 1761 1770 else { 1762 1771 pte_unmap(pte); 1763 - return i ? : -EFAULT; 1772 + goto efault; 1764 1773 } 1765 1774 } 1766 1775 pages[i] = page; ··· 1771 1780 goto next_page; 1772 1781 } 1773 1782 1774 - if (!vma || 1775 - (vma->vm_flags & (VM_IO | VM_PFNMAP)) || 1776 - !(vm_flags & vma->vm_flags)) 1777 - return i ? : -EFAULT; 1783 + if (!vma) 1784 + goto efault; 1785 + vm_flags = vma->vm_flags; 1786 + if (vm_flags & (VM_IO | VM_PFNMAP)) 1787 + goto efault; 1788 + 1789 + if (gup_flags & FOLL_WRITE) { 1790 + if (!(vm_flags & VM_WRITE)) { 1791 + if (!(gup_flags & FOLL_FORCE)) 1792 + goto efault; 1793 + /* 1794 + * We used to let the write,force case do COW 1795 + * in a VM_MAYWRITE VM_SHARED !VM_WRITE vma, so 1796 + * ptrace could set a breakpoint in a read-only 1797 + * mapping of an executable, without corrupting 1798 + * the file (yet only when that file had been 1799 + * opened for writing!). Anon pages in shared 1800 + * mappings are surprising: now just reject it. 1801 + */ 1802 + if (!is_cow_mapping(vm_flags)) { 1803 + WARN_ON_ONCE(vm_flags & VM_MAYWRITE); 1804 + goto efault; 1805 + } 1806 + } 1807 + } else { 1808 + if (!(vm_flags & VM_READ)) { 1809 + if (!(gup_flags & FOLL_FORCE)) 1810 + goto efault; 1811 + /* 1812 + * Is there actually any vma we can reach here 1813 + * which does not have VM_MAYREAD set? 1814 + */ 1815 + if (!(vm_flags & VM_MAYREAD)) 1816 + goto efault; 1817 + } 1818 + } 1778 1819 1779 1820 if (is_vm_hugetlb_page(vma)) { 1780 1821 i = follow_hugetlb_page(mm, vma, pages, vmas, ··· 1860 1837 return -EFAULT; 1861 1838 } 1862 1839 if (ret & VM_FAULT_SIGBUS) 1863 - return i ? i : -EFAULT; 1840 + goto efault; 1864 1841 BUG(); 1865 1842 } 1866 1843 ··· 1918 1895 } while (nr_pages && start < vma->vm_end); 1919 1896 } while (nr_pages); 1920 1897 return i; 1898 + efault: 1899 + return i ? : -EFAULT; 1921 1900 } 1922 1901 EXPORT_SYMBOL(__get_user_pages); 1923 1902 ··· 1987 1962 * @start: starting user address 1988 1963 * @nr_pages: number of pages from start to pin 1989 1964 * @write: whether pages will be written to by the caller 1990 - * @force: whether to force write access even if user mapping is 1991 - * readonly. This will result in the page being COWed even 1992 - * in MAP_SHARED mappings. You do not want this. 1965 + * @force: whether to force access even when user mapping is currently 1966 + * protected (but never forces write access to shared mapping). 1993 1967 * @pages: array that receives pointers to the pages pinned. 1994 1968 * Should be at least nr_pages long. Or NULL, if caller 1995 1969 * only intends to ensure the pages are faulted in.