Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem

Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem
in at least read mode. This is because the explicit locking in
huge_pmd_share (called by huge_pte_alloc) was removed. When restructuring
the code, the call to huge_pte_alloc in the else block at the beginning of
hugetlb_fault was missed.

Unfortunately, that else clause is exercised when there is no page table
entry. This will likely lead to a call to huge_pmd_share. If
huge_pmd_share thinks pmd sharing is possible, it will traverse the
mapping tree (i_mmap) without holding i_mmap_rwsem. If someone else is
modifying the tree, bad things such as addressing exceptions or worse
could happen.

Simply remove the else clause. It should have been removed previously.
The code following the else will call huge_pte_alloc with the appropriate
locking.

To prevent this type of issue in the future, add routines to assert that
i_mmap_rwsem is held, and call these routines in huge pmd sharing
routines.

Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A.Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e670f327-5cf9-1959-96e4-6dc7cc30d3d5@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Mike Kravetz and committed by
Linus Torvalds
34ae204f 15568299

+23 -12
+10
include/linux/fs.h
··· 518 518 up_read(&mapping->i_mmap_rwsem); 519 519 } 520 520 521 + static inline void i_mmap_assert_locked(struct address_space *mapping) 522 + { 523 + lockdep_assert_held(&mapping->i_mmap_rwsem); 524 + } 525 + 526 + static inline void i_mmap_assert_write_locked(struct address_space *mapping) 527 + { 528 + lockdep_assert_held_write(&mapping->i_mmap_rwsem); 529 + } 530 + 521 531 /* 522 532 * Might pages of this file be mapped into userspace? 523 533 */
+5 -3
include/linux/hugetlb.h
··· 164 164 unsigned long addr, unsigned long sz); 165 165 pte_t *huge_pte_offset(struct mm_struct *mm, 166 166 unsigned long addr, unsigned long sz); 167 - int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep); 167 + int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, 168 + unsigned long *addr, pte_t *ptep); 168 169 void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, 169 170 unsigned long *start, unsigned long *end); 170 171 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address, ··· 204 203 return NULL; 205 204 } 206 205 207 - static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, 208 - pte_t *ptep) 206 + static inline int huge_pmd_unshare(struct mm_struct *mm, 207 + struct vm_area_struct *vma, 208 + unsigned long *addr, pte_t *ptep) 209 209 { 210 210 return 0; 211 211 }
+7 -8
mm/hugetlb.c
··· 3967 3967 continue; 3968 3968 3969 3969 ptl = huge_pte_lock(h, mm, ptep); 3970 - if (huge_pmd_unshare(mm, &address, ptep)) { 3970 + if (huge_pmd_unshare(mm, vma, &address, ptep)) { 3971 3971 spin_unlock(ptl); 3972 3972 /* 3973 3973 * We just unmapped a page of PMDs by clearing a PUD. ··· 4554 4554 } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) 4555 4555 return VM_FAULT_HWPOISON_LARGE | 4556 4556 VM_FAULT_SET_HINDEX(hstate_index(h)); 4557 - } else { 4558 - ptep = huge_pte_alloc(mm, haddr, huge_page_size(h)); 4559 - if (!ptep) 4560 - return VM_FAULT_OOM; 4561 4557 } 4562 4558 4563 4559 /* ··· 5030 5034 if (!ptep) 5031 5035 continue; 5032 5036 ptl = huge_pte_lock(h, mm, ptep); 5033 - if (huge_pmd_unshare(mm, &address, ptep)) { 5037 + if (huge_pmd_unshare(mm, vma, &address, ptep)) { 5034 5038 pages++; 5035 5039 spin_unlock(ptl); 5036 5040 shared_pmd = true; ··· 5411 5415 * returns: 1 successfully unmapped a shared pte page 5412 5416 * 0 the underlying pte page is not shared, or it is the last user 5413 5417 */ 5414 - int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep) 5418 + int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, 5419 + unsigned long *addr, pte_t *ptep) 5415 5420 { 5416 5421 pgd_t *pgd = pgd_offset(mm, *addr); 5417 5422 p4d_t *p4d = p4d_offset(pgd, *addr); 5418 5423 pud_t *pud = pud_offset(p4d, *addr); 5419 5424 5425 + i_mmap_assert_write_locked(vma->vm_file->f_mapping); 5420 5426 BUG_ON(page_count(virt_to_page(ptep)) == 0); 5421 5427 if (page_count(virt_to_page(ptep)) == 1) 5422 5428 return 0; ··· 5436 5438 return NULL; 5437 5439 } 5438 5440 5439 - int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep) 5441 + int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, 5442 + unsigned long *addr, pte_t *ptep) 5440 5443 { 5441 5444 return 0; 5442 5445 }
+1 -1
mm/rmap.c
··· 1469 1469 * do this outside rmap routines. 1470 1470 */ 1471 1471 VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); 1472 - if (huge_pmd_unshare(mm, &address, pvmw.pte)) { 1472 + if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) { 1473 1473 /* 1474 1474 * huge_pmd_unshare unmapped an entire PMD 1475 1475 * page. There is no way of knowing exactly