Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/madvise: allow guard page install/remove under VMA lock

We only need to keep the page table stable so we can perform this
operation under the VMA lock. PTE installation is stabilised via the PTE
lock.

One caveat is that, if we prepare vma->anon_vma we must hold the mmap read
lock. We can account for this by adapting the VMA locking logic to
explicitly check for this case and prevent a VMA lock from being acquired
should it be the case.

This check is safe, as while we might be raced on anon_vma installation,
this would simply make the check conservative, there's no way for us to
see an anon_vma and then for it to be cleared, as doing so requires the
mmap/VMA write lock.

We abstract the VMA lock validity logic to is_vma_lock_sufficient() for
this purpose, and add prepares_anon_vma() to abstract the anon_vma logic.

In order to do this we need to have a way of installing page tables
explicitly for an identified VMA, so we export walk_page_range_vma() in an
unsafe variant - walk_page_range_vma_unsafe() and use this should the VMA
read lock be taken.

We additionally update the comments in madvise_guard_install() to more
accurately reflect the cases in which the logic may be reattempted,
specifically THP huge pages being present.

Link: https://lkml.kernel.org/r/cca1edbd99cd1386ad20556d08ebdb356c45ef91.1762795245.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: SeongJae Park <sj@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Lorenzo Stoakes and committed by
Andrew Morton
2ab7f1bb f4af67ff

+94 -36
+3
mm/internal.h
··· 1652 1652 int walk_page_range_mm_unsafe(struct mm_struct *mm, unsigned long start, 1653 1653 unsigned long end, const struct mm_walk_ops *ops, 1654 1654 void *private); 1655 + int walk_page_range_vma_unsafe(struct vm_area_struct *vma, unsigned long start, 1656 + unsigned long end, const struct mm_walk_ops *ops, 1657 + void *private); 1655 1658 int walk_page_range_debug(struct mm_struct *mm, unsigned long start, 1656 1659 unsigned long end, const struct mm_walk_ops *ops, 1657 1660 pgd_t *pgd, void *private);
+79 -31
mm/madvise.c
··· 1122 1122 return 0; 1123 1123 } 1124 1124 1125 - static const struct mm_walk_ops guard_install_walk_ops = { 1126 - .pud_entry = guard_install_pud_entry, 1127 - .pmd_entry = guard_install_pmd_entry, 1128 - .pte_entry = guard_install_pte_entry, 1129 - .install_pte = guard_install_set_pte, 1130 - .walk_lock = PGWALK_RDLOCK, 1131 - }; 1132 - 1133 1125 static long madvise_guard_install(struct madvise_behavior *madv_behavior) 1134 1126 { 1135 1127 struct vm_area_struct *vma = madv_behavior->vma; 1136 1128 struct madvise_behavior_range *range = &madv_behavior->range; 1129 + struct mm_walk_ops walk_ops = { 1130 + .pud_entry = guard_install_pud_entry, 1131 + .pmd_entry = guard_install_pmd_entry, 1132 + .pte_entry = guard_install_pte_entry, 1133 + .install_pte = guard_install_set_pte, 1134 + .walk_lock = get_walk_lock(madv_behavior->lock_mode), 1135 + }; 1137 1136 long err; 1138 1137 int i; 1139 1138 ··· 1149 1150 /* 1150 1151 * If anonymous and we are establishing page tables the VMA ought to 1151 1152 * have an anon_vma associated with it. 1153 + * 1154 + * We will hold an mmap read lock if this is necessary, this is checked 1155 + * as part of the VMA lock logic. 1152 1156 */ 1153 1157 if (vma_is_anonymous(vma)) { 1158 + VM_WARN_ON_ONCE(!vma->anon_vma && 1159 + madv_behavior->lock_mode != MADVISE_MMAP_READ_LOCK); 1160 + 1154 1161 err = anon_vma_prepare(vma); 1155 1162 if (err) 1156 1163 return err; ··· 1164 1159 1165 1160 /* 1166 1161 * Optimistically try to install the guard marker pages first. If any 1167 - * non-guard pages are encountered, give up and zap the range before 1168 - * trying again. 1162 + * non-guard pages or THP huge pages are encountered, give up and zap 1163 + * the range before trying again. 1169 1164 * 1170 1165 * We try a few times before giving up and releasing back to userland to 1171 - * loop around, releasing locks in the process to avoid contention. This 1172 - * would only happen if there was a great many racing page faults. 1166 + * loop around, releasing locks in the process to avoid contention. 1167 + * 1168 + * This would only happen due to races with e.g. page faults or 1169 + * khugepaged. 1173 1170 * 1174 1171 * In most cases we should simply install the guard markers immediately 1175 1172 * with no zap or looping. ··· 1180 1173 unsigned long nr_pages = 0; 1181 1174 1182 1175 /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ 1183 - err = walk_page_range_mm_unsafe(vma->vm_mm, range->start, 1184 - range->end, &guard_install_walk_ops, &nr_pages); 1176 + if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) 1177 + err = walk_page_range_vma_unsafe(madv_behavior->vma, 1178 + range->start, range->end, &walk_ops, 1179 + &nr_pages); 1180 + else 1181 + err = walk_page_range_mm_unsafe(vma->vm_mm, range->start, 1182 + range->end, &walk_ops, &nr_pages); 1185 1183 if (err < 0) 1186 1184 return err; 1187 1185 ··· 1207 1195 } 1208 1196 1209 1197 /* 1210 - * We were unable to install the guard pages due to being raced by page 1211 - * faults. This should not happen ordinarily. We return to userspace and 1198 + * We were unable to install the guard pages, return to userspace and 1212 1199 * immediately retry, relieving lock contention. 1213 1200 */ 1214 1201 return restart_syscall(); ··· 1251 1240 return 0; 1252 1241 } 1253 1242 1254 - static const struct mm_walk_ops guard_remove_walk_ops = { 1255 - .pud_entry = guard_remove_pud_entry, 1256 - .pmd_entry = guard_remove_pmd_entry, 1257 - .pte_entry = guard_remove_pte_entry, 1258 - .walk_lock = PGWALK_RDLOCK, 1259 - }; 1260 - 1261 1243 static long madvise_guard_remove(struct madvise_behavior *madv_behavior) 1262 1244 { 1263 1245 struct vm_area_struct *vma = madv_behavior->vma; 1264 1246 struct madvise_behavior_range *range = &madv_behavior->range; 1247 + struct mm_walk_ops wallk_ops = { 1248 + .pud_entry = guard_remove_pud_entry, 1249 + .pmd_entry = guard_remove_pmd_entry, 1250 + .pte_entry = guard_remove_pte_entry, 1251 + .walk_lock = get_walk_lock(madv_behavior->lock_mode), 1252 + }; 1265 1253 1266 1254 /* 1267 1255 * We're ok with removing guards in mlock()'d ranges, as this is a ··· 1270 1260 return -EINVAL; 1271 1261 1272 1262 return walk_page_range_vma(vma, range->start, range->end, 1273 - &guard_remove_walk_ops, NULL); 1263 + &wallk_ops, NULL); 1274 1264 } 1275 1265 1276 1266 #ifdef CONFIG_64BIT ··· 1583 1573 } 1584 1574 } 1585 1575 1576 + /* Does this operation invoke anon_vma_prepare()? */ 1577 + static bool prepares_anon_vma(int behavior) 1578 + { 1579 + switch (behavior) { 1580 + case MADV_GUARD_INSTALL: 1581 + return true; 1582 + default: 1583 + return false; 1584 + } 1585 + } 1586 + 1587 + /* 1588 + * We have acquired a VMA read lock, is the VMA valid to be madvise'd under VMA 1589 + * read lock only now we have a VMA to examine? 1590 + */ 1591 + static bool is_vma_lock_sufficient(struct vm_area_struct *vma, 1592 + struct madvise_behavior *madv_behavior) 1593 + { 1594 + /* Must span only a single VMA.*/ 1595 + if (madv_behavior->range.end > vma->vm_end) 1596 + return false; 1597 + /* Remote processes unsupported. */ 1598 + if (current->mm != vma->vm_mm) 1599 + return false; 1600 + /* Userfaultfd unsupported. */ 1601 + if (userfaultfd_armed(vma)) 1602 + return false; 1603 + /* 1604 + * anon_vma_prepare() explicitly requires an mmap lock for 1605 + * serialisation, so we cannot use a VMA lock in this case. 1606 + * 1607 + * Note we might race with anon_vma being set, however this makes this 1608 + * check overly paranoid which is safe. 1609 + */ 1610 + if (vma_is_anonymous(vma) && 1611 + prepares_anon_vma(madv_behavior->behavior) && !vma->anon_vma) 1612 + return false; 1613 + 1614 + return true; 1615 + } 1616 + 1586 1617 /* 1587 1618 * Try to acquire a VMA read lock if possible. 1588 1619 * ··· 1645 1594 vma = lock_vma_under_rcu(mm, madv_behavior->range.start); 1646 1595 if (!vma) 1647 1596 goto take_mmap_read_lock; 1648 - /* 1649 - * Must span only a single VMA; uffd and remote processes are 1650 - * unsupported. 1651 - */ 1652 - if (madv_behavior->range.end > vma->vm_end || current->mm != mm || 1653 - userfaultfd_armed(vma)) { 1597 + 1598 + if (!is_vma_lock_sufficient(vma, madv_behavior)) { 1654 1599 vma_end_read(vma); 1655 1600 goto take_mmap_read_lock; 1656 1601 } 1602 + 1657 1603 madv_behavior->vma = vma; 1658 1604 return true; 1659 1605 ··· 1763 1715 case MADV_POPULATE_READ: 1764 1716 case MADV_POPULATE_WRITE: 1765 1717 case MADV_COLLAPSE: 1718 + return MADVISE_MMAP_READ_LOCK; 1766 1719 case MADV_GUARD_INSTALL: 1767 1720 case MADV_GUARD_REMOVE: 1768 - return MADVISE_MMAP_READ_LOCK; 1769 1721 case MADV_DONTNEED: 1770 1722 case MADV_DONTNEED_LOCKED: 1771 1723 case MADV_FREE:
+12 -5
mm/pagewalk.c
··· 694 694 return walk_pgd_range(start, end, &walk); 695 695 } 696 696 697 - int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start, 698 - unsigned long end, const struct mm_walk_ops *ops, 699 - void *private) 697 + int walk_page_range_vma_unsafe(struct vm_area_struct *vma, unsigned long start, 698 + unsigned long end, const struct mm_walk_ops *ops, void *private) 700 699 { 701 700 struct mm_walk walk = { 702 701 .ops = ops, ··· 708 709 return -EINVAL; 709 710 if (start < vma->vm_start || end > vma->vm_end) 710 711 return -EINVAL; 711 - if (!check_ops_safe(ops)) 712 - return -EINVAL; 713 712 714 713 process_mm_walk_lock(walk.mm, ops->walk_lock); 715 714 process_vma_walk_lock(vma, ops->walk_lock); 716 715 return __walk_page_range(start, end, &walk); 716 + } 717 + 718 + int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start, 719 + unsigned long end, const struct mm_walk_ops *ops, 720 + void *private) 721 + { 722 + if (!check_ops_safe(ops)) 723 + return -EINVAL; 724 + 725 + return walk_page_range_vma_unsafe(vma, start, end, ops, private); 717 726 } 718 727 719 728 int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,