Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: make PTE_MARKER_SWAPIN_ERROR more general

Patch series "add UFFDIO_POISON to simulate memory poisoning with UFFD",
v4.

This series adds a new userfaultfd feature, UFFDIO_POISON. See commit 4
for a detailed description of the feature.


This patch (of 8):

Future patches will reuse PTE_MARKER_SWAPIN_ERROR to implement
UFFDIO_POISON, so make some various preparations for that:

First, rename it to just PTE_MARKER_POISONED. The "SWAPIN" can be
confusing since we're going to re-use it for something not really related
to swap. This can be particularly confusing for things like hugetlbfs,
which doesn't support swap whatsoever. Also rename some various helper
functions.

Next, fix pte marker copying for hugetlbfs. Previously, it would WARN on
seeing a PTE_MARKER_SWAPIN_ERROR, since hugetlbfs doesn't support swap.
But, since we're going to re-use it, we want it to go ahead and copy it
just like non-hugetlbfs memory does today. Since the code to do this is
more complicated now, pull it out into a helper which can be re-used in
both places. While we're at it, also make it slightly more explicit in
its handling of e.g. uffd wp markers.

For non-hugetlbfs page faults, instead of returning VM_FAULT_SIGBUS for an
error entry, return VM_FAULT_HWPOISON. For most cases this change doesn't
matter, e.g. a userspace program would receive a SIGBUS either way. But
for UFFDIO_POISON, this change will let KVM guests get an MCE out of the
box, instead of giving a SIGBUS to the hypervisor and requiring it to
somehow inject an MCE.

Finally, for hugetlbfs faults, handle PTE_MARKER_POISONED, and return
VM_FAULT_HWPOISON_LARGE in such cases. Note that this can't happen today
because the lack of swap support means we'll never end up with such a PTE
anyway, but this behavior will be needed once such entries *can* show up
via UFFDIO_POISON.

Link: https://lkml.kernel.org/r/20230707215540.2324998-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20230707215540.2324998-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Jiaqi Yan <jiaqiyan@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: T.J. Alumbaugh <talumbau@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Axel Rasmussen and committed by
Andrew Morton
af19487f 60b1e24c

+65 -28
+19
include/linux/mm_inline.h
··· 524 524 } 525 525 526 526 /* 527 + * Computes the pte marker to copy from the given source entry into dst_vma. 528 + * If no marker should be copied, returns 0. 529 + * The caller should insert a new pte created with make_pte_marker(). 530 + */ 531 + static inline pte_marker copy_pte_marker( 532 + swp_entry_t entry, struct vm_area_struct *dst_vma) 533 + { 534 + pte_marker srcm = pte_marker_get(entry); 535 + /* Always copy error entries. */ 536 + pte_marker dstm = srcm & PTE_MARKER_POISONED; 537 + 538 + /* Only copy PTE markers if UFFD register matches. */ 539 + if ((srcm & PTE_MARKER_UFFD_WP) && userfaultfd_wp(dst_vma)) 540 + dstm |= PTE_MARKER_UFFD_WP; 541 + 542 + return dstm; 543 + } 544 + 545 + /* 527 546 * If this pte is wr-protected by uffd-wp in any form, arm the special pte to 528 547 * replace a none pte. NOTE! This should only be called when *pte is already 529 548 * cleared so we will never accidentally replace something valuable. Meanwhile
+10 -5
include/linux/swapops.h
··· 393 393 typedef unsigned long pte_marker; 394 394 395 395 #define PTE_MARKER_UFFD_WP BIT(0) 396 - #define PTE_MARKER_SWAPIN_ERROR BIT(1) 396 + /* 397 + * "Poisoned" here is meant in the very general sense of "future accesses are 398 + * invalid", instead of referring very specifically to hardware memory errors. 399 + * This marker is meant to represent any of various different causes of this. 400 + */ 401 + #define PTE_MARKER_POISONED BIT(1) 397 402 #define PTE_MARKER_MASK (BIT(2) - 1) 398 403 399 404 static inline swp_entry_t make_pte_marker_entry(pte_marker marker) ··· 426 421 return swp_entry_to_pte(make_pte_marker_entry(marker)); 427 422 } 428 423 429 - static inline swp_entry_t make_swapin_error_entry(void) 424 + static inline swp_entry_t make_poisoned_swp_entry(void) 430 425 { 431 - return make_pte_marker_entry(PTE_MARKER_SWAPIN_ERROR); 426 + return make_pte_marker_entry(PTE_MARKER_POISONED); 432 427 } 433 428 434 - static inline int is_swapin_error_entry(swp_entry_t entry) 429 + static inline int is_poisoned_swp_entry(swp_entry_t entry) 435 430 { 436 431 return is_pte_marker_entry(entry) && 437 - (pte_marker_get(entry) & PTE_MARKER_SWAPIN_ERROR); 432 + (pte_marker_get(entry) & PTE_MARKER_POISONED); 438 433 } 439 434 440 435 /*
+21 -11
mm/hugetlb.c
··· 34 34 #include <linux/nospec.h> 35 35 #include <linux/delayacct.h> 36 36 #include <linux/memory.h> 37 + #include <linux/mm_inline.h> 37 38 38 39 #include <asm/page.h> 39 40 #include <asm/pgalloc.h> ··· 5102 5101 entry = huge_pte_clear_uffd_wp(entry); 5103 5102 set_huge_pte_at(dst, addr, dst_pte, entry); 5104 5103 } else if (unlikely(is_pte_marker(entry))) { 5105 - /* No swap on hugetlb */ 5106 - WARN_ON_ONCE( 5107 - is_swapin_error_entry(pte_to_swp_entry(entry))); 5108 - /* 5109 - * We copy the pte marker only if the dst vma has 5110 - * uffd-wp enabled. 5111 - */ 5112 - if (userfaultfd_wp(dst_vma)) 5113 - set_huge_pte_at(dst, addr, dst_pte, entry); 5104 + pte_marker marker = copy_pte_marker( 5105 + pte_to_swp_entry(entry), dst_vma); 5106 + 5107 + if (marker) 5108 + set_huge_pte_at(dst, addr, dst_pte, 5109 + make_pte_marker(marker)); 5114 5110 } else { 5115 5111 entry = huge_ptep_get(src_pte); 5116 5112 pte_folio = page_folio(pte_page(entry)); ··· 6087 6089 } 6088 6090 6089 6091 entry = huge_ptep_get(ptep); 6090 - /* PTE markers should be handled the same way as none pte */ 6091 - if (huge_pte_none_mostly(entry)) 6092 + if (huge_pte_none_mostly(entry)) { 6093 + if (is_pte_marker(entry)) { 6094 + pte_marker marker = 6095 + pte_marker_get(pte_to_swp_entry(entry)); 6096 + 6097 + if (marker & PTE_MARKER_POISONED) { 6098 + ret = VM_FAULT_HWPOISON_LARGE; 6099 + goto out_mutex; 6100 + } 6101 + } 6102 + 6092 6103 /* 6104 + * Other PTE markers should be handled the same way as none PTE. 6105 + * 6093 6106 * hugetlb_no_page will drop vma lock and hugetlb fault 6094 6107 * mutex internally, which make us return immediately. 6095 6108 */ 6096 6109 return hugetlb_no_page(mm, vma, mapping, idx, address, ptep, 6097 6110 entry, flags); 6111 + } 6098 6112 6099 6113 ret = 0; 6100 6114
+1 -1
mm/madvise.c
··· 664 664 free_swap_and_cache(entry); 665 665 pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); 666 666 } else if (is_hwpoison_entry(entry) || 667 - is_swapin_error_entry(entry)) { 667 + is_poisoned_swp_entry(entry)) { 668 668 pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); 669 669 } 670 670 continue;
+9 -6
mm/memory.c
··· 860 860 return -EBUSY; 861 861 return -ENOENT; 862 862 } else if (is_pte_marker_entry(entry)) { 863 - if (is_swapin_error_entry(entry) || userfaultfd_wp(dst_vma)) 864 - set_pte_at(dst_mm, addr, dst_pte, pte); 863 + pte_marker marker = copy_pte_marker(entry, dst_vma); 864 + 865 + if (marker) 866 + set_pte_at(dst_mm, addr, dst_pte, 867 + make_pte_marker(marker)); 865 868 return 0; 866 869 } 867 870 if (!userfaultfd_wp(dst_vma)) ··· 1505 1502 !zap_drop_file_uffd_wp(details)) 1506 1503 continue; 1507 1504 } else if (is_hwpoison_entry(entry) || 1508 - is_swapin_error_entry(entry)) { 1505 + is_poisoned_swp_entry(entry)) { 1509 1506 if (!should_zap_cows(details)) 1510 1507 continue; 1511 1508 } else { ··· 3654 3651 * none pte. Otherwise it means the pte could have changed, so retry. 3655 3652 * 3656 3653 * This should also cover the case where e.g. the pte changed 3657 - * quickly from a PTE_MARKER_UFFD_WP into PTE_MARKER_SWAPIN_ERROR. 3654 + * quickly from a PTE_MARKER_UFFD_WP into PTE_MARKER_POISONED. 3658 3655 * So is_pte_marker() check is not enough to safely drop the pte. 3659 3656 */ 3660 3657 if (pte_same(vmf->orig_pte, ptep_get(vmf->pte))) ··· 3700 3697 return VM_FAULT_SIGBUS; 3701 3698 3702 3699 /* Higher priority than uffd-wp when data corrupted */ 3703 - if (marker & PTE_MARKER_SWAPIN_ERROR) 3704 - return VM_FAULT_SIGBUS; 3700 + if (marker & PTE_MARKER_POISONED) 3701 + return VM_FAULT_HWPOISON; 3705 3702 3706 3703 if (pte_marker_entry_uffd_wp(entry)) 3707 3704 return pte_marker_handle_uffd_wp(vmf);
+2 -2
mm/mprotect.c
··· 230 230 newpte = pte_swp_mkuffd_wp(newpte); 231 231 } else if (is_pte_marker_entry(entry)) { 232 232 /* 233 - * Ignore swapin errors unconditionally, 233 + * Ignore error swap entries unconditionally, 234 234 * because any access should sigbus anyway. 235 235 */ 236 - if (is_swapin_error_entry(entry)) 236 + if (is_poisoned_swp_entry(entry)) 237 237 continue; 238 238 /* 239 239 * If this is uffd-wp pte marker and we'd like
+2 -2
mm/shmem.c
··· 1707 1707 swp_entry_t swapin_error; 1708 1708 void *old; 1709 1709 1710 - swapin_error = make_swapin_error_entry(); 1710 + swapin_error = make_poisoned_swp_entry(); 1711 1711 old = xa_cmpxchg_irq(&mapping->i_pages, index, 1712 1712 swp_to_radix_entry(swap), 1713 1713 swp_to_radix_entry(swapin_error), 0); ··· 1752 1752 swap = radix_to_swp_entry(*foliop); 1753 1753 *foliop = NULL; 1754 1754 1755 - if (is_swapin_error_entry(swap)) 1755 + if (is_poisoned_swp_entry(swap)) 1756 1756 return -EIO; 1757 1757 1758 1758 si = get_swap_device(swap);
+1 -1
mm/swapfile.c
··· 1771 1771 swp_entry = make_hwpoison_entry(swapcache); 1772 1772 page = swapcache; 1773 1773 } else { 1774 - swp_entry = make_swapin_error_entry(); 1774 + swp_entry = make_poisoned_swp_entry(); 1775 1775 } 1776 1776 new_pte = swp_entry_to_pte(swp_entry); 1777 1777 ret = 0;