Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/fork: accept huge pfnmap entries

Teach the fork code to properly copy pfnmaps for pmd/pud levels. Pud is
much easier, the write bit needs to be persisted though for writable and
shared pud mappings like PFNMAP ones, otherwise a follow up write in
either parent or child process will trigger a write fault.

Do the same for pmd level.

Link: https://lkml.kernel.org/r/20240826204353.2228736-8-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Peter Xu and committed by
Andrew Morton
bc02afbd 10d83d77

+26 -3
+26 -3
mm/huge_memory.c
··· 1583 1583 pgtable_t pgtable = NULL; 1584 1584 int ret = -ENOMEM; 1585 1585 1586 + pmd = pmdp_get_lockless(src_pmd); 1587 + if (unlikely(pmd_special(pmd))) { 1588 + dst_ptl = pmd_lock(dst_mm, dst_pmd); 1589 + src_ptl = pmd_lockptr(src_mm, src_pmd); 1590 + spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); 1591 + /* 1592 + * No need to recheck the pmd, it can't change with write 1593 + * mmap lock held here. 1594 + * 1595 + * Meanwhile, making sure it's not a CoW VMA with writable 1596 + * mapping, otherwise it means either the anon page wrongly 1597 + * applied special bit, or we made the PRIVATE mapping be 1598 + * able to wrongly write to the backend MMIO. 1599 + */ 1600 + VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)); 1601 + goto set_pmd; 1602 + } 1603 + 1586 1604 /* Skip if can be re-fill on fault */ 1587 1605 if (!vma_is_anonymous(dst_vma)) 1588 1606 return 0; ··· 1682 1664 pmdp_set_wrprotect(src_mm, addr, src_pmd); 1683 1665 if (!userfaultfd_wp(dst_vma)) 1684 1666 pmd = pmd_clear_uffd_wp(pmd); 1685 - pmd = pmd_mkold(pmd_wrprotect(pmd)); 1667 + pmd = pmd_wrprotect(pmd); 1668 + set_pmd: 1669 + pmd = pmd_mkold(pmd); 1686 1670 set_pmd_at(dst_mm, addr, dst_pmd, pmd); 1687 1671 1688 1672 ret = 0; ··· 1730 1710 * TODO: once we support anonymous pages, use 1731 1711 * folio_try_dup_anon_rmap_*() and split if duplicating fails. 1732 1712 */ 1733 - pudp_set_wrprotect(src_mm, addr, src_pud); 1734 - pud = pud_mkold(pud_wrprotect(pud)); 1713 + if (is_cow_mapping(vma->vm_flags) && pud_write(pud)) { 1714 + pudp_set_wrprotect(src_mm, addr, src_pud); 1715 + pud = pud_wrprotect(pud); 1716 + } 1717 + pud = pud_mkold(pud); 1735 1718 set_pud_at(dst_mm, addr, dst_pud, pud); 1736 1719 1737 1720 ret = 0;