Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: allow ->huge_fault() to be called without the mmap_lock held

Remove the checks for the VMA lock being held, allowing the page fault
path to call into the filesystem instead of retrying with the mmap_lock
held. This will improve scalability for DAX page faults. Also update the
documentation to match (and fix some other changes that have happened
recently).

Link: https://lkml.kernel.org/r/20230818202335.2739663-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Matthew Wilcox (Oracle) and committed by
Andrew Morton
40d49a3c 051ddcfe

+36 -33
+23 -13
Documentation/filesystems/locking.rst
··· 628 628 629 629 prototypes:: 630 630 631 - void (*open)(struct vm_area_struct*); 632 - void (*close)(struct vm_area_struct*); 633 - vm_fault_t (*fault)(struct vm_area_struct*, struct vm_fault *); 631 + void (*open)(struct vm_area_struct *); 632 + void (*close)(struct vm_area_struct *); 633 + vm_fault_t (*fault)(struct vm_fault *); 634 + vm_fault_t (*huge_fault)(struct vm_fault *, unsigned int order); 635 + vm_fault_t (*map_pages)(struct vm_fault *, pgoff_t start, pgoff_t end); 634 636 vm_fault_t (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); 635 637 vm_fault_t (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *); 636 638 int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); 637 639 638 640 locking rules: 639 641 640 - ============= ========= =========================== 642 + ============= ========== =========================== 641 643 ops mmap_lock PageLocked(page) 642 - ============= ========= =========================== 643 - open: yes 644 - close: yes 645 - fault: yes can return with page locked 646 - map_pages: read 647 - page_mkwrite: yes can return with page locked 648 - pfn_mkwrite: yes 649 - access: yes 650 - ============= ========= =========================== 644 + ============= ========== =========================== 645 + open: write 646 + close: read/write 647 + fault: read can return with page locked 648 + huge_fault: maybe-read 649 + map_pages: maybe-read 650 + page_mkwrite: read can return with page locked 651 + pfn_mkwrite: read 652 + access: read 653 + ============= ========== =========================== 651 654 652 655 ->fault() is called when a previously not present pte is about to be faulted 653 656 in. The filesystem must find and return the page associated with the passed in ··· 659 656 then ensure the page is not already truncated (invalidate_lock will block 660 657 subsequent truncate), and then return with VM_FAULT_LOCKED, and the page 661 658 locked. The VM will unlock the page. 659 + 660 + ->huge_fault() is called when there is no PUD or PMD entry present. This 661 + gives the filesystem the opportunity to install a PUD or PMD sized page. 662 + Filesystems can also use the ->fault method to return a PMD sized page, 663 + so implementing this function may not be necessary. In particular, 664 + filesystems should not call filemap_fault() from ->huge_fault(). 665 + The mmap_lock may not be held when this method is called. 662 666 663 667 ->map_pages() is called when VM asks to map easy accessible pages. 664 668 Filesystem should find and map pages associated with offsets from "start_pgoff"
+11
Documentation/filesystems/porting.rst
··· 943 943 changed to simplify callers. The passed file is in a non-open state and on 944 944 success must be opened before returning (e.g. by calling 945 945 finish_open_simple()). 946 + 947 + --- 948 + 949 + **mandatory** 950 + 951 + Calling convention for ->huge_fault has changed. It now takes a page 952 + order instead of an enum page_entry_size, and it may be called without the 953 + mmap_lock held. All in-tree users have been audited and do not seem to 954 + depend on the mmap_lock being held, but out of tree users should verify 955 + for themselves. If they do need it, they can return VM_FAULT_RETRY to 956 + be called with the mmap_lock held.
+2 -20
mm/memory.c
··· 4854 4854 struct vm_area_struct *vma = vmf->vma; 4855 4855 if (vma_is_anonymous(vma)) 4856 4856 return do_huge_pmd_anonymous_page(vmf); 4857 - if (vma->vm_ops->huge_fault) { 4858 - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { 4859 - vma_end_read(vma); 4860 - return VM_FAULT_RETRY; 4861 - } 4857 + if (vma->vm_ops->huge_fault) 4862 4858 return vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); 4863 - } 4864 4859 return VM_FAULT_FALLBACK; 4865 4860 } 4866 4861 ··· 4875 4880 4876 4881 if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) { 4877 4882 if (vma->vm_ops->huge_fault) { 4878 - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { 4879 - vma_end_read(vma); 4880 - return VM_FAULT_RETRY; 4881 - } 4882 4883 ret = vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); 4883 4884 if (!(ret & VM_FAULT_FALLBACK)) 4884 4885 return ret; ··· 4895 4904 /* No support for anonymous transparent PUD pages yet */ 4896 4905 if (vma_is_anonymous(vma)) 4897 4906 return VM_FAULT_FALLBACK; 4898 - if (vma->vm_ops->huge_fault) { 4899 - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { 4900 - vma_end_read(vma); 4901 - return VM_FAULT_RETRY; 4902 - } 4907 + if (vma->vm_ops->huge_fault) 4903 4908 return vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD); 4904 - } 4905 4909 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 4906 4910 return VM_FAULT_FALLBACK; 4907 4911 } ··· 4913 4927 goto split; 4914 4928 if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) { 4915 4929 if (vma->vm_ops->huge_fault) { 4916 - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { 4917 - vma_end_read(vma); 4918 - return VM_FAULT_RETRY; 4919 - } 4920 4930 ret = vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD); 4921 4931 if (!(ret & VM_FAULT_FALLBACK)) 4922 4932 return ret;