dax: use common 4k zero page for dax mmap reads

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

When servicing mmap() reads from file holes the current DAX code
allocates a page cache page of all zeroes and places the struct page
pointer in the mapping->page_tree radix tree.

This has three major drawbacks:

1) It consumes memory unnecessarily. For every 4k page that is read via
a DAX mmap() over a hole, we allocate a new page cache page. This
means that if you read 1GiB worth of pages, you end up using 1GiB of
zeroed memory. This is easily visible by looking at the overall
memory consumption of the system or by looking at /proc/[pid]/smaps:

7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12 /root/dax/data
Size: 1048576 kB
Rss: 1048576 kB
Pss: 1048576 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 1048576 kB
Private_Dirty: 0 kB
Referenced: 1048576 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB

2) It is slower than using a common zero page because each page fault
has more work to do. Instead of just inserting a common zero page we
have to allocate a page cache page, zero it, and then insert it. Here
are the average latencies of dax_load_hole() as measured by ftrace on
a random test box:

Old method, using zeroed page cache pages: 3.4 us
New method, using the common 4k zero page: 0.8 us

This was the average latency over 1 GiB of sequential reads done by
this simple fio script:

[global]
size=1G
filename=/root/dax/data
fallocate=none
[io]
rw=read
ioengine=mmap

3) The fact that we had to check for both DAX exceptional entries and
for page cache pages in the radix tree made the DAX code more
complex.

Solve these issues by following the lead of the DAX PMD code and using a
common 4k zero page instead. As with the PMD code we will now insert a
DAX exceptional entry into the radix tree instead of a struct page
pointer which allows us to remove all the special casing in the DAX
code.

Note that we do still pretty aggressively check for regular pages in the
DAX radix tree, especially where we take action based on the bits set in
the page. If we ever find a regular page in our radix tree now that
most likely means that someone besides DAX is inserting pages (which has
happened lots of times in the past), and we want to find that out early
and fail loudly.

This solution also removes the extra memory consumption. Here is that
same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
code:

7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12 /root/dax/data
Size: 1048576 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB

Overall system memory consumption is similarly improved.

Another major change is that we remove dax_pfn_mkwrite() from our fault
flow, and instead rely on the page fault itself to make the PTE dirty
and writeable. The following description from the patch adding the
vm_insert_mixed_mkwrite() call explains this a little more:

"To be able to use the common 4k zero page in DAX we need to have our
PTE fault path look more like our PMD fault path where a PTE entry
can be marked as dirty and writeable as it is first inserted rather
than waiting for a follow-up dax_pfn_mkwrite() =>
finish_mkwrite_fault() call.

Right now we can rely on having a dax_pfn_mkwrite() call because we
can distinguish between these two cases in do_wp_page():

case 1: 4k zero page => writable DAX storage
case 2: read-only DAX storage => writeable DAX storage

This distinction is made by via vm_normal_page(). vm_normal_page()
returns false for the common 4k zero page, though, just as it does
for DAX ptes. Instead of special casing the DAX + 4k zero page case
we will simplify our DAX PTE page fault sequence so that it matches
our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
We will instead use dax_iomap_fault() to handle write-protection
faults.

This means that insert_pfn() needs to follow the lead of
insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
'mkwrite' is set insert_pfn() will do the work that was previously
done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"

Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Ross Zwisler and committed by

Linus Torvalds 8 years ago 91d25ba8 e30331ff

+86 -235

7 changed files

expand all

Documentation

filesystems

dax.txt

dax.c

ext2

file.c

ext4

file.c

xfs

xfs_file.c

include

linux

dax.h

trace

events

fs_dax.h

+2 -3

Documentation/filesystems/dax.txt

··· 63 63 - implementing an mmap file operation for DAX files which sets the 64 64 VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to 65 65 include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These 66 - handlers should probably call dax_iomap_fault() (for fault and page_mkwrite 67 - handlers), dax_iomap_pmd_fault(), dax_pfn_mkwrite() passing the appropriate 68 - iomap operations. 66 + handlers should probably call dax_iomap_fault() passing the appropriate 67 + fault size and iomap operations. 69 68 - calling iomap_zero_range() passing appropriate iomap operations instead of 70 69 block_truncate_page() for DAX files 71 70 - ensuring that there is sufficient locking between reads, writes,

+76 -167

fs/dax.c

··· 66 66 67 67 static int dax_is_zero_entry(void *entry) 68 68 { 69 - return (unsigned long)entry & RADIX_DAX_HZP; 69 + return (unsigned long)entry & RADIX_DAX_ZERO_PAGE; 70 70 } 71 71 72 72 static int dax_is_empty_entry(void *entry) ··· 206 206 for (;;) { 207 207 entry = __radix_tree_lookup(&mapping->page_tree, index, NULL, 208 208 &slot); 209 - if (!entry || !radix_tree_exceptional_entry(entry) || 209 + if (!entry || 210 + WARN_ON_ONCE(!radix_tree_exceptional_entry(entry)) || 210 211 !slot_locked(mapping, slot)) { 211 212 if (slotp) 212 213 *slotp = slot; ··· 242 241 } 243 242 244 243 static void put_locked_mapping_entry(struct address_space *mapping, 245 - pgoff_t index, void *entry) 244 + pgoff_t index) 246 245 { 247 - if (!radix_tree_exceptional_entry(entry)) { 248 - unlock_page(entry); 249 - put_page(entry); 250 - } else { 251 - dax_unlock_mapping_entry(mapping, index); 252 - } 246 + dax_unlock_mapping_entry(mapping, index); 253 247 } 254 248 255 249 /* ··· 254 258 static void put_unlocked_mapping_entry(struct address_space *mapping, 255 259 pgoff_t index, void *entry) 256 260 { 257 - if (!radix_tree_exceptional_entry(entry)) 261 + if (!entry) 258 262 return; 259 263 260 264 /* We have to wake up next waiter for the radix tree entry lock */ ··· 262 266 } 263 267 264 268 /* 265 - * Find radix tree entry at given index. If it points to a page, return with 266 - * the page locked. If it points to the exceptional entry, return with the 267 - * radix tree entry locked. If the radix tree doesn't contain given index, 268 - * create empty exceptional entry for the index and return with it locked. 269 + * Find radix tree entry at given index. If it points to an exceptional entry, 270 + * return it with the radix tree entry locked. If the radix tree doesn't 271 + * contain given index, create an empty exceptional entry for the index and 272 + * return with it locked. 269 273 * 270 274 * When requesting an entry with size RADIX_DAX_PMD, grab_mapping_entry() will 271 275 * either return that locked entry or will return an error. This error will 272 - * happen if there are any 4k entries (either zero pages or DAX entries) 273 - * within the 2MiB range that we are requesting. 276 + * happen if there are any 4k entries within the 2MiB range that we are 277 + * requesting. 274 278 * 275 279 * We always favor 4k entries over 2MiB entries. There isn't a flow where we 276 280 * evict 4k entries in order to 'upgrade' them to a 2MiB entry. A 2MiB ··· 297 301 spin_lock_irq(&mapping->tree_lock); 298 302 entry = get_unlocked_mapping_entry(mapping, index, &slot); 299 303 304 + if (WARN_ON_ONCE(entry && !radix_tree_exceptional_entry(entry))) { 305 + entry = ERR_PTR(-EIO); 306 + goto out_unlock; 307 + } 308 + 300 309 if (entry) { 301 310 if (size_flag & RADIX_DAX_PMD) { 302 - if (!radix_tree_exceptional_entry(entry) || 303 - dax_is_pte_entry(entry)) { 311 + if (dax_is_pte_entry(entry)) { 304 312 put_unlocked_mapping_entry(mapping, index, 305 313 entry); 306 314 entry = ERR_PTR(-EEXIST); 307 315 goto out_unlock; 308 316 } 309 317 } else { /* trying to grab a PTE entry */ 310 - if (radix_tree_exceptional_entry(entry) && 311 - dax_is_pmd_entry(entry) && 318 + if (dax_is_pmd_entry(entry) && 312 319 (dax_is_zero_entry(entry) || 313 320 dax_is_empty_entry(entry))) { 314 321 pmd_downgrade = true; ··· 345 346 mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM); 346 347 if (err) { 347 348 if (pmd_downgrade) 348 - put_locked_mapping_entry(mapping, index, entry); 349 + put_locked_mapping_entry(mapping, index); 349 350 return ERR_PTR(err); 350 351 } 351 352 spin_lock_irq(&mapping->tree_lock); ··· 395 396 spin_unlock_irq(&mapping->tree_lock); 396 397 return entry; 397 398 } 398 - /* Normal page in radix tree? */ 399 - if (!radix_tree_exceptional_entry(entry)) { 400 - struct page *page = entry; 401 - 402 - get_page(page); 403 - spin_unlock_irq(&mapping->tree_lock); 404 - lock_page(page); 405 - /* Page got truncated? Retry... */ 406 - if (unlikely(page->mapping != mapping)) { 407 - unlock_page(page); 408 - put_page(page); 409 - goto restart; 410 - } 411 - return page; 412 - } 413 399 entry = lock_slot(mapping, slot); 414 400 out_unlock: 415 401 spin_unlock_irq(&mapping->tree_lock); ··· 410 426 411 427 spin_lock_irq(&mapping->tree_lock); 412 428 entry = get_unlocked_mapping_entry(mapping, index, NULL); 413 - if (!entry || !radix_tree_exceptional_entry(entry)) 429 + if (!entry || WARN_ON_ONCE(!radix_tree_exceptional_entry(entry))) 414 430 goto out; 415 431 if (!trunc && 416 432 (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) || ··· 492 508 unsigned long flags) 493 509 { 494 510 struct radix_tree_root *page_tree = &mapping->page_tree; 495 - int error = 0; 496 - bool hole_fill = false; 497 511 void *new_entry; 498 512 pgoff_t index = vmf->pgoff; 499 513 500 514 if (vmf->flags & FAULT_FLAG_WRITE) 501 515 __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); 502 516 503 - /* Replacing hole page with block mapping? */ 504 - if (!radix_tree_exceptional_entry(entry)) { 505 - hole_fill = true; 506 - /* 507 - * Unmap the page now before we remove it from page cache below. 508 - * The page is locked so it cannot be faulted in again. 509 - */ 510 - unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, 511 - PAGE_SIZE, 0); 512 - error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM); 513 - if (error) 514 - return ERR_PTR(error); 515 - } else if (dax_is_zero_entry(entry) && !(flags & RADIX_DAX_HZP)) { 516 - /* replacing huge zero page with PMD block mapping */ 517 - unmap_mapping_range(mapping, 518 - (vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0); 517 + if (dax_is_zero_entry(entry) && !(flags & RADIX_DAX_ZERO_PAGE)) { 518 + /* we are replacing a zero page with block mapping */ 519 + if (dax_is_pmd_entry(entry)) 520 + unmap_mapping_range(mapping, 521 + (vmf->pgoff << PAGE_SHIFT) & PMD_MASK, 522 + PMD_SIZE, 0); 523 + else /* pte entry */ 524 + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, 525 + PAGE_SIZE, 0); 519 526 } 520 527 521 528 spin_lock_irq(&mapping->tree_lock); 522 529 new_entry = dax_radix_locked_entry(sector, flags); 523 530 524 - if (hole_fill) { 525 - __delete_from_page_cache(entry, NULL); 526 - /* Drop pagecache reference */ 527 - put_page(entry); 528 - error = __radix_tree_insert(page_tree, index, 529 - dax_radix_order(new_entry), new_entry); 530 - if (error) { 531 - new_entry = ERR_PTR(error); 532 - goto unlock; 533 - } 534 - mapping->nrexceptional++; 535 - } else if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { 531 + if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { 536 532 /* 537 533 * Only swap our new entry into the radix tree if the current 538 534 * entry is a zero page or an empty entry. If a normal PTE or ··· 529 565 WARN_ON_ONCE(ret != entry); 530 566 __radix_tree_replace(page_tree, node, slot, 531 567 new_entry, NULL, NULL); 568 + entry = new_entry; 532 569 } 570 + 533 571 if (vmf->flags & FAULT_FLAG_WRITE) 534 572 radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY); 535 - unlock: 573 + 536 574 spin_unlock_irq(&mapping->tree_lock); 537 - if (hole_fill) { 538 - radix_tree_preload_end(); 539 - /* 540 - * We don't need hole page anymore, it has been replaced with 541 - * locked radix tree entry now. 542 - */ 543 - if (mapping->a_ops->freepage) 544 - mapping->a_ops->freepage(entry); 545 - unlock_page(entry); 546 - put_page(entry); 547 - } 548 - return new_entry; 575 + return entry; 549 576 } 550 577 551 578 static inline unsigned long ··· 638 683 spin_lock_irq(&mapping->tree_lock); 639 684 entry2 = get_unlocked_mapping_entry(mapping, index, &slot); 640 685 /* Entry got punched out / reallocated? */ 641 - if (!entry2 || !radix_tree_exceptional_entry(entry2)) 686 + if (!entry2 || WARN_ON_ONCE(!radix_tree_exceptional_entry(entry2))) 642 687 goto put_unlocked; 643 688 /* 644 689 * Entry got reallocated elsewhere? No need to writeback. We have to ··· 710 755 trace_dax_writeback_one(mapping->host, index, size >> PAGE_SHIFT); 711 756 dax_unlock: 712 757 dax_read_unlock(id); 713 - put_locked_mapping_entry(mapping, index, entry); 758 + put_locked_mapping_entry(mapping, index); 714 759 return ret; 715 760 716 761 put_unlocked: ··· 785 830 786 831 static int dax_insert_mapping(struct address_space *mapping, 787 832 struct block_device *bdev, struct dax_device *dax_dev, 788 - sector_t sector, size_t size, void **entryp, 833 + sector_t sector, size_t size, void *entry, 789 834 struct vm_area_struct *vma, struct vm_fault *vmf) 790 835 { 791 836 unsigned long vaddr = vmf->address; 792 - void *entry = *entryp; 793 837 void *ret, *kaddr; 794 838 pgoff_t pgoff; 795 839 int id, rc; ··· 809 855 ret = dax_insert_mapping_entry(mapping, vmf, entry, sector, 0); 810 856 if (IS_ERR(ret)) 811 857 return PTR_ERR(ret); 812 - *entryp = ret; 813 858 814 859 trace_dax_insert_mapping(mapping->host, vmf, ret); 815 - return vm_insert_mixed(vma, vaddr, pfn); 860 + if (vmf->flags & FAULT_FLAG_WRITE) 861 + return vm_insert_mixed_mkwrite(vma, vaddr, pfn); 862 + else 863 + return vm_insert_mixed(vma, vaddr, pfn); 816 864 } 817 - 818 - /** 819 - * dax_pfn_mkwrite - handle first write to DAX page 820 - * @vmf: The description of the fault 821 - */ 822 - int dax_pfn_mkwrite(struct vm_fault *vmf) 823 - { 824 - struct file *file = vmf->vma->vm_file; 825 - struct address_space *mapping = file->f_mapping; 826 - struct inode *inode = mapping->host; 827 - void *entry, **slot; 828 - pgoff_t index = vmf->pgoff; 829 - 830 - spin_lock_irq(&mapping->tree_lock); 831 - entry = get_unlocked_mapping_entry(mapping, index, &slot); 832 - if (!entry || !radix_tree_exceptional_entry(entry)) { 833 - if (entry) 834 - put_unlocked_mapping_entry(mapping, index, entry); 835 - spin_unlock_irq(&mapping->tree_lock); 836 - trace_dax_pfn_mkwrite_no_entry(inode, vmf, VM_FAULT_NOPAGE); 837 - return VM_FAULT_NOPAGE; 838 - } 839 - radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY); 840 - entry = lock_slot(mapping, slot); 841 - spin_unlock_irq(&mapping->tree_lock); 842 - /* 843 - * If we race with somebody updating the PTE and finish_mkwrite_fault() 844 - * fails, we don't care. We need to return VM_FAULT_NOPAGE and retry 845 - * the fault in either case. 846 - */ 847 - finish_mkwrite_fault(vmf); 848 - put_locked_mapping_entry(mapping, index, entry); 849 - trace_dax_pfn_mkwrite(inode, vmf, VM_FAULT_NOPAGE); 850 - return VM_FAULT_NOPAGE; 851 - } 852 - EXPORT_SYMBOL_GPL(dax_pfn_mkwrite); 853 865 854 866 /* 855 - * The user has performed a load from a hole in the file. Allocating 856 - * a new page in the file would cause excessive storage usage for 857 - * workloads with sparse files. We allocate a page cache page instead. 858 - * We'll kick it out of the page cache if it's ever written to, 859 - * otherwise it will simply fall out of the page cache under memory 860 - * pressure without ever having been dirtied. 867 + * The user has performed a load from a hole in the file. Allocating a new 868 + * page in the file would cause excessive storage usage for workloads with 869 + * sparse files. Instead we insert a read-only mapping of the 4k zero page. 870 + * If this page is ever written to we will re-fault and change the mapping to 871 + * point to real DAX storage instead. 861 872 */ 862 - static int dax_load_hole(struct address_space *mapping, void **entry, 873 + static int dax_load_hole(struct address_space *mapping, void *entry, 863 874 struct vm_fault *vmf) 864 875 { 865 876 struct inode *inode = mapping->host; 866 - struct page *page; 867 - int ret; 877 + unsigned long vaddr = vmf->address; 878 + int ret = VM_FAULT_NOPAGE; 879 + struct page *zero_page; 880 + void *entry2; 868 881 869 - /* Hole page already exists? Return it... */ 870 - if (!radix_tree_exceptional_entry(*entry)) { 871 - page = *entry; 872 - goto finish_fault; 873 - } 874 - 875 - /* This will replace locked radix tree entry with a hole page */ 876 - page = find_or_create_page(mapping, vmf->pgoff, 877 - vmf->gfp_mask | __GFP_ZERO); 878 - if (!page) { 882 + zero_page = ZERO_PAGE(0); 883 + if (unlikely(!zero_page)) { 879 884 ret = VM_FAULT_OOM; 880 885 goto out; 881 886 } 882 887 883 - finish_fault: 884 - vmf->page = page; 885 - ret = finish_fault(vmf); 886 - vmf->page = NULL; 887 - *entry = page; 888 - if (!ret) { 889 - /* Grab reference for PTE that is now referencing the page */ 890 - get_page(page); 891 - ret = VM_FAULT_NOPAGE; 888 + entry2 = dax_insert_mapping_entry(mapping, vmf, entry, 0, 889 + RADIX_DAX_ZERO_PAGE); 890 + if (IS_ERR(entry2)) { 891 + ret = VM_FAULT_SIGBUS; 892 + goto out; 892 893 } 894 + 895 + vm_insert_mixed(vmf->vma, vaddr, page_to_pfn_t(zero_page)); 893 896 out: 894 897 trace_dax_load_hole(inode, vmf, ret); 895 898 return ret; ··· 1134 1223 major = VM_FAULT_MAJOR; 1135 1224 } 1136 1225 error = dax_insert_mapping(mapping, iomap.bdev, iomap.dax_dev, 1137 - sector, PAGE_SIZE, &entry, vmf->vma, vmf); 1226 + sector, PAGE_SIZE, entry, vmf->vma, vmf); 1138 1227 /* -EBUSY is fine, somebody else faulted on the same PTE */ 1139 1228 if (error == -EBUSY) 1140 1229 error = 0; ··· 1142 1231 case IOMAP_UNWRITTEN: 1143 1232 case IOMAP_HOLE: 1144 1233 if (!(vmf->flags & FAULT_FLAG_WRITE)) { 1145 - vmf_ret = dax_load_hole(mapping, &entry, vmf); 1234 + vmf_ret = dax_load_hole(mapping, entry, vmf); 1146 1235 goto finish_iomap; 1147 1236 } 1148 1237 /*FALLTHRU*/ ··· 1169 1258 ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap); 1170 1259 } 1171 1260 unlock_entry: 1172 - put_locked_mapping_entry(mapping, vmf->pgoff, entry); 1261 + put_locked_mapping_entry(mapping, vmf->pgoff); 1173 1262 out: 1174 1263 trace_dax_pte_fault_done(inode, vmf, vmf_ret); 1175 1264 return vmf_ret; ··· 1183 1272 #define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1) 1184 1273 1185 1274 static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap, 1186 - loff_t pos, void **entryp) 1275 + loff_t pos, void *entry) 1187 1276 { 1188 1277 struct address_space *mapping = vmf->vma->vm_file->f_mapping; 1189 1278 const sector_t sector = dax_iomap_sector(iomap, pos); ··· 1214 1303 goto unlock_fallback; 1215 1304 dax_read_unlock(id); 1216 1305 1217 - ret = dax_insert_mapping_entry(mapping, vmf, *entryp, sector, 1306 + ret = dax_insert_mapping_entry(mapping, vmf, entry, sector, 1218 1307 RADIX_DAX_PMD); 1219 1308 if (IS_ERR(ret)) 1220 1309 goto fallback; 1221 - *entryp = ret; 1222 1310 1223 1311 trace_dax_pmd_insert_mapping(inode, vmf, length, pfn, ret); 1224 1312 return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, ··· 1231 1321 } 1232 1322 1233 1323 static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap, 1234 - void **entryp) 1324 + void *entry) 1235 1325 { 1236 1326 struct address_space *mapping = vmf->vma->vm_file->f_mapping; 1237 1327 unsigned long pmd_addr = vmf->address & PMD_MASK; ··· 1246 1336 if (unlikely(!zero_page)) 1247 1337 goto fallback; 1248 1338 1249 - ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0, 1250 - RADIX_DAX_PMD | RADIX_DAX_HZP); 1339 + ret = dax_insert_mapping_entry(mapping, vmf, entry, 0, 1340 + RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE); 1251 1341 if (IS_ERR(ret)) 1252 1342 goto fallback; 1253 - *entryp = ret; 1254 1343 1255 1344 ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); 1256 1345 if (!pmd_none(*(vmf->pmd))) { ··· 1325 1416 goto fallback; 1326 1417 1327 1418 /* 1328 - * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX 1329 - * PMD or a HZP entry. If it can't (because a 4k page is already in 1330 - * the tree, for instance), it will return -EEXIST and we just fall 1331 - * back to 4k entries. 1419 + * grab_mapping_entry() will make sure we get a 2MiB empty entry, a 1420 + * 2MiB zero page entry or a DAX PMD. If it can't (because a 4k page 1421 + * is already in the tree, for instance), it will return -EEXIST and 1422 + * we just fall back to 4k entries. 1332 1423 */ 1333 1424 entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD); 1334 1425 if (IS_ERR(entry)) ··· 1361 1452 1362 1453 switch (iomap.type) { 1363 1454 case IOMAP_MAPPED: 1364 - result = dax_pmd_insert_mapping(vmf, &iomap, pos, &entry); 1455 + result = dax_pmd_insert_mapping(vmf, &iomap, pos, entry); 1365 1456 break; 1366 1457 case IOMAP_UNWRITTEN: 1367 1458 case IOMAP_HOLE: 1368 1459 if (WARN_ON_ONCE(write)) 1369 1460 break; 1370 - result = dax_pmd_load_hole(vmf, &iomap, &entry); 1461 + result = dax_pmd_load_hole(vmf, &iomap, entry); 1371 1462 break; 1372 1463 default: 1373 1464 WARN_ON_ONCE(1); ··· 1390 1481 &iomap); 1391 1482 } 1392 1483 unlock_entry: 1393 - put_locked_mapping_entry(mapping, pgoff, entry); 1484 + put_locked_mapping_entry(mapping, pgoff); 1394 1485 fallback: 1395 1486 if (result == VM_FAULT_FALLBACK) { 1396 1487 split_huge_pmd(vma, vmf->pmd, vmf->address);

+1 -24

fs/ext2/file.c

··· 107 107 return ret; 108 108 } 109 109 110 - static int ext2_dax_pfn_mkwrite(struct vm_fault *vmf) 111 - { 112 - struct inode *inode = file_inode(vmf->vma->vm_file); 113 - struct ext2_inode_info *ei = EXT2_I(inode); 114 - loff_t size; 115 - int ret; 116 - 117 - sb_start_pagefault(inode->i_sb); 118 - file_update_time(vmf->vma->vm_file); 119 - down_read(&ei->dax_sem); 120 - 121 - /* check that the faulting page hasn't raced with truncate */ 122 - size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; 123 - if (vmf->pgoff >= size) 124 - ret = VM_FAULT_SIGBUS; 125 - else 126 - ret = dax_pfn_mkwrite(vmf); 127 - 128 - up_read(&ei->dax_sem); 129 - sb_end_pagefault(inode->i_sb); 130 - return ret; 131 - } 132 - 133 110 static const struct vm_operations_struct ext2_dax_vm_ops = { 134 111 .fault = ext2_dax_fault, 135 112 /* ··· 115 138 * will always fail and fail back to regular faults. 116 139 */ 117 140 .page_mkwrite = ext2_dax_fault, 118 - .pfn_mkwrite = ext2_dax_pfn_mkwrite, 141 + .pfn_mkwrite = ext2_dax_fault, 119 142 }; 120 143 121 144 static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)

+1 -31

fs/ext4/file.c

··· 311 311 return ext4_dax_huge_fault(vmf, PE_SIZE_PTE); 312 312 } 313 313 314 - /* 315 - * Handle write fault for VM_MIXEDMAP mappings. Similarly to ext4_dax_fault() 316 - * handler we check for races agaist truncate. Note that since we cycle through 317 - * i_mmap_sem, we are sure that also any hole punching that began before we 318 - * were called is finished by now and so if it included part of the file we 319 - * are working on, our pte will get unmapped and the check for pte_same() in 320 - * wp_pfn_shared() fails. Thus fault gets retried and things work out as 321 - * desired. 322 - */ 323 - static int ext4_dax_pfn_mkwrite(struct vm_fault *vmf) 324 - { 325 - struct inode *inode = file_inode(vmf->vma->vm_file); 326 - struct super_block *sb = inode->i_sb; 327 - loff_t size; 328 - int ret; 329 - 330 - sb_start_pagefault(sb); 331 - file_update_time(vmf->vma->vm_file); 332 - down_read(&EXT4_I(inode)->i_mmap_sem); 333 - size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; 334 - if (vmf->pgoff >= size) 335 - ret = VM_FAULT_SIGBUS; 336 - else 337 - ret = dax_pfn_mkwrite(vmf); 338 - up_read(&EXT4_I(inode)->i_mmap_sem); 339 - sb_end_pagefault(sb); 340 - 341 - return ret; 342 - } 343 - 344 314 static const struct vm_operations_struct ext4_dax_vm_ops = { 345 315 .fault = ext4_dax_fault, 346 316 .huge_fault = ext4_dax_huge_fault, 347 317 .page_mkwrite = ext4_dax_fault, 348 - .pfn_mkwrite = ext4_dax_pfn_mkwrite, 318 + .pfn_mkwrite = ext4_dax_fault, 349 319 }; 350 320 #else 351 321 #define ext4_dax_vm_ops ext4_file_vm_ops

+1 -1

fs/xfs/xfs_file.c

··· 1130 1130 if (vmf->pgoff >= size) 1131 1131 ret = VM_FAULT_SIGBUS; 1132 1132 else if (IS_DAX(inode)) 1133 - ret = dax_pfn_mkwrite(vmf); 1133 + ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &xfs_iomap_ops); 1134 1134 xfs_iunlock(ip, XFS_MMAPLOCK_SHARED); 1135 1135 sb_end_pagefault(inode->i_sb); 1136 1136 return ret;

+5 -7

include/linux/dax.h

··· 91 91 92 92 /* 93 93 * We use lowest available bit in exceptional entry for locking, one bit for 94 - * the entry size (PMD) and two more to tell us if the entry is a huge zero 95 - * page (HZP) or an empty entry that is just used for locking. In total four 96 - * special bits. 94 + * the entry size (PMD) and two more to tell us if the entry is a zero page or 95 + * an empty entry that is just used for locking. In total four special bits. 97 96 * 98 - * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the HZP and 99 - * EMPTY bits aren't set the entry is a normal DAX entry with a filesystem 97 + * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE 98 + * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem 100 99 * block allocation. 101 100 */ 102 101 #define RADIX_DAX_SHIFT (RADIX_TREE_EXCEPTIONAL_SHIFT + 4) 103 102 #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT) 104 103 #define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1)) 105 - #define RADIX_DAX_HZP (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2)) 104 + #define RADIX_DAX_ZERO_PAGE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2)) 106 105 #define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3)) 107 106 108 107 static inline unsigned long dax_radix_sector(void *entry) ··· 152 153 return 0; 153 154 } 154 155 #endif 155 - int dax_pfn_mkwrite(struct vm_fault *vmf); 156 156 157 157 static inline bool dax_mapping(struct address_space *mapping) 158 158 {

-2

include/trace/events/fs_dax.h

··· 190 190 191 191 DEFINE_PTE_FAULT_EVENT(dax_pte_fault); 192 192 DEFINE_PTE_FAULT_EVENT(dax_pte_fault_done); 193 - DEFINE_PTE_FAULT_EVENT(dax_pfn_mkwrite_no_entry); 194 - DEFINE_PTE_FAULT_EVENT(dax_pfn_mkwrite); 195 193 DEFINE_PTE_FAULT_EVENT(dax_load_hole); 196 194 197 195 TRACE_EVENT(dax_insert_mapping,