Merge branch 'akpm' (patches from Andrew)

+24

Documentation/admin-guide/cgroup-v2.rst

··· 1329 1329 workingset_activate 1330 1330 Number of refaulted pages that were immediately activated 1331 1331 1332 + workingset_restore 1333 + Number of restored pages which have been detected as an active 1334 + workingset before they got reclaimed. 1335 + 1332 1336 workingset_nodereclaim 1333 1337 Number of times a shadow node has been reclaimed 1334 1338 ··· 1374 1370 The total amount of swap currently being used by the cgroup 1375 1371 and its descendants. 1376 1372 1373 + memory.swap.high 1374 + A read-write single value file which exists on non-root 1375 + cgroups. The default is "max". 1376 + 1377 + Swap usage throttle limit. If a cgroup's swap usage exceeds 1378 + this limit, all its further allocations will be throttled to 1379 + allow userspace to implement custom out-of-memory procedures. 1380 + 1381 + This limit marks a point of no return for the cgroup. It is NOT 1382 + designed to manage the amount of swapping a workload does 1383 + during regular operation. Compare to memory.swap.max, which 1384 + prohibits swapping past a set amount, but lets the cgroup 1385 + continue unimpeded as long as other memory can be reclaimed. 1386 + 1387 + Healthy workloads are not expected to reach this limit. 1388 + 1377 1389 memory.swap.max 1378 1390 A read-write single value file which exists on non-root 1379 1391 cgroups. The default is "max". ··· 1402 1382 The following entries are defined. Unless specified 1403 1383 otherwise, a value change in this file generates a file 1404 1384 modified event. 1385 + 1386 + high 1387 + The number of times the cgroup's swap usage was over 1388 + the high threshold. 1405 1389 1406 1390 max 1407 1391 The number of times the cgroup's swap usage was about

+1 -1

Documentation/core-api/cachetlb.rst

··· 213 213 there will be no entries in the cache for the kernel address 214 214 space for virtual addresses in the range 'start' to 'end-1'. 215 215 216 - The first of these two routines is invoked after map_vm_area() 216 + The first of these two routines is invoked after map_kernel_range() 217 217 has installed the page table entries. The second is invoked 218 218 before unmap_kernel_range() deletes the page table entries. 219 219

+5 -1

Documentation/filesystems/locking.rst

··· 239 239 int (*readpage)(struct file *, struct page *); 240 240 int (*writepages)(struct address_space *, struct writeback_control *); 241 241 int (*set_page_dirty)(struct page *page); 242 + void (*readahead)(struct readahead_control *); 242 243 int (*readpages)(struct file *filp, struct address_space *mapping, 243 244 struct list_head *pages, unsigned nr_pages); 244 245 int (*write_begin)(struct file *, struct address_space *mapping, ··· 272 271 readpage: yes, unlocks 273 272 writepages: 274 273 set_page_dirty no 275 - readpages: 274 + readahead: yes, unlocks 275 + readpages: no 276 276 write_begin: locks the page exclusive 277 277 write_end: yes, unlocks exclusive 278 278 bmap: ··· 296 294 297 295 ->readpage() unlocks the page, either synchronously or via I/O 298 296 completion. 297 + 298 + ->readahead() unlocks the pages that I/O is attempted on like ->readpage(). 299 299 300 300 ->readpages() populates the pagecache with the passed pages and starts 301 301 I/O against them. They come unlocked upon I/O completion.

+2 -2

Documentation/filesystems/proc.rst

··· 1043 1043 amount of memory dedicated to the lowest level of page 1044 1044 tables. 1045 1045 NFS_Unstable 1046 - NFS pages sent to the server, but not yet committed to stable 1047 - storage 1046 + Always zero. Previous counted pages which had been written to 1047 + the server, but has not been committed to stable storage. 1048 1048 Bounce 1049 1049 Memory used for block device "bounce buffers" 1050 1050 WritebackTmp

+15

Documentation/filesystems/vfs.rst

··· 706 706 int (*readpage)(struct file *, struct page *); 707 707 int (*writepages)(struct address_space *, struct writeback_control *); 708 708 int (*set_page_dirty)(struct page *page); 709 + void (*readahead)(struct readahead_control *); 709 710 int (*readpages)(struct file *filp, struct address_space *mapping, 710 711 struct list_head *pages, unsigned nr_pages); 711 712 int (*write_begin)(struct file *, struct address_space *mapping, ··· 782 781 If defined, it should set the PageDirty flag, and the 783 782 PAGECACHE_TAG_DIRTY tag in the radix tree. 784 783 784 + ``readahead`` 785 + Called by the VM to read pages associated with the address_space 786 + object. The pages are consecutive in the page cache and are 787 + locked. The implementation should decrement the page refcount 788 + after starting I/O on each page. Usually the page will be 789 + unlocked by the I/O completion handler. If the filesystem decides 790 + to stop attempting I/O before reaching the end of the readahead 791 + window, it can simply return. The caller will decrement the page 792 + refcount and unlock the remaining pages for you. Set PageUptodate 793 + if the I/O completes successfully. Setting PageError on any page 794 + will be ignored; simply unlock the page if an I/O error occurs. 795 + 785 796 ``readpages`` 786 797 called by the VM to read pages associated with the address_space 787 798 object. This is essentially just a vector version of readpage. 788 799 Instead of just one page, several pages are requested. 789 800 readpages is only used for read-ahead, so read errors are 790 801 ignored. If anything goes wrong, feel free to give up. 802 + This interface is deprecated and will be removed by the end of 803 + 2020; implement readahead instead. 791 804 792 805 ``write_begin`` 793 806 Called by the generic buffered write code to ask the filesystem

+1 -1

Documentation/vm/slub.rst

··· 49 49 P Poisoning (object and padding) 50 50 U User tracking (free and alloc) 51 51 T Trace (please only use on single slabs) 52 - A Toggle failslab filter mark for the cache 52 + A Enable failslab filter mark for the cache 53 53 O Switch debugging off for caches that would have 54 54 caused higher minimum slab orders 55 55 - Switch all debugging off (useful if the kernel is

+1 -1

arch/arm/configs/omap2plus_defconfig

··· 81 81 CONFIG_BINFMT_MISC=y 82 82 CONFIG_CMA=y 83 83 CONFIG_ZSMALLOC=m 84 - CONFIG_PGTABLE_MAPPING=y 84 + CONFIG_ZSMALLOC_PGTABLE_MAPPING=y 85 85 CONFIG_NET=y 86 86 CONFIG_PACKET=y 87 87 CONFIG_UNIX=y

+3

arch/arm64/include/asm/pgtable.h

··· 407 407 #define __pgprot_modify(prot,mask,bits) \ 408 408 __pgprot((pgprot_val(prot) & ~(mask)) | (bits)) 409 409 410 + #define pgprot_nx(prot) \ 411 + __pgprot_modify(prot, 0, PTE_PXN) 412 + 410 413 /* 411 414 * Mark the prot value as uncacheable and unbufferable. 412 415 */

+2 -4

arch/arm64/include/asm/vmap_stack.h

··· 19 19 { 20 20 BUILD_BUG_ON(!IS_ENABLED(CONFIG_VMAP_STACK)); 21 21 22 - return __vmalloc_node_range(stack_size, THREAD_ALIGN, 23 - VMALLOC_START, VMALLOC_END, 24 - THREADINFO_GFP, PAGE_KERNEL, 0, node, 25 - __builtin_return_address(0)); 22 + return __vmalloc_node(stack_size, THREAD_ALIGN, THREADINFO_GFP, node, 23 + __builtin_return_address(0)); 26 24 } 27 25 28 26 #endif /* __ASM_VMAP_STACK_H */

+1 -1

arch/arm64/mm/dump.c

··· 252 252 } 253 253 254 254 static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level, 255 - unsigned long val) 255 + u64 val) 256 256 { 257 257 struct pg_state *st = container_of(pt_st, struct pg_state, ptdump); 258 258 static const char units[] = "KMGTPE";

-2

arch/parisc/include/asm/pgtable.h

··· 93 93 94 94 #define set_pte_at(mm, addr, ptep, pteval) \ 95 95 do { \ 96 - pte_t old_pte; \ 97 96 unsigned long flags; \ 98 97 spin_lock_irqsave(pgd_spinlock((mm)->pgd), flags);\ 99 - old_pte = *ptep; \ 100 98 set_pte(ptep, pteval); \ 101 99 purge_tlb_entries(mm, addr); \ 102 100 spin_unlock_irqrestore(pgd_spinlock((mm)->pgd), flags);\

+2 -8

arch/powerpc/include/asm/io.h

··· 699 699 * 700 700 * * iounmap undoes such a mapping and can be hooked 701 701 * 702 - * * __ioremap_at (and the pending __iounmap_at) are low level functions to 703 - * create hand-made mappings for use only by the PCI code and cannot 704 - * currently be hooked. Must be page aligned. 705 - * 706 702 * * __ioremap_caller is the same as above but takes an explicit caller 707 703 * reference rather than using __builtin_return_address(0) 708 704 * ··· 715 719 716 720 extern void iounmap(volatile void __iomem *addr); 717 721 722 + void __iomem *ioremap_phb(phys_addr_t paddr, unsigned long size); 723 + 718 724 int early_ioremap_range(unsigned long ea, phys_addr_t pa, 719 725 unsigned long size, pgprot_t prot); 720 726 void __iomem *do_ioremap(phys_addr_t pa, phys_addr_t offset, unsigned long size, ··· 724 726 725 727 extern void __iomem *__ioremap_caller(phys_addr_t, unsigned long size, 726 728 pgprot_t prot, void *caller); 727 - 728 - extern void __iomem * __ioremap_at(phys_addr_t pa, void *ea, 729 - unsigned long size, pgprot_t prot); 730 - extern void __iounmap_at(void *ea, unsigned long size); 731 729 732 730 /* 733 731 * When CONFIG_PPC_INDIRECT_PIO is set, we use the generic iomap implementation

+1 -1

arch/powerpc/include/asm/pci-bridge.h

··· 66 66 67 67 void __iomem *io_base_virt; 68 68 #ifdef CONFIG_PPC64 69 - void *io_base_alloc; 69 + void __iomem *io_base_alloc; 70 70 #endif 71 71 resource_size_t io_base_phys; 72 72 resource_size_t pci_io_size;

+2 -3

arch/powerpc/kernel/irq.c

··· 748 748 749 749 static void *__init alloc_vm_stack(void) 750 750 { 751 - return __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN, VMALLOC_START, 752 - VMALLOC_END, THREADINFO_GFP, PAGE_KERNEL, 753 - 0, NUMA_NO_NODE, (void*)_RET_IP_); 751 + return __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, THREADINFO_GFP, 752 + NUMA_NO_NODE, (void *)_RET_IP_); 754 753 } 755 754 756 755 static void __init vmap_irqstack_init(void)

+21 -7

arch/powerpc/kernel/isa-bridge.c

··· 18 18 #include <linux/init.h> 19 19 #include <linux/mm.h> 20 20 #include <linux/notifier.h> 21 + #include <linux/vmalloc.h> 21 22 22 23 #include <asm/processor.h> 23 24 #include <asm/io.h> ··· 38 37 39 38 #define ISA_SPACE_MASK 0x1 40 39 #define ISA_SPACE_IO 0x1 40 + 41 + static void remap_isa_base(phys_addr_t pa, unsigned long size) 42 + { 43 + WARN_ON_ONCE(ISA_IO_BASE & ~PAGE_MASK); 44 + WARN_ON_ONCE(pa & ~PAGE_MASK); 45 + WARN_ON_ONCE(size & ~PAGE_MASK); 46 + 47 + if (slab_is_available()) { 48 + if (ioremap_page_range(ISA_IO_BASE, ISA_IO_BASE + size, pa, 49 + pgprot_noncached(PAGE_KERNEL))) 50 + unmap_kernel_range(ISA_IO_BASE, size); 51 + } else { 52 + early_ioremap_range(ISA_IO_BASE, pa, size, 53 + pgprot_noncached(PAGE_KERNEL)); 54 + } 55 + } 41 56 42 57 static void pci_process_ISA_OF_ranges(struct device_node *isa_node, 43 58 unsigned long phb_io_base_phys) ··· 122 105 if (size > 0x10000) 123 106 size = 0x10000; 124 107 125 - __ioremap_at(phb_io_base_phys, (void *)ISA_IO_BASE, 126 - size, pgprot_noncached(PAGE_KERNEL)); 108 + remap_isa_base(phb_io_base_phys, size); 127 109 return; 128 110 129 111 inval_range: 130 112 printk(KERN_ERR "no ISA IO ranges or unexpected isa range, " 131 113 "mapping 64k\n"); 132 - __ioremap_at(phb_io_base_phys, (void *)ISA_IO_BASE, 133 - 0x10000, pgprot_noncached(PAGE_KERNEL)); 114 + remap_isa_base(phb_io_base_phys, 0x10000); 134 115 } 135 116 136 117 ··· 263 248 * and map it 264 249 */ 265 250 isa_io_base = ISA_IO_BASE; 266 - __ioremap_at(pbase, (void *)ISA_IO_BASE, 267 - size, pgprot_noncached(PAGE_KERNEL)); 251 + remap_isa_base(pbase, size); 268 252 269 253 pr_debug("ISA: Non-PCI bridge is %pOF\n", np); 270 254 } ··· 311 297 isa_bridge_pcidev = NULL; 312 298 313 299 /* Unmap the ISA area */ 314 - __iounmap_at((void *)ISA_IO_BASE, 0x10000); 300 + unmap_kernel_range(ISA_IO_BASE, 0x10000); 315 301 } 316 302 317 303 /**

+36 -18

arch/powerpc/kernel/pci_64.c

··· 109 109 /* Get the host bridge */ 110 110 hose = pci_bus_to_host(bus); 111 111 112 - /* Check if we have IOs allocated */ 113 - if (hose->io_base_alloc == NULL) 114 - return 0; 115 - 116 112 pr_debug("IO unmapping for PHB %pOF\n", hose->dn); 117 113 pr_debug(" alloc=0x%p\n", hose->io_base_alloc); 118 114 119 - /* This is a PHB, we fully unmap the IO area */ 120 - vunmap(hose->io_base_alloc); 121 - 115 + iounmap(hose->io_base_alloc); 122 116 return 0; 123 117 } 124 118 EXPORT_SYMBOL_GPL(pcibios_unmap_io_space); 125 119 126 - static int pcibios_map_phb_io_space(struct pci_controller *hose) 120 + void __iomem *ioremap_phb(phys_addr_t paddr, unsigned long size) 127 121 { 128 122 struct vm_struct *area; 123 + unsigned long addr; 124 + 125 + WARN_ON_ONCE(paddr & ~PAGE_MASK); 126 + WARN_ON_ONCE(size & ~PAGE_MASK); 127 + 128 + /* 129 + * Let's allocate some IO space for that guy. We don't pass VM_IOREMAP 130 + * because we don't care about alignment tricks that the core does in 131 + * that case. Maybe we should due to stupid card with incomplete 132 + * address decoding but I'd rather not deal with those outside of the 133 + * reserved 64K legacy region. 134 + */ 135 + area = __get_vm_area_caller(size, 0, PHB_IO_BASE, PHB_IO_END, 136 + __builtin_return_address(0)); 137 + if (!area) 138 + return NULL; 139 + 140 + addr = (unsigned long)area->addr; 141 + if (ioremap_page_range(addr, addr + size, paddr, 142 + pgprot_noncached(PAGE_KERNEL))) { 143 + unmap_kernel_range(addr, size); 144 + return NULL; 145 + } 146 + 147 + return (void __iomem *)addr; 148 + } 149 + EXPORT_SYMBOL_GPL(ioremap_phb); 150 + 151 + static int pcibios_map_phb_io_space(struct pci_controller *hose) 152 + { 129 153 unsigned long phys_page; 130 154 unsigned long size_page; 131 155 unsigned long io_virt_offset; ··· 170 146 * with incomplete address decoding but I'd rather not deal with 171 147 * those outside of the reserved 64K legacy region. 172 148 */ 173 - area = __get_vm_area(size_page, 0, PHB_IO_BASE, PHB_IO_END); 174 - if (area == NULL) 149 + hose->io_base_alloc = ioremap_phb(phys_page, size_page); 150 + if (!hose->io_base_alloc) 175 151 return -ENOMEM; 176 - hose->io_base_alloc = area->addr; 177 - hose->io_base_virt = (void __iomem *)(area->addr + 178 - hose->io_base_phys - phys_page); 152 + hose->io_base_virt = hose->io_base_alloc + 153 + hose->io_base_phys - phys_page; 179 154 180 155 pr_debug("IO mapping for PHB %pOF\n", hose->dn); 181 156 pr_debug(" phys=0x%016llx, virt=0x%p (alloc=0x%p)\n", 182 157 hose->io_base_phys, hose->io_base_virt, hose->io_base_alloc); 183 158 pr_debug(" size=0x%016llx (alloc=0x%016lx)\n", 184 159 hose->pci_io_size, size_page); 185 - 186 - /* Establish the mapping */ 187 - if (__ioremap_at(phys_page, area->addr, size_page, 188 - pgprot_noncached(PAGE_KERNEL)) == NULL) 189 - return -ENOMEM; 190 160 191 161 /* Fixup hose IO resource */ 192 162 io_virt_offset = pcibios_io_space_offset(hose);

-50

arch/powerpc/mm/ioremap_64.c

··· 4 4 #include <linux/slab.h> 5 5 #include <linux/vmalloc.h> 6 6 7 - /** 8 - * Low level function to establish the page tables for an IO mapping 9 - */ 10 - void __iomem *__ioremap_at(phys_addr_t pa, void *ea, unsigned long size, pgprot_t prot) 11 - { 12 - int ret; 13 - unsigned long va = (unsigned long)ea; 14 - 15 - /* We don't support the 4K PFN hack with ioremap */ 16 - if (pgprot_val(prot) & H_PAGE_4K_PFN) 17 - return NULL; 18 - 19 - if ((ea + size) >= (void *)IOREMAP_END) { 20 - pr_warn("Outside the supported range\n"); 21 - return NULL; 22 - } 23 - 24 - WARN_ON(pa & ~PAGE_MASK); 25 - WARN_ON(((unsigned long)ea) & ~PAGE_MASK); 26 - WARN_ON(size & ~PAGE_MASK); 27 - 28 - if (slab_is_available()) { 29 - ret = ioremap_page_range(va, va + size, pa, prot); 30 - if (ret) 31 - unmap_kernel_range(va, size); 32 - } else { 33 - ret = early_ioremap_range(va, pa, size, prot); 34 - } 35 - 36 - if (ret) 37 - return NULL; 38 - 39 - return (void __iomem *)ea; 40 - } 41 - EXPORT_SYMBOL(__ioremap_at); 42 - 43 - /** 44 - * Low level function to tear down the page tables for an IO mapping. This is 45 - * used for mappings that are manipulated manually, like partial unmapping of 46 - * PCI IOs or ISA space. 47 - */ 48 - void __iounmap_at(void *ea, unsigned long size) 49 - { 50 - WARN_ON(((unsigned long)ea) & ~PAGE_MASK); 51 - WARN_ON(size & ~PAGE_MASK); 52 - 53 - unmap_kernel_range((unsigned long)ea, size); 54 - } 55 - EXPORT_SYMBOL(__iounmap_at); 56 - 57 7 void __iomem *__ioremap_caller(phys_addr_t addr, unsigned long size, 58 8 pgprot_t prot, void *caller) 59 9 {

+2 -2

arch/riscv/include/asm/pgtable.h

··· 473 473 #define PAGE_SHARED __pgprot(0) 474 474 #define PAGE_KERNEL __pgprot(0) 475 475 #define swapper_pg_dir NULL 476 + #define TASK_SIZE 0xffffffffUL 476 477 #define VMALLOC_START 0 477 - 478 - #define TASK_SIZE 0xffffffffUL 478 + #define VMALLOC_END TASK_SIZE 479 479 480 480 static inline void __kernel_map_pages(struct page *page, int numpages, int enable) {} 481 481

+1 -1

arch/riscv/mm/ptdump.c

··· 204 204 } 205 205 206 206 static void note_page(struct ptdump_state *pt_st, unsigned long addr, 207 - int level, unsigned long val) 207 + int level, u64 val) 208 208 { 209 209 struct pg_state *st = container_of(pt_st, struct pg_state, ptdump); 210 210 u64 pa = PFN_PHYS(pte_pfn(__pte(val)));

+3 -6

arch/s390/kernel/setup.c

··· 305 305 unsigned long stack_alloc(void) 306 306 { 307 307 #ifdef CONFIG_VMAP_STACK 308 - return (unsigned long) 309 - __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE, 310 - VMALLOC_START, VMALLOC_END, 311 - THREADINFO_GFP, 312 - PAGE_KERNEL, 0, NUMA_NO_NODE, 313 - __builtin_return_address(0)); 308 + return (unsigned long)__vmalloc_node(THREAD_SIZE, THREAD_SIZE, 309 + THREADINFO_GFP, NUMA_NO_NODE, 310 + __builtin_return_address(0)); 314 311 #else 315 312 return __get_free_pages(GFP_KERNEL, THREAD_SIZE_ORDER); 316 313 #endif

+2 -1

arch/sh/kernel/cpu/sh4/sq.c

··· 103 103 #if defined(CONFIG_MMU) 104 104 struct vm_struct *vma; 105 105 106 - vma = __get_vm_area(map->size, VM_ALLOC, map->sq_addr, SQ_ADDRMAX); 106 + vma = __get_vm_area_caller(map->size, VM_ALLOC, map->sq_addr, 107 + SQ_ADDRMAX, __builtin_return_address(0)); 107 108 if (!vma) 108 109 return -ENOMEM; 109 110

+2 -3

arch/x86/hyperv/hv_init.c

··· 97 97 * not be stopped in the case of CPU offlining and the VM will hang. 98 98 */ 99 99 if (!*hvp) { 100 - *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO, 101 - PAGE_KERNEL); 100 + *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO); 102 101 } 103 102 104 103 if (*hvp) { ··· 378 379 guest_id = generate_guest_id(0, LINUX_VERSION_CODE, 0); 379 380 wrmsrl(HV_X64_MSR_GUEST_OS_ID, guest_id); 380 381 381 - hv_hypercall_pg = __vmalloc(PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL_RX); 382 + hv_hypercall_pg = vmalloc_exec(PAGE_SIZE); 382 383 if (hv_hypercall_pg == NULL) { 383 384 wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0); 384 385 goto remove_cpuhp_state;

+1 -2

arch/x86/include/asm/kvm_host.h

··· 1279 1279 #define __KVM_HAVE_ARCH_VM_ALLOC 1280 1280 static inline struct kvm *kvm_arch_alloc_vm(void) 1281 1281 { 1282 - return __vmalloc(kvm_x86_ops.vm_size, 1283 - GFP_KERNEL_ACCOUNT | __GFP_ZERO, PAGE_KERNEL); 1282 + return __vmalloc(kvm_x86_ops.vm_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO); 1284 1283 } 1285 1284 void kvm_arch_free_vm(struct kvm *kvm); 1286 1285

+2

arch/x86/include/asm/pgtable-2level_types.h

··· 20 20 21 21 #define SHARED_KERNEL_PMD 0 22 22 23 + #define ARCH_PAGE_TABLE_SYNC_MASK PGTBL_PMD_MODIFIED 24 + 23 25 /* 24 26 * traditional i386 two-level paging structure: 25 27 */

+2

arch/x86/include/asm/pgtable-3level_types.h

··· 27 27 #define SHARED_KERNEL_PMD (!static_cpu_has(X86_FEATURE_PTI)) 28 28 #endif 29 29 30 + #define ARCH_PAGE_TABLE_SYNC_MASK (SHARED_KERNEL_PMD ? 0 : PGTBL_PMD_MODIFIED) 31 + 30 32 /* 31 33 * PGDIR_SHIFT determines what a top-level page table entry can map 32 34 */

+2

arch/x86/include/asm/pgtable_64_types.h

··· 159 159 160 160 #define PGD_KERNEL_START ((PAGE_SIZE / 2) / sizeof(pgd_t)) 161 161 162 + #define ARCH_PAGE_TABLE_SYNC_MASK (pgtable_l5_enabled() ? PGTBL_PGD_MODIFIED : PGTBL_P4D_MODIFIED) 163 + 162 164 #endif /* _ASM_X86_PGTABLE_64_DEFS_H */

+6 -2

arch/x86/include/asm/pgtable_types.h

··· 194 194 #define _PAGE_TABLE_NOENC (__PP|__RW|_USR|___A| 0|___D| 0| 0) 195 195 #define _PAGE_TABLE (__PP|__RW|_USR|___A| 0|___D| 0| 0| _ENC) 196 196 #define __PAGE_KERNEL_RO (__PP| 0| 0|___A|__NX|___D| 0|___G) 197 - #define __PAGE_KERNEL_RX (__PP| 0| 0|___A| 0|___D| 0|___G) 198 197 #define __PAGE_KERNEL_NOCACHE (__PP|__RW| 0|___A|__NX|___D| 0|___G| __NC) 199 198 #define __PAGE_KERNEL_VVAR (__PP| 0|_USR|___A|__NX|___D| 0|___G) 200 199 #define __PAGE_KERNEL_LARGE (__PP|__RW| 0|___A|__NX|___D|_PSE|___G) ··· 219 220 #define PAGE_KERNEL_RO __pgprot_mask(__PAGE_KERNEL_RO | _ENC) 220 221 #define PAGE_KERNEL_EXEC __pgprot_mask(__PAGE_KERNEL_EXEC | _ENC) 221 222 #define PAGE_KERNEL_EXEC_NOENC __pgprot_mask(__PAGE_KERNEL_EXEC | 0) 222 - #define PAGE_KERNEL_RX __pgprot_mask(__PAGE_KERNEL_RX | _ENC) 223 223 #define PAGE_KERNEL_NOCACHE __pgprot_mask(__PAGE_KERNEL_NOCACHE | _ENC) 224 224 #define PAGE_KERNEL_LARGE __pgprot_mask(__PAGE_KERNEL_LARGE | _ENC) 225 225 #define PAGE_KERNEL_LARGE_EXEC __pgprot_mask(__PAGE_KERNEL_LARGE_EXEC | _ENC) ··· 281 283 typedef struct pgprot { pgprotval_t pgprot; } pgprot_t; 282 284 283 285 typedef struct { pgdval_t pgd; } pgd_t; 286 + 287 + static inline pgprot_t pgprot_nx(pgprot_t prot) 288 + { 289 + return __pgprot(pgprot_val(prot) | _PAGE_NX); 290 + } 291 + #define pgprot_nx pgprot_nx 284 292 285 293 #ifdef CONFIG_X86_PAE 286 294

-23

arch/x86/include/asm/switch_to.h

··· 12 12 __visible struct task_struct *__switch_to(struct task_struct *prev, 13 13 struct task_struct *next); 14 14 15 - /* This runs runs on the previous thread's stack. */ 16 - static inline void prepare_switch_to(struct task_struct *next) 17 - { 18 - #ifdef CONFIG_VMAP_STACK 19 - /* 20 - * If we switch to a stack that has a top-level paging entry 21 - * that is not present in the current mm, the resulting #PF will 22 - * will be promoted to a double-fault and we'll panic. Probe 23 - * the new stack now so that vmalloc_fault can fix up the page 24 - * tables if needed. This can only happen if we use a stack 25 - * in vmap space. 26 - * 27 - * We assume that the stack is aligned so that it never spans 28 - * more than one top-level paging entry. 29 - * 30 - * To minimize cache pollution, just follow the stack pointer. 31 - */ 32 - READ_ONCE(*(unsigned char *)next->thread.sp); 33 - #endif 34 - } 35 - 36 15 asmlinkage void ret_from_fork(void); 37 16 38 17 /* ··· 46 67 47 68 #define switch_to(prev, next, last) \ 48 69 do { \ 49 - prepare_switch_to(next); \ 50 - \ 51 70 ((last) = __switch_to_asm((prev), (next))); \ 52 71 } while (0) 53 72

+1 -1

arch/x86/kernel/irq_64.c

··· 43 43 pages[i] = pfn_to_page(pa >> PAGE_SHIFT); 44 44 } 45 45 46 - va = vmap(pages, IRQ_STACK_SIZE / PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL); 46 + va = vmap(pages, IRQ_STACK_SIZE / PAGE_SIZE, VM_MAP, PAGE_KERNEL); 47 47 if (!va) 48 48 return -ENOMEM; 49 49

+3 -3

arch/x86/kernel/setup_percpu.c

··· 287 287 /* 288 288 * Sync back kernel address range again. We already did this in 289 289 * setup_arch(), but percpu data also needs to be available in 290 - * the smpboot asm. We can't reliably pick up percpu mappings 291 - * using vmalloc_fault(), because exception dispatch needs 292 - * percpu data. 290 + * the smpboot asm and arch_sync_kernel_mappings() doesn't sync to 291 + * swapper_pg_dir on 32-bit. The per-cpu mappings need to be available 292 + * there too. 293 293 * 294 294 * FIXME: Can the later sync in setup_cpu_entry_areas() replace 295 295 * this call?

+1 -2

arch/x86/kvm/svm/sev.c

··· 336 336 /* Avoid using vmalloc for smaller buffers. */ 337 337 size = npages * sizeof(struct page *); 338 338 if (size > PAGE_SIZE) 339 - pages = __vmalloc(size, GFP_KERNEL_ACCOUNT | __GFP_ZERO, 340 - PAGE_KERNEL); 339 + pages = __vmalloc(size, GFP_KERNEL_ACCOUNT | __GFP_ZERO); 341 340 else 342 341 pages = kmalloc(size, GFP_KERNEL_ACCOUNT); 343 342

+21 -14

arch/x86/mm/dump_pagetables.c

··· 249 249 (void *)st->start_address); 250 250 } 251 251 252 - static inline pgprotval_t effective_prot(pgprotval_t prot1, pgprotval_t prot2) 252 + static void effective_prot(struct ptdump_state *pt_st, int level, u64 val) 253 253 { 254 - return (prot1 & prot2 & (_PAGE_USER | _PAGE_RW)) | 255 - ((prot1 | prot2) & _PAGE_NX); 254 + struct pg_state *st = container_of(pt_st, struct pg_state, ptdump); 255 + pgprotval_t prot = val & PTE_FLAGS_MASK; 256 + pgprotval_t effective; 257 + 258 + if (level > 0) { 259 + pgprotval_t higher_prot = st->prot_levels[level - 1]; 260 + 261 + effective = (higher_prot & prot & (_PAGE_USER | _PAGE_RW)) | 262 + ((higher_prot | prot) & _PAGE_NX); 263 + } else { 264 + effective = prot; 265 + } 266 + 267 + st->prot_levels[level] = effective; 256 268 } 257 269 258 270 /* ··· 273 261 * print what we collected so far. 274 262 */ 275 263 static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level, 276 - unsigned long val) 264 + u64 val) 277 265 { 278 266 struct pg_state *st = container_of(pt_st, struct pg_state, ptdump); 279 267 pgprotval_t new_prot, new_eff; ··· 282 270 struct seq_file *m = st->seq; 283 271 284 272 new_prot = val & PTE_FLAGS_MASK; 285 - 286 - if (level > 0) { 287 - new_eff = effective_prot(st->prot_levels[level - 1], 288 - new_prot); 289 - } else { 290 - new_eff = new_prot; 291 - } 292 - 293 - if (level >= 0) 294 - st->prot_levels[level] = new_eff; 273 + if (!val) 274 + new_eff = 0; 275 + else 276 + new_eff = st->prot_levels[level]; 295 277 296 278 /* 297 279 * If we have a "break" in the series, we need to flush the state that ··· 380 374 struct pg_state st = { 381 375 .ptdump = { 382 376 .note_page = note_page, 377 + .effective_prot = effective_prot, 383 378 .range = ptdump_ranges 384 379 }, 385 380 .level = -1,

+6 -170

arch/x86/mm/fault.c

··· 190 190 return pmd_k; 191 191 } 192 192 193 - static void vmalloc_sync(void) 193 + void arch_sync_kernel_mappings(unsigned long start, unsigned long end) 194 194 { 195 - unsigned long address; 195 + unsigned long addr; 196 196 197 - if (SHARED_KERNEL_PMD) 198 - return; 199 - 200 - for (address = VMALLOC_START & PMD_MASK; 201 - address >= TASK_SIZE_MAX && address < VMALLOC_END; 202 - address += PMD_SIZE) { 197 + for (addr = start & PMD_MASK; 198 + addr >= TASK_SIZE_MAX && addr < VMALLOC_END; 199 + addr += PMD_SIZE) { 203 200 struct page *page; 204 201 205 202 spin_lock(&pgd_lock); ··· 207 210 pgt_lock = &pgd_page_get_mm(page)->page_table_lock; 208 211 209 212 spin_lock(pgt_lock); 210 - vmalloc_sync_one(page_address(page), address); 213 + vmalloc_sync_one(page_address(page), addr); 211 214 spin_unlock(pgt_lock); 212 215 } 213 216 spin_unlock(&pgd_lock); 214 217 } 215 218 } 216 - 217 - void vmalloc_sync_mappings(void) 218 - { 219 - vmalloc_sync(); 220 - } 221 - 222 - void vmalloc_sync_unmappings(void) 223 - { 224 - vmalloc_sync(); 225 - } 226 - 227 - /* 228 - * 32-bit: 229 - * 230 - * Handle a fault on the vmalloc or module mapping area 231 - */ 232 - static noinline int vmalloc_fault(unsigned long address) 233 - { 234 - unsigned long pgd_paddr; 235 - pmd_t *pmd_k; 236 - pte_t *pte_k; 237 - 238 - /* Make sure we are in vmalloc area: */ 239 - if (!(address >= VMALLOC_START && address < VMALLOC_END)) 240 - return -1; 241 - 242 - /* 243 - * Synchronize this task's top level page-table 244 - * with the 'reference' page table. 245 - * 246 - * Do _not_ use "current" here. We might be inside 247 - * an interrupt in the middle of a task switch.. 248 - */ 249 - pgd_paddr = read_cr3_pa(); 250 - pmd_k = vmalloc_sync_one(__va(pgd_paddr), address); 251 - if (!pmd_k) 252 - return -1; 253 - 254 - if (pmd_large(*pmd_k)) 255 - return 0; 256 - 257 - pte_k = pte_offset_kernel(pmd_k, address); 258 - if (!pte_present(*pte_k)) 259 - return -1; 260 - 261 - return 0; 262 - } 263 - NOKPROBE_SYMBOL(vmalloc_fault); 264 219 265 220 /* 266 221 * Did it hit the DOS screen memory VA from vm86 mode? ··· 277 328 } 278 329 279 330 #else /* CONFIG_X86_64: */ 280 - 281 - void vmalloc_sync_mappings(void) 282 - { 283 - /* 284 - * 64-bit mappings might allocate new p4d/pud pages 285 - * that need to be propagated to all tasks' PGDs. 286 - */ 287 - sync_global_pgds(VMALLOC_START & PGDIR_MASK, VMALLOC_END); 288 - } 289 - 290 - void vmalloc_sync_unmappings(void) 291 - { 292 - /* 293 - * Unmappings never allocate or free p4d/pud pages. 294 - * No work is required here. 295 - */ 296 - } 297 - 298 - /* 299 - * 64-bit: 300 - * 301 - * Handle a fault on the vmalloc area 302 - */ 303 - static noinline int vmalloc_fault(unsigned long address) 304 - { 305 - pgd_t *pgd, *pgd_k; 306 - p4d_t *p4d, *p4d_k; 307 - pud_t *pud; 308 - pmd_t *pmd; 309 - pte_t *pte; 310 - 311 - /* Make sure we are in vmalloc area: */ 312 - if (!(address >= VMALLOC_START && address < VMALLOC_END)) 313 - return -1; 314 - 315 - /* 316 - * Copy kernel mappings over when needed. This can also 317 - * happen within a race in page table update. In the later 318 - * case just flush: 319 - */ 320 - pgd = (pgd_t *)__va(read_cr3_pa()) + pgd_index(address); 321 - pgd_k = pgd_offset_k(address); 322 - if (pgd_none(*pgd_k)) 323 - return -1; 324 - 325 - if (pgtable_l5_enabled()) { 326 - if (pgd_none(*pgd)) { 327 - set_pgd(pgd, *pgd_k); 328 - arch_flush_lazy_mmu_mode(); 329 - } else { 330 - BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_k)); 331 - } 332 - } 333 - 334 - /* With 4-level paging, copying happens on the p4d level. */ 335 - p4d = p4d_offset(pgd, address); 336 - p4d_k = p4d_offset(pgd_k, address); 337 - if (p4d_none(*p4d_k)) 338 - return -1; 339 - 340 - if (p4d_none(*p4d) && !pgtable_l5_enabled()) { 341 - set_p4d(p4d, *p4d_k); 342 - arch_flush_lazy_mmu_mode(); 343 - } else { 344 - BUG_ON(p4d_pfn(*p4d) != p4d_pfn(*p4d_k)); 345 - } 346 - 347 - BUILD_BUG_ON(CONFIG_PGTABLE_LEVELS < 4); 348 - 349 - pud = pud_offset(p4d, address); 350 - if (pud_none(*pud)) 351 - return -1; 352 - 353 - if (pud_large(*pud)) 354 - return 0; 355 - 356 - pmd = pmd_offset(pud, address); 357 - if (pmd_none(*pmd)) 358 - return -1; 359 - 360 - if (pmd_large(*pmd)) 361 - return 0; 362 - 363 - pte = pte_offset_kernel(pmd, address); 364 - if (!pte_present(*pte)) 365 - return -1; 366 - 367 - return 0; 368 - } 369 - NOKPROBE_SYMBOL(vmalloc_fault); 370 331 371 332 #ifdef CONFIG_CPU_SUP_AMD 372 333 static const char errata93_warning[] = ··· 1115 1256 * space, so do not expect them here. 1116 1257 */ 1117 1258 WARN_ON_ONCE(hw_error_code & X86_PF_PK); 1118 - 1119 - /* 1120 - * We can fault-in kernel-space virtual memory on-demand. The 1121 - * 'reference' page table is init_mm.pgd. 1122 - * 1123 - * NOTE! We MUST NOT take any locks for this case. We may 1124 - * be in an interrupt or a critical region, and should 1125 - * only copy the information from the master page table, 1126 - * nothing more. 1127 - * 1128 - * Before doing this on-demand faulting, ensure that the 1129 - * fault is not any of the following: 1130 - * 1. A fault on a PTE with a reserved bit set. 1131 - * 2. A fault caused by a user-mode access. (Do not demand- 1132 - * fault kernel memory due to user-mode accesses). 1133 - * 3. A fault caused by a page-level protection violation. 1134 - * (A demand fault would be on a non-present page which 1135 - * would have X86_PF_PROT==0). 1136 - */ 1137 - if (!(hw_error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) { 1138 - if (vmalloc_fault(address) >= 0) 1139 - return; 1140 - } 1141 1259 1142 1260 /* Was the fault spurious, caused by lazy TLB invalidation? */ 1143 1261 if (spurious_kernel_fault(hw_error_code, address))

+5

arch/x86/mm/init_64.c

··· 218 218 sync_global_pgds_l4(start, end); 219 219 } 220 220 221 + void arch_sync_kernel_mappings(unsigned long start, unsigned long end) 222 + { 223 + sync_global_pgds(start, end); 224 + } 225 + 221 226 /* 222 227 * NOTE: This function is marked __ref because it calls __init function 223 228 * (alloc_bootmem_pages). It's safe to do it ONLY when after_bootmem == 0.

+1 -7

arch/x86/mm/pti.c

··· 448 448 * the sp1 and sp2 slots. 449 449 * 450 450 * This is done for all possible CPUs during boot to ensure 451 - * that it's propagated to all mms. If we were to add one of 452 - * these mappings during CPU hotplug, we would need to take 453 - * some measure to make sure that every mm that subsequently 454 - * ran on that CPU would have the relevant PGD entry in its 455 - * pagetables. The usual vmalloc_fault() mechanism would not 456 - * work for page faults taken in entry_SYSCALL_64 before RSP 457 - * is set up. 451 + * that it's propagated to all mms. 458 452 */ 459 453 460 454 unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);

-37

arch/x86/mm/tlb.c

··· 161 161 local_irq_restore(flags); 162 162 } 163 163 164 - static void sync_current_stack_to_mm(struct mm_struct *mm) 165 - { 166 - unsigned long sp = current_stack_pointer; 167 - pgd_t *pgd = pgd_offset(mm, sp); 168 - 169 - if (pgtable_l5_enabled()) { 170 - if (unlikely(pgd_none(*pgd))) { 171 - pgd_t *pgd_ref = pgd_offset_k(sp); 172 - 173 - set_pgd(pgd, *pgd_ref); 174 - } 175 - } else { 176 - /* 177 - * "pgd" is faked. The top level entries are "p4d"s, so sync 178 - * the p4d. This compiles to approximately the same code as 179 - * the 5-level case. 180 - */ 181 - p4d_t *p4d = p4d_offset(pgd, sp); 182 - 183 - if (unlikely(p4d_none(*p4d))) { 184 - pgd_t *pgd_ref = pgd_offset_k(sp); 185 - p4d_t *p4d_ref = p4d_offset(pgd_ref, sp); 186 - 187 - set_p4d(p4d, *p4d_ref); 188 - } 189 - } 190 - } 191 - 192 164 static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next) 193 165 { 194 166 unsigned long next_tif = task_thread_info(next)->flags; ··· 348 376 * one process from doing Spectre-v2 attacks on another. 349 377 */ 350 378 cond_ibpb(tsk); 351 - 352 - if (IS_ENABLED(CONFIG_VMAP_STACK)) { 353 - /* 354 - * If our current stack is in vmalloc space and isn't 355 - * mapped in the new pgd, we'll double-fault. Forcibly 356 - * map it. 357 - */ 358 - sync_current_stack_to_mm(next); 359 - } 360 379 361 380 /* 362 381 * Stop remote flushes for the previous mm.

+1

block/blk-core.c

··· 20 20 #include <linux/blk-mq.h> 21 21 #include <linux/highmem.h> 22 22 #include <linux/mm.h> 23 + #include <linux/pagemap.h> 23 24 #include <linux/kernel_stat.h> 24 25 #include <linux/string.h> 25 26 #include <linux/init.h>

-6

drivers/acpi/apei/ghes.c

··· 167 167 if (!addr) 168 168 goto err_pool_alloc; 169 169 170 - /* 171 - * New allocation must be visible in all pgd before it can be found by 172 - * an NMI allocating from the pool. 173 - */ 174 - vmalloc_sync_mappings(); 175 - 176 170 rc = gen_pool_add(ghes_estatus_pool, addr, PAGE_ALIGN(len), -1); 177 171 if (rc) 178 172 goto err_pool_add;

+1 -1

drivers/base/node.c

··· 445 445 nid, sum_zone_node_page_state(nid, NR_KERNEL_SCS_KB), 446 446 #endif 447 447 nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)), 448 - nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)), 448 + nid, 0UL, 449 449 nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)), 450 450 nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), 451 451 nid, K(sreclaimable +

+1 -3

drivers/block/drbd/drbd_bitmap.c

··· 396 396 bytes = sizeof(struct page *)*want; 397 397 new_pages = kzalloc(bytes, GFP_NOIO | __GFP_NOWARN); 398 398 if (!new_pages) { 399 - new_pages = __vmalloc(bytes, 400 - GFP_NOIO | __GFP_ZERO, 401 - PAGE_KERNEL); 399 + new_pages = __vmalloc(bytes, GFP_NOIO | __GFP_ZERO); 402 400 if (!new_pages) 403 401 return NULL; 404 402 }

+1 -1

drivers/block/loop.c

··· 919 919 920 920 static int loop_kthread_worker_fn(void *worker_ptr) 921 921 { 922 - current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO; 922 + current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO; 923 923 return kthread_worker_fn(worker_ptr); 924 924 } 925 925

+1

drivers/dax/device.c

··· 377 377 inode->i_mapping->a_ops = &dev_dax_aops; 378 378 filp->f_mapping = inode->i_mapping; 379 379 filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping); 380 + filp->f_sb_err = file_sample_sb_err(filp); 380 381 filp->private_data = dev_dax; 381 382 inode->i_flags = S_DAX; 382 383

+1 -10

drivers/gpu/drm/drm_scatter.c

··· 43 43 44 44 #define DEBUG_SCATTER 0 45 45 46 - static inline void *drm_vmalloc_dma(unsigned long size) 47 - { 48 - #if defined(__powerpc__) && defined(CONFIG_NOT_COHERENT_CACHE) 49 - return __vmalloc(size, GFP_KERNEL, pgprot_noncached_wc(PAGE_KERNEL)); 50 - #else 51 - return vmalloc_32(size); 52 - #endif 53 - } 54 - 55 46 static void drm_sg_cleanup(struct drm_sg_mem * entry) 56 47 { 57 48 struct page *page; ··· 117 126 return -ENOMEM; 118 127 } 119 128 120 - entry->virtual = drm_vmalloc_dma(pages << PAGE_SHIFT); 129 + entry->virtual = vmalloc_32(pages << PAGE_SHIFT); 121 130 if (!entry->virtual) { 122 131 kfree(entry->busaddr); 123 132 kfree(entry->pagelist);

+2 -2

drivers/gpu/drm/etnaviv/etnaviv_dump.c

··· 154 154 file_size += sizeof(*iter.hdr) * n_obj; 155 155 156 156 /* Allocate the file in vmalloc memory, it's likely to be big */ 157 - iter.start = __vmalloc(file_size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY, 158 - PAGE_KERNEL); 157 + iter.start = __vmalloc(file_size, GFP_KERNEL | __GFP_NOWARN | 158 + __GFP_NORETRY); 159 159 if (!iter.start) { 160 160 mutex_unlock(&gpu->mmu_context->lock); 161 161 dev_warn(gpu->dev, "failed to allocate devcoredump file\n");

+1 -1

drivers/gpu/drm/i915/gem/selftests/mock_dmabuf.c

··· 66 66 { 67 67 struct mock_dmabuf *mock = to_mock(dma_buf); 68 68 69 - return vm_map_ram(mock->pages, mock->npages, 0, PAGE_KERNEL); 69 + return vm_map_ram(mock->pages, mock->npages, 0); 70 70 } 71 71 72 72 static void mock_dmabuf_vunmap(struct dma_buf *dma_buf, void *vaddr)

+2 -3

drivers/lightnvm/pblk-init.c

··· 145 145 int ret = 0; 146 146 147 147 map_size = pblk_trans_map_size(pblk); 148 - pblk->trans_map = __vmalloc(map_size, GFP_KERNEL | __GFP_NOWARN 149 - | __GFP_RETRY_MAYFAIL | __GFP_HIGHMEM, 150 - PAGE_KERNEL); 148 + pblk->trans_map = __vmalloc(map_size, GFP_KERNEL | __GFP_NOWARN | 149 + __GFP_RETRY_MAYFAIL | __GFP_HIGHMEM); 151 150 if (!pblk->trans_map) { 152 151 pblk_err(pblk, "failed to allocate L2P (need %zu of memory)\n", 153 152 map_size);

+2 -2

drivers/md/dm-bufio.c

··· 400 400 */ 401 401 if (gfp_mask & __GFP_NORETRY) { 402 402 unsigned noio_flag = memalloc_noio_save(); 403 - void *ptr = __vmalloc(c->block_size, gfp_mask, PAGE_KERNEL); 403 + void *ptr = __vmalloc(c->block_size, gfp_mask); 404 404 405 405 memalloc_noio_restore(noio_flag); 406 406 return ptr; 407 407 } 408 408 409 - return __vmalloc(c->block_size, gfp_mask, PAGE_KERNEL); 409 + return __vmalloc(c->block_size, gfp_mask); 410 410 } 411 411 412 412 /*

+2 -10

drivers/md/md-bitmap.c

··· 324 324 wake_up(&bitmap->write_wait); 325 325 } 326 326 327 - /* copied from buffer.c */ 328 - static void 329 - __clear_page_buffers(struct page *page) 330 - { 331 - ClearPagePrivate(page); 332 - set_page_private(page, 0); 333 - put_page(page); 334 - } 335 327 static void free_buffers(struct page *page) 336 328 { 337 329 struct buffer_head *bh; ··· 337 345 free_buffer_head(bh); 338 346 bh = next; 339 347 } 340 - __clear_page_buffers(page); 348 + detach_page_private(page); 341 349 put_page(page); 342 350 } 343 351 ··· 366 374 ret = -ENOMEM; 367 375 goto out; 368 376 } 369 - attach_page_buffers(page, bh); 377 + attach_page_private(page, bh); 370 378 blk_cur = index << (PAGE_SHIFT - inode->i_blkbits); 371 379 while (bh) { 372 380 block = blk_cur;

+1 -2

drivers/media/common/videobuf2/videobuf2-dma-sg.c

··· 309 309 if (buf->db_attach) 310 310 buf->vaddr = dma_buf_vmap(buf->db_attach->dmabuf); 311 311 else 312 - buf->vaddr = vm_map_ram(buf->pages, 313 - buf->num_pages, -1, PAGE_KERNEL); 312 + buf->vaddr = vm_map_ram(buf->pages, buf->num_pages, -1); 314 313 } 315 314 316 315 /* add offset in case userptr is not page-aligned */

+1 -2

drivers/media/common/videobuf2/videobuf2-vmalloc.c

··· 107 107 buf->vaddr = (__force void *) 108 108 ioremap(__pfn_to_phys(nums[0]), size + offset); 109 109 } else { 110 - buf->vaddr = vm_map_ram(frame_vector_pages(vec), n_pages, -1, 111 - PAGE_KERNEL); 110 + buf->vaddr = vm_map_ram(frame_vector_pages(vec), n_pages, -1); 112 111 } 113 112 114 113 if (!buf->vaddr)

+6 -13

drivers/media/pci/ivtv/ivtv-udma.c

··· 92 92 { 93 93 struct ivtv_dma_page_info user_dma; 94 94 struct ivtv_user_dma *dma = &itv->udma; 95 - int i, err; 95 + int err; 96 96 97 97 IVTV_DEBUG_DMA("ivtv_udma_setup, dst: 0x%08x\n", (unsigned int)ivtv_dest_addr); 98 98 ··· 111 111 return -EINVAL; 112 112 } 113 113 114 - /* Get user pages for DMA Xfer */ 115 - err = get_user_pages_unlocked(user_dma.uaddr, user_dma.page_count, 114 + /* Pin user pages for DMA Xfer */ 115 + err = pin_user_pages_unlocked(user_dma.uaddr, user_dma.page_count, 116 116 dma->map, FOLL_FORCE); 117 117 118 118 if (user_dma.page_count != err) { 119 119 IVTV_DEBUG_WARN("failed to map user pages, returned %d instead of %d\n", 120 120 err, user_dma.page_count); 121 121 if (err >= 0) { 122 - for (i = 0; i < err; i++) 123 - put_page(dma->map[i]); 122 + unpin_user_pages(dma->map, err); 124 123 return -EINVAL; 125 124 } 126 125 return err; ··· 129 130 130 131 /* Fill SG List with new values */ 131 132 if (ivtv_udma_fill_sg_list(dma, &user_dma, 0) < 0) { 132 - for (i = 0; i < dma->page_count; i++) { 133 - put_page(dma->map[i]); 134 - } 133 + unpin_user_pages(dma->map, dma->page_count); 135 134 dma->page_count = 0; 136 135 return -ENOMEM; 137 136 } ··· 150 153 void ivtv_udma_unmap(struct ivtv *itv) 151 154 { 152 155 struct ivtv_user_dma *dma = &itv->udma; 153 - int i; 154 156 155 157 IVTV_DEBUG_INFO("ivtv_unmap_user_dma\n"); 156 158 ··· 165 169 /* sync DMA */ 166 170 ivtv_udma_sync_for_cpu(itv); 167 171 168 - /* Release User Pages */ 169 - for (i = 0; i < dma->page_count; i++) { 170 - put_page(dma->map[i]); 171 - } 172 + unpin_user_pages(dma->map, dma->page_count); 172 173 dma->page_count = 0; 173 174 } 174 175

+6 -11

drivers/media/pci/ivtv/ivtv-yuv.c

··· 30 30 struct yuv_playback_info *yi = &itv->yuv_info; 31 31 u8 frame = yi->draw_frame; 32 32 struct yuv_frame_info *f = &yi->new_frame_info[frame]; 33 - int i; 34 33 int y_pages, uv_pages; 35 34 unsigned long y_buffer_offset, uv_buffer_offset; 36 35 int y_decode_height, uv_decode_height, y_size; ··· 61 62 ivtv_udma_get_page_info (&y_dma, (unsigned long)args->y_source, 720 * y_decode_height); 62 63 ivtv_udma_get_page_info (&uv_dma, (unsigned long)args->uv_source, 360 * uv_decode_height); 63 64 64 - /* Get user pages for DMA Xfer */ 65 - y_pages = get_user_pages_unlocked(y_dma.uaddr, 65 + /* Pin user pages for DMA Xfer */ 66 + y_pages = pin_user_pages_unlocked(y_dma.uaddr, 66 67 y_dma.page_count, &dma->map[0], FOLL_FORCE); 67 68 uv_pages = 0; /* silence gcc. value is set and consumed only if: */ 68 69 if (y_pages == y_dma.page_count) { 69 - uv_pages = get_user_pages_unlocked(uv_dma.uaddr, 70 + uv_pages = pin_user_pages_unlocked(uv_dma.uaddr, 70 71 uv_dma.page_count, &dma->map[y_pages], 71 72 FOLL_FORCE); 72 73 } ··· 80 81 uv_pages, uv_dma.page_count); 81 82 82 83 if (uv_pages >= 0) { 83 - for (i = 0; i < uv_pages; i++) 84 - put_page(dma->map[y_pages + i]); 84 + unpin_user_pages(&dma->map[y_pages], uv_pages); 85 85 rc = -EFAULT; 86 86 } else { 87 87 rc = uv_pages; ··· 91 93 y_pages, y_dma.page_count); 92 94 } 93 95 if (y_pages >= 0) { 94 - for (i = 0; i < y_pages; i++) 95 - put_page(dma->map[i]); 96 + unpin_user_pages(dma->map, y_pages); 96 97 /* 97 98 * Inherit the -EFAULT from rc's 98 99 * initialization, but allow it to be ··· 109 112 /* Fill & map SG List */ 110 113 if (ivtv_udma_fill_sg_list (dma, &uv_dma, ivtv_udma_fill_sg_list (dma, &y_dma, 0)) < 0) { 111 114 IVTV_DEBUG_WARN("could not allocate bounce buffers for highmem userspace buffers\n"); 112 - for (i = 0; i < dma->page_count; i++) { 113 - put_page(dma->map[i]); 114 - } 115 + unpin_user_pages(dma->map, dma->page_count); 115 116 dma->page_count = 0; 116 117 return -ENOMEM; 117 118 }

+2 -2

drivers/media/pci/ivtv/ivtvfb.c

··· 281 281 /* Map User DMA */ 282 282 if (ivtv_udma_setup(itv, ivtv_dest_addr, userbuf, size_in_bytes) <= 0) { 283 283 mutex_unlock(&itv->udma.lock); 284 - IVTVFB_WARN("ivtvfb_prep_dec_dma_to_device, Error with get_user_pages: %d bytes, %d pages returned\n", 284 + IVTVFB_WARN("ivtvfb_prep_dec_dma_to_device, Error with pin_user_pages: %d bytes, %d pages returned\n", 285 285 size_in_bytes, itv->udma.page_count); 286 286 287 - /* get_user_pages must have failed completely */ 287 + /* pin_user_pages must have failed completely */ 288 288 return -EIO; 289 289 } 290 290

+2 -2

drivers/mtd/ubi/io.c

··· 1297 1297 if (!ubi_dbg_chk_io(ubi)) 1298 1298 return 0; 1299 1299 1300 - buf1 = __vmalloc(len, GFP_NOFS, PAGE_KERNEL); 1300 + buf1 = __vmalloc(len, GFP_NOFS); 1301 1301 if (!buf1) { 1302 1302 ubi_err(ubi, "cannot allocate memory to check writes"); 1303 1303 return 0; ··· 1361 1361 if (!ubi_dbg_chk_io(ubi)) 1362 1362 return 0; 1363 1363 1364 - buf = __vmalloc(len, GFP_NOFS, PAGE_KERNEL); 1364 + buf = __vmalloc(len, GFP_NOFS); 1365 1365 if (!buf) { 1366 1366 ubi_err(ubi, "cannot allocate memory to check for 0xFFs"); 1367 1367 return 0;

+16 -29

drivers/pcmcia/electra_cf.c

··· 178 178 struct device_node *np = ofdev->dev.of_node; 179 179 struct electra_cf_socket *cf; 180 180 struct resource mem, io; 181 - int status; 181 + int status = -ENOMEM; 182 182 const unsigned int *prop; 183 183 int err; 184 - struct vm_struct *area; 185 184 186 185 err = of_address_to_resource(np, 0, &mem); 187 186 if (err) ··· 201 202 cf->mem_phys = mem.start; 202 203 cf->mem_size = PAGE_ALIGN(resource_size(&mem)); 203 204 cf->mem_base = ioremap(cf->mem_phys, cf->mem_size); 205 + if (!cf->mem_base) 206 + goto out_free_cf; 204 207 cf->io_size = PAGE_ALIGN(resource_size(&io)); 205 - 206 - area = __get_vm_area(cf->io_size, 0, PHB_IO_BASE, PHB_IO_END); 207 - if (area == NULL) { 208 - status = -ENOMEM; 209 - goto fail1; 210 - } 211 - 212 - cf->io_virt = (void __iomem *)(area->addr); 208 + cf->io_virt = ioremap_phb(io.start, cf->io_size); 209 + if (!cf->io_virt) 210 + goto out_unmap_mem; 213 211 214 212 cf->gpio_base = ioremap(0xfc103000, 0x1000); 213 + if (!cf->gpio_base) 214 + goto out_unmap_virt; 215 215 dev_set_drvdata(device, cf); 216 216 217 - if (!cf->mem_base || !cf->io_virt || !cf->gpio_base || 218 - (__ioremap_at(io.start, cf->io_virt, cf->io_size, 219 - pgprot_noncached(PAGE_KERNEL)) == NULL)) { 220 - dev_err(device, "can't ioremap ranges\n"); 221 - status = -ENOMEM; 222 - goto fail1; 223 - } 224 - 225 - 226 217 cf->io_base = (unsigned long)cf->io_virt - VMALLOC_END; 227 - 228 218 cf->iomem.start = (unsigned long)cf->mem_base; 229 219 cf->iomem.end = (unsigned long)cf->mem_base + (mem.end - mem.start); 230 220 cf->iomem.flags = IORESOURCE_MEM; ··· 293 305 if (cf->irq) 294 306 free_irq(cf->irq, cf); 295 307 296 - if (cf->io_virt) 297 - __iounmap_at(cf->io_virt, cf->io_size); 298 - if (cf->mem_base) 299 - iounmap(cf->mem_base); 300 - if (cf->gpio_base) 301 - iounmap(cf->gpio_base); 302 - if (area) 303 - device_init_wakeup(&ofdev->dev, 0); 308 + iounmap(cf->gpio_base); 309 + out_unmap_virt: 310 + device_init_wakeup(&ofdev->dev, 0); 311 + iounmap(cf->io_virt); 312 + out_unmap_mem: 313 + iounmap(cf->mem_base); 314 + out_free_cf: 304 315 kfree(cf); 305 316 return status; 306 317 ··· 317 330 free_irq(cf->irq, cf); 318 331 del_timer_sync(&cf->timer); 319 332 320 - __iounmap_at(cf->io_virt, cf->io_size); 333 + iounmap(cf->io_virt); 321 334 iounmap(cf->mem_base); 322 335 iounmap(cf->gpio_base); 323 336 release_mem_region(cf->mem_phys, cf->mem_size);

+1 -2

drivers/scsi/sd_zbc.c

··· 136 136 137 137 while (bufsize >= SECTOR_SIZE) { 138 138 buf = __vmalloc(bufsize, 139 - GFP_KERNEL | __GFP_ZERO | __GFP_NORETRY, 140 - PAGE_KERNEL); 139 + GFP_KERNEL | __GFP_ZERO | __GFP_NORETRY); 141 140 if (buf) { 142 141 *buflen = bufsize; 143 142 return buf;

+2 -2

drivers/staging/android/ion/ion_heap.c

··· 99 99 100 100 static int ion_heap_clear_pages(struct page **pages, int num, pgprot_t pgprot) 101 101 { 102 - void *addr = vm_map_ram(pages, num, -1, pgprot); 102 + void *addr = vmap(pages, num, VM_MAP, pgprot); 103 103 104 104 if (!addr) 105 105 return -ENOMEM; 106 106 memset(addr, 0, PAGE_SIZE * num); 107 - vm_unmap_ram(addr, num); 107 + vunmap(addr); 108 108 109 109 return 0; 110 110 }

+1 -3

drivers/staging/media/ipu3/ipu3-css-pool.h

··· 15 15 * @size: size of the buffer in bytes. 16 16 * @vaddr: kernel virtual address. 17 17 * @daddr: iova dma address to access IPU3. 18 - * @vma: private, a pointer to &struct vm_struct, 19 - * used for imgu_dmamap_free. 20 18 */ 21 19 struct imgu_css_map { 22 20 size_t size; 23 21 void *vaddr; 24 22 dma_addr_t daddr; 25 - struct vm_struct *vma; 23 + struct page **pages; 26 24 }; 27 25 28 26 /**

+8 -22

drivers/staging/media/ipu3/ipu3-dmamap.c

··· 96 96 unsigned long shift = iova_shift(&imgu->iova_domain); 97 97 struct device *dev = &imgu->pci_dev->dev; 98 98 size_t size = PAGE_ALIGN(len); 99 + int count = size >> PAGE_SHIFT; 99 100 struct page **pages; 100 101 dma_addr_t iovaddr; 101 102 struct iova *iova; ··· 115 114 116 115 /* Call IOMMU driver to setup pgt */ 117 116 iovaddr = iova_dma_addr(&imgu->iova_domain, iova); 118 - for (i = 0; i < size / PAGE_SIZE; ++i) { 117 + for (i = 0; i < count; ++i) { 119 118 rval = imgu_mmu_map(imgu->mmu, iovaddr, 120 119 page_to_phys(pages[i]), PAGE_SIZE); 121 120 if (rval) ··· 124 123 iovaddr += PAGE_SIZE; 125 124 } 126 125 127 - /* Now grab a virtual region */ 128 - map->vma = __get_vm_area(size, VM_USERMAP, VMALLOC_START, VMALLOC_END); 129 - if (!map->vma) 126 + map->vaddr = vmap(pages, count, VM_USERMAP, PAGE_KERNEL); 127 + if (!map->vaddr) 130 128 goto out_unmap; 131 129 132 - map->vma->pages = pages; 133 - /* And map it in KVA */ 134 - if (map_vm_area(map->vma, PAGE_KERNEL, pages)) 135 - goto out_vunmap; 136 - 130 + map->pages = pages; 137 131 map->size = size; 138 132 map->daddr = iova_dma_addr(&imgu->iova_domain, iova); 139 - map->vaddr = map->vma->addr; 140 133 141 134 dev_dbg(dev, "%s: allocated %zu @ IOVA %pad @ VA %p\n", __func__, 142 - size, &map->daddr, map->vma->addr); 135 + size, &map->daddr, map->vaddr); 143 136 144 - return map->vma->addr; 145 - 146 - out_vunmap: 147 - vunmap(map->vma->addr); 137 + return map->vaddr; 148 138 149 139 out_unmap: 150 140 imgu_dmamap_free_buffer(pages, size); 151 141 imgu_mmu_unmap(imgu->mmu, iova_dma_addr(&imgu->iova_domain, iova), 152 142 i * PAGE_SIZE); 153 - map->vma = NULL; 154 143 155 144 out_free_iova: 156 145 __free_iova(&imgu->iova_domain, iova); ··· 168 177 */ 169 178 void imgu_dmamap_free(struct imgu_device *imgu, struct imgu_css_map *map) 170 179 { 171 - struct vm_struct *area = map->vma; 172 - 173 180 dev_dbg(&imgu->pci_dev->dev, "%s: freeing %zu @ IOVA %pad @ VA %p\n", 174 181 __func__, map->size, &map->daddr, map->vaddr); 175 182 ··· 176 187 177 188 imgu_dmamap_unmap(imgu, map); 178 189 179 - if (WARN_ON(!area) || WARN_ON(!area->pages)) 180 - return; 181 - 182 - imgu_dmamap_free_buffer(area->pages, map->size); 183 190 vunmap(map->vaddr); 191 + imgu_dmamap_free_buffer(map->pages, map->size); 184 192 map->vaddr = NULL; 185 193 } 186 194

+3 -4

fs/block_dev.c

··· 614 614 return block_read_full_page(page, blkdev_get_block); 615 615 } 616 616 617 - static int blkdev_readpages(struct file *file, struct address_space *mapping, 618 - struct list_head *pages, unsigned nr_pages) 617 + static void blkdev_readahead(struct readahead_control *rac) 619 618 { 620 - return mpage_readpages(mapping, pages, nr_pages, blkdev_get_block); 619 + mpage_readahead(rac, blkdev_get_block); 621 620 } 622 621 623 622 static int blkdev_write_begin(struct file *file, struct address_space *mapping, ··· 2084 2085 2085 2086 static const struct address_space_operations def_blk_aops = { 2086 2087 .readpage = blkdev_readpage, 2087 - .readpages = blkdev_readpages, 2088 + .readahead = blkdev_readahead, 2088 2089 .writepage = blkdev_writepage, 2089 2090 .write_begin = blkdev_write_begin, 2090 2091 .write_end = blkdev_write_end,

+1 -3

fs/btrfs/disk-io.c

··· 980 980 btrfs_warn(BTRFS_I(page->mapping->host)->root->fs_info, 981 981 "page private not zero on page %llu", 982 982 (unsigned long long)page_offset(page)); 983 - ClearPagePrivate(page); 984 - set_page_private(page, 0); 985 - put_page(page); 983 + detach_page_private(page); 986 984 } 987 985 } 988 986

+18 -46

fs/btrfs/extent_io.c

··· 3076 3076 static void attach_extent_buffer_page(struct extent_buffer *eb, 3077 3077 struct page *page) 3078 3078 { 3079 - if (!PagePrivate(page)) { 3080 - SetPagePrivate(page); 3081 - get_page(page); 3082 - set_page_private(page, (unsigned long)eb); 3083 - } else { 3079 + if (!PagePrivate(page)) 3080 + attach_page_private(page, eb); 3081 + else 3084 3082 WARN_ON(page->private != (unsigned long)eb); 3085 - } 3086 3083 } 3087 3084 3088 3085 void set_page_extent_mapped(struct page *page) 3089 3086 { 3090 - if (!PagePrivate(page)) { 3091 - SetPagePrivate(page); 3092 - get_page(page); 3093 - set_page_private(page, EXTENT_PAGE_PRIVATE); 3094 - } 3087 + if (!PagePrivate(page)) 3088 + attach_page_private(page, (void *)EXTENT_PAGE_PRIVATE); 3095 3089 } 3096 3090 3097 3091 static struct extent_map * ··· 4361 4367 return ret; 4362 4368 } 4363 4369 4364 - int extent_readpages(struct address_space *mapping, struct list_head *pages, 4365 - unsigned nr_pages) 4370 + void extent_readahead(struct readahead_control *rac) 4366 4371 { 4367 4372 struct bio *bio = NULL; 4368 4373 unsigned long bio_flags = 0; 4369 4374 struct page *pagepool[16]; 4370 4375 struct extent_map *em_cached = NULL; 4371 - int nr = 0; 4372 4376 u64 prev_em_start = (u64)-1; 4377 + int nr; 4373 4378 4374 - while (!list_empty(pages)) { 4375 - u64 contig_end = 0; 4379 + while ((nr = readahead_page_batch(rac, pagepool))) { 4380 + u64 contig_start = page_offset(pagepool[0]); 4381 + u64 contig_end = page_offset(pagepool[nr - 1]) + PAGE_SIZE - 1; 4376 4382 4377 - for (nr = 0; nr < ARRAY_SIZE(pagepool) && !list_empty(pages);) { 4378 - struct page *page = lru_to_page(pages); 4383 + ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end); 4379 4384 4380 - prefetchw(&page->flags); 4381 - list_del(&page->lru); 4382 - if (add_to_page_cache_lru(page, mapping, page->index, 4383 - readahead_gfp_mask(mapping))) { 4384 - put_page(page); 4385 - break; 4386 - } 4387 - 4388 - pagepool[nr++] = page; 4389 - contig_end = page_offset(page) + PAGE_SIZE - 1; 4390 - } 4391 - 4392 - if (nr) { 4393 - u64 contig_start = page_offset(pagepool[0]); 4394 - 4395 - ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end); 4396 - 4397 - contiguous_readpages(pagepool, nr, contig_start, 4398 - contig_end, &em_cached, &bio, &bio_flags, 4399 - &prev_em_start); 4400 - } 4385 + contiguous_readpages(pagepool, nr, contig_start, contig_end, 4386 + &em_cached, &bio, &bio_flags, &prev_em_start); 4401 4387 } 4402 4388 4403 4389 if (em_cached) 4404 4390 free_extent_map(em_cached); 4405 4391 4406 - if (bio) 4407 - return submit_one_bio(bio, 0, bio_flags); 4408 - return 0; 4392 + if (bio) { 4393 + if (submit_one_bio(bio, 0, bio_flags)) 4394 + return; 4395 + } 4409 4396 } 4410 4397 4411 4398 /* ··· 4904 4929 * We need to make sure we haven't be attached 4905 4930 * to a new eb. 4906 4931 */ 4907 - ClearPagePrivate(page); 4908 - set_page_private(page, 0); 4909 - /* One for the page private */ 4910 - put_page(page); 4932 + detach_page_private(page); 4911 4933 } 4912 4934 4913 4935 if (mapped)

+1 -2

fs/btrfs/extent_io.h

··· 198 198 struct writeback_control *wbc); 199 199 int btree_write_cache_pages(struct address_space *mapping, 200 200 struct writeback_control *wbc); 201 - int extent_readpages(struct address_space *mapping, struct list_head *pages, 202 - unsigned nr_pages); 201 + void extent_readahead(struct readahead_control *rac); 203 202 int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, 204 203 __u64 start, __u64 len); 205 204 void set_page_extent_mapped(struct page *page);

+12 -27

fs/btrfs/inode.c

··· 4856 4856 4857 4857 /* 4858 4858 * Keep looping until we have no more ranges in the io tree. 4859 - * We can have ongoing bios started by readpages (called from readahead) 4860 - * that have their endio callback (extent_io.c:end_bio_extent_readpage) 4859 + * We can have ongoing bios started by readahead that have 4860 + * their endio callback (extent_io.c:end_bio_extent_readpage) 4861 4861 * still in progress (unlocked the pages in the bio but did not yet 4862 4862 * unlocked the ranges in the io tree). Therefore this means some 4863 4863 * ranges can still be locked and eviction started because before ··· 7050 7050 * for it to complete) and then invalidate the pages for 7051 7051 * this range (through invalidate_inode_pages2_range()), 7052 7052 * but that can lead us to a deadlock with a concurrent 7053 - * call to readpages() (a buffered read or a defrag call 7053 + * call to readahead (a buffered read or a defrag call 7054 7054 * triggered a readahead) on a page lock due to an 7055 7055 * ordered dio extent we created before but did not have 7056 7056 * yet a corresponding bio submitted (whence it can not 7057 - * complete), which makes readpages() wait for that 7057 + * complete), which makes readahead wait for that 7058 7058 * ordered extent to complete while holding a lock on 7059 7059 * that page. 7060 7060 */ ··· 8293 8293 return extent_writepages(mapping, wbc); 8294 8294 } 8295 8295 8296 - static int 8297 - btrfs_readpages(struct file *file, struct address_space *mapping, 8298 - struct list_head *pages, unsigned nr_pages) 8296 + static void btrfs_readahead(struct readahead_control *rac) 8299 8297 { 8300 - return extent_readpages(mapping, pages, nr_pages); 8298 + extent_readahead(rac); 8301 8299 } 8302 8300 8303 8301 static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags) 8304 8302 { 8305 8303 int ret = try_release_extent_mapping(page, gfp_flags); 8306 - if (ret == 1) { 8307 - ClearPagePrivate(page); 8308 - set_page_private(page, 0); 8309 - put_page(page); 8310 - } 8304 + if (ret == 1) 8305 + detach_page_private(page); 8311 8306 return ret; 8312 8307 } 8313 8308 ··· 8324 8329 if (ret != MIGRATEPAGE_SUCCESS) 8325 8330 return ret; 8326 8331 8327 - if (page_has_private(page)) { 8328 - ClearPagePrivate(page); 8329 - get_page(newpage); 8330 - set_page_private(newpage, page_private(page)); 8331 - set_page_private(page, 0); 8332 - put_page(page); 8333 - SetPagePrivate(newpage); 8334 - } 8332 + if (page_has_private(page)) 8333 + attach_page_private(newpage, detach_page_private(page)); 8335 8334 8336 8335 if (PagePrivate2(page)) { 8337 8336 ClearPagePrivate2(page); ··· 8447 8458 } 8448 8459 8449 8460 ClearPageChecked(page); 8450 - if (PagePrivate(page)) { 8451 - ClearPagePrivate(page); 8452 - set_page_private(page, 0); 8453 - put_page(page); 8454 - } 8461 + detach_page_private(page); 8455 8462 } 8456 8463 8457 8464 /* ··· 10538 10553 .readpage = btrfs_readpage, 10539 10554 .writepage = btrfs_writepage, 10540 10555 .writepages = btrfs_writepages, 10541 - .readpages = btrfs_readpages, 10556 + .readahead = btrfs_readahead, 10542 10557 .direct_IO = btrfs_direct_IO, 10543 10558 .invalidatepage = btrfs_invalidatepage, 10544 10559 .releasepage = btrfs_releasepage,

+11 -12

fs/buffer.c

··· 123 123 } 124 124 EXPORT_SYMBOL(__wait_on_buffer); 125 125 126 - static void 127 - __clear_page_buffers(struct page *page) 128 - { 129 - ClearPagePrivate(page); 130 - set_page_private(page, 0); 131 - put_page(page); 132 - } 133 - 134 126 static void buffer_io_error(struct buffer_head *bh, char *msg) 135 127 { 136 128 if (!test_bit(BH_Quiet, &bh->b_state)) ··· 898 906 bh = bh->b_this_page; 899 907 } while (bh); 900 908 tail->b_this_page = head; 901 - attach_page_buffers(page, head); 909 + attach_page_private(page, head); 902 910 } 903 911 904 912 static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size) ··· 1146 1154 1147 1155 void mark_buffer_write_io_error(struct buffer_head *bh) 1148 1156 { 1157 + struct super_block *sb; 1158 + 1149 1159 set_buffer_write_io_error(bh); 1150 1160 /* FIXME: do we need to set this in both places? */ 1151 1161 if (bh->b_page && bh->b_page->mapping) 1152 1162 mapping_set_error(bh->b_page->mapping, -EIO); 1153 1163 if (bh->b_assoc_map) 1154 1164 mapping_set_error(bh->b_assoc_map, -EIO); 1165 + rcu_read_lock(); 1166 + sb = READ_ONCE(bh->b_bdev->bd_super); 1167 + if (sb) 1168 + errseq_set(&sb->s_wb_err, -EIO); 1169 + rcu_read_unlock(); 1155 1170 } 1156 1171 EXPORT_SYMBOL(mark_buffer_write_io_error); 1157 1172 ··· 1579 1580 bh = bh->b_this_page; 1580 1581 } while (bh != head); 1581 1582 } 1582 - attach_page_buffers(page, head); 1583 + attach_page_private(page, head); 1583 1584 spin_unlock(&page->mapping->private_lock); 1584 1585 } 1585 1586 EXPORT_SYMBOL(create_empty_buffers); ··· 2566 2567 bh->b_this_page = head; 2567 2568 bh = bh->b_this_page; 2568 2569 } while (bh != head); 2569 - attach_page_buffers(page, head); 2570 + attach_page_private(page, head); 2570 2571 spin_unlock(&page->mapping->private_lock); 2571 2572 } 2572 2573 ··· 3226 3227 bh = next; 3227 3228 } while (bh != head); 3228 3229 *buffers_to_free = head; 3229 - __clear_page_buffers(page); 3230 + detach_page_private(page); 3230 3231 return 1; 3231 3232 failed: 3232 3233 return 0;

+14 -25

fs/erofs/data.c

··· 280 280 return 0; 281 281 } 282 282 283 - static int erofs_raw_access_readpages(struct file *filp, 284 - struct address_space *mapping, 285 - struct list_head *pages, 286 - unsigned int nr_pages) 283 + static void erofs_raw_access_readahead(struct readahead_control *rac) 287 284 { 288 285 erofs_off_t last_block; 289 286 struct bio *bio = NULL; 290 - gfp_t gfp = readahead_gfp_mask(mapping); 291 - struct page *page = list_last_entry(pages, struct page, lru); 287 + struct page *page; 292 288 293 - trace_erofs_readpages(mapping->host, page, nr_pages, true); 289 + trace_erofs_readpages(rac->mapping->host, readahead_index(rac), 290 + readahead_count(rac), true); 294 291 295 - for (; nr_pages; --nr_pages) { 296 - page = list_entry(pages->prev, struct page, lru); 297 - 292 + while ((page = readahead_page(rac))) { 298 293 prefetchw(&page->flags); 299 - list_del(&page->lru); 300 294 301 - if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) { 302 - bio = erofs_read_raw_page(bio, mapping, page, 303 - &last_block, nr_pages, true); 295 + bio = erofs_read_raw_page(bio, rac->mapping, page, &last_block, 296 + readahead_count(rac), true); 304 297 305 - /* all the page errors are ignored when readahead */ 306 - if (IS_ERR(bio)) { 307 - pr_err("%s, readahead error at page %lu of nid %llu\n", 308 - __func__, page->index, 309 - EROFS_I(mapping->host)->nid); 298 + /* all the page errors are ignored when readahead */ 299 + if (IS_ERR(bio)) { 300 + pr_err("%s, readahead error at page %lu of nid %llu\n", 301 + __func__, page->index, 302 + EROFS_I(rac->mapping->host)->nid); 310 303 311 - bio = NULL; 312 - } 304 + bio = NULL; 313 305 } 314 306 315 - /* pages could still be locked */ 316 307 put_page(page); 317 308 } 318 - DBG_BUGON(!list_empty(pages)); 319 309 320 310 /* the rare case (end in gaps) */ 321 311 if (bio) 322 312 submit_bio(bio); 323 - return 0; 324 313 } 325 314 326 315 static int erofs_get_block(struct inode *inode, sector_t iblock, ··· 347 358 /* for uncompressed (aligned) files and raw access for other files */ 348 359 const struct address_space_operations erofs_raw_access_aops = { 349 360 .readpage = erofs_raw_access_readpage, 350 - .readpages = erofs_raw_access_readpages, 361 + .readahead = erofs_raw_access_readahead, 351 362 .bmap = erofs_bmap, 352 363 }; 353 364

+1 -1

fs/erofs/decompressor.c

··· 274 274 275 275 i = 0; 276 276 while (1) { 277 - dst = vm_map_ram(rq->out, nrpages_out, -1, PAGE_KERNEL); 277 + dst = vm_map_ram(rq->out, nrpages_out, -1); 278 278 279 279 /* retry two more times (totally 3 times) */ 280 280 if (dst || ++i >= 3)

+9 -20

fs/erofs/zdata.c

··· 1305 1305 return nr <= sbi->max_sync_decompress_pages; 1306 1306 } 1307 1307 1308 - static int z_erofs_readpages(struct file *filp, struct address_space *mapping, 1309 - struct list_head *pages, unsigned int nr_pages) 1308 + static void z_erofs_readahead(struct readahead_control *rac) 1310 1309 { 1311 - struct inode *const inode = mapping->host; 1310 + struct inode *const inode = rac->mapping->host; 1312 1311 struct erofs_sb_info *const sbi = EROFS_I_SB(inode); 1313 1312 1314 - bool sync = should_decompress_synchronously(sbi, nr_pages); 1313 + bool sync = should_decompress_synchronously(sbi, readahead_count(rac)); 1315 1314 struct z_erofs_decompress_frontend f = DECOMPRESS_FRONTEND_INIT(inode); 1316 - gfp_t gfp = mapping_gfp_constraint(mapping, GFP_KERNEL); 1317 - struct page *head = NULL; 1315 + struct page *page, *head = NULL; 1318 1316 LIST_HEAD(pagepool); 1319 1317 1320 - trace_erofs_readpages(mapping->host, lru_to_page(pages), 1321 - nr_pages, false); 1318 + trace_erofs_readpages(inode, readahead_index(rac), 1319 + readahead_count(rac), false); 1322 1320 1323 - f.headoffset = (erofs_off_t)lru_to_page(pages)->index << PAGE_SHIFT; 1321 + f.headoffset = readahead_pos(rac); 1324 1322 1325 - for (; nr_pages; --nr_pages) { 1326 - struct page *page = lru_to_page(pages); 1327 - 1323 + while ((page = readahead_page(rac))) { 1328 1324 prefetchw(&page->flags); 1329 - list_del(&page->lru); 1330 1325 1331 1326 /* 1332 1327 * A pure asynchronous readahead is indicated if ··· 1329 1334 * Let's also do asynchronous decompression for this case. 1330 1335 */ 1331 1336 sync &= !(PageReadahead(page) && !head); 1332 - 1333 - if (add_to_page_cache_lru(page, mapping, page->index, gfp)) { 1334 - list_add(&page->lru, &pagepool); 1335 - continue; 1336 - } 1337 1337 1338 1338 set_page_private(page, (unsigned long)head); 1339 1339 head = page; ··· 1358 1368 1359 1369 /* clean up the remaining free pages */ 1360 1370 put_pages_list(&pagepool); 1361 - return 0; 1362 1371 } 1363 1372 1364 1373 const struct address_space_operations z_erofs_aops = { 1365 1374 .readpage = z_erofs_readpage, 1366 - .readpages = z_erofs_readpages, 1375 + .readahead = z_erofs_readahead, 1367 1376 }; 1368 1377

+3 -4

fs/exfat/inode.c

··· 372 372 return mpage_readpage(page, exfat_get_block); 373 373 } 374 374 375 - static int exfat_readpages(struct file *file, struct address_space *mapping, 376 - struct list_head *pages, unsigned int nr_pages) 375 + static void exfat_readahead(struct readahead_control *rac) 377 376 { 378 - return mpage_readpages(mapping, pages, nr_pages, exfat_get_block); 377 + mpage_readahead(rac, exfat_get_block); 379 378 } 380 379 381 380 static int exfat_writepage(struct page *page, struct writeback_control *wbc) ··· 501 502 502 503 static const struct address_space_operations exfat_aops = { 503 504 .readpage = exfat_readpage, 504 - .readpages = exfat_readpages, 505 + .readahead = exfat_readahead, 505 506 .writepage = exfat_writepage, 506 507 .writepages = exfat_writepages, 507 508 .write_begin = exfat_write_begin,

+4 -6

fs/ext2/inode.c

··· 877 877 return mpage_readpage(page, ext2_get_block); 878 878 } 879 879 880 - static int 881 - ext2_readpages(struct file *file, struct address_space *mapping, 882 - struct list_head *pages, unsigned nr_pages) 880 + static void ext2_readahead(struct readahead_control *rac) 883 881 { 884 - return mpage_readpages(mapping, pages, nr_pages, ext2_get_block); 882 + mpage_readahead(rac, ext2_get_block); 885 883 } 886 884 887 885 static int ··· 965 967 966 968 const struct address_space_operations ext2_aops = { 967 969 .readpage = ext2_readpage, 968 - .readpages = ext2_readpages, 970 + .readahead = ext2_readahead, 969 971 .writepage = ext2_writepage, 970 972 .write_begin = ext2_write_begin, 971 973 .write_end = ext2_write_end, ··· 979 981 980 982 const struct address_space_operations ext2_nobh_aops = { 981 983 .readpage = ext2_readpage, 982 - .readpages = ext2_readpages, 984 + .readahead = ext2_readahead, 983 985 .writepage = ext2_nobh_writepage, 984 986 .write_begin = ext2_nobh_write_begin, 985 987 .write_end = nobh_write_end,

+2 -3

fs/ext4/ext4.h

··· 3317 3317 } 3318 3318 3319 3319 /* readpages.c */ 3320 - extern int ext4_mpage_readpages(struct address_space *mapping, 3321 - struct list_head *pages, struct page *page, 3322 - unsigned nr_pages, bool is_readahead); 3320 + extern int ext4_mpage_readpages(struct inode *inode, 3321 + struct readahead_control *rac, struct page *page); 3323 3322 extern int __init ext4_init_post_read_processing(void); 3324 3323 extern void ext4_exit_post_read_processing(void); 3325 3324

+9 -12

fs/ext4/inode.c

··· 3224 3224 ret = ext4_readpage_inline(inode, page); 3225 3225 3226 3226 if (ret == -EAGAIN) 3227 - return ext4_mpage_readpages(page->mapping, NULL, page, 1, 3228 - false); 3227 + return ext4_mpage_readpages(inode, NULL, page); 3229 3228 3230 3229 return ret; 3231 3230 } 3232 3231 3233 - static int 3234 - ext4_readpages(struct file *file, struct address_space *mapping, 3235 - struct list_head *pages, unsigned nr_pages) 3232 + static void ext4_readahead(struct readahead_control *rac) 3236 3233 { 3237 - struct inode *inode = mapping->host; 3234 + struct inode *inode = rac->mapping->host; 3238 3235 3239 - /* If the file has inline data, no need to do readpages. */ 3236 + /* If the file has inline data, no need to do readahead. */ 3240 3237 if (ext4_has_inline_data(inode)) 3241 - return 0; 3238 + return; 3242 3239 3243 - return ext4_mpage_readpages(mapping, pages, NULL, nr_pages, true); 3240 + ext4_mpage_readpages(inode, rac, NULL); 3244 3241 } 3245 3242 3246 3243 static void ext4_invalidatepage(struct page *page, unsigned int offset, ··· 3602 3605 3603 3606 static const struct address_space_operations ext4_aops = { 3604 3607 .readpage = ext4_readpage, 3605 - .readpages = ext4_readpages, 3608 + .readahead = ext4_readahead, 3606 3609 .writepage = ext4_writepage, 3607 3610 .writepages = ext4_writepages, 3608 3611 .write_begin = ext4_write_begin, ··· 3619 3622 3620 3623 static const struct address_space_operations ext4_journalled_aops = { 3621 3624 .readpage = ext4_readpage, 3622 - .readpages = ext4_readpages, 3625 + .readahead = ext4_readahead, 3623 3626 .writepage = ext4_writepage, 3624 3627 .writepages = ext4_writepages, 3625 3628 .write_begin = ext4_write_begin, ··· 3635 3638 3636 3639 static const struct address_space_operations ext4_da_aops = { 3637 3640 .readpage = ext4_readpage, 3638 - .readpages = ext4_readpages, 3641 + .readahead = ext4_readahead, 3639 3642 .writepage = ext4_writepage, 3640 3643 .writepages = ext4_writepages, 3641 3644 .write_begin = ext4_da_write_begin,

+9 -16

fs/ext4/readpage.c

··· 7 7 * 8 8 * This was originally taken from fs/mpage.c 9 9 * 10 - * The intent is the ext4_mpage_readpages() function here is intended 11 - * to replace mpage_readpages() in the general case, not just for 10 + * The ext4_mpage_readpages() function here is intended to 11 + * replace mpage_readahead() in the general case, not just for 12 12 * encrypted files. It has some limitations (see below), where it 13 13 * will fall back to read_block_full_page(), but these limitations 14 14 * should only be hit when page_size != block_size. ··· 221 221 return i_size_read(inode); 222 222 } 223 223 224 - int ext4_mpage_readpages(struct address_space *mapping, 225 - struct list_head *pages, struct page *page, 226 - unsigned nr_pages, bool is_readahead) 224 + int ext4_mpage_readpages(struct inode *inode, 225 + struct readahead_control *rac, struct page *page) 227 226 { 228 227 struct bio *bio = NULL; 229 228 sector_t last_block_in_bio = 0; 230 229 231 - struct inode *inode = mapping->host; 232 230 const unsigned blkbits = inode->i_blkbits; 233 231 const unsigned blocks_per_page = PAGE_SIZE >> blkbits; 234 232 const unsigned blocksize = 1 << blkbits; ··· 239 241 int length; 240 242 unsigned relative_block = 0; 241 243 struct ext4_map_blocks map; 244 + unsigned int nr_pages = rac ? readahead_count(rac) : 1; 242 245 243 246 map.m_pblk = 0; 244 247 map.m_lblk = 0; ··· 250 251 int fully_mapped = 1; 251 252 unsigned first_hole = blocks_per_page; 252 253 253 - if (pages) { 254 - page = lru_to_page(pages); 255 - 254 + if (rac) { 255 + page = readahead_page(rac); 256 256 prefetchw(&page->flags); 257 - list_del(&page->lru); 258 - if (add_to_page_cache_lru(page, mapping, page->index, 259 - readahead_gfp_mask(mapping))) 260 - goto next_page; 261 257 } 262 258 263 259 if (page_has_buffers(page)) ··· 375 381 bio->bi_iter.bi_sector = blocks[0] << (blkbits - 9); 376 382 bio->bi_end_io = mpage_end_io; 377 383 bio_set_op_attrs(bio, REQ_OP_READ, 378 - is_readahead ? REQ_RAHEAD : 0); 384 + rac ? REQ_RAHEAD : 0); 379 385 } 380 386 381 387 length = first_hole << blkbits; ··· 400 406 else 401 407 unlock_page(page); 402 408 next_page: 403 - if (pages) 409 + if (rac) 404 410 put_page(page); 405 411 } 406 - BUG_ON(pages && !list_empty(pages)); 407 412 if (bio) 408 413 submit_bio(bio); 409 414 return 0;

+2 -33

fs/ext4/verity.c

··· 342 342 return desc_size; 343 343 } 344 344 345 - /* 346 - * Prefetch some pages from the file's Merkle tree. 347 - * 348 - * This is basically a stripped-down version of __do_page_cache_readahead() 349 - * which works on pages past i_size. 350 - */ 351 - static void ext4_merkle_tree_readahead(struct address_space *mapping, 352 - pgoff_t start_index, unsigned long count) 353 - { 354 - LIST_HEAD(pages); 355 - unsigned int nr_pages = 0; 356 - struct page *page; 357 - pgoff_t index; 358 - struct blk_plug plug; 359 - 360 - for (index = start_index; index < start_index + count; index++) { 361 - page = xa_load(&mapping->i_pages, index); 362 - if (!page || xa_is_value(page)) { 363 - page = __page_cache_alloc(readahead_gfp_mask(mapping)); 364 - if (!page) 365 - break; 366 - page->index = index; 367 - list_add(&page->lru, &pages); 368 - nr_pages++; 369 - } 370 - } 371 - blk_start_plug(&plug); 372 - ext4_mpage_readpages(mapping, &pages, NULL, nr_pages, true); 373 - blk_finish_plug(&plug); 374 - } 375 - 376 345 static struct page *ext4_read_merkle_tree_page(struct inode *inode, 377 346 pgoff_t index, 378 347 unsigned long num_ra_pages) ··· 355 386 if (page) 356 387 put_page(page); 357 388 else if (num_ra_pages > 1) 358 - ext4_merkle_tree_readahead(inode->i_mapping, index, 359 - num_ra_pages); 389 + page_cache_readahead_unbounded(inode->i_mapping, NULL, 390 + index, num_ra_pages, 0); 360 391 page = read_mapping_page(inode->i_mapping, index, NULL); 361 392 } 362 393 return page;

+20 -30

fs/f2fs/data.c

··· 2177 2177 * use ->readpage() or do the necessary surgery to decouple ->readpages() 2178 2178 * from read-ahead. 2179 2179 */ 2180 - int f2fs_mpage_readpages(struct address_space *mapping, 2181 - struct list_head *pages, struct page *page, 2182 - unsigned nr_pages, bool is_readahead) 2180 + static int f2fs_mpage_readpages(struct inode *inode, 2181 + struct readahead_control *rac, struct page *page) 2183 2182 { 2184 2183 struct bio *bio = NULL; 2185 2184 sector_t last_block_in_bio = 0; 2186 - struct inode *inode = mapping->host; 2187 2185 struct f2fs_map_blocks map; 2188 2186 #ifdef CONFIG_F2FS_FS_COMPRESSION 2189 2187 struct compress_ctx cc = { ··· 2195 2197 .nr_cpages = 0, 2196 2198 }; 2197 2199 #endif 2200 + unsigned nr_pages = rac ? readahead_count(rac) : 1; 2198 2201 unsigned max_nr_pages = nr_pages; 2199 2202 int ret = 0; 2200 2203 ··· 2209 2210 map.m_may_create = false; 2210 2211 2211 2212 for (; nr_pages; nr_pages--) { 2212 - if (pages) { 2213 - page = list_last_entry(pages, struct page, lru); 2214 - 2213 + if (rac) { 2214 + page = readahead_page(rac); 2215 2215 prefetchw(&page->flags); 2216 - list_del(&page->lru); 2217 - if (add_to_page_cache_lru(page, mapping, 2218 - page_index(page), 2219 - readahead_gfp_mask(mapping))) 2220 - goto next_page; 2221 2216 } 2222 2217 2223 2218 #ifdef CONFIG_F2FS_FS_COMPRESSION ··· 2221 2228 ret = f2fs_read_multi_pages(&cc, &bio, 2222 2229 max_nr_pages, 2223 2230 &last_block_in_bio, 2224 - is_readahead, false); 2231 + rac != NULL, false); 2225 2232 f2fs_destroy_compress_ctx(&cc); 2226 2233 if (ret) 2227 2234 goto set_error_page; ··· 2244 2251 #endif 2245 2252 2246 2253 ret = f2fs_read_single_page(inode, page, max_nr_pages, &map, 2247 - &bio, &last_block_in_bio, is_readahead); 2254 + &bio, &last_block_in_bio, rac); 2248 2255 if (ret) { 2249 2256 #ifdef CONFIG_F2FS_FS_COMPRESSION 2250 2257 set_error_page: ··· 2253 2260 zero_user_segment(page, 0, PAGE_SIZE); 2254 2261 unlock_page(page); 2255 2262 } 2263 + #ifdef CONFIG_F2FS_FS_COMPRESSION 2256 2264 next_page: 2257 - if (pages) 2265 + #endif 2266 + if (rac) 2258 2267 put_page(page); 2259 2268 2260 2269 #ifdef CONFIG_F2FS_FS_COMPRESSION ··· 2266 2271 ret = f2fs_read_multi_pages(&cc, &bio, 2267 2272 max_nr_pages, 2268 2273 &last_block_in_bio, 2269 - is_readahead, false); 2274 + rac != NULL, false); 2270 2275 f2fs_destroy_compress_ctx(&cc); 2271 2276 } 2272 2277 } 2273 2278 #endif 2274 2279 } 2275 - BUG_ON(pages && !list_empty(pages)); 2276 2280 if (bio) 2277 2281 __submit_bio(F2FS_I_SB(inode), bio, DATA); 2278 - return pages ? 0 : ret; 2282 + return ret; 2279 2283 } 2280 2284 2281 2285 static int f2fs_read_data_page(struct file *file, struct page *page) ··· 2293 2299 if (f2fs_has_inline_data(inode)) 2294 2300 ret = f2fs_read_inline_data(inode, page); 2295 2301 if (ret == -EAGAIN) 2296 - ret = f2fs_mpage_readpages(page_file_mapping(page), 2297 - NULL, page, 1, false); 2302 + ret = f2fs_mpage_readpages(inode, NULL, page); 2298 2303 return ret; 2299 2304 } 2300 2305 2301 - static int f2fs_read_data_pages(struct file *file, 2302 - struct address_space *mapping, 2303 - struct list_head *pages, unsigned nr_pages) 2306 + static void f2fs_readahead(struct readahead_control *rac) 2304 2307 { 2305 - struct inode *inode = mapping->host; 2306 - struct page *page = list_last_entry(pages, struct page, lru); 2308 + struct inode *inode = rac->mapping->host; 2307 2309 2308 - trace_f2fs_readpages(inode, page, nr_pages); 2310 + trace_f2fs_readpages(inode, readahead_index(rac), readahead_count(rac)); 2309 2311 2310 2312 if (!f2fs_is_compress_backend_ready(inode)) 2311 - return 0; 2313 + return; 2312 2314 2313 2315 /* If the file has inline data, skip readpages */ 2314 2316 if (f2fs_has_inline_data(inode)) 2315 - return 0; 2317 + return; 2316 2318 2317 - return f2fs_mpage_readpages(mapping, pages, NULL, nr_pages, true); 2319 + f2fs_mpage_readpages(inode, rac, NULL); 2318 2320 } 2319 2321 2320 2322 int f2fs_encrypt_one_page(struct f2fs_io_info *fio) ··· 3795 3805 3796 3806 const struct address_space_operations f2fs_dblock_aops = { 3797 3807 .readpage = f2fs_read_data_page, 3798 - .readpages = f2fs_read_data_pages, 3808 + .readahead = f2fs_readahead, 3799 3809 .writepage = f2fs_write_data_page, 3800 3810 .writepages = f2fs_write_data_pages, 3801 3811 .write_begin = f2fs_write_begin,

+2 -12

fs/f2fs/f2fs.h

··· 3051 3051 if (PagePrivate(page)) 3052 3052 return; 3053 3053 3054 - get_page(page); 3055 - SetPagePrivate(page); 3056 - set_page_private(page, data); 3054 + attach_page_private(page, (void *)data); 3057 3055 } 3058 3056 3059 3057 static inline void f2fs_clear_page_private(struct page *page) 3060 3058 { 3061 - if (!PagePrivate(page)) 3062 - return; 3063 - 3064 - set_page_private(page, 0); 3065 - ClearPagePrivate(page); 3066 - f2fs_put_page(page, 0); 3059 + detach_page_private(page); 3067 3060 } 3068 3061 3069 3062 /* ··· 3366 3373 int f2fs_get_block(struct dnode_of_data *dn, pgoff_t index); 3367 3374 int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from); 3368 3375 int f2fs_reserve_block(struct dnode_of_data *dn, pgoff_t index); 3369 - int f2fs_mpage_readpages(struct address_space *mapping, 3370 - struct list_head *pages, struct page *page, 3371 - unsigned nr_pages, bool is_readahead); 3372 3376 struct page *f2fs_get_read_data_page(struct inode *inode, pgoff_t index, 3373 3377 int op_flags, bool for_write); 3374 3378 struct page *f2fs_find_data_page(struct inode *inode, pgoff_t index);

+2 -33

fs/f2fs/verity.c

··· 222 222 return size; 223 223 } 224 224 225 - /* 226 - * Prefetch some pages from the file's Merkle tree. 227 - * 228 - * This is basically a stripped-down version of __do_page_cache_readahead() 229 - * which works on pages past i_size. 230 - */ 231 - static void f2fs_merkle_tree_readahead(struct address_space *mapping, 232 - pgoff_t start_index, unsigned long count) 233 - { 234 - LIST_HEAD(pages); 235 - unsigned int nr_pages = 0; 236 - struct page *page; 237 - pgoff_t index; 238 - struct blk_plug plug; 239 - 240 - for (index = start_index; index < start_index + count; index++) { 241 - page = xa_load(&mapping->i_pages, index); 242 - if (!page || xa_is_value(page)) { 243 - page = __page_cache_alloc(readahead_gfp_mask(mapping)); 244 - if (!page) 245 - break; 246 - page->index = index; 247 - list_add(&page->lru, &pages); 248 - nr_pages++; 249 - } 250 - } 251 - blk_start_plug(&plug); 252 - f2fs_mpage_readpages(mapping, &pages, NULL, nr_pages, true); 253 - blk_finish_plug(&plug); 254 - } 255 - 256 225 static struct page *f2fs_read_merkle_tree_page(struct inode *inode, 257 226 pgoff_t index, 258 227 unsigned long num_ra_pages) ··· 235 266 if (page) 236 267 put_page(page); 237 268 else if (num_ra_pages > 1) 238 - f2fs_merkle_tree_readahead(inode->i_mapping, index, 239 - num_ra_pages); 269 + page_cache_readahead_unbounded(inode->i_mapping, NULL, 270 + index, num_ra_pages, 0); 240 271 page = read_mapping_page(inode->i_mapping, index, NULL); 241 272 } 242 273 return page;

+3 -4

fs/fat/inode.c

··· 210 210 return mpage_readpage(page, fat_get_block); 211 211 } 212 212 213 - static int fat_readpages(struct file *file, struct address_space *mapping, 214 - struct list_head *pages, unsigned nr_pages) 213 + static void fat_readahead(struct readahead_control *rac) 215 214 { 216 - return mpage_readpages(mapping, pages, nr_pages, fat_get_block); 215 + mpage_readahead(rac, fat_get_block); 217 216 } 218 217 219 218 static void fat_write_failed(struct address_space *mapping, loff_t to) ··· 343 344 344 345 static const struct address_space_operations fat_aops = { 345 346 .readpage = fat_readpage, 346 - .readpages = fat_readpages, 347 + .readahead = fat_readahead, 347 348 .writepage = fat_writepage, 348 349 .writepages = fat_writepages, 349 350 .write_begin = fat_write_begin,

+1

fs/file_table.c

··· 198 198 file->f_inode = path->dentry->d_inode; 199 199 file->f_mapping = path->dentry->d_inode->i_mapping; 200 200 file->f_wb_err = filemap_sample_wb_err(file->f_mapping); 201 + file->f_sb_err = file_sample_sb_err(file); 201 202 if ((file->f_mode & FMODE_READ) && 202 203 likely(fop->read || fop->read_iter)) 203 204 file->f_mode |= FMODE_CAN_READ;

-1

fs/fs-writeback.c

··· 1070 1070 static unsigned long get_nr_dirty_pages(void) 1071 1071 { 1072 1072 return global_node_page_state(NR_FILE_DIRTY) + 1073 - global_node_page_state(NR_UNSTABLE_NFS) + 1074 1073 get_nr_dirty_inodes(); 1075 1074 } 1076 1075

+28 -72

fs/fuse/file.c

··· 915 915 fuse_readpages_end(fc, &ap->args, err); 916 916 } 917 917 918 - struct fuse_fill_data { 919 - struct fuse_io_args *ia; 920 - struct file *file; 921 - struct inode *inode; 922 - unsigned int nr_pages; 923 - unsigned int max_pages; 924 - }; 925 - 926 - static int fuse_readpages_fill(void *_data, struct page *page) 918 + static void fuse_readahead(struct readahead_control *rac) 927 919 { 928 - struct fuse_fill_data *data = _data; 929 - struct fuse_io_args *ia = data->ia; 930 - struct fuse_args_pages *ap = &ia->ap; 931 - struct inode *inode = data->inode; 920 + struct inode *inode = rac->mapping->host; 932 921 struct fuse_conn *fc = get_fuse_conn(inode); 922 + unsigned int i, max_pages, nr_pages = 0; 933 923 934 - fuse_wait_on_page_writeback(inode, page->index); 935 - 936 - if (ap->num_pages && 937 - (ap->num_pages == fc->max_pages || 938 - (ap->num_pages + 1) * PAGE_SIZE > fc->max_read || 939 - ap->pages[ap->num_pages - 1]->index + 1 != page->index)) { 940 - data->max_pages = min_t(unsigned int, data->nr_pages, 941 - fc->max_pages); 942 - fuse_send_readpages(ia, data->file); 943 - data->ia = ia = fuse_io_alloc(NULL, data->max_pages); 944 - if (!ia) { 945 - unlock_page(page); 946 - return -ENOMEM; 947 - } 948 - ap = &ia->ap; 949 - } 950 - 951 - if (WARN_ON(ap->num_pages >= data->max_pages)) { 952 - unlock_page(page); 953 - fuse_io_free(ia); 954 - return -EIO; 955 - } 956 - 957 - get_page(page); 958 - ap->pages[ap->num_pages] = page; 959 - ap->descs[ap->num_pages].length = PAGE_SIZE; 960 - ap->num_pages++; 961 - data->nr_pages--; 962 - return 0; 963 - } 964 - 965 - static int fuse_readpages(struct file *file, struct address_space *mapping, 966 - struct list_head *pages, unsigned nr_pages) 967 - { 968 - struct inode *inode = mapping->host; 969 - struct fuse_conn *fc = get_fuse_conn(inode); 970 - struct fuse_fill_data data; 971 - int err; 972 - 973 - err = -EIO; 974 924 if (is_bad_inode(inode)) 975 - goto out; 925 + return; 976 926 977 - data.file = file; 978 - data.inode = inode; 979 - data.nr_pages = nr_pages; 980 - data.max_pages = min_t(unsigned int, nr_pages, fc->max_pages); 981 - ; 982 - data.ia = fuse_io_alloc(NULL, data.max_pages); 983 - err = -ENOMEM; 984 - if (!data.ia) 985 - goto out; 927 + max_pages = min_t(unsigned int, fc->max_pages, 928 + fc->max_read / PAGE_SIZE); 986 929 987 - err = read_cache_pages(mapping, pages, fuse_readpages_fill, &data); 988 - if (!err) { 989 - if (data.ia->ap.num_pages) 990 - fuse_send_readpages(data.ia, file); 991 - else 992 - fuse_io_free(data.ia); 930 + for (;;) { 931 + struct fuse_io_args *ia; 932 + struct fuse_args_pages *ap; 933 + 934 + nr_pages = readahead_count(rac) - nr_pages; 935 + if (nr_pages > max_pages) 936 + nr_pages = max_pages; 937 + if (nr_pages == 0) 938 + break; 939 + ia = fuse_io_alloc(NULL, nr_pages); 940 + if (!ia) 941 + return; 942 + ap = &ia->ap; 943 + nr_pages = __readahead_batch(rac, ap->pages, nr_pages); 944 + for (i = 0; i < nr_pages; i++) { 945 + fuse_wait_on_page_writeback(inode, 946 + readahead_index(rac) + i); 947 + ap->descs[i].length = PAGE_SIZE; 948 + } 949 + ap->num_pages = nr_pages; 950 + fuse_send_readpages(ia, rac->file); 993 951 } 994 - out: 995 - return err; 996 952 } 997 953 998 954 static ssize_t fuse_cache_read_iter(struct kiocb *iocb, struct iov_iter *to) ··· 3329 3373 3330 3374 static const struct address_space_operations fuse_file_aops = { 3331 3375 .readpage = fuse_readpage, 3376 + .readahead = fuse_readahead, 3332 3377 .writepage = fuse_writepage, 3333 3378 .writepages = fuse_writepages, 3334 3379 .launder_page = fuse_launder_page, 3335 - .readpages = fuse_readpages, 3336 3380 .set_page_dirty = __set_page_dirty_nobuffers, 3337 3381 .bmap = fuse_bmap, 3338 3382 .direct_IO = fuse_direct_IO,

+8 -15

fs/gfs2/aops.c

··· 577 577 } 578 578 579 579 /** 580 - * gfs2_readpages - Read a bunch of pages at once 580 + * gfs2_readahead - Read a bunch of pages at once 581 581 * @file: The file to read from 582 582 * @mapping: Address space info 583 583 * @pages: List of pages to read ··· 590 590 * obviously not something we'd want to do on too regular a basis. 591 591 * Any I/O we ignore at this time will be done via readpage later. 592 592 * 2. We don't handle stuffed files here we let readpage do the honours. 593 - * 3. mpage_readpages() does most of the heavy lifting in the common case. 593 + * 3. mpage_readahead() does most of the heavy lifting in the common case. 594 594 * 4. gfs2_block_map() is relied upon to set BH_Boundary in the right places. 595 595 */ 596 596 597 - static int gfs2_readpages(struct file *file, struct address_space *mapping, 598 - struct list_head *pages, unsigned nr_pages) 597 + static void gfs2_readahead(struct readahead_control *rac) 599 598 { 600 - struct inode *inode = mapping->host; 599 + struct inode *inode = rac->mapping->host; 601 600 struct gfs2_inode *ip = GFS2_I(inode); 602 - struct gfs2_sbd *sdp = GFS2_SB(inode); 603 601 struct gfs2_holder gh; 604 - int ret; 605 602 606 603 gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh); 607 - ret = gfs2_glock_nq(&gh); 608 - if (unlikely(ret)) 604 + if (gfs2_glock_nq(&gh)) 609 605 goto out_uninit; 610 606 if (!gfs2_is_stuffed(ip)) 611 - ret = mpage_readpages(mapping, pages, nr_pages, gfs2_block_map); 607 + mpage_readahead(rac, gfs2_block_map); 612 608 gfs2_glock_dq(&gh); 613 609 out_uninit: 614 610 gfs2_holder_uninit(&gh); 615 - if (unlikely(gfs2_withdrawn(sdp))) 616 - ret = -EIO; 617 - return ret; 618 611 } 619 612 620 613 /** ··· 826 833 .writepage = gfs2_writepage, 827 834 .writepages = gfs2_writepages, 828 835 .readpage = gfs2_readpage, 829 - .readpages = gfs2_readpages, 836 + .readahead = gfs2_readahead, 830 837 .bmap = gfs2_bmap, 831 838 .invalidatepage = gfs2_invalidatepage, 832 839 .releasepage = gfs2_releasepage, ··· 840 847 .writepage = gfs2_jdata_writepage, 841 848 .writepages = gfs2_jdata_writepages, 842 849 .readpage = gfs2_readpage, 843 - .readpages = gfs2_readpages, 850 + .readahead = gfs2_readahead, 844 851 .set_page_dirty = jdata_set_page_dirty, 845 852 .bmap = gfs2_bmap, 846 853 .invalidatepage = gfs2_invalidatepage,

+4 -5

fs/gfs2/dir.c

··· 354 354 355 355 hc = kmalloc(hsize, GFP_NOFS | __GFP_NOWARN); 356 356 if (hc == NULL) 357 - hc = __vmalloc(hsize, GFP_NOFS, PAGE_KERNEL); 357 + hc = __vmalloc(hsize, GFP_NOFS); 358 358 359 359 if (hc == NULL) 360 360 return ERR_PTR(-ENOMEM); ··· 1166 1166 1167 1167 hc2 = kmalloc_array(hsize_bytes, 2, GFP_NOFS | __GFP_NOWARN); 1168 1168 if (hc2 == NULL) 1169 - hc2 = __vmalloc(hsize_bytes * 2, GFP_NOFS, PAGE_KERNEL); 1169 + hc2 = __vmalloc(hsize_bytes * 2, GFP_NOFS); 1170 1170 1171 1171 if (!hc2) 1172 1172 return -ENOMEM; ··· 1327 1327 if (size < KMALLOC_MAX_SIZE) 1328 1328 ptr = kmalloc(size, GFP_NOFS | __GFP_NOWARN); 1329 1329 if (!ptr) 1330 - ptr = __vmalloc(size, GFP_NOFS, PAGE_KERNEL); 1330 + ptr = __vmalloc(size, GFP_NOFS); 1331 1331 return ptr; 1332 1332 } 1333 1333 ··· 1987 1987 1988 1988 ht = kzalloc(size, GFP_NOFS | __GFP_NOWARN); 1989 1989 if (ht == NULL) 1990 - ht = __vmalloc(size, GFP_NOFS | __GFP_NOWARN | __GFP_ZERO, 1991 - PAGE_KERNEL); 1990 + ht = __vmalloc(size, GFP_NOFS | __GFP_NOWARN | __GFP_ZERO); 1992 1991 if (!ht) 1993 1992 return -ENOMEM; 1994 1993

+1 -1

fs/gfs2/quota.c

··· 1365 1365 sdp->sd_quota_bitmap = kzalloc(bm_size, GFP_NOFS | __GFP_NOWARN); 1366 1366 if (sdp->sd_quota_bitmap == NULL) 1367 1367 sdp->sd_quota_bitmap = __vmalloc(bm_size, GFP_NOFS | 1368 - __GFP_ZERO, PAGE_KERNEL); 1368 + __GFP_ZERO); 1369 1369 if (!sdp->sd_quota_bitmap) 1370 1370 return error; 1371 1371

+3 -4

fs/hpfs/file.c

··· 125 125 return block_write_full_page(page, hpfs_get_block, wbc); 126 126 } 127 127 128 - static int hpfs_readpages(struct file *file, struct address_space *mapping, 129 - struct list_head *pages, unsigned nr_pages) 128 + static void hpfs_readahead(struct readahead_control *rac) 130 129 { 131 - return mpage_readpages(mapping, pages, nr_pages, hpfs_get_block); 130 + mpage_readahead(rac, hpfs_get_block); 132 131 } 133 132 134 133 static int hpfs_writepages(struct address_space *mapping, ··· 197 198 const struct address_space_operations hpfs_aops = { 198 199 .readpage = hpfs_readpage, 199 200 .writepage = hpfs_writepage, 200 - .readpages = hpfs_readpages, 201 + .readahead = hpfs_readahead, 201 202 .writepages = hpfs_writepages, 202 203 .write_begin = hpfs_write_begin, 203 204 .write_end = hpfs_write_end,

+36 -75

fs/iomap/buffered-io.c

··· 59 59 * migrate_page_move_mapping() assumes that pages with private data have 60 60 * their count elevated by 1. 61 61 */ 62 - get_page(page); 63 - set_page_private(page, (unsigned long)iop); 64 - SetPagePrivate(page); 62 + attach_page_private(page, iop); 65 63 return iop; 66 64 } 67 65 68 66 static void 69 67 iomap_page_release(struct page *page) 70 68 { 71 - struct iomap_page *iop = to_iomap_page(page); 69 + struct iomap_page *iop = detach_page_private(page); 72 70 73 71 if (!iop) 74 72 return; 75 73 WARN_ON_ONCE(atomic_read(&iop->read_count)); 76 74 WARN_ON_ONCE(atomic_read(&iop->write_count)); 77 - ClearPagePrivate(page); 78 - set_page_private(page, 0); 79 - put_page(page); 80 75 kfree(iop); 81 76 } 82 77 ··· 209 214 struct iomap_readpage_ctx { 210 215 struct page *cur_page; 211 216 bool cur_page_in_bio; 212 - bool is_readahead; 213 217 struct bio *bio; 214 - struct list_head *pages; 218 + struct readahead_control *rac; 215 219 }; 216 220 217 221 static void ··· 302 308 if (ctx->bio) 303 309 submit_bio(ctx->bio); 304 310 305 - if (ctx->is_readahead) /* same as readahead_gfp_mask */ 311 + if (ctx->rac) /* same as readahead_gfp_mask */ 306 312 gfp |= __GFP_NORETRY | __GFP_NOWARN; 307 313 ctx->bio = bio_alloc(gfp, min(BIO_MAX_PAGES, nr_vecs)); 308 314 /* ··· 313 319 if (!ctx->bio) 314 320 ctx->bio = bio_alloc(orig_gfp, 1); 315 321 ctx->bio->bi_opf = REQ_OP_READ; 316 - if (ctx->is_readahead) 322 + if (ctx->rac) 317 323 ctx->bio->bi_opf |= REQ_RAHEAD; 318 324 ctx->bio->bi_iter.bi_sector = sector; 319 325 bio_set_dev(ctx->bio, iomap->bdev); ··· 361 367 } 362 368 363 369 /* 364 - * Just like mpage_readpages and block_read_full_page we always 370 + * Just like mpage_readahead and block_read_full_page we always 365 371 * return 0 and just mark the page as PageError on errors. This 366 372 * should be cleaned up all through the stack eventually. 367 373 */ ··· 369 375 } 370 376 EXPORT_SYMBOL_GPL(iomap_readpage); 371 377 372 - static struct page * 373 - iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos, 374 - loff_t length, loff_t *done) 375 - { 376 - while (!list_empty(pages)) { 377 - struct page *page = lru_to_page(pages); 378 - 379 - if (page_offset(page) >= (u64)pos + length) 380 - break; 381 - 382 - list_del(&page->lru); 383 - if (!add_to_page_cache_lru(page, inode->i_mapping, page->index, 384 - GFP_NOFS)) 385 - return page; 386 - 387 - /* 388 - * If we already have a page in the page cache at index we are 389 - * done. Upper layers don't care if it is uptodate after the 390 - * readpages call itself as every page gets checked again once 391 - * actually needed. 392 - */ 393 - *done += PAGE_SIZE; 394 - put_page(page); 395 - } 396 - 397 - return NULL; 398 - } 399 - 400 378 static loff_t 401 - iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length, 379 + iomap_readahead_actor(struct inode *inode, loff_t pos, loff_t length, 402 380 void *data, struct iomap *iomap, struct iomap *srcmap) 403 381 { 404 382 struct iomap_readpage_ctx *ctx = data; ··· 384 418 ctx->cur_page = NULL; 385 419 } 386 420 if (!ctx->cur_page) { 387 - ctx->cur_page = iomap_next_page(inode, ctx->pages, 388 - pos, length, &done); 389 - if (!ctx->cur_page) 390 - break; 421 + ctx->cur_page = readahead_page(ctx->rac); 391 422 ctx->cur_page_in_bio = false; 392 423 } 393 424 ret = iomap_readpage_actor(inode, pos + done, length - done, ··· 394 431 return done; 395 432 } 396 433 397 - int 398 - iomap_readpages(struct address_space *mapping, struct list_head *pages, 399 - unsigned nr_pages, const struct iomap_ops *ops) 434 + /** 435 + * iomap_readahead - Attempt to read pages from a file. 436 + * @rac: Describes the pages to be read. 437 + * @ops: The operations vector for the filesystem. 438 + * 439 + * This function is for filesystems to call to implement their readahead 440 + * address_space operation. 441 + * 442 + * Context: The @ops callbacks may submit I/O (eg to read the addresses of 443 + * blocks from disc), and may wait for it. The caller may be trying to 444 + * access a different page, and so sleeping excessively should be avoided. 445 + * It may allocate memory, but should avoid costly allocations. This 446 + * function is called with memalloc_nofs set, so allocations will not cause 447 + * the filesystem to be reentered. 448 + */ 449 + void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops) 400 450 { 451 + struct inode *inode = rac->mapping->host; 452 + loff_t pos = readahead_pos(rac); 453 + loff_t length = readahead_length(rac); 401 454 struct iomap_readpage_ctx ctx = { 402 - .pages = pages, 403 - .is_readahead = true, 455 + .rac = rac, 404 456 }; 405 - loff_t pos = page_offset(list_entry(pages->prev, struct page, lru)); 406 - loff_t last = page_offset(list_entry(pages->next, struct page, lru)); 407 - loff_t length = last - pos + PAGE_SIZE, ret = 0; 408 457 409 - trace_iomap_readpages(mapping->host, nr_pages); 458 + trace_iomap_readahead(inode, readahead_count(rac)); 410 459 411 460 while (length > 0) { 412 - ret = iomap_apply(mapping->host, pos, length, 0, ops, 413 - &ctx, iomap_readpages_actor); 461 + loff_t ret = iomap_apply(inode, pos, length, 0, ops, 462 + &ctx, iomap_readahead_actor); 414 463 if (ret <= 0) { 415 464 WARN_ON_ONCE(ret == 0); 416 - goto done; 465 + break; 417 466 } 418 467 pos += ret; 419 468 length -= ret; 420 469 } 421 - ret = 0; 422 - done: 470 + 423 471 if (ctx.bio) 424 472 submit_bio(ctx.bio); 425 473 if (ctx.cur_page) { ··· 438 464 unlock_page(ctx.cur_page); 439 465 put_page(ctx.cur_page); 440 466 } 441 - 442 - /* 443 - * Check that we didn't lose a page due to the arcance calling 444 - * conventions.. 445 - */ 446 - WARN_ON_ONCE(!ret && !list_empty(ctx.pages)); 447 - return ret; 448 467 } 449 - EXPORT_SYMBOL_GPL(iomap_readpages); 468 + EXPORT_SYMBOL_GPL(iomap_readahead); 450 469 451 470 /* 452 471 * iomap_is_partially_uptodate checks whether blocks within a page are ··· 521 554 if (ret != MIGRATEPAGE_SUCCESS) 522 555 return ret; 523 556 524 - if (page_has_private(page)) { 525 - ClearPagePrivate(page); 526 - get_page(newpage); 527 - set_page_private(newpage, page_private(page)); 528 - set_page_private(page, 0); 529 - put_page(page); 530 - SetPagePrivate(newpage); 531 - } 557 + if (page_has_private(page)) 558 + attach_page_private(newpage, detach_page_private(page)); 532 559 533 560 if (mode != MIGRATE_SYNC_NO_COPY) 534 561 migrate_page_copy(newpage, page);

+1 -1

fs/iomap/trace.h

··· 39 39 TP_PROTO(struct inode *inode, int nr_pages), \ 40 40 TP_ARGS(inode, nr_pages)) 41 41 DEFINE_READPAGE_EVENT(iomap_readpage); 42 - DEFINE_READPAGE_EVENT(iomap_readpages); 42 + DEFINE_READPAGE_EVENT(iomap_readahead); 43 43 44 44 DECLARE_EVENT_CLASS(iomap_range_class, 45 45 TP_PROTO(struct inode *inode, unsigned long off, unsigned int len),

+3 -4

fs/isofs/inode.c

··· 1185 1185 return mpage_readpage(page, isofs_get_block); 1186 1186 } 1187 1187 1188 - static int isofs_readpages(struct file *file, struct address_space *mapping, 1189 - struct list_head *pages, unsigned nr_pages) 1188 + static void isofs_readahead(struct readahead_control *rac) 1190 1189 { 1191 - return mpage_readpages(mapping, pages, nr_pages, isofs_get_block); 1190 + mpage_readahead(rac, isofs_get_block); 1192 1191 } 1193 1192 1194 1193 static sector_t _isofs_bmap(struct address_space *mapping, sector_t block) ··· 1197 1198 1198 1199 static const struct address_space_operations isofs_aops = { 1199 1200 .readpage = isofs_readpage, 1200 - .readpages = isofs_readpages, 1201 + .readahead = isofs_readahead, 1201 1202 .bmap = _isofs_bmap 1202 1203 }; 1203 1204

+3 -4

fs/jfs/inode.c

··· 296 296 return mpage_readpage(page, jfs_get_block); 297 297 } 298 298 299 - static int jfs_readpages(struct file *file, struct address_space *mapping, 300 - struct list_head *pages, unsigned nr_pages) 299 + static void jfs_readahead(struct readahead_control *rac) 301 300 { 302 - return mpage_readpages(mapping, pages, nr_pages, jfs_get_block); 301 + mpage_readahead(rac, jfs_get_block); 303 302 } 304 303 305 304 static void jfs_write_failed(struct address_space *mapping, loff_t to) ··· 357 358 358 359 const struct address_space_operations jfs_aops = { 359 360 .readpage = jfs_readpage, 360 - .readpages = jfs_readpages, 361 + .readahead = jfs_readahead, 361 362 .writepage = jfs_writepage, 362 363 .writepages = jfs_writepages, 363 364 .write_begin = jfs_write_begin,

+11 -27

fs/mpage.c

··· 91 91 } 92 92 93 93 /* 94 - * support function for mpage_readpages. The fs supplied get_block might 94 + * support function for mpage_readahead. The fs supplied get_block might 95 95 * return an up to date buffer. This is used to map that buffer into 96 96 * the page, which allows readpage to avoid triggering a duplicate call 97 97 * to get_block. ··· 338 338 } 339 339 340 340 /** 341 - * mpage_readpages - populate an address space with some pages & start reads against them 342 - * @mapping: the address_space 343 - * @pages: The address of a list_head which contains the target pages. These 344 - * pages have their ->index populated and are otherwise uninitialised. 345 - * The page at @pages->prev has the lowest file offset, and reads should be 346 - * issued in @pages->prev to @pages->next order. 347 - * @nr_pages: The number of pages at *@pages 341 + * mpage_readahead - start reads against pages 342 + * @rac: Describes which pages to read. 348 343 * @get_block: The filesystem's block mapper function. 349 344 * 350 345 * This function walks the pages and the blocks within each page, building and ··· 376 381 * 377 382 * This all causes the disk requests to be issued in the correct order. 378 383 */ 379 - int 380 - mpage_readpages(struct address_space *mapping, struct list_head *pages, 381 - unsigned nr_pages, get_block_t get_block) 384 + void mpage_readahead(struct readahead_control *rac, get_block_t get_block) 382 385 { 386 + struct page *page; 383 387 struct mpage_readpage_args args = { 384 388 .get_block = get_block, 385 389 .is_readahead = true, 386 390 }; 387 - unsigned page_idx; 388 391 389 - for (page_idx = 0; page_idx < nr_pages; page_idx++) { 390 - struct page *page = lru_to_page(pages); 391 - 392 + while ((page = readahead_page(rac))) { 392 393 prefetchw(&page->flags); 393 - list_del(&page->lru); 394 - if (!add_to_page_cache_lru(page, mapping, 395 - page->index, 396 - readahead_gfp_mask(mapping))) { 397 - args.page = page; 398 - args.nr_pages = nr_pages - page_idx; 399 - args.bio = do_mpage_readpage(&args); 400 - } 394 + args.page = page; 395 + args.nr_pages = readahead_count(rac); 396 + args.bio = do_mpage_readpage(&args); 401 397 put_page(page); 402 398 } 403 - BUG_ON(!list_empty(pages)); 404 399 if (args.bio) 405 400 mpage_bio_submit(REQ_OP_READ, REQ_RAHEAD, args.bio); 406 - return 0; 407 401 } 408 - EXPORT_SYMBOL(mpage_readpages); 402 + EXPORT_SYMBOL(mpage_readahead); 409 403 410 404 /* 411 405 * This isn't called much at all ··· 547 563 * Page has buffers, but they are all unmapped. The page was 548 564 * created by pagein or read over a hole which was handled by 549 565 * block_read_full_page(). If this address_space is also 550 - * using mpage_readpages then this can rarely happen. 566 + * using mpage_readahead then this can rarely happen. 551 567 */ 552 568 goto confused; 553 569 }

+1 -1

fs/nfs/blocklayout/extent_tree.c

··· 582 582 if (!arg->layoutupdate_pages) 583 583 return -ENOMEM; 584 584 585 - start_p = __vmalloc(buffer_size, GFP_NOFS, PAGE_KERNEL); 585 + start_p = __vmalloc(buffer_size, GFP_NOFS); 586 586 if (!start_p) { 587 587 kfree(arg->layoutupdate_pages); 588 588 return -ENOMEM;

+7 -3

fs/nfs/internal.h

··· 668 668 } 669 669 670 670 /* 671 - * Record the page as unstable and mark its inode as dirty. 671 + * Record the page as unstable (an extra writeback period) and mark its 672 + * inode as dirty. 672 673 */ 673 674 static inline 674 675 void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo) ··· 677 676 if (!cinfo->dreq) { 678 677 struct inode *inode = page_file_mapping(page)->host; 679 678 680 - inc_node_page_state(page, NR_UNSTABLE_NFS); 681 - inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE); 679 + /* This page is really still in write-back - just that the 680 + * writeback is happening on the server now. 681 + */ 682 + inc_node_page_state(page, NR_WRITEBACK); 683 + inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK); 682 684 __mark_inode_dirty(inode, I_DIRTY_DATASYNC); 683 685 } 684 686 }

+2 -2

fs/nfs/write.c

··· 946 946 static void 947 947 nfs_clear_page_commit(struct page *page) 948 948 { 949 - dec_node_page_state(page, NR_UNSTABLE_NFS); 949 + dec_node_page_state(page, NR_WRITEBACK); 950 950 dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb, 951 - WB_RECLAIMABLE); 951 + WB_WRITEBACK); 952 952 } 953 953 954 954 /* Called holding the request lock on @req */

+5 -4

fs/nfsd/vfs.c

··· 979 979 980 980 if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) 981 981 /* 982 - * We want less throttling in balance_dirty_pages() 983 - * and shrink_inactive_list() so that nfs to 982 + * We want throttling in balance_dirty_pages() 983 + * and shrink_inactive_list() to only consider 984 + * the backingdev we are writing to, so that nfs to 984 985 * localhost doesn't cause nfsd to lock up due to all 985 986 * the client's dirty pages or its congested queue. 986 987 */ 987 - current->flags |= PF_LESS_THROTTLE; 988 + current->flags |= PF_LOCAL_THROTTLE; 988 989 989 990 exp = fhp->fh_export; 990 991 use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp); ··· 1038 1037 nfserr = nfserrno(host_err); 1039 1038 } 1040 1039 if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) 1041 - current_restore_flags(pflags, PF_LESS_THROTTLE); 1040 + current_restore_flags(pflags, PF_LOCAL_THROTTLE); 1042 1041 return nfserr; 1043 1042 } 1044 1043

+3 -12

fs/nilfs2/inode.c

··· 145 145 return mpage_readpage(page, nilfs_get_block); 146 146 } 147 147 148 - /** 149 - * nilfs_readpages() - implement readpages() method of nilfs_aops {} 150 - * address_space_operations. 151 - * @file - file struct of the file to be read 152 - * @mapping - address_space struct used for reading multiple pages 153 - * @pages - the pages to be read 154 - * @nr_pages - number of pages to be read 155 - */ 156 - static int nilfs_readpages(struct file *file, struct address_space *mapping, 157 - struct list_head *pages, unsigned int nr_pages) 148 + static void nilfs_readahead(struct readahead_control *rac) 158 149 { 159 - return mpage_readpages(mapping, pages, nr_pages, nilfs_get_block); 150 + mpage_readahead(rac, nilfs_get_block); 160 151 } 161 152 162 153 static int nilfs_writepages(struct address_space *mapping, ··· 299 308 .readpage = nilfs_readpage, 300 309 .writepages = nilfs_writepages, 301 310 .set_page_dirty = nilfs_set_page_dirty, 302 - .readpages = nilfs_readpages, 311 + .readahead = nilfs_readahead, 303 312 .write_begin = nilfs_write_begin, 304 313 .write_end = nilfs_write_end, 305 314 /* .releasepage = nilfs_releasepage, */

+1 -1

fs/ntfs/aops.c

··· 1732 1732 bh = bh->b_this_page; 1733 1733 } while (bh); 1734 1734 tail->b_this_page = head; 1735 - attach_page_buffers(page, head); 1735 + attach_page_private(page, head); 1736 1736 } else 1737 1737 buffers_to_free = bh; 1738 1738 }

+1 -1

fs/ntfs/malloc.h

··· 34 34 /* return (void *)__get_free_page(gfp_mask); */ 35 35 } 36 36 if (likely((size >> PAGE_SHIFT) < totalram_pages())) 37 - return __vmalloc(size, gfp_mask, PAGE_KERNEL); 37 + return __vmalloc(size, gfp_mask); 38 38 return NULL; 39 39 } 40 40

+1 -1

fs/ntfs/mft.c

··· 504 504 bh = bh->b_this_page; 505 505 } while (bh); 506 506 tail->b_this_page = head; 507 - attach_page_buffers(page, head); 507 + attach_page_private(page, head); 508 508 } 509 509 bh = head = page_buffers(page); 510 510 BUG_ON(!bh);

+13 -21

fs/ocfs2/aops.c

··· 350 350 * grow out to a tree. If need be, detecting boundary extents could 351 351 * trivially be added in a future version of ocfs2_get_block(). 352 352 */ 353 - static int ocfs2_readpages(struct file *filp, struct address_space *mapping, 354 - struct list_head *pages, unsigned nr_pages) 353 + static void ocfs2_readahead(struct readahead_control *rac) 355 354 { 356 - int ret, err = -EIO; 357 - struct inode *inode = mapping->host; 355 + int ret; 356 + struct inode *inode = rac->mapping->host; 358 357 struct ocfs2_inode_info *oi = OCFS2_I(inode); 359 - loff_t start; 360 - struct page *last; 361 358 362 359 /* 363 360 * Use the nonblocking flag for the dlm code to avoid page ··· 362 365 */ 363 366 ret = ocfs2_inode_lock_full(inode, NULL, 0, OCFS2_LOCK_NONBLOCK); 364 367 if (ret) 365 - return err; 368 + return; 366 369 367 - if (down_read_trylock(&oi->ip_alloc_sem) == 0) { 368 - ocfs2_inode_unlock(inode, 0); 369 - return err; 370 - } 370 + if (down_read_trylock(&oi->ip_alloc_sem) == 0) 371 + goto out_unlock; 371 372 372 373 /* 373 374 * Don't bother with inline-data. There isn't anything 374 375 * to read-ahead in that case anyway... 375 376 */ 376 377 if (oi->ip_dyn_features & OCFS2_INLINE_DATA_FL) 377 - goto out_unlock; 378 + goto out_up; 378 379 379 380 /* 380 381 * Check whether a remote node truncated this file - we just 381 382 * drop out in that case as it's not worth handling here. 382 383 */ 383 - last = lru_to_page(pages); 384 - start = (loff_t)last->index << PAGE_SHIFT; 385 - if (start >= i_size_read(inode)) 386 - goto out_unlock; 384 + if (readahead_pos(rac) >= i_size_read(inode)) 385 + goto out_up; 387 386 388 - err = mpage_readpages(mapping, pages, nr_pages, ocfs2_get_block); 387 + mpage_readahead(rac, ocfs2_get_block); 389 388 390 - out_unlock: 389 + out_up: 391 390 up_read(&oi->ip_alloc_sem); 391 + out_unlock: 392 392 ocfs2_inode_unlock(inode, 0); 393 - 394 - return err; 395 393 } 396 394 397 395 /* Note: Because we don't support holes, our allocation has ··· 2466 2474 2467 2475 const struct address_space_operations ocfs2_aops = { 2468 2476 .readpage = ocfs2_readpage, 2469 - .readpages = ocfs2_readpages, 2477 + .readahead = ocfs2_readahead, 2470 2478 .writepage = ocfs2_writepage, 2471 2479 .write_begin = ocfs2_write_begin, 2472 2480 .write_end = ocfs2_write_end,

+1

fs/ocfs2/dlm/dlmmaster.c

··· 2760 2760 * Returns: 1 if dlm->spinlock was dropped/retaken, 0 if never dropped 2761 2761 */ 2762 2762 int dlm_empty_lockres(struct dlm_ctxt *dlm, struct dlm_lock_resource *res) 2763 + __must_hold(&dlm->spinlock) 2763 2764 { 2764 2765 int ret; 2765 2766 int lock_dropped = 0;

+3 -1

fs/ocfs2/ocfs2.h

··· 279 279 OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT = 1 << 15, /* Journal Async Commit */ 280 280 OCFS2_MOUNT_ERRORS_CONT = 1 << 16, /* Return EIO to the calling process on error */ 281 281 OCFS2_MOUNT_ERRORS_ROFS = 1 << 17, /* Change filesystem to read-only on error */ 282 + OCFS2_MOUNT_NOCLUSTER = 1 << 18, /* No cluster aware filesystem mount */ 282 283 }; 283 284 284 285 #define OCFS2_OSB_SOFT_RO 0x0001 ··· 674 673 675 674 static inline int ocfs2_mount_local(struct ocfs2_super *osb) 676 675 { 677 - return (osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT); 676 + return ((osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT) 677 + || (osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER)); 678 678 } 679 679 680 680 static inline int ocfs2_uses_extended_slot_map(struct ocfs2_super *osb)

+27 -19

fs/ocfs2/slot_map.c

··· 254 254 int i, ret = -ENOSPC; 255 255 256 256 if ((preferred >= 0) && (preferred < si->si_num_slots)) { 257 - if (!si->si_slots[preferred].sl_valid) { 257 + if (!si->si_slots[preferred].sl_valid || 258 + !si->si_slots[preferred].sl_node_num) { 258 259 ret = preferred; 259 260 goto out; 260 261 } 261 262 } 262 263 263 264 for(i = 0; i < si->si_num_slots; i++) { 264 - if (!si->si_slots[i].sl_valid) { 265 + if (!si->si_slots[i].sl_valid || 266 + !si->si_slots[i].sl_node_num) { 265 267 ret = i; 266 268 break; 267 269 } ··· 458 456 spin_lock(&osb->osb_lock); 459 457 ocfs2_update_slot_info(si); 460 458 461 - /* search for ourselves first and take the slot if it already 462 - * exists. Perhaps we need to mark this in a variable for our 463 - * own journal recovery? Possibly not, though we certainly 464 - * need to warn to the user */ 465 - slot = __ocfs2_node_num_to_slot(si, osb->node_num); 466 - if (slot < 0) { 467 - /* if no slot yet, then just take 1st available 468 - * one. */ 469 - slot = __ocfs2_find_empty_slot(si, osb->preferred_slot); 459 + if (ocfs2_mount_local(osb)) 460 + /* use slot 0 directly in local mode */ 461 + slot = 0; 462 + else { 463 + /* search for ourselves first and take the slot if it already 464 + * exists. Perhaps we need to mark this in a variable for our 465 + * own journal recovery? Possibly not, though we certainly 466 + * need to warn to the user */ 467 + slot = __ocfs2_node_num_to_slot(si, osb->node_num); 470 468 if (slot < 0) { 471 - spin_unlock(&osb->osb_lock); 472 - mlog(ML_ERROR, "no free slots available!\n"); 473 - status = -EINVAL; 474 - goto bail; 475 - } 476 - } else 477 - printk(KERN_INFO "ocfs2: Slot %d on device (%s) was already " 478 - "allocated to this node!\n", slot, osb->dev_str); 469 + /* if no slot yet, then just take 1st available 470 + * one. */ 471 + slot = __ocfs2_find_empty_slot(si, osb->preferred_slot); 472 + if (slot < 0) { 473 + spin_unlock(&osb->osb_lock); 474 + mlog(ML_ERROR, "no free slots available!\n"); 475 + status = -EINVAL; 476 + goto bail; 477 + } 478 + } else 479 + printk(KERN_INFO "ocfs2: Slot %d on device (%s) was " 480 + "already allocated to this node!\n", 481 + slot, osb->dev_str); 482 + } 479 483 480 484 ocfs2_set_slot(si, slot, osb->node_num); 481 485 osb->slot_num = slot;

+21

fs/ocfs2/super.c

··· 175 175 Opt_dir_resv_level, 176 176 Opt_journal_async_commit, 177 177 Opt_err_cont, 178 + Opt_nocluster, 178 179 Opt_err, 179 180 }; 180 181 ··· 209 208 {Opt_dir_resv_level, "dir_resv_level=%u"}, 210 209 {Opt_journal_async_commit, "journal_async_commit"}, 211 210 {Opt_err_cont, "errors=continue"}, 211 + {Opt_nocluster, "nocluster"}, 212 212 {Opt_err, NULL} 213 213 }; 214 214 ··· 621 619 goto out; 622 620 } 623 621 622 + tmp = OCFS2_MOUNT_NOCLUSTER; 623 + if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) { 624 + ret = -EINVAL; 625 + mlog(ML_ERROR, "Cannot change nocluster option on remount\n"); 626 + goto out; 627 + } 628 + 624 629 tmp = OCFS2_MOUNT_HB_LOCAL | OCFS2_MOUNT_HB_GLOBAL | 625 630 OCFS2_MOUNT_HB_NONE; 626 631 if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) { ··· 868 859 } 869 860 870 861 if (ocfs2_userspace_stack(osb) && 862 + !(osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) && 871 863 strncmp(osb->osb_cluster_stack, mopt->cluster_stack, 872 864 OCFS2_STACK_LABEL_LEN)) { 873 865 mlog(ML_ERROR, ··· 1148 1138 osb->dev_str, nodestr, osb->slot_num, 1149 1139 osb->s_mount_opt & OCFS2_MOUNT_DATA_WRITEBACK ? "writeback" : 1150 1140 "ordered"); 1141 + 1142 + if ((osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) && 1143 + !(osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT)) 1144 + printk(KERN_NOTICE "ocfs2: The shared device (%s) is mounted " 1145 + "without cluster aware mode.\n", osb->dev_str); 1151 1146 1152 1147 atomic_set(&osb->vol_state, VOLUME_MOUNTED); 1153 1148 wake_up(&osb->osb_mount_event); ··· 1460 1445 case Opt_journal_async_commit: 1461 1446 mopt->mount_opt |= OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT; 1462 1447 break; 1448 + case Opt_nocluster: 1449 + mopt->mount_opt |= OCFS2_MOUNT_NOCLUSTER; 1450 + break; 1463 1451 default: 1464 1452 mlog(ML_ERROR, 1465 1453 "Unrecognized mount option \"%s\" " ··· 1573 1555 1574 1556 if (opts & OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT) 1575 1557 seq_printf(s, ",journal_async_commit"); 1558 + 1559 + if (opts & OCFS2_MOUNT_NOCLUSTER) 1560 + seq_printf(s, ",nocluster"); 1576 1561 1577 1562 return 0; 1578 1563 }

+3 -4

fs/omfs/file.c

··· 289 289 return block_read_full_page(page, omfs_get_block); 290 290 } 291 291 292 - static int omfs_readpages(struct file *file, struct address_space *mapping, 293 - struct list_head *pages, unsigned nr_pages) 292 + static void omfs_readahead(struct readahead_control *rac) 294 293 { 295 - return mpage_readpages(mapping, pages, nr_pages, omfs_get_block); 294 + mpage_readahead(rac, omfs_get_block); 296 295 } 297 296 298 297 static int omfs_writepage(struct page *page, struct writeback_control *wbc) ··· 372 373 373 374 const struct address_space_operations omfs_aops = { 374 375 .readpage = omfs_readpage, 375 - .readpages = omfs_readpages, 376 + .readahead = omfs_readahead, 376 377 .writepage = omfs_writepage, 377 378 .writepages = omfs_writepages, 378 379 .write_begin = omfs_write_begin,

+1 -2

fs/open.c

··· 775 775 path_get(&f->f_path); 776 776 f->f_inode = inode; 777 777 f->f_mapping = inode->i_mapping; 778 - 779 - /* Ensure that we skip any errors that predate opening of the file */ 780 778 f->f_wb_err = filemap_sample_wb_err(f->f_mapping); 779 + f->f_sb_err = file_sample_sb_err(f); 781 780 782 781 if (unlikely(f->f_flags & O_PATH)) { 783 782 f->f_mode = FMODE_PATH | FMODE_OPENED;

+6 -26

fs/orangefs/inode.c

··· 62 62 } else { 63 63 ret = 0; 64 64 } 65 - if (wr) { 66 - kfree(wr); 67 - set_page_private(page, 0); 68 - ClearPagePrivate(page); 69 - put_page(page); 70 - } 65 + kfree(detach_page_private(page)); 71 66 return ret; 72 67 } 73 68 ··· 404 409 wr->len = len; 405 410 wr->uid = current_fsuid(); 406 411 wr->gid = current_fsgid(); 407 - SetPagePrivate(page); 408 - set_page_private(page, (unsigned long)wr); 409 - get_page(page); 412 + attach_page_private(page, wr); 410 413 okay: 411 414 return 0; 412 415 } ··· 452 459 wr = (struct orangefs_write_range *)page_private(page); 453 460 454 461 if (offset == 0 && length == PAGE_SIZE) { 455 - kfree((struct orangefs_write_range *)page_private(page)); 456 - set_page_private(page, 0); 457 - ClearPagePrivate(page); 458 - put_page(page); 462 + kfree(detach_page_private(page)); 459 463 return; 460 464 /* write range entirely within invalidate range (or equal) */ 461 465 } else if (page_offset(page) + offset <= wr->pos && 462 466 wr->pos + wr->len <= page_offset(page) + offset + length) { 463 - kfree((struct orangefs_write_range *)page_private(page)); 464 - set_page_private(page, 0); 465 - ClearPagePrivate(page); 466 - put_page(page); 467 + kfree(detach_page_private(page)); 467 468 /* XXX is this right? only caller in fs */ 468 469 cancel_dirty_page(page); 469 470 return; ··· 522 535 523 536 static void orangefs_freepage(struct page *page) 524 537 { 525 - if (PagePrivate(page)) { 526 - kfree((struct orangefs_write_range *)page_private(page)); 527 - set_page_private(page, 0); 528 - ClearPagePrivate(page); 529 - put_page(page); 530 - } 538 + kfree(detach_page_private(page)); 531 539 } 532 540 533 541 static int orangefs_launder_page(struct page *page) ··· 722 740 wr->len = PAGE_SIZE; 723 741 wr->uid = current_fsuid(); 724 742 wr->gid = current_fsgid(); 725 - SetPagePrivate(page); 726 - set_page_private(page, (unsigned long)wr); 727 - get_page(page); 743 + attach_page_private(page, wr); 728 744 okay: 729 745 730 746 file_update_time(vmf->vma->vm_file);

+1 -2

fs/proc/meminfo.c

··· 110 110 show_val_kb(m, "PageTables: ", 111 111 global_zone_page_state(NR_PAGETABLE)); 112 112 113 - show_val_kb(m, "NFS_Unstable: ", 114 - global_node_page_state(NR_UNSTABLE_NFS)); 113 + show_val_kb(m, "NFS_Unstable: ", 0); 115 114 show_val_kb(m, "Bounce: ", 116 115 global_zone_page_state(NR_BOUNCE)); 117 116 show_val_kb(m, "WritebackTmp: ",

+11 -5

fs/proc/task_mmu.c

··· 546 546 struct mem_size_stats *mss = walk->private; 547 547 struct vm_area_struct *vma = walk->vma; 548 548 bool locked = !!(vma->vm_flags & VM_LOCKED); 549 - struct page *page; 549 + struct page *page = NULL; 550 550 551 - /* FOLL_DUMP will return -EFAULT on huge zero page */ 552 - page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP); 551 + if (pmd_present(*pmd)) { 552 + /* FOLL_DUMP will return -EFAULT on huge zero page */ 553 + page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP); 554 + } else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) { 555 + swp_entry_t entry = pmd_to_swp_entry(*pmd); 556 + 557 + if (is_migration_entry(entry)) 558 + page = migration_entry_to_page(entry); 559 + } 553 560 if (IS_ERR_OR_NULL(page)) 554 561 return; 555 562 if (PageAnon(page)) ··· 585 578 586 579 ptl = pmd_trans_huge_lock(pmd, vma); 587 580 if (ptl) { 588 - if (pmd_present(*pmd)) 589 - smaps_pmd_entry(pmd, addr, walk); 581 + smaps_pmd_entry(pmd, addr, walk); 590 582 spin_unlock(ptl); 591 583 goto out; 592 584 }

+3 -4

fs/qnx6/inode.c

··· 99 99 return mpage_readpage(page, qnx6_get_block); 100 100 } 101 101 102 - static int qnx6_readpages(struct file *file, struct address_space *mapping, 103 - struct list_head *pages, unsigned nr_pages) 102 + static void qnx6_readahead(struct readahead_control *rac) 104 103 { 105 - return mpage_readpages(mapping, pages, nr_pages, qnx6_get_block); 104 + mpage_readahead(rac, qnx6_get_block); 106 105 } 107 106 108 107 /* ··· 498 499 } 499 500 static const struct address_space_operations qnx6_aops = { 500 501 .readpage = qnx6_readpage, 501 - .readpages = qnx6_readpages, 502 + .readahead = qnx6_readahead, 502 503 .bmap = qnx6_bmap 503 504 }; 504 505

+3 -5

fs/reiserfs/inode.c

··· 1160 1160 return retval; 1161 1161 } 1162 1162 1163 - static int 1164 - reiserfs_readpages(struct file *file, struct address_space *mapping, 1165 - struct list_head *pages, unsigned nr_pages) 1163 + static void reiserfs_readahead(struct readahead_control *rac) 1166 1164 { 1167 - return mpage_readpages(mapping, pages, nr_pages, reiserfs_get_block); 1165 + mpage_readahead(rac, reiserfs_get_block); 1168 1166 } 1169 1167 1170 1168 /* ··· 3432 3434 const struct address_space_operations reiserfs_address_space_operations = { 3433 3435 .writepage = reiserfs_writepage, 3434 3436 .readpage = reiserfs_readpage, 3435 - .readpages = reiserfs_readpages, 3437 + .readahead = reiserfs_readahead, 3436 3438 .releasepage = reiserfs_releasepage, 3437 3439 .invalidatepage = reiserfs_invalidatepage, 3438 3440 .write_begin = reiserfs_write_begin,

+144 -125

fs/squashfs/block.c

··· 13 13 * datablocks and metadata blocks. 14 14 */ 15 15 16 + #include <linux/blkdev.h> 16 17 #include <linux/fs.h> 17 18 #include <linux/vfs.h> 18 19 #include <linux/slab.h> ··· 28 27 #include "page_actor.h" 29 28 30 29 /* 31 - * Read the metadata block length, this is stored in the first two 32 - * bytes of the metadata block. 30 + * Returns the amount of bytes copied to the page actor. 33 31 */ 34 - static struct buffer_head *get_block_length(struct super_block *sb, 35 - u64 *cur_index, int *offset, int *length) 32 + static int copy_bio_to_actor(struct bio *bio, 33 + struct squashfs_page_actor *actor, 34 + int offset, int req_length) 36 35 { 37 - struct squashfs_sb_info *msblk = sb->s_fs_info; 38 - struct buffer_head *bh; 36 + void *actor_addr = squashfs_first_page(actor); 37 + struct bvec_iter_all iter_all = {}; 38 + struct bio_vec *bvec = bvec_init_iter_all(&iter_all); 39 + int copied_bytes = 0; 40 + int actor_offset = 0; 39 41 40 - bh = sb_bread(sb, *cur_index); 41 - if (bh == NULL) 42 - return NULL; 42 + if (WARN_ON_ONCE(!bio_next_segment(bio, &iter_all))) 43 + return 0; 43 44 44 - if (msblk->devblksize - *offset == 1) { 45 - *length = (unsigned char) bh->b_data[*offset]; 46 - put_bh(bh); 47 - bh = sb_bread(sb, ++(*cur_index)); 48 - if (bh == NULL) 49 - return NULL; 50 - *length |= (unsigned char) bh->b_data[0] << 8; 51 - *offset = 1; 52 - } else { 53 - *length = (unsigned char) bh->b_data[*offset] | 54 - (unsigned char) bh->b_data[*offset + 1] << 8; 55 - *offset += 2; 45 + while (copied_bytes < req_length) { 46 + int bytes_to_copy = min_t(int, bvec->bv_len - offset, 47 + PAGE_SIZE - actor_offset); 56 48 57 - if (*offset == msblk->devblksize) { 58 - put_bh(bh); 59 - bh = sb_bread(sb, ++(*cur_index)); 60 - if (bh == NULL) 61 - return NULL; 62 - *offset = 0; 49 + bytes_to_copy = min_t(int, bytes_to_copy, 50 + req_length - copied_bytes); 51 + memcpy(actor_addr + actor_offset, 52 + page_address(bvec->bv_page) + bvec->bv_offset + offset, 53 + bytes_to_copy); 54 + 55 + actor_offset += bytes_to_copy; 56 + copied_bytes += bytes_to_copy; 57 + offset += bytes_to_copy; 58 + 59 + if (actor_offset >= PAGE_SIZE) { 60 + actor_addr = squashfs_next_page(actor); 61 + if (!actor_addr) 62 + break; 63 + actor_offset = 0; 64 + } 65 + if (offset >= bvec->bv_len) { 66 + if (!bio_next_segment(bio, &iter_all)) 67 + break; 68 + offset = 0; 63 69 } 64 70 } 65 - 66 - return bh; 71 + squashfs_finish_page(actor); 72 + return copied_bytes; 67 73 } 68 74 75 + static int squashfs_bio_read(struct super_block *sb, u64 index, int length, 76 + struct bio **biop, int *block_offset) 77 + { 78 + struct squashfs_sb_info *msblk = sb->s_fs_info; 79 + const u64 read_start = round_down(index, msblk->devblksize); 80 + const sector_t block = read_start >> msblk->devblksize_log2; 81 + const u64 read_end = round_up(index + length, msblk->devblksize); 82 + const sector_t block_end = read_end >> msblk->devblksize_log2; 83 + int offset = read_start - round_down(index, PAGE_SIZE); 84 + int total_len = (block_end - block) << msblk->devblksize_log2; 85 + const int page_count = DIV_ROUND_UP(total_len + offset, PAGE_SIZE); 86 + int error, i; 87 + struct bio *bio; 88 + 89 + bio = bio_alloc(GFP_NOIO, page_count); 90 + if (!bio) 91 + return -ENOMEM; 92 + 93 + bio_set_dev(bio, sb->s_bdev); 94 + bio->bi_opf = READ; 95 + bio->bi_iter.bi_sector = block * (msblk->devblksize >> SECTOR_SHIFT); 96 + 97 + for (i = 0; i < page_count; ++i) { 98 + unsigned int len = 99 + min_t(unsigned int, PAGE_SIZE - offset, total_len); 100 + struct page *page = alloc_page(GFP_NOIO); 101 + 102 + if (!page) { 103 + error = -ENOMEM; 104 + goto out_free_bio; 105 + } 106 + if (!bio_add_page(bio, page, len, offset)) { 107 + error = -EIO; 108 + goto out_free_bio; 109 + } 110 + offset = 0; 111 + total_len -= len; 112 + } 113 + 114 + error = submit_bio_wait(bio); 115 + if (error) 116 + goto out_free_bio; 117 + 118 + *biop = bio; 119 + *block_offset = index & ((1 << msblk->devblksize_log2) - 1); 120 + return 0; 121 + 122 + out_free_bio: 123 + bio_free_pages(bio); 124 + bio_put(bio); 125 + return error; 126 + } 69 127 70 128 /* 71 129 * Read and decompress a metadata block or datablock. Length is non-zero ··· 136 76 * algorithms). 137 77 */ 138 78 int squashfs_read_data(struct super_block *sb, u64 index, int length, 139 - u64 *next_index, struct squashfs_page_actor *output) 79 + u64 *next_index, struct squashfs_page_actor *output) 140 80 { 141 81 struct squashfs_sb_info *msblk = sb->s_fs_info; 142 - struct buffer_head **bh; 143 - int offset = index & ((1 << msblk->devblksize_log2) - 1); 144 - u64 cur_index = index >> msblk->devblksize_log2; 145 - int bytes, compressed, b = 0, k = 0, avail, i; 146 - 147 - bh = kcalloc(((output->length + msblk->devblksize - 1) 148 - >> msblk->devblksize_log2) + 1, sizeof(*bh), GFP_KERNEL); 149 - if (bh == NULL) 150 - return -ENOMEM; 82 + struct bio *bio = NULL; 83 + int compressed; 84 + int res; 85 + int offset; 151 86 152 87 if (length) { 153 88 /* 154 89 * Datablock. 155 90 */ 156 - bytes = -offset; 157 91 compressed = SQUASHFS_COMPRESSED_BLOCK(length); 158 92 length = SQUASHFS_COMPRESSED_SIZE_BLOCK(length); 159 - if (next_index) 160 - *next_index = index + length; 161 - 162 93 TRACE("Block @ 0x%llx, %scompressed size %d, src size %d\n", 163 94 index, compressed ? "" : "un", length, output->length); 164 - 165 - if (length < 0 || length > output->length || 166 - (index + length) > msblk->bytes_used) 167 - goto read_failure; 168 - 169 - for (b = 0; bytes < length; b++, cur_index++) { 170 - bh[b] = sb_getblk(sb, cur_index); 171 - if (bh[b] == NULL) 172 - goto block_release; 173 - bytes += msblk->devblksize; 174 - } 175 - ll_rw_block(REQ_OP_READ, 0, b, bh); 176 95 } else { 177 96 /* 178 97 * Metadata block. 179 98 */ 180 - if ((index + 2) > msblk->bytes_used) 181 - goto read_failure; 99 + const u8 *data; 100 + struct bvec_iter_all iter_all = {}; 101 + struct bio_vec *bvec = bvec_init_iter_all(&iter_all); 182 102 183 - bh[0] = get_block_length(sb, &cur_index, &offset, &length); 184 - if (bh[0] == NULL) 185 - goto read_failure; 186 - b = 1; 103 + if (index + 2 > msblk->bytes_used) { 104 + res = -EIO; 105 + goto out; 106 + } 107 + res = squashfs_bio_read(sb, index, 2, &bio, &offset); 108 + if (res) 109 + goto out; 187 110 188 - bytes = msblk->devblksize - offset; 111 + if (WARN_ON_ONCE(!bio_next_segment(bio, &iter_all))) { 112 + res = -EIO; 113 + goto out_free_bio; 114 + } 115 + /* Extract the length of the metadata block */ 116 + data = page_address(bvec->bv_page) + bvec->bv_offset; 117 + length = data[offset]; 118 + if (offset <= bvec->bv_len - 1) { 119 + length |= data[offset + 1] << 8; 120 + } else { 121 + if (WARN_ON_ONCE(!bio_next_segment(bio, &iter_all))) { 122 + res = -EIO; 123 + goto out_free_bio; 124 + } 125 + data = page_address(bvec->bv_page) + bvec->bv_offset; 126 + length |= data[0] << 8; 127 + } 128 + bio_free_pages(bio); 129 + bio_put(bio); 130 + 189 131 compressed = SQUASHFS_COMPRESSED(length); 190 132 length = SQUASHFS_COMPRESSED_SIZE(length); 191 - if (next_index) 192 - *next_index = index + length + 2; 133 + index += 2; 193 134 194 135 TRACE("Block @ 0x%llx, %scompressed size %d\n", index, 195 - compressed ? "" : "un", length); 196 - 197 - if (length < 0 || length > output->length || 198 - (index + length) > msblk->bytes_used) 199 - goto block_release; 200 - 201 - for (; bytes < length; b++) { 202 - bh[b] = sb_getblk(sb, ++cur_index); 203 - if (bh[b] == NULL) 204 - goto block_release; 205 - bytes += msblk->devblksize; 206 - } 207 - ll_rw_block(REQ_OP_READ, 0, b - 1, bh + 1); 136 + compressed ? "" : "un", length); 208 137 } 138 + if (next_index) 139 + *next_index = index + length; 209 140 210 - for (i = 0; i < b; i++) { 211 - wait_on_buffer(bh[i]); 212 - if (!buffer_uptodate(bh[i])) 213 - goto block_release; 214 - } 141 + res = squashfs_bio_read(sb, index, length, &bio, &offset); 142 + if (res) 143 + goto out; 215 144 216 145 if (compressed) { 217 - if (!msblk->stream) 218 - goto read_failure; 219 - length = squashfs_decompress(msblk, bh, b, offset, length, 220 - output); 221 - if (length < 0) 222 - goto read_failure; 223 - } else { 224 - /* 225 - * Block is uncompressed. 226 - */ 227 - int in, pg_offset = 0; 228 - void *data = squashfs_first_page(output); 229 - 230 - for (bytes = length; k < b; k++) { 231 - in = min(bytes, msblk->devblksize - offset); 232 - bytes -= in; 233 - while (in) { 234 - if (pg_offset == PAGE_SIZE) { 235 - data = squashfs_next_page(output); 236 - pg_offset = 0; 237 - } 238 - avail = min_t(int, in, PAGE_SIZE - 239 - pg_offset); 240 - memcpy(data + pg_offset, bh[k]->b_data + offset, 241 - avail); 242 - in -= avail; 243 - pg_offset += avail; 244 - offset += avail; 245 - } 246 - offset = 0; 247 - put_bh(bh[k]); 146 + if (!msblk->stream) { 147 + res = -EIO; 148 + goto out_free_bio; 248 149 } 249 - squashfs_finish_page(output); 150 + res = squashfs_decompress(msblk, bio, offset, length, output); 151 + } else { 152 + res = copy_bio_to_actor(bio, output, offset, length); 250 153 } 251 154 252 - kfree(bh); 253 - return length; 155 + out_free_bio: 156 + bio_free_pages(bio); 157 + bio_put(bio); 158 + out: 159 + if (res < 0) 160 + ERROR("Failed to read block 0x%llx: %d\n", index, res); 254 161 255 - block_release: 256 - for (; k < b; k++) 257 - put_bh(bh[k]); 258 - 259 - read_failure: 260 - ERROR("squashfs_read_data failed to read block 0x%llx\n", 261 - (unsigned long long) index); 262 - kfree(bh); 263 - return -EIO; 162 + return res; 264 163 }

+3 -2

fs/squashfs/decompressor.h

··· 10 10 * decompressor.h 11 11 */ 12 12 13 + #include <linux/bio.h> 14 + 13 15 struct squashfs_decompressor { 14 16 void *(*init)(struct squashfs_sb_info *, void *); 15 17 void *(*comp_opts)(struct squashfs_sb_info *, void *, int); 16 18 void (*free)(void *); 17 19 int (*decompress)(struct squashfs_sb_info *, void *, 18 - struct buffer_head **, int, int, int, 19 - struct squashfs_page_actor *); 20 + struct bio *, int, int, struct squashfs_page_actor *); 20 21 int id; 21 22 char *name; 22 23 int supported;

+5 -4

fs/squashfs/decompressor_multi.c

··· 6 6 #include <linux/types.h> 7 7 #include <linux/mutex.h> 8 8 #include <linux/slab.h> 9 - #include <linux/buffer_head.h> 9 + #include <linux/bio.h> 10 10 #include <linux/sched.h> 11 11 #include <linux/wait.h> 12 12 #include <linux/cpumask.h> ··· 180 180 } 181 181 182 182 183 - int squashfs_decompress(struct squashfs_sb_info *msblk, struct buffer_head **bh, 184 - int b, int offset, int length, struct squashfs_page_actor *output) 183 + int squashfs_decompress(struct squashfs_sb_info *msblk, struct bio *bio, 184 + int offset, int length, 185 + struct squashfs_page_actor *output) 185 186 { 186 187 int res; 187 188 struct squashfs_stream *stream = msblk->stream; 188 189 struct decomp_stream *decomp_stream = get_decomp_stream(msblk, stream); 189 190 res = msblk->decompressor->decompress(msblk, decomp_stream->stream, 190 - bh, b, offset, length, output); 191 + bio, offset, length, output); 191 192 put_decomp_stream(decomp_stream, stream); 192 193 if (res < 0) 193 194 ERROR("%s decompression failed, data probably corrupt\n",

+4 -4

fs/squashfs/decompressor_multi_percpu.c

··· 75 75 } 76 76 } 77 77 78 - int squashfs_decompress(struct squashfs_sb_info *msblk, struct buffer_head **bh, 79 - int b, int offset, int length, struct squashfs_page_actor *output) 78 + int squashfs_decompress(struct squashfs_sb_info *msblk, struct bio *bio, 79 + int offset, int length, struct squashfs_page_actor *output) 80 80 { 81 81 struct squashfs_stream *stream; 82 82 int res; ··· 84 84 local_lock(&msblk->stream->lock); 85 85 stream = this_cpu_ptr(msblk->stream); 86 86 87 - res = msblk->decompressor->decompress(msblk, stream->stream, bh, b, 88 - offset, length, output); 87 + res = msblk->decompressor->decompress(msblk, stream->stream, bio, 88 + offset, length, output); 89 89 90 90 local_unlock(&msblk->stream->lock); 91 91

+5 -4

fs/squashfs/decompressor_single.c

··· 7 7 #include <linux/types.h> 8 8 #include <linux/mutex.h> 9 9 #include <linux/slab.h> 10 - #include <linux/buffer_head.h> 10 + #include <linux/bio.h> 11 11 12 12 #include "squashfs_fs.h" 13 13 #include "squashfs_fs_sb.h" ··· 59 59 } 60 60 } 61 61 62 - int squashfs_decompress(struct squashfs_sb_info *msblk, struct buffer_head **bh, 63 - int b, int offset, int length, struct squashfs_page_actor *output) 62 + int squashfs_decompress(struct squashfs_sb_info *msblk, struct bio *bio, 63 + int offset, int length, 64 + struct squashfs_page_actor *output) 64 65 { 65 66 int res; 66 67 struct squashfs_stream *stream = msblk->stream; 67 68 68 69 mutex_lock(&stream->mutex); 69 - res = msblk->decompressor->decompress(msblk, stream->stream, bh, b, 70 + res = msblk->decompressor->decompress(msblk, stream->stream, bio, 70 71 offset, length, output); 71 72 mutex_unlock(&stream->mutex); 72 73

+10 -7

fs/squashfs/lz4_wrapper.c

··· 4 4 * Phillip Lougher <phillip@squashfs.org.uk> 5 5 */ 6 6 7 - #include <linux/buffer_head.h> 7 + #include <linux/bio.h> 8 8 #include <linux/mutex.h> 9 9 #include <linux/slab.h> 10 10 #include <linux/vmalloc.h> ··· 89 89 90 90 91 91 static int lz4_uncompress(struct squashfs_sb_info *msblk, void *strm, 92 - struct buffer_head **bh, int b, int offset, int length, 92 + struct bio *bio, int offset, int length, 93 93 struct squashfs_page_actor *output) 94 94 { 95 + struct bvec_iter_all iter_all = {}; 96 + struct bio_vec *bvec = bvec_init_iter_all(&iter_all); 95 97 struct squashfs_lz4 *stream = strm; 96 98 void *buff = stream->input, *data; 97 - int avail, i, bytes = length, res; 99 + int bytes = length, res; 98 100 99 - for (i = 0; i < b; i++) { 100 - avail = min(bytes, msblk->devblksize - offset); 101 - memcpy(buff, bh[i]->b_data + offset, avail); 101 + while (bio_next_segment(bio, &iter_all)) { 102 + int avail = min(bytes, ((int)bvec->bv_len) - offset); 103 + 104 + data = page_address(bvec->bv_page) + bvec->bv_offset; 105 + memcpy(buff, data + offset, avail); 102 106 buff += avail; 103 107 bytes -= avail; 104 108 offset = 0; 105 - put_bh(bh[i]); 106 109 } 107 110 108 111 res = LZ4_decompress_safe(stream->input, stream->output,

+10 -7

fs/squashfs/lzo_wrapper.c

··· 9 9 */ 10 10 11 11 #include <linux/mutex.h> 12 - #include <linux/buffer_head.h> 12 + #include <linux/bio.h> 13 13 #include <linux/slab.h> 14 14 #include <linux/vmalloc.h> 15 15 #include <linux/lzo.h> ··· 63 63 64 64 65 65 static int lzo_uncompress(struct squashfs_sb_info *msblk, void *strm, 66 - struct buffer_head **bh, int b, int offset, int length, 66 + struct bio *bio, int offset, int length, 67 67 struct squashfs_page_actor *output) 68 68 { 69 + struct bvec_iter_all iter_all = {}; 70 + struct bio_vec *bvec = bvec_init_iter_all(&iter_all); 69 71 struct squashfs_lzo *stream = strm; 70 72 void *buff = stream->input, *data; 71 - int avail, i, bytes = length, res; 73 + int bytes = length, res; 72 74 size_t out_len = output->length; 73 75 74 - for (i = 0; i < b; i++) { 75 - avail = min(bytes, msblk->devblksize - offset); 76 - memcpy(buff, bh[i]->b_data + offset, avail); 76 + while (bio_next_segment(bio, &iter_all)) { 77 + int avail = min(bytes, ((int)bvec->bv_len) - offset); 78 + 79 + data = page_address(bvec->bv_page) + bvec->bv_offset; 80 + memcpy(buff, data + offset, avail); 77 81 buff += avail; 78 82 bytes -= avail; 79 83 offset = 0; 80 - put_bh(bh[i]); 81 84 } 82 85 83 86 res = lzo1x_decompress_safe(stream->input, (size_t)length,

+2 -2

fs/squashfs/squashfs.h

··· 40 40 /* decompressor_xxx.c */ 41 41 extern void *squashfs_decompressor_create(struct squashfs_sb_info *, void *); 42 42 extern void squashfs_decompressor_destroy(struct squashfs_sb_info *); 43 - extern int squashfs_decompress(struct squashfs_sb_info *, struct buffer_head **, 44 - int, int, int, struct squashfs_page_actor *); 43 + extern int squashfs_decompress(struct squashfs_sb_info *, struct bio *, 44 + int, int, struct squashfs_page_actor *); 45 45 extern int squashfs_max_decompressors(void); 46 46 47 47 /* export.c */

+29 -22

fs/squashfs/xz_wrapper.c

··· 10 10 11 11 12 12 #include <linux/mutex.h> 13 - #include <linux/buffer_head.h> 13 + #include <linux/bio.h> 14 14 #include <linux/slab.h> 15 15 #include <linux/xz.h> 16 16 #include <linux/bitops.h> ··· 117 117 118 118 119 119 static int squashfs_xz_uncompress(struct squashfs_sb_info *msblk, void *strm, 120 - struct buffer_head **bh, int b, int offset, int length, 120 + struct bio *bio, int offset, int length, 121 121 struct squashfs_page_actor *output) 122 122 { 123 - enum xz_ret xz_err; 124 - int avail, total = 0, k = 0; 123 + struct bvec_iter_all iter_all = {}; 124 + struct bio_vec *bvec = bvec_init_iter_all(&iter_all); 125 + int total = 0, error = 0; 125 126 struct squashfs_xz *stream = strm; 126 127 127 128 xz_dec_reset(stream->state); ··· 132 131 stream->buf.out_size = PAGE_SIZE; 133 132 stream->buf.out = squashfs_first_page(output); 134 133 135 - do { 136 - if (stream->buf.in_pos == stream->buf.in_size && k < b) { 137 - avail = min(length, msblk->devblksize - offset); 134 + for (;;) { 135 + enum xz_ret xz_err; 136 + 137 + if (stream->buf.in_pos == stream->buf.in_size) { 138 + const void *data; 139 + int avail; 140 + 141 + if (!bio_next_segment(bio, &iter_all)) { 142 + /* XZ_STREAM_END must be reached. */ 143 + error = -EIO; 144 + break; 145 + } 146 + 147 + avail = min(length, ((int)bvec->bv_len) - offset); 148 + data = page_address(bvec->bv_page) + bvec->bv_offset; 138 149 length -= avail; 139 - stream->buf.in = bh[k]->b_data + offset; 150 + stream->buf.in = data + offset; 140 151 stream->buf.in_size = avail; 141 152 stream->buf.in_pos = 0; 142 153 offset = 0; ··· 163 150 } 164 151 165 152 xz_err = xz_dec_run(stream->state, &stream->buf); 166 - 167 - if (stream->buf.in_pos == stream->buf.in_size && k < b) 168 - put_bh(bh[k++]); 169 - } while (xz_err == XZ_OK); 153 + if (xz_err == XZ_STREAM_END) 154 + break; 155 + if (xz_err != XZ_OK) { 156 + error = -EIO; 157 + break; 158 + } 159 + } 170 160 171 161 squashfs_finish_page(output); 172 162 173 - if (xz_err != XZ_STREAM_END || k < b) 174 - goto out; 175 - 176 - return total + stream->buf.out_pos; 177 - 178 - out: 179 - for (; k < b; k++) 180 - put_bh(bh[k]); 181 - 182 - return -EIO; 163 + return error ? error : total + stream->buf.out_pos; 183 164 } 184 165 185 166 const struct squashfs_decompressor squashfs_xz_comp_ops = {

+34 -29

fs/squashfs/zlib_wrapper.c

··· 10 10 11 11 12 12 #include <linux/mutex.h> 13 - #include <linux/buffer_head.h> 13 + #include <linux/bio.h> 14 14 #include <linux/slab.h> 15 15 #include <linux/zlib.h> 16 16 #include <linux/vmalloc.h> ··· 50 50 51 51 52 52 static int zlib_uncompress(struct squashfs_sb_info *msblk, void *strm, 53 - struct buffer_head **bh, int b, int offset, int length, 53 + struct bio *bio, int offset, int length, 54 54 struct squashfs_page_actor *output) 55 55 { 56 - int zlib_err, zlib_init = 0, k = 0; 56 + struct bvec_iter_all iter_all = {}; 57 + struct bio_vec *bvec = bvec_init_iter_all(&iter_all); 58 + int zlib_init = 0, error = 0; 57 59 z_stream *stream = strm; 58 60 59 61 stream->avail_out = PAGE_SIZE; 60 62 stream->next_out = squashfs_first_page(output); 61 63 stream->avail_in = 0; 62 64 63 - do { 64 - if (stream->avail_in == 0 && k < b) { 65 - int avail = min(length, msblk->devblksize - offset); 65 + for (;;) { 66 + int zlib_err; 67 + 68 + if (stream->avail_in == 0) { 69 + const void *data; 70 + int avail; 71 + 72 + if (!bio_next_segment(bio, &iter_all)) { 73 + /* Z_STREAM_END must be reached. */ 74 + error = -EIO; 75 + break; 76 + } 77 + 78 + avail = min(length, ((int)bvec->bv_len) - offset); 79 + data = page_address(bvec->bv_page) + bvec->bv_offset; 66 80 length -= avail; 67 - stream->next_in = bh[k]->b_data + offset; 81 + stream->next_in = data + offset; 68 82 stream->avail_in = avail; 69 83 offset = 0; 70 84 } ··· 92 78 if (!zlib_init) { 93 79 zlib_err = zlib_inflateInit(stream); 94 80 if (zlib_err != Z_OK) { 95 - squashfs_finish_page(output); 96 - goto out; 81 + error = -EIO; 82 + break; 97 83 } 98 84 zlib_init = 1; 99 85 } 100 86 101 87 zlib_err = zlib_inflate(stream, Z_SYNC_FLUSH); 102 - 103 - if (stream->avail_in == 0 && k < b) 104 - put_bh(bh[k++]); 105 - } while (zlib_err == Z_OK); 88 + if (zlib_err == Z_STREAM_END) 89 + break; 90 + if (zlib_err != Z_OK) { 91 + error = -EIO; 92 + break; 93 + } 94 + } 106 95 107 96 squashfs_finish_page(output); 108 97 109 - if (zlib_err != Z_STREAM_END) 110 - goto out; 98 + if (!error) 99 + if (zlib_inflateEnd(stream) != Z_OK) 100 + error = -EIO; 111 101 112 - zlib_err = zlib_inflateEnd(stream); 113 - if (zlib_err != Z_OK) 114 - goto out; 115 - 116 - if (k < b) 117 - goto out; 118 - 119 - return stream->total_out; 120 - 121 - out: 122 - for (; k < b; k++) 123 - put_bh(bh[k]); 124 - 125 - return -EIO; 102 + return error ? error : stream->total_out; 126 103 } 127 104 128 105 const struct squashfs_decompressor squashfs_zlib_comp_ops = {

+32 -30

fs/squashfs/zstd_wrapper.c

··· 9 9 */ 10 10 11 11 #include <linux/mutex.h> 12 - #include <linux/buffer_head.h> 12 + #include <linux/bio.h> 13 13 #include <linux/slab.h> 14 14 #include <linux/zstd.h> 15 15 #include <linux/vmalloc.h> ··· 59 59 60 60 61 61 static int zstd_uncompress(struct squashfs_sb_info *msblk, void *strm, 62 - struct buffer_head **bh, int b, int offset, int length, 62 + struct bio *bio, int offset, int length, 63 63 struct squashfs_page_actor *output) 64 64 { 65 65 struct workspace *wksp = strm; 66 66 ZSTD_DStream *stream; 67 67 size_t total_out = 0; 68 - size_t zstd_err; 69 - int k = 0; 68 + int error = 0; 70 69 ZSTD_inBuffer in_buf = { NULL, 0, 0 }; 71 70 ZSTD_outBuffer out_buf = { NULL, 0, 0 }; 71 + struct bvec_iter_all iter_all = {}; 72 + struct bio_vec *bvec = bvec_init_iter_all(&iter_all); 72 73 73 74 stream = ZSTD_initDStream(wksp->window_size, wksp->mem, wksp->mem_size); 74 75 75 76 if (!stream) { 76 77 ERROR("Failed to initialize zstd decompressor\n"); 77 - goto out; 78 + return -EIO; 78 79 } 79 80 80 81 out_buf.size = PAGE_SIZE; 81 82 out_buf.dst = squashfs_first_page(output); 82 83 83 - do { 84 - if (in_buf.pos == in_buf.size && k < b) { 85 - int avail = min(length, msblk->devblksize - offset); 84 + for (;;) { 85 + size_t zstd_err; 86 86 87 + if (in_buf.pos == in_buf.size) { 88 + const void *data; 89 + int avail; 90 + 91 + if (!bio_next_segment(bio, &iter_all)) { 92 + error = -EIO; 93 + break; 94 + } 95 + 96 + avail = min(length, ((int)bvec->bv_len) - offset); 97 + data = page_address(bvec->bv_page) + bvec->bv_offset; 87 98 length -= avail; 88 - in_buf.src = bh[k]->b_data + offset; 99 + in_buf.src = data + offset; 89 100 in_buf.size = avail; 90 101 in_buf.pos = 0; 91 102 offset = 0; ··· 108 97 /* Shouldn't run out of pages 109 98 * before stream is done. 110 99 */ 111 - squashfs_finish_page(output); 112 - goto out; 100 + error = -EIO; 101 + break; 113 102 } 114 103 out_buf.pos = 0; 115 104 out_buf.size = PAGE_SIZE; ··· 118 107 total_out -= out_buf.pos; 119 108 zstd_err = ZSTD_decompressStream(stream, &out_buf, &in_buf); 120 109 total_out += out_buf.pos; /* add the additional data produced */ 110 + if (zstd_err == 0) 111 + break; 121 112 122 - if (in_buf.pos == in_buf.size && k < b) 123 - put_bh(bh[k++]); 124 - } while (zstd_err != 0 && !ZSTD_isError(zstd_err)); 113 + if (ZSTD_isError(zstd_err)) { 114 + ERROR("zstd decompression error: %d\n", 115 + (int)ZSTD_getErrorCode(zstd_err)); 116 + error = -EIO; 117 + break; 118 + } 119 + } 125 120 126 121 squashfs_finish_page(output); 127 122 128 - if (ZSTD_isError(zstd_err)) { 129 - ERROR("zstd decompression error: %d\n", 130 - (int)ZSTD_getErrorCode(zstd_err)); 131 - goto out; 132 - } 133 - 134 - if (k < b) 135 - goto out; 136 - 137 - return (int)total_out; 138 - 139 - out: 140 - for (; k < b; k++) 141 - put_bh(bh[k]); 142 - 143 - return -EIO; 123 + return error ? error : total_out; 144 124 } 145 125 146 126 const struct squashfs_decompressor squashfs_zstd_comp_ops = {

+4 -2

fs/sync.c

··· 161 161 { 162 162 struct fd f = fdget(fd); 163 163 struct super_block *sb; 164 - int ret; 164 + int ret, ret2; 165 165 166 166 if (!f.file) 167 167 return -EBADF; ··· 171 171 ret = sync_filesystem(sb); 172 172 up_read(&sb->s_umount); 173 173 174 + ret2 = errseq_check_and_advance(&sb->s_wb_err, &f.file->f_sb_err); 175 + 174 176 fdput(f); 175 - return ret; 177 + return ret ? ret : ret2; 176 178 } 177 179 178 180 /**

+1 -1

fs/ubifs/debug.c

··· 815 815 816 816 pr_err("(pid %d) start dumping LEB %d\n", current->pid, lnum); 817 817 818 - buf = __vmalloc(c->leb_size, GFP_NOFS, PAGE_KERNEL); 818 + buf = __vmalloc(c->leb_size, GFP_NOFS); 819 819 if (!buf) { 820 820 ubifs_err(c, "cannot allocate memory for dumping LEB %d", lnum); 821 821 return;

+1 -1

fs/ubifs/lprops.c

··· 1095 1095 return LPT_SCAN_CONTINUE; 1096 1096 } 1097 1097 1098 - buf = __vmalloc(c->leb_size, GFP_NOFS, PAGE_KERNEL); 1098 + buf = __vmalloc(c->leb_size, GFP_NOFS); 1099 1099 if (!buf) 1100 1100 return -ENOMEM; 1101 1101

+2 -2

fs/ubifs/lpt_commit.c

··· 1596 1596 if (!dbg_is_chk_lprops(c)) 1597 1597 return 0; 1598 1598 1599 - buf = p = __vmalloc(c->leb_size, GFP_NOFS, PAGE_KERNEL); 1599 + buf = p = __vmalloc(c->leb_size, GFP_NOFS); 1600 1600 if (!buf) { 1601 1601 ubifs_err(c, "cannot allocate memory for ltab checking"); 1602 1602 return 0; ··· 1845 1845 void *buf, *p; 1846 1846 1847 1847 pr_err("(pid %d) start dumping LEB %d\n", current->pid, lnum); 1848 - buf = p = __vmalloc(c->leb_size, GFP_NOFS, PAGE_KERNEL); 1848 + buf = p = __vmalloc(c->leb_size, GFP_NOFS); 1849 1849 if (!buf) { 1850 1850 ubifs_err(c, "cannot allocate memory to dump LPT"); 1851 1851 return;

+1 -1

fs/ubifs/orphan.c

··· 977 977 if (c->no_orphs) 978 978 return 0; 979 979 980 - buf = __vmalloc(c->leb_size, GFP_NOFS, PAGE_KERNEL); 980 + buf = __vmalloc(c->leb_size, GFP_NOFS); 981 981 if (!buf) { 982 982 ubifs_err(c, "cannot allocate memory to check orphans"); 983 983 return 0;

+3 -4

fs/udf/inode.c

··· 195 195 return mpage_readpage(page, udf_get_block); 196 196 } 197 197 198 - static int udf_readpages(struct file *file, struct address_space *mapping, 199 - struct list_head *pages, unsigned nr_pages) 198 + static void udf_readahead(struct readahead_control *rac) 200 199 { 201 - return mpage_readpages(mapping, pages, nr_pages, udf_get_block); 200 + mpage_readahead(rac, udf_get_block); 202 201 } 203 202 204 203 static int udf_write_begin(struct file *file, struct address_space *mapping, ··· 233 234 234 235 const struct address_space_operations udf_aops = { 235 236 .readpage = udf_readpage, 236 - .readpages = udf_readpages, 237 + .readahead = udf_readahead, 237 238 .writepage = udf_writepage, 238 239 .writepages = udf_writepages, 239 240 .write_begin = udf_write_begin,

+1 -1

fs/xfs/kmem.c

··· 48 48 if (flags & KM_NOFS) 49 49 nofs_flag = memalloc_nofs_save(); 50 50 51 - ptr = __vmalloc(size, lflags, PAGE_KERNEL); 51 + ptr = __vmalloc(size, lflags); 52 52 53 53 if (flags & KM_NOFS) 54 54 memalloc_nofs_restore(nofs_flag);

+5 -8

fs/xfs/xfs_aops.c

··· 621 621 return iomap_readpage(page, &xfs_read_iomap_ops); 622 622 } 623 623 624 - STATIC int 625 - xfs_vm_readpages( 626 - struct file *unused, 627 - struct address_space *mapping, 628 - struct list_head *pages, 629 - unsigned nr_pages) 624 + STATIC void 625 + xfs_vm_readahead( 626 + struct readahead_control *rac) 630 627 { 631 - return iomap_readpages(mapping, pages, nr_pages, &xfs_read_iomap_ops); 628 + iomap_readahead(rac, &xfs_read_iomap_ops); 632 629 } 633 630 634 631 static int ··· 641 644 642 645 const struct address_space_operations xfs_address_space_operations = { 643 646 .readpage = xfs_vm_readpage, 644 - .readpages = xfs_vm_readpages, 647 + .readahead = xfs_vm_readahead, 645 648 .writepage = xfs_vm_writepage, 646 649 .writepages = xfs_vm_writepages, 647 650 .set_page_dirty = iomap_set_page_dirty,

+1 -1

fs/xfs/xfs_buf.c

··· 477 477 nofs_flag = memalloc_nofs_save(); 478 478 do { 479 479 bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count, 480 - -1, PAGE_KERNEL); 480 + -1); 481 481 if (bp->b_addr) 482 482 break; 483 483 vm_unmap_aliases();

+3 -4

fs/zonefs/super.c

··· 78 78 return iomap_readpage(page, &zonefs_iomap_ops); 79 79 } 80 80 81 - static int zonefs_readpages(struct file *unused, struct address_space *mapping, 82 - struct list_head *pages, unsigned int nr_pages) 81 + static void zonefs_readahead(struct readahead_control *rac) 83 82 { 84 - return iomap_readpages(mapping, pages, nr_pages, &zonefs_iomap_ops); 83 + iomap_readahead(rac, &zonefs_iomap_ops); 85 84 } 86 85 87 86 /* ··· 127 128 128 129 static const struct address_space_operations zonefs_file_aops = { 129 130 .readpage = zonefs_readpage, 130 - .readpages = zonefs_readpages, 131 + .readahead = zonefs_readahead, 131 132 .writepage = zonefs_writepage, 132 133 .writepages = zonefs_writepages, 133 134 .set_page_dirty = iomap_set_page_dirty,

+3 -2

include/asm-generic/5level-fixup.h

··· 17 17 ((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address)) ? \ 18 18 NULL : pud_offset(p4d, address)) 19 19 20 - #define p4d_alloc(mm, pgd, address) (pgd) 21 - #define p4d_offset(pgd, start) (pgd) 20 + #define p4d_alloc(mm, pgd, address) (pgd) 21 + #define p4d_alloc_track(mm, pgd, address, mask) (pgd) 22 + #define p4d_offset(pgd, start) (pgd) 22 23 23 24 #ifndef __ASSEMBLY__ 24 25 static inline int p4d_none(p4d_t p4d)

+27

include/asm-generic/pgtable.h

··· 491 491 #define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address) 492 492 #endif 493 493 494 + #ifndef pgprot_nx 495 + #define pgprot_nx(prot) (prot) 496 + #endif 497 + 494 498 #ifndef pgprot_noncached 495 499 #define pgprot_noncached(prot) (prot) 496 500 #endif ··· 1212 1208 #ifndef PAGE_KERNEL_EXEC 1213 1209 # define PAGE_KERNEL_EXEC PAGE_KERNEL 1214 1210 #endif 1211 + 1212 + /* 1213 + * Page Table Modification bits for pgtbl_mod_mask. 1214 + * 1215 + * These are used by the p?d_alloc_track*() set of functions an in the generic 1216 + * vmalloc/ioremap code to track at which page-table levels entries have been 1217 + * modified. Based on that the code can better decide when vmalloc and ioremap 1218 + * mapping changes need to be synchronized to other page-tables in the system. 1219 + */ 1220 + #define __PGTBL_PGD_MODIFIED 0 1221 + #define __PGTBL_P4D_MODIFIED 1 1222 + #define __PGTBL_PUD_MODIFIED 2 1223 + #define __PGTBL_PMD_MODIFIED 3 1224 + #define __PGTBL_PTE_MODIFIED 4 1225 + 1226 + #define PGTBL_PGD_MODIFIED BIT(__PGTBL_PGD_MODIFIED) 1227 + #define PGTBL_P4D_MODIFIED BIT(__PGTBL_P4D_MODIFIED) 1228 + #define PGTBL_PUD_MODIFIED BIT(__PGTBL_PUD_MODIFIED) 1229 + #define PGTBL_PMD_MODIFIED BIT(__PGTBL_PMD_MODIFIED) 1230 + #define PGTBL_PTE_MODIFIED BIT(__PGTBL_PTE_MODIFIED) 1231 + 1232 + /* Page-Table Modification Mask */ 1233 + typedef unsigned int pgtbl_mod_mask; 1215 1234 1216 1235 #endif /* !__ASSEMBLY__ */ 1217 1236

-8

include/linux/buffer_head.h

··· 272 272 * inline definitions 273 273 */ 274 274 275 - static inline void attach_page_buffers(struct page *page, 276 - struct buffer_head *head) 277 - { 278 - get_page(page); 279 - SetPagePrivate(page); 280 - set_page_private(page, (unsigned long)head); 281 - } 282 - 283 275 static inline void get_bh(struct buffer_head *bh) 284 276 { 285 277 atomic_inc(&bh->b_count);

+18

include/linux/fs.h

··· 292 292 struct page; 293 293 struct address_space; 294 294 struct writeback_control; 295 + struct readahead_control; 295 296 296 297 /* 297 298 * Write life time hint values. ··· 376 375 */ 377 376 int (*readpages)(struct file *filp, struct address_space *mapping, 378 377 struct list_head *pages, unsigned nr_pages); 378 + void (*readahead)(struct readahead_control *); 379 379 380 380 int (*write_begin)(struct file *, struct address_space *mapping, 381 381 loff_t pos, unsigned len, unsigned flags, ··· 978 976 #endif /* #ifdef CONFIG_EPOLL */ 979 977 struct address_space *f_mapping; 980 978 errseq_t f_wb_err; 979 + errseq_t f_sb_err; /* for syncfs */ 981 980 } __randomize_layout 982 981 __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */ 983 982 ··· 1522 1519 1523 1520 /* Being remounted read-only */ 1524 1521 int s_readonly_remount; 1522 + 1523 + /* per-sb errseq_t for reporting writeback errors via syncfs */ 1524 + errseq_t s_wb_err; 1525 1525 1526 1526 /* AIO completions deferred from interrupt context */ 1527 1527 struct workqueue_struct *s_dio_done_wq; ··· 2835 2829 static inline errseq_t filemap_sample_wb_err(struct address_space *mapping) 2836 2830 { 2837 2831 return errseq_sample(&mapping->wb_err); 2832 + } 2833 + 2834 + /** 2835 + * file_sample_sb_err - sample the current errseq_t to test for later errors 2836 + * @mapping: mapping to be sampled 2837 + * 2838 + * Grab the most current superblock-level errseq_t value for the given 2839 + * struct file. 2840 + */ 2841 + static inline errseq_t file_sample_sb_err(struct file *file) 2842 + { 2843 + return errseq_sample(&file->f_path.dentry->d_sb->s_wb_err); 2838 2844 } 2839 2845 2840 2846 static inline int filemap_nr_thps(struct address_space *mapping)

+1 -2

include/linux/iomap.h

··· 155 155 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from, 156 156 const struct iomap_ops *ops); 157 157 int iomap_readpage(struct page *page, const struct iomap_ops *ops); 158 - int iomap_readpages(struct address_space *mapping, struct list_head *pages, 159 - unsigned nr_pages, const struct iomap_ops *ops); 158 + void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops); 160 159 int iomap_set_page_dirty(struct page *page); 161 160 int iomap_is_partially_uptodate(struct page *page, unsigned long from, 162 161 unsigned long count);

+1 -3

include/linux/memcontrol.h

··· 45 45 MEMCG_MAX, 46 46 MEMCG_OOM, 47 47 MEMCG_OOM_KILL, 48 + MEMCG_SWAP_HIGH, 48 49 MEMCG_SWAP_MAX, 49 50 MEMCG_SWAP_FAIL, 50 51 MEMCG_NR_MEMORY_EVENTS, ··· 215 214 struct page_counter memsw; 216 215 struct page_counter kmem; 217 216 struct page_counter tcpmem; 218 - 219 - /* Upper bound of normal memory consumption range */ 220 - unsigned long high; 221 217 222 218 /* Range enforcement for interrupt charges */ 223 219 struct work_struct high_work;

+48 -19

include/linux/mm.h

··· 1709 1709 unsigned int gup_flags, struct page **pages, int *locked); 1710 1710 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, 1711 1711 struct page **pages, unsigned int gup_flags); 1712 + long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, 1713 + struct page **pages, unsigned int gup_flags); 1712 1714 1713 1715 int get_user_pages_fast(unsigned long start, int nr_pages, 1714 1716 unsigned int gup_flags, struct page **pages); ··· 2087 2085 return (unlikely(p4d_none(*p4d)) && __pud_alloc(mm, p4d, address)) ? 2088 2086 NULL : pud_offset(p4d, address); 2089 2087 } 2088 + 2089 + static inline p4d_t *p4d_alloc_track(struct mm_struct *mm, pgd_t *pgd, 2090 + unsigned long address, 2091 + pgtbl_mod_mask *mod_mask) 2092 + 2093 + { 2094 + if (unlikely(pgd_none(*pgd))) { 2095 + if (__p4d_alloc(mm, pgd, address)) 2096 + return NULL; 2097 + *mod_mask |= PGTBL_PGD_MODIFIED; 2098 + } 2099 + 2100 + return p4d_offset(pgd, address); 2101 + } 2102 + 2090 2103 #endif /* !__ARCH_HAS_5LEVEL_HACK */ 2104 + 2105 + static inline pud_t *pud_alloc_track(struct mm_struct *mm, p4d_t *p4d, 2106 + unsigned long address, 2107 + pgtbl_mod_mask *mod_mask) 2108 + { 2109 + if (unlikely(p4d_none(*p4d))) { 2110 + if (__pud_alloc(mm, p4d, address)) 2111 + return NULL; 2112 + *mod_mask |= PGTBL_P4D_MODIFIED; 2113 + } 2114 + 2115 + return pud_offset(p4d, address); 2116 + } 2091 2117 2092 2118 static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) 2093 2119 { 2094 2120 return (unlikely(pud_none(*pud)) && __pmd_alloc(mm, pud, address))? 2095 2121 NULL: pmd_offset(pud, address); 2122 + } 2123 + 2124 + static inline pmd_t *pmd_alloc_track(struct mm_struct *mm, pud_t *pud, 2125 + unsigned long address, 2126 + pgtbl_mod_mask *mod_mask) 2127 + { 2128 + if (unlikely(pud_none(*pud))) { 2129 + if (__pmd_alloc(mm, pud, address)) 2130 + return NULL; 2131 + *mod_mask |= PGTBL_PUD_MODIFIED; 2132 + } 2133 + 2134 + return pmd_offset(pud, address); 2096 2135 } 2097 2136 #endif /* CONFIG_MMU */ 2098 2137 ··· 2248 2205 2249 2206 #define pte_alloc_kernel(pmd, address) \ 2250 2207 ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \ 2208 + NULL: pte_offset_kernel(pmd, address)) 2209 + 2210 + #define pte_alloc_kernel_track(pmd, address, mask) \ 2211 + ((unlikely(pmd_none(*(pmd))) && \ 2212 + (__pte_alloc_kernel(pmd) || ({*(mask)|=PGTBL_PMD_MODIFIED;0;})))?\ 2251 2213 NULL: pte_offset_kernel(pmd, address)) 2252 2214 2253 2215 #if USE_SPLIT_PMD_PTLOCKS ··· 2655 2607 /* mm/page-writeback.c */ 2656 2608 int __must_check write_one_page(struct page *page); 2657 2609 void task_dirty_inc(struct task_struct *tsk); 2658 - 2659 - /* readahead.c */ 2660 - #define VM_READAHEAD_PAGES (SZ_128K / PAGE_SIZE) 2661 - 2662 - int force_page_cache_readahead(struct address_space *mapping, struct file *filp, 2663 - pgoff_t offset, unsigned long nr_to_read); 2664 - 2665 - void page_cache_sync_readahead(struct address_space *mapping, 2666 - struct file_ra_state *ra, 2667 - struct file *filp, 2668 - pgoff_t offset, 2669 - unsigned long size); 2670 - 2671 - void page_cache_async_readahead(struct address_space *mapping, 2672 - struct file_ra_state *ra, 2673 - struct file *filp, 2674 - struct page *pg, 2675 - pgoff_t offset, 2676 - unsigned long size); 2677 2610 2678 2611 extern unsigned long stack_guard_gap; 2679 2612 /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */

+5 -1

include/linux/mm_types.h

··· 240 240 #define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE) 241 241 242 242 #define page_private(page) ((page)->private) 243 - #define set_page_private(page, v) ((page)->private = (v)) 243 + 244 + static inline void set_page_private(struct page *page, unsigned long private) 245 + { 246 + page->private = private; 247 + } 244 248 245 249 struct page_frag_cache { 246 250 void * va;

-1

include/linux/mmzone.h

··· 196 196 NR_FILE_THPS, 197 197 NR_FILE_PMDMAPPED, 198 198 NR_ANON_THPS, 199 - NR_UNSTABLE_NFS, /* NFS unstable pages */ 200 199 NR_VMSCAN_WRITE, 201 200 NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ 202 201 NR_DIRTIED, /* page dirtyings since bootup */

+2 -2

include/linux/mpage.h

··· 13 13 #ifdef CONFIG_BLOCK 14 14 15 15 struct writeback_control; 16 + struct readahead_control; 16 17 17 - int mpage_readpages(struct address_space *mapping, struct list_head *pages, 18 - unsigned nr_pages, get_block_t get_block); 18 + void mpage_readahead(struct readahead_control *, get_block_t get_block); 19 19 int mpage_readpage(struct page *page, get_block_t get_block); 20 20 int mpage_writepages(struct address_space *mapping, 21 21 struct writeback_control *wbc, get_block_t get_block);

+8

include/linux/page_counter.h

··· 10 10 atomic_long_t usage; 11 11 unsigned long min; 12 12 unsigned long low; 13 + unsigned long high; 13 14 unsigned long max; 14 15 struct page_counter *parent; 15 16 ··· 56 55 void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages); 57 56 void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages); 58 57 void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages); 58 + 59 + static inline void page_counter_set_high(struct page_counter *counter, 60 + unsigned long nr_pages) 61 + { 62 + WRITE_ONCE(counter->high, nr_pages); 63 + } 64 + 59 65 int page_counter_set_max(struct page_counter *counter, unsigned long nr_pages); 60 66 int page_counter_memparse(const char *buf, const char *max, 61 67 unsigned long *nr_pages);

+192 -1

include/linux/pagemap.h

··· 51 51 return; 52 52 53 53 /* Record in wb_err for checkers using errseq_t based tracking */ 54 - filemap_set_wb_err(mapping, error); 54 + __filemap_set_wb_err(mapping, error); 55 + 56 + /* Record it in superblock */ 57 + errseq_set(&mapping->host->i_sb->s_wb_err, error); 55 58 56 59 /* Record it in flags for now, for legacy callers */ 57 60 if (error == -ENOSPC) ··· 206 203 static inline int page_cache_add_speculative(struct page *page, int count) 207 204 { 208 205 return __page_cache_add_speculative(page, count); 206 + } 207 + 208 + /** 209 + * attach_page_private - Attach private data to a page. 210 + * @page: Page to attach data to. 211 + * @data: Data to attach to page. 212 + * 213 + * Attaching private data to a page increments the page's reference count. 214 + * The data must be detached before the page will be freed. 215 + */ 216 + static inline void attach_page_private(struct page *page, void *data) 217 + { 218 + get_page(page); 219 + set_page_private(page, (unsigned long)data); 220 + SetPagePrivate(page); 221 + } 222 + 223 + /** 224 + * detach_page_private - Detach private data from a page. 225 + * @page: Page to detach data from. 226 + * 227 + * Removes the data that was previously attached to the page and decrements 228 + * the refcount on the page. 229 + * 230 + * Return: Data that was attached to the page. 231 + */ 232 + static inline void *detach_page_private(struct page *page) 233 + { 234 + void *data = (void *)page_private(page); 235 + 236 + if (!PagePrivate(page)) 237 + return NULL; 238 + ClearPagePrivate(page); 239 + set_page_private(page, 0); 240 + put_page(page); 241 + 242 + return data; 209 243 } 210 244 211 245 #ifdef CONFIG_NUMA ··· 655 615 void delete_from_page_cache_batch(struct address_space *mapping, 656 616 struct pagevec *pvec); 657 617 618 + #define VM_READAHEAD_PAGES (SZ_128K / PAGE_SIZE) 619 + 620 + void page_cache_sync_readahead(struct address_space *, struct file_ra_state *, 621 + struct file *, pgoff_t index, unsigned long req_count); 622 + void page_cache_async_readahead(struct address_space *, struct file_ra_state *, 623 + struct file *, struct page *, pgoff_t index, 624 + unsigned long req_count); 625 + void page_cache_readahead_unbounded(struct address_space *, struct file *, 626 + pgoff_t index, unsigned long nr_to_read, 627 + unsigned long lookahead_count); 628 + 658 629 /* 659 630 * Like add_to_page_cache_locked, but used to add newly allocated pages: 660 631 * the page is new, so we can just run __SetPageLocked() against it. ··· 680 629 if (unlikely(error)) 681 630 __ClearPageLocked(page); 682 631 return error; 632 + } 633 + 634 + /** 635 + * struct readahead_control - Describes a readahead request. 636 + * 637 + * A readahead request is for consecutive pages. Filesystems which 638 + * implement the ->readahead method should call readahead_page() or 639 + * readahead_page_batch() in a loop and attempt to start I/O against 640 + * each page in the request. 641 + * 642 + * Most of the fields in this struct are private and should be accessed 643 + * by the functions below. 644 + * 645 + * @file: The file, used primarily by network filesystems for authentication. 646 + * May be NULL if invoked internally by the filesystem. 647 + * @mapping: Readahead this filesystem object. 648 + */ 649 + struct readahead_control { 650 + struct file *file; 651 + struct address_space *mapping; 652 + /* private: use the readahead_* accessors instead */ 653 + pgoff_t _index; 654 + unsigned int _nr_pages; 655 + unsigned int _batch_count; 656 + }; 657 + 658 + /** 659 + * readahead_page - Get the next page to read. 660 + * @rac: The current readahead request. 661 + * 662 + * Context: The page is locked and has an elevated refcount. The caller 663 + * should decreases the refcount once the page has been submitted for I/O 664 + * and unlock the page once all I/O to that page has completed. 665 + * Return: A pointer to the next page, or %NULL if we are done. 666 + */ 667 + static inline struct page *readahead_page(struct readahead_control *rac) 668 + { 669 + struct page *page; 670 + 671 + BUG_ON(rac->_batch_count > rac->_nr_pages); 672 + rac->_nr_pages -= rac->_batch_count; 673 + rac->_index += rac->_batch_count; 674 + 675 + if (!rac->_nr_pages) { 676 + rac->_batch_count = 0; 677 + return NULL; 678 + } 679 + 680 + page = xa_load(&rac->mapping->i_pages, rac->_index); 681 + VM_BUG_ON_PAGE(!PageLocked(page), page); 682 + rac->_batch_count = hpage_nr_pages(page); 683 + 684 + return page; 685 + } 686 + 687 + static inline unsigned int __readahead_batch(struct readahead_control *rac, 688 + struct page **array, unsigned int array_sz) 689 + { 690 + unsigned int i = 0; 691 + XA_STATE(xas, &rac->mapping->i_pages, 0); 692 + struct page *page; 693 + 694 + BUG_ON(rac->_batch_count > rac->_nr_pages); 695 + rac->_nr_pages -= rac->_batch_count; 696 + rac->_index += rac->_batch_count; 697 + rac->_batch_count = 0; 698 + 699 + xas_set(&xas, rac->_index); 700 + rcu_read_lock(); 701 + xas_for_each(&xas, page, rac->_index + rac->_nr_pages - 1) { 702 + VM_BUG_ON_PAGE(!PageLocked(page), page); 703 + VM_BUG_ON_PAGE(PageTail(page), page); 704 + array[i++] = page; 705 + rac->_batch_count += hpage_nr_pages(page); 706 + 707 + /* 708 + * The page cache isn't using multi-index entries yet, 709 + * so the xas cursor needs to be manually moved to the 710 + * next index. This can be removed once the page cache 711 + * is converted. 712 + */ 713 + if (PageHead(page)) 714 + xas_set(&xas, rac->_index + rac->_batch_count); 715 + 716 + if (i == array_sz) 717 + break; 718 + } 719 + rcu_read_unlock(); 720 + 721 + return i; 722 + } 723 + 724 + /** 725 + * readahead_page_batch - Get a batch of pages to read. 726 + * @rac: The current readahead request. 727 + * @array: An array of pointers to struct page. 728 + * 729 + * Context: The pages are locked and have an elevated refcount. The caller 730 + * should decreases the refcount once the page has been submitted for I/O 731 + * and unlock the page once all I/O to that page has completed. 732 + * Return: The number of pages placed in the array. 0 indicates the request 733 + * is complete. 734 + */ 735 + #define readahead_page_batch(rac, array) \ 736 + __readahead_batch(rac, array, ARRAY_SIZE(array)) 737 + 738 + /** 739 + * readahead_pos - The byte offset into the file of this readahead request. 740 + * @rac: The readahead request. 741 + */ 742 + static inline loff_t readahead_pos(struct readahead_control *rac) 743 + { 744 + return (loff_t)rac->_index * PAGE_SIZE; 745 + } 746 + 747 + /** 748 + * readahead_length - The number of bytes in this readahead request. 749 + * @rac: The readahead request. 750 + */ 751 + static inline loff_t readahead_length(struct readahead_control *rac) 752 + { 753 + return (loff_t)rac->_nr_pages * PAGE_SIZE; 754 + } 755 + 756 + /** 757 + * readahead_index - The index of the first page in this readahead request. 758 + * @rac: The readahead request. 759 + */ 760 + static inline pgoff_t readahead_index(struct readahead_control *rac) 761 + { 762 + return rac->_index; 763 + } 764 + 765 + /** 766 + * readahead_count - The number of pages in this readahead request. 767 + * @rac: The readahead request. 768 + */ 769 + static inline unsigned int readahead_count(struct readahead_control *rac) 770 + { 771 + return rac->_nr_pages; 683 772 } 684 773 685 774 static inline unsigned long dir_pages(struct inode *inode)

+2 -1

include/linux/ptdump.h

··· 13 13 struct ptdump_state { 14 14 /* level is 0:PGD to 4:PTE, or -1 if unknown */ 15 15 void (*note_page)(struct ptdump_state *st, unsigned long addr, 16 - int level, unsigned long val); 16 + int level, u64 val); 17 + void (*effective_prot)(struct ptdump_state *st, int level, u64 val); 17 18 const struct ptdump_range *range; 18 19 }; 19 20

+2 -1

include/linux/sched.h

··· 1495 1495 #define PF_KSWAPD 0x00020000 /* I am kswapd */ 1496 1496 #define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */ 1497 1497 #define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */ 1498 - #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ 1498 + #define PF_LOCAL_THROTTLE 0x00100000 /* Throttle writes only against the bdi I write to, 1499 + * I am cleaning dirty pages from some other bdi. */ 1499 1500 #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ 1500 1501 #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ 1501 1502 #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */

+11 -6

include/linux/swap.h

··· 183 183 #define SWAP_CLUSTER_MAX 32UL 184 184 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX 185 185 186 - #define SWAP_MAP_MAX 0x3e /* Max duplication count, in first swap_map */ 187 - #define SWAP_MAP_BAD 0x3f /* Note pageblock is bad, in first swap_map */ 186 + /* Bit flag in swap_map */ 188 187 #define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */ 189 - #define SWAP_CONT_MAX 0x7f /* Max count, in each swap_map continuation */ 190 - #define COUNT_CONTINUED 0x80 /* See swap_map continuation for full count */ 191 - #define SWAP_MAP_SHMEM 0xbf /* Owned by shmem/tmpfs, in first swap_map */ 188 + #define COUNT_CONTINUED 0x80 /* Flag swap_map continuation for full count */ 189 + 190 + /* Special value in first swap_map */ 191 + #define SWAP_MAP_MAX 0x3e /* Max count */ 192 + #define SWAP_MAP_BAD 0x3f /* Note page is bad */ 193 + #define SWAP_MAP_SHMEM 0xbf /* Owned by shmem/tmpfs */ 194 + 195 + /* Special value in each swap_map continuation */ 196 + #define SWAP_CONT_MAX 0x7f /* Max count */ 192 197 193 198 /* 194 199 * We use this to track usage of a cluster. A cluster is a block of swap disk ··· 252 247 unsigned int inuse_pages; /* number of those currently in use */ 253 248 unsigned int cluster_next; /* likely index for next allocation */ 254 249 unsigned int cluster_nr; /* countdown to next cluster search */ 250 + unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */ 255 251 struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */ 256 252 struct rb_root swap_extent_root;/* root of the swap extent rbtree */ 257 253 struct block_device *bdev; /* swap device or bdev of swap file */ ··· 415 409 extern void show_swap_cache_info(void); 416 410 extern int add_to_swap(struct page *page); 417 411 extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t); 418 - extern int __add_to_swap_cache(struct page *page, swp_entry_t entry); 419 412 extern void __delete_from_swap_cache(struct page *, swp_entry_t entry); 420 413 extern void delete_from_swap_cache(struct page *); 421 414 extern void free_page_and_swap_cache(struct page *);

+24 -25

include/linux/vmalloc.h

··· 88 88 * Highlevel APIs for driver use 89 89 */ 90 90 extern void vm_unmap_ram(const void *mem, unsigned int count); 91 - extern void *vm_map_ram(struct page **pages, unsigned int count, 92 - int node, pgprot_t prot); 91 + extern void *vm_map_ram(struct page **pages, unsigned int count, int node); 93 92 extern void vm_unmap_aliases(void); 94 93 95 94 #ifdef CONFIG_MMU ··· 106 107 extern void *vmalloc_user(unsigned long size); 107 108 extern void *vmalloc_node(unsigned long size, int node); 108 109 extern void *vzalloc_node(unsigned long size, int node); 109 - extern void *vmalloc_user_node_flags(unsigned long size, int node, gfp_t flags); 110 110 extern void *vmalloc_exec(unsigned long size); 111 111 extern void *vmalloc_32(unsigned long size); 112 112 extern void *vmalloc_32_user(unsigned long size); 113 - extern void *__vmalloc(unsigned long size, gfp_t gfp_mask, pgprot_t prot); 113 + extern void *__vmalloc(unsigned long size, gfp_t gfp_mask); 114 114 extern void *__vmalloc_node_range(unsigned long size, unsigned long align, 115 115 unsigned long start, unsigned long end, gfp_t gfp_mask, 116 116 pgprot_t prot, unsigned long vm_flags, int node, 117 117 const void *caller); 118 - #ifndef CONFIG_MMU 119 - extern void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags); 120 - static inline void *__vmalloc_node_flags_caller(unsigned long size, int node, 121 - gfp_t flags, void *caller) 122 - { 123 - return __vmalloc_node_flags(size, node, flags); 124 - } 125 - #else 126 - extern void *__vmalloc_node_flags_caller(unsigned long size, 127 - int node, gfp_t flags, void *caller); 128 - #endif 118 + void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask, 119 + int node, const void *caller); 129 120 130 121 extern void vfree(const void *addr); 131 122 extern void vfree_atomic(const void *addr); ··· 130 141 131 142 extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr, 132 143 unsigned long pgoff); 133 - void vmalloc_sync_mappings(void); 134 - void vmalloc_sync_unmappings(void); 144 + 145 + /* 146 + * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values 147 + * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings() 148 + * needs to be called. 149 + */ 150 + #ifndef ARCH_PAGE_TABLE_SYNC_MASK 151 + #define ARCH_PAGE_TABLE_SYNC_MASK 0 152 + #endif 153 + 154 + /* 155 + * There is no default implementation for arch_sync_kernel_mappings(). It is 156 + * relied upon the compiler to optimize calls out if ARCH_PAGE_TABLE_SYNC_MASK 157 + * is 0. 158 + */ 159 + void arch_sync_kernel_mappings(unsigned long start, unsigned long end); 135 160 136 161 /* 137 162 * Lowlevel-APIs (not for driver use!) ··· 164 161 extern struct vm_struct *get_vm_area(unsigned long size, unsigned long flags); 165 162 extern struct vm_struct *get_vm_area_caller(unsigned long size, 166 163 unsigned long flags, const void *caller); 167 - extern struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags, 168 - unsigned long start, unsigned long end); 169 164 extern struct vm_struct *__get_vm_area_caller(unsigned long size, 170 165 unsigned long flags, 171 166 unsigned long start, unsigned long end, ··· 171 170 extern struct vm_struct *remove_vm_area(const void *addr); 172 171 extern struct vm_struct *find_vm_area(const void *addr); 173 172 174 - extern int map_vm_area(struct vm_struct *area, pgprot_t prot, 175 - struct page **pages); 176 173 #ifdef CONFIG_MMU 177 174 extern int map_kernel_range_noflush(unsigned long start, unsigned long size, 178 175 pgprot_t prot, struct page **pages); 176 + int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot, 177 + struct page **pages); 179 178 extern void unmap_kernel_range_noflush(unsigned long addr, unsigned long size); 180 179 extern void unmap_kernel_range(unsigned long addr, unsigned long size); 181 180 static inline void set_vm_flush_reset_perms(void *addr) ··· 192 191 { 193 192 return size >> PAGE_SHIFT; 194 193 } 194 + #define map_kernel_range map_kernel_range_noflush 195 195 static inline void 196 196 unmap_kernel_range_noflush(unsigned long addr, unsigned long size) 197 197 { 198 198 } 199 - static inline void 200 - unmap_kernel_range(unsigned long addr, unsigned long size) 201 - { 202 - } 199 + #define unmap_kernel_range unmap_kernel_range_noflush 203 200 static inline void set_vm_flush_reset_perms(void *addr) 204 201 { 205 202 }

+1 -1

include/linux/zsmalloc.h

··· 20 20 * zsmalloc mapping modes 21 21 * 22 22 * NOTE: These only make a difference when a mapped object spans pages. 23 - * They also have no effect when PGTABLE_MAPPING is selected. 23 + * They also have no effect when ZSMALLOC_PGTABLE_MAPPING is selected. 24 24 */ 25 25 enum zs_mapmode { 26 26 ZS_MM_RW, /* normal read-write mapping */

+3 -3

include/trace/events/erofs.h

··· 113 113 114 114 TRACE_EVENT(erofs_readpages, 115 115 116 - TP_PROTO(struct inode *inode, struct page *page, unsigned int nrpage, 116 + TP_PROTO(struct inode *inode, pgoff_t start, unsigned int nrpage, 117 117 bool raw), 118 118 119 - TP_ARGS(inode, page, nrpage, raw), 119 + TP_ARGS(inode, start, nrpage, raw), 120 120 121 121 TP_STRUCT__entry( 122 122 __field(dev_t, dev ) ··· 129 129 TP_fast_assign( 130 130 __entry->dev = inode->i_sb->s_dev; 131 131 __entry->nid = EROFS_I(inode)->nid; 132 - __entry->start = page->index; 132 + __entry->start = start; 133 133 __entry->nrpage = nrpage; 134 134 __entry->raw = raw; 135 135 ),

+3 -3

include/trace/events/f2fs.h

··· 1376 1376 1377 1377 TRACE_EVENT(f2fs_readpages, 1378 1378 1379 - TP_PROTO(struct inode *inode, struct page *page, unsigned int nrpage), 1379 + TP_PROTO(struct inode *inode, pgoff_t start, unsigned int nrpage), 1380 1380 1381 - TP_ARGS(inode, page, nrpage), 1381 + TP_ARGS(inode, start, nrpage), 1382 1382 1383 1383 TP_STRUCT__entry( 1384 1384 __field(dev_t, dev) ··· 1390 1390 TP_fast_assign( 1391 1391 __entry->dev = inode->i_sb->s_dev; 1392 1392 __entry->ino = inode->i_ino; 1393 - __entry->start = page->index; 1393 + __entry->start = start; 1394 1394 __entry->nrpage = nrpage; 1395 1395 ), 1396 1396

+1 -4

include/trace/events/writeback.h

··· 541 541 TP_STRUCT__entry( 542 542 __field(unsigned long, nr_dirty) 543 543 __field(unsigned long, nr_writeback) 544 - __field(unsigned long, nr_unstable) 545 544 __field(unsigned long, background_thresh) 546 545 __field(unsigned long, dirty_thresh) 547 546 __field(unsigned long, dirty_limit) ··· 551 552 TP_fast_assign( 552 553 __entry->nr_dirty = global_node_page_state(NR_FILE_DIRTY); 553 554 __entry->nr_writeback = global_node_page_state(NR_WRITEBACK); 554 - __entry->nr_unstable = global_node_page_state(NR_UNSTABLE_NFS); 555 555 __entry->nr_dirtied = global_node_page_state(NR_DIRTIED); 556 556 __entry->nr_written = global_node_page_state(NR_WRITTEN); 557 557 __entry->background_thresh = background_thresh; ··· 558 560 __entry->dirty_limit = global_wb_domain.dirty_limit; 559 561 ), 560 562 561 - TP_printk("dirty=%lu writeback=%lu unstable=%lu " 563 + TP_printk("dirty=%lu writeback=%lu " 562 564 "bg_thresh=%lu thresh=%lu limit=%lu " 563 565 "dirtied=%lu written=%lu", 564 566 __entry->nr_dirty, 565 567 __entry->nr_writeback, 566 - __entry->nr_unstable, 567 568 __entry->background_thresh, 568 569 __entry->dirty_thresh, 569 570 __entry->dirty_limit,

+3 -3

kernel/bpf/core.c

··· 82 82 struct bpf_prog *fp; 83 83 84 84 size = round_up(size, PAGE_SIZE); 85 - fp = __vmalloc(size, gfp_flags, PAGE_KERNEL); 85 + fp = __vmalloc(size, gfp_flags); 86 86 if (fp == NULL) 87 87 return NULL; 88 88 ··· 232 232 if (ret) 233 233 return NULL; 234 234 235 - fp = __vmalloc(size, gfp_flags, PAGE_KERNEL); 235 + fp = __vmalloc(size, gfp_flags); 236 236 if (fp == NULL) { 237 237 __bpf_prog_uncharge(fp_old->aux->user, delta); 238 238 } else { ··· 1089 1089 gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags; 1090 1090 struct bpf_prog *fp; 1091 1091 1092 - fp = __vmalloc(fp_other->pages * PAGE_SIZE, gfp_flags, PAGE_KERNEL); 1092 + fp = __vmalloc(fp_other->pages * PAGE_SIZE, gfp_flags); 1093 1093 if (fp != NULL) { 1094 1094 /* aux->prog still points to the fp_other one, so 1095 1095 * when promoting the clone to the real program,

+14 -11

kernel/bpf/syscall.c

··· 25 25 #include <linux/nospec.h> 26 26 #include <linux/audit.h> 27 27 #include <uapi/linux/btf.h> 28 + #include <asm/pgtable.h> 28 29 #include <linux/bpf_lsm.h> 29 30 30 31 #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \ ··· 282 281 * __GFP_RETRY_MAYFAIL to avoid such situations. 283 282 */ 284 283 285 - const gfp_t flags = __GFP_NOWARN | __GFP_ZERO; 284 + const gfp_t gfp = __GFP_NOWARN | __GFP_ZERO; 285 + unsigned int flags = 0; 286 + unsigned long align = 1; 286 287 void *area; 287 288 288 289 if (size >= SIZE_MAX) 289 290 return NULL; 290 291 291 292 /* kmalloc()'ed memory can't be mmap()'ed */ 292 - if (!mmapable && size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { 293 - area = kmalloc_node(size, GFP_USER | __GFP_NORETRY | flags, 293 + if (mmapable) { 294 + BUG_ON(!PAGE_ALIGNED(size)); 295 + align = SHMLBA; 296 + flags = VM_USERMAP; 297 + } else if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { 298 + area = kmalloc_node(size, gfp | GFP_USER | __GFP_NORETRY, 294 299 numa_node); 295 300 if (area != NULL) 296 301 return area; 297 302 } 298 - if (mmapable) { 299 - BUG_ON(!PAGE_ALIGNED(size)); 300 - return vmalloc_user_node_flags(size, numa_node, GFP_KERNEL | 301 - __GFP_RETRY_MAYFAIL | flags); 302 - } 303 - return __vmalloc_node_flags_caller(size, numa_node, 304 - GFP_KERNEL | __GFP_RETRY_MAYFAIL | 305 - flags, __builtin_return_address(0)); 303 + 304 + return __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END, 305 + gfp | GFP_KERNEL | __GFP_RETRY_MAYFAIL, PAGE_KERNEL, 306 + flags, numa_node, __builtin_return_address(0)); 306 307 } 307 308 308 309 void *bpf_map_area_alloc(u64 size, int numa_node)

+12 -36

kernel/dma/remap.c

··· 20 20 return area->pages; 21 21 } 22 22 23 - static struct vm_struct *__dma_common_pages_remap(struct page **pages, 24 - size_t size, pgprot_t prot, const void *caller) 25 - { 26 - struct vm_struct *area; 27 - 28 - area = get_vm_area_caller(size, VM_DMA_COHERENT, caller); 29 - if (!area) 30 - return NULL; 31 - 32 - if (map_vm_area(area, prot, pages)) { 33 - vunmap(area->addr); 34 - return NULL; 35 - } 36 - 37 - return area; 38 - } 39 - 40 23 /* 41 24 * Remaps an array of PAGE_SIZE pages into another vm_area. 42 25 * Cannot be used in non-sleeping contexts ··· 27 44 void *dma_common_pages_remap(struct page **pages, size_t size, 28 45 pgprot_t prot, const void *caller) 29 46 { 30 - struct vm_struct *area; 47 + void *vaddr; 31 48 32 - area = __dma_common_pages_remap(pages, size, prot, caller); 33 - if (!area) 34 - return NULL; 35 - 36 - area->pages = pages; 37 - 38 - return area->addr; 49 + vaddr = vmap(pages, size >> PAGE_SHIFT, VM_DMA_COHERENT, prot); 50 + if (vaddr) 51 + find_vm_area(vaddr)->pages = pages; 52 + return vaddr; 39 53 } 40 54 41 55 /* ··· 42 62 void *dma_common_contiguous_remap(struct page *page, size_t size, 43 63 pgprot_t prot, const void *caller) 44 64 { 45 - int i; 65 + int count = size >> PAGE_SHIFT; 46 66 struct page **pages; 47 - struct vm_struct *area; 67 + void *vaddr; 68 + int i; 48 69 49 - pages = kmalloc(sizeof(struct page *) << get_order(size), GFP_KERNEL); 70 + pages = kmalloc_array(count, sizeof(struct page *), GFP_KERNEL); 50 71 if (!pages) 51 72 return NULL; 52 - 53 - for (i = 0; i < (size >> PAGE_SHIFT); i++) 73 + for (i = 0; i < count; i++) 54 74 pages[i] = nth_page(page, i); 55 - 56 - area = __dma_common_pages_remap(pages, size, prot, caller); 57 - 75 + vaddr = vmap(pages, count, VM_DMA_COHERENT, prot); 58 76 kfree(pages); 59 77 60 - if (!area) 61 - return NULL; 62 - return area->addr; 78 + return vaddr; 63 79 } 64 80 65 81 /*

+1 -1

kernel/groups.c

··· 20 20 len = sizeof(struct group_info) + sizeof(kgid_t) * gidsetsize; 21 21 gi = kmalloc(len, GFP_KERNEL_ACCOUNT|__GFP_NOWARN|__GFP_NORETRY); 22 22 if (!gi) 23 - gi = __vmalloc(len, GFP_KERNEL_ACCOUNT, PAGE_KERNEL); 23 + gi = __vmalloc(len, GFP_KERNEL_ACCOUNT); 24 24 if (!gi) 25 25 return NULL; 26 26

+1 -2

kernel/module.c

··· 2951 2951 return err; 2952 2952 2953 2953 /* Suck in entire file: we'll want most of it. */ 2954 - info->hdr = __vmalloc(info->len, 2955 - GFP_KERNEL | __GFP_NOWARN, PAGE_KERNEL); 2954 + info->hdr = __vmalloc(info->len, GFP_KERNEL | __GFP_NOWARN); 2956 2955 if (!info->hdr) 2957 2956 return -ENOMEM; 2958 2957

-1

kernel/notifier.c

··· 519 519 520 520 int register_die_notifier(struct notifier_block *nb) 521 521 { 522 - vmalloc_sync_mappings(); 523 522 return atomic_notifier_chain_register(&die_chain, nb); 524 523 } 525 524 EXPORT_SYMBOL_GPL(register_die_notifier);

+1 -1

kernel/sys.c

··· 2262 2262 return -EINVAL; 2263 2263 } 2264 2264 2265 - #define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE) 2265 + #define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE) 2266 2266 2267 2267 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, 2268 2268 unsigned long, arg4, unsigned long, arg5)

-12

kernel/trace/trace.c

··· 8527 8527 allocate_snapshot = false; 8528 8528 #endif 8529 8529 8530 - /* 8531 - * Because of some magic with the way alloc_percpu() works on 8532 - * x86_64, we need to synchronize the pgd of all the tables, 8533 - * otherwise the trace events that happen in x86_64 page fault 8534 - * handlers can't cope with accessing the chance that a 8535 - * alloc_percpu()'d memory might be touched in the page fault trace 8536 - * event. Oh, and we need to audit all other alloc_percpu() and vmalloc() 8537 - * calls in tracing, because something might get triggered within a 8538 - * page fault trace event! 8539 - */ 8540 - vmalloc_sync_mappings(); 8541 - 8542 8530 return 0; 8543 8531 } 8544 8532

+1 -1

lib/Kconfig.ubsan

··· 63 63 config UBSAN_ALIGNMENT 64 64 bool "Enable checks for pointers alignment" 65 65 default !HAVE_EFFICIENT_UNALIGNED_ACCESS 66 - depends on !X86 || !COMPILE_TEST 66 + depends on !UBSAN_TRAP 67 67 help 68 68 This option enables the check of unaligned memory accesses. 69 69 Enabling this option on architectures that support unaligned

+31 -15

lib/ioremap.c

··· 61 61 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ 62 62 63 63 static int ioremap_pte_range(pmd_t *pmd, unsigned long addr, 64 - unsigned long end, phys_addr_t phys_addr, pgprot_t prot) 64 + unsigned long end, phys_addr_t phys_addr, pgprot_t prot, 65 + pgtbl_mod_mask *mask) 65 66 { 66 67 pte_t *pte; 67 68 u64 pfn; 68 69 69 70 pfn = phys_addr >> PAGE_SHIFT; 70 - pte = pte_alloc_kernel(pmd, addr); 71 + pte = pte_alloc_kernel_track(pmd, addr, mask); 71 72 if (!pte) 72 73 return -ENOMEM; 73 74 do { ··· 76 75 set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot)); 77 76 pfn++; 78 77 } while (pte++, addr += PAGE_SIZE, addr != end); 78 + *mask |= PGTBL_PTE_MODIFIED; 79 79 return 0; 80 80 } 81 81 ··· 103 101 } 104 102 105 103 static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr, 106 - unsigned long end, phys_addr_t phys_addr, pgprot_t prot) 104 + unsigned long end, phys_addr_t phys_addr, pgprot_t prot, 105 + pgtbl_mod_mask *mask) 107 106 { 108 107 pmd_t *pmd; 109 108 unsigned long next; 110 109 111 - pmd = pmd_alloc(&init_mm, pud, addr); 110 + pmd = pmd_alloc_track(&init_mm, pud, addr, mask); 112 111 if (!pmd) 113 112 return -ENOMEM; 114 113 do { 115 114 next = pmd_addr_end(addr, end); 116 115 117 - if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) 116 + if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) { 117 + *mask |= PGTBL_PMD_MODIFIED; 118 118 continue; 119 + } 119 120 120 - if (ioremap_pte_range(pmd, addr, next, phys_addr, prot)) 121 + if (ioremap_pte_range(pmd, addr, next, phys_addr, prot, mask)) 121 122 return -ENOMEM; 122 123 } while (pmd++, phys_addr += (next - addr), addr = next, addr != end); 123 124 return 0; ··· 149 144 } 150 145 151 146 static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr, 152 - unsigned long end, phys_addr_t phys_addr, pgprot_t prot) 147 + unsigned long end, phys_addr_t phys_addr, pgprot_t prot, 148 + pgtbl_mod_mask *mask) 153 149 { 154 150 pud_t *pud; 155 151 unsigned long next; 156 152 157 - pud = pud_alloc(&init_mm, p4d, addr); 153 + pud = pud_alloc_track(&init_mm, p4d, addr, mask); 158 154 if (!pud) 159 155 return -ENOMEM; 160 156 do { 161 157 next = pud_addr_end(addr, end); 162 158 163 - if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) 159 + if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) { 160 + *mask |= PGTBL_PUD_MODIFIED; 164 161 continue; 162 + } 165 163 166 - if (ioremap_pmd_range(pud, addr, next, phys_addr, prot)) 164 + if (ioremap_pmd_range(pud, addr, next, phys_addr, prot, mask)) 167 165 return -ENOMEM; 168 166 } while (pud++, phys_addr += (next - addr), addr = next, addr != end); 169 167 return 0; ··· 195 187 } 196 188 197 189 static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr, 198 - unsigned long end, phys_addr_t phys_addr, pgprot_t prot) 190 + unsigned long end, phys_addr_t phys_addr, pgprot_t prot, 191 + pgtbl_mod_mask *mask) 199 192 { 200 193 p4d_t *p4d; 201 194 unsigned long next; 202 195 203 - p4d = p4d_alloc(&init_mm, pgd, addr); 196 + p4d = p4d_alloc_track(&init_mm, pgd, addr, mask); 204 197 if (!p4d) 205 198 return -ENOMEM; 206 199 do { 207 200 next = p4d_addr_end(addr, end); 208 201 209 - if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) 202 + if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) { 203 + *mask |= PGTBL_P4D_MODIFIED; 210 204 continue; 205 + } 211 206 212 - if (ioremap_pud_range(p4d, addr, next, phys_addr, prot)) 207 + if (ioremap_pud_range(p4d, addr, next, phys_addr, prot, mask)) 213 208 return -ENOMEM; 214 209 } while (p4d++, phys_addr += (next - addr), addr = next, addr != end); 215 210 return 0; ··· 225 214 unsigned long start; 226 215 unsigned long next; 227 216 int err; 217 + pgtbl_mod_mask mask = 0; 228 218 229 219 might_sleep(); 230 220 BUG_ON(addr >= end); ··· 234 222 pgd = pgd_offset_k(addr); 235 223 do { 236 224 next = pgd_addr_end(addr, end); 237 - err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot); 225 + err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot, 226 + &mask); 238 227 if (err) 239 228 break; 240 229 } while (pgd++, phys_addr += (next - addr), addr = next, addr != end); 241 230 242 231 flush_cache_vmap(start, end); 232 + 233 + if (mask & ARCH_PAGE_TABLE_SYNC_MASK) 234 + arch_sync_kernel_mappings(start, end); 243 235 244 236 return err; 245 237 }

+7 -19

lib/test_vmalloc.c

··· 91 91 */ 92 92 size = ((rnd % 10) + 1) * PAGE_SIZE; 93 93 94 - ptr = __vmalloc_node_range(size, align, 95 - VMALLOC_START, VMALLOC_END, 96 - GFP_KERNEL | __GFP_ZERO, 97 - PAGE_KERNEL, 98 - 0, 0, __builtin_return_address(0)); 99 - 94 + ptr = __vmalloc_node(size, align, GFP_KERNEL | __GFP_ZERO, 0, 95 + __builtin_return_address(0)); 100 96 if (!ptr) 101 97 return -1; 102 98 ··· 114 118 for (i = 0; i < BITS_PER_LONG; i++) { 115 119 align = ((unsigned long) 1) << i; 116 120 117 - ptr = __vmalloc_node_range(PAGE_SIZE, align, 118 - VMALLOC_START, VMALLOC_END, 119 - GFP_KERNEL | __GFP_ZERO, 120 - PAGE_KERNEL, 121 - 0, 0, __builtin_return_address(0)); 122 - 121 + ptr = __vmalloc_node(PAGE_SIZE, align, GFP_KERNEL|__GFP_ZERO, 0, 122 + __builtin_return_address(0)); 123 123 if (!ptr) 124 124 return -1; 125 125 ··· 131 139 int i; 132 140 133 141 for (i = 0; i < test_loop_count; i++) { 134 - ptr = __vmalloc_node_range(5 * PAGE_SIZE, 135 - THREAD_ALIGN << 1, 136 - VMALLOC_START, VMALLOC_END, 137 - GFP_KERNEL | __GFP_ZERO, 138 - PAGE_KERNEL, 139 - 0, 0, __builtin_return_address(0)); 140 - 142 + ptr = __vmalloc_node(5 * PAGE_SIZE, THREAD_ALIGN << 1, 143 + GFP_KERNEL | __GFP_ZERO, 0, 144 + __builtin_return_address(0)); 141 145 if (!ptr) 142 146 return -1; 143 147

+2 -2

mm/Kconfig

··· 705 705 returned by an alloc(). This handle must be mapped in order to 706 706 access the allocated space. 707 707 708 - config PGTABLE_MAPPING 708 + config ZSMALLOC_PGTABLE_MAPPING 709 709 bool "Use page table mapping to access object in zsmalloc" 710 - depends on ZSMALLOC 710 + depends on ZSMALLOC=y 711 711 help 712 712 By default, zsmalloc uses a copy-based object mapping method to 713 713 access allocations that span two pages. However, if a particular

+50 -6

mm/debug.c

··· 110 110 else if (PageAnon(page)) 111 111 type = "anon "; 112 112 else if (mapping) { 113 - if (mapping->host && mapping->host->i_dentry.first) { 114 - struct dentry *dentry; 115 - dentry = container_of(mapping->host->i_dentry.first, struct dentry, d_u.d_alias); 116 - pr_warn("%ps name:\"%pd\"\n", mapping->a_ops, dentry); 117 - } else 118 - pr_warn("%ps\n", mapping->a_ops); 113 + const struct inode *host; 114 + const struct address_space_operations *a_ops; 115 + const struct hlist_node *dentry_first; 116 + const struct dentry *dentry_ptr; 117 + struct dentry dentry; 118 + 119 + /* 120 + * mapping can be invalid pointer and we don't want to crash 121 + * accessing it, so probe everything depending on it carefully 122 + */ 123 + if (probe_kernel_read_strict(&host, &mapping->host, 124 + sizeof(struct inode *)) || 125 + probe_kernel_read_strict(&a_ops, &mapping->a_ops, 126 + sizeof(struct address_space_operations *))) { 127 + pr_warn("failed to read mapping->host or a_ops, mapping not a valid kernel address?\n"); 128 + goto out_mapping; 129 + } 130 + 131 + if (!host) { 132 + pr_warn("mapping->a_ops:%ps\n", a_ops); 133 + goto out_mapping; 134 + } 135 + 136 + if (probe_kernel_read_strict(&dentry_first, 137 + &host->i_dentry.first, sizeof(struct hlist_node *))) { 138 + pr_warn("mapping->a_ops:%ps with invalid mapping->host inode address %px\n", 139 + a_ops, host); 140 + goto out_mapping; 141 + } 142 + 143 + if (!dentry_first) { 144 + pr_warn("mapping->a_ops:%ps\n", a_ops); 145 + goto out_mapping; 146 + } 147 + 148 + dentry_ptr = container_of(dentry_first, struct dentry, d_u.d_alias); 149 + if (probe_kernel_read_strict(&dentry, dentry_ptr, 150 + sizeof(struct dentry))) { 151 + pr_warn("mapping->aops:%ps with invalid mapping->host->i_dentry.first %px\n", 152 + a_ops, dentry_ptr); 153 + } else { 154 + /* 155 + * if dentry is corrupted, the %pd handler may still 156 + * crash, but it's unlikely that we reach here with a 157 + * corrupted struct page 158 + */ 159 + pr_warn("mapping->aops:%ps dentry name:\"%pd\"\n", 160 + a_ops, &dentry); 161 + } 119 162 } 163 + out_mapping: 120 164 BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS + 1); 121 165 122 166 pr_warn("%sflags: %#lx(%pGp)%s\n", type, page->flags, &page->flags,

+2 -4

mm/fadvise.c

··· 22 22 23 23 #include <asm/unistd.h> 24 24 25 + #include "internal.h" 26 + 25 27 /* 26 28 * POSIX_FADV_WILLNEED could set PG_Referenced, and POSIX_FADV_NOREUSE could 27 29 * deactivate the pages and clear PG_Referenced. ··· 104 102 if (!nrpages) 105 103 nrpages = ~0UL; 106 104 107 - /* 108 - * Ignore return value because fadvise() shall return 109 - * success even if filesystem can't retrieve a hint, 110 - */ 111 105 force_page_cache_readahead(mapping, file, start_index, nrpages); 112 106 break; 113 107 case POSIX_FADV_NOREUSE:

-1

mm/filemap.c

··· 2566 2566 if (!error || error == AOP_TRUNCATED_PAGE) 2567 2567 goto retry_find; 2568 2568 2569 - /* Things didn't work out. Return zero to tell the mm layer so. */ 2570 2569 shrink_readahead_size_eio(ra); 2571 2570 return VM_FAULT_SIGBUS; 2572 2571

+58 -19

mm/gup.c

··· 1183 1183 return true; 1184 1184 } 1185 1185 1186 - /* 1186 + /** 1187 1187 * fixup_user_fault() - manually resolve a user page fault 1188 1188 * @tsk: the task_struct to use for page fault accounting, or 1189 1189 * NULL if faults are not to be recorded. ··· 1191 1191 * @address: user address 1192 1192 * @fault_flags:flags to pass down to handle_mm_fault() 1193 1193 * @unlocked: did we unlock the mmap_sem while retrying, maybe NULL if caller 1194 - * does not allow retry 1194 + * does not allow retry. If NULL, the caller must guarantee 1195 + * that fault_flags does not contain FAULT_FLAG_ALLOW_RETRY. 1195 1196 * 1196 1197 * This is meant to be called in the specific scenario where for locking reasons 1197 1198 * we try to access user memory in atomic context (within a pagefault_disable() ··· 1855 1854 gup_flags | FOLL_TOUCH | FOLL_REMOTE); 1856 1855 } 1857 1856 1858 - /* 1857 + /** 1859 1858 * get_user_pages_remote() - pin user pages in memory 1860 1859 * @tsk: the task_struct to use for page fault accounting, or 1861 1860 * NULL if faults are not to be recorded. ··· 1886 1885 * 1887 1886 * Must be called with mmap_sem held for read or write. 1888 1887 * 1889 - * get_user_pages walks a process's page tables and takes a reference to 1890 - * each struct page that each user address corresponds to at a given 1888 + * get_user_pages_remote walks a process's page tables and takes a reference 1889 + * to each struct page that each user address corresponds to at a given 1891 1890 * instant. That is, it takes the page that would be accessed if a user 1892 1891 * thread accesses the given user virtual address at that instant. 1893 1892 * 1894 1893 * This does not guarantee that the page exists in the user mappings when 1895 - * get_user_pages returns, and there may even be a completely different 1894 + * get_user_pages_remote returns, and there may even be a completely different 1896 1895 * page there in some cases (eg. if mmapped pagecache has been invalidated 1897 1896 * and subsequently re faulted). However it does guarantee that the page 1898 1897 * won't be freed completely. And mostly callers simply care that the page ··· 1904 1903 * is written to, set_page_dirty (or set_page_dirty_lock, as appropriate) must 1905 1904 * be called after the page is finished with, and before put_page is called. 1906 1905 * 1907 - * get_user_pages is typically used for fewer-copy IO operations, to get a 1908 - * handle on the memory by some means other than accesses via the user virtual 1909 - * addresses. The pages may be submitted for DMA to devices or accessed via 1910 - * their kernel linear mapping (via the kmap APIs). Care should be taken to 1911 - * use the correct cache flushing APIs. 1906 + * get_user_pages_remote is typically used for fewer-copy IO operations, 1907 + * to get a handle on the memory by some means other than accesses 1908 + * via the user virtual addresses. The pages may be submitted for 1909 + * DMA to devices or accessed via their kernel linear mapping (via the 1910 + * kmap APIs). Care should be taken to use the correct cache flushing APIs. 1912 1911 * 1913 1912 * See also get_user_pages_fast, for performance critical applications. 1914 1913 * 1915 - * get_user_pages should be phased out in favor of 1914 + * get_user_pages_remote should be phased out in favor of 1916 1915 * get_user_pages_locked|unlocked or get_user_pages_fast. Nothing 1917 - * should use get_user_pages because it cannot pass 1916 + * should use get_user_pages_remote because it cannot pass 1918 1917 * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault. 1919 1918 */ 1920 1919 long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, ··· 1953 1952 } 1954 1953 #endif /* !CONFIG_MMU */ 1955 1954 1956 - /* 1955 + /** 1956 + * get_user_pages() - pin user pages in memory 1957 + * @start: starting user address 1958 + * @nr_pages: number of pages from start to pin 1959 + * @gup_flags: flags modifying lookup behaviour 1960 + * @pages: array that receives pointers to the pages pinned. 1961 + * Should be at least nr_pages long. Or NULL, if caller 1962 + * only intends to ensure the pages are faulted in. 1963 + * @vmas: array of pointers to vmas corresponding to each page. 1964 + * Or NULL if the caller does not require them. 1965 + * 1957 1966 * This is the same as get_user_pages_remote(), just with a 1958 1967 * less-flexible calling convention where we assume that the task 1959 1968 * and mm being operated on are the current task's and don't allow ··· 1986 1975 } 1987 1976 EXPORT_SYMBOL(get_user_pages); 1988 1977 1989 - /* 1990 - * We can leverage the VM_FAULT_RETRY functionality in the page fault 1991 - * paths better by using either get_user_pages_locked() or 1992 - * get_user_pages_unlocked(). 1993 - * 1978 + /** 1994 1979 * get_user_pages_locked() is suitable to replace the form: 1995 1980 * 1996 1981 * down_read(&mm->mmap_sem); ··· 2002 1995 * get_user_pages_locked(tsk, mm, ..., pages, &locked); 2003 1996 * if (locked) 2004 1997 * up_read(&mm->mmap_sem); 1998 + * 1999 + * @start: starting user address 2000 + * @nr_pages: number of pages from start to pin 2001 + * @gup_flags: flags modifying lookup behaviour 2002 + * @pages: array that receives pointers to the pages pinned. 2003 + * Should be at least nr_pages long. Or NULL, if caller 2004 + * only intends to ensure the pages are faulted in. 2005 + * @locked: pointer to lock flag indicating whether lock is held and 2006 + * subsequently whether VM_FAULT_RETRY functionality can be 2007 + * utilised. Lock must initially be held. 2008 + * 2009 + * We can leverage the VM_FAULT_RETRY functionality in the page fault 2010 + * paths better by using either get_user_pages_locked() or 2011 + * get_user_pages_unlocked(). 2012 + * 2005 2013 */ 2006 2014 long get_user_pages_locked(unsigned long start, unsigned long nr_pages, 2007 2015 unsigned int gup_flags, struct page **pages, ··· 2993 2971 pages, vmas, gup_flags); 2994 2972 } 2995 2973 EXPORT_SYMBOL(pin_user_pages); 2974 + 2975 + /* 2976 + * pin_user_pages_unlocked() is the FOLL_PIN variant of 2977 + * get_user_pages_unlocked(). Behavior is the same, except that this one sets 2978 + * FOLL_PIN and rejects FOLL_GET. 2979 + */ 2980 + long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages, 2981 + struct page **pages, unsigned int gup_flags) 2982 + { 2983 + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ 2984 + if (WARN_ON_ONCE(gup_flags & FOLL_GET)) 2985 + return -EINVAL; 2986 + 2987 + gup_flags |= FOLL_PIN; 2988 + return get_user_pages_unlocked(start, nr_pages, pages, gup_flags); 2989 + } 2990 + EXPORT_SYMBOL(pin_user_pages_unlocked);

+7 -5

mm/internal.h

··· 49 49 unsigned long addr, unsigned long end, 50 50 struct zap_details *details); 51 51 52 - extern unsigned int __do_page_cache_readahead(struct address_space *mapping, 53 - struct file *filp, pgoff_t offset, unsigned long nr_to_read, 52 + void force_page_cache_readahead(struct address_space *, struct file *, 53 + pgoff_t index, unsigned long nr_to_read); 54 + void __do_page_cache_readahead(struct address_space *, struct file *, 55 + pgoff_t index, unsigned long nr_to_read, 54 56 unsigned long lookahead_size); 55 57 56 58 /* 57 59 * Submit IO for the read-ahead request in file_ra_state. 58 60 */ 59 - static inline unsigned long ra_submit(struct file_ra_state *ra, 61 + static inline void ra_submit(struct file_ra_state *ra, 60 62 struct address_space *mapping, struct file *filp) 61 63 { 62 - return __do_page_cache_readahead(mapping, filp, 63 - ra->start, ra->size, ra->async_size); 64 + __do_page_cache_readahead(mapping, filp, 65 + ra->start, ra->size, ra->async_size); 64 66 } 65 67 66 68 /**

+13 -8

mm/kasan/Makefile

··· 15 15 16 16 # Function splitter causes unnecessary splits in __asan_load1/__asan_store1 17 17 # see: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63533 18 - CFLAGS_common.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) -DDISABLE_BRANCH_PROFILING 19 - CFLAGS_generic.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) -DDISABLE_BRANCH_PROFILING 20 - CFLAGS_generic_report.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) -DDISABLE_BRANCH_PROFILING 21 - CFLAGS_init.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) -DDISABLE_BRANCH_PROFILING 22 - CFLAGS_quarantine.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) -DDISABLE_BRANCH_PROFILING 23 - CFLAGS_report.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) -DDISABLE_BRANCH_PROFILING 24 - CFLAGS_tags.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) -DDISABLE_BRANCH_PROFILING 25 - CFLAGS_tags_report.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) -DDISABLE_BRANCH_PROFILING 18 + CC_FLAGS_KASAN_RUNTIME := $(call cc-option, -fno-conserve-stack) 19 + CC_FLAGS_KASAN_RUNTIME += $(call cc-option, -fno-stack-protector) 20 + # Disable branch tracing to avoid recursion. 21 + CC_FLAGS_KASAN_RUNTIME += -DDISABLE_BRANCH_PROFILING 22 + 23 + CFLAGS_common.o := $(CC_FLAGS_KASAN_RUNTIME) 24 + CFLAGS_generic.o := $(CC_FLAGS_KASAN_RUNTIME) 25 + CFLAGS_generic_report.o := $(CC_FLAGS_KASAN_RUNTIME) 26 + CFLAGS_init.o := $(CC_FLAGS_KASAN_RUNTIME) 27 + CFLAGS_quarantine.o := $(CC_FLAGS_KASAN_RUNTIME) 28 + CFLAGS_report.o := $(CC_FLAGS_KASAN_RUNTIME) 29 + CFLAGS_tags.o := $(CC_FLAGS_KASAN_RUNTIME) 30 + CFLAGS_tags_report.o := $(CC_FLAGS_KASAN_RUNTIME) 26 31 27 32 obj-$(CONFIG_KASAN) := common.o init.o report.o 28 33 obj-$(CONFIG_KASAN_GENERIC) += generic.o generic_report.o quarantine.o

-19

mm/kasan/common.c

··· 33 33 #include <linux/types.h> 34 34 #include <linux/vmalloc.h> 35 35 #include <linux/bug.h> 36 - #include <linux/uaccess.h> 37 36 38 37 #include <asm/cacheflush.h> 39 38 #include <asm/tlbflush.h> ··· 611 612 vfree(kasan_mem_to_shadow(vm->addr)); 612 613 } 613 614 #endif 614 - 615 - extern void __kasan_report(unsigned long addr, size_t size, bool is_write, unsigned long ip); 616 - extern bool report_enabled(void); 617 - 618 - bool kasan_report(unsigned long addr, size_t size, bool is_write, unsigned long ip) 619 - { 620 - unsigned long flags = user_access_save(); 621 - bool ret = false; 622 - 623 - if (likely(report_enabled())) { 624 - __kasan_report(addr, size, is_write, ip); 625 - ret = true; 626 - } 627 - 628 - user_access_restore(flags); 629 - 630 - return ret; 631 - } 632 615 633 616 #ifdef CONFIG_MEMORY_HOTPLUG 634 617 static bool shadow_mapped(unsigned long addr)

+20 -2

mm/kasan/report.c

··· 29 29 #include <linux/kasan.h> 30 30 #include <linux/module.h> 31 31 #include <linux/sched/task_stack.h> 32 + #include <linux/uaccess.h> 32 33 33 34 #include <asm/sections.h> 34 35 ··· 455 454 } 456 455 } 457 456 458 - bool report_enabled(void) 457 + static bool report_enabled(void) 459 458 { 460 459 if (current->kasan_depth) 461 460 return false; ··· 480 479 end_report(&flags); 481 480 } 482 481 483 - void __kasan_report(unsigned long addr, size_t size, bool is_write, unsigned long ip) 482 + static void __kasan_report(unsigned long addr, size_t size, bool is_write, 483 + unsigned long ip) 484 484 { 485 485 struct kasan_access_info info; 486 486 void *tagged_addr; ··· 518 516 } 519 517 520 518 end_report(&flags); 519 + } 520 + 521 + bool kasan_report(unsigned long addr, size_t size, bool is_write, 522 + unsigned long ip) 523 + { 524 + unsigned long flags = user_access_save(); 525 + bool ret = false; 526 + 527 + if (likely(report_enabled())) { 528 + __kasan_report(addr, size, is_write, ip); 529 + ret = true; 530 + } 531 + 532 + user_access_restore(flags); 533 + 534 + return ret; 521 535 } 522 536 523 537 #ifdef CONFIG_KASAN_INLINE

+142 -52

mm/memcontrol.c

··· 1314 1314 if (do_memsw_account()) { 1315 1315 count = page_counter_read(&memcg->memsw); 1316 1316 limit = READ_ONCE(memcg->memsw.max); 1317 - if (count <= limit) 1317 + if (count < limit) 1318 1318 margin = min(margin, limit - count); 1319 1319 else 1320 1320 margin = 0; ··· 1451 1451 memcg_page_state(memcg, WORKINGSET_REFAULT)); 1452 1452 seq_buf_printf(&s, "workingset_activate %lu\n", 1453 1453 memcg_page_state(memcg, WORKINGSET_ACTIVATE)); 1454 + seq_buf_printf(&s, "workingset_restore %lu\n", 1455 + memcg_page_state(memcg, WORKINGSET_RESTORE)); 1454 1456 seq_buf_printf(&s, "workingset_nodereclaim %lu\n", 1455 1457 memcg_page_state(memcg, WORKINGSET_NODERECLAIM)); 1456 1458 ··· 2252 2250 gfp_t gfp_mask) 2253 2251 { 2254 2252 do { 2255 - if (page_counter_read(&memcg->memory) <= READ_ONCE(memcg->high)) 2253 + if (page_counter_read(&memcg->memory) <= 2254 + READ_ONCE(memcg->memory.high)) 2256 2255 continue; 2257 2256 memcg_memory_event(memcg, MEMCG_HIGH); 2258 2257 try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); ··· 2322 2319 #define MEMCG_DELAY_PRECISION_SHIFT 20 2323 2320 #define MEMCG_DELAY_SCALING_SHIFT 14 2324 2321 2322 + static u64 calculate_overage(unsigned long usage, unsigned long high) 2323 + { 2324 + u64 overage; 2325 + 2326 + if (usage <= high) 2327 + return 0; 2328 + 2329 + /* 2330 + * Prevent division by 0 in overage calculation by acting as if 2331 + * it was a threshold of 1 page 2332 + */ 2333 + high = max(high, 1UL); 2334 + 2335 + overage = usage - high; 2336 + overage <<= MEMCG_DELAY_PRECISION_SHIFT; 2337 + return div64_u64(overage, high); 2338 + } 2339 + 2340 + static u64 mem_find_max_overage(struct mem_cgroup *memcg) 2341 + { 2342 + u64 overage, max_overage = 0; 2343 + 2344 + do { 2345 + overage = calculate_overage(page_counter_read(&memcg->memory), 2346 + READ_ONCE(memcg->memory.high)); 2347 + max_overage = max(overage, max_overage); 2348 + } while ((memcg = parent_mem_cgroup(memcg)) && 2349 + !mem_cgroup_is_root(memcg)); 2350 + 2351 + return max_overage; 2352 + } 2353 + 2354 + static u64 swap_find_max_overage(struct mem_cgroup *memcg) 2355 + { 2356 + u64 overage, max_overage = 0; 2357 + 2358 + do { 2359 + overage = calculate_overage(page_counter_read(&memcg->swap), 2360 + READ_ONCE(memcg->swap.high)); 2361 + if (overage) 2362 + memcg_memory_event(memcg, MEMCG_SWAP_HIGH); 2363 + max_overage = max(overage, max_overage); 2364 + } while ((memcg = parent_mem_cgroup(memcg)) && 2365 + !mem_cgroup_is_root(memcg)); 2366 + 2367 + return max_overage; 2368 + } 2369 + 2325 2370 /* 2326 2371 * Get the number of jiffies that we should penalise a mischievous cgroup which 2327 2372 * is exceeding its memory.high by checking both it and its ancestors. 2328 2373 */ 2329 2374 static unsigned long calculate_high_delay(struct mem_cgroup *memcg, 2330 - unsigned int nr_pages) 2375 + unsigned int nr_pages, 2376 + u64 max_overage) 2331 2377 { 2332 2378 unsigned long penalty_jiffies; 2333 - u64 max_overage = 0; 2334 - 2335 - do { 2336 - unsigned long usage, high; 2337 - u64 overage; 2338 - 2339 - usage = page_counter_read(&memcg->memory); 2340 - high = READ_ONCE(memcg->high); 2341 - 2342 - if (usage <= high) 2343 - continue; 2344 - 2345 - /* 2346 - * Prevent division by 0 in overage calculation by acting as if 2347 - * it was a threshold of 1 page 2348 - */ 2349 - high = max(high, 1UL); 2350 - 2351 - overage = usage - high; 2352 - overage <<= MEMCG_DELAY_PRECISION_SHIFT; 2353 - overage = div64_u64(overage, high); 2354 - 2355 - if (overage > max_overage) 2356 - max_overage = overage; 2357 - } while ((memcg = parent_mem_cgroup(memcg)) && 2358 - !mem_cgroup_is_root(memcg)); 2359 2379 2360 2380 if (!max_overage) 2361 2381 return 0; ··· 2403 2377 * MEMCG_CHARGE_BATCH pages is nominal, so work out how much smaller or 2404 2378 * larger the current charge patch is than that. 2405 2379 */ 2406 - penalty_jiffies = penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH; 2407 - 2408 - /* 2409 - * Clamp the max delay per usermode return so as to still keep the 2410 - * application moving forwards and also permit diagnostics, albeit 2411 - * extremely slowly. 2412 - */ 2413 - return min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES); 2380 + return penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH; 2414 2381 } 2415 2382 2416 2383 /* ··· 2428 2409 * memory.high is breached and reclaim is unable to keep up. Throttle 2429 2410 * allocators proactively to slow down excessive growth. 2430 2411 */ 2431 - penalty_jiffies = calculate_high_delay(memcg, nr_pages); 2412 + penalty_jiffies = calculate_high_delay(memcg, nr_pages, 2413 + mem_find_max_overage(memcg)); 2414 + 2415 + penalty_jiffies += calculate_high_delay(memcg, nr_pages, 2416 + swap_find_max_overage(memcg)); 2417 + 2418 + /* 2419 + * Clamp the max delay per usermode return so as to still keep the 2420 + * application moving forwards and also permit diagnostics, albeit 2421 + * extremely slowly. 2422 + */ 2423 + penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES); 2432 2424 2433 2425 /* 2434 2426 * Don't sleep if the amount of jiffies this memcg owes us is so low ··· 2624 2594 * reclaim, the cost of mismatch is negligible. 2625 2595 */ 2626 2596 do { 2627 - if (page_counter_read(&memcg->memory) > READ_ONCE(memcg->high)) { 2628 - /* Don't bother a random interrupted task */ 2629 - if (in_interrupt()) { 2597 + bool mem_high, swap_high; 2598 + 2599 + mem_high = page_counter_read(&memcg->memory) > 2600 + READ_ONCE(memcg->memory.high); 2601 + swap_high = page_counter_read(&memcg->swap) > 2602 + READ_ONCE(memcg->swap.high); 2603 + 2604 + /* Don't bother a random interrupted task */ 2605 + if (in_interrupt()) { 2606 + if (mem_high) { 2630 2607 schedule_work(&memcg->high_work); 2631 2608 break; 2632 2609 } 2610 + continue; 2611 + } 2612 + 2613 + if (mem_high || swap_high) { 2614 + /* 2615 + * The allocating tasks in this cgroup will need to do 2616 + * reclaim or be throttled to prevent further growth 2617 + * of the memory or swap footprints. 2618 + * 2619 + * Target some best-effort fairness between the tasks, 2620 + * and distribute reclaim work and delay penalties 2621 + * based on how much each task is actually allocating. 2622 + */ 2633 2623 current->memcg_nr_pages_over_high += batch; 2634 2624 set_notify_resume(current); 2635 2625 break; ··· 2852 2802 2853 2803 static inline bool memcg_kmem_bypass(void) 2854 2804 { 2855 - if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD)) 2805 + if (in_interrupt()) 2806 + return true; 2807 + 2808 + /* Allow remote memcg charging in kthread contexts. */ 2809 + if ((!current->mm || (current->flags & PF_KTHREAD)) && 2810 + !current->active_memcg) 2856 2811 return true; 2857 2812 return false; 2858 2813 } ··· 4385 4330 4386 4331 *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); 4387 4332 4388 - /* this should eventually include NR_UNSTABLE_NFS */ 4389 4333 *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); 4390 4334 *pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) + 4391 4335 memcg_exact_page_state(memcg, NR_ACTIVE_FILE); ··· 4392 4338 4393 4339 while ((parent = parent_mem_cgroup(memcg))) { 4394 4340 unsigned long ceiling = min(READ_ONCE(memcg->memory.max), 4395 - READ_ONCE(memcg->high)); 4341 + READ_ONCE(memcg->memory.high)); 4396 4342 unsigned long used = page_counter_read(&memcg->memory); 4397 4343 4398 4344 *pheadroom = min(*pheadroom, ceiling - min(ceiling, used)); ··· 5117 5063 if (IS_ERR(memcg)) 5118 5064 return ERR_CAST(memcg); 5119 5065 5120 - WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX); 5066 + page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX); 5121 5067 memcg->soft_limit = PAGE_COUNTER_MAX; 5068 + page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX); 5122 5069 if (parent) { 5123 5070 memcg->swappiness = mem_cgroup_swappiness(parent); 5124 5071 memcg->oom_kill_disable = parent->oom_kill_disable; ··· 5271 5216 page_counter_set_max(&memcg->tcpmem, PAGE_COUNTER_MAX); 5272 5217 page_counter_set_min(&memcg->memory, 0); 5273 5218 page_counter_set_low(&memcg->memory, 0); 5274 - WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX); 5219 + page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX); 5275 5220 memcg->soft_limit = PAGE_COUNTER_MAX; 5221 + page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX); 5276 5222 memcg_wb_domain_size_changed(memcg); 5277 5223 } 5278 5224 ··· 6071 6015 6072 6016 static int memory_high_show(struct seq_file *m, void *v) 6073 6017 { 6074 - return seq_puts_memcg_tunable(m, READ_ONCE(mem_cgroup_from_seq(m)->high)); 6018 + return seq_puts_memcg_tunable(m, 6019 + READ_ONCE(mem_cgroup_from_seq(m)->memory.high)); 6075 6020 } 6076 6021 6077 6022 static ssize_t memory_high_write(struct kernfs_open_file *of, ··· 6089 6032 if (err) 6090 6033 return err; 6091 6034 6092 - WRITE_ONCE(memcg->high, high); 6035 + page_counter_set_high(&memcg->memory, high); 6093 6036 6094 6037 for (;;) { 6095 6038 unsigned long nr_pages = page_counter_read(&memcg->memory); ··· 6284 6227 }, 6285 6228 { 6286 6229 .name = "stat", 6287 - .flags = CFTYPE_NOT_ON_ROOT, 6288 6230 .seq_show = memory_stat_show, 6289 6231 }, 6290 6232 { ··· 7187 7131 if (!memcg) 7188 7132 return false; 7189 7133 7190 - for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) 7191 - if (page_counter_read(&memcg->swap) * 2 >= 7192 - READ_ONCE(memcg->swap.max)) 7134 + for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) { 7135 + unsigned long usage = page_counter_read(&memcg->swap); 7136 + 7137 + if (usage * 2 >= READ_ONCE(memcg->swap.high) || 7138 + usage * 2 >= READ_ONCE(memcg->swap.max)) 7193 7139 return true; 7140 + } 7194 7141 7195 7142 return false; 7196 7143 } ··· 7223 7164 return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE; 7224 7165 } 7225 7166 7167 + static int swap_high_show(struct seq_file *m, void *v) 7168 + { 7169 + return seq_puts_memcg_tunable(m, 7170 + READ_ONCE(mem_cgroup_from_seq(m)->swap.high)); 7171 + } 7172 + 7173 + static ssize_t swap_high_write(struct kernfs_open_file *of, 7174 + char *buf, size_t nbytes, loff_t off) 7175 + { 7176 + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); 7177 + unsigned long high; 7178 + int err; 7179 + 7180 + buf = strstrip(buf); 7181 + err = page_counter_memparse(buf, "max", &high); 7182 + if (err) 7183 + return err; 7184 + 7185 + page_counter_set_high(&memcg->swap, high); 7186 + 7187 + return nbytes; 7188 + } 7189 + 7226 7190 static int swap_max_show(struct seq_file *m, void *v) 7227 7191 { 7228 7192 return seq_puts_memcg_tunable(m, ··· 7273 7191 { 7274 7192 struct mem_cgroup *memcg = mem_cgroup_from_seq(m); 7275 7193 7194 + seq_printf(m, "high %lu\n", 7195 + atomic_long_read(&memcg->memory_events[MEMCG_SWAP_HIGH])); 7276 7196 seq_printf(m, "max %lu\n", 7277 7197 atomic_long_read(&memcg->memory_events[MEMCG_SWAP_MAX])); 7278 7198 seq_printf(m, "fail %lu\n", ··· 7288 7204 .name = "swap.current", 7289 7205 .flags = CFTYPE_NOT_ON_ROOT, 7290 7206 .read_u64 = swap_current_read, 7207 + }, 7208 + { 7209 + .name = "swap.high", 7210 + .flags = CFTYPE_NOT_ON_ROOT, 7211 + .seq_show = swap_high_show, 7212 + .write = swap_high_write, 7291 7213 }, 7292 7214 { 7293 7215 .name = "swap.max",

+9 -6

mm/memory-failure.c

··· 210 210 { 211 211 struct task_struct *t = tk->tsk; 212 212 short addr_lsb = tk->size_shift; 213 - int ret; 213 + int ret = 0; 214 214 215 - pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", 216 - pfn, t->comm, t->pid); 215 + if ((t->mm == current->mm) || !(flags & MF_ACTION_REQUIRED)) 216 + pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", 217 + pfn, t->comm, t->pid); 217 218 218 - if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) { 219 - ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr, 220 - addr_lsb); 219 + if (flags & MF_ACTION_REQUIRED) { 220 + if (t->mm == current->mm) 221 + ret = force_sig_mceerr(BUS_MCEERR_AR, 222 + (void __user *)tk->addr, addr_lsb); 223 + /* send no signal to non-current processes */ 221 224 } else { 222 225 /* 223 226 * Don't use force here, it's convenient if the signal

-2

mm/memory.c

··· 802 802 get_page(page); 803 803 page_dup_rmap(page, false); 804 804 rss[mm_counter(page)]++; 805 - } else if (pte_devmap(pte)) { 806 - page = pte_page(pte); 807 805 } 808 806 809 807 out_set_pte:

+2 -7

mm/migrate.c

··· 797 797 if (rc != MIGRATEPAGE_SUCCESS) 798 798 goto unlock_buffers; 799 799 800 - ClearPagePrivate(page); 801 - set_page_private(newpage, page_private(page)); 802 - set_page_private(page, 0); 803 - put_page(page); 800 + attach_page_private(newpage, detach_page_private(page)); 804 801 get_page(newpage); 805 802 806 803 bh = head; ··· 806 809 bh = bh->b_this_page; 807 810 808 811 } while (bh != head); 809 - 810 - SetPagePrivate(newpage); 811 812 812 813 if (mode != MIGRATE_SYNC_NO_COPY) 813 814 migrate_page_copy(newpage, page); ··· 1027 1032 * to the LRU. Later, when the IO completes the pages are 1028 1033 * marked uptodate and unlocked. However, the queueing 1029 1034 * could be merging multiple pages for one bio (e.g. 1030 - * mpage_readpages). If an allocation happens for the 1035 + * mpage_readahead). If an allocation happens for the 1031 1036 * second or third page, the process can end up locking 1032 1037 * the same page twice and deadlocking. Rather than 1033 1038 * trying to be clever about what pages can be locked,

+10 -6

mm/mm_init.c

··· 67 67 unsigned long or_mask, add_mask; 68 68 69 69 shift = 8 * sizeof(unsigned long); 70 - width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_CPUPID_SHIFT; 70 + width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH 71 + - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH; 71 72 mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths", 72 - "Section %d Node %d Zone %d Lastcpupid %d Flags %d\n", 73 + "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n", 73 74 SECTIONS_WIDTH, 74 75 NODES_WIDTH, 75 76 ZONES_WIDTH, 76 77 LAST_CPUPID_WIDTH, 78 + KASAN_TAG_WIDTH, 77 79 NR_PAGEFLAGS); 78 80 mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts", 79 - "Section %d Node %d Zone %d Lastcpupid %d\n", 81 + "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n", 80 82 SECTIONS_SHIFT, 81 83 NODES_SHIFT, 82 84 ZONES_SHIFT, 83 - LAST_CPUPID_SHIFT); 85 + LAST_CPUPID_SHIFT, 86 + KASAN_TAG_WIDTH); 84 87 mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts", 85 - "Section %lu Node %lu Zone %lu Lastcpupid %lu\n", 88 + "Section %lu Node %lu Zone %lu Lastcpupid %lu Kasantag %lu\n", 86 89 (unsigned long)SECTIONS_PGSHIFT, 87 90 (unsigned long)NODES_PGSHIFT, 88 91 (unsigned long)ZONES_PGSHIFT, 89 - (unsigned long)LAST_CPUPID_PGSHIFT); 92 + (unsigned long)LAST_CPUPID_PGSHIFT, 93 + (unsigned long)KASAN_TAG_PGSHIFT); 90 94 mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid", 91 95 "Node/Zone ID: %lu -> %lu\n", 92 96 (unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),

+18 -28

mm/nommu.c

··· 140 140 } 141 141 EXPORT_SYMBOL(vfree); 142 142 143 - void *__vmalloc(unsigned long size, gfp_t gfp_mask, pgprot_t prot) 143 + void *__vmalloc(unsigned long size, gfp_t gfp_mask) 144 144 { 145 145 /* 146 146 * You can't specify __GFP_HIGHMEM with kmalloc() since kmalloc() ··· 150 150 } 151 151 EXPORT_SYMBOL(__vmalloc); 152 152 153 - void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags) 153 + void *__vmalloc_node_range(unsigned long size, unsigned long align, 154 + unsigned long start, unsigned long end, gfp_t gfp_mask, 155 + pgprot_t prot, unsigned long vm_flags, int node, 156 + const void *caller) 154 157 { 155 - return __vmalloc(size, flags, PAGE_KERNEL); 158 + return __vmalloc(size, gfp_mask); 159 + } 160 + 161 + void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask, 162 + int node, const void *caller) 163 + { 164 + return __vmalloc(size, gfp_mask); 156 165 } 157 166 158 167 static void *__vmalloc_user_flags(unsigned long size, gfp_t flags) 159 168 { 160 169 void *ret; 161 170 162 - ret = __vmalloc(size, flags, PAGE_KERNEL); 171 + ret = __vmalloc(size, flags); 163 172 if (ret) { 164 173 struct vm_area_struct *vma; 165 174 ··· 187 178 return __vmalloc_user_flags(size, GFP_KERNEL | __GFP_ZERO); 188 179 } 189 180 EXPORT_SYMBOL(vmalloc_user); 190 - 191 - void *vmalloc_user_node_flags(unsigned long size, int node, gfp_t flags) 192 - { 193 - return __vmalloc_user_flags(size, flags | __GFP_ZERO); 194 - } 195 - EXPORT_SYMBOL(vmalloc_user_node_flags); 196 181 197 182 struct page *vmalloc_to_page(const void *addr) 198 183 { ··· 233 230 */ 234 231 void *vmalloc(unsigned long size) 235 232 { 236 - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL); 233 + return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM); 237 234 } 238 235 EXPORT_SYMBOL(vmalloc); 239 236 ··· 251 248 */ 252 249 void *vzalloc(unsigned long size) 253 250 { 254 - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO, 255 - PAGE_KERNEL); 251 + return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO); 256 252 } 257 253 EXPORT_SYMBOL(vzalloc); 258 254 ··· 304 302 305 303 void *vmalloc_exec(unsigned long size) 306 304 { 307 - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC); 305 + return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM); 308 306 } 309 307 310 308 /** ··· 316 314 */ 317 315 void *vmalloc_32(unsigned long size) 318 316 { 319 - return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL); 317 + return __vmalloc(size, GFP_KERNEL); 320 318 } 321 319 EXPORT_SYMBOL(vmalloc_32); 322 320 ··· 353 351 } 354 352 EXPORT_SYMBOL(vunmap); 355 353 356 - void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t prot) 354 + void *vm_map_ram(struct page **pages, unsigned int count, int node) 357 355 { 358 356 BUG(); 359 357 return NULL; ··· 370 368 { 371 369 } 372 370 EXPORT_SYMBOL_GPL(vm_unmap_aliases); 373 - 374 - /* 375 - * Implement a stub for vmalloc_sync_[un]mapping() if the architecture 376 - * chose not to have one. 377 - */ 378 - void __weak vmalloc_sync_mappings(void) 379 - { 380 - } 381 - 382 - void __weak vmalloc_sync_unmappings(void) 383 - { 384 - } 385 371 386 372 struct vm_struct *alloc_vm_area(size_t size, pte_t **ptes) 387 373 {

+38 -24

mm/page-writeback.c

··· 387 387 * Calculate @dtc->thresh and ->bg_thresh considering 388 388 * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller 389 389 * must ensure that @dtc->avail is set before calling this function. The 390 - * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and 391 - * real-time tasks. 390 + * dirty limits will be lifted by 1/4 for real-time tasks. 392 391 */ 393 392 static void domain_dirty_limits(struct dirty_throttle_control *dtc) 394 393 { ··· 435 436 if (bg_thresh >= thresh) 436 437 bg_thresh = thresh / 2; 437 438 tsk = current; 438 - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { 439 + if (rt_task(tsk)) { 439 440 bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32; 440 441 thresh += thresh / 4 + global_wb_domain.dirty_limit / 32; 441 442 } ··· 485 486 else 486 487 dirty = vm_dirty_ratio * node_memory / 100; 487 488 488 - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) 489 + if (rt_task(tsk)) 489 490 dirty += dirty / 4; 490 491 491 492 return dirty; ··· 504 505 unsigned long nr_pages = 0; 505 506 506 507 nr_pages += node_page_state(pgdat, NR_FILE_DIRTY); 507 - nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS); 508 508 nr_pages += node_page_state(pgdat, NR_WRITEBACK); 509 509 510 510 return nr_pages <= limit; ··· 757 759 * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set. 758 760 * 759 761 * Return: @wb's dirty limit in pages. The term "dirty" in the context of 760 - * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages. 762 + * dirty balancing includes all PG_dirty and PG_writeback pages. 761 763 */ 762 764 static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc) 763 765 { ··· 1565 1567 struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ? 1566 1568 &mdtc_stor : NULL; 1567 1569 struct dirty_throttle_control *sdtc; 1568 - unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */ 1570 + unsigned long nr_reclaimable; /* = file_dirty */ 1569 1571 long period; 1570 1572 long pause; 1571 1573 long max_pause; ··· 1585 1587 unsigned long m_thresh = 0; 1586 1588 unsigned long m_bg_thresh = 0; 1587 1589 1588 - /* 1589 - * Unstable writes are a feature of certain networked 1590 - * filesystems (i.e. NFS) in which data may have been 1591 - * written to the server's write cache, but has not yet 1592 - * been flushed to permanent storage. 1593 - */ 1594 - nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) + 1595 - global_node_page_state(NR_UNSTABLE_NFS); 1590 + nr_reclaimable = global_node_page_state(NR_FILE_DIRTY); 1596 1591 gdtc->avail = global_dirtyable_memory(); 1597 1592 gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK); 1598 1593 ··· 1644 1653 if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) && 1645 1654 (!mdtc || 1646 1655 m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) { 1647 - unsigned long intv = dirty_poll_interval(dirty, thresh); 1648 - unsigned long m_intv = ULONG_MAX; 1656 + unsigned long intv; 1657 + unsigned long m_intv; 1658 + 1659 + free_running: 1660 + intv = dirty_poll_interval(dirty, thresh); 1661 + m_intv = ULONG_MAX; 1649 1662 1650 1663 current->dirty_paused_when = now; 1651 1664 current->nr_dirtied = 0; ··· 1668 1673 * Calculate global domain's pos_ratio and select the 1669 1674 * global dtc by default. 1670 1675 */ 1671 - if (!strictlimit) 1676 + if (!strictlimit) { 1672 1677 wb_dirty_limits(gdtc); 1678 + 1679 + if ((current->flags & PF_LOCAL_THROTTLE) && 1680 + gdtc->wb_dirty < 1681 + dirty_freerun_ceiling(gdtc->wb_thresh, 1682 + gdtc->wb_bg_thresh)) 1683 + /* 1684 + * LOCAL_THROTTLE tasks must not be throttled 1685 + * when below the per-wb freerun ceiling. 1686 + */ 1687 + goto free_running; 1688 + } 1673 1689 1674 1690 dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) && 1675 1691 ((gdtc->dirty > gdtc->thresh) || strictlimit); ··· 1695 1689 * both global and memcg domains. Choose the one 1696 1690 * w/ lower pos_ratio. 1697 1691 */ 1698 - if (!strictlimit) 1692 + if (!strictlimit) { 1699 1693 wb_dirty_limits(mdtc); 1700 1694 1695 + if ((current->flags & PF_LOCAL_THROTTLE) && 1696 + mdtc->wb_dirty < 1697 + dirty_freerun_ceiling(mdtc->wb_thresh, 1698 + mdtc->wb_bg_thresh)) 1699 + /* 1700 + * LOCAL_THROTTLE tasks must not be 1701 + * throttled when below the per-wb 1702 + * freerun ceiling. 1703 + */ 1704 + goto free_running; 1705 + } 1701 1706 dirty_exceeded |= (mdtc->wb_dirty > mdtc->wb_thresh) && 1702 1707 ((mdtc->dirty > mdtc->thresh) || strictlimit); 1703 1708 ··· 1955 1938 * as we're trying to decide whether to put more under writeback. 1956 1939 */ 1957 1940 gdtc->avail = global_dirtyable_memory(); 1958 - gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) + 1959 - global_node_page_state(NR_UNSTABLE_NFS); 1941 + gdtc->dirty = global_node_page_state(NR_FILE_DIRTY); 1960 1942 domain_dirty_limits(gdtc); 1961 1943 1962 1944 if (gdtc->dirty > gdtc->bg_thresh) ··· 2180 2164 int error; 2181 2165 struct pagevec pvec; 2182 2166 int nr_pages; 2183 - pgoff_t uninitialized_var(writeback_index); 2184 2167 pgoff_t index; 2185 2168 pgoff_t end; /* Inclusive */ 2186 2169 pgoff_t done_index; ··· 2188 2173 2189 2174 pagevec_init(&pvec); 2190 2175 if (wbc->range_cyclic) { 2191 - writeback_index = mapping->writeback_index; /* prev offset */ 2192 - index = writeback_index; 2176 + index = mapping->writeback_index; /* prev offset */ 2193 2177 end = -1; 2194 2178 } else { 2195 2179 index = wbc->range_start >> PAGE_SHIFT;

+2 -5

mm/page_alloc.c

··· 5319 5319 5320 5320 printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n" 5321 5321 " active_file:%lu inactive_file:%lu isolated_file:%lu\n" 5322 - " unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n" 5322 + " unevictable:%lu dirty:%lu writeback:%lu\n" 5323 5323 " slab_reclaimable:%lu slab_unreclaimable:%lu\n" 5324 5324 " mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n" 5325 5325 " free:%lu free_pcp:%lu free_cma:%lu\n", ··· 5332 5332 global_node_page_state(NR_UNEVICTABLE), 5333 5333 global_node_page_state(NR_FILE_DIRTY), 5334 5334 global_node_page_state(NR_WRITEBACK), 5335 - global_node_page_state(NR_UNSTABLE_NFS), 5336 5335 global_node_page_state(NR_SLAB_RECLAIMABLE), 5337 5336 global_node_page_state(NR_SLAB_UNRECLAIMABLE), 5338 5337 global_node_page_state(NR_FILE_MAPPED), ··· 5364 5365 " anon_thp: %lukB" 5365 5366 #endif 5366 5367 " writeback_tmp:%lukB" 5367 - " unstable:%lukB" 5368 5368 " all_unreclaimable? %s" 5369 5369 "\n", 5370 5370 pgdat->node_id, ··· 5385 5387 K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR), 5386 5388 #endif 5387 5389 K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), 5388 - K(node_page_state(pgdat, NR_UNSTABLE_NFS)), 5389 5390 pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ? 5390 5391 "yes" : "no"); 5391 5392 } ··· 8250 8253 table = memblock_alloc_raw(size, 8251 8254 SMP_CACHE_BYTES); 8252 8255 } else if (get_order(size) >= MAX_ORDER || hashdist) { 8253 - table = __vmalloc(size, gfp_flags, PAGE_KERNEL); 8256 + table = __vmalloc(size, gfp_flags); 8254 8257 virt = true; 8255 8258 } else { 8256 8259 /*

+1 -1

mm/percpu.c

··· 482 482 if (size <= PAGE_SIZE) 483 483 return kzalloc(size, gfp); 484 484 else 485 - return __vmalloc(size, gfp | __GFP_ZERO, PAGE_KERNEL); 485 + return __vmalloc(size, gfp | __GFP_ZERO); 486 486 } 487 487 488 488 /**

+16 -1

mm/ptdump.c

··· 36 36 return note_kasan_page_table(walk, addr); 37 37 #endif 38 38 39 + if (st->effective_prot) 40 + st->effective_prot(st, 0, pgd_val(val)); 41 + 39 42 if (pgd_leaf(val)) 40 43 st->note_page(st, addr, 0, pgd_val(val)); 41 44 ··· 55 52 if (p4d_page(val) == virt_to_page(lm_alias(kasan_early_shadow_pud))) 56 53 return note_kasan_page_table(walk, addr); 57 54 #endif 55 + 56 + if (st->effective_prot) 57 + st->effective_prot(st, 1, p4d_val(val)); 58 58 59 59 if (p4d_leaf(val)) 60 60 st->note_page(st, addr, 1, p4d_val(val)); ··· 76 70 return note_kasan_page_table(walk, addr); 77 71 #endif 78 72 73 + if (st->effective_prot) 74 + st->effective_prot(st, 2, pud_val(val)); 75 + 79 76 if (pud_leaf(val)) 80 77 st->note_page(st, addr, 2, pud_val(val)); 81 78 ··· 96 87 return note_kasan_page_table(walk, addr); 97 88 #endif 98 89 90 + if (st->effective_prot) 91 + st->effective_prot(st, 3, pmd_val(val)); 99 92 if (pmd_leaf(val)) 100 93 st->note_page(st, addr, 3, pmd_val(val)); 101 94 ··· 108 97 unsigned long next, struct mm_walk *walk) 109 98 { 110 99 struct ptdump_state *st = walk->private; 100 + pte_t val = READ_ONCE(*pte); 111 101 112 - st->note_page(st, addr, 4, pte_val(READ_ONCE(*pte))); 102 + if (st->effective_prot) 103 + st->effective_prot(st, 4, pte_val(val)); 104 + 105 + st->note_page(st, addr, 4, pte_val(val)); 113 106 114 107 return 0; 115 108 }

+166 -109

mm/readahead.c

··· 22 22 #include <linux/mm_inline.h> 23 23 #include <linux/blk-cgroup.h> 24 24 #include <linux/fadvise.h> 25 + #include <linux/sched/mm.h> 25 26 26 27 #include "internal.h" 27 28 ··· 114 113 115 114 EXPORT_SYMBOL(read_cache_pages); 116 115 117 - static int read_pages(struct address_space *mapping, struct file *filp, 118 - struct list_head *pages, unsigned int nr_pages, gfp_t gfp) 116 + static void read_pages(struct readahead_control *rac, struct list_head *pages, 117 + bool skip_page) 119 118 { 119 + const struct address_space_operations *aops = rac->mapping->a_ops; 120 + struct page *page; 120 121 struct blk_plug plug; 121 - unsigned page_idx; 122 - int ret; 122 + 123 + if (!readahead_count(rac)) 124 + goto out; 123 125 124 126 blk_start_plug(&plug); 125 127 126 - if (mapping->a_ops->readpages) { 127 - ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages); 128 + if (aops->readahead) { 129 + aops->readahead(rac); 130 + /* Clean up the remaining pages */ 131 + while ((page = readahead_page(rac))) { 132 + unlock_page(page); 133 + put_page(page); 134 + } 135 + } else if (aops->readpages) { 136 + aops->readpages(rac->file, rac->mapping, pages, 137 + readahead_count(rac)); 128 138 /* Clean up the remaining pages */ 129 139 put_pages_list(pages); 130 - goto out; 140 + rac->_index += rac->_nr_pages; 141 + rac->_nr_pages = 0; 142 + } else { 143 + while ((page = readahead_page(rac))) { 144 + aops->readpage(rac->file, page); 145 + put_page(page); 146 + } 131 147 } 132 148 133 - for (page_idx = 0; page_idx < nr_pages; page_idx++) { 134 - struct page *page = lru_to_page(pages); 135 - list_del(&page->lru); 136 - if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) 137 - mapping->a_ops->readpage(filp, page); 138 - put_page(page); 139 - } 140 - ret = 0; 141 - 142 - out: 143 149 blk_finish_plug(&plug); 144 150 145 - return ret; 151 + BUG_ON(!list_empty(pages)); 152 + BUG_ON(readahead_count(rac)); 153 + 154 + out: 155 + if (skip_page) 156 + rac->_index++; 146 157 } 147 158 148 - /* 149 - * __do_page_cache_readahead() actually reads a chunk of disk. It allocates 150 - * the pages first, then submits them for I/O. This avoids the very bad 151 - * behaviour which would occur if page allocations are causing VM writeback. 152 - * We really don't want to intermingle reads and writes like that. 159 + /** 160 + * page_cache_readahead_unbounded - Start unchecked readahead. 161 + * @mapping: File address space. 162 + * @file: This instance of the open file; used for authentication. 163 + * @index: First page index to read. 164 + * @nr_to_read: The number of pages to read. 165 + * @lookahead_size: Where to start the next readahead. 153 166 * 154 - * Returns the number of pages requested, or the maximum amount of I/O allowed. 167 + * This function is for filesystems to call when they want to start 168 + * readahead beyond a file's stated i_size. This is almost certainly 169 + * not the function you want to call. Use page_cache_async_readahead() 170 + * or page_cache_sync_readahead() instead. 171 + * 172 + * Context: File is referenced by caller. Mutexes may be held by caller. 173 + * May sleep, but will not reenter filesystem to reclaim memory. 155 174 */ 156 - unsigned int __do_page_cache_readahead(struct address_space *mapping, 157 - struct file *filp, pgoff_t offset, unsigned long nr_to_read, 175 + void page_cache_readahead_unbounded(struct address_space *mapping, 176 + struct file *file, pgoff_t index, unsigned long nr_to_read, 158 177 unsigned long lookahead_size) 159 178 { 160 - struct inode *inode = mapping->host; 161 - struct page *page; 162 - unsigned long end_index; /* The last page we want to read */ 163 179 LIST_HEAD(page_pool); 164 - int page_idx; 165 - unsigned int nr_pages = 0; 166 - loff_t isize = i_size_read(inode); 167 180 gfp_t gfp_mask = readahead_gfp_mask(mapping); 181 + struct readahead_control rac = { 182 + .mapping = mapping, 183 + .file = file, 184 + ._index = index, 185 + }; 186 + unsigned long i; 168 187 169 - if (isize == 0) 170 - goto out; 171 - 172 - end_index = ((isize - 1) >> PAGE_SHIFT); 188 + /* 189 + * Partway through the readahead operation, we will have added 190 + * locked pages to the page cache, but will not yet have submitted 191 + * them for I/O. Adding another page may need to allocate memory, 192 + * which can trigger memory reclaim. Telling the VM we're in 193 + * the middle of a filesystem operation will cause it to not 194 + * touch file-backed pages, preventing a deadlock. Most (all?) 195 + * filesystems already specify __GFP_NOFS in their mapping's 196 + * gfp_mask, but let's be explicit here. 197 + */ 198 + unsigned int nofs = memalloc_nofs_save(); 173 199 174 200 /* 175 201 * Preallocate as many pages as we will need. 176 202 */ 177 - for (page_idx = 0; page_idx < nr_to_read; page_idx++) { 178 - pgoff_t page_offset = offset + page_idx; 203 + for (i = 0; i < nr_to_read; i++) { 204 + struct page *page = xa_load(&mapping->i_pages, index + i); 179 205 180 - if (page_offset > end_index) 181 - break; 206 + BUG_ON(index + i != rac._index + rac._nr_pages); 182 207 183 - page = xa_load(&mapping->i_pages, page_offset); 184 208 if (page && !xa_is_value(page)) { 185 209 /* 186 - * Page already present? Kick off the current batch of 187 - * contiguous pages before continuing with the next 188 - * batch. 210 + * Page already present? Kick off the current batch 211 + * of contiguous pages before continuing with the 212 + * next batch. This page may be the one we would 213 + * have intended to mark as Readahead, but we don't 214 + * have a stable reference to this page, and it's 215 + * not worth getting one just for that. 189 216 */ 190 - if (nr_pages) 191 - read_pages(mapping, filp, &page_pool, nr_pages, 192 - gfp_mask); 193 - nr_pages = 0; 217 + read_pages(&rac, &page_pool, true); 194 218 continue; 195 219 } 196 220 197 221 page = __page_cache_alloc(gfp_mask); 198 222 if (!page) 199 223 break; 200 - page->index = page_offset; 201 - list_add(&page->lru, &page_pool); 202 - if (page_idx == nr_to_read - lookahead_size) 224 + if (mapping->a_ops->readpages) { 225 + page->index = index + i; 226 + list_add(&page->lru, &page_pool); 227 + } else if (add_to_page_cache_lru(page, mapping, index + i, 228 + gfp_mask) < 0) { 229 + put_page(page); 230 + read_pages(&rac, &page_pool, true); 231 + continue; 232 + } 233 + if (i == nr_to_read - lookahead_size) 203 234 SetPageReadahead(page); 204 - nr_pages++; 235 + rac._nr_pages++; 205 236 } 206 237 207 238 /* ··· 241 208 * uptodate then the caller will launch readpage again, and 242 209 * will then handle the error. 243 210 */ 244 - if (nr_pages) 245 - read_pages(mapping, filp, &page_pool, nr_pages, gfp_mask); 246 - BUG_ON(!list_empty(&page_pool)); 247 - out: 248 - return nr_pages; 211 + read_pages(&rac, &page_pool, false); 212 + memalloc_nofs_restore(nofs); 213 + } 214 + EXPORT_SYMBOL_GPL(page_cache_readahead_unbounded); 215 + 216 + /* 217 + * __do_page_cache_readahead() actually reads a chunk of disk. It allocates 218 + * the pages first, then submits them for I/O. This avoids the very bad 219 + * behaviour which would occur if page allocations are causing VM writeback. 220 + * We really don't want to intermingle reads and writes like that. 221 + */ 222 + void __do_page_cache_readahead(struct address_space *mapping, 223 + struct file *file, pgoff_t index, unsigned long nr_to_read, 224 + unsigned long lookahead_size) 225 + { 226 + struct inode *inode = mapping->host; 227 + loff_t isize = i_size_read(inode); 228 + pgoff_t end_index; /* The last page we want to read */ 229 + 230 + if (isize == 0) 231 + return; 232 + 233 + end_index = (isize - 1) >> PAGE_SHIFT; 234 + if (index > end_index) 235 + return; 236 + /* Don't read past the page containing the last byte of the file */ 237 + if (nr_to_read > end_index - index) 238 + nr_to_read = end_index - index + 1; 239 + 240 + page_cache_readahead_unbounded(mapping, file, index, nr_to_read, 241 + lookahead_size); 249 242 } 250 243 251 244 /* 252 245 * Chunk the readahead into 2 megabyte units, so that we don't pin too much 253 246 * memory at once. 254 247 */ 255 - int force_page_cache_readahead(struct address_space *mapping, struct file *filp, 256 - pgoff_t offset, unsigned long nr_to_read) 248 + void force_page_cache_readahead(struct address_space *mapping, 249 + struct file *filp, pgoff_t index, unsigned long nr_to_read) 257 250 { 258 251 struct backing_dev_info *bdi = inode_to_bdi(mapping->host); 259 252 struct file_ra_state *ra = &filp->f_ra; 260 253 unsigned long max_pages; 261 254 262 - if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages)) 263 - return -EINVAL; 255 + if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages && 256 + !mapping->a_ops->readahead)) 257 + return; 264 258 265 259 /* 266 260 * If the request exceeds the readahead window, allow the read to ··· 300 240 301 241 if (this_chunk > nr_to_read) 302 242 this_chunk = nr_to_read; 303 - __do_page_cache_readahead(mapping, filp, offset, this_chunk, 0); 243 + __do_page_cache_readahead(mapping, filp, index, this_chunk, 0); 304 244 305 - offset += this_chunk; 245 + index += this_chunk; 306 246 nr_to_read -= this_chunk; 307 247 } 308 - return 0; 309 248 } 310 249 311 250 /* ··· 383 324 */ 384 325 385 326 /* 386 - * Count contiguously cached pages from @offset-1 to @offset-@max, 327 + * Count contiguously cached pages from @index-1 to @index-@max, 387 328 * this count is a conservative estimation of 388 329 * - length of the sequential read sequence, or 389 330 * - thrashing threshold in memory tight systems 390 331 */ 391 332 static pgoff_t count_history_pages(struct address_space *mapping, 392 - pgoff_t offset, unsigned long max) 333 + pgoff_t index, unsigned long max) 393 334 { 394 335 pgoff_t head; 395 336 396 337 rcu_read_lock(); 397 - head = page_cache_prev_miss(mapping, offset - 1, max); 338 + head = page_cache_prev_miss(mapping, index - 1, max); 398 339 rcu_read_unlock(); 399 340 400 - return offset - 1 - head; 341 + return index - 1 - head; 401 342 } 402 343 403 344 /* ··· 405 346 */ 406 347 static int try_context_readahead(struct address_space *mapping, 407 348 struct file_ra_state *ra, 408 - pgoff_t offset, 349 + pgoff_t index, 409 350 unsigned long req_size, 410 351 unsigned long max) 411 352 { 412 353 pgoff_t size; 413 354 414 - size = count_history_pages(mapping, offset, max); 355 + size = count_history_pages(mapping, index, max); 415 356 416 357 /* 417 358 * not enough history pages: ··· 424 365 * starts from beginning of file: 425 366 * it is a strong indication of long-run stream (or whole-file-read) 426 367 */ 427 - if (size >= offset) 368 + if (size >= index) 428 369 size *= 2; 429 370 430 - ra->start = offset; 371 + ra->start = index; 431 372 ra->size = min(size + req_size, max); 432 373 ra->async_size = 1; 433 374 ··· 437 378 /* 438 379 * A minimal readahead algorithm for trivial sequential/random reads. 439 380 */ 440 - static unsigned long 441 - ondemand_readahead(struct address_space *mapping, 442 - struct file_ra_state *ra, struct file *filp, 443 - bool hit_readahead_marker, pgoff_t offset, 444 - unsigned long req_size) 381 + static void ondemand_readahead(struct address_space *mapping, 382 + struct file_ra_state *ra, struct file *filp, 383 + bool hit_readahead_marker, pgoff_t index, 384 + unsigned long req_size) 445 385 { 446 386 struct backing_dev_info *bdi = inode_to_bdi(mapping->host); 447 387 unsigned long max_pages = ra->ra_pages; 448 388 unsigned long add_pages; 449 - pgoff_t prev_offset; 389 + pgoff_t prev_index; 450 390 451 391 /* 452 392 * If the request exceeds the readahead window, allow the read to ··· 457 399 /* 458 400 * start of file 459 401 */ 460 - if (!offset) 402 + if (!index) 461 403 goto initial_readahead; 462 404 463 405 /* 464 - * It's the expected callback offset, assume sequential access. 406 + * It's the expected callback index, assume sequential access. 465 407 * Ramp up sizes, and push forward the readahead window. 466 408 */ 467 - if ((offset == (ra->start + ra->size - ra->async_size) || 468 - offset == (ra->start + ra->size))) { 409 + if ((index == (ra->start + ra->size - ra->async_size) || 410 + index == (ra->start + ra->size))) { 469 411 ra->start += ra->size; 470 412 ra->size = get_next_ra_size(ra, max_pages); 471 413 ra->async_size = ra->size; ··· 482 424 pgoff_t start; 483 425 484 426 rcu_read_lock(); 485 - start = page_cache_next_miss(mapping, offset + 1, max_pages); 427 + start = page_cache_next_miss(mapping, index + 1, max_pages); 486 428 rcu_read_unlock(); 487 429 488 - if (!start || start - offset > max_pages) 489 - return 0; 430 + if (!start || start - index > max_pages) 431 + return; 490 432 491 433 ra->start = start; 492 - ra->size = start - offset; /* old async_size */ 434 + ra->size = start - index; /* old async_size */ 493 435 ra->size += req_size; 494 436 ra->size = get_next_ra_size(ra, max_pages); 495 437 ra->async_size = ra->size; ··· 504 446 505 447 /* 506 448 * sequential cache miss 507 - * trivial case: (offset - prev_offset) == 1 508 - * unaligned reads: (offset - prev_offset) == 0 449 + * trivial case: (index - prev_index) == 1 450 + * unaligned reads: (index - prev_index) == 0 509 451 */ 510 - prev_offset = (unsigned long long)ra->prev_pos >> PAGE_SHIFT; 511 - if (offset - prev_offset <= 1UL) 452 + prev_index = (unsigned long long)ra->prev_pos >> PAGE_SHIFT; 453 + if (index - prev_index <= 1UL) 512 454 goto initial_readahead; 513 455 514 456 /* 515 457 * Query the page cache and look for the traces(cached history pages) 516 458 * that a sequential stream would leave behind. 517 459 */ 518 - if (try_context_readahead(mapping, ra, offset, req_size, max_pages)) 460 + if (try_context_readahead(mapping, ra, index, req_size, max_pages)) 519 461 goto readit; 520 462 521 463 /* 522 464 * standalone, small random read 523 465 * Read as is, and do not pollute the readahead state. 524 466 */ 525 - return __do_page_cache_readahead(mapping, filp, offset, req_size, 0); 467 + __do_page_cache_readahead(mapping, filp, index, req_size, 0); 468 + return; 526 469 527 470 initial_readahead: 528 - ra->start = offset; 471 + ra->start = index; 529 472 ra->size = get_init_ra_size(req_size, max_pages); 530 473 ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size; 531 474 ··· 537 478 * the resulted next readahead window into the current one. 538 479 * Take care of maximum IO pages as above. 539 480 */ 540 - if (offset == ra->start && ra->size == ra->async_size) { 481 + if (index == ra->start && ra->size == ra->async_size) { 541 482 add_pages = get_next_ra_size(ra, max_pages); 542 483 if (ra->size + add_pages <= max_pages) { 543 484 ra->async_size = add_pages; ··· 548 489 } 549 490 } 550 491 551 - return ra_submit(ra, mapping, filp); 492 + ra_submit(ra, mapping, filp); 552 493 } 553 494 554 495 /** ··· 556 497 * @mapping: address_space which holds the pagecache and I/O vectors 557 498 * @ra: file_ra_state which holds the readahead state 558 499 * @filp: passed on to ->readpage() and ->readpages() 559 - * @offset: start offset into @mapping, in pagecache page-sized units 560 - * @req_size: hint: total size of the read which the caller is performing in 561 - * pagecache pages 500 + * @index: Index of first page to be read. 501 + * @req_count: Total number of pages being read by the caller. 562 502 * 563 503 * page_cache_sync_readahead() should be called when a cache miss happened: 564 504 * it will submit the read. The readahead logic may decide to piggyback more ··· 566 508 */ 567 509 void page_cache_sync_readahead(struct address_space *mapping, 568 510 struct file_ra_state *ra, struct file *filp, 569 - pgoff_t offset, unsigned long req_size) 511 + pgoff_t index, unsigned long req_count) 570 512 { 571 513 /* no read-ahead */ 572 514 if (!ra->ra_pages) ··· 577 519 578 520 /* be dumb */ 579 521 if (filp && (filp->f_mode & FMODE_RANDOM)) { 580 - force_page_cache_readahead(mapping, filp, offset, req_size); 522 + force_page_cache_readahead(mapping, filp, index, req_count); 581 523 return; 582 524 } 583 525 584 526 /* do read-ahead */ 585 - ondemand_readahead(mapping, ra, filp, false, offset, req_size); 527 + ondemand_readahead(mapping, ra, filp, false, index, req_count); 586 528 } 587 529 EXPORT_SYMBOL_GPL(page_cache_sync_readahead); 588 530 ··· 591 533 * @mapping: address_space which holds the pagecache and I/O vectors 592 534 * @ra: file_ra_state which holds the readahead state 593 535 * @filp: passed on to ->readpage() and ->readpages() 594 - * @page: the page at @offset which has the PG_readahead flag set 595 - * @offset: start offset into @mapping, in pagecache page-sized units 596 - * @req_size: hint: total size of the read which the caller is performing in 597 - * pagecache pages 536 + * @page: The page at @index which triggered the readahead call. 537 + * @index: Index of first page to be read. 538 + * @req_count: Total number of pages being read by the caller. 598 539 * 599 540 * page_cache_async_readahead() should be called when a page is used which 600 - * has the PG_readahead flag; this is a marker to suggest that the application 541 + * is marked as PageReadahead; this is a marker to suggest that the application 601 542 * has used up enough of the readahead window that we should start pulling in 602 543 * more pages. 603 544 */ 604 545 void 605 546 page_cache_async_readahead(struct address_space *mapping, 606 547 struct file_ra_state *ra, struct file *filp, 607 - struct page *page, pgoff_t offset, 608 - unsigned long req_size) 548 + struct page *page, pgoff_t index, 549 + unsigned long req_count) 609 550 { 610 551 /* no read-ahead */ 611 552 if (!ra->ra_pages) ··· 628 571 return; 629 572 630 573 /* do read-ahead */ 631 - ondemand_readahead(mapping, ra, filp, true, offset, req_size); 574 + ondemand_readahead(mapping, ra, filp, true, index, req_count); 632 575 } 633 576 EXPORT_SYMBOL_GPL(page_cache_async_readahead); 634 577

+2 -1

mm/slab_common.c

··· 1303 1303 kmalloc_caches[KMALLOC_DMA][i] = create_kmalloc_cache( 1304 1304 kmalloc_info[i].name[KMALLOC_DMA], 1305 1305 kmalloc_info[i].size, 1306 - SLAB_CACHE_DMA | flags, 0, 0); 1306 + SLAB_CACHE_DMA | flags, 0, 1307 + kmalloc_info[i].size); 1307 1308 } 1308 1309 } 1309 1310 #endif

+45 -22

mm/slub.c

··· 679 679 va_end(args); 680 680 } 681 681 682 + static bool freelist_corrupted(struct kmem_cache *s, struct page *page, 683 + void *freelist, void *nextfree) 684 + { 685 + if ((s->flags & SLAB_CONSISTENCY_CHECKS) && 686 + !check_valid_pointer(s, page, nextfree)) { 687 + object_err(s, page, freelist, "Freechain corrupt"); 688 + freelist = NULL; 689 + slab_fix(s, "Isolate corrupted freechain"); 690 + return true; 691 + } 692 + 693 + return false; 694 + } 695 + 682 696 static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p) 683 697 { 684 698 unsigned int off; /* Offset of last byte */ ··· 1424 1410 static inline void dec_slabs_node(struct kmem_cache *s, int node, 1425 1411 int objects) {} 1426 1412 1413 + static bool freelist_corrupted(struct kmem_cache *s, struct page *page, 1414 + void *freelist, void *nextfree) 1415 + { 1416 + return false; 1417 + } 1427 1418 #endif /* CONFIG_SLUB_DEBUG */ 1428 1419 1429 1420 /* ··· 2111 2092 while (freelist && (nextfree = get_freepointer(s, freelist))) { 2112 2093 void *prior; 2113 2094 unsigned long counters; 2095 + 2096 + /* 2097 + * If 'nextfree' is invalid, it is possible that the object at 2098 + * 'freelist' is already corrupted. So isolate all objects 2099 + * starting at 'freelist'. 2100 + */ 2101 + if (freelist_corrupted(s, page, freelist, nextfree)) 2102 + break; 2114 2103 2115 2104 do { 2116 2105 prior = page->freelist; ··· 3766 3739 } 3767 3740 3768 3741 static void list_slab_objects(struct kmem_cache *s, struct page *page, 3769 - const char *text) 3742 + const char *text, unsigned long *map) 3770 3743 { 3771 3744 #ifdef CONFIG_SLUB_DEBUG 3772 3745 void *addr = page_address(page); 3773 3746 void *p; 3774 - unsigned long *map; 3747 + 3748 + if (!map) 3749 + return; 3775 3750 3776 3751 slab_err(s, page, text, s->name); 3777 3752 slab_lock(page); ··· 3786 3757 print_tracking(s, p); 3787 3758 } 3788 3759 } 3789 - put_map(map); 3790 - 3791 3760 slab_unlock(page); 3792 3761 #endif 3793 3762 } ··· 3799 3772 { 3800 3773 LIST_HEAD(discard); 3801 3774 struct page *page, *h; 3775 + unsigned long *map = NULL; 3776 + 3777 + #ifdef CONFIG_SLUB_DEBUG 3778 + map = bitmap_alloc(oo_objects(s->max), GFP_KERNEL); 3779 + #endif 3802 3780 3803 3781 BUG_ON(irqs_disabled()); 3804 3782 spin_lock_irq(&n->list_lock); ··· 3813 3781 list_add(&page->slab_list, &discard); 3814 3782 } else { 3815 3783 list_slab_objects(s, page, 3816 - "Objects remaining in %s on __kmem_cache_shutdown()"); 3784 + "Objects remaining in %s on __kmem_cache_shutdown()", 3785 + map); 3817 3786 } 3818 3787 } 3819 3788 spin_unlock_irq(&n->list_lock); 3789 + 3790 + #ifdef CONFIG_SLUB_DEBUG 3791 + bitmap_free(map); 3792 + #endif 3820 3793 3821 3794 list_for_each_entry_safe(page, h, &discard, slab_list) 3822 3795 discard_slab(s, page); ··· 5691 5654 */ 5692 5655 if (buffer) 5693 5656 buf = buffer; 5694 - else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf)) 5657 + else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf) && 5658 + !IS_ENABLED(CONFIG_SLUB_STATS)) 5695 5659 buf = mbuf; 5696 5660 else { 5697 5661 buffer = (char *) get_zeroed_page(GFP_KERNEL); ··· 5724 5686 static struct kobj_type slab_ktype = { 5725 5687 .sysfs_ops = &slab_sysfs_ops, 5726 5688 .release = kmem_cache_release, 5727 - }; 5728 - 5729 - static int uevent_filter(struct kset *kset, struct kobject *kobj) 5730 - { 5731 - struct kobj_type *ktype = get_ktype(kobj); 5732 - 5733 - if (ktype == &slab_ktype) 5734 - return 1; 5735 - return 0; 5736 - } 5737 - 5738 - static const struct kset_uevent_ops slab_uevent_ops = { 5739 - .filter = uevent_filter, 5740 5689 }; 5741 5690 5742 5691 static struct kset *slab_kset; ··· 5793 5768 #ifdef CONFIG_MEMCG 5794 5769 kset_unregister(s->memcg_kset); 5795 5770 #endif 5796 - kobject_uevent(&s->kobj, KOBJ_REMOVE); 5797 5771 out: 5798 5772 kobject_put(&s->kobj); 5799 5773 } ··· 5850 5826 } 5851 5827 #endif 5852 5828 5853 - kobject_uevent(&s->kobj, KOBJ_ADD); 5854 5829 if (!unmergeable) { 5855 5830 /* Setup first alias */ 5856 5831 sysfs_slab_alias(s, s->name); ··· 5930 5907 5931 5908 mutex_lock(&slab_mutex); 5932 5909 5933 - slab_kset = kset_create_and_add("slab", &slab_uevent_ops, kernel_kobj); 5910 + slab_kset = kset_create_and_add("slab", NULL, kernel_kobj); 5934 5911 if (!slab_kset) { 5935 5912 mutex_unlock(&slab_mutex); 5936 5913 pr_err("Cannot register slab subsystem.\n");

+3 -2

mm/swap_state.c

··· 509 509 return 1; 510 510 511 511 hits = atomic_xchg(&swapin_readahead_hits, 0); 512 - pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, 512 + pages = __swapin_nr_pages(READ_ONCE(prev_offset), offset, hits, 513 + max_pages, 513 514 atomic_read(&last_readahead_pages)); 514 515 if (!hits) 515 - prev_offset = offset; 516 + WRITE_ONCE(prev_offset, offset); 516 517 atomic_set(&last_readahead_pages, pages); 517 518 518 519 return pages;

+119 -65

mm/swapfile.c

··· 601 601 { 602 602 struct percpu_cluster *cluster; 603 603 struct swap_cluster_info *ci; 604 - bool found_free; 605 604 unsigned long tmp, max; 606 605 607 606 new_cluster: ··· 613 614 } else if (!cluster_list_empty(&si->discard_clusters)) { 614 615 /* 615 616 * we don't have free cluster but have some clusters in 616 - * discarding, do discard now and reclaim them 617 + * discarding, do discard now and reclaim them, then 618 + * reread cluster_next_cpu since we dropped si->lock 617 619 */ 618 620 swap_do_scheduled_discard(si); 619 - *scan_base = *offset = si->cluster_next; 621 + *scan_base = this_cpu_read(*si->cluster_next_cpu); 622 + *offset = *scan_base; 620 623 goto new_cluster; 621 624 } else 622 625 return false; 623 626 } 624 - 625 - found_free = false; 626 627 627 628 /* 628 629 * Other CPUs can use our cluster if they can't find a free cluster, ··· 631 632 tmp = cluster->next; 632 633 max = min_t(unsigned long, si->max, 633 634 (cluster_next(&cluster->index) + 1) * SWAPFILE_CLUSTER); 634 - if (tmp >= max) { 635 - cluster_set_null(&cluster->index); 636 - goto new_cluster; 637 - } 638 - ci = lock_cluster(si, tmp); 639 - while (tmp < max) { 640 - if (!si->swap_map[tmp]) { 641 - found_free = true; 642 - break; 635 + if (tmp < max) { 636 + ci = lock_cluster(si, tmp); 637 + while (tmp < max) { 638 + if (!si->swap_map[tmp]) 639 + break; 640 + tmp++; 643 641 } 644 - tmp++; 642 + unlock_cluster(ci); 645 643 } 646 - unlock_cluster(ci); 647 - if (!found_free) { 644 + if (tmp >= max) { 648 645 cluster_set_null(&cluster->index); 649 646 goto new_cluster; 650 647 } 651 648 cluster->next = tmp + 1; 652 649 *offset = tmp; 653 650 *scan_base = tmp; 654 - return found_free; 651 + return true; 655 652 } 656 653 657 654 static void __del_from_avail_list(struct swap_info_struct *p) ··· 724 729 } 725 730 } 726 731 732 + static void set_cluster_next(struct swap_info_struct *si, unsigned long next) 733 + { 734 + unsigned long prev; 735 + 736 + if (!(si->flags & SWP_SOLIDSTATE)) { 737 + si->cluster_next = next; 738 + return; 739 + } 740 + 741 + prev = this_cpu_read(*si->cluster_next_cpu); 742 + /* 743 + * Cross the swap address space size aligned trunk, choose 744 + * another trunk randomly to avoid lock contention on swap 745 + * address space if possible. 746 + */ 747 + if ((prev >> SWAP_ADDRESS_SPACE_SHIFT) != 748 + (next >> SWAP_ADDRESS_SPACE_SHIFT)) { 749 + /* No free swap slots available */ 750 + if (si->highest_bit <= si->lowest_bit) 751 + return; 752 + next = si->lowest_bit + 753 + prandom_u32_max(si->highest_bit - si->lowest_bit + 1); 754 + next = ALIGN_DOWN(next, SWAP_ADDRESS_SPACE_PAGES); 755 + next = max_t(unsigned int, next, si->lowest_bit); 756 + } 757 + this_cpu_write(*si->cluster_next_cpu, next); 758 + } 759 + 727 760 static int scan_swap_map_slots(struct swap_info_struct *si, 728 761 unsigned char usage, int nr, 729 762 swp_entry_t slots[]) ··· 762 739 unsigned long last_in_cluster = 0; 763 740 int latency_ration = LATENCY_LIMIT; 764 741 int n_ret = 0; 765 - 766 - if (nr > SWAP_BATCH) 767 - nr = SWAP_BATCH; 742 + bool scanned_many = false; 768 743 769 744 /* 770 745 * We try to cluster swap pages by allocating them sequentially ··· 776 755 */ 777 756 778 757 si->flags += SWP_SCANNING; 779 - scan_base = offset = si->cluster_next; 758 + /* 759 + * Use percpu scan base for SSD to reduce lock contention on 760 + * cluster and swap cache. For HDD, sequential access is more 761 + * important. 762 + */ 763 + if (si->flags & SWP_SOLIDSTATE) 764 + scan_base = this_cpu_read(*si->cluster_next_cpu); 765 + else 766 + scan_base = si->cluster_next; 767 + offset = scan_base; 780 768 781 769 /* SSD algorithm */ 782 770 if (si->cluster_info) { 783 - if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) 784 - goto checks; 785 - else 771 + if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) 786 772 goto scan; 787 - } 788 - 789 - if (unlikely(!si->cluster_nr--)) { 773 + } else if (unlikely(!si->cluster_nr--)) { 790 774 if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) { 791 775 si->cluster_nr = SWAPFILE_CLUSTER - 1; 792 776 goto checks; ··· 874 848 unlock_cluster(ci); 875 849 876 850 swap_range_alloc(si, offset, 1); 877 - si->cluster_next = offset + 1; 878 851 slots[n_ret++] = swp_entry(si->type, offset); 879 852 880 853 /* got enough slots or reach max slots? */ ··· 896 871 if (si->cluster_info) { 897 872 if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) 898 873 goto checks; 899 - else 900 - goto done; 901 - } 902 - /* non-ssd case */ 903 - ++offset; 904 - 905 - /* non-ssd case, still more slots in cluster? */ 906 - if (si->cluster_nr && !si->swap_map[offset]) { 874 + } else if (si->cluster_nr && !si->swap_map[++offset]) { 875 + /* non-ssd case, still more slots in cluster? */ 907 876 --si->cluster_nr; 908 877 goto checks; 909 878 } 910 879 880 + /* 881 + * Even if there's no free clusters available (fragmented), 882 + * try to scan a little more quickly with lock held unless we 883 + * have scanned too many slots already. 884 + */ 885 + if (!scanned_many) { 886 + unsigned long scan_limit; 887 + 888 + if (offset < scan_base) 889 + scan_limit = scan_base; 890 + else 891 + scan_limit = si->highest_bit; 892 + for (; offset <= scan_limit && --latency_ration > 0; 893 + offset++) { 894 + if (!si->swap_map[offset]) 895 + goto checks; 896 + } 897 + } 898 + 911 899 done: 900 + set_cluster_next(si, offset + 1); 912 901 si->flags -= SWP_SCANNING; 913 902 return n_ret; 914 903 ··· 940 901 if (unlikely(--latency_ration < 0)) { 941 902 cond_resched(); 942 903 latency_ration = LATENCY_LIMIT; 904 + scanned_many = true; 943 905 } 944 906 } 945 907 offset = si->lowest_bit; ··· 956 916 if (unlikely(--latency_ration < 0)) { 957 917 cond_resched(); 958 918 latency_ration = LATENCY_LIMIT; 919 + scanned_many = true; 959 920 } 960 921 offset++; 961 922 } ··· 1045 1004 if (avail_pgs <= 0) 1046 1005 goto noswap; 1047 1006 1048 - if (n_goal > SWAP_BATCH) 1049 - n_goal = SWAP_BATCH; 1050 - 1051 - if (n_goal > avail_pgs) 1052 - n_goal = avail_pgs; 1007 + n_goal = min3((long)n_goal, (long)SWAP_BATCH, avail_pgs); 1053 1008 1054 1009 atomic_long_sub(n_goal * size, &nr_swap_pages); 1055 1010 ··· 1312 1275 } 1313 1276 1314 1277 static unsigned char __swap_entry_free(struct swap_info_struct *p, 1315 - swp_entry_t entry, unsigned char usage) 1278 + swp_entry_t entry) 1316 1279 { 1317 1280 struct swap_cluster_info *ci; 1318 1281 unsigned long offset = swp_offset(entry); 1282 + unsigned char usage; 1319 1283 1320 1284 ci = lock_cluster_or_swap_info(p, offset); 1321 - usage = __swap_entry_free_locked(p, offset, usage); 1285 + usage = __swap_entry_free_locked(p, offset, 1); 1322 1286 unlock_cluster_or_swap_info(p, ci); 1323 1287 if (!usage) 1324 1288 free_swap_slot(entry); ··· 1354 1316 1355 1317 p = _swap_info_get(entry); 1356 1318 if (p) 1357 - __swap_entry_free(p, entry, 1); 1319 + __swap_entry_free(p, entry); 1358 1320 } 1359 1321 1360 1322 /* ··· 1777 1739 1778 1740 p = _swap_info_get(entry); 1779 1741 if (p) { 1780 - count = __swap_entry_free(p, entry, 1); 1742 + count = __swap_entry_free(p, entry); 1781 1743 if (count == SWAP_HAS_CACHE && 1782 1744 !swap_page_trans_huge_swapped(p, entry)) 1783 1745 __try_to_reclaim_swap(p, swp_offset(entry), ··· 1975 1937 1976 1938 pte_unmap(pte); 1977 1939 swap_map = &si->swap_map[offset]; 1978 - vmf.vma = vma; 1979 - vmf.address = addr; 1980 - vmf.pmd = pmd; 1981 - page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, &vmf); 1940 + page = lookup_swap_cache(entry, vma, addr); 1941 + if (!page) { 1942 + vmf.vma = vma; 1943 + vmf.address = addr; 1944 + vmf.pmd = pmd; 1945 + page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, 1946 + &vmf); 1947 + } 1982 1948 if (!page) { 1983 1949 if (*swap_map == 0 || *swap_map == SWAP_MAP_BAD) 1984 1950 goto try_next; ··· 2692 2650 mutex_unlock(&swapon_mutex); 2693 2651 free_percpu(p->percpu_cluster); 2694 2652 p->percpu_cluster = NULL; 2653 + free_percpu(p->cluster_next_cpu); 2654 + p->cluster_next_cpu = NULL; 2695 2655 vfree(swap_map); 2696 2656 kvfree(cluster_info); 2697 2657 kvfree(frontswap_map); ··· 2801 2757 struct swap_info_struct *si = v; 2802 2758 struct file *file; 2803 2759 int len; 2760 + unsigned int bytes, inuse; 2804 2761 2805 2762 if (si == SEQ_START_TOKEN) { 2806 - seq_puts(swap,"Filename\t\t\t\tType\t\tSize\tUsed\tPriority\n"); 2763 + seq_puts(swap,"Filename\t\t\t\tType\t\tSize\t\tUsed\t\tPriority\n"); 2807 2764 return 0; 2808 2765 } 2809 2766 2767 + bytes = si->pages << (PAGE_SHIFT - 10); 2768 + inuse = si->inuse_pages << (PAGE_SHIFT - 10); 2769 + 2810 2770 file = si->swap_file; 2811 2771 len = seq_file_path(swap, file, " \t\n\\"); 2812 - seq_printf(swap, "%*s%s\t%u\t%u\t%d\n", 2772 + seq_printf(swap, "%*s%s\t%u\t%s%u\t%s%d\n", 2813 2773 len < 40 ? 40 - len : 1, " ", 2814 2774 S_ISBLK(file_inode(file)->i_mode) ? 2815 2775 "partition" : "file\t", 2816 - si->pages << (PAGE_SHIFT - 10), 2817 - si->inuse_pages << (PAGE_SHIFT - 10), 2776 + bytes, bytes < 10000000 ? "\t" : "", 2777 + inuse, inuse < 10000000 ? "\t" : "", 2818 2778 si->prio); 2819 2779 return 0; 2820 2780 } ··· 3250 3202 unsigned long ci, nr_cluster; 3251 3203 3252 3204 p->flags |= SWP_SOLIDSTATE; 3205 + p->cluster_next_cpu = alloc_percpu(unsigned int); 3206 + if (!p->cluster_next_cpu) { 3207 + error = -ENOMEM; 3208 + goto bad_swap_unlock_inode; 3209 + } 3253 3210 /* 3254 3211 * select a random position to start with to help wear leveling 3255 3212 * SSD 3256 3213 */ 3257 - p->cluster_next = 1 + (prandom_u32() % p->highest_bit); 3214 + for_each_possible_cpu(cpu) { 3215 + per_cpu(*p->cluster_next_cpu, cpu) = 3216 + 1 + prandom_u32_max(p->highest_bit); 3217 + } 3258 3218 nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); 3259 3219 3260 3220 cluster_info = kvcalloc(nr_cluster, sizeof(*cluster_info), ··· 3378 3322 bad_swap: 3379 3323 free_percpu(p->percpu_cluster); 3380 3324 p->percpu_cluster = NULL; 3325 + free_percpu(p->cluster_next_cpu); 3326 + p->cluster_next_cpu = NULL; 3381 3327 if (inode && S_ISBLK(inode->i_mode) && p->bdev) { 3382 3328 set_blocksize(p->bdev, p->old_block_size); 3383 3329 blkdev_put(p->bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL); ··· 3712 3654 3713 3655 spin_lock(&si->cont_lock); 3714 3656 offset &= ~PAGE_MASK; 3715 - page = list_entry(head->lru.next, struct page, lru); 3657 + page = list_next_entry(head, lru); 3716 3658 map = kmap_atomic(page) + offset; 3717 3659 3718 3660 if (count == SWAP_MAP_MAX) /* initial increment from swap_map */ ··· 3724 3666 */ 3725 3667 while (*map == (SWAP_CONT_MAX | COUNT_CONTINUED)) { 3726 3668 kunmap_atomic(map); 3727 - page = list_entry(page->lru.next, struct page, lru); 3669 + page = list_next_entry(page, lru); 3728 3670 BUG_ON(page == head); 3729 3671 map = kmap_atomic(page) + offset; 3730 3672 } 3731 3673 if (*map == SWAP_CONT_MAX) { 3732 3674 kunmap_atomic(map); 3733 - page = list_entry(page->lru.next, struct page, lru); 3675 + page = list_next_entry(page, lru); 3734 3676 if (page == head) { 3735 3677 ret = false; /* add count continuation */ 3736 3678 goto out; ··· 3740 3682 } 3741 3683 *map += 1; 3742 3684 kunmap_atomic(map); 3743 - page = list_entry(page->lru.prev, struct page, lru); 3744 - while (page != head) { 3685 + while ((page = list_prev_entry(page, lru)) != head) { 3745 3686 map = kmap_atomic(page) + offset; 3746 3687 *map = COUNT_CONTINUED; 3747 3688 kunmap_atomic(map); 3748 - page = list_entry(page->lru.prev, struct page, lru); 3749 3689 } 3750 3690 ret = true; /* incremented */ 3751 3691 ··· 3754 3698 BUG_ON(count != COUNT_CONTINUED); 3755 3699 while (*map == COUNT_CONTINUED) { 3756 3700 kunmap_atomic(map); 3757 - page = list_entry(page->lru.next, struct page, lru); 3701 + page = list_next_entry(page, lru); 3758 3702 BUG_ON(page == head); 3759 3703 map = kmap_atomic(page) + offset; 3760 3704 } ··· 3763 3707 if (*map == 0) 3764 3708 count = 0; 3765 3709 kunmap_atomic(map); 3766 - page = list_entry(page->lru.prev, struct page, lru); 3767 - while (page != head) { 3710 + while ((page = list_prev_entry(page, lru)) != head) { 3768 3711 map = kmap_atomic(page) + offset; 3769 3712 *map = SWAP_CONT_MAX | count; 3770 3713 count = COUNT_CONTINUED; 3771 3714 kunmap_atomic(map); 3772 - page = list_entry(page->lru.prev, struct page, lru); 3773 3715 } 3774 3716 ret = count == COUNT_CONTINUED; 3775 3717 }

+1 -1

mm/util.c

··· 580 580 if (ret || size <= PAGE_SIZE) 581 581 return ret; 582 582 583 - return __vmalloc_node_flags_caller(size, node, flags, 583 + return __vmalloc_node(size, 1, flags, node, 584 584 __builtin_return_address(0)); 585 585 } 586 586 EXPORT_SYMBOL(kvmalloc_node);

+149 -218

mm/vmalloc.c

··· 69 69 70 70 /*** Page table manipulation functions ***/ 71 71 72 - static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end) 72 + static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, 73 + pgtbl_mod_mask *mask) 73 74 { 74 75 pte_t *pte; 75 76 ··· 79 78 pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte); 80 79 WARN_ON(!pte_none(ptent) && !pte_present(ptent)); 81 80 } while (pte++, addr += PAGE_SIZE, addr != end); 81 + *mask |= PGTBL_PTE_MODIFIED; 82 82 } 83 83 84 - static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end) 84 + static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, 85 + pgtbl_mod_mask *mask) 85 86 { 86 87 pmd_t *pmd; 87 88 unsigned long next; 89 + int cleared; 88 90 89 91 pmd = pmd_offset(pud, addr); 90 92 do { 91 93 next = pmd_addr_end(addr, end); 92 - if (pmd_clear_huge(pmd)) 94 + 95 + cleared = pmd_clear_huge(pmd); 96 + if (cleared || pmd_bad(*pmd)) 97 + *mask |= PGTBL_PMD_MODIFIED; 98 + 99 + if (cleared) 93 100 continue; 94 101 if (pmd_none_or_clear_bad(pmd)) 95 102 continue; 96 - vunmap_pte_range(pmd, addr, next); 103 + vunmap_pte_range(pmd, addr, next, mask); 97 104 } while (pmd++, addr = next, addr != end); 98 105 } 99 106 100 - static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end) 107 + static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, 108 + pgtbl_mod_mask *mask) 101 109 { 102 110 pud_t *pud; 103 111 unsigned long next; 112 + int cleared; 104 113 105 114 pud = pud_offset(p4d, addr); 106 115 do { 107 116 next = pud_addr_end(addr, end); 108 - if (pud_clear_huge(pud)) 117 + 118 + cleared = pud_clear_huge(pud); 119 + if (cleared || pud_bad(*pud)) 120 + *mask |= PGTBL_PUD_MODIFIED; 121 + 122 + if (cleared) 109 123 continue; 110 124 if (pud_none_or_clear_bad(pud)) 111 125 continue; 112 - vunmap_pmd_range(pud, addr, next); 126 + vunmap_pmd_range(pud, addr, next, mask); 113 127 } while (pud++, addr = next, addr != end); 114 128 } 115 129 116 - static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end) 130 + static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end, 131 + pgtbl_mod_mask *mask) 117 132 { 118 133 p4d_t *p4d; 119 134 unsigned long next; 135 + int cleared; 120 136 121 137 p4d = p4d_offset(pgd, addr); 122 138 do { 123 139 next = p4d_addr_end(addr, end); 124 - if (p4d_clear_huge(p4d)) 140 + 141 + cleared = p4d_clear_huge(p4d); 142 + if (cleared || p4d_bad(*p4d)) 143 + *mask |= PGTBL_P4D_MODIFIED; 144 + 145 + if (cleared) 125 146 continue; 126 147 if (p4d_none_or_clear_bad(p4d)) 127 148 continue; 128 - vunmap_pud_range(p4d, addr, next); 149 + vunmap_pud_range(p4d, addr, next, mask); 129 150 } while (p4d++, addr = next, addr != end); 130 151 } 131 152 132 - static void vunmap_page_range(unsigned long addr, unsigned long end) 153 + /** 154 + * unmap_kernel_range_noflush - unmap kernel VM area 155 + * @start: start of the VM area to unmap 156 + * @size: size of the VM area to unmap 157 + * 158 + * Unmap PFN_UP(@size) pages at @addr. The VM area @addr and @size specify 159 + * should have been allocated using get_vm_area() and its friends. 160 + * 161 + * NOTE: 162 + * This function does NOT do any cache flushing. The caller is responsible 163 + * for calling flush_cache_vunmap() on to-be-mapped areas before calling this 164 + * function and flush_tlb_kernel_range() after. 165 + */ 166 + void unmap_kernel_range_noflush(unsigned long start, unsigned long size) 133 167 { 134 - pgd_t *pgd; 168 + unsigned long end = start + size; 135 169 unsigned long next; 170 + pgd_t *pgd; 171 + unsigned long addr = start; 172 + pgtbl_mod_mask mask = 0; 136 173 137 174 BUG_ON(addr >= end); 175 + start = addr; 138 176 pgd = pgd_offset_k(addr); 139 177 do { 140 178 next = pgd_addr_end(addr, end); 179 + if (pgd_bad(*pgd)) 180 + mask |= PGTBL_PGD_MODIFIED; 141 181 if (pgd_none_or_clear_bad(pgd)) 142 182 continue; 143 - vunmap_p4d_range(pgd, addr, next); 183 + vunmap_p4d_range(pgd, addr, next, &mask); 144 184 } while (pgd++, addr = next, addr != end); 185 + 186 + if (mask & ARCH_PAGE_TABLE_SYNC_MASK) 187 + arch_sync_kernel_mappings(start, end); 145 188 } 146 189 147 190 static int vmap_pte_range(pmd_t *pmd, unsigned long addr, 148 - unsigned long end, pgprot_t prot, struct page **pages, int *nr) 191 + unsigned long end, pgprot_t prot, struct page **pages, int *nr, 192 + pgtbl_mod_mask *mask) 149 193 { 150 194 pte_t *pte; 151 195 ··· 199 153 * callers keep track of where we're up to. 200 154 */ 201 155 202 - pte = pte_alloc_kernel(pmd, addr); 156 + pte = pte_alloc_kernel_track(pmd, addr, mask); 203 157 if (!pte) 204 158 return -ENOMEM; 205 159 do { ··· 212 166 set_pte_at(&init_mm, addr, pte, mk_pte(page, prot)); 213 167 (*nr)++; 214 168 } while (pte++, addr += PAGE_SIZE, addr != end); 169 + *mask |= PGTBL_PTE_MODIFIED; 215 170 return 0; 216 171 } 217 172 218 173 static int vmap_pmd_range(pud_t *pud, unsigned long addr, 219 - unsigned long end, pgprot_t prot, struct page **pages, int *nr) 174 + unsigned long end, pgprot_t prot, struct page **pages, int *nr, 175 + pgtbl_mod_mask *mask) 220 176 { 221 177 pmd_t *pmd; 222 178 unsigned long next; 223 179 224 - pmd = pmd_alloc(&init_mm, pud, addr); 180 + pmd = pmd_alloc_track(&init_mm, pud, addr, mask); 225 181 if (!pmd) 226 182 return -ENOMEM; 227 183 do { 228 184 next = pmd_addr_end(addr, end); 229 - if (vmap_pte_range(pmd, addr, next, prot, pages, nr)) 185 + if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask)) 230 186 return -ENOMEM; 231 187 } while (pmd++, addr = next, addr != end); 232 188 return 0; 233 189 } 234 190 235 191 static int vmap_pud_range(p4d_t *p4d, unsigned long addr, 236 - unsigned long end, pgprot_t prot, struct page **pages, int *nr) 192 + unsigned long end, pgprot_t prot, struct page **pages, int *nr, 193 + pgtbl_mod_mask *mask) 237 194 { 238 195 pud_t *pud; 239 196 unsigned long next; 240 197 241 - pud = pud_alloc(&init_mm, p4d, addr); 198 + pud = pud_alloc_track(&init_mm, p4d, addr, mask); 242 199 if (!pud) 243 200 return -ENOMEM; 244 201 do { 245 202 next = pud_addr_end(addr, end); 246 - if (vmap_pmd_range(pud, addr, next, prot, pages, nr)) 203 + if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask)) 247 204 return -ENOMEM; 248 205 } while (pud++, addr = next, addr != end); 249 206 return 0; 250 207 } 251 208 252 209 static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, 253 - unsigned long end, pgprot_t prot, struct page **pages, int *nr) 210 + unsigned long end, pgprot_t prot, struct page **pages, int *nr, 211 + pgtbl_mod_mask *mask) 254 212 { 255 213 p4d_t *p4d; 256 214 unsigned long next; 257 215 258 - p4d = p4d_alloc(&init_mm, pgd, addr); 216 + p4d = p4d_alloc_track(&init_mm, pgd, addr, mask); 259 217 if (!p4d) 260 218 return -ENOMEM; 261 219 do { 262 220 next = p4d_addr_end(addr, end); 263 - if (vmap_pud_range(p4d, addr, next, prot, pages, nr)) 221 + if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask)) 264 222 return -ENOMEM; 265 223 } while (p4d++, addr = next, addr != end); 266 224 return 0; 267 225 } 268 226 269 - /* 270 - * Set up page tables in kva (addr, end). The ptes shall have prot "prot", and 271 - * will have pfns corresponding to the "pages" array. 227 + /** 228 + * map_kernel_range_noflush - map kernel VM area with the specified pages 229 + * @addr: start of the VM area to map 230 + * @size: size of the VM area to map 231 + * @prot: page protection flags to use 232 + * @pages: pages to map 272 233 * 273 - * Ie. pte at addr+N*PAGE_SIZE shall point to pfn corresponding to pages[N] 234 + * Map PFN_UP(@size) pages at @addr. The VM area @addr and @size specify should 235 + * have been allocated using get_vm_area() and its friends. 236 + * 237 + * NOTE: 238 + * This function does NOT do any cache flushing. The caller is responsible for 239 + * calling flush_cache_vmap() on to-be-mapped areas before calling this 240 + * function. 241 + * 242 + * RETURNS: 243 + * 0 on success, -errno on failure. 274 244 */ 275 - static int vmap_page_range_noflush(unsigned long start, unsigned long end, 276 - pgprot_t prot, struct page **pages) 245 + int map_kernel_range_noflush(unsigned long addr, unsigned long size, 246 + pgprot_t prot, struct page **pages) 277 247 { 278 - pgd_t *pgd; 248 + unsigned long start = addr; 249 + unsigned long end = addr + size; 279 250 unsigned long next; 280 - unsigned long addr = start; 251 + pgd_t *pgd; 281 252 int err = 0; 282 253 int nr = 0; 254 + pgtbl_mod_mask mask = 0; 283 255 284 256 BUG_ON(addr >= end); 285 257 pgd = pgd_offset_k(addr); 286 258 do { 287 259 next = pgd_addr_end(addr, end); 288 - err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr); 260 + if (pgd_bad(*pgd)) 261 + mask |= PGTBL_PGD_MODIFIED; 262 + err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask); 289 263 if (err) 290 264 return err; 291 265 } while (pgd++, addr = next, addr != end); 292 266 293 - return nr; 267 + if (mask & ARCH_PAGE_TABLE_SYNC_MASK) 268 + arch_sync_kernel_mappings(start, end); 269 + 270 + return 0; 294 271 } 295 272 296 - static int vmap_page_range(unsigned long start, unsigned long end, 297 - pgprot_t prot, struct page **pages) 273 + int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot, 274 + struct page **pages) 298 275 { 299 276 int ret; 300 277 301 - ret = vmap_page_range_noflush(start, end, prot, pages); 302 - flush_cache_vmap(start, end); 278 + ret = map_kernel_range_noflush(start, size, prot, pages); 279 + flush_cache_vmap(start, start + size); 303 280 return ret; 304 281 } 305 282 ··· 1292 1223 EXPORT_SYMBOL_GPL(unregister_vmap_purge_notifier); 1293 1224 1294 1225 /* 1295 - * Clear the pagetable entries of a given vmap_area 1296 - */ 1297 - static void unmap_vmap_area(struct vmap_area *va) 1298 - { 1299 - vunmap_page_range(va->va_start, va->va_end); 1300 - } 1301 - 1302 - /* 1303 1226 * lazy_max_pages is the maximum amount of virtual address space we gather up 1304 1227 * before attempting to purge with a TLB flush. 1305 1228 * ··· 1352 1291 valist = llist_del_all(&vmap_purge_list); 1353 1292 if (unlikely(valist == NULL)) 1354 1293 return false; 1355 - 1356 - /* 1357 - * First make sure the mappings are removed from all page-tables 1358 - * before they are freed. 1359 - */ 1360 - vmalloc_sync_unmappings(); 1361 1294 1362 1295 /* 1363 1296 * TODO: to calculate a flush range without looping. ··· 1446 1391 static void free_unmap_vmap_area(struct vmap_area *va) 1447 1392 { 1448 1393 flush_cache_vunmap(va->va_start, va->va_end); 1449 - unmap_vmap_area(va); 1394 + unmap_kernel_range_noflush(va->va_start, va->va_end - va->va_start); 1450 1395 if (debug_pagealloc_enabled_static()) 1451 1396 flush_tlb_kernel_range(va->va_start, va->va_end); 1452 1397 ··· 1720 1665 return vaddr; 1721 1666 } 1722 1667 1723 - static void vb_free(const void *addr, unsigned long size) 1668 + static void vb_free(unsigned long addr, unsigned long size) 1724 1669 { 1725 1670 unsigned long offset; 1726 1671 unsigned long vb_idx; ··· 1730 1675 BUG_ON(offset_in_page(size)); 1731 1676 BUG_ON(size > PAGE_SIZE*VMAP_MAX_ALLOC); 1732 1677 1733 - flush_cache_vunmap((unsigned long)addr, (unsigned long)addr + size); 1678 + flush_cache_vunmap(addr, addr + size); 1734 1679 1735 1680 order = get_order(size); 1736 1681 1737 - offset = (unsigned long)addr & (VMAP_BLOCK_SIZE - 1); 1738 - offset >>= PAGE_SHIFT; 1682 + offset = (addr & (VMAP_BLOCK_SIZE - 1)) >> PAGE_SHIFT; 1739 1683 1740 - vb_idx = addr_to_vb_idx((unsigned long)addr); 1684 + vb_idx = addr_to_vb_idx(addr); 1741 1685 rcu_read_lock(); 1742 1686 vb = radix_tree_lookup(&vmap_block_tree, vb_idx); 1743 1687 rcu_read_unlock(); 1744 1688 BUG_ON(!vb); 1745 1689 1746 - vunmap_page_range((unsigned long)addr, (unsigned long)addr + size); 1690 + unmap_kernel_range_noflush(addr, size); 1747 1691 1748 1692 if (debug_pagealloc_enabled_static()) 1749 - flush_tlb_kernel_range((unsigned long)addr, 1750 - (unsigned long)addr + size); 1693 + flush_tlb_kernel_range(addr, addr + size); 1751 1694 1752 1695 spin_lock(&vb->lock); 1753 1696 ··· 1845 1792 1846 1793 if (likely(count <= VMAP_MAX_ALLOC)) { 1847 1794 debug_check_no_locks_freed(mem, size); 1848 - vb_free(mem, size); 1795 + vb_free(addr, size); 1849 1796 return; 1850 1797 } 1851 1798 ··· 1872 1819 * 1873 1820 * Returns: a pointer to the address that has been mapped, or %NULL on failure 1874 1821 */ 1875 - void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t prot) 1822 + void *vm_map_ram(struct page **pages, unsigned int count, int node) 1876 1823 { 1877 1824 unsigned long size = (unsigned long)count << PAGE_SHIFT; 1878 1825 unsigned long addr; ··· 1896 1843 1897 1844 kasan_unpoison_vmalloc(mem, size); 1898 1845 1899 - if (vmap_page_range(addr, addr + size, prot, pages) < 0) { 1846 + if (map_kernel_range(addr, size, PAGE_KERNEL, pages) < 0) { 1900 1847 vm_unmap_ram(mem, count); 1901 1848 return NULL; 1902 1849 } ··· 2041 1988 } 2042 1989 2043 1990 /** 2044 - * map_kernel_range_noflush - map kernel VM area with the specified pages 2045 - * @addr: start of the VM area to map 2046 - * @size: size of the VM area to map 2047 - * @prot: page protection flags to use 2048 - * @pages: pages to map 2049 - * 2050 - * Map PFN_UP(@size) pages at @addr. The VM area @addr and @size 2051 - * specify should have been allocated using get_vm_area() and its 2052 - * friends. 2053 - * 2054 - * NOTE: 2055 - * This function does NOT do any cache flushing. The caller is 2056 - * responsible for calling flush_cache_vmap() on to-be-mapped areas 2057 - * before calling this function. 2058 - * 2059 - * RETURNS: 2060 - * The number of pages mapped on success, -errno on failure. 2061 - */ 2062 - int map_kernel_range_noflush(unsigned long addr, unsigned long size, 2063 - pgprot_t prot, struct page **pages) 2064 - { 2065 - return vmap_page_range_noflush(addr, addr + size, prot, pages); 2066 - } 2067 - 2068 - /** 2069 - * unmap_kernel_range_noflush - unmap kernel VM area 2070 - * @addr: start of the VM area to unmap 2071 - * @size: size of the VM area to unmap 2072 - * 2073 - * Unmap PFN_UP(@size) pages at @addr. The VM area @addr and @size 2074 - * specify should have been allocated using get_vm_area() and its 2075 - * friends. 2076 - * 2077 - * NOTE: 2078 - * This function does NOT do any cache flushing. The caller is 2079 - * responsible for calling flush_cache_vunmap() on to-be-mapped areas 2080 - * before calling this function and flush_tlb_kernel_range() after. 2081 - */ 2082 - void unmap_kernel_range_noflush(unsigned long addr, unsigned long size) 2083 - { 2084 - vunmap_page_range(addr, addr + size); 2085 - } 2086 - EXPORT_SYMBOL_GPL(unmap_kernel_range_noflush); 2087 - 2088 - /** 2089 1991 * unmap_kernel_range - unmap kernel VM area and flush cache and TLB 2090 1992 * @addr: start of the VM area to unmap 2091 1993 * @size: size of the VM area to unmap ··· 2053 2045 unsigned long end = addr + size; 2054 2046 2055 2047 flush_cache_vunmap(addr, end); 2056 - vunmap_page_range(addr, end); 2048 + unmap_kernel_range_noflush(addr, size); 2057 2049 flush_tlb_kernel_range(addr, end); 2058 2050 } 2059 - EXPORT_SYMBOL_GPL(unmap_kernel_range); 2060 - 2061 - int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page **pages) 2062 - { 2063 - unsigned long addr = (unsigned long)area->addr; 2064 - unsigned long end = addr + get_vm_area_size(area); 2065 - int err; 2066 - 2067 - err = vmap_page_range(addr, end, prot, pages); 2068 - 2069 - return err > 0 ? 0 : err; 2070 - } 2071 - EXPORT_SYMBOL_GPL(map_vm_area); 2072 2051 2073 2052 static inline void setup_vmalloc_vm_locked(struct vm_struct *vm, 2074 2053 struct vmap_area *va, unsigned long flags, const void *caller) ··· 2122 2127 2123 2128 return area; 2124 2129 } 2125 - 2126 - struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags, 2127 - unsigned long start, unsigned long end) 2128 - { 2129 - return __get_vm_area_node(size, 1, flags, start, end, NUMA_NO_NODE, 2130 - GFP_KERNEL, __builtin_return_address(0)); 2131 - } 2132 - EXPORT_SYMBOL_GPL(__get_vm_area); 2133 2130 2134 2131 struct vm_struct *__get_vm_area_caller(unsigned long size, unsigned long flags, 2135 2132 unsigned long start, unsigned long end, ··· 2428 2441 if (!area) 2429 2442 return NULL; 2430 2443 2431 - if (map_vm_area(area, prot, pages)) { 2444 + if (map_kernel_range((unsigned long)area->addr, size, pgprot_nx(prot), 2445 + pages) < 0) { 2432 2446 vunmap(area->addr); 2433 2447 return NULL; 2434 2448 } ··· 2438 2450 } 2439 2451 EXPORT_SYMBOL(vmap); 2440 2452 2441 - static void *__vmalloc_node(unsigned long size, unsigned long align, 2442 - gfp_t gfp_mask, pgprot_t prot, 2443 - int node, const void *caller); 2444 2453 static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, 2445 2454 pgprot_t prot, int node) 2446 2455 { ··· 2455 2470 /* Please note that the recursion is strictly bounded. */ 2456 2471 if (array_size > PAGE_SIZE) { 2457 2472 pages = __vmalloc_node(array_size, 1, nested_gfp|highmem_mask, 2458 - PAGE_KERNEL, node, area->caller); 2473 + node, area->caller); 2459 2474 } else { 2460 2475 pages = kmalloc_node(array_size, nested_gfp, node); 2461 2476 } ··· 2489 2504 } 2490 2505 atomic_long_add(area->nr_pages, &nr_vmalloc_pages); 2491 2506 2492 - if (map_vm_area(area, prot, pages)) 2507 + if (map_kernel_range((unsigned long)area->addr, get_vm_area_size(area), 2508 + prot, pages) < 0) 2493 2509 goto fail; 2510 + 2494 2511 return area->addr; 2495 2512 2496 2513 fail: ··· 2560 2573 return NULL; 2561 2574 } 2562 2575 2563 - /* 2564 - * This is only for performance analysis of vmalloc and stress purpose. 2565 - * It is required by vmalloc test module, therefore do not use it other 2566 - * than that. 2567 - */ 2568 - #ifdef CONFIG_TEST_VMALLOC_MODULE 2569 - EXPORT_SYMBOL_GPL(__vmalloc_node_range); 2570 - #endif 2571 - 2572 2576 /** 2573 2577 * __vmalloc_node - allocate virtually contiguous memory 2574 2578 * @size: allocation size 2575 2579 * @align: desired alignment 2576 2580 * @gfp_mask: flags for the page level allocator 2577 - * @prot: protection mask for the allocated pages 2578 2581 * @node: node to use for allocation or NUMA_NO_NODE 2579 2582 * @caller: caller's return address 2580 2583 * 2581 - * Allocate enough pages to cover @size from the page level 2582 - * allocator with @gfp_mask flags. Map them into contiguous 2583 - * kernel virtual space, using a pagetable protection of @prot. 2584 + * Allocate enough pages to cover @size from the page level allocator with 2585 + * @gfp_mask flags. Map them into contiguous kernel virtual space. 2584 2586 * 2585 2587 * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_RETRY_MAYFAIL 2586 2588 * and __GFP_NOFAIL are not supported ··· 2579 2603 * 2580 2604 * Return: pointer to the allocated memory or %NULL on error 2581 2605 */ 2582 - static void *__vmalloc_node(unsigned long size, unsigned long align, 2583 - gfp_t gfp_mask, pgprot_t prot, 2584 - int node, const void *caller) 2606 + void *__vmalloc_node(unsigned long size, unsigned long align, 2607 + gfp_t gfp_mask, int node, const void *caller) 2585 2608 { 2586 2609 return __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END, 2587 - gfp_mask, prot, 0, node, caller); 2610 + gfp_mask, PAGE_KERNEL, 0, node, caller); 2588 2611 } 2612 + /* 2613 + * This is only for performance analysis of vmalloc and stress purpose. 2614 + * It is required by vmalloc test module, therefore do not use it other 2615 + * than that. 2616 + */ 2617 + #ifdef CONFIG_TEST_VMALLOC_MODULE 2618 + EXPORT_SYMBOL_GPL(__vmalloc_node); 2619 + #endif 2589 2620 2590 - void *__vmalloc(unsigned long size, gfp_t gfp_mask, pgprot_t prot) 2621 + void *__vmalloc(unsigned long size, gfp_t gfp_mask) 2591 2622 { 2592 - return __vmalloc_node(size, 1, gfp_mask, prot, NUMA_NO_NODE, 2623 + return __vmalloc_node(size, 1, gfp_mask, NUMA_NO_NODE, 2593 2624 __builtin_return_address(0)); 2594 2625 } 2595 2626 EXPORT_SYMBOL(__vmalloc); 2596 - 2597 - static inline void *__vmalloc_node_flags(unsigned long size, 2598 - int node, gfp_t flags) 2599 - { 2600 - return __vmalloc_node(size, 1, flags, PAGE_KERNEL, 2601 - node, __builtin_return_address(0)); 2602 - } 2603 - 2604 - 2605 - void *__vmalloc_node_flags_caller(unsigned long size, int node, gfp_t flags, 2606 - void *caller) 2607 - { 2608 - return __vmalloc_node(size, 1, flags, PAGE_KERNEL, node, caller); 2609 - } 2610 2627 2611 2628 /** 2612 2629 * vmalloc - allocate virtually contiguous memory ··· 2615 2646 */ 2616 2647 void *vmalloc(unsigned long size) 2617 2648 { 2618 - return __vmalloc_node_flags(size, NUMA_NO_NODE, 2619 - GFP_KERNEL); 2649 + return __vmalloc_node(size, 1, GFP_KERNEL, NUMA_NO_NODE, 2650 + __builtin_return_address(0)); 2620 2651 } 2621 2652 EXPORT_SYMBOL(vmalloc); 2622 2653 ··· 2635 2666 */ 2636 2667 void *vzalloc(unsigned long size) 2637 2668 { 2638 - return __vmalloc_node_flags(size, NUMA_NO_NODE, 2639 - GFP_KERNEL | __GFP_ZERO); 2669 + return __vmalloc_node(size, 1, GFP_KERNEL | __GFP_ZERO, NUMA_NO_NODE, 2670 + __builtin_return_address(0)); 2640 2671 } 2641 2672 EXPORT_SYMBOL(vzalloc); 2642 2673 ··· 2673 2704 */ 2674 2705 void *vmalloc_node(unsigned long size, int node) 2675 2706 { 2676 - return __vmalloc_node(size, 1, GFP_KERNEL, PAGE_KERNEL, 2677 - node, __builtin_return_address(0)); 2707 + return __vmalloc_node(size, 1, GFP_KERNEL, node, 2708 + __builtin_return_address(0)); 2678 2709 } 2679 2710 EXPORT_SYMBOL(vmalloc_node); 2680 2711 ··· 2687 2718 * allocator and map them into contiguous kernel virtual space. 2688 2719 * The memory allocated is set to zero. 2689 2720 * 2690 - * For tight control over page level allocator and protection flags 2691 - * use __vmalloc_node() instead. 2692 - * 2693 2721 * Return: pointer to the allocated memory or %NULL on error 2694 2722 */ 2695 2723 void *vzalloc_node(unsigned long size, int node) 2696 2724 { 2697 - return __vmalloc_node_flags(size, node, 2698 - GFP_KERNEL | __GFP_ZERO); 2725 + return __vmalloc_node(size, 1, GFP_KERNEL | __GFP_ZERO, node, 2726 + __builtin_return_address(0)); 2699 2727 } 2700 2728 EXPORT_SYMBOL(vzalloc_node); 2701 - 2702 - /** 2703 - * vmalloc_user_node_flags - allocate memory for userspace on a specific node 2704 - * @size: allocation size 2705 - * @node: numa node 2706 - * @flags: flags for the page level allocator 2707 - * 2708 - * The resulting memory area is zeroed so it can be mapped to userspace 2709 - * without leaking data. 2710 - * 2711 - * Return: pointer to the allocated memory or %NULL on error 2712 - */ 2713 - void *vmalloc_user_node_flags(unsigned long size, int node, gfp_t flags) 2714 - { 2715 - return __vmalloc_node_range(size, SHMLBA, VMALLOC_START, VMALLOC_END, 2716 - flags | __GFP_ZERO, PAGE_KERNEL, 2717 - VM_USERMAP, node, 2718 - __builtin_return_address(0)); 2719 - } 2720 - EXPORT_SYMBOL(vmalloc_user_node_flags); 2721 2729 2722 2730 /** 2723 2731 * vmalloc_exec - allocate virtually contiguous, executable memory ··· 2739 2793 */ 2740 2794 void *vmalloc_32(unsigned long size) 2741 2795 { 2742 - return __vmalloc_node(size, 1, GFP_VMALLOC32, PAGE_KERNEL, 2743 - NUMA_NO_NODE, __builtin_return_address(0)); 2796 + return __vmalloc_node(size, 1, GFP_VMALLOC32, NUMA_NO_NODE, 2797 + __builtin_return_address(0)); 2744 2798 } 2745 2799 EXPORT_SYMBOL(vmalloc_32); 2746 2800 ··· 3082 3136 vma->vm_end - vma->vm_start); 3083 3137 } 3084 3138 EXPORT_SYMBOL(remap_vmalloc_range); 3085 - 3086 - /* 3087 - * Implement stubs for vmalloc_sync_[un]mappings () if the architecture chose 3088 - * not to have one. 3089 - * 3090 - * The purpose of this function is to make sure the vmalloc area 3091 - * mappings are identical in all page-tables in the system. 3092 - */ 3093 - void __weak vmalloc_sync_mappings(void) 3094 - { 3095 - } 3096 - 3097 - void __weak vmalloc_sync_unmappings(void) 3098 - { 3099 - } 3100 3139 3101 3140 static int f(pte_t *pte, unsigned long addr, void *data) 3102 3141 {

+2 -2

mm/vmscan.c

··· 1878 1878 1879 1879 /* 1880 1880 * If a kernel thread (such as nfsd for loop-back mounts) services 1881 - * a backing device by writing to the page cache it sets PF_LESS_THROTTLE. 1881 + * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE. 1882 1882 * In that case we should only throttle if the backing device it is 1883 1883 * writing to is congested. In other cases it is safe to throttle. 1884 1884 */ 1885 1885 static int current_may_throttle(void) 1886 1886 { 1887 - return !(current->flags & PF_LESS_THROTTLE) || 1887 + return !(current->flags & PF_LOCAL_THROTTLE) || 1888 1888 current->backing_dev_info == NULL || 1889 1889 bdi_write_congested(current->backing_dev_info); 1890 1890 }

+9 -2

mm/vmstat.c

··· 1108 1108 TEXT_FOR_HIGHMEM(xx) xx "_movable", 1109 1109 1110 1110 const char * const vmstat_text[] = { 1111 - /* enum zone_stat_item countes */ 1111 + /* enum zone_stat_item counters */ 1112 1112 "nr_free_pages", 1113 1113 "nr_zone_inactive_anon", 1114 1114 "nr_zone_active_anon", ··· 1165 1165 "nr_file_hugepages", 1166 1166 "nr_file_pmdmapped", 1167 1167 "nr_anon_transparent_hugepages", 1168 - "nr_unstable", 1169 1168 "nr_vmscan_write", 1170 1169 "nr_vmscan_immediate_reclaim", 1171 1170 "nr_dirtied", ··· 1725 1726 seq_puts(m, vmstat_text[off]); 1726 1727 seq_put_decimal_ull(m, " ", *l); 1727 1728 seq_putc(m, '\n'); 1729 + 1730 + if (off == NR_VMSTAT_ITEMS - 1) { 1731 + /* 1732 + * We've come to the end - add any deprecated counters to avoid 1733 + * breaking userspace which might depend on them being present. 1734 + */ 1735 + seq_puts(m, "nr_unstable 0\n"); 1736 + } 1728 1737 return 0; 1729 1738 } 1730 1739

+7 -5

mm/zsmalloc.c

··· 293 293 }; 294 294 295 295 struct mapping_area { 296 - #ifdef CONFIG_PGTABLE_MAPPING 296 + #ifdef CONFIG_ZSMALLOC_PGTABLE_MAPPING 297 297 struct vm_struct *vm; /* vm area for mapping object that span pages */ 298 298 #else 299 299 char *vm_buf; /* copy buffer for objects that span pages */ ··· 1113 1113 return zspage; 1114 1114 } 1115 1115 1116 - #ifdef CONFIG_PGTABLE_MAPPING 1116 + #ifdef CONFIG_ZSMALLOC_PGTABLE_MAPPING 1117 1117 static inline int __zs_cpu_up(struct mapping_area *area) 1118 1118 { 1119 1119 /* ··· 1138 1138 static inline void *__zs_map_object(struct mapping_area *area, 1139 1139 struct page *pages[2], int off, int size) 1140 1140 { 1141 - BUG_ON(map_vm_area(area->vm, PAGE_KERNEL, pages)); 1141 + unsigned long addr = (unsigned long)area->vm->addr; 1142 + 1143 + BUG_ON(map_kernel_range(addr, PAGE_SIZE * 2, PAGE_KERNEL, pages) < 0); 1142 1144 area->vm_addr = area->vm->addr; 1143 1145 return area->vm_addr + off; 1144 1146 } ··· 1153 1151 unmap_kernel_range(addr, PAGE_SIZE * 2); 1154 1152 } 1155 1153 1156 - #else /* CONFIG_PGTABLE_MAPPING */ 1154 + #else /* CONFIG_ZSMALLOC_PGTABLE_MAPPING */ 1157 1155 1158 1156 static inline int __zs_cpu_up(struct mapping_area *area) 1159 1157 { ··· 1235 1233 pagefault_enable(); 1236 1234 } 1237 1235 1238 - #endif /* CONFIG_PGTABLE_MAPPING */ 1236 + #endif /* CONFIG_ZSMALLOC_PGTABLE_MAPPING */ 1239 1237 1240 1238 static int zs_cpu_prepare(unsigned int cpu) 1241 1239 {

+2 -4

net/bridge/netfilter/ebtables.c

··· 1095 1095 tmp.name[sizeof(tmp.name) - 1] = 0; 1096 1096 1097 1097 countersize = COUNTER_OFFSET(tmp.nentries) * nr_cpu_ids; 1098 - newinfo = __vmalloc(sizeof(*newinfo) + countersize, GFP_KERNEL_ACCOUNT, 1099 - PAGE_KERNEL); 1098 + newinfo = __vmalloc(sizeof(*newinfo) + countersize, GFP_KERNEL_ACCOUNT); 1100 1099 if (!newinfo) 1101 1100 return -ENOMEM; 1102 1101 1103 1102 if (countersize) 1104 1103 memset(newinfo->counters, 0, countersize); 1105 1104 1106 - newinfo->entries = __vmalloc(tmp.entries_size, GFP_KERNEL_ACCOUNT, 1107 - PAGE_KERNEL); 1105 + newinfo->entries = __vmalloc(tmp.entries_size, GFP_KERNEL_ACCOUNT); 1108 1106 if (!newinfo->entries) { 1109 1107 ret = -ENOMEM; 1110 1108 goto free_newinfo;

+1 -2

net/ceph/ceph_common.c

··· 190 190 * kvmalloc() doesn't fall back to the vmalloc allocator unless flags are 191 191 * compatible with (a superset of) GFP_KERNEL. This is because while the 192 192 * actual pages are allocated with the specified flags, the page table pages 193 - * are always allocated with GFP_KERNEL. map_vm_area() doesn't even take 194 - * flags because GFP_KERNEL is hard-coded in {p4d,pud,pmd,pte}_alloc(). 193 + * are always allocated with GFP_KERNEL. 195 194 * 196 195 * ceph_kvmalloc() may be called with GFP_KERNEL, GFP_NOFS or GFP_NOIO. 197 196 */

+1 -1

sound/core/memalloc.c

··· 143 143 break; 144 144 case SNDRV_DMA_TYPE_VMALLOC: 145 145 gfp = snd_mem_get_gfp_flags(device, GFP_KERNEL | __GFP_HIGHMEM); 146 - dmab->area = __vmalloc(size, gfp, PAGE_KERNEL); 146 + dmab->area = __vmalloc(size, gfp); 147 147 dmab->addr = 0; 148 148 break; 149 149 #ifdef CONFIG_HAS_DMA

+1 -1

sound/core/pcm_memory.c

··· 460 460 return 0; /* already large enough */ 461 461 vfree(runtime->dma_area); 462 462 } 463 - runtime->dma_area = __vmalloc(size, gfp_flags, PAGE_KERNEL); 463 + runtime->dma_area = __vmalloc(size, gfp_flags); 464 464 if (!runtime->dma_area) 465 465 return -ENOMEM; 466 466 runtime->dma_bytes = size;