Merge branch 'akpm' (patches from Andrew)

+31

Documentation/ABI/testing/procfs-smaps_rollup

··· 1 + What: /proc/pid/smaps_rollup 2 + Date: August 2017 3 + Contact: Daniel Colascione <dancol@google.com> 4 + Description: 5 + This file provides pre-summed memory information for a 6 + process. The format is identical to /proc/pid/smaps, 7 + except instead of an entry for each VMA in a process, 8 + smaps_rollup has a single entry (tagged "[rollup]") 9 + for which each field is the sum of the corresponding 10 + fields from all the maps in /proc/pid/smaps. 11 + For more details, see the procfs man page. 12 + 13 + Typical output looks like this: 14 + 15 + 00100000-ff709000 ---p 00000000 00:00 0 [rollup] 16 + Rss: 884 kB 17 + Pss: 385 kB 18 + Shared_Clean: 696 kB 19 + Shared_Dirty: 0 kB 20 + Private_Clean: 120 kB 21 + Private_Dirty: 68 kB 22 + Referenced: 884 kB 23 + Anonymous: 68 kB 24 + LazyFree: 0 kB 25 + AnonHugePages: 0 kB 26 + ShmemPmdMapped: 0 kB 27 + Shared_Hugetlb: 0 kB 28 + Private_Hugetlb: 0 kB 29 + Swap: 0 kB 30 + SwapPss: 0 kB 31 + Locked: 385 kB

+8

Documentation/ABI/testing/sysfs-block-zram

··· 90 90 device's debugging info useful for kernel developers. Its 91 91 format is not documented intentionally and may change 92 92 anytime without any notice. 93 + 94 + What: /sys/block/zram<id>/backing_dev 95 + Date: June 2017 96 + Contact: Minchan Kim <minchan@kernel.org> 97 + Description: 98 + The backing_dev file is read-write and set up backing 99 + device for zram to write incompressible pages. 100 + For using, user should enable CONFIG_ZRAM_WRITEBACK.

+26

Documentation/ABI/testing/sysfs-kernel-mm-swap

··· 1 + What: /sys/kernel/mm/swap/ 2 + Date: August 2017 3 + Contact: Linux memory management mailing list <linux-mm@kvack.org> 4 + Description: Interface for swapping 5 + 6 + What: /sys/kernel/mm/swap/vma_ra_enabled 7 + Date: August 2017 8 + Contact: Linux memory management mailing list <linux-mm@kvack.org> 9 + Description: Enable/disable VMA based swap readahead. 10 + 11 + If set to true, the VMA based swap readahead algorithm 12 + will be used for swappable anonymous pages mapped in a 13 + VMA, and the global swap readahead algorithm will be 14 + still used for tmpfs etc. other users. If set to 15 + false, the global swap readahead algorithm will be 16 + used for all swappable pages. 17 + 18 + What: /sys/kernel/mm/swap/vma_ra_max_order 19 + Date: August 2017 20 + Contact: Linux memory management mailing list <linux-mm@kvack.org> 21 + Description: The max readahead size in order for VMA based swap readahead 22 + 23 + VMA based swap readahead algorithm will readahead at 24 + most 1 << max_order pages for each readahead. The 25 + real readahead size for each readahead will be scaled 26 + according to the estimation algorithm.

+1 -1

Documentation/admin-guide/kernel-parameters.txt

··· 2783 2783 Allowed values are enable and disable 2784 2784 2785 2785 numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA. 2786 - one of ['zone', 'node', 'default'] can be specified 2786 + 'node', 'default' can be specified 2787 2787 This can be set from sysctl after boot. 2788 2788 See Documentation/sysctl/vm.txt for details. 2789 2789

+11

Documentation/blockdev/zram.txt

··· 168 168 comp_algorithm RW show and change the compression algorithm 169 169 compact WO trigger memory compaction 170 170 debug_stat RO this file is used for zram debugging purposes 171 + backing_dev RW set up backend storage for zram to write out 171 172 172 173 173 174 User space is advised to use the following files to read the device statistics. ··· 231 230 This frees all the memory allocated for the given device and 232 231 resets the disksize to zero. You must set the disksize again 233 232 before reusing the device. 233 + 234 + * Optional Feature 235 + 236 + = writeback 237 + 238 + With incompressible pages, there is no memory saving with zram. 239 + Instead, with CONFIG_ZRAM_WRITEBACK, zram can write incompressible page 240 + to backing storage rather than keeping it in memory. 241 + User should set up backing device via /sys/block/zramX/backing_dev 242 + before disksize setting. 234 243 235 244 Nitin Gupta 236 245 ngupta@vflare.org

-2

Documentation/filesystems/caching/netfs-api.txt

··· 151 151 void (*mark_pages_cached)(void *cookie_netfs_data, 152 152 struct address_space *mapping, 153 153 struct pagevec *cached_pvec); 154 - 155 - void (*now_uncached)(void *cookie_netfs_data); 156 154 }; 157 155 158 156 This has the following fields:

+2 -3

Documentation/filesystems/dax.txt

··· 63 63 - implementing an mmap file operation for DAX files which sets the 64 64 VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to 65 65 include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These 66 - handlers should probably call dax_iomap_fault() (for fault and page_mkwrite 67 - handlers), dax_iomap_pmd_fault(), dax_pfn_mkwrite() passing the appropriate 68 - iomap operations. 66 + handlers should probably call dax_iomap_fault() passing the appropriate 67 + fault size and iomap operations. 69 68 - calling iomap_zero_range() passing appropriate iomap operations instead of 70 69 block_truncate_page() for DAX files 71 70 - ensuring that there is sufficient locking between reads, writes,

+3 -1

Documentation/sysctl/vm.txt

··· 572 572 573 573 numa_zonelist_order 574 574 575 - This sysctl is only for NUMA. 575 + This sysctl is only for NUMA and it is deprecated. Anything but 576 + Node order will fail! 577 + 576 578 'where the memory is allocated from' is controlled by zonelists. 577 579 (This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. 578 580 you may be able to read ZONE_DMA as ZONE_DMA32...)

+2 -5

Documentation/vm/numa

··· 79 79 fall back to the same zone type on a different node, or to a different zone 80 80 type on the same node. This is an important consideration because some zones, 81 81 such as DMA or DMA32, represent relatively scarce resources. Linux chooses 82 - a default zonelist order based on the sizes of the various zone types relative 83 - to the total memory of the node and the total memory of the system. The 84 - default zonelist order may be overridden using the numa_zonelist_order kernel 85 - boot parameter or sysctl. [see Documentation/admin-guide/kernel-parameters.rst and 86 - Documentation/sysctl/vm.txt] 82 + a default Node ordered zonelist. This means it tries to fallback to other zones 83 + from the same node before using remote nodes which are ordered by NUMA distance. 87 84 88 85 By default, Linux will attempt to satisfy memory allocation requests from the 89 86 node to which the CPU that executes the request is assigned. Specifically,

+69

Documentation/vm/swap_numa.txt

··· 1 + Automatically bind swap device to numa node 2 + ------------------------------------------- 3 + 4 + If the system has more than one swap device and swap device has the node 5 + information, we can make use of this information to decide which swap 6 + device to use in get_swap_pages() to get better performance. 7 + 8 + 9 + How to use this feature 10 + ----------------------- 11 + 12 + Swap device has priority and that decides the order of it to be used. To make 13 + use of automatically binding, there is no need to manipulate priority settings 14 + for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and 15 + swapB, with swapA attached to node 0 and swapB attached to node 1, are going 16 + to be swapped on. Simply swapping them on by doing: 17 + # swapon /dev/swapA 18 + # swapon /dev/swapB 19 + 20 + Then node 0 will use the two swap devices in the order of swapA then swapB and 21 + node 1 will use the two swap devices in the order of swapB then swapA. Note 22 + that the order of them being swapped on doesn't matter. 23 + 24 + A more complex example on a 4 node machine. Assume 6 swap devices are going to 25 + be swapped on: swapA and swapB are attached to node 0, swapC is attached to 26 + node 1, swapD and swapE are attached to node 2 and swapF is attached to node3. 27 + The way to swap them on is the same as above: 28 + # swapon /dev/swapA 29 + # swapon /dev/swapB 30 + # swapon /dev/swapC 31 + # swapon /dev/swapD 32 + # swapon /dev/swapE 33 + # swapon /dev/swapF 34 + 35 + Then node 0 will use them in the order of: 36 + swapA/swapB -> swapC -> swapD -> swapE -> swapF 37 + swapA and swapB will be used in a round robin mode before any other swap device. 38 + 39 + node 1 will use them in the order of: 40 + swapC -> swapA -> swapB -> swapD -> swapE -> swapF 41 + 42 + node 2 will use them in the order of: 43 + swapD/swapE -> swapA -> swapB -> swapC -> swapF 44 + Similaly, swapD and swapE will be used in a round robin mode before any 45 + other swap devices. 46 + 47 + node 3 will use them in the order of: 48 + swapF -> swapA -> swapB -> swapC -> swapD -> swapE 49 + 50 + 51 + Implementation details 52 + ---------------------- 53 + 54 + The current code uses a priority based list, swap_avail_list, to decide 55 + which swap device to use and if multiple swap devices share the same 56 + priority, they are used round robin. This change here replaces the single 57 + global swap_avail_list with a per-numa-node list, i.e. for each numa node, 58 + it sees its own priority based list of available swap devices. Swap 59 + device's priority can be promoted on its matching node's swap_avail_list. 60 + 61 + The current swap device's priority is set as: user can set a >=0 value, 62 + or the system will pick one starting from -1 then downwards. The priority 63 + value in the swap_avail_list is the negated value of the swap device's 64 + due to plist being sorted from low to high. The new policy doesn't change 65 + the semantics for priority >=0 cases, the previous starting from -1 then 66 + downwards now becomes starting from -2 then downwards and -1 is reserved 67 + as the promoted value. So if multiple swap devices are attached to the same 68 + node, they will all be promoted to priority -1 on that node's plist and will 69 + be used round robin before any other swap devices.

+3 -11

arch/alpha/include/uapi/asm/mman.h

··· 64 64 overrides the coredump filter bits */ 65 65 #define MADV_DODUMP 17 /* Clear the MADV_NODUMP flag */ 66 66 67 + #define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */ 68 + #define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */ 69 + 67 70 /* compatibility flags */ 68 71 #define MAP_FILE 0 69 - 70 - /* 71 - * When MAP_HUGETLB is set bits [26:31] encode the log2 of the huge page size. 72 - * This gives us 6 bits, which is enough until someone invents 128 bit address 73 - * spaces. 74 - * 75 - * Assume these are all power of twos. 76 - * When 0 use the default page size. 77 - */ 78 - #define MAP_HUGE_SHIFT 26 79 - #define MAP_HUGE_MASK 0x3f 80 72 81 73 #define PKEY_DISABLE_ACCESS 0x1 82 74 #define PKEY_DISABLE_WRITE 0x2

-1

arch/metag/include/asm/topology.h

··· 4 4 #ifdef CONFIG_NUMA 5 5 6 6 #define cpu_to_node(cpu) ((void)(cpu), 0) 7 - #define parent_node(node) ((void)(node), 0) 8 7 9 8 #define cpumask_of_node(node) ((void)node, cpu_online_mask) 10 9

+3 -11

arch/mips/include/uapi/asm/mman.h

··· 91 91 overrides the coredump filter bits */ 92 92 #define MADV_DODUMP 17 /* Clear the MADV_NODUMP flag */ 93 93 94 + #define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */ 95 + #define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */ 96 + 94 97 /* compatibility flags */ 95 98 #define MAP_FILE 0 96 - 97 - /* 98 - * When MAP_HUGETLB is set bits [26:31] encode the log2 of the huge page size. 99 - * This gives us 6 bits, which is enough until someone invents 128 bit address 100 - * spaces. 101 - * 102 - * Assume these are all power of twos. 103 - * When 0 use the default page size. 104 - */ 105 - #define MAP_HUGE_SHIFT 26 106 - #define MAP_HUGE_MASK 0x3f 107 99 108 100 #define PKEY_DISABLE_ACCESS 0x1 109 101 #define PKEY_DISABLE_WRITE 0x2

+3 -11

arch/parisc/include/uapi/asm/mman.h

··· 57 57 overrides the coredump filter bits */ 58 58 #define MADV_DODUMP 70 /* Clear the MADV_NODUMP flag */ 59 59 60 + #define MADV_WIPEONFORK 71 /* Zero memory on fork, child only */ 61 + #define MADV_KEEPONFORK 72 /* Undo MADV_WIPEONFORK */ 62 + 60 63 #define MADV_HWPOISON 100 /* poison a page for testing */ 61 64 #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ 62 65 63 66 /* compatibility flags */ 64 67 #define MAP_FILE 0 65 68 #define MAP_VARIABLE 0 66 - 67 - /* 68 - * When MAP_HUGETLB is set bits [26:31] encode the log2 of the huge page size. 69 - * This gives us 6 bits, which is enough until someone invents 128 bit address 70 - * spaces. 71 - * 72 - * Assume these are all power of twos. 73 - * When 0 use the default page size. 74 - */ 75 - #define MAP_HUGE_SHIFT 26 76 - #define MAP_HUGE_MASK 0x3f 77 69 78 70 #define PKEY_DISABLE_ACCESS 0x1 79 71 #define PKEY_DISABLE_WRITE 0x2

-16

arch/powerpc/include/uapi/asm/mman.h

··· 29 29 #define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */ 30 30 #define MAP_HUGETLB 0x40000 /* create a huge page mapping */ 31 31 32 - /* 33 - * When MAP_HUGETLB is set, bits [26:31] of the flags argument to mmap(2), 34 - * encode the log2 of the huge page size. A value of zero indicates that the 35 - * default huge page size should be used. To use a non-default huge page size, 36 - * one of these defines can be used, or the size can be encoded by hand. Note 37 - * that on most systems only a subset, or possibly none, of these sizes will be 38 - * available. 39 - */ 40 - #define MAP_HUGE_512KB (19 << MAP_HUGE_SHIFT) /* 512KB HugeTLB Page */ 41 - #define MAP_HUGE_1MB (20 << MAP_HUGE_SHIFT) /* 1MB HugeTLB Page */ 42 - #define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT) /* 2MB HugeTLB Page */ 43 - #define MAP_HUGE_8MB (23 << MAP_HUGE_SHIFT) /* 8MB HugeTLB Page */ 44 - #define MAP_HUGE_16MB (24 << MAP_HUGE_SHIFT) /* 16MB HugeTLB Page */ 45 - #define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT) /* 1GB HugeTLB Page */ 46 - #define MAP_HUGE_16GB (34 << MAP_HUGE_SHIFT) /* 16GB HugeTLB Page */ 47 - 48 32 #endif /* _UAPI_ASM_POWERPC_MMAN_H */

+3 -1

arch/x86/Kconfig

··· 1806 1806 config X86_INTEL_MPX 1807 1807 prompt "Intel MPX (Memory Protection Extensions)" 1808 1808 def_bool n 1809 - depends on CPU_SUP_INTEL 1809 + # Note: only available in 64-bit mode due to VMA flags shortage 1810 + depends on CPU_SUP_INTEL && X86_64 1811 + select ARCH_USES_HIGH_VMA_FLAGS 1810 1812 ---help--- 1811 1813 MPX provides hardware features that can be used in 1812 1814 conjunction with compiler-instrumented code to check

-3

arch/x86/include/uapi/asm/mman.h

··· 3 3 4 4 #define MAP_32BIT 0x40 /* only give out 32bit addresses */ 5 5 6 - #define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT) 7 - #define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT) 8 - 9 6 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS 10 7 /* 11 8 * Take the 4 protection key bits out of the vma->vm_flags

+3 -11

arch/xtensa/include/uapi/asm/mman.h

··· 103 103 overrides the coredump filter bits */ 104 104 #define MADV_DODUMP 17 /* Clear the MADV_NODUMP flag */ 105 105 106 + #define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */ 107 + #define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */ 108 + 106 109 /* compatibility flags */ 107 110 #define MAP_FILE 0 108 - 109 - /* 110 - * When MAP_HUGETLB is set bits [26:31] encode the log2 of the huge page size. 111 - * This gives us 6 bits, which is enough until someone invents 128 bit address 112 - * spaces. 113 - * 114 - * Assume these are all power of twos. 115 - * When 0 use the default page size. 116 - */ 117 - #define MAP_HUGE_SHIFT 26 118 - #define MAP_HUGE_MASK 0x3f 119 111 120 112 #define PKEY_DISABLE_ACCESS 0x1 121 113 #define PKEY_DISABLE_WRITE 0x2

+20 -10

drivers/base/memory.c

··· 388 388 } 389 389 390 390 #ifdef CONFIG_MEMORY_HOTREMOVE 391 + static void print_allowed_zone(char *buf, int nid, unsigned long start_pfn, 392 + unsigned long nr_pages, int online_type, 393 + struct zone *default_zone) 394 + { 395 + struct zone *zone; 396 + 397 + zone = zone_for_pfn_range(online_type, nid, start_pfn, nr_pages); 398 + if (zone != default_zone) { 399 + strcat(buf, " "); 400 + strcat(buf, zone->name); 401 + } 402 + } 403 + 391 404 static ssize_t show_valid_zones(struct device *dev, 392 405 struct device_attribute *attr, char *buf) 393 406 { ··· 408 395 unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); 409 396 unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; 410 397 unsigned long valid_start_pfn, valid_end_pfn; 411 - bool append = false; 398 + struct zone *default_zone; 412 399 int nid; 413 400 414 401 /* ··· 431 418 } 432 419 433 420 nid = pfn_to_nid(start_pfn); 434 - if (allow_online_pfn_range(nid, start_pfn, nr_pages, MMOP_ONLINE_KERNEL)) { 435 - strcat(buf, default_zone_for_pfn(nid, start_pfn, nr_pages)->name); 436 - append = true; 437 - } 421 + default_zone = zone_for_pfn_range(MMOP_ONLINE_KEEP, nid, start_pfn, nr_pages); 422 + strcat(buf, default_zone->name); 438 423 439 - if (allow_online_pfn_range(nid, start_pfn, nr_pages, MMOP_ONLINE_MOVABLE)) { 440 - if (append) 441 - strcat(buf, " "); 442 - strcat(buf, NODE_DATA(nid)->node_zones[ZONE_MOVABLE].name); 443 - } 424 + print_allowed_zone(buf, nid, start_pfn, nr_pages, MMOP_ONLINE_KERNEL, 425 + default_zone); 426 + print_allowed_zone(buf, nid, start_pfn, nr_pages, MMOP_ONLINE_MOVABLE, 427 + default_zone); 444 428 out: 445 429 strcat(buf, "\n"); 446 430

+5 -1

drivers/block/brd.c

··· 326 326 struct page *page, bool is_write) 327 327 { 328 328 struct brd_device *brd = bdev->bd_disk->private_data; 329 - int err = brd_do_bvec(brd, page, PAGE_SIZE, 0, is_write, sector); 329 + int err; 330 + 331 + if (PageTransHuge(page)) 332 + return -ENOTSUPP; 333 + err = brd_do_bvec(brd, page, PAGE_SIZE, 0, is_write, sector); 330 334 page_endio(page, is_write, err); 331 335 return err; 332 336 }

+12

drivers/block/zram/Kconfig

··· 13 13 disks and maybe many more. 14 14 15 15 See zram.txt for more information. 16 + 17 + config ZRAM_WRITEBACK 18 + bool "Write back incompressible page to backing device" 19 + depends on ZRAM 20 + default n 21 + help 22 + With incompressible page, there is no memory saving to keep it 23 + in memory. Instead, write it out to backing device. 24 + For this feature, admin should set up backing device via 25 + /sys/block/zramX/backing_dev. 26 + 27 + See zram.txt for more infomration.

+459 -81

drivers/block/zram/zram_drv.c

··· 270 270 return len; 271 271 } 272 272 273 + #ifdef CONFIG_ZRAM_WRITEBACK 274 + static bool zram_wb_enabled(struct zram *zram) 275 + { 276 + return zram->backing_dev; 277 + } 278 + 279 + static void reset_bdev(struct zram *zram) 280 + { 281 + struct block_device *bdev; 282 + 283 + if (!zram_wb_enabled(zram)) 284 + return; 285 + 286 + bdev = zram->bdev; 287 + if (zram->old_block_size) 288 + set_blocksize(bdev, zram->old_block_size); 289 + blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL); 290 + /* hope filp_close flush all of IO */ 291 + filp_close(zram->backing_dev, NULL); 292 + zram->backing_dev = NULL; 293 + zram->old_block_size = 0; 294 + zram->bdev = NULL; 295 + 296 + kvfree(zram->bitmap); 297 + zram->bitmap = NULL; 298 + } 299 + 300 + static ssize_t backing_dev_show(struct device *dev, 301 + struct device_attribute *attr, char *buf) 302 + { 303 + struct zram *zram = dev_to_zram(dev); 304 + struct file *file = zram->backing_dev; 305 + char *p; 306 + ssize_t ret; 307 + 308 + down_read(&zram->init_lock); 309 + if (!zram_wb_enabled(zram)) { 310 + memcpy(buf, "none\n", 5); 311 + up_read(&zram->init_lock); 312 + return 5; 313 + } 314 + 315 + p = file_path(file, buf, PAGE_SIZE - 1); 316 + if (IS_ERR(p)) { 317 + ret = PTR_ERR(p); 318 + goto out; 319 + } 320 + 321 + ret = strlen(p); 322 + memmove(buf, p, ret); 323 + buf[ret++] = '\n'; 324 + out: 325 + up_read(&zram->init_lock); 326 + return ret; 327 + } 328 + 329 + static ssize_t backing_dev_store(struct device *dev, 330 + struct device_attribute *attr, const char *buf, size_t len) 331 + { 332 + char *file_name; 333 + struct file *backing_dev = NULL; 334 + struct inode *inode; 335 + struct address_space *mapping; 336 + unsigned int bitmap_sz, old_block_size = 0; 337 + unsigned long nr_pages, *bitmap = NULL; 338 + struct block_device *bdev = NULL; 339 + int err; 340 + struct zram *zram = dev_to_zram(dev); 341 + 342 + file_name = kmalloc(PATH_MAX, GFP_KERNEL); 343 + if (!file_name) 344 + return -ENOMEM; 345 + 346 + down_write(&zram->init_lock); 347 + if (init_done(zram)) { 348 + pr_info("Can't setup backing device for initialized device\n"); 349 + err = -EBUSY; 350 + goto out; 351 + } 352 + 353 + strlcpy(file_name, buf, len); 354 + 355 + backing_dev = filp_open(file_name, O_RDWR|O_LARGEFILE, 0); 356 + if (IS_ERR(backing_dev)) { 357 + err = PTR_ERR(backing_dev); 358 + backing_dev = NULL; 359 + goto out; 360 + } 361 + 362 + mapping = backing_dev->f_mapping; 363 + inode = mapping->host; 364 + 365 + /* Support only block device in this moment */ 366 + if (!S_ISBLK(inode->i_mode)) { 367 + err = -ENOTBLK; 368 + goto out; 369 + } 370 + 371 + bdev = bdgrab(I_BDEV(inode)); 372 + err = blkdev_get(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL, zram); 373 + if (err < 0) 374 + goto out; 375 + 376 + nr_pages = i_size_read(inode) >> PAGE_SHIFT; 377 + bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long); 378 + bitmap = kvzalloc(bitmap_sz, GFP_KERNEL); 379 + if (!bitmap) { 380 + err = -ENOMEM; 381 + goto out; 382 + } 383 + 384 + old_block_size = block_size(bdev); 385 + err = set_blocksize(bdev, PAGE_SIZE); 386 + if (err) 387 + goto out; 388 + 389 + reset_bdev(zram); 390 + spin_lock_init(&zram->bitmap_lock); 391 + 392 + zram->old_block_size = old_block_size; 393 + zram->bdev = bdev; 394 + zram->backing_dev = backing_dev; 395 + zram->bitmap = bitmap; 396 + zram->nr_pages = nr_pages; 397 + up_write(&zram->init_lock); 398 + 399 + pr_info("setup backing device %s\n", file_name); 400 + kfree(file_name); 401 + 402 + return len; 403 + out: 404 + if (bitmap) 405 + kvfree(bitmap); 406 + 407 + if (bdev) 408 + blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL); 409 + 410 + if (backing_dev) 411 + filp_close(backing_dev, NULL); 412 + 413 + up_write(&zram->init_lock); 414 + 415 + kfree(file_name); 416 + 417 + return err; 418 + } 419 + 420 + static unsigned long get_entry_bdev(struct zram *zram) 421 + { 422 + unsigned long entry; 423 + 424 + spin_lock(&zram->bitmap_lock); 425 + /* skip 0 bit to confuse zram.handle = 0 */ 426 + entry = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1); 427 + if (entry == zram->nr_pages) { 428 + spin_unlock(&zram->bitmap_lock); 429 + return 0; 430 + } 431 + 432 + set_bit(entry, zram->bitmap); 433 + spin_unlock(&zram->bitmap_lock); 434 + 435 + return entry; 436 + } 437 + 438 + static void put_entry_bdev(struct zram *zram, unsigned long entry) 439 + { 440 + int was_set; 441 + 442 + spin_lock(&zram->bitmap_lock); 443 + was_set = test_and_clear_bit(entry, zram->bitmap); 444 + spin_unlock(&zram->bitmap_lock); 445 + WARN_ON_ONCE(!was_set); 446 + } 447 + 448 + void zram_page_end_io(struct bio *bio) 449 + { 450 + struct page *page = bio->bi_io_vec[0].bv_page; 451 + 452 + page_endio(page, op_is_write(bio_op(bio)), 453 + blk_status_to_errno(bio->bi_status)); 454 + bio_put(bio); 455 + } 456 + 457 + /* 458 + * Returns 1 if the submission is successful. 459 + */ 460 + static int read_from_bdev_async(struct zram *zram, struct bio_vec *bvec, 461 + unsigned long entry, struct bio *parent) 462 + { 463 + struct bio *bio; 464 + 465 + bio = bio_alloc(GFP_ATOMIC, 1); 466 + if (!bio) 467 + return -ENOMEM; 468 + 469 + bio->bi_iter.bi_sector = entry * (PAGE_SIZE >> 9); 470 + bio->bi_bdev = zram->bdev; 471 + if (!bio_add_page(bio, bvec->bv_page, bvec->bv_len, bvec->bv_offset)) { 472 + bio_put(bio); 473 + return -EIO; 474 + } 475 + 476 + if (!parent) { 477 + bio->bi_opf = REQ_OP_READ; 478 + bio->bi_end_io = zram_page_end_io; 479 + } else { 480 + bio->bi_opf = parent->bi_opf; 481 + bio_chain(bio, parent); 482 + } 483 + 484 + submit_bio(bio); 485 + return 1; 486 + } 487 + 488 + struct zram_work { 489 + struct work_struct work; 490 + struct zram *zram; 491 + unsigned long entry; 492 + struct bio *bio; 493 + }; 494 + 495 + #if PAGE_SIZE != 4096 496 + static void zram_sync_read(struct work_struct *work) 497 + { 498 + struct bio_vec bvec; 499 + struct zram_work *zw = container_of(work, struct zram_work, work); 500 + struct zram *zram = zw->zram; 501 + unsigned long entry = zw->entry; 502 + struct bio *bio = zw->bio; 503 + 504 + read_from_bdev_async(zram, &bvec, entry, bio); 505 + } 506 + 507 + /* 508 + * Block layer want one ->make_request_fn to be active at a time 509 + * so if we use chained IO with parent IO in same context, 510 + * it's a deadlock. To avoid, it, it uses worker thread context. 511 + */ 512 + static int read_from_bdev_sync(struct zram *zram, struct bio_vec *bvec, 513 + unsigned long entry, struct bio *bio) 514 + { 515 + struct zram_work work; 516 + 517 + work.zram = zram; 518 + work.entry = entry; 519 + work.bio = bio; 520 + 521 + INIT_WORK_ONSTACK(&work.work, zram_sync_read); 522 + queue_work(system_unbound_wq, &work.work); 523 + flush_work(&work.work); 524 + destroy_work_on_stack(&work.work); 525 + 526 + return 1; 527 + } 528 + #else 529 + static int read_from_bdev_sync(struct zram *zram, struct bio_vec *bvec, 530 + unsigned long entry, struct bio *bio) 531 + { 532 + WARN_ON(1); 533 + return -EIO; 534 + } 535 + #endif 536 + 537 + static int read_from_bdev(struct zram *zram, struct bio_vec *bvec, 538 + unsigned long entry, struct bio *parent, bool sync) 539 + { 540 + if (sync) 541 + return read_from_bdev_sync(zram, bvec, entry, parent); 542 + else 543 + return read_from_bdev_async(zram, bvec, entry, parent); 544 + } 545 + 546 + static int write_to_bdev(struct zram *zram, struct bio_vec *bvec, 547 + u32 index, struct bio *parent, 548 + unsigned long *pentry) 549 + { 550 + struct bio *bio; 551 + unsigned long entry; 552 + 553 + bio = bio_alloc(GFP_ATOMIC, 1); 554 + if (!bio) 555 + return -ENOMEM; 556 + 557 + entry = get_entry_bdev(zram); 558 + if (!entry) { 559 + bio_put(bio); 560 + return -ENOSPC; 561 + } 562 + 563 + bio->bi_iter.bi_sector = entry * (PAGE_SIZE >> 9); 564 + bio->bi_bdev = zram->bdev; 565 + if (!bio_add_page(bio, bvec->bv_page, bvec->bv_len, 566 + bvec->bv_offset)) { 567 + bio_put(bio); 568 + put_entry_bdev(zram, entry); 569 + return -EIO; 570 + } 571 + 572 + if (!parent) { 573 + bio->bi_opf = REQ_OP_WRITE | REQ_SYNC; 574 + bio->bi_end_io = zram_page_end_io; 575 + } else { 576 + bio->bi_opf = parent->bi_opf; 577 + bio_chain(bio, parent); 578 + } 579 + 580 + submit_bio(bio); 581 + *pentry = entry; 582 + 583 + return 0; 584 + } 585 + 586 + static void zram_wb_clear(struct zram *zram, u32 index) 587 + { 588 + unsigned long entry; 589 + 590 + zram_clear_flag(zram, index, ZRAM_WB); 591 + entry = zram_get_element(zram, index); 592 + zram_set_element(zram, index, 0); 593 + put_entry_bdev(zram, entry); 594 + } 595 + 596 + #else 597 + static bool zram_wb_enabled(struct zram *zram) { return false; } 598 + static inline void reset_bdev(struct zram *zram) {}; 599 + static int write_to_bdev(struct zram *zram, struct bio_vec *bvec, 600 + u32 index, struct bio *parent, 601 + unsigned long *pentry) 602 + 603 + { 604 + return -EIO; 605 + } 606 + 607 + static int read_from_bdev(struct zram *zram, struct bio_vec *bvec, 608 + unsigned long entry, struct bio *parent, bool sync) 609 + { 610 + return -EIO; 611 + } 612 + static void zram_wb_clear(struct zram *zram, u32 index) {} 613 + #endif 614 + 615 + 273 616 /* 274 617 * We switched to per-cpu streams and this attr is not needed anymore. 275 618 * However, we will keep it around for some time, because: ··· 796 453 return false; 797 454 } 798 455 799 - static bool zram_same_page_write(struct zram *zram, u32 index, 800 - struct page *page) 801 - { 802 - unsigned long element; 803 - void *mem = kmap_atomic(page); 804 - 805 - if (page_same_filled(mem, &element)) { 806 - kunmap_atomic(mem); 807 - /* Free memory associated with this sector now. */ 808 - zram_slot_lock(zram, index); 809 - zram_free_page(zram, index); 810 - zram_set_flag(zram, index, ZRAM_SAME); 811 - zram_set_element(zram, index, element); 812 - zram_slot_unlock(zram, index); 813 - 814 - atomic64_inc(&zram->stats.same_pages); 815 - atomic64_inc(&zram->stats.pages_stored); 816 - return true; 817 - } 818 - kunmap_atomic(mem); 819 - 820 - return false; 821 - } 822 - 823 456 static void zram_meta_free(struct zram *zram, u64 disksize) 824 457 { 825 458 size_t num_pages = disksize >> PAGE_SHIFT; ··· 834 515 */ 835 516 static void zram_free_page(struct zram *zram, size_t index) 836 517 { 837 - unsigned long handle = zram_get_handle(zram, index); 518 + unsigned long handle; 519 + 520 + if (zram_wb_enabled(zram) && zram_test_flag(zram, index, ZRAM_WB)) { 521 + zram_wb_clear(zram, index); 522 + atomic64_dec(&zram->stats.pages_stored); 523 + return; 524 + } 838 525 839 526 /* 840 527 * No memory is allocated for same element filled pages. ··· 854 529 return; 855 530 } 856 531 532 + handle = zram_get_handle(zram, index); 857 533 if (!handle) 858 534 return; 859 535 ··· 868 542 zram_set_obj_size(zram, index, 0); 869 543 } 870 544 871 - static int zram_decompress_page(struct zram *zram, struct page *page, u32 index) 545 + static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index, 546 + struct bio *bio, bool partial_io) 872 547 { 873 548 int ret; 874 549 unsigned long handle; 875 550 unsigned int size; 876 551 void *src, *dst; 552 + 553 + if (zram_wb_enabled(zram)) { 554 + zram_slot_lock(zram, index); 555 + if (zram_test_flag(zram, index, ZRAM_WB)) { 556 + struct bio_vec bvec; 557 + 558 + zram_slot_unlock(zram, index); 559 + 560 + bvec.bv_page = page; 561 + bvec.bv_len = PAGE_SIZE; 562 + bvec.bv_offset = 0; 563 + return read_from_bdev(zram, &bvec, 564 + zram_get_element(zram, index), 565 + bio, partial_io); 566 + } 567 + zram_slot_unlock(zram, index); 568 + } 877 569 878 570 if (zram_same_page_read(zram, index, page, 0, PAGE_SIZE)) 879 571 return 0; ··· 925 581 } 926 582 927 583 static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec, 928 - u32 index, int offset) 584 + u32 index, int offset, struct bio *bio) 929 585 { 930 586 int ret; 931 587 struct page *page; ··· 938 594 return -ENOMEM; 939 595 } 940 596 941 - ret = zram_decompress_page(zram, page, index); 597 + ret = __zram_bvec_read(zram, page, index, bio, is_partial_io(bvec)); 942 598 if (unlikely(ret)) 943 599 goto out; 944 600 ··· 957 613 return ret; 958 614 } 959 615 960 - static int zram_compress(struct zram *zram, struct zcomp_strm **zstrm, 961 - struct page *page, 962 - unsigned long *out_handle, unsigned int *out_comp_len) 616 + static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, 617 + u32 index, struct bio *bio) 963 618 { 964 - int ret; 965 - unsigned int comp_len; 966 - void *src; 619 + int ret = 0; 967 620 unsigned long alloced_pages; 968 621 unsigned long handle = 0; 622 + unsigned int comp_len = 0; 623 + void *src, *dst, *mem; 624 + struct zcomp_strm *zstrm; 625 + struct page *page = bvec->bv_page; 626 + unsigned long element = 0; 627 + enum zram_pageflags flags = 0; 628 + bool allow_wb = true; 629 + 630 + mem = kmap_atomic(page); 631 + if (page_same_filled(mem, &element)) { 632 + kunmap_atomic(mem); 633 + /* Free memory associated with this sector now. */ 634 + flags = ZRAM_SAME; 635 + atomic64_inc(&zram->stats.same_pages); 636 + goto out; 637 + } 638 + kunmap_atomic(mem); 969 639 970 640 compress_again: 641 + zstrm = zcomp_stream_get(zram->comp); 971 642 src = kmap_atomic(page); 972 - ret = zcomp_compress(*zstrm, src, &comp_len); 643 + ret = zcomp_compress(zstrm, src, &comp_len); 973 644 kunmap_atomic(src); 974 645 975 646 if (unlikely(ret)) { 647 + zcomp_stream_put(zram->comp); 976 648 pr_err("Compression failed! err=%d\n", ret); 977 - if (handle) 978 - zs_free(zram->mem_pool, handle); 649 + zs_free(zram->mem_pool, handle); 979 650 return ret; 980 651 } 981 652 982 - if (unlikely(comp_len > max_zpage_size)) 653 + if (unlikely(comp_len > max_zpage_size)) { 654 + if (zram_wb_enabled(zram) && allow_wb) { 655 + zcomp_stream_put(zram->comp); 656 + ret = write_to_bdev(zram, bvec, index, bio, &element); 657 + if (!ret) { 658 + flags = ZRAM_WB; 659 + ret = 1; 660 + goto out; 661 + } 662 + allow_wb = false; 663 + goto compress_again; 664 + } 983 665 comp_len = PAGE_SIZE; 666 + } 984 667 985 668 /* 986 669 * handle allocation has 2 paths: ··· 1034 663 handle = zs_malloc(zram->mem_pool, comp_len, 1035 664 GFP_NOIO | __GFP_HIGHMEM | 1036 665 __GFP_MOVABLE); 1037 - *zstrm = zcomp_stream_get(zram->comp); 1038 666 if (handle) 1039 667 goto compress_again; 1040 668 return -ENOMEM; ··· 1043 673 update_used_max(zram, alloced_pages); 1044 674 1045 675 if (zram->limit_pages && alloced_pages > zram->limit_pages) { 676 + zcomp_stream_put(zram->comp); 1046 677 zs_free(zram->mem_pool, handle); 1047 678 return -ENOMEM; 1048 - } 1049 - 1050 - *out_handle = handle; 1051 - *out_comp_len = comp_len; 1052 - return 0; 1053 - } 1054 - 1055 - static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index) 1056 - { 1057 - int ret; 1058 - unsigned long handle; 1059 - unsigned int comp_len; 1060 - void *src, *dst; 1061 - struct zcomp_strm *zstrm; 1062 - struct page *page = bvec->bv_page; 1063 - 1064 - if (zram_same_page_write(zram, index, page)) 1065 - return 0; 1066 - 1067 - zstrm = zcomp_stream_get(zram->comp); 1068 - ret = zram_compress(zram, &zstrm, page, &handle, &comp_len); 1069 - if (ret) { 1070 - zcomp_stream_put(zram->comp); 1071 - return ret; 1072 679 } 1073 680 1074 681 dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO); ··· 1059 712 1060 713 zcomp_stream_put(zram->comp); 1061 714 zs_unmap_object(zram->mem_pool, handle); 1062 - 715 + atomic64_add(comp_len, &zram->stats.compr_data_size); 716 + out: 1063 717 /* 1064 718 * Free memory associated with this sector 1065 719 * before overwriting unused sectors. 1066 720 */ 1067 721 zram_slot_lock(zram, index); 1068 722 zram_free_page(zram, index); 1069 - zram_set_handle(zram, index, handle); 1070 - zram_set_obj_size(zram, index, comp_len); 723 + 724 + if (flags) { 725 + zram_set_flag(zram, index, flags); 726 + zram_set_element(zram, index, element); 727 + } else { 728 + zram_set_handle(zram, index, handle); 729 + zram_set_obj_size(zram, index, comp_len); 730 + } 1071 731 zram_slot_unlock(zram, index); 1072 732 1073 733 /* Update stats */ 1074 - atomic64_add(comp_len, &zram->stats.compr_data_size); 1075 734 atomic64_inc(&zram->stats.pages_stored); 1076 - return 0; 735 + return ret; 1077 736 } 1078 737 1079 738 static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, 1080 - u32 index, int offset) 739 + u32 index, int offset, struct bio *bio) 1081 740 { 1082 741 int ret; 1083 742 struct page *page = NULL; ··· 1101 748 if (!page) 1102 749 return -ENOMEM; 1103 750 1104 - ret = zram_decompress_page(zram, page, index); 751 + ret = __zram_bvec_read(zram, page, index, bio, true); 1105 752 if (ret) 1106 753 goto out; 1107 754 ··· 1116 763 vec.bv_offset = 0; 1117 764 } 1118 765 1119 - ret = __zram_bvec_write(zram, &vec, index); 766 + ret = __zram_bvec_write(zram, &vec, index, bio); 1120 767 out: 1121 768 if (is_partial_io(bvec)) 1122 769 __free_page(page); ··· 1161 808 } 1162 809 } 1163 810 811 + /* 812 + * Returns errno if it has some problem. Otherwise return 0 or 1. 813 + * Returns 0 if IO request was done synchronously 814 + * Returns 1 if IO request was successfully submitted. 815 + */ 1164 816 static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index, 1165 - int offset, bool is_write) 817 + int offset, bool is_write, struct bio *bio) 1166 818 { 1167 819 unsigned long start_time = jiffies; 1168 820 int rw_acct = is_write ? REQ_OP_WRITE : REQ_OP_READ; ··· 1178 820 1179 821 if (!is_write) { 1180 822 atomic64_inc(&zram->stats.num_reads); 1181 - ret = zram_bvec_read(zram, bvec, index, offset); 823 + ret = zram_bvec_read(zram, bvec, index, offset, bio); 1182 824 flush_dcache_page(bvec->bv_page); 1183 825 } else { 1184 826 atomic64_inc(&zram->stats.num_writes); 1185 - ret = zram_bvec_write(zram, bvec, index, offset); 827 + ret = zram_bvec_write(zram, bvec, index, offset, bio); 1186 828 } 1187 829 1188 830 generic_end_io_acct(rw_acct, &zram->disk->part0, start_time); 1189 831 1190 - if (unlikely(ret)) { 832 + if (unlikely(ret < 0)) { 1191 833 if (!is_write) 1192 834 atomic64_inc(&zram->stats.failed_reads); 1193 835 else ··· 1226 868 bv.bv_len = min_t(unsigned int, PAGE_SIZE - offset, 1227 869 unwritten); 1228 870 if (zram_bvec_rw(zram, &bv, index, offset, 1229 - op_is_write(bio_op(bio))) < 0) 871 + op_is_write(bio_op(bio)), bio) < 0) 1230 872 goto out; 1231 873 1232 874 bv.bv_offset += bv.bv_len; ··· 1280 922 static int zram_rw_page(struct block_device *bdev, sector_t sector, 1281 923 struct page *page, bool is_write) 1282 924 { 1283 - int offset, err = -EIO; 925 + int offset, ret; 1284 926 u32 index; 1285 927 struct zram *zram; 1286 928 struct bio_vec bv; 1287 929 930 + if (PageTransHuge(page)) 931 + return -ENOTSUPP; 1288 932 zram = bdev->bd_disk->private_data; 1289 933 1290 934 if (!valid_io_request(zram, sector, PAGE_SIZE)) { 1291 935 atomic64_inc(&zram->stats.invalid_io); 1292 - err = -EINVAL; 936 + ret = -EINVAL; 1293 937 goto out; 1294 938 } 1295 939 ··· 1302 942 bv.bv_len = PAGE_SIZE; 1303 943 bv.bv_offset = 0; 1304 944 1305 - err = zram_bvec_rw(zram, &bv, index, offset, is_write); 945 + ret = zram_bvec_rw(zram, &bv, index, offset, is_write, NULL); 1306 946 out: 1307 947 /* 1308 948 * If I/O fails, just return error(ie, non-zero) without ··· 1312 952 * bio->bi_end_io does things to handle the error 1313 953 * (e.g., SetPageError, set_page_dirty and extra works). 1314 954 */ 1315 - if (err == 0) 955 + if (unlikely(ret < 0)) 956 + return ret; 957 + 958 + switch (ret) { 959 + case 0: 1316 960 page_endio(page, is_write, 0); 1317 - return err; 961 + break; 962 + case 1: 963 + ret = 0; 964 + break; 965 + default: 966 + WARN_ON(1); 967 + } 968 + return ret; 1318 969 } 1319 970 1320 971 static void zram_reset_device(struct zram *zram) ··· 1354 983 zram_meta_free(zram, disksize); 1355 984 memset(&zram->stats, 0, sizeof(zram->stats)); 1356 985 zcomp_destroy(comp); 986 + reset_bdev(zram); 1357 987 } 1358 988 1359 989 static ssize_t disksize_store(struct device *dev, ··· 1480 1108 static DEVICE_ATTR_WO(mem_used_max); 1481 1109 static DEVICE_ATTR_RW(max_comp_streams); 1482 1110 static DEVICE_ATTR_RW(comp_algorithm); 1111 + #ifdef CONFIG_ZRAM_WRITEBACK 1112 + static DEVICE_ATTR_RW(backing_dev); 1113 + #endif 1483 1114 1484 1115 static struct attribute *zram_disk_attrs[] = { 1485 1116 &dev_attr_disksize.attr, ··· 1493 1118 &dev_attr_mem_used_max.attr, 1494 1119 &dev_attr_max_comp_streams.attr, 1495 1120 &dev_attr_comp_algorithm.attr, 1121 + #ifdef CONFIG_ZRAM_WRITEBACK 1122 + &dev_attr_backing_dev.attr, 1123 + #endif 1496 1124 &dev_attr_io_stat.attr, 1497 1125 &dev_attr_mm_stat.attr, 1498 1126 &dev_attr_debug_stat.attr,

+10 -1

drivers/block/zram/zram_drv.h

··· 60 60 61 61 /* Flags for zram pages (table[page_no].value) */ 62 62 enum zram_pageflags { 63 - /* Page consists entirely of zeros */ 63 + /* Page consists the same element */ 64 64 ZRAM_SAME = ZRAM_FLAG_SHIFT, 65 65 ZRAM_ACCESS, /* page is now accessed */ 66 + ZRAM_WB, /* page is stored on backing_device */ 66 67 67 68 __NR_ZRAM_PAGEFLAGS, 68 69 }; ··· 116 115 * zram is claimed so open request will be failed 117 116 */ 118 117 bool claim; /* Protected by bdev->bd_mutex */ 118 + #ifdef CONFIG_ZRAM_WRITEBACK 119 + struct file *backing_dev; 120 + struct block_device *bdev; 121 + unsigned int old_block_size; 122 + unsigned long *bitmap; 123 + unsigned long nr_pages; 124 + spinlock_t bitmap_lock; 125 + #endif 119 126 }; 120 127 #endif

+2 -2

drivers/gpu/drm/i915/i915_debugfs.c

··· 4308 4308 4309 4309 fs_reclaim_acquire(GFP_KERNEL); 4310 4310 if (val & DROP_BOUND) 4311 - i915_gem_shrink(dev_priv, LONG_MAX, I915_SHRINK_BOUND); 4311 + i915_gem_shrink(dev_priv, LONG_MAX, NULL, I915_SHRINK_BOUND); 4312 4312 4313 4313 if (val & DROP_UNBOUND) 4314 - i915_gem_shrink(dev_priv, LONG_MAX, I915_SHRINK_UNBOUND); 4314 + i915_gem_shrink(dev_priv, LONG_MAX, NULL, I915_SHRINK_UNBOUND); 4315 4315 4316 4316 if (val & DROP_SHRINK_ALL) 4317 4317 i915_gem_shrink_all(dev_priv);

+1

drivers/gpu/drm/i915/i915_drv.h

··· 3742 3742 /* i915_gem_shrinker.c */ 3743 3743 unsigned long i915_gem_shrink(struct drm_i915_private *dev_priv, 3744 3744 unsigned long target, 3745 + unsigned long *nr_scanned, 3745 3746 unsigned flags); 3746 3747 #define I915_SHRINK_PURGEABLE 0x1 3747 3748 #define I915_SHRINK_UNBOUND 0x2

+2 -2

drivers/gpu/drm/i915/i915_gem.c

··· 2354 2354 goto err_sg; 2355 2355 } 2356 2356 2357 - i915_gem_shrink(dev_priv, 2 * page_count, *s++); 2357 + i915_gem_shrink(dev_priv, 2 * page_count, NULL, *s++); 2358 2358 cond_resched(); 2359 2359 2360 2360 /* We've tried hard to allocate the memory by reaping ··· 5015 5015 * the objects as well, see i915_gem_freeze() 5016 5016 */ 5017 5017 5018 - i915_gem_shrink(dev_priv, -1UL, I915_SHRINK_UNBOUND); 5018 + i915_gem_shrink(dev_priv, -1UL, NULL, I915_SHRINK_UNBOUND); 5019 5019 i915_gem_drain_freed_objects(dev_priv); 5020 5020 5021 5021 mutex_lock(&dev_priv->drm.struct_mutex);

+1 -1

drivers/gpu/drm/i915/i915_gem_gtt.c

··· 2062 2062 */ 2063 2063 GEM_BUG_ON(obj->mm.pages == pages); 2064 2064 } while (i915_gem_shrink(to_i915(obj->base.dev), 2065 - obj->base.size >> PAGE_SHIFT, 2065 + obj->base.size >> PAGE_SHIFT, NULL, 2066 2066 I915_SHRINK_BOUND | 2067 2067 I915_SHRINK_UNBOUND | 2068 2068 I915_SHRINK_ACTIVE));

+18 -6

drivers/gpu/drm/i915/i915_gem_shrinker.c

··· 136 136 * i915_gem_shrink - Shrink buffer object caches 137 137 * @dev_priv: i915 device 138 138 * @target: amount of memory to make available, in pages 139 + * @nr_scanned: optional output for number of pages scanned (incremental) 139 140 * @flags: control flags for selecting cache types 140 141 * 141 142 * This function is the main interface to the shrinker. It will try to release ··· 159 158 */ 160 159 unsigned long 161 160 i915_gem_shrink(struct drm_i915_private *dev_priv, 162 - unsigned long target, unsigned flags) 161 + unsigned long target, 162 + unsigned long *nr_scanned, 163 + unsigned flags) 163 164 { 164 165 const struct { 165 166 struct list_head *list; ··· 172 169 { NULL, 0 }, 173 170 }, *phase; 174 171 unsigned long count = 0; 172 + unsigned long scanned = 0; 175 173 bool unlock; 176 174 177 175 if (!shrinker_lock(dev_priv, &unlock)) ··· 253 249 count += obj->base.size >> PAGE_SHIFT; 254 250 } 255 251 mutex_unlock(&obj->mm.lock); 252 + scanned += obj->base.size >> PAGE_SHIFT; 256 253 } 257 254 } 258 255 list_splice_tail(&still_in_list, phase->list); ··· 266 261 267 262 shrinker_unlock(dev_priv, unlock); 268 263 264 + if (nr_scanned) 265 + *nr_scanned += scanned; 269 266 return count; 270 267 } 271 268 ··· 290 283 unsigned long freed; 291 284 292 285 intel_runtime_pm_get(dev_priv); 293 - freed = i915_gem_shrink(dev_priv, -1UL, 286 + freed = i915_gem_shrink(dev_priv, -1UL, NULL, 294 287 I915_SHRINK_BOUND | 295 288 I915_SHRINK_UNBOUND | 296 289 I915_SHRINK_ACTIVE); ··· 336 329 unsigned long freed; 337 330 bool unlock; 338 331 332 + sc->nr_scanned = 0; 333 + 339 334 if (!shrinker_lock(dev_priv, &unlock)) 340 335 return SHRINK_STOP; 341 336 342 337 freed = i915_gem_shrink(dev_priv, 343 338 sc->nr_to_scan, 339 + &sc->nr_scanned, 344 340 I915_SHRINK_BOUND | 345 341 I915_SHRINK_UNBOUND | 346 342 I915_SHRINK_PURGEABLE); 347 343 if (freed < sc->nr_to_scan) 348 344 freed += i915_gem_shrink(dev_priv, 349 - sc->nr_to_scan - freed, 345 + sc->nr_to_scan - sc->nr_scanned, 346 + &sc->nr_scanned, 350 347 I915_SHRINK_BOUND | 351 348 I915_SHRINK_UNBOUND); 352 349 if (freed < sc->nr_to_scan && current_is_kswapd()) { 353 350 intel_runtime_pm_get(dev_priv); 354 351 freed += i915_gem_shrink(dev_priv, 355 - sc->nr_to_scan - freed, 352 + sc->nr_to_scan - sc->nr_scanned, 353 + &sc->nr_scanned, 356 354 I915_SHRINK_ACTIVE | 357 355 I915_SHRINK_BOUND | 358 356 I915_SHRINK_UNBOUND); ··· 366 354 367 355 shrinker_unlock(dev_priv, unlock); 368 356 369 - return freed; 357 + return sc->nr_scanned ? freed : SHRINK_STOP; 370 358 } 371 359 372 360 static bool ··· 465 453 goto out; 466 454 467 455 intel_runtime_pm_get(dev_priv); 468 - freed_pages += i915_gem_shrink(dev_priv, -1UL, 456 + freed_pages += i915_gem_shrink(dev_priv, -1UL, NULL, 469 457 I915_SHRINK_BOUND | 470 458 I915_SHRINK_UNBOUND | 471 459 I915_SHRINK_ACTIVE |

+3 -1

drivers/nvdimm/btt.c

··· 1241 1241 { 1242 1242 struct btt *btt = bdev->bd_disk->private_data; 1243 1243 int rc; 1244 + unsigned int len; 1244 1245 1245 - rc = btt_do_bvec(btt, NULL, page, PAGE_SIZE, 0, is_write, sector); 1246 + len = hpage_nr_pages(page) * PAGE_SIZE; 1247 + rc = btt_do_bvec(btt, NULL, page, len, 0, is_write, sector); 1246 1248 if (rc == 0) 1247 1249 page_endio(page, is_write, 0); 1248 1250

+28 -9

drivers/nvdimm/pmem.c

··· 80 80 static void write_pmem(void *pmem_addr, struct page *page, 81 81 unsigned int off, unsigned int len) 82 82 { 83 - void *mem = kmap_atomic(page); 83 + unsigned int chunk; 84 + void *mem; 84 85 85 - memcpy_flushcache(pmem_addr, mem + off, len); 86 - kunmap_atomic(mem); 86 + while (len) { 87 + mem = kmap_atomic(page); 88 + chunk = min_t(unsigned int, len, PAGE_SIZE); 89 + memcpy_flushcache(pmem_addr, mem + off, chunk); 90 + kunmap_atomic(mem); 91 + len -= chunk; 92 + off = 0; 93 + page++; 94 + pmem_addr += PAGE_SIZE; 95 + } 87 96 } 88 97 89 98 static blk_status_t read_pmem(struct page *page, unsigned int off, 90 99 void *pmem_addr, unsigned int len) 91 100 { 101 + unsigned int chunk; 92 102 int rc; 93 - void *mem = kmap_atomic(page); 103 + void *mem; 94 104 95 - rc = memcpy_mcsafe(mem + off, pmem_addr, len); 96 - kunmap_atomic(mem); 97 - if (rc) 98 - return BLK_STS_IOERR; 105 + while (len) { 106 + mem = kmap_atomic(page); 107 + chunk = min_t(unsigned int, len, PAGE_SIZE); 108 + rc = memcpy_mcsafe(mem + off, pmem_addr, chunk); 109 + kunmap_atomic(mem); 110 + if (rc) 111 + return BLK_STS_IOERR; 112 + len -= chunk; 113 + off = 0; 114 + page++; 115 + pmem_addr += PAGE_SIZE; 116 + } 99 117 return BLK_STS_OK; 100 118 } 101 119 ··· 206 188 struct pmem_device *pmem = bdev->bd_queue->queuedata; 207 189 blk_status_t rc; 208 190 209 - rc = pmem_do_bvec(pmem, page, PAGE_SIZE, 0, is_write, sector); 191 + rc = pmem_do_bvec(pmem, page, hpage_nr_pages(page) * PAGE_SIZE, 192 + 0, is_write, sector); 210 193 211 194 /* 212 195 * The ->rw_page interface is subtle and tricky. The core

-29

fs/9p/cache.c

··· 151 151 return FSCACHE_CHECKAUX_OKAY; 152 152 } 153 153 154 - static void v9fs_cache_inode_now_uncached(void *cookie_netfs_data) 155 - { 156 - struct v9fs_inode *v9inode = cookie_netfs_data; 157 - struct pagevec pvec; 158 - pgoff_t first; 159 - int loop, nr_pages; 160 - 161 - pagevec_init(&pvec, 0); 162 - first = 0; 163 - 164 - for (;;) { 165 - nr_pages = pagevec_lookup(&pvec, v9inode->vfs_inode.i_mapping, 166 - first, 167 - PAGEVEC_SIZE - pagevec_count(&pvec)); 168 - if (!nr_pages) 169 - break; 170 - 171 - for (loop = 0; loop < nr_pages; loop++) 172 - ClearPageFsCache(pvec.pages[loop]); 173 - 174 - first = pvec.pages[nr_pages - 1]->index + 1; 175 - 176 - pvec.nr = nr_pages; 177 - pagevec_release(&pvec); 178 - cond_resched(); 179 - } 180 - } 181 - 182 154 const struct fscache_cookie_def v9fs_cache_inode_index_def = { 183 155 .name = "9p.inode", 184 156 .type = FSCACHE_COOKIE_TYPE_DATAFILE, ··· 158 186 .get_attr = v9fs_cache_inode_get_attr, 159 187 .get_aux = v9fs_cache_inode_get_aux, 160 188 .check_aux = v9fs_cache_inode_check_aux, 161 - .now_uncached = v9fs_cache_inode_now_uncached, 162 189 }; 163 190 164 191 void v9fs_cache_inode_get_cookie(struct inode *inode)

-43

fs/afs/cache.c

··· 39 39 static enum fscache_checkaux afs_vnode_cache_check_aux(void *cookie_netfs_data, 40 40 const void *buffer, 41 41 uint16_t buflen); 42 - static void afs_vnode_cache_now_uncached(void *cookie_netfs_data); 43 42 44 43 struct fscache_netfs afs_cache_netfs = { 45 44 .name = "afs", ··· 74 75 .get_attr = afs_vnode_cache_get_attr, 75 76 .get_aux = afs_vnode_cache_get_aux, 76 77 .check_aux = afs_vnode_cache_check_aux, 77 - .now_uncached = afs_vnode_cache_now_uncached, 78 78 }; 79 79 80 80 /* ··· 356 358 357 359 _leave(" = SUCCESS"); 358 360 return FSCACHE_CHECKAUX_OKAY; 359 - } 360 - 361 - /* 362 - * indication the cookie is no longer uncached 363 - * - this function is called when the backing store currently caching a cookie 364 - * is removed 365 - * - the netfs should use this to clean up any markers indicating cached pages 366 - * - this is mandatory for any object that may have data 367 - */ 368 - static void afs_vnode_cache_now_uncached(void *cookie_netfs_data) 369 - { 370 - struct afs_vnode *vnode = cookie_netfs_data; 371 - struct pagevec pvec; 372 - pgoff_t first; 373 - int loop, nr_pages; 374 - 375 - _enter("{%x,%x,%Lx}", 376 - vnode->fid.vnode, vnode->fid.unique, vnode->status.data_version); 377 - 378 - pagevec_init(&pvec, 0); 379 - first = 0; 380 - 381 - for (;;) { 382 - /* grab a bunch of pages to clean */ 383 - nr_pages = pagevec_lookup(&pvec, vnode->vfs_inode.i_mapping, 384 - first, 385 - PAGEVEC_SIZE - pagevec_count(&pvec)); 386 - if (!nr_pages) 387 - break; 388 - 389 - for (loop = 0; loop < nr_pages; loop++) 390 - ClearPageFsCache(pvec.pages[loop]); 391 - 392 - first = pvec.pages[nr_pages - 1]->index + 1; 393 - 394 - pvec.nr = nr_pages; 395 - pagevec_release(&pvec); 396 - cond_resched(); 397 - } 398 - 399 - _leave(""); 400 361 }

+10 -21

fs/buffer.c

··· 1627 1627 struct pagevec pvec; 1628 1628 pgoff_t index = block >> (PAGE_SHIFT - bd_inode->i_blkbits); 1629 1629 pgoff_t end; 1630 - int i; 1630 + int i, count; 1631 1631 struct buffer_head *bh; 1632 1632 struct buffer_head *head; 1633 1633 1634 1634 end = (block + len - 1) >> (PAGE_SHIFT - bd_inode->i_blkbits); 1635 1635 pagevec_init(&pvec, 0); 1636 - while (index <= end && pagevec_lookup(&pvec, bd_mapping, index, 1637 - min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) { 1638 - for (i = 0; i < pagevec_count(&pvec); i++) { 1636 + while (pagevec_lookup_range(&pvec, bd_mapping, &index, end)) { 1637 + count = pagevec_count(&pvec); 1638 + for (i = 0; i < count; i++) { 1639 1639 struct page *page = pvec.pages[i]; 1640 1640 1641 - index = page->index; 1642 - if (index > end) 1643 - break; 1644 1641 if (!page_has_buffers(page)) 1645 1642 continue; 1646 1643 /* ··· 1667 1670 } 1668 1671 pagevec_release(&pvec); 1669 1672 cond_resched(); 1670 - index++; 1673 + /* End of range already reached? */ 1674 + if (index > end || !index) 1675 + break; 1671 1676 } 1672 1677 } 1673 1678 EXPORT_SYMBOL(clean_bdev_aliases); ··· 3548 3549 pagevec_init(&pvec, 0); 3549 3550 3550 3551 do { 3551 - unsigned want, nr_pages, i; 3552 + unsigned nr_pages, i; 3552 3553 3553 - want = min_t(unsigned, end - index, PAGEVEC_SIZE); 3554 - nr_pages = pagevec_lookup(&pvec, inode->i_mapping, index, want); 3554 + nr_pages = pagevec_lookup_range(&pvec, inode->i_mapping, &index, 3555 + end - 1); 3555 3556 if (nr_pages == 0) 3556 3557 break; 3557 3558 ··· 3572 3573 lastoff < page_offset(page)) 3573 3574 goto check_range; 3574 3575 3575 - /* Searching done if the page index is out of range. */ 3576 - if (page->index >= end) 3577 - goto not_found; 3578 - 3579 3576 lock_page(page); 3580 3577 if (likely(page->mapping == inode->i_mapping) && 3581 3578 page_has_buffers(page)) { ··· 3584 3589 unlock_page(page); 3585 3590 lastoff = page_offset(page) + PAGE_SIZE; 3586 3591 } 3587 - 3588 - /* Searching done if fewer pages returned than wanted. */ 3589 - if (nr_pages < want) 3590 - break; 3591 - 3592 - index = pvec.pages[i - 1]->index + 1; 3593 3592 pagevec_release(&pvec); 3594 3593 } while (index < end); 3595 3594

-31

fs/ceph/cache.c

··· 194 194 return FSCACHE_CHECKAUX_OKAY; 195 195 } 196 196 197 - static void ceph_fscache_inode_now_uncached(void* cookie_netfs_data) 198 - { 199 - struct ceph_inode_info* ci = cookie_netfs_data; 200 - struct pagevec pvec; 201 - pgoff_t first; 202 - int loop, nr_pages; 203 - 204 - pagevec_init(&pvec, 0); 205 - first = 0; 206 - 207 - dout("ceph inode 0x%p now uncached", ci); 208 - 209 - while (1) { 210 - nr_pages = pagevec_lookup(&pvec, ci->vfs_inode.i_mapping, first, 211 - PAGEVEC_SIZE - pagevec_count(&pvec)); 212 - 213 - if (!nr_pages) 214 - break; 215 - 216 - for (loop = 0; loop < nr_pages; loop++) 217 - ClearPageFsCache(pvec.pages[loop]); 218 - 219 - first = pvec.pages[nr_pages - 1]->index + 1; 220 - 221 - pvec.nr = nr_pages; 222 - pagevec_release(&pvec); 223 - cond_resched(); 224 - } 225 - } 226 - 227 197 static const struct fscache_cookie_def ceph_fscache_inode_object_def = { 228 198 .name = "CEPH.inode", 229 199 .type = FSCACHE_COOKIE_TYPE_DATAFILE, ··· 201 231 .get_attr = ceph_fscache_inode_get_attr, 202 232 .get_aux = ceph_fscache_inode_get_aux, 203 233 .check_aux = ceph_fscache_inode_check_aux, 204 - .now_uncached = ceph_fscache_inode_now_uncached, 205 234 }; 206 235 207 236 void ceph_fscache_register_inode_cookie(struct inode *inode)

-31

fs/cifs/cache.c

··· 292 292 return FSCACHE_CHECKAUX_OKAY; 293 293 } 294 294 295 - static void cifs_fscache_inode_now_uncached(void *cookie_netfs_data) 296 - { 297 - struct cifsInodeInfo *cifsi = cookie_netfs_data; 298 - struct pagevec pvec; 299 - pgoff_t first; 300 - int loop, nr_pages; 301 - 302 - pagevec_init(&pvec, 0); 303 - first = 0; 304 - 305 - cifs_dbg(FYI, "%s: cifs inode 0x%p now uncached\n", __func__, cifsi); 306 - 307 - for (;;) { 308 - nr_pages = pagevec_lookup(&pvec, 309 - cifsi->vfs_inode.i_mapping, first, 310 - PAGEVEC_SIZE - pagevec_count(&pvec)); 311 - if (!nr_pages) 312 - break; 313 - 314 - for (loop = 0; loop < nr_pages; loop++) 315 - ClearPageFsCache(pvec.pages[loop]); 316 - 317 - first = pvec.pages[nr_pages - 1]->index + 1; 318 - 319 - pvec.nr = nr_pages; 320 - pagevec_release(&pvec); 321 - cond_resched(); 322 - } 323 - } 324 - 325 295 const struct fscache_cookie_def cifs_fscache_inode_object_def = { 326 296 .name = "CIFS.uniqueid", 327 297 .type = FSCACHE_COOKIE_TYPE_DATAFILE, ··· 299 329 .get_attr = cifs_fscache_inode_get_attr, 300 330 .get_aux = cifs_fscache_inode_get_aux, 301 331 .check_aux = cifs_fscache_inode_check_aux, 302 - .now_uncached = cifs_fscache_inode_now_uncached, 303 332 };

+154 -209

fs/dax.c

··· 42 42 #define DAX_WAIT_TABLE_BITS 12 43 43 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS) 44 44 45 + /* The 'colour' (ie low bits) within a PMD of a page offset. */ 46 + #define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1) 47 + 45 48 static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES]; 46 49 47 50 static int __init init_dax_wait_table(void) ··· 56 53 return 0; 57 54 } 58 55 fs_initcall(init_dax_wait_table); 56 + 57 + /* 58 + * We use lowest available bit in exceptional entry for locking, one bit for 59 + * the entry size (PMD) and two more to tell us if the entry is a zero page or 60 + * an empty entry that is just used for locking. In total four special bits. 61 + * 62 + * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE 63 + * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem 64 + * block allocation. 65 + */ 66 + #define RADIX_DAX_SHIFT (RADIX_TREE_EXCEPTIONAL_SHIFT + 4) 67 + #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT) 68 + #define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1)) 69 + #define RADIX_DAX_ZERO_PAGE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2)) 70 + #define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3)) 71 + 72 + static unsigned long dax_radix_sector(void *entry) 73 + { 74 + return (unsigned long)entry >> RADIX_DAX_SHIFT; 75 + } 76 + 77 + static void *dax_radix_locked_entry(sector_t sector, unsigned long flags) 78 + { 79 + return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags | 80 + ((unsigned long)sector << RADIX_DAX_SHIFT) | 81 + RADIX_DAX_ENTRY_LOCK); 82 + } 83 + 84 + static unsigned int dax_radix_order(void *entry) 85 + { 86 + if ((unsigned long)entry & RADIX_DAX_PMD) 87 + return PMD_SHIFT - PAGE_SHIFT; 88 + return 0; 89 + } 59 90 60 91 static int dax_is_pmd_entry(void *entry) 61 92 { ··· 103 66 104 67 static int dax_is_zero_entry(void *entry) 105 68 { 106 - return (unsigned long)entry & RADIX_DAX_HZP; 69 + return (unsigned long)entry & RADIX_DAX_ZERO_PAGE; 107 70 } 108 71 109 72 static int dax_is_empty_entry(void *entry) ··· 135 98 * the range covered by the PMD map to the same bit lock. 136 99 */ 137 100 if (dax_is_pmd_entry(entry)) 138 - index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1); 101 + index &= ~PG_PMD_COLOUR; 139 102 140 103 key->mapping = mapping; 141 104 key->entry_start = index; ··· 155 118 key->entry_start != ewait->key.entry_start) 156 119 return 0; 157 120 return autoremove_wake_function(wait, mode, sync, NULL); 121 + } 122 + 123 + /* 124 + * We do not necessarily hold the mapping->tree_lock when we call this 125 + * function so it is possible that 'entry' is no longer a valid item in the 126 + * radix tree. This is okay because all we really need to do is to find the 127 + * correct waitqueue where tasks might be waiting for that old 'entry' and 128 + * wake them. 129 + */ 130 + static void dax_wake_mapping_entry_waiter(struct address_space *mapping, 131 + pgoff_t index, void *entry, bool wake_all) 132 + { 133 + struct exceptional_entry_key key; 134 + wait_queue_head_t *wq; 135 + 136 + wq = dax_entry_waitqueue(mapping, index, entry, &key); 137 + 138 + /* 139 + * Checking for locked entry and prepare_to_wait_exclusive() happens 140 + * under mapping->tree_lock, ditto for entry handling in our callers. 141 + * So at this point all tasks that could have seen our entry locked 142 + * must be in the waitqueue and the following check will see them. 143 + */ 144 + if (waitqueue_active(wq)) 145 + __wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key); 158 146 } 159 147 160 148 /* ··· 243 181 for (;;) { 244 182 entry = __radix_tree_lookup(&mapping->page_tree, index, NULL, 245 183 &slot); 246 - if (!entry || !radix_tree_exceptional_entry(entry) || 184 + if (!entry || 185 + WARN_ON_ONCE(!radix_tree_exceptional_entry(entry)) || 247 186 !slot_locked(mapping, slot)) { 248 187 if (slotp) 249 188 *slotp = slot; ··· 279 216 } 280 217 281 218 static void put_locked_mapping_entry(struct address_space *mapping, 282 - pgoff_t index, void *entry) 219 + pgoff_t index) 283 220 { 284 - if (!radix_tree_exceptional_entry(entry)) { 285 - unlock_page(entry); 286 - put_page(entry); 287 - } else { 288 - dax_unlock_mapping_entry(mapping, index); 289 - } 221 + dax_unlock_mapping_entry(mapping, index); 290 222 } 291 223 292 224 /* ··· 291 233 static void put_unlocked_mapping_entry(struct address_space *mapping, 292 234 pgoff_t index, void *entry) 293 235 { 294 - if (!radix_tree_exceptional_entry(entry)) 236 + if (!entry) 295 237 return; 296 238 297 239 /* We have to wake up next waiter for the radix tree entry lock */ ··· 299 241 } 300 242 301 243 /* 302 - * Find radix tree entry at given index. If it points to a page, return with 303 - * the page locked. If it points to the exceptional entry, return with the 304 - * radix tree entry locked. If the radix tree doesn't contain given index, 305 - * create empty exceptional entry for the index and return with it locked. 244 + * Find radix tree entry at given index. If it points to an exceptional entry, 245 + * return it with the radix tree entry locked. If the radix tree doesn't 246 + * contain given index, create an empty exceptional entry for the index and 247 + * return with it locked. 306 248 * 307 249 * When requesting an entry with size RADIX_DAX_PMD, grab_mapping_entry() will 308 250 * either return that locked entry or will return an error. This error will 309 - * happen if there are any 4k entries (either zero pages or DAX entries) 310 - * within the 2MiB range that we are requesting. 251 + * happen if there are any 4k entries within the 2MiB range that we are 252 + * requesting. 311 253 * 312 254 * We always favor 4k entries over 2MiB entries. There isn't a flow where we 313 255 * evict 4k entries in order to 'upgrade' them to a 2MiB entry. A 2MiB ··· 334 276 spin_lock_irq(&mapping->tree_lock); 335 277 entry = get_unlocked_mapping_entry(mapping, index, &slot); 336 278 279 + if (WARN_ON_ONCE(entry && !radix_tree_exceptional_entry(entry))) { 280 + entry = ERR_PTR(-EIO); 281 + goto out_unlock; 282 + } 283 + 337 284 if (entry) { 338 285 if (size_flag & RADIX_DAX_PMD) { 339 - if (!radix_tree_exceptional_entry(entry) || 340 - dax_is_pte_entry(entry)) { 286 + if (dax_is_pte_entry(entry)) { 341 287 put_unlocked_mapping_entry(mapping, index, 342 288 entry); 343 289 entry = ERR_PTR(-EEXIST); 344 290 goto out_unlock; 345 291 } 346 292 } else { /* trying to grab a PTE entry */ 347 - if (radix_tree_exceptional_entry(entry) && 348 - dax_is_pmd_entry(entry) && 293 + if (dax_is_pmd_entry(entry) && 349 294 (dax_is_zero_entry(entry) || 350 295 dax_is_empty_entry(entry))) { 351 296 pmd_downgrade = true; ··· 382 321 mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM); 383 322 if (err) { 384 323 if (pmd_downgrade) 385 - put_locked_mapping_entry(mapping, index, entry); 324 + put_locked_mapping_entry(mapping, index); 386 325 return ERR_PTR(err); 387 326 } 388 327 spin_lock_irq(&mapping->tree_lock); ··· 432 371 spin_unlock_irq(&mapping->tree_lock); 433 372 return entry; 434 373 } 435 - /* Normal page in radix tree? */ 436 - if (!radix_tree_exceptional_entry(entry)) { 437 - struct page *page = entry; 438 - 439 - get_page(page); 440 - spin_unlock_irq(&mapping->tree_lock); 441 - lock_page(page); 442 - /* Page got truncated? Retry... */ 443 - if (unlikely(page->mapping != mapping)) { 444 - unlock_page(page); 445 - put_page(page); 446 - goto restart; 447 - } 448 - return page; 449 - } 450 374 entry = lock_slot(mapping, slot); 451 375 out_unlock: 452 376 spin_unlock_irq(&mapping->tree_lock); 453 377 return entry; 454 - } 455 - 456 - /* 457 - * We do not necessarily hold the mapping->tree_lock when we call this 458 - * function so it is possible that 'entry' is no longer a valid item in the 459 - * radix tree. This is okay because all we really need to do is to find the 460 - * correct waitqueue where tasks might be waiting for that old 'entry' and 461 - * wake them. 462 - */ 463 - void dax_wake_mapping_entry_waiter(struct address_space *mapping, 464 - pgoff_t index, void *entry, bool wake_all) 465 - { 466 - struct exceptional_entry_key key; 467 - wait_queue_head_t *wq; 468 - 469 - wq = dax_entry_waitqueue(mapping, index, entry, &key); 470 - 471 - /* 472 - * Checking for locked entry and prepare_to_wait_exclusive() happens 473 - * under mapping->tree_lock, ditto for entry handling in our callers. 474 - * So at this point all tasks that could have seen our entry locked 475 - * must be in the waitqueue and the following check will see them. 476 - */ 477 - if (waitqueue_active(wq)) 478 - __wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key); 479 378 } 480 379 481 380 static int __dax_invalidate_mapping_entry(struct address_space *mapping, ··· 447 426 448 427 spin_lock_irq(&mapping->tree_lock); 449 428 entry = get_unlocked_mapping_entry(mapping, index, NULL); 450 - if (!entry || !radix_tree_exceptional_entry(entry)) 429 + if (!entry || WARN_ON_ONCE(!radix_tree_exceptional_entry(entry))) 451 430 goto out; 452 431 if (!trunc && 453 432 (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) || ··· 487 466 pgoff_t index) 488 467 { 489 468 return __dax_invalidate_mapping_entry(mapping, index, false); 490 - } 491 - 492 - /* 493 - * The user has performed a load from a hole in the file. Allocating 494 - * a new page in the file would cause excessive storage usage for 495 - * workloads with sparse files. We allocate a page cache page instead. 496 - * We'll kick it out of the page cache if it's ever written to, 497 - * otherwise it will simply fall out of the page cache under memory 498 - * pressure without ever having been dirtied. 499 - */ 500 - static int dax_load_hole(struct address_space *mapping, void **entry, 501 - struct vm_fault *vmf) 502 - { 503 - struct inode *inode = mapping->host; 504 - struct page *page; 505 - int ret; 506 - 507 - /* Hole page already exists? Return it... */ 508 - if (!radix_tree_exceptional_entry(*entry)) { 509 - page = *entry; 510 - goto finish_fault; 511 - } 512 - 513 - /* This will replace locked radix tree entry with a hole page */ 514 - page = find_or_create_page(mapping, vmf->pgoff, 515 - vmf->gfp_mask | __GFP_ZERO); 516 - if (!page) { 517 - ret = VM_FAULT_OOM; 518 - goto out; 519 - } 520 - 521 - finish_fault: 522 - vmf->page = page; 523 - ret = finish_fault(vmf); 524 - vmf->page = NULL; 525 - *entry = page; 526 - if (!ret) { 527 - /* Grab reference for PTE that is now referencing the page */ 528 - get_page(page); 529 - ret = VM_FAULT_NOPAGE; 530 - } 531 - out: 532 - trace_dax_load_hole(inode, vmf, ret); 533 - return ret; 534 469 } 535 470 536 471 static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev, ··· 529 552 unsigned long flags) 530 553 { 531 554 struct radix_tree_root *page_tree = &mapping->page_tree; 532 - int error = 0; 533 - bool hole_fill = false; 534 555 void *new_entry; 535 556 pgoff_t index = vmf->pgoff; 536 557 537 558 if (vmf->flags & FAULT_FLAG_WRITE) 538 559 __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); 539 560 540 - /* Replacing hole page with block mapping? */ 541 - if (!radix_tree_exceptional_entry(entry)) { 542 - hole_fill = true; 543 - /* 544 - * Unmap the page now before we remove it from page cache below. 545 - * The page is locked so it cannot be faulted in again. 546 - */ 547 - unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, 548 - PAGE_SIZE, 0); 549 - error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM); 550 - if (error) 551 - return ERR_PTR(error); 552 - } else if (dax_is_zero_entry(entry) && !(flags & RADIX_DAX_HZP)) { 553 - /* replacing huge zero page with PMD block mapping */ 554 - unmap_mapping_range(mapping, 555 - (vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0); 561 + if (dax_is_zero_entry(entry) && !(flags & RADIX_DAX_ZERO_PAGE)) { 562 + /* we are replacing a zero page with block mapping */ 563 + if (dax_is_pmd_entry(entry)) 564 + unmap_mapping_range(mapping, 565 + (vmf->pgoff << PAGE_SHIFT) & PMD_MASK, 566 + PMD_SIZE, 0); 567 + else /* pte entry */ 568 + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, 569 + PAGE_SIZE, 0); 556 570 } 557 571 558 572 spin_lock_irq(&mapping->tree_lock); 559 573 new_entry = dax_radix_locked_entry(sector, flags); 560 574 561 - if (hole_fill) { 562 - __delete_from_page_cache(entry, NULL); 563 - /* Drop pagecache reference */ 564 - put_page(entry); 565 - error = __radix_tree_insert(page_tree, index, 566 - dax_radix_order(new_entry), new_entry); 567 - if (error) { 568 - new_entry = ERR_PTR(error); 569 - goto unlock; 570 - } 571 - mapping->nrexceptional++; 572 - } else if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { 575 + if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { 573 576 /* 574 577 * Only swap our new entry into the radix tree if the current 575 578 * entry is a zero page or an empty entry. If a normal PTE or ··· 566 609 WARN_ON_ONCE(ret != entry); 567 610 __radix_tree_replace(page_tree, node, slot, 568 611 new_entry, NULL, NULL); 612 + entry = new_entry; 569 613 } 614 + 570 615 if (vmf->flags & FAULT_FLAG_WRITE) 571 616 radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY); 572 - unlock: 617 + 573 618 spin_unlock_irq(&mapping->tree_lock); 574 - if (hole_fill) { 575 - radix_tree_preload_end(); 576 - /* 577 - * We don't need hole page anymore, it has been replaced with 578 - * locked radix tree entry now. 579 - */ 580 - if (mapping->a_ops->freepage) 581 - mapping->a_ops->freepage(entry); 582 - unlock_page(entry); 583 - put_page(entry); 584 - } 585 - return new_entry; 619 + return entry; 586 620 } 587 621 588 622 static inline unsigned long ··· 675 727 spin_lock_irq(&mapping->tree_lock); 676 728 entry2 = get_unlocked_mapping_entry(mapping, index, &slot); 677 729 /* Entry got punched out / reallocated? */ 678 - if (!entry2 || !radix_tree_exceptional_entry(entry2)) 730 + if (!entry2 || WARN_ON_ONCE(!radix_tree_exceptional_entry(entry2))) 679 731 goto put_unlocked; 680 732 /* 681 733 * Entry got reallocated elsewhere? No need to writeback. We have to ··· 747 799 trace_dax_writeback_one(mapping->host, index, size >> PAGE_SHIFT); 748 800 dax_unlock: 749 801 dax_read_unlock(id); 750 - put_locked_mapping_entry(mapping, index, entry); 802 + put_locked_mapping_entry(mapping, index); 751 803 return ret; 752 804 753 805 put_unlocked: ··· 822 874 823 875 static int dax_insert_mapping(struct address_space *mapping, 824 876 struct block_device *bdev, struct dax_device *dax_dev, 825 - sector_t sector, size_t size, void **entryp, 877 + sector_t sector, size_t size, void *entry, 826 878 struct vm_area_struct *vma, struct vm_fault *vmf) 827 879 { 828 880 unsigned long vaddr = vmf->address; 829 - void *entry = *entryp; 830 881 void *ret, *kaddr; 831 882 pgoff_t pgoff; 832 883 int id, rc; ··· 846 899 ret = dax_insert_mapping_entry(mapping, vmf, entry, sector, 0); 847 900 if (IS_ERR(ret)) 848 901 return PTR_ERR(ret); 849 - *entryp = ret; 850 902 851 903 trace_dax_insert_mapping(mapping->host, vmf, ret); 852 - return vm_insert_mixed(vma, vaddr, pfn); 904 + if (vmf->flags & FAULT_FLAG_WRITE) 905 + return vm_insert_mixed_mkwrite(vma, vaddr, pfn); 906 + else 907 + return vm_insert_mixed(vma, vaddr, pfn); 853 908 } 854 909 855 - /** 856 - * dax_pfn_mkwrite - handle first write to DAX page 857 - * @vmf: The description of the fault 910 + /* 911 + * The user has performed a load from a hole in the file. Allocating a new 912 + * page in the file would cause excessive storage usage for workloads with 913 + * sparse files. Instead we insert a read-only mapping of the 4k zero page. 914 + * If this page is ever written to we will re-fault and change the mapping to 915 + * point to real DAX storage instead. 858 916 */ 859 - int dax_pfn_mkwrite(struct vm_fault *vmf) 917 + static int dax_load_hole(struct address_space *mapping, void *entry, 918 + struct vm_fault *vmf) 860 919 { 861 - struct file *file = vmf->vma->vm_file; 862 - struct address_space *mapping = file->f_mapping; 863 920 struct inode *inode = mapping->host; 864 - void *entry, **slot; 865 - pgoff_t index = vmf->pgoff; 921 + unsigned long vaddr = vmf->address; 922 + int ret = VM_FAULT_NOPAGE; 923 + struct page *zero_page; 924 + void *entry2; 866 925 867 - spin_lock_irq(&mapping->tree_lock); 868 - entry = get_unlocked_mapping_entry(mapping, index, &slot); 869 - if (!entry || !radix_tree_exceptional_entry(entry)) { 870 - if (entry) 871 - put_unlocked_mapping_entry(mapping, index, entry); 872 - spin_unlock_irq(&mapping->tree_lock); 873 - trace_dax_pfn_mkwrite_no_entry(inode, vmf, VM_FAULT_NOPAGE); 874 - return VM_FAULT_NOPAGE; 926 + zero_page = ZERO_PAGE(0); 927 + if (unlikely(!zero_page)) { 928 + ret = VM_FAULT_OOM; 929 + goto out; 875 930 } 876 - radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY); 877 - entry = lock_slot(mapping, slot); 878 - spin_unlock_irq(&mapping->tree_lock); 879 - /* 880 - * If we race with somebody updating the PTE and finish_mkwrite_fault() 881 - * fails, we don't care. We need to return VM_FAULT_NOPAGE and retry 882 - * the fault in either case. 883 - */ 884 - finish_mkwrite_fault(vmf); 885 - put_locked_mapping_entry(mapping, index, entry); 886 - trace_dax_pfn_mkwrite(inode, vmf, VM_FAULT_NOPAGE); 887 - return VM_FAULT_NOPAGE; 931 + 932 + entry2 = dax_insert_mapping_entry(mapping, vmf, entry, 0, 933 + RADIX_DAX_ZERO_PAGE); 934 + if (IS_ERR(entry2)) { 935 + ret = VM_FAULT_SIGBUS; 936 + goto out; 937 + } 938 + 939 + vm_insert_mixed(vmf->vma, vaddr, page_to_pfn_t(zero_page)); 940 + out: 941 + trace_dax_load_hole(inode, vmf, ret); 942 + return ret; 888 943 } 889 - EXPORT_SYMBOL_GPL(dax_pfn_mkwrite); 890 944 891 945 static bool dax_range_is_aligned(struct block_device *bdev, 892 946 unsigned int offset, unsigned int length) ··· 1007 1059 if (map_len > end - pos) 1008 1060 map_len = end - pos; 1009 1061 1062 + /* 1063 + * The userspace address for the memory copy has already been 1064 + * validated via access_ok() in either vfs_read() or 1065 + * vfs_write(), depending on which operation we are doing. 1066 + */ 1010 1067 if (iov_iter_rw(iter) == WRITE) 1011 1068 map_len = dax_copy_from_iter(dax_dev, pgoff, kaddr, 1012 1069 map_len, iter); ··· 1176 1223 major = VM_FAULT_MAJOR; 1177 1224 } 1178 1225 error = dax_insert_mapping(mapping, iomap.bdev, iomap.dax_dev, 1179 - sector, PAGE_SIZE, &entry, vmf->vma, vmf); 1226 + sector, PAGE_SIZE, entry, vmf->vma, vmf); 1180 1227 /* -EBUSY is fine, somebody else faulted on the same PTE */ 1181 1228 if (error == -EBUSY) 1182 1229 error = 0; ··· 1184 1231 case IOMAP_UNWRITTEN: 1185 1232 case IOMAP_HOLE: 1186 1233 if (!(vmf->flags & FAULT_FLAG_WRITE)) { 1187 - vmf_ret = dax_load_hole(mapping, &entry, vmf); 1234 + vmf_ret = dax_load_hole(mapping, entry, vmf); 1188 1235 goto finish_iomap; 1189 1236 } 1190 1237 /*FALLTHRU*/ ··· 1211 1258 ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap); 1212 1259 } 1213 1260 unlock_entry: 1214 - put_locked_mapping_entry(mapping, vmf->pgoff, entry); 1261 + put_locked_mapping_entry(mapping, vmf->pgoff); 1215 1262 out: 1216 1263 trace_dax_pte_fault_done(inode, vmf, vmf_ret); 1217 1264 return vmf_ret; 1218 1265 } 1219 1266 1220 1267 #ifdef CONFIG_FS_DAX_PMD 1221 - /* 1222 - * The 'colour' (ie low bits) within a PMD of a page offset. This comes up 1223 - * more often than one might expect in the below functions. 1224 - */ 1225 - #define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1) 1226 - 1227 1268 static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap, 1228 - loff_t pos, void **entryp) 1269 + loff_t pos, void *entry) 1229 1270 { 1230 1271 struct address_space *mapping = vmf->vma->vm_file->f_mapping; 1231 1272 const sector_t sector = dax_iomap_sector(iomap, pos); ··· 1230 1283 void *ret = NULL, *kaddr; 1231 1284 long length = 0; 1232 1285 pgoff_t pgoff; 1233 - pfn_t pfn; 1286 + pfn_t pfn = {}; 1234 1287 int id; 1235 1288 1236 1289 if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0) ··· 1250 1303 goto unlock_fallback; 1251 1304 dax_read_unlock(id); 1252 1305 1253 - ret = dax_insert_mapping_entry(mapping, vmf, *entryp, sector, 1306 + ret = dax_insert_mapping_entry(mapping, vmf, entry, sector, 1254 1307 RADIX_DAX_PMD); 1255 1308 if (IS_ERR(ret)) 1256 1309 goto fallback; 1257 - *entryp = ret; 1258 1310 1259 1311 trace_dax_pmd_insert_mapping(inode, vmf, length, pfn, ret); 1260 1312 return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, ··· 1267 1321 } 1268 1322 1269 1323 static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap, 1270 - void **entryp) 1324 + void *entry) 1271 1325 { 1272 1326 struct address_space *mapping = vmf->vma->vm_file->f_mapping; 1273 1327 unsigned long pmd_addr = vmf->address & PMD_MASK; ··· 1282 1336 if (unlikely(!zero_page)) 1283 1337 goto fallback; 1284 1338 1285 - ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0, 1286 - RADIX_DAX_PMD | RADIX_DAX_HZP); 1339 + ret = dax_insert_mapping_entry(mapping, vmf, entry, 0, 1340 + RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE); 1287 1341 if (IS_ERR(ret)) 1288 1342 goto fallback; 1289 - *entryp = ret; 1290 1343 1291 1344 ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); 1292 1345 if (!pmd_none(*(vmf->pmd))) { ··· 1361 1416 goto fallback; 1362 1417 1363 1418 /* 1364 - * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX 1365 - * PMD or a HZP entry. If it can't (because a 4k page is already in 1366 - * the tree, for instance), it will return -EEXIST and we just fall 1367 - * back to 4k entries. 1419 + * grab_mapping_entry() will make sure we get a 2MiB empty entry, a 1420 + * 2MiB zero page entry or a DAX PMD. If it can't (because a 4k page 1421 + * is already in the tree, for instance), it will return -EEXIST and 1422 + * we just fall back to 4k entries. 1368 1423 */ 1369 1424 entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD); 1370 1425 if (IS_ERR(entry)) ··· 1397 1452 1398 1453 switch (iomap.type) { 1399 1454 case IOMAP_MAPPED: 1400 - result = dax_pmd_insert_mapping(vmf, &iomap, pos, &entry); 1455 + result = dax_pmd_insert_mapping(vmf, &iomap, pos, entry); 1401 1456 break; 1402 1457 case IOMAP_UNWRITTEN: 1403 1458 case IOMAP_HOLE: 1404 1459 if (WARN_ON_ONCE(write)) 1405 1460 break; 1406 - result = dax_pmd_load_hole(vmf, &iomap, &entry); 1461 + result = dax_pmd_load_hole(vmf, &iomap, entry); 1407 1462 break; 1408 1463 default: 1409 1464 WARN_ON_ONCE(1); ··· 1426 1481 &iomap); 1427 1482 } 1428 1483 unlock_entry: 1429 - put_locked_mapping_entry(mapping, pgoff, entry); 1484 + put_locked_mapping_entry(mapping, pgoff); 1430 1485 fallback: 1431 1486 if (result == VM_FAULT_FALLBACK) { 1432 1487 split_huge_pmd(vma, vmf->pmd, vmf->address);

+1 -24

fs/ext2/file.c

··· 107 107 return ret; 108 108 } 109 109 110 - static int ext2_dax_pfn_mkwrite(struct vm_fault *vmf) 111 - { 112 - struct inode *inode = file_inode(vmf->vma->vm_file); 113 - struct ext2_inode_info *ei = EXT2_I(inode); 114 - loff_t size; 115 - int ret; 116 - 117 - sb_start_pagefault(inode->i_sb); 118 - file_update_time(vmf->vma->vm_file); 119 - down_read(&ei->dax_sem); 120 - 121 - /* check that the faulting page hasn't raced with truncate */ 122 - size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; 123 - if (vmf->pgoff >= size) 124 - ret = VM_FAULT_SIGBUS; 125 - else 126 - ret = dax_pfn_mkwrite(vmf); 127 - 128 - up_read(&ei->dax_sem); 129 - sb_end_pagefault(inode->i_sb); 130 - return ret; 131 - } 132 - 133 110 static const struct vm_operations_struct ext2_dax_vm_ops = { 134 111 .fault = ext2_dax_fault, 135 112 /* ··· 115 138 * will always fail and fail back to regular faults. 116 139 */ 117 140 .page_mkwrite = ext2_dax_fault, 118 - .pfn_mkwrite = ext2_dax_pfn_mkwrite, 141 + .pfn_mkwrite = ext2_dax_fault, 119 142 }; 120 143 121 144 static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)

+5 -43

fs/ext4/file.c

··· 324 324 return ext4_dax_huge_fault(vmf, PE_SIZE_PTE); 325 325 } 326 326 327 - /* 328 - * Handle write fault for VM_MIXEDMAP mappings. Similarly to ext4_dax_fault() 329 - * handler we check for races agaist truncate. Note that since we cycle through 330 - * i_mmap_sem, we are sure that also any hole punching that began before we 331 - * were called is finished by now and so if it included part of the file we 332 - * are working on, our pte will get unmapped and the check for pte_same() in 333 - * wp_pfn_shared() fails. Thus fault gets retried and things work out as 334 - * desired. 335 - */ 336 - static int ext4_dax_pfn_mkwrite(struct vm_fault *vmf) 337 - { 338 - struct inode *inode = file_inode(vmf->vma->vm_file); 339 - struct super_block *sb = inode->i_sb; 340 - loff_t size; 341 - int ret; 342 - 343 - sb_start_pagefault(sb); 344 - file_update_time(vmf->vma->vm_file); 345 - down_read(&EXT4_I(inode)->i_mmap_sem); 346 - size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; 347 - if (vmf->pgoff >= size) 348 - ret = VM_FAULT_SIGBUS; 349 - else 350 - ret = dax_pfn_mkwrite(vmf); 351 - up_read(&EXT4_I(inode)->i_mmap_sem); 352 - sb_end_pagefault(sb); 353 - 354 - return ret; 355 - } 356 - 357 327 static const struct vm_operations_struct ext4_dax_vm_ops = { 358 328 .fault = ext4_dax_fault, 359 329 .huge_fault = ext4_dax_huge_fault, 360 330 .page_mkwrite = ext4_dax_fault, 361 - .pfn_mkwrite = ext4_dax_pfn_mkwrite, 331 + .pfn_mkwrite = ext4_dax_fault, 362 332 }; 363 333 #else 364 334 #define ext4_dax_vm_ops ext4_file_vm_ops ··· 477 507 478 508 pagevec_init(&pvec, 0); 479 509 do { 480 - int i, num; 510 + int i; 481 511 unsigned long nr_pages; 482 512 483 - num = min_t(pgoff_t, end - index, PAGEVEC_SIZE - 1) + 1; 484 - nr_pages = pagevec_lookup(&pvec, inode->i_mapping, index, 485 - (pgoff_t)num); 513 + nr_pages = pagevec_lookup_range(&pvec, inode->i_mapping, 514 + &index, end); 486 515 if (nr_pages == 0) 487 516 break; 488 517 ··· 499 530 *offset = lastoff; 500 531 goto out; 501 532 } 502 - 503 - if (page->index > end) 504 - goto out; 505 533 506 534 lock_page(page); 507 535 ··· 542 576 unlock_page(page); 543 577 } 544 578 545 - /* The no. of pages is less than our desired, we are done. */ 546 - if (nr_pages < num) 547 - break; 548 - 549 - index = pvec.pages[i - 1]->index + 1; 550 579 pagevec_release(&pvec); 551 580 } while (index <= end); 552 581 582 + /* There are no pages upto endoff - that would be a hole in there. */ 553 583 if (whence == SEEK_HOLE && lastoff < endoff) { 554 584 found = 1; 555 585 *offset = lastoff;

+4 -11

fs/ext4/inode.c

··· 1720 1720 1721 1721 pagevec_init(&pvec, 0); 1722 1722 while (index <= end) { 1723 - nr_pages = pagevec_lookup(&pvec, mapping, index, PAGEVEC_SIZE); 1723 + nr_pages = pagevec_lookup_range(&pvec, mapping, &index, end); 1724 1724 if (nr_pages == 0) 1725 1725 break; 1726 1726 for (i = 0; i < nr_pages; i++) { 1727 1727 struct page *page = pvec.pages[i]; 1728 - if (page->index > end) 1729 - break; 1728 + 1730 1729 BUG_ON(!PageLocked(page)); 1731 1730 BUG_ON(PageWriteback(page)); 1732 1731 if (invalidate) { ··· 1736 1737 } 1737 1738 unlock_page(page); 1738 1739 } 1739 - index = pvec.pages[nr_pages - 1]->index + 1; 1740 1740 pagevec_release(&pvec); 1741 1741 } 1742 1742 } ··· 2346 2348 2347 2349 pagevec_init(&pvec, 0); 2348 2350 while (start <= end) { 2349 - nr_pages = pagevec_lookup(&pvec, inode->i_mapping, start, 2350 - PAGEVEC_SIZE); 2351 + nr_pages = pagevec_lookup_range(&pvec, inode->i_mapping, 2352 + &start, end); 2351 2353 if (nr_pages == 0) 2352 2354 break; 2353 2355 for (i = 0; i < nr_pages; i++) { 2354 2356 struct page *page = pvec.pages[i]; 2355 2357 2356 - if (page->index > end) 2357 - break; 2358 - /* Up to 'end' pages must be contiguous */ 2359 - BUG_ON(page->index != start); 2360 2358 bh = head = page_buffers(page); 2361 2359 do { 2362 2360 if (lblk < mpd->map.m_lblk) ··· 2397 2403 pagevec_release(&pvec); 2398 2404 return err; 2399 2405 } 2400 - start++; 2401 2406 } 2402 2407 pagevec_release(&pvec); 2403 2408 }

+2 -3

fs/fscache/page.c

··· 1178 1178 pagevec_init(&pvec, 0); 1179 1179 next = 0; 1180 1180 do { 1181 - if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) 1181 + if (!pagevec_lookup(&pvec, mapping, &next)) 1182 1182 break; 1183 1183 for (i = 0; i < pagevec_count(&pvec); i++) { 1184 1184 struct page *page = pvec.pages[i]; 1185 - next = page->index; 1186 1185 if (PageFsCache(page)) { 1187 1186 __fscache_wait_on_page_write(cookie, page); 1188 1187 __fscache_uncache_page(cookie, page); ··· 1189 1190 } 1190 1191 pagevec_release(&pvec); 1191 1192 cond_resched(); 1192 - } while (++next); 1193 + } while (next); 1193 1194 1194 1195 _leave(""); 1195 1196 }

+7 -23

fs/hugetlbfs/inode.c

··· 401 401 const pgoff_t end = lend >> huge_page_shift(h); 402 402 struct vm_area_struct pseudo_vma; 403 403 struct pagevec pvec; 404 - pgoff_t next; 404 + pgoff_t next, index; 405 405 int i, freed = 0; 406 - long lookup_nr = PAGEVEC_SIZE; 407 406 bool truncate_op = (lend == LLONG_MAX); 408 407 409 408 memset(&pseudo_vma, 0, sizeof(struct vm_area_struct)); ··· 411 412 next = start; 412 413 while (next < end) { 413 414 /* 414 - * Don't grab more pages than the number left in the range. 415 - */ 416 - if (end - next < lookup_nr) 417 - lookup_nr = end - next; 418 - 419 - /* 420 415 * When no more pages are found, we are done. 421 416 */ 422 - if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) 417 + if (!pagevec_lookup_range(&pvec, mapping, &next, end - 1)) 423 418 break; 424 419 425 420 for (i = 0; i < pagevec_count(&pvec); ++i) { 426 421 struct page *page = pvec.pages[i]; 427 422 u32 hash; 428 423 429 - /* 430 - * The page (index) could be beyond end. This is 431 - * only possible in the punch hole case as end is 432 - * max page offset in the truncate case. 433 - */ 434 - next = page->index; 435 - if (next >= end) 436 - break; 437 - 424 + index = page->index; 438 425 hash = hugetlb_fault_mutex_hash(h, current->mm, 439 426 &pseudo_vma, 440 - mapping, next, 0); 427 + mapping, index, 0); 441 428 mutex_lock(&hugetlb_fault_mutex_table[hash]); 442 429 443 430 /* ··· 440 455 441 456 i_mmap_lock_write(mapping); 442 457 hugetlb_vmdelete_list(&mapping->i_mmap, 443 - next * pages_per_huge_page(h), 444 - (next + 1) * pages_per_huge_page(h)); 458 + index * pages_per_huge_page(h), 459 + (index + 1) * pages_per_huge_page(h)); 445 460 i_mmap_unlock_write(mapping); 446 461 } 447 462 ··· 460 475 freed++; 461 476 if (!truncate_op) { 462 477 if (unlikely(hugetlb_unreserve_pages(inode, 463 - next, next + 1, 1))) 478 + index, index + 1, 1))) 464 479 hugetlb_fix_reserve_counts(inode); 465 480 } 466 481 467 482 unlock_page(page); 468 483 mutex_unlock(&hugetlb_fault_mutex_table[hash]); 469 484 } 470 - ++next; 471 485 huge_pagevec_release(&pvec); 472 486 cond_resched(); 473 487 }

-40

fs/nfs/fscache-index.c

··· 252 252 } 253 253 254 254 /* 255 - * Indication from FS-Cache that the cookie is no longer cached 256 - * - This function is called when the backing store currently caching a cookie 257 - * is removed 258 - * - The netfs should use this to clean up any markers indicating cached pages 259 - * - This is mandatory for any object that may have data 260 - */ 261 - static void nfs_fscache_inode_now_uncached(void *cookie_netfs_data) 262 - { 263 - struct nfs_inode *nfsi = cookie_netfs_data; 264 - struct pagevec pvec; 265 - pgoff_t first; 266 - int loop, nr_pages; 267 - 268 - pagevec_init(&pvec, 0); 269 - first = 0; 270 - 271 - dprintk("NFS: nfs_inode_now_uncached: nfs_inode 0x%p\n", nfsi); 272 - 273 - for (;;) { 274 - /* grab a bunch of pages to unmark */ 275 - nr_pages = pagevec_lookup(&pvec, 276 - nfsi->vfs_inode.i_mapping, 277 - first, 278 - PAGEVEC_SIZE - pagevec_count(&pvec)); 279 - if (!nr_pages) 280 - break; 281 - 282 - for (loop = 0; loop < nr_pages; loop++) 283 - ClearPageFsCache(pvec.pages[loop]); 284 - 285 - first = pvec.pages[nr_pages - 1]->index + 1; 286 - 287 - pvec.nr = nr_pages; 288 - pagevec_release(&pvec); 289 - cond_resched(); 290 - } 291 - } 292 - 293 - /* 294 255 * Get an extra reference on a read context. 295 256 * - This function can be absent if the completion function doesn't require a 296 257 * context. ··· 291 330 .get_attr = nfs_fscache_inode_get_attr, 292 331 .get_aux = nfs_fscache_inode_get_aux, 293 332 .check_aux = nfs_fscache_inode_check_aux, 294 - .now_uncached = nfs_fscache_inode_now_uncached, 295 333 .get_context = nfs_fh_get_context, 296 334 .put_context = nfs_fh_put_context, 297 335 };

+1 -2

fs/nilfs2/page.c

··· 312 312 313 313 pagevec_init(&pvec, 0); 314 314 repeat: 315 - n = pagevec_lookup(&pvec, smap, index, PAGEVEC_SIZE); 315 + n = pagevec_lookup(&pvec, smap, &index); 316 316 if (!n) 317 317 return; 318 - index = pvec.pages[n - 1]->index + 1; 319 318 320 319 for (i = 0; i < pagevec_count(&pvec); i++) { 321 320 struct page *page = pvec.pages[i], *dpage;

+1 -1

fs/ocfs2/acl.c

··· 221 221 /* 222 222 * Set the access or default ACL of an inode. 223 223 */ 224 - int ocfs2_set_acl(handle_t *handle, 224 + static int ocfs2_set_acl(handle_t *handle, 225 225 struct inode *inode, 226 226 struct buffer_head *di_bh, 227 227 int type,

-7

fs/ocfs2/acl.h

··· 28 28 29 29 struct posix_acl *ocfs2_iop_get_acl(struct inode *inode, int type); 30 30 int ocfs2_iop_set_acl(struct inode *inode, struct posix_acl *acl, int type); 31 - int ocfs2_set_acl(handle_t *handle, 32 - struct inode *inode, 33 - struct buffer_head *di_bh, 34 - int type, 35 - struct posix_acl *acl, 36 - struct ocfs2_alloc_context *meta_ac, 37 - struct ocfs2_alloc_context *data_ac); 38 31 extern int ocfs2_acl_chmod(struct inode *, struct buffer_head *); 39 32 extern int ocfs2_init_acl(handle_t *, struct inode *, struct inode *, 40 33 struct buffer_head *, struct buffer_head *,

+8 -14

fs/ocfs2/alloc.c

··· 955 955 /* 956 956 * How many free extents have we got before we need more meta data? 957 957 */ 958 - int ocfs2_num_free_extents(struct ocfs2_super *osb, 959 - struct ocfs2_extent_tree *et) 958 + int ocfs2_num_free_extents(struct ocfs2_extent_tree *et) 960 959 { 961 960 int retval; 962 961 struct ocfs2_extent_list *el = NULL; ··· 1932 1933 * the new changes. 1933 1934 * 1934 1935 * left_rec: the record on the left. 1935 - * left_child_el: is the child list pointed to by left_rec 1936 1936 * right_rec: the record to the right of left_rec 1937 1937 * right_child_el: is the child list pointed to by right_rec 1938 1938 * 1939 1939 * By definition, this only works on interior nodes. 1940 1940 */ 1941 1941 static void ocfs2_adjust_adjacent_records(struct ocfs2_extent_rec *left_rec, 1942 - struct ocfs2_extent_list *left_child_el, 1943 1942 struct ocfs2_extent_rec *right_rec, 1944 1943 struct ocfs2_extent_list *right_child_el) 1945 1944 { ··· 2000 2003 */ 2001 2004 BUG_ON(i >= (le16_to_cpu(root_el->l_next_free_rec) - 1)); 2002 2005 2003 - ocfs2_adjust_adjacent_records(&root_el->l_recs[i], left_el, 2006 + ocfs2_adjust_adjacent_records(&root_el->l_recs[i], 2004 2007 &root_el->l_recs[i + 1], right_el); 2005 2008 } 2006 2009 ··· 2057 2060 el = right_path->p_node[i].el; 2058 2061 right_rec = &el->l_recs[0]; 2059 2062 2060 - ocfs2_adjust_adjacent_records(left_rec, left_el, right_rec, 2061 - right_el); 2063 + ocfs2_adjust_adjacent_records(left_rec, right_rec, right_el); 2062 2064 2063 2065 ocfs2_journal_dirty(handle, left_path->p_node[i].bh); 2064 2066 ocfs2_journal_dirty(handle, right_path->p_node[i].bh); ··· 2505 2509 2506 2510 static int ocfs2_update_edge_lengths(handle_t *handle, 2507 2511 struct ocfs2_extent_tree *et, 2508 - int subtree_index, struct ocfs2_path *path) 2512 + struct ocfs2_path *path) 2509 2513 { 2510 2514 int i, idx, ret; 2511 2515 struct ocfs2_extent_rec *rec; ··· 2751 2755 if (del_right_subtree) { 2752 2756 ocfs2_unlink_subtree(handle, et, left_path, right_path, 2753 2757 subtree_index, dealloc); 2754 - ret = ocfs2_update_edge_lengths(handle, et, subtree_index, 2755 - left_path); 2758 + ret = ocfs2_update_edge_lengths(handle, et, left_path); 2756 2759 if (ret) { 2757 2760 mlog_errno(ret); 2758 2761 goto out; ··· 3055 3060 3056 3061 ocfs2_unlink_subtree(handle, et, left_path, path, 3057 3062 subtree_index, dealloc); 3058 - ret = ocfs2_update_edge_lengths(handle, et, subtree_index, 3059 - left_path); 3063 + ret = ocfs2_update_edge_lengths(handle, et, left_path); 3060 3064 if (ret) { 3061 3065 mlog_errno(ret); 3062 3066 goto out; ··· 4784 4790 if (mark_unwritten) 4785 4791 flags = OCFS2_EXT_UNWRITTEN; 4786 4792 4787 - free_extents = ocfs2_num_free_extents(osb, et); 4793 + free_extents = ocfs2_num_free_extents(et); 4788 4794 if (free_extents < 0) { 4789 4795 status = free_extents; 4790 4796 mlog_errno(status); ··· 5662 5668 5663 5669 *ac = NULL; 5664 5670 5665 - num_free_extents = ocfs2_num_free_extents(osb, et); 5671 + num_free_extents = ocfs2_num_free_extents(et); 5666 5672 if (num_free_extents < 0) { 5667 5673 ret = num_free_extents; 5668 5674 mlog_errno(ret);

+1 -2

fs/ocfs2/alloc.h

··· 144 144 struct ocfs2_cached_dealloc_ctxt *dealloc, 145 145 u64 refcount_loc, bool refcount_tree_locked); 146 146 147 - int ocfs2_num_free_extents(struct ocfs2_super *osb, 148 - struct ocfs2_extent_tree *et); 147 + int ocfs2_num_free_extents(struct ocfs2_extent_tree *et); 149 148 150 149 /* 151 150 * how many new metadata chunks would an allocation need at maximum?

+4 -38

fs/ocfs2/cluster/heartbeat.c

··· 505 505 } 506 506 } 507 507 508 - static void o2hb_wait_on_io(struct o2hb_region *reg, 509 - struct o2hb_bio_wait_ctxt *wc) 508 + static void o2hb_wait_on_io(struct o2hb_bio_wait_ctxt *wc) 510 509 { 511 510 o2hb_bio_wait_dec(wc, 1); 512 511 wait_for_completion(&wc->wc_io_complete); ··· 607 608 status = 0; 608 609 609 610 bail_and_wait: 610 - o2hb_wait_on_io(reg, &wc); 611 + o2hb_wait_on_io(&wc); 611 612 if (wc.wc_error && !status) 612 613 status = wc.wc_error; 613 614 ··· 1161 1162 * before we can go to steady state. This ensures that 1162 1163 * people we find in our steady state have seen us. 1163 1164 */ 1164 - o2hb_wait_on_io(reg, &write_wc); 1165 + o2hb_wait_on_io(&write_wc); 1165 1166 if (write_wc.wc_error) { 1166 1167 /* Do not re-arm the write timeout on I/O error - we 1167 1168 * can't be sure that the new block ever made it to ··· 1274 1275 o2hb_prepare_block(reg, 0); 1275 1276 ret = o2hb_issue_node_write(reg, &write_wc); 1276 1277 if (ret == 0) 1277 - o2hb_wait_on_io(reg, &write_wc); 1278 + o2hb_wait_on_io(&write_wc); 1278 1279 else 1279 1280 mlog_errno(ret); 1280 1281 } ··· 2575 2576 } 2576 2577 EXPORT_SYMBOL_GPL(o2hb_unregister_callback); 2577 2578 2578 - int o2hb_check_node_heartbeating(u8 node_num) 2579 - { 2580 - unsigned long testing_map[BITS_TO_LONGS(O2NM_MAX_NODES)]; 2581 - 2582 - o2hb_fill_node_map(testing_map, sizeof(testing_map)); 2583 - if (!test_bit(node_num, testing_map)) { 2584 - mlog(ML_HEARTBEAT, 2585 - "node (%u) does not have heartbeating enabled.\n", 2586 - node_num); 2587 - return 0; 2588 - } 2589 - 2590 - return 1; 2591 - } 2592 - EXPORT_SYMBOL_GPL(o2hb_check_node_heartbeating); 2593 - 2594 2579 int o2hb_check_node_heartbeating_no_sem(u8 node_num) 2595 2580 { 2596 2581 unsigned long testing_map[BITS_TO_LONGS(O2NM_MAX_NODES)]; ··· 2608 2625 return 1; 2609 2626 } 2610 2627 EXPORT_SYMBOL_GPL(o2hb_check_node_heartbeating_from_callback); 2611 - 2612 - /* Makes sure our local node is configured with a node number, and is 2613 - * heartbeating. */ 2614 - int o2hb_check_local_node_heartbeating(void) 2615 - { 2616 - u8 node_num; 2617 - 2618 - /* if this node was set then we have networking */ 2619 - node_num = o2nm_this_node(); 2620 - if (node_num == O2NM_MAX_NODES) { 2621 - mlog(ML_HEARTBEAT, "this node has not been configured.\n"); 2622 - return 0; 2623 - } 2624 - 2625 - return o2hb_check_node_heartbeating(node_num); 2626 - } 2627 - EXPORT_SYMBOL_GPL(o2hb_check_local_node_heartbeating); 2628 2628 2629 2629 /* 2630 2630 * this is just a hack until we get the plumbing which flips file systems

+1 -1

fs/ocfs2/dir.c

··· 3249 3249 spin_unlock(&OCFS2_I(dir)->ip_lock); 3250 3250 ocfs2_init_dinode_extent_tree(&et, INODE_CACHE(dir), 3251 3251 parent_fe_bh); 3252 - num_free_extents = ocfs2_num_free_extents(osb, &et); 3252 + num_free_extents = ocfs2_num_free_extents(&et); 3253 3253 if (num_free_extents < 0) { 3254 3254 status = num_free_extents; 3255 3255 mlog_errno(status);

-7

fs/ocfs2/file.c

··· 713 713 return status; 714 714 } 715 715 716 - int ocfs2_extend_allocation(struct inode *inode, u32 logical_start, 717 - u32 clusters_to_add, int mark_unwritten) 718 - { 719 - return __ocfs2_extend_allocation(inode, logical_start, 720 - clusters_to_add, mark_unwritten); 721 - } 722 - 723 716 /* 724 717 * While a write will already be ordering the data, a truncate will not. 725 718 * Thus, we need to explicitly order the zeroed pages.

-1

fs/ocfs2/journal.c

··· 1348 1348 ocfs2_schedule_truncate_log_flush(osb, 0); 1349 1349 1350 1350 osb->local_alloc_copy = NULL; 1351 - osb->dirty = 0; 1352 1351 1353 1352 /* queue to recover orphan slots for all offline slots */ 1354 1353 ocfs2_replay_map_set_state(osb, REPLAY_NEEDED);

+1 -1

fs/ocfs2/move_extents.c

··· 175 175 unsigned int max_recs_needed = 2 * extents_to_split + clusters_to_move; 176 176 struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); 177 177 178 - num_free_extents = ocfs2_num_free_extents(osb, et); 178 + num_free_extents = ocfs2_num_free_extents(et); 179 179 if (num_free_extents < 0) { 180 180 ret = num_free_extents; 181 181 mlog_errno(ret);

+1 -3

fs/ocfs2/ocfs2.h

··· 320 320 u64 system_dir_blkno; 321 321 u64 bitmap_blkno; 322 322 u32 bitmap_cpg; 323 - u8 *uuid; 324 323 char *uuid_str; 325 324 u32 uuid_hash; 326 325 u8 *vol_label; ··· 387 388 unsigned int osb_resv_level; 388 389 unsigned int osb_dir_resv_level; 389 390 390 - /* Next three fields are for local node slot recovery during 391 + /* Next two fields are for local node slot recovery during 391 392 * mount. */ 392 - int dirty; 393 393 struct ocfs2_dinode *local_alloc_copy; 394 394 struct ocfs2_quota_recovery *quota_rec; 395 395

+1 -1

fs/ocfs2/refcounttree.c

··· 2851 2851 int *credits) 2852 2852 { 2853 2853 int ret = 0, meta_add = 0; 2854 - int num_free_extents = ocfs2_num_free_extents(OCFS2_SB(sb), et); 2854 + int num_free_extents = ocfs2_num_free_extents(et); 2855 2855 2856 2856 if (num_free_extents < 0) { 2857 2857 ret = num_free_extents;

+1 -1

fs/ocfs2/suballoc.c

··· 2700 2700 2701 2701 BUG_ON(clusters_to_add != 0 && data_ac == NULL); 2702 2702 2703 - num_free_extents = ocfs2_num_free_extents(osb, et); 2703 + num_free_extents = ocfs2_num_free_extents(et); 2704 2704 if (num_free_extents < 0) { 2705 2705 ret = num_free_extents; 2706 2706 mlog_errno(ret);

-1

fs/ocfs2/super.c

··· 2486 2486 if (dirty) { 2487 2487 /* Recovery will be completed after we've mounted the 2488 2488 * rest of the volume. */ 2489 - osb->dirty = 1; 2490 2489 osb->local_alloc_copy = local_alloc; 2491 2490 local_alloc = NULL; 2492 2491 }

+1 -1

fs/ocfs2/xattr.c

··· 6800 6800 *credits += 1; 6801 6801 6802 6802 /* count in the xattr tree change. */ 6803 - num_free_extents = ocfs2_num_free_extents(osb, xt_et); 6803 + num_free_extents = ocfs2_num_free_extents(xt_et); 6804 6804 if (num_free_extents < 0) { 6805 6805 ret = num_free_extents; 6806 6806 mlog_errno(ret);

+2

fs/proc/base.c

··· 2931 2931 #ifdef CONFIG_PROC_PAGE_MONITOR 2932 2932 REG("clear_refs", S_IWUSR, proc_clear_refs_operations), 2933 2933 REG("smaps", S_IRUGO, proc_pid_smaps_operations), 2934 + REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), 2934 2935 REG("pagemap", S_IRUSR, proc_pagemap_operations), 2935 2936 #endif 2936 2937 #ifdef CONFIG_SECURITY ··· 3325 3324 #ifdef CONFIG_PROC_PAGE_MONITOR 3326 3325 REG("clear_refs", S_IWUSR, proc_clear_refs_operations), 3327 3326 REG("smaps", S_IRUGO, proc_tid_smaps_operations), 3327 + REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), 3328 3328 REG("pagemap", S_IRUSR, proc_pagemap_operations), 3329 3329 #endif 3330 3330 #ifdef CONFIG_SECURITY

+3

fs/proc/internal.h

··· 269 269 /* 270 270 * task_[no]mmu.c 271 271 */ 272 + struct mem_size_stats; 272 273 struct proc_maps_private { 273 274 struct inode *inode; 274 275 struct task_struct *task; 275 276 struct mm_struct *mm; 277 + struct mem_size_stats *rollup; 276 278 #ifdef CONFIG_MMU 277 279 struct vm_area_struct *tail_vma; 278 280 #endif ··· 290 288 extern const struct file_operations proc_pid_numa_maps_operations; 291 289 extern const struct file_operations proc_tid_numa_maps_operations; 292 290 extern const struct file_operations proc_pid_smaps_operations; 291 + extern const struct file_operations proc_pid_smaps_rollup_operations; 293 292 extern const struct file_operations proc_tid_smaps_operations; 294 293 extern const struct file_operations proc_clear_refs_operations; 295 294 extern const struct file_operations proc_pagemap_operations;

+5 -5

fs/proc/meminfo.c

··· 80 80 show_val_kb(m, "Active(file): ", pages[LRU_ACTIVE_FILE]); 81 81 show_val_kb(m, "Inactive(file): ", pages[LRU_INACTIVE_FILE]); 82 82 show_val_kb(m, "Unevictable: ", pages[LRU_UNEVICTABLE]); 83 - show_val_kb(m, "Mlocked: ", global_page_state(NR_MLOCK)); 83 + show_val_kb(m, "Mlocked: ", global_zone_page_state(NR_MLOCK)); 84 84 85 85 #ifdef CONFIG_HIGHMEM 86 86 show_val_kb(m, "HighTotal: ", i.totalhigh); ··· 114 114 show_val_kb(m, "SUnreclaim: ", 115 115 global_node_page_state(NR_SLAB_UNRECLAIMABLE)); 116 116 seq_printf(m, "KernelStack: %8lu kB\n", 117 - global_page_state(NR_KERNEL_STACK_KB)); 117 + global_zone_page_state(NR_KERNEL_STACK_KB)); 118 118 show_val_kb(m, "PageTables: ", 119 - global_page_state(NR_PAGETABLE)); 119 + global_zone_page_state(NR_PAGETABLE)); 120 120 #ifdef CONFIG_QUICKLIST 121 121 show_val_kb(m, "Quicklists: ", quicklist_total_size()); 122 122 #endif ··· 124 124 show_val_kb(m, "NFS_Unstable: ", 125 125 global_node_page_state(NR_UNSTABLE_NFS)); 126 126 show_val_kb(m, "Bounce: ", 127 - global_page_state(NR_BOUNCE)); 127 + global_zone_page_state(NR_BOUNCE)); 128 128 show_val_kb(m, "WritebackTmp: ", 129 129 global_node_page_state(NR_WRITEBACK_TEMP)); 130 130 show_val_kb(m, "CommitLimit: ", vm_commit_limit()); ··· 151 151 #ifdef CONFIG_CMA 152 152 show_val_kb(m, "CmaTotal: ", totalcma_pages); 153 153 show_val_kb(m, "CmaFree: ", 154 - global_page_state(NR_FREE_CMA_PAGES)); 154 + global_zone_page_state(NR_FREE_CMA_PAGES)); 155 155 #endif 156 156 157 157 hugetlb_report_meminfo(m);

+133 -60

fs/proc/task_mmu.c

··· 253 253 if (priv->mm) 254 254 mmdrop(priv->mm); 255 255 256 + kfree(priv->rollup); 256 257 return seq_release_private(inode, file); 257 258 } 258 259 ··· 280 279 vma->vm_end >= vma->vm_mm->start_stack; 281 280 } 282 281 282 + static void show_vma_header_prefix(struct seq_file *m, 283 + unsigned long start, unsigned long end, 284 + vm_flags_t flags, unsigned long long pgoff, 285 + dev_t dev, unsigned long ino) 286 + { 287 + seq_setwidth(m, 25 + sizeof(void *) * 6 - 1); 288 + seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ", 289 + start, 290 + end, 291 + flags & VM_READ ? 'r' : '-', 292 + flags & VM_WRITE ? 'w' : '-', 293 + flags & VM_EXEC ? 'x' : '-', 294 + flags & VM_MAYSHARE ? 's' : 'p', 295 + pgoff, 296 + MAJOR(dev), MINOR(dev), ino); 297 + } 298 + 283 299 static void 284 300 show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid) 285 301 { ··· 319 301 320 302 start = vma->vm_start; 321 303 end = vma->vm_end; 322 - 323 - seq_setwidth(m, 25 + sizeof(void *) * 6 - 1); 324 - seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ", 325 - start, 326 - end, 327 - flags & VM_READ ? 'r' : '-', 328 - flags & VM_WRITE ? 'w' : '-', 329 - flags & VM_EXEC ? 'x' : '-', 330 - flags & VM_MAYSHARE ? 's' : 'p', 331 - pgoff, 332 - MAJOR(dev), MINOR(dev), ino); 304 + show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino); 333 305 334 306 /* 335 307 * Print the dentry name for named mappings, and a ··· 438 430 439 431 #ifdef CONFIG_PROC_PAGE_MONITOR 440 432 struct mem_size_stats { 433 + bool first; 441 434 unsigned long resident; 442 435 unsigned long shared_clean; 443 436 unsigned long shared_dirty; ··· 452 443 unsigned long swap; 453 444 unsigned long shared_hugetlb; 454 445 unsigned long private_hugetlb; 446 + unsigned long first_vma_start; 455 447 u64 pss; 448 + u64 pss_locked; 456 449 u64 swap_pss; 457 450 bool check_shmem_swap; 458 451 }; ··· 663 652 [ilog2(VM_NORESERVE)] = "nr", 664 653 [ilog2(VM_HUGETLB)] = "ht", 665 654 [ilog2(VM_ARCH_1)] = "ar", 655 + [ilog2(VM_WIPEONFORK)] = "wf", 666 656 [ilog2(VM_DONTDUMP)] = "dd", 667 657 #ifdef CONFIG_MEM_SOFT_DIRTY 668 658 [ilog2(VM_SOFTDIRTY)] = "sd", ··· 731 719 732 720 static int show_smap(struct seq_file *m, void *v, int is_pid) 733 721 { 722 + struct proc_maps_private *priv = m->private; 734 723 struct vm_area_struct *vma = v; 735 - struct mem_size_stats mss; 724 + struct mem_size_stats mss_stack; 725 + struct mem_size_stats *mss; 736 726 struct mm_walk smaps_walk = { 737 727 .pmd_entry = smaps_pte_range, 738 728 #ifdef CONFIG_HUGETLB_PAGE 739 729 .hugetlb_entry = smaps_hugetlb_range, 740 730 #endif 741 731 .mm = vma->vm_mm, 742 - .private = &mss, 743 732 }; 733 + int ret = 0; 734 + bool rollup_mode; 735 + bool last_vma; 744 736 745 - memset(&mss, 0, sizeof mss); 737 + if (priv->rollup) { 738 + rollup_mode = true; 739 + mss = priv->rollup; 740 + if (mss->first) { 741 + mss->first_vma_start = vma->vm_start; 742 + mss->first = false; 743 + } 744 + last_vma = !m_next_vma(priv, vma); 745 + } else { 746 + rollup_mode = false; 747 + memset(&mss_stack, 0, sizeof(mss_stack)); 748 + mss = &mss_stack; 749 + } 750 + 751 + smaps_walk.private = mss; 746 752 747 753 #ifdef CONFIG_SHMEM 748 754 if (vma->vm_file && shmem_mapping(vma->vm_file->f_mapping)) { ··· 778 748 779 749 if (!shmem_swapped || (vma->vm_flags & VM_SHARED) || 780 750 !(vma->vm_flags & VM_WRITE)) { 781 - mss.swap = shmem_swapped; 751 + mss->swap = shmem_swapped; 782 752 } else { 783 - mss.check_shmem_swap = true; 753 + mss->check_shmem_swap = true; 784 754 smaps_walk.pte_hole = smaps_pte_hole; 785 755 } 786 756 } ··· 788 758 789 759 /* mmap_sem is held in m_start */ 790 760 walk_page_vma(vma, &smaps_walk); 761 + if (vma->vm_flags & VM_LOCKED) 762 + mss->pss_locked += mss->pss; 791 763 792 - show_map_vma(m, vma, is_pid); 764 + if (!rollup_mode) { 765 + show_map_vma(m, vma, is_pid); 766 + } else if (last_vma) { 767 + show_vma_header_prefix( 768 + m, mss->first_vma_start, vma->vm_end, 0, 0, 0, 0); 769 + seq_pad(m, ' '); 770 + seq_puts(m, "[rollup]\n"); 771 + } else { 772 + ret = SEQ_SKIP; 773 + } 793 774 794 - seq_printf(m, 795 - "Size: %8lu kB\n" 796 - "Rss: %8lu kB\n" 797 - "Pss: %8lu kB\n" 798 - "Shared_Clean: %8lu kB\n" 799 - "Shared_Dirty: %8lu kB\n" 800 - "Private_Clean: %8lu kB\n" 801 - "Private_Dirty: %8lu kB\n" 802 - "Referenced: %8lu kB\n" 803 - "Anonymous: %8lu kB\n" 804 - "LazyFree: %8lu kB\n" 805 - "AnonHugePages: %8lu kB\n" 806 - "ShmemPmdMapped: %8lu kB\n" 807 - "Shared_Hugetlb: %8lu kB\n" 808 - "Private_Hugetlb: %7lu kB\n" 809 - "Swap: %8lu kB\n" 810 - "SwapPss: %8lu kB\n" 811 - "KernelPageSize: %8lu kB\n" 812 - "MMUPageSize: %8lu kB\n" 813 - "Locked: %8lu kB\n", 814 - (vma->vm_end - vma->vm_start) >> 10, 815 - mss.resident >> 10, 816 - (unsigned long)(mss.pss >> (10 + PSS_SHIFT)), 817 - mss.shared_clean >> 10, 818 - mss.shared_dirty >> 10, 819 - mss.private_clean >> 10, 820 - mss.private_dirty >> 10, 821 - mss.referenced >> 10, 822 - mss.anonymous >> 10, 823 - mss.lazyfree >> 10, 824 - mss.anonymous_thp >> 10, 825 - mss.shmem_thp >> 10, 826 - mss.shared_hugetlb >> 10, 827 - mss.private_hugetlb >> 10, 828 - mss.swap >> 10, 829 - (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)), 830 - vma_kernel_pagesize(vma) >> 10, 831 - vma_mmu_pagesize(vma) >> 10, 832 - (vma->vm_flags & VM_LOCKED) ? 833 - (unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0); 775 + if (!rollup_mode) 776 + seq_printf(m, 777 + "Size: %8lu kB\n" 778 + "KernelPageSize: %8lu kB\n" 779 + "MMUPageSize: %8lu kB\n", 780 + (vma->vm_end - vma->vm_start) >> 10, 781 + vma_kernel_pagesize(vma) >> 10, 782 + vma_mmu_pagesize(vma) >> 10); 834 783 835 - arch_show_smap(m, vma); 836 - show_smap_vma_flags(m, vma); 784 + 785 + if (!rollup_mode || last_vma) 786 + seq_printf(m, 787 + "Rss: %8lu kB\n" 788 + "Pss: %8lu kB\n" 789 + "Shared_Clean: %8lu kB\n" 790 + "Shared_Dirty: %8lu kB\n" 791 + "Private_Clean: %8lu kB\n" 792 + "Private_Dirty: %8lu kB\n" 793 + "Referenced: %8lu kB\n" 794 + "Anonymous: %8lu kB\n" 795 + "LazyFree: %8lu kB\n" 796 + "AnonHugePages: %8lu kB\n" 797 + "ShmemPmdMapped: %8lu kB\n" 798 + "Shared_Hugetlb: %8lu kB\n" 799 + "Private_Hugetlb: %7lu kB\n" 800 + "Swap: %8lu kB\n" 801 + "SwapPss: %8lu kB\n" 802 + "Locked: %8lu kB\n", 803 + mss->resident >> 10, 804 + (unsigned long)(mss->pss >> (10 + PSS_SHIFT)), 805 + mss->shared_clean >> 10, 806 + mss->shared_dirty >> 10, 807 + mss->private_clean >> 10, 808 + mss->private_dirty >> 10, 809 + mss->referenced >> 10, 810 + mss->anonymous >> 10, 811 + mss->lazyfree >> 10, 812 + mss->anonymous_thp >> 10, 813 + mss->shmem_thp >> 10, 814 + mss->shared_hugetlb >> 10, 815 + mss->private_hugetlb >> 10, 816 + mss->swap >> 10, 817 + (unsigned long)(mss->swap_pss >> (10 + PSS_SHIFT)), 818 + (unsigned long)(mss->pss >> (10 + PSS_SHIFT))); 819 + 820 + if (!rollup_mode) { 821 + arch_show_smap(m, vma); 822 + show_smap_vma_flags(m, vma); 823 + } 837 824 m_cache_vma(m, vma); 838 - return 0; 825 + return ret; 839 826 } 840 827 841 828 static int show_pid_smap(struct seq_file *m, void *v) ··· 884 837 return do_maps_open(inode, file, &proc_pid_smaps_op); 885 838 } 886 839 840 + static int pid_smaps_rollup_open(struct inode *inode, struct file *file) 841 + { 842 + struct seq_file *seq; 843 + struct proc_maps_private *priv; 844 + int ret = do_maps_open(inode, file, &proc_pid_smaps_op); 845 + 846 + if (ret < 0) 847 + return ret; 848 + seq = file->private_data; 849 + priv = seq->private; 850 + priv->rollup = kzalloc(sizeof(*priv->rollup), GFP_KERNEL); 851 + if (!priv->rollup) { 852 + proc_map_release(inode, file); 853 + return -ENOMEM; 854 + } 855 + priv->rollup->first = true; 856 + return 0; 857 + } 858 + 887 859 static int tid_smaps_open(struct inode *inode, struct file *file) 888 860 { 889 861 return do_maps_open(inode, file, &proc_tid_smaps_op); ··· 910 844 911 845 const struct file_operations proc_pid_smaps_operations = { 912 846 .open = pid_smaps_open, 847 + .read = seq_read, 848 + .llseek = seq_lseek, 849 + .release = proc_map_release, 850 + }; 851 + 852 + const struct file_operations proc_pid_smaps_rollup_operations = { 853 + .open = pid_smaps_rollup_open, 913 854 .read = seq_read, 914 855 .llseek = seq_lseek, 915 856 .release = proc_map_release,

+1 -1

fs/ramfs/file-nommu.c

··· 228 228 if (!pages) 229 229 goto out_free; 230 230 231 - nr = find_get_pages(inode->i_mapping, pgoff, lpages, pages); 231 + nr = find_get_pages(inode->i_mapping, &pgoff, lpages, pages); 232 232 if (nr != lpages) 233 233 goto out_free_pages; /* leave if some pages were missing */ 234 234

-5

fs/sync.c

··· 335 335 goto out_put; 336 336 337 337 mapping = f.file->f_mapping; 338 - if (!mapping) { 339 - ret = -EINVAL; 340 - goto out_put; 341 - } 342 - 343 338 ret = 0; 344 339 if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) { 345 340 ret = file_fdatawait_range(f.file, offset, endbyte);

+14 -7

fs/userfaultfd.c

··· 178 178 179 179 static inline struct uffd_msg userfault_msg(unsigned long address, 180 180 unsigned int flags, 181 - unsigned long reason) 181 + unsigned long reason, 182 + unsigned int features) 182 183 { 183 184 struct uffd_msg msg; 184 185 msg_init(&msg); ··· 203 202 * write protect fault. 204 203 */ 205 204 msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP; 205 + if (features & UFFD_FEATURE_THREAD_ID) 206 + msg.arg.pagefault.feat.ptid = task_pid_vnr(current); 206 207 return msg; 207 208 } 208 209 ··· 373 370 VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP)); 374 371 VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP)); 375 372 373 + if (ctx->features & UFFD_FEATURE_SIGBUS) 374 + goto out; 375 + 376 376 /* 377 377 * If it's already released don't get it. This avoids to loop 378 378 * in __get_user_pages if userfaultfd_release waits on the ··· 425 419 426 420 init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function); 427 421 uwq.wq.private = current; 428 - uwq.msg = userfault_msg(vmf->address, vmf->flags, reason); 422 + uwq.msg = userfault_msg(vmf->address, vmf->flags, reason, 423 + ctx->features); 429 424 uwq.ctx = ctx; 430 425 uwq.waken = false; 431 426 ··· 1201 1194 struct uffdio_register __user *user_uffdio_register; 1202 1195 unsigned long vm_flags, new_flags; 1203 1196 bool found; 1204 - bool non_anon_pages; 1197 + bool basic_ioctls; 1205 1198 unsigned long start, end, vma_end; 1206 1199 1207 1200 user_uffdio_register = (struct uffdio_register __user *) arg; ··· 1267 1260 * Search for not compatible vmas. 1268 1261 */ 1269 1262 found = false; 1270 - non_anon_pages = false; 1263 + basic_ioctls = false; 1271 1264 for (cur = vma; cur && cur->vm_start < end; cur = cur->vm_next) { 1272 1265 cond_resched(); 1273 1266 ··· 1306 1299 /* 1307 1300 * Note vmas containing huge pages 1308 1301 */ 1309 - if (is_vm_hugetlb_page(cur) || vma_is_shmem(cur)) 1310 - non_anon_pages = true; 1302 + if (is_vm_hugetlb_page(cur)) 1303 + basic_ioctls = true; 1311 1304 1312 1305 found = true; 1313 1306 } ··· 1378 1371 * userland which ioctls methods are guaranteed to 1379 1372 * succeed on this range. 1380 1373 */ 1381 - if (put_user(non_anon_pages ? UFFD_API_RANGE_IOCTLS_BASIC : 1374 + if (put_user(basic_ioctls ? UFFD_API_RANGE_IOCTLS_BASIC : 1382 1375 UFFD_API_RANGE_IOCTLS, 1383 1376 &user_uffdio_register->ioctls)) 1384 1377 ret = -EFAULT;

+1 -1

fs/xfs/xfs_file.c

··· 1101 1101 if (vmf->pgoff >= size) 1102 1102 ret = VM_FAULT_SIGBUS; 1103 1103 else if (IS_DAX(inode)) 1104 - ret = dax_pfn_mkwrite(vmf); 1104 + ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &xfs_iomap_ops); 1105 1105 xfs_iunlock(ip, XFS_MMAPLOCK_SHARED); 1106 1106 sb_end_pagefault(inode->i_sb); 1107 1107 return ret;

+8

include/linux/bio.h

··· 38 38 #define BIO_BUG_ON 39 39 #endif 40 40 41 + #ifdef CONFIG_THP_SWAP 42 + #if HPAGE_PMD_NR > 256 43 + #define BIO_MAX_PAGES HPAGE_PMD_NR 44 + #else 41 45 #define BIO_MAX_PAGES 256 46 + #endif 47 + #else 48 + #define BIO_MAX_PAGES 256 49 + #endif 42 50 43 51 #define bio_prio(bio) (bio)->bi_ioprio 44 52 #define bio_set_prio(bio, prio) ((bio)->bi_ioprio = prio)

-45

include/linux/dax.h

··· 89 89 void dax_write_cache(struct dax_device *dax_dev, bool wc); 90 90 bool dax_write_cache_enabled(struct dax_device *dax_dev); 91 91 92 - /* 93 - * We use lowest available bit in exceptional entry for locking, one bit for 94 - * the entry size (PMD) and two more to tell us if the entry is a huge zero 95 - * page (HZP) or an empty entry that is just used for locking. In total four 96 - * special bits. 97 - * 98 - * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the HZP and 99 - * EMPTY bits aren't set the entry is a normal DAX entry with a filesystem 100 - * block allocation. 101 - */ 102 - #define RADIX_DAX_SHIFT (RADIX_TREE_EXCEPTIONAL_SHIFT + 4) 103 - #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT) 104 - #define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1)) 105 - #define RADIX_DAX_HZP (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2)) 106 - #define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3)) 107 - 108 - static inline unsigned long dax_radix_sector(void *entry) 109 - { 110 - return (unsigned long)entry >> RADIX_DAX_SHIFT; 111 - } 112 - 113 - static inline void *dax_radix_locked_entry(sector_t sector, unsigned long flags) 114 - { 115 - return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags | 116 - ((unsigned long)sector << RADIX_DAX_SHIFT) | 117 - RADIX_DAX_ENTRY_LOCK); 118 - } 119 - 120 92 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter, 121 93 const struct iomap_ops *ops); 122 94 int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size, ··· 96 124 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index); 97 125 int dax_invalidate_mapping_entry_sync(struct address_space *mapping, 98 126 pgoff_t index); 99 - void dax_wake_mapping_entry_waiter(struct address_space *mapping, 100 - pgoff_t index, void *entry, bool wake_all); 101 127 102 128 #ifdef CONFIG_FS_DAX 103 129 int __dax_zero_page_range(struct block_device *bdev, ··· 109 139 return -ENXIO; 110 140 } 111 141 #endif 112 - 113 - #ifdef CONFIG_FS_DAX_PMD 114 - static inline unsigned int dax_radix_order(void *entry) 115 - { 116 - if ((unsigned long)entry & RADIX_DAX_PMD) 117 - return PMD_SHIFT - PAGE_SHIFT; 118 - return 0; 119 - } 120 - #else 121 - static inline unsigned int dax_radix_order(void *entry) 122 - { 123 - return 0; 124 - } 125 - #endif 126 - int dax_pfn_mkwrite(struct vm_fault *vmf); 127 142 128 143 static inline bool dax_mapping(struct address_space *mapping) 129 144 {

-2

include/linux/fs.h

··· 1269 1269 extern pid_t f_getown(struct file *filp); 1270 1270 extern int send_sigurg(struct fown_struct *fown); 1271 1271 1272 - struct mm_struct; 1273 - 1274 1272 /* 1275 1273 * Umount options 1276 1274 */

-9

include/linux/fscache.h

··· 143 143 void (*mark_page_cached)(void *cookie_netfs_data, 144 144 struct address_space *mapping, 145 145 struct page *page); 146 - 147 - /* indicate the cookie is no longer cached 148 - * - this function is called when the backing store currently caching 149 - * a cookie is removed 150 - * - the netfs should use this to clean up any markers indicating 151 - * cached pages 152 - * - this is mandatory for any object that may have data 153 - */ 154 - void (*now_uncached)(void *cookie_netfs_data); 155 146 }; 156 147 157 148 /*

+32 -20

include/linux/memcontrol.h

··· 488 488 void __unlock_page_memcg(struct mem_cgroup *memcg); 489 489 void unlock_page_memcg(struct page *page); 490 490 491 + /* idx can be of type enum memcg_stat_item or node_stat_item */ 491 492 static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, 492 - enum memcg_stat_item idx) 493 + int idx) 493 494 { 494 495 long val = 0; 495 496 int cpu; ··· 504 503 return val; 505 504 } 506 505 506 + /* idx can be of type enum memcg_stat_item or node_stat_item */ 507 507 static inline void __mod_memcg_state(struct mem_cgroup *memcg, 508 - enum memcg_stat_item idx, int val) 508 + int idx, int val) 509 509 { 510 510 if (!mem_cgroup_disabled()) 511 511 __this_cpu_add(memcg->stat->count[idx], val); 512 512 } 513 513 514 + /* idx can be of type enum memcg_stat_item or node_stat_item */ 514 515 static inline void mod_memcg_state(struct mem_cgroup *memcg, 515 - enum memcg_stat_item idx, int val) 516 + int idx, int val) 516 517 { 517 518 if (!mem_cgroup_disabled()) 518 519 this_cpu_add(memcg->stat->count[idx], val); ··· 538 535 * Kernel pages are an exception to this, since they'll never move. 539 536 */ 540 537 static inline void __mod_memcg_page_state(struct page *page, 541 - enum memcg_stat_item idx, int val) 538 + int idx, int val) 542 539 { 543 540 if (page->mem_cgroup) 544 541 __mod_memcg_state(page->mem_cgroup, idx, val); 545 542 } 546 543 547 544 static inline void mod_memcg_page_state(struct page *page, 548 - enum memcg_stat_item idx, int val) 545 + int idx, int val) 549 546 { 550 547 if (page->mem_cgroup) 551 548 mod_memcg_state(page->mem_cgroup, idx, val); ··· 635 632 this_cpu_add(memcg->stat->events[idx], count); 636 633 } 637 634 635 + /* idx can be of type enum memcg_stat_item or node_stat_item */ 638 636 static inline void count_memcg_page_event(struct page *page, 639 - enum memcg_stat_item idx) 637 + int idx) 640 638 { 641 639 if (page->mem_cgroup) 642 640 count_memcg_events(page->mem_cgroup, idx, 1); ··· 850 846 } 851 847 852 848 static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, 853 - enum memcg_stat_item idx) 849 + int idx) 854 850 { 855 851 return 0; 856 852 } 857 853 858 854 static inline void __mod_memcg_state(struct mem_cgroup *memcg, 859 - enum memcg_stat_item idx, 855 + int idx, 860 856 int nr) 861 857 { 862 858 } 863 859 864 860 static inline void mod_memcg_state(struct mem_cgroup *memcg, 865 - enum memcg_stat_item idx, 861 + int idx, 866 862 int nr) 867 863 { 868 864 } 869 865 870 866 static inline void __mod_memcg_page_state(struct page *page, 871 - enum memcg_stat_item idx, 867 + int idx, 872 868 int nr) 873 869 { 874 870 } 875 871 876 872 static inline void mod_memcg_page_state(struct page *page, 877 - enum memcg_stat_item idx, 873 + int idx, 878 874 int nr) 879 875 { 880 876 } ··· 928 924 } 929 925 930 926 static inline void count_memcg_page_event(struct page *page, 931 - enum memcg_stat_item idx) 927 + int idx) 932 928 { 933 929 } 934 930 ··· 938 934 } 939 935 #endif /* CONFIG_MEMCG */ 940 936 937 + /* idx can be of type enum memcg_stat_item or node_stat_item */ 941 938 static inline void __inc_memcg_state(struct mem_cgroup *memcg, 942 - enum memcg_stat_item idx) 939 + int idx) 943 940 { 944 941 __mod_memcg_state(memcg, idx, 1); 945 942 } 946 943 944 + /* idx can be of type enum memcg_stat_item or node_stat_item */ 947 945 static inline void __dec_memcg_state(struct mem_cgroup *memcg, 948 - enum memcg_stat_item idx) 946 + int idx) 949 947 { 950 948 __mod_memcg_state(memcg, idx, -1); 951 949 } 952 950 951 + /* idx can be of type enum memcg_stat_item or node_stat_item */ 953 952 static inline void __inc_memcg_page_state(struct page *page, 954 - enum memcg_stat_item idx) 953 + int idx) 955 954 { 956 955 __mod_memcg_page_state(page, idx, 1); 957 956 } 958 957 958 + /* idx can be of type enum memcg_stat_item or node_stat_item */ 959 959 static inline void __dec_memcg_page_state(struct page *page, 960 - enum memcg_stat_item idx) 960 + int idx) 961 961 { 962 962 __mod_memcg_page_state(page, idx, -1); 963 963 } ··· 990 982 __mod_lruvec_page_state(page, idx, -1); 991 983 } 992 984 985 + /* idx can be of type enum memcg_stat_item or node_stat_item */ 993 986 static inline void inc_memcg_state(struct mem_cgroup *memcg, 994 - enum memcg_stat_item idx) 987 + int idx) 995 988 { 996 989 mod_memcg_state(memcg, idx, 1); 997 990 } 998 991 992 + /* idx can be of type enum memcg_stat_item or node_stat_item */ 999 993 static inline void dec_memcg_state(struct mem_cgroup *memcg, 1000 - enum memcg_stat_item idx) 994 + int idx) 1001 995 { 1002 996 mod_memcg_state(memcg, idx, -1); 1003 997 } 1004 998 999 + /* idx can be of type enum memcg_stat_item or node_stat_item */ 1005 1000 static inline void inc_memcg_page_state(struct page *page, 1006 - enum memcg_stat_item idx) 1001 + int idx) 1007 1002 { 1008 1003 mod_memcg_page_state(page, idx, 1); 1009 1004 } 1010 1005 1006 + /* idx can be of type enum memcg_stat_item or node_stat_item */ 1011 1007 static inline void dec_memcg_page_state(struct page *page, 1012 - enum memcg_stat_item idx) 1008 + int idx) 1013 1009 { 1014 1010 mod_memcg_page_state(page, idx, -1); 1015 1011 }

+1 -1

include/linux/memory_hotplug.h

··· 319 319 unsigned long pnum); 320 320 extern bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_pages, 321 321 int online_type); 322 - extern struct zone *default_zone_for_pfn(int nid, unsigned long pfn, 322 + extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn, 323 323 unsigned long nr_pages); 324 324 #endif /* __LINUX_MEMORY_HOTPLUG_H */

+10 -4

include/linux/mm.h

··· 189 189 #define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */ 190 190 #define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */ 191 191 #define VM_ARCH_1 0x01000000 /* Architecture-specific flag */ 192 - #define VM_ARCH_2 0x02000000 192 + #define VM_WIPEONFORK 0x02000000 /* Wipe VMA contents in child. */ 193 193 #define VM_DONTDUMP 0x04000000 /* Do not include in the core dump */ 194 194 195 195 #ifdef CONFIG_MEM_SOFT_DIRTY ··· 208 208 #define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */ 209 209 #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ 210 210 #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ 211 + #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ 211 212 #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) 212 213 #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) 213 214 #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) 214 215 #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) 216 + #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) 215 217 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ 216 218 217 219 #if defined(CONFIG_X86) ··· 237 235 # define VM_MAPPED_COPY VM_ARCH_1 /* T if mapped copy of data (nommu mmap) */ 238 236 #endif 239 237 240 - #if defined(CONFIG_X86) 238 + #if defined(CONFIG_X86_INTEL_MPX) 241 239 /* MPX specific bounds table or bounds directory */ 242 - # define VM_MPX VM_ARCH_2 240 + # define VM_MPX VM_HIGH_ARCH_BIT_4 241 + #else 242 + # define VM_MPX VM_NONE 243 243 #endif 244 244 245 245 #ifndef VM_GROWSUP ··· 2298 2294 unsigned long pfn, pgprot_t pgprot); 2299 2295 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, 2300 2296 pfn_t pfn); 2297 + int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr, 2298 + pfn_t pfn); 2301 2299 int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len); 2302 2300 2303 2301 ··· 2512 2506 2513 2507 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) 2514 2508 extern void clear_huge_page(struct page *page, 2515 - unsigned long addr, 2509 + unsigned long addr_hint, 2516 2510 unsigned int pages_per_huge_page); 2517 2511 extern void copy_user_huge_page(struct page *dst, struct page *src, 2518 2512 unsigned long addr, struct vm_area_struct *vma,

+1

include/linux/mm_types.h

··· 335 335 struct file * vm_file; /* File we map to (can be NULL). */ 336 336 void * vm_private_data; /* was vm_pte (shared mem) */ 337 337 338 + atomic_long_t swap_readahead_info; 338 339 #ifndef CONFIG_MMU 339 340 struct vm_region *vm_region; /* NOMMU mapping region */ 340 341 #endif

+2 -3

include/linux/mmzone.h

··· 770 770 771 771 #include <linux/memory_hotplug.h> 772 772 773 - extern struct mutex zonelists_mutex; 774 - void build_all_zonelists(pg_data_t *pgdat, struct zone *zone); 773 + void build_all_zonelists(pg_data_t *pgdat); 775 774 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx); 776 775 bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, 777 776 int classzone_idx, unsigned int alloc_flags, ··· 895 896 extern int numa_zonelist_order_handler(struct ctl_table *, int, 896 897 void __user *, size_t *, loff_t *); 897 898 extern char numa_zonelist_order[]; 898 - #define NUMA_ZONELIST_ORDER_LEN 16 /* string buffer size */ 899 + #define NUMA_ZONELIST_ORDER_LEN 16 899 900 900 901 #ifndef CONFIG_NEED_MULTIPLE_NODES 901 902

+2 -2

include/linux/page-flags.h

··· 303 303 * Only test-and-set exist for PG_writeback. The unconditional operators are 304 304 * risky: they bypass page accounting. 305 305 */ 306 - TESTPAGEFLAG(Writeback, writeback, PF_NO_COMPOUND) 307 - TESTSCFLAG(Writeback, writeback, PF_NO_COMPOUND) 306 + TESTPAGEFLAG(Writeback, writeback, PF_NO_TAIL) 307 + TESTSCFLAG(Writeback, writeback, PF_NO_TAIL) 308 308 PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL) 309 309 310 310 /* PG_readahead is only used for reads; PG_reclaim is only for writes */

+10 -2

include/linux/pagemap.h

··· 353 353 unsigned find_get_entries(struct address_space *mapping, pgoff_t start, 354 354 unsigned int nr_entries, struct page **entries, 355 355 pgoff_t *indices); 356 - unsigned find_get_pages(struct address_space *mapping, pgoff_t start, 357 - unsigned int nr_pages, struct page **pages); 356 + unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start, 357 + pgoff_t end, unsigned int nr_pages, 358 + struct page **pages); 359 + static inline unsigned find_get_pages(struct address_space *mapping, 360 + pgoff_t *start, unsigned int nr_pages, 361 + struct page **pages) 362 + { 363 + return find_get_pages_range(mapping, start, (pgoff_t)-1, nr_pages, 364 + pages); 365 + } 358 366 unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start, 359 367 unsigned int nr_pages, struct page **pages); 360 368 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,

+10 -2

include/linux/pagevec.h

··· 27 27 pgoff_t start, unsigned nr_entries, 28 28 pgoff_t *indices); 29 29 void pagevec_remove_exceptionals(struct pagevec *pvec); 30 - unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping, 31 - pgoff_t start, unsigned nr_pages); 30 + unsigned pagevec_lookup_range(struct pagevec *pvec, 31 + struct address_space *mapping, 32 + pgoff_t *start, pgoff_t end); 33 + static inline unsigned pagevec_lookup(struct pagevec *pvec, 34 + struct address_space *mapping, 35 + pgoff_t *start) 36 + { 37 + return pagevec_lookup_range(pvec, mapping, start, (pgoff_t)-1); 38 + } 39 + 32 40 unsigned pagevec_lookup_tag(struct pagevec *pvec, 33 41 struct address_space *mapping, pgoff_t *index, int tag, 34 42 unsigned nr_pages);

-6

include/linux/sched/mm.h

··· 84 84 85 85 /* mmput gets rid of the mappings and all user-space */ 86 86 extern void mmput(struct mm_struct *); 87 - #ifdef CONFIG_MMU 88 - /* same as above but performs the slow path from the async context. Can 89 - * be called from the atomic context as well 90 - */ 91 - extern void mmput_async(struct mm_struct *); 92 - #endif 93 87 94 88 /* Grab a reference to a task's mm, if it is not already going away */ 95 89 extern struct mm_struct *get_task_mm(struct task_struct *task);

-17

include/linux/shm.h

··· 27 27 /* shm_mode upper byte flags */ 28 28 #define SHM_DEST 01000 /* segment will be destroyed on last detach */ 29 29 #define SHM_LOCKED 02000 /* segment will not be swapped */ 30 - #define SHM_HUGETLB 04000 /* segment will use huge TLB pages */ 31 - #define SHM_NORESERVE 010000 /* don't check for reservations */ 32 - 33 - /* Bits [26:31] are reserved */ 34 - 35 - /* 36 - * When SHM_HUGETLB is set bits [26:31] encode the log2 of the huge page size. 37 - * This gives us 6 bits, which is enough until someone invents 128 bit address 38 - * spaces. 39 - * 40 - * Assume these are all power of twos. 41 - * When 0 use the default page size. 42 - */ 43 - #define SHM_HUGE_SHIFT 26 44 - #define SHM_HUGE_MASK 0x3f 45 - #define SHM_HUGE_2MB (21 << SHM_HUGE_SHIFT) 46 - #define SHM_HUGE_1GB (30 << SHM_HUGE_SHIFT) 47 30 48 31 #ifdef CONFIG_SYSVIPC 49 32 struct sysv_shm {

+6

include/linux/shmem_fs.h

··· 137 137 unsigned long dst_addr, 138 138 unsigned long src_addr, 139 139 struct page **pagep); 140 + extern int shmem_mfill_zeropage_pte(struct mm_struct *dst_mm, 141 + pmd_t *dst_pmd, 142 + struct vm_area_struct *dst_vma, 143 + unsigned long dst_addr); 140 144 #else 141 145 #define shmem_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, dst_addr, \ 142 146 src_addr, pagep) ({ BUG(); 0; }) 147 + #define shmem_mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma, \ 148 + dst_addr) ({ BUG(); 0; }) 143 149 #endif 144 150 145 151 #endif

+7

include/linux/shrinker.h

··· 18 18 */ 19 19 unsigned long nr_to_scan; 20 20 21 + /* 22 + * How many objects did scan_objects process? 23 + * This defaults to nr_to_scan before every call, but the callee 24 + * should track its actual progress. 25 + */ 26 + unsigned long nr_scanned; 27 + 21 28 /* current node being shrunk (for NUMA aware shrinkers) */ 22 29 int nid; 23 30

+4

include/linux/slub_def.h

··· 115 115 #endif 116 116 #endif 117 117 118 + #ifdef CONFIG_SLAB_FREELIST_HARDENED 119 + unsigned long random; 120 + #endif 121 + 118 122 #ifdef CONFIG_NUMA 119 123 /* 120 124 * Defragmentation by allocating from a remote node.

+71 -7

include/linux/swap.h

··· 188 188 }; 189 189 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ 190 190 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */ 191 + #define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge page */ 191 192 192 193 /* 193 194 * We assign a cluster to each CPU, so each CPU can allocate swap entry from ··· 212 211 unsigned long flags; /* SWP_USED etc: see above */ 213 212 signed short prio; /* swap priority of this type */ 214 213 struct plist_node list; /* entry in swap_active_head */ 215 - struct plist_node avail_list; /* entry in swap_avail_head */ 214 + struct plist_node avail_lists[MAX_NUMNODES];/* entry in swap_avail_heads */ 216 215 signed char type; /* strange name for an index */ 217 216 unsigned int max; /* extent of the swap_map */ 218 217 unsigned char *swap_map; /* vmalloc'ed array of usage counts */ ··· 251 250 struct swap_cluster_list discard_clusters; /* discard clusters list */ 252 251 }; 253 252 253 + #ifdef CONFIG_64BIT 254 + #define SWAP_RA_ORDER_CEILING 5 255 + #else 256 + /* Avoid stack overflow, because we need to save part of page table */ 257 + #define SWAP_RA_ORDER_CEILING 3 258 + #define SWAP_RA_PTE_CACHE_SIZE (1 << SWAP_RA_ORDER_CEILING) 259 + #endif 260 + 261 + struct vma_swap_readahead { 262 + unsigned short win; 263 + unsigned short offset; 264 + unsigned short nr_pte; 265 + #ifdef CONFIG_64BIT 266 + pte_t *ptes; 267 + #else 268 + pte_t ptes[SWAP_RA_PTE_CACHE_SIZE]; 269 + #endif 270 + }; 271 + 254 272 /* linux/mm/workingset.c */ 255 273 void *workingset_eviction(struct address_space *mapping, struct page *page); 256 274 bool workingset_refault(void *shadow); ··· 282 262 extern unsigned long nr_free_buffer_pages(void); 283 263 extern unsigned long nr_free_pagecache_pages(void); 284 264 285 - /* Definition of global_page_state not available yet */ 286 - #define nr_free_pages() global_page_state(NR_FREE_PAGES) 265 + /* Definition of global_zone_page_state not available yet */ 266 + #define nr_free_pages() global_zone_page_state(NR_FREE_PAGES) 287 267 288 268 289 269 /* linux/mm/swap.c */ ··· 369 349 #define SWAP_ADDRESS_SPACE_SHIFT 14 370 350 #define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT) 371 351 extern struct address_space *swapper_spaces[]; 352 + extern bool swap_vma_readahead; 372 353 #define swap_address_space(entry) \ 373 354 (&swapper_spaces[swp_type(entry)][swp_offset(entry) \ 374 355 >> SWAP_ADDRESS_SPACE_SHIFT]) ··· 382 361 extern void delete_from_swap_cache(struct page *); 383 362 extern void free_page_and_swap_cache(struct page *); 384 363 extern void free_pages_and_swap_cache(struct page **, int); 385 - extern struct page *lookup_swap_cache(swp_entry_t); 364 + extern struct page *lookup_swap_cache(swp_entry_t entry, 365 + struct vm_area_struct *vma, 366 + unsigned long addr); 386 367 extern struct page *read_swap_cache_async(swp_entry_t, gfp_t, 387 368 struct vm_area_struct *vma, unsigned long addr, 388 369 bool do_poll); ··· 394 371 extern struct page *swapin_readahead(swp_entry_t, gfp_t, 395 372 struct vm_area_struct *vma, unsigned long addr); 396 373 374 + extern struct page *swap_readahead_detect(struct vm_fault *vmf, 375 + struct vma_swap_readahead *swap_ra); 376 + extern struct page *do_swap_page_readahead(swp_entry_t fentry, gfp_t gfp_mask, 377 + struct vm_fault *vmf, 378 + struct vma_swap_readahead *swap_ra); 379 + 397 380 /* linux/mm/swapfile.c */ 398 381 extern atomic_long_t nr_swap_pages; 399 382 extern long total_swap_pages; 383 + extern atomic_t nr_rotate_swap; 400 384 extern bool has_usable_swap(void); 385 + 386 + static inline bool swap_use_vma_readahead(void) 387 + { 388 + return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap); 389 + } 401 390 402 391 /* Swap 50% full? Release swapcache more aggressively.. */ 403 392 static inline bool vm_swap_full(void) ··· 500 465 return NULL; 501 466 } 502 467 468 + static inline bool swap_use_vma_readahead(void) 469 + { 470 + return false; 471 + } 472 + 473 + static inline struct page *swap_readahead_detect( 474 + struct vm_fault *vmf, struct vma_swap_readahead *swap_ra) 475 + { 476 + return NULL; 477 + } 478 + 479 + static inline struct page *do_swap_page_readahead( 480 + swp_entry_t fentry, gfp_t gfp_mask, 481 + struct vm_fault *vmf, struct vma_swap_readahead *swap_ra) 482 + { 483 + return NULL; 484 + } 485 + 503 486 static inline int swap_writepage(struct page *p, struct writeback_control *wbc) 504 487 { 505 488 return 0; 506 489 } 507 490 508 - static inline struct page *lookup_swap_cache(swp_entry_t swp) 491 + static inline struct page *lookup_swap_cache(swp_entry_t swp, 492 + struct vm_area_struct *vma, 493 + unsigned long addr) 509 494 { 510 495 return NULL; 511 496 } ··· 564 509 return 0; 565 510 } 566 511 567 - #define reuse_swap_page(page, total_mapcount) \ 568 - (page_trans_huge_mapcount(page, total_mapcount) == 1) 512 + #define reuse_swap_page(page, total_map_swapcount) \ 513 + (page_trans_huge_mapcount(page, total_map_swapcount) == 1) 569 514 570 515 static inline int try_to_free_swap(struct page *page) 571 516 { ··· 580 525 } 581 526 582 527 #endif /* CONFIG_SWAP */ 528 + 529 + #ifdef CONFIG_THP_SWAP 530 + extern int split_swap_cluster(swp_entry_t entry); 531 + #else 532 + static inline int split_swap_cluster(swp_entry_t entry) 533 + { 534 + return 0; 535 + } 536 + #endif 583 537 584 538 #ifdef CONFIG_MEMCG 585 539 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)

+6

include/linux/vm_event_item.h

··· 85 85 #endif 86 86 THP_ZERO_PAGE_ALLOC, 87 87 THP_ZERO_PAGE_ALLOC_FAILED, 88 + THP_SWPOUT, 89 + THP_SWPOUT_FALLBACK, 88 90 #endif 89 91 #ifdef CONFIG_MEMORY_BALLOON 90 92 BALLOON_INFLATE, ··· 105 103 VMACACHE_FIND_CALLS, 106 104 VMACACHE_FIND_HITS, 107 105 VMACACHE_FULL_FLUSHES, 106 + #endif 107 + #ifdef CONFIG_SWAP 108 + SWAP_RA, 109 + SWAP_RA_HIT, 108 110 #endif 109 111 NR_VM_EVENT_ITEMS 110 112 };

+2 -2

include/linux/vmstat.h

··· 123 123 atomic_long_add(x, &vm_node_stat[item]); 124 124 } 125 125 126 - static inline unsigned long global_page_state(enum zone_stat_item item) 126 + static inline unsigned long global_zone_page_state(enum zone_stat_item item) 127 127 { 128 128 long x = atomic_long_read(&vm_zone_stat[item]); 129 129 #ifdef CONFIG_SMP ··· 199 199 extern unsigned long node_page_state(struct pglist_data *pgdat, 200 200 enum node_stat_item item); 201 201 #else 202 - #define sum_zone_node_page_state(node, item) global_page_state(item) 202 + #define sum_zone_node_page_state(node, item) global_zone_page_state(item) 203 203 #define node_page_state(node, item) global_node_page_state(item) 204 204 #endif /* CONFIG_NUMA */ 205 205

-2

include/trace/events/fs_dax.h

··· 190 190 191 191 DEFINE_PTE_FAULT_EVENT(dax_pte_fault); 192 192 DEFINE_PTE_FAULT_EVENT(dax_pte_fault_done); 193 - DEFINE_PTE_FAULT_EVENT(dax_pfn_mkwrite_no_entry); 194 - DEFINE_PTE_FAULT_EVENT(dax_pfn_mkwrite); 195 193 DEFINE_PTE_FAULT_EVENT(dax_load_hole); 196 194 197 195 TRACE_EVENT(dax_insert_mapping,

+1 -7

include/trace/events/mmflags.h

··· 125 125 #define __VM_ARCH_SPECIFIC_1 {VM_ARCH_1, "arch_1" } 126 126 #endif 127 127 128 - #if defined(CONFIG_X86) 129 - #define __VM_ARCH_SPECIFIC_2 {VM_MPX, "mpx" } 130 - #else 131 - #define __VM_ARCH_SPECIFIC_2 {VM_ARCH_2, "arch_2" } 132 - #endif 133 - 134 128 #ifdef CONFIG_MEM_SOFT_DIRTY 135 129 #define IF_HAVE_VM_SOFTDIRTY(flag,name) {flag, name }, 136 130 #else ··· 156 162 {VM_NORESERVE, "noreserve" }, \ 157 163 {VM_HUGETLB, "hugetlb" }, \ 158 164 __VM_ARCH_SPECIFIC_1 , \ 159 - __VM_ARCH_SPECIFIC_2 , \ 165 + {VM_WIPEONFORK, "wipeonfork" }, \ 160 166 {VM_DONTDUMP, "dontdump" }, \ 161 167 IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \ 162 168 {VM_MIXEDMAP, "mixedmap" }, \

+34

include/uapi/asm-generic/hugetlb_encode.h

··· 1 + #ifndef _ASM_GENERIC_HUGETLB_ENCODE_H_ 2 + #define _ASM_GENERIC_HUGETLB_ENCODE_H_ 3 + 4 + /* 5 + * Several system calls take a flag to request "hugetlb" huge pages. 6 + * Without further specification, these system calls will use the 7 + * system's default huge page size. If a system supports multiple 8 + * huge page sizes, the desired huge page size can be specified in 9 + * bits [26:31] of the flag arguments. The value in these 6 bits 10 + * will encode the log2 of the huge page size. 11 + * 12 + * The following definitions are associated with this huge page size 13 + * encoding in flag arguments. System call specific header files 14 + * that use this encoding should include this file. They can then 15 + * provide definitions based on these with their own specific prefix. 16 + * for example: 17 + * #define MAP_HUGE_SHIFT HUGETLB_FLAG_ENCODE_SHIFT 18 + */ 19 + 20 + #define HUGETLB_FLAG_ENCODE_SHIFT 26 21 + #define HUGETLB_FLAG_ENCODE_MASK 0x3f 22 + 23 + #define HUGETLB_FLAG_ENCODE_64KB (16 << HUGETLB_FLAG_ENCODE_SHIFT) 24 + #define HUGETLB_FLAG_ENCODE_512KB (19 << HUGETLB_FLAG_ENCODE_SHIFT) 25 + #define HUGETLB_FLAG_ENCODE_1MB (20 << HUGETLB_FLAG_ENCODE_SHIFT) 26 + #define HUGETLB_FLAG_ENCODE_2MB (21 << HUGETLB_FLAG_ENCODE_SHIFT) 27 + #define HUGETLB_FLAG_ENCODE_8MB (23 << HUGETLB_FLAG_ENCODE_SHIFT) 28 + #define HUGETLB_FLAG_ENCODE_16MB (24 << HUGETLB_FLAG_ENCODE_SHIFT) 29 + #define HUGETLB_FLAG_ENCODE_256MB (28 << HUGETLB_FLAG_ENCODE_SHIFT) 30 + #define HUGETLB_FLAG_ENCODE_1GB (30 << HUGETLB_FLAG_ENCODE_SHIFT) 31 + #define HUGETLB_FLAG_ENCODE_2GB (31 << HUGETLB_FLAG_ENCODE_SHIFT) 32 + #define HUGETLB_FLAG_ENCODE_16GB (34 << HUGETLB_FLAG_ENCODE_SHIFT) 33 + 34 + #endif /* _ASM_GENERIC_HUGETLB_ENCODE_H_ */

+3 -11

include/uapi/asm-generic/mman-common.h

··· 58 58 overrides the coredump filter bits */ 59 59 #define MADV_DODUMP 17 /* Clear the MADV_DONTDUMP flag */ 60 60 61 + #define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */ 62 + #define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */ 63 + 61 64 /* compatibility flags */ 62 65 #define MAP_FILE 0 63 - 64 - /* 65 - * When MAP_HUGETLB is set bits [26:31] encode the log2 of the huge page size. 66 - * This gives us 6 bits, which is enough until someone invents 128 bit address 67 - * spaces. 68 - * 69 - * Assume these are all power of twos. 70 - * When 0 use the default page size. 71 - */ 72 - #define MAP_HUGE_SHIFT 26 73 - #define MAP_HUGE_MASK 0x3f 74 66 75 67 #define PKEY_DISABLE_ACCESS 0x1 76 68 #define PKEY_DISABLE_WRITE 0x2

+24

include/uapi/linux/memfd.h

··· 1 1 #ifndef _UAPI_LINUX_MEMFD_H 2 2 #define _UAPI_LINUX_MEMFD_H 3 3 4 + #include <asm-generic/hugetlb_encode.h> 5 + 4 6 /* flags for memfd_create(2) (unsigned int) */ 5 7 #define MFD_CLOEXEC 0x0001U 6 8 #define MFD_ALLOW_SEALING 0x0002U 9 + #define MFD_HUGETLB 0x0004U 10 + 11 + /* 12 + * Huge page size encoding when MFD_HUGETLB is specified, and a huge page 13 + * size other than the default is desired. See hugetlb_encode.h. 14 + * All known huge page size encodings are provided here. It is the 15 + * responsibility of the application to know which sizes are supported on 16 + * the running system. See mmap(2) man page for details. 17 + */ 18 + #define MFD_HUGE_SHIFT HUGETLB_FLAG_ENCODE_SHIFT 19 + #define MFD_HUGE_MASK HUGETLB_FLAG_ENCODE_MASK 20 + 21 + #define MFD_HUGE_64KB HUGETLB_FLAG_ENCODE_64KB 22 + #define MFD_HUGE_512KB HUGETLB_FLAG_ENCODE_512KB 23 + #define MFD_HUGE_1MB HUGETLB_FLAG_ENCODE_1MB 24 + #define MFD_HUGE_2MB HUGETLB_FLAG_ENCODE_2MB 25 + #define MFD_HUGE_8MB HUGETLB_FLAG_ENCODE_8MB 26 + #define MFD_HUGE_16MB HUGETLB_FLAG_ENCODE_16MB 27 + #define MFD_HUGE_256MB HUGETLB_FLAG_ENCODE_256MB 28 + #define MFD_HUGE_1GB HUGETLB_FLAG_ENCODE_1GB 29 + #define MFD_HUGE_2GB HUGETLB_FLAG_ENCODE_2GB 30 + #define MFD_HUGE_16GB HUGETLB_FLAG_ENCODE_16GB 7 31 8 32 #endif /* _UAPI_LINUX_MEMFD_H */

+22

include/uapi/linux/mman.h

··· 2 2 #define _UAPI_LINUX_MMAN_H 3 3 4 4 #include <asm/mman.h> 5 + #include <asm-generic/hugetlb_encode.h> 5 6 6 7 #define MREMAP_MAYMOVE 1 7 8 #define MREMAP_FIXED 2 ··· 10 9 #define OVERCOMMIT_GUESS 0 11 10 #define OVERCOMMIT_ALWAYS 1 12 11 #define OVERCOMMIT_NEVER 2 12 + 13 + /* 14 + * Huge page size encoding when MAP_HUGETLB is specified, and a huge page 15 + * size other than the default is desired. See hugetlb_encode.h. 16 + * All known huge page size encodings are provided here. It is the 17 + * responsibility of the application to know which sizes are supported on 18 + * the running system. See mmap(2) man page for details. 19 + */ 20 + #define MAP_HUGE_SHIFT HUGETLB_FLAG_ENCODE_SHIFT 21 + #define MAP_HUGE_MASK HUGETLB_FLAG_ENCODE_MASK 22 + 23 + #define MAP_HUGE_64KB HUGETLB_FLAG_ENCODE_64KB 24 + #define MAP_HUGE_512KB HUGETLB_FLAG_ENCODE_512KB 25 + #define MAP_HUGE_1MB HUGETLB_FLAG_ENCODE_1MB 26 + #define MAP_HUGE_2MB HUGETLB_FLAG_ENCODE_2MB 27 + #define MAP_HUGE_8MB HUGETLB_FLAG_ENCODE_8MB 28 + #define MAP_HUGE_16MB HUGETLB_FLAG_ENCODE_16MB 29 + #define MAP_HUGE_256MB HUGETLB_FLAG_ENCODE_256MB 30 + #define MAP_HUGE_1GB HUGETLB_FLAG_ENCODE_1GB 31 + #define MAP_HUGE_2GB HUGETLB_FLAG_ENCODE_2GB 32 + #define MAP_HUGE_16GB HUGETLB_FLAG_ENCODE_16GB 13 33 14 34 #endif /* _UAPI_LINUX_MMAN_H */

+29 -2

include/uapi/linux/shm.h

··· 3 3 4 4 #include <linux/ipc.h> 5 5 #include <linux/errno.h> 6 + #include <asm-generic/hugetlb_encode.h> 6 7 #ifndef __KERNEL__ 7 8 #include <unistd.h> 8 9 #endif ··· 41 40 /* Include the definition of shmid64_ds and shminfo64 */ 42 41 #include <asm/shmbuf.h> 43 42 44 - /* permission flag for shmget */ 43 + /* 44 + * shmget() shmflg values. 45 + */ 46 + /* The bottom nine bits are the same as open(2) mode flags */ 45 47 #define SHM_R 0400 /* or S_IRUGO from <linux/stat.h> */ 46 48 #define SHM_W 0200 /* or S_IWUGO from <linux/stat.h> */ 49 + /* Bits 9 & 10 are IPC_CREAT and IPC_EXCL */ 50 + #define SHM_HUGETLB 04000 /* segment will use huge TLB pages */ 51 + #define SHM_NORESERVE 010000 /* don't check for reservations */ 47 52 48 - /* mode for attach */ 53 + /* 54 + * Huge page size encoding when SHM_HUGETLB is specified, and a huge page 55 + * size other than the default is desired. See hugetlb_encode.h 56 + */ 57 + #define SHM_HUGE_SHIFT HUGETLB_FLAG_ENCODE_SHIFT 58 + #define SHM_HUGE_MASK HUGETLB_FLAG_ENCODE_MASK 59 + 60 + #define SHM_HUGE_64KB HUGETLB_FLAG_ENCODE_64KB 61 + #define SHM_HUGE_512KB HUGETLB_FLAG_ENCODE_512KB 62 + #define SHM_HUGE_1MB HUGETLB_FLAG_ENCODE_1MB 63 + #define SHM_HUGE_2MB HUGETLB_FLAG_ENCODE_2MB 64 + #define SHM_HUGE_8MB HUGETLB_FLAG_ENCODE_8MB 65 + #define SHM_HUGE_16MB HUGETLB_FLAG_ENCODE_16MB 66 + #define SHM_HUGE_256MB HUGETLB_FLAG_ENCODE_256MB 67 + #define SHM_HUGE_1GB HUGETLB_FLAG_ENCODE_1GB 68 + #define SHM_HUGE_2GB HUGETLB_FLAG_ENCODE_2GB 69 + #define SHM_HUGE_16GB HUGETLB_FLAG_ENCODE_16GB 70 + 71 + /* 72 + * shmat() shmflg values 73 + */ 49 74 #define SHM_RDONLY 010000 /* read-only access */ 50 75 #define SHM_RND 020000 /* round attach address to SHMLBA boundary */ 51 76 #define SHM_REMAP 040000 /* take-over region on attach */

+15 -1

include/uapi/linux/userfaultfd.h

··· 23 23 UFFD_FEATURE_EVENT_REMOVE | \ 24 24 UFFD_FEATURE_EVENT_UNMAP | \ 25 25 UFFD_FEATURE_MISSING_HUGETLBFS | \ 26 - UFFD_FEATURE_MISSING_SHMEM) 26 + UFFD_FEATURE_MISSING_SHMEM | \ 27 + UFFD_FEATURE_SIGBUS | \ 28 + UFFD_FEATURE_THREAD_ID) 27 29 #define UFFD_API_IOCTLS \ 28 30 ((__u64)1 << _UFFDIO_REGISTER | \ 29 31 (__u64)1 << _UFFDIO_UNREGISTER | \ ··· 80 78 struct { 81 79 __u64 flags; 82 80 __u64 address; 81 + union { 82 + __u32 ptid; 83 + } feat; 83 84 } pagefault; 84 85 85 86 struct { ··· 158 153 * UFFD_FEATURE_MISSING_SHMEM works the same as 159 154 * UFFD_FEATURE_MISSING_HUGETLBFS, but it applies to shmem 160 155 * (i.e. tmpfs and other shmem based APIs). 156 + * 157 + * UFFD_FEATURE_SIGBUS feature means no page-fault 158 + * (UFFD_EVENT_PAGEFAULT) event will be delivered, instead 159 + * a SIGBUS signal will be sent to the faulting process. 160 + * 161 + * UFFD_FEATURE_THREAD_ID pid of the page faulted task_struct will 162 + * be returned, if feature is not requested 0 will be returned. 161 163 */ 162 164 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) 163 165 #define UFFD_FEATURE_EVENT_FORK (1<<1) ··· 173 161 #define UFFD_FEATURE_MISSING_HUGETLBFS (1<<4) 174 162 #define UFFD_FEATURE_MISSING_SHMEM (1<<5) 175 163 #define UFFD_FEATURE_EVENT_UNMAP (1<<6) 164 + #define UFFD_FEATURE_SIGBUS (1<<7) 165 + #define UFFD_FEATURE_THREAD_ID (1<<8) 176 166 __u64 features; 177 167 178 168 __u64 ioctls;

+9

init/Kconfig

··· 1576 1576 security feature reduces the predictability of the kernel slab 1577 1577 allocator against heap overflows. 1578 1578 1579 + config SLAB_FREELIST_HARDENED 1580 + bool "Harden slab freelist metadata" 1581 + depends on SLUB 1582 + help 1583 + Many kernel heap attacks try to target slab cache metadata and 1584 + other infrastructure. This options makes minor performance 1585 + sacrifies to harden the kernel slab allocator against common 1586 + freelist exploit methods. 1587 + 1579 1588 config SLUB_CPU_PARTIAL 1580 1589 default y 1581 1590 depends on SLUB && SMP

+1 -1

init/main.c

··· 542 542 boot_cpu_state_init(); 543 543 smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */ 544 544 545 - build_all_zonelists(NULL, NULL); 545 + build_all_zonelists(NULL); 546 546 page_alloc_init(); 547 547 548 548 pr_notice("Kernel command line: %s\n", boot_command_line);

-3

kernel/cgroup/cgroup.c

··· 4100 4100 if (!(css->flags & CSS_ONLINE)) 4101 4101 return; 4102 4102 4103 - if (ss->css_reset) 4104 - ss->css_reset(css); 4105 - 4106 4103 if (ss->css_offline) 4107 4104 ss->css_offline(css); 4108 4105

+5 -4

kernel/cgroup/cpuset.c

··· 56 56 #include <linux/time64.h> 57 57 #include <linux/backing-dev.h> 58 58 #include <linux/sort.h> 59 + #include <linux/oom.h> 59 60 60 61 #include <linux/uaccess.h> 61 62 #include <linux/atomic.h> ··· 2501 2500 * If we're in interrupt, yes, we can always allocate. If @node is set in 2502 2501 * current's mems_allowed, yes. If it's not a __GFP_HARDWALL request and this 2503 2502 * node is set in the nearest hardwalled cpuset ancestor to current's cpuset, 2504 - * yes. If current has access to memory reserves due to TIF_MEMDIE, yes. 2503 + * yes. If current has access to memory reserves as an oom victim, yes. 2505 2504 * Otherwise, no. 2506 2505 * 2507 2506 * GFP_USER allocations are marked with the __GFP_HARDWALL bit, 2508 2507 * and do not allow allocations outside the current tasks cpuset 2509 - * unless the task has been OOM killed as is marked TIF_MEMDIE. 2508 + * unless the task has been OOM killed. 2510 2509 * GFP_KERNEL allocations are not so marked, so can escape to the 2511 2510 * nearest enclosing hardwalled ancestor cpuset. 2512 2511 * ··· 2529 2528 * affect that: 2530 2529 * in_interrupt - any node ok (current task context irrelevant) 2531 2530 * GFP_ATOMIC - any node ok 2532 - * TIF_MEMDIE - any node ok 2531 + * tsk_is_oom_victim - any node ok 2533 2532 * GFP_KERNEL - any node in enclosing hardwalled cpuset ok 2534 2533 * GFP_USER - only nodes in current tasks mems allowed ok. 2535 2534 */ ··· 2547 2546 * Allow tasks that have access to memory reserves because they have 2548 2547 * been OOM killed to get memory anywhere. 2549 2548 */ 2550 - if (unlikely(test_thread_flag(TIF_MEMDIE))) 2549 + if (unlikely(tsk_is_oom_victim(current))) 2551 2550 return true; 2552 2551 if (gfp_mask & __GFP_HARDWALL) /* If hardwall request, stop here */ 2553 2552 return false;

+8 -19

kernel/fork.c

··· 657 657 retval = dup_userfaultfd(tmp, &uf); 658 658 if (retval) 659 659 goto fail_nomem_anon_vma_fork; 660 - if (anon_vma_fork(tmp, mpnt)) 660 + if (tmp->vm_flags & VM_WIPEONFORK) { 661 + /* VM_WIPEONFORK gets a clean slate in the child. */ 662 + tmp->anon_vma = NULL; 663 + if (anon_vma_prepare(tmp)) 664 + goto fail_nomem_anon_vma_fork; 665 + } else if (anon_vma_fork(tmp, mpnt)) 661 666 goto fail_nomem_anon_vma_fork; 662 667 tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT); 663 668 tmp->vm_next = tmp->vm_prev = NULL; ··· 706 701 rb_parent = &tmp->vm_rb; 707 702 708 703 mm->map_count++; 709 - retval = copy_page_range(mm, oldmm, mpnt); 704 + if (!(tmp->vm_flags & VM_WIPEONFORK)) 705 + retval = copy_page_range(mm, oldmm, mpnt); 710 706 711 707 if (tmp->vm_ops && tmp->vm_ops->open) 712 708 tmp->vm_ops->open(tmp); ··· 928 922 } 929 923 if (mm->binfmt) 930 924 module_put(mm->binfmt->module); 931 - set_bit(MMF_OOM_SKIP, &mm->flags); 932 925 mmdrop(mm); 933 926 } 934 927 ··· 942 937 __mmput(mm); 943 938 } 944 939 EXPORT_SYMBOL_GPL(mmput); 945 - 946 - #ifdef CONFIG_MMU 947 - static void mmput_async_fn(struct work_struct *work) 948 - { 949 - struct mm_struct *mm = container_of(work, struct mm_struct, async_put_work); 950 - __mmput(mm); 951 - } 952 - 953 - void mmput_async(struct mm_struct *mm) 954 - { 955 - if (atomic_dec_and_test(&mm->mm_users)) { 956 - INIT_WORK(&mm->async_put_work, mmput_async_fn); 957 - schedule_work(&mm->async_put_work); 958 - } 959 - } 960 - #endif 961 940 962 941 /** 963 942 * set_mm_exe_file - change a reference to the mm's executable file

+38 -14

kernel/memremap.c

··· 194 194 struct vmem_altmap altmap; 195 195 }; 196 196 197 + static unsigned long order_at(struct resource *res, unsigned long pgoff) 198 + { 199 + unsigned long phys_pgoff = PHYS_PFN(res->start) + pgoff; 200 + unsigned long nr_pages, mask; 201 + 202 + nr_pages = PHYS_PFN(resource_size(res)); 203 + if (nr_pages == pgoff) 204 + return ULONG_MAX; 205 + 206 + /* 207 + * What is the largest aligned power-of-2 range available from 208 + * this resource pgoff to the end of the resource range, 209 + * considering the alignment of the current pgoff? 210 + */ 211 + mask = phys_pgoff | rounddown_pow_of_two(nr_pages - pgoff); 212 + if (!mask) 213 + return ULONG_MAX; 214 + 215 + return find_first_bit(&mask, BITS_PER_LONG); 216 + } 217 + 218 + #define foreach_order_pgoff(res, order, pgoff) \ 219 + for (pgoff = 0, order = order_at((res), pgoff); order < ULONG_MAX; \ 220 + pgoff += 1UL << order, order = order_at((res), pgoff)) 221 + 197 222 static void pgmap_radix_release(struct resource *res) 198 223 { 199 - resource_size_t key, align_start, align_size, align_end; 200 - 201 - align_start = res->start & ~(SECTION_SIZE - 1); 202 - align_size = ALIGN(resource_size(res), SECTION_SIZE); 203 - align_end = align_start + align_size - 1; 224 + unsigned long pgoff, order; 204 225 205 226 mutex_lock(&pgmap_lock); 206 - for (key = res->start; key <= res->end; key += SECTION_SIZE) 207 - radix_tree_delete(&pgmap_radix, key >> PA_SECTION_SHIFT); 227 + foreach_order_pgoff(res, order, pgoff) 228 + radix_tree_delete(&pgmap_radix, PHYS_PFN(res->start) + pgoff); 208 229 mutex_unlock(&pgmap_lock); 230 + 231 + synchronize_rcu(); 209 232 } 210 233 211 234 static unsigned long pfn_first(struct page_map *page_map) ··· 291 268 292 269 WARN_ON_ONCE(!rcu_read_lock_held()); 293 270 294 - page_map = radix_tree_lookup(&pgmap_radix, phys >> PA_SECTION_SHIFT); 271 + page_map = radix_tree_lookup(&pgmap_radix, PHYS_PFN(phys)); 295 272 return page_map ? &page_map->pgmap : NULL; 296 273 } 297 274 ··· 316 293 void *devm_memremap_pages(struct device *dev, struct resource *res, 317 294 struct percpu_ref *ref, struct vmem_altmap *altmap) 318 295 { 319 - resource_size_t key, align_start, align_size, align_end; 296 + resource_size_t align_start, align_size, align_end; 297 + unsigned long pfn, pgoff, order; 320 298 pgprot_t pgprot = PAGE_KERNEL; 321 299 struct dev_pagemap *pgmap; 322 300 struct page_map *page_map; 323 301 int error, nid, is_ram; 324 - unsigned long pfn; 325 302 326 303 align_start = res->start & ~(SECTION_SIZE - 1); 327 304 align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE) ··· 360 337 mutex_lock(&pgmap_lock); 361 338 error = 0; 362 339 align_end = align_start + align_size - 1; 363 - for (key = align_start; key <= align_end; key += SECTION_SIZE) { 340 + 341 + foreach_order_pgoff(res, order, pgoff) { 364 342 struct dev_pagemap *dup; 365 343 366 344 rcu_read_lock(); 367 - dup = find_dev_pagemap(key); 345 + dup = find_dev_pagemap(res->start + PFN_PHYS(pgoff)); 368 346 rcu_read_unlock(); 369 347 if (dup) { 370 348 dev_err(dev, "%s: %pr collides with mapping for %s\n", ··· 373 349 error = -EBUSY; 374 350 break; 375 351 } 376 - error = radix_tree_insert(&pgmap_radix, key >> PA_SECTION_SHIFT, 377 - page_map); 352 + error = __radix_tree_insert(&pgmap_radix, 353 + PHYS_PFN(res->start) + pgoff, order, page_map); 378 354 if (error) { 379 355 dev_err(dev, "%s: failed: %d\n", __func__, error); 380 356 break;

+1

mm/Kconfig

··· 678 678 depends on MEMORY_HOTREMOVE 679 679 depends on SPARSEMEM_VMEMMAP 680 680 depends on ARCH_HAS_ZONE_DEVICE 681 + select RADIX_TREE_MULTIORDER 681 682 682 683 help 683 684 Device memory hotplug support allows for establishing pmem,

+39 -28

mm/filemap.c

··· 130 130 return -EEXIST; 131 131 132 132 mapping->nrexceptional--; 133 - if (!dax_mapping(mapping)) { 134 - if (shadowp) 135 - *shadowp = p; 136 - } else { 137 - /* DAX can replace empty locked entry with a hole */ 138 - WARN_ON_ONCE(p != 139 - dax_radix_locked_entry(0, RADIX_DAX_EMPTY)); 140 - /* Wakeup waiters for exceptional entry lock */ 141 - dax_wake_mapping_entry_waiter(mapping, page->index, p, 142 - true); 143 - } 133 + if (shadowp) 134 + *shadowp = p; 144 135 } 145 136 __radix_tree_replace(&mapping->page_tree, node, slot, page, 146 137 workingset_update_node, mapping); ··· 393 402 { 394 403 pgoff_t index = start_byte >> PAGE_SHIFT; 395 404 pgoff_t end = end_byte >> PAGE_SHIFT; 396 - struct pagevec pvec; 397 - bool ret; 405 + struct page *page; 398 406 399 407 if (end_byte < start_byte) 400 408 return false; ··· 401 411 if (mapping->nrpages == 0) 402 412 return false; 403 413 404 - pagevec_init(&pvec, 0); 405 - if (!pagevec_lookup(&pvec, mapping, index, 1)) 414 + if (!find_get_pages_range(mapping, &index, end, 1, &page)) 406 415 return false; 407 - ret = (pvec.pages[0]->index <= end); 408 - pagevec_release(&pvec); 409 - return ret; 416 + put_page(page); 417 + return true; 410 418 } 411 419 EXPORT_SYMBOL(filemap_range_has_page); 412 420 ··· 1552 1564 } 1553 1565 1554 1566 /** 1555 - * find_get_pages - gang pagecache lookup 1567 + * find_get_pages_range - gang pagecache lookup 1556 1568 * @mapping: The address_space to search 1557 1569 * @start: The starting page index 1570 + * @end: The final page index (inclusive) 1558 1571 * @nr_pages: The maximum number of pages 1559 1572 * @pages: Where the resulting pages are placed 1560 1573 * 1561 - * find_get_pages() will search for and return a group of up to 1562 - * @nr_pages pages in the mapping. The pages are placed at @pages. 1563 - * find_get_pages() takes a reference against the returned pages. 1574 + * find_get_pages_range() will search for and return a group of up to @nr_pages 1575 + * pages in the mapping starting at index @start and up to index @end 1576 + * (inclusive). The pages are placed at @pages. find_get_pages_range() takes 1577 + * a reference against the returned pages. 1564 1578 * 1565 1579 * The search returns a group of mapping-contiguous pages with ascending 1566 1580 * indexes. There may be holes in the indices due to not-present pages. 1581 + * We also update @start to index the next page for the traversal. 1567 1582 * 1568 - * find_get_pages() returns the number of pages which were found. 1583 + * find_get_pages_range() returns the number of pages which were found. If this 1584 + * number is smaller than @nr_pages, the end of specified range has been 1585 + * reached. 1569 1586 */ 1570 - unsigned find_get_pages(struct address_space *mapping, pgoff_t start, 1571 - unsigned int nr_pages, struct page **pages) 1587 + unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start, 1588 + pgoff_t end, unsigned int nr_pages, 1589 + struct page **pages) 1572 1590 { 1573 1591 struct radix_tree_iter iter; 1574 1592 void **slot; ··· 1584 1590 return 0; 1585 1591 1586 1592 rcu_read_lock(); 1587 - radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) { 1593 + radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, *start) { 1588 1594 struct page *head, *page; 1595 + 1596 + if (iter.index > end) 1597 + break; 1589 1598 repeat: 1590 1599 page = radix_tree_deref_slot(slot); 1591 1600 if (unlikely(!page)) ··· 1624 1627 } 1625 1628 1626 1629 pages[ret] = page; 1627 - if (++ret == nr_pages) 1628 - break; 1630 + if (++ret == nr_pages) { 1631 + *start = pages[ret - 1]->index + 1; 1632 + goto out; 1633 + } 1629 1634 } 1630 1635 1636 + /* 1637 + * We come here when there is no page beyond @end. We take care to not 1638 + * overflow the index @start as it confuses some of the callers. This 1639 + * breaks the iteration when there is page at index -1 but that is 1640 + * already broken anyway. 1641 + */ 1642 + if (end == (pgoff_t)-1) 1643 + *start = (pgoff_t)-1; 1644 + else 1645 + *start = end + 1; 1646 + out: 1631 1647 rcu_read_unlock(); 1648 + 1632 1649 return ret; 1633 1650 } 1634 1651

+1 -1

mm/gup.c

··· 1352 1352 } 1353 1353 #endif /* __HAVE_ARCH_PTE_SPECIAL */ 1354 1354 1355 - #ifdef __HAVE_ARCH_PTE_DEVMAP 1355 + #if defined(__HAVE_ARCH_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE) 1356 1356 static int __gup_device_huge(unsigned long pfn, unsigned long addr, 1357 1357 unsigned long end, struct page **pages, int *nr) 1358 1358 {

+27 -5

mm/huge_memory.c

··· 328 328 NULL, 329 329 }; 330 330 331 - static struct attribute_group hugepage_attr_group = { 331 + static const struct attribute_group hugepage_attr_group = { 332 332 .attrs = hugepage_attr, 333 333 }; 334 334 ··· 567 567 goto release; 568 568 } 569 569 570 - clear_huge_page(page, haddr, HPAGE_PMD_NR); 570 + clear_huge_page(page, vmf->address, HPAGE_PMD_NR); 571 571 /* 572 572 * The memory barrier inside __SetPageUptodate makes sure that 573 573 * clear_huge_page writes become visible before the set_pmd_at() ··· 1240 1240 * We can only reuse the page if nobody else maps the huge page or it's 1241 1241 * part. 1242 1242 */ 1243 - if (page_trans_huge_mapcount(page, NULL) == 1) { 1243 + if (!trylock_page(page)) { 1244 + get_page(page); 1245 + spin_unlock(vmf->ptl); 1246 + lock_page(page); 1247 + spin_lock(vmf->ptl); 1248 + if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) { 1249 + unlock_page(page); 1250 + put_page(page); 1251 + goto out_unlock; 1252 + } 1253 + put_page(page); 1254 + } 1255 + if (reuse_swap_page(page, NULL)) { 1244 1256 pmd_t entry; 1245 1257 entry = pmd_mkyoung(orig_pmd); 1246 1258 entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); 1247 1259 if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1)) 1248 1260 update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); 1249 1261 ret |= VM_FAULT_WRITE; 1262 + unlock_page(page); 1250 1263 goto out_unlock; 1251 1264 } 1265 + unlock_page(page); 1252 1266 get_page(page); 1253 1267 spin_unlock(vmf->ptl); 1254 1268 alloc: ··· 1305 1291 count_vm_event(THP_FAULT_ALLOC); 1306 1292 1307 1293 if (!page) 1308 - clear_huge_page(new_page, haddr, HPAGE_PMD_NR); 1294 + clear_huge_page(new_page, vmf->address, HPAGE_PMD_NR); 1309 1295 else 1310 1296 copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR); 1311 1297 __SetPageUptodate(new_page); ··· 2481 2467 VM_BUG_ON_PAGE(!PageLocked(page), page); 2482 2468 VM_BUG_ON_PAGE(!PageCompound(page), page); 2483 2469 2470 + if (PageWriteback(page)) 2471 + return -EBUSY; 2472 + 2484 2473 if (PageAnon(head)) { 2485 2474 /* 2486 2475 * The caller does not necessarily hold an mmap_sem that would ··· 2561 2544 __dec_node_page_state(page, NR_SHMEM_THPS); 2562 2545 spin_unlock(&pgdata->split_queue_lock); 2563 2546 __split_huge_page(page, list, flags); 2564 - ret = 0; 2547 + if (PageSwapCache(head)) { 2548 + swp_entry_t entry = { .val = page_private(head) }; 2549 + 2550 + ret = split_swap_cluster(entry); 2551 + } else 2552 + ret = 0; 2565 2553 } else { 2566 2554 if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) { 2567 2555 pr_alert("total_mapcount: %u, page_count(): %u\n",

+44 -21

mm/hugetlb.c

··· 1066 1066 } 1067 1067 1068 1068 static int __alloc_gigantic_page(unsigned long start_pfn, 1069 - unsigned long nr_pages) 1069 + unsigned long nr_pages, gfp_t gfp_mask) 1070 1070 { 1071 1071 unsigned long end_pfn = start_pfn + nr_pages; 1072 1072 return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE, 1073 - GFP_KERNEL); 1073 + gfp_mask); 1074 1074 } 1075 1075 1076 1076 static bool pfn_range_valid_gigantic(struct zone *z, ··· 1108 1108 return zone_spans_pfn(zone, last_pfn); 1109 1109 } 1110 1110 1111 - static struct page *alloc_gigantic_page(int nid, unsigned int order) 1111 + static struct page *alloc_gigantic_page(int nid, struct hstate *h) 1112 1112 { 1113 + unsigned int order = huge_page_order(h); 1113 1114 unsigned long nr_pages = 1 << order; 1114 1115 unsigned long ret, pfn, flags; 1115 - struct zone *z; 1116 + struct zonelist *zonelist; 1117 + struct zone *zone; 1118 + struct zoneref *z; 1119 + gfp_t gfp_mask; 1116 1120 1117 - z = NODE_DATA(nid)->node_zones; 1118 - for (; z - NODE_DATA(nid)->node_zones < MAX_NR_ZONES; z++) { 1119 - spin_lock_irqsave(&z->lock, flags); 1121 + gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; 1122 + zonelist = node_zonelist(nid, gfp_mask); 1123 + for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), NULL) { 1124 + spin_lock_irqsave(&zone->lock, flags); 1120 1125 1121 - pfn = ALIGN(z->zone_start_pfn, nr_pages); 1122 - while (zone_spans_last_pfn(z, pfn, nr_pages)) { 1123 - if (pfn_range_valid_gigantic(z, pfn, nr_pages)) { 1126 + pfn = ALIGN(zone->zone_start_pfn, nr_pages); 1127 + while (zone_spans_last_pfn(zone, pfn, nr_pages)) { 1128 + if (pfn_range_valid_gigantic(zone, pfn, nr_pages)) { 1124 1129 /* 1125 1130 * We release the zone lock here because 1126 1131 * alloc_contig_range() will also lock the zone ··· 1133 1128 * spinning on this lock, it may win the race 1134 1129 * and cause alloc_contig_range() to fail... 1135 1130 */ 1136 - spin_unlock_irqrestore(&z->lock, flags); 1137 - ret = __alloc_gigantic_page(pfn, nr_pages); 1131 + spin_unlock_irqrestore(&zone->lock, flags); 1132 + ret = __alloc_gigantic_page(pfn, nr_pages, gfp_mask); 1138 1133 if (!ret) 1139 1134 return pfn_to_page(pfn); 1140 - spin_lock_irqsave(&z->lock, flags); 1135 + spin_lock_irqsave(&zone->lock, flags); 1141 1136 } 1142 1137 pfn += nr_pages; 1143 1138 } 1144 1139 1145 - spin_unlock_irqrestore(&z->lock, flags); 1140 + spin_unlock_irqrestore(&zone->lock, flags); 1146 1141 } 1147 1142 1148 1143 return NULL; ··· 1155 1150 { 1156 1151 struct page *page; 1157 1152 1158 - page = alloc_gigantic_page(nid, huge_page_order(h)); 1153 + page = alloc_gigantic_page(nid, h); 1159 1154 if (page) { 1160 1155 prep_compound_gigantic_page(page, huge_page_order(h)); 1161 1156 prep_new_huge_page(h, page, nid); ··· 2574 2569 NULL, 2575 2570 }; 2576 2571 2577 - static struct attribute_group hstate_attr_group = { 2572 + static const struct attribute_group hstate_attr_group = { 2578 2573 .attrs = hstate_attrs, 2579 2574 }; 2580 2575 2581 2576 static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent, 2582 2577 struct kobject **hstate_kobjs, 2583 - struct attribute_group *hstate_attr_group) 2578 + const struct attribute_group *hstate_attr_group) 2584 2579 { 2585 2580 int retval; 2586 2581 int hi = hstate_index(h); ··· 2638 2633 NULL, 2639 2634 }; 2640 2635 2641 - static struct attribute_group per_node_hstate_attr_group = { 2636 + static const struct attribute_group per_node_hstate_attr_group = { 2642 2637 .attrs = per_node_hstate_attrs, 2643 2638 }; 2644 2639 ··· 4605 4600 return pte; 4606 4601 } 4607 4602 4603 + /* 4604 + * huge_pte_offset() - Walk the page table to resolve the hugepage 4605 + * entry at address @addr 4606 + * 4607 + * Return: Pointer to page table or swap entry (PUD or PMD) for 4608 + * address @addr, or NULL if a p*d_none() entry is encountered and the 4609 + * size @sz doesn't match the hugepage size at this level of the page 4610 + * table. 4611 + */ 4608 4612 pte_t *huge_pte_offset(struct mm_struct *mm, 4609 4613 unsigned long addr, unsigned long sz) 4610 4614 { ··· 4628 4614 p4d = p4d_offset(pgd, addr); 4629 4615 if (!p4d_present(*p4d)) 4630 4616 return NULL; 4617 + 4631 4618 pud = pud_offset(p4d, addr); 4632 - if (!pud_present(*pud)) 4619 + if (sz != PUD_SIZE && pud_none(*pud)) 4633 4620 return NULL; 4634 - if (pud_huge(*pud)) 4621 + /* hugepage or swap? */ 4622 + if (pud_huge(*pud) || !pud_present(*pud)) 4635 4623 return (pte_t *)pud; 4624 + 4636 4625 pmd = pmd_offset(pud, addr); 4637 - return (pte_t *) pmd; 4626 + if (sz != PMD_SIZE && pmd_none(*pmd)) 4627 + return NULL; 4628 + /* hugepage or swap? */ 4629 + if (pmd_huge(*pmd) || !pmd_present(*pmd)) 4630 + return (pte_t *)pmd; 4631 + 4632 + return NULL; 4638 4633 } 4639 4634 4640 4635 #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */

+12

mm/internal.h

··· 480 480 /* Mask to get the watermark bits */ 481 481 #define ALLOC_WMARK_MASK (ALLOC_NO_WATERMARKS-1) 482 482 483 + /* 484 + * Only MMU archs have async oom victim reclaim - aka oom_reaper so we 485 + * cannot assume a reduced access to memory reserves is sufficient for 486 + * !MMU 487 + */ 488 + #ifdef CONFIG_MMU 489 + #define ALLOC_OOM 0x08 490 + #else 491 + #define ALLOC_OOM ALLOC_NO_WATERMARKS 492 + #endif 493 + 483 494 #define ALLOC_HARDER 0x10 /* try to alloc harder */ 484 495 #define ALLOC_HIGH 0x20 /* __GFP_HIGH set */ 485 496 #define ALLOC_CPUSET 0x40 /* check for correct cpuset */ ··· 536 525 return get_pageblock_migratetype(page) == MIGRATE_HIGHATOMIC; 537 526 } 538 527 528 + void setup_zone_pageset(struct zone *zone); 539 529 #endif /* __MM_INTERNAL_H */

+1 -1

mm/ksm.c

··· 3043 3043 NULL, 3044 3044 }; 3045 3045 3046 - static struct attribute_group ksm_attr_group = { 3046 + static const struct attribute_group ksm_attr_group = { 3047 3047 .attrs = ksm_attrs, 3048 3048 .name = "ksm", 3049 3049 };

+13

mm/madvise.c

··· 80 80 } 81 81 new_flags &= ~VM_DONTCOPY; 82 82 break; 83 + case MADV_WIPEONFORK: 84 + /* MADV_WIPEONFORK is only supported on anonymous memory. */ 85 + if (vma->vm_file || vma->vm_flags & VM_SHARED) { 86 + error = -EINVAL; 87 + goto out; 88 + } 89 + new_flags |= VM_WIPEONFORK; 90 + break; 91 + case MADV_KEEPONFORK: 92 + new_flags &= ~VM_WIPEONFORK; 93 + break; 83 94 case MADV_DONTDUMP: 84 95 new_flags |= VM_DONTDUMP; 85 96 break; ··· 707 696 #endif 708 697 case MADV_DONTDUMP: 709 698 case MADV_DODUMP: 699 + case MADV_WIPEONFORK: 700 + case MADV_KEEPONFORK: 710 701 #ifdef CONFIG_MEMORY_FAILURE 711 702 case MADV_SOFT_OFFLINE: 712 703 case MADV_HWPOISON:

+27 -13

mm/memcontrol.c

··· 550 550 * value, and reading all cpu value can be performance bottleneck in some 551 551 * common workload, threshold and synchronization as vmstat[] should be 552 552 * implemented. 553 + * 554 + * The parameter idx can be of type enum memcg_event_item or vm_event_item. 553 555 */ 554 556 555 557 static unsigned long memcg_sum_events(struct mem_cgroup *memcg, 556 - enum memcg_event_item event) 558 + int event) 557 559 { 558 560 unsigned long val = 0; 559 561 int cpu; ··· 1917 1915 * bypass the last charges so that they can exit quickly and 1918 1916 * free their memory. 1919 1917 */ 1920 - if (unlikely(test_thread_flag(TIF_MEMDIE) || 1918 + if (unlikely(tsk_is_oom_victim(current) || 1921 1919 fatal_signal_pending(current) || 1922 1920 current->flags & PF_EXITING)) 1923 1921 goto force; ··· 4321 4319 } 4322 4320 spin_unlock(&memcg->event_list_lock); 4323 4321 4322 + memcg->low = 0; 4323 + 4324 4324 memcg_offline_kmem(memcg); 4325 4325 wb_memcg_offline(memcg); 4326 4326 ··· 4639 4635 if (!ret || !target) 4640 4636 put_page(page); 4641 4637 } 4642 - /* There is a swap entry and a page doesn't exist or isn't charged */ 4643 - if (ent.val && !ret && 4638 + /* 4639 + * There is a swap entry and a page doesn't exist or isn't charged. 4640 + * But we cannot move a tail-page in a THP. 4641 + */ 4642 + if (ent.val && !ret && (!page || !PageTransCompound(page)) && 4644 4643 mem_cgroup_id(mc.from) == lookup_swap_cgroup_id(ent)) { 4645 4644 ret = MC_TARGET_SWAP; 4646 4645 if (target) ··· 4654 4647 4655 4648 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 4656 4649 /* 4657 - * We don't consider swapping or file mapped pages because THP does not 4658 - * support them for now. 4650 + * We don't consider PMD mapped swapping or file mapped pages because THP does 4651 + * not support them for now. 4659 4652 * Caller should make sure that pmd_trans_huge(pmd) is true. 4660 4653 */ 4661 4654 static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma, ··· 5430 5423 * in turn serializes uncharging. 5431 5424 */ 5432 5425 VM_BUG_ON_PAGE(!PageLocked(page), page); 5433 - if (page->mem_cgroup) 5426 + if (compound_head(page)->mem_cgroup) 5434 5427 goto out; 5435 5428 5436 5429 if (do_swap_account) { ··· 5913 5906 void mem_cgroup_swapout(struct page *page, swp_entry_t entry) 5914 5907 { 5915 5908 struct mem_cgroup *memcg, *swap_memcg; 5909 + unsigned int nr_entries; 5916 5910 unsigned short oldid; 5917 5911 5918 5912 VM_BUG_ON_PAGE(PageLRU(page), page); ··· 5934 5926 * ancestor for the swap instead and transfer the memory+swap charge. 5935 5927 */ 5936 5928 swap_memcg = mem_cgroup_id_get_online(memcg); 5937 - oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg), 1); 5929 + nr_entries = hpage_nr_pages(page); 5930 + /* Get references for the tail pages, too */ 5931 + if (nr_entries > 1) 5932 + mem_cgroup_id_get_many(swap_memcg, nr_entries - 1); 5933 + oldid = swap_cgroup_record(entry, mem_cgroup_id(swap_memcg), 5934 + nr_entries); 5938 5935 VM_BUG_ON_PAGE(oldid, page); 5939 - mem_cgroup_swap_statistics(swap_memcg, 1); 5936 + mem_cgroup_swap_statistics(swap_memcg, nr_entries); 5940 5937 5941 5938 page->mem_cgroup = NULL; 5942 5939 5943 5940 if (!mem_cgroup_is_root(memcg)) 5944 - page_counter_uncharge(&memcg->memory, 1); 5941 + page_counter_uncharge(&memcg->memory, nr_entries); 5945 5942 5946 5943 if (memcg != swap_memcg) { 5947 5944 if (!mem_cgroup_is_root(swap_memcg)) 5948 - page_counter_charge(&swap_memcg->memsw, 1); 5949 - page_counter_uncharge(&memcg->memsw, 1); 5945 + page_counter_charge(&swap_memcg->memsw, nr_entries); 5946 + page_counter_uncharge(&memcg->memsw, nr_entries); 5950 5947 } 5951 5948 5952 5949 /* ··· 5961 5948 * only synchronisation we have for udpating the per-CPU variables. 5962 5949 */ 5963 5950 VM_BUG_ON(!irqs_disabled()); 5964 - mem_cgroup_charge_statistics(memcg, page, false, -1); 5951 + mem_cgroup_charge_statistics(memcg, page, PageTransHuge(page), 5952 + -nr_entries); 5965 5953 memcg_check_events(memcg, page); 5966 5954 5967 5955 if (!mem_cgroup_is_root(memcg))

+115 -20

mm/memory.c

··· 1513 1513 tlb_gather_mmu(&tlb, mm, start, end); 1514 1514 update_hiwater_rss(mm); 1515 1515 mmu_notifier_invalidate_range_start(mm, start, end); 1516 - for ( ; vma && vma->vm_start < end; vma = vma->vm_next) 1516 + for ( ; vma && vma->vm_start < end; vma = vma->vm_next) { 1517 1517 unmap_single_vma(&tlb, vma, start, end, NULL); 1518 + 1519 + /* 1520 + * zap_page_range does not specify whether mmap_sem should be 1521 + * held for read or write. That allows parallel zap_page_range 1522 + * operations to unmap a PTE and defer a flush meaning that 1523 + * this call observes pte_none and fails to flush the TLB. 1524 + * Rather than adding a complex API, ensure that no stale 1525 + * TLB entries exist when this call returns. 1526 + */ 1527 + flush_tlb_range(vma, start, end); 1528 + } 1529 + 1518 1530 mmu_notifier_invalidate_range_end(mm, start, end); 1519 1531 tlb_finish_mmu(&tlb, start, end); 1520 1532 } ··· 1688 1676 EXPORT_SYMBOL(vm_insert_page); 1689 1677 1690 1678 static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, 1691 - pfn_t pfn, pgprot_t prot) 1679 + pfn_t pfn, pgprot_t prot, bool mkwrite) 1692 1680 { 1693 1681 struct mm_struct *mm = vma->vm_mm; 1694 1682 int retval; ··· 1700 1688 if (!pte) 1701 1689 goto out; 1702 1690 retval = -EBUSY; 1703 - if (!pte_none(*pte)) 1704 - goto out_unlock; 1691 + if (!pte_none(*pte)) { 1692 + if (mkwrite) { 1693 + /* 1694 + * For read faults on private mappings the PFN passed 1695 + * in may not match the PFN we have mapped if the 1696 + * mapped PFN is a writeable COW page. In the mkwrite 1697 + * case we are creating a writable PTE for a shared 1698 + * mapping and we expect the PFNs to match. 1699 + */ 1700 + if (WARN_ON_ONCE(pte_pfn(*pte) != pfn_t_to_pfn(pfn))) 1701 + goto out_unlock; 1702 + entry = *pte; 1703 + goto out_mkwrite; 1704 + } else 1705 + goto out_unlock; 1706 + } 1705 1707 1706 1708 /* Ok, finally just insert the thing.. */ 1707 1709 if (pfn_t_devmap(pfn)) 1708 1710 entry = pte_mkdevmap(pfn_t_pte(pfn, prot)); 1709 1711 else 1710 1712 entry = pte_mkspecial(pfn_t_pte(pfn, prot)); 1713 + 1714 + out_mkwrite: 1715 + if (mkwrite) { 1716 + entry = pte_mkyoung(entry); 1717 + entry = maybe_mkwrite(pte_mkdirty(entry), vma); 1718 + } 1719 + 1711 1720 set_pte_at(mm, addr, pte, entry); 1712 1721 update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */ 1713 1722 ··· 1799 1766 1800 1767 track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV)); 1801 1768 1802 - ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot); 1769 + ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot, 1770 + false); 1803 1771 1804 1772 return ret; 1805 1773 } 1806 1774 EXPORT_SYMBOL(vm_insert_pfn_prot); 1807 1775 1808 - int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, 1809 - pfn_t pfn) 1776 + static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, 1777 + pfn_t pfn, bool mkwrite) 1810 1778 { 1811 1779 pgprot_t pgprot = vma->vm_page_prot; 1812 1780 ··· 1836 1802 page = pfn_to_page(pfn_t_to_pfn(pfn)); 1837 1803 return insert_page(vma, addr, page, pgprot); 1838 1804 } 1839 - return insert_pfn(vma, addr, pfn, pgprot); 1805 + return insert_pfn(vma, addr, pfn, pgprot, mkwrite); 1806 + } 1807 + 1808 + int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, 1809 + pfn_t pfn) 1810 + { 1811 + return __vm_insert_mixed(vma, addr, pfn, false); 1812 + 1840 1813 } 1841 1814 EXPORT_SYMBOL(vm_insert_mixed); 1815 + 1816 + int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr, 1817 + pfn_t pfn) 1818 + { 1819 + return __vm_insert_mixed(vma, addr, pfn, true); 1820 + } 1821 + EXPORT_SYMBOL(vm_insert_mixed_mkwrite); 1842 1822 1843 1823 /* 1844 1824 * maps a range of physical memory into the requested pages. the old ··· 2619 2571 * not dirty accountable. 2620 2572 */ 2621 2573 if (PageAnon(vmf->page) && !PageKsm(vmf->page)) { 2622 - int total_mapcount; 2574 + int total_map_swapcount; 2623 2575 if (!trylock_page(vmf->page)) { 2624 2576 get_page(vmf->page); 2625 2577 pte_unmap_unlock(vmf->pte, vmf->ptl); ··· 2634 2586 } 2635 2587 put_page(vmf->page); 2636 2588 } 2637 - if (reuse_swap_page(vmf->page, &total_mapcount)) { 2638 - if (total_mapcount == 1) { 2589 + if (reuse_swap_page(vmf->page, &total_map_swapcount)) { 2590 + if (total_map_swapcount == 1) { 2639 2591 /* 2640 2592 * The page is all ours. Move it to 2641 2593 * our anon_vma so the rmap code will ··· 2752 2704 int do_swap_page(struct vm_fault *vmf) 2753 2705 { 2754 2706 struct vm_area_struct *vma = vmf->vma; 2755 - struct page *page, *swapcache; 2707 + struct page *page = NULL, *swapcache; 2756 2708 struct mem_cgroup *memcg; 2709 + struct vma_swap_readahead swap_ra; 2757 2710 swp_entry_t entry; 2758 2711 pte_t pte; 2759 2712 int locked; 2760 2713 int exclusive = 0; 2761 2714 int ret = 0; 2715 + bool vma_readahead = swap_use_vma_readahead(); 2762 2716 2763 - if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) 2717 + if (vma_readahead) 2718 + page = swap_readahead_detect(vmf, &swap_ra); 2719 + if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) { 2720 + if (page) 2721 + put_page(page); 2764 2722 goto out; 2723 + } 2765 2724 2766 2725 entry = pte_to_swp_entry(vmf->orig_pte); 2767 2726 if (unlikely(non_swap_entry(entry))) { ··· 2784 2729 goto out; 2785 2730 } 2786 2731 delayacct_set_flag(DELAYACCT_PF_SWAPIN); 2787 - page = lookup_swap_cache(entry); 2732 + if (!page) 2733 + page = lookup_swap_cache(entry, vma_readahead ? vma : NULL, 2734 + vmf->address); 2788 2735 if (!page) { 2789 - page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vma, 2790 - vmf->address); 2736 + if (vma_readahead) 2737 + page = do_swap_page_readahead(entry, 2738 + GFP_HIGHUSER_MOVABLE, vmf, &swap_ra); 2739 + else 2740 + page = swapin_readahead(entry, 2741 + GFP_HIGHUSER_MOVABLE, vma, vmf->address); 2791 2742 if (!page) { 2792 2743 /* 2793 2744 * Back out if somebody else faulted in this pte ··· 4417 4356 } 4418 4357 } 4419 4358 void clear_huge_page(struct page *page, 4420 - unsigned long addr, unsigned int pages_per_huge_page) 4359 + unsigned long addr_hint, unsigned int pages_per_huge_page) 4421 4360 { 4422 - int i; 4361 + int i, n, base, l; 4362 + unsigned long addr = addr_hint & 4363 + ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1); 4423 4364 4424 4365 if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) { 4425 4366 clear_gigantic_page(page, addr, pages_per_huge_page); 4426 4367 return; 4427 4368 } 4428 4369 4370 + /* Clear sub-page to access last to keep its cache lines hot */ 4429 4371 might_sleep(); 4430 - for (i = 0; i < pages_per_huge_page; i++) { 4372 + n = (addr_hint - addr) / PAGE_SIZE; 4373 + if (2 * n <= pages_per_huge_page) { 4374 + /* If sub-page to access in first half of huge page */ 4375 + base = 0; 4376 + l = n; 4377 + /* Clear sub-pages at the end of huge page */ 4378 + for (i = pages_per_huge_page - 1; i >= 2 * n; i--) { 4379 + cond_resched(); 4380 + clear_user_highpage(page + i, addr + i * PAGE_SIZE); 4381 + } 4382 + } else { 4383 + /* If sub-page to access in second half of huge page */ 4384 + base = pages_per_huge_page - 2 * (pages_per_huge_page - n); 4385 + l = pages_per_huge_page - n; 4386 + /* Clear sub-pages at the begin of huge page */ 4387 + for (i = 0; i < base; i++) { 4388 + cond_resched(); 4389 + clear_user_highpage(page + i, addr + i * PAGE_SIZE); 4390 + } 4391 + } 4392 + /* 4393 + * Clear remaining sub-pages in left-right-left-right pattern 4394 + * towards the sub-page to access 4395 + */ 4396 + for (i = 0; i < l; i++) { 4397 + int left_idx = base + i; 4398 + int right_idx = base + 2 * l - 1 - i; 4399 + 4431 4400 cond_resched(); 4432 - clear_user_highpage(page + i, addr + i * PAGE_SIZE); 4401 + clear_user_highpage(page + left_idx, 4402 + addr + left_idx * PAGE_SIZE); 4403 + cond_resched(); 4404 + clear_user_highpage(page + right_idx, 4405 + addr + right_idx * PAGE_SIZE); 4433 4406 } 4434 4407 } 4435 4408

+40 -74

mm/memory_hotplug.c

··· 773 773 node_set_state(node, N_MEMORY); 774 774 } 775 775 776 - bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_pages, int online_type) 777 - { 778 - struct pglist_data *pgdat = NODE_DATA(nid); 779 - struct zone *movable_zone = &pgdat->node_zones[ZONE_MOVABLE]; 780 - struct zone *default_zone = default_zone_for_pfn(nid, pfn, nr_pages); 781 - 782 - /* 783 - * TODO there shouldn't be any inherent reason to have ZONE_NORMAL 784 - * physically before ZONE_MOVABLE. All we need is they do not 785 - * overlap. Historically we didn't allow ZONE_NORMAL after ZONE_MOVABLE 786 - * though so let's stick with it for simplicity for now. 787 - * TODO make sure we do not overlap with ZONE_DEVICE 788 - */ 789 - if (online_type == MMOP_ONLINE_KERNEL) { 790 - if (zone_is_empty(movable_zone)) 791 - return true; 792 - return movable_zone->zone_start_pfn >= pfn + nr_pages; 793 - } else if (online_type == MMOP_ONLINE_MOVABLE) { 794 - return zone_end_pfn(default_zone) <= pfn; 795 - } 796 - 797 - /* MMOP_ONLINE_KEEP will always succeed and inherits the current zone */ 798 - return online_type == MMOP_ONLINE_KEEP; 799 - } 800 - 801 776 static void __meminit resize_zone_range(struct zone *zone, unsigned long start_pfn, 802 777 unsigned long nr_pages) 803 778 { ··· 831 856 * If no kernel zone covers this pfn range it will automatically go 832 857 * to the ZONE_NORMAL. 833 858 */ 834 - struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn, 859 + static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn, 835 860 unsigned long nr_pages) 836 861 { 837 862 struct pglist_data *pgdat = NODE_DATA(nid); ··· 847 872 return &pgdat->node_zones[ZONE_NORMAL]; 848 873 } 849 874 850 - static inline bool movable_pfn_range(int nid, struct zone *default_zone, 851 - unsigned long start_pfn, unsigned long nr_pages) 875 + static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn, 876 + unsigned long nr_pages) 852 877 { 853 - if (!allow_online_pfn_range(nid, start_pfn, nr_pages, 854 - MMOP_ONLINE_KERNEL)) 855 - return true; 878 + struct zone *kernel_zone = default_kernel_zone_for_pfn(nid, start_pfn, 879 + nr_pages); 880 + struct zone *movable_zone = &NODE_DATA(nid)->node_zones[ZONE_MOVABLE]; 881 + bool in_kernel = zone_intersects(kernel_zone, start_pfn, nr_pages); 882 + bool in_movable = zone_intersects(movable_zone, start_pfn, nr_pages); 856 883 857 - if (!movable_node_is_enabled()) 858 - return false; 884 + /* 885 + * We inherit the existing zone in a simple case where zones do not 886 + * overlap in the given range 887 + */ 888 + if (in_kernel ^ in_movable) 889 + return (in_kernel) ? kernel_zone : movable_zone; 859 890 860 - return !zone_intersects(default_zone, start_pfn, nr_pages); 891 + /* 892 + * If the range doesn't belong to any zone or two zones overlap in the 893 + * given range then we use movable zone only if movable_node is 894 + * enabled because we always online to a kernel zone by default. 895 + */ 896 + return movable_node_enabled ? movable_zone : kernel_zone; 897 + } 898 + 899 + struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn, 900 + unsigned long nr_pages) 901 + { 902 + if (online_type == MMOP_ONLINE_KERNEL) 903 + return default_kernel_zone_for_pfn(nid, start_pfn, nr_pages); 904 + 905 + if (online_type == MMOP_ONLINE_MOVABLE) 906 + return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE]; 907 + 908 + return default_zone_for_pfn(nid, start_pfn, nr_pages); 861 909 } 862 910 863 911 /* ··· 890 892 static struct zone * __meminit move_pfn_range(int online_type, int nid, 891 893 unsigned long start_pfn, unsigned long nr_pages) 892 894 { 893 - struct pglist_data *pgdat = NODE_DATA(nid); 894 - struct zone *zone = default_zone_for_pfn(nid, start_pfn, nr_pages); 895 + struct zone *zone; 895 896 896 - if (online_type == MMOP_ONLINE_KEEP) { 897 - struct zone *movable_zone = &pgdat->node_zones[ZONE_MOVABLE]; 898 - /* 899 - * MMOP_ONLINE_KEEP defaults to MMOP_ONLINE_KERNEL but use 900 - * movable zone if that is not possible (e.g. we are within 901 - * or past the existing movable zone). movable_node overrides 902 - * this default and defaults to movable zone 903 - */ 904 - if (movable_pfn_range(nid, zone, start_pfn, nr_pages)) 905 - zone = movable_zone; 906 - } else if (online_type == MMOP_ONLINE_MOVABLE) { 907 - zone = &pgdat->node_zones[ZONE_MOVABLE]; 908 - } 909 - 897 + zone = zone_for_pfn_range(online_type, nid, start_pfn, nr_pages); 910 898 move_pfn_range_to_zone(zone, start_pfn, nr_pages); 911 899 return zone; 912 900 } 913 901 914 - /* Must be protected by mem_hotplug_begin() */ 902 + /* Must be protected by mem_hotplug_begin() or a device_lock */ 915 903 int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type) 916 904 { 917 905 unsigned long flags; ··· 909 925 struct memory_notify arg; 910 926 911 927 nid = pfn_to_nid(pfn); 912 - if (!allow_online_pfn_range(nid, pfn, nr_pages, online_type)) 913 - return -EINVAL; 914 - 915 928 /* associate pfn range with the zone */ 916 929 zone = move_pfn_range(online_type, nid, pfn, nr_pages); 917 930 ··· 926 945 * This means the page allocator ignores this zone. 927 946 * So, zonelist must be updated after online. 928 947 */ 929 - mutex_lock(&zonelists_mutex); 930 948 if (!populated_zone(zone)) { 931 949 need_zonelists_rebuild = 1; 932 - build_all_zonelists(NULL, zone); 950 + setup_zone_pageset(zone); 933 951 } 934 952 935 953 ret = walk_system_ram_range(pfn, nr_pages, &onlined_pages, ··· 936 956 if (ret) { 937 957 if (need_zonelists_rebuild) 938 958 zone_pcp_reset(zone); 939 - mutex_unlock(&zonelists_mutex); 940 959 goto failed_addition; 941 960 } 942 961 ··· 948 969 if (onlined_pages) { 949 970 node_states_set_node(nid, &arg); 950 971 if (need_zonelists_rebuild) 951 - build_all_zonelists(NULL, NULL); 972 + build_all_zonelists(NULL); 952 973 else 953 974 zone_pcp_update(zone); 954 975 } 955 - 956 - mutex_unlock(&zonelists_mutex); 957 976 958 977 init_per_zone_wmark_min(); 959 978 ··· 1023 1046 * The node we allocated has no zone fallback lists. For avoiding 1024 1047 * to access not-initialized zonelist, build here. 1025 1048 */ 1026 - mutex_lock(&zonelists_mutex); 1027 - build_all_zonelists(pgdat, NULL); 1028 - mutex_unlock(&zonelists_mutex); 1049 + build_all_zonelists(pgdat); 1029 1050 1030 1051 /* 1031 1052 * zone->managed_pages is set to an approximate value in ··· 1075 1100 node_set_online(nid); 1076 1101 ret = register_one_node(nid); 1077 1102 BUG_ON(ret); 1078 - 1079 - if (pgdat->node_zonelists->_zonerefs->zone == NULL) { 1080 - mutex_lock(&zonelists_mutex); 1081 - build_all_zonelists(NULL, NULL); 1082 - mutex_unlock(&zonelists_mutex); 1083 - } 1084 - 1085 1103 out: 1086 1104 mem_hotplug_done(); 1087 1105 return ret; ··· 1690 1722 1691 1723 if (!populated_zone(zone)) { 1692 1724 zone_pcp_reset(zone); 1693 - mutex_lock(&zonelists_mutex); 1694 - build_all_zonelists(NULL, NULL); 1695 - mutex_unlock(&zonelists_mutex); 1725 + build_all_zonelists(NULL); 1696 1726 } else 1697 1727 zone_pcp_update(zone); 1698 1728 ··· 1716 1750 return ret; 1717 1751 } 1718 1752 1719 - /* Must be protected by mem_hotplug_begin() */ 1753 + /* Must be protected by mem_hotplug_begin() or a device_lock */ 1720 1754 int offline_pages(unsigned long start_pfn, unsigned long nr_pages) 1721 1755 { 1722 1756 return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);

+36 -10

mm/mmap.c

··· 44 44 #include <linux/userfaultfd_k.h> 45 45 #include <linux/moduleparam.h> 46 46 #include <linux/pkeys.h> 47 + #include <linux/oom.h> 47 48 48 49 #include <linux/uaccess.h> 49 50 #include <asm/cacheflush.h> ··· 2640 2639 if (vma->vm_start >= end) 2641 2640 return 0; 2642 2641 2643 - if (uf) { 2644 - int error = userfaultfd_unmap_prep(vma, start, end, uf); 2645 - 2646 - if (error) 2647 - return error; 2648 - } 2649 - 2650 2642 /* 2651 2643 * If we need to split any vma, do it now to save pain later. 2652 2644 * ··· 2672 2678 return error; 2673 2679 } 2674 2680 vma = prev ? prev->vm_next : mm->mmap; 2681 + 2682 + if (unlikely(uf)) { 2683 + /* 2684 + * If userfaultfd_unmap_prep returns an error the vmas 2685 + * will remain splitted, but userland will get a 2686 + * highly unexpected error anyway. This is no 2687 + * different than the case where the first of the two 2688 + * __split_vma fails, but we don't undo the first 2689 + * split, despite we could. This is unlikely enough 2690 + * failure that it's not worth optimizing it for. 2691 + */ 2692 + int error = userfaultfd_unmap_prep(vma, start, end, uf); 2693 + if (error) 2694 + return error; 2695 + } 2675 2696 2676 2697 /* 2677 2698 * unlock any mlock()ed ranges before detaching vmas ··· 3002 2993 /* Use -1 here to ensure all VMAs in the mm are unmapped */ 3003 2994 unmap_vmas(&tlb, vma, 0, -1); 3004 2995 2996 + set_bit(MMF_OOM_SKIP, &mm->flags); 2997 + if (unlikely(tsk_is_oom_victim(current))) { 2998 + /* 2999 + * Wait for oom_reap_task() to stop working on this 3000 + * mm. Because MMF_OOM_SKIP is already set before 3001 + * calling down_read(), oom_reap_task() will not run 3002 + * on this "mm" post up_write(). 3003 + * 3004 + * tsk_is_oom_victim() cannot be set from under us 3005 + * either because current->mm is already set to NULL 3006 + * under task_lock before calling mmput and oom_mm is 3007 + * set not NULL by the OOM killer only if current->mm 3008 + * is found not NULL while holding the task_lock. 3009 + */ 3010 + down_write(&mm->mmap_sem); 3011 + up_write(&mm->mmap_sem); 3012 + } 3005 3013 free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); 3006 3014 tlb_finish_mmu(&tlb, 0, -1); 3007 3015 ··· 3540 3514 { 3541 3515 unsigned long free_kbytes; 3542 3516 3543 - free_kbytes = global_page_state(NR_FREE_PAGES) << (PAGE_SHIFT - 10); 3517 + free_kbytes = global_zone_page_state(NR_FREE_PAGES) << (PAGE_SHIFT - 10); 3544 3518 3545 3519 sysctl_user_reserve_kbytes = min(free_kbytes / 32, 1UL << 17); 3546 3520 return 0; ··· 3561 3535 { 3562 3536 unsigned long free_kbytes; 3563 3537 3564 - free_kbytes = global_page_state(NR_FREE_PAGES) << (PAGE_SHIFT - 10); 3538 + free_kbytes = global_zone_page_state(NR_FREE_PAGES) << (PAGE_SHIFT - 10); 3565 3539 3566 3540 sysctl_admin_reserve_kbytes = min(free_kbytes / 32, 1UL << 13); 3567 3541 return 0; ··· 3605 3579 3606 3580 break; 3607 3581 case MEM_OFFLINE: 3608 - free_kbytes = global_page_state(NR_FREE_PAGES) << (PAGE_SHIFT - 10); 3582 + free_kbytes = global_zone_page_state(NR_FREE_PAGES) << (PAGE_SHIFT - 10); 3609 3583 3610 3584 if (sysctl_user_reserve_kbytes > free_kbytes) { 3611 3585 init_user_reserve();

+13

mm/mremap.c

··· 384 384 if (!vma || vma->vm_start > addr) 385 385 return ERR_PTR(-EFAULT); 386 386 387 + /* 388 + * !old_len is a special case where an attempt is made to 'duplicate' 389 + * a mapping. This makes no sense for private mappings as it will 390 + * instead create a fresh/new mapping unrelated to the original. This 391 + * is contrary to the basic idea of mremap which creates new mappings 392 + * based on the original. There are no known use cases for this 393 + * behavior. As a result, fail such attempts. 394 + */ 395 + if (!old_len && !(vma->vm_flags & (VM_SHARED | VM_MAYSHARE))) { 396 + pr_warn_once("%s (%d): attempted to duplicate a private mapping with mremap. This is not supported.\n", current->comm, current->pid); 397 + return ERR_PTR(-EINVAL); 398 + } 399 + 387 400 if (is_vm_hugetlb_page(vma)) 388 401 return ERR_PTR(-EINVAL); 389 402

+2 -2

mm/nommu.c

··· 1962 1962 { 1963 1963 unsigned long free_kbytes; 1964 1964 1965 - free_kbytes = global_page_state(NR_FREE_PAGES) << (PAGE_SHIFT - 10); 1965 + free_kbytes = global_zone_page_state(NR_FREE_PAGES) << (PAGE_SHIFT - 10); 1966 1966 1967 1967 sysctl_user_reserve_kbytes = min(free_kbytes / 32, 1UL << 17); 1968 1968 return 0; ··· 1983 1983 { 1984 1984 unsigned long free_kbytes; 1985 1985 1986 - free_kbytes = global_page_state(NR_FREE_PAGES) << (PAGE_SHIFT - 10); 1986 + free_kbytes = global_zone_page_state(NR_FREE_PAGES) << (PAGE_SHIFT - 10); 1987 1987 1988 1988 sysctl_admin_reserve_kbytes = min(free_kbytes / 32, 1UL << 13); 1989 1989 return 0;

+10 -14

mm/oom_kill.c

··· 495 495 } 496 496 497 497 /* 498 - * increase mm_users only after we know we will reap something so 499 - * that the mmput_async is called only when we have reaped something 500 - * and delayed __mmput doesn't matter that much 498 + * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't 499 + * work on the mm anymore. The check for MMF_OOM_SKIP must run 500 + * under mmap_sem for reading because it serializes against the 501 + * down_write();up_write() cycle in exit_mmap(). 501 502 */ 502 - if (!mmget_not_zero(mm)) { 503 + if (test_bit(MMF_OOM_SKIP, &mm->flags)) { 503 504 up_read(&mm->mmap_sem); 504 505 trace_skip_task_reaping(tsk->pid); 505 506 goto unlock_oom; ··· 543 542 K(get_mm_counter(mm, MM_SHMEMPAGES))); 544 543 up_read(&mm->mmap_sem); 545 544 546 - /* 547 - * Drop our reference but make sure the mmput slow path is called from a 548 - * different context because we shouldn't risk we get stuck there and 549 - * put the oom_reaper out of the way. 550 - */ 551 - mmput_async(mm); 552 545 trace_finish_task_reaping(tsk->pid); 553 546 unlock_oom: 554 547 mutex_unlock(&oom_lock); ··· 819 824 820 825 /* 821 826 * If the task is already exiting, don't alarm the sysadmin or kill 822 - * its children or threads, just set TIF_MEMDIE so it can die quickly 827 + * its children or threads, just give it access to memory reserves 828 + * so it can die quickly 823 829 */ 824 830 task_lock(p); 825 831 if (task_will_free_mem(p)) { ··· 885 889 count_memcg_event_mm(mm, OOM_KILL); 886 890 887 891 /* 888 - * We should send SIGKILL before setting TIF_MEMDIE in order to prevent 889 - * the OOM victim from depleting the memory reserves from the user 890 - * space under its control. 892 + * We should send SIGKILL before granting access to memory reserves 893 + * in order to prevent the OOM victim from depleting the memory 894 + * reserves from the user space under its control. 891 895 */ 892 896 do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); 893 897 mark_oom_victim(victim);

+2 -2

mm/page-writeback.c

··· 363 363 { 364 364 unsigned long x; 365 365 366 - x = global_page_state(NR_FREE_PAGES); 366 + x = global_zone_page_state(NR_FREE_PAGES); 367 367 /* 368 368 * Pages reserved for the kernel should not be considered 369 369 * dirtyable, to prevent a situation where reclaim has to ··· 1405 1405 * will look to see if it needs to start dirty throttling. 1406 1406 * 1407 1407 * If dirty_poll_interval is too low, big NUMA machines will call the expensive 1408 - * global_page_state() too often. So scale it near-sqrt to the safety margin 1408 + * global_zone_page_state() too often. So scale it near-sqrt to the safety margin 1409 1409 * (the number of pages we may dirty without exceeding the dirty limits). 1410 1410 */ 1411 1411 static unsigned long dirty_poll_interval(unsigned long dirty,

+171 -275

mm/page_alloc.c

··· 2951 2951 { 2952 2952 long min = mark; 2953 2953 int o; 2954 - const bool alloc_harder = (alloc_flags & ALLOC_HARDER); 2954 + const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM)); 2955 2955 2956 2956 /* free_pages may go negative - that's OK */ 2957 2957 free_pages -= (1 << order) - 1; ··· 2964 2964 * the high-atomic reserves. This will over-estimate the size of the 2965 2965 * atomic reserve but it avoids a search. 2966 2966 */ 2967 - if (likely(!alloc_harder)) 2967 + if (likely(!alloc_harder)) { 2968 2968 free_pages -= z->nr_reserved_highatomic; 2969 - else 2970 - min -= min / 4; 2969 + } else { 2970 + /* 2971 + * OOM victims can try even harder than normal ALLOC_HARDER 2972 + * users on the grounds that it's definitely going to be in 2973 + * the exit path shortly and free memory. Any allocation it 2974 + * makes during the free path will be small and short-lived. 2975 + */ 2976 + if (alloc_flags & ALLOC_OOM) 2977 + min -= min / 2; 2978 + else 2979 + min -= min / 4; 2980 + } 2981 + 2971 2982 2972 2983 #ifdef CONFIG_CMA 2973 2984 /* If allocation can't use CMA areas don't use free CMA pages */ ··· 3216 3205 * of allowed nodes. 3217 3206 */ 3218 3207 if (!(gfp_mask & __GFP_NOMEMALLOC)) 3219 - if (test_thread_flag(TIF_MEMDIE) || 3208 + if (tsk_is_oom_victim(current) || 3220 3209 (current->flags & (PF_MEMALLOC | PF_EXITING))) 3221 3210 filter &= ~SHOW_MEM_FILTER_NODES; 3222 3211 if (in_interrupt() || !(gfp_mask & __GFP_DIRECT_RECLAIM)) ··· 3679 3668 return alloc_flags; 3680 3669 } 3681 3670 3682 - bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) 3671 + static bool oom_reserves_allowed(struct task_struct *tsk) 3683 3672 { 3684 - if (unlikely(gfp_mask & __GFP_NOMEMALLOC)) 3673 + if (!tsk_is_oom_victim(tsk)) 3685 3674 return false; 3686 3675 3687 - if (gfp_mask & __GFP_MEMALLOC) 3688 - return true; 3689 - if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) 3690 - return true; 3691 - if (!in_interrupt() && 3692 - ((current->flags & PF_MEMALLOC) || 3693 - unlikely(test_thread_flag(TIF_MEMDIE)))) 3694 - return true; 3676 + /* 3677 + * !MMU doesn't have oom reaper so give access to memory reserves 3678 + * only to the thread with TIF_MEMDIE set 3679 + */ 3680 + if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE)) 3681 + return false; 3695 3682 3696 - return false; 3683 + return true; 3684 + } 3685 + 3686 + /* 3687 + * Distinguish requests which really need access to full memory 3688 + * reserves from oom victims which can live with a portion of it 3689 + */ 3690 + static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask) 3691 + { 3692 + if (unlikely(gfp_mask & __GFP_NOMEMALLOC)) 3693 + return 0; 3694 + if (gfp_mask & __GFP_MEMALLOC) 3695 + return ALLOC_NO_WATERMARKS; 3696 + if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) 3697 + return ALLOC_NO_WATERMARKS; 3698 + if (!in_interrupt()) { 3699 + if (current->flags & PF_MEMALLOC) 3700 + return ALLOC_NO_WATERMARKS; 3701 + else if (oom_reserves_allowed(current)) 3702 + return ALLOC_OOM; 3703 + } 3704 + 3705 + return 0; 3706 + } 3707 + 3708 + bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) 3709 + { 3710 + return !!__gfp_pfmemalloc_flags(gfp_mask); 3697 3711 } 3698 3712 3699 3713 /* ··· 3871 3835 unsigned long alloc_start = jiffies; 3872 3836 unsigned int stall_timeout = 10 * HZ; 3873 3837 unsigned int cpuset_mems_cookie; 3838 + int reserve_flags; 3874 3839 3875 3840 /* 3876 3841 * In the slowpath, we sanity check order to avoid ever trying to ··· 3977 3940 if (gfp_mask & __GFP_KSWAPD_RECLAIM) 3978 3941 wake_all_kswapds(order, ac); 3979 3942 3980 - if (gfp_pfmemalloc_allowed(gfp_mask)) 3981 - alloc_flags = ALLOC_NO_WATERMARKS; 3943 + reserve_flags = __gfp_pfmemalloc_flags(gfp_mask); 3944 + if (reserve_flags) 3945 + alloc_flags = reserve_flags; 3982 3946 3983 3947 /* 3984 3948 * Reset the zonelist iterators if memory policies can be ignored. 3985 3949 * These allocations are high priority and system rather than user 3986 3950 * orientated. 3987 3951 */ 3988 - if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) { 3952 + if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) { 3989 3953 ac->zonelist = node_zonelist(numa_node_id(), gfp_mask); 3990 3954 ac->preferred_zoneref = first_zones_zonelist(ac->zonelist, 3991 3955 ac->high_zoneidx, ac->nodemask); ··· 4063 4025 goto got_pg; 4064 4026 4065 4027 /* Avoid allocations with no watermarks from looping endlessly */ 4066 - if (test_thread_flag(TIF_MEMDIE) && 4067 - (alloc_flags == ALLOC_NO_WATERMARKS || 4028 + if (tsk_is_oom_victim(current) && 4029 + (alloc_flags == ALLOC_OOM || 4068 4030 (gfp_mask & __GFP_NOMEMALLOC))) 4069 4031 goto nopage; 4070 4032 ··· 4547 4509 * Estimate the amount of memory available for userspace allocations, 4548 4510 * without causing swapping. 4549 4511 */ 4550 - available = global_page_state(NR_FREE_PAGES) - totalreserve_pages; 4512 + available = global_zone_page_state(NR_FREE_PAGES) - totalreserve_pages; 4551 4513 4552 4514 /* 4553 4515 * Not all the page cache can be freed, otherwise the system will ··· 4576 4538 { 4577 4539 val->totalram = totalram_pages; 4578 4540 val->sharedram = global_node_page_state(NR_SHMEM); 4579 - val->freeram = global_page_state(NR_FREE_PAGES); 4541 + val->freeram = global_zone_page_state(NR_FREE_PAGES); 4580 4542 val->bufferram = nr_blockdev_pages(); 4581 4543 val->totalhigh = totalhigh_pages; 4582 4544 val->freehigh = nr_free_highpages(); ··· 4711 4673 global_node_page_state(NR_SLAB_UNRECLAIMABLE), 4712 4674 global_node_page_state(NR_FILE_MAPPED), 4713 4675 global_node_page_state(NR_SHMEM), 4714 - global_page_state(NR_PAGETABLE), 4715 - global_page_state(NR_BOUNCE), 4716 - global_page_state(NR_FREE_PAGES), 4676 + global_zone_page_state(NR_PAGETABLE), 4677 + global_zone_page_state(NR_BOUNCE), 4678 + global_zone_page_state(NR_FREE_PAGES), 4717 4679 free_pcp, 4718 - global_page_state(NR_FREE_CMA_PAGES)); 4680 + global_zone_page_state(NR_FREE_CMA_PAGES)); 4719 4681 4720 4682 for_each_online_pgdat(pgdat) { 4721 4683 if (show_mem_node_skip(filter, pgdat->node_id, nodemask)) ··· 4877 4839 * 4878 4840 * Add all populated zones of a node to the zonelist. 4879 4841 */ 4880 - static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, 4881 - int nr_zones) 4842 + static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) 4882 4843 { 4883 4844 struct zone *zone; 4884 4845 enum zone_type zone_type = MAX_NR_ZONES; 4846 + int nr_zones = 0; 4885 4847 4886 4848 do { 4887 4849 zone_type--; 4888 4850 zone = pgdat->node_zones + zone_type; 4889 4851 if (managed_zone(zone)) { 4890 - zoneref_set_zone(zone, 4891 - &zonelist->_zonerefs[nr_zones++]); 4852 + zoneref_set_zone(zone, &zonerefs[nr_zones++]); 4892 4853 check_highest_zone(zone_type); 4893 4854 } 4894 4855 } while (zone_type); ··· 4895 4858 return nr_zones; 4896 4859 } 4897 4860 4898 - 4899 - /* 4900 - * zonelist_order: 4901 - * 0 = automatic detection of better ordering. 4902 - * 1 = order by ([node] distance, -zonetype) 4903 - * 2 = order by (-zonetype, [node] distance) 4904 - * 4905 - * If not NUMA, ZONELIST_ORDER_ZONE and ZONELIST_ORDER_NODE will create 4906 - * the same zonelist. So only NUMA can configure this param. 4907 - */ 4908 - #define ZONELIST_ORDER_DEFAULT 0 4909 - #define ZONELIST_ORDER_NODE 1 4910 - #define ZONELIST_ORDER_ZONE 2 4911 - 4912 - /* zonelist order in the kernel. 4913 - * set_zonelist_order() will set this to NODE or ZONE. 4914 - */ 4915 - static int current_zonelist_order = ZONELIST_ORDER_DEFAULT; 4916 - static char zonelist_order_name[3][8] = {"Default", "Node", "Zone"}; 4917 - 4918 - 4919 4861 #ifdef CONFIG_NUMA 4920 - /* The value user specified ....changed by config */ 4921 - static int user_zonelist_order = ZONELIST_ORDER_DEFAULT; 4922 - /* string for sysctl */ 4923 - #define NUMA_ZONELIST_ORDER_LEN 16 4924 - char numa_zonelist_order[16] = "default"; 4925 - 4926 - /* 4927 - * interface for configure zonelist ordering. 4928 - * command line option "numa_zonelist_order" 4929 - * = "[dD]efault - default, automatic configuration. 4930 - * = "[nN]ode - order by node locality, then by zone within node 4931 - * = "[zZ]one - order by zone, then by locality within zone 4932 - */ 4933 4862 4934 4863 static int __parse_numa_zonelist_order(char *s) 4935 4864 { 4936 - if (*s == 'd' || *s == 'D') { 4937 - user_zonelist_order = ZONELIST_ORDER_DEFAULT; 4938 - } else if (*s == 'n' || *s == 'N') { 4939 - user_zonelist_order = ZONELIST_ORDER_NODE; 4940 - } else if (*s == 'z' || *s == 'Z') { 4941 - user_zonelist_order = ZONELIST_ORDER_ZONE; 4942 - } else { 4943 - pr_warn("Ignoring invalid numa_zonelist_order value: %s\n", s); 4865 + /* 4866 + * We used to support different zonlists modes but they turned 4867 + * out to be just not useful. Let's keep the warning in place 4868 + * if somebody still use the cmd line parameter so that we do 4869 + * not fail it silently 4870 + */ 4871 + if (!(*s == 'd' || *s == 'D' || *s == 'n' || *s == 'N')) { 4872 + pr_warn("Ignoring unsupported numa_zonelist_order value: %s\n", s); 4944 4873 return -EINVAL; 4945 4874 } 4946 4875 return 0; ··· 4914 4911 4915 4912 static __init int setup_numa_zonelist_order(char *s) 4916 4913 { 4917 - int ret; 4918 - 4919 4914 if (!s) 4920 4915 return 0; 4921 4916 4922 - ret = __parse_numa_zonelist_order(s); 4923 - if (ret == 0) 4924 - strlcpy(numa_zonelist_order, s, NUMA_ZONELIST_ORDER_LEN); 4925 - 4926 - return ret; 4917 + return __parse_numa_zonelist_order(s); 4927 4918 } 4928 4919 early_param("numa_zonelist_order", setup_numa_zonelist_order); 4920 + 4921 + char numa_zonelist_order[] = "Node"; 4929 4922 4930 4923 /* 4931 4924 * sysctl handler for numa_zonelist_order ··· 4930 4931 void __user *buffer, size_t *length, 4931 4932 loff_t *ppos) 4932 4933 { 4933 - char saved_string[NUMA_ZONELIST_ORDER_LEN]; 4934 + char *str; 4934 4935 int ret; 4935 - static DEFINE_MUTEX(zl_order_mutex); 4936 4936 4937 - mutex_lock(&zl_order_mutex); 4938 - if (write) { 4939 - if (strlen((char *)table->data) >= NUMA_ZONELIST_ORDER_LEN) { 4940 - ret = -EINVAL; 4941 - goto out; 4942 - } 4943 - strcpy(saved_string, (char *)table->data); 4944 - } 4945 - ret = proc_dostring(table, write, buffer, length, ppos); 4946 - if (ret) 4947 - goto out; 4948 - if (write) { 4949 - int oldval = user_zonelist_order; 4937 + if (!write) 4938 + return proc_dostring(table, write, buffer, length, ppos); 4939 + str = memdup_user_nul(buffer, 16); 4940 + if (IS_ERR(str)) 4941 + return PTR_ERR(str); 4950 4942 4951 - ret = __parse_numa_zonelist_order((char *)table->data); 4952 - if (ret) { 4953 - /* 4954 - * bogus value. restore saved string 4955 - */ 4956 - strncpy((char *)table->data, saved_string, 4957 - NUMA_ZONELIST_ORDER_LEN); 4958 - user_zonelist_order = oldval; 4959 - } else if (oldval != user_zonelist_order) { 4960 - mem_hotplug_begin(); 4961 - mutex_lock(&zonelists_mutex); 4962 - build_all_zonelists(NULL, NULL); 4963 - mutex_unlock(&zonelists_mutex); 4964 - mem_hotplug_done(); 4965 - } 4966 - } 4967 - out: 4968 - mutex_unlock(&zl_order_mutex); 4943 + ret = __parse_numa_zonelist_order(str); 4944 + kfree(str); 4969 4945 return ret; 4970 4946 } 4971 4947 ··· 5014 5040 * This results in maximum locality--normal zone overflows into local 5015 5041 * DMA zone, if any--but risks exhausting DMA zone. 5016 5042 */ 5017 - static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) 5043 + static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order, 5044 + unsigned nr_nodes) 5018 5045 { 5019 - int j; 5020 - struct zonelist *zonelist; 5046 + struct zoneref *zonerefs; 5047 + int i; 5021 5048 5022 - zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK]; 5023 - for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++) 5024 - ; 5025 - j = build_zonelists_node(NODE_DATA(node), zonelist, j); 5026 - zonelist->_zonerefs[j].zone = NULL; 5027 - zonelist->_zonerefs[j].zone_idx = 0; 5049 + zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs; 5050 + 5051 + for (i = 0; i < nr_nodes; i++) { 5052 + int nr_zones; 5053 + 5054 + pg_data_t *node = NODE_DATA(node_order[i]); 5055 + 5056 + nr_zones = build_zonerefs_node(node, zonerefs); 5057 + zonerefs += nr_zones; 5058 + } 5059 + zonerefs->zone = NULL; 5060 + zonerefs->zone_idx = 0; 5028 5061 } 5029 5062 5030 5063 /* ··· 5039 5058 */ 5040 5059 static void build_thisnode_zonelists(pg_data_t *pgdat) 5041 5060 { 5042 - int j; 5043 - struct zonelist *zonelist; 5061 + struct zoneref *zonerefs; 5062 + int nr_zones; 5044 5063 5045 - zonelist = &pgdat->node_zonelists[ZONELIST_NOFALLBACK]; 5046 - j = build_zonelists_node(pgdat, zonelist, 0); 5047 - zonelist->_zonerefs[j].zone = NULL; 5048 - zonelist->_zonerefs[j].zone_idx = 0; 5064 + zonerefs = pgdat->node_zonelists[ZONELIST_NOFALLBACK]._zonerefs; 5065 + nr_zones = build_zonerefs_node(pgdat, zonerefs); 5066 + zonerefs += nr_zones; 5067 + zonerefs->zone = NULL; 5068 + zonerefs->zone_idx = 0; 5049 5069 } 5050 5070 5051 5071 /* ··· 5055 5073 * exhausted, but results in overflowing to remote node while memory 5056 5074 * may still exist in local DMA zone. 5057 5075 */ 5058 - static int node_order[MAX_NUMNODES]; 5059 - 5060 - static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes) 5061 - { 5062 - int pos, j, node; 5063 - int zone_type; /* needs to be signed */ 5064 - struct zone *z; 5065 - struct zonelist *zonelist; 5066 - 5067 - zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK]; 5068 - pos = 0; 5069 - for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) { 5070 - for (j = 0; j < nr_nodes; j++) { 5071 - node = node_order[j]; 5072 - z = &NODE_DATA(node)->node_zones[zone_type]; 5073 - if (managed_zone(z)) { 5074 - zoneref_set_zone(z, 5075 - &zonelist->_zonerefs[pos++]); 5076 - check_highest_zone(zone_type); 5077 - } 5078 - } 5079 - } 5080 - zonelist->_zonerefs[pos].zone = NULL; 5081 - zonelist->_zonerefs[pos].zone_idx = 0; 5082 - } 5083 - 5084 - #if defined(CONFIG_64BIT) 5085 - /* 5086 - * Devices that require DMA32/DMA are relatively rare and do not justify a 5087 - * penalty to every machine in case the specialised case applies. Default 5088 - * to Node-ordering on 64-bit NUMA machines 5089 - */ 5090 - static int default_zonelist_order(void) 5091 - { 5092 - return ZONELIST_ORDER_NODE; 5093 - } 5094 - #else 5095 - /* 5096 - * On 32-bit, the Normal zone needs to be preserved for allocations accessible 5097 - * by the kernel. If processes running on node 0 deplete the low memory zone 5098 - * then reclaim will occur more frequency increasing stalls and potentially 5099 - * be easier to OOM if a large percentage of the zone is under writeback or 5100 - * dirty. The problem is significantly worse if CONFIG_HIGHPTE is not set. 5101 - * Hence, default to zone ordering on 32-bit. 5102 - */ 5103 - static int default_zonelist_order(void) 5104 - { 5105 - return ZONELIST_ORDER_ZONE; 5106 - } 5107 - #endif /* CONFIG_64BIT */ 5108 - 5109 - static void set_zonelist_order(void) 5110 - { 5111 - if (user_zonelist_order == ZONELIST_ORDER_DEFAULT) 5112 - current_zonelist_order = default_zonelist_order(); 5113 - else 5114 - current_zonelist_order = user_zonelist_order; 5115 - } 5116 5076 5117 5077 static void build_zonelists(pg_data_t *pgdat) 5118 5078 { 5119 - int i, node, load; 5079 + static int node_order[MAX_NUMNODES]; 5080 + int node, load, nr_nodes = 0; 5120 5081 nodemask_t used_mask; 5121 5082 int local_node, prev_node; 5122 - struct zonelist *zonelist; 5123 - unsigned int order = current_zonelist_order; 5124 - 5125 - /* initialize zonelists */ 5126 - for (i = 0; i < MAX_ZONELISTS; i++) { 5127 - zonelist = pgdat->node_zonelists + i; 5128 - zonelist->_zonerefs[0].zone = NULL; 5129 - zonelist->_zonerefs[0].zone_idx = 0; 5130 - } 5131 5083 5132 5084 /* NUMA-aware ordering of nodes */ 5133 5085 local_node = pgdat->node_id; ··· 5070 5154 nodes_clear(used_mask); 5071 5155 5072 5156 memset(node_order, 0, sizeof(node_order)); 5073 - i = 0; 5074 - 5075 5157 while ((node = find_next_best_node(local_node, &used_mask)) >= 0) { 5076 5158 /* 5077 5159 * We don't want to pressure a particular node. ··· 5080 5166 node_distance(local_node, prev_node)) 5081 5167 node_load[node] = load; 5082 5168 5169 + node_order[nr_nodes++] = node; 5083 5170 prev_node = node; 5084 5171 load--; 5085 - if (order == ZONELIST_ORDER_NODE) 5086 - build_zonelists_in_node_order(pgdat, node); 5087 - else 5088 - node_order[i++] = node; /* remember order */ 5089 5172 } 5090 5173 5091 - if (order == ZONELIST_ORDER_ZONE) { 5092 - /* calculate node order -- i.e., DMA last! */ 5093 - build_zonelists_in_zone_order(pgdat, i); 5094 - } 5095 - 5174 + build_zonelists_in_node_order(pgdat, node_order, nr_nodes); 5096 5175 build_thisnode_zonelists(pgdat); 5097 5176 } 5098 5177 ··· 5111 5204 static void setup_min_slab_ratio(void); 5112 5205 #else /* CONFIG_NUMA */ 5113 5206 5114 - static void set_zonelist_order(void) 5115 - { 5116 - current_zonelist_order = ZONELIST_ORDER_ZONE; 5117 - } 5118 - 5119 5207 static void build_zonelists(pg_data_t *pgdat) 5120 5208 { 5121 5209 int node, local_node; 5122 - enum zone_type j; 5123 - struct zonelist *zonelist; 5210 + struct zoneref *zonerefs; 5211 + int nr_zones; 5124 5212 5125 5213 local_node = pgdat->node_id; 5126 5214 5127 - zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK]; 5128 - j = build_zonelists_node(pgdat, zonelist, 0); 5215 + zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs; 5216 + nr_zones = build_zonerefs_node(pgdat, zonerefs); 5217 + zonerefs += nr_zones; 5129 5218 5130 5219 /* 5131 5220 * Now we build the zonelist so that it contains the zones ··· 5134 5231 for (node = local_node + 1; node < MAX_NUMNODES; node++) { 5135 5232 if (!node_online(node)) 5136 5233 continue; 5137 - j = build_zonelists_node(NODE_DATA(node), zonelist, j); 5234 + nr_zones = build_zonerefs_node(NODE_DATA(node), zonerefs); 5235 + zonerefs += nr_zones; 5138 5236 } 5139 5237 for (node = 0; node < local_node; node++) { 5140 5238 if (!node_online(node)) 5141 5239 continue; 5142 - j = build_zonelists_node(NODE_DATA(node), zonelist, j); 5240 + nr_zones = build_zonerefs_node(NODE_DATA(node), zonerefs); 5241 + zonerefs += nr_zones; 5143 5242 } 5144 5243 5145 - zonelist->_zonerefs[j].zone = NULL; 5146 - zonelist->_zonerefs[j].zone_idx = 0; 5244 + zonerefs->zone = NULL; 5245 + zonerefs->zone_idx = 0; 5147 5246 } 5148 5247 5149 5248 #endif /* CONFIG_NUMA */ ··· 5168 5263 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch); 5169 5264 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset); 5170 5265 static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats); 5171 - static void setup_zone_pageset(struct zone *zone); 5172 5266 5173 - /* 5174 - * Global mutex to protect against size modification of zonelists 5175 - * as well as to serialize pageset setup for the new populated zone. 5176 - */ 5177 - DEFINE_MUTEX(zonelists_mutex); 5178 - 5179 - /* return values int ....just for stop_machine() */ 5180 - static int __build_all_zonelists(void *data) 5267 + static void __build_all_zonelists(void *data) 5181 5268 { 5182 5269 int nid; 5183 - int cpu; 5270 + int __maybe_unused cpu; 5184 5271 pg_data_t *self = data; 5272 + static DEFINE_SPINLOCK(lock); 5273 + 5274 + spin_lock(&lock); 5185 5275 5186 5276 #ifdef CONFIG_NUMA 5187 5277 memset(node_load, 0, sizeof(node_load)); 5188 5278 #endif 5189 5279 5280 + /* 5281 + * This node is hotadded and no memory is yet present. So just 5282 + * building zonelists is fine - no need to touch other nodes. 5283 + */ 5190 5284 if (self && !node_online(self->node_id)) { 5191 5285 build_zonelists(self); 5286 + } else { 5287 + for_each_online_node(nid) { 5288 + pg_data_t *pgdat = NODE_DATA(nid); 5289 + 5290 + build_zonelists(pgdat); 5291 + } 5292 + 5293 + #ifdef CONFIG_HAVE_MEMORYLESS_NODES 5294 + /* 5295 + * We now know the "local memory node" for each node-- 5296 + * i.e., the node of the first zone in the generic zonelist. 5297 + * Set up numa_mem percpu variable for on-line cpus. During 5298 + * boot, only the boot cpu should be on-line; we'll init the 5299 + * secondary cpus' numa_mem as they come on-line. During 5300 + * node/memory hotplug, we'll fixup all on-line cpus. 5301 + */ 5302 + for_each_online_cpu(cpu) 5303 + set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu))); 5304 + #endif 5192 5305 } 5193 5306 5194 - for_each_online_node(nid) { 5195 - pg_data_t *pgdat = NODE_DATA(nid); 5307 + spin_unlock(&lock); 5308 + } 5196 5309 5197 - build_zonelists(pgdat); 5198 - } 5310 + static noinline void __init 5311 + build_all_zonelists_init(void) 5312 + { 5313 + int cpu; 5314 + 5315 + __build_all_zonelists(NULL); 5199 5316 5200 5317 /* 5201 5318 * Initialize the boot_pagesets that are going to be used ··· 5232 5305 * needs the percpu allocator in order to allocate its pagesets 5233 5306 * (a chicken-egg dilemma). 5234 5307 */ 5235 - for_each_possible_cpu(cpu) { 5308 + for_each_possible_cpu(cpu) 5236 5309 setup_pageset(&per_cpu(boot_pageset, cpu), 0); 5237 5310 5238 - #ifdef CONFIG_HAVE_MEMORYLESS_NODES 5239 - /* 5240 - * We now know the "local memory node" for each node-- 5241 - * i.e., the node of the first zone in the generic zonelist. 5242 - * Set up numa_mem percpu variable for on-line cpus. During 5243 - * boot, only the boot cpu should be on-line; we'll init the 5244 - * secondary cpus' numa_mem as they come on-line. During 5245 - * node/memory hotplug, we'll fixup all on-line cpus. 5246 - */ 5247 - if (cpu_online(cpu)) 5248 - set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu))); 5249 - #endif 5250 - } 5251 - 5252 - return 0; 5253 - } 5254 - 5255 - static noinline void __init 5256 - build_all_zonelists_init(void) 5257 - { 5258 - __build_all_zonelists(NULL); 5259 5311 mminit_verify_zonelist(); 5260 5312 cpuset_init_current_mems_allowed(); 5261 5313 } 5262 5314 5263 5315 /* 5264 - * Called with zonelists_mutex held always 5265 5316 * unless system_state == SYSTEM_BOOTING. 5266 5317 * 5267 - * __ref due to (1) call of __meminit annotated setup_zone_pageset 5268 - * [we're only called with non-NULL zone through __meminit paths] and 5269 - * (2) call of __init annotated helper build_all_zonelists_init 5318 + * __ref due to call of __init annotated helper build_all_zonelists_init 5270 5319 * [protected by SYSTEM_BOOTING]. 5271 5320 */ 5272 - void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone) 5321 + void __ref build_all_zonelists(pg_data_t *pgdat) 5273 5322 { 5274 - set_zonelist_order(); 5275 - 5276 5323 if (system_state == SYSTEM_BOOTING) { 5277 5324 build_all_zonelists_init(); 5278 5325 } else { 5279 - #ifdef CONFIG_MEMORY_HOTPLUG 5280 - if (zone) 5281 - setup_zone_pageset(zone); 5282 - #endif 5283 - /* we have to stop all cpus to guarantee there is no user 5284 - of zonelist */ 5285 - stop_machine_cpuslocked(__build_all_zonelists, pgdat, NULL); 5326 + __build_all_zonelists(pgdat); 5286 5327 /* cpuset refresh routine should be here */ 5287 5328 } 5288 5329 vm_total_pages = nr_free_pagecache_pages(); ··· 5266 5371 else 5267 5372 page_group_by_mobility_disabled = 0; 5268 5373 5269 - pr_info("Built %i zonelists in %s order, mobility grouping %s. Total pages: %ld\n", 5374 + pr_info("Built %i zonelists, mobility grouping %s. Total pages: %ld\n", 5270 5375 nr_online_nodes, 5271 - zonelist_order_name[current_zonelist_order], 5272 5376 page_group_by_mobility_disabled ? "off" : "on", 5273 5377 vm_total_pages); 5274 5378 #ifdef CONFIG_NUMA ··· 5521 5627 pageset_set_high_and_batch(zone, pcp); 5522 5628 } 5523 5629 5524 - static void __meminit setup_zone_pageset(struct zone *zone) 5630 + void __meminit setup_zone_pageset(struct zone *zone) 5525 5631 { 5526 5632 int cpu; 5527 5633 zone->pageset = alloc_percpu(struct per_cpu_pageset); ··· 6975 7081 */ 6976 7082 void setup_per_zone_wmarks(void) 6977 7083 { 6978 - mutex_lock(&zonelists_mutex); 7084 + static DEFINE_SPINLOCK(lock); 7085 + 7086 + spin_lock(&lock); 6979 7087 __setup_per_zone_wmarks(); 6980 - mutex_unlock(&zonelists_mutex); 7088 + spin_unlock(&lock); 6981 7089 } 6982 7090 6983 7091 /*

+2 -4

mm/page_ext.c

··· 222 222 return addr; 223 223 } 224 224 225 - if (node_state(nid, N_HIGH_MEMORY)) 226 - addr = vzalloc_node(size, nid); 227 - else 228 - addr = vzalloc(size); 225 + addr = vzalloc_node(size, nid); 229 226 230 227 return addr; 231 228 } ··· 406 409 continue; 407 410 if (init_section_page_ext(pfn, nid)) 408 411 goto oom; 412 + cond_resched(); 409 413 } 410 414 } 411 415 hotplug_memory_notifier(page_ext_callback, 0);

+1 -1

mm/page_idle.c

··· 204 204 NULL, 205 205 }; 206 206 207 - static struct attribute_group page_idle_attr_group = { 207 + static const struct attribute_group page_idle_attr_group = { 208 208 .bin_attrs = page_idle_bin_attrs, 209 209 .name = "page_idle", 210 210 };

+16 -5

mm/page_io.c

··· 28 28 static struct bio *get_swap_bio(gfp_t gfp_flags, 29 29 struct page *page, bio_end_io_t end_io) 30 30 { 31 + int i, nr = hpage_nr_pages(page); 31 32 struct bio *bio; 32 33 33 - bio = bio_alloc(gfp_flags, 1); 34 + bio = bio_alloc(gfp_flags, nr); 34 35 if (bio) { 35 36 bio->bi_iter.bi_sector = map_swap_page(page, &bio->bi_bdev); 36 37 bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9; 37 38 bio->bi_end_io = end_io; 38 39 39 - bio_add_page(bio, page, PAGE_SIZE, 0); 40 - BUG_ON(bio->bi_iter.bi_size != PAGE_SIZE); 40 + for (i = 0; i < nr; i++) 41 + bio_add_page(bio, page + i, PAGE_SIZE, 0); 42 + VM_BUG_ON(bio->bi_iter.bi_size != PAGE_SIZE * nr); 41 43 } 42 44 return bio; 43 45 } ··· 264 262 return (sector_t)__page_file_index(page) << (PAGE_SHIFT - 9); 265 263 } 266 264 265 + static inline void count_swpout_vm_event(struct page *page) 266 + { 267 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 268 + if (unlikely(PageTransHuge(page))) 269 + count_vm_event(THP_SWPOUT); 270 + #endif 271 + count_vm_events(PSWPOUT, hpage_nr_pages(page)); 272 + } 273 + 267 274 int __swap_writepage(struct page *page, struct writeback_control *wbc, 268 275 bio_end_io_t end_write_func) 269 276 { ··· 324 313 325 314 ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc); 326 315 if (!ret) { 327 - count_vm_event(PSWPOUT); 316 + count_swpout_vm_event(page); 328 317 return 0; 329 318 } 330 319 ··· 337 326 goto out; 338 327 } 339 328 bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc); 340 - count_vm_event(PSWPOUT); 329 + count_swpout_vm_event(page); 341 330 set_page_writeback(page); 342 331 unlock_page(page); 343 332 submit_bio(bio);

+42 -26

mm/page_owner.c

··· 30 30 31 31 static depot_stack_handle_t dummy_handle; 32 32 static depot_stack_handle_t failure_handle; 33 + static depot_stack_handle_t early_handle; 33 34 34 35 static void init_early_allocated_pages(void); 35 36 ··· 54 53 return true; 55 54 } 56 55 57 - static noinline void register_dummy_stack(void) 56 + static __always_inline depot_stack_handle_t create_dummy_stack(void) 58 57 { 59 58 unsigned long entries[4]; 60 59 struct stack_trace dummy; ··· 65 64 dummy.skip = 0; 66 65 67 66 save_stack_trace(&dummy); 68 - dummy_handle = depot_save_stack(&dummy, GFP_KERNEL); 67 + return depot_save_stack(&dummy, GFP_KERNEL); 68 + } 69 + 70 + static noinline void register_dummy_stack(void) 71 + { 72 + dummy_handle = create_dummy_stack(); 69 73 } 70 74 71 75 static noinline void register_failure_stack(void) 72 76 { 73 - unsigned long entries[4]; 74 - struct stack_trace failure; 77 + failure_handle = create_dummy_stack(); 78 + } 75 79 76 - failure.nr_entries = 0; 77 - failure.max_entries = ARRAY_SIZE(entries); 78 - failure.entries = &entries[0]; 79 - failure.skip = 0; 80 - 81 - save_stack_trace(&failure); 82 - failure_handle = depot_save_stack(&failure, GFP_KERNEL); 80 + static noinline void register_early_stack(void) 81 + { 82 + early_handle = create_dummy_stack(); 83 83 } 84 84 85 85 static void init_page_owner(void) ··· 90 88 91 89 register_dummy_stack(); 92 90 register_failure_stack(); 91 + register_early_stack(); 93 92 static_branch_enable(&page_owner_inited); 94 93 init_early_allocated_pages(); 95 94 } ··· 168 165 return handle; 169 166 } 170 167 171 - noinline void __set_page_owner(struct page *page, unsigned int order, 172 - gfp_t gfp_mask) 168 + static inline void __set_page_owner_handle(struct page_ext *page_ext, 169 + depot_stack_handle_t handle, unsigned int order, gfp_t gfp_mask) 173 170 { 174 - struct page_ext *page_ext = lookup_page_ext(page); 175 171 struct page_owner *page_owner; 176 172 177 - if (unlikely(!page_ext)) 178 - return; 179 - 180 173 page_owner = get_page_owner(page_ext); 181 - page_owner->handle = save_stack(gfp_mask); 174 + page_owner->handle = handle; 182 175 page_owner->order = order; 183 176 page_owner->gfp_mask = gfp_mask; 184 177 page_owner->last_migrate_reason = -1; 185 178 186 179 __set_bit(PAGE_EXT_OWNER, &page_ext->flags); 180 + } 181 + 182 + noinline void __set_page_owner(struct page *page, unsigned int order, 183 + gfp_t gfp_mask) 184 + { 185 + struct page_ext *page_ext = lookup_page_ext(page); 186 + depot_stack_handle_t handle; 187 + 188 + if (unlikely(!page_ext)) 189 + return; 190 + 191 + handle = save_stack(gfp_mask); 192 + __set_page_owner_handle(page_ext, handle, order, gfp_mask); 187 193 } 188 194 189 195 void __set_page_owner_migrate_reason(struct page *page, int reason) ··· 562 550 continue; 563 551 564 552 /* 565 - * We are safe to check buddy flag and order, because 566 - * this is init stage and only single thread runs. 553 + * To avoid having to grab zone->lock, be a little 554 + * careful when reading buddy page order. The only 555 + * danger is that we skip too much and potentially miss 556 + * some early allocated pages, which is better than 557 + * heavy lock contention. 567 558 */ 568 559 if (PageBuddy(page)) { 569 - pfn += (1UL << page_order(page)) - 1; 560 + unsigned long order = page_order_unsafe(page); 561 + 562 + if (order > 0 && order < MAX_ORDER) 563 + pfn += (1UL << order) - 1; 570 564 continue; 571 565 } 572 566 ··· 583 565 if (unlikely(!page_ext)) 584 566 continue; 585 567 586 - /* Maybe overraping zone */ 568 + /* Maybe overlapping zone */ 587 569 if (test_bit(PAGE_EXT_OWNER, &page_ext->flags)) 588 570 continue; 589 571 590 572 /* Found early allocated page */ 591 - set_page_owner(page, 0, 0); 573 + __set_page_owner_handle(page_ext, early_handle, 0, 0); 592 574 count++; 593 575 } 576 + cond_resched(); 594 577 } 595 578 596 579 pr_info("Node %d, zone %8s: page owner found early allocated %lu pages\n", ··· 602 583 { 603 584 struct zone *zone; 604 585 struct zone *node_zones = pgdat->node_zones; 605 - unsigned long flags; 606 586 607 587 for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) { 608 588 if (!populated_zone(zone)) 609 589 continue; 610 590 611 - spin_lock_irqsave(&zone->lock, flags); 612 591 init_pages_in_zone(pgdat, zone); 613 - spin_unlock_irqrestore(&zone->lock, flags); 614 592 } 615 593 } 616 594

+122 -80

mm/shmem.c

··· 34 34 #include <linux/swap.h> 35 35 #include <linux/uio.h> 36 36 #include <linux/khugepaged.h> 37 + #include <linux/hugetlb.h> 37 38 38 39 #include <asm/tlbflush.h> /* for arch/microblaze update_mmu_cache() */ 39 40 ··· 189 188 vm_unacct_memory(pages * VM_ACCT(PAGE_SIZE)); 190 189 } 191 190 191 + static inline bool shmem_inode_acct_block(struct inode *inode, long pages) 192 + { 193 + struct shmem_inode_info *info = SHMEM_I(inode); 194 + struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); 195 + 196 + if (shmem_acct_block(info->flags, pages)) 197 + return false; 198 + 199 + if (sbinfo->max_blocks) { 200 + if (percpu_counter_compare(&sbinfo->used_blocks, 201 + sbinfo->max_blocks - pages) > 0) 202 + goto unacct; 203 + percpu_counter_add(&sbinfo->used_blocks, pages); 204 + } 205 + 206 + return true; 207 + 208 + unacct: 209 + shmem_unacct_blocks(info->flags, pages); 210 + return false; 211 + } 212 + 213 + static inline void shmem_inode_unacct_blocks(struct inode *inode, long pages) 214 + { 215 + struct shmem_inode_info *info = SHMEM_I(inode); 216 + struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); 217 + 218 + if (sbinfo->max_blocks) 219 + percpu_counter_sub(&sbinfo->used_blocks, pages); 220 + shmem_unacct_blocks(info->flags, pages); 221 + } 222 + 192 223 static const struct super_operations shmem_ops; 193 224 static const struct address_space_operations shmem_aops; 194 225 static const struct file_operations shmem_file_operations; ··· 282 249 283 250 freed = info->alloced - info->swapped - inode->i_mapping->nrpages; 284 251 if (freed > 0) { 285 - struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); 286 - if (sbinfo->max_blocks) 287 - percpu_counter_add(&sbinfo->used_blocks, -freed); 288 252 info->alloced -= freed; 289 253 inode->i_blocks -= freed * BLOCKS_PER_PAGE; 290 - shmem_unacct_blocks(info->flags, freed); 254 + shmem_inode_unacct_blocks(inode, freed); 291 255 } 292 256 } 293 257 294 258 bool shmem_charge(struct inode *inode, long pages) 295 259 { 296 260 struct shmem_inode_info *info = SHMEM_I(inode); 297 - struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); 298 261 unsigned long flags; 299 262 300 - if (shmem_acct_block(info->flags, pages)) 263 + if (!shmem_inode_acct_block(inode, pages)) 301 264 return false; 265 + 302 266 spin_lock_irqsave(&info->lock, flags); 303 267 info->alloced += pages; 304 268 inode->i_blocks += pages * BLOCKS_PER_PAGE; ··· 303 273 spin_unlock_irqrestore(&info->lock, flags); 304 274 inode->i_mapping->nrpages += pages; 305 275 306 - if (!sbinfo->max_blocks) 307 - return true; 308 - if (percpu_counter_compare(&sbinfo->used_blocks, 309 - sbinfo->max_blocks - pages) > 0) { 310 - inode->i_mapping->nrpages -= pages; 311 - spin_lock_irqsave(&info->lock, flags); 312 - info->alloced -= pages; 313 - shmem_recalc_inode(inode); 314 - spin_unlock_irqrestore(&info->lock, flags); 315 - shmem_unacct_blocks(info->flags, pages); 316 - return false; 317 - } 318 - percpu_counter_add(&sbinfo->used_blocks, pages); 319 276 return true; 320 277 } 321 278 322 279 void shmem_uncharge(struct inode *inode, long pages) 323 280 { 324 281 struct shmem_inode_info *info = SHMEM_I(inode); 325 - struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); 326 282 unsigned long flags; 327 283 328 284 spin_lock_irqsave(&info->lock, flags); ··· 317 301 shmem_recalc_inode(inode); 318 302 spin_unlock_irqrestore(&info->lock, flags); 319 303 320 - if (sbinfo->max_blocks) 321 - percpu_counter_sub(&sbinfo->used_blocks, pages); 322 - shmem_unacct_blocks(info->flags, pages); 304 + shmem_inode_unacct_blocks(inode, pages); 323 305 } 324 306 325 307 /* ··· 1466 1452 } 1467 1453 1468 1454 static struct page *shmem_alloc_and_acct_page(gfp_t gfp, 1469 - struct shmem_inode_info *info, struct shmem_sb_info *sbinfo, 1455 + struct inode *inode, 1470 1456 pgoff_t index, bool huge) 1471 1457 { 1458 + struct shmem_inode_info *info = SHMEM_I(inode); 1472 1459 struct page *page; 1473 1460 int nr; 1474 1461 int err = -ENOSPC; ··· 1478 1463 huge = false; 1479 1464 nr = huge ? HPAGE_PMD_NR : 1; 1480 1465 1481 - if (shmem_acct_block(info->flags, nr)) 1466 + if (!shmem_inode_acct_block(inode, nr)) 1482 1467 goto failed; 1483 - if (sbinfo->max_blocks) { 1484 - if (percpu_counter_compare(&sbinfo->used_blocks, 1485 - sbinfo->max_blocks - nr) > 0) 1486 - goto unacct; 1487 - percpu_counter_add(&sbinfo->used_blocks, nr); 1488 - } 1489 1468 1490 1469 if (huge) 1491 1470 page = shmem_alloc_hugepage(gfp, info, index); ··· 1492 1483 } 1493 1484 1494 1485 err = -ENOMEM; 1495 - if (sbinfo->max_blocks) 1496 - percpu_counter_add(&sbinfo->used_blocks, -nr); 1497 - unacct: 1498 - shmem_unacct_blocks(info->flags, nr); 1486 + shmem_inode_unacct_blocks(inode, nr); 1499 1487 failed: 1500 1488 return ERR_PTR(err); 1501 1489 } ··· 1650 1644 1651 1645 if (swap.val) { 1652 1646 /* Look it up and read it in.. */ 1653 - page = lookup_swap_cache(swap); 1647 + page = lookup_swap_cache(swap, NULL, 0); 1654 1648 if (!page) { 1655 1649 /* Or update major stats only when swapin succeeds?? */ 1656 1650 if (fault_type) { ··· 1757 1751 } 1758 1752 1759 1753 alloc_huge: 1760 - page = shmem_alloc_and_acct_page(gfp, info, sbinfo, 1761 - index, true); 1754 + page = shmem_alloc_and_acct_page(gfp, inode, index, true); 1762 1755 if (IS_ERR(page)) { 1763 - alloc_nohuge: page = shmem_alloc_and_acct_page(gfp, info, sbinfo, 1756 + alloc_nohuge: page = shmem_alloc_and_acct_page(gfp, inode, 1764 1757 index, false); 1765 1758 } 1766 1759 if (IS_ERR(page)) { ··· 1881 1876 * Error recovery. 1882 1877 */ 1883 1878 unacct: 1884 - if (sbinfo->max_blocks) 1885 - percpu_counter_sub(&sbinfo->used_blocks, 1886 - 1 << compound_order(page)); 1887 - shmem_unacct_blocks(info->flags, 1 << compound_order(page)); 1879 + shmem_inode_unacct_blocks(inode, 1 << compound_order(page)); 1888 1880 1889 1881 if (PageTransHuge(page)) { 1890 1882 unlock_page(page); ··· 2208 2206 return mapping->a_ops == &shmem_aops; 2209 2207 } 2210 2208 2211 - int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, 2212 - pmd_t *dst_pmd, 2213 - struct vm_area_struct *dst_vma, 2214 - unsigned long dst_addr, 2215 - unsigned long src_addr, 2216 - struct page **pagep) 2209 + static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, 2210 + pmd_t *dst_pmd, 2211 + struct vm_area_struct *dst_vma, 2212 + unsigned long dst_addr, 2213 + unsigned long src_addr, 2214 + bool zeropage, 2215 + struct page **pagep) 2217 2216 { 2218 2217 struct inode *inode = file_inode(dst_vma->vm_file); 2219 2218 struct shmem_inode_info *info = SHMEM_I(inode); 2220 - struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); 2221 2219 struct address_space *mapping = inode->i_mapping; 2222 2220 gfp_t gfp = mapping_gfp_mask(mapping); 2223 2221 pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); ··· 2229 2227 int ret; 2230 2228 2231 2229 ret = -ENOMEM; 2232 - if (shmem_acct_block(info->flags, 1)) 2230 + if (!shmem_inode_acct_block(inode, 1)) 2233 2231 goto out; 2234 - if (sbinfo->max_blocks) { 2235 - if (percpu_counter_compare(&sbinfo->used_blocks, 2236 - sbinfo->max_blocks) >= 0) 2237 - goto out_unacct_blocks; 2238 - percpu_counter_inc(&sbinfo->used_blocks); 2239 - } 2240 2232 2241 2233 if (!*pagep) { 2242 2234 page = shmem_alloc_page(gfp, info, pgoff); 2243 2235 if (!page) 2244 - goto out_dec_used_blocks; 2236 + goto out_unacct_blocks; 2245 2237 2246 - page_kaddr = kmap_atomic(page); 2247 - ret = copy_from_user(page_kaddr, (const void __user *)src_addr, 2248 - PAGE_SIZE); 2249 - kunmap_atomic(page_kaddr); 2238 + if (!zeropage) { /* mcopy_atomic */ 2239 + page_kaddr = kmap_atomic(page); 2240 + ret = copy_from_user(page_kaddr, 2241 + (const void __user *)src_addr, 2242 + PAGE_SIZE); 2243 + kunmap_atomic(page_kaddr); 2250 2244 2251 - /* fallback to copy_from_user outside mmap_sem */ 2252 - if (unlikely(ret)) { 2253 - *pagep = page; 2254 - if (sbinfo->max_blocks) 2255 - percpu_counter_add(&sbinfo->used_blocks, -1); 2256 - shmem_unacct_blocks(info->flags, 1); 2257 - /* don't free the page */ 2258 - return -EFAULT; 2245 + /* fallback to copy_from_user outside mmap_sem */ 2246 + if (unlikely(ret)) { 2247 + *pagep = page; 2248 + shmem_inode_unacct_blocks(inode, 1); 2249 + /* don't free the page */ 2250 + return -EFAULT; 2251 + } 2252 + } else { /* mfill_zeropage_atomic */ 2253 + clear_highpage(page); 2259 2254 } 2260 2255 } else { 2261 2256 page = *pagep; ··· 2313 2314 out_release: 2314 2315 unlock_page(page); 2315 2316 put_page(page); 2316 - out_dec_used_blocks: 2317 - if (sbinfo->max_blocks) 2318 - percpu_counter_add(&sbinfo->used_blocks, -1); 2319 2317 out_unacct_blocks: 2320 - shmem_unacct_blocks(info->flags, 1); 2318 + shmem_inode_unacct_blocks(inode, 1); 2321 2319 goto out; 2320 + } 2321 + 2322 + int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, 2323 + pmd_t *dst_pmd, 2324 + struct vm_area_struct *dst_vma, 2325 + unsigned long dst_addr, 2326 + unsigned long src_addr, 2327 + struct page **pagep) 2328 + { 2329 + return shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, 2330 + dst_addr, src_addr, false, pagep); 2331 + } 2332 + 2333 + int shmem_mfill_zeropage_pte(struct mm_struct *dst_mm, 2334 + pmd_t *dst_pmd, 2335 + struct vm_area_struct *dst_vma, 2336 + unsigned long dst_addr) 2337 + { 2338 + struct page *page = NULL; 2339 + 2340 + return shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, 2341 + dst_addr, 0, true, &page); 2322 2342 } 2323 2343 2324 2344 #ifdef CONFIG_TMPFS ··· 3653 3635 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) 3654 3636 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) 3655 3637 3656 - #define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING) 3638 + #define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) 3657 3639 3658 3640 SYSCALL_DEFINE2(memfd_create, 3659 3641 const char __user *, uname, ··· 3665 3647 char *name; 3666 3648 long len; 3667 3649 3668 - if (flags & ~(unsigned int)MFD_ALL_FLAGS) 3669 - return -EINVAL; 3650 + if (!(flags & MFD_HUGETLB)) { 3651 + if (flags & ~(unsigned int)MFD_ALL_FLAGS) 3652 + return -EINVAL; 3653 + } else { 3654 + /* Sealing not supported in hugetlbfs (MFD_HUGETLB) */ 3655 + if (flags & MFD_ALLOW_SEALING) 3656 + return -EINVAL; 3657 + /* Allow huge page size encoding in flags. */ 3658 + if (flags & ~(unsigned int)(MFD_ALL_FLAGS | 3659 + (MFD_HUGE_MASK << MFD_HUGE_SHIFT))) 3660 + return -EINVAL; 3661 + } 3670 3662 3671 3663 /* length includes terminating zero */ 3672 3664 len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); ··· 3707 3679 goto err_name; 3708 3680 } 3709 3681 3710 - file = shmem_file_setup(name, 0, VM_NORESERVE); 3682 + if (flags & MFD_HUGETLB) { 3683 + struct user_struct *user = NULL; 3684 + 3685 + file = hugetlb_file_setup(name, 0, VM_NORESERVE, &user, 3686 + HUGETLB_ANONHUGE_INODE, 3687 + (flags >> MFD_HUGE_SHIFT) & 3688 + MFD_HUGE_MASK); 3689 + } else 3690 + file = shmem_file_setup(name, 0, VM_NORESERVE); 3711 3691 if (IS_ERR(file)) { 3712 3692 error = PTR_ERR(file); 3713 3693 goto err_fd; 3714 3694 } 3715 - info = SHMEM_I(file_inode(file)); 3716 3695 file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; 3717 3696 file->f_flags |= O_RDWR | O_LARGEFILE; 3718 - if (flags & MFD_ALLOW_SEALING) 3697 + 3698 + if (flags & MFD_ALLOW_SEALING) { 3699 + /* 3700 + * flags check at beginning of function ensures 3701 + * this is not a hugetlbfs (MFD_HUGETLB) file. 3702 + */ 3703 + info = SHMEM_I(file_inode(file)); 3719 3704 info->seals &= ~F_SEAL_SEAL; 3705 + } 3720 3706 3721 3707 fd_install(fd, file); 3722 3708 kfree(name);

+44 -8

mm/slub.c

··· 34 34 #include <linux/stacktrace.h> 35 35 #include <linux/prefetch.h> 36 36 #include <linux/memcontrol.h> 37 + #include <linux/random.h> 37 38 38 39 #include <trace/events/kmem.h> 39 40 ··· 239 238 * Core slab cache functions 240 239 *******************************************************************/ 241 240 241 + /* 242 + * Returns freelist pointer (ptr). With hardening, this is obfuscated 243 + * with an XOR of the address where the pointer is held and a per-cache 244 + * random number. 245 + */ 246 + static inline void *freelist_ptr(const struct kmem_cache *s, void *ptr, 247 + unsigned long ptr_addr) 248 + { 249 + #ifdef CONFIG_SLAB_FREELIST_HARDENED 250 + return (void *)((unsigned long)ptr ^ s->random ^ ptr_addr); 251 + #else 252 + return ptr; 253 + #endif 254 + } 255 + 256 + /* Returns the freelist pointer recorded at location ptr_addr. */ 257 + static inline void *freelist_dereference(const struct kmem_cache *s, 258 + void *ptr_addr) 259 + { 260 + return freelist_ptr(s, (void *)*(unsigned long *)(ptr_addr), 261 + (unsigned long)ptr_addr); 262 + } 263 + 242 264 static inline void *get_freepointer(struct kmem_cache *s, void *object) 243 265 { 244 - return *(void **)(object + s->offset); 266 + return freelist_dereference(s, object + s->offset); 245 267 } 246 268 247 269 static void prefetch_freepointer(const struct kmem_cache *s, void *object) 248 270 { 249 - prefetch(object + s->offset); 271 + if (object) 272 + prefetch(freelist_dereference(s, object + s->offset)); 250 273 } 251 274 252 275 static inline void *get_freepointer_safe(struct kmem_cache *s, void *object) 253 276 { 277 + unsigned long freepointer_addr; 254 278 void *p; 255 279 256 280 if (!debug_pagealloc_enabled()) 257 281 return get_freepointer(s, object); 258 282 259 - probe_kernel_read(&p, (void **)(object + s->offset), sizeof(p)); 260 - return p; 283 + freepointer_addr = (unsigned long)object + s->offset; 284 + probe_kernel_read(&p, (void **)freepointer_addr, sizeof(p)); 285 + return freelist_ptr(s, p, freepointer_addr); 261 286 } 262 287 263 288 static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp) 264 289 { 265 - *(void **)(object + s->offset) = fp; 290 + unsigned long freeptr_addr = (unsigned long)object + s->offset; 291 + 292 + #ifdef CONFIG_SLAB_FREELIST_HARDENED 293 + BUG_ON(object == fp); /* naive detection of double free or corruption */ 294 + #endif 295 + 296 + *(void **)freeptr_addr = freelist_ptr(s, fp, freeptr_addr); 266 297 } 267 298 268 299 /* Loop over all objects in a slab */ ··· 3391 3358 struct kmem_cache_node *n; 3392 3359 3393 3360 for_each_kmem_cache_node(s, node, n) { 3394 - kmem_cache_free(kmem_cache_node, n); 3395 3361 s->node[node] = NULL; 3362 + kmem_cache_free(kmem_cache_node, n); 3396 3363 } 3397 3364 } 3398 3365 ··· 3422 3389 return 0; 3423 3390 } 3424 3391 3425 - s->node[node] = n; 3426 3392 init_kmem_cache_node(n); 3393 + s->node[node] = n; 3427 3394 } 3428 3395 return 1; 3429 3396 } ··· 3596 3563 { 3597 3564 s->flags = kmem_cache_flags(s->size, flags, s->name, s->ctor); 3598 3565 s->reserved = 0; 3566 + #ifdef CONFIG_SLAB_FREELIST_HARDENED 3567 + s->random = get_random_long(); 3568 + #endif 3599 3569 3600 3570 if (need_reserve_slab_rcu && (s->flags & SLAB_TYPESAFE_BY_RCU)) 3601 3571 s->reserved = sizeof(struct rcu_head); ··· 5459 5423 NULL 5460 5424 }; 5461 5425 5462 - static struct attribute_group slab_attr_group = { 5426 + static const struct attribute_group slab_attr_group = { 5463 5427 .attrs = slab_attrs, 5464 5428 }; 5465 5429

+3 -8

mm/sparse-vmemmap.c

··· 54 54 if (slab_is_available()) { 55 55 struct page *page; 56 56 57 - if (node_state(node, N_HIGH_MEMORY)) 58 - page = alloc_pages_node( 59 - node, GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL, 60 - get_order(size)); 61 - else 62 - page = alloc_pages( 63 - GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL, 64 - get_order(size)); 57 + page = alloc_pages_node(node, 58 + GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL, 59 + get_order(size)); 65 60 if (page) 66 61 return page_address(page); 67 62 return NULL;

+3 -7

mm/sparse.c

··· 65 65 unsigned long array_size = SECTIONS_PER_ROOT * 66 66 sizeof(struct mem_section); 67 67 68 - if (slab_is_available()) { 69 - if (node_state(nid, N_HIGH_MEMORY)) 70 - section = kzalloc_node(array_size, GFP_KERNEL, nid); 71 - else 72 - section = kzalloc(array_size, GFP_KERNEL); 73 - } else { 68 + if (slab_is_available()) 69 + section = kzalloc_node(array_size, GFP_KERNEL, nid); 70 + else 74 71 section = memblock_virt_alloc_node(array_size, nid); 75 - } 76 72 77 73 return section; 78 74 }

+15 -9

mm/swap.c

··· 946 946 } 947 947 948 948 /** 949 - * pagevec_lookup - gang pagecache lookup 949 + * pagevec_lookup_range - gang pagecache lookup 950 950 * @pvec: Where the resulting pages are placed 951 951 * @mapping: The address_space to search 952 952 * @start: The starting page index 953 + * @end: The final page index 953 954 * @nr_pages: The maximum number of pages 954 955 * 955 - * pagevec_lookup() will search for and return a group of up to @nr_pages pages 956 - * in the mapping. The pages are placed in @pvec. pagevec_lookup() takes a 956 + * pagevec_lookup_range() will search for and return a group of up to @nr_pages 957 + * pages in the mapping starting from index @start and upto index @end 958 + * (inclusive). The pages are placed in @pvec. pagevec_lookup() takes a 957 959 * reference against the pages in @pvec. 958 960 * 959 961 * The search returns a group of mapping-contiguous pages with ascending 960 - * indexes. There may be holes in the indices due to not-present pages. 962 + * indexes. There may be holes in the indices due to not-present pages. We 963 + * also update @start to index the next page for the traversal. 961 964 * 962 - * pagevec_lookup() returns the number of pages which were found. 965 + * pagevec_lookup_range() returns the number of pages which were found. If this 966 + * number is smaller than @nr_pages, the end of specified range has been 967 + * reached. 963 968 */ 964 - unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping, 965 - pgoff_t start, unsigned nr_pages) 969 + unsigned pagevec_lookup_range(struct pagevec *pvec, 970 + struct address_space *mapping, pgoff_t *start, pgoff_t end) 966 971 { 967 - pvec->nr = find_get_pages(mapping, start, nr_pages, pvec->pages); 972 + pvec->nr = find_get_pages_range(mapping, start, end, PAGEVEC_SIZE, 973 + pvec->pages); 968 974 return pagevec_count(pvec); 969 975 } 970 - EXPORT_SYMBOL(pagevec_lookup); 976 + EXPORT_SYMBOL(pagevec_lookup_range); 971 977 972 978 unsigned pagevec_lookup_tag(struct pagevec *pvec, struct address_space *mapping, 973 979 pgoff_t *index, int tag, unsigned nr_pages)

+293 -23

mm/swap_state.c

··· 37 37 38 38 struct address_space *swapper_spaces[MAX_SWAPFILES]; 39 39 static unsigned int nr_swapper_spaces[MAX_SWAPFILES]; 40 + bool swap_vma_readahead = true; 41 + 42 + #define SWAP_RA_MAX_ORDER_DEFAULT 3 43 + 44 + static int swap_ra_max_order = SWAP_RA_MAX_ORDER_DEFAULT; 45 + 46 + #define SWAP_RA_WIN_SHIFT (PAGE_SHIFT / 2) 47 + #define SWAP_RA_HITS_MASK ((1UL << SWAP_RA_WIN_SHIFT) - 1) 48 + #define SWAP_RA_HITS_MAX SWAP_RA_HITS_MASK 49 + #define SWAP_RA_WIN_MASK (~PAGE_MASK & ~SWAP_RA_HITS_MASK) 50 + 51 + #define SWAP_RA_HITS(v) ((v) & SWAP_RA_HITS_MASK) 52 + #define SWAP_RA_WIN(v) (((v) & SWAP_RA_WIN_MASK) >> SWAP_RA_WIN_SHIFT) 53 + #define SWAP_RA_ADDR(v) ((v) & PAGE_MASK) 54 + 55 + #define SWAP_RA_VAL(addr, win, hits) \ 56 + (((addr) & PAGE_MASK) | \ 57 + (((win) << SWAP_RA_WIN_SHIFT) & SWAP_RA_WIN_MASK) | \ 58 + ((hits) & SWAP_RA_HITS_MASK)) 59 + 60 + /* Initial readahead hits is 4 to start up with a small window */ 61 + #define GET_SWAP_RA_VAL(vma) \ 62 + (atomic_long_read(&(vma)->swap_readahead_info) ? : 4) 40 63 41 64 #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0) 42 65 #define ADD_CACHE_INFO(x, nr) do { swap_cache_info.x += (nr); } while (0) ··· 320 297 * lock getting page table operations atomic even if we drop the page 321 298 * lock before returning. 322 299 */ 323 - struct page * lookup_swap_cache(swp_entry_t entry) 300 + struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma, 301 + unsigned long addr) 324 302 { 325 303 struct page *page; 304 + unsigned long ra_info; 305 + int win, hits, readahead; 326 306 327 307 page = find_get_page(swap_address_space(entry), swp_offset(entry)); 328 308 329 - if (page && likely(!PageTransCompound(page))) { 330 - INC_CACHE_INFO(find_success); 331 - if (TestClearPageReadahead(page)) 332 - atomic_inc(&swapin_readahead_hits); 333 - } 334 - 335 309 INC_CACHE_INFO(find_total); 310 + if (page) { 311 + INC_CACHE_INFO(find_success); 312 + if (unlikely(PageTransCompound(page))) 313 + return page; 314 + readahead = TestClearPageReadahead(page); 315 + if (vma) { 316 + ra_info = GET_SWAP_RA_VAL(vma); 317 + win = SWAP_RA_WIN(ra_info); 318 + hits = SWAP_RA_HITS(ra_info); 319 + if (readahead) 320 + hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX); 321 + atomic_long_set(&vma->swap_readahead_info, 322 + SWAP_RA_VAL(addr, win, hits)); 323 + } 324 + if (readahead) { 325 + count_vm_event(SWAP_RA_HIT); 326 + if (!vma) 327 + atomic_inc(&swapin_readahead_hits); 328 + } 329 + } 336 330 return page; 337 331 } 338 332 ··· 464 424 return retpage; 465 425 } 466 426 467 - static unsigned long swapin_nr_pages(unsigned long offset) 427 + static unsigned int __swapin_nr_pages(unsigned long prev_offset, 428 + unsigned long offset, 429 + int hits, 430 + int max_pages, 431 + int prev_win) 468 432 { 469 - static unsigned long prev_offset; 470 - unsigned int pages, max_pages, last_ra; 471 - static atomic_t last_readahead_pages; 472 - 473 - max_pages = 1 << READ_ONCE(page_cluster); 474 - if (max_pages <= 1) 475 - return 1; 433 + unsigned int pages, last_ra; 476 434 477 435 /* 478 436 * This heuristic has been found to work well on both sequential and 479 437 * random loads, swapping to hard disk or to SSD: please don't ask 480 438 * what the "+ 2" means, it just happens to work well, that's all. 481 439 */ 482 - pages = atomic_xchg(&swapin_readahead_hits, 0) + 2; 440 + pages = hits + 2; 483 441 if (pages == 2) { 484 442 /* 485 443 * We can have no readahead hits to judge by: but must not get ··· 486 448 */ 487 449 if (offset != prev_offset + 1 && offset != prev_offset - 1) 488 450 pages = 1; 489 - prev_offset = offset; 490 451 } else { 491 452 unsigned int roundup = 4; 492 453 while (roundup < pages) ··· 497 460 pages = max_pages; 498 461 499 462 /* Don't shrink readahead too fast */ 500 - last_ra = atomic_read(&last_readahead_pages) / 2; 463 + last_ra = prev_win / 2; 501 464 if (pages < last_ra) 502 465 pages = last_ra; 466 + 467 + return pages; 468 + } 469 + 470 + static unsigned long swapin_nr_pages(unsigned long offset) 471 + { 472 + static unsigned long prev_offset; 473 + unsigned int hits, pages, max_pages; 474 + static atomic_t last_readahead_pages; 475 + 476 + max_pages = 1 << READ_ONCE(page_cluster); 477 + if (max_pages <= 1) 478 + return 1; 479 + 480 + hits = atomic_xchg(&swapin_readahead_hits, 0); 481 + pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, 482 + atomic_read(&last_readahead_pages)); 483 + if (!hits) 484 + prev_offset = offset; 503 485 atomic_set(&last_readahead_pages, pages); 504 486 505 487 return pages; ··· 552 496 unsigned long start_offset, end_offset; 553 497 unsigned long mask; 554 498 struct blk_plug plug; 555 - bool do_poll = true; 499 + bool do_poll = true, page_allocated; 556 500 557 501 mask = swapin_nr_pages(offset) - 1; 558 502 if (!mask) ··· 568 512 blk_start_plug(&plug); 569 513 for (offset = start_offset; offset <= end_offset ; offset++) { 570 514 /* Ok, do the async read-ahead now */ 571 - page = read_swap_cache_async(swp_entry(swp_type(entry), offset), 572 - gfp_mask, vma, addr, false); 515 + page = __read_swap_cache_async( 516 + swp_entry(swp_type(entry), offset), 517 + gfp_mask, vma, addr, &page_allocated); 573 518 if (!page) 574 519 continue; 575 - if (offset != entry_offset && likely(!PageTransCompound(page))) 576 - SetPageReadahead(page); 520 + if (page_allocated) { 521 + swap_readpage(page, false); 522 + if (offset != entry_offset && 523 + likely(!PageTransCompound(page))) { 524 + SetPageReadahead(page); 525 + count_vm_event(SWAP_RA); 526 + } 527 + } 577 528 put_page(page); 578 529 } 579 530 blk_finish_plug(&plug); ··· 624 561 synchronize_rcu(); 625 562 kvfree(spaces); 626 563 } 564 + 565 + static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma, 566 + unsigned long faddr, 567 + unsigned long lpfn, 568 + unsigned long rpfn, 569 + unsigned long *start, 570 + unsigned long *end) 571 + { 572 + *start = max3(lpfn, PFN_DOWN(vma->vm_start), 573 + PFN_DOWN(faddr & PMD_MASK)); 574 + *end = min3(rpfn, PFN_DOWN(vma->vm_end), 575 + PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE)); 576 + } 577 + 578 + struct page *swap_readahead_detect(struct vm_fault *vmf, 579 + struct vma_swap_readahead *swap_ra) 580 + { 581 + struct vm_area_struct *vma = vmf->vma; 582 + unsigned long swap_ra_info; 583 + struct page *page; 584 + swp_entry_t entry; 585 + unsigned long faddr, pfn, fpfn; 586 + unsigned long start, end; 587 + pte_t *pte; 588 + unsigned int max_win, hits, prev_win, win, left; 589 + #ifndef CONFIG_64BIT 590 + pte_t *tpte; 591 + #endif 592 + 593 + faddr = vmf->address; 594 + entry = pte_to_swp_entry(vmf->orig_pte); 595 + if ((unlikely(non_swap_entry(entry)))) 596 + return NULL; 597 + page = lookup_swap_cache(entry, vma, faddr); 598 + if (page) 599 + return page; 600 + 601 + max_win = 1 << READ_ONCE(swap_ra_max_order); 602 + if (max_win == 1) { 603 + swap_ra->win = 1; 604 + return NULL; 605 + } 606 + 607 + fpfn = PFN_DOWN(faddr); 608 + swap_ra_info = GET_SWAP_RA_VAL(vma); 609 + pfn = PFN_DOWN(SWAP_RA_ADDR(swap_ra_info)); 610 + prev_win = SWAP_RA_WIN(swap_ra_info); 611 + hits = SWAP_RA_HITS(swap_ra_info); 612 + swap_ra->win = win = __swapin_nr_pages(pfn, fpfn, hits, 613 + max_win, prev_win); 614 + atomic_long_set(&vma->swap_readahead_info, 615 + SWAP_RA_VAL(faddr, win, 0)); 616 + 617 + if (win == 1) 618 + return NULL; 619 + 620 + /* Copy the PTEs because the page table may be unmapped */ 621 + if (fpfn == pfn + 1) 622 + swap_ra_clamp_pfn(vma, faddr, fpfn, fpfn + win, &start, &end); 623 + else if (pfn == fpfn + 1) 624 + swap_ra_clamp_pfn(vma, faddr, fpfn - win + 1, fpfn + 1, 625 + &start, &end); 626 + else { 627 + left = (win - 1) / 2; 628 + swap_ra_clamp_pfn(vma, faddr, fpfn - left, fpfn + win - left, 629 + &start, &end); 630 + } 631 + swap_ra->nr_pte = end - start; 632 + swap_ra->offset = fpfn - start; 633 + pte = vmf->pte - swap_ra->offset; 634 + #ifdef CONFIG_64BIT 635 + swap_ra->ptes = pte; 636 + #else 637 + tpte = swap_ra->ptes; 638 + for (pfn = start; pfn != end; pfn++) 639 + *tpte++ = *pte++; 640 + #endif 641 + 642 + return NULL; 643 + } 644 + 645 + struct page *do_swap_page_readahead(swp_entry_t fentry, gfp_t gfp_mask, 646 + struct vm_fault *vmf, 647 + struct vma_swap_readahead *swap_ra) 648 + { 649 + struct blk_plug plug; 650 + struct vm_area_struct *vma = vmf->vma; 651 + struct page *page; 652 + pte_t *pte, pentry; 653 + swp_entry_t entry; 654 + unsigned int i; 655 + bool page_allocated; 656 + 657 + if (swap_ra->win == 1) 658 + goto skip; 659 + 660 + blk_start_plug(&plug); 661 + for (i = 0, pte = swap_ra->ptes; i < swap_ra->nr_pte; 662 + i++, pte++) { 663 + pentry = *pte; 664 + if (pte_none(pentry)) 665 + continue; 666 + if (pte_present(pentry)) 667 + continue; 668 + entry = pte_to_swp_entry(pentry); 669 + if (unlikely(non_swap_entry(entry))) 670 + continue; 671 + page = __read_swap_cache_async(entry, gfp_mask, vma, 672 + vmf->address, &page_allocated); 673 + if (!page) 674 + continue; 675 + if (page_allocated) { 676 + swap_readpage(page, false); 677 + if (i != swap_ra->offset && 678 + likely(!PageTransCompound(page))) { 679 + SetPageReadahead(page); 680 + count_vm_event(SWAP_RA); 681 + } 682 + } 683 + put_page(page); 684 + } 685 + blk_finish_plug(&plug); 686 + lru_add_drain(); 687 + skip: 688 + return read_swap_cache_async(fentry, gfp_mask, vma, vmf->address, 689 + swap_ra->win == 1); 690 + } 691 + 692 + #ifdef CONFIG_SYSFS 693 + static ssize_t vma_ra_enabled_show(struct kobject *kobj, 694 + struct kobj_attribute *attr, char *buf) 695 + { 696 + return sprintf(buf, "%s\n", swap_vma_readahead ? "true" : "false"); 697 + } 698 + static ssize_t vma_ra_enabled_store(struct kobject *kobj, 699 + struct kobj_attribute *attr, 700 + const char *buf, size_t count) 701 + { 702 + if (!strncmp(buf, "true", 4) || !strncmp(buf, "1", 1)) 703 + swap_vma_readahead = true; 704 + else if (!strncmp(buf, "false", 5) || !strncmp(buf, "0", 1)) 705 + swap_vma_readahead = false; 706 + else 707 + return -EINVAL; 708 + 709 + return count; 710 + } 711 + static struct kobj_attribute vma_ra_enabled_attr = 712 + __ATTR(vma_ra_enabled, 0644, vma_ra_enabled_show, 713 + vma_ra_enabled_store); 714 + 715 + static ssize_t vma_ra_max_order_show(struct kobject *kobj, 716 + struct kobj_attribute *attr, char *buf) 717 + { 718 + return sprintf(buf, "%d\n", swap_ra_max_order); 719 + } 720 + static ssize_t vma_ra_max_order_store(struct kobject *kobj, 721 + struct kobj_attribute *attr, 722 + const char *buf, size_t count) 723 + { 724 + int err, v; 725 + 726 + err = kstrtoint(buf, 10, &v); 727 + if (err || v > SWAP_RA_ORDER_CEILING || v <= 0) 728 + return -EINVAL; 729 + 730 + swap_ra_max_order = v; 731 + 732 + return count; 733 + } 734 + static struct kobj_attribute vma_ra_max_order_attr = 735 + __ATTR(vma_ra_max_order, 0644, vma_ra_max_order_show, 736 + vma_ra_max_order_store); 737 + 738 + static struct attribute *swap_attrs[] = { 739 + &vma_ra_enabled_attr.attr, 740 + &vma_ra_max_order_attr.attr, 741 + NULL, 742 + }; 743 + 744 + static struct attribute_group swap_attr_group = { 745 + .attrs = swap_attrs, 746 + }; 747 + 748 + static int __init swap_init_sysfs(void) 749 + { 750 + int err; 751 + struct kobject *swap_kobj; 752 + 753 + swap_kobj = kobject_create_and_add("swap", mm_kobj); 754 + if (!swap_kobj) { 755 + pr_err("failed to create swap kobject\n"); 756 + return -ENOMEM; 757 + } 758 + err = sysfs_create_group(swap_kobj, &swap_attr_group); 759 + if (err) { 760 + pr_err("failed to register swap group\n"); 761 + goto delete_obj; 762 + } 763 + return 0; 764 + 765 + delete_obj: 766 + kobject_put(swap_kobj); 767 + return err; 768 + } 769 + subsys_initcall(swap_init_sysfs); 770 + #endif

+309 -53

mm/swapfile.c

··· 60 60 EXPORT_SYMBOL_GPL(nr_swap_pages); 61 61 /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */ 62 62 long total_swap_pages; 63 - static int least_priority; 63 + static int least_priority = -1; 64 64 65 65 static const char Bad_file[] = "Bad swap file entry "; 66 66 static const char Unused_file[] = "Unused swap file entry "; ··· 85 85 * is held and the locking order requires swap_lock to be taken 86 86 * before any swap_info_struct->lock. 87 87 */ 88 - static PLIST_HEAD(swap_avail_head); 88 + struct plist_head *swap_avail_heads; 89 89 static DEFINE_SPINLOCK(swap_avail_lock); 90 90 91 91 struct swap_info_struct *swap_info[MAX_SWAPFILES]; ··· 95 95 static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait); 96 96 /* Activity counter to indicate that a swapon or swapoff has occurred */ 97 97 static atomic_t proc_poll_event = ATOMIC_INIT(0); 98 + 99 + atomic_t nr_rotate_swap = ATOMIC_INIT(0); 98 100 99 101 static inline unsigned char swap_count(unsigned char ent) 100 102 { ··· 265 263 { 266 264 info->flags = CLUSTER_FLAG_NEXT_NULL; 267 265 info->data = 0; 266 + } 267 + 268 + static inline bool cluster_is_huge(struct swap_cluster_info *info) 269 + { 270 + return info->flags & CLUSTER_FLAG_HUGE; 271 + } 272 + 273 + static inline void cluster_clear_huge(struct swap_cluster_info *info) 274 + { 275 + info->flags &= ~CLUSTER_FLAG_HUGE; 268 276 } 269 277 270 278 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, ··· 592 580 return found_free; 593 581 } 594 582 583 + static void __del_from_avail_list(struct swap_info_struct *p) 584 + { 585 + int nid; 586 + 587 + for_each_node(nid) 588 + plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]); 589 + } 590 + 591 + static void del_from_avail_list(struct swap_info_struct *p) 592 + { 593 + spin_lock(&swap_avail_lock); 594 + __del_from_avail_list(p); 595 + spin_unlock(&swap_avail_lock); 596 + } 597 + 595 598 static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset, 596 599 unsigned int nr_entries) 597 600 { ··· 620 593 if (si->inuse_pages == si->pages) { 621 594 si->lowest_bit = si->max; 622 595 si->highest_bit = 0; 623 - spin_lock(&swap_avail_lock); 624 - plist_del(&si->avail_list, &swap_avail_head); 625 - spin_unlock(&swap_avail_lock); 596 + del_from_avail_list(si); 626 597 } 598 + } 599 + 600 + static void add_to_avail_list(struct swap_info_struct *p) 601 + { 602 + int nid; 603 + 604 + spin_lock(&swap_avail_lock); 605 + for_each_node(nid) { 606 + WARN_ON(!plist_node_empty(&p->avail_lists[nid])); 607 + plist_add(&p->avail_lists[nid], &swap_avail_heads[nid]); 608 + } 609 + spin_unlock(&swap_avail_lock); 627 610 } 628 611 629 612 static void swap_range_free(struct swap_info_struct *si, unsigned long offset, ··· 648 611 bool was_full = !si->highest_bit; 649 612 650 613 si->highest_bit = end; 651 - if (was_full && (si->flags & SWP_WRITEOK)) { 652 - spin_lock(&swap_avail_lock); 653 - WARN_ON(!plist_node_empty(&si->avail_list)); 654 - if (plist_node_empty(&si->avail_list)) 655 - plist_add(&si->avail_list, &swap_avail_head); 656 - spin_unlock(&swap_avail_lock); 657 - } 614 + if (was_full && (si->flags & SWP_WRITEOK)) 615 + add_to_avail_list(si); 658 616 } 659 617 atomic_long_add(nr_entries, &nr_swap_pages); 660 618 si->inuse_pages -= nr_entries; ··· 878 846 offset = idx * SWAPFILE_CLUSTER; 879 847 ci = lock_cluster(si, offset); 880 848 alloc_cluster(si, idx); 881 - cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0); 849 + cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE); 882 850 883 851 map = si->swap_map + offset; 884 852 for (i = 0; i < SWAPFILE_CLUSTER; i++) ··· 930 898 struct swap_info_struct *si, *next; 931 899 long avail_pgs; 932 900 int n_ret = 0; 901 + int node; 933 902 934 903 /* Only single cluster request supported */ 935 904 WARN_ON_ONCE(n_goal > 1 && cluster); ··· 950 917 spin_lock(&swap_avail_lock); 951 918 952 919 start_over: 953 - plist_for_each_entry_safe(si, next, &swap_avail_head, avail_list) { 920 + node = numa_node_id(); 921 + plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) { 954 922 /* requeue si to after same-priority siblings */ 955 - plist_requeue(&si->avail_list, &swap_avail_head); 923 + plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]); 956 924 spin_unlock(&swap_avail_lock); 957 925 spin_lock(&si->lock); 958 926 if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) { 959 927 spin_lock(&swap_avail_lock); 960 - if (plist_node_empty(&si->avail_list)) { 928 + if (plist_node_empty(&si->avail_lists[node])) { 961 929 spin_unlock(&si->lock); 962 930 goto nextsi; 963 931 } ··· 968 934 WARN(!(si->flags & SWP_WRITEOK), 969 935 "swap_info %d in list but !SWP_WRITEOK\n", 970 936 si->type); 971 - plist_del(&si->avail_list, &swap_avail_head); 937 + __del_from_avail_list(si); 972 938 spin_unlock(&si->lock); 973 939 goto nextsi; 974 940 } 975 - if (cluster) 976 - n_ret = swap_alloc_cluster(si, swp_entries); 977 - else 941 + if (cluster) { 942 + if (!(si->flags & SWP_FILE)) 943 + n_ret = swap_alloc_cluster(si, swp_entries); 944 + } else 978 945 n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, 979 946 n_goal, swp_entries); 980 947 spin_unlock(&si->lock); ··· 997 962 * swap_avail_head list then try it, otherwise start over 998 963 * if we have not gotten any slots. 999 964 */ 1000 - if (plist_node_empty(&next->avail_list)) 965 + if (plist_node_empty(&next->avail_lists[node])) 1001 966 goto start_over; 1002 967 } 1003 968 ··· 1203 1168 struct swap_cluster_info *ci; 1204 1169 struct swap_info_struct *si; 1205 1170 unsigned char *map; 1206 - unsigned int i; 1171 + unsigned int i, free_entries = 0; 1172 + unsigned char val; 1207 1173 1208 - si = swap_info_get(entry); 1174 + si = _swap_info_get(entry); 1209 1175 if (!si) 1210 1176 return; 1211 1177 1212 1178 ci = lock_cluster(si, offset); 1179 + VM_BUG_ON(!cluster_is_huge(ci)); 1213 1180 map = si->swap_map + offset; 1214 1181 for (i = 0; i < SWAPFILE_CLUSTER; i++) { 1215 - VM_BUG_ON(map[i] != SWAP_HAS_CACHE); 1216 - map[i] = 0; 1182 + val = map[i]; 1183 + VM_BUG_ON(!(val & SWAP_HAS_CACHE)); 1184 + if (val == SWAP_HAS_CACHE) 1185 + free_entries++; 1217 1186 } 1187 + if (!free_entries) { 1188 + for (i = 0; i < SWAPFILE_CLUSTER; i++) 1189 + map[i] &= ~SWAP_HAS_CACHE; 1190 + } 1191 + cluster_clear_huge(ci); 1218 1192 unlock_cluster(ci); 1219 - mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER); 1220 - swap_free_cluster(si, idx); 1221 - spin_unlock(&si->lock); 1193 + if (free_entries == SWAPFILE_CLUSTER) { 1194 + spin_lock(&si->lock); 1195 + ci = lock_cluster(si, offset); 1196 + memset(map, 0, SWAPFILE_CLUSTER); 1197 + unlock_cluster(ci); 1198 + mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER); 1199 + swap_free_cluster(si, idx); 1200 + spin_unlock(&si->lock); 1201 + } else if (free_entries) { 1202 + for (i = 0; i < SWAPFILE_CLUSTER; i++, entry.val++) { 1203 + if (!__swap_entry_free(si, entry, SWAP_HAS_CACHE)) 1204 + free_swap_slot(entry); 1205 + } 1206 + } 1207 + } 1208 + 1209 + int split_swap_cluster(swp_entry_t entry) 1210 + { 1211 + struct swap_info_struct *si; 1212 + struct swap_cluster_info *ci; 1213 + unsigned long offset = swp_offset(entry); 1214 + 1215 + si = _swap_info_get(entry); 1216 + if (!si) 1217 + return -EBUSY; 1218 + ci = lock_cluster(si, offset); 1219 + cluster_clear_huge(ci); 1220 + unlock_cluster(ci); 1221 + return 0; 1222 1222 } 1223 1223 #else 1224 1224 static inline void swapcache_free_cluster(swp_entry_t entry) ··· 1402 1332 return count; 1403 1333 } 1404 1334 1335 + #ifdef CONFIG_THP_SWAP 1336 + static bool swap_page_trans_huge_swapped(struct swap_info_struct *si, 1337 + swp_entry_t entry) 1338 + { 1339 + struct swap_cluster_info *ci; 1340 + unsigned char *map = si->swap_map; 1341 + unsigned long roffset = swp_offset(entry); 1342 + unsigned long offset = round_down(roffset, SWAPFILE_CLUSTER); 1343 + int i; 1344 + bool ret = false; 1345 + 1346 + ci = lock_cluster_or_swap_info(si, offset); 1347 + if (!ci || !cluster_is_huge(ci)) { 1348 + if (map[roffset] != SWAP_HAS_CACHE) 1349 + ret = true; 1350 + goto unlock_out; 1351 + } 1352 + for (i = 0; i < SWAPFILE_CLUSTER; i++) { 1353 + if (map[offset + i] != SWAP_HAS_CACHE) { 1354 + ret = true; 1355 + break; 1356 + } 1357 + } 1358 + unlock_out: 1359 + unlock_cluster_or_swap_info(si, ci); 1360 + return ret; 1361 + } 1362 + 1363 + static bool page_swapped(struct page *page) 1364 + { 1365 + swp_entry_t entry; 1366 + struct swap_info_struct *si; 1367 + 1368 + if (likely(!PageTransCompound(page))) 1369 + return page_swapcount(page) != 0; 1370 + 1371 + page = compound_head(page); 1372 + entry.val = page_private(page); 1373 + si = _swap_info_get(entry); 1374 + if (si) 1375 + return swap_page_trans_huge_swapped(si, entry); 1376 + return false; 1377 + } 1378 + 1379 + static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount, 1380 + int *total_swapcount) 1381 + { 1382 + int i, map_swapcount, _total_mapcount, _total_swapcount; 1383 + unsigned long offset = 0; 1384 + struct swap_info_struct *si; 1385 + struct swap_cluster_info *ci = NULL; 1386 + unsigned char *map = NULL; 1387 + int mapcount, swapcount = 0; 1388 + 1389 + /* hugetlbfs shouldn't call it */ 1390 + VM_BUG_ON_PAGE(PageHuge(page), page); 1391 + 1392 + if (likely(!PageTransCompound(page))) { 1393 + mapcount = atomic_read(&page->_mapcount) + 1; 1394 + if (total_mapcount) 1395 + *total_mapcount = mapcount; 1396 + if (PageSwapCache(page)) 1397 + swapcount = page_swapcount(page); 1398 + if (total_swapcount) 1399 + *total_swapcount = swapcount; 1400 + return mapcount + swapcount; 1401 + } 1402 + 1403 + page = compound_head(page); 1404 + 1405 + _total_mapcount = _total_swapcount = map_swapcount = 0; 1406 + if (PageSwapCache(page)) { 1407 + swp_entry_t entry; 1408 + 1409 + entry.val = page_private(page); 1410 + si = _swap_info_get(entry); 1411 + if (si) { 1412 + map = si->swap_map; 1413 + offset = swp_offset(entry); 1414 + } 1415 + } 1416 + if (map) 1417 + ci = lock_cluster(si, offset); 1418 + for (i = 0; i < HPAGE_PMD_NR; i++) { 1419 + mapcount = atomic_read(&page[i]._mapcount) + 1; 1420 + _total_mapcount += mapcount; 1421 + if (map) { 1422 + swapcount = swap_count(map[offset + i]); 1423 + _total_swapcount += swapcount; 1424 + } 1425 + map_swapcount = max(map_swapcount, mapcount + swapcount); 1426 + } 1427 + unlock_cluster(ci); 1428 + if (PageDoubleMap(page)) { 1429 + map_swapcount -= 1; 1430 + _total_mapcount -= HPAGE_PMD_NR; 1431 + } 1432 + mapcount = compound_mapcount(page); 1433 + map_swapcount += mapcount; 1434 + _total_mapcount += mapcount; 1435 + if (total_mapcount) 1436 + *total_mapcount = _total_mapcount; 1437 + if (total_swapcount) 1438 + *total_swapcount = _total_swapcount; 1439 + 1440 + return map_swapcount; 1441 + } 1442 + #else 1443 + #define swap_page_trans_huge_swapped(si, entry) swap_swapcount(si, entry) 1444 + #define page_swapped(page) (page_swapcount(page) != 0) 1445 + 1446 + static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount, 1447 + int *total_swapcount) 1448 + { 1449 + int mapcount, swapcount = 0; 1450 + 1451 + /* hugetlbfs shouldn't call it */ 1452 + VM_BUG_ON_PAGE(PageHuge(page), page); 1453 + 1454 + mapcount = page_trans_huge_mapcount(page, total_mapcount); 1455 + if (PageSwapCache(page)) 1456 + swapcount = page_swapcount(page); 1457 + if (total_swapcount) 1458 + *total_swapcount = swapcount; 1459 + return mapcount + swapcount; 1460 + } 1461 + #endif 1462 + 1405 1463 /* 1406 1464 * We can write to an anon page without COW if there are no other references 1407 1465 * to it. And as a side-effect, free up its swap: because the old content 1408 1466 * on disk will never be read, and seeking back there to write new content 1409 1467 * later would only waste time away from clustering. 1410 1468 * 1411 - * NOTE: total_mapcount should not be relied upon by the caller if 1469 + * NOTE: total_map_swapcount should not be relied upon by the caller if 1412 1470 * reuse_swap_page() returns false, but it may be always overwritten 1413 1471 * (see the other implementation for CONFIG_SWAP=n). 1414 1472 */ 1415 - bool reuse_swap_page(struct page *page, int *total_mapcount) 1473 + bool reuse_swap_page(struct page *page, int *total_map_swapcount) 1416 1474 { 1417 - int count; 1475 + int count, total_mapcount, total_swapcount; 1418 1476 1419 1477 VM_BUG_ON_PAGE(!PageLocked(page), page); 1420 1478 if (unlikely(PageKsm(page))) 1421 1479 return false; 1422 - count = page_trans_huge_mapcount(page, total_mapcount); 1423 - if (count <= 1 && PageSwapCache(page)) { 1424 - count += page_swapcount(page); 1425 - if (count != 1) 1426 - goto out; 1480 + count = page_trans_huge_map_swapcount(page, &total_mapcount, 1481 + &total_swapcount); 1482 + if (total_map_swapcount) 1483 + *total_map_swapcount = total_mapcount + total_swapcount; 1484 + if (count == 1 && PageSwapCache(page) && 1485 + (likely(!PageTransCompound(page)) || 1486 + /* The remaining swap count will be freed soon */ 1487 + total_swapcount == page_swapcount(page))) { 1427 1488 if (!PageWriteback(page)) { 1489 + page = compound_head(page); 1428 1490 delete_from_swap_cache(page); 1429 1491 SetPageDirty(page); 1430 1492 } else { ··· 1572 1370 spin_unlock(&p->lock); 1573 1371 } 1574 1372 } 1575 - out: 1373 + 1576 1374 return count <= 1; 1577 1375 } 1578 1376 ··· 1588 1386 return 0; 1589 1387 if (PageWriteback(page)) 1590 1388 return 0; 1591 - if (page_swapcount(page)) 1389 + if (page_swapped(page)) 1592 1390 return 0; 1593 1391 1594 1392 /* ··· 1609 1407 if (pm_suspended_storage()) 1610 1408 return 0; 1611 1409 1410 + page = compound_head(page); 1612 1411 delete_from_swap_cache(page); 1613 1412 SetPageDirty(page); 1614 1413 return 1; ··· 1631 1428 p = _swap_info_get(entry); 1632 1429 if (p) { 1633 1430 count = __swap_entry_free(p, entry, 1); 1634 - if (count == SWAP_HAS_CACHE) { 1431 + if (count == SWAP_HAS_CACHE && 1432 + !swap_page_trans_huge_swapped(p, entry)) { 1635 1433 page = find_get_page(swap_address_space(entry), 1636 1434 swp_offset(entry)); 1637 1435 if (page && !trylock_page(page)) { ··· 1649 1445 */ 1650 1446 if (PageSwapCache(page) && !PageWriteback(page) && 1651 1447 (!page_mapped(page) || mem_cgroup_swap_full(page)) && 1652 - !swap_swapcount(p, entry)) { 1448 + !swap_page_trans_huge_swapped(p, entry)) { 1449 + page = compound_head(page); 1653 1450 delete_from_swap_cache(page); 1654 1451 SetPageDirty(page); 1655 1452 } ··· 2204 1999 .sync_mode = WB_SYNC_NONE, 2205 2000 }; 2206 2001 2207 - swap_writepage(page, &wbc); 2002 + swap_writepage(compound_head(page), &wbc); 2208 2003 lock_page(page); 2209 2004 wait_on_page_writeback(page); 2210 2005 } ··· 2217 2012 * delete, since it may not have been written out to swap yet. 2218 2013 */ 2219 2014 if (PageSwapCache(page) && 2220 - likely(page_private(page) == entry.val)) 2221 - delete_from_swap_cache(page); 2015 + likely(page_private(page) == entry.val) && 2016 + !page_swapped(page)) 2017 + delete_from_swap_cache(compound_head(page)); 2222 2018 2223 2019 /* 2224 2020 * So we could skip searching mms once swap count went ··· 2432 2226 return generic_swapfile_activate(sis, swap_file, span); 2433 2227 } 2434 2228 2229 + static int swap_node(struct swap_info_struct *p) 2230 + { 2231 + struct block_device *bdev; 2232 + 2233 + if (p->bdev) 2234 + bdev = p->bdev; 2235 + else 2236 + bdev = p->swap_file->f_inode->i_sb->s_bdev; 2237 + 2238 + return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE; 2239 + } 2240 + 2435 2241 static void _enable_swap_info(struct swap_info_struct *p, int prio, 2436 2242 unsigned char *swap_map, 2437 2243 struct swap_cluster_info *cluster_info) 2438 2244 { 2245 + int i; 2246 + 2439 2247 if (prio >= 0) 2440 2248 p->prio = prio; 2441 2249 else ··· 2459 2239 * low-to-high, while swap ordering is high-to-low 2460 2240 */ 2461 2241 p->list.prio = -p->prio; 2462 - p->avail_list.prio = -p->prio; 2242 + for_each_node(i) { 2243 + if (p->prio >= 0) 2244 + p->avail_lists[i].prio = -p->prio; 2245 + else { 2246 + if (swap_node(p) == i) 2247 + p->avail_lists[i].prio = 1; 2248 + else 2249 + p->avail_lists[i].prio = -p->prio; 2250 + } 2251 + } 2463 2252 p->swap_map = swap_map; 2464 2253 p->cluster_info = cluster_info; 2465 2254 p->flags |= SWP_WRITEOK; ··· 2487 2258 * swap_info_struct. 2488 2259 */ 2489 2260 plist_add(&p->list, &swap_active_head); 2490 - spin_lock(&swap_avail_lock); 2491 - plist_add(&p->avail_list, &swap_avail_head); 2492 - spin_unlock(&swap_avail_lock); 2261 + add_to_avail_list(p); 2493 2262 } 2494 2263 2495 2264 static void enable_swap_info(struct swap_info_struct *p, int prio, ··· 2572 2345 spin_unlock(&swap_lock); 2573 2346 goto out_dput; 2574 2347 } 2575 - spin_lock(&swap_avail_lock); 2576 - plist_del(&p->avail_list, &swap_avail_head); 2577 - spin_unlock(&swap_avail_lock); 2348 + del_from_avail_list(p); 2578 2349 spin_lock(&p->lock); 2579 2350 if (p->prio < 0) { 2580 2351 struct swap_info_struct *si = p; 2352 + int nid; 2581 2353 2582 2354 plist_for_each_entry_continue(si, &swap_active_head, list) { 2583 2355 si->prio++; 2584 2356 si->list.prio--; 2585 - si->avail_list.prio--; 2357 + for_each_node(nid) { 2358 + if (si->avail_lists[nid].prio != 1) 2359 + si->avail_lists[nid].prio--; 2360 + } 2586 2361 } 2587 2362 least_priority++; 2588 2363 } ··· 2615 2386 destroy_swap_extents(p); 2616 2387 if (p->flags & SWP_CONTINUED) 2617 2388 free_swap_count_continuations(p); 2389 + 2390 + if (!p->bdev || !blk_queue_nonrot(bdev_get_queue(p->bdev))) 2391 + atomic_dec(&nr_rotate_swap); 2618 2392 2619 2393 mutex_lock(&swapon_mutex); 2620 2394 spin_lock(&swap_lock); ··· 2828 2596 { 2829 2597 struct swap_info_struct *p; 2830 2598 unsigned int type; 2599 + int i; 2831 2600 2832 2601 p = kzalloc(sizeof(*p), GFP_KERNEL); 2833 2602 if (!p) ··· 2864 2631 } 2865 2632 INIT_LIST_HEAD(&p->first_swap_extent.list); 2866 2633 plist_node_init(&p->list, 0); 2867 - plist_node_init(&p->avail_list, 0); 2634 + for_each_node(i) 2635 + plist_node_init(&p->avail_lists[i], 0); 2868 2636 p->flags = SWP_USED; 2869 2637 spin_unlock(&swap_lock); 2870 2638 spin_lock_init(&p->lock); ··· 3107 2873 if (!capable(CAP_SYS_ADMIN)) 3108 2874 return -EPERM; 3109 2875 2876 + if (!swap_avail_heads) 2877 + return -ENOMEM; 2878 + 3110 2879 p = alloc_swap_info(); 3111 2880 if (IS_ERR(p)) 3112 2881 return PTR_ERR(p); ··· 3200 2963 cluster = per_cpu_ptr(p->percpu_cluster, cpu); 3201 2964 cluster_set_null(&cluster->index); 3202 2965 } 3203 - } 2966 + } else 2967 + atomic_inc(&nr_rotate_swap); 3204 2968 3205 2969 error = swap_cgroup_swapon(p->type, maxpages); 3206 2970 if (error) ··· 3695 3457 } 3696 3458 } 3697 3459 } 3460 + 3461 + static int __init swapfile_init(void) 3462 + { 3463 + int nid; 3464 + 3465 + swap_avail_heads = kmalloc_array(nr_node_ids, sizeof(struct plist_head), 3466 + GFP_KERNEL); 3467 + if (!swap_avail_heads) { 3468 + pr_emerg("Not enough memory for swap heads, swap is disabled\n"); 3469 + return -ENOMEM; 3470 + } 3471 + 3472 + for_each_node(nid) 3473 + plist_head_init(&swap_avail_heads[nid]); 3474 + 3475 + return 0; 3476 + } 3477 + subsys_initcall(swapfile_init);

+32 -16

mm/userfaultfd.c

··· 371 371 bool zeropage); 372 372 #endif /* CONFIG_HUGETLB_PAGE */ 373 373 374 + static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, 375 + pmd_t *dst_pmd, 376 + struct vm_area_struct *dst_vma, 377 + unsigned long dst_addr, 378 + unsigned long src_addr, 379 + struct page **page, 380 + bool zeropage) 381 + { 382 + ssize_t err; 383 + 384 + if (vma_is_anonymous(dst_vma)) { 385 + if (!zeropage) 386 + err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, 387 + dst_addr, src_addr, page); 388 + else 389 + err = mfill_zeropage_pte(dst_mm, dst_pmd, 390 + dst_vma, dst_addr); 391 + } else { 392 + if (!zeropage) 393 + err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, 394 + dst_vma, dst_addr, 395 + src_addr, page); 396 + else 397 + err = shmem_mfill_zeropage_pte(dst_mm, dst_pmd, 398 + dst_vma, dst_addr); 399 + } 400 + 401 + return err; 402 + } 403 + 374 404 static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, 375 405 unsigned long dst_start, 376 406 unsigned long src_start, ··· 517 487 BUG_ON(pmd_none(*dst_pmd)); 518 488 BUG_ON(pmd_trans_huge(*dst_pmd)); 519 489 520 - if (vma_is_anonymous(dst_vma)) { 521 - if (!zeropage) 522 - err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, 523 - dst_addr, src_addr, 524 - &page); 525 - else 526 - err = mfill_zeropage_pte(dst_mm, dst_pmd, 527 - dst_vma, dst_addr); 528 - } else { 529 - err = -EINVAL; /* if zeropage is true return -EINVAL */ 530 - if (likely(!zeropage)) 531 - err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, 532 - dst_vma, dst_addr, 533 - src_addr, &page); 534 - } 535 - 490 + err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, 491 + src_addr, &page, zeropage); 536 492 cond_resched(); 537 493 538 494 if (unlikely(err == -EFAULT)) {

+1 -1

mm/util.c

··· 614 614 return 0; 615 615 616 616 if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) { 617 - free = global_page_state(NR_FREE_PAGES); 617 + free = global_zone_page_state(NR_FREE_PAGES); 618 618 free += global_node_page_state(NR_FILE_PAGES); 619 619 620 620 /*

+7 -13

mm/vmalloc.c

··· 49 49 static void free_work(struct work_struct *w) 50 50 { 51 51 struct vfree_deferred *p = container_of(w, struct vfree_deferred, wq); 52 - struct llist_node *llnode = llist_del_all(&p->list); 53 - while (llnode) { 54 - void *p = llnode; 55 - llnode = llist_next(llnode); 56 - __vunmap(p, 1); 57 - } 52 + struct llist_node *t, *llnode; 53 + 54 + llist_for_each_safe(llnode, t, llist_del_all(&p->list)) 55 + __vunmap((void *)llnode, 1); 58 56 } 59 57 60 58 /*** Page table manipulation functions ***/ ··· 2480 2482 * matching slot. While scanning, if any of the areas overlaps with 2481 2483 * existing vmap_area, the base address is pulled down to fit the 2482 2484 * area. Scanning is repeated till all the areas fit and then all 2483 - * necessary data structres are inserted and the result is returned. 2485 + * necessary data structures are inserted and the result is returned. 2484 2486 */ 2485 2487 struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, 2486 2488 const size_t *sizes, int nr_vms, ··· 2508 2510 if (start > offsets[last_area]) 2509 2511 last_area = area; 2510 2512 2511 - for (area2 = 0; area2 < nr_vms; area2++) { 2513 + for (area2 = area + 1; area2 < nr_vms; area2++) { 2512 2514 unsigned long start2 = offsets[area2]; 2513 2515 unsigned long end2 = start2 + sizes[area2]; 2514 2516 2515 - if (area2 == area) 2516 - continue; 2517 - 2518 - BUG_ON(start2 >= start && start2 < end); 2519 - BUG_ON(end2 <= end && end2 > start); 2517 + BUG_ON(start2 < end && start < end2); 2520 2518 } 2521 2519 } 2522 2520 last_end = offsets[last_area] + sizes[last_area];

+66 -47

mm/vmscan.c

··· 393 393 unsigned long nr_to_scan = min(batch_size, total_scan); 394 394 395 395 shrinkctl->nr_to_scan = nr_to_scan; 396 + shrinkctl->nr_scanned = nr_to_scan; 396 397 ret = shrinker->scan_objects(shrinker, shrinkctl); 397 398 if (ret == SHRINK_STOP) 398 399 break; 399 400 freed += ret; 400 401 401 - count_vm_events(SLABS_SCANNED, nr_to_scan); 402 - total_scan -= nr_to_scan; 403 - scanned += nr_to_scan; 402 + count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned); 403 + total_scan -= shrinkctl->nr_scanned; 404 + scanned += shrinkctl->nr_scanned; 404 405 405 406 cond_resched(); 406 407 } ··· 536 535 * that isolated the page, the page cache radix tree and 537 536 * optional buffer heads at page->private. 538 537 */ 539 - return page_count(page) - page_has_private(page) == 2; 538 + int radix_pins = PageTransHuge(page) && PageSwapCache(page) ? 539 + HPAGE_PMD_NR : 1; 540 + return page_count(page) - page_has_private(page) == 1 + radix_pins; 540 541 } 541 542 542 543 static int may_write_to_inode(struct inode *inode, struct scan_control *sc) ··· 668 665 bool reclaimed) 669 666 { 670 667 unsigned long flags; 668 + int refcount; 671 669 672 670 BUG_ON(!PageLocked(page)); 673 671 BUG_ON(mapping != page_mapping(page)); ··· 699 695 * Note that if SetPageDirty is always performed via set_page_dirty, 700 696 * and thus under tree_lock, then this ordering is not required. 701 697 */ 702 - if (!page_ref_freeze(page, 2)) 698 + if (unlikely(PageTransHuge(page)) && PageSwapCache(page)) 699 + refcount = 1 + HPAGE_PMD_NR; 700 + else 701 + refcount = 2; 702 + if (!page_ref_freeze(page, refcount)) 703 703 goto cannot_free; 704 704 /* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */ 705 705 if (unlikely(PageDirty(page))) { 706 - page_ref_unfreeze(page, 2); 706 + page_ref_unfreeze(page, refcount); 707 707 goto cannot_free; 708 708 } 709 709 ··· 1129 1121 * Try to allocate it some swap space here. 1130 1122 * Lazyfree page could be freed directly 1131 1123 */ 1132 - if (PageAnon(page) && PageSwapBacked(page) && 1133 - !PageSwapCache(page)) { 1134 - if (!(sc->gfp_mask & __GFP_IO)) 1135 - goto keep_locked; 1136 - if (PageTransHuge(page)) { 1137 - /* cannot split THP, skip it */ 1138 - if (!can_split_huge_page(page, NULL)) 1139 - goto activate_locked; 1140 - /* 1141 - * Split pages without a PMD map right 1142 - * away. Chances are some or all of the 1143 - * tail pages can be freed without IO. 1144 - */ 1145 - if (!compound_mapcount(page) && 1146 - split_huge_page_to_list(page, page_list)) 1147 - goto activate_locked; 1148 - } 1149 - if (!add_to_swap(page)) { 1150 - if (!PageTransHuge(page)) 1151 - goto activate_locked; 1152 - /* Split THP and swap individual base pages */ 1153 - if (split_huge_page_to_list(page, page_list)) 1154 - goto activate_locked; 1155 - if (!add_to_swap(page)) 1156 - goto activate_locked; 1157 - } 1124 + if (PageAnon(page) && PageSwapBacked(page)) { 1125 + if (!PageSwapCache(page)) { 1126 + if (!(sc->gfp_mask & __GFP_IO)) 1127 + goto keep_locked; 1128 + if (PageTransHuge(page)) { 1129 + /* cannot split THP, skip it */ 1130 + if (!can_split_huge_page(page, NULL)) 1131 + goto activate_locked; 1132 + /* 1133 + * Split pages without a PMD map right 1134 + * away. Chances are some or all of the 1135 + * tail pages can be freed without IO. 1136 + */ 1137 + if (!compound_mapcount(page) && 1138 + split_huge_page_to_list(page, 1139 + page_list)) 1140 + goto activate_locked; 1141 + } 1142 + if (!add_to_swap(page)) { 1143 + if (!PageTransHuge(page)) 1144 + goto activate_locked; 1145 + /* Fallback to swap normal pages */ 1146 + if (split_huge_page_to_list(page, 1147 + page_list)) 1148 + goto activate_locked; 1149 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 1150 + count_vm_event(THP_SWPOUT_FALLBACK); 1151 + #endif 1152 + if (!add_to_swap(page)) 1153 + goto activate_locked; 1154 + } 1158 1155 1159 - /* XXX: We don't support THP writes */ 1160 - if (PageTransHuge(page) && 1161 - split_huge_page_to_list(page, page_list)) { 1162 - delete_from_swap_cache(page); 1163 - goto activate_locked; 1156 + may_enter_fs = 1; 1157 + 1158 + /* Adding to swap updated mapping */ 1159 + mapping = page_mapping(page); 1164 1160 } 1165 - 1166 - may_enter_fs = 1; 1167 - 1168 - /* Adding to swap updated mapping */ 1169 - mapping = page_mapping(page); 1170 1161 } else if (unlikely(PageTransHuge(page))) { 1171 1162 /* Split file THP */ 1172 1163 if (split_huge_page_to_list(page, page_list)) 1173 1164 goto keep_locked; 1174 1165 } 1175 1166 1176 - VM_BUG_ON_PAGE(PageTransHuge(page), page); 1177 - 1178 1167 /* 1179 1168 * The page is mapped into the page tables of one or more 1180 1169 * processes. Try to unmap it here. 1181 1170 */ 1182 1171 if (page_mapped(page)) { 1183 - if (!try_to_unmap(page, ttu_flags | TTU_BATCH_FLUSH)) { 1172 + enum ttu_flags flags = ttu_flags | TTU_BATCH_FLUSH; 1173 + 1174 + if (unlikely(PageTransHuge(page))) 1175 + flags |= TTU_SPLIT_HUGE_PMD; 1176 + if (!try_to_unmap(page, flags)) { 1184 1177 nr_unmap_fail++; 1185 1178 goto activate_locked; 1186 1179 } ··· 1321 1312 * Is there need to periodically free_page_list? It would 1322 1313 * appear not as the counts should be low 1323 1314 */ 1324 - list_add(&page->lru, &free_pages); 1315 + if (unlikely(PageTransHuge(page))) { 1316 + mem_cgroup_uncharge(page); 1317 + (*get_compound_page_dtor(page))(page); 1318 + } else 1319 + list_add(&page->lru, &free_pages); 1325 1320 continue; 1326 1321 1327 1322 activate_locked: ··· 1755 1742 int file = is_file_lru(lru); 1756 1743 struct pglist_data *pgdat = lruvec_pgdat(lruvec); 1757 1744 struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; 1745 + bool stalled = false; 1758 1746 1759 1747 while (unlikely(too_many_isolated(pgdat, file, sc))) { 1760 - congestion_wait(BLK_RW_ASYNC, HZ/10); 1748 + if (stalled) 1749 + return 0; 1750 + 1751 + /* wait a bit for the reclaimer. */ 1752 + msleep(100); 1753 + stalled = true; 1761 1754 1762 1755 /* We are about to die and free our memory. Return now. */ 1763 1756 if (fatal_signal_pending(current))

+12 -3

mm/vmstat.c

··· 870 870 { 871 871 unsigned long requested = 1UL << order; 872 872 873 + if (WARN_ON_ONCE(order >= MAX_ORDER)) 874 + return 0; 875 + 873 876 if (!info->free_blocks_total) 874 877 return 0; 875 878 ··· 1074 1071 #endif 1075 1072 "thp_zero_page_alloc", 1076 1073 "thp_zero_page_alloc_failed", 1074 + "thp_swpout", 1075 + "thp_swpout_fallback", 1077 1076 #endif 1078 1077 #ifdef CONFIG_MEMORY_BALLOON 1079 1078 "balloon_inflate", ··· 1097 1092 "vmacache_find_calls", 1098 1093 "vmacache_find_hits", 1099 1094 "vmacache_full_flushes", 1095 + #endif 1096 + #ifdef CONFIG_SWAP 1097 + "swap_ra", 1098 + "swap_ra_hit", 1100 1099 #endif 1101 1100 #endif /* CONFIG_VM_EVENTS_COUNTERS */ 1102 1101 }; ··· 1259 1250 seq_putc(m, '\n'); 1260 1251 } 1261 1252 1262 - /* Print out the free pages at each order for each migratetype */ 1253 + /* Print out the number of pageblocks for each migratetype */ 1263 1254 static int pagetypeinfo_showblockcount(struct seq_file *m, void *arg) 1264 1255 { 1265 1256 int mtype; ··· 1509 1500 if (!v) 1510 1501 return ERR_PTR(-ENOMEM); 1511 1502 for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) 1512 - v[i] = global_page_state(i); 1503 + v[i] = global_zone_page_state(i); 1513 1504 v += NR_VM_ZONE_STAT_ITEMS; 1514 1505 1515 1506 for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) ··· 1598 1589 * which can equally be echo'ed to or cat'ted from (by root), 1599 1590 * can be used to update the stats just before reading them. 1600 1591 * 1601 - * Oh, and since global_page_state() etc. are so careful to hide 1592 + * Oh, and since global_zone_page_state() etc. are so careful to hide 1602 1593 * transiently negative values, report an error here if any of 1603 1594 * the stats is negative, so we know to go looking for imbalance. 1604 1595 */

+347 -138

mm/z3fold.c

··· 23 23 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 24 24 25 25 #include <linux/atomic.h> 26 + #include <linux/sched.h> 26 27 #include <linux/list.h> 27 28 #include <linux/mm.h> 28 29 #include <linux/module.h> 30 + #include <linux/percpu.h> 29 31 #include <linux/preempt.h> 32 + #include <linux/workqueue.h> 30 33 #include <linux/slab.h> 31 34 #include <linux/spinlock.h> 32 35 #include <linux/zpool.h> ··· 51 48 }; 52 49 53 50 /* 54 - * struct z3fold_header - z3fold page metadata occupying the first chunk of each 51 + * struct z3fold_header - z3fold page metadata occupying first chunks of each 55 52 * z3fold page, except for HEADLESS pages 56 - * @buddy: links the z3fold page into the relevant list in the pool 53 + * @buddy: links the z3fold page into the relevant list in the 54 + * pool 57 55 * @page_lock: per-page lock 58 - * @refcount: reference cound for the z3fold page 56 + * @refcount: reference count for the z3fold page 57 + * @work: work_struct for page layout optimization 58 + * @pool: pointer to the pool which this page belongs to 59 + * @cpu: CPU which this page "belongs" to 59 60 * @first_chunks: the size of the first buddy in chunks, 0 if free 60 61 * @middle_chunks: the size of the middle buddy in chunks, 0 if free 61 62 * @last_chunks: the size of the last buddy in chunks, 0 if free ··· 69 62 struct list_head buddy; 70 63 spinlock_t page_lock; 71 64 struct kref refcount; 65 + struct work_struct work; 66 + struct z3fold_pool *pool; 67 + short cpu; 72 68 unsigned short first_chunks; 73 69 unsigned short middle_chunks; 74 70 unsigned short last_chunks; ··· 102 92 103 93 /** 104 94 * struct z3fold_pool - stores metadata for each z3fold pool 105 - * @lock: protects all pool fields and first|last_chunk fields of any 106 - * z3fold page in the pool 107 - * @unbuddied: array of lists tracking z3fold pages that contain 2- buddies; 108 - * the lists each z3fold page is added to depends on the size of 109 - * its free region. 95 + * @name: pool name 96 + * @lock: protects pool unbuddied/lru lists 97 + * @stale_lock: protects pool stale page list 98 + * @unbuddied: per-cpu array of lists tracking z3fold pages that contain 2- 99 + * buddies; the list each z3fold page is added to depends on 100 + * the size of its free region. 110 101 * @lru: list tracking the z3fold pages in LRU order by most recently 111 102 * added buddy. 103 + * @stale: list of pages marked for freeing 112 104 * @pages_nr: number of z3fold pages in the pool. 113 105 * @ops: pointer to a structure of user defined operations specified at 114 106 * pool creation time. 107 + * @compact_wq: workqueue for page layout background optimization 108 + * @release_wq: workqueue for safe page release 109 + * @work: work_struct for safe page release 115 110 * 116 111 * This structure is allocated at pool creation time and maintains metadata 117 112 * pertaining to a particular z3fold pool. 118 113 */ 119 114 struct z3fold_pool { 115 + const char *name; 120 116 spinlock_t lock; 121 - struct list_head unbuddied[NCHUNKS]; 117 + spinlock_t stale_lock; 118 + struct list_head *unbuddied; 122 119 struct list_head lru; 120 + struct list_head stale; 123 121 atomic64_t pages_nr; 124 122 const struct z3fold_ops *ops; 125 123 struct zpool *zpool; 126 124 const struct zpool_ops *zpool_ops; 125 + struct workqueue_struct *compact_wq; 126 + struct workqueue_struct *release_wq; 127 + struct work_struct work; 127 128 }; 128 129 129 130 /* ··· 143 122 enum z3fold_page_flags { 144 123 PAGE_HEADLESS = 0, 145 124 MIDDLE_CHUNK_MAPPED, 125 + NEEDS_COMPACTING, 126 + PAGE_STALE 146 127 }; 147 - 148 128 149 129 /***************** 150 130 * Helpers ··· 160 138 #define for_each_unbuddied_list(_iter, _begin) \ 161 139 for ((_iter) = (_begin); (_iter) < NCHUNKS; (_iter)++) 162 140 141 + static void compact_page_work(struct work_struct *w); 142 + 163 143 /* Initializes the z3fold header of a newly allocated z3fold page */ 164 - static struct z3fold_header *init_z3fold_page(struct page *page) 144 + static struct z3fold_header *init_z3fold_page(struct page *page, 145 + struct z3fold_pool *pool) 165 146 { 166 147 struct z3fold_header *zhdr = page_address(page); 167 148 168 149 INIT_LIST_HEAD(&page->lru); 169 150 clear_bit(PAGE_HEADLESS, &page->private); 170 151 clear_bit(MIDDLE_CHUNK_MAPPED, &page->private); 152 + clear_bit(NEEDS_COMPACTING, &page->private); 153 + clear_bit(PAGE_STALE, &page->private); 171 154 172 155 spin_lock_init(&zhdr->page_lock); 173 156 kref_init(&zhdr->refcount); ··· 181 154 zhdr->last_chunks = 0; 182 155 zhdr->first_num = 0; 183 156 zhdr->start_middle = 0; 157 + zhdr->cpu = -1; 158 + zhdr->pool = pool; 184 159 INIT_LIST_HEAD(&zhdr->buddy); 160 + INIT_WORK(&zhdr->work, compact_page_work); 185 161 return zhdr; 186 162 } 187 163 ··· 192 162 static void free_z3fold_page(struct page *page) 193 163 { 194 164 __free_page(page); 195 - } 196 - 197 - static void release_z3fold_page(struct kref *ref) 198 - { 199 - struct z3fold_header *zhdr; 200 - struct page *page; 201 - 202 - zhdr = container_of(ref, struct z3fold_header, refcount); 203 - page = virt_to_page(zhdr); 204 - 205 - if (!list_empty(&zhdr->buddy)) 206 - list_del(&zhdr->buddy); 207 - if (!list_empty(&page->lru)) 208 - list_del(&page->lru); 209 - free_z3fold_page(page); 210 165 } 211 166 212 167 /* Lock a z3fold page */ ··· 243 228 return (handle - zhdr->first_num) & BUDDY_MASK; 244 229 } 245 230 231 + static void __release_z3fold_page(struct z3fold_header *zhdr, bool locked) 232 + { 233 + struct page *page = virt_to_page(zhdr); 234 + struct z3fold_pool *pool = zhdr->pool; 235 + 236 + WARN_ON(!list_empty(&zhdr->buddy)); 237 + set_bit(PAGE_STALE, &page->private); 238 + spin_lock(&pool->lock); 239 + if (!list_empty(&page->lru)) 240 + list_del(&page->lru); 241 + spin_unlock(&pool->lock); 242 + if (locked) 243 + z3fold_page_unlock(zhdr); 244 + spin_lock(&pool->stale_lock); 245 + list_add(&zhdr->buddy, &pool->stale); 246 + queue_work(pool->release_wq, &pool->work); 247 + spin_unlock(&pool->stale_lock); 248 + } 249 + 250 + static void __attribute__((__unused__)) 251 + release_z3fold_page(struct kref *ref) 252 + { 253 + struct z3fold_header *zhdr = container_of(ref, struct z3fold_header, 254 + refcount); 255 + __release_z3fold_page(zhdr, false); 256 + } 257 + 258 + static void release_z3fold_page_locked(struct kref *ref) 259 + { 260 + struct z3fold_header *zhdr = container_of(ref, struct z3fold_header, 261 + refcount); 262 + WARN_ON(z3fold_page_trylock(zhdr)); 263 + __release_z3fold_page(zhdr, true); 264 + } 265 + 266 + static void release_z3fold_page_locked_list(struct kref *ref) 267 + { 268 + struct z3fold_header *zhdr = container_of(ref, struct z3fold_header, 269 + refcount); 270 + spin_lock(&zhdr->pool->lock); 271 + list_del_init(&zhdr->buddy); 272 + spin_unlock(&zhdr->pool->lock); 273 + 274 + WARN_ON(z3fold_page_trylock(zhdr)); 275 + __release_z3fold_page(zhdr, true); 276 + } 277 + 278 + static void free_pages_work(struct work_struct *w) 279 + { 280 + struct z3fold_pool *pool = container_of(w, struct z3fold_pool, work); 281 + 282 + spin_lock(&pool->stale_lock); 283 + while (!list_empty(&pool->stale)) { 284 + struct z3fold_header *zhdr = list_first_entry(&pool->stale, 285 + struct z3fold_header, buddy); 286 + struct page *page = virt_to_page(zhdr); 287 + 288 + list_del(&zhdr->buddy); 289 + if (WARN_ON(!test_bit(PAGE_STALE, &page->private))) 290 + continue; 291 + clear_bit(NEEDS_COMPACTING, &page->private); 292 + spin_unlock(&pool->stale_lock); 293 + cancel_work_sync(&zhdr->work); 294 + free_z3fold_page(page); 295 + cond_resched(); 296 + spin_lock(&pool->stale_lock); 297 + } 298 + spin_unlock(&pool->stale_lock); 299 + } 300 + 246 301 /* 247 302 * Returns the number of free chunks in a z3fold page. 248 303 * NB: can't be used with HEADLESS pages. ··· 335 250 } else 336 251 nfree = NCHUNKS - zhdr->first_chunks - zhdr->last_chunks; 337 252 return nfree; 338 - } 339 - 340 - /***************** 341 - * API Functions 342 - *****************/ 343 - /** 344 - * z3fold_create_pool() - create a new z3fold pool 345 - * @gfp: gfp flags when allocating the z3fold pool structure 346 - * @ops: user-defined operations for the z3fold pool 347 - * 348 - * Return: pointer to the new z3fold pool or NULL if the metadata allocation 349 - * failed. 350 - */ 351 - static struct z3fold_pool *z3fold_create_pool(gfp_t gfp, 352 - const struct z3fold_ops *ops) 353 - { 354 - struct z3fold_pool *pool; 355 - int i; 356 - 357 - pool = kzalloc(sizeof(struct z3fold_pool), gfp); 358 - if (!pool) 359 - return NULL; 360 - spin_lock_init(&pool->lock); 361 - for_each_unbuddied_list(i, 0) 362 - INIT_LIST_HEAD(&pool->unbuddied[i]); 363 - INIT_LIST_HEAD(&pool->lru); 364 - atomic64_set(&pool->pages_nr, 0); 365 - pool->ops = ops; 366 - return pool; 367 - } 368 - 369 - /** 370 - * z3fold_destroy_pool() - destroys an existing z3fold pool 371 - * @pool: the z3fold pool to be destroyed 372 - * 373 - * The pool should be emptied before this function is called. 374 - */ 375 - static void z3fold_destroy_pool(struct z3fold_pool *pool) 376 - { 377 - kfree(pool); 378 253 } 379 254 380 255 static inline void *mchunk_memmove(struct z3fold_header *zhdr, ··· 392 347 return 0; 393 348 } 394 349 350 + static void do_compact_page(struct z3fold_header *zhdr, bool locked) 351 + { 352 + struct z3fold_pool *pool = zhdr->pool; 353 + struct page *page; 354 + struct list_head *unbuddied; 355 + int fchunks; 356 + 357 + page = virt_to_page(zhdr); 358 + if (locked) 359 + WARN_ON(z3fold_page_trylock(zhdr)); 360 + else 361 + z3fold_page_lock(zhdr); 362 + if (test_bit(PAGE_STALE, &page->private) || 363 + !test_and_clear_bit(NEEDS_COMPACTING, &page->private)) { 364 + z3fold_page_unlock(zhdr); 365 + return; 366 + } 367 + spin_lock(&pool->lock); 368 + list_del_init(&zhdr->buddy); 369 + spin_unlock(&pool->lock); 370 + 371 + z3fold_compact_page(zhdr); 372 + unbuddied = get_cpu_ptr(pool->unbuddied); 373 + fchunks = num_free_chunks(zhdr); 374 + if (fchunks < NCHUNKS && 375 + (!zhdr->first_chunks || !zhdr->middle_chunks || 376 + !zhdr->last_chunks)) { 377 + /* the page's not completely free and it's unbuddied */ 378 + spin_lock(&pool->lock); 379 + list_add(&zhdr->buddy, &unbuddied[fchunks]); 380 + spin_unlock(&pool->lock); 381 + zhdr->cpu = smp_processor_id(); 382 + } 383 + put_cpu_ptr(pool->unbuddied); 384 + z3fold_page_unlock(zhdr); 385 + } 386 + 387 + static void compact_page_work(struct work_struct *w) 388 + { 389 + struct z3fold_header *zhdr = container_of(w, struct z3fold_header, 390 + work); 391 + 392 + do_compact_page(zhdr, false); 393 + } 394 + 395 + 396 + /* 397 + * API Functions 398 + */ 399 + 400 + /** 401 + * z3fold_create_pool() - create a new z3fold pool 402 + * @name: pool name 403 + * @gfp: gfp flags when allocating the z3fold pool structure 404 + * @ops: user-defined operations for the z3fold pool 405 + * 406 + * Return: pointer to the new z3fold pool or NULL if the metadata allocation 407 + * failed. 408 + */ 409 + static struct z3fold_pool *z3fold_create_pool(const char *name, gfp_t gfp, 410 + const struct z3fold_ops *ops) 411 + { 412 + struct z3fold_pool *pool = NULL; 413 + int i, cpu; 414 + 415 + pool = kzalloc(sizeof(struct z3fold_pool), gfp); 416 + if (!pool) 417 + goto out; 418 + spin_lock_init(&pool->lock); 419 + spin_lock_init(&pool->stale_lock); 420 + pool->unbuddied = __alloc_percpu(sizeof(struct list_head)*NCHUNKS, 2); 421 + for_each_possible_cpu(cpu) { 422 + struct list_head *unbuddied = 423 + per_cpu_ptr(pool->unbuddied, cpu); 424 + for_each_unbuddied_list(i, 0) 425 + INIT_LIST_HEAD(&unbuddied[i]); 426 + } 427 + INIT_LIST_HEAD(&pool->lru); 428 + INIT_LIST_HEAD(&pool->stale); 429 + atomic64_set(&pool->pages_nr, 0); 430 + pool->name = name; 431 + pool->compact_wq = create_singlethread_workqueue(pool->name); 432 + if (!pool->compact_wq) 433 + goto out; 434 + pool->release_wq = create_singlethread_workqueue(pool->name); 435 + if (!pool->release_wq) 436 + goto out_wq; 437 + INIT_WORK(&pool->work, free_pages_work); 438 + pool->ops = ops; 439 + return pool; 440 + 441 + out_wq: 442 + destroy_workqueue(pool->compact_wq); 443 + out: 444 + kfree(pool); 445 + return NULL; 446 + } 447 + 448 + /** 449 + * z3fold_destroy_pool() - destroys an existing z3fold pool 450 + * @pool: the z3fold pool to be destroyed 451 + * 452 + * The pool should be emptied before this function is called. 453 + */ 454 + static void z3fold_destroy_pool(struct z3fold_pool *pool) 455 + { 456 + destroy_workqueue(pool->release_wq); 457 + destroy_workqueue(pool->compact_wq); 458 + kfree(pool); 459 + } 460 + 395 461 /** 396 462 * z3fold_alloc() - allocates a region of a given size 397 463 * @pool: z3fold pool from which to allocate ··· 527 371 { 528 372 int chunks = 0, i, freechunks; 529 373 struct z3fold_header *zhdr = NULL; 374 + struct page *page = NULL; 530 375 enum buddy bud; 531 - struct page *page; 376 + bool can_sleep = (gfp & __GFP_RECLAIM) == __GFP_RECLAIM; 532 377 533 378 if (!size || (gfp & __GFP_HIGHMEM)) 534 379 return -EINVAL; ··· 540 383 if (size > PAGE_SIZE - ZHDR_SIZE_ALIGNED - CHUNK_SIZE) 541 384 bud = HEADLESS; 542 385 else { 386 + struct list_head *unbuddied; 543 387 chunks = size_to_chunks(size); 544 388 389 + lookup: 545 390 /* First, try to find an unbuddied z3fold page. */ 546 - zhdr = NULL; 391 + unbuddied = get_cpu_ptr(pool->unbuddied); 547 392 for_each_unbuddied_list(i, chunks) { 548 - spin_lock(&pool->lock); 549 - zhdr = list_first_entry_or_null(&pool->unbuddied[i], 393 + struct list_head *l = &unbuddied[i]; 394 + 395 + zhdr = list_first_entry_or_null(READ_ONCE(l), 550 396 struct z3fold_header, buddy); 551 - if (!zhdr || !z3fold_page_trylock(zhdr)) { 552 - spin_unlock(&pool->lock); 397 + 398 + if (!zhdr) 553 399 continue; 400 + 401 + /* Re-check under lock. */ 402 + spin_lock(&pool->lock); 403 + l = &unbuddied[i]; 404 + if (unlikely(zhdr != list_first_entry(READ_ONCE(l), 405 + struct z3fold_header, buddy)) || 406 + !z3fold_page_trylock(zhdr)) { 407 + spin_unlock(&pool->lock); 408 + put_cpu_ptr(pool->unbuddied); 409 + goto lookup; 554 410 } 555 - kref_get(&zhdr->refcount); 556 411 list_del_init(&zhdr->buddy); 412 + zhdr->cpu = -1; 557 413 spin_unlock(&pool->lock); 558 414 559 415 page = virt_to_page(zhdr); 416 + if (test_bit(NEEDS_COMPACTING, &page->private)) { 417 + z3fold_page_unlock(zhdr); 418 + zhdr = NULL; 419 + put_cpu_ptr(pool->unbuddied); 420 + if (can_sleep) 421 + cond_resched(); 422 + goto lookup; 423 + } 424 + 425 + /* 426 + * this page could not be removed from its unbuddied 427 + * list while pool lock was held, and then we've taken 428 + * page lock so kref_put could not be called before 429 + * we got here, so it's safe to just call kref_get() 430 + */ 431 + kref_get(&zhdr->refcount); 432 + break; 433 + } 434 + put_cpu_ptr(pool->unbuddied); 435 + 436 + if (zhdr) { 560 437 if (zhdr->first_chunks == 0) { 561 438 if (zhdr->middle_chunks != 0 && 562 439 chunks >= zhdr->start_middle) ··· 602 411 else if (zhdr->middle_chunks == 0) 603 412 bud = MIDDLE; 604 413 else { 605 - z3fold_page_unlock(zhdr); 606 - spin_lock(&pool->lock); 607 414 if (kref_put(&zhdr->refcount, 608 - release_z3fold_page)) 415 + release_z3fold_page_locked)) 609 416 atomic64_dec(&pool->pages_nr); 610 - spin_unlock(&pool->lock); 417 + else 418 + z3fold_page_unlock(zhdr); 611 419 pr_err("No free chunks in unbuddied\n"); 612 420 WARN_ON(1); 613 - continue; 421 + goto lookup; 614 422 } 615 423 goto found; 616 424 } 617 425 bud = FIRST; 618 426 } 619 427 620 - /* Couldn't find unbuddied z3fold page, create new one */ 621 - page = alloc_page(gfp); 428 + spin_lock(&pool->stale_lock); 429 + zhdr = list_first_entry_or_null(&pool->stale, 430 + struct z3fold_header, buddy); 431 + /* 432 + * Before allocating a page, let's see if we can take one from the 433 + * stale pages list. cancel_work_sync() can sleep so we must make 434 + * sure it won't be called in case we're in atomic context. 435 + */ 436 + if (zhdr && (can_sleep || !work_pending(&zhdr->work) || 437 + !unlikely(work_busy(&zhdr->work)))) { 438 + list_del(&zhdr->buddy); 439 + clear_bit(NEEDS_COMPACTING, &page->private); 440 + spin_unlock(&pool->stale_lock); 441 + if (can_sleep) 442 + cancel_work_sync(&zhdr->work); 443 + page = virt_to_page(zhdr); 444 + } else { 445 + spin_unlock(&pool->stale_lock); 446 + page = alloc_page(gfp); 447 + } 448 + 622 449 if (!page) 623 450 return -ENOMEM; 624 451 625 452 atomic64_inc(&pool->pages_nr); 626 - zhdr = init_z3fold_page(page); 453 + zhdr = init_z3fold_page(page, pool); 627 454 628 455 if (bud == HEADLESS) { 629 456 set_bit(PAGE_HEADLESS, &page->private); 630 - spin_lock(&pool->lock); 631 457 goto headless; 632 458 } 633 459 z3fold_page_lock(zhdr); ··· 659 451 zhdr->start_middle = zhdr->first_chunks + ZHDR_CHUNKS; 660 452 } 661 453 662 - spin_lock(&pool->lock); 663 454 if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0 || 664 455 zhdr->middle_chunks == 0) { 456 + struct list_head *unbuddied = get_cpu_ptr(pool->unbuddied); 457 + 665 458 /* Add to unbuddied list */ 666 459 freechunks = num_free_chunks(zhdr); 667 - list_add(&zhdr->buddy, &pool->unbuddied[freechunks]); 460 + spin_lock(&pool->lock); 461 + list_add(&zhdr->buddy, &unbuddied[freechunks]); 462 + spin_unlock(&pool->lock); 463 + zhdr->cpu = smp_processor_id(); 464 + put_cpu_ptr(pool->unbuddied); 668 465 } 669 466 670 467 headless: 468 + spin_lock(&pool->lock); 671 469 /* Add/move z3fold page to beginning of LRU */ 672 470 if (!list_empty(&page->lru)) 673 471 list_del(&page->lru); ··· 701 487 static void z3fold_free(struct z3fold_pool *pool, unsigned long handle) 702 488 { 703 489 struct z3fold_header *zhdr; 704 - int freechunks; 705 490 struct page *page; 706 491 enum buddy bud; 707 492 ··· 739 526 spin_unlock(&pool->lock); 740 527 free_z3fold_page(page); 741 528 atomic64_dec(&pool->pages_nr); 742 - } else { 743 - if (zhdr->first_chunks != 0 || zhdr->middle_chunks != 0 || 744 - zhdr->last_chunks != 0) { 745 - z3fold_compact_page(zhdr); 746 - /* Add to the unbuddied list */ 747 - spin_lock(&pool->lock); 748 - if (!list_empty(&zhdr->buddy)) 749 - list_del(&zhdr->buddy); 750 - freechunks = num_free_chunks(zhdr); 751 - list_add(&zhdr->buddy, &pool->unbuddied[freechunks]); 752 - spin_unlock(&pool->lock); 753 - } 754 - z3fold_page_unlock(zhdr); 755 - spin_lock(&pool->lock); 756 - if (kref_put(&zhdr->refcount, release_z3fold_page)) 757 - atomic64_dec(&pool->pages_nr); 758 - spin_unlock(&pool->lock); 529 + return; 759 530 } 760 531 532 + if (kref_put(&zhdr->refcount, release_z3fold_page_locked_list)) { 533 + atomic64_dec(&pool->pages_nr); 534 + return; 535 + } 536 + if (test_and_set_bit(NEEDS_COMPACTING, &page->private)) { 537 + z3fold_page_unlock(zhdr); 538 + return; 539 + } 540 + if (zhdr->cpu < 0 || !cpu_online(zhdr->cpu)) { 541 + spin_lock(&pool->lock); 542 + list_del_init(&zhdr->buddy); 543 + spin_unlock(&pool->lock); 544 + zhdr->cpu = -1; 545 + do_compact_page(zhdr, true); 546 + return; 547 + } 548 + queue_work_on(zhdr->cpu, pool->compact_wq, &zhdr->work); 549 + z3fold_page_unlock(zhdr); 761 550 } 762 551 763 552 /** ··· 800 585 */ 801 586 static int z3fold_reclaim_page(struct z3fold_pool *pool, unsigned int retries) 802 587 { 803 - int i, ret = 0, freechunks; 804 - struct z3fold_header *zhdr; 805 - struct page *page; 588 + int i, ret = 0; 589 + struct z3fold_header *zhdr = NULL; 590 + struct page *page = NULL; 591 + struct list_head *pos; 806 592 unsigned long first_handle = 0, middle_handle = 0, last_handle = 0; 807 593 808 594 spin_lock(&pool->lock); ··· 816 600 spin_unlock(&pool->lock); 817 601 return -EINVAL; 818 602 } 819 - page = list_last_entry(&pool->lru, struct page, lru); 820 - list_del_init(&page->lru); 603 + list_for_each_prev(pos, &pool->lru) { 604 + page = list_entry(pos, struct page, lru); 605 + if (test_bit(PAGE_HEADLESS, &page->private)) 606 + /* candidate found */ 607 + break; 821 608 822 - zhdr = page_address(page); 823 - if (!test_bit(PAGE_HEADLESS, &page->private)) { 824 - if (!list_empty(&zhdr->buddy)) 825 - list_del_init(&zhdr->buddy); 609 + zhdr = page_address(page); 610 + if (!z3fold_page_trylock(zhdr)) 611 + continue; /* can't evict at this point */ 826 612 kref_get(&zhdr->refcount); 827 - spin_unlock(&pool->lock); 828 - z3fold_page_lock(zhdr); 613 + list_del_init(&zhdr->buddy); 614 + zhdr->cpu = -1; 615 + } 616 + 617 + list_del_init(&page->lru); 618 + spin_unlock(&pool->lock); 619 + 620 + if (!test_bit(PAGE_HEADLESS, &page->private)) { 829 621 /* 830 622 * We need encode the handles before unlocking, since 831 623 * we can race with free that will set ··· 848 624 middle_handle = encode_handle(zhdr, MIDDLE); 849 625 if (zhdr->last_chunks) 850 626 last_handle = encode_handle(zhdr, LAST); 627 + /* 628 + * it's safe to unlock here because we hold a 629 + * reference to this page 630 + */ 851 631 z3fold_page_unlock(zhdr); 852 632 } else { 853 633 first_handle = encode_handle(zhdr, HEADLESS); 854 634 last_handle = middle_handle = 0; 855 - spin_unlock(&pool->lock); 856 635 } 857 636 858 637 /* Issue the eviction callback(s) */ ··· 879 652 if (ret == 0) { 880 653 free_z3fold_page(page); 881 654 return 0; 882 - } else { 883 - spin_lock(&pool->lock); 884 655 } 885 - } else { 886 - z3fold_page_lock(zhdr); 887 - if ((zhdr->first_chunks || zhdr->last_chunks || 888 - zhdr->middle_chunks) && 889 - !(zhdr->first_chunks && zhdr->last_chunks && 890 - zhdr->middle_chunks)) { 891 - z3fold_compact_page(zhdr); 892 - /* add to unbuddied list */ 893 - spin_lock(&pool->lock); 894 - freechunks = num_free_chunks(zhdr); 895 - list_add(&zhdr->buddy, 896 - &pool->unbuddied[freechunks]); 897 - spin_unlock(&pool->lock); 898 - } 899 - z3fold_page_unlock(zhdr); 900 - spin_lock(&pool->lock); 901 - if (kref_put(&zhdr->refcount, release_z3fold_page)) { 902 - spin_unlock(&pool->lock); 903 - atomic64_dec(&pool->pages_nr); 904 - return 0; 905 - } 656 + } else if (kref_put(&zhdr->refcount, release_z3fold_page)) { 657 + atomic64_dec(&pool->pages_nr); 658 + return 0; 906 659 } 660 + spin_lock(&pool->lock); 907 661 908 662 /* 909 663 * Add to the beginning of LRU. ··· 1003 795 { 1004 796 struct z3fold_pool *pool; 1005 797 1006 - pool = z3fold_create_pool(gfp, zpool_ops ? &z3fold_zpool_ops : NULL); 798 + pool = z3fold_create_pool(name, gfp, 799 + zpool_ops ? &z3fold_zpool_ops : NULL); 1007 800 if (pool) { 1008 801 pool->zpool = zpool; 1009 802 pool->zpool_ops = zpool_ops;

+5 -3

mm/zsmalloc.c

··· 1983 1983 1984 1984 spin_lock(&class->lock); 1985 1985 if (!get_zspage_inuse(zspage)) { 1986 - ret = -EBUSY; 1987 - goto unlock_class; 1986 + /* 1987 + * Set "offset" to end of the page so that every loops 1988 + * skips unnecessary object scanning. 1989 + */ 1990 + offset = PAGE_SIZE; 1988 1991 } 1989 1992 1990 1993 pos = offset; ··· 2055 2052 } 2056 2053 } 2057 2054 kunmap_atomic(s_addr); 2058 - unlock_class: 2059 2055 spin_unlock(&class->lock); 2060 2056 migrate_write_unlock(zspage); 2061 2057

+11 -16

scripts/mod/modpost.c

··· 261 261 return export_unknown; 262 262 } 263 263 264 - static const char *sec_name(struct elf_info *elf, int secindex); 264 + static const char *sech_name(struct elf_info *elf, Elf_Shdr *sechdr) 265 + { 266 + return (void *)elf->hdr + 267 + elf->sechdrs[elf->secindex_strings].sh_offset + 268 + sechdr->sh_name; 269 + } 270 + 271 + static const char *sec_name(struct elf_info *elf, int secindex) 272 + { 273 + return sech_name(elf, &elf->sechdrs[secindex]); 274 + } 265 275 266 276 #define strstarts(str, prefix) (strncmp(str, prefix, strlen(prefix)) == 0) 267 277 ··· 783 773 return elf->strtab + sym->st_name; 784 774 else 785 775 return "(unknown)"; 786 - } 787 - 788 - static const char *sec_name(struct elf_info *elf, int secindex) 789 - { 790 - Elf_Shdr *sechdrs = elf->sechdrs; 791 - return (void *)elf->hdr + 792 - elf->sechdrs[elf->secindex_strings].sh_offset + 793 - sechdrs[secindex].sh_name; 794 - } 795 - 796 - static const char *sech_name(struct elf_info *elf, Elf_Shdr *sechdr) 797 - { 798 - return (void *)elf->hdr + 799 - elf->sechdrs[elf->secindex_strings].sh_offset + 800 - sechdr->sh_name; 801 776 } 802 777 803 778 /* The pattern is an array of simple patterns.

+1 -1

tools/testing/selftests/memfd/Makefile

··· 3 3 CFLAGS += -I../../../../include/ 4 4 CFLAGS += -I../../../../usr/include/ 5 5 6 - TEST_PROGS := run_fuse_test.sh 6 + TEST_PROGS := run_tests.sh 7 7 TEST_GEN_FILES := memfd_test fuse_mnt fuse_test 8 8 9 9 fuse_mnt.o: CFLAGS += $(shell pkg-config fuse --cflags)

+288 -86

tools/testing/selftests/memfd/memfd_test.c

··· 18 18 #include <sys/wait.h> 19 19 #include <unistd.h> 20 20 21 + #define MEMFD_STR "memfd:" 22 + #define SHARED_FT_STR "(shared file-table)" 23 + 21 24 #define MFD_DEF_SIZE 8192 22 25 #define STACK_SIZE 65536 26 + 27 + /* 28 + * Default is not to test hugetlbfs 29 + */ 30 + static int hugetlbfs_test; 31 + static size_t mfd_def_size = MFD_DEF_SIZE; 32 + 33 + /* 34 + * Copied from mlock2-tests.c 35 + */ 36 + static unsigned long default_huge_page_size(void) 37 + { 38 + unsigned long hps = 0; 39 + char *line = NULL; 40 + size_t linelen = 0; 41 + FILE *f = fopen("/proc/meminfo", "r"); 42 + 43 + if (!f) 44 + return 0; 45 + while (getline(&line, &linelen, f) > 0) { 46 + if (sscanf(line, "Hugepagesize: %lu kB", &hps) == 1) { 47 + hps <<= 10; 48 + break; 49 + } 50 + } 51 + 52 + free(line); 53 + fclose(f); 54 + return hps; 55 + } 23 56 24 57 static int sys_memfd_create(const char *name, 25 58 unsigned int flags) 26 59 { 60 + if (hugetlbfs_test) 61 + flags |= MFD_HUGETLB; 62 + 27 63 return syscall(__NR_memfd_create, name, flags); 28 64 } 29 65 ··· 186 150 void *p; 187 151 188 152 p = mmap(NULL, 189 - MFD_DEF_SIZE, 153 + mfd_def_size, 190 154 PROT_READ | PROT_WRITE, 191 155 MAP_SHARED, 192 156 fd, ··· 204 168 void *p; 205 169 206 170 p = mmap(NULL, 207 - MFD_DEF_SIZE, 171 + mfd_def_size, 208 172 PROT_READ, 209 173 MAP_PRIVATE, 210 174 fd, ··· 259 223 260 224 /* verify PROT_READ *is* allowed */ 261 225 p = mmap(NULL, 262 - MFD_DEF_SIZE, 226 + mfd_def_size, 263 227 PROT_READ, 264 228 MAP_PRIVATE, 265 229 fd, ··· 268 232 printf("mmap() failed: %m\n"); 269 233 abort(); 270 234 } 271 - munmap(p, MFD_DEF_SIZE); 235 + munmap(p, mfd_def_size); 272 236 273 237 /* verify MAP_PRIVATE is *always* allowed (even writable) */ 274 238 p = mmap(NULL, 275 - MFD_DEF_SIZE, 239 + mfd_def_size, 276 240 PROT_READ | PROT_WRITE, 277 241 MAP_PRIVATE, 278 242 fd, ··· 281 245 printf("mmap() failed: %m\n"); 282 246 abort(); 283 247 } 284 - munmap(p, MFD_DEF_SIZE); 248 + munmap(p, mfd_def_size); 285 249 } 286 250 287 251 static void mfd_assert_write(int fd) ··· 290 254 void *p; 291 255 int r; 292 256 293 - /* verify write() succeeds */ 294 - l = write(fd, "\0\0\0\0", 4); 295 - if (l != 4) { 296 - printf("write() failed: %m\n"); 297 - abort(); 257 + /* 258 + * huegtlbfs does not support write, but we want to 259 + * verify everything else here. 260 + */ 261 + if (!hugetlbfs_test) { 262 + /* verify write() succeeds */ 263 + l = write(fd, "\0\0\0\0", 4); 264 + if (l != 4) { 265 + printf("write() failed: %m\n"); 266 + abort(); 267 + } 298 268 } 299 269 300 270 /* verify PROT_READ | PROT_WRITE is allowed */ 301 271 p = mmap(NULL, 302 - MFD_DEF_SIZE, 272 + mfd_def_size, 303 273 PROT_READ | PROT_WRITE, 304 274 MAP_SHARED, 305 275 fd, ··· 315 273 abort(); 316 274 } 317 275 *(char *)p = 0; 318 - munmap(p, MFD_DEF_SIZE); 276 + munmap(p, mfd_def_size); 319 277 320 278 /* verify PROT_WRITE is allowed */ 321 279 p = mmap(NULL, 322 - MFD_DEF_SIZE, 280 + mfd_def_size, 323 281 PROT_WRITE, 324 282 MAP_SHARED, 325 283 fd, ··· 329 287 abort(); 330 288 } 331 289 *(char *)p = 0; 332 - munmap(p, MFD_DEF_SIZE); 290 + munmap(p, mfd_def_size); 333 291 334 292 /* verify PROT_READ with MAP_SHARED is allowed and a following 335 293 * mprotect(PROT_WRITE) allows writing */ 336 294 p = mmap(NULL, 337 - MFD_DEF_SIZE, 295 + mfd_def_size, 338 296 PROT_READ, 339 297 MAP_SHARED, 340 298 fd, ··· 344 302 abort(); 345 303 } 346 304 347 - r = mprotect(p, MFD_DEF_SIZE, PROT_READ | PROT_WRITE); 305 + r = mprotect(p, mfd_def_size, PROT_READ | PROT_WRITE); 348 306 if (r < 0) { 349 307 printf("mprotect() failed: %m\n"); 350 308 abort(); 351 309 } 352 310 353 311 *(char *)p = 0; 354 - munmap(p, MFD_DEF_SIZE); 312 + munmap(p, mfd_def_size); 355 313 356 314 /* verify PUNCH_HOLE works */ 357 315 r = fallocate(fd, 358 316 FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 359 317 0, 360 - MFD_DEF_SIZE); 318 + mfd_def_size); 361 319 if (r < 0) { 362 320 printf("fallocate(PUNCH_HOLE) failed: %m\n"); 363 321 abort(); ··· 379 337 380 338 /* verify PROT_READ | PROT_WRITE is not allowed */ 381 339 p = mmap(NULL, 382 - MFD_DEF_SIZE, 340 + mfd_def_size, 383 341 PROT_READ | PROT_WRITE, 384 342 MAP_SHARED, 385 343 fd, ··· 391 349 392 350 /* verify PROT_WRITE is not allowed */ 393 351 p = mmap(NULL, 394 - MFD_DEF_SIZE, 352 + mfd_def_size, 395 353 PROT_WRITE, 396 354 MAP_SHARED, 397 355 fd, ··· 404 362 /* Verify PROT_READ with MAP_SHARED with a following mprotect is not 405 363 * allowed. Note that for r/w the kernel already prevents the mmap. */ 406 364 p = mmap(NULL, 407 - MFD_DEF_SIZE, 365 + mfd_def_size, 408 366 PROT_READ, 409 367 MAP_SHARED, 410 368 fd, 411 369 0); 412 370 if (p != MAP_FAILED) { 413 - r = mprotect(p, MFD_DEF_SIZE, PROT_READ | PROT_WRITE); 371 + r = mprotect(p, mfd_def_size, PROT_READ | PROT_WRITE); 414 372 if (r >= 0) { 415 373 printf("mmap()+mprotect() didn't fail as expected\n"); 416 374 abort(); ··· 421 379 r = fallocate(fd, 422 380 FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 423 381 0, 424 - MFD_DEF_SIZE); 382 + mfd_def_size); 425 383 if (r >= 0) { 426 384 printf("fallocate(PUNCH_HOLE) didn't fail as expected\n"); 427 385 abort(); ··· 432 390 { 433 391 int r, fd2; 434 392 435 - r = ftruncate(fd, MFD_DEF_SIZE / 2); 393 + r = ftruncate(fd, mfd_def_size / 2); 436 394 if (r < 0) { 437 395 printf("ftruncate(SHRINK) failed: %m\n"); 438 396 abort(); 439 397 } 440 398 441 - mfd_assert_size(fd, MFD_DEF_SIZE / 2); 399 + mfd_assert_size(fd, mfd_def_size / 2); 442 400 443 401 fd2 = mfd_assert_open(fd, 444 402 O_RDWR | O_CREAT | O_TRUNC, ··· 452 410 { 453 411 int r; 454 412 455 - r = ftruncate(fd, MFD_DEF_SIZE / 2); 413 + r = ftruncate(fd, mfd_def_size / 2); 456 414 if (r >= 0) { 457 415 printf("ftruncate(SHRINK) didn't fail as expected\n"); 458 416 abort(); ··· 467 425 { 468 426 int r; 469 427 470 - r = ftruncate(fd, MFD_DEF_SIZE * 2); 428 + r = ftruncate(fd, mfd_def_size * 2); 471 429 if (r < 0) { 472 430 printf("ftruncate(GROW) failed: %m\n"); 473 431 abort(); 474 432 } 475 433 476 - mfd_assert_size(fd, MFD_DEF_SIZE * 2); 434 + mfd_assert_size(fd, mfd_def_size * 2); 477 435 478 436 r = fallocate(fd, 479 437 0, 480 438 0, 481 - MFD_DEF_SIZE * 4); 439 + mfd_def_size * 4); 482 440 if (r < 0) { 483 441 printf("fallocate(ALLOC) failed: %m\n"); 484 442 abort(); 485 443 } 486 444 487 - mfd_assert_size(fd, MFD_DEF_SIZE * 4); 445 + mfd_assert_size(fd, mfd_def_size * 4); 488 446 } 489 447 490 448 static void mfd_fail_grow(int fd) 491 449 { 492 450 int r; 493 451 494 - r = ftruncate(fd, MFD_DEF_SIZE * 2); 452 + r = ftruncate(fd, mfd_def_size * 2); 495 453 if (r >= 0) { 496 454 printf("ftruncate(GROW) didn't fail as expected\n"); 497 455 abort(); ··· 500 458 r = fallocate(fd, 501 459 0, 502 460 0, 503 - MFD_DEF_SIZE * 4); 461 + mfd_def_size * 4); 504 462 if (r >= 0) { 505 463 printf("fallocate(ALLOC) didn't fail as expected\n"); 506 464 abort(); ··· 509 467 510 468 static void mfd_assert_grow_write(int fd) 511 469 { 512 - static char buf[MFD_DEF_SIZE * 8]; 470 + static char *buf; 513 471 ssize_t l; 514 472 515 - l = pwrite(fd, buf, sizeof(buf), 0); 516 - if (l != sizeof(buf)) { 473 + buf = malloc(mfd_def_size * 8); 474 + if (!buf) { 475 + printf("malloc(%d) failed: %m\n", mfd_def_size * 8); 476 + abort(); 477 + } 478 + 479 + l = pwrite(fd, buf, mfd_def_size * 8, 0); 480 + if (l != (mfd_def_size * 8)) { 517 481 printf("pwrite() failed: %m\n"); 518 482 abort(); 519 483 } 520 484 521 - mfd_assert_size(fd, MFD_DEF_SIZE * 8); 485 + mfd_assert_size(fd, mfd_def_size * 8); 522 486 } 523 487 524 488 static void mfd_fail_grow_write(int fd) 525 489 { 526 - static char buf[MFD_DEF_SIZE * 8]; 490 + static char *buf; 527 491 ssize_t l; 528 492 529 - l = pwrite(fd, buf, sizeof(buf), 0); 530 - if (l == sizeof(buf)) { 493 + buf = malloc(mfd_def_size * 8); 494 + if (!buf) { 495 + printf("malloc(%d) failed: %m\n", mfd_def_size * 8); 496 + abort(); 497 + } 498 + 499 + l = pwrite(fd, buf, mfd_def_size * 8, 0); 500 + if (l == (mfd_def_size * 8)) { 531 501 printf("pwrite() didn't fail as expected\n"); 532 502 abort(); 533 503 } ··· 597 543 char buf[2048]; 598 544 int fd; 599 545 546 + printf("%s CREATE\n", MEMFD_STR); 547 + 600 548 /* test NULL name */ 601 549 mfd_fail_new(NULL, 0); 602 550 ··· 626 570 fd = mfd_assert_new("", 0, MFD_CLOEXEC); 627 571 close(fd); 628 572 629 - /* verify MFD_ALLOW_SEALING is allowed */ 630 - fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING); 631 - close(fd); 573 + if (!hugetlbfs_test) { 574 + /* verify MFD_ALLOW_SEALING is allowed */ 575 + fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING); 576 + close(fd); 632 577 633 - /* verify MFD_ALLOW_SEALING | MFD_CLOEXEC is allowed */ 634 - fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING | MFD_CLOEXEC); 635 - close(fd); 578 + /* verify MFD_ALLOW_SEALING | MFD_CLOEXEC is allowed */ 579 + fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING | MFD_CLOEXEC); 580 + close(fd); 581 + } else { 582 + /* sealing is not supported on hugetlbfs */ 583 + mfd_fail_new("", MFD_ALLOW_SEALING); 584 + } 636 585 } 637 586 638 587 /* ··· 648 587 { 649 588 int fd; 650 589 590 + /* hugetlbfs does not contain sealing support */ 591 + if (hugetlbfs_test) 592 + return; 593 + 594 + printf("%s BASIC\n", MEMFD_STR); 595 + 651 596 fd = mfd_assert_new("kern_memfd_basic", 652 - MFD_DEF_SIZE, 597 + mfd_def_size, 653 598 MFD_CLOEXEC | MFD_ALLOW_SEALING); 654 599 655 600 /* add basic seals */ ··· 686 619 687 620 /* verify sealing does not work without MFD_ALLOW_SEALING */ 688 621 fd = mfd_assert_new("kern_memfd_basic", 689 - MFD_DEF_SIZE, 622 + mfd_def_size, 690 623 MFD_CLOEXEC); 691 624 mfd_assert_has_seals(fd, F_SEAL_SEAL); 692 625 mfd_fail_add_seals(fd, F_SEAL_SHRINK | 693 626 F_SEAL_GROW | 694 627 F_SEAL_WRITE); 695 628 mfd_assert_has_seals(fd, F_SEAL_SEAL); 629 + close(fd); 630 + } 631 + 632 + /* 633 + * hugetlbfs doesn't support seals or write, so just verify grow and shrink 634 + * on a hugetlbfs file created via memfd_create. 635 + */ 636 + static void test_hugetlbfs_grow_shrink(void) 637 + { 638 + int fd; 639 + 640 + printf("%s HUGETLBFS-GROW-SHRINK\n", MEMFD_STR); 641 + 642 + fd = mfd_assert_new("kern_memfd_seal_write", 643 + mfd_def_size, 644 + MFD_CLOEXEC); 645 + 646 + mfd_assert_read(fd); 647 + mfd_assert_write(fd); 648 + mfd_assert_shrink(fd); 649 + mfd_assert_grow(fd); 650 + 696 651 close(fd); 697 652 } 698 653 ··· 726 637 { 727 638 int fd; 728 639 640 + /* 641 + * hugetlbfs does not contain sealing or write support. Just test 642 + * basic grow and shrink via test_hugetlbfs_grow_shrink. 643 + */ 644 + if (hugetlbfs_test) 645 + return test_hugetlbfs_grow_shrink(); 646 + 647 + printf("%s SEAL-WRITE\n", MEMFD_STR); 648 + 729 649 fd = mfd_assert_new("kern_memfd_seal_write", 730 - MFD_DEF_SIZE, 650 + mfd_def_size, 731 651 MFD_CLOEXEC | MFD_ALLOW_SEALING); 732 652 mfd_assert_has_seals(fd, 0); 733 653 mfd_assert_add_seals(fd, F_SEAL_WRITE); ··· 759 661 { 760 662 int fd; 761 663 664 + /* hugetlbfs does not contain sealing support */ 665 + if (hugetlbfs_test) 666 + return; 667 + 668 + printf("%s SEAL-SHRINK\n", MEMFD_STR); 669 + 762 670 fd = mfd_assert_new("kern_memfd_seal_shrink", 763 - MFD_DEF_SIZE, 671 + mfd_def_size, 764 672 MFD_CLOEXEC | MFD_ALLOW_SEALING); 765 673 mfd_assert_has_seals(fd, 0); 766 674 mfd_assert_add_seals(fd, F_SEAL_SHRINK); ··· 789 685 { 790 686 int fd; 791 687 688 + /* hugetlbfs does not contain sealing support */ 689 + if (hugetlbfs_test) 690 + return; 691 + 692 + printf("%s SEAL-GROW\n", MEMFD_STR); 693 + 792 694 fd = mfd_assert_new("kern_memfd_seal_grow", 793 - MFD_DEF_SIZE, 695 + mfd_def_size, 794 696 MFD_CLOEXEC | MFD_ALLOW_SEALING); 795 697 mfd_assert_has_seals(fd, 0); 796 698 mfd_assert_add_seals(fd, F_SEAL_GROW); ··· 819 709 { 820 710 int fd; 821 711 712 + /* hugetlbfs does not contain sealing support */ 713 + if (hugetlbfs_test) 714 + return; 715 + 716 + printf("%s SEAL-RESIZE\n", MEMFD_STR); 717 + 822 718 fd = mfd_assert_new("kern_memfd_seal_resize", 823 - MFD_DEF_SIZE, 719 + mfd_def_size, 824 720 MFD_CLOEXEC | MFD_ALLOW_SEALING); 825 721 mfd_assert_has_seals(fd, 0); 826 722 mfd_assert_add_seals(fd, F_SEAL_SHRINK | F_SEAL_GROW); ··· 842 726 } 843 727 844 728 /* 845 - * Test sharing via dup() 846 - * Test that seals are shared between dupped FDs and they're all equal. 729 + * hugetlbfs does not support seals. Basic test to dup the memfd created 730 + * fd and perform some basic operations on it. 847 731 */ 848 - static void test_share_dup(void) 732 + static void hugetlbfs_dup(char *b_suffix) 849 733 { 850 734 int fd, fd2; 851 735 736 + printf("%s HUGETLBFS-DUP %s\n", MEMFD_STR, b_suffix); 737 + 852 738 fd = mfd_assert_new("kern_memfd_share_dup", 853 - MFD_DEF_SIZE, 739 + mfd_def_size, 740 + MFD_CLOEXEC); 741 + 742 + fd2 = mfd_assert_dup(fd); 743 + 744 + mfd_assert_read(fd); 745 + mfd_assert_write(fd); 746 + 747 + mfd_assert_shrink(fd2); 748 + mfd_assert_grow(fd2); 749 + 750 + close(fd2); 751 + close(fd); 752 + } 753 + 754 + /* 755 + * Test sharing via dup() 756 + * Test that seals are shared between dupped FDs and they're all equal. 757 + */ 758 + static void test_share_dup(char *banner, char *b_suffix) 759 + { 760 + int fd, fd2; 761 + 762 + /* 763 + * hugetlbfs does not contain sealing support. Perform some 764 + * basic testing on dup'ed fd instead via hugetlbfs_dup. 765 + */ 766 + if (hugetlbfs_test) { 767 + hugetlbfs_dup(b_suffix); 768 + return; 769 + } 770 + 771 + printf("%s %s %s\n", MEMFD_STR, banner, b_suffix); 772 + 773 + fd = mfd_assert_new("kern_memfd_share_dup", 774 + mfd_def_size, 854 775 MFD_CLOEXEC | MFD_ALLOW_SEALING); 855 776 mfd_assert_has_seals(fd, 0); 856 777 ··· 921 768 * Test sealing with active mmap()s 922 769 * Modifying seals is only allowed if no other mmap() refs exist. 923 770 */ 924 - static void test_share_mmap(void) 771 + static void test_share_mmap(char *banner, char *b_suffix) 925 772 { 926 773 int fd; 927 774 void *p; 928 775 776 + /* hugetlbfs does not contain sealing support */ 777 + if (hugetlbfs_test) 778 + return; 779 + 780 + printf("%s %s %s\n", MEMFD_STR, banner, b_suffix); 781 + 929 782 fd = mfd_assert_new("kern_memfd_share_mmap", 930 - MFD_DEF_SIZE, 783 + mfd_def_size, 931 784 MFD_CLOEXEC | MFD_ALLOW_SEALING); 932 785 mfd_assert_has_seals(fd, 0); 933 786 ··· 943 784 mfd_assert_has_seals(fd, 0); 944 785 mfd_assert_add_seals(fd, F_SEAL_SHRINK); 945 786 mfd_assert_has_seals(fd, F_SEAL_SHRINK); 946 - munmap(p, MFD_DEF_SIZE); 787 + munmap(p, mfd_def_size); 947 788 948 789 /* readable ref allows sealing */ 949 790 p = mfd_assert_mmap_private(fd); 950 791 mfd_assert_add_seals(fd, F_SEAL_WRITE); 951 792 mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK); 952 - munmap(p, MFD_DEF_SIZE); 793 + munmap(p, mfd_def_size); 953 794 795 + close(fd); 796 + } 797 + 798 + /* 799 + * Basic test to make sure we can open the hugetlbfs fd via /proc and 800 + * perform some simple operations on it. 801 + */ 802 + static void hugetlbfs_proc_open(char *b_suffix) 803 + { 804 + int fd, fd2; 805 + 806 + printf("%s HUGETLBFS-PROC-OPEN %s\n", MEMFD_STR, b_suffix); 807 + 808 + fd = mfd_assert_new("kern_memfd_share_open", 809 + mfd_def_size, 810 + MFD_CLOEXEC); 811 + 812 + fd2 = mfd_assert_open(fd, O_RDWR, 0); 813 + 814 + mfd_assert_read(fd); 815 + mfd_assert_write(fd); 816 + 817 + mfd_assert_shrink(fd2); 818 + mfd_assert_grow(fd2); 819 + 820 + close(fd2); 954 821 close(fd); 955 822 } 956 823 ··· 986 801 * This is *not* like dup(), but like a real separate open(). Make sure the 987 802 * semantics are as expected and we correctly check for RDONLY / WRONLY / RDWR. 988 803 */ 989 - static void test_share_open(void) 804 + static void test_share_open(char *banner, char *b_suffix) 990 805 { 991 806 int fd, fd2; 992 807 808 + /* 809 + * hugetlbfs does not contain sealing support. So test basic 810 + * functionality of using /proc fd via hugetlbfs_proc_open 811 + */ 812 + if (hugetlbfs_test) { 813 + hugetlbfs_proc_open(b_suffix); 814 + return; 815 + } 816 + 817 + printf("%s %s %s\n", MEMFD_STR, banner, b_suffix); 818 + 993 819 fd = mfd_assert_new("kern_memfd_share_open", 994 - MFD_DEF_SIZE, 820 + mfd_def_size, 995 821 MFD_CLOEXEC | MFD_ALLOW_SEALING); 996 822 mfd_assert_has_seals(fd, 0); 997 823 ··· 1037 841 * Test sharing via fork() 1038 842 * Test whether seal-modifications work as expected with forked childs. 1039 843 */ 1040 - static void test_share_fork(void) 844 + static void test_share_fork(char *banner, char *b_suffix) 1041 845 { 1042 846 int fd; 1043 847 pid_t pid; 1044 848 849 + /* hugetlbfs does not contain sealing support */ 850 + if (hugetlbfs_test) 851 + return; 852 + 853 + printf("%s %s %s\n", MEMFD_STR, banner, b_suffix); 854 + 1045 855 fd = mfd_assert_new("kern_memfd_share_fork", 1046 - MFD_DEF_SIZE, 856 + mfd_def_size, 1047 857 MFD_CLOEXEC | MFD_ALLOW_SEALING); 1048 858 mfd_assert_has_seals(fd, 0); 1049 859 ··· 1072 870 { 1073 871 pid_t pid; 1074 872 1075 - printf("memfd: CREATE\n"); 873 + if (argc == 2) { 874 + if (!strcmp(argv[1], "hugetlbfs")) { 875 + unsigned long hpage_size = default_huge_page_size(); 876 + 877 + if (!hpage_size) { 878 + printf("Unable to determine huge page size\n"); 879 + abort(); 880 + } 881 + 882 + hugetlbfs_test = 1; 883 + mfd_def_size = hpage_size * 2; 884 + } 885 + } 886 + 1076 887 test_create(); 1077 - printf("memfd: BASIC\n"); 1078 888 test_basic(); 1079 889 1080 - printf("memfd: SEAL-WRITE\n"); 1081 890 test_seal_write(); 1082 - printf("memfd: SEAL-SHRINK\n"); 1083 891 test_seal_shrink(); 1084 - printf("memfd: SEAL-GROW\n"); 1085 892 test_seal_grow(); 1086 - printf("memfd: SEAL-RESIZE\n"); 1087 893 test_seal_resize(); 1088 894 1089 - printf("memfd: SHARE-DUP\n"); 1090 - test_share_dup(); 1091 - printf("memfd: SHARE-MMAP\n"); 1092 - test_share_mmap(); 1093 - printf("memfd: SHARE-OPEN\n"); 1094 - test_share_open(); 1095 - printf("memfd: SHARE-FORK\n"); 1096 - test_share_fork(); 895 + test_share_dup("SHARE-DUP", ""); 896 + test_share_mmap("SHARE-MMAP", ""); 897 + test_share_open("SHARE-OPEN", ""); 898 + test_share_fork("SHARE-FORK", ""); 1097 899 1098 900 /* Run test-suite in a multi-threaded environment with a shared 1099 901 * file-table. */ 1100 902 pid = spawn_idle_thread(CLONE_FILES | CLONE_FS | CLONE_VM); 1101 - printf("memfd: SHARE-DUP (shared file-table)\n"); 1102 - test_share_dup(); 1103 - printf("memfd: SHARE-MMAP (shared file-table)\n"); 1104 - test_share_mmap(); 1105 - printf("memfd: SHARE-OPEN (shared file-table)\n"); 1106 - test_share_open(); 1107 - printf("memfd: SHARE-FORK (shared file-table)\n"); 1108 - test_share_fork(); 903 + test_share_dup("SHARE-DUP", SHARED_FT_STR); 904 + test_share_mmap("SHARE-MMAP", SHARED_FT_STR); 905 + test_share_open("SHARE-OPEN", SHARED_FT_STR); 906 + test_share_fork("SHARE-FORK", SHARED_FT_STR); 1109 907 join_idle_thread(pid); 1110 908 1111 909 printf("memfd: DONE\n");

+69

tools/testing/selftests/memfd/run_tests.sh

··· 1 + #!/bin/bash 2 + # please run as root 3 + 4 + # 5 + # Normal tests requiring no special resources 6 + # 7 + ./run_fuse_test.sh 8 + ./memfd_test 9 + 10 + # 11 + # To test memfd_create with hugetlbfs, there needs to be hpages_test 12 + # huge pages free. Attempt to allocate enough pages to test. 13 + # 14 + hpages_test=8 15 + 16 + # 17 + # Get count of free huge pages from /proc/meminfo 18 + # 19 + while read name size unit; do 20 + if [ "$name" = "HugePages_Free:" ]; then 21 + freepgs=$size 22 + fi 23 + done < /proc/meminfo 24 + 25 + # 26 + # If not enough free huge pages for test, attempt to increase 27 + # 28 + if [ -n "$freepgs" ] && [ $freepgs -lt $hpages_test ]; then 29 + nr_hugepgs=`cat /proc/sys/vm/nr_hugepages` 30 + hpages_needed=`expr $hpages_test - $freepgs` 31 + 32 + echo 3 > /proc/sys/vm/drop_caches 33 + echo $(( $hpages_needed + $nr_hugepgs )) > /proc/sys/vm/nr_hugepages 34 + if [ $? -ne 0 ]; then 35 + echo "Please run this test as root" 36 + exit 1 37 + fi 38 + while read name size unit; do 39 + if [ "$name" = "HugePages_Free:" ]; then 40 + freepgs=$size 41 + fi 42 + done < /proc/meminfo 43 + fi 44 + 45 + # 46 + # If still not enough huge pages available, exit. But, give back any huge 47 + # pages potentially allocated above. 48 + # 49 + if [ $freepgs -lt $hpages_test ]; then 50 + # nr_hugepgs non-zero only if we attempted to increase 51 + if [ -n "$nr_hugepgs" ]; then 52 + echo $nr_hugepgs > /proc/sys/vm/nr_hugepages 53 + fi 54 + printf "Not enough huge pages available (%d < %d)\n" \ 55 + $freepgs $needpgs 56 + exit 1 57 + fi 58 + 59 + # 60 + # Run the hugetlbfs test 61 + # 62 + ./memfd_test hugetlbfs 63 + 64 + # 65 + # Give back any huge pages allocated for the test 66 + # 67 + if [ -n "$nr_hugepgs" ]; then 68 + echo $nr_hugepgs > /proc/sys/vm/nr_hugepages 69 + fi

+267 -12

tools/testing/selftests/vm/userfaultfd.c

··· 66 66 #include <sys/wait.h> 67 67 #include <pthread.h> 68 68 #include <linux/userfaultfd.h> 69 + #include <setjmp.h> 70 + #include <stdbool.h> 69 71 70 72 #ifdef __NR_userfaultfd 71 73 ··· 84 82 #define TEST_SHMEM 3 85 83 static int test_type; 86 84 85 + /* exercise the test_uffdio_*_eexist every ALARM_INTERVAL_SECS */ 86 + #define ALARM_INTERVAL_SECS 10 87 + static volatile bool test_uffdio_copy_eexist = true; 88 + static volatile bool test_uffdio_zeropage_eexist = true; 89 + 90 + static bool map_shared; 87 91 static int huge_fd; 88 92 static char *huge_fd_off0; 89 93 static unsigned long long *count_verify; 90 94 static int uffd, uffd_flags, finished, *pipefd; 91 - static char *area_src, *area_dst; 95 + static char *area_src, *area_src_alias, *area_dst, *area_dst_alias; 92 96 static char *zeropage; 93 97 pthread_attr_t attr; 94 98 ··· 133 125 } 134 126 } 135 127 128 + static void noop_alias_mapping(__u64 *start, size_t len, unsigned long offset) 129 + { 130 + } 136 131 137 132 /* HugeTLB memory */ 138 133 static int hugetlb_release_pages(char *rel_area) ··· 156 145 157 146 static void hugetlb_allocate_area(void **alloc_area) 158 147 { 148 + void *area_alias = NULL; 149 + char **alloc_area_alias; 159 150 *alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, 160 - MAP_PRIVATE | MAP_HUGETLB, huge_fd, 161 - *alloc_area == area_src ? 0 : 162 - nr_pages * page_size); 151 + (map_shared ? MAP_SHARED : MAP_PRIVATE) | 152 + MAP_HUGETLB, 153 + huge_fd, *alloc_area == area_src ? 0 : 154 + nr_pages * page_size); 163 155 if (*alloc_area == MAP_FAILED) { 164 156 fprintf(stderr, "mmap of hugetlbfs file failed\n"); 165 157 *alloc_area = NULL; 166 158 } 167 159 168 - if (*alloc_area == area_src) 160 + if (map_shared) { 161 + area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, 162 + MAP_SHARED | MAP_HUGETLB, 163 + huge_fd, *alloc_area == area_src ? 0 : 164 + nr_pages * page_size); 165 + if (area_alias == MAP_FAILED) { 166 + if (munmap(*alloc_area, nr_pages * page_size) < 0) 167 + perror("hugetlb munmap"), exit(1); 168 + *alloc_area = NULL; 169 + return; 170 + } 171 + } 172 + if (*alloc_area == area_src) { 169 173 huge_fd_off0 = *alloc_area; 174 + alloc_area_alias = &area_src_alias; 175 + } else { 176 + alloc_area_alias = &area_dst_alias; 177 + } 178 + if (area_alias) 179 + *alloc_area_alias = area_alias; 180 + } 181 + 182 + static void hugetlb_alias_mapping(__u64 *start, size_t len, unsigned long offset) 183 + { 184 + if (!map_shared) 185 + return; 186 + /* 187 + * We can't zap just the pagetable with hugetlbfs because 188 + * MADV_DONTEED won't work. So exercise -EEXIST on a alias 189 + * mapping where the pagetables are not established initially, 190 + * this way we'll exercise the -EEXEC at the fs level. 191 + */ 192 + *start = (unsigned long) area_dst_alias + offset; 170 193 } 171 194 172 195 /* Shared memory */ ··· 230 185 unsigned long expected_ioctls; 231 186 void (*allocate_area)(void **alloc_area); 232 187 int (*release_pages)(char *rel_area); 188 + void (*alias_mapping)(__u64 *start, size_t len, unsigned long offset); 233 189 }; 234 190 235 191 #define ANON_EXPECTED_IOCTLS ((1 << _UFFDIO_WAKE) | \ ··· 241 195 .expected_ioctls = ANON_EXPECTED_IOCTLS, 242 196 .allocate_area = anon_allocate_area, 243 197 .release_pages = anon_release_pages, 198 + .alias_mapping = noop_alias_mapping, 244 199 }; 245 200 246 201 static struct uffd_test_ops shmem_uffd_test_ops = { 247 - .expected_ioctls = UFFD_API_RANGE_IOCTLS_BASIC, 202 + .expected_ioctls = ANON_EXPECTED_IOCTLS, 248 203 .allocate_area = shmem_allocate_area, 249 204 .release_pages = shmem_release_pages, 205 + .alias_mapping = noop_alias_mapping, 250 206 }; 251 207 252 208 static struct uffd_test_ops hugetlb_uffd_test_ops = { 253 209 .expected_ioctls = UFFD_API_RANGE_IOCTLS_BASIC, 254 210 .allocate_area = hugetlb_allocate_area, 255 211 .release_pages = hugetlb_release_pages, 212 + .alias_mapping = hugetlb_alias_mapping, 256 213 }; 257 214 258 215 static struct uffd_test_ops *uffd_test_ops; ··· 380 331 return NULL; 381 332 } 382 333 334 + static void retry_copy_page(int ufd, struct uffdio_copy *uffdio_copy, 335 + unsigned long offset) 336 + { 337 + uffd_test_ops->alias_mapping(&uffdio_copy->dst, 338 + uffdio_copy->len, 339 + offset); 340 + if (ioctl(ufd, UFFDIO_COPY, uffdio_copy)) { 341 + /* real retval in ufdio_copy.copy */ 342 + if (uffdio_copy->copy != -EEXIST) 343 + fprintf(stderr, "UFFDIO_COPY retry error %Ld\n", 344 + uffdio_copy->copy), exit(1); 345 + } else { 346 + fprintf(stderr, "UFFDIO_COPY retry unexpected %Ld\n", 347 + uffdio_copy->copy), exit(1); 348 + } 349 + } 350 + 383 351 static int copy_page(int ufd, unsigned long offset) 384 352 { 385 353 struct uffdio_copy uffdio_copy; ··· 417 351 } else if (uffdio_copy.copy != page_size) { 418 352 fprintf(stderr, "UFFDIO_COPY unexpected copy %Ld\n", 419 353 uffdio_copy.copy), exit(1); 420 - } else 354 + } else { 355 + if (test_uffdio_copy_eexist) { 356 + test_uffdio_copy_eexist = false; 357 + retry_copy_page(ufd, &uffdio_copy, offset); 358 + } 421 359 return 1; 360 + } 422 361 return 0; 423 362 } 424 363 ··· 479 408 userfaults++; 480 409 break; 481 410 case UFFD_EVENT_FORK: 411 + close(uffd); 482 412 uffd = msg.arg.fork.ufd; 483 413 pollfd[0].fd = uffd; 484 414 break; ··· 644 572 return 0; 645 573 } 646 574 575 + sigjmp_buf jbuf, *sigbuf; 576 + 577 + static void sighndl(int sig, siginfo_t *siginfo, void *ptr) 578 + { 579 + if (sig == SIGBUS) { 580 + if (sigbuf) 581 + siglongjmp(*sigbuf, 1); 582 + abort(); 583 + } 584 + } 585 + 647 586 /* 648 587 * For non-cooperative userfaultfd test we fork() a process that will 649 588 * generate pagefaults, will mremap the area monitored by the ··· 668 585 * The release of the pages currently generates event for shmem and 669 586 * anonymous memory (UFFD_EVENT_REMOVE), hence it is not checked 670 587 * for hugetlb. 588 + * For signal test(UFFD_FEATURE_SIGBUS), signal_test = 1, we register 589 + * monitored area, generate pagefaults and test that signal is delivered. 590 + * Use UFFDIO_COPY to allocate missing page and retry. For signal_test = 2 591 + * test robustness use case - we release monitored area, fork a process 592 + * that will generate pagefaults and verify signal is generated. 593 + * This also tests UFFD_FEATURE_EVENT_FORK event along with the signal 594 + * feature. Using monitor thread, verify no userfault events are generated. 671 595 */ 672 - static int faulting_process(void) 596 + static int faulting_process(int signal_test) 673 597 { 674 598 unsigned long nr; 675 599 unsigned long long count; 676 600 unsigned long split_nr_pages; 601 + unsigned long lastnr; 602 + struct sigaction act; 603 + unsigned long signalled = 0; 677 604 678 605 if (test_type != TEST_HUGETLB) 679 606 split_nr_pages = (nr_pages + 1) / 2; 680 607 else 681 608 split_nr_pages = nr_pages; 682 609 610 + if (signal_test) { 611 + sigbuf = &jbuf; 612 + memset(&act, 0, sizeof(act)); 613 + act.sa_sigaction = sighndl; 614 + act.sa_flags = SA_SIGINFO; 615 + if (sigaction(SIGBUS, &act, 0)) { 616 + perror("sigaction"); 617 + return 1; 618 + } 619 + lastnr = (unsigned long)-1; 620 + } 621 + 683 622 for (nr = 0; nr < split_nr_pages; nr++) { 623 + if (signal_test) { 624 + if (sigsetjmp(*sigbuf, 1) != 0) { 625 + if (nr == lastnr) { 626 + fprintf(stderr, "Signal repeated\n"); 627 + return 1; 628 + } 629 + 630 + lastnr = nr; 631 + if (signal_test == 1) { 632 + if (copy_page(uffd, nr * page_size)) 633 + signalled++; 634 + } else { 635 + signalled++; 636 + continue; 637 + } 638 + } 639 + } 640 + 684 641 count = *area_count(area_dst, nr); 685 642 if (count != count_verify[nr]) { 686 643 fprintf(stderr, ··· 729 606 count_verify[nr]), exit(1); 730 607 } 731 608 } 609 + 610 + if (signal_test) 611 + return signalled != split_nr_pages; 732 612 733 613 if (test_type == TEST_HUGETLB) 734 614 return 0; ··· 760 634 } 761 635 762 636 return 0; 637 + } 638 + 639 + static void retry_uffdio_zeropage(int ufd, 640 + struct uffdio_zeropage *uffdio_zeropage, 641 + unsigned long offset) 642 + { 643 + uffd_test_ops->alias_mapping(&uffdio_zeropage->range.start, 644 + uffdio_zeropage->range.len, 645 + offset); 646 + if (ioctl(ufd, UFFDIO_ZEROPAGE, uffdio_zeropage)) { 647 + if (uffdio_zeropage->zeropage != -EEXIST) 648 + fprintf(stderr, "UFFDIO_ZEROPAGE retry error %Ld\n", 649 + uffdio_zeropage->zeropage), exit(1); 650 + } else { 651 + fprintf(stderr, "UFFDIO_ZEROPAGE retry unexpected %Ld\n", 652 + uffdio_zeropage->zeropage), exit(1); 653 + } 763 654 } 764 655 765 656 static int uffdio_zeropage(int ufd, unsigned long offset) ··· 813 670 if (uffdio_zeropage.zeropage != page_size) { 814 671 fprintf(stderr, "UFFDIO_ZEROPAGE unexpected %Ld\n", 815 672 uffdio_zeropage.zeropage), exit(1); 816 - } else 673 + } else { 674 + if (test_uffdio_zeropage_eexist) { 675 + test_uffdio_zeropage_eexist = false; 676 + retry_uffdio_zeropage(ufd, &uffdio_zeropage, 677 + offset); 678 + } 817 679 return 1; 680 + } 818 681 } else { 819 682 fprintf(stderr, 820 683 "UFFDIO_ZEROPAGE succeeded %Ld\n", ··· 910 761 perror("fork"), exit(1); 911 762 912 763 if (!pid) 913 - return faulting_process(); 764 + return faulting_process(0); 914 765 915 766 waitpid(pid, &err, 0); 916 767 if (err) ··· 927 778 return userfaults != nr_pages; 928 779 } 929 780 781 + static int userfaultfd_sig_test(void) 782 + { 783 + struct uffdio_register uffdio_register; 784 + unsigned long expected_ioctls; 785 + unsigned long userfaults; 786 + pthread_t uffd_mon; 787 + int err, features; 788 + pid_t pid; 789 + char c; 790 + 791 + printf("testing signal delivery: "); 792 + fflush(stdout); 793 + 794 + if (uffd_test_ops->release_pages(area_dst)) 795 + return 1; 796 + 797 + features = UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_SIGBUS; 798 + if (userfaultfd_open(features) < 0) 799 + return 1; 800 + fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK); 801 + 802 + uffdio_register.range.start = (unsigned long) area_dst; 803 + uffdio_register.range.len = nr_pages * page_size; 804 + uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; 805 + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) 806 + fprintf(stderr, "register failure\n"), exit(1); 807 + 808 + expected_ioctls = uffd_test_ops->expected_ioctls; 809 + if ((uffdio_register.ioctls & expected_ioctls) != 810 + expected_ioctls) 811 + fprintf(stderr, 812 + "unexpected missing ioctl for anon memory\n"), 813 + exit(1); 814 + 815 + if (faulting_process(1)) 816 + fprintf(stderr, "faulting process failed\n"), exit(1); 817 + 818 + if (uffd_test_ops->release_pages(area_dst)) 819 + return 1; 820 + 821 + if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, NULL)) 822 + perror("uffd_poll_thread create"), exit(1); 823 + 824 + pid = fork(); 825 + if (pid < 0) 826 + perror("fork"), exit(1); 827 + 828 + if (!pid) 829 + exit(faulting_process(2)); 830 + 831 + waitpid(pid, &err, 0); 832 + if (err) 833 + fprintf(stderr, "faulting process failed\n"), exit(1); 834 + 835 + if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) 836 + perror("pipe write"), exit(1); 837 + if (pthread_join(uffd_mon, (void **)&userfaults)) 838 + return 1; 839 + 840 + printf("done.\n"); 841 + if (userfaults) 842 + fprintf(stderr, "Signal test failed, userfaults: %ld\n", 843 + userfaults); 844 + close(uffd); 845 + return userfaults != 0; 846 + } 930 847 static int userfaultfd_stress(void) 931 848 { 932 849 void *area; ··· 1094 879 return 1; 1095 880 } 1096 881 882 + if (area_dst_alias) { 883 + uffdio_register.range.start = (unsigned long) 884 + area_dst_alias; 885 + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) { 886 + fprintf(stderr, "register failure alias\n"); 887 + return 1; 888 + } 889 + } 890 + 1097 891 /* 1098 892 * The madvise done previously isn't enough: some 1099 893 * uffd_thread could have read userfaults (one of ··· 1136 912 1137 913 /* unregister */ 1138 914 if (ioctl(uffd, UFFDIO_UNREGISTER, &uffdio_register.range)) { 1139 - fprintf(stderr, "register failure\n"); 915 + fprintf(stderr, "unregister failure\n"); 1140 916 return 1; 917 + } 918 + if (area_dst_alias) { 919 + uffdio_register.range.start = (unsigned long) area_dst; 920 + if (ioctl(uffd, UFFDIO_UNREGISTER, 921 + &uffdio_register.range)) { 922 + fprintf(stderr, "unregister failure alias\n"); 923 + return 1; 924 + } 1141 925 } 1142 926 1143 927 /* verification */ ··· 1168 936 area_src = area_dst; 1169 937 area_dst = tmp_area; 1170 938 939 + tmp_area = area_src_alias; 940 + area_src_alias = area_dst_alias; 941 + area_dst_alias = tmp_area; 942 + 1171 943 printf("userfaults:"); 1172 944 for (cpu = 0; cpu < nr_cpus; cpu++) 1173 945 printf(" %lu", userfaults[cpu]); ··· 1182 946 return err; 1183 947 1184 948 close(uffd); 1185 - return userfaultfd_zeropage_test() || userfaultfd_events_test(); 949 + return userfaultfd_zeropage_test() || userfaultfd_sig_test() 950 + || userfaultfd_events_test(); 1186 951 } 1187 952 1188 953 /* ··· 1218 981 } else if (!strcmp(type, "hugetlb")) { 1219 982 test_type = TEST_HUGETLB; 1220 983 uffd_test_ops = &hugetlb_uffd_test_ops; 984 + } else if (!strcmp(type, "hugetlb_shared")) { 985 + map_shared = true; 986 + test_type = TEST_HUGETLB; 987 + uffd_test_ops = &hugetlb_uffd_test_ops; 1221 988 } else if (!strcmp(type, "shmem")) { 989 + map_shared = true; 1222 990 test_type = TEST_SHMEM; 1223 991 uffd_test_ops = &shmem_uffd_test_ops; 1224 992 } else { ··· 1243 1001 fprintf(stderr, "Impossible to run this test\n"), exit(2); 1244 1002 } 1245 1003 1004 + static void sigalrm(int sig) 1005 + { 1006 + if (sig != SIGALRM) 1007 + abort(); 1008 + test_uffdio_copy_eexist = true; 1009 + test_uffdio_zeropage_eexist = true; 1010 + alarm(ALARM_INTERVAL_SECS); 1011 + } 1012 + 1246 1013 int main(int argc, char **argv) 1247 1014 { 1248 1015 if (argc < 4) 1249 1016 fprintf(stderr, "Usage: <test type> <MiB> <bounces> [hugetlbfs_file]\n"), 1250 1017 exit(1); 1018 + 1019 + if (signal(SIGALRM, sigalrm) == SIG_ERR) 1020 + fprintf(stderr, "failed to arm SIGALRM"), exit(1); 1021 + alarm(ALARM_INTERVAL_SECS); 1251 1022 1252 1023 set_test_type(argv[1]); 1253 1024