Merge branch 'akpm' (patches from Andrew)

+6 -1

Documentation/admin-guide/cgroup-v2.rst

··· 1288 1288 inactive_anon, active_anon, inactive_file, active_file, unevictable 1289 1289 Amount of memory, swap-backed and filesystem-backed, 1290 1290 on the internal memory management lists used by the 1291 - page reclaim algorithm 1291 + page reclaim algorithm. 1292 + 1293 + As these represent internal list state (eg. shmem pages are on anon 1294 + memory management lists), inactive_foo + active_foo may not be equal to 1295 + the value for the foo counter, since the foo counter is type-based, not 1296 + list-based. 1292 1297 1293 1298 slab_reclaimable 1294 1299 Part of "slab" that might be reclaimed, such as

+63

Documentation/dev-tools/kasan.rst

··· 218 218 A potential expansion of this mode is a hardware tag-based mode, which would 219 219 use hardware memory tagging support instead of compiler instrumentation and 220 220 manual shadow memory manipulation. 221 + 222 + What memory accesses are sanitised by KASAN? 223 + -------------------------------------------- 224 + 225 + The kernel maps memory in a number of different parts of the address 226 + space. This poses something of a problem for KASAN, which requires 227 + that all addresses accessed by instrumented code have a valid shadow 228 + region. 229 + 230 + The range of kernel virtual addresses is large: there is not enough 231 + real memory to support a real shadow region for every address that 232 + could be accessed by the kernel. 233 + 234 + By default 235 + ~~~~~~~~~~ 236 + 237 + By default, architectures only map real memory over the shadow region 238 + for the linear mapping (and potentially other small areas). For all 239 + other areas - such as vmalloc and vmemmap space - a single read-only 240 + page is mapped over the shadow area. This read-only shadow page 241 + declares all memory accesses as permitted. 242 + 243 + This presents a problem for modules: they do not live in the linear 244 + mapping, but in a dedicated module space. By hooking in to the module 245 + allocator, KASAN can temporarily map real shadow memory to cover 246 + them. This allows detection of invalid accesses to module globals, for 247 + example. 248 + 249 + This also creates an incompatibility with ``VMAP_STACK``: if the stack 250 + lives in vmalloc space, it will be shadowed by the read-only page, and 251 + the kernel will fault when trying to set up the shadow data for stack 252 + variables. 253 + 254 + CONFIG_KASAN_VMALLOC 255 + ~~~~~~~~~~~~~~~~~~~~ 256 + 257 + With ``CONFIG_KASAN_VMALLOC``, KASAN can cover vmalloc space at the 258 + cost of greater memory usage. Currently this is only supported on x86. 259 + 260 + This works by hooking into vmalloc and vmap, and dynamically 261 + allocating real shadow memory to back the mappings. 262 + 263 + Most mappings in vmalloc space are small, requiring less than a full 264 + page of shadow space. Allocating a full shadow page per mapping would 265 + therefore be wasteful. Furthermore, to ensure that different mappings 266 + use different shadow pages, mappings would have to be aligned to 267 + ``KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE``. 268 + 269 + Instead, we share backing space across multiple mappings. We allocate 270 + a backing page when a mapping in vmalloc space uses a particular page 271 + of the shadow region. This page can be shared by other vmalloc 272 + mappings later on. 273 + 274 + We hook in to the vmap infrastructure to lazily clean up unused shadow 275 + memory. 276 + 277 + To avoid the difficulties around swapping mappings around, we expect 278 + that the part of the shadow region that covers the vmalloc space will 279 + not be covered by the early shadow page, but will be left 280 + unmapped. This will require changes in arch-specific code. 281 + 282 + This allows ``VMAP_STACK`` support on x86, and can simplify support of 283 + architectures that do not have a fixed module region.

+5 -4

arch/Kconfig

··· 836 836 config VMAP_STACK 837 837 default y 838 838 bool "Use a virtually-mapped stack" 839 - depends on HAVE_ARCH_VMAP_STACK && !KASAN 839 + depends on HAVE_ARCH_VMAP_STACK 840 + depends on !KASAN || KASAN_VMALLOC 840 841 ---help--- 841 842 Enable this if you want the use virtually-mapped kernel stacks 842 843 with guard pages. This causes kernel stack overflows to be 843 844 caught immediately rather than causing difficult-to-diagnose 844 845 corruption. 845 846 846 - This is presently incompatible with KASAN because KASAN expects 847 - the stack to map directly to the KASAN shadow map using a formula 848 - that is incorrect if the stack is in vmalloc space. 847 + To use this with KASAN, the architecture must support backing 848 + virtual mappings with real shadow memory, and KASAN_VMALLOC must 849 + be enabled. 849 850 850 851 config ARCH_OPTIONAL_KERNEL_RWX 851 852 def_bool n

-1

arch/arc/include/asm/pgtable.h

··· 33 33 #define _ASM_ARC_PGTABLE_H 34 34 35 35 #include <linux/bits.h> 36 - #define __ARCH_USE_5LEVEL_HACK 37 36 #include <asm-generic/pgtable-nopmd.h> 38 37 #include <asm/page.h> 39 38 #include <asm/mmu.h> /* to propagate CONFIG_ARC_MMU_VER <n> */

+8 -2

arch/arc/mm/fault.c

··· 30 30 * with the 'reference' page table. 31 31 */ 32 32 pgd_t *pgd, *pgd_k; 33 + p4d_t *p4d, *p4d_k; 33 34 pud_t *pud, *pud_k; 34 35 pmd_t *pmd, *pmd_k; 35 36 ··· 40 39 if (!pgd_present(*pgd_k)) 41 40 goto bad_area; 42 41 43 - pud = pud_offset(pgd, address); 44 - pud_k = pud_offset(pgd_k, address); 42 + p4d = p4d_offset(pgd, address); 43 + p4d_k = p4d_offset(pgd_k, address); 44 + if (!p4d_present(*p4d_k)) 45 + goto bad_area; 46 + 47 + pud = pud_offset(p4d, address); 48 + pud_k = pud_offset(p4d_k, address); 45 49 if (!pud_present(*pud_k)) 46 50 goto bad_area; 47 51

+3 -1

arch/arc/mm/highmem.c

··· 111 111 static noinline pte_t * __init alloc_kmap_pgtable(unsigned long kvaddr) 112 112 { 113 113 pgd_t *pgd_k; 114 + p4d_t *p4d_k; 114 115 pud_t *pud_k; 115 116 pmd_t *pmd_k; 116 117 pte_t *pte_k; 117 118 118 119 pgd_k = pgd_offset_k(kvaddr); 119 - pud_k = pud_offset(pgd_k, kvaddr); 120 + p4d_k = p4d_offset(pgd_k, kvaddr); 121 + pud_k = pud_offset(p4d_k, kvaddr); 120 122 pmd_k = pmd_offset(pud_k, kvaddr); 121 123 122 124 pte_k = (pte_t *)memblock_alloc_low(PAGE_SIZE, PAGE_SIZE);

-3

arch/powerpc/include/asm/book3s/64/pgtable-4k.h

··· 70 70 /* should not reach */ 71 71 } 72 72 73 - #else /* !CONFIG_HUGETLB_PAGE */ 74 - static inline int pmd_huge(pmd_t pmd) { return 0; } 75 - static inline int pud_huge(pud_t pud) { return 0; } 76 73 #endif /* CONFIG_HUGETLB_PAGE */ 77 74 78 75 #endif /* __ASSEMBLY__ */

-3

arch/powerpc/include/asm/book3s/64/pgtable-64k.h

··· 59 59 BUG(); 60 60 } 61 61 62 - #else /* !CONFIG_HUGETLB_PAGE */ 63 - static inline int pmd_huge(pmd_t pmd) { return 0; } 64 - static inline int pud_huge(pud_t pud) { return 0; } 65 62 #endif /* CONFIG_HUGETLB_PAGE */ 66 63 67 64 static inline int remap_4k_pfn(struct vm_area_struct *vma, unsigned long addr,

+1

arch/powerpc/mm/book3s64/radix_pgtable.c

··· 13 13 #include <linux/memblock.h> 14 14 #include <linux/of_fdt.h> 15 15 #include <linux/mm.h> 16 + #include <linux/hugetlb.h> 16 17 #include <linux/string_helpers.h> 17 18 #include <linux/stop_machine.h> 18 19

+1

arch/x86/Kconfig

··· 134 134 select HAVE_ARCH_JUMP_LABEL 135 135 select HAVE_ARCH_JUMP_LABEL_RELATIVE 136 136 select HAVE_ARCH_KASAN if X86_64 137 + select HAVE_ARCH_KASAN_VMALLOC if X86_64 137 138 select HAVE_ARCH_KGDB 138 139 select HAVE_ARCH_MMAP_RND_BITS if MMU 139 140 select HAVE_ARCH_MMAP_RND_COMPAT_BITS if MMU && COMPAT

+61

arch/x86/mm/kasan_init_64.c

··· 245 245 } while (pgd++, addr = next, addr != end); 246 246 } 247 247 248 + static void __init kasan_shallow_populate_p4ds(pgd_t *pgd, 249 + unsigned long addr, 250 + unsigned long end) 251 + { 252 + p4d_t *p4d; 253 + unsigned long next; 254 + void *p; 255 + 256 + p4d = p4d_offset(pgd, addr); 257 + do { 258 + next = p4d_addr_end(addr, end); 259 + 260 + if (p4d_none(*p4d)) { 261 + p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true); 262 + p4d_populate(&init_mm, p4d, p); 263 + } 264 + } while (p4d++, addr = next, addr != end); 265 + } 266 + 267 + static void __init kasan_shallow_populate_pgds(void *start, void *end) 268 + { 269 + unsigned long addr, next; 270 + pgd_t *pgd; 271 + void *p; 272 + 273 + addr = (unsigned long)start; 274 + pgd = pgd_offset_k(addr); 275 + do { 276 + next = pgd_addr_end(addr, (unsigned long)end); 277 + 278 + if (pgd_none(*pgd)) { 279 + p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true); 280 + pgd_populate(&init_mm, pgd, p); 281 + } 282 + 283 + /* 284 + * we need to populate p4ds to be synced when running in 285 + * four level mode - see sync_global_pgds_l4() 286 + */ 287 + kasan_shallow_populate_p4ds(pgd, addr, next); 288 + } while (pgd++, addr = next, addr != (unsigned long)end); 289 + } 290 + 248 291 #ifdef CONFIG_KASAN_INLINE 249 292 static int kasan_die_handler(struct notifier_block *self, 250 293 unsigned long val, ··· 397 354 398 355 kasan_populate_early_shadow( 399 356 kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM), 357 + kasan_mem_to_shadow((void *)VMALLOC_START)); 358 + 359 + /* 360 + * If we're in full vmalloc mode, don't back vmalloc space with early 361 + * shadow pages. Instead, prepopulate pgds/p4ds so they are synced to 362 + * the global table and we can populate the lower levels on demand. 363 + */ 364 + if (IS_ENABLED(CONFIG_KASAN_VMALLOC)) 365 + kasan_shallow_populate_pgds( 366 + kasan_mem_to_shadow((void *)VMALLOC_START), 367 + kasan_mem_to_shadow((void *)VMALLOC_END)); 368 + else 369 + kasan_populate_early_shadow( 370 + kasan_mem_to_shadow((void *)VMALLOC_START), 371 + kasan_mem_to_shadow((void *)VMALLOC_END)); 372 + 373 + kasan_populate_early_shadow( 374 + kasan_mem_to_shadow((void *)VMALLOC_END + 1), 400 375 shadow_cpu_entry_begin); 401 376 402 377 kasan_populate_shadow((unsigned long)shadow_cpu_entry_begin,

+15 -25

drivers/base/memory.c

··· 19 19 #include <linux/memory.h> 20 20 #include <linux/memory_hotplug.h> 21 21 #include <linux/mm.h> 22 - #include <linux/mutex.h> 23 22 #include <linux/stat.h> 24 23 #include <linux/slab.h> 25 24 26 25 #include <linux/atomic.h> 27 26 #include <linux/uaccess.h> 28 - 29 - static DEFINE_MUTEX(mem_sysfs_mutex); 30 27 31 28 #define MEMORY_CLASS_NAME "memory" 32 29 ··· 535 538 if (kstrtoull(buf, 0, &pfn) < 0) 536 539 return -EINVAL; 537 540 pfn >>= PAGE_SHIFT; 538 - if (!pfn_valid(pfn)) 539 - return -ENXIO; 540 - /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */ 541 - if (!pfn_to_online_page(pfn)) 542 - return -EIO; 543 - ret = soft_offline_page(pfn_to_page(pfn), 0); 541 + ret = soft_offline_page(pfn, 0); 544 542 return ret == 0 ? count : ret; 545 543 } 546 544 ··· 697 705 * Create memory block devices for the given memory area. Start and size 698 706 * have to be aligned to memory block granularity. Memory block devices 699 707 * will be initialized as offline. 708 + * 709 + * Called under device_hotplug_lock. 700 710 */ 701 711 int create_memory_block_devices(unsigned long start, unsigned long size) 702 712 { ··· 712 718 !IS_ALIGNED(size, memory_block_size_bytes()))) 713 719 return -EINVAL; 714 720 715 - mutex_lock(&mem_sysfs_mutex); 716 721 for (block_id = start_block_id; block_id != end_block_id; block_id++) { 717 722 ret = init_memory_block(&mem, block_id, MEM_OFFLINE); 718 723 if (ret) ··· 723 730 for (block_id = start_block_id; block_id != end_block_id; 724 731 block_id++) { 725 732 mem = find_memory_block_by_id(block_id); 733 + if (WARN_ON_ONCE(!mem)) 734 + continue; 726 735 mem->section_count = 0; 727 736 unregister_memory(mem); 728 737 } 729 738 } 730 - mutex_unlock(&mem_sysfs_mutex); 731 739 return ret; 732 740 } 733 741 ··· 736 742 * Remove memory block devices for the given memory area. Start and size 737 743 * have to be aligned to memory block granularity. Memory block devices 738 744 * have to be offline. 745 + * 746 + * Called under device_hotplug_lock. 739 747 */ 740 748 void remove_memory_block_devices(unsigned long start, unsigned long size) 741 749 { ··· 750 754 !IS_ALIGNED(size, memory_block_size_bytes()))) 751 755 return; 752 756 753 - mutex_lock(&mem_sysfs_mutex); 754 757 for (block_id = start_block_id; block_id != end_block_id; block_id++) { 755 758 mem = find_memory_block_by_id(block_id); 756 759 if (WARN_ON_ONCE(!mem)) ··· 758 763 unregister_memory_block_under_nodes(mem); 759 764 unregister_memory(mem); 760 765 } 761 - mutex_unlock(&mem_sysfs_mutex); 762 766 } 763 767 764 768 /* return true if the memory block is offlined, otherwise, return false */ ··· 791 797 }; 792 798 793 799 /* 794 - * Initialize the sysfs support for memory devices... 800 + * Initialize the sysfs support for memory devices. At the time this function 801 + * is called, we cannot have concurrent creation/deletion of memory block 802 + * devices, the device_hotplug_lock is not needed. 795 803 */ 796 804 void __init memory_dev_init(void) 797 805 { 798 806 int ret; 799 - int err; 800 807 unsigned long block_sz, nr; 801 808 802 809 /* Validate the configured memory block size */ ··· 808 813 809 814 ret = subsys_system_register(&memory_subsys, memory_root_attr_groups); 810 815 if (ret) 811 - goto out; 816 + panic("%s() failed to register subsystem: %d\n", __func__, ret); 812 817 813 818 /* 814 819 * Create entries for memory sections that were found 815 820 * during boot and have been initialized 816 821 */ 817 - mutex_lock(&mem_sysfs_mutex); 818 822 for (nr = 0; nr <= __highest_present_section_nr; 819 823 nr += sections_per_block) { 820 - err = add_memory_block(nr); 821 - if (!ret) 822 - ret = err; 824 + ret = add_memory_block(nr); 825 + if (ret) 826 + panic("%s() failed to add memory block: %d\n", __func__, 827 + ret); 823 828 } 824 - mutex_unlock(&mem_sysfs_mutex); 825 - 826 - out: 827 - if (ret) 828 - panic("%s() failed: %d\n", __func__, ret); 829 829 } 830 830 831 831 /**

+1 -3

drivers/hv/hv_balloon.c

··· 682 682 __ClearPageOffline(pg); 683 683 684 684 /* This frame is currently backed; online the page. */ 685 - __online_page_set_limits(pg); 686 - __online_page_increment_counters(pg); 687 - __online_page_free(pg); 685 + generic_online_page(pg, 0); 688 686 689 687 lockdep_assert_held(&dm_device.ha_lock); 690 688 dm_device.num_pages_onlined++;

-1

drivers/xen/balloon.c

··· 374 374 mutex_lock(&balloon_mutex); 375 375 for (i = 0; i < size; i++) { 376 376 p = pfn_to_page(start_pfn + i); 377 - __online_page_set_limits(p); 378 377 balloon_append(p); 379 378 } 380 379 mutex_unlock(&balloon_mutex);

+4 -2

fs/buffer.c

··· 49 49 #include <trace/events/block.h> 50 50 #include <linux/fscrypt.h> 51 51 52 + #include "internal.h" 53 + 52 54 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); 53 55 static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh, 54 56 enum rw_hint hint, struct writeback_control *wbc); ··· 1425 1423 1426 1424 for (i = 0; i < BH_LRU_SIZE; i++) { 1427 1425 if (b->bhs[i]) 1428 - return 1; 1426 + return true; 1429 1427 } 1430 1428 1431 - return 0; 1429 + return false; 1432 1430 } 1433 1431 1434 1432 void invalidate_bh_lrus(void)

-21

fs/direct-io.c

··· 221 221 } 222 222 223 223 /* 224 - * Warn about a page cache invalidation failure during a direct io write. 225 - */ 226 - void dio_warn_stale_pagecache(struct file *filp) 227 - { 228 - static DEFINE_RATELIMIT_STATE(_rs, 86400 * HZ, DEFAULT_RATELIMIT_BURST); 229 - char pathname[128]; 230 - struct inode *inode = file_inode(filp); 231 - char *path; 232 - 233 - errseq_set(&inode->i_mapping->wb_err, -EIO); 234 - if (__ratelimit(&_rs)) { 235 - path = file_path(filp, pathname, sizeof(pathname)); 236 - if (IS_ERR(path)) 237 - path = "(unknown)"; 238 - pr_crit("Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!\n"); 239 - pr_crit("File: %s PID: %d Comm: %.20s\n", path, current->pid, 240 - current->comm); 241 - } 242 - } 243 - 244 - /* 245 224 * dio_complete() - called when all DIO BIO I/O has been completed 246 225 * 247 226 * This drops i_dio_count, lets interested parties know that a DIO operation

+48 -15

fs/hugetlbfs/inode.c

··· 440 440 u32 hash; 441 441 442 442 index = page->index; 443 - hash = hugetlb_fault_mutex_hash(h, mapping, index, 0); 443 + hash = hugetlb_fault_mutex_hash(mapping, index); 444 444 mutex_lock(&hugetlb_fault_mutex_table[hash]); 445 445 446 446 /* ··· 644 644 addr = index * hpage_size; 645 645 646 646 /* mutex taken here, fault path and hole punch */ 647 - hash = hugetlb_fault_mutex_hash(h, mapping, index, addr); 647 + hash = hugetlb_fault_mutex_hash(mapping, index); 648 648 mutex_lock(&hugetlb_fault_mutex_table[hash]); 649 649 650 650 /* See if already present in mapping to avoid alloc/free */ ··· 815 815 /* 816 816 * File creation. Allocate an inode, and we're done.. 817 817 */ 818 - static int hugetlbfs_mknod(struct inode *dir, 819 - struct dentry *dentry, umode_t mode, dev_t dev) 818 + static int do_hugetlbfs_mknod(struct inode *dir, 819 + struct dentry *dentry, 820 + umode_t mode, 821 + dev_t dev, 822 + bool tmpfile) 820 823 { 821 824 struct inode *inode; 822 825 int error = -ENOSPC; ··· 827 824 inode = hugetlbfs_get_inode(dir->i_sb, dir, mode, dev); 828 825 if (inode) { 829 826 dir->i_ctime = dir->i_mtime = current_time(dir); 830 - d_instantiate(dentry, inode); 831 - dget(dentry); /* Extra count - pin the dentry in core */ 827 + if (tmpfile) { 828 + d_tmpfile(dentry, inode); 829 + } else { 830 + d_instantiate(dentry, inode); 831 + dget(dentry);/* Extra count - pin the dentry in core */ 832 + } 832 833 error = 0; 833 834 } 834 835 return error; 836 + } 837 + 838 + static int hugetlbfs_mknod(struct inode *dir, 839 + struct dentry *dentry, umode_t mode, dev_t dev) 840 + { 841 + return do_hugetlbfs_mknod(dir, dentry, mode, dev, false); 835 842 } 836 843 837 844 static int hugetlbfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) ··· 855 842 static int hugetlbfs_create(struct inode *dir, struct dentry *dentry, umode_t mode, bool excl) 856 843 { 857 844 return hugetlbfs_mknod(dir, dentry, mode | S_IFREG, 0); 845 + } 846 + 847 + static int hugetlbfs_tmpfile(struct inode *dir, 848 + struct dentry *dentry, umode_t mode) 849 + { 850 + return do_hugetlbfs_mknod(dir, dentry, mode | S_IFREG, 0, true); 858 851 } 859 852 860 853 static int hugetlbfs_symlink(struct inode *dir, ··· 1121 1102 .mknod = hugetlbfs_mknod, 1122 1103 .rename = simple_rename, 1123 1104 .setattr = hugetlbfs_setattr, 1105 + .tmpfile = hugetlbfs_tmpfile, 1124 1106 }; 1125 1107 1126 1108 static const struct inode_operations hugetlbfs_inode_operations = { ··· 1481 1461 sizeof(struct hugetlbfs_inode_info), 1482 1462 0, SLAB_ACCOUNT, init_once); 1483 1463 if (hugetlbfs_inode_cachep == NULL) 1484 - goto out2; 1464 + goto out; 1485 1465 1486 1466 error = register_filesystem(&hugetlbfs_fs_type); 1487 1467 if (error) 1488 - goto out; 1468 + goto out_free; 1489 1469 1470 + /* default hstate mount is required */ 1471 + mnt = mount_one_hugetlbfs(&hstates[default_hstate_idx]); 1472 + if (IS_ERR(mnt)) { 1473 + error = PTR_ERR(mnt); 1474 + goto out_unreg; 1475 + } 1476 + hugetlbfs_vfsmount[default_hstate_idx] = mnt; 1477 + 1478 + /* other hstates are optional */ 1490 1479 i = 0; 1491 1480 for_each_hstate(h) { 1481 + if (i == default_hstate_idx) 1482 + continue; 1483 + 1492 1484 mnt = mount_one_hugetlbfs(h); 1493 - if (IS_ERR(mnt) && i == 0) { 1494 - error = PTR_ERR(mnt); 1495 - goto out; 1496 - } 1497 - hugetlbfs_vfsmount[i] = mnt; 1485 + if (IS_ERR(mnt)) 1486 + hugetlbfs_vfsmount[i] = NULL; 1487 + else 1488 + hugetlbfs_vfsmount[i] = mnt; 1498 1489 i++; 1499 1490 } 1500 1491 1501 1492 return 0; 1502 1493 1503 - out: 1494 + out_unreg: 1495 + (void)unregister_filesystem(&hugetlbfs_fs_type); 1496 + out_free: 1504 1497 kmem_cache_destroy(hugetlbfs_inode_cachep); 1505 - out2: 1498 + out: 1506 1499 return error; 1507 1500 } 1508 1501 fs_initcall(init_hugetlbfs_fs)

+2 -2

fs/ocfs2/acl.c

··· 327 327 down_read(&OCFS2_I(inode)->ip_xattr_sem); 328 328 acl = ocfs2_get_acl_nolock(inode, ACL_TYPE_ACCESS, bh); 329 329 up_read(&OCFS2_I(inode)->ip_xattr_sem); 330 - if (IS_ERR(acl) || !acl) 331 - return PTR_ERR(acl); 330 + if (IS_ERR_OR_NULL(acl)) 331 + return PTR_ERR_OR_ZERO(acl); 332 332 ret = __posix_acl_chmod(&acl, GFP_KERNEL, inode->i_mode); 333 333 if (ret) 334 334 return ret;

+13 -8

fs/userfaultfd.c

··· 1460 1460 start = vma->vm_start; 1461 1461 vma_end = min(end, vma->vm_end); 1462 1462 1463 - new_flags = (vma->vm_flags & ~vm_flags) | vm_flags; 1463 + new_flags = (vma->vm_flags & 1464 + ~(VM_UFFD_MISSING|VM_UFFD_WP)) | vm_flags; 1464 1465 prev = vma_merge(mm, prev, start, vma_end, new_flags, 1465 1466 vma->anon_vma, vma->vm_file, vma->vm_pgoff, 1466 1467 vma_policy(vma), ··· 1835 1834 if (copy_from_user(&uffdio_api, buf, sizeof(uffdio_api))) 1836 1835 goto out; 1837 1836 features = uffdio_api.features; 1838 - if (uffdio_api.api != UFFD_API || (features & ~UFFD_API_FEATURES)) { 1839 - memset(&uffdio_api, 0, sizeof(uffdio_api)); 1840 - if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api))) 1841 - goto out; 1842 - ret = -EINVAL; 1843 - goto out; 1844 - } 1837 + ret = -EINVAL; 1838 + if (uffdio_api.api != UFFD_API || (features & ~UFFD_API_FEATURES)) 1839 + goto err_out; 1840 + ret = -EPERM; 1841 + if ((features & UFFD_FEATURE_EVENT_FORK) && !capable(CAP_SYS_PTRACE)) 1842 + goto err_out; 1845 1843 /* report all available features and ioctls to userland */ 1846 1844 uffdio_api.features = UFFD_API_FEATURES; 1847 1845 uffdio_api.ioctls = UFFD_API_IOCTLS; ··· 1853 1853 ret = 0; 1854 1854 out: 1855 1855 return ret; 1856 + err_out: 1857 + memset(&uffdio_api, 0, sizeof(uffdio_api)); 1858 + if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api))) 1859 + ret = -EFAULT; 1860 + goto out; 1856 1861 } 1857 1862 1858 1863 static long userfaultfd_ioctl(struct file *file, unsigned cmd,

-1

include/asm-generic/4level-fixup.h

··· 30 30 #undef pud_free_tlb 31 31 #define pud_free_tlb(tlb, x, addr) do { } while (0) 32 32 #define pud_free(mm, x) do { } while (0) 33 - #define __pud_free_tlb(tlb, x, addr) do { } while (0) 34 33 35 34 #undef pud_addr_end 36 35 #define pud_addr_end(addr, end) (end)

-1

include/asm-generic/5level-fixup.h

··· 51 51 #undef p4d_free_tlb 52 52 #define p4d_free_tlb(tlb, x, addr) do { } while (0) 53 53 #define p4d_free(mm, x) do { } while (0) 54 - #define __p4d_free_tlb(tlb, x, addr) do { } while (0) 55 54 56 55 #undef p4d_addr_end 57 56 #define p4d_addr_end(addr, end) (end)

+1 -1

include/asm-generic/pgtable-nop4d.h

··· 50 50 */ 51 51 #define p4d_alloc_one(mm, address) NULL 52 52 #define p4d_free(mm, x) do { } while (0) 53 - #define __p4d_free_tlb(tlb, x, a) do { } while (0) 53 + #define p4d_free_tlb(tlb, x, a) do { } while (0) 54 54 55 55 #undef p4d_addr_end 56 56 #define p4d_addr_end(addr, end) (end)

+1 -1

include/asm-generic/pgtable-nopmd.h

··· 60 60 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd) 61 61 { 62 62 } 63 - #define __pmd_free_tlb(tlb, x, a) do { } while (0) 63 + #define pmd_free_tlb(tlb, x, a) do { } while (0) 64 64 65 65 #undef pmd_addr_end 66 66 #define pmd_addr_end(addr, end) (end)

+1 -1

include/asm-generic/pgtable-nopud.h

··· 59 59 */ 60 60 #define pud_alloc_one(mm, address) NULL 61 61 #define pud_free(mm, x) do { } while (0) 62 - #define __pud_free_tlb(tlb, x, a) do { } while (0) 62 + #define pud_free_tlb(tlb, x, a) do { } while (0) 63 63 64 64 #undef pud_addr_end 65 65 #define pud_addr_end(addr, end) (end)

+51

include/asm-generic/pgtable.h

··· 558 558 * Do the tests inline, but report and clear the bad entry in mm/memory.c. 559 559 */ 560 560 void pgd_clear_bad(pgd_t *); 561 + 562 + #ifndef __PAGETABLE_P4D_FOLDED 561 563 void p4d_clear_bad(p4d_t *); 564 + #else 565 + #define p4d_clear_bad(p4d) do { } while (0) 566 + #endif 567 + 568 + #ifndef __PAGETABLE_PUD_FOLDED 562 569 void pud_clear_bad(pud_t *); 570 + #else 571 + #define pud_clear_bad(p4d) do { } while (0) 572 + #endif 573 + 563 574 void pmd_clear_bad(pmd_t *); 564 575 565 576 static inline int pgd_none_or_clear_bad(pgd_t *pgd) ··· 914 903 } 915 904 #endif /* pud_write */ 916 905 906 + #if !defined(CONFIG_ARCH_HAS_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE) 907 + static inline int pmd_devmap(pmd_t pmd) 908 + { 909 + return 0; 910 + } 911 + static inline int pud_devmap(pud_t pud) 912 + { 913 + return 0; 914 + } 915 + static inline int pgd_devmap(pgd_t pgd) 916 + { 917 + return 0; 918 + } 919 + #endif 920 + 917 921 #if !defined(CONFIG_TRANSPARENT_HUGEPAGE) || \ 918 922 (defined(CONFIG_TRANSPARENT_HUGEPAGE) && \ 919 923 !defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)) ··· 937 911 return 0; 938 912 } 939 913 #endif 914 + 915 + /* See pmd_none_or_trans_huge_or_clear_bad for discussion. */ 916 + static inline int pud_none_or_trans_huge_or_dev_or_clear_bad(pud_t *pud) 917 + { 918 + pud_t pudval = READ_ONCE(*pud); 919 + 920 + if (pud_none(pudval) || pud_trans_huge(pudval) || pud_devmap(pudval)) 921 + return 1; 922 + if (unlikely(pud_bad(pudval))) { 923 + pud_clear_bad(pud); 924 + return 1; 925 + } 926 + return 0; 927 + } 928 + 929 + /* See pmd_trans_unstable for discussion. */ 930 + static inline int pud_trans_unstable(pud_t *pud) 931 + { 932 + #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \ 933 + defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) 934 + return pud_none_or_trans_huge_or_dev_or_clear_bad(pud); 935 + #else 936 + return 0; 937 + #endif 938 + } 940 939 941 940 #ifndef pmd_read_atomic 942 941 static inline pmd_t pmd_read_atomic(pmd_t *pmdp)

-4

include/asm-generic/tlb.h

··· 584 584 } while (0) 585 585 #endif 586 586 587 - #ifndef __ARCH_HAS_4LEVEL_HACK 588 587 #ifndef pud_free_tlb 589 588 #define pud_free_tlb(tlb, pudp, address) \ 590 589 do { \ ··· 593 594 __pud_free_tlb(tlb, pudp, address); \ 594 595 } while (0) 595 596 #endif 596 - #endif 597 597 598 - #ifndef __ARCH_HAS_5LEVEL_HACK 599 598 #ifndef p4d_free_tlb 600 599 #define p4d_free_tlb(tlb, pudp, address) \ 601 600 do { \ ··· 601 604 tlb->freed_tables = 1; \ 602 605 __p4d_free_tlb(tlb, pudp, address); \ 603 606 } while (0) 604 - #endif 605 607 #endif 606 608 607 609 #endif /* CONFIG_MMU */

+5 -1

include/linux/fs.h

··· 3156 3156 }; 3157 3157 3158 3158 void dio_end_io(struct bio *bio); 3159 - void dio_warn_stale_pagecache(struct file *filp); 3160 3159 3161 3160 ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, 3162 3161 struct block_device *bdev, struct iov_iter *iter, ··· 3199 3200 if (atomic_dec_and_test(&inode->i_dio_count)) 3200 3201 wake_up_bit(&inode->i_state, __I_DIO_WAKEUP); 3201 3202 } 3203 + 3204 + /* 3205 + * Warn about a page cache invalidation failure diring a direct I/O write. 3206 + */ 3207 + void dio_warn_stale_pagecache(struct file *filp); 3202 3208 3203 3209 extern void inode_set_flags(struct inode *inode, unsigned int flags, 3204 3210 unsigned int mask);

+2

include/linux/gfp.h

··· 612 612 /* The below functions must be run on a range from a single zone. */ 613 613 extern int alloc_contig_range(unsigned long start, unsigned long end, 614 614 unsigned migratetype, gfp_t gfp_mask); 615 + extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask, 616 + int nid, nodemask_t *nodemask); 615 617 #endif 616 618 void free_contig_range(unsigned long pfn, unsigned int nr_pages); 617 619

+116 -24

include/linux/hugetlb.h

··· 105 105 void free_huge_page(struct page *page); 106 106 void hugetlb_fix_reserve_counts(struct inode *inode); 107 107 extern struct mutex *hugetlb_fault_mutex_table; 108 - u32 hugetlb_fault_mutex_hash(struct hstate *h, struct address_space *mapping, 109 - pgoff_t idx, unsigned long address); 108 + u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx); 110 109 111 110 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud); 112 111 ··· 163 164 { 164 165 } 165 166 166 - #define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n) ({ BUG(); 0; }) 167 - #define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL) 168 - #define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; }) 167 + static inline long follow_hugetlb_page(struct mm_struct *mm, 168 + struct vm_area_struct *vma, struct page **pages, 169 + struct vm_area_struct **vmas, unsigned long *position, 170 + unsigned long *nr_pages, long i, unsigned int flags, 171 + int *nonblocking) 172 + { 173 + BUG(); 174 + return 0; 175 + } 176 + 177 + static inline struct page *follow_huge_addr(struct mm_struct *mm, 178 + unsigned long address, int write) 179 + { 180 + return ERR_PTR(-EINVAL); 181 + } 182 + 183 + static inline int copy_hugetlb_page_range(struct mm_struct *dst, 184 + struct mm_struct *src, struct vm_area_struct *vma) 185 + { 186 + BUG(); 187 + return 0; 188 + } 189 + 169 190 static inline void hugetlb_report_meminfo(struct seq_file *m) 170 191 { 171 192 } 172 - #define hugetlb_report_node_meminfo(n, buf) 0 193 + 194 + static inline int hugetlb_report_node_meminfo(int nid, char *buf) 195 + { 196 + return 0; 197 + } 198 + 173 199 static inline void hugetlb_show_meminfo(void) 174 200 { 175 201 } 176 - #define follow_huge_pd(vma, addr, hpd, flags, pdshift) NULL 177 - #define follow_huge_pmd(mm, addr, pmd, flags) NULL 178 - #define follow_huge_pud(mm, addr, pud, flags) NULL 179 - #define follow_huge_pgd(mm, addr, pgd, flags) NULL 180 - #define prepare_hugepage_range(file, addr, len) (-EINVAL) 181 - #define pmd_huge(x) 0 182 - #define pud_huge(x) 0 183 - #define is_hugepage_only_range(mm, addr, len) 0 184 - #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; }) 185 - #define hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, dst_addr, \ 186 - src_addr, pagep) ({ BUG(); 0; }) 187 - #define huge_pte_offset(mm, address, sz) 0 202 + 203 + static inline struct page *follow_huge_pd(struct vm_area_struct *vma, 204 + unsigned long address, hugepd_t hpd, int flags, 205 + int pdshift) 206 + { 207 + return NULL; 208 + } 209 + 210 + static inline struct page *follow_huge_pmd(struct mm_struct *mm, 211 + unsigned long address, pmd_t *pmd, int flags) 212 + { 213 + return NULL; 214 + } 215 + 216 + static inline struct page *follow_huge_pud(struct mm_struct *mm, 217 + unsigned long address, pud_t *pud, int flags) 218 + { 219 + return NULL; 220 + } 221 + 222 + static inline struct page *follow_huge_pgd(struct mm_struct *mm, 223 + unsigned long address, pgd_t *pgd, int flags) 224 + { 225 + return NULL; 226 + } 227 + 228 + static inline int prepare_hugepage_range(struct file *file, 229 + unsigned long addr, unsigned long len) 230 + { 231 + return -EINVAL; 232 + } 233 + 234 + static inline int pmd_huge(pmd_t pmd) 235 + { 236 + return 0; 237 + } 238 + 239 + static inline int pud_huge(pud_t pud) 240 + { 241 + return 0; 242 + } 243 + 244 + static inline int is_hugepage_only_range(struct mm_struct *mm, 245 + unsigned long addr, unsigned long len) 246 + { 247 + return 0; 248 + } 249 + 250 + static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb, 251 + unsigned long addr, unsigned long end, 252 + unsigned long floor, unsigned long ceiling) 253 + { 254 + BUG(); 255 + } 256 + 257 + static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, 258 + pte_t *dst_pte, 259 + struct vm_area_struct *dst_vma, 260 + unsigned long dst_addr, 261 + unsigned long src_addr, 262 + struct page **pagep) 263 + { 264 + BUG(); 265 + return 0; 266 + } 267 + 268 + static inline pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, 269 + unsigned long sz) 270 + { 271 + return NULL; 272 + } 188 273 189 274 static inline bool isolate_huge_page(struct page *page, struct list_head *list) 190 275 { 191 276 return false; 192 277 } 193 - #define putback_active_hugepage(p) do {} while (0) 194 - #define move_hugetlb_state(old, new, reason) do {} while (0) 195 278 196 - static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma, 197 - unsigned long address, unsigned long end, pgprot_t newprot) 279 + static inline void putback_active_hugepage(struct page *page) 280 + { 281 + } 282 + 283 + static inline void move_hugetlb_state(struct page *oldpage, 284 + struct page *newpage, int reason) 285 + { 286 + } 287 + 288 + static inline unsigned long hugetlb_change_protection( 289 + struct vm_area_struct *vma, unsigned long address, 290 + unsigned long end, pgprot_t newprot) 198 291 { 199 292 return 0; 200 293 } ··· 304 213 { 305 214 BUG(); 306 215 } 216 + 307 217 static inline vm_fault_t hugetlb_fault(struct mm_struct *mm, 308 - struct vm_area_struct *vma, unsigned long address, 309 - unsigned int flags) 218 + struct vm_area_struct *vma, unsigned long address, 219 + unsigned int flags) 310 220 { 311 221 BUG(); 312 222 return 0;

+31

include/linux/kasan.h

··· 70 70 int free_meta_offset; 71 71 }; 72 72 73 + /* 74 + * These functions provide a special case to support backing module 75 + * allocations with real shadow memory. With KASAN vmalloc, the special 76 + * case is unnecessary, as the work is handled in the generic case. 77 + */ 78 + #ifndef CONFIG_KASAN_VMALLOC 73 79 int kasan_module_alloc(void *addr, size_t size); 74 80 void kasan_free_shadow(const struct vm_struct *vm); 81 + #else 82 + static inline int kasan_module_alloc(void *addr, size_t size) { return 0; } 83 + static inline void kasan_free_shadow(const struct vm_struct *vm) {} 84 + #endif 75 85 76 86 int kasan_add_zero_shadow(void *start, unsigned long size); 77 87 void kasan_remove_zero_shadow(void *start, unsigned long size); ··· 203 193 } 204 194 205 195 #endif /* CONFIG_KASAN_SW_TAGS */ 196 + 197 + #ifdef CONFIG_KASAN_VMALLOC 198 + int kasan_populate_vmalloc(unsigned long requested_size, 199 + struct vm_struct *area); 200 + void kasan_poison_vmalloc(void *start, unsigned long size); 201 + void kasan_release_vmalloc(unsigned long start, unsigned long end, 202 + unsigned long free_region_start, 203 + unsigned long free_region_end); 204 + #else 205 + static inline int kasan_populate_vmalloc(unsigned long requested_size, 206 + struct vm_struct *area) 207 + { 208 + return 0; 209 + } 210 + 211 + static inline void kasan_poison_vmalloc(void *start, unsigned long size) {} 212 + static inline void kasan_release_vmalloc(unsigned long start, 213 + unsigned long end, 214 + unsigned long free_region_start, 215 + unsigned long free_region_end) {} 216 + #endif 206 217 207 218 #endif /* LINUX_KASAN_H */

+3

include/linux/memblock.h

··· 358 358 MEMBLOCK_ALLOC_ACCESSIBLE); 359 359 } 360 360 361 + void *memblock_alloc_exact_nid_raw(phys_addr_t size, phys_addr_t align, 362 + phys_addr_t min_addr, phys_addr_t max_addr, 363 + int nid); 361 364 void *memblock_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align, 362 365 phys_addr_t min_addr, phys_addr_t max_addr, 363 366 int nid);

+22 -27

include/linux/memcontrol.h

··· 58 58 59 59 struct mem_cgroup_reclaim_cookie { 60 60 pg_data_t *pgdat; 61 - int priority; 62 61 unsigned int generation; 63 62 }; 64 63 ··· 80 81 enum mem_cgroup_events_target { 81 82 MEM_CGROUP_TARGET_THRESH, 82 83 MEM_CGROUP_TARGET_SOFTLIMIT, 83 - MEM_CGROUP_TARGET_NUMAINFO, 84 84 MEM_CGROUP_NTARGETS, 85 85 }; 86 86 ··· 110 112 }; 111 113 112 114 /* 113 - * per-zone information in memory controller. 115 + * per-node information in memory controller. 114 116 */ 115 117 struct mem_cgroup_per_node { 116 118 struct lruvec lruvec; ··· 124 126 125 127 unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS]; 126 128 127 - struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1]; 129 + struct mem_cgroup_reclaim_iter iter; 128 130 129 131 struct memcg_shrinker_map __rcu *shrinker_map; 130 132 ··· 132 134 unsigned long usage_in_excess;/* Set to the value by which */ 133 135 /* the soft limit is exceeded*/ 134 136 bool on_tree; 135 - bool congested; /* memcg has many dirty pages */ 136 - /* backed by a congested BDI */ 137 - 138 137 struct mem_cgroup *memcg; /* Back pointer, we cannot */ 139 138 /* use container_of */ 140 139 }; ··· 308 313 struct list_head kmem_caches; 309 314 #endif 310 315 311 - int last_scanned_node; 312 - #if MAX_NUMNODES > 1 313 - nodemask_t scan_nodes; 314 - atomic_t numainfo_events; 315 - atomic_t numainfo_updating; 316 - #endif 317 - 318 316 #ifdef CONFIG_CGROUP_WRITEBACK 319 317 struct list_head cgwb_list; 320 318 struct wb_domain cgwb_domain; ··· 382 394 } 383 395 384 396 /** 385 - * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone 386 - * @node: node of the wanted lruvec 397 + * mem_cgroup_lruvec - get the lru list vector for a memcg & node 387 398 * @memcg: memcg of the wanted lruvec 388 399 * 389 - * Returns the lru list vector holding pages for a given @node or a given 390 - * @memcg and @zone. This can be the node lruvec, if the memory controller 391 - * is disabled. 400 + * Returns the lru list vector holding pages for a given @memcg & 401 + * @node combination. This can be the node lruvec, if the memory 402 + * controller is disabled. 392 403 */ 393 - static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat, 394 - struct mem_cgroup *memcg) 404 + static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, 405 + struct pglist_data *pgdat) 395 406 { 396 407 struct mem_cgroup_per_node *mz; 397 408 struct lruvec *lruvec; 398 409 399 410 if (mem_cgroup_disabled()) { 400 - lruvec = node_lruvec(pgdat); 411 + lruvec = &pgdat->__lruvec; 401 412 goto out; 402 413 } 414 + 415 + if (!memcg) 416 + memcg = root_mem_cgroup; 403 417 404 418 mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); 405 419 lruvec = &mz->lruvec; ··· 718 728 return; 719 729 } 720 730 721 - lruvec = mem_cgroup_lruvec(pgdat, page->mem_cgroup); 731 + lruvec = mem_cgroup_lruvec(page->mem_cgroup, pgdat); 722 732 __mod_lruvec_state(lruvec, idx, val); 723 733 } 724 734 ··· 889 899 { 890 900 } 891 901 892 - static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat, 893 - struct mem_cgroup *memcg) 902 + static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, 903 + struct pglist_data *pgdat) 894 904 { 895 - return node_lruvec(pgdat); 905 + return &pgdat->__lruvec; 896 906 } 897 907 898 908 static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, 899 909 struct pglist_data *pgdat) 900 910 { 901 - return &pgdat->lruvec; 911 + return &pgdat->__lruvec; 912 + } 913 + 914 + static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) 915 + { 916 + return NULL; 902 917 } 903 918 904 919 static inline bool mm_match_cgroup(struct mm_struct *mm,

+4 -7

include/linux/memory_hotplug.h

··· 102 102 103 103 typedef void (*online_page_callback_t)(struct page *page, unsigned int order); 104 104 105 + extern void generic_online_page(struct page *page, unsigned int order); 105 106 extern int set_online_page_callback(online_page_callback_t callback); 106 107 extern int restore_online_page_callback(online_page_callback_t callback); 107 - 108 - extern void __online_page_set_limits(struct page *page); 109 - extern void __online_page_increment_counters(struct page *page); 110 - extern void __online_page_free(struct page *page); 111 108 112 109 extern int try_online_node(int nid); 113 110 ··· 226 229 void mem_hotplug_begin(void); 227 230 void mem_hotplug_done(void); 228 231 229 - extern void set_zone_contiguous(struct zone *zone); 230 - extern void clear_zone_contiguous(struct zone *zone); 231 - 232 232 #else /* ! CONFIG_MEMORY_HOTPLUG */ 233 233 #define pfn_to_online_page(pfn) \ 234 234 ({ \ ··· 332 338 333 339 static inline void __remove_memory(int nid, u64 start, u64 size) {} 334 340 #endif /* CONFIG_MEMORY_HOTREMOVE */ 341 + 342 + extern void set_zone_contiguous(struct zone *zone); 343 + extern void clear_zone_contiguous(struct zone *zone); 335 344 336 345 extern void __ref free_area_init_core_hotplug(int nid); 337 346 extern int __add_memory(int nid, u64 start, u64 size);

+12 -22

include/linux/mm.h

··· 564 564 struct mmu_gather; 565 565 struct inode; 566 566 567 - #if !defined(CONFIG_ARCH_HAS_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE) 568 - static inline int pmd_devmap(pmd_t pmd) 569 - { 570 - return 0; 571 - } 572 - static inline int pud_devmap(pud_t pud) 573 - { 574 - return 0; 575 - } 576 - static inline int pgd_devmap(pgd_t pgd) 577 - { 578 - return 0; 579 - } 580 - #endif 581 - 582 567 /* 583 568 * FIXME: take this include out, include page-flags.h in 584 569 * files which need it (119 of them) ··· 1628 1643 return (unsigned long)val; 1629 1644 } 1630 1645 1646 + void mm_trace_rss_stat(struct mm_struct *mm, int member, long count); 1647 + 1631 1648 static inline void add_mm_counter(struct mm_struct *mm, int member, long value) 1632 1649 { 1633 - atomic_long_add(value, &mm->rss_stat.count[member]); 1650 + long count = atomic_long_add_return(value, &mm->rss_stat.count[member]); 1651 + 1652 + mm_trace_rss_stat(mm, member, count); 1634 1653 } 1635 1654 1636 1655 static inline void inc_mm_counter(struct mm_struct *mm, int member) 1637 1656 { 1638 - atomic_long_inc(&mm->rss_stat.count[member]); 1657 + long count = atomic_long_inc_return(&mm->rss_stat.count[member]); 1658 + 1659 + mm_trace_rss_stat(mm, member, count); 1639 1660 } 1640 1661 1641 1662 static inline void dec_mm_counter(struct mm_struct *mm, int member) 1642 1663 { 1643 - atomic_long_dec(&mm->rss_stat.count[member]); 1664 + long count = atomic_long_dec_return(&mm->rss_stat.count[member]); 1665 + 1666 + mm_trace_rss_stat(mm, member, count); 1644 1667 } 1645 1668 1646 1669 /* Optimized variant when page is already known not to be PageAnon */ ··· 2206 2213 void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...); 2207 2214 2208 2215 extern void setup_per_cpu_pageset(void); 2209 - 2210 - extern void zone_pcp_update(struct zone *zone); 2211 - extern void zone_pcp_reset(struct zone *zone); 2212 2216 2213 2217 /* page_alloc.c */ 2214 2218 extern int min_free_kbytes; ··· 2770 2780 extern int sysctl_memory_failure_recovery; 2771 2781 extern void shake_page(struct page *p, int access); 2772 2782 extern atomic_long_t num_poisoned_pages __read_mostly; 2773 - extern int soft_offline_page(struct page *page, int flags); 2783 + extern int soft_offline_page(unsigned long pfn, int flags); 2774 2784 2775 2785 2776 2786 /*

+20 -14

include/linux/mmzone.h

··· 273 273 274 274 #define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++) 275 275 276 - static inline int is_file_lru(enum lru_list lru) 276 + static inline bool is_file_lru(enum lru_list lru) 277 277 { 278 278 return (lru == LRU_INACTIVE_FILE || lru == LRU_ACTIVE_FILE); 279 279 } 280 280 281 - static inline int is_active_lru(enum lru_list lru) 281 + static inline bool is_active_lru(enum lru_list lru) 282 282 { 283 283 return (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE); 284 284 } ··· 296 296 unsigned long recent_scanned[2]; 297 297 }; 298 298 299 + enum lruvec_flags { 300 + LRUVEC_CONGESTED, /* lruvec has many dirty pages 301 + * backed by a congested BDI 302 + */ 303 + }; 304 + 299 305 struct lruvec { 300 306 struct list_head lists[NR_LRU_LISTS]; 301 307 struct zone_reclaim_stat reclaim_stat; ··· 309 303 atomic_long_t inactive_age; 310 304 /* Refaults at the time of last reclaim cycle */ 311 305 unsigned long refaults; 306 + /* Various lruvec state flags (enum lruvec_flags) */ 307 + unsigned long flags; 312 308 #ifdef CONFIG_MEMCG 313 309 struct pglist_data *pgdat; 314 310 #endif 315 311 }; 316 312 317 - /* Isolate unmapped file */ 313 + /* Isolate unmapped pages */ 318 314 #define ISOLATE_UNMAPPED ((__force isolate_mode_t)0x2) 319 315 /* Isolate for asynchronous migration */ 320 316 #define ISOLATE_ASYNC_MIGRATE ((__force isolate_mode_t)0x4) ··· 580 572 } ____cacheline_internodealigned_in_smp; 581 573 582 574 enum pgdat_flags { 583 - PGDAT_CONGESTED, /* pgdat has many dirty pages backed by 584 - * a congested BDI 585 - */ 586 575 PGDAT_DIRTY, /* reclaim scanning has recently found 587 576 * many dirty file pages at the tail 588 577 * of the LRU. ··· 782 777 #endif 783 778 784 779 /* Fields commonly accessed by the page reclaim scanner */ 785 - struct lruvec lruvec; 780 + 781 + /* 782 + * NOTE: THIS IS UNUSED IF MEMCG IS ENABLED. 783 + * 784 + * Use mem_cgroup_lruvec() to look up lruvecs. 785 + */ 786 + struct lruvec __lruvec; 786 787 787 788 unsigned long flags; 788 789 ··· 810 799 811 800 #define node_start_pfn(nid) (NODE_DATA(nid)->node_start_pfn) 812 801 #define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid)) 813 - 814 - static inline struct lruvec *node_lruvec(struct pglist_data *pgdat) 815 - { 816 - return &pgdat->lruvec; 817 - } 818 802 819 803 static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat) 820 804 { ··· 848 842 #ifdef CONFIG_MEMCG 849 843 return lruvec->pgdat; 850 844 #else 851 - return container_of(lruvec, struct pglist_data, lruvec); 845 + return container_of(lruvec, struct pglist_data, __lruvec); 852 846 #endif 853 847 } 854 848 ··· 1085 1079 /** 1086 1080 * for_each_zone_zonelist_nodemask - helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask 1087 1081 * @zone - The current zone in the iterator 1088 - * @z - The current pointer within zonelist->zones being iterated 1082 + * @z - The current pointer within zonelist->_zonerefs being iterated 1089 1083 * @zlist - The zonelist being iterated 1090 1084 * @highidx - The zone index of the highest zone to return 1091 1085 * @nodemask - Nodemask allowed by the allocator

+1 -1

include/linux/moduleloader.h

··· 91 91 /* Any cleanup before freeing mod->module_init */ 92 92 void module_arch_freeing_init(struct module *mod); 93 93 94 - #ifdef CONFIG_KASAN 94 + #if defined(CONFIG_KASAN) && !defined(CONFIG_KASAN_VMALLOC) 95 95 #include <linux/kasan.h> 96 96 #define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT) 97 97 #else

+2 -2

include/linux/page-isolation.h

··· 30 30 } 31 31 #endif 32 32 33 - #define SKIP_HWPOISON 0x1 33 + #define MEMORY_OFFLINE 0x1 34 34 #define REPORT_FAILURE 0x2 35 35 36 36 bool has_unmovable_pages(struct zone *zone, struct page *page, int count, ··· 58 58 * Test all pages in [start_pfn, end_pfn) are isolated or not. 59 59 */ 60 60 int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, 61 - bool skip_hwpoisoned_pages); 61 + int isol_flags); 62 62 63 63 struct page *alloc_migrate_target(struct page *page, unsigned long private); 64 64

-20

include/linux/slab.h

··· 561 561 return __kmalloc(size, flags); 562 562 } 563 563 564 - /* 565 - * Determine size used for the nth kmalloc cache. 566 - * return size or 0 if a kmalloc cache for that 567 - * size does not exist 568 - */ 569 - static __always_inline unsigned int kmalloc_size(unsigned int n) 570 - { 571 - #ifndef CONFIG_SLOB 572 - if (n > 2) 573 - return 1U << n; 574 - 575 - if (n == 1 && KMALLOC_MIN_SIZE <= 32) 576 - return 96; 577 - 578 - if (n == 2 && KMALLOC_MIN_SIZE <= 64) 579 - return 192; 580 - #endif 581 - return 0; 582 - } 583 - 584 564 static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node) 585 565 { 586 566 #ifndef CONFIG_SLOB

+2

include/linux/string.h

··· 216 216 extern ssize_t memory_read_from_buffer(void *to, size_t count, loff_t *ppos, 217 217 const void *from, size_t available); 218 218 219 + int ptr_to_hashval(const void *ptr, unsigned long *hashval_out); 220 + 219 221 /** 220 222 * strstarts - does @str start with @prefix? 221 223 * @str: string to examine

+1 -1

include/linux/swap.h

··· 307 307 }; 308 308 309 309 /* linux/mm/workingset.c */ 310 - void *workingset_eviction(struct page *page); 310 + void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg); 311 311 void workingset_refault(struct page *page, void *shadow); 312 312 void workingset_activation(struct page *page); 313 313

+12

include/linux/vmalloc.h

··· 22 22 #define VM_UNINITIALIZED 0x00000020 /* vm_struct is not fully initialized */ 23 23 #define VM_NO_GUARD 0x00000040 /* don't add guard page */ 24 24 #define VM_KASAN 0x00000080 /* has allocated kasan shadow memory */ 25 + 26 + /* 27 + * VM_KASAN is used slighly differently depending on CONFIG_KASAN_VMALLOC. 28 + * 29 + * If IS_ENABLED(CONFIG_KASAN_VMALLOC), VM_KASAN is set on a vm_struct after 30 + * shadow memory has been mapped. It's used to handle allocation errors so that 31 + * we don't try to poision shadow on free if it was never allocated. 32 + * 33 + * Otherwise, VM_KASAN is set for kasan_module_alloc() allocations and used to 34 + * determine which allocations need the module shadow freed. 35 + */ 36 + 25 37 /* 26 38 * Memory with VM_FLUSH_RESET_PERMS cannot be freed in an interrupt or with 27 39 * vfree_atomic().

+47

include/trace/events/kmem.h

··· 316 316 __entry->change_ownership) 317 317 ); 318 318 319 + /* 320 + * Required for uniquely and securely identifying mm in rss_stat tracepoint. 321 + */ 322 + #ifndef __PTR_TO_HASHVAL 323 + static unsigned int __maybe_unused mm_ptr_to_hash(const void *ptr) 324 + { 325 + int ret; 326 + unsigned long hashval; 327 + 328 + ret = ptr_to_hashval(ptr, &hashval); 329 + if (ret) 330 + return 0; 331 + 332 + /* The hashed value is only 32-bit */ 333 + return (unsigned int)hashval; 334 + } 335 + #define __PTR_TO_HASHVAL 336 + #endif 337 + 338 + TRACE_EVENT(rss_stat, 339 + 340 + TP_PROTO(struct mm_struct *mm, 341 + int member, 342 + long count), 343 + 344 + TP_ARGS(mm, member, count), 345 + 346 + TP_STRUCT__entry( 347 + __field(unsigned int, mm_id) 348 + __field(unsigned int, curr) 349 + __field(int, member) 350 + __field(long, size) 351 + ), 352 + 353 + TP_fast_assign( 354 + __entry->mm_id = mm_ptr_to_hash(mm); 355 + __entry->curr = !!(current->mm == mm); 356 + __entry->member = member; 357 + __entry->size = (count << PAGE_SHIFT); 358 + ), 359 + 360 + TP_printk("mm_id=%u curr=%d member=%d size=%ldB", 361 + __entry->mm_id, 362 + __entry->curr, 363 + __entry->member, 364 + __entry->size) 365 + ); 319 366 #endif /* _TRACE_KMEM_H */ 320 367 321 368 /* This part must be outside protection */

+1 -1

kernel/events/uprobes.c

··· 1457 1457 /* Try to map as high as possible, this is only a hint. */ 1458 1458 area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, 1459 1459 PAGE_SIZE, 0, 0); 1460 - if (area->vaddr & ~PAGE_MASK) { 1460 + if (IS_ERR_VALUE(area->vaddr)) { 1461 1461 ret = area->vaddr; 1462 1462 goto fail; 1463 1463 }

+4

kernel/fork.c

··· 93 93 #include <linux/livepatch.h> 94 94 #include <linux/thread_info.h> 95 95 #include <linux/stackleak.h> 96 + #include <linux/kasan.h> 96 97 97 98 #include <asm/pgtable.h> 98 99 #include <asm/pgalloc.h> ··· 223 222 224 223 if (!s) 225 224 continue; 225 + 226 + /* Clear the KASAN shadow of the stack. */ 227 + kasan_unpoison_shadow(s->addr, THREAD_SIZE); 226 228 227 229 /* Clear stale pointers from reused stack. */ 228 230 memset(s->addr, 0, THREAD_SIZE);

+1 -1

kernel/sysctl.c

··· 1466 1466 .procname = "drop_caches", 1467 1467 .data = &sysctl_drop_caches, 1468 1468 .maxlen = sizeof(int), 1469 - .mode = 0644, 1469 + .mode = 0200, 1470 1470 .proc_handler = drop_caches_sysctl_handler, 1471 1471 .extra1 = SYSCTL_ONE, 1472 1472 .extra2 = &four,

+16

lib/Kconfig.kasan

··· 6 6 config HAVE_ARCH_KASAN_SW_TAGS 7 7 bool 8 8 9 + config HAVE_ARCH_KASAN_VMALLOC 10 + bool 11 + 9 12 config CC_HAS_KASAN_GENERIC 10 13 def_bool $(cc-option, -fsanitize=kernel-address) 11 14 ··· 144 141 This option enables best-effort identification of bug type 145 142 (use-after-free or out-of-bounds) at the cost of increased 146 143 memory consumption. 144 + 145 + config KASAN_VMALLOC 146 + bool "Back mappings in vmalloc space with real shadow memory" 147 + depends on KASAN && HAVE_ARCH_KASAN_VMALLOC 148 + help 149 + By default, the shadow region for vmalloc space is the read-only 150 + zero page. This means that KASAN cannot detect errors involving 151 + vmalloc space. 152 + 153 + Enabling this option will hook in to vmap/vmalloc and back those 154 + mappings with real shadow memory allocated on demand. This allows 155 + for KASAN to detect more sorts of errors (and to support vmapped 156 + stacks), but at the cost of higher memory usage. 147 157 148 158 config TEST_KASAN 149 159 tristate "Module for testing KASAN for bug detection"

+26

lib/test_kasan.c

··· 19 19 #include <linux/string.h> 20 20 #include <linux/uaccess.h> 21 21 #include <linux/io.h> 22 + #include <linux/vmalloc.h> 22 23 23 24 #include <asm/page.h> 24 25 ··· 749 748 kzfree(ptr); 750 749 } 751 750 751 + #ifdef CONFIG_KASAN_VMALLOC 752 + static noinline void __init vmalloc_oob(void) 753 + { 754 + void *area; 755 + 756 + pr_info("vmalloc out-of-bounds\n"); 757 + 758 + /* 759 + * We have to be careful not to hit the guard page. 760 + * The MMU will catch that and crash us. 761 + */ 762 + area = vmalloc(3000); 763 + if (!area) { 764 + pr_err("Allocation failed\n"); 765 + return; 766 + } 767 + 768 + ((volatile char *)area)[3100]; 769 + vfree(area); 770 + } 771 + #else 772 + static void __init vmalloc_oob(void) {} 773 + #endif 774 + 752 775 static int __init kmalloc_tests_init(void) 753 776 { 754 777 /* ··· 818 793 kasan_strings(); 819 794 kasan_bitops(); 820 795 kmalloc_double_kzfree(); 796 + vmalloc_oob(); 821 797 822 798 kasan_restore_multi_shot(multishot); 823 799

+32 -14

lib/vsprintf.c

··· 761 761 early_initcall(initialize_ptr_random); 762 762 763 763 /* Maps a pointer to a 32 bit unique identifier. */ 764 - static char *ptr_to_id(char *buf, char *end, const void *ptr, 765 - struct printf_spec spec) 764 + static inline int __ptr_to_hashval(const void *ptr, unsigned long *hashval_out) 766 765 { 767 - const char *str = sizeof(ptr) == 8 ? "(____ptrval____)" : "(ptrval)"; 768 766 unsigned long hashval; 769 767 770 - /* When debugging early boot use non-cryptographically secure hash. */ 771 - if (unlikely(debug_boot_weak_hash)) { 772 - hashval = hash_long((unsigned long)ptr, 32); 773 - return pointer_string(buf, end, (const void *)hashval, spec); 774 - } 775 - 776 - if (static_branch_unlikely(&not_filled_random_ptr_key)) { 777 - spec.field_width = 2 * sizeof(ptr); 778 - /* string length must be less than default_width */ 779 - return error_string(buf, end, str, spec); 780 - } 768 + if (static_branch_unlikely(&not_filled_random_ptr_key)) 769 + return -EAGAIN; 781 770 782 771 #ifdef CONFIG_64BIT 783 772 hashval = (unsigned long)siphash_1u64((u64)ptr, &ptr_key); ··· 778 789 #else 779 790 hashval = (unsigned long)siphash_1u32((u32)ptr, &ptr_key); 780 791 #endif 792 + *hashval_out = hashval; 793 + return 0; 794 + } 795 + 796 + int ptr_to_hashval(const void *ptr, unsigned long *hashval_out) 797 + { 798 + return __ptr_to_hashval(ptr, hashval_out); 799 + } 800 + 801 + static char *ptr_to_id(char *buf, char *end, const void *ptr, 802 + struct printf_spec spec) 803 + { 804 + const char *str = sizeof(ptr) == 8 ? "(____ptrval____)" : "(ptrval)"; 805 + unsigned long hashval; 806 + int ret; 807 + 808 + /* When debugging early boot use non-cryptographically secure hash. */ 809 + if (unlikely(debug_boot_weak_hash)) { 810 + hashval = hash_long((unsigned long)ptr, 32); 811 + return pointer_string(buf, end, (const void *)hashval, spec); 812 + } 813 + 814 + ret = __ptr_to_hashval(ptr, &hashval); 815 + if (ret) { 816 + spec.field_width = 2 * sizeof(ptr); 817 + /* string length must be less than default_width */ 818 + return error_string(buf, end, str, spec); 819 + } 820 + 781 821 return pointer_string(buf, end, (const void *)hashval, spec); 782 822 } 783 823

+20 -20

mm/Kconfig

··· 29 29 30 30 For systems that have holes in their physical address 31 31 spaces and for features like NUMA and memory hotplug, 32 - choose "Sparse Memory" 32 + choose "Sparse Memory". 33 33 34 34 If unsure, choose this option (Flat Memory) over any other. 35 35 ··· 122 122 depends on SPARSEMEM && SPARSEMEM_VMEMMAP_ENABLE 123 123 default y 124 124 help 125 - SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise 126 - pfn_to_page and page_to_pfn operations. This is the most 127 - efficient option when sufficient kernel resources are available. 125 + SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise 126 + pfn_to_page and page_to_pfn operations. This is the most 127 + efficient option when sufficient kernel resources are available. 128 128 129 129 config HAVE_MEMBLOCK_NODE_MAP 130 130 bool ··· 160 160 depends on SPARSEMEM && MEMORY_HOTPLUG 161 161 162 162 config MEMORY_HOTPLUG_DEFAULT_ONLINE 163 - bool "Online the newly added memory blocks by default" 164 - depends on MEMORY_HOTPLUG 165 - help 163 + bool "Online the newly added memory blocks by default" 164 + depends on MEMORY_HOTPLUG 165 + help 166 166 This option sets the default policy setting for memory hotplug 167 167 onlining policy (/sys/devices/system/memory/auto_online_blocks) which 168 168 determines what happens to newly added memory regions. Policy setting ··· 227 227 select MIGRATION 228 228 depends on MMU 229 229 help 230 - Compaction is the only memory management component to form 231 - high order (larger physically contiguous) memory blocks 232 - reliably. The page allocator relies on compaction heavily and 233 - the lack of the feature can lead to unexpected OOM killer 234 - invocations for high order memory requests. You shouldn't 235 - disable this option unless there really is a strong reason for 236 - it and then we would be really interested to hear about that at 237 - linux-mm@kvack.org. 230 + Compaction is the only memory management component to form 231 + high order (larger physically contiguous) memory blocks 232 + reliably. The page allocator relies on compaction heavily and 233 + the lack of the feature can lead to unexpected OOM killer 234 + invocations for high order memory requests. You shouldn't 235 + disable this option unless there really is a strong reason for 236 + it and then we would be really interested to hear about that at 237 + linux-mm@kvack.org. 238 238 239 239 # 240 240 # support for page migration ··· 258 258 bool 259 259 260 260 config CONTIG_ALLOC 261 - def_bool (MEMORY_ISOLATION && COMPACTION) || CMA 261 + def_bool (MEMORY_ISOLATION && COMPACTION) || CMA 262 262 263 263 config PHYS_ADDR_T_64BIT 264 264 def_bool 64BIT ··· 302 302 root has set /sys/kernel/mm/ksm/run to 1 (if CONFIG_SYSFS is set). 303 303 304 304 config DEFAULT_MMAP_MIN_ADDR 305 - int "Low address space to protect from user allocation" 305 + int "Low address space to protect from user allocation" 306 306 depends on MMU 307 - default 4096 308 - help 307 + default 4096 308 + help 309 309 This is the portion of low virtual memory which should be protected 310 310 from userspace allocation. Keeping a user from writing to low pages 311 311 can help reduce the impact of kernel NULL pointer bugs. ··· 408 408 endchoice 409 409 410 410 config ARCH_WANTS_THP_SWAP 411 - def_bool n 411 + def_bool n 412 412 413 413 config THP_SWAP 414 414 def_bool y

+2 -4

mm/cma.c

··· 95 95 96 96 static int __init cma_activate_area(struct cma *cma) 97 97 { 98 - int bitmap_size = BITS_TO_LONGS(cma_bitmap_maxno(cma)) * sizeof(long); 99 98 unsigned long base_pfn = cma->base_pfn, pfn = base_pfn; 100 99 unsigned i = cma->count >> pageblock_order; 101 100 struct zone *zone; 102 101 103 - cma->bitmap = kzalloc(bitmap_size, GFP_KERNEL); 104 - 102 + cma->bitmap = bitmap_zalloc(cma_bitmap_maxno(cma), GFP_KERNEL); 105 103 if (!cma->bitmap) { 106 104 cma->count = 0; 107 105 return -ENOMEM; ··· 137 139 138 140 not_in_zone: 139 141 pr_err("CMA area %s could not be activated\n", cma->name); 140 - kfree(cma->bitmap); 142 + bitmap_free(cma->bitmap); 141 143 cma->count = 0; 142 144 return -EINVAL; 143 145 }

+5 -5

mm/cma_debug.c

··· 29 29 30 30 return 0; 31 31 } 32 - DEFINE_SIMPLE_ATTRIBUTE(cma_debugfs_fops, cma_debugfs_get, NULL, "%llu\n"); 32 + DEFINE_DEBUGFS_ATTRIBUTE(cma_debugfs_fops, cma_debugfs_get, NULL, "%llu\n"); 33 33 34 34 static int cma_used_get(void *data, u64 *val) 35 35 { ··· 44 44 45 45 return 0; 46 46 } 47 - DEFINE_SIMPLE_ATTRIBUTE(cma_used_fops, cma_used_get, NULL, "%llu\n"); 47 + DEFINE_DEBUGFS_ATTRIBUTE(cma_used_fops, cma_used_get, NULL, "%llu\n"); 48 48 49 49 static int cma_maxchunk_get(void *data, u64 *val) 50 50 { ··· 66 66 67 67 return 0; 68 68 } 69 - DEFINE_SIMPLE_ATTRIBUTE(cma_maxchunk_fops, cma_maxchunk_get, NULL, "%llu\n"); 69 + DEFINE_DEBUGFS_ATTRIBUTE(cma_maxchunk_fops, cma_maxchunk_get, NULL, "%llu\n"); 70 70 71 71 static void cma_add_to_cma_mem_list(struct cma *cma, struct cma_mem *mem) 72 72 { ··· 126 126 127 127 return cma_free_mem(cma, pages); 128 128 } 129 - DEFINE_SIMPLE_ATTRIBUTE(cma_free_fops, NULL, cma_free_write, "%llu\n"); 129 + DEFINE_DEBUGFS_ATTRIBUTE(cma_free_fops, NULL, cma_free_write, "%llu\n"); 130 130 131 131 static int cma_alloc_mem(struct cma *cma, int count) 132 132 { ··· 158 158 159 159 return cma_alloc_mem(cma, pages); 160 160 } 161 - DEFINE_SIMPLE_ATTRIBUTE(cma_alloc_fops, NULL, cma_alloc_write, "%llu\n"); 161 + DEFINE_DEBUGFS_ATTRIBUTE(cma_alloc_fops, NULL, cma_alloc_write, "%llu\n"); 162 162 163 163 static void cma_debugfs_add_one(struct cma *cma, struct dentry *root_dentry) 164 164 {

+29 -25

mm/filemap.c

··· 2329 2329 2330 2330 #ifdef CONFIG_MMU 2331 2331 #define MMAP_LOTSAMISS (100) 2332 - static struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf, 2333 - struct file *fpin) 2334 - { 2335 - int flags = vmf->flags; 2336 - 2337 - if (fpin) 2338 - return fpin; 2339 - 2340 - /* 2341 - * FAULT_FLAG_RETRY_NOWAIT means we don't want to wait on page locks or 2342 - * anything, so we only pin the file and drop the mmap_sem if only 2343 - * FAULT_FLAG_ALLOW_RETRY is set. 2344 - */ 2345 - if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) == 2346 - FAULT_FLAG_ALLOW_RETRY) { 2347 - fpin = get_file(vmf->vma->vm_file); 2348 - up_read(&vmf->vma->vm_mm->mmap_sem); 2349 - } 2350 - return fpin; 2351 - } 2352 - 2353 2332 /* 2354 2333 * lock_page_maybe_drop_mmap - lock the page, possibly dropping the mmap_sem 2355 2334 * @vmf - the vm_fault for this fault. ··· 3140 3161 } 3141 3162 EXPORT_SYMBOL(pagecache_write_end); 3142 3163 3164 + /* 3165 + * Warn about a page cache invalidation failure during a direct I/O write. 3166 + */ 3167 + void dio_warn_stale_pagecache(struct file *filp) 3168 + { 3169 + static DEFINE_RATELIMIT_STATE(_rs, 86400 * HZ, DEFAULT_RATELIMIT_BURST); 3170 + char pathname[128]; 3171 + struct inode *inode = file_inode(filp); 3172 + char *path; 3173 + 3174 + errseq_set(&inode->i_mapping->wb_err, -EIO); 3175 + if (__ratelimit(&_rs)) { 3176 + path = file_path(filp, pathname, sizeof(pathname)); 3177 + if (IS_ERR(path)) 3178 + path = "(unknown)"; 3179 + pr_crit("Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!\n"); 3180 + pr_crit("File: %s PID: %d Comm: %.20s\n", path, current->pid, 3181 + current->comm); 3182 + } 3183 + } 3184 + 3143 3185 ssize_t 3144 3186 generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from) 3145 3187 { ··· 3218 3218 * Most of the time we do not need this since dio_complete() will do 3219 3219 * the invalidation for us. However there are some file systems that 3220 3220 * do not end up with dio_complete() being called, so let's not break 3221 - * them by removing it completely 3221 + * them by removing it completely. 3222 + * 3223 + * Noticeable example is a blkdev_direct_IO(). 3224 + * 3225 + * Skip invalidation for async writes or if mapping has no pages. 3222 3226 */ 3223 - if (mapping->nrpages) 3224 - invalidate_inode_pages2_range(mapping, 3225 - pos >> PAGE_SHIFT, end); 3227 + if (written > 0 && mapping->nrpages && 3228 + invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT, end)) 3229 + dio_warn_stale_pagecache(file); 3226 3230 3227 3231 if (written > 0) { 3228 3232 pos += written;

+27 -13

mm/gup.c

··· 734 734 * Or NULL if the caller does not require them. 735 735 * @nonblocking: whether waiting for disk IO or mmap_sem contention 736 736 * 737 - * Returns number of pages pinned. This may be fewer than the number 738 - * requested. If nr_pages is 0 or negative, returns 0. If no pages 739 - * were pinned, returns -errno. Each page returned must be released 740 - * with a put_page() call when it is finished with. vmas will only 741 - * remain valid while mmap_sem is held. 737 + * Returns either number of pages pinned (which may be less than the 738 + * number requested), or an error. Details about the return value: 739 + * 740 + * -- If nr_pages is 0, returns 0. 741 + * -- If nr_pages is >0, but no pages were pinned, returns -errno. 742 + * -- If nr_pages is >0, and some pages were pinned, returns the number of 743 + * pages pinned. Again, this may be less than nr_pages. 744 + * 745 + * The caller is responsible for releasing returned @pages, via put_page(). 746 + * 747 + * @vmas are valid only as long as mmap_sem is held. 742 748 * 743 749 * Must be called with mmap_sem held. It may be released. See below. 744 750 * ··· 1113 1107 * subsequently whether VM_FAULT_RETRY functionality can be 1114 1108 * utilised. Lock must initially be held. 1115 1109 * 1116 - * Returns number of pages pinned. This may be fewer than the number 1117 - * requested. If nr_pages is 0 or negative, returns 0. If no pages 1118 - * were pinned, returns -errno. Each page returned must be released 1119 - * with a put_page() call when it is finished with. vmas will only 1120 - * remain valid while mmap_sem is held. 1110 + * Returns either number of pages pinned (which may be less than the 1111 + * number requested), or an error. Details about the return value: 1112 + * 1113 + * -- If nr_pages is 0, returns 0. 1114 + * -- If nr_pages is >0, but no pages were pinned, returns -errno. 1115 + * -- If nr_pages is >0, and some pages were pinned, returns the number of 1116 + * pages pinned. Again, this may be less than nr_pages. 1117 + * 1118 + * The caller is responsible for releasing returned @pages, via put_page(). 1119 + * 1120 + * @vmas are valid only as long as mmap_sem is held. 1121 1121 * 1122 1122 * Must be called with mmap_sem held for read or write. 1123 1123 * ··· 1455 1443 bool drain_allow = true; 1456 1444 bool migrate_allow = true; 1457 1445 LIST_HEAD(cma_page_list); 1446 + long ret = nr_pages; 1458 1447 1459 1448 check_again: 1460 1449 for (i = 0; i < nr_pages;) { ··· 1517 1504 * again migrating any new CMA pages which we failed to isolate 1518 1505 * earlier. 1519 1506 */ 1520 - nr_pages = __get_user_pages_locked(tsk, mm, start, nr_pages, 1507 + ret = __get_user_pages_locked(tsk, mm, start, nr_pages, 1521 1508 pages, vmas, NULL, 1522 1509 gup_flags); 1523 1510 1524 - if ((nr_pages > 0) && migrate_allow) { 1511 + if ((ret > 0) && migrate_allow) { 1512 + nr_pages = ret; 1525 1513 drain_allow = true; 1526 1514 goto check_again; 1527 1515 } 1528 1516 } 1529 1517 1530 - return nr_pages; 1518 + return ret; 1531 1519 } 1532 1520 #else 1533 1521 static long check_and_migrate_cma_pages(struct task_struct *tsk,

+1 -1

mm/huge_memory.c

··· 3003 3003 3004 3004 return 0; 3005 3005 } 3006 - DEFINE_SIMPLE_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set, 3006 + DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set, 3007 3007 "%llu\n"); 3008 3008 3009 3009 static int __init split_huge_pages_debugfs(void)

+91 -197

mm/hugetlb.c

··· 244 244 long to; 245 245 }; 246 246 247 + /* Must be called with resv->lock held. Calling this with count_only == true 248 + * will count the number of pages to be added but will not modify the linked 249 + * list. 250 + */ 251 + static long add_reservation_in_range(struct resv_map *resv, long f, long t, 252 + bool count_only) 253 + { 254 + long chg = 0; 255 + struct list_head *head = &resv->regions; 256 + struct file_region *rg = NULL, *trg = NULL, *nrg = NULL; 257 + 258 + /* Locate the region we are before or in. */ 259 + list_for_each_entry(rg, head, link) 260 + if (f <= rg->to) 261 + break; 262 + 263 + /* Round our left edge to the current segment if it encloses us. */ 264 + if (f > rg->from) 265 + f = rg->from; 266 + 267 + chg = t - f; 268 + 269 + /* Check for and consume any regions we now overlap with. */ 270 + nrg = rg; 271 + list_for_each_entry_safe(rg, trg, rg->link.prev, link) { 272 + if (&rg->link == head) 273 + break; 274 + if (rg->from > t) 275 + break; 276 + 277 + /* We overlap with this area, if it extends further than 278 + * us then we must extend ourselves. Account for its 279 + * existing reservation. 280 + */ 281 + if (rg->to > t) { 282 + chg += rg->to - t; 283 + t = rg->to; 284 + } 285 + chg -= rg->to - rg->from; 286 + 287 + if (!count_only && rg != nrg) { 288 + list_del(&rg->link); 289 + kfree(rg); 290 + } 291 + } 292 + 293 + if (!count_only) { 294 + nrg->from = f; 295 + nrg->to = t; 296 + } 297 + 298 + return chg; 299 + } 300 + 247 301 /* 248 302 * Add the huge page range represented by [f, t) to the reserve 249 - * map. In the normal case, existing regions will be expanded 250 - * to accommodate the specified range. Sufficient regions should 251 - * exist for expansion due to the previous call to region_chg 252 - * with the same range. However, it is possible that region_del 253 - * could have been called after region_chg and modifed the map 254 - * in such a way that no region exists to be expanded. In this 255 - * case, pull a region descriptor from the cache associated with 256 - * the map and use that for the new range. 303 + * map. Existing regions will be expanded to accommodate the specified 304 + * range, or a region will be taken from the cache. Sufficient regions 305 + * must exist in the cache due to the previous call to region_chg with 306 + * the same range. 257 307 * 258 308 * Return the number of new huge pages added to the map. This 259 309 * number is greater than or equal to zero. ··· 311 261 static long region_add(struct resv_map *resv, long f, long t) 312 262 { 313 263 struct list_head *head = &resv->regions; 314 - struct file_region *rg, *nrg, *trg; 264 + struct file_region *rg, *nrg; 315 265 long add = 0; 316 266 317 267 spin_lock(&resv->lock); ··· 322 272 323 273 /* 324 274 * If no region exists which can be expanded to include the 325 - * specified range, the list must have been modified by an 326 - * interleving call to region_del(). Pull a region descriptor 327 - * from the cache and use it for this range. 275 + * specified range, pull a region descriptor from the cache 276 + * and use it for this range. 328 277 */ 329 278 if (&rg->link == head || t < rg->from) { 330 279 VM_BUG_ON(resv->region_cache_count <= 0); ··· 341 292 goto out_locked; 342 293 } 343 294 344 - /* Round our left edge to the current segment if it encloses us. */ 345 - if (f > rg->from) 346 - f = rg->from; 347 - 348 - /* Check for and consume any regions we now overlap with. */ 349 - nrg = rg; 350 - list_for_each_entry_safe(rg, trg, rg->link.prev, link) { 351 - if (&rg->link == head) 352 - break; 353 - if (rg->from > t) 354 - break; 355 - 356 - /* If this area reaches higher then extend our area to 357 - * include it completely. If this is not the first area 358 - * which we intend to reuse, free it. */ 359 - if (rg->to > t) 360 - t = rg->to; 361 - if (rg != nrg) { 362 - /* Decrement return value by the deleted range. 363 - * Another range will span this area so that by 364 - * end of routine add will be >= zero 365 - */ 366 - add -= (rg->to - rg->from); 367 - list_del(&rg->link); 368 - kfree(rg); 369 - } 370 - } 371 - 372 - add += (nrg->from - f); /* Added to beginning of region */ 373 - nrg->from = f; 374 - add += t - nrg->to; /* Added to end of region */ 375 - nrg->to = t; 295 + add = add_reservation_in_range(resv, f, t, false); 376 296 377 297 out_locked: 378 298 resv->adds_in_progress--; ··· 357 339 * call to region_add that will actually modify the reserve 358 340 * map to add the specified range [f, t). region_chg does 359 341 * not change the number of huge pages represented by the 360 - * map. However, if the existing regions in the map can not 361 - * be expanded to represent the new range, a new file_region 362 - * structure is added to the map as a placeholder. This is 363 - * so that the subsequent region_add call will have all the 364 - * regions it needs and will not fail. 365 - * 366 - * Upon entry, region_chg will also examine the cache of region descriptors 367 - * associated with the map. If there are not enough descriptors cached, one 368 - * will be allocated for the in progress add operation. 342 + * map. A new file_region structure is added to the cache 343 + * as a placeholder, so that the subsequent region_add 344 + * call will have all the regions it needs and will not fail. 369 345 * 370 346 * Returns the number of huge pages that need to be added to the existing 371 347 * reservation map for the range [f, t). This number is greater or equal to ··· 368 356 */ 369 357 static long region_chg(struct resv_map *resv, long f, long t) 370 358 { 371 - struct list_head *head = &resv->regions; 372 - struct file_region *rg, *nrg = NULL; 373 359 long chg = 0; 374 360 375 - retry: 376 361 spin_lock(&resv->lock); 377 362 retry_locked: 378 363 resv->adds_in_progress++; ··· 387 378 spin_unlock(&resv->lock); 388 379 389 380 trg = kmalloc(sizeof(*trg), GFP_KERNEL); 390 - if (!trg) { 391 - kfree(nrg); 381 + if (!trg) 392 382 return -ENOMEM; 393 - } 394 383 395 384 spin_lock(&resv->lock); 396 385 list_add(&trg->link, &resv->region_cache); ··· 396 389 goto retry_locked; 397 390 } 398 391 399 - /* Locate the region we are before or in. */ 400 - list_for_each_entry(rg, head, link) 401 - if (f <= rg->to) 402 - break; 392 + chg = add_reservation_in_range(resv, f, t, true); 403 393 404 - /* If we are below the current region then a new region is required. 405 - * Subtle, allocate a new region at the position but make it zero 406 - * size such that we can guarantee to record the reservation. */ 407 - if (&rg->link == head || t < rg->from) { 408 - if (!nrg) { 409 - resv->adds_in_progress--; 410 - spin_unlock(&resv->lock); 411 - nrg = kmalloc(sizeof(*nrg), GFP_KERNEL); 412 - if (!nrg) 413 - return -ENOMEM; 414 - 415 - nrg->from = f; 416 - nrg->to = f; 417 - INIT_LIST_HEAD(&nrg->link); 418 - goto retry; 419 - } 420 - 421 - list_add(&nrg->link, rg->link.prev); 422 - chg = t - f; 423 - goto out_nrg; 424 - } 425 - 426 - /* Round our left edge to the current segment if it encloses us. */ 427 - if (f > rg->from) 428 - f = rg->from; 429 - chg = t - f; 430 - 431 - /* Check for and consume any regions we now overlap with. */ 432 - list_for_each_entry(rg, rg->link.prev, link) { 433 - if (&rg->link == head) 434 - break; 435 - if (rg->from > t) 436 - goto out; 437 - 438 - /* We overlap with this area, if it extends further than 439 - * us then we must extend ourselves. Account for its 440 - * existing reservation. */ 441 - if (rg->to > t) { 442 - chg += rg->to - t; 443 - t = rg->to; 444 - } 445 - chg -= rg->to - rg->from; 446 - } 447 - 448 - out: 449 - spin_unlock(&resv->lock); 450 - /* We already know we raced and no longer need the new region */ 451 - kfree(nrg); 452 - return chg; 453 - out_nrg: 454 394 spin_unlock(&resv->lock); 455 395 return chg; 456 396 } ··· 1023 1069 } 1024 1070 1025 1071 #ifdef CONFIG_CONTIG_ALLOC 1026 - static int __alloc_gigantic_page(unsigned long start_pfn, 1027 - unsigned long nr_pages, gfp_t gfp_mask) 1028 - { 1029 - unsigned long end_pfn = start_pfn + nr_pages; 1030 - return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE, 1031 - gfp_mask); 1032 - } 1033 - 1034 - static bool pfn_range_valid_gigantic(struct zone *z, 1035 - unsigned long start_pfn, unsigned long nr_pages) 1036 - { 1037 - unsigned long i, end_pfn = start_pfn + nr_pages; 1038 - struct page *page; 1039 - 1040 - for (i = start_pfn; i < end_pfn; i++) { 1041 - page = pfn_to_online_page(i); 1042 - if (!page) 1043 - return false; 1044 - 1045 - if (page_zone(page) != z) 1046 - return false; 1047 - 1048 - if (PageReserved(page)) 1049 - return false; 1050 - 1051 - if (page_count(page) > 0) 1052 - return false; 1053 - 1054 - if (PageHuge(page)) 1055 - return false; 1056 - } 1057 - 1058 - return true; 1059 - } 1060 - 1061 - static bool zone_spans_last_pfn(const struct zone *zone, 1062 - unsigned long start_pfn, unsigned long nr_pages) 1063 - { 1064 - unsigned long last_pfn = start_pfn + nr_pages - 1; 1065 - return zone_spans_pfn(zone, last_pfn); 1066 - } 1067 - 1068 1072 static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask, 1069 1073 int nid, nodemask_t *nodemask) 1070 1074 { 1071 - unsigned int order = huge_page_order(h); 1072 - unsigned long nr_pages = 1 << order; 1073 - unsigned long ret, pfn, flags; 1074 - struct zonelist *zonelist; 1075 - struct zone *zone; 1076 - struct zoneref *z; 1075 + unsigned long nr_pages = 1UL << huge_page_order(h); 1077 1076 1078 - zonelist = node_zonelist(nid, gfp_mask); 1079 - for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), nodemask) { 1080 - spin_lock_irqsave(&zone->lock, flags); 1081 - 1082 - pfn = ALIGN(zone->zone_start_pfn, nr_pages); 1083 - while (zone_spans_last_pfn(zone, pfn, nr_pages)) { 1084 - if (pfn_range_valid_gigantic(zone, pfn, nr_pages)) { 1085 - /* 1086 - * We release the zone lock here because 1087 - * alloc_contig_range() will also lock the zone 1088 - * at some point. If there's an allocation 1089 - * spinning on this lock, it may win the race 1090 - * and cause alloc_contig_range() to fail... 1091 - */ 1092 - spin_unlock_irqrestore(&zone->lock, flags); 1093 - ret = __alloc_gigantic_page(pfn, nr_pages, gfp_mask); 1094 - if (!ret) 1095 - return pfn_to_page(pfn); 1096 - spin_lock_irqsave(&zone->lock, flags); 1097 - } 1098 - pfn += nr_pages; 1099 - } 1100 - 1101 - spin_unlock_irqrestore(&zone->lock, flags); 1102 - } 1103 - 1104 - return NULL; 1077 + return alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask); 1105 1078 } 1106 1079 1107 1080 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid); ··· 3796 3915 * handling userfault. Reacquire after handling 3797 3916 * fault to make calling code simpler. 3798 3917 */ 3799 - hash = hugetlb_fault_mutex_hash(h, mapping, idx, haddr); 3918 + hash = hugetlb_fault_mutex_hash(mapping, idx); 3800 3919 mutex_unlock(&hugetlb_fault_mutex_table[hash]); 3801 3920 ret = handle_userfault(&vmf, VM_UFFD_MISSING); 3802 3921 mutex_lock(&hugetlb_fault_mutex_table[hash]); ··· 3923 4042 } 3924 4043 3925 4044 #ifdef CONFIG_SMP 3926 - u32 hugetlb_fault_mutex_hash(struct hstate *h, struct address_space *mapping, 3927 - pgoff_t idx, unsigned long address) 4045 + u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx) 3928 4046 { 3929 4047 unsigned long key[2]; 3930 4048 u32 hash; ··· 3931 4051 key[0] = (unsigned long) mapping; 3932 4052 key[1] = idx; 3933 4053 3934 - hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0); 4054 + hash = jhash2((u32 *)&key, sizeof(key)/(sizeof(u32)), 0); 3935 4055 3936 4056 return hash & (num_fault_mutexes - 1); 3937 4057 } ··· 3940 4060 * For uniprocesor systems we always use a single mutex, so just 3941 4061 * return 0 and avoid the hashing overhead. 3942 4062 */ 3943 - u32 hugetlb_fault_mutex_hash(struct hstate *h, struct address_space *mapping, 3944 - pgoff_t idx, unsigned long address) 4063 + u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx) 3945 4064 { 3946 4065 return 0; 3947 4066 } ··· 3984 4105 * get spurious allocation failures if two CPUs race to instantiate 3985 4106 * the same page in the page cache. 3986 4107 */ 3987 - hash = hugetlb_fault_mutex_hash(h, mapping, idx, haddr); 4108 + hash = hugetlb_fault_mutex_hash(mapping, idx); 3988 4109 mutex_lock(&hugetlb_fault_mutex_table[hash]); 3989 4110 3990 4111 entry = huge_ptep_get(ptep); ··· 4338 4459 break; 4339 4460 } 4340 4461 } 4462 + 4463 + /* 4464 + * If subpage information not requested, update counters 4465 + * and skip the same_page loop below. 4466 + */ 4467 + if (!pages && !vmas && !pfn_offset && 4468 + (vaddr + huge_page_size(h) < vma->vm_end) && 4469 + (remainder >= pages_per_huge_page(h))) { 4470 + vaddr += huge_page_size(h); 4471 + remainder -= pages_per_huge_page(h); 4472 + i += pages_per_huge_page(h); 4473 + spin_unlock(ptl); 4474 + continue; 4475 + } 4476 + 4341 4477 same_page: 4342 4478 if (pages) { 4343 4479 pages[i] = mem_map_offset(page, pfn_offset); ··· 4736 4842 if (!vma_shareable(vma, addr)) 4737 4843 return (pte_t *)pmd_alloc(mm, pud, addr); 4738 4844 4739 - i_mmap_lock_write(mapping); 4845 + i_mmap_lock_read(mapping); 4740 4846 vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) { 4741 4847 if (svma == vma) 4742 4848 continue; ··· 4766 4872 spin_unlock(ptl); 4767 4873 out: 4768 4874 pte = (pte_t *)pmd_alloc(mm, pud, addr); 4769 - i_mmap_unlock_write(mapping); 4875 + i_mmap_unlock_read(mapping); 4770 4876 return pte; 4771 4877 } 4772 4878

+2 -2

mm/hwpoison-inject.c

··· 67 67 return unpoison_memory(val); 68 68 } 69 69 70 - DEFINE_SIMPLE_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n"); 71 - DEFINE_SIMPLE_ATTRIBUTE(unpoison_fops, NULL, hwpoison_unpoison, "%lli\n"); 70 + DEFINE_DEBUGFS_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n"); 71 + DEFINE_DEBUGFS_ATTRIBUTE(unpoison_fops, NULL, hwpoison_unpoison, "%lli\n"); 72 72 73 73 static void pfn_inject_exit(void) 74 74 {

+26 -1

mm/internal.h

··· 165 165 gfp_t gfp_flags); 166 166 extern int user_min_free_kbytes; 167 167 168 + extern void zone_pcp_update(struct zone *zone); 169 + extern void zone_pcp_reset(struct zone *zone); 170 + 168 171 #if defined CONFIG_COMPACTION || defined CONFIG_CMA 169 172 170 173 /* ··· 293 290 294 291 /* mm/util.c */ 295 292 void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma, 296 - struct vm_area_struct *prev, struct rb_node *rb_parent); 293 + struct vm_area_struct *prev); 294 + void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma); 297 295 298 296 #ifdef CONFIG_MMU 299 297 extern long populate_vma_page_range(struct vm_area_struct *vma, ··· 364 360 VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma); 365 361 366 362 return max(start, vma->vm_start); 363 + } 364 + 365 + static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf, 366 + struct file *fpin) 367 + { 368 + int flags = vmf->flags; 369 + 370 + if (fpin) 371 + return fpin; 372 + 373 + /* 374 + * FAULT_FLAG_RETRY_NOWAIT means we don't want to wait on page locks or 375 + * anything, so we only pin the file and drop the mmap_sem if only 376 + * FAULT_FLAG_ALLOW_RETRY is set. 377 + */ 378 + if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) == 379 + FAULT_FLAG_ALLOW_RETRY) { 380 + fpin = get_file(vmf->vma->vm_file); 381 + up_read(&vmf->vma->vm_mm->mmap_sem); 382 + } 383 + return fpin; 367 384 } 368 385 369 386 #else /* !CONFIG_MMU */

+233

mm/kasan/common.c

··· 36 36 #include <linux/bug.h> 37 37 #include <linux/uaccess.h> 38 38 39 + #include <asm/tlbflush.h> 40 + 39 41 #include "kasan.h" 40 42 #include "../slab.h" 41 43 ··· 592 590 /* The object will be poisoned by page_alloc. */ 593 591 } 594 592 593 + #ifndef CONFIG_KASAN_VMALLOC 595 594 int kasan_module_alloc(void *addr, size_t size) 596 595 { 597 596 void *ret; ··· 628 625 if (vm->flags & VM_KASAN) 629 626 vfree(kasan_mem_to_shadow(vm->addr)); 630 627 } 628 + #endif 631 629 632 630 extern void __kasan_report(unsigned long addr, size_t size, bool is_write, unsigned long ip); 633 631 ··· 747 743 } 748 744 749 745 core_initcall(kasan_memhotplug_init); 746 + #endif 747 + 748 + #ifdef CONFIG_KASAN_VMALLOC 749 + static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr, 750 + void *unused) 751 + { 752 + unsigned long page; 753 + pte_t pte; 754 + 755 + if (likely(!pte_none(*ptep))) 756 + return 0; 757 + 758 + page = __get_free_page(GFP_KERNEL); 759 + if (!page) 760 + return -ENOMEM; 761 + 762 + memset((void *)page, KASAN_VMALLOC_INVALID, PAGE_SIZE); 763 + pte = pfn_pte(PFN_DOWN(__pa(page)), PAGE_KERNEL); 764 + 765 + spin_lock(&init_mm.page_table_lock); 766 + if (likely(pte_none(*ptep))) { 767 + set_pte_at(&init_mm, addr, ptep, pte); 768 + page = 0; 769 + } 770 + spin_unlock(&init_mm.page_table_lock); 771 + if (page) 772 + free_page(page); 773 + return 0; 774 + } 775 + 776 + int kasan_populate_vmalloc(unsigned long requested_size, struct vm_struct *area) 777 + { 778 + unsigned long shadow_start, shadow_end; 779 + int ret; 780 + 781 + shadow_start = (unsigned long)kasan_mem_to_shadow(area->addr); 782 + shadow_start = ALIGN_DOWN(shadow_start, PAGE_SIZE); 783 + shadow_end = (unsigned long)kasan_mem_to_shadow(area->addr + 784 + area->size); 785 + shadow_end = ALIGN(shadow_end, PAGE_SIZE); 786 + 787 + ret = apply_to_page_range(&init_mm, shadow_start, 788 + shadow_end - shadow_start, 789 + kasan_populate_vmalloc_pte, NULL); 790 + if (ret) 791 + return ret; 792 + 793 + flush_cache_vmap(shadow_start, shadow_end); 794 + 795 + kasan_unpoison_shadow(area->addr, requested_size); 796 + 797 + area->flags |= VM_KASAN; 798 + 799 + /* 800 + * We need to be careful about inter-cpu effects here. Consider: 801 + * 802 + * CPU#0 CPU#1 803 + * WRITE_ONCE(p, vmalloc(100)); while (x = READ_ONCE(p)) ; 804 + * p[99] = 1; 805 + * 806 + * With compiler instrumentation, that ends up looking like this: 807 + * 808 + * CPU#0 CPU#1 809 + * // vmalloc() allocates memory 810 + * // let a = area->addr 811 + * // we reach kasan_populate_vmalloc 812 + * // and call kasan_unpoison_shadow: 813 + * STORE shadow(a), unpoison_val 814 + * ... 815 + * STORE shadow(a+99), unpoison_val x = LOAD p 816 + * // rest of vmalloc process <data dependency> 817 + * STORE p, a LOAD shadow(x+99) 818 + * 819 + * If there is no barrier between the end of unpoisioning the shadow 820 + * and the store of the result to p, the stores could be committed 821 + * in a different order by CPU#0, and CPU#1 could erroneously observe 822 + * poison in the shadow. 823 + * 824 + * We need some sort of barrier between the stores. 825 + * 826 + * In the vmalloc() case, this is provided by a smp_wmb() in 827 + * clear_vm_uninitialized_flag(). In the per-cpu allocator and in 828 + * get_vm_area() and friends, the caller gets shadow allocated but 829 + * doesn't have any pages mapped into the virtual address space that 830 + * has been reserved. Mapping those pages in will involve taking and 831 + * releasing a page-table lock, which will provide the barrier. 832 + */ 833 + 834 + return 0; 835 + } 836 + 837 + /* 838 + * Poison the shadow for a vmalloc region. Called as part of the 839 + * freeing process at the time the region is freed. 840 + */ 841 + void kasan_poison_vmalloc(void *start, unsigned long size) 842 + { 843 + size = round_up(size, KASAN_SHADOW_SCALE_SIZE); 844 + kasan_poison_shadow(start, size, KASAN_VMALLOC_INVALID); 845 + } 846 + 847 + static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr, 848 + void *unused) 849 + { 850 + unsigned long page; 851 + 852 + page = (unsigned long)__va(pte_pfn(*ptep) << PAGE_SHIFT); 853 + 854 + spin_lock(&init_mm.page_table_lock); 855 + 856 + if (likely(!pte_none(*ptep))) { 857 + pte_clear(&init_mm, addr, ptep); 858 + free_page(page); 859 + } 860 + spin_unlock(&init_mm.page_table_lock); 861 + 862 + return 0; 863 + } 864 + 865 + /* 866 + * Release the backing for the vmalloc region [start, end), which 867 + * lies within the free region [free_region_start, free_region_end). 868 + * 869 + * This can be run lazily, long after the region was freed. It runs 870 + * under vmap_area_lock, so it's not safe to interact with the vmalloc/vmap 871 + * infrastructure. 872 + * 873 + * How does this work? 874 + * ------------------- 875 + * 876 + * We have a region that is page aligned, labelled as A. 877 + * That might not map onto the shadow in a way that is page-aligned: 878 + * 879 + * start end 880 + * v v 881 + * |????????|????????|AAAAAAAA|AA....AA|AAAAAAAA|????????| < vmalloc 882 + * -------- -------- -------- -------- -------- 883 + * | | | | | 884 + * | | | /-------/ | 885 + * \-------\|/------/ |/---------------/ 886 + * ||| || 887 + * |??AAAAAA|AAAAAAAA|AA??????| < shadow 888 + * (1) (2) (3) 889 + * 890 + * First we align the start upwards and the end downwards, so that the 891 + * shadow of the region aligns with shadow page boundaries. In the 892 + * example, this gives us the shadow page (2). This is the shadow entirely 893 + * covered by this allocation. 894 + * 895 + * Then we have the tricky bits. We want to know if we can free the 896 + * partially covered shadow pages - (1) and (3) in the example. For this, 897 + * we are given the start and end of the free region that contains this 898 + * allocation. Extending our previous example, we could have: 899 + * 900 + * free_region_start free_region_end 901 + * | start end | 902 + * v v v v 903 + * |FFFFFFFF|FFFFFFFF|AAAAAAAA|AA....AA|AAAAAAAA|FFFFFFFF| < vmalloc 904 + * -------- -------- -------- -------- -------- 905 + * | | | | | 906 + * | | | /-------/ | 907 + * \-------\|/------/ |/---------------/ 908 + * ||| || 909 + * |FFAAAAAA|AAAAAAAA|AAF?????| < shadow 910 + * (1) (2) (3) 911 + * 912 + * Once again, we align the start of the free region up, and the end of 913 + * the free region down so that the shadow is page aligned. So we can free 914 + * page (1) - we know no allocation currently uses anything in that page, 915 + * because all of it is in the vmalloc free region. But we cannot free 916 + * page (3), because we can't be sure that the rest of it is unused. 917 + * 918 + * We only consider pages that contain part of the original region for 919 + * freeing: we don't try to free other pages from the free region or we'd 920 + * end up trying to free huge chunks of virtual address space. 921 + * 922 + * Concurrency 923 + * ----------- 924 + * 925 + * How do we know that we're not freeing a page that is simultaneously 926 + * being used for a fresh allocation in kasan_populate_vmalloc(_pte)? 927 + * 928 + * We _can_ have kasan_release_vmalloc and kasan_populate_vmalloc running 929 + * at the same time. While we run under free_vmap_area_lock, the population 930 + * code does not. 931 + * 932 + * free_vmap_area_lock instead operates to ensure that the larger range 933 + * [free_region_start, free_region_end) is safe: because __alloc_vmap_area and 934 + * the per-cpu region-finding algorithm both run under free_vmap_area_lock, 935 + * no space identified as free will become used while we are running. This 936 + * means that so long as we are careful with alignment and only free shadow 937 + * pages entirely covered by the free region, we will not run in to any 938 + * trouble - any simultaneous allocations will be for disjoint regions. 939 + */ 940 + void kasan_release_vmalloc(unsigned long start, unsigned long end, 941 + unsigned long free_region_start, 942 + unsigned long free_region_end) 943 + { 944 + void *shadow_start, *shadow_end; 945 + unsigned long region_start, region_end; 946 + 947 + region_start = ALIGN(start, PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE); 948 + region_end = ALIGN_DOWN(end, PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE); 949 + 950 + free_region_start = ALIGN(free_region_start, 951 + PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE); 952 + 953 + if (start != region_start && 954 + free_region_start < region_start) 955 + region_start -= PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE; 956 + 957 + free_region_end = ALIGN_DOWN(free_region_end, 958 + PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE); 959 + 960 + if (end != region_end && 961 + free_region_end > region_end) 962 + region_end += PAGE_SIZE * KASAN_SHADOW_SCALE_SIZE; 963 + 964 + shadow_start = kasan_mem_to_shadow((void *)region_start); 965 + shadow_end = kasan_mem_to_shadow((void *)region_end); 966 + 967 + if (shadow_end > shadow_start) { 968 + apply_to_page_range(&init_mm, (unsigned long)shadow_start, 969 + (unsigned long)(shadow_end - shadow_start), 970 + kasan_depopulate_vmalloc_pte, NULL); 971 + flush_tlb_kernel_range((unsigned long)shadow_start, 972 + (unsigned long)shadow_end); 973 + } 974 + } 750 975 #endif

+3

mm/kasan/generic_report.c

··· 86 86 case KASAN_ALLOCA_RIGHT: 87 87 bug_type = "alloca-out-of-bounds"; 88 88 break; 89 + case KASAN_VMALLOC_INVALID: 90 + bug_type = "vmalloc-out-of-bounds"; 91 + break; 89 92 } 90 93 91 94 return bug_type;

+1

mm/kasan/kasan.h

··· 25 25 #endif 26 26 27 27 #define KASAN_GLOBAL_REDZONE 0xFA /* redzone for global variable */ 28 + #define KASAN_VMALLOC_INVALID 0xF9 /* unallocated space in vmapped page */ 28 29 29 30 /* 30 31 * Stack redzone shadow values

+18

mm/khugepaged.c

··· 1602 1602 result = SCAN_FAIL; 1603 1603 goto xa_unlocked; 1604 1604 } 1605 + } else if (PageDirty(page)) { 1606 + /* 1607 + * khugepaged only works on read-only fd, 1608 + * so this page is dirty because it hasn't 1609 + * been flushed since first write. There 1610 + * won't be new dirty pages. 1611 + * 1612 + * Trigger async flush here and hope the 1613 + * writeback is done when khugepaged 1614 + * revisits this page. 1615 + * 1616 + * This is a one-off situation. We are not 1617 + * forcing writeback in loop. 1618 + */ 1619 + xas_unlock_irq(&xas); 1620 + filemap_flush(mapping); 1621 + result = SCAN_FAIL; 1622 + goto xa_unlocked; 1605 1623 } else if (trylock_page(page)) { 1606 1624 get_page(page); 1607 1625 xas_unlock_irq(&xas);

+7 -7

mm/madvise.c

··· 864 864 { 865 865 struct page *page; 866 866 struct zone *zone; 867 - unsigned int order; 867 + unsigned long size; 868 868 869 869 if (!capable(CAP_SYS_ADMIN)) 870 870 return -EPERM; 871 871 872 872 873 - for (; start < end; start += PAGE_SIZE << order) { 873 + for (; start < end; start += size) { 874 874 unsigned long pfn; 875 875 int ret; 876 876 ··· 882 882 /* 883 883 * When soft offlining hugepages, after migrating the page 884 884 * we dissolve it, therefore in the second loop "page" will 885 - * no longer be a compound page, and order will be 0. 885 + * no longer be a compound page. 886 886 */ 887 - order = compound_order(compound_head(page)); 887 + size = page_size(compound_head(page)); 888 888 889 889 if (PageHWPoison(page)) { 890 890 put_page(page); ··· 895 895 pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n", 896 896 pfn, start); 897 897 898 - ret = soft_offline_page(page, MF_COUNT_INCREASED); 898 + ret = soft_offline_page(pfn, MF_COUNT_INCREASED); 899 899 if (ret) 900 900 return ret; 901 901 continue; ··· 1059 1059 if (!madvise_behavior_valid(behavior)) 1060 1060 return error; 1061 1061 1062 - if (start & ~PAGE_MASK) 1062 + if (!PAGE_ALIGNED(start)) 1063 1063 return error; 1064 - len = (len_in + ~PAGE_MASK) & PAGE_MASK; 1064 + len = PAGE_ALIGN(len_in); 1065 1065 1066 1066 /* Check to see whether len was rounded up from small -ve to zero */ 1067 1067 if (len_in && !len)

+75 -36

mm/memblock.c

··· 57 57 * at build time. The region arrays for the "memory" and "reserved" 58 58 * types are initially sized to %INIT_MEMBLOCK_REGIONS and for the 59 59 * "physmap" type to %INIT_PHYSMEM_REGIONS. 60 - * The :c:func:`memblock_allow_resize` enables automatic resizing of 61 - * the region arrays during addition of new regions. This feature 62 - * should be used with care so that memory allocated for the region 63 - * array will not overlap with areas that should be reserved, for 64 - * example initrd. 60 + * The memblock_allow_resize() enables automatic resizing of the region 61 + * arrays during addition of new regions. This feature should be used 62 + * with care so that memory allocated for the region array will not 63 + * overlap with areas that should be reserved, for example initrd. 65 64 * 66 65 * The early architecture setup should tell memblock what the physical 67 - * memory layout is by using :c:func:`memblock_add` or 68 - * :c:func:`memblock_add_node` functions. The first function does not 69 - * assign the region to a NUMA node and it is appropriate for UMA 70 - * systems. Yet, it is possible to use it on NUMA systems as well and 71 - * assign the region to a NUMA node later in the setup process using 72 - * :c:func:`memblock_set_node`. The :c:func:`memblock_add_node` 73 - * performs such an assignment directly. 66 + * memory layout is by using memblock_add() or memblock_add_node() 67 + * functions. The first function does not assign the region to a NUMA 68 + * node and it is appropriate for UMA systems. Yet, it is possible to 69 + * use it on NUMA systems as well and assign the region to a NUMA node 70 + * later in the setup process using memblock_set_node(). The 71 + * memblock_add_node() performs such an assignment directly. 74 72 * 75 73 * Once memblock is setup the memory can be allocated using one of the 76 74 * API variants: 77 75 * 78 - * * :c:func:`memblock_phys_alloc*` - these functions return the 79 - * **physical** address of the allocated memory 80 - * * :c:func:`memblock_alloc*` - these functions return the **virtual** 81 - * address of the allocated memory. 76 + * * memblock_phys_alloc*() - these functions return the **physical** 77 + * address of the allocated memory 78 + * * memblock_alloc*() - these functions return the **virtual** address 79 + * of the allocated memory. 82 80 * 83 81 * Note, that both API variants use implict assumptions about allowed 84 82 * memory ranges and the fallback methods. Consult the documentation 85 - * of :c:func:`memblock_alloc_internal` and 86 - * :c:func:`memblock_alloc_range_nid` functions for more elaboarte 87 - * description. 83 + * of memblock_alloc_internal() and memblock_alloc_range_nid() 84 + * functions for more elaborate description. 88 85 * 89 - * As the system boot progresses, the architecture specific 90 - * :c:func:`mem_init` function frees all the memory to the buddy page 91 - * allocator. 86 + * As the system boot progresses, the architecture specific mem_init() 87 + * function frees all the memory to the buddy page allocator. 92 88 * 93 - * Unless an architecure enables %CONFIG_ARCH_KEEP_MEMBLOCK, the 89 + * Unless an architecture enables %CONFIG_ARCH_KEEP_MEMBLOCK, the 94 90 * memblock data structures will be discarded after the system 95 - * initialization compltes. 91 + * initialization completes. 96 92 */ 97 93 98 94 #ifndef CONFIG_NEED_MULTIPLE_NODES ··· 1319 1323 * @start: the lower bound of the memory region to allocate (phys address) 1320 1324 * @end: the upper bound of the memory region to allocate (phys address) 1321 1325 * @nid: nid of the free area to find, %NUMA_NO_NODE for any node 1326 + * @exact_nid: control the allocation fall back to other nodes 1322 1327 * 1323 1328 * The allocation is performed from memory region limited by 1324 - * memblock.current_limit if @max_addr == %MEMBLOCK_ALLOC_ACCESSIBLE. 1329 + * memblock.current_limit if @end == %MEMBLOCK_ALLOC_ACCESSIBLE. 1325 1330 * 1326 - * If the specified node can not hold the requested memory the 1327 - * allocation falls back to any node in the system 1331 + * If the specified node can not hold the requested memory and @exact_nid 1332 + * is false, the allocation falls back to any node in the system. 1328 1333 * 1329 1334 * For systems with memory mirroring, the allocation is attempted first 1330 1335 * from the regions with mirroring enabled and then retried from any ··· 1339 1342 */ 1340 1343 static phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, 1341 1344 phys_addr_t align, phys_addr_t start, 1342 - phys_addr_t end, int nid) 1345 + phys_addr_t end, int nid, 1346 + bool exact_nid) 1343 1347 { 1344 1348 enum memblock_flags flags = choose_memblock_flags(); 1345 1349 phys_addr_t found; ··· 1360 1362 if (found && !memblock_reserve(found, size)) 1361 1363 goto done; 1362 1364 1363 - if (nid != NUMA_NO_NODE) { 1365 + if (nid != NUMA_NO_NODE && !exact_nid) { 1364 1366 found = memblock_find_in_range_node(size, align, start, 1365 1367 end, NUMA_NO_NODE, 1366 1368 flags); ··· 1408 1410 phys_addr_t start, 1409 1411 phys_addr_t end) 1410 1412 { 1411 - return memblock_alloc_range_nid(size, align, start, end, NUMA_NO_NODE); 1413 + return memblock_alloc_range_nid(size, align, start, end, NUMA_NO_NODE, 1414 + false); 1412 1415 } 1413 1416 1414 1417 /** ··· 1428 1429 phys_addr_t __init memblock_phys_alloc_try_nid(phys_addr_t size, phys_addr_t align, int nid) 1429 1430 { 1430 1431 return memblock_alloc_range_nid(size, align, 0, 1431 - MEMBLOCK_ALLOC_ACCESSIBLE, nid); 1432 + MEMBLOCK_ALLOC_ACCESSIBLE, nid, false); 1432 1433 } 1433 1434 1434 1435 /** ··· 1438 1439 * @min_addr: the lower bound of the memory region to allocate (phys address) 1439 1440 * @max_addr: the upper bound of the memory region to allocate (phys address) 1440 1441 * @nid: nid of the free area to find, %NUMA_NO_NODE for any node 1442 + * @exact_nid: control the allocation fall back to other nodes 1441 1443 * 1442 1444 * Allocates memory block using memblock_alloc_range_nid() and 1443 1445 * converts the returned physical address to virtual. ··· 1454 1454 static void * __init memblock_alloc_internal( 1455 1455 phys_addr_t size, phys_addr_t align, 1456 1456 phys_addr_t min_addr, phys_addr_t max_addr, 1457 - int nid) 1457 + int nid, bool exact_nid) 1458 1458 { 1459 1459 phys_addr_t alloc; 1460 1460 ··· 1469 1469 if (max_addr > memblock.current_limit) 1470 1470 max_addr = memblock.current_limit; 1471 1471 1472 - alloc = memblock_alloc_range_nid(size, align, min_addr, max_addr, nid); 1472 + alloc = memblock_alloc_range_nid(size, align, min_addr, max_addr, nid, 1473 + exact_nid); 1473 1474 1474 1475 /* retry allocation without lower limit */ 1475 1476 if (!alloc && min_addr) 1476 - alloc = memblock_alloc_range_nid(size, align, 0, max_addr, nid); 1477 + alloc = memblock_alloc_range_nid(size, align, 0, max_addr, nid, 1478 + exact_nid); 1477 1479 1478 1480 if (!alloc) 1479 1481 return NULL; 1480 1482 1481 1483 return phys_to_virt(alloc); 1484 + } 1485 + 1486 + /** 1487 + * memblock_alloc_exact_nid_raw - allocate boot memory block on the exact node 1488 + * without zeroing memory 1489 + * @size: size of memory block to be allocated in bytes 1490 + * @align: alignment of the region and block's size 1491 + * @min_addr: the lower bound of the memory region from where the allocation 1492 + * is preferred (phys address) 1493 + * @max_addr: the upper bound of the memory region from where the allocation 1494 + * is preferred (phys address), or %MEMBLOCK_ALLOC_ACCESSIBLE to 1495 + * allocate only from memory limited by memblock.current_limit value 1496 + * @nid: nid of the free area to find, %NUMA_NO_NODE for any node 1497 + * 1498 + * Public function, provides additional debug information (including caller 1499 + * info), if enabled. Does not zero allocated memory. 1500 + * 1501 + * Return: 1502 + * Virtual address of allocated memory block on success, NULL on failure. 1503 + */ 1504 + void * __init memblock_alloc_exact_nid_raw( 1505 + phys_addr_t size, phys_addr_t align, 1506 + phys_addr_t min_addr, phys_addr_t max_addr, 1507 + int nid) 1508 + { 1509 + void *ptr; 1510 + 1511 + memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pS\n", 1512 + __func__, (u64)size, (u64)align, nid, &min_addr, 1513 + &max_addr, (void *)_RET_IP_); 1514 + 1515 + ptr = memblock_alloc_internal(size, align, 1516 + min_addr, max_addr, nid, true); 1517 + if (ptr && size > 0) 1518 + page_init_poison(ptr, size); 1519 + 1520 + return ptr; 1482 1521 } 1483 1522 1484 1523 /** ··· 1551 1512 &max_addr, (void *)_RET_IP_); 1552 1513 1553 1514 ptr = memblock_alloc_internal(size, align, 1554 - min_addr, max_addr, nid); 1515 + min_addr, max_addr, nid, false); 1555 1516 if (ptr && size > 0) 1556 1517 page_init_poison(ptr, size); 1557 1518 ··· 1586 1547 __func__, (u64)size, (u64)align, nid, &min_addr, 1587 1548 &max_addr, (void *)_RET_IP_); 1588 1549 ptr = memblock_alloc_internal(size, align, 1589 - min_addr, max_addr, nid); 1550 + min_addr, max_addr, nid, false); 1590 1551 if (ptr) 1591 1552 memset(ptr, 0, size); 1592 1553

+33 -134

mm/memcontrol.c

··· 108 108 109 109 #define THRESHOLDS_EVENTS_TARGET 128 110 110 #define SOFTLIMIT_EVENTS_TARGET 1024 111 - #define NUMAINFO_EVENTS_TARGET 1024 112 111 113 112 /* 114 113 * Cgroups above their limits are maintained in a RB-Tree, independent of ··· 777 778 if (!memcg || memcg == root_mem_cgroup) { 778 779 __mod_node_page_state(pgdat, idx, val); 779 780 } else { 780 - lruvec = mem_cgroup_lruvec(pgdat, memcg); 781 + lruvec = mem_cgroup_lruvec(memcg, pgdat); 781 782 __mod_lruvec_state(lruvec, idx, val); 782 783 } 783 784 rcu_read_unlock(); ··· 876 877 case MEM_CGROUP_TARGET_SOFTLIMIT: 877 878 next = val + SOFTLIMIT_EVENTS_TARGET; 878 879 break; 879 - case MEM_CGROUP_TARGET_NUMAINFO: 880 - next = val + NUMAINFO_EVENTS_TARGET; 881 - break; 882 880 default: 883 881 break; 884 882 } ··· 895 899 if (unlikely(mem_cgroup_event_ratelimit(memcg, 896 900 MEM_CGROUP_TARGET_THRESH))) { 897 901 bool do_softlimit; 898 - bool do_numainfo __maybe_unused; 899 902 900 903 do_softlimit = mem_cgroup_event_ratelimit(memcg, 901 904 MEM_CGROUP_TARGET_SOFTLIMIT); 902 - #if MAX_NUMNODES > 1 903 - do_numainfo = mem_cgroup_event_ratelimit(memcg, 904 - MEM_CGROUP_TARGET_NUMAINFO); 905 - #endif 906 905 mem_cgroup_threshold(memcg); 907 906 if (unlikely(do_softlimit)) 908 907 mem_cgroup_update_tree(memcg, page); 909 - #if MAX_NUMNODES > 1 910 - if (unlikely(do_numainfo)) 911 - atomic_inc(&memcg->numainfo_events); 912 - #endif 913 908 } 914 909 } 915 910 ··· 1039 1052 struct mem_cgroup_per_node *mz; 1040 1053 1041 1054 mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id); 1042 - iter = &mz->iter[reclaim->priority]; 1055 + iter = &mz->iter; 1043 1056 1044 1057 if (prev && reclaim->generation != iter->generation) 1045 1058 goto out_unlock; ··· 1139 1152 struct mem_cgroup_reclaim_iter *iter; 1140 1153 struct mem_cgroup_per_node *mz; 1141 1154 int nid; 1142 - int i; 1143 1155 1144 1156 for_each_node(nid) { 1145 1157 mz = mem_cgroup_nodeinfo(from, nid); 1146 - for (i = 0; i <= DEF_PRIORITY; i++) { 1147 - iter = &mz->iter[i]; 1148 - cmpxchg(&iter->position, 1149 - dead_memcg, NULL); 1150 - } 1158 + iter = &mz->iter; 1159 + cmpxchg(&iter->position, dead_memcg, NULL); 1151 1160 } 1152 1161 } 1153 1162 ··· 1221 1238 struct lruvec *lruvec; 1222 1239 1223 1240 if (mem_cgroup_disabled()) { 1224 - lruvec = &pgdat->lruvec; 1241 + lruvec = &pgdat->__lruvec; 1225 1242 goto out; 1226 1243 } 1227 1244 ··· 1578 1595 return ret; 1579 1596 } 1580 1597 1581 - #if MAX_NUMNODES > 1 1582 - 1583 - /** 1584 - * test_mem_cgroup_node_reclaimable 1585 - * @memcg: the target memcg 1586 - * @nid: the node ID to be checked. 1587 - * @noswap : specify true here if the user wants flle only information. 1588 - * 1589 - * This function returns whether the specified memcg contains any 1590 - * reclaimable pages on a node. Returns true if there are any reclaimable 1591 - * pages in the node. 1592 - */ 1593 - static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg, 1594 - int nid, bool noswap) 1595 - { 1596 - struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg); 1597 - 1598 - if (lruvec_page_state(lruvec, NR_INACTIVE_FILE) || 1599 - lruvec_page_state(lruvec, NR_ACTIVE_FILE)) 1600 - return true; 1601 - if (noswap || !total_swap_pages) 1602 - return false; 1603 - if (lruvec_page_state(lruvec, NR_INACTIVE_ANON) || 1604 - lruvec_page_state(lruvec, NR_ACTIVE_ANON)) 1605 - return true; 1606 - return false; 1607 - 1608 - } 1609 - 1610 - /* 1611 - * Always updating the nodemask is not very good - even if we have an empty 1612 - * list or the wrong list here, we can start from some node and traverse all 1613 - * nodes based on the zonelist. So update the list loosely once per 10 secs. 1614 - * 1615 - */ 1616 - static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg) 1617 - { 1618 - int nid; 1619 - /* 1620 - * numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET 1621 - * pagein/pageout changes since the last update. 1622 - */ 1623 - if (!atomic_read(&memcg->numainfo_events)) 1624 - return; 1625 - if (atomic_inc_return(&memcg->numainfo_updating) > 1) 1626 - return; 1627 - 1628 - /* make a nodemask where this memcg uses memory from */ 1629 - memcg->scan_nodes = node_states[N_MEMORY]; 1630 - 1631 - for_each_node_mask(nid, node_states[N_MEMORY]) { 1632 - 1633 - if (!test_mem_cgroup_node_reclaimable(memcg, nid, false)) 1634 - node_clear(nid, memcg->scan_nodes); 1635 - } 1636 - 1637 - atomic_set(&memcg->numainfo_events, 0); 1638 - atomic_set(&memcg->numainfo_updating, 0); 1639 - } 1640 - 1641 - /* 1642 - * Selecting a node where we start reclaim from. Because what we need is just 1643 - * reducing usage counter, start from anywhere is O,K. Considering 1644 - * memory reclaim from current node, there are pros. and cons. 1645 - * 1646 - * Freeing memory from current node means freeing memory from a node which 1647 - * we'll use or we've used. So, it may make LRU bad. And if several threads 1648 - * hit limits, it will see a contention on a node. But freeing from remote 1649 - * node means more costs for memory reclaim because of memory latency. 1650 - * 1651 - * Now, we use round-robin. Better algorithm is welcomed. 1652 - */ 1653 - int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) 1654 - { 1655 - int node; 1656 - 1657 - mem_cgroup_may_update_nodemask(memcg); 1658 - node = memcg->last_scanned_node; 1659 - 1660 - node = next_node_in(node, memcg->scan_nodes); 1661 - /* 1662 - * mem_cgroup_may_update_nodemask might have seen no reclaimmable pages 1663 - * last time it really checked all the LRUs due to rate limiting. 1664 - * Fallback to the current node in that case for simplicity. 1665 - */ 1666 - if (unlikely(node == MAX_NUMNODES)) 1667 - node = numa_node_id(); 1668 - 1669 - memcg->last_scanned_node = node; 1670 - return node; 1671 - } 1672 - #else 1673 - int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) 1674 - { 1675 - return 0; 1676 - } 1677 - #endif 1678 - 1679 1598 static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, 1680 1599 pg_data_t *pgdat, 1681 1600 gfp_t gfp_mask, ··· 1590 1705 unsigned long nr_scanned; 1591 1706 struct mem_cgroup_reclaim_cookie reclaim = { 1592 1707 .pgdat = pgdat, 1593 - .priority = 0, 1594 1708 }; 1595 1709 1596 1710 excess = soft_limit_excess(root_memcg); ··· 3634 3750 static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, 3635 3751 int nid, unsigned int lru_mask) 3636 3752 { 3637 - struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg); 3753 + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); 3638 3754 unsigned long nr = 0; 3639 3755 enum lru_list lru; 3640 3756 ··· 4962 5078 goto fail; 4963 5079 4964 5080 INIT_WORK(&memcg->high_work, high_work_func); 4965 - memcg->last_scanned_node = MAX_NUMNODES; 4966 5081 INIT_LIST_HEAD(&memcg->oom_notify); 4967 5082 mutex_init(&memcg->thresholds_lock); 4968 5083 spin_lock_init(&memcg->move_lock); ··· 5338 5455 anon = PageAnon(page); 5339 5456 5340 5457 pgdat = page_pgdat(page); 5341 - from_vec = mem_cgroup_lruvec(pgdat, from); 5342 - to_vec = mem_cgroup_lruvec(pgdat, to); 5458 + from_vec = mem_cgroup_lruvec(from, pgdat); 5459 + to_vec = mem_cgroup_lruvec(to, pgdat); 5343 5460 5344 5461 spin_lock_irqsave(&from->move_lock, flags); 5345 5462 ··· 5979 6096 char *buf, size_t nbytes, loff_t off) 5980 6097 { 5981 6098 struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); 5982 - unsigned long nr_pages; 6099 + unsigned int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; 6100 + bool drained = false; 5983 6101 unsigned long high; 5984 6102 int err; 5985 6103 ··· 5991 6107 5992 6108 memcg->high = high; 5993 6109 5994 - nr_pages = page_counter_read(&memcg->memory); 5995 - if (nr_pages > high) 5996 - try_to_free_mem_cgroup_pages(memcg, nr_pages - high, 5997 - GFP_KERNEL, true); 6110 + for (;;) { 6111 + unsigned long nr_pages = page_counter_read(&memcg->memory); 6112 + unsigned long reclaimed; 5998 6113 5999 - memcg_wb_domain_size_changed(memcg); 6114 + if (nr_pages <= high) 6115 + break; 6116 + 6117 + if (signal_pending(current)) 6118 + break; 6119 + 6120 + if (!drained) { 6121 + drain_all_stock(memcg); 6122 + drained = true; 6123 + continue; 6124 + } 6125 + 6126 + reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, 6127 + GFP_KERNEL, true); 6128 + 6129 + if (!reclaimed && !nr_retries--) 6130 + break; 6131 + } 6132 + 6000 6133 return nbytes; 6001 6134 } 6002 6135 ··· 6045 6144 if (nr_pages <= max) 6046 6145 break; 6047 6146 6048 - if (signal_pending(current)) { 6049 - err = -EINTR; 6147 + if (signal_pending(current)) 6050 6148 break; 6051 - } 6052 6149 6053 6150 if (!drained) { 6054 6151 drain_all_stock(memcg);

+23 -38

mm/memory-failure.c

··· 303 303 /* 304 304 * Schedule a process for later kill. 305 305 * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM. 306 - * TBD would GFP_NOIO be enough? 307 306 */ 308 307 static void add_to_kill(struct task_struct *tsk, struct page *p, 309 308 struct vm_area_struct *vma, 310 - struct list_head *to_kill, 311 - struct to_kill **tkc) 309 + struct list_head *to_kill) 312 310 { 313 311 struct to_kill *tk; 314 312 315 - if (*tkc) { 316 - tk = *tkc; 317 - *tkc = NULL; 318 - } else { 319 - tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC); 320 - if (!tk) { 321 - pr_err("Memory failure: Out of memory while machine check handling\n"); 322 - return; 323 - } 313 + tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC); 314 + if (!tk) { 315 + pr_err("Memory failure: Out of memory while machine check handling\n"); 316 + return; 324 317 } 318 + 325 319 tk->addr = page_address_in_vma(p, vma); 326 320 if (is_zone_device_page(p)) 327 321 tk->size_shift = dev_pagemap_mapping_shift(p, vma); 328 322 else 329 - tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT; 323 + tk->size_shift = page_shift(compound_head(p)); 330 324 331 325 /* 332 326 * Send SIGKILL if "tk->addr == -EFAULT". Also, as ··· 339 345 kfree(tk); 340 346 return; 341 347 } 348 + 342 349 get_task_struct(tsk); 343 350 tk->tsk = tsk; 344 351 list_add_tail(&tk->nd, to_kill); ··· 431 436 * Collect processes when the error hit an anonymous page. 432 437 */ 433 438 static void collect_procs_anon(struct page *page, struct list_head *to_kill, 434 - struct to_kill **tkc, int force_early) 439 + int force_early) 435 440 { 436 441 struct vm_area_struct *vma; 437 442 struct task_struct *tsk; ··· 456 461 if (!page_mapped_in_vma(page, vma)) 457 462 continue; 458 463 if (vma->vm_mm == t->mm) 459 - add_to_kill(t, page, vma, to_kill, tkc); 464 + add_to_kill(t, page, vma, to_kill); 460 465 } 461 466 } 462 467 read_unlock(&tasklist_lock); ··· 467 472 * Collect processes when the error hit a file mapped page. 468 473 */ 469 474 static void collect_procs_file(struct page *page, struct list_head *to_kill, 470 - struct to_kill **tkc, int force_early) 475 + int force_early) 471 476 { 472 477 struct vm_area_struct *vma; 473 478 struct task_struct *tsk; ··· 491 496 * to be informed of all such data corruptions. 492 497 */ 493 498 if (vma->vm_mm == t->mm) 494 - add_to_kill(t, page, vma, to_kill, tkc); 499 + add_to_kill(t, page, vma, to_kill); 495 500 } 496 501 } 497 502 read_unlock(&tasklist_lock); ··· 500 505 501 506 /* 502 507 * Collect the processes who have the corrupted page mapped to kill. 503 - * This is done in two steps for locking reasons. 504 - * First preallocate one tokill structure outside the spin locks, 505 - * so that we can kill at least one process reasonably reliable. 506 508 */ 507 509 static void collect_procs(struct page *page, struct list_head *tokill, 508 510 int force_early) 509 511 { 510 - struct to_kill *tk; 511 - 512 512 if (!page->mapping) 513 513 return; 514 514 515 - tk = kmalloc(sizeof(struct to_kill), GFP_NOIO); 516 - if (!tk) 517 - return; 518 515 if (PageAnon(page)) 519 - collect_procs_anon(page, tokill, &tk, force_early); 516 + collect_procs_anon(page, tokill, force_early); 520 517 else 521 - collect_procs_file(page, tokill, &tk, force_early); 522 - kfree(tk); 518 + collect_procs_file(page, tokill, force_early); 523 519 } 524 520 525 521 static const char *action_name[] = { ··· 1476 1490 if (!gotten) 1477 1491 break; 1478 1492 if (entry.flags & MF_SOFT_OFFLINE) 1479 - soft_offline_page(pfn_to_page(entry.pfn), entry.flags); 1493 + soft_offline_page(entry.pfn, entry.flags); 1480 1494 else 1481 1495 memory_failure(entry.pfn, entry.flags); 1482 1496 } ··· 1857 1871 1858 1872 /** 1859 1873 * soft_offline_page - Soft offline a page. 1860 - * @page: page to offline 1874 + * @pfn: pfn to soft-offline 1861 1875 * @flags: flags. Same as memory_failure(). 1862 1876 * 1863 1877 * Returns 0 on success, otherwise negated errno. ··· 1877 1891 * This is not a 100% solution for all memory, but tries to be 1878 1892 * ``good enough'' for the majority of memory. 1879 1893 */ 1880 - int soft_offline_page(struct page *page, int flags) 1894 + int soft_offline_page(unsigned long pfn, int flags) 1881 1895 { 1882 1896 int ret; 1883 - unsigned long pfn = page_to_pfn(page); 1897 + struct page *page; 1884 1898 1885 - if (is_zone_device_page(page)) { 1886 - pr_debug_ratelimited("soft_offline: %#lx page is device page\n", 1887 - pfn); 1888 - if (flags & MF_COUNT_INCREASED) 1889 - put_page(page); 1899 + if (!pfn_valid(pfn)) 1900 + return -ENXIO; 1901 + /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */ 1902 + page = pfn_to_online_page(pfn); 1903 + if (!page) 1890 1904 return -EIO; 1891 - } 1892 1905 1893 1906 if (PageHWPoison(page)) { 1894 1907 pr_info("soft offline: %#lx page already poisoned\n", pfn);

+42 -14

mm/memory.c

··· 72 72 #include <linux/oom.h> 73 73 #include <linux/numa.h> 74 74 75 + #include <trace/events/kmem.h> 76 + 75 77 #include <asm/io.h> 76 78 #include <asm/mmu_context.h> 77 79 #include <asm/pgalloc.h> ··· 154 152 } 155 153 core_initcall(init_zero_pfn); 156 154 155 + void mm_trace_rss_stat(struct mm_struct *mm, int member, long count) 156 + { 157 + trace_rss_stat(mm, member, count); 158 + } 157 159 158 160 #if defined(SPLIT_RSS_COUNTING) 159 161 ··· 2295 2289 * 2296 2290 * The function expects the page to be locked and unlocks it. 2297 2291 */ 2298 - static void fault_dirty_shared_page(struct vm_area_struct *vma, 2299 - struct page *page) 2292 + static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) 2300 2293 { 2294 + struct vm_area_struct *vma = vmf->vma; 2301 2295 struct address_space *mapping; 2296 + struct page *page = vmf->page; 2302 2297 bool dirtied; 2303 2298 bool page_mkwrite = vma->vm_ops && vma->vm_ops->page_mkwrite; 2304 2299 ··· 2314 2307 mapping = page_rmapping(page); 2315 2308 unlock_page(page); 2316 2309 2317 - if ((dirtied || page_mkwrite) && mapping) { 2318 - /* 2319 - * Some device drivers do not set page.mapping 2320 - * but still dirty their pages 2321 - */ 2322 - balance_dirty_pages_ratelimited(mapping); 2323 - } 2324 - 2325 2310 if (!page_mkwrite) 2326 2311 file_update_time(vma->vm_file); 2312 + 2313 + /* 2314 + * Throttle page dirtying rate down to writeback speed. 2315 + * 2316 + * mapping may be NULL here because some device drivers do not 2317 + * set page.mapping but still dirty their pages 2318 + * 2319 + * Drop the mmap_sem before waiting on IO, if we can. The file 2320 + * is pinning the mapping, as per above. 2321 + */ 2322 + if ((dirtied || page_mkwrite) && mapping) { 2323 + struct file *fpin; 2324 + 2325 + fpin = maybe_unlock_mmap_for_io(vmf, NULL); 2326 + balance_dirty_pages_ratelimited(mapping); 2327 + if (fpin) { 2328 + fput(fpin); 2329 + return VM_FAULT_RETRY; 2330 + } 2331 + } 2332 + 2333 + return 0; 2327 2334 } 2328 2335 2329 2336 /* ··· 2592 2571 __releases(vmf->ptl) 2593 2572 { 2594 2573 struct vm_area_struct *vma = vmf->vma; 2574 + vm_fault_t ret = VM_FAULT_WRITE; 2595 2575 2596 2576 get_page(vmf->page); 2597 2577 ··· 2616 2594 wp_page_reuse(vmf); 2617 2595 lock_page(vmf->page); 2618 2596 } 2619 - fault_dirty_shared_page(vma, vmf->page); 2597 + ret |= fault_dirty_shared_page(vmf); 2620 2598 put_page(vmf->page); 2621 2599 2622 - return VM_FAULT_WRITE; 2600 + return ret; 2623 2601 } 2624 2602 2625 2603 /* ··· 3105 3083 3106 3084 /* 3107 3085 * The memory barrier inside __SetPageUptodate makes sure that 3108 - * preceeding stores to the page contents become visible before 3086 + * preceding stores to the page contents become visible before 3109 3087 * the set_pte_at() write. 3110 3088 */ 3111 3089 __SetPageUptodate(page); ··· 3663 3641 return ret; 3664 3642 } 3665 3643 3666 - fault_dirty_shared_page(vma, vmf->page); 3644 + ret |= fault_dirty_shared_page(vmf); 3667 3645 return ret; 3668 3646 } 3669 3647 ··· 4010 3988 vmf.pud = pud_alloc(mm, p4d, address); 4011 3989 if (!vmf.pud) 4012 3990 return VM_FAULT_OOM; 3991 + retry_pud: 4013 3992 if (pud_none(*vmf.pud) && __transparent_hugepage_enabled(vma)) { 4014 3993 ret = create_huge_pud(&vmf); 4015 3994 if (!(ret & VM_FAULT_FALLBACK)) ··· 4037 4014 vmf.pmd = pmd_alloc(mm, vmf.pud, address); 4038 4015 if (!vmf.pmd) 4039 4016 return VM_FAULT_OOM; 4017 + 4018 + /* Huge pud page fault raced with pmd_alloc? */ 4019 + if (pud_trans_unstable(vmf.pud)) 4020 + goto retry_pud; 4021 + 4040 4022 if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) { 4041 4023 ret = create_huge_pmd(&vmf); 4042 4024 if (!(ret & VM_FAULT_FALLBACK))

+56 -30

mm/memory_hotplug.c

··· 49 49 * and restore_online_page_callback() for generic callback restore. 50 50 */ 51 51 52 - static void generic_online_page(struct page *page, unsigned int order); 53 - 54 52 static online_page_callback_t online_page_callback = generic_online_page; 55 53 static DEFINE_MUTEX(online_page_callback_lock); 56 54 ··· 276 278 return 0; 277 279 } 278 280 281 + static int check_hotplug_memory_addressable(unsigned long pfn, 282 + unsigned long nr_pages) 283 + { 284 + const u64 max_addr = PFN_PHYS(pfn + nr_pages) - 1; 285 + 286 + if (max_addr >> MAX_PHYSMEM_BITS) { 287 + const u64 max_allowed = (1ull << (MAX_PHYSMEM_BITS + 1)) - 1; 288 + WARN(1, 289 + "Hotplugged memory exceeds maximum addressable address, range=%#llx-%#llx, maximum=%#llx\n", 290 + (u64)PFN_PHYS(pfn), max_addr, max_allowed); 291 + return -E2BIG; 292 + } 293 + 294 + return 0; 295 + } 296 + 279 297 /* 280 298 * Reasonably generic function for adding memory. It is 281 299 * expected that archs that support memory hotplug will ··· 304 290 int err; 305 291 unsigned long nr, start_sec, end_sec; 306 292 struct vmem_altmap *altmap = restrictions->altmap; 293 + 294 + err = check_hotplug_memory_addressable(pfn, nr_pages); 295 + if (err) 296 + return err; 307 297 308 298 if (altmap) { 309 299 /* ··· 598 580 } 599 581 EXPORT_SYMBOL_GPL(restore_online_page_callback); 600 582 601 - void __online_page_set_limits(struct page *page) 602 - { 603 - } 604 - EXPORT_SYMBOL_GPL(__online_page_set_limits); 605 - 606 - void __online_page_increment_counters(struct page *page) 607 - { 608 - adjust_managed_page_count(page, 1); 609 - } 610 - EXPORT_SYMBOL_GPL(__online_page_increment_counters); 611 - 612 - void __online_page_free(struct page *page) 613 - { 614 - __free_reserved_page(page); 615 - } 616 - EXPORT_SYMBOL_GPL(__online_page_free); 617 - 618 - static void generic_online_page(struct page *page, unsigned int order) 583 + void generic_online_page(struct page *page, unsigned int order) 619 584 { 620 585 kernel_map_pages(page, 1 << order, 1); 621 586 __free_pages_core(page, order); ··· 608 607 totalhigh_pages_add(1UL << order); 609 608 #endif 610 609 } 610 + EXPORT_SYMBOL_GPL(generic_online_page); 611 611 612 612 static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages, 613 613 void *arg) ··· 1182 1180 if (!zone_spans_pfn(zone, pfn)) 1183 1181 return false; 1184 1182 1185 - return !has_unmovable_pages(zone, page, 0, MIGRATE_MOVABLE, SKIP_HWPOISON); 1183 + return !has_unmovable_pages(zone, page, 0, MIGRATE_MOVABLE, 1184 + MEMORY_OFFLINE); 1186 1185 } 1187 1186 1188 1187 /* Checks if this range of memory is likely to be hot-removable. */ ··· 1380 1377 return ret; 1381 1378 } 1382 1379 1383 - /* 1384 - * remove from free_area[] and mark all as Reserved. 1385 - */ 1380 + /* Mark all sections offline and remove all free pages from the buddy. */ 1386 1381 static int 1387 1382 offline_isolated_pages_cb(unsigned long start, unsigned long nr_pages, 1388 1383 void *data) ··· 1398 1397 check_pages_isolated_cb(unsigned long start_pfn, unsigned long nr_pages, 1399 1398 void *data) 1400 1399 { 1401 - return test_pages_isolated(start_pfn, start_pfn + nr_pages, true); 1400 + return test_pages_isolated(start_pfn, start_pfn + nr_pages, 1401 + MEMORY_OFFLINE); 1402 1402 } 1403 1403 1404 1404 static int __init cmdline_parse_movable_node(char *p) ··· 1480 1478 node_clear_state(node, N_MEMORY); 1481 1479 } 1482 1480 1481 + static int count_system_ram_pages_cb(unsigned long start_pfn, 1482 + unsigned long nr_pages, void *data) 1483 + { 1484 + unsigned long *nr_system_ram_pages = data; 1485 + 1486 + *nr_system_ram_pages += nr_pages; 1487 + return 0; 1488 + } 1489 + 1483 1490 static int __ref __offline_pages(unsigned long start_pfn, 1484 1491 unsigned long end_pfn) 1485 1492 { 1486 - unsigned long pfn, nr_pages; 1493 + unsigned long pfn, nr_pages = 0; 1487 1494 unsigned long offlined_pages = 0; 1488 1495 int ret, node, nr_isolate_pageblock; 1489 1496 unsigned long flags; ··· 1502 1491 char *reason; 1503 1492 1504 1493 mem_hotplug_begin(); 1494 + 1495 + /* 1496 + * Don't allow to offline memory blocks that contain holes. 1497 + * Consequently, memory blocks with holes can never get onlined 1498 + * via the hotplug path - online_pages() - as hotplugged memory has 1499 + * no holes. This way, we e.g., don't have to worry about marking 1500 + * memory holes PG_reserved, don't need pfn_valid() checks, and can 1501 + * avoid using walk_system_ram_range() later. 1502 + */ 1503 + walk_system_ram_range(start_pfn, end_pfn - start_pfn, &nr_pages, 1504 + count_system_ram_pages_cb); 1505 + if (nr_pages != end_pfn - start_pfn) { 1506 + ret = -EINVAL; 1507 + reason = "memory holes"; 1508 + goto failed_removal; 1509 + } 1505 1510 1506 1511 /* This makes hotplug much easier...and readable. 1507 1512 we assume this for now. .*/ ··· 1530 1503 1531 1504 zone = page_zone(pfn_to_page(valid_start)); 1532 1505 node = zone_to_nid(zone); 1533 - nr_pages = end_pfn - start_pfn; 1534 1506 1535 1507 /* set above range as isolated */ 1536 1508 ret = start_isolate_page_range(start_pfn, end_pfn, 1537 1509 MIGRATE_MOVABLE, 1538 - SKIP_HWPOISON | REPORT_FAILURE); 1510 + MEMORY_OFFLINE | REPORT_FAILURE); 1539 1511 if (ret < 0) { 1540 1512 reason = "failure to isolate range"; 1541 1513 goto failed_removal; ··· 1776 1750 1777 1751 /* remove memmap entry */ 1778 1752 firmware_map_remove(start, start + size, "System RAM"); 1779 - memblock_free(start, size); 1780 - memblock_remove(start, size); 1781 1753 1782 1754 /* remove memory block devices before removing memory */ 1783 1755 remove_memory_block_devices(start, size); 1784 1756 1785 1757 arch_remove_memory(nid, start, size, NULL); 1758 + memblock_free(start, size); 1759 + memblock_remove(start, size); 1786 1760 __release_memory_resource(start, size); 1787 1761 1788 1762 try_offline_node(nid);

+31 -16

mm/mempolicy.c

··· 410 410 struct list_head *pagelist; 411 411 unsigned long flags; 412 412 nodemask_t *nmask; 413 - struct vm_area_struct *prev; 413 + unsigned long start; 414 + unsigned long end; 415 + struct vm_area_struct *first; 414 416 }; 415 417 416 418 /* ··· 620 618 unsigned long endvma = vma->vm_end; 621 619 unsigned long flags = qp->flags; 622 620 621 + /* range check first */ 622 + VM_BUG_ON((vma->vm_start > start) || (vma->vm_end < end)); 623 + 624 + if (!qp->first) { 625 + qp->first = vma; 626 + if (!(flags & MPOL_MF_DISCONTIG_OK) && 627 + (qp->start < vma->vm_start)) 628 + /* hole at head side of range */ 629 + return -EFAULT; 630 + } 631 + if (!(flags & MPOL_MF_DISCONTIG_OK) && 632 + ((vma->vm_end < qp->end) && 633 + (!vma->vm_next || vma->vm_end < vma->vm_next->vm_start))) 634 + /* hole at middle or tail of range */ 635 + return -EFAULT; 636 + 623 637 /* 624 638 * Need check MPOL_MF_STRICT to return -EIO if possible 625 639 * regardless of vma_migratable ··· 646 628 647 629 if (endvma > end) 648 630 endvma = end; 649 - if (vma->vm_start > start) 650 - start = vma->vm_start; 651 - 652 - if (!(flags & MPOL_MF_DISCONTIG_OK)) { 653 - if (!vma->vm_next && vma->vm_end < end) 654 - return -EFAULT; 655 - if (qp->prev && qp->prev->vm_end < vma->vm_start) 656 - return -EFAULT; 657 - } 658 - 659 - qp->prev = vma; 660 631 661 632 if (flags & MPOL_MF_LAZY) { 662 633 /* Similar to task_numa_work, skip inaccessible VMAs */ ··· 688 681 nodemask_t *nodes, unsigned long flags, 689 682 struct list_head *pagelist) 690 683 { 684 + int err; 691 685 struct queue_pages qp = { 692 686 .pagelist = pagelist, 693 687 .flags = flags, 694 688 .nmask = nodes, 695 - .prev = NULL, 689 + .start = start, 690 + .end = end, 691 + .first = NULL, 696 692 }; 697 693 698 - return walk_page_range(mm, start, end, &queue_pages_walk_ops, &qp); 694 + err = walk_page_range(mm, start, end, &queue_pages_walk_ops, &qp); 695 + 696 + if (!qp.first) 697 + /* whole range in hole */ 698 + err = -EFAULT; 699 + 700 + return err; 699 701 } 700 702 701 703 /* ··· 756 740 unsigned long vmend; 757 741 758 742 vma = find_vma(mm, start); 759 - if (!vma || vma->vm_start > start) 760 - return -EFAULT; 743 + VM_BUG_ON(!vma); 761 744 762 745 prev = vma->vm_prev; 763 746 if (start > vma->vm_start)

+6 -10

mm/migrate.c

··· 1168 1168 enum migrate_reason reason) 1169 1169 { 1170 1170 int rc = MIGRATEPAGE_SUCCESS; 1171 - struct page *newpage; 1171 + struct page *newpage = NULL; 1172 1172 1173 1173 if (!thp_migration_supported() && PageTransHuge(page)) 1174 - return -ENOMEM; 1175 - 1176 - newpage = get_new_page(page, private); 1177 - if (!newpage) 1178 1174 return -ENOMEM; 1179 1175 1180 1176 if (page_count(page) == 1) { ··· 1183 1187 __ClearPageIsolated(page); 1184 1188 unlock_page(page); 1185 1189 } 1186 - if (put_new_page) 1187 - put_new_page(newpage, private); 1188 - else 1189 - put_page(newpage); 1190 1190 goto out; 1191 1191 } 1192 + 1193 + newpage = get_new_page(page, private); 1194 + if (!newpage) 1195 + return -ENOMEM; 1192 1196 1193 1197 rc = __unmap_and_move(page, newpage, force, mode); 1194 1198 if (rc == MIGRATEPAGE_SUCCESS) ··· 1859 1863 if (!zone_watermark_ok(zone, 0, 1860 1864 high_wmark_pages(zone) + 1861 1865 nr_migrate_pages, 1862 - 0, 0)) 1866 + ZONE_MOVABLE, 0)) 1863 1867 continue; 1864 1868 return true; 1865 1869 }

+21 -42

mm/mmap.c

··· 641 641 struct vm_area_struct *prev, struct rb_node **rb_link, 642 642 struct rb_node *rb_parent) 643 643 { 644 - __vma_link_list(mm, vma, prev, rb_parent); 644 + __vma_link_list(mm, vma, prev); 645 645 __vma_link_rb(mm, vma, rb_link, rb_parent); 646 646 } 647 647 ··· 684 684 685 685 static __always_inline void __vma_unlink_common(struct mm_struct *mm, 686 686 struct vm_area_struct *vma, 687 - struct vm_area_struct *prev, 688 - bool has_prev, 689 687 struct vm_area_struct *ignore) 690 688 { 691 - struct vm_area_struct *next; 692 - 693 689 vma_rb_erase_ignore(vma, &mm->mm_rb, ignore); 694 - next = vma->vm_next; 695 - if (has_prev) 696 - prev->vm_next = next; 697 - else { 698 - prev = vma->vm_prev; 699 - if (prev) 700 - prev->vm_next = next; 701 - else 702 - mm->mmap = next; 703 - } 704 - if (next) 705 - next->vm_prev = prev; 706 - 690 + __vma_unlink_list(mm, vma); 707 691 /* Kill the cache */ 708 692 vmacache_invalidate(mm); 709 - } 710 - 711 - static inline void __vma_unlink_prev(struct mm_struct *mm, 712 - struct vm_area_struct *vma, 713 - struct vm_area_struct *prev) 714 - { 715 - __vma_unlink_common(mm, vma, prev, true, vma); 716 693 } 717 694 718 695 /* ··· 746 769 remove_next = 1 + (end > next->vm_end); 747 770 VM_WARN_ON(remove_next == 2 && 748 771 end != next->vm_next->vm_end); 749 - VM_WARN_ON(remove_next == 1 && 750 - end != next->vm_end); 751 772 /* trim end to next, for case 6 first pass */ 752 773 end = next->vm_end; 753 774 } ··· 864 889 * us to remove next before dropping the locks. 865 890 */ 866 891 if (remove_next != 3) 867 - __vma_unlink_prev(mm, next, vma); 892 + __vma_unlink_common(mm, next, next); 868 893 else 869 894 /* 870 895 * vma is not before next if they've been ··· 875 900 * "next" (which is stored in post-swap() 876 901 * "vma"). 877 902 */ 878 - __vma_unlink_common(mm, next, NULL, false, vma); 903 + __vma_unlink_common(mm, next, vma); 879 904 if (file) 880 905 __remove_shared_vm_struct(next, file, mapping); 881 906 } else if (insert) { ··· 1091 1116 * the area passed down from mprotect_fixup, never extending beyond one 1092 1117 * vma, PPPPPP is the prev vma specified, and NNNNNN the next vma after: 1093 1118 * 1094 - * AAAA AAAA AAAA AAAA 1095 - * PPPPPPNNNNNN PPPPPPNNNNNN PPPPPPNNNNNN PPPPNNNNXXXX 1096 - * cannot merge might become might become might become 1097 - * PPNNNNNNNNNN PPPPPPPPPPNN PPPPPPPPPPPP 6 or 1098 - * mmap, brk or case 4 below case 5 below PPPPPPPPXXXX 7 or 1099 - * mremap move: PPPPXXXXXXXX 8 1100 - * AAAA 1101 - * PPPP NNNN PPPPPPPPPPPP PPPPPPPPNNNN PPPPNNNNNNNN 1102 - * might become case 1 below case 2 below case 3 below 1119 + * AAAA AAAA AAAA 1120 + * PPPPPPNNNNNN PPPPPPNNNNNN PPPPPPNNNNNN 1121 + * cannot merge might become might become 1122 + * PPNNNNNNNNNN PPPPPPPPPPNN 1123 + * mmap, brk or case 4 below case 5 below 1124 + * mremap move: 1125 + * AAAA AAAA 1126 + * PPPP NNNN PPPPNNNNXXXX 1127 + * might become might become 1128 + * PPPPPPPPPPPP 1 or PPPPPPPPPPPP 6 or 1129 + * PPPPPPPPNNNN 2 or PPPPPPPPXXXX 7 or 1130 + * PPPPNNNNNNNN 3 PPPPXXXXXXXX 8 1103 1131 * 1104 1132 * It is important for case 8 that the vma NNNN overlapping the 1105 1133 * region AAAA is never going to extended over XXXX. Instead XXXX must ··· 1420 1442 * that it represents a valid section of the address space. 1421 1443 */ 1422 1444 addr = get_unmapped_area(file, addr, len, pgoff, flags); 1423 - if (offset_in_page(addr)) 1445 + if (IS_ERR_VALUE(addr)) 1424 1446 return addr; 1425 1447 1426 1448 if (flags & MAP_FIXED_NOREPLACE) { ··· 2984 3006 struct rb_node **rb_link, *rb_parent; 2985 3007 pgoff_t pgoff = addr >> PAGE_SHIFT; 2986 3008 int error; 3009 + unsigned long mapped_addr; 2987 3010 2988 3011 /* Until we need other flags, refuse anything except VM_EXEC. */ 2989 3012 if ((flags & (~VM_EXEC)) != 0) 2990 3013 return -EINVAL; 2991 3014 flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags; 2992 3015 2993 - error = get_unmapped_area(NULL, addr, len, 0, MAP_FIXED); 2994 - if (offset_in_page(error)) 2995 - return error; 3016 + mapped_addr = get_unmapped_area(NULL, addr, len, 0, MAP_FIXED); 3017 + if (IS_ERR_VALUE(mapped_addr)) 3018 + return mapped_addr; 2996 3019 2997 3020 error = mlock_future_check(mm, mm->def_flags, len); 2998 3021 if (error)

+4 -4

mm/mprotect.c

··· 80 80 if (prot_numa) { 81 81 struct page *page; 82 82 83 + /* Avoid TLB flush if possible */ 84 + if (pte_protnone(oldpte)) 85 + continue; 86 + 83 87 page = vm_normal_page(vma, addr, oldpte); 84 88 if (!page || PageKsm(page)) 85 89 continue; ··· 99 95 * context. 100 96 */ 101 97 if (page_is_file_cache(page) && PageDirty(page)) 102 - continue; 103 - 104 - /* Avoid TLB flush if possible */ 105 - if (pte_protnone(oldpte)) 106 98 continue; 107 99 108 100 /*

+2 -2

mm/mremap.c

··· 558 558 ret = get_unmapped_area(vma->vm_file, new_addr, new_len, vma->vm_pgoff + 559 559 ((addr - vma->vm_start) >> PAGE_SHIFT), 560 560 map_flags); 561 - if (offset_in_page(ret)) 561 + if (IS_ERR_VALUE(ret)) 562 562 goto out1; 563 563 564 564 ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, uf, ··· 706 706 vma->vm_pgoff + 707 707 ((addr - vma->vm_start) >> PAGE_SHIFT), 708 708 map_flags); 709 - if (offset_in_page(new_addr)) { 709 + if (IS_ERR_VALUE(new_addr)) { 710 710 ret = new_addr; 711 711 goto out; 712 712 }

+2 -8

mm/nommu.c

··· 648 648 if (rb_prev) 649 649 prev = rb_entry(rb_prev, struct vm_area_struct, vm_rb); 650 650 651 - __vma_link_list(mm, vma, prev, parent); 651 + __vma_link_list(mm, vma, prev); 652 652 } 653 653 654 654 /* ··· 684 684 /* remove from the MM's tree and list */ 685 685 rb_erase(&vma->vm_rb, &mm->mm_rb); 686 686 687 - if (vma->vm_prev) 688 - vma->vm_prev->vm_next = vma->vm_next; 689 - else 690 - mm->mmap = vma->vm_next; 691 - 692 - if (vma->vm_next) 693 - vma->vm_next->vm_prev = vma->vm_prev; 687 + __vma_unlink_list(mm, vma); 694 688 } 695 689 696 690 /*

+119 -18

mm/page_alloc.c

··· 5354 5354 " min:%lukB" 5355 5355 " low:%lukB" 5356 5356 " high:%lukB" 5357 + " reserved_highatomic:%luKB" 5357 5358 " active_anon:%lukB" 5358 5359 " inactive_anon:%lukB" 5359 5360 " active_file:%lukB" ··· 5376 5375 K(min_wmark_pages(zone)), 5377 5376 K(low_wmark_pages(zone)), 5378 5377 K(high_wmark_pages(zone)), 5378 + K(zone->nr_reserved_highatomic), 5379 5379 K(zone_page_state(zone, NR_ZONE_ACTIVE_ANON)), 5380 5380 K(zone_page_state(zone, NR_ZONE_INACTIVE_ANON)), 5381 5381 K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)), ··· 6713 6711 6714 6712 pgdat_page_ext_init(pgdat); 6715 6713 spin_lock_init(&pgdat->lru_lock); 6716 - lruvec_init(node_lruvec(pgdat)); 6714 + lruvec_init(&pgdat->__lruvec); 6717 6715 } 6718 6716 6719 6717 static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid, ··· 7990 7988 return 0; 7991 7989 } 7992 7990 7991 + static void __zone_pcp_update(struct zone *zone) 7992 + { 7993 + unsigned int cpu; 7994 + 7995 + for_each_possible_cpu(cpu) 7996 + pageset_set_high_and_batch(zone, 7997 + per_cpu_ptr(zone->pageset, cpu)); 7998 + } 7999 + 7993 8000 /* 7994 8001 * percpu_pagelist_fraction - changes the pcp->high for each zone on each 7995 8002 * cpu. It is the fraction of total pages in each zone that a hot per cpu ··· 8030 8019 if (percpu_pagelist_fraction == old_percpu_pagelist_fraction) 8031 8020 goto out; 8032 8021 8033 - for_each_populated_zone(zone) { 8034 - unsigned int cpu; 8035 - 8036 - for_each_possible_cpu(cpu) 8037 - pageset_set_high_and_batch(zone, 8038 - per_cpu_ptr(zone->pageset, cpu)); 8039 - } 8022 + for_each_populated_zone(zone) 8023 + __zone_pcp_update(zone); 8040 8024 out: 8041 8025 mutex_unlock(&pcp_batch_high_lock); 8042 8026 return ret; ··· 8267 8261 * The HWPoisoned page may be not in buddy system, and 8268 8262 * page_count() is not 0. 8269 8263 */ 8270 - if ((flags & SKIP_HWPOISON) && PageHWPoison(page)) 8264 + if ((flags & MEMORY_OFFLINE) && PageHWPoison(page)) 8271 8265 continue; 8272 8266 8273 8267 if (__PageMovable(page)) ··· 8483 8477 } 8484 8478 8485 8479 /* Make sure the range is really isolated. */ 8486 - if (test_pages_isolated(outer_start, end, false)) { 8480 + if (test_pages_isolated(outer_start, end, 0)) { 8487 8481 pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n", 8488 8482 __func__, outer_start, end); 8489 8483 ret = -EBUSY; ··· 8508 8502 pfn_max_align_up(end), migratetype); 8509 8503 return ret; 8510 8504 } 8505 + 8506 + static int __alloc_contig_pages(unsigned long start_pfn, 8507 + unsigned long nr_pages, gfp_t gfp_mask) 8508 + { 8509 + unsigned long end_pfn = start_pfn + nr_pages; 8510 + 8511 + return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE, 8512 + gfp_mask); 8513 + } 8514 + 8515 + static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn, 8516 + unsigned long nr_pages) 8517 + { 8518 + unsigned long i, end_pfn = start_pfn + nr_pages; 8519 + struct page *page; 8520 + 8521 + for (i = start_pfn; i < end_pfn; i++) { 8522 + page = pfn_to_online_page(i); 8523 + if (!page) 8524 + return false; 8525 + 8526 + if (page_zone(page) != z) 8527 + return false; 8528 + 8529 + if (PageReserved(page)) 8530 + return false; 8531 + 8532 + if (page_count(page) > 0) 8533 + return false; 8534 + 8535 + if (PageHuge(page)) 8536 + return false; 8537 + } 8538 + return true; 8539 + } 8540 + 8541 + static bool zone_spans_last_pfn(const struct zone *zone, 8542 + unsigned long start_pfn, unsigned long nr_pages) 8543 + { 8544 + unsigned long last_pfn = start_pfn + nr_pages - 1; 8545 + 8546 + return zone_spans_pfn(zone, last_pfn); 8547 + } 8548 + 8549 + /** 8550 + * alloc_contig_pages() -- tries to find and allocate contiguous range of pages 8551 + * @nr_pages: Number of contiguous pages to allocate 8552 + * @gfp_mask: GFP mask to limit search and used during compaction 8553 + * @nid: Target node 8554 + * @nodemask: Mask for other possible nodes 8555 + * 8556 + * This routine is a wrapper around alloc_contig_range(). It scans over zones 8557 + * on an applicable zonelist to find a contiguous pfn range which can then be 8558 + * tried for allocation with alloc_contig_range(). This routine is intended 8559 + * for allocation requests which can not be fulfilled with the buddy allocator. 8560 + * 8561 + * The allocated memory is always aligned to a page boundary. If nr_pages is a 8562 + * power of two then the alignment is guaranteed to be to the given nr_pages 8563 + * (e.g. 1GB request would be aligned to 1GB). 8564 + * 8565 + * Allocated pages can be freed with free_contig_range() or by manually calling 8566 + * __free_page() on each allocated page. 8567 + * 8568 + * Return: pointer to contiguous pages on success, or NULL if not successful. 8569 + */ 8570 + struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask, 8571 + int nid, nodemask_t *nodemask) 8572 + { 8573 + unsigned long ret, pfn, flags; 8574 + struct zonelist *zonelist; 8575 + struct zone *zone; 8576 + struct zoneref *z; 8577 + 8578 + zonelist = node_zonelist(nid, gfp_mask); 8579 + for_each_zone_zonelist_nodemask(zone, z, zonelist, 8580 + gfp_zone(gfp_mask), nodemask) { 8581 + spin_lock_irqsave(&zone->lock, flags); 8582 + 8583 + pfn = ALIGN(zone->zone_start_pfn, nr_pages); 8584 + while (zone_spans_last_pfn(zone, pfn, nr_pages)) { 8585 + if (pfn_range_valid_contig(zone, pfn, nr_pages)) { 8586 + /* 8587 + * We release the zone lock here because 8588 + * alloc_contig_range() will also lock the zone 8589 + * at some point. If there's an allocation 8590 + * spinning on this lock, it may win the race 8591 + * and cause alloc_contig_range() to fail... 8592 + */ 8593 + spin_unlock_irqrestore(&zone->lock, flags); 8594 + ret = __alloc_contig_pages(pfn, nr_pages, 8595 + gfp_mask); 8596 + if (!ret) 8597 + return pfn_to_page(pfn); 8598 + spin_lock_irqsave(&zone->lock, flags); 8599 + } 8600 + pfn += nr_pages; 8601 + } 8602 + spin_unlock_irqrestore(&zone->lock, flags); 8603 + } 8604 + return NULL; 8605 + } 8511 8606 #endif /* CONFIG_CONTIG_ALLOC */ 8512 8607 8513 8608 void free_contig_range(unsigned long pfn, unsigned int nr_pages) ··· 8630 8523 */ 8631 8524 void __meminit zone_pcp_update(struct zone *zone) 8632 8525 { 8633 - unsigned cpu; 8634 8526 mutex_lock(&pcp_batch_high_lock); 8635 - for_each_possible_cpu(cpu) 8636 - pageset_set_high_and_batch(zone, 8637 - per_cpu_ptr(zone->pageset, cpu)); 8527 + __zone_pcp_update(zone); 8638 8528 mutex_unlock(&pcp_batch_high_lock); 8639 8529 } 8640 8530 ··· 8664 8560 { 8665 8561 struct page *page; 8666 8562 struct zone *zone; 8667 - unsigned int order, i; 8563 + unsigned int order; 8668 8564 unsigned long pfn; 8669 8565 unsigned long flags; 8670 8566 unsigned long offlined_pages = 0; ··· 8692 8588 */ 8693 8589 if (unlikely(!PageBuddy(page) && PageHWPoison(page))) { 8694 8590 pfn++; 8695 - SetPageReserved(page); 8696 8591 offlined_pages++; 8697 8592 continue; 8698 8593 } ··· 8705 8602 pfn, 1 << order, end_pfn); 8706 8603 #endif 8707 8604 del_page_from_free_area(page, &zone->free_area[order]); 8708 - for (i = 0; i < (1 << order); i++) 8709 - SetPageReserved((page+i)); 8710 8605 pfn += (1 << order); 8711 8606 } 8712 8607 spin_unlock_irqrestore(&zone->lock, flags);

+13 -2

mm/page_io.c

··· 22 22 #include <linux/writeback.h> 23 23 #include <linux/frontswap.h> 24 24 #include <linux/blkdev.h> 25 + #include <linux/psi.h> 25 26 #include <linux/uio.h> 26 27 #include <linux/sched/task.h> 27 28 #include <asm/pgtable.h> ··· 355 354 struct swap_info_struct *sis = page_swap_info(page); 356 355 blk_qc_t qc; 357 356 struct gendisk *disk; 357 + unsigned long pflags; 358 358 359 359 VM_BUG_ON_PAGE(!PageSwapCache(page) && !synchronous, page); 360 360 VM_BUG_ON_PAGE(!PageLocked(page), page); 361 361 VM_BUG_ON_PAGE(PageUptodate(page), page); 362 + 363 + /* 364 + * Count submission time as memory stall. When the device is congested, 365 + * or the submitting cgroup IO-throttled, submission can be a 366 + * significant part of overall IO time. 367 + */ 368 + psi_memstall_enter(&pflags); 369 + 362 370 if (frontswap_load(page) == 0) { 363 371 SetPageUptodate(page); 364 372 unlock_page(page); ··· 381 371 ret = mapping->a_ops->readpage(swap_file, page); 382 372 if (!ret) 383 373 count_vm_event(PSWPIN); 384 - return ret; 374 + goto out; 385 375 } 386 376 387 377 ret = bdev_read_page(sis->bdev, swap_page_sector(page), page); ··· 392 382 } 393 383 394 384 count_vm_event(PSWPIN); 395 - return 0; 385 + goto out; 396 386 } 397 387 398 388 ret = 0; ··· 428 418 bio_put(bio); 429 419 430 420 out: 421 + psi_memstall_leave(&pflags); 431 422 return ret; 432 423 } 433 424

+6 -6

mm/page_isolation.c

··· 168 168 * @migratetype: Migrate type to set in error recovery. 169 169 * @flags: The following flags are allowed (they can be combined in 170 170 * a bit mask) 171 - * SKIP_HWPOISON - ignore hwpoison pages 171 + * MEMORY_OFFLINE - isolate to offline (!allocate) memory 172 + * e.g., skip over PageHWPoison() pages 172 173 * REPORT_FAILURE - report details about the failure to 173 174 * isolate the range 174 175 * ··· 258 257 */ 259 258 static unsigned long 260 259 __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn, 261 - bool skip_hwpoisoned_pages) 260 + int flags) 262 261 { 263 262 struct page *page; 264 263 ··· 275 274 * simple way to verify that as VM_BUG_ON(), though. 276 275 */ 277 276 pfn += 1 << page_order(page); 278 - else if (skip_hwpoisoned_pages && PageHWPoison(page)) 277 + else if ((flags & MEMORY_OFFLINE) && PageHWPoison(page)) 279 278 /* A HWPoisoned page cannot be also PageBuddy */ 280 279 pfn++; 281 280 else ··· 287 286 288 287 /* Caller should ensure that requested range is in a single zone */ 289 288 int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, 290 - bool skip_hwpoisoned_pages) 289 + int isol_flags) 291 290 { 292 291 unsigned long pfn, flags; 293 292 struct page *page; ··· 309 308 /* Check all pages are free or marked as ISOLATED */ 310 309 zone = page_zone(page); 311 310 spin_lock_irqsave(&zone->lock, flags); 312 - pfn = __test_page_isolated_in_pageblock(start_pfn, end_pfn, 313 - skip_hwpoisoned_pages); 311 + pfn = __test_page_isolated_in_pageblock(start_pfn, end_pfn, isol_flags); 314 312 spin_unlock_irqrestore(&zone->lock, flags); 315 313 316 314 trace_test_pages_isolated(start_pfn, end_pfn, pfn);

+9

mm/pgtable-generic.c

··· 24 24 pgd_clear(pgd); 25 25 } 26 26 27 + #ifndef __PAGETABLE_P4D_FOLDED 27 28 void p4d_clear_bad(p4d_t *p4d) 28 29 { 29 30 p4d_ERROR(*p4d); 30 31 p4d_clear(p4d); 31 32 } 33 + #endif 32 34 35 + #ifndef __PAGETABLE_PUD_FOLDED 33 36 void pud_clear_bad(pud_t *pud) 34 37 { 35 38 pud_ERROR(*pud); 36 39 pud_clear(pud); 37 40 } 41 + #endif 38 42 43 + /* 44 + * Note that the pmd variant below can't be stub'ed out just as for p4d/pud 45 + * above. pmd folding is special and typically pmd_* macros refer to upper 46 + * level even when folded 47 + */ 39 48 void pmd_clear_bad(pmd_t *pmd) 40 49 { 41 50 pmd_ERROR(*pmd);

+45 -20

mm/rmap.c

··· 251 251 * Attach the anon_vmas from src to dst. 252 252 * Returns 0 on success, -ENOMEM on failure. 253 253 * 254 - * If dst->anon_vma is NULL this function tries to find and reuse existing 255 - * anon_vma which has no vmas and only one child anon_vma. This prevents 256 - * degradation of anon_vma hierarchy to endless linear chain in case of 257 - * constantly forking task. On the other hand, an anon_vma with more than one 258 - * child isn't reused even if there was no alive vma, thus rmap walker has a 259 - * good chance of avoiding scanning the whole hierarchy when it searches where 260 - * page is mapped. 254 + * anon_vma_clone() is called by __vma_split(), __split_vma(), copy_vma() and 255 + * anon_vma_fork(). The first three want an exact copy of src, while the last 256 + * one, anon_vma_fork(), may try to reuse an existing anon_vma to prevent 257 + * endless growth of anon_vma. Since dst->anon_vma is set to NULL before call, 258 + * we can identify this case by checking (!dst->anon_vma && src->anon_vma). 259 + * 260 + * If (!dst->anon_vma && src->anon_vma) is true, this function tries to find 261 + * and reuse existing anon_vma which has no vmas and only one child anon_vma. 262 + * This prevents degradation of anon_vma hierarchy to endless linear chain in 263 + * case of constantly forking task. On the other hand, an anon_vma with more 264 + * than one child isn't reused even if there was no alive vma, thus rmap 265 + * walker has a good chance of avoiding scanning the whole hierarchy when it 266 + * searches where page is mapped. 261 267 */ 262 268 int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) 263 269 { 264 270 struct anon_vma_chain *avc, *pavc; 265 271 struct anon_vma *root = NULL; 272 + struct vm_area_struct *prev = dst->vm_prev, *pprev = src->vm_prev; 273 + 274 + /* 275 + * If parent share anon_vma with its vm_prev, keep this sharing in in 276 + * child. 277 + * 278 + * 1. Parent has vm_prev, which implies we have vm_prev. 279 + * 2. Parent and its vm_prev have the same anon_vma. 280 + */ 281 + if (!dst->anon_vma && src->anon_vma && 282 + pprev && pprev->anon_vma == src->anon_vma) 283 + dst->anon_vma = prev->anon_vma; 284 + 266 285 267 286 list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) { 268 287 struct anon_vma *anon_vma; ··· 306 287 * will always reuse it. Root anon_vma is never reused: 307 288 * it has self-parent reference and at least one child. 308 289 */ 309 - if (!dst->anon_vma && anon_vma != src->anon_vma && 310 - anon_vma->degree < 2) 290 + if (!dst->anon_vma && src->anon_vma && 291 + anon_vma != src->anon_vma && anon_vma->degree < 2) 311 292 dst->anon_vma = anon_vma; 312 293 } 313 294 if (dst->anon_vma) ··· 477 458 * chain and verify that the page in question is indeed mapped in it 478 459 * [ something equivalent to page_mapped_in_vma() ]. 479 460 * 480 - * Since anon_vma's slab is DESTROY_BY_RCU and we know from page_remove_rmap() 481 - * that the anon_vma pointer from page->mapping is valid if there is a 482 - * mapcount, we can dereference the anon_vma after observing those. 461 + * Since anon_vma's slab is SLAB_TYPESAFE_BY_RCU and we know from 462 + * page_remove_rmap() that the anon_vma pointer from page->mapping is valid 463 + * if there is a mapcount, we can dereference the anon_vma after observing 464 + * those. 483 465 */ 484 466 struct anon_vma *page_get_anon_vma(struct page *page) 485 467 { ··· 1075 1055 static void __page_check_anon_rmap(struct page *page, 1076 1056 struct vm_area_struct *vma, unsigned long address) 1077 1057 { 1078 - #ifdef CONFIG_DEBUG_VM 1079 1058 /* 1080 1059 * The page's anon-rmap details (mapping and index) are guaranteed to 1081 1060 * be set up correctly at this point. ··· 1087 1068 * are initially only visible via the pagetables, and the pte is locked 1088 1069 * over the call to page_add_new_anon_rmap. 1089 1070 */ 1090 - BUG_ON(page_anon_vma(page)->root != vma->anon_vma->root); 1091 - BUG_ON(page_to_pgoff(page) != linear_page_index(vma, address)); 1092 - #endif 1071 + VM_BUG_ON_PAGE(page_anon_vma(page)->root != vma->anon_vma->root, page); 1072 + VM_BUG_ON_PAGE(page_to_pgoff(page) != linear_page_index(vma, address), 1073 + page); 1093 1074 } 1094 1075 1095 1076 /** ··· 1292 1273 if (TestClearPageDoubleMap(page)) { 1293 1274 /* 1294 1275 * Subpages can be mapped with PTEs too. Check how many of 1295 - * themi are still mapped. 1276 + * them are still mapped. 1296 1277 */ 1297 1278 for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) { 1298 1279 if (atomic_add_negative(-1, &page[i]._mapcount)) 1299 1280 nr++; 1300 1281 } 1282 + 1283 + /* 1284 + * Queue the page for deferred split if at least one small 1285 + * page of the compound page is unmapped, but at least one 1286 + * small page is still mapped. 1287 + */ 1288 + if (nr && nr < HPAGE_PMD_NR) 1289 + deferred_split_huge_page(page); 1301 1290 } else { 1302 1291 nr = HPAGE_PMD_NR; 1303 1292 } ··· 1313 1286 if (unlikely(PageMlocked(page))) 1314 1287 clear_page_mlock(page); 1315 1288 1316 - if (nr) { 1289 + if (nr) 1317 1290 __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr); 1318 - deferred_split_huge_page(page); 1319 - } 1320 1291 } 1321 1292 1322 1293 /**

+17 -12

mm/shmem.c

··· 1369 1369 if (list_empty(&info->swaplist)) 1370 1370 list_add(&info->swaplist, &shmem_swaplist); 1371 1371 1372 - if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) { 1372 + if (add_to_swap_cache(page, swap, 1373 + __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN) == 0) { 1373 1374 spin_lock_irq(&info->lock); 1374 1375 shmem_recalc_inode(inode); 1375 1376 info->swapped++; ··· 2023 2022 shmem_falloc->waitq && 2024 2023 vmf->pgoff >= shmem_falloc->start && 2025 2024 vmf->pgoff < shmem_falloc->next) { 2025 + struct file *fpin; 2026 2026 wait_queue_head_t *shmem_falloc_waitq; 2027 2027 DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function); 2028 2028 2029 2029 ret = VM_FAULT_NOPAGE; 2030 - if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) && 2031 - !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) { 2032 - /* It's polite to up mmap_sem if we can */ 2033 - up_read(&vma->vm_mm->mmap_sem); 2030 + fpin = maybe_unlock_mmap_for_io(vmf, NULL); 2031 + if (fpin) 2034 2032 ret = VM_FAULT_RETRY; 2035 - } 2036 2033 2037 2034 shmem_falloc_waitq = shmem_falloc->waitq; 2038 2035 prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait, ··· 2048 2049 spin_lock(&inode->i_lock); 2049 2050 finish_wait(shmem_falloc_waitq, &shmem_fault_wait); 2050 2051 spin_unlock(&inode->i_lock); 2052 + 2053 + if (fpin) 2054 + fput(fpin); 2051 2055 return ret; 2052 2056 } 2053 2057 spin_unlock(&inode->i_lock); ··· 2215 2213 return -EPERM; 2216 2214 2217 2215 /* 2218 - * Since the F_SEAL_FUTURE_WRITE seals allow for a MAP_SHARED 2219 - * read-only mapping, take care to not allow mprotect to revert 2220 - * protections. 2216 + * Since an F_SEAL_FUTURE_WRITE sealed memfd can be mapped as 2217 + * MAP_SHARED and read-only, take care to not allow mprotect to 2218 + * revert protections on such mappings. Do this only for shared 2219 + * mappings. For private mappings, don't need to mask 2220 + * VM_MAYWRITE as we still want them to be COW-writable. 2221 2221 */ 2222 - vma->vm_flags &= ~(VM_MAYWRITE); 2222 + if (vma->vm_flags & VM_SHARED) 2223 + vma->vm_flags &= ~(VM_MAYWRITE); 2223 2224 } 2224 2225 2225 2226 file_accessed(file); ··· 2747 2742 } 2748 2743 2749 2744 shmem_falloc.waitq = &shmem_falloc_waitq; 2750 - shmem_falloc.start = unmap_start >> PAGE_SHIFT; 2745 + shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT; 2751 2746 shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT; 2752 2747 spin_lock(&inode->i_lock); 2753 2748 inode->i_private = &shmem_falloc; ··· 3933 3928 static ssize_t shmem_enabled_show(struct kobject *kobj, 3934 3929 struct kobj_attribute *attr, char *buf) 3935 3930 { 3936 - int values[] = { 3931 + static const int values[] = { 3937 3932 SHMEM_HUGE_ALWAYS, 3938 3933 SHMEM_HUGE_WITHIN_SIZE, 3939 3934 SHMEM_HUGE_ADVISE,

+4 -3

mm/slab.c

··· 1247 1247 * structures first. Without this, further allocations will bug. 1248 1248 */ 1249 1249 kmalloc_caches[KMALLOC_NORMAL][INDEX_NODE] = create_kmalloc_cache( 1250 - kmalloc_info[INDEX_NODE].name, 1251 - kmalloc_size(INDEX_NODE), ARCH_KMALLOC_FLAGS, 1252 - 0, kmalloc_size(INDEX_NODE)); 1250 + kmalloc_info[INDEX_NODE].name[KMALLOC_NORMAL], 1251 + kmalloc_info[INDEX_NODE].size, 1252 + ARCH_KMALLOC_FLAGS, 0, 1253 + kmalloc_info[INDEX_NODE].size); 1253 1254 slab_state = PARTIAL_NODE; 1254 1255 setup_kmalloc_cache_index_table(); 1255 1256

+3 -3

mm/slab.h

··· 139 139 140 140 /* A table of kmalloc cache names and sizes */ 141 141 extern const struct kmalloc_info_struct { 142 - const char *name; 142 + const char *name[NR_KMALLOC_TYPES]; 143 143 unsigned int size; 144 144 } kmalloc_info[]; 145 145 ··· 369 369 if (ret) 370 370 goto out; 371 371 372 - lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg); 372 + lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page)); 373 373 mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order); 374 374 375 375 /* transer try_charge() page references to kmem_cache */ ··· 393 393 rcu_read_lock(); 394 394 memcg = READ_ONCE(s->memcg_params.memcg); 395 395 if (likely(!mem_cgroup_is_root(memcg))) { 396 - lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg); 396 + lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page)); 397 397 mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order)); 398 398 memcg_kmem_uncharge_memcg(page, order, memcg); 399 399 } else {

+53 -46

mm/slab_common.c

··· 1139 1139 return kmalloc_caches[kmalloc_type(flags)][index]; 1140 1140 } 1141 1141 1142 + #ifdef CONFIG_ZONE_DMA 1143 + #define INIT_KMALLOC_INFO(__size, __short_size) \ 1144 + { \ 1145 + .name[KMALLOC_NORMAL] = "kmalloc-" #__short_size, \ 1146 + .name[KMALLOC_RECLAIM] = "kmalloc-rcl-" #__short_size, \ 1147 + .name[KMALLOC_DMA] = "dma-kmalloc-" #__short_size, \ 1148 + .size = __size, \ 1149 + } 1150 + #else 1151 + #define INIT_KMALLOC_INFO(__size, __short_size) \ 1152 + { \ 1153 + .name[KMALLOC_NORMAL] = "kmalloc-" #__short_size, \ 1154 + .name[KMALLOC_RECLAIM] = "kmalloc-rcl-" #__short_size, \ 1155 + .size = __size, \ 1156 + } 1157 + #endif 1158 + 1142 1159 /* 1143 1160 * kmalloc_info[] is to make slub_debug=,kmalloc-xx option work at boot time. 1144 1161 * kmalloc_index() supports up to 2^26=64MB, so the final entry of the table is 1145 1162 * kmalloc-67108864. 1146 1163 */ 1147 1164 const struct kmalloc_info_struct kmalloc_info[] __initconst = { 1148 - {NULL, 0}, {"kmalloc-96", 96}, 1149 - {"kmalloc-192", 192}, {"kmalloc-8", 8}, 1150 - {"kmalloc-16", 16}, {"kmalloc-32", 32}, 1151 - {"kmalloc-64", 64}, {"kmalloc-128", 128}, 1152 - {"kmalloc-256", 256}, {"kmalloc-512", 512}, 1153 - {"kmalloc-1k", 1024}, {"kmalloc-2k", 2048}, 1154 - {"kmalloc-4k", 4096}, {"kmalloc-8k", 8192}, 1155 - {"kmalloc-16k", 16384}, {"kmalloc-32k", 32768}, 1156 - {"kmalloc-64k", 65536}, {"kmalloc-128k", 131072}, 1157 - {"kmalloc-256k", 262144}, {"kmalloc-512k", 524288}, 1158 - {"kmalloc-1M", 1048576}, {"kmalloc-2M", 2097152}, 1159 - {"kmalloc-4M", 4194304}, {"kmalloc-8M", 8388608}, 1160 - {"kmalloc-16M", 16777216}, {"kmalloc-32M", 33554432}, 1161 - {"kmalloc-64M", 67108864} 1165 + INIT_KMALLOC_INFO(0, 0), 1166 + INIT_KMALLOC_INFO(96, 96), 1167 + INIT_KMALLOC_INFO(192, 192), 1168 + INIT_KMALLOC_INFO(8, 8), 1169 + INIT_KMALLOC_INFO(16, 16), 1170 + INIT_KMALLOC_INFO(32, 32), 1171 + INIT_KMALLOC_INFO(64, 64), 1172 + INIT_KMALLOC_INFO(128, 128), 1173 + INIT_KMALLOC_INFO(256, 256), 1174 + INIT_KMALLOC_INFO(512, 512), 1175 + INIT_KMALLOC_INFO(1024, 1k), 1176 + INIT_KMALLOC_INFO(2048, 2k), 1177 + INIT_KMALLOC_INFO(4096, 4k), 1178 + INIT_KMALLOC_INFO(8192, 8k), 1179 + INIT_KMALLOC_INFO(16384, 16k), 1180 + INIT_KMALLOC_INFO(32768, 32k), 1181 + INIT_KMALLOC_INFO(65536, 64k), 1182 + INIT_KMALLOC_INFO(131072, 128k), 1183 + INIT_KMALLOC_INFO(262144, 256k), 1184 + INIT_KMALLOC_INFO(524288, 512k), 1185 + INIT_KMALLOC_INFO(1048576, 1M), 1186 + INIT_KMALLOC_INFO(2097152, 2M), 1187 + INIT_KMALLOC_INFO(4194304, 4M), 1188 + INIT_KMALLOC_INFO(8388608, 8M), 1189 + INIT_KMALLOC_INFO(16777216, 16M), 1190 + INIT_KMALLOC_INFO(33554432, 32M), 1191 + INIT_KMALLOC_INFO(67108864, 64M) 1162 1192 }; 1163 1193 1164 1194 /* ··· 1238 1208 } 1239 1209 } 1240 1210 1241 - static const char * 1242 - kmalloc_cache_name(const char *prefix, unsigned int size) 1243 - { 1244 - 1245 - static const char units[3] = "\0kM"; 1246 - int idx = 0; 1247 - 1248 - while (size >= 1024 && (size % 1024 == 0)) { 1249 - size /= 1024; 1250 - idx++; 1251 - } 1252 - 1253 - return kasprintf(GFP_NOWAIT, "%s-%u%c", prefix, size, units[idx]); 1254 - } 1255 - 1256 1211 static void __init 1257 - new_kmalloc_cache(int idx, int type, slab_flags_t flags) 1212 + new_kmalloc_cache(int idx, enum kmalloc_cache_type type, slab_flags_t flags) 1258 1213 { 1259 - const char *name; 1260 - 1261 - if (type == KMALLOC_RECLAIM) { 1214 + if (type == KMALLOC_RECLAIM) 1262 1215 flags |= SLAB_RECLAIM_ACCOUNT; 1263 - name = kmalloc_cache_name("kmalloc-rcl", 1264 - kmalloc_info[idx].size); 1265 - BUG_ON(!name); 1266 - } else { 1267 - name = kmalloc_info[idx].name; 1268 - } 1269 1216 1270 - kmalloc_caches[type][idx] = create_kmalloc_cache(name, 1217 + kmalloc_caches[type][idx] = create_kmalloc_cache( 1218 + kmalloc_info[idx].name[type], 1271 1219 kmalloc_info[idx].size, flags, 0, 1272 1220 kmalloc_info[idx].size); 1273 1221 } ··· 1257 1249 */ 1258 1250 void __init create_kmalloc_caches(slab_flags_t flags) 1259 1251 { 1260 - int i, type; 1252 + int i; 1253 + enum kmalloc_cache_type type; 1261 1254 1262 1255 for (type = KMALLOC_NORMAL; type <= KMALLOC_RECLAIM; type++) { 1263 1256 for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) { ··· 1287 1278 struct kmem_cache *s = kmalloc_caches[KMALLOC_NORMAL][i]; 1288 1279 1289 1280 if (s) { 1290 - unsigned int size = kmalloc_size(i); 1291 - const char *n = kmalloc_cache_name("dma-kmalloc", size); 1292 - 1293 - BUG_ON(!n); 1294 1281 kmalloc_caches[KMALLOC_DMA][i] = create_kmalloc_cache( 1295 - n, size, SLAB_CACHE_DMA | flags, 0, 0); 1282 + kmalloc_info[i].name[KMALLOC_DMA], 1283 + kmalloc_info[i].size, 1284 + SLAB_CACHE_DMA | flags, 0, 0); 1296 1285 } 1297 1286 } 1298 1287 #endif

+16 -20

mm/slub.c

··· 93 93 * minimal so we rely on the page allocators per cpu caches for 94 94 * fast frees and allocs. 95 95 * 96 - * Overloading of page flags that are otherwise used for LRU management. 97 - * 98 - * PageActive The slab is frozen and exempt from list processing. 96 + * page->frozen The slab is frozen and exempt from list processing. 99 97 * This means that the slab is dedicated to a purpose 100 98 * such as satisfying allocations for a specific 101 99 * processor. Objects may be freed in the slab while ··· 109 111 * free objects in addition to the regular freelist 110 112 * that requires the slab lock. 111 113 * 112 - * PageError Slab requires special handling due to debug 114 + * SLAB_DEBUG_FLAGS Slab requires special handling due to debug 113 115 * options set. This moves slab handling out of 114 116 * the fast path and disables lockless freelists. 115 117 */ ··· 734 736 { 735 737 u8 *fault; 736 738 u8 *end; 739 + u8 *addr = page_address(page); 737 740 738 741 metadata_access_enable(); 739 742 fault = memchr_inv(start, value, bytes); ··· 747 748 end--; 748 749 749 750 slab_bug(s, "%s overwritten", what); 750 - pr_err("INFO: 0x%p-0x%p. First byte 0x%x instead of 0x%x\n", 751 - fault, end - 1, fault[0], value); 751 + pr_err("INFO: 0x%p-0x%p @offset=%tu. First byte 0x%x instead of 0x%x\n", 752 + fault, end - 1, fault - addr, 753 + fault[0], value); 752 754 print_trailer(s, page, object); 753 755 754 756 restore_bytes(s, what, value, fault, end); ··· 844 844 while (end > fault && end[-1] == POISON_INUSE) 845 845 end--; 846 846 847 - slab_err(s, page, "Padding overwritten. 0x%p-0x%p", fault, end - 1); 847 + slab_err(s, page, "Padding overwritten. 0x%p-0x%p @offset=%tu", 848 + fault, end - 1, fault - start); 848 849 print_section(KERN_ERR, "Padding ", pad, remainder); 849 850 850 851 restore_bytes(s, "slab padding", POISON_INUSE, fault, end); ··· 4384 4383 #endif 4385 4384 4386 4385 #ifdef CONFIG_SLUB_DEBUG 4387 - static int validate_slab(struct kmem_cache *s, struct page *page, 4386 + static void validate_slab(struct kmem_cache *s, struct page *page, 4388 4387 unsigned long *map) 4389 4388 { 4390 4389 void *p; 4391 4390 void *addr = page_address(page); 4392 4391 4393 - if (!check_slab(s, page) || 4394 - !on_freelist(s, page, NULL)) 4395 - return 0; 4392 + if (!check_slab(s, page) || !on_freelist(s, page, NULL)) 4393 + return; 4396 4394 4397 4395 /* Now we know that a valid freelist exists */ 4398 4396 bitmap_zero(map, page->objects); 4399 4397 4400 4398 get_map(s, page, map); 4401 4399 for_each_object(p, s, addr, page->objects) { 4402 - if (test_bit(slab_index(p, s, addr), map)) 4403 - if (!check_object(s, page, p, SLUB_RED_INACTIVE)) 4404 - return 0; 4405 - } 4400 + u8 val = test_bit(slab_index(p, s, addr), map) ? 4401 + SLUB_RED_INACTIVE : SLUB_RED_ACTIVE; 4406 4402 4407 - for_each_object(p, s, addr, page->objects) 4408 - if (!test_bit(slab_index(p, s, addr), map)) 4409 - if (!check_object(s, page, p, SLUB_RED_ACTIVE)) 4410 - return 0; 4411 - return 1; 4403 + if (!check_object(s, page, p, val)) 4404 + break; 4405 + } 4412 4406 } 4413 4407 4414 4408 static void validate_slab_slab(struct kmem_cache *s, struct page *page,

+10 -8

mm/sparse.c

··· 458 458 if (map) 459 459 return map; 460 460 461 - map = memblock_alloc_try_nid(size, 462 - PAGE_SIZE, addr, 461 + map = memblock_alloc_try_nid_raw(size, size, addr, 463 462 MEMBLOCK_ALLOC_ACCESSIBLE, nid); 464 463 if (!map) 465 464 panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d from=%pa\n", ··· 481 482 { 482 483 phys_addr_t addr = __pa(MAX_DMA_ADDRESS); 483 484 WARN_ON(sparsemap_buf); /* forgot to call sparse_buffer_fini()? */ 484 - sparsemap_buf = 485 - memblock_alloc_try_nid_raw(size, PAGE_SIZE, 486 - addr, 487 - MEMBLOCK_ALLOC_ACCESSIBLE, nid); 485 + /* 486 + * Pre-allocated buffer is mainly used by __populate_section_memmap 487 + * and we want it to be properly aligned to the section size - this is 488 + * especially the case for VMEMMAP which maps memmap to PMDs 489 + */ 490 + sparsemap_buf = memblock_alloc_exact_nid_raw(size, section_map_size(), 491 + addr, MEMBLOCK_ALLOC_ACCESSIBLE, nid); 488 492 sparsemap_buf_end = sparsemap_buf + size; 489 493 } 490 494 ··· 649 647 #endif 650 648 651 649 #ifdef CONFIG_SPARSEMEM_VMEMMAP 652 - static struct page *populate_section_memmap(unsigned long pfn, 650 + static struct page * __meminit populate_section_memmap(unsigned long pfn, 653 651 unsigned long nr_pages, int nid, struct vmem_altmap *altmap) 654 652 { 655 653 return __populate_section_memmap(pfn, nr_pages, nid, altmap); ··· 671 669 vmemmap_free(start, end, NULL); 672 670 } 673 671 #else 674 - struct page *populate_section_memmap(unsigned long pfn, 672 + struct page * __meminit populate_section_memmap(unsigned long pfn, 675 673 unsigned long nr_pages, int nid, struct vmem_altmap *altmap) 676 674 { 677 675 struct page *page, *ret;

+24 -5

mm/swap.c

··· 373 373 void mark_page_accessed(struct page *page) 374 374 { 375 375 page = compound_head(page); 376 - if (!PageActive(page) && !PageUnevictable(page) && 377 - PageReferenced(page)) { 378 376 377 + if (!PageReferenced(page)) { 378 + SetPageReferenced(page); 379 + } else if (PageUnevictable(page)) { 380 + /* 381 + * Unevictable pages are on the "LRU_UNEVICTABLE" list. But, 382 + * this list is never rotated or maintained, so marking an 383 + * evictable page accessed has no effect. 384 + */ 385 + } else if (!PageActive(page)) { 379 386 /* 380 387 * If the page is on the LRU, queue it for activation via 381 388 * activate_page_pvecs. Otherwise, assume the page is on a ··· 396 389 ClearPageReferenced(page); 397 390 if (page_is_file_cache(page)) 398 391 workingset_activation(page); 399 - } else if (!PageReferenced(page)) { 400 - SetPageReferenced(page); 401 392 } 402 393 if (page_is_idle(page)) 403 394 clear_page_idle(page); ··· 713 708 */ 714 709 void lru_add_drain_all(void) 715 710 { 711 + static seqcount_t seqcount = SEQCNT_ZERO(seqcount); 716 712 static DEFINE_MUTEX(lock); 717 713 static struct cpumask has_work; 718 - int cpu; 714 + int cpu, seq; 719 715 720 716 /* 721 717 * Make sure nobody triggers this path before mm_percpu_wq is fully ··· 725 719 if (WARN_ON(!mm_percpu_wq)) 726 720 return; 727 721 722 + seq = raw_read_seqcount_latch(&seqcount); 723 + 728 724 mutex_lock(&lock); 725 + 726 + /* 727 + * Piggyback on drain started and finished while we waited for lock: 728 + * all pages pended at the time of our enter were drained from vectors. 729 + */ 730 + if (__read_seqcount_retry(&seqcount, seq)) 731 + goto done; 732 + 733 + raw_write_seqcount_latch(&seqcount); 734 + 729 735 cpumask_clear(&has_work); 730 736 731 737 for_each_online_cpu(cpu) { ··· 758 740 for_each_cpu(cpu, &has_work) 759 741 flush_work(&per_cpu(lru_add_drain_work, cpu)); 760 742 743 + done: 761 744 mutex_unlock(&lock); 762 745 } 763 746 #else

+7

mm/swapfile.c

··· 2887 2887 error = set_blocksize(p->bdev, PAGE_SIZE); 2888 2888 if (error < 0) 2889 2889 return error; 2890 + /* 2891 + * Zoned block devices contain zones that have a sequential 2892 + * write only restriction. Hence zoned block devices are not 2893 + * suitable for swapping. Disallow them here. 2894 + */ 2895 + if (blk_queue_is_zoned(p->bdev->bd_queue)) 2896 + return -EINVAL; 2890 2897 p->flags |= SWP_BLKDEV; 2891 2898 } else if (S_ISREG(inode->i_mode)) { 2892 2899 p->bdev = inode->i_sb->s_bdev;

+37 -36

mm/userfaultfd.c

··· 18 18 #include <asm/tlbflush.h> 19 19 #include "internal.h" 20 20 21 + static __always_inline 22 + struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm, 23 + unsigned long dst_start, 24 + unsigned long len) 25 + { 26 + /* 27 + * Make sure that the dst range is both valid and fully within a 28 + * single existing vma. 29 + */ 30 + struct vm_area_struct *dst_vma; 31 + 32 + dst_vma = find_vma(dst_mm, dst_start); 33 + if (!dst_vma) 34 + return NULL; 35 + 36 + if (dst_start < dst_vma->vm_start || 37 + dst_start + len > dst_vma->vm_end) 38 + return NULL; 39 + 40 + /* 41 + * Check the vma is registered in uffd, this is required to 42 + * enforce the VM_MAYWRITE check done at uffd registration 43 + * time. 44 + */ 45 + if (!dst_vma->vm_userfaultfd_ctx.ctx) 46 + return NULL; 47 + 48 + return dst_vma; 49 + } 50 + 21 51 static int mcopy_atomic_pte(struct mm_struct *dst_mm, 22 52 pmd_t *dst_pmd, 23 53 struct vm_area_struct *dst_vma, ··· 90 60 91 61 /* 92 62 * The memory barrier inside __SetPageUptodate makes sure that 93 - * preceeding stores to the page contents become visible before 63 + * preceding stores to the page contents become visible before 94 64 * the set_pte_at() write. 95 65 */ 96 66 __SetPageUptodate(page); ··· 214 184 unsigned long src_addr, dst_addr; 215 185 long copied; 216 186 struct page *page; 217 - struct hstate *h; 218 187 unsigned long vma_hpagesize; 219 188 pgoff_t idx; 220 189 u32 hash; ··· 250 221 */ 251 222 if (!dst_vma) { 252 223 err = -ENOENT; 253 - dst_vma = find_vma(dst_mm, dst_start); 224 + dst_vma = find_dst_vma(dst_mm, dst_start, len); 254 225 if (!dst_vma || !is_vm_hugetlb_page(dst_vma)) 255 - goto out_unlock; 256 - /* 257 - * Check the vma is registered in uffd, this is 258 - * required to enforce the VM_MAYWRITE check done at 259 - * uffd registration time. 260 - */ 261 - if (!dst_vma->vm_userfaultfd_ctx.ctx) 262 - goto out_unlock; 263 - 264 - if (dst_start < dst_vma->vm_start || 265 - dst_start + len > dst_vma->vm_end) 266 226 goto out_unlock; 267 227 268 228 err = -EINVAL; ··· 260 242 261 243 vm_shared = dst_vma->vm_flags & VM_SHARED; 262 244 } 263 - 264 - if (WARN_ON(dst_addr & (vma_hpagesize - 1) || 265 - (len - copied) & (vma_hpagesize - 1))) 266 - goto out_unlock; 267 245 268 246 /* 269 247 * If not shared, ensure the dst_vma has a anon_vma. ··· 270 256 goto out_unlock; 271 257 } 272 258 273 - h = hstate_vma(dst_vma); 274 - 275 259 while (src_addr < src_start + len) { 276 260 pte_t dst_pteval; 277 261 278 262 BUG_ON(dst_addr >= dst_start + len); 279 - VM_BUG_ON(dst_addr & ~huge_page_mask(h)); 280 263 281 264 /* 282 265 * Serialize via hugetlb_fault_mutex 283 266 */ 284 267 idx = linear_page_index(dst_vma, dst_addr); 285 268 mapping = dst_vma->vm_file->f_mapping; 286 - hash = hugetlb_fault_mutex_hash(h, mapping, idx, dst_addr); 269 + hash = hugetlb_fault_mutex_hash(mapping, idx); 287 270 mutex_lock(&hugetlb_fault_mutex_table[hash]); 288 271 289 272 err = -ENOMEM; 290 - dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h)); 273 + dst_pte = huge_pte_alloc(dst_mm, dst_addr, vma_hpagesize); 291 274 if (!dst_pte) { 292 275 mutex_unlock(&hugetlb_fault_mutex_table[hash]); 293 276 goto out_unlock; ··· 311 300 312 301 err = copy_huge_page_from_user(page, 313 302 (const void __user *)src_addr, 314 - pages_per_huge_page(h), true); 303 + vma_hpagesize / PAGE_SIZE, 304 + true); 315 305 if (unlikely(err)) { 316 306 err = -EFAULT; 317 307 goto out; ··· 487 475 * both valid and fully within a single existing vma. 488 476 */ 489 477 err = -ENOENT; 490 - dst_vma = find_vma(dst_mm, dst_start); 478 + dst_vma = find_dst_vma(dst_mm, dst_start, len); 491 479 if (!dst_vma) 492 - goto out_unlock; 493 - /* 494 - * Check the vma is registered in uffd, this is required to 495 - * enforce the VM_MAYWRITE check done at uffd registration 496 - * time. 497 - */ 498 - if (!dst_vma->vm_userfaultfd_ctx.ctx) 499 - goto out_unlock; 500 - 501 - if (dst_start < dst_vma->vm_start || 502 - dst_start + len > dst_vma->vm_end) 503 480 goto out_unlock; 504 481 505 482 err = -EINVAL;

+16 -6

mm/util.c

··· 271 271 EXPORT_SYMBOL(memdup_user_nul); 272 272 273 273 void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma, 274 - struct vm_area_struct *prev, struct rb_node *rb_parent) 274 + struct vm_area_struct *prev) 275 275 { 276 276 struct vm_area_struct *next; 277 277 ··· 280 280 next = prev->vm_next; 281 281 prev->vm_next = vma; 282 282 } else { 283 + next = mm->mmap; 283 284 mm->mmap = vma; 284 - if (rb_parent) 285 - next = rb_entry(rb_parent, 286 - struct vm_area_struct, vm_rb); 287 - else 288 - next = NULL; 289 285 } 290 286 vma->vm_next = next; 291 287 if (next) 292 288 next->vm_prev = vma; 289 + } 290 + 291 + void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma) 292 + { 293 + struct vm_area_struct *prev, *next; 294 + 295 + next = vma->vm_next; 296 + prev = vma->vm_prev; 297 + if (prev) 298 + prev->vm_next = next; 299 + else 300 + mm->mmap = next; 301 + if (next) 302 + next->vm_prev = prev; 293 303 } 294 304 295 305 /* Check if the vma is being used as a stack by this task */

+134 -58

mm/vmalloc.c

··· 331 331 332 332 333 333 static DEFINE_SPINLOCK(vmap_area_lock); 334 + static DEFINE_SPINLOCK(free_vmap_area_lock); 334 335 /* Export for kexec only */ 335 336 LIST_HEAD(vmap_area_list); 336 337 static LLIST_HEAD(vmap_purge_list); ··· 683 682 * free area is inserted. If VA has been merged, it is 684 683 * freed. 685 684 */ 686 - static __always_inline void 685 + static __always_inline struct vmap_area * 687 686 merge_or_add_vmap_area(struct vmap_area *va, 688 687 struct rb_root *root, struct list_head *head) 689 688 { ··· 750 749 751 750 /* Free vmap_area object. */ 752 751 kmem_cache_free(vmap_area_cachep, va); 753 - return; 752 + 753 + /* Point to the new merged area. */ 754 + va = sibling; 755 + merged = true; 754 756 } 755 757 } 756 758 ··· 762 758 link_va(va, root, parent, link, head); 763 759 augment_tree_propagate_from(va); 764 760 } 761 + 762 + return va; 765 763 } 766 764 767 765 static __always_inline bool ··· 974 968 * There are a few exceptions though, as an example it is 975 969 * a first allocation (early boot up) when we have "one" 976 970 * big free space that has to be split. 971 + * 972 + * Also we can hit this path in case of regular "vmap" 973 + * allocations, if "this" current CPU was not preloaded. 974 + * See the comment in alloc_vmap_area() why. If so, then 975 + * GFP_NOWAIT is used instead to get an extra object for 976 + * split purpose. That is rare and most time does not 977 + * occur. 978 + * 979 + * What happens if an allocation gets failed. Basically, 980 + * an "overflow" path is triggered to purge lazily freed 981 + * areas to free some memory, then, the "retry" path is 982 + * triggered to repeat one more time. See more details 983 + * in alloc_vmap_area() function. 977 984 */ 978 985 lva = kmem_cache_alloc(vmap_area_cachep, GFP_NOWAIT); 979 986 if (!lva) ··· 1082 1063 return ERR_PTR(-EBUSY); 1083 1064 1084 1065 might_sleep(); 1066 + gfp_mask = gfp_mask & GFP_RECLAIM_MASK; 1085 1067 1086 - va = kmem_cache_alloc_node(vmap_area_cachep, 1087 - gfp_mask & GFP_RECLAIM_MASK, node); 1068 + va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node); 1088 1069 if (unlikely(!va)) 1089 1070 return ERR_PTR(-ENOMEM); 1090 1071 ··· 1092 1073 * Only scan the relevant parts containing pointers to other objects 1093 1074 * to avoid false negatives. 1094 1075 */ 1095 - kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask & GFP_RECLAIM_MASK); 1076 + kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask); 1096 1077 1097 1078 retry: 1098 1079 /* 1099 - * Preload this CPU with one extra vmap_area object to ensure 1100 - * that we have it available when fit type of free area is 1101 - * NE_FIT_TYPE. 1080 + * Preload this CPU with one extra vmap_area object. It is used 1081 + * when fit type of free area is NE_FIT_TYPE. Please note, it 1082 + * does not guarantee that an allocation occurs on a CPU that 1083 + * is preloaded, instead we minimize the case when it is not. 1084 + * It can happen because of cpu migration, because there is a 1085 + * race until the below spinlock is taken. 1102 1086 * 1103 1087 * The preload is done in non-atomic context, thus it allows us 1104 1088 * to use more permissive allocation masks to be more stable under 1105 - * low memory condition and high memory pressure. 1089 + * low memory condition and high memory pressure. In rare case, 1090 + * if not preloaded, GFP_NOWAIT is used. 1106 1091 * 1107 - * Even if it fails we do not really care about that. Just proceed 1108 - * as it is. "overflow" path will refill the cache we allocate from. 1092 + * Set "pva" to NULL here, because of "retry" path. 1109 1093 */ 1110 - preempt_disable(); 1111 - if (!__this_cpu_read(ne_fit_preload_node)) { 1112 - preempt_enable(); 1113 - pva = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, node); 1114 - preempt_disable(); 1094 + pva = NULL; 1115 1095 1116 - if (__this_cpu_cmpxchg(ne_fit_preload_node, NULL, pva)) { 1117 - if (pva) 1118 - kmem_cache_free(vmap_area_cachep, pva); 1119 - } 1120 - } 1096 + if (!this_cpu_read(ne_fit_preload_node)) 1097 + /* 1098 + * Even if it fails we do not really care about that. 1099 + * Just proceed as it is. If needed "overflow" path 1100 + * will refill the cache we allocate from. 1101 + */ 1102 + pva = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node); 1121 1103 1122 - spin_lock(&vmap_area_lock); 1123 - preempt_enable(); 1104 + spin_lock(&free_vmap_area_lock); 1105 + 1106 + if (pva && __this_cpu_cmpxchg(ne_fit_preload_node, NULL, pva)) 1107 + kmem_cache_free(vmap_area_cachep, pva); 1124 1108 1125 1109 /* 1126 1110 * If an allocation fails, the "vend" address is 1127 1111 * returned. Therefore trigger the overflow path. 1128 1112 */ 1129 1113 addr = __alloc_vmap_area(size, align, vstart, vend); 1114 + spin_unlock(&free_vmap_area_lock); 1115 + 1130 1116 if (unlikely(addr == vend)) 1131 1117 goto overflow; 1132 1118 1133 1119 va->va_start = addr; 1134 1120 va->va_end = addr + size; 1135 1121 va->vm = NULL; 1136 - insert_vmap_area(va, &vmap_area_root, &vmap_area_list); 1137 1122 1123 + spin_lock(&vmap_area_lock); 1124 + insert_vmap_area(va, &vmap_area_root, &vmap_area_list); 1138 1125 spin_unlock(&vmap_area_lock); 1139 1126 1140 1127 BUG_ON(!IS_ALIGNED(va->va_start, align)); ··· 1150 1125 return va; 1151 1126 1152 1127 overflow: 1153 - spin_unlock(&vmap_area_lock); 1154 1128 if (!purged) { 1155 1129 purge_vmap_area_lazy(); 1156 1130 purged = 1; ··· 1185 1161 } 1186 1162 EXPORT_SYMBOL_GPL(unregister_vmap_purge_notifier); 1187 1163 1188 - static void __free_vmap_area(struct vmap_area *va) 1189 - { 1190 - /* 1191 - * Remove from the busy tree/list. 1192 - */ 1193 - unlink_va(va, &vmap_area_root); 1194 - 1195 - /* 1196 - * Merge VA with its neighbors, otherwise just add it. 1197 - */ 1198 - merge_or_add_vmap_area(va, 1199 - &free_vmap_area_root, &free_vmap_area_list); 1200 - } 1201 - 1202 1164 /* 1203 1165 * Free a region of KVA allocated by alloc_vmap_area 1204 1166 */ 1205 1167 static void free_vmap_area(struct vmap_area *va) 1206 1168 { 1169 + /* 1170 + * Remove from the busy tree/list. 1171 + */ 1207 1172 spin_lock(&vmap_area_lock); 1208 - __free_vmap_area(va); 1173 + unlink_va(va, &vmap_area_root); 1209 1174 spin_unlock(&vmap_area_lock); 1175 + 1176 + /* 1177 + * Insert/Merge it back to the free tree/list. 1178 + */ 1179 + spin_lock(&free_vmap_area_lock); 1180 + merge_or_add_vmap_area(va, &free_vmap_area_root, &free_vmap_area_list); 1181 + spin_unlock(&free_vmap_area_lock); 1210 1182 } 1211 1183 1212 1184 /* ··· 1295 1275 flush_tlb_kernel_range(start, end); 1296 1276 resched_threshold = lazy_max_pages() << 1; 1297 1277 1298 - spin_lock(&vmap_area_lock); 1278 + spin_lock(&free_vmap_area_lock); 1299 1279 llist_for_each_entry_safe(va, n_va, valist, purge_list) { 1300 1280 unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT; 1281 + unsigned long orig_start = va->va_start; 1282 + unsigned long orig_end = va->va_end; 1301 1283 1302 1284 /* 1303 1285 * Finally insert or merge lazily-freed area. It is 1304 1286 * detached and there is no need to "unlink" it from 1305 1287 * anything. 1306 1288 */ 1307 - merge_or_add_vmap_area(va, 1308 - &free_vmap_area_root, &free_vmap_area_list); 1289 + va = merge_or_add_vmap_area(va, &free_vmap_area_root, 1290 + &free_vmap_area_list); 1291 + 1292 + if (is_vmalloc_or_module_addr((void *)orig_start)) 1293 + kasan_release_vmalloc(orig_start, orig_end, 1294 + va->va_start, va->va_end); 1309 1295 1310 1296 atomic_long_sub(nr, &vmap_lazy_nr); 1311 1297 1312 1298 if (atomic_long_read(&vmap_lazy_nr) < resched_threshold) 1313 - cond_resched_lock(&vmap_area_lock); 1299 + cond_resched_lock(&free_vmap_area_lock); 1314 1300 } 1315 - spin_unlock(&vmap_area_lock); 1301 + spin_unlock(&free_vmap_area_lock); 1316 1302 return true; 1317 1303 } 1318 1304 ··· 2040 2014 } 2041 2015 EXPORT_SYMBOL_GPL(map_vm_area); 2042 2016 2043 - static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, 2044 - unsigned long flags, const void *caller) 2017 + static inline void setup_vmalloc_vm_locked(struct vm_struct *vm, 2018 + struct vmap_area *va, unsigned long flags, const void *caller) 2045 2019 { 2046 - spin_lock(&vmap_area_lock); 2047 2020 vm->flags = flags; 2048 2021 vm->addr = (void *)va->va_start; 2049 2022 vm->size = va->va_end - va->va_start; 2050 2023 vm->caller = caller; 2051 2024 va->vm = vm; 2025 + } 2026 + 2027 + static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, 2028 + unsigned long flags, const void *caller) 2029 + { 2030 + spin_lock(&vmap_area_lock); 2031 + setup_vmalloc_vm_locked(vm, va, flags, caller); 2052 2032 spin_unlock(&vmap_area_lock); 2053 2033 } 2054 2034 ··· 2099 2067 } 2100 2068 2101 2069 setup_vmalloc_vm(area, va, flags, caller); 2070 + 2071 + /* 2072 + * For KASAN, if we are in vmalloc space, we need to cover the shadow 2073 + * area with real memory. If we come here through VM_ALLOC, this is 2074 + * done by a higher level function that has access to the true size, 2075 + * which might not be a full page. 2076 + * 2077 + * We assume module space comes via VM_ALLOC path. 2078 + */ 2079 + if (is_vmalloc_addr(area->addr) && !(area->flags & VM_ALLOC)) { 2080 + if (kasan_populate_vmalloc(area->size, area)) { 2081 + unmap_vmap_area(va); 2082 + kfree(area); 2083 + return NULL; 2084 + } 2085 + } 2102 2086 2103 2087 return area; 2104 2088 } ··· 2292 2244 2293 2245 debug_check_no_locks_freed(area->addr, get_vm_area_size(area)); 2294 2246 debug_check_no_obj_freed(area->addr, get_vm_area_size(area)); 2247 + 2248 + if (area->flags & VM_KASAN) 2249 + kasan_poison_vmalloc(area->addr, area->size); 2295 2250 2296 2251 vm_remove_mappings(area, deallocate_pages); 2297 2252 ··· 2491 2440 goto fail; 2492 2441 } 2493 2442 area->pages[i] = page; 2494 - if (gfpflags_allow_blocking(gfp_mask|highmem_mask)) 2443 + if (gfpflags_allow_blocking(gfp_mask)) 2495 2444 cond_resched(); 2496 2445 } 2497 2446 atomic_long_add(area->nr_pages, &nr_vmalloc_pages); ··· 2547 2496 addr = __vmalloc_area_node(area, gfp_mask, prot, node); 2548 2497 if (!addr) 2549 2498 return NULL; 2499 + 2500 + if (is_vmalloc_or_module_addr(area->addr)) { 2501 + if (kasan_populate_vmalloc(real_size, area)) 2502 + return NULL; 2503 + } 2550 2504 2551 2505 /* 2552 2506 * In this function, newly allocated vm_struct has VM_UNINITIALIZED ··· 3338 3282 goto err_free; 3339 3283 } 3340 3284 retry: 3341 - spin_lock(&vmap_area_lock); 3285 + spin_lock(&free_vmap_area_lock); 3342 3286 3343 3287 /* start scanning - we scan from the top, begin with the last area */ 3344 3288 area = term_area = last_area; ··· 3420 3364 va = vas[area]; 3421 3365 va->va_start = start; 3422 3366 va->va_end = start + size; 3423 - 3424 - insert_vmap_area(va, &vmap_area_root, &vmap_area_list); 3425 3367 } 3426 3368 3427 - spin_unlock(&vmap_area_lock); 3369 + spin_unlock(&free_vmap_area_lock); 3428 3370 3429 3371 /* insert all vm's */ 3430 - for (area = 0; area < nr_vms; area++) 3431 - setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC, 3372 + spin_lock(&vmap_area_lock); 3373 + for (area = 0; area < nr_vms; area++) { 3374 + insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list); 3375 + 3376 + setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC, 3432 3377 pcpu_get_vm_areas); 3378 + } 3379 + spin_unlock(&vmap_area_lock); 3380 + 3381 + /* populate the shadow space outside of the lock */ 3382 + for (area = 0; area < nr_vms; area++) { 3383 + /* assume success here */ 3384 + kasan_populate_vmalloc(sizes[area], vms[area]); 3385 + } 3433 3386 3434 3387 kfree(vas); 3435 3388 return vms; 3436 3389 3437 3390 recovery: 3438 - /* Remove previously inserted areas. */ 3391 + /* 3392 + * Remove previously allocated areas. There is no 3393 + * need in removing these areas from the busy tree, 3394 + * because they are inserted only on the final step 3395 + * and when pcpu_get_vm_areas() is success. 3396 + */ 3439 3397 while (area--) { 3440 - __free_vmap_area(vas[area]); 3398 + merge_or_add_vmap_area(vas[area], &free_vmap_area_root, 3399 + &free_vmap_area_list); 3441 3400 vas[area] = NULL; 3442 3401 } 3443 3402 3444 3403 overflow: 3445 - spin_unlock(&vmap_area_lock); 3404 + spin_unlock(&free_vmap_area_lock); 3446 3405 if (!purged) { 3447 3406 purge_vmap_area_lazy(); 3448 3407 purged = true; ··· 3508 3437 3509 3438 #ifdef CONFIG_PROC_FS 3510 3439 static void *s_start(struct seq_file *m, loff_t *pos) 3440 + __acquires(&vmap_purge_lock) 3511 3441 __acquires(&vmap_area_lock) 3512 3442 { 3443 + mutex_lock(&vmap_purge_lock); 3513 3444 spin_lock(&vmap_area_lock); 3445 + 3514 3446 return seq_list_start(&vmap_area_list, *pos); 3515 3447 } 3516 3448 ··· 3523 3449 } 3524 3450 3525 3451 static void s_stop(struct seq_file *m, void *p) 3452 + __releases(&vmap_purge_lock) 3526 3453 __releases(&vmap_area_lock) 3527 3454 { 3455 + mutex_unlock(&vmap_purge_lock); 3528 3456 spin_unlock(&vmap_area_lock); 3529 3457 } 3530 3458

+339 -347

mm/vmscan.c

··· 79 79 */ 80 80 struct mem_cgroup *target_mem_cgroup; 81 81 82 + /* Can active pages be deactivated as part of reclaim? */ 83 + #define DEACTIVATE_ANON 1 84 + #define DEACTIVATE_FILE 2 85 + unsigned int may_deactivate:2; 86 + unsigned int force_deactivate:1; 87 + unsigned int skipped_deactivate:1; 88 + 82 89 /* Writepage batching in laptop mode; RECLAIM_WRITE */ 83 90 unsigned int may_writepage:1; 84 91 ··· 107 100 108 101 /* One of the zones is ready for compaction */ 109 102 unsigned int compaction_ready:1; 103 + 104 + /* There is easily reclaimable cold cache in the current node */ 105 + unsigned int cache_trim_mode:1; 106 + 107 + /* The file pages on the current node are dangerously low */ 108 + unsigned int file_is_tiny:1; 110 109 111 110 /* Allocation order */ 112 111 s8 order; ··· 252 239 up_write(&shrinker_rwsem); 253 240 } 254 241 255 - static bool global_reclaim(struct scan_control *sc) 242 + static bool cgroup_reclaim(struct scan_control *sc) 256 243 { 257 - return !sc->target_mem_cgroup; 244 + return sc->target_mem_cgroup; 258 245 } 259 246 260 247 /** 261 - * sane_reclaim - is the usual dirty throttling mechanism operational? 248 + * writeback_throttling_sane - is the usual dirty throttling mechanism available? 262 249 * @sc: scan_control in question 263 250 * 264 251 * The normal page dirty throttling mechanism in balance_dirty_pages() is ··· 270 257 * This function tests whether the vmscan currently in progress can assume 271 258 * that the normal dirty throttling mechanism is operational. 272 259 */ 273 - static bool sane_reclaim(struct scan_control *sc) 260 + static bool writeback_throttling_sane(struct scan_control *sc) 274 261 { 275 - struct mem_cgroup *memcg = sc->target_mem_cgroup; 276 - 277 - if (!memcg) 262 + if (!cgroup_reclaim(sc)) 278 263 return true; 279 264 #ifdef CONFIG_CGROUP_WRITEBACK 280 265 if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) 281 266 return true; 282 267 #endif 283 268 return false; 284 - } 285 - 286 - static void set_memcg_congestion(pg_data_t *pgdat, 287 - struct mem_cgroup *memcg, 288 - bool congested) 289 - { 290 - struct mem_cgroup_per_node *mn; 291 - 292 - if (!memcg) 293 - return; 294 - 295 - mn = mem_cgroup_nodeinfo(memcg, pgdat->node_id); 296 - WRITE_ONCE(mn->congested, congested); 297 - } 298 - 299 - static bool memcg_congested(pg_data_t *pgdat, 300 - struct mem_cgroup *memcg) 301 - { 302 - struct mem_cgroup_per_node *mn; 303 - 304 - mn = mem_cgroup_nodeinfo(memcg, pgdat->node_id); 305 - return READ_ONCE(mn->congested); 306 - 307 269 } 308 270 #else 309 271 static int prealloc_memcg_shrinker(struct shrinker *shrinker) ··· 290 302 { 291 303 } 292 304 293 - static bool global_reclaim(struct scan_control *sc) 294 - { 295 - return true; 296 - } 297 - 298 - static bool sane_reclaim(struct scan_control *sc) 299 - { 300 - return true; 301 - } 302 - 303 - static inline void set_memcg_congestion(struct pglist_data *pgdat, 304 - struct mem_cgroup *memcg, bool congested) 305 - { 306 - } 307 - 308 - static inline bool memcg_congested(struct pglist_data *pgdat, 309 - struct mem_cgroup *memcg) 305 + static bool cgroup_reclaim(struct scan_control *sc) 310 306 { 311 307 return false; 308 + } 312 309 310 + static bool writeback_throttling_sane(struct scan_control *sc) 311 + { 312 + return true; 313 313 } 314 314 #endif 315 315 ··· 327 351 */ 328 352 unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx) 329 353 { 330 - unsigned long lru_size = 0; 354 + unsigned long size = 0; 331 355 int zid; 332 356 333 - if (!mem_cgroup_disabled()) { 334 - for (zid = 0; zid < MAX_NR_ZONES; zid++) 335 - lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid); 336 - } else 337 - lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru); 338 - 339 - for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) { 357 + for (zid = 0; zid <= zone_idx && zid < MAX_NR_ZONES; zid++) { 340 358 struct zone *zone = &lruvec_pgdat(lruvec)->node_zones[zid]; 341 - unsigned long size; 342 359 343 360 if (!managed_zone(zone)) 344 361 continue; 345 362 346 363 if (!mem_cgroup_disabled()) 347 - size = mem_cgroup_get_zone_lru_size(lruvec, lru, zid); 364 + size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid); 348 365 else 349 - size = zone_page_state(&lruvec_pgdat(lruvec)->node_zones[zid], 350 - NR_ZONE_LRU_BASE + lru); 351 - lru_size -= min(size, lru_size); 366 + size += zone_page_state(zone, NR_ZONE_LRU_BASE + lru); 352 367 } 353 - 354 - return lru_size; 355 - 368 + return size; 356 369 } 357 370 358 371 /* ··· 740 775 return page_count(page) - page_has_private(page) == 1 + page_cache_pins; 741 776 } 742 777 743 - static int may_write_to_inode(struct inode *inode, struct scan_control *sc) 778 + static int may_write_to_inode(struct inode *inode) 744 779 { 745 780 if (current->flags & PF_SWAPWRITE) 746 781 return 1; ··· 788 823 * pageout is called by shrink_page_list() for each dirty page. 789 824 * Calls ->writepage(). 790 825 */ 791 - static pageout_t pageout(struct page *page, struct address_space *mapping, 792 - struct scan_control *sc) 826 + static pageout_t pageout(struct page *page, struct address_space *mapping) 793 827 { 794 828 /* 795 829 * If the page is dirty, only perform writeback if that write ··· 824 860 } 825 861 if (mapping->a_ops->writepage == NULL) 826 862 return PAGE_ACTIVATE; 827 - if (!may_write_to_inode(mapping->host, sc)) 863 + if (!may_write_to_inode(mapping->host)) 828 864 return PAGE_KEEP; 829 865 830 866 if (clear_page_dirty_for_io(page)) { ··· 863 899 * gets returned with a refcount of 0. 864 900 */ 865 901 static int __remove_mapping(struct address_space *mapping, struct page *page, 866 - bool reclaimed) 902 + bool reclaimed, struct mem_cgroup *target_memcg) 867 903 { 868 904 unsigned long flags; 869 905 int refcount; ··· 935 971 */ 936 972 if (reclaimed && page_is_file_cache(page) && 937 973 !mapping_exiting(mapping) && !dax_mapping(mapping)) 938 - shadow = workingset_eviction(page); 974 + shadow = workingset_eviction(page, target_memcg); 939 975 __delete_from_page_cache(page, shadow); 940 976 xa_unlock_irqrestore(&mapping->i_pages, flags); 941 977 ··· 958 994 */ 959 995 int remove_mapping(struct address_space *mapping, struct page *page) 960 996 { 961 - if (__remove_mapping(mapping, page, false)) { 997 + if (__remove_mapping(mapping, page, false, NULL)) { 962 998 /* 963 999 * Unfreezing the refcount with 1 rather than 2 effectively 964 1000 * drops the pagecache ref for us without requiring another ··· 1203 1239 goto activate_locked; 1204 1240 1205 1241 /* Case 2 above */ 1206 - } else if (sane_reclaim(sc) || 1242 + } else if (writeback_throttling_sane(sc) || 1207 1243 !PageReclaim(page) || !may_enter_fs) { 1208 1244 /* 1209 1245 * This is slightly racy - end_page_writeback() ··· 1358 1394 * starts and then write it out here. 1359 1395 */ 1360 1396 try_to_unmap_flush_dirty(); 1361 - switch (pageout(page, mapping, sc)) { 1397 + switch (pageout(page, mapping)) { 1362 1398 case PAGE_KEEP: 1363 1399 goto keep_locked; 1364 1400 case PAGE_ACTIVATE: ··· 1436 1472 1437 1473 count_vm_event(PGLAZYFREED); 1438 1474 count_memcg_page_event(page, PGLAZYFREED); 1439 - } else if (!mapping || !__remove_mapping(mapping, page, true)) 1475 + } else if (!mapping || !__remove_mapping(mapping, page, true, 1476 + sc->target_mem_cgroup)) 1440 1477 goto keep_locked; 1441 1478 1442 1479 unlock_page(page); ··· 1785 1820 1786 1821 /* 1787 1822 * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and 1788 - * then get resheduled. When there are massive number of tasks doing page 1823 + * then get rescheduled. When there are massive number of tasks doing page 1789 1824 * allocation, such sleeping direct reclaimers may keep piling up on each CPU, 1790 1825 * the LRU list will go small and be scanned faster than necessary, leading to 1791 1826 * unnecessary swapping, thrashing and OOM. ··· 1798 1833 if (current_is_kswapd()) 1799 1834 return 0; 1800 1835 1801 - if (!sane_reclaim(sc)) 1836 + if (!writeback_throttling_sane(sc)) 1802 1837 return 0; 1803 1838 1804 1839 if (file) { ··· 1948 1983 reclaim_stat->recent_scanned[file] += nr_taken; 1949 1984 1950 1985 item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT; 1951 - if (global_reclaim(sc)) 1986 + if (!cgroup_reclaim(sc)) 1952 1987 __count_vm_events(item, nr_scanned); 1953 1988 __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned); 1954 1989 spin_unlock_irq(&pgdat->lru_lock); ··· 1962 1997 spin_lock_irq(&pgdat->lru_lock); 1963 1998 1964 1999 item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; 1965 - if (global_reclaim(sc)) 2000 + if (!cgroup_reclaim(sc)) 1966 2001 __count_vm_events(item, nr_reclaimed); 1967 2002 __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); 1968 2003 reclaim_stat->recent_rotated[0] += stat.nr_activate[0]; ··· 2164 2199 return nr_reclaimed; 2165 2200 } 2166 2201 2202 + static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, 2203 + struct lruvec *lruvec, struct scan_control *sc) 2204 + { 2205 + if (is_active_lru(lru)) { 2206 + if (sc->may_deactivate & (1 << is_file_lru(lru))) 2207 + shrink_active_list(nr_to_scan, lruvec, sc, lru); 2208 + else 2209 + sc->skipped_deactivate = 1; 2210 + return 0; 2211 + } 2212 + 2213 + return shrink_inactive_list(nr_to_scan, lruvec, sc, lru); 2214 + } 2215 + 2167 2216 /* 2168 2217 * The inactive anon list should be small enough that the VM never has 2169 2218 * to do too much work. ··· 2206 2227 * 1TB 101 10GB 2207 2228 * 10TB 320 32GB 2208 2229 */ 2209 - static bool inactive_list_is_low(struct lruvec *lruvec, bool file, 2210 - struct scan_control *sc, bool trace) 2230 + static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru) 2211 2231 { 2212 - enum lru_list active_lru = file * LRU_FILE + LRU_ACTIVE; 2213 - struct pglist_data *pgdat = lruvec_pgdat(lruvec); 2214 - enum lru_list inactive_lru = file * LRU_FILE; 2232 + enum lru_list active_lru = inactive_lru + LRU_ACTIVE; 2215 2233 unsigned long inactive, active; 2216 2234 unsigned long inactive_ratio; 2217 - unsigned long refaults; 2218 2235 unsigned long gb; 2219 2236 2220 - /* 2221 - * If we don't have swap space, anonymous page deactivation 2222 - * is pointless. 2223 - */ 2224 - if (!file && !total_swap_pages) 2225 - return false; 2237 + inactive = lruvec_page_state(lruvec, NR_LRU_BASE + inactive_lru); 2238 + active = lruvec_page_state(lruvec, NR_LRU_BASE + active_lru); 2226 2239 2227 - inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx); 2228 - active = lruvec_lru_size(lruvec, active_lru, sc->reclaim_idx); 2229 - 2230 - /* 2231 - * When refaults are being observed, it means a new workingset 2232 - * is being established. Disable active list protection to get 2233 - * rid of the stale workingset quickly. 2234 - */ 2235 - refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE); 2236 - if (file && lruvec->refaults != refaults) { 2237 - inactive_ratio = 0; 2238 - } else { 2239 - gb = (inactive + active) >> (30 - PAGE_SHIFT); 2240 - if (gb) 2241 - inactive_ratio = int_sqrt(10 * gb); 2242 - else 2243 - inactive_ratio = 1; 2244 - } 2245 - 2246 - if (trace) 2247 - trace_mm_vmscan_inactive_list_is_low(pgdat->node_id, sc->reclaim_idx, 2248 - lruvec_lru_size(lruvec, inactive_lru, MAX_NR_ZONES), inactive, 2249 - lruvec_lru_size(lruvec, active_lru, MAX_NR_ZONES), active, 2250 - inactive_ratio, file); 2240 + gb = (inactive + active) >> (30 - PAGE_SHIFT); 2241 + if (gb) 2242 + inactive_ratio = int_sqrt(10 * gb); 2243 + else 2244 + inactive_ratio = 1; 2251 2245 2252 2246 return inactive * inactive_ratio < active; 2253 - } 2254 - 2255 - static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, 2256 - struct lruvec *lruvec, struct scan_control *sc) 2257 - { 2258 - if (is_active_lru(lru)) { 2259 - if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true)) 2260 - shrink_active_list(nr_to_scan, lruvec, sc, lru); 2261 - return 0; 2262 - } 2263 - 2264 - return shrink_inactive_list(nr_to_scan, lruvec, sc, lru); 2265 2247 } 2266 2248 2267 2249 enum scan_balance { ··· 2241 2301 * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan 2242 2302 * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan 2243 2303 */ 2244 - static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, 2245 - struct scan_control *sc, unsigned long *nr, 2246 - unsigned long *lru_pages) 2304 + static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, 2305 + unsigned long *nr) 2247 2306 { 2307 + struct mem_cgroup *memcg = lruvec_memcg(lruvec); 2248 2308 int swappiness = mem_cgroup_swappiness(memcg); 2249 2309 struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; 2250 2310 u64 fraction[2]; ··· 2269 2329 * using the memory controller's swap limit feature would be 2270 2330 * too expensive. 2271 2331 */ 2272 - if (!global_reclaim(sc) && !swappiness) { 2332 + if (cgroup_reclaim(sc) && !swappiness) { 2273 2333 scan_balance = SCAN_FILE; 2274 2334 goto out; 2275 2335 } ··· 2285 2345 } 2286 2346 2287 2347 /* 2288 - * Prevent the reclaimer from falling into the cache trap: as 2289 - * cache pages start out inactive, every cache fault will tip 2290 - * the scan balance towards the file LRU. And as the file LRU 2291 - * shrinks, so does the window for rotation from references. 2292 - * This means we have a runaway feedback loop where a tiny 2293 - * thrashing file LRU becomes infinitely more attractive than 2294 - * anon pages. Try to detect this based on file LRU size. 2348 + * If the system is almost out of file pages, force-scan anon. 2295 2349 */ 2296 - if (global_reclaim(sc)) { 2297 - unsigned long pgdatfile; 2298 - unsigned long pgdatfree; 2299 - int z; 2300 - unsigned long total_high_wmark = 0; 2301 - 2302 - pgdatfree = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES); 2303 - pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) + 2304 - node_page_state(pgdat, NR_INACTIVE_FILE); 2305 - 2306 - for (z = 0; z < MAX_NR_ZONES; z++) { 2307 - struct zone *zone = &pgdat->node_zones[z]; 2308 - if (!managed_zone(zone)) 2309 - continue; 2310 - 2311 - total_high_wmark += high_wmark_pages(zone); 2312 - } 2313 - 2314 - if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) { 2315 - /* 2316 - * Force SCAN_ANON if there are enough inactive 2317 - * anonymous pages on the LRU in eligible zones. 2318 - * Otherwise, the small LRU gets thrashed. 2319 - */ 2320 - if (!inactive_list_is_low(lruvec, false, sc, false) && 2321 - lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, sc->reclaim_idx) 2322 - >> sc->priority) { 2323 - scan_balance = SCAN_ANON; 2324 - goto out; 2325 - } 2326 - } 2350 + if (sc->file_is_tiny) { 2351 + scan_balance = SCAN_ANON; 2352 + goto out; 2327 2353 } 2328 2354 2329 2355 /* 2330 - * If there is enough inactive page cache, i.e. if the size of the 2331 - * inactive list is greater than that of the active list *and* the 2332 - * inactive list actually has some pages to scan on this priority, we 2333 - * do not reclaim anything from the anonymous working set right now. 2334 - * Without the second condition we could end up never scanning an 2335 - * lruvec even if it has plenty of old anonymous pages unless the 2336 - * system is under heavy pressure. 2356 + * If there is enough inactive page cache, we do not reclaim 2357 + * anything from the anonymous working right now. 2337 2358 */ 2338 - if (!inactive_list_is_low(lruvec, true, sc, false) && 2339 - lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, sc->reclaim_idx) >> sc->priority) { 2359 + if (sc->cache_trim_mode) { 2340 2360 scan_balance = SCAN_FILE; 2341 2361 goto out; 2342 2362 } ··· 2354 2454 fraction[1] = fp; 2355 2455 denominator = ap + fp + 1; 2356 2456 out: 2357 - *lru_pages = 0; 2358 2457 for_each_evictable_lru(lru) { 2359 2458 int file = is_file_lru(lru); 2360 2459 unsigned long lruvec_size; ··· 2448 2549 BUG(); 2449 2550 } 2450 2551 2451 - *lru_pages += lruvec_size; 2452 2552 nr[lru] = scan; 2453 2553 } 2454 2554 } 2455 2555 2456 - /* 2457 - * This is a basic per-node page freer. Used by both kswapd and direct reclaim. 2458 - */ 2459 - static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg, 2460 - struct scan_control *sc, unsigned long *lru_pages) 2556 + static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) 2461 2557 { 2462 - struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg); 2463 2558 unsigned long nr[NR_LRU_LISTS]; 2464 2559 unsigned long targets[NR_LRU_LISTS]; 2465 2560 unsigned long nr_to_scan; ··· 2463 2570 struct blk_plug plug; 2464 2571 bool scan_adjusted; 2465 2572 2466 - get_scan_count(lruvec, memcg, sc, nr, lru_pages); 2573 + get_scan_count(lruvec, sc, nr); 2467 2574 2468 2575 /* Record the original scan target for proportional adjustments later */ 2469 2576 memcpy(targets, nr, sizeof(nr)); ··· 2479 2586 * abort proportional reclaim if either the file or anon lru has already 2480 2587 * dropped to zero at the first pass. 2481 2588 */ 2482 - scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() && 2589 + scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() && 2483 2590 sc->priority == DEF_PRIORITY); 2484 2591 2485 2592 blk_start_plug(&plug); ··· 2561 2668 * Even if we did not try to evict anon pages at all, we want to 2562 2669 * rebalance the anon lru active/inactive ratio. 2563 2670 */ 2564 - if (inactive_list_is_low(lruvec, false, sc, true)) 2671 + if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON)) 2565 2672 shrink_active_list(SWAP_CLUSTER_MAX, lruvec, 2566 2673 sc, LRU_ACTIVE_ANON); 2567 2674 } ··· 2637 2744 return inactive_lru_pages > pages_for_compaction; 2638 2745 } 2639 2746 2640 - static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg) 2747 + static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) 2641 2748 { 2642 - return test_bit(PGDAT_CONGESTED, &pgdat->flags) || 2643 - (memcg && memcg_congested(pgdat, memcg)); 2749 + struct mem_cgroup *target_memcg = sc->target_mem_cgroup; 2750 + struct mem_cgroup *memcg; 2751 + 2752 + memcg = mem_cgroup_iter(target_memcg, NULL, NULL); 2753 + do { 2754 + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); 2755 + unsigned long reclaimed; 2756 + unsigned long scanned; 2757 + 2758 + switch (mem_cgroup_protected(target_memcg, memcg)) { 2759 + case MEMCG_PROT_MIN: 2760 + /* 2761 + * Hard protection. 2762 + * If there is no reclaimable memory, OOM. 2763 + */ 2764 + continue; 2765 + case MEMCG_PROT_LOW: 2766 + /* 2767 + * Soft protection. 2768 + * Respect the protection only as long as 2769 + * there is an unprotected supply 2770 + * of reclaimable memory from other cgroups. 2771 + */ 2772 + if (!sc->memcg_low_reclaim) { 2773 + sc->memcg_low_skipped = 1; 2774 + continue; 2775 + } 2776 + memcg_memory_event(memcg, MEMCG_LOW); 2777 + break; 2778 + case MEMCG_PROT_NONE: 2779 + /* 2780 + * All protection thresholds breached. We may 2781 + * still choose to vary the scan pressure 2782 + * applied based on by how much the cgroup in 2783 + * question has exceeded its protection 2784 + * thresholds (see get_scan_count). 2785 + */ 2786 + break; 2787 + } 2788 + 2789 + reclaimed = sc->nr_reclaimed; 2790 + scanned = sc->nr_scanned; 2791 + 2792 + shrink_lruvec(lruvec, sc); 2793 + 2794 + shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, 2795 + sc->priority); 2796 + 2797 + /* Record the group's reclaim efficiency */ 2798 + vmpressure(sc->gfp_mask, memcg, false, 2799 + sc->nr_scanned - scanned, 2800 + sc->nr_reclaimed - reclaimed); 2801 + 2802 + } while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL))); 2644 2803 } 2645 2804 2646 2805 static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) 2647 2806 { 2648 2807 struct reclaim_state *reclaim_state = current->reclaim_state; 2649 2808 unsigned long nr_reclaimed, nr_scanned; 2809 + struct lruvec *target_lruvec; 2650 2810 bool reclaimable = false; 2811 + unsigned long file; 2651 2812 2652 - do { 2653 - struct mem_cgroup *root = sc->target_mem_cgroup; 2654 - unsigned long node_lru_pages = 0; 2655 - struct mem_cgroup *memcg; 2813 + target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); 2656 2814 2657 - memset(&sc->nr, 0, sizeof(sc->nr)); 2815 + again: 2816 + memset(&sc->nr, 0, sizeof(sc->nr)); 2658 2817 2659 - nr_reclaimed = sc->nr_reclaimed; 2660 - nr_scanned = sc->nr_scanned; 2818 + nr_reclaimed = sc->nr_reclaimed; 2819 + nr_scanned = sc->nr_scanned; 2661 2820 2662 - memcg = mem_cgroup_iter(root, NULL, NULL); 2663 - do { 2664 - unsigned long lru_pages; 2665 - unsigned long reclaimed; 2666 - unsigned long scanned; 2821 + /* 2822 + * Target desirable inactive:active list ratios for the anon 2823 + * and file LRU lists. 2824 + */ 2825 + if (!sc->force_deactivate) { 2826 + unsigned long refaults; 2667 2827 2668 - switch (mem_cgroup_protected(root, memcg)) { 2669 - case MEMCG_PROT_MIN: 2670 - /* 2671 - * Hard protection. 2672 - * If there is no reclaimable memory, OOM. 2673 - */ 2828 + if (inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) 2829 + sc->may_deactivate |= DEACTIVATE_ANON; 2830 + else 2831 + sc->may_deactivate &= ~DEACTIVATE_ANON; 2832 + 2833 + /* 2834 + * When refaults are being observed, it means a new 2835 + * workingset is being established. Deactivate to get 2836 + * rid of any stale active pages quickly. 2837 + */ 2838 + refaults = lruvec_page_state(target_lruvec, 2839 + WORKINGSET_ACTIVATE); 2840 + if (refaults != target_lruvec->refaults || 2841 + inactive_is_low(target_lruvec, LRU_INACTIVE_FILE)) 2842 + sc->may_deactivate |= DEACTIVATE_FILE; 2843 + else 2844 + sc->may_deactivate &= ~DEACTIVATE_FILE; 2845 + } else 2846 + sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE; 2847 + 2848 + /* 2849 + * If we have plenty of inactive file pages that aren't 2850 + * thrashing, try to reclaim those first before touching 2851 + * anonymous pages. 2852 + */ 2853 + file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE); 2854 + if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE)) 2855 + sc->cache_trim_mode = 1; 2856 + else 2857 + sc->cache_trim_mode = 0; 2858 + 2859 + /* 2860 + * Prevent the reclaimer from falling into the cache trap: as 2861 + * cache pages start out inactive, every cache fault will tip 2862 + * the scan balance towards the file LRU. And as the file LRU 2863 + * shrinks, so does the window for rotation from references. 2864 + * This means we have a runaway feedback loop where a tiny 2865 + * thrashing file LRU becomes infinitely more attractive than 2866 + * anon pages. Try to detect this based on file LRU size. 2867 + */ 2868 + if (!cgroup_reclaim(sc)) { 2869 + unsigned long total_high_wmark = 0; 2870 + unsigned long free, anon; 2871 + int z; 2872 + 2873 + free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES); 2874 + file = node_page_state(pgdat, NR_ACTIVE_FILE) + 2875 + node_page_state(pgdat, NR_INACTIVE_FILE); 2876 + 2877 + for (z = 0; z < MAX_NR_ZONES; z++) { 2878 + struct zone *zone = &pgdat->node_zones[z]; 2879 + if (!managed_zone(zone)) 2674 2880 continue; 2675 - case MEMCG_PROT_LOW: 2676 - /* 2677 - * Soft protection. 2678 - * Respect the protection only as long as 2679 - * there is an unprotected supply 2680 - * of reclaimable memory from other cgroups. 2681 - */ 2682 - if (!sc->memcg_low_reclaim) { 2683 - sc->memcg_low_skipped = 1; 2684 - continue; 2685 - } 2686 - memcg_memory_event(memcg, MEMCG_LOW); 2687 - break; 2688 - case MEMCG_PROT_NONE: 2689 - /* 2690 - * All protection thresholds breached. We may 2691 - * still choose to vary the scan pressure 2692 - * applied based on by how much the cgroup in 2693 - * question has exceeded its protection 2694 - * thresholds (see get_scan_count). 2695 - */ 2696 - break; 2697 - } 2698 2881 2699 - reclaimed = sc->nr_reclaimed; 2700 - scanned = sc->nr_scanned; 2701 - shrink_node_memcg(pgdat, memcg, sc, &lru_pages); 2702 - node_lru_pages += lru_pages; 2703 - 2704 - shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, 2705 - sc->priority); 2706 - 2707 - /* Record the group's reclaim efficiency */ 2708 - vmpressure(sc->gfp_mask, memcg, false, 2709 - sc->nr_scanned - scanned, 2710 - sc->nr_reclaimed - reclaimed); 2711 - 2712 - } while ((memcg = mem_cgroup_iter(root, memcg, NULL))); 2713 - 2714 - if (reclaim_state) { 2715 - sc->nr_reclaimed += reclaim_state->reclaimed_slab; 2716 - reclaim_state->reclaimed_slab = 0; 2717 - } 2718 - 2719 - /* Record the subtree's reclaim efficiency */ 2720 - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true, 2721 - sc->nr_scanned - nr_scanned, 2722 - sc->nr_reclaimed - nr_reclaimed); 2723 - 2724 - if (sc->nr_reclaimed - nr_reclaimed) 2725 - reclaimable = true; 2726 - 2727 - if (current_is_kswapd()) { 2728 - /* 2729 - * If reclaim is isolating dirty pages under writeback, 2730 - * it implies that the long-lived page allocation rate 2731 - * is exceeding the page laundering rate. Either the 2732 - * global limits are not being effective at throttling 2733 - * processes due to the page distribution throughout 2734 - * zones or there is heavy usage of a slow backing 2735 - * device. The only option is to throttle from reclaim 2736 - * context which is not ideal as there is no guarantee 2737 - * the dirtying process is throttled in the same way 2738 - * balance_dirty_pages() manages. 2739 - * 2740 - * Once a node is flagged PGDAT_WRITEBACK, kswapd will 2741 - * count the number of pages under pages flagged for 2742 - * immediate reclaim and stall if any are encountered 2743 - * in the nr_immediate check below. 2744 - */ 2745 - if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken) 2746 - set_bit(PGDAT_WRITEBACK, &pgdat->flags); 2747 - 2748 - /* 2749 - * Tag a node as congested if all the dirty pages 2750 - * scanned were backed by a congested BDI and 2751 - * wait_iff_congested will stall. 2752 - */ 2753 - if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested) 2754 - set_bit(PGDAT_CONGESTED, &pgdat->flags); 2755 - 2756 - /* Allow kswapd to start writing pages during reclaim.*/ 2757 - if (sc->nr.unqueued_dirty == sc->nr.file_taken) 2758 - set_bit(PGDAT_DIRTY, &pgdat->flags); 2759 - 2760 - /* 2761 - * If kswapd scans pages marked marked for immediate 2762 - * reclaim and under writeback (nr_immediate), it 2763 - * implies that pages are cycling through the LRU 2764 - * faster than they are written so also forcibly stall. 2765 - */ 2766 - if (sc->nr.immediate) 2767 - congestion_wait(BLK_RW_ASYNC, HZ/10); 2882 + total_high_wmark += high_wmark_pages(zone); 2768 2883 } 2769 2884 2770 2885 /* 2771 - * Legacy memcg will stall in page writeback so avoid forcibly 2772 - * stalling in wait_iff_congested(). 2886 + * Consider anon: if that's low too, this isn't a 2887 + * runaway file reclaim problem, but rather just 2888 + * extreme pressure. Reclaim as per usual then. 2773 2889 */ 2774 - if (!global_reclaim(sc) && sane_reclaim(sc) && 2775 - sc->nr.dirty && sc->nr.dirty == sc->nr.congested) 2776 - set_memcg_congestion(pgdat, root, true); 2890 + anon = node_page_state(pgdat, NR_INACTIVE_ANON); 2891 + 2892 + sc->file_is_tiny = 2893 + file + free <= total_high_wmark && 2894 + !(sc->may_deactivate & DEACTIVATE_ANON) && 2895 + anon >> sc->priority; 2896 + } 2897 + 2898 + shrink_node_memcgs(pgdat, sc); 2899 + 2900 + if (reclaim_state) { 2901 + sc->nr_reclaimed += reclaim_state->reclaimed_slab; 2902 + reclaim_state->reclaimed_slab = 0; 2903 + } 2904 + 2905 + /* Record the subtree's reclaim efficiency */ 2906 + vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true, 2907 + sc->nr_scanned - nr_scanned, 2908 + sc->nr_reclaimed - nr_reclaimed); 2909 + 2910 + if (sc->nr_reclaimed - nr_reclaimed) 2911 + reclaimable = true; 2912 + 2913 + if (current_is_kswapd()) { 2914 + /* 2915 + * If reclaim is isolating dirty pages under writeback, 2916 + * it implies that the long-lived page allocation rate 2917 + * is exceeding the page laundering rate. Either the 2918 + * global limits are not being effective at throttling 2919 + * processes due to the page distribution throughout 2920 + * zones or there is heavy usage of a slow backing 2921 + * device. The only option is to throttle from reclaim 2922 + * context which is not ideal as there is no guarantee 2923 + * the dirtying process is throttled in the same way 2924 + * balance_dirty_pages() manages. 2925 + * 2926 + * Once a node is flagged PGDAT_WRITEBACK, kswapd will 2927 + * count the number of pages under pages flagged for 2928 + * immediate reclaim and stall if any are encountered 2929 + * in the nr_immediate check below. 2930 + */ 2931 + if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken) 2932 + set_bit(PGDAT_WRITEBACK, &pgdat->flags); 2933 + 2934 + /* Allow kswapd to start writing pages during reclaim.*/ 2935 + if (sc->nr.unqueued_dirty == sc->nr.file_taken) 2936 + set_bit(PGDAT_DIRTY, &pgdat->flags); 2777 2937 2778 2938 /* 2779 - * Stall direct reclaim for IO completions if underlying BDIs 2780 - * and node is congested. Allow kswapd to continue until it 2781 - * starts encountering unqueued dirty pages or cycling through 2782 - * the LRU too quickly. 2939 + * If kswapd scans pages marked marked for immediate 2940 + * reclaim and under writeback (nr_immediate), it 2941 + * implies that pages are cycling through the LRU 2942 + * faster than they are written so also forcibly stall. 2783 2943 */ 2784 - if (!sc->hibernation_mode && !current_is_kswapd() && 2785 - current_may_throttle() && pgdat_memcg_congested(pgdat, root)) 2786 - wait_iff_congested(BLK_RW_ASYNC, HZ/10); 2944 + if (sc->nr.immediate) 2945 + congestion_wait(BLK_RW_ASYNC, HZ/10); 2946 + } 2787 2947 2788 - } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, 2789 - sc)); 2948 + /* 2949 + * Tag a node/memcg as congested if all the dirty pages 2950 + * scanned were backed by a congested BDI and 2951 + * wait_iff_congested will stall. 2952 + * 2953 + * Legacy memcg will stall in page writeback so avoid forcibly 2954 + * stalling in wait_iff_congested(). 2955 + */ 2956 + if ((current_is_kswapd() || 2957 + (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) && 2958 + sc->nr.dirty && sc->nr.dirty == sc->nr.congested) 2959 + set_bit(LRUVEC_CONGESTED, &target_lruvec->flags); 2960 + 2961 + /* 2962 + * Stall direct reclaim for IO completions if underlying BDIs 2963 + * and node is congested. Allow kswapd to continue until it 2964 + * starts encountering unqueued dirty pages or cycling through 2965 + * the LRU too quickly. 2966 + */ 2967 + if (!current_is_kswapd() && current_may_throttle() && 2968 + !sc->hibernation_mode && 2969 + test_bit(LRUVEC_CONGESTED, &target_lruvec->flags)) 2970 + wait_iff_congested(BLK_RW_ASYNC, HZ/10); 2971 + 2972 + if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, 2973 + sc)) 2974 + goto again; 2790 2975 2791 2976 /* 2792 2977 * Kswapd gives up on balancing particular nodes after too ··· 2944 2973 * Take care memory controller reclaiming has small influence 2945 2974 * to global LRU. 2946 2975 */ 2947 - if (global_reclaim(sc)) { 2976 + if (!cgroup_reclaim(sc)) { 2948 2977 if (!cpuset_zone_allowed(zone, 2949 2978 GFP_KERNEL | __GFP_HARDWALL)) 2950 2979 continue; ··· 3003 3032 sc->gfp_mask = orig_mask; 3004 3033 } 3005 3034 3006 - static void snapshot_refaults(struct mem_cgroup *root_memcg, pg_data_t *pgdat) 3035 + static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) 3007 3036 { 3008 - struct mem_cgroup *memcg; 3037 + struct lruvec *target_lruvec; 3038 + unsigned long refaults; 3009 3039 3010 - memcg = mem_cgroup_iter(root_memcg, NULL, NULL); 3011 - do { 3012 - unsigned long refaults; 3013 - struct lruvec *lruvec; 3014 - 3015 - lruvec = mem_cgroup_lruvec(pgdat, memcg); 3016 - refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE); 3017 - lruvec->refaults = refaults; 3018 - } while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL))); 3040 + target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); 3041 + refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE); 3042 + target_lruvec->refaults = refaults; 3019 3043 } 3020 3044 3021 3045 /* ··· 3039 3073 retry: 3040 3074 delayacct_freepages_start(); 3041 3075 3042 - if (global_reclaim(sc)) 3076 + if (!cgroup_reclaim(sc)) 3043 3077 __count_zid_vm_events(ALLOCSTALL, sc->reclaim_idx, 1); 3044 3078 3045 3079 do { ··· 3068 3102 if (zone->zone_pgdat == last_pgdat) 3069 3103 continue; 3070 3104 last_pgdat = zone->zone_pgdat; 3105 + 3071 3106 snapshot_refaults(sc->target_mem_cgroup, zone->zone_pgdat); 3072 - set_memcg_congestion(last_pgdat, sc->target_mem_cgroup, false); 3107 + 3108 + if (cgroup_reclaim(sc)) { 3109 + struct lruvec *lruvec; 3110 + 3111 + lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, 3112 + zone->zone_pgdat); 3113 + clear_bit(LRUVEC_CONGESTED, &lruvec->flags); 3114 + } 3073 3115 } 3074 3116 3075 3117 delayacct_freepages_end(); ··· 3089 3115 if (sc->compaction_ready) 3090 3116 return 1; 3091 3117 3118 + /* 3119 + * We make inactive:active ratio decisions based on the node's 3120 + * composition of memory, but a restrictive reclaim_idx or a 3121 + * memory.low cgroup setting can exempt large amounts of 3122 + * memory from reclaim. Neither of which are very common, so 3123 + * instead of doing costly eligibility calculations of the 3124 + * entire cgroup subtree up front, we assume the estimates are 3125 + * good, and retry with forcible deactivation if that fails. 3126 + */ 3127 + if (sc->skipped_deactivate) { 3128 + sc->priority = initial_priority; 3129 + sc->force_deactivate = 1; 3130 + sc->skipped_deactivate = 0; 3131 + goto retry; 3132 + } 3133 + 3092 3134 /* Untapped cgroup reserves? Don't OOM, retry. */ 3093 3135 if (sc->memcg_low_skipped) { 3094 3136 sc->priority = initial_priority; 3137 + sc->force_deactivate = 0; 3138 + sc->skipped_deactivate = 0; 3095 3139 sc->memcg_low_reclaim = 1; 3096 3140 sc->memcg_low_skipped = 0; 3097 3141 goto retry; ··· 3301 3309 pg_data_t *pgdat, 3302 3310 unsigned long *nr_scanned) 3303 3311 { 3312 + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); 3304 3313 struct scan_control sc = { 3305 3314 .nr_to_reclaim = SWAP_CLUSTER_MAX, 3306 3315 .target_mem_cgroup = memcg, ··· 3310 3317 .reclaim_idx = MAX_NR_ZONES - 1, 3311 3318 .may_swap = !noswap, 3312 3319 }; 3313 - unsigned long lru_pages; 3314 3320 3315 3321 WARN_ON_ONCE(!current->reclaim_state); 3316 3322 ··· 3326 3334 * will pick up pages from other mem cgroup's as well. We hack 3327 3335 * the priority and make it zero. 3328 3336 */ 3329 - shrink_node_memcg(pgdat, memcg, &sc, &lru_pages); 3337 + shrink_lruvec(lruvec, &sc); 3330 3338 3331 3339 trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed); 3332 3340 ··· 3340 3348 gfp_t gfp_mask, 3341 3349 bool may_swap) 3342 3350 { 3343 - struct zonelist *zonelist; 3344 3351 unsigned long nr_reclaimed; 3345 3352 unsigned long pflags; 3346 - int nid; 3347 3353 unsigned int noreclaim_flag; 3348 3354 struct scan_control sc = { 3349 3355 .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), ··· 3354 3364 .may_unmap = 1, 3355 3365 .may_swap = may_swap, 3356 3366 }; 3367 + /* 3368 + * Traverse the ZONELIST_FALLBACK zonelist of the current node to put 3369 + * equal pressure on all the nodes. This is based on the assumption that 3370 + * the reclaim does not bail out early. 3371 + */ 3372 + struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask); 3357 3373 3358 3374 set_task_reclaim_state(current, &sc.reclaim_state); 3359 - /* 3360 - * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't 3361 - * take care of from where we get pages. So the node where we start the 3362 - * scan does not need to be the current node. 3363 - */ 3364 - nid = mem_cgroup_select_victim_node(memcg); 3365 - 3366 - zonelist = &NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK]; 3367 3375 3368 3376 trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask); 3369 3377 ··· 3384 3396 struct scan_control *sc) 3385 3397 { 3386 3398 struct mem_cgroup *memcg; 3399 + struct lruvec *lruvec; 3387 3400 3388 3401 if (!total_swap_pages) 3389 3402 return; 3390 3403 3404 + lruvec = mem_cgroup_lruvec(NULL, pgdat); 3405 + if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON)) 3406 + return; 3407 + 3391 3408 memcg = mem_cgroup_iter(NULL, NULL, NULL); 3392 3409 do { 3393 - struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg); 3394 - 3395 - if (inactive_list_is_low(lruvec, false, sc, true)) 3396 - shrink_active_list(SWAP_CLUSTER_MAX, lruvec, 3397 - sc, LRU_ACTIVE_ANON); 3398 - 3410 + lruvec = mem_cgroup_lruvec(memcg, pgdat); 3411 + shrink_active_list(SWAP_CLUSTER_MAX, lruvec, 3412 + sc, LRU_ACTIVE_ANON); 3399 3413 memcg = mem_cgroup_iter(NULL, memcg, NULL); 3400 3414 } while (memcg); 3401 3415 } ··· 3465 3475 /* Clear pgdat state for congested, dirty or under writeback. */ 3466 3476 static void clear_pgdat_congested(pg_data_t *pgdat) 3467 3477 { 3468 - clear_bit(PGDAT_CONGESTED, &pgdat->flags); 3478 + struct lruvec *lruvec = mem_cgroup_lruvec(NULL, pgdat); 3479 + 3480 + clear_bit(LRUVEC_CONGESTED, &lruvec->flags); 3469 3481 clear_bit(PGDAT_DIRTY, &pgdat->flags); 3470 3482 clear_bit(PGDAT_WRITEBACK, &pgdat->flags); 3471 3483 }

+53 -16

mm/workingset.c

··· 213 213 *workingsetp = workingset; 214 214 } 215 215 216 + static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat) 217 + { 218 + /* 219 + * Reclaiming a cgroup means reclaiming all its children in a 220 + * round-robin fashion. That means that each cgroup has an LRU 221 + * order that is composed of the LRU orders of its child 222 + * cgroups; and every page has an LRU position not just in the 223 + * cgroup that owns it, but in all of that group's ancestors. 224 + * 225 + * So when the physical inactive list of a leaf cgroup ages, 226 + * the virtual inactive lists of all its parents, including 227 + * the root cgroup's, age as well. 228 + */ 229 + do { 230 + struct lruvec *lruvec; 231 + 232 + lruvec = mem_cgroup_lruvec(memcg, pgdat); 233 + atomic_long_inc(&lruvec->inactive_age); 234 + } while (memcg && (memcg = parent_mem_cgroup(memcg))); 235 + } 236 + 216 237 /** 217 238 * workingset_eviction - note the eviction of a page from memory 239 + * @target_memcg: the cgroup that is causing the reclaim 218 240 * @page: the page being evicted 219 241 * 220 242 * Returns a shadow entry to be stored in @page->mapping->i_pages in place 221 243 * of the evicted @page so that a later refault can be detected. 222 244 */ 223 - void *workingset_eviction(struct page *page) 245 + void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) 224 246 { 225 247 struct pglist_data *pgdat = page_pgdat(page); 226 - struct mem_cgroup *memcg = page_memcg(page); 227 - int memcgid = mem_cgroup_id(memcg); 228 248 unsigned long eviction; 229 249 struct lruvec *lruvec; 250 + int memcgid; 230 251 231 252 /* Page is fully exclusive and pins page->mem_cgroup */ 232 253 VM_BUG_ON_PAGE(PageLRU(page), page); 233 254 VM_BUG_ON_PAGE(page_count(page), page); 234 255 VM_BUG_ON_PAGE(!PageLocked(page), page); 235 256 236 - lruvec = mem_cgroup_lruvec(pgdat, memcg); 237 - eviction = atomic_long_inc_return(&lruvec->inactive_age); 257 + advance_inactive_age(page_memcg(page), pgdat); 258 + 259 + lruvec = mem_cgroup_lruvec(target_memcg, pgdat); 260 + /* XXX: target_memcg can be NULL, go through lruvec */ 261 + memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); 262 + eviction = atomic_long_read(&lruvec->inactive_age); 238 263 return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page)); 239 264 } 240 265 ··· 269 244 * @shadow: shadow entry of the evicted page 270 245 * 271 246 * Calculates and evaluates the refault distance of the previously 272 - * evicted page in the context of the node it was allocated in. 247 + * evicted page in the context of the node and the memcg whose memory 248 + * pressure caused the eviction. 273 249 */ 274 250 void workingset_refault(struct page *page, void *shadow) 275 251 { 252 + struct mem_cgroup *eviction_memcg; 253 + struct lruvec *eviction_lruvec; 276 254 unsigned long refault_distance; 277 255 struct pglist_data *pgdat; 278 256 unsigned long active_file; ··· 305 277 * would be better if the root_mem_cgroup existed in all 306 278 * configurations instead. 307 279 */ 308 - memcg = mem_cgroup_from_id(memcgid); 309 - if (!mem_cgroup_disabled() && !memcg) 280 + eviction_memcg = mem_cgroup_from_id(memcgid); 281 + if (!mem_cgroup_disabled() && !eviction_memcg) 310 282 goto out; 311 - lruvec = mem_cgroup_lruvec(pgdat, memcg); 312 - refault = atomic_long_read(&lruvec->inactive_age); 313 - active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES); 283 + eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); 284 + refault = atomic_long_read(&eviction_lruvec->inactive_age); 285 + active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); 314 286 315 287 /* 316 288 * Calculate the refault distance ··· 330 302 */ 331 303 refault_distance = (refault - eviction) & EVICTION_MASK; 332 304 305 + /* 306 + * The activation decision for this page is made at the level 307 + * where the eviction occurred, as that is where the LRU order 308 + * during page reclaim is being determined. 309 + * 310 + * However, the cgroup that will own the page is the one that 311 + * is actually experiencing the refault event. 312 + */ 313 + memcg = page_memcg(page); 314 + lruvec = mem_cgroup_lruvec(memcg, pgdat); 315 + 333 316 inc_lruvec_state(lruvec, WORKINGSET_REFAULT); 334 317 335 318 /* ··· 352 313 goto out; 353 314 354 315 SetPageActive(page); 355 - atomic_long_inc(&lruvec->inactive_age); 316 + advance_inactive_age(memcg, pgdat); 356 317 inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); 357 318 358 319 /* Page was active prior to eviction */ ··· 371 332 void workingset_activation(struct page *page) 372 333 { 373 334 struct mem_cgroup *memcg; 374 - struct lruvec *lruvec; 375 335 376 336 rcu_read_lock(); 377 337 /* ··· 383 345 memcg = page_memcg_rcu(page); 384 346 if (!mem_cgroup_disabled() && !memcg) 385 347 goto out; 386 - lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg); 387 - atomic_long_inc(&lruvec->inactive_age); 348 + advance_inactive_age(memcg, page_pgdat(page)); 388 349 out: 389 350 rcu_read_unlock(); 390 351 } ··· 463 426 struct lruvec *lruvec; 464 427 int i; 465 428 466 - lruvec = mem_cgroup_lruvec(NODE_DATA(sc->nid), sc->memcg); 429 + lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); 467 430 for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) 468 431 pages += lruvec_page_state_local(lruvec, 469 432 NR_LRU_BASE + i);

+304 -73

mm/z3fold.c

··· 41 41 #include <linux/workqueue.h> 42 42 #include <linux/slab.h> 43 43 #include <linux/spinlock.h> 44 + #include <linux/rwlock.h> 44 45 #include <linux/zpool.h> 45 46 #include <linux/magic.h> 46 47 ··· 91 90 */ 92 91 unsigned long slot[BUDDY_MASK + 1]; 93 92 unsigned long pool; /* back link + flags */ 93 + rwlock_t lock; 94 94 }; 95 95 #define HANDLE_FLAG_MASK (0x03) 96 96 ··· 126 124 unsigned short start_middle; 127 125 unsigned short first_num:2; 128 126 unsigned short mapped_count:2; 127 + unsigned short foreign_handles:2; 129 128 }; 130 129 131 130 /** ··· 181 178 PAGE_CLAIMED, /* by either reclaim or free */ 182 179 }; 183 180 181 + /* 182 + * handle flags, go under HANDLE_FLAG_MASK 183 + */ 184 + enum z3fold_handle_flags { 185 + HANDLES_ORPHANED = 0, 186 + }; 187 + 188 + /* 189 + * Forward declarations 190 + */ 191 + static struct z3fold_header *__z3fold_alloc(struct z3fold_pool *, size_t, bool); 192 + static void compact_page_work(struct work_struct *w); 193 + 184 194 /***************** 185 195 * Helpers 186 196 *****************/ ··· 207 191 #define for_each_unbuddied_list(_iter, _begin) \ 208 192 for ((_iter) = (_begin); (_iter) < NCHUNKS; (_iter)++) 209 193 210 - static void compact_page_work(struct work_struct *w); 211 - 212 194 static inline struct z3fold_buddy_slots *alloc_slots(struct z3fold_pool *pool, 213 195 gfp_t gfp) 214 196 { ··· 218 204 if (slots) { 219 205 memset(slots->slot, 0, sizeof(slots->slot)); 220 206 slots->pool = (unsigned long)pool; 207 + rwlock_init(&slots->lock); 221 208 } 222 209 223 210 return slots; ··· 234 219 return (struct z3fold_buddy_slots *)(handle & ~(SLOTS_ALIGN - 1)); 235 220 } 236 221 222 + /* Lock a z3fold page */ 223 + static inline void z3fold_page_lock(struct z3fold_header *zhdr) 224 + { 225 + spin_lock(&zhdr->page_lock); 226 + } 227 + 228 + /* Try to lock a z3fold page */ 229 + static inline int z3fold_page_trylock(struct z3fold_header *zhdr) 230 + { 231 + return spin_trylock(&zhdr->page_lock); 232 + } 233 + 234 + /* Unlock a z3fold page */ 235 + static inline void z3fold_page_unlock(struct z3fold_header *zhdr) 236 + { 237 + spin_unlock(&zhdr->page_lock); 238 + } 239 + 240 + 241 + static inline struct z3fold_header *__get_z3fold_header(unsigned long handle, 242 + bool lock) 243 + { 244 + struct z3fold_buddy_slots *slots; 245 + struct z3fold_header *zhdr; 246 + int locked = 0; 247 + 248 + if (!(handle & (1 << PAGE_HEADLESS))) { 249 + slots = handle_to_slots(handle); 250 + do { 251 + unsigned long addr; 252 + 253 + read_lock(&slots->lock); 254 + addr = *(unsigned long *)handle; 255 + zhdr = (struct z3fold_header *)(addr & PAGE_MASK); 256 + if (lock) 257 + locked = z3fold_page_trylock(zhdr); 258 + read_unlock(&slots->lock); 259 + if (locked) 260 + break; 261 + cpu_relax(); 262 + } while (lock); 263 + } else { 264 + zhdr = (struct z3fold_header *)(handle & PAGE_MASK); 265 + } 266 + 267 + return zhdr; 268 + } 269 + 270 + /* Returns the z3fold page where a given handle is stored */ 271 + static inline struct z3fold_header *handle_to_z3fold_header(unsigned long h) 272 + { 273 + return __get_z3fold_header(h, false); 274 + } 275 + 276 + /* return locked z3fold page if it's not headless */ 277 + static inline struct z3fold_header *get_z3fold_header(unsigned long h) 278 + { 279 + return __get_z3fold_header(h, true); 280 + } 281 + 282 + static inline void put_z3fold_header(struct z3fold_header *zhdr) 283 + { 284 + struct page *page = virt_to_page(zhdr); 285 + 286 + if (!test_bit(PAGE_HEADLESS, &page->private)) 287 + z3fold_page_unlock(zhdr); 288 + } 289 + 237 290 static inline void free_handle(unsigned long handle) 238 291 { 239 292 struct z3fold_buddy_slots *slots; 293 + struct z3fold_header *zhdr; 240 294 int i; 241 295 bool is_free; 242 296 243 297 if (handle & (1 << PAGE_HEADLESS)) 244 298 return; 245 299 246 - WARN_ON(*(unsigned long *)handle == 0); 247 - *(unsigned long *)handle = 0; 300 + if (WARN_ON(*(unsigned long *)handle == 0)) 301 + return; 302 + 303 + zhdr = handle_to_z3fold_header(handle); 248 304 slots = handle_to_slots(handle); 305 + write_lock(&slots->lock); 306 + *(unsigned long *)handle = 0; 307 + write_unlock(&slots->lock); 308 + if (zhdr->slots == slots) 309 + return; /* simple case, nothing else to do */ 310 + 311 + /* we are freeing a foreign handle if we are here */ 312 + zhdr->foreign_handles--; 249 313 is_free = true; 314 + read_lock(&slots->lock); 315 + if (!test_bit(HANDLES_ORPHANED, &slots->pool)) { 316 + read_unlock(&slots->lock); 317 + return; 318 + } 250 319 for (i = 0; i <= BUDDY_MASK; i++) { 251 320 if (slots->slot[i]) { 252 321 is_free = false; 253 322 break; 254 323 } 255 324 } 325 + read_unlock(&slots->lock); 256 326 257 327 if (is_free) { 258 328 struct z3fold_pool *pool = slots_to_pool(slots); ··· 422 322 zhdr->first_num = 0; 423 323 zhdr->start_middle = 0; 424 324 zhdr->cpu = -1; 325 + zhdr->foreign_handles = 0; 425 326 zhdr->slots = slots; 426 327 zhdr->pool = pool; 427 328 INIT_LIST_HEAD(&zhdr->buddy); ··· 440 339 } 441 340 ClearPagePrivate(page); 442 341 __free_page(page); 443 - } 444 - 445 - /* Lock a z3fold page */ 446 - static inline void z3fold_page_lock(struct z3fold_header *zhdr) 447 - { 448 - spin_lock(&zhdr->page_lock); 449 - } 450 - 451 - /* Try to lock a z3fold page */ 452 - static inline int z3fold_page_trylock(struct z3fold_header *zhdr) 453 - { 454 - return spin_trylock(&zhdr->page_lock); 455 - } 456 - 457 - /* Unlock a z3fold page */ 458 - static inline void z3fold_page_unlock(struct z3fold_header *zhdr) 459 - { 460 - spin_unlock(&zhdr->page_lock); 461 342 } 462 343 463 344 /* Helper function to build the index */ ··· 472 389 if (bud == LAST) 473 390 h |= (zhdr->last_chunks << BUDDY_SHIFT); 474 391 392 + write_lock(&slots->lock); 475 393 slots->slot[idx] = h; 394 + write_unlock(&slots->lock); 476 395 return (unsigned long)&slots->slot[idx]; 477 396 } 478 397 ··· 483 398 return __encode_handle(zhdr, zhdr->slots, bud); 484 399 } 485 400 486 - /* Returns the z3fold page where a given handle is stored */ 487 - static inline struct z3fold_header *handle_to_z3fold_header(unsigned long h) 488 - { 489 - unsigned long addr = h; 490 - 491 - if (!(addr & (1 << PAGE_HEADLESS))) 492 - addr = *(unsigned long *)h; 493 - 494 - return (struct z3fold_header *)(addr & PAGE_MASK); 495 - } 496 - 497 401 /* only for LAST bud, returns zero otherwise */ 498 402 static unsigned short handle_to_chunks(unsigned long handle) 499 403 { 500 - unsigned long addr = *(unsigned long *)handle; 404 + struct z3fold_buddy_slots *slots = handle_to_slots(handle); 405 + unsigned long addr; 501 406 407 + read_lock(&slots->lock); 408 + addr = *(unsigned long *)handle; 409 + read_unlock(&slots->lock); 502 410 return (addr & ~PAGE_MASK) >> BUDDY_SHIFT; 503 411 } 504 412 ··· 503 425 static enum buddy handle_to_buddy(unsigned long handle) 504 426 { 505 427 struct z3fold_header *zhdr; 428 + struct z3fold_buddy_slots *slots = handle_to_slots(handle); 506 429 unsigned long addr; 507 430 431 + read_lock(&slots->lock); 508 432 WARN_ON(handle & (1 << PAGE_HEADLESS)); 509 433 addr = *(unsigned long *)handle; 434 + read_unlock(&slots->lock); 510 435 zhdr = (struct z3fold_header *)(addr & PAGE_MASK); 511 436 return (addr - zhdr->first_num) & BUDDY_MASK; 512 437 } ··· 523 442 { 524 443 struct page *page = virt_to_page(zhdr); 525 444 struct z3fold_pool *pool = zhdr_to_pool(zhdr); 445 + bool is_free = true; 446 + int i; 526 447 527 448 WARN_ON(!list_empty(&zhdr->buddy)); 528 449 set_bit(PAGE_STALE, &page->private); ··· 533 450 if (!list_empty(&page->lru)) 534 451 list_del_init(&page->lru); 535 452 spin_unlock(&pool->lock); 453 + 454 + /* If there are no foreign handles, free the handles array */ 455 + read_lock(&zhdr->slots->lock); 456 + for (i = 0; i <= BUDDY_MASK; i++) { 457 + if (zhdr->slots->slot[i]) { 458 + is_free = false; 459 + break; 460 + } 461 + } 462 + if (!is_free) 463 + set_bit(HANDLES_ORPHANED, &zhdr->slots->pool); 464 + read_unlock(&zhdr->slots->lock); 465 + 466 + if (is_free) 467 + kmem_cache_free(pool->c_handle, zhdr->slots); 468 + 536 469 if (locked) 537 470 z3fold_page_unlock(zhdr); 471 + 538 472 spin_lock(&pool->stale_lock); 539 473 list_add(&zhdr->buddy, &pool->stale); 540 474 queue_work(pool->release_wq, &pool->work); ··· 579 479 struct z3fold_header *zhdr = container_of(ref, struct z3fold_header, 580 480 refcount); 581 481 struct z3fold_pool *pool = zhdr_to_pool(zhdr); 482 + 582 483 spin_lock(&pool->lock); 583 484 list_del_init(&zhdr->buddy); 584 485 spin_unlock(&pool->lock); ··· 658 557 return memmove(beg + (dst_chunk << CHUNK_SHIFT), 659 558 beg + (zhdr->start_middle << CHUNK_SHIFT), 660 559 zhdr->middle_chunks << CHUNK_SHIFT); 560 + } 561 + 562 + static inline bool buddy_single(struct z3fold_header *zhdr) 563 + { 564 + return !((zhdr->first_chunks && zhdr->middle_chunks) || 565 + (zhdr->first_chunks && zhdr->last_chunks) || 566 + (zhdr->middle_chunks && zhdr->last_chunks)); 567 + } 568 + 569 + static struct z3fold_header *compact_single_buddy(struct z3fold_header *zhdr) 570 + { 571 + struct z3fold_pool *pool = zhdr_to_pool(zhdr); 572 + void *p = zhdr; 573 + unsigned long old_handle = 0; 574 + size_t sz = 0; 575 + struct z3fold_header *new_zhdr = NULL; 576 + int first_idx = __idx(zhdr, FIRST); 577 + int middle_idx = __idx(zhdr, MIDDLE); 578 + int last_idx = __idx(zhdr, LAST); 579 + unsigned short *moved_chunks = NULL; 580 + 581 + /* 582 + * No need to protect slots here -- all the slots are "local" and 583 + * the page lock is already taken 584 + */ 585 + if (zhdr->first_chunks && zhdr->slots->slot[first_idx]) { 586 + p += ZHDR_SIZE_ALIGNED; 587 + sz = zhdr->first_chunks << CHUNK_SHIFT; 588 + old_handle = (unsigned long)&zhdr->slots->slot[first_idx]; 589 + moved_chunks = &zhdr->first_chunks; 590 + } else if (zhdr->middle_chunks && zhdr->slots->slot[middle_idx]) { 591 + p += zhdr->start_middle << CHUNK_SHIFT; 592 + sz = zhdr->middle_chunks << CHUNK_SHIFT; 593 + old_handle = (unsigned long)&zhdr->slots->slot[middle_idx]; 594 + moved_chunks = &zhdr->middle_chunks; 595 + } else if (zhdr->last_chunks && zhdr->slots->slot[last_idx]) { 596 + p += PAGE_SIZE - (zhdr->last_chunks << CHUNK_SHIFT); 597 + sz = zhdr->last_chunks << CHUNK_SHIFT; 598 + old_handle = (unsigned long)&zhdr->slots->slot[last_idx]; 599 + moved_chunks = &zhdr->last_chunks; 600 + } 601 + 602 + if (sz > 0) { 603 + enum buddy new_bud = HEADLESS; 604 + short chunks = size_to_chunks(sz); 605 + void *q; 606 + 607 + new_zhdr = __z3fold_alloc(pool, sz, false); 608 + if (!new_zhdr) 609 + return NULL; 610 + 611 + if (WARN_ON(new_zhdr == zhdr)) 612 + goto out_fail; 613 + 614 + if (new_zhdr->first_chunks == 0) { 615 + if (new_zhdr->middle_chunks != 0 && 616 + chunks >= new_zhdr->start_middle) { 617 + new_bud = LAST; 618 + } else { 619 + new_bud = FIRST; 620 + } 621 + } else if (new_zhdr->last_chunks == 0) { 622 + new_bud = LAST; 623 + } else if (new_zhdr->middle_chunks == 0) { 624 + new_bud = MIDDLE; 625 + } 626 + q = new_zhdr; 627 + switch (new_bud) { 628 + case FIRST: 629 + new_zhdr->first_chunks = chunks; 630 + q += ZHDR_SIZE_ALIGNED; 631 + break; 632 + case MIDDLE: 633 + new_zhdr->middle_chunks = chunks; 634 + new_zhdr->start_middle = 635 + new_zhdr->first_chunks + ZHDR_CHUNKS; 636 + q += new_zhdr->start_middle << CHUNK_SHIFT; 637 + break; 638 + case LAST: 639 + new_zhdr->last_chunks = chunks; 640 + q += PAGE_SIZE - (new_zhdr->last_chunks << CHUNK_SHIFT); 641 + break; 642 + default: 643 + goto out_fail; 644 + } 645 + new_zhdr->foreign_handles++; 646 + memcpy(q, p, sz); 647 + write_lock(&zhdr->slots->lock); 648 + *(unsigned long *)old_handle = (unsigned long)new_zhdr + 649 + __idx(new_zhdr, new_bud); 650 + if (new_bud == LAST) 651 + *(unsigned long *)old_handle |= 652 + (new_zhdr->last_chunks << BUDDY_SHIFT); 653 + write_unlock(&zhdr->slots->lock); 654 + add_to_unbuddied(pool, new_zhdr); 655 + z3fold_page_unlock(new_zhdr); 656 + 657 + *moved_chunks = 0; 658 + } 659 + 660 + return new_zhdr; 661 + 662 + out_fail: 663 + if (new_zhdr) { 664 + if (kref_put(&new_zhdr->refcount, release_z3fold_page_locked)) 665 + atomic64_dec(&pool->pages_nr); 666 + else { 667 + add_to_unbuddied(pool, new_zhdr); 668 + z3fold_page_unlock(new_zhdr); 669 + } 670 + } 671 + return NULL; 672 + 661 673 } 662 674 663 675 #define BIG_CHUNK_GAP 3 ··· 852 638 return; 853 639 } 854 640 641 + if (!zhdr->foreign_handles && buddy_single(zhdr) && 642 + zhdr->mapped_count == 0 && compact_single_buddy(zhdr)) { 643 + if (kref_put(&zhdr->refcount, release_z3fold_page_locked)) 644 + atomic64_dec(&pool->pages_nr); 645 + else 646 + z3fold_page_unlock(zhdr); 647 + return; 648 + } 649 + 855 650 z3fold_compact_page(zhdr); 856 651 add_to_unbuddied(pool, zhdr); 857 652 z3fold_page_unlock(zhdr); ··· 913 690 spin_unlock(&pool->lock); 914 691 915 692 page = virt_to_page(zhdr); 916 - if (test_bit(NEEDS_COMPACTING, &page->private)) { 693 + if (test_bit(NEEDS_COMPACTING, &page->private) || 694 + test_bit(PAGE_CLAIMED, &page->private)) { 917 695 z3fold_page_unlock(zhdr); 918 696 zhdr = NULL; 919 697 put_cpu_ptr(pool->unbuddied); ··· 958 734 spin_unlock(&pool->lock); 959 735 960 736 page = virt_to_page(zhdr); 961 - if (test_bit(NEEDS_COMPACTING, &page->private)) { 737 + if (test_bit(NEEDS_COMPACTING, &page->private) || 738 + test_bit(PAGE_CLAIMED, &page->private)) { 962 739 z3fold_page_unlock(zhdr); 963 740 zhdr = NULL; 964 741 if (can_sleep) ··· 1225 1000 enum buddy bud; 1226 1001 bool page_claimed; 1227 1002 1228 - zhdr = handle_to_z3fold_header(handle); 1003 + zhdr = get_z3fold_header(handle); 1229 1004 page = virt_to_page(zhdr); 1230 1005 page_claimed = test_and_set_bit(PAGE_CLAIMED, &page->private); 1231 1006 ··· 1239 1014 spin_lock(&pool->lock); 1240 1015 list_del(&page->lru); 1241 1016 spin_unlock(&pool->lock); 1017 + put_z3fold_header(zhdr); 1242 1018 free_z3fold_page(page, true); 1243 1019 atomic64_dec(&pool->pages_nr); 1244 1020 } ··· 1247 1021 } 1248 1022 1249 1023 /* Non-headless case */ 1250 - z3fold_page_lock(zhdr); 1251 1024 bud = handle_to_buddy(handle); 1252 1025 1253 1026 switch (bud) { ··· 1262 1037 default: 1263 1038 pr_err("%s: unknown bud %d\n", __func__, bud); 1264 1039 WARN_ON(1); 1265 - z3fold_page_unlock(zhdr); 1040 + put_z3fold_header(zhdr); 1041 + clear_bit(PAGE_CLAIMED, &page->private); 1266 1042 return; 1267 1043 } 1268 1044 1269 - free_handle(handle); 1045 + if (!page_claimed) 1046 + free_handle(handle); 1270 1047 if (kref_put(&zhdr->refcount, release_z3fold_page_locked_list)) { 1271 1048 atomic64_dec(&pool->pages_nr); 1272 1049 return; ··· 1280 1053 } 1281 1054 if (unlikely(PageIsolated(page)) || 1282 1055 test_and_set_bit(NEEDS_COMPACTING, &page->private)) { 1283 - z3fold_page_unlock(zhdr); 1056 + put_z3fold_header(zhdr); 1284 1057 clear_bit(PAGE_CLAIMED, &page->private); 1285 1058 return; 1286 1059 } ··· 1290 1063 spin_unlock(&pool->lock); 1291 1064 zhdr->cpu = -1; 1292 1065 kref_get(&zhdr->refcount); 1293 - do_compact_page(zhdr, true); 1294 1066 clear_bit(PAGE_CLAIMED, &page->private); 1067 + do_compact_page(zhdr, true); 1295 1068 return; 1296 1069 } 1297 1070 kref_get(&zhdr->refcount); 1298 - queue_work_on(zhdr->cpu, pool->compact_wq, &zhdr->work); 1299 1071 clear_bit(PAGE_CLAIMED, &page->private); 1300 - z3fold_page_unlock(zhdr); 1072 + queue_work_on(zhdr->cpu, pool->compact_wq, &zhdr->work); 1073 + put_z3fold_header(zhdr); 1301 1074 } 1302 1075 1303 1076 /** ··· 1338 1111 */ 1339 1112 static int z3fold_reclaim_page(struct z3fold_pool *pool, unsigned int retries) 1340 1113 { 1341 - int i, ret = 0; 1114 + int i, ret = -1; 1342 1115 struct z3fold_header *zhdr = NULL; 1343 1116 struct page *page = NULL; 1344 1117 struct list_head *pos; 1345 - struct z3fold_buddy_slots slots; 1346 1118 unsigned long first_handle = 0, middle_handle = 0, last_handle = 0; 1347 1119 1348 1120 spin_lock(&pool->lock); ··· 1379 1153 zhdr = NULL; 1380 1154 continue; /* can't evict at this point */ 1381 1155 } 1156 + if (zhdr->foreign_handles) { 1157 + clear_bit(PAGE_CLAIMED, &page->private); 1158 + z3fold_page_unlock(zhdr); 1159 + zhdr = NULL; 1160 + continue; /* can't evict such page */ 1161 + } 1382 1162 kref_get(&zhdr->refcount); 1383 1163 list_del_init(&zhdr->buddy); 1384 1164 zhdr->cpu = -1; ··· 1408 1176 last_handle = 0; 1409 1177 middle_handle = 0; 1410 1178 if (zhdr->first_chunks) 1411 - first_handle = __encode_handle(zhdr, &slots, 1412 - FIRST); 1179 + first_handle = encode_handle(zhdr, FIRST); 1413 1180 if (zhdr->middle_chunks) 1414 - middle_handle = __encode_handle(zhdr, &slots, 1415 - MIDDLE); 1181 + middle_handle = encode_handle(zhdr, MIDDLE); 1416 1182 if (zhdr->last_chunks) 1417 - last_handle = __encode_handle(zhdr, &slots, 1418 - LAST); 1183 + last_handle = encode_handle(zhdr, LAST); 1419 1184 /* 1420 1185 * it's safe to unlock here because we hold a 1421 1186 * reference to this page 1422 1187 */ 1423 1188 z3fold_page_unlock(zhdr); 1424 1189 } else { 1425 - first_handle = __encode_handle(zhdr, &slots, HEADLESS); 1190 + first_handle = encode_handle(zhdr, HEADLESS); 1426 1191 last_handle = middle_handle = 0; 1427 1192 } 1428 - 1429 1193 /* Issue the eviction callback(s) */ 1430 1194 if (middle_handle) { 1431 1195 ret = pool->ops->evict(pool, middle_handle); 1432 1196 if (ret) 1433 1197 goto next; 1198 + free_handle(middle_handle); 1434 1199 } 1435 1200 if (first_handle) { 1436 1201 ret = pool->ops->evict(pool, first_handle); 1437 1202 if (ret) 1438 1203 goto next; 1204 + free_handle(first_handle); 1439 1205 } 1440 1206 if (last_handle) { 1441 1207 ret = pool->ops->evict(pool, last_handle); 1442 1208 if (ret) 1443 1209 goto next; 1210 + free_handle(last_handle); 1444 1211 } 1445 1212 next: 1446 1213 if (test_bit(PAGE_HEADLESS, &page->private)) { ··· 1495 1264 void *addr; 1496 1265 enum buddy buddy; 1497 1266 1498 - zhdr = handle_to_z3fold_header(handle); 1267 + zhdr = get_z3fold_header(handle); 1499 1268 addr = zhdr; 1500 1269 page = virt_to_page(zhdr); 1501 1270 1502 1271 if (test_bit(PAGE_HEADLESS, &page->private)) 1503 1272 goto out; 1504 1273 1505 - z3fold_page_lock(zhdr); 1506 1274 buddy = handle_to_buddy(handle); 1507 1275 switch (buddy) { 1508 1276 case FIRST: ··· 1523 1293 1524 1294 if (addr) 1525 1295 zhdr->mapped_count++; 1526 - z3fold_page_unlock(zhdr); 1527 1296 out: 1297 + put_z3fold_header(zhdr); 1528 1298 return addr; 1529 1299 } 1530 1300 ··· 1539 1309 struct page *page; 1540 1310 enum buddy buddy; 1541 1311 1542 - zhdr = handle_to_z3fold_header(handle); 1312 + zhdr = get_z3fold_header(handle); 1543 1313 page = virt_to_page(zhdr); 1544 1314 1545 1315 if (test_bit(PAGE_HEADLESS, &page->private)) 1546 1316 return; 1547 1317 1548 - z3fold_page_lock(zhdr); 1549 1318 buddy = handle_to_buddy(handle); 1550 1319 if (buddy == MIDDLE) 1551 1320 clear_bit(MIDDLE_CHUNK_MAPPED, &page->private); 1552 1321 zhdr->mapped_count--; 1553 - z3fold_page_unlock(zhdr); 1322 + put_z3fold_header(zhdr); 1554 1323 } 1555 1324 1556 1325 /** ··· 1581 1352 test_bit(PAGE_STALE, &page->private)) 1582 1353 goto out; 1583 1354 1584 - pool = zhdr_to_pool(zhdr); 1355 + if (zhdr->mapped_count != 0 || zhdr->foreign_handles != 0) 1356 + goto out; 1585 1357 1586 - if (zhdr->mapped_count == 0) { 1587 - kref_get(&zhdr->refcount); 1588 - if (!list_empty(&zhdr->buddy)) 1589 - list_del_init(&zhdr->buddy); 1590 - spin_lock(&pool->lock); 1591 - if (!list_empty(&page->lru)) 1592 - list_del(&page->lru); 1593 - spin_unlock(&pool->lock); 1594 - z3fold_page_unlock(zhdr); 1595 - return true; 1596 - } 1358 + pool = zhdr_to_pool(zhdr); 1359 + spin_lock(&pool->lock); 1360 + if (!list_empty(&zhdr->buddy)) 1361 + list_del_init(&zhdr->buddy); 1362 + if (!list_empty(&page->lru)) 1363 + list_del_init(&page->lru); 1364 + spin_unlock(&pool->lock); 1365 + 1366 + kref_get(&zhdr->refcount); 1367 + z3fold_page_unlock(zhdr); 1368 + return true; 1369 + 1597 1370 out: 1598 1371 z3fold_page_unlock(zhdr); 1599 1372 return false; ··· 1618 1387 if (!z3fold_page_trylock(zhdr)) { 1619 1388 return -EAGAIN; 1620 1389 } 1621 - if (zhdr->mapped_count != 0) { 1390 + if (zhdr->mapped_count != 0 || zhdr->foreign_handles != 0) { 1622 1391 z3fold_page_unlock(zhdr); 1623 1392 return -EBUSY; 1624 1393 }

+28

scripts/spelling.txt

··· 87 87 algorithmical||algorithmically 88 88 algoritm||algorithm 89 89 algoritms||algorithms 90 + algorithmn||algorithm 90 91 algorrithm||algorithm 91 92 algorritm||algorithm 92 93 aligment||alignment ··· 110 109 altough||although 111 110 alue||value 112 111 ambigious||ambiguous 112 + ambigous||ambiguous 113 113 amoung||among 114 114 amout||amount 115 115 amplifer||amplifier ··· 181 179 attnetion||attention 182 180 attruibutes||attributes 183 181 authentification||authentication 182 + authenicated||authenticated 184 183 automaticaly||automatically 185 184 automaticly||automatically 186 185 automatize||automate ··· 289 286 clared||cleared 290 287 closeing||closing 291 288 clustred||clustered 289 + cnfiguration||configuration 292 290 coexistance||coexistence 293 291 colescing||coalescing 294 292 collapsable||collapsible ··· 329 325 comunication||communication 330 326 conbination||combination 331 327 conditionaly||conditionally 328 + conditon||condition 332 329 conected||connected 333 330 conector||connector 334 331 connecetd||connected 332 + configration||configuration 335 333 configuartion||configuration 336 334 configuation||configuration 337 335 configued||configured ··· 353 347 contaisn||contains 354 348 contant||contact 355 349 contence||contents 350 + contiguos||contiguous 356 351 continious||continuous 357 352 continous||continuous 358 353 continously||continuously ··· 387 380 dafault||default 388 381 deafult||default 389 382 deamon||daemon 383 + debouce||debounce 390 384 decompres||decompress 391 385 decsribed||described 392 386 decription||description ··· 456 448 differenciate||differentiate 457 449 diffrentiate||differentiate 458 450 difinition||definition 451 + digial||digital 459 452 dimention||dimension 460 453 dimesions||dimensions 461 454 dispalying||displaying ··· 498 489 druing||during 499 490 dynmaic||dynamic 500 491 eanable||enable 492 + eanble||enable 501 493 easilly||easily 502 494 ecspecially||especially 503 495 edditable||editable ··· 512 502 eletronic||electronic 513 503 embeded||embedded 514 504 enabledi||enabled 505 + enbale||enable 515 506 enble||enable 516 507 enchanced||enhanced 517 508 encorporating||incorporating ··· 547 536 execeeded||exceeded 548 537 execeeds||exceeds 549 538 exeed||exceed 539 + exeuction||execution 550 540 existance||existence 551 541 existant||existent 552 542 exixt||exist ··· 613 601 framming||framing 614 602 framwork||framework 615 603 frequncy||frequency 604 + frequancy||frequency 616 605 frome||from 617 606 fucntion||function 618 607 fuction||function 619 608 fuctions||functions 609 + fullill||fulfill 620 610 funcation||function 621 611 funcion||function 622 612 functionallity||functionality ··· 656 642 harware||hardware 657 643 heirarchically||hierarchically 658 644 helpfull||helpful 645 + hexdecimal||hexadecimal 659 646 hybernate||hibernate 660 647 hierachy||hierarchy 661 648 hierarchie||hierarchy ··· 724 709 initation||initiation 725 710 initators||initiators 726 711 initialiazation||initialization 712 + initializationg||initialization 727 713 initializiation||initialization 728 714 initialze||initialize 729 715 initialzed||initialized 730 716 initialzing||initializing 731 717 initilization||initialization 732 718 initilize||initialize 719 + initliaze||initialize 733 720 inofficial||unofficial 734 721 inrerface||interface 735 722 insititute||institute ··· 796 779 itslef||itself 797 780 jave||java 798 781 jeffies||jiffies 782 + jumpimng||jumping 799 783 juse||just 800 784 jus||just 801 785 kown||known ··· 857 839 messgaes||messages 858 840 messsage||message 859 841 messsages||messages 842 + metdata||metadata 860 843 micropone||microphone 861 844 microprocesspr||microprocessor 862 845 migrateable||migratable ··· 876 857 missign||missing 877 858 missmanaged||mismanaged 878 859 missmatch||mismatch 860 + misssing||missing 879 861 miximum||maximum 880 862 mmnemonic||mnemonic 881 863 mnay||many ··· 932 912 occuring||occurring 933 913 offser||offset 934 914 offet||offset 915 + offlaod||offload 935 916 offloded||offloaded 936 917 offseting||offsetting 937 918 omited||omitted ··· 1014 993 posible||possible 1015 994 positon||position 1016 995 possibilites||possibilities 996 + potocol||protocol 1017 997 powerfull||powerful 1018 998 pramater||parameter 1019 999 preamle||preamble ··· 1083 1061 pwoer||power 1084 1062 queing||queuing 1085 1063 quering||querying 1064 + queus||queues 1086 1065 randomally||randomly 1087 1066 raoming||roaming 1088 1067 reasearcher||researcher 1089 1068 reasearchers||researchers 1090 1069 reasearch||research 1070 + receieve||receive 1091 1071 recepient||recipient 1092 1072 recevied||received 1093 1073 receving||receiving ··· 1190 1166 scaned||scanned 1191 1167 scaning||scanning 1192 1168 scarch||search 1169 + schdule||schedule 1193 1170 seach||search 1194 1171 searchs||searches 1195 1172 secquence||sequence ··· 1333 1308 teh||the 1334 1309 temorary||temporary 1335 1310 temproarily||temporarily 1311 + temperture||temperature 1336 1312 thead||thread 1337 1313 therfore||therefore 1338 1314 thier||their ··· 1380 1354 usupported||unsupported 1381 1355 uncommited||uncommitted 1382 1356 unconditionaly||unconditionally 1357 + undeflow||underflow 1383 1358 underun||underrun 1384 1359 unecessary||unnecessary 1385 1360 unexecpted||unexpected ··· 1441 1414 varient||variant 1442 1415 vaule||value 1443 1416 verbse||verbose 1417 + veify||verify 1444 1418 verisons||versions 1445 1419 verison||version 1446 1420 verson||version

+36

tools/testing/selftests/memfd/memfd_test.c

··· 290 290 munmap(p, mfd_def_size); 291 291 } 292 292 293 + static void mfd_assert_fork_private_write(int fd) 294 + { 295 + int *p; 296 + pid_t pid; 297 + 298 + p = mmap(NULL, 299 + mfd_def_size, 300 + PROT_READ | PROT_WRITE, 301 + MAP_PRIVATE, 302 + fd, 303 + 0); 304 + if (p == MAP_FAILED) { 305 + printf("mmap() failed: %m\n"); 306 + abort(); 307 + } 308 + 309 + p[0] = 22; 310 + 311 + pid = fork(); 312 + if (pid == 0) { 313 + p[0] = 33; 314 + exit(0); 315 + } else { 316 + waitpid(pid, NULL, 0); 317 + 318 + if (p[0] != 22) { 319 + printf("MAP_PRIVATE copy-on-write failed: %m\n"); 320 + abort(); 321 + } 322 + } 323 + 324 + munmap(p, mfd_def_size); 325 + } 326 + 293 327 static void mfd_assert_write(int fd) 294 328 { 295 329 ssize_t l; ··· 793 759 mfd_assert_read(fd2); 794 760 mfd_assert_read_shared(fd2); 795 761 mfd_fail_write(fd2); 762 + 763 + mfd_assert_fork_private_write(fd); 796 764 797 765 munmap(p, mfd_def_size); 798 766 close(fd2);

+1

tools/testing/selftests/vm/config

··· 1 1 CONFIG_SYSVIPC=y 2 2 CONFIG_USERFAULTFD=y 3 + CONFIG_TEST_VMALLOC=m