Merge branch 'akpm' (patches from Andrew)

+7

Documentation/DMA-API.txt

··· 104 104 from this pool must not cross 4KByte boundaries. 105 105 106 106 107 + void *dma_pool_zalloc(struct dma_pool *pool, gfp_t mem_flags, 108 + dma_addr_t *handle) 109 + 110 + Wraps dma_pool_alloc() and also zeroes the returned memory if the 111 + allocation attempt succeeded. 112 + 113 + 107 114 void *dma_pool_alloc(struct dma_pool *pool, gfp_t gfp_flags, 108 115 dma_addr_t *dma_handle); 109 116

+2 -1

Documentation/blockdev/zram.txt

··· 144 144 store compressed data 145 145 mem_limit RW the maximum amount of memory ZRAM can use to store 146 146 the compressed data 147 - num_migrated RO the number of objects migrated migrated by compaction 147 + pages_compacted RO the number of pages freed during compaction 148 + (available only via zram<id>/mm_stat node) 148 149 compact WO trigger memory compaction 149 150 150 151 WARNING

+4 -3

Documentation/filesystems/dax.txt

··· 60 60 - implementing the direct_IO address space operation, and calling 61 61 dax_do_io() instead of blockdev_direct_IO() if S_DAX is set 62 62 - implementing an mmap file operation for DAX files which sets the 63 - VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers 64 - for fault and page_mkwrite (which should probably call dax_fault() and 65 - dax_mkwrite(), passing the appropriate get_block() callback) 63 + VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to 64 + include handlers for fault, pmd_fault and page_mkwrite (which should 65 + probably call dax_fault(), dax_pmd_fault() and dax_mkwrite(), passing the 66 + appropriate get_block() callback) 66 67 - calling dax_truncate_page() instead of block_truncate_page() for DAX files 67 68 - calling dax_zero_page_range() instead of zero_user() for DAX files 68 69 - ensuring that there is sufficient locking between reads, writes,

+13 -5

Documentation/filesystems/proc.txt

··· 424 424 Referenced: 892 kB 425 425 Anonymous: 0 kB 426 426 Swap: 0 kB 427 + SwapPss: 0 kB 427 428 KernelPageSize: 4 kB 428 429 MMUPageSize: 4 kB 429 430 Locked: 374 kB ··· 434 433 mapping in /proc/PID/maps. The remaining lines show the size of the mapping 435 434 (size), the amount of the mapping that is currently resident in RAM (RSS), the 436 435 process' proportional share of this mapping (PSS), the number of clean and 437 - dirty private pages in the mapping. Note that even a page which is part of a 438 - MAP_SHARED mapping, but has only a single pte mapped, i.e. is currently used 439 - by only one process, is accounted as private and not as shared. "Referenced" 440 - indicates the amount of memory currently marked as referenced or accessed. 436 + dirty private pages in the mapping. 437 + 438 + The "proportional set size" (PSS) of a process is the count of pages it has 439 + in memory, where each page is divided by the number of processes sharing it. 440 + So if a process has 1000 pages all to itself, and 1000 shared with one other 441 + process, its PSS will be 1500. 442 + Note that even a page which is part of a MAP_SHARED mapping, but has only 443 + a single pte mapped, i.e. is currently used by only one process, is accounted 444 + as private and not as shared. 445 + "Referenced" indicates the amount of memory currently marked as referenced or 446 + accessed. 441 447 "Anonymous" shows the amount of memory that does not belong to any file. Even 442 448 a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE 443 449 and a page is modified, the file page is replaced by a private anonymous copy. 444 450 "Swap" shows how much would-be-anonymous memory is also used, but out on 445 451 swap. 446 - 452 + "SwapPss" shows proportional swap share of this mapping. 447 453 "VmFlags" field deserves a separate description. This member represents the kernel 448 454 flags associated with the particular virtual memory area in two letter encoded 449 455 manner. The codes are the following:

+2 -2

Documentation/sysctl/vm.txt

··· 349 349 350 350 (i < j): 351 351 zone[i]->protection[j] 352 - = (total sums of present_pages from zone[i+1] to zone[j] on the node) 352 + = (total sums of managed_pages from zone[i+1] to zone[j] on the node) 353 353 / lowmem_reserve_ratio[i]; 354 354 (i = j): 355 355 (should not be protected. = 0; ··· 360 360 256 (if zone[i] means DMA or DMA32 zone) 361 361 32 (others). 362 362 As above expression, they are reciprocal number of ratio. 363 - 256 means 1/256. # of protection pages becomes about "0.39%" of total present 363 + 256 means 1/256. # of protection pages becomes about "0.39%" of total managed 364 364 pages of higher zones on the node. 365 365 366 366 If you would like to protect more pages, smaller values are effective.

+2 -1

Documentation/sysrq.txt

··· 75 75 76 76 'e' - Send a SIGTERM to all processes, except for init. 77 77 78 - 'f' - Will call oom_kill to kill a memory hog process. 78 + 'f' - Will call the oom killer to kill a memory hog process, but do not 79 + panic if nothing can be killed. 79 80 80 81 'g' - Used by kgdb (kernel debugger) 81 82

+11 -4

Documentation/vm/hugetlbpage.txt

··· 329 329 330 330 3) hugepage-mmap: see tools/testing/selftests/vm/hugepage-mmap.c 331 331 332 - 4) The libhugetlbfs (http://libhugetlbfs.sourceforge.net) library provides a 333 - wide range of userspace tools to help with huge page usability, environment 334 - setup, and control. Furthermore it provides useful test cases that should be 335 - used when modifying code to ensure no regressions are introduced. 332 + 4) The libhugetlbfs (https://github.com/libhugetlbfs/libhugetlbfs) library 333 + provides a wide range of userspace tools to help with huge page usability, 334 + environment setup, and control. 335 + 336 + Kernel development regression testing 337 + ===================================== 338 + 339 + The most complete set of hugetlb tests are in the libhugetlbfs repository. 340 + If you modify any hugetlb related code, use the libhugetlbfs test suite 341 + to check for regressions. In addition, if you add any new hugetlb 342 + functionality, please add appropriate tests to libhugetlbfs.

+13 -2

Documentation/vm/pagemap.txt

··· 16 16 * Bits 0-4 swap type if swapped 17 17 * Bits 5-54 swap offset if swapped 18 18 * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt) 19 - * Bits 56-60 zero 20 - * Bit 61 page is file-page or shared-anon 19 + * Bit 56 page exclusively mapped (since 4.2) 20 + * Bits 57-60 zero 21 + * Bit 61 page is file-page or shared-anon (since 3.5) 21 22 * Bit 62 page swapped 22 23 * Bit 63 page present 24 + 25 + Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs. 26 + In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from 27 + 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN. 28 + Reason: information about PFNs helps in exploiting Rowhammer vulnerability. 23 29 24 30 If the page is not present but in swap, then the PFN contains an 25 31 encoding of the swap file number and the page's offset into the ··· 165 159 Reading from any of the files will return -EINVAL if you are not starting 166 160 the read on an 8-byte boundary (e.g., if you sought an odd number of bytes 167 161 into the file), or if the size of the read is not a multiple of 8 bytes. 162 + 163 + Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is 164 + always 12 at most architectures). Since Linux 3.11 their meaning changes 165 + after first clear of soft-dirty bits. Since Linux 4.2 they are used for 166 + flags unconditionally.

+62

arch/arm64/kernel/setup.c

··· 339 339 } 340 340 } 341 341 342 + #ifdef CONFIG_BLK_DEV_INITRD 343 + /* 344 + * Relocate initrd if it is not completely within the linear mapping. 345 + * This would be the case if mem= cuts out all or part of it. 346 + */ 347 + static void __init relocate_initrd(void) 348 + { 349 + phys_addr_t orig_start = __virt_to_phys(initrd_start); 350 + phys_addr_t orig_end = __virt_to_phys(initrd_end); 351 + phys_addr_t ram_end = memblock_end_of_DRAM(); 352 + phys_addr_t new_start; 353 + unsigned long size, to_free = 0; 354 + void *dest; 355 + 356 + if (orig_end <= ram_end) 357 + return; 358 + 359 + /* 360 + * Any of the original initrd which overlaps the linear map should 361 + * be freed after relocating. 362 + */ 363 + if (orig_start < ram_end) 364 + to_free = ram_end - orig_start; 365 + 366 + size = orig_end - orig_start; 367 + 368 + /* initrd needs to be relocated completely inside linear mapping */ 369 + new_start = memblock_find_in_range(0, PFN_PHYS(max_pfn), 370 + size, PAGE_SIZE); 371 + if (!new_start) 372 + panic("Cannot relocate initrd of size %ld\n", size); 373 + memblock_reserve(new_start, size); 374 + 375 + initrd_start = __phys_to_virt(new_start); 376 + initrd_end = initrd_start + size; 377 + 378 + pr_info("Moving initrd from [%llx-%llx] to [%llx-%llx]\n", 379 + orig_start, orig_start + size - 1, 380 + new_start, new_start + size - 1); 381 + 382 + dest = (void *)initrd_start; 383 + 384 + if (to_free) { 385 + memcpy(dest, (void *)__phys_to_virt(orig_start), to_free); 386 + dest += to_free; 387 + } 388 + 389 + copy_from_early_mem(dest, orig_start + to_free, size - to_free); 390 + 391 + if (to_free) { 392 + pr_info("Freeing original RAMDISK from [%llx-%llx]\n", 393 + orig_start, orig_start + to_free - 1); 394 + memblock_free(orig_start, to_free); 395 + } 396 + } 397 + #else 398 + static inline void __init relocate_initrd(void) 399 + { 400 + } 401 + #endif 402 + 342 403 u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID }; 343 404 344 405 void __init setup_arch(char **cmdline_p) ··· 433 372 acpi_boot_table_init(); 434 373 435 374 paging_init(); 375 + relocate_initrd(); 436 376 request_standard_resources(); 437 377 438 378 early_ioremap_reset();

+1 -5

arch/ia64/hp/common/sba_iommu.c

··· 1140 1140 1141 1141 #ifdef CONFIG_NUMA 1142 1142 { 1143 - int node = ioc->node; 1144 1143 struct page *page; 1145 1144 1146 - if (node == NUMA_NO_NODE) 1147 - node = numa_node_id(); 1148 - 1149 - page = alloc_pages_exact_node(node, flags, get_order(size)); 1145 + page = alloc_pages_node(ioc->node, flags, get_order(size)); 1150 1146 if (unlikely(!page)) 1151 1147 return NULL; 1152 1148

+1 -1

arch/ia64/kernel/uncached.c

··· 97 97 98 98 /* attempt to allocate a granule's worth of cached memory pages */ 99 99 100 - page = alloc_pages_exact_node(nid, 100 + page = __alloc_pages_node(nid, 101 101 GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE, 102 102 IA64_GRANULE_SHIFT-PAGE_SHIFT); 103 103 if (!page) {

+1 -1

arch/ia64/sn/pci/pci_dma.c

··· 92 92 */ 93 93 node = pcibus_to_node(pdev->bus); 94 94 if (likely(node >=0)) { 95 - struct page *p = alloc_pages_exact_node(node, 95 + struct page *p = __alloc_pages_node(node, 96 96 flags, get_order(size)); 97 97 98 98 if (likely(p))

+1 -1

arch/powerpc/platforms/cell/ras.c

··· 123 123 124 124 area->nid = nid; 125 125 area->order = order; 126 - area->pages = alloc_pages_exact_node(area->nid, 126 + area->pages = __alloc_pages_node(area->nid, 127 127 GFP_KERNEL|__GFP_THISNODE, 128 128 area->order); 129 129

+1 -1

arch/sparc/include/asm/pgtable_32.h

··· 14 14 #include <asm-generic/4level-fixup.h> 15 15 16 16 #include <linux/spinlock.h> 17 - #include <linux/swap.h> 17 + #include <linux/mm_types.h> 18 18 #include <asm/types.h> 19 19 #include <asm/pgtsrmmu.h> 20 20 #include <asm/vaddrs.h>

+1 -21

arch/x86/kernel/setup.c

··· 317 317 return ramdisk_size; 318 318 } 319 319 320 - #define MAX_MAP_CHUNK (NR_FIX_BTMAPS << PAGE_SHIFT) 321 320 static void __init relocate_initrd(void) 322 321 { 323 322 /* Assume only end is not page aligned */ 324 323 u64 ramdisk_image = get_ramdisk_image(); 325 324 u64 ramdisk_size = get_ramdisk_size(); 326 325 u64 area_size = PAGE_ALIGN(ramdisk_size); 327 - unsigned long slop, clen, mapaddr; 328 - char *p, *q; 329 326 330 327 /* We need to move the initrd down into directly mapped mem */ 331 328 relocated_ramdisk = memblock_find_in_range(0, PFN_PHYS(max_pfn_mapped), ··· 340 343 printk(KERN_INFO "Allocated new RAMDISK: [mem %#010llx-%#010llx]\n", 341 344 relocated_ramdisk, relocated_ramdisk + ramdisk_size - 1); 342 345 343 - q = (char *)initrd_start; 346 + copy_from_early_mem((void *)initrd_start, ramdisk_image, ramdisk_size); 344 347 345 - /* Copy the initrd */ 346 - while (ramdisk_size) { 347 - slop = ramdisk_image & ~PAGE_MASK; 348 - clen = ramdisk_size; 349 - if (clen > MAX_MAP_CHUNK-slop) 350 - clen = MAX_MAP_CHUNK-slop; 351 - mapaddr = ramdisk_image & PAGE_MASK; 352 - p = early_memremap(mapaddr, clen+slop); 353 - memcpy(q, p+slop, clen); 354 - early_memunmap(p, clen+slop); 355 - q += clen; 356 - ramdisk_image += clen; 357 - ramdisk_size -= clen; 358 - } 359 - 360 - ramdisk_image = get_ramdisk_image(); 361 - ramdisk_size = get_ramdisk_size(); 362 348 printk(KERN_INFO "Move RAMDISK from [mem %#010llx-%#010llx] to" 363 349 " [mem %#010llx-%#010llx]\n", 364 350 ramdisk_image, ramdisk_image + ramdisk_size - 1,

+1 -1

arch/x86/kvm/vmx.c

··· 3150 3150 struct page *pages; 3151 3151 struct vmcs *vmcs; 3152 3152 3153 - pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order); 3153 + pages = __alloc_pages_node(node, GFP_KERNEL, vmcs_config.order); 3154 3154 if (!pages) 3155 3155 return NULL; 3156 3156 vmcs = page_address(pages);

+4 -2

arch/x86/mm/numa.c

··· 246 246 bi->start = max(bi->start, low); 247 247 bi->end = min(bi->end, high); 248 248 249 - /* and there's no empty block */ 250 - if (bi->start >= bi->end) 249 + /* and there's no empty or non-exist block */ 250 + if (bi->start >= bi->end || 251 + !memblock_overlaps_region(&memblock.memory, 252 + bi->start, bi->end - bi->start)) 251 253 numa_remove_memblk_from(i--, mi); 252 254 } 253 255

+17 -13

drivers/block/zram/zram_drv.c

··· 388 388 static ssize_t compact_store(struct device *dev, 389 389 struct device_attribute *attr, const char *buf, size_t len) 390 390 { 391 - unsigned long nr_migrated; 392 391 struct zram *zram = dev_to_zram(dev); 393 392 struct zram_meta *meta; 394 393 ··· 398 399 } 399 400 400 401 meta = zram->meta; 401 - nr_migrated = zs_compact(meta->mem_pool); 402 - atomic64_add(nr_migrated, &zram->stats.num_migrated); 402 + zs_compact(meta->mem_pool); 403 403 up_read(&zram->init_lock); 404 404 405 405 return len; ··· 426 428 struct device_attribute *attr, char *buf) 427 429 { 428 430 struct zram *zram = dev_to_zram(dev); 431 + struct zs_pool_stats pool_stats; 429 432 u64 orig_size, mem_used = 0; 430 433 long max_used; 431 434 ssize_t ret; 432 435 436 + memset(&pool_stats, 0x00, sizeof(struct zs_pool_stats)); 437 + 433 438 down_read(&zram->init_lock); 434 - if (init_done(zram)) 439 + if (init_done(zram)) { 435 440 mem_used = zs_get_total_pages(zram->meta->mem_pool); 441 + zs_pool_stats(zram->meta->mem_pool, &pool_stats); 442 + } 436 443 437 444 orig_size = atomic64_read(&zram->stats.pages_stored); 438 445 max_used = atomic_long_read(&zram->stats.max_used_pages); 439 446 440 447 ret = scnprintf(buf, PAGE_SIZE, 441 - "%8llu %8llu %8llu %8lu %8ld %8llu %8llu\n", 448 + "%8llu %8llu %8llu %8lu %8ld %8llu %8lu\n", 442 449 orig_size << PAGE_SHIFT, 443 450 (u64)atomic64_read(&zram->stats.compr_data_size), 444 451 mem_used << PAGE_SHIFT, 445 452 zram->limit_pages << PAGE_SHIFT, 446 453 max_used << PAGE_SHIFT, 447 454 (u64)atomic64_read(&zram->stats.zero_pages), 448 - (u64)atomic64_read(&zram->stats.num_migrated)); 455 + pool_stats.pages_compacted); 449 456 up_read(&zram->init_lock); 450 457 451 458 return ret; ··· 622 619 uncmem = user_mem; 623 620 624 621 if (!uncmem) { 625 - pr_info("Unable to allocate temp memory\n"); 622 + pr_err("Unable to allocate temp memory\n"); 626 623 ret = -ENOMEM; 627 624 goto out_cleanup; 628 625 } ··· 719 716 720 717 handle = zs_malloc(meta->mem_pool, clen); 721 718 if (!handle) { 722 - pr_info("Error allocating memory for compressed page: %u, size=%zu\n", 719 + pr_err("Error allocating memory for compressed page: %u, size=%zu\n", 723 720 index, clen); 724 721 ret = -ENOMEM; 725 722 goto out; ··· 1039 1036 1040 1037 comp = zcomp_create(zram->compressor, zram->max_comp_streams); 1041 1038 if (IS_ERR(comp)) { 1042 - pr_info("Cannot initialise %s compressing backend\n", 1039 + pr_err("Cannot initialise %s compressing backend\n", 1043 1040 zram->compressor); 1044 1041 err = PTR_ERR(comp); 1045 1042 goto out_free_meta; ··· 1217 1214 /* gendisk structure */ 1218 1215 zram->disk = alloc_disk(1); 1219 1216 if (!zram->disk) { 1220 - pr_warn("Error allocating disk structure for device %d\n", 1217 + pr_err("Error allocating disk structure for device %d\n", 1221 1218 device_id); 1222 1219 ret = -ENOMEM; 1223 1220 goto out_free_queue; ··· 1266 1263 ret = sysfs_create_group(&disk_to_dev(zram->disk)->kobj, 1267 1264 &zram_disk_attr_group); 1268 1265 if (ret < 0) { 1269 - pr_warn("Error creating sysfs group"); 1266 + pr_err("Error creating sysfs group for device %d\n", 1267 + device_id); 1270 1268 goto out_free_disk; 1271 1269 } 1272 1270 strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor)); ··· 1407 1403 1408 1404 ret = class_register(&zram_control_class); 1409 1405 if (ret) { 1410 - pr_warn("Unable to register zram-control class\n"); 1406 + pr_err("Unable to register zram-control class\n"); 1411 1407 return ret; 1412 1408 } 1413 1409 1414 1410 zram_major = register_blkdev(0, "zram"); 1415 1411 if (zram_major <= 0) { 1416 - pr_warn("Unable to get major number\n"); 1412 + pr_err("Unable to get major number\n"); 1417 1413 class_unregister(&zram_control_class); 1418 1414 return -EBUSY; 1419 1415 }

-1

drivers/block/zram/zram_drv.h

··· 78 78 atomic64_t compr_data_size; /* compressed size of pages stored */ 79 79 atomic64_t num_reads; /* failed + successful */ 80 80 atomic64_t num_writes; /* --do-- */ 81 - atomic64_t num_migrated; /* no. of migrated object */ 82 81 atomic64_t failed_reads; /* can happen when memory is too low */ 83 82 atomic64_t failed_writes; /* can happen when memory is too low */ 84 83 atomic64_t invalid_io; /* non-page-aligned I/O requests */

+1 -1

drivers/misc/sgi-xp/xpc_uv.c

··· 239 239 mq->mmr_blade = uv_cpu_to_blade_id(cpu); 240 240 241 241 nid = cpu_to_node(cpu); 242 - page = alloc_pages_exact_node(nid, 242 + page = __alloc_pages_node(nid, 243 243 GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE, 244 244 pg_order); 245 245 if (page == NULL) {

+9 -2

drivers/tty/sysrq.c

··· 353 353 354 354 static void moom_callback(struct work_struct *ignored) 355 355 { 356 + const gfp_t gfp_mask = GFP_KERNEL; 357 + struct oom_control oc = { 358 + .zonelist = node_zonelist(first_memory_node, gfp_mask), 359 + .nodemask = NULL, 360 + .gfp_mask = gfp_mask, 361 + .order = -1, 362 + }; 363 + 356 364 mutex_lock(&oom_lock); 357 - if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), 358 - GFP_KERNEL, 0, NULL, true)) 365 + if (!out_of_memory(&oc)) 359 366 pr_info("OOM request ignored because killer is disabled\n"); 360 367 mutex_unlock(&oom_lock); 361 368 }

+1

fs/block_dev.c

··· 28 28 #include <linux/namei.h> 29 29 #include <linux/log2.h> 30 30 #include <linux/cleancache.h> 31 + #include <linux/dax.h> 31 32 #include <asm/uaccess.h> 32 33 #include "internal.h" 33 34

+184 -13

fs/dax.c

··· 283 283 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, 284 284 struct vm_area_struct *vma, struct vm_fault *vmf) 285 285 { 286 - struct address_space *mapping = inode->i_mapping; 287 286 sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9); 288 287 unsigned long vaddr = (unsigned long)vmf->virtual_address; 289 288 void __pmem *addr; 290 289 unsigned long pfn; 291 290 pgoff_t size; 292 291 int error; 293 - 294 - i_mmap_lock_read(mapping); 295 292 296 293 /* 297 294 * Check truncate didn't happen while we were allocating a block. ··· 319 322 error = vm_insert_mixed(vma, vaddr, pfn); 320 323 321 324 out: 322 - i_mmap_unlock_read(mapping); 323 - 324 325 return error; 325 326 } 326 327 ··· 380 385 * from a read fault and we've raced with a truncate 381 386 */ 382 387 error = -EIO; 383 - goto unlock_page; 388 + goto unlock; 384 389 } 390 + } else { 391 + i_mmap_lock_write(mapping); 385 392 } 386 393 387 394 error = get_block(inode, block, &bh, 0); 388 395 if (!error && (bh.b_size < PAGE_SIZE)) 389 396 error = -EIO; /* fs corruption? */ 390 397 if (error) 391 - goto unlock_page; 398 + goto unlock; 392 399 393 400 if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) { 394 401 if (vmf->flags & FAULT_FLAG_WRITE) { ··· 401 404 if (!error && (bh.b_size < PAGE_SIZE)) 402 405 error = -EIO; 403 406 if (error) 404 - goto unlock_page; 407 + goto unlock; 405 408 } else { 409 + i_mmap_unlock_write(mapping); 406 410 return dax_load_hole(mapping, page, vmf); 407 411 } 408 412 } ··· 415 417 else 416 418 clear_user_highpage(new_page, vaddr); 417 419 if (error) 418 - goto unlock_page; 420 + goto unlock; 419 421 vmf->page = page; 420 422 if (!page) { 421 - i_mmap_lock_read(mapping); 422 423 /* Check we didn't race with truncate */ 423 424 size = (i_size_read(inode) + PAGE_SIZE - 1) >> 424 425 PAGE_SHIFT; 425 426 if (vmf->pgoff >= size) { 426 - i_mmap_unlock_read(mapping); 427 427 error = -EIO; 428 - goto out; 428 + goto unlock; 429 429 } 430 430 } 431 431 return VM_FAULT_LOCKED; ··· 459 463 WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE)); 460 464 } 461 465 466 + if (!page) 467 + i_mmap_unlock_write(mapping); 462 468 out: 463 469 if (error == -ENOMEM) 464 470 return VM_FAULT_OOM | major; ··· 469 471 return VM_FAULT_SIGBUS | major; 470 472 return VM_FAULT_NOPAGE | major; 471 473 472 - unlock_page: 474 + unlock: 473 475 if (page) { 474 476 unlock_page(page); 475 477 page_cache_release(page); 478 + } else { 479 + i_mmap_unlock_write(mapping); 476 480 } 481 + 477 482 goto out; 478 483 } 479 484 EXPORT_SYMBOL(__dax_fault); ··· 507 506 return result; 508 507 } 509 508 EXPORT_SYMBOL_GPL(dax_fault); 509 + 510 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 511 + /* 512 + * The 'colour' (ie low bits) within a PMD of a page offset. This comes up 513 + * more often than one might expect in the below function. 514 + */ 515 + #define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1) 516 + 517 + int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, 518 + pmd_t *pmd, unsigned int flags, get_block_t get_block, 519 + dax_iodone_t complete_unwritten) 520 + { 521 + struct file *file = vma->vm_file; 522 + struct address_space *mapping = file->f_mapping; 523 + struct inode *inode = mapping->host; 524 + struct buffer_head bh; 525 + unsigned blkbits = inode->i_blkbits; 526 + unsigned long pmd_addr = address & PMD_MASK; 527 + bool write = flags & FAULT_FLAG_WRITE; 528 + long length; 529 + void *kaddr; 530 + pgoff_t size, pgoff; 531 + sector_t block, sector; 532 + unsigned long pfn; 533 + int result = 0; 534 + 535 + /* Fall back to PTEs if we're going to COW */ 536 + if (write && !(vma->vm_flags & VM_SHARED)) 537 + return VM_FAULT_FALLBACK; 538 + /* If the PMD would extend outside the VMA */ 539 + if (pmd_addr < vma->vm_start) 540 + return VM_FAULT_FALLBACK; 541 + if ((pmd_addr + PMD_SIZE) > vma->vm_end) 542 + return VM_FAULT_FALLBACK; 543 + 544 + pgoff = linear_page_index(vma, pmd_addr); 545 + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; 546 + if (pgoff >= size) 547 + return VM_FAULT_SIGBUS; 548 + /* If the PMD would cover blocks out of the file */ 549 + if ((pgoff | PG_PMD_COLOUR) >= size) 550 + return VM_FAULT_FALLBACK; 551 + 552 + memset(&bh, 0, sizeof(bh)); 553 + block = (sector_t)pgoff << (PAGE_SHIFT - blkbits); 554 + 555 + bh.b_size = PMD_SIZE; 556 + i_mmap_lock_write(mapping); 557 + length = get_block(inode, block, &bh, write); 558 + if (length) 559 + return VM_FAULT_SIGBUS; 560 + 561 + /* 562 + * If the filesystem isn't willing to tell us the length of a hole, 563 + * just fall back to PTEs. Calling get_block 512 times in a loop 564 + * would be silly. 565 + */ 566 + if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) 567 + goto fallback; 568 + 569 + if (buffer_unwritten(&bh) || buffer_new(&bh)) { 570 + int i; 571 + for (i = 0; i < PTRS_PER_PMD; i++) 572 + clear_page(kaddr + i * PAGE_SIZE); 573 + count_vm_event(PGMAJFAULT); 574 + mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); 575 + result |= VM_FAULT_MAJOR; 576 + } 577 + 578 + /* 579 + * If we allocated new storage, make sure no process has any 580 + * zero pages covering this hole 581 + */ 582 + if (buffer_new(&bh)) { 583 + i_mmap_unlock_write(mapping); 584 + unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0); 585 + i_mmap_lock_write(mapping); 586 + } 587 + 588 + /* 589 + * If a truncate happened while we were allocating blocks, we may 590 + * leave blocks allocated to the file that are beyond EOF. We can't 591 + * take i_mutex here, so just leave them hanging; they'll be freed 592 + * when the file is deleted. 593 + */ 594 + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; 595 + if (pgoff >= size) { 596 + result = VM_FAULT_SIGBUS; 597 + goto out; 598 + } 599 + if ((pgoff | PG_PMD_COLOUR) >= size) 600 + goto fallback; 601 + 602 + if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) { 603 + spinlock_t *ptl; 604 + pmd_t entry; 605 + struct page *zero_page = get_huge_zero_page(); 606 + 607 + if (unlikely(!zero_page)) 608 + goto fallback; 609 + 610 + ptl = pmd_lock(vma->vm_mm, pmd); 611 + if (!pmd_none(*pmd)) { 612 + spin_unlock(ptl); 613 + goto fallback; 614 + } 615 + 616 + entry = mk_pmd(zero_page, vma->vm_page_prot); 617 + entry = pmd_mkhuge(entry); 618 + set_pmd_at(vma->vm_mm, pmd_addr, pmd, entry); 619 + result = VM_FAULT_NOPAGE; 620 + spin_unlock(ptl); 621 + } else { 622 + sector = bh.b_blocknr << (blkbits - 9); 623 + length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, 624 + bh.b_size); 625 + if (length < 0) { 626 + result = VM_FAULT_SIGBUS; 627 + goto out; 628 + } 629 + if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) 630 + goto fallback; 631 + 632 + result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write); 633 + } 634 + 635 + out: 636 + if (buffer_unwritten(&bh)) 637 + complete_unwritten(&bh, !(result & VM_FAULT_ERROR)); 638 + 639 + i_mmap_unlock_write(mapping); 640 + 641 + return result; 642 + 643 + fallback: 644 + count_vm_event(THP_FAULT_FALLBACK); 645 + result = VM_FAULT_FALLBACK; 646 + goto out; 647 + } 648 + EXPORT_SYMBOL_GPL(__dax_pmd_fault); 649 + 650 + /** 651 + * dax_pmd_fault - handle a PMD fault on a DAX file 652 + * @vma: The virtual memory area where the fault occurred 653 + * @vmf: The description of the fault 654 + * @get_block: The filesystem method used to translate file offsets to blocks 655 + * 656 + * When a page fault occurs, filesystems may call this helper in their 657 + * pmd_fault handler for DAX files. 658 + */ 659 + int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, 660 + pmd_t *pmd, unsigned int flags, get_block_t get_block, 661 + dax_iodone_t complete_unwritten) 662 + { 663 + int result; 664 + struct super_block *sb = file_inode(vma->vm_file)->i_sb; 665 + 666 + if (flags & FAULT_FLAG_WRITE) { 667 + sb_start_pagefault(sb); 668 + file_update_time(vma->vm_file); 669 + } 670 + result = __dax_pmd_fault(vma, address, pmd, flags, get_block, 671 + complete_unwritten); 672 + if (flags & FAULT_FLAG_WRITE) 673 + sb_end_pagefault(sb); 674 + 675 + return result; 676 + } 677 + EXPORT_SYMBOL_GPL(dax_pmd_fault); 678 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 510 679 511 680 /** 512 681 * dax_pfn_mkwrite - handle first write to DAX page

+9 -1

fs/ext2/file.c

··· 20 20 21 21 #include <linux/time.h> 22 22 #include <linux/pagemap.h> 23 + #include <linux/dax.h> 23 24 #include <linux/quotaops.h> 24 25 #include "ext2.h" 25 26 #include "xattr.h" ··· 32 31 return dax_fault(vma, vmf, ext2_get_block, NULL); 33 32 } 34 33 34 + static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr, 35 + pmd_t *pmd, unsigned int flags) 36 + { 37 + return dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL); 38 + } 39 + 35 40 static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) 36 41 { 37 42 return dax_mkwrite(vma, vmf, ext2_get_block, NULL); ··· 45 38 46 39 static const struct vm_operations_struct ext2_dax_vm_ops = { 47 40 .fault = ext2_dax_fault, 41 + .pmd_fault = ext2_dax_pmd_fault, 48 42 .page_mkwrite = ext2_dax_mkwrite, 49 43 .pfn_mkwrite = dax_pfn_mkwrite, 50 44 }; ··· 57 49 58 50 file_accessed(file); 59 51 vma->vm_ops = &ext2_dax_vm_ops; 60 - vma->vm_flags |= VM_MIXEDMAP; 52 + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; 61 53 return 0; 62 54 } 63 55 #else

+1

fs/ext2/inode.c

··· 25 25 #include <linux/time.h> 26 26 #include <linux/highuid.h> 27 27 #include <linux/pagemap.h> 28 + #include <linux/dax.h> 28 29 #include <linux/quotaops.h> 29 30 #include <linux/writeback.h> 30 31 #include <linux/buffer_head.h>

+2

fs/ext4/ext4.h

··· 2272 2272 struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int); 2273 2273 int ext4_get_block_write(struct inode *inode, sector_t iblock, 2274 2274 struct buffer_head *bh_result, int create); 2275 + int ext4_get_block_dax(struct inode *inode, sector_t iblock, 2276 + struct buffer_head *bh_result, int create); 2275 2277 int ext4_get_block(struct inode *inode, sector_t iblock, 2276 2278 struct buffer_head *bh_result, int create); 2277 2279 int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,

+63 -5

fs/ext4/file.c

··· 22 22 #include <linux/fs.h> 23 23 #include <linux/mount.h> 24 24 #include <linux/path.h> 25 + #include <linux/dax.h> 25 26 #include <linux/quotaops.h> 26 27 #include <linux/pagevec.h> 27 28 #include <linux/uio.h> ··· 196 195 static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate) 197 196 { 198 197 struct inode *inode = bh->b_assoc_map->host; 199 - /* XXX: breaks on 32-bit > 16GB. Is that even supported? */ 198 + /* XXX: breaks on 32-bit > 16TB. Is that even supported? */ 200 199 loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits; 201 200 int err; 202 201 if (!uptodate) ··· 207 206 208 207 static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) 209 208 { 210 - return dax_fault(vma, vmf, ext4_get_block, ext4_end_io_unwritten); 211 - /* Is this the right get_block? */ 209 + int result; 210 + handle_t *handle = NULL; 211 + struct super_block *sb = file_inode(vma->vm_file)->i_sb; 212 + bool write = vmf->flags & FAULT_FLAG_WRITE; 213 + 214 + if (write) { 215 + sb_start_pagefault(sb); 216 + file_update_time(vma->vm_file); 217 + handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE, 218 + EXT4_DATA_TRANS_BLOCKS(sb)); 219 + } 220 + 221 + if (IS_ERR(handle)) 222 + result = VM_FAULT_SIGBUS; 223 + else 224 + result = __dax_fault(vma, vmf, ext4_get_block_dax, 225 + ext4_end_io_unwritten); 226 + 227 + if (write) { 228 + if (!IS_ERR(handle)) 229 + ext4_journal_stop(handle); 230 + sb_end_pagefault(sb); 231 + } 232 + 233 + return result; 234 + } 235 + 236 + static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr, 237 + pmd_t *pmd, unsigned int flags) 238 + { 239 + int result; 240 + handle_t *handle = NULL; 241 + struct inode *inode = file_inode(vma->vm_file); 242 + struct super_block *sb = inode->i_sb; 243 + bool write = flags & FAULT_FLAG_WRITE; 244 + 245 + if (write) { 246 + sb_start_pagefault(sb); 247 + file_update_time(vma->vm_file); 248 + handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE, 249 + ext4_chunk_trans_blocks(inode, 250 + PMD_SIZE / PAGE_SIZE)); 251 + } 252 + 253 + if (IS_ERR(handle)) 254 + result = VM_FAULT_SIGBUS; 255 + else 256 + result = __dax_pmd_fault(vma, addr, pmd, flags, 257 + ext4_get_block_dax, ext4_end_io_unwritten); 258 + 259 + if (write) { 260 + if (!IS_ERR(handle)) 261 + ext4_journal_stop(handle); 262 + sb_end_pagefault(sb); 263 + } 264 + 265 + return result; 212 266 } 213 267 214 268 static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) 215 269 { 216 - return dax_mkwrite(vma, vmf, ext4_get_block, ext4_end_io_unwritten); 270 + return dax_mkwrite(vma, vmf, ext4_get_block_dax, 271 + ext4_end_io_unwritten); 217 272 } 218 273 219 274 static const struct vm_operations_struct ext4_dax_vm_ops = { 220 275 .fault = ext4_dax_fault, 276 + .pmd_fault = ext4_dax_pmd_fault, 221 277 .page_mkwrite = ext4_dax_mkwrite, 222 278 .pfn_mkwrite = dax_pfn_mkwrite, 223 279 }; ··· 302 244 file_accessed(file); 303 245 if (IS_DAX(file_inode(file))) { 304 246 vma->vm_ops = &ext4_dax_vm_ops; 305 - vma->vm_flags |= VM_MIXEDMAP; 247 + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; 306 248 } else { 307 249 vma->vm_ops = &ext4_file_vm_ops; 308 250 }

+1

fs/ext4/indirect.c

··· 22 22 23 23 #include "ext4_jbd2.h" 24 24 #include "truncate.h" 25 + #include <linux/dax.h> 25 26 #include <linux/uio.h> 26 27 27 28 #include <trace/events/ext4.h>

+12

fs/ext4/inode.c

··· 22 22 #include <linux/time.h> 23 23 #include <linux/highuid.h> 24 24 #include <linux/pagemap.h> 25 + #include <linux/dax.h> 25 26 #include <linux/quotaops.h> 26 27 #include <linux/string.h> 27 28 #include <linux/buffer_head.h> ··· 3019 3018 inode->i_ino, create); 3020 3019 return _ext4_get_block(inode, iblock, bh_result, 3021 3020 EXT4_GET_BLOCKS_NO_LOCK); 3021 + } 3022 + 3023 + int ext4_get_block_dax(struct inode *inode, sector_t iblock, 3024 + struct buffer_head *bh_result, int create) 3025 + { 3026 + int flags = EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_UNWRIT_EXT; 3027 + if (create) 3028 + flags |= EXT4_GET_BLOCKS_CREATE; 3029 + ext4_debug("ext4_get_block_dax: inode %lu, create flag %d\n", 3030 + inode->i_ino, create); 3031 + return _ext4_get_block(inode, iblock, bh_result, flags); 3022 3032 } 3023 3033 3024 3034 static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,

+284 -18

fs/hugetlbfs/inode.c

··· 12 12 #include <linux/thread_info.h> 13 13 #include <asm/current.h> 14 14 #include <linux/sched.h> /* remove ASAP */ 15 + #include <linux/falloc.h> 15 16 #include <linux/fs.h> 16 17 #include <linux/mount.h> 17 18 #include <linux/file.h> ··· 84 83 {Opt_min_size, "min_size=%s"}, 85 84 {Opt_err, NULL}, 86 85 }; 86 + 87 + #ifdef CONFIG_NUMA 88 + static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma, 89 + struct inode *inode, pgoff_t index) 90 + { 91 + vma->vm_policy = mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy, 92 + index); 93 + } 94 + 95 + static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma) 96 + { 97 + mpol_cond_put(vma->vm_policy); 98 + } 99 + #else 100 + static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma, 101 + struct inode *inode, pgoff_t index) 102 + { 103 + } 104 + 105 + static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma) 106 + { 107 + } 108 + #endif 87 109 88 110 static void huge_pagevec_release(struct pagevec *pvec) 89 111 { ··· 317 293 return -EINVAL; 318 294 } 319 295 320 - static void truncate_huge_page(struct page *page) 296 + static void remove_huge_page(struct page *page) 321 297 { 322 298 ClearPageDirty(page); 323 299 ClearPageUptodate(page); 324 300 delete_from_page_cache(page); 325 301 } 326 302 327 - static void truncate_hugepages(struct inode *inode, loff_t lstart) 303 + 304 + /* 305 + * remove_inode_hugepages handles two distinct cases: truncation and hole 306 + * punch. There are subtle differences in operation for each case. 307 + 308 + * truncation is indicated by end of range being LLONG_MAX 309 + * In this case, we first scan the range and release found pages. 310 + * After releasing pages, hugetlb_unreserve_pages cleans up region/reserv 311 + * maps and global counts. 312 + * hole punch is indicated if end is not LLONG_MAX 313 + * In the hole punch case we scan the range and release found pages. 314 + * Only when releasing a page is the associated region/reserv map 315 + * deleted. The region/reserv map for ranges without associated 316 + * pages are not modified. 317 + * Note: If the passed end of range value is beyond the end of file, but 318 + * not LLONG_MAX this routine still performs a hole punch operation. 319 + */ 320 + static void remove_inode_hugepages(struct inode *inode, loff_t lstart, 321 + loff_t lend) 328 322 { 329 323 struct hstate *h = hstate_inode(inode); 330 324 struct address_space *mapping = &inode->i_data; 331 325 const pgoff_t start = lstart >> huge_page_shift(h); 326 + const pgoff_t end = lend >> huge_page_shift(h); 327 + struct vm_area_struct pseudo_vma; 332 328 struct pagevec pvec; 333 329 pgoff_t next; 334 330 int i, freed = 0; 331 + long lookup_nr = PAGEVEC_SIZE; 332 + bool truncate_op = (lend == LLONG_MAX); 335 333 334 + memset(&pseudo_vma, 0, sizeof(struct vm_area_struct)); 335 + pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED); 336 336 pagevec_init(&pvec, 0); 337 337 next = start; 338 - while (1) { 339 - if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { 338 + while (next < end) { 339 + /* 340 + * Make sure to never grab more pages that we 341 + * might possibly need. 342 + */ 343 + if (end - next < lookup_nr) 344 + lookup_nr = end - next; 345 + 346 + /* 347 + * This pagevec_lookup() may return pages past 'end', 348 + * so we must check for page->index > end. 349 + */ 350 + if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) { 340 351 if (next == start) 341 352 break; 342 353 next = start; ··· 380 321 381 322 for (i = 0; i < pagevec_count(&pvec); ++i) { 382 323 struct page *page = pvec.pages[i]; 324 + u32 hash; 325 + 326 + hash = hugetlb_fault_mutex_hash(h, current->mm, 327 + &pseudo_vma, 328 + mapping, next, 0); 329 + mutex_lock(&hugetlb_fault_mutex_table[hash]); 383 330 384 331 lock_page(page); 332 + if (page->index >= end) { 333 + unlock_page(page); 334 + mutex_unlock(&hugetlb_fault_mutex_table[hash]); 335 + next = end; /* we are done */ 336 + break; 337 + } 338 + 339 + /* 340 + * If page is mapped, it was faulted in after being 341 + * unmapped. Do nothing in this race case. In the 342 + * normal case page is not mapped. 343 + */ 344 + if (!page_mapped(page)) { 345 + bool rsv_on_error = !PagePrivate(page); 346 + /* 347 + * We must free the huge page and remove 348 + * from page cache (remove_huge_page) BEFORE 349 + * removing the region/reserve map 350 + * (hugetlb_unreserve_pages). In rare out 351 + * of memory conditions, removal of the 352 + * region/reserve map could fail. Before 353 + * free'ing the page, note PagePrivate which 354 + * is used in case of error. 355 + */ 356 + remove_huge_page(page); 357 + freed++; 358 + if (!truncate_op) { 359 + if (unlikely(hugetlb_unreserve_pages( 360 + inode, next, 361 + next + 1, 1))) 362 + hugetlb_fix_reserve_counts( 363 + inode, rsv_on_error); 364 + } 365 + } 366 + 385 367 if (page->index > next) 386 368 next = page->index; 369 + 387 370 ++next; 388 - truncate_huge_page(page); 389 371 unlock_page(page); 390 - freed++; 372 + 373 + mutex_unlock(&hugetlb_fault_mutex_table[hash]); 391 374 } 392 375 huge_pagevec_release(&pvec); 393 376 } 394 - BUG_ON(!lstart && mapping->nrpages); 395 - hugetlb_unreserve_pages(inode, start, freed); 377 + 378 + if (truncate_op) 379 + (void)hugetlb_unreserve_pages(inode, start, LONG_MAX, freed); 396 380 } 397 381 398 382 static void hugetlbfs_evict_inode(struct inode *inode) 399 383 { 400 384 struct resv_map *resv_map; 401 385 402 - truncate_hugepages(inode, 0); 386 + remove_inode_hugepages(inode, 0, LLONG_MAX); 403 387 resv_map = (struct resv_map *)inode->i_mapping->private_data; 404 388 /* root inode doesn't have the resv_map, so we should check it */ 405 389 if (resv_map) ··· 451 349 } 452 350 453 351 static inline void 454 - hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t pgoff) 352 + hugetlb_vmdelete_list(struct rb_root *root, pgoff_t start, pgoff_t end) 455 353 { 456 354 struct vm_area_struct *vma; 457 355 458 - vma_interval_tree_foreach(vma, root, pgoff, ULONG_MAX) { 356 + /* 357 + * end == 0 indicates that the entire range after 358 + * start should be unmapped. 359 + */ 360 + vma_interval_tree_foreach(vma, root, start, end ? end : ULONG_MAX) { 459 361 unsigned long v_offset; 460 362 461 363 /* ··· 468 362 * which overlap the truncated area starting at pgoff, 469 363 * and no vma on a 32-bit arch can span beyond the 4GB. 470 364 */ 471 - if (vma->vm_pgoff < pgoff) 472 - v_offset = (pgoff - vma->vm_pgoff) << PAGE_SHIFT; 365 + if (vma->vm_pgoff < start) 366 + v_offset = (start - vma->vm_pgoff) << PAGE_SHIFT; 473 367 else 474 368 v_offset = 0; 475 369 476 - unmap_hugepage_range(vma, vma->vm_start + v_offset, 477 - vma->vm_end, NULL); 370 + if (end) { 371 + end = ((end - start) << PAGE_SHIFT) + 372 + vma->vm_start + v_offset; 373 + if (end > vma->vm_end) 374 + end = vma->vm_end; 375 + } else 376 + end = vma->vm_end; 377 + 378 + unmap_hugepage_range(vma, vma->vm_start + v_offset, end, NULL); 478 379 } 479 380 } 480 381 ··· 497 384 i_size_write(inode, offset); 498 385 i_mmap_lock_write(mapping); 499 386 if (!RB_EMPTY_ROOT(&mapping->i_mmap)) 500 - hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff); 387 + hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0); 501 388 i_mmap_unlock_write(mapping); 502 - truncate_hugepages(inode, offset); 389 + remove_inode_hugepages(inode, offset, LLONG_MAX); 503 390 return 0; 391 + } 392 + 393 + static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) 394 + { 395 + struct hstate *h = hstate_inode(inode); 396 + loff_t hpage_size = huge_page_size(h); 397 + loff_t hole_start, hole_end; 398 + 399 + /* 400 + * For hole punch round up the beginning offset of the hole and 401 + * round down the end. 402 + */ 403 + hole_start = round_up(offset, hpage_size); 404 + hole_end = round_down(offset + len, hpage_size); 405 + 406 + if (hole_end > hole_start) { 407 + struct address_space *mapping = inode->i_mapping; 408 + 409 + mutex_lock(&inode->i_mutex); 410 + i_mmap_lock_write(mapping); 411 + if (!RB_EMPTY_ROOT(&mapping->i_mmap)) 412 + hugetlb_vmdelete_list(&mapping->i_mmap, 413 + hole_start >> PAGE_SHIFT, 414 + hole_end >> PAGE_SHIFT); 415 + i_mmap_unlock_write(mapping); 416 + remove_inode_hugepages(inode, hole_start, hole_end); 417 + mutex_unlock(&inode->i_mutex); 418 + } 419 + 420 + return 0; 421 + } 422 + 423 + static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, 424 + loff_t len) 425 + { 426 + struct inode *inode = file_inode(file); 427 + struct address_space *mapping = inode->i_mapping; 428 + struct hstate *h = hstate_inode(inode); 429 + struct vm_area_struct pseudo_vma; 430 + struct mm_struct *mm = current->mm; 431 + loff_t hpage_size = huge_page_size(h); 432 + unsigned long hpage_shift = huge_page_shift(h); 433 + pgoff_t start, index, end; 434 + int error; 435 + u32 hash; 436 + 437 + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) 438 + return -EOPNOTSUPP; 439 + 440 + if (mode & FALLOC_FL_PUNCH_HOLE) 441 + return hugetlbfs_punch_hole(inode, offset, len); 442 + 443 + /* 444 + * Default preallocate case. 445 + * For this range, start is rounded down and end is rounded up 446 + * as well as being converted to page offsets. 447 + */ 448 + start = offset >> hpage_shift; 449 + end = (offset + len + hpage_size - 1) >> hpage_shift; 450 + 451 + mutex_lock(&inode->i_mutex); 452 + 453 + /* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */ 454 + error = inode_newsize_ok(inode, offset + len); 455 + if (error) 456 + goto out; 457 + 458 + /* 459 + * Initialize a pseudo vma as this is required by the huge page 460 + * allocation routines. If NUMA is configured, use page index 461 + * as input to create an allocation policy. 462 + */ 463 + memset(&pseudo_vma, 0, sizeof(struct vm_area_struct)); 464 + pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED); 465 + pseudo_vma.vm_file = file; 466 + 467 + for (index = start; index < end; index++) { 468 + /* 469 + * This is supposed to be the vaddr where the page is being 470 + * faulted in, but we have no vaddr here. 471 + */ 472 + struct page *page; 473 + unsigned long addr; 474 + int avoid_reserve = 0; 475 + 476 + cond_resched(); 477 + 478 + /* 479 + * fallocate(2) manpage permits EINTR; we may have been 480 + * interrupted because we are using up too much memory. 481 + */ 482 + if (signal_pending(current)) { 483 + error = -EINTR; 484 + break; 485 + } 486 + 487 + /* Set numa allocation policy based on index */ 488 + hugetlb_set_vma_policy(&pseudo_vma, inode, index); 489 + 490 + /* addr is the offset within the file (zero based) */ 491 + addr = index * hpage_size; 492 + 493 + /* mutex taken here, fault path and hole punch */ 494 + hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping, 495 + index, addr); 496 + mutex_lock(&hugetlb_fault_mutex_table[hash]); 497 + 498 + /* See if already present in mapping to avoid alloc/free */ 499 + page = find_get_page(mapping, index); 500 + if (page) { 501 + put_page(page); 502 + mutex_unlock(&hugetlb_fault_mutex_table[hash]); 503 + hugetlb_drop_vma_policy(&pseudo_vma); 504 + continue; 505 + } 506 + 507 + /* Allocate page and add to page cache */ 508 + page = alloc_huge_page(&pseudo_vma, addr, avoid_reserve); 509 + hugetlb_drop_vma_policy(&pseudo_vma); 510 + if (IS_ERR(page)) { 511 + mutex_unlock(&hugetlb_fault_mutex_table[hash]); 512 + error = PTR_ERR(page); 513 + goto out; 514 + } 515 + clear_huge_page(page, addr, pages_per_huge_page(h)); 516 + __SetPageUptodate(page); 517 + error = huge_add_to_page_cache(page, mapping, index); 518 + if (unlikely(error)) { 519 + put_page(page); 520 + mutex_unlock(&hugetlb_fault_mutex_table[hash]); 521 + goto out; 522 + } 523 + 524 + mutex_unlock(&hugetlb_fault_mutex_table[hash]); 525 + 526 + /* 527 + * page_put due to reference from alloc_huge_page() 528 + * unlock_page because locked by add_to_page_cache() 529 + */ 530 + put_page(page); 531 + unlock_page(page); 532 + } 533 + 534 + if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) 535 + i_size_write(inode, offset + len); 536 + inode->i_ctime = CURRENT_TIME; 537 + spin_lock(&inode->i_lock); 538 + inode->i_private = NULL; 539 + spin_unlock(&inode->i_lock); 540 + out: 541 + mutex_unlock(&inode->i_mutex); 542 + return error; 504 543 } 505 544 506 545 static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr) ··· 966 701 .mmap = hugetlbfs_file_mmap, 967 702 .fsync = noop_fsync, 968 703 .get_unmapped_area = hugetlb_get_unmapped_area, 969 - .llseek = default_llseek, 704 + .llseek = default_llseek, 705 + .fallocate = hugetlbfs_fallocate, 970 706 }; 971 707 972 708 static const struct inode_operations hugetlbfs_dir_inode_operations = {

+132 -151

fs/proc/task_mmu.c

··· 446 446 unsigned long anonymous_thp; 447 447 unsigned long swap; 448 448 u64 pss; 449 + u64 swap_pss; 449 450 }; 450 451 451 452 static void smaps_account(struct mem_size_stats *mss, struct page *page, ··· 493 492 } else if (is_swap_pte(*pte)) { 494 493 swp_entry_t swpent = pte_to_swp_entry(*pte); 495 494 496 - if (!non_swap_entry(swpent)) 495 + if (!non_swap_entry(swpent)) { 496 + int mapcount; 497 + 497 498 mss->swap += PAGE_SIZE; 498 - else if (is_migration_entry(swpent)) 499 + mapcount = swp_swapcount(swpent); 500 + if (mapcount >= 2) { 501 + u64 pss_delta = (u64)PAGE_SIZE << PSS_SHIFT; 502 + 503 + do_div(pss_delta, mapcount); 504 + mss->swap_pss += pss_delta; 505 + } else { 506 + mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT; 507 + } 508 + } else if (is_migration_entry(swpent)) 499 509 page = migration_entry_to_page(swpent); 500 510 } 501 511 ··· 652 640 "Anonymous: %8lu kB\n" 653 641 "AnonHugePages: %8lu kB\n" 654 642 "Swap: %8lu kB\n" 643 + "SwapPss: %8lu kB\n" 655 644 "KernelPageSize: %8lu kB\n" 656 645 "MMUPageSize: %8lu kB\n" 657 646 "Locked: %8lu kB\n", ··· 667 654 mss.anonymous >> 10, 668 655 mss.anonymous_thp >> 10, 669 656 mss.swap >> 10, 657 + (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)), 670 658 vma_kernel_pagesize(vma) >> 10, 671 659 vma_mmu_pagesize(vma) >> 10, 672 660 (vma->vm_flags & VM_LOCKED) ? ··· 725 711 .llseek = seq_lseek, 726 712 .release = proc_map_release, 727 713 }; 728 - 729 - /* 730 - * We do not want to have constant page-shift bits sitting in 731 - * pagemap entries and are about to reuse them some time soon. 732 - * 733 - * Here's the "migration strategy": 734 - * 1. when the system boots these bits remain what they are, 735 - * but a warning about future change is printed in log; 736 - * 2. once anyone clears soft-dirty bits via clear_refs file, 737 - * these flag is set to denote, that user is aware of the 738 - * new API and those page-shift bits change their meaning. 739 - * The respective warning is printed in dmesg; 740 - * 3. In a couple of releases we will remove all the mentions 741 - * of page-shift in pagemap entries. 742 - */ 743 - 744 - static bool soft_dirty_cleared __read_mostly; 745 714 746 715 enum clear_refs_types { 747 716 CLEAR_REFS_ALL = 1, ··· 886 889 if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST) 887 890 return -EINVAL; 888 891 889 - if (type == CLEAR_REFS_SOFT_DIRTY) { 890 - soft_dirty_cleared = true; 891 - pr_warn_once("The pagemap bits 55-60 has changed their meaning!" 892 - " See the linux/Documentation/vm/pagemap.txt for " 893 - "details.\n"); 894 - } 895 - 896 892 task = get_proc_task(file_inode(file)); 897 893 if (!task) 898 894 return -ESRCH; ··· 953 963 struct pagemapread { 954 964 int pos, len; /* units: PM_ENTRY_BYTES, not bytes */ 955 965 pagemap_entry_t *buffer; 956 - bool v2; 966 + bool show_pfn; 957 967 }; 958 968 959 969 #define PAGEMAP_WALK_SIZE (PMD_SIZE) 960 970 #define PAGEMAP_WALK_MASK (PMD_MASK) 961 971 962 - #define PM_ENTRY_BYTES sizeof(pagemap_entry_t) 963 - #define PM_STATUS_BITS 3 964 - #define PM_STATUS_OFFSET (64 - PM_STATUS_BITS) 965 - #define PM_STATUS_MASK (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET) 966 - #define PM_STATUS(nr) (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK) 967 - #define PM_PSHIFT_BITS 6 968 - #define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS) 969 - #define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET) 970 - #define __PM_PSHIFT(x) (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK) 971 - #define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1) 972 - #define PM_PFRAME(x) ((x) & PM_PFRAME_MASK) 973 - /* in "new" pagemap pshift bits are occupied with more status bits */ 974 - #define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT)) 972 + #define PM_ENTRY_BYTES sizeof(pagemap_entry_t) 973 + #define PM_PFRAME_BITS 55 974 + #define PM_PFRAME_MASK GENMASK_ULL(PM_PFRAME_BITS - 1, 0) 975 + #define PM_SOFT_DIRTY BIT_ULL(55) 976 + #define PM_MMAP_EXCLUSIVE BIT_ULL(56) 977 + #define PM_FILE BIT_ULL(61) 978 + #define PM_SWAP BIT_ULL(62) 979 + #define PM_PRESENT BIT_ULL(63) 975 980 976 - #define __PM_SOFT_DIRTY (1LL) 977 - #define PM_PRESENT PM_STATUS(4LL) 978 - #define PM_SWAP PM_STATUS(2LL) 979 - #define PM_FILE PM_STATUS(1LL) 980 - #define PM_NOT_PRESENT(v2) PM_STATUS2(v2, 0) 981 981 #define PM_END_OF_BUFFER 1 982 982 983 - static inline pagemap_entry_t make_pme(u64 val) 983 + static inline pagemap_entry_t make_pme(u64 frame, u64 flags) 984 984 { 985 - return (pagemap_entry_t) { .pme = val }; 985 + return (pagemap_entry_t) { .pme = (frame & PM_PFRAME_MASK) | flags }; 986 986 } 987 987 988 988 static int add_to_pagemap(unsigned long addr, pagemap_entry_t *pme, ··· 993 1013 994 1014 while (addr < end) { 995 1015 struct vm_area_struct *vma = find_vma(walk->mm, addr); 996 - pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2)); 1016 + pagemap_entry_t pme = make_pme(0, 0); 997 1017 /* End of address space hole, which we mark as non-present. */ 998 1018 unsigned long hole_end; 999 1019 ··· 1013 1033 1014 1034 /* Addresses in the VMA. */ 1015 1035 if (vma->vm_flags & VM_SOFTDIRTY) 1016 - pme.pme |= PM_STATUS2(pm->v2, __PM_SOFT_DIRTY); 1036 + pme = make_pme(0, PM_SOFT_DIRTY); 1017 1037 for (; addr < min(end, vma->vm_end); addr += PAGE_SIZE) { 1018 1038 err = add_to_pagemap(addr, &pme, pm); 1019 1039 if (err) ··· 1024 1044 return err; 1025 1045 } 1026 1046 1027 - static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, 1047 + static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, 1028 1048 struct vm_area_struct *vma, unsigned long addr, pte_t pte) 1029 1049 { 1030 - u64 frame, flags; 1050 + u64 frame = 0, flags = 0; 1031 1051 struct page *page = NULL; 1032 - int flags2 = 0; 1033 1052 1034 1053 if (pte_present(pte)) { 1035 - frame = pte_pfn(pte); 1036 - flags = PM_PRESENT; 1054 + if (pm->show_pfn) 1055 + frame = pte_pfn(pte); 1056 + flags |= PM_PRESENT; 1037 1057 page = vm_normal_page(vma, addr, pte); 1038 1058 if (pte_soft_dirty(pte)) 1039 - flags2 |= __PM_SOFT_DIRTY; 1059 + flags |= PM_SOFT_DIRTY; 1040 1060 } else if (is_swap_pte(pte)) { 1041 1061 swp_entry_t entry; 1042 1062 if (pte_swp_soft_dirty(pte)) 1043 - flags2 |= __PM_SOFT_DIRTY; 1063 + flags |= PM_SOFT_DIRTY; 1044 1064 entry = pte_to_swp_entry(pte); 1045 1065 frame = swp_type(entry) | 1046 1066 (swp_offset(entry) << MAX_SWAPFILES_SHIFT); 1047 - flags = PM_SWAP; 1067 + flags |= PM_SWAP; 1048 1068 if (is_migration_entry(entry)) 1049 1069 page = migration_entry_to_page(entry); 1050 - } else { 1051 - if (vma->vm_flags & VM_SOFTDIRTY) 1052 - flags2 |= __PM_SOFT_DIRTY; 1053 - *pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, flags2)); 1054 - return; 1055 1070 } 1056 1071 1057 1072 if (page && !PageAnon(page)) 1058 1073 flags |= PM_FILE; 1059 - if ((vma->vm_flags & VM_SOFTDIRTY)) 1060 - flags2 |= __PM_SOFT_DIRTY; 1074 + if (page && page_mapcount(page) == 1) 1075 + flags |= PM_MMAP_EXCLUSIVE; 1076 + if (vma->vm_flags & VM_SOFTDIRTY) 1077 + flags |= PM_SOFT_DIRTY; 1061 1078 1062 - *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, flags2) | flags); 1079 + return make_pme(frame, flags); 1063 1080 } 1064 1081 1065 - #ifdef CONFIG_TRANSPARENT_HUGEPAGE 1066 - static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, 1067 - pmd_t pmd, int offset, int pmd_flags2) 1068 - { 1069 - /* 1070 - * Currently pmd for thp is always present because thp can not be 1071 - * swapped-out, migrated, or HWPOISONed (split in such cases instead.) 1072 - * This if-check is just to prepare for future implementation. 1073 - */ 1074 - if (pmd_present(pmd)) 1075 - *pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset) 1076 - | PM_STATUS2(pm->v2, pmd_flags2) | PM_PRESENT); 1077 - else 1078 - *pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, pmd_flags2)); 1079 - } 1080 - #else 1081 - static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, 1082 - pmd_t pmd, int offset, int pmd_flags2) 1083 - { 1084 - } 1085 - #endif 1086 - 1087 - static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, 1082 + static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, 1088 1083 struct mm_walk *walk) 1089 1084 { 1090 1085 struct vm_area_struct *vma = walk->vma; ··· 1068 1113 pte_t *pte, *orig_pte; 1069 1114 int err = 0; 1070 1115 1071 - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { 1072 - int pmd_flags2; 1116 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 1117 + if (pmd_trans_huge_lock(pmdp, vma, &ptl) == 1) { 1118 + u64 flags = 0, frame = 0; 1119 + pmd_t pmd = *pmdp; 1073 1120 1074 - if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd)) 1075 - pmd_flags2 = __PM_SOFT_DIRTY; 1076 - else 1077 - pmd_flags2 = 0; 1121 + if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd)) 1122 + flags |= PM_SOFT_DIRTY; 1123 + 1124 + /* 1125 + * Currently pmd for thp is always present because thp 1126 + * can not be swapped-out, migrated, or HWPOISONed 1127 + * (split in such cases instead.) 1128 + * This if-check is just to prepare for future implementation. 1129 + */ 1130 + if (pmd_present(pmd)) { 1131 + struct page *page = pmd_page(pmd); 1132 + 1133 + if (page_mapcount(page) == 1) 1134 + flags |= PM_MMAP_EXCLUSIVE; 1135 + 1136 + flags |= PM_PRESENT; 1137 + if (pm->show_pfn) 1138 + frame = pmd_pfn(pmd) + 1139 + ((addr & ~PMD_MASK) >> PAGE_SHIFT); 1140 + } 1078 1141 1079 1142 for (; addr != end; addr += PAGE_SIZE) { 1080 - unsigned long offset; 1081 - pagemap_entry_t pme; 1143 + pagemap_entry_t pme = make_pme(frame, flags); 1082 1144 1083 - offset = (addr & ~PAGEMAP_WALK_MASK) >> 1084 - PAGE_SHIFT; 1085 - thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2); 1086 1145 err = add_to_pagemap(addr, &pme, pm); 1087 1146 if (err) 1088 1147 break; 1148 + if (pm->show_pfn && (flags & PM_PRESENT)) 1149 + frame++; 1089 1150 } 1090 1151 spin_unlock(ptl); 1091 1152 return err; 1092 1153 } 1093 1154 1094 - if (pmd_trans_unstable(pmd)) 1155 + if (pmd_trans_unstable(pmdp)) 1095 1156 return 0; 1157 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 1096 1158 1097 1159 /* 1098 1160 * We can assume that @vma always points to a valid one and @end never 1099 1161 * goes beyond vma->vm_end. 1100 1162 */ 1101 - orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); 1163 + orig_pte = pte = pte_offset_map_lock(walk->mm, pmdp, addr, &ptl); 1102 1164 for (; addr < end; pte++, addr += PAGE_SIZE) { 1103 1165 pagemap_entry_t pme; 1104 1166 1105 - pte_to_pagemap_entry(&pme, pm, vma, addr, *pte); 1167 + pme = pte_to_pagemap_entry(pm, vma, addr, *pte); 1106 1168 err = add_to_pagemap(addr, &pme, pm); 1107 1169 if (err) 1108 1170 break; ··· 1132 1160 } 1133 1161 1134 1162 #ifdef CONFIG_HUGETLB_PAGE 1135 - static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm, 1136 - pte_t pte, int offset, int flags2) 1137 - { 1138 - if (pte_present(pte)) 1139 - *pme = make_pme(PM_PFRAME(pte_pfn(pte) + offset) | 1140 - PM_STATUS2(pm->v2, flags2) | 1141 - PM_PRESENT); 1142 - else 1143 - *pme = make_pme(PM_NOT_PRESENT(pm->v2) | 1144 - PM_STATUS2(pm->v2, flags2)); 1145 - } 1146 - 1147 1163 /* This function walks within one hugetlb entry in the single call */ 1148 - static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask, 1164 + static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask, 1149 1165 unsigned long addr, unsigned long end, 1150 1166 struct mm_walk *walk) 1151 1167 { 1152 1168 struct pagemapread *pm = walk->private; 1153 1169 struct vm_area_struct *vma = walk->vma; 1170 + u64 flags = 0, frame = 0; 1154 1171 int err = 0; 1155 - int flags2; 1156 - pagemap_entry_t pme; 1172 + pte_t pte; 1157 1173 1158 1174 if (vma->vm_flags & VM_SOFTDIRTY) 1159 - flags2 = __PM_SOFT_DIRTY; 1160 - else 1161 - flags2 = 0; 1175 + flags |= PM_SOFT_DIRTY; 1176 + 1177 + pte = huge_ptep_get(ptep); 1178 + if (pte_present(pte)) { 1179 + struct page *page = pte_page(pte); 1180 + 1181 + if (!PageAnon(page)) 1182 + flags |= PM_FILE; 1183 + 1184 + if (page_mapcount(page) == 1) 1185 + flags |= PM_MMAP_EXCLUSIVE; 1186 + 1187 + flags |= PM_PRESENT; 1188 + if (pm->show_pfn) 1189 + frame = pte_pfn(pte) + 1190 + ((addr & ~hmask) >> PAGE_SHIFT); 1191 + } 1162 1192 1163 1193 for (; addr != end; addr += PAGE_SIZE) { 1164 - int offset = (addr & ~hmask) >> PAGE_SHIFT; 1165 - huge_pte_to_pagemap_entry(&pme, pm, *pte, offset, flags2); 1194 + pagemap_entry_t pme = make_pme(frame, flags); 1195 + 1166 1196 err = add_to_pagemap(addr, &pme, pm); 1167 1197 if (err) 1168 1198 return err; 1199 + if (pm->show_pfn && (flags & PM_PRESENT)) 1200 + frame++; 1169 1201 } 1170 1202 1171 1203 cond_resched(); ··· 1187 1211 * Bits 0-54 page frame number (PFN) if present 1188 1212 * Bits 0-4 swap type if swapped 1189 1213 * Bits 5-54 swap offset if swapped 1190 - * Bits 55-60 page shift (page size = 1<<page shift) 1214 + * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt) 1215 + * Bit 56 page exclusively mapped 1216 + * Bits 57-60 zero 1191 1217 * Bit 61 page is file-page or shared-anon 1192 1218 * Bit 62 page swapped 1193 1219 * Bit 63 page present ··· 1207 1229 static ssize_t pagemap_read(struct file *file, char __user *buf, 1208 1230 size_t count, loff_t *ppos) 1209 1231 { 1210 - struct task_struct *task = get_proc_task(file_inode(file)); 1211 - struct mm_struct *mm; 1232 + struct mm_struct *mm = file->private_data; 1212 1233 struct pagemapread pm; 1213 - int ret = -ESRCH; 1214 1234 struct mm_walk pagemap_walk = {}; 1215 1235 unsigned long src; 1216 1236 unsigned long svpfn; 1217 1237 unsigned long start_vaddr; 1218 1238 unsigned long end_vaddr; 1219 - int copied = 0; 1239 + int ret = 0, copied = 0; 1220 1240 1221 - if (!task) 1241 + if (!mm || !atomic_inc_not_zero(&mm->mm_users)) 1222 1242 goto out; 1223 1243 1224 1244 ret = -EINVAL; 1225 1245 /* file position must be aligned */ 1226 1246 if ((*ppos % PM_ENTRY_BYTES) || (count % PM_ENTRY_BYTES)) 1227 - goto out_task; 1247 + goto out_mm; 1228 1248 1229 1249 ret = 0; 1230 1250 if (!count) 1231 - goto out_task; 1251 + goto out_mm; 1232 1252 1233 - pm.v2 = soft_dirty_cleared; 1253 + /* do not disclose physical addresses: attack vector */ 1254 + pm.show_pfn = file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN); 1255 + 1234 1256 pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); 1235 1257 pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY); 1236 1258 ret = -ENOMEM; 1237 1259 if (!pm.buffer) 1238 - goto out_task; 1260 + goto out_mm; 1239 1261 1240 - mm = mm_access(task, PTRACE_MODE_READ); 1241 - ret = PTR_ERR(mm); 1242 - if (!mm || IS_ERR(mm)) 1243 - goto out_free; 1244 - 1245 - pagemap_walk.pmd_entry = pagemap_pte_range; 1262 + pagemap_walk.pmd_entry = pagemap_pmd_range; 1246 1263 pagemap_walk.pte_hole = pagemap_pte_hole; 1247 1264 #ifdef CONFIG_HUGETLB_PAGE 1248 1265 pagemap_walk.hugetlb_entry = pagemap_hugetlb_range; ··· 1248 1275 src = *ppos; 1249 1276 svpfn = src / PM_ENTRY_BYTES; 1250 1277 start_vaddr = svpfn << PAGE_SHIFT; 1251 - end_vaddr = TASK_SIZE_OF(task); 1278 + end_vaddr = mm->task_size; 1252 1279 1253 1280 /* watch out for wraparound */ 1254 - if (svpfn > TASK_SIZE_OF(task) >> PAGE_SHIFT) 1281 + if (svpfn > mm->task_size >> PAGE_SHIFT) 1255 1282 start_vaddr = end_vaddr; 1256 1283 1257 1284 /* ··· 1278 1305 len = min(count, PM_ENTRY_BYTES * pm.pos); 1279 1306 if (copy_to_user(buf, pm.buffer, len)) { 1280 1307 ret = -EFAULT; 1281 - goto out_mm; 1308 + goto out_free; 1282 1309 } 1283 1310 copied += len; 1284 1311 buf += len; ··· 1288 1315 if (!ret || ret == PM_END_OF_BUFFER) 1289 1316 ret = copied; 1290 1317 1291 - out_mm: 1292 - mmput(mm); 1293 1318 out_free: 1294 1319 kfree(pm.buffer); 1295 - out_task: 1296 - put_task_struct(task); 1320 + out_mm: 1321 + mmput(mm); 1297 1322 out: 1298 1323 return ret; 1299 1324 } 1300 1325 1301 1326 static int pagemap_open(struct inode *inode, struct file *file) 1302 1327 { 1303 - /* do not disclose physical addresses: attack vector */ 1304 - if (!capable(CAP_SYS_ADMIN)) 1305 - return -EPERM; 1306 - pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about " 1307 - "to stop being page-shift some time soon. See the " 1308 - "linux/Documentation/vm/pagemap.txt for details.\n"); 1328 + struct mm_struct *mm; 1329 + 1330 + mm = proc_mem_open(inode, PTRACE_MODE_READ); 1331 + if (IS_ERR(mm)) 1332 + return PTR_ERR(mm); 1333 + file->private_data = mm; 1334 + return 0; 1335 + } 1336 + 1337 + static int pagemap_release(struct inode *inode, struct file *file) 1338 + { 1339 + struct mm_struct *mm = file->private_data; 1340 + 1341 + if (mm) 1342 + mmdrop(mm); 1309 1343 return 0; 1310 1344 } 1311 1345 ··· 1320 1340 .llseek = mem_lseek, /* borrow this */ 1321 1341 .read = pagemap_read, 1322 1342 .open = pagemap_open, 1343 + .release = pagemap_release, 1323 1344 }; 1324 1345 #endif /* CONFIG_PROC_PAGE_MONITOR */ 1325 1346

+1

fs/xfs/xfs_buf.h

··· 23 23 #include <linux/spinlock.h> 24 24 #include <linux/mm.h> 25 25 #include <linux/fs.h> 26 + #include <linux/dax.h> 26 27 #include <linux/buffer_head.h> 27 28 #include <linux/uio.h> 28 29 #include <linux/list_lru.h>

+29 -1

fs/xfs/xfs_file.c

··· 1546 1546 return ret; 1547 1547 } 1548 1548 1549 + STATIC int 1550 + xfs_filemap_pmd_fault( 1551 + struct vm_area_struct *vma, 1552 + unsigned long addr, 1553 + pmd_t *pmd, 1554 + unsigned int flags) 1555 + { 1556 + struct inode *inode = file_inode(vma->vm_file); 1557 + struct xfs_inode *ip = XFS_I(inode); 1558 + int ret; 1559 + 1560 + if (!IS_DAX(inode)) 1561 + return VM_FAULT_FALLBACK; 1562 + 1563 + trace_xfs_filemap_pmd_fault(ip); 1564 + 1565 + sb_start_pagefault(inode->i_sb); 1566 + file_update_time(vma->vm_file); 1567 + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); 1568 + ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_direct, 1569 + xfs_end_io_dax_write); 1570 + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); 1571 + sb_end_pagefault(inode->i_sb); 1572 + 1573 + return ret; 1574 + } 1575 + 1549 1576 static const struct vm_operations_struct xfs_file_vm_ops = { 1550 1577 .fault = xfs_filemap_fault, 1578 + .pmd_fault = xfs_filemap_pmd_fault, 1551 1579 .map_pages = filemap_map_pages, 1552 1580 .page_mkwrite = xfs_filemap_page_mkwrite, 1553 1581 }; ··· 1588 1560 file_accessed(filp); 1589 1561 vma->vm_ops = &xfs_file_vm_ops; 1590 1562 if (IS_DAX(file_inode(filp))) 1591 - vma->vm_flags |= VM_MIXEDMAP; 1563 + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; 1592 1564 return 0; 1593 1565 } 1594 1566

+1

fs/xfs/xfs_trace.h

··· 687 687 DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid); 688 688 689 689 DEFINE_INODE_EVENT(xfs_filemap_fault); 690 + DEFINE_INODE_EVENT(xfs_filemap_pmd_fault); 690 691 DEFINE_INODE_EVENT(xfs_filemap_page_mkwrite); 691 692 692 693 DECLARE_EVENT_CLASS(xfs_iref_class,

+6

include/asm-generic/early_ioremap.h

··· 35 35 */ 36 36 extern void early_ioremap_reset(void); 37 37 38 + /* 39 + * Early copy from unmapped memory to kernel mapped memory. 40 + */ 41 + extern void copy_from_early_mem(void *dest, phys_addr_t src, 42 + unsigned long size); 43 + 38 44 #else 39 45 static inline void early_ioremap_init(void) { } 40 46 static inline void early_ioremap_setup(void) { }

+39

include/linux/dax.h

··· 1 + #ifndef _LINUX_DAX_H 2 + #define _LINUX_DAX_H 3 + 4 + #include <linux/fs.h> 5 + #include <linux/mm.h> 6 + #include <asm/pgtable.h> 7 + 8 + ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t, 9 + get_block_t, dio_iodone_t, int flags); 10 + int dax_clear_blocks(struct inode *, sector_t block, long size); 11 + int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); 12 + int dax_truncate_page(struct inode *, loff_t from, get_block_t); 13 + int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t, 14 + dax_iodone_t); 15 + int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t, 16 + dax_iodone_t); 17 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 18 + int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *, 19 + unsigned int flags, get_block_t, dax_iodone_t); 20 + int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *, 21 + unsigned int flags, get_block_t, dax_iodone_t); 22 + #else 23 + static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr, 24 + pmd_t *pmd, unsigned int flags, get_block_t gb, 25 + dax_iodone_t di) 26 + { 27 + return VM_FAULT_FALLBACK; 28 + } 29 + #define __dax_pmd_fault dax_pmd_fault 30 + #endif 31 + int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *); 32 + #define dax_mkwrite(vma, vmf, gb, iod) dax_fault(vma, vmf, gb, iod) 33 + #define __dax_mkwrite(vma, vmf, gb, iod) __dax_fault(vma, vmf, gb, iod) 34 + 35 + static inline bool vma_is_dax(struct vm_area_struct *vma) 36 + { 37 + return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); 38 + } 39 + #endif

+6

include/linux/dmapool.h

··· 24 24 void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags, 25 25 dma_addr_t *handle); 26 26 27 + static inline void *dma_pool_zalloc(struct dma_pool *pool, gfp_t mem_flags, 28 + dma_addr_t *handle) 29 + { 30 + return dma_pool_alloc(pool, mem_flags | __GFP_ZERO, handle); 31 + } 32 + 27 33 void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t addr); 28 34 29 35 /*

-14

include/linux/fs.h

··· 52 52 struct seq_file; 53 53 struct workqueue_struct; 54 54 struct iov_iter; 55 - struct vm_fault; 56 55 57 56 extern void __init inode_init(void); 58 57 extern void __init inode_init_early(void); ··· 2676 2677 int whence, loff_t size); 2677 2678 extern int generic_file_open(struct inode * inode, struct file * filp); 2678 2679 extern int nonseekable_open(struct inode * inode, struct file * filp); 2679 - 2680 - ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t, 2681 - get_block_t, dio_iodone_t, int flags); 2682 - int dax_clear_blocks(struct inode *, sector_t block, long size); 2683 - int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); 2684 - int dax_truncate_page(struct inode *, loff_t from, get_block_t); 2685 - int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t, 2686 - dax_iodone_t); 2687 - int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t, 2688 - dax_iodone_t); 2689 - int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *); 2690 - #define dax_mkwrite(vma, vmf, gb, iod) dax_fault(vma, vmf, gb, iod) 2691 - #define __dax_mkwrite(vma, vmf, gb, iod) __dax_fault(vma, vmf, gb, iod) 2692 2680 2693 2681 #ifdef CONFIG_BLOCK 2694 2682 typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,

+21 -10

include/linux/gfp.h

··· 63 63 * but it is definitely preferable to use the flag rather than opencode endless 64 64 * loop around allocator. 65 65 * 66 - * __GFP_NORETRY: The VM implementation must not retry indefinitely. 66 + * __GFP_NORETRY: The VM implementation must not retry indefinitely and will 67 + * return NULL when direct reclaim and memory compaction have failed to allow 68 + * the allocation to succeed. The OOM killer is not called with the current 69 + * implementation. 67 70 * 68 71 * __GFP_MOVABLE: Flag that this page will be movable by the page migration 69 72 * mechanism or reclaimed ··· 303 300 return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL); 304 301 } 305 302 306 - static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, 307 - unsigned int order) 303 + /* 304 + * Allocate pages, preferring the node given as nid. The node must be valid and 305 + * online. For more general interface, see alloc_pages_node(). 306 + */ 307 + static inline struct page * 308 + __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) 308 309 { 309 - /* Unknown node is current node */ 310 - if (nid < 0) 311 - nid = numa_node_id(); 310 + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); 311 + VM_WARN_ON(!node_online(nid)); 312 312 313 313 return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); 314 314 } 315 315 316 - static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask, 316 + /* 317 + * Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE, 318 + * prefer the current CPU's closest node. Otherwise node must be valid and 319 + * online. 320 + */ 321 + static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, 317 322 unsigned int order) 318 323 { 319 - VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)); 324 + if (nid == NUMA_NO_NODE) 325 + nid = numa_mem_id(); 320 326 321 - return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); 327 + return __alloc_pages_node(nid, gfp_mask, order); 322 328 } 323 329 324 330 #ifdef CONFIG_NUMA ··· 366 354 367 355 void *alloc_pages_exact(size_t size, gfp_t gfp_mask); 368 356 void free_pages_exact(void *virt, size_t size); 369 - /* This is different from alloc_pages_exact_node !!! */ 370 357 void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask); 371 358 372 359 #define __get_free_page(gfp_mask) \

+10 -10

include/linux/huge_mm.h

··· 33 33 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, 34 34 unsigned long addr, pgprot_t newprot, 35 35 int prot_numa); 36 + int vmf_insert_pfn_pmd(struct vm_area_struct *, unsigned long addr, pmd_t *, 37 + unsigned long pfn, bool write); 36 38 37 39 enum transparent_hugepage_flag { 38 40 TRANSPARENT_HUGEPAGE_FLAG, ··· 124 122 #endif 125 123 extern int hugepage_madvise(struct vm_area_struct *vma, 126 124 unsigned long *vm_flags, int advice); 127 - extern void __vma_adjust_trans_huge(struct vm_area_struct *vma, 125 + extern void vma_adjust_trans_huge(struct vm_area_struct *vma, 128 126 unsigned long start, 129 127 unsigned long end, 130 128 long adjust_next); ··· 139 137 return __pmd_trans_huge_lock(pmd, vma, ptl); 140 138 else 141 139 return 0; 142 - } 143 - static inline void vma_adjust_trans_huge(struct vm_area_struct *vma, 144 - unsigned long start, 145 - unsigned long end, 146 - long adjust_next) 147 - { 148 - if (!vma->anon_vma || vma->vm_ops) 149 - return; 150 - __vma_adjust_trans_huge(vma, start, end, adjust_next); 151 140 } 152 141 static inline int hpage_nr_pages(struct page *page) 153 142 { ··· 156 163 { 157 164 return ACCESS_ONCE(huge_zero_page) == page; 158 165 } 166 + 167 + static inline bool is_huge_zero_pmd(pmd_t pmd) 168 + { 169 + return is_huge_zero_page(pmd_page(pmd)); 170 + } 171 + 172 + struct page *get_huge_zero_page(void); 159 173 160 174 #else /* CONFIG_TRANSPARENT_HUGEPAGE */ 161 175 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })

+16 -1

include/linux/hugetlb.h

··· 35 35 struct kref refs; 36 36 spinlock_t lock; 37 37 struct list_head regions; 38 + long adds_in_progress; 39 + struct list_head region_cache; 40 + long region_cache_count; 38 41 }; 39 42 extern struct resv_map *resv_map_alloc(void); 40 43 void resv_map_release(struct kref *ref); ··· 83 80 int hugetlb_reserve_pages(struct inode *inode, long from, long to, 84 81 struct vm_area_struct *vma, 85 82 vm_flags_t vm_flags); 86 - void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed); 83 + long hugetlb_unreserve_pages(struct inode *inode, long start, long end, 84 + long freed); 87 85 int dequeue_hwpoisoned_huge_page(struct page *page); 88 86 bool isolate_huge_page(struct page *page, struct list_head *list); 89 87 void putback_active_hugepage(struct page *page); 90 88 void free_huge_page(struct page *page); 89 + void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve); 90 + extern struct mutex *hugetlb_fault_mutex_table; 91 + u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm, 92 + struct vm_area_struct *vma, 93 + struct address_space *mapping, 94 + pgoff_t idx, unsigned long address); 91 95 92 96 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE 93 97 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud); ··· 330 320 #endif 331 321 }; 332 322 323 + struct page *alloc_huge_page(struct vm_area_struct *vma, 324 + unsigned long addr, int avoid_reserve); 333 325 struct page *alloc_huge_page_node(struct hstate *h, int nid); 334 326 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma, 335 327 unsigned long addr, int avoid_reserve); 328 + int huge_add_to_page_cache(struct page *page, struct address_space *mapping, 329 + pgoff_t idx); 336 330 337 331 /* arch callback */ 338 332 int __init alloc_bootmem_huge_page(struct hstate *h); ··· 485 471 486 472 #else /* CONFIG_HUGETLB_PAGE */ 487 473 struct hstate {}; 474 + #define alloc_huge_page(v, a, r) NULL 488 475 #define alloc_huge_page_node(h, nid) NULL 489 476 #define alloc_huge_page_noerr(v, a, r) NULL 490 477 #define alloc_bootmem_huge_page(h) NULL

+3 -1

include/linux/memblock.h

··· 77 77 int memblock_free(phys_addr_t base, phys_addr_t size); 78 78 int memblock_reserve(phys_addr_t base, phys_addr_t size); 79 79 void memblock_trim_memory(phys_addr_t align); 80 + bool memblock_overlaps_region(struct memblock_type *type, 81 + phys_addr_t base, phys_addr_t size); 80 82 int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); 81 83 int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); 82 84 int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); ··· 325 323 int memblock_is_memory(phys_addr_t addr); 326 324 int memblock_is_region_memory(phys_addr_t base, phys_addr_t size); 327 325 int memblock_is_reserved(phys_addr_t addr); 328 - int memblock_is_region_reserved(phys_addr_t base, phys_addr_t size); 326 + bool memblock_is_region_reserved(phys_addr_t base, phys_addr_t size); 329 327 330 328 extern void __memblock_dump_all(void); 331 329

+345 -43

include/linux/memcontrol.h

··· 23 23 #include <linux/vm_event_item.h> 24 24 #include <linux/hardirq.h> 25 25 #include <linux/jump_label.h> 26 + #include <linux/page_counter.h> 27 + #include <linux/vmpressure.h> 28 + #include <linux/eventfd.h> 29 + #include <linux/mmzone.h> 30 + #include <linux/writeback.h> 26 31 27 32 struct mem_cgroup; 28 33 struct page; ··· 72 67 MEMCG_NR_EVENTS, 73 68 }; 74 69 70 + /* 71 + * Per memcg event counter is incremented at every pagein/pageout. With THP, 72 + * it will be incremated by the number of pages. This counter is used for 73 + * for trigger some periodic events. This is straightforward and better 74 + * than using jiffies etc. to handle periodic memcg event. 75 + */ 76 + enum mem_cgroup_events_target { 77 + MEM_CGROUP_TARGET_THRESH, 78 + MEM_CGROUP_TARGET_SOFTLIMIT, 79 + MEM_CGROUP_TARGET_NUMAINFO, 80 + MEM_CGROUP_NTARGETS, 81 + }; 82 + 83 + /* 84 + * Bits in struct cg_proto.flags 85 + */ 86 + enum cg_proto_flags { 87 + /* Currently active and new sockets should be assigned to cgroups */ 88 + MEMCG_SOCK_ACTIVE, 89 + /* It was ever activated; we must disarm static keys on destruction */ 90 + MEMCG_SOCK_ACTIVATED, 91 + }; 92 + 93 + struct cg_proto { 94 + struct page_counter memory_allocated; /* Current allocated memory. */ 95 + struct percpu_counter sockets_allocated; /* Current number of sockets. */ 96 + int memory_pressure; 97 + long sysctl_mem[3]; 98 + unsigned long flags; 99 + /* 100 + * memcg field is used to find which memcg we belong directly 101 + * Each memcg struct can hold more than one cg_proto, so container_of 102 + * won't really cut. 103 + * 104 + * The elegant solution would be having an inverse function to 105 + * proto_cgroup in struct proto, but that means polluting the structure 106 + * for everybody, instead of just for memcg users. 107 + */ 108 + struct mem_cgroup *memcg; 109 + }; 110 + 75 111 #ifdef CONFIG_MEMCG 112 + struct mem_cgroup_stat_cpu { 113 + long count[MEM_CGROUP_STAT_NSTATS]; 114 + unsigned long events[MEMCG_NR_EVENTS]; 115 + unsigned long nr_page_events; 116 + unsigned long targets[MEM_CGROUP_NTARGETS]; 117 + }; 118 + 119 + struct mem_cgroup_reclaim_iter { 120 + struct mem_cgroup *position; 121 + /* scan generation, increased every round-trip */ 122 + unsigned int generation; 123 + }; 124 + 125 + /* 126 + * per-zone information in memory controller. 127 + */ 128 + struct mem_cgroup_per_zone { 129 + struct lruvec lruvec; 130 + unsigned long lru_size[NR_LRU_LISTS]; 131 + 132 + struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1]; 133 + 134 + struct rb_node tree_node; /* RB tree node */ 135 + unsigned long usage_in_excess;/* Set to the value by which */ 136 + /* the soft limit is exceeded*/ 137 + bool on_tree; 138 + struct mem_cgroup *memcg; /* Back pointer, we cannot */ 139 + /* use container_of */ 140 + }; 141 + 142 + struct mem_cgroup_per_node { 143 + struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES]; 144 + }; 145 + 146 + struct mem_cgroup_threshold { 147 + struct eventfd_ctx *eventfd; 148 + unsigned long threshold; 149 + }; 150 + 151 + /* For threshold */ 152 + struct mem_cgroup_threshold_ary { 153 + /* An array index points to threshold just below or equal to usage. */ 154 + int current_threshold; 155 + /* Size of entries[] */ 156 + unsigned int size; 157 + /* Array of thresholds */ 158 + struct mem_cgroup_threshold entries[0]; 159 + }; 160 + 161 + struct mem_cgroup_thresholds { 162 + /* Primary thresholds array */ 163 + struct mem_cgroup_threshold_ary *primary; 164 + /* 165 + * Spare threshold array. 166 + * This is needed to make mem_cgroup_unregister_event() "never fail". 167 + * It must be able to store at least primary->size - 1 entries. 168 + */ 169 + struct mem_cgroup_threshold_ary *spare; 170 + }; 171 + 172 + /* 173 + * The memory controller data structure. The memory controller controls both 174 + * page cache and RSS per cgroup. We would eventually like to provide 175 + * statistics based on the statistics developed by Rik Van Riel for clock-pro, 176 + * to help the administrator determine what knobs to tune. 177 + */ 178 + struct mem_cgroup { 179 + struct cgroup_subsys_state css; 180 + 181 + /* Accounted resources */ 182 + struct page_counter memory; 183 + struct page_counter memsw; 184 + struct page_counter kmem; 185 + 186 + /* Normal memory consumption range */ 187 + unsigned long low; 188 + unsigned long high; 189 + 190 + unsigned long soft_limit; 191 + 192 + /* vmpressure notifications */ 193 + struct vmpressure vmpressure; 194 + 195 + /* css_online() has been completed */ 196 + int initialized; 197 + 198 + /* 199 + * Should the accounting and control be hierarchical, per subtree? 200 + */ 201 + bool use_hierarchy; 202 + 203 + /* protected by memcg_oom_lock */ 204 + bool oom_lock; 205 + int under_oom; 206 + 207 + int swappiness; 208 + /* OOM-Killer disable */ 209 + int oom_kill_disable; 210 + 211 + /* protect arrays of thresholds */ 212 + struct mutex thresholds_lock; 213 + 214 + /* thresholds for memory usage. RCU-protected */ 215 + struct mem_cgroup_thresholds thresholds; 216 + 217 + /* thresholds for mem+swap usage. RCU-protected */ 218 + struct mem_cgroup_thresholds memsw_thresholds; 219 + 220 + /* For oom notifier event fd */ 221 + struct list_head oom_notify; 222 + 223 + /* 224 + * Should we move charges of a task when a task is moved into this 225 + * mem_cgroup ? And what type of charges should we move ? 226 + */ 227 + unsigned long move_charge_at_immigrate; 228 + /* 229 + * set > 0 if pages under this cgroup are moving to other cgroup. 230 + */ 231 + atomic_t moving_account; 232 + /* taken only while moving_account > 0 */ 233 + spinlock_t move_lock; 234 + struct task_struct *move_lock_task; 235 + unsigned long move_lock_flags; 236 + /* 237 + * percpu counter. 238 + */ 239 + struct mem_cgroup_stat_cpu __percpu *stat; 240 + spinlock_t pcp_counter_lock; 241 + 242 + #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) 243 + struct cg_proto tcp_mem; 244 + #endif 245 + #if defined(CONFIG_MEMCG_KMEM) 246 + /* Index in the kmem_cache->memcg_params.memcg_caches array */ 247 + int kmemcg_id; 248 + bool kmem_acct_activated; 249 + bool kmem_acct_active; 250 + #endif 251 + 252 + int last_scanned_node; 253 + #if MAX_NUMNODES > 1 254 + nodemask_t scan_nodes; 255 + atomic_t numainfo_events; 256 + atomic_t numainfo_updating; 257 + #endif 258 + 259 + #ifdef CONFIG_CGROUP_WRITEBACK 260 + struct list_head cgwb_list; 261 + struct wb_domain cgwb_domain; 262 + #endif 263 + 264 + /* List of events which userspace want to receive */ 265 + struct list_head event_list; 266 + spinlock_t event_list_lock; 267 + 268 + struct mem_cgroup_per_node *nodeinfo[0]; 269 + /* WARNING: nodeinfo must be the last member here */ 270 + }; 76 271 extern struct cgroup_subsys_state *mem_cgroup_root_css; 77 272 78 - void mem_cgroup_events(struct mem_cgroup *memcg, 273 + /** 274 + * mem_cgroup_events - count memory events against a cgroup 275 + * @memcg: the memory cgroup 276 + * @idx: the event index 277 + * @nr: the number of events to account for 278 + */ 279 + static inline void mem_cgroup_events(struct mem_cgroup *memcg, 79 280 enum mem_cgroup_events_index idx, 80 - unsigned int nr); 281 + unsigned int nr) 282 + { 283 + this_cpu_add(memcg->stat->events[idx], nr); 284 + } 81 285 82 286 bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg); 83 287 ··· 304 90 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); 305 91 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); 306 92 307 - bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, 308 - struct mem_cgroup *root); 309 93 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg); 310 94 311 - extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page); 312 - extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); 95 + struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page); 96 + struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); 313 97 314 - extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg); 315 - extern struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css); 98 + struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg); 99 + static inline 100 + struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ 101 + return css ? container_of(css, struct mem_cgroup, css) : NULL; 102 + } 103 + 104 + struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *, 105 + struct mem_cgroup *, 106 + struct mem_cgroup_reclaim_cookie *); 107 + void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *); 108 + 109 + static inline bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, 110 + struct mem_cgroup *root) 111 + { 112 + if (root == memcg) 113 + return true; 114 + if (!root->use_hierarchy) 115 + return false; 116 + return cgroup_is_descendant(memcg->css.cgroup, root->css.cgroup); 117 + } 316 118 317 119 static inline bool mm_match_cgroup(struct mm_struct *mm, 318 120 struct mem_cgroup *memcg) ··· 344 114 return match; 345 115 } 346 116 347 - extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg); 348 - extern struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page); 117 + struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page); 349 118 350 - struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *, 351 - struct mem_cgroup *, 352 - struct mem_cgroup_reclaim_cookie *); 353 - void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *); 119 + static inline bool mem_cgroup_disabled(void) 120 + { 121 + if (memory_cgrp_subsys.disabled) 122 + return true; 123 + return false; 124 + } 354 125 355 126 /* 356 127 * For memory reclaim. 357 128 */ 358 - int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec); 359 - bool mem_cgroup_lruvec_online(struct lruvec *lruvec); 360 129 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg); 361 - unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list); 362 - void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int); 363 - extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, 364 - struct task_struct *p); 130 + 131 + void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, 132 + int nr_pages); 133 + 134 + static inline bool mem_cgroup_lruvec_online(struct lruvec *lruvec) 135 + { 136 + struct mem_cgroup_per_zone *mz; 137 + struct mem_cgroup *memcg; 138 + 139 + if (mem_cgroup_disabled()) 140 + return true; 141 + 142 + mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec); 143 + memcg = mz->memcg; 144 + 145 + return !!(memcg->css.flags & CSS_ONLINE); 146 + } 147 + 148 + static inline 149 + unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru) 150 + { 151 + struct mem_cgroup_per_zone *mz; 152 + 153 + mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec); 154 + return mz->lru_size[lru]; 155 + } 156 + 157 + static inline int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec) 158 + { 159 + unsigned long inactive_ratio; 160 + unsigned long inactive; 161 + unsigned long active; 162 + unsigned long gb; 163 + 164 + inactive = mem_cgroup_get_lru_size(lruvec, LRU_INACTIVE_ANON); 165 + active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_ANON); 166 + 167 + gb = (inactive + active) >> (30 - PAGE_SHIFT); 168 + if (gb) 169 + inactive_ratio = int_sqrt(10 * gb); 170 + else 171 + inactive_ratio = 1; 172 + 173 + return inactive * inactive_ratio < active; 174 + } 175 + 176 + void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, 177 + struct task_struct *p); 365 178 366 179 static inline void mem_cgroup_oom_enable(void) 367 180 { ··· 429 156 extern int do_swap_account; 430 157 #endif 431 158 432 - static inline bool mem_cgroup_disabled(void) 433 - { 434 - if (memory_cgrp_subsys.disabled) 435 - return true; 436 - return false; 437 - } 438 - 439 159 struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page); 440 - void mem_cgroup_update_page_stat(struct mem_cgroup *memcg, 441 - enum mem_cgroup_stat_index idx, int val); 442 160 void mem_cgroup_end_page_stat(struct mem_cgroup *memcg); 161 + 162 + /** 163 + * mem_cgroup_update_page_stat - update page state statistics 164 + * @memcg: memcg to account against 165 + * @idx: page state item to account 166 + * @val: number of pages (positive or negative) 167 + * 168 + * See mem_cgroup_begin_page_stat() for locking requirements. 169 + */ 170 + static inline void mem_cgroup_update_page_stat(struct mem_cgroup *memcg, 171 + enum mem_cgroup_stat_index idx, int val) 172 + { 173 + VM_BUG_ON(!rcu_read_lock_held()); 174 + 175 + if (memcg) 176 + this_cpu_add(memcg->stat->count[idx], val); 177 + } 443 178 444 179 static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg, 445 180 enum mem_cgroup_stat_index idx) ··· 465 184 gfp_t gfp_mask, 466 185 unsigned long *total_scanned); 467 186 468 - void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); 469 187 static inline void mem_cgroup_count_vm_event(struct mm_struct *mm, 470 188 enum vm_event_item idx) 471 189 { 190 + struct mem_cgroup *memcg; 191 + 472 192 if (mem_cgroup_disabled()) 473 193 return; 474 - __mem_cgroup_count_vm_event(mm, idx); 194 + 195 + rcu_read_lock(); 196 + memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); 197 + if (unlikely(!memcg)) 198 + goto out; 199 + 200 + switch (idx) { 201 + case PGFAULT: 202 + this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGFAULT]); 203 + break; 204 + case PGMAJFAULT: 205 + this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT]); 206 + break; 207 + default: 208 + BUG(); 209 + } 210 + out: 211 + rcu_read_unlock(); 475 212 } 476 213 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 477 214 void mem_cgroup_split_huge_fixup(struct page *head); ··· 497 198 498 199 #else /* CONFIG_MEMCG */ 499 200 struct mem_cgroup; 500 - 501 - #define mem_cgroup_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL)) 502 201 503 202 static inline void mem_cgroup_events(struct mem_cgroup *memcg, 504 203 enum mem_cgroup_events_index idx, ··· 570 273 const struct mem_cgroup *memcg) 571 274 { 572 275 return true; 573 - } 574 - 575 - static inline struct cgroup_subsys_state 576 - *mem_cgroup_css(struct mem_cgroup *memcg) 577 - { 578 - return NULL; 579 276 } 580 277 581 278 static inline struct mem_cgroup * ··· 719 428 extern struct static_key memcg_kmem_enabled_key; 720 429 721 430 extern int memcg_nr_cache_ids; 722 - extern void memcg_get_cache_ids(void); 723 - extern void memcg_put_cache_ids(void); 431 + void memcg_get_cache_ids(void); 432 + void memcg_put_cache_ids(void); 724 433 725 434 /* 726 435 * Helper macro to loop through all memcg-specific caches. Callers must still ··· 735 444 return static_key_false(&memcg_kmem_enabled_key); 736 445 } 737 446 738 - bool memcg_kmem_is_active(struct mem_cgroup *memcg); 447 + static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg) 448 + { 449 + return memcg->kmem_acct_active; 450 + } 739 451 740 452 /* 741 453 * In general, we'll do everything in our power to not incur in any overhead ··· 757 463 struct mem_cgroup *memcg, int order); 758 464 void __memcg_kmem_uncharge_pages(struct page *page, int order); 759 465 760 - int memcg_cache_id(struct mem_cgroup *memcg); 466 + /* 467 + * helper for acessing a memcg's index. It will be used as an index in the 468 + * child cache array in kmem_cache, and also to derive its name. This function 469 + * will return -1 when this is not a kmem-limited memcg. 470 + */ 471 + static inline int memcg_cache_id(struct mem_cgroup *memcg) 472 + { 473 + return memcg ? memcg->kmemcg_id : -1; 474 + } 761 475 762 476 struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep); 763 477 void __memcg_kmem_put_cache(struct kmem_cache *cachep);

+8 -24

include/linux/mm.h

··· 249 249 void (*close)(struct vm_area_struct * area); 250 250 int (*mremap)(struct vm_area_struct * area); 251 251 int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf); 252 + int (*pmd_fault)(struct vm_area_struct *, unsigned long address, 253 + pmd_t *, unsigned int flags); 252 254 void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf); 253 255 254 256 /* notification that a previously read-only page is about to become ··· 309 307 #define page_private(page) ((page)->private) 310 308 #define set_page_private(page, v) ((page)->private = (v)) 311 309 312 - /* It's valid only if the page is free path or free_list */ 313 - static inline void set_freepage_migratetype(struct page *page, int migratetype) 314 - { 315 - page->index = migratetype; 316 - } 317 - 318 - /* It's valid only if the page is free path or free_list */ 319 - static inline int get_freepage_migratetype(struct page *page) 320 - { 321 - return page->index; 322 - } 323 - 324 310 /* 325 311 * FIXME: take this include out, include page-flags.h in 326 312 * files which need it (119 of them) ··· 347 357 static inline int get_page_unless_zero(struct page *page) 348 358 { 349 359 return atomic_inc_not_zero(&page->_count); 350 - } 351 - 352 - /* 353 - * Try to drop a ref unless the page has a refcount of one, return false if 354 - * that is the case. 355 - * This is to make sure that the refcount won't become zero after this drop. 356 - * This can be called when MMU is off so it must not access 357 - * any of the virtual mappings. 358 - */ 359 - static inline int put_page_unless_one(struct page *page) 360 - { 361 - return atomic_add_unless(&page->_count, -1, 1); 362 360 } 363 361 364 362 extern int page_is_ram(unsigned long pfn); ··· 1243 1265 static inline int vma_growsdown(struct vm_area_struct *vma, unsigned long addr) 1244 1266 { 1245 1267 return vma && (vma->vm_end == addr) && (vma->vm_flags & VM_GROWSDOWN); 1268 + } 1269 + 1270 + static inline bool vma_is_anonymous(struct vm_area_struct *vma) 1271 + { 1272 + return !vma->vm_ops; 1246 1273 } 1247 1274 1248 1275 static inline int stack_guard_page_start(struct vm_area_struct *vma, ··· 2176 2193 extern void memory_failure_queue(unsigned long pfn, int trapno, int flags); 2177 2194 extern int unpoison_memory(unsigned long pfn); 2178 2195 extern int get_hwpoison_page(struct page *page); 2196 + extern void put_hwpoison_page(struct page *page); 2179 2197 extern int sysctl_memory_failure_early_kill; 2180 2198 extern int sysctl_memory_failure_recovery; 2181 2199 extern void shake_page(struct page *p, int access);

+1 -1

include/linux/mm_types.h

··· 235 235 bool pfmemalloc; 236 236 }; 237 237 238 - typedef unsigned long __nocast vm_flags_t; 238 + typedef unsigned long vm_flags_t; 239 239 240 240 /* 241 241 * A region containing a mapping of a non-memory backed file under NOMMU

+28 -10

include/linux/oom.h

··· 13 13 struct task_struct; 14 14 15 15 /* 16 + * Details of the page allocation that triggered the oom killer that are used to 17 + * determine what should be killed. 18 + */ 19 + struct oom_control { 20 + /* Used to determine cpuset */ 21 + struct zonelist *zonelist; 22 + 23 + /* Used to determine mempolicy */ 24 + nodemask_t *nodemask; 25 + 26 + /* Used to determine cpuset and node locality requirement */ 27 + const gfp_t gfp_mask; 28 + 29 + /* 30 + * order == -1 means the oom kill is required by sysrq, otherwise only 31 + * for display purposes. 32 + */ 33 + const int order; 34 + }; 35 + 36 + /* 16 37 * Types of limitations to the nodes from which allocations may occur 17 38 */ 18 39 enum oom_constraint { ··· 78 57 79 58 extern int oom_kills_count(void); 80 59 extern void note_oom_kill(void); 81 - extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, 60 + extern void oom_kill_process(struct oom_control *oc, struct task_struct *p, 82 61 unsigned int points, unsigned long totalpages, 83 - struct mem_cgroup *memcg, nodemask_t *nodemask, 84 - const char *message); 62 + struct mem_cgroup *memcg, const char *message); 85 63 86 - extern void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, 87 - int order, const nodemask_t *nodemask, 64 + extern void check_panic_on_oom(struct oom_control *oc, 65 + enum oom_constraint constraint, 88 66 struct mem_cgroup *memcg); 89 67 90 - extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, 91 - unsigned long totalpages, const nodemask_t *nodemask, 92 - bool force_kill); 68 + extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, 69 + struct task_struct *task, unsigned long totalpages); 93 70 94 - extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, 95 - int order, nodemask_t *mask, bool force_kill); 71 + extern bool out_of_memory(struct oom_control *oc); 96 72 97 73 extern void exit_oom_victim(void); 98 74

-5

include/linux/page-isolation.h

··· 65 65 int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, 66 66 bool skip_hwpoisoned_pages); 67 67 68 - /* 69 - * Internal functions. Changes pageblock's migrate type. 70 - */ 71 - int set_migratetype_isolate(struct page *page, bool skip_hwpoisoned_pages); 72 - void unset_migratetype_isolate(struct page *page, unsigned migratetype); 73 68 struct page *alloc_migrate_target(struct page *page, unsigned long private, 74 69 int **resultp); 75 70

+2

include/linux/pci.h

··· 1227 1227 dma_pool_create(name, &pdev->dev, size, align, allocation) 1228 1228 #define pci_pool_destroy(pool) dma_pool_destroy(pool) 1229 1229 #define pci_pool_alloc(pool, flags, handle) dma_pool_alloc(pool, flags, handle) 1230 + #define pci_pool_zalloc(pool, flags, handle) \ 1231 + dma_pool_zalloc(pool, flags, handle) 1230 1232 #define pci_pool_free(pool, vaddr, addr) dma_pool_free(pool, vaddr, addr) 1231 1233 1232 1234 struct msix_entry {

+18 -1

include/linux/swap.h

··· 351 351 extern int kswapd_run(int nid); 352 352 extern void kswapd_stop(int nid); 353 353 #ifdef CONFIG_MEMCG 354 - extern int mem_cgroup_swappiness(struct mem_cgroup *mem); 354 + static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) 355 + { 356 + /* root ? */ 357 + if (mem_cgroup_disabled() || !memcg->css.parent) 358 + return vm_swappiness; 359 + 360 + return memcg->swappiness; 361 + } 362 + 355 363 #else 356 364 static inline int mem_cgroup_swappiness(struct mem_cgroup *mem) 357 365 { ··· 406 398 extern struct page *lookup_swap_cache(swp_entry_t); 407 399 extern struct page *read_swap_cache_async(swp_entry_t, gfp_t, 408 400 struct vm_area_struct *vma, unsigned long addr); 401 + extern struct page *__read_swap_cache_async(swp_entry_t, gfp_t, 402 + struct vm_area_struct *vma, unsigned long addr, 403 + bool *new_page_allocated); 409 404 extern struct page *swapin_readahead(swp_entry_t, gfp_t, 410 405 struct vm_area_struct *vma, unsigned long addr); 411 406 ··· 442 431 extern sector_t map_swap_page(struct page *, struct block_device **); 443 432 extern sector_t swapdev_block(int, pgoff_t); 444 433 extern int page_swapcount(struct page *); 434 + extern int swp_swapcount(swp_entry_t entry); 445 435 extern struct swap_info_struct *page_swap_info(struct page *); 446 436 extern int reuse_swap_page(struct page *); 447 437 extern int try_to_free_swap(struct page *); ··· 530 518 } 531 519 532 520 static inline int page_swapcount(struct page *page) 521 + { 522 + return 0; 523 + } 524 + 525 + static inline int swp_swapcount(swp_entry_t entry) 533 526 { 534 527 return 0; 535 528 }

+37

include/linux/swapops.h

··· 164 164 #endif 165 165 166 166 #ifdef CONFIG_MEMORY_FAILURE 167 + 168 + extern atomic_long_t num_poisoned_pages __read_mostly; 169 + 167 170 /* 168 171 * Support for hardware poisoned pages 169 172 */ ··· 180 177 { 181 178 return swp_type(entry) == SWP_HWPOISON; 182 179 } 180 + 181 + static inline bool test_set_page_hwpoison(struct page *page) 182 + { 183 + return TestSetPageHWPoison(page); 184 + } 185 + 186 + static inline void num_poisoned_pages_inc(void) 187 + { 188 + atomic_long_inc(&num_poisoned_pages); 189 + } 190 + 191 + static inline void num_poisoned_pages_dec(void) 192 + { 193 + atomic_long_dec(&num_poisoned_pages); 194 + } 195 + 196 + static inline void num_poisoned_pages_add(long num) 197 + { 198 + atomic_long_add(num, &num_poisoned_pages); 199 + } 200 + 201 + static inline void num_poisoned_pages_sub(long num) 202 + { 203 + atomic_long_sub(num, &num_poisoned_pages); 204 + } 183 205 #else 184 206 185 207 static inline swp_entry_t make_hwpoison_entry(struct page *page) ··· 215 187 static inline int is_hwpoison_entry(swp_entry_t swp) 216 188 { 217 189 return 0; 190 + } 191 + 192 + static inline bool test_set_page_hwpoison(struct page *page) 193 + { 194 + return false; 195 + } 196 + 197 + static inline void num_poisoned_pages_inc(void) 198 + { 218 199 } 219 200 #endif 220 201

+1 -1

include/linux/zbud.h

··· 9 9 int (*evict)(struct zbud_pool *pool, unsigned long handle); 10 10 }; 11 11 12 - struct zbud_pool *zbud_create_pool(gfp_t gfp, struct zbud_ops *ops); 12 + struct zbud_pool *zbud_create_pool(gfp_t gfp, const struct zbud_ops *ops); 13 13 void zbud_destroy_pool(struct zbud_pool *pool); 14 14 int zbud_alloc(struct zbud_pool *pool, size_t size, gfp_t gfp, 15 15 unsigned long *handle);

+2 -2

include/linux/zpool.h

··· 37 37 }; 38 38 39 39 struct zpool *zpool_create_pool(char *type, char *name, 40 - gfp_t gfp, struct zpool_ops *ops); 40 + gfp_t gfp, const struct zpool_ops *ops); 41 41 42 42 char *zpool_get_type(struct zpool *pool); 43 43 ··· 81 81 atomic_t refcount; 82 82 struct list_head list; 83 83 84 - void *(*create)(char *name, gfp_t gfp, struct zpool_ops *ops, 84 + void *(*create)(char *name, gfp_t gfp, const struct zpool_ops *ops, 85 85 struct zpool *zpool); 86 86 void (*destroy)(void *pool); 87 87

+6

include/linux/zsmalloc.h

··· 34 34 */ 35 35 }; 36 36 37 + struct zs_pool_stats { 38 + /* How many pages were migrated (freed) */ 39 + unsigned long pages_compacted; 40 + }; 41 + 37 42 struct zs_pool; 38 43 39 44 struct zs_pool *zs_create_pool(char *name, gfp_t flags); ··· 54 49 unsigned long zs_get_total_pages(struct zs_pool *pool); 55 50 unsigned long zs_compact(struct zs_pool *pool); 56 51 52 + void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats); 57 53 #endif

-33

include/net/sock.h

··· 1042 1042 #endif 1043 1043 }; 1044 1044 1045 - /* 1046 - * Bits in struct cg_proto.flags 1047 - */ 1048 - enum cg_proto_flags { 1049 - /* Currently active and new sockets should be assigned to cgroups */ 1050 - MEMCG_SOCK_ACTIVE, 1051 - /* It was ever activated; we must disarm static keys on destruction */ 1052 - MEMCG_SOCK_ACTIVATED, 1053 - }; 1054 - 1055 - struct cg_proto { 1056 - struct page_counter memory_allocated; /* Current allocated memory. */ 1057 - struct percpu_counter sockets_allocated; /* Current number of sockets. */ 1058 - int memory_pressure; 1059 - long sysctl_mem[3]; 1060 - unsigned long flags; 1061 - /* 1062 - * memcg field is used to find which memcg we belong directly 1063 - * Each memcg struct can hold more than one cg_proto, so container_of 1064 - * won't really cut. 1065 - * 1066 - * The elegant solution would be having an inverse function to 1067 - * proto_cgroup in struct proto, but that means polluting the structure 1068 - * for everybody, instead of just for memcg users. 1069 - */ 1070 - struct mem_cgroup *memcg; 1071 - }; 1072 - 1073 1045 int proto_register(struct proto *prot, int alloc_slab); 1074 1046 void proto_unregister(struct proto *prot); 1075 - 1076 - static inline bool memcg_proto_active(struct cg_proto *cg_proto) 1077 - { 1078 - return test_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags); 1079 - } 1080 1047 1081 1048 #ifdef SOCK_REFCNT_DEBUG 1082 1049 static inline void sk_refcnt_debug_inc(struct sock *sk)

+1 -1

kernel/cgroup.c

··· 1342 1342 if (root != &cgrp_dfl_root) 1343 1343 for_each_subsys(ss, ssid) 1344 1344 if (root->subsys_mask & (1 << ssid)) 1345 - seq_show_option(seq, ss->name, NULL); 1345 + seq_show_option(seq, ss->legacy_name, NULL); 1346 1346 if (root->flags & CGRP_ROOT_NOPREFIX) 1347 1347 seq_puts(seq, ",noprefix"); 1348 1348 if (root->flags & CGRP_ROOT_XATTR)

+4 -4

kernel/profile.c

··· 339 339 node = cpu_to_mem(cpu); 340 340 per_cpu(cpu_profile_flip, cpu) = 0; 341 341 if (!per_cpu(cpu_profile_hits, cpu)[1]) { 342 - page = alloc_pages_exact_node(node, 342 + page = __alloc_pages_node(node, 343 343 GFP_KERNEL | __GFP_ZERO, 344 344 0); 345 345 if (!page) ··· 347 347 per_cpu(cpu_profile_hits, cpu)[1] = page_address(page); 348 348 } 349 349 if (!per_cpu(cpu_profile_hits, cpu)[0]) { 350 - page = alloc_pages_exact_node(node, 350 + page = __alloc_pages_node(node, 351 351 GFP_KERNEL | __GFP_ZERO, 352 352 0); 353 353 if (!page) ··· 543 543 int node = cpu_to_mem(cpu); 544 544 struct page *page; 545 545 546 - page = alloc_pages_exact_node(node, 546 + page = __alloc_pages_node(node, 547 547 GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE, 548 548 0); 549 549 if (!page) 550 550 goto out_cleanup; 551 551 per_cpu(cpu_profile_hits, cpu)[1] 552 552 = (struct profile_hit *)page_address(page); 553 - page = alloc_pages_exact_node(node, 553 + page = __alloc_pages_node(node, 554 554 GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE, 555 555 0); 556 556 if (!page)

+2 -4

lib/show_mem.c

··· 38 38 39 39 printk("%lu pages RAM\n", total); 40 40 printk("%lu pages HighMem/MovableOnly\n", highmem); 41 - #ifdef CONFIG_CMA 42 - printk("%lu pages reserved\n", (reserved - totalcma_pages)); 43 - printk("%lu pages cma reserved\n", totalcma_pages); 44 - #else 45 41 printk("%lu pages reserved\n", reserved); 42 + #ifdef CONFIG_CMA 43 + printk("%lu pages cma reserved\n", totalcma_pages); 46 44 #endif 47 45 #ifdef CONFIG_QUICKLIST 48 46 printk("%lu pages in pagetable cache\n",

+7

mm/bootmem.c

··· 236 236 count += pages; 237 237 while (pages--) 238 238 __free_pages_bootmem(page++, cur++, 0); 239 + bdata->node_bootmem_map = NULL; 239 240 240 241 bdebug("nid=%td released=%lx\n", bdata - bootmem_node_data, count); 241 242 ··· 295 294 sidx + bdata->node_min_pfn, 296 295 eidx + bdata->node_min_pfn); 297 296 297 + if (WARN_ON(bdata->node_bootmem_map == NULL)) 298 + return; 299 + 298 300 if (bdata->hint_idx > sidx) 299 301 bdata->hint_idx = sidx; 300 302 ··· 317 313 sidx + bdata->node_min_pfn, 318 314 eidx + bdata->node_min_pfn, 319 315 flags); 316 + 317 + if (WARN_ON(bdata->node_bootmem_map == NULL)) 318 + return 0; 320 319 321 320 for (idx = sidx; idx < eidx; idx++) 322 321 if (test_and_set_bit(idx, bdata->node_bootmem_map)) {

+111 -64

mm/compaction.c

··· 207 207 return !get_pageblock_skip(page); 208 208 } 209 209 210 + static void reset_cached_positions(struct zone *zone) 211 + { 212 + zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn; 213 + zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn; 214 + zone->compact_cached_free_pfn = zone_end_pfn(zone); 215 + } 216 + 210 217 /* 211 218 * This function is called to clear all cached information on pageblocks that 212 219 * should be skipped for page isolation when the migrate and free page scanner ··· 225 218 unsigned long end_pfn = zone_end_pfn(zone); 226 219 unsigned long pfn; 227 220 228 - zone->compact_cached_migrate_pfn[0] = start_pfn; 229 - zone->compact_cached_migrate_pfn[1] = start_pfn; 230 - zone->compact_cached_free_pfn = end_pfn; 231 221 zone->compact_blockskip_flush = false; 232 222 233 223 /* Walk the zone and mark every pageblock as suitable for isolation */ ··· 242 238 243 239 clear_pageblock_skip(page); 244 240 } 241 + 242 + reset_cached_positions(zone); 245 243 } 246 244 247 245 void reset_isolation_suitable(pg_data_t *pgdat) ··· 437 431 438 432 if (!valid_page) 439 433 valid_page = page; 434 + 435 + /* 436 + * For compound pages such as THP and hugetlbfs, we can save 437 + * potentially a lot of iterations if we skip them at once. 438 + * The check is racy, but we can consider only valid values 439 + * and the only danger is skipping too much. 440 + */ 441 + if (PageCompound(page)) { 442 + unsigned int comp_order = compound_order(page); 443 + 444 + if (likely(comp_order < MAX_ORDER)) { 445 + blockpfn += (1UL << comp_order) - 1; 446 + cursor += (1UL << comp_order) - 1; 447 + } 448 + 449 + goto isolate_fail; 450 + } 451 + 440 452 if (!PageBuddy(page)) 441 453 goto isolate_fail; 442 454 ··· 513 489 continue; 514 490 515 491 } 492 + 493 + /* 494 + * There is a tiny chance that we have read bogus compound_order(), 495 + * so be careful to not go outside of the pageblock. 496 + */ 497 + if (unlikely(blockpfn > end_pfn)) 498 + blockpfn = end_pfn; 516 499 517 500 trace_mm_compaction_isolate_freepages(*start_pfn, blockpfn, 518 501 nr_scanned, total_isolated); ··· 705 674 706 675 /* Time to isolate some pages for migration */ 707 676 for (; low_pfn < end_pfn; low_pfn++) { 677 + bool is_lru; 678 + 708 679 /* 709 680 * Periodically drop the lock (if held) regardless of its 710 681 * contention, to give chance to IRQs. Abort async compaction ··· 750 717 * It's possible to migrate LRU pages and balloon pages 751 718 * Skip any other type of page 752 719 */ 753 - if (!PageLRU(page)) { 720 + is_lru = PageLRU(page); 721 + if (!is_lru) { 754 722 if (unlikely(balloon_page_movable(page))) { 755 723 if (balloon_page_isolate(page)) { 756 724 /* Successfully isolated */ 757 725 goto isolate_success; 758 726 } 759 727 } 760 - continue; 761 728 } 762 729 763 730 /* 764 - * PageLRU is set. lru_lock normally excludes isolation 765 - * splitting and collapsing (collapsing has already happened 766 - * if PageLRU is set) but the lock is not necessarily taken 767 - * here and it is wasteful to take it just to check transhuge. 768 - * Check TransHuge without lock and skip the whole pageblock if 769 - * it's either a transhuge or hugetlbfs page, as calling 770 - * compound_order() without preventing THP from splitting the 771 - * page underneath us may return surprising results. 731 + * Regardless of being on LRU, compound pages such as THP and 732 + * hugetlbfs are not to be compacted. We can potentially save 733 + * a lot of iterations if we skip them at once. The check is 734 + * racy, but we can consider only valid values and the only 735 + * danger is skipping too much. 772 736 */ 773 - if (PageTransHuge(page)) { 774 - if (!locked) 775 - low_pfn = ALIGN(low_pfn + 1, 776 - pageblock_nr_pages) - 1; 777 - else 778 - low_pfn += (1 << compound_order(page)) - 1; 737 + if (PageCompound(page)) { 738 + unsigned int comp_order = compound_order(page); 739 + 740 + if (likely(comp_order < MAX_ORDER)) 741 + low_pfn += (1UL << comp_order) - 1; 779 742 780 743 continue; 781 744 } 745 + 746 + if (!is_lru) 747 + continue; 782 748 783 749 /* 784 750 * Migration will fail if an anonymous page is pinned in memory, ··· 795 763 if (!locked) 796 764 break; 797 765 798 - /* Recheck PageLRU and PageTransHuge under lock */ 766 + /* Recheck PageLRU and PageCompound under lock */ 799 767 if (!PageLRU(page)) 800 768 continue; 801 - if (PageTransHuge(page)) { 802 - low_pfn += (1 << compound_order(page)) - 1; 769 + 770 + /* 771 + * Page become compound since the non-locked check, 772 + * and it's on LRU. It can only be a THP so the order 773 + * is safe to read and it's 0 for tail pages. 774 + */ 775 + if (unlikely(PageCompound(page))) { 776 + low_pfn += (1UL << compound_order(page)) - 1; 803 777 continue; 804 778 } 805 779 } ··· 816 778 if (__isolate_lru_page(page, isolate_mode) != 0) 817 779 continue; 818 780 819 - VM_BUG_ON_PAGE(PageTransCompound(page), page); 781 + VM_BUG_ON_PAGE(PageCompound(page), page); 820 782 821 783 /* Successfully isolated */ 822 784 del_page_from_lru_list(page, lruvec, page_lru(page)); ··· 936 898 } 937 899 938 900 /* 901 + * Test whether the free scanner has reached the same or lower pageblock than 902 + * the migration scanner, and compaction should thus terminate. 903 + */ 904 + static inline bool compact_scanners_met(struct compact_control *cc) 905 + { 906 + return (cc->free_pfn >> pageblock_order) 907 + <= (cc->migrate_pfn >> pageblock_order); 908 + } 909 + 910 + /* 939 911 * Based on information in the current compact_control, find blocks 940 912 * suitable for isolating free pages from and then isolate them. 941 913 */ ··· 981 933 * pages on cc->migratepages. We stop searching if the migrate 982 934 * and free page scanners meet or enough free pages are isolated. 983 935 */ 984 - for (; block_start_pfn >= low_pfn && 985 - cc->nr_migratepages > cc->nr_freepages; 936 + for (; block_start_pfn >= low_pfn; 986 937 block_end_pfn = block_start_pfn, 987 938 block_start_pfn -= pageblock_nr_pages, 988 939 isolate_start_pfn = block_start_pfn) { ··· 1013 966 block_end_pfn, freelist, false); 1014 967 1015 968 /* 969 + * If we isolated enough freepages, or aborted due to async 970 + * compaction being contended, terminate the loop. 1016 971 * Remember where the free scanner should restart next time, 1017 972 * which is where isolate_freepages_block() left off. 1018 973 * But if it scanned the whole pageblock, isolate_start_pfn ··· 1023 974 * In that case we will however want to restart at the start 1024 975 * of the previous pageblock. 1025 976 */ 1026 - cc->free_pfn = (isolate_start_pfn < block_end_pfn) ? 1027 - isolate_start_pfn : 1028 - block_start_pfn - pageblock_nr_pages; 1029 - 1030 - /* 1031 - * isolate_freepages_block() might have aborted due to async 1032 - * compaction being contended 1033 - */ 1034 - if (cc->contended) 977 + if ((cc->nr_freepages >= cc->nr_migratepages) 978 + || cc->contended) { 979 + if (isolate_start_pfn >= block_end_pfn) 980 + isolate_start_pfn = 981 + block_start_pfn - pageblock_nr_pages; 1035 982 break; 983 + } else { 984 + /* 985 + * isolate_freepages_block() should not terminate 986 + * prematurely unless contended, or isolated enough 987 + */ 988 + VM_BUG_ON(isolate_start_pfn < block_end_pfn); 989 + } 1036 990 } 1037 991 1038 992 /* split_free_page does not map the pages */ 1039 993 map_pages(freelist); 1040 994 1041 995 /* 1042 - * If we crossed the migrate scanner, we want to keep it that way 1043 - * so that compact_finished() may detect this 996 + * Record where the free scanner will restart next time. Either we 997 + * broke from the loop and set isolate_start_pfn based on the last 998 + * call to isolate_freepages_block(), or we met the migration scanner 999 + * and the loop terminated due to isolate_start_pfn < low_pfn 1044 1000 */ 1045 - if (block_start_pfn < low_pfn) 1046 - cc->free_pfn = cc->migrate_pfn; 1001 + cc->free_pfn = isolate_start_pfn; 1047 1002 } 1048 1003 1049 1004 /* ··· 1115 1062 struct compact_control *cc) 1116 1063 { 1117 1064 unsigned long low_pfn, end_pfn; 1065 + unsigned long isolate_start_pfn; 1118 1066 struct page *page; 1119 1067 const isolate_mode_t isolate_mode = 1120 1068 (sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) | ··· 1164 1110 continue; 1165 1111 1166 1112 /* Perform the isolation */ 1113 + isolate_start_pfn = low_pfn; 1167 1114 low_pfn = isolate_migratepages_block(cc, low_pfn, end_pfn, 1168 1115 isolate_mode); 1169 1116 ··· 1172 1117 acct_isolated(zone, cc); 1173 1118 return ISOLATE_ABORT; 1174 1119 } 1120 + 1121 + /* 1122 + * Record where we could have freed pages by migration and not 1123 + * yet flushed them to buddy allocator. 1124 + * - this is the lowest page that could have been isolated and 1125 + * then freed by migration. 1126 + */ 1127 + if (cc->nr_migratepages && !cc->last_migrated_pfn) 1128 + cc->last_migrated_pfn = isolate_start_pfn; 1175 1129 1176 1130 /* 1177 1131 * Either we isolated something and proceed with migration. Or ··· 1191 1127 } 1192 1128 1193 1129 acct_isolated(zone, cc); 1194 - /* 1195 - * Record where migration scanner will be restarted. If we end up in 1196 - * the same pageblock as the free scanner, make the scanners fully 1197 - * meet so that compact_finished() terminates compaction. 1198 - */ 1199 - cc->migrate_pfn = (end_pfn <= cc->free_pfn) ? low_pfn : cc->free_pfn; 1130 + /* Record where migration scanner will be restarted. */ 1131 + cc->migrate_pfn = low_pfn; 1200 1132 1201 1133 return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE; 1202 1134 } ··· 1207 1147 return COMPACT_PARTIAL; 1208 1148 1209 1149 /* Compaction run completes if the migrate and free scanner meet */ 1210 - if (cc->free_pfn <= cc->migrate_pfn) { 1150 + if (compact_scanners_met(cc)) { 1211 1151 /* Let the next compaction start anew. */ 1212 - zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn; 1213 - zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn; 1214 - zone->compact_cached_free_pfn = zone_end_pfn(zone); 1152 + reset_cached_positions(zone); 1215 1153 1216 1154 /* 1217 1155 * Mark that the PG_migrate_skip information should be cleared ··· 1353 1295 unsigned long end_pfn = zone_end_pfn(zone); 1354 1296 const int migratetype = gfpflags_to_migratetype(cc->gfp_mask); 1355 1297 const bool sync = cc->mode != MIGRATE_ASYNC; 1356 - unsigned long last_migrated_pfn = 0; 1357 1298 1358 1299 ret = compaction_suitable(zone, cc->order, cc->alloc_flags, 1359 1300 cc->classzone_idx); ··· 1390 1333 zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn; 1391 1334 zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn; 1392 1335 } 1336 + cc->last_migrated_pfn = 0; 1393 1337 1394 1338 trace_mm_compaction_begin(start_pfn, cc->migrate_pfn, 1395 1339 cc->free_pfn, end_pfn, sync); ··· 1400 1342 while ((ret = compact_finished(zone, cc, migratetype)) == 1401 1343 COMPACT_CONTINUE) { 1402 1344 int err; 1403 - unsigned long isolate_start_pfn = cc->migrate_pfn; 1404 1345 1405 1346 switch (isolate_migratepages(zone, cc)) { 1406 1347 case ISOLATE_ABORT: ··· 1433 1376 * migrate_pages() may return -ENOMEM when scanners meet 1434 1377 * and we want compact_finished() to detect it 1435 1378 */ 1436 - if (err == -ENOMEM && cc->free_pfn > cc->migrate_pfn) { 1379 + if (err == -ENOMEM && !compact_scanners_met(cc)) { 1437 1380 ret = COMPACT_PARTIAL; 1438 1381 goto out; 1439 1382 } 1440 1383 } 1441 - 1442 - /* 1443 - * Record where we could have freed pages by migration and not 1444 - * yet flushed them to buddy allocator. We use the pfn that 1445 - * isolate_migratepages() started from in this loop iteration 1446 - * - this is the lowest page that could have been isolated and 1447 - * then freed by migration. 1448 - */ 1449 - if (!last_migrated_pfn) 1450 - last_migrated_pfn = isolate_start_pfn; 1451 1384 1452 1385 check_drain: 1453 1386 /* ··· 1447 1400 * compact_finished() can detect immediately if allocation 1448 1401 * would succeed. 1449 1402 */ 1450 - if (cc->order > 0 && last_migrated_pfn) { 1403 + if (cc->order > 0 && cc->last_migrated_pfn) { 1451 1404 int cpu; 1452 1405 unsigned long current_block_start = 1453 1406 cc->migrate_pfn & ~((1UL << cc->order) - 1); 1454 1407 1455 - if (last_migrated_pfn < current_block_start) { 1408 + if (cc->last_migrated_pfn < current_block_start) { 1456 1409 cpu = get_cpu(); 1457 1410 lru_add_drain_cpu(cpu); 1458 1411 drain_local_pages(zone); 1459 1412 put_cpu(); 1460 1413 /* No more flushing until we migrate again */ 1461 - last_migrated_pfn = 0; 1414 + cc->last_migrated_pfn = 0; 1462 1415 } 1463 1416 } 1464 1417

+10 -2

mm/dmapool.c

··· 271 271 { 272 272 bool empty = false; 273 273 274 + if (unlikely(!pool)) 275 + return; 276 + 274 277 mutex_lock(&pools_reg_lock); 275 278 mutex_lock(&pools_lock); 276 279 list_del(&pool->pools); ··· 337 334 /* pool_alloc_page() might sleep, so temporarily drop &pool->lock */ 338 335 spin_unlock_irqrestore(&pool->lock, flags); 339 336 340 - page = pool_alloc_page(pool, mem_flags); 337 + page = pool_alloc_page(pool, mem_flags & (~__GFP_ZERO)); 341 338 if (!page) 342 339 return NULL; 343 340 ··· 375 372 break; 376 373 } 377 374 } 378 - memset(retval, POOL_POISON_ALLOCATED, pool->size); 375 + if (!(mem_flags & __GFP_ZERO)) 376 + memset(retval, POOL_POISON_ALLOCATED, pool->size); 379 377 #endif 380 378 spin_unlock_irqrestore(&pool->lock, flags); 379 + 380 + if (mem_flags & __GFP_ZERO) 381 + memset(retval, 0, pool->size); 382 + 381 383 return retval; 382 384 } 383 385 EXPORT_SYMBOL(dma_pool_alloc);

+22

mm/early_ioremap.c

··· 224 224 return (__force void *)__early_ioremap(phys_addr, size, FIXMAP_PAGE_RO); 225 225 } 226 226 #endif 227 + 228 + #define MAX_MAP_CHUNK (NR_FIX_BTMAPS << PAGE_SHIFT) 229 + 230 + void __init copy_from_early_mem(void *dest, phys_addr_t src, unsigned long size) 231 + { 232 + unsigned long slop, clen; 233 + char *p; 234 + 235 + while (size) { 236 + slop = src & ~PAGE_MASK; 237 + clen = size; 238 + if (clen > MAX_MAP_CHUNK - slop) 239 + clen = MAX_MAP_CHUNK - slop; 240 + p = early_memremap(src & PAGE_MASK, clen + slop); 241 + memcpy(dest, p + slop, clen); 242 + early_memunmap(p, clen + slop); 243 + dest += clen; 244 + src += clen; 245 + size -= clen; 246 + } 247 + } 248 + 227 249 #else /* CONFIG_MMU */ 228 250 229 251 void __init __iomem *

+19 -17

mm/filemap.c

··· 674 674 do { 675 675 cpuset_mems_cookie = read_mems_allowed_begin(); 676 676 n = cpuset_mem_spread_node(); 677 - page = alloc_pages_exact_node(n, gfp, 0); 677 + page = __alloc_pages_node(n, gfp, 0); 678 678 } while (!page && read_mems_allowed_retry(cpuset_mems_cookie)); 679 679 680 680 return page; ··· 2473 2473 iov_iter_count(i)); 2474 2474 2475 2475 again: 2476 - /* 2477 - * Bring in the user page that we will copy from _first_. 2478 - * Otherwise there's a nasty deadlock on copying from the 2479 - * same page as we're writing to, without it being marked 2480 - * up-to-date. 2481 - * 2482 - * Not only is this an optimisation, but it is also required 2483 - * to check that the address is actually valid, when atomic 2484 - * usercopies are used, below. 2485 - */ 2486 - if (unlikely(iov_iter_fault_in_readable(i, bytes))) { 2487 - status = -EFAULT; 2488 - break; 2489 - } 2490 - 2491 2476 status = a_ops->write_begin(file, mapping, pos, bytes, flags, 2492 2477 &page, &fsdata); 2493 2478 if (unlikely(status < 0)) ··· 2480 2495 2481 2496 if (mapping_writably_mapped(mapping)) 2482 2497 flush_dcache_page(page); 2483 - 2498 + /* 2499 + * 'page' is now locked. If we are trying to copy from a 2500 + * mapping of 'page' in userspace, the copy might fault and 2501 + * would need PageUptodate() to complete. But, page can not be 2502 + * made Uptodate without acquiring the page lock, which we hold. 2503 + * Deadlock. Avoid with pagefault_disable(). Fix up below with 2504 + * iov_iter_fault_in_readable(). 2505 + */ 2506 + pagefault_disable(); 2484 2507 copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); 2508 + pagefault_enable(); 2485 2509 flush_dcache_page(page); 2486 2510 2487 2511 status = a_ops->write_end(file, mapping, pos, bytes, copied, ··· 2513 2519 */ 2514 2520 bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset, 2515 2521 iov_iter_single_seg_count(i)); 2522 + /* 2523 + * This is the fallback to recover if the copy from 2524 + * userspace above faults. 2525 + */ 2526 + if (unlikely(iov_iter_fault_in_readable(i, bytes))) { 2527 + status = -EFAULT; 2528 + break; 2529 + } 2516 2530 goto again; 2517 2531 } 2518 2532 pos += copied;

+105 -58

mm/huge_memory.c

··· 16 16 #include <linux/swap.h> 17 17 #include <linux/shrinker.h> 18 18 #include <linux/mm_inline.h> 19 + #include <linux/dax.h> 19 20 #include <linux/kthread.h> 20 21 #include <linux/khugepaged.h> 21 22 #include <linux/freezer.h> ··· 106 105 }; 107 106 108 107 109 - static int set_recommended_min_free_kbytes(void) 108 + static void set_recommended_min_free_kbytes(void) 110 109 { 111 110 struct zone *zone; 112 111 int nr_zones = 0; ··· 141 140 min_free_kbytes = recommended_min; 142 141 } 143 142 setup_per_zone_wmarks(); 144 - return 0; 145 143 } 146 144 147 145 static int start_stop_khugepaged(void) ··· 172 172 static atomic_t huge_zero_refcount; 173 173 struct page *huge_zero_page __read_mostly; 174 174 175 - static inline bool is_huge_zero_pmd(pmd_t pmd) 176 - { 177 - return is_huge_zero_page(pmd_page(pmd)); 178 - } 179 - 180 - static struct page *get_huge_zero_page(void) 175 + struct page *get_huge_zero_page(void) 181 176 { 182 177 struct page *zero_page; 183 178 retry: ··· 789 794 } 790 795 791 796 /* Caller must hold page table lock. */ 792 - static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm, 797 + static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm, 793 798 struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd, 794 799 struct page *zero_page) 795 800 { 796 801 pmd_t entry; 802 + if (!pmd_none(*pmd)) 803 + return false; 797 804 entry = mk_pmd(zero_page, vma->vm_page_prot); 798 805 entry = pmd_mkhuge(entry); 799 806 pgtable_trans_huge_deposit(mm, pmd, pgtable); 800 807 set_pmd_at(mm, haddr, pmd, entry); 801 808 atomic_long_inc(&mm->nr_ptes); 809 + return true; 802 810 } 803 811 804 812 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, ··· 866 868 } 867 869 return __do_huge_pmd_anonymous_page(mm, vma, address, pmd, page, gfp, 868 870 flags); 871 + } 872 + 873 + static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, 874 + pmd_t *pmd, unsigned long pfn, pgprot_t prot, bool write) 875 + { 876 + struct mm_struct *mm = vma->vm_mm; 877 + pmd_t entry; 878 + spinlock_t *ptl; 879 + 880 + ptl = pmd_lock(mm, pmd); 881 + if (pmd_none(*pmd)) { 882 + entry = pmd_mkhuge(pfn_pmd(pfn, prot)); 883 + if (write) { 884 + entry = pmd_mkyoung(pmd_mkdirty(entry)); 885 + entry = maybe_pmd_mkwrite(entry, vma); 886 + } 887 + set_pmd_at(mm, addr, pmd, entry); 888 + update_mmu_cache_pmd(vma, addr, pmd); 889 + } 890 + spin_unlock(ptl); 891 + } 892 + 893 + int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, 894 + pmd_t *pmd, unsigned long pfn, bool write) 895 + { 896 + pgprot_t pgprot = vma->vm_page_prot; 897 + /* 898 + * If we had pmd_special, we could avoid all these restrictions, 899 + * but we need to be consistent with PTEs and architectures that 900 + * can't support a 'special' bit. 901 + */ 902 + BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))); 903 + BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) == 904 + (VM_PFNMAP|VM_MIXEDMAP)); 905 + BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags)); 906 + BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_valid(pfn)); 907 + 908 + if (addr < vma->vm_start || addr >= vma->vm_end) 909 + return VM_FAULT_SIGBUS; 910 + if (track_pfn_insert(vma, &pgprot, pfn)) 911 + return VM_FAULT_SIGBUS; 912 + insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write); 913 + return VM_FAULT_NOPAGE; 869 914 } 870 915 871 916 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, ··· 1455 1414 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, 1456 1415 pmd_t *pmd, unsigned long addr) 1457 1416 { 1417 + pmd_t orig_pmd; 1458 1418 spinlock_t *ptl; 1459 - int ret = 0; 1460 1419 1461 - if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { 1462 - struct page *page; 1463 - pgtable_t pgtable; 1464 - pmd_t orig_pmd; 1465 - /* 1466 - * For architectures like ppc64 we look at deposited pgtable 1467 - * when calling pmdp_huge_get_and_clear. So do the 1468 - * pgtable_trans_huge_withdraw after finishing pmdp related 1469 - * operations. 1470 - */ 1471 - orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd, 1472 - tlb->fullmm); 1473 - tlb_remove_pmd_tlb_entry(tlb, pmd, addr); 1474 - pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd); 1475 - if (is_huge_zero_pmd(orig_pmd)) { 1476 - atomic_long_dec(&tlb->mm->nr_ptes); 1477 - spin_unlock(ptl); 1420 + if (__pmd_trans_huge_lock(pmd, vma, &ptl) != 1) 1421 + return 0; 1422 + /* 1423 + * For architectures like ppc64 we look at deposited pgtable 1424 + * when calling pmdp_huge_get_and_clear. So do the 1425 + * pgtable_trans_huge_withdraw after finishing pmdp related 1426 + * operations. 1427 + */ 1428 + orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd, 1429 + tlb->fullmm); 1430 + tlb_remove_pmd_tlb_entry(tlb, pmd, addr); 1431 + if (vma_is_dax(vma)) { 1432 + spin_unlock(ptl); 1433 + if (is_huge_zero_pmd(orig_pmd)) 1478 1434 put_huge_zero_page(); 1479 - } else { 1480 - page = pmd_page(orig_pmd); 1481 - page_remove_rmap(page); 1482 - VM_BUG_ON_PAGE(page_mapcount(page) < 0, page); 1483 - add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); 1484 - VM_BUG_ON_PAGE(!PageHead(page), page); 1485 - atomic_long_dec(&tlb->mm->nr_ptes); 1486 - spin_unlock(ptl); 1487 - tlb_remove_page(tlb, page); 1488 - } 1489 - pte_free(tlb->mm, pgtable); 1490 - ret = 1; 1435 + } else if (is_huge_zero_pmd(orig_pmd)) { 1436 + pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd)); 1437 + atomic_long_dec(&tlb->mm->nr_ptes); 1438 + spin_unlock(ptl); 1439 + put_huge_zero_page(); 1440 + } else { 1441 + struct page *page = pmd_page(orig_pmd); 1442 + page_remove_rmap(page); 1443 + VM_BUG_ON_PAGE(page_mapcount(page) < 0, page); 1444 + add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); 1445 + VM_BUG_ON_PAGE(!PageHead(page), page); 1446 + pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd)); 1447 + atomic_long_dec(&tlb->mm->nr_ptes); 1448 + spin_unlock(ptl); 1449 + tlb_remove_page(tlb, page); 1491 1450 } 1492 - return ret; 1451 + return 1; 1493 1452 } 1494 1453 1495 1454 int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma, ··· 2326 2285 2327 2286 static void khugepaged_alloc_sleep(void) 2328 2287 { 2329 - wait_event_freezable_timeout(khugepaged_wait, false, 2330 - msecs_to_jiffies(khugepaged_alloc_sleep_millisecs)); 2288 + DEFINE_WAIT(wait); 2289 + 2290 + add_wait_queue(&khugepaged_wait, &wait); 2291 + freezable_schedule_timeout_interruptible( 2292 + msecs_to_jiffies(khugepaged_alloc_sleep_millisecs)); 2293 + remove_wait_queue(&khugepaged_wait, &wait); 2331 2294 } 2332 2295 2333 2296 static int khugepaged_node_load[MAX_NUMNODES]; ··· 2418 2373 */ 2419 2374 up_read(&mm->mmap_sem); 2420 2375 2421 - *hpage = alloc_pages_exact_node(node, gfp, HPAGE_PMD_ORDER); 2376 + *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER); 2422 2377 if (unlikely(!*hpage)) { 2423 2378 count_vm_event(THP_COLLAPSE_ALLOC_FAILED); 2424 2379 *hpage = ERR_PTR(-ENOMEM); ··· 2956 2911 pmd_t *pmd) 2957 2912 { 2958 2913 spinlock_t *ptl; 2959 - struct page *page; 2914 + struct page *page = NULL; 2960 2915 struct mm_struct *mm = vma->vm_mm; 2961 2916 unsigned long haddr = address & HPAGE_PMD_MASK; 2962 2917 unsigned long mmun_start; /* For mmu_notifiers */ ··· 2969 2924 again: 2970 2925 mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); 2971 2926 ptl = pmd_lock(mm, pmd); 2972 - if (unlikely(!pmd_trans_huge(*pmd))) { 2973 - spin_unlock(ptl); 2974 - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 2975 - return; 2976 - } 2977 - if (is_huge_zero_pmd(*pmd)) { 2927 + if (unlikely(!pmd_trans_huge(*pmd))) 2928 + goto unlock; 2929 + if (vma_is_dax(vma)) { 2930 + pmd_t _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd); 2931 + if (is_huge_zero_pmd(_pmd)) 2932 + put_huge_zero_page(); 2933 + } else if (is_huge_zero_pmd(*pmd)) { 2978 2934 __split_huge_zero_page_pmd(vma, haddr, pmd); 2979 - spin_unlock(ptl); 2980 - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 2981 - return; 2935 + } else { 2936 + page = pmd_page(*pmd); 2937 + VM_BUG_ON_PAGE(!page_count(page), page); 2938 + get_page(page); 2982 2939 } 2983 - page = pmd_page(*pmd); 2984 - VM_BUG_ON_PAGE(!page_count(page), page); 2985 - get_page(page); 2940 + unlock: 2986 2941 spin_unlock(ptl); 2987 2942 mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 2988 2943 2989 - split_huge_page(page); 2944 + if (!page) 2945 + return; 2990 2946 2947 + split_huge_page(page); 2991 2948 put_page(page); 2992 2949 2993 2950 /* ··· 3038 2991 split_huge_page_pmd_mm(mm, address, pmd); 3039 2992 } 3040 2993 3041 - void __vma_adjust_trans_huge(struct vm_area_struct *vma, 2994 + void vma_adjust_trans_huge(struct vm_area_struct *vma, 3042 2995 unsigned long start, 3043 2996 unsigned long end, 3044 2997 long adjust_next)

+346 -90

mm/hugetlb.c

··· 64 64 * prevent spurious OOMs when the hugepage pool is fully utilized. 65 65 */ 66 66 static int num_fault_mutexes; 67 - static struct mutex *htlb_fault_mutex_table ____cacheline_aligned_in_smp; 67 + struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp; 68 68 69 69 /* Forward declaration */ 70 70 static int hugetlb_acct_memory(struct hstate *h, long delta); ··· 240 240 241 241 /* 242 242 * Add the huge page range represented by [f, t) to the reserve 243 - * map. Existing regions will be expanded to accommodate the 244 - * specified range. We know only existing regions need to be 245 - * expanded, because region_add is only called after region_chg 246 - * with the same range. If a new file_region structure must 247 - * be allocated, it is done in region_chg. 243 + * map. In the normal case, existing regions will be expanded 244 + * to accommodate the specified range. Sufficient regions should 245 + * exist for expansion due to the previous call to region_chg 246 + * with the same range. However, it is possible that region_del 247 + * could have been called after region_chg and modifed the map 248 + * in such a way that no region exists to be expanded. In this 249 + * case, pull a region descriptor from the cache associated with 250 + * the map and use that for the new range. 248 251 * 249 252 * Return the number of new huge pages added to the map. This 250 253 * number is greater than or equal to zero. ··· 263 260 list_for_each_entry(rg, head, link) 264 261 if (f <= rg->to) 265 262 break; 263 + 264 + /* 265 + * If no region exists which can be expanded to include the 266 + * specified range, the list must have been modified by an 267 + * interleving call to region_del(). Pull a region descriptor 268 + * from the cache and use it for this range. 269 + */ 270 + if (&rg->link == head || t < rg->from) { 271 + VM_BUG_ON(resv->region_cache_count <= 0); 272 + 273 + resv->region_cache_count--; 274 + nrg = list_first_entry(&resv->region_cache, struct file_region, 275 + link); 276 + list_del(&nrg->link); 277 + 278 + nrg->from = f; 279 + nrg->to = t; 280 + list_add(&nrg->link, rg->link.prev); 281 + 282 + add += t - f; 283 + goto out_locked; 284 + } 266 285 267 286 /* Round our left edge to the current segment if it encloses us. */ 268 287 if (f > rg->from) ··· 319 294 add += t - nrg->to; /* Added to end of region */ 320 295 nrg->to = t; 321 296 297 + out_locked: 298 + resv->adds_in_progress--; 322 299 spin_unlock(&resv->lock); 323 300 VM_BUG_ON(add < 0); 324 301 return add; ··· 339 312 * so that the subsequent region_add call will have all the 340 313 * regions it needs and will not fail. 341 314 * 342 - * Returns the number of huge pages that need to be added 343 - * to the existing reservation map for the range [f, t). 344 - * This number is greater or equal to zero. -ENOMEM is 345 - * returned if a new file_region structure is needed and can 346 - * not be allocated. 315 + * Upon entry, region_chg will also examine the cache of region descriptors 316 + * associated with the map. If there are not enough descriptors cached, one 317 + * will be allocated for the in progress add operation. 318 + * 319 + * Returns the number of huge pages that need to be added to the existing 320 + * reservation map for the range [f, t). This number is greater or equal to 321 + * zero. -ENOMEM is returned if a new file_region structure or cache entry 322 + * is needed and can not be allocated. 347 323 */ 348 324 static long region_chg(struct resv_map *resv, long f, long t) 349 325 { ··· 356 326 357 327 retry: 358 328 spin_lock(&resv->lock); 329 + retry_locked: 330 + resv->adds_in_progress++; 331 + 332 + /* 333 + * Check for sufficient descriptors in the cache to accommodate 334 + * the number of in progress add operations. 335 + */ 336 + if (resv->adds_in_progress > resv->region_cache_count) { 337 + struct file_region *trg; 338 + 339 + VM_BUG_ON(resv->adds_in_progress - resv->region_cache_count > 1); 340 + /* Must drop lock to allocate a new descriptor. */ 341 + resv->adds_in_progress--; 342 + spin_unlock(&resv->lock); 343 + 344 + trg = kmalloc(sizeof(*trg), GFP_KERNEL); 345 + if (!trg) 346 + return -ENOMEM; 347 + 348 + spin_lock(&resv->lock); 349 + list_add(&trg->link, &resv->region_cache); 350 + resv->region_cache_count++; 351 + goto retry_locked; 352 + } 353 + 359 354 /* Locate the region we are before or in. */ 360 355 list_for_each_entry(rg, head, link) 361 356 if (f <= rg->to) ··· 391 336 * size such that we can guarantee to record the reservation. */ 392 337 if (&rg->link == head || t < rg->from) { 393 338 if (!nrg) { 339 + resv->adds_in_progress--; 394 340 spin_unlock(&resv->lock); 395 341 nrg = kmalloc(sizeof(*nrg), GFP_KERNEL); 396 342 if (!nrg) ··· 441 385 } 442 386 443 387 /* 444 - * Truncate the reserve map at index 'end'. Modify/truncate any 445 - * region which contains end. Delete any regions past end. 446 - * Return the number of huge pages removed from the map. 388 + * Abort the in progress add operation. The adds_in_progress field 389 + * of the resv_map keeps track of the operations in progress between 390 + * calls to region_chg and region_add. Operations are sometimes 391 + * aborted after the call to region_chg. In such cases, region_abort 392 + * is called to decrement the adds_in_progress counter. 393 + * 394 + * NOTE: The range arguments [f, t) are not needed or used in this 395 + * routine. They are kept to make reading the calling code easier as 396 + * arguments will match the associated region_chg call. 447 397 */ 448 - static long region_truncate(struct resv_map *resv, long end) 398 + static void region_abort(struct resv_map *resv, long f, long t) 399 + { 400 + spin_lock(&resv->lock); 401 + VM_BUG_ON(!resv->region_cache_count); 402 + resv->adds_in_progress--; 403 + spin_unlock(&resv->lock); 404 + } 405 + 406 + /* 407 + * Delete the specified range [f, t) from the reserve map. If the 408 + * t parameter is LONG_MAX, this indicates that ALL regions after f 409 + * should be deleted. Locate the regions which intersect [f, t) 410 + * and either trim, delete or split the existing regions. 411 + * 412 + * Returns the number of huge pages deleted from the reserve map. 413 + * In the normal case, the return value is zero or more. In the 414 + * case where a region must be split, a new region descriptor must 415 + * be allocated. If the allocation fails, -ENOMEM will be returned. 416 + * NOTE: If the parameter t == LONG_MAX, then we will never split 417 + * a region and possibly return -ENOMEM. Callers specifying 418 + * t == LONG_MAX do not need to check for -ENOMEM error. 419 + */ 420 + static long region_del(struct resv_map *resv, long f, long t) 449 421 { 450 422 struct list_head *head = &resv->regions; 451 423 struct file_region *rg, *trg; 452 - long chg = 0; 424 + struct file_region *nrg = NULL; 425 + long del = 0; 453 426 427 + retry: 454 428 spin_lock(&resv->lock); 455 - /* Locate the region we are either in or before. */ 456 - list_for_each_entry(rg, head, link) 457 - if (end <= rg->to) 429 + list_for_each_entry_safe(rg, trg, head, link) { 430 + if (rg->to <= f) 431 + continue; 432 + if (rg->from >= t) 458 433 break; 459 - if (&rg->link == head) 460 - goto out; 461 434 462 - /* If we are in the middle of a region then adjust it. */ 463 - if (end > rg->from) { 464 - chg = rg->to - end; 465 - rg->to = end; 466 - rg = list_entry(rg->link.next, typeof(*rg), link); 435 + if (f > rg->from && t < rg->to) { /* Must split region */ 436 + /* 437 + * Check for an entry in the cache before dropping 438 + * lock and attempting allocation. 439 + */ 440 + if (!nrg && 441 + resv->region_cache_count > resv->adds_in_progress) { 442 + nrg = list_first_entry(&resv->region_cache, 443 + struct file_region, 444 + link); 445 + list_del(&nrg->link); 446 + resv->region_cache_count--; 447 + } 448 + 449 + if (!nrg) { 450 + spin_unlock(&resv->lock); 451 + nrg = kmalloc(sizeof(*nrg), GFP_KERNEL); 452 + if (!nrg) 453 + return -ENOMEM; 454 + goto retry; 455 + } 456 + 457 + del += t - f; 458 + 459 + /* New entry for end of split region */ 460 + nrg->from = t; 461 + nrg->to = rg->to; 462 + INIT_LIST_HEAD(&nrg->link); 463 + 464 + /* Original entry is trimmed */ 465 + rg->to = f; 466 + 467 + list_add(&nrg->link, &rg->link); 468 + nrg = NULL; 469 + break; 470 + } 471 + 472 + if (f <= rg->from && t >= rg->to) { /* Remove entire region */ 473 + del += rg->to - rg->from; 474 + list_del(&rg->link); 475 + kfree(rg); 476 + continue; 477 + } 478 + 479 + if (f <= rg->from) { /* Trim beginning of region */ 480 + del += t - rg->from; 481 + rg->from = t; 482 + } else { /* Trim end of region */ 483 + del += rg->to - f; 484 + rg->to = f; 485 + } 467 486 } 468 487 469 - /* Drop any remaining regions. */ 470 - list_for_each_entry_safe(rg, trg, rg->link.prev, link) { 471 - if (&rg->link == head) 472 - break; 473 - chg += rg->to - rg->from; 474 - list_del(&rg->link); 475 - kfree(rg); 476 - } 477 - 478 - out: 479 488 spin_unlock(&resv->lock); 480 - return chg; 489 + kfree(nrg); 490 + return del; 491 + } 492 + 493 + /* 494 + * A rare out of memory error was encountered which prevented removal of 495 + * the reserve map region for a page. The huge page itself was free'ed 496 + * and removed from the page cache. This routine will adjust the subpool 497 + * usage count, and the global reserve count if needed. By incrementing 498 + * these counts, the reserve map entry which could not be deleted will 499 + * appear as a "reserved" entry instead of simply dangling with incorrect 500 + * counts. 501 + */ 502 + void hugetlb_fix_reserve_counts(struct inode *inode, bool restore_reserve) 503 + { 504 + struct hugepage_subpool *spool = subpool_inode(inode); 505 + long rsv_adjust; 506 + 507 + rsv_adjust = hugepage_subpool_get_pages(spool, 1); 508 + if (restore_reserve && rsv_adjust) { 509 + struct hstate *h = hstate_inode(inode); 510 + 511 + hugetlb_acct_memory(h, 1); 512 + } 481 513 } 482 514 483 515 /* ··· 688 544 struct resv_map *resv_map_alloc(void) 689 545 { 690 546 struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL); 691 - if (!resv_map) 547 + struct file_region *rg = kmalloc(sizeof(*rg), GFP_KERNEL); 548 + 549 + if (!resv_map || !rg) { 550 + kfree(resv_map); 551 + kfree(rg); 692 552 return NULL; 553 + } 693 554 694 555 kref_init(&resv_map->refs); 695 556 spin_lock_init(&resv_map->lock); 696 557 INIT_LIST_HEAD(&resv_map->regions); 558 + 559 + resv_map->adds_in_progress = 0; 560 + 561 + INIT_LIST_HEAD(&resv_map->region_cache); 562 + list_add(&rg->link, &resv_map->region_cache); 563 + resv_map->region_cache_count = 1; 697 564 698 565 return resv_map; 699 566 } ··· 712 557 void resv_map_release(struct kref *ref) 713 558 { 714 559 struct resv_map *resv_map = container_of(ref, struct resv_map, refs); 560 + struct list_head *head = &resv_map->region_cache; 561 + struct file_region *rg, *trg; 715 562 716 563 /* Clear out any active regions before we release the map. */ 717 - region_truncate(resv_map, 0); 564 + region_del(resv_map, 0, LONG_MAX); 565 + 566 + /* ... and any entries left in the cache */ 567 + list_for_each_entry_safe(rg, trg, head, link) { 568 + list_del(&rg->link); 569 + kfree(rg); 570 + } 571 + 572 + VM_BUG_ON(resv_map->adds_in_progress); 573 + 718 574 kfree(resv_map); 719 575 } 720 576 ··· 801 635 } 802 636 803 637 /* Shared mappings always use reserves */ 804 - if (vma->vm_flags & VM_MAYSHARE) 805 - return true; 638 + if (vma->vm_flags & VM_MAYSHARE) { 639 + /* 640 + * We know VM_NORESERVE is not set. Therefore, there SHOULD 641 + * be a region map for all pages. The only situation where 642 + * there is no region map is if a hole was punched via 643 + * fallocate. In this case, there really are no reverves to 644 + * use. This situation is indicated if chg != 0. 645 + */ 646 + if (chg) 647 + return false; 648 + else 649 + return true; 650 + } 806 651 807 652 /* 808 653 * Only the process that called mmap() has reserves for ··· 1331 1154 { 1332 1155 struct page *page; 1333 1156 1334 - page = alloc_pages_exact_node(nid, 1157 + page = __alloc_pages_node(nid, 1335 1158 htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE| 1336 1159 __GFP_REPEAT|__GFP_NOWARN, 1337 1160 huge_page_order(h)); ··· 1483 1306 __GFP_REPEAT|__GFP_NOWARN, 1484 1307 huge_page_order(h)); 1485 1308 else 1486 - page = alloc_pages_exact_node(nid, 1309 + page = __alloc_pages_node(nid, 1487 1310 htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE| 1488 1311 __GFP_REPEAT|__GFP_NOWARN, huge_page_order(h)); 1489 1312 ··· 1650 1473 } 1651 1474 } 1652 1475 1476 + 1653 1477 /* 1654 - * vma_needs_reservation and vma_commit_reservation are used by the huge 1655 - * page allocation routines to manage reservations. 1478 + * vma_needs_reservation, vma_commit_reservation and vma_end_reservation 1479 + * are used by the huge page allocation routines to manage reservations. 1656 1480 * 1657 1481 * vma_needs_reservation is called to determine if the huge page at addr 1658 1482 * within the vma has an associated reservation. If a reservation is 1659 1483 * needed, the value 1 is returned. The caller is then responsible for 1660 1484 * managing the global reservation and subpool usage counts. After 1661 1485 * the huge page has been allocated, vma_commit_reservation is called 1662 - * to add the page to the reservation map. 1486 + * to add the page to the reservation map. If the page allocation fails, 1487 + * the reservation must be ended instead of committed. vma_end_reservation 1488 + * is called in such cases. 1663 1489 * 1664 1490 * In the normal case, vma_commit_reservation returns the same value 1665 1491 * as the preceding vma_needs_reservation call. The only time this ··· 1670 1490 * is the responsibility of the caller to notice the difference and 1671 1491 * take appropriate action. 1672 1492 */ 1493 + enum vma_resv_mode { 1494 + VMA_NEEDS_RESV, 1495 + VMA_COMMIT_RESV, 1496 + VMA_END_RESV, 1497 + }; 1673 1498 static long __vma_reservation_common(struct hstate *h, 1674 1499 struct vm_area_struct *vma, unsigned long addr, 1675 - bool commit) 1500 + enum vma_resv_mode mode) 1676 1501 { 1677 1502 struct resv_map *resv; 1678 1503 pgoff_t idx; ··· 1688 1503 return 1; 1689 1504 1690 1505 idx = vma_hugecache_offset(h, vma, addr); 1691 - if (commit) 1692 - ret = region_add(resv, idx, idx + 1); 1693 - else 1506 + switch (mode) { 1507 + case VMA_NEEDS_RESV: 1694 1508 ret = region_chg(resv, idx, idx + 1); 1509 + break; 1510 + case VMA_COMMIT_RESV: 1511 + ret = region_add(resv, idx, idx + 1); 1512 + break; 1513 + case VMA_END_RESV: 1514 + region_abort(resv, idx, idx + 1); 1515 + ret = 0; 1516 + break; 1517 + default: 1518 + BUG(); 1519 + } 1695 1520 1696 1521 if (vma->vm_flags & VM_MAYSHARE) 1697 1522 return ret; ··· 1712 1517 static long vma_needs_reservation(struct hstate *h, 1713 1518 struct vm_area_struct *vma, unsigned long addr) 1714 1519 { 1715 - return __vma_reservation_common(h, vma, addr, false); 1520 + return __vma_reservation_common(h, vma, addr, VMA_NEEDS_RESV); 1716 1521 } 1717 1522 1718 1523 static long vma_commit_reservation(struct hstate *h, 1719 1524 struct vm_area_struct *vma, unsigned long addr) 1720 1525 { 1721 - return __vma_reservation_common(h, vma, addr, true); 1526 + return __vma_reservation_common(h, vma, addr, VMA_COMMIT_RESV); 1722 1527 } 1723 1528 1724 - static struct page *alloc_huge_page(struct vm_area_struct *vma, 1529 + static void vma_end_reservation(struct hstate *h, 1530 + struct vm_area_struct *vma, unsigned long addr) 1531 + { 1532 + (void)__vma_reservation_common(h, vma, addr, VMA_END_RESV); 1533 + } 1534 + 1535 + struct page *alloc_huge_page(struct vm_area_struct *vma, 1725 1536 unsigned long addr, int avoid_reserve) 1726 1537 { 1727 1538 struct hugepage_subpool *spool = subpool_vma(vma); 1728 1539 struct hstate *h = hstate_vma(vma); 1729 1540 struct page *page; 1730 - long chg, commit; 1541 + long map_chg, map_commit; 1542 + long gbl_chg; 1731 1543 int ret, idx; 1732 1544 struct hugetlb_cgroup *h_cg; 1733 1545 1734 1546 idx = hstate_index(h); 1735 1547 /* 1736 - * Processes that did not create the mapping will have no 1737 - * reserves and will not have accounted against subpool 1738 - * limit. Check that the subpool limit can be made before 1739 - * satisfying the allocation MAP_NORESERVE mappings may also 1740 - * need pages and subpool limit allocated allocated if no reserve 1741 - * mapping overlaps. 1548 + * Examine the region/reserve map to determine if the process 1549 + * has a reservation for the page to be allocated. A return 1550 + * code of zero indicates a reservation exists (no change). 1742 1551 */ 1743 - chg = vma_needs_reservation(h, vma, addr); 1744 - if (chg < 0) 1552 + map_chg = gbl_chg = vma_needs_reservation(h, vma, addr); 1553 + if (map_chg < 0) 1745 1554 return ERR_PTR(-ENOMEM); 1746 - if (chg || avoid_reserve) 1747 - if (hugepage_subpool_get_pages(spool, 1) < 0) 1555 + 1556 + /* 1557 + * Processes that did not create the mapping will have no 1558 + * reserves as indicated by the region/reserve map. Check 1559 + * that the allocation will not exceed the subpool limit. 1560 + * Allocations for MAP_NORESERVE mappings also need to be 1561 + * checked against any subpool limit. 1562 + */ 1563 + if (map_chg || avoid_reserve) { 1564 + gbl_chg = hugepage_subpool_get_pages(spool, 1); 1565 + if (gbl_chg < 0) { 1566 + vma_end_reservation(h, vma, addr); 1748 1567 return ERR_PTR(-ENOSPC); 1568 + } 1569 + 1570 + /* 1571 + * Even though there was no reservation in the region/reserve 1572 + * map, there could be reservations associated with the 1573 + * subpool that can be used. This would be indicated if the 1574 + * return value of hugepage_subpool_get_pages() is zero. 1575 + * However, if avoid_reserve is specified we still avoid even 1576 + * the subpool reservations. 1577 + */ 1578 + if (avoid_reserve) 1579 + gbl_chg = 1; 1580 + } 1749 1581 1750 1582 ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg); 1751 1583 if (ret) 1752 1584 goto out_subpool_put; 1753 1585 1754 1586 spin_lock(&hugetlb_lock); 1755 - page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg); 1587 + /* 1588 + * glb_chg is passed to indicate whether or not a page must be taken 1589 + * from the global free pool (global change). gbl_chg == 0 indicates 1590 + * a reservation exists for the allocation. 1591 + */ 1592 + page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg); 1756 1593 if (!page) { 1757 1594 spin_unlock(&hugetlb_lock); 1758 1595 page = alloc_buddy_huge_page(h, NUMA_NO_NODE); ··· 1800 1573 1801 1574 set_page_private(page, (unsigned long)spool); 1802 1575 1803 - commit = vma_commit_reservation(h, vma, addr); 1804 - if (unlikely(chg > commit)) { 1576 + map_commit = vma_commit_reservation(h, vma, addr); 1577 + if (unlikely(map_chg > map_commit)) { 1805 1578 /* 1806 1579 * The page was added to the reservation map between 1807 1580 * vma_needs_reservation and vma_commit_reservation. ··· 1821 1594 out_uncharge_cgroup: 1822 1595 hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg); 1823 1596 out_subpool_put: 1824 - if (chg || avoid_reserve) 1597 + if (map_chg || avoid_reserve) 1825 1598 hugepage_subpool_put_pages(spool, 1); 1599 + vma_end_reservation(h, vma, addr); 1826 1600 return ERR_PTR(-ENOSPC); 1827 1601 } 1828 1602 ··· 2539 2311 } 2540 2312 2541 2313 kobject_put(hugepages_kobj); 2542 - kfree(htlb_fault_mutex_table); 2314 + kfree(hugetlb_fault_mutex_table); 2543 2315 } 2544 2316 module_exit(hugetlb_exit); 2545 2317 ··· 2572 2344 #else 2573 2345 num_fault_mutexes = 1; 2574 2346 #endif 2575 - htlb_fault_mutex_table = 2347 + hugetlb_fault_mutex_table = 2576 2348 kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL); 2577 - BUG_ON(!htlb_fault_mutex_table); 2349 + BUG_ON(!hugetlb_fault_mutex_table); 2578 2350 2579 2351 for (i = 0; i < num_fault_mutexes; i++) 2580 - mutex_init(&htlb_fault_mutex_table[i]); 2352 + mutex_init(&hugetlb_fault_mutex_table[i]); 2581 2353 return 0; 2582 2354 } 2583 2355 module_init(hugetlb_init); ··· 3375 3147 return page != NULL; 3376 3148 } 3377 3149 3150 + int huge_add_to_page_cache(struct page *page, struct address_space *mapping, 3151 + pgoff_t idx) 3152 + { 3153 + struct inode *inode = mapping->host; 3154 + struct hstate *h = hstate_inode(inode); 3155 + int err = add_to_page_cache(page, mapping, idx, GFP_KERNEL); 3156 + 3157 + if (err) 3158 + return err; 3159 + ClearPagePrivate(page); 3160 + 3161 + spin_lock(&inode->i_lock); 3162 + inode->i_blocks += blocks_per_huge_page(h); 3163 + spin_unlock(&inode->i_lock); 3164 + return 0; 3165 + } 3166 + 3378 3167 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, 3379 3168 struct address_space *mapping, pgoff_t idx, 3380 3169 unsigned long address, pte_t *ptep, unsigned int flags) ··· 3439 3194 set_page_huge_active(page); 3440 3195 3441 3196 if (vma->vm_flags & VM_MAYSHARE) { 3442 - int err; 3443 - struct inode *inode = mapping->host; 3444 - 3445 - err = add_to_page_cache(page, mapping, idx, GFP_KERNEL); 3197 + int err = huge_add_to_page_cache(page, mapping, idx); 3446 3198 if (err) { 3447 3199 put_page(page); 3448 3200 if (err == -EEXIST) 3449 3201 goto retry; 3450 3202 goto out; 3451 3203 } 3452 - ClearPagePrivate(page); 3453 - 3454 - spin_lock(&inode->i_lock); 3455 - inode->i_blocks += blocks_per_huge_page(h); 3456 - spin_unlock(&inode->i_lock); 3457 3204 } else { 3458 3205 lock_page(page); 3459 3206 if (unlikely(anon_vma_prepare(vma))) { ··· 3473 3236 * any allocations necessary to record that reservation occur outside 3474 3237 * the spinlock. 3475 3238 */ 3476 - if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) 3239 + if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { 3477 3240 if (vma_needs_reservation(h, vma, address) < 0) { 3478 3241 ret = VM_FAULT_OOM; 3479 3242 goto backout_unlocked; 3480 3243 } 3244 + /* Just decrements count, does not deallocate */ 3245 + vma_end_reservation(h, vma, address); 3246 + } 3481 3247 3482 3248 ptl = huge_pte_lockptr(h, mm, ptep); 3483 3249 spin_lock(ptl); ··· 3520 3280 } 3521 3281 3522 3282 #ifdef CONFIG_SMP 3523 - static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm, 3283 + u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm, 3524 3284 struct vm_area_struct *vma, 3525 3285 struct address_space *mapping, 3526 3286 pgoff_t idx, unsigned long address) ··· 3545 3305 * For uniprocesor systems we always use a single mutex, so just 3546 3306 * return 0 and avoid the hashing overhead. 3547 3307 */ 3548 - static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm, 3308 + u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm, 3549 3309 struct vm_area_struct *vma, 3550 3310 struct address_space *mapping, 3551 3311 pgoff_t idx, unsigned long address) ··· 3593 3353 * get spurious allocation failures if two CPUs race to instantiate 3594 3354 * the same page in the page cache. 3595 3355 */ 3596 - hash = fault_mutex_hash(h, mm, vma, mapping, idx, address); 3597 - mutex_lock(&htlb_fault_mutex_table[hash]); 3356 + hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, address); 3357 + mutex_lock(&hugetlb_fault_mutex_table[hash]); 3598 3358 3599 3359 entry = huge_ptep_get(ptep); 3600 3360 if (huge_pte_none(entry)) { ··· 3627 3387 ret = VM_FAULT_OOM; 3628 3388 goto out_mutex; 3629 3389 } 3390 + /* Just decrements count, does not deallocate */ 3391 + vma_end_reservation(h, vma, address); 3630 3392 3631 3393 if (!(vma->vm_flags & VM_MAYSHARE)) 3632 3394 pagecache_page = hugetlbfs_pagecache_page(h, ··· 3679 3437 put_page(pagecache_page); 3680 3438 } 3681 3439 out_mutex: 3682 - mutex_unlock(&htlb_fault_mutex_table[hash]); 3440 + mutex_unlock(&hugetlb_fault_mutex_table[hash]); 3683 3441 /* 3684 3442 * Generally it's safe to hold refcount during waiting page lock. But 3685 3443 * here we just wait to defer the next page fault to avoid busy loop and ··· 3968 3726 } 3969 3727 return 0; 3970 3728 out_err: 3729 + if (!vma || vma->vm_flags & VM_MAYSHARE) 3730 + region_abort(resv_map, from, to); 3971 3731 if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) 3972 3732 kref_put(&resv_map->refs, resv_map_release); 3973 3733 return ret; 3974 3734 } 3975 3735 3976 - void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed) 3736 + long hugetlb_unreserve_pages(struct inode *inode, long start, long end, 3737 + long freed) 3977 3738 { 3978 3739 struct hstate *h = hstate_inode(inode); 3979 3740 struct resv_map *resv_map = inode_resv_map(inode); ··· 3984 3739 struct hugepage_subpool *spool = subpool_inode(inode); 3985 3740 long gbl_reserve; 3986 3741 3987 - if (resv_map) 3988 - chg = region_truncate(resv_map, offset); 3742 + if (resv_map) { 3743 + chg = region_del(resv_map, start, end); 3744 + /* 3745 + * region_del() can fail in the rare case where a region 3746 + * must be split and another region descriptor can not be 3747 + * allocated. If end == LONG_MAX, it will not fail. 3748 + */ 3749 + if (chg < 0) 3750 + return chg; 3751 + } 3752 + 3989 3753 spin_lock(&inode->i_lock); 3990 3754 inode->i_blocks -= (blocks_per_huge_page(h) * freed); 3991 3755 spin_unlock(&inode->i_lock); ··· 4005 3751 */ 4006 3752 gbl_reserve = hugepage_subpool_put_pages(spool, (chg - freed)); 4007 3753 hugetlb_acct_memory(h, -gbl_reserve); 3754 + 3755 + return 0; 4008 3756 } 4009 3757 4010 3758 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE

+1 -1

mm/hwpoison-inject.c

··· 58 58 pr_info("Injecting memory failure at pfn %#lx\n", pfn); 59 59 return memory_failure(pfn, 18, MF_COUNT_INCREASED); 60 60 put_out: 61 - put_page(p); 61 + put_hwpoison_page(p); 62 62 return 0; 63 63 } 64 64

+1

mm/internal.h

··· 182 182 unsigned long nr_migratepages; /* Number of pages to migrate */ 183 183 unsigned long free_pfn; /* isolate_freepages search base */ 184 184 unsigned long migrate_pfn; /* isolate_migratepages search base */ 185 + unsigned long last_migrated_pfn;/* Not yet flushed page being freed */ 185 186 enum migrate_mode mode; /* Async or sync migration mode */ 186 187 bool ignore_skip_hint; /* Scan blocks even if marked skip */ 187 188 int order; /* order a direct compactor needs */

+2 -1

mm/kmemleak.c

··· 838 838 } 839 839 840 840 if (crt_early_log >= ARRAY_SIZE(early_log)) { 841 + crt_early_log++; 841 842 kmemleak_disable(); 842 843 return; 843 844 } ··· 1883 1882 object_cache = KMEM_CACHE(kmemleak_object, SLAB_NOLEAKTRACE); 1884 1883 scan_area_cache = KMEM_CACHE(kmemleak_scan_area, SLAB_NOLEAKTRACE); 1885 1884 1886 - if (crt_early_log >= ARRAY_SIZE(early_log)) 1885 + if (crt_early_log > ARRAY_SIZE(early_log)) 1887 1886 pr_warning("Early log buffer exceeded (%d), please increase " 1888 1887 "DEBUG_KMEMLEAK_EARLY_LOG_SIZE\n", crt_early_log); 1889 1888

+2 -2

mm/list_lru.c

··· 99 99 struct list_lru_one *l; 100 100 101 101 spin_lock(&nlru->lock); 102 - l = list_lru_from_kmem(nlru, item); 103 102 if (list_empty(item)) { 103 + l = list_lru_from_kmem(nlru, item); 104 104 list_add_tail(item, &l->list); 105 105 l->nr_items++; 106 106 spin_unlock(&nlru->lock); ··· 118 118 struct list_lru_one *l; 119 119 120 120 spin_lock(&nlru->lock); 121 - l = list_lru_from_kmem(nlru, item); 122 121 if (!list_empty(item)) { 122 + l = list_lru_from_kmem(nlru, item); 123 123 list_del_init(item); 124 124 l->nr_items--; 125 125 spin_unlock(&nlru->lock);

+1 -1

mm/madvise.c

··· 301 301 302 302 *prev = NULL; /* tell sys_madvise we drop mmap_sem */ 303 303 304 - if (vma->vm_flags & (VM_LOCKED | VM_HUGETLB)) 304 + if (vma->vm_flags & VM_LOCKED) 305 305 return -EINVAL; 306 306 307 307 f = vma->vm_file;

+16 -15

mm/memblock.c

··· 91 91 return ((base1 < (base2 + size2)) && (base2 < (base1 + size1))); 92 92 } 93 93 94 - static long __init_memblock memblock_overlaps_region(struct memblock_type *type, 94 + bool __init_memblock memblock_overlaps_region(struct memblock_type *type, 95 95 phys_addr_t base, phys_addr_t size) 96 96 { 97 97 unsigned long i; ··· 103 103 break; 104 104 } 105 105 106 - return (i < type->cnt) ? i : -1; 106 + return i < type->cnt; 107 107 } 108 108 109 109 /* ··· 569 569 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP 570 570 WARN_ON(nid != memblock_get_region_node(rgn)); 571 571 #endif 572 + WARN_ON(flags != rgn->flags); 572 573 nr_new++; 573 574 if (insert) 574 575 memblock_insert_region(type, i++, base, ··· 615 614 int nid, 616 615 unsigned long flags) 617 616 { 618 - struct memblock_type *_rgn = &memblock.memory; 617 + struct memblock_type *type = &memblock.memory; 619 618 620 619 memblock_dbg("memblock_add: [%#016llx-%#016llx] flags %#02lx %pF\n", 621 620 (unsigned long long)base, 622 621 (unsigned long long)base + size - 1, 623 622 flags, (void *)_RET_IP_); 624 623 625 - return memblock_add_range(_rgn, base, size, nid, flags); 624 + return memblock_add_range(type, base, size, nid, flags); 626 625 } 627 626 628 627 int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size) ··· 762 761 * 763 762 * This function isolates region [@base, @base + @size), and sets/clears flag 764 763 * 765 - * Return 0 on succees, -errno on failure. 764 + * Return 0 on success, -errno on failure. 766 765 */ 767 766 static int __init_memblock memblock_setclr_flag(phys_addr_t base, 768 767 phys_addr_t size, int set, int flag) ··· 789 788 * @base: the base phys addr of the region 790 789 * @size: the size of the region 791 790 * 792 - * Return 0 on succees, -errno on failure. 791 + * Return 0 on success, -errno on failure. 793 792 */ 794 793 int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size) 795 794 { ··· 801 800 * @base: the base phys addr of the region 802 801 * @size: the size of the region 803 802 * 804 - * Return 0 on succees, -errno on failure. 803 + * Return 0 on success, -errno on failure. 805 804 */ 806 805 int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size) 807 806 { ··· 813 812 * @base: the base phys addr of the region 814 813 * @size: the size of the region 815 814 * 816 - * Return 0 on succees, -errno on failure. 815 + * Return 0 on success, -errno on failure. 817 816 */ 818 817 int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size) 819 818 { ··· 835 834 phys_addr_t *out_start, 836 835 phys_addr_t *out_end) 837 836 { 838 - struct memblock_type *rsv = &memblock.reserved; 837 + struct memblock_type *type = &memblock.reserved; 839 838 840 - if (*idx >= 0 && *idx < rsv->cnt) { 841 - struct memblock_region *r = &rsv->regions[*idx]; 839 + if (*idx >= 0 && *idx < type->cnt) { 840 + struct memblock_region *r = &type->regions[*idx]; 842 841 phys_addr_t base = r->base; 843 842 phys_addr_t size = r->size; 844 843 ··· 976 975 * in type_b. 977 976 * 978 977 * @idx: pointer to u64 loop variable 979 - * @nid: nid: node selector, %NUMA_NO_NODE for all nodes 978 + * @nid: node selector, %NUMA_NO_NODE for all nodes 980 979 * @flags: pick from blocks based on memory attributes 981 980 * @type_a: pointer to memblock_type from where the range is taken 982 981 * @type_b: pointer to memblock_type which excludes memory from being taken ··· 1566 1565 * Check if the region [@base, @base+@size) intersects a reserved memory block. 1567 1566 * 1568 1567 * RETURNS: 1569 - * 0 if false, non-zero if true 1568 + * True if they intersect, false if not. 1570 1569 */ 1571 - int __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size) 1570 + bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size) 1572 1571 { 1573 1572 memblock_cap_size(base, &size); 1574 - return memblock_overlaps_region(&memblock.reserved, base, size) >= 0; 1573 + return memblock_overlaps_region(&memblock.reserved, base, size); 1575 1574 } 1576 1575 1577 1576 void __init_memblock memblock_trim_memory(phys_addr_t align)

+43 -347

mm/memcontrol.c

··· 111 111 "unevictable", 112 112 }; 113 113 114 - /* 115 - * Per memcg event counter is incremented at every pagein/pageout. With THP, 116 - * it will be incremated by the number of pages. This counter is used for 117 - * for trigger some periodic events. This is straightforward and better 118 - * than using jiffies etc. to handle periodic memcg event. 119 - */ 120 - enum mem_cgroup_events_target { 121 - MEM_CGROUP_TARGET_THRESH, 122 - MEM_CGROUP_TARGET_SOFTLIMIT, 123 - MEM_CGROUP_TARGET_NUMAINFO, 124 - MEM_CGROUP_NTARGETS, 125 - }; 126 114 #define THRESHOLDS_EVENTS_TARGET 128 127 115 #define SOFTLIMIT_EVENTS_TARGET 1024 128 116 #define NUMAINFO_EVENTS_TARGET 1024 129 - 130 - struct mem_cgroup_stat_cpu { 131 - long count[MEM_CGROUP_STAT_NSTATS]; 132 - unsigned long events[MEMCG_NR_EVENTS]; 133 - unsigned long nr_page_events; 134 - unsigned long targets[MEM_CGROUP_NTARGETS]; 135 - }; 136 - 137 - struct reclaim_iter { 138 - struct mem_cgroup *position; 139 - /* scan generation, increased every round-trip */ 140 - unsigned int generation; 141 - }; 142 - 143 - /* 144 - * per-zone information in memory controller. 145 - */ 146 - struct mem_cgroup_per_zone { 147 - struct lruvec lruvec; 148 - unsigned long lru_size[NR_LRU_LISTS]; 149 - 150 - struct reclaim_iter iter[DEF_PRIORITY + 1]; 151 - 152 - struct rb_node tree_node; /* RB tree node */ 153 - unsigned long usage_in_excess;/* Set to the value by which */ 154 - /* the soft limit is exceeded*/ 155 - bool on_tree; 156 - struct mem_cgroup *memcg; /* Back pointer, we cannot */ 157 - /* use container_of */ 158 - }; 159 - 160 - struct mem_cgroup_per_node { 161 - struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES]; 162 - }; 163 117 164 118 /* 165 119 * Cgroups above their limits are maintained in a RB-Tree, independent of ··· 134 180 }; 135 181 136 182 static struct mem_cgroup_tree soft_limit_tree __read_mostly; 137 - 138 - struct mem_cgroup_threshold { 139 - struct eventfd_ctx *eventfd; 140 - unsigned long threshold; 141 - }; 142 - 143 - /* For threshold */ 144 - struct mem_cgroup_threshold_ary { 145 - /* An array index points to threshold just below or equal to usage. */ 146 - int current_threshold; 147 - /* Size of entries[] */ 148 - unsigned int size; 149 - /* Array of thresholds */ 150 - struct mem_cgroup_threshold entries[0]; 151 - }; 152 - 153 - struct mem_cgroup_thresholds { 154 - /* Primary thresholds array */ 155 - struct mem_cgroup_threshold_ary *primary; 156 - /* 157 - * Spare threshold array. 158 - * This is needed to make mem_cgroup_unregister_event() "never fail". 159 - * It must be able to store at least primary->size - 1 entries. 160 - */ 161 - struct mem_cgroup_threshold_ary *spare; 162 - }; 163 183 164 184 /* for OOM */ 165 185 struct mem_cgroup_eventfd_list { ··· 183 255 184 256 static void mem_cgroup_threshold(struct mem_cgroup *memcg); 185 257 static void mem_cgroup_oom_notify(struct mem_cgroup *memcg); 186 - 187 - /* 188 - * The memory controller data structure. The memory controller controls both 189 - * page cache and RSS per cgroup. We would eventually like to provide 190 - * statistics based on the statistics developed by Rik Van Riel for clock-pro, 191 - * to help the administrator determine what knobs to tune. 192 - */ 193 - struct mem_cgroup { 194 - struct cgroup_subsys_state css; 195 - 196 - /* Accounted resources */ 197 - struct page_counter memory; 198 - struct page_counter memsw; 199 - struct page_counter kmem; 200 - 201 - /* Normal memory consumption range */ 202 - unsigned long low; 203 - unsigned long high; 204 - 205 - unsigned long soft_limit; 206 - 207 - /* vmpressure notifications */ 208 - struct vmpressure vmpressure; 209 - 210 - /* css_online() has been completed */ 211 - int initialized; 212 - 213 - /* 214 - * Should the accounting and control be hierarchical, per subtree? 215 - */ 216 - bool use_hierarchy; 217 - 218 - /* protected by memcg_oom_lock */ 219 - bool oom_lock; 220 - int under_oom; 221 - 222 - int swappiness; 223 - /* OOM-Killer disable */ 224 - int oom_kill_disable; 225 - 226 - /* protect arrays of thresholds */ 227 - struct mutex thresholds_lock; 228 - 229 - /* thresholds for memory usage. RCU-protected */ 230 - struct mem_cgroup_thresholds thresholds; 231 - 232 - /* thresholds for mem+swap usage. RCU-protected */ 233 - struct mem_cgroup_thresholds memsw_thresholds; 234 - 235 - /* For oom notifier event fd */ 236 - struct list_head oom_notify; 237 - 238 - /* 239 - * Should we move charges of a task when a task is moved into this 240 - * mem_cgroup ? And what type of charges should we move ? 241 - */ 242 - unsigned long move_charge_at_immigrate; 243 - /* 244 - * set > 0 if pages under this cgroup are moving to other cgroup. 245 - */ 246 - atomic_t moving_account; 247 - /* taken only while moving_account > 0 */ 248 - spinlock_t move_lock; 249 - struct task_struct *move_lock_task; 250 - unsigned long move_lock_flags; 251 - /* 252 - * percpu counter. 253 - */ 254 - struct mem_cgroup_stat_cpu __percpu *stat; 255 - spinlock_t pcp_counter_lock; 256 - 257 - #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) 258 - struct cg_proto tcp_mem; 259 - #endif 260 - #if defined(CONFIG_MEMCG_KMEM) 261 - /* Index in the kmem_cache->memcg_params.memcg_caches array */ 262 - int kmemcg_id; 263 - bool kmem_acct_activated; 264 - bool kmem_acct_active; 265 - #endif 266 - 267 - int last_scanned_node; 268 - #if MAX_NUMNODES > 1 269 - nodemask_t scan_nodes; 270 - atomic_t numainfo_events; 271 - atomic_t numainfo_updating; 272 - #endif 273 - 274 - #ifdef CONFIG_CGROUP_WRITEBACK 275 - struct list_head cgwb_list; 276 - struct wb_domain cgwb_domain; 277 - #endif 278 - 279 - /* List of events which userspace want to receive */ 280 - struct list_head event_list; 281 - spinlock_t event_list_lock; 282 - 283 - struct mem_cgroup_per_node *nodeinfo[0]; 284 - /* WARNING: nodeinfo must be the last member here */ 285 - }; 286 - 287 - #ifdef CONFIG_MEMCG_KMEM 288 - bool memcg_kmem_is_active(struct mem_cgroup *memcg) 289 - { 290 - return memcg->kmem_acct_active; 291 - } 292 - #endif 293 258 294 259 /* Stuffs for move charges at task migration. */ 295 260 /* ··· 243 422 * appearing has to hold it as well. 244 423 */ 245 424 static DEFINE_MUTEX(memcg_create_mutex); 246 - 247 - struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s) 248 - { 249 - return s ? container_of(s, struct mem_cgroup, css) : NULL; 250 - } 251 425 252 426 /* Some nice accessors for the vmpressure. */ 253 427 struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg) ··· 315 499 rcu_read_lock(); 316 500 memcg = mem_cgroup_from_task(current); 317 501 cg_proto = sk->sk_prot->proto_cgroup(memcg); 318 - if (!mem_cgroup_is_root(memcg) && 319 - memcg_proto_active(cg_proto) && 502 + if (cg_proto && test_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags) && 320 503 css_tryget_online(&memcg->css)) { 321 504 sk->sk_cgrp = cg_proto; 322 505 } ··· 406 591 int zid = zone_idx(zone); 407 592 408 593 return &memcg->nodeinfo[nid]->zoneinfo[zid]; 409 - } 410 - 411 - struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg) 412 - { 413 - return &memcg->css; 414 594 } 415 595 416 596 /** ··· 686 876 __this_cpu_add(memcg->stat->nr_page_events, nr_pages); 687 877 } 688 878 689 - unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru) 690 - { 691 - struct mem_cgroup_per_zone *mz; 692 - 693 - mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec); 694 - return mz->lru_size[lru]; 695 - } 696 - 697 879 static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, 698 880 int nid, 699 881 unsigned int lru_mask) ··· 788 986 789 987 return mem_cgroup_from_css(task_css(p, memory_cgrp_id)); 790 988 } 989 + EXPORT_SYMBOL(mem_cgroup_from_task); 791 990 792 991 static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) 793 992 { ··· 834 1031 struct mem_cgroup *prev, 835 1032 struct mem_cgroup_reclaim_cookie *reclaim) 836 1033 { 837 - struct reclaim_iter *uninitialized_var(iter); 1034 + struct mem_cgroup_reclaim_iter *uninitialized_var(iter); 838 1035 struct cgroup_subsys_state *css = NULL; 839 1036 struct mem_cgroup *memcg = NULL; 840 1037 struct mem_cgroup *pos = NULL; ··· 976 1173 iter != NULL; \ 977 1174 iter = mem_cgroup_iter(NULL, iter, NULL)) 978 1175 979 - void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx) 980 - { 981 - struct mem_cgroup *memcg; 982 - 983 - rcu_read_lock(); 984 - memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); 985 - if (unlikely(!memcg)) 986 - goto out; 987 - 988 - switch (idx) { 989 - case PGFAULT: 990 - this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGFAULT]); 991 - break; 992 - case PGMAJFAULT: 993 - this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT]); 994 - break; 995 - default: 996 - BUG(); 997 - } 998 - out: 999 - rcu_read_unlock(); 1000 - } 1001 - EXPORT_SYMBOL(__mem_cgroup_count_vm_event); 1002 - 1003 1176 /** 1004 1177 * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg 1005 1178 * @zone: zone of the wanted lruvec ··· 1074 1295 VM_BUG_ON((long)(*lru_size) < 0); 1075 1296 } 1076 1297 1077 - bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, struct mem_cgroup *root) 1078 - { 1079 - if (root == memcg) 1080 - return true; 1081 - if (!root->use_hierarchy) 1082 - return false; 1083 - return cgroup_is_descendant(memcg->css.cgroup, root->css.cgroup); 1084 - } 1085 - 1086 1298 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg) 1087 1299 { 1088 1300 struct mem_cgroup *task_memcg; ··· 1098 1328 ret = mem_cgroup_is_descendant(task_memcg, memcg); 1099 1329 css_put(&task_memcg->css); 1100 1330 return ret; 1101 - } 1102 - 1103 - int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec) 1104 - { 1105 - unsigned long inactive_ratio; 1106 - unsigned long inactive; 1107 - unsigned long active; 1108 - unsigned long gb; 1109 - 1110 - inactive = mem_cgroup_get_lru_size(lruvec, LRU_INACTIVE_ANON); 1111 - active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_ANON); 1112 - 1113 - gb = (inactive + active) >> (30 - PAGE_SHIFT); 1114 - if (gb) 1115 - inactive_ratio = int_sqrt(10 * gb); 1116 - else 1117 - inactive_ratio = 1; 1118 - 1119 - return inactive * inactive_ratio < active; 1120 - } 1121 - 1122 - bool mem_cgroup_lruvec_online(struct lruvec *lruvec) 1123 - { 1124 - struct mem_cgroup_per_zone *mz; 1125 - struct mem_cgroup *memcg; 1126 - 1127 - if (mem_cgroup_disabled()) 1128 - return true; 1129 - 1130 - mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec); 1131 - memcg = mz->memcg; 1132 - 1133 - return !!(memcg->css.flags & CSS_ONLINE); 1134 1331 } 1135 1332 1136 1333 #define mem_cgroup_from_counter(counter, member) \ ··· 1129 1392 } 1130 1393 1131 1394 return margin; 1132 - } 1133 - 1134 - int mem_cgroup_swappiness(struct mem_cgroup *memcg) 1135 - { 1136 - /* root ? */ 1137 - if (mem_cgroup_disabled() || !memcg->css.parent) 1138 - return vm_swappiness; 1139 - 1140 - return memcg->swappiness; 1141 1395 } 1142 1396 1143 1397 /* ··· 1273 1545 static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, 1274 1546 int order) 1275 1547 { 1548 + struct oom_control oc = { 1549 + .zonelist = NULL, 1550 + .nodemask = NULL, 1551 + .gfp_mask = gfp_mask, 1552 + .order = order, 1553 + }; 1276 1554 struct mem_cgroup *iter; 1277 1555 unsigned long chosen_points = 0; 1278 1556 unsigned long totalpages; ··· 1297 1563 goto unlock; 1298 1564 } 1299 1565 1300 - check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL, memcg); 1566 + check_panic_on_oom(&oc, CONSTRAINT_MEMCG, memcg); 1301 1567 totalpages = mem_cgroup_get_limit(memcg) ? : 1; 1302 1568 for_each_mem_cgroup_tree(iter, memcg) { 1303 1569 struct css_task_iter it; ··· 1305 1571 1306 1572 css_task_iter_start(&iter->css, &it); 1307 1573 while ((task = css_task_iter_next(&it))) { 1308 - switch (oom_scan_process_thread(task, totalpages, NULL, 1309 - false)) { 1574 + switch (oom_scan_process_thread(&oc, task, totalpages)) { 1310 1575 case OOM_SCAN_SELECT: 1311 1576 if (chosen) 1312 1577 put_task_struct(chosen); ··· 1343 1610 1344 1611 if (chosen) { 1345 1612 points = chosen_points * 1000 / totalpages; 1346 - oom_kill_process(chosen, gfp_mask, order, points, totalpages, 1347 - memcg, NULL, "Memory cgroup out of memory"); 1613 + oom_kill_process(&oc, chosen, points, totalpages, memcg, 1614 + "Memory cgroup out of memory"); 1348 1615 } 1349 1616 unlock: 1350 1617 mutex_unlock(&oom_lock); ··· 1795 2062 } 1796 2063 EXPORT_SYMBOL(mem_cgroup_end_page_stat); 1797 2064 1798 - /** 1799 - * mem_cgroup_update_page_stat - update page state statistics 1800 - * @memcg: memcg to account against 1801 - * @idx: page state item to account 1802 - * @val: number of pages (positive or negative) 1803 - * 1804 - * See mem_cgroup_begin_page_stat() for locking requirements. 1805 - */ 1806 - void mem_cgroup_update_page_stat(struct mem_cgroup *memcg, 1807 - enum mem_cgroup_stat_index idx, int val) 1808 - { 1809 - VM_BUG_ON(!rcu_read_lock_held()); 1810 - 1811 - if (memcg) 1812 - this_cpu_add(memcg->stat->count[idx], val); 1813 - } 1814 - 1815 2065 /* 1816 2066 * size of first charge trial. "32" comes from vmscan.c's magic value. 1817 2067 * TODO: maybe necessary to use big numbers in big irons. ··· 2218 2502 page_counter_uncharge(&memcg->kmem, nr_pages); 2219 2503 2220 2504 css_put_many(&memcg->css, nr_pages); 2221 - } 2222 - 2223 - /* 2224 - * helper for acessing a memcg's index. It will be used as an index in the 2225 - * child cache array in kmem_cache, and also to derive its name. This function 2226 - * will return -1 when this is not a kmem-limited memcg. 2227 - */ 2228 - int memcg_cache_id(struct mem_cgroup *memcg) 2229 - { 2230 - return memcg ? memcg->kmemcg_id : -1; 2231 2505 } 2232 2506 2233 2507 static int memcg_alloc_cache_id(void) ··· 4833 5127 static int mem_cgroup_can_attach(struct cgroup_subsys_state *css, 4834 5128 struct cgroup_taskset *tset) 4835 5129 { 4836 - struct task_struct *p = cgroup_taskset_first(tset); 4837 - int ret = 0; 4838 5130 struct mem_cgroup *memcg = mem_cgroup_from_css(css); 5131 + struct mem_cgroup *from; 5132 + struct task_struct *p; 5133 + struct mm_struct *mm; 4839 5134 unsigned long move_flags; 5135 + int ret = 0; 4840 5136 4841 5137 /* 4842 5138 * We are now commited to this value whatever it is. Changes in this ··· 4846 5138 * So we need to save it, and keep it going. 4847 5139 */ 4848 5140 move_flags = READ_ONCE(memcg->move_charge_at_immigrate); 4849 - if (move_flags) { 4850 - struct mm_struct *mm; 4851 - struct mem_cgroup *from = mem_cgroup_from_task(p); 5141 + if (!move_flags) 5142 + return 0; 4852 5143 4853 - VM_BUG_ON(from == memcg); 5144 + p = cgroup_taskset_first(tset); 5145 + from = mem_cgroup_from_task(p); 4854 5146 4855 - mm = get_task_mm(p); 4856 - if (!mm) 4857 - return 0; 4858 - /* We move charges only when we move a owner of the mm */ 4859 - if (mm->owner == p) { 4860 - VM_BUG_ON(mc.from); 4861 - VM_BUG_ON(mc.to); 4862 - VM_BUG_ON(mc.precharge); 4863 - VM_BUG_ON(mc.moved_charge); 4864 - VM_BUG_ON(mc.moved_swap); 5147 + VM_BUG_ON(from == memcg); 4865 5148 4866 - spin_lock(&mc.lock); 4867 - mc.from = from; 4868 - mc.to = memcg; 4869 - mc.flags = move_flags; 4870 - spin_unlock(&mc.lock); 4871 - /* We set mc.moving_task later */ 5149 + mm = get_task_mm(p); 5150 + if (!mm) 5151 + return 0; 5152 + /* We move charges only when we move a owner of the mm */ 5153 + if (mm->owner == p) { 5154 + VM_BUG_ON(mc.from); 5155 + VM_BUG_ON(mc.to); 5156 + VM_BUG_ON(mc.precharge); 5157 + VM_BUG_ON(mc.moved_charge); 5158 + VM_BUG_ON(mc.moved_swap); 4872 5159 4873 - ret = mem_cgroup_precharge_mc(mm); 4874 - if (ret) 4875 - mem_cgroup_clear_mc(); 4876 - } 4877 - mmput(mm); 5160 + spin_lock(&mc.lock); 5161 + mc.from = from; 5162 + mc.to = memcg; 5163 + mc.flags = move_flags; 5164 + spin_unlock(&mc.lock); 5165 + /* We set mc.moving_task later */ 5166 + 5167 + ret = mem_cgroup_precharge_mc(mm); 5168 + if (ret) 5169 + mem_cgroup_clear_mc(); 4878 5170 } 5171 + mmput(mm); 4879 5172 return ret; 4880 5173 } 4881 5174 ··· 5228 5519 .legacy_cftypes = mem_cgroup_legacy_files, 5229 5520 .early_init = 0, 5230 5521 }; 5231 - 5232 - /** 5233 - * mem_cgroup_events - count memory events against a cgroup 5234 - * @memcg: the memory cgroup 5235 - * @idx: the event index 5236 - * @nr: the number of events to account for 5237 - */ 5238 - void mem_cgroup_events(struct mem_cgroup *memcg, 5239 - enum mem_cgroup_events_index idx, 5240 - unsigned int nr) 5241 - { 5242 - this_cpu_add(memcg->stat->events[idx], nr); 5243 - } 5244 5522 5245 5523 /** 5246 5524 * mem_cgroup_low - check if memory consumption is below the normal range

+68 -35

mm/memory-failure.c

··· 146 146 if (!mem) 147 147 return -EINVAL; 148 148 149 - css = mem_cgroup_css(mem); 149 + css = &mem->css; 150 150 ino = cgroup_ino(css->cgroup); 151 151 css_put(css); 152 152 ··· 934 934 } 935 935 EXPORT_SYMBOL_GPL(get_hwpoison_page); 936 936 937 + /** 938 + * put_hwpoison_page() - Put refcount for memory error handling: 939 + * @page: raw error page (hit by memory error) 940 + */ 941 + void put_hwpoison_page(struct page *page) 942 + { 943 + struct page *head = compound_head(page); 944 + 945 + if (PageHuge(head)) { 946 + put_page(head); 947 + return; 948 + } 949 + 950 + if (PageTransHuge(head)) 951 + if (page != head) 952 + put_page(head); 953 + 954 + put_page(page); 955 + } 956 + EXPORT_SYMBOL_GPL(put_hwpoison_page); 957 + 937 958 /* 938 959 * Do all that is necessary to remove user space mappings. Unmap 939 960 * the pages and send SIGBUS to the processes if the data was dirty. ··· 1121 1100 nr_pages = 1 << compound_order(hpage); 1122 1101 else /* normal page or thp */ 1123 1102 nr_pages = 1; 1124 - atomic_long_add(nr_pages, &num_poisoned_pages); 1103 + num_poisoned_pages_add(nr_pages); 1125 1104 1126 1105 /* 1127 1106 * We need/can do nothing about count=0 pages. ··· 1149 1128 if (PageHWPoison(hpage)) { 1150 1129 if ((hwpoison_filter(p) && TestClearPageHWPoison(p)) 1151 1130 || (p != hpage && TestSetPageHWPoison(hpage))) { 1152 - atomic_long_sub(nr_pages, &num_poisoned_pages); 1131 + num_poisoned_pages_sub(nr_pages); 1153 1132 unlock_page(hpage); 1154 1133 return 0; 1155 1134 } ··· 1173 1152 else 1174 1153 pr_err("MCE: %#lx: thp split failed\n", pfn); 1175 1154 if (TestClearPageHWPoison(p)) 1176 - atomic_long_sub(nr_pages, &num_poisoned_pages); 1177 - put_page(p); 1178 - if (p != hpage) 1179 - put_page(hpage); 1155 + num_poisoned_pages_sub(nr_pages); 1156 + put_hwpoison_page(p); 1180 1157 return -EBUSY; 1181 1158 } 1182 1159 VM_BUG_ON_PAGE(!page_count(p), p); ··· 1233 1214 */ 1234 1215 if (!PageHWPoison(p)) { 1235 1216 printk(KERN_ERR "MCE %#lx: just unpoisoned\n", pfn); 1236 - atomic_long_sub(nr_pages, &num_poisoned_pages); 1217 + num_poisoned_pages_sub(nr_pages); 1237 1218 unlock_page(hpage); 1238 - put_page(hpage); 1219 + put_hwpoison_page(hpage); 1239 1220 return 0; 1240 1221 } 1241 1222 if (hwpoison_filter(p)) { 1242 1223 if (TestClearPageHWPoison(p)) 1243 - atomic_long_sub(nr_pages, &num_poisoned_pages); 1224 + num_poisoned_pages_sub(nr_pages); 1244 1225 unlock_page(hpage); 1245 - put_page(hpage); 1226 + put_hwpoison_page(hpage); 1246 1227 return 0; 1247 1228 } 1248 1229 ··· 1256 1237 if (PageHuge(p) && PageTail(p) && TestSetPageHWPoison(hpage)) { 1257 1238 action_result(pfn, MF_MSG_POISONED_HUGE, MF_IGNORED); 1258 1239 unlock_page(hpage); 1259 - put_page(hpage); 1240 + put_hwpoison_page(hpage); 1260 1241 return 0; 1261 1242 } 1262 1243 /* ··· 1445 1426 return 0; 1446 1427 } 1447 1428 1429 + if (page_count(page) > 1) { 1430 + pr_info("MCE: Someone grabs the hwpoison page %#lx\n", pfn); 1431 + return 0; 1432 + } 1433 + 1434 + if (page_mapped(page)) { 1435 + pr_info("MCE: Someone maps the hwpoison page %#lx\n", pfn); 1436 + return 0; 1437 + } 1438 + 1439 + if (page_mapping(page)) { 1440 + pr_info("MCE: the hwpoison page has non-NULL mapping %#lx\n", 1441 + pfn); 1442 + return 0; 1443 + } 1444 + 1448 1445 /* 1449 1446 * unpoison_memory() can encounter thp only when the thp is being 1450 1447 * worked by memory_failure() and the page lock is not held yet. ··· 1485 1450 return 0; 1486 1451 } 1487 1452 if (TestClearPageHWPoison(p)) 1488 - atomic_long_dec(&num_poisoned_pages); 1453 + num_poisoned_pages_dec(); 1489 1454 pr_info("MCE: Software-unpoisoned free page %#lx\n", pfn); 1490 1455 return 0; 1491 1456 } ··· 1499 1464 */ 1500 1465 if (TestClearPageHWPoison(page)) { 1501 1466 pr_info("MCE: Software-unpoisoned page %#lx\n", pfn); 1502 - atomic_long_sub(nr_pages, &num_poisoned_pages); 1467 + num_poisoned_pages_sub(nr_pages); 1503 1468 freeit = 1; 1504 1469 if (PageHuge(page)) 1505 1470 clear_page_hwpoison_huge_page(page); 1506 1471 } 1507 1472 unlock_page(page); 1508 1473 1509 - put_page(page); 1474 + put_hwpoison_page(page); 1510 1475 if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1)) 1511 - put_page(page); 1476 + put_hwpoison_page(page); 1512 1477 1513 1478 return 0; 1514 1479 } ··· 1521 1486 return alloc_huge_page_node(page_hstate(compound_head(p)), 1522 1487 nid); 1523 1488 else 1524 - return alloc_pages_exact_node(nid, GFP_HIGHUSER_MOVABLE, 0); 1489 + return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0); 1525 1490 } 1526 1491 1527 1492 /* ··· 1568 1533 /* 1569 1534 * Try to free it. 1570 1535 */ 1571 - put_page(page); 1536 + put_hwpoison_page(page); 1572 1537 shake_page(page, 1); 1573 1538 1574 1539 /* ··· 1577 1542 ret = __get_any_page(page, pfn, 0); 1578 1543 if (!PageLRU(page)) { 1579 1544 /* Drop page reference which is from __get_any_page() */ 1580 - put_page(page); 1545 + put_hwpoison_page(page); 1581 1546 pr_info("soft_offline: %#lx: unknown non LRU page type %lx\n", 1582 1547 pfn, page->flags); 1583 1548 return -EIO; ··· 1600 1565 lock_page(hpage); 1601 1566 if (PageHWPoison(hpage)) { 1602 1567 unlock_page(hpage); 1603 - put_page(hpage); 1568 + put_hwpoison_page(hpage); 1604 1569 pr_info("soft offline: %#lx hugepage already poisoned\n", pfn); 1605 1570 return -EBUSY; 1606 1571 } ··· 1611 1576 * get_any_page() and isolate_huge_page() takes a refcount each, 1612 1577 * so need to drop one here. 1613 1578 */ 1614 - put_page(hpage); 1579 + put_hwpoison_page(hpage); 1615 1580 if (!ret) { 1616 1581 pr_info("soft offline: %#lx hugepage failed to isolate\n", pfn); 1617 1582 return -EBUSY; ··· 1635 1600 if (PageHuge(page)) { 1636 1601 set_page_hwpoison_huge_page(hpage); 1637 1602 dequeue_hwpoisoned_huge_page(hpage); 1638 - atomic_long_add(1 << compound_order(hpage), 1639 - &num_poisoned_pages); 1603 + num_poisoned_pages_add(1 << compound_order(hpage)); 1640 1604 } else { 1641 1605 SetPageHWPoison(page); 1642 - atomic_long_inc(&num_poisoned_pages); 1606 + num_poisoned_pages_inc(); 1643 1607 } 1644 1608 } 1645 1609 return ret; ··· 1659 1625 wait_on_page_writeback(page); 1660 1626 if (PageHWPoison(page)) { 1661 1627 unlock_page(page); 1662 - put_page(page); 1628 + put_hwpoison_page(page); 1663 1629 pr_info("soft offline: %#lx page already poisoned\n", pfn); 1664 1630 return -EBUSY; 1665 1631 } ··· 1674 1640 * would need to fix isolation locking first. 1675 1641 */ 1676 1642 if (ret == 1) { 1677 - put_page(page); 1643 + put_hwpoison_page(page); 1678 1644 pr_info("soft_offline: %#lx: invalidated\n", pfn); 1679 1645 SetPageHWPoison(page); 1680 - atomic_long_inc(&num_poisoned_pages); 1646 + num_poisoned_pages_inc(); 1681 1647 return 0; 1682 1648 } 1683 1649 ··· 1691 1657 * Drop page reference which is came from get_any_page() 1692 1658 * successful isolate_lru_page() already took another one. 1693 1659 */ 1694 - put_page(page); 1660 + put_hwpoison_page(page); 1695 1661 if (!ret) { 1696 1662 LIST_HEAD(pagelist); 1697 1663 inc_zone_page_state(page, NR_ISOLATED_ANON + 1698 1664 page_is_file_cache(page)); 1699 1665 list_add(&page->lru, &pagelist); 1700 - if (!TestSetPageHWPoison(page)) 1701 - atomic_long_inc(&num_poisoned_pages); 1702 1666 ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, 1703 1667 MIGRATE_SYNC, MR_MEMORY_FAILURE); 1704 1668 if (ret) { ··· 1711 1679 pfn, ret, page->flags); 1712 1680 if (ret > 0) 1713 1681 ret = -EIO; 1714 - if (TestClearPageHWPoison(page)) 1715 - atomic_long_dec(&num_poisoned_pages); 1716 1682 } 1717 1683 } else { 1718 1684 pr_info("soft offline: %#lx: isolation failed: %d, page count %d, type %lx\n", ··· 1749 1719 1750 1720 if (PageHWPoison(page)) { 1751 1721 pr_info("soft offline: %#lx page already poisoned\n", pfn); 1722 + if (flags & MF_COUNT_INCREASED) 1723 + put_hwpoison_page(page); 1752 1724 return -EBUSY; 1753 1725 } 1754 1726 if (!PageHuge(page) && PageTransHuge(hpage)) { 1755 1727 if (PageAnon(hpage) && unlikely(split_huge_page(hpage))) { 1756 1728 pr_info("soft offline: %#lx: failed to split THP\n", 1757 1729 pfn); 1730 + if (flags & MF_COUNT_INCREASED) 1731 + put_hwpoison_page(page); 1758 1732 return -EBUSY; 1759 1733 } 1760 1734 } ··· 1776 1742 if (PageHuge(page)) { 1777 1743 set_page_hwpoison_huge_page(hpage); 1778 1744 if (!dequeue_hwpoisoned_huge_page(hpage)) 1779 - atomic_long_add(1 << compound_order(hpage), 1780 - &num_poisoned_pages); 1745 + num_poisoned_pages_add(1 << compound_order(hpage)); 1781 1746 } else { 1782 1747 if (!TestSetPageHWPoison(page)) 1783 - atomic_long_inc(&num_poisoned_pages); 1748 + num_poisoned_pages_inc(); 1784 1749 } 1785 1750 } 1786 1751 return ret;

+32 -16

mm/memory.c

··· 2426 2426 if (details.last_index < details.first_index) 2427 2427 details.last_index = ULONG_MAX; 2428 2428 2429 - 2430 - /* DAX uses i_mmap_lock to serialise file truncate vs page fault */ 2431 2429 i_mmap_lock_write(mapping); 2432 2430 if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) 2433 2431 unmap_mapping_range_tree(&mapping->i_mmap, &details); ··· 3013 3015 } else { 3014 3016 /* 3015 3017 * The fault handler has no page to lock, so it holds 3016 - * i_mmap_lock for read to protect against truncate. 3018 + * i_mmap_lock for write to protect against truncate. 3017 3019 */ 3018 - i_mmap_unlock_read(vma->vm_file->f_mapping); 3020 + i_mmap_unlock_write(vma->vm_file->f_mapping); 3019 3021 } 3020 3022 goto uncharge_out; 3021 3023 } ··· 3029 3031 } else { 3030 3032 /* 3031 3033 * The fault handler has no page to lock, so it holds 3032 - * i_mmap_lock for read to protect against truncate. 3034 + * i_mmap_lock for write to protect against truncate. 3033 3035 */ 3034 - i_mmap_unlock_read(vma->vm_file->f_mapping); 3036 + i_mmap_unlock_write(vma->vm_file->f_mapping); 3035 3037 } 3036 3038 return ret; 3037 3039 uncharge_out: ··· 3230 3232 return 0; 3231 3233 } 3232 3234 3235 + static int create_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma, 3236 + unsigned long address, pmd_t *pmd, unsigned int flags) 3237 + { 3238 + if (!vma->vm_ops) 3239 + return do_huge_pmd_anonymous_page(mm, vma, address, pmd, flags); 3240 + if (vma->vm_ops->pmd_fault) 3241 + return vma->vm_ops->pmd_fault(vma, address, pmd, flags); 3242 + return VM_FAULT_FALLBACK; 3243 + } 3244 + 3245 + static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma, 3246 + unsigned long address, pmd_t *pmd, pmd_t orig_pmd, 3247 + unsigned int flags) 3248 + { 3249 + if (!vma->vm_ops) 3250 + return do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd); 3251 + if (vma->vm_ops->pmd_fault) 3252 + return vma->vm_ops->pmd_fault(vma, address, pmd, flags); 3253 + return VM_FAULT_FALLBACK; 3254 + } 3255 + 3233 3256 /* 3234 3257 * These routines also need to handle stuff like marking pages dirty 3235 3258 * and/or accessed for architectures that don't do it in hardware (most ··· 3286 3267 barrier(); 3287 3268 if (!pte_present(entry)) { 3288 3269 if (pte_none(entry)) { 3289 - if (vma->vm_ops) 3270 + if (vma_is_anonymous(vma)) 3271 + return do_anonymous_page(mm, vma, address, 3272 + pte, pmd, flags); 3273 + else 3290 3274 return do_fault(mm, vma, address, pte, pmd, 3291 3275 flags, entry); 3292 - 3293 - return do_anonymous_page(mm, vma, address, pte, pmd, 3294 - flags); 3295 3276 } 3296 3277 return do_swap_page(mm, vma, address, 3297 3278 pte, pmd, flags, entry); ··· 3353 3334 if (!pmd) 3354 3335 return VM_FAULT_OOM; 3355 3336 if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) { 3356 - int ret = VM_FAULT_FALLBACK; 3357 - if (!vma->vm_ops) 3358 - ret = do_huge_pmd_anonymous_page(mm, vma, address, 3359 - pmd, flags); 3337 + int ret = create_huge_pmd(mm, vma, address, pmd, flags); 3360 3338 if (!(ret & VM_FAULT_FALLBACK)) 3361 3339 return ret; 3362 3340 } else { ··· 3377 3361 orig_pmd, pmd); 3378 3362 3379 3363 if (dirty && !pmd_write(orig_pmd)) { 3380 - ret = do_huge_pmd_wp_page(mm, vma, address, pmd, 3381 - orig_pmd); 3364 + ret = wp_huge_pmd(mm, vma, address, pmd, 3365 + orig_pmd, flags); 3382 3366 if (!(ret & VM_FAULT_FALLBACK)) 3383 3367 return ret; 3384 3368 } else {

+2 -5

mm/mempolicy.c

··· 608 608 609 609 qp->prev = vma; 610 610 611 - if (vma->vm_flags & VM_PFNMAP) 612 - return 1; 613 - 614 611 if (flags & MPOL_MF_LAZY) { 615 612 /* Similar to task_numa_work, skip inaccessible VMAs */ 616 613 if (vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)) ··· 942 945 return alloc_huge_page_node(page_hstate(compound_head(page)), 943 946 node); 944 947 else 945 - return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE | 948 + return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE | 946 949 __GFP_THISNODE, 0); 947 950 } 948 951 ··· 1998 2001 nmask = policy_nodemask(gfp, pol); 1999 2002 if (!nmask || node_isset(hpage_node, *nmask)) { 2000 2003 mpol_cond_put(pol); 2001 - page = alloc_pages_exact_node(hpage_node, 2004 + page = __alloc_pages_node(hpage_node, 2002 2005 gfp | __GFP_THISNODE, order); 2003 2006 goto out; 2004 2007 }

+3

mm/mempool.c

··· 150 150 */ 151 151 void mempool_destroy(mempool_t *pool) 152 152 { 153 + if (unlikely(!pool)) 154 + return; 155 + 153 156 while (pool->curr_nr) { 154 157 void *element = remove_element(pool); 155 158 pool->free(element, pool->pool_data);

+10 -17

mm/memtest.c

··· 1 1 #include <linux/kernel.h> 2 - #include <linux/errno.h> 3 - #include <linux/string.h> 4 2 #include <linux/types.h> 5 - #include <linux/mm.h> 6 - #include <linux/smp.h> 7 3 #include <linux/init.h> 8 - #include <linux/pfn.h> 9 4 #include <linux/memblock.h> 10 5 11 6 static u64 patterns[] __initdata = { ··· 26 31 27 32 static void __init reserve_bad_mem(u64 pattern, phys_addr_t start_bad, phys_addr_t end_bad) 28 33 { 29 - printk(KERN_INFO " %016llx bad mem addr %010llx - %010llx reserved\n", 30 - (unsigned long long) pattern, 31 - (unsigned long long) start_bad, 32 - (unsigned long long) end_bad); 34 + pr_info(" %016llx bad mem addr %pa - %pa reserved\n", 35 + cpu_to_be64(pattern), &start_bad, &end_bad); 33 36 memblock_reserve(start_bad, end_bad - start_bad); 34 37 } 35 38 ··· 72 79 this_start = clamp(this_start, start, end); 73 80 this_end = clamp(this_end, start, end); 74 81 if (this_start < this_end) { 75 - printk(KERN_INFO " %010llx - %010llx pattern %016llx\n", 76 - (unsigned long long)this_start, 77 - (unsigned long long)this_end, 78 - (unsigned long long)cpu_to_be64(pattern)); 82 + pr_info(" %pa - %pa pattern %016llx\n", 83 + &this_start, &this_end, cpu_to_be64(pattern)); 79 84 memtest(pattern, this_start, this_end - this_start); 80 85 } 81 86 } 82 87 } 83 88 84 89 /* default is disabled */ 85 - static int memtest_pattern __initdata; 90 + static unsigned int memtest_pattern __initdata; 86 91 87 92 static int __init parse_memtest(char *arg) 88 93 { 94 + int ret = 0; 95 + 89 96 if (arg) 90 - memtest_pattern = simple_strtoul(arg, NULL, 0); 97 + ret = kstrtouint(arg, 0, &memtest_pattern); 91 98 else 92 99 memtest_pattern = ARRAY_SIZE(patterns); 93 100 94 - return 0; 101 + return ret; 95 102 } 96 103 97 104 early_param("memtest", parse_memtest); ··· 104 111 if (!memtest_pattern) 105 112 return; 106 113 107 - printk(KERN_INFO "early_memtest: # of tests: %d\n", memtest_pattern); 114 + pr_info("early_memtest: # of tests: %u\n", memtest_pattern); 108 115 for (i = memtest_pattern-1; i < UINT_MAX; --i) { 109 116 idx = i % ARRAY_SIZE(patterns); 110 117 do_one_pass(patterns[idx], start, end);

+7 -6

mm/migrate.c

··· 880 880 /* Establish migration ptes or remove ptes */ 881 881 if (page_mapped(page)) { 882 882 try_to_unmap(page, 883 - TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS| 884 - TTU_IGNORE_HWPOISON); 883 + TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); 885 884 page_was_mapped = 1; 886 885 } 887 886 ··· 951 952 dec_zone_page_state(page, NR_ISOLATED_ANON + 952 953 page_is_file_cache(page)); 953 954 /* Soft-offlined page shouldn't go through lru cache list */ 954 - if (reason == MR_MEMORY_FAILURE) 955 + if (reason == MR_MEMORY_FAILURE) { 955 956 put_page(page); 956 - else 957 + if (!test_set_page_hwpoison(page)) 958 + num_poisoned_pages_inc(); 959 + } else 957 960 putback_lru_page(page); 958 961 } 959 962 ··· 1195 1194 return alloc_huge_page_node(page_hstate(compound_head(p)), 1196 1195 pm->node); 1197 1196 else 1198 - return alloc_pages_exact_node(pm->node, 1197 + return __alloc_pages_node(pm->node, 1199 1198 GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0); 1200 1199 } 1201 1200 ··· 1555 1554 int nid = (int) data; 1556 1555 struct page *newpage; 1557 1556 1558 - newpage = alloc_pages_exact_node(nid, 1557 + newpage = __alloc_pages_node(nid, 1559 1558 (GFP_HIGHUSER_MOVABLE | 1560 1559 __GFP_THISNODE | __GFP_NOMEMALLOC | 1561 1560 __GFP_NORETRY | __GFP_NOWARN) &

+32 -39

mm/mmap.c

··· 2455 2455 unsigned long addr, int new_below) 2456 2456 { 2457 2457 struct vm_area_struct *new; 2458 - int err = -ENOMEM; 2458 + int err; 2459 2459 2460 2460 if (is_vm_hugetlb_page(vma) && (addr & 2461 2461 ~(huge_page_mask(hstate_vma(vma))))) ··· 2463 2463 2464 2464 new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); 2465 2465 if (!new) 2466 - goto out_err; 2466 + return -ENOMEM; 2467 2467 2468 2468 /* most fields are the same, copy all, and then fixup */ 2469 2469 *new = *vma; ··· 2511 2511 mpol_put(vma_policy(new)); 2512 2512 out_free_vma: 2513 2513 kmem_cache_free(vm_area_cachep, new); 2514 - out_err: 2515 2514 return err; 2516 2515 } 2517 2516 ··· 2871 2872 struct vm_area_struct *prev; 2872 2873 struct rb_node **rb_link, *rb_parent; 2873 2874 2875 + if (find_vma_links(mm, vma->vm_start, vma->vm_end, 2876 + &prev, &rb_link, &rb_parent)) 2877 + return -ENOMEM; 2878 + if ((vma->vm_flags & VM_ACCOUNT) && 2879 + security_vm_enough_memory_mm(mm, vma_pages(vma))) 2880 + return -ENOMEM; 2881 + 2874 2882 /* 2875 2883 * The vm_pgoff of a purely anonymous vma should be irrelevant 2876 2884 * until its first write fault, when page's anon_vma and index ··· 2890 2884 * using the existing file pgoff checks and manipulations. 2891 2885 * Similarly in do_mmap_pgoff and in do_brk. 2892 2886 */ 2893 - if (!vma->vm_file) { 2887 + if (vma_is_anonymous(vma)) { 2894 2888 BUG_ON(vma->anon_vma); 2895 2889 vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; 2896 2890 } 2897 - if (find_vma_links(mm, vma->vm_start, vma->vm_end, 2898 - &prev, &rb_link, &rb_parent)) 2899 - return -ENOMEM; 2900 - if ((vma->vm_flags & VM_ACCOUNT) && 2901 - security_vm_enough_memory_mm(mm, vma_pages(vma))) 2902 - return -ENOMEM; 2903 2891 2904 2892 vma_link(mm, vma, prev, rb_link, rb_parent); 2905 2893 return 0; ··· 2918 2918 * If anonymous vma has not yet been faulted, update new pgoff 2919 2919 * to match new location, to increase its chance of merging. 2920 2920 */ 2921 - if (unlikely(!vma->vm_file && !vma->anon_vma)) { 2921 + if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) { 2922 2922 pgoff = addr >> PAGE_SHIFT; 2923 2923 faulted_in_anon_vma = false; 2924 2924 } ··· 2952 2952 *need_rmap_locks = (new_vma->vm_pgoff <= vma->vm_pgoff); 2953 2953 } else { 2954 2954 new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); 2955 - if (new_vma) { 2956 - *new_vma = *vma; 2957 - new_vma->vm_start = addr; 2958 - new_vma->vm_end = addr + len; 2959 - new_vma->vm_pgoff = pgoff; 2960 - if (vma_dup_policy(vma, new_vma)) 2961 - goto out_free_vma; 2962 - INIT_LIST_HEAD(&new_vma->anon_vma_chain); 2963 - if (anon_vma_clone(new_vma, vma)) 2964 - goto out_free_mempol; 2965 - if (new_vma->vm_file) 2966 - get_file(new_vma->vm_file); 2967 - if (new_vma->vm_ops && new_vma->vm_ops->open) 2968 - new_vma->vm_ops->open(new_vma); 2969 - vma_link(mm, new_vma, prev, rb_link, rb_parent); 2970 - *need_rmap_locks = false; 2971 - } 2955 + if (!new_vma) 2956 + goto out; 2957 + *new_vma = *vma; 2958 + new_vma->vm_start = addr; 2959 + new_vma->vm_end = addr + len; 2960 + new_vma->vm_pgoff = pgoff; 2961 + if (vma_dup_policy(vma, new_vma)) 2962 + goto out_free_vma; 2963 + INIT_LIST_HEAD(&new_vma->anon_vma_chain); 2964 + if (anon_vma_clone(new_vma, vma)) 2965 + goto out_free_mempol; 2966 + if (new_vma->vm_file) 2967 + get_file(new_vma->vm_file); 2968 + if (new_vma->vm_ops && new_vma->vm_ops->open) 2969 + new_vma->vm_ops->open(new_vma); 2970 + vma_link(mm, new_vma, prev, rb_link, rb_parent); 2971 + *need_rmap_locks = false; 2972 2972 } 2973 2973 return new_vma; 2974 2974 2975 - out_free_mempol: 2975 + out_free_mempol: 2976 2976 mpol_put(vma_policy(new_vma)); 2977 - out_free_vma: 2977 + out_free_vma: 2978 2978 kmem_cache_free(vm_area_cachep, new_vma); 2979 + out: 2979 2980 return NULL; 2980 2981 } 2981 2982 ··· 3028 3027 pgoff_t pgoff; 3029 3028 struct page **pages; 3030 3029 3031 - /* 3032 - * special mappings have no vm_file, and in that case, the mm 3033 - * uses vm_pgoff internally. So we have to subtract it from here. 3034 - * We are allowed to do this because we are the mm; do not copy 3035 - * this code into drivers! 3036 - */ 3037 - pgoff = vmf->pgoff - vma->vm_pgoff; 3038 - 3039 3030 if (vma->vm_ops == &legacy_special_mapping_vmops) 3040 3031 pages = vma->vm_private_data; 3041 3032 else 3042 3033 pages = ((struct vm_special_mapping *)vma->vm_private_data)-> 3043 3034 pages; 3044 3035 3045 - for (; pgoff && *pages; ++pages) 3036 + for (pgoff = vmf->pgoff; pgoff && *pages; ++pages) 3046 3037 pgoff--; 3047 3038 3048 3039 if (*pages) {

+66 -76

mm/oom_kill.c

··· 196 196 * Determine the type of allocation constraint. 197 197 */ 198 198 #ifdef CONFIG_NUMA 199 - static enum oom_constraint constrained_alloc(struct zonelist *zonelist, 200 - gfp_t gfp_mask, nodemask_t *nodemask, 201 - unsigned long *totalpages) 199 + static enum oom_constraint constrained_alloc(struct oom_control *oc, 200 + unsigned long *totalpages) 202 201 { 203 202 struct zone *zone; 204 203 struct zoneref *z; 205 - enum zone_type high_zoneidx = gfp_zone(gfp_mask); 204 + enum zone_type high_zoneidx = gfp_zone(oc->gfp_mask); 206 205 bool cpuset_limited = false; 207 206 int nid; 208 207 209 208 /* Default to all available memory */ 210 209 *totalpages = totalram_pages + total_swap_pages; 211 210 212 - if (!zonelist) 211 + if (!oc->zonelist) 213 212 return CONSTRAINT_NONE; 214 213 /* 215 214 * Reach here only when __GFP_NOFAIL is used. So, we should avoid 216 215 * to kill current.We have to random task kill in this case. 217 216 * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now. 218 217 */ 219 - if (gfp_mask & __GFP_THISNODE) 218 + if (oc->gfp_mask & __GFP_THISNODE) 220 219 return CONSTRAINT_NONE; 221 220 222 221 /* ··· 223 224 * the page allocator means a mempolicy is in effect. Cpuset policy 224 225 * is enforced in get_page_from_freelist(). 225 226 */ 226 - if (nodemask && !nodes_subset(node_states[N_MEMORY], *nodemask)) { 227 + if (oc->nodemask && 228 + !nodes_subset(node_states[N_MEMORY], *oc->nodemask)) { 227 229 *totalpages = total_swap_pages; 228 - for_each_node_mask(nid, *nodemask) 230 + for_each_node_mask(nid, *oc->nodemask) 229 231 *totalpages += node_spanned_pages(nid); 230 232 return CONSTRAINT_MEMORY_POLICY; 231 233 } 232 234 233 235 /* Check this allocation failure is caused by cpuset's wall function */ 234 - for_each_zone_zonelist_nodemask(zone, z, zonelist, 235 - high_zoneidx, nodemask) 236 - if (!cpuset_zone_allowed(zone, gfp_mask)) 236 + for_each_zone_zonelist_nodemask(zone, z, oc->zonelist, 237 + high_zoneidx, oc->nodemask) 238 + if (!cpuset_zone_allowed(zone, oc->gfp_mask)) 237 239 cpuset_limited = true; 238 240 239 241 if (cpuset_limited) { ··· 246 246 return CONSTRAINT_NONE; 247 247 } 248 248 #else 249 - static enum oom_constraint constrained_alloc(struct zonelist *zonelist, 250 - gfp_t gfp_mask, nodemask_t *nodemask, 251 - unsigned long *totalpages) 249 + static enum oom_constraint constrained_alloc(struct oom_control *oc, 250 + unsigned long *totalpages) 252 251 { 253 252 *totalpages = totalram_pages + total_swap_pages; 254 253 return CONSTRAINT_NONE; 255 254 } 256 255 #endif 257 256 258 - enum oom_scan_t oom_scan_process_thread(struct task_struct *task, 259 - unsigned long totalpages, const nodemask_t *nodemask, 260 - bool force_kill) 257 + enum oom_scan_t oom_scan_process_thread(struct oom_control *oc, 258 + struct task_struct *task, unsigned long totalpages) 261 259 { 262 - if (oom_unkillable_task(task, NULL, nodemask)) 260 + if (oom_unkillable_task(task, NULL, oc->nodemask)) 263 261 return OOM_SCAN_CONTINUE; 264 262 265 263 /* ··· 265 267 * Don't allow any other task to have access to the reserves. 266 268 */ 267 269 if (test_tsk_thread_flag(task, TIF_MEMDIE)) { 268 - if (!force_kill) 270 + if (oc->order != -1) 269 271 return OOM_SCAN_ABORT; 270 272 } 271 273 if (!task->mm) ··· 278 280 if (oom_task_origin(task)) 279 281 return OOM_SCAN_SELECT; 280 282 281 - if (task_will_free_mem(task) && !force_kill) 283 + if (task_will_free_mem(task) && oc->order != -1) 282 284 return OOM_SCAN_ABORT; 283 285 284 286 return OOM_SCAN_OK; ··· 287 289 /* 288 290 * Simple selection loop. We chose the process with the highest 289 291 * number of 'points'. Returns -1 on scan abort. 290 - * 291 - * (not docbooked, we don't want this one cluttering up the manual) 292 292 */ 293 - static struct task_struct *select_bad_process(unsigned int *ppoints, 294 - unsigned long totalpages, const nodemask_t *nodemask, 295 - bool force_kill) 293 + static struct task_struct *select_bad_process(struct oom_control *oc, 294 + unsigned int *ppoints, unsigned long totalpages) 296 295 { 297 296 struct task_struct *g, *p; 298 297 struct task_struct *chosen = NULL; ··· 299 304 for_each_process_thread(g, p) { 300 305 unsigned int points; 301 306 302 - switch (oom_scan_process_thread(p, totalpages, nodemask, 303 - force_kill)) { 307 + switch (oom_scan_process_thread(oc, p, totalpages)) { 304 308 case OOM_SCAN_SELECT: 305 309 chosen = p; 306 310 chosen_points = ULONG_MAX; ··· 312 318 case OOM_SCAN_OK: 313 319 break; 314 320 }; 315 - points = oom_badness(p, NULL, nodemask, totalpages); 321 + points = oom_badness(p, NULL, oc->nodemask, totalpages); 316 322 if (!points || points < chosen_points) 317 323 continue; 318 324 /* Prefer thread group leaders for display purposes */ ··· 374 380 rcu_read_unlock(); 375 381 } 376 382 377 - static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, 378 - struct mem_cgroup *memcg, const nodemask_t *nodemask) 383 + static void dump_header(struct oom_control *oc, struct task_struct *p, 384 + struct mem_cgroup *memcg) 379 385 { 380 386 task_lock(current); 381 387 pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, " 382 388 "oom_score_adj=%hd\n", 383 - current->comm, gfp_mask, order, 389 + current->comm, oc->gfp_mask, oc->order, 384 390 current->signal->oom_score_adj); 385 391 cpuset_print_task_mems_allowed(current); 386 392 task_unlock(current); ··· 390 396 else 391 397 show_mem(SHOW_MEM_FILTER_NODES); 392 398 if (sysctl_oom_dump_tasks) 393 - dump_tasks(memcg, nodemask); 399 + dump_tasks(memcg, oc->nodemask); 394 400 } 395 401 396 402 /* ··· 481 487 * Must be called while holding a reference to p, which will be released upon 482 488 * returning. 483 489 */ 484 - void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, 490 + void oom_kill_process(struct oom_control *oc, struct task_struct *p, 485 491 unsigned int points, unsigned long totalpages, 486 - struct mem_cgroup *memcg, nodemask_t *nodemask, 487 - const char *message) 492 + struct mem_cgroup *memcg, const char *message) 488 493 { 489 494 struct task_struct *victim = p; 490 495 struct task_struct *child; ··· 507 514 task_unlock(p); 508 515 509 516 if (__ratelimit(&oom_rs)) 510 - dump_header(p, gfp_mask, order, memcg, nodemask); 517 + dump_header(oc, p, memcg); 511 518 512 519 task_lock(p); 513 520 pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", ··· 530 537 /* 531 538 * oom_badness() returns 0 if the thread is unkillable 532 539 */ 533 - child_points = oom_badness(child, memcg, nodemask, 540 + child_points = oom_badness(child, memcg, oc->nodemask, 534 541 totalpages); 535 542 if (child_points > victim_points) { 536 543 put_task_struct(victim); ··· 593 600 /* 594 601 * Determines whether the kernel must panic because of the panic_on_oom sysctl. 595 602 */ 596 - void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, 597 - int order, const nodemask_t *nodemask, 603 + void check_panic_on_oom(struct oom_control *oc, enum oom_constraint constraint, 598 604 struct mem_cgroup *memcg) 599 605 { 600 606 if (likely(!sysctl_panic_on_oom)) ··· 607 615 if (constraint != CONSTRAINT_NONE) 608 616 return; 609 617 } 610 - dump_header(NULL, gfp_mask, order, memcg, nodemask); 618 + /* Do not panic for oom kills triggered by sysrq */ 619 + if (oc->order == -1) 620 + return; 621 + dump_header(oc, NULL, memcg); 611 622 panic("Out of memory: %s panic_on_oom is enabled\n", 612 623 sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide"); 613 624 } ··· 630 635 EXPORT_SYMBOL_GPL(unregister_oom_notifier); 631 636 632 637 /** 633 - * __out_of_memory - kill the "best" process when we run out of memory 634 - * @zonelist: zonelist pointer 635 - * @gfp_mask: memory allocation flags 636 - * @order: amount of memory being requested as a power of 2 637 - * @nodemask: nodemask passed to page allocator 638 - * @force_kill: true if a task must be killed, even if others are exiting 638 + * out_of_memory - kill the "best" process when we run out of memory 639 + * @oc: pointer to struct oom_control 639 640 * 640 641 * If we run out of memory, we have the choice between either 641 642 * killing a random task (bad), letting the system crash (worse) 642 643 * OR try to be smart about which process to kill. Note that we 643 644 * don't have to be perfect here, we just have to be good. 644 645 */ 645 - bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, 646 - int order, nodemask_t *nodemask, bool force_kill) 646 + bool out_of_memory(struct oom_control *oc) 647 647 { 648 - const nodemask_t *mpol_mask; 649 648 struct task_struct *p; 650 649 unsigned long totalpages; 651 650 unsigned long freed = 0; 652 651 unsigned int uninitialized_var(points); 653 652 enum oom_constraint constraint = CONSTRAINT_NONE; 654 - int killed = 0; 655 653 656 654 if (oom_killer_disabled) 657 655 return false; ··· 652 664 blocking_notifier_call_chain(&oom_notify_list, 0, &freed); 653 665 if (freed > 0) 654 666 /* Got some memory back in the last second. */ 655 - goto out; 667 + return true; 656 668 657 669 /* 658 670 * If current has a pending SIGKILL or is exiting, then automatically ··· 665 677 if (current->mm && 666 678 (fatal_signal_pending(current) || task_will_free_mem(current))) { 667 679 mark_oom_victim(current); 668 - goto out; 680 + return true; 669 681 } 670 682 671 683 /* 672 684 * Check if there were limitations on the allocation (only relevant for 673 685 * NUMA) that may require different handling. 674 686 */ 675 - constraint = constrained_alloc(zonelist, gfp_mask, nodemask, 676 - &totalpages); 677 - mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL; 678 - check_panic_on_oom(constraint, gfp_mask, order, mpol_mask, NULL); 687 + constraint = constrained_alloc(oc, &totalpages); 688 + if (constraint != CONSTRAINT_MEMORY_POLICY) 689 + oc->nodemask = NULL; 690 + check_panic_on_oom(oc, constraint, NULL); 679 691 680 692 if (sysctl_oom_kill_allocating_task && current->mm && 681 - !oom_unkillable_task(current, NULL, nodemask) && 693 + !oom_unkillable_task(current, NULL, oc->nodemask) && 682 694 current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { 683 695 get_task_struct(current); 684 - oom_kill_process(current, gfp_mask, order, 0, totalpages, NULL, 685 - nodemask, 696 + oom_kill_process(oc, current, 0, totalpages, NULL, 686 697 "Out of memory (oom_kill_allocating_task)"); 687 - goto out; 698 + return true; 688 699 } 689 700 690 - p = select_bad_process(&points, totalpages, mpol_mask, force_kill); 701 + p = select_bad_process(oc, &points, totalpages); 691 702 /* Found nothing?!?! Either we hang forever, or we panic. */ 692 - if (!p) { 693 - dump_header(NULL, gfp_mask, order, NULL, mpol_mask); 703 + if (!p && oc->order != -1) { 704 + dump_header(oc, NULL, NULL); 694 705 panic("Out of memory and no killable processes...\n"); 695 706 } 696 - if (p != (void *)-1UL) { 697 - oom_kill_process(p, gfp_mask, order, points, totalpages, NULL, 698 - nodemask, "Out of memory"); 699 - killed = 1; 700 - } 701 - out: 702 - /* 703 - * Give the killed threads a good chance of exiting before trying to 704 - * allocate memory again. 705 - */ 706 - if (killed) 707 + if (p && p != (void *)-1UL) { 708 + oom_kill_process(oc, p, points, totalpages, NULL, 709 + "Out of memory"); 710 + /* 711 + * Give the killed process a good chance to exit before trying 712 + * to allocate memory again. 713 + */ 707 714 schedule_timeout_killable(1); 708 - 715 + } 709 716 return true; 710 717 } 711 718 ··· 711 728 */ 712 729 void pagefault_out_of_memory(void) 713 730 { 731 + struct oom_control oc = { 732 + .zonelist = NULL, 733 + .nodemask = NULL, 734 + .gfp_mask = 0, 735 + .order = 0, 736 + }; 737 + 714 738 if (mem_cgroup_oom_synchronize(true)) 715 739 return; 716 740 717 741 if (!mutex_trylock(&oom_lock)) 718 742 return; 719 743 720 - if (!out_of_memory(NULL, 0, 0, NULL, false)) { 744 + if (!out_of_memory(&oc)) { 721 745 /* 722 746 * There shouldn't be any user tasks runnable while the 723 747 * OOM killer is disabled, so the current task has to

+50 -30

mm/page_alloc.c

··· 125 125 int percpu_pagelist_fraction; 126 126 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK; 127 127 128 + /* 129 + * A cached value of the page's pageblock's migratetype, used when the page is 130 + * put on a pcplist. Used to avoid the pageblock migratetype lookup when 131 + * freeing from pcplists in most cases, at the cost of possibly becoming stale. 132 + * Also the migratetype set in the page does not necessarily match the pcplist 133 + * index, e.g. page might have MIGRATE_CMA set but be on a pcplist with any 134 + * other index - this ensures that it will be put on the correct CMA freelist. 135 + */ 136 + static inline int get_pcppage_migratetype(struct page *page) 137 + { 138 + return page->index; 139 + } 140 + 141 + static inline void set_pcppage_migratetype(struct page *page, int migratetype) 142 + { 143 + page->index = migratetype; 144 + } 145 + 128 146 #ifdef CONFIG_PM_SLEEP 129 147 /* 130 148 * The following functions are used by the suspend/hibernate code to temporarily ··· 809 791 page = list_entry(list->prev, struct page, lru); 810 792 /* must delete as __free_one_page list manipulates */ 811 793 list_del(&page->lru); 812 - mt = get_freepage_migratetype(page); 794 + 795 + mt = get_pcppage_migratetype(page); 796 + /* MIGRATE_ISOLATE page should not go to pcplists */ 797 + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page); 798 + /* Pageblock could have been isolated meanwhile */ 813 799 if (unlikely(has_isolate_pageblock(zone))) 814 800 mt = get_pageblock_migratetype(page); 815 801 ··· 977 955 migratetype = get_pfnblock_migratetype(page, pfn); 978 956 local_irq_save(flags); 979 957 __count_vm_events(PGFREE, 1 << order); 980 - set_freepage_migratetype(page, migratetype); 981 958 free_one_page(page_zone(page), page, pfn, order, migratetype); 982 959 local_irq_restore(flags); 983 960 } ··· 1404 1383 rmv_page_order(page); 1405 1384 area->nr_free--; 1406 1385 expand(zone, page, order, current_order, area, migratetype); 1407 - set_freepage_migratetype(page, migratetype); 1386 + set_pcppage_migratetype(page, migratetype); 1408 1387 return page; 1409 1388 } 1410 1389 ··· 1481 1460 order = page_order(page); 1482 1461 list_move(&page->lru, 1483 1462 &zone->free_area[order].free_list[migratetype]); 1484 - set_freepage_migratetype(page, migratetype); 1485 1463 page += 1 << order; 1486 1464 pages_moved += 1 << order; 1487 1465 } ··· 1650 1630 expand(zone, page, order, current_order, area, 1651 1631 start_migratetype); 1652 1632 /* 1653 - * The freepage_migratetype may differ from pageblock's 1633 + * The pcppage_migratetype may differ from pageblock's 1654 1634 * migratetype depending on the decisions in 1655 - * try_to_steal_freepages(). This is OK as long as it 1656 - * does not differ for MIGRATE_CMA pageblocks. For CMA 1657 - * we need to make sure unallocated pages flushed from 1658 - * pcp lists are returned to the correct freelist. 1635 + * find_suitable_fallback(). This is OK as long as it does not 1636 + * differ for MIGRATE_CMA pageblocks. Those can be used as 1637 + * fallback only via special __rmqueue_cma_fallback() function 1659 1638 */ 1660 - set_freepage_migratetype(page, start_migratetype); 1639 + set_pcppage_migratetype(page, start_migratetype); 1661 1640 1662 1641 trace_mm_page_alloc_extfrag(page, order, current_order, 1663 1642 start_migratetype, fallback_mt); ··· 1732 1713 else 1733 1714 list_add_tail(&page->lru, list); 1734 1715 list = &page->lru; 1735 - if (is_migrate_cma(get_freepage_migratetype(page))) 1716 + if (is_migrate_cma(get_pcppage_migratetype(page))) 1736 1717 __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1737 1718 -(1 << order)); 1738 1719 } ··· 1929 1910 return; 1930 1911 1931 1912 migratetype = get_pfnblock_migratetype(page, pfn); 1932 - set_freepage_migratetype(page, migratetype); 1913 + set_pcppage_migratetype(page, migratetype); 1933 1914 local_irq_save(flags); 1934 1915 __count_vm_event(PGFREE); 1935 1916 ··· 2134 2115 if (!page) 2135 2116 goto failed; 2136 2117 __mod_zone_freepage_state(zone, -(1 << order), 2137 - get_freepage_migratetype(page)); 2118 + get_pcppage_migratetype(page)); 2138 2119 } 2139 2120 2140 2121 __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order)); ··· 2715 2696 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, 2716 2697 const struct alloc_context *ac, unsigned long *did_some_progress) 2717 2698 { 2699 + struct oom_control oc = { 2700 + .zonelist = ac->zonelist, 2701 + .nodemask = ac->nodemask, 2702 + .gfp_mask = gfp_mask, 2703 + .order = order, 2704 + }; 2718 2705 struct page *page; 2719 2706 2720 2707 *did_some_progress = 0; ··· 2772 2747 goto out; 2773 2748 } 2774 2749 /* Exhausted what can be done so it's blamo time */ 2775 - if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false) 2776 - || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) 2750 + if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) 2777 2751 *did_some_progress = 1; 2778 2752 out: 2779 2753 mutex_unlock(&oom_lock); ··· 3514 3490 * 3515 3491 * Like alloc_pages_exact(), but try to allocate on node nid first before falling 3516 3492 * back. 3517 - * Note this is not alloc_pages_exact_node() which allocates on a specific node, 3518 - * but is not exact. 3519 3493 */ 3520 3494 void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask) 3521 3495 { ··· 5088 5066 { 5089 5067 unsigned long zone_start_pfn, zone_end_pfn; 5090 5068 5091 - /* When hotadd a new node, the node should be empty */ 5069 + /* When hotadd a new node from cpu_up(), the node should be empty */ 5092 5070 if (!node_start_pfn && !node_end_pfn) 5093 5071 return 0; 5094 5072 ··· 5155 5133 unsigned long zone_high = arch_zone_highest_possible_pfn[zone_type]; 5156 5134 unsigned long zone_start_pfn, zone_end_pfn; 5157 5135 5158 - /* When hotadd a new node, the node should be empty */ 5136 + /* When hotadd a new node from cpu_up(), the node should be empty */ 5159 5137 if (!node_start_pfn && !node_end_pfn) 5160 5138 return 0; 5161 5139 ··· 5328 5306 * 5329 5307 * NOTE: pgdat should get zeroed by caller. 5330 5308 */ 5331 - static void __paginginit free_area_init_core(struct pglist_data *pgdat, 5332 - unsigned long node_start_pfn, unsigned long node_end_pfn) 5309 + static void __paginginit free_area_init_core(struct pglist_data *pgdat) 5333 5310 { 5334 5311 enum zone_type j; 5335 5312 int nid = pgdat->node_id; ··· 5479 5458 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP 5480 5459 get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); 5481 5460 pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid, 5482 - (u64)start_pfn << PAGE_SHIFT, ((u64)end_pfn << PAGE_SHIFT) - 1); 5461 + (u64)start_pfn << PAGE_SHIFT, 5462 + end_pfn ? ((u64)end_pfn << PAGE_SHIFT) - 1 : 0); 5483 5463 #endif 5484 5464 calculate_node_totalpages(pgdat, start_pfn, end_pfn, 5485 5465 zones_size, zholes_size); ··· 5492 5470 (unsigned long)pgdat->node_mem_map); 5493 5471 #endif 5494 5472 5495 - free_area_init_core(pgdat, start_pfn, end_pfn); 5473 + free_area_init_core(pgdat); 5496 5474 } 5497 5475 5498 5476 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP ··· 5503 5481 */ 5504 5482 void __init setup_nr_node_ids(void) 5505 5483 { 5506 - unsigned int node; 5507 - unsigned int highest = 0; 5484 + unsigned int highest; 5508 5485 5509 - for_each_node_mask(node, node_possible_map) 5510 - highest = node; 5486 + highest = find_last_bit(node_possible_map.bits, MAX_NUMNODES); 5511 5487 nr_node_ids = highest + 1; 5512 5488 } 5513 5489 #endif ··· 6026 6006 * set_dma_reserve - set the specified number of pages reserved in the first zone 6027 6007 * @new_dma_reserve: The number of pages to mark reserved 6028 6008 * 6029 - * The per-cpu batchsize and zone watermarks are determined by present_pages. 6009 + * The per-cpu batchsize and zone watermarks are determined by managed_pages. 6030 6010 * In the DMA zone, a significant percentage may be consumed by kernel image 6031 6011 * and other unfreeable allocations which can skew the watermarks badly. This 6032 6012 * function may optionally be used to account for unfreeable pages in the ··· 6079 6059 } 6080 6060 6081 6061 /* 6082 - * calculate_totalreserve_pages - called when sysctl_lower_zone_reserve_ratio 6062 + * calculate_totalreserve_pages - called when sysctl_lowmem_reserve_ratio 6083 6063 * or min_free_kbytes changes. 6084 6064 */ 6085 6065 static void calculate_totalreserve_pages(void) ··· 6123 6103 6124 6104 /* 6125 6105 * setup_per_zone_lowmem_reserve - called whenever 6126 - * sysctl_lower_zone_reserve_ratio changes. Ensures that each zone 6106 + * sysctl_lowmem_reserve_ratio changes. Ensures that each zone 6127 6107 * has a correct pages reserved value, so an adequate number of 6128 6108 * pages are left in the zone after a successful __alloc_pages(). 6129 6109 */

+9 -26

mm/page_isolation.c

··· 9 9 #include <linux/hugetlb.h> 10 10 #include "internal.h" 11 11 12 - int set_migratetype_isolate(struct page *page, bool skip_hwpoisoned_pages) 12 + static int set_migratetype_isolate(struct page *page, 13 + bool skip_hwpoisoned_pages) 13 14 { 14 15 struct zone *zone; 15 16 unsigned long flags, pfn; ··· 73 72 return ret; 74 73 } 75 74 76 - void unset_migratetype_isolate(struct page *page, unsigned migratetype) 75 + static void unset_migratetype_isolate(struct page *page, unsigned migratetype) 77 76 { 78 77 struct zone *zone; 79 78 unsigned long flags, nr_pages; ··· 224 223 continue; 225 224 } 226 225 page = pfn_to_page(pfn); 227 - if (PageBuddy(page)) { 226 + if (PageBuddy(page)) 228 227 /* 229 - * If race between isolatation and allocation happens, 230 - * some free pages could be in MIGRATE_MOVABLE list 231 - * although pageblock's migratation type of the page 232 - * is MIGRATE_ISOLATE. Catch it and move the page into 233 - * MIGRATE_ISOLATE list. 228 + * If the page is on a free list, it has to be on 229 + * the correct MIGRATE_ISOLATE freelist. There is no 230 + * simple way to verify that as VM_BUG_ON(), though. 234 231 */ 235 - if (get_freepage_migratetype(page) != MIGRATE_ISOLATE) { 236 - struct page *end_page; 237 - 238 - end_page = page + (1 << page_order(page)) - 1; 239 - move_freepages(page_zone(page), page, end_page, 240 - MIGRATE_ISOLATE); 241 - } 242 232 pfn += 1 << page_order(page); 243 - } 244 - else if (page_count(page) == 0 && 245 - get_freepage_migratetype(page) == MIGRATE_ISOLATE) 246 - pfn += 1; 247 - else if (skip_hwpoisoned_pages && PageHWPoison(page)) { 248 - /* 249 - * The HWPoisoned page may be not in buddy 250 - * system, and page_count() is not 0. 251 - */ 233 + else if (skip_hwpoisoned_pages && PageHWPoison(page)) 234 + /* A HWPoisoned page cannot be also PageBuddy */ 252 235 pfn++; 253 - continue; 254 - } 255 236 else 256 237 break; 257 238 }

+16

mm/shmem.c

··· 542 542 } 543 543 EXPORT_SYMBOL_GPL(shmem_truncate_range); 544 544 545 + static int shmem_getattr(struct vfsmount *mnt, struct dentry *dentry, 546 + struct kstat *stat) 547 + { 548 + struct inode *inode = dentry->d_inode; 549 + struct shmem_inode_info *info = SHMEM_I(inode); 550 + 551 + spin_lock(&info->lock); 552 + shmem_recalc_inode(inode); 553 + spin_unlock(&info->lock); 554 + 555 + generic_fillattr(inode, stat); 556 + 557 + return 0; 558 + } 559 + 545 560 static int shmem_setattr(struct dentry *dentry, struct iattr *attr) 546 561 { 547 562 struct inode *inode = d_inode(dentry); ··· 3137 3122 }; 3138 3123 3139 3124 static const struct inode_operations shmem_inode_operations = { 3125 + .getattr = shmem_getattr, 3140 3126 .setattr = shmem_setattr, 3141 3127 #ifdef CONFIG_TMPFS_XATTR 3142 3128 .setxattr = shmem_setxattr,

+1 -1

mm/slab.c

··· 1595 1595 if (memcg_charge_slab(cachep, flags, cachep->gfporder)) 1596 1596 return NULL; 1597 1597 1598 - page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder); 1598 + page = __alloc_pages_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder); 1599 1599 if (!page) { 1600 1600 memcg_uncharge_slab(cachep, cachep->gfporder); 1601 1601 slab_out_of_memory(cachep, flags, nodeid);

+4 -1

mm/slab_common.c

··· 500 500 struct kmem_cache *root_cache) 501 501 { 502 502 static char memcg_name_buf[NAME_MAX + 1]; /* protected by slab_mutex */ 503 - struct cgroup_subsys_state *css = mem_cgroup_css(memcg); 503 + struct cgroup_subsys_state *css = &memcg->css; 504 504 struct memcg_cache_array *arr; 505 505 struct kmem_cache *s = NULL; 506 506 char *cache_name; ··· 639 639 LIST_HEAD(release); 640 640 bool need_rcu_barrier = false; 641 641 bool busy = false; 642 + 643 + if (unlikely(!s)) 644 + return; 642 645 643 646 BUG_ON(!is_root_cache(s)); 644 647

+2 -2

mm/slob.c

··· 45 45 * NUMA support in SLOB is fairly simplistic, pushing most of the real 46 46 * logic down to the page allocator, and simply doing the node accounting 47 47 * on the upper levels. In the event that a node id is explicitly 48 - * provided, alloc_pages_exact_node() with the specified node id is used 48 + * provided, __alloc_pages_node() with the specified node id is used 49 49 * instead. The common case (or when the node id isn't explicitly provided) 50 50 * will default to the current node, as per numa_node_id(). 51 51 * ··· 193 193 194 194 #ifdef CONFIG_NUMA 195 195 if (node != NUMA_NO_NODE) 196 - page = alloc_pages_exact_node(node, gfp, order); 196 + page = __alloc_pages_node(node, gfp, order); 197 197 else 198 198 #endif 199 199 page = alloc_pages(gfp, order);

+1 -1

mm/slub.c

··· 1334 1334 if (node == NUMA_NO_NODE) 1335 1335 page = alloc_pages(flags, order); 1336 1336 else 1337 - page = alloc_pages_exact_node(node, flags, order); 1337 + page = __alloc_pages_node(node, flags, order); 1338 1338 1339 1339 if (!page) 1340 1340 memcg_uncharge_slab(s, order);

+26 -11

mm/swap_state.c

··· 288 288 return page; 289 289 } 290 290 291 - /* 292 - * Locate a page of swap in physical memory, reserving swap cache space 293 - * and reading the disk if it is not already cached. 294 - * A failure return means that either the page allocation failed or that 295 - * the swap entry is no longer in use. 296 - */ 297 - struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, 298 - struct vm_area_struct *vma, unsigned long addr) 291 + struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, 292 + struct vm_area_struct *vma, unsigned long addr, 293 + bool *new_page_allocated) 299 294 { 300 295 struct page *found_page, *new_page = NULL; 296 + struct address_space *swapper_space = swap_address_space(entry); 301 297 int err; 298 + *new_page_allocated = false; 302 299 303 300 do { 304 301 /* ··· 303 306 * called after lookup_swap_cache() failed, re-calling 304 307 * that would confuse statistics. 305 308 */ 306 - found_page = find_get_page(swap_address_space(entry), 307 - entry.val); 309 + found_page = find_get_page(swapper_space, entry.val); 308 310 if (found_page) 309 311 break; 310 312 ··· 362 366 * Initiate read into locked page and return. 363 367 */ 364 368 lru_cache_add_anon(new_page); 365 - swap_readpage(new_page); 369 + *new_page_allocated = true; 366 370 return new_page; 367 371 } 368 372 radix_tree_preload_end(); ··· 378 382 if (new_page) 379 383 page_cache_release(new_page); 380 384 return found_page; 385 + } 386 + 387 + /* 388 + * Locate a page of swap in physical memory, reserving swap cache space 389 + * and reading the disk if it is not already cached. 390 + * A failure return means that either the page allocation failed or that 391 + * the swap entry is no longer in use. 392 + */ 393 + struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, 394 + struct vm_area_struct *vma, unsigned long addr) 395 + { 396 + bool page_was_allocated; 397 + struct page *retpage = __read_swap_cache_async(entry, gfp_mask, 398 + vma, addr, &page_was_allocated); 399 + 400 + if (page_was_allocated) 401 + swap_readpage(retpage); 402 + 403 + return retpage; 381 404 } 382 405 383 406 static unsigned long swapin_nr_pages(unsigned long offset)

+42

mm/swapfile.c

··· 875 875 } 876 876 877 877 /* 878 + * How many references to @entry are currently swapped out? 879 + * This considers COUNT_CONTINUED so it returns exact answer. 880 + */ 881 + int swp_swapcount(swp_entry_t entry) 882 + { 883 + int count, tmp_count, n; 884 + struct swap_info_struct *p; 885 + struct page *page; 886 + pgoff_t offset; 887 + unsigned char *map; 888 + 889 + p = swap_info_get(entry); 890 + if (!p) 891 + return 0; 892 + 893 + count = swap_count(p->swap_map[swp_offset(entry)]); 894 + if (!(count & COUNT_CONTINUED)) 895 + goto out; 896 + 897 + count &= ~COUNT_CONTINUED; 898 + n = SWAP_MAP_MAX + 1; 899 + 900 + offset = swp_offset(entry); 901 + page = vmalloc_to_page(p->swap_map + offset); 902 + offset &= ~PAGE_MASK; 903 + VM_BUG_ON(page_private(page) != SWP_CONTINUED); 904 + 905 + do { 906 + page = list_entry(page->lru.next, struct page, lru); 907 + map = kmap_atomic(page); 908 + tmp_count = map[offset]; 909 + kunmap_atomic(map); 910 + 911 + count += (tmp_count & ~COUNT_CONTINUED) * n; 912 + n *= (SWAP_CONT_MAX + 1); 913 + } while (tmp_count & COUNT_CONTINUED); 914 + out: 915 + spin_unlock(&p->lock); 916 + return count; 917 + } 918 + 919 + /* 878 920 * We can write to an anon page without COW if there are no other references 879 921 * to it. And as a side-effect, free up its swap: because the old content 880 922 * on disk will never be read, and seeking back there to write new content

+9 -5

mm/vmscan.c

··· 175 175 if (!memcg) 176 176 return true; 177 177 #ifdef CONFIG_CGROUP_WRITEBACK 178 - if (cgroup_on_dfl(mem_cgroup_css(memcg)->cgroup)) 178 + if (memcg->css.cgroup) 179 179 return true; 180 180 #endif 181 181 return false; ··· 985 985 * __GFP_IO|__GFP_FS for this reason); but more thought 986 986 * would probably show more reasons. 987 987 * 988 - * 3) Legacy memcg encounters a page that is not already marked 988 + * 3) Legacy memcg encounters a page that is already marked 989 989 * PageReclaim. memcg does not have any dirty pages 990 990 * throttling so we could easily OOM just because too many 991 991 * pages are in writeback and there is nothing else to ··· 1015 1015 */ 1016 1016 SetPageReclaim(page); 1017 1017 nr_writeback++; 1018 - 1019 1018 goto keep_locked; 1020 1019 1021 1020 /* Case 3 above */ 1022 1021 } else { 1022 + unlock_page(page); 1023 1023 wait_on_page_writeback(page); 1024 + /* then go back and try same page again */ 1025 + list_add_tail(&page->lru, page_list); 1026 + continue; 1024 1027 } 1025 1028 } 1026 1029 ··· 1199 1196 if (PageSwapCache(page)) 1200 1197 try_to_free_swap(page); 1201 1198 unlock_page(page); 1202 - putback_lru_page(page); 1199 + list_add(&page->lru, &ret_pages); 1203 1200 continue; 1204 1201 1205 1202 activate_locked: ··· 1362 1359 unsigned long nr_taken = 0; 1363 1360 unsigned long scan; 1364 1361 1365 - for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) { 1362 + for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan && 1363 + !list_empty(src); scan++) { 1366 1364 struct page *page; 1367 1365 int nr_pages; 1368 1366

+5 -5

mm/zbud.c

··· 96 96 struct list_head buddied; 97 97 struct list_head lru; 98 98 u64 pages_nr; 99 - struct zbud_ops *ops; 99 + const struct zbud_ops *ops; 100 100 #ifdef CONFIG_ZPOOL 101 101 struct zpool *zpool; 102 - struct zpool_ops *zpool_ops; 102 + const struct zpool_ops *zpool_ops; 103 103 #endif 104 104 }; 105 105 ··· 133 133 return -ENOENT; 134 134 } 135 135 136 - static struct zbud_ops zbud_zpool_ops = { 136 + static const struct zbud_ops zbud_zpool_ops = { 137 137 .evict = zbud_zpool_evict 138 138 }; 139 139 140 140 static void *zbud_zpool_create(char *name, gfp_t gfp, 141 - struct zpool_ops *zpool_ops, 141 + const struct zpool_ops *zpool_ops, 142 142 struct zpool *zpool) 143 143 { 144 144 struct zbud_pool *pool; ··· 302 302 * Return: pointer to the new zbud pool or NULL if the metadata allocation 303 303 * failed. 304 304 */ 305 - struct zbud_pool *zbud_create_pool(gfp_t gfp, struct zbud_ops *ops) 305 + struct zbud_pool *zbud_create_pool(gfp_t gfp, const struct zbud_ops *ops) 306 306 { 307 307 struct zbud_pool *pool; 308 308 int i;

+2 -16

mm/zpool.c

··· 22 22 23 23 struct zpool_driver *driver; 24 24 void *pool; 25 - struct zpool_ops *ops; 25 + const struct zpool_ops *ops; 26 26 27 27 struct list_head list; 28 28 }; ··· 115 115 * Returns: New zpool on success, NULL on failure. 116 116 */ 117 117 struct zpool *zpool_create_pool(char *type, char *name, gfp_t gfp, 118 - struct zpool_ops *ops) 118 + const struct zpool_ops *ops) 119 119 { 120 120 struct zpool_driver *driver; 121 121 struct zpool *zpool; ··· 319 319 { 320 320 return zpool->driver->total_size(zpool->pool); 321 321 } 322 - 323 - static int __init init_zpool(void) 324 - { 325 - pr_info("loaded\n"); 326 - return 0; 327 - } 328 - 329 - static void __exit exit_zpool(void) 330 - { 331 - pr_info("unloaded\n"); 332 - } 333 - 334 - module_init(init_zpool); 335 - module_exit(exit_zpool); 336 322 337 323 MODULE_LICENSE("GPL"); 338 324 MODULE_AUTHOR("Dan Streetman <ddstreet@ieee.org>");

+163 -72

mm/zsmalloc.c

··· 169 169 NR_ZS_STAT_TYPE, 170 170 }; 171 171 172 - #ifdef CONFIG_ZSMALLOC_STAT 173 - 174 - static struct dentry *zs_stat_root; 175 - 176 172 struct zs_size_stat { 177 173 unsigned long objs[NR_ZS_STAT_TYPE]; 178 174 }; 179 175 176 + #ifdef CONFIG_ZSMALLOC_STAT 177 + static struct dentry *zs_stat_root; 180 178 #endif 181 179 182 180 /* ··· 199 201 static const int fullness_threshold_frac = 4; 200 202 201 203 struct size_class { 204 + spinlock_t lock; 205 + struct page *fullness_list[_ZS_NR_FULLNESS_GROUPS]; 202 206 /* 203 207 * Size of objects stored in this class. Must be multiple 204 208 * of ZS_ALIGN. ··· 210 210 211 211 /* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */ 212 212 int pages_per_zspage; 213 + struct zs_size_stat stats; 214 + 213 215 /* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */ 214 216 bool huge; 215 - 216 - #ifdef CONFIG_ZSMALLOC_STAT 217 - struct zs_size_stat stats; 218 - #endif 219 - 220 - spinlock_t lock; 221 - 222 - struct page *fullness_list[_ZS_NR_FULLNESS_GROUPS]; 223 217 }; 224 218 225 219 /* ··· 245 251 gfp_t flags; /* allocation flags used when growing pool */ 246 252 atomic_long_t pages_allocated; 247 253 254 + struct zs_pool_stats stats; 255 + 256 + /* Compact classes */ 257 + struct shrinker shrinker; 258 + /* 259 + * To signify that register_shrinker() was successful 260 + * and unregister_shrinker() will not Oops. 261 + */ 262 + bool shrinker_enabled; 248 263 #ifdef CONFIG_ZSMALLOC_STAT 249 264 struct dentry *stat_dentry; 250 265 #endif ··· 288 285 289 286 static void destroy_handle_cache(struct zs_pool *pool) 290 287 { 291 - if (pool->handle_cachep) 292 - kmem_cache_destroy(pool->handle_cachep); 288 + kmem_cache_destroy(pool->handle_cachep); 293 289 } 294 290 295 291 static unsigned long alloc_handle(struct zs_pool *pool) ··· 311 309 312 310 #ifdef CONFIG_ZPOOL 313 311 314 - static void *zs_zpool_create(char *name, gfp_t gfp, struct zpool_ops *zpool_ops, 312 + static void *zs_zpool_create(char *name, gfp_t gfp, 313 + const struct zpool_ops *zpool_ops, 315 314 struct zpool *zpool) 316 315 { 317 316 return zs_create_pool(name, gfp); ··· 444 441 return min(zs_size_classes - 1, idx); 445 442 } 446 443 447 - #ifdef CONFIG_ZSMALLOC_STAT 448 - 449 444 static inline void zs_stat_inc(struct size_class *class, 450 445 enum zs_stat_type type, unsigned long cnt) 451 446 { ··· 461 460 { 462 461 return class->stats.objs[type]; 463 462 } 463 + 464 + #ifdef CONFIG_ZSMALLOC_STAT 464 465 465 466 static int __init zs_stat_init(void) 466 467 { ··· 579 576 } 580 577 581 578 #else /* CONFIG_ZSMALLOC_STAT */ 582 - 583 - static inline void zs_stat_inc(struct size_class *class, 584 - enum zs_stat_type type, unsigned long cnt) 585 - { 586 - } 587 - 588 - static inline void zs_stat_dec(struct size_class *class, 589 - enum zs_stat_type type, unsigned long cnt) 590 - { 591 - } 592 - 593 - static inline unsigned long zs_stat_get(struct size_class *class, 594 - enum zs_stat_type type) 595 - { 596 - return 0; 597 - } 598 - 599 579 static int __init zs_stat_init(void) 600 580 { 601 581 return 0; ··· 596 610 static inline void zs_pool_stat_destroy(struct zs_pool *pool) 597 611 { 598 612 } 599 - 600 613 #endif 601 614 602 615 ··· 643 658 if (fullness >= _ZS_NR_FULLNESS_GROUPS) 644 659 return; 645 660 646 - head = &class->fullness_list[fullness]; 647 - if (*head) 648 - list_add_tail(&page->lru, &(*head)->lru); 649 - 650 - *head = page; 651 661 zs_stat_inc(class, fullness == ZS_ALMOST_EMPTY ? 652 662 CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1); 663 + 664 + head = &class->fullness_list[fullness]; 665 + if (!*head) { 666 + *head = page; 667 + return; 668 + } 669 + 670 + /* 671 + * We want to see more ZS_FULL pages and less almost 672 + * empty/full. Put pages with higher ->inuse first. 673 + */ 674 + list_add_tail(&page->lru, &(*head)->lru); 675 + if (page->inuse >= (*head)->inuse) 676 + *head = page; 653 677 } 654 678 655 679 /* ··· 1489 1495 } 1490 1496 EXPORT_SYMBOL_GPL(zs_free); 1491 1497 1492 - static void zs_object_copy(unsigned long src, unsigned long dst, 1498 + static void zs_object_copy(unsigned long dst, unsigned long src, 1493 1499 struct size_class *class) 1494 1500 { 1495 1501 struct page *s_page, *d_page; ··· 1596 1602 /* Starting object index within @s_page which used for live object 1597 1603 * in the subpage. */ 1598 1604 int index; 1599 - /* how many of objects are migrated */ 1600 - int nr_migrated; 1601 1605 }; 1602 1606 1603 1607 static int migrate_zspage(struct zs_pool *pool, struct size_class *class, ··· 1606 1614 struct page *s_page = cc->s_page; 1607 1615 struct page *d_page = cc->d_page; 1608 1616 unsigned long index = cc->index; 1609 - int nr_migrated = 0; 1610 1617 int ret = 0; 1611 1618 1612 1619 while (1) { ··· 1627 1636 1628 1637 used_obj = handle_to_obj(handle); 1629 1638 free_obj = obj_malloc(d_page, class, handle); 1630 - zs_object_copy(used_obj, free_obj, class); 1639 + zs_object_copy(free_obj, used_obj, class); 1631 1640 index++; 1632 1641 record_obj(handle, free_obj); 1633 1642 unpin_tag(handle); 1634 1643 obj_free(pool, class, used_obj); 1635 - nr_migrated++; 1636 1644 } 1637 1645 1638 1646 /* Remember last position in this iteration */ 1639 1647 cc->s_page = s_page; 1640 1648 cc->index = index; 1641 - cc->nr_migrated = nr_migrated; 1642 1649 1643 1650 return ret; 1644 1651 } 1645 1652 1646 - static struct page *alloc_target_page(struct size_class *class) 1653 + static struct page *isolate_target_page(struct size_class *class) 1647 1654 { 1648 1655 int i; 1649 1656 struct page *page; ··· 1657 1668 return page; 1658 1669 } 1659 1670 1660 - static void putback_zspage(struct zs_pool *pool, struct size_class *class, 1661 - struct page *first_page) 1671 + /* 1672 + * putback_zspage - add @first_page into right class's fullness list 1673 + * @pool: target pool 1674 + * @class: destination class 1675 + * @first_page: target page 1676 + * 1677 + * Return @fist_page's fullness_group 1678 + */ 1679 + static enum fullness_group putback_zspage(struct zs_pool *pool, 1680 + struct size_class *class, 1681 + struct page *first_page) 1662 1682 { 1663 1683 enum fullness_group fullness; 1664 1684 ··· 1685 1687 1686 1688 free_zspage(first_page); 1687 1689 } 1690 + 1691 + return fullness; 1688 1692 } 1689 1693 1690 1694 static struct page *isolate_source_page(struct size_class *class) 1691 1695 { 1692 - struct page *page; 1696 + int i; 1697 + struct page *page = NULL; 1693 1698 1694 - page = class->fullness_list[ZS_ALMOST_EMPTY]; 1695 - if (page) 1696 - remove_zspage(page, class, ZS_ALMOST_EMPTY); 1699 + for (i = ZS_ALMOST_EMPTY; i >= ZS_ALMOST_FULL; i--) { 1700 + page = class->fullness_list[i]; 1701 + if (!page) 1702 + continue; 1703 + 1704 + remove_zspage(page, class, i); 1705 + break; 1706 + } 1697 1707 1698 1708 return page; 1699 1709 } 1700 1710 1701 - static unsigned long __zs_compact(struct zs_pool *pool, 1702 - struct size_class *class) 1711 + /* 1712 + * 1713 + * Based on the number of unused allocated objects calculate 1714 + * and return the number of pages that we can free. 1715 + */ 1716 + static unsigned long zs_can_compact(struct size_class *class) 1703 1717 { 1704 - int nr_to_migrate; 1718 + unsigned long obj_wasted; 1719 + 1720 + obj_wasted = zs_stat_get(class, OBJ_ALLOCATED) - 1721 + zs_stat_get(class, OBJ_USED); 1722 + 1723 + obj_wasted /= get_maxobj_per_zspage(class->size, 1724 + class->pages_per_zspage); 1725 + 1726 + return obj_wasted * class->pages_per_zspage; 1727 + } 1728 + 1729 + static void __zs_compact(struct zs_pool *pool, struct size_class *class) 1730 + { 1705 1731 struct zs_compact_control cc; 1706 1732 struct page *src_page; 1707 1733 struct page *dst_page = NULL; 1708 - unsigned long nr_total_migrated = 0; 1709 1734 1710 1735 spin_lock(&class->lock); 1711 1736 while ((src_page = isolate_source_page(class))) { 1712 1737 1713 1738 BUG_ON(!is_first_page(src_page)); 1714 1739 1715 - /* The goal is to migrate all live objects in source page */ 1716 - nr_to_migrate = src_page->inuse; 1740 + if (!zs_can_compact(class)) 1741 + break; 1742 + 1717 1743 cc.index = 0; 1718 1744 cc.s_page = src_page; 1719 1745 1720 - while ((dst_page = alloc_target_page(class))) { 1746 + while ((dst_page = isolate_target_page(class))) { 1721 1747 cc.d_page = dst_page; 1722 1748 /* 1723 - * If there is no more space in dst_page, try to 1724 - * allocate another zspage. 1749 + * If there is no more space in dst_page, resched 1750 + * and see if anyone had allocated another zspage. 1725 1751 */ 1726 1752 if (!migrate_zspage(pool, class, &cc)) 1727 1753 break; 1728 1754 1729 1755 putback_zspage(pool, class, dst_page); 1730 - nr_total_migrated += cc.nr_migrated; 1731 - nr_to_migrate -= cc.nr_migrated; 1732 1756 } 1733 1757 1734 1758 /* Stop if we couldn't find slot */ ··· 1758 1738 break; 1759 1739 1760 1740 putback_zspage(pool, class, dst_page); 1761 - putback_zspage(pool, class, src_page); 1741 + if (putback_zspage(pool, class, src_page) == ZS_EMPTY) 1742 + pool->stats.pages_compacted += class->pages_per_zspage; 1762 1743 spin_unlock(&class->lock); 1763 - nr_total_migrated += cc.nr_migrated; 1764 1744 cond_resched(); 1765 1745 spin_lock(&class->lock); 1766 1746 } ··· 1769 1749 putback_zspage(pool, class, src_page); 1770 1750 1771 1751 spin_unlock(&class->lock); 1772 - 1773 - return nr_total_migrated; 1774 1752 } 1775 1753 1776 1754 unsigned long zs_compact(struct zs_pool *pool) 1777 1755 { 1778 1756 int i; 1779 - unsigned long nr_migrated = 0; 1780 1757 struct size_class *class; 1781 1758 1782 1759 for (i = zs_size_classes - 1; i >= 0; i--) { ··· 1782 1765 continue; 1783 1766 if (class->index != i) 1784 1767 continue; 1785 - nr_migrated += __zs_compact(pool, class); 1768 + __zs_compact(pool, class); 1786 1769 } 1787 1770 1788 - return nr_migrated; 1771 + return pool->stats.pages_compacted; 1789 1772 } 1790 1773 EXPORT_SYMBOL_GPL(zs_compact); 1774 + 1775 + void zs_pool_stats(struct zs_pool *pool, struct zs_pool_stats *stats) 1776 + { 1777 + memcpy(stats, &pool->stats, sizeof(struct zs_pool_stats)); 1778 + } 1779 + EXPORT_SYMBOL_GPL(zs_pool_stats); 1780 + 1781 + static unsigned long zs_shrinker_scan(struct shrinker *shrinker, 1782 + struct shrink_control *sc) 1783 + { 1784 + unsigned long pages_freed; 1785 + struct zs_pool *pool = container_of(shrinker, struct zs_pool, 1786 + shrinker); 1787 + 1788 + pages_freed = pool->stats.pages_compacted; 1789 + /* 1790 + * Compact classes and calculate compaction delta. 1791 + * Can run concurrently with a manually triggered 1792 + * (by user) compaction. 1793 + */ 1794 + pages_freed = zs_compact(pool) - pages_freed; 1795 + 1796 + return pages_freed ? pages_freed : SHRINK_STOP; 1797 + } 1798 + 1799 + static unsigned long zs_shrinker_count(struct shrinker *shrinker, 1800 + struct shrink_control *sc) 1801 + { 1802 + int i; 1803 + struct size_class *class; 1804 + unsigned long pages_to_free = 0; 1805 + struct zs_pool *pool = container_of(shrinker, struct zs_pool, 1806 + shrinker); 1807 + 1808 + if (!pool->shrinker_enabled) 1809 + return 0; 1810 + 1811 + for (i = zs_size_classes - 1; i >= 0; i--) { 1812 + class = pool->size_class[i]; 1813 + if (!class) 1814 + continue; 1815 + if (class->index != i) 1816 + continue; 1817 + 1818 + pages_to_free += zs_can_compact(class); 1819 + } 1820 + 1821 + return pages_to_free; 1822 + } 1823 + 1824 + static void zs_unregister_shrinker(struct zs_pool *pool) 1825 + { 1826 + if (pool->shrinker_enabled) { 1827 + unregister_shrinker(&pool->shrinker); 1828 + pool->shrinker_enabled = false; 1829 + } 1830 + } 1831 + 1832 + static int zs_register_shrinker(struct zs_pool *pool) 1833 + { 1834 + pool->shrinker.scan_objects = zs_shrinker_scan; 1835 + pool->shrinker.count_objects = zs_shrinker_count; 1836 + pool->shrinker.batch = 0; 1837 + pool->shrinker.seeks = DEFAULT_SEEKS; 1838 + 1839 + return register_shrinker(&pool->shrinker); 1840 + } 1791 1841 1792 1842 /** 1793 1843 * zs_create_pool - Creates an allocation pool to work from. ··· 1941 1857 if (zs_pool_stat_create(name, pool)) 1942 1858 goto err; 1943 1859 1860 + /* 1861 + * Not critical, we still can use the pool 1862 + * and user can trigger compaction manually. 1863 + */ 1864 + if (zs_register_shrinker(pool) == 0) 1865 + pool->shrinker_enabled = true; 1944 1866 return pool; 1945 1867 1946 1868 err: ··· 1959 1869 { 1960 1870 int i; 1961 1871 1872 + zs_unregister_shrinker(pool); 1962 1873 zs_pool_stat_destroy(pool); 1963 1874 1964 1875 for (i = 0; i < zs_size_classes; i++) {

+7 -68

mm/zswap.c

··· 446 446 static int zswap_get_swap_cache_page(swp_entry_t entry, 447 447 struct page **retpage) 448 448 { 449 - struct page *found_page, *new_page = NULL; 450 - struct address_space *swapper_space = swap_address_space(entry); 451 - int err; 449 + bool page_was_allocated; 452 450 453 - *retpage = NULL; 454 - do { 455 - /* 456 - * First check the swap cache. Since this is normally 457 - * called after lookup_swap_cache() failed, re-calling 458 - * that would confuse statistics. 459 - */ 460 - found_page = find_get_page(swapper_space, entry.val); 461 - if (found_page) 462 - break; 463 - 464 - /* 465 - * Get a new page to read into from swap. 466 - */ 467 - if (!new_page) { 468 - new_page = alloc_page(GFP_KERNEL); 469 - if (!new_page) 470 - break; /* Out of memory */ 471 - } 472 - 473 - /* 474 - * call radix_tree_preload() while we can wait. 475 - */ 476 - err = radix_tree_preload(GFP_KERNEL); 477 - if (err) 478 - break; 479 - 480 - /* 481 - * Swap entry may have been freed since our caller observed it. 482 - */ 483 - err = swapcache_prepare(entry); 484 - if (err == -EEXIST) { /* seems racy */ 485 - radix_tree_preload_end(); 486 - continue; 487 - } 488 - if (err) { /* swp entry is obsolete ? */ 489 - radix_tree_preload_end(); 490 - break; 491 - } 492 - 493 - /* May fail (-ENOMEM) if radix-tree node allocation failed. */ 494 - __set_page_locked(new_page); 495 - SetPageSwapBacked(new_page); 496 - err = __add_to_swap_cache(new_page, entry); 497 - if (likely(!err)) { 498 - radix_tree_preload_end(); 499 - lru_cache_add_anon(new_page); 500 - *retpage = new_page; 501 - return ZSWAP_SWAPCACHE_NEW; 502 - } 503 - radix_tree_preload_end(); 504 - ClearPageSwapBacked(new_page); 505 - __clear_page_locked(new_page); 506 - /* 507 - * add_to_swap_cache() doesn't return -EEXIST, so we can safely 508 - * clear SWAP_HAS_CACHE flag. 509 - */ 510 - swapcache_free(entry); 511 - } while (err != -ENOMEM); 512 - 513 - if (new_page) 514 - page_cache_release(new_page); 515 - if (!found_page) 451 + *retpage = __read_swap_cache_async(entry, GFP_KERNEL, 452 + NULL, 0, &page_was_allocated); 453 + if (page_was_allocated) 454 + return ZSWAP_SWAPCACHE_NEW; 455 + if (!*retpage) 516 456 return ZSWAP_SWAPCACHE_FAIL; 517 - *retpage = found_page; 518 457 return ZSWAP_SWAPCACHE_EXIST; 519 458 } 520 459 ··· 755 816 zswap_trees[type] = NULL; 756 817 } 757 818 758 - static struct zpool_ops zswap_zpool_ops = { 819 + static const struct zpool_ops zswap_zpool_ops = { 759 820 .evict = zswap_writeback_entry 760 821 }; 761 822

+84

scripts/coccinelle/api/alloc/pool_zalloc-simple.cocci

··· 1 + /// 2 + /// Use *_pool_zalloc rather than *_pool_alloc followed by memset with 0 3 + /// 4 + // Copyright: (C) 2015 Intel Corp. GPLv2. 5 + // Options: --no-includes --include-headers 6 + // 7 + // Keywords: dma_pool_zalloc, pci_pool_zalloc 8 + // 9 + 10 + virtual context 11 + virtual patch 12 + virtual org 13 + virtual report 14 + 15 + //---------------------------------------------------------- 16 + // For context mode 17 + //---------------------------------------------------------- 18 + 19 + @depends on context@ 20 + expression x; 21 + statement S; 22 + @@ 23 + 24 + * x = $dma_pool_alloc\|pci_pool_alloc$(...); 25 + if ((x==NULL) || ...) S 26 + * memset(x,0, ...); 27 + 28 + //---------------------------------------------------------- 29 + // For patch mode 30 + //---------------------------------------------------------- 31 + 32 + @depends on patch@ 33 + expression x; 34 + expression a,b,c; 35 + statement S; 36 + @@ 37 + 38 + - x = dma_pool_alloc(a,b,c); 39 + + x = dma_pool_zalloc(a,b,c); 40 + if ((x==NULL) || ...) S 41 + - memset(x,0,...); 42 + 43 + @depends on patch@ 44 + expression x; 45 + expression a,b,c; 46 + statement S; 47 + @@ 48 + 49 + - x = pci_pool_alloc(a,b,c); 50 + + x = pci_pool_zalloc(a,b,c); 51 + if ((x==NULL) || ...) S 52 + - memset(x,0,...); 53 + 54 + //---------------------------------------------------------- 55 + // For org and report mode 56 + //---------------------------------------------------------- 57 + 58 + @r depends on org || report@ 59 + expression x; 60 + expression a,b,c; 61 + statement S; 62 + position p; 63 + @@ 64 + 65 + x = @p$dma_pool_alloc\|pci_pool_alloc$(a,b,c); 66 + if ((x==NULL) || ...) S 67 + memset(x,0, ...); 68 + 69 + @script:python depends on org@ 70 + p << r.p; 71 + x << r.x; 72 + @@ 73 + 74 + msg="%s" % (x) 75 + msg_safe=msg.replace("[","@(").replace("]",")") 76 + coccilib.org.print_todo(p[0], msg_safe) 77 + 78 + @script:python depends on report@ 79 + p << r.p; 80 + x << r.x; 81 + @@ 82 + 83 + msg="WARNING: *_pool_zalloc should be used for %s, instead of *_pool_alloc/memset" % (x) 84 + coccilib.report.print_report(p[0], msg)

-1

tools/testing/selftests/vm/Makefile

··· 4 4 BINARIES = compaction_test 5 5 BINARIES += hugepage-mmap 6 6 BINARIES += hugepage-shm 7 - BINARIES += hugetlbfstest 8 7 BINARIES += map_hugetlb 9 8 BINARIES += thuge-gen 10 9 BINARIES += transhuge-stress

-86

tools/testing/selftests/vm/hugetlbfstest.c

··· 1 - #define _GNU_SOURCE 2 - #include <assert.h> 3 - #include <fcntl.h> 4 - #include <stdio.h> 5 - #include <stdlib.h> 6 - #include <string.h> 7 - #include <sys/mman.h> 8 - #include <sys/stat.h> 9 - #include <sys/types.h> 10 - #include <unistd.h> 11 - 12 - typedef unsigned long long u64; 13 - 14 - static size_t length = 1 << 24; 15 - 16 - static u64 read_rss(void) 17 - { 18 - char buf[4096], *s = buf; 19 - int i, fd; 20 - u64 rss; 21 - 22 - fd = open("/proc/self/statm", O_RDONLY); 23 - assert(fd > 2); 24 - memset(buf, 0, sizeof(buf)); 25 - read(fd, buf, sizeof(buf) - 1); 26 - for (i = 0; i < 1; i++) 27 - s = strchr(s, ' ') + 1; 28 - rss = strtoull(s, NULL, 10); 29 - return rss << 12; /* assumes 4k pagesize */ 30 - } 31 - 32 - static void do_mmap(int fd, int extra_flags, int unmap) 33 - { 34 - int *p; 35 - int flags = MAP_PRIVATE | MAP_POPULATE | extra_flags; 36 - u64 before, after; 37 - int ret; 38 - 39 - before = read_rss(); 40 - p = mmap(NULL, length, PROT_READ | PROT_WRITE, flags, fd, 0); 41 - assert(p != MAP_FAILED || 42 - !"mmap returned an unexpected error"); 43 - after = read_rss(); 44 - assert(llabs(after - before - length) < 0x40000 || 45 - !"rss didn't grow as expected"); 46 - if (!unmap) 47 - return; 48 - ret = munmap(p, length); 49 - assert(!ret || !"munmap returned an unexpected error"); 50 - after = read_rss(); 51 - assert(llabs(after - before) < 0x40000 || 52 - !"rss didn't shrink as expected"); 53 - } 54 - 55 - static int open_file(const char *path) 56 - { 57 - int fd, err; 58 - 59 - unlink(path); 60 - fd = open(path, O_CREAT | O_RDWR | O_TRUNC | O_EXCL 61 - | O_LARGEFILE | O_CLOEXEC, 0600); 62 - assert(fd > 2); 63 - unlink(path); 64 - err = ftruncate(fd, length); 65 - assert(!err); 66 - return fd; 67 - } 68 - 69 - int main(void) 70 - { 71 - int hugefd, fd; 72 - 73 - fd = open_file("/dev/shm/hugetlbhog"); 74 - hugefd = open_file("/hugepages/hugetlbhog"); 75 - 76 - system("echo 100 > /proc/sys/vm/nr_hugepages"); 77 - do_mmap(-1, MAP_ANONYMOUS, 1); 78 - do_mmap(fd, 0, 1); 79 - do_mmap(-1, MAP_ANONYMOUS | MAP_HUGETLB, 1); 80 - do_mmap(hugefd, 0, 1); 81 - do_mmap(hugefd, MAP_HUGETLB, 1); 82 - /* Leak the last one to test do_exit() */ 83 - do_mmap(-1, MAP_ANONYMOUS | MAP_HUGETLB, 0); 84 - printf("oll korrekt.\n"); 85 - return 0; 86 - }

+3 -10

tools/testing/selftests/vm/run_vmtests

··· 75 75 echo "[PASS]" 76 76 fi 77 77 78 - echo "--------------------" 79 - echo "running hugetlbfstest" 80 - echo "--------------------" 81 - ./hugetlbfstest 82 - if [ $? -ne 0 ]; then 83 - echo "[FAIL]" 84 - exitcode=1 85 - else 86 - echo "[PASS]" 87 - fi 78 + echo "NOTE: The above hugetlb tests provide minimal coverage. Use" 79 + echo " https://github.com/libhugetlbfs/libhugetlbfs.git for" 80 + echo " hugetlb regression testing." 88 81 89 82 echo "--------------------" 90 83 echo "running userfaultfd"

+6 -3

tools/testing/selftests/vm/userfaultfd.c

··· 147 147 if (sizeof(page_nr) > sizeof(rand_nr)) { 148 148 if (random_r(&rand, &rand_nr)) 149 149 fprintf(stderr, "random_r 2 error\n"), exit(1); 150 - page_nr |= ((unsigned long) rand_nr) << 32; 150 + page_nr |= (((unsigned long) rand_nr) << 16) << 151 + 16; 151 152 } 152 153 } else 153 154 page_nr += 1; ··· 291 290 msg.event), exit(1); 292 291 if (msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE) 293 292 fprintf(stderr, "unexpected write fault\n"), exit(1); 294 - offset = (char *)msg.arg.pagefault.address - area_dst; 293 + offset = (char *)(unsigned long)msg.arg.pagefault.address - 294 + area_dst; 295 295 offset &= ~(page_size-1); 296 296 if (copy_page(offset)) 297 297 userfaults++; ··· 329 327 if (bounces & BOUNCE_VERIFY && 330 328 msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE) 331 329 fprintf(stderr, "unexpected write fault\n"), exit(1); 332 - offset = (char *)msg.arg.pagefault.address - area_dst; 330 + offset = (char *)(unsigned long)msg.arg.pagefault.address - 331 + area_dst; 333 332 offset &= ~(page_size-1); 334 333 if (copy_page(offset)) 335 334 (*this_cpu_userfaults)++;

+18 -17

tools/vm/page-types.c

··· 57 57 * pagemap kernel ABI bits 58 58 */ 59 59 60 - #define PM_ENTRY_BYTES sizeof(uint64_t) 61 - #define PM_STATUS_BITS 3 62 - #define PM_STATUS_OFFSET (64 - PM_STATUS_BITS) 63 - #define PM_STATUS_MASK (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET) 64 - #define PM_STATUS(nr) (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK) 65 - #define PM_PSHIFT_BITS 6 66 - #define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS) 67 - #define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET) 68 - #define __PM_PSHIFT(x) (((uint64_t) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK) 69 - #define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1) 70 - #define PM_PFRAME(x) ((x) & PM_PFRAME_MASK) 71 - 72 - #define __PM_SOFT_DIRTY (1LL) 73 - #define PM_PRESENT PM_STATUS(4LL) 74 - #define PM_SWAP PM_STATUS(2LL) 75 - #define PM_SOFT_DIRTY __PM_PSHIFT(__PM_SOFT_DIRTY) 76 - 60 + #define PM_ENTRY_BYTES 8 61 + #define PM_PFRAME_BITS 55 62 + #define PM_PFRAME_MASK ((1LL << PM_PFRAME_BITS) - 1) 63 + #define PM_PFRAME(x) ((x) & PM_PFRAME_MASK) 64 + #define PM_SOFT_DIRTY (1ULL << 55) 65 + #define PM_MMAP_EXCLUSIVE (1ULL << 56) 66 + #define PM_FILE (1ULL << 61) 67 + #define PM_SWAP (1ULL << 62) 68 + #define PM_PRESENT (1ULL << 63) 77 69 78 70 /* 79 71 * kernel page flags ··· 92 100 #define KPF_SLOB_FREE 49 93 101 #define KPF_SLUB_FROZEN 50 94 102 #define KPF_SLUB_DEBUG 51 103 + #define KPF_FILE 62 104 + #define KPF_MMAP_EXCLUSIVE 63 95 105 96 106 #define KPF_ALL_BITS ((uint64_t)~0ULL) 97 107 #define KPF_HACKERS_BITS (0xffffULL << 32) ··· 143 149 [KPF_SLOB_FREE] = "P:slob_free", 144 150 [KPF_SLUB_FROZEN] = "A:slub_frozen", 145 151 [KPF_SLUB_DEBUG] = "E:slub_debug", 152 + 153 + [KPF_FILE] = "F:file", 154 + [KPF_MMAP_EXCLUSIVE] = "1:mmap_exclusive", 146 155 }; 147 156 148 157 ··· 449 452 450 453 if (pme & PM_SOFT_DIRTY) 451 454 flags |= BIT(SOFTDIRTY); 455 + if (pme & PM_FILE) 456 + flags |= BIT(FILE); 457 + if (pme & PM_MMAP_EXCLUSIVE) 458 + flags |= BIT(MMAP_EXCLUSIVE); 452 459 453 460 return flags; 454 461 }