Merge branch 'akpm' (fixes from Andrew Morton)

-8

Documentation/ABI/stable/sysfs-devices-node

··· 85 85 will be compacted. When it completes, memory will be freed 86 86 into blocks which have as many contiguous pages as possible 87 87 88 - What: /sys/devices/system/node/nodeX/scan_unevictable_pages 89 - Date: October 2008 90 - Contact: Lee Schermerhorn <lee.schermerhorn@hp.com> 91 - Description: 92 - When set, it triggers scanning the node's unevictable lists 93 - and move any pages that have become evictable onto the respective 94 - zone's inactive list. See mm/vmscan.c 95 - 96 88 What: /sys/devices/system/node/nodeX/hugepages/hugepages-<size>/ 97 89 Date: December 2009 98 90 Contact: Lee Schermerhorn <lee.schermerhorn@hp.com>

+27 -5

Documentation/ABI/testing/sysfs-block-zram

··· 77 77 Date: August 2010 78 78 Contact: Nitin Gupta <ngupta@vflare.org> 79 79 Description: 80 - The notify_free file is read-only and specifies the number of 81 - swap slot free notifications received by this device. These 82 - notifications are sent to a swap block device when a swap slot 83 - is freed. This statistic is applicable only when this disk is 84 - being used as a swap disk. 80 + The notify_free file is read-only. Depending on device usage 81 + scenario it may account a) the number of pages freed because 82 + of swap slot free notifications or b) the number of pages freed 83 + because of REQ_DISCARD requests sent by bio. The former ones 84 + are sent to a swap block device when a swap slot is freed, which 85 + implies that this disk is being used as a swap disk. The latter 86 + ones are sent by filesystem mounted with discard option, 87 + whenever some data blocks are getting discarded. 85 88 86 89 What: /sys/block/zram<id>/zero_pages 87 90 Date: August 2010 ··· 122 119 efficiency can be calculated using compr_data_size and this 123 120 statistic. 124 121 Unit: bytes 122 + 123 + What: /sys/block/zram<id>/mem_used_max 124 + Date: August 2014 125 + Contact: Minchan Kim <minchan@kernel.org> 126 + Description: 127 + The mem_used_max file is read/write and specifies the amount 128 + of maximum memory zram have consumed to store compressed data. 129 + For resetting the value, you should write "0". Otherwise, 130 + you could see -EINVAL. 131 + Unit: bytes 132 + 133 + What: /sys/block/zram<id>/mem_limit 134 + Date: August 2014 135 + Contact: Minchan Kim <minchan@kernel.org> 136 + Description: 137 + The mem_limit file is read/write and specifies the maximum 138 + amount of memory ZRAM can use to store the compressed data. The 139 + limit could be changed in run time and "0" means disable the 140 + limit. No limit is the initial state. Unit: bytes

+8

Documentation/ABI/testing/sysfs-devices-memory

··· 61 61 http://www.ibm.com/developerworks/wikis/display/LinuxP/powerpc-utils 62 62 63 63 64 + What: /sys/devices/system/memory/memoryX/valid_zones 65 + Date: July 2014 66 + Contact: Zhang Zhen <zhenzhang.zhang@huawei.com> 67 + Description: 68 + The file /sys/devices/system/memory/memoryX/valid_zones is 69 + read-only and is designed to show which zone this memory 70 + block can be onlined to. 71 + 64 72 What: /sys/devices/system/memoryX/nodeY 65 73 Date: October 2009 66 74 Contact: Linux Memory Management list <linux-mm@kvack.org>

+21 -4

Documentation/blockdev/zram.txt

··· 74 74 since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the 75 75 size of the disk when not in use so a huge zram is wasteful. 76 76 77 - 5) Activate: 77 + 5) Set memory limit: Optional 78 + Set memory limit by writing the value to sysfs node 'mem_limit'. 79 + The value can be either in bytes or you can use mem suffixes. 80 + In addition, you could change the value in runtime. 81 + Examples: 82 + # limit /dev/zram0 with 50MB memory 83 + echo $((50*1024*1024)) > /sys/block/zram0/mem_limit 84 + 85 + # Using mem suffixes 86 + echo 256K > /sys/block/zram0/mem_limit 87 + echo 512M > /sys/block/zram0/mem_limit 88 + echo 1G > /sys/block/zram0/mem_limit 89 + 90 + # To disable memory limit 91 + echo 0 > /sys/block/zram0/mem_limit 92 + 93 + 6) Activate: 78 94 mkswap /dev/zram0 79 95 swapon /dev/zram0 80 96 81 97 mkfs.ext4 /dev/zram1 82 98 mount /dev/zram1 /tmp 83 99 84 - 6) Stats: 100 + 7) Stats: 85 101 Per-device statistics are exported as various nodes under 86 102 /sys/block/zram<id>/ 87 103 disksize ··· 111 95 orig_data_size 112 96 compr_data_size 113 97 mem_used_total 98 + mem_used_max 114 99 115 - 7) Deactivate: 100 + 8) Deactivate: 116 101 swapoff /dev/zram0 117 102 umount /dev/zram1 118 103 119 - 8) Reset: 104 + 9) Reset: 120 105 Write any positive value to 'reset' sysfs node 121 106 echo 1 > /sys/block/zram0/reset 122 107 echo 1 > /sys/block/zram1/reset

+11 -6

Documentation/kernel-parameters.txt

··· 656 656 Sets the size of kernel global memory area for 657 657 contiguous memory allocations and optionally the 658 658 placement constraint by the physical address range of 659 - memory allocations. For more information, see 659 + memory allocations. A value of 0 disables CMA 660 + altogether. For more information, see 660 661 include/linux/dma-contiguous.h 661 662 662 663 cmo_free_hint= [PPC] Format: { yes | no } ··· 3159 3158 3160 3159 slram= [HW,MTD] 3161 3160 3161 + slab_nomerge [MM] 3162 + Disable merging of slabs with similar size. May be 3163 + necessary if there is some reason to distinguish 3164 + allocs to different slabs. Debug options disable 3165 + merging on their own. 3166 + For more information see Documentation/vm/slub.txt. 3167 + 3162 3168 slab_max_order= [MM, SLAB] 3163 3169 Determines the maximum allowed order for slabs. 3164 3170 A high setting may cause OOMs due to memory ··· 3201 3193 For more information see Documentation/vm/slub.txt. 3202 3194 3203 3195 slub_nomerge [MM, SLUB] 3204 - Disable merging of slabs with similar size. May be 3205 - necessary if there is some reason to distinguish 3206 - allocs to different slabs. Debug options disable 3207 - merging on their own. 3208 - For more information see Documentation/vm/slub.txt. 3196 + Same with slab_nomerge. This is supported for legacy. 3197 + See slab_nomerge for more information. 3209 3198 3210 3199 smart2= [HW] 3211 3200 Format: <io1>[,<io2>[,...,<io8>]]

+10 -1

Documentation/memory-hotplug.txt

··· 155 155 /sys/devices/system/memory/memoryXXX/phys_device 156 156 /sys/devices/system/memory/memoryXXX/state 157 157 /sys/devices/system/memory/memoryXXX/removable 158 + /sys/devices/system/memory/memoryXXX/valid_zones 158 159 159 160 'phys_index' : read-only and contains memory block id, same as XXX. 160 161 'state' : read-write ··· 171 170 block is removable and a value of 0 indicates that 172 171 it is not removable. A memory block is removable only if 173 172 every section in the block is removable. 173 + 'valid_zones' : read-only: designed to show which zones this memory block 174 + can be onlined to. 175 + The first column shows it's default zone. 176 + "memory6/valid_zones: Normal Movable" shows this memoryblock 177 + can be onlined to ZONE_NORMAL by default and to ZONE_MOVABLE 178 + by online_movable. 179 + "memory7/valid_zones: Movable Normal" shows this memoryblock 180 + can be onlined to ZONE_MOVABLE by default and to ZONE_NORMAL 181 + by online_kernel. 174 182 175 183 NOTE: 176 184 These directories/files appear after physical memory hotplug phase. ··· 418 408 - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like 419 409 sysctl or new control file. 420 410 - showing memory block and physical device relationship. 421 - - showing memory block is under ZONE_MOVABLE or not 422 411 - test and make it better memory offlining. 423 412 - support HugeTLB page migration and offlining. 424 413 - memmap removing at memory offline.

+1

arch/alpha/include/asm/Kbuild

··· 8 8 generic-y += mcs_spinlock.h 9 9 generic-y += preempt.h 10 10 generic-y += scatterlist.h 11 + generic-y += sections.h 11 12 generic-y += trace_clock.h

-7

arch/alpha/include/asm/sections.h

··· 1 - #ifndef _ALPHA_SECTIONS_H 2 - #define _ALPHA_SECTIONS_H 3 - 4 - /* nothing to see, move along */ 5 - #include <asm-generic/sections.h> 6 - 7 - #endif

+6

arch/arm/Kconfig

··· 14 14 select CLONE_BACKWARDS 15 15 select CPU_PM if (SUSPEND || CPU_IDLE) 16 16 select DCACHE_WORD_ACCESS if HAVE_EFFICIENT_UNALIGNED_ACCESS 17 + select GENERIC_ALLOCATOR 17 18 select GENERIC_ATOMIC64 if (CPU_V7M || CPU_V6 || !CPU_32v6K || !AEABI) 18 19 select GENERIC_CLOCKEVENTS_BROADCAST if SMP 19 20 select GENERIC_IDLE_POLL_SETUP ··· 62 61 select HAVE_PERF_EVENTS 63 62 select HAVE_PERF_REGS 64 63 select HAVE_PERF_USER_STACK_DUMP 64 + select HAVE_RCU_TABLE_FREE if (SMP && ARM_LPAE) 65 65 select HAVE_REGS_AND_STACK_ACCESS_API 66 66 select HAVE_SYSCALL_TRACEPOINTS 67 67 select HAVE_UID16 ··· 1660 1658 1661 1659 config HAVE_ARCH_PFN_VALID 1662 1660 def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM 1661 + 1662 + config HAVE_GENERIC_RCU_GUP 1663 + def_bool y 1664 + depends on ARM_LPAE 1663 1665 1664 1666 config HIGHMEM 1665 1667 bool "High Memory Support"

+2

arch/arm/include/asm/pgtable-2level.h

··· 182 182 #define pmd_addr_end(addr,end) (end) 183 183 184 184 #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext) 185 + #define pte_special(pte) (0) 186 + static inline pte_t pte_mkspecial(pte_t pte) { return pte; } 185 187 186 188 /* 187 189 * We don't have huge page support for short descriptors, for the moment

+15

arch/arm/include/asm/pgtable-3level.h

··· 213 213 #define pmd_isclear(pmd, val) (!(pmd_val(pmd) & (val))) 214 214 215 215 #define pmd_young(pmd) (pmd_isset((pmd), PMD_SECT_AF)) 216 + #define pte_special(pte) (pte_isset((pte), L_PTE_SPECIAL)) 217 + static inline pte_t pte_mkspecial(pte_t pte) 218 + { 219 + pte_val(pte) |= L_PTE_SPECIAL; 220 + return pte; 221 + } 222 + #define __HAVE_ARCH_PTE_SPECIAL 216 223 217 224 #define __HAVE_ARCH_PMD_WRITE 218 225 #define pmd_write(pmd) (pmd_isclear((pmd), L_PMD_SECT_RDONLY)) 219 226 #define pmd_dirty(pmd) (pmd_isset((pmd), L_PMD_SECT_DIRTY)) 227 + #define pud_page(pud) pmd_page(__pmd(pud_val(pud))) 228 + #define pud_write(pud) pmd_write(__pmd(pud_val(pud))) 220 229 221 230 #define pmd_hugewillfault(pmd) (!pmd_young(pmd) || !pmd_write(pmd)) 222 231 #define pmd_thp_or_huge(pmd) (pmd_huge(pmd) || pmd_trans_huge(pmd)) ··· 233 224 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 234 225 #define pmd_trans_huge(pmd) (pmd_val(pmd) && !pmd_table(pmd)) 235 226 #define pmd_trans_splitting(pmd) (pmd_isset((pmd), L_PMD_SECT_SPLITTING)) 227 + 228 + #ifdef CONFIG_HAVE_RCU_TABLE_FREE 229 + #define __HAVE_ARCH_PMDP_SPLITTING_FLUSH 230 + void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address, 231 + pmd_t *pmdp); 232 + #endif 236 233 #endif 237 234 238 235 #define PMD_BIT_FUNC(fn,op) \

+2 -4

arch/arm/include/asm/pgtable.h

··· 226 226 #define pte_dirty(pte) (pte_isset((pte), L_PTE_DIRTY)) 227 227 #define pte_young(pte) (pte_isset((pte), L_PTE_YOUNG)) 228 228 #define pte_exec(pte) (pte_isclear((pte), L_PTE_XN)) 229 - #define pte_special(pte) (0) 230 229 231 230 #define pte_valid_user(pte) \ 232 231 (pte_valid(pte) && pte_isset((pte), L_PTE_USER) && pte_young(pte)) ··· 244 245 unsigned long ext = 0; 245 246 246 247 if (addr < TASK_SIZE && pte_valid_user(pteval)) { 247 - __sync_icache_dcache(pteval); 248 + if (!pte_special(pteval)) 249 + __sync_icache_dcache(pteval); 248 250 ext |= PTE_EXT_NG; 249 251 } 250 252 ··· 263 263 PTE_BIT_FUNC(mkyoung, |= L_PTE_YOUNG); 264 264 PTE_BIT_FUNC(mkexec, &= ~L_PTE_XN); 265 265 PTE_BIT_FUNC(mknexec, |= L_PTE_XN); 266 - 267 - static inline pte_t pte_mkspecial(pte_t pte) { return pte; } 268 266 269 267 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) 270 268 {

+36 -2

arch/arm/include/asm/tlb.h

··· 35 35 36 36 #define MMU_GATHER_BUNDLE 8 37 37 38 + #ifdef CONFIG_HAVE_RCU_TABLE_FREE 39 + static inline void __tlb_remove_table(void *_table) 40 + { 41 + free_page_and_swap_cache((struct page *)_table); 42 + } 43 + 44 + struct mmu_table_batch { 45 + struct rcu_head rcu; 46 + unsigned int nr; 47 + void *tables[0]; 48 + }; 49 + 50 + #define MAX_TABLE_BATCH \ 51 + ((PAGE_SIZE - sizeof(struct mmu_table_batch)) / sizeof(void *)) 52 + 53 + extern void tlb_table_flush(struct mmu_gather *tlb); 54 + extern void tlb_remove_table(struct mmu_gather *tlb, void *table); 55 + 56 + #define tlb_remove_entry(tlb, entry) tlb_remove_table(tlb, entry) 57 + #else 58 + #define tlb_remove_entry(tlb, entry) tlb_remove_page(tlb, entry) 59 + #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ 60 + 38 61 /* 39 62 * TLB handling. This allows us to remove pages from the page 40 63 * tables, and efficiently handle the TLB issues. 41 64 */ 42 65 struct mmu_gather { 43 66 struct mm_struct *mm; 67 + #ifdef CONFIG_HAVE_RCU_TABLE_FREE 68 + struct mmu_table_batch *batch; 69 + unsigned int need_flush; 70 + #endif 44 71 unsigned int fullmm; 45 72 struct vm_area_struct *vma; 46 73 unsigned long start, end; ··· 128 101 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) 129 102 { 130 103 tlb_flush(tlb); 104 + #ifdef CONFIG_HAVE_RCU_TABLE_FREE 105 + tlb_table_flush(tlb); 106 + #endif 131 107 } 132 108 133 109 static inline void tlb_flush_mmu_free(struct mmu_gather *tlb) ··· 159 129 tlb->pages = tlb->local; 160 130 tlb->nr = 0; 161 131 __tlb_alloc_page(tlb); 132 + 133 + #ifdef CONFIG_HAVE_RCU_TABLE_FREE 134 + tlb->batch = NULL; 135 + #endif 162 136 } 163 137 164 138 static inline void ··· 239 205 tlb_add_flush(tlb, addr + SZ_1M); 240 206 #endif 241 207 242 - tlb_remove_page(tlb, pte); 208 + tlb_remove_entry(tlb, pte); 243 209 } 244 210 245 211 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp, ··· 247 213 { 248 214 #ifdef CONFIG_ARM_LPAE 249 215 tlb_add_flush(tlb, addr); 250 - tlb_remove_page(tlb, virt_to_page(pmdp)); 216 + tlb_remove_entry(tlb, virt_to_page(pmdp)); 251 217 #endif 252 218 } 253 219

+1 -2

arch/arm/kernel/hibernate.c

··· 21 21 #include <asm/idmap.h> 22 22 #include <asm/suspend.h> 23 23 #include <asm/memory.h> 24 - 25 - extern const void __nosave_begin, __nosave_end; 24 + #include <asm/sections.h> 26 25 27 26 int pfn_is_nosave(unsigned long pfn) 28 27 {

+55 -149

arch/arm/mm/dma-mapping.c

··· 12 12 #include <linux/bootmem.h> 13 13 #include <linux/module.h> 14 14 #include <linux/mm.h> 15 + #include <linux/genalloc.h> 15 16 #include <linux/gfp.h> 16 17 #include <linux/errno.h> 17 18 #include <linux/list.h> ··· 299 298 __dma_alloc_remap(struct page *page, size_t size, gfp_t gfp, pgprot_t prot, 300 299 const void *caller) 301 300 { 302 - struct vm_struct *area; 303 - unsigned long addr; 304 - 305 301 /* 306 302 * DMA allocation can be mapped to user space, so lets 307 303 * set VM_USERMAP flags too. 308 304 */ 309 - area = get_vm_area_caller(size, VM_ARM_DMA_CONSISTENT | VM_USERMAP, 310 - caller); 311 - if (!area) 312 - return NULL; 313 - addr = (unsigned long)area->addr; 314 - area->phys_addr = __pfn_to_phys(page_to_pfn(page)); 315 - 316 - if (ioremap_page_range(addr, addr + size, area->phys_addr, prot)) { 317 - vunmap((void *)addr); 318 - return NULL; 319 - } 320 - return (void *)addr; 305 + return dma_common_contiguous_remap(page, size, 306 + VM_ARM_DMA_CONSISTENT | VM_USERMAP, 307 + prot, caller); 321 308 } 322 309 323 310 static void __dma_free_remap(void *cpu_addr, size_t size) 324 311 { 325 - unsigned int flags = VM_ARM_DMA_CONSISTENT | VM_USERMAP; 326 - struct vm_struct *area = find_vm_area(cpu_addr); 327 - if (!area || (area->flags & flags) != flags) { 328 - WARN(1, "trying to free invalid coherent area: %p\n", cpu_addr); 329 - return; 330 - } 331 - unmap_kernel_range((unsigned long)cpu_addr, size); 332 - vunmap(cpu_addr); 312 + dma_common_free_remap(cpu_addr, size, 313 + VM_ARM_DMA_CONSISTENT | VM_USERMAP); 333 314 } 334 315 335 316 #define DEFAULT_DMA_COHERENT_POOL_SIZE SZ_256K 317 + static struct gen_pool *atomic_pool; 336 318 337 - struct dma_pool { 338 - size_t size; 339 - spinlock_t lock; 340 - unsigned long *bitmap; 341 - unsigned long nr_pages; 342 - void *vaddr; 343 - struct page **pages; 344 - }; 345 - 346 - static struct dma_pool atomic_pool = { 347 - .size = DEFAULT_DMA_COHERENT_POOL_SIZE, 348 - }; 319 + static size_t atomic_pool_size = DEFAULT_DMA_COHERENT_POOL_SIZE; 349 320 350 321 static int __init early_coherent_pool(char *p) 351 322 { 352 - atomic_pool.size = memparse(p, &p); 323 + atomic_pool_size = memparse(p, &p); 353 324 return 0; 354 325 } 355 326 early_param("coherent_pool", early_coherent_pool); ··· 331 358 /* 332 359 * Catch any attempt to set the pool size too late. 333 360 */ 334 - BUG_ON(atomic_pool.vaddr); 361 + BUG_ON(atomic_pool); 335 362 336 363 /* 337 364 * Set architecture specific coherent pool size only if 338 365 * it has not been changed by kernel command line parameter. 339 366 */ 340 - if (atomic_pool.size == DEFAULT_DMA_COHERENT_POOL_SIZE) 341 - atomic_pool.size = size; 367 + if (atomic_pool_size == DEFAULT_DMA_COHERENT_POOL_SIZE) 368 + atomic_pool_size = size; 342 369 } 343 370 344 371 /* ··· 346 373 */ 347 374 static int __init atomic_pool_init(void) 348 375 { 349 - struct dma_pool *pool = &atomic_pool; 350 376 pgprot_t prot = pgprot_dmacoherent(PAGE_KERNEL); 351 377 gfp_t gfp = GFP_KERNEL | GFP_DMA; 352 - unsigned long nr_pages = pool->size >> PAGE_SHIFT; 353 - unsigned long *bitmap; 354 378 struct page *page; 355 - struct page **pages; 356 379 void *ptr; 357 - int bitmap_size = BITS_TO_LONGS(nr_pages) * sizeof(long); 358 380 359 - bitmap = kzalloc(bitmap_size, GFP_KERNEL); 360 - if (!bitmap) 361 - goto no_bitmap; 362 - 363 - pages = kzalloc(nr_pages * sizeof(struct page *), GFP_KERNEL); 364 - if (!pages) 365 - goto no_pages; 381 + atomic_pool = gen_pool_create(PAGE_SHIFT, -1); 382 + if (!atomic_pool) 383 + goto out; 366 384 367 385 if (dev_get_cma_area(NULL)) 368 - ptr = __alloc_from_contiguous(NULL, pool->size, prot, &page, 369 - atomic_pool_init); 386 + ptr = __alloc_from_contiguous(NULL, atomic_pool_size, prot, 387 + &page, atomic_pool_init); 370 388 else 371 - ptr = __alloc_remap_buffer(NULL, pool->size, gfp, prot, &page, 372 - atomic_pool_init); 389 + ptr = __alloc_remap_buffer(NULL, atomic_pool_size, gfp, prot, 390 + &page, atomic_pool_init); 373 391 if (ptr) { 374 - int i; 392 + int ret; 375 393 376 - for (i = 0; i < nr_pages; i++) 377 - pages[i] = page + i; 394 + ret = gen_pool_add_virt(atomic_pool, (unsigned long)ptr, 395 + page_to_phys(page), 396 + atomic_pool_size, -1); 397 + if (ret) 398 + goto destroy_genpool; 378 399 379 - spin_lock_init(&pool->lock); 380 - pool->vaddr = ptr; 381 - pool->pages = pages; 382 - pool->bitmap = bitmap; 383 - pool->nr_pages = nr_pages; 384 - pr_info("DMA: preallocated %u KiB pool for atomic coherent allocations\n", 385 - (unsigned)pool->size / 1024); 400 + gen_pool_set_algo(atomic_pool, 401 + gen_pool_first_fit_order_align, 402 + (void *)PAGE_SHIFT); 403 + pr_info("DMA: preallocated %zd KiB pool for atomic coherent allocations\n", 404 + atomic_pool_size / 1024); 386 405 return 0; 387 406 } 388 407 389 - kfree(pages); 390 - no_pages: 391 - kfree(bitmap); 392 - no_bitmap: 393 - pr_err("DMA: failed to allocate %u KiB pool for atomic coherent allocation\n", 394 - (unsigned)pool->size / 1024); 408 + destroy_genpool: 409 + gen_pool_destroy(atomic_pool); 410 + atomic_pool = NULL; 411 + out: 412 + pr_err("DMA: failed to allocate %zx KiB pool for atomic coherent allocation\n", 413 + atomic_pool_size / 1024); 395 414 return -ENOMEM; 396 415 } 397 416 /* ··· 487 522 488 523 static void *__alloc_from_pool(size_t size, struct page **ret_page) 489 524 { 490 - struct dma_pool *pool = &atomic_pool; 491 - unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT; 492 - unsigned int pageno; 493 - unsigned long flags; 525 + unsigned long val; 494 526 void *ptr = NULL; 495 - unsigned long align_mask; 496 527 497 - if (!pool->vaddr) { 528 + if (!atomic_pool) { 498 529 WARN(1, "coherent pool not initialised!\n"); 499 530 return NULL; 500 531 } 501 532 502 - /* 503 - * Align the region allocation - allocations from pool are rather 504 - * small, so align them to their order in pages, minimum is a page 505 - * size. This helps reduce fragmentation of the DMA space. 506 - */ 507 - align_mask = (1 << get_order(size)) - 1; 533 + val = gen_pool_alloc(atomic_pool, size); 534 + if (val) { 535 + phys_addr_t phys = gen_pool_virt_to_phys(atomic_pool, val); 508 536 509 - spin_lock_irqsave(&pool->lock, flags); 510 - pageno = bitmap_find_next_zero_area(pool->bitmap, pool->nr_pages, 511 - 0, count, align_mask); 512 - if (pageno < pool->nr_pages) { 513 - bitmap_set(pool->bitmap, pageno, count); 514 - ptr = pool->vaddr + PAGE_SIZE * pageno; 515 - *ret_page = pool->pages[pageno]; 516 - } else { 517 - pr_err_once("ERROR: %u KiB atomic DMA coherent pool is too small!\n" 518 - "Please increase it with coherent_pool= kernel parameter!\n", 519 - (unsigned)pool->size / 1024); 537 + *ret_page = phys_to_page(phys); 538 + ptr = (void *)val; 520 539 } 521 - spin_unlock_irqrestore(&pool->lock, flags); 522 540 523 541 return ptr; 524 542 } 525 543 526 544 static bool __in_atomic_pool(void *start, size_t size) 527 545 { 528 - struct dma_pool *pool = &atomic_pool; 529 - void *end = start + size; 530 - void *pool_start = pool->vaddr; 531 - void *pool_end = pool->vaddr + pool->size; 532 - 533 - if (start < pool_start || start >= pool_end) 534 - return false; 535 - 536 - if (end <= pool_end) 537 - return true; 538 - 539 - WARN(1, "Wrong coherent size(%p-%p) from atomic pool(%p-%p)\n", 540 - start, end - 1, pool_start, pool_end - 1); 541 - 542 - return false; 546 + return addr_in_gen_pool(atomic_pool, (unsigned long)start, size); 543 547 } 544 548 545 549 static int __free_from_pool(void *start, size_t size) 546 550 { 547 - struct dma_pool *pool = &atomic_pool; 548 - unsigned long pageno, count; 549 - unsigned long flags; 550 - 551 551 if (!__in_atomic_pool(start, size)) 552 552 return 0; 553 553 554 - pageno = (start - pool->vaddr) >> PAGE_SHIFT; 555 - count = size >> PAGE_SHIFT; 556 - 557 - spin_lock_irqsave(&pool->lock, flags); 558 - bitmap_clear(pool->bitmap, pageno, count); 559 - spin_unlock_irqrestore(&pool->lock, flags); 554 + gen_pool_free(atomic_pool, (unsigned long)start, size); 560 555 561 556 return 1; 562 557 } ··· 1196 1271 __iommu_alloc_remap(struct page **pages, size_t size, gfp_t gfp, pgprot_t prot, 1197 1272 const void *caller) 1198 1273 { 1199 - unsigned int i, nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT; 1200 - struct vm_struct *area; 1201 - unsigned long p; 1202 - 1203 - area = get_vm_area_caller(size, VM_ARM_DMA_CONSISTENT | VM_USERMAP, 1204 - caller); 1205 - if (!area) 1206 - return NULL; 1207 - 1208 - area->pages = pages; 1209 - area->nr_pages = nr_pages; 1210 - p = (unsigned long)area->addr; 1211 - 1212 - for (i = 0; i < nr_pages; i++) { 1213 - phys_addr_t phys = __pfn_to_phys(page_to_pfn(pages[i])); 1214 - if (ioremap_page_range(p, p + PAGE_SIZE, phys, prot)) 1215 - goto err; 1216 - p += PAGE_SIZE; 1217 - } 1218 - return area->addr; 1219 - err: 1220 - unmap_kernel_range((unsigned long)area->addr, size); 1221 - vunmap(area->addr); 1274 + return dma_common_pages_remap(pages, size, 1275 + VM_ARM_DMA_CONSISTENT | VM_USERMAP, prot, caller); 1222 1276 return NULL; 1223 1277 } 1224 1278 ··· 1259 1355 1260 1356 static struct page **__atomic_get_pages(void *addr) 1261 1357 { 1262 - struct dma_pool *pool = &atomic_pool; 1263 - struct page **pages = pool->pages; 1264 - int offs = (addr - pool->vaddr) >> PAGE_SHIFT; 1358 + struct page *page; 1359 + phys_addr_t phys; 1265 1360 1266 - return pages + offs; 1361 + phys = gen_pool_virt_to_phys(atomic_pool, (unsigned long)addr); 1362 + page = phys_to_page(phys); 1363 + 1364 + return (struct page **)page; 1267 1365 } 1268 1366 1269 1367 static struct page **__iommu_get_pages(void *cpu_addr, struct dma_attrs *attrs) ··· 1407 1501 } 1408 1502 1409 1503 if (!dma_get_attr(DMA_ATTR_NO_KERNEL_MAPPING, attrs)) { 1410 - unmap_kernel_range((unsigned long)cpu_addr, size); 1411 - vunmap(cpu_addr); 1504 + dma_common_free_remap(cpu_addr, size, 1505 + VM_ARM_DMA_CONSISTENT | VM_USERMAP); 1412 1506 } 1413 1507 1414 1508 __iommu_remove_mapping(dev, handle, size);

+15

arch/arm/mm/flush.c

··· 400 400 */ 401 401 __cpuc_flush_dcache_area(page_address(page), PAGE_SIZE); 402 402 } 403 + 404 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 405 + #ifdef CONFIG_HAVE_RCU_TABLE_FREE 406 + void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address, 407 + pmd_t *pmdp) 408 + { 409 + pmd_t pmd = pmd_mksplitting(*pmdp); 410 + VM_BUG_ON(address & ~PMD_MASK); 411 + set_pmd_at(vma->vm_mm, address, pmdp, pmd); 412 + 413 + /* dummy IPI to serialise against fast_gup */ 414 + kick_all_cpus_sync(); 415 + } 416 + #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ 417 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+1 -1

arch/arm/mm/init.c

··· 322 322 * reserve memory for DMA contigouos allocations, 323 323 * must come from DMA area inside low memory 324 324 */ 325 - dma_contiguous_reserve(min(arm_dma_limit, arm_lowmem_limit)); 325 + dma_contiguous_reserve(arm_dma_limit); 326 326 327 327 arm_memblock_steal_permitted = false; 328 328 memblock_dump_all();

+5

arch/arm64/Kconfig

··· 18 18 select COMMON_CLK 19 19 select CPU_PM if (SUSPEND || CPU_IDLE) 20 20 select DCACHE_WORD_ACCESS 21 + select GENERIC_ALLOCATOR 21 22 select GENERIC_CLOCKEVENTS 22 23 select GENERIC_CLOCKEVENTS_BROADCAST if SMP 23 24 select GENERIC_CPU_AUTOPROBE ··· 57 56 select HAVE_PERF_EVENTS 58 57 select HAVE_PERF_REGS 59 58 select HAVE_PERF_USER_STACK_DUMP 59 + select HAVE_RCU_TABLE_FREE 60 60 select HAVE_SYSCALL_TRACEPOINTS 61 61 select IRQ_DOMAIN 62 62 select MODULES_USE_ELF_RELA ··· 109 107 def_bool y 110 108 111 109 config ZONE_DMA 110 + def_bool y 111 + 112 + config HAVE_GENERIC_RCU_GUP 112 113 def_bool y 113 114 114 115 config ARCH_DMA_ADDR_T_64BIT

+20 -1

arch/arm64/include/asm/pgtable.h

··· 244 244 245 245 #define __HAVE_ARCH_PTE_SPECIAL 246 246 247 + static inline pte_t pud_pte(pud_t pud) 248 + { 249 + return __pte(pud_val(pud)); 250 + } 251 + 252 + static inline pmd_t pud_pmd(pud_t pud) 253 + { 254 + return __pmd(pud_val(pud)); 255 + } 256 + 247 257 static inline pte_t pmd_pte(pmd_t pmd) 248 258 { 249 259 return __pte(pmd_val(pmd)); ··· 271 261 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 272 262 #define pmd_trans_huge(pmd) (pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT)) 273 263 #define pmd_trans_splitting(pmd) pte_special(pmd_pte(pmd)) 274 - #endif 264 + #ifdef CONFIG_HAVE_RCU_TABLE_FREE 265 + #define __HAVE_ARCH_PMDP_SPLITTING_FLUSH 266 + struct vm_area_struct; 267 + void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address, 268 + pmd_t *pmdp); 269 + #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ 270 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 275 271 276 272 #define pmd_young(pmd) pte_young(pmd_pte(pmd)) 277 273 #define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd))) ··· 298 282 #define mk_pmd(page,prot) pfn_pmd(page_to_pfn(page),prot) 299 283 300 284 #define pmd_page(pmd) pfn_to_page(__phys_to_pfn(pmd_val(pmd) & PHYS_MASK)) 285 + #define pud_write(pud) pte_write(pud_pte(pud)) 301 286 #define pud_pfn(pud) (((pud_val(pud) & PUD_MASK) & PHYS_MASK) >> PAGE_SHIFT) 302 287 303 288 #define set_pmd_at(mm, addr, pmdp, pmd) set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd)) ··· 399 382 { 400 383 return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(addr); 401 384 } 385 + 386 + #define pud_page(pud) pmd_page(pud_pmd(pud)) 402 387 403 388 #endif /* CONFIG_ARM64_PGTABLE_LEVELS > 2 */ 404 389

+17 -3

arch/arm64/include/asm/tlb.h

··· 23 23 24 24 #include <asm-generic/tlb.h> 25 25 26 + #include <linux/pagemap.h> 27 + #include <linux/swap.h> 28 + 29 + #ifdef CONFIG_HAVE_RCU_TABLE_FREE 30 + 31 + #define tlb_remove_entry(tlb, entry) tlb_remove_table(tlb, entry) 32 + static inline void __tlb_remove_table(void *_table) 33 + { 34 + free_page_and_swap_cache((struct page *)_table); 35 + } 36 + #else 37 + #define tlb_remove_entry(tlb, entry) tlb_remove_page(tlb, entry) 38 + #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ 39 + 26 40 /* 27 41 * There's three ways the TLB shootdown code is used: 28 42 * 1. Unmapping a range of vmas. See zap_page_range(), unmap_region(). ··· 102 88 { 103 89 pgtable_page_dtor(pte); 104 90 tlb_add_flush(tlb, addr); 105 - tlb_remove_page(tlb, pte); 91 + tlb_remove_entry(tlb, pte); 106 92 } 107 93 108 94 #if CONFIG_ARM64_PGTABLE_LEVELS > 2 ··· 110 96 unsigned long addr) 111 97 { 112 98 tlb_add_flush(tlb, addr); 113 - tlb_remove_page(tlb, virt_to_page(pmdp)); 99 + tlb_remove_entry(tlb, virt_to_page(pmdp)); 114 100 } 115 101 #endif 116 102 ··· 119 105 unsigned long addr) 120 106 { 121 107 tlb_add_flush(tlb, addr); 122 - tlb_remove_page(tlb, virt_to_page(pudp)); 108 + tlb_remove_entry(tlb, virt_to_page(pudp)); 123 109 } 124 110 #endif 125 111

+145 -19

arch/arm64/mm/dma-mapping.c

··· 20 20 #include <linux/gfp.h> 21 21 #include <linux/export.h> 22 22 #include <linux/slab.h> 23 + #include <linux/genalloc.h> 23 24 #include <linux/dma-mapping.h> 24 25 #include <linux/dma-contiguous.h> 25 26 #include <linux/vmalloc.h> ··· 39 38 return prot; 40 39 } 41 40 41 + static struct gen_pool *atomic_pool; 42 + 43 + #define DEFAULT_DMA_COHERENT_POOL_SIZE SZ_256K 44 + static size_t atomic_pool_size = DEFAULT_DMA_COHERENT_POOL_SIZE; 45 + 46 + static int __init early_coherent_pool(char *p) 47 + { 48 + atomic_pool_size = memparse(p, &p); 49 + return 0; 50 + } 51 + early_param("coherent_pool", early_coherent_pool); 52 + 53 + static void *__alloc_from_pool(size_t size, struct page **ret_page) 54 + { 55 + unsigned long val; 56 + void *ptr = NULL; 57 + 58 + if (!atomic_pool) { 59 + WARN(1, "coherent pool not initialised!\n"); 60 + return NULL; 61 + } 62 + 63 + val = gen_pool_alloc(atomic_pool, size); 64 + if (val) { 65 + phys_addr_t phys = gen_pool_virt_to_phys(atomic_pool, val); 66 + 67 + *ret_page = phys_to_page(phys); 68 + ptr = (void *)val; 69 + } 70 + 71 + return ptr; 72 + } 73 + 74 + static bool __in_atomic_pool(void *start, size_t size) 75 + { 76 + return addr_in_gen_pool(atomic_pool, (unsigned long)start, size); 77 + } 78 + 79 + static int __free_from_pool(void *start, size_t size) 80 + { 81 + if (!__in_atomic_pool(start, size)) 82 + return 0; 83 + 84 + gen_pool_free(atomic_pool, (unsigned long)start, size); 85 + 86 + return 1; 87 + } 88 + 42 89 static void *__dma_alloc_coherent(struct device *dev, size_t size, 43 90 dma_addr_t *dma_handle, gfp_t flags, 44 91 struct dma_attrs *attrs) ··· 99 50 if (IS_ENABLED(CONFIG_ZONE_DMA) && 100 51 dev->coherent_dma_mask <= DMA_BIT_MASK(32)) 101 52 flags |= GFP_DMA; 102 - if (IS_ENABLED(CONFIG_DMA_CMA)) { 53 + if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_WAIT)) { 103 54 struct page *page; 104 55 105 56 size = PAGE_ALIGN(size); ··· 119 70 void *vaddr, dma_addr_t dma_handle, 120 71 struct dma_attrs *attrs) 121 72 { 73 + bool freed; 74 + phys_addr_t paddr = dma_to_phys(dev, dma_handle); 75 + 122 76 if (dev == NULL) { 123 77 WARN_ONCE(1, "Use an actual device structure for DMA allocation\n"); 124 78 return; 125 79 } 126 80 127 - if (IS_ENABLED(CONFIG_DMA_CMA)) { 128 - phys_addr_t paddr = dma_to_phys(dev, dma_handle); 129 - 130 - dma_release_from_contiguous(dev, 81 + freed = dma_release_from_contiguous(dev, 131 82 phys_to_page(paddr), 132 83 size >> PAGE_SHIFT); 133 - } else { 84 + if (!freed) 134 85 swiotlb_free_coherent(dev, size, vaddr, dma_handle); 135 - } 136 86 } 137 87 138 88 static void *__dma_alloc_noncoherent(struct device *dev, size_t size, 139 89 dma_addr_t *dma_handle, gfp_t flags, 140 90 struct dma_attrs *attrs) 141 91 { 142 - struct page *page, **map; 92 + struct page *page; 143 93 void *ptr, *coherent_ptr; 144 - int order, i; 145 94 146 95 size = PAGE_ALIGN(size); 147 - order = get_order(size); 96 + 97 + if (!(flags & __GFP_WAIT)) { 98 + struct page *page = NULL; 99 + void *addr = __alloc_from_pool(size, &page); 100 + 101 + if (addr) 102 + *dma_handle = phys_to_dma(dev, page_to_phys(page)); 103 + 104 + return addr; 105 + 106 + } 148 107 149 108 ptr = __dma_alloc_coherent(dev, size, dma_handle, flags, attrs); 150 109 if (!ptr) 151 110 goto no_mem; 152 - map = kmalloc(sizeof(struct page *) << order, flags & ~GFP_DMA); 153 - if (!map) 154 - goto no_map; 155 111 156 112 /* remove any dirty cache lines on the kernel alias */ 157 113 __dma_flush_range(ptr, ptr + size); 158 114 159 115 /* create a coherent mapping */ 160 116 page = virt_to_page(ptr); 161 - for (i = 0; i < (size >> PAGE_SHIFT); i++) 162 - map[i] = page + i; 163 - coherent_ptr = vmap(map, size >> PAGE_SHIFT, VM_MAP, 164 - __get_dma_pgprot(attrs, __pgprot(PROT_NORMAL_NC), false)); 165 - kfree(map); 117 + coherent_ptr = dma_common_contiguous_remap(page, size, VM_USERMAP, 118 + __get_dma_pgprot(attrs, 119 + __pgprot(PROT_NORMAL_NC), false), 120 + NULL); 166 121 if (!coherent_ptr) 167 122 goto no_map; 168 123 ··· 185 132 { 186 133 void *swiotlb_addr = phys_to_virt(dma_to_phys(dev, dma_handle)); 187 134 135 + if (__free_from_pool(vaddr, size)) 136 + return; 188 137 vunmap(vaddr); 189 138 __dma_free_coherent(dev, size, swiotlb_addr, dma_handle, attrs); 190 139 } ··· 362 307 363 308 extern int swiotlb_late_init_with_default_size(size_t default_size); 364 309 310 + static int __init atomic_pool_init(void) 311 + { 312 + pgprot_t prot = __pgprot(PROT_NORMAL_NC); 313 + unsigned long nr_pages = atomic_pool_size >> PAGE_SHIFT; 314 + struct page *page; 315 + void *addr; 316 + unsigned int pool_size_order = get_order(atomic_pool_size); 317 + 318 + if (dev_get_cma_area(NULL)) 319 + page = dma_alloc_from_contiguous(NULL, nr_pages, 320 + pool_size_order); 321 + else 322 + page = alloc_pages(GFP_DMA, pool_size_order); 323 + 324 + if (page) { 325 + int ret; 326 + void *page_addr = page_address(page); 327 + 328 + memset(page_addr, 0, atomic_pool_size); 329 + __dma_flush_range(page_addr, page_addr + atomic_pool_size); 330 + 331 + atomic_pool = gen_pool_create(PAGE_SHIFT, -1); 332 + if (!atomic_pool) 333 + goto free_page; 334 + 335 + addr = dma_common_contiguous_remap(page, atomic_pool_size, 336 + VM_USERMAP, prot, atomic_pool_init); 337 + 338 + if (!addr) 339 + goto destroy_genpool; 340 + 341 + ret = gen_pool_add_virt(atomic_pool, (unsigned long)addr, 342 + page_to_phys(page), 343 + atomic_pool_size, -1); 344 + if (ret) 345 + goto remove_mapping; 346 + 347 + gen_pool_set_algo(atomic_pool, 348 + gen_pool_first_fit_order_align, 349 + (void *)PAGE_SHIFT); 350 + 351 + pr_info("DMA: preallocated %zu KiB pool for atomic allocations\n", 352 + atomic_pool_size / 1024); 353 + return 0; 354 + } 355 + goto out; 356 + 357 + remove_mapping: 358 + dma_common_free_remap(addr, atomic_pool_size, VM_USERMAP); 359 + destroy_genpool: 360 + gen_pool_destroy(atomic_pool); 361 + atomic_pool = NULL; 362 + free_page: 363 + if (!dma_release_from_contiguous(NULL, page, nr_pages)) 364 + __free_pages(page, pool_size_order); 365 + out: 366 + pr_err("DMA: failed to allocate %zu KiB pool for atomic coherent allocation\n", 367 + atomic_pool_size / 1024); 368 + return -ENOMEM; 369 + } 370 + 365 371 static int __init swiotlb_late_init(void) 366 372 { 367 373 size_t swiotlb_size = min(SZ_64M, MAX_ORDER_NR_PAGES << PAGE_SHIFT); ··· 431 315 432 316 return swiotlb_late_init_with_default_size(swiotlb_size); 433 317 } 434 - arch_initcall(swiotlb_late_init); 318 + 319 + static int __init arm64_dma_init(void) 320 + { 321 + int ret = 0; 322 + 323 + ret |= swiotlb_late_init(); 324 + ret |= atomic_pool_init(); 325 + 326 + return ret; 327 + } 328 + arch_initcall(arm64_dma_init); 435 329 436 330 #define PREALLOC_DMA_DEBUG_ENTRIES 4096 437 331

+16

arch/arm64/mm/flush.c

··· 104 104 */ 105 105 EXPORT_SYMBOL(flush_cache_all); 106 106 EXPORT_SYMBOL(flush_icache_range); 107 + 108 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 109 + #ifdef CONFIG_HAVE_RCU_TABLE_FREE 110 + void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address, 111 + pmd_t *pmdp) 112 + { 113 + pmd_t pmd = pmd_mksplitting(*pmdp); 114 + 115 + VM_BUG_ON(address & ~PMD_MASK); 116 + set_pmd_at(vma->vm_mm, address, pmdp, pmd); 117 + 118 + /* dummy IPI to serialise against fast_gup */ 119 + kick_all_cpus_sync(); 120 + } 121 + #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ 122 + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+1

arch/cris/include/asm/Kbuild

··· 15 15 generic-y += module.h 16 16 generic-y += preempt.h 17 17 generic-y += scatterlist.h 18 + generic-y += sections.h 18 19 generic-y += trace_clock.h 19 20 generic-y += vga.h 20 21 generic-y += xor.h

-7

arch/cris/include/asm/sections.h

··· 1 - #ifndef _CRIS_SECTIONS_H 2 - #define _CRIS_SECTIONS_H 3 - 4 - /* nothing to see, move along */ 5 - #include <asm-generic/sections.h> 6 - 7 - #endif

-16

arch/frv/include/asm/processor.h

··· 35 35 struct task_struct; 36 36 37 37 /* 38 - * CPU type and hardware bug flags. Kept separately for each CPU. 39 - */ 40 - struct cpuinfo_frv { 41 - #ifdef CONFIG_MMU 42 - unsigned long *pgd_quick; 43 - unsigned long *pte_quick; 44 - unsigned long pgtable_cache_sz; 45 - #endif 46 - } __cacheline_aligned; 47 - 48 - extern struct cpuinfo_frv __nongprelbss boot_cpu_data; 49 - 50 - #define cpu_data (&boot_cpu_data) 51 - #define current_cpu_data boot_cpu_data 52 - 53 - /* 54 38 * Bus types 55 39 */ 56 40 #define EISA_bus 0

+4 -4

arch/frv/kernel/irq-mb93091.c

··· 107 107 static struct irqaction fpga_irq[4] = { 108 108 [0] = { 109 109 .handler = fpga_interrupt, 110 - .flags = IRQF_DISABLED | IRQF_SHARED, 110 + .flags = IRQF_SHARED, 111 111 .name = "fpga.0", 112 112 .dev_id = (void *) 0x0028UL, 113 113 }, 114 114 [1] = { 115 115 .handler = fpga_interrupt, 116 - .flags = IRQF_DISABLED | IRQF_SHARED, 116 + .flags = IRQF_SHARED, 117 117 .name = "fpga.1", 118 118 .dev_id = (void *) 0x0050UL, 119 119 }, 120 120 [2] = { 121 121 .handler = fpga_interrupt, 122 - .flags = IRQF_DISABLED | IRQF_SHARED, 122 + .flags = IRQF_SHARED, 123 123 .name = "fpga.2", 124 124 .dev_id = (void *) 0x1c00UL, 125 125 }, 126 126 [3] = { 127 127 .handler = fpga_interrupt, 128 - .flags = IRQF_DISABLED | IRQF_SHARED, 128 + .flags = IRQF_SHARED, 129 129 .name = "fpga.3", 130 130 .dev_id = (void *) 0x6386UL, 131 131 }

-1

arch/frv/kernel/irq-mb93093.c

··· 105 105 static struct irqaction fpga_irq[1] = { 106 106 [0] = { 107 107 .handler = fpga_interrupt, 108 - .flags = IRQF_DISABLED, 109 108 .name = "fpga.0", 110 109 .dev_id = (void *) 0x0700UL, 111 110 }

+2 -2

arch/frv/kernel/irq-mb93493.c

··· 118 118 static struct irqaction mb93493_irq[2] = { 119 119 [0] = { 120 120 .handler = mb93493_interrupt, 121 - .flags = IRQF_DISABLED | IRQF_SHARED, 121 + .flags = IRQF_SHARED, 122 122 .name = "mb93493.0", 123 123 .dev_id = (void *) __addr_MB93493_IQSR(0), 124 124 }, 125 125 [1] = { 126 126 .handler = mb93493_interrupt, 127 - .flags = IRQF_DISABLED | IRQF_SHARED, 127 + .flags = IRQF_SHARED, 128 128 .name = "mb93493.1", 129 129 .dev_id = (void *) __addr_MB93493_IQSR(1), 130 130 }

-2

arch/frv/kernel/setup.c

··· 104 104 unsigned long __initdata __sdram_old_base; 105 105 unsigned long __initdata num_mappedpages; 106 106 107 - struct cpuinfo_frv __nongprelbss boot_cpu_data; 108 - 109 107 char __initdata command_line[COMMAND_LINE_SIZE]; 110 108 char __initdata redboot_command_line[COMMAND_LINE_SIZE]; 111 109

-1

arch/frv/kernel/time.c

··· 44 44 45 45 static struct irqaction timer_irq = { 46 46 .handler = timer_interrupt, 47 - .flags = IRQF_DISABLED, 48 47 .name = "timer", 49 48 }; 50 49

+1

arch/m32r/include/asm/Kbuild

··· 8 8 generic-y += module.h 9 9 generic-y += preempt.h 10 10 generic-y += scatterlist.h 11 + generic-y += sections.h 11 12 generic-y += trace_clock.h

-7

arch/m32r/include/asm/sections.h

··· 1 - #ifndef _M32R_SECTIONS_H 2 - #define _M32R_SECTIONS_H 3 - 4 - /* nothing to see, move along */ 5 - #include <asm-generic/sections.h> 6 - 7 - #endif /* _M32R_SECTIONS_H */

-1

arch/m32r/kernel/time.c

··· 134 134 135 135 static struct irqaction irq0 = { 136 136 .handler = timer_interrupt, 137 - .flags = IRQF_DISABLED, 138 137 .name = "MFT2", 139 138 }; 140 139

+13 -8

arch/m68k/kernel/sys_m68k.c

··· 376 376 asmlinkage int 377 377 sys_cacheflush (unsigned long addr, int scope, int cache, unsigned long len) 378 378 { 379 - struct vm_area_struct *vma; 380 379 int ret = -EINVAL; 381 380 382 381 if (scope < FLUSH_SCOPE_LINE || scope > FLUSH_SCOPE_ALL || ··· 388 389 if (!capable(CAP_SYS_ADMIN)) 389 390 goto out; 390 391 } else { 392 + struct vm_area_struct *vma; 393 + 394 + /* Check for overflow. */ 395 + if (addr + len < addr) 396 + goto out; 397 + 391 398 /* 392 399 * Verify that the specified address region actually belongs 393 400 * to this process. 394 401 */ 395 - vma = find_vma (current->mm, addr); 396 402 ret = -EINVAL; 397 - /* Check for overflow. */ 398 - if (addr + len < addr) 399 - goto out; 400 - if (vma == NULL || addr < vma->vm_start || addr + len > vma->vm_end) 401 - goto out; 403 + down_read(&current->mm->mmap_sem); 404 + vma = find_vma(current->mm, addr); 405 + if (!vma || addr < vma->vm_start || addr + len > vma->vm_end) 406 + goto out_unlock; 402 407 } 403 408 404 409 if (CPU_IS_020_OR_030) { ··· 432 429 __asm__ __volatile__ ("movec %0, %%cacr" : : "r" (cacr)); 433 430 } 434 431 ret = 0; 435 - goto out; 432 + goto out_unlock; 436 433 } else { 437 434 /* 438 435 * 040 or 060: don't blindly trust 'scope', someone could ··· 449 446 ret = cache_flush_060 (addr, scope, cache, len); 450 447 } 451 448 } 449 + out_unlock: 450 + up_read(&current->mm->mmap_sem); 452 451 out: 453 452 return ret; 454 453 }

-7

arch/mips/include/asm/suspend.h

··· 1 - #ifndef __ASM_SUSPEND_H 2 - #define __ASM_SUSPEND_H 3 - 4 - /* References to section boundaries */ 5 - extern const void __nosave_begin, __nosave_end; 6 - 7 - #endif /* __ASM_SUSPEND_H */

+1 -1

arch/mips/power/cpu.c

··· 7 7 * Author: Hu Hongbing <huhb@lemote.com> 8 8 * Wu Zhangjin <wuzhangjin@gmail.com> 9 9 */ 10 - #include <asm/suspend.h> 10 + #include <asm/sections.h> 11 11 #include <asm/fpu.h> 12 12 #include <asm/dsp.h> 13 13

+1

arch/mn10300/include/asm/Kbuild

··· 8 8 generic-y += mcs_spinlock.h 9 9 generic-y += preempt.h 10 10 generic-y += scatterlist.h 11 + generic-y += sections.h 11 12 generic-y += trace_clock.h

-1

arch/mn10300/include/asm/sections.h

··· 1 - #include <asm-generic/sections.h>

+12 -45

arch/powerpc/include/asm/pgtable.h

··· 38 38 static inline pgprot_t pte_pgprot(pte_t pte) { return __pgprot(pte_val(pte) & PAGE_PROT_BITS); } 39 39 40 40 #ifdef CONFIG_NUMA_BALANCING 41 - 42 41 static inline int pte_present(pte_t pte) 43 42 { 44 - return pte_val(pte) & (_PAGE_PRESENT | _PAGE_NUMA); 43 + return pte_val(pte) & _PAGE_NUMA_MASK; 45 44 } 46 45 47 46 #define pte_present_nonuma pte_present_nonuma 48 47 static inline int pte_present_nonuma(pte_t pte) 49 48 { 50 49 return pte_val(pte) & (_PAGE_PRESENT); 51 - } 52 - 53 - #define pte_numa pte_numa 54 - static inline int pte_numa(pte_t pte) 55 - { 56 - return (pte_val(pte) & 57 - (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA; 58 - } 59 - 60 - #define pte_mknonnuma pte_mknonnuma 61 - static inline pte_t pte_mknonnuma(pte_t pte) 62 - { 63 - pte_val(pte) &= ~_PAGE_NUMA; 64 - pte_val(pte) |= _PAGE_PRESENT | _PAGE_ACCESSED; 65 - return pte; 66 - } 67 - 68 - #define pte_mknuma pte_mknuma 69 - static inline pte_t pte_mknuma(pte_t pte) 70 - { 71 - /* 72 - * We should not set _PAGE_NUMA on non present ptes. Also clear the 73 - * present bit so that hash_page will return 1 and we collect this 74 - * as numa fault. 75 - */ 76 - if (pte_present(pte)) { 77 - pte_val(pte) |= _PAGE_NUMA; 78 - pte_val(pte) &= ~_PAGE_PRESENT; 79 - } else 80 - VM_BUG_ON(1); 81 - return pte; 82 50 } 83 51 84 52 #define ptep_set_numa ptep_set_numa ··· 60 92 return; 61 93 } 62 94 63 - #define pmd_numa pmd_numa 64 - static inline int pmd_numa(pmd_t pmd) 65 - { 66 - return pte_numa(pmd_pte(pmd)); 67 - } 68 - 69 95 #define pmdp_set_numa pmdp_set_numa 70 96 static inline void pmdp_set_numa(struct mm_struct *mm, unsigned long addr, 71 97 pmd_t *pmdp) ··· 71 109 return; 72 110 } 73 111 74 - #define pmd_mknonnuma pmd_mknonnuma 75 - static inline pmd_t pmd_mknonnuma(pmd_t pmd) 112 + /* 113 + * Generic NUMA pte helpers expect pteval_t and pmdval_t types to exist 114 + * which was inherited from x86. For the purposes of powerpc pte_basic_t and 115 + * pmd_t are equivalent 116 + */ 117 + #define pteval_t pte_basic_t 118 + #define pmdval_t pmd_t 119 + static inline pteval_t ptenuma_flags(pte_t pte) 76 120 { 77 - return pte_pmd(pte_mknonnuma(pmd_pte(pmd))); 121 + return pte_val(pte) & _PAGE_NUMA_MASK; 78 122 } 79 123 80 - #define pmd_mknuma pmd_mknuma 81 - static inline pmd_t pmd_mknuma(pmd_t pmd) 124 + static inline pmdval_t pmdnuma_flags(pmd_t pmd) 82 125 { 83 - return pte_pmd(pte_mknuma(pmd_pte(pmd))); 126 + return pmd_val(pmd) & _PAGE_NUMA_MASK; 84 127 } 85 128 86 129 # else

+5

arch/powerpc/include/asm/pte-common.h

··· 98 98 _PAGE_USER | _PAGE_ACCESSED | \ 99 99 _PAGE_RW | _PAGE_HWWRITE | _PAGE_DIRTY | _PAGE_EXEC) 100 100 101 + #ifdef CONFIG_NUMA_BALANCING 102 + /* Mask of bits that distinguish present and numa ptes */ 103 + #define _PAGE_NUMA_MASK (_PAGE_NUMA|_PAGE_PRESENT) 104 + #endif 105 + 101 106 /* 102 107 * We define 2 sets of base prot bits, one for basic pages (ie, 103 108 * cacheable kernel and user pages) and one for non cacheable

+1 -3

arch/powerpc/kernel/suspend.c

··· 9 9 10 10 #include <linux/mm.h> 11 11 #include <asm/page.h> 12 - 13 - /* References to section boundaries */ 14 - extern const void __nosave_begin, __nosave_end; 12 + #include <asm/sections.h> 15 13 16 14 /* 17 15 * pfn_is_nosave - check if given pfn is in the 'nosave' section

+1 -5

arch/s390/kernel/suspend.c

··· 13 13 #include <asm/ipl.h> 14 14 #include <asm/cio.h> 15 15 #include <asm/pci.h> 16 + #include <asm/sections.h> 16 17 #include "entry.h" 17 - 18 - /* 19 - * References to section boundaries 20 - */ 21 - extern const void __nosave_begin, __nosave_end; 22 18 23 19 /* 24 20 * The restore of the saved pages in an hibernation image will set

+1

arch/score/include/asm/Kbuild

··· 10 10 generic-y += mcs_spinlock.h 11 11 generic-y += preempt.h 12 12 generic-y += scatterlist.h 13 + generic-y += sections.h 13 14 generic-y += trace_clock.h 14 15 generic-y += xor.h 15 16 generic-y += serial.h

-6

arch/score/include/asm/sections.h

··· 1 - #ifndef _ASM_SCORE_SECTIONS_H 2 - #define _ASM_SCORE_SECTIONS_H 3 - 4 - #include <asm-generic/sections.h> 5 - 6 - #endif /* _ASM_SCORE_SECTIONS_H */

-1

arch/sh/include/asm/sections.h

··· 3 3 4 4 #include <asm-generic/sections.h> 5 5 6 - extern long __nosave_begin, __nosave_end; 7 6 extern long __machvec_start, __machvec_end; 8 7 extern char __uncached_start, __uncached_end; 9 8 extern char __start_eh_frame[], __stop_eh_frame[];

+1 -3

arch/sparc/power/hibernate.c

··· 9 9 #include <asm/hibernate.h> 10 10 #include <asm/visasm.h> 11 11 #include <asm/page.h> 12 + #include <asm/sections.h> 12 13 #include <asm/tlb.h> 13 - 14 - /* References to section boundaries */ 15 - extern const void __nosave_begin, __nosave_end; 16 14 17 15 struct saved_context saved_context; 18 16

-3

arch/unicore32/include/mach/pm.h

··· 36 36 /* Defined in hibernate_asm.S */ 37 37 extern int restore_image(pgd_t *resume_pg_dir, struct pbe *restore_pblist); 38 38 39 - /* References to section boundaries */ 40 - extern const void __nosave_begin, __nosave_end; 41 - 42 39 extern struct pbe *restore_pblist; 43 40 #endif

+1

arch/unicore32/kernel/hibernate.c

··· 18 18 #include <asm/page.h> 19 19 #include <asm/pgtable.h> 20 20 #include <asm/pgalloc.h> 21 + #include <asm/sections.h> 21 22 #include <asm/suspend.h> 22 23 23 24 #include "mach/pm.h"

-1

arch/x86/Kconfig

··· 30 30 select HAVE_UNSTABLE_SCHED_CLOCK 31 31 select ARCH_SUPPORTS_NUMA_BALANCING if X86_64 32 32 select ARCH_SUPPORTS_INT128 if X86_64 33 - select ARCH_WANTS_PROT_NUMA_PROT_NONE 34 33 select HAVE_IDE 35 34 select HAVE_OPROFILE 36 35 select HAVE_PCSPKR_PLATFORM

+14

arch/x86/include/asm/pgtable_types.h

··· 325 325 return native_pte_val(pte) & PTE_FLAGS_MASK; 326 326 } 327 327 328 + #ifdef CONFIG_NUMA_BALANCING 329 + /* Set of bits that distinguishes present, prot_none and numa ptes */ 330 + #define _PAGE_NUMA_MASK (_PAGE_NUMA|_PAGE_PROTNONE|_PAGE_PRESENT) 331 + static inline pteval_t ptenuma_flags(pte_t pte) 332 + { 333 + return pte_flags(pte) & _PAGE_NUMA_MASK; 334 + } 335 + 336 + static inline pmdval_t pmdnuma_flags(pmd_t pmd) 337 + { 338 + return pmd_flags(pmd) & _PAGE_NUMA_MASK; 339 + } 340 + #endif /* CONFIG_NUMA_BALANCING */ 341 + 328 342 #define pgprot_val(x) ((x).pgprot) 329 343 #define __pgprot(x) ((pgprot_t) { (x) } ) 330 344

+1 -3

arch/x86/power/hibernate_32.c

··· 13 13 #include <asm/page.h> 14 14 #include <asm/pgtable.h> 15 15 #include <asm/mmzone.h> 16 + #include <asm/sections.h> 16 17 17 18 /* Defined in hibernate_asm_32.S */ 18 19 extern int restore_image(void); 19 - 20 - /* References to section boundaries */ 21 - extern const void __nosave_begin, __nosave_end; 22 20 23 21 /* Pointer to the temporary resume page tables */ 24 22 pgd_t *resume_pg_dir;

+1 -3

arch/x86/power/hibernate_64.c

··· 17 17 #include <asm/page.h> 18 18 #include <asm/pgtable.h> 19 19 #include <asm/mtrr.h> 20 + #include <asm/sections.h> 20 21 #include <asm/suspend.h> 21 - 22 - /* References to section boundaries */ 23 - extern __visible const void __nosave_begin, __nosave_end; 24 22 25 23 /* Defined in hibernate_asm_64.S */ 26 24 extern asmlinkage __visible int restore_image(void);

+3

drivers/base/Kconfig

··· 252 252 to allocate big physically-contiguous blocks of memory for use with 253 253 hardware components that do not support I/O map nor scatter-gather. 254 254 255 + You can disable CMA by specifying "cma=0" on the kernel's command 256 + line. 257 + 255 258 For more information see <include/linux/dma-contiguous.h>. 256 259 If unsure, say "n". 257 260

+72

drivers/base/dma-mapping.c

··· 10 10 #include <linux/dma-mapping.h> 11 11 #include <linux/export.h> 12 12 #include <linux/gfp.h> 13 + #include <linux/slab.h> 14 + #include <linux/vmalloc.h> 13 15 #include <asm-generic/dma-coherent.h> 14 16 15 17 /* ··· 269 267 return ret; 270 268 } 271 269 EXPORT_SYMBOL(dma_common_mmap); 270 + 271 + #ifdef CONFIG_MMU 272 + /* 273 + * remaps an array of PAGE_SIZE pages into another vm_area 274 + * Cannot be used in non-sleeping contexts 275 + */ 276 + void *dma_common_pages_remap(struct page **pages, size_t size, 277 + unsigned long vm_flags, pgprot_t prot, 278 + const void *caller) 279 + { 280 + struct vm_struct *area; 281 + 282 + area = get_vm_area_caller(size, vm_flags, caller); 283 + if (!area) 284 + return NULL; 285 + 286 + area->pages = pages; 287 + 288 + if (map_vm_area(area, prot, pages)) { 289 + vunmap(area->addr); 290 + return NULL; 291 + } 292 + 293 + return area->addr; 294 + } 295 + 296 + /* 297 + * remaps an allocated contiguous region into another vm_area. 298 + * Cannot be used in non-sleeping contexts 299 + */ 300 + 301 + void *dma_common_contiguous_remap(struct page *page, size_t size, 302 + unsigned long vm_flags, 303 + pgprot_t prot, const void *caller) 304 + { 305 + int i; 306 + struct page **pages; 307 + void *ptr; 308 + unsigned long pfn; 309 + 310 + pages = kmalloc(sizeof(struct page *) << get_order(size), GFP_KERNEL); 311 + if (!pages) 312 + return NULL; 313 + 314 + for (i = 0, pfn = page_to_pfn(page); i < (size >> PAGE_SHIFT); i++) 315 + pages[i] = pfn_to_page(pfn + i); 316 + 317 + ptr = dma_common_pages_remap(pages, size, vm_flags, prot, caller); 318 + 319 + kfree(pages); 320 + 321 + return ptr; 322 + } 323 + 324 + /* 325 + * unmaps a range previously mapped by dma_common_*_remap 326 + */ 327 + void dma_common_free_remap(void *cpu_addr, size_t size, unsigned long vm_flags) 328 + { 329 + struct vm_struct *area = find_vm_area(cpu_addr); 330 + 331 + if (!area || (area->flags & vm_flags) != vm_flags) { 332 + WARN(1, "trying to free invalid coherent area: %p\n", cpu_addr); 333 + return; 334 + } 335 + 336 + unmap_kernel_range((unsigned long)cpu_addr, size); 337 + vunmap(cpu_addr); 338 + } 339 + #endif

+42

drivers/base/memory.c

··· 373 373 return sprintf(buf, "%d\n", mem->phys_device); 374 374 } 375 375 376 + #ifdef CONFIG_MEMORY_HOTREMOVE 377 + static ssize_t show_valid_zones(struct device *dev, 378 + struct device_attribute *attr, char *buf) 379 + { 380 + struct memory_block *mem = to_memory_block(dev); 381 + unsigned long start_pfn, end_pfn; 382 + unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; 383 + struct page *first_page; 384 + struct zone *zone; 385 + 386 + start_pfn = section_nr_to_pfn(mem->start_section_nr); 387 + end_pfn = start_pfn + nr_pages; 388 + first_page = pfn_to_page(start_pfn); 389 + 390 + /* The block contains more than one zone can not be offlined. */ 391 + if (!test_pages_in_a_zone(start_pfn, end_pfn)) 392 + return sprintf(buf, "none\n"); 393 + 394 + zone = page_zone(first_page); 395 + 396 + if (zone_idx(zone) == ZONE_MOVABLE - 1) { 397 + /*The mem block is the last memoryblock of this zone.*/ 398 + if (end_pfn == zone_end_pfn(zone)) 399 + return sprintf(buf, "%s %s\n", 400 + zone->name, (zone + 1)->name); 401 + } 402 + 403 + if (zone_idx(zone) == ZONE_MOVABLE) { 404 + /*The mem block is the first memoryblock of ZONE_MOVABLE.*/ 405 + if (start_pfn == zone->zone_start_pfn) 406 + return sprintf(buf, "%s %s\n", 407 + zone->name, (zone - 1)->name); 408 + } 409 + 410 + return sprintf(buf, "%s\n", zone->name); 411 + } 412 + static DEVICE_ATTR(valid_zones, 0444, show_valid_zones, NULL); 413 + #endif 414 + 376 415 static DEVICE_ATTR(phys_index, 0444, show_mem_start_phys_index, NULL); 377 416 static DEVICE_ATTR(state, 0644, show_mem_state, store_mem_state); 378 417 static DEVICE_ATTR(phys_device, 0444, show_phys_device, NULL); ··· 562 523 &dev_attr_state.attr, 563 524 &dev_attr_phys_device.attr, 564 525 &dev_attr_removable.attr, 526 + #ifdef CONFIG_MEMORY_HOTREMOVE 527 + &dev_attr_valid_zones.attr, 528 + #endif 565 529 NULL 566 530 }; 567 531

-3

drivers/base/node.c

··· 289 289 device_create_file(&node->dev, &dev_attr_distance); 290 290 device_create_file(&node->dev, &dev_attr_vmstat); 291 291 292 - scan_unevictable_register_node(node); 293 - 294 292 hugetlb_register_node(node); 295 293 296 294 compaction_register_node(node); ··· 312 314 device_remove_file(&node->dev, &dev_attr_distance); 313 315 device_remove_file(&node->dev, &dev_attr_vmstat); 314 316 315 - scan_unevictable_unregister_node(node); 316 317 hugetlb_unregister_node(node); /* no-op, if memoryless node */ 317 318 318 319 device_unregister(&node->dev);

+104 -2

drivers/block/zram/zram_drv.c

··· 103 103 104 104 down_read(&zram->init_lock); 105 105 if (init_done(zram)) 106 - val = zs_get_total_size_bytes(meta->mem_pool); 106 + val = zs_get_total_pages(meta->mem_pool); 107 107 up_read(&zram->init_lock); 108 108 109 - return scnprintf(buf, PAGE_SIZE, "%llu\n", val); 109 + return scnprintf(buf, PAGE_SIZE, "%llu\n", val << PAGE_SHIFT); 110 110 } 111 111 112 112 static ssize_t max_comp_streams_show(struct device *dev, ··· 120 120 up_read(&zram->init_lock); 121 121 122 122 return scnprintf(buf, PAGE_SIZE, "%d\n", val); 123 + } 124 + 125 + static ssize_t mem_limit_show(struct device *dev, 126 + struct device_attribute *attr, char *buf) 127 + { 128 + u64 val; 129 + struct zram *zram = dev_to_zram(dev); 130 + 131 + down_read(&zram->init_lock); 132 + val = zram->limit_pages; 133 + up_read(&zram->init_lock); 134 + 135 + return scnprintf(buf, PAGE_SIZE, "%llu\n", val << PAGE_SHIFT); 136 + } 137 + 138 + static ssize_t mem_limit_store(struct device *dev, 139 + struct device_attribute *attr, const char *buf, size_t len) 140 + { 141 + u64 limit; 142 + char *tmp; 143 + struct zram *zram = dev_to_zram(dev); 144 + 145 + limit = memparse(buf, &tmp); 146 + if (buf == tmp) /* no chars parsed, invalid input */ 147 + return -EINVAL; 148 + 149 + down_write(&zram->init_lock); 150 + zram->limit_pages = PAGE_ALIGN(limit) >> PAGE_SHIFT; 151 + up_write(&zram->init_lock); 152 + 153 + return len; 154 + } 155 + 156 + static ssize_t mem_used_max_show(struct device *dev, 157 + struct device_attribute *attr, char *buf) 158 + { 159 + u64 val = 0; 160 + struct zram *zram = dev_to_zram(dev); 161 + 162 + down_read(&zram->init_lock); 163 + if (init_done(zram)) 164 + val = atomic_long_read(&zram->stats.max_used_pages); 165 + up_read(&zram->init_lock); 166 + 167 + return scnprintf(buf, PAGE_SIZE, "%llu\n", val << PAGE_SHIFT); 168 + } 169 + 170 + static ssize_t mem_used_max_store(struct device *dev, 171 + struct device_attribute *attr, const char *buf, size_t len) 172 + { 173 + int err; 174 + unsigned long val; 175 + struct zram *zram = dev_to_zram(dev); 176 + struct zram_meta *meta = zram->meta; 177 + 178 + err = kstrtoul(buf, 10, &val); 179 + if (err || val != 0) 180 + return -EINVAL; 181 + 182 + down_read(&zram->init_lock); 183 + if (init_done(zram)) 184 + atomic_long_set(&zram->stats.max_used_pages, 185 + zs_get_total_pages(meta->mem_pool)); 186 + up_read(&zram->init_lock); 187 + 188 + return len; 123 189 } 124 190 125 191 static ssize_t max_comp_streams_store(struct device *dev, ··· 500 434 return ret; 501 435 } 502 436 437 + static inline void update_used_max(struct zram *zram, 438 + const unsigned long pages) 439 + { 440 + int old_max, cur_max; 441 + 442 + old_max = atomic_long_read(&zram->stats.max_used_pages); 443 + 444 + do { 445 + cur_max = old_max; 446 + if (pages > cur_max) 447 + old_max = atomic_long_cmpxchg( 448 + &zram->stats.max_used_pages, cur_max, pages); 449 + } while (old_max != cur_max); 450 + } 451 + 503 452 static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, 504 453 int offset) 505 454 { ··· 526 445 struct zram_meta *meta = zram->meta; 527 446 struct zcomp_strm *zstrm; 528 447 bool locked = false; 448 + unsigned long alloced_pages; 529 449 530 450 page = bvec->bv_page; 531 451 if (is_partial_io(bvec)) { ··· 595 513 ret = -ENOMEM; 596 514 goto out; 597 515 } 516 + 517 + alloced_pages = zs_get_total_pages(meta->mem_pool); 518 + if (zram->limit_pages && alloced_pages > zram->limit_pages) { 519 + zs_free(meta->mem_pool, handle); 520 + ret = -ENOMEM; 521 + goto out; 522 + } 523 + 524 + update_used_max(zram, alloced_pages); 525 + 598 526 cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO); 599 527 600 528 if ((clen == PAGE_SIZE) && !is_partial_io(bvec)) { ··· 698 606 bit_spin_lock(ZRAM_ACCESS, &meta->table[index].value); 699 607 zram_free_page(zram, index); 700 608 bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value); 609 + atomic64_inc(&zram->stats.notify_free); 701 610 index++; 702 611 n -= PAGE_SIZE; 703 612 } ··· 710 617 struct zram_meta *meta; 711 618 712 619 down_write(&zram->init_lock); 620 + 621 + zram->limit_pages = 0; 622 + 713 623 if (!init_done(zram)) { 714 624 up_write(&zram->init_lock); 715 625 return; ··· 953 857 static DEVICE_ATTR(reset, S_IWUSR, NULL, reset_store); 954 858 static DEVICE_ATTR(orig_data_size, S_IRUGO, orig_data_size_show, NULL); 955 859 static DEVICE_ATTR(mem_used_total, S_IRUGO, mem_used_total_show, NULL); 860 + static DEVICE_ATTR(mem_limit, S_IRUGO | S_IWUSR, mem_limit_show, 861 + mem_limit_store); 862 + static DEVICE_ATTR(mem_used_max, S_IRUGO | S_IWUSR, mem_used_max_show, 863 + mem_used_max_store); 956 864 static DEVICE_ATTR(max_comp_streams, S_IRUGO | S_IWUSR, 957 865 max_comp_streams_show, max_comp_streams_store); 958 866 static DEVICE_ATTR(comp_algorithm, S_IRUGO | S_IWUSR, ··· 985 885 &dev_attr_orig_data_size.attr, 986 886 &dev_attr_compr_data_size.attr, 987 887 &dev_attr_mem_used_total.attr, 888 + &dev_attr_mem_limit.attr, 889 + &dev_attr_mem_used_max.attr, 988 890 &dev_attr_max_comp_streams.attr, 989 891 &dev_attr_comp_algorithm.attr, 990 892 NULL,

+6

drivers/block/zram/zram_drv.h

··· 90 90 atomic64_t notify_free; /* no. of swap slot free notifications */ 91 91 atomic64_t zero_pages; /* no. of zero filled pages */ 92 92 atomic64_t pages_stored; /* no. of pages currently stored */ 93 + atomic_long_t max_used_pages; /* no. of maximum pages stored */ 93 94 }; 94 95 95 96 struct zram_meta { ··· 113 112 u64 disksize; /* bytes */ 114 113 int max_comp_streams; 115 114 struct zram_stats stats; 115 + /* 116 + * the number of pages zram can consume for storing compressed data 117 + */ 118 + unsigned long limit_pages; 119 + 116 120 char compressor[10]; 117 121 }; 118 122 #endif

+3

drivers/firmware/memmap.c

··· 184 184 static int map_entries_nr; 185 185 static struct kset *mmap_kset; 186 186 187 + if (entry->kobj.state_in_sysfs) 188 + return -EEXIST; 189 + 187 190 if (!mmap_kset) { 188 191 mmap_kset = kset_create_and_add("memmap", NULL, firmware_kobj); 189 192 if (!mmap_kset)

+1

drivers/virtio/Kconfig

··· 25 25 config VIRTIO_BALLOON 26 26 tristate "Virtio balloon driver" 27 27 depends on VIRTIO 28 + select MEMORY_BALLOON 28 29 ---help--- 29 30 This driver supports increasing and decreasing the amount 30 31 of memory within a KVM guest.

+21 -55

drivers/virtio/virtio_balloon.c

··· 59 59 * Each page on this list adds VIRTIO_BALLOON_PAGES_PER_PAGE 60 60 * to num_pages above. 61 61 */ 62 - struct balloon_dev_info *vb_dev_info; 62 + struct balloon_dev_info vb_dev_info; 63 63 64 64 /* Synchronize access/update to this struct virtio_balloon elements */ 65 65 struct mutex balloon_lock; ··· 127 127 128 128 static void fill_balloon(struct virtio_balloon *vb, size_t num) 129 129 { 130 - struct balloon_dev_info *vb_dev_info = vb->vb_dev_info; 130 + struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info; 131 131 132 132 /* We can only do one array worth at a time. */ 133 133 num = min(num, ARRAY_SIZE(vb->pfns)); ··· 163 163 /* Find pfns pointing at start of each page, get pages and free them. */ 164 164 for (i = 0; i < num; i += VIRTIO_BALLOON_PAGES_PER_PAGE) { 165 165 struct page *page = balloon_pfn_to_page(pfns[i]); 166 - balloon_page_free(page); 167 166 adjust_managed_page_count(page, 1); 167 + put_page(page); /* balloon reference */ 168 168 } 169 169 } 170 170 171 171 static void leak_balloon(struct virtio_balloon *vb, size_t num) 172 172 { 173 173 struct page *page; 174 - struct balloon_dev_info *vb_dev_info = vb->vb_dev_info; 174 + struct balloon_dev_info *vb_dev_info = &vb->vb_dev_info; 175 175 176 176 /* We can only do one array worth at a time. */ 177 177 num = min(num, ARRAY_SIZE(vb->pfns)); ··· 353 353 return 0; 354 354 } 355 355 356 - static const struct address_space_operations virtio_balloon_aops; 357 356 #ifdef CONFIG_BALLOON_COMPACTION 358 357 /* 359 358 * virtballoon_migratepage - perform the balloon page migration on behalf of 360 359 * a compation thread. (called under page lock) 361 - * @mapping: the page->mapping which will be assigned to the new migrated page. 360 + * @vb_dev_info: the balloon device 362 361 * @newpage: page that will replace the isolated page after migration finishes. 363 362 * @page : the isolated (old) page that is about to be migrated to newpage. 364 363 * @mode : compaction mode -- not used for balloon page migration. ··· 372 373 * This function preforms the balloon page migration task. 373 374 * Called through balloon_mapping->a_ops->migratepage 374 375 */ 375 - static int virtballoon_migratepage(struct address_space *mapping, 376 + static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info, 376 377 struct page *newpage, struct page *page, enum migrate_mode mode) 377 378 { 378 - struct balloon_dev_info *vb_dev_info = balloon_page_device(page); 379 - struct virtio_balloon *vb; 379 + struct virtio_balloon *vb = container_of(vb_dev_info, 380 + struct virtio_balloon, vb_dev_info); 380 381 unsigned long flags; 381 - 382 - BUG_ON(!vb_dev_info); 383 - 384 - vb = vb_dev_info->balloon_device; 385 382 386 383 /* 387 384 * In order to avoid lock contention while migrating pages concurrently ··· 390 395 if (!mutex_trylock(&vb->balloon_lock)) 391 396 return -EAGAIN; 392 397 398 + get_page(newpage); /* balloon reference */ 399 + 393 400 /* balloon's page migration 1st step -- inflate "newpage" */ 394 401 spin_lock_irqsave(&vb_dev_info->pages_lock, flags); 395 - balloon_page_insert(newpage, mapping, &vb_dev_info->pages); 402 + balloon_page_insert(vb_dev_info, newpage); 396 403 vb_dev_info->isolated_pages--; 404 + __count_vm_event(BALLOON_MIGRATE); 397 405 spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags); 398 406 vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; 399 407 set_page_pfns(vb->pfns, newpage); 400 408 tell_host(vb, vb->inflate_vq); 401 409 402 - /* 403 - * balloon's page migration 2nd step -- deflate "page" 404 - * 405 - * It's safe to delete page->lru here because this page is at 406 - * an isolated migration list, and this step is expected to happen here 407 - */ 410 + /* balloon's page migration 2nd step -- deflate "page" */ 408 411 balloon_page_delete(page); 409 412 vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE; 410 413 set_page_pfns(vb->pfns, page); ··· 410 417 411 418 mutex_unlock(&vb->balloon_lock); 412 419 413 - return MIGRATEPAGE_BALLOON_SUCCESS; 414 - } 420 + put_page(page); /* balloon reference */ 415 421 416 - /* define the balloon_mapping->a_ops callback to allow balloon page migration */ 417 - static const struct address_space_operations virtio_balloon_aops = { 418 - .migratepage = virtballoon_migratepage, 419 - }; 422 + return MIGRATEPAGE_SUCCESS; 423 + } 420 424 #endif /* CONFIG_BALLOON_COMPACTION */ 421 425 422 426 static int virtballoon_probe(struct virtio_device *vdev) 423 427 { 424 428 struct virtio_balloon *vb; 425 - struct address_space *vb_mapping; 426 - struct balloon_dev_info *vb_devinfo; 427 429 int err; 428 430 429 431 vdev->priv = vb = kmalloc(sizeof(*vb), GFP_KERNEL); ··· 434 446 vb->vdev = vdev; 435 447 vb->need_stats_update = 0; 436 448 437 - vb_devinfo = balloon_devinfo_alloc(vb); 438 - if (IS_ERR(vb_devinfo)) { 439 - err = PTR_ERR(vb_devinfo); 440 - goto out_free_vb; 441 - } 442 - 443 - vb_mapping = balloon_mapping_alloc(vb_devinfo, 444 - (balloon_compaction_check()) ? 445 - &virtio_balloon_aops : NULL); 446 - if (IS_ERR(vb_mapping)) { 447 - /* 448 - * IS_ERR(vb_mapping) && PTR_ERR(vb_mapping) == -EOPNOTSUPP 449 - * This means !CONFIG_BALLOON_COMPACTION, otherwise we get off. 450 - */ 451 - err = PTR_ERR(vb_mapping); 452 - if (err != -EOPNOTSUPP) 453 - goto out_free_vb_devinfo; 454 - } 455 - 456 - vb->vb_dev_info = vb_devinfo; 449 + balloon_devinfo_init(&vb->vb_dev_info); 450 + #ifdef CONFIG_BALLOON_COMPACTION 451 + vb->vb_dev_info.migratepage = virtballoon_migratepage; 452 + #endif 457 453 458 454 err = init_vqs(vb); 459 455 if (err) 460 - goto out_free_vb_mapping; 456 + goto out_free_vb; 461 457 462 458 vb->thread = kthread_run(balloon, vb, "vballoon"); 463 459 if (IS_ERR(vb->thread)) { ··· 453 481 454 482 out_del_vqs: 455 483 vdev->config->del_vqs(vdev); 456 - out_free_vb_mapping: 457 - balloon_mapping_free(vb_mapping); 458 - out_free_vb_devinfo: 459 - balloon_devinfo_free(vb_devinfo); 460 484 out_free_vb: 461 485 kfree(vb); 462 486 out: ··· 478 510 479 511 kthread_stop(vb->thread); 480 512 remove_common(vb); 481 - balloon_mapping_free(vb->vb_dev_info->mapping); 482 - balloon_devinfo_free(vb->vb_dev_info); 483 513 kfree(vb); 484 514 } 485 515

+7

fs/block_dev.c

··· 304 304 return block_read_full_page(page, blkdev_get_block); 305 305 } 306 306 307 + static int blkdev_readpages(struct file *file, struct address_space *mapping, 308 + struct list_head *pages, unsigned nr_pages) 309 + { 310 + return mpage_readpages(mapping, pages, nr_pages, blkdev_get_block); 311 + } 312 + 307 313 static int blkdev_write_begin(struct file *file, struct address_space *mapping, 308 314 loff_t pos, unsigned len, unsigned flags, 309 315 struct page **pagep, void **fsdata) ··· 1628 1622 1629 1623 static const struct address_space_operations def_blk_aops = { 1630 1624 .readpage = blkdev_readpage, 1625 + .readpages = blkdev_readpages, 1631 1626 .writepage = blkdev_writepage, 1632 1627 .write_begin = blkdev_write_begin, 1633 1628 .write_end = blkdev_write_end,

+13 -15

fs/buffer.c

··· 1253 1253 * a local interrupt disable for that. 1254 1254 */ 1255 1255 1256 - #define BH_LRU_SIZE 8 1256 + #define BH_LRU_SIZE 16 1257 1257 1258 1258 struct bh_lru { 1259 1259 struct buffer_head *bhs[BH_LRU_SIZE]; ··· 2956 2956 2957 2957 /* 2958 2958 * This allows us to do IO even on the odd last sectors 2959 - * of a device, even if the bh block size is some multiple 2959 + * of a device, even if the block size is some multiple 2960 2960 * of the physical sector size. 2961 2961 * 2962 2962 * We'll just truncate the bio to the size of the device, ··· 2966 2966 * errors, this only handles the "we need to be able to 2967 2967 * do IO at the final sector" case. 2968 2968 */ 2969 - static void guard_bh_eod(int rw, struct bio *bio, struct buffer_head *bh) 2969 + void guard_bio_eod(int rw, struct bio *bio) 2970 2970 { 2971 2971 sector_t maxsector; 2972 - unsigned bytes; 2972 + struct bio_vec *bvec = &bio->bi_io_vec[bio->bi_vcnt - 1]; 2973 + unsigned truncated_bytes; 2973 2974 2974 2975 maxsector = i_size_read(bio->bi_bdev->bd_inode) >> 9; 2975 2976 if (!maxsector) ··· 2985 2984 return; 2986 2985 2987 2986 maxsector -= bio->bi_iter.bi_sector; 2988 - bytes = bio->bi_iter.bi_size; 2989 - if (likely((bytes >> 9) <= maxsector)) 2987 + if (likely((bio->bi_iter.bi_size >> 9) <= maxsector)) 2990 2988 return; 2991 2989 2992 - /* Uhhuh. We've got a bh that straddles the device size! */ 2993 - bytes = maxsector << 9; 2990 + /* Uhhuh. We've got a bio that straddles the device size! */ 2991 + truncated_bytes = bio->bi_iter.bi_size - (maxsector << 9); 2994 2992 2995 2993 /* Truncate the bio.. */ 2996 - bio->bi_iter.bi_size = bytes; 2997 - bio->bi_io_vec[0].bv_len = bytes; 2994 + bio->bi_iter.bi_size -= truncated_bytes; 2995 + bvec->bv_len -= truncated_bytes; 2998 2996 2999 2997 /* ..and clear the end of the buffer for reads */ 3000 2998 if ((rw & RW_MASK) == READ) { 3001 - void *kaddr = kmap_atomic(bh->b_page); 3002 - memset(kaddr + bh_offset(bh) + bytes, 0, bh->b_size - bytes); 3003 - kunmap_atomic(kaddr); 3004 - flush_dcache_page(bh->b_page); 2999 + zero_user(bvec->bv_page, bvec->bv_offset + bvec->bv_len, 3000 + truncated_bytes); 3005 3001 } 3006 3002 } 3007 3003 ··· 3039 3041 bio->bi_flags |= bio_flags; 3040 3042 3041 3043 /* Take care of bh's that straddle the end of the device */ 3042 - guard_bh_eod(rw, bio, bh); 3044 + guard_bio_eod(rw, bio); 3043 3045 3044 3046 if (buffer_meta(bh)) 3045 3047 rw |= REQ_META;

+5

fs/internal.h

··· 35 35 #endif 36 36 37 37 /* 38 + * buffer.c 39 + */ 40 + extern void guard_bio_eod(int rw, struct bio *bio); 41 + 42 + /* 38 43 * char_dev.c 39 44 */ 40 45 extern void __init chrdev_init(void);

+2

fs/mpage.c

··· 28 28 #include <linux/backing-dev.h> 29 29 #include <linux/pagevec.h> 30 30 #include <linux/cleancache.h> 31 + #include "internal.h" 31 32 32 33 /* 33 34 * I/O completion handler for multipage BIOs. ··· 58 57 static struct bio *mpage_bio_submit(int rw, struct bio *bio) 59 58 { 60 59 bio->bi_end_io = mpage_end_io; 60 + guard_bio_eod(rw, bio); 61 61 submit_bio(rw, bio); 62 62 return NULL; 63 63 }

+1 -1

fs/notify/fanotify/fanotify_user.c

··· 78 78 79 79 pr_debug("%s: group=%p event=%p\n", __func__, group, event); 80 80 81 - client_fd = get_unused_fd(); 81 + client_fd = get_unused_fd_flags(group->fanotify_data.f_flags); 82 82 if (client_fd < 0) 83 83 return client_fd; 84 84

-3

fs/notify/fsnotify.h

··· 23 23 struct fsnotify_group *group, struct vfsmount *mnt, 24 24 int allow_dups); 25 25 26 - /* final kfree of a group */ 27 - extern void fsnotify_final_destroy_group(struct fsnotify_group *group); 28 - 29 26 /* vfsmount specific destruction of a mark */ 30 27 extern void fsnotify_destroy_vfsmount_mark(struct fsnotify_mark *mark); 31 28 /* inode specific destruction of a mark */

+1 -1

fs/notify/group.c

··· 31 31 /* 32 32 * Final freeing of a group 33 33 */ 34 - void fsnotify_final_destroy_group(struct fsnotify_group *group) 34 + static void fsnotify_final_destroy_group(struct fsnotify_group *group) 35 35 { 36 36 if (group->ops->free_group_priv) 37 37 group->ops->free_group_priv(group);

+4 -2

fs/notify/inotify/inotify_fsnotify.c

··· 165 165 /* ideally the idr is empty and we won't hit the BUG in the callback */ 166 166 idr_for_each(&group->inotify_data.idr, idr_callback, group); 167 167 idr_destroy(&group->inotify_data.idr); 168 - atomic_dec(&group->inotify_data.user->inotify_devs); 169 - free_uid(group->inotify_data.user); 168 + if (group->inotify_data.user) { 169 + atomic_dec(&group->inotify_data.user->inotify_devs); 170 + free_uid(group->inotify_data.user); 171 + } 170 172 } 171 173 172 174 static void inotify_free_event(struct fsnotify_event *fsn_event)

+1 -1

fs/ntfs/debug.c

··· 112 112 /* If 1, output debug messages, and if 0, don't. */ 113 113 int debug_msgs = 0; 114 114 115 - void __ntfs_debug (const char *file, int line, const char *function, 115 + void __ntfs_debug(const char *file, int line, const char *function, 116 116 const char *fmt, ...) 117 117 { 118 118 struct va_format vaf;

+3 -2

fs/ntfs/file.c

··· 1 1 /* 2 2 * file.c - NTFS kernel file operations. Part of the Linux-NTFS project. 3 3 * 4 - * Copyright (c) 2001-2011 Anton Altaparmakov and Tuxera Inc. 4 + * Copyright (c) 2001-2014 Anton Altaparmakov and Tuxera Inc. 5 5 * 6 6 * This program/include file is free software; you can redistribute it and/or 7 7 * modify it under the terms of the GNU General Public License as published ··· 410 410 BUG_ON(!nr_pages); 411 411 err = nr = 0; 412 412 do { 413 - pages[nr] = find_lock_page(mapping, index); 413 + pages[nr] = find_get_page_flags(mapping, index, FGP_LOCK | 414 + FGP_ACCESSED); 414 415 if (!pages[nr]) { 415 416 if (!*cached_page) { 416 417 *cached_page = page_cache_alloc(mapping);

+1 -1

fs/ntfs/super.c

··· 3208 3208 } 3209 3209 3210 3210 MODULE_AUTHOR("Anton Altaparmakov <anton@tuxera.com>"); 3211 - MODULE_DESCRIPTION("NTFS 1.2/3.x driver - Copyright (c) 2001-2011 Anton Altaparmakov and Tuxera Inc."); 3211 + MODULE_DESCRIPTION("NTFS 1.2/3.x driver - Copyright (c) 2001-2014 Anton Altaparmakov and Tuxera Inc."); 3212 3212 MODULE_VERSION(NTFS_VERSION); 3213 3213 MODULE_LICENSE("GPL"); 3214 3214 #ifdef DEBUG

+8 -7

fs/ocfs2/aops.c

··· 1481 1481 handle_t *handle; 1482 1482 struct ocfs2_dinode *di = (struct ocfs2_dinode *)wc->w_di_bh->b_data; 1483 1483 1484 + handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS); 1485 + if (IS_ERR(handle)) { 1486 + ret = PTR_ERR(handle); 1487 + mlog_errno(ret); 1488 + goto out; 1489 + } 1490 + 1484 1491 page = find_or_create_page(mapping, 0, GFP_NOFS); 1485 1492 if (!page) { 1493 + ocfs2_commit_trans(osb, handle); 1486 1494 ret = -ENOMEM; 1487 1495 mlog_errno(ret); 1488 1496 goto out; ··· 1501 1493 */ 1502 1494 wc->w_pages[0] = wc->w_target_page = page; 1503 1495 wc->w_num_pages = 1; 1504 - 1505 - handle = ocfs2_start_trans(osb, OCFS2_INODE_UPDATE_CREDITS); 1506 - if (IS_ERR(handle)) { 1507 - ret = PTR_ERR(handle); 1508 - mlog_errno(ret); 1509 - goto out; 1510 - } 1511 1496 1512 1497 ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode), wc->w_di_bh, 1513 1498 OCFS2_JOURNAL_ACCESS_WRITE);

+19

fs/ocfs2/cluster/heartbeat.c

··· 2572 2572 } 2573 2573 EXPORT_SYMBOL_GPL(o2hb_check_node_heartbeating); 2574 2574 2575 + int o2hb_check_node_heartbeating_no_sem(u8 node_num) 2576 + { 2577 + unsigned long testing_map[BITS_TO_LONGS(O2NM_MAX_NODES)]; 2578 + unsigned long flags; 2579 + 2580 + spin_lock_irqsave(&o2hb_live_lock, flags); 2581 + o2hb_fill_node_map_from_callback(testing_map, sizeof(testing_map)); 2582 + spin_unlock_irqrestore(&o2hb_live_lock, flags); 2583 + if (!test_bit(node_num, testing_map)) { 2584 + mlog(ML_HEARTBEAT, 2585 + "node (%u) does not have heartbeating enabled.\n", 2586 + node_num); 2587 + return 0; 2588 + } 2589 + 2590 + return 1; 2591 + } 2592 + EXPORT_SYMBOL_GPL(o2hb_check_node_heartbeating_no_sem); 2593 + 2575 2594 int o2hb_check_node_heartbeating_from_callback(u8 node_num) 2576 2595 { 2577 2596 unsigned long testing_map[BITS_TO_LONGS(O2NM_MAX_NODES)];

+1

fs/ocfs2/cluster/heartbeat.h

··· 80 80 void o2hb_exit(void); 81 81 int o2hb_init(void); 82 82 int o2hb_check_node_heartbeating(u8 node_num); 83 + int o2hb_check_node_heartbeating_no_sem(u8 node_num); 83 84 int o2hb_check_node_heartbeating_from_callback(u8 node_num); 84 85 int o2hb_check_local_node_heartbeating(void); 85 86 void o2hb_stop_all_regions(void);

+19 -59

fs/ocfs2/cluster/netdebug.c

··· 185 185 static int nst_fop_open(struct inode *inode, struct file *file) 186 186 { 187 187 struct o2net_send_tracking *dummy_nst; 188 - struct seq_file *seq; 189 - int ret; 190 188 191 - dummy_nst = kmalloc(sizeof(struct o2net_send_tracking), GFP_KERNEL); 192 - if (dummy_nst == NULL) { 193 - ret = -ENOMEM; 194 - goto out; 195 - } 196 - dummy_nst->st_task = NULL; 197 - 198 - ret = seq_open(file, &nst_seq_ops); 199 - if (ret) 200 - goto out; 201 - 202 - seq = file->private_data; 203 - seq->private = dummy_nst; 189 + dummy_nst = __seq_open_private(file, &nst_seq_ops, sizeof(*dummy_nst)); 190 + if (!dummy_nst) 191 + return -ENOMEM; 204 192 o2net_debug_add_nst(dummy_nst); 205 193 206 - dummy_nst = NULL; 207 - 208 - out: 209 - kfree(dummy_nst); 210 - return ret; 194 + return 0; 211 195 } 212 196 213 197 static int nst_fop_release(struct inode *inode, struct file *file) ··· 396 412 .show = sc_seq_show, 397 413 }; 398 414 399 - static int sc_common_open(struct file *file, struct o2net_sock_debug *sd) 415 + static int sc_common_open(struct file *file, int ctxt) 400 416 { 417 + struct o2net_sock_debug *sd; 401 418 struct o2net_sock_container *dummy_sc; 402 - struct seq_file *seq; 403 - int ret; 404 419 405 - dummy_sc = kmalloc(sizeof(struct o2net_sock_container), GFP_KERNEL); 406 - if (dummy_sc == NULL) { 407 - ret = -ENOMEM; 408 - goto out; 420 + dummy_sc = kzalloc(sizeof(*dummy_sc), GFP_KERNEL); 421 + if (!dummy_sc) 422 + return -ENOMEM; 423 + 424 + sd = __seq_open_private(file, &sc_seq_ops, sizeof(*sd)); 425 + if (!sd) { 426 + kfree(dummy_sc); 427 + return -ENOMEM; 409 428 } 410 - dummy_sc->sc_page = NULL; 411 429 412 - ret = seq_open(file, &sc_seq_ops); 413 - if (ret) 414 - goto out; 415 - 416 - seq = file->private_data; 417 - seq->private = sd; 430 + sd->dbg_ctxt = ctxt; 418 431 sd->dbg_sock = dummy_sc; 432 + 419 433 o2net_debug_add_sc(dummy_sc); 420 434 421 - dummy_sc = NULL; 422 - 423 - out: 424 - kfree(dummy_sc); 425 - return ret; 435 + return 0; 426 436 } 427 437 428 438 static int sc_fop_release(struct inode *inode, struct file *file) ··· 431 453 432 454 static int stats_fop_open(struct inode *inode, struct file *file) 433 455 { 434 - struct o2net_sock_debug *sd; 435 - 436 - sd = kmalloc(sizeof(struct o2net_sock_debug), GFP_KERNEL); 437 - if (sd == NULL) 438 - return -ENOMEM; 439 - 440 - sd->dbg_ctxt = SHOW_SOCK_STATS; 441 - sd->dbg_sock = NULL; 442 - 443 - return sc_common_open(file, sd); 456 + return sc_common_open(file, SHOW_SOCK_STATS); 444 457 } 445 458 446 459 static const struct file_operations stats_seq_fops = { ··· 443 474 444 475 static int sc_fop_open(struct inode *inode, struct file *file) 445 476 { 446 - struct o2net_sock_debug *sd; 447 - 448 - sd = kmalloc(sizeof(struct o2net_sock_debug), GFP_KERNEL); 449 - if (sd == NULL) 450 - return -ENOMEM; 451 - 452 - sd->dbg_ctxt = SHOW_SOCK_CONTAINERS; 453 - sd->dbg_sock = NULL; 454 - 455 - return sc_common_open(file, sd); 477 + return sc_common_open(file, SHOW_SOCK_CONTAINERS); 456 478 } 457 479 458 480 static const struct file_operations sc_seq_fops = {

+34 -9

fs/ocfs2/cluster/tcp.c

··· 536 536 if (nn->nn_persistent_error || nn->nn_sc_valid) 537 537 wake_up(&nn->nn_sc_wq); 538 538 539 - if (!was_err && nn->nn_persistent_error) { 539 + if (was_valid && !was_err && nn->nn_persistent_error) { 540 540 o2quo_conn_err(o2net_num_from_nn(nn)); 541 541 queue_delayed_work(o2net_wq, &nn->nn_still_up, 542 542 msecs_to_jiffies(O2NET_QUORUM_DELAY_MS)); ··· 1601 1601 struct sockaddr_in myaddr = {0, }, remoteaddr = {0, }; 1602 1602 int ret = 0, stop; 1603 1603 unsigned int timeout; 1604 + unsigned int noio_flag; 1604 1605 1606 + /* 1607 + * sock_create allocates the sock with GFP_KERNEL. We must set 1608 + * per-process flag PF_MEMALLOC_NOIO so that all allocations done 1609 + * by this process are done as if GFP_NOIO was specified. So we 1610 + * are not reentering filesystem while doing memory reclaim. 1611 + */ 1612 + noio_flag = memalloc_noio_save(); 1605 1613 /* if we're greater we initiate tx, otherwise we accept */ 1606 1614 if (o2nm_this_node() <= o2net_num_from_nn(nn)) 1607 1615 goto out; ··· 1718 1710 if (mynode) 1719 1711 o2nm_node_put(mynode); 1720 1712 1713 + memalloc_noio_restore(noio_flag); 1721 1714 return; 1722 1715 } 1723 1716 ··· 1730 1721 spin_lock(&nn->nn_lock); 1731 1722 if (!nn->nn_sc_valid) { 1732 1723 printk(KERN_NOTICE "o2net: No connection established with " 1733 - "node %u after %u.%u seconds, giving up.\n", 1724 + "node %u after %u.%u seconds, check network and" 1725 + " cluster configuration.\n", 1734 1726 o2net_num_from_nn(nn), 1735 1727 o2net_idle_timeout() / 1000, 1736 1728 o2net_idle_timeout() % 1000); ··· 1845 1835 struct o2nm_node *local_node = NULL; 1846 1836 struct o2net_sock_container *sc = NULL; 1847 1837 struct o2net_node *nn; 1838 + unsigned int noio_flag; 1839 + 1840 + /* 1841 + * sock_create_lite allocates the sock with GFP_KERNEL. We must set 1842 + * per-process flag PF_MEMALLOC_NOIO so that all allocations done 1843 + * by this process are done as if GFP_NOIO was specified. So we 1844 + * are not reentering filesystem while doing memory reclaim. 1845 + */ 1846 + noio_flag = memalloc_noio_save(); 1848 1847 1849 1848 BUG_ON(sock == NULL); 1850 1849 *more = 0; ··· 1970 1951 o2nm_node_put(local_node); 1971 1952 if (sc) 1972 1953 sc_put(sc); 1954 + 1955 + memalloc_noio_restore(noio_flag); 1973 1956 return ret; 1974 1957 } 1975 1958 ··· 2167 2146 o2quo_init(); 2168 2147 2169 2148 if (o2net_debugfs_init()) 2170 - return -ENOMEM; 2149 + goto out; 2171 2150 2172 2151 o2net_hand = kzalloc(sizeof(struct o2net_handshake), GFP_KERNEL); 2173 2152 o2net_keep_req = kzalloc(sizeof(struct o2net_msg), GFP_KERNEL); 2174 2153 o2net_keep_resp = kzalloc(sizeof(struct o2net_msg), GFP_KERNEL); 2175 - if (!o2net_hand || !o2net_keep_req || !o2net_keep_resp) { 2176 - kfree(o2net_hand); 2177 - kfree(o2net_keep_req); 2178 - kfree(o2net_keep_resp); 2179 - return -ENOMEM; 2180 - } 2154 + if (!o2net_hand || !o2net_keep_req || !o2net_keep_resp) 2155 + goto out; 2181 2156 2182 2157 o2net_hand->protocol_version = cpu_to_be64(O2NET_PROTOCOL_VERSION); 2183 2158 o2net_hand->connector_id = cpu_to_be64(1); ··· 2198 2181 } 2199 2182 2200 2183 return 0; 2184 + 2185 + out: 2186 + kfree(o2net_hand); 2187 + kfree(o2net_keep_req); 2188 + kfree(o2net_keep_resp); 2189 + 2190 + o2quo_exit(); 2191 + return -ENOMEM; 2201 2192 } 2202 2193 2203 2194 void o2net_exit(void)

+14 -25

fs/ocfs2/dlm/dlmdebug.c

··· 647 647 static int debug_lockres_open(struct inode *inode, struct file *file) 648 648 { 649 649 struct dlm_ctxt *dlm = inode->i_private; 650 - int ret = -ENOMEM; 651 - struct seq_file *seq; 652 - struct debug_lockres *dl = NULL; 650 + struct debug_lockres *dl; 651 + void *buf; 653 652 654 - dl = kzalloc(sizeof(struct debug_lockres), GFP_KERNEL); 655 - if (!dl) { 656 - mlog_errno(ret); 653 + buf = kmalloc(PAGE_SIZE, GFP_KERNEL); 654 + if (!buf) 657 655 goto bail; 658 - } 656 + 657 + dl = __seq_open_private(file, &debug_lockres_ops, sizeof(*dl)); 658 + if (!dl) 659 + goto bailfree; 659 660 660 661 dl->dl_len = PAGE_SIZE; 661 - dl->dl_buf = kmalloc(dl->dl_len, GFP_KERNEL); 662 - if (!dl->dl_buf) { 663 - mlog_errno(ret); 664 - goto bail; 665 - } 666 - 667 - ret = seq_open(file, &debug_lockres_ops); 668 - if (ret) { 669 - mlog_errno(ret); 670 - goto bail; 671 - } 672 - 673 - seq = file->private_data; 674 - seq->private = dl; 662 + dl->dl_buf = buf; 675 663 676 664 dlm_grab(dlm); 677 665 dl->dl_ctxt = dlm; 678 666 679 667 return 0; 668 + 669 + bailfree: 670 + kfree(buf); 680 671 bail: 681 - if (dl) 682 - kfree(dl->dl_buf); 683 - kfree(dl); 684 - return ret; 672 + mlog_errno(-ENOMEM); 673 + return -ENOMEM; 685 674 } 686 675 687 676 static int debug_lockres_release(struct inode *inode, struct file *file)

+23 -21

fs/ocfs2/dlm/dlmdomain.c

··· 839 839 * to back off and try again. This gives heartbeat a chance 840 840 * to catch up. 841 841 */ 842 - if (!o2hb_check_node_heartbeating(query->node_idx)) { 842 + if (!o2hb_check_node_heartbeating_no_sem(query->node_idx)) { 843 843 mlog(0, "node %u is not in our live map yet\n", 844 844 query->node_idx); 845 845 ··· 1975 1975 1976 1976 dlm = kzalloc(sizeof(*dlm), GFP_KERNEL); 1977 1977 if (!dlm) { 1978 - mlog_errno(-ENOMEM); 1978 + ret = -ENOMEM; 1979 + mlog_errno(ret); 1979 1980 goto leave; 1980 1981 } 1981 1982 1982 1983 dlm->name = kstrdup(domain, GFP_KERNEL); 1983 1984 if (dlm->name == NULL) { 1984 - mlog_errno(-ENOMEM); 1985 - kfree(dlm); 1986 - dlm = NULL; 1985 + ret = -ENOMEM; 1986 + mlog_errno(ret); 1987 1987 goto leave; 1988 1988 } 1989 1989 1990 1990 dlm->lockres_hash = (struct hlist_head **)dlm_alloc_pagevec(DLM_HASH_PAGES); 1991 1991 if (!dlm->lockres_hash) { 1992 - mlog_errno(-ENOMEM); 1993 - kfree(dlm->name); 1994 - kfree(dlm); 1995 - dlm = NULL; 1992 + ret = -ENOMEM; 1993 + mlog_errno(ret); 1996 1994 goto leave; 1997 1995 } 1998 1996 ··· 2000 2002 dlm->master_hash = (struct hlist_head **) 2001 2003 dlm_alloc_pagevec(DLM_HASH_PAGES); 2002 2004 if (!dlm->master_hash) { 2003 - mlog_errno(-ENOMEM); 2004 - dlm_free_pagevec((void **)dlm->lockres_hash, DLM_HASH_PAGES); 2005 - kfree(dlm->name); 2006 - kfree(dlm); 2007 - dlm = NULL; 2005 + ret = -ENOMEM; 2006 + mlog_errno(ret); 2008 2007 goto leave; 2009 2008 } 2010 2009 ··· 2012 2017 dlm->node_num = o2nm_this_node(); 2013 2018 2014 2019 ret = dlm_create_debugfs_subroot(dlm); 2015 - if (ret < 0) { 2016 - dlm_free_pagevec((void **)dlm->master_hash, DLM_HASH_PAGES); 2017 - dlm_free_pagevec((void **)dlm->lockres_hash, DLM_HASH_PAGES); 2018 - kfree(dlm->name); 2019 - kfree(dlm); 2020 - dlm = NULL; 2020 + if (ret < 0) 2021 2021 goto leave; 2022 - } 2023 2022 2024 2023 spin_lock_init(&dlm->spinlock); 2025 2024 spin_lock_init(&dlm->master_lock); ··· 2074 2085 atomic_read(&dlm->dlm_refs.refcount)); 2075 2086 2076 2087 leave: 2088 + if (ret < 0 && dlm) { 2089 + if (dlm->master_hash) 2090 + dlm_free_pagevec((void **)dlm->master_hash, 2091 + DLM_HASH_PAGES); 2092 + 2093 + if (dlm->lockres_hash) 2094 + dlm_free_pagevec((void **)dlm->lockres_hash, 2095 + DLM_HASH_PAGES); 2096 + 2097 + kfree(dlm->name); 2098 + kfree(dlm); 2099 + dlm = NULL; 2100 + } 2077 2101 return dlm; 2078 2102 } 2079 2103

-3

fs/ocfs2/dlm/dlmmaster.c

··· 625 625 return res; 626 626 627 627 error: 628 - if (res && res->lockname.name) 629 - kmem_cache_free(dlm_lockname_cache, (void *)res->lockname.name); 630 - 631 628 if (res) 632 629 kmem_cache_free(dlm_lockres_cache, res); 633 630 return NULL;

+5 -2

fs/ocfs2/dlm/dlmrecovery.c

··· 1710 1710 BUG(); 1711 1711 } else 1712 1712 __dlm_lockres_grab_inflight_worker(dlm, res); 1713 - } else /* put.. incase we are not the master */ 1713 + spin_unlock(&res->spinlock); 1714 + } else { 1715 + /* put.. incase we are not the master */ 1716 + spin_unlock(&res->spinlock); 1714 1717 dlm_lockres_put(res); 1715 - spin_unlock(&res->spinlock); 1718 + } 1716 1719 } 1717 1720 spin_unlock(&dlm->spinlock); 1718 1721

+5 -18

fs/ocfs2/dlmglue.c

··· 2892 2892 2893 2893 static int ocfs2_dlm_debug_open(struct inode *inode, struct file *file) 2894 2894 { 2895 - int ret; 2896 2895 struct ocfs2_dlm_seq_priv *priv; 2897 - struct seq_file *seq; 2898 2896 struct ocfs2_super *osb; 2899 2897 2900 - priv = kzalloc(sizeof(struct ocfs2_dlm_seq_priv), GFP_KERNEL); 2898 + priv = __seq_open_private(file, &ocfs2_dlm_seq_ops, sizeof(*priv)); 2901 2899 if (!priv) { 2902 - ret = -ENOMEM; 2903 - mlog_errno(ret); 2904 - goto out; 2900 + mlog_errno(-ENOMEM); 2901 + return -ENOMEM; 2905 2902 } 2903 + 2906 2904 osb = inode->i_private; 2907 2905 ocfs2_get_dlm_debug(osb->osb_dlm_debug); 2908 2906 priv->p_dlm_debug = osb->osb_dlm_debug; 2909 2907 INIT_LIST_HEAD(&priv->p_iter_res.l_debug_list); 2910 2908 2911 - ret = seq_open(file, &ocfs2_dlm_seq_ops); 2912 - if (ret) { 2913 - kfree(priv); 2914 - mlog_errno(ret); 2915 - goto out; 2916 - } 2917 - 2918 - seq = file->private_data; 2919 - seq->private = priv; 2920 - 2921 2909 ocfs2_add_lockres_tracking(&priv->p_iter_res, 2922 2910 priv->p_dlm_debug); 2923 2911 2924 - out: 2925 - return ret; 2912 + return 0; 2926 2913 } 2927 2914 2928 2915 static const struct file_operations ocfs2_dlm_debug_fops = {

+23 -24

fs/ocfs2/file.c

··· 760 760 struct address_space *mapping = inode->i_mapping; 761 761 struct page *page; 762 762 unsigned long index = abs_from >> PAGE_CACHE_SHIFT; 763 - handle_t *handle = NULL; 763 + handle_t *handle; 764 764 int ret = 0; 765 765 unsigned zero_from, zero_to, block_start, block_end; 766 766 struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data; ··· 769 769 BUG_ON(abs_to > (((u64)index + 1) << PAGE_CACHE_SHIFT)); 770 770 BUG_ON(abs_from & (inode->i_blkbits - 1)); 771 771 772 + handle = ocfs2_zero_start_ordered_transaction(inode, di_bh); 773 + if (IS_ERR(handle)) { 774 + ret = PTR_ERR(handle); 775 + goto out; 776 + } 777 + 772 778 page = find_or_create_page(mapping, index, GFP_NOFS); 773 779 if (!page) { 774 780 ret = -ENOMEM; 775 781 mlog_errno(ret); 776 - goto out; 782 + goto out_commit_trans; 777 783 } 778 784 779 785 /* Get the offsets within the page that we want to zero */ ··· 811 805 goto out_unlock; 812 806 } 813 807 814 - if (!handle) { 815 - handle = ocfs2_zero_start_ordered_transaction(inode, 816 - di_bh); 817 - if (IS_ERR(handle)) { 818 - ret = PTR_ERR(handle); 819 - handle = NULL; 820 - break; 821 - } 822 - } 823 808 824 809 /* must not update i_size! */ 825 810 ret = block_commit_write(page, block_start + 1, ··· 821 824 ret = 0; 822 825 } 823 826 827 + /* 828 + * fs-writeback will release the dirty pages without page lock 829 + * whose offset are over inode size, the release happens at 830 + * block_write_full_page(). 831 + */ 832 + i_size_write(inode, abs_to); 833 + inode->i_blocks = ocfs2_inode_sector_count(inode); 834 + di->i_size = cpu_to_le64((u64)i_size_read(inode)); 835 + inode->i_mtime = inode->i_ctime = CURRENT_TIME; 836 + di->i_mtime = di->i_ctime = cpu_to_le64(inode->i_mtime.tv_sec); 837 + di->i_ctime_nsec = cpu_to_le32(inode->i_mtime.tv_nsec); 838 + di->i_mtime_nsec = di->i_ctime_nsec; 824 839 if (handle) { 825 - /* 826 - * fs-writeback will release the dirty pages without page lock 827 - * whose offset are over inode size, the release happens at 828 - * block_write_full_page(). 829 - */ 830 - i_size_write(inode, abs_to); 831 - inode->i_blocks = ocfs2_inode_sector_count(inode); 832 - di->i_size = cpu_to_le64((u64)i_size_read(inode)); 833 - inode->i_mtime = inode->i_ctime = CURRENT_TIME; 834 - di->i_mtime = di->i_ctime = cpu_to_le64(inode->i_mtime.tv_sec); 835 - di->i_ctime_nsec = cpu_to_le32(inode->i_mtime.tv_nsec); 836 - di->i_mtime_nsec = di->i_ctime_nsec; 837 840 ocfs2_journal_dirty(handle, di_bh); 838 841 ocfs2_update_inode_fsync_trans(handle, inode, 1); 839 - ocfs2_commit_trans(OCFS2_SB(inode->i_sb), handle); 840 842 } 841 843 842 844 out_unlock: 843 845 unlock_page(page); 844 846 page_cache_release(page); 847 + out_commit_trans: 848 + if (handle) 849 + ocfs2_commit_trans(OCFS2_SB(inode->i_sb), handle); 845 850 out: 846 851 return ret; 847 852 }

+1 -1

fs/ocfs2/inode.h

··· 162 162 { 163 163 int c_to_s_bits = OCFS2_SB(inode->i_sb)->s_clustersize_bits - 9; 164 164 165 - return (blkcnt_t)(OCFS2_I(inode)->ip_clusters << c_to_s_bits); 165 + return (blkcnt_t)OCFS2_I(inode)->ip_clusters << c_to_s_bits; 166 166 } 167 167 168 168 /* Validate that a bh contains a valid inode */

+1 -1

fs/ocfs2/move_extents.c

··· 404 404 * 'vict_blkno' was out of the valid range. 405 405 */ 406 406 if ((vict_blkno < le64_to_cpu(rec->c_blkno)) || 407 - (vict_blkno >= (le32_to_cpu(ac_dinode->id1.bitmap1.i_total) << 407 + (vict_blkno >= ((u64)le32_to_cpu(ac_dinode->id1.bitmap1.i_total) << 408 408 bits_per_unit))) { 409 409 ret = -EINVAL; 410 410 goto out;

+1 -1

fs/ocfs2/stack_user.c

··· 591 591 */ 592 592 ocfs2_control_this_node = -1; 593 593 running_proto.pv_major = 0; 594 - running_proto.pv_major = 0; 594 + running_proto.pv_minor = 0; 595 595 } 596 596 597 597 out:

+22 -16

fs/proc/base.c

··· 632 632 .release = single_release, 633 633 }; 634 634 635 + 636 + struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode) 637 + { 638 + struct task_struct *task = get_proc_task(inode); 639 + struct mm_struct *mm = ERR_PTR(-ESRCH); 640 + 641 + if (task) { 642 + mm = mm_access(task, mode); 643 + put_task_struct(task); 644 + 645 + if (!IS_ERR_OR_NULL(mm)) { 646 + /* ensure this mm_struct can't be freed */ 647 + atomic_inc(&mm->mm_count); 648 + /* but do not pin its memory */ 649 + mmput(mm); 650 + } 651 + } 652 + 653 + return mm; 654 + } 655 + 635 656 static int __mem_open(struct inode *inode, struct file *file, unsigned int mode) 636 657 { 637 - struct task_struct *task = get_proc_task(file_inode(file)); 638 - struct mm_struct *mm; 639 - 640 - if (!task) 641 - return -ESRCH; 642 - 643 - mm = mm_access(task, mode); 644 - put_task_struct(task); 658 + struct mm_struct *mm = proc_mem_open(inode, mode); 645 659 646 660 if (IS_ERR(mm)) 647 661 return PTR_ERR(mm); 648 662 649 - if (mm) { 650 - /* ensure this mm_struct can't be freed */ 651 - atomic_inc(&mm->mm_count); 652 - /* but do not pin its memory */ 653 - mmput(mm); 654 - } 655 - 656 663 file->private_data = mm; 657 - 658 664 return 0; 659 665 } 660 666

+4 -1

fs/proc/internal.h

··· 268 268 * task_[no]mmu.c 269 269 */ 270 270 struct proc_maps_private { 271 - struct pid *pid; 271 + struct inode *inode; 272 272 struct task_struct *task; 273 + struct mm_struct *mm; 273 274 #ifdef CONFIG_MMU 274 275 struct vm_area_struct *tail_vma; 275 276 #endif ··· 278 277 struct mempolicy *task_mempolicy; 279 278 #endif 280 279 }; 280 + 281 + struct mm_struct *proc_mem_open(struct inode *inode, unsigned int mode); 281 282 282 283 extern const struct file_operations proc_pid_maps_operations; 283 284 extern const struct file_operations proc_tid_maps_operations;

+3 -1

fs/proc/kcore.c

··· 610 610 struct kcore_list kcore_modules; 611 611 static void __init add_modules_range(void) 612 612 { 613 - kclist_add(&kcore_modules, (void *)MODULES_VADDR, 613 + if (MODULES_VADDR != VMALLOC_START && MODULES_END != VMALLOC_END) { 614 + kclist_add(&kcore_modules, (void *)MODULES_VADDR, 614 615 MODULES_END - MODULES_VADDR, KCORE_VMALLOC); 616 + } 615 617 } 616 618 #else 617 619 static void __init add_modules_range(void)

+3

fs/proc/page.c

··· 133 133 if (PageBuddy(page)) 134 134 u |= 1 << KPF_BUDDY; 135 135 136 + if (PageBalloon(page)) 137 + u |= 1 << KPF_BALLOON; 138 + 136 139 u |= kpf_copy_bit(k, KPF_LOCKED, PG_locked); 137 140 138 141 u |= kpf_copy_bit(k, KPF_SLAB, PG_slab);

+168 -162

fs/proc/task_mmu.c

··· 87 87 88 88 #ifdef CONFIG_NUMA 89 89 /* 90 - * These functions are for numa_maps but called in generic **maps seq_file 91 - * ->start(), ->stop() ops. 92 - * 93 - * numa_maps scans all vmas under mmap_sem and checks their mempolicy. 94 - * Each mempolicy object is controlled by reference counting. The problem here 95 - * is how to avoid accessing dead mempolicy object. 96 - * 97 - * Because we're holding mmap_sem while reading seq_file, it's safe to access 98 - * each vma's mempolicy, no vma objects will never drop refs to mempolicy. 99 - * 100 - * A task's mempolicy (task->mempolicy) has different behavior. task->mempolicy 101 - * is set and replaced under mmap_sem but unrefed and cleared under task_lock(). 102 - * So, without task_lock(), we cannot trust get_vma_policy() because we cannot 103 - * gurantee the task never exits under us. But taking task_lock() around 104 - * get_vma_plicy() causes lock order problem. 105 - * 106 - * To access task->mempolicy without lock, we hold a reference count of an 107 - * object pointed by task->mempolicy and remember it. This will guarantee 108 - * that task->mempolicy points to an alive object or NULL in numa_maps accesses. 90 + * Save get_task_policy() for show_numa_map(). 109 91 */ 110 92 static void hold_task_mempolicy(struct proc_maps_private *priv) 111 93 { 112 94 struct task_struct *task = priv->task; 113 95 114 96 task_lock(task); 115 - priv->task_mempolicy = task->mempolicy; 97 + priv->task_mempolicy = get_task_policy(task); 116 98 mpol_get(priv->task_mempolicy); 117 99 task_unlock(task); 118 100 } ··· 111 129 } 112 130 #endif 113 131 114 - static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma) 132 + static void vma_stop(struct proc_maps_private *priv) 115 133 { 116 - if (vma && vma != priv->tail_vma) { 117 - struct mm_struct *mm = vma->vm_mm; 118 - release_task_mempolicy(priv); 119 - up_read(&mm->mmap_sem); 120 - mmput(mm); 121 - } 134 + struct mm_struct *mm = priv->mm; 135 + 136 + release_task_mempolicy(priv); 137 + up_read(&mm->mmap_sem); 138 + mmput(mm); 122 139 } 123 140 124 - static void *m_start(struct seq_file *m, loff_t *pos) 141 + static struct vm_area_struct * 142 + m_next_vma(struct proc_maps_private *priv, struct vm_area_struct *vma) 143 + { 144 + if (vma == priv->tail_vma) 145 + return NULL; 146 + return vma->vm_next ?: priv->tail_vma; 147 + } 148 + 149 + static void m_cache_vma(struct seq_file *m, struct vm_area_struct *vma) 150 + { 151 + if (m->count < m->size) /* vma is copied successfully */ 152 + m->version = m_next_vma(m->private, vma) ? vma->vm_start : -1UL; 153 + } 154 + 155 + static void *m_start(struct seq_file *m, loff_t *ppos) 125 156 { 126 157 struct proc_maps_private *priv = m->private; 127 158 unsigned long last_addr = m->version; 128 159 struct mm_struct *mm; 129 - struct vm_area_struct *vma, *tail_vma = NULL; 130 - loff_t l = *pos; 160 + struct vm_area_struct *vma; 161 + unsigned int pos = *ppos; 131 162 132 - /* Clear the per syscall fields in priv */ 133 - priv->task = NULL; 134 - priv->tail_vma = NULL; 135 - 136 - /* 137 - * We remember last_addr rather than next_addr to hit with 138 - * vmacache most of the time. We have zero last_addr at 139 - * the beginning and also after lseek. We will have -1 last_addr 140 - * after the end of the vmas. 141 - */ 142 - 163 + /* See m_cache_vma(). Zero at the start or after lseek. */ 143 164 if (last_addr == -1UL) 144 165 return NULL; 145 166 146 - priv->task = get_pid_task(priv->pid, PIDTYPE_PID); 167 + priv->task = get_proc_task(priv->inode); 147 168 if (!priv->task) 148 169 return ERR_PTR(-ESRCH); 149 170 150 - mm = mm_access(priv->task, PTRACE_MODE_READ); 151 - if (!mm || IS_ERR(mm)) 152 - return mm; 171 + mm = priv->mm; 172 + if (!mm || !atomic_inc_not_zero(&mm->mm_users)) 173 + return NULL; 174 + 153 175 down_read(&mm->mmap_sem); 154 - 155 - tail_vma = get_gate_vma(priv->task->mm); 156 - priv->tail_vma = tail_vma; 157 176 hold_task_mempolicy(priv); 158 - /* Start with last addr hint */ 159 - vma = find_vma(mm, last_addr); 160 - if (last_addr && vma) { 161 - vma = vma->vm_next; 162 - goto out; 177 + priv->tail_vma = get_gate_vma(mm); 178 + 179 + if (last_addr) { 180 + vma = find_vma(mm, last_addr); 181 + if (vma && (vma = m_next_vma(priv, vma))) 182 + return vma; 163 183 } 164 184 165 - /* 166 - * Check the vma index is within the range and do 167 - * sequential scan until m_index. 168 - */ 169 - vma = NULL; 170 - if ((unsigned long)l < mm->map_count) { 171 - vma = mm->mmap; 172 - while (l-- && vma) 185 + m->version = 0; 186 + if (pos < mm->map_count) { 187 + for (vma = mm->mmap; pos; pos--) { 188 + m->version = vma->vm_start; 173 189 vma = vma->vm_next; 174 - goto out; 190 + } 191 + return vma; 175 192 } 176 193 177 - if (l != mm->map_count) 178 - tail_vma = NULL; /* After gate vma */ 194 + /* we do not bother to update m->version in this case */ 195 + if (pos == mm->map_count && priv->tail_vma) 196 + return priv->tail_vma; 179 197 180 - out: 181 - if (vma) 182 - return vma; 183 - 184 - release_task_mempolicy(priv); 185 - /* End of vmas has been reached */ 186 - m->version = (tail_vma != NULL)? 0: -1UL; 187 - up_read(&mm->mmap_sem); 188 - mmput(mm); 189 - return tail_vma; 198 + vma_stop(priv); 199 + return NULL; 190 200 } 191 201 192 202 static void *m_next(struct seq_file *m, void *v, loff_t *pos) 193 203 { 194 204 struct proc_maps_private *priv = m->private; 195 - struct vm_area_struct *vma = v; 196 - struct vm_area_struct *tail_vma = priv->tail_vma; 205 + struct vm_area_struct *next; 197 206 198 207 (*pos)++; 199 - if (vma && (vma != tail_vma) && vma->vm_next) 200 - return vma->vm_next; 201 - vma_stop(priv, vma); 202 - return (vma != tail_vma)? tail_vma: NULL; 208 + next = m_next_vma(priv, v); 209 + if (!next) 210 + vma_stop(priv); 211 + return next; 203 212 } 204 213 205 214 static void m_stop(struct seq_file *m, void *v) 206 215 { 207 216 struct proc_maps_private *priv = m->private; 208 - struct vm_area_struct *vma = v; 209 217 210 - if (!IS_ERR(vma)) 211 - vma_stop(priv, vma); 212 - if (priv->task) 218 + if (!IS_ERR_OR_NULL(v)) 219 + vma_stop(priv); 220 + if (priv->task) { 213 221 put_task_struct(priv->task); 222 + priv->task = NULL; 223 + } 224 + } 225 + 226 + static int proc_maps_open(struct inode *inode, struct file *file, 227 + const struct seq_operations *ops, int psize) 228 + { 229 + struct proc_maps_private *priv = __seq_open_private(file, ops, psize); 230 + 231 + if (!priv) 232 + return -ENOMEM; 233 + 234 + priv->inode = inode; 235 + priv->mm = proc_mem_open(inode, PTRACE_MODE_READ); 236 + if (IS_ERR(priv->mm)) { 237 + int err = PTR_ERR(priv->mm); 238 + 239 + seq_release_private(inode, file); 240 + return err; 241 + } 242 + 243 + return 0; 244 + } 245 + 246 + static int proc_map_release(struct inode *inode, struct file *file) 247 + { 248 + struct seq_file *seq = file->private_data; 249 + struct proc_maps_private *priv = seq->private; 250 + 251 + if (priv->mm) 252 + mmdrop(priv->mm); 253 + 254 + return seq_release_private(inode, file); 214 255 } 215 256 216 257 static int do_maps_open(struct inode *inode, struct file *file, 217 258 const struct seq_operations *ops) 218 259 { 219 - struct proc_maps_private *priv; 220 - int ret = -ENOMEM; 221 - priv = kzalloc(sizeof(*priv), GFP_KERNEL); 222 - if (priv) { 223 - priv->pid = proc_pid(inode); 224 - ret = seq_open(file, ops); 225 - if (!ret) { 226 - struct seq_file *m = file->private_data; 227 - m->private = priv; 228 - } else { 229 - kfree(priv); 230 - } 260 + return proc_maps_open(inode, file, ops, 261 + sizeof(struct proc_maps_private)); 262 + } 263 + 264 + static pid_t pid_of_stack(struct proc_maps_private *priv, 265 + struct vm_area_struct *vma, bool is_pid) 266 + { 267 + struct inode *inode = priv->inode; 268 + struct task_struct *task; 269 + pid_t ret = 0; 270 + 271 + rcu_read_lock(); 272 + task = pid_task(proc_pid(inode), PIDTYPE_PID); 273 + if (task) { 274 + task = task_of_stack(task, vma, is_pid); 275 + if (task) 276 + ret = task_pid_nr_ns(task, inode->i_sb->s_fs_info); 231 277 } 278 + rcu_read_unlock(); 279 + 232 280 return ret; 233 281 } 234 282 ··· 268 256 struct mm_struct *mm = vma->vm_mm; 269 257 struct file *file = vma->vm_file; 270 258 struct proc_maps_private *priv = m->private; 271 - struct task_struct *task = priv->task; 272 259 vm_flags_t flags = vma->vm_flags; 273 260 unsigned long ino = 0; 274 261 unsigned long long pgoff = 0; ··· 332 321 goto done; 333 322 } 334 323 335 - tid = vm_is_stack(task, vma, is_pid); 336 - 324 + tid = pid_of_stack(priv, vma, is_pid); 337 325 if (tid != 0) { 338 326 /* 339 327 * Thread stack in /proc/PID/task/TID/maps or ··· 359 349 360 350 static int show_map(struct seq_file *m, void *v, int is_pid) 361 351 { 362 - struct vm_area_struct *vma = v; 363 - struct proc_maps_private *priv = m->private; 364 - struct task_struct *task = priv->task; 365 - 366 - show_map_vma(m, vma, is_pid); 367 - 368 - if (m->count < m->size) /* vma is copied successfully */ 369 - m->version = (vma != get_gate_vma(task->mm)) 370 - ? vma->vm_start : 0; 352 + show_map_vma(m, v, is_pid); 353 + m_cache_vma(m, v); 371 354 return 0; 372 355 } 373 356 ··· 402 399 .open = pid_maps_open, 403 400 .read = seq_read, 404 401 .llseek = seq_lseek, 405 - .release = seq_release_private, 402 + .release = proc_map_release, 406 403 }; 407 404 408 405 const struct file_operations proc_tid_maps_operations = { 409 406 .open = tid_maps_open, 410 407 .read = seq_read, 411 408 .llseek = seq_lseek, 412 - .release = seq_release_private, 409 + .release = proc_map_release, 413 410 }; 414 411 415 412 /* ··· 586 583 587 584 static int show_smap(struct seq_file *m, void *v, int is_pid) 588 585 { 589 - struct proc_maps_private *priv = m->private; 590 - struct task_struct *task = priv->task; 591 586 struct vm_area_struct *vma = v; 592 587 struct mem_size_stats mss; 593 588 struct mm_walk smaps_walk = { ··· 638 637 mss.nonlinear >> 10); 639 638 640 639 show_smap_vma_flags(m, vma); 641 - 642 - if (m->count < m->size) /* vma is copied successfully */ 643 - m->version = (vma != get_gate_vma(task->mm)) 644 - ? vma->vm_start : 0; 640 + m_cache_vma(m, vma); 645 641 return 0; 646 642 } 647 643 ··· 680 682 .open = pid_smaps_open, 681 683 .read = seq_read, 682 684 .llseek = seq_lseek, 683 - .release = seq_release_private, 685 + .release = proc_map_release, 684 686 }; 685 687 686 688 const struct file_operations proc_tid_smaps_operations = { 687 689 .open = tid_smaps_open, 688 690 .read = seq_read, 689 691 .llseek = seq_lseek, 690 - .release = seq_release_private, 692 + .release = proc_map_release, 691 693 }; 692 694 693 695 /* ··· 1027 1029 spinlock_t *ptl; 1028 1030 pte_t *pte; 1029 1031 int err = 0; 1030 - pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2)); 1031 1032 1032 1033 /* find the first VMA at or above 'addr' */ 1033 1034 vma = find_vma(walk->mm, addr); ··· 1040 1043 1041 1044 for (; addr != end; addr += PAGE_SIZE) { 1042 1045 unsigned long offset; 1046 + pagemap_entry_t pme; 1043 1047 1044 1048 offset = (addr & ~PAGEMAP_WALK_MASK) >> 1045 1049 PAGE_SHIFT; ··· 1055 1057 1056 1058 if (pmd_trans_unstable(pmd)) 1057 1059 return 0; 1058 - for (; addr != end; addr += PAGE_SIZE) { 1059 - int flags2; 1060 1060 1061 - /* check to see if we've left 'vma' behind 1062 - * and need a new, higher one */ 1063 - if (vma && (addr >= vma->vm_end)) { 1064 - vma = find_vma(walk->mm, addr); 1065 - if (vma && (vma->vm_flags & VM_SOFTDIRTY)) 1066 - flags2 = __PM_SOFT_DIRTY; 1067 - else 1068 - flags2 = 0; 1069 - pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, flags2)); 1061 + while (1) { 1062 + /* End of address space hole, which we mark as non-present. */ 1063 + unsigned long hole_end; 1064 + 1065 + if (vma) 1066 + hole_end = min(end, vma->vm_start); 1067 + else 1068 + hole_end = end; 1069 + 1070 + for (; addr < hole_end; addr += PAGE_SIZE) { 1071 + pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2)); 1072 + 1073 + err = add_to_pagemap(addr, &pme, pm); 1074 + if (err) 1075 + return err; 1070 1076 } 1071 1077 1072 - /* check that 'vma' actually covers this address, 1073 - * and that it isn't a huge page vma */ 1074 - if (vma && (vma->vm_start <= addr) && 1075 - !is_vm_hugetlb_page(vma)) { 1078 + if (!vma || vma->vm_start >= end) 1079 + break; 1080 + /* 1081 + * We can't possibly be in a hugetlb VMA. In general, 1082 + * for a mm_walk with a pmd_entry and a hugetlb_entry, 1083 + * the pmd_entry can only be called on addresses in a 1084 + * hugetlb if the walk starts in a non-hugetlb VMA and 1085 + * spans a hugepage VMA. Since pagemap_read walks are 1086 + * PMD-sized and PMD-aligned, this will never be true. 1087 + */ 1088 + BUG_ON(is_vm_hugetlb_page(vma)); 1089 + 1090 + /* Addresses in the VMA. */ 1091 + for (; addr < min(end, vma->vm_end); addr += PAGE_SIZE) { 1092 + pagemap_entry_t pme; 1076 1093 pte = pte_offset_map(pmd, addr); 1077 1094 pte_to_pagemap_entry(&pme, pm, vma, addr, *pte); 1078 - /* unmap before userspace copy */ 1079 1095 pte_unmap(pte); 1096 + err = add_to_pagemap(addr, &pme, pm); 1097 + if (err) 1098 + return err; 1080 1099 } 1081 - err = add_to_pagemap(addr, &pme, pm); 1082 - if (err) 1083 - return err; 1100 + 1101 + if (addr == end) 1102 + break; 1103 + 1104 + vma = find_vma(walk->mm, addr); 1084 1105 } 1085 1106 1086 1107 cond_resched(); ··· 1432 1415 struct vm_area_struct *vma = v; 1433 1416 struct numa_maps *md = &numa_priv->md; 1434 1417 struct file *file = vma->vm_file; 1435 - struct task_struct *task = proc_priv->task; 1436 1418 struct mm_struct *mm = vma->vm_mm; 1437 1419 struct mm_walk walk = {}; 1438 1420 struct mempolicy *pol; ··· 1451 1435 walk.private = md; 1452 1436 walk.mm = mm; 1453 1437 1454 - pol = get_vma_policy(task, vma, vma->vm_start); 1455 - mpol_to_str(buffer, sizeof(buffer), pol); 1456 - mpol_cond_put(pol); 1438 + pol = __get_vma_policy(vma, vma->vm_start); 1439 + if (pol) { 1440 + mpol_to_str(buffer, sizeof(buffer), pol); 1441 + mpol_cond_put(pol); 1442 + } else { 1443 + mpol_to_str(buffer, sizeof(buffer), proc_priv->task_mempolicy); 1444 + } 1457 1445 1458 1446 seq_printf(m, "%08lx %s", vma->vm_start, buffer); 1459 1447 ··· 1467 1447 } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) { 1468 1448 seq_puts(m, " heap"); 1469 1449 } else { 1470 - pid_t tid = vm_is_stack(task, vma, is_pid); 1450 + pid_t tid = pid_of_stack(proc_priv, vma, is_pid); 1471 1451 if (tid != 0) { 1472 1452 /* 1473 1453 * Thread stack in /proc/PID/task/TID/maps or ··· 1515 1495 seq_printf(m, " N%d=%lu", nid, md->node[nid]); 1516 1496 out: 1517 1497 seq_putc(m, '\n'); 1518 - 1519 - if (m->count < m->size) 1520 - m->version = (vma != proc_priv->tail_vma) ? vma->vm_start : 0; 1498 + m_cache_vma(m, vma); 1521 1499 return 0; 1522 1500 } 1523 1501 ··· 1546 1528 static int numa_maps_open(struct inode *inode, struct file *file, 1547 1529 const struct seq_operations *ops) 1548 1530 { 1549 - struct numa_maps_private *priv; 1550 - int ret = -ENOMEM; 1551 - priv = kzalloc(sizeof(*priv), GFP_KERNEL); 1552 - if (priv) { 1553 - priv->proc_maps.pid = proc_pid(inode); 1554 - ret = seq_open(file, ops); 1555 - if (!ret) { 1556 - struct seq_file *m = file->private_data; 1557 - m->private = priv; 1558 - } else { 1559 - kfree(priv); 1560 - } 1561 - } 1562 - return ret; 1531 + return proc_maps_open(inode, file, ops, 1532 + sizeof(struct numa_maps_private)); 1563 1533 } 1564 1534 1565 1535 static int pid_numa_maps_open(struct inode *inode, struct file *file) ··· 1564 1558 .open = pid_numa_maps_open, 1565 1559 .read = seq_read, 1566 1560 .llseek = seq_lseek, 1567 - .release = seq_release_private, 1561 + .release = proc_map_release, 1568 1562 }; 1569 1563 1570 1564 const struct file_operations proc_tid_numa_maps_operations = { 1571 1565 .open = tid_numa_maps_open, 1572 1566 .read = seq_read, 1573 1567 .llseek = seq_lseek, 1574 - .release = seq_release_private, 1568 + .release = proc_map_release, 1575 1569 }; 1576 1570 #endif /* CONFIG_NUMA */

+60 -26

fs/proc/task_nommu.c

··· 123 123 return size; 124 124 } 125 125 126 + static pid_t pid_of_stack(struct proc_maps_private *priv, 127 + struct vm_area_struct *vma, bool is_pid) 128 + { 129 + struct inode *inode = priv->inode; 130 + struct task_struct *task; 131 + pid_t ret = 0; 132 + 133 + rcu_read_lock(); 134 + task = pid_task(proc_pid(inode), PIDTYPE_PID); 135 + if (task) { 136 + task = task_of_stack(task, vma, is_pid); 137 + if (task) 138 + ret = task_pid_nr_ns(task, inode->i_sb->s_fs_info); 139 + } 140 + rcu_read_unlock(); 141 + 142 + return ret; 143 + } 144 + 126 145 /* 127 146 * display a single VMA to a sequenced file 128 147 */ ··· 182 163 seq_pad(m, ' '); 183 164 seq_path(m, &file->f_path, ""); 184 165 } else if (mm) { 185 - pid_t tid = vm_is_stack(priv->task, vma, is_pid); 166 + pid_t tid = pid_of_stack(priv, vma, is_pid); 186 167 187 168 if (tid != 0) { 188 169 seq_pad(m, ' '); ··· 231 212 loff_t n = *pos; 232 213 233 214 /* pin the task and mm whilst we play with them */ 234 - priv->task = get_pid_task(priv->pid, PIDTYPE_PID); 215 + priv->task = get_proc_task(priv->inode); 235 216 if (!priv->task) 236 217 return ERR_PTR(-ESRCH); 237 218 238 - mm = mm_access(priv->task, PTRACE_MODE_READ); 239 - if (!mm || IS_ERR(mm)) { 240 - put_task_struct(priv->task); 241 - priv->task = NULL; 242 - return mm; 243 - } 244 - down_read(&mm->mmap_sem); 219 + mm = priv->mm; 220 + if (!mm || !atomic_inc_not_zero(&mm->mm_users)) 221 + return NULL; 245 222 223 + down_read(&mm->mmap_sem); 246 224 /* start from the Nth VMA */ 247 225 for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) 248 226 if (n-- == 0) 249 227 return p; 228 + 229 + up_read(&mm->mmap_sem); 230 + mmput(mm); 250 231 return NULL; 251 232 } 252 233 ··· 254 235 { 255 236 struct proc_maps_private *priv = m->private; 256 237 238 + if (!IS_ERR_OR_NULL(_vml)) { 239 + up_read(&priv->mm->mmap_sem); 240 + mmput(priv->mm); 241 + } 257 242 if (priv->task) { 258 - struct mm_struct *mm = priv->task->mm; 259 - up_read(&mm->mmap_sem); 260 - mmput(mm); 261 243 put_task_struct(priv->task); 244 + priv->task = NULL; 262 245 } 263 246 } 264 247 ··· 290 269 const struct seq_operations *ops) 291 270 { 292 271 struct proc_maps_private *priv; 293 - int ret = -ENOMEM; 294 272 295 - priv = kzalloc(sizeof(*priv), GFP_KERNEL); 296 - if (priv) { 297 - priv->pid = proc_pid(inode); 298 - ret = seq_open(file, ops); 299 - if (!ret) { 300 - struct seq_file *m = file->private_data; 301 - m->private = priv; 302 - } else { 303 - kfree(priv); 304 - } 273 + priv = __seq_open_private(file, ops, sizeof(*priv)); 274 + if (!priv) 275 + return -ENOMEM; 276 + 277 + priv->inode = inode; 278 + priv->mm = proc_mem_open(inode, PTRACE_MODE_READ); 279 + if (IS_ERR(priv->mm)) { 280 + int err = PTR_ERR(priv->mm); 281 + 282 + seq_release_private(inode, file); 283 + return err; 305 284 } 306 - return ret; 285 + 286 + return 0; 287 + } 288 + 289 + 290 + static int map_release(struct inode *inode, struct file *file) 291 + { 292 + struct seq_file *seq = file->private_data; 293 + struct proc_maps_private *priv = seq->private; 294 + 295 + if (priv->mm) 296 + mmdrop(priv->mm); 297 + 298 + return seq_release_private(inode, file); 307 299 } 308 300 309 301 static int pid_maps_open(struct inode *inode, struct file *file) ··· 333 299 .open = pid_maps_open, 334 300 .read = seq_read, 335 301 .llseek = seq_lseek, 336 - .release = seq_release_private, 302 + .release = map_release, 337 303 }; 338 304 339 305 const struct file_operations proc_tid_maps_operations = { 340 306 .open = tid_maps_open, 341 307 .read = seq_read, 342 308 .llseek = seq_lseek, 343 - .release = seq_release_private, 309 + .release = map_release, 344 310 }; 345 311

+9

include/asm-generic/dma-mapping-common.h

··· 179 179 extern int dma_common_mmap(struct device *dev, struct vm_area_struct *vma, 180 180 void *cpu_addr, dma_addr_t dma_addr, size_t size); 181 181 182 + void *dma_common_contiguous_remap(struct page *page, size_t size, 183 + unsigned long vm_flags, 184 + pgprot_t prot, const void *caller); 185 + 186 + void *dma_common_pages_remap(struct page **pages, size_t size, 187 + unsigned long vm_flags, pgprot_t prot, 188 + const void *caller); 189 + void dma_common_free_remap(void *cpu_addr, size_t size, unsigned long vm_flags); 190 + 182 191 /** 183 192 * dma_mmap_attrs - map a coherent DMA allocation into user space 184 193 * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices

+9 -18

include/asm-generic/pgtable.h

··· 664 664 } 665 665 666 666 #ifdef CONFIG_NUMA_BALANCING 667 - #ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE 668 667 /* 669 - * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the 670 - * same bit too). It's set only when _PAGE_PRESET is not set and it's 671 - * never set if _PAGE_PRESENT is set. 668 + * _PAGE_NUMA distinguishes between an unmapped page table entry, an entry that 669 + * is protected for PROT_NONE and a NUMA hinting fault entry. If the 670 + * architecture defines __PAGE_PROTNONE then it should take that into account 671 + * but those that do not can rely on the fact that the NUMA hinting scanner 672 + * skips inaccessible VMAs. 672 673 * 673 674 * pte/pmd_present() returns true if pte/pmd_numa returns true. Page 674 675 * fault triggers on those regions if pte/pmd_numa returns true ··· 678 677 #ifndef pte_numa 679 678 static inline int pte_numa(pte_t pte) 680 679 { 681 - return (pte_flags(pte) & 682 - (_PAGE_NUMA|_PAGE_PROTNONE|_PAGE_PRESENT)) == _PAGE_NUMA; 680 + return ptenuma_flags(pte) == _PAGE_NUMA; 683 681 } 684 682 #endif 685 683 686 684 #ifndef pmd_numa 687 685 static inline int pmd_numa(pmd_t pmd) 688 686 { 689 - return (pmd_flags(pmd) & 690 - (_PAGE_NUMA|_PAGE_PROTNONE|_PAGE_PRESENT)) == _PAGE_NUMA; 687 + return pmdnuma_flags(pmd) == _PAGE_NUMA; 691 688 } 692 689 #endif 693 690 ··· 724 725 static inline pte_t pte_mknuma(pte_t pte) 725 726 { 726 727 pteval_t val = pte_val(pte); 728 + 729 + VM_BUG_ON(!(val & _PAGE_PRESENT)); 727 730 728 731 val &= ~_PAGE_PRESENT; 729 732 val |= _PAGE_NUMA; ··· 769 768 return; 770 769 } 771 770 #endif 772 - #else 773 - extern int pte_numa(pte_t pte); 774 - extern int pmd_numa(pmd_t pmd); 775 - extern pte_t pte_mknonnuma(pte_t pte); 776 - extern pmd_t pmd_mknonnuma(pmd_t pmd); 777 - extern pte_t pte_mknuma(pte_t pte); 778 - extern pmd_t pmd_mknuma(pmd_t pmd); 779 - extern void ptep_set_numa(struct mm_struct *mm, unsigned long addr, pte_t *ptep); 780 - extern void pmdp_set_numa(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp); 781 - #endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */ 782 771 #else 783 772 static inline int pmd_numa(pmd_t pmd) 784 773 {

+4

include/asm-generic/sections.h

··· 3 3 4 4 /* References to section boundaries */ 5 5 6 + #include <linux/compiler.h> 7 + 6 8 /* 7 9 * Usage guidelines: 8 10 * _text, _data: architecture specific, don't use them in arch-independent code ··· 38 36 39 37 /* Start and end of .ctors section - used for constructor calls. */ 40 38 extern char __ctors_start[], __ctors_end[]; 39 + 40 + extern __visible const void __nosave_begin, __nosave_end; 41 41 42 42 /* function descriptor handling (if any). Override 43 43 * in asm/sections.h */

+45 -126

include/linux/balloon_compaction.h

··· 27 27 * counter raised only while it is under our special handling; 28 28 * 29 29 * iii. after the lockless scan step have selected a potential balloon page for 30 - * isolation, re-test the page->mapping flags and the page ref counter 30 + * isolation, re-test the PageBalloon mark and the PagePrivate flag 31 31 * under the proper page lock, to ensure isolating a valid balloon page 32 32 * (not yet isolated, nor under release procedure) 33 + * 34 + * iv. isolation or dequeueing procedure must clear PagePrivate flag under 35 + * page lock together with removing page from balloon device page list. 33 36 * 34 37 * The functions provided by this interface are placed to help on coping with 35 38 * the aforementioned balloon page corner case, as well as to ensure the simple ··· 57 54 * balloon driver as a page book-keeper for its registered balloon devices. 58 55 */ 59 56 struct balloon_dev_info { 60 - void *balloon_device; /* balloon device descriptor */ 61 - struct address_space *mapping; /* balloon special page->mapping */ 62 57 unsigned long isolated_pages; /* # of isolated pages for migration */ 63 58 spinlock_t pages_lock; /* Protection to pages list */ 64 59 struct list_head pages; /* Pages enqueued & handled to Host */ 60 + int (*migratepage)(struct balloon_dev_info *, struct page *newpage, 61 + struct page *page, enum migrate_mode mode); 65 62 }; 66 63 67 64 extern struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info); 68 65 extern struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info); 69 - extern struct balloon_dev_info *balloon_devinfo_alloc( 70 - void *balloon_dev_descriptor); 71 66 72 - static inline void balloon_devinfo_free(struct balloon_dev_info *b_dev_info) 67 + static inline void balloon_devinfo_init(struct balloon_dev_info *balloon) 73 68 { 74 - kfree(b_dev_info); 75 - } 76 - 77 - /* 78 - * balloon_page_free - release a balloon page back to the page free lists 79 - * @page: ballooned page to be set free 80 - * 81 - * This function must be used to properly set free an isolated/dequeued balloon 82 - * page at the end of a sucessful page migration, or at the balloon driver's 83 - * page release procedure. 84 - */ 85 - static inline void balloon_page_free(struct page *page) 86 - { 87 - /* 88 - * Balloon pages always get an extra refcount before being isolated 89 - * and before being dequeued to help on sorting out fortuite colisions 90 - * between a thread attempting to isolate and another thread attempting 91 - * to release the very same balloon page. 92 - * 93 - * Before we handle the page back to Buddy, lets drop its extra refcnt. 94 - */ 95 - put_page(page); 96 - __free_page(page); 69 + balloon->isolated_pages = 0; 70 + spin_lock_init(&balloon->pages_lock); 71 + INIT_LIST_HEAD(&balloon->pages); 72 + balloon->migratepage = NULL; 97 73 } 98 74 99 75 #ifdef CONFIG_BALLOON_COMPACTION ··· 80 98 extern void balloon_page_putback(struct page *page); 81 99 extern int balloon_page_migrate(struct page *newpage, 82 100 struct page *page, enum migrate_mode mode); 83 - extern struct address_space 84 - *balloon_mapping_alloc(struct balloon_dev_info *b_dev_info, 85 - const struct address_space_operations *a_ops); 86 - 87 - static inline void balloon_mapping_free(struct address_space *balloon_mapping) 88 - { 89 - kfree(balloon_mapping); 90 - } 91 101 92 102 /* 93 - * page_flags_cleared - helper to perform balloon @page ->flags tests. 94 - * 95 - * As balloon pages are obtained from buddy and we do not play with page->flags 96 - * at driver level (exception made when we get the page lock for compaction), 97 - * we can safely identify a ballooned page by checking if the 98 - * PAGE_FLAGS_CHECK_AT_PREP page->flags are all cleared. This approach also 99 - * helps us skip ballooned pages that are locked for compaction or release, thus 100 - * mitigating their racy check at balloon_page_movable() 101 - */ 102 - static inline bool page_flags_cleared(struct page *page) 103 - { 104 - return !(page->flags & PAGE_FLAGS_CHECK_AT_PREP); 105 - } 106 - 107 - /* 108 - * __is_movable_balloon_page - helper to perform @page mapping->flags tests 103 + * __is_movable_balloon_page - helper to perform @page PageBalloon tests 109 104 */ 110 105 static inline bool __is_movable_balloon_page(struct page *page) 111 106 { 112 - struct address_space *mapping = page->mapping; 113 - return mapping_balloon(mapping); 107 + return PageBalloon(page); 114 108 } 115 109 116 110 /* 117 - * balloon_page_movable - test page->mapping->flags to identify balloon pages 118 - * that can be moved by compaction/migration. 119 - * 120 - * This function is used at core compaction's page isolation scheme, therefore 121 - * most pages exposed to it are not enlisted as balloon pages and so, to avoid 122 - * undesired side effects like racing against __free_pages(), we cannot afford 123 - * holding the page locked while testing page->mapping->flags here. 111 + * balloon_page_movable - test PageBalloon to identify balloon pages 112 + * and PagePrivate to check that the page is not 113 + * isolated and can be moved by compaction/migration. 124 114 * 125 115 * As we might return false positives in the case of a balloon page being just 126 - * released under us, the page->mapping->flags need to be re-tested later, 127 - * under the proper page lock, at the functions that will be coping with the 128 - * balloon page case. 116 + * released under us, this need to be re-tested later, under the page lock. 129 117 */ 130 118 static inline bool balloon_page_movable(struct page *page) 131 119 { 132 - /* 133 - * Before dereferencing and testing mapping->flags, let's make sure 134 - * this is not a page that uses ->mapping in a different way 135 - */ 136 - if (page_flags_cleared(page) && !page_mapped(page) && 137 - page_count(page) == 1) 138 - return __is_movable_balloon_page(page); 139 - 140 - return false; 120 + return PageBalloon(page) && PagePrivate(page); 141 121 } 142 122 143 123 /* 144 124 * isolated_balloon_page - identify an isolated balloon page on private 145 125 * compaction/migration page lists. 146 - * 147 - * After a compaction thread isolates a balloon page for migration, it raises 148 - * the page refcount to prevent concurrent compaction threads from re-isolating 149 - * the same page. For that reason putback_movable_pages(), or other routines 150 - * that need to identify isolated balloon pages on private pagelists, cannot 151 - * rely on balloon_page_movable() to accomplish the task. 152 126 */ 153 127 static inline bool isolated_balloon_page(struct page *page) 154 128 { 155 - /* Already isolated balloon pages, by default, have a raised refcount */ 156 - if (page_flags_cleared(page) && !page_mapped(page) && 157 - page_count(page) >= 2) 158 - return __is_movable_balloon_page(page); 159 - 160 - return false; 129 + return PageBalloon(page); 161 130 } 162 131 163 132 /* 164 133 * balloon_page_insert - insert a page into the balloon's page list and make 165 - * the page->mapping assignment accordingly. 134 + * the page->private assignment accordingly. 135 + * @balloon : pointer to balloon device 166 136 * @page : page to be assigned as a 'balloon page' 167 - * @mapping : allocated special 'balloon_mapping' 168 - * @head : balloon's device page list head 169 137 * 170 138 * Caller must ensure the page is locked and the spin_lock protecting balloon 171 139 * pages list is held before inserting a page into the balloon device. 172 140 */ 173 - static inline void balloon_page_insert(struct page *page, 174 - struct address_space *mapping, 175 - struct list_head *head) 141 + static inline void balloon_page_insert(struct balloon_dev_info *balloon, 142 + struct page *page) 176 143 { 177 - page->mapping = mapping; 178 - list_add(&page->lru, head); 144 + __SetPageBalloon(page); 145 + SetPagePrivate(page); 146 + set_page_private(page, (unsigned long)balloon); 147 + list_add(&page->lru, &balloon->pages); 179 148 } 180 149 181 150 /* 182 151 * balloon_page_delete - delete a page from balloon's page list and clear 183 - * the page->mapping assignement accordingly. 152 + * the page->private assignement accordingly. 184 153 * @page : page to be released from balloon's page list 185 154 * 186 155 * Caller must ensure the page is locked and the spin_lock protecting balloon ··· 139 206 */ 140 207 static inline void balloon_page_delete(struct page *page) 141 208 { 142 - page->mapping = NULL; 143 - list_del(&page->lru); 209 + __ClearPageBalloon(page); 210 + set_page_private(page, 0); 211 + if (PagePrivate(page)) { 212 + ClearPagePrivate(page); 213 + list_del(&page->lru); 214 + } 144 215 } 145 216 146 217 /* ··· 153 216 */ 154 217 static inline struct balloon_dev_info *balloon_page_device(struct page *page) 155 218 { 156 - struct address_space *mapping = page->mapping; 157 - if (likely(mapping)) 158 - return mapping->private_data; 159 - 160 - return NULL; 219 + return (struct balloon_dev_info *)page_private(page); 161 220 } 162 221 163 222 static inline gfp_t balloon_mapping_gfp_mask(void) ··· 161 228 return GFP_HIGHUSER_MOVABLE; 162 229 } 163 230 164 - static inline bool balloon_compaction_check(void) 165 - { 166 - return true; 167 - } 168 - 169 231 #else /* !CONFIG_BALLOON_COMPACTION */ 170 232 171 - static inline void *balloon_mapping_alloc(void *balloon_device, 172 - const struct address_space_operations *a_ops) 233 + static inline void balloon_page_insert(struct balloon_dev_info *balloon, 234 + struct page *page) 173 235 { 174 - return ERR_PTR(-EOPNOTSUPP); 175 - } 176 - 177 - static inline void balloon_mapping_free(struct address_space *balloon_mapping) 178 - { 179 - return; 180 - } 181 - 182 - static inline void balloon_page_insert(struct page *page, 183 - struct address_space *mapping, 184 - struct list_head *head) 185 - { 186 - list_add(&page->lru, head); 236 + __SetPageBalloon(page); 237 + list_add(&page->lru, &balloon->pages); 187 238 } 188 239 189 240 static inline void balloon_page_delete(struct page *page) 190 241 { 242 + __ClearPageBalloon(page); 191 243 list_del(&page->lru); 244 + } 245 + 246 + static inline bool __is_movable_balloon_page(struct page *page) 247 + { 248 + return false; 192 249 } 193 250 194 251 static inline bool balloon_page_movable(struct page *page) ··· 212 289 return GFP_HIGHUSER; 213 290 } 214 291 215 - static inline bool balloon_compaction_check(void) 216 - { 217 - return false; 218 - } 219 292 #endif /* CONFIG_BALLOON_COMPACTION */ 220 293 #endif /* _LINUX_BALLOON_COMPACTION_H */

+1 -1

include/linux/blkdev.h

··· 1564 1564 } 1565 1565 static inline struct blk_integrity *bdev_get_integrity(struct block_device *b) 1566 1566 { 1567 - return 0; 1567 + return NULL; 1568 1568 } 1569 1569 static inline struct blk_integrity *blk_get_integrity(struct gendisk *disk) 1570 1570 {

+18 -6

include/linux/compaction.h

··· 2 2 #define _LINUX_COMPACTION_H 3 3 4 4 /* Return values for compact_zone() and try_to_compact_pages() */ 5 + /* compaction didn't start as it was deferred due to past failures */ 6 + #define COMPACT_DEFERRED 0 5 7 /* compaction didn't start as it was not possible or direct reclaim was more suitable */ 6 - #define COMPACT_SKIPPED 0 8 + #define COMPACT_SKIPPED 1 7 9 /* compaction should continue to another pageblock */ 8 - #define COMPACT_CONTINUE 1 10 + #define COMPACT_CONTINUE 2 9 11 /* direct compaction partially compacted a zone and there are suitable pages */ 10 - #define COMPACT_PARTIAL 2 12 + #define COMPACT_PARTIAL 3 11 13 /* The full zone was compacted */ 12 - #define COMPACT_COMPLETE 3 14 + #define COMPACT_COMPLETE 4 15 + 16 + /* Used to signal whether compaction detected need_sched() or lock contention */ 17 + /* No contention detected */ 18 + #define COMPACT_CONTENDED_NONE 0 19 + /* Either need_sched() was true or fatal signal pending */ 20 + #define COMPACT_CONTENDED_SCHED 1 21 + /* Zone lock or lru_lock was contended in async compaction */ 22 + #define COMPACT_CONTENDED_LOCK 2 13 23 14 24 #ifdef CONFIG_COMPACTION 15 25 extern int sysctl_compact_memory; ··· 32 22 extern int fragmentation_index(struct zone *zone, unsigned int order); 33 23 extern unsigned long try_to_compact_pages(struct zonelist *zonelist, 34 24 int order, gfp_t gfp_mask, nodemask_t *mask, 35 - enum migrate_mode mode, bool *contended); 25 + enum migrate_mode mode, int *contended, 26 + struct zone **candidate_zone); 36 27 extern void compact_pgdat(pg_data_t *pgdat, int order); 37 28 extern void reset_isolation_suitable(pg_data_t *pgdat); 38 29 extern unsigned long compaction_suitable(struct zone *zone, int order); ··· 102 91 #else 103 92 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist, 104 93 int order, gfp_t gfp_mask, nodemask_t *nodemask, 105 - enum migrate_mode mode, bool *contended) 94 + enum migrate_mode mode, int *contended, 95 + struct zone **candidate_zone) 106 96 { 107 97 return COMPACT_CONTINUE; 108 98 }

+7

include/linux/genalloc.h

··· 110 110 extern unsigned long gen_pool_first_fit(unsigned long *map, unsigned long size, 111 111 unsigned long start, unsigned int nr, void *data); 112 112 113 + extern unsigned long gen_pool_first_fit_order_align(unsigned long *map, 114 + unsigned long size, unsigned long start, unsigned int nr, 115 + void *data); 116 + 113 117 extern unsigned long gen_pool_best_fit(unsigned long *map, unsigned long size, 114 118 unsigned long start, unsigned int nr, void *data); 115 119 116 120 extern struct gen_pool *devm_gen_pool_create(struct device *dev, 117 121 int min_alloc_order, int nid); 118 122 extern struct gen_pool *dev_get_gen_pool(struct device *dev); 123 + 124 + bool addr_in_gen_pool(struct gen_pool *pool, unsigned long start, 125 + size_t size); 119 126 120 127 #ifdef CONFIG_OF 121 128 extern struct gen_pool *of_get_named_gen_pool(struct device_node *np,

+1 -1

include/linux/gfp.h

··· 156 156 #define GFP_DMA32 __GFP_DMA32 157 157 158 158 /* Convert GFP flags to their corresponding migrate type */ 159 - static inline int allocflags_to_migratetype(gfp_t gfp_flags) 159 + static inline int gfpflags_to_migratetype(const gfp_t gfp_flags) 160 160 { 161 161 WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK); 162 162

+1 -1

include/linux/huge_mm.h

··· 132 132 static inline int pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma, 133 133 spinlock_t **ptl) 134 134 { 135 - VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem)); 135 + VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma); 136 136 if (pmd_trans_huge(*pmd)) 137 137 return __pmd_trans_huge_lock(pmd, vma, ptl); 138 138 else

+12 -44

include/linux/kernel.h

··· 715 715 (void) (&_max1 == &_max2); \ 716 716 _max1 > _max2 ? _max1 : _max2; }) 717 717 718 - #define min3(x, y, z) ({ \ 719 - typeof(x) _min1 = (x); \ 720 - typeof(y) _min2 = (y); \ 721 - typeof(z) _min3 = (z); \ 722 - (void) (&_min1 == &_min2); \ 723 - (void) (&_min1 == &_min3); \ 724 - _min1 < _min2 ? (_min1 < _min3 ? _min1 : _min3) : \ 725 - (_min2 < _min3 ? _min2 : _min3); }) 726 - 727 - #define max3(x, y, z) ({ \ 728 - typeof(x) _max1 = (x); \ 729 - typeof(y) _max2 = (y); \ 730 - typeof(z) _max3 = (z); \ 731 - (void) (&_max1 == &_max2); \ 732 - (void) (&_max1 == &_max3); \ 733 - _max1 > _max2 ? (_max1 > _max3 ? _max1 : _max3) : \ 734 - (_max2 > _max3 ? _max2 : _max3); }) 718 + #define min3(x, y, z) min((typeof(x))min(x, y), z) 719 + #define max3(x, y, z) max((typeof(x))max(x, y), z) 735 720 736 721 /** 737 722 * min_not_zero - return the minimum that is _not_ zero, unless both are zero ··· 731 746 /** 732 747 * clamp - return a value clamped to a given range with strict typechecking 733 748 * @val: current value 734 - * @min: minimum allowable value 735 - * @max: maximum allowable value 749 + * @lo: lowest allowable value 750 + * @hi: highest allowable value 736 751 * 737 - * This macro does strict typechecking of min/max to make sure they are of the 752 + * This macro does strict typechecking of lo/hi to make sure they are of the 738 753 * same type as val. See the unnecessary pointer comparisons. 739 754 */ 740 - #define clamp(val, min, max) ({ \ 741 - typeof(val) __val = (val); \ 742 - typeof(min) __min = (min); \ 743 - typeof(max) __max = (max); \ 744 - (void) (&__val == &__min); \ 745 - (void) (&__val == &__max); \ 746 - __val = __val < __min ? __min: __val; \ 747 - __val > __max ? __max: __val; }) 755 + #define clamp(val, lo, hi) min((typeof(val))max(val, lo), hi) 748 756 749 757 /* 750 758 * ..and if you can't take the strict ··· 759 781 * clamp_t - return a value clamped to a given range using a given type 760 782 * @type: the type of variable to use 761 783 * @val: current value 762 - * @min: minimum allowable value 763 - * @max: maximum allowable value 784 + * @lo: minimum allowable value 785 + * @hi: maximum allowable value 764 786 * 765 787 * This macro does no typechecking and uses temporary variables of type 766 788 * 'type' to make all the comparisons. 767 789 */ 768 - #define clamp_t(type, val, min, max) ({ \ 769 - type __val = (val); \ 770 - type __min = (min); \ 771 - type __max = (max); \ 772 - __val = __val < __min ? __min: __val; \ 773 - __val > __max ? __max: __val; }) 790 + #define clamp_t(type, val, lo, hi) min_t(type, max_t(type, val, lo), hi) 774 791 775 792 /** 776 793 * clamp_val - return a value clamped to a given range using val's type 777 794 * @val: current value 778 - * @min: minimum allowable value 779 - * @max: maximum allowable value 795 + * @lo: minimum allowable value 796 + * @hi: maximum allowable value 780 797 * 781 798 * This macro does no typechecking and uses temporary variables of whatever 782 799 * type the input argument 'val' is. This is useful when val is an unsigned 783 800 * type and min and max are literals that will otherwise be assigned a signed 784 801 * integer type. 785 802 */ 786 - #define clamp_val(val, min, max) ({ \ 787 - typeof(val) __val = (val); \ 788 - typeof(val) __min = (min); \ 789 - typeof(val) __max = (max); \ 790 - __val = __val < __min ? __min: __val; \ 791 - __val > __max ? __max: __val; }) 803 + #define clamp_val(val, lo, hi) clamp_t(typeof(val), val, lo, hi) 792 804 793 805 794 806 /*

-15

include/linux/memcontrol.h

··· 440 440 441 441 int memcg_cache_id(struct mem_cgroup *memcg); 442 442 443 - int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s, 444 - struct kmem_cache *root_cache); 445 - void memcg_free_cache_params(struct kmem_cache *s); 446 - 447 - int memcg_update_cache_size(struct kmem_cache *s, int num_groups); 448 443 void memcg_update_array_size(int num_groups); 449 444 450 445 struct kmem_cache * ··· 567 572 static inline int memcg_cache_id(struct mem_cgroup *memcg) 568 573 { 569 574 return -1; 570 - } 571 - 572 - static inline int memcg_alloc_cache_params(struct mem_cgroup *memcg, 573 - struct kmem_cache *s, struct kmem_cache *root_cache) 574 - { 575 - return 0; 576 - } 577 - 578 - static inline void memcg_free_cache_params(struct kmem_cache *s) 579 - { 580 575 } 581 576 582 577 static inline struct kmem_cache *

+1

include/linux/memory_hotplug.h

··· 84 84 extern int add_one_highpage(struct page *page, int pfn, int bad_ppro); 85 85 /* VM interface that may be used by firmware interface */ 86 86 extern int online_pages(unsigned long, unsigned long, int); 87 + extern int test_pages_in_a_zone(unsigned long, unsigned long); 87 88 extern void __offline_isolated_pages(unsigned long, unsigned long); 88 89 89 90 typedef void (*online_page_callback_t)(struct page *page);

+4 -3

include/linux/mempolicy.h

··· 134 134 struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp, 135 135 unsigned long idx); 136 136 137 - struct mempolicy *get_vma_policy(struct task_struct *tsk, 138 - struct vm_area_struct *vma, unsigned long addr); 139 - bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma); 137 + struct mempolicy *get_task_policy(struct task_struct *p); 138 + struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, 139 + unsigned long addr); 140 + bool vma_policy_mof(struct vm_area_struct *vma); 140 141 141 142 extern void numa_default_policy(void); 142 143 extern void numa_policy_init(void);

+1 -13

include/linux/migrate.h

··· 13 13 * Return values from addresss_space_operations.migratepage(): 14 14 * - negative errno on page migration failure; 15 15 * - zero on page migration success; 16 - * 17 - * The balloon page migration introduces this special case where a 'distinct' 18 - * return code is used to flag a successful page migration to unmap_and_move(). 19 - * This approach is necessary because page migration can race against balloon 20 - * deflation procedure, and for such case we could introduce a nasty page leak 21 - * if a successfully migrated balloon page gets released concurrently with 22 - * migration's unmap_and_move() wrap-up steps. 23 16 */ 24 17 #define MIGRATEPAGE_SUCCESS 0 25 - #define MIGRATEPAGE_BALLOON_SUCCESS 1 /* special ret code for balloon page 26 - * sucessful migration case. 27 - */ 18 + 28 19 enum migrate_reason { 29 20 MR_COMPACTION, 30 21 MR_MEMORY_FAILURE, ··· 72 81 { 73 82 return -ENOSYS; 74 83 } 75 - 76 - /* Possible settings for the migrate_page() method in address_operations */ 77 - #define migrate_page NULL 78 84 79 85 #endif /* CONFIG_MIGRATION */ 80 86

+36 -2

include/linux/mm.h

··· 18 18 #include <linux/pfn.h> 19 19 #include <linux/bit_spinlock.h> 20 20 #include <linux/shrinker.h> 21 + #include <linux/resource.h> 21 22 22 23 struct mempolicy; 23 24 struct anon_vma; ··· 551 550 static inline void __ClearPageBuddy(struct page *page) 552 551 { 553 552 VM_BUG_ON_PAGE(!PageBuddy(page), page); 553 + atomic_set(&page->_mapcount, -1); 554 + } 555 + 556 + #define PAGE_BALLOON_MAPCOUNT_VALUE (-256) 557 + 558 + static inline int PageBalloon(struct page *page) 559 + { 560 + return atomic_read(&page->_mapcount) == PAGE_BALLOON_MAPCOUNT_VALUE; 561 + } 562 + 563 + static inline void __SetPageBalloon(struct page *page) 564 + { 565 + VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page); 566 + atomic_set(&page->_mapcount, PAGE_BALLOON_MAPCOUNT_VALUE); 567 + } 568 + 569 + static inline void __ClearPageBalloon(struct page *page) 570 + { 571 + VM_BUG_ON_PAGE(!PageBalloon(page), page); 554 572 atomic_set(&page->_mapcount, -1); 555 573 } 556 574 ··· 1267 1247 !vma_growsup(vma->vm_next, addr); 1268 1248 } 1269 1249 1270 - extern pid_t 1271 - vm_is_stack(struct task_struct *task, struct vm_area_struct *vma, int in_group); 1250 + extern struct task_struct *task_of_stack(struct task_struct *task, 1251 + struct vm_area_struct *vma, bool in_group); 1272 1252 1273 1253 extern unsigned long move_page_tables(struct vm_area_struct *vma, 1274 1254 unsigned long old_addr, struct vm_area_struct *new_vma, ··· 1799 1779 unsigned long addr, unsigned long len, pgoff_t pgoff, 1800 1780 bool *need_rmap_locks); 1801 1781 extern void exit_mmap(struct mm_struct *); 1782 + 1783 + static inline int check_data_rlimit(unsigned long rlim, 1784 + unsigned long new, 1785 + unsigned long start, 1786 + unsigned long end_data, 1787 + unsigned long start_data) 1788 + { 1789 + if (rlim < RLIM_INFINITY) { 1790 + if (((new - start) + (end_data - start_data)) > rlim) 1791 + return -ENOSPC; 1792 + } 1793 + 1794 + return 0; 1795 + } 1802 1796 1803 1797 extern int mm_take_all_locks(struct mm_struct *mm); 1804 1798 extern void mm_drop_all_locks(struct mm_struct *mm);

+20

include/linux/mmdebug.h

··· 4 4 #include <linux/stringify.h> 5 5 6 6 struct page; 7 + struct vm_area_struct; 8 + struct mm_struct; 7 9 8 10 extern void dump_page(struct page *page, const char *reason); 9 11 extern void dump_page_badflags(struct page *page, const char *reason, 10 12 unsigned long badflags); 13 + void dump_vma(const struct vm_area_struct *vma); 14 + void dump_mm(const struct mm_struct *mm); 11 15 12 16 #ifdef CONFIG_DEBUG_VM 13 17 #define VM_BUG_ON(cond) BUG_ON(cond) ··· 22 18 BUG(); \ 23 19 } \ 24 20 } while (0) 21 + #define VM_BUG_ON_VMA(cond, vma) \ 22 + do { \ 23 + if (unlikely(cond)) { \ 24 + dump_vma(vma); \ 25 + BUG(); \ 26 + } \ 27 + } while (0) 28 + #define VM_BUG_ON_MM(cond, mm) \ 29 + do { \ 30 + if (unlikely(cond)) { \ 31 + dump_mm(mm); \ 32 + BUG(); \ 33 + } \ 34 + } while (0) 25 35 #define VM_WARN_ON(cond) WARN_ON(cond) 26 36 #define VM_WARN_ON_ONCE(cond) WARN_ON_ONCE(cond) 27 37 #define VM_WARN_ONCE(cond, format...) WARN_ONCE(cond, format) 28 38 #else 29 39 #define VM_BUG_ON(cond) BUILD_BUG_ON_INVALID(cond) 30 40 #define VM_BUG_ON_PAGE(cond, page) VM_BUG_ON(cond) 41 + #define VM_BUG_ON_VMA(cond, vma) VM_BUG_ON(cond) 42 + #define VM_BUG_ON_MM(cond, mm) VM_BUG_ON(cond) 31 43 #define VM_WARN_ON(cond) BUILD_BUG_ON_INVALID(cond) 32 44 #define VM_WARN_ON_ONCE(cond) BUILD_BUG_ON_INVALID(cond) 33 45 #define VM_WARN_ONCE(cond, format...) BUILD_BUG_ON_INVALID(cond)

+3 -48

include/linux/mmzone.h

··· 521 521 atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS]; 522 522 } ____cacheline_internodealigned_in_smp; 523 523 524 - typedef enum { 524 + enum zone_flags { 525 525 ZONE_RECLAIM_LOCKED, /* prevents concurrent reclaim */ 526 526 ZONE_OOM_LOCKED, /* zone is in OOM killer zonelist */ 527 527 ZONE_CONGESTED, /* zone has many dirty pages backed by 528 528 * a congested BDI 529 529 */ 530 - ZONE_TAIL_LRU_DIRTY, /* reclaim scanning has recently found 530 + ZONE_DIRTY, /* reclaim scanning has recently found 531 531 * many dirty file pages at the tail 532 532 * of the LRU. 533 533 */ ··· 535 535 * many pages under writeback 536 536 */ 537 537 ZONE_FAIR_DEPLETED, /* fair zone policy batch depleted */ 538 - } zone_flags_t; 539 - 540 - static inline void zone_set_flag(struct zone *zone, zone_flags_t flag) 541 - { 542 - set_bit(flag, &zone->flags); 543 - } 544 - 545 - static inline int zone_test_and_set_flag(struct zone *zone, zone_flags_t flag) 546 - { 547 - return test_and_set_bit(flag, &zone->flags); 548 - } 549 - 550 - static inline void zone_clear_flag(struct zone *zone, zone_flags_t flag) 551 - { 552 - clear_bit(flag, &zone->flags); 553 - } 554 - 555 - static inline int zone_is_reclaim_congested(const struct zone *zone) 556 - { 557 - return test_bit(ZONE_CONGESTED, &zone->flags); 558 - } 559 - 560 - static inline int zone_is_reclaim_dirty(const struct zone *zone) 561 - { 562 - return test_bit(ZONE_TAIL_LRU_DIRTY, &zone->flags); 563 - } 564 - 565 - static inline int zone_is_reclaim_writeback(const struct zone *zone) 566 - { 567 - return test_bit(ZONE_WRITEBACK, &zone->flags); 568 - } 569 - 570 - static inline int zone_is_reclaim_locked(const struct zone *zone) 571 - { 572 - return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags); 573 - } 574 - 575 - static inline int zone_is_fair_depleted(const struct zone *zone) 576 - { 577 - return test_bit(ZONE_FAIR_DEPLETED, &zone->flags); 578 - } 579 - 580 - static inline int zone_is_oom_locked(const struct zone *zone) 581 - { 582 - return test_bit(ZONE_OOM_LOCKED, &zone->flags); 583 - } 538 + }; 584 539 585 540 static inline unsigned long zone_end_pfn(const struct zone *zone) 586 541 {

+1 -17

include/linux/pagemap.h

··· 24 24 AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */ 25 25 AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */ 26 26 AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */ 27 - AS_BALLOON_MAP = __GFP_BITS_SHIFT + 4, /* balloon page special map */ 28 - AS_EXITING = __GFP_BITS_SHIFT + 5, /* final truncate in progress */ 27 + AS_EXITING = __GFP_BITS_SHIFT + 4, /* final truncate in progress */ 29 28 }; 30 29 31 30 static inline void mapping_set_error(struct address_space *mapping, int error) ··· 52 53 if (mapping) 53 54 return test_bit(AS_UNEVICTABLE, &mapping->flags); 54 55 return !!mapping; 55 - } 56 - 57 - static inline void mapping_set_balloon(struct address_space *mapping) 58 - { 59 - set_bit(AS_BALLOON_MAP, &mapping->flags); 60 - } 61 - 62 - static inline void mapping_clear_balloon(struct address_space *mapping) 63 - { 64 - clear_bit(AS_BALLOON_MAP, &mapping->flags); 65 - } 66 - 67 - static inline int mapping_balloon(struct address_space *mapping) 68 - { 69 - return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags); 70 56 } 71 57 72 58 static inline void mapping_set_exiting(struct address_space *mapping)

+1 -1

include/linux/rmap.h

··· 150 150 static inline void anon_vma_merge(struct vm_area_struct *vma, 151 151 struct vm_area_struct *next) 152 152 { 153 - VM_BUG_ON(vma->anon_vma != next->anon_vma); 153 + VM_BUG_ON_VMA(vma->anon_vma != next->anon_vma, vma); 154 154 unlink_anon_vmas(next); 155 155 } 156 156

+4 -2

include/linux/sched.h

··· 1935 1935 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH) 1936 1936 #define used_math() tsk_used_math(current) 1937 1937 1938 - /* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags */ 1938 + /* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags 1939 + * __GFP_FS is also cleared as it implies __GFP_IO. 1940 + */ 1939 1941 static inline gfp_t memalloc_noio_flags(gfp_t flags) 1940 1942 { 1941 1943 if (unlikely(current->flags & PF_MEMALLOC_NOIO)) 1942 - flags &= ~__GFP_IO; 1944 + flags &= ~(__GFP_IO | __GFP_FS); 1943 1945 return flags; 1944 1946 } 1945 1947

-8

include/linux/screen_info.h

··· 5 5 6 6 extern struct screen_info screen_info; 7 7 8 - #define ORIG_X (screen_info.orig_x) 9 - #define ORIG_Y (screen_info.orig_y) 10 - #define ORIG_VIDEO_MODE (screen_info.orig_video_mode) 11 - #define ORIG_VIDEO_COLS (screen_info.orig_video_cols) 12 - #define ORIG_VIDEO_EGA_BX (screen_info.orig_video_ega_bx) 13 - #define ORIG_VIDEO_LINES (screen_info.orig_video_lines) 14 - #define ORIG_VIDEO_ISVGA (screen_info.orig_video_isVGA) 15 - #define ORIG_VIDEO_POINTS (screen_info.orig_video_points) 16 8 #endif /* _SCREEN_INFO_H */

+1 -63

include/linux/slab.h

··· 158 158 #define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long) 159 159 #endif 160 160 161 - #ifdef CONFIG_SLOB 162 - /* 163 - * Common fields provided in kmem_cache by all slab allocators 164 - * This struct is either used directly by the allocator (SLOB) 165 - * or the allocator must include definitions for all fields 166 - * provided in kmem_cache_common in their definition of kmem_cache. 167 - * 168 - * Once we can do anonymous structs (C11 standard) we could put a 169 - * anonymous struct definition in these allocators so that the 170 - * separate allocations in the kmem_cache structure of SLAB and 171 - * SLUB is no longer needed. 172 - */ 173 - struct kmem_cache { 174 - unsigned int object_size;/* The original size of the object */ 175 - unsigned int size; /* The aligned/padded/added on size */ 176 - unsigned int align; /* Alignment as calculated */ 177 - unsigned long flags; /* Active flags on the slab */ 178 - const char *name; /* Slab name for sysfs */ 179 - int refcount; /* Use counter */ 180 - void (*ctor)(void *); /* Called on object slot creation */ 181 - struct list_head list; /* List of all slab caches on the system */ 182 - }; 183 - 184 - #endif /* CONFIG_SLOB */ 185 - 186 161 /* 187 162 * Kmalloc array related definitions 188 163 */ ··· 337 362 return kmem_cache_alloc_node(s, gfpflags, node); 338 363 } 339 364 #endif /* CONFIG_TRACING */ 340 - 341 - #ifdef CONFIG_SLAB 342 - #include <linux/slab_def.h> 343 - #endif 344 - 345 - #ifdef CONFIG_SLUB 346 - #include <linux/slub_def.h> 347 - #endif 348 365 349 366 extern void *kmalloc_order(size_t size, gfp_t flags, unsigned int order); 350 367 ··· 549 582 * allocator where we care about the real place the memory allocation 550 583 * request comes from. 551 584 */ 552 - #if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || \ 553 - (defined(CONFIG_SLAB) && defined(CONFIG_TRACING)) || \ 554 - (defined(CONFIG_SLOB) && defined(CONFIG_TRACING)) 555 585 extern void *__kmalloc_track_caller(size_t, gfp_t, unsigned long); 556 586 #define kmalloc_track_caller(size, flags) \ 557 587 __kmalloc_track_caller(size, flags, _RET_IP_) 558 - #else 559 - #define kmalloc_track_caller(size, flags) \ 560 - __kmalloc(size, flags) 561 - #endif /* DEBUG_SLAB */ 562 588 563 589 #ifdef CONFIG_NUMA 564 - /* 565 - * kmalloc_node_track_caller is a special version of kmalloc_node that 566 - * records the calling function of the routine calling it for slab leak 567 - * tracking instead of just the calling function (confusing, eh?). 568 - * It's useful when the call to kmalloc_node comes from a widely-used 569 - * standard allocator where we care about the real place the memory 570 - * allocation request comes from. 571 - */ 572 - #if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB) || \ 573 - (defined(CONFIG_SLAB) && defined(CONFIG_TRACING)) || \ 574 - (defined(CONFIG_SLOB) && defined(CONFIG_TRACING)) 575 590 extern void *__kmalloc_node_track_caller(size_t, gfp_t, int, unsigned long); 576 591 #define kmalloc_node_track_caller(size, flags, node) \ 577 592 __kmalloc_node_track_caller(size, flags, node, \ 578 593 _RET_IP_) 579 - #else 580 - #define kmalloc_node_track_caller(size, flags, node) \ 581 - __kmalloc_node(size, flags, node) 582 - #endif 583 594 584 595 #else /* CONFIG_NUMA */ 585 596 ··· 595 650 return kmalloc_node(size, flags | __GFP_ZERO, node); 596 651 } 597 652 598 - /* 599 - * Determine the size of a slab object 600 - */ 601 - static inline unsigned int kmem_cache_size(struct kmem_cache *s) 602 - { 603 - return s->object_size; 604 - } 605 - 653 + unsigned int kmem_cache_size(struct kmem_cache *s); 606 654 void __init kmem_cache_init_late(void); 607 655 608 656 #endif /* _LINUX_SLAB_H */

+3 -17

include/linux/slab_def.h

··· 8 8 */ 9 9 10 10 struct kmem_cache { 11 + struct array_cache __percpu *cpu_cache; 12 + 11 13 /* 1) Cache tunables. Protected by slab_mutex */ 12 14 unsigned int batchcount; 13 15 unsigned int limit; ··· 73 71 struct memcg_cache_params *memcg_params; 74 72 #endif 75 73 76 - /* 6) per-cpu/per-node data, touched during every alloc/free */ 77 - /* 78 - * We put array[] at the end of kmem_cache, because we want to size 79 - * this array to nr_cpu_ids slots instead of NR_CPUS 80 - * (see kmem_cache_init()) 81 - * We still use [NR_CPUS] and not [1] or [0] because cache_cache 82 - * is statically defined, so we reserve the max number of cpus. 83 - * 84 - * We also need to guarantee that the list is able to accomodate a 85 - * pointer for each node since "nodelists" uses the remainder of 86 - * available pointers. 87 - */ 88 - struct kmem_cache_node **node; 89 - struct array_cache *array[NR_CPUS + MAX_NUMNODES]; 90 - /* 91 - * Do not add fields after array[] 92 - */ 74 + struct kmem_cache_node *node[MAX_NUMNODES]; 93 75 }; 94 76 95 77 #endif /* _LINUX_SLAB_DEF_H */

+4 -18

include/linux/swap.h

··· 327 327 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, 328 328 gfp_t gfp_mask, nodemask_t *mask); 329 329 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); 330 - extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, 331 - gfp_t gfp_mask, bool noswap); 330 + extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, 331 + unsigned long nr_pages, 332 + gfp_t gfp_mask, 333 + bool may_swap); 332 334 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, 333 335 gfp_t gfp_mask, bool noswap, 334 336 struct zone *zone, ··· 355 353 356 354 extern int page_evictable(struct page *page); 357 355 extern void check_move_unevictable_pages(struct page **, int nr_pages); 358 - 359 - extern unsigned long scan_unevictable_pages; 360 - extern int scan_unevictable_handler(struct ctl_table *, int, 361 - void __user *, size_t *, loff_t *); 362 - #ifdef CONFIG_NUMA 363 - extern int scan_unevictable_register_node(struct node *node); 364 - extern void scan_unevictable_unregister_node(struct node *node); 365 - #else 366 - static inline int scan_unevictable_register_node(struct node *node) 367 - { 368 - return 0; 369 - } 370 - static inline void scan_unevictable_unregister_node(struct node *node) 371 - { 372 - } 373 - #endif 374 356 375 357 extern int kswapd_run(int nid); 376 358 extern void kswapd_stop(int nid);

+17

include/linux/topology.h

··· 119 119 * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem(). 120 120 */ 121 121 DECLARE_PER_CPU(int, _numa_mem_); 122 + extern int _node_numa_mem_[MAX_NUMNODES]; 122 123 123 124 #ifndef set_numa_mem 124 125 static inline void set_numa_mem(int node) 125 126 { 126 127 this_cpu_write(_numa_mem_, node); 128 + _node_numa_mem_[numa_node_id()] = node; 129 + } 130 + #endif 131 + 132 + #ifndef node_to_mem_node 133 + static inline int node_to_mem_node(int node) 134 + { 135 + return _node_numa_mem_[node]; 127 136 } 128 137 #endif 129 138 ··· 155 146 static inline void set_cpu_numa_mem(int cpu, int node) 156 147 { 157 148 per_cpu(_numa_mem_, cpu) = node; 149 + _node_numa_mem_[cpu_to_node(cpu)] = node; 158 150 } 159 151 #endif 160 152 ··· 166 156 static inline int numa_mem_id(void) 167 157 { 168 158 return numa_node_id(); 159 + } 160 + #endif 161 + 162 + #ifndef node_to_mem_node 163 + static inline int node_to_mem_node(int node) 164 + { 165 + return node; 169 166 } 170 167 #endif 171 168

+7

include/linux/vm_event_item.h

··· 72 72 THP_ZERO_PAGE_ALLOC, 73 73 THP_ZERO_PAGE_ALLOC_FAILED, 74 74 #endif 75 + #ifdef CONFIG_MEMORY_BALLOON 76 + BALLOON_INFLATE, 77 + BALLOON_DEFLATE, 78 + #ifdef CONFIG_BALLOON_COMPACTION 79 + BALLOON_MIGRATE, 80 + #endif 81 + #endif 75 82 #ifdef CONFIG_DEBUG_TLBFLUSH 76 83 #ifdef CONFIG_SMP 77 84 NR_TLB_REMOTE_FLUSH, /* cpu tried to flush others' tlbs */

+1 -1

include/linux/zsmalloc.h

··· 46 46 enum zs_mapmode mm); 47 47 void zs_unmap_object(struct zs_pool *pool, unsigned long handle); 48 48 49 - u64 zs_get_total_size_bytes(struct zs_pool *pool); 49 + unsigned long zs_get_total_pages(struct zs_pool *pool); 50 50 51 51 #endif

+1

include/uapi/linux/kernel-page-flags.h

··· 31 31 32 32 #define KPF_KSM 21 33 33 #define KPF_THP 22 34 + #define KPF_BALLOON 23 34 35 35 36 36 37 #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */

+27

include/uapi/linux/prctl.h

··· 1 1 #ifndef _LINUX_PRCTL_H 2 2 #define _LINUX_PRCTL_H 3 3 4 + #include <linux/types.h> 5 + 4 6 /* Values to pass as first argument to prctl() */ 5 7 6 8 #define PR_SET_PDEATHSIG 1 /* Second arg is a signal */ ··· 121 119 # define PR_SET_MM_ENV_END 11 122 120 # define PR_SET_MM_AUXV 12 123 121 # define PR_SET_MM_EXE_FILE 13 122 + # define PR_SET_MM_MAP 14 123 + # define PR_SET_MM_MAP_SIZE 15 124 + 125 + /* 126 + * This structure provides new memory descriptor 127 + * map which mostly modifies /proc/pid/stat[m] 128 + * output for a task. This mostly done in a 129 + * sake of checkpoint/restore functionality. 130 + */ 131 + struct prctl_mm_map { 132 + __u64 start_code; /* code section bounds */ 133 + __u64 end_code; 134 + __u64 start_data; /* data section bounds */ 135 + __u64 end_data; 136 + __u64 start_brk; /* heap for brk() syscall */ 137 + __u64 brk; 138 + __u64 start_stack; /* stack starts at */ 139 + __u64 arg_start; /* command line arguments bounds */ 140 + __u64 arg_end; 141 + __u64 env_start; /* environment variables bounds */ 142 + __u64 env_end; 143 + __u64 *auxv; /* auxiliary vector */ 144 + __u32 auxv_size; /* vector size */ 145 + __u32 exe_fd; /* /proc/$pid/exe link file */ 146 + }; 124 147 125 148 /* 126 149 * Set specific pid that is allowed to ptrace the current task.

-11

init/Kconfig

··· 889 889 config ARCH_WANT_NUMA_VARIABLE_LOCALITY 890 890 bool 891 891 892 - # 893 - # For architectures that are willing to define _PAGE_NUMA as _PAGE_PROTNONE 894 - config ARCH_WANTS_PROT_NUMA_PROT_NONE 895 - bool 896 - 897 - config ARCH_USES_NUMA_PROT_NONE 898 - bool 899 - default y 900 - depends on ARCH_WANTS_PROT_NUMA_PROT_NONE 901 - depends on NUMA_BALANCING 902 - 903 892 config NUMA_BALANCING_DEFAULT_ENABLED 904 893 bool "Automatically enable NUMA aware memory/task placement" 905 894 default y

+9 -5

kernel/acct.c

··· 472 472 acct_t ac; 473 473 unsigned long flim; 474 474 const struct cred *orig_cred; 475 - struct pid_namespace *ns = acct->ns; 476 475 struct file *file = acct->file; 477 476 478 477 /* ··· 499 500 ac.ac_gid16 = ac.ac_gid; 500 501 #endif 501 502 #if ACCT_VERSION == 3 502 - ac.ac_pid = task_tgid_nr_ns(current, ns); 503 - rcu_read_lock(); 504 - ac.ac_ppid = task_tgid_nr_ns(rcu_dereference(current->real_parent), ns); 505 - rcu_read_unlock(); 503 + { 504 + struct pid_namespace *ns = acct->ns; 505 + 506 + ac.ac_pid = task_tgid_nr_ns(current, ns); 507 + rcu_read_lock(); 508 + ac.ac_ppid = task_tgid_nr_ns(rcu_dereference(current->real_parent), 509 + ns); 510 + rcu_read_unlock(); 511 + } 506 512 #endif 507 513 /* 508 514 * Get freeze protection. If the fs is frozen, just skip the write

+4 -4

kernel/async.c

··· 115 115 116 116 /* 1) run (and print duration) */ 117 117 if (initcall_debug && system_state == SYSTEM_BOOTING) { 118 - printk(KERN_DEBUG "calling %lli_%pF @ %i\n", 118 + pr_debug("calling %lli_%pF @ %i\n", 119 119 (long long)entry->cookie, 120 120 entry->func, task_pid_nr(current)); 121 121 calltime = ktime_get(); ··· 124 124 if (initcall_debug && system_state == SYSTEM_BOOTING) { 125 125 rettime = ktime_get(); 126 126 delta = ktime_sub(rettime, calltime); 127 - printk(KERN_DEBUG "initcall %lli_%pF returned 0 after %lld usecs\n", 127 + pr_debug("initcall %lli_%pF returned 0 after %lld usecs\n", 128 128 (long long)entry->cookie, 129 129 entry->func, 130 130 (long long)ktime_to_ns(delta) >> 10); ··· 285 285 ktime_t uninitialized_var(starttime), delta, endtime; 286 286 287 287 if (initcall_debug && system_state == SYSTEM_BOOTING) { 288 - printk(KERN_DEBUG "async_waiting @ %i\n", task_pid_nr(current)); 288 + pr_debug("async_waiting @ %i\n", task_pid_nr(current)); 289 289 starttime = ktime_get(); 290 290 } 291 291 ··· 295 295 endtime = ktime_get(); 296 296 delta = ktime_sub(endtime, starttime); 297 297 298 - printk(KERN_DEBUG "async_continuing @ %i after %lli usec\n", 298 + pr_debug("async_continuing @ %i after %lli usec\n", 299 299 task_pid_nr(current), 300 300 (long long)ktime_to_ns(delta) >> 10); 301 301 }

+1 -2

kernel/fork.c

··· 601 601 printk(KERN_ALERT "BUG: Bad rss-counter state " 602 602 "mm:%p idx:%d val:%ld\n", mm, i, x); 603 603 } 604 - 605 604 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS 606 - VM_BUG_ON(mm->pmd_huge_pte); 605 + VM_BUG_ON_MM(mm->pmd_huge_pte, mm); 607 606 #endif 608 607 } 609 608

+1 -1

kernel/kthread.c

··· 369 369 { 370 370 struct task_struct *p; 371 371 372 - p = kthread_create_on_node(threadfn, data, cpu_to_mem(cpu), namefmt, 372 + p = kthread_create_on_node(threadfn, data, cpu_to_node(cpu), namefmt, 373 373 cpu); 374 374 if (IS_ERR(p)) 375 375 return p;

+1 -1

kernel/sched/fair.c

··· 1946 1946 vma = mm->mmap; 1947 1947 } 1948 1948 for (; vma; vma = vma->vm_next) { 1949 - if (!vma_migratable(vma) || !vma_policy_mof(p, vma)) 1949 + if (!vma_migratable(vma) || !vma_policy_mof(vma)) 1950 1950 continue; 1951 1951 1952 1952 /*

+341 -146

kernel/sys.c

··· 62 62 #include <asm/unistd.h> 63 63 64 64 #ifndef SET_UNALIGN_CTL 65 - # define SET_UNALIGN_CTL(a,b) (-EINVAL) 65 + # define SET_UNALIGN_CTL(a, b) (-EINVAL) 66 66 #endif 67 67 #ifndef GET_UNALIGN_CTL 68 - # define GET_UNALIGN_CTL(a,b) (-EINVAL) 68 + # define GET_UNALIGN_CTL(a, b) (-EINVAL) 69 69 #endif 70 70 #ifndef SET_FPEMU_CTL 71 - # define SET_FPEMU_CTL(a,b) (-EINVAL) 71 + # define SET_FPEMU_CTL(a, b) (-EINVAL) 72 72 #endif 73 73 #ifndef GET_FPEMU_CTL 74 - # define GET_FPEMU_CTL(a,b) (-EINVAL) 74 + # define GET_FPEMU_CTL(a, b) (-EINVAL) 75 75 #endif 76 76 #ifndef SET_FPEXC_CTL 77 - # define SET_FPEXC_CTL(a,b) (-EINVAL) 77 + # define SET_FPEXC_CTL(a, b) (-EINVAL) 78 78 #endif 79 79 #ifndef GET_FPEXC_CTL 80 - # define GET_FPEXC_CTL(a,b) (-EINVAL) 80 + # define GET_FPEXC_CTL(a, b) (-EINVAL) 81 81 #endif 82 82 #ifndef GET_ENDIAN 83 - # define GET_ENDIAN(a,b) (-EINVAL) 83 + # define GET_ENDIAN(a, b) (-EINVAL) 84 84 #endif 85 85 #ifndef SET_ENDIAN 86 - # define SET_ENDIAN(a,b) (-EINVAL) 86 + # define SET_ENDIAN(a, b) (-EINVAL) 87 87 #endif 88 88 #ifndef GET_TSC_CTL 89 89 # define GET_TSC_CTL(a) (-EINVAL) ··· 182 182 rcu_read_lock(); 183 183 read_lock(&tasklist_lock); 184 184 switch (which) { 185 - case PRIO_PROCESS: 186 - if (who) 187 - p = find_task_by_vpid(who); 188 - else 189 - p = current; 190 - if (p) 191 - error = set_one_prio(p, niceval, error); 192 - break; 193 - case PRIO_PGRP: 194 - if (who) 195 - pgrp = find_vpid(who); 196 - else 197 - pgrp = task_pgrp(current); 198 - do_each_pid_thread(pgrp, PIDTYPE_PGID, p) { 199 - error = set_one_prio(p, niceval, error); 200 - } while_each_pid_thread(pgrp, PIDTYPE_PGID, p); 201 - break; 202 - case PRIO_USER: 203 - uid = make_kuid(cred->user_ns, who); 204 - user = cred->user; 205 - if (!who) 206 - uid = cred->uid; 207 - else if (!uid_eq(uid, cred->uid) && 208 - !(user = find_user(uid))) 185 + case PRIO_PROCESS: 186 + if (who) 187 + p = find_task_by_vpid(who); 188 + else 189 + p = current; 190 + if (p) 191 + error = set_one_prio(p, niceval, error); 192 + break; 193 + case PRIO_PGRP: 194 + if (who) 195 + pgrp = find_vpid(who); 196 + else 197 + pgrp = task_pgrp(current); 198 + do_each_pid_thread(pgrp, PIDTYPE_PGID, p) { 199 + error = set_one_prio(p, niceval, error); 200 + } while_each_pid_thread(pgrp, PIDTYPE_PGID, p); 201 + break; 202 + case PRIO_USER: 203 + uid = make_kuid(cred->user_ns, who); 204 + user = cred->user; 205 + if (!who) 206 + uid = cred->uid; 207 + else if (!uid_eq(uid, cred->uid)) { 208 + user = find_user(uid); 209 + if (!user) 209 210 goto out_unlock; /* No processes for this user */ 210 - 211 - do_each_thread(g, p) { 212 - if (uid_eq(task_uid(p), uid)) 213 - error = set_one_prio(p, niceval, error); 214 - } while_each_thread(g, p); 215 - if (!uid_eq(uid, cred->uid)) 216 - free_uid(user); /* For find_user() */ 217 - break; 211 + } 212 + do_each_thread(g, p) { 213 + if (uid_eq(task_uid(p), uid)) 214 + error = set_one_prio(p, niceval, error); 215 + } while_each_thread(g, p); 216 + if (!uid_eq(uid, cred->uid)) 217 + free_uid(user); /* For find_user() */ 218 + break; 218 219 } 219 220 out_unlock: 220 221 read_unlock(&tasklist_lock); ··· 245 244 rcu_read_lock(); 246 245 read_lock(&tasklist_lock); 247 246 switch (which) { 248 - case PRIO_PROCESS: 249 - if (who) 250 - p = find_task_by_vpid(who); 251 - else 252 - p = current; 253 - if (p) { 247 + case PRIO_PROCESS: 248 + if (who) 249 + p = find_task_by_vpid(who); 250 + else 251 + p = current; 252 + if (p) { 253 + niceval = nice_to_rlimit(task_nice(p)); 254 + if (niceval > retval) 255 + retval = niceval; 256 + } 257 + break; 258 + case PRIO_PGRP: 259 + if (who) 260 + pgrp = find_vpid(who); 261 + else 262 + pgrp = task_pgrp(current); 263 + do_each_pid_thread(pgrp, PIDTYPE_PGID, p) { 264 + niceval = nice_to_rlimit(task_nice(p)); 265 + if (niceval > retval) 266 + retval = niceval; 267 + } while_each_pid_thread(pgrp, PIDTYPE_PGID, p); 268 + break; 269 + case PRIO_USER: 270 + uid = make_kuid(cred->user_ns, who); 271 + user = cred->user; 272 + if (!who) 273 + uid = cred->uid; 274 + else if (!uid_eq(uid, cred->uid)) { 275 + user = find_user(uid); 276 + if (!user) 277 + goto out_unlock; /* No processes for this user */ 278 + } 279 + do_each_thread(g, p) { 280 + if (uid_eq(task_uid(p), uid)) { 254 281 niceval = nice_to_rlimit(task_nice(p)); 255 282 if (niceval > retval) 256 283 retval = niceval; 257 284 } 258 - break; 259 - case PRIO_PGRP: 260 - if (who) 261 - pgrp = find_vpid(who); 262 - else 263 - pgrp = task_pgrp(current); 264 - do_each_pid_thread(pgrp, PIDTYPE_PGID, p) { 265 - niceval = nice_to_rlimit(task_nice(p)); 266 - if (niceval > retval) 267 - retval = niceval; 268 - } while_each_pid_thread(pgrp, PIDTYPE_PGID, p); 269 - break; 270 - case PRIO_USER: 271 - uid = make_kuid(cred->user_ns, who); 272 - user = cred->user; 273 - if (!who) 274 - uid = cred->uid; 275 - else if (!uid_eq(uid, cred->uid) && 276 - !(user = find_user(uid))) 277 - goto out_unlock; /* No processes for this user */ 278 - 279 - do_each_thread(g, p) { 280 - if (uid_eq(task_uid(p), uid)) { 281 - niceval = nice_to_rlimit(task_nice(p)); 282 - if (niceval > retval) 283 - retval = niceval; 284 - } 285 - } while_each_thread(g, p); 286 - if (!uid_eq(uid, cred->uid)) 287 - free_uid(user); /* for find_user() */ 288 - break; 285 + } while_each_thread(g, p); 286 + if (!uid_eq(uid, cred->uid)) 287 + free_uid(user); /* for find_user() */ 288 + break; 289 289 } 290 290 out_unlock: 291 291 read_unlock(&tasklist_lock); ··· 308 306 * 309 307 * The general idea is that a program which uses just setregid() will be 310 308 * 100% compatible with BSD. A program which uses just setgid() will be 311 - * 100% compatible with POSIX with saved IDs. 309 + * 100% compatible with POSIX with saved IDs. 312 310 * 313 311 * SMP: There are not races, the GIDs are checked only by filesystem 314 312 * operations (as far as semantic preservation is concerned). ··· 366 364 } 367 365 368 366 /* 369 - * setgid() is implemented like SysV w/ SAVED_IDS 367 + * setgid() is implemented like SysV w/ SAVED_IDS 370 368 * 371 369 * SMP: Same implicit races as above. 372 370 */ ··· 444 442 * 445 443 * The general idea is that a program which uses just setreuid() will be 446 444 * 100% compatible with BSD. A program which uses just setuid() will be 447 - * 100% compatible with POSIX with saved IDs. 445 + * 100% compatible with POSIX with saved IDs. 448 446 */ 449 447 SYSCALL_DEFINE2(setreuid, uid_t, ruid, uid_t, euid) 450 448 { ··· 505 503 abort_creds(new); 506 504 return retval; 507 505 } 508 - 506 + 509 507 /* 510 - * setuid() is implemented like SysV with SAVED_IDS 511 - * 508 + * setuid() is implemented like SysV with SAVED_IDS 509 + * 512 510 * Note that SAVED_ID's is deficient in that a setuid root program 513 - * like sendmail, for example, cannot set its uid to be a normal 511 + * like sendmail, for example, cannot set its uid to be a normal 514 512 * user and then switch back, because if you're root, setuid() sets 515 513 * the saved uid too. If you don't like this, blame the bright people 516 514 * in the POSIX committee and/or USG. Note that the BSD-style setreuid() 517 515 * will allow a root program to temporarily drop privileges and be able to 518 - * regain them by swapping the real and effective uid. 516 + * regain them by swapping the real and effective uid. 519 517 */ 520 518 SYSCALL_DEFINE1(setuid, uid_t, uid) 521 519 { ··· 639 637 euid = from_kuid_munged(cred->user_ns, cred->euid); 640 638 suid = from_kuid_munged(cred->user_ns, cred->suid); 641 639 642 - if (!(retval = put_user(ruid, ruidp)) && 643 - !(retval = put_user(euid, euidp))) 644 - retval = put_user(suid, suidp); 645 - 640 + retval = put_user(ruid, ruidp); 641 + if (!retval) { 642 + retval = put_user(euid, euidp); 643 + if (!retval) 644 + return put_user(suid, suidp); 645 + } 646 646 return retval; 647 647 } 648 648 ··· 713 709 egid = from_kgid_munged(cred->user_ns, cred->egid); 714 710 sgid = from_kgid_munged(cred->user_ns, cred->sgid); 715 711 716 - if (!(retval = put_user(rgid, rgidp)) && 717 - !(retval = put_user(egid, egidp))) 718 - retval = put_user(sgid, sgidp); 712 + retval = put_user(rgid, rgidp); 713 + if (!retval) { 714 + retval = put_user(egid, egidp); 715 + if (!retval) 716 + retval = put_user(sgid, sgidp); 717 + } 719 718 720 719 return retval; 721 720 } ··· 1291 1284 /* 1292 1285 * Back compatibility for getrlimit. Needed for some apps. 1293 1286 */ 1294 - 1295 1287 SYSCALL_DEFINE2(old_getrlimit, unsigned int, resource, 1296 1288 struct rlimit __user *, rlim) 1297 1289 { ··· 1305 1299 x.rlim_cur = 0x7FFFFFFF; 1306 1300 if (x.rlim_max > 0x7FFFFFFF) 1307 1301 x.rlim_max = 0x7FFFFFFF; 1308 - return copy_to_user(rlim, &x, sizeof(x))?-EFAULT:0; 1302 + return copy_to_user(rlim, &x, sizeof(x)) ? -EFAULT : 0; 1309 1303 } 1310 1304 1311 1305 #endif ··· 1533 1527 cputime_t tgutime, tgstime, utime, stime; 1534 1528 unsigned long maxrss = 0; 1535 1529 1536 - memset((char *) r, 0, sizeof *r); 1530 + memset((char *)r, 0, sizeof (*r)); 1537 1531 utime = stime = 0; 1538 1532 1539 1533 if (who == RUSAGE_THREAD) { ··· 1547 1541 return; 1548 1542 1549 1543 switch (who) { 1550 - case RUSAGE_BOTH: 1551 - case RUSAGE_CHILDREN: 1552 - utime = p->signal->cutime; 1553 - stime = p->signal->cstime; 1554 - r->ru_nvcsw = p->signal->cnvcsw; 1555 - r->ru_nivcsw = p->signal->cnivcsw; 1556 - r->ru_minflt = p->signal->cmin_flt; 1557 - r->ru_majflt = p->signal->cmaj_flt; 1558 - r->ru_inblock = p->signal->cinblock; 1559 - r->ru_oublock = p->signal->coublock; 1560 - maxrss = p->signal->cmaxrss; 1544 + case RUSAGE_BOTH: 1545 + case RUSAGE_CHILDREN: 1546 + utime = p->signal->cutime; 1547 + stime = p->signal->cstime; 1548 + r->ru_nvcsw = p->signal->cnvcsw; 1549 + r->ru_nivcsw = p->signal->cnivcsw; 1550 + r->ru_minflt = p->signal->cmin_flt; 1551 + r->ru_majflt = p->signal->cmaj_flt; 1552 + r->ru_inblock = p->signal->cinblock; 1553 + r->ru_oublock = p->signal->coublock; 1554 + maxrss = p->signal->cmaxrss; 1561 1555 1562 - if (who == RUSAGE_CHILDREN) 1563 - break; 1564 - 1565 - case RUSAGE_SELF: 1566 - thread_group_cputime_adjusted(p, &tgutime, &tgstime); 1567 - utime += tgutime; 1568 - stime += tgstime; 1569 - r->ru_nvcsw += p->signal->nvcsw; 1570 - r->ru_nivcsw += p->signal->nivcsw; 1571 - r->ru_minflt += p->signal->min_flt; 1572 - r->ru_majflt += p->signal->maj_flt; 1573 - r->ru_inblock += p->signal->inblock; 1574 - r->ru_oublock += p->signal->oublock; 1575 - if (maxrss < p->signal->maxrss) 1576 - maxrss = p->signal->maxrss; 1577 - t = p; 1578 - do { 1579 - accumulate_thread_rusage(t, r); 1580 - } while_each_thread(p, t); 1556 + if (who == RUSAGE_CHILDREN) 1581 1557 break; 1582 1558 1583 - default: 1584 - BUG(); 1559 + case RUSAGE_SELF: 1560 + thread_group_cputime_adjusted(p, &tgutime, &tgstime); 1561 + utime += tgutime; 1562 + stime += tgstime; 1563 + r->ru_nvcsw += p->signal->nvcsw; 1564 + r->ru_nivcsw += p->signal->nivcsw; 1565 + r->ru_minflt += p->signal->min_flt; 1566 + r->ru_majflt += p->signal->maj_flt; 1567 + r->ru_inblock += p->signal->inblock; 1568 + r->ru_oublock += p->signal->oublock; 1569 + if (maxrss < p->signal->maxrss) 1570 + maxrss = p->signal->maxrss; 1571 + t = p; 1572 + do { 1573 + accumulate_thread_rusage(t, r); 1574 + } while_each_thread(p, t); 1575 + break; 1576 + 1577 + default: 1578 + BUG(); 1585 1579 } 1586 1580 unlock_task_sighand(p, &flags); 1587 1581 ··· 1591 1585 1592 1586 if (who != RUSAGE_CHILDREN) { 1593 1587 struct mm_struct *mm = get_task_mm(p); 1588 + 1594 1589 if (mm) { 1595 1590 setmax_mm_hiwater_rss(&maxrss, mm); 1596 1591 mmput(mm); ··· 1603 1596 int getrusage(struct task_struct *p, int who, struct rusage __user *ru) 1604 1597 { 1605 1598 struct rusage r; 1599 + 1606 1600 k_getrusage(p, who, &r); 1607 1601 return copy_to_user(ru, &r, sizeof(r)) ? -EFAULT : 0; 1608 1602 } ··· 1636 1628 return mask; 1637 1629 } 1638 1630 1639 - static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd) 1631 + static int prctl_set_mm_exe_file_locked(struct mm_struct *mm, unsigned int fd) 1640 1632 { 1641 1633 struct fd exe; 1642 1634 struct inode *inode; 1643 1635 int err; 1636 + 1637 + VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm); 1644 1638 1645 1639 exe = fdget(fd); 1646 1640 if (!exe.file) ··· 1664 1654 if (err) 1665 1655 goto exit; 1666 1656 1667 - down_write(&mm->mmap_sem); 1668 - 1669 1657 /* 1670 1658 * Forbid mm->exe_file change if old file still mapped. 1671 1659 */ ··· 1675 1667 if (vma->vm_file && 1676 1668 path_equal(&vma->vm_file->f_path, 1677 1669 &mm->exe_file->f_path)) 1678 - goto exit_unlock; 1670 + goto exit; 1679 1671 } 1680 1672 1681 1673 /* ··· 1686 1678 */ 1687 1679 err = -EPERM; 1688 1680 if (test_and_set_bit(MMF_EXE_FILE_CHANGED, &mm->flags)) 1689 - goto exit_unlock; 1681 + goto exit; 1690 1682 1691 1683 err = 0; 1692 1684 set_mm_exe_file(mm, exe.file); /* this grabs a reference to exe.file */ 1693 - exit_unlock: 1694 - up_write(&mm->mmap_sem); 1695 - 1696 1685 exit: 1697 1686 fdput(exe); 1698 1687 return err; 1699 1688 } 1700 1689 1690 + #ifdef CONFIG_CHECKPOINT_RESTORE 1691 + /* 1692 + * WARNING: we don't require any capability here so be very careful 1693 + * in what is allowed for modification from userspace. 1694 + */ 1695 + static int validate_prctl_map(struct prctl_mm_map *prctl_map) 1696 + { 1697 + unsigned long mmap_max_addr = TASK_SIZE; 1698 + struct mm_struct *mm = current->mm; 1699 + int error = -EINVAL, i; 1700 + 1701 + static const unsigned char offsets[] = { 1702 + offsetof(struct prctl_mm_map, start_code), 1703 + offsetof(struct prctl_mm_map, end_code), 1704 + offsetof(struct prctl_mm_map, start_data), 1705 + offsetof(struct prctl_mm_map, end_data), 1706 + offsetof(struct prctl_mm_map, start_brk), 1707 + offsetof(struct prctl_mm_map, brk), 1708 + offsetof(struct prctl_mm_map, start_stack), 1709 + offsetof(struct prctl_mm_map, arg_start), 1710 + offsetof(struct prctl_mm_map, arg_end), 1711 + offsetof(struct prctl_mm_map, env_start), 1712 + offsetof(struct prctl_mm_map, env_end), 1713 + }; 1714 + 1715 + /* 1716 + * Make sure the members are not somewhere outside 1717 + * of allowed address space. 1718 + */ 1719 + for (i = 0; i < ARRAY_SIZE(offsets); i++) { 1720 + u64 val = *(u64 *)((char *)prctl_map + offsets[i]); 1721 + 1722 + if ((unsigned long)val >= mmap_max_addr || 1723 + (unsigned long)val < mmap_min_addr) 1724 + goto out; 1725 + } 1726 + 1727 + /* 1728 + * Make sure the pairs are ordered. 1729 + */ 1730 + #define __prctl_check_order(__m1, __op, __m2) \ 1731 + ((unsigned long)prctl_map->__m1 __op \ 1732 + (unsigned long)prctl_map->__m2) ? 0 : -EINVAL 1733 + error = __prctl_check_order(start_code, <, end_code); 1734 + error |= __prctl_check_order(start_data, <, end_data); 1735 + error |= __prctl_check_order(start_brk, <=, brk); 1736 + error |= __prctl_check_order(arg_start, <=, arg_end); 1737 + error |= __prctl_check_order(env_start, <=, env_end); 1738 + if (error) 1739 + goto out; 1740 + #undef __prctl_check_order 1741 + 1742 + error = -EINVAL; 1743 + 1744 + /* 1745 + * @brk should be after @end_data in traditional maps. 1746 + */ 1747 + if (prctl_map->start_brk <= prctl_map->end_data || 1748 + prctl_map->brk <= prctl_map->end_data) 1749 + goto out; 1750 + 1751 + /* 1752 + * Neither we should allow to override limits if they set. 1753 + */ 1754 + if (check_data_rlimit(rlimit(RLIMIT_DATA), prctl_map->brk, 1755 + prctl_map->start_brk, prctl_map->end_data, 1756 + prctl_map->start_data)) 1757 + goto out; 1758 + 1759 + /* 1760 + * Someone is trying to cheat the auxv vector. 1761 + */ 1762 + if (prctl_map->auxv_size) { 1763 + if (!prctl_map->auxv || prctl_map->auxv_size > sizeof(mm->saved_auxv)) 1764 + goto out; 1765 + } 1766 + 1767 + /* 1768 + * Finally, make sure the caller has the rights to 1769 + * change /proc/pid/exe link: only local root should 1770 + * be allowed to. 1771 + */ 1772 + if (prctl_map->exe_fd != (u32)-1) { 1773 + struct user_namespace *ns = current_user_ns(); 1774 + const struct cred *cred = current_cred(); 1775 + 1776 + if (!uid_eq(cred->uid, make_kuid(ns, 0)) || 1777 + !gid_eq(cred->gid, make_kgid(ns, 0))) 1778 + goto out; 1779 + } 1780 + 1781 + error = 0; 1782 + out: 1783 + return error; 1784 + } 1785 + 1786 + static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data_size) 1787 + { 1788 + struct prctl_mm_map prctl_map = { .exe_fd = (u32)-1, }; 1789 + unsigned long user_auxv[AT_VECTOR_SIZE]; 1790 + struct mm_struct *mm = current->mm; 1791 + int error; 1792 + 1793 + BUILD_BUG_ON(sizeof(user_auxv) != sizeof(mm->saved_auxv)); 1794 + BUILD_BUG_ON(sizeof(struct prctl_mm_map) > 256); 1795 + 1796 + if (opt == PR_SET_MM_MAP_SIZE) 1797 + return put_user((unsigned int)sizeof(prctl_map), 1798 + (unsigned int __user *)addr); 1799 + 1800 + if (data_size != sizeof(prctl_map)) 1801 + return -EINVAL; 1802 + 1803 + if (copy_from_user(&prctl_map, addr, sizeof(prctl_map))) 1804 + return -EFAULT; 1805 + 1806 + error = validate_prctl_map(&prctl_map); 1807 + if (error) 1808 + return error; 1809 + 1810 + if (prctl_map.auxv_size) { 1811 + memset(user_auxv, 0, sizeof(user_auxv)); 1812 + if (copy_from_user(user_auxv, 1813 + (const void __user *)prctl_map.auxv, 1814 + prctl_map.auxv_size)) 1815 + return -EFAULT; 1816 + 1817 + /* Last entry must be AT_NULL as specification requires */ 1818 + user_auxv[AT_VECTOR_SIZE - 2] = AT_NULL; 1819 + user_auxv[AT_VECTOR_SIZE - 1] = AT_NULL; 1820 + } 1821 + 1822 + down_write(&mm->mmap_sem); 1823 + if (prctl_map.exe_fd != (u32)-1) 1824 + error = prctl_set_mm_exe_file_locked(mm, prctl_map.exe_fd); 1825 + downgrade_write(&mm->mmap_sem); 1826 + if (error) 1827 + goto out; 1828 + 1829 + /* 1830 + * We don't validate if these members are pointing to 1831 + * real present VMAs because application may have correspond 1832 + * VMAs already unmapped and kernel uses these members for statistics 1833 + * output in procfs mostly, except 1834 + * 1835 + * - @start_brk/@brk which are used in do_brk but kernel lookups 1836 + * for VMAs when updating these memvers so anything wrong written 1837 + * here cause kernel to swear at userspace program but won't lead 1838 + * to any problem in kernel itself 1839 + */ 1840 + 1841 + mm->start_code = prctl_map.start_code; 1842 + mm->end_code = prctl_map.end_code; 1843 + mm->start_data = prctl_map.start_data; 1844 + mm->end_data = prctl_map.end_data; 1845 + mm->start_brk = prctl_map.start_brk; 1846 + mm->brk = prctl_map.brk; 1847 + mm->start_stack = prctl_map.start_stack; 1848 + mm->arg_start = prctl_map.arg_start; 1849 + mm->arg_end = prctl_map.arg_end; 1850 + mm->env_start = prctl_map.env_start; 1851 + mm->env_end = prctl_map.env_end; 1852 + 1853 + /* 1854 + * Note this update of @saved_auxv is lockless thus 1855 + * if someone reads this member in procfs while we're 1856 + * updating -- it may get partly updated results. It's 1857 + * known and acceptable trade off: we leave it as is to 1858 + * not introduce additional locks here making the kernel 1859 + * more complex. 1860 + */ 1861 + if (prctl_map.auxv_size) 1862 + memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv)); 1863 + 1864 + error = 0; 1865 + out: 1866 + up_read(&mm->mmap_sem); 1867 + return error; 1868 + } 1869 + #endif /* CONFIG_CHECKPOINT_RESTORE */ 1870 + 1701 1871 static int prctl_set_mm(int opt, unsigned long addr, 1702 1872 unsigned long arg4, unsigned long arg5) 1703 1873 { 1704 - unsigned long rlim = rlimit(RLIMIT_DATA); 1705 1874 struct mm_struct *mm = current->mm; 1706 1875 struct vm_area_struct *vma; 1707 1876 int error; 1708 1877 1709 - if (arg5 || (arg4 && opt != PR_SET_MM_AUXV)) 1878 + if (arg5 || (arg4 && (opt != PR_SET_MM_AUXV && 1879 + opt != PR_SET_MM_MAP && 1880 + opt != PR_SET_MM_MAP_SIZE))) 1710 1881 return -EINVAL; 1882 + 1883 + #ifdef CONFIG_CHECKPOINT_RESTORE 1884 + if (opt == PR_SET_MM_MAP || opt == PR_SET_MM_MAP_SIZE) 1885 + return prctl_set_mm_map(opt, (const void __user *)addr, arg4); 1886 + #endif 1711 1887 1712 1888 if (!capable(CAP_SYS_RESOURCE)) 1713 1889 return -EPERM; 1714 1890 1715 - if (opt == PR_SET_MM_EXE_FILE) 1716 - return prctl_set_mm_exe_file(mm, (unsigned int)addr); 1891 + if (opt == PR_SET_MM_EXE_FILE) { 1892 + down_write(&mm->mmap_sem); 1893 + error = prctl_set_mm_exe_file_locked(mm, (unsigned int)addr); 1894 + up_write(&mm->mmap_sem); 1895 + return error; 1896 + } 1717 1897 1718 1898 if (addr >= TASK_SIZE || addr < mmap_min_addr) 1719 1899 return -EINVAL; ··· 1929 1733 if (addr <= mm->end_data) 1930 1734 goto out; 1931 1735 1932 - if (rlim < RLIM_INFINITY && 1933 - (mm->brk - addr) + 1934 - (mm->end_data - mm->start_data) > rlim) 1736 + if (check_data_rlimit(rlimit(RLIMIT_DATA), mm->brk, addr, 1737 + mm->end_data, mm->start_data)) 1935 1738 goto out; 1936 1739 1937 1740 mm->start_brk = addr; ··· 1940 1745 if (addr <= mm->end_data) 1941 1746 goto out; 1942 1747 1943 - if (rlim < RLIM_INFINITY && 1944 - (addr - mm->start_brk) + 1945 - (mm->end_data - mm->start_data) > rlim) 1748 + if (check_data_rlimit(rlimit(RLIMIT_DATA), addr, mm->start_brk, 1749 + mm->end_data, mm->start_data)) 1946 1750 goto out; 1947 1751 1948 1752 mm->brk = addr; ··· 2217 2023 { 2218 2024 int err = 0; 2219 2025 int cpu = raw_smp_processor_id(); 2026 + 2220 2027 if (cpup) 2221 2028 err |= put_user(cpu, cpup); 2222 2029 if (nodep) ··· 2330 2135 /* Check to see if any memory value is too large for 32-bit and scale 2331 2136 * down if needed 2332 2137 */ 2333 - if ((s.totalram >> 32) || (s.totalswap >> 32)) { 2138 + if (upper_32_bits(s.totalram) || upper_32_bits(s.totalswap)) { 2334 2139 int bitcount = 0; 2335 2140 2336 2141 while (s.mem_unit < PAGE_SIZE) {

-7

kernel/sysctl.c

··· 1460 1460 .extra2 = &one, 1461 1461 }, 1462 1462 #endif 1463 - { 1464 - .procname = "scan_unevictable_pages", 1465 - .data = &scan_unevictable_pages, 1466 - .maxlen = sizeof(scan_unevictable_pages), 1467 - .mode = 0644, 1468 - .proc_handler = scan_unevictable_handler, 1469 - }, 1470 1463 #ifdef CONFIG_MEMORY_FAILURE 1471 1464 { 1472 1465 .procname = "memory_failure_early_kill",

+17 -1

kernel/watchdog.c

··· 47 47 static DEFINE_PER_CPU(bool, soft_watchdog_warn); 48 48 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts); 49 49 static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt); 50 + static DEFINE_PER_CPU(struct task_struct *, softlockup_task_ptr_saved); 50 51 #ifdef CONFIG_HARDLOCKUP_DETECTOR 51 52 static DEFINE_PER_CPU(bool, hard_watchdog_warn); 52 53 static DEFINE_PER_CPU(bool, watchdog_nmi_touch); ··· 334 333 return HRTIMER_RESTART; 335 334 336 335 /* only warn once */ 337 - if (__this_cpu_read(soft_watchdog_warn) == true) 336 + if (__this_cpu_read(soft_watchdog_warn) == true) { 337 + /* 338 + * When multiple processes are causing softlockups the 339 + * softlockup detector only warns on the first one 340 + * because the code relies on a full quiet cycle to 341 + * re-arm. The second process prevents the quiet cycle 342 + * and never gets reported. Use task pointers to detect 343 + * this. 344 + */ 345 + if (__this_cpu_read(softlockup_task_ptr_saved) != 346 + current) { 347 + __this_cpu_write(soft_watchdog_warn, false); 348 + __touch_watchdog(); 349 + } 338 350 return HRTIMER_RESTART; 351 + } 339 352 340 353 if (softlockup_all_cpu_backtrace) { 341 354 /* Prevent multiple soft-lockup reports if one cpu is already ··· 365 350 pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n", 366 351 smp_processor_id(), duration, 367 352 current->comm, task_pid_nr(current)); 353 + __this_cpu_write(softlockup_task_ptr_saved, current); 368 354 print_modules(); 369 355 print_irqtrace_events(current); 370 356 if (regs)

+49

lib/genalloc.c

··· 403 403 EXPORT_SYMBOL(gen_pool_for_each_chunk); 404 404 405 405 /** 406 + * addr_in_gen_pool - checks if an address falls within the range of a pool 407 + * @pool: the generic memory pool 408 + * @start: start address 409 + * @size: size of the region 410 + * 411 + * Check if the range of addresses falls within the specified pool. Returns 412 + * true if the entire range is contained in the pool and false otherwise. 413 + */ 414 + bool addr_in_gen_pool(struct gen_pool *pool, unsigned long start, 415 + size_t size) 416 + { 417 + bool found = false; 418 + unsigned long end = start + size; 419 + struct gen_pool_chunk *chunk; 420 + 421 + rcu_read_lock(); 422 + list_for_each_entry_rcu(chunk, &(pool)->chunks, next_chunk) { 423 + if (start >= chunk->start_addr && start <= chunk->end_addr) { 424 + if (end <= chunk->end_addr) { 425 + found = true; 426 + break; 427 + } 428 + } 429 + } 430 + rcu_read_unlock(); 431 + return found; 432 + } 433 + 434 + /** 406 435 * gen_pool_avail - get available free space of the pool 407 436 * @pool: pool to get available free space 408 437 * ··· 508 479 return bitmap_find_next_zero_area(map, size, start, nr, 0); 509 480 } 510 481 EXPORT_SYMBOL(gen_pool_first_fit); 482 + 483 + /** 484 + * gen_pool_first_fit_order_align - find the first available region 485 + * of memory matching the size requirement. The region will be aligned 486 + * to the order of the size specified. 487 + * @map: The address to base the search on 488 + * @size: The bitmap size in bits 489 + * @start: The bitnumber to start searching at 490 + * @nr: The number of zeroed bits we're looking for 491 + * @data: additional data - unused 492 + */ 493 + unsigned long gen_pool_first_fit_order_align(unsigned long *map, 494 + unsigned long size, unsigned long start, 495 + unsigned int nr, void *data) 496 + { 497 + unsigned long align_mask = roundup_pow_of_two(nr) - 1; 498 + 499 + return bitmap_find_next_zero_area(map, size, start, nr, align_mask); 500 + } 501 + EXPORT_SYMBOL(gen_pool_first_fit_order_align); 511 502 512 503 /** 513 504 * gen_pool_best_fit - find the best fitting region of memory

+9 -1

mm/Kconfig

··· 137 137 config HAVE_MEMBLOCK_PHYS_MAP 138 138 boolean 139 139 140 + config HAVE_GENERIC_RCU_GUP 141 + boolean 142 + 140 143 config ARCH_DISCARD_MEMBLOCK 141 144 boolean 142 145 ··· 231 228 boolean 232 229 233 230 # 231 + # support for memory balloon 232 + config MEMORY_BALLOON 233 + boolean 234 + 235 + # 234 236 # support for memory balloon compaction 235 237 config BALLOON_COMPACTION 236 238 bool "Allow for balloon memory compaction/migration" 237 239 def_bool y 238 - depends on COMPACTION && VIRTIO_BALLOON 240 + depends on COMPACTION && MEMORY_BALLOON 239 241 help 240 242 Memory fragmentation introduced by ballooning might reduce 241 243 significantly the number of 2MB contiguous memory blocks that can be

+3 -2

mm/Makefile

··· 16 16 readahead.o swap.o truncate.o vmscan.o shmem.o \ 17 17 util.o mmzone.o vmstat.o backing-dev.o \ 18 18 mm_init.o mmu_context.o percpu.o slab_common.o \ 19 - compaction.o balloon_compaction.o vmacache.o \ 19 + compaction.o vmacache.o \ 20 20 interval_tree.o list_lru.o workingset.o \ 21 - iov_iter.o $(mmu-y) 21 + iov_iter.o debug.o $(mmu-y) 22 22 23 23 obj-y += init-mm.o 24 24 ··· 67 67 obj-$(CONFIG_ZSMALLOC) += zsmalloc.o 68 68 obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o 69 69 obj-$(CONFIG_CMA) += cma.o 70 + obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o

+1 -1

mm/backing-dev.c

··· 631 631 * of sleeping on the congestion queue 632 632 */ 633 633 if (atomic_read(&nr_bdi_congested[sync]) == 0 || 634 - !zone_is_reclaim_congested(zone)) { 634 + !test_bit(ZONE_CONGESTED, &zone->flags)) { 635 635 cond_resched(); 636 636 637 637 /* In case we scheduled, work out time remaining */

+20 -103

mm/balloon_compaction.c

··· 11 11 #include <linux/balloon_compaction.h> 12 12 13 13 /* 14 - * balloon_devinfo_alloc - allocates a balloon device information descriptor. 15 - * @balloon_dev_descriptor: pointer to reference the balloon device which 16 - * this struct balloon_dev_info will be servicing. 17 - * 18 - * Driver must call it to properly allocate and initialize an instance of 19 - * struct balloon_dev_info which will be used to reference a balloon device 20 - * as well as to keep track of the balloon device page list. 21 - */ 22 - struct balloon_dev_info *balloon_devinfo_alloc(void *balloon_dev_descriptor) 23 - { 24 - struct balloon_dev_info *b_dev_info; 25 - b_dev_info = kmalloc(sizeof(*b_dev_info), GFP_KERNEL); 26 - if (!b_dev_info) 27 - return ERR_PTR(-ENOMEM); 28 - 29 - b_dev_info->balloon_device = balloon_dev_descriptor; 30 - b_dev_info->mapping = NULL; 31 - b_dev_info->isolated_pages = 0; 32 - spin_lock_init(&b_dev_info->pages_lock); 33 - INIT_LIST_HEAD(&b_dev_info->pages); 34 - 35 - return b_dev_info; 36 - } 37 - EXPORT_SYMBOL_GPL(balloon_devinfo_alloc); 38 - 39 - /* 40 14 * balloon_page_enqueue - allocates a new page and inserts it into the balloon 41 15 * page list. 42 16 * @b_dev_info: balloon device decriptor where we will insert a new page to ··· 35 61 */ 36 62 BUG_ON(!trylock_page(page)); 37 63 spin_lock_irqsave(&b_dev_info->pages_lock, flags); 38 - balloon_page_insert(page, b_dev_info->mapping, &b_dev_info->pages); 64 + balloon_page_insert(b_dev_info, page); 65 + __count_vm_event(BALLOON_INFLATE); 39 66 spin_unlock_irqrestore(&b_dev_info->pages_lock, flags); 40 67 unlock_page(page); 41 68 return page; ··· 68 93 * to be released by the balloon driver. 69 94 */ 70 95 if (trylock_page(page)) { 96 + if (!PagePrivate(page)) { 97 + /* raced with isolation */ 98 + unlock_page(page); 99 + continue; 100 + } 71 101 spin_lock_irqsave(&b_dev_info->pages_lock, flags); 72 - /* 73 - * Raise the page refcount here to prevent any wrong 74 - * attempt to isolate this page, in case of coliding 75 - * with balloon_page_isolate() just after we release 76 - * the page lock. 77 - * 78 - * balloon_page_free() will take care of dropping 79 - * this extra refcount later. 80 - */ 81 - get_page(page); 82 102 balloon_page_delete(page); 103 + __count_vm_event(BALLOON_DEFLATE); 83 104 spin_unlock_irqrestore(&b_dev_info->pages_lock, flags); 84 105 unlock_page(page); 85 106 dequeued_page = true; ··· 103 132 EXPORT_SYMBOL_GPL(balloon_page_dequeue); 104 133 105 134 #ifdef CONFIG_BALLOON_COMPACTION 106 - /* 107 - * balloon_mapping_alloc - allocates a special ->mapping for ballooned pages. 108 - * @b_dev_info: holds the balloon device information descriptor. 109 - * @a_ops: balloon_mapping address_space_operations descriptor. 110 - * 111 - * Driver must call it to properly allocate and initialize an instance of 112 - * struct address_space which will be used as the special page->mapping for 113 - * balloon device enlisted page instances. 114 - */ 115 - struct address_space *balloon_mapping_alloc(struct balloon_dev_info *b_dev_info, 116 - const struct address_space_operations *a_ops) 117 - { 118 - struct address_space *mapping; 119 - 120 - mapping = kmalloc(sizeof(*mapping), GFP_KERNEL); 121 - if (!mapping) 122 - return ERR_PTR(-ENOMEM); 123 - 124 - /* 125 - * Give a clean 'zeroed' status to all elements of this special 126 - * balloon page->mapping struct address_space instance. 127 - */ 128 - address_space_init_once(mapping); 129 - 130 - /* 131 - * Set mapping->flags appropriately, to allow balloon pages 132 - * ->mapping identification. 133 - */ 134 - mapping_set_balloon(mapping); 135 - mapping_set_gfp_mask(mapping, balloon_mapping_gfp_mask()); 136 - 137 - /* balloon's page->mapping->a_ops callback descriptor */ 138 - mapping->a_ops = a_ops; 139 - 140 - /* 141 - * Establish a pointer reference back to the balloon device descriptor 142 - * this particular page->mapping will be servicing. 143 - * This is used by compaction / migration procedures to identify and 144 - * access the balloon device pageset while isolating / migrating pages. 145 - * 146 - * As some balloon drivers can register multiple balloon devices 147 - * for a single guest, this also helps compaction / migration to 148 - * properly deal with multiple balloon pagesets, when required. 149 - */ 150 - mapping->private_data = b_dev_info; 151 - b_dev_info->mapping = mapping; 152 - 153 - return mapping; 154 - } 155 - EXPORT_SYMBOL_GPL(balloon_mapping_alloc); 156 135 157 136 static inline void __isolate_balloon_page(struct page *page) 158 137 { 159 - struct balloon_dev_info *b_dev_info = page->mapping->private_data; 138 + struct balloon_dev_info *b_dev_info = balloon_page_device(page); 160 139 unsigned long flags; 140 + 161 141 spin_lock_irqsave(&b_dev_info->pages_lock, flags); 142 + ClearPagePrivate(page); 162 143 list_del(&page->lru); 163 144 b_dev_info->isolated_pages++; 164 145 spin_unlock_irqrestore(&b_dev_info->pages_lock, flags); ··· 118 195 119 196 static inline void __putback_balloon_page(struct page *page) 120 197 { 121 - struct balloon_dev_info *b_dev_info = page->mapping->private_data; 198 + struct balloon_dev_info *b_dev_info = balloon_page_device(page); 122 199 unsigned long flags; 200 + 123 201 spin_lock_irqsave(&b_dev_info->pages_lock, flags); 202 + SetPagePrivate(page); 124 203 list_add(&page->lru, &b_dev_info->pages); 125 204 b_dev_info->isolated_pages--; 126 205 spin_unlock_irqrestore(&b_dev_info->pages_lock, flags); 127 - } 128 - 129 - static inline int __migrate_balloon_page(struct address_space *mapping, 130 - struct page *newpage, struct page *page, enum migrate_mode mode) 131 - { 132 - return page->mapping->a_ops->migratepage(mapping, newpage, page, mode); 133 206 } 134 207 135 208 /* __isolate_lru_page() counterpart for a ballooned page */ ··· 154 235 */ 155 236 if (likely(trylock_page(page))) { 156 237 /* 157 - * A ballooned page, by default, has just one refcount. 238 + * A ballooned page, by default, has PagePrivate set. 158 239 * Prevent concurrent compaction threads from isolating 159 - * an already isolated balloon page by refcount check. 240 + * an already isolated balloon page by clearing it. 160 241 */ 161 - if (__is_movable_balloon_page(page) && 162 - page_count(page) == 2) { 242 + if (balloon_page_movable(page)) { 163 243 __isolate_balloon_page(page); 164 244 unlock_page(page); 165 245 return true; ··· 194 276 int balloon_page_migrate(struct page *newpage, 195 277 struct page *page, enum migrate_mode mode) 196 278 { 197 - struct address_space *mapping; 279 + struct balloon_dev_info *balloon = balloon_page_device(page); 198 280 int rc = -EAGAIN; 199 281 200 282 /* ··· 210 292 return rc; 211 293 } 212 294 213 - mapping = page->mapping; 214 - if (mapping) 215 - rc = __migrate_balloon_page(mapping, newpage, page, mode); 295 + if (balloon && balloon->migratepage) 296 + rc = balloon->migratepage(balloon, newpage, page, mode); 216 297 217 298 unlock_page(newpage); 218 299 return rc;

+2 -2

mm/bootmem.c

··· 16 16 #include <linux/kmemleak.h> 17 17 #include <linux/range.h> 18 18 #include <linux/memblock.h> 19 + #include <linux/bug.h> 20 + #include <linux/io.h> 19 21 20 - #include <asm/bug.h> 21 - #include <asm/io.h> 22 22 #include <asm/processor.h> 23 23 24 24 #include "internal.h"

+21

mm/cma.c

··· 32 32 #include <linux/slab.h> 33 33 #include <linux/log2.h> 34 34 #include <linux/cma.h> 35 + #include <linux/highmem.h> 35 36 36 37 struct cma { 37 38 unsigned long base_pfn; ··· 164 163 bool fixed, struct cma **res_cma) 165 164 { 166 165 struct cma *cma; 166 + phys_addr_t memblock_end = memblock_end_of_DRAM(); 167 + phys_addr_t highmem_start = __pa(high_memory); 167 168 int ret = 0; 168 169 169 170 pr_debug("%s(size %lx, base %08lx, limit %08lx alignment %08lx)\n", ··· 198 195 /* size should be aligned with order_per_bit */ 199 196 if (!IS_ALIGNED(size >> PAGE_SHIFT, 1 << order_per_bit)) 200 197 return -EINVAL; 198 + 199 + /* 200 + * adjust limit to avoid crossing low/high memory boundary for 201 + * automatically allocated regions 202 + */ 203 + if (((limit == 0 || limit > memblock_end) && 204 + (memblock_end - size < highmem_start && 205 + memblock_end > highmem_start)) || 206 + (!fixed && limit > highmem_start && limit - size < highmem_start)) { 207 + limit = highmem_start; 208 + } 209 + 210 + if (fixed && base < highmem_start && base+size > highmem_start) { 211 + ret = -EINVAL; 212 + pr_err("Region at %08lx defined on low/high memory boundary (%08lx)\n", 213 + (unsigned long)base, (unsigned long)highmem_start); 214 + goto err; 215 + } 201 216 202 217 /* Reserve memory */ 203 218 if (base && fixed) {

+444 -230

mm/compaction.c

··· 67 67 return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE; 68 68 } 69 69 70 + /* 71 + * Check that the whole (or subset of) a pageblock given by the interval of 72 + * [start_pfn, end_pfn) is valid and within the same zone, before scanning it 73 + * with the migration of free compaction scanner. The scanners then need to 74 + * use only pfn_valid_within() check for arches that allow holes within 75 + * pageblocks. 76 + * 77 + * Return struct page pointer of start_pfn, or NULL if checks were not passed. 78 + * 79 + * It's possible on some configurations to have a setup like node0 node1 node0 80 + * i.e. it's possible that all pages within a zones range of pages do not 81 + * belong to a single zone. We assume that a border between node0 and node1 82 + * can occur within a single pageblock, but not a node0 node1 node0 83 + * interleaving within a single pageblock. It is therefore sufficient to check 84 + * the first and last page of a pageblock and avoid checking each individual 85 + * page in a pageblock. 86 + */ 87 + static struct page *pageblock_pfn_to_page(unsigned long start_pfn, 88 + unsigned long end_pfn, struct zone *zone) 89 + { 90 + struct page *start_page; 91 + struct page *end_page; 92 + 93 + /* end_pfn is one past the range we are checking */ 94 + end_pfn--; 95 + 96 + if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn)) 97 + return NULL; 98 + 99 + start_page = pfn_to_page(start_pfn); 100 + 101 + if (page_zone(start_page) != zone) 102 + return NULL; 103 + 104 + end_page = pfn_to_page(end_pfn); 105 + 106 + /* This gives a shorter code than deriving page_zone(end_page) */ 107 + if (page_zone_id(start_page) != page_zone_id(end_page)) 108 + return NULL; 109 + 110 + return start_page; 111 + } 112 + 70 113 #ifdef CONFIG_COMPACTION 71 114 /* Returns true if the pageblock should be scanned for pages to isolate. */ 72 115 static inline bool isolation_suitable(struct compact_control *cc, ··· 175 132 */ 176 133 static void update_pageblock_skip(struct compact_control *cc, 177 134 struct page *page, unsigned long nr_isolated, 178 - bool set_unsuitable, bool migrate_scanner) 135 + bool migrate_scanner) 179 136 { 180 137 struct zone *zone = cc->zone; 181 138 unsigned long pfn; ··· 189 146 if (nr_isolated) 190 147 return; 191 148 192 - /* 193 - * Only skip pageblocks when all forms of compaction will be known to 194 - * fail in the near future. 195 - */ 196 - if (set_unsuitable) 197 - set_pageblock_skip(page); 149 + set_pageblock_skip(page); 198 150 199 151 pfn = page_to_pfn(page); 200 152 ··· 218 180 219 181 static void update_pageblock_skip(struct compact_control *cc, 220 182 struct page *page, unsigned long nr_isolated, 221 - bool set_unsuitable, bool migrate_scanner) 183 + bool migrate_scanner) 222 184 { 223 185 } 224 186 #endif /* CONFIG_COMPACTION */ 225 187 226 - static inline bool should_release_lock(spinlock_t *lock) 188 + /* 189 + * Compaction requires the taking of some coarse locks that are potentially 190 + * very heavily contended. For async compaction, back out if the lock cannot 191 + * be taken immediately. For sync compaction, spin on the lock if needed. 192 + * 193 + * Returns true if the lock is held 194 + * Returns false if the lock is not held and compaction should abort 195 + */ 196 + static bool compact_trylock_irqsave(spinlock_t *lock, unsigned long *flags, 197 + struct compact_control *cc) 227 198 { 228 - return need_resched() || spin_is_contended(lock); 199 + if (cc->mode == MIGRATE_ASYNC) { 200 + if (!spin_trylock_irqsave(lock, *flags)) { 201 + cc->contended = COMPACT_CONTENDED_LOCK; 202 + return false; 203 + } 204 + } else { 205 + spin_lock_irqsave(lock, *flags); 206 + } 207 + 208 + return true; 229 209 } 230 210 231 211 /* 232 212 * Compaction requires the taking of some coarse locks that are potentially 233 - * very heavily contended. Check if the process needs to be scheduled or 234 - * if the lock is contended. For async compaction, back out in the event 235 - * if contention is severe. For sync compaction, schedule. 213 + * very heavily contended. The lock should be periodically unlocked to avoid 214 + * having disabled IRQs for a long time, even when there is nobody waiting on 215 + * the lock. It might also be that allowing the IRQs will result in 216 + * need_resched() becoming true. If scheduling is needed, async compaction 217 + * aborts. Sync compaction schedules. 218 + * Either compaction type will also abort if a fatal signal is pending. 219 + * In either case if the lock was locked, it is dropped and not regained. 236 220 * 237 - * Returns true if the lock is held. 238 - * Returns false if the lock is released and compaction should abort 221 + * Returns true if compaction should abort due to fatal signal pending, or 222 + * async compaction due to need_resched() 223 + * Returns false when compaction can continue (sync compaction might have 224 + * scheduled) 239 225 */ 240 - static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags, 241 - bool locked, struct compact_control *cc) 226 + static bool compact_unlock_should_abort(spinlock_t *lock, 227 + unsigned long flags, bool *locked, struct compact_control *cc) 242 228 { 243 - if (should_release_lock(lock)) { 244 - if (locked) { 245 - spin_unlock_irqrestore(lock, *flags); 246 - locked = false; 247 - } 229 + if (*locked) { 230 + spin_unlock_irqrestore(lock, flags); 231 + *locked = false; 232 + } 248 233 249 - /* async aborts if taking too long or contended */ 234 + if (fatal_signal_pending(current)) { 235 + cc->contended = COMPACT_CONTENDED_SCHED; 236 + return true; 237 + } 238 + 239 + if (need_resched()) { 250 240 if (cc->mode == MIGRATE_ASYNC) { 251 - cc->contended = true; 252 - return false; 241 + cc->contended = COMPACT_CONTENDED_SCHED; 242 + return true; 253 243 } 254 - 255 244 cond_resched(); 256 245 } 257 246 258 - if (!locked) 259 - spin_lock_irqsave(lock, *flags); 260 - return true; 247 + return false; 261 248 } 262 249 263 250 /* 264 251 * Aside from avoiding lock contention, compaction also periodically checks 265 252 * need_resched() and either schedules in sync compaction or aborts async 266 - * compaction. This is similar to what compact_checklock_irqsave() does, but 253 + * compaction. This is similar to what compact_unlock_should_abort() does, but 267 254 * is used where no lock is concerned. 268 255 * 269 256 * Returns false when no scheduling was needed, or sync compaction scheduled. ··· 299 236 /* async compaction aborts if contended */ 300 237 if (need_resched()) { 301 238 if (cc->mode == MIGRATE_ASYNC) { 302 - cc->contended = true; 239 + cc->contended = COMPACT_CONTENDED_SCHED; 303 240 return true; 304 241 } 305 242 ··· 313 250 static bool suitable_migration_target(struct page *page) 314 251 { 315 252 /* If the page is a large free page, then disallow migration */ 316 - if (PageBuddy(page) && page_order(page) >= pageblock_order) 317 - return false; 253 + if (PageBuddy(page)) { 254 + /* 255 + * We are checking page_order without zone->lock taken. But 256 + * the only small danger is that we skip a potentially suitable 257 + * pageblock, so it's not worth to check order for valid range. 258 + */ 259 + if (page_order_unsafe(page) >= pageblock_order) 260 + return false; 261 + } 318 262 319 263 /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */ 320 264 if (migrate_async_suitable(get_pageblock_migratetype(page))) ··· 337 267 * (even though it may still end up isolating some pages). 338 268 */ 339 269 static unsigned long isolate_freepages_block(struct compact_control *cc, 340 - unsigned long blockpfn, 270 + unsigned long *start_pfn, 341 271 unsigned long end_pfn, 342 272 struct list_head *freelist, 343 273 bool strict) 344 274 { 345 275 int nr_scanned = 0, total_isolated = 0; 346 276 struct page *cursor, *valid_page = NULL; 347 - unsigned long flags; 277 + unsigned long flags = 0; 348 278 bool locked = false; 349 - bool checked_pageblock = false; 279 + unsigned long blockpfn = *start_pfn; 350 280 351 281 cursor = pfn_to_page(blockpfn); 352 282 ··· 354 284 for (; blockpfn < end_pfn; blockpfn++, cursor++) { 355 285 int isolated, i; 356 286 struct page *page = cursor; 287 + 288 + /* 289 + * Periodically drop the lock (if held) regardless of its 290 + * contention, to give chance to IRQs. Abort if fatal signal 291 + * pending or async compaction detects need_resched() 292 + */ 293 + if (!(blockpfn % SWAP_CLUSTER_MAX) 294 + && compact_unlock_should_abort(&cc->zone->lock, flags, 295 + &locked, cc)) 296 + break; 357 297 358 298 nr_scanned++; 359 299 if (!pfn_valid_within(blockpfn)) ··· 375 295 goto isolate_fail; 376 296 377 297 /* 378 - * The zone lock must be held to isolate freepages. 379 - * Unfortunately this is a very coarse lock and can be 380 - * heavily contended if there are parallel allocations 381 - * or parallel compactions. For async compaction do not 382 - * spin on the lock and we acquire the lock as late as 383 - * possible. 298 + * If we already hold the lock, we can skip some rechecking. 299 + * Note that if we hold the lock now, checked_pageblock was 300 + * already set in some previous iteration (or strict is true), 301 + * so it is correct to skip the suitable migration target 302 + * recheck as well. 384 303 */ 385 - locked = compact_checklock_irqsave(&cc->zone->lock, &flags, 386 - locked, cc); 387 - if (!locked) 388 - break; 389 - 390 - /* Recheck this is a suitable migration target under lock */ 391 - if (!strict && !checked_pageblock) { 304 + if (!locked) { 392 305 /* 393 - * We need to check suitability of pageblock only once 394 - * and this isolate_freepages_block() is called with 395 - * pageblock range, so just check once is sufficient. 306 + * The zone lock must be held to isolate freepages. 307 + * Unfortunately this is a very coarse lock and can be 308 + * heavily contended if there are parallel allocations 309 + * or parallel compactions. For async compaction do not 310 + * spin on the lock and we acquire the lock as late as 311 + * possible. 396 312 */ 397 - checked_pageblock = true; 398 - if (!suitable_migration_target(page)) 313 + locked = compact_trylock_irqsave(&cc->zone->lock, 314 + &flags, cc); 315 + if (!locked) 399 316 break; 400 - } 401 317 402 - /* Recheck this is a buddy page under lock */ 403 - if (!PageBuddy(page)) 404 - goto isolate_fail; 318 + /* Recheck this is a buddy page under lock */ 319 + if (!PageBuddy(page)) 320 + goto isolate_fail; 321 + } 405 322 406 323 /* Found a free page, break it into order-0 pages */ 407 324 isolated = split_free_page(page); ··· 423 346 424 347 } 425 348 349 + /* Record how far we have got within the block */ 350 + *start_pfn = blockpfn; 351 + 426 352 trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated); 427 353 428 354 /* ··· 441 361 442 362 /* Update the pageblock-skip if the whole pageblock was scanned */ 443 363 if (blockpfn == end_pfn) 444 - update_pageblock_skip(cc, valid_page, total_isolated, true, 445 - false); 364 + update_pageblock_skip(cc, valid_page, total_isolated, false); 446 365 447 366 count_compact_events(COMPACTFREE_SCANNED, nr_scanned); 448 367 if (total_isolated) ··· 469 390 unsigned long isolated, pfn, block_end_pfn; 470 391 LIST_HEAD(freelist); 471 392 472 - for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) { 473 - if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn))) 474 - break; 393 + pfn = start_pfn; 394 + block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages); 475 395 476 - /* 477 - * On subsequent iterations ALIGN() is actually not needed, 478 - * but we keep it that we not to complicate the code. 479 - */ 480 - block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages); 396 + for (; pfn < end_pfn; pfn += isolated, 397 + block_end_pfn += pageblock_nr_pages) { 398 + /* Protect pfn from changing by isolate_freepages_block */ 399 + unsigned long isolate_start_pfn = pfn; 400 + 481 401 block_end_pfn = min(block_end_pfn, end_pfn); 482 402 483 - isolated = isolate_freepages_block(cc, pfn, block_end_pfn, 484 - &freelist, true); 403 + if (!pageblock_pfn_to_page(pfn, block_end_pfn, cc->zone)) 404 + break; 405 + 406 + isolated = isolate_freepages_block(cc, &isolate_start_pfn, 407 + block_end_pfn, &freelist, true); 485 408 486 409 /* 487 410 * In strict mode, isolate_freepages_block() returns 0 if ··· 514 433 } 515 434 516 435 /* Update the number of anon and file isolated pages in the zone */ 517 - static void acct_isolated(struct zone *zone, bool locked, struct compact_control *cc) 436 + static void acct_isolated(struct zone *zone, struct compact_control *cc) 518 437 { 519 438 struct page *page; 520 439 unsigned int count[2] = { 0, }; 521 440 441 + if (list_empty(&cc->migratepages)) 442 + return; 443 + 522 444 list_for_each_entry(page, &cc->migratepages, lru) 523 445 count[!!page_is_file_cache(page)]++; 524 446 525 - /* If locked we can use the interrupt unsafe versions */ 526 - if (locked) { 527 - __mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]); 528 - __mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]); 529 - } else { 530 - mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]); 531 - mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]); 532 - } 447 + mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]); 448 + mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]); 533 449 } 534 450 535 451 /* Similar to reclaim, but different enough that they don't share logic */ ··· 545 467 } 546 468 547 469 /** 548 - * isolate_migratepages_range() - isolate all migrate-able pages in range. 549 - * @zone: Zone pages are in. 470 + * isolate_migratepages_block() - isolate all migrate-able pages within 471 + * a single pageblock 550 472 * @cc: Compaction control structure. 551 - * @low_pfn: The first PFN of the range. 552 - * @end_pfn: The one-past-the-last PFN of the range. 553 - * @unevictable: true if it allows to isolate unevictable pages 473 + * @low_pfn: The first PFN to isolate 474 + * @end_pfn: The one-past-the-last PFN to isolate, within same pageblock 475 + * @isolate_mode: Isolation mode to be used. 554 476 * 555 477 * Isolate all pages that can be migrated from the range specified by 556 - * [low_pfn, end_pfn). Returns zero if there is a fatal signal 557 - * pending), otherwise PFN of the first page that was not scanned 558 - * (which may be both less, equal to or more then end_pfn). 478 + * [low_pfn, end_pfn). The range is expected to be within same pageblock. 479 + * Returns zero if there is a fatal signal pending, otherwise PFN of the 480 + * first page that was not scanned (which may be both less, equal to or more 481 + * than end_pfn). 559 482 * 560 - * Assumes that cc->migratepages is empty and cc->nr_migratepages is 561 - * zero. 562 - * 563 - * Apart from cc->migratepages and cc->nr_migratetypes this function 564 - * does not modify any cc's fields, in particular it does not modify 565 - * (or read for that matter) cc->migrate_pfn. 483 + * The pages are isolated on cc->migratepages list (not required to be empty), 484 + * and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field 485 + * is neither read nor updated. 566 486 */ 567 - unsigned long 568 - isolate_migratepages_range(struct zone *zone, struct compact_control *cc, 569 - unsigned long low_pfn, unsigned long end_pfn, bool unevictable) 487 + static unsigned long 488 + isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, 489 + unsigned long end_pfn, isolate_mode_t isolate_mode) 570 490 { 571 - unsigned long last_pageblock_nr = 0, pageblock_nr; 491 + struct zone *zone = cc->zone; 572 492 unsigned long nr_scanned = 0, nr_isolated = 0; 573 493 struct list_head *migratelist = &cc->migratepages; 574 494 struct lruvec *lruvec; 575 - unsigned long flags; 495 + unsigned long flags = 0; 576 496 bool locked = false; 577 497 struct page *page = NULL, *valid_page = NULL; 578 - bool set_unsuitable = true; 579 - const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ? 580 - ISOLATE_ASYNC_MIGRATE : 0) | 581 - (unevictable ? ISOLATE_UNEVICTABLE : 0); 582 498 583 499 /* 584 500 * Ensure that there are not too many pages isolated from the LRU ··· 595 523 596 524 /* Time to isolate some pages for migration */ 597 525 for (; low_pfn < end_pfn; low_pfn++) { 598 - /* give a chance to irqs before checking need_resched() */ 599 - if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) { 600 - if (should_release_lock(&zone->lru_lock)) { 601 - spin_unlock_irqrestore(&zone->lru_lock, flags); 602 - locked = false; 603 - } 604 - } 605 - 606 526 /* 607 - * migrate_pfn does not necessarily start aligned to a 608 - * pageblock. Ensure that pfn_valid is called when moving 609 - * into a new MAX_ORDER_NR_PAGES range in case of large 610 - * memory holes within the zone 527 + * Periodically drop the lock (if held) regardless of its 528 + * contention, to give chance to IRQs. Abort async compaction 529 + * if contended. 611 530 */ 612 - if ((low_pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) { 613 - if (!pfn_valid(low_pfn)) { 614 - low_pfn += MAX_ORDER_NR_PAGES - 1; 615 - continue; 616 - } 617 - } 531 + if (!(low_pfn % SWAP_CLUSTER_MAX) 532 + && compact_unlock_should_abort(&zone->lru_lock, flags, 533 + &locked, cc)) 534 + break; 618 535 619 536 if (!pfn_valid_within(low_pfn)) 620 537 continue; 621 538 nr_scanned++; 622 539 623 - /* 624 - * Get the page and ensure the page is within the same zone. 625 - * See the comment in isolate_freepages about overlapping 626 - * nodes. It is deliberate that the new zone lock is not taken 627 - * as memory compaction should not move pages between nodes. 628 - */ 629 540 page = pfn_to_page(low_pfn); 630 - if (page_zone(page) != zone) 631 - continue; 632 541 633 542 if (!valid_page) 634 543 valid_page = page; 635 544 636 - /* If isolation recently failed, do not retry */ 637 - pageblock_nr = low_pfn >> pageblock_order; 638 - if (last_pageblock_nr != pageblock_nr) { 639 - int mt; 640 - 641 - last_pageblock_nr = pageblock_nr; 642 - if (!isolation_suitable(cc, page)) 643 - goto next_pageblock; 545 + /* 546 + * Skip if free. We read page order here without zone lock 547 + * which is generally unsafe, but the race window is small and 548 + * the worst thing that can happen is that we skip some 549 + * potential isolation targets. 550 + */ 551 + if (PageBuddy(page)) { 552 + unsigned long freepage_order = page_order_unsafe(page); 644 553 645 554 /* 646 - * For async migration, also only scan in MOVABLE 647 - * blocks. Async migration is optimistic to see if 648 - * the minimum amount of work satisfies the allocation 555 + * Without lock, we cannot be sure that what we got is 556 + * a valid page order. Consider only values in the 557 + * valid order range to prevent low_pfn overflow. 649 558 */ 650 - mt = get_pageblock_migratetype(page); 651 - if (cc->mode == MIGRATE_ASYNC && 652 - !migrate_async_suitable(mt)) { 653 - set_unsuitable = false; 654 - goto next_pageblock; 655 - } 656 - } 657 - 658 - /* 659 - * Skip if free. page_order cannot be used without zone->lock 660 - * as nothing prevents parallel allocations or buddy merging. 661 - */ 662 - if (PageBuddy(page)) 559 + if (freepage_order > 0 && freepage_order < MAX_ORDER) 560 + low_pfn += (1UL << freepage_order) - 1; 663 561 continue; 562 + } 664 563 665 564 /* 666 565 * Check may be lockless but that's ok as we recheck later. ··· 640 597 */ 641 598 if (!PageLRU(page)) { 642 599 if (unlikely(balloon_page_movable(page))) { 643 - if (locked && balloon_page_isolate(page)) { 600 + if (balloon_page_isolate(page)) { 644 601 /* Successfully isolated */ 645 602 goto isolate_success; 646 603 } ··· 660 617 */ 661 618 if (PageTransHuge(page)) { 662 619 if (!locked) 663 - goto next_pageblock; 664 - low_pfn += (1 << compound_order(page)) - 1; 620 + low_pfn = ALIGN(low_pfn + 1, 621 + pageblock_nr_pages) - 1; 622 + else 623 + low_pfn += (1 << compound_order(page)) - 1; 624 + 665 625 continue; 666 626 } 667 627 ··· 677 631 page_count(page) > page_mapcount(page)) 678 632 continue; 679 633 680 - /* Check if it is ok to still hold the lock */ 681 - locked = compact_checklock_irqsave(&zone->lru_lock, &flags, 682 - locked, cc); 683 - if (!locked || fatal_signal_pending(current)) 684 - break; 634 + /* If we already hold the lock, we can skip some rechecking */ 635 + if (!locked) { 636 + locked = compact_trylock_irqsave(&zone->lru_lock, 637 + &flags, cc); 638 + if (!locked) 639 + break; 685 640 686 - /* Recheck PageLRU and PageTransHuge under lock */ 687 - if (!PageLRU(page)) 688 - continue; 689 - if (PageTransHuge(page)) { 690 - low_pfn += (1 << compound_order(page)) - 1; 691 - continue; 641 + /* Recheck PageLRU and PageTransHuge under lock */ 642 + if (!PageLRU(page)) 643 + continue; 644 + if (PageTransHuge(page)) { 645 + low_pfn += (1 << compound_order(page)) - 1; 646 + continue; 647 + } 692 648 } 693 649 694 650 lruvec = mem_cgroup_page_lruvec(page, zone); 695 651 696 652 /* Try isolate the page */ 697 - if (__isolate_lru_page(page, mode) != 0) 653 + if (__isolate_lru_page(page, isolate_mode) != 0) 698 654 continue; 699 655 700 656 VM_BUG_ON_PAGE(PageTransCompound(page), page); ··· 715 667 ++low_pfn; 716 668 break; 717 669 } 718 - 719 - continue; 720 - 721 - next_pageblock: 722 - low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1; 723 670 } 724 671 725 - acct_isolated(zone, locked, cc); 672 + /* 673 + * The PageBuddy() check could have potentially brought us outside 674 + * the range to be scanned. 675 + */ 676 + if (unlikely(low_pfn > end_pfn)) 677 + low_pfn = end_pfn; 726 678 727 679 if (locked) 728 680 spin_unlock_irqrestore(&zone->lru_lock, flags); ··· 732 684 * if the whole pageblock was scanned without isolating any page. 733 685 */ 734 686 if (low_pfn == end_pfn) 735 - update_pageblock_skip(cc, valid_page, nr_isolated, 736 - set_unsuitable, true); 687 + update_pageblock_skip(cc, valid_page, nr_isolated, true); 737 688 738 689 trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated); 739 690 ··· 743 696 return low_pfn; 744 697 } 745 698 699 + /** 700 + * isolate_migratepages_range() - isolate migrate-able pages in a PFN range 701 + * @cc: Compaction control structure. 702 + * @start_pfn: The first PFN to start isolating. 703 + * @end_pfn: The one-past-last PFN. 704 + * 705 + * Returns zero if isolation fails fatally due to e.g. pending signal. 706 + * Otherwise, function returns one-past-the-last PFN of isolated page 707 + * (which may be greater than end_pfn if end fell in a middle of a THP page). 708 + */ 709 + unsigned long 710 + isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn, 711 + unsigned long end_pfn) 712 + { 713 + unsigned long pfn, block_end_pfn; 714 + 715 + /* Scan block by block. First and last block may be incomplete */ 716 + pfn = start_pfn; 717 + block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages); 718 + 719 + for (; pfn < end_pfn; pfn = block_end_pfn, 720 + block_end_pfn += pageblock_nr_pages) { 721 + 722 + block_end_pfn = min(block_end_pfn, end_pfn); 723 + 724 + if (!pageblock_pfn_to_page(pfn, block_end_pfn, cc->zone)) 725 + continue; 726 + 727 + pfn = isolate_migratepages_block(cc, pfn, block_end_pfn, 728 + ISOLATE_UNEVICTABLE); 729 + 730 + /* 731 + * In case of fatal failure, release everything that might 732 + * have been isolated in the previous iteration, and signal 733 + * the failure back to caller. 734 + */ 735 + if (!pfn) { 736 + putback_movable_pages(&cc->migratepages); 737 + cc->nr_migratepages = 0; 738 + break; 739 + } 740 + } 741 + acct_isolated(cc->zone, cc); 742 + 743 + return pfn; 744 + } 745 + 746 746 #endif /* CONFIG_COMPACTION || CONFIG_CMA */ 747 747 #ifdef CONFIG_COMPACTION 748 748 /* 749 749 * Based on information in the current compact_control, find blocks 750 750 * suitable for isolating free pages from and then isolate them. 751 751 */ 752 - static void isolate_freepages(struct zone *zone, 753 - struct compact_control *cc) 752 + static void isolate_freepages(struct compact_control *cc) 754 753 { 754 + struct zone *zone = cc->zone; 755 755 struct page *page; 756 756 unsigned long block_start_pfn; /* start of current pageblock */ 757 + unsigned long isolate_start_pfn; /* exact pfn we start at */ 757 758 unsigned long block_end_pfn; /* end of current pageblock */ 758 759 unsigned long low_pfn; /* lowest pfn scanner is able to scan */ 759 760 int nr_freepages = cc->nr_freepages; ··· 810 715 /* 811 716 * Initialise the free scanner. The starting point is where we last 812 717 * successfully isolated from, zone-cached value, or the end of the 813 - * zone when isolating for the first time. We need this aligned to 814 - * the pageblock boundary, because we do 718 + * zone when isolating for the first time. For looping we also need 719 + * this pfn aligned down to the pageblock boundary, because we do 815 720 * block_start_pfn -= pageblock_nr_pages in the for loop. 816 721 * For ending point, take care when isolating in last pageblock of a 817 722 * a zone which ends in the middle of a pageblock. 818 723 * The low boundary is the end of the pageblock the migration scanner 819 724 * is using. 820 725 */ 726 + isolate_start_pfn = cc->free_pfn; 821 727 block_start_pfn = cc->free_pfn & ~(pageblock_nr_pages-1); 822 728 block_end_pfn = min(block_start_pfn + pageblock_nr_pages, 823 729 zone_end_pfn(zone)); ··· 831 735 */ 832 736 for (; block_start_pfn >= low_pfn && cc->nr_migratepages > nr_freepages; 833 737 block_end_pfn = block_start_pfn, 834 - block_start_pfn -= pageblock_nr_pages) { 738 + block_start_pfn -= pageblock_nr_pages, 739 + isolate_start_pfn = block_start_pfn) { 835 740 unsigned long isolated; 836 741 837 742 /* ··· 844 747 && compact_should_abort(cc)) 845 748 break; 846 749 847 - if (!pfn_valid(block_start_pfn)) 848 - continue; 849 - 850 - /* 851 - * Check for overlapping nodes/zones. It's possible on some 852 - * configurations to have a setup like 853 - * node0 node1 node0 854 - * i.e. it's possible that all pages within a zones range of 855 - * pages do not belong to a single zone. 856 - */ 857 - page = pfn_to_page(block_start_pfn); 858 - if (page_zone(page) != zone) 750 + page = pageblock_pfn_to_page(block_start_pfn, block_end_pfn, 751 + zone); 752 + if (!page) 859 753 continue; 860 754 861 755 /* Check the block is suitable for migration */ ··· 857 769 if (!isolation_suitable(cc, page)) 858 770 continue; 859 771 860 - /* Found a block suitable for isolating free pages from */ 861 - cc->free_pfn = block_start_pfn; 862 - isolated = isolate_freepages_block(cc, block_start_pfn, 772 + /* Found a block suitable for isolating free pages from. */ 773 + isolated = isolate_freepages_block(cc, &isolate_start_pfn, 863 774 block_end_pfn, freelist, false); 864 775 nr_freepages += isolated; 776 + 777 + /* 778 + * Remember where the free scanner should restart next time, 779 + * which is where isolate_freepages_block() left off. 780 + * But if it scanned the whole pageblock, isolate_start_pfn 781 + * now points at block_end_pfn, which is the start of the next 782 + * pageblock. 783 + * In that case we will however want to restart at the start 784 + * of the previous pageblock. 785 + */ 786 + cc->free_pfn = (isolate_start_pfn < block_end_pfn) ? 787 + isolate_start_pfn : 788 + block_start_pfn - pageblock_nr_pages; 865 789 866 790 /* 867 791 * Set a flag that we successfully isolated in this pageblock. ··· 922 822 */ 923 823 if (list_empty(&cc->freepages)) { 924 824 if (!cc->contended) 925 - isolate_freepages(cc->zone, cc); 825 + isolate_freepages(cc); 926 826 927 827 if (list_empty(&cc->freepages)) 928 828 return NULL; ··· 956 856 } isolate_migrate_t; 957 857 958 858 /* 959 - * Isolate all pages that can be migrated from the block pointed to by 960 - * the migrate scanner within compact_control. 859 + * Isolate all pages that can be migrated from the first suitable block, 860 + * starting at the block pointed to by the migrate scanner pfn within 861 + * compact_control. 961 862 */ 962 863 static isolate_migrate_t isolate_migratepages(struct zone *zone, 963 864 struct compact_control *cc) 964 865 { 965 866 unsigned long low_pfn, end_pfn; 867 + struct page *page; 868 + const isolate_mode_t isolate_mode = 869 + (cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0); 966 870 967 - /* Do not scan outside zone boundaries */ 968 - low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn); 871 + /* 872 + * Start at where we last stopped, or beginning of the zone as 873 + * initialized by compact_zone() 874 + */ 875 + low_pfn = cc->migrate_pfn; 969 876 970 877 /* Only scan within a pageblock boundary */ 971 878 end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages); 972 879 973 - /* Do not cross the free scanner or scan within a memory hole */ 974 - if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) { 975 - cc->migrate_pfn = end_pfn; 976 - return ISOLATE_NONE; 880 + /* 881 + * Iterate over whole pageblocks until we find the first suitable. 882 + * Do not cross the free scanner. 883 + */ 884 + for (; end_pfn <= cc->free_pfn; 885 + low_pfn = end_pfn, end_pfn += pageblock_nr_pages) { 886 + 887 + /* 888 + * This can potentially iterate a massively long zone with 889 + * many pageblocks unsuitable, so periodically check if we 890 + * need to schedule, or even abort async compaction. 891 + */ 892 + if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages)) 893 + && compact_should_abort(cc)) 894 + break; 895 + 896 + page = pageblock_pfn_to_page(low_pfn, end_pfn, zone); 897 + if (!page) 898 + continue; 899 + 900 + /* If isolation recently failed, do not retry */ 901 + if (!isolation_suitable(cc, page)) 902 + continue; 903 + 904 + /* 905 + * For async compaction, also only scan in MOVABLE blocks. 906 + * Async compaction is optimistic to see if the minimum amount 907 + * of work satisfies the allocation. 908 + */ 909 + if (cc->mode == MIGRATE_ASYNC && 910 + !migrate_async_suitable(get_pageblock_migratetype(page))) 911 + continue; 912 + 913 + /* Perform the isolation */ 914 + low_pfn = isolate_migratepages_block(cc, low_pfn, end_pfn, 915 + isolate_mode); 916 + 917 + if (!low_pfn || cc->contended) 918 + return ISOLATE_ABORT; 919 + 920 + /* 921 + * Either we isolated something and proceed with migration. Or 922 + * we failed and compact_zone should decide if we should 923 + * continue or not. 924 + */ 925 + break; 977 926 } 978 927 979 - /* Perform the isolation */ 980 - low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false); 981 - if (!low_pfn || cc->contended) 982 - return ISOLATE_ABORT; 983 - 928 + acct_isolated(zone, cc); 929 + /* Record where migration scanner will be restarted */ 984 930 cc->migrate_pfn = low_pfn; 985 931 986 - return ISOLATE_SUCCESS; 932 + return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE; 987 933 } 988 934 989 - static int compact_finished(struct zone *zone, 990 - struct compact_control *cc) 935 + static int compact_finished(struct zone *zone, struct compact_control *cc, 936 + const int migratetype) 991 937 { 992 938 unsigned int order; 993 939 unsigned long watermark; ··· 1079 933 struct free_area *area = &zone->free_area[order]; 1080 934 1081 935 /* Job done if page is free of the right migratetype */ 1082 - if (!list_empty(&area->free_list[cc->migratetype])) 936 + if (!list_empty(&area->free_list[migratetype])) 1083 937 return COMPACT_PARTIAL; 1084 938 1085 939 /* Job done if allocation would set block type */ ··· 1145 999 int ret; 1146 1000 unsigned long start_pfn = zone->zone_start_pfn; 1147 1001 unsigned long end_pfn = zone_end_pfn(zone); 1002 + const int migratetype = gfpflags_to_migratetype(cc->gfp_mask); 1148 1003 const bool sync = cc->mode != MIGRATE_ASYNC; 1149 1004 1150 1005 ret = compaction_suitable(zone, cc->order); ··· 1188 1041 1189 1042 migrate_prep_local(); 1190 1043 1191 - while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) { 1044 + while ((ret = compact_finished(zone, cc, migratetype)) == 1045 + COMPACT_CONTINUE) { 1192 1046 int err; 1193 1047 1194 1048 switch (isolate_migratepages(zone, cc)) { ··· 1203 1055 case ISOLATE_SUCCESS: 1204 1056 ; 1205 1057 } 1206 - 1207 - if (!cc->nr_migratepages) 1208 - continue; 1209 1058 1210 1059 err = migrate_pages(&cc->migratepages, compaction_alloc, 1211 1060 compaction_free, (unsigned long)cc, cc->mode, ··· 1237 1092 } 1238 1093 1239 1094 static unsigned long compact_zone_order(struct zone *zone, int order, 1240 - gfp_t gfp_mask, enum migrate_mode mode, bool *contended) 1095 + gfp_t gfp_mask, enum migrate_mode mode, int *contended) 1241 1096 { 1242 1097 unsigned long ret; 1243 1098 struct compact_control cc = { 1244 1099 .nr_freepages = 0, 1245 1100 .nr_migratepages = 0, 1246 1101 .order = order, 1247 - .migratetype = allocflags_to_migratetype(gfp_mask), 1102 + .gfp_mask = gfp_mask, 1248 1103 .zone = zone, 1249 1104 .mode = mode, 1250 1105 }; ··· 1269 1124 * @gfp_mask: The GFP mask of the current allocation 1270 1125 * @nodemask: The allowed nodes to allocate from 1271 1126 * @mode: The migration mode for async, sync light, or sync migration 1272 - * @contended: Return value that is true if compaction was aborted due to lock contention 1273 - * @page: Optionally capture a free page of the requested order during compaction 1127 + * @contended: Return value that determines if compaction was aborted due to 1128 + * need_resched() or lock contention 1129 + * @candidate_zone: Return the zone where we think allocation should succeed 1274 1130 * 1275 1131 * This is the main entry point for direct page compaction. 1276 1132 */ 1277 1133 unsigned long try_to_compact_pages(struct zonelist *zonelist, 1278 1134 int order, gfp_t gfp_mask, nodemask_t *nodemask, 1279 - enum migrate_mode mode, bool *contended) 1135 + enum migrate_mode mode, int *contended, 1136 + struct zone **candidate_zone) 1280 1137 { 1281 1138 enum zone_type high_zoneidx = gfp_zone(gfp_mask); 1282 1139 int may_enter_fs = gfp_mask & __GFP_FS; 1283 1140 int may_perform_io = gfp_mask & __GFP_IO; 1284 1141 struct zoneref *z; 1285 1142 struct zone *zone; 1286 - int rc = COMPACT_SKIPPED; 1143 + int rc = COMPACT_DEFERRED; 1287 1144 int alloc_flags = 0; 1145 + int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */ 1146 + 1147 + *contended = COMPACT_CONTENDED_NONE; 1288 1148 1289 1149 /* Check if the GFP flags allow compaction */ 1290 1150 if (!order || !may_enter_fs || !may_perform_io) 1291 - return rc; 1292 - 1293 - count_compact_event(COMPACTSTALL); 1151 + return COMPACT_SKIPPED; 1294 1152 1295 1153 #ifdef CONFIG_CMA 1296 - if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE) 1154 + if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE) 1297 1155 alloc_flags |= ALLOC_CMA; 1298 1156 #endif 1299 1157 /* Compact each zone in the list */ 1300 1158 for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx, 1301 1159 nodemask) { 1302 1160 int status; 1161 + int zone_contended; 1162 + 1163 + if (compaction_deferred(zone, order)) 1164 + continue; 1303 1165 1304 1166 status = compact_zone_order(zone, order, gfp_mask, mode, 1305 - contended); 1167 + &zone_contended); 1306 1168 rc = max(status, rc); 1169 + /* 1170 + * It takes at least one zone that wasn't lock contended 1171 + * to clear all_zones_contended. 1172 + */ 1173 + all_zones_contended &= zone_contended; 1307 1174 1308 1175 /* If a normal allocation would succeed, stop compacting */ 1309 1176 if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 1310 - alloc_flags)) 1311 - break; 1177 + alloc_flags)) { 1178 + *candidate_zone = zone; 1179 + /* 1180 + * We think the allocation will succeed in this zone, 1181 + * but it is not certain, hence the false. The caller 1182 + * will repeat this with true if allocation indeed 1183 + * succeeds in this zone. 1184 + */ 1185 + compaction_defer_reset(zone, order, false); 1186 + /* 1187 + * It is possible that async compaction aborted due to 1188 + * need_resched() and the watermarks were ok thanks to 1189 + * somebody else freeing memory. The allocation can 1190 + * however still fail so we better signal the 1191 + * need_resched() contention anyway (this will not 1192 + * prevent the allocation attempt). 1193 + */ 1194 + if (zone_contended == COMPACT_CONTENDED_SCHED) 1195 + *contended = COMPACT_CONTENDED_SCHED; 1196 + 1197 + goto break_loop; 1198 + } 1199 + 1200 + if (mode != MIGRATE_ASYNC) { 1201 + /* 1202 + * We think that allocation won't succeed in this zone 1203 + * so we defer compaction there. If it ends up 1204 + * succeeding after all, it will be reset. 1205 + */ 1206 + defer_compaction(zone, order); 1207 + } 1208 + 1209 + /* 1210 + * We might have stopped compacting due to need_resched() in 1211 + * async compaction, or due to a fatal signal detected. In that 1212 + * case do not try further zones and signal need_resched() 1213 + * contention. 1214 + */ 1215 + if ((zone_contended == COMPACT_CONTENDED_SCHED) 1216 + || fatal_signal_pending(current)) { 1217 + *contended = COMPACT_CONTENDED_SCHED; 1218 + goto break_loop; 1219 + } 1220 + 1221 + continue; 1222 + break_loop: 1223 + /* 1224 + * We might not have tried all the zones, so be conservative 1225 + * and assume they are not all lock contended. 1226 + */ 1227 + all_zones_contended = 0; 1228 + break; 1312 1229 } 1230 + 1231 + /* 1232 + * If at least one zone wasn't deferred or skipped, we report if all 1233 + * zones that were tried were lock contended. 1234 + */ 1235 + if (rc > COMPACT_SKIPPED && all_zones_contended) 1236 + *contended = COMPACT_CONTENDED_LOCK; 1313 1237 1314 1238 return rc; 1315 1239 }

+237

mm/debug.c

··· 1 + /* 2 + * mm/debug.c 3 + * 4 + * mm/ specific debug routines. 5 + * 6 + */ 7 + 8 + #include <linux/kernel.h> 9 + #include <linux/mm.h> 10 + #include <linux/ftrace_event.h> 11 + #include <linux/memcontrol.h> 12 + 13 + static const struct trace_print_flags pageflag_names[] = { 14 + {1UL << PG_locked, "locked" }, 15 + {1UL << PG_error, "error" }, 16 + {1UL << PG_referenced, "referenced" }, 17 + {1UL << PG_uptodate, "uptodate" }, 18 + {1UL << PG_dirty, "dirty" }, 19 + {1UL << PG_lru, "lru" }, 20 + {1UL << PG_active, "active" }, 21 + {1UL << PG_slab, "slab" }, 22 + {1UL << PG_owner_priv_1, "owner_priv_1" }, 23 + {1UL << PG_arch_1, "arch_1" }, 24 + {1UL << PG_reserved, "reserved" }, 25 + {1UL << PG_private, "private" }, 26 + {1UL << PG_private_2, "private_2" }, 27 + {1UL << PG_writeback, "writeback" }, 28 + #ifdef CONFIG_PAGEFLAGS_EXTENDED 29 + {1UL << PG_head, "head" }, 30 + {1UL << PG_tail, "tail" }, 31 + #else 32 + {1UL << PG_compound, "compound" }, 33 + #endif 34 + {1UL << PG_swapcache, "swapcache" }, 35 + {1UL << PG_mappedtodisk, "mappedtodisk" }, 36 + {1UL << PG_reclaim, "reclaim" }, 37 + {1UL << PG_swapbacked, "swapbacked" }, 38 + {1UL << PG_unevictable, "unevictable" }, 39 + #ifdef CONFIG_MMU 40 + {1UL << PG_mlocked, "mlocked" }, 41 + #endif 42 + #ifdef CONFIG_ARCH_USES_PG_UNCACHED 43 + {1UL << PG_uncached, "uncached" }, 44 + #endif 45 + #ifdef CONFIG_MEMORY_FAILURE 46 + {1UL << PG_hwpoison, "hwpoison" }, 47 + #endif 48 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 49 + {1UL << PG_compound_lock, "compound_lock" }, 50 + #endif 51 + }; 52 + 53 + static void dump_flags(unsigned long flags, 54 + const struct trace_print_flags *names, int count) 55 + { 56 + const char *delim = ""; 57 + unsigned long mask; 58 + int i; 59 + 60 + pr_emerg("flags: %#lx(", flags); 61 + 62 + /* remove zone id */ 63 + flags &= (1UL << NR_PAGEFLAGS) - 1; 64 + 65 + for (i = 0; i < count && flags; i++) { 66 + 67 + mask = names[i].mask; 68 + if ((flags & mask) != mask) 69 + continue; 70 + 71 + flags &= ~mask; 72 + pr_cont("%s%s", delim, names[i].name); 73 + delim = "|"; 74 + } 75 + 76 + /* check for left over flags */ 77 + if (flags) 78 + pr_cont("%s%#lx", delim, flags); 79 + 80 + pr_cont(")\n"); 81 + } 82 + 83 + void dump_page_badflags(struct page *page, const char *reason, 84 + unsigned long badflags) 85 + { 86 + pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#lx\n", 87 + page, atomic_read(&page->_count), page_mapcount(page), 88 + page->mapping, page->index); 89 + BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS); 90 + dump_flags(page->flags, pageflag_names, ARRAY_SIZE(pageflag_names)); 91 + if (reason) 92 + pr_alert("page dumped because: %s\n", reason); 93 + if (page->flags & badflags) { 94 + pr_alert("bad because of flags:\n"); 95 + dump_flags(page->flags & badflags, 96 + pageflag_names, ARRAY_SIZE(pageflag_names)); 97 + } 98 + mem_cgroup_print_bad_page(page); 99 + } 100 + 101 + void dump_page(struct page *page, const char *reason) 102 + { 103 + dump_page_badflags(page, reason, 0); 104 + } 105 + EXPORT_SYMBOL(dump_page); 106 + 107 + #ifdef CONFIG_DEBUG_VM 108 + 109 + static const struct trace_print_flags vmaflags_names[] = { 110 + {VM_READ, "read" }, 111 + {VM_WRITE, "write" }, 112 + {VM_EXEC, "exec" }, 113 + {VM_SHARED, "shared" }, 114 + {VM_MAYREAD, "mayread" }, 115 + {VM_MAYWRITE, "maywrite" }, 116 + {VM_MAYEXEC, "mayexec" }, 117 + {VM_MAYSHARE, "mayshare" }, 118 + {VM_GROWSDOWN, "growsdown" }, 119 + {VM_PFNMAP, "pfnmap" }, 120 + {VM_DENYWRITE, "denywrite" }, 121 + {VM_LOCKED, "locked" }, 122 + {VM_IO, "io" }, 123 + {VM_SEQ_READ, "seqread" }, 124 + {VM_RAND_READ, "randread" }, 125 + {VM_DONTCOPY, "dontcopy" }, 126 + {VM_DONTEXPAND, "dontexpand" }, 127 + {VM_ACCOUNT, "account" }, 128 + {VM_NORESERVE, "noreserve" }, 129 + {VM_HUGETLB, "hugetlb" }, 130 + {VM_NONLINEAR, "nonlinear" }, 131 + #if defined(CONFIG_X86) 132 + {VM_PAT, "pat" }, 133 + #elif defined(CONFIG_PPC) 134 + {VM_SAO, "sao" }, 135 + #elif defined(CONFIG_PARISC) || defined(CONFIG_METAG) || defined(CONFIG_IA64) 136 + {VM_GROWSUP, "growsup" }, 137 + #elif !defined(CONFIG_MMU) 138 + {VM_MAPPED_COPY, "mappedcopy" }, 139 + #else 140 + {VM_ARCH_1, "arch_1" }, 141 + #endif 142 + {VM_DONTDUMP, "dontdump" }, 143 + #ifdef CONFIG_MEM_SOFT_DIRTY 144 + {VM_SOFTDIRTY, "softdirty" }, 145 + #endif 146 + {VM_MIXEDMAP, "mixedmap" }, 147 + {VM_HUGEPAGE, "hugepage" }, 148 + {VM_NOHUGEPAGE, "nohugepage" }, 149 + {VM_MERGEABLE, "mergeable" }, 150 + }; 151 + 152 + void dump_vma(const struct vm_area_struct *vma) 153 + { 154 + pr_emerg("vma %p start %p end %p\n" 155 + "next %p prev %p mm %p\n" 156 + "prot %lx anon_vma %p vm_ops %p\n" 157 + "pgoff %lx file %p private_data %p\n", 158 + vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_next, 159 + vma->vm_prev, vma->vm_mm, 160 + (unsigned long)pgprot_val(vma->vm_page_prot), 161 + vma->anon_vma, vma->vm_ops, vma->vm_pgoff, 162 + vma->vm_file, vma->vm_private_data); 163 + dump_flags(vma->vm_flags, vmaflags_names, ARRAY_SIZE(vmaflags_names)); 164 + } 165 + EXPORT_SYMBOL(dump_vma); 166 + 167 + void dump_mm(const struct mm_struct *mm) 168 + { 169 + pr_emerg("mm %p mmap %p seqnum %d task_size %lu\n" 170 + #ifdef CONFIG_MMU 171 + "get_unmapped_area %p\n" 172 + #endif 173 + "mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n" 174 + "pgd %p mm_users %d mm_count %d nr_ptes %lu map_count %d\n" 175 + "hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n" 176 + "pinned_vm %lx shared_vm %lx exec_vm %lx stack_vm %lx\n" 177 + "start_code %lx end_code %lx start_data %lx end_data %lx\n" 178 + "start_brk %lx brk %lx start_stack %lx\n" 179 + "arg_start %lx arg_end %lx env_start %lx env_end %lx\n" 180 + "binfmt %p flags %lx core_state %p\n" 181 + #ifdef CONFIG_AIO 182 + "ioctx_table %p\n" 183 + #endif 184 + #ifdef CONFIG_MEMCG 185 + "owner %p " 186 + #endif 187 + "exe_file %p\n" 188 + #ifdef CONFIG_MMU_NOTIFIER 189 + "mmu_notifier_mm %p\n" 190 + #endif 191 + #ifdef CONFIG_NUMA_BALANCING 192 + "numa_next_scan %lu numa_scan_offset %lu numa_scan_seq %d\n" 193 + #endif 194 + #if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION) 195 + "tlb_flush_pending %d\n" 196 + #endif 197 + "%s", /* This is here to hold the comma */ 198 + 199 + mm, mm->mmap, mm->vmacache_seqnum, mm->task_size, 200 + #ifdef CONFIG_MMU 201 + mm->get_unmapped_area, 202 + #endif 203 + mm->mmap_base, mm->mmap_legacy_base, mm->highest_vm_end, 204 + mm->pgd, atomic_read(&mm->mm_users), 205 + atomic_read(&mm->mm_count), 206 + atomic_long_read((atomic_long_t *)&mm->nr_ptes), 207 + mm->map_count, 208 + mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm, 209 + mm->pinned_vm, mm->shared_vm, mm->exec_vm, mm->stack_vm, 210 + mm->start_code, mm->end_code, mm->start_data, mm->end_data, 211 + mm->start_brk, mm->brk, mm->start_stack, 212 + mm->arg_start, mm->arg_end, mm->env_start, mm->env_end, 213 + mm->binfmt, mm->flags, mm->core_state, 214 + #ifdef CONFIG_AIO 215 + mm->ioctx_table, 216 + #endif 217 + #ifdef CONFIG_MEMCG 218 + mm->owner, 219 + #endif 220 + mm->exe_file, 221 + #ifdef CONFIG_MMU_NOTIFIER 222 + mm->mmu_notifier_mm, 223 + #endif 224 + #ifdef CONFIG_NUMA_BALANCING 225 + mm->numa_next_scan, mm->numa_scan_offset, mm->numa_scan_seq, 226 + #endif 227 + #if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION) 228 + mm->tlb_flush_pending, 229 + #endif 230 + "" /* This is here to not have a comma! */ 231 + ); 232 + 233 + dump_flags(mm->def_flags, vmaflags_names, 234 + ARRAY_SIZE(vmaflags_names)); 235 + } 236 + 237 + #endif /* CONFIG_DEBUG_VM */

+40 -16

mm/dmapool.c

··· 62 62 }; 63 63 64 64 static DEFINE_MUTEX(pools_lock); 65 + static DEFINE_MUTEX(pools_reg_lock); 65 66 66 67 static ssize_t 67 68 show_pools(struct device *dev, struct device_attribute *attr, char *buf) ··· 133 132 { 134 133 struct dma_pool *retval; 135 134 size_t allocation; 135 + bool empty = false; 136 136 137 - if (align == 0) { 137 + if (align == 0) 138 138 align = 1; 139 - } else if (align & (align - 1)) { 139 + else if (align & (align - 1)) 140 140 return NULL; 141 - } 142 141 143 - if (size == 0) { 142 + if (size == 0) 144 143 return NULL; 145 - } else if (size < 4) { 144 + else if (size < 4) 146 145 size = 4; 147 - } 148 146 149 147 if ((size % align) != 0) 150 148 size = ALIGN(size, align); 151 149 152 150 allocation = max_t(size_t, size, PAGE_SIZE); 153 151 154 - if (!boundary) { 152 + if (!boundary) 155 153 boundary = allocation; 156 - } else if ((boundary < size) || (boundary & (boundary - 1))) { 154 + else if ((boundary < size) || (boundary & (boundary - 1))) 157 155 return NULL; 158 - } 159 156 160 157 retval = kmalloc_node(sizeof(*retval), GFP_KERNEL, dev_to_node(dev)); 161 158 if (!retval) ··· 171 172 172 173 INIT_LIST_HEAD(&retval->pools); 173 174 175 + /* 176 + * pools_lock ensures that the ->dma_pools list does not get corrupted. 177 + * pools_reg_lock ensures that there is not a race between 178 + * dma_pool_create() and dma_pool_destroy() or within dma_pool_create() 179 + * when the first invocation of dma_pool_create() failed on 180 + * device_create_file() and the second assumes that it has been done (I 181 + * know it is a short window). 182 + */ 183 + mutex_lock(&pools_reg_lock); 174 184 mutex_lock(&pools_lock); 175 - if (list_empty(&dev->dma_pools) && 176 - device_create_file(dev, &dev_attr_pools)) { 177 - kfree(retval); 178 - retval = NULL; 179 - } else 180 - list_add(&retval->pools, &dev->dma_pools); 185 + if (list_empty(&dev->dma_pools)) 186 + empty = true; 187 + list_add(&retval->pools, &dev->dma_pools); 181 188 mutex_unlock(&pools_lock); 189 + if (empty) { 190 + int err; 182 191 192 + err = device_create_file(dev, &dev_attr_pools); 193 + if (err) { 194 + mutex_lock(&pools_lock); 195 + list_del(&retval->pools); 196 + mutex_unlock(&pools_lock); 197 + mutex_unlock(&pools_reg_lock); 198 + kfree(retval); 199 + return NULL; 200 + } 201 + } 202 + mutex_unlock(&pools_reg_lock); 183 203 return retval; 184 204 } 185 205 EXPORT_SYMBOL(dma_pool_create); ··· 269 251 */ 270 252 void dma_pool_destroy(struct dma_pool *pool) 271 253 { 254 + bool empty = false; 255 + 256 + mutex_lock(&pools_reg_lock); 272 257 mutex_lock(&pools_lock); 273 258 list_del(&pool->pools); 274 259 if (pool->dev && list_empty(&pool->dev->dma_pools)) 275 - device_remove_file(pool->dev, &dev_attr_pools); 260 + empty = true; 276 261 mutex_unlock(&pools_lock); 262 + if (empty) 263 + device_remove_file(pool->dev, &dev_attr_pools); 264 + mutex_unlock(&pools_reg_lock); 277 265 278 266 while (!list_empty(&pool->page_list)) { 279 267 struct dma_page *page;

+2 -2

mm/filemap.c

··· 1753 1753 static int page_cache_read(struct file *file, pgoff_t offset) 1754 1754 { 1755 1755 struct address_space *mapping = file->f_mapping; 1756 - struct page *page; 1756 + struct page *page; 1757 1757 int ret; 1758 1758 1759 1759 do { ··· 1770 1770 page_cache_release(page); 1771 1771 1772 1772 } while (ret == AOP_TRUNCATED_PAGE); 1773 - 1773 + 1774 1774 return ret; 1775 1775 } 1776 1776

+354

mm/gup.c

··· 10 10 #include <linux/swap.h> 11 11 #include <linux/swapops.h> 12 12 13 + #include <linux/sched.h> 14 + #include <linux/rwsem.h> 15 + #include <asm/pgtable.h> 16 + 13 17 #include "internal.h" 14 18 15 19 static struct page *no_page_table(struct vm_area_struct *vma, ··· 680 676 return page; 681 677 } 682 678 #endif /* CONFIG_ELF_CORE */ 679 + 680 + /* 681 + * Generic RCU Fast GUP 682 + * 683 + * get_user_pages_fast attempts to pin user pages by walking the page 684 + * tables directly and avoids taking locks. Thus the walker needs to be 685 + * protected from page table pages being freed from under it, and should 686 + * block any THP splits. 687 + * 688 + * One way to achieve this is to have the walker disable interrupts, and 689 + * rely on IPIs from the TLB flushing code blocking before the page table 690 + * pages are freed. This is unsuitable for architectures that do not need 691 + * to broadcast an IPI when invalidating TLBs. 692 + * 693 + * Another way to achieve this is to batch up page table containing pages 694 + * belonging to more than one mm_user, then rcu_sched a callback to free those 695 + * pages. Disabling interrupts will allow the fast_gup walker to both block 696 + * the rcu_sched callback, and an IPI that we broadcast for splitting THPs 697 + * (which is a relatively rare event). The code below adopts this strategy. 698 + * 699 + * Before activating this code, please be aware that the following assumptions 700 + * are currently made: 701 + * 702 + * *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free 703 + * pages containing page tables. 704 + * 705 + * *) THP splits will broadcast an IPI, this can be achieved by overriding 706 + * pmdp_splitting_flush. 707 + * 708 + * *) ptes can be read atomically by the architecture. 709 + * 710 + * *) access_ok is sufficient to validate userspace address ranges. 711 + * 712 + * The last two assumptions can be relaxed by the addition of helper functions. 713 + * 714 + * This code is based heavily on the PowerPC implementation by Nick Piggin. 715 + */ 716 + #ifdef CONFIG_HAVE_GENERIC_RCU_GUP 717 + 718 + #ifdef __HAVE_ARCH_PTE_SPECIAL 719 + static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, 720 + int write, struct page **pages, int *nr) 721 + { 722 + pte_t *ptep, *ptem; 723 + int ret = 0; 724 + 725 + ptem = ptep = pte_offset_map(&pmd, addr); 726 + do { 727 + /* 728 + * In the line below we are assuming that the pte can be read 729 + * atomically. If this is not the case for your architecture, 730 + * please wrap this in a helper function! 731 + * 732 + * for an example see gup_get_pte in arch/x86/mm/gup.c 733 + */ 734 + pte_t pte = ACCESS_ONCE(*ptep); 735 + struct page *page; 736 + 737 + /* 738 + * Similar to the PMD case below, NUMA hinting must take slow 739 + * path 740 + */ 741 + if (!pte_present(pte) || pte_special(pte) || 742 + pte_numa(pte) || (write && !pte_write(pte))) 743 + goto pte_unmap; 744 + 745 + VM_BUG_ON(!pfn_valid(pte_pfn(pte))); 746 + page = pte_page(pte); 747 + 748 + if (!page_cache_get_speculative(page)) 749 + goto pte_unmap; 750 + 751 + if (unlikely(pte_val(pte) != pte_val(*ptep))) { 752 + put_page(page); 753 + goto pte_unmap; 754 + } 755 + 756 + pages[*nr] = page; 757 + (*nr)++; 758 + 759 + } while (ptep++, addr += PAGE_SIZE, addr != end); 760 + 761 + ret = 1; 762 + 763 + pte_unmap: 764 + pte_unmap(ptem); 765 + return ret; 766 + } 767 + #else 768 + 769 + /* 770 + * If we can't determine whether or not a pte is special, then fail immediately 771 + * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not 772 + * to be special. 773 + * 774 + * For a futex to be placed on a THP tail page, get_futex_key requires a 775 + * __get_user_pages_fast implementation that can pin pages. Thus it's still 776 + * useful to have gup_huge_pmd even if we can't operate on ptes. 777 + */ 778 + static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, 779 + int write, struct page **pages, int *nr) 780 + { 781 + return 0; 782 + } 783 + #endif /* __HAVE_ARCH_PTE_SPECIAL */ 784 + 785 + static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, 786 + unsigned long end, int write, struct page **pages, int *nr) 787 + { 788 + struct page *head, *page, *tail; 789 + int refs; 790 + 791 + if (write && !pmd_write(orig)) 792 + return 0; 793 + 794 + refs = 0; 795 + head = pmd_page(orig); 796 + page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT); 797 + tail = page; 798 + do { 799 + VM_BUG_ON_PAGE(compound_head(page) != head, page); 800 + pages[*nr] = page; 801 + (*nr)++; 802 + page++; 803 + refs++; 804 + } while (addr += PAGE_SIZE, addr != end); 805 + 806 + if (!page_cache_add_speculative(head, refs)) { 807 + *nr -= refs; 808 + return 0; 809 + } 810 + 811 + if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { 812 + *nr -= refs; 813 + while (refs--) 814 + put_page(head); 815 + return 0; 816 + } 817 + 818 + /* 819 + * Any tail pages need their mapcount reference taken before we 820 + * return. (This allows the THP code to bump their ref count when 821 + * they are split into base pages). 822 + */ 823 + while (refs--) { 824 + if (PageTail(tail)) 825 + get_huge_page_tail(tail); 826 + tail++; 827 + } 828 + 829 + return 1; 830 + } 831 + 832 + static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, 833 + unsigned long end, int write, struct page **pages, int *nr) 834 + { 835 + struct page *head, *page, *tail; 836 + int refs; 837 + 838 + if (write && !pud_write(orig)) 839 + return 0; 840 + 841 + refs = 0; 842 + head = pud_page(orig); 843 + page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT); 844 + tail = page; 845 + do { 846 + VM_BUG_ON_PAGE(compound_head(page) != head, page); 847 + pages[*nr] = page; 848 + (*nr)++; 849 + page++; 850 + refs++; 851 + } while (addr += PAGE_SIZE, addr != end); 852 + 853 + if (!page_cache_add_speculative(head, refs)) { 854 + *nr -= refs; 855 + return 0; 856 + } 857 + 858 + if (unlikely(pud_val(orig) != pud_val(*pudp))) { 859 + *nr -= refs; 860 + while (refs--) 861 + put_page(head); 862 + return 0; 863 + } 864 + 865 + while (refs--) { 866 + if (PageTail(tail)) 867 + get_huge_page_tail(tail); 868 + tail++; 869 + } 870 + 871 + return 1; 872 + } 873 + 874 + static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, 875 + int write, struct page **pages, int *nr) 876 + { 877 + unsigned long next; 878 + pmd_t *pmdp; 879 + 880 + pmdp = pmd_offset(&pud, addr); 881 + do { 882 + pmd_t pmd = ACCESS_ONCE(*pmdp); 883 + 884 + next = pmd_addr_end(addr, end); 885 + if (pmd_none(pmd) || pmd_trans_splitting(pmd)) 886 + return 0; 887 + 888 + if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) { 889 + /* 890 + * NUMA hinting faults need to be handled in the GUP 891 + * slowpath for accounting purposes and so that they 892 + * can be serialised against THP migration. 893 + */ 894 + if (pmd_numa(pmd)) 895 + return 0; 896 + 897 + if (!gup_huge_pmd(pmd, pmdp, addr, next, write, 898 + pages, nr)) 899 + return 0; 900 + 901 + } else if (!gup_pte_range(pmd, addr, next, write, pages, nr)) 902 + return 0; 903 + } while (pmdp++, addr = next, addr != end); 904 + 905 + return 1; 906 + } 907 + 908 + static int gup_pud_range(pgd_t *pgdp, unsigned long addr, unsigned long end, 909 + int write, struct page **pages, int *nr) 910 + { 911 + unsigned long next; 912 + pud_t *pudp; 913 + 914 + pudp = pud_offset(pgdp, addr); 915 + do { 916 + pud_t pud = ACCESS_ONCE(*pudp); 917 + 918 + next = pud_addr_end(addr, end); 919 + if (pud_none(pud)) 920 + return 0; 921 + if (pud_huge(pud)) { 922 + if (!gup_huge_pud(pud, pudp, addr, next, write, 923 + pages, nr)) 924 + return 0; 925 + } else if (!gup_pmd_range(pud, addr, next, write, pages, nr)) 926 + return 0; 927 + } while (pudp++, addr = next, addr != end); 928 + 929 + return 1; 930 + } 931 + 932 + /* 933 + * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to 934 + * the regular GUP. It will only return non-negative values. 935 + */ 936 + int __get_user_pages_fast(unsigned long start, int nr_pages, int write, 937 + struct page **pages) 938 + { 939 + struct mm_struct *mm = current->mm; 940 + unsigned long addr, len, end; 941 + unsigned long next, flags; 942 + pgd_t *pgdp; 943 + int nr = 0; 944 + 945 + start &= PAGE_MASK; 946 + addr = start; 947 + len = (unsigned long) nr_pages << PAGE_SHIFT; 948 + end = start + len; 949 + 950 + if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ, 951 + start, len))) 952 + return 0; 953 + 954 + /* 955 + * Disable interrupts. We use the nested form as we can already have 956 + * interrupts disabled by get_futex_key. 957 + * 958 + * With interrupts disabled, we block page table pages from being 959 + * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h 960 + * for more details. 961 + * 962 + * We do not adopt an rcu_read_lock(.) here as we also want to 963 + * block IPIs that come from THPs splitting. 964 + */ 965 + 966 + local_irq_save(flags); 967 + pgdp = pgd_offset(mm, addr); 968 + do { 969 + next = pgd_addr_end(addr, end); 970 + if (pgd_none(*pgdp)) 971 + break; 972 + else if (!gup_pud_range(pgdp, addr, next, write, pages, &nr)) 973 + break; 974 + } while (pgdp++, addr = next, addr != end); 975 + local_irq_restore(flags); 976 + 977 + return nr; 978 + } 979 + 980 + /** 981 + * get_user_pages_fast() - pin user pages in memory 982 + * @start: starting user address 983 + * @nr_pages: number of pages from start to pin 984 + * @write: whether pages will be written to 985 + * @pages: array that receives pointers to the pages pinned. 986 + * Should be at least nr_pages long. 987 + * 988 + * Attempt to pin user pages in memory without taking mm->mmap_sem. 989 + * If not successful, it will fall back to taking the lock and 990 + * calling get_user_pages(). 991 + * 992 + * Returns number of pages pinned. This may be fewer than the number 993 + * requested. If nr_pages is 0 or negative, returns 0. If no pages 994 + * were pinned, returns -errno. 995 + */ 996 + int get_user_pages_fast(unsigned long start, int nr_pages, int write, 997 + struct page **pages) 998 + { 999 + struct mm_struct *mm = current->mm; 1000 + int nr, ret; 1001 + 1002 + start &= PAGE_MASK; 1003 + nr = __get_user_pages_fast(start, nr_pages, write, pages); 1004 + ret = nr; 1005 + 1006 + if (nr < nr_pages) { 1007 + /* Try to get the remaining pages with get_user_pages */ 1008 + start += nr << PAGE_SHIFT; 1009 + pages += nr; 1010 + 1011 + down_read(&mm->mmap_sem); 1012 + ret = get_user_pages(current, mm, start, 1013 + nr_pages - nr, write, 0, pages, NULL); 1014 + up_read(&mm->mmap_sem); 1015 + 1016 + /* Have to be a bit careful with return values */ 1017 + if (nr > 0) { 1018 + if (ret < 0) 1019 + ret = nr; 1020 + else 1021 + ret += nr; 1022 + } 1023 + } 1024 + 1025 + return ret; 1026 + } 1027 + 1028 + #endif /* CONFIG_HAVE_GENERIC_RCU_GUP */

+12 -18

mm/huge_memory.c

··· 1096 1096 unsigned long mmun_end; /* For mmu_notifiers */ 1097 1097 1098 1098 ptl = pmd_lockptr(mm, pmd); 1099 - VM_BUG_ON(!vma->anon_vma); 1099 + VM_BUG_ON_VMA(!vma->anon_vma, vma); 1100 1100 haddr = address & HPAGE_PMD_MASK; 1101 1101 if (is_huge_zero_pmd(orig_pmd)) 1102 1102 goto alloc; ··· 2048 2048 return -ENOMEM; 2049 2049 2050 2050 /* __khugepaged_exit() must not run from under us */ 2051 - VM_BUG_ON(khugepaged_test_exit(mm)); 2051 + VM_BUG_ON_MM(khugepaged_test_exit(mm), mm); 2052 2052 if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) { 2053 2053 free_mm_slot(mm_slot); 2054 2054 return 0; ··· 2083 2083 if (vma->vm_ops) 2084 2084 /* khugepaged not yet working on file or special mappings */ 2085 2085 return 0; 2086 - VM_BUG_ON(vma->vm_flags & VM_NO_THP); 2086 + VM_BUG_ON_VMA(vma->vm_flags & VM_NO_THP, vma); 2087 2087 hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; 2088 2088 hend = vma->vm_end & HPAGE_PMD_MASK; 2089 2089 if (hstart < hend) ··· 2322 2322 int node) 2323 2323 { 2324 2324 VM_BUG_ON_PAGE(*hpage, *hpage); 2325 + 2325 2326 /* 2326 - * Allocate the page while the vma is still valid and under 2327 - * the mmap_sem read mode so there is no memory allocation 2328 - * later when we take the mmap_sem in write mode. This is more 2329 - * friendly behavior (OTOH it may actually hide bugs) to 2330 - * filesystems in userland with daemons allocating memory in 2331 - * the userland I/O paths. Allocating memory with the 2332 - * mmap_sem in read mode is good idea also to allow greater 2333 - * scalability. 2334 - */ 2335 - *hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask( 2336 - khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER); 2337 - /* 2338 - * After allocating the hugepage, release the mmap_sem read lock in 2339 - * preparation for taking it in write mode. 2327 + * Before allocating the hugepage, release the mmap_sem read lock. 2328 + * The allocation can take potentially a long time if it involves 2329 + * sync compaction, and we do not need to hold the mmap_sem during 2330 + * that. We will recheck the vma after taking it again in write mode. 2340 2331 */ 2341 2332 up_read(&mm->mmap_sem); 2333 + 2334 + *hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask( 2335 + khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER); 2342 2336 if (unlikely(!*hpage)) { 2343 2337 count_vm_event(THP_COLLAPSE_ALLOC_FAILED); 2344 2338 *hpage = ERR_PTR(-ENOMEM); ··· 2406 2412 return false; 2407 2413 if (is_vma_temporary_stack(vma)) 2408 2414 return false; 2409 - VM_BUG_ON(vma->vm_flags & VM_NO_THP); 2415 + VM_BUG_ON_VMA(vma->vm_flags & VM_NO_THP, vma); 2410 2416 return true; 2411 2417 } 2412 2418

+7 -7

mm/hugetlb.c

··· 434 434 435 435 static struct resv_map *vma_resv_map(struct vm_area_struct *vma) 436 436 { 437 - VM_BUG_ON(!is_vm_hugetlb_page(vma)); 437 + VM_BUG_ON_VMA(!is_vm_hugetlb_page(vma), vma); 438 438 if (vma->vm_flags & VM_MAYSHARE) { 439 439 struct address_space *mapping = vma->vm_file->f_mapping; 440 440 struct inode *inode = mapping->host; ··· 449 449 450 450 static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map) 451 451 { 452 - VM_BUG_ON(!is_vm_hugetlb_page(vma)); 453 - VM_BUG_ON(vma->vm_flags & VM_MAYSHARE); 452 + VM_BUG_ON_VMA(!is_vm_hugetlb_page(vma), vma); 453 + VM_BUG_ON_VMA(vma->vm_flags & VM_MAYSHARE, vma); 454 454 455 455 set_vma_private_data(vma, (get_vma_private_data(vma) & 456 456 HPAGE_RESV_MASK) | (unsigned long)map); ··· 458 458 459 459 static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags) 460 460 { 461 - VM_BUG_ON(!is_vm_hugetlb_page(vma)); 462 - VM_BUG_ON(vma->vm_flags & VM_MAYSHARE); 461 + VM_BUG_ON_VMA(!is_vm_hugetlb_page(vma), vma); 462 + VM_BUG_ON_VMA(vma->vm_flags & VM_MAYSHARE, vma); 463 463 464 464 set_vma_private_data(vma, get_vma_private_data(vma) | flags); 465 465 } 466 466 467 467 static int is_vma_resv_set(struct vm_area_struct *vma, unsigned long flag) 468 468 { 469 - VM_BUG_ON(!is_vm_hugetlb_page(vma)); 469 + VM_BUG_ON_VMA(!is_vm_hugetlb_page(vma), vma); 470 470 471 471 return (get_vma_private_data(vma) & flag) != 0; 472 472 } ··· 474 474 /* Reset counters to 0 and clear all HPAGE_RESV_* flags */ 475 475 void reset_vma_resv_huge_pages(struct vm_area_struct *vma) 476 476 { 477 - VM_BUG_ON(!is_vm_hugetlb_page(vma)); 477 + VM_BUG_ON_VMA(!is_vm_hugetlb_page(vma), vma); 478 478 if (!(vma->vm_flags & VM_MAYSHARE)) 479 479 vma->vm_private_data = (void *)0; 480 480 }

+20 -6

mm/internal.h

··· 142 142 bool finished_update_migrate; 143 143 144 144 int order; /* order a direct compactor needs */ 145 - int migratetype; /* MOVABLE, RECLAIMABLE etc */ 145 + const gfp_t gfp_mask; /* gfp mask of a direct compactor */ 146 146 struct zone *zone; 147 - bool contended; /* True if a lock was contended, or 148 - * need_resched() true during async 147 + int contended; /* Signal need_sched() or lock 148 + * contention detected during 149 149 * compaction 150 150 */ 151 151 }; ··· 154 154 isolate_freepages_range(struct compact_control *cc, 155 155 unsigned long start_pfn, unsigned long end_pfn); 156 156 unsigned long 157 - isolate_migratepages_range(struct zone *zone, struct compact_control *cc, 158 - unsigned long low_pfn, unsigned long end_pfn, bool unevictable); 157 + isolate_migratepages_range(struct compact_control *cc, 158 + unsigned long low_pfn, unsigned long end_pfn); 159 159 160 160 #endif 161 161 ··· 164 164 * general, page_zone(page)->lock must be held by the caller to prevent the 165 165 * page from being allocated in parallel and returning garbage as the order. 166 166 * If a caller does not hold page_zone(page)->lock, it must guarantee that the 167 - * page cannot be allocated or merged in parallel. 167 + * page cannot be allocated or merged in parallel. Alternatively, it must 168 + * handle invalid values gracefully, and use page_order_unsafe() below. 168 169 */ 169 170 static inline unsigned long page_order(struct page *page) 170 171 { 171 172 /* PageBuddy() must be checked by the caller */ 172 173 return page_private(page); 173 174 } 175 + 176 + /* 177 + * Like page_order(), but for callers who cannot afford to hold the zone lock. 178 + * PageBuddy() should be checked first by the caller to minimize race window, 179 + * and invalid values must be handled gracefully. 180 + * 181 + * ACCESS_ONCE is used so that if the caller assigns the result into a local 182 + * variable and e.g. tests it for valid range before using, the compiler cannot 183 + * decide to remove the variable and inline the page_private(page) multiple 184 + * times, potentially observing different values in the tests and the actual 185 + * use of the result. 186 + */ 187 + #define page_order_unsafe(page) ACCESS_ONCE(page_private(page)) 174 188 175 189 static inline bool is_cow_mapping(vm_flags_t flags) 176 190 {

+1 -1

mm/interval_tree.c

··· 34 34 struct vm_area_struct *parent; 35 35 unsigned long last = vma_last_pgoff(node); 36 36 37 - VM_BUG_ON(vma_start_pgoff(node) != vma_start_pgoff(prev)); 37 + VM_BUG_ON_VMA(vma_start_pgoff(node) != vma_start_pgoff(prev), node); 38 38 39 39 if (!prev->shared.linear.rb.rb_right) { 40 40 parent = prev;

+1

mm/kmemcheck.c

··· 2 2 #include <linux/mm_types.h> 3 3 #include <linux/mm.h> 4 4 #include <linux/slab.h> 5 + #include "slab.h" 5 6 #include <linux/kmemcheck.h> 6 7 7 8 void kmemcheck_alloc_shadow(struct page *page, int order, gfp_t flags, int node)

+2 -2

mm/ksm.c

··· 2310 2310 2311 2311 ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd"); 2312 2312 if (IS_ERR(ksm_thread)) { 2313 - printk(KERN_ERR "ksm: creating kthread failed\n"); 2313 + pr_err("ksm: creating kthread failed\n"); 2314 2314 err = PTR_ERR(ksm_thread); 2315 2315 goto out_free; 2316 2316 } ··· 2318 2318 #ifdef CONFIG_SYSFS 2319 2319 err = sysfs_create_group(mm_kobj, &ksm_attr_group); 2320 2320 if (err) { 2321 - printk(KERN_ERR "ksm: register sysfs failed\n"); 2321 + pr_err("ksm: register sysfs failed\n"); 2322 2322 kthread_stop(ksm_thread); 2323 2323 goto out_free; 2324 2324 }

+73 -209

mm/memcontrol.c

··· 318 318 /* OOM-Killer disable */ 319 319 int oom_kill_disable; 320 320 321 - /* set when res.limit == memsw.limit */ 322 - bool memsw_is_minimum; 323 - 324 321 /* protect arrays of thresholds */ 325 322 struct mutex thresholds_lock; 326 323 ··· 481 484 #define OOM_CONTROL (0) 482 485 483 486 /* 484 - * Reclaim flags for mem_cgroup_hierarchical_reclaim 485 - */ 486 - #define MEM_CGROUP_RECLAIM_NOSWAP_BIT 0x0 487 - #define MEM_CGROUP_RECLAIM_NOSWAP (1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT) 488 - #define MEM_CGROUP_RECLAIM_SHRINK_BIT 0x1 489 - #define MEM_CGROUP_RECLAIM_SHRINK (1 << MEM_CGROUP_RECLAIM_SHRINK_BIT) 490 - 491 - /* 492 487 * The memcg_create_mutex will be held whenever a new cgroup is created. 493 488 * As a consequence, any change that needs to protect against new child cgroups 494 489 * appearing has to hold it as well. ··· 638 649 struct static_key memcg_kmem_enabled_key; 639 650 EXPORT_SYMBOL(memcg_kmem_enabled_key); 640 651 652 + static void memcg_free_cache_id(int id); 653 + 641 654 static void disarm_kmem_keys(struct mem_cgroup *memcg) 642 655 { 643 656 if (memcg_kmem_is_active(memcg)) { 644 657 static_key_slow_dec(&memcg_kmem_enabled_key); 645 - ida_simple_remove(&kmem_limited_groups, memcg->kmemcg_id); 658 + memcg_free_cache_id(memcg->kmemcg_id); 646 659 } 647 660 /* 648 661 * This check can't live in kmem destruction function, ··· 1797 1806 NULL, "Memory cgroup out of memory"); 1798 1807 } 1799 1808 1800 - static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg, 1801 - gfp_t gfp_mask, 1802 - unsigned long flags) 1803 - { 1804 - unsigned long total = 0; 1805 - bool noswap = false; 1806 - int loop; 1807 - 1808 - if (flags & MEM_CGROUP_RECLAIM_NOSWAP) 1809 - noswap = true; 1810 - if (!(flags & MEM_CGROUP_RECLAIM_SHRINK) && memcg->memsw_is_minimum) 1811 - noswap = true; 1812 - 1813 - for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) { 1814 - if (loop) 1815 - drain_all_stock_async(memcg); 1816 - total += try_to_free_mem_cgroup_pages(memcg, gfp_mask, noswap); 1817 - /* 1818 - * Allow limit shrinkers, which are triggered directly 1819 - * by userspace, to catch signals and stop reclaim 1820 - * after minimal progress, regardless of the margin. 1821 - */ 1822 - if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK)) 1823 - break; 1824 - if (mem_cgroup_margin(memcg)) 1825 - break; 1826 - /* 1827 - * If nothing was reclaimed after two attempts, there 1828 - * may be no reclaimable pages in this hierarchy. 1829 - */ 1830 - if (loop && !total) 1831 - break; 1832 - } 1833 - return total; 1834 - } 1835 - 1836 1809 /** 1837 1810 * test_mem_cgroup_node_reclaimable 1838 1811 * @memcg: the target memcg ··· 2499 2544 struct mem_cgroup *mem_over_limit; 2500 2545 struct res_counter *fail_res; 2501 2546 unsigned long nr_reclaimed; 2502 - unsigned long flags = 0; 2503 2547 unsigned long long size; 2548 + bool may_swap = true; 2549 + bool drained = false; 2504 2550 int ret = 0; 2505 2551 2506 2552 if (mem_cgroup_is_root(memcg)) ··· 2511 2555 goto done; 2512 2556 2513 2557 size = batch * PAGE_SIZE; 2514 - if (!res_counter_charge(&memcg->res, size, &fail_res)) { 2515 - if (!do_swap_account) 2558 + if (!do_swap_account || 2559 + !res_counter_charge(&memcg->memsw, size, &fail_res)) { 2560 + if (!res_counter_charge(&memcg->res, size, &fail_res)) 2516 2561 goto done_restock; 2517 - if (!res_counter_charge(&memcg->memsw, size, &fail_res)) 2518 - goto done_restock; 2519 - res_counter_uncharge(&memcg->res, size); 2520 - mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw); 2521 - flags |= MEM_CGROUP_RECLAIM_NOSWAP; 2522 - } else 2562 + if (do_swap_account) 2563 + res_counter_uncharge(&memcg->memsw, size); 2523 2564 mem_over_limit = mem_cgroup_from_res_counter(fail_res, res); 2565 + } else { 2566 + mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw); 2567 + may_swap = false; 2568 + } 2524 2569 2525 2570 if (batch > nr_pages) { 2526 2571 batch = nr_pages; ··· 2545 2588 if (!(gfp_mask & __GFP_WAIT)) 2546 2589 goto nomem; 2547 2590 2548 - nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); 2591 + nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, 2592 + gfp_mask, may_swap); 2549 2593 2550 2594 if (mem_cgroup_margin(mem_over_limit) >= nr_pages) 2551 2595 goto retry; 2596 + 2597 + if (!drained) { 2598 + drain_all_stock_async(mem_over_limit); 2599 + drained = true; 2600 + goto retry; 2601 + } 2552 2602 2553 2603 if (gfp_mask & __GFP_NORETRY) 2554 2604 goto nomem; ··· 2762 2798 2763 2799 static DEFINE_MUTEX(activate_kmem_mutex); 2764 2800 2765 - static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg) 2766 - { 2767 - return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) && 2768 - memcg_kmem_is_active(memcg); 2769 - } 2770 - 2771 2801 /* 2772 2802 * This is a bit cumbersome, but it is rarely used and avoids a backpointer 2773 2803 * in the memcg_cache_params struct. ··· 2781 2823 struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); 2782 2824 struct memcg_cache_params *params; 2783 2825 2784 - if (!memcg_can_account_kmem(memcg)) 2826 + if (!memcg_kmem_is_active(memcg)) 2785 2827 return -EIO; 2786 2828 2787 2829 print_slabinfo_header(m); ··· 2864 2906 return memcg ? memcg->kmemcg_id : -1; 2865 2907 } 2866 2908 2867 - static size_t memcg_caches_array_size(int num_groups) 2909 + static int memcg_alloc_cache_id(void) 2868 2910 { 2869 - ssize_t size; 2870 - if (num_groups <= 0) 2871 - return 0; 2911 + int id, size; 2912 + int err; 2872 2913 2873 - size = 2 * num_groups; 2914 + id = ida_simple_get(&kmem_limited_groups, 2915 + 0, MEMCG_CACHES_MAX_SIZE, GFP_KERNEL); 2916 + if (id < 0) 2917 + return id; 2918 + 2919 + if (id < memcg_limited_groups_array_size) 2920 + return id; 2921 + 2922 + /* 2923 + * There's no space for the new id in memcg_caches arrays, 2924 + * so we have to grow them. 2925 + */ 2926 + 2927 + size = 2 * (id + 1); 2874 2928 if (size < MEMCG_CACHES_MIN_SIZE) 2875 2929 size = MEMCG_CACHES_MIN_SIZE; 2876 2930 else if (size > MEMCG_CACHES_MAX_SIZE) 2877 2931 size = MEMCG_CACHES_MAX_SIZE; 2878 2932 2879 - return size; 2933 + mutex_lock(&memcg_slab_mutex); 2934 + err = memcg_update_all_caches(size); 2935 + mutex_unlock(&memcg_slab_mutex); 2936 + 2937 + if (err) { 2938 + ida_simple_remove(&kmem_limited_groups, id); 2939 + return err; 2940 + } 2941 + return id; 2942 + } 2943 + 2944 + static void memcg_free_cache_id(int id) 2945 + { 2946 + ida_simple_remove(&kmem_limited_groups, id); 2880 2947 } 2881 2948 2882 2949 /* ··· 2911 2928 */ 2912 2929 void memcg_update_array_size(int num) 2913 2930 { 2914 - if (num > memcg_limited_groups_array_size) 2915 - memcg_limited_groups_array_size = memcg_caches_array_size(num); 2916 - } 2917 - 2918 - int memcg_update_cache_size(struct kmem_cache *s, int num_groups) 2919 - { 2920 - struct memcg_cache_params *cur_params = s->memcg_params; 2921 - 2922 - VM_BUG_ON(!is_root_cache(s)); 2923 - 2924 - if (num_groups > memcg_limited_groups_array_size) { 2925 - int i; 2926 - struct memcg_cache_params *new_params; 2927 - ssize_t size = memcg_caches_array_size(num_groups); 2928 - 2929 - size *= sizeof(void *); 2930 - size += offsetof(struct memcg_cache_params, memcg_caches); 2931 - 2932 - new_params = kzalloc(size, GFP_KERNEL); 2933 - if (!new_params) 2934 - return -ENOMEM; 2935 - 2936 - new_params->is_root_cache = true; 2937 - 2938 - /* 2939 - * There is the chance it will be bigger than 2940 - * memcg_limited_groups_array_size, if we failed an allocation 2941 - * in a cache, in which case all caches updated before it, will 2942 - * have a bigger array. 2943 - * 2944 - * But if that is the case, the data after 2945 - * memcg_limited_groups_array_size is certainly unused 2946 - */ 2947 - for (i = 0; i < memcg_limited_groups_array_size; i++) { 2948 - if (!cur_params->memcg_caches[i]) 2949 - continue; 2950 - new_params->memcg_caches[i] = 2951 - cur_params->memcg_caches[i]; 2952 - } 2953 - 2954 - /* 2955 - * Ideally, we would wait until all caches succeed, and only 2956 - * then free the old one. But this is not worth the extra 2957 - * pointer per-cache we'd have to have for this. 2958 - * 2959 - * It is not a big deal if some caches are left with a size 2960 - * bigger than the others. And all updates will reset this 2961 - * anyway. 2962 - */ 2963 - rcu_assign_pointer(s->memcg_params, new_params); 2964 - if (cur_params) 2965 - kfree_rcu(cur_params, rcu_head); 2966 - } 2967 - return 0; 2968 - } 2969 - 2970 - int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s, 2971 - struct kmem_cache *root_cache) 2972 - { 2973 - size_t size; 2974 - 2975 - if (!memcg_kmem_enabled()) 2976 - return 0; 2977 - 2978 - if (!memcg) { 2979 - size = offsetof(struct memcg_cache_params, memcg_caches); 2980 - size += memcg_limited_groups_array_size * sizeof(void *); 2981 - } else 2982 - size = sizeof(struct memcg_cache_params); 2983 - 2984 - s->memcg_params = kzalloc(size, GFP_KERNEL); 2985 - if (!s->memcg_params) 2986 - return -ENOMEM; 2987 - 2988 - if (memcg) { 2989 - s->memcg_params->memcg = memcg; 2990 - s->memcg_params->root_cache = root_cache; 2991 - css_get(&memcg->css); 2992 - } else 2993 - s->memcg_params->is_root_cache = true; 2994 - 2995 - return 0; 2996 - } 2997 - 2998 - void memcg_free_cache_params(struct kmem_cache *s) 2999 - { 3000 - if (!s->memcg_params) 3001 - return; 3002 - if (!s->memcg_params->is_root_cache) 3003 - css_put(&s->memcg_params->memcg->css); 3004 - kfree(s->memcg_params); 2931 + memcg_limited_groups_array_size = num; 3005 2932 } 3006 2933 3007 2934 static void memcg_register_cache(struct mem_cgroup *memcg, ··· 2944 3051 if (!cachep) 2945 3052 return; 2946 3053 3054 + css_get(&memcg->css); 2947 3055 list_add(&cachep->memcg_params->list, &memcg->memcg_slab_caches); 2948 3056 2949 3057 /* ··· 2978 3084 list_del(&cachep->memcg_params->list); 2979 3085 2980 3086 kmem_cache_destroy(cachep); 3087 + 3088 + /* drop the reference taken in memcg_register_cache */ 3089 + css_put(&memcg->css); 2981 3090 } 2982 3091 2983 3092 /* ··· 3158 3261 rcu_read_lock(); 3159 3262 memcg = mem_cgroup_from_task(rcu_dereference(current->mm->owner)); 3160 3263 3161 - if (!memcg_can_account_kmem(memcg)) 3264 + if (!memcg_kmem_is_active(memcg)) 3162 3265 goto out; 3163 3266 3164 3267 memcg_cachep = cache_from_memcg_idx(cachep, memcg_cache_id(memcg)); ··· 3243 3346 3244 3347 memcg = get_mem_cgroup_from_mm(current->mm); 3245 3348 3246 - if (!memcg_can_account_kmem(memcg)) { 3349 + if (!memcg_kmem_is_active(memcg)) { 3247 3350 css_put(&memcg->css); 3248 3351 return true; 3249 3352 } ··· 3585 3688 unsigned long long val) 3586 3689 { 3587 3690 int retry_count; 3588 - u64 memswlimit, memlimit; 3589 3691 int ret = 0; 3590 3692 int children = mem_cgroup_count_children(memcg); 3591 3693 u64 curusage, oldusage; ··· 3611 3715 * We have to guarantee memcg->res.limit <= memcg->memsw.limit. 3612 3716 */ 3613 3717 mutex_lock(&set_limit_mutex); 3614 - memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT); 3615 - if (memswlimit < val) { 3718 + if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val) { 3616 3719 ret = -EINVAL; 3617 3720 mutex_unlock(&set_limit_mutex); 3618 3721 break; 3619 3722 } 3620 3723 3621 - memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT); 3622 - if (memlimit < val) 3724 + if (res_counter_read_u64(&memcg->res, RES_LIMIT) < val) 3623 3725 enlarge = 1; 3624 3726 3625 3727 ret = res_counter_set_limit(&memcg->res, val); 3626 - if (!ret) { 3627 - if (memswlimit == val) 3628 - memcg->memsw_is_minimum = true; 3629 - else 3630 - memcg->memsw_is_minimum = false; 3631 - } 3632 3728 mutex_unlock(&set_limit_mutex); 3633 3729 3634 3730 if (!ret) 3635 3731 break; 3636 3732 3637 - mem_cgroup_reclaim(memcg, GFP_KERNEL, 3638 - MEM_CGROUP_RECLAIM_SHRINK); 3733 + try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true); 3734 + 3639 3735 curusage = res_counter_read_u64(&memcg->res, RES_USAGE); 3640 3736 /* Usage is reduced ? */ 3641 3737 if (curusage >= oldusage) ··· 3645 3757 unsigned long long val) 3646 3758 { 3647 3759 int retry_count; 3648 - u64 memlimit, memswlimit, oldusage, curusage; 3760 + u64 oldusage, curusage; 3649 3761 int children = mem_cgroup_count_children(memcg); 3650 3762 int ret = -EBUSY; 3651 3763 int enlarge = 0; ··· 3664 3776 * We have to guarantee memcg->res.limit <= memcg->memsw.limit. 3665 3777 */ 3666 3778 mutex_lock(&set_limit_mutex); 3667 - memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT); 3668 - if (memlimit > val) { 3779 + if (res_counter_read_u64(&memcg->res, RES_LIMIT) > val) { 3669 3780 ret = -EINVAL; 3670 3781 mutex_unlock(&set_limit_mutex); 3671 3782 break; 3672 3783 } 3673 - memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT); 3674 - if (memswlimit < val) 3784 + if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val) 3675 3785 enlarge = 1; 3676 3786 ret = res_counter_set_limit(&memcg->memsw, val); 3677 - if (!ret) { 3678 - if (memlimit == val) 3679 - memcg->memsw_is_minimum = true; 3680 - else 3681 - memcg->memsw_is_minimum = false; 3682 - } 3683 3787 mutex_unlock(&set_limit_mutex); 3684 3788 3685 3789 if (!ret) 3686 3790 break; 3687 3791 3688 - mem_cgroup_reclaim(memcg, GFP_KERNEL, 3689 - MEM_CGROUP_RECLAIM_NOSWAP | 3690 - MEM_CGROUP_RECLAIM_SHRINK); 3792 + try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false); 3793 + 3691 3794 curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE); 3692 3795 /* Usage is reduced ? */ 3693 3796 if (curusage >= oldusage) ··· 3927 4048 if (signal_pending(current)) 3928 4049 return -EINTR; 3929 4050 3930 - progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL, 3931 - false); 4051 + progress = try_to_free_mem_cgroup_pages(memcg, 1, 4052 + GFP_KERNEL, true); 3932 4053 if (!progress) { 3933 4054 nr_retries--; 3934 4055 /* maybe some writeback is necessary */ ··· 4093 4214 if (err) 4094 4215 goto out; 4095 4216 4096 - memcg_id = ida_simple_get(&kmem_limited_groups, 4097 - 0, MEMCG_CACHES_MAX_SIZE, GFP_KERNEL); 4217 + memcg_id = memcg_alloc_cache_id(); 4098 4218 if (memcg_id < 0) { 4099 4219 err = memcg_id; 4100 4220 goto out; 4101 4221 } 4102 - 4103 - /* 4104 - * Make sure we have enough space for this cgroup in each root cache's 4105 - * memcg_params. 4106 - */ 4107 - mutex_lock(&memcg_slab_mutex); 4108 - err = memcg_update_all_caches(memcg_id + 1); 4109 - mutex_unlock(&memcg_slab_mutex); 4110 - if (err) 4111 - goto out_rmid; 4112 4222 4113 4223 memcg->kmemcg_id = memcg_id; 4114 4224 INIT_LIST_HEAD(&memcg->memcg_slab_caches); ··· 4119 4251 out: 4120 4252 memcg_resume_kmem_account(); 4121 4253 return err; 4122 - 4123 - out_rmid: 4124 - ida_simple_remove(&kmem_limited_groups, memcg_id); 4125 - goto out; 4126 4254 } 4127 4255 4128 4256 static int memcg_activate_kmem(struct mem_cgroup *memcg,

+1 -1

mm/memory_hotplug.c

··· 1307 1307 /* 1308 1308 * Confirm all pages in a range [start, end) is belongs to the same zone. 1309 1309 */ 1310 - static int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn) 1310 + int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn) 1311 1311 { 1312 1312 unsigned long pfn; 1313 1313 struct zone *zone = NULL;

+65 -77

mm/mempolicy.c

··· 123 123 124 124 static struct mempolicy preferred_node_policy[MAX_NUMNODES]; 125 125 126 - static struct mempolicy *get_task_policy(struct task_struct *p) 126 + struct mempolicy *get_task_policy(struct task_struct *p) 127 127 { 128 128 struct mempolicy *pol = p->mempolicy; 129 + int node; 129 130 130 - if (!pol) { 131 - int node = numa_node_id(); 131 + if (pol) 132 + return pol; 132 133 133 - if (node != NUMA_NO_NODE) { 134 - pol = &preferred_node_policy[node]; 135 - /* 136 - * preferred_node_policy is not initialised early in 137 - * boot 138 - */ 139 - if (!pol->mode) 140 - pol = NULL; 141 - } 134 + node = numa_node_id(); 135 + if (node != NUMA_NO_NODE) { 136 + pol = &preferred_node_policy[node]; 137 + /* preferred_node_policy is not initialised early in boot */ 138 + if (pol->mode) 139 + return pol; 142 140 } 143 141 144 - return pol; 142 + return &default_policy; 145 143 } 146 144 147 145 static const struct mempolicy_operations { ··· 681 683 } 682 684 683 685 if (flags & MPOL_MF_LAZY) { 684 - change_prot_numa(vma, start, endvma); 686 + /* Similar to task_numa_work, skip inaccessible VMAs */ 687 + if (vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)) 688 + change_prot_numa(vma, start, endvma); 685 689 goto next; 686 690 } 687 691 ··· 804 804 nodemask_t *nodes) 805 805 { 806 806 struct mempolicy *new, *old; 807 - struct mm_struct *mm = current->mm; 808 807 NODEMASK_SCRATCH(scratch); 809 808 int ret; 810 809 ··· 815 816 ret = PTR_ERR(new); 816 817 goto out; 817 818 } 818 - /* 819 - * prevent changing our mempolicy while show_numa_maps() 820 - * is using it. 821 - * Note: do_set_mempolicy() can be called at init time 822 - * with no 'mm'. 823 - */ 824 - if (mm) 825 - down_write(&mm->mmap_sem); 819 + 826 820 task_lock(current); 827 821 ret = mpol_set_nodemask(new, nodes, scratch); 828 822 if (ret) { 829 823 task_unlock(current); 830 - if (mm) 831 - up_write(&mm->mmap_sem); 832 824 mpol_put(new); 833 825 goto out; 834 826 } ··· 829 839 nodes_weight(new->v.nodes)) 830 840 current->il_next = first_node(new->v.nodes); 831 841 task_unlock(current); 832 - if (mm) 833 - up_write(&mm->mmap_sem); 834 - 835 842 mpol_put(old); 836 843 ret = 0; 837 844 out: ··· 1592 1605 1593 1606 #endif 1594 1607 1595 - /* 1596 - * get_vma_policy(@task, @vma, @addr) 1597 - * @task: task for fallback if vma policy == default 1598 - * @vma: virtual memory area whose policy is sought 1599 - * @addr: address in @vma for shared policy lookup 1600 - * 1601 - * Returns effective policy for a VMA at specified address. 1602 - * Falls back to @task or system default policy, as necessary. 1603 - * Current or other task's task mempolicy and non-shared vma policies must be 1604 - * protected by task_lock(task) by the caller. 1605 - * Shared policies [those marked as MPOL_F_SHARED] require an extra reference 1606 - * count--added by the get_policy() vm_op, as appropriate--to protect against 1607 - * freeing by another task. It is the caller's responsibility to free the 1608 - * extra reference for shared policies. 1609 - */ 1610 - struct mempolicy *get_vma_policy(struct task_struct *task, 1611 - struct vm_area_struct *vma, unsigned long addr) 1608 + struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, 1609 + unsigned long addr) 1612 1610 { 1613 - struct mempolicy *pol = get_task_policy(task); 1611 + struct mempolicy *pol = NULL; 1614 1612 1615 1613 if (vma) { 1616 1614 if (vma->vm_ops && vma->vm_ops->get_policy) { 1617 - struct mempolicy *vpol = vma->vm_ops->get_policy(vma, 1618 - addr); 1619 - if (vpol) 1620 - pol = vpol; 1615 + pol = vma->vm_ops->get_policy(vma, addr); 1621 1616 } else if (vma->vm_policy) { 1622 1617 pol = vma->vm_policy; 1623 1618 ··· 1613 1644 mpol_get(pol); 1614 1645 } 1615 1646 } 1616 - if (!pol) 1617 - pol = &default_policy; 1647 + 1618 1648 return pol; 1619 1649 } 1620 1650 1621 - bool vma_policy_mof(struct task_struct *task, struct vm_area_struct *vma) 1651 + /* 1652 + * get_vma_policy(@vma, @addr) 1653 + * @vma: virtual memory area whose policy is sought 1654 + * @addr: address in @vma for shared policy lookup 1655 + * 1656 + * Returns effective policy for a VMA at specified address. 1657 + * Falls back to current->mempolicy or system default policy, as necessary. 1658 + * Shared policies [those marked as MPOL_F_SHARED] require an extra reference 1659 + * count--added by the get_policy() vm_op, as appropriate--to protect against 1660 + * freeing by another task. It is the caller's responsibility to free the 1661 + * extra reference for shared policies. 1662 + */ 1663 + static struct mempolicy *get_vma_policy(struct vm_area_struct *vma, 1664 + unsigned long addr) 1622 1665 { 1623 - struct mempolicy *pol = get_task_policy(task); 1624 - if (vma) { 1625 - if (vma->vm_ops && vma->vm_ops->get_policy) { 1626 - bool ret = false; 1627 - 1628 - pol = vma->vm_ops->get_policy(vma, vma->vm_start); 1629 - if (pol && (pol->flags & MPOL_F_MOF)) 1630 - ret = true; 1631 - mpol_cond_put(pol); 1632 - 1633 - return ret; 1634 - } else if (vma->vm_policy) { 1635 - pol = vma->vm_policy; 1636 - } 1637 - } 1666 + struct mempolicy *pol = __get_vma_policy(vma, addr); 1638 1667 1639 1668 if (!pol) 1640 - return default_policy.flags & MPOL_F_MOF; 1669 + pol = get_task_policy(current); 1670 + 1671 + return pol; 1672 + } 1673 + 1674 + bool vma_policy_mof(struct vm_area_struct *vma) 1675 + { 1676 + struct mempolicy *pol; 1677 + 1678 + if (vma->vm_ops && vma->vm_ops->get_policy) { 1679 + bool ret = false; 1680 + 1681 + pol = vma->vm_ops->get_policy(vma, vma->vm_start); 1682 + if (pol && (pol->flags & MPOL_F_MOF)) 1683 + ret = true; 1684 + mpol_cond_put(pol); 1685 + 1686 + return ret; 1687 + } 1688 + 1689 + pol = vma->vm_policy; 1690 + if (!pol) 1691 + pol = get_task_policy(current); 1641 1692 1642 1693 return pol->flags & MPOL_F_MOF; 1643 1694 } ··· 1863 1874 { 1864 1875 struct zonelist *zl; 1865 1876 1866 - *mpol = get_vma_policy(current, vma, addr); 1877 + *mpol = get_vma_policy(vma, addr); 1867 1878 *nodemask = NULL; /* assume !MPOL_BIND */ 1868 1879 1869 1880 if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) { ··· 2018 2029 unsigned int cpuset_mems_cookie; 2019 2030 2020 2031 retry_cpuset: 2021 - pol = get_vma_policy(current, vma, addr); 2032 + pol = get_vma_policy(vma, addr); 2022 2033 cpuset_mems_cookie = read_mems_allowed_begin(); 2023 2034 2024 2035 if (unlikely(pol->mode == MPOL_INTERLEAVE)) { ··· 2035 2046 page = __alloc_pages_nodemask(gfp, order, 2036 2047 policy_zonelist(gfp, pol, node), 2037 2048 policy_nodemask(gfp, pol)); 2038 - if (unlikely(mpol_needs_cond_ref(pol))) 2039 - __mpol_put(pol); 2049 + mpol_cond_put(pol); 2040 2050 if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) 2041 2051 goto retry_cpuset; 2042 2052 return page; ··· 2062 2074 */ 2063 2075 struct page *alloc_pages_current(gfp_t gfp, unsigned order) 2064 2076 { 2065 - struct mempolicy *pol = get_task_policy(current); 2077 + struct mempolicy *pol = &default_policy; 2066 2078 struct page *page; 2067 2079 unsigned int cpuset_mems_cookie; 2068 2080 2069 - if (!pol || in_interrupt() || (gfp & __GFP_THISNODE)) 2070 - pol = &default_policy; 2081 + if (!in_interrupt() && !(gfp & __GFP_THISNODE)) 2082 + pol = get_task_policy(current); 2071 2083 2072 2084 retry_cpuset: 2073 2085 cpuset_mems_cookie = read_mems_allowed_begin(); ··· 2284 2296 2285 2297 BUG_ON(!vma); 2286 2298 2287 - pol = get_vma_policy(current, vma, addr); 2299 + pol = get_vma_policy(vma, addr); 2288 2300 if (!(pol->flags & MPOL_F_MOF)) 2289 2301 goto out; 2290 2302

+4 -12

mm/migrate.c

··· 876 876 } 877 877 } 878 878 879 - if (unlikely(balloon_page_movable(page))) { 879 + if (unlikely(isolated_balloon_page(page))) { 880 880 /* 881 881 * A ballooned page does not need any special attention from 882 882 * physical to virtual reverse mapping procedures. ··· 955 955 956 956 rc = __unmap_and_move(page, newpage, force, mode); 957 957 958 - if (unlikely(rc == MIGRATEPAGE_BALLOON_SUCCESS)) { 959 - /* 960 - * A ballooned page has been migrated already. 961 - * Now, it's the time to wrap-up counters, 962 - * handle the page back to Buddy and return. 963 - */ 964 - dec_zone_page_state(page, NR_ISOLATED_ANON + 965 - page_is_file_cache(page)); 966 - balloon_page_free(page); 967 - return MIGRATEPAGE_SUCCESS; 968 - } 969 958 out: 970 959 if (rc != -EAGAIN) { 971 960 /* ··· 977 988 if (rc != MIGRATEPAGE_SUCCESS && put_new_page) { 978 989 ClearPageSwapBacked(newpage); 979 990 put_new_page(newpage, private); 991 + } else if (unlikely(__is_movable_balloon_page(newpage))) { 992 + /* drop our reference, page already in the balloon */ 993 + put_page(newpage); 980 994 } else 981 995 putback_lru_page(newpage); 982 996

+3 -3

mm/mlock.c

··· 233 233 234 234 VM_BUG_ON(start & ~PAGE_MASK); 235 235 VM_BUG_ON(end & ~PAGE_MASK); 236 - VM_BUG_ON(start < vma->vm_start); 237 - VM_BUG_ON(end > vma->vm_end); 238 - VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem)); 236 + VM_BUG_ON_VMA(start < vma->vm_start, vma); 237 + VM_BUG_ON_VMA(end > vma->vm_end, vma); 238 + VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm); 239 239 240 240 gup_flags = FOLL_TOUCH | FOLL_MLOCK; 241 241 /*

+40 -34

mm/mmap.c

··· 70 70 * MAP_SHARED r: (no) no r: (yes) yes r: (no) yes r: (no) yes 71 71 * w: (no) no w: (no) no w: (yes) yes w: (no) no 72 72 * x: (no) no x: (no) yes x: (no) yes x: (yes) yes 73 - * 73 + * 74 74 * MAP_PRIVATE r: (no) no r: (yes) yes r: (no) yes r: (no) yes 75 75 * w: (no) no w: (no) no w: (copy) copy w: (no) no 76 76 * x: (no) no x: (no) yes x: (no) yes x: (yes) yes ··· 268 268 269 269 SYSCALL_DEFINE1(brk, unsigned long, brk) 270 270 { 271 - unsigned long rlim, retval; 271 + unsigned long retval; 272 272 unsigned long newbrk, oldbrk; 273 273 struct mm_struct *mm = current->mm; 274 274 unsigned long min_brk; ··· 298 298 * segment grow beyond its set limit the in case where the limit is 299 299 * not page aligned -Ram Gupta 300 300 */ 301 - rlim = rlimit(RLIMIT_DATA); 302 - if (rlim < RLIM_INFINITY && (brk - mm->start_brk) + 303 - (mm->end_data - mm->start_data) > rlim) 301 + if (check_data_rlimit(rlimit(RLIMIT_DATA), brk, mm->start_brk, 302 + mm->end_data, mm->start_data)) 304 303 goto out; 305 304 306 305 newbrk = PAGE_ALIGN(brk); ··· 368 369 struct vm_area_struct *vma; 369 370 vma = rb_entry(nd, struct vm_area_struct, vm_rb); 370 371 if (vma->vm_start < prev) { 371 - pr_emerg("vm_start %lx prev %lx\n", vma->vm_start, prev); 372 + pr_emerg("vm_start %lx < prev %lx\n", 373 + vma->vm_start, prev); 372 374 bug = 1; 373 375 } 374 376 if (vma->vm_start < pend) { 375 - pr_emerg("vm_start %lx pend %lx\n", vma->vm_start, pend); 377 + pr_emerg("vm_start %lx < pend %lx\n", 378 + vma->vm_start, pend); 376 379 bug = 1; 377 380 } 378 381 if (vma->vm_start > vma->vm_end) { 379 - pr_emerg("vm_end %lx < vm_start %lx\n", 380 - vma->vm_end, vma->vm_start); 382 + pr_emerg("vm_start %lx > vm_end %lx\n", 383 + vma->vm_start, vma->vm_end); 381 384 bug = 1; 382 385 } 383 386 if (vma->rb_subtree_gap != vma_compute_subtree_gap(vma)) { ··· 410 409 for (nd = rb_first(root); nd; nd = rb_next(nd)) { 411 410 struct vm_area_struct *vma; 412 411 vma = rb_entry(nd, struct vm_area_struct, vm_rb); 413 - BUG_ON(vma != ignore && 414 - vma->rb_subtree_gap != vma_compute_subtree_gap(vma)); 412 + VM_BUG_ON_VMA(vma != ignore && 413 + vma->rb_subtree_gap != vma_compute_subtree_gap(vma), 414 + vma); 415 415 } 416 416 } 417 417 ··· 422 420 int i = 0; 423 421 unsigned long highest_address = 0; 424 422 struct vm_area_struct *vma = mm->mmap; 423 + 425 424 while (vma) { 426 425 struct anon_vma_chain *avc; 426 + 427 427 vma_lock_anon_vma(vma); 428 428 list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) 429 429 anon_vma_interval_tree_verify(avc); ··· 440 436 } 441 437 if (highest_address != mm->highest_vm_end) { 442 438 pr_emerg("mm->highest_vm_end %lx, found %lx\n", 443 - mm->highest_vm_end, highest_address); 439 + mm->highest_vm_end, highest_address); 444 440 bug = 1; 445 441 } 446 442 i = browse_rb(&mm->mm_rb); 447 443 if (i != mm->map_count) { 448 - pr_emerg("map_count %d rb %d\n", mm->map_count, i); 444 + if (i != -1) 445 + pr_emerg("map_count %d rb %d\n", mm->map_count, i); 449 446 bug = 1; 450 447 } 451 - BUG_ON(bug); 448 + VM_BUG_ON_MM(bug, mm); 452 449 } 453 450 #else 454 451 #define validate_mm_rb(root, ignore) do { } while (0) ··· 746 741 * split_vma inserting another: so it must be 747 742 * mprotect case 4 shifting the boundary down. 748 743 */ 749 - adjust_next = - ((vma->vm_end - end) >> PAGE_SHIFT); 744 + adjust_next = -((vma->vm_end - end) >> PAGE_SHIFT); 750 745 exporter = vma; 751 746 importer = next; 752 747 } ··· 792 787 if (!anon_vma && adjust_next) 793 788 anon_vma = next->anon_vma; 794 789 if (anon_vma) { 795 - VM_BUG_ON(adjust_next && next->anon_vma && 796 - anon_vma != next->anon_vma); 790 + VM_BUG_ON_VMA(adjust_next && next->anon_vma && 791 + anon_vma != next->anon_vma, next); 797 792 anon_vma_lock_write(anon_vma); 798 793 anon_vma_interval_tree_pre_update_vma(vma); 799 794 if (adjust_next) ··· 1015 1010 struct vm_area_struct *vma_merge(struct mm_struct *mm, 1016 1011 struct vm_area_struct *prev, unsigned long addr, 1017 1012 unsigned long end, unsigned long vm_flags, 1018 - struct anon_vma *anon_vma, struct file *file, 1013 + struct anon_vma *anon_vma, struct file *file, 1019 1014 pgoff_t pgoff, struct mempolicy *policy) 1020 1015 { 1021 1016 pgoff_t pglen = (end - addr) >> PAGE_SHIFT; ··· 1041 1036 * Can it merge with the predecessor? 1042 1037 */ 1043 1038 if (prev && prev->vm_end == addr && 1044 - mpol_equal(vma_policy(prev), policy) && 1039 + mpol_equal(vma_policy(prev), policy) && 1045 1040 can_vma_merge_after(prev, vm_flags, 1046 1041 anon_vma, file, pgoff)) { 1047 1042 /* ··· 1069 1064 * Can this new request be merged in front of next? 1070 1065 */ 1071 1066 if (next && end == next->vm_start && 1072 - mpol_equal(policy, vma_policy(next)) && 1067 + mpol_equal(policy, vma_policy(next)) && 1073 1068 can_vma_merge_before(next, vm_flags, 1074 1069 anon_vma, file, pgoff+pglen)) { 1075 1070 if (prev && addr < prev->vm_end) /* case 4 */ ··· 1240 1235 unsigned long flags, unsigned long pgoff, 1241 1236 unsigned long *populate) 1242 1237 { 1243 - struct mm_struct * mm = current->mm; 1238 + struct mm_struct *mm = current->mm; 1244 1239 vm_flags_t vm_flags; 1245 1240 1246 1241 *populate = 0; ··· 1268 1263 1269 1264 /* offset overflow? */ 1270 1265 if ((pgoff + (len >> PAGE_SHIFT)) < pgoff) 1271 - return -EOVERFLOW; 1266 + return -EOVERFLOW; 1272 1267 1273 1268 /* Too many mappings? */ 1274 1269 if (mm->map_count > sysctl_max_map_count) ··· 1926 1921 info.align_mask = 0; 1927 1922 return vm_unmapped_area(&info); 1928 1923 } 1929 - #endif 1924 + #endif 1930 1925 1931 1926 /* 1932 1927 * This mmap-allocator allocates new areas top-down from below the ··· 2326 2321 } 2327 2322 2328 2323 struct vm_area_struct * 2329 - find_extend_vma(struct mm_struct * mm, unsigned long addr) 2324 + find_extend_vma(struct mm_struct *mm, unsigned long addr) 2330 2325 { 2331 - struct vm_area_struct * vma; 2326 + struct vm_area_struct *vma; 2332 2327 unsigned long start; 2333 2328 2334 2329 addr &= PAGE_MASK; 2335 - vma = find_vma(mm,addr); 2330 + vma = find_vma(mm, addr); 2336 2331 if (!vma) 2337 2332 return NULL; 2338 2333 if (vma->vm_start <= addr) ··· 2381 2376 struct vm_area_struct *vma, struct vm_area_struct *prev, 2382 2377 unsigned long start, unsigned long end) 2383 2378 { 2384 - struct vm_area_struct *next = prev? prev->vm_next: mm->mmap; 2379 + struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap; 2385 2380 struct mmu_gather tlb; 2386 2381 2387 2382 lru_add_drain(); ··· 2428 2423 * __split_vma() bypasses sysctl_max_map_count checking. We use this on the 2429 2424 * munmap path where it doesn't make sense to fail. 2430 2425 */ 2431 - static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma, 2426 + static int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma, 2432 2427 unsigned long addr, int new_below) 2433 2428 { 2434 2429 struct vm_area_struct *new; ··· 2517 2512 if ((start & ~PAGE_MASK) || start > TASK_SIZE || len > TASK_SIZE-start) 2518 2513 return -EINVAL; 2519 2514 2520 - if ((len = PAGE_ALIGN(len)) == 0) 2515 + len = PAGE_ALIGN(len); 2516 + if (len == 0) 2521 2517 return -EINVAL; 2522 2518 2523 2519 /* Find the first overlapping VMA */ ··· 2564 2558 if (error) 2565 2559 return error; 2566 2560 } 2567 - vma = prev? prev->vm_next: mm->mmap; 2561 + vma = prev ? prev->vm_next : mm->mmap; 2568 2562 2569 2563 /* 2570 2564 * unlock any mlock()ed ranges before detaching vmas ··· 2627 2621 */ 2628 2622 static unsigned long do_brk(unsigned long addr, unsigned long len) 2629 2623 { 2630 - struct mm_struct * mm = current->mm; 2631 - struct vm_area_struct * vma, * prev; 2624 + struct mm_struct *mm = current->mm; 2625 + struct vm_area_struct *vma, *prev; 2632 2626 unsigned long flags; 2633 - struct rb_node ** rb_link, * rb_parent; 2627 + struct rb_node **rb_link, *rb_parent; 2634 2628 pgoff_t pgoff = addr >> PAGE_SHIFT; 2635 2629 int error; 2636 2630 ··· 2854 2848 * safe. It is only safe to keep the vm_pgoff 2855 2849 * linear if there are no pages mapped yet. 2856 2850 */ 2857 - VM_BUG_ON(faulted_in_anon_vma); 2851 + VM_BUG_ON_VMA(faulted_in_anon_vma, new_vma); 2858 2852 *vmap = vma = new_vma; 2859 2853 } 2860 2854 *need_rmap_locks = (new_vma->vm_pgoff <= vma->vm_pgoff);

+3 -2

mm/mremap.c

··· 21 21 #include <linux/syscalls.h> 22 22 #include <linux/mmu_notifier.h> 23 23 #include <linux/sched/sysctl.h> 24 + #include <linux/uaccess.h> 24 25 25 - #include <asm/uaccess.h> 26 26 #include <asm/cacheflush.h> 27 27 #include <asm/tlbflush.h> 28 28 ··· 195 195 if (pmd_trans_huge(*old_pmd)) { 196 196 int err = 0; 197 197 if (extent == HPAGE_PMD_SIZE) { 198 - VM_BUG_ON(vma->vm_file || !vma->anon_vma); 198 + VM_BUG_ON_VMA(vma->vm_file || !vma->anon_vma, 199 + vma); 199 200 /* See comment in move_ptes() */ 200 201 if (need_rmap_locks) 201 202 anon_vma_lock_write(vma->anon_vma);

+3 -3

mm/oom_kill.c

··· 565 565 566 566 spin_lock(&zone_scan_lock); 567 567 for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) 568 - if (zone_is_oom_locked(zone)) { 568 + if (test_bit(ZONE_OOM_LOCKED, &zone->flags)) { 569 569 ret = false; 570 570 goto out; 571 571 } ··· 575 575 * call to oom_zonelist_trylock() doesn't succeed when it shouldn't. 576 576 */ 577 577 for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) 578 - zone_set_flag(zone, ZONE_OOM_LOCKED); 578 + set_bit(ZONE_OOM_LOCKED, &zone->flags); 579 579 580 580 out: 581 581 spin_unlock(&zone_scan_lock); ··· 594 594 595 595 spin_lock(&zone_scan_lock); 596 596 for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) 597 - zone_clear_flag(zone, ZONE_OOM_LOCKED); 597 + clear_bit(ZONE_OOM_LOCKED, &zone->flags); 598 598 spin_unlock(&zone_scan_lock); 599 599 } 600 600

+4 -4

mm/page-writeback.c

··· 1075 1075 } 1076 1076 1077 1077 if (dirty < setpoint) { 1078 - x = min(bdi->balanced_dirty_ratelimit, 1079 - min(balanced_dirty_ratelimit, task_ratelimit)); 1078 + x = min3(bdi->balanced_dirty_ratelimit, 1079 + balanced_dirty_ratelimit, task_ratelimit); 1080 1080 if (dirty_ratelimit < x) 1081 1081 step = x - dirty_ratelimit; 1082 1082 } else { 1083 - x = max(bdi->balanced_dirty_ratelimit, 1084 - max(balanced_dirty_ratelimit, task_ratelimit)); 1083 + x = max3(bdi->balanced_dirty_ratelimit, 1084 + balanced_dirty_ratelimit, task_ratelimit); 1085 1085 if (dirty_ratelimit > x) 1086 1086 step = dirty_ratelimit - x; 1087 1087 }

+129 -229

mm/page_alloc.c

··· 53 53 #include <linux/kmemleak.h> 54 54 #include <linux/compaction.h> 55 55 #include <trace/events/kmem.h> 56 - #include <linux/ftrace_event.h> 57 - #include <linux/memcontrol.h> 58 56 #include <linux/prefetch.h> 59 57 #include <linux/mm_inline.h> 60 58 #include <linux/migrate.h> ··· 83 85 */ 84 86 DEFINE_PER_CPU(int, _numa_mem_); /* Kernel "local memory" node */ 85 87 EXPORT_PER_CPU_SYMBOL(_numa_mem_); 88 + int _node_numa_mem_[MAX_NUMNODES]; 86 89 #endif 87 90 88 91 /* ··· 1013 1014 * Remove at a later date when no bug reports exist related to 1014 1015 * grouping pages by mobility 1015 1016 */ 1016 - BUG_ON(page_zone(start_page) != page_zone(end_page)); 1017 + VM_BUG_ON(page_zone(start_page) != page_zone(end_page)); 1017 1018 #endif 1018 1019 1019 1020 for (page = start_page; page <= end_page;) { ··· 1612 1613 1613 1614 __mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order)); 1614 1615 if (atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]) <= 0 && 1615 - !zone_is_fair_depleted(zone)) 1616 - zone_set_flag(zone, ZONE_FAIR_DEPLETED); 1616 + !test_bit(ZONE_FAIR_DEPLETED, &zone->flags)) 1617 + set_bit(ZONE_FAIR_DEPLETED, &zone->flags); 1617 1618 1618 1619 __count_zone_vm_events(PGALLOC, zone, 1 << order); 1619 1620 zone_statistics(preferred_zone, zone, gfp_flags); ··· 1933 1934 mod_zone_page_state(zone, NR_ALLOC_BATCH, 1934 1935 high_wmark_pages(zone) - low_wmark_pages(zone) - 1935 1936 atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH])); 1936 - zone_clear_flag(zone, ZONE_FAIR_DEPLETED); 1937 + clear_bit(ZONE_FAIR_DEPLETED, &zone->flags); 1937 1938 } while (zone++ != preferred_zone); 1938 1939 } 1939 1940 ··· 1984 1985 if (alloc_flags & ALLOC_FAIR) { 1985 1986 if (!zone_local(preferred_zone, zone)) 1986 1987 break; 1987 - if (zone_is_fair_depleted(zone)) { 1988 + if (test_bit(ZONE_FAIR_DEPLETED, &zone->flags)) { 1988 1989 nr_fair_skipped++; 1989 1990 continue; 1990 1991 } ··· 2295 2296 struct zonelist *zonelist, enum zone_type high_zoneidx, 2296 2297 nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone, 2297 2298 int classzone_idx, int migratetype, enum migrate_mode mode, 2298 - bool *contended_compaction, bool *deferred_compaction, 2299 - unsigned long *did_some_progress) 2299 + int *contended_compaction, bool *deferred_compaction) 2300 2300 { 2301 + struct zone *last_compact_zone = NULL; 2302 + unsigned long compact_result; 2303 + struct page *page; 2304 + 2301 2305 if (!order) 2302 2306 return NULL; 2303 2307 2304 - if (compaction_deferred(preferred_zone, order)) { 2305 - *deferred_compaction = true; 2306 - return NULL; 2307 - } 2308 - 2309 2308 current->flags |= PF_MEMALLOC; 2310 - *did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask, 2309 + compact_result = try_to_compact_pages(zonelist, order, gfp_mask, 2311 2310 nodemask, mode, 2312 - contended_compaction); 2311 + contended_compaction, 2312 + &last_compact_zone); 2313 2313 current->flags &= ~PF_MEMALLOC; 2314 2314 2315 - if (*did_some_progress != COMPACT_SKIPPED) { 2316 - struct page *page; 2317 - 2318 - /* Page migration frees to the PCP lists but we want merging */ 2319 - drain_pages(get_cpu()); 2320 - put_cpu(); 2321 - 2322 - page = get_page_from_freelist(gfp_mask, nodemask, 2323 - order, zonelist, high_zoneidx, 2324 - alloc_flags & ~ALLOC_NO_WATERMARKS, 2325 - preferred_zone, classzone_idx, migratetype); 2326 - if (page) { 2327 - preferred_zone->compact_blockskip_flush = false; 2328 - compaction_defer_reset(preferred_zone, order, true); 2329 - count_vm_event(COMPACTSUCCESS); 2330 - return page; 2331 - } 2332 - 2333 - /* 2334 - * It's bad if compaction run occurs and fails. 2335 - * The most likely reason is that pages exist, 2336 - * but not enough to satisfy watermarks. 2337 - */ 2338 - count_vm_event(COMPACTFAIL); 2339 - 2340 - /* 2341 - * As async compaction considers a subset of pageblocks, only 2342 - * defer if the failure was a sync compaction failure. 2343 - */ 2344 - if (mode != MIGRATE_ASYNC) 2345 - defer_compaction(preferred_zone, order); 2346 - 2347 - cond_resched(); 2315 + switch (compact_result) { 2316 + case COMPACT_DEFERRED: 2317 + *deferred_compaction = true; 2318 + /* fall-through */ 2319 + case COMPACT_SKIPPED: 2320 + return NULL; 2321 + default: 2322 + break; 2348 2323 } 2324 + 2325 + /* 2326 + * At least in one zone compaction wasn't deferred or skipped, so let's 2327 + * count a compaction stall 2328 + */ 2329 + count_vm_event(COMPACTSTALL); 2330 + 2331 + /* Page migration frees to the PCP lists but we want merging */ 2332 + drain_pages(get_cpu()); 2333 + put_cpu(); 2334 + 2335 + page = get_page_from_freelist(gfp_mask, nodemask, 2336 + order, zonelist, high_zoneidx, 2337 + alloc_flags & ~ALLOC_NO_WATERMARKS, 2338 + preferred_zone, classzone_idx, migratetype); 2339 + 2340 + if (page) { 2341 + struct zone *zone = page_zone(page); 2342 + 2343 + zone->compact_blockskip_flush = false; 2344 + compaction_defer_reset(zone, order, true); 2345 + count_vm_event(COMPACTSUCCESS); 2346 + return page; 2347 + } 2348 + 2349 + /* 2350 + * last_compact_zone is where try_to_compact_pages thought allocation 2351 + * should succeed, so it did not defer compaction. But here we know 2352 + * that it didn't succeed, so we do the defer. 2353 + */ 2354 + if (last_compact_zone && mode != MIGRATE_ASYNC) 2355 + defer_compaction(last_compact_zone, order); 2356 + 2357 + /* 2358 + * It's bad if compaction run occurs and fails. The most likely reason 2359 + * is that pages exist, but not enough to satisfy watermarks. 2360 + */ 2361 + count_vm_event(COMPACTFAIL); 2362 + 2363 + cond_resched(); 2349 2364 2350 2365 return NULL; 2351 2366 } ··· 2368 2355 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, 2369 2356 struct zonelist *zonelist, enum zone_type high_zoneidx, 2370 2357 nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone, 2371 - int classzone_idx, int migratetype, 2372 - enum migrate_mode mode, bool *contended_compaction, 2373 - bool *deferred_compaction, unsigned long *did_some_progress) 2358 + int classzone_idx, int migratetype, enum migrate_mode mode, 2359 + int *contended_compaction, bool *deferred_compaction) 2374 2360 { 2375 2361 return NULL; 2376 2362 } ··· 2469 2457 static void wake_all_kswapds(unsigned int order, 2470 2458 struct zonelist *zonelist, 2471 2459 enum zone_type high_zoneidx, 2472 - struct zone *preferred_zone) 2460 + struct zone *preferred_zone, 2461 + nodemask_t *nodemask) 2473 2462 { 2474 2463 struct zoneref *z; 2475 2464 struct zone *zone; 2476 2465 2477 - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) 2466 + for_each_zone_zonelist_nodemask(zone, z, zonelist, 2467 + high_zoneidx, nodemask) 2478 2468 wakeup_kswapd(zone, order, zone_idx(preferred_zone)); 2479 2469 } 2480 2470 ··· 2523 2509 alloc_flags |= ALLOC_NO_WATERMARKS; 2524 2510 } 2525 2511 #ifdef CONFIG_CMA 2526 - if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE) 2512 + if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE) 2527 2513 alloc_flags |= ALLOC_CMA; 2528 2514 #endif 2529 2515 return alloc_flags; ··· 2547 2533 unsigned long did_some_progress; 2548 2534 enum migrate_mode migration_mode = MIGRATE_ASYNC; 2549 2535 bool deferred_compaction = false; 2550 - bool contended_compaction = false; 2536 + int contended_compaction = COMPACT_CONTENDED_NONE; 2551 2537 2552 2538 /* 2553 2539 * In the slowpath, we sanity check order to avoid ever trying to ··· 2574 2560 2575 2561 restart: 2576 2562 if (!(gfp_mask & __GFP_NO_KSWAPD)) 2577 - wake_all_kswapds(order, zonelist, high_zoneidx, preferred_zone); 2563 + wake_all_kswapds(order, zonelist, high_zoneidx, 2564 + preferred_zone, nodemask); 2578 2565 2579 2566 /* 2580 2567 * OK, we're below the kswapd watermark and have kicked background ··· 2648 2633 preferred_zone, 2649 2634 classzone_idx, migratetype, 2650 2635 migration_mode, &contended_compaction, 2651 - &deferred_compaction, 2652 - &did_some_progress); 2636 + &deferred_compaction); 2653 2637 if (page) 2654 2638 goto got_pg; 2655 2639 2656 - /* 2657 - * If compaction is deferred for high-order allocations, it is because 2658 - * sync compaction recently failed. In this is the case and the caller 2659 - * requested a movable allocation that does not heavily disrupt the 2660 - * system then fail the allocation instead of entering direct reclaim. 2661 - */ 2662 - if ((deferred_compaction || contended_compaction) && 2663 - (gfp_mask & __GFP_NO_KSWAPD)) 2664 - goto nopage; 2640 + /* Checks for THP-specific high-order allocations */ 2641 + if ((gfp_mask & GFP_TRANSHUGE) == GFP_TRANSHUGE) { 2642 + /* 2643 + * If compaction is deferred for high-order allocations, it is 2644 + * because sync compaction recently failed. If this is the case 2645 + * and the caller requested a THP allocation, we do not want 2646 + * to heavily disrupt the system, so we fail the allocation 2647 + * instead of entering direct reclaim. 2648 + */ 2649 + if (deferred_compaction) 2650 + goto nopage; 2651 + 2652 + /* 2653 + * In all zones where compaction was attempted (and not 2654 + * deferred or skipped), lock contention has been detected. 2655 + * For THP allocation we do not want to disrupt the others 2656 + * so we fallback to base pages instead. 2657 + */ 2658 + if (contended_compaction == COMPACT_CONTENDED_LOCK) 2659 + goto nopage; 2660 + 2661 + /* 2662 + * If compaction was aborted due to need_resched(), we do not 2663 + * want to further increase allocation latency, unless it is 2664 + * khugepaged trying to collapse. 2665 + */ 2666 + if (contended_compaction == COMPACT_CONTENDED_SCHED 2667 + && !(current->flags & PF_KTHREAD)) 2668 + goto nopage; 2669 + } 2665 2670 2666 2671 /* 2667 2672 * It can become very expensive to allocate transparent hugepages at ··· 2761 2726 preferred_zone, 2762 2727 classzone_idx, migratetype, 2763 2728 migration_mode, &contended_compaction, 2764 - &deferred_compaction, 2765 - &did_some_progress); 2729 + &deferred_compaction); 2766 2730 if (page) 2767 2731 goto got_pg; 2768 2732 } ··· 2787 2753 struct zone *preferred_zone; 2788 2754 struct zoneref *preferred_zoneref; 2789 2755 struct page *page = NULL; 2790 - int migratetype = allocflags_to_migratetype(gfp_mask); 2756 + int migratetype = gfpflags_to_migratetype(gfp_mask); 2791 2757 unsigned int cpuset_mems_cookie; 2792 2758 int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR; 2793 2759 int classzone_idx; ··· 2809 2775 if (unlikely(!zonelist->_zonerefs->zone)) 2810 2776 return NULL; 2811 2777 2778 + if (IS_ENABLED(CONFIG_CMA) && migratetype == MIGRATE_MOVABLE) 2779 + alloc_flags |= ALLOC_CMA; 2780 + 2812 2781 retry_cpuset: 2813 2782 cpuset_mems_cookie = read_mems_allowed_begin(); 2814 2783 ··· 2823 2786 goto out; 2824 2787 classzone_idx = zonelist_zone_idx(preferred_zoneref); 2825 2788 2826 - #ifdef CONFIG_CMA 2827 - if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE) 2828 - alloc_flags |= ALLOC_CMA; 2829 - #endif 2830 2789 /* First allocation attempt */ 2831 2790 page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, 2832 2791 zonelist, high_zoneidx, alloc_flags, ··· 3612 3579 zonelist->_zonerefs[pos].zone_idx = 0; 3613 3580 } 3614 3581 3582 + #if defined(CONFIG_64BIT) 3583 + /* 3584 + * Devices that require DMA32/DMA are relatively rare and do not justify a 3585 + * penalty to every machine in case the specialised case applies. Default 3586 + * to Node-ordering on 64-bit NUMA machines 3587 + */ 3615 3588 static int default_zonelist_order(void) 3616 3589 { 3617 - int nid, zone_type; 3618 - unsigned long low_kmem_size, total_size; 3619 - struct zone *z; 3620 - int average_size; 3621 - /* 3622 - * ZONE_DMA and ZONE_DMA32 can be very small area in the system. 3623 - * If they are really small and used heavily, the system can fall 3624 - * into OOM very easily. 3625 - * This function detect ZONE_DMA/DMA32 size and configures zone order. 3626 - */ 3627 - /* Is there ZONE_NORMAL ? (ex. ppc has only DMA zone..) */ 3628 - low_kmem_size = 0; 3629 - total_size = 0; 3630 - for_each_online_node(nid) { 3631 - for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) { 3632 - z = &NODE_DATA(nid)->node_zones[zone_type]; 3633 - if (populated_zone(z)) { 3634 - if (zone_type < ZONE_NORMAL) 3635 - low_kmem_size += z->managed_pages; 3636 - total_size += z->managed_pages; 3637 - } else if (zone_type == ZONE_NORMAL) { 3638 - /* 3639 - * If any node has only lowmem, then node order 3640 - * is preferred to allow kernel allocations 3641 - * locally; otherwise, they can easily infringe 3642 - * on other nodes when there is an abundance of 3643 - * lowmem available to allocate from. 3644 - */ 3645 - return ZONELIST_ORDER_NODE; 3646 - } 3647 - } 3648 - } 3649 - if (!low_kmem_size || /* there are no DMA area. */ 3650 - low_kmem_size > total_size/2) /* DMA/DMA32 is big. */ 3651 - return ZONELIST_ORDER_NODE; 3652 - /* 3653 - * look into each node's config. 3654 - * If there is a node whose DMA/DMA32 memory is very big area on 3655 - * local memory, NODE_ORDER may be suitable. 3656 - */ 3657 - average_size = total_size / 3658 - (nodes_weight(node_states[N_MEMORY]) + 1); 3659 - for_each_online_node(nid) { 3660 - low_kmem_size = 0; 3661 - total_size = 0; 3662 - for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) { 3663 - z = &NODE_DATA(nid)->node_zones[zone_type]; 3664 - if (populated_zone(z)) { 3665 - if (zone_type < ZONE_NORMAL) 3666 - low_kmem_size += z->present_pages; 3667 - total_size += z->present_pages; 3668 - } 3669 - } 3670 - if (low_kmem_size && 3671 - total_size > average_size && /* ignore small node */ 3672 - low_kmem_size > total_size * 70/100) 3673 - return ZONELIST_ORDER_NODE; 3674 - } 3590 + return ZONELIST_ORDER_NODE; 3591 + } 3592 + #else 3593 + /* 3594 + * On 32-bit, the Normal zone needs to be preserved for allocations accessible 3595 + * by the kernel. If processes running on node 0 deplete the low memory zone 3596 + * then reclaim will occur more frequency increasing stalls and potentially 3597 + * be easier to OOM if a large percentage of the zone is under writeback or 3598 + * dirty. The problem is significantly worse if CONFIG_HIGHPTE is not set. 3599 + * Hence, default to zone ordering on 32-bit. 3600 + */ 3601 + static int default_zonelist_order(void) 3602 + { 3675 3603 return ZONELIST_ORDER_ZONE; 3676 3604 } 3605 + #endif /* CONFIG_64BIT */ 3677 3606 3678 3607 static void set_zonelist_order(void) 3679 3608 { ··· 6272 6277 6273 6278 if (list_empty(&cc->migratepages)) { 6274 6279 cc->nr_migratepages = 0; 6275 - pfn = isolate_migratepages_range(cc->zone, cc, 6276 - pfn, end, true); 6280 + pfn = isolate_migratepages_range(cc, pfn, end); 6277 6281 if (!pfn) { 6278 6282 ret = -EINTR; 6279 6283 break; ··· 6548 6554 return order < MAX_ORDER; 6549 6555 } 6550 6556 #endif 6551 - 6552 - static const struct trace_print_flags pageflag_names[] = { 6553 - {1UL << PG_locked, "locked" }, 6554 - {1UL << PG_error, "error" }, 6555 - {1UL << PG_referenced, "referenced" }, 6556 - {1UL << PG_uptodate, "uptodate" }, 6557 - {1UL << PG_dirty, "dirty" }, 6558 - {1UL << PG_lru, "lru" }, 6559 - {1UL << PG_active, "active" }, 6560 - {1UL << PG_slab, "slab" }, 6561 - {1UL << PG_owner_priv_1, "owner_priv_1" }, 6562 - {1UL << PG_arch_1, "arch_1" }, 6563 - {1UL << PG_reserved, "reserved" }, 6564 - {1UL << PG_private, "private" }, 6565 - {1UL << PG_private_2, "private_2" }, 6566 - {1UL << PG_writeback, "writeback" }, 6567 - #ifdef CONFIG_PAGEFLAGS_EXTENDED 6568 - {1UL << PG_head, "head" }, 6569 - {1UL << PG_tail, "tail" }, 6570 - #else 6571 - {1UL << PG_compound, "compound" }, 6572 - #endif 6573 - {1UL << PG_swapcache, "swapcache" }, 6574 - {1UL << PG_mappedtodisk, "mappedtodisk" }, 6575 - {1UL << PG_reclaim, "reclaim" }, 6576 - {1UL << PG_swapbacked, "swapbacked" }, 6577 - {1UL << PG_unevictable, "unevictable" }, 6578 - #ifdef CONFIG_MMU 6579 - {1UL << PG_mlocked, "mlocked" }, 6580 - #endif 6581 - #ifdef CONFIG_ARCH_USES_PG_UNCACHED 6582 - {1UL << PG_uncached, "uncached" }, 6583 - #endif 6584 - #ifdef CONFIG_MEMORY_FAILURE 6585 - {1UL << PG_hwpoison, "hwpoison" }, 6586 - #endif 6587 - #ifdef CONFIG_TRANSPARENT_HUGEPAGE 6588 - {1UL << PG_compound_lock, "compound_lock" }, 6589 - #endif 6590 - }; 6591 - 6592 - static void dump_page_flags(unsigned long flags) 6593 - { 6594 - const char *delim = ""; 6595 - unsigned long mask; 6596 - int i; 6597 - 6598 - BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS); 6599 - 6600 - printk(KERN_ALERT "page flags: %#lx(", flags); 6601 - 6602 - /* remove zone id */ 6603 - flags &= (1UL << NR_PAGEFLAGS) - 1; 6604 - 6605 - for (i = 0; i < ARRAY_SIZE(pageflag_names) && flags; i++) { 6606 - 6607 - mask = pageflag_names[i].mask; 6608 - if ((flags & mask) != mask) 6609 - continue; 6610 - 6611 - flags &= ~mask; 6612 - printk("%s%s", delim, pageflag_names[i].name); 6613 - delim = "|"; 6614 - } 6615 - 6616 - /* check for left over flags */ 6617 - if (flags) 6618 - printk("%s%#lx", delim, flags); 6619 - 6620 - printk(")\n"); 6621 - } 6622 - 6623 - void dump_page_badflags(struct page *page, const char *reason, 6624 - unsigned long badflags) 6625 - { 6626 - printk(KERN_ALERT 6627 - "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n", 6628 - page, atomic_read(&page->_count), page_mapcount(page), 6629 - page->mapping, page->index); 6630 - dump_page_flags(page->flags); 6631 - if (reason) 6632 - pr_alert("page dumped because: %s\n", reason); 6633 - if (page->flags & badflags) { 6634 - pr_alert("bad because of flags:\n"); 6635 - dump_page_flags(page->flags & badflags); 6636 - } 6637 - mem_cgroup_print_bad_page(page); 6638 - } 6639 - 6640 - void dump_page(struct page *page, const char *reason) 6641 - { 6642 - dump_page_badflags(page, reason, 0); 6643 - } 6644 - EXPORT_SYMBOL(dump_page);

+1 -1

mm/pagewalk.c

··· 177 177 if (!walk->mm) 178 178 return -EINVAL; 179 179 180 - VM_BUG_ON(!rwsem_is_locked(&walk->mm->mmap_sem)); 180 + VM_BUG_ON_MM(!rwsem_is_locked(&walk->mm->mmap_sem), walk->mm); 181 181 182 182 pgd = pgd_offset(walk->mm, addr); 183 183 do {

+4 -4

mm/rmap.c

··· 527 527 unsigned long address = __vma_address(page, vma); 528 528 529 529 /* page should be within @vma mapping range */ 530 - VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end); 530 + VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); 531 531 532 532 return address; 533 533 } ··· 897 897 struct anon_vma *anon_vma = vma->anon_vma; 898 898 899 899 VM_BUG_ON_PAGE(!PageLocked(page), page); 900 - VM_BUG_ON(!anon_vma); 900 + VM_BUG_ON_VMA(!anon_vma, vma); 901 901 VM_BUG_ON_PAGE(page->index != linear_page_index(vma, address), page); 902 902 903 903 anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; ··· 1024 1024 void page_add_new_anon_rmap(struct page *page, 1025 1025 struct vm_area_struct *vma, unsigned long address) 1026 1026 { 1027 - VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end); 1027 + VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); 1028 1028 SetPageSwapBacked(page); 1029 1029 atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */ 1030 1030 if (PageTransHuge(page)) ··· 1670 1670 * structure at mapping cannot be freed and reused yet, 1671 1671 * so we can safely take mapping->i_mmap_mutex. 1672 1672 */ 1673 - VM_BUG_ON(!PageLocked(page)); 1673 + VM_BUG_ON_PAGE(!PageLocked(page), page); 1674 1674 1675 1675 if (!mapping) 1676 1676 return ret;

+2

mm/shmem.c

··· 3077 3077 .write_begin = shmem_write_begin, 3078 3078 .write_end = shmem_write_end, 3079 3079 #endif 3080 + #ifdef CONFIG_MIGRATION 3080 3081 .migratepage = migrate_page, 3082 + #endif 3081 3083 .error_remove_page = generic_error_remove_page, 3082 3084 }; 3083 3085

+134 -215

mm/slab.c

··· 237 237 /* 238 238 * Need this for bootstrapping a per node allocator. 239 239 */ 240 - #define NUM_INIT_LISTS (3 * MAX_NUMNODES) 240 + #define NUM_INIT_LISTS (2 * MAX_NUMNODES) 241 241 static struct kmem_cache_node __initdata init_kmem_cache_node[NUM_INIT_LISTS]; 242 242 #define CACHE_CACHE 0 243 - #define SIZE_AC MAX_NUMNODES 244 - #define SIZE_NODE (2 * MAX_NUMNODES) 243 + #define SIZE_NODE (MAX_NUMNODES) 245 244 246 245 static int drain_freelist(struct kmem_cache *cache, 247 246 struct kmem_cache_node *n, int tofree); ··· 252 253 253 254 static int slab_early_init = 1; 254 255 255 - #define INDEX_AC kmalloc_index(sizeof(struct arraycache_init)) 256 256 #define INDEX_NODE kmalloc_index(sizeof(struct kmem_cache_node)) 257 257 258 258 static void kmem_cache_node_init(struct kmem_cache_node *parent) ··· 456 458 return reciprocal_divide(offset, cache->reciprocal_buffer_size); 457 459 } 458 460 459 - static struct arraycache_init initarray_generic = 460 - { {0, BOOT_CPUCACHE_ENTRIES, 1, 0} }; 461 - 462 461 /* internal cache of cache description objs */ 463 462 static struct kmem_cache kmem_cache_boot = { 464 463 .batchcount = 1, ··· 471 476 472 477 static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep) 473 478 { 474 - return cachep->array[smp_processor_id()]; 479 + return this_cpu_ptr(cachep->cpu_cache); 475 480 } 476 481 477 482 static size_t calculate_freelist_size(int nr_objs, size_t align) ··· 780 785 return objp; 781 786 } 782 787 783 - static void *__ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac, 784 - void *objp) 788 + static noinline void *__ac_put_obj(struct kmem_cache *cachep, 789 + struct array_cache *ac, void *objp) 785 790 { 786 791 if (unlikely(pfmemalloc_active)) { 787 792 /* Some pfmemalloc slabs exist, check if this is one */ ··· 979 984 } 980 985 } 981 986 982 - static inline int cache_free_alien(struct kmem_cache *cachep, void *objp) 987 + static int __cache_free_alien(struct kmem_cache *cachep, void *objp, 988 + int node, int page_node) 983 989 { 984 - int nodeid = page_to_nid(virt_to_page(objp)); 985 990 struct kmem_cache_node *n; 986 991 struct alien_cache *alien = NULL; 987 992 struct array_cache *ac; 988 - int node; 989 993 LIST_HEAD(list); 990 - 991 - node = numa_mem_id(); 992 - 993 - /* 994 - * Make sure we are not freeing a object from another node to the array 995 - * cache on this cpu. 996 - */ 997 - if (likely(nodeid == node)) 998 - return 0; 999 994 1000 995 n = get_node(cachep, node); 1001 996 STATS_INC_NODEFREES(cachep); 1002 - if (n->alien && n->alien[nodeid]) { 1003 - alien = n->alien[nodeid]; 997 + if (n->alien && n->alien[page_node]) { 998 + alien = n->alien[page_node]; 1004 999 ac = &alien->ac; 1005 1000 spin_lock(&alien->lock); 1006 1001 if (unlikely(ac->avail == ac->limit)) { 1007 1002 STATS_INC_ACOVERFLOW(cachep); 1008 - __drain_alien_cache(cachep, ac, nodeid, &list); 1003 + __drain_alien_cache(cachep, ac, page_node, &list); 1009 1004 } 1010 1005 ac_put_obj(cachep, ac, objp); 1011 1006 spin_unlock(&alien->lock); 1012 1007 slabs_destroy(cachep, &list); 1013 1008 } else { 1014 - n = get_node(cachep, nodeid); 1009 + n = get_node(cachep, page_node); 1015 1010 spin_lock(&n->list_lock); 1016 - free_block(cachep, &objp, 1, nodeid, &list); 1011 + free_block(cachep, &objp, 1, page_node, &list); 1017 1012 spin_unlock(&n->list_lock); 1018 1013 slabs_destroy(cachep, &list); 1019 1014 } 1020 1015 return 1; 1016 + } 1017 + 1018 + static inline int cache_free_alien(struct kmem_cache *cachep, void *objp) 1019 + { 1020 + int page_node = page_to_nid(virt_to_page(objp)); 1021 + int node = numa_mem_id(); 1022 + /* 1023 + * Make sure we are not freeing a object from another node to the array 1024 + * cache on this cpu. 1025 + */ 1026 + if (likely(node == page_node)) 1027 + return 0; 1028 + 1029 + return __cache_free_alien(cachep, objp, node, page_node); 1021 1030 } 1022 1031 #endif 1023 1032 ··· 1091 1092 struct alien_cache **alien; 1092 1093 LIST_HEAD(list); 1093 1094 1094 - /* cpu is dead; no one can alloc from it. */ 1095 - nc = cachep->array[cpu]; 1096 - cachep->array[cpu] = NULL; 1097 1095 n = get_node(cachep, node); 1098 - 1099 1096 if (!n) 1100 - goto free_array_cache; 1097 + continue; 1101 1098 1102 1099 spin_lock_irq(&n->list_lock); 1103 1100 1104 1101 /* Free limit for this kmem_cache_node */ 1105 1102 n->free_limit -= cachep->batchcount; 1106 - if (nc) 1103 + 1104 + /* cpu is dead; no one can alloc from it. */ 1105 + nc = per_cpu_ptr(cachep->cpu_cache, cpu); 1106 + if (nc) { 1107 1107 free_block(cachep, nc->entry, nc->avail, node, &list); 1108 + nc->avail = 0; 1109 + } 1108 1110 1109 1111 if (!cpumask_empty(mask)) { 1110 1112 spin_unlock_irq(&n->list_lock); 1111 - goto free_array_cache; 1113 + goto free_slab; 1112 1114 } 1113 1115 1114 1116 shared = n->shared; ··· 1129 1129 drain_alien_cache(cachep, alien); 1130 1130 free_alien_cache(alien); 1131 1131 } 1132 - free_array_cache: 1132 + 1133 + free_slab: 1133 1134 slabs_destroy(cachep, &list); 1134 - kfree(nc); 1135 1135 } 1136 1136 /* 1137 1137 * In the previous loop, all the objects were freed to ··· 1168 1168 * array caches 1169 1169 */ 1170 1170 list_for_each_entry(cachep, &slab_caches, list) { 1171 - struct array_cache *nc; 1172 1171 struct array_cache *shared = NULL; 1173 1172 struct alien_cache **alien = NULL; 1174 1173 1175 - nc = alloc_arraycache(node, cachep->limit, 1176 - cachep->batchcount, GFP_KERNEL); 1177 - if (!nc) 1178 - goto bad; 1179 1174 if (cachep->shared) { 1180 1175 shared = alloc_arraycache(node, 1181 1176 cachep->shared * cachep->batchcount, 1182 1177 0xbaadf00d, GFP_KERNEL); 1183 - if (!shared) { 1184 - kfree(nc); 1178 + if (!shared) 1185 1179 goto bad; 1186 - } 1187 1180 } 1188 1181 if (use_alien_caches) { 1189 1182 alien = alloc_alien_cache(node, cachep->limit, GFP_KERNEL); 1190 1183 if (!alien) { 1191 1184 kfree(shared); 1192 - kfree(nc); 1193 1185 goto bad; 1194 1186 } 1195 1187 } 1196 - cachep->array[cpu] = nc; 1197 1188 n = get_node(cachep, node); 1198 1189 BUG_ON(!n); 1199 1190 ··· 1376 1385 } 1377 1386 1378 1387 /* 1379 - * The memory after the last cpu cache pointer is used for the 1380 - * the node pointer. 1381 - */ 1382 - static void setup_node_pointer(struct kmem_cache *cachep) 1383 - { 1384 - cachep->node = (struct kmem_cache_node **)&cachep->array[nr_cpu_ids]; 1385 - } 1386 - 1387 - /* 1388 1388 * Initialisation. Called after the page allocator have been initialised and 1389 1389 * before smp_init(). 1390 1390 */ ··· 1386 1404 BUILD_BUG_ON(sizeof(((struct page *)NULL)->lru) < 1387 1405 sizeof(struct rcu_head)); 1388 1406 kmem_cache = &kmem_cache_boot; 1389 - setup_node_pointer(kmem_cache); 1390 1407 1391 1408 if (num_possible_nodes() == 1) 1392 1409 use_alien_caches = 0; 1393 1410 1394 1411 for (i = 0; i < NUM_INIT_LISTS; i++) 1395 1412 kmem_cache_node_init(&init_kmem_cache_node[i]); 1396 - 1397 - set_up_node(kmem_cache, CACHE_CACHE); 1398 1413 1399 1414 /* 1400 1415 * Fragmentation resistance on low memory - only use bigger ··· 1427 1448 * struct kmem_cache size depends on nr_node_ids & nr_cpu_ids 1428 1449 */ 1429 1450 create_boot_cache(kmem_cache, "kmem_cache", 1430 - offsetof(struct kmem_cache, array[nr_cpu_ids]) + 1451 + offsetof(struct kmem_cache, node) + 1431 1452 nr_node_ids * sizeof(struct kmem_cache_node *), 1432 1453 SLAB_HWCACHE_ALIGN); 1433 1454 list_add(&kmem_cache->list, &slab_caches); 1434 - 1435 - /* 2+3) create the kmalloc caches */ 1455 + slab_state = PARTIAL; 1436 1456 1437 1457 /* 1438 - * Initialize the caches that provide memory for the array cache and the 1439 - * kmem_cache_node structures first. Without this, further allocations will 1440 - * bug. 1458 + * Initialize the caches that provide memory for the kmem_cache_node 1459 + * structures first. Without this, further allocations will bug. 1441 1460 */ 1442 - 1443 - kmalloc_caches[INDEX_AC] = create_kmalloc_cache("kmalloc-ac", 1444 - kmalloc_size(INDEX_AC), ARCH_KMALLOC_FLAGS); 1445 - 1446 - if (INDEX_AC != INDEX_NODE) 1447 - kmalloc_caches[INDEX_NODE] = 1448 - create_kmalloc_cache("kmalloc-node", 1461 + kmalloc_caches[INDEX_NODE] = create_kmalloc_cache("kmalloc-node", 1449 1462 kmalloc_size(INDEX_NODE), ARCH_KMALLOC_FLAGS); 1463 + slab_state = PARTIAL_NODE; 1450 1464 1451 1465 slab_early_init = 0; 1452 1466 1453 - /* 4) Replace the bootstrap head arrays */ 1454 - { 1455 - struct array_cache *ptr; 1456 - 1457 - ptr = kmalloc(sizeof(struct arraycache_init), GFP_NOWAIT); 1458 - 1459 - memcpy(ptr, cpu_cache_get(kmem_cache), 1460 - sizeof(struct arraycache_init)); 1461 - 1462 - kmem_cache->array[smp_processor_id()] = ptr; 1463 - 1464 - ptr = kmalloc(sizeof(struct arraycache_init), GFP_NOWAIT); 1465 - 1466 - BUG_ON(cpu_cache_get(kmalloc_caches[INDEX_AC]) 1467 - != &initarray_generic.cache); 1468 - memcpy(ptr, cpu_cache_get(kmalloc_caches[INDEX_AC]), 1469 - sizeof(struct arraycache_init)); 1470 - 1471 - kmalloc_caches[INDEX_AC]->array[smp_processor_id()] = ptr; 1472 - } 1473 1467 /* 5) Replace the bootstrap kmem_cache_node */ 1474 1468 { 1475 1469 int nid; ··· 1450 1498 for_each_online_node(nid) { 1451 1499 init_list(kmem_cache, &init_kmem_cache_node[CACHE_CACHE + nid], nid); 1452 1500 1453 - init_list(kmalloc_caches[INDEX_AC], 1454 - &init_kmem_cache_node[SIZE_AC + nid], nid); 1455 - 1456 - if (INDEX_AC != INDEX_NODE) { 1457 - init_list(kmalloc_caches[INDEX_NODE], 1501 + init_list(kmalloc_caches[INDEX_NODE], 1458 1502 &init_kmem_cache_node[SIZE_NODE + nid], nid); 1459 - } 1460 1503 } 1461 1504 } 1462 1505 ··· 1984 2037 return left_over; 1985 2038 } 1986 2039 2040 + static struct array_cache __percpu *alloc_kmem_cache_cpus( 2041 + struct kmem_cache *cachep, int entries, int batchcount) 2042 + { 2043 + int cpu; 2044 + size_t size; 2045 + struct array_cache __percpu *cpu_cache; 2046 + 2047 + size = sizeof(void *) * entries + sizeof(struct array_cache); 2048 + cpu_cache = __alloc_percpu(size, 0); 2049 + 2050 + if (!cpu_cache) 2051 + return NULL; 2052 + 2053 + for_each_possible_cpu(cpu) { 2054 + init_arraycache(per_cpu_ptr(cpu_cache, cpu), 2055 + entries, batchcount); 2056 + } 2057 + 2058 + return cpu_cache; 2059 + } 2060 + 1987 2061 static int __init_refok setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp) 1988 2062 { 1989 2063 if (slab_state >= FULL) 1990 2064 return enable_cpucache(cachep, gfp); 1991 2065 2066 + cachep->cpu_cache = alloc_kmem_cache_cpus(cachep, 1, 1); 2067 + if (!cachep->cpu_cache) 2068 + return 1; 2069 + 1992 2070 if (slab_state == DOWN) { 1993 - /* 1994 - * Note: Creation of first cache (kmem_cache). 1995 - * The setup_node is taken care 1996 - * of by the caller of __kmem_cache_create 1997 - */ 1998 - cachep->array[smp_processor_id()] = &initarray_generic.cache; 1999 - slab_state = PARTIAL; 2071 + /* Creation of first cache (kmem_cache). */ 2072 + set_up_node(kmem_cache, CACHE_CACHE); 2000 2073 } else if (slab_state == PARTIAL) { 2001 - /* 2002 - * Note: the second kmem_cache_create must create the cache 2003 - * that's used by kmalloc(24), otherwise the creation of 2004 - * further caches will BUG(). 2005 - */ 2006 - cachep->array[smp_processor_id()] = &initarray_generic.cache; 2007 - 2008 - /* 2009 - * If the cache that's used by kmalloc(sizeof(kmem_cache_node)) is 2010 - * the second cache, then we need to set up all its node/, 2011 - * otherwise the creation of further caches will BUG(). 2012 - */ 2013 - set_up_node(cachep, SIZE_AC); 2014 - if (INDEX_AC == INDEX_NODE) 2015 - slab_state = PARTIAL_NODE; 2016 - else 2017 - slab_state = PARTIAL_ARRAYCACHE; 2074 + /* For kmem_cache_node */ 2075 + set_up_node(cachep, SIZE_NODE); 2018 2076 } else { 2019 - /* Remaining boot caches */ 2020 - cachep->array[smp_processor_id()] = 2021 - kmalloc(sizeof(struct arraycache_init), gfp); 2077 + int node; 2022 2078 2023 - if (slab_state == PARTIAL_ARRAYCACHE) { 2024 - set_up_node(cachep, SIZE_NODE); 2025 - slab_state = PARTIAL_NODE; 2026 - } else { 2027 - int node; 2028 - for_each_online_node(node) { 2029 - cachep->node[node] = 2030 - kmalloc_node(sizeof(struct kmem_cache_node), 2031 - gfp, node); 2032 - BUG_ON(!cachep->node[node]); 2033 - kmem_cache_node_init(cachep->node[node]); 2034 - } 2079 + for_each_online_node(node) { 2080 + cachep->node[node] = kmalloc_node( 2081 + sizeof(struct kmem_cache_node), gfp, node); 2082 + BUG_ON(!cachep->node[node]); 2083 + kmem_cache_node_init(cachep->node[node]); 2035 2084 } 2036 2085 } 2086 + 2037 2087 cachep->node[numa_mem_id()]->next_reap = 2038 2088 jiffies + REAPTIMEOUT_NODE + 2039 2089 ((unsigned long)cachep) % REAPTIMEOUT_NODE; ··· 2042 2098 cachep->batchcount = 1; 2043 2099 cachep->limit = BOOT_CPUCACHE_ENTRIES; 2044 2100 return 0; 2101 + } 2102 + 2103 + unsigned long kmem_cache_flags(unsigned long object_size, 2104 + unsigned long flags, const char *name, 2105 + void (*ctor)(void *)) 2106 + { 2107 + return flags; 2108 + } 2109 + 2110 + struct kmem_cache * 2111 + __kmem_cache_alias(const char *name, size_t size, size_t align, 2112 + unsigned long flags, void (*ctor)(void *)) 2113 + { 2114 + struct kmem_cache *cachep; 2115 + 2116 + cachep = find_mergeable(size, align, flags, name, ctor); 2117 + if (cachep) { 2118 + cachep->refcount++; 2119 + 2120 + /* 2121 + * Adjust the object sizes so that we clear 2122 + * the complete object on kzalloc. 2123 + */ 2124 + cachep->object_size = max_t(int, cachep->object_size, size); 2125 + } 2126 + return cachep; 2045 2127 } 2046 2128 2047 2129 /** ··· 2153 2183 else 2154 2184 gfp = GFP_NOWAIT; 2155 2185 2156 - setup_node_pointer(cachep); 2157 2186 #if DEBUG 2158 2187 2159 2188 /* ··· 2409 2440 if (rc) 2410 2441 return rc; 2411 2442 2412 - for_each_online_cpu(i) 2413 - kfree(cachep->array[i]); 2443 + free_percpu(cachep->cpu_cache); 2414 2444 2415 2445 /* NUMA: free the node structures */ 2416 2446 for_each_kmem_cache_node(cachep, i, n) { ··· 3367 3399 if (nr_online_nodes > 1 && cache_free_alien(cachep, objp)) 3368 3400 return; 3369 3401 3370 - if (likely(ac->avail < ac->limit)) { 3402 + if (ac->avail < ac->limit) { 3371 3403 STATS_INC_FREEHIT(cachep); 3372 3404 } else { 3373 3405 STATS_INC_FREEMISS(cachep); ··· 3464 3496 return kmem_cache_alloc_node_trace(cachep, flags, node, size); 3465 3497 } 3466 3498 3467 - #if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_TRACING) 3468 3499 void *__kmalloc_node(size_t size, gfp_t flags, int node) 3469 3500 { 3470 3501 return __do_kmalloc_node(size, flags, node, _RET_IP_); ··· 3476 3509 return __do_kmalloc_node(size, flags, node, caller); 3477 3510 } 3478 3511 EXPORT_SYMBOL(__kmalloc_node_track_caller); 3479 - #else 3480 - void *__kmalloc_node(size_t size, gfp_t flags, int node) 3481 - { 3482 - return __do_kmalloc_node(size, flags, node, 0); 3483 - } 3484 - EXPORT_SYMBOL(__kmalloc_node); 3485 - #endif /* CONFIG_DEBUG_SLAB || CONFIG_TRACING */ 3486 3512 #endif /* CONFIG_NUMA */ 3487 3513 3488 3514 /** ··· 3501 3541 return ret; 3502 3542 } 3503 3543 3504 - 3505 - #if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_TRACING) 3506 3544 void *__kmalloc(size_t size, gfp_t flags) 3507 3545 { 3508 3546 return __do_kmalloc(size, flags, _RET_IP_); ··· 3512 3554 return __do_kmalloc(size, flags, caller); 3513 3555 } 3514 3556 EXPORT_SYMBOL(__kmalloc_track_caller); 3515 - 3516 - #else 3517 - void *__kmalloc(size_t size, gfp_t flags) 3518 - { 3519 - return __do_kmalloc(size, flags, 0); 3520 - } 3521 - EXPORT_SYMBOL(__kmalloc); 3522 - #endif 3523 3557 3524 3558 /** 3525 3559 * kmem_cache_free - Deallocate an object ··· 3657 3707 return -ENOMEM; 3658 3708 } 3659 3709 3660 - struct ccupdate_struct { 3661 - struct kmem_cache *cachep; 3662 - struct array_cache *new[0]; 3663 - }; 3664 - 3665 - static void do_ccupdate_local(void *info) 3666 - { 3667 - struct ccupdate_struct *new = info; 3668 - struct array_cache *old; 3669 - 3670 - check_irq_off(); 3671 - old = cpu_cache_get(new->cachep); 3672 - 3673 - new->cachep->array[smp_processor_id()] = new->new[smp_processor_id()]; 3674 - new->new[smp_processor_id()] = old; 3675 - } 3676 - 3677 3710 /* Always called with the slab_mutex held */ 3678 3711 static int __do_tune_cpucache(struct kmem_cache *cachep, int limit, 3679 3712 int batchcount, int shared, gfp_t gfp) 3680 3713 { 3681 - struct ccupdate_struct *new; 3682 - int i; 3714 + struct array_cache __percpu *cpu_cache, *prev; 3715 + int cpu; 3683 3716 3684 - new = kzalloc(sizeof(*new) + nr_cpu_ids * sizeof(struct array_cache *), 3685 - gfp); 3686 - if (!new) 3717 + cpu_cache = alloc_kmem_cache_cpus(cachep, limit, batchcount); 3718 + if (!cpu_cache) 3687 3719 return -ENOMEM; 3688 3720 3689 - for_each_online_cpu(i) { 3690 - new->new[i] = alloc_arraycache(cpu_to_mem(i), limit, 3691 - batchcount, gfp); 3692 - if (!new->new[i]) { 3693 - for (i--; i >= 0; i--) 3694 - kfree(new->new[i]); 3695 - kfree(new); 3696 - return -ENOMEM; 3697 - } 3698 - } 3699 - new->cachep = cachep; 3700 - 3701 - on_each_cpu(do_ccupdate_local, (void *)new, 1); 3721 + prev = cachep->cpu_cache; 3722 + cachep->cpu_cache = cpu_cache; 3723 + kick_all_cpus_sync(); 3702 3724 3703 3725 check_irq_on(); 3704 3726 cachep->batchcount = batchcount; 3705 3727 cachep->limit = limit; 3706 3728 cachep->shared = shared; 3707 3729 3708 - for_each_online_cpu(i) { 3730 + if (!prev) 3731 + goto alloc_node; 3732 + 3733 + for_each_online_cpu(cpu) { 3709 3734 LIST_HEAD(list); 3710 - struct array_cache *ccold = new->new[i]; 3711 3735 int node; 3712 3736 struct kmem_cache_node *n; 3737 + struct array_cache *ac = per_cpu_ptr(prev, cpu); 3713 3738 3714 - if (!ccold) 3715 - continue; 3716 - 3717 - node = cpu_to_mem(i); 3739 + node = cpu_to_mem(cpu); 3718 3740 n = get_node(cachep, node); 3719 3741 spin_lock_irq(&n->list_lock); 3720 - free_block(cachep, ccold->entry, ccold->avail, node, &list); 3742 + free_block(cachep, ac->entry, ac->avail, node, &list); 3721 3743 spin_unlock_irq(&n->list_lock); 3722 3744 slabs_destroy(cachep, &list); 3723 - kfree(ccold); 3724 3745 } 3725 - kfree(new); 3746 + free_percpu(prev); 3747 + 3748 + alloc_node: 3726 3749 return alloc_kmem_cache_node(cachep, gfp); 3727 3750 } 3728 3751 ··· 4178 4255 4179 4256 static int slabstats_open(struct inode *inode, struct file *file) 4180 4257 { 4181 - unsigned long *n = kzalloc(PAGE_SIZE, GFP_KERNEL); 4182 - int ret = -ENOMEM; 4183 - if (n) { 4184 - ret = seq_open(file, &slabstats_op); 4185 - if (!ret) { 4186 - struct seq_file *m = file->private_data; 4187 - *n = PAGE_SIZE / (2 * sizeof(unsigned long)); 4188 - m->private = n; 4189 - n = NULL; 4190 - } 4191 - kfree(n); 4192 - } 4193 - return ret; 4258 + unsigned long *n; 4259 + 4260 + n = __seq_open_private(file, &slabstats_op, PAGE_SIZE); 4261 + if (!n) 4262 + return -ENOMEM; 4263 + 4264 + *n = PAGE_SIZE / (2 * sizeof(unsigned long)); 4265 + 4266 + return 0; 4194 4267 } 4195 4268 4196 4269 static const struct file_operations proc_slabstats_operations = {

+53 -4

mm/slab.h

··· 4 4 * Internal slab definitions 5 5 */ 6 6 7 + #ifdef CONFIG_SLOB 8 + /* 9 + * Common fields provided in kmem_cache by all slab allocators 10 + * This struct is either used directly by the allocator (SLOB) 11 + * or the allocator must include definitions for all fields 12 + * provided in kmem_cache_common in their definition of kmem_cache. 13 + * 14 + * Once we can do anonymous structs (C11 standard) we could put a 15 + * anonymous struct definition in these allocators so that the 16 + * separate allocations in the kmem_cache structure of SLAB and 17 + * SLUB is no longer needed. 18 + */ 19 + struct kmem_cache { 20 + unsigned int object_size;/* The original size of the object */ 21 + unsigned int size; /* The aligned/padded/added on size */ 22 + unsigned int align; /* Alignment as calculated */ 23 + unsigned long flags; /* Active flags on the slab */ 24 + const char *name; /* Slab name for sysfs */ 25 + int refcount; /* Use counter */ 26 + void (*ctor)(void *); /* Called on object slot creation */ 27 + struct list_head list; /* List of all slab caches on the system */ 28 + }; 29 + 30 + #endif /* CONFIG_SLOB */ 31 + 32 + #ifdef CONFIG_SLAB 33 + #include <linux/slab_def.h> 34 + #endif 35 + 36 + #ifdef CONFIG_SLUB 37 + #include <linux/slub_def.h> 38 + #endif 39 + 40 + #include <linux/memcontrol.h> 41 + 7 42 /* 8 43 * State of the slab allocator. 9 44 * ··· 50 15 enum slab_state { 51 16 DOWN, /* No slab functionality yet */ 52 17 PARTIAL, /* SLUB: kmem_cache_node available */ 53 - PARTIAL_ARRAYCACHE, /* SLAB: kmalloc size for arraycache available */ 54 18 PARTIAL_NODE, /* SLAB: kmalloc size for node struct available */ 55 19 UP, /* Slab caches usable but not all extras yet */ 56 20 FULL /* Everything is working */ ··· 87 53 size_t size, unsigned long flags); 88 54 89 55 struct mem_cgroup; 90 - #ifdef CONFIG_SLUB 56 + 57 + int slab_unmergeable(struct kmem_cache *s); 58 + struct kmem_cache *find_mergeable(size_t size, size_t align, 59 + unsigned long flags, const char *name, void (*ctor)(void *)); 60 + #ifndef CONFIG_SLOB 91 61 struct kmem_cache * 92 62 __kmem_cache_alias(const char *name, size_t size, size_t align, 93 63 unsigned long flags, void (*ctor)(void *)); 64 + 65 + unsigned long kmem_cache_flags(unsigned long object_size, 66 + unsigned long flags, const char *name, 67 + void (*ctor)(void *)); 94 68 #else 95 69 static inline struct kmem_cache * 96 70 __kmem_cache_alias(const char *name, size_t size, size_t align, 97 71 unsigned long flags, void (*ctor)(void *)) 98 72 { return NULL; } 73 + 74 + static inline unsigned long kmem_cache_flags(unsigned long object_size, 75 + unsigned long flags, const char *name, 76 + void (*ctor)(void *)) 77 + { 78 + return flags; 79 + } 99 80 #endif 100 81 101 82 ··· 352 303 * a kmem_cache_node structure allocated (which is true for all online nodes) 353 304 */ 354 305 #define for_each_kmem_cache_node(__s, __node, __n) \ 355 - for (__node = 0; __n = get_node(__s, __node), __node < nr_node_ids; __node++) \ 356 - if (__n) 306 + for (__node = 0; __node < nr_node_ids; __node++) \ 307 + if ((__n = get_node(__s, __node))) 357 308 358 309 #endif 359 310

+174 -4

mm/slab_common.c

··· 30 30 DEFINE_MUTEX(slab_mutex); 31 31 struct kmem_cache *kmem_cache; 32 32 33 + /* 34 + * Set of flags that will prevent slab merging 35 + */ 36 + #define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \ 37 + SLAB_TRACE | SLAB_DESTROY_BY_RCU | SLAB_NOLEAKTRACE | \ 38 + SLAB_FAILSLAB) 39 + 40 + #define SLAB_MERGE_SAME (SLAB_DEBUG_FREE | SLAB_RECLAIM_ACCOUNT | \ 41 + SLAB_CACHE_DMA | SLAB_NOTRACK) 42 + 43 + /* 44 + * Merge control. If this is set then no merging of slab caches will occur. 45 + * (Could be removed. This was introduced to pacify the merge skeptics.) 46 + */ 47 + static int slab_nomerge; 48 + 49 + static int __init setup_slab_nomerge(char *str) 50 + { 51 + slab_nomerge = 1; 52 + return 1; 53 + } 54 + 55 + #ifdef CONFIG_SLUB 56 + __setup_param("slub_nomerge", slub_nomerge, setup_slab_nomerge, 0); 57 + #endif 58 + 59 + __setup("slab_nomerge", setup_slab_nomerge); 60 + 61 + /* 62 + * Determine the size of a slab object 63 + */ 64 + unsigned int kmem_cache_size(struct kmem_cache *s) 65 + { 66 + return s->object_size; 67 + } 68 + EXPORT_SYMBOL(kmem_cache_size); 69 + 33 70 #ifdef CONFIG_DEBUG_VM 34 71 static int kmem_cache_sanity_check(const char *name, size_t size) 35 72 { ··· 116 79 #endif 117 80 118 81 #ifdef CONFIG_MEMCG_KMEM 82 + static int memcg_alloc_cache_params(struct mem_cgroup *memcg, 83 + struct kmem_cache *s, struct kmem_cache *root_cache) 84 + { 85 + size_t size; 86 + 87 + if (!memcg_kmem_enabled()) 88 + return 0; 89 + 90 + if (!memcg) { 91 + size = offsetof(struct memcg_cache_params, memcg_caches); 92 + size += memcg_limited_groups_array_size * sizeof(void *); 93 + } else 94 + size = sizeof(struct memcg_cache_params); 95 + 96 + s->memcg_params = kzalloc(size, GFP_KERNEL); 97 + if (!s->memcg_params) 98 + return -ENOMEM; 99 + 100 + if (memcg) { 101 + s->memcg_params->memcg = memcg; 102 + s->memcg_params->root_cache = root_cache; 103 + } else 104 + s->memcg_params->is_root_cache = true; 105 + 106 + return 0; 107 + } 108 + 109 + static void memcg_free_cache_params(struct kmem_cache *s) 110 + { 111 + kfree(s->memcg_params); 112 + } 113 + 114 + static int memcg_update_cache_params(struct kmem_cache *s, int num_memcgs) 115 + { 116 + int size; 117 + struct memcg_cache_params *new_params, *cur_params; 118 + 119 + BUG_ON(!is_root_cache(s)); 120 + 121 + size = offsetof(struct memcg_cache_params, memcg_caches); 122 + size += num_memcgs * sizeof(void *); 123 + 124 + new_params = kzalloc(size, GFP_KERNEL); 125 + if (!new_params) 126 + return -ENOMEM; 127 + 128 + cur_params = s->memcg_params; 129 + memcpy(new_params->memcg_caches, cur_params->memcg_caches, 130 + memcg_limited_groups_array_size * sizeof(void *)); 131 + 132 + new_params->is_root_cache = true; 133 + 134 + rcu_assign_pointer(s->memcg_params, new_params); 135 + if (cur_params) 136 + kfree_rcu(cur_params, rcu_head); 137 + 138 + return 0; 139 + } 140 + 119 141 int memcg_update_all_caches(int num_memcgs) 120 142 { 121 143 struct kmem_cache *s; ··· 185 89 if (!is_root_cache(s)) 186 90 continue; 187 91 188 - ret = memcg_update_cache_size(s, num_memcgs); 92 + ret = memcg_update_cache_params(s, num_memcgs); 189 93 /* 190 - * See comment in memcontrol.c, memcg_update_cache_size: 191 94 * Instead of freeing the memory, we'll just leave the caches 192 95 * up to this point in an updated state. 193 96 */ ··· 199 104 mutex_unlock(&slab_mutex); 200 105 return ret; 201 106 } 202 - #endif 107 + #else 108 + static inline int memcg_alloc_cache_params(struct mem_cgroup *memcg, 109 + struct kmem_cache *s, struct kmem_cache *root_cache) 110 + { 111 + return 0; 112 + } 113 + 114 + static inline void memcg_free_cache_params(struct kmem_cache *s) 115 + { 116 + } 117 + #endif /* CONFIG_MEMCG_KMEM */ 118 + 119 + /* 120 + * Find a mergeable slab cache 121 + */ 122 + int slab_unmergeable(struct kmem_cache *s) 123 + { 124 + if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE)) 125 + return 1; 126 + 127 + if (!is_root_cache(s)) 128 + return 1; 129 + 130 + if (s->ctor) 131 + return 1; 132 + 133 + /* 134 + * We may have set a slab to be unmergeable during bootstrap. 135 + */ 136 + if (s->refcount < 0) 137 + return 1; 138 + 139 + return 0; 140 + } 141 + 142 + struct kmem_cache *find_mergeable(size_t size, size_t align, 143 + unsigned long flags, const char *name, void (*ctor)(void *)) 144 + { 145 + struct kmem_cache *s; 146 + 147 + if (slab_nomerge || (flags & SLAB_NEVER_MERGE)) 148 + return NULL; 149 + 150 + if (ctor) 151 + return NULL; 152 + 153 + size = ALIGN(size, sizeof(void *)); 154 + align = calculate_alignment(flags, align, size); 155 + size = ALIGN(size, align); 156 + flags = kmem_cache_flags(size, flags, name, NULL); 157 + 158 + list_for_each_entry(s, &slab_caches, list) { 159 + if (slab_unmergeable(s)) 160 + continue; 161 + 162 + if (size > s->size) 163 + continue; 164 + 165 + if ((flags & SLAB_MERGE_SAME) != (s->flags & SLAB_MERGE_SAME)) 166 + continue; 167 + /* 168 + * Check if alignment is compatible. 169 + * Courtesy of Adrian Drzewiecki 170 + */ 171 + if ((s->size & ~(align - 1)) != s->size) 172 + continue; 173 + 174 + if (s->size - size >= sizeof(void *)) 175 + continue; 176 + 177 + return s; 178 + } 179 + return NULL; 180 + } 203 181 204 182 /* 205 183 * Figure out what the alignment of the objects will be given a set of ··· 379 211 mutex_lock(&slab_mutex); 380 212 381 213 err = kmem_cache_sanity_check(name, size); 382 - if (err) 214 + if (err) { 215 + s = NULL; /* suppress uninit var warning */ 383 216 goto out_unlock; 217 + } 384 218 385 219 /* 386 220 * Some allocators will constraint the set of valid flags to a subset

-2

mm/slob.c

··· 468 468 } 469 469 EXPORT_SYMBOL(__kmalloc); 470 470 471 - #ifdef CONFIG_TRACING 472 471 void *__kmalloc_track_caller(size_t size, gfp_t gfp, unsigned long caller) 473 472 { 474 473 return __do_kmalloc_node(size, gfp, NUMA_NO_NODE, caller); ··· 479 480 { 480 481 return __do_kmalloc_node(size, gfp, node, caller); 481 482 } 482 - #endif 483 483 #endif 484 484 485 485 void kfree(const void *block)

+31 -95

mm/slub.c

··· 169 169 */ 170 170 #define DEBUG_METADATA_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER) 171 171 172 - /* 173 - * Set of flags that will prevent slab merging 174 - */ 175 - #define SLUB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \ 176 - SLAB_TRACE | SLAB_DESTROY_BY_RCU | SLAB_NOLEAKTRACE | \ 177 - SLAB_FAILSLAB) 178 - 179 - #define SLUB_MERGE_SAME (SLAB_DEBUG_FREE | SLAB_RECLAIM_ACCOUNT | \ 180 - SLAB_CACHE_DMA | SLAB_NOTRACK) 181 - 182 172 #define OO_SHIFT 16 183 173 #define OO_MASK ((1 << OO_SHIFT) - 1) 184 174 #define MAX_OBJS_PER_PAGE 32767 /* since page.objects is u15 */ ··· 1166 1176 1167 1177 __setup("slub_debug", setup_slub_debug); 1168 1178 1169 - static unsigned long kmem_cache_flags(unsigned long object_size, 1179 + unsigned long kmem_cache_flags(unsigned long object_size, 1170 1180 unsigned long flags, const char *name, 1171 1181 void (*ctor)(void *)) 1172 1182 { ··· 1198 1208 struct page *page) {} 1199 1209 static inline void remove_full(struct kmem_cache *s, struct kmem_cache_node *n, 1200 1210 struct page *page) {} 1201 - static inline unsigned long kmem_cache_flags(unsigned long object_size, 1211 + unsigned long kmem_cache_flags(unsigned long object_size, 1202 1212 unsigned long flags, const char *name, 1203 1213 void (*ctor)(void *)) 1204 1214 { ··· 1689 1699 struct kmem_cache_cpu *c) 1690 1700 { 1691 1701 void *object; 1692 - int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node; 1702 + int searchnode = node; 1703 + 1704 + if (node == NUMA_NO_NODE) 1705 + searchnode = numa_mem_id(); 1706 + else if (!node_present_pages(node)) 1707 + searchnode = node_to_mem_node(node); 1693 1708 1694 1709 object = get_partial_node(s, get_node(s, searchnode), c, flags); 1695 1710 if (object || node != NUMA_NO_NODE) ··· 2275 2280 redo: 2276 2281 2277 2282 if (unlikely(!node_match(page, node))) { 2278 - stat(s, ALLOC_NODE_MISMATCH); 2279 - deactivate_slab(s, page, c->freelist); 2280 - c->page = NULL; 2281 - c->freelist = NULL; 2282 - goto new_slab; 2283 + int searchnode = node; 2284 + 2285 + if (node != NUMA_NO_NODE && !node_present_pages(node)) 2286 + searchnode = node_to_mem_node(node); 2287 + 2288 + if (unlikely(!node_match(page, searchnode))) { 2289 + stat(s, ALLOC_NODE_MISMATCH); 2290 + deactivate_slab(s, page, c->freelist); 2291 + c->page = NULL; 2292 + c->freelist = NULL; 2293 + goto new_slab; 2294 + } 2283 2295 } 2284 2296 2285 2297 /* ··· 2707 2705 static int slub_min_order; 2708 2706 static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER; 2709 2707 static int slub_min_objects; 2710 - 2711 - /* 2712 - * Merge control. If this is set then no merging of slab caches will occur. 2713 - * (Could be removed. This was introduced to pacify the merge skeptics.) 2714 - */ 2715 - static int slub_nomerge; 2716 2708 2717 2709 /* 2718 2710 * Calculate the order of allocation given an slab object size. ··· 3236 3240 3237 3241 __setup("slub_min_objects=", setup_slub_min_objects); 3238 3242 3239 - static int __init setup_slub_nomerge(char *str) 3240 - { 3241 - slub_nomerge = 1; 3242 - return 1; 3243 - } 3244 - 3245 - __setup("slub_nomerge", setup_slub_nomerge); 3246 - 3247 3243 void *__kmalloc(size_t size, gfp_t flags) 3248 3244 { 3249 3245 struct kmem_cache *s; ··· 3611 3623 3612 3624 void __init kmem_cache_init_late(void) 3613 3625 { 3614 - } 3615 - 3616 - /* 3617 - * Find a mergeable slab cache 3618 - */ 3619 - static int slab_unmergeable(struct kmem_cache *s) 3620 - { 3621 - if (slub_nomerge || (s->flags & SLUB_NEVER_MERGE)) 3622 - return 1; 3623 - 3624 - if (!is_root_cache(s)) 3625 - return 1; 3626 - 3627 - if (s->ctor) 3628 - return 1; 3629 - 3630 - /* 3631 - * We may have set a slab to be unmergeable during bootstrap. 3632 - */ 3633 - if (s->refcount < 0) 3634 - return 1; 3635 - 3636 - return 0; 3637 - } 3638 - 3639 - static struct kmem_cache *find_mergeable(size_t size, size_t align, 3640 - unsigned long flags, const char *name, void (*ctor)(void *)) 3641 - { 3642 - struct kmem_cache *s; 3643 - 3644 - if (slub_nomerge || (flags & SLUB_NEVER_MERGE)) 3645 - return NULL; 3646 - 3647 - if (ctor) 3648 - return NULL; 3649 - 3650 - size = ALIGN(size, sizeof(void *)); 3651 - align = calculate_alignment(flags, align, size); 3652 - size = ALIGN(size, align); 3653 - flags = kmem_cache_flags(size, flags, name, NULL); 3654 - 3655 - list_for_each_entry(s, &slab_caches, list) { 3656 - if (slab_unmergeable(s)) 3657 - continue; 3658 - 3659 - if (size > s->size) 3660 - continue; 3661 - 3662 - if ((flags & SLUB_MERGE_SAME) != (s->flags & SLUB_MERGE_SAME)) 3663 - continue; 3664 - /* 3665 - * Check if alignment is compatible. 3666 - * Courtesy of Adrian Drzewiecki 3667 - */ 3668 - if ((s->size & ~(align - 1)) != s->size) 3669 - continue; 3670 - 3671 - if (s->size - size >= sizeof(void *)) 3672 - continue; 3673 - 3674 - return s; 3675 - } 3676 - return NULL; 3677 3626 } 3678 3627 3679 3628 struct kmem_cache * ··· 4529 4604 static ssize_t trace_store(struct kmem_cache *s, const char *buf, 4530 4605 size_t length) 4531 4606 { 4607 + /* 4608 + * Tracing a merged cache is going to give confusing results 4609 + * as well as cause other issues like converting a mergeable 4610 + * cache into an umergeable one. 4611 + */ 4612 + if (s->refcount > 1) 4613 + return -EINVAL; 4614 + 4532 4615 s->flags &= ~SLAB_TRACE; 4533 4616 if (buf[0] == '1') { 4534 4617 s->flags &= ~__CMPXCHG_DOUBLE; ··· 4654 4721 static ssize_t failslab_store(struct kmem_cache *s, const char *buf, 4655 4722 size_t length) 4656 4723 { 4724 + if (s->refcount > 1) 4725 + return -EINVAL; 4726 + 4657 4727 s->flags &= ~SLAB_FAILSLAB; 4658 4728 if (buf[0] == '1') 4659 4729 s->flags |= SLAB_FAILSLAB;

+19 -11

mm/swap.c

··· 887 887 mutex_unlock(&lock); 888 888 } 889 889 890 - /* 891 - * Batched page_cache_release(). Decrement the reference count on all the 892 - * passed pages. If it fell to zero then remove the page from the LRU and 893 - * free it. 890 + /** 891 + * release_pages - batched page_cache_release() 892 + * @pages: array of pages to release 893 + * @nr: number of pages 894 + * @cold: whether the pages are cache cold 894 895 * 895 - * Avoid taking zone->lru_lock if possible, but if it is taken, retain it 896 - * for the remainder of the operation. 897 - * 898 - * The locking in this function is against shrink_inactive_list(): we recheck 899 - * the page count inside the lock to see whether shrink_inactive_list() 900 - * grabbed the page via the LRU. If it did, give up: shrink_inactive_list() 901 - * will free it. 896 + * Decrement the reference count on all the pages in @pages. If it 897 + * fell to zero, remove the page from the LRU and free it. 902 898 */ 903 899 void release_pages(struct page **pages, int nr, bool cold) 904 900 { ··· 903 907 struct zone *zone = NULL; 904 908 struct lruvec *lruvec; 905 909 unsigned long uninitialized_var(flags); 910 + unsigned int uninitialized_var(lock_batch); 906 911 907 912 for (i = 0; i < nr; i++) { 908 913 struct page *page = pages[i]; ··· 917 920 continue; 918 921 } 919 922 923 + /* 924 + * Make sure the IRQ-safe lock-holding time does not get 925 + * excessive with a continuous string of pages from the 926 + * same zone. The lock is held only if zone != NULL. 927 + */ 928 + if (zone && ++lock_batch == SWAP_CLUSTER_MAX) { 929 + spin_unlock_irqrestore(&zone->lru_lock, flags); 930 + zone = NULL; 931 + } 932 + 920 933 if (!put_page_testzero(page)) 921 934 continue; 922 935 ··· 937 930 if (zone) 938 931 spin_unlock_irqrestore(&zone->lru_lock, 939 932 flags); 933 + lock_batch = 0; 940 934 zone = pagezone; 941 935 spin_lock_irqsave(&zone->lru_lock, flags); 942 936 }

+6 -10

mm/swap_state.c

··· 28 28 static const struct address_space_operations swap_aops = { 29 29 .writepage = swap_writepage, 30 30 .set_page_dirty = swap_set_page_dirty, 31 + #ifdef CONFIG_MIGRATION 31 32 .migratepage = migrate_page, 33 + #endif 32 34 }; 33 35 34 36 static struct backing_dev_info swap_backing_dev_info = { ··· 265 263 void free_pages_and_swap_cache(struct page **pages, int nr) 266 264 { 267 265 struct page **pagep = pages; 266 + int i; 268 267 269 268 lru_add_drain(); 270 - while (nr) { 271 - int todo = min(nr, PAGEVEC_SIZE); 272 - int i; 273 - 274 - for (i = 0; i < todo; i++) 275 - free_swap_cache(pagep[i]); 276 - release_pages(pagep, todo, false); 277 - pagep += todo; 278 - nr -= todo; 279 - } 269 + for (i = 0; i < nr; i++) 270 + free_swap_cache(pagep[i]); 271 + release_pages(pagep, nr, false); 280 272 } 281 273 282 274 /*

+8 -15

mm/util.c

··· 170 170 /* 171 171 * Check if the vma is being used as a stack. 172 172 * If is_group is non-zero, check in the entire thread group or else 173 - * just check in the current task. Returns the pid of the task that 174 - * the vma is stack for. 173 + * just check in the current task. Returns the task_struct of the task 174 + * that the vma is stack for. Must be called under rcu_read_lock(). 175 175 */ 176 - pid_t vm_is_stack(struct task_struct *task, 177 - struct vm_area_struct *vma, int in_group) 176 + struct task_struct *task_of_stack(struct task_struct *task, 177 + struct vm_area_struct *vma, bool in_group) 178 178 { 179 - pid_t ret = 0; 180 - 181 179 if (vm_is_stack_for_task(task, vma)) 182 - return task->pid; 180 + return task; 183 181 184 182 if (in_group) { 185 183 struct task_struct *t; 186 184 187 - rcu_read_lock(); 188 185 for_each_thread(task, t) { 189 - if (vm_is_stack_for_task(t, vma)) { 190 - ret = t->pid; 191 - goto done; 192 - } 186 + if (vm_is_stack_for_task(t, vma)) 187 + return t; 193 188 } 194 - done: 195 - rcu_read_unlock(); 196 189 } 197 190 198 - return ret; 191 + return NULL; 199 192 } 200 193 201 194 #if defined(CONFIG_MMU) && !defined(HAVE_ARCH_PICK_MMAP_LAYOUT)

+5 -15

mm/vmalloc.c

··· 2646 2646 2647 2647 static int vmalloc_open(struct inode *inode, struct file *file) 2648 2648 { 2649 - unsigned int *ptr = NULL; 2650 - int ret; 2651 - 2652 - if (IS_ENABLED(CONFIG_NUMA)) { 2653 - ptr = kmalloc(nr_node_ids * sizeof(unsigned int), GFP_KERNEL); 2654 - if (ptr == NULL) 2655 - return -ENOMEM; 2656 - } 2657 - ret = seq_open(file, &vmalloc_op); 2658 - if (!ret) { 2659 - struct seq_file *m = file->private_data; 2660 - m->private = ptr; 2661 - } else 2662 - kfree(ptr); 2663 - return ret; 2649 + if (IS_ENABLED(CONFIG_NUMA)) 2650 + return seq_open_private(file, &vmalloc_op, 2651 + nr_node_ids * sizeof(unsigned int)); 2652 + else 2653 + return seq_open(file, &vmalloc_op); 2664 2654 } 2665 2655 2666 2656 static const struct file_operations proc_vmalloc_operations = {

+28 -84

mm/vmscan.c

··· 920 920 /* Case 1 above */ 921 921 if (current_is_kswapd() && 922 922 PageReclaim(page) && 923 - zone_is_reclaim_writeback(zone)) { 923 + test_bit(ZONE_WRITEBACK, &zone->flags)) { 924 924 nr_immediate++; 925 925 goto keep_locked; 926 926 ··· 1002 1002 */ 1003 1003 if (page_is_file_cache(page) && 1004 1004 (!current_is_kswapd() || 1005 - !zone_is_reclaim_dirty(zone))) { 1005 + !test_bit(ZONE_DIRTY, &zone->flags))) { 1006 1006 /* 1007 1007 * Immediately reclaim when written back. 1008 1008 * Similar in principal to deactivate_page() ··· 1563 1563 * are encountered in the nr_immediate check below. 1564 1564 */ 1565 1565 if (nr_writeback && nr_writeback == nr_taken) 1566 - zone_set_flag(zone, ZONE_WRITEBACK); 1566 + set_bit(ZONE_WRITEBACK, &zone->flags); 1567 1567 1568 1568 /* 1569 1569 * memcg will stall in page writeback so only consider forcibly ··· 1575 1575 * backed by a congested BDI and wait_iff_congested will stall. 1576 1576 */ 1577 1577 if (nr_dirty && nr_dirty == nr_congested) 1578 - zone_set_flag(zone, ZONE_CONGESTED); 1578 + set_bit(ZONE_CONGESTED, &zone->flags); 1579 1579 1580 1580 /* 1581 1581 * If dirty pages are scanned that are not queued for IO, it 1582 1582 * implies that flushers are not keeping up. In this case, flag 1583 - * the zone ZONE_TAIL_LRU_DIRTY and kswapd will start writing 1584 - * pages from reclaim context. 1583 + * the zone ZONE_DIRTY and kswapd will start writing pages from 1584 + * reclaim context. 1585 1585 */ 1586 1586 if (nr_unqueued_dirty == nr_taken) 1587 - zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY); 1587 + set_bit(ZONE_DIRTY, &zone->flags); 1588 1588 1589 1589 /* 1590 1590 * If kswapd scans pages marked marked for immediate ··· 2315 2315 return reclaimable; 2316 2316 } 2317 2317 2318 - /* Returns true if compaction should go ahead for a high-order request */ 2318 + /* 2319 + * Returns true if compaction should go ahead for a high-order request, or 2320 + * the high-order allocation would succeed without compaction. 2321 + */ 2319 2322 static inline bool compaction_ready(struct zone *zone, int order) 2320 2323 { 2321 2324 unsigned long balance_gap, watermark; ··· 2342 2339 if (compaction_deferred(zone, order)) 2343 2340 return watermark_ok; 2344 2341 2345 - /* If compaction is not ready to start, keep reclaiming */ 2346 - if (!compaction_suitable(zone, order)) 2342 + /* 2343 + * If compaction is not ready to start and allocation is not likely 2344 + * to succeed without it, then keep reclaiming. 2345 + */ 2346 + if (compaction_suitable(zone, order) == COMPACT_SKIPPED) 2347 2347 return false; 2348 2348 2349 2349 return watermark_ok; ··· 2759 2753 } 2760 2754 2761 2755 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, 2756 + unsigned long nr_pages, 2762 2757 gfp_t gfp_mask, 2763 - bool noswap) 2758 + bool may_swap) 2764 2759 { 2765 2760 struct zonelist *zonelist; 2766 2761 unsigned long nr_reclaimed; 2767 2762 int nid; 2768 2763 struct scan_control sc = { 2769 - .nr_to_reclaim = SWAP_CLUSTER_MAX, 2764 + .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), 2770 2765 .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | 2771 2766 (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), 2772 2767 .target_mem_cgroup = memcg, 2773 2768 .priority = DEF_PRIORITY, 2774 2769 .may_writepage = !laptop_mode, 2775 2770 .may_unmap = 1, 2776 - .may_swap = !noswap, 2771 + .may_swap = may_swap, 2777 2772 }; 2778 2773 2779 2774 /* ··· 2825 2818 return false; 2826 2819 2827 2820 if (IS_ENABLED(CONFIG_COMPACTION) && order && 2828 - !compaction_suitable(zone, order)) 2821 + compaction_suitable(zone, order) == COMPACT_SKIPPED) 2829 2822 return false; 2830 2823 2831 2824 return true; ··· 2985 2978 /* Account for the number of pages attempted to reclaim */ 2986 2979 *nr_attempted += sc->nr_to_reclaim; 2987 2980 2988 - zone_clear_flag(zone, ZONE_WRITEBACK); 2981 + clear_bit(ZONE_WRITEBACK, &zone->flags); 2989 2982 2990 2983 /* 2991 2984 * If a zone reaches its high watermark, consider it to be no longer ··· 2995 2988 */ 2996 2989 if (zone_reclaimable(zone) && 2997 2990 zone_balanced(zone, testorder, 0, classzone_idx)) { 2998 - zone_clear_flag(zone, ZONE_CONGESTED); 2999 - zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY); 2991 + clear_bit(ZONE_CONGESTED, &zone->flags); 2992 + clear_bit(ZONE_DIRTY, &zone->flags); 3000 2993 } 3001 2994 3002 2995 return sc->nr_scanned >= sc->nr_to_reclaim; ··· 3087 3080 * If balanced, clear the dirty and congested 3088 3081 * flags 3089 3082 */ 3090 - zone_clear_flag(zone, ZONE_CONGESTED); 3091 - zone_clear_flag(zone, ZONE_TAIL_LRU_DIRTY); 3083 + clear_bit(ZONE_CONGESTED, &zone->flags); 3084 + clear_bit(ZONE_DIRTY, &zone->flags); 3092 3085 } 3093 3086 } 3094 3087 ··· 3715 3708 if (node_state(node_id, N_CPU) && node_id != numa_node_id()) 3716 3709 return ZONE_RECLAIM_NOSCAN; 3717 3710 3718 - if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED)) 3711 + if (test_and_set_bit(ZONE_RECLAIM_LOCKED, &zone->flags)) 3719 3712 return ZONE_RECLAIM_NOSCAN; 3720 3713 3721 3714 ret = __zone_reclaim(zone, gfp_mask, order); 3722 - zone_clear_flag(zone, ZONE_RECLAIM_LOCKED); 3715 + clear_bit(ZONE_RECLAIM_LOCKED, &zone->flags); 3723 3716 3724 3717 if (!ret) 3725 3718 count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED); ··· 3798 3791 } 3799 3792 } 3800 3793 #endif /* CONFIG_SHMEM */ 3801 - 3802 - static void warn_scan_unevictable_pages(void) 3803 - { 3804 - printk_once(KERN_WARNING 3805 - "%s: The scan_unevictable_pages sysctl/node-interface has been " 3806 - "disabled for lack of a legitimate use case. If you have " 3807 - "one, please send an email to linux-mm@kvack.org.\n", 3808 - current->comm); 3809 - } 3810 - 3811 - /* 3812 - * scan_unevictable_pages [vm] sysctl handler. On demand re-scan of 3813 - * all nodes' unevictable lists for evictable pages 3814 - */ 3815 - unsigned long scan_unevictable_pages; 3816 - 3817 - int scan_unevictable_handler(struct ctl_table *table, int write, 3818 - void __user *buffer, 3819 - size_t *length, loff_t *ppos) 3820 - { 3821 - warn_scan_unevictable_pages(); 3822 - proc_doulongvec_minmax(table, write, buffer, length, ppos); 3823 - scan_unevictable_pages = 0; 3824 - return 0; 3825 - } 3826 - 3827 - #ifdef CONFIG_NUMA 3828 - /* 3829 - * per node 'scan_unevictable_pages' attribute. On demand re-scan of 3830 - * a specified node's per zone unevictable lists for evictable pages. 3831 - */ 3832 - 3833 - static ssize_t read_scan_unevictable_node(struct device *dev, 3834 - struct device_attribute *attr, 3835 - char *buf) 3836 - { 3837 - warn_scan_unevictable_pages(); 3838 - return sprintf(buf, "0\n"); /* always zero; should fit... */ 3839 - } 3840 - 3841 - static ssize_t write_scan_unevictable_node(struct device *dev, 3842 - struct device_attribute *attr, 3843 - const char *buf, size_t count) 3844 - { 3845 - warn_scan_unevictable_pages(); 3846 - return 1; 3847 - } 3848 - 3849 - 3850 - static DEVICE_ATTR(scan_unevictable_pages, S_IRUGO | S_IWUSR, 3851 - read_scan_unevictable_node, 3852 - write_scan_unevictable_node); 3853 - 3854 - int scan_unevictable_register_node(struct node *node) 3855 - { 3856 - return device_create_file(&node->dev, &dev_attr_scan_unevictable_pages); 3857 - } 3858 - 3859 - void scan_unevictable_unregister_node(struct node *node) 3860 - { 3861 - device_remove_file(&node->dev, &dev_attr_scan_unevictable_pages); 3862 - } 3863 - #endif

+132 -23

mm/vmstat.c

··· 7 7 * zoned VM statistics 8 8 * Copyright (C) 2006 Silicon Graphics, Inc., 9 9 * Christoph Lameter <christoph@lameter.com> 10 + * Copyright (C) 2008-2014 Christoph Lameter 10 11 */ 11 12 #include <linux/fs.h> 12 13 #include <linux/mm.h> ··· 15 14 #include <linux/module.h> 16 15 #include <linux/slab.h> 17 16 #include <linux/cpu.h> 17 + #include <linux/cpumask.h> 18 18 #include <linux/vmstat.h> 19 19 #include <linux/sched.h> 20 20 #include <linux/math64.h> ··· 421 419 EXPORT_SYMBOL(dec_zone_page_state); 422 420 #endif 423 421 424 - static inline void fold_diff(int *diff) 422 + 423 + /* 424 + * Fold a differential into the global counters. 425 + * Returns the number of counters updated. 426 + */ 427 + static int fold_diff(int *diff) 425 428 { 426 429 int i; 430 + int changes = 0; 427 431 428 432 for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) 429 - if (diff[i]) 433 + if (diff[i]) { 430 434 atomic_long_add(diff[i], &vm_stat[i]); 435 + changes++; 436 + } 437 + return changes; 431 438 } 432 439 433 440 /* ··· 452 441 * statistics in the remote zone struct as well as the global cachelines 453 442 * with the global counters. These could cause remote node cache line 454 443 * bouncing and will have to be only done when necessary. 444 + * 445 + * The function returns the number of global counters updated. 455 446 */ 456 - static void refresh_cpu_vm_stats(void) 447 + static int refresh_cpu_vm_stats(void) 457 448 { 458 449 struct zone *zone; 459 450 int i; 460 451 int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, }; 452 + int changes = 0; 461 453 462 454 for_each_populated_zone(zone) { 463 455 struct per_cpu_pageset __percpu *p = zone->pageset; ··· 500 486 continue; 501 487 } 502 488 503 - 504 489 if (__this_cpu_dec_return(p->expire)) 505 490 continue; 506 491 507 - if (__this_cpu_read(p->pcp.count)) 492 + if (__this_cpu_read(p->pcp.count)) { 508 493 drain_zone_pages(zone, this_cpu_ptr(&p->pcp)); 494 + changes++; 495 + } 509 496 #endif 510 497 } 511 - fold_diff(global_diff); 498 + changes += fold_diff(global_diff); 499 + return changes; 512 500 } 513 501 514 502 /* ··· 751 735 TEXT_FOR_HIGHMEM(xx) xx "_movable", 752 736 753 737 const char * const vmstat_text[] = { 754 - /* Zoned VM counters */ 738 + /* enum zone_stat_item countes */ 755 739 "nr_free_pages", 756 740 "nr_alloc_batch", 757 741 "nr_inactive_anon", ··· 794 778 "workingset_nodereclaim", 795 779 "nr_anon_transparent_hugepages", 796 780 "nr_free_cma", 781 + 782 + /* enum writeback_stat_item counters */ 797 783 "nr_dirty_threshold", 798 784 "nr_dirty_background_threshold", 799 785 800 786 #ifdef CONFIG_VM_EVENT_COUNTERS 787 + /* enum vm_event_item counters */ 801 788 "pgpgin", 802 789 "pgpgout", 803 790 "pswpin", ··· 879 860 "thp_zero_page_alloc", 880 861 "thp_zero_page_alloc_failed", 881 862 #endif 863 + #ifdef CONFIG_MEMORY_BALLOON 864 + "balloon_inflate", 865 + "balloon_deflate", 866 + #ifdef CONFIG_BALLOON_COMPACTION 867 + "balloon_migrate", 868 + #endif 869 + #endif /* CONFIG_MEMORY_BALLOON */ 882 870 #ifdef CONFIG_DEBUG_TLBFLUSH 883 871 #ifdef CONFIG_SMP 884 872 "nr_tlb_remote_flush", ··· 1255 1229 #ifdef CONFIG_SMP 1256 1230 static DEFINE_PER_CPU(struct delayed_work, vmstat_work); 1257 1231 int sysctl_stat_interval __read_mostly = HZ; 1232 + static cpumask_var_t cpu_stat_off; 1258 1233 1259 1234 static void vmstat_update(struct work_struct *w) 1260 1235 { 1261 - refresh_cpu_vm_stats(); 1262 - schedule_delayed_work(this_cpu_ptr(&vmstat_work), 1263 - round_jiffies_relative(sysctl_stat_interval)); 1236 + if (refresh_cpu_vm_stats()) 1237 + /* 1238 + * Counters were updated so we expect more updates 1239 + * to occur in the future. Keep on running the 1240 + * update worker thread. 1241 + */ 1242 + schedule_delayed_work(this_cpu_ptr(&vmstat_work), 1243 + round_jiffies_relative(sysctl_stat_interval)); 1244 + else { 1245 + /* 1246 + * We did not update any counters so the app may be in 1247 + * a mode where it does not cause counter updates. 1248 + * We may be uselessly running vmstat_update. 1249 + * Defer the checking for differentials to the 1250 + * shepherd thread on a different processor. 1251 + */ 1252 + int r; 1253 + /* 1254 + * Shepherd work thread does not race since it never 1255 + * changes the bit if its zero but the cpu 1256 + * online / off line code may race if 1257 + * worker threads are still allowed during 1258 + * shutdown / startup. 1259 + */ 1260 + r = cpumask_test_and_set_cpu(smp_processor_id(), 1261 + cpu_stat_off); 1262 + VM_BUG_ON(r); 1263 + } 1264 1264 } 1265 1265 1266 - static void start_cpu_timer(int cpu) 1266 + /* 1267 + * Check if the diffs for a certain cpu indicate that 1268 + * an update is needed. 1269 + */ 1270 + static bool need_update(int cpu) 1267 1271 { 1268 - struct delayed_work *work = &per_cpu(vmstat_work, cpu); 1272 + struct zone *zone; 1269 1273 1270 - INIT_DEFERRABLE_WORK(work, vmstat_update); 1271 - schedule_delayed_work_on(cpu, work, __round_jiffies_relative(HZ, cpu)); 1274 + for_each_populated_zone(zone) { 1275 + struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu); 1276 + 1277 + BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1); 1278 + /* 1279 + * The fast way of checking if there are any vmstat diffs. 1280 + * This works because the diffs are byte sized items. 1281 + */ 1282 + if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS)) 1283 + return true; 1284 + 1285 + } 1286 + return false; 1287 + } 1288 + 1289 + 1290 + /* 1291 + * Shepherd worker thread that checks the 1292 + * differentials of processors that have their worker 1293 + * threads for vm statistics updates disabled because of 1294 + * inactivity. 1295 + */ 1296 + static void vmstat_shepherd(struct work_struct *w); 1297 + 1298 + static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd); 1299 + 1300 + static void vmstat_shepherd(struct work_struct *w) 1301 + { 1302 + int cpu; 1303 + 1304 + get_online_cpus(); 1305 + /* Check processors whose vmstat worker threads have been disabled */ 1306 + for_each_cpu(cpu, cpu_stat_off) 1307 + if (need_update(cpu) && 1308 + cpumask_test_and_clear_cpu(cpu, cpu_stat_off)) 1309 + 1310 + schedule_delayed_work_on(cpu, &per_cpu(vmstat_work, cpu), 1311 + __round_jiffies_relative(sysctl_stat_interval, cpu)); 1312 + 1313 + put_online_cpus(); 1314 + 1315 + schedule_delayed_work(&shepherd, 1316 + round_jiffies_relative(sysctl_stat_interval)); 1317 + 1318 + } 1319 + 1320 + static void __init start_shepherd_timer(void) 1321 + { 1322 + int cpu; 1323 + 1324 + for_each_possible_cpu(cpu) 1325 + INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu), 1326 + vmstat_update); 1327 + 1328 + if (!alloc_cpumask_var(&cpu_stat_off, GFP_KERNEL)) 1329 + BUG(); 1330 + cpumask_copy(cpu_stat_off, cpu_online_mask); 1331 + 1332 + schedule_delayed_work(&shepherd, 1333 + round_jiffies_relative(sysctl_stat_interval)); 1272 1334 } 1273 1335 1274 1336 static void vmstat_cpu_dead(int node) ··· 1387 1273 case CPU_ONLINE: 1388 1274 case CPU_ONLINE_FROZEN: 1389 1275 refresh_zone_stat_thresholds(); 1390 - start_cpu_timer(cpu); 1391 1276 node_set_state(cpu_to_node(cpu), N_CPU); 1277 + cpumask_set_cpu(cpu, cpu_stat_off); 1392 1278 break; 1393 1279 case CPU_DOWN_PREPARE: 1394 1280 case CPU_DOWN_PREPARE_FROZEN: 1395 1281 cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu)); 1396 - per_cpu(vmstat_work, cpu).work.func = NULL; 1282 + cpumask_clear_cpu(cpu, cpu_stat_off); 1397 1283 break; 1398 1284 case CPU_DOWN_FAILED: 1399 1285 case CPU_DOWN_FAILED_FROZEN: 1400 - start_cpu_timer(cpu); 1286 + cpumask_set_cpu(cpu, cpu_stat_off); 1401 1287 break; 1402 1288 case CPU_DEAD: 1403 1289 case CPU_DEAD_FROZEN: ··· 1417 1303 static int __init setup_vmstat(void) 1418 1304 { 1419 1305 #ifdef CONFIG_SMP 1420 - int cpu; 1421 - 1422 1306 cpu_notifier_register_begin(); 1423 1307 __register_cpu_notifier(&vmstat_notifier); 1424 1308 1425 - for_each_online_cpu(cpu) { 1426 - start_cpu_timer(cpu); 1427 - node_set_state(cpu_to_node(cpu), N_CPU); 1428 - } 1309 + start_shepherd_timer(); 1429 1310 cpu_notifier_register_done(); 1430 1311 #endif 1431 1312 #ifdef CONFIG_PROC_FS

+7 -6

mm/zbud.c

··· 60 60 * NCHUNKS_ORDER determines the internal allocation granularity, effectively 61 61 * adjusting internal fragmentation. It also determines the number of 62 62 * freelists maintained in each pool. NCHUNKS_ORDER of 6 means that the 63 - * allocation granularity will be in chunks of size PAGE_SIZE/64, and there 64 - * will be 64 freelists per pool. 63 + * allocation granularity will be in chunks of size PAGE_SIZE/64. As one chunk 64 + * in allocated page is occupied by zbud header, NCHUNKS will be calculated to 65 + * 63 which shows the max number of free chunks in zbud page, also there will be 66 + * 63 freelists per pool. 65 67 */ 66 68 #define NCHUNKS_ORDER 6 67 69 68 70 #define CHUNK_SHIFT (PAGE_SHIFT - NCHUNKS_ORDER) 69 71 #define CHUNK_SIZE (1 << CHUNK_SHIFT) 70 - #define NCHUNKS (PAGE_SIZE >> CHUNK_SHIFT) 71 72 #define ZHDR_SIZE_ALIGNED CHUNK_SIZE 73 + #define NCHUNKS ((PAGE_SIZE - ZHDR_SIZE_ALIGNED) >> CHUNK_SHIFT) 72 74 73 75 /** 74 76 * struct zbud_pool - stores metadata for each zbud pool ··· 270 268 { 271 269 /* 272 270 * Rather than branch for different situations, just use the fact that 273 - * free buddies have a length of zero to simplify everything. -1 at the 274 - * end for the zbud header. 271 + * free buddies have a length of zero to simplify everything. 275 272 */ 276 - return NCHUNKS - zhdr->first_chunks - zhdr->last_chunks - 1; 273 + return NCHUNKS - zhdr->first_chunks - zhdr->last_chunks; 277 274 } 278 275 279 276 /*****************

+17 -29

mm/zsmalloc.c

··· 175 175 * n <= N / f, where 176 176 * n = number of allocated objects 177 177 * N = total number of objects zspage can store 178 - * f = 1/fullness_threshold_frac 178 + * f = fullness_threshold_frac 179 179 * 180 180 * Similarly, we assign zspage to: 181 181 * ZS_ALMOST_FULL when n > N / f ··· 199 199 200 200 spinlock_t lock; 201 201 202 - /* stats */ 203 - u64 pages_allocated; 204 - 205 202 struct page *fullness_list[_ZS_NR_FULLNESS_GROUPS]; 206 203 }; 207 204 ··· 217 220 struct size_class size_class[ZS_SIZE_CLASSES]; 218 221 219 222 gfp_t flags; /* allocation flags used when growing pool */ 223 + atomic_long_t pages_allocated; 220 224 }; 221 225 222 226 /* ··· 297 299 298 300 static u64 zs_zpool_total_size(void *pool) 299 301 { 300 - return zs_get_total_size_bytes(pool); 302 + return zs_get_total_pages(pool) << PAGE_SHIFT; 301 303 } 302 304 303 305 static struct zpool_driver zs_zpool_driver = { ··· 628 630 while (page) { 629 631 struct page *next_page; 630 632 struct link_free *link; 631 - unsigned int i, objs_on_page; 633 + unsigned int i = 1; 632 634 633 635 /* 634 636 * page->index stores offset of first object starting ··· 641 643 642 644 link = (struct link_free *)kmap_atomic(page) + 643 645 off / sizeof(*link); 644 - objs_on_page = (PAGE_SIZE - off) / class->size; 645 646 646 - for (i = 1; i <= objs_on_page; i++) { 647 - off += class->size; 648 - if (off < PAGE_SIZE) { 649 - link->next = obj_location_to_handle(page, i); 650 - link += class->size / sizeof(*link); 651 - } 647 + while ((off += class->size) < PAGE_SIZE) { 648 + link->next = obj_location_to_handle(page, i++); 649 + link += class->size / sizeof(*link); 652 650 } 653 651 654 652 /* ··· 656 662 link->next = obj_location_to_handle(next_page, 0); 657 663 kunmap_atomic(link); 658 664 page = next_page; 659 - off = (off + class->size) % PAGE_SIZE; 665 + off %= PAGE_SIZE; 660 666 } 661 667 } 662 668 ··· 1022 1028 return 0; 1023 1029 1024 1030 set_zspage_mapping(first_page, class->index, ZS_EMPTY); 1031 + atomic_long_add(class->pages_per_zspage, 1032 + &pool->pages_allocated); 1025 1033 spin_lock(&class->lock); 1026 - class->pages_allocated += class->pages_per_zspage; 1027 1034 } 1028 1035 1029 1036 obj = (unsigned long)first_page->freelist; ··· 1077 1082 1078 1083 first_page->inuse--; 1079 1084 fullness = fix_fullness_group(pool, first_page); 1080 - 1081 - if (fullness == ZS_EMPTY) 1082 - class->pages_allocated -= class->pages_per_zspage; 1083 - 1084 1085 spin_unlock(&class->lock); 1085 1086 1086 - if (fullness == ZS_EMPTY) 1087 + if (fullness == ZS_EMPTY) { 1088 + atomic_long_sub(class->pages_per_zspage, 1089 + &pool->pages_allocated); 1087 1090 free_zspage(first_page); 1091 + } 1088 1092 } 1089 1093 EXPORT_SYMBOL_GPL(zs_free); 1090 1094 ··· 1177 1183 } 1178 1184 EXPORT_SYMBOL_GPL(zs_unmap_object); 1179 1185 1180 - u64 zs_get_total_size_bytes(struct zs_pool *pool) 1186 + unsigned long zs_get_total_pages(struct zs_pool *pool) 1181 1187 { 1182 - int i; 1183 - u64 npages = 0; 1184 - 1185 - for (i = 0; i < ZS_SIZE_CLASSES; i++) 1186 - npages += pool->size_class[i].pages_allocated; 1187 - 1188 - return npages << PAGE_SHIFT; 1188 + return atomic_long_read(&pool->pages_allocated); 1189 1189 } 1190 - EXPORT_SYMBOL_GPL(zs_get_total_size_bytes); 1190 + EXPORT_SYMBOL_GPL(zs_get_total_pages); 1191 1191 1192 1192 module_init(zs_init); 1193 1193 module_exit(zs_exit);

+1

tools/testing/selftests/vm/Makefile

··· 3 3 CC = $(CROSS_COMPILE)gcc 4 4 CFLAGS = -Wall 5 5 BINARIES = hugepage-mmap hugepage-shm map_hugetlb thuge-gen hugetlbfstest 6 + BINARIES += transhuge-stress 6 7 7 8 all: $(BINARIES) 8 9 %: %.c

+144

tools/testing/selftests/vm/transhuge-stress.c

··· 1 + /* 2 + * Stress test for transparent huge pages, memory compaction and migration. 3 + * 4 + * Authors: Konstantin Khlebnikov <koct9i@gmail.com> 5 + * 6 + * This is free and unencumbered software released into the public domain. 7 + */ 8 + 9 + #include <stdlib.h> 10 + #include <stdio.h> 11 + #include <stdint.h> 12 + #include <err.h> 13 + #include <time.h> 14 + #include <unistd.h> 15 + #include <fcntl.h> 16 + #include <string.h> 17 + #include <sys/mman.h> 18 + 19 + #define PAGE_SHIFT 12 20 + #define HPAGE_SHIFT 21 21 + 22 + #define PAGE_SIZE (1 << PAGE_SHIFT) 23 + #define HPAGE_SIZE (1 << HPAGE_SHIFT) 24 + 25 + #define PAGEMAP_PRESENT(ent) (((ent) & (1ull << 63)) != 0) 26 + #define PAGEMAP_PFN(ent) ((ent) & ((1ull << 55) - 1)) 27 + 28 + int pagemap_fd; 29 + 30 + int64_t allocate_transhuge(void *ptr) 31 + { 32 + uint64_t ent[2]; 33 + 34 + /* drop pmd */ 35 + if (mmap(ptr, HPAGE_SIZE, PROT_READ | PROT_WRITE, 36 + MAP_FIXED | MAP_ANONYMOUS | 37 + MAP_NORESERVE | MAP_PRIVATE, -1, 0) != ptr) 38 + errx(2, "mmap transhuge"); 39 + 40 + if (madvise(ptr, HPAGE_SIZE, MADV_HUGEPAGE)) 41 + err(2, "MADV_HUGEPAGE"); 42 + 43 + /* allocate transparent huge page */ 44 + *(volatile void **)ptr = ptr; 45 + 46 + if (pread(pagemap_fd, ent, sizeof(ent), 47 + (uintptr_t)ptr >> (PAGE_SHIFT - 3)) != sizeof(ent)) 48 + err(2, "read pagemap"); 49 + 50 + if (PAGEMAP_PRESENT(ent[0]) && PAGEMAP_PRESENT(ent[1]) && 51 + PAGEMAP_PFN(ent[0]) + 1 == PAGEMAP_PFN(ent[1]) && 52 + !(PAGEMAP_PFN(ent[0]) & ((1 << (HPAGE_SHIFT - PAGE_SHIFT)) - 1))) 53 + return PAGEMAP_PFN(ent[0]); 54 + 55 + return -1; 56 + } 57 + 58 + int main(int argc, char **argv) 59 + { 60 + size_t ram, len; 61 + void *ptr, *p; 62 + struct timespec a, b; 63 + double s; 64 + uint8_t *map; 65 + size_t map_len; 66 + 67 + ram = sysconf(_SC_PHYS_PAGES); 68 + if (ram > SIZE_MAX / sysconf(_SC_PAGESIZE) / 4) 69 + ram = SIZE_MAX / 4; 70 + else 71 + ram *= sysconf(_SC_PAGESIZE); 72 + 73 + if (argc == 1) 74 + len = ram; 75 + else if (!strcmp(argv[1], "-h")) 76 + errx(1, "usage: %s [size in MiB]", argv[0]); 77 + else 78 + len = atoll(argv[1]) << 20; 79 + 80 + warnx("allocate %zd transhuge pages, using %zd MiB virtual memory" 81 + " and %zd MiB of ram", len >> HPAGE_SHIFT, len >> 20, 82 + len >> (20 + HPAGE_SHIFT - PAGE_SHIFT - 1)); 83 + 84 + pagemap_fd = open("/proc/self/pagemap", O_RDONLY); 85 + if (pagemap_fd < 0) 86 + err(2, "open pagemap"); 87 + 88 + len -= len % HPAGE_SIZE; 89 + ptr = mmap(NULL, len + HPAGE_SIZE, PROT_READ | PROT_WRITE, 90 + MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0); 91 + if (ptr == MAP_FAILED) 92 + err(2, "initial mmap"); 93 + ptr += HPAGE_SIZE - (uintptr_t)ptr % HPAGE_SIZE; 94 + 95 + if (madvise(ptr, len, MADV_HUGEPAGE)) 96 + err(2, "MADV_HUGEPAGE"); 97 + 98 + map_len = ram >> (HPAGE_SHIFT - 1); 99 + map = malloc(map_len); 100 + if (!map) 101 + errx(2, "map malloc"); 102 + 103 + while (1) { 104 + int nr_succeed = 0, nr_failed = 0, nr_pages = 0; 105 + 106 + memset(map, 0, map_len); 107 + 108 + clock_gettime(CLOCK_MONOTONIC, &a); 109 + for (p = ptr; p < ptr + len; p += HPAGE_SIZE) { 110 + int64_t pfn; 111 + 112 + pfn = allocate_transhuge(p); 113 + 114 + if (pfn < 0) { 115 + nr_failed++; 116 + } else { 117 + size_t idx = pfn >> (HPAGE_SHIFT - PAGE_SHIFT); 118 + 119 + nr_succeed++; 120 + if (idx >= map_len) { 121 + map = realloc(map, idx + 1); 122 + if (!map) 123 + errx(2, "map realloc"); 124 + memset(map + map_len, 0, idx + 1 - map_len); 125 + map_len = idx + 1; 126 + } 127 + if (!map[idx]) 128 + nr_pages++; 129 + map[idx] = 1; 130 + } 131 + 132 + /* split transhuge page, keep last page */ 133 + if (madvise(p, HPAGE_SIZE - PAGE_SIZE, MADV_DONTNEED)) 134 + err(2, "MADV_DONTNEED"); 135 + } 136 + clock_gettime(CLOCK_MONOTONIC, &b); 137 + s = b.tv_sec - a.tv_sec + (b.tv_nsec - a.tv_nsec) / 1000000000.; 138 + 139 + warnx("%.3f s/loop, %.3f ms/page, %10.3f MiB/s\t" 140 + "%4d succeed, %4d failed, %4d different pages", 141 + s, s * 1000 / (len >> HPAGE_SHIFT), len / s / (1 << 20), 142 + nr_succeed, nr_failed, nr_pages); 143 + } 144 + }

+1

tools/vm/page-types.c

··· 132 132 [KPF_NOPAGE] = "n:nopage", 133 133 [KPF_KSM] = "x:ksm", 134 134 [KPF_THP] = "t:thp", 135 + [KPF_BALLOON] = "o:balloon", 135 136 136 137 [KPF_RESERVED] = "r:reserved", 137 138 [KPF_MLOCKED] = "m:mlocked",