Merge branch 'akpm' (patches from Andrew)

+21

Documentation/cma/debugfs.txt

··· 1 + The CMA debugfs interface is useful to retrieve basic information out of the 2 + different CMA areas and to test allocation/release in each of the areas. 3 + 4 + Each CMA zone represents a directory under <debugfs>/cma/, indexed by the 5 + kernel's CMA index. So the first CMA zone would be: 6 + 7 + <debugfs>/cma/cma-0 8 + 9 + The structure of the files created under that directory is as follows: 10 + 11 + - [RO] base_pfn: The base PFN (Page Frame Number) of the zone. 12 + - [RO] count: Amount of memory in the CMA area. 13 + - [RO] order_per_bit: Order of pages represented by one bit. 14 + - [RO] bitmap: The bitmap of page states in the zone. 15 + - [WO] alloc: Allocate N pages from that CMA area. For example: 16 + 17 + echo 5 > <debugfs>/cma/cma-2/alloc 18 + 19 + would try to allocate 5 pages from the cma-2 area. 20 + 21 + - [WO] free: Free N pages from that CMA area, similar to the above.

+7 -3

Documentation/kernel-parameters.txt

··· 1989 1989 seconds. Use this parameter to check at some 1990 1990 other rate. 0 disables periodic checking. 1991 1991 1992 - memtest= [KNL,X86] Enable memtest 1992 + memtest= [KNL,X86,ARM] Enable memtest 1993 1993 Format: <integer> 1994 1994 default : 0 <disable> 1995 1995 Specifies the number of memtest passes to be ··· 2236 2236 2237 2237 nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels 2238 2238 Format: [panic,][nopanic,][num] 2239 - Valid num: 0 2239 + Valid num: 0 or 1 2240 2240 0 - turn nmi_watchdog off 2241 + 1 - turn nmi_watchdog on 2241 2242 When panic is specified, panic when an NMI watchdog 2242 2243 timeout occurs (or 'nopanic' to override the opposite 2243 2244 default). ··· 2322 2321 nofxsr [BUGS=X86-32] Disables x86 floating point extended 2323 2322 register save and restore. The kernel will only save 2324 2323 legacy floating-point registers on task switch. 2324 + 2325 + nohugeiomap [KNL,x86] Disable kernel huge I/O mappings. 2325 2326 2326 2327 noxsave [BUGS=X86] Disables x86 extended register state save 2327 2328 and restore using xsave. The kernel will fallback to ··· 2467 2464 2468 2465 nousb [USB] Disable the USB subsystem 2469 2466 2470 - nowatchdog [KNL] Disable the lockup detector (NMI watchdog). 2467 + nowatchdog [KNL] Disable both lockup detectors, i.e. 2468 + soft-lockup and NMI watchdog (hard-lockup). 2471 2469 2472 2470 nowb [ARM] 2473 2471

+53 -9

Documentation/sysctl/kernel.txt

··· 77 77 - shmmax [ sysv ipc ] 78 78 - shmmni 79 79 - softlockup_all_cpu_backtrace 80 + - soft_watchdog 80 81 - stop-a [ SPARC only ] 81 82 - sysrq ==> Documentation/sysrq.txt 82 83 - sysctl_writes_strict 83 84 - tainted 84 85 - threads-max 85 86 - unknown_nmi_panic 87 + - watchdog 86 88 - watchdog_thresh 87 89 - version 88 90 ··· 419 417 420 418 nmi_watchdog: 421 419 422 - Enables/Disables the NMI watchdog on x86 systems. When the value is 423 - non-zero the NMI watchdog is enabled and will continuously test all 424 - online cpus to determine whether or not they are still functioning 425 - properly. Currently, passing "nmi_watchdog=" parameter at boot time is 426 - required for this function to work. 420 + This parameter can be used to control the NMI watchdog 421 + (i.e. the hard lockup detector) on x86 systems. 427 422 428 - If LAPIC NMI watchdog method is in use (nmi_watchdog=2 kernel 429 - parameter), the NMI watchdog shares registers with oprofile. By 430 - disabling the NMI watchdog, oprofile may have more registers to 431 - utilize. 423 + 0 - disable the hard lockup detector 424 + 1 - enable the hard lockup detector 425 + 426 + The hard lockup detector monitors each CPU for its ability to respond to 427 + timer interrupts. The mechanism utilizes CPU performance counter registers 428 + that are programmed to generate Non-Maskable Interrupts (NMIs) periodically 429 + while a CPU is busy. Hence, the alternative name 'NMI watchdog'. 430 + 431 + The NMI watchdog is disabled by default if the kernel is running as a guest 432 + in a KVM virtual machine. This default can be overridden by adding 433 + 434 + nmi_watchdog=1 435 + 436 + to the guest kernel command line (see Documentation/kernel-parameters.txt). 432 437 433 438 ============================================================== 434 439 ··· 825 816 826 817 ============================================================== 827 818 819 + soft_watchdog 820 + 821 + This parameter can be used to control the soft lockup detector. 822 + 823 + 0 - disable the soft lockup detector 824 + 1 - enable the soft lockup detector 825 + 826 + The soft lockup detector monitors CPUs for threads that are hogging the CPUs 827 + without rescheduling voluntarily, and thus prevent the 'watchdog/N' threads 828 + from running. The mechanism depends on the CPUs ability to respond to timer 829 + interrupts which are needed for the 'watchdog/N' threads to be woken up by 830 + the watchdog timer function, otherwise the NMI watchdog - if enabled - can 831 + detect a hard lockup condition. 832 + 833 + ============================================================== 834 + 828 835 tainted: 829 836 830 837 Non-zero if the kernel has been tainted. Numeric values, which ··· 880 855 881 856 NMI switch that most IA32 servers have fires unknown NMI up, for 882 857 example. If a system hangs up, try pressing the NMI switch. 858 + 859 + ============================================================== 860 + 861 + watchdog: 862 + 863 + This parameter can be used to disable or enable the soft lockup detector 864 + _and_ the NMI watchdog (i.e. the hard lockup detector) at the same time. 865 + 866 + 0 - disable both lockup detectors 867 + 1 - enable both lockup detectors 868 + 869 + The soft lockup detector and the NMI watchdog can also be disabled or 870 + enabled individually, using the soft_watchdog and nmi_watchdog parameters. 871 + If the watchdog parameter is read, for example by executing 872 + 873 + cat /proc/sys/kernel/watchdog 874 + 875 + the output of this command (0 or 1) shows the logical OR of soft_watchdog 876 + and nmi_watchdog. 883 877 884 878 ============================================================== 885 879

+1 -3

Documentation/vm/cleancache.txt

··· 28 28 A cleancache "backend" that provides transcendent memory registers itself 29 29 to the kernel's cleancache "frontend" by calling cleancache_register_ops, 30 30 passing a pointer to a cleancache_ops structure with funcs set appropriately. 31 - Note that cleancache_register_ops returns the previous settings so that 32 - chaining can be performed if desired. The functions provided must conform to 33 - certain semantics as follows: 31 + The functions provided must conform to certain semantics as follows: 34 32 35 33 Most important, cleancache is "ephemeral". Pages which are copied into 36 34 cleancache have an indefinite lifetime which is completely unknowable

+7 -17

Documentation/vm/unevictable-lru.txt

··· 317 317 below, mlock_fixup() will attempt to merge the VMA with its neighbors or split 318 318 off a subset of the VMA if the range does not cover the entire VMA. Once the 319 319 VMA has been merged or split or neither, mlock_fixup() will call 320 - __mlock_vma_pages_range() to fault in the pages via get_user_pages() and to 320 + populate_vma_page_range() to fault in the pages via get_user_pages() and to 321 321 mark the pages as mlocked via mlock_vma_page(). 322 322 323 323 Note that the VMA being mlocked might be mapped with PROT_NONE. In this case, ··· 327 327 328 328 Also note that a page returned by get_user_pages() could be truncated or 329 329 migrated out from under us, while we're trying to mlock it. To detect this, 330 - __mlock_vma_pages_range() checks page_mapping() after acquiring the page lock. 330 + populate_vma_page_range() checks page_mapping() after acquiring the page lock. 331 331 If the page is still associated with its mapping, we'll go ahead and call 332 332 mlock_vma_page(). If the mapping is gone, we just unlock the page and move on. 333 333 In the worst case, this will result in a page mapped in a VM_LOCKED VMA ··· 392 392 393 393 If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the 394 394 specified range. The range is then munlocked via the function 395 - __mlock_vma_pages_range() - the same function used to mlock a VMA range - 395 + populate_vma_page_range() - the same function used to mlock a VMA range - 396 396 passing a flag to indicate that munlock() is being performed. 397 397 398 398 Because the VMA access protections could have been changed to PROT_NONE after ··· 402 402 fetching the pages - all of which should be resident as a result of previous 403 403 mlocking. 404 404 405 - For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling 405 + For munlock(), populate_vma_page_range() unlocks individual pages by calling 406 406 munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked 407 407 flag using TestClearPageMlocked(). As with mlock_vma_page(), 408 408 munlock_vma_page() use the Test*PageMlocked() function to handle the case where ··· 463 463 464 464 To mlock a range of memory under the unevictable/mlock infrastructure, the 465 465 mmap() handler and task address space expansion functions call 466 - mlock_vma_pages_range() specifying the vma and the address range to mlock. 467 - mlock_vma_pages_range() filters VMAs like mlock_fixup(), as described above in 468 - "Filtering Special VMAs". It will clear the VM_LOCKED flag, which will have 469 - already been set by the caller, in filtered VMAs. Thus these VMA's need not be 470 - visited for munlock when the region is unmapped. 466 + populate_vma_page_range() specifying the vma and the address range to mlock. 471 467 472 - For "normal" VMAs, mlock_vma_pages_range() calls __mlock_vma_pages_range() to 473 - fault/allocate the pages and mlock them. Again, like mlock_fixup(), 474 - mlock_vma_pages_range() downgrades the mmap semaphore to read mode before 475 - attempting to fault/allocate and mlock the pages and "upgrades" the semaphore 476 - back to write mode before returning. 477 - 478 - The callers of mlock_vma_pages_range() will have already added the memory range 468 + The callers of populate_vma_page_range() will have already added the memory range 479 469 to be mlocked to the task's "locked_vm". To account for filtered VMAs, 480 - mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the 470 + populate_vma_page_range() returns the number of pages NOT mlocked. All of the 481 471 callers then subtract a non-negative return value from the task's locked_vm. A 482 472 negative return value represent an error - for example, from get_user_pages() 483 473 attempting to fault in a VMA with PROT_NONE access. In this case, we leave the

+15

arch/Kconfig

··· 446 446 config HAVE_ARCH_TRANSPARENT_HUGEPAGE 447 447 bool 448 448 449 + config HAVE_ARCH_HUGE_VMAP 450 + bool 451 + 449 452 config HAVE_ARCH_SOFT_DIRTY 450 453 bool 451 454 ··· 486 483 in the end of an hardirq. 487 484 This spares a stack switch and improves cache usage on softirq 488 485 processing. 486 + 487 + config PGTABLE_LEVELS 488 + int 489 + default 2 490 + 491 + config ARCH_HAS_ELF_RANDOMIZE 492 + bool 493 + help 494 + An architecture supports choosing randomized locations for 495 + stack, mmap, brk, and ET_DYN. Defined functions: 496 + - arch_mmap_rnd() 497 + - arch_randomize_brk() 489 498 490 499 # 491 500 # ABI hall of shame

+4

arch/alpha/Kconfig

··· 76 76 bool 77 77 default y 78 78 79 + config PGTABLE_LEVELS 80 + int 81 + default 3 82 + 79 83 source "init/Kconfig" 80 84 source "kernel/Kconfig.freezer" 81 85

+6 -1

arch/arm/Kconfig

··· 1 1 config ARM 2 2 bool 3 3 default y 4 - select ARCH_BINFMT_ELF_RANDOMIZE_PIE 5 4 select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE 5 + select ARCH_HAS_ELF_RANDOMIZE 6 6 select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST 7 7 select ARCH_HAVE_CUSTOM_GPIO_H 8 8 select ARCH_HAS_GCOV_PROFILE_ALL ··· 285 285 config GENERIC_BUG 286 286 def_bool y 287 287 depends on BUG 288 + 289 + config PGTABLE_LEVELS 290 + int 291 + default 3 if ARM_LPAE 292 + default 2 288 293 289 294 source "init/Kconfig" 290 295

-4

arch/arm/include/asm/elf.h

··· 125 125 extern void elf_set_personality(const struct elf32_hdr *); 126 126 #define SET_PERSONALITY(ex) elf_set_personality(&(ex)) 127 127 128 - struct mm_struct; 129 - extern unsigned long arch_randomize_brk(struct mm_struct *mm); 130 - #define arch_randomize_brk arch_randomize_brk 131 - 132 128 #ifdef CONFIG_MMU 133 129 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1 134 130 struct linux_binprm;

+3

arch/arm/mm/init.c

··· 335 335 336 336 find_limits(&min, &max_low, &max_high); 337 337 338 + early_memtest((phys_addr_t)min << PAGE_SHIFT, 339 + (phys_addr_t)max_low << PAGE_SHIFT); 340 + 338 341 /* 339 342 * Sparsemem tries to allocate bootmem in memory_present(), 340 343 * so must be done after the fixed reservations

+12 -4

arch/arm/mm/mmap.c

··· 169 169 return addr; 170 170 } 171 171 172 + unsigned long arch_mmap_rnd(void) 173 + { 174 + unsigned long rnd; 175 + 176 + /* 8 bits of randomness in 20 address space bits */ 177 + rnd = (unsigned long)get_random_int() % (1 << 8); 178 + 179 + return rnd << PAGE_SHIFT; 180 + } 181 + 172 182 void arch_pick_mmap_layout(struct mm_struct *mm) 173 183 { 174 184 unsigned long random_factor = 0UL; 175 185 176 - /* 8 bits of randomness in 20 address space bits */ 177 - if ((current->flags & PF_RANDOMIZE) && 178 - !(current->personality & ADDR_NO_RANDOMIZE)) 179 - random_factor = (get_random_int() % (1 << 8)) << PAGE_SHIFT; 186 + if (current->flags & PF_RANDOMIZE) 187 + random_factor = arch_mmap_rnd(); 180 188 181 189 if (mmap_is_legacy()) { 182 190 mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;

+8 -8

arch/arm64/Kconfig

··· 1 1 config ARM64 2 2 def_bool y 3 - select ARCH_BINFMT_ELF_RANDOMIZE_PIE 4 3 select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE 4 + select ARCH_HAS_ELF_RANDOMIZE 5 5 select ARCH_HAS_GCOV_PROFILE_ALL 6 6 select ARCH_HAS_SG_CHAIN 7 7 select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST ··· 142 142 143 143 config FIX_EARLYCON_MEM 144 144 def_bool y 145 + 146 + config PGTABLE_LEVELS 147 + int 148 + default 2 if ARM64_64K_PAGES && ARM64_VA_BITS_42 149 + default 3 if ARM64_64K_PAGES && ARM64_VA_BITS_48 150 + default 3 if ARM64_4K_PAGES && ARM64_VA_BITS_39 151 + default 4 if ARM64_4K_PAGES && ARM64_VA_BITS_48 145 152 146 153 source "init/Kconfig" 147 154 ··· 419 412 default 39 if ARM64_VA_BITS_39 420 413 default 42 if ARM64_VA_BITS_42 421 414 default 48 if ARM64_VA_BITS_48 422 - 423 - config ARM64_PGTABLE_LEVELS 424 - int 425 - default 2 if ARM64_64K_PAGES && ARM64_VA_BITS_42 426 - default 3 if ARM64_64K_PAGES && ARM64_VA_BITS_48 427 - default 3 if ARM64_4K_PAGES && ARM64_VA_BITS_39 428 - default 4 if ARM64_4K_PAGES && ARM64_VA_BITS_48 429 415 430 416 config CPU_BIG_ENDIAN 431 417 bool "Build big-endian kernel"

-5

arch/arm64/include/asm/elf.h

··· 125 125 * the loader. We need to make sure that it is out of the way of the program 126 126 * that it will "exec", and that there is sufficient room for the brk. 127 127 */ 128 - extern unsigned long randomize_et_dyn(unsigned long base); 129 128 #define ELF_ET_DYN_BASE (2 * TASK_SIZE_64 / 3) 130 129 131 130 /* ··· 155 156 #else 156 157 #define STACK_RND_MASK (0x3ffff >> (PAGE_SHIFT - 12)) 157 158 #endif 158 - 159 - struct mm_struct; 160 - extern unsigned long arch_randomize_brk(struct mm_struct *mm); 161 - #define arch_randomize_brk arch_randomize_brk 162 159 163 160 #ifdef CONFIG_COMPAT 164 161

+2 -2

arch/arm64/include/asm/kvm_mmu.h

··· 163 163 /* 164 164 * If we are concatenating first level stage-2 page tables, we would have less 165 165 * than or equal to 16 pointers in the fake PGD, because that's what the 166 - * architecture allows. In this case, (4 - CONFIG_ARM64_PGTABLE_LEVELS) 166 + * architecture allows. In this case, (4 - CONFIG_PGTABLE_LEVELS) 167 167 * represents the first level for the host, and we add 1 to go to the next 168 168 * level (which uses contatenation) for the stage-2 tables. 169 169 */ 170 170 #if PTRS_PER_S2_PGD <= 16 171 - #define KVM_PREALLOC_LEVEL (4 - CONFIG_ARM64_PGTABLE_LEVELS + 1) 171 + #define KVM_PREALLOC_LEVEL (4 - CONFIG_PGTABLE_LEVELS + 1) 172 172 #else 173 173 #define KVM_PREALLOC_LEVEL (0) 174 174 #endif

+2 -2

arch/arm64/include/asm/page.h

··· 36 36 * for more information). 37 37 */ 38 38 #ifdef CONFIG_ARM64_64K_PAGES 39 - #define SWAPPER_PGTABLE_LEVELS (CONFIG_ARM64_PGTABLE_LEVELS) 39 + #define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS) 40 40 #else 41 - #define SWAPPER_PGTABLE_LEVELS (CONFIG_ARM64_PGTABLE_LEVELS - 1) 41 + #define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS - 1) 42 42 #endif 43 43 44 44 #define SWAPPER_DIR_SIZE (SWAPPER_PGTABLE_LEVELS * PAGE_SIZE)

+4 -4

arch/arm64/include/asm/pgalloc.h

··· 28 28 29 29 #define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO) 30 30 31 - #if CONFIG_ARM64_PGTABLE_LEVELS > 2 31 + #if CONFIG_PGTABLE_LEVELS > 2 32 32 33 33 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) 34 34 { ··· 46 46 set_pud(pud, __pud(__pa(pmd) | PMD_TYPE_TABLE)); 47 47 } 48 48 49 - #endif /* CONFIG_ARM64_PGTABLE_LEVELS > 2 */ 49 + #endif /* CONFIG_PGTABLE_LEVELS > 2 */ 50 50 51 - #if CONFIG_ARM64_PGTABLE_LEVELS > 3 51 + #if CONFIG_PGTABLE_LEVELS > 3 52 52 53 53 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr) 54 54 { ··· 66 66 set_pgd(pgd, __pgd(__pa(pud) | PUD_TYPE_TABLE)); 67 67 } 68 68 69 - #endif /* CONFIG_ARM64_PGTABLE_LEVELS > 3 */ 69 + #endif /* CONFIG_PGTABLE_LEVELS > 3 */ 70 70 71 71 extern pgd_t *pgd_alloc(struct mm_struct *mm); 72 72 extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);

+3 -3

arch/arm64/include/asm/pgtable-hwdef.h

··· 21 21 /* 22 22 * PMD_SHIFT determines the size a level 2 page table entry can map. 23 23 */ 24 - #if CONFIG_ARM64_PGTABLE_LEVELS > 2 24 + #if CONFIG_PGTABLE_LEVELS > 2 25 25 #define PMD_SHIFT ((PAGE_SHIFT - 3) * 2 + 3) 26 26 #define PMD_SIZE (_AC(1, UL) << PMD_SHIFT) 27 27 #define PMD_MASK (~(PMD_SIZE-1)) ··· 31 31 /* 32 32 * PUD_SHIFT determines the size a level 1 page table entry can map. 33 33 */ 34 - #if CONFIG_ARM64_PGTABLE_LEVELS > 3 34 + #if CONFIG_PGTABLE_LEVELS > 3 35 35 #define PUD_SHIFT ((PAGE_SHIFT - 3) * 3 + 3) 36 36 #define PUD_SIZE (_AC(1, UL) << PUD_SHIFT) 37 37 #define PUD_MASK (~(PUD_SIZE-1)) ··· 42 42 * PGDIR_SHIFT determines the size a top-level page table entry can map 43 43 * (depending on the configuration, this level can be 0, 1 or 2). 44 44 */ 45 - #define PGDIR_SHIFT ((PAGE_SHIFT - 3) * CONFIG_ARM64_PGTABLE_LEVELS + 3) 45 + #define PGDIR_SHIFT ((PAGE_SHIFT - 3) * CONFIG_PGTABLE_LEVELS + 3) 46 46 #define PGDIR_SIZE (_AC(1, UL) << PGDIR_SHIFT) 47 47 #define PGDIR_MASK (~(PGDIR_SIZE-1)) 48 48 #define PTRS_PER_PGD (1 << (VA_BITS - PGDIR_SHIFT))

+6 -6

arch/arm64/include/asm/pgtable-types.h

··· 38 38 #define pte_val(x) ((x).pte) 39 39 #define __pte(x) ((pte_t) { (x) } ) 40 40 41 - #if CONFIG_ARM64_PGTABLE_LEVELS > 2 41 + #if CONFIG_PGTABLE_LEVELS > 2 42 42 typedef struct { pmdval_t pmd; } pmd_t; 43 43 #define pmd_val(x) ((x).pmd) 44 44 #define __pmd(x) ((pmd_t) { (x) } ) 45 45 #endif 46 46 47 - #if CONFIG_ARM64_PGTABLE_LEVELS > 3 47 + #if CONFIG_PGTABLE_LEVELS > 3 48 48 typedef struct { pudval_t pud; } pud_t; 49 49 #define pud_val(x) ((x).pud) 50 50 #define __pud(x) ((pud_t) { (x) } ) ··· 64 64 #define pte_val(x) (x) 65 65 #define __pte(x) (x) 66 66 67 - #if CONFIG_ARM64_PGTABLE_LEVELS > 2 67 + #if CONFIG_PGTABLE_LEVELS > 2 68 68 typedef pmdval_t pmd_t; 69 69 #define pmd_val(x) (x) 70 70 #define __pmd(x) (x) 71 71 #endif 72 72 73 - #if CONFIG_ARM64_PGTABLE_LEVELS > 3 73 + #if CONFIG_PGTABLE_LEVELS > 3 74 74 typedef pudval_t pud_t; 75 75 #define pud_val(x) (x) 76 76 #define __pud(x) (x) ··· 86 86 87 87 #endif /* STRICT_MM_TYPECHECKS */ 88 88 89 - #if CONFIG_ARM64_PGTABLE_LEVELS == 2 89 + #if CONFIG_PGTABLE_LEVELS == 2 90 90 #include <asm-generic/pgtable-nopmd.h> 91 - #elif CONFIG_ARM64_PGTABLE_LEVELS == 3 91 + #elif CONFIG_PGTABLE_LEVELS == 3 92 92 #include <asm-generic/pgtable-nopud.h> 93 93 #endif 94 94

+4 -4

arch/arm64/include/asm/pgtable.h

··· 374 374 */ 375 375 #define mk_pte(page,prot) pfn_pte(page_to_pfn(page),prot) 376 376 377 - #if CONFIG_ARM64_PGTABLE_LEVELS > 2 377 + #if CONFIG_PGTABLE_LEVELS > 2 378 378 379 379 #define pmd_ERROR(pmd) __pmd_error(__FILE__, __LINE__, pmd_val(pmd)) 380 380 ··· 409 409 410 410 #define pud_page(pud) pfn_to_page(__phys_to_pfn(pud_val(pud) & PHYS_MASK)) 411 411 412 - #endif /* CONFIG_ARM64_PGTABLE_LEVELS > 2 */ 412 + #endif /* CONFIG_PGTABLE_LEVELS > 2 */ 413 413 414 - #if CONFIG_ARM64_PGTABLE_LEVELS > 3 414 + #if CONFIG_PGTABLE_LEVELS > 3 415 415 416 416 #define pud_ERROR(pud) __pud_error(__FILE__, __LINE__, pud_val(pud)) 417 417 ··· 445 445 446 446 #define pgd_page(pgd) pfn_to_page(__phys_to_pfn(pgd_val(pgd) & PHYS_MASK)) 447 447 448 - #endif /* CONFIG_ARM64_PGTABLE_LEVELS > 3 */ 448 + #endif /* CONFIG_PGTABLE_LEVELS > 3 */ 449 449 450 450 #define pgd_ERROR(pgd) __pgd_error(__FILE__, __LINE__, pgd_val(pgd)) 451 451

+2 -2

arch/arm64/include/asm/tlb.h

··· 53 53 tlb_remove_entry(tlb, pte); 54 54 } 55 55 56 - #if CONFIG_ARM64_PGTABLE_LEVELS > 2 56 + #if CONFIG_PGTABLE_LEVELS > 2 57 57 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp, 58 58 unsigned long addr) 59 59 { ··· 62 62 } 63 63 #endif 64 64 65 - #if CONFIG_ARM64_PGTABLE_LEVELS > 3 65 + #if CONFIG_PGTABLE_LEVELS > 3 66 66 static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp, 67 67 unsigned long addr) 68 68 {

+2

arch/arm64/mm/init.c

··· 190 190 min = PFN_UP(memblock_start_of_DRAM()); 191 191 max = PFN_DOWN(memblock_end_of_DRAM()); 192 192 193 + early_memtest(min << PAGE_SHIFT, max << PAGE_SHIFT); 194 + 193 195 /* 194 196 * Sparsemem tries to allocate bootmem in memory_present(), so must be 195 197 * done after the fixed reservations.

+12 -8

arch/arm64/mm/mmap.c

··· 47 47 return sysctl_legacy_va_layout; 48 48 } 49 49 50 - static unsigned long mmap_rnd(void) 50 + unsigned long arch_mmap_rnd(void) 51 51 { 52 - unsigned long rnd = 0; 52 + unsigned long rnd; 53 53 54 - if (current->flags & PF_RANDOMIZE) 55 - rnd = (long)get_random_int() & STACK_RND_MASK; 54 + rnd = (unsigned long)get_random_int() & STACK_RND_MASK; 56 55 57 56 return rnd << PAGE_SHIFT; 58 57 } 59 58 60 - static unsigned long mmap_base(void) 59 + static unsigned long mmap_base(unsigned long rnd) 61 60 { 62 61 unsigned long gap = rlimit(RLIMIT_STACK); 63 62 ··· 65 66 else if (gap > MAX_GAP) 66 67 gap = MAX_GAP; 67 68 68 - return PAGE_ALIGN(STACK_TOP - gap - mmap_rnd()); 69 + return PAGE_ALIGN(STACK_TOP - gap - rnd); 69 70 } 70 71 71 72 /* ··· 74 75 */ 75 76 void arch_pick_mmap_layout(struct mm_struct *mm) 76 77 { 78 + unsigned long random_factor = 0UL; 79 + 80 + if (current->flags & PF_RANDOMIZE) 81 + random_factor = arch_mmap_rnd(); 82 + 77 83 /* 78 84 * Fall back to the standard layout if the personality bit is set, or 79 85 * if the expected stack growth is unlimited: 80 86 */ 81 87 if (mmap_is_legacy()) { 82 - mm->mmap_base = TASK_UNMAPPED_BASE; 88 + mm->mmap_base = TASK_UNMAPPED_BASE + random_factor; 83 89 mm->get_unmapped_area = arch_get_unmapped_area; 84 90 } else { 85 - mm->mmap_base = mmap_base(); 91 + mm->mmap_base = mmap_base(random_factor); 86 92 mm->get_unmapped_area = arch_get_unmapped_area_topdown; 87 93 } 88 94 }

+2 -2

arch/arm64/mm/mmu.c

··· 550 550 #endif /* CONFIG_SPARSEMEM_VMEMMAP */ 551 551 552 552 static pte_t bm_pte[PTRS_PER_PTE] __page_aligned_bss; 553 - #if CONFIG_ARM64_PGTABLE_LEVELS > 2 553 + #if CONFIG_PGTABLE_LEVELS > 2 554 554 static pmd_t bm_pmd[PTRS_PER_PMD] __page_aligned_bss; 555 555 #endif 556 - #if CONFIG_ARM64_PGTABLE_LEVELS > 3 556 + #if CONFIG_PGTABLE_LEVELS > 3 557 557 static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss; 558 558 #endif 559 559

+5 -13

arch/ia64/Kconfig

··· 1 + config PGTABLE_LEVELS 2 + int "Page Table Levels" if !IA64_PAGE_SIZE_64KB 3 + range 3 4 if !IA64_PAGE_SIZE_64KB 4 + default 3 5 + 1 6 source "init/Kconfig" 2 7 3 8 source "kernel/Kconfig.freezer" ··· 288 283 config IA64_PAGE_SIZE_64KB 289 284 depends on !ITANIUM 290 285 bool "64KB" 291 - 292 - endchoice 293 - 294 - choice 295 - prompt "Page Table Levels" 296 - default PGTABLE_3 297 - 298 - config PGTABLE_3 299 - bool "3 Levels" 300 - 301 - config PGTABLE_4 302 - depends on !IA64_PAGE_SIZE_64KB 303 - bool "4 Levels" 304 286 305 287 endchoice 306 288

+2 -2

arch/ia64/include/asm/page.h

··· 173 173 */ 174 174 typedef struct { unsigned long pte; } pte_t; 175 175 typedef struct { unsigned long pmd; } pmd_t; 176 - #ifdef CONFIG_PGTABLE_4 176 + #if CONFIG_PGTABLE_LEVELS == 4 177 177 typedef struct { unsigned long pud; } pud_t; 178 178 #endif 179 179 typedef struct { unsigned long pgd; } pgd_t; ··· 182 182 183 183 # define pte_val(x) ((x).pte) 184 184 # define pmd_val(x) ((x).pmd) 185 - #ifdef CONFIG_PGTABLE_4 185 + #if CONFIG_PGTABLE_LEVELS == 4 186 186 # define pud_val(x) ((x).pud) 187 187 #endif 188 188 # define pgd_val(x) ((x).pgd)

+2 -2

arch/ia64/include/asm/pgalloc.h

··· 32 32 quicklist_free(0, NULL, pgd); 33 33 } 34 34 35 - #ifdef CONFIG_PGTABLE_4 35 + #if CONFIG_PGTABLE_LEVELS == 4 36 36 static inline void 37 37 pgd_populate(struct mm_struct *mm, pgd_t * pgd_entry, pud_t * pud) 38 38 { ··· 49 49 quicklist_free(0, NULL, pud); 50 50 } 51 51 #define __pud_free_tlb(tlb, pud, address) pud_free((tlb)->mm, pud) 52 - #endif /* CONFIG_PGTABLE_4 */ 52 + #endif /* CONFIG_PGTABLE_LEVELS == 4 */ 53 53 54 54 static inline void 55 55 pud_populate(struct mm_struct *mm, pud_t * pud_entry, pmd_t * pmd)

+6 -6

arch/ia64/include/asm/pgtable.h

··· 99 99 #define PMD_MASK (~(PMD_SIZE-1)) 100 100 #define PTRS_PER_PMD (1UL << (PTRS_PER_PTD_SHIFT)) 101 101 102 - #ifdef CONFIG_PGTABLE_4 102 + #if CONFIG_PGTABLE_LEVELS == 4 103 103 /* 104 104 * Definitions for second level: 105 105 * ··· 117 117 * 118 118 * PGDIR_SHIFT determines what a first-level page table entry can map. 119 119 */ 120 - #ifdef CONFIG_PGTABLE_4 120 + #if CONFIG_PGTABLE_LEVELS == 4 121 121 #define PGDIR_SHIFT (PUD_SHIFT + (PTRS_PER_PTD_SHIFT)) 122 122 #else 123 123 #define PGDIR_SHIFT (PMD_SHIFT + (PTRS_PER_PTD_SHIFT)) ··· 180 180 #define __S111 __pgprot(__ACCESS_BITS | _PAGE_PL_3 | _PAGE_AR_RWX) 181 181 182 182 #define pgd_ERROR(e) printk("%s:%d: bad pgd %016lx.\n", __FILE__, __LINE__, pgd_val(e)) 183 - #ifdef CONFIG_PGTABLE_4 183 + #if CONFIG_PGTABLE_LEVELS == 4 184 184 #define pud_ERROR(e) printk("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e)) 185 185 #endif 186 186 #define pmd_ERROR(e) printk("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e)) ··· 281 281 #define pud_page_vaddr(pud) ((unsigned long) __va(pud_val(pud) & _PFN_MASK)) 282 282 #define pud_page(pud) virt_to_page((pud_val(pud) + PAGE_OFFSET)) 283 283 284 - #ifdef CONFIG_PGTABLE_4 284 + #if CONFIG_PGTABLE_LEVELS == 4 285 285 #define pgd_none(pgd) (!pgd_val(pgd)) 286 286 #define pgd_bad(pgd) (!ia64_phys_addr_valid(pgd_val(pgd))) 287 287 #define pgd_present(pgd) (pgd_val(pgd) != 0UL) ··· 384 384 here. */ 385 385 #define pgd_offset_gate(mm, addr) pgd_offset_k(addr) 386 386 387 - #ifdef CONFIG_PGTABLE_4 387 + #if CONFIG_PGTABLE_LEVELS == 4 388 388 /* Find an entry in the second-level page table.. */ 389 389 #define pud_offset(dir,addr) \ 390 390 ((pud_t *) pgd_page_vaddr(*(dir)) + (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))) ··· 586 586 #define __HAVE_ARCH_PGD_OFFSET_GATE 587 587 588 588 589 - #ifndef CONFIG_PGTABLE_4 589 + #if CONFIG_PGTABLE_LEVELS == 3 590 590 #include <asm-generic/pgtable-nopud.h> 591 591 #endif 592 592 #include <asm-generic/pgtable.h>

+6 -6

arch/ia64/kernel/ivt.S

··· 146 146 (p6) dep r17=r18,r19,3,(PAGE_SHIFT-3) // r17=pgd_offset for region 5 147 147 (p7) dep r17=r18,r17,3,(PAGE_SHIFT-6) // r17=pgd_offset for region[0-4] 148 148 cmp.eq p7,p6=0,r21 // unused address bits all zeroes? 149 - #ifdef CONFIG_PGTABLE_4 149 + #if CONFIG_PGTABLE_LEVELS == 4 150 150 shr.u r28=r22,PUD_SHIFT // shift pud index into position 151 151 #else 152 152 shr.u r18=r22,PMD_SHIFT // shift pmd index into position ··· 155 155 ld8 r17=[r17] // get *pgd (may be 0) 156 156 ;; 157 157 (p7) cmp.eq p6,p7=r17,r0 // was pgd_present(*pgd) == NULL? 158 - #ifdef CONFIG_PGTABLE_4 158 + #if CONFIG_PGTABLE_LEVELS == 4 159 159 dep r28=r28,r17,3,(PAGE_SHIFT-3) // r28=pud_offset(pgd,addr) 160 160 ;; 161 161 shr.u r18=r22,PMD_SHIFT // shift pmd index into position ··· 222 222 */ 223 223 ld8 r25=[r21] // read *pte again 224 224 ld8 r26=[r17] // read *pmd again 225 - #ifdef CONFIG_PGTABLE_4 225 + #if CONFIG_PGTABLE_LEVELS == 4 226 226 ld8 r19=[r28] // read *pud again 227 227 #endif 228 228 cmp.ne p6,p7=r0,r0 229 229 ;; 230 230 cmp.ne.or.andcm p6,p7=r26,r20 // did *pmd change 231 - #ifdef CONFIG_PGTABLE_4 231 + #if CONFIG_PGTABLE_LEVELS == 4 232 232 cmp.ne.or.andcm p6,p7=r19,r29 // did *pud change 233 233 #endif 234 234 mov r27=PAGE_SHIFT<<2 ··· 476 476 (p6) dep r17=r18,r19,3,(PAGE_SHIFT-3) // r17=pgd_offset for region 5 477 477 (p7) dep r17=r18,r17,3,(PAGE_SHIFT-6) // r17=pgd_offset for region[0-4] 478 478 cmp.eq p7,p6=0,r21 // unused address bits all zeroes? 479 - #ifdef CONFIG_PGTABLE_4 479 + #if CONFIG_PGTABLE_LEVELS == 4 480 480 shr.u r18=r22,PUD_SHIFT // shift pud index into position 481 481 #else 482 482 shr.u r18=r22,PMD_SHIFT // shift pmd index into position ··· 487 487 (p7) cmp.eq p6,p7=r17,r0 // was pgd_present(*pgd) == NULL? 488 488 dep r17=r18,r17,3,(PAGE_SHIFT-3) // r17=p[u|m]d_offset(pgd,addr) 489 489 ;; 490 - #ifdef CONFIG_PGTABLE_4 490 + #if CONFIG_PGTABLE_LEVELS == 4 491 491 (p7) ld8 r17=[r17] // get *pud (may be 0) 492 492 shr.u r18=r22,PMD_SHIFT // shift pmd index into position 493 493 ;;

+2 -2

arch/ia64/kernel/machine_kexec.c

··· 156 156 VMCOREINFO_OFFSET(node_memblk_s, start_paddr); 157 157 VMCOREINFO_OFFSET(node_memblk_s, size); 158 158 #endif 159 - #ifdef CONFIG_PGTABLE_3 159 + #if CONFIG_PGTABLE_LEVELS == 3 160 160 VMCOREINFO_CONFIG(PGTABLE_3); 161 - #elif defined(CONFIG_PGTABLE_4) 161 + #elif CONFIG_PGTABLE_LEVELS == 4 162 162 VMCOREINFO_CONFIG(PGTABLE_4); 163 163 #endif 164 164 }

+4

arch/m68k/Kconfig

··· 67 67 default 1000 if CLEOPATRA 68 68 default 100 69 69 70 + config PGTABLE_LEVELS 71 + default 2 if SUN3 || COLDFIRE 72 + default 3 73 + 70 74 source "init/Kconfig" 71 75 72 76 source "kernel/Kconfig.freezer"

+6 -1

arch/mips/Kconfig

··· 23 23 select HAVE_KRETPROBES 24 24 select HAVE_DEBUG_KMEMLEAK 25 25 select HAVE_SYSCALL_TRACEPOINTS 26 - select ARCH_BINFMT_ELF_RANDOMIZE_PIE 26 + select ARCH_HAS_ELF_RANDOMIZE 27 27 select HAVE_ARCH_TRANSPARENT_HUGEPAGE if CPU_SUPPORTS_HUGEPAGES && 64BIT 28 28 select RTC_LIB if !MACH_LOONGSON 29 29 select GENERIC_ATOMIC64 if !64BIT ··· 2599 2599 config STACKTRACE_SUPPORT 2600 2600 bool 2601 2601 default y 2602 + 2603 + config PGTABLE_LEVELS 2604 + int 2605 + default 3 if 64BIT && !PAGE_SIZE_64KB 2606 + default 2 2602 2607 2603 2608 source "init/Kconfig" 2604 2609

-4

arch/mips/include/asm/elf.h

··· 410 410 extern int arch_setup_additional_pages(struct linux_binprm *bprm, 411 411 int uses_interp); 412 412 413 - struct mm_struct; 414 - extern unsigned long arch_randomize_brk(struct mm_struct *mm); 415 - #define arch_randomize_brk arch_randomize_brk 416 - 417 413 struct arch_elf_state { 418 414 int fp_abi; 419 415 int interp_fp_abi;

+16 -8

arch/mips/mm/mmap.c

··· 142 142 addr0, len, pgoff, flags, DOWN); 143 143 } 144 144 145 + unsigned long arch_mmap_rnd(void) 146 + { 147 + unsigned long rnd; 148 + 149 + rnd = (unsigned long)get_random_int(); 150 + rnd <<= PAGE_SHIFT; 151 + if (TASK_IS_32BIT_ADDR) 152 + rnd &= 0xfffffful; 153 + else 154 + rnd &= 0xffffffful; 155 + 156 + return rnd; 157 + } 158 + 145 159 void arch_pick_mmap_layout(struct mm_struct *mm) 146 160 { 147 161 unsigned long random_factor = 0UL; 148 162 149 - if (current->flags & PF_RANDOMIZE) { 150 - random_factor = get_random_int(); 151 - random_factor = random_factor << PAGE_SHIFT; 152 - if (TASK_IS_32BIT_ADDR) 153 - random_factor &= 0xfffffful; 154 - else 155 - random_factor &= 0xffffffful; 156 - } 163 + if (current->flags & PF_RANDOMIZE) 164 + random_factor = arch_mmap_rnd(); 157 165 158 166 if (mmap_is_legacy()) { 159 167 mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;

+5

arch/parisc/Kconfig

··· 103 103 depends on BROKEN 104 104 default y 105 105 106 + config PGTABLE_LEVELS 107 + int 108 + default 3 if 64BIT && PARISC_PAGE_SIZE_4KB 109 + default 2 110 + 106 111 source "init/Kconfig" 107 112 108 113 source "kernel/Kconfig.freezer"

+1 -1

arch/parisc/include/asm/pgalloc.h

··· 51 51 free_pages((unsigned long)pgd, PGD_ALLOC_ORDER); 52 52 } 53 53 54 - #if PT_NLEVELS == 3 54 + #if CONFIG_PGTABLE_LEVELS == 3 55 55 56 56 /* Three Level Page Table Support for pmd's */ 57 57

+7 -9

arch/parisc/include/asm/pgtable.h

··· 68 68 #define KERNEL_INITIAL_ORDER 24 /* 0 to 1<<24 = 16MB */ 69 69 #define KERNEL_INITIAL_SIZE (1 << KERNEL_INITIAL_ORDER) 70 70 71 - #if defined(CONFIG_64BIT) && defined(CONFIG_PARISC_PAGE_SIZE_4KB) 72 - #define PT_NLEVELS 3 71 + #if CONFIG_PGTABLE_LEVELS == 3 73 72 #define PGD_ORDER 1 /* Number of pages per pgd */ 74 73 #define PMD_ORDER 1 /* Number of pages per pmd */ 75 74 #define PGD_ALLOC_ORDER 2 /* first pgd contains pmd */ 76 75 #else 77 - #define PT_NLEVELS 2 78 76 #define PGD_ORDER 1 /* Number of pages per pgd */ 79 77 #define PGD_ALLOC_ORDER PGD_ORDER 80 78 #endif ··· 91 93 #define PMD_SHIFT (PLD_SHIFT + BITS_PER_PTE) 92 94 #define PMD_SIZE (1UL << PMD_SHIFT) 93 95 #define PMD_MASK (~(PMD_SIZE-1)) 94 - #if PT_NLEVELS == 3 96 + #if CONFIG_PGTABLE_LEVELS == 3 95 97 #define BITS_PER_PMD (PAGE_SHIFT + PMD_ORDER - BITS_PER_PMD_ENTRY) 96 98 #else 97 99 #define __PAGETABLE_PMD_FOLDED ··· 275 277 #define pgd_flag(x) (pgd_val(x) & PxD_FLAG_MASK) 276 278 #define pgd_address(x) ((unsigned long)(pgd_val(x) &~ PxD_FLAG_MASK) << PxD_VALUE_SHIFT) 277 279 278 - #if PT_NLEVELS == 3 280 + #if CONFIG_PGTABLE_LEVELS == 3 279 281 /* The first entry of the permanent pmd is not there if it contains 280 282 * the gateway marker */ 281 283 #define pmd_none(x) (!pmd_val(x) || pmd_flag(x) == PxD_FLAG_ATTACHED) ··· 285 287 #define pmd_bad(x) (!(pmd_flag(x) & PxD_FLAG_VALID)) 286 288 #define pmd_present(x) (pmd_flag(x) & PxD_FLAG_PRESENT) 287 289 static inline void pmd_clear(pmd_t *pmd) { 288 - #if PT_NLEVELS == 3 290 + #if CONFIG_PGTABLE_LEVELS == 3 289 291 if (pmd_flag(*pmd) & PxD_FLAG_ATTACHED) 290 292 /* This is the entry pointing to the permanent pmd 291 293 * attached to the pgd; cannot clear it */ ··· 297 299 298 300 299 301 300 - #if PT_NLEVELS == 3 302 + #if CONFIG_PGTABLE_LEVELS == 3 301 303 #define pgd_page_vaddr(pgd) ((unsigned long) __va(pgd_address(pgd))) 302 304 #define pgd_page(pgd) virt_to_page((void *)pgd_page_vaddr(pgd)) 303 305 ··· 307 309 #define pgd_bad(x) (!(pgd_flag(x) & PxD_FLAG_VALID)) 308 310 #define pgd_present(x) (pgd_flag(x) & PxD_FLAG_PRESENT) 309 311 static inline void pgd_clear(pgd_t *pgd) { 310 - #if PT_NLEVELS == 3 312 + #if CONFIG_PGTABLE_LEVELS == 3 311 313 if(pgd_flag(*pgd) & PxD_FLAG_ATTACHED) 312 314 /* This is the permanent pmd attached to the pgd; cannot 313 315 * free it */ ··· 391 393 392 394 /* Find an entry in the second-level page table.. */ 393 395 394 - #if PT_NLEVELS == 3 396 + #if CONFIG_PGTABLE_LEVELS == 3 395 397 #define pmd_offset(dir,address) \ 396 398 ((pmd_t *) pgd_page_vaddr(*(dir)) + (((address)>>PMD_SHIFT) & (PTRS_PER_PMD-1))) 397 399 #else

+2 -2

arch/parisc/kernel/entry.S

··· 398 398 * can address up to 1TB 399 399 */ 400 400 .macro L2_ptep pmd,pte,index,va,fault 401 - #if PT_NLEVELS == 3 401 + #if CONFIG_PGTABLE_LEVELS == 3 402 402 extru \va,31-ASM_PMD_SHIFT,ASM_BITS_PER_PMD,\index 403 403 #else 404 404 # if defined(CONFIG_64BIT) ··· 436 436 * all ILP32 processes and all the kernel for machines with 437 437 * under 4GB of memory) */ 438 438 .macro L3_ptep pgd,pte,index,va,fault 439 - #if PT_NLEVELS == 3 /* we might have a 2-Level scheme, e.g. with 16kb page size */ 439 + #if CONFIG_PGTABLE_LEVELS == 3 /* we might have a 2-Level scheme, e.g. with 16kb page size */ 440 440 extrd,u \va,63-ASM_PGDIR_SHIFT,ASM_BITS_PER_PGD,\index 441 441 copy %r0,\pte 442 442 extrd,u,*= \va,63-ASM_PGDIR_SHIFT,64-ASM_PGDIR_SHIFT,%r0

+2 -2

arch/parisc/kernel/head.S

··· 74 74 mtctl %r4,%cr24 /* Initialize kernel root pointer */ 75 75 mtctl %r4,%cr25 /* Initialize user root pointer */ 76 76 77 - #if PT_NLEVELS == 3 77 + #if CONFIG_PGTABLE_LEVELS == 3 78 78 /* Set pmd in pgd */ 79 79 load32 PA(pmd0),%r5 80 80 shrd %r5,PxD_VALUE_SHIFT,%r3 ··· 97 97 stw %r3,0(%r4) 98 98 ldo (PAGE_SIZE >> PxD_VALUE_SHIFT)(%r3),%r3 99 99 addib,> -1,%r1,1b 100 - #if PT_NLEVELS == 3 100 + #if CONFIG_PGTABLE_LEVELS == 3 101 101 ldo ASM_PMD_ENTRY_SIZE(%r4),%r4 102 102 #else 103 103 ldo ASM_PGD_ENTRY_SIZE(%r4),%r4

+1 -1

arch/parisc/mm/init.c

··· 34 34 extern int data_start; 35 35 extern void parisc_kernel_start(void); /* Kernel entry point in head.S */ 36 36 37 - #if PT_NLEVELS == 3 37 + #if CONFIG_PGTABLE_LEVELS == 3 38 38 /* NOTE: This layout exactly conforms to the hybrid L2/L3 page table layout 39 39 * with the first pmd adjacent to the pgd and below it. gcc doesn't actually 40 40 * guarantee that global objects will be laid out in memory in the same order

+7 -1

arch/powerpc/Kconfig

··· 88 88 select ARCH_MIGHT_HAVE_PC_PARPORT 89 89 select ARCH_MIGHT_HAVE_PC_SERIO 90 90 select BINFMT_ELF 91 - select ARCH_BINFMT_ELF_RANDOMIZE_PIE 91 + select ARCH_HAS_ELF_RANDOMIZE 92 92 select OF 93 93 select OF_EARLY_FLATTREE 94 94 select OF_RESERVED_MEM ··· 296 296 config ZONE_DMA32 297 297 bool 298 298 default y if PPC64 299 + 300 + config PGTABLE_LEVELS 301 + int 302 + default 2 if !PPC64 303 + default 3 if PPC_64K_PAGES 304 + default 4 299 305 300 306 source "init/Kconfig" 301 307

-4

arch/powerpc/include/asm/elf.h

··· 128 128 (0x7ff >> (PAGE_SHIFT - 12)) : \ 129 129 (0x3ffff >> (PAGE_SHIFT - 12))) 130 130 131 - extern unsigned long arch_randomize_brk(struct mm_struct *mm); 132 - #define arch_randomize_brk arch_randomize_brk 133 - 134 - 135 131 #ifdef CONFIG_SPU_BASE 136 132 /* Notes used in ET_CORE. Note name is "SPU/<fd>/<filename>". */ 137 133 #define NT_SPU 1

+16 -12

arch/powerpc/mm/mmap.c

··· 53 53 return sysctl_legacy_va_layout; 54 54 } 55 55 56 - static unsigned long mmap_rnd(void) 56 + unsigned long arch_mmap_rnd(void) 57 57 { 58 - unsigned long rnd = 0; 58 + unsigned long rnd; 59 59 60 - if (current->flags & PF_RANDOMIZE) { 61 - /* 8MB for 32bit, 1GB for 64bit */ 62 - if (is_32bit_task()) 63 - rnd = (long)(get_random_int() % (1<<(23-PAGE_SHIFT))); 64 - else 65 - rnd = (long)(get_random_int() % (1<<(30-PAGE_SHIFT))); 66 - } 60 + /* 8MB for 32bit, 1GB for 64bit */ 61 + if (is_32bit_task()) 62 + rnd = (unsigned long)get_random_int() % (1<<(23-PAGE_SHIFT)); 63 + else 64 + rnd = (unsigned long)get_random_int() % (1<<(30-PAGE_SHIFT)); 65 + 67 66 return rnd << PAGE_SHIFT; 68 67 } 69 68 70 - static inline unsigned long mmap_base(void) 69 + static inline unsigned long mmap_base(unsigned long rnd) 71 70 { 72 71 unsigned long gap = rlimit(RLIMIT_STACK); 73 72 ··· 75 76 else if (gap > MAX_GAP) 76 77 gap = MAX_GAP; 77 78 78 - return PAGE_ALIGN(TASK_SIZE - gap - mmap_rnd()); 79 + return PAGE_ALIGN(TASK_SIZE - gap - rnd); 79 80 } 80 81 81 82 /* ··· 84 85 */ 85 86 void arch_pick_mmap_layout(struct mm_struct *mm) 86 87 { 88 + unsigned long random_factor = 0UL; 89 + 90 + if (current->flags & PF_RANDOMIZE) 91 + random_factor = arch_mmap_rnd(); 92 + 87 93 /* 88 94 * Fall back to the standard layout if the personality 89 95 * bit is set, or if the expected stack growth is unlimited: ··· 97 93 mm->mmap_base = TASK_UNMAPPED_BASE; 98 94 mm->get_unmapped_area = arch_get_unmapped_area; 99 95 } else { 100 - mm->mmap_base = mmap_base(); 96 + mm->mmap_base = mmap_base(random_factor); 101 97 mm->get_unmapped_area = arch_get_unmapped_area_topdown; 102 98 } 103 99 }

+6

arch/s390/Kconfig

··· 65 65 def_bool y 66 66 select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE 67 67 select ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS 68 + select ARCH_HAS_ELF_RANDOMIZE 68 69 select ARCH_HAS_GCOV_PROFILE_ALL 69 70 select ARCH_HAS_SG_CHAIN 70 71 select ARCH_HAVE_NMI_SAFE_CMPXCHG ··· 156 155 157 156 config SCHED_OMIT_FRAME_POINTER 158 157 def_bool y 158 + 159 + config PGTABLE_LEVELS 160 + int 161 + default 4 if 64BIT 162 + default 2 159 163 160 164 source "init/Kconfig" 161 165

+5 -7

arch/s390/include/asm/elf.h

··· 161 161 /* This is the location that an ET_DYN program is loaded if exec'ed. Typical 162 162 use of this is to invoke "./ld.so someprog" to test out a new version of 163 163 the loader. We need to make sure that it is out of the way of the program 164 - that it will "exec", and that there is sufficient room for the brk. */ 165 - 166 - extern unsigned long randomize_et_dyn(void); 167 - #define ELF_ET_DYN_BASE randomize_et_dyn() 164 + that it will "exec", and that there is sufficient room for the brk. 64-bit 165 + tasks are aligned to 4GB. */ 166 + #define ELF_ET_DYN_BASE (is_32bit_task() ? \ 167 + (STACK_TOP / 3 * 2) : \ 168 + (STACK_TOP / 3 * 2) & ~((1UL << 32) - 1)) 168 169 169 170 /* This yields a mask that user programs can use to figure out what 170 171 instruction set this CPU supports. */ ··· 225 224 226 225 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1 227 226 int arch_setup_additional_pages(struct linux_binprm *, int); 228 - 229 - extern unsigned long arch_randomize_brk(struct mm_struct *mm); 230 - #define arch_randomize_brk arch_randomize_brk 231 227 232 228 void *fill_cpu_elf_notes(void *ptr, struct save_area *sa, __vector128 *vxrs); 233 229

+19 -22

arch/s390/mm/mmap.c

··· 60 60 return sysctl_legacy_va_layout; 61 61 } 62 62 63 - static unsigned long mmap_rnd(void) 63 + unsigned long arch_mmap_rnd(void) 64 64 { 65 - if (!(current->flags & PF_RANDOMIZE)) 66 - return 0; 67 65 if (is_32bit_task()) 68 66 return (get_random_int() & 0x7ff) << PAGE_SHIFT; 69 67 else 70 68 return (get_random_int() & mmap_rnd_mask) << PAGE_SHIFT; 71 69 } 72 70 73 - static unsigned long mmap_base_legacy(void) 71 + static unsigned long mmap_base_legacy(unsigned long rnd) 74 72 { 75 - return TASK_UNMAPPED_BASE + mmap_rnd(); 73 + return TASK_UNMAPPED_BASE + rnd; 76 74 } 77 75 78 - static inline unsigned long mmap_base(void) 76 + static inline unsigned long mmap_base(unsigned long rnd) 79 77 { 80 78 unsigned long gap = rlimit(RLIMIT_STACK); 81 79 ··· 82 84 else if (gap > MAX_GAP) 83 85 gap = MAX_GAP; 84 86 gap &= PAGE_MASK; 85 - return STACK_TOP - stack_maxrandom_size() - mmap_rnd() - gap; 87 + return STACK_TOP - stack_maxrandom_size() - rnd - gap; 86 88 } 87 89 88 90 unsigned long ··· 177 179 return addr; 178 180 } 179 181 180 - unsigned long randomize_et_dyn(void) 181 - { 182 - unsigned long base; 183 - 184 - base = STACK_TOP / 3 * 2; 185 - if (!is_32bit_task()) 186 - /* Align to 4GB */ 187 - base &= ~((1UL << 32) - 1); 188 - return base + mmap_rnd(); 189 - } 190 - 191 182 #ifndef CONFIG_64BIT 192 183 193 184 /* ··· 185 198 */ 186 199 void arch_pick_mmap_layout(struct mm_struct *mm) 187 200 { 201 + unsigned long random_factor = 0UL; 202 + 203 + if (current->flags & PF_RANDOMIZE) 204 + random_factor = arch_mmap_rnd(); 205 + 188 206 /* 189 207 * Fall back to the standard layout if the personality 190 208 * bit is set, or if the expected stack growth is unlimited: 191 209 */ 192 210 if (mmap_is_legacy()) { 193 - mm->mmap_base = mmap_base_legacy(); 211 + mm->mmap_base = mmap_base_legacy(random_factor); 194 212 mm->get_unmapped_area = arch_get_unmapped_area; 195 213 } else { 196 - mm->mmap_base = mmap_base(); 214 + mm->mmap_base = mmap_base(random_factor); 197 215 mm->get_unmapped_area = arch_get_unmapped_area_topdown; 198 216 } 199 217 } ··· 265 273 */ 266 274 void arch_pick_mmap_layout(struct mm_struct *mm) 267 275 { 276 + unsigned long random_factor = 0UL; 277 + 278 + if (current->flags & PF_RANDOMIZE) 279 + random_factor = arch_mmap_rnd(); 280 + 268 281 /* 269 282 * Fall back to the standard layout if the personality 270 283 * bit is set, or if the expected stack growth is unlimited: 271 284 */ 272 285 if (mmap_is_legacy()) { 273 - mm->mmap_base = mmap_base_legacy(); 286 + mm->mmap_base = mmap_base_legacy(random_factor); 274 287 mm->get_unmapped_area = s390_get_unmapped_area; 275 288 } else { 276 - mm->mmap_base = mmap_base(); 289 + mm->mmap_base = mmap_base(random_factor); 277 290 mm->get_unmapped_area = s390_get_unmapped_area_topdown; 278 291 } 279 292 }

+4

arch/sh/Kconfig

··· 162 162 config NEED_SG_DMA_LENGTH 163 163 def_bool y 164 164 165 + config PGTABLE_LEVELS 166 + default 3 if X2TLB 167 + default 2 168 + 165 169 source "init/Kconfig" 166 170 167 171 source "kernel/Kconfig.freezer"

+9 -9

arch/sh/kernel/dwarf.c

··· 993 993 .rating = 150, 994 994 }; 995 995 996 - static void dwarf_unwinder_cleanup(void) 996 + static void __init dwarf_unwinder_cleanup(void) 997 997 { 998 998 struct dwarf_fde *fde, *next_fde; 999 999 struct dwarf_cie *cie, *next_cie; ··· 1009 1009 rbtree_postorder_for_each_entry_safe(cie, next_cie, &cie_root, node) 1010 1010 kfree(cie); 1011 1011 1012 + if (dwarf_reg_pool) 1013 + mempool_destroy(dwarf_reg_pool); 1014 + if (dwarf_frame_pool) 1015 + mempool_destroy(dwarf_frame_pool); 1012 1016 kmem_cache_destroy(dwarf_reg_cachep); 1013 1017 kmem_cache_destroy(dwarf_frame_cachep); 1014 1018 } ··· 1180 1176 sizeof(struct dwarf_reg), 0, 1181 1177 SLAB_PANIC | SLAB_HWCACHE_ALIGN | SLAB_NOTRACK, NULL); 1182 1178 1183 - dwarf_frame_pool = mempool_create(DWARF_FRAME_MIN_REQ, 1184 - mempool_alloc_slab, 1185 - mempool_free_slab, 1186 - dwarf_frame_cachep); 1179 + dwarf_frame_pool = mempool_create_slab_pool(DWARF_FRAME_MIN_REQ, 1180 + dwarf_frame_cachep); 1187 1181 if (!dwarf_frame_pool) 1188 1182 goto out; 1189 1183 1190 - dwarf_reg_pool = mempool_create(DWARF_REG_MIN_REQ, 1191 - mempool_alloc_slab, 1192 - mempool_free_slab, 1193 - dwarf_reg_cachep); 1184 + dwarf_reg_pool = mempool_create_slab_pool(DWARF_REG_MIN_REQ, 1185 + dwarf_reg_cachep); 1194 1186 if (!dwarf_reg_pool) 1195 1187 goto out; 1196 1188

+4

arch/sparc/Kconfig

··· 146 146 config ARCH_SUPPORTS_DEBUG_PAGEALLOC 147 147 def_bool y if SPARC64 148 148 149 + config PGTABLE_LEVELS 150 + default 4 if 64BIT 151 + default 3 152 + 149 153 source "init/Kconfig" 150 154 151 155 source "kernel/Kconfig.freezer"

+11 -11

arch/sparc/kernel/mdesc.c

··· 130 130 static struct mdesc_handle *mdesc_kmalloc(unsigned int mdesc_size) 131 131 { 132 132 unsigned int handle_size; 133 + struct mdesc_handle *hp; 134 + unsigned long addr; 133 135 void *base; 134 136 135 137 handle_size = (sizeof(struct mdesc_handle) - 136 138 sizeof(struct mdesc_hdr) + 137 139 mdesc_size); 138 140 141 + /* 142 + * Allocation has to succeed because mdesc update would be missed 143 + * and such events are not retransmitted. 144 + */ 139 145 base = kmalloc(handle_size + 15, GFP_KERNEL | __GFP_NOFAIL); 140 - if (base) { 141 - struct mdesc_handle *hp; 142 - unsigned long addr; 146 + addr = (unsigned long)base; 147 + addr = (addr + 15UL) & ~15UL; 148 + hp = (struct mdesc_handle *) addr; 143 149 144 - addr = (unsigned long)base; 145 - addr = (addr + 15UL) & ~15UL; 146 - hp = (struct mdesc_handle *) addr; 150 + mdesc_handle_init(hp, handle_size, base); 147 151 148 - mdesc_handle_init(hp, handle_size, base); 149 - return hp; 150 - } 151 - 152 - return NULL; 152 + return hp; 153 153 } 154 154 155 155 static void mdesc_kfree(struct mdesc_handle *hp)

+5

arch/tile/Kconfig

··· 147 147 default "arch/tile/configs/tilepro_defconfig" if !TILEGX 148 148 default "arch/tile/configs/tilegx_defconfig" if TILEGX 149 149 150 + config PGTABLE_LEVELS 151 + int 152 + default 3 if 64BIT 153 + default 2 154 + 150 155 source "init/Kconfig" 151 156 152 157 source "kernel/Kconfig.freezer"

+5

arch/um/Kconfig.um

··· 155 155 156 156 config NO_DMA 157 157 def_bool y 158 + 159 + config PGTABLE_LEVELS 160 + int 161 + default 3 if 3_LEVEL_PGTABLES 162 + default 2

+8 -12

arch/x86/Kconfig

··· 87 87 select HAVE_ARCH_KMEMCHECK 88 88 select HAVE_ARCH_KASAN if X86_64 && SPARSEMEM_VMEMMAP 89 89 select HAVE_USER_RETURN_NOTIFIER 90 - select ARCH_BINFMT_ELF_RANDOMIZE_PIE 90 + select ARCH_HAS_ELF_RANDOMIZE 91 91 select HAVE_ARCH_JUMP_LABEL 92 92 select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE 93 93 select SPARSE_IRQ ··· 99 99 select IRQ_FORCED_THREADING 100 100 select HAVE_BPF_JIT if X86_64 101 101 select HAVE_ARCH_TRANSPARENT_HUGEPAGE 102 + select HAVE_ARCH_HUGE_VMAP if X86_64 || (X86_32 && X86_PAE) 102 103 select ARCH_HAS_SG_CHAIN 103 104 select CLKEVT_I8253 104 105 select ARCH_HAVE_NMI_SAFE_CMPXCHG ··· 277 276 278 277 config FIX_EARLYCON_MEM 279 278 def_bool y 279 + 280 + config PGTABLE_LEVELS 281 + int 282 + default 4 if X86_64 283 + default 3 if X86_PAE 284 + default 2 280 285 281 286 source "init/Kconfig" 282 287 source "kernel/Kconfig.freezer" ··· 720 713 721 714 config NO_BOOTMEM 722 715 def_bool y 723 - 724 - config MEMTEST 725 - bool "Memtest" 726 - ---help--- 727 - This option adds a kernel parameter 'memtest', which allows memtest 728 - to be set. 729 - memtest=0, mean disabled; -- default 730 - memtest=1, mean do 1 test pattern; 731 - ... 732 - memtest=4, mean do 4 test patterns. 733 - If you are unsure how to answer this question, answer N. 734 716 735 717 source "arch/x86/Kconfig.cpu" 736 718

-8

arch/x86/include/asm/e820.h

··· 40 40 } 41 41 #endif 42 42 43 - #ifdef CONFIG_MEMTEST 44 - extern void early_memtest(unsigned long start, unsigned long end); 45 - #else 46 - static inline void early_memtest(unsigned long start, unsigned long end) 47 - { 48 - } 49 - #endif 50 - 51 43 extern unsigned long e820_end_of_ram_pfn(void); 52 44 extern unsigned long e820_end_of_low_ram_pfn(void); 53 45 extern u64 early_reserve_e820(u64 sizet, u64 align);

-3

arch/x86/include/asm/elf.h

··· 339 339 int uses_interp); 340 340 #define compat_arch_setup_additional_pages compat_arch_setup_additional_pages 341 341 342 - extern unsigned long arch_randomize_brk(struct mm_struct *mm); 343 - #define arch_randomize_brk arch_randomize_brk 344 - 345 342 /* 346 343 * True on X86_32 or when emulating IA32 on X86_64 347 344 */

+2

arch/x86/include/asm/page_types.h

··· 40 40 41 41 #ifdef CONFIG_X86_64 42 42 #include <asm/page_64_types.h> 43 + #define IOREMAP_MAX_ORDER (PUD_SHIFT) 43 44 #else 44 45 #include <asm/page_32_types.h> 46 + #define IOREMAP_MAX_ORDER (PMD_SHIFT) 45 47 #endif /* CONFIG_X86_64 */ 46 48 47 49 #ifndef __ASSEMBLY__

+4 -4

arch/x86/include/asm/paravirt.h

··· 545 545 PVOP_VCALL2(pv_mmu_ops.set_pmd, pmdp, val); 546 546 } 547 547 548 - #if PAGETABLE_LEVELS >= 3 548 + #if CONFIG_PGTABLE_LEVELS >= 3 549 549 static inline pmd_t __pmd(pmdval_t val) 550 550 { 551 551 pmdval_t ret; ··· 585 585 PVOP_VCALL2(pv_mmu_ops.set_pud, pudp, 586 586 val); 587 587 } 588 - #if PAGETABLE_LEVELS == 4 588 + #if CONFIG_PGTABLE_LEVELS == 4 589 589 static inline pud_t __pud(pudval_t val) 590 590 { 591 591 pudval_t ret; ··· 636 636 set_pud(pudp, __pud(0)); 637 637 } 638 638 639 - #endif /* PAGETABLE_LEVELS == 4 */ 639 + #endif /* CONFIG_PGTABLE_LEVELS == 4 */ 640 640 641 - #endif /* PAGETABLE_LEVELS >= 3 */ 641 + #endif /* CONFIG_PGTABLE_LEVELS >= 3 */ 642 642 643 643 #ifdef CONFIG_X86_PAE 644 644 /* Special-case pte-setting operations for PAE, which can't update a

+4 -4

arch/x86/include/asm/paravirt_types.h

··· 294 294 struct paravirt_callee_save pgd_val; 295 295 struct paravirt_callee_save make_pgd; 296 296 297 - #if PAGETABLE_LEVELS >= 3 297 + #if CONFIG_PGTABLE_LEVELS >= 3 298 298 #ifdef CONFIG_X86_PAE 299 299 void (*set_pte_atomic)(pte_t *ptep, pte_t pteval); 300 300 void (*pte_clear)(struct mm_struct *mm, unsigned long addr, ··· 308 308 struct paravirt_callee_save pmd_val; 309 309 struct paravirt_callee_save make_pmd; 310 310 311 - #if PAGETABLE_LEVELS == 4 311 + #if CONFIG_PGTABLE_LEVELS == 4 312 312 struct paravirt_callee_save pud_val; 313 313 struct paravirt_callee_save make_pud; 314 314 315 315 void (*set_pgd)(pgd_t *pudp, pgd_t pgdval); 316 - #endif /* PAGETABLE_LEVELS == 4 */ 317 - #endif /* PAGETABLE_LEVELS >= 3 */ 316 + #endif /* CONFIG_PGTABLE_LEVELS == 4 */ 317 + #endif /* CONFIG_PGTABLE_LEVELS >= 3 */ 318 318 319 319 struct pv_lazy_ops lazy_mode; 320 320

+4 -4

arch/x86/include/asm/pgalloc.h

··· 77 77 78 78 #define pmd_pgtable(pmd) pmd_page(pmd) 79 79 80 - #if PAGETABLE_LEVELS > 2 80 + #if CONFIG_PGTABLE_LEVELS > 2 81 81 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) 82 82 { 83 83 struct page *page; ··· 116 116 } 117 117 #endif /* CONFIG_X86_PAE */ 118 118 119 - #if PAGETABLE_LEVELS > 3 119 + #if CONFIG_PGTABLE_LEVELS > 3 120 120 static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud) 121 121 { 122 122 paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT); ··· 142 142 ___pud_free_tlb(tlb, pud); 143 143 } 144 144 145 - #endif /* PAGETABLE_LEVELS > 3 */ 146 - #endif /* PAGETABLE_LEVELS > 2 */ 145 + #endif /* CONFIG_PGTABLE_LEVELS > 3 */ 146 + #endif /* CONFIG_PGTABLE_LEVELS > 2 */ 147 147 148 148 #endif /* _ASM_X86_PGALLOC_H */

-1

arch/x86/include/asm/pgtable-2level_types.h

··· 17 17 #endif /* !__ASSEMBLY__ */ 18 18 19 19 #define SHARED_KERNEL_PMD 0 20 - #define PAGETABLE_LEVELS 2 21 20 22 21 /* 23 22 * traditional i386 two-level paging structure:

-2

arch/x86/include/asm/pgtable-3level_types.h

··· 24 24 #define SHARED_KERNEL_PMD 1 25 25 #endif 26 26 27 - #define PAGETABLE_LEVELS 3 28 - 29 27 /* 30 28 * PGDIR_SHIFT determines what a top-level page table entry can map 31 29 */

+4 -4

arch/x86/include/asm/pgtable.h

··· 551 551 return npg >> (20 - PAGE_SHIFT); 552 552 } 553 553 554 - #if PAGETABLE_LEVELS > 2 554 + #if CONFIG_PGTABLE_LEVELS > 2 555 555 static inline int pud_none(pud_t pud) 556 556 { 557 557 return native_pud_val(pud) == 0; ··· 594 594 { 595 595 return 0; 596 596 } 597 - #endif /* PAGETABLE_LEVELS > 2 */ 597 + #endif /* CONFIG_PGTABLE_LEVELS > 2 */ 598 598 599 - #if PAGETABLE_LEVELS > 3 599 + #if CONFIG_PGTABLE_LEVELS > 3 600 600 static inline int pgd_present(pgd_t pgd) 601 601 { 602 602 return pgd_flags(pgd) & _PAGE_PRESENT; ··· 633 633 { 634 634 return !native_pgd_val(pgd); 635 635 } 636 - #endif /* PAGETABLE_LEVELS > 3 */ 636 + #endif /* CONFIG_PGTABLE_LEVELS > 3 */ 637 637 638 638 #endif /* __ASSEMBLY__ */ 639 639

-1

arch/x86/include/asm/pgtable_64_types.h

··· 20 20 #endif /* !__ASSEMBLY__ */ 21 21 22 22 #define SHARED_KERNEL_PMD 0 23 - #define PAGETABLE_LEVELS 4 24 23 25 24 /* 26 25 * PGDIR_SHIFT determines what a top-level page table entry can map

+2 -2

arch/x86/include/asm/pgtable_types.h

··· 234 234 return native_pgd_val(pgd) & PTE_FLAGS_MASK; 235 235 } 236 236 237 - #if PAGETABLE_LEVELS > 3 237 + #if CONFIG_PGTABLE_LEVELS > 3 238 238 typedef struct { pudval_t pud; } pud_t; 239 239 240 240 static inline pud_t native_make_pud(pmdval_t val) ··· 255 255 } 256 256 #endif 257 257 258 - #if PAGETABLE_LEVELS > 2 258 + #if CONFIG_PGTABLE_LEVELS > 2 259 259 typedef struct { pmdval_t pmd; } pmd_t; 260 260 261 261 static inline pmd_t native_make_pmd(pmdval_t val)

+1 -1

arch/x86/kernel/kvm.c

··· 513 513 * can get false positives too easily, for example if the host is 514 514 * overcommitted. 515 515 */ 516 - watchdog_enable_hardlockup_detector(false); 516 + hardlockup_detector_disable(); 517 517 } 518 518 519 519 static noinline uint32_t __kvm_cpuid_base(void)

+3 -3

arch/x86/kernel/paravirt.c

··· 443 443 .ptep_modify_prot_start = __ptep_modify_prot_start, 444 444 .ptep_modify_prot_commit = __ptep_modify_prot_commit, 445 445 446 - #if PAGETABLE_LEVELS >= 3 446 + #if CONFIG_PGTABLE_LEVELS >= 3 447 447 #ifdef CONFIG_X86_PAE 448 448 .set_pte_atomic = native_set_pte_atomic, 449 449 .pte_clear = native_pte_clear, ··· 454 454 .pmd_val = PTE_IDENT, 455 455 .make_pmd = PTE_IDENT, 456 456 457 - #if PAGETABLE_LEVELS == 4 457 + #if CONFIG_PGTABLE_LEVELS == 4 458 458 .pud_val = PTE_IDENT, 459 459 .make_pud = PTE_IDENT, 460 460 461 461 .set_pgd = native_set_pgd, 462 462 #endif 463 - #endif /* PAGETABLE_LEVELS >= 3 */ 463 + #endif /* CONFIG_PGTABLE_LEVELS >= 3 */ 464 464 465 465 .pte_val = PTE_IDENT, 466 466 .pgd_val = PTE_IDENT,

-2

arch/x86/mm/Makefile

··· 32 32 obj-$(CONFIG_ACPI_NUMA) += srat.o 33 33 obj-$(CONFIG_NUMA_EMU) += numa_emulation.o 34 34 35 - obj-$(CONFIG_MEMTEST) += memtest.o 36 - 37 35 obj-$(CONFIG_X86_INTEL_MPX) += mpx.o

+21 -2

arch/x86/mm/ioremap.c

··· 67 67 68 68 /* 69 69 * Remap an arbitrary physical address space into the kernel virtual 70 - * address space. Needed when the kernel wants to access high addresses 71 - * directly. 70 + * address space. It transparently creates kernel huge I/O mapping when 71 + * the physical address is aligned by a huge page size (1GB or 2MB) and 72 + * the requested size is at least the huge page size. 73 + * 74 + * NOTE: MTRRs can override PAT memory types with a 4KB granularity. 75 + * Therefore, the mapping code falls back to use a smaller page toward 4KB 76 + * when a mapping range is covered by non-WB type of MTRRs. 72 77 * 73 78 * NOTE! We need to allow non-page-aligned mappings too: we will obviously 74 79 * have to convert them into an offset in a page-aligned mapping, but the ··· 330 325 kfree(p); 331 326 } 332 327 EXPORT_SYMBOL(iounmap); 328 + 329 + int arch_ioremap_pud_supported(void) 330 + { 331 + #ifdef CONFIG_X86_64 332 + return cpu_has_gbpages; 333 + #else 334 + return 0; 335 + #endif 336 + } 337 + 338 + int arch_ioremap_pmd_supported(void) 339 + { 340 + return cpu_has_pse; 341 + } 333 342 334 343 /* 335 344 * Convert a physical pointer to a virtual kernel pointer for /dev/mem

+8 -8

arch/x86/mm/memtest.c mm/memtest.c

··· 29 29 0x7a6c7258554e494cULL, /* yeah ;-) */ 30 30 }; 31 31 32 - static void __init reserve_bad_mem(u64 pattern, u64 start_bad, u64 end_bad) 32 + static void __init reserve_bad_mem(u64 pattern, phys_addr_t start_bad, phys_addr_t end_bad) 33 33 { 34 34 printk(KERN_INFO " %016llx bad mem addr %010llx - %010llx reserved\n", 35 35 (unsigned long long) pattern, ··· 38 38 memblock_reserve(start_bad, end_bad - start_bad); 39 39 } 40 40 41 - static void __init memtest(u64 pattern, u64 start_phys, u64 size) 41 + static void __init memtest(u64 pattern, phys_addr_t start_phys, phys_addr_t size) 42 42 { 43 43 u64 *p, *start, *end; 44 - u64 start_bad, last_bad; 45 - u64 start_phys_aligned; 44 + phys_addr_t start_bad, last_bad; 45 + phys_addr_t start_phys_aligned; 46 46 const size_t incr = sizeof(pattern); 47 47 48 48 start_phys_aligned = ALIGN(start_phys, incr); ··· 69 69 reserve_bad_mem(pattern, start_bad, last_bad + incr); 70 70 } 71 71 72 - static void __init do_one_pass(u64 pattern, u64 start, u64 end) 72 + static void __init do_one_pass(u64 pattern, phys_addr_t start, phys_addr_t end) 73 73 { 74 74 u64 i; 75 75 phys_addr_t this_start, this_end; 76 76 77 77 for_each_free_mem_range(i, NUMA_NO_NODE, &this_start, &this_end, NULL) { 78 - this_start = clamp_t(phys_addr_t, this_start, start, end); 79 - this_end = clamp_t(phys_addr_t, this_end, start, end); 78 + this_start = clamp(this_start, start, end); 79 + this_end = clamp(this_end, start, end); 80 80 if (this_start < this_end) { 81 81 printk(KERN_INFO " %010llx - %010llx pattern %016llx\n", 82 82 (unsigned long long)this_start, ··· 102 102 103 103 early_param("memtest", parse_memtest); 104 104 105 - void __init early_memtest(unsigned long start, unsigned long end) 105 + void __init early_memtest(phys_addr_t start, phys_addr_t end) 106 106 { 107 107 unsigned int i; 108 108 unsigned int idx = 0;

+21 -17

arch/x86/mm/mmap.c

··· 65 65 return sysctl_legacy_va_layout; 66 66 } 67 67 68 - static unsigned long mmap_rnd(void) 68 + unsigned long arch_mmap_rnd(void) 69 69 { 70 - unsigned long rnd = 0; 70 + unsigned long rnd; 71 71 72 72 /* 73 - * 8 bits of randomness in 32bit mmaps, 20 address space bits 74 - * 28 bits of randomness in 64bit mmaps, 40 address space bits 75 - */ 76 - if (current->flags & PF_RANDOMIZE) { 77 - if (mmap_is_ia32()) 78 - rnd = get_random_int() % (1<<8); 79 - else 80 - rnd = get_random_int() % (1<<28); 81 - } 73 + * 8 bits of randomness in 32bit mmaps, 20 address space bits 74 + * 28 bits of randomness in 64bit mmaps, 40 address space bits 75 + */ 76 + if (mmap_is_ia32()) 77 + rnd = (unsigned long)get_random_int() % (1<<8); 78 + else 79 + rnd = (unsigned long)get_random_int() % (1<<28); 80 + 82 81 return rnd << PAGE_SHIFT; 83 82 } 84 83 85 - static unsigned long mmap_base(void) 84 + static unsigned long mmap_base(unsigned long rnd) 86 85 { 87 86 unsigned long gap = rlimit(RLIMIT_STACK); 88 87 ··· 90 91 else if (gap > MAX_GAP) 91 92 gap = MAX_GAP; 92 93 93 - return PAGE_ALIGN(TASK_SIZE - gap - mmap_rnd()); 94 + return PAGE_ALIGN(TASK_SIZE - gap - rnd); 94 95 } 95 96 96 97 /* 97 98 * Bottom-up (legacy) layout on X86_32 did not support randomization, X86_64 98 99 * does, but not when emulating X86_32 99 100 */ 100 - static unsigned long mmap_legacy_base(void) 101 + static unsigned long mmap_legacy_base(unsigned long rnd) 101 102 { 102 103 if (mmap_is_ia32()) 103 104 return TASK_UNMAPPED_BASE; 104 105 else 105 - return TASK_UNMAPPED_BASE + mmap_rnd(); 106 + return TASK_UNMAPPED_BASE + rnd; 106 107 } 107 108 108 109 /* ··· 111 112 */ 112 113 void arch_pick_mmap_layout(struct mm_struct *mm) 113 114 { 114 - mm->mmap_legacy_base = mmap_legacy_base(); 115 - mm->mmap_base = mmap_base(); 115 + unsigned long random_factor = 0UL; 116 + 117 + if (current->flags & PF_RANDOMIZE) 118 + random_factor = arch_mmap_rnd(); 119 + 120 + mm->mmap_legacy_base = mmap_legacy_base(random_factor); 116 121 117 122 if (mmap_is_legacy()) { 118 123 mm->mmap_base = mm->mmap_legacy_base; 119 124 mm->get_unmapped_area = arch_get_unmapped_area; 120 125 } else { 126 + mm->mmap_base = mmap_base(random_factor); 121 127 mm->get_unmapped_area = arch_get_unmapped_area_topdown; 122 128 } 123 129 }

+72 -7

arch/x86/mm/pgtable.c

··· 4 4 #include <asm/pgtable.h> 5 5 #include <asm/tlb.h> 6 6 #include <asm/fixmap.h> 7 + #include <asm/mtrr.h> 7 8 8 9 #define PGALLOC_GFP GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO 9 10 ··· 59 58 tlb_remove_page(tlb, pte); 60 59 } 61 60 62 - #if PAGETABLE_LEVELS > 2 61 + #if CONFIG_PGTABLE_LEVELS > 2 63 62 void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd) 64 63 { 65 64 struct page *page = virt_to_page(pmd); ··· 75 74 tlb_remove_page(tlb, page); 76 75 } 77 76 78 - #if PAGETABLE_LEVELS > 3 77 + #if CONFIG_PGTABLE_LEVELS > 3 79 78 void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud) 80 79 { 81 80 paravirt_release_pud(__pa(pud) >> PAGE_SHIFT); 82 81 tlb_remove_page(tlb, virt_to_page(pud)); 83 82 } 84 - #endif /* PAGETABLE_LEVELS > 3 */ 85 - #endif /* PAGETABLE_LEVELS > 2 */ 83 + #endif /* CONFIG_PGTABLE_LEVELS > 3 */ 84 + #endif /* CONFIG_PGTABLE_LEVELS > 2 */ 86 85 87 86 static inline void pgd_list_add(pgd_t *pgd) 88 87 { ··· 118 117 /* If the pgd points to a shared pagetable level (either the 119 118 ptes in non-PAE, or shared PMD in PAE), then just copy the 120 119 references from swapper_pg_dir. */ 121 - if (PAGETABLE_LEVELS == 2 || 122 - (PAGETABLE_LEVELS == 3 && SHARED_KERNEL_PMD) || 123 - PAGETABLE_LEVELS == 4) { 120 + if (CONFIG_PGTABLE_LEVELS == 2 || 121 + (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) || 122 + CONFIG_PGTABLE_LEVELS == 4) { 124 123 clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY, 125 124 swapper_pg_dir + KERNEL_PGD_BOUNDARY, 126 125 KERNEL_PGD_PTRS); ··· 561 560 { 562 561 __native_set_fixmap(idx, pfn_pte(phys >> PAGE_SHIFT, flags)); 563 562 } 563 + 564 + #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP 565 + int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot) 566 + { 567 + u8 mtrr; 568 + 569 + /* 570 + * Do not use a huge page when the range is covered by non-WB type 571 + * of MTRRs. 572 + */ 573 + mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE); 574 + if ((mtrr != MTRR_TYPE_WRBACK) && (mtrr != 0xFF)) 575 + return 0; 576 + 577 + prot = pgprot_4k_2_large(prot); 578 + 579 + set_pte((pte_t *)pud, pfn_pte( 580 + (u64)addr >> PAGE_SHIFT, 581 + __pgprot(pgprot_val(prot) | _PAGE_PSE))); 582 + 583 + return 1; 584 + } 585 + 586 + int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot) 587 + { 588 + u8 mtrr; 589 + 590 + /* 591 + * Do not use a huge page when the range is covered by non-WB type 592 + * of MTRRs. 593 + */ 594 + mtrr = mtrr_type_lookup(addr, addr + PMD_SIZE); 595 + if ((mtrr != MTRR_TYPE_WRBACK) && (mtrr != 0xFF)) 596 + return 0; 597 + 598 + prot = pgprot_4k_2_large(prot); 599 + 600 + set_pte((pte_t *)pmd, pfn_pte( 601 + (u64)addr >> PAGE_SHIFT, 602 + __pgprot(pgprot_val(prot) | _PAGE_PSE))); 603 + 604 + return 1; 605 + } 606 + 607 + int pud_clear_huge(pud_t *pud) 608 + { 609 + if (pud_large(*pud)) { 610 + pud_clear(pud); 611 + return 1; 612 + } 613 + 614 + return 0; 615 + } 616 + 617 + int pmd_clear_huge(pmd_t *pmd) 618 + { 619 + if (pmd_large(*pmd)) { 620 + pmd_clear(pmd); 621 + return 1; 622 + } 623 + 624 + return 0; 625 + } 626 + #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */

+7 -7

arch/x86/xen/mmu.c

··· 502 502 } 503 503 PV_CALLEE_SAVE_REGS_THUNK(xen_make_pmd); 504 504 505 - #if PAGETABLE_LEVELS == 4 505 + #if CONFIG_PGTABLE_LEVELS == 4 506 506 __visible pudval_t xen_pud_val(pud_t pud) 507 507 { 508 508 return pte_mfn_to_pfn(pud.pud); ··· 589 589 590 590 xen_mc_issue(PARAVIRT_LAZY_MMU); 591 591 } 592 - #endif /* PAGETABLE_LEVELS == 4 */ 592 + #endif /* CONFIG_PGTABLE_LEVELS == 4 */ 593 593 594 594 /* 595 595 * (Yet another) pagetable walker. This one is intended for pinning a ··· 1628 1628 xen_release_ptpage(pfn, PT_PMD); 1629 1629 } 1630 1630 1631 - #if PAGETABLE_LEVELS == 4 1631 + #if CONFIG_PGTABLE_LEVELS == 4 1632 1632 static void xen_alloc_pud(struct mm_struct *mm, unsigned long pfn) 1633 1633 { 1634 1634 xen_alloc_ptpage(mm, pfn, PT_PUD); ··· 2046 2046 pv_mmu_ops.set_pte = xen_set_pte; 2047 2047 pv_mmu_ops.set_pmd = xen_set_pmd; 2048 2048 pv_mmu_ops.set_pud = xen_set_pud; 2049 - #if PAGETABLE_LEVELS == 4 2049 + #if CONFIG_PGTABLE_LEVELS == 4 2050 2050 pv_mmu_ops.set_pgd = xen_set_pgd; 2051 2051 #endif 2052 2052 ··· 2056 2056 pv_mmu_ops.alloc_pmd = xen_alloc_pmd; 2057 2057 pv_mmu_ops.release_pte = xen_release_pte; 2058 2058 pv_mmu_ops.release_pmd = xen_release_pmd; 2059 - #if PAGETABLE_LEVELS == 4 2059 + #if CONFIG_PGTABLE_LEVELS == 4 2060 2060 pv_mmu_ops.alloc_pud = xen_alloc_pud; 2061 2061 pv_mmu_ops.release_pud = xen_release_pud; 2062 2062 #endif ··· 2122 2122 .make_pmd = PV_CALLEE_SAVE(xen_make_pmd), 2123 2123 .pmd_val = PV_CALLEE_SAVE(xen_pmd_val), 2124 2124 2125 - #if PAGETABLE_LEVELS == 4 2125 + #if CONFIG_PGTABLE_LEVELS == 4 2126 2126 .pud_val = PV_CALLEE_SAVE(xen_pud_val), 2127 2127 .make_pud = PV_CALLEE_SAVE(xen_make_pud), 2128 2128 .set_pgd = xen_set_pgd_hyper, 2129 2129 2130 2130 .alloc_pud = xen_alloc_pmd_init, 2131 2131 .release_pud = xen_release_pmd_init, 2132 - #endif /* PAGETABLE_LEVELS == 4 */ 2132 + #endif /* CONFIG_PGTABLE_LEVELS == 4 */ 2133 2133 2134 2134 .activate_mm = xen_activate_mm, 2135 2135 .dup_mmap = xen_dup_mmap,

+13 -8

drivers/base/memory.c

··· 219 219 /* 220 220 * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is 221 221 * OK to have direct references to sparsemem variables in here. 222 + * Must already be protected by mem_hotplug_begin(). 222 223 */ 223 224 static int 224 225 memory_block_action(unsigned long phys_index, unsigned long action, int online_type) ··· 229 228 struct page *first_page; 230 229 int ret; 231 230 232 - start_pfn = phys_index << PFN_SECTION_SHIFT; 231 + start_pfn = section_nr_to_pfn(phys_index); 233 232 first_page = pfn_to_page(start_pfn); 234 233 235 234 switch (action) { ··· 287 286 if (mem->online_type < 0) 288 287 mem->online_type = MMOP_ONLINE_KEEP; 289 288 289 + /* Already under protection of mem_hotplug_begin() */ 290 290 ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE); 291 291 292 292 /* clear online_type */ ··· 330 328 goto err; 331 329 } 332 330 331 + /* 332 + * Memory hotplug needs to hold mem_hotplug_begin() for probe to find 333 + * the correct memory block to online before doing device_online(dev), 334 + * which will take dev->mutex. Take the lock early to prevent an 335 + * inversion, memory_subsys_online() callbacks will be implemented by 336 + * assuming it's already protected. 337 + */ 338 + mem_hotplug_begin(); 339 + 333 340 switch (online_type) { 334 341 case MMOP_ONLINE_KERNEL: 335 342 case MMOP_ONLINE_MOVABLE: 336 343 case MMOP_ONLINE_KEEP: 337 - /* 338 - * mem->online_type is not protected so there can be a 339 - * race here. However, when racing online, the first 340 - * will succeed and the second will just return as the 341 - * block will already be online. The online type 342 - * could be either one, but that is expected. 343 - */ 344 344 mem->online_type = online_type; 345 345 ret = device_online(&mem->dev); 346 346 break; ··· 353 349 ret = -EINVAL; /* should never happen */ 354 350 } 355 351 352 + mem_hotplug_done(); 356 353 err: 357 354 unlock_device_hotplug(); 358 355

+2 -2

drivers/s390/scsi/zfcp_erp.c

··· 738 738 return ZFCP_ERP_FAILED; 739 739 740 740 if (mempool_resize(act->adapter->pool.sr_data, 741 - act->adapter->stat_read_buf_num, GFP_KERNEL)) 741 + act->adapter->stat_read_buf_num)) 742 742 return ZFCP_ERP_FAILED; 743 743 744 744 if (mempool_resize(act->adapter->pool.status_read_req, 745 - act->adapter->stat_read_buf_num, GFP_KERNEL)) 745 + act->adapter->stat_read_buf_num)) 746 746 return ZFCP_ERP_FAILED; 747 747 748 748 atomic_set(&act->adapter->stat_miss, act->adapter->stat_read_buf_num);

+3 -1

drivers/staging/lustre/lustre/include/linux/lustre_patchless_compat.h

··· 55 55 if (PagePrivate(page)) 56 56 page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE); 57 57 58 - cancel_dirty_page(page, PAGE_SIZE); 58 + if (TestClearPageDirty(page)) 59 + account_page_cleaned(page, mapping); 60 + 59 61 ClearPageMappedToDisk(page); 60 62 ll_delete_from_page_cache(page); 61 63 }

+9 -7

drivers/xen/tmem.c

··· 397 397 #ifdef CONFIG_CLEANCACHE 398 398 BUG_ON(sizeof(struct cleancache_filekey) != sizeof(struct tmem_oid)); 399 399 if (tmem_enabled && cleancache) { 400 - char *s = ""; 401 - struct cleancache_ops *old_ops = 402 - cleancache_register_ops(&tmem_cleancache_ops); 403 - if (old_ops) 404 - s = " (WARNING: cleancache_ops overridden)"; 405 - pr_info("cleancache enabled, RAM provided by Xen Transcendent Memory%s\n", 406 - s); 400 + int err; 401 + 402 + err = cleancache_register_ops(&tmem_cleancache_ops); 403 + if (err) 404 + pr_warn("xen-tmem: failed to enable cleancache: %d\n", 405 + err); 406 + else 407 + pr_info("cleancache enabled, RAM provided by " 408 + "Xen Transcendent Memory\n"); 407 409 } 408 410 #endif 409 411 #ifdef CONFIG_XEN_SELFBALLOONING

-3

fs/Kconfig.binfmt

··· 27 27 bool 28 28 depends on COMPAT && BINFMT_ELF 29 29 30 - config ARCH_BINFMT_ELF_RANDOMIZE_PIE 31 - bool 32 - 33 30 config ARCH_BINFMT_ELF_STATE 34 31 bool 35 32

+13 -18

fs/binfmt_elf.c

··· 31 31 #include <linux/security.h> 32 32 #include <linux/random.h> 33 33 #include <linux/elf.h> 34 + #include <linux/elf-randomize.h> 34 35 #include <linux/utsname.h> 35 36 #include <linux/coredump.h> 36 37 #include <linux/sched.h> ··· 863 862 i < loc->elf_ex.e_phnum; i++, elf_ppnt++) { 864 863 int elf_prot = 0, elf_flags; 865 864 unsigned long k, vaddr; 865 + unsigned long total_size = 0; 866 866 867 867 if (elf_ppnt->p_type != PT_LOAD) 868 868 continue; ··· 911 909 * default mmap base, as well as whatever program they 912 910 * might try to exec. This is because the brk will 913 911 * follow the loader, and is not movable. */ 914 - #ifdef CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE 915 - /* Memory randomization might have been switched off 916 - * in runtime via sysctl or explicit setting of 917 - * personality flags. 918 - * If that is the case, retain the original non-zero 919 - * load_bias value in order to establish proper 920 - * non-randomized mappings. 921 - */ 912 + load_bias = ELF_ET_DYN_BASE - vaddr; 922 913 if (current->flags & PF_RANDOMIZE) 923 - load_bias = 0; 924 - else 925 - load_bias = ELF_PAGESTART(ELF_ET_DYN_BASE - vaddr); 926 - #else 927 - load_bias = ELF_PAGESTART(ELF_ET_DYN_BASE - vaddr); 928 - #endif 914 + load_bias += arch_mmap_rnd(); 915 + load_bias = ELF_PAGESTART(load_bias); 916 + total_size = total_mapping_size(elf_phdata, 917 + loc->elf_ex.e_phnum); 918 + if (!total_size) { 919 + error = -EINVAL; 920 + goto out_free_dentry; 921 + } 929 922 } 930 923 931 924 error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt, 932 - elf_prot, elf_flags, 0); 925 + elf_prot, elf_flags, total_size); 933 926 if (BAD_ADDR(error)) { 934 927 retval = IS_ERR((void *)error) ? 935 928 PTR_ERR((void*)error) : -EINVAL; ··· 1050 1053 current->mm->end_data = end_data; 1051 1054 current->mm->start_stack = bprm->p; 1052 1055 1053 - #ifdef arch_randomize_brk 1054 1056 if ((current->flags & PF_RANDOMIZE) && (randomize_va_space > 1)) { 1055 1057 current->mm->brk = current->mm->start_brk = 1056 1058 arch_randomize_brk(current->mm); 1057 - #ifdef CONFIG_COMPAT_BRK 1059 + #ifdef compat_brk_randomized 1058 1060 current->brk_randomized = 1; 1059 1061 #endif 1060 1062 } 1061 - #endif 1062 1063 1063 1064 if (current->personality & MMAP_PAGE_ZERO) { 1064 1065 /* Why this, you ask??? Well SVr4 maps page 0 as read-only,

+2 -2

fs/buffer.c

··· 3243 3243 * to synchronise against __set_page_dirty_buffers and prevent the 3244 3244 * dirty bit from being lost. 3245 3245 */ 3246 - if (ret) 3247 - cancel_dirty_page(page, PAGE_CACHE_SIZE); 3246 + if (ret && TestClearPageDirty(page)) 3247 + account_page_cleaned(page, mapping); 3248 3248 spin_unlock(&mapping->private_lock); 3249 3249 out: 3250 3250 if (buffers_to_free) {

+2 -4

fs/cifs/connect.c

··· 773 773 774 774 length = atomic_dec_return(&tcpSesAllocCount); 775 775 if (length > 0) 776 - mempool_resize(cifs_req_poolp, length + cifs_min_rcv, 777 - GFP_KERNEL); 776 + mempool_resize(cifs_req_poolp, length + cifs_min_rcv); 778 777 } 779 778 780 779 static int ··· 847 848 848 849 length = atomic_inc_return(&tcpSesAllocCount); 849 850 if (length > 1) 850 - mempool_resize(cifs_req_poolp, length + cifs_min_rcv, 851 - GFP_KERNEL); 851 + mempool_resize(cifs_req_poolp, length + cifs_min_rcv); 852 852 853 853 set_freezable(); 854 854 while (server->tcpStatus != CifsExiting) {

+1 -1

fs/hugetlbfs/inode.c

··· 319 319 320 320 static void truncate_huge_page(struct page *page) 321 321 { 322 - cancel_dirty_page(page, /* No IO accounting for huge pages? */0); 322 + ClearPageDirty(page); 323 323 ClearPageUptodate(page); 324 324 delete_from_page_cache(page); 325 325 }

-5

fs/nfs/write.c

··· 1876 1876 * request from the inode / page_private pointer and 1877 1877 * release it */ 1878 1878 nfs_inode_remove_request(req); 1879 - /* 1880 - * In case nfs_inode_remove_request has marked the 1881 - * page as being dirty 1882 - */ 1883 - cancel_dirty_page(page, PAGE_CACHE_SIZE); 1884 1879 nfs_unlock_and_release_request(req); 1885 1880 } 1886 1881

+22 -26

fs/ocfs2/alloc.c

··· 3370 3370 ret = ocfs2_get_right_path(et, left_path, &right_path); 3371 3371 if (ret) { 3372 3372 mlog_errno(ret); 3373 - goto out; 3373 + return ret; 3374 3374 } 3375 3375 3376 3376 right_el = path_leaf_el(right_path); ··· 3453 3453 subtree_index); 3454 3454 } 3455 3455 out: 3456 - if (right_path) 3457 - ocfs2_free_path(right_path); 3456 + ocfs2_free_path(right_path); 3458 3457 return ret; 3459 3458 } 3460 3459 ··· 3535 3536 ret = ocfs2_get_left_path(et, right_path, &left_path); 3536 3537 if (ret) { 3537 3538 mlog_errno(ret); 3538 - goto out; 3539 + return ret; 3539 3540 } 3540 3541 3541 3542 left_el = path_leaf_el(left_path); ··· 3646 3647 right_path, subtree_index); 3647 3648 } 3648 3649 out: 3649 - if (left_path) 3650 - ocfs2_free_path(left_path); 3650 + ocfs2_free_path(left_path); 3651 3651 return ret; 3652 3652 } 3653 3653 ··· 4332 4334 } else if (path->p_tree_depth > 0) { 4333 4335 status = ocfs2_find_cpos_for_left_leaf(sb, path, &left_cpos); 4334 4336 if (status) 4335 - goto out; 4337 + goto exit; 4336 4338 4337 4339 if (left_cpos != 0) { 4338 4340 left_path = ocfs2_new_path_from_path(path); 4339 4341 if (!left_path) 4340 - goto out; 4342 + goto exit; 4341 4343 4342 4344 status = ocfs2_find_path(et->et_ci, left_path, 4343 4345 left_cpos); 4344 4346 if (status) 4345 - goto out; 4347 + goto free_left_path; 4346 4348 4347 4349 new_el = path_leaf_el(left_path); 4348 4350 ··· 4359 4361 le16_to_cpu(new_el->l_next_free_rec), 4360 4362 le16_to_cpu(new_el->l_count)); 4361 4363 status = -EINVAL; 4362 - goto out; 4364 + goto free_left_path; 4363 4365 } 4364 4366 rec = &new_el->l_recs[ 4365 4367 le16_to_cpu(new_el->l_next_free_rec) - 1]; ··· 4386 4388 path->p_tree_depth > 0) { 4387 4389 status = ocfs2_find_cpos_for_right_leaf(sb, path, &right_cpos); 4388 4390 if (status) 4389 - goto out; 4391 + goto free_left_path; 4390 4392 4391 4393 if (right_cpos == 0) 4392 - goto out; 4394 + goto free_left_path; 4393 4395 4394 4396 right_path = ocfs2_new_path_from_path(path); 4395 4397 if (!right_path) 4396 - goto out; 4398 + goto free_left_path; 4397 4399 4398 4400 status = ocfs2_find_path(et->et_ci, right_path, right_cpos); 4399 4401 if (status) 4400 - goto out; 4402 + goto free_right_path; 4401 4403 4402 4404 new_el = path_leaf_el(right_path); 4403 4405 rec = &new_el->l_recs[0]; ··· 4411 4413 (unsigned long long)le64_to_cpu(eb->h_blkno), 4412 4414 le16_to_cpu(new_el->l_next_free_rec)); 4413 4415 status = -EINVAL; 4414 - goto out; 4416 + goto free_right_path; 4415 4417 } 4416 4418 rec = &new_el->l_recs[1]; 4417 4419 } ··· 4428 4430 ret = contig_type; 4429 4431 } 4430 4432 4431 - out: 4432 - if (left_path) 4433 - ocfs2_free_path(left_path); 4434 - if (right_path) 4435 - ocfs2_free_path(right_path); 4436 - 4433 + free_right_path: 4434 + ocfs2_free_path(right_path); 4435 + free_left_path: 4436 + ocfs2_free_path(left_path); 4437 + exit: 4437 4438 return ret; 4438 4439 } 4439 4440 ··· 6855 6858 if (pages == NULL) { 6856 6859 ret = -ENOMEM; 6857 6860 mlog_errno(ret); 6858 - goto out; 6861 + return ret; 6859 6862 } 6860 6863 6861 6864 ret = ocfs2_reserve_clusters(osb, 1, &data_ac); 6862 6865 if (ret) { 6863 6866 mlog_errno(ret); 6864 - goto out; 6867 + goto free_pages; 6865 6868 } 6866 6869 } 6867 6870 ··· 6993 6996 out: 6994 6997 if (data_ac) 6995 6998 ocfs2_free_alloc_context(data_ac); 6996 - if (pages) 6997 - kfree(pages); 6998 - 6999 + free_pages: 7000 + kfree(pages); 6999 7001 return ret; 7000 7002 } 7001 7003

+141 -14

fs/ocfs2/aops.c

··· 664 664 return 0; 665 665 } 666 666 667 + static int ocfs2_direct_IO_zero_extend(struct ocfs2_super *osb, 668 + struct inode *inode, loff_t offset, 669 + u64 zero_len, int cluster_align) 670 + { 671 + u32 p_cpos = 0; 672 + u32 v_cpos = ocfs2_bytes_to_clusters(osb->sb, i_size_read(inode)); 673 + unsigned int num_clusters = 0; 674 + unsigned int ext_flags = 0; 675 + int ret = 0; 676 + 677 + if (offset <= i_size_read(inode) || cluster_align) 678 + return 0; 679 + 680 + ret = ocfs2_get_clusters(inode, v_cpos, &p_cpos, &num_clusters, 681 + &ext_flags); 682 + if (ret < 0) { 683 + mlog_errno(ret); 684 + return ret; 685 + } 686 + 687 + if (p_cpos && !(ext_flags & OCFS2_EXT_UNWRITTEN)) { 688 + u64 s = i_size_read(inode); 689 + sector_t sector = (p_cpos << (osb->s_clustersize_bits - 9)) + 690 + (do_div(s, osb->s_clustersize) >> 9); 691 + 692 + ret = blkdev_issue_zeroout(osb->sb->s_bdev, sector, 693 + zero_len >> 9, GFP_NOFS, false); 694 + if (ret < 0) 695 + mlog_errno(ret); 696 + } 697 + 698 + return ret; 699 + } 700 + 701 + static int ocfs2_direct_IO_extend_no_holes(struct ocfs2_super *osb, 702 + struct inode *inode, loff_t offset) 703 + { 704 + u64 zero_start, zero_len, total_zero_len; 705 + u32 p_cpos = 0, clusters_to_add; 706 + u32 v_cpos = ocfs2_bytes_to_clusters(osb->sb, i_size_read(inode)); 707 + unsigned int num_clusters = 0; 708 + unsigned int ext_flags = 0; 709 + u32 size_div, offset_div; 710 + int ret = 0; 711 + 712 + { 713 + u64 o = offset; 714 + u64 s = i_size_read(inode); 715 + 716 + offset_div = do_div(o, osb->s_clustersize); 717 + size_div = do_div(s, osb->s_clustersize); 718 + } 719 + 720 + if (offset <= i_size_read(inode)) 721 + return 0; 722 + 723 + clusters_to_add = ocfs2_bytes_to_clusters(inode->i_sb, offset) - 724 + ocfs2_bytes_to_clusters(inode->i_sb, i_size_read(inode)); 725 + total_zero_len = offset - i_size_read(inode); 726 + if (clusters_to_add) 727 + total_zero_len -= offset_div; 728 + 729 + /* Allocate clusters to fill out holes, and this is only needed 730 + * when we add more than one clusters. Otherwise the cluster will 731 + * be allocated during direct IO */ 732 + if (clusters_to_add > 1) { 733 + ret = ocfs2_extend_allocation(inode, 734 + OCFS2_I(inode)->ip_clusters, 735 + clusters_to_add - 1, 0); 736 + if (ret) { 737 + mlog_errno(ret); 738 + goto out; 739 + } 740 + } 741 + 742 + while (total_zero_len) { 743 + ret = ocfs2_get_clusters(inode, v_cpos, &p_cpos, &num_clusters, 744 + &ext_flags); 745 + if (ret < 0) { 746 + mlog_errno(ret); 747 + goto out; 748 + } 749 + 750 + zero_start = ocfs2_clusters_to_bytes(osb->sb, p_cpos) + 751 + size_div; 752 + zero_len = ocfs2_clusters_to_bytes(osb->sb, num_clusters) - 753 + size_div; 754 + zero_len = min(total_zero_len, zero_len); 755 + 756 + if (p_cpos && !(ext_flags & OCFS2_EXT_UNWRITTEN)) { 757 + ret = blkdev_issue_zeroout(osb->sb->s_bdev, 758 + zero_start >> 9, zero_len >> 9, 759 + GFP_NOFS, false); 760 + if (ret < 0) { 761 + mlog_errno(ret); 762 + goto out; 763 + } 764 + } 765 + 766 + total_zero_len -= zero_len; 767 + v_cpos += ocfs2_bytes_to_clusters(osb->sb, zero_len + size_div); 768 + 769 + /* Only at first iteration can be cluster not aligned. 770 + * So set size_div to 0 for the rest */ 771 + size_div = 0; 772 + } 773 + 774 + out: 775 + return ret; 776 + } 777 + 667 778 static ssize_t ocfs2_direct_IO_write(struct kiocb *iocb, 668 779 struct iov_iter *iter, 669 780 loff_t offset) ··· 789 678 struct buffer_head *di_bh = NULL; 790 679 size_t count = iter->count; 791 680 journal_t *journal = osb->journal->j_journal; 792 - u32 zero_len; 793 - int cluster_align; 681 + u64 zero_len_head, zero_len_tail; 682 + int cluster_align_head, cluster_align_tail; 794 683 loff_t final_size = offset + count; 795 684 int append_write = offset >= i_size_read(inode) ? 1 : 0; 796 685 unsigned int num_clusters = 0; ··· 798 687 799 688 { 800 689 u64 o = offset; 690 + u64 s = i_size_read(inode); 801 691 802 - zero_len = do_div(o, 1 << osb->s_clustersize_bits); 803 - cluster_align = !zero_len; 692 + zero_len_head = do_div(o, 1 << osb->s_clustersize_bits); 693 + cluster_align_head = !zero_len_head; 694 + 695 + zero_len_tail = osb->s_clustersize - 696 + do_div(s, osb->s_clustersize); 697 + if ((offset - i_size_read(inode)) < zero_len_tail) 698 + zero_len_tail = offset - i_size_read(inode); 699 + cluster_align_tail = !zero_len_tail; 804 700 } 805 701 806 702 /* ··· 825 707 } 826 708 827 709 if (append_write) { 828 - ret = ocfs2_inode_lock(inode, &di_bh, 1); 710 + ret = ocfs2_inode_lock(inode, NULL, 1); 829 711 if (ret < 0) { 830 712 mlog_errno(ret); 831 713 goto clean_orphan; 832 714 } 833 715 716 + /* zeroing out the previously allocated cluster tail 717 + * that but not zeroed */ 834 718 if (ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb))) 835 - ret = ocfs2_zero_extend(inode, di_bh, offset); 719 + ret = ocfs2_direct_IO_zero_extend(osb, inode, offset, 720 + zero_len_tail, cluster_align_tail); 836 721 else 837 - ret = ocfs2_extend_no_holes(inode, di_bh, offset, 722 + ret = ocfs2_direct_IO_extend_no_holes(osb, inode, 838 723 offset); 839 724 if (ret < 0) { 840 725 mlog_errno(ret); 841 726 ocfs2_inode_unlock(inode, 1); 842 - brelse(di_bh); 843 727 goto clean_orphan; 844 728 } 845 729 ··· 849 729 if (is_overwrite < 0) { 850 730 mlog_errno(is_overwrite); 851 731 ocfs2_inode_unlock(inode, 1); 852 - brelse(di_bh); 853 732 goto clean_orphan; 854 733 } 855 734 856 735 ocfs2_inode_unlock(inode, 1); 857 - brelse(di_bh); 858 - di_bh = NULL; 859 736 } 860 737 861 738 written = __blockdev_direct_IO(WRITE, iocb, inode, inode->i_sb->s_bdev, ··· 889 772 if (ret < 0) 890 773 mlog_errno(ret); 891 774 } 892 - } else if (written < 0 && append_write && !is_overwrite && 893 - !cluster_align) { 775 + } else if (written > 0 && append_write && !is_overwrite && 776 + !cluster_align_head) { 777 + /* zeroing out the allocated cluster head */ 894 778 u32 p_cpos = 0; 895 779 u32 v_cpos = ocfs2_bytes_to_clusters(osb->sb, offset); 780 + 781 + ret = ocfs2_inode_lock(inode, NULL, 0); 782 + if (ret < 0) { 783 + mlog_errno(ret); 784 + goto clean_orphan; 785 + } 896 786 897 787 ret = ocfs2_get_clusters(inode, v_cpos, &p_cpos, 898 788 &num_clusters, &ext_flags); 899 789 if (ret < 0) { 900 790 mlog_errno(ret); 791 + ocfs2_inode_unlock(inode, 0); 901 792 goto clean_orphan; 902 793 } 903 794 ··· 913 788 914 789 ret = blkdev_issue_zeroout(osb->sb->s_bdev, 915 790 p_cpos << (osb->s_clustersize_bits - 9), 916 - zero_len >> 9, GFP_KERNEL, false); 791 + zero_len_head >> 9, GFP_NOFS, false); 917 792 if (ret < 0) 918 793 mlog_errno(ret); 794 + 795 + ocfs2_inode_unlock(inode, 0); 919 796 } 920 797 921 798 clean_orphan:

+31 -11

fs/ocfs2/cluster/heartbeat.c

··· 1312 1312 int ret = -ENOMEM; 1313 1313 1314 1314 o2hb_debug_dir = debugfs_create_dir(O2HB_DEBUG_DIR, NULL); 1315 - if (!o2hb_debug_dir) { 1315 + if (IS_ERR_OR_NULL(o2hb_debug_dir)) { 1316 + ret = o2hb_debug_dir ? 1317 + PTR_ERR(o2hb_debug_dir) : -ENOMEM; 1316 1318 mlog_errno(ret); 1317 1319 goto bail; 1318 1320 } ··· 1327 1325 sizeof(o2hb_live_node_bitmap), 1328 1326 O2NM_MAX_NODES, 1329 1327 o2hb_live_node_bitmap); 1330 - if (!o2hb_debug_livenodes) { 1328 + if (IS_ERR_OR_NULL(o2hb_debug_livenodes)) { 1329 + ret = o2hb_debug_livenodes ? 1330 + PTR_ERR(o2hb_debug_livenodes) : -ENOMEM; 1331 1331 mlog_errno(ret); 1332 1332 goto bail; 1333 1333 } ··· 1342 1338 sizeof(o2hb_live_region_bitmap), 1343 1339 O2NM_MAX_REGIONS, 1344 1340 o2hb_live_region_bitmap); 1345 - if (!o2hb_debug_liveregions) { 1341 + if (IS_ERR_OR_NULL(o2hb_debug_liveregions)) { 1342 + ret = o2hb_debug_liveregions ? 1343 + PTR_ERR(o2hb_debug_liveregions) : -ENOMEM; 1346 1344 mlog_errno(ret); 1347 1345 goto bail; 1348 1346 } ··· 1358 1352 sizeof(o2hb_quorum_region_bitmap), 1359 1353 O2NM_MAX_REGIONS, 1360 1354 o2hb_quorum_region_bitmap); 1361 - if (!o2hb_debug_quorumregions) { 1355 + if (IS_ERR_OR_NULL(o2hb_debug_quorumregions)) { 1356 + ret = o2hb_debug_quorumregions ? 1357 + PTR_ERR(o2hb_debug_quorumregions) : -ENOMEM; 1362 1358 mlog_errno(ret); 1363 1359 goto bail; 1364 1360 } ··· 1374 1366 sizeof(o2hb_failed_region_bitmap), 1375 1367 O2NM_MAX_REGIONS, 1376 1368 o2hb_failed_region_bitmap); 1377 - if (!o2hb_debug_failedregions) { 1369 + if (IS_ERR_OR_NULL(o2hb_debug_failedregions)) { 1370 + ret = o2hb_debug_failedregions ? 1371 + PTR_ERR(o2hb_debug_failedregions) : -ENOMEM; 1378 1372 mlog_errno(ret); 1379 1373 goto bail; 1380 1374 } ··· 2010 2000 2011 2001 reg->hr_debug_dir = 2012 2002 debugfs_create_dir(config_item_name(&reg->hr_item), dir); 2013 - if (!reg->hr_debug_dir) { 2003 + if (IS_ERR_OR_NULL(reg->hr_debug_dir)) { 2004 + ret = reg->hr_debug_dir ? PTR_ERR(reg->hr_debug_dir) : -ENOMEM; 2014 2005 mlog_errno(ret); 2015 2006 goto bail; 2016 2007 } ··· 2024 2013 O2HB_DB_TYPE_REGION_LIVENODES, 2025 2014 sizeof(reg->hr_live_node_bitmap), 2026 2015 O2NM_MAX_NODES, reg); 2027 - if (!reg->hr_debug_livenodes) { 2016 + if (IS_ERR_OR_NULL(reg->hr_debug_livenodes)) { 2017 + ret = reg->hr_debug_livenodes ? 2018 + PTR_ERR(reg->hr_debug_livenodes) : -ENOMEM; 2028 2019 mlog_errno(ret); 2029 2020 goto bail; 2030 2021 } ··· 2038 2025 sizeof(*(reg->hr_db_regnum)), 2039 2026 O2HB_DB_TYPE_REGION_NUMBER, 2040 2027 0, O2NM_MAX_NODES, reg); 2041 - if (!reg->hr_debug_regnum) { 2028 + if (IS_ERR_OR_NULL(reg->hr_debug_regnum)) { 2029 + ret = reg->hr_debug_regnum ? 2030 + PTR_ERR(reg->hr_debug_regnum) : -ENOMEM; 2042 2031 mlog_errno(ret); 2043 2032 goto bail; 2044 2033 } ··· 2052 2037 sizeof(*(reg->hr_db_elapsed_time)), 2053 2038 O2HB_DB_TYPE_REGION_ELAPSED_TIME, 2054 2039 0, 0, reg); 2055 - if (!reg->hr_debug_elapsed_time) { 2040 + if (IS_ERR_OR_NULL(reg->hr_debug_elapsed_time)) { 2041 + ret = reg->hr_debug_elapsed_time ? 2042 + PTR_ERR(reg->hr_debug_elapsed_time) : -ENOMEM; 2056 2043 mlog_errno(ret); 2057 2044 goto bail; 2058 2045 } ··· 2066 2049 sizeof(*(reg->hr_db_pinned)), 2067 2050 O2HB_DB_TYPE_REGION_PINNED, 2068 2051 0, 0, reg); 2069 - if (!reg->hr_debug_pinned) { 2052 + if (IS_ERR_OR_NULL(reg->hr_debug_pinned)) { 2053 + ret = reg->hr_debug_pinned ? 2054 + PTR_ERR(reg->hr_debug_pinned) : -ENOMEM; 2070 2055 mlog_errno(ret); 2071 2056 goto bail; 2072 2057 } 2073 2058 2074 - ret = 0; 2059 + return 0; 2075 2060 bail: 2061 + debugfs_remove_recursive(reg->hr_debug_dir); 2076 2062 return ret; 2077 2063 } 2078 2064

+3 -2

fs/ocfs2/cluster/masklog.h

··· 196 196 } \ 197 197 } while (0) 198 198 199 - #define mlog_errno(st) do { \ 199 + #define mlog_errno(st) ({ \ 200 200 int _st = (st); \ 201 201 if (_st != -ERESTARTSYS && _st != -EINTR && \ 202 202 _st != AOP_TRUNCATED_PAGE && _st != -ENOSPC && \ 203 203 _st != -EDQUOT) \ 204 204 mlog(ML_ERROR, "status = %lld\n", (long long)_st); \ 205 - } while (0) 205 + _st; \ 206 + }) 206 207 207 208 #define mlog_bug_on_msg(cond, fmt, args...) do { \ 208 209 if (cond) { \

+6 -9

fs/ocfs2/dir.c

··· 18 18 * 19 19 * linux/fs/minix/dir.c 20 20 * 21 - * Copyright (C) 1991, 1992 Linux Torvalds 21 + * Copyright (C) 1991, 1992 Linus Torvalds 22 22 * 23 23 * This program is free software; you can redistribute it and/or 24 24 * modify it under the terms of the GNU General Public ··· 2047 2047 const char *name, 2048 2048 int namelen) 2049 2049 { 2050 - int ret; 2050 + int ret = 0; 2051 2051 struct ocfs2_dir_lookup_result lookup = { NULL, }; 2052 2052 2053 2053 trace_ocfs2_check_dir_for_entry( 2054 2054 (unsigned long long)OCFS2_I(dir)->ip_blkno, namelen, name); 2055 2055 2056 - ret = -EEXIST; 2057 - if (ocfs2_find_entry(name, namelen, dir, &lookup) == 0) 2058 - goto bail; 2056 + if (ocfs2_find_entry(name, namelen, dir, &lookup) == 0) { 2057 + ret = -EEXIST; 2058 + mlog_errno(ret); 2059 + } 2059 2060 2060 - ret = 0; 2061 - bail: 2062 2061 ocfs2_free_dir_lookup_result(&lookup); 2063 2062 2064 - if (ret) 2065 - mlog_errno(ret); 2066 2063 return ret; 2067 2064 } 2068 2065

+6 -1

fs/ocfs2/dlmglue.c

··· 1391 1391 int noqueue_attempted = 0; 1392 1392 int dlm_locked = 0; 1393 1393 1394 + if (!(lockres->l_flags & OCFS2_LOCK_INITIALIZED)) { 1395 + mlog_errno(-EINVAL); 1396 + return -EINVAL; 1397 + } 1398 + 1394 1399 ocfs2_init_mask_waiter(&mw); 1395 1400 1396 1401 if (lockres->l_ops->flags & LOCK_TYPE_USES_LVB) ··· 2959 2954 osb->osb_debug_root, 2960 2955 osb, 2961 2956 &ocfs2_dlm_debug_fops); 2962 - if (!dlm_debug->d_locking_state) { 2957 + if (IS_ERR_OR_NULL(dlm_debug->d_locking_state)) { 2963 2958 ret = -EINVAL; 2964 2959 mlog(ML_ERROR, 2965 2960 "Unable to create locking state debugfs file.\n");

+1 -1

fs/ocfs2/export.c

··· 82 82 } 83 83 84 84 status = ocfs2_test_inode_bit(osb, blkno, &set); 85 - trace_ocfs2_get_dentry_test_bit(status, set); 86 85 if (status < 0) { 87 86 if (status == -EINVAL) { 88 87 /* ··· 95 96 goto unlock_nfs_sync; 96 97 } 97 98 99 + trace_ocfs2_get_dentry_test_bit(status, set); 98 100 /* If the inode allocator bit is clear, this inode must be stale */ 99 101 if (!set) { 100 102 status = -ESTALE;

+2 -2

fs/ocfs2/inode.c

··· 624 624 ocfs2_get_system_file_inode(osb, INODE_ALLOC_SYSTEM_INODE, 625 625 le16_to_cpu(di->i_suballoc_slot)); 626 626 if (!inode_alloc_inode) { 627 - status = -EEXIST; 627 + status = -ENOENT; 628 628 mlog_errno(status); 629 629 goto bail; 630 630 } ··· 742 742 ORPHAN_DIR_SYSTEM_INODE, 743 743 orphaned_slot); 744 744 if (!orphan_dir_inode) { 745 - status = -EEXIST; 745 + status = -ENOENT; 746 746 mlog_errno(status); 747 747 goto bail; 748 748 }

+2 -2

fs/ocfs2/localalloc.c

··· 666 666 if (le32_to_cpu(alloc->id1.bitmap1.i_used) != 667 667 ocfs2_local_alloc_count_bits(alloc)) { 668 668 ocfs2_error(osb->sb, "local alloc inode %llu says it has " 669 - "%u free bits, but a count shows %u", 669 + "%u used bits, but a count shows %u", 670 670 (unsigned long long)le64_to_cpu(alloc->i_blkno), 671 671 le32_to_cpu(alloc->id1.bitmap1.i_used), 672 672 ocfs2_local_alloc_count_bits(alloc)); ··· 839 839 u32 *numbits, 840 840 struct ocfs2_alloc_reservation *resv) 841 841 { 842 - int numfound, bitoff, left, startoff, lastzero; 842 + int numfound = 0, bitoff, left, startoff, lastzero; 843 843 int local_resv = 0; 844 844 struct ocfs2_alloc_reservation r; 845 845 void *bitmap = NULL;

+3 -3

fs/ocfs2/namei.c

··· 2322 2322 2323 2323 trace_ocfs2_orphan_del( 2324 2324 (unsigned long long)OCFS2_I(orphan_dir_inode)->ip_blkno, 2325 - name, namelen); 2325 + name, strlen(name)); 2326 2326 2327 2327 /* find it's spot in the orphan directory */ 2328 - status = ocfs2_find_entry(name, namelen, orphan_dir_inode, 2328 + status = ocfs2_find_entry(name, strlen(name), orphan_dir_inode, 2329 2329 &lookup); 2330 2330 if (status) { 2331 2331 mlog_errno(status); ··· 2808 2808 ORPHAN_DIR_SYSTEM_INODE, 2809 2809 osb->slot_num); 2810 2810 if (!orphan_dir_inode) { 2811 - status = -EEXIST; 2811 + status = -ENOENT; 2812 2812 mlog_errno(status); 2813 2813 goto leave; 2814 2814 }

+1 -1

fs/ocfs2/refcounttree.c

··· 4276 4276 error = posix_acl_create(dir, &mode, &default_acl, &acl); 4277 4277 if (error) { 4278 4278 mlog_errno(error); 4279 - goto out; 4279 + return error; 4280 4280 } 4281 4281 4282 4282 error = ocfs2_create_inode_in_orphan(dir, mode,

+2 -2

fs/ocfs2/slot_map.c

··· 427 427 if (!si) { 428 428 status = -ENOMEM; 429 429 mlog_errno(status); 430 - goto bail; 430 + return status; 431 431 } 432 432 433 433 si->si_extended = ocfs2_uses_extended_slot_map(osb); ··· 452 452 453 453 osb->slot_info = (struct ocfs2_slot_info *)si; 454 454 bail: 455 - if (status < 0 && si) 455 + if (status < 0) 456 456 __ocfs2_free_slot_info(si); 457 457 458 458 return status;

+1 -1

fs/ocfs2/stack_o2cb.c

··· 295 295 set_bit(node_num, netmap); 296 296 if (!memcmp(hbmap, netmap, sizeof(hbmap))) 297 297 return 0; 298 - if (i < O2CB_MAP_STABILIZE_COUNT) 298 + if (i < O2CB_MAP_STABILIZE_COUNT - 1) 299 299 msleep(1000); 300 300 } 301 301

+3 -5

fs/ocfs2/stack_user.c

··· 1004 1004 BUG_ON(conn == NULL); 1005 1005 1006 1006 lc = kzalloc(sizeof(struct ocfs2_live_connection), GFP_KERNEL); 1007 - if (!lc) { 1008 - rc = -ENOMEM; 1009 - goto out; 1010 - } 1007 + if (!lc) 1008 + return -ENOMEM; 1011 1009 1012 1010 init_waitqueue_head(&lc->oc_wait); 1013 1011 init_completion(&lc->oc_sync_wait); ··· 1061 1063 } 1062 1064 1063 1065 out: 1064 - if (rc && lc) 1066 + if (rc) 1065 1067 kfree(lc); 1066 1068 return rc; 1067 1069 }

+2

fs/ocfs2/suballoc.c

··· 2499 2499 alloc_bh, OCFS2_JOURNAL_ACCESS_WRITE); 2500 2500 if (status < 0) { 2501 2501 mlog_errno(status); 2502 + ocfs2_block_group_set_bits(handle, alloc_inode, group, group_bh, 2503 + start_bit, count); 2502 2504 goto bail; 2503 2505 } 2504 2506

+26 -20

fs/ocfs2/super.c

··· 1112 1112 1113 1113 osb->osb_debug_root = debugfs_create_dir(osb->uuid_str, 1114 1114 ocfs2_debugfs_root); 1115 - if (!osb->osb_debug_root) { 1115 + if (IS_ERR_OR_NULL(osb->osb_debug_root)) { 1116 1116 status = -EINVAL; 1117 1117 mlog(ML_ERROR, "Unable to create per-mount debugfs root.\n"); 1118 1118 goto read_super_error; ··· 1122 1122 osb->osb_debug_root, 1123 1123 osb, 1124 1124 &ocfs2_osb_debug_fops); 1125 - if (!osb->osb_ctxt) { 1125 + if (IS_ERR_OR_NULL(osb->osb_ctxt)) { 1126 1126 status = -EINVAL; 1127 1127 mlog_errno(status); 1128 1128 goto read_super_error; ··· 1606 1606 } 1607 1607 1608 1608 ocfs2_debugfs_root = debugfs_create_dir("ocfs2", NULL); 1609 - if (!ocfs2_debugfs_root) { 1610 - status = -ENOMEM; 1609 + if (IS_ERR_OR_NULL(ocfs2_debugfs_root)) { 1610 + status = ocfs2_debugfs_root ? 1611 + PTR_ERR(ocfs2_debugfs_root) : -ENOMEM; 1611 1612 mlog(ML_ERROR, "Unable to create ocfs2 debugfs root.\n"); 1612 1613 goto out4; 1613 1614 } ··· 2070 2069 cbits = le32_to_cpu(di->id2.i_super.s_clustersize_bits); 2071 2070 bbits = le32_to_cpu(di->id2.i_super.s_blocksize_bits); 2072 2071 sb->s_maxbytes = ocfs2_max_file_offset(bbits, cbits); 2072 + memcpy(sb->s_uuid, di->id2.i_super.s_uuid, 2073 + sizeof(di->id2.i_super.s_uuid)); 2073 2074 2074 2075 osb->osb_dx_mask = (1 << (cbits - bbits)) - 1; 2075 2076 ··· 2336 2333 mlog_errno(status); 2337 2334 goto bail; 2338 2335 } 2339 - cleancache_init_shared_fs((char *)&di->id2.i_super.s_uuid, sb); 2336 + cleancache_init_shared_fs(sb); 2340 2337 2341 2338 bail: 2342 2339 return status; ··· 2566 2563 ocfs2_set_ro_flag(osb, 0); 2567 2564 } 2568 2565 2569 - static char error_buf[1024]; 2570 - 2571 - void __ocfs2_error(struct super_block *sb, 2572 - const char *function, 2573 - const char *fmt, ...) 2566 + void __ocfs2_error(struct super_block *sb, const char *function, 2567 + const char *fmt, ...) 2574 2568 { 2569 + struct va_format vaf; 2575 2570 va_list args; 2576 2571 2577 2572 va_start(args, fmt); 2578 - vsnprintf(error_buf, sizeof(error_buf), fmt, args); 2579 - va_end(args); 2573 + vaf.fmt = fmt; 2574 + vaf.va = &args; 2580 2575 2581 2576 /* Not using mlog here because we want to show the actual 2582 2577 * function the error came from. */ 2583 - printk(KERN_CRIT "OCFS2: ERROR (device %s): %s: %s\n", 2584 - sb->s_id, function, error_buf); 2578 + printk(KERN_CRIT "OCFS2: ERROR (device %s): %s: %pV\n", 2579 + sb->s_id, function, &vaf); 2580 + 2581 + va_end(args); 2585 2582 2586 2583 ocfs2_handle_error(sb); 2587 2584 } ··· 2589 2586 /* Handle critical errors. This is intentionally more drastic than 2590 2587 * ocfs2_handle_error, so we only use for things like journal errors, 2591 2588 * etc. */ 2592 - void __ocfs2_abort(struct super_block* sb, 2593 - const char *function, 2589 + void __ocfs2_abort(struct super_block *sb, const char *function, 2594 2590 const char *fmt, ...) 2595 2591 { 2592 + struct va_format vaf; 2596 2593 va_list args; 2597 2594 2598 2595 va_start(args, fmt); 2599 - vsnprintf(error_buf, sizeof(error_buf), fmt, args); 2600 - va_end(args); 2601 2596 2602 - printk(KERN_CRIT "OCFS2: abort (device %s): %s: %s\n", 2603 - sb->s_id, function, error_buf); 2597 + vaf.fmt = fmt; 2598 + vaf.va = &args; 2599 + 2600 + printk(KERN_CRIT "OCFS2: abort (device %s): %s: %pV\n", 2601 + sb->s_id, function, &vaf); 2602 + 2603 + va_end(args); 2604 2604 2605 2605 /* We don't have the cluster support yet to go straight to 2606 2606 * hard readonly in here. Until then, we want to keep

+8

fs/ocfs2/xattr.c

··· 1238 1238 i, 1239 1239 &block_off, 1240 1240 &name_offset); 1241 + if (ret) { 1242 + mlog_errno(ret); 1243 + goto cleanup; 1244 + } 1241 1245 xs->base = bucket_block(xs->bucket, block_off); 1242 1246 } 1243 1247 if (ocfs2_xattr_is_local(xs->here)) { ··· 5669 5665 5670 5666 ret = ocfs2_get_xattr_tree_value_root(inode->i_sb, bucket, 5671 5667 i, &xv, NULL); 5668 + if (ret) { 5669 + mlog_errno(ret); 5670 + break; 5671 + } 5672 5672 5673 5673 ret = ocfs2_lock_xattr_remove_allocators(inode, xv, 5674 5674 args->ref_ci,

+1 -1

fs/super.c

··· 224 224 s->s_maxbytes = MAX_NON_LFS; 225 225 s->s_op = &default_op; 226 226 s->s_time_gran = 1000000000; 227 - s->cleancache_poolid = -1; 227 + s->cleancache_poolid = CLEANCACHE_NO_POOL; 228 228 229 229 s->s_shrink.seeks = DEFAULT_SEEKS; 230 230 s->s_shrink.scan_objects = super_cache_scan;

+30

include/asm-generic/pgtable.h

··· 6 6 7 7 #include <linux/mm_types.h> 8 8 #include <linux/bug.h> 9 + #include <linux/errno.h> 10 + 11 + #if 4 - defined(__PAGETABLE_PUD_FOLDED) - defined(__PAGETABLE_PMD_FOLDED) != \ 12 + CONFIG_PGTABLE_LEVELS 13 + #error CONFIG_PGTABLE_LEVELS is not consistent with __PAGETABLE_{PUD,PMD}_FOLDED 14 + #endif 9 15 10 16 /* 11 17 * On almost all architectures and configurations, 0 can be used as the ··· 696 690 #endif /* CONFIG_NUMA_BALANCING */ 697 691 698 692 #endif /* CONFIG_MMU */ 693 + 694 + #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP 695 + int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot); 696 + int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot); 697 + int pud_clear_huge(pud_t *pud); 698 + int pmd_clear_huge(pmd_t *pmd); 699 + #else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ 700 + static inline int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot) 701 + { 702 + return 0; 703 + } 704 + static inline int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot) 705 + { 706 + return 0; 707 + } 708 + static inline int pud_clear_huge(pud_t *pud) 709 + { 710 + return 0; 711 + } 712 + static inline int pmd_clear_huge(pmd_t *pmd) 713 + { 714 + return 0; 715 + } 716 + #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ 699 717 700 718 #endif /* !__ASSEMBLY__ */ 701 719

+8 -5

include/linux/cleancache.h

··· 5 5 #include <linux/exportfs.h> 6 6 #include <linux/mm.h> 7 7 8 + #define CLEANCACHE_NO_POOL -1 9 + #define CLEANCACHE_NO_BACKEND -2 10 + #define CLEANCACHE_NO_BACKEND_SHARED -3 11 + 8 12 #define CLEANCACHE_KEY_MAX 6 9 13 10 14 /* ··· 37 33 void (*invalidate_fs)(int); 38 34 }; 39 35 40 - extern struct cleancache_ops * 41 - cleancache_register_ops(struct cleancache_ops *ops); 36 + extern int cleancache_register_ops(struct cleancache_ops *ops); 42 37 extern void __cleancache_init_fs(struct super_block *); 43 - extern void __cleancache_init_shared_fs(char *, struct super_block *); 38 + extern void __cleancache_init_shared_fs(struct super_block *); 44 39 extern int __cleancache_get_page(struct page *); 45 40 extern void __cleancache_put_page(struct page *); 46 41 extern void __cleancache_invalidate_page(struct address_space *, struct page *); ··· 81 78 __cleancache_init_fs(sb); 82 79 } 83 80 84 - static inline void cleancache_init_shared_fs(char *uuid, struct super_block *sb) 81 + static inline void cleancache_init_shared_fs(struct super_block *sb) 85 82 { 86 83 if (cleancache_enabled) 87 - __cleancache_init_shared_fs(uuid, sb); 84 + __cleancache_init_shared_fs(sb); 88 85 } 89 86 90 87 static inline int cleancache_get_page(struct page *page)

+6 -6

include/linux/cma.h

··· 16 16 struct cma; 17 17 18 18 extern unsigned long totalcma_pages; 19 - extern phys_addr_t cma_get_base(struct cma *cma); 20 - extern unsigned long cma_get_size(struct cma *cma); 19 + extern phys_addr_t cma_get_base(const struct cma *cma); 20 + extern unsigned long cma_get_size(const struct cma *cma); 21 21 22 22 extern int __init cma_declare_contiguous(phys_addr_t base, 23 23 phys_addr_t size, phys_addr_t limit, 24 24 phys_addr_t alignment, unsigned int order_per_bit, 25 25 bool fixed, struct cma **res_cma); 26 - extern int cma_init_reserved_mem(phys_addr_t base, 27 - phys_addr_t size, int order_per_bit, 26 + extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, 27 + unsigned int order_per_bit, 28 28 struct cma **res_cma); 29 - extern struct page *cma_alloc(struct cma *cma, int count, unsigned int align); 30 - extern bool cma_release(struct cma *cma, struct page *pages, int count); 29 + extern struct page *cma_alloc(struct cma *cma, unsigned int count, unsigned int align); 30 + extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count); 31 31 #endif

+22

include/linux/elf-randomize.h

··· 1 + #ifndef _ELF_RANDOMIZE_H 2 + #define _ELF_RANDOMIZE_H 3 + 4 + struct mm_struct; 5 + 6 + #ifndef CONFIG_ARCH_HAS_ELF_RANDOMIZE 7 + static inline unsigned long arch_mmap_rnd(void) { return 0; } 8 + # if defined(arch_randomize_brk) && defined(CONFIG_COMPAT_BRK) 9 + # define compat_brk_randomized 10 + # endif 11 + # ifndef arch_randomize_brk 12 + # define arch_randomize_brk(mm) (mm->brk) 13 + # endif 14 + #else 15 + extern unsigned long arch_mmap_rnd(void); 16 + extern unsigned long arch_randomize_brk(struct mm_struct *mm); 17 + # ifdef CONFIG_COMPAT_BRK 18 + # define compat_brk_randomized 19 + # endif 20 + #endif 21 + 22 + #endif

+4 -12

include/linux/gfp.h

··· 57 57 * _might_ fail. This depends upon the particular VM implementation. 58 58 * 59 59 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller 60 - * cannot handle allocation failures. This modifier is deprecated and no new 61 - * users should be added. 60 + * cannot handle allocation failures. New users should be evaluated carefully 61 + * (and the flag should be used only when there is no reasonable failure policy) 62 + * but it is definitely preferable to use the flag rather than opencode endless 63 + * loop around allocator. 62 64 * 63 65 * __GFP_NORETRY: The VM implementation must not retry indefinitely. 64 66 * ··· 118 116 #define GFP_TRANSHUGE (GFP_HIGHUSER_MOVABLE | __GFP_COMP | \ 119 117 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \ 120 118 __GFP_NO_KSWAPD) 121 - 122 - /* 123 - * GFP_THISNODE does not perform any reclaim, you most likely want to 124 - * use __GFP_THISNODE to allocate from a given node without fallback! 125 - */ 126 - #ifdef CONFIG_NUMA 127 - #define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY) 128 - #else 129 - #define GFP_THISNODE ((__force gfp_t)0) 130 - #endif 131 119 132 120 /* This mask makes up all the page movable related flags */ 133 121 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)

+8

include/linux/io.h

··· 38 38 } 39 39 #endif 40 40 41 + #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP 42 + void __init ioremap_huge_init(void); 43 + int arch_ioremap_pud_supported(void); 44 + int arch_ioremap_pmd_supported(void); 45 + #else 46 + static inline void ioremap_huge_init(void) { } 47 + #endif 48 + 41 49 /* 42 50 * Managed iomap interface 43 51 */

+8

include/linux/memblock.h

··· 365 365 #define __initdata_memblock 366 366 #endif 367 367 368 + #ifdef CONFIG_MEMTEST 369 + extern void early_memtest(phys_addr_t start, phys_addr_t end); 370 + #else 371 + static inline void early_memtest(phys_addr_t start, phys_addr_t end) 372 + { 373 + } 374 + #endif 375 + 368 376 #else 369 377 static inline phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align) 370 378 {

+6

include/linux/memory_hotplug.h

··· 192 192 void get_online_mems(void); 193 193 void put_online_mems(void); 194 194 195 + void mem_hotplug_begin(void); 196 + void mem_hotplug_done(void); 197 + 195 198 #else /* ! CONFIG_MEMORY_HOTPLUG */ 196 199 /* 197 200 * Stub functions for when hotplug is off ··· 233 230 234 231 static inline void get_online_mems(void) {} 235 232 static inline void put_online_mems(void) {} 233 + 234 + static inline void mem_hotplug_begin(void) {} 235 + static inline void mem_hotplug_done(void) {} 236 236 237 237 #endif /* ! CONFIG_MEMORY_HOTPLUG */ 238 238

+1 -1

include/linux/mempool.h

··· 29 29 mempool_free_t *free_fn, void *pool_data, 30 30 gfp_t gfp_mask, int nid); 31 31 32 - extern int mempool_resize(mempool_t *pool, int new_min_nr, gfp_t gfp_mask); 32 + extern int mempool_resize(mempool_t *pool, int new_min_nr); 33 33 extern void mempool_destroy(mempool_t *pool); 34 34 extern void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask); 35 35 extern void mempool_free(void *element, mempool_t *pool);

-5

include/linux/migrate.h

··· 69 69 extern bool pmd_trans_migrating(pmd_t pmd); 70 70 extern int migrate_misplaced_page(struct page *page, 71 71 struct vm_area_struct *vma, int node); 72 - extern bool migrate_ratelimited(int node); 73 72 #else 74 73 static inline bool pmd_trans_migrating(pmd_t pmd) 75 74 { ··· 78 79 struct vm_area_struct *vma, int node) 79 80 { 80 81 return -EAGAIN; /* can't migrate now */ 81 - } 82 - static inline bool migrate_ratelimited(int node) 83 - { 84 - return false; 85 82 } 86 83 #endif /* CONFIG_NUMA_BALANCING */ 87 84

+3 -1

include/linux/mm.h

··· 1294 1294 int redirty_page_for_writepage(struct writeback_control *wbc, 1295 1295 struct page *page); 1296 1296 void account_page_dirtied(struct page *page, struct address_space *mapping); 1297 + void account_page_cleaned(struct page *page, struct address_space *mapping); 1297 1298 int set_page_dirty(struct page *page); 1298 1299 int set_page_dirty_lock(struct page *page); 1299 1300 int clear_page_dirty_for_io(struct page *page); 1301 + 1300 1302 int get_cmdline(struct task_struct *task, char *buffer, int buflen); 1301 1303 1302 1304 /* Is the vma a continuation of the stack vma above it? */ ··· 2111 2109 #define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */ 2112 2110 #define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO 2113 2111 * and return without waiting upon it */ 2114 - #define FOLL_MLOCK 0x40 /* mark page as mlocked */ 2112 + #define FOLL_POPULATE 0x40 /* fault in page */ 2115 2113 #define FOLL_SPLIT 0x80 /* don't return transhuge pages, split them */ 2116 2114 #define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ 2117 2115 #define FOLL_NUMA 0x200 /* force NUMA hinting page fault */

+2

include/linux/mm_types.h

··· 364 364 atomic_t mm_users; /* How many users with user space? */ 365 365 atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ 366 366 atomic_long_t nr_ptes; /* PTE page table pages */ 367 + #if CONFIG_PGTABLE_LEVELS > 2 367 368 atomic_long_t nr_pmds; /* PMD page table pages */ 369 + #endif 368 370 int map_count; /* number of VMAs */ 369 371 370 372 spinlock_t page_table_lock; /* Protects page tables and some counters */

+12 -9

include/linux/nmi.h

··· 25 25 #endif 26 26 27 27 #if defined(CONFIG_HARDLOCKUP_DETECTOR) 28 - extern void watchdog_enable_hardlockup_detector(bool val); 29 - extern bool watchdog_hardlockup_detector_is_enabled(void); 28 + extern void hardlockup_detector_disable(void); 30 29 #else 31 - static inline void watchdog_enable_hardlockup_detector(bool val) 30 + static inline void hardlockup_detector_disable(void) 32 31 { 33 - } 34 - static inline bool watchdog_hardlockup_detector_is_enabled(void) 35 - { 36 - return true; 37 32 } 38 33 #endif 39 34 ··· 63 68 #ifdef CONFIG_LOCKUP_DETECTOR 64 69 int hw_nmi_is_cpu_stuck(struct pt_regs *); 65 70 u64 hw_nmi_get_sample_period(int watchdog_thresh); 71 + extern int nmi_watchdog_enabled; 72 + extern int soft_watchdog_enabled; 66 73 extern int watchdog_user_enabled; 67 74 extern int watchdog_thresh; 68 75 extern int sysctl_softlockup_all_cpu_backtrace; 69 76 struct ctl_table; 70 - extern int proc_dowatchdog(struct ctl_table *, int , 71 - void __user *, size_t *, loff_t *); 77 + extern int proc_watchdog(struct ctl_table *, int , 78 + void __user *, size_t *, loff_t *); 79 + extern int proc_nmi_watchdog(struct ctl_table *, int , 80 + void __user *, size_t *, loff_t *); 81 + extern int proc_soft_watchdog(struct ctl_table *, int , 82 + void __user *, size_t *, loff_t *); 83 + extern int proc_watchdog_thresh(struct ctl_table *, int , 84 + void __user *, size_t *, loff_t *); 72 85 #endif 73 86 74 87 #ifdef CONFIG_HAVE_ACPI_APEI_NMI

+2 -1

include/linux/oom.h

··· 66 66 extern void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_flags); 67 67 68 68 extern void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, 69 - int order, const nodemask_t *nodemask); 69 + int order, const nodemask_t *nodemask, 70 + struct mem_cgroup *memcg); 70 71 71 72 extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, 72 73 unsigned long totalpages, const nodemask_t *nodemask,

-2

include/linux/page-flags.h

··· 328 328 329 329 CLEARPAGEFLAG(Uptodate, uptodate) 330 330 331 - extern void cancel_dirty_page(struct page *page, unsigned int account_size); 332 - 333 331 int test_clear_page_writeback(struct page *page); 334 332 int __test_set_page_writeback(struct page *page, bool keep_write); 335 333

+1 -1

include/linux/slab.h

··· 18 18 19 19 /* 20 20 * Flags to pass to kmem_cache_create(). 21 - * The ones marked DEBUG are only valid if CONFIG_SLAB_DEBUG is set. 21 + * The ones marked DEBUG are only valid if CONFIG_DEBUG_SLAB is set. 22 22 */ 23 23 #define SLAB_DEBUG_FREE 0x00000100UL /* DEBUG: Perform (expensive) checks on free */ 24 24 #define SLAB_RED_ZONE 0x00000400UL /* DEBUG: Red zone objs in a cache */

+1 -1

include/trace/events/xen.h

··· 224 224 TP_printk("pmdp %p", __entry->pmdp) 225 225 ); 226 226 227 - #if PAGETABLE_LEVELS >= 4 227 + #if CONFIG_PGTABLE_LEVELS >= 4 228 228 229 229 TRACE_EVENT(xen_mmu_set_pud, 230 230 TP_PROTO(pud_t *pudp, pud_t pudval),

+2

init/main.c

··· 80 80 #include <linux/list.h> 81 81 #include <linux/integrity.h> 82 82 #include <linux/proc_ns.h> 83 + #include <linux/io.h> 83 84 84 85 #include <asm/io.h> 85 86 #include <asm/bugs.h> ··· 486 485 percpu_init_late(); 487 486 pgtable_init(); 488 487 vmalloc_init(); 488 + ioremap_huge_init(); 489 489 } 490 490 491 491 asmlinkage __visible void __init start_kernel(void)

+5 -13

kernel/cpuset.c

··· 2453 2453 * @node: is this an allowed node? 2454 2454 * @gfp_mask: memory allocation flags 2455 2455 * 2456 - * If we're in interrupt, yes, we can always allocate. If __GFP_THISNODE is 2457 - * set, yes, we can always allocate. If node is in our task's mems_allowed, 2458 - * yes. If it's not a __GFP_HARDWALL request and this node is in the nearest 2459 - * hardwalled cpuset ancestor to this task's cpuset, yes. If the task has been 2460 - * OOM killed and has access to memory reserves as specified by the TIF_MEMDIE 2461 - * flag, yes. 2456 + * If we're in interrupt, yes, we can always allocate. If @node is set in 2457 + * current's mems_allowed, yes. If it's not a __GFP_HARDWALL request and this 2458 + * node is set in the nearest hardwalled cpuset ancestor to current's cpuset, 2459 + * yes. If current has access to memory reserves due to TIF_MEMDIE, yes. 2462 2460 * Otherwise, no. 2463 - * 2464 - * The __GFP_THISNODE placement logic is really handled elsewhere, 2465 - * by forcibly using a zonelist starting at a specified node, and by 2466 - * (in get_page_from_freelist()) refusing to consider the zones for 2467 - * any node on the zonelist except the first. By the time any such 2468 - * calls get to this routine, we should just shut up and say 'yes'. 2469 2461 * 2470 2462 * GFP_USER allocations are marked with the __GFP_HARDWALL bit, 2471 2463 * and do not allow allocations outside the current tasks cpuset ··· 2494 2502 int allowed; /* is allocation in zone z allowed? */ 2495 2503 unsigned long flags; 2496 2504 2497 - if (in_interrupt() || (gfp_mask & __GFP_THISNODE)) 2505 + if (in_interrupt()) 2498 2506 return 1; 2499 2507 if (node_isset(node, current->mems_allowed)) 2500 2508 return 1;

+24 -11

kernel/sysctl.c

··· 847 847 .data = &watchdog_user_enabled, 848 848 .maxlen = sizeof (int), 849 849 .mode = 0644, 850 - .proc_handler = proc_dowatchdog, 850 + .proc_handler = proc_watchdog, 851 851 .extra1 = &zero, 852 852 .extra2 = &one, 853 853 }, ··· 856 856 .data = &watchdog_thresh, 857 857 .maxlen = sizeof(int), 858 858 .mode = 0644, 859 - .proc_handler = proc_dowatchdog, 859 + .proc_handler = proc_watchdog_thresh, 860 860 .extra1 = &zero, 861 861 .extra2 = &sixty, 862 + }, 863 + { 864 + .procname = "nmi_watchdog", 865 + .data = &nmi_watchdog_enabled, 866 + .maxlen = sizeof (int), 867 + .mode = 0644, 868 + .proc_handler = proc_nmi_watchdog, 869 + .extra1 = &zero, 870 + #if defined(CONFIG_HAVE_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR) 871 + .extra2 = &one, 872 + #else 873 + .extra2 = &zero, 874 + #endif 875 + }, 876 + { 877 + .procname = "soft_watchdog", 878 + .data = &soft_watchdog_enabled, 879 + .maxlen = sizeof (int), 880 + .mode = 0644, 881 + .proc_handler = proc_soft_watchdog, 882 + .extra1 = &zero, 883 + .extra2 = &one, 862 884 }, 863 885 { 864 886 .procname = "softlockup_panic", ··· 902 880 .extra2 = &one, 903 881 }, 904 882 #endif /* CONFIG_SMP */ 905 - { 906 - .procname = "nmi_watchdog", 907 - .data = &watchdog_user_enabled, 908 - .maxlen = sizeof (int), 909 - .mode = 0644, 910 - .proc_handler = proc_dowatchdog, 911 - .extra1 = &zero, 912 - .extra2 = &one, 913 - }, 914 883 #endif 915 884 #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86) 916 885 {

+216 -75

kernel/watchdog.c

··· 24 24 #include <linux/kvm_para.h> 25 25 #include <linux/perf_event.h> 26 26 27 - int watchdog_user_enabled = 1; 27 + /* 28 + * The run state of the lockup detectors is controlled by the content of the 29 + * 'watchdog_enabled' variable. Each lockup detector has its dedicated bit - 30 + * bit 0 for the hard lockup detector and bit 1 for the soft lockup detector. 31 + * 32 + * 'watchdog_user_enabled', 'nmi_watchdog_enabled' and 'soft_watchdog_enabled' 33 + * are variables that are only used as an 'interface' between the parameters 34 + * in /proc/sys/kernel and the internal state bits in 'watchdog_enabled'. The 35 + * 'watchdog_thresh' variable is handled differently because its value is not 36 + * boolean, and the lockup detectors are 'suspended' while 'watchdog_thresh' 37 + * is equal zero. 38 + */ 39 + #define NMI_WATCHDOG_ENABLED_BIT 0 40 + #define SOFT_WATCHDOG_ENABLED_BIT 1 41 + #define NMI_WATCHDOG_ENABLED (1 << NMI_WATCHDOG_ENABLED_BIT) 42 + #define SOFT_WATCHDOG_ENABLED (1 << SOFT_WATCHDOG_ENABLED_BIT) 43 + 44 + #ifdef CONFIG_HARDLOCKUP_DETECTOR 45 + static unsigned long __read_mostly watchdog_enabled = SOFT_WATCHDOG_ENABLED|NMI_WATCHDOG_ENABLED; 46 + #else 47 + static unsigned long __read_mostly watchdog_enabled = SOFT_WATCHDOG_ENABLED; 48 + #endif 49 + int __read_mostly nmi_watchdog_enabled; 50 + int __read_mostly soft_watchdog_enabled; 51 + int __read_mostly watchdog_user_enabled; 28 52 int __read_mostly watchdog_thresh = 10; 53 + 29 54 #ifdef CONFIG_SMP 30 55 int __read_mostly sysctl_softlockup_all_cpu_backtrace; 31 56 #else ··· 83 58 #ifdef CONFIG_HARDLOCKUP_DETECTOR 84 59 static int hardlockup_panic = 85 60 CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE; 86 - 87 - static bool hardlockup_detector_enabled = true; 88 61 /* 89 62 * We may not want to enable hard lockup detection by default in all cases, 90 63 * for example when running the kernel as a guest on a hypervisor. In these ··· 91 68 * kernel command line parameters are parsed, because otherwise it is not 92 69 * possible to override this in hardlockup_panic_setup(). 93 70 */ 94 - void watchdog_enable_hardlockup_detector(bool val) 71 + void hardlockup_detector_disable(void) 95 72 { 96 - hardlockup_detector_enabled = val; 97 - } 98 - 99 - bool watchdog_hardlockup_detector_is_enabled(void) 100 - { 101 - return hardlockup_detector_enabled; 73 + watchdog_enabled &= ~NMI_WATCHDOG_ENABLED; 102 74 } 103 75 104 76 static int __init hardlockup_panic_setup(char *str) ··· 103 85 else if (!strncmp(str, "nopanic", 7)) 104 86 hardlockup_panic = 0; 105 87 else if (!strncmp(str, "0", 1)) 106 - watchdog_user_enabled = 0; 107 - else if (!strncmp(str, "1", 1) || !strncmp(str, "2", 1)) { 108 - /* 109 - * Setting 'nmi_watchdog=1' or 'nmi_watchdog=2' (legacy option) 110 - * has the same effect. 111 - */ 112 - watchdog_user_enabled = 1; 113 - watchdog_enable_hardlockup_detector(true); 114 - } 88 + watchdog_enabled &= ~NMI_WATCHDOG_ENABLED; 89 + else if (!strncmp(str, "1", 1)) 90 + watchdog_enabled |= NMI_WATCHDOG_ENABLED; 115 91 return 1; 116 92 } 117 93 __setup("nmi_watchdog=", hardlockup_panic_setup); ··· 124 112 125 113 static int __init nowatchdog_setup(char *str) 126 114 { 127 - watchdog_user_enabled = 0; 115 + watchdog_enabled = 0; 128 116 return 1; 129 117 } 130 118 __setup("nowatchdog", nowatchdog_setup); 131 119 132 - /* deprecated */ 133 120 static int __init nosoftlockup_setup(char *str) 134 121 { 135 - watchdog_user_enabled = 0; 122 + watchdog_enabled &= ~SOFT_WATCHDOG_ENABLED; 136 123 return 1; 137 124 } 138 125 __setup("nosoftlockup", nosoftlockup_setup); 139 - /* */ 126 + 140 127 #ifdef CONFIG_SMP 141 128 static int __init softlockup_all_cpu_backtrace_setup(char *str) 142 129 { ··· 250 239 { 251 240 unsigned long now = get_timestamp(); 252 241 253 - /* Warn about unreasonable delays: */ 254 - if (time_after(now, touch_ts + get_softlockup_thresh())) 255 - return now - touch_ts; 256 - 242 + if (watchdog_enabled & SOFT_WATCHDOG_ENABLED) { 243 + /* Warn about unreasonable delays. */ 244 + if (time_after(now, touch_ts + get_softlockup_thresh())) 245 + return now - touch_ts; 246 + } 257 247 return 0; 258 248 } 259 249 ··· 489 477 __this_cpu_write(soft_lockup_hrtimer_cnt, 490 478 __this_cpu_read(hrtimer_interrupts)); 491 479 __touch_watchdog(); 480 + 481 + /* 482 + * watchdog_nmi_enable() clears the NMI_WATCHDOG_ENABLED bit in the 483 + * failure path. Check for failures that can occur asynchronously - 484 + * for example, when CPUs are on-lined - and shut down the hardware 485 + * perf event on each CPU accordingly. 486 + * 487 + * The only non-obvious place this bit can be cleared is through 488 + * watchdog_nmi_enable(), so a pr_info() is placed there. Placing a 489 + * pr_info here would be too noisy as it would result in a message 490 + * every few seconds if the hardlockup was disabled but the softlockup 491 + * enabled. 492 + */ 493 + if (!(watchdog_enabled & NMI_WATCHDOG_ENABLED)) 494 + watchdog_nmi_disable(cpu); 492 495 } 493 496 494 497 #ifdef CONFIG_HARDLOCKUP_DETECTOR ··· 519 492 struct perf_event_attr *wd_attr; 520 493 struct perf_event *event = per_cpu(watchdog_ev, cpu); 521 494 522 - /* 523 - * Some kernels need to default hard lockup detection to 524 - * 'disabled', for example a guest on a hypervisor. 525 - */ 526 - if (!watchdog_hardlockup_detector_is_enabled()) { 527 - event = ERR_PTR(-ENOENT); 528 - goto handle_err; 529 - } 495 + /* nothing to do if the hard lockup detector is disabled */ 496 + if (!(watchdog_enabled & NMI_WATCHDOG_ENABLED)) 497 + goto out; 530 498 531 499 /* is it already setup and enabled? */ 532 500 if (event && event->state > PERF_EVENT_STATE_OFF) ··· 537 515 /* Try to register using hardware perf events */ 538 516 event = perf_event_create_kernel_counter(wd_attr, cpu, NULL, watchdog_overflow_callback, NULL); 539 517 540 - handle_err: 541 518 /* save cpu0 error for future comparision */ 542 519 if (cpu == 0 && IS_ERR(event)) 543 520 cpu0_err = PTR_ERR(event); ··· 547 526 pr_info("enabled on all CPUs, permanently consumes one hw-PMU counter.\n"); 548 527 goto out_save; 549 528 } 529 + 530 + /* 531 + * Disable the hard lockup detector if _any_ CPU fails to set up 532 + * set up the hardware perf event. The watchdog() function checks 533 + * the NMI_WATCHDOG_ENABLED bit periodically. 534 + * 535 + * The barriers are for syncing up watchdog_enabled across all the 536 + * cpus, as clear_bit() does not use barriers. 537 + */ 538 + smp_mb__before_atomic(); 539 + clear_bit(NMI_WATCHDOG_ENABLED_BIT, &watchdog_enabled); 540 + smp_mb__after_atomic(); 550 541 551 542 /* skip displaying the same error again */ 552 543 if (cpu > 0 && (PTR_ERR(event) == cpu0_err)) ··· 573 540 else 574 541 pr_err("disabled (cpu%i): unable to create perf event: %ld\n", 575 542 cpu, PTR_ERR(event)); 543 + 544 + pr_info("Shutting down hard lockup detector on all cpus\n"); 545 + 576 546 return PTR_ERR(event); 577 547 578 548 /* success path */ ··· 664 628 HRTIMER_MODE_REL_PINNED); 665 629 } 666 630 667 - static void update_timers(int cpu) 631 + static void update_watchdog(int cpu) 668 632 { 669 633 /* 670 634 * Make sure that perf event counter will adopt to a new ··· 679 643 watchdog_nmi_enable(cpu); 680 644 } 681 645 682 - static void update_timers_all_cpus(void) 646 + static void update_watchdog_all_cpus(void) 683 647 { 684 648 int cpu; 685 649 686 650 get_online_cpus(); 687 651 for_each_online_cpu(cpu) 688 - update_timers(cpu); 652 + update_watchdog(cpu); 689 653 put_online_cpus(); 690 654 } 691 655 692 - static int watchdog_enable_all_cpus(bool sample_period_changed) 656 + static int watchdog_enable_all_cpus(void) 693 657 { 694 658 int err = 0; 695 659 ··· 699 663 pr_err("Failed to create watchdog threads, disabled\n"); 700 664 else 701 665 watchdog_running = 1; 702 - } else if (sample_period_changed) { 703 - update_timers_all_cpus(); 666 + } else { 667 + /* 668 + * Enable/disable the lockup detectors or 669 + * change the sample period 'on the fly'. 670 + */ 671 + update_watchdog_all_cpus(); 704 672 } 705 673 706 674 return err; ··· 722 682 } 723 683 724 684 /* 725 - * proc handler for /proc/sys/kernel/nmi_watchdog,watchdog_thresh 685 + * Update the run state of the lockup detectors. 726 686 */ 727 - 728 - int proc_dowatchdog(struct ctl_table *table, int write, 729 - void __user *buffer, size_t *lenp, loff_t *ppos) 687 + static int proc_watchdog_update(void) 730 688 { 731 - int err, old_thresh, old_enabled; 732 - bool old_hardlockup; 733 - static DEFINE_MUTEX(watchdog_proc_mutex); 689 + int err = 0; 690 + 691 + /* 692 + * Watchdog threads won't be started if they are already active. 693 + * The 'watchdog_running' variable in watchdog_*_all_cpus() takes 694 + * care of this. If those threads are already active, the sample 695 + * period will be updated and the lockup detectors will be enabled 696 + * or disabled 'on the fly'. 697 + */ 698 + if (watchdog_enabled && watchdog_thresh) 699 + err = watchdog_enable_all_cpus(); 700 + else 701 + watchdog_disable_all_cpus(); 702 + 703 + return err; 704 + 705 + } 706 + 707 + static DEFINE_MUTEX(watchdog_proc_mutex); 708 + 709 + /* 710 + * common function for watchdog, nmi_watchdog and soft_watchdog parameter 711 + * 712 + * caller | table->data points to | 'which' contains the flag(s) 713 + * -------------------|-----------------------|----------------------------- 714 + * proc_watchdog | watchdog_user_enabled | NMI_WATCHDOG_ENABLED or'ed 715 + * | | with SOFT_WATCHDOG_ENABLED 716 + * -------------------|-----------------------|----------------------------- 717 + * proc_nmi_watchdog | nmi_watchdog_enabled | NMI_WATCHDOG_ENABLED 718 + * -------------------|-----------------------|----------------------------- 719 + * proc_soft_watchdog | soft_watchdog_enabled | SOFT_WATCHDOG_ENABLED 720 + */ 721 + static int proc_watchdog_common(int which, struct ctl_table *table, int write, 722 + void __user *buffer, size_t *lenp, loff_t *ppos) 723 + { 724 + int err, old, new; 725 + int *watchdog_param = (int *)table->data; 734 726 735 727 mutex_lock(&watchdog_proc_mutex); 736 - old_thresh = ACCESS_ONCE(watchdog_thresh); 737 - old_enabled = ACCESS_ONCE(watchdog_user_enabled); 738 - old_hardlockup = watchdog_hardlockup_detector_is_enabled(); 739 728 729 + /* 730 + * If the parameter is being read return the state of the corresponding 731 + * bit(s) in 'watchdog_enabled', else update 'watchdog_enabled' and the 732 + * run state of the lockup detectors. 733 + */ 734 + if (!write) { 735 + *watchdog_param = (watchdog_enabled & which) != 0; 736 + err = proc_dointvec_minmax(table, write, buffer, lenp, ppos); 737 + } else { 738 + err = proc_dointvec_minmax(table, write, buffer, lenp, ppos); 739 + if (err) 740 + goto out; 741 + 742 + /* 743 + * There is a race window between fetching the current value 744 + * from 'watchdog_enabled' and storing the new value. During 745 + * this race window, watchdog_nmi_enable() can sneak in and 746 + * clear the NMI_WATCHDOG_ENABLED bit in 'watchdog_enabled'. 747 + * The 'cmpxchg' detects this race and the loop retries. 748 + */ 749 + do { 750 + old = watchdog_enabled; 751 + /* 752 + * If the parameter value is not zero set the 753 + * corresponding bit(s), else clear it(them). 754 + */ 755 + if (*watchdog_param) 756 + new = old | which; 757 + else 758 + new = old & ~which; 759 + } while (cmpxchg(&watchdog_enabled, old, new) != old); 760 + 761 + /* 762 + * Update the run state of the lockup detectors. 763 + * Restore 'watchdog_enabled' on failure. 764 + */ 765 + err = proc_watchdog_update(); 766 + if (err) 767 + watchdog_enabled = old; 768 + } 769 + out: 770 + mutex_unlock(&watchdog_proc_mutex); 771 + return err; 772 + } 773 + 774 + /* 775 + * /proc/sys/kernel/watchdog 776 + */ 777 + int proc_watchdog(struct ctl_table *table, int write, 778 + void __user *buffer, size_t *lenp, loff_t *ppos) 779 + { 780 + return proc_watchdog_common(NMI_WATCHDOG_ENABLED|SOFT_WATCHDOG_ENABLED, 781 + table, write, buffer, lenp, ppos); 782 + } 783 + 784 + /* 785 + * /proc/sys/kernel/nmi_watchdog 786 + */ 787 + int proc_nmi_watchdog(struct ctl_table *table, int write, 788 + void __user *buffer, size_t *lenp, loff_t *ppos) 789 + { 790 + return proc_watchdog_common(NMI_WATCHDOG_ENABLED, 791 + table, write, buffer, lenp, ppos); 792 + } 793 + 794 + /* 795 + * /proc/sys/kernel/soft_watchdog 796 + */ 797 + int proc_soft_watchdog(struct ctl_table *table, int write, 798 + void __user *buffer, size_t *lenp, loff_t *ppos) 799 + { 800 + return proc_watchdog_common(SOFT_WATCHDOG_ENABLED, 801 + table, write, buffer, lenp, ppos); 802 + } 803 + 804 + /* 805 + * /proc/sys/kernel/watchdog_thresh 806 + */ 807 + int proc_watchdog_thresh(struct ctl_table *table, int write, 808 + void __user *buffer, size_t *lenp, loff_t *ppos) 809 + { 810 + int err, old; 811 + 812 + mutex_lock(&watchdog_proc_mutex); 813 + 814 + old = ACCESS_ONCE(watchdog_thresh); 740 815 err = proc_dointvec_minmax(table, write, buffer, lenp, ppos); 816 + 741 817 if (err || !write) 742 818 goto out; 743 819 744 - set_sample_period(); 745 820 /* 746 - * Watchdog threads shouldn't be enabled if they are 747 - * disabled. The 'watchdog_running' variable check in 748 - * watchdog_*_all_cpus() function takes care of this. 821 + * Update the sample period. 822 + * Restore 'watchdog_thresh' on failure. 749 823 */ 750 - if (watchdog_user_enabled && watchdog_thresh) { 751 - /* 752 - * Prevent a change in watchdog_thresh accidentally overriding 753 - * the enablement of the hardlockup detector. 754 - */ 755 - if (watchdog_user_enabled != old_enabled) 756 - watchdog_enable_hardlockup_detector(true); 757 - err = watchdog_enable_all_cpus(old_thresh != watchdog_thresh); 758 - } else 759 - watchdog_disable_all_cpus(); 760 - 761 - /* Restore old values on failure */ 762 - if (err) { 763 - watchdog_thresh = old_thresh; 764 - watchdog_user_enabled = old_enabled; 765 - watchdog_enable_hardlockup_detector(old_hardlockup); 766 - } 824 + set_sample_period(); 825 + err = proc_watchdog_update(); 826 + if (err) 827 + watchdog_thresh = old; 767 828 out: 768 829 mutex_unlock(&watchdog_proc_mutex); 769 830 return err; ··· 875 734 { 876 735 set_sample_period(); 877 736 878 - if (watchdog_user_enabled) 879 - watchdog_enable_all_cpus(false); 737 + if (watchdog_enabled) 738 + watchdog_enable_all_cpus(); 880 739 }

+12

lib/Kconfig.debug

··· 1760 1760 1761 1761 If unsure, say N. 1762 1762 1763 + config MEMTEST 1764 + bool "Memtest" 1765 + depends on HAVE_MEMBLOCK 1766 + ---help--- 1767 + This option adds a kernel parameter 'memtest', which allows memtest 1768 + to be set. 1769 + memtest=0, mean disabled; -- default 1770 + memtest=1, mean do 1 test pattern; 1771 + ... 1772 + memtest=17, mean do 17 test patterns. 1773 + If you are unsure how to answer this question, answer N. 1774 + 1763 1775 source "samples/Kconfig" 1764 1776 1765 1777 source "lib/Kconfig.kgdb"

+53

lib/ioremap.c

··· 13 13 #include <asm/cacheflush.h> 14 14 #include <asm/pgtable.h> 15 15 16 + #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP 17 + static int __read_mostly ioremap_pud_capable; 18 + static int __read_mostly ioremap_pmd_capable; 19 + static int __read_mostly ioremap_huge_disabled; 20 + 21 + static int __init set_nohugeiomap(char *str) 22 + { 23 + ioremap_huge_disabled = 1; 24 + return 0; 25 + } 26 + early_param("nohugeiomap", set_nohugeiomap); 27 + 28 + void __init ioremap_huge_init(void) 29 + { 30 + if (!ioremap_huge_disabled) { 31 + if (arch_ioremap_pud_supported()) 32 + ioremap_pud_capable = 1; 33 + if (arch_ioremap_pmd_supported()) 34 + ioremap_pmd_capable = 1; 35 + } 36 + } 37 + 38 + static inline int ioremap_pud_enabled(void) 39 + { 40 + return ioremap_pud_capable; 41 + } 42 + 43 + static inline int ioremap_pmd_enabled(void) 44 + { 45 + return ioremap_pmd_capable; 46 + } 47 + 48 + #else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ 49 + static inline int ioremap_pud_enabled(void) { return 0; } 50 + static inline int ioremap_pmd_enabled(void) { return 0; } 51 + #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ 52 + 16 53 static int ioremap_pte_range(pmd_t *pmd, unsigned long addr, 17 54 unsigned long end, phys_addr_t phys_addr, pgprot_t prot) 18 55 { ··· 80 43 return -ENOMEM; 81 44 do { 82 45 next = pmd_addr_end(addr, end); 46 + 47 + if (ioremap_pmd_enabled() && 48 + ((next - addr) == PMD_SIZE) && 49 + IS_ALIGNED(phys_addr + addr, PMD_SIZE)) { 50 + if (pmd_set_huge(pmd, phys_addr + addr, prot)) 51 + continue; 52 + } 53 + 83 54 if (ioremap_pte_range(pmd, addr, next, phys_addr + addr, prot)) 84 55 return -ENOMEM; 85 56 } while (pmd++, addr = next, addr != end); ··· 106 61 return -ENOMEM; 107 62 do { 108 63 next = pud_addr_end(addr, end); 64 + 65 + if (ioremap_pud_enabled() && 66 + ((next - addr) == PUD_SIZE) && 67 + IS_ALIGNED(phys_addr + addr, PUD_SIZE)) { 68 + if (pud_set_huge(pud, phys_addr + addr, prot)) 69 + continue; 70 + } 71 + 109 72 if (ioremap_pmd_range(pud, addr, next, phys_addr + addr, prot)) 110 73 return -ENOMEM; 111 74 } while (pud++, addr = next, addr != end);

+6

mm/Kconfig

··· 517 517 processing calls such as dma_alloc_from_contiguous(). 518 518 This option does not affect warning and error messages. 519 519 520 + config CMA_DEBUGFS 521 + bool "CMA debugfs interface" 522 + depends on CMA && DEBUG_FS 523 + help 524 + Turns on the DebugFS interface for CMA. 525 + 520 526 config CMA_AREAS 521 527 int "Maximum count of the CMA areas" 522 528 depends on CMA

+2

mm/Makefile

··· 55 55 obj-$(CONFIG_KASAN) += kasan/ 56 56 obj-$(CONFIG_FAILSLAB) += failslab.o 57 57 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o 58 + obj-$(CONFIG_MEMTEST) += memtest.o 58 59 obj-$(CONFIG_MIGRATION) += migrate.o 59 60 obj-$(CONFIG_QUICKLIST) += quicklist.o 60 61 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o ··· 77 76 obj-$(CONFIG_CMA) += cma.o 78 77 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o 79 78 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o 79 + obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o

+96 -186

mm/cleancache.c

··· 19 19 #include <linux/cleancache.h> 20 20 21 21 /* 22 - * cleancache_ops is set by cleancache_ops_register to contain the pointers 22 + * cleancache_ops is set by cleancache_register_ops to contain the pointers 23 23 * to the cleancache "backend" implementation functions. 24 24 */ 25 25 static struct cleancache_ops *cleancache_ops __read_mostly; ··· 34 34 static u64 cleancache_puts; 35 35 static u64 cleancache_invalidates; 36 36 37 - /* 38 - * When no backend is registered all calls to init_fs and init_shared_fs 39 - * are registered and fake poolids (FAKE_FS_POOLID_OFFSET or 40 - * FAKE_SHARED_FS_POOLID_OFFSET, plus offset in the respective array 41 - * [shared_|]fs_poolid_map) are given to the respective super block 42 - * (sb->cleancache_poolid) and no tmem_pools are created. When a backend 43 - * registers with cleancache the previous calls to init_fs and init_shared_fs 44 - * are executed to create tmem_pools and set the respective poolids. While no 45 - * backend is registered all "puts", "gets" and "flushes" are ignored or failed. 46 - */ 47 - #define MAX_INITIALIZABLE_FS 32 48 - #define FAKE_FS_POOLID_OFFSET 1000 49 - #define FAKE_SHARED_FS_POOLID_OFFSET 2000 50 - 51 - #define FS_NO_BACKEND (-1) 52 - #define FS_UNKNOWN (-2) 53 - static int fs_poolid_map[MAX_INITIALIZABLE_FS]; 54 - static int shared_fs_poolid_map[MAX_INITIALIZABLE_FS]; 55 - static char *uuids[MAX_INITIALIZABLE_FS]; 56 - /* 57 - * Mutex for the [shared_|]fs_poolid_map to guard against multiple threads 58 - * invoking umount (and ending in __cleancache_invalidate_fs) and also multiple 59 - * threads calling mount (and ending up in __cleancache_init_[shared|]fs). 60 - */ 61 - static DEFINE_MUTEX(poolid_mutex); 62 - /* 63 - * When set to false (default) all calls to the cleancache functions, except 64 - * the __cleancache_invalidate_fs and __cleancache_init_[shared|]fs are guarded 65 - * by the if (!cleancache_ops) return. This means multiple threads (from 66 - * different filesystems) will be checking cleancache_ops. The usage of a 67 - * bool instead of a atomic_t or a bool guarded by a spinlock is OK - we are 68 - * OK if the time between the backend's have been initialized (and 69 - * cleancache_ops has been set to not NULL) and when the filesystems start 70 - * actually calling the backends. The inverse (when unloading) is obviously 71 - * not good - but this shim does not do that (yet). 72 - */ 73 - 74 - /* 75 - * The backends and filesystems work all asynchronously. This is b/c the 76 - * backends can be built as modules. 77 - * The usual sequence of events is: 78 - * a) mount / -> __cleancache_init_fs is called. We set the 79 - * [shared_|]fs_poolid_map and uuids for. 80 - * 81 - * b). user does I/Os -> we call the rest of __cleancache_* functions 82 - * which return immediately as cleancache_ops is false. 83 - * 84 - * c). modprobe zcache -> cleancache_register_ops. We init the backend 85 - * and set cleancache_ops to true, and for any fs_poolid_map 86 - * (which is set by __cleancache_init_fs) we initialize the poolid. 87 - * 88 - * d). user does I/Os -> now that cleancache_ops is true all the 89 - * __cleancache_* functions can call the backend. They all check 90 - * that fs_poolid_map is valid and if so invoke the backend. 91 - * 92 - * e). umount / -> __cleancache_invalidate_fs, the fs_poolid_map is 93 - * reset (which is the second check in the __cleancache_* ops 94 - * to call the backend). 95 - * 96 - * The sequence of event could also be c), followed by a), and d). and e). The 97 - * c) would not happen anymore. There is also the chance of c), and one thread 98 - * doing a) + d), and another doing e). For that case we depend on the 99 - * filesystem calling __cleancache_invalidate_fs in the proper sequence (so 100 - * that it handles all I/Os before it invalidates the fs (which is last part 101 - * of unmounting process). 102 - * 103 - * Note: The acute reader will notice that there is no "rmmod zcache" case. 104 - * This is b/c the functionality for that is not yet implemented and when 105 - * done, will require some extra locking not yet devised. 106 - */ 107 - 108 - /* 109 - * Register operations for cleancache, returning previous thus allowing 110 - * detection of multiple backends and possible nesting. 111 - */ 112 - struct cleancache_ops *cleancache_register_ops(struct cleancache_ops *ops) 37 + static void cleancache_register_ops_sb(struct super_block *sb, void *unused) 113 38 { 114 - struct cleancache_ops *old = cleancache_ops; 115 - int i; 116 - 117 - mutex_lock(&poolid_mutex); 118 - for (i = 0; i < MAX_INITIALIZABLE_FS; i++) { 119 - if (fs_poolid_map[i] == FS_NO_BACKEND) 120 - fs_poolid_map[i] = ops->init_fs(PAGE_SIZE); 121 - if (shared_fs_poolid_map[i] == FS_NO_BACKEND) 122 - shared_fs_poolid_map[i] = ops->init_shared_fs 123 - (uuids[i], PAGE_SIZE); 39 + switch (sb->cleancache_poolid) { 40 + case CLEANCACHE_NO_BACKEND: 41 + __cleancache_init_fs(sb); 42 + break; 43 + case CLEANCACHE_NO_BACKEND_SHARED: 44 + __cleancache_init_shared_fs(sb); 45 + break; 124 46 } 47 + } 48 + 49 + /* 50 + * Register operations for cleancache. Returns 0 on success. 51 + */ 52 + int cleancache_register_ops(struct cleancache_ops *ops) 53 + { 54 + if (cmpxchg(&cleancache_ops, NULL, ops)) 55 + return -EBUSY; 56 + 125 57 /* 126 - * We MUST set cleancache_ops _after_ we have called the backends 127 - * init_fs or init_shared_fs functions. Otherwise the compiler might 128 - * re-order where cleancache_ops is set in this function. 58 + * A cleancache backend can be built as a module and hence loaded after 59 + * a cleancache enabled filesystem has called cleancache_init_fs. To 60 + * handle such a scenario, here we call ->init_fs or ->init_shared_fs 61 + * for each active super block. To differentiate between local and 62 + * shared filesystems, we temporarily initialize sb->cleancache_poolid 63 + * to CLEANCACHE_NO_BACKEND or CLEANCACHE_NO_BACKEND_SHARED 64 + * respectively in case there is no backend registered at the time 65 + * cleancache_init_fs or cleancache_init_shared_fs is called. 66 + * 67 + * Since filesystems can be mounted concurrently with cleancache 68 + * backend registration, we have to be careful to guarantee that all 69 + * cleancache enabled filesystems that has been mounted by the time 70 + * cleancache_register_ops is called has got and all mounted later will 71 + * get cleancache_poolid. This is assured by the following statements 72 + * tied together: 73 + * 74 + * a) iterate_supers skips only those super blocks that has started 75 + * ->kill_sb 76 + * 77 + * b) if iterate_supers encounters a super block that has not finished 78 + * ->mount yet, it waits until it is finished 79 + * 80 + * c) cleancache_init_fs is called from ->mount and 81 + * cleancache_invalidate_fs is called from ->kill_sb 82 + * 83 + * d) we call iterate_supers after cleancache_ops has been set 84 + * 85 + * From a) it follows that if iterate_supers skips a super block, then 86 + * either the super block is already dead, in which case we do not need 87 + * to bother initializing cleancache for it, or it was mounted after we 88 + * initiated iterate_supers. In the latter case, it must have seen 89 + * cleancache_ops set according to d) and initialized cleancache from 90 + * ->mount by itself according to c). This proves that we call 91 + * ->init_fs at least once for each active super block. 92 + * 93 + * From b) and c) it follows that if iterate_supers encounters a super 94 + * block that has already started ->init_fs, it will wait until ->mount 95 + * and hence ->init_fs has finished, then check cleancache_poolid, see 96 + * that it has already been set and therefore do nothing. This proves 97 + * that we call ->init_fs no more than once for each super block. 98 + * 99 + * Combined together, the last two paragraphs prove the function 100 + * correctness. 101 + * 102 + * Note that various cleancache callbacks may proceed before this 103 + * function is called or even concurrently with it, but since 104 + * CLEANCACHE_NO_BACKEND is negative, they will all result in a noop 105 + * until the corresponding ->init_fs has been actually called and 106 + * cleancache_ops has been set. 129 107 */ 130 - barrier(); 131 - cleancache_ops = ops; 132 - mutex_unlock(&poolid_mutex); 133 - return old; 108 + iterate_supers(cleancache_register_ops_sb, NULL); 109 + return 0; 134 110 } 135 111 EXPORT_SYMBOL(cleancache_register_ops); 136 112 137 113 /* Called by a cleancache-enabled filesystem at time of mount */ 138 114 void __cleancache_init_fs(struct super_block *sb) 139 115 { 140 - int i; 116 + int pool_id = CLEANCACHE_NO_BACKEND; 141 117 142 - mutex_lock(&poolid_mutex); 143 - for (i = 0; i < MAX_INITIALIZABLE_FS; i++) { 144 - if (fs_poolid_map[i] == FS_UNKNOWN) { 145 - sb->cleancache_poolid = i + FAKE_FS_POOLID_OFFSET; 146 - if (cleancache_ops) 147 - fs_poolid_map[i] = cleancache_ops->init_fs(PAGE_SIZE); 148 - else 149 - fs_poolid_map[i] = FS_NO_BACKEND; 150 - break; 151 - } 118 + if (cleancache_ops) { 119 + pool_id = cleancache_ops->init_fs(PAGE_SIZE); 120 + if (pool_id < 0) 121 + pool_id = CLEANCACHE_NO_POOL; 152 122 } 153 - mutex_unlock(&poolid_mutex); 123 + sb->cleancache_poolid = pool_id; 154 124 } 155 125 EXPORT_SYMBOL(__cleancache_init_fs); 156 126 157 127 /* Called by a cleancache-enabled clustered filesystem at time of mount */ 158 - void __cleancache_init_shared_fs(char *uuid, struct super_block *sb) 128 + void __cleancache_init_shared_fs(struct super_block *sb) 159 129 { 160 - int i; 130 + int pool_id = CLEANCACHE_NO_BACKEND_SHARED; 161 131 162 - mutex_lock(&poolid_mutex); 163 - for (i = 0; i < MAX_INITIALIZABLE_FS; i++) { 164 - if (shared_fs_poolid_map[i] == FS_UNKNOWN) { 165 - sb->cleancache_poolid = i + FAKE_SHARED_FS_POOLID_OFFSET; 166 - uuids[i] = uuid; 167 - if (cleancache_ops) 168 - shared_fs_poolid_map[i] = cleancache_ops->init_shared_fs 169 - (uuid, PAGE_SIZE); 170 - else 171 - shared_fs_poolid_map[i] = FS_NO_BACKEND; 172 - break; 173 - } 132 + if (cleancache_ops) { 133 + pool_id = cleancache_ops->init_shared_fs(sb->s_uuid, PAGE_SIZE); 134 + if (pool_id < 0) 135 + pool_id = CLEANCACHE_NO_POOL; 174 136 } 175 - mutex_unlock(&poolid_mutex); 137 + sb->cleancache_poolid = pool_id; 176 138 } 177 139 EXPORT_SYMBOL(__cleancache_init_shared_fs); 178 140 ··· 164 202 } 165 203 166 204 /* 167 - * Returns a pool_id that is associated with a given fake poolid. 168 - */ 169 - static int get_poolid_from_fake(int fake_pool_id) 170 - { 171 - if (fake_pool_id >= FAKE_SHARED_FS_POOLID_OFFSET) 172 - return shared_fs_poolid_map[fake_pool_id - 173 - FAKE_SHARED_FS_POOLID_OFFSET]; 174 - else if (fake_pool_id >= FAKE_FS_POOLID_OFFSET) 175 - return fs_poolid_map[fake_pool_id - FAKE_FS_POOLID_OFFSET]; 176 - return FS_NO_BACKEND; 177 - } 178 - 179 - /* 180 205 * "Get" data from cleancache associated with the poolid/inode/index 181 206 * that were specified when the data was put to cleanache and, if 182 207 * successful, use it to fill the specified page with data and return 0. ··· 178 229 { 179 230 int ret = -1; 180 231 int pool_id; 181 - int fake_pool_id; 182 232 struct cleancache_filekey key = { .u.key = { 0 } }; 183 233 184 234 if (!cleancache_ops) { ··· 186 238 } 187 239 188 240 VM_BUG_ON_PAGE(!PageLocked(page), page); 189 - fake_pool_id = page->mapping->host->i_sb->cleancache_poolid; 190 - if (fake_pool_id < 0) 241 + pool_id = page->mapping->host->i_sb->cleancache_poolid; 242 + if (pool_id < 0) 191 243 goto out; 192 - pool_id = get_poolid_from_fake(fake_pool_id); 193 244 194 245 if (cleancache_get_key(page->mapping->host, &key) < 0) 195 246 goto out; 196 247 197 - if (pool_id >= 0) 198 - ret = cleancache_ops->get_page(pool_id, 199 - key, page->index, page); 248 + ret = cleancache_ops->get_page(pool_id, key, page->index, page); 200 249 if (ret == 0) 201 250 cleancache_succ_gets++; 202 251 else ··· 216 271 void __cleancache_put_page(struct page *page) 217 272 { 218 273 int pool_id; 219 - int fake_pool_id; 220 274 struct cleancache_filekey key = { .u.key = { 0 } }; 221 275 222 276 if (!cleancache_ops) { ··· 224 280 } 225 281 226 282 VM_BUG_ON_PAGE(!PageLocked(page), page); 227 - fake_pool_id = page->mapping->host->i_sb->cleancache_poolid; 228 - if (fake_pool_id < 0) 229 - return; 230 - 231 - pool_id = get_poolid_from_fake(fake_pool_id); 232 - 283 + pool_id = page->mapping->host->i_sb->cleancache_poolid; 233 284 if (pool_id >= 0 && 234 285 cleancache_get_key(page->mapping->host, &key) >= 0) { 235 286 cleancache_ops->put_page(pool_id, key, page->index, page); ··· 245 306 struct page *page) 246 307 { 247 308 /* careful... page->mapping is NULL sometimes when this is called */ 248 - int pool_id; 249 - int fake_pool_id = mapping->host->i_sb->cleancache_poolid; 309 + int pool_id = mapping->host->i_sb->cleancache_poolid; 250 310 struct cleancache_filekey key = { .u.key = { 0 } }; 251 311 252 312 if (!cleancache_ops) 253 313 return; 254 314 255 - if (fake_pool_id >= 0) { 256 - pool_id = get_poolid_from_fake(fake_pool_id); 257 - if (pool_id < 0) 258 - return; 259 - 315 + if (pool_id >= 0) { 260 316 VM_BUG_ON_PAGE(!PageLocked(page), page); 261 317 if (cleancache_get_key(mapping->host, &key) >= 0) { 262 318 cleancache_ops->invalidate_page(pool_id, ··· 273 339 */ 274 340 void __cleancache_invalidate_inode(struct address_space *mapping) 275 341 { 276 - int pool_id; 277 - int fake_pool_id = mapping->host->i_sb->cleancache_poolid; 342 + int pool_id = mapping->host->i_sb->cleancache_poolid; 278 343 struct cleancache_filekey key = { .u.key = { 0 } }; 279 344 280 345 if (!cleancache_ops) 281 346 return; 282 - 283 - if (fake_pool_id < 0) 284 - return; 285 - 286 - pool_id = get_poolid_from_fake(fake_pool_id); 287 347 288 348 if (pool_id >= 0 && cleancache_get_key(mapping->host, &key) >= 0) 289 349 cleancache_ops->invalidate_inode(pool_id, key); ··· 291 363 */ 292 364 void __cleancache_invalidate_fs(struct super_block *sb) 293 365 { 294 - int index; 295 - int fake_pool_id = sb->cleancache_poolid; 296 - int old_poolid = fake_pool_id; 366 + int pool_id; 297 367 298 - mutex_lock(&poolid_mutex); 299 - if (fake_pool_id >= FAKE_SHARED_FS_POOLID_OFFSET) { 300 - index = fake_pool_id - FAKE_SHARED_FS_POOLID_OFFSET; 301 - old_poolid = shared_fs_poolid_map[index]; 302 - shared_fs_poolid_map[index] = FS_UNKNOWN; 303 - uuids[index] = NULL; 304 - } else if (fake_pool_id >= FAKE_FS_POOLID_OFFSET) { 305 - index = fake_pool_id - FAKE_FS_POOLID_OFFSET; 306 - old_poolid = fs_poolid_map[index]; 307 - fs_poolid_map[index] = FS_UNKNOWN; 308 - } 309 - sb->cleancache_poolid = -1; 310 - if (cleancache_ops) 311 - cleancache_ops->invalidate_fs(old_poolid); 312 - mutex_unlock(&poolid_mutex); 368 + pool_id = sb->cleancache_poolid; 369 + sb->cleancache_poolid = CLEANCACHE_NO_POOL; 370 + 371 + if (cleancache_ops && pool_id >= 0) 372 + cleancache_ops->invalidate_fs(pool_id); 313 373 } 314 374 EXPORT_SYMBOL(__cleancache_invalidate_fs); 315 375 316 376 static int __init init_cleancache(void) 317 377 { 318 - int i; 319 - 320 378 #ifdef CONFIG_DEBUG_FS 321 379 struct dentry *root = debugfs_create_dir("cleancache", NULL); 322 380 if (root == NULL) ··· 314 400 debugfs_create_u64("invalidates", S_IRUGO, 315 401 root, &cleancache_invalidates); 316 402 #endif 317 - for (i = 0; i < MAX_INITIALIZABLE_FS; i++) { 318 - fs_poolid_map[i] = FS_UNKNOWN; 319 - shared_fs_poolid_map[i] = FS_UNKNOWN; 320 - } 321 403 return 0; 322 404 } 323 405 module_init(init_cleancache)

+23 -24

mm/cma.c

··· 35 35 #include <linux/highmem.h> 36 36 #include <linux/io.h> 37 37 38 - struct cma { 39 - unsigned long base_pfn; 40 - unsigned long count; 41 - unsigned long *bitmap; 42 - unsigned int order_per_bit; /* Order of pages represented by one bit */ 43 - struct mutex lock; 44 - }; 38 + #include "cma.h" 45 39 46 - static struct cma cma_areas[MAX_CMA_AREAS]; 47 - static unsigned cma_area_count; 40 + struct cma cma_areas[MAX_CMA_AREAS]; 41 + unsigned cma_area_count; 48 42 static DEFINE_MUTEX(cma_mutex); 49 43 50 - phys_addr_t cma_get_base(struct cma *cma) 44 + phys_addr_t cma_get_base(const struct cma *cma) 51 45 { 52 46 return PFN_PHYS(cma->base_pfn); 53 47 } 54 48 55 - unsigned long cma_get_size(struct cma *cma) 49 + unsigned long cma_get_size(const struct cma *cma) 56 50 { 57 51 return cma->count << PAGE_SHIFT; 58 52 } 59 53 60 - static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int align_order) 54 + static unsigned long cma_bitmap_aligned_mask(const struct cma *cma, 55 + int align_order) 61 56 { 62 57 if (align_order <= cma->order_per_bit) 63 58 return 0; ··· 63 68 * Find a PFN aligned to the specified order and return an offset represented in 64 69 * order_per_bits. 65 70 */ 66 - static unsigned long cma_bitmap_aligned_offset(struct cma *cma, int align_order) 71 + static unsigned long cma_bitmap_aligned_offset(const struct cma *cma, 72 + int align_order) 67 73 { 68 74 if (align_order <= cma->order_per_bit) 69 75 return 0; ··· 73 77 - cma->base_pfn) >> cma->order_per_bit; 74 78 } 75 79 76 - static unsigned long cma_bitmap_maxno(struct cma *cma) 77 - { 78 - return cma->count >> cma->order_per_bit; 79 - } 80 - 81 - static unsigned long cma_bitmap_pages_to_bits(struct cma *cma, 82 - unsigned long pages) 80 + static unsigned long cma_bitmap_pages_to_bits(const struct cma *cma, 81 + unsigned long pages) 83 82 { 84 83 return ALIGN(pages, 1UL << cma->order_per_bit) >> cma->order_per_bit; 85 84 } 86 85 87 - static void cma_clear_bitmap(struct cma *cma, unsigned long pfn, int count) 86 + static void cma_clear_bitmap(struct cma *cma, unsigned long pfn, 87 + unsigned int count) 88 88 { 89 89 unsigned long bitmap_no, bitmap_count; 90 90 ··· 126 134 } while (--i); 127 135 128 136 mutex_init(&cma->lock); 137 + 138 + #ifdef CONFIG_CMA_DEBUGFS 139 + INIT_HLIST_HEAD(&cma->mem_head); 140 + spin_lock_init(&cma->mem_head_lock); 141 + #endif 142 + 129 143 return 0; 130 144 131 145 err: ··· 165 167 * This function creates custom contiguous area from already reserved memory. 166 168 */ 167 169 int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, 168 - int order_per_bit, struct cma **res_cma) 170 + unsigned int order_per_bit, 171 + struct cma **res_cma) 169 172 { 170 173 struct cma *cma; 171 174 phys_addr_t alignment; ··· 357 358 * This function allocates part of contiguous memory on specific 358 359 * contiguous memory area. 359 360 */ 360 - struct page *cma_alloc(struct cma *cma, int count, unsigned int align) 361 + struct page *cma_alloc(struct cma *cma, unsigned int count, unsigned int align) 361 362 { 362 363 unsigned long mask, offset, pfn, start = 0; 363 364 unsigned long bitmap_maxno, bitmap_no, bitmap_count; ··· 428 429 * It returns false when provided pages do not belong to contiguous area and 429 430 * true otherwise. 430 431 */ 431 - bool cma_release(struct cma *cma, struct page *pages, int count) 432 + bool cma_release(struct cma *cma, const struct page *pages, unsigned int count) 432 433 { 433 434 unsigned long pfn; 434 435

+24

mm/cma.h

··· 1 + #ifndef __MM_CMA_H__ 2 + #define __MM_CMA_H__ 3 + 4 + struct cma { 5 + unsigned long base_pfn; 6 + unsigned long count; 7 + unsigned long *bitmap; 8 + unsigned int order_per_bit; /* Order of pages represented by one bit */ 9 + struct mutex lock; 10 + #ifdef CONFIG_CMA_DEBUGFS 11 + struct hlist_head mem_head; 12 + spinlock_t mem_head_lock; 13 + #endif 14 + }; 15 + 16 + extern struct cma cma_areas[MAX_CMA_AREAS]; 17 + extern unsigned cma_area_count; 18 + 19 + static unsigned long cma_bitmap_maxno(struct cma *cma) 20 + { 21 + return cma->count >> cma->order_per_bit; 22 + } 23 + 24 + #endif

+170

mm/cma_debug.c

··· 1 + /* 2 + * CMA DebugFS Interface 3 + * 4 + * Copyright (c) 2015 Sasha Levin <sasha.levin@oracle.com> 5 + */ 6 + 7 + 8 + #include <linux/debugfs.h> 9 + #include <linux/cma.h> 10 + #include <linux/list.h> 11 + #include <linux/kernel.h> 12 + #include <linux/slab.h> 13 + #include <linux/mm_types.h> 14 + 15 + #include "cma.h" 16 + 17 + struct cma_mem { 18 + struct hlist_node node; 19 + struct page *p; 20 + unsigned long n; 21 + }; 22 + 23 + static struct dentry *cma_debugfs_root; 24 + 25 + static int cma_debugfs_get(void *data, u64 *val) 26 + { 27 + unsigned long *p = data; 28 + 29 + *val = *p; 30 + 31 + return 0; 32 + } 33 + 34 + DEFINE_SIMPLE_ATTRIBUTE(cma_debugfs_fops, cma_debugfs_get, NULL, "%llu\n"); 35 + 36 + static void cma_add_to_cma_mem_list(struct cma *cma, struct cma_mem *mem) 37 + { 38 + spin_lock(&cma->mem_head_lock); 39 + hlist_add_head(&mem->node, &cma->mem_head); 40 + spin_unlock(&cma->mem_head_lock); 41 + } 42 + 43 + static struct cma_mem *cma_get_entry_from_list(struct cma *cma) 44 + { 45 + struct cma_mem *mem = NULL; 46 + 47 + spin_lock(&cma->mem_head_lock); 48 + if (!hlist_empty(&cma->mem_head)) { 49 + mem = hlist_entry(cma->mem_head.first, struct cma_mem, node); 50 + hlist_del_init(&mem->node); 51 + } 52 + spin_unlock(&cma->mem_head_lock); 53 + 54 + return mem; 55 + } 56 + 57 + static int cma_free_mem(struct cma *cma, int count) 58 + { 59 + struct cma_mem *mem = NULL; 60 + 61 + while (count) { 62 + mem = cma_get_entry_from_list(cma); 63 + if (mem == NULL) 64 + return 0; 65 + 66 + if (mem->n <= count) { 67 + cma_release(cma, mem->p, mem->n); 68 + count -= mem->n; 69 + kfree(mem); 70 + } else if (cma->order_per_bit == 0) { 71 + cma_release(cma, mem->p, count); 72 + mem->p += count; 73 + mem->n -= count; 74 + count = 0; 75 + cma_add_to_cma_mem_list(cma, mem); 76 + } else { 77 + pr_debug("cma: cannot release partial block when order_per_bit != 0\n"); 78 + cma_add_to_cma_mem_list(cma, mem); 79 + break; 80 + } 81 + } 82 + 83 + return 0; 84 + 85 + } 86 + 87 + static int cma_free_write(void *data, u64 val) 88 + { 89 + int pages = val; 90 + struct cma *cma = data; 91 + 92 + return cma_free_mem(cma, pages); 93 + } 94 + 95 + DEFINE_SIMPLE_ATTRIBUTE(cma_free_fops, NULL, cma_free_write, "%llu\n"); 96 + 97 + static int cma_alloc_mem(struct cma *cma, int count) 98 + { 99 + struct cma_mem *mem; 100 + struct page *p; 101 + 102 + mem = kzalloc(sizeof(*mem), GFP_KERNEL); 103 + if (!mem) 104 + return -ENOMEM; 105 + 106 + p = cma_alloc(cma, count, 0); 107 + if (!p) { 108 + kfree(mem); 109 + return -ENOMEM; 110 + } 111 + 112 + mem->p = p; 113 + mem->n = count; 114 + 115 + cma_add_to_cma_mem_list(cma, mem); 116 + 117 + return 0; 118 + } 119 + 120 + static int cma_alloc_write(void *data, u64 val) 121 + { 122 + int pages = val; 123 + struct cma *cma = data; 124 + 125 + return cma_alloc_mem(cma, pages); 126 + } 127 + 128 + DEFINE_SIMPLE_ATTRIBUTE(cma_alloc_fops, NULL, cma_alloc_write, "%llu\n"); 129 + 130 + static void cma_debugfs_add_one(struct cma *cma, int idx) 131 + { 132 + struct dentry *tmp; 133 + char name[16]; 134 + int u32s; 135 + 136 + sprintf(name, "cma-%d", idx); 137 + 138 + tmp = debugfs_create_dir(name, cma_debugfs_root); 139 + 140 + debugfs_create_file("alloc", S_IWUSR, cma_debugfs_root, cma, 141 + &cma_alloc_fops); 142 + 143 + debugfs_create_file("free", S_IWUSR, cma_debugfs_root, cma, 144 + &cma_free_fops); 145 + 146 + debugfs_create_file("base_pfn", S_IRUGO, tmp, 147 + &cma->base_pfn, &cma_debugfs_fops); 148 + debugfs_create_file("count", S_IRUGO, tmp, 149 + &cma->count, &cma_debugfs_fops); 150 + debugfs_create_file("order_per_bit", S_IRUGO, tmp, 151 + &cma->order_per_bit, &cma_debugfs_fops); 152 + 153 + u32s = DIV_ROUND_UP(cma_bitmap_maxno(cma), BITS_PER_BYTE * sizeof(u32)); 154 + debugfs_create_u32_array("bitmap", S_IRUGO, tmp, (u32*)cma->bitmap, u32s); 155 + } 156 + 157 + static int __init cma_debugfs_init(void) 158 + { 159 + int i; 160 + 161 + cma_debugfs_root = debugfs_create_dir("cma", NULL); 162 + if (!cma_debugfs_root) 163 + return -ENOMEM; 164 + 165 + for (i = 0; i < cma_area_count; i++) 166 + cma_debugfs_add_one(&cma_areas[i], i); 167 + 168 + return 0; 169 + } 170 + late_initcall(cma_debugfs_init);

+13 -2

mm/compaction.c

··· 1174 1174 /* Direct compactor: Is a suitable page free? */ 1175 1175 for (order = cc->order; order < MAX_ORDER; order++) { 1176 1176 struct free_area *area = &zone->free_area[order]; 1177 + bool can_steal; 1177 1178 1178 1179 /* Job done if page is free of the right migratetype */ 1179 1180 if (!list_empty(&area->free_list[migratetype])) 1180 1181 return COMPACT_PARTIAL; 1181 1182 1182 - /* Job done if allocation would set block type */ 1183 - if (order >= pageblock_order && area->nr_free) 1183 + #ifdef CONFIG_CMA 1184 + /* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */ 1185 + if (migratetype == MIGRATE_MOVABLE && 1186 + !list_empty(&area->free_list[MIGRATE_CMA])) 1187 + return COMPACT_PARTIAL; 1188 + #endif 1189 + /* 1190 + * Job done if allocation would steal freepages from 1191 + * other migratetype buddy lists. 1192 + */ 1193 + if (find_suitable_fallback(area, order, migratetype, 1194 + true, &can_steal) != -1) 1184 1195 return COMPACT_PARTIAL; 1185 1196 } 1186 1197

+7 -8

mm/filemap.c

··· 202 202 BUG_ON(page_mapped(page)); 203 203 204 204 /* 205 - * Some filesystems seem to re-dirty the page even after 206 - * the VM has canceled the dirty bit (eg ext3 journaling). 205 + * At this point page must be either written or cleaned by truncate. 206 + * Dirty page here signals a bug and loss of unwritten data. 207 207 * 208 - * Fix it up by doing a final dirty accounting check after 209 - * having removed the page entirely. 208 + * This fixes dirty accounting after removing the page entirely but 209 + * leaves PageDirty set: it has no effect for truncated page and 210 + * anyway will be cleared before returning page into buddy allocator. 210 211 */ 211 - if (PageDirty(page) && mapping_cap_account_dirty(mapping)) { 212 - dec_zone_page_state(page, NR_FILE_DIRTY); 213 - dec_bdi_stat(inode_to_bdi(mapping->host), BDI_RECLAIMABLE); 214 - } 212 + if (WARN_ON_ONCE(PageDirty(page))) 213 + account_page_cleaned(page, mapping); 215 214 } 216 215 217 216 /**

+121 -3

mm/gup.c

··· 92 92 */ 93 93 mark_page_accessed(page); 94 94 } 95 - if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) { 95 + if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) { 96 96 /* 97 97 * The preliminary mapping check is mainly to avoid the 98 98 * pointless overhead of lock_page on the ZERO_PAGE ··· 265 265 unsigned int fault_flags = 0; 266 266 int ret; 267 267 268 - /* For mlock, just skip the stack guard page. */ 269 - if ((*flags & FOLL_MLOCK) && 268 + /* For mm_populate(), just skip the stack guard page. */ 269 + if ((*flags & FOLL_POPULATE) && 270 270 (stack_guard_page_start(vma, address) || 271 271 stack_guard_page_end(vma, address + PAGE_SIZE))) 272 272 return -ENOENT; ··· 817 817 pages, vmas, NULL, false, FOLL_TOUCH); 818 818 } 819 819 EXPORT_SYMBOL(get_user_pages); 820 + 821 + /** 822 + * populate_vma_page_range() - populate a range of pages in the vma. 823 + * @vma: target vma 824 + * @start: start address 825 + * @end: end address 826 + * @nonblocking: 827 + * 828 + * This takes care of mlocking the pages too if VM_LOCKED is set. 829 + * 830 + * return 0 on success, negative error code on error. 831 + * 832 + * vma->vm_mm->mmap_sem must be held. 833 + * 834 + * If @nonblocking is NULL, it may be held for read or write and will 835 + * be unperturbed. 836 + * 837 + * If @nonblocking is non-NULL, it must held for read only and may be 838 + * released. If it's released, *@nonblocking will be set to 0. 839 + */ 840 + long populate_vma_page_range(struct vm_area_struct *vma, 841 + unsigned long start, unsigned long end, int *nonblocking) 842 + { 843 + struct mm_struct *mm = vma->vm_mm; 844 + unsigned long nr_pages = (end - start) / PAGE_SIZE; 845 + int gup_flags; 846 + 847 + VM_BUG_ON(start & ~PAGE_MASK); 848 + VM_BUG_ON(end & ~PAGE_MASK); 849 + VM_BUG_ON_VMA(start < vma->vm_start, vma); 850 + VM_BUG_ON_VMA(end > vma->vm_end, vma); 851 + VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm); 852 + 853 + gup_flags = FOLL_TOUCH | FOLL_POPULATE; 854 + /* 855 + * We want to touch writable mappings with a write fault in order 856 + * to break COW, except for shared mappings because these don't COW 857 + * and we would not want to dirty them for nothing. 858 + */ 859 + if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE) 860 + gup_flags |= FOLL_WRITE; 861 + 862 + /* 863 + * We want mlock to succeed for regions that have any permissions 864 + * other than PROT_NONE. 865 + */ 866 + if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) 867 + gup_flags |= FOLL_FORCE; 868 + 869 + /* 870 + * We made sure addr is within a VMA, so the following will 871 + * not result in a stack expansion that recurses back here. 872 + */ 873 + return __get_user_pages(current, mm, start, nr_pages, gup_flags, 874 + NULL, NULL, nonblocking); 875 + } 876 + 877 + /* 878 + * __mm_populate - populate and/or mlock pages within a range of address space. 879 + * 880 + * This is used to implement mlock() and the MAP_POPULATE / MAP_LOCKED mmap 881 + * flags. VMAs must be already marked with the desired vm_flags, and 882 + * mmap_sem must not be held. 883 + */ 884 + int __mm_populate(unsigned long start, unsigned long len, int ignore_errors) 885 + { 886 + struct mm_struct *mm = current->mm; 887 + unsigned long end, nstart, nend; 888 + struct vm_area_struct *vma = NULL; 889 + int locked = 0; 890 + long ret = 0; 891 + 892 + VM_BUG_ON(start & ~PAGE_MASK); 893 + VM_BUG_ON(len != PAGE_ALIGN(len)); 894 + end = start + len; 895 + 896 + for (nstart = start; nstart < end; nstart = nend) { 897 + /* 898 + * We want to fault in pages for [nstart; end) address range. 899 + * Find first corresponding VMA. 900 + */ 901 + if (!locked) { 902 + locked = 1; 903 + down_read(&mm->mmap_sem); 904 + vma = find_vma(mm, nstart); 905 + } else if (nstart >= vma->vm_end) 906 + vma = vma->vm_next; 907 + if (!vma || vma->vm_start >= end) 908 + break; 909 + /* 910 + * Set [nstart; nend) to intersection of desired address 911 + * range with the first VMA. Also, skip undesirable VMA types. 912 + */ 913 + nend = min(end, vma->vm_end); 914 + if (vma->vm_flags & (VM_IO | VM_PFNMAP)) 915 + continue; 916 + if (nstart < vma->vm_start) 917 + nstart = vma->vm_start; 918 + /* 919 + * Now fault in a range of pages. populate_vma_page_range() 920 + * double checks the vma flags, so that it won't mlock pages 921 + * if the vma was already munlocked. 922 + */ 923 + ret = populate_vma_page_range(vma, nstart, nend, &locked); 924 + if (ret < 0) { 925 + if (ignore_errors) { 926 + ret = 0; 927 + continue; /* continue at next VMA */ 928 + } 929 + break; 930 + } 931 + nend = nstart + ret * PAGE_SIZE; 932 + ret = 0; 933 + } 934 + if (locked) 935 + up_read(&mm->mmap_sem); 936 + return ret; /* 0 or negative error code */ 937 + } 820 938 821 939 /** 822 940 * get_dump_page() - pin user page in memory while writing it to core dump

+28 -11

mm/huge_memory.c

··· 1231 1231 pmd, _pmd, 1)) 1232 1232 update_mmu_cache_pmd(vma, addr, pmd); 1233 1233 } 1234 - if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) { 1234 + if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) { 1235 1235 if (page->mapping && trylock_page(page)) { 1236 1236 lru_add_drain(); 1237 1237 if (page->mapping) ··· 2109 2109 { 2110 2110 while (--_pte >= pte) { 2111 2111 pte_t pteval = *_pte; 2112 - if (!pte_none(pteval)) 2112 + if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval))) 2113 2113 release_pte_page(pte_page(pteval)); 2114 2114 } 2115 2115 } ··· 2120 2120 { 2121 2121 struct page *page; 2122 2122 pte_t *_pte; 2123 - int none = 0; 2123 + int none_or_zero = 0; 2124 2124 bool referenced = false, writable = false; 2125 2125 for (_pte = pte; _pte < pte+HPAGE_PMD_NR; 2126 2126 _pte++, address += PAGE_SIZE) { 2127 2127 pte_t pteval = *_pte; 2128 - if (pte_none(pteval)) { 2129 - if (++none <= khugepaged_max_ptes_none) 2128 + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { 2129 + if (++none_or_zero <= khugepaged_max_ptes_none) 2130 2130 continue; 2131 2131 else 2132 2132 goto out; ··· 2207 2207 pte_t pteval = *_pte; 2208 2208 struct page *src_page; 2209 2209 2210 - if (pte_none(pteval)) { 2210 + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { 2211 2211 clear_user_highpage(page, address); 2212 2212 add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); 2213 + if (is_zero_pfn(pte_pfn(pteval))) { 2214 + /* 2215 + * ptl mostly unnecessary. 2216 + */ 2217 + spin_lock(ptl); 2218 + /* 2219 + * paravirt calls inside pte_clear here are 2220 + * superfluous. 2221 + */ 2222 + pte_clear(vma->vm_mm, address, _pte); 2223 + spin_unlock(ptl); 2224 + } 2213 2225 } else { 2214 2226 src_page = pte_page(pteval); 2215 2227 copy_user_highpage(page, src_page, address, vma); ··· 2328 2316 struct vm_area_struct *vma, unsigned long address, 2329 2317 int node) 2330 2318 { 2319 + gfp_t flags; 2320 + 2331 2321 VM_BUG_ON_PAGE(*hpage, *hpage); 2322 + 2323 + /* Only allocate from the target node */ 2324 + flags = alloc_hugepage_gfpmask(khugepaged_defrag(), __GFP_OTHER_NODE) | 2325 + __GFP_THISNODE; 2332 2326 2333 2327 /* 2334 2328 * Before allocating the hugepage, release the mmap_sem read lock. ··· 2344 2326 */ 2345 2327 up_read(&mm->mmap_sem); 2346 2328 2347 - *hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask( 2348 - khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER); 2329 + *hpage = alloc_pages_exact_node(node, flags, HPAGE_PMD_ORDER); 2349 2330 if (unlikely(!*hpage)) { 2350 2331 count_vm_event(THP_COLLAPSE_ALLOC_FAILED); 2351 2332 *hpage = ERR_PTR(-ENOMEM); ··· 2560 2543 { 2561 2544 pmd_t *pmd; 2562 2545 pte_t *pte, *_pte; 2563 - int ret = 0, none = 0; 2546 + int ret = 0, none_or_zero = 0; 2564 2547 struct page *page; 2565 2548 unsigned long _address; 2566 2549 spinlock_t *ptl; ··· 2578 2561 for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR; 2579 2562 _pte++, _address += PAGE_SIZE) { 2580 2563 pte_t pteval = *_pte; 2581 - if (pte_none(pteval)) { 2582 - if (++none <= khugepaged_max_ptes_none) 2564 + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { 2565 + if (++none_or_zero <= khugepaged_max_ptes_none) 2583 2566 continue; 2584 2567 else 2585 2568 goto out_unmap;

+10 -2

mm/hugetlb.c

··· 3278 3278 struct page *page; 3279 3279 3280 3280 /* 3281 + * If we have a pending SIGKILL, don't keep faulting pages and 3282 + * potentially allocating memory. 3283 + */ 3284 + if (unlikely(fatal_signal_pending(current))) { 3285 + remainder = 0; 3286 + break; 3287 + } 3288 + 3289 + /* 3281 3290 * Some archs (sparc64, sh*) have multiple pte_ts to 3282 3291 * each hugepage. We have to make sure we get the 3283 3292 * first, for the page indexing below to work. ··· 3744 3735 if (!pmd_huge(*pmd)) 3745 3736 goto out; 3746 3737 if (pmd_present(*pmd)) { 3747 - page = pte_page(*(pte_t *)pmd) + 3748 - ((address & ~PMD_MASK) >> PAGE_SHIFT); 3738 + page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT); 3749 3739 if (flags & FOLL_GET) 3750 3740 get_page(page); 3751 3741 } else {

+3 -1

mm/internal.h

··· 200 200 unsigned long 201 201 isolate_migratepages_range(struct compact_control *cc, 202 202 unsigned long low_pfn, unsigned long end_pfn); 203 + int find_suitable_fallback(struct free_area *area, unsigned int order, 204 + int migratetype, bool only_stealable, bool *can_steal); 203 205 204 206 #endif 205 207 ··· 242 240 struct vm_area_struct *prev, struct rb_node *rb_parent); 243 241 244 242 #ifdef CONFIG_MMU 245 - extern long __mlock_vma_pages_range(struct vm_area_struct *vma, 243 + extern long populate_vma_page_range(struct vm_area_struct *vma, 246 244 unsigned long start, unsigned long end, int *nonblocking); 247 245 extern void munlock_vma_pages_range(struct vm_area_struct *vma, 248 246 unsigned long start, unsigned long end);

+2 -2

mm/memblock.c

··· 699 699 int nid, 700 700 unsigned long flags) 701 701 { 702 - struct memblock_type *_rgn = &memblock.reserved; 702 + struct memblock_type *type = &memblock.reserved; 703 703 704 704 memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n", 705 705 (unsigned long long)base, 706 706 (unsigned long long)base + size - 1, 707 707 flags, (void *)_RET_IP_); 708 708 709 - return memblock_add_range(_rgn, base, size, nid, flags); 709 + return memblock_add_range(type, base, size, nid, flags); 710 710 } 711 711 712 712 int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)

+101 -93

mm/memcontrol.c

··· 14 14 * Copyright (C) 2012 Parallels Inc. and Google Inc. 15 15 * Authors: Glauber Costa and Suleiman Souhlal 16 16 * 17 + * Native page reclaim 18 + * Charge lifetime sanitation 19 + * Lockless page tracking & accounting 20 + * Unified hierarchy configuration model 21 + * Copyright (C) 2015 Red Hat, Inc., Johannes Weiner 22 + * 17 23 * This program is free software; you can redistribute it and/or modify 18 24 * it under the terms of the GNU General Public License as published by 19 25 * the Free Software Foundation; either version 2 of the License, or ··· 1442 1436 struct mem_cgroup *iter; 1443 1437 unsigned int i; 1444 1438 1445 - if (!p) 1446 - return; 1447 - 1448 1439 mutex_lock(&oom_info_lock); 1449 1440 rcu_read_lock(); 1450 1441 1451 - pr_info("Task in "); 1452 - pr_cont_cgroup_path(task_cgroup(p, memory_cgrp_id)); 1453 - pr_cont(" killed as a result of limit of "); 1442 + if (p) { 1443 + pr_info("Task in "); 1444 + pr_cont_cgroup_path(task_cgroup(p, memory_cgrp_id)); 1445 + pr_cont(" killed as a result of limit of "); 1446 + } else { 1447 + pr_info("Memory limit reached of cgroup "); 1448 + } 1449 + 1454 1450 pr_cont_cgroup_path(memcg->css.cgroup); 1455 1451 pr_cont("\n"); 1456 1452 ··· 1539 1531 return; 1540 1532 } 1541 1533 1542 - check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL); 1534 + check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL, memcg); 1543 1535 totalpages = mem_cgroup_get_limit(memcg) ? : 1; 1544 1536 for_each_mem_cgroup_tree(iter, memcg) { 1545 1537 struct css_task_iter it; ··· 2786 2778 HPAGE_PMD_NR); 2787 2779 } 2788 2780 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 2789 - 2790 - /** 2791 - * mem_cgroup_move_account - move account of the page 2792 - * @page: the page 2793 - * @nr_pages: number of regular pages (>1 for huge pages) 2794 - * @from: mem_cgroup which the page is moved from. 2795 - * @to: mem_cgroup which the page is moved to. @from != @to. 2796 - * 2797 - * The caller must confirm following. 2798 - * - page is not on LRU (isolate_page() is useful.) 2799 - * - compound_lock is held when nr_pages > 1 2800 - * 2801 - * This function doesn't do "charge" to new cgroup and doesn't do "uncharge" 2802 - * from old cgroup. 2803 - */ 2804 - static int mem_cgroup_move_account(struct page *page, 2805 - unsigned int nr_pages, 2806 - struct mem_cgroup *from, 2807 - struct mem_cgroup *to) 2808 - { 2809 - unsigned long flags; 2810 - int ret; 2811 - 2812 - VM_BUG_ON(from == to); 2813 - VM_BUG_ON_PAGE(PageLRU(page), page); 2814 - /* 2815 - * The page is isolated from LRU. So, collapse function 2816 - * will not handle this page. But page splitting can happen. 2817 - * Do this check under compound_page_lock(). The caller should 2818 - * hold it. 2819 - */ 2820 - ret = -EBUSY; 2821 - if (nr_pages > 1 && !PageTransHuge(page)) 2822 - goto out; 2823 - 2824 - /* 2825 - * Prevent mem_cgroup_migrate() from looking at page->mem_cgroup 2826 - * of its source page while we change it: page migration takes 2827 - * both pages off the LRU, but page cache replacement doesn't. 2828 - */ 2829 - if (!trylock_page(page)) 2830 - goto out; 2831 - 2832 - ret = -EINVAL; 2833 - if (page->mem_cgroup != from) 2834 - goto out_unlock; 2835 - 2836 - spin_lock_irqsave(&from->move_lock, flags); 2837 - 2838 - if (!PageAnon(page) && page_mapped(page)) { 2839 - __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], 2840 - nr_pages); 2841 - __this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], 2842 - nr_pages); 2843 - } 2844 - 2845 - if (PageWriteback(page)) { 2846 - __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_WRITEBACK], 2847 - nr_pages); 2848 - __this_cpu_add(to->stat->count[MEM_CGROUP_STAT_WRITEBACK], 2849 - nr_pages); 2850 - } 2851 - 2852 - /* 2853 - * It is safe to change page->mem_cgroup here because the page 2854 - * is referenced, charged, and isolated - we can't race with 2855 - * uncharging, charging, migration, or LRU putback. 2856 - */ 2857 - 2858 - /* caller should have done css_get */ 2859 - page->mem_cgroup = to; 2860 - spin_unlock_irqrestore(&from->move_lock, flags); 2861 - 2862 - ret = 0; 2863 - 2864 - local_irq_disable(); 2865 - mem_cgroup_charge_statistics(to, page, nr_pages); 2866 - memcg_check_events(to, page); 2867 - mem_cgroup_charge_statistics(from, page, -nr_pages); 2868 - memcg_check_events(from, page); 2869 - local_irq_enable(); 2870 - out_unlock: 2871 - unlock_page(page); 2872 - out: 2873 - return ret; 2874 - } 2875 2781 2876 2782 #ifdef CONFIG_MEMCG_SWAP 2877 2783 static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg, ··· 4736 4814 page = find_get_page(mapping, pgoff); 4737 4815 #endif 4738 4816 return page; 4817 + } 4818 + 4819 + /** 4820 + * mem_cgroup_move_account - move account of the page 4821 + * @page: the page 4822 + * @nr_pages: number of regular pages (>1 for huge pages) 4823 + * @from: mem_cgroup which the page is moved from. 4824 + * @to: mem_cgroup which the page is moved to. @from != @to. 4825 + * 4826 + * The caller must confirm following. 4827 + * - page is not on LRU (isolate_page() is useful.) 4828 + * - compound_lock is held when nr_pages > 1 4829 + * 4830 + * This function doesn't do "charge" to new cgroup and doesn't do "uncharge" 4831 + * from old cgroup. 4832 + */ 4833 + static int mem_cgroup_move_account(struct page *page, 4834 + unsigned int nr_pages, 4835 + struct mem_cgroup *from, 4836 + struct mem_cgroup *to) 4837 + { 4838 + unsigned long flags; 4839 + int ret; 4840 + 4841 + VM_BUG_ON(from == to); 4842 + VM_BUG_ON_PAGE(PageLRU(page), page); 4843 + /* 4844 + * The page is isolated from LRU. So, collapse function 4845 + * will not handle this page. But page splitting can happen. 4846 + * Do this check under compound_page_lock(). The caller should 4847 + * hold it. 4848 + */ 4849 + ret = -EBUSY; 4850 + if (nr_pages > 1 && !PageTransHuge(page)) 4851 + goto out; 4852 + 4853 + /* 4854 + * Prevent mem_cgroup_migrate() from looking at page->mem_cgroup 4855 + * of its source page while we change it: page migration takes 4856 + * both pages off the LRU, but page cache replacement doesn't. 4857 + */ 4858 + if (!trylock_page(page)) 4859 + goto out; 4860 + 4861 + ret = -EINVAL; 4862 + if (page->mem_cgroup != from) 4863 + goto out_unlock; 4864 + 4865 + spin_lock_irqsave(&from->move_lock, flags); 4866 + 4867 + if (!PageAnon(page) && page_mapped(page)) { 4868 + __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], 4869 + nr_pages); 4870 + __this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], 4871 + nr_pages); 4872 + } 4873 + 4874 + if (PageWriteback(page)) { 4875 + __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_WRITEBACK], 4876 + nr_pages); 4877 + __this_cpu_add(to->stat->count[MEM_CGROUP_STAT_WRITEBACK], 4878 + nr_pages); 4879 + } 4880 + 4881 + /* 4882 + * It is safe to change page->mem_cgroup here because the page 4883 + * is referenced, charged, and isolated - we can't race with 4884 + * uncharging, charging, migration, or LRU putback. 4885 + */ 4886 + 4887 + /* caller should have done css_get */ 4888 + page->mem_cgroup = to; 4889 + spin_unlock_irqrestore(&from->move_lock, flags); 4890 + 4891 + ret = 0; 4892 + 4893 + local_irq_disable(); 4894 + mem_cgroup_charge_statistics(to, page, nr_pages); 4895 + memcg_check_events(to, page); 4896 + mem_cgroup_charge_statistics(from, page, -nr_pages); 4897 + memcg_check_events(from, page); 4898 + local_irq_enable(); 4899 + out_unlock: 4900 + unlock_page(page); 4901 + out: 4902 + return ret; 4739 4903 } 4740 4904 4741 4905 static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,

+223 -162

mm/memory.c

··· 1983 1983 } 1984 1984 1985 1985 /* 1986 - * This routine handles present pages, when users try to write 1987 - * to a shared page. It is done by copying the page to a new address 1988 - * and decrementing the shared-page counter for the old page. 1986 + * Handle write page faults for pages that can be reused in the current vma 1989 1987 * 1990 - * Note that this routine assumes that the protection checks have been 1991 - * done by the caller (the low-level page fault routine in most cases). 1992 - * Thus we can safely just mark it writable once we've done any necessary 1993 - * COW. 1994 - * 1995 - * We also mark the page dirty at this point even though the page will 1996 - * change only once the write actually happens. This avoids a few races, 1997 - * and potentially makes it more efficient. 1998 - * 1999 - * We enter with non-exclusive mmap_sem (to exclude vma changes, 2000 - * but allow concurrent faults), with pte both mapped and locked. 2001 - * We return with mmap_sem still held, but pte unmapped and unlocked. 1988 + * This can happen either due to the mapping being with the VM_SHARED flag, 1989 + * or due to us being the last reference standing to the page. In either 1990 + * case, all we need to do here is to mark the page as writable and update 1991 + * any related book-keeping. 2002 1992 */ 2003 - static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, 2004 - unsigned long address, pte_t *page_table, pmd_t *pmd, 2005 - spinlock_t *ptl, pte_t orig_pte) 1993 + static inline int wp_page_reuse(struct mm_struct *mm, 1994 + struct vm_area_struct *vma, unsigned long address, 1995 + pte_t *page_table, spinlock_t *ptl, pte_t orig_pte, 1996 + struct page *page, int page_mkwrite, 1997 + int dirty_shared) 2006 1998 __releases(ptl) 2007 1999 { 2008 - struct page *old_page, *new_page = NULL; 2009 2000 pte_t entry; 2010 - int ret = 0; 2011 - int page_mkwrite = 0; 2012 - bool dirty_shared = false; 2013 - unsigned long mmun_start = 0; /* For mmu_notifiers */ 2014 - unsigned long mmun_end = 0; /* For mmu_notifiers */ 2015 - struct mem_cgroup *memcg; 2016 - 2017 - old_page = vm_normal_page(vma, address, orig_pte); 2018 - if (!old_page) { 2019 - /* 2020 - * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a 2021 - * VM_PFNMAP VMA. 2022 - * 2023 - * We should not cow pages in a shared writeable mapping. 2024 - * Just mark the pages writable as we can't do any dirty 2025 - * accounting on raw pfn maps. 2026 - */ 2027 - if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == 2028 - (VM_WRITE|VM_SHARED)) 2029 - goto reuse; 2030 - goto gotten; 2031 - } 2032 - 2033 2001 /* 2034 - * Take out anonymous pages first, anonymous shared vmas are 2035 - * not dirty accountable. 2002 + * Clear the pages cpupid information as the existing 2003 + * information potentially belongs to a now completely 2004 + * unrelated process. 2036 2005 */ 2037 - if (PageAnon(old_page) && !PageKsm(old_page)) { 2038 - if (!trylock_page(old_page)) { 2039 - page_cache_get(old_page); 2040 - pte_unmap_unlock(page_table, ptl); 2041 - lock_page(old_page); 2042 - page_table = pte_offset_map_lock(mm, pmd, address, 2043 - &ptl); 2044 - if (!pte_same(*page_table, orig_pte)) { 2045 - unlock_page(old_page); 2046 - goto unlock; 2047 - } 2048 - page_cache_release(old_page); 2049 - } 2050 - if (reuse_swap_page(old_page)) { 2051 - /* 2052 - * The page is all ours. Move it to our anon_vma so 2053 - * the rmap code will not search our parent or siblings. 2054 - * Protected against the rmap code by the page lock. 2055 - */ 2056 - page_move_anon_rmap(old_page, vma, address); 2057 - unlock_page(old_page); 2058 - goto reuse; 2059 - } 2060 - unlock_page(old_page); 2061 - } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == 2062 - (VM_WRITE|VM_SHARED))) { 2063 - page_cache_get(old_page); 2064 - /* 2065 - * Only catch write-faults on shared writable pages, 2066 - * read-only shared pages can get COWed by 2067 - * get_user_pages(.write=1, .force=1). 2068 - */ 2069 - if (vma->vm_ops && vma->vm_ops->page_mkwrite) { 2070 - int tmp; 2006 + if (page) 2007 + page_cpupid_xchg_last(page, (1 << LAST_CPUPID_SHIFT) - 1); 2071 2008 2072 - pte_unmap_unlock(page_table, ptl); 2073 - tmp = do_page_mkwrite(vma, old_page, address); 2074 - if (unlikely(!tmp || (tmp & 2075 - (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) { 2076 - page_cache_release(old_page); 2077 - return tmp; 2078 - } 2079 - /* 2080 - * Since we dropped the lock we need to revalidate 2081 - * the PTE as someone else may have changed it. If 2082 - * they did, we just return, as we can count on the 2083 - * MMU to tell us if they didn't also make it writable. 2084 - */ 2085 - page_table = pte_offset_map_lock(mm, pmd, address, 2086 - &ptl); 2087 - if (!pte_same(*page_table, orig_pte)) { 2088 - unlock_page(old_page); 2089 - goto unlock; 2090 - } 2091 - page_mkwrite = 1; 2092 - } 2093 - 2094 - dirty_shared = true; 2095 - 2096 - reuse: 2097 - /* 2098 - * Clear the pages cpupid information as the existing 2099 - * information potentially belongs to a now completely 2100 - * unrelated process. 2101 - */ 2102 - if (old_page) 2103 - page_cpupid_xchg_last(old_page, (1 << LAST_CPUPID_SHIFT) - 1); 2104 - 2105 - flush_cache_page(vma, address, pte_pfn(orig_pte)); 2106 - entry = pte_mkyoung(orig_pte); 2107 - entry = maybe_mkwrite(pte_mkdirty(entry), vma); 2108 - if (ptep_set_access_flags(vma, address, page_table, entry,1)) 2109 - update_mmu_cache(vma, address, page_table); 2110 - pte_unmap_unlock(page_table, ptl); 2111 - ret |= VM_FAULT_WRITE; 2112 - 2113 - if (dirty_shared) { 2114 - struct address_space *mapping; 2115 - int dirtied; 2116 - 2117 - if (!page_mkwrite) 2118 - lock_page(old_page); 2119 - 2120 - dirtied = set_page_dirty(old_page); 2121 - VM_BUG_ON_PAGE(PageAnon(old_page), old_page); 2122 - mapping = old_page->mapping; 2123 - unlock_page(old_page); 2124 - page_cache_release(old_page); 2125 - 2126 - if ((dirtied || page_mkwrite) && mapping) { 2127 - /* 2128 - * Some device drivers do not set page.mapping 2129 - * but still dirty their pages 2130 - */ 2131 - balance_dirty_pages_ratelimited(mapping); 2132 - } 2133 - 2134 - if (!page_mkwrite) 2135 - file_update_time(vma->vm_file); 2136 - } 2137 - 2138 - return ret; 2139 - } 2140 - 2141 - /* 2142 - * Ok, we need to copy. Oh, well.. 2143 - */ 2144 - page_cache_get(old_page); 2145 - gotten: 2009 + flush_cache_page(vma, address, pte_pfn(orig_pte)); 2010 + entry = pte_mkyoung(orig_pte); 2011 + entry = maybe_mkwrite(pte_mkdirty(entry), vma); 2012 + if (ptep_set_access_flags(vma, address, page_table, entry, 1)) 2013 + update_mmu_cache(vma, address, page_table); 2146 2014 pte_unmap_unlock(page_table, ptl); 2015 + 2016 + if (dirty_shared) { 2017 + struct address_space *mapping; 2018 + int dirtied; 2019 + 2020 + if (!page_mkwrite) 2021 + lock_page(page); 2022 + 2023 + dirtied = set_page_dirty(page); 2024 + VM_BUG_ON_PAGE(PageAnon(page), page); 2025 + mapping = page->mapping; 2026 + unlock_page(page); 2027 + page_cache_release(page); 2028 + 2029 + if ((dirtied || page_mkwrite) && mapping) { 2030 + /* 2031 + * Some device drivers do not set page.mapping 2032 + * but still dirty their pages 2033 + */ 2034 + balance_dirty_pages_ratelimited(mapping); 2035 + } 2036 + 2037 + if (!page_mkwrite) 2038 + file_update_time(vma->vm_file); 2039 + } 2040 + 2041 + return VM_FAULT_WRITE; 2042 + } 2043 + 2044 + /* 2045 + * Handle the case of a page which we actually need to copy to a new page. 2046 + * 2047 + * Called with mmap_sem locked and the old page referenced, but 2048 + * without the ptl held. 2049 + * 2050 + * High level logic flow: 2051 + * 2052 + * - Allocate a page, copy the content of the old page to the new one. 2053 + * - Handle book keeping and accounting - cgroups, mmu-notifiers, etc. 2054 + * - Take the PTL. If the pte changed, bail out and release the allocated page 2055 + * - If the pte is still the way we remember it, update the page table and all 2056 + * relevant references. This includes dropping the reference the page-table 2057 + * held to the old page, as well as updating the rmap. 2058 + * - In any case, unlock the PTL and drop the reference we took to the old page. 2059 + */ 2060 + static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma, 2061 + unsigned long address, pte_t *page_table, pmd_t *pmd, 2062 + pte_t orig_pte, struct page *old_page) 2063 + { 2064 + struct page *new_page = NULL; 2065 + spinlock_t *ptl = NULL; 2066 + pte_t entry; 2067 + int page_copied = 0; 2068 + const unsigned long mmun_start = address & PAGE_MASK; /* For mmu_notifiers */ 2069 + const unsigned long mmun_end = mmun_start + PAGE_SIZE; /* For mmu_notifiers */ 2070 + struct mem_cgroup *memcg; 2147 2071 2148 2072 if (unlikely(anon_vma_prepare(vma))) 2149 2073 goto oom; ··· 2087 2163 if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg)) 2088 2164 goto oom_free_new; 2089 2165 2090 - mmun_start = address & PAGE_MASK; 2091 - mmun_end = mmun_start + PAGE_SIZE; 2092 2166 mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); 2093 2167 2094 2168 /* ··· 2099 2177 dec_mm_counter_fast(mm, MM_FILEPAGES); 2100 2178 inc_mm_counter_fast(mm, MM_ANONPAGES); 2101 2179 } 2102 - } else 2180 + } else { 2103 2181 inc_mm_counter_fast(mm, MM_ANONPAGES); 2182 + } 2104 2183 flush_cache_page(vma, address, pte_pfn(orig_pte)); 2105 2184 entry = mk_pte(new_page, vma->vm_page_prot); 2106 2185 entry = maybe_mkwrite(pte_mkdirty(entry), vma); ··· 2150 2227 2151 2228 /* Free the old page.. */ 2152 2229 new_page = old_page; 2153 - ret |= VM_FAULT_WRITE; 2154 - } else 2230 + page_copied = 1; 2231 + } else { 2155 2232 mem_cgroup_cancel_charge(new_page, memcg); 2233 + } 2156 2234 2157 2235 if (new_page) 2158 2236 page_cache_release(new_page); 2159 - unlock: 2237 + 2160 2238 pte_unmap_unlock(page_table, ptl); 2161 - if (mmun_end > mmun_start) 2162 - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 2239 + mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 2163 2240 if (old_page) { 2164 2241 /* 2165 2242 * Don't let another task, with possibly unlocked vma, 2166 2243 * keep the mlocked page. 2167 2244 */ 2168 - if ((ret & VM_FAULT_WRITE) && (vma->vm_flags & VM_LOCKED)) { 2245 + if (page_copied && (vma->vm_flags & VM_LOCKED)) { 2169 2246 lock_page(old_page); /* LRU manipulation */ 2170 2247 munlock_vma_page(old_page); 2171 2248 unlock_page(old_page); 2172 2249 } 2173 2250 page_cache_release(old_page); 2174 2251 } 2175 - return ret; 2252 + return page_copied ? VM_FAULT_WRITE : 0; 2176 2253 oom_free_new: 2177 2254 page_cache_release(new_page); 2178 2255 oom: 2179 2256 if (old_page) 2180 2257 page_cache_release(old_page); 2181 2258 return VM_FAULT_OOM; 2259 + } 2260 + 2261 + static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma, 2262 + unsigned long address, pte_t *page_table, 2263 + pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte, 2264 + struct page *old_page) 2265 + __releases(ptl) 2266 + { 2267 + int page_mkwrite = 0; 2268 + 2269 + page_cache_get(old_page); 2270 + 2271 + /* 2272 + * Only catch write-faults on shared writable pages, 2273 + * read-only shared pages can get COWed by 2274 + * get_user_pages(.write=1, .force=1). 2275 + */ 2276 + if (vma->vm_ops && vma->vm_ops->page_mkwrite) { 2277 + int tmp; 2278 + 2279 + pte_unmap_unlock(page_table, ptl); 2280 + tmp = do_page_mkwrite(vma, old_page, address); 2281 + if (unlikely(!tmp || (tmp & 2282 + (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) { 2283 + page_cache_release(old_page); 2284 + return tmp; 2285 + } 2286 + /* 2287 + * Since we dropped the lock we need to revalidate 2288 + * the PTE as someone else may have changed it. If 2289 + * they did, we just return, as we can count on the 2290 + * MMU to tell us if they didn't also make it writable. 2291 + */ 2292 + page_table = pte_offset_map_lock(mm, pmd, address, 2293 + &ptl); 2294 + if (!pte_same(*page_table, orig_pte)) { 2295 + unlock_page(old_page); 2296 + pte_unmap_unlock(page_table, ptl); 2297 + page_cache_release(old_page); 2298 + return 0; 2299 + } 2300 + page_mkwrite = 1; 2301 + } 2302 + 2303 + return wp_page_reuse(mm, vma, address, page_table, ptl, 2304 + orig_pte, old_page, page_mkwrite, 1); 2305 + } 2306 + 2307 + /* 2308 + * This routine handles present pages, when users try to write 2309 + * to a shared page. It is done by copying the page to a new address 2310 + * and decrementing the shared-page counter for the old page. 2311 + * 2312 + * Note that this routine assumes that the protection checks have been 2313 + * done by the caller (the low-level page fault routine in most cases). 2314 + * Thus we can safely just mark it writable once we've done any necessary 2315 + * COW. 2316 + * 2317 + * We also mark the page dirty at this point even though the page will 2318 + * change only once the write actually happens. This avoids a few races, 2319 + * and potentially makes it more efficient. 2320 + * 2321 + * We enter with non-exclusive mmap_sem (to exclude vma changes, 2322 + * but allow concurrent faults), with pte both mapped and locked. 2323 + * We return with mmap_sem still held, but pte unmapped and unlocked. 2324 + */ 2325 + static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, 2326 + unsigned long address, pte_t *page_table, pmd_t *pmd, 2327 + spinlock_t *ptl, pte_t orig_pte) 2328 + __releases(ptl) 2329 + { 2330 + struct page *old_page; 2331 + 2332 + old_page = vm_normal_page(vma, address, orig_pte); 2333 + if (!old_page) { 2334 + /* 2335 + * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a 2336 + * VM_PFNMAP VMA. 2337 + * 2338 + * We should not cow pages in a shared writeable mapping. 2339 + * Just mark the pages writable as we can't do any dirty 2340 + * accounting on raw pfn maps. 2341 + */ 2342 + if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == 2343 + (VM_WRITE|VM_SHARED)) 2344 + return wp_page_reuse(mm, vma, address, page_table, ptl, 2345 + orig_pte, old_page, 0, 0); 2346 + 2347 + pte_unmap_unlock(page_table, ptl); 2348 + return wp_page_copy(mm, vma, address, page_table, pmd, 2349 + orig_pte, old_page); 2350 + } 2351 + 2352 + /* 2353 + * Take out anonymous pages first, anonymous shared vmas are 2354 + * not dirty accountable. 2355 + */ 2356 + if (PageAnon(old_page) && !PageKsm(old_page)) { 2357 + if (!trylock_page(old_page)) { 2358 + page_cache_get(old_page); 2359 + pte_unmap_unlock(page_table, ptl); 2360 + lock_page(old_page); 2361 + page_table = pte_offset_map_lock(mm, pmd, address, 2362 + &ptl); 2363 + if (!pte_same(*page_table, orig_pte)) { 2364 + unlock_page(old_page); 2365 + pte_unmap_unlock(page_table, ptl); 2366 + page_cache_release(old_page); 2367 + return 0; 2368 + } 2369 + page_cache_release(old_page); 2370 + } 2371 + if (reuse_swap_page(old_page)) { 2372 + /* 2373 + * The page is all ours. Move it to our anon_vma so 2374 + * the rmap code will not search our parent or siblings. 2375 + * Protected against the rmap code by the page lock. 2376 + */ 2377 + page_move_anon_rmap(old_page, vma, address); 2378 + unlock_page(old_page); 2379 + return wp_page_reuse(mm, vma, address, page_table, ptl, 2380 + orig_pte, old_page, 0, 0); 2381 + } 2382 + unlock_page(old_page); 2383 + } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == 2384 + (VM_WRITE|VM_SHARED))) { 2385 + return wp_page_shared(mm, vma, address, page_table, pmd, 2386 + ptl, orig_pte, old_page); 2387 + } 2388 + 2389 + /* 2390 + * Ok, we need to copy. Oh, well.. 2391 + */ 2392 + page_cache_get(old_page); 2393 + 2394 + pte_unmap_unlock(page_table, ptl); 2395 + return wp_page_copy(mm, vma, address, page_table, pmd, 2396 + orig_pte, old_page); 2182 2397 } 2183 2398 2184 2399 static void unmap_mapping_range_vma(struct vm_area_struct *vma,

+13 -22

mm/memory_hotplug.c

··· 104 104 105 105 } 106 106 107 - static void mem_hotplug_begin(void) 107 + void mem_hotplug_begin(void) 108 108 { 109 109 mem_hotplug.active_writer = current; 110 110 ··· 119 119 } 120 120 } 121 121 122 - static void mem_hotplug_done(void) 122 + void mem_hotplug_done(void) 123 123 { 124 124 mem_hotplug.active_writer = NULL; 125 125 mutex_unlock(&mem_hotplug.lock); ··· 502 502 end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1); 503 503 504 504 for (i = start_sec; i <= end_sec; i++) { 505 - err = __add_section(nid, zone, i << PFN_SECTION_SHIFT); 505 + err = __add_section(nid, zone, section_nr_to_pfn(i)); 506 506 507 507 /* 508 508 * EEXIST is finally dealt with by ioresource collision ··· 959 959 } 960 960 961 961 962 + /* Must be protected by mem_hotplug_begin() */ 962 963 int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type) 963 964 { 964 965 unsigned long flags; ··· 970 969 int ret; 971 970 struct memory_notify arg; 972 971 973 - mem_hotplug_begin(); 974 972 /* 975 973 * This doesn't need a lock to do pfn_to_page(). 976 974 * The section can't be removed here because of the ··· 977 977 */ 978 978 zone = page_zone(pfn_to_page(pfn)); 979 979 980 - ret = -EINVAL; 981 980 if ((zone_idx(zone) > ZONE_NORMAL || 982 981 online_type == MMOP_ONLINE_MOVABLE) && 983 982 !can_online_high_movable(zone)) 984 - goto out; 983 + return -EINVAL; 985 984 986 985 if (online_type == MMOP_ONLINE_KERNEL && 987 986 zone_idx(zone) == ZONE_MOVABLE) { 988 987 if (move_pfn_range_left(zone - 1, zone, pfn, pfn + nr_pages)) 989 - goto out; 988 + return -EINVAL; 990 989 } 991 990 if (online_type == MMOP_ONLINE_MOVABLE && 992 991 zone_idx(zone) == ZONE_MOVABLE - 1) { 993 992 if (move_pfn_range_right(zone, zone + 1, pfn, pfn + nr_pages)) 994 - goto out; 993 + return -EINVAL; 995 994 } 996 995 997 996 /* Previous code may changed the zone of the pfn range */ ··· 1006 1007 ret = notifier_to_errno(ret); 1007 1008 if (ret) { 1008 1009 memory_notify(MEM_CANCEL_ONLINE, &arg); 1009 - goto out; 1010 + return ret; 1010 1011 } 1011 1012 /* 1012 1013 * If this zone is not populated, then it is not in zonelist. ··· 1030 1031 (((unsigned long long) pfn + nr_pages) 1031 1032 << PAGE_SHIFT) - 1); 1032 1033 memory_notify(MEM_CANCEL_ONLINE, &arg); 1033 - goto out; 1034 + return ret; 1034 1035 } 1035 1036 1036 1037 zone->present_pages += onlined_pages; ··· 1060 1061 1061 1062 if (onlined_pages) 1062 1063 memory_notify(MEM_ONLINE, &arg); 1063 - out: 1064 - mem_hotplug_done(); 1065 - return ret; 1064 + return 0; 1066 1065 } 1067 1066 #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */ 1068 1067 ··· 1685 1688 if (!test_pages_in_a_zone(start_pfn, end_pfn)) 1686 1689 return -EINVAL; 1687 1690 1688 - mem_hotplug_begin(); 1689 - 1690 1691 zone = page_zone(pfn_to_page(start_pfn)); 1691 1692 node = zone_to_nid(zone); 1692 1693 nr_pages = end_pfn - start_pfn; 1693 1694 1694 - ret = -EINVAL; 1695 1695 if (zone_idx(zone) <= ZONE_NORMAL && !can_offline_normal(zone, nr_pages)) 1696 - goto out; 1696 + return -EINVAL; 1697 1697 1698 1698 /* set above range as isolated */ 1699 1699 ret = start_isolate_page_range(start_pfn, end_pfn, 1700 1700 MIGRATE_MOVABLE, true); 1701 1701 if (ret) 1702 - goto out; 1702 + return ret; 1703 1703 1704 1704 arg.start_pfn = start_pfn; 1705 1705 arg.nr_pages = nr_pages; ··· 1789 1795 writeback_set_ratelimit(); 1790 1796 1791 1797 memory_notify(MEM_OFFLINE, &arg); 1792 - mem_hotplug_done(); 1793 1798 return 0; 1794 1799 1795 1800 failed_removal: ··· 1798 1805 memory_notify(MEM_CANCEL_OFFLINE, &arg); 1799 1806 /* pushback to free area */ 1800 1807 undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); 1801 - 1802 - out: 1803 - mem_hotplug_done(); 1804 1808 return ret; 1805 1809 } 1806 1810 1811 + /* Must be protected by mem_hotplug_begin() */ 1807 1812 int offline_pages(unsigned long start_pfn, unsigned long nr_pages) 1808 1813 { 1809 1814 return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);

+4 -2

mm/mempolicy.c

··· 945 945 return alloc_huge_page_node(page_hstate(compound_head(page)), 946 946 node); 947 947 else 948 - return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE, 0); 948 + return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE | 949 + __GFP_THISNODE, 0); 949 950 } 950 951 951 952 /* ··· 1986 1985 nmask = policy_nodemask(gfp, pol); 1987 1986 if (!nmask || node_isset(node, *nmask)) { 1988 1987 mpol_cond_put(pol); 1989 - page = alloc_pages_exact_node(node, gfp, order); 1988 + page = alloc_pages_exact_node(node, 1989 + gfp | __GFP_THISNODE, order); 1990 1990 goto out; 1991 1991 } 1992 1992 }

+6 -4

mm/mempool.c

··· 113 113 * mempool_create(). 114 114 * @new_min_nr: the new minimum number of elements guaranteed to be 115 115 * allocated for this pool. 116 - * @gfp_mask: the usual allocation bitmask. 117 116 * 118 117 * This function shrinks/grows the pool. In the case of growing, 119 118 * it cannot be guaranteed that the pool will be grown to the new 120 119 * size immediately, but new mempool_free() calls will refill it. 120 + * This function may sleep. 121 121 * 122 122 * Note, the caller must guarantee that no mempool_destroy is called 123 123 * while this function is running. mempool_alloc() & mempool_free() 124 124 * might be called (eg. from IRQ contexts) while this function executes. 125 125 */ 126 - int mempool_resize(mempool_t *pool, int new_min_nr, gfp_t gfp_mask) 126 + int mempool_resize(mempool_t *pool, int new_min_nr) 127 127 { 128 128 void *element; 129 129 void **new_elements; 130 130 unsigned long flags; 131 131 132 132 BUG_ON(new_min_nr <= 0); 133 + might_sleep(); 133 134 134 135 spin_lock_irqsave(&pool->lock, flags); 135 136 if (new_min_nr <= pool->min_nr) { ··· 146 145 spin_unlock_irqrestore(&pool->lock, flags); 147 146 148 147 /* Grow the pool */ 149 - new_elements = kmalloc(new_min_nr * sizeof(*new_elements), gfp_mask); 148 + new_elements = kmalloc_array(new_min_nr, sizeof(*new_elements), 149 + GFP_KERNEL); 150 150 if (!new_elements) 151 151 return -ENOMEM; 152 152 ··· 166 164 167 165 while (pool->curr_nr < pool->min_nr) { 168 166 spin_unlock_irqrestore(&pool->lock, flags); 169 - element = pool->alloc(gfp_mask, pool->pool_data); 167 + element = pool->alloc(GFP_KERNEL, pool->pool_data); 170 168 if (!element) 171 169 goto out; 172 170 spin_lock_irqsave(&pool->lock, flags);

+14 -23

mm/migrate.c

··· 901 901 } 902 902 903 903 /* 904 + * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move(). Work 905 + * around it. 906 + */ 907 + #if (GCC_VERSION >= 40700 && GCC_VERSION < 40900) && defined(CONFIG_ARM) 908 + #define ICE_noinline noinline 909 + #else 910 + #define ICE_noinline 911 + #endif 912 + 913 + /* 904 914 * Obtain the lock on page, remove all ptes and migrate the page 905 915 * to the newly allocated page in newpage. 906 916 */ 907 - static int unmap_and_move(new_page_t get_new_page, free_page_t put_new_page, 908 - unsigned long private, struct page *page, int force, 909 - enum migrate_mode mode) 917 + static ICE_noinline int unmap_and_move(new_page_t get_new_page, 918 + free_page_t put_new_page, 919 + unsigned long private, struct page *page, 920 + int force, enum migrate_mode mode) 910 921 { 911 922 int rc = 0; 912 923 int *result = NULL; ··· 1565 1554 * page migration rate limiting control. 1566 1555 * Do not migrate more than @pages_to_migrate in a @migrate_interval_millisecs 1567 1556 * window of time. Default here says do not migrate more than 1280M per second. 1568 - * If a node is rate-limited then PTE NUMA updates are also rate-limited. However 1569 - * as it is faults that reset the window, pte updates will happen unconditionally 1570 - * if there has not been a fault since @pteupdate_interval_millisecs after the 1571 - * throttle window closed. 1572 1557 */ 1573 1558 static unsigned int migrate_interval_millisecs __read_mostly = 100; 1574 - static unsigned int pteupdate_interval_millisecs __read_mostly = 1000; 1575 1559 static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT); 1576 - 1577 - /* Returns true if NUMA migration is currently rate limited */ 1578 - bool migrate_ratelimited(int node) 1579 - { 1580 - pg_data_t *pgdat = NODE_DATA(node); 1581 - 1582 - if (time_after(jiffies, pgdat->numabalancing_migrate_next_window + 1583 - msecs_to_jiffies(pteupdate_interval_millisecs))) 1584 - return false; 1585 - 1586 - if (pgdat->numabalancing_migrate_nr_pages < ratelimit_pages) 1587 - return false; 1588 - 1589 - return true; 1590 - } 1591 1560 1592 1561 /* Returns true if the node is migrate rate-limited after the update */ 1593 1562 static bool numamigrate_update_ratelimit(pg_data_t *pgdat,

+8 -123

mm/mlock.c

··· 205 205 return nr_pages - 1; 206 206 } 207 207 208 - /** 209 - * __mlock_vma_pages_range() - mlock a range of pages in the vma. 210 - * @vma: target vma 211 - * @start: start address 212 - * @end: end address 213 - * @nonblocking: 214 - * 215 - * This takes care of making the pages present too. 216 - * 217 - * return 0 on success, negative error code on error. 218 - * 219 - * vma->vm_mm->mmap_sem must be held. 220 - * 221 - * If @nonblocking is NULL, it may be held for read or write and will 222 - * be unperturbed. 223 - * 224 - * If @nonblocking is non-NULL, it must held for read only and may be 225 - * released. If it's released, *@nonblocking will be set to 0. 226 - */ 227 - long __mlock_vma_pages_range(struct vm_area_struct *vma, 228 - unsigned long start, unsigned long end, int *nonblocking) 229 - { 230 - struct mm_struct *mm = vma->vm_mm; 231 - unsigned long nr_pages = (end - start) / PAGE_SIZE; 232 - int gup_flags; 233 - 234 - VM_BUG_ON(start & ~PAGE_MASK); 235 - VM_BUG_ON(end & ~PAGE_MASK); 236 - VM_BUG_ON_VMA(start < vma->vm_start, vma); 237 - VM_BUG_ON_VMA(end > vma->vm_end, vma); 238 - VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm); 239 - 240 - gup_flags = FOLL_TOUCH | FOLL_MLOCK; 241 - /* 242 - * We want to touch writable mappings with a write fault in order 243 - * to break COW, except for shared mappings because these don't COW 244 - * and we would not want to dirty them for nothing. 245 - */ 246 - if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE) 247 - gup_flags |= FOLL_WRITE; 248 - 249 - /* 250 - * We want mlock to succeed for regions that have any permissions 251 - * other than PROT_NONE. 252 - */ 253 - if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) 254 - gup_flags |= FOLL_FORCE; 255 - 256 - /* 257 - * We made sure addr is within a VMA, so the following will 258 - * not result in a stack expansion that recurses back here. 259 - */ 260 - return __get_user_pages(current, mm, start, nr_pages, gup_flags, 261 - NULL, NULL, nonblocking); 262 - } 263 - 264 208 /* 265 209 * convert get_user_pages() return value to posix mlock() error 266 210 */ ··· 540 596 /* 541 597 * vm_flags is protected by the mmap_sem held in write mode. 542 598 * It's okay if try_to_unmap_one unmaps a page just after we 543 - * set VM_LOCKED, __mlock_vma_pages_range will bring it back. 599 + * set VM_LOCKED, populate_vma_page_range will bring it back. 544 600 */ 545 601 546 602 if (lock) ··· 604 660 return error; 605 661 } 606 662 607 - /* 608 - * __mm_populate - populate and/or mlock pages within a range of address space. 609 - * 610 - * This is used to implement mlock() and the MAP_POPULATE / MAP_LOCKED mmap 611 - * flags. VMAs must be already marked with the desired vm_flags, and 612 - * mmap_sem must not be held. 613 - */ 614 - int __mm_populate(unsigned long start, unsigned long len, int ignore_errors) 615 - { 616 - struct mm_struct *mm = current->mm; 617 - unsigned long end, nstart, nend; 618 - struct vm_area_struct *vma = NULL; 619 - int locked = 0; 620 - long ret = 0; 621 - 622 - VM_BUG_ON(start & ~PAGE_MASK); 623 - VM_BUG_ON(len != PAGE_ALIGN(len)); 624 - end = start + len; 625 - 626 - for (nstart = start; nstart < end; nstart = nend) { 627 - /* 628 - * We want to fault in pages for [nstart; end) address range. 629 - * Find first corresponding VMA. 630 - */ 631 - if (!locked) { 632 - locked = 1; 633 - down_read(&mm->mmap_sem); 634 - vma = find_vma(mm, nstart); 635 - } else if (nstart >= vma->vm_end) 636 - vma = vma->vm_next; 637 - if (!vma || vma->vm_start >= end) 638 - break; 639 - /* 640 - * Set [nstart; nend) to intersection of desired address 641 - * range with the first VMA. Also, skip undesirable VMA types. 642 - */ 643 - nend = min(end, vma->vm_end); 644 - if (vma->vm_flags & (VM_IO | VM_PFNMAP)) 645 - continue; 646 - if (nstart < vma->vm_start) 647 - nstart = vma->vm_start; 648 - /* 649 - * Now fault in a range of pages. __mlock_vma_pages_range() 650 - * double checks the vma flags, so that it won't mlock pages 651 - * if the vma was already munlocked. 652 - */ 653 - ret = __mlock_vma_pages_range(vma, nstart, nend, &locked); 654 - if (ret < 0) { 655 - if (ignore_errors) { 656 - ret = 0; 657 - continue; /* continue at next VMA */ 658 - } 659 - ret = __mlock_posix_error_return(ret); 660 - break; 661 - } 662 - nend = nstart + ret * PAGE_SIZE; 663 - ret = 0; 664 - } 665 - if (locked) 666 - up_read(&mm->mmap_sem); 667 - return ret; /* 0 or negative error code */ 668 - } 669 - 670 663 SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len) 671 664 { 672 665 unsigned long locked; ··· 631 750 error = do_mlock(start, len, 1); 632 751 633 752 up_write(&current->mm->mmap_sem); 634 - if (!error) 635 - error = __mm_populate(start, len, 0); 636 - return error; 753 + if (error) 754 + return error; 755 + 756 + error = __mm_populate(start, len, 0); 757 + if (error) 758 + return __mlock_posix_error_return(error); 759 + return 0; 637 760 } 638 761 639 762 SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)

+2 -2

mm/mmap.c

··· 2316 2316 if (!prev || expand_stack(prev, addr)) 2317 2317 return NULL; 2318 2318 if (prev->vm_flags & VM_LOCKED) 2319 - __mlock_vma_pages_range(prev, addr, prev->vm_end, NULL); 2319 + populate_vma_page_range(prev, addr, prev->vm_end, NULL); 2320 2320 return prev; 2321 2321 } 2322 2322 #else ··· 2351 2351 if (expand_stack(vma, addr)) 2352 2352 return NULL; 2353 2353 if (vma->vm_flags & VM_LOCKED) 2354 - __mlock_vma_pages_range(vma, addr, start, NULL); 2354 + populate_vma_page_range(vma, addr, start, NULL); 2355 2355 return vma; 2356 2356 } 2357 2357 #endif

+4 -3

mm/oom_kill.c

··· 612 612 * Determines whether the kernel must panic because of the panic_on_oom sysctl. 613 613 */ 614 614 void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, 615 - int order, const nodemask_t *nodemask) 615 + int order, const nodemask_t *nodemask, 616 + struct mem_cgroup *memcg) 616 617 { 617 618 if (likely(!sysctl_panic_on_oom)) 618 619 return; ··· 626 625 if (constraint != CONSTRAINT_NONE) 627 626 return; 628 627 } 629 - dump_header(NULL, gfp_mask, order, NULL, nodemask); 628 + dump_header(NULL, gfp_mask, order, memcg, nodemask); 630 629 panic("Out of memory: %s panic_on_oom is enabled\n", 631 630 sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide"); 632 631 } ··· 741 740 constraint = constrained_alloc(zonelist, gfp_mask, nodemask, 742 741 &totalpages); 743 742 mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL; 744 - check_panic_on_oom(constraint, gfp_mask, order, mpol_mask); 743 + check_panic_on_oom(constraint, gfp_mask, order, mpol_mask, NULL); 745 744 746 745 if (sysctl_oom_kill_allocating_task && current->mm && 747 746 !oom_unkillable_task(current, NULL, nodemask) &&

+19

mm/page-writeback.c

··· 2111 2111 EXPORT_SYMBOL(account_page_dirtied); 2112 2112 2113 2113 /* 2114 + * Helper function for deaccounting dirty page without writeback. 2115 + * 2116 + * Doing this should *normally* only ever be done when a page 2117 + * is truncated, and is not actually mapped anywhere at all. However, 2118 + * fs/buffer.c does this when it notices that somebody has cleaned 2119 + * out all the buffers on a page without actually doing it through 2120 + * the VM. Can you say "ext3 is horribly ugly"? Thought you could. 2121 + */ 2122 + void account_page_cleaned(struct page *page, struct address_space *mapping) 2123 + { 2124 + if (mapping_cap_account_dirty(mapping)) { 2125 + dec_zone_page_state(page, NR_FILE_DIRTY); 2126 + dec_bdi_stat(inode_to_bdi(mapping->host), BDI_RECLAIMABLE); 2127 + task_io_account_cancelled_write(PAGE_CACHE_SIZE); 2128 + } 2129 + } 2130 + EXPORT_SYMBOL(account_page_cleaned); 2131 + 2132 + /* 2114 2133 * For address_spaces which do not use buffers. Just tag the page as dirty in 2115 2134 * its radix tree. 2116 2135 *

+146 -103

mm/page_alloc.c

··· 1032 1032 static int fallbacks[MIGRATE_TYPES][4] = { 1033 1033 [MIGRATE_UNMOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE }, 1034 1034 [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RESERVE }, 1035 - #ifdef CONFIG_CMA 1036 - [MIGRATE_MOVABLE] = { MIGRATE_CMA, MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE }, 1037 - [MIGRATE_CMA] = { MIGRATE_RESERVE }, /* Never used */ 1038 - #else 1039 1035 [MIGRATE_MOVABLE] = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_RESERVE }, 1036 + #ifdef CONFIG_CMA 1037 + [MIGRATE_CMA] = { MIGRATE_RESERVE }, /* Never used */ 1040 1038 #endif 1041 1039 [MIGRATE_RESERVE] = { MIGRATE_RESERVE }, /* Never used */ 1042 1040 #ifdef CONFIG_MEMORY_ISOLATION 1043 1041 [MIGRATE_ISOLATE] = { MIGRATE_RESERVE }, /* Never used */ 1044 1042 #endif 1045 1043 }; 1044 + 1045 + #ifdef CONFIG_CMA 1046 + static struct page *__rmqueue_cma_fallback(struct zone *zone, 1047 + unsigned int order) 1048 + { 1049 + return __rmqueue_smallest(zone, order, MIGRATE_CMA); 1050 + } 1051 + #else 1052 + static inline struct page *__rmqueue_cma_fallback(struct zone *zone, 1053 + unsigned int order) { return NULL; } 1054 + #endif 1046 1055 1047 1056 /* 1048 1057 * Move the free pages in a range to the free lists of the requested type. ··· 1145 1136 * as fragmentation caused by those allocations polluting movable pageblocks 1146 1137 * is worse than movable allocations stealing from unmovable and reclaimable 1147 1138 * pageblocks. 1148 - * 1149 - * If we claim more than half of the pageblock, change pageblock's migratetype 1150 - * as well. 1151 1139 */ 1152 - static void try_to_steal_freepages(struct zone *zone, struct page *page, 1153 - int start_type, int fallback_type) 1140 + static bool can_steal_fallback(unsigned int order, int start_mt) 1141 + { 1142 + /* 1143 + * Leaving this order check is intended, although there is 1144 + * relaxed order check in next check. The reason is that 1145 + * we can actually steal whole pageblock if this condition met, 1146 + * but, below check doesn't guarantee it and that is just heuristic 1147 + * so could be changed anytime. 1148 + */ 1149 + if (order >= pageblock_order) 1150 + return true; 1151 + 1152 + if (order >= pageblock_order / 2 || 1153 + start_mt == MIGRATE_RECLAIMABLE || 1154 + start_mt == MIGRATE_UNMOVABLE || 1155 + page_group_by_mobility_disabled) 1156 + return true; 1157 + 1158 + return false; 1159 + } 1160 + 1161 + /* 1162 + * This function implements actual steal behaviour. If order is large enough, 1163 + * we can steal whole pageblock. If not, we first move freepages in this 1164 + * pageblock and check whether half of pages are moved or not. If half of 1165 + * pages are moved, we can change migratetype of pageblock and permanently 1166 + * use it's pages as requested migratetype in the future. 1167 + */ 1168 + static void steal_suitable_fallback(struct zone *zone, struct page *page, 1169 + int start_type) 1154 1170 { 1155 1171 int current_order = page_order(page); 1172 + int pages; 1156 1173 1157 1174 /* Take ownership for orders >= pageblock_order */ 1158 1175 if (current_order >= pageblock_order) { ··· 1186 1151 return; 1187 1152 } 1188 1153 1189 - if (current_order >= pageblock_order / 2 || 1190 - start_type == MIGRATE_RECLAIMABLE || 1191 - start_type == MIGRATE_UNMOVABLE || 1192 - page_group_by_mobility_disabled) { 1193 - int pages; 1154 + pages = move_freepages_block(zone, page, start_type); 1194 1155 1195 - pages = move_freepages_block(zone, page, start_type); 1156 + /* Claim the whole block if over half of it is free */ 1157 + if (pages >= (1 << (pageblock_order-1)) || 1158 + page_group_by_mobility_disabled) 1159 + set_pageblock_migratetype(page, start_type); 1160 + } 1196 1161 1197 - /* Claim the whole block if over half of it is free */ 1198 - if (pages >= (1 << (pageblock_order-1)) || 1199 - page_group_by_mobility_disabled) 1200 - set_pageblock_migratetype(page, start_type); 1162 + /* 1163 + * Check whether there is a suitable fallback freepage with requested order. 1164 + * If only_stealable is true, this function returns fallback_mt only if 1165 + * we can steal other freepages all together. This would help to reduce 1166 + * fragmentation due to mixed migratetype pages in one pageblock. 1167 + */ 1168 + int find_suitable_fallback(struct free_area *area, unsigned int order, 1169 + int migratetype, bool only_stealable, bool *can_steal) 1170 + { 1171 + int i; 1172 + int fallback_mt; 1173 + 1174 + if (area->nr_free == 0) 1175 + return -1; 1176 + 1177 + *can_steal = false; 1178 + for (i = 0;; i++) { 1179 + fallback_mt = fallbacks[migratetype][i]; 1180 + if (fallback_mt == MIGRATE_RESERVE) 1181 + break; 1182 + 1183 + if (list_empty(&area->free_list[fallback_mt])) 1184 + continue; 1185 + 1186 + if (can_steal_fallback(order, migratetype)) 1187 + *can_steal = true; 1188 + 1189 + if (!only_stealable) 1190 + return fallback_mt; 1191 + 1192 + if (*can_steal) 1193 + return fallback_mt; 1201 1194 } 1195 + 1196 + return -1; 1202 1197 } 1203 1198 1204 1199 /* Remove an element from the buddy allocator from the fallback list */ ··· 1238 1173 struct free_area *area; 1239 1174 unsigned int current_order; 1240 1175 struct page *page; 1176 + int fallback_mt; 1177 + bool can_steal; 1241 1178 1242 1179 /* Find the largest possible block of pages in the other list */ 1243 1180 for (current_order = MAX_ORDER-1; 1244 1181 current_order >= order && current_order <= MAX_ORDER-1; 1245 1182 --current_order) { 1246 - int i; 1247 - for (i = 0;; i++) { 1248 - int migratetype = fallbacks[start_migratetype][i]; 1249 - int buddy_type = start_migratetype; 1183 + area = &(zone->free_area[current_order]); 1184 + fallback_mt = find_suitable_fallback(area, current_order, 1185 + start_migratetype, false, &can_steal); 1186 + if (fallback_mt == -1) 1187 + continue; 1250 1188 1251 - /* MIGRATE_RESERVE handled later if necessary */ 1252 - if (migratetype == MIGRATE_RESERVE) 1253 - break; 1189 + page = list_entry(area->free_list[fallback_mt].next, 1190 + struct page, lru); 1191 + if (can_steal) 1192 + steal_suitable_fallback(zone, page, start_migratetype); 1254 1193 1255 - area = &(zone->free_area[current_order]); 1256 - if (list_empty(&area->free_list[migratetype])) 1257 - continue; 1194 + /* Remove the page from the freelists */ 1195 + area->nr_free--; 1196 + list_del(&page->lru); 1197 + rmv_page_order(page); 1258 1198 1259 - page = list_entry(area->free_list[migratetype].next, 1260 - struct page, lru); 1261 - area->nr_free--; 1199 + expand(zone, page, order, current_order, area, 1200 + start_migratetype); 1201 + /* 1202 + * The freepage_migratetype may differ from pageblock's 1203 + * migratetype depending on the decisions in 1204 + * try_to_steal_freepages(). This is OK as long as it 1205 + * does not differ for MIGRATE_CMA pageblocks. For CMA 1206 + * we need to make sure unallocated pages flushed from 1207 + * pcp lists are returned to the correct freelist. 1208 + */ 1209 + set_freepage_migratetype(page, start_migratetype); 1262 1210 1263 - if (!is_migrate_cma(migratetype)) { 1264 - try_to_steal_freepages(zone, page, 1265 - start_migratetype, 1266 - migratetype); 1267 - } else { 1268 - /* 1269 - * When borrowing from MIGRATE_CMA, we need to 1270 - * release the excess buddy pages to CMA 1271 - * itself, and we do not try to steal extra 1272 - * free pages. 1273 - */ 1274 - buddy_type = migratetype; 1275 - } 1211 + trace_mm_page_alloc_extfrag(page, order, current_order, 1212 + start_migratetype, fallback_mt); 1276 1213 1277 - /* Remove the page from the freelists */ 1278 - list_del(&page->lru); 1279 - rmv_page_order(page); 1280 - 1281 - expand(zone, page, order, current_order, area, 1282 - buddy_type); 1283 - 1284 - /* 1285 - * The freepage_migratetype may differ from pageblock's 1286 - * migratetype depending on the decisions in 1287 - * try_to_steal_freepages(). This is OK as long as it 1288 - * does not differ for MIGRATE_CMA pageblocks. For CMA 1289 - * we need to make sure unallocated pages flushed from 1290 - * pcp lists are returned to the correct freelist. 1291 - */ 1292 - set_freepage_migratetype(page, buddy_type); 1293 - 1294 - trace_mm_page_alloc_extfrag(page, order, current_order, 1295 - start_migratetype, migratetype); 1296 - 1297 - return page; 1298 - } 1214 + return page; 1299 1215 } 1300 1216 1301 1217 return NULL; ··· 1295 1249 page = __rmqueue_smallest(zone, order, migratetype); 1296 1250 1297 1251 if (unlikely(!page) && migratetype != MIGRATE_RESERVE) { 1298 - page = __rmqueue_fallback(zone, order, migratetype); 1252 + if (migratetype == MIGRATE_MOVABLE) 1253 + page = __rmqueue_cma_fallback(zone, order); 1254 + 1255 + if (!page) 1256 + page = __rmqueue_fallback(zone, order, migratetype); 1299 1257 1300 1258 /* 1301 1259 * Use MIGRATE_RESERVE rather than fail an allocation. goto ··· 2412 2362 *did_some_progress = 1; 2413 2363 goto out; 2414 2364 } 2415 - /* 2416 - * GFP_THISNODE contains __GFP_NORETRY and we never hit this. 2417 - * Sanity check for bare calls of __GFP_THISNODE, not real OOM. 2418 - * The caller should handle page allocation failure by itself if 2419 - * it specifies __GFP_THISNODE. 2420 - * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER. 2421 - */ 2365 + /* The OOM killer may not free memory on a specific node */ 2422 2366 if (gfp_mask & __GFP_THISNODE) 2423 2367 goto out; 2424 2368 } ··· 2667 2623 } 2668 2624 2669 2625 /* 2670 - * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and 2671 - * __GFP_NOWARN set) should not cause reclaim since the subsystem 2672 - * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim 2673 - * using a larger set of nodes after it has established that the 2674 - * allowed per node queues are empty and that nodes are 2675 - * over allocated. 2626 + * If this allocation cannot block and it is for a specific node, then 2627 + * fail early. There's no need to wakeup kswapd or retry for a 2628 + * speculative node-specific allocation. 2676 2629 */ 2677 - if (IS_ENABLED(CONFIG_NUMA) && 2678 - (gfp_mask & GFP_THISNODE) == GFP_THISNODE) 2630 + if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !wait) 2679 2631 goto nopage; 2680 2632 2681 2633 retry: ··· 2864 2824 /* 2865 2825 * Check the zones suitable for the gfp_mask contain at least one 2866 2826 * valid zone. It's possible to have an empty zonelist as a result 2867 - * of GFP_THISNODE and a memoryless node 2827 + * of __GFP_THISNODE and a memoryless node 2868 2828 */ 2869 2829 if (unlikely(!zonelist->_zonerefs->zone)) 2870 2830 return NULL; ··· 3241 3201 * Show free area list (used inside shift_scroll-lock stuff) 3242 3202 * We also calculate the percentage fragmentation. We do this by counting the 3243 3203 * memory on each free list with the exception of the first item on the list. 3244 - * Suppresses nodes that are not allowed by current's cpuset if 3245 - * SHOW_MEM_FILTER_NODES is passed. 3204 + * 3205 + * Bits in @filter: 3206 + * SHOW_MEM_FILTER_NODES: suppress nodes that are not allowed by current's 3207 + * cpuset. 3246 3208 */ 3247 3209 void show_free_areas(unsigned int filter) 3248 3210 { 3211 + unsigned long free_pcp = 0; 3249 3212 int cpu; 3250 3213 struct zone *zone; 3251 3214 3252 3215 for_each_populated_zone(zone) { 3253 3216 if (skip_free_areas_node(filter, zone_to_nid(zone))) 3254 3217 continue; 3255 - show_node(zone); 3256 - printk("%s per-cpu:\n", zone->name); 3257 3218 3258 - for_each_online_cpu(cpu) { 3259 - struct per_cpu_pageset *pageset; 3260 - 3261 - pageset = per_cpu_ptr(zone->pageset, cpu); 3262 - 3263 - printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n", 3264 - cpu, pageset->pcp.high, 3265 - pageset->pcp.batch, pageset->pcp.count); 3266 - } 3219 + for_each_online_cpu(cpu) 3220 + free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count; 3267 3221 } 3268 3222 3269 3223 printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n" 3270 3224 " active_file:%lu inactive_file:%lu isolated_file:%lu\n" 3271 - " unevictable:%lu" 3272 - " dirty:%lu writeback:%lu unstable:%lu\n" 3273 - " free:%lu slab_reclaimable:%lu slab_unreclaimable:%lu\n" 3225 + " unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n" 3226 + " slab_reclaimable:%lu slab_unreclaimable:%lu\n" 3274 3227 " mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n" 3275 - " free_cma:%lu\n", 3228 + " free:%lu free_pcp:%lu free_cma:%lu\n", 3276 3229 global_page_state(NR_ACTIVE_ANON), 3277 3230 global_page_state(NR_INACTIVE_ANON), 3278 3231 global_page_state(NR_ISOLATED_ANON), ··· 3276 3243 global_page_state(NR_FILE_DIRTY), 3277 3244 global_page_state(NR_WRITEBACK), 3278 3245 global_page_state(NR_UNSTABLE_NFS), 3279 - global_page_state(NR_FREE_PAGES), 3280 3246 global_page_state(NR_SLAB_RECLAIMABLE), 3281 3247 global_page_state(NR_SLAB_UNRECLAIMABLE), 3282 3248 global_page_state(NR_FILE_MAPPED), 3283 3249 global_page_state(NR_SHMEM), 3284 3250 global_page_state(NR_PAGETABLE), 3285 3251 global_page_state(NR_BOUNCE), 3252 + global_page_state(NR_FREE_PAGES), 3253 + free_pcp, 3286 3254 global_page_state(NR_FREE_CMA_PAGES)); 3287 3255 3288 3256 for_each_populated_zone(zone) { ··· 3291 3257 3292 3258 if (skip_free_areas_node(filter, zone_to_nid(zone))) 3293 3259 continue; 3260 + 3261 + free_pcp = 0; 3262 + for_each_online_cpu(cpu) 3263 + free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count; 3264 + 3294 3265 show_node(zone); 3295 3266 printk("%s" 3296 3267 " free:%lukB" ··· 3322 3283 " pagetables:%lukB" 3323 3284 " unstable:%lukB" 3324 3285 " bounce:%lukB" 3286 + " free_pcp:%lukB" 3287 + " local_pcp:%ukB" 3325 3288 " free_cma:%lukB" 3326 3289 " writeback_tmp:%lukB" 3327 3290 " pages_scanned:%lu" ··· 3355 3314 K(zone_page_state(zone, NR_PAGETABLE)), 3356 3315 K(zone_page_state(zone, NR_UNSTABLE_NFS)), 3357 3316 K(zone_page_state(zone, NR_BOUNCE)), 3317 + K(free_pcp), 3318 + K(this_cpu_read(zone->pageset->pcp.count)), 3358 3319 K(zone_page_state(zone, NR_FREE_CMA_PAGES)), 3359 3320 K(zone_page_state(zone, NR_WRITEBACK_TEMP)), 3360 3321 K(zone_page_state(zone, NR_PAGES_SCANNED)), ··· 5760 5717 * value here. 5761 5718 * 5762 5719 * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN) 5763 - * deltas controls asynch page reclaim, and so should 5720 + * deltas control asynch page reclaim, and so should 5764 5721 * not be capped for highmem. 5765 5722 */ 5766 5723 unsigned long min_pages;

+18 -4

mm/slab.c

··· 857 857 return NULL; 858 858 } 859 859 860 + static inline gfp_t gfp_exact_node(gfp_t flags) 861 + { 862 + return flags; 863 + } 864 + 860 865 #else /* CONFIG_NUMA */ 861 866 862 867 static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int); ··· 1027 1022 return 0; 1028 1023 1029 1024 return __cache_free_alien(cachep, objp, node, page_node); 1025 + } 1026 + 1027 + /* 1028 + * Construct gfp mask to allocate from a specific node but do not invoke reclaim 1029 + * or warn about failures. 1030 + */ 1031 + static inline gfp_t gfp_exact_node(gfp_t flags) 1032 + { 1033 + return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT; 1030 1034 } 1031 1035 #endif 1032 1036 ··· 2839 2825 if (unlikely(!ac->avail)) { 2840 2826 int x; 2841 2827 force_grow: 2842 - x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL); 2828 + x = cache_grow(cachep, gfp_exact_node(flags), node, NULL); 2843 2829 2844 2830 /* cache_grow can reenable interrupts, then ac could change. */ 2845 2831 ac = cpu_cache_get(cachep); ··· 3033 3019 get_node(cache, nid) && 3034 3020 get_node(cache, nid)->free_objects) { 3035 3021 obj = ____cache_alloc_node(cache, 3036 - flags | GFP_THISNODE, nid); 3022 + gfp_exact_node(flags), nid); 3037 3023 if (obj) 3038 3024 break; 3039 3025 } ··· 3061 3047 nid = page_to_nid(page); 3062 3048 if (cache_grow(cache, flags, nid, page)) { 3063 3049 obj = ____cache_alloc_node(cache, 3064 - flags | GFP_THISNODE, nid); 3050 + gfp_exact_node(flags), nid); 3065 3051 if (!obj) 3066 3052 /* 3067 3053 * Another processor may allocate the ··· 3132 3118 3133 3119 must_grow: 3134 3120 spin_unlock(&n->list_lock); 3135 - x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL); 3121 + x = cache_grow(cachep, gfp_exact_node(flags), nodeid, NULL); 3136 3122 if (x) 3137 3123 goto retry; 3138 3124

+1 -2

mm/slob.c

··· 532 532 return 0; 533 533 } 534 534 535 - void *slob_alloc_node(struct kmem_cache *c, gfp_t flags, int node) 535 + static void *slob_alloc_node(struct kmem_cache *c, gfp_t flags, int node) 536 536 { 537 537 void *b; 538 538 ··· 558 558 kmemleak_alloc_recursive(b, c->size, 1, c->flags, flags); 559 559 return b; 560 560 } 561 - EXPORT_SYMBOL(slob_alloc_node); 562 561 563 562 void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) 564 563 {

+13 -15

mm/slub.c

··· 374 374 if (cmpxchg_double(&page->freelist, &page->counters, 375 375 freelist_old, counters_old, 376 376 freelist_new, counters_new)) 377 - return 1; 377 + return true; 378 378 } else 379 379 #endif 380 380 { ··· 384 384 page->freelist = freelist_new; 385 385 set_page_slub_counters(page, counters_new); 386 386 slab_unlock(page); 387 - return 1; 387 + return true; 388 388 } 389 389 slab_unlock(page); 390 390 } ··· 396 396 pr_info("%s %s: cmpxchg double redo ", n, s->name); 397 397 #endif 398 398 399 - return 0; 399 + return false; 400 400 } 401 401 402 402 static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page, ··· 410 410 if (cmpxchg_double(&page->freelist, &page->counters, 411 411 freelist_old, counters_old, 412 412 freelist_new, counters_new)) 413 - return 1; 413 + return true; 414 414 } else 415 415 #endif 416 416 { ··· 424 424 set_page_slub_counters(page, counters_new); 425 425 slab_unlock(page); 426 426 local_irq_restore(flags); 427 - return 1; 427 + return true; 428 428 } 429 429 slab_unlock(page); 430 430 local_irq_restore(flags); ··· 437 437 pr_info("%s %s: cmpxchg double redo ", n, s->name); 438 438 #endif 439 439 440 - return 0; 440 + return false; 441 441 } 442 442 443 443 #ifdef CONFIG_SLUB_DEBUG ··· 1137 1137 */ 1138 1138 goto check_slabs; 1139 1139 1140 - if (tolower(*str) == 'o') { 1141 - /* 1142 - * Avoid enabling debugging on caches if its minimum order 1143 - * would increase as a result. 1144 - */ 1145 - disable_higher_order_debug = 1; 1146 - goto out; 1147 - } 1148 - 1149 1140 slub_debug = 0; 1150 1141 if (*str == '-') 1151 1142 /* ··· 1166 1175 break; 1167 1176 case 'a': 1168 1177 slub_debug |= SLAB_FAILSLAB; 1178 + break; 1179 + case 'o': 1180 + /* 1181 + * Avoid enabling debugging on caches if its minimum 1182 + * order would increase as a result. 1183 + */ 1184 + disable_higher_order_debug = 1; 1169 1185 break; 1170 1186 default: 1171 1187 pr_err("slub_debug option '%c' unknown. skipped\n",

+7 -30

mm/truncate.c

··· 93 93 } 94 94 95 95 /* 96 - * This cancels just the dirty bit on the kernel page itself, it 97 - * does NOT actually remove dirty bits on any mmap's that may be 98 - * around. It also leaves the page tagged dirty, so any sync 99 - * activity will still find it on the dirty lists, and in particular, 100 - * clear_page_dirty_for_io() will still look at the dirty bits in 101 - * the VM. 102 - * 103 - * Doing this should *normally* only ever be done when a page 104 - * is truncated, and is not actually mapped anywhere at all. However, 105 - * fs/buffer.c does this when it notices that somebody has cleaned 106 - * out all the buffers on a page without actually doing it through 107 - * the VM. Can you say "ext3 is horribly ugly"? Tought you could. 108 - */ 109 - void cancel_dirty_page(struct page *page, unsigned int account_size) 110 - { 111 - if (TestClearPageDirty(page)) { 112 - struct address_space *mapping = page->mapping; 113 - if (mapping && mapping_cap_account_dirty(mapping)) { 114 - dec_zone_page_state(page, NR_FILE_DIRTY); 115 - dec_bdi_stat(inode_to_bdi(mapping->host), 116 - BDI_RECLAIMABLE); 117 - if (account_size) 118 - task_io_account_cancelled_write(account_size); 119 - } 120 - } 121 - } 122 - EXPORT_SYMBOL(cancel_dirty_page); 123 - 124 - /* 125 96 * If truncate cannot remove the fs-private metadata from the page, the page 126 97 * becomes orphaned. It will be left on the LRU and may even be mapped into 127 98 * user pagetables if we're racing with filemap_fault(). ··· 111 140 if (page_has_private(page)) 112 141 do_invalidatepage(page, 0, PAGE_CACHE_SIZE); 113 142 114 - cancel_dirty_page(page, PAGE_CACHE_SIZE); 143 + /* 144 + * Some filesystems seem to re-dirty the page even after 145 + * the VM has canceled the dirty bit (eg ext3 journaling). 146 + * Hence dirty accounting check is placed after invalidation. 147 + */ 148 + if (TestClearPageDirty(page)) 149 + account_page_cleaned(page, mapping); 115 150 116 151 ClearPageMappedToDisk(page); 117 152 delete_from_page_cache(page);

+7 -1

mm/vmalloc.c

··· 29 29 #include <linux/atomic.h> 30 30 #include <linux/compiler.h> 31 31 #include <linux/llist.h> 32 + #include <linux/bitops.h> 32 33 33 34 #include <asm/uaccess.h> 34 35 #include <asm/tlbflush.h> ··· 75 74 pmd = pmd_offset(pud, addr); 76 75 do { 77 76 next = pmd_addr_end(addr, end); 77 + if (pmd_clear_huge(pmd)) 78 + continue; 78 79 if (pmd_none_or_clear_bad(pmd)) 79 80 continue; 80 81 vunmap_pte_range(pmd, addr, next); ··· 91 88 pud = pud_offset(pgd, addr); 92 89 do { 93 90 next = pud_addr_end(addr, end); 91 + if (pud_clear_huge(pud)) 92 + continue; 94 93 if (pud_none_or_clear_bad(pud)) 95 94 continue; 96 95 vunmap_pmd_range(pud, addr, next); ··· 1319 1314 1320 1315 BUG_ON(in_interrupt()); 1321 1316 if (flags & VM_IOREMAP) 1322 - align = 1ul << clamp(fls(size), PAGE_SHIFT, IOREMAP_MAX_ORDER); 1317 + align = 1ul << clamp_t(int, fls_long(size), 1318 + PAGE_SHIFT, IOREMAP_MAX_ORDER); 1323 1319 1324 1320 size = PAGE_ALIGN(size); 1325 1321 if (unlikely(!size))

+3 -1

net/openvswitch/flow.c

··· 100 100 101 101 new_stats = 102 102 kmem_cache_alloc_node(flow_stats_cache, 103 - GFP_THISNODE | 103 + GFP_NOWAIT | 104 + __GFP_THISNODE | 105 + __GFP_NOWARN | 104 106 __GFP_NOMEMALLOC, 105 107 node); 106 108 if (likely(new_stats)) {

+1 -1

scripts/coccinelle/misc/bugon.cocci

··· 57 57 p << r.p; 58 58 @@ 59 59 60 - msg="WARNING: Use BUG_ON" 60 + msg="WARNING: Use BUG_ON instead of if condition followed by BUG.\nPlease make sure the condition has no side effects (see conditional BUG_ON definition in include/asm-generic/bug.h)" 61 61 coccilib.report.print_report(p[0], msg) 62 62