Merge branch 'akpm' (second patch-bomb from Andrew)

+3 -3

Documentation/cgroups/cpusets.txt

··· 445 445 that would be beyond our understanding. So if each of two partially 446 446 overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we 447 447 form a single sched domain that is a superset of both. We won't move 448 - a task to a CPU outside it cpuset, but the scheduler load balancing 448 + a task to a CPU outside its cpuset, but the scheduler load balancing 449 449 code might waste some compute cycles considering that possibility. 450 450 451 451 This mismatch is why there is not a simple one-to-one relation ··· 552 552 1 : search siblings (hyperthreads in a core). 553 553 2 : search cores in a package. 554 554 3 : search cpus in a node [= system wide on non-NUMA system] 555 - ( 4 : search nodes in a chunk of node [on NUMA system] ) 556 - ( 5 : search system wide [on NUMA system] ) 555 + 4 : search nodes in a chunk of node [on NUMA system] 556 + 5 : search system wide [on NUMA system] 557 557 558 558 The system default is architecture dependent. The system default 559 559 can be changed using the relax_domain_level= boot parameter.

+4 -4

Documentation/cgroups/memory.txt

··· 326 326 327 327 * tcp memory pressure: sockets memory pressure for the tcp protocol. 328 328 329 - 2.7.3 Common use cases 329 + 2.7.2 Common use cases 330 330 331 331 Because the "kmem" counter is fed to the main user counter, kernel memory can 332 332 never be limited completely independently of user memory. Say "U" is the user ··· 354 354 355 355 3. User Interface 356 356 357 - 0. Configuration 357 + 3.0. Configuration 358 358 359 359 a. Enable CONFIG_CGROUPS 360 360 b. Enable CONFIG_MEMCG 361 361 c. Enable CONFIG_MEMCG_SWAP (to use swap extension) 362 362 d. Enable CONFIG_MEMCG_KMEM (to use kmem extension) 363 363 364 - 1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) 364 + 3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) 365 365 # mount -t tmpfs none /sys/fs/cgroup 366 366 # mkdir /sys/fs/cgroup/memory 367 367 # mount -t cgroup none /sys/fs/cgroup/memory -o memory 368 368 369 - 2. Make the new group and move bash into it 369 + 3.2. Make the new group and move bash into it 370 370 # mkdir /sys/fs/cgroup/memory/0 371 371 # echo $$ > /sys/fs/cgroup/memory/0/tasks 372 372

+16 -3

Documentation/kernel-parameters.txt

··· 829 829 CONFIG_DEBUG_PAGEALLOC, hence this option will not help 830 830 tracking down these problems. 831 831 832 + debug_pagealloc= 833 + [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this 834 + parameter enables the feature at boot time. In 835 + default, it is disabled. We can avoid allocating huge 836 + chunk of memory for debug pagealloc if we don't enable 837 + it at boot time and the system will work mostly same 838 + with the kernel built without CONFIG_DEBUG_PAGEALLOC. 839 + on: enable the feature 840 + 832 841 debugpat [X86] Enable PAT debugging 833 842 834 843 decnet.addr= [HW,NET] ··· 1237 1228 multiple times interleaved with hugepages= to reserve 1238 1229 huge pages of different sizes. Valid pages sizes on 1239 1230 x86-64 are 2M (when the CPU supports "pse") and 1G 1240 - (when the CPU supports the "pdpe1gb" cpuinfo flag) 1241 - Note that 1GB pages can only be allocated at boot time 1242 - using hugepages= and not freed afterwards. 1231 + (when the CPU supports the "pdpe1gb" cpuinfo flag). 1243 1232 1244 1233 hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) 1245 1234 terminal devices. Valid values: 0..8 ··· 2512 2505 2513 2506 OSS [HW,OSS] 2514 2507 See Documentation/sound/oss/oss-parameters.txt 2508 + 2509 + page_owner= [KNL] Boot-time page_owner enabling option. 2510 + Storage of the information about who allocated 2511 + each page is disabled in default. With this switch, 2512 + we can turn it on. 2513 + on: enable the feature 2515 2514 2516 2515 panic= [KNL] Kernel behaviour on panic: delay <timeout> 2517 2516 timeout > 0: seconds before rebooting

+9 -4

Documentation/local_ops.txt

··· 8 8 properly. It also stresses on the precautions that must be taken when reading 9 9 those local variables across CPUs when the order of memory writes matters. 10 10 11 + Note that local_t based operations are not recommended for general kernel use. 12 + Please use the this_cpu operations instead unless there is really a special purpose. 13 + Most uses of local_t in the kernel have been replaced by this_cpu operations. 14 + this_cpu operations combine the relocation with the local_t like semantics in 15 + a single instruction and yield more compact and faster executing code. 11 16 12 17 13 18 * Purpose of local atomic operations ··· 92 87 local_inc(&get_cpu_var(counters)); 93 88 put_cpu_var(counters); 94 89 95 - If you are already in a preemption-safe context, you can directly use 96 - __get_cpu_var() instead. 90 + If you are already in a preemption-safe context, you can use 91 + this_cpu_ptr() instead. 97 92 98 - local_inc(&__get_cpu_var(counters)); 93 + local_inc(this_cpu_ptr(&counters)); 99 94 100 95 101 96 ··· 139 134 { 140 135 /* Increment the counter from a non preemptible context */ 141 136 printk("Increment on cpu %d\n", smp_processor_id()); 142 - local_inc(&__get_cpu_var(counters)); 137 + local_inc(this_cpu_ptr(&counters)); 143 138 144 139 /* This is what incrementing the variable would look like within a 145 140 * preemptible context (it disables preemption) :

+6 -4

Documentation/sysctl/kernel.txt

··· 116 116 117 117 auto_msgmni: 118 118 119 - Enables/Disables automatic recomputing of msgmni upon memory add/remove 120 - or upon ipc namespace creation/removal (see the msgmni description 121 - above). Echoing "1" into this file enables msgmni automatic recomputing. 122 - Echoing "0" turns it off. auto_msgmni default value is 1. 119 + This variable has no effect and may be removed in future kernel 120 + releases. Reading it always returns 0. 121 + Up to Linux 3.17, it enabled/disabled automatic recomputing of msgmni 122 + upon memory add/remove or upon ipc namespace creation/removal. 123 + Echoing "1" into this file enabled msgmni automatic recomputing. 124 + Echoing "0" turned it off. auto_msgmni default value was 1. 123 125 124 126 125 127 ==============================================================

+81

Documentation/vm/page_owner.txt

··· 1 + page owner: Tracking about who allocated each page 2 + ----------------------------------------------------------- 3 + 4 + * Introduction 5 + 6 + page owner is for the tracking about who allocated each page. 7 + It can be used to debug memory leak or to find a memory hogger. 8 + When allocation happens, information about allocation such as call stack 9 + and order of pages is stored into certain storage for each page. 10 + When we need to know about status of all pages, we can get and analyze 11 + this information. 12 + 13 + Although we already have tracepoint for tracing page allocation/free, 14 + using it for analyzing who allocate each page is rather complex. We need 15 + to enlarge the trace buffer for preventing overlapping until userspace 16 + program launched. And, launched program continually dump out the trace 17 + buffer for later analysis and it would change system behviour with more 18 + possibility rather than just keeping it in memory, so bad for debugging. 19 + 20 + page owner can also be used for various purposes. For example, accurate 21 + fragmentation statistics can be obtained through gfp flag information of 22 + each page. It is already implemented and activated if page owner is 23 + enabled. Other usages are more than welcome. 24 + 25 + page owner is disabled in default. So, if you'd like to use it, you need 26 + to add "page_owner=on" into your boot cmdline. If the kernel is built 27 + with page owner and page owner is disabled in runtime due to no enabling 28 + boot option, runtime overhead is marginal. If disabled in runtime, it 29 + doesn't require memory to store owner information, so there is no runtime 30 + memory overhead. And, page owner inserts just two unlikely branches into 31 + the page allocator hotpath and if it returns false then allocation is 32 + done like as the kernel without page owner. These two unlikely branches 33 + would not affect to allocation performance. Following is the kernel's 34 + code size change due to this facility. 35 + 36 + - Without page owner 37 + text data bss dec hex filename 38 + 40662 1493 644 42799 a72f mm/page_alloc.o 39 + 40 + - With page owner 41 + text data bss dec hex filename 42 + 40892 1493 644 43029 a815 mm/page_alloc.o 43 + 1427 24 8 1459 5b3 mm/page_ext.o 44 + 2722 50 0 2772 ad4 mm/page_owner.o 45 + 46 + Although, roughly, 4 KB code is added in total, page_alloc.o increase by 47 + 230 bytes and only half of it is in hotpath. Building the kernel with 48 + page owner and turning it on if needed would be great option to debug 49 + kernel memory problem. 50 + 51 + There is one notice that is caused by implementation detail. page owner 52 + stores information into the memory from struct page extension. This memory 53 + is initialized some time later than that page allocator starts in sparse 54 + memory system, so, until initialization, many pages can be allocated and 55 + they would have no owner information. To fix it up, these early allocated 56 + pages are investigated and marked as allocated in initialization phase. 57 + Although it doesn't mean that they have the right owner information, 58 + at least, we can tell whether the page is allocated or not, 59 + more accurately. On 2GB memory x86-64 VM box, 13343 early allocated pages 60 + are catched and marked, although they are mostly allocated from struct 61 + page extension feature. Anyway, after that, no page is left in 62 + un-tracking state. 63 + 64 + * Usage 65 + 66 + 1) Build user-space helper 67 + cd tools/vm 68 + make page_owner_sort 69 + 70 + 2) Enable page owner 71 + Add "page_owner=on" to boot cmdline. 72 + 73 + 3) Do the job what you want to debug 74 + 75 + 4) Analyze information from page owner 76 + cat /sys/kernel/debug/page_owner > page_owner_full.txt 77 + grep -v ^PFN page_owner_full.txt > page_owner.txt 78 + ./page_owner_sort page_owner.txt sorted_page_owner.txt 79 + 80 + See the result about who allocated each page 81 + in the sorted_page_owner.txt.

+1 -1

MAINTAINERS

··· 4045 4045 FREESCALE SOC SOUND DRIVERS 4046 4046 M: Timur Tabi <timur@tabi.org> 4047 4047 M: Nicolin Chen <nicoleotsuka@gmail.com> 4048 - M: Xiubo Li <Li.Xiubo@freescale.com> 4048 + M: Xiubo Li <Xiubo.Lee@gmail.com> 4049 4049 L: alsa-devel@alsa-project.org (moderated for non-subscribers) 4050 4050 L: linuxppc-dev@lists.ozlabs.org 4051 4051 S: Maintained

+1

arch/arm/Kconfig

··· 5 5 select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE 6 6 select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST 7 7 select ARCH_HAVE_CUSTOM_GPIO_H 8 + select ARCH_HAS_GCOV_PROFILE_ALL 8 9 select ARCH_MIGHT_HAVE_PC_PARPORT 9 10 select ARCH_SUPPORTS_ATOMIC_RMW 10 11 select ARCH_USE_BUILTIN_BSWAP

+1

arch/arm64/Kconfig

··· 2 2 def_bool y 3 3 select ARCH_BINFMT_ELF_RANDOMIZE_PIE 4 4 select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE 5 + select ARCH_HAS_GCOV_PROFILE_ALL 5 6 select ARCH_HAS_SG_CHAIN 6 7 select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST 7 8 select ARCH_USE_CMPXCHG_LOCKREF

+1

arch/microblaze/Kconfig

··· 1 1 config MICROBLAZE 2 2 def_bool y 3 + select ARCH_HAS_GCOV_PROFILE_ALL 3 4 select ARCH_MIGHT_HAVE_PC_PARPORT 4 5 select ARCH_WANT_IPC_PARSE_VERSION 5 6 select ARCH_WANT_OPTIONAL_GPIOLIB

+2 -2

arch/parisc/lib/fixup.S

··· 38 38 LDREGX \t2(\t1),\t2 39 39 addil LT%exception_data,%r27 40 40 LDREG RT%exception_data(%r1),\t1 41 - /* t1 = &__get_cpu_var(exception_data) */ 41 + /* t1 = this_cpu_ptr(&exception_data) */ 42 42 add,l \t1,\t2,\t1 43 43 /* t1 = t1->fault_ip */ 44 44 LDREG EXCDATA_IP(\t1), \t1 45 45 .endm 46 46 #else 47 47 .macro get_fault_ip t1 t2 48 - /* t1 = &__get_cpu_var(exception_data) */ 48 + /* t1 = this_cpu_ptr(&exception_data) */ 49 49 addil LT%exception_data,%r27 50 50 LDREG RT%exception_data(%r1),\t2 51 51 /* t1 = t2->fault_ip */

+1

arch/powerpc/Kconfig

··· 129 129 select HAVE_BPF_JIT if PPC64 130 130 select HAVE_ARCH_JUMP_LABEL 131 131 select ARCH_HAVE_NMI_SAFE_CMPXCHG 132 + select ARCH_HAS_GCOV_PROFILE_ALL 132 133 select GENERIC_SMP_IDLE_THREAD 133 134 select GENERIC_CMOS_UPDATE 134 135 select GENERIC_TIME_VSYSCALL_OLD

+1 -1

arch/powerpc/mm/hash_utils_64.c

··· 1514 1514 mmu_kernel_ssize, 0); 1515 1515 } 1516 1516 1517 - void kernel_map_pages(struct page *page, int numpages, int enable) 1517 + void __kernel_map_pages(struct page *page, int numpages, int enable) 1518 1518 { 1519 1519 unsigned long flags, vaddr, lmi; 1520 1520 int i;

+1 -1

arch/powerpc/mm/pgtable_32.c

··· 429 429 } 430 430 431 431 432 - void kernel_map_pages(struct page *page, int numpages, int enable) 432 + void __kernel_map_pages(struct page *page, int numpages, int enable) 433 433 { 434 434 if (PageHighMem(page)) 435 435 return;

+1

arch/s390/Kconfig

··· 65 65 def_bool y 66 66 select ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE 67 67 select ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS 68 + select ARCH_HAS_GCOV_PROFILE_ALL 68 69 select ARCH_HAVE_NMI_SAFE_CMPXCHG 69 70 select ARCH_INLINE_READ_LOCK 70 71 select ARCH_INLINE_READ_LOCK_BH

+1 -1

arch/s390/mm/pageattr.c

··· 120 120 } 121 121 } 122 122 123 - void kernel_map_pages(struct page *page, int numpages, int enable) 123 + void __kernel_map_pages(struct page *page, int numpages, int enable) 124 124 { 125 125 unsigned long address; 126 126 int nr, i, j;

+1

arch/sh/Kconfig

··· 16 16 select HAVE_DEBUG_BUGVERBOSE 17 17 select ARCH_HAVE_CUSTOM_GPIO_H 18 18 select ARCH_HAVE_NMI_SAFE_CMPXCHG if (GUSA_RB || CPU_SH4A) 19 + select ARCH_HAS_GCOV_PROFILE_ALL 19 20 select PERF_USE_VMALLOC 20 21 select HAVE_DEBUG_KMEMLEAK 21 22 select HAVE_KERNEL_GZIP

+2 -1

arch/sparc/include/uapi/asm/unistd.h

··· 415 415 #define __NR_getrandom 347 416 416 #define __NR_memfd_create 348 417 417 #define __NR_bpf 349 418 + #define __NR_execveat 350 418 419 419 - #define NR_syscalls 350 420 + #define NR_syscalls 351 420 421 421 422 /* Bitmask values returned from kern_features system call. */ 422 423 #define KERN_FEATURE_MIXED_MODE_STACK 0x00000001

+10

arch/sparc/kernel/syscalls.S

··· 6 6 jmpl %g1, %g0 7 7 flushw 8 8 9 + sys64_execveat: 10 + set sys_execveat, %g1 11 + jmpl %g1, %g0 12 + flushw 13 + 9 14 #ifdef CONFIG_COMPAT 10 15 sunos_execv: 11 16 mov %g0, %o2 12 17 sys32_execve: 13 18 set compat_sys_execve, %g1 19 + jmpl %g1, %g0 20 + flushw 21 + 22 + sys32_execveat: 23 + set compat_sys_execveat, %g1 14 24 jmpl %g1, %g0 15 25 flushw 16 26 #endif

+1

arch/sparc/kernel/systbls_32.S

··· 87 87 /*335*/ .long sys_syncfs, sys_sendmmsg, sys_setns, sys_process_vm_readv, sys_process_vm_writev 88 88 /*340*/ .long sys_ni_syscall, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr 89 89 /*345*/ .long sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf 90 + /*350*/ .long sys_execveat

+2

arch/sparc/kernel/systbls_64.S

··· 88 88 .word sys_syncfs, compat_sys_sendmmsg, sys_setns, compat_sys_process_vm_readv, compat_sys_process_vm_writev 89 89 /*340*/ .word sys_kern_features, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr 90 90 .word sys32_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf 91 + /*350*/ .word sys32_execveat 91 92 92 93 #endif /* CONFIG_COMPAT */ 93 94 ··· 168 167 .word sys_syncfs, sys_sendmmsg, sys_setns, sys_process_vm_readv, sys_process_vm_writev 169 168 /*340*/ .word sys_kern_features, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr 170 169 .word sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf 170 + /*350*/ .word sys64_execveat

+1 -1

arch/sparc/mm/init_64.c

··· 1621 1621 } 1622 1622 1623 1623 #ifdef CONFIG_DEBUG_PAGEALLOC 1624 - void kernel_map_pages(struct page *page, int numpages, int enable) 1624 + void __kernel_map_pages(struct page *page, int numpages, int enable) 1625 1625 { 1626 1626 unsigned long phys_start = page_to_pfn(page) << PAGE_SHIFT; 1627 1627 unsigned long phys_end = phys_start + (numpages * PAGE_SIZE);

+1

arch/x86/Kconfig

··· 24 24 select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI 25 25 select ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS 26 26 select ARCH_HAS_FAST_MULTIPLIER 27 + select ARCH_HAS_GCOV_PROFILE_ALL 27 28 select ARCH_MIGHT_HAVE_PC_PARPORT 28 29 select ARCH_MIGHT_HAVE_PC_SERIO 29 30 select HAVE_AOUT if X86_32

+1

arch/x86/ia32/audit.c

··· 35 35 case __NR_socketcall: 36 36 return 4; 37 37 case __NR_execve: 38 + case __NR_execveat: 38 39 return 5; 39 40 default: 40 41 return 1;

+1

arch/x86/ia32/ia32entry.S

··· 480 480 PTREGSCALL stub32_rt_sigreturn, sys32_rt_sigreturn 481 481 PTREGSCALL stub32_sigreturn, sys32_sigreturn 482 482 PTREGSCALL stub32_execve, compat_sys_execve 483 + PTREGSCALL stub32_execveat, compat_sys_execveat 483 484 PTREGSCALL stub32_fork, sys_fork 484 485 PTREGSCALL stub32_vfork, sys_vfork 485 486

+1

arch/x86/kernel/audit_64.c

··· 50 50 case __NR_openat: 51 51 return 3; 52 52 case __NR_execve: 53 + case __NR_execveat: 53 54 return 5; 54 55 default: 55 56 return 0;

+28

arch/x86/kernel/entry_64.S

··· 652 652 CFI_ENDPROC 653 653 END(stub_execve) 654 654 655 + ENTRY(stub_execveat) 656 + CFI_STARTPROC 657 + addq $8, %rsp 658 + PARTIAL_FRAME 0 659 + SAVE_REST 660 + FIXUP_TOP_OF_STACK %r11 661 + call sys_execveat 662 + RESTORE_TOP_OF_STACK %r11 663 + movq %rax,RAX(%rsp) 664 + RESTORE_REST 665 + jmp int_ret_from_sys_call 666 + CFI_ENDPROC 667 + END(stub_execveat) 668 + 655 669 /* 656 670 * sigreturn is special because it needs to restore all registers on return. 657 671 * This cannot be done with SYSRET, so use the IRET return path instead. ··· 710 696 jmp int_ret_from_sys_call 711 697 CFI_ENDPROC 712 698 END(stub_x32_execve) 699 + 700 + ENTRY(stub_x32_execveat) 701 + CFI_STARTPROC 702 + addq $8, %rsp 703 + PARTIAL_FRAME 0 704 + SAVE_REST 705 + FIXUP_TOP_OF_STACK %r11 706 + call compat_sys_execveat 707 + RESTORE_TOP_OF_STACK %r11 708 + movq %rax,RAX(%rsp) 709 + RESTORE_REST 710 + jmp int_ret_from_sys_call 711 + CFI_ENDPROC 712 + END(stub_x32_execveat) 713 713 714 714 #endif 715 715

+1 -1

arch/x86/mm/pageattr.c

··· 1817 1817 return __change_page_attr_set_clr(&cpa, 0); 1818 1818 } 1819 1819 1820 - void kernel_map_pages(struct page *page, int numpages, int enable) 1820 + void __kernel_map_pages(struct page *page, int numpages, int enable) 1821 1821 { 1822 1822 if (PageHighMem(page)) 1823 1823 return;

+1

arch/x86/syscalls/syscall_32.tbl

··· 364 364 355 i386 getrandom sys_getrandom 365 365 356 i386 memfd_create sys_memfd_create 366 366 357 i386 bpf sys_bpf 367 + 358 i386 execveat sys_execveat stub32_execveat

+2

arch/x86/syscalls/syscall_64.tbl

··· 328 328 319 common memfd_create sys_memfd_create 329 329 320 common kexec_file_load sys_kexec_file_load 330 330 321 common bpf sys_bpf 331 + 322 64 execveat stub_execveat 331 332 332 333 # 333 334 # x32-specific system call numbers start at 512 to avoid cache impact ··· 367 366 542 x32 getsockopt compat_sys_getsockopt 368 367 543 x32 io_setup compat_sys_io_setup 369 368 544 x32 io_submit compat_sys_io_submit 369 + 545 x32 execveat stub_x32_execveat

+1

arch/x86/um/sys_call_table_64.c

··· 31 31 #define stub_fork sys_fork 32 32 #define stub_vfork sys_vfork 33 33 #define stub_execve sys_execve 34 + #define stub_execveat sys_execveat 34 35 #define stub_rt_sigreturn sys_rt_sigreturn 35 36 36 37 #define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)

+2 -2

drivers/base/memory.c

··· 228 228 struct page *first_page; 229 229 int ret; 230 230 231 - first_page = pfn_to_page(phys_index << PFN_SECTION_SHIFT); 232 - start_pfn = page_to_pfn(first_page); 231 + start_pfn = phys_index << PFN_SECTION_SHIFT; 232 + first_page = pfn_to_page(start_pfn); 233 233 234 234 switch (action) { 235 235 case MEM_ONLINE:

+71 -33

drivers/block/zram/zram_drv.c

··· 44 44 static unsigned int num_devices = 1; 45 45 46 46 #define ZRAM_ATTR_RO(name) \ 47 - static ssize_t zram_attr_##name##_show(struct device *d, \ 47 + static ssize_t name##_show(struct device *d, \ 48 48 struct device_attribute *attr, char *b) \ 49 49 { \ 50 50 struct zram *zram = dev_to_zram(d); \ 51 51 return scnprintf(b, PAGE_SIZE, "%llu\n", \ 52 52 (u64)atomic64_read(&zram->stats.name)); \ 53 53 } \ 54 - static struct device_attribute dev_attr_##name = \ 55 - __ATTR(name, S_IRUGO, zram_attr_##name##_show, NULL); 54 + static DEVICE_ATTR_RO(name); 56 55 57 56 static inline int init_done(struct zram *zram) 58 57 { ··· 286 287 /* 287 288 * Check if request is within bounds and aligned on zram logical blocks. 288 289 */ 289 - static inline int valid_io_request(struct zram *zram, struct bio *bio) 290 + static inline int valid_io_request(struct zram *zram, 291 + sector_t start, unsigned int size) 290 292 { 291 - u64 start, end, bound; 293 + u64 end, bound; 292 294 293 295 /* unaligned request */ 294 - if (unlikely(bio->bi_iter.bi_sector & 295 - (ZRAM_SECTOR_PER_LOGICAL_BLOCK - 1))) 296 + if (unlikely(start & (ZRAM_SECTOR_PER_LOGICAL_BLOCK - 1))) 296 297 return 0; 297 - if (unlikely(bio->bi_iter.bi_size & (ZRAM_LOGICAL_BLOCK_SIZE - 1))) 298 + if (unlikely(size & (ZRAM_LOGICAL_BLOCK_SIZE - 1))) 298 299 return 0; 299 300 300 - start = bio->bi_iter.bi_sector; 301 - end = start + (bio->bi_iter.bi_size >> SECTOR_SHIFT); 301 + end = start + (size >> SECTOR_SHIFT); 302 302 bound = zram->disksize >> SECTOR_SHIFT; 303 303 /* out of range range */ 304 304 if (unlikely(start >= bound || end > bound || start > end)) ··· 451 453 } 452 454 453 455 static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec, 454 - u32 index, int offset, struct bio *bio) 456 + u32 index, int offset) 455 457 { 456 458 int ret; 457 459 struct page *page; ··· 643 645 } 644 646 645 647 static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index, 646 - int offset, struct bio *bio) 648 + int offset, int rw) 647 649 { 648 650 int ret; 649 - int rw = bio_data_dir(bio); 650 651 651 652 if (rw == READ) { 652 653 atomic64_inc(&zram->stats.num_reads); 653 - ret = zram_bvec_read(zram, bvec, index, offset, bio); 654 + ret = zram_bvec_read(zram, bvec, index, offset); 654 655 } else { 655 656 atomic64_inc(&zram->stats.num_writes); 656 657 ret = zram_bvec_write(zram, bvec, index, offset); ··· 850 853 851 854 static void __zram_make_request(struct zram *zram, struct bio *bio) 852 855 { 853 - int offset; 856 + int offset, rw; 854 857 u32 index; 855 858 struct bio_vec bvec; 856 859 struct bvec_iter iter; ··· 865 868 return; 866 869 } 867 870 871 + rw = bio_data_dir(bio); 868 872 bio_for_each_segment(bvec, bio, iter) { 869 873 int max_transfer_size = PAGE_SIZE - offset; 870 874 ··· 880 882 bv.bv_len = max_transfer_size; 881 883 bv.bv_offset = bvec.bv_offset; 882 884 883 - if (zram_bvec_rw(zram, &bv, index, offset, bio) < 0) 885 + if (zram_bvec_rw(zram, &bv, index, offset, rw) < 0) 884 886 goto out; 885 887 886 888 bv.bv_len = bvec.bv_len - max_transfer_size; 887 889 bv.bv_offset += max_transfer_size; 888 - if (zram_bvec_rw(zram, &bv, index + 1, 0, bio) < 0) 890 + if (zram_bvec_rw(zram, &bv, index + 1, 0, rw) < 0) 889 891 goto out; 890 892 } else 891 - if (zram_bvec_rw(zram, &bvec, index, offset, bio) < 0) 893 + if (zram_bvec_rw(zram, &bvec, index, offset, rw) < 0) 892 894 goto out; 893 895 894 896 update_position(&index, &offset, &bvec); ··· 913 915 if (unlikely(!init_done(zram))) 914 916 goto error; 915 917 916 - if (!valid_io_request(zram, bio)) { 918 + if (!valid_io_request(zram, bio->bi_iter.bi_sector, 919 + bio->bi_iter.bi_size)) { 917 920 atomic64_inc(&zram->stats.invalid_io); 918 921 goto error; 919 922 } ··· 944 945 atomic64_inc(&zram->stats.notify_free); 945 946 } 946 947 948 + static int zram_rw_page(struct block_device *bdev, sector_t sector, 949 + struct page *page, int rw) 950 + { 951 + int offset, err; 952 + u32 index; 953 + struct zram *zram; 954 + struct bio_vec bv; 955 + 956 + zram = bdev->bd_disk->private_data; 957 + if (!valid_io_request(zram, sector, PAGE_SIZE)) { 958 + atomic64_inc(&zram->stats.invalid_io); 959 + return -EINVAL; 960 + } 961 + 962 + down_read(&zram->init_lock); 963 + if (unlikely(!init_done(zram))) { 964 + err = -EIO; 965 + goto out_unlock; 966 + } 967 + 968 + index = sector >> SECTORS_PER_PAGE_SHIFT; 969 + offset = sector & (SECTORS_PER_PAGE - 1) << SECTOR_SHIFT; 970 + 971 + bv.bv_page = page; 972 + bv.bv_len = PAGE_SIZE; 973 + bv.bv_offset = 0; 974 + 975 + err = zram_bvec_rw(zram, &bv, index, offset, rw); 976 + out_unlock: 977 + up_read(&zram->init_lock); 978 + /* 979 + * If I/O fails, just return error(ie, non-zero) without 980 + * calling page_endio. 981 + * It causes resubmit the I/O with bio request by upper functions 982 + * of rw_page(e.g., swap_readpage, __swap_writepage) and 983 + * bio->bi_end_io does things to handle the error 984 + * (e.g., SetPageError, set_page_dirty and extra works). 985 + */ 986 + if (err == 0) 987 + page_endio(page, rw, 0); 988 + return err; 989 + } 990 + 947 991 static const struct block_device_operations zram_devops = { 948 992 .swap_slot_free_notify = zram_slot_free_notify, 993 + .rw_page = zram_rw_page, 949 994 .owner = THIS_MODULE 950 995 }; 951 996 952 - static DEVICE_ATTR(disksize, S_IRUGO | S_IWUSR, 953 - disksize_show, disksize_store); 954 - static DEVICE_ATTR(initstate, S_IRUGO, initstate_show, NULL); 955 - static DEVICE_ATTR(reset, S_IWUSR, NULL, reset_store); 956 - static DEVICE_ATTR(orig_data_size, S_IRUGO, orig_data_size_show, NULL); 957 - static DEVICE_ATTR(mem_used_total, S_IRUGO, mem_used_total_show, NULL); 958 - static DEVICE_ATTR(mem_limit, S_IRUGO | S_IWUSR, mem_limit_show, 959 - mem_limit_store); 960 - static DEVICE_ATTR(mem_used_max, S_IRUGO | S_IWUSR, mem_used_max_show, 961 - mem_used_max_store); 962 - static DEVICE_ATTR(max_comp_streams, S_IRUGO | S_IWUSR, 963 - max_comp_streams_show, max_comp_streams_store); 964 - static DEVICE_ATTR(comp_algorithm, S_IRUGO | S_IWUSR, 965 - comp_algorithm_show, comp_algorithm_store); 997 + static DEVICE_ATTR_RW(disksize); 998 + static DEVICE_ATTR_RO(initstate); 999 + static DEVICE_ATTR_WO(reset); 1000 + static DEVICE_ATTR_RO(orig_data_size); 1001 + static DEVICE_ATTR_RO(mem_used_total); 1002 + static DEVICE_ATTR_RW(mem_limit); 1003 + static DEVICE_ATTR_RW(mem_used_max); 1004 + static DEVICE_ATTR_RW(max_comp_streams); 1005 + static DEVICE_ATTR_RW(comp_algorithm); 966 1006 967 1007 ZRAM_ATTR_RO(num_reads); 968 1008 ZRAM_ATTR_RO(num_writes);

+2 -2

drivers/block/zram/zram_drv.h

··· 66 66 /* Flags for zram pages (table[page_no].value) */ 67 67 enum zram_pageflags { 68 68 /* Page consists entirely of zeros */ 69 - ZRAM_ZERO = ZRAM_FLAG_SHIFT + 1, 70 - ZRAM_ACCESS, /* page in now accessed */ 69 + ZRAM_ZERO = ZRAM_FLAG_SHIFT, 70 + ZRAM_ACCESS, /* page is now accessed */ 71 71 72 72 __NR_ZRAM_PAGEFLAGS, 73 73 };

+52 -30

drivers/iommu/amd_iommu_v2.c

··· 509 509 spin_unlock_irqrestore(&pasid_state->lock, flags); 510 510 } 511 511 512 + static void handle_fault_error(struct fault *fault) 513 + { 514 + int status; 515 + 516 + if (!fault->dev_state->inv_ppr_cb) { 517 + set_pri_tag_status(fault->state, fault->tag, PPR_INVALID); 518 + return; 519 + } 520 + 521 + status = fault->dev_state->inv_ppr_cb(fault->dev_state->pdev, 522 + fault->pasid, 523 + fault->address, 524 + fault->flags); 525 + switch (status) { 526 + case AMD_IOMMU_INV_PRI_RSP_SUCCESS: 527 + set_pri_tag_status(fault->state, fault->tag, PPR_SUCCESS); 528 + break; 529 + case AMD_IOMMU_INV_PRI_RSP_INVALID: 530 + set_pri_tag_status(fault->state, fault->tag, PPR_INVALID); 531 + break; 532 + case AMD_IOMMU_INV_PRI_RSP_FAIL: 533 + set_pri_tag_status(fault->state, fault->tag, PPR_FAILURE); 534 + break; 535 + default: 536 + BUG(); 537 + } 538 + } 539 + 512 540 static void do_fault(struct work_struct *work) 513 541 { 514 542 struct fault *fault = container_of(work, struct fault, work); 515 - int npages, write; 516 - struct page *page; 543 + struct mm_struct *mm; 544 + struct vm_area_struct *vma; 545 + u64 address; 546 + int ret, write; 517 547 518 548 write = !!(fault->flags & PPR_FAULT_WRITE); 519 549 520 - down_read(&fault->state->mm->mmap_sem); 521 - npages = get_user_pages(NULL, fault->state->mm, 522 - fault->address, 1, write, 0, &page, NULL); 523 - up_read(&fault->state->mm->mmap_sem); 550 + mm = fault->state->mm; 551 + address = fault->address; 524 552 525 - if (npages == 1) { 526 - put_page(page); 527 - } else if (fault->dev_state->inv_ppr_cb) { 528 - int status; 529 - 530 - status = fault->dev_state->inv_ppr_cb(fault->dev_state->pdev, 531 - fault->pasid, 532 - fault->address, 533 - fault->flags); 534 - switch (status) { 535 - case AMD_IOMMU_INV_PRI_RSP_SUCCESS: 536 - set_pri_tag_status(fault->state, fault->tag, PPR_SUCCESS); 537 - break; 538 - case AMD_IOMMU_INV_PRI_RSP_INVALID: 539 - set_pri_tag_status(fault->state, fault->tag, PPR_INVALID); 540 - break; 541 - case AMD_IOMMU_INV_PRI_RSP_FAIL: 542 - set_pri_tag_status(fault->state, fault->tag, PPR_FAILURE); 543 - break; 544 - default: 545 - BUG(); 546 - } 547 - } else { 548 - set_pri_tag_status(fault->state, fault->tag, PPR_INVALID); 553 + down_read(&mm->mmap_sem); 554 + vma = find_extend_vma(mm, address); 555 + if (!vma || address < vma->vm_start) { 556 + /* failed to get a vma in the right range */ 557 + up_read(&mm->mmap_sem); 558 + handle_fault_error(fault); 559 + goto out; 549 560 } 550 561 562 + ret = handle_mm_fault(mm, vma, address, write); 563 + if (ret & VM_FAULT_ERROR) { 564 + /* failed to service fault */ 565 + up_read(&mm->mmap_sem); 566 + handle_fault_error(fault); 567 + goto out; 568 + } 569 + 570 + up_read(&mm->mmap_sem); 571 + 572 + out: 551 573 finish_pri_tag(fault->dev_state, fault->state, fault->tag); 552 574 553 575 put_pasid_state(fault->state);

+9 -2

drivers/rtc/rtc-snvs.c

··· 344 344 345 345 return 0; 346 346 } 347 - #endif 348 347 349 348 static const struct dev_pm_ops snvs_rtc_pm_ops = { 350 349 .suspend_noirq = snvs_rtc_suspend, 351 350 .resume_noirq = snvs_rtc_resume, 352 351 }; 352 + 353 + #define SNVS_RTC_PM_OPS (&snvs_rtc_pm_ops) 354 + 355 + #else 356 + 357 + #define SNVS_RTC_PM_OPS NULL 358 + 359 + #endif 353 360 354 361 static const struct of_device_id snvs_dt_ids[] = { 355 362 { .compatible = "fsl,sec-v4.0-mon-rtc-lp", }, ··· 368 361 .driver = { 369 362 .name = "snvs_rtc", 370 363 .owner = THIS_MODULE, 371 - .pm = &snvs_rtc_pm_ops, 364 + .pm = SNVS_RTC_PM_OPS, 372 365 .of_match_table = snvs_dt_ids, 373 366 }, 374 367 .probe = snvs_rtc_probe,

+1 -2

drivers/staging/android/ashmem.c

··· 418 418 } 419 419 420 420 /* 421 - * ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab 421 + * ashmem_shrink - our cache shrinker, called from mm/vmscan.c 422 422 * 423 423 * 'nr_to_scan' is the number of objects to scan for freeing. 424 424 * ··· 785 785 .nr_to_scan = LONG_MAX, 786 786 }; 787 787 ret = ashmem_shrink_count(&ashmem_shrinker, &sc); 788 - nodes_setall(sc.nodes_to_scan); 789 788 ashmem_shrink_scan(&ashmem_shrinker, &sc); 790 789 } 791 790 break;

+2

fs/affs/affs.h

··· 135 135 extern void secs_to_datestamp(time_t secs, struct affs_date *ds); 136 136 extern umode_t prot_to_mode(u32 prot); 137 137 extern void mode_to_prot(struct inode *inode); 138 + __printf(3, 4) 138 139 extern void affs_error(struct super_block *sb, const char *function, 139 140 const char *fmt, ...); 141 + __printf(3, 4) 140 142 extern void affs_warning(struct super_block *sb, const char *function, 141 143 const char *fmt, ...); 142 144 extern bool affs_nofilenametruncate(const struct dentry *dentry);

+13 -15

fs/affs/amigaffs.c

··· 10 10 11 11 #include "affs.h" 12 12 13 - static char ErrorBuffer[256]; 14 - 15 13 /* 16 14 * Functions for accessing Amiga-FFS structures. 17 15 */ ··· 442 444 void 443 445 affs_error(struct super_block *sb, const char *function, const char *fmt, ...) 444 446 { 445 - va_list args; 447 + struct va_format vaf; 448 + va_list args; 446 449 447 - va_start(args,fmt); 448 - vsnprintf(ErrorBuffer,sizeof(ErrorBuffer),fmt,args); 449 - va_end(args); 450 - 451 - pr_crit("error (device %s): %s(): %s\n", sb->s_id, 452 - function,ErrorBuffer); 450 + va_start(args, fmt); 451 + vaf.fmt = fmt; 452 + vaf.va = &args; 453 + pr_crit("error (device %s): %s(): %pV\n", sb->s_id, function, &vaf); 453 454 if (!(sb->s_flags & MS_RDONLY)) 454 455 pr_warn("Remounting filesystem read-only\n"); 455 456 sb->s_flags |= MS_RDONLY; 457 + va_end(args); 456 458 } 457 459 458 460 void 459 461 affs_warning(struct super_block *sb, const char *function, const char *fmt, ...) 460 462 { 461 - va_list args; 463 + struct va_format vaf; 464 + va_list args; 462 465 463 - va_start(args,fmt); 464 - vsnprintf(ErrorBuffer,sizeof(ErrorBuffer),fmt,args); 466 + va_start(args, fmt); 467 + vaf.fmt = fmt; 468 + vaf.va = &args; 469 + pr_warn("(device %s): %s(): %pV\n", sb->s_id, function, &vaf); 465 470 va_end(args); 466 - 467 - pr_warn("(device %s): %s(): %s\n", sb->s_id, 468 - function,ErrorBuffer); 469 471 } 470 472 471 473 bool

+44 -32

fs/affs/file.c

··· 12 12 * affs regular file handling primitives 13 13 */ 14 14 15 + #include <linux/aio.h> 15 16 #include "affs.h" 16 17 17 - #if PAGE_SIZE < 4096 18 - #error PAGE_SIZE must be at least 4096 19 - #endif 20 - 21 - static int affs_grow_extcache(struct inode *inode, u32 lc_idx); 22 - static struct buffer_head *affs_alloc_extblock(struct inode *inode, struct buffer_head *bh, u32 ext); 23 - static inline struct buffer_head *affs_get_extblock(struct inode *inode, u32 ext); 24 18 static struct buffer_head *affs_get_extblock_slow(struct inode *inode, u32 ext); 25 - static int affs_file_open(struct inode *inode, struct file *filp); 26 - static int affs_file_release(struct inode *inode, struct file *filp); 27 - 28 - const struct file_operations affs_file_operations = { 29 - .llseek = generic_file_llseek, 30 - .read = new_sync_read, 31 - .read_iter = generic_file_read_iter, 32 - .write = new_sync_write, 33 - .write_iter = generic_file_write_iter, 34 - .mmap = generic_file_mmap, 35 - .open = affs_file_open, 36 - .release = affs_file_release, 37 - .fsync = affs_file_fsync, 38 - .splice_read = generic_file_splice_read, 39 - }; 40 - 41 - const struct inode_operations affs_file_inode_operations = { 42 - .setattr = affs_notify_change, 43 - }; 44 19 45 20 static int 46 21 affs_file_open(struct inode *inode, struct file *filp) ··· 330 355 331 356 /* store new block */ 332 357 if (bh_result->b_blocknr) 333 - affs_warning(sb, "get_block", "block already set (%x)", bh_result->b_blocknr); 358 + affs_warning(sb, "get_block", "block already set (%lx)", 359 + (unsigned long)bh_result->b_blocknr); 334 360 AFFS_BLOCK(sb, ext_bh, block) = cpu_to_be32(blocknr); 335 361 AFFS_HEAD(ext_bh)->block_count = cpu_to_be32(block + 1); 336 362 affs_adjust_checksum(ext_bh, blocknr - bh_result->b_blocknr + 1); ··· 353 377 return 0; 354 378 355 379 err_big: 356 - affs_error(inode->i_sb,"get_block","strange block request %d", block); 380 + affs_error(inode->i_sb, "get_block", "strange block request %d", 381 + (int)block); 357 382 return -EIO; 358 383 err_ext: 359 384 // unlock cache ··· 389 412 } 390 413 } 391 414 415 + static ssize_t 416 + affs_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *iter, 417 + loff_t offset) 418 + { 419 + struct file *file = iocb->ki_filp; 420 + struct address_space *mapping = file->f_mapping; 421 + struct inode *inode = mapping->host; 422 + size_t count = iov_iter_count(iter); 423 + ssize_t ret; 424 + 425 + ret = blockdev_direct_IO(rw, iocb, inode, iter, offset, affs_get_block); 426 + if (ret < 0 && (rw & WRITE)) 427 + affs_write_failed(mapping, offset + count); 428 + return ret; 429 + } 430 + 392 431 static int affs_write_begin(struct file *file, struct address_space *mapping, 393 432 loff_t pos, unsigned len, unsigned flags, 394 433 struct page **pagep, void **fsdata) ··· 431 438 .writepage = affs_writepage, 432 439 .write_begin = affs_write_begin, 433 440 .write_end = generic_write_end, 441 + .direct_IO = affs_direct_IO, 434 442 .bmap = _affs_bmap 435 443 }; 436 444 ··· 861 867 // lock cache 862 868 ext_bh = affs_get_extblock(inode, ext); 863 869 if (IS_ERR(ext_bh)) { 864 - affs_warning(sb, "truncate", "unexpected read error for ext block %u (%d)", 865 - ext, PTR_ERR(ext_bh)); 870 + affs_warning(sb, "truncate", 871 + "unexpected read error for ext block %u (%ld)", 872 + (unsigned int)ext, PTR_ERR(ext_bh)); 866 873 return; 867 874 } 868 875 if (AFFS_I(inode)->i_lc) { ··· 909 914 struct buffer_head *bh = affs_bread_ino(inode, last_blk, 0); 910 915 u32 tmp; 911 916 if (IS_ERR(bh)) { 912 - affs_warning(sb, "truncate", "unexpected read error for last block %u (%d)", 913 - ext, PTR_ERR(bh)); 917 + affs_warning(sb, "truncate", 918 + "unexpected read error for last block %u (%ld)", 919 + (unsigned int)ext, PTR_ERR(bh)); 914 920 return; 915 921 } 916 922 tmp = be32_to_cpu(AFFS_DATA_HEAD(bh)->next); ··· 957 961 mutex_unlock(&inode->i_mutex); 958 962 return ret; 959 963 } 964 + const struct file_operations affs_file_operations = { 965 + .llseek = generic_file_llseek, 966 + .read = new_sync_read, 967 + .read_iter = generic_file_read_iter, 968 + .write = new_sync_write, 969 + .write_iter = generic_file_write_iter, 970 + .mmap = generic_file_mmap, 971 + .open = affs_file_open, 972 + .release = affs_file_release, 973 + .fsync = affs_file_fsync, 974 + .splice_read = generic_file_splice_read, 975 + }; 976 + 977 + const struct inode_operations affs_file_inode_operations = { 978 + .setattr = affs_notify_change, 979 + };

-4

fs/befs/linuxvfs.c

··· 269 269 } 270 270 ctx->pos++; 271 271 goto more; 272 - 273 - befs_debug(sb, "<--- %s pos %lld", __func__, ctx->pos); 274 - 275 - return 0; 276 272 } 277 273 278 274 static struct inode *

+4

fs/binfmt_em86.c

··· 42 42 return -ENOEXEC; 43 43 } 44 44 45 + /* Need to be able to load the file after exec */ 46 + if (bprm->interp_flags & BINPRM_FLAGS_PATH_INACCESSIBLE) 47 + return -ENOENT; 48 + 45 49 allow_write_access(bprm->file); 46 50 fput(bprm->file); 47 51 bprm->file = NULL;

+4

fs/binfmt_misc.c

··· 144 144 if (!fmt) 145 145 goto ret; 146 146 147 + /* Need to be able to load the file after exec */ 148 + if (bprm->interp_flags & BINPRM_FLAGS_PATH_INACCESSIBLE) 149 + return -ENOENT; 150 + 147 151 if (!(fmt->flags & MISC_FMT_PRESERVE_ARGV0)) { 148 152 retval = remove_arg_zero(bprm); 149 153 if (retval)

+10

fs/binfmt_script.c

··· 24 24 25 25 if ((bprm->buf[0] != '#') || (bprm->buf[1] != '!')) 26 26 return -ENOEXEC; 27 + 28 + /* 29 + * If the script filename will be inaccessible after exec, typically 30 + * because it is a "/dev/fd/<fd>/.." path against an O_CLOEXEC fd, give 31 + * up now (on the assumption that the interpreter will want to load 32 + * this file). 33 + */ 34 + if (bprm->interp_flags & BINPRM_FLAGS_PATH_INACCESSIBLE) 35 + return -ENOENT; 36 + 27 37 /* 28 38 * This section does the #! interpretation. 29 39 * Sorta complicated, but hopefully it will work. -TYT

+6 -5

fs/drop_caches.c

··· 40 40 static void drop_slab(void) 41 41 { 42 42 int nr_objects; 43 - struct shrink_control shrink = { 44 - .gfp_mask = GFP_KERNEL, 45 - }; 46 43 47 - nodes_setall(shrink.nodes_to_scan); 48 44 do { 49 - nr_objects = shrink_slab(&shrink, 1000, 1000); 45 + int nid; 46 + 47 + nr_objects = 0; 48 + for_each_online_node(nid) 49 + nr_objects += shrink_node_slabs(GFP_KERNEL, nid, 50 + 1000, 1000); 50 51 } while (nr_objects > 10); 51 52 } 52 53

+100 -13

fs/exec.c

··· 748 748 749 749 #endif /* CONFIG_MMU */ 750 750 751 - static struct file *do_open_exec(struct filename *name) 751 + static struct file *do_open_execat(int fd, struct filename *name, int flags) 752 752 { 753 753 struct file *file; 754 754 int err; 755 - static const struct open_flags open_exec_flags = { 755 + struct open_flags open_exec_flags = { 756 756 .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC, 757 757 .acc_mode = MAY_EXEC | MAY_OPEN, 758 758 .intent = LOOKUP_OPEN, 759 759 .lookup_flags = LOOKUP_FOLLOW, 760 760 }; 761 761 762 - file = do_filp_open(AT_FDCWD, name, &open_exec_flags); 762 + if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0) 763 + return ERR_PTR(-EINVAL); 764 + if (flags & AT_SYMLINK_NOFOLLOW) 765 + open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW; 766 + if (flags & AT_EMPTY_PATH) 767 + open_exec_flags.lookup_flags |= LOOKUP_EMPTY; 768 + 769 + file = do_filp_open(fd, name, &open_exec_flags); 763 770 if (IS_ERR(file)) 764 771 goto out; 765 772 ··· 777 770 if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) 778 771 goto exit; 779 772 780 - fsnotify_open(file); 781 - 782 773 err = deny_write_access(file); 783 774 if (err) 784 775 goto exit; 776 + 777 + if (name->name[0] != '\0') 778 + fsnotify_open(file); 785 779 786 780 out: 787 781 return file; ··· 795 787 struct file *open_exec(const char *name) 796 788 { 797 789 struct filename tmp = { .name = name }; 798 - return do_open_exec(&tmp); 790 + return do_open_execat(AT_FDCWD, &tmp, 0); 799 791 } 800 792 EXPORT_SYMBOL(open_exec); 801 793 ··· 1436 1428 /* 1437 1429 * sys_execve() executes a new program. 1438 1430 */ 1439 - static int do_execve_common(struct filename *filename, 1440 - struct user_arg_ptr argv, 1441 - struct user_arg_ptr envp) 1431 + static int do_execveat_common(int fd, struct filename *filename, 1432 + struct user_arg_ptr argv, 1433 + struct user_arg_ptr envp, 1434 + int flags) 1442 1435 { 1436 + char *pathbuf = NULL; 1443 1437 struct linux_binprm *bprm; 1444 1438 struct file *file; 1445 1439 struct files_struct *displaced; ··· 1482 1472 check_unsafe_exec(bprm); 1483 1473 current->in_execve = 1; 1484 1474 1485 - file = do_open_exec(filename); 1475 + file = do_open_execat(fd, filename, flags); 1486 1476 retval = PTR_ERR(file); 1487 1477 if (IS_ERR(file)) 1488 1478 goto out_unmark; ··· 1490 1480 sched_exec(); 1491 1481 1492 1482 bprm->file = file; 1493 - bprm->filename = bprm->interp = filename->name; 1483 + if (fd == AT_FDCWD || filename->name[0] == '/') { 1484 + bprm->filename = filename->name; 1485 + } else { 1486 + if (filename->name[0] == '\0') 1487 + pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d", fd); 1488 + else 1489 + pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d/%s", 1490 + fd, filename->name); 1491 + if (!pathbuf) { 1492 + retval = -ENOMEM; 1493 + goto out_unmark; 1494 + } 1495 + /* 1496 + * Record that a name derived from an O_CLOEXEC fd will be 1497 + * inaccessible after exec. Relies on having exclusive access to 1498 + * current->files (due to unshare_files above). 1499 + */ 1500 + if (close_on_exec(fd, rcu_dereference_raw(current->files->fdt))) 1501 + bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE; 1502 + bprm->filename = pathbuf; 1503 + } 1504 + bprm->interp = bprm->filename; 1494 1505 1495 1506 retval = bprm_mm_init(bprm); 1496 1507 if (retval) ··· 1552 1521 acct_update_integrals(current); 1553 1522 task_numa_free(current); 1554 1523 free_bprm(bprm); 1524 + kfree(pathbuf); 1555 1525 putname(filename); 1556 1526 if (displaced) 1557 1527 put_files_struct(displaced); ··· 1570 1538 1571 1539 out_free: 1572 1540 free_bprm(bprm); 1541 + kfree(pathbuf); 1573 1542 1574 1543 out_files: 1575 1544 if (displaced) ··· 1586 1553 { 1587 1554 struct user_arg_ptr argv = { .ptr.native = __argv }; 1588 1555 struct user_arg_ptr envp = { .ptr.native = __envp }; 1589 - return do_execve_common(filename, argv, envp); 1556 + return do_execveat_common(AT_FDCWD, filename, argv, envp, 0); 1557 + } 1558 + 1559 + int do_execveat(int fd, struct filename *filename, 1560 + const char __user *const __user *__argv, 1561 + const char __user *const __user *__envp, 1562 + int flags) 1563 + { 1564 + struct user_arg_ptr argv = { .ptr.native = __argv }; 1565 + struct user_arg_ptr envp = { .ptr.native = __envp }; 1566 + 1567 + return do_execveat_common(fd, filename, argv, envp, flags); 1590 1568 } 1591 1569 1592 1570 #ifdef CONFIG_COMPAT ··· 1613 1569 .is_compat = true, 1614 1570 .ptr.compat = __envp, 1615 1571 }; 1616 - return do_execve_common(filename, argv, envp); 1572 + return do_execveat_common(AT_FDCWD, filename, argv, envp, 0); 1573 + } 1574 + 1575 + static int compat_do_execveat(int fd, struct filename *filename, 1576 + const compat_uptr_t __user *__argv, 1577 + const compat_uptr_t __user *__envp, 1578 + int flags) 1579 + { 1580 + struct user_arg_ptr argv = { 1581 + .is_compat = true, 1582 + .ptr.compat = __argv, 1583 + }; 1584 + struct user_arg_ptr envp = { 1585 + .is_compat = true, 1586 + .ptr.compat = __envp, 1587 + }; 1588 + return do_execveat_common(fd, filename, argv, envp, flags); 1617 1589 } 1618 1590 #endif 1619 1591 ··· 1669 1609 { 1670 1610 return do_execve(getname(filename), argv, envp); 1671 1611 } 1612 + 1613 + SYSCALL_DEFINE5(execveat, 1614 + int, fd, const char __user *, filename, 1615 + const char __user *const __user *, argv, 1616 + const char __user *const __user *, envp, 1617 + int, flags) 1618 + { 1619 + int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0; 1620 + 1621 + return do_execveat(fd, 1622 + getname_flags(filename, lookup_flags, NULL), 1623 + argv, envp, flags); 1624 + } 1625 + 1672 1626 #ifdef CONFIG_COMPAT 1673 1627 COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename, 1674 1628 const compat_uptr_t __user *, argv, 1675 1629 const compat_uptr_t __user *, envp) 1676 1630 { 1677 1631 return compat_do_execve(getname(filename), argv, envp); 1632 + } 1633 + 1634 + COMPAT_SYSCALL_DEFINE5(execveat, int, fd, 1635 + const char __user *, filename, 1636 + const compat_uptr_t __user *, argv, 1637 + const compat_uptr_t __user *, envp, 1638 + int, flags) 1639 + { 1640 + int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0; 1641 + 1642 + return compat_do_execveat(fd, 1643 + getname_flags(filename, lookup_flags, NULL), 1644 + argv, envp, flags); 1678 1645 } 1679 1646 #endif

+1

fs/fat/fat.h

··· 370 370 int datasync); 371 371 372 372 /* fat/inode.c */ 373 + extern int fat_block_truncate_page(struct inode *inode, loff_t from); 373 374 extern void fat_attach(struct inode *inode, loff_t i_pos); 374 375 extern void fat_detach(struct inode *inode); 375 376 extern struct inode *fat_iget(struct super_block *sb, loff_t i_pos);

+3

fs/fat/file.c

··· 443 443 } 444 444 445 445 if (attr->ia_valid & ATTR_SIZE) { 446 + error = fat_block_truncate_page(inode, attr->ia_size); 447 + if (error) 448 + goto out; 446 449 down_write(&MSDOS_I(inode)->truncate_lock); 447 450 truncate_setsize(inode, attr->ia_size); 448 451 fat_truncate_blocks(inode, attr->ia_size);

+12

fs/fat/inode.c

··· 294 294 return blocknr; 295 295 } 296 296 297 + /* 298 + * fat_block_truncate_page() zeroes out a mapping from file offset `from' 299 + * up to the end of the block which corresponds to `from'. 300 + * This is required during truncate to physically zeroout the tail end 301 + * of that block so it doesn't yield old data if the file is later grown. 302 + * Also, avoid causing failure from fsx for cases of "data past EOF" 303 + */ 304 + int fat_block_truncate_page(struct inode *inode, loff_t from) 305 + { 306 + return block_truncate_page(inode->i_mapping, from, fat_get_block); 307 + } 308 + 297 309 static const struct address_space_operations fat_aops = { 298 310 .readpage = fat_readpage, 299 311 .readpages = fat_readpages,

+7 -7

fs/hugetlbfs/inode.c

··· 412 412 pgoff = offset >> PAGE_SHIFT; 413 413 414 414 i_size_write(inode, offset); 415 - mutex_lock(&mapping->i_mmap_mutex); 415 + i_mmap_lock_write(mapping); 416 416 if (!RB_EMPTY_ROOT(&mapping->i_mmap)) 417 417 hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff); 418 - mutex_unlock(&mapping->i_mmap_mutex); 418 + i_mmap_unlock_write(mapping); 419 419 truncate_hugepages(inode, offset); 420 420 return 0; 421 421 } ··· 472 472 } 473 473 474 474 /* 475 - * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never 475 + * Hugetlbfs is not reclaimable; therefore its i_mmap_rwsem will never 476 476 * be taken from reclaim -- unlike regular filesystems. This needs an 477 477 * annotation because huge_pmd_share() does an allocation under 478 - * i_mmap_mutex. 478 + * i_mmap_rwsem. 479 479 */ 480 - static struct lock_class_key hugetlbfs_i_mmap_mutex_key; 480 + static struct lock_class_key hugetlbfs_i_mmap_rwsem_key; 481 481 482 482 static struct inode *hugetlbfs_get_inode(struct super_block *sb, 483 483 struct inode *dir, ··· 495 495 struct hugetlbfs_inode_info *info; 496 496 inode->i_ino = get_next_ino(); 497 497 inode_init_owner(inode, dir, mode); 498 - lockdep_set_class(&inode->i_mapping->i_mmap_mutex, 499 - &hugetlbfs_i_mmap_mutex_key); 498 + lockdep_set_class(&inode->i_mapping->i_mmap_rwsem, 499 + &hugetlbfs_i_mmap_rwsem_key); 500 500 inode->i_mapping->a_ops = &hugetlbfs_aops; 501 501 inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info; 502 502 inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;

+1 -1

fs/inode.c

··· 346 346 memset(mapping, 0, sizeof(*mapping)); 347 347 INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC); 348 348 spin_lock_init(&mapping->tree_lock); 349 - mutex_init(&mapping->i_mmap_mutex); 349 + init_rwsem(&mapping->i_mmap_rwsem); 350 350 INIT_LIST_HEAD(&mapping->private_list); 351 351 spin_lock_init(&mapping->private_lock); 352 352 mapping->i_mmap = RB_ROOT;

+1 -1

fs/namei.c

··· 130 130 131 131 #define EMBEDDED_NAME_MAX (PATH_MAX - sizeof(struct filename)) 132 132 133 - static struct filename * 133 + struct filename * 134 134 getname_flags(const char __user *filename, int flags, int *empty) 135 135 { 136 136 struct filename *result, *err;

+2 -2

fs/notify/dnotify/dnotify.c

··· 69 69 if (old_mask == new_mask) 70 70 return; 71 71 72 - if (fsn_mark->i.inode) 73 - fsnotify_recalc_inode_mask(fsn_mark->i.inode); 72 + if (fsn_mark->inode) 73 + fsnotify_recalc_inode_mask(fsn_mark->inode); 74 74 } 75 75 76 76 /*

+3 -3

fs/notify/fdinfo.c

··· 80 80 return; 81 81 82 82 inode_mark = container_of(mark, struct inotify_inode_mark, fsn_mark); 83 - inode = igrab(mark->i.inode); 83 + inode = igrab(mark->inode); 84 84 if (inode) { 85 85 seq_printf(m, "inotify wd:%x ino:%lx sdev:%x mask:%x ignored_mask:%x ", 86 86 inode_mark->wd, inode->i_ino, inode->i_sb->s_dev, ··· 112 112 mflags |= FAN_MARK_IGNORED_SURV_MODIFY; 113 113 114 114 if (mark->flags & FSNOTIFY_MARK_FLAG_INODE) { 115 - inode = igrab(mark->i.inode); 115 + inode = igrab(mark->inode); 116 116 if (!inode) 117 117 return; 118 118 seq_printf(m, "fanotify ino:%lx sdev:%x mflags:%x mask:%x ignored_mask:%x ", ··· 122 122 seq_putc(m, '\n'); 123 123 iput(inode); 124 124 } else if (mark->flags & FSNOTIFY_MARK_FLAG_VFSMOUNT) { 125 - struct mount *mnt = real_mount(mark->m.mnt); 125 + struct mount *mnt = real_mount(mark->mnt); 126 126 127 127 seq_printf(m, "fanotify mnt_id:%x mflags:%x mask:%x ignored_mask:%x\n", 128 128 mnt->mnt_id, mflags, mark->mask, mark->ignored_mask);

+2 -2

fs/notify/fsnotify.c

··· 242 242 243 243 if (inode_node) { 244 244 inode_mark = hlist_entry(srcu_dereference(inode_node, &fsnotify_mark_srcu), 245 - struct fsnotify_mark, i.i_list); 245 + struct fsnotify_mark, obj_list); 246 246 inode_group = inode_mark->group; 247 247 } 248 248 249 249 if (vfsmount_node) { 250 250 vfsmount_mark = hlist_entry(srcu_dereference(vfsmount_node, &fsnotify_mark_srcu), 251 - struct fsnotify_mark, m.m_list); 251 + struct fsnotify_mark, obj_list); 252 252 vfsmount_group = vfsmount_mark->group; 253 253 } 254 254

+12

fs/notify/fsnotify.h

··· 12 12 /* protects reads of inode and vfsmount marks list */ 13 13 extern struct srcu_struct fsnotify_mark_srcu; 14 14 15 + /* Calculate mask of events for a list of marks */ 16 + extern u32 fsnotify_recalc_mask(struct hlist_head *head); 17 + 15 18 /* compare two groups for sorting of marks lists */ 16 19 extern int fsnotify_compare_groups(struct fsnotify_group *a, 17 20 struct fsnotify_group *b); 18 21 19 22 extern void fsnotify_set_inode_mark_mask_locked(struct fsnotify_mark *fsn_mark, 20 23 __u32 mask); 24 + /* Add mark to a proper place in mark list */ 25 + extern int fsnotify_add_mark_list(struct hlist_head *head, 26 + struct fsnotify_mark *mark, 27 + int allow_dups); 21 28 /* add a mark to an inode */ 22 29 extern int fsnotify_add_inode_mark(struct fsnotify_mark *mark, 23 30 struct fsnotify_group *group, struct inode *inode, ··· 38 31 extern void fsnotify_destroy_vfsmount_mark(struct fsnotify_mark *mark); 39 32 /* inode specific destruction of a mark */ 40 33 extern void fsnotify_destroy_inode_mark(struct fsnotify_mark *mark); 34 + /* Destroy all marks in the given list */ 35 + extern void fsnotify_destroy_marks(struct list_head *to_free); 36 + /* Find mark belonging to given group in the list of marks */ 37 + extern struct fsnotify_mark *fsnotify_find_mark(struct hlist_head *head, 38 + struct fsnotify_group *group); 41 39 /* run the list of all marks associated with inode and flag them to be freed */ 42 40 extern void fsnotify_clear_marks_by_inode(struct inode *inode); 43 41 /* run the list of all marks associated with vfsmount and flag them to be freed */

+18 -95

fs/notify/inode_mark.c

··· 31 31 #include "../internal.h" 32 32 33 33 /* 34 - * Recalculate the mask of events relevant to a given inode locked. 35 - */ 36 - static void fsnotify_recalc_inode_mask_locked(struct inode *inode) 37 - { 38 - struct fsnotify_mark *mark; 39 - __u32 new_mask = 0; 40 - 41 - assert_spin_locked(&inode->i_lock); 42 - 43 - hlist_for_each_entry(mark, &inode->i_fsnotify_marks, i.i_list) 44 - new_mask |= mark->mask; 45 - inode->i_fsnotify_mask = new_mask; 46 - } 47 - 48 - /* 49 34 * Recalculate the inode->i_fsnotify_mask, or the mask of all FS_* event types 50 35 * any notifier is interested in hearing for this inode. 51 36 */ 52 37 void fsnotify_recalc_inode_mask(struct inode *inode) 53 38 { 54 39 spin_lock(&inode->i_lock); 55 - fsnotify_recalc_inode_mask_locked(inode); 40 + inode->i_fsnotify_mask = fsnotify_recalc_mask(&inode->i_fsnotify_marks); 56 41 spin_unlock(&inode->i_lock); 57 42 58 43 __fsnotify_update_child_dentry_flags(inode); ··· 45 60 46 61 void fsnotify_destroy_inode_mark(struct fsnotify_mark *mark) 47 62 { 48 - struct inode *inode = mark->i.inode; 63 + struct inode *inode = mark->inode; 49 64 50 65 BUG_ON(!mutex_is_locked(&mark->group->mark_mutex)); 51 66 assert_spin_locked(&mark->lock); 52 67 53 68 spin_lock(&inode->i_lock); 54 69 55 - hlist_del_init_rcu(&mark->i.i_list); 56 - mark->i.inode = NULL; 70 + hlist_del_init_rcu(&mark->obj_list); 71 + mark->inode = NULL; 57 72 58 73 /* 59 74 * this mark is now off the inode->i_fsnotify_marks list and we 60 75 * hold the inode->i_lock, so this is the perfect time to update the 61 76 * inode->i_fsnotify_mask 62 77 */ 63 - fsnotify_recalc_inode_mask_locked(inode); 64 - 78 + inode->i_fsnotify_mask = fsnotify_recalc_mask(&inode->i_fsnotify_marks); 65 79 spin_unlock(&inode->i_lock); 66 80 } 67 81 ··· 69 85 */ 70 86 void fsnotify_clear_marks_by_inode(struct inode *inode) 71 87 { 72 - struct fsnotify_mark *mark, *lmark; 88 + struct fsnotify_mark *mark; 73 89 struct hlist_node *n; 74 90 LIST_HEAD(free_list); 75 91 76 92 spin_lock(&inode->i_lock); 77 - hlist_for_each_entry_safe(mark, n, &inode->i_fsnotify_marks, i.i_list) { 78 - list_add(&mark->i.free_i_list, &free_list); 79 - hlist_del_init_rcu(&mark->i.i_list); 93 + hlist_for_each_entry_safe(mark, n, &inode->i_fsnotify_marks, obj_list) { 94 + list_add(&mark->free_list, &free_list); 95 + hlist_del_init_rcu(&mark->obj_list); 80 96 fsnotify_get_mark(mark); 81 97 } 82 98 spin_unlock(&inode->i_lock); 83 99 84 - list_for_each_entry_safe(mark, lmark, &free_list, i.free_i_list) { 85 - struct fsnotify_group *group; 86 - 87 - spin_lock(&mark->lock); 88 - fsnotify_get_group(mark->group); 89 - group = mark->group; 90 - spin_unlock(&mark->lock); 91 - 92 - fsnotify_destroy_mark(mark, group); 93 - fsnotify_put_mark(mark); 94 - fsnotify_put_group(group); 95 - } 100 + fsnotify_destroy_marks(&free_list); 96 101 } 97 102 98 103 /* ··· 96 123 * given a group and inode, find the mark associated with that combination. 97 124 * if found take a reference to that mark and return it, else return NULL 98 125 */ 99 - static struct fsnotify_mark *fsnotify_find_inode_mark_locked( 100 - struct fsnotify_group *group, 101 - struct inode *inode) 102 - { 103 - struct fsnotify_mark *mark; 104 - 105 - assert_spin_locked(&inode->i_lock); 106 - 107 - hlist_for_each_entry(mark, &inode->i_fsnotify_marks, i.i_list) { 108 - if (mark->group == group) { 109 - fsnotify_get_mark(mark); 110 - return mark; 111 - } 112 - } 113 - return NULL; 114 - } 115 - 116 - /* 117 - * given a group and inode, find the mark associated with that combination. 118 - * if found take a reference to that mark and return it, else return NULL 119 - */ 120 126 struct fsnotify_mark *fsnotify_find_inode_mark(struct fsnotify_group *group, 121 127 struct inode *inode) 122 128 { 123 129 struct fsnotify_mark *mark; 124 130 125 131 spin_lock(&inode->i_lock); 126 - mark = fsnotify_find_inode_mark_locked(group, inode); 132 + mark = fsnotify_find_mark(&inode->i_fsnotify_marks, group); 127 133 spin_unlock(&inode->i_lock); 128 134 129 135 return mark; ··· 120 168 assert_spin_locked(&mark->lock); 121 169 122 170 if (mask && 123 - mark->i.inode && 171 + mark->inode && 124 172 !(mark->flags & FSNOTIFY_MARK_FLAG_OBJECT_PINNED)) { 125 173 mark->flags |= FSNOTIFY_MARK_FLAG_OBJECT_PINNED; 126 - inode = igrab(mark->i.inode); 174 + inode = igrab(mark->inode); 127 175 /* 128 176 * we shouldn't be able to get here if the inode wasn't 129 177 * already safely held in memory. But bug in case it ··· 144 192 struct fsnotify_group *group, struct inode *inode, 145 193 int allow_dups) 146 194 { 147 - struct fsnotify_mark *lmark, *last = NULL; 148 - int ret = 0; 149 - int cmp; 195 + int ret; 150 196 151 197 mark->flags |= FSNOTIFY_MARK_FLAG_INODE; 152 198 ··· 152 202 assert_spin_locked(&mark->lock); 153 203 154 204 spin_lock(&inode->i_lock); 155 - 156 - mark->i.inode = inode; 157 - 158 - /* is mark the first mark? */ 159 - if (hlist_empty(&inode->i_fsnotify_marks)) { 160 - hlist_add_head_rcu(&mark->i.i_list, &inode->i_fsnotify_marks); 161 - goto out; 162 - } 163 - 164 - /* should mark be in the middle of the current list? */ 165 - hlist_for_each_entry(lmark, &inode->i_fsnotify_marks, i.i_list) { 166 - last = lmark; 167 - 168 - if ((lmark->group == group) && !allow_dups) { 169 - ret = -EEXIST; 170 - goto out; 171 - } 172 - 173 - cmp = fsnotify_compare_groups(lmark->group, mark->group); 174 - if (cmp < 0) 175 - continue; 176 - 177 - hlist_add_before_rcu(&mark->i.i_list, &lmark->i.i_list); 178 - goto out; 179 - } 180 - 181 - BUG_ON(last == NULL); 182 - /* mark should be the last entry. last is the current last entry */ 183 - hlist_add_behind_rcu(&mark->i.i_list, &last->i.i_list); 184 - out: 185 - fsnotify_recalc_inode_mask_locked(inode); 205 + mark->inode = inode; 206 + ret = fsnotify_add_mark_list(&inode->i_fsnotify_marks, mark, 207 + allow_dups); 208 + inode->i_fsnotify_mask = fsnotify_recalc_mask(&inode->i_fsnotify_marks); 186 209 spin_unlock(&inode->i_lock); 187 210 188 211 return ret;

+1 -1

fs/notify/inotify/inotify_fsnotify.c

··· 156 156 */ 157 157 if (fsn_mark) 158 158 printk(KERN_WARNING "fsn_mark->group=%p inode=%p wd=%d\n", 159 - fsn_mark->group, fsn_mark->i.inode, i_mark->wd); 159 + fsn_mark->group, fsn_mark->inode, i_mark->wd); 160 160 return 0; 161 161 } 162 162

+5 -5

fs/notify/inotify/inotify_user.c

··· 433 433 if (wd == -1) { 434 434 WARN_ONCE(1, "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p" 435 435 " i_mark->inode=%p\n", __func__, i_mark, i_mark->wd, 436 - i_mark->fsn_mark.group, i_mark->fsn_mark.i.inode); 436 + i_mark->fsn_mark.group, i_mark->fsn_mark.inode); 437 437 goto out; 438 438 } 439 439 ··· 442 442 if (unlikely(!found_i_mark)) { 443 443 WARN_ONCE(1, "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p" 444 444 " i_mark->inode=%p\n", __func__, i_mark, i_mark->wd, 445 - i_mark->fsn_mark.group, i_mark->fsn_mark.i.inode); 445 + i_mark->fsn_mark.group, i_mark->fsn_mark.inode); 446 446 goto out; 447 447 } 448 448 ··· 456 456 "mark->inode=%p found_i_mark=%p found_i_mark->wd=%d " 457 457 "found_i_mark->group=%p found_i_mark->inode=%p\n", 458 458 __func__, i_mark, i_mark->wd, i_mark->fsn_mark.group, 459 - i_mark->fsn_mark.i.inode, found_i_mark, found_i_mark->wd, 459 + i_mark->fsn_mark.inode, found_i_mark, found_i_mark->wd, 460 460 found_i_mark->fsn_mark.group, 461 - found_i_mark->fsn_mark.i.inode); 461 + found_i_mark->fsn_mark.inode); 462 462 goto out; 463 463 } 464 464 ··· 470 470 if (unlikely(atomic_read(&i_mark->fsn_mark.refcnt) < 3)) { 471 471 printk(KERN_ERR "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p" 472 472 " i_mark->inode=%p\n", __func__, i_mark, i_mark->wd, 473 - i_mark->fsn_mark.group, i_mark->fsn_mark.i.inode); 473 + i_mark->fsn_mark.group, i_mark->fsn_mark.inode); 474 474 /* we can't really recover with bad ref cnting.. */ 475 475 BUG(); 476 476 }

+90 -7

fs/notify/mark.c

··· 110 110 } 111 111 } 112 112 113 + /* Calculate mask of events for a list of marks */ 114 + u32 fsnotify_recalc_mask(struct hlist_head *head) 115 + { 116 + u32 new_mask = 0; 117 + struct fsnotify_mark *mark; 118 + 119 + hlist_for_each_entry(mark, head, obj_list) 120 + new_mask |= mark->mask; 121 + return new_mask; 122 + } 123 + 113 124 /* 114 125 * Any time a mark is getting freed we end up here. 115 126 * The caller had better be holding a reference to this mark so we don't actually ··· 144 133 mark->flags &= ~FSNOTIFY_MARK_FLAG_ALIVE; 145 134 146 135 if (mark->flags & FSNOTIFY_MARK_FLAG_INODE) { 147 - inode = mark->i.inode; 136 + inode = mark->inode; 148 137 fsnotify_destroy_inode_mark(mark); 149 138 } else if (mark->flags & FSNOTIFY_MARK_FLAG_VFSMOUNT) 150 139 fsnotify_destroy_vfsmount_mark(mark); ··· 161 150 mutex_unlock(&group->mark_mutex); 162 151 163 152 spin_lock(&destroy_lock); 164 - list_add(&mark->destroy_list, &destroy_list); 153 + list_add(&mark->g_list, &destroy_list); 165 154 spin_unlock(&destroy_lock); 166 155 wake_up(&destroy_waitq); 167 156 /* ··· 201 190 mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING); 202 191 fsnotify_destroy_mark_locked(mark, group); 203 192 mutex_unlock(&group->mark_mutex); 193 + } 194 + 195 + /* 196 + * Destroy all marks in the given list. The marks must be already detached from 197 + * the original inode / vfsmount. 198 + */ 199 + void fsnotify_destroy_marks(struct list_head *to_free) 200 + { 201 + struct fsnotify_mark *mark, *lmark; 202 + struct fsnotify_group *group; 203 + 204 + list_for_each_entry_safe(mark, lmark, to_free, free_list) { 205 + spin_lock(&mark->lock); 206 + fsnotify_get_group(mark->group); 207 + group = mark->group; 208 + spin_unlock(&mark->lock); 209 + 210 + fsnotify_destroy_mark(mark, group); 211 + fsnotify_put_mark(mark); 212 + fsnotify_put_group(group); 213 + } 204 214 } 205 215 206 216 void fsnotify_set_mark_mask_locked(struct fsnotify_mark *mark, __u32 mask) ··· 275 243 if (a < b) 276 244 return 1; 277 245 return -1; 246 + } 247 + 248 + /* Add mark into proper place in given list of marks */ 249 + int fsnotify_add_mark_list(struct hlist_head *head, struct fsnotify_mark *mark, 250 + int allow_dups) 251 + { 252 + struct fsnotify_mark *lmark, *last = NULL; 253 + int cmp; 254 + 255 + /* is mark the first mark? */ 256 + if (hlist_empty(head)) { 257 + hlist_add_head_rcu(&mark->obj_list, head); 258 + return 0; 259 + } 260 + 261 + /* should mark be in the middle of the current list? */ 262 + hlist_for_each_entry(lmark, head, obj_list) { 263 + last = lmark; 264 + 265 + if ((lmark->group == mark->group) && !allow_dups) 266 + return -EEXIST; 267 + 268 + cmp = fsnotify_compare_groups(lmark->group, mark->group); 269 + if (cmp >= 0) { 270 + hlist_add_before_rcu(&mark->obj_list, &lmark->obj_list); 271 + return 0; 272 + } 273 + } 274 + 275 + BUG_ON(last == NULL); 276 + /* mark should be the last entry. last is the current last entry */ 277 + hlist_add_behind_rcu(&mark->obj_list, &last->obj_list); 278 + return 0; 278 279 } 279 280 280 281 /* ··· 370 305 spin_unlock(&mark->lock); 371 306 372 307 spin_lock(&destroy_lock); 373 - list_add(&mark->destroy_list, &destroy_list); 308 + list_add(&mark->g_list, &destroy_list); 374 309 spin_unlock(&destroy_lock); 375 310 wake_up(&destroy_waitq); 376 311 ··· 385 320 ret = fsnotify_add_mark_locked(mark, group, inode, mnt, allow_dups); 386 321 mutex_unlock(&group->mark_mutex); 387 322 return ret; 323 + } 324 + 325 + /* 326 + * Given a list of marks, find the mark associated with given group. If found 327 + * take a reference to that mark and return it, else return NULL. 328 + */ 329 + struct fsnotify_mark *fsnotify_find_mark(struct hlist_head *head, 330 + struct fsnotify_group *group) 331 + { 332 + struct fsnotify_mark *mark; 333 + 334 + hlist_for_each_entry(mark, head, obj_list) { 335 + if (mark->group == group) { 336 + fsnotify_get_mark(mark); 337 + return mark; 338 + } 339 + } 340 + return NULL; 388 341 } 389 342 390 343 /* ··· 435 352 void fsnotify_duplicate_mark(struct fsnotify_mark *new, struct fsnotify_mark *old) 436 353 { 437 354 assert_spin_locked(&old->lock); 438 - new->i.inode = old->i.inode; 439 - new->m.mnt = old->m.mnt; 355 + new->inode = old->inode; 356 + new->mnt = old->mnt; 440 357 if (old->group) 441 358 fsnotify_get_group(old->group); 442 359 new->group = old->group; ··· 469 386 470 387 synchronize_srcu(&fsnotify_mark_srcu); 471 388 472 - list_for_each_entry_safe(mark, next, &private_destroy_list, destroy_list) { 473 - list_del_init(&mark->destroy_list); 389 + list_for_each_entry_safe(mark, next, &private_destroy_list, g_list) { 390 + list_del_init(&mark->g_list); 474 391 fsnotify_put_mark(mark); 475 392 } 476 393

+19 -90

fs/notify/vfsmount_mark.c

··· 32 32 33 33 void fsnotify_clear_marks_by_mount(struct vfsmount *mnt) 34 34 { 35 - struct fsnotify_mark *mark, *lmark; 35 + struct fsnotify_mark *mark; 36 36 struct hlist_node *n; 37 37 struct mount *m = real_mount(mnt); 38 38 LIST_HEAD(free_list); 39 39 40 40 spin_lock(&mnt->mnt_root->d_lock); 41 - hlist_for_each_entry_safe(mark, n, &m->mnt_fsnotify_marks, m.m_list) { 42 - list_add(&mark->m.free_m_list, &free_list); 43 - hlist_del_init_rcu(&mark->m.m_list); 41 + hlist_for_each_entry_safe(mark, n, &m->mnt_fsnotify_marks, obj_list) { 42 + list_add(&mark->free_list, &free_list); 43 + hlist_del_init_rcu(&mark->obj_list); 44 44 fsnotify_get_mark(mark); 45 45 } 46 46 spin_unlock(&mnt->mnt_root->d_lock); 47 47 48 - list_for_each_entry_safe(mark, lmark, &free_list, m.free_m_list) { 49 - struct fsnotify_group *group; 50 - 51 - spin_lock(&mark->lock); 52 - fsnotify_get_group(mark->group); 53 - group = mark->group; 54 - spin_unlock(&mark->lock); 55 - 56 - fsnotify_destroy_mark(mark, group); 57 - fsnotify_put_mark(mark); 58 - fsnotify_put_group(group); 59 - } 48 + fsnotify_destroy_marks(&free_list); 60 49 } 61 50 62 51 void fsnotify_clear_vfsmount_marks_by_group(struct fsnotify_group *group) ··· 54 65 } 55 66 56 67 /* 57 - * Recalculate the mask of events relevant to a given vfsmount locked. 58 - */ 59 - static void fsnotify_recalc_vfsmount_mask_locked(struct vfsmount *mnt) 60 - { 61 - struct mount *m = real_mount(mnt); 62 - struct fsnotify_mark *mark; 63 - __u32 new_mask = 0; 64 - 65 - assert_spin_locked(&mnt->mnt_root->d_lock); 66 - 67 - hlist_for_each_entry(mark, &m->mnt_fsnotify_marks, m.m_list) 68 - new_mask |= mark->mask; 69 - m->mnt_fsnotify_mask = new_mask; 70 - } 71 - 72 - /* 73 68 * Recalculate the mnt->mnt_fsnotify_mask, or the mask of all FS_* event types 74 69 * any notifier is interested in hearing for this mount point 75 70 */ 76 71 void fsnotify_recalc_vfsmount_mask(struct vfsmount *mnt) 77 72 { 73 + struct mount *m = real_mount(mnt); 74 + 78 75 spin_lock(&mnt->mnt_root->d_lock); 79 - fsnotify_recalc_vfsmount_mask_locked(mnt); 76 + m->mnt_fsnotify_mask = fsnotify_recalc_mask(&m->mnt_fsnotify_marks); 80 77 spin_unlock(&mnt->mnt_root->d_lock); 81 78 } 82 79 83 80 void fsnotify_destroy_vfsmount_mark(struct fsnotify_mark *mark) 84 81 { 85 - struct vfsmount *mnt = mark->m.mnt; 82 + struct vfsmount *mnt = mark->mnt; 83 + struct mount *m = real_mount(mnt); 86 84 87 85 BUG_ON(!mutex_is_locked(&mark->group->mark_mutex)); 88 86 assert_spin_locked(&mark->lock); 89 87 90 88 spin_lock(&mnt->mnt_root->d_lock); 91 89 92 - hlist_del_init_rcu(&mark->m.m_list); 93 - mark->m.mnt = NULL; 90 + hlist_del_init_rcu(&mark->obj_list); 91 + mark->mnt = NULL; 94 92 95 - fsnotify_recalc_vfsmount_mask_locked(mnt); 96 - 93 + m->mnt_fsnotify_mask = fsnotify_recalc_mask(&m->mnt_fsnotify_marks); 97 94 spin_unlock(&mnt->mnt_root->d_lock); 98 - } 99 - 100 - static struct fsnotify_mark *fsnotify_find_vfsmount_mark_locked(struct fsnotify_group *group, 101 - struct vfsmount *mnt) 102 - { 103 - struct mount *m = real_mount(mnt); 104 - struct fsnotify_mark *mark; 105 - 106 - assert_spin_locked(&mnt->mnt_root->d_lock); 107 - 108 - hlist_for_each_entry(mark, &m->mnt_fsnotify_marks, m.m_list) { 109 - if (mark->group == group) { 110 - fsnotify_get_mark(mark); 111 - return mark; 112 - } 113 - } 114 - return NULL; 115 95 } 116 96 117 97 /* ··· 90 132 struct fsnotify_mark *fsnotify_find_vfsmount_mark(struct fsnotify_group *group, 91 133 struct vfsmount *mnt) 92 134 { 135 + struct mount *m = real_mount(mnt); 93 136 struct fsnotify_mark *mark; 94 137 95 138 spin_lock(&mnt->mnt_root->d_lock); 96 - mark = fsnotify_find_vfsmount_mark_locked(group, mnt); 139 + mark = fsnotify_find_mark(&m->mnt_fsnotify_marks, group); 97 140 spin_unlock(&mnt->mnt_root->d_lock); 98 141 99 142 return mark; ··· 110 151 int allow_dups) 111 152 { 112 153 struct mount *m = real_mount(mnt); 113 - struct fsnotify_mark *lmark, *last = NULL; 114 - int ret = 0; 115 - int cmp; 154 + int ret; 116 155 117 156 mark->flags |= FSNOTIFY_MARK_FLAG_VFSMOUNT; 118 157 ··· 118 161 assert_spin_locked(&mark->lock); 119 162 120 163 spin_lock(&mnt->mnt_root->d_lock); 121 - 122 - mark->m.mnt = mnt; 123 - 124 - /* is mark the first mark? */ 125 - if (hlist_empty(&m->mnt_fsnotify_marks)) { 126 - hlist_add_head_rcu(&mark->m.m_list, &m->mnt_fsnotify_marks); 127 - goto out; 128 - } 129 - 130 - /* should mark be in the middle of the current list? */ 131 - hlist_for_each_entry(lmark, &m->mnt_fsnotify_marks, m.m_list) { 132 - last = lmark; 133 - 134 - if ((lmark->group == group) && !allow_dups) { 135 - ret = -EEXIST; 136 - goto out; 137 - } 138 - 139 - cmp = fsnotify_compare_groups(lmark->group, mark->group); 140 - if (cmp < 0) 141 - continue; 142 - 143 - hlist_add_before_rcu(&mark->m.m_list, &lmark->m.m_list); 144 - goto out; 145 - } 146 - 147 - BUG_ON(last == NULL); 148 - /* mark should be the last entry. last is the current last entry */ 149 - hlist_add_behind_rcu(&mark->m.m_list, &last->m.m_list); 150 - out: 151 - fsnotify_recalc_vfsmount_mask_locked(mnt); 164 + mark->mnt = mnt; 165 + ret = fsnotify_add_mark_list(&m->mnt_fsnotify_marks, mark, allow_dups); 166 + m->mnt_fsnotify_mask = fsnotify_recalc_mask(&m->mnt_fsnotify_marks); 152 167 spin_unlock(&mnt->mnt_root->d_lock); 153 168 154 169 return ret;

+11

fs/open.c

··· 295 295 296 296 sb_start_write(inode->i_sb); 297 297 ret = file->f_op->fallocate(file, mode, offset, len); 298 + 299 + /* 300 + * Create inotify and fanotify events. 301 + * 302 + * To keep the logic simple always create events if fallocate succeeds. 303 + * This implies that events are even created if the file size remains 304 + * unchanged, e.g. when using flag FALLOC_FL_KEEP_SIZE. 305 + */ 306 + if (ret == 0) 307 + fsnotify_modify(file); 308 + 298 309 sb_end_write(inode->i_sb); 299 310 return ret; 300 311 }

+5 -1

fs/seq_file.c

··· 25 25 { 26 26 void *buf; 27 27 28 - buf = kmalloc(size, GFP_KERNEL | __GFP_NOWARN); 28 + /* 29 + * __GFP_NORETRY to avoid oom-killings with high-order allocations - 30 + * it's better to fall back to vmalloc() than to kill things. 31 + */ 32 + buf = kmalloc(size, GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN); 29 33 if (!buf && size > PAGE_SIZE) 30 34 buf = vmalloc(size); 31 35 return buf;

+4

include/linux/binfmts.h

··· 53 53 #define BINPRM_FLAGS_EXECFD_BIT 1 54 54 #define BINPRM_FLAGS_EXECFD (1 << BINPRM_FLAGS_EXECFD_BIT) 55 55 56 + /* filename of the binary will be inaccessible after exec */ 57 + #define BINPRM_FLAGS_PATH_INACCESSIBLE_BIT 2 58 + #define BINPRM_FLAGS_PATH_INACCESSIBLE (1 << BINPRM_FLAGS_PATH_INACCESSIBLE_BIT) 59 + 56 60 /* Function parameter for binfmt->coredump */ 57 61 struct coredump_params { 58 62 const siginfo_t *siginfo;

+31 -5

include/linux/bitmap.h

··· 45 45 * bitmap_set(dst, pos, nbits) Set specified bit area 46 46 * bitmap_clear(dst, pos, nbits) Clear specified bit area 47 47 * bitmap_find_next_zero_area(buf, len, pos, n, mask) Find bit free area 48 + * bitmap_find_next_zero_area_off(buf, len, pos, n, mask) as above 48 49 * bitmap_shift_right(dst, src, n, nbits) *dst = *src >> n 49 50 * bitmap_shift_left(dst, src, n, nbits) *dst = *src << n 50 51 * bitmap_remap(dst, src, old, new, nbits) *dst = map(old, new)(src) ··· 115 114 116 115 extern void bitmap_set(unsigned long *map, unsigned int start, int len); 117 116 extern void bitmap_clear(unsigned long *map, unsigned int start, int len); 118 - extern unsigned long bitmap_find_next_zero_area(unsigned long *map, 119 - unsigned long size, 120 - unsigned long start, 121 - unsigned int nr, 122 - unsigned long align_mask); 117 + 118 + extern unsigned long bitmap_find_next_zero_area_off(unsigned long *map, 119 + unsigned long size, 120 + unsigned long start, 121 + unsigned int nr, 122 + unsigned long align_mask, 123 + unsigned long align_offset); 124 + 125 + /** 126 + * bitmap_find_next_zero_area - find a contiguous aligned zero area 127 + * @map: The address to base the search on 128 + * @size: The bitmap size in bits 129 + * @start: The bitnumber to start searching at 130 + * @nr: The number of zeroed bits we're looking for 131 + * @align_mask: Alignment mask for zero area 132 + * 133 + * The @align_mask should be one less than a power of 2; the effect is that 134 + * the bit offset of all zero areas this function finds is multiples of that 135 + * power of 2. A @align_mask of 0 means no alignment is required. 136 + */ 137 + static inline unsigned long 138 + bitmap_find_next_zero_area(unsigned long *map, 139 + unsigned long size, 140 + unsigned long start, 141 + unsigned int nr, 142 + unsigned long align_mask) 143 + { 144 + return bitmap_find_next_zero_area_off(map, size, start, nr, 145 + align_mask, 0); 146 + } 123 147 124 148 extern int bitmap_scnprintf(char *buf, unsigned int len, 125 149 const unsigned long *src, int nbits);

+3

include/linux/compat.h

··· 357 357 358 358 asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv, 359 359 const compat_uptr_t __user *envp); 360 + asmlinkage long compat_sys_execveat(int dfd, const char __user *filename, 361 + const compat_uptr_t __user *argv, 362 + const compat_uptr_t __user *envp, int flags); 360 363 361 364 asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp, 362 365 compat_ulong_t __user *outp, compat_ulong_t __user *exp,

+11 -6

include/linux/fault-inject.h

··· 5 5 6 6 #include <linux/types.h> 7 7 #include <linux/debugfs.h> 8 + #include <linux/ratelimit.h> 8 9 #include <linux/atomic.h> 9 10 10 11 /* ··· 26 25 unsigned long reject_end; 27 26 28 27 unsigned long count; 28 + struct ratelimit_state ratelimit_state; 29 + struct dentry *dname; 29 30 }; 30 31 31 - #define FAULT_ATTR_INITIALIZER { \ 32 - .interval = 1, \ 33 - .times = ATOMIC_INIT(1), \ 34 - .require_end = ULONG_MAX, \ 35 - .stacktrace_depth = 32, \ 36 - .verbose = 2, \ 32 + #define FAULT_ATTR_INITIALIZER { \ 33 + .interval = 1, \ 34 + .times = ATOMIC_INIT(1), \ 35 + .require_end = ULONG_MAX, \ 36 + .stacktrace_depth = 32, \ 37 + .ratelimit_state = RATELIMIT_STATE_INIT_DISABLED, \ 38 + .verbose = 2, \ 39 + .dname = NULL, \ 37 40 } 38 41 39 42 #define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER

+23 -1

include/linux/fs.h

··· 18 18 #include <linux/pid.h> 19 19 #include <linux/bug.h> 20 20 #include <linux/mutex.h> 21 + #include <linux/rwsem.h> 21 22 #include <linux/capability.h> 22 23 #include <linux/semaphore.h> 23 24 #include <linux/fiemap.h> ··· 402 401 atomic_t i_mmap_writable;/* count VM_SHARED mappings */ 403 402 struct rb_root i_mmap; /* tree of private and shared mappings */ 404 403 struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ 405 - struct mutex i_mmap_mutex; /* protect tree, count, list */ 404 + struct rw_semaphore i_mmap_rwsem; /* protect tree, count, list */ 406 405 /* Protected by tree_lock together with the radix tree */ 407 406 unsigned long nrpages; /* number of total pages */ 408 407 unsigned long nrshadows; /* number of shadow entries */ ··· 467 466 #define PAGECACHE_TAG_TOWRITE 2 468 467 469 468 int mapping_tagged(struct address_space *mapping, int tag); 469 + 470 + static inline void i_mmap_lock_write(struct address_space *mapping) 471 + { 472 + down_write(&mapping->i_mmap_rwsem); 473 + } 474 + 475 + static inline void i_mmap_unlock_write(struct address_space *mapping) 476 + { 477 + up_write(&mapping->i_mmap_rwsem); 478 + } 479 + 480 + static inline void i_mmap_lock_read(struct address_space *mapping) 481 + { 482 + down_read(&mapping->i_mmap_rwsem); 483 + } 484 + 485 + static inline void i_mmap_unlock_read(struct address_space *mapping) 486 + { 487 + up_read(&mapping->i_mmap_rwsem); 488 + } 470 489 471 490 /* 472 491 * Might pages of this file be mapped into userspace? ··· 2096 2075 extern struct file * dentry_open(const struct path *, int, const struct cred *); 2097 2076 extern int filp_close(struct file *, fl_owner_t id); 2098 2077 2078 + extern struct filename *getname_flags(const char __user *, int, int *); 2099 2079 extern struct filename *getname(const char __user *); 2100 2080 extern struct filename *getname_kernel(const char *); 2101 2081

+9 -22

include/linux/fsnotify_backend.h

··· 197 197 #define FSNOTIFY_EVENT_INODE 2 198 198 199 199 /* 200 - * Inode specific fields in an fsnotify_mark 201 - */ 202 - struct fsnotify_inode_mark { 203 - struct inode *inode; /* inode this mark is associated with */ 204 - struct hlist_node i_list; /* list of marks by inode->i_fsnotify_marks */ 205 - struct list_head free_i_list; /* tmp list used when freeing this mark */ 206 - }; 207 - 208 - /* 209 - * Mount point specific fields in an fsnotify_mark 210 - */ 211 - struct fsnotify_vfsmount_mark { 212 - struct vfsmount *mnt; /* vfsmount this mark is associated with */ 213 - struct hlist_node m_list; /* list of marks by inode->i_fsnotify_marks */ 214 - struct list_head free_m_list; /* tmp list used when freeing this mark */ 215 - }; 216 - 217 - /* 218 200 * a mark is simply an object attached to an in core inode which allows an 219 201 * fsnotify listener to indicate they are either no longer interested in events 220 202 * of a type matching mask or only interested in those events. ··· 212 230 * in kernel that found and may be using this mark. */ 213 231 atomic_t refcnt; /* active things looking at this mark */ 214 232 struct fsnotify_group *group; /* group this mark is for */ 215 - struct list_head g_list; /* list of marks by group->i_fsnotify_marks */ 233 + struct list_head g_list; /* list of marks by group->i_fsnotify_marks 234 + * Also reused for queueing mark into 235 + * destroy_list when it's waiting for 236 + * the end of SRCU period before it can 237 + * be freed */ 216 238 spinlock_t lock; /* protect group and inode */ 239 + struct hlist_node obj_list; /* list of marks for inode / vfsmount */ 240 + struct list_head free_list; /* tmp list used when freeing this mark */ 217 241 union { 218 - struct fsnotify_inode_mark i; 219 - struct fsnotify_vfsmount_mark m; 242 + struct inode *inode; /* inode this mark is associated with */ 243 + struct vfsmount *mnt; /* vfsmount this mark is associated with */ 220 244 }; 221 245 __u32 ignored_mask; /* events types to ignore */ 222 246 #define FSNOTIFY_MARK_FLAG_INODE 0x01 ··· 231 243 #define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY 0x08 232 244 #define FSNOTIFY_MARK_FLAG_ALIVE 0x10 233 245 unsigned int flags; /* vfsmount or inode mark? */ 234 - struct list_head destroy_list; 235 246 void (*free_mark)(struct fsnotify_mark *mark); /* called on final put+free */ 236 247 }; 237 248

+2 -5

include/linux/gfp.h

-20

include/linux/ipc_namespace.h

··· 7 7 #include <linux/notifier.h> 8 8 #include <linux/nsproxy.h> 9 9 10 - /* 11 - * ipc namespace events 12 - */ 13 - #define IPCNS_MEMCHANGED 0x00000001 /* Notify lowmem size changed */ 14 - #define IPCNS_CREATED 0x00000002 /* Notify new ipc namespace created */ 15 - #define IPCNS_REMOVED 0x00000003 /* Notify ipc namespace removed */ 16 - 17 - #define IPCNS_CALLBACK_PRI 0 18 - 19 10 struct user_namespace; 20 11 21 12 struct ipc_ids { ··· 29 38 unsigned int msg_ctlmni; 30 39 atomic_t msg_bytes; 31 40 atomic_t msg_hdrs; 32 - int auto_msgmni; 33 41 34 42 size_t shm_ctlmax; 35 43 size_t shm_ctlall; ··· 67 77 extern spinlock_t mq_lock; 68 78 69 79 #ifdef CONFIG_SYSVIPC 70 - extern int register_ipcns_notifier(struct ipc_namespace *); 71 - extern int cond_register_ipcns_notifier(struct ipc_namespace *); 72 - extern void unregister_ipcns_notifier(struct ipc_namespace *); 73 - extern int ipcns_notify(unsigned long); 74 80 extern void shm_destroy_orphaned(struct ipc_namespace *ns); 75 81 #else /* CONFIG_SYSVIPC */ 76 - static inline int register_ipcns_notifier(struct ipc_namespace *ns) 77 - { return 0; } 78 - static inline int cond_register_ipcns_notifier(struct ipc_namespace *ns) 79 - { return 0; } 80 - static inline void unregister_ipcns_notifier(struct ipc_namespace *ns) { } 81 - static inline int ipcns_notify(unsigned long l) { return 0; } 82 82 static inline void shm_destroy_orphaned(struct ipc_namespace *ns) {} 83 83 #endif /* CONFIG_SYSVIPC */ 84 84

+2

include/linux/kmemleak.h

··· 21 21 #ifndef __KMEMLEAK_H 22 22 #define __KMEMLEAK_H 23 23 24 + #include <linux/slab.h> 25 + 24 26 #ifdef CONFIG_DEBUG_KMEMLEAK 25 27 26 28 extern void kmemleak_init(void) __ref;

+13 -3

include/linux/memcontrol.h

··· 400 400 401 401 void memcg_update_array_size(int num_groups); 402 402 403 - struct kmem_cache * 404 - __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp); 403 + struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep); 404 + void __memcg_kmem_put_cache(struct kmem_cache *cachep); 405 405 406 406 int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order); 407 407 void __memcg_uncharge_slab(struct kmem_cache *cachep, int order); ··· 492 492 if (unlikely(fatal_signal_pending(current))) 493 493 return cachep; 494 494 495 - return __memcg_kmem_get_cache(cachep, gfp); 495 + return __memcg_kmem_get_cache(cachep); 496 + } 497 + 498 + static __always_inline void memcg_kmem_put_cache(struct kmem_cache *cachep) 499 + { 500 + if (memcg_kmem_enabled()) 501 + __memcg_kmem_put_cache(cachep); 496 502 } 497 503 #else 498 504 #define for_each_memcg_cache_index(_idx) \ ··· 533 527 memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp) 534 528 { 535 529 return cachep; 530 + } 531 + 532 + static inline void memcg_kmem_put_cache(struct kmem_cache *cachep) 533 + { 536 534 } 537 535 #endif /* CONFIG_MEMCG_KMEM */ 538 536 #endif /* _LINUX_MEMCONTROL_H */

+37 -5

include/linux/mm.h

··· 19 19 #include <linux/bit_spinlock.h> 20 20 #include <linux/shrinker.h> 21 21 #include <linux/resource.h> 22 + #include <linux/page_ext.h> 22 23 23 24 struct mempolicy; 24 25 struct anon_vma; ··· 2061 2060 #endif /* CONFIG_PROC_FS */ 2062 2061 2063 2062 #ifdef CONFIG_DEBUG_PAGEALLOC 2064 - extern void kernel_map_pages(struct page *page, int numpages, int enable); 2063 + extern bool _debug_pagealloc_enabled; 2064 + extern void __kernel_map_pages(struct page *page, int numpages, int enable); 2065 + 2066 + static inline bool debug_pagealloc_enabled(void) 2067 + { 2068 + return _debug_pagealloc_enabled; 2069 + } 2070 + 2071 + static inline void 2072 + kernel_map_pages(struct page *page, int numpages, int enable) 2073 + { 2074 + if (!debug_pagealloc_enabled()) 2075 + return; 2076 + 2077 + __kernel_map_pages(page, numpages, enable); 2078 + } 2065 2079 #ifdef CONFIG_HIBERNATION 2066 2080 extern bool kernel_page_present(struct page *page); 2067 2081 #endif /* CONFIG_HIBERNATION */ ··· 2110 2094 void __user *, size_t *, loff_t *); 2111 2095 #endif 2112 2096 2113 - unsigned long shrink_slab(struct shrink_control *shrink, 2114 - unsigned long nr_pages_scanned, 2115 - unsigned long lru_pages); 2097 + unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid, 2098 + unsigned long nr_scanned, 2099 + unsigned long nr_eligible); 2116 2100 2117 2101 #ifndef CONFIG_MMU 2118 2102 #define randomize_va_space 0 ··· 2171 2155 unsigned int pages_per_huge_page); 2172 2156 #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */ 2173 2157 2158 + extern struct page_ext_operations debug_guardpage_ops; 2159 + extern struct page_ext_operations page_poisoning_ops; 2160 + 2174 2161 #ifdef CONFIG_DEBUG_PAGEALLOC 2175 2162 extern unsigned int _debug_guardpage_minorder; 2163 + extern bool _debug_guardpage_enabled; 2176 2164 2177 2165 static inline unsigned int debug_guardpage_minorder(void) 2178 2166 { 2179 2167 return _debug_guardpage_minorder; 2180 2168 } 2181 2169 2170 + static inline bool debug_guardpage_enabled(void) 2171 + { 2172 + return _debug_guardpage_enabled; 2173 + } 2174 + 2182 2175 static inline bool page_is_guard(struct page *page) 2183 2176 { 2184 - return test_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags); 2177 + struct page_ext *page_ext; 2178 + 2179 + if (!debug_guardpage_enabled()) 2180 + return false; 2181 + 2182 + page_ext = lookup_page_ext(page); 2183 + return test_bit(PAGE_EXT_DEBUG_GUARD, &page_ext->flags); 2185 2184 } 2186 2185 #else 2187 2186 static inline unsigned int debug_guardpage_minorder(void) { return 0; } 2187 + static inline bool debug_guardpage_enabled(void) { return false; } 2188 2188 static inline bool page_is_guard(struct page *page) { return false; } 2189 2189 #endif /* CONFIG_DEBUG_PAGEALLOC */ 2190 2190

+8 -4

include/linux/mm_types.h

··· 10 10 #include <linux/rwsem.h> 11 11 #include <linux/completion.h> 12 12 #include <linux/cpumask.h> 13 - #include <linux/page-debug-flags.h> 14 13 #include <linux/uprobes.h> 15 14 #include <linux/page-flags-layout.h> 16 15 #include <asm/page.h> ··· 185 186 void *virtual; /* Kernel virtual address (NULL if 186 187 not kmapped, ie. highmem) */ 187 188 #endif /* WANT_PAGE_VIRTUAL */ 188 - #ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS 189 - unsigned long debug_flags; /* Use atomic bitops on this */ 190 - #endif 191 189 192 190 #ifdef CONFIG_KMEMCHECK 193 191 /* ··· 529 533 TLB_LOCAL_MM_SHOOTDOWN, 530 534 NR_TLB_FLUSH_REASONS, 531 535 }; 536 + 537 + /* 538 + * A swap entry has to fit into a "unsigned long", as the entry is hidden 539 + * in the "index" field of the swapper address space. 540 + */ 541 + typedef struct { 542 + unsigned long val; 543 + } swp_entry_t; 532 544 533 545 #endif /* _LINUX_MM_TYPES_H */

+1 -1

include/linux/mmu_notifier.h

··· 154 154 * Therefore notifier chains can only be traversed when either 155 155 * 156 156 * 1. mmap_sem is held. 157 - * 2. One of the reverse map locks is held (i_mmap_mutex or anon_vma->rwsem). 157 + * 2. One of the reverse map locks is held (i_mmap_rwsem or anon_vma->rwsem). 158 158 * 3. No other concurrent thread can access the list (release) 159 159 */ 160 160 struct mmu_notifier {

+12

include/linux/mmzone.h

··· 722 722 int nr_zones; 723 723 #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */ 724 724 struct page *node_mem_map; 725 + #ifdef CONFIG_PAGE_EXTENSION 726 + struct page_ext *node_page_ext; 727 + #endif 725 728 #endif 726 729 #ifndef CONFIG_NO_BOOTMEM 727 730 struct bootmem_data *bdata; ··· 1078 1075 #define SECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SECTION_MASK) 1079 1076 1080 1077 struct page; 1078 + struct page_ext; 1081 1079 struct mem_section { 1082 1080 /* 1083 1081 * This is, logically, a pointer to an array of struct ··· 1096 1092 1097 1093 /* See declaration of similar field in struct zone */ 1098 1094 unsigned long *pageblock_flags; 1095 + #ifdef CONFIG_PAGE_EXTENSION 1096 + /* 1097 + * If !SPARSEMEM, pgdat doesn't have page_ext pointer. We use 1098 + * section. (see page_ext.h about this.) 1099 + */ 1100 + struct page_ext *page_ext; 1101 + unsigned long pad; 1102 + #endif 1099 1103 /* 1100 1104 * WARNING: mem_section must be a power-of-2 in size for the 1101 1105 * calculation and use of SECTION_ROOT_MASK to make sense.

+11

include/linux/oom.h

··· 92 92 93 93 extern struct task_struct *find_lock_task_mm(struct task_struct *p); 94 94 95 + static inline bool task_will_free_mem(struct task_struct *task) 96 + { 97 + /* 98 + * A coredumping process may sleep for an extended period in exit_mm(), 99 + * so the oom killer cannot assume that the process will promptly exit 100 + * and release memory. 101 + */ 102 + return (task->flags & PF_EXITING) && 103 + !(task->signal->flags & SIGNAL_GROUP_COREDUMP); 104 + } 105 + 95 106 /* sysctls */ 96 107 extern int sysctl_oom_dump_tasks; 97 108 extern int sysctl_oom_kill_allocating_task;

-32

include/linux/page-debug-flags.h

··· 1 - #ifndef LINUX_PAGE_DEBUG_FLAGS_H 2 - #define LINUX_PAGE_DEBUG_FLAGS_H 3 - 4 - /* 5 - * page->debug_flags bits: 6 - * 7 - * PAGE_DEBUG_FLAG_POISON is set for poisoned pages. This is used to 8 - * implement generic debug pagealloc feature. The pages are filled with 9 - * poison patterns and set this flag after free_pages(). The poisoned 10 - * pages are verified whether the patterns are not corrupted and clear 11 - * the flag before alloc_pages(). 12 - */ 13 - 14 - enum page_debug_flags { 15 - PAGE_DEBUG_FLAG_POISON, /* Page is poisoned */ 16 - PAGE_DEBUG_FLAG_GUARD, 17 - }; 18 - 19 - /* 20 - * Ensure that CONFIG_WANT_PAGE_DEBUG_FLAGS reliably 21 - * gets turned off when no debug features are enabling it! 22 - */ 23 - 24 - #ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS 25 - #if !defined(CONFIG_PAGE_POISONING) && \ 26 - !defined(CONFIG_PAGE_GUARD) \ 27 - /* && !defined(CONFIG_PAGE_DEBUG_SOMETHING_ELSE) && ... */ 28 - #error WANT_PAGE_DEBUG_FLAGS is turned on with no debug features! 29 - #endif 30 - #endif /* CONFIG_WANT_PAGE_DEBUG_FLAGS */ 31 - 32 - #endif /* LINUX_PAGE_DEBUG_FLAGS_H */

+84

include/linux/page_ext.h

··· 1 + #ifndef __LINUX_PAGE_EXT_H 2 + #define __LINUX_PAGE_EXT_H 3 + 4 + #include <linux/types.h> 5 + #include <linux/stacktrace.h> 6 + 7 + struct pglist_data; 8 + struct page_ext_operations { 9 + bool (*need)(void); 10 + void (*init)(void); 11 + }; 12 + 13 + #ifdef CONFIG_PAGE_EXTENSION 14 + 15 + /* 16 + * page_ext->flags bits: 17 + * 18 + * PAGE_EXT_DEBUG_POISON is set for poisoned pages. This is used to 19 + * implement generic debug pagealloc feature. The pages are filled with 20 + * poison patterns and set this flag after free_pages(). The poisoned 21 + * pages are verified whether the patterns are not corrupted and clear 22 + * the flag before alloc_pages(). 23 + */ 24 + 25 + enum page_ext_flags { 26 + PAGE_EXT_DEBUG_POISON, /* Page is poisoned */ 27 + PAGE_EXT_DEBUG_GUARD, 28 + PAGE_EXT_OWNER, 29 + }; 30 + 31 + /* 32 + * Page Extension can be considered as an extended mem_map. 33 + * A page_ext page is associated with every page descriptor. The 34 + * page_ext helps us add more information about the page. 35 + * All page_ext are allocated at boot or memory hotplug event, 36 + * then the page_ext for pfn always exists. 37 + */ 38 + struct page_ext { 39 + unsigned long flags; 40 + #ifdef CONFIG_PAGE_OWNER 41 + unsigned int order; 42 + gfp_t gfp_mask; 43 + struct stack_trace trace; 44 + unsigned long trace_entries[8]; 45 + #endif 46 + }; 47 + 48 + extern void pgdat_page_ext_init(struct pglist_data *pgdat); 49 + 50 + #ifdef CONFIG_SPARSEMEM 51 + static inline void page_ext_init_flatmem(void) 52 + { 53 + } 54 + extern void page_ext_init(void); 55 + #else 56 + extern void page_ext_init_flatmem(void); 57 + static inline void page_ext_init(void) 58 + { 59 + } 60 + #endif 61 + 62 + struct page_ext *lookup_page_ext(struct page *page); 63 + 64 + #else /* !CONFIG_PAGE_EXTENSION */ 65 + struct page_ext; 66 + 67 + static inline void pgdat_page_ext_init(struct pglist_data *pgdat) 68 + { 69 + } 70 + 71 + static inline struct page_ext *lookup_page_ext(struct page *page) 72 + { 73 + return NULL; 74 + } 75 + 76 + static inline void page_ext_init(void) 77 + { 78 + } 79 + 80 + static inline void page_ext_init_flatmem(void) 81 + { 82 + } 83 + #endif /* CONFIG_PAGE_EXTENSION */ 84 + #endif /* __LINUX_PAGE_EXT_H */

+38

include/linux/page_owner.h

··· 1 + #ifndef __LINUX_PAGE_OWNER_H 2 + #define __LINUX_PAGE_OWNER_H 3 + 4 + #ifdef CONFIG_PAGE_OWNER 5 + extern bool page_owner_inited; 6 + extern struct page_ext_operations page_owner_ops; 7 + 8 + extern void __reset_page_owner(struct page *page, unsigned int order); 9 + extern void __set_page_owner(struct page *page, 10 + unsigned int order, gfp_t gfp_mask); 11 + 12 + static inline void reset_page_owner(struct page *page, unsigned int order) 13 + { 14 + if (likely(!page_owner_inited)) 15 + return; 16 + 17 + __reset_page_owner(page, order); 18 + } 19 + 20 + static inline void set_page_owner(struct page *page, 21 + unsigned int order, gfp_t gfp_mask) 22 + { 23 + if (likely(!page_owner_inited)) 24 + return; 25 + 26 + __set_page_owner(page, order, gfp_mask); 27 + } 28 + #else 29 + static inline void reset_page_owner(struct page *page, unsigned int order) 30 + { 31 + } 32 + static inline void set_page_owner(struct page *page, 33 + unsigned int order, gfp_t gfp_mask) 34 + { 35 + } 36 + 37 + #endif /* CONFIG_PAGE_OWNER */ 38 + #endif /* __LINUX_PAGE_OWNER_H */

-2

include/linux/percpu-defs.h

··· 254 254 #endif /* CONFIG_SMP */ 255 255 256 256 #define per_cpu(var, cpu) (*per_cpu_ptr(&(var), cpu)) 257 - #define __raw_get_cpu_var(var) (*raw_cpu_ptr(&(var))) 258 - #define __get_cpu_var(var) (*this_cpu_ptr(&(var))) 259 257 260 258 /* 261 259 * Must be an lvalue. Since @var must be a simple identifier,

+9 -3

include/linux/ratelimit.h

··· 17 17 unsigned long begin; 18 18 }; 19 19 20 - #define DEFINE_RATELIMIT_STATE(name, interval_init, burst_init) \ 21 - \ 22 - struct ratelimit_state name = { \ 20 + #define RATELIMIT_STATE_INIT(name, interval_init, burst_init) { \ 23 21 .lock = __RAW_SPIN_LOCK_UNLOCKED(name.lock), \ 24 22 .interval = interval_init, \ 25 23 .burst = burst_init, \ 26 24 } 25 + 26 + #define RATELIMIT_STATE_INIT_DISABLED \ 27 + RATELIMIT_STATE_INIT(ratelimit_state, 0, DEFAULT_RATELIMIT_BURST) 28 + 29 + #define DEFINE_RATELIMIT_STATE(name, interval_init, burst_init) \ 30 + \ 31 + struct ratelimit_state name = \ 32 + RATELIMIT_STATE_INIT(name, interval_init, burst_init) \ 27 33 28 34 static inline void ratelimit_state_init(struct ratelimit_state *rs, 29 35 int interval, int burst)

+9 -2

include/linux/sched.h

··· 1364 1364 unsigned sched_reset_on_fork:1; 1365 1365 unsigned sched_contributes_to_load:1; 1366 1366 1367 + #ifdef CONFIG_MEMCG_KMEM 1368 + unsigned memcg_kmem_skip_account:1; 1369 + #endif 1370 + 1367 1371 unsigned long atomic_flags; /* Flags needing atomic access. */ 1368 1372 1369 1373 pid_t pid; ··· 1683 1679 /* bitmask and counter of trace recursion */ 1684 1680 unsigned long trace_recursion; 1685 1681 #endif /* CONFIG_TRACING */ 1686 - #ifdef CONFIG_MEMCG /* memcg uses this to do batch job */ 1687 - unsigned int memcg_kmem_skip_account; 1682 + #ifdef CONFIG_MEMCG 1688 1683 struct memcg_oom_info { 1689 1684 struct mem_cgroup *memcg; 1690 1685 gfp_t gfp_mask; ··· 2485 2482 extern int do_execve(struct filename *, 2486 2483 const char __user * const __user *, 2487 2484 const char __user * const __user *); 2485 + extern int do_execveat(int, struct filename *, 2486 + const char __user * const __user *, 2487 + const char __user * const __user *, 2488 + int); 2488 2489 extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *); 2489 2490 struct task_struct *fork_idle(int); 2490 2491 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);

-2

include/linux/shrinker.h

··· 18 18 */ 19 19 unsigned long nr_to_scan; 20 20 21 - /* shrink from these nodes */ 22 - nodemask_t nodes_to_scan; 23 21 /* current node being shrunk (for NUMA aware shrinkers) */ 24 22 int nid; 25 23 };

-2

include/linux/slab.h

··· 493 493 * @memcg: pointer to the memcg this cache belongs to 494 494 * @list: list_head for the list of all caches in this memcg 495 495 * @root_cache: pointer to the global, root cache, this cache was derived from 496 - * @nr_pages: number of pages that belongs to this cache. 497 496 */ 498 497 struct memcg_cache_params { 499 498 bool is_root_cache; ··· 505 506 struct mem_cgroup *memcg; 506 507 struct list_head list; 507 508 struct kmem_cache *root_cache; 508 - atomic_t nr_pages; 509 509 }; 510 510 }; 511 511 };

+5

include/linux/stacktrace.h

··· 1 1 #ifndef __LINUX_STACKTRACE_H 2 2 #define __LINUX_STACKTRACE_H 3 3 4 + #include <linux/types.h> 5 + 4 6 struct task_struct; 5 7 struct pt_regs; 6 8 ··· 22 20 struct stack_trace *trace); 23 21 24 22 extern void print_stack_trace(struct stack_trace *trace, int spaces); 23 + extern int snprint_stack_trace(char *buf, size_t size, 24 + struct stack_trace *trace, int spaces); 25 25 26 26 #ifdef CONFIG_USER_STACKTRACE_SUPPORT 27 27 extern void save_stack_trace_user(struct stack_trace *trace); ··· 36 32 # define save_stack_trace_tsk(tsk, trace) do { } while (0) 37 33 # define save_stack_trace_user(trace) do { } while (0) 38 34 # define print_stack_trace(trace, spaces) do { } while (0) 35 + # define snprint_stack_trace(buf, size, trace, spaces) do { } while (0) 39 36 #endif 40 37 41 38 #endif

-8

include/linux/swap.h

··· 102 102 } info; 103 103 }; 104 104 105 - /* A swap entry has to fit into a "unsigned long", as 106 - * the entry is hidden in the "index" field of the 107 - * swapper address space. 108 - */ 109 - typedef struct { 110 - unsigned long val; 111 - } swp_entry_t; 112 - 113 105 /* 114 106 * current->reclaim_state points to one of these when a task is running 115 107 * memory reclaim

+5

include/linux/syscalls.h

··· 877 877 asmlinkage long sys_getrandom(char __user *buf, size_t count, 878 878 unsigned int flags); 879 879 asmlinkage long sys_bpf(int cmd, union bpf_attr *attr, unsigned int size); 880 + 881 + asmlinkage long sys_execveat(int dfd, const char __user *filename, 882 + const char __user *const __user *argv, 883 + const char __user *const __user *envp, int flags); 884 + 880 885 #endif

+1

include/linux/vm_event_item.h

··· 90 90 #ifdef CONFIG_DEBUG_VM_VMACACHE 91 91 VMACACHE_FIND_CALLS, 92 92 VMACACHE_FIND_HITS, 93 + VMACACHE_FULL_FLUSHES, 93 94 #endif 94 95 NR_VM_EVENT_ITEMS 95 96 };

+3 -1

include/uapi/asm-generic/unistd.h

··· 707 707 __SYSCALL(__NR_memfd_create, sys_memfd_create) 708 708 #define __NR_bpf 280 709 709 __SYSCALL(__NR_bpf, sys_bpf) 710 + #define __NR_execveat 281 711 + __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat) 710 712 711 713 #undef __NR_syscalls 712 - #define __NR_syscalls 281 714 + #define __NR_syscalls 282 713 715 714 716 /* 715 717 * All syscalls below here should go away really,

+20 -8

include/uapi/linux/msg.h

··· 51 51 }; 52 52 53 53 /* 54 - * Scaling factor to compute msgmni: 55 - * the memory dedicated to msg queues (msgmni * msgmnb) should occupy 56 - * at most 1/MSG_MEM_SCALE of the lowmem (see the formula in ipc/msg.c): 57 - * up to 8MB : msgmni = 16 (MSGMNI) 58 - * 4 GB : msgmni = 8K 59 - * more than 16 GB : msgmni = 32K (IPCMNI) 54 + * MSGMNI, MSGMAX and MSGMNB are default values which can be 55 + * modified by sysctl. 56 + * 57 + * MSGMNI is the upper limit for the number of messages queues per 58 + * namespace. 59 + * It has been chosen to be as large possible without facilitating 60 + * scenarios where userspace causes overflows when adjusting the limits via 61 + * operations of the form retrieve current limit; add X; update limit". 62 + * 63 + * MSGMNB is the default size of a new message queue. Non-root tasks can 64 + * decrease the size with msgctl(IPC_SET), root tasks 65 + * (actually: CAP_SYS_RESOURCE) can both increase and decrease the queue 66 + * size. The optimal value is application dependent. 67 + * 16384 is used because it was always used (since 0.99.10) 68 + * 69 + * MAXMAX is the maximum size of an individual message, it's a global 70 + * (per-namespace) limit that applies for all message queues. 71 + * It's set to 1/2 of MSGMNB, to ensure that at least two messages fit into 72 + * the queue. This is also an arbitrary choice (since 2.6.0). 60 73 */ 61 - #define MSG_MEM_SCALE 32 62 74 63 - #define MSGMNI 16 /* <= IPCMNI */ /* max # of msg queue identifiers */ 75 + #define MSGMNI 32000 /* <= IPCMNI */ /* max # of msg queue identifiers */ 64 76 #define MSGMAX 8192 /* <= INT_MAX */ /* max size of message (bytes) */ 65 77 #define MSGMNB 16384 /* <= INT_MAX */ /* default max size of a message queue */ 66 78

+15 -3

include/uapi/linux/sem.h

··· 63 63 int semaem; 64 64 }; 65 65 66 - #define SEMMNI 128 /* <= IPCMNI max # of semaphore identifiers */ 67 - #define SEMMSL 250 /* <= 8 000 max num of semaphores per id */ 66 + /* 67 + * SEMMNI, SEMMSL and SEMMNS are default values which can be 68 + * modified by sysctl. 69 + * The values has been chosen to be larger than necessary for any 70 + * known configuration. 71 + * 72 + * SEMOPM should not be increased beyond 1000, otherwise there is the 73 + * risk that semop()/semtimedop() fails due to kernel memory fragmentation when 74 + * allocating the sop array. 75 + */ 76 + 77 + 78 + #define SEMMNI 32000 /* <= IPCMNI max # of semaphore identifiers */ 79 + #define SEMMSL 32000 /* <= INT_MAX max num of semaphores per id */ 68 80 #define SEMMNS (SEMMNI*SEMMSL) /* <= INT_MAX max # of semaphores in system */ 69 - #define SEMOPM 32 /* <= 1 000 max num of ops per semop call */ 81 + #define SEMOPM 500 /* <= 1 000 max num of ops per semop call */ 70 82 #define SEMVMX 32767 /* <= 32767 semaphore maximum value */ 71 83 #define SEMAEM SEMVMX /* adjust on exit max value */ 72 84

+7

init/main.c

··· 51 51 #include <linux/mempolicy.h> 52 52 #include <linux/key.h> 53 53 #include <linux/buffer_head.h> 54 + #include <linux/page_ext.h> 54 55 #include <linux/debug_locks.h> 55 56 #include <linux/debugobjects.h> 56 57 #include <linux/lockdep.h> ··· 485 484 */ 486 485 static void __init mm_init(void) 487 486 { 487 + /* 488 + * page_ext requires contiguous pages, 489 + * bigger than MAX_ORDER unless SPARSEMEM. 490 + */ 491 + page_ext_init_flatmem(); 488 492 mem_init(); 489 493 kmem_cache_init(); 490 494 percpu_init_late(); ··· 627 621 initrd_start = 0; 628 622 } 629 623 #endif 624 + page_ext_init(); 630 625 debug_objects_mem_init(); 631 626 kmemleak_init(); 632 627 setup_per_cpu_pageset();

+1 -1

ipc/Makefile

··· 3 3 # 4 4 5 5 obj-$(CONFIG_SYSVIPC_COMPAT) += compat.o 6 - obj-$(CONFIG_SYSVIPC) += util.o msgutil.o msg.o sem.o shm.o ipcns_notifier.o syscall.o 6 + obj-$(CONFIG_SYSVIPC) += util.o msgutil.o msg.o sem.o shm.o syscall.o 7 7 obj-$(CONFIG_SYSVIPC_SYSCTL) += ipc_sysctl.o 8 8 obj_mq-$(CONFIG_COMPAT) += compat_mq.o 9 9 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)

+17 -76

ipc/ipc_sysctl.c

··· 62 62 return err; 63 63 } 64 64 65 - static int proc_ipc_callback_dointvec_minmax(struct ctl_table *table, int write, 66 - void __user *buffer, size_t *lenp, loff_t *ppos) 67 - { 68 - struct ctl_table ipc_table; 69 - size_t lenp_bef = *lenp; 70 - int rc; 71 - 72 - memcpy(&ipc_table, table, sizeof(ipc_table)); 73 - ipc_table.data = get_ipc(table); 74 - 75 - rc = proc_dointvec_minmax(&ipc_table, write, buffer, lenp, ppos); 76 - 77 - if (write && !rc && lenp_bef == *lenp) 78 - /* 79 - * Tunable has successfully been changed by hand. Disable its 80 - * automatic adjustment. This simply requires unregistering 81 - * the notifiers that trigger recalculation. 82 - */ 83 - unregister_ipcns_notifier(current->nsproxy->ipc_ns); 84 - 85 - return rc; 86 - } 87 - 88 65 static int proc_ipc_doulongvec_minmax(struct ctl_table *table, int write, 89 66 void __user *buffer, size_t *lenp, loff_t *ppos) 90 67 { ··· 73 96 lenp, ppos); 74 97 } 75 98 76 - /* 77 - * Routine that is called when the file "auto_msgmni" has successfully been 78 - * written. 79 - * Two values are allowed: 80 - * 0: unregister msgmni's callback routine from the ipc namespace notifier 81 - * chain. This means that msgmni won't be recomputed anymore upon memory 82 - * add/remove or ipc namespace creation/removal. 83 - * 1: register back the callback routine. 84 - */ 85 - static void ipc_auto_callback(int val) 86 - { 87 - if (!val) 88 - unregister_ipcns_notifier(current->nsproxy->ipc_ns); 89 - else { 90 - /* 91 - * Re-enable automatic recomputing only if not already 92 - * enabled. 93 - */ 94 - recompute_msgmni(current->nsproxy->ipc_ns); 95 - cond_register_ipcns_notifier(current->nsproxy->ipc_ns); 96 - } 97 - } 98 - 99 - static int proc_ipcauto_dointvec_minmax(struct ctl_table *table, int write, 99 + static int proc_ipc_auto_msgmni(struct ctl_table *table, int write, 100 100 void __user *buffer, size_t *lenp, loff_t *ppos) 101 101 { 102 102 struct ctl_table ipc_table; 103 - int oldval; 104 - int rc; 103 + int dummy = 0; 105 104 106 105 memcpy(&ipc_table, table, sizeof(ipc_table)); 107 - ipc_table.data = get_ipc(table); 108 - oldval = *((int *)(ipc_table.data)); 106 + ipc_table.data = &dummy; 109 107 110 - rc = proc_dointvec_minmax(&ipc_table, write, buffer, lenp, ppos); 108 + if (write) 109 + pr_info_once("writing to auto_msgmni has no effect"); 111 110 112 - if (write && !rc) { 113 - int newval = *((int *)(ipc_table.data)); 114 - /* 115 - * The file "auto_msgmni" has correctly been set. 116 - * React by (un)registering the corresponding tunable, if the 117 - * value has changed. 118 - */ 119 - if (newval != oldval) 120 - ipc_auto_callback(newval); 121 - } 122 - 123 - return rc; 111 + return proc_dointvec_minmax(&ipc_table, write, buffer, lenp, ppos); 124 112 } 125 113 126 114 #else ··· 93 151 #define proc_ipc_dointvec NULL 94 152 #define proc_ipc_dointvec_minmax NULL 95 153 #define proc_ipc_dointvec_minmax_orphans NULL 96 - #define proc_ipc_callback_dointvec_minmax NULL 97 - #define proc_ipcauto_dointvec_minmax NULL 154 + #define proc_ipc_auto_msgmni NULL 98 155 #endif 99 156 100 157 static int zero; ··· 145 204 .data = &init_ipc_ns.msg_ctlmni, 146 205 .maxlen = sizeof(init_ipc_ns.msg_ctlmni), 147 206 .mode = 0644, 148 - .proc_handler = proc_ipc_callback_dointvec_minmax, 207 + .proc_handler = proc_ipc_dointvec_minmax, 149 208 .extra1 = &zero, 150 209 .extra2 = &int_max, 210 + }, 211 + { 212 + .procname = "auto_msgmni", 213 + .data = NULL, 214 + .maxlen = sizeof(int), 215 + .mode = 0644, 216 + .proc_handler = proc_ipc_auto_msgmni, 217 + .extra1 = &zero, 218 + .extra2 = &one, 151 219 }, 152 220 { 153 221 .procname = "msgmnb", ··· 173 223 .maxlen = 4*sizeof(int), 174 224 .mode = 0644, 175 225 .proc_handler = proc_ipc_dointvec, 176 - }, 177 - { 178 - .procname = "auto_msgmni", 179 - .data = &init_ipc_ns.auto_msgmni, 180 - .maxlen = sizeof(int), 181 - .mode = 0644, 182 - .proc_handler = proc_ipcauto_dointvec_minmax, 183 - .extra1 = &zero, 184 - .extra2 = &one, 185 226 }, 186 227 #ifdef CONFIG_CHECKPOINT_RESTORE 187 228 {

-92

ipc/ipcns_notifier.c

··· 1 - /* 2 - * linux/ipc/ipcns_notifier.c 3 - * Copyright (C) 2007 BULL SA. Nadia Derbey 4 - * 5 - * Notification mechanism for ipc namespaces: 6 - * The callback routine registered in the memory chain invokes the ipcns 7 - * notifier chain with the IPCNS_MEMCHANGED event. 8 - * Each callback routine registered in the ipcns namespace recomputes msgmni 9 - * for the owning namespace. 10 - */ 11 - 12 - #include <linux/msg.h> 13 - #include <linux/rcupdate.h> 14 - #include <linux/notifier.h> 15 - #include <linux/nsproxy.h> 16 - #include <linux/ipc_namespace.h> 17 - 18 - #include "util.h" 19 - 20 - 21 - 22 - static BLOCKING_NOTIFIER_HEAD(ipcns_chain); 23 - 24 - 25 - static int ipcns_callback(struct notifier_block *self, 26 - unsigned long action, void *arg) 27 - { 28 - struct ipc_namespace *ns; 29 - 30 - switch (action) { 31 - case IPCNS_MEMCHANGED: /* amount of lowmem has changed */ 32 - case IPCNS_CREATED: 33 - case IPCNS_REMOVED: 34 - /* 35 - * It's time to recompute msgmni 36 - */ 37 - ns = container_of(self, struct ipc_namespace, ipcns_nb); 38 - /* 39 - * No need to get a reference on the ns: the 1st job of 40 - * free_ipc_ns() is to unregister the callback routine. 41 - * blocking_notifier_chain_unregister takes the wr lock to do 42 - * it. 43 - * When this callback routine is called the rd lock is held by 44 - * blocking_notifier_call_chain. 45 - * So the ipc ns cannot be freed while we are here. 46 - */ 47 - recompute_msgmni(ns); 48 - break; 49 - default: 50 - break; 51 - } 52 - 53 - return NOTIFY_OK; 54 - } 55 - 56 - int register_ipcns_notifier(struct ipc_namespace *ns) 57 - { 58 - int rc; 59 - 60 - memset(&ns->ipcns_nb, 0, sizeof(ns->ipcns_nb)); 61 - ns->ipcns_nb.notifier_call = ipcns_callback; 62 - ns->ipcns_nb.priority = IPCNS_CALLBACK_PRI; 63 - rc = blocking_notifier_chain_register(&ipcns_chain, &ns->ipcns_nb); 64 - if (!rc) 65 - ns->auto_msgmni = 1; 66 - return rc; 67 - } 68 - 69 - int cond_register_ipcns_notifier(struct ipc_namespace *ns) 70 - { 71 - int rc; 72 - 73 - memset(&ns->ipcns_nb, 0, sizeof(ns->ipcns_nb)); 74 - ns->ipcns_nb.notifier_call = ipcns_callback; 75 - ns->ipcns_nb.priority = IPCNS_CALLBACK_PRI; 76 - rc = blocking_notifier_chain_cond_register(&ipcns_chain, 77 - &ns->ipcns_nb); 78 - if (!rc) 79 - ns->auto_msgmni = 1; 80 - return rc; 81 - } 82 - 83 - void unregister_ipcns_notifier(struct ipc_namespace *ns) 84 - { 85 - blocking_notifier_chain_unregister(&ipcns_chain, &ns->ipcns_nb); 86 - ns->auto_msgmni = 0; 87 - } 88 - 89 - int ipcns_notify(unsigned long val) 90 - { 91 - return blocking_notifier_call_chain(&ipcns_chain, val, NULL); 92 - }

+1 -35

ipc/msg.c

··· 989 989 return do_msgrcv(msqid, msgp, msgsz, msgtyp, msgflg, do_msg_fill); 990 990 } 991 991 992 - /* 993 - * Scale msgmni with the available lowmem size: the memory dedicated to msg 994 - * queues should occupy at most 1/MSG_MEM_SCALE of lowmem. 995 - * Also take into account the number of nsproxies created so far. 996 - * This should be done staying within the (MSGMNI , IPCMNI/nr_ipc_ns) range. 997 - */ 998 - void recompute_msgmni(struct ipc_namespace *ns) 999 - { 1000 - struct sysinfo i; 1001 - unsigned long allowed; 1002 - int nb_ns; 1003 - 1004 - si_meminfo(&i); 1005 - allowed = (((i.totalram - i.totalhigh) / MSG_MEM_SCALE) * i.mem_unit) 1006 - / MSGMNB; 1007 - nb_ns = atomic_read(&nr_ipc_ns); 1008 - allowed /= nb_ns; 1009 - 1010 - if (allowed < MSGMNI) { 1011 - ns->msg_ctlmni = MSGMNI; 1012 - return; 1013 - } 1014 - 1015 - if (allowed > IPCMNI / nb_ns) { 1016 - ns->msg_ctlmni = IPCMNI / nb_ns; 1017 - return; 1018 - } 1019 - 1020 - ns->msg_ctlmni = allowed; 1021 - } 1022 992 1023 993 void msg_init_ns(struct ipc_namespace *ns) 1024 994 { 1025 995 ns->msg_ctlmax = MSGMAX; 1026 996 ns->msg_ctlmnb = MSGMNB; 1027 - 1028 - recompute_msgmni(ns); 997 + ns->msg_ctlmni = MSGMNI; 1029 998 1030 999 atomic_set(&ns->msg_bytes, 0); 1031 1000 atomic_set(&ns->msg_hdrs, 0); ··· 1037 1068 void __init msg_init(void) 1038 1069 { 1039 1070 msg_init_ns(&init_ipc_ns); 1040 - 1041 - printk(KERN_INFO "msgmni has been set to %d\n", 1042 - init_ipc_ns.msg_ctlmni); 1043 1071 1044 1072 ipc_init_proc_interface("sysvipc/msg", 1045 1073 " key msqid perms cbytes qnum lspid lrpid uid gid cuid cgid stime rtime ctime\n",

-22

ipc/namespace.c

··· 45 45 msg_init_ns(ns); 46 46 shm_init_ns(ns); 47 47 48 - /* 49 - * msgmni has already been computed for the new ipc ns. 50 - * Thus, do the ipcns creation notification before registering that 51 - * new ipcns in the chain. 52 - */ 53 - ipcns_notify(IPCNS_CREATED); 54 - register_ipcns_notifier(ns); 55 - 56 48 ns->user_ns = get_user_ns(user_ns); 57 49 58 50 return ns; ··· 91 99 92 100 static void free_ipc_ns(struct ipc_namespace *ns) 93 101 { 94 - /* 95 - * Unregistering the hotplug notifier at the beginning guarantees 96 - * that the ipc namespace won't be freed while we are inside the 97 - * callback routine. Since the blocking_notifier_chain_XXX routines 98 - * hold a rw lock on the notifier list, unregister_ipcns_notifier() 99 - * won't take the rw lock before blocking_notifier_call_chain() has 100 - * released the rd lock. 101 - */ 102 - unregister_ipcns_notifier(ns); 103 102 sem_exit_ns(ns); 104 103 msg_exit_ns(ns); 105 104 shm_exit_ns(ns); 106 105 atomic_dec(&nr_ipc_ns); 107 106 108 - /* 109 - * Do the ipcns removal notification after decrementing nr_ipc_ns in 110 - * order to have a correct value when recomputing msgmni. 111 - */ 112 - ipcns_notify(IPCNS_REMOVED); 113 107 put_user_ns(ns->user_ns); 114 108 proc_free_inum(ns->proc_inum); 115 109 kfree(ns);

+10 -3

ipc/sem.c

··· 326 326 327 327 /* Then check that the global lock is free */ 328 328 if (!spin_is_locked(&sma->sem_perm.lock)) { 329 - /* spin_is_locked() is not a memory barrier */ 330 - smp_mb(); 329 + /* 330 + * The ipc object lock check must be visible on all 331 + * cores before rechecking the complex count. Otherwise 332 + * we can race with another thread that does: 333 + * complex_count++; 334 + * spin_unlock(sem_perm.lock); 335 + */ 336 + smp_rmb(); 331 337 332 - /* Now repeat the test of complex_count: 338 + /* 339 + * Now repeat the test of complex_count: 333 340 * It can't change anymore until we drop sem->lock. 334 341 * Thus: if is now 0, then it will stay 0. 335 342 */

+15 -6

ipc/shm.c

··· 219 219 if (!is_file_hugepages(shm_file)) 220 220 shmem_lock(shm_file, 0, shp->mlock_user); 221 221 else if (shp->mlock_user) 222 - user_shm_unlock(file_inode(shm_file)->i_size, shp->mlock_user); 222 + user_shm_unlock(i_size_read(file_inode(shm_file)), 223 + shp->mlock_user); 223 224 fput(shm_file); 224 225 ipc_rcu_putref(shp, shm_rcu_free); 225 226 } ··· 1230 1229 int retval = -EINVAL; 1231 1230 #ifdef CONFIG_MMU 1232 1231 loff_t size = 0; 1232 + struct file *file; 1233 1233 struct vm_area_struct *next; 1234 1234 #endif 1235 1235 ··· 1247 1245 * started at address shmaddr. It records it's size and then unmaps 1248 1246 * it. 1249 1247 * - Then it unmaps all shm vmas that started at shmaddr and that 1250 - * are within the initially determined size. 1248 + * are within the initially determined size and that are from the 1249 + * same shm segment from which we determined the size. 1251 1250 * Errors from do_munmap are ignored: the function only fails if 1252 1251 * it's called with invalid parameters or if it's called to unmap 1253 1252 * a part of a vma. Both calls in this function are for full vmas, ··· 1274 1271 if ((vma->vm_ops == &shm_vm_ops) && 1275 1272 (vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) { 1276 1273 1277 - 1278 - size = file_inode(vma->vm_file)->i_size; 1274 + /* 1275 + * Record the file of the shm segment being 1276 + * unmapped. With mremap(), someone could place 1277 + * page from another segment but with equal offsets 1278 + * in the range we are unmapping. 1279 + */ 1280 + file = vma->vm_file; 1281 + size = i_size_read(file_inode(vma->vm_file)); 1279 1282 do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start); 1280 1283 /* 1281 1284 * We discovered the size of the shm segment, so ··· 1307 1298 1308 1299 /* finding a matching vma now does not alter retval */ 1309 1300 if ((vma->vm_ops == &shm_vm_ops) && 1310 - (vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) 1311 - 1301 + ((vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) && 1302 + (vma->vm_file == file)) 1312 1303 do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start); 1313 1304 vma = next; 1314 1305 }

-40

ipc/util.c

··· 71 71 int (*show)(struct seq_file *, void *); 72 72 }; 73 73 74 - static void ipc_memory_notifier(struct work_struct *work) 75 - { 76 - ipcns_notify(IPCNS_MEMCHANGED); 77 - } 78 - 79 - static int ipc_memory_callback(struct notifier_block *self, 80 - unsigned long action, void *arg) 81 - { 82 - static DECLARE_WORK(ipc_memory_wq, ipc_memory_notifier); 83 - 84 - switch (action) { 85 - case MEM_ONLINE: /* memory successfully brought online */ 86 - case MEM_OFFLINE: /* or offline: it's time to recompute msgmni */ 87 - /* 88 - * This is done by invoking the ipcns notifier chain with the 89 - * IPC_MEMCHANGED event. 90 - * In order not to keep the lock on the hotplug memory chain 91 - * for too long, queue a work item that will, when waken up, 92 - * activate the ipcns notification chain. 93 - */ 94 - schedule_work(&ipc_memory_wq); 95 - break; 96 - case MEM_GOING_ONLINE: 97 - case MEM_GOING_OFFLINE: 98 - case MEM_CANCEL_ONLINE: 99 - case MEM_CANCEL_OFFLINE: 100 - default: 101 - break; 102 - } 103 - 104 - return NOTIFY_OK; 105 - } 106 - 107 - static struct notifier_block ipc_memory_nb = { 108 - .notifier_call = ipc_memory_callback, 109 - .priority = IPC_CALLBACK_PRI, 110 - }; 111 - 112 74 /** 113 75 * ipc_init - initialise ipc subsystem 114 76 * ··· 86 124 sem_init(); 87 125 msg_init(); 88 126 shm_init(); 89 - register_hotmemory_notifier(&ipc_memory_nb); 90 - register_ipcns_notifier(&init_ipc_ns); 91 127 return 0; 92 128 } 93 129 device_initcall(ipc_init);

+8 -8

kernel/audit_tree.c

··· 174 174 struct fsnotify_mark *entry = &chunk->mark; 175 175 struct list_head *list; 176 176 177 - if (!entry->i.inode) 177 + if (!entry->inode) 178 178 return; 179 - list = chunk_hash(entry->i.inode); 179 + list = chunk_hash(entry->inode); 180 180 list_add_rcu(&chunk->hash, list); 181 181 } 182 182 ··· 188 188 189 189 list_for_each_entry_rcu(p, list, hash) { 190 190 /* mark.inode may have gone NULL, but who cares? */ 191 - if (p->mark.i.inode == inode) { 191 + if (p->mark.inode == inode) { 192 192 atomic_long_inc(&p->refs); 193 193 return p; 194 194 } ··· 231 231 new = alloc_chunk(size); 232 232 233 233 spin_lock(&entry->lock); 234 - if (chunk->dead || !entry->i.inode) { 234 + if (chunk->dead || !entry->inode) { 235 235 spin_unlock(&entry->lock); 236 236 if (new) 237 237 free_chunk(new); ··· 258 258 goto Fallback; 259 259 260 260 fsnotify_duplicate_mark(&new->mark, entry); 261 - if (fsnotify_add_mark(&new->mark, new->mark.group, new->mark.i.inode, NULL, 1)) { 261 + if (fsnotify_add_mark(&new->mark, new->mark.group, new->mark.inode, NULL, 1)) { 262 262 fsnotify_put_mark(&new->mark); 263 263 goto Fallback; 264 264 } ··· 386 386 chunk_entry = &chunk->mark; 387 387 388 388 spin_lock(&old_entry->lock); 389 - if (!old_entry->i.inode) { 389 + if (!old_entry->inode) { 390 390 /* old_entry is being shot, lets just lie */ 391 391 spin_unlock(&old_entry->lock); 392 392 fsnotify_put_mark(old_entry); ··· 395 395 } 396 396 397 397 fsnotify_duplicate_mark(chunk_entry, old_entry); 398 - if (fsnotify_add_mark(chunk_entry, chunk_entry->group, chunk_entry->i.inode, NULL, 1)) { 398 + if (fsnotify_add_mark(chunk_entry, chunk_entry->group, chunk_entry->inode, NULL, 1)) { 399 399 spin_unlock(&old_entry->lock); 400 400 fsnotify_put_mark(chunk_entry); 401 401 fsnotify_put_mark(old_entry); ··· 611 611 list_for_each_entry(node, &tree->chunks, list) { 612 612 struct audit_chunk *chunk = find_chunk(node); 613 613 /* this could be NULL if the watch is dying else where... */ 614 - struct inode *inode = chunk->mark.i.inode; 614 + struct inode *inode = chunk->mark.inode; 615 615 node->index |= 1U<<31; 616 616 if (iterate_mounts(compare_root, inode, root_mnt)) 617 617 node->index &= ~(1U<<31);

+3 -3

kernel/events/uprobes.c

··· 724 724 int more = 0; 725 725 726 726 again: 727 - mutex_lock(&mapping->i_mmap_mutex); 727 + i_mmap_lock_read(mapping); 728 728 vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { 729 729 if (!valid_vma(vma, is_register)) 730 730 continue; 731 731 732 732 if (!prev && !more) { 733 733 /* 734 - * Needs GFP_NOWAIT to avoid i_mmap_mutex recursion through 734 + * Needs GFP_NOWAIT to avoid i_mmap_rwsem recursion through 735 735 * reclaim. This is optimistic, no harm done if it fails. 736 736 */ 737 737 prev = kmalloc(sizeof(struct map_info), ··· 755 755 info->mm = vma->vm_mm; 756 756 info->vaddr = offset_to_vaddr(vma, offset); 757 757 } 758 - mutex_unlock(&mapping->i_mmap_mutex); 758 + i_mmap_unlock_read(mapping); 759 759 760 760 if (!more) 761 761 goto out;

+2 -2

kernel/fork.c

··· 433 433 get_file(file); 434 434 if (tmp->vm_flags & VM_DENYWRITE) 435 435 atomic_dec(&inode->i_writecount); 436 - mutex_lock(&mapping->i_mmap_mutex); 436 + i_mmap_lock_write(mapping); 437 437 if (tmp->vm_flags & VM_SHARED) 438 438 atomic_inc(&mapping->i_mmap_writable); 439 439 flush_dcache_mmap_lock(mapping); ··· 445 445 vma_interval_tree_insert_after(tmp, mpnt, 446 446 &mapping->i_mmap); 447 447 flush_dcache_mmap_unlock(mapping); 448 - mutex_unlock(&mapping->i_mmap_mutex); 448 + i_mmap_unlock_write(mapping); 449 449 } 450 450 451 451 /*

+4 -1

kernel/gcov/Kconfig

··· 32 32 Note that the debugfs filesystem has to be mounted to access 33 33 profiling data. 34 34 35 + config ARCH_HAS_GCOV_PROFILE_ALL 36 + def_bool n 37 + 35 38 config GCOV_PROFILE_ALL 36 39 bool "Profile entire Kernel" 37 40 depends on GCOV_KERNEL 38 - depends on SUPERH || S390 || X86 || PPC || MICROBLAZE || ARM || ARM64 41 + depends on ARCH_HAS_GCOV_PROFILE_ALL 39 42 default n 40 43 ---help--- 41 44 This options activates profiling for the entire kernel.

+1 -1

kernel/kexec.c

··· 600 600 if (!kexec_on_panic) { 601 601 image->swap_page = kimage_alloc_control_pages(image, 0); 602 602 if (!image->swap_page) { 603 - pr_err(KERN_ERR "Could not allocate swap buffer\n"); 603 + pr_err("Could not allocate swap buffer\n"); 604 604 goto out_free_control_pages; 605 605 } 606 606 }

+32

kernel/stacktrace.c

··· 25 25 } 26 26 EXPORT_SYMBOL_GPL(print_stack_trace); 27 27 28 + int snprint_stack_trace(char *buf, size_t size, 29 + struct stack_trace *trace, int spaces) 30 + { 31 + int i; 32 + unsigned long ip; 33 + int generated; 34 + int total = 0; 35 + 36 + if (WARN_ON(!trace->entries)) 37 + return 0; 38 + 39 + for (i = 0; i < trace->nr_entries; i++) { 40 + ip = trace->entries[i]; 41 + generated = snprintf(buf, size, "%*c[<%p>] %pS\n", 42 + 1 + spaces, ' ', (void *) ip, (void *) ip); 43 + 44 + total += generated; 45 + 46 + /* Assume that generated isn't a negative number */ 47 + if (generated >= size) { 48 + buf += size; 49 + size = 0; 50 + } else { 51 + buf += generated; 52 + size -= generated; 53 + } 54 + } 55 + 56 + return total; 57 + } 58 + EXPORT_SYMBOL_GPL(snprint_stack_trace); 59 + 28 60 /* 29 61 * Architectures that do not implement save_stack_trace_tsk or 30 62 * save_stack_trace_regs get this weak alias and a once-per-bootup warning

+3

kernel/sys_ni.c

··· 226 226 227 227 /* access BPF programs and maps */ 228 228 cond_syscall(sys_bpf); 229 + 230 + /* execveat */ 231 + cond_syscall(sys_execveat);

+16

lib/Kconfig.debug

··· 227 227 you really need it, and what the merge plan to the mainline kernel for 228 228 your module is. 229 229 230 + config PAGE_OWNER 231 + bool "Track page owner" 232 + depends on DEBUG_KERNEL && STACKTRACE_SUPPORT 233 + select DEBUG_FS 234 + select STACKTRACE 235 + select PAGE_EXTENSION 236 + help 237 + This keeps track of what call chain is the owner of a page, may 238 + help to find bare alloc_page(s) leaks. Even if you include this 239 + feature on your build, it is disabled in default. You should pass 240 + "page_owner=on" to boot parameter in order to enable it. Eats 241 + a fair amount of memory if enabled. See tools/vm/page_owner_sort.c 242 + for user-space helper. 243 + 244 + If unsure, say N. 245 + 230 246 config DEBUG_FS 231 247 bool "Debug Filesystem" 232 248 help

+3

lib/audit.c

··· 54 54 case __NR_socketcall: 55 55 return 4; 56 56 #endif 57 + #ifdef __NR_execveat 58 + case __NR_execveat: 59 + #endif 57 60 case __NR_execve: 58 61 return 5; 59 62 default:

+13 -11

lib/bitmap.c

··· 326 326 } 327 327 EXPORT_SYMBOL(bitmap_clear); 328 328 329 - /* 330 - * bitmap_find_next_zero_area - find a contiguous aligned zero area 329 + /** 330 + * bitmap_find_next_zero_area_off - find a contiguous aligned zero area 331 331 * @map: The address to base the search on 332 332 * @size: The bitmap size in bits 333 333 * @start: The bitnumber to start searching at 334 334 * @nr: The number of zeroed bits we're looking for 335 335 * @align_mask: Alignment mask for zero area 336 + * @align_offset: Alignment offset for zero area. 336 337 * 337 338 * The @align_mask should be one less than a power of 2; the effect is that 338 - * the bit offset of all zero areas this function finds is multiples of that 339 - * power of 2. A @align_mask of 0 means no alignment is required. 339 + * the bit offset of all zero areas this function finds plus @align_offset 340 + * is multiple of that power of 2. 340 341 */ 341 - unsigned long bitmap_find_next_zero_area(unsigned long *map, 342 - unsigned long size, 343 - unsigned long start, 344 - unsigned int nr, 345 - unsigned long align_mask) 342 + unsigned long bitmap_find_next_zero_area_off(unsigned long *map, 343 + unsigned long size, 344 + unsigned long start, 345 + unsigned int nr, 346 + unsigned long align_mask, 347 + unsigned long align_offset) 346 348 { 347 349 unsigned long index, end, i; 348 350 again: 349 351 index = find_next_zero_bit(map, size, start); 350 352 351 353 /* Align allocation */ 352 - index = __ALIGN_MASK(index, align_mask); 354 + index = __ALIGN_MASK(index + align_offset, align_mask) - align_offset; 353 355 354 356 end = index + nr; 355 357 if (end > size) ··· 363 361 } 364 362 return index; 365 363 } 366 - EXPORT_SYMBOL(bitmap_find_next_zero_area); 364 + EXPORT_SYMBOL(bitmap_find_next_zero_area_off); 367 365 368 366 /* 369 367 * Bitmap printing & parsing functions: first version by Nadia Yvette Chambers,

+2 -2

lib/decompress.c

··· 44 44 }; 45 45 46 46 static const struct compress_format compressed_formats[] __initconst = { 47 - { {037, 0213}, "gzip", gunzip }, 48 - { {037, 0236}, "gzip", gunzip }, 47 + { {0x1f, 0x8b}, "gzip", gunzip }, 48 + { {0x1f, 0x9e}, "gzip", gunzip }, 49 49 { {0x42, 0x5a}, "bzip2", bunzip2 }, 50 50 { {0x5d, 0x00}, "lzma", unlzma }, 51 51 { {0xfd, 0x37}, "xz", unxz },

+1 -1

lib/decompress_bunzip2.c

··· 184 184 if (get_bits(bd, 1)) 185 185 return RETVAL_OBSOLETE_INPUT; 186 186 origPtr = get_bits(bd, 24); 187 - if (origPtr > dbufSize) 187 + if (origPtr >= dbufSize) 188 188 return RETVAL_DATA_ERROR; 189 189 /* mapping table: if some byte values are never used (encoding things 190 190 like ascii text), the compression code removes the gaps to have fewer

+17 -4

lib/fault-inject.c

··· 40 40 41 41 static void fail_dump(struct fault_attr *attr) 42 42 { 43 - if (attr->verbose > 0) 44 - printk(KERN_NOTICE "FAULT_INJECTION: forcing a failure\n"); 45 - if (attr->verbose > 1) 46 - dump_stack(); 43 + if (attr->verbose > 0 && __ratelimit(&attr->ratelimit_state)) { 44 + printk(KERN_NOTICE "FAULT_INJECTION: forcing a failure.\n" 45 + "name %pd, interval %lu, probability %lu, " 46 + "space %d, times %d\n", attr->dname, 47 + attr->probability, attr->interval, 48 + atomic_read(&attr->space), 49 + atomic_read(&attr->times)); 50 + if (attr->verbose > 1) 51 + dump_stack(); 52 + } 47 53 } 48 54 49 55 #define atomic_dec_not_zero(v) atomic_add_unless((v), -1, 0) ··· 208 202 goto fail; 209 203 if (!debugfs_create_ul("verbose", mode, dir, &attr->verbose)) 210 204 goto fail; 205 + if (!debugfs_create_u32("verbose_ratelimit_interval_ms", mode, dir, 206 + &attr->ratelimit_state.interval)) 207 + goto fail; 208 + if (!debugfs_create_u32("verbose_ratelimit_burst", mode, dir, 209 + &attr->ratelimit_state.burst)) 210 + goto fail; 211 211 if (!debugfs_create_bool("task-filter", mode, dir, &attr->task_filter)) 212 212 goto fail; 213 213 ··· 234 222 235 223 #endif /* CONFIG_FAULT_INJECTION_STACKTRACE_FILTER */ 236 224 225 + attr->dname = dget(dir); 237 226 return dir; 238 227 fail: 239 228 debugfs_remove_recursive(dir);

+10

mm/Kconfig.debug

··· 1 + config PAGE_EXTENSION 2 + bool "Extend memmap on extra space for more information on page" 3 + ---help--- 4 + Extend memmap on extra space for more information on page. This 5 + could be used for debugging features that need to insert extra 6 + field for every page. This extension enables us to save memory 7 + by not allocating this extra memory according to boottime 8 + configuration. 9 + 1 10 config DEBUG_PAGEALLOC 2 11 bool "Debug page memory allocations" 3 12 depends on DEBUG_KERNEL 4 13 depends on !HIBERNATION || ARCH_SUPPORTS_DEBUG_PAGEALLOC && !PPC && !SPARC 5 14 depends on !KMEMCHECK 15 + select PAGE_EXTENSION 6 16 select PAGE_POISONING if !ARCH_SUPPORTS_DEBUG_PAGEALLOC 7 17 select PAGE_GUARD if ARCH_SUPPORTS_DEBUG_PAGEALLOC 8 18 ---help---

+2

mm/Makefile

··· 63 63 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o 64 64 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o 65 65 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o 66 + obj-$(CONFIG_PAGE_OWNER) += page_owner.o 66 67 obj-$(CONFIG_CLEANCACHE) += cleancache.o 67 68 obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o 68 69 obj-$(CONFIG_ZPOOL) += zpool.o ··· 72 71 obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o 73 72 obj-$(CONFIG_CMA) += cma.o 74 73 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o 74 + obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o

+22 -3

mm/cma.c

··· 33 33 #include <linux/log2.h> 34 34 #include <linux/cma.h> 35 35 #include <linux/highmem.h> 36 + #include <linux/io.h> 36 37 37 38 struct cma { 38 39 unsigned long base_pfn; ··· 62 61 if (align_order <= cma->order_per_bit) 63 62 return 0; 64 63 return (1UL << (align_order - cma->order_per_bit)) - 1; 64 + } 65 + 66 + static unsigned long cma_bitmap_aligned_offset(struct cma *cma, int align_order) 67 + { 68 + unsigned int alignment; 69 + 70 + if (align_order <= cma->order_per_bit) 71 + return 0; 72 + alignment = 1UL << (align_order - cma->order_per_bit); 73 + return ALIGN(cma->base_pfn, alignment) - 74 + (cma->base_pfn >> cma->order_per_bit); 65 75 } 66 76 67 77 static unsigned long cma_bitmap_maxno(struct cma *cma) ··· 325 313 } 326 314 } 327 315 316 + /* 317 + * kmemleak scans/reads tracked objects for pointers to other 318 + * objects but this address isn't mapped and accessible 319 + */ 320 + kmemleak_ignore(phys_to_virt(addr)); 328 321 base = addr; 329 322 } 330 323 ··· 357 340 */ 358 341 struct page *cma_alloc(struct cma *cma, int count, unsigned int align) 359 342 { 360 - unsigned long mask, pfn, start = 0; 343 + unsigned long mask, offset, pfn, start = 0; 361 344 unsigned long bitmap_maxno, bitmap_no, bitmap_count; 362 345 struct page *page = NULL; 363 346 int ret; ··· 372 355 return NULL; 373 356 374 357 mask = cma_bitmap_aligned_mask(cma, align); 358 + offset = cma_bitmap_aligned_offset(cma, align); 375 359 bitmap_maxno = cma_bitmap_maxno(cma); 376 360 bitmap_count = cma_bitmap_pages_to_bits(cma, count); 377 361 378 362 for (;;) { 379 363 mutex_lock(&cma->lock); 380 - bitmap_no = bitmap_find_next_zero_area(cma->bitmap, 381 - bitmap_maxno, start, bitmap_count, mask); 364 + bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap, 365 + bitmap_maxno, start, bitmap_count, mask, 366 + offset); 382 367 if (bitmap_no >= bitmap_maxno) { 383 368 mutex_unlock(&cma->lock); 384 369 break;

+40 -5

mm/debug-pagealloc.c

··· 2 2 #include <linux/string.h> 3 3 #include <linux/mm.h> 4 4 #include <linux/highmem.h> 5 - #include <linux/page-debug-flags.h> 5 + #include <linux/page_ext.h> 6 6 #include <linux/poison.h> 7 7 #include <linux/ratelimit.h> 8 8 9 + static bool page_poisoning_enabled __read_mostly; 10 + 11 + static bool need_page_poisoning(void) 12 + { 13 + if (!debug_pagealloc_enabled()) 14 + return false; 15 + 16 + return true; 17 + } 18 + 19 + static void init_page_poisoning(void) 20 + { 21 + if (!debug_pagealloc_enabled()) 22 + return; 23 + 24 + page_poisoning_enabled = true; 25 + } 26 + 27 + struct page_ext_operations page_poisoning_ops = { 28 + .need = need_page_poisoning, 29 + .init = init_page_poisoning, 30 + }; 31 + 9 32 static inline void set_page_poison(struct page *page) 10 33 { 11 - __set_bit(PAGE_DEBUG_FLAG_POISON, &page->debug_flags); 34 + struct page_ext *page_ext; 35 + 36 + page_ext = lookup_page_ext(page); 37 + __set_bit(PAGE_EXT_DEBUG_POISON, &page_ext->flags); 12 38 } 13 39 14 40 static inline void clear_page_poison(struct page *page) 15 41 { 16 - __clear_bit(PAGE_DEBUG_FLAG_POISON, &page->debug_flags); 42 + struct page_ext *page_ext; 43 + 44 + page_ext = lookup_page_ext(page); 45 + __clear_bit(PAGE_EXT_DEBUG_POISON, &page_ext->flags); 17 46 } 18 47 19 48 static inline bool page_poison(struct page *page) 20 49 { 21 - return test_bit(PAGE_DEBUG_FLAG_POISON, &page->debug_flags); 50 + struct page_ext *page_ext; 51 + 52 + page_ext = lookup_page_ext(page); 53 + return test_bit(PAGE_EXT_DEBUG_POISON, &page_ext->flags); 22 54 } 23 55 24 56 static void poison_page(struct page *page) ··· 125 93 unpoison_page(page + i); 126 94 } 127 95 128 - void kernel_map_pages(struct page *page, int numpages, int enable) 96 + void __kernel_map_pages(struct page *page, int numpages, int enable) 129 97 { 98 + if (!page_poisoning_enabled) 99 + return; 100 + 130 101 if (enable) 131 102 unpoison_pages(page, numpages); 132 103 else

+5 -1

mm/fadvise.c

··· 117 117 __filemap_fdatawrite_range(mapping, offset, endbyte, 118 118 WB_SYNC_NONE); 119 119 120 - /* First and last FULL page! */ 120 + /* 121 + * First and last FULL page! Partial pages are deliberately 122 + * preserved on the expectation that it is better to preserve 123 + * needed memory than to discard unneeded memory. 124 + */ 121 125 start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT; 122 126 end_index = (endbyte >> PAGE_CACHE_SHIFT); 123 127

+5 -5

mm/filemap.c

··· 62 62 /* 63 63 * Lock ordering: 64 64 * 65 - * ->i_mmap_mutex (truncate_pagecache) 65 + * ->i_mmap_rwsem (truncate_pagecache) 66 66 * ->private_lock (__free_pte->__set_page_dirty_buffers) 67 67 * ->swap_lock (exclusive_swap_page, others) 68 68 * ->mapping->tree_lock 69 69 * 70 70 * ->i_mutex 71 - * ->i_mmap_mutex (truncate->unmap_mapping_range) 71 + * ->i_mmap_rwsem (truncate->unmap_mapping_range) 72 72 * 73 73 * ->mmap_sem 74 - * ->i_mmap_mutex 74 + * ->i_mmap_rwsem 75 75 * ->page_table_lock or pte_lock (various, mainly in memory.c) 76 76 * ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock) 77 77 * ··· 85 85 * sb_lock (fs/fs-writeback.c) 86 86 * ->mapping->tree_lock (__sync_single_inode) 87 87 * 88 - * ->i_mmap_mutex 88 + * ->i_mmap_rwsem 89 89 * ->anon_vma.lock (vma_adjust) 90 90 * 91 91 * ->anon_vma.lock ··· 105 105 * ->inode->i_lock (zap_pte_range->set_page_dirty) 106 106 * ->private_lock (zap_pte_range->__set_page_dirty_buffers) 107 107 * 108 - * ->i_mmap_mutex 108 + * ->i_mmap_rwsem 109 109 * ->tasklist_lock (memory_failure, collect_procs_ao) 110 110 */ 111 111

+9 -14

mm/filemap_xip.c

··· 155 155 EXPORT_SYMBOL_GPL(xip_file_read); 156 156 157 157 /* 158 - * __xip_unmap is invoked from xip_unmap and 159 - * xip_write 158 + * __xip_unmap is invoked from xip_unmap and xip_write 160 159 * 161 160 * This function walks all vmas of the address_space and unmaps the 162 161 * __xip_sparse_page when found at pgoff. 163 162 */ 164 - static void 165 - __xip_unmap (struct address_space * mapping, 166 - unsigned long pgoff) 163 + static void __xip_unmap(struct address_space * mapping, unsigned long pgoff) 167 164 { 168 165 struct vm_area_struct *vma; 169 - struct mm_struct *mm; 170 - unsigned long address; 171 - pte_t *pte; 172 - pte_t pteval; 173 - spinlock_t *ptl; 174 166 struct page *page; 175 167 unsigned count; 176 168 int locked = 0; ··· 174 182 return; 175 183 176 184 retry: 177 - mutex_lock(&mapping->i_mmap_mutex); 185 + i_mmap_lock_read(mapping); 178 186 vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { 179 - mm = vma->vm_mm; 180 - address = vma->vm_start + 187 + pte_t *pte, pteval; 188 + spinlock_t *ptl; 189 + struct mm_struct *mm = vma->vm_mm; 190 + unsigned long address = vma->vm_start + 181 191 ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); 192 + 182 193 BUG_ON(address < vma->vm_start || address >= vma->vm_end); 183 194 pte = page_check_address(page, mm, address, &ptl, 1); 184 195 if (pte) { ··· 197 202 page_cache_release(page); 198 203 } 199 204 } 200 - mutex_unlock(&mapping->i_mmap_mutex); 205 + i_mmap_unlock_read(mapping); 201 206 202 207 if (locked) { 203 208 mutex_unlock(&xip_sparse_mutex);

+2 -2

mm/fremap.c

··· 238 238 } 239 239 goto out_freed; 240 240 } 241 - mutex_lock(&mapping->i_mmap_mutex); 241 + i_mmap_lock_write(mapping); 242 242 flush_dcache_mmap_lock(mapping); 243 243 vma->vm_flags |= VM_NONLINEAR; 244 244 vma_interval_tree_remove(vma, &mapping->i_mmap); 245 245 vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear); 246 246 flush_dcache_mmap_unlock(mapping); 247 - mutex_unlock(&mapping->i_mmap_mutex); 247 + i_mmap_unlock_write(mapping); 248 248 } 249 249 250 250 if (vma->vm_flags & VM_LOCKED) {

+13 -13

mm/hugetlb.c

··· 1457 1457 return 0; 1458 1458 1459 1459 found: 1460 - BUG_ON((unsigned long)virt_to_phys(m) & (huge_page_size(h) - 1)); 1460 + BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h))); 1461 1461 /* Put them into a private list first because mem_map is not up yet */ 1462 1462 list_add(&m->list, &huge_boot_pages); 1463 1463 m->hstate = h; ··· 2083 2083 * devices of nodes that have memory. All on-line nodes should have 2084 2084 * registered their associated device by this time. 2085 2085 */ 2086 - static void hugetlb_register_all_nodes(void) 2086 + static void __init hugetlb_register_all_nodes(void) 2087 2087 { 2088 2088 int nid; 2089 2089 ··· 2726 2726 * on its way out. We're lucky that the flag has such an appropriate 2727 2727 * name, and can in fact be safely cleared here. We could clear it 2728 2728 * before the __unmap_hugepage_range above, but all that's necessary 2729 - * is to clear it before releasing the i_mmap_mutex. This works 2729 + * is to clear it before releasing the i_mmap_rwsem. This works 2730 2730 * because in the context this is called, the VMA is about to be 2731 - * destroyed and the i_mmap_mutex is held. 2731 + * destroyed and the i_mmap_rwsem is held. 2732 2732 */ 2733 2733 vma->vm_flags &= ~VM_MAYSHARE; 2734 2734 } ··· 2774 2774 * this mapping should be shared between all the VMAs, 2775 2775 * __unmap_hugepage_range() is called as the lock is already held 2776 2776 */ 2777 - mutex_lock(&mapping->i_mmap_mutex); 2777 + i_mmap_lock_write(mapping); 2778 2778 vma_interval_tree_foreach(iter_vma, &mapping->i_mmap, pgoff, pgoff) { 2779 2779 /* Do not unmap the current VMA */ 2780 2780 if (iter_vma == vma) ··· 2791 2791 unmap_hugepage_range(iter_vma, address, 2792 2792 address + huge_page_size(h), page); 2793 2793 } 2794 - mutex_unlock(&mapping->i_mmap_mutex); 2794 + i_mmap_unlock_write(mapping); 2795 2795 } 2796 2796 2797 2797 /* ··· 3348 3348 flush_cache_range(vma, address, end); 3349 3349 3350 3350 mmu_notifier_invalidate_range_start(mm, start, end); 3351 - mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex); 3351 + i_mmap_lock_write(vma->vm_file->f_mapping); 3352 3352 for (; address < end; address += huge_page_size(h)) { 3353 3353 spinlock_t *ptl; 3354 3354 ptep = huge_pte_offset(mm, address); ··· 3370 3370 spin_unlock(ptl); 3371 3371 } 3372 3372 /* 3373 - * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare 3373 + * Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare 3374 3374 * may have cleared our pud entry and done put_page on the page table: 3375 - * once we release i_mmap_mutex, another task can do the final put_page 3375 + * once we release i_mmap_rwsem, another task can do the final put_page 3376 3376 * and that page table be reused and filled with junk. 3377 3377 */ 3378 3378 flush_tlb_range(vma, start, end); 3379 - mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); 3379 + i_mmap_unlock_write(vma->vm_file->f_mapping); 3380 3380 mmu_notifier_invalidate_range_end(mm, start, end); 3381 3381 3382 3382 return pages << h->order; ··· 3525 3525 * and returns the corresponding pte. While this is not necessary for the 3526 3526 * !shared pmd case because we can allocate the pmd later as well, it makes the 3527 3527 * code much cleaner. pmd allocation is essential for the shared case because 3528 - * pud has to be populated inside the same i_mmap_mutex section - otherwise 3528 + * pud has to be populated inside the same i_mmap_rwsem section - otherwise 3529 3529 * racing tasks could either miss the sharing (see huge_pte_offset) or select a 3530 3530 * bad pmd for sharing. 3531 3531 */ ··· 3544 3544 if (!vma_shareable(vma, addr)) 3545 3545 return (pte_t *)pmd_alloc(mm, pud, addr); 3546 3546 3547 - mutex_lock(&mapping->i_mmap_mutex); 3547 + i_mmap_lock_write(mapping); 3548 3548 vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) { 3549 3549 if (svma == vma) 3550 3550 continue; ··· 3572 3572 spin_unlock(ptl); 3573 3573 out: 3574 3574 pte = (pte_t *)pmd_alloc(mm, pud, addr); 3575 - mutex_unlock(&mapping->i_mmap_mutex); 3575 + i_mmap_unlock_write(mapping); 3576 3576 return pte; 3577 3577 } 3578 3578

+20 -23

mm/memblock.c

··· 715 715 } 716 716 717 717 /** 718 - * memblock_mark_hotplug - Mark hotpluggable memory with flag MEMBLOCK_HOTPLUG. 719 - * @base: the base phys addr of the region 720 - * @size: the size of the region 721 718 * 722 - * This function isolates region [@base, @base + @size), and mark it with flag 723 - * MEMBLOCK_HOTPLUG. 719 + * This function isolates region [@base, @base + @size), and sets/clears flag 724 720 * 725 721 * Return 0 on succees, -errno on failure. 726 722 */ 727 - int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size) 723 + static int __init_memblock memblock_setclr_flag(phys_addr_t base, 724 + phys_addr_t size, int set, int flag) 728 725 { 729 726 struct memblock_type *type = &memblock.memory; 730 727 int i, ret, start_rgn, end_rgn; ··· 731 734 return ret; 732 735 733 736 for (i = start_rgn; i < end_rgn; i++) 734 - memblock_set_region_flags(&type->regions[i], MEMBLOCK_HOTPLUG); 737 + if (set) 738 + memblock_set_region_flags(&type->regions[i], flag); 739 + else 740 + memblock_clear_region_flags(&type->regions[i], flag); 735 741 736 742 memblock_merge_regions(type); 737 743 return 0; 744 + } 745 + 746 + /** 747 + * memblock_mark_hotplug - Mark hotpluggable memory with flag MEMBLOCK_HOTPLUG. 748 + * @base: the base phys addr of the region 749 + * @size: the size of the region 750 + * 751 + * Return 0 on succees, -errno on failure. 752 + */ 753 + int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size) 754 + { 755 + return memblock_setclr_flag(base, size, 1, MEMBLOCK_HOTPLUG); 738 756 } 739 757 740 758 /** ··· 757 745 * @base: the base phys addr of the region 758 746 * @size: the size of the region 759 747 * 760 - * This function isolates region [@base, @base + @size), and clear flag 761 - * MEMBLOCK_HOTPLUG for the isolated regions. 762 - * 763 748 * Return 0 on succees, -errno on failure. 764 749 */ 765 750 int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size) 766 751 { 767 - struct memblock_type *type = &memblock.memory; 768 - int i, ret, start_rgn, end_rgn; 769 - 770 - ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn); 771 - if (ret) 772 - return ret; 773 - 774 - for (i = start_rgn; i < end_rgn; i++) 775 - memblock_clear_region_flags(&type->regions[i], 776 - MEMBLOCK_HOTPLUG); 777 - 778 - memblock_merge_regions(type); 779 - return 0; 752 + return memblock_setclr_flag(base, size, 0, MEMBLOCK_HOTPLUG); 780 753 } 781 754 782 755 /**

+35 -145

mm/memcontrol.c

··· 296 296 * Should the accounting and control be hierarchical, per subtree? 297 297 */ 298 298 bool use_hierarchy; 299 - unsigned long kmem_account_flags; /* See KMEM_ACCOUNTED_*, below */ 300 299 301 300 bool oom_lock; 302 301 atomic_t under_oom; ··· 365 366 /* WARNING: nodeinfo must be the last member here */ 366 367 }; 367 368 368 - /* internal only representation about the status of kmem accounting. */ 369 - enum { 370 - KMEM_ACCOUNTED_ACTIVE, /* accounted by this cgroup itself */ 371 - }; 372 - 373 369 #ifdef CONFIG_MEMCG_KMEM 374 - static inline void memcg_kmem_set_active(struct mem_cgroup *memcg) 375 - { 376 - set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags); 377 - } 378 - 379 370 static bool memcg_kmem_is_active(struct mem_cgroup *memcg) 380 371 { 381 - return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags); 372 + return memcg->kmemcg_id >= 0; 382 373 } 383 - 384 374 #endif 385 375 386 376 /* Stuffs for move charges at task migration. */ ··· 1559 1571 * select it. The goal is to allow it to allocate so that it may 1560 1572 * quickly exit and free its memory. 1561 1573 */ 1562 - if (fatal_signal_pending(current) || current->flags & PF_EXITING) { 1574 + if (fatal_signal_pending(current) || task_will_free_mem(current)) { 1563 1575 set_thread_flag(TIF_MEMDIE); 1564 1576 return; 1565 1577 } ··· 1616 1628 NULL, "Memory cgroup out of memory"); 1617 1629 } 1618 1630 1631 + #if MAX_NUMNODES > 1 1632 + 1619 1633 /** 1620 1634 * test_mem_cgroup_node_reclaimable 1621 1635 * @memcg: the target memcg ··· 1640 1650 return false; 1641 1651 1642 1652 } 1643 - #if MAX_NUMNODES > 1 1644 1653 1645 1654 /* 1646 1655 * Always updating the nodemask is not very good - even if we have an empty ··· 2635 2646 if (!cachep) 2636 2647 return; 2637 2648 2638 - css_get(&memcg->css); 2639 2649 list_add(&cachep->memcg_params->list, &memcg->memcg_slab_caches); 2640 2650 2641 2651 /* ··· 2668 2680 list_del(&cachep->memcg_params->list); 2669 2681 2670 2682 kmem_cache_destroy(cachep); 2671 - 2672 - /* drop the reference taken in memcg_register_cache */ 2673 - css_put(&memcg->css); 2674 - } 2675 - 2676 - /* 2677 - * During the creation a new cache, we need to disable our accounting mechanism 2678 - * altogether. This is true even if we are not creating, but rather just 2679 - * enqueing new caches to be created. 2680 - * 2681 - * This is because that process will trigger allocations; some visible, like 2682 - * explicit kmallocs to auxiliary data structures, name strings and internal 2683 - * cache structures; some well concealed, like INIT_WORK() that can allocate 2684 - * objects during debug. 2685 - * 2686 - * If any allocation happens during memcg_kmem_get_cache, we will recurse back 2687 - * to it. This may not be a bounded recursion: since the first cache creation 2688 - * failed to complete (waiting on the allocation), we'll just try to create the 2689 - * cache again, failing at the same point. 2690 - * 2691 - * memcg_kmem_get_cache is prepared to abort after seeing a positive count of 2692 - * memcg_kmem_skip_account. So we enclose anything that might allocate memory 2693 - * inside the following two functions. 2694 - */ 2695 - static inline void memcg_stop_kmem_account(void) 2696 - { 2697 - VM_BUG_ON(!current->mm); 2698 - current->memcg_kmem_skip_account++; 2699 - } 2700 - 2701 - static inline void memcg_resume_kmem_account(void) 2702 - { 2703 - VM_BUG_ON(!current->mm); 2704 - current->memcg_kmem_skip_account--; 2705 2683 } 2706 2684 2707 2685 int __memcg_cleanup_cache_params(struct kmem_cache *s) ··· 2701 2747 mutex_lock(&memcg_slab_mutex); 2702 2748 list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) { 2703 2749 cachep = memcg_params_to_cache(params); 2704 - kmem_cache_shrink(cachep); 2705 - if (atomic_read(&cachep->memcg_params->nr_pages) == 0) 2706 - memcg_unregister_cache(cachep); 2750 + memcg_unregister_cache(cachep); 2707 2751 } 2708 2752 mutex_unlock(&memcg_slab_mutex); 2709 2753 } ··· 2736 2784 struct memcg_register_cache_work *cw; 2737 2785 2738 2786 cw = kmalloc(sizeof(*cw), GFP_NOWAIT); 2739 - if (cw == NULL) { 2740 - css_put(&memcg->css); 2787 + if (!cw) 2741 2788 return; 2742 - } 2789 + 2790 + css_get(&memcg->css); 2743 2791 2744 2792 cw->memcg = memcg; 2745 2793 cw->cachep = cachep; ··· 2762 2810 * this point we can't allow ourselves back into memcg_kmem_get_cache, 2763 2811 * the safest choice is to do it like this, wrapping the whole function. 2764 2812 */ 2765 - memcg_stop_kmem_account(); 2813 + current->memcg_kmem_skip_account = 1; 2766 2814 __memcg_schedule_register_cache(memcg, cachep); 2767 - memcg_resume_kmem_account(); 2815 + current->memcg_kmem_skip_account = 0; 2768 2816 } 2769 2817 2770 2818 int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order) 2771 2819 { 2772 2820 unsigned int nr_pages = 1 << order; 2773 - int res; 2774 2821 2775 - res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp, nr_pages); 2776 - if (!res) 2777 - atomic_add(nr_pages, &cachep->memcg_params->nr_pages); 2778 - return res; 2822 + return memcg_charge_kmem(cachep->memcg_params->memcg, gfp, nr_pages); 2779 2823 } 2780 2824 2781 2825 void __memcg_uncharge_slab(struct kmem_cache *cachep, int order) ··· 2779 2831 unsigned int nr_pages = 1 << order; 2780 2832 2781 2833 memcg_uncharge_kmem(cachep->memcg_params->memcg, nr_pages); 2782 - atomic_sub(nr_pages, &cachep->memcg_params->nr_pages); 2783 2834 } 2784 2835 2785 2836 /* ··· 2794 2847 * Can't be called in interrupt context or from kernel threads. 2795 2848 * This function needs to be called with rcu_read_lock() held. 2796 2849 */ 2797 - struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep, 2798 - gfp_t gfp) 2850 + struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep) 2799 2851 { 2800 2852 struct mem_cgroup *memcg; 2801 2853 struct kmem_cache *memcg_cachep; ··· 2802 2856 VM_BUG_ON(!cachep->memcg_params); 2803 2857 VM_BUG_ON(!cachep->memcg_params->is_root_cache); 2804 2858 2805 - if (!current->mm || current->memcg_kmem_skip_account) 2859 + if (current->memcg_kmem_skip_account) 2806 2860 return cachep; 2807 2861 2808 - rcu_read_lock(); 2809 - memcg = mem_cgroup_from_task(rcu_dereference(current->mm->owner)); 2810 - 2862 + memcg = get_mem_cgroup_from_mm(current->mm); 2811 2863 if (!memcg_kmem_is_active(memcg)) 2812 2864 goto out; 2813 2865 2814 2866 memcg_cachep = cache_from_memcg_idx(cachep, memcg_cache_id(memcg)); 2815 - if (likely(memcg_cachep)) { 2816 - cachep = memcg_cachep; 2817 - goto out; 2818 - } 2819 - 2820 - /* The corresponding put will be done in the workqueue. */ 2821 - if (!css_tryget_online(&memcg->css)) 2822 - goto out; 2823 - rcu_read_unlock(); 2867 + if (likely(memcg_cachep)) 2868 + return memcg_cachep; 2824 2869 2825 2870 /* 2826 2871 * If we are in a safe context (can wait, and not in interrupt ··· 2826 2889 * defer everything. 2827 2890 */ 2828 2891 memcg_schedule_register_cache(memcg, cachep); 2829 - return cachep; 2830 2892 out: 2831 - rcu_read_unlock(); 2893 + css_put(&memcg->css); 2832 2894 return cachep; 2895 + } 2896 + 2897 + void __memcg_kmem_put_cache(struct kmem_cache *cachep) 2898 + { 2899 + if (!is_root_cache(cachep)) 2900 + css_put(&cachep->memcg_params->memcg->css); 2833 2901 } 2834 2902 2835 2903 /* ··· 2858 2916 int ret; 2859 2917 2860 2918 *_memcg = NULL; 2861 - 2862 - /* 2863 - * Disabling accounting is only relevant for some specific memcg 2864 - * internal allocations. Therefore we would initially not have such 2865 - * check here, since direct calls to the page allocator that are 2866 - * accounted to kmemcg (alloc_kmem_pages and friends) only happen 2867 - * outside memcg core. We are mostly concerned with cache allocations, 2868 - * and by having this test at memcg_kmem_get_cache, we are already able 2869 - * to relay the allocation to the root cache and bypass the memcg cache 2870 - * altogether. 2871 - * 2872 - * There is one exception, though: the SLUB allocator does not create 2873 - * large order caches, but rather service large kmallocs directly from 2874 - * the page allocator. Therefore, the following sequence when backed by 2875 - * the SLUB allocator: 2876 - * 2877 - * memcg_stop_kmem_account(); 2878 - * kmalloc(<large_number>) 2879 - * memcg_resume_kmem_account(); 2880 - * 2881 - * would effectively ignore the fact that we should skip accounting, 2882 - * since it will drive us directly to this function without passing 2883 - * through the cache selector memcg_kmem_get_cache. Such large 2884 - * allocations are extremely rare but can happen, for instance, for the 2885 - * cache arrays. We bring this test here. 2886 - */ 2887 - if (!current->mm || current->memcg_kmem_skip_account) 2888 - return true; 2889 2919 2890 2920 memcg = get_mem_cgroup_from_mm(current->mm); 2891 2921 ··· 2898 2984 2899 2985 memcg_uncharge_kmem(memcg, 1 << order); 2900 2986 page->mem_cgroup = NULL; 2901 - } 2902 - #else 2903 - static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg) 2904 - { 2905 2987 } 2906 2988 #endif /* CONFIG_MEMCG_KMEM */ 2907 2989 ··· 3449 3539 return 0; 3450 3540 3451 3541 /* 3452 - * We are going to allocate memory for data shared by all memory 3453 - * cgroups so let's stop accounting here. 3454 - */ 3455 - memcg_stop_kmem_account(); 3456 - 3457 - /* 3458 3542 * For simplicity, we won't allow this to be disabled. It also can't 3459 3543 * be changed if the cgroup has children already, or if tasks had 3460 3544 * already joined. ··· 3474 3570 goto out; 3475 3571 } 3476 3572 3477 - memcg->kmemcg_id = memcg_id; 3478 - INIT_LIST_HEAD(&memcg->memcg_slab_caches); 3479 - 3480 3573 /* 3481 - * We couldn't have accounted to this cgroup, because it hasn't got the 3482 - * active bit set yet, so this should succeed. 3574 + * We couldn't have accounted to this cgroup, because it hasn't got 3575 + * activated yet, so this should succeed. 3483 3576 */ 3484 3577 err = page_counter_limit(&memcg->kmem, nr_pages); 3485 3578 VM_BUG_ON(err); 3486 3579 3487 3580 static_key_slow_inc(&memcg_kmem_enabled_key); 3488 3581 /* 3489 - * Setting the active bit after enabling static branching will 3582 + * A memory cgroup is considered kmem-active as soon as it gets 3583 + * kmemcg_id. Setting the id after enabling static branching will 3490 3584 * guarantee no one starts accounting before all call sites are 3491 3585 * patched. 3492 3586 */ 3493 - memcg_kmem_set_active(memcg); 3587 + memcg->kmemcg_id = memcg_id; 3494 3588 out: 3495 - memcg_resume_kmem_account(); 3496 3589 return err; 3497 3590 } 3498 3591 ··· 3692 3791 } 3693 3792 #endif /* CONFIG_NUMA */ 3694 3793 3695 - static inline void mem_cgroup_lru_names_not_uptodate(void) 3696 - { 3697 - BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS); 3698 - } 3699 - 3700 3794 static int memcg_stat_show(struct seq_file *m, void *v) 3701 3795 { 3702 3796 struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); 3703 3797 unsigned long memory, memsw; 3704 3798 struct mem_cgroup *mi; 3705 3799 unsigned int i; 3800 + 3801 + BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS); 3706 3802 3707 3803 for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { 3708 3804 if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) ··· 4157 4259 { 4158 4260 int ret; 4159 4261 4160 - memcg->kmemcg_id = -1; 4161 4262 ret = memcg_propagate_kmem(memcg); 4162 4263 if (ret) 4163 4264 return ret; ··· 4166 4269 4167 4270 static void memcg_destroy_kmem(struct mem_cgroup *memcg) 4168 4271 { 4272 + memcg_unregister_all_caches(memcg); 4169 4273 mem_cgroup_sockets_destroy(memcg); 4170 4274 } 4171 4275 #else ··· 4622 4724 4623 4725 free_percpu(memcg->stat); 4624 4726 4625 - /* 4626 - * We need to make sure that (at least for now), the jump label 4627 - * destruction code runs outside of the cgroup lock. This is because 4628 - * get_online_cpus(), which is called from the static_branch update, 4629 - * can't be called inside the cgroup_lock. cpusets are the ones 4630 - * enforcing this dependency, so if they ever change, we might as well. 4631 - * 4632 - * schedule_work() will guarantee this happens. Be careful if you need 4633 - * to move this code around, and make sure it is outside 4634 - * the cgroup_lock. 4635 - */ 4636 4727 disarm_static_keys(memcg); 4637 4728 kfree(memcg); 4638 4729 } ··· 4691 4804 vmpressure_init(&memcg->vmpressure); 4692 4805 INIT_LIST_HEAD(&memcg->event_list); 4693 4806 spin_lock_init(&memcg->event_list_lock); 4807 + #ifdef CONFIG_MEMCG_KMEM 4808 + memcg->kmemcg_id = -1; 4809 + INIT_LIST_HEAD(&memcg->memcg_slab_caches); 4810 + #endif 4694 4811 4695 4812 return &memcg->css; 4696 4813 ··· 4776 4885 } 4777 4886 spin_unlock(&memcg->event_list_lock); 4778 4887 4779 - memcg_unregister_all_caches(memcg); 4780 4888 vmpressure_cleanup(&memcg->vmpressure); 4781 4889 } 4782 4890

+5 -10

mm/memory-failure.c

··· 239 239 } 240 240 241 241 /* 242 - * Only call shrink_slab here (which would also shrink other caches) if 243 - * access is not potentially fatal. 242 + * Only call shrink_node_slabs here (which would also shrink 243 + * other caches) if access is not potentially fatal. 244 244 */ 245 245 if (access) { 246 246 int nr; 247 247 int nid = page_to_nid(p); 248 248 do { 249 - struct shrink_control shrink = { 250 - .gfp_mask = GFP_KERNEL, 251 - }; 252 - node_set(nid, shrink.nodes_to_scan); 253 - 254 - nr = shrink_slab(&shrink, 1000, 1000); 249 + nr = shrink_node_slabs(GFP_KERNEL, nid, 1000, 1000); 255 250 if (page_count(p) == 1) 256 251 break; 257 252 } while (nr > 10); ··· 461 466 struct task_struct *tsk; 462 467 struct address_space *mapping = page->mapping; 463 468 464 - mutex_lock(&mapping->i_mmap_mutex); 469 + i_mmap_lock_read(mapping); 465 470 read_lock(&tasklist_lock); 466 471 for_each_process(tsk) { 467 472 pgoff_t pgoff = page_to_pgoff(page); ··· 483 488 } 484 489 } 485 490 read_unlock(&tasklist_lock); 486 - mutex_unlock(&mapping->i_mmap_mutex); 491 + i_mmap_unlock_read(mapping); 487 492 } 488 493 489 494 /*

+5 -4

mm/memory.c

··· 1326 1326 * safe to do nothing in this case. 1327 1327 */ 1328 1328 if (vma->vm_file) { 1329 - mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex); 1329 + i_mmap_lock_write(vma->vm_file->f_mapping); 1330 1330 __unmap_hugepage_range_final(tlb, vma, start, end, NULL); 1331 - mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); 1331 + i_mmap_unlock_write(vma->vm_file->f_mapping); 1332 1332 } 1333 1333 } else 1334 1334 unmap_page_range(tlb, vma, start, end, details); ··· 2377 2377 details.last_index = ULONG_MAX; 2378 2378 2379 2379 2380 - mutex_lock(&mapping->i_mmap_mutex); 2380 + i_mmap_lock_read(mapping); 2381 2381 if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap))) 2382 2382 unmap_mapping_range_tree(&mapping->i_mmap, &details); 2383 2383 if (unlikely(!list_empty(&mapping->i_mmap_nonlinear))) 2384 2384 unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details); 2385 - mutex_unlock(&mapping->i_mmap_mutex); 2385 + i_mmap_unlock_read(mapping); 2386 2386 } 2387 2387 EXPORT_SYMBOL(unmap_mapping_range); 2388 2388 ··· 3365 3365 3366 3366 return ret; 3367 3367 } 3368 + EXPORT_SYMBOL_GPL(handle_mm_fault); 3368 3369 3369 3370 #ifndef __PAGETABLE_PUD_FOLDED 3370 3371 /*

+18 -10

mm/migrate.c

··· 746 746 * MIGRATEPAGE_SUCCESS - success 747 747 */ 748 748 static int move_to_new_page(struct page *newpage, struct page *page, 749 - int remap_swapcache, enum migrate_mode mode) 749 + int page_was_mapped, enum migrate_mode mode) 750 750 { 751 751 struct address_space *mapping; 752 752 int rc; ··· 784 784 newpage->mapping = NULL; 785 785 } else { 786 786 mem_cgroup_migrate(page, newpage, false); 787 - if (remap_swapcache) 787 + if (page_was_mapped) 788 788 remove_migration_ptes(page, newpage); 789 789 page->mapping = NULL; 790 790 } ··· 798 798 int force, enum migrate_mode mode) 799 799 { 800 800 int rc = -EAGAIN; 801 - int remap_swapcache = 1; 801 + int page_was_mapped = 0; 802 802 struct anon_vma *anon_vma = NULL; 803 803 804 804 if (!trylock_page(page)) { ··· 870 870 * migrated but are not remapped when migration 871 871 * completes 872 872 */ 873 - remap_swapcache = 0; 874 873 } else { 875 874 goto out_unlock; 876 875 } ··· 909 910 } 910 911 911 912 /* Establish migration ptes or remove ptes */ 912 - try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); 913 + if (page_mapped(page)) { 914 + try_to_unmap(page, 915 + TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); 916 + page_was_mapped = 1; 917 + } 913 918 914 919 skip_unmap: 915 920 if (!page_mapped(page)) 916 - rc = move_to_new_page(newpage, page, remap_swapcache, mode); 921 + rc = move_to_new_page(newpage, page, page_was_mapped, mode); 917 922 918 - if (rc && remap_swapcache) 923 + if (rc && page_was_mapped) 919 924 remove_migration_ptes(page, page); 920 925 921 926 /* Drop an anon_vma reference if we took one */ ··· 1020 1017 { 1021 1018 int rc = 0; 1022 1019 int *result = NULL; 1020 + int page_was_mapped = 0; 1023 1021 struct page *new_hpage; 1024 1022 struct anon_vma *anon_vma = NULL; 1025 1023 ··· 1051 1047 if (PageAnon(hpage)) 1052 1048 anon_vma = page_get_anon_vma(hpage); 1053 1049 1054 - try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); 1050 + if (page_mapped(hpage)) { 1051 + try_to_unmap(hpage, 1052 + TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); 1053 + page_was_mapped = 1; 1054 + } 1055 1055 1056 1056 if (!page_mapped(hpage)) 1057 - rc = move_to_new_page(new_hpage, hpage, 1, mode); 1057 + rc = move_to_new_page(new_hpage, hpage, page_was_mapped, mode); 1058 1058 1059 - if (rc != MIGRATEPAGE_SUCCESS) 1059 + if (rc != MIGRATEPAGE_SUCCESS && page_was_mapped) 1060 1060 remove_migration_ptes(hpage, hpage); 1061 1061 1062 1062 if (anon_vma)

+5 -2

mm/mincore.c

··· 137 137 } else { /* pte is a swap entry */ 138 138 swp_entry_t entry = pte_to_swp_entry(pte); 139 139 140 - if (is_migration_entry(entry)) { 141 - /* migration entries are always uptodate */ 140 + if (non_swap_entry(entry)) { 141 + /* 142 + * migration or hwpoison entries are always 143 + * uptodate 144 + */ 142 145 *vec = 1; 143 146 } else { 144 147 #ifdef CONFIG_SWAP

+13 -11

mm/mmap.c

··· 232 232 } 233 233 234 234 /* 235 - * Requires inode->i_mapping->i_mmap_mutex 235 + * Requires inode->i_mapping->i_mmap_rwsem 236 236 */ 237 237 static void __remove_shared_vm_struct(struct vm_area_struct *vma, 238 238 struct file *file, struct address_space *mapping) ··· 260 260 261 261 if (file) { 262 262 struct address_space *mapping = file->f_mapping; 263 - mutex_lock(&mapping->i_mmap_mutex); 263 + i_mmap_lock_write(mapping); 264 264 __remove_shared_vm_struct(vma, file, mapping); 265 - mutex_unlock(&mapping->i_mmap_mutex); 265 + i_mmap_unlock_write(mapping); 266 266 } 267 267 } 268 268 ··· 674 674 675 675 if (vma->vm_file) { 676 676 mapping = vma->vm_file->f_mapping; 677 - mutex_lock(&mapping->i_mmap_mutex); 677 + i_mmap_lock_write(mapping); 678 678 } 679 679 680 680 __vma_link(mm, vma, prev, rb_link, rb_parent); 681 681 __vma_link_file(vma); 682 682 683 683 if (mapping) 684 - mutex_unlock(&mapping->i_mmap_mutex); 684 + i_mmap_unlock_write(mapping); 685 685 686 686 mm->map_count++; 687 687 validate_mm(mm); ··· 796 796 next->vm_end); 797 797 } 798 798 799 - mutex_lock(&mapping->i_mmap_mutex); 799 + i_mmap_lock_write(mapping); 800 800 if (insert) { 801 801 /* 802 802 * Put into interval tree now, so instantiated pages ··· 883 883 anon_vma_unlock_write(anon_vma); 884 884 } 885 885 if (mapping) 886 - mutex_unlock(&mapping->i_mmap_mutex); 886 + i_mmap_unlock_write(mapping); 887 887 888 888 if (root) { 889 889 uprobe_mmap(vma); ··· 2362 2362 } 2363 2363 #endif 2364 2364 2365 + EXPORT_SYMBOL_GPL(find_extend_vma); 2366 + 2365 2367 /* 2366 2368 * Ok - we have the memory areas we should free on the vma list, 2367 2369 * so release them, and do the vma updates. ··· 2793 2791 2794 2792 /* Insert vm structure into process list sorted by address 2795 2793 * and into the inode's i_mmap tree. If vm_file is non-NULL 2796 - * then i_mmap_mutex is taken here. 2794 + * then i_mmap_rwsem is taken here. 2797 2795 */ 2798 2796 int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) 2799 2797 { ··· 3088 3086 */ 3089 3087 if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags)) 3090 3088 BUG(); 3091 - mutex_lock_nest_lock(&mapping->i_mmap_mutex, &mm->mmap_sem); 3089 + down_write_nest_lock(&mapping->i_mmap_rwsem, &mm->mmap_sem); 3092 3090 } 3093 3091 } 3094 3092 ··· 3115 3113 * vma in this mm is backed by the same anon_vma or address_space. 3116 3114 * 3117 3115 * We can take all the locks in random order because the VM code 3118 - * taking i_mmap_mutex or anon_vma->rwsem outside the mmap_sem never 3116 + * taking i_mmap_rwsem or anon_vma->rwsem outside the mmap_sem never 3119 3117 * takes more than one of them in a row. Secondly we're protected 3120 3118 * against a concurrent mm_take_all_locks() by the mm_all_locks_mutex. 3121 3119 * ··· 3184 3182 * AS_MM_ALL_LOCKS can't change to 0 from under us 3185 3183 * because we hold the mm_all_locks_mutex. 3186 3184 */ 3187 - mutex_unlock(&mapping->i_mmap_mutex); 3185 + i_mmap_unlock_write(mapping); 3188 3186 if (!test_and_clear_bit(AS_MM_ALL_LOCKS, 3189 3187 &mapping->flags)) 3190 3188 BUG();

+3 -3

mm/mremap.c

··· 99 99 spinlock_t *old_ptl, *new_ptl; 100 100 101 101 /* 102 - * When need_rmap_locks is true, we take the i_mmap_mutex and anon_vma 102 + * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma 103 103 * locks to ensure that rmap will always observe either the old or the 104 104 * new ptes. This is the easiest way to avoid races with 105 105 * truncate_pagecache(), page migration, etc... ··· 119 119 if (need_rmap_locks) { 120 120 if (vma->vm_file) { 121 121 mapping = vma->vm_file->f_mapping; 122 - mutex_lock(&mapping->i_mmap_mutex); 122 + i_mmap_lock_write(mapping); 123 123 } 124 124 if (vma->anon_vma) { 125 125 anon_vma = vma->anon_vma; ··· 156 156 if (anon_vma) 157 157 anon_vma_unlock_write(anon_vma); 158 158 if (mapping) 159 - mutex_unlock(&mapping->i_mmap_mutex); 159 + i_mmap_unlock_write(mapping); 160 160 } 161 161 162 162 #define LATENCY_LIMIT (64 * PAGE_SIZE)

+19 -31

mm/nommu.c

··· 722 722 if (vma->vm_file) { 723 723 mapping = vma->vm_file->f_mapping; 724 724 725 - mutex_lock(&mapping->i_mmap_mutex); 725 + i_mmap_lock_write(mapping); 726 726 flush_dcache_mmap_lock(mapping); 727 727 vma_interval_tree_insert(vma, &mapping->i_mmap); 728 728 flush_dcache_mmap_unlock(mapping); 729 - mutex_unlock(&mapping->i_mmap_mutex); 729 + i_mmap_unlock_write(mapping); 730 730 } 731 731 732 732 /* add the VMA to the tree */ ··· 795 795 if (vma->vm_file) { 796 796 mapping = vma->vm_file->f_mapping; 797 797 798 - mutex_lock(&mapping->i_mmap_mutex); 798 + i_mmap_lock_write(mapping); 799 799 flush_dcache_mmap_lock(mapping); 800 800 vma_interval_tree_remove(vma, &mapping->i_mmap); 801 801 flush_dcache_mmap_unlock(mapping); 802 - mutex_unlock(&mapping->i_mmap_mutex); 802 + i_mmap_unlock_write(mapping); 803 803 } 804 804 805 805 /* remove from the MM's tree and list */ ··· 1149 1149 unsigned long len, 1150 1150 unsigned long capabilities) 1151 1151 { 1152 - struct page *pages; 1153 - unsigned long total, point, n; 1152 + unsigned long total, point; 1154 1153 void *base; 1155 1154 int ret, order; 1156 1155 ··· 1181 1182 order = get_order(len); 1182 1183 kdebug("alloc order %d for %lx", order, len); 1183 1184 1184 - pages = alloc_pages(GFP_KERNEL, order); 1185 - if (!pages) 1186 - goto enomem; 1187 - 1188 1185 total = 1 << order; 1189 - atomic_long_add(total, &mmap_pages_allocated); 1190 - 1191 1186 point = len >> PAGE_SHIFT; 1192 1187 1193 - /* we allocated a power-of-2 sized page set, so we may want to trim off 1194 - * the excess */ 1188 + /* we don't want to allocate a power-of-2 sized page set */ 1195 1189 if (sysctl_nr_trim_pages && total - point >= sysctl_nr_trim_pages) { 1196 - while (total > point) { 1197 - order = ilog2(total - point); 1198 - n = 1 << order; 1199 - kdebug("shave %lu/%lu @%lu", n, total - point, total); 1200 - atomic_long_sub(n, &mmap_pages_allocated); 1201 - total -= n; 1202 - set_page_refcounted(pages + total); 1203 - __free_pages(pages + total, order); 1204 - } 1190 + total = point; 1191 + kdebug("try to alloc exact %lu pages", total); 1192 + base = alloc_pages_exact(len, GFP_KERNEL); 1193 + } else { 1194 + base = (void *)__get_free_pages(GFP_KERNEL, order); 1205 1195 } 1206 1196 1207 - for (point = 1; point < total; point++) 1208 - set_page_refcounted(&pages[point]); 1197 + if (!base) 1198 + goto enomem; 1209 1199 1210 - base = page_address(pages); 1200 + atomic_long_add(total, &mmap_pages_allocated); 1201 + 1211 1202 region->vm_flags = vma->vm_flags |= VM_MAPPED_COPY; 1212 1203 region->vm_start = (unsigned long) base; 1213 1204 region->vm_end = region->vm_start + len; ··· 2083 2094 high = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; 2084 2095 2085 2096 down_write(&nommu_region_sem); 2086 - mutex_lock(&inode->i_mapping->i_mmap_mutex); 2097 + i_mmap_lock_read(inode->i_mapping); 2087 2098 2088 2099 /* search for VMAs that fall within the dead zone */ 2089 2100 vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap, low, high) { 2090 2101 /* found one - only interested if it's shared out of the page 2091 2102 * cache */ 2092 2103 if (vma->vm_flags & VM_SHARED) { 2093 - mutex_unlock(&inode->i_mapping->i_mmap_mutex); 2104 + i_mmap_unlock_read(inode->i_mapping); 2094 2105 up_write(&nommu_region_sem); 2095 2106 return -ETXTBSY; /* not quite true, but near enough */ 2096 2107 } ··· 2102 2113 * we don't check for any regions that start beyond the EOF as there 2103 2114 * shouldn't be any 2104 2115 */ 2105 - vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap, 2106 - 0, ULONG_MAX) { 2116 + vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap, 0, ULONG_MAX) { 2107 2117 if (!(vma->vm_flags & VM_SHARED)) 2108 2118 continue; 2109 2119 ··· 2117 2129 } 2118 2130 } 2119 2131 2120 - mutex_unlock(&inode->i_mapping->i_mmap_mutex); 2132 + i_mmap_unlock_read(inode->i_mapping); 2121 2133 up_write(&nommu_region_sem); 2122 2134 return 0; 2123 2135 }

+5 -10

mm/oom_kill.c

··· 281 281 if (oom_task_origin(task)) 282 282 return OOM_SCAN_SELECT; 283 283 284 - if (task->flags & PF_EXITING && !force_kill) { 285 - /* 286 - * If this task is not being ptraced on exit, then wait for it 287 - * to finish before killing some other task unnecessarily. 288 - */ 289 - if (!(task->group_leader->ptrace & PT_TRACE_EXIT)) 290 - return OOM_SCAN_ABORT; 291 - } 284 + if (task_will_free_mem(task) && !force_kill) 285 + return OOM_SCAN_ABORT; 286 + 292 287 return OOM_SCAN_OK; 293 288 } 294 289 ··· 438 443 * If the task is already exiting, don't alarm the sysadmin or kill 439 444 * its children or threads, just set TIF_MEMDIE so it can die quickly 440 445 */ 441 - if (p->flags & PF_EXITING) { 446 + if (task_will_free_mem(p)) { 442 447 set_tsk_thread_flag(p, TIF_MEMDIE); 443 448 put_task_struct(p); 444 449 return; ··· 644 649 * select it. The goal is to allow it to allocate so that it may 645 650 * quickly exit and free its memory. 646 651 */ 647 - if (fatal_signal_pending(current) || current->flags & PF_EXITING) { 652 + if (fatal_signal_pending(current) || task_will_free_mem(current)) { 648 653 set_thread_flag(TIF_MEMDIE); 649 654 return; 650 655 }

+101 -36

mm/page_alloc.c

··· 48 48 #include <linux/backing-dev.h> 49 49 #include <linux/fault-inject.h> 50 50 #include <linux/page-isolation.h> 51 + #include <linux/page_ext.h> 51 52 #include <linux/debugobjects.h> 52 53 #include <linux/kmemleak.h> 53 54 #include <linux/compaction.h> ··· 56 55 #include <linux/prefetch.h> 57 56 #include <linux/mm_inline.h> 58 57 #include <linux/migrate.h> 59 - #include <linux/page-debug-flags.h> 58 + #include <linux/page_ext.h> 60 59 #include <linux/hugetlb.h> 61 60 #include <linux/sched/rt.h> 61 + #include <linux/page_owner.h> 62 62 63 63 #include <asm/sections.h> 64 64 #include <asm/tlbflush.h> ··· 426 424 427 425 #ifdef CONFIG_DEBUG_PAGEALLOC 428 426 unsigned int _debug_guardpage_minorder; 427 + bool _debug_pagealloc_enabled __read_mostly; 428 + bool _debug_guardpage_enabled __read_mostly; 429 + 430 + static int __init early_debug_pagealloc(char *buf) 431 + { 432 + if (!buf) 433 + return -EINVAL; 434 + 435 + if (strcmp(buf, "on") == 0) 436 + _debug_pagealloc_enabled = true; 437 + 438 + return 0; 439 + } 440 + early_param("debug_pagealloc", early_debug_pagealloc); 441 + 442 + static bool need_debug_guardpage(void) 443 + { 444 + /* If we don't use debug_pagealloc, we don't need guard page */ 445 + if (!debug_pagealloc_enabled()) 446 + return false; 447 + 448 + return true; 449 + } 450 + 451 + static void init_debug_guardpage(void) 452 + { 453 + if (!debug_pagealloc_enabled()) 454 + return; 455 + 456 + _debug_guardpage_enabled = true; 457 + } 458 + 459 + struct page_ext_operations debug_guardpage_ops = { 460 + .need = need_debug_guardpage, 461 + .init = init_debug_guardpage, 462 + }; 429 463 430 464 static int __init debug_guardpage_minorder_setup(char *buf) 431 465 { ··· 477 439 } 478 440 __setup("debug_guardpage_minorder=", debug_guardpage_minorder_setup); 479 441 480 - static inline void set_page_guard_flag(struct page *page) 442 + static inline void set_page_guard(struct zone *zone, struct page *page, 443 + unsigned int order, int migratetype) 481 444 { 482 - __set_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags); 445 + struct page_ext *page_ext; 446 + 447 + if (!debug_guardpage_enabled()) 448 + return; 449 + 450 + page_ext = lookup_page_ext(page); 451 + __set_bit(PAGE_EXT_DEBUG_GUARD, &page_ext->flags); 452 + 453 + INIT_LIST_HEAD(&page->lru); 454 + set_page_private(page, order); 455 + /* Guard pages are not available for any usage */ 456 + __mod_zone_freepage_state(zone, -(1 << order), migratetype); 483 457 } 484 458 485 - static inline void clear_page_guard_flag(struct page *page) 459 + static inline void clear_page_guard(struct zone *zone, struct page *page, 460 + unsigned int order, int migratetype) 486 461 { 487 - __clear_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags); 462 + struct page_ext *page_ext; 463 + 464 + if (!debug_guardpage_enabled()) 465 + return; 466 + 467 + page_ext = lookup_page_ext(page); 468 + __clear_bit(PAGE_EXT_DEBUG_GUARD, &page_ext->flags); 469 + 470 + set_page_private(page, 0); 471 + if (!is_migrate_isolate(migratetype)) 472 + __mod_zone_freepage_state(zone, (1 << order), migratetype); 488 473 } 489 474 #else 490 - static inline void set_page_guard_flag(struct page *page) { } 491 - static inline void clear_page_guard_flag(struct page *page) { } 475 + struct page_ext_operations debug_guardpage_ops = { NULL, }; 476 + static inline void set_page_guard(struct zone *zone, struct page *page, 477 + unsigned int order, int migratetype) {} 478 + static inline void clear_page_guard(struct zone *zone, struct page *page, 479 + unsigned int order, int migratetype) {} 492 480 #endif 493 481 494 482 static inline void set_page_order(struct page *page, unsigned int order) ··· 645 581 * merge with it and move up one order. 646 582 */ 647 583 if (page_is_guard(buddy)) { 648 - clear_page_guard_flag(buddy); 649 - set_page_private(buddy, 0); 650 - if (!is_migrate_isolate(migratetype)) { 651 - __mod_zone_freepage_state(zone, 1 << order, 652 - migratetype); 653 - } 584 + clear_page_guard(zone, buddy, order, migratetype); 654 585 } else { 655 586 list_del(&buddy->lru); 656 587 zone->free_area[order].nr_free--; ··· 814 755 if (bad) 815 756 return false; 816 757 758 + reset_page_owner(page, order); 759 + 817 760 if (!PageHighMem(page)) { 818 761 debug_check_no_locks_freed(page_address(page), 819 762 PAGE_SIZE << order); ··· 922 861 size >>= 1; 923 862 VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]); 924 863 925 - #ifdef CONFIG_DEBUG_PAGEALLOC 926 - if (high < debug_guardpage_minorder()) { 864 + if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC) && 865 + debug_guardpage_enabled() && 866 + high < debug_guardpage_minorder()) { 927 867 /* 928 868 * Mark as guard pages (or page), that will allow to 929 869 * merge back to allocator when buddy will be freed. 930 870 * Corresponding page table entries will not be touched, 931 871 * pages will stay not present in virtual address space 932 872 */ 933 - INIT_LIST_HEAD(&page[size].lru); 934 - set_page_guard_flag(&page[size]); 935 - set_page_private(&page[size], high); 936 - /* Guard pages are not available for any usage */ 937 - __mod_zone_freepage_state(zone, -(1 << high), 938 - migratetype); 873 + set_page_guard(zone, &page[size], high, migratetype); 939 874 continue; 940 875 } 941 - #endif 942 876 list_add(&page[size].lru, &area->free_list[migratetype]); 943 877 area->nr_free++; 944 878 set_page_order(&page[size], high); ··· 990 934 991 935 if (order && (gfp_flags & __GFP_COMP)) 992 936 prep_compound_page(page, order); 937 + 938 + set_page_owner(page, order, gfp_flags); 993 939 994 940 return 0; 995 941 } ··· 1565 1507 split_page(virt_to_page(page[0].shadow), order); 1566 1508 #endif 1567 1509 1568 - for (i = 1; i < (1 << order); i++) 1510 + set_page_owner(page, 0, 0); 1511 + for (i = 1; i < (1 << order); i++) { 1569 1512 set_page_refcounted(page + i); 1513 + set_page_owner(page + i, 0, 0); 1514 + } 1570 1515 } 1571 1516 EXPORT_SYMBOL_GPL(split_page); 1572 1517 ··· 1609 1548 } 1610 1549 } 1611 1550 1551 + set_page_owner(page, order, 0); 1612 1552 return 1UL << order; 1613 1553 } 1614 1554 ··· 4918 4856 #endif 4919 4857 init_waitqueue_head(&pgdat->kswapd_wait); 4920 4858 init_waitqueue_head(&pgdat->pfmemalloc_wait); 4859 + pgdat_page_ext_init(pgdat); 4921 4860 4922 4861 for (j = 0; j < MAX_NR_ZONES; j++) { 4923 4862 struct zone *zone = pgdat->node_zones + j; ··· 4937 4874 * and per-cpu initialisations 4938 4875 */ 4939 4876 memmap_pages = calc_memmap_size(size, realsize); 4940 - if (freesize >= memmap_pages) { 4941 - freesize -= memmap_pages; 4942 - if (memmap_pages) 4943 - printk(KERN_DEBUG 4944 - " %s zone: %lu pages used for memmap\n", 4945 - zone_names[j], memmap_pages); 4946 - } else 4947 - printk(KERN_WARNING 4948 - " %s zone: %lu pages exceeds freesize %lu\n", 4949 - zone_names[j], memmap_pages, freesize); 4877 + if (!is_highmem_idx(j)) { 4878 + if (freesize >= memmap_pages) { 4879 + freesize -= memmap_pages; 4880 + if (memmap_pages) 4881 + printk(KERN_DEBUG 4882 + " %s zone: %lu pages used for memmap\n", 4883 + zone_names[j], memmap_pages); 4884 + } else 4885 + printk(KERN_WARNING 4886 + " %s zone: %lu pages exceeds freesize %lu\n", 4887 + zone_names[j], memmap_pages, freesize); 4888 + } 4950 4889 4951 4890 /* Account for reserved pages */ 4952 4891 if (j == 0 && freesize > dma_reserve) { ··· 6286 6221 if (!PageLRU(page)) 6287 6222 found++; 6288 6223 /* 6289 - * If there are RECLAIMABLE pages, we need to check it. 6290 - * But now, memory offline itself doesn't call shrink_slab() 6291 - * and it still to be fixed. 6224 + * If there are RECLAIMABLE pages, we need to check 6225 + * it. But now, memory offline itself doesn't call 6226 + * shrink_node_slabs() and it still to be fixed. 6292 6227 */ 6293 6228 /* 6294 6229 * If the page is not RAM, page_count()should be 0.

+403

mm/page_ext.c

··· 1 + #include <linux/mm.h> 2 + #include <linux/mmzone.h> 3 + #include <linux/bootmem.h> 4 + #include <linux/page_ext.h> 5 + #include <linux/memory.h> 6 + #include <linux/vmalloc.h> 7 + #include <linux/kmemleak.h> 8 + #include <linux/page_owner.h> 9 + 10 + /* 11 + * struct page extension 12 + * 13 + * This is the feature to manage memory for extended data per page. 14 + * 15 + * Until now, we must modify struct page itself to store extra data per page. 16 + * This requires rebuilding the kernel and it is really time consuming process. 17 + * And, sometimes, rebuild is impossible due to third party module dependency. 18 + * At last, enlarging struct page could cause un-wanted system behaviour change. 19 + * 20 + * This feature is intended to overcome above mentioned problems. This feature 21 + * allocates memory for extended data per page in certain place rather than 22 + * the struct page itself. This memory can be accessed by the accessor 23 + * functions provided by this code. During the boot process, it checks whether 24 + * allocation of huge chunk of memory is needed or not. If not, it avoids 25 + * allocating memory at all. With this advantage, we can include this feature 26 + * into the kernel in default and can avoid rebuild and solve related problems. 27 + * 28 + * To help these things to work well, there are two callbacks for clients. One 29 + * is the need callback which is mandatory if user wants to avoid useless 30 + * memory allocation at boot-time. The other is optional, init callback, which 31 + * is used to do proper initialization after memory is allocated. 32 + * 33 + * The need callback is used to decide whether extended memory allocation is 34 + * needed or not. Sometimes users want to deactivate some features in this 35 + * boot and extra memory would be unneccessary. In this case, to avoid 36 + * allocating huge chunk of memory, each clients represent their need of 37 + * extra memory through the need callback. If one of the need callbacks 38 + * returns true, it means that someone needs extra memory so that 39 + * page extension core should allocates memory for page extension. If 40 + * none of need callbacks return true, memory isn't needed at all in this boot 41 + * and page extension core can skip to allocate memory. As result, 42 + * none of memory is wasted. 43 + * 44 + * The init callback is used to do proper initialization after page extension 45 + * is completely initialized. In sparse memory system, extra memory is 46 + * allocated some time later than memmap is allocated. In other words, lifetime 47 + * of memory for page extension isn't same with memmap for struct page. 48 + * Therefore, clients can't store extra data until page extension is 49 + * initialized, even if pages are allocated and used freely. This could 50 + * cause inadequate state of extra data per page, so, to prevent it, client 51 + * can utilize this callback to initialize the state of it correctly. 52 + */ 53 + 54 + static struct page_ext_operations *page_ext_ops[] = { 55 + &debug_guardpage_ops, 56 + #ifdef CONFIG_PAGE_POISONING 57 + &page_poisoning_ops, 58 + #endif 59 + #ifdef CONFIG_PAGE_OWNER 60 + &page_owner_ops, 61 + #endif 62 + }; 63 + 64 + static unsigned long total_usage; 65 + 66 + static bool __init invoke_need_callbacks(void) 67 + { 68 + int i; 69 + int entries = ARRAY_SIZE(page_ext_ops); 70 + 71 + for (i = 0; i < entries; i++) { 72 + if (page_ext_ops[i]->need && page_ext_ops[i]->need()) 73 + return true; 74 + } 75 + 76 + return false; 77 + } 78 + 79 + static void __init invoke_init_callbacks(void) 80 + { 81 + int i; 82 + int entries = ARRAY_SIZE(page_ext_ops); 83 + 84 + for (i = 0; i < entries; i++) { 85 + if (page_ext_ops[i]->init) 86 + page_ext_ops[i]->init(); 87 + } 88 + } 89 + 90 + #if !defined(CONFIG_SPARSEMEM) 91 + 92 + 93 + void __meminit pgdat_page_ext_init(struct pglist_data *pgdat) 94 + { 95 + pgdat->node_page_ext = NULL; 96 + } 97 + 98 + struct page_ext *lookup_page_ext(struct page *page) 99 + { 100 + unsigned long pfn = page_to_pfn(page); 101 + unsigned long offset; 102 + struct page_ext *base; 103 + 104 + base = NODE_DATA(page_to_nid(page))->node_page_ext; 105 + #ifdef CONFIG_DEBUG_VM 106 + /* 107 + * The sanity checks the page allocator does upon freeing a 108 + * page can reach here before the page_ext arrays are 109 + * allocated when feeding a range of pages to the allocator 110 + * for the first time during bootup or memory hotplug. 111 + */ 112 + if (unlikely(!base)) 113 + return NULL; 114 + #endif 115 + offset = pfn - round_down(node_start_pfn(page_to_nid(page)), 116 + MAX_ORDER_NR_PAGES); 117 + return base + offset; 118 + } 119 + 120 + static int __init alloc_node_page_ext(int nid) 121 + { 122 + struct page_ext *base; 123 + unsigned long table_size; 124 + unsigned long nr_pages; 125 + 126 + nr_pages = NODE_DATA(nid)->node_spanned_pages; 127 + if (!nr_pages) 128 + return 0; 129 + 130 + /* 131 + * Need extra space if node range is not aligned with 132 + * MAX_ORDER_NR_PAGES. When page allocator's buddy algorithm 133 + * checks buddy's status, range could be out of exact node range. 134 + */ 135 + if (!IS_ALIGNED(node_start_pfn(nid), MAX_ORDER_NR_PAGES) || 136 + !IS_ALIGNED(node_end_pfn(nid), MAX_ORDER_NR_PAGES)) 137 + nr_pages += MAX_ORDER_NR_PAGES; 138 + 139 + table_size = sizeof(struct page_ext) * nr_pages; 140 + 141 + base = memblock_virt_alloc_try_nid_nopanic( 142 + table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS), 143 + BOOTMEM_ALLOC_ACCESSIBLE, nid); 144 + if (!base) 145 + return -ENOMEM; 146 + NODE_DATA(nid)->node_page_ext = base; 147 + total_usage += table_size; 148 + return 0; 149 + } 150 + 151 + void __init page_ext_init_flatmem(void) 152 + { 153 + 154 + int nid, fail; 155 + 156 + if (!invoke_need_callbacks()) 157 + return; 158 + 159 + for_each_online_node(nid) { 160 + fail = alloc_node_page_ext(nid); 161 + if (fail) 162 + goto fail; 163 + } 164 + pr_info("allocated %ld bytes of page_ext\n", total_usage); 165 + invoke_init_callbacks(); 166 + return; 167 + 168 + fail: 169 + pr_crit("allocation of page_ext failed.\n"); 170 + panic("Out of memory"); 171 + } 172 + 173 + #else /* CONFIG_FLAT_NODE_MEM_MAP */ 174 + 175 + struct page_ext *lookup_page_ext(struct page *page) 176 + { 177 + unsigned long pfn = page_to_pfn(page); 178 + struct mem_section *section = __pfn_to_section(pfn); 179 + #ifdef CONFIG_DEBUG_VM 180 + /* 181 + * The sanity checks the page allocator does upon freeing a 182 + * page can reach here before the page_ext arrays are 183 + * allocated when feeding a range of pages to the allocator 184 + * for the first time during bootup or memory hotplug. 185 + */ 186 + if (!section->page_ext) 187 + return NULL; 188 + #endif 189 + return section->page_ext + pfn; 190 + } 191 + 192 + static void *__meminit alloc_page_ext(size_t size, int nid) 193 + { 194 + gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN; 195 + void *addr = NULL; 196 + 197 + addr = alloc_pages_exact_nid(nid, size, flags); 198 + if (addr) { 199 + kmemleak_alloc(addr, size, 1, flags); 200 + return addr; 201 + } 202 + 203 + if (node_state(nid, N_HIGH_MEMORY)) 204 + addr = vzalloc_node(size, nid); 205 + else 206 + addr = vzalloc(size); 207 + 208 + return addr; 209 + } 210 + 211 + static int __meminit init_section_page_ext(unsigned long pfn, int nid) 212 + { 213 + struct mem_section *section; 214 + struct page_ext *base; 215 + unsigned long table_size; 216 + 217 + section = __pfn_to_section(pfn); 218 + 219 + if (section->page_ext) 220 + return 0; 221 + 222 + table_size = sizeof(struct page_ext) * PAGES_PER_SECTION; 223 + base = alloc_page_ext(table_size, nid); 224 + 225 + /* 226 + * The value stored in section->page_ext is (base - pfn) 227 + * and it does not point to the memory block allocated above, 228 + * causing kmemleak false positives. 229 + */ 230 + kmemleak_not_leak(base); 231 + 232 + if (!base) { 233 + pr_err("page ext allocation failure\n"); 234 + return -ENOMEM; 235 + } 236 + 237 + /* 238 + * The passed "pfn" may not be aligned to SECTION. For the calculation 239 + * we need to apply a mask. 240 + */ 241 + pfn &= PAGE_SECTION_MASK; 242 + section->page_ext = base - pfn; 243 + total_usage += table_size; 244 + return 0; 245 + } 246 + #ifdef CONFIG_MEMORY_HOTPLUG 247 + static void free_page_ext(void *addr) 248 + { 249 + if (is_vmalloc_addr(addr)) { 250 + vfree(addr); 251 + } else { 252 + struct page *page = virt_to_page(addr); 253 + size_t table_size; 254 + 255 + table_size = sizeof(struct page_ext) * PAGES_PER_SECTION; 256 + 257 + BUG_ON(PageReserved(page)); 258 + free_pages_exact(addr, table_size); 259 + } 260 + } 261 + 262 + static void __free_page_ext(unsigned long pfn) 263 + { 264 + struct mem_section *ms; 265 + struct page_ext *base; 266 + 267 + ms = __pfn_to_section(pfn); 268 + if (!ms || !ms->page_ext) 269 + return; 270 + base = ms->page_ext + pfn; 271 + free_page_ext(base); 272 + ms->page_ext = NULL; 273 + } 274 + 275 + static int __meminit online_page_ext(unsigned long start_pfn, 276 + unsigned long nr_pages, 277 + int nid) 278 + { 279 + unsigned long start, end, pfn; 280 + int fail = 0; 281 + 282 + start = SECTION_ALIGN_DOWN(start_pfn); 283 + end = SECTION_ALIGN_UP(start_pfn + nr_pages); 284 + 285 + if (nid == -1) { 286 + /* 287 + * In this case, "nid" already exists and contains valid memory. 288 + * "start_pfn" passed to us is a pfn which is an arg for 289 + * online__pages(), and start_pfn should exist. 290 + */ 291 + nid = pfn_to_nid(start_pfn); 292 + VM_BUG_ON(!node_state(nid, N_ONLINE)); 293 + } 294 + 295 + for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) { 296 + if (!pfn_present(pfn)) 297 + continue; 298 + fail = init_section_page_ext(pfn, nid); 299 + } 300 + if (!fail) 301 + return 0; 302 + 303 + /* rollback */ 304 + for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) 305 + __free_page_ext(pfn); 306 + 307 + return -ENOMEM; 308 + } 309 + 310 + static int __meminit offline_page_ext(unsigned long start_pfn, 311 + unsigned long nr_pages, int nid) 312 + { 313 + unsigned long start, end, pfn; 314 + 315 + start = SECTION_ALIGN_DOWN(start_pfn); 316 + end = SECTION_ALIGN_UP(start_pfn + nr_pages); 317 + 318 + for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) 319 + __free_page_ext(pfn); 320 + return 0; 321 + 322 + } 323 + 324 + static int __meminit page_ext_callback(struct notifier_block *self, 325 + unsigned long action, void *arg) 326 + { 327 + struct memory_notify *mn = arg; 328 + int ret = 0; 329 + 330 + switch (action) { 331 + case MEM_GOING_ONLINE: 332 + ret = online_page_ext(mn->start_pfn, 333 + mn->nr_pages, mn->status_change_nid); 334 + break; 335 + case MEM_OFFLINE: 336 + offline_page_ext(mn->start_pfn, 337 + mn->nr_pages, mn->status_change_nid); 338 + break; 339 + case MEM_CANCEL_ONLINE: 340 + offline_page_ext(mn->start_pfn, 341 + mn->nr_pages, mn->status_change_nid); 342 + break; 343 + case MEM_GOING_OFFLINE: 344 + break; 345 + case MEM_ONLINE: 346 + case MEM_CANCEL_OFFLINE: 347 + break; 348 + } 349 + 350 + return notifier_from_errno(ret); 351 + } 352 + 353 + #endif 354 + 355 + void __init page_ext_init(void) 356 + { 357 + unsigned long pfn; 358 + int nid; 359 + 360 + if (!invoke_need_callbacks()) 361 + return; 362 + 363 + for_each_node_state(nid, N_MEMORY) { 364 + unsigned long start_pfn, end_pfn; 365 + 366 + start_pfn = node_start_pfn(nid); 367 + end_pfn = node_end_pfn(nid); 368 + /* 369 + * start_pfn and end_pfn may not be aligned to SECTION and the 370 + * page->flags of out of node pages are not initialized. So we 371 + * scan [start_pfn, the biggest section's pfn < end_pfn) here. 372 + */ 373 + for (pfn = start_pfn; pfn < end_pfn; 374 + pfn = ALIGN(pfn + 1, PAGES_PER_SECTION)) { 375 + 376 + if (!pfn_valid(pfn)) 377 + continue; 378 + /* 379 + * Nodes's pfns can be overlapping. 380 + * We know some arch can have a nodes layout such as 381 + * -------------pfn--------------> 382 + * N0 | N1 | N2 | N0 | N1 | N2|.... 383 + */ 384 + if (pfn_to_nid(pfn) != nid) 385 + continue; 386 + if (init_section_page_ext(pfn, nid)) 387 + goto oom; 388 + } 389 + } 390 + hotplug_memory_notifier(page_ext_callback, 0); 391 + pr_info("allocated %ld bytes of page_ext\n", total_usage); 392 + invoke_init_callbacks(); 393 + return; 394 + 395 + oom: 396 + panic("Out of memory"); 397 + } 398 + 399 + void __meminit pgdat_page_ext_init(struct pglist_data *pgdat) 400 + { 401 + } 402 + 403 + #endif

+311

mm/page_owner.c

··· 1 + #include <linux/debugfs.h> 2 + #include <linux/mm.h> 3 + #include <linux/slab.h> 4 + #include <linux/uaccess.h> 5 + #include <linux/bootmem.h> 6 + #include <linux/stacktrace.h> 7 + #include <linux/page_owner.h> 8 + #include "internal.h" 9 + 10 + static bool page_owner_disabled = true; 11 + bool page_owner_inited __read_mostly; 12 + 13 + static void init_early_allocated_pages(void); 14 + 15 + static int early_page_owner_param(char *buf) 16 + { 17 + if (!buf) 18 + return -EINVAL; 19 + 20 + if (strcmp(buf, "on") == 0) 21 + page_owner_disabled = false; 22 + 23 + return 0; 24 + } 25 + early_param("page_owner", early_page_owner_param); 26 + 27 + static bool need_page_owner(void) 28 + { 29 + if (page_owner_disabled) 30 + return false; 31 + 32 + return true; 33 + } 34 + 35 + static void init_page_owner(void) 36 + { 37 + if (page_owner_disabled) 38 + return; 39 + 40 + page_owner_inited = true; 41 + init_early_allocated_pages(); 42 + } 43 + 44 + struct page_ext_operations page_owner_ops = { 45 + .need = need_page_owner, 46 + .init = init_page_owner, 47 + }; 48 + 49 + void __reset_page_owner(struct page *page, unsigned int order) 50 + { 51 + int i; 52 + struct page_ext *page_ext; 53 + 54 + for (i = 0; i < (1 << order); i++) { 55 + page_ext = lookup_page_ext(page + i); 56 + __clear_bit(PAGE_EXT_OWNER, &page_ext->flags); 57 + } 58 + } 59 + 60 + void __set_page_owner(struct page *page, unsigned int order, gfp_t gfp_mask) 61 + { 62 + struct page_ext *page_ext; 63 + struct stack_trace *trace; 64 + 65 + page_ext = lookup_page_ext(page); 66 + 67 + trace = &page_ext->trace; 68 + trace->nr_entries = 0; 69 + trace->max_entries = ARRAY_SIZE(page_ext->trace_entries); 70 + trace->entries = &page_ext->trace_entries[0]; 71 + trace->skip = 3; 72 + save_stack_trace(&page_ext->trace); 73 + 74 + page_ext->order = order; 75 + page_ext->gfp_mask = gfp_mask; 76 + 77 + __set_bit(PAGE_EXT_OWNER, &page_ext->flags); 78 + } 79 + 80 + static ssize_t 81 + print_page_owner(char __user *buf, size_t count, unsigned long pfn, 82 + struct page *page, struct page_ext *page_ext) 83 + { 84 + int ret; 85 + int pageblock_mt, page_mt; 86 + char *kbuf; 87 + 88 + kbuf = kmalloc(count, GFP_KERNEL); 89 + if (!kbuf) 90 + return -ENOMEM; 91 + 92 + ret = snprintf(kbuf, count, 93 + "Page allocated via order %u, mask 0x%x\n", 94 + page_ext->order, page_ext->gfp_mask); 95 + 96 + if (ret >= count) 97 + goto err; 98 + 99 + /* Print information relevant to grouping pages by mobility */ 100 + pageblock_mt = get_pfnblock_migratetype(page, pfn); 101 + page_mt = gfpflags_to_migratetype(page_ext->gfp_mask); 102 + ret += snprintf(kbuf + ret, count - ret, 103 + "PFN %lu Block %lu type %d %s Flags %s%s%s%s%s%s%s%s%s%s%s%s\n", 104 + pfn, 105 + pfn >> pageblock_order, 106 + pageblock_mt, 107 + pageblock_mt != page_mt ? "Fallback" : " ", 108 + PageLocked(page) ? "K" : " ", 109 + PageError(page) ? "E" : " ", 110 + PageReferenced(page) ? "R" : " ", 111 + PageUptodate(page) ? "U" : " ", 112 + PageDirty(page) ? "D" : " ", 113 + PageLRU(page) ? "L" : " ", 114 + PageActive(page) ? "A" : " ", 115 + PageSlab(page) ? "S" : " ", 116 + PageWriteback(page) ? "W" : " ", 117 + PageCompound(page) ? "C" : " ", 118 + PageSwapCache(page) ? "B" : " ", 119 + PageMappedToDisk(page) ? "M" : " "); 120 + 121 + if (ret >= count) 122 + goto err; 123 + 124 + ret += snprint_stack_trace(kbuf + ret, count - ret, 125 + &page_ext->trace, 0); 126 + if (ret >= count) 127 + goto err; 128 + 129 + ret += snprintf(kbuf + ret, count - ret, "\n"); 130 + if (ret >= count) 131 + goto err; 132 + 133 + if (copy_to_user(buf, kbuf, ret)) 134 + ret = -EFAULT; 135 + 136 + kfree(kbuf); 137 + return ret; 138 + 139 + err: 140 + kfree(kbuf); 141 + return -ENOMEM; 142 + } 143 + 144 + static ssize_t 145 + read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos) 146 + { 147 + unsigned long pfn; 148 + struct page *page; 149 + struct page_ext *page_ext; 150 + 151 + if (!page_owner_inited) 152 + return -EINVAL; 153 + 154 + page = NULL; 155 + pfn = min_low_pfn + *ppos; 156 + 157 + /* Find a valid PFN or the start of a MAX_ORDER_NR_PAGES area */ 158 + while (!pfn_valid(pfn) && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0) 159 + pfn++; 160 + 161 + drain_all_pages(NULL); 162 + 163 + /* Find an allocated page */ 164 + for (; pfn < max_pfn; pfn++) { 165 + /* 166 + * If the new page is in a new MAX_ORDER_NR_PAGES area, 167 + * validate the area as existing, skip it if not 168 + */ 169 + if ((pfn & (MAX_ORDER_NR_PAGES - 1)) == 0 && !pfn_valid(pfn)) { 170 + pfn += MAX_ORDER_NR_PAGES - 1; 171 + continue; 172 + } 173 + 174 + /* Check for holes within a MAX_ORDER area */ 175 + if (!pfn_valid_within(pfn)) 176 + continue; 177 + 178 + page = pfn_to_page(pfn); 179 + if (PageBuddy(page)) { 180 + unsigned long freepage_order = page_order_unsafe(page); 181 + 182 + if (freepage_order < MAX_ORDER) 183 + pfn += (1UL << freepage_order) - 1; 184 + continue; 185 + } 186 + 187 + page_ext = lookup_page_ext(page); 188 + 189 + /* 190 + * Some pages could be missed by concurrent allocation or free, 191 + * because we don't hold the zone lock. 192 + */ 193 + if (!test_bit(PAGE_EXT_OWNER, &page_ext->flags)) 194 + continue; 195 + 196 + /* Record the next PFN to read in the file offset */ 197 + *ppos = (pfn - min_low_pfn) + 1; 198 + 199 + return print_page_owner(buf, count, pfn, page, page_ext); 200 + } 201 + 202 + return 0; 203 + } 204 + 205 + static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone) 206 + { 207 + struct page *page; 208 + struct page_ext *page_ext; 209 + unsigned long pfn = zone->zone_start_pfn, block_end_pfn; 210 + unsigned long end_pfn = pfn + zone->spanned_pages; 211 + unsigned long count = 0; 212 + 213 + /* Scan block by block. First and last block may be incomplete */ 214 + pfn = zone->zone_start_pfn; 215 + 216 + /* 217 + * Walk the zone in pageblock_nr_pages steps. If a page block spans 218 + * a zone boundary, it will be double counted between zones. This does 219 + * not matter as the mixed block count will still be correct 220 + */ 221 + for (; pfn < end_pfn; ) { 222 + if (!pfn_valid(pfn)) { 223 + pfn = ALIGN(pfn + 1, MAX_ORDER_NR_PAGES); 224 + continue; 225 + } 226 + 227 + block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages); 228 + block_end_pfn = min(block_end_pfn, end_pfn); 229 + 230 + page = pfn_to_page(pfn); 231 + 232 + for (; pfn < block_end_pfn; pfn++) { 233 + if (!pfn_valid_within(pfn)) 234 + continue; 235 + 236 + page = pfn_to_page(pfn); 237 + 238 + /* 239 + * We are safe to check buddy flag and order, because 240 + * this is init stage and only single thread runs. 241 + */ 242 + if (PageBuddy(page)) { 243 + pfn += (1UL << page_order(page)) - 1; 244 + continue; 245 + } 246 + 247 + if (PageReserved(page)) 248 + continue; 249 + 250 + page_ext = lookup_page_ext(page); 251 + 252 + /* Maybe overraping zone */ 253 + if (test_bit(PAGE_EXT_OWNER, &page_ext->flags)) 254 + continue; 255 + 256 + /* Found early allocated page */ 257 + set_page_owner(page, 0, 0); 258 + count++; 259 + } 260 + } 261 + 262 + pr_info("Node %d, zone %8s: page owner found early allocated %lu pages\n", 263 + pgdat->node_id, zone->name, count); 264 + } 265 + 266 + static void init_zones_in_node(pg_data_t *pgdat) 267 + { 268 + struct zone *zone; 269 + struct zone *node_zones = pgdat->node_zones; 270 + unsigned long flags; 271 + 272 + for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) { 273 + if (!populated_zone(zone)) 274 + continue; 275 + 276 + spin_lock_irqsave(&zone->lock, flags); 277 + init_pages_in_zone(pgdat, zone); 278 + spin_unlock_irqrestore(&zone->lock, flags); 279 + } 280 + } 281 + 282 + static void init_early_allocated_pages(void) 283 + { 284 + pg_data_t *pgdat; 285 + 286 + drain_all_pages(NULL); 287 + for_each_online_pgdat(pgdat) 288 + init_zones_in_node(pgdat); 289 + } 290 + 291 + static const struct file_operations proc_page_owner_operations = { 292 + .read = read_page_owner, 293 + }; 294 + 295 + static int __init pageowner_init(void) 296 + { 297 + struct dentry *dentry; 298 + 299 + if (!page_owner_inited) { 300 + pr_info("page_owner is disabled\n"); 301 + return 0; 302 + } 303 + 304 + dentry = debugfs_create_file("page_owner", S_IRUSR, NULL, 305 + NULL, &proc_page_owner_operations); 306 + if (IS_ERR(dentry)) 307 + return PTR_ERR(dentry); 308 + 309 + return 0; 310 + } 311 + module_init(pageowner_init)

+10 -8

mm/rmap.c

··· 23 23 * inode->i_mutex (while writing or truncating, not reading or faulting) 24 24 * mm->mmap_sem 25 25 * page->flags PG_locked (lock_page) 26 - * mapping->i_mmap_mutex 26 + * mapping->i_mmap_rwsem 27 27 * anon_vma->rwsem 28 28 * mm->page_table_lock or pte_lock 29 29 * zone->lru_lock (in mark_page_accessed, isolate_lru_page) ··· 1260 1260 /* 1261 1261 * We need mmap_sem locking, Otherwise VM_LOCKED check makes 1262 1262 * unstable result and race. Plus, We can't wait here because 1263 - * we now hold anon_vma->rwsem or mapping->i_mmap_mutex. 1263 + * we now hold anon_vma->rwsem or mapping->i_mmap_rwsem. 1264 1264 * if trylock failed, the page remain in evictable lru and later 1265 1265 * vmscan could retry to move the page to unevictable lru if the 1266 1266 * page is actually mlocked. ··· 1635 1635 static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc) 1636 1636 { 1637 1637 struct anon_vma *anon_vma; 1638 - pgoff_t pgoff = page_to_pgoff(page); 1638 + pgoff_t pgoff; 1639 1639 struct anon_vma_chain *avc; 1640 1640 int ret = SWAP_AGAIN; 1641 1641 ··· 1643 1643 if (!anon_vma) 1644 1644 return ret; 1645 1645 1646 + pgoff = page_to_pgoff(page); 1646 1647 anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) { 1647 1648 struct vm_area_struct *vma = avc->vma; 1648 1649 unsigned long address = vma_address(page, vma); ··· 1677 1676 static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc) 1678 1677 { 1679 1678 struct address_space *mapping = page->mapping; 1680 - pgoff_t pgoff = page_to_pgoff(page); 1679 + pgoff_t pgoff; 1681 1680 struct vm_area_struct *vma; 1682 1681 int ret = SWAP_AGAIN; 1683 1682 ··· 1685 1684 * The page lock not only makes sure that page->mapping cannot 1686 1685 * suddenly be NULLified by truncation, it makes sure that the 1687 1686 * structure at mapping cannot be freed and reused yet, 1688 - * so we can safely take mapping->i_mmap_mutex. 1687 + * so we can safely take mapping->i_mmap_rwsem. 1689 1688 */ 1690 1689 VM_BUG_ON_PAGE(!PageLocked(page), page); 1691 1690 1692 1691 if (!mapping) 1693 1692 return ret; 1694 - mutex_lock(&mapping->i_mmap_mutex); 1693 + 1694 + pgoff = page_to_pgoff(page); 1695 + i_mmap_lock_read(mapping); 1695 1696 vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { 1696 1697 unsigned long address = vma_address(page, vma); 1697 1698 ··· 1714 1711 goto done; 1715 1712 1716 1713 ret = rwc->file_nonlinear(page, mapping, rwc->arg); 1717 - 1718 1714 done: 1719 - mutex_unlock(&mapping->i_mmap_mutex); 1715 + i_mmap_unlock_read(mapping); 1720 1716 return ret; 1721 1717 } 1722 1718

+3 -1

mm/slab.c

··· 3015 3015 for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { 3016 3016 nid = zone_to_nid(zone); 3017 3017 3018 - if (cpuset_zone_allowed(zone, flags | __GFP_HARDWALL) && 3018 + if (cpuset_zone_allowed(zone, flags) && 3019 3019 get_node(cache, nid) && 3020 3020 get_node(cache, nid)->free_objects) { 3021 3021 obj = ____cache_alloc_node(cache, ··· 3182 3182 memset(ptr, 0, cachep->object_size); 3183 3183 } 3184 3184 3185 + memcg_kmem_put_cache(cachep); 3185 3186 return ptr; 3186 3187 } 3187 3188 ··· 3248 3247 memset(objp, 0, cachep->object_size); 3249 3248 } 3250 3249 3250 + memcg_kmem_put_cache(cachep); 3251 3251 return objp; 3252 3252 } 3253 3253

+10 -7

mm/slub.c

··· 1233 1233 kmemleak_free(x); 1234 1234 } 1235 1235 1236 - static inline int slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags) 1236 + static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, 1237 + gfp_t flags) 1237 1238 { 1238 1239 flags &= gfp_allowed_mask; 1239 1240 lockdep_trace_alloc(flags); 1240 1241 might_sleep_if(flags & __GFP_WAIT); 1241 1242 1242 - return should_failslab(s->object_size, flags, s->flags); 1243 + if (should_failslab(s->object_size, flags, s->flags)) 1244 + return NULL; 1245 + 1246 + return memcg_kmem_get_cache(s, flags); 1243 1247 } 1244 1248 1245 1249 static inline void slab_post_alloc_hook(struct kmem_cache *s, ··· 1252 1248 flags &= gfp_allowed_mask; 1253 1249 kmemcheck_slab_alloc(s, flags, object, slab_ksize(s)); 1254 1250 kmemleak_alloc_recursive(object, s->object_size, 1, s->flags, flags); 1251 + memcg_kmem_put_cache(s); 1255 1252 } 1256 1253 1257 1254 static inline void slab_free_hook(struct kmem_cache *s, void *x) ··· 1670 1665 1671 1666 n = get_node(s, zone_to_nid(zone)); 1672 1667 1673 - if (n && cpuset_zone_allowed(zone, 1674 - flags | __GFP_HARDWALL) && 1668 + if (n && cpuset_zone_allowed(zone, flags) && 1675 1669 n->nr_partial > s->min_partial) { 1676 1670 object = get_partial_node(s, n, c, flags); 1677 1671 if (object) { ··· 2388 2384 struct page *page; 2389 2385 unsigned long tid; 2390 2386 2391 - if (slab_pre_alloc_hook(s, gfpflags)) 2387 + s = slab_pre_alloc_hook(s, gfpflags); 2388 + if (!s) 2392 2389 return NULL; 2393 - 2394 - s = memcg_kmem_get_cache(s, gfpflags); 2395 2390 redo: 2396 2391 /* 2397 2392 * Must read kmem_cache cpu data via this cpu ptr. Preemption is

+2

mm/vmacache.c

··· 17 17 { 18 18 struct task_struct *g, *p; 19 19 20 + count_vm_vmacache_event(VMACACHE_FULL_FLUSHES); 21 + 20 22 /* 21 23 * Single threaded tasks need not iterate the entire 22 24 * list of process. We can avoid the flushing as well

+2 -2

mm/vmalloc.c

··· 2574 2574 if (!counters) 2575 2575 return; 2576 2576 2577 - /* Pair with smp_wmb() in clear_vm_uninitialized_flag() */ 2578 - smp_rmb(); 2579 2577 if (v->flags & VM_UNINITIALIZED) 2580 2578 return; 2579 + /* Pair with smp_wmb() in clear_vm_uninitialized_flag() */ 2580 + smp_rmb(); 2581 2581 2582 2582 memset(counters, 0, nr_node_ids * sizeof(unsigned int)); 2583 2583

+90 -126

mm/vmscan.c

··· 229 229 230 230 #define SHRINK_BATCH 128 231 231 232 - static unsigned long 233 - shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker, 234 - unsigned long nr_pages_scanned, unsigned long lru_pages) 232 + static unsigned long shrink_slabs(struct shrink_control *shrinkctl, 233 + struct shrinker *shrinker, 234 + unsigned long nr_scanned, 235 + unsigned long nr_eligible) 235 236 { 236 237 unsigned long freed = 0; 237 238 unsigned long long delta; ··· 256 255 nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0); 257 256 258 257 total_scan = nr; 259 - delta = (4 * nr_pages_scanned) / shrinker->seeks; 258 + delta = (4 * nr_scanned) / shrinker->seeks; 260 259 delta *= freeable; 261 - do_div(delta, lru_pages + 1); 260 + do_div(delta, nr_eligible + 1); 262 261 total_scan += delta; 263 262 if (total_scan < 0) { 264 263 pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n", ··· 290 289 total_scan = freeable * 2; 291 290 292 291 trace_mm_shrink_slab_start(shrinker, shrinkctl, nr, 293 - nr_pages_scanned, lru_pages, 294 - freeable, delta, total_scan); 292 + nr_scanned, nr_eligible, 293 + freeable, delta, total_scan); 295 294 296 295 /* 297 296 * Normally, we should not scan less than batch_size objects in one ··· 340 339 return freed; 341 340 } 342 341 343 - /* 344 - * Call the shrink functions to age shrinkable caches 342 + /** 343 + * shrink_node_slabs - shrink slab caches of a given node 344 + * @gfp_mask: allocation context 345 + * @nid: node whose slab caches to target 346 + * @nr_scanned: pressure numerator 347 + * @nr_eligible: pressure denominator 345 348 * 346 - * Here we assume it costs one seek to replace a lru page and that it also 347 - * takes a seek to recreate a cache object. With this in mind we age equal 348 - * percentages of the lru and ageable caches. This should balance the seeks 349 - * generated by these structures. 349 + * Call the shrink functions to age shrinkable caches. 350 350 * 351 - * If the vm encountered mapped pages on the LRU it increase the pressure on 352 - * slab to avoid swapping. 351 + * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set, 352 + * unaware shrinkers will receive a node id of 0 instead. 353 353 * 354 - * We do weird things to avoid (scanned*seeks*entries) overflowing 32 bits. 354 + * @nr_scanned and @nr_eligible form a ratio that indicate how much of 355 + * the available objects should be scanned. Page reclaim for example 356 + * passes the number of pages scanned and the number of pages on the 357 + * LRU lists that it considered on @nid, plus a bias in @nr_scanned 358 + * when it encountered mapped pages. The ratio is further biased by 359 + * the ->seeks setting of the shrink function, which indicates the 360 + * cost to recreate an object relative to that of an LRU page. 355 361 * 356 - * `lru_pages' represents the number of on-LRU pages in all the zones which 357 - * are eligible for the caller's allocation attempt. It is used for balancing 358 - * slab reclaim versus page reclaim. 359 - * 360 - * Returns the number of slab objects which we shrunk. 362 + * Returns the number of reclaimed slab objects. 361 363 */ 362 - unsigned long shrink_slab(struct shrink_control *shrinkctl, 363 - unsigned long nr_pages_scanned, 364 - unsigned long lru_pages) 364 + unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid, 365 + unsigned long nr_scanned, 366 + unsigned long nr_eligible) 365 367 { 366 368 struct shrinker *shrinker; 367 369 unsigned long freed = 0; 368 370 369 - if (nr_pages_scanned == 0) 370 - nr_pages_scanned = SWAP_CLUSTER_MAX; 371 + if (nr_scanned == 0) 372 + nr_scanned = SWAP_CLUSTER_MAX; 371 373 372 374 if (!down_read_trylock(&shrinker_rwsem)) { 373 375 /* ··· 384 380 } 385 381 386 382 list_for_each_entry(shrinker, &shrinker_list, list) { 387 - if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) { 388 - shrinkctl->nid = 0; 389 - freed += shrink_slab_node(shrinkctl, shrinker, 390 - nr_pages_scanned, lru_pages); 391 - continue; 392 - } 383 + struct shrink_control sc = { 384 + .gfp_mask = gfp_mask, 385 + .nid = nid, 386 + }; 393 387 394 - for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) { 395 - if (node_online(shrinkctl->nid)) 396 - freed += shrink_slab_node(shrinkctl, shrinker, 397 - nr_pages_scanned, lru_pages); 388 + if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) 389 + sc.nid = 0; 398 390 399 - } 391 + freed += shrink_slabs(&sc, shrinker, nr_scanned, nr_eligible); 400 392 } 393 + 401 394 up_read(&shrinker_rwsem); 402 395 out: 403 396 cond_resched(); ··· 1877 1876 * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan 1878 1877 */ 1879 1878 static void get_scan_count(struct lruvec *lruvec, int swappiness, 1880 - struct scan_control *sc, unsigned long *nr) 1879 + struct scan_control *sc, unsigned long *nr, 1880 + unsigned long *lru_pages) 1881 1881 { 1882 1882 struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; 1883 1883 u64 fraction[2]; ··· 2024 2022 some_scanned = false; 2025 2023 /* Only use force_scan on second pass. */ 2026 2024 for (pass = 0; !some_scanned && pass < 2; pass++) { 2025 + *lru_pages = 0; 2027 2026 for_each_evictable_lru(lru) { 2028 2027 int file = is_file_lru(lru); 2029 2028 unsigned long size; ··· 2051 2048 case SCAN_FILE: 2052 2049 case SCAN_ANON: 2053 2050 /* Scan one type exclusively */ 2054 - if ((scan_balance == SCAN_FILE) != file) 2051 + if ((scan_balance == SCAN_FILE) != file) { 2052 + size = 0; 2055 2053 scan = 0; 2054 + } 2056 2055 break; 2057 2056 default: 2058 2057 /* Look ma, no brain */ 2059 2058 BUG(); 2060 2059 } 2060 + 2061 + *lru_pages += size; 2061 2062 nr[lru] = scan; 2063 + 2062 2064 /* 2063 2065 * Skip the second pass and don't force_scan, 2064 2066 * if we found something to scan. ··· 2077 2069 * This is a basic per-zone page freer. Used by both kswapd and direct reclaim. 2078 2070 */ 2079 2071 static void shrink_lruvec(struct lruvec *lruvec, int swappiness, 2080 - struct scan_control *sc) 2072 + struct scan_control *sc, unsigned long *lru_pages) 2081 2073 { 2082 2074 unsigned long nr[NR_LRU_LISTS]; 2083 2075 unsigned long targets[NR_LRU_LISTS]; ··· 2088 2080 struct blk_plug plug; 2089 2081 bool scan_adjusted; 2090 2082 2091 - get_scan_count(lruvec, swappiness, sc, nr); 2083 + get_scan_count(lruvec, swappiness, sc, nr, lru_pages); 2092 2084 2093 2085 /* Record the original scan target for proportional adjustments later */ 2094 2086 memcpy(targets, nr, sizeof(nr)); ··· 2266 2258 } 2267 2259 } 2268 2260 2269 - static bool shrink_zone(struct zone *zone, struct scan_control *sc) 2261 + static bool shrink_zone(struct zone *zone, struct scan_control *sc, 2262 + bool is_classzone) 2270 2263 { 2271 2264 unsigned long nr_reclaimed, nr_scanned; 2272 2265 bool reclaimable = false; ··· 2278 2269 .zone = zone, 2279 2270 .priority = sc->priority, 2280 2271 }; 2272 + unsigned long zone_lru_pages = 0; 2281 2273 struct mem_cgroup *memcg; 2282 2274 2283 2275 nr_reclaimed = sc->nr_reclaimed; ··· 2286 2276 2287 2277 memcg = mem_cgroup_iter(root, NULL, &reclaim); 2288 2278 do { 2279 + unsigned long lru_pages; 2289 2280 struct lruvec *lruvec; 2290 2281 int swappiness; 2291 2282 2292 2283 lruvec = mem_cgroup_zone_lruvec(zone, memcg); 2293 2284 swappiness = mem_cgroup_swappiness(memcg); 2294 2285 2295 - shrink_lruvec(lruvec, swappiness, sc); 2286 + shrink_lruvec(lruvec, swappiness, sc, &lru_pages); 2287 + zone_lru_pages += lru_pages; 2296 2288 2297 2289 /* 2298 2290 * Direct reclaim and kswapd have to scan all memory ··· 2313 2301 } 2314 2302 memcg = mem_cgroup_iter(root, memcg, &reclaim); 2315 2303 } while (memcg); 2304 + 2305 + /* 2306 + * Shrink the slab caches in the same proportion that 2307 + * the eligible LRU pages were scanned. 2308 + */ 2309 + if (global_reclaim(sc) && is_classzone) { 2310 + struct reclaim_state *reclaim_state; 2311 + 2312 + shrink_node_slabs(sc->gfp_mask, zone_to_nid(zone), 2313 + sc->nr_scanned - nr_scanned, 2314 + zone_lru_pages); 2315 + 2316 + reclaim_state = current->reclaim_state; 2317 + if (reclaim_state) { 2318 + sc->nr_reclaimed += 2319 + reclaim_state->reclaimed_slab; 2320 + reclaim_state->reclaimed_slab = 0; 2321 + } 2322 + } 2316 2323 2317 2324 vmpressure(sc->gfp_mask, sc->target_mem_cgroup, 2318 2325 sc->nr_scanned - nr_scanned, ··· 2407 2376 struct zone *zone; 2408 2377 unsigned long nr_soft_reclaimed; 2409 2378 unsigned long nr_soft_scanned; 2410 - unsigned long lru_pages = 0; 2411 - struct reclaim_state *reclaim_state = current->reclaim_state; 2412 2379 gfp_t orig_mask; 2413 - struct shrink_control shrink = { 2414 - .gfp_mask = sc->gfp_mask, 2415 - }; 2416 2380 enum zone_type requested_highidx = gfp_zone(sc->gfp_mask); 2417 2381 bool reclaimable = false; 2418 2382 ··· 2420 2394 if (buffer_heads_over_limit) 2421 2395 sc->gfp_mask |= __GFP_HIGHMEM; 2422 2396 2423 - nodes_clear(shrink.nodes_to_scan); 2424 - 2425 2397 for_each_zone_zonelist_nodemask(zone, z, zonelist, 2426 - gfp_zone(sc->gfp_mask), sc->nodemask) { 2398 + requested_highidx, sc->nodemask) { 2399 + enum zone_type classzone_idx; 2400 + 2427 2401 if (!populated_zone(zone)) 2428 2402 continue; 2403 + 2404 + classzone_idx = requested_highidx; 2405 + while (!populated_zone(zone->zone_pgdat->node_zones + 2406 + classzone_idx)) 2407 + classzone_idx--; 2408 + 2429 2409 /* 2430 2410 * Take care memory controller reclaiming has small influence 2431 2411 * to global LRU. ··· 2440 2408 if (!cpuset_zone_allowed(zone, 2441 2409 GFP_KERNEL | __GFP_HARDWALL)) 2442 2410 continue; 2443 - 2444 - lru_pages += zone_reclaimable_pages(zone); 2445 - node_set(zone_to_nid(zone), shrink.nodes_to_scan); 2446 2411 2447 2412 if (sc->priority != DEF_PRIORITY && 2448 2413 !zone_reclaimable(zone)) ··· 2479 2450 /* need some check for avoid more shrink_zone() */ 2480 2451 } 2481 2452 2482 - if (shrink_zone(zone, sc)) 2453 + if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx)) 2483 2454 reclaimable = true; 2484 2455 2485 2456 if (global_reclaim(sc) && 2486 2457 !reclaimable && zone_reclaimable(zone)) 2487 2458 reclaimable = true; 2488 - } 2489 - 2490 - /* 2491 - * Don't shrink slabs when reclaiming memory from over limit cgroups 2492 - * but do shrink slab at least once when aborting reclaim for 2493 - * compaction to avoid unevenly scanning file/anon LRU pages over slab 2494 - * pages. 2495 - */ 2496 - if (global_reclaim(sc)) { 2497 - shrink_slab(&shrink, sc->nr_scanned, lru_pages); 2498 - if (reclaim_state) { 2499 - sc->nr_reclaimed += reclaim_state->reclaimed_slab; 2500 - reclaim_state->reclaimed_slab = 0; 2501 - } 2502 2459 } 2503 2460 2504 2461 /* ··· 2751 2736 }; 2752 2737 struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg); 2753 2738 int swappiness = mem_cgroup_swappiness(memcg); 2739 + unsigned long lru_pages; 2754 2740 2755 2741 sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | 2756 2742 (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK); ··· 2767 2751 * will pick up pages from other mem cgroup's as well. We hack 2768 2752 * the priority and make it zero. 2769 2753 */ 2770 - shrink_lruvec(lruvec, swappiness, &sc); 2754 + shrink_lruvec(lruvec, swappiness, &sc, &lru_pages); 2771 2755 2772 2756 trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed); 2773 2757 ··· 2948 2932 static bool kswapd_shrink_zone(struct zone *zone, 2949 2933 int classzone_idx, 2950 2934 struct scan_control *sc, 2951 - unsigned long lru_pages, 2952 2935 unsigned long *nr_attempted) 2953 2936 { 2954 2937 int testorder = sc->order; 2955 2938 unsigned long balance_gap; 2956 - struct reclaim_state *reclaim_state = current->reclaim_state; 2957 - struct shrink_control shrink = { 2958 - .gfp_mask = sc->gfp_mask, 2959 - }; 2960 2939 bool lowmem_pressure; 2961 2940 2962 2941 /* Reclaim above the high watermark. */ ··· 2986 2975 balance_gap, classzone_idx)) 2987 2976 return true; 2988 2977 2989 - shrink_zone(zone, sc); 2990 - nodes_clear(shrink.nodes_to_scan); 2991 - node_set(zone_to_nid(zone), shrink.nodes_to_scan); 2992 - 2993 - reclaim_state->reclaimed_slab = 0; 2994 - shrink_slab(&shrink, sc->nr_scanned, lru_pages); 2995 - sc->nr_reclaimed += reclaim_state->reclaimed_slab; 2978 + shrink_zone(zone, sc, zone_idx(zone) == classzone_idx); 2996 2979 2997 2980 /* Account for the number of pages attempted to reclaim */ 2998 2981 *nr_attempted += sc->nr_to_reclaim; ··· 3047 3042 count_vm_event(PAGEOUTRUN); 3048 3043 3049 3044 do { 3050 - unsigned long lru_pages = 0; 3051 3045 unsigned long nr_attempted = 0; 3052 3046 bool raise_priority = true; 3053 3047 bool pgdat_needs_compaction = (order > 0); ··· 3106 3102 if (!populated_zone(zone)) 3107 3103 continue; 3108 3104 3109 - lru_pages += zone_reclaimable_pages(zone); 3110 - 3111 3105 /* 3112 3106 * If any zone is currently balanced then kswapd will 3113 3107 * not call compaction as it is expected that the ··· 3161 3159 * that that high watermark would be met at 100% 3162 3160 * efficiency. 3163 3161 */ 3164 - if (kswapd_shrink_zone(zone, end_zone, &sc, 3165 - lru_pages, &nr_attempted)) 3162 + if (kswapd_shrink_zone(zone, end_zone, 3163 + &sc, &nr_attempted)) 3166 3164 raise_priority = false; 3167 3165 } 3168 3166 ··· 3614 3612 .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP), 3615 3613 .may_swap = 1, 3616 3614 }; 3617 - struct shrink_control shrink = { 3618 - .gfp_mask = sc.gfp_mask, 3619 - }; 3620 - unsigned long nr_slab_pages0, nr_slab_pages1; 3621 3615 3622 3616 cond_resched(); 3623 3617 /* ··· 3632 3634 * priorities until we have enough memory freed. 3633 3635 */ 3634 3636 do { 3635 - shrink_zone(zone, &sc); 3637 + shrink_zone(zone, &sc, true); 3636 3638 } while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0); 3637 - } 3638 - 3639 - nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE); 3640 - if (nr_slab_pages0 > zone->min_slab_pages) { 3641 - /* 3642 - * shrink_slab() does not currently allow us to determine how 3643 - * many pages were freed in this zone. So we take the current 3644 - * number of slab pages and shake the slab until it is reduced 3645 - * by the same nr_pages that we used for reclaiming unmapped 3646 - * pages. 3647 - */ 3648 - nodes_clear(shrink.nodes_to_scan); 3649 - node_set(zone_to_nid(zone), shrink.nodes_to_scan); 3650 - for (;;) { 3651 - unsigned long lru_pages = zone_reclaimable_pages(zone); 3652 - 3653 - /* No reclaimable slab or very low memory pressure */ 3654 - if (!shrink_slab(&shrink, sc.nr_scanned, lru_pages)) 3655 - break; 3656 - 3657 - /* Freed enough memory */ 3658 - nr_slab_pages1 = zone_page_state(zone, 3659 - NR_SLAB_RECLAIMABLE); 3660 - if (nr_slab_pages1 + nr_pages <= nr_slab_pages0) 3661 - break; 3662 - } 3663 - 3664 - /* 3665 - * Update nr_reclaimed by the number of slab pages we 3666 - * reclaimed from this zone. 3667 - */ 3668 - nr_slab_pages1 = zone_page_state(zone, NR_SLAB_RECLAIMABLE); 3669 - if (nr_slab_pages1 < nr_slab_pages0) 3670 - sc.nr_reclaimed += nr_slab_pages0 - nr_slab_pages1; 3671 3639 } 3672 3640 3673 3641 p->reclaim_state = NULL;

+102

mm/vmstat.c

··· 22 22 #include <linux/writeback.h> 23 23 #include <linux/compaction.h> 24 24 #include <linux/mm_inline.h> 25 + #include <linux/page_ext.h> 26 + #include <linux/page_owner.h> 25 27 26 28 #include "internal.h" 27 29 ··· 900 898 #ifdef CONFIG_DEBUG_VM_VMACACHE 901 899 "vmacache_find_calls", 902 900 "vmacache_find_hits", 901 + "vmacache_full_flushes", 903 902 #endif 904 903 #endif /* CONFIG_VM_EVENTS_COUNTERS */ 905 904 }; ··· 1020 1017 return 0; 1021 1018 } 1022 1019 1020 + #ifdef CONFIG_PAGE_OWNER 1021 + static void pagetypeinfo_showmixedcount_print(struct seq_file *m, 1022 + pg_data_t *pgdat, 1023 + struct zone *zone) 1024 + { 1025 + struct page *page; 1026 + struct page_ext *page_ext; 1027 + unsigned long pfn = zone->zone_start_pfn, block_end_pfn; 1028 + unsigned long end_pfn = pfn + zone->spanned_pages; 1029 + unsigned long count[MIGRATE_TYPES] = { 0, }; 1030 + int pageblock_mt, page_mt; 1031 + int i; 1032 + 1033 + /* Scan block by block. First and last block may be incomplete */ 1034 + pfn = zone->zone_start_pfn; 1035 + 1036 + /* 1037 + * Walk the zone in pageblock_nr_pages steps. If a page block spans 1038 + * a zone boundary, it will be double counted between zones. This does 1039 + * not matter as the mixed block count will still be correct 1040 + */ 1041 + for (; pfn < end_pfn; ) { 1042 + if (!pfn_valid(pfn)) { 1043 + pfn = ALIGN(pfn + 1, MAX_ORDER_NR_PAGES); 1044 + continue; 1045 + } 1046 + 1047 + block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages); 1048 + block_end_pfn = min(block_end_pfn, end_pfn); 1049 + 1050 + page = pfn_to_page(pfn); 1051 + pageblock_mt = get_pfnblock_migratetype(page, pfn); 1052 + 1053 + for (; pfn < block_end_pfn; pfn++) { 1054 + if (!pfn_valid_within(pfn)) 1055 + continue; 1056 + 1057 + page = pfn_to_page(pfn); 1058 + if (PageBuddy(page)) { 1059 + pfn += (1UL << page_order(page)) - 1; 1060 + continue; 1061 + } 1062 + 1063 + if (PageReserved(page)) 1064 + continue; 1065 + 1066 + page_ext = lookup_page_ext(page); 1067 + 1068 + if (!test_bit(PAGE_EXT_OWNER, &page_ext->flags)) 1069 + continue; 1070 + 1071 + page_mt = gfpflags_to_migratetype(page_ext->gfp_mask); 1072 + if (pageblock_mt != page_mt) { 1073 + if (is_migrate_cma(pageblock_mt)) 1074 + count[MIGRATE_MOVABLE]++; 1075 + else 1076 + count[pageblock_mt]++; 1077 + 1078 + pfn = block_end_pfn; 1079 + break; 1080 + } 1081 + pfn += (1UL << page_ext->order) - 1; 1082 + } 1083 + } 1084 + 1085 + /* Print counts */ 1086 + seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name); 1087 + for (i = 0; i < MIGRATE_TYPES; i++) 1088 + seq_printf(m, "%12lu ", count[i]); 1089 + seq_putc(m, '\n'); 1090 + } 1091 + #endif /* CONFIG_PAGE_OWNER */ 1092 + 1093 + /* 1094 + * Print out the number of pageblocks for each migratetype that contain pages 1095 + * of other types. This gives an indication of how well fallbacks are being 1096 + * contained by rmqueue_fallback(). It requires information from PAGE_OWNER 1097 + * to determine what is going on 1098 + */ 1099 + static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat) 1100 + { 1101 + #ifdef CONFIG_PAGE_OWNER 1102 + int mtype; 1103 + 1104 + if (!page_owner_inited) 1105 + return; 1106 + 1107 + drain_all_pages(NULL); 1108 + 1109 + seq_printf(m, "\n%-23s", "Number of mixed blocks "); 1110 + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) 1111 + seq_printf(m, "%12s ", migratetype_names[mtype]); 1112 + seq_putc(m, '\n'); 1113 + 1114 + walk_zones_in_node(m, pgdat, pagetypeinfo_showmixedcount_print); 1115 + #endif /* CONFIG_PAGE_OWNER */ 1116 + } 1117 + 1023 1118 /* 1024 1119 * This prints out statistics in relation to grouping pages by mobility. 1025 1120 * It is expensive to collect so do not constantly read the file. ··· 1135 1034 seq_putc(m, '\n'); 1136 1035 pagetypeinfo_showfree(m, pgdat); 1137 1036 pagetypeinfo_showblockcount(m, pgdat); 1037 + pagetypeinfo_showmixedcount(m, pgdat); 1138 1038 1139 1039 return 0; 1140 1040 }

+1 -1

mm/zbud.c

··· 132 132 133 133 static void *zbud_zpool_create(gfp_t gfp, struct zpool_ops *zpool_ops) 134 134 { 135 - return zbud_create_pool(gfp, &zbud_zpool_ops); 135 + return zbud_create_pool(gfp, zpool_ops ? &zbud_zpool_ops : NULL); 136 136 } 137 137 138 138 static void zbud_zpool_destroy(void *pool)

+134 -46

mm/zsmalloc.c

··· 155 155 * (reason above) 156 156 */ 157 157 #define ZS_SIZE_CLASS_DELTA (PAGE_SIZE >> 8) 158 - #define ZS_SIZE_CLASSES ((ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) / \ 159 - ZS_SIZE_CLASS_DELTA + 1) 160 158 161 159 /* 162 160 * We do not maintain any list for completely empty or full pages ··· 167 169 ZS_EMPTY, 168 170 ZS_FULL 169 171 }; 172 + 173 + /* 174 + * number of size_classes 175 + */ 176 + static int zs_size_classes; 170 177 171 178 /* 172 179 * We assign a page to ZS_ALMOST_EMPTY fullness group when: ··· 217 214 }; 218 215 219 216 struct zs_pool { 220 - struct size_class size_class[ZS_SIZE_CLASSES]; 217 + struct size_class **size_class; 221 218 222 219 gfp_t flags; /* allocation flags used when growing pool */ 223 220 atomic_long_t pages_allocated; ··· 471 468 if (newfg == currfg) 472 469 goto out; 473 470 474 - class = &pool->size_class[class_idx]; 471 + class = pool->size_class[class_idx]; 475 472 remove_zspage(page, class, currfg); 476 473 insert_zspage(page, class, newfg); 477 474 set_zspage_mapping(page, class_idx, newfg); ··· 632 629 struct page *next_page; 633 630 struct link_free *link; 634 631 unsigned int i = 1; 632 + void *vaddr; 635 633 636 634 /* 637 635 * page->index stores offset of first object starting ··· 643 639 if (page != first_page) 644 640 page->index = off; 645 641 646 - link = (struct link_free *)kmap_atomic(page) + 647 - off / sizeof(*link); 642 + vaddr = kmap_atomic(page); 643 + link = (struct link_free *)vaddr + off / sizeof(*link); 648 644 649 645 while ((off += class->size) < PAGE_SIZE) { 650 646 link->next = obj_location_to_handle(page, i++); ··· 658 654 */ 659 655 next_page = get_next_page(page); 660 656 link->next = obj_location_to_handle(next_page, 0); 661 - kunmap_atomic(link); 657 + kunmap_atomic(vaddr); 662 658 page = next_page; 663 659 off %= PAGE_SIZE; 664 660 } ··· 788 784 */ 789 785 if (area->vm_buf) 790 786 return 0; 791 - area->vm_buf = (char *)__get_free_page(GFP_KERNEL); 787 + area->vm_buf = kmalloc(ZS_MAX_ALLOC_SIZE, GFP_KERNEL); 792 788 if (!area->vm_buf) 793 789 return -ENOMEM; 794 790 return 0; ··· 796 792 797 793 static inline void __zs_cpu_down(struct mapping_area *area) 798 794 { 799 - if (area->vm_buf) 800 - free_page((unsigned long)area->vm_buf); 795 + kfree(area->vm_buf); 801 796 area->vm_buf = NULL; 802 797 } 803 798 ··· 884 881 .notifier_call = zs_cpu_notifier 885 882 }; 886 883 887 - static void zs_exit(void) 884 + static void zs_unregister_cpu_notifier(void) 888 885 { 889 886 int cpu; 890 - 891 - #ifdef CONFIG_ZPOOL 892 - zpool_unregister_driver(&zs_zpool_driver); 893 - #endif 894 887 895 888 cpu_notifier_register_begin(); 896 889 ··· 897 898 cpu_notifier_register_done(); 898 899 } 899 900 900 - static int zs_init(void) 901 + static int zs_register_cpu_notifier(void) 901 902 { 902 - int cpu, ret; 903 + int cpu, uninitialized_var(ret); 903 904 904 905 cpu_notifier_register_begin(); 905 906 906 907 __register_cpu_notifier(&zs_cpu_nb); 907 908 for_each_online_cpu(cpu) { 908 909 ret = zs_cpu_notifier(NULL, CPU_UP_PREPARE, (void *)(long)cpu); 909 - if (notifier_to_errno(ret)) { 910 - cpu_notifier_register_done(); 911 - goto fail; 912 - } 910 + if (notifier_to_errno(ret)) 911 + break; 913 912 } 914 913 915 914 cpu_notifier_register_done(); 915 + return notifier_to_errno(ret); 916 + } 917 + 918 + static void init_zs_size_classes(void) 919 + { 920 + int nr; 921 + 922 + nr = (ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) / ZS_SIZE_CLASS_DELTA + 1; 923 + if ((ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) % ZS_SIZE_CLASS_DELTA) 924 + nr += 1; 925 + 926 + zs_size_classes = nr; 927 + } 928 + 929 + static void __exit zs_exit(void) 930 + { 931 + #ifdef CONFIG_ZPOOL 932 + zpool_unregister_driver(&zs_zpool_driver); 933 + #endif 934 + zs_unregister_cpu_notifier(); 935 + } 936 + 937 + static int __init zs_init(void) 938 + { 939 + int ret = zs_register_cpu_notifier(); 940 + 941 + if (ret) { 942 + zs_unregister_cpu_notifier(); 943 + return ret; 944 + } 945 + 946 + init_zs_size_classes(); 916 947 917 948 #ifdef CONFIG_ZPOOL 918 949 zpool_register_driver(&zs_zpool_driver); 919 950 #endif 920 - 921 951 return 0; 922 - fail: 923 - zs_exit(); 924 - return notifier_to_errno(ret); 952 + } 953 + 954 + static unsigned int get_maxobj_per_zspage(int size, int pages_per_zspage) 955 + { 956 + return pages_per_zspage * PAGE_SIZE / size; 957 + } 958 + 959 + static bool can_merge(struct size_class *prev, int size, int pages_per_zspage) 960 + { 961 + if (prev->pages_per_zspage != pages_per_zspage) 962 + return false; 963 + 964 + if (get_maxobj_per_zspage(prev->size, prev->pages_per_zspage) 965 + != get_maxobj_per_zspage(size, pages_per_zspage)) 966 + return false; 967 + 968 + return true; 925 969 } 926 970 927 971 /** ··· 979 937 */ 980 938 struct zs_pool *zs_create_pool(gfp_t flags) 981 939 { 982 - int i, ovhd_size; 940 + int i; 983 941 struct zs_pool *pool; 942 + struct size_class *prev_class = NULL; 984 943 985 - ovhd_size = roundup(sizeof(*pool), PAGE_SIZE); 986 - pool = kzalloc(ovhd_size, GFP_KERNEL); 944 + pool = kzalloc(sizeof(*pool), GFP_KERNEL); 987 945 if (!pool) 988 946 return NULL; 989 947 990 - for (i = 0; i < ZS_SIZE_CLASSES; i++) { 948 + pool->size_class = kcalloc(zs_size_classes, sizeof(struct size_class *), 949 + GFP_KERNEL); 950 + if (!pool->size_class) { 951 + kfree(pool); 952 + return NULL; 953 + } 954 + 955 + /* 956 + * Iterate reversly, because, size of size_class that we want to use 957 + * for merging should be larger or equal to current size. 958 + */ 959 + for (i = zs_size_classes - 1; i >= 0; i--) { 991 960 int size; 961 + int pages_per_zspage; 992 962 struct size_class *class; 993 963 994 964 size = ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA; 995 965 if (size > ZS_MAX_ALLOC_SIZE) 996 966 size = ZS_MAX_ALLOC_SIZE; 967 + pages_per_zspage = get_pages_per_zspage(size); 997 968 998 - class = &pool->size_class[i]; 969 + /* 970 + * size_class is used for normal zsmalloc operation such 971 + * as alloc/free for that size. Although it is natural that we 972 + * have one size_class for each size, there is a chance that we 973 + * can get more memory utilization if we use one size_class for 974 + * many different sizes whose size_class have same 975 + * characteristics. So, we makes size_class point to 976 + * previous size_class if possible. 977 + */ 978 + if (prev_class) { 979 + if (can_merge(prev_class, size, pages_per_zspage)) { 980 + pool->size_class[i] = prev_class; 981 + continue; 982 + } 983 + } 984 + 985 + class = kzalloc(sizeof(struct size_class), GFP_KERNEL); 986 + if (!class) 987 + goto err; 988 + 999 989 class->size = size; 1000 990 class->index = i; 991 + class->pages_per_zspage = pages_per_zspage; 1001 992 spin_lock_init(&class->lock); 1002 - class->pages_per_zspage = get_pages_per_zspage(size); 993 + pool->size_class[i] = class; 1003 994 995 + prev_class = class; 1004 996 } 1005 997 1006 998 pool->flags = flags; 1007 999 1008 1000 return pool; 1001 + 1002 + err: 1003 + zs_destroy_pool(pool); 1004 + return NULL; 1009 1005 } 1010 1006 EXPORT_SYMBOL_GPL(zs_create_pool); 1011 1007 ··· 1051 971 { 1052 972 int i; 1053 973 1054 - for (i = 0; i < ZS_SIZE_CLASSES; i++) { 974 + for (i = 0; i < zs_size_classes; i++) { 1055 975 int fg; 1056 - struct size_class *class = &pool->size_class[i]; 976 + struct size_class *class = pool->size_class[i]; 977 + 978 + if (!class) 979 + continue; 980 + 981 + if (class->index != i) 982 + continue; 1057 983 1058 984 for (fg = 0; fg < _ZS_NR_FULLNESS_GROUPS; fg++) { 1059 985 if (class->fullness_list[fg]) { ··· 1067 981 class->size, fg); 1068 982 } 1069 983 } 984 + kfree(class); 1070 985 } 986 + 987 + kfree(pool->size_class); 1071 988 kfree(pool); 1072 989 } 1073 990 EXPORT_SYMBOL_GPL(zs_destroy_pool); ··· 1088 999 { 1089 1000 unsigned long obj; 1090 1001 struct link_free *link; 1091 - int class_idx; 1092 1002 struct size_class *class; 1003 + void *vaddr; 1093 1004 1094 1005 struct page *first_page, *m_page; 1095 1006 unsigned long m_objidx, m_offset; ··· 1097 1008 if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE)) 1098 1009 return 0; 1099 1010 1100 - class_idx = get_size_class_index(size); 1101 - class = &pool->size_class[class_idx]; 1102 - BUG_ON(class_idx != class->index); 1011 + class = pool->size_class[get_size_class_index(size)]; 1103 1012 1104 1013 spin_lock(&class->lock); 1105 1014 first_page = find_get_zspage(class); ··· 1118 1031 obj_handle_to_location(obj, &m_page, &m_objidx); 1119 1032 m_offset = obj_idx_to_offset(m_page, m_objidx, class->size); 1120 1033 1121 - link = (struct link_free *)kmap_atomic(m_page) + 1122 - m_offset / sizeof(*link); 1034 + vaddr = kmap_atomic(m_page); 1035 + link = (struct link_free *)vaddr + m_offset / sizeof(*link); 1123 1036 first_page->freelist = link->next; 1124 1037 memset(link, POISON_INUSE, sizeof(*link)); 1125 - kunmap_atomic(link); 1038 + kunmap_atomic(vaddr); 1126 1039 1127 1040 first_page->inuse++; 1128 1041 /* Now move the zspage to another fullness group, if required */ ··· 1138 1051 struct link_free *link; 1139 1052 struct page *first_page, *f_page; 1140 1053 unsigned long f_objidx, f_offset; 1054 + void *vaddr; 1141 1055 1142 1056 int class_idx; 1143 1057 struct size_class *class; ··· 1151 1063 first_page = get_first_page(f_page); 1152 1064 1153 1065 get_zspage_mapping(first_page, &class_idx, &fullness); 1154 - class = &pool->size_class[class_idx]; 1066 + class = pool->size_class[class_idx]; 1155 1067 f_offset = obj_idx_to_offset(f_page, f_objidx, class->size); 1156 1068 1157 1069 spin_lock(&class->lock); 1158 1070 1159 1071 /* Insert this object in containing zspage's freelist */ 1160 - link = (struct link_free *)((unsigned char *)kmap_atomic(f_page) 1161 - + f_offset); 1072 + vaddr = kmap_atomic(f_page); 1073 + link = (struct link_free *)(vaddr + f_offset); 1162 1074 link->next = first_page->freelist; 1163 - kunmap_atomic(link); 1075 + kunmap_atomic(vaddr); 1164 1076 first_page->freelist = (void *)obj; 1165 1077 1166 1078 first_page->inuse--; ··· 1212 1124 1213 1125 obj_handle_to_location(handle, &page, &obj_idx); 1214 1126 get_zspage_mapping(get_first_page(page), &class_idx, &fg); 1215 - class = &pool->size_class[class_idx]; 1127 + class = pool->size_class[class_idx]; 1216 1128 off = obj_idx_to_offset(page, obj_idx, class->size); 1217 1129 1218 1130 area = &get_cpu_var(zs_map_area); ··· 1246 1158 1247 1159 obj_handle_to_location(handle, &page, &obj_idx); 1248 1160 get_zspage_mapping(get_first_page(page), &class_idx, &fg); 1249 - class = &pool->size_class[class_idx]; 1161 + class = pool->size_class[class_idx]; 1250 1162 off = obj_idx_to_offset(page, obj_idx, class->size); 1251 1163 1252 1164 area = this_cpu_ptr(&zs_map_area);

+4 -5

mm/zswap.c

··· 149 149 return 0; 150 150 } 151 151 152 - static void zswap_comp_exit(void) 152 + static void __init zswap_comp_exit(void) 153 153 { 154 154 /* free percpu transforms */ 155 - if (zswap_comp_pcpu_tfms) 156 - free_percpu(zswap_comp_pcpu_tfms); 155 + free_percpu(zswap_comp_pcpu_tfms); 157 156 } 158 157 159 158 /********************************* ··· 205 206 **********************************/ 206 207 static struct kmem_cache *zswap_entry_cache; 207 208 208 - static int zswap_entry_cache_create(void) 209 + static int __init zswap_entry_cache_create(void) 209 210 { 210 211 zswap_entry_cache = KMEM_CACHE(zswap_entry, 0); 211 212 return zswap_entry_cache == NULL; ··· 388 389 .notifier_call = zswap_cpu_notifier 389 390 }; 390 391 391 - static int zswap_cpu_init(void) 392 + static int __init zswap_cpu_init(void) 392 393 { 393 394 unsigned long cpu; 394 395

+1

tools/testing/selftests/Makefile

··· 15 15 TARGETS += sysctl 16 16 TARGETS += firmware 17 17 TARGETS += ftrace 18 + TARGETS += exec 18 19 19 20 TARGETS_HOTPLUG = cpu-hotplug 20 21 TARGETS_HOTPLUG += memory-hotplug

+9

tools/testing/selftests/exec/.gitignore

··· 1 + subdir* 2 + script* 3 + execveat 4 + execveat.symlink 5 + execveat.moved 6 + execveat.path.ephemeral 7 + execveat.ephemeral 8 + execveat.denatured 9 + xxxxxxxx*

+25

tools/testing/selftests/exec/Makefile

··· 1 + CC = $(CROSS_COMPILE)gcc 2 + CFLAGS = -Wall 3 + BINARIES = execveat 4 + DEPS = execveat.symlink execveat.denatured script subdir 5 + all: $(BINARIES) $(DEPS) 6 + 7 + subdir: 8 + mkdir -p $@ 9 + script: 10 + echo '#!/bin/sh' > $@ 11 + echo 'exit $$*' >> $@ 12 + chmod +x $@ 13 + execveat.symlink: execveat 14 + ln -s -f $< $@ 15 + execveat.denatured: execveat 16 + cp $< $@ 17 + chmod -x $@ 18 + %: %.c 19 + $(CC) $(CFLAGS) -o $@ $^ 20 + 21 + run_tests: all 22 + ./execveat 23 + 24 + clean: 25 + rm -rf $(BINARIES) $(DEPS) subdir.moved execveat.moved xxxxx*

+397

tools/testing/selftests/exec/execveat.c

··· 1 + /* 2 + * Copyright (c) 2014 Google, Inc. 3 + * 4 + * Licensed under the terms of the GNU GPL License version 2 5 + * 6 + * Selftests for execveat(2). 7 + */ 8 + 9 + #define _GNU_SOURCE /* to get O_PATH, AT_EMPTY_PATH */ 10 + #include <sys/sendfile.h> 11 + #include <sys/stat.h> 12 + #include <sys/syscall.h> 13 + #include <sys/types.h> 14 + #include <sys/wait.h> 15 + #include <errno.h> 16 + #include <fcntl.h> 17 + #include <limits.h> 18 + #include <stdio.h> 19 + #include <stdlib.h> 20 + #include <string.h> 21 + #include <unistd.h> 22 + 23 + static char longpath[2 * PATH_MAX] = ""; 24 + static char *envp[] = { "IN_TEST=yes", NULL, NULL }; 25 + static char *argv[] = { "execveat", "99", NULL }; 26 + 27 + static int execveat_(int fd, const char *path, char **argv, char **envp, 28 + int flags) 29 + { 30 + #ifdef __NR_execveat 31 + return syscall(__NR_execveat, fd, path, argv, envp, flags); 32 + #else 33 + errno = -ENOSYS; 34 + return -1; 35 + #endif 36 + } 37 + 38 + #define check_execveat_fail(fd, path, flags, errno) \ 39 + _check_execveat_fail(fd, path, flags, errno, #errno) 40 + static int _check_execveat_fail(int fd, const char *path, int flags, 41 + int expected_errno, const char *errno_str) 42 + { 43 + int rc; 44 + 45 + errno = 0; 46 + printf("Check failure of execveat(%d, '%s', %d) with %s... ", 47 + fd, path?:"(null)", flags, errno_str); 48 + rc = execveat_(fd, path, argv, envp, flags); 49 + 50 + if (rc > 0) { 51 + printf("[FAIL] (unexpected success from execveat(2))\n"); 52 + return 1; 53 + } 54 + if (errno != expected_errno) { 55 + printf("[FAIL] (expected errno %d (%s) not %d (%s)\n", 56 + expected_errno, strerror(expected_errno), 57 + errno, strerror(errno)); 58 + return 1; 59 + } 60 + printf("[OK]\n"); 61 + return 0; 62 + } 63 + 64 + static int check_execveat_invoked_rc(int fd, const char *path, int flags, 65 + int expected_rc) 66 + { 67 + int status; 68 + int rc; 69 + pid_t child; 70 + int pathlen = path ? strlen(path) : 0; 71 + 72 + if (pathlen > 40) 73 + printf("Check success of execveat(%d, '%.20s...%s', %d)... ", 74 + fd, path, (path + pathlen - 20), flags); 75 + else 76 + printf("Check success of execveat(%d, '%s', %d)... ", 77 + fd, path?:"(null)", flags); 78 + child = fork(); 79 + if (child < 0) { 80 + printf("[FAIL] (fork() failed)\n"); 81 + return 1; 82 + } 83 + if (child == 0) { 84 + /* Child: do execveat(). */ 85 + rc = execveat_(fd, path, argv, envp, flags); 86 + printf("[FAIL]: execveat() failed, rc=%d errno=%d (%s)\n", 87 + rc, errno, strerror(errno)); 88 + exit(1); /* should not reach here */ 89 + } 90 + /* Parent: wait for & check child's exit status. */ 91 + rc = waitpid(child, &status, 0); 92 + if (rc != child) { 93 + printf("[FAIL] (waitpid(%d,...) returned %d)\n", child, rc); 94 + return 1; 95 + } 96 + if (!WIFEXITED(status)) { 97 + printf("[FAIL] (child %d did not exit cleanly, status=%08x)\n", 98 + child, status); 99 + return 1; 100 + } 101 + if (WEXITSTATUS(status) != expected_rc) { 102 + printf("[FAIL] (child %d exited with %d not %d)\n", 103 + child, WEXITSTATUS(status), expected_rc); 104 + return 1; 105 + } 106 + printf("[OK]\n"); 107 + return 0; 108 + } 109 + 110 + static int check_execveat(int fd, const char *path, int flags) 111 + { 112 + return check_execveat_invoked_rc(fd, path, flags, 99); 113 + } 114 + 115 + static char *concat(const char *left, const char *right) 116 + { 117 + char *result = malloc(strlen(left) + strlen(right) + 1); 118 + 119 + strcpy(result, left); 120 + strcat(result, right); 121 + return result; 122 + } 123 + 124 + static int open_or_die(const char *filename, int flags) 125 + { 126 + int fd = open(filename, flags); 127 + 128 + if (fd < 0) { 129 + printf("Failed to open '%s'; " 130 + "check prerequisites are available\n", filename); 131 + exit(1); 132 + } 133 + return fd; 134 + } 135 + 136 + static void exe_cp(const char *src, const char *dest) 137 + { 138 + int in_fd = open_or_die(src, O_RDONLY); 139 + int out_fd = open(dest, O_RDWR|O_CREAT|O_TRUNC, 0755); 140 + struct stat info; 141 + 142 + fstat(in_fd, &info); 143 + sendfile(out_fd, in_fd, NULL, info.st_size); 144 + close(in_fd); 145 + close(out_fd); 146 + } 147 + 148 + #define XX_DIR_LEN 200 149 + static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script) 150 + { 151 + int fail = 0; 152 + int ii, count, len; 153 + char longname[XX_DIR_LEN + 1]; 154 + int fd; 155 + 156 + if (*longpath == '\0') { 157 + /* Create a filename close to PATH_MAX in length */ 158 + memset(longname, 'x', XX_DIR_LEN - 1); 159 + longname[XX_DIR_LEN - 1] = '/'; 160 + longname[XX_DIR_LEN] = '\0'; 161 + count = (PATH_MAX - 3) / XX_DIR_LEN; 162 + for (ii = 0; ii < count; ii++) { 163 + strcat(longpath, longname); 164 + mkdir(longpath, 0755); 165 + } 166 + len = (PATH_MAX - 3) - (count * XX_DIR_LEN); 167 + if (len <= 0) 168 + len = 1; 169 + memset(longname, 'y', len); 170 + longname[len] = '\0'; 171 + strcat(longpath, longname); 172 + } 173 + exe_cp(src, longpath); 174 + 175 + /* 176 + * Execute as a pre-opened file descriptor, which works whether this is 177 + * a script or not (because the interpreter sees a filename like 178 + * "/dev/fd/20"). 179 + */ 180 + fd = open(longpath, O_RDONLY); 181 + if (fd > 0) { 182 + printf("Invoke copy of '%s' via filename of length %lu:\n", 183 + src, strlen(longpath)); 184 + fail += check_execveat(fd, "", AT_EMPTY_PATH); 185 + } else { 186 + printf("Failed to open length %lu filename, errno=%d (%s)\n", 187 + strlen(longpath), errno, strerror(errno)); 188 + fail++; 189 + } 190 + 191 + /* 192 + * Execute as a long pathname relative to ".". If this is a script, 193 + * the interpreter will launch but fail to open the script because its 194 + * name ("/dev/fd/5/xxx....") is bigger than PATH_MAX. 195 + */ 196 + if (is_script) 197 + fail += check_execveat_invoked_rc(dot_dfd, longpath, 0, 127); 198 + else 199 + fail += check_execveat(dot_dfd, longpath, 0); 200 + 201 + return fail; 202 + } 203 + 204 + static int run_tests(void) 205 + { 206 + int fail = 0; 207 + char *fullname = realpath("execveat", NULL); 208 + char *fullname_script = realpath("script", NULL); 209 + char *fullname_symlink = concat(fullname, ".symlink"); 210 + int subdir_dfd = open_or_die("subdir", O_DIRECTORY|O_RDONLY); 211 + int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral", 212 + O_DIRECTORY|O_RDONLY); 213 + int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY); 214 + int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH); 215 + int dot_dfd_cloexec = open_or_die(".", O_DIRECTORY|O_RDONLY|O_CLOEXEC); 216 + int fd = open_or_die("execveat", O_RDONLY); 217 + int fd_path = open_or_die("execveat", O_RDONLY|O_PATH); 218 + int fd_symlink = open_or_die("execveat.symlink", O_RDONLY); 219 + int fd_denatured = open_or_die("execveat.denatured", O_RDONLY); 220 + int fd_denatured_path = open_or_die("execveat.denatured", 221 + O_RDONLY|O_PATH); 222 + int fd_script = open_or_die("script", O_RDONLY); 223 + int fd_ephemeral = open_or_die("execveat.ephemeral", O_RDONLY); 224 + int fd_ephemeral_path = open_or_die("execveat.path.ephemeral", 225 + O_RDONLY|O_PATH); 226 + int fd_script_ephemeral = open_or_die("script.ephemeral", O_RDONLY); 227 + int fd_cloexec = open_or_die("execveat", O_RDONLY|O_CLOEXEC); 228 + int fd_script_cloexec = open_or_die("script", O_RDONLY|O_CLOEXEC); 229 + 230 + /* Change file position to confirm it doesn't affect anything */ 231 + lseek(fd, 10, SEEK_SET); 232 + 233 + /* Normal executable file: */ 234 + /* dfd + path */ 235 + fail += check_execveat(subdir_dfd, "../execveat", 0); 236 + fail += check_execveat(dot_dfd, "execveat", 0); 237 + fail += check_execveat(dot_dfd_path, "execveat", 0); 238 + /* absolute path */ 239 + fail += check_execveat(AT_FDCWD, fullname, 0); 240 + /* absolute path with nonsense dfd */ 241 + fail += check_execveat(99, fullname, 0); 242 + /* fd + no path */ 243 + fail += check_execveat(fd, "", AT_EMPTY_PATH); 244 + /* O_CLOEXEC fd + no path */ 245 + fail += check_execveat(fd_cloexec, "", AT_EMPTY_PATH); 246 + /* O_PATH fd */ 247 + fail += check_execveat(fd_path, "", AT_EMPTY_PATH); 248 + 249 + /* Mess with executable file that's already open: */ 250 + /* fd + no path to a file that's been renamed */ 251 + rename("execveat.ephemeral", "execveat.moved"); 252 + fail += check_execveat(fd_ephemeral, "", AT_EMPTY_PATH); 253 + /* fd + no path to a file that's been deleted */ 254 + unlink("execveat.moved"); /* remove the file now fd open */ 255 + fail += check_execveat(fd_ephemeral, "", AT_EMPTY_PATH); 256 + 257 + /* Mess with executable file that's already open with O_PATH */ 258 + /* fd + no path to a file that's been deleted */ 259 + unlink("execveat.path.ephemeral"); 260 + fail += check_execveat(fd_ephemeral_path, "", AT_EMPTY_PATH); 261 + 262 + /* Invalid argument failures */ 263 + fail += check_execveat_fail(fd, "", 0, ENOENT); 264 + fail += check_execveat_fail(fd, NULL, AT_EMPTY_PATH, EFAULT); 265 + 266 + /* Symlink to executable file: */ 267 + /* dfd + path */ 268 + fail += check_execveat(dot_dfd, "execveat.symlink", 0); 269 + fail += check_execveat(dot_dfd_path, "execveat.symlink", 0); 270 + /* absolute path */ 271 + fail += check_execveat(AT_FDCWD, fullname_symlink, 0); 272 + /* fd + no path, even with AT_SYMLINK_NOFOLLOW (already followed) */ 273 + fail += check_execveat(fd_symlink, "", AT_EMPTY_PATH); 274 + fail += check_execveat(fd_symlink, "", 275 + AT_EMPTY_PATH|AT_SYMLINK_NOFOLLOW); 276 + 277 + /* Symlink fails when AT_SYMLINK_NOFOLLOW set: */ 278 + /* dfd + path */ 279 + fail += check_execveat_fail(dot_dfd, "execveat.symlink", 280 + AT_SYMLINK_NOFOLLOW, ELOOP); 281 + fail += check_execveat_fail(dot_dfd_path, "execveat.symlink", 282 + AT_SYMLINK_NOFOLLOW, ELOOP); 283 + /* absolute path */ 284 + fail += check_execveat_fail(AT_FDCWD, fullname_symlink, 285 + AT_SYMLINK_NOFOLLOW, ELOOP); 286 + 287 + /* Shell script wrapping executable file: */ 288 + /* dfd + path */ 289 + fail += check_execveat(subdir_dfd, "../script", 0); 290 + fail += check_execveat(dot_dfd, "script", 0); 291 + fail += check_execveat(dot_dfd_path, "script", 0); 292 + /* absolute path */ 293 + fail += check_execveat(AT_FDCWD, fullname_script, 0); 294 + /* fd + no path */ 295 + fail += check_execveat(fd_script, "", AT_EMPTY_PATH); 296 + fail += check_execveat(fd_script, "", 297 + AT_EMPTY_PATH|AT_SYMLINK_NOFOLLOW); 298 + /* O_CLOEXEC fd fails for a script (as script file inaccessible) */ 299 + fail += check_execveat_fail(fd_script_cloexec, "", AT_EMPTY_PATH, 300 + ENOENT); 301 + fail += check_execveat_fail(dot_dfd_cloexec, "script", 0, ENOENT); 302 + 303 + /* Mess with script file that's already open: */ 304 + /* fd + no path to a file that's been renamed */ 305 + rename("script.ephemeral", "script.moved"); 306 + fail += check_execveat(fd_script_ephemeral, "", AT_EMPTY_PATH); 307 + /* fd + no path to a file that's been deleted */ 308 + unlink("script.moved"); /* remove the file while fd open */ 309 + fail += check_execveat(fd_script_ephemeral, "", AT_EMPTY_PATH); 310 + 311 + /* Rename a subdirectory in the path: */ 312 + rename("subdir.ephemeral", "subdir.moved"); 313 + fail += check_execveat(subdir_dfd_ephemeral, "../script", 0); 314 + fail += check_execveat(subdir_dfd_ephemeral, "script", 0); 315 + /* Remove the subdir and its contents */ 316 + unlink("subdir.moved/script"); 317 + unlink("subdir.moved"); 318 + /* Shell loads via deleted subdir OK because name starts with .. */ 319 + fail += check_execveat(subdir_dfd_ephemeral, "../script", 0); 320 + fail += check_execveat_fail(subdir_dfd_ephemeral, "script", 0, ENOENT); 321 + 322 + /* Flag values other than AT_SYMLINK_NOFOLLOW => EINVAL */ 323 + fail += check_execveat_fail(dot_dfd, "execveat", 0xFFFF, EINVAL); 324 + /* Invalid path => ENOENT */ 325 + fail += check_execveat_fail(dot_dfd, "no-such-file", 0, ENOENT); 326 + fail += check_execveat_fail(dot_dfd_path, "no-such-file", 0, ENOENT); 327 + fail += check_execveat_fail(AT_FDCWD, "no-such-file", 0, ENOENT); 328 + /* Attempt to execute directory => EACCES */ 329 + fail += check_execveat_fail(dot_dfd, "", AT_EMPTY_PATH, EACCES); 330 + /* Attempt to execute non-executable => EACCES */ 331 + fail += check_execveat_fail(dot_dfd, "Makefile", 0, EACCES); 332 + fail += check_execveat_fail(fd_denatured, "", AT_EMPTY_PATH, EACCES); 333 + fail += check_execveat_fail(fd_denatured_path, "", AT_EMPTY_PATH, 334 + EACCES); 335 + /* Attempt to execute nonsense FD => EBADF */ 336 + fail += check_execveat_fail(99, "", AT_EMPTY_PATH, EBADF); 337 + fail += check_execveat_fail(99, "execveat", 0, EBADF); 338 + /* Attempt to execute relative to non-directory => ENOTDIR */ 339 + fail += check_execveat_fail(fd, "execveat", 0, ENOTDIR); 340 + 341 + fail += check_execveat_pathmax(dot_dfd, "execveat", 0); 342 + fail += check_execveat_pathmax(dot_dfd, "script", 1); 343 + return fail; 344 + } 345 + 346 + static void prerequisites(void) 347 + { 348 + int fd; 349 + const char *script = "#!/bin/sh\nexit $*\n"; 350 + 351 + /* Create ephemeral copies of files */ 352 + exe_cp("execveat", "execveat.ephemeral"); 353 + exe_cp("execveat", "execveat.path.ephemeral"); 354 + exe_cp("script", "script.ephemeral"); 355 + mkdir("subdir.ephemeral", 0755); 356 + 357 + fd = open("subdir.ephemeral/script", O_RDWR|O_CREAT|O_TRUNC, 0755); 358 + write(fd, script, strlen(script)); 359 + close(fd); 360 + } 361 + 362 + int main(int argc, char **argv) 363 + { 364 + int ii; 365 + int rc; 366 + const char *verbose = getenv("VERBOSE"); 367 + 368 + if (argc >= 2) { 369 + /* If we are invoked with an argument, don't run tests. */ 370 + const char *in_test = getenv("IN_TEST"); 371 + 372 + if (verbose) { 373 + printf(" invoked with:"); 374 + for (ii = 0; ii < argc; ii++) 375 + printf(" [%d]='%s'", ii, argv[ii]); 376 + printf("\n"); 377 + } 378 + 379 + /* Check expected environment transferred. */ 380 + if (!in_test || strcmp(in_test, "yes") != 0) { 381 + printf("[FAIL] (no IN_TEST=yes in env)\n"); 382 + return 1; 383 + } 384 + 385 + /* Use the final argument as an exit code. */ 386 + rc = atoi(argv[argc - 1]); 387 + fflush(stdout); 388 + } else { 389 + prerequisites(); 390 + if (verbose) 391 + envp[1] = "VERBOSE=1"; 392 + rc = run_tests(); 393 + if (rc > 0) 394 + printf("%d tests failed\n", rc); 395 + } 396 + return rc; 397 + }

+2 -2

tools/vm/Makefile

··· 1 1 # Makefile for vm tools 2 2 # 3 - TARGETS=page-types slabinfo 3 + TARGETS=page-types slabinfo page_owner_sort 4 4 5 5 LIB_DIR = ../lib/api 6 6 LIBS = $(LIB_DIR)/libapikfs.a ··· 18 18 $(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) 19 19 20 20 clean: 21 - $(RM) page-types slabinfo 21 + $(RM) page-types slabinfo page_owner_sort 22 22 make -C $(LIB_DIR) clean

+144

tools/vm/page_owner_sort.c

··· 1 + /* 2 + * User-space helper to sort the output of /sys/kernel/debug/page_owner 3 + * 4 + * Example use: 5 + * cat /sys/kernel/debug/page_owner > page_owner_full.txt 6 + * grep -v ^PFN page_owner_full.txt > page_owner.txt 7 + * ./sort page_owner.txt sorted_page_owner.txt 8 + */ 9 + 10 + #include <stdio.h> 11 + #include <stdlib.h> 12 + #include <sys/types.h> 13 + #include <sys/stat.h> 14 + #include <fcntl.h> 15 + #include <unistd.h> 16 + #include <string.h> 17 + 18 + struct block_list { 19 + char *txt; 20 + int len; 21 + int num; 22 + }; 23 + 24 + 25 + static struct block_list *list; 26 + static int list_size; 27 + static int max_size; 28 + 29 + struct block_list *block_head; 30 + 31 + int read_block(char *buf, int buf_size, FILE *fin) 32 + { 33 + char *curr = buf, *const buf_end = buf + buf_size; 34 + 35 + while (buf_end - curr > 1 && fgets(curr, buf_end - curr, fin)) { 36 + if (*curr == '\n') /* empty line */ 37 + return curr - buf; 38 + curr += strlen(curr); 39 + } 40 + 41 + return -1; /* EOF or no space left in buf. */ 42 + } 43 + 44 + static int compare_txt(const void *p1, const void *p2) 45 + { 46 + const struct block_list *l1 = p1, *l2 = p2; 47 + 48 + return strcmp(l1->txt, l2->txt); 49 + } 50 + 51 + static int compare_num(const void *p1, const void *p2) 52 + { 53 + const struct block_list *l1 = p1, *l2 = p2; 54 + 55 + return l2->num - l1->num; 56 + } 57 + 58 + static void add_list(char *buf, int len) 59 + { 60 + if (list_size != 0 && 61 + len == list[list_size-1].len && 62 + memcmp(buf, list[list_size-1].txt, len) == 0) { 63 + list[list_size-1].num++; 64 + return; 65 + } 66 + if (list_size == max_size) { 67 + printf("max_size too small??\n"); 68 + exit(1); 69 + } 70 + list[list_size].txt = malloc(len+1); 71 + list[list_size].len = len; 72 + list[list_size].num = 1; 73 + memcpy(list[list_size].txt, buf, len); 74 + list[list_size].txt[len] = 0; 75 + list_size++; 76 + if (list_size % 1000 == 0) { 77 + printf("loaded %d\r", list_size); 78 + fflush(stdout); 79 + } 80 + } 81 + 82 + #define BUF_SIZE 1024 83 + 84 + int main(int argc, char **argv) 85 + { 86 + FILE *fin, *fout; 87 + char buf[BUF_SIZE]; 88 + int ret, i, count; 89 + struct block_list *list2; 90 + struct stat st; 91 + 92 + if (argc < 3) { 93 + printf("Usage: ./program <input> <output>\n"); 94 + perror("open: "); 95 + exit(1); 96 + } 97 + 98 + fin = fopen(argv[1], "r"); 99 + fout = fopen(argv[2], "w"); 100 + if (!fin || !fout) { 101 + printf("Usage: ./program <input> <output>\n"); 102 + perror("open: "); 103 + exit(1); 104 + } 105 + 106 + fstat(fileno(fin), &st); 107 + max_size = st.st_size / 100; /* hack ... */ 108 + 109 + list = malloc(max_size * sizeof(*list)); 110 + 111 + for ( ; ; ) { 112 + ret = read_block(buf, BUF_SIZE, fin); 113 + if (ret < 0) 114 + break; 115 + 116 + add_list(buf, ret); 117 + } 118 + 119 + printf("loaded %d\n", list_size); 120 + 121 + printf("sorting ....\n"); 122 + 123 + qsort(list, list_size, sizeof(list[0]), compare_txt); 124 + 125 + list2 = malloc(sizeof(*list) * list_size); 126 + 127 + printf("culling\n"); 128 + 129 + for (i = count = 0; i < list_size; i++) { 130 + if (count == 0 || 131 + strcmp(list2[count-1].txt, list[i].txt) != 0) { 132 + list2[count++] = list[i]; 133 + } else { 134 + list2[count-1].num += list[i].num; 135 + } 136 + } 137 + 138 + qsort(list2, count, sizeof(list[0]), compare_num); 139 + 140 + for (i = 0; i < count; i++) 141 + fprintf(fout, "%d times:\n%s\n", list2[i].num, list2[i].txt); 142 + 143 + return 0; 144 + }

+12 -12

usr/Kconfig

··· 46 46 If you are not sure, leave it set to "0". 47 47 48 48 config RD_GZIP 49 - bool "Support initial ramdisks compressed using gzip" if EXPERT 50 - default y 49 + bool "Support initial ramdisks compressed using gzip" 51 50 depends on BLK_DEV_INITRD 51 + default y 52 52 select DECOMPRESS_GZIP 53 53 help 54 54 Support loading of a gzip encoded initial ramdisk or cpio buffer. 55 55 If unsure, say Y. 56 56 57 57 config RD_BZIP2 58 - bool "Support initial ramdisks compressed using bzip2" if EXPERT 59 - default !EXPERT 58 + bool "Support initial ramdisks compressed using bzip2" 59 + default y 60 60 depends on BLK_DEV_INITRD 61 61 select DECOMPRESS_BZIP2 62 62 help ··· 64 64 If unsure, say N. 65 65 66 66 config RD_LZMA 67 - bool "Support initial ramdisks compressed using LZMA" if EXPERT 68 - default !EXPERT 67 + bool "Support initial ramdisks compressed using LZMA" 68 + default y 69 69 depends on BLK_DEV_INITRD 70 70 select DECOMPRESS_LZMA 71 71 help ··· 73 73 If unsure, say N. 74 74 75 75 config RD_XZ 76 - bool "Support initial ramdisks compressed using XZ" if EXPERT 77 - default !EXPERT 76 + bool "Support initial ramdisks compressed using XZ" 78 77 depends on BLK_DEV_INITRD 78 + default y 79 79 select DECOMPRESS_XZ 80 80 help 81 81 Support loading of a XZ encoded initial ramdisk or cpio buffer. 82 82 If unsure, say N. 83 83 84 84 config RD_LZO 85 - bool "Support initial ramdisks compressed using LZO" if EXPERT 86 - default !EXPERT 85 + bool "Support initial ramdisks compressed using LZO" 86 + default y 87 87 depends on BLK_DEV_INITRD 88 88 select DECOMPRESS_LZO 89 89 help ··· 91 91 If unsure, say N. 92 92 93 93 config RD_LZ4 94 - bool "Support initial ramdisks compressed using LZ4" if EXPERT 95 - default !EXPERT 94 + bool "Support initial ramdisks compressed using LZ4" 95 + default y 96 96 depends on BLK_DEV_INITRD 97 97 select DECOMPRESS_LZ4 98 98 help