Merge branch 'akpm' (patches from Andrew)

+5

CREDITS

··· 2571 2571 S: D-30625 Hannover 2572 2572 S: Germany 2573 2573 2574 + N: Ron Minnich 2575 + E: rminnich@sandia.gov 2576 + E: rminnich@gmail.com 2577 + D: 9p filesystem development 2578 + 2574 2579 N: Corey Minyard 2575 2580 E: minyard@wf-rch.cirr.com 2576 2581 E: minyard@mvista.com

+5

Documentation/admin-guide/mm/idle_page_tracking.rst

··· 65 65 are not reclaimable, he or she can filter them out using 66 66 ``/proc/kpageflags``. 67 67 68 + The page-types tool in the tools/vm directory can be used to assist in this. 69 + If the tool is run initially with the appropriate option, it will mark all the 70 + queried pages as idle. Subsequent runs of the tool can then show which pages have 71 + their idle flag cleared in the interim. 72 + 68 73 See :ref:`Documentation/admin-guide/mm/pagemap.rst <pagemap>` for more 69 74 information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and 70 75 ``/proc/kpagecgroup``.

+3

Documentation/admin-guide/mm/pagemap.rst

··· 44 44 * ``/proc/kpagecount``. This file contains a 64-bit count of the number of 45 45 times each page is mapped, indexed by PFN. 46 46 47 + The page-types tool in the tools/vm directory can be used to query the 48 + number of times a page is mapped. 49 + 47 50 * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each 48 51 page, indexed by PFN. 49 52

+42 -21

Documentation/filesystems/seq_file.txt

··· 66 66 67 67 The iterator interface 68 68 69 - Modules implementing a virtual file with seq_file must implement a simple 70 - iterator object that allows stepping through the data of interest. 71 - Iterators must be able to move to a specific position - like the file they 72 - implement - but the interpretation of that position is up to the iterator 73 - itself. A seq_file implementation that is formatting firewall rules, for 74 - example, could interpret position N as the Nth rule in the chain. 75 - Positioning can thus be done in whatever way makes the most sense for the 76 - generator of the data, which need not be aware of how a position translates 77 - to an offset in the virtual file. The one obvious exception is that a 78 - position of zero should indicate the beginning of the file. 69 + Modules implementing a virtual file with seq_file must implement an 70 + iterator object that allows stepping through the data of interest 71 + during a "session" (roughly one read() system call). If the iterator 72 + is able to move to a specific position - like the file they implement, 73 + though with freedom to map the position number to a sequence location 74 + in whatever way is convenient - the iterator need only exist 75 + transiently during a session. If the iterator cannot easily find a 76 + numerical position but works well with a first/next interface, the 77 + iterator can be stored in the private data area and continue from one 78 + session to the next. 79 + 80 + A seq_file implementation that is formatting firewall rules from a 81 + table, for example, could provide a simple iterator that interprets 82 + position N as the Nth rule in the chain. A seq_file implementation 83 + that presents the content of a, potentially volatile, linked list 84 + might record a pointer into that list, providing that can be done 85 + without risk of the current location being removed. 86 + 87 + Positioning can thus be done in whatever way makes the most sense for 88 + the generator of the data, which need not be aware of how a position 89 + translates to an offset in the virtual file. The one obvious exception 90 + is that a position of zero should indicate the beginning of the file. 79 91 80 92 The /proc/sequence iterator just uses the count of the next number it 81 93 will output as its position. 82 94 83 - Four functions must be implemented to make the iterator work. The first, 84 - called start() takes a position as an argument and returns an iterator 85 - which will start reading at that position. For our simple sequence example, 95 + Four functions must be implemented to make the iterator work. The 96 + first, called start(), starts a session and takes a position as an 97 + argument, returning an iterator which will start reading at that 98 + position. The pos passed to start() will always be either zero, or 99 + the most recent pos used in the previous session. 100 + 101 + For our simple sequence example, 86 102 the start() function looks like: 87 103 88 104 static void *ct_seq_start(struct seq_file *s, loff_t *pos) ··· 117 101 "past end of file" condition and return NULL if need be. 118 102 119 103 For more complicated applications, the private field of the seq_file 120 - structure can be used. There is also a special value which can be returned 121 - by the start() function called SEQ_START_TOKEN; it can be used if you wish 122 - to instruct your show() function (described below) to print a header at the 123 - top of the output. SEQ_START_TOKEN should only be used if the offset is 124 - zero, however. 104 + structure can be used to hold state from session to session. There is 105 + also a special value which can be returned by the start() function 106 + called SEQ_START_TOKEN; it can be used if you wish to instruct your 107 + show() function (described below) to print a header at the top of the 108 + output. SEQ_START_TOKEN should only be used if the offset is zero, 109 + however. 125 110 126 111 The next function to implement is called, amazingly, next(); its job is to 127 112 move the iterator forward to the next position in the sequence. The ··· 138 121 return spos; 139 122 } 140 123 141 - The stop() function is called when iteration is complete; its job, of 142 - course, is to clean up. If dynamic memory is allocated for the iterator, 143 - stop() is the place to free it. 124 + The stop() function closes a session; its job, of course, is to clean 125 + up. If dynamic memory is allocated for the iterator, stop() is the 126 + place to free it; if a lock was taken by start(), stop() must release 127 + that lock. The value that *pos was set to by the last next() call 128 + before stop() is remembered, and used for the first start() call of 129 + the next session unless lseek() has been called on the file; in that 130 + case next start() will be asked to start at position zero. 144 131 145 132 static void ct_seq_stop(struct seq_file *s, void *v) 146 133 {

+2 -1

MAINTAINERS

··· 199 199 200 200 9P FILE SYSTEM 201 201 M: Eric Van Hensbergen <ericvh@gmail.com> 202 - M: Ron Minnich <rminnich@sandia.gov> 203 202 M: Latchesar Ionkov <lucho@ionkov.net> 203 + M: Dominique Martinet <asmadeus@codewreck.org> 204 204 L: v9fs-developer@lists.sourceforge.net 205 205 W: http://swik.net/v9fs 206 206 Q: http://patchwork.kernel.org/project/v9fs-devel/list/ 207 207 T: git git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs.git 208 + T: git git://github.com/martinetd/linux.git 208 209 S: Maintained 209 210 F: Documentation/filesystems/9p.txt 210 211 F: fs/9p/

+2 -1

arch/alpha/mm/fault.c

··· 87 87 struct vm_area_struct * vma; 88 88 struct mm_struct *mm = current->mm; 89 89 const struct exception_table_entry *fixup; 90 - int fault, si_code = SEGV_MAPERR; 90 + int si_code = SEGV_MAPERR; 91 + vm_fault_t fault; 91 92 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 92 93 93 94 /* As of EV6, a load into $31/$f31 is a prefetch, and never faults

+3 -1

arch/arc/mm/fault.c

··· 15 15 #include <linux/uaccess.h> 16 16 #include <linux/kdebug.h> 17 17 #include <linux/perf_event.h> 18 + #include <linux/mm_types.h> 18 19 #include <asm/pgalloc.h> 19 20 #include <asm/mmu.h> 20 21 ··· 67 66 struct task_struct *tsk = current; 68 67 struct mm_struct *mm = tsk->mm; 69 68 siginfo_t info; 70 - int fault, ret; 69 + int ret; 70 + vm_fault_t fault; 71 71 int write = regs->ecr_cause & ECR_C_PROTV_STORE; /* ST/EX */ 72 72 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 73 73

+3 -2

arch/arm/mm/dma-mapping.c

··· 594 594 struct page *page; 595 595 void *ptr = NULL; 596 596 597 - page = dma_alloc_from_contiguous(dev, count, order, gfp); 597 + page = dma_alloc_from_contiguous(dev, count, order, gfp & __GFP_NOWARN); 598 598 if (!page) 599 599 return NULL; 600 600 ··· 1299 1299 unsigned long order = get_order(size); 1300 1300 struct page *page; 1301 1301 1302 - page = dma_alloc_from_contiguous(dev, count, order, gfp); 1302 + page = dma_alloc_from_contiguous(dev, count, order, 1303 + gfp & __GFP_NOWARN); 1303 1304 if (!page) 1304 1305 goto error; 1305 1306

+4 -3

arch/arm/mm/fault.c

··· 224 224 return vma->vm_flags & mask ? false : true; 225 225 } 226 226 227 - static int __kprobes 227 + static vm_fault_t __kprobes 228 228 __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr, 229 229 unsigned int flags, struct task_struct *tsk) 230 230 { 231 231 struct vm_area_struct *vma; 232 - int fault; 232 + vm_fault_t fault; 233 233 234 234 vma = find_vma(mm, addr); 235 235 fault = VM_FAULT_BADMAP; ··· 264 264 { 265 265 struct task_struct *tsk; 266 266 struct mm_struct *mm; 267 - int fault, sig, code; 267 + int sig, code; 268 + vm_fault_t fault; 268 269 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 269 270 270 271 if (notify_page_fault(regs, fsr))

+2 -2

arch/arm64/mm/dma-mapping.c

··· 355 355 356 356 if (dev_get_cma_area(NULL)) 357 357 page = dma_alloc_from_contiguous(NULL, nr_pages, 358 - pool_size_order, GFP_KERNEL); 358 + pool_size_order, false); 359 359 else 360 360 page = alloc_pages(GFP_DMA32, pool_size_order); 361 361 ··· 573 573 struct page *page; 574 574 575 575 page = dma_alloc_from_contiguous(dev, size >> PAGE_SHIFT, 576 - get_order(size), gfp); 576 + get_order(size), gfp & __GFP_NOWARN); 577 577 if (!page) 578 578 return NULL; 579 579

+3 -3

arch/arm64/mm/fault.c

··· 379 379 #define VM_FAULT_BADMAP 0x010000 380 380 #define VM_FAULT_BADACCESS 0x020000 381 381 382 - static int __do_page_fault(struct mm_struct *mm, unsigned long addr, 382 + static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr, 383 383 unsigned int mm_flags, unsigned long vm_flags, 384 384 struct task_struct *tsk) 385 385 { 386 386 struct vm_area_struct *vma; 387 - int fault; 387 + vm_fault_t fault; 388 388 389 389 vma = find_vma(mm, addr); 390 390 fault = VM_FAULT_BADMAP; ··· 427 427 struct task_struct *tsk; 428 428 struct mm_struct *mm; 429 429 struct siginfo si; 430 - int fault, major = 0; 430 + vm_fault_t fault, major = 0; 431 431 unsigned long vm_flags = VM_READ | VM_WRITE; 432 432 unsigned int mm_flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 433 433

+1 -1

arch/hexagon/mm/vm_fault.c

··· 52 52 struct mm_struct *mm = current->mm; 53 53 int si_signo; 54 54 int si_code = SEGV_MAPERR; 55 - int fault; 55 + vm_fault_t fault; 56 56 const struct exception_table_entry *fixup; 57 57 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 58 58

+1 -1

arch/ia64/mm/fault.c

··· 86 86 struct vm_area_struct *vma, *prev_vma; 87 87 struct mm_struct *mm = current->mm; 88 88 unsigned long mask; 89 - int fault; 89 + vm_fault_t fault; 90 90 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 91 91 92 92 mask = ((((isr >> IA64_ISR_X_BIT) & 1UL) << VM_EXEC_BIT)

+2 -2

arch/m68k/mm/fault.c

··· 70 70 { 71 71 struct mm_struct *mm = current->mm; 72 72 struct vm_area_struct * vma; 73 - int fault; 73 + vm_fault_t fault; 74 74 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 75 75 76 76 pr_debug("do page fault:\nregs->sr=%#x, regs->pc=%#lx, address=%#lx, %ld, %p\n", ··· 136 136 */ 137 137 138 138 fault = handle_mm_fault(vma, address, flags); 139 - pr_debug("handle_mm_fault returns %d\n", fault); 139 + pr_debug("handle_mm_fault returns %x\n", fault); 140 140 141 141 if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) 142 142 return 0;

+1 -1

arch/microblaze/mm/fault.c

··· 90 90 struct mm_struct *mm = current->mm; 91 91 int code = SEGV_MAPERR; 92 92 int is_write = error_code & ESR_S; 93 - int fault; 93 + vm_fault_t fault; 94 94 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 95 95 96 96 regs->ear = address;

+1 -1

arch/mips/mm/fault.c

··· 43 43 struct mm_struct *mm = tsk->mm; 44 44 const int field = sizeof(unsigned long) * 2; 45 45 int si_code; 46 - int fault; 46 + vm_fault_t fault; 47 47 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 48 48 49 49 static DEFINE_RATELIMIT_STATE(ratelimit_state, 5 * HZ, 10);

+1 -1

arch/nds32/mm/fault.c

··· 73 73 struct mm_struct *mm; 74 74 struct vm_area_struct *vma; 75 75 int si_code; 76 - int fault; 76 + vm_fault_t fault; 77 77 unsigned int mask = VM_READ | VM_WRITE | VM_EXEC; 78 78 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 79 79

+1 -1

arch/nios2/mm/fault.c

··· 47 47 struct task_struct *tsk = current; 48 48 struct mm_struct *mm = tsk->mm; 49 49 int code = SEGV_MAPERR; 50 - int fault; 50 + vm_fault_t fault; 51 51 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 52 52 53 53 cause >>= 2;

+1 -1

arch/openrisc/mm/fault.c

··· 53 53 struct mm_struct *mm; 54 54 struct vm_area_struct *vma; 55 55 int si_code; 56 - int fault; 56 + vm_fault_t fault; 57 57 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 58 58 59 59 tsk = current;

+1 -1

arch/parisc/mm/fault.c

··· 262 262 struct task_struct *tsk; 263 263 struct mm_struct *mm; 264 264 unsigned long acc_type; 265 - int fault = 0; 265 + vm_fault_t fault = 0; 266 266 unsigned int flags; 267 267 268 268 if (faulthandler_disabled())

+3 -1

arch/powerpc/include/asm/copro.h

··· 10 10 #ifndef _ASM_POWERPC_COPRO_H 11 11 #define _ASM_POWERPC_COPRO_H 12 12 13 + #include <linux/mm_types.h> 14 + 13 15 struct copro_slb 14 16 { 15 17 u64 esid, vsid; 16 18 }; 17 19 18 20 int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea, 19 - unsigned long dsisr, unsigned *flt); 21 + unsigned long dsisr, vm_fault_t *flt); 20 22 21 23 int copro_calculate_slb(struct mm_struct *mm, u64 ea, struct copro_slb *slb); 22 24

+1 -1

arch/powerpc/kvm/book3s_hv_builtin.c

··· 77 77 VM_BUG_ON(order_base_2(nr_pages) < KVM_CMA_CHUNK_ORDER - PAGE_SHIFT); 78 78 79 79 return cma_alloc(kvm_cma, nr_pages, order_base_2(HPT_ALIGN_PAGES), 80 - GFP_KERNEL); 80 + false); 81 81 } 82 82 EXPORT_SYMBOL_GPL(kvm_alloc_hpt_cma); 83 83

+1 -1

arch/powerpc/mm/copro_fault.c

··· 34 34 * to handle fortunately. 35 35 */ 36 36 int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea, 37 - unsigned long dsisr, unsigned *flt) 37 + unsigned long dsisr, vm_fault_t *flt) 38 38 { 39 39 struct vm_area_struct *vma; 40 40 unsigned long is_write;

+4 -3

arch/powerpc/mm/fault.c

··· 155 155 } 156 156 157 157 static int do_sigbus(struct pt_regs *regs, unsigned long address, 158 - unsigned int fault) 158 + vm_fault_t fault) 159 159 { 160 160 siginfo_t info; 161 161 unsigned int lsb = 0; ··· 186 186 return 0; 187 187 } 188 188 189 - static int mm_fault_error(struct pt_regs *regs, unsigned long addr, int fault) 189 + static int mm_fault_error(struct pt_regs *regs, unsigned long addr, 190 + vm_fault_t fault) 190 191 { 191 192 /* 192 193 * Kernel page fault interrupted by SIGKILL. We have no reason to ··· 415 414 int is_exec = TRAP(regs) == 0x400; 416 415 int is_user = user_mode(regs); 417 416 int is_write = page_fault_is_write(error_code); 418 - int fault, major = 0; 417 + vm_fault_t fault, major = 0; 419 418 bool must_retry = false; 420 419 421 420 if (notify_page_fault(regs))

+1 -1

arch/powerpc/platforms/cell/spufs/fault.c

··· 111 111 { 112 112 u64 ea, dsisr, access; 113 113 unsigned long flags; 114 - unsigned flt = 0; 114 + vm_fault_t flt = 0; 115 115 int ret; 116 116 117 117 /*

+2 -1

arch/riscv/mm/fault.c

··· 41 41 struct mm_struct *mm; 42 42 unsigned long addr, cause; 43 43 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 44 - int fault, code = SEGV_MAPERR; 44 + int code = SEGV_MAPERR; 45 + vm_fault_t fault; 45 46 46 47 cause = regs->scause; 47 48 addr = regs->sbadaddr;

+8 -5

arch/s390/mm/fault.c

··· 341 341 return -EACCES; 342 342 } 343 343 344 - static noinline void do_fault_error(struct pt_regs *regs, int access, int fault) 344 + static noinline void do_fault_error(struct pt_regs *regs, int access, 345 + vm_fault_t fault) 345 346 { 346 347 int si_code; 347 348 ··· 402 401 * 11 Page translation -> Not present (nullification) 403 402 * 3b Region third trans. -> Not present (nullification) 404 403 */ 405 - static inline int do_exception(struct pt_regs *regs, int access) 404 + static inline vm_fault_t do_exception(struct pt_regs *regs, int access) 406 405 { 407 406 struct gmap *gmap; 408 407 struct task_struct *tsk; ··· 412 411 unsigned long trans_exc_code; 413 412 unsigned long address; 414 413 unsigned int flags; 415 - int fault; 414 + vm_fault_t fault; 416 415 417 416 tsk = current; 418 417 /* ··· 565 564 void do_protection_exception(struct pt_regs *regs) 566 565 { 567 566 unsigned long trans_exc_code; 568 - int access, fault; 567 + int access; 568 + vm_fault_t fault; 569 569 570 570 trans_exc_code = regs->int_parm_long; 571 571 /* ··· 601 599 602 600 void do_dat_exception(struct pt_regs *regs) 603 601 { 604 - int access, fault; 602 + int access; 603 + vm_fault_t fault; 605 604 606 605 access = VM_READ | VM_EXEC | VM_WRITE; 607 606 fault = do_exception(regs, access);

+4 -3

arch/sh/boards/of-generic.c

··· 56 56 57 57 static void sh_of_smp_probe(void) 58 58 { 59 - struct device_node *np = 0; 60 - const char *method = 0; 59 + struct device_node *np; 60 + const char *method = NULL; 61 61 const struct of_cpu_method *m = __cpu_method_of_table; 62 62 63 63 pr_info("SH generic board support: scanning for cpus\n"); 64 64 65 65 init_cpu_possible(cpumask_of(0)); 66 66 67 - while ((np = of_find_node_by_type(np, "cpu"))) { 67 + for_each_node_by_type(np, "cpu") { 68 68 const __be32 *cell = of_get_property(np, "reg", NULL); 69 69 u64 id = -1; 70 70 if (cell) id = of_read_number(cell, of_n_addr_cells(np)); ··· 80 80 if (!method) { 81 81 np = of_find_node_by_name(NULL, "cpus"); 82 82 of_property_read_string(np, "enable-method", &method); 83 + of_node_put(np); 83 84 } 84 85 85 86 pr_info("CPU enable method: %s\n", method);

+2 -1

arch/sh/include/asm/kexec.h

··· 4 4 5 5 #include <asm/ptrace.h> 6 6 #include <asm/string.h> 7 + #include <linux/kernel.h> 7 8 8 9 /* 9 10 * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return. ··· 62 61 __asm__ __volatile__ ("stc gbr, %0" : "=r" (newregs->gbr)); 63 62 __asm__ __volatile__ ("stc sr, %0" : "=r" (newregs->sr)); 64 63 65 - newregs->pc = (unsigned long)current_text_addr(); 64 + newregs->pc = _THIS_IP_; 66 65 } 67 66 } 68 67 #else

+1 -1

arch/sh/kernel/dwarf.c

··· 599 599 * time this function makes its first function call. 600 600 */ 601 601 if (!pc || !prev) 602 - pc = (unsigned long)current_text_addr(); 602 + pc = _THIS_IP_; 603 603 604 604 #ifdef CONFIG_FUNCTION_GRAPH_TRACER 605 605 /*

+2 -2

arch/sh/mm/fault.c

··· 313 313 314 314 static noinline int 315 315 mm_fault_error(struct pt_regs *regs, unsigned long error_code, 316 - unsigned long address, unsigned int fault) 316 + unsigned long address, vm_fault_t fault) 317 317 { 318 318 /* 319 319 * Pagefault was interrupted by SIGKILL. We have no reason to ··· 396 396 struct task_struct *tsk; 397 397 struct mm_struct *mm; 398 398 struct vm_area_struct * vma; 399 - int fault; 399 + vm_fault_t fault; 400 400 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 401 401 402 402 tsk = current;

+2 -1

arch/sparc/mm/fault_32.c

··· 166 166 unsigned int fixup; 167 167 unsigned long g2; 168 168 int from_user = !(regs->psr & PSR_PS); 169 - int fault, code; 169 + int code; 170 + vm_fault_t fault; 170 171 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 171 172 172 173 if (text_fault)

+2 -1

arch/sparc/mm/fault_64.c

··· 278 278 struct mm_struct *mm = current->mm; 279 279 struct vm_area_struct *vma; 280 280 unsigned int insn = 0; 281 - int si_code, fault_code, fault; 281 + int si_code, fault_code; 282 + vm_fault_t fault; 282 283 unsigned long address, mm_rss; 283 284 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 284 285

+1 -1

arch/um/kernel/trap.c

··· 72 72 } 73 73 74 74 do { 75 - int fault; 75 + vm_fault_t fault; 76 76 77 77 fault = handle_mm_fault(vma, address, flags); 78 78

+5 -4

arch/unicore32/mm/fault.c

··· 168 168 return vma->vm_flags & mask ? false : true; 169 169 } 170 170 171 - static int __do_pf(struct mm_struct *mm, unsigned long addr, unsigned int fsr, 172 - unsigned int flags, struct task_struct *tsk) 171 + static vm_fault_t __do_pf(struct mm_struct *mm, unsigned long addr, 172 + unsigned int fsr, unsigned int flags, struct task_struct *tsk) 173 173 { 174 174 struct vm_area_struct *vma; 175 - int fault; 175 + vm_fault_t fault; 176 176 177 177 vma = find_vma(mm, addr); 178 178 fault = VM_FAULT_BADMAP; ··· 209 209 { 210 210 struct task_struct *tsk; 211 211 struct mm_struct *mm; 212 - int fault, sig, code; 212 + int sig, code; 213 + vm_fault_t fault; 213 214 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 214 215 215 216 tsk = current;

+3 -2

arch/x86/mm/fault.c

··· 16 16 #include <linux/prefetch.h> /* prefetchw */ 17 17 #include <linux/context_tracking.h> /* exception_enter(), ... */ 18 18 #include <linux/uaccess.h> /* faulthandler_disabled() */ 19 + #include <linux/mm_types.h> 19 20 20 21 #include <asm/cpufeature.h> /* boot_cpu_has, ... */ 21 22 #include <asm/traps.h> /* dotraplinkage, ... */ ··· 1000 999 1001 1000 static noinline void 1002 1001 mm_fault_error(struct pt_regs *regs, unsigned long error_code, 1003 - unsigned long address, u32 *pkey, unsigned int fault) 1002 + unsigned long address, u32 *pkey, vm_fault_t fault) 1004 1003 { 1005 1004 if (fatal_signal_pending(current) && !(error_code & X86_PF_USER)) { 1006 1005 no_context(regs, error_code, address, 0, 0); ··· 1214 1213 struct vm_area_struct *vma; 1215 1214 struct task_struct *tsk; 1216 1215 struct mm_struct *mm; 1217 - int fault, major = 0; 1216 + vm_fault_t fault, major = 0; 1218 1217 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 1219 1218 u32 pkey; 1220 1219

+1 -1

arch/xtensa/kernel/pci-dma.c

··· 137 137 138 138 if (gfpflags_allow_blocking(flag)) 139 139 page = dma_alloc_from_contiguous(dev, count, get_order(size), 140 - flag); 140 + flag & __GFP_NOWARN); 141 141 142 142 if (!page) 143 143 page = alloc_pages(flag, get_order(size));

+1 -1

arch/xtensa/mm/fault.c

··· 42 42 int code; 43 43 44 44 int is_write, is_exec; 45 - int fault; 45 + vm_fault_t fault; 46 46 unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; 47 47 48 48 code = SEGV_MAPERR;

-5

drivers/base/firmware_loader/fallback.c

··· 219 219 return sprintf(buf, "%d\n", loading); 220 220 } 221 221 222 - /* Some architectures don't have PAGE_KERNEL_RO */ 223 - #ifndef PAGE_KERNEL_RO 224 - #define PAGE_KERNEL_RO PAGE_KERNEL 225 - #endif 226 - 227 222 /* one pages buffer should be mapped/unmapped only once */ 228 223 static int map_fw_priv_pages(struct fw_priv *fw_priv) 229 224 {

-2

drivers/base/memory.c

··· 736 736 mem->section_count++; 737 737 } 738 738 739 - if (mem->section_count == sections_per_block) 740 - ret = register_mem_sect_under_node(mem, nid, false); 741 739 out: 742 740 mutex_unlock(&mem_sysfs_mutex); 743 741 return ret;

+6 -43

drivers/base/node.c

··· 399 399 } 400 400 401 401 /* register memory section under specified node if it spans that node */ 402 - int register_mem_sect_under_node(struct memory_block *mem_blk, int nid, 403 - bool check_nid) 402 + int register_mem_sect_under_node(struct memory_block *mem_blk, void *arg) 404 403 { 405 - int ret; 404 + int ret, nid = *(int *)arg; 406 405 unsigned long pfn, sect_start_pfn, sect_end_pfn; 407 406 408 - if (!mem_blk) 409 - return -EFAULT; 410 - 411 407 mem_blk->nid = nid; 412 - if (!node_online(nid)) 413 - return 0; 414 408 415 409 sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr); 416 410 sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr); ··· 427 433 * case, during hotplug we know that all pages in the memory 428 434 * block belong to the same node. 429 435 */ 430 - if (check_nid) { 436 + if (system_state == SYSTEM_BOOTING) { 431 437 page_nid = get_nid_for_pfn(pfn); 432 438 if (page_nid < 0) 433 439 continue; ··· 484 490 return 0; 485 491 } 486 492 487 - int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages, 488 - bool check_nid) 493 + int link_mem_sections(int nid, unsigned long start_pfn, unsigned long end_pfn) 489 494 { 490 - unsigned long end_pfn = start_pfn + nr_pages; 491 - unsigned long pfn; 492 - struct memory_block *mem_blk = NULL; 493 - int err = 0; 494 - 495 - for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) { 496 - unsigned long section_nr = pfn_to_section_nr(pfn); 497 - struct mem_section *mem_sect; 498 - int ret; 499 - 500 - if (!present_section_nr(section_nr)) 501 - continue; 502 - mem_sect = __nr_to_section(section_nr); 503 - 504 - /* same memblock ? */ 505 - if (mem_blk) 506 - if ((section_nr >= mem_blk->start_section_nr) && 507 - (section_nr <= mem_blk->end_section_nr)) 508 - continue; 509 - 510 - mem_blk = find_memory_block_hinted(mem_sect, mem_blk); 511 - 512 - ret = register_mem_sect_under_node(mem_blk, nid, check_nid); 513 - if (!err) 514 - err = ret; 515 - 516 - /* discard ref obtained in find_memory_block() */ 517 - } 518 - 519 - if (mem_blk) 520 - kobject_put(&mem_blk->dev.kobj); 521 - return err; 495 + return walk_memory_range(start_pfn, end_pfn, (void *)&nid, 496 + register_mem_sect_under_node); 522 497 } 523 498 524 499 #ifdef CONFIG_HUGETLBFS

+1 -1

drivers/dax/device.c

··· 474 474 return rc; 475 475 476 476 vma->vm_ops = &dax_vm_ops; 477 - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; 477 + vma->vm_flags |= VM_HUGEPAGE; 478 478 return 0; 479 479 } 480 480

+4 -4

drivers/firewire/core-cdev.c

··· 1205 1205 { 1206 1206 struct fw_cdev_get_cycle_timer2 *a = &arg->get_cycle_timer2; 1207 1207 struct fw_card *card = client->device->card; 1208 - struct timespec ts = {0, 0}; 1208 + struct timespec64 ts = {0, 0}; 1209 1209 u32 cycle_time; 1210 1210 int ret = 0; 1211 1211 ··· 1214 1214 cycle_time = card->driver->read_csr(card, CSR_CYCLE_TIME); 1215 1215 1216 1216 switch (a->clk_id) { 1217 - case CLOCK_REALTIME: getnstimeofday(&ts); break; 1218 - case CLOCK_MONOTONIC: ktime_get_ts(&ts); break; 1219 - case CLOCK_MONOTONIC_RAW: getrawmonotonic(&ts); break; 1217 + case CLOCK_REALTIME: ktime_get_real_ts64(&ts); break; 1218 + case CLOCK_MONOTONIC: ktime_get_ts64(&ts); break; 1219 + case CLOCK_MONOTONIC_RAW: ktime_get_raw_ts64(&ts); break; 1220 1220 default: 1221 1221 ret = -EINVAL; 1222 1222 }

+1 -1

drivers/iommu/amd_iommu.c

··· 2620 2620 return NULL; 2621 2621 2622 2622 page = dma_alloc_from_contiguous(dev, size >> PAGE_SHIFT, 2623 - get_order(size), flag); 2623 + get_order(size), flag & __GFP_NOWARN); 2624 2624 if (!page) 2625 2625 return NULL; 2626 2626 }

+1 -1

drivers/iommu/amd_iommu_v2.c

··· 508 508 { 509 509 struct fault *fault = container_of(work, struct fault, work); 510 510 struct vm_area_struct *vma; 511 - int ret = VM_FAULT_ERROR; 511 + vm_fault_t ret = VM_FAULT_ERROR; 512 512 unsigned int flags = 0; 513 513 struct mm_struct *mm; 514 514 u64 address;

+2 -1

drivers/iommu/intel-iommu.c

··· 3758 3758 if (gfpflags_allow_blocking(flags)) { 3759 3759 unsigned int count = size >> PAGE_SHIFT; 3760 3760 3761 - page = dma_alloc_from_contiguous(dev, count, order, flags); 3761 + page = dma_alloc_from_contiguous(dev, count, order, 3762 + flags & __GFP_NOWARN); 3762 3763 if (page && iommu_no_mapping(dev) && 3763 3764 page_to_phys(page) + size > dev->coherent_dma_mask) { 3764 3765 dma_release_from_contiguous(dev, page, count);

+3 -1

drivers/iommu/intel-svm.c

··· 24 24 #include <linux/pci-ats.h> 25 25 #include <linux/dmar.h> 26 26 #include <linux/interrupt.h> 27 + #include <linux/mm_types.h> 27 28 #include <asm/page.h> 28 29 29 30 #define PASID_ENTRY_P BIT_ULL(0) ··· 595 594 struct vm_area_struct *vma; 596 595 struct page_req_dsc *req; 597 596 struct qi_desc resp; 598 - int ret, result; 597 + int result; 598 + vm_fault_t ret; 599 599 u64 address; 600 600 601 601 handled = 1;

+1 -1

drivers/misc/cxl/fault.c

··· 134 134 135 135 int cxl_handle_mm_fault(struct mm_struct *mm, u64 dsisr, u64 dar) 136 136 { 137 - unsigned flt = 0; 137 + vm_fault_t flt = 0; 138 138 int result; 139 139 unsigned long access, flags, inv_flags = 0; 140 140

+2 -1

drivers/misc/ocxl/link.c

··· 2 2 // Copyright 2017 IBM Corp. 3 3 #include <linux/sched/mm.h> 4 4 #include <linux/mutex.h> 5 + #include <linux/mm_types.h> 5 6 #include <linux/mmu_context.h> 6 7 #include <asm/copro.h> 7 8 #include <asm/pnv-ocxl.h> ··· 127 126 128 127 static void xsl_fault_handler_bh(struct work_struct *fault_work) 129 128 { 130 - unsigned int flt = 0; 129 + vm_fault_t flt = 0; 131 130 unsigned long access, flags, inv_flags = 0; 132 131 enum xsl_response r; 133 132 struct xsl_fault *fault = container_of(fault_work, struct xsl_fault,

+1 -1

drivers/s390/char/vmcp.c

··· 68 68 * anymore the system won't work anyway. 69 69 */ 70 70 if (order > 2) 71 - page = cma_alloc(vmcp_cma, nr_pages, 0, GFP_KERNEL); 71 + page = cma_alloc(vmcp_cma, nr_pages, 0, false); 72 72 if (page) { 73 73 session->response = (char *)page_to_phys(page); 74 74 session->cma_alloc = 1;

+1 -1

drivers/staging/android/ion/ion_cma_heap.c

··· 39 39 if (align > CONFIG_CMA_ALIGNMENT) 40 40 align = CONFIG_CMA_ALIGNMENT; 41 41 42 - pages = cma_alloc(cma_heap->cma, nr_pages, align, GFP_KERNEL); 42 + pages = cma_alloc(cma_heap->cma, nr_pages, align, false); 43 43 if (!pages) 44 44 return -ENOMEM; 45 45

+1 -1

fs/btrfs/extent_io.c

··· 3102 3102 3103 3103 for (index = 0; index < nr_pages; index++) { 3104 3104 __do_readpage(tree, pages[index], btrfs_get_extent, em_cached, 3105 - bio, 0, bio_flags, 0, prev_em_start); 3105 + bio, 0, bio_flags, REQ_RAHEAD, prev_em_start); 3106 3106 put_page(pages[index]); 3107 3107 } 3108 3108 }

+10 -2

fs/buffer.c

··· 45 45 #include <linux/mpage.h> 46 46 #include <linux/bit_spinlock.h> 47 47 #include <linux/pagevec.h> 48 + #include <linux/sched/mm.h> 48 49 #include <trace/events/block.h> 49 50 50 51 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); ··· 814 813 bool retry) 815 814 { 816 815 struct buffer_head *bh, *head; 817 - gfp_t gfp = GFP_NOFS; 816 + gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT; 818 817 long offset; 818 + struct mem_cgroup *memcg; 819 819 820 820 if (retry) 821 821 gfp |= __GFP_NOFAIL; 822 + 823 + memcg = get_mem_cgroup_from_page(page); 824 + memalloc_use_memcg(memcg); 822 825 823 826 head = NULL; 824 827 offset = PAGE_SIZE; ··· 840 835 /* Link the buffer to its page */ 841 836 set_bh_page(bh, page, offset); 842 837 } 838 + out: 839 + memalloc_unuse_memcg(); 840 + mem_cgroup_put(memcg); 843 841 return head; 844 842 /* 845 843 * In case anything failed, we just free everything we got. ··· 856 848 } while (head); 857 849 } 858 850 859 - return NULL; 851 + goto out; 860 852 } 861 853 EXPORT_SYMBOL_GPL(alloc_page_buffers); 862 854

+2 -1

fs/dcache.c

··· 292 292 spin_unlock(&dentry->d_lock); 293 293 name->name = p->name; 294 294 } else { 295 - memcpy(name->inline_name, dentry->d_iname, DNAME_INLINE_LEN); 295 + memcpy(name->inline_name, dentry->d_iname, 296 + dentry->d_name.len + 1); 296 297 spin_unlock(&dentry->d_lock); 297 298 name->name = name->inline_name; 298 299 }

-1

fs/ext2/file.c

··· 126 126 127 127 file_accessed(file); 128 128 vma->vm_ops = &ext2_dax_vm_ops; 129 - vma->vm_flags |= VM_MIXEDMAP; 130 129 return 0; 131 130 } 132 131 #else

+1 -1

fs/ext4/ext4.h

··· 3062 3062 /* readpages.c */ 3063 3063 extern int ext4_mpage_readpages(struct address_space *mapping, 3064 3064 struct list_head *pages, struct page *page, 3065 - unsigned nr_pages); 3065 + unsigned nr_pages, bool is_readahead); 3066 3066 3067 3067 /* symlink.c */ 3068 3068 extern const struct inode_operations ext4_encrypted_symlink_inode_operations;

+1 -1

fs/ext4/file.c

··· 374 374 file_accessed(file); 375 375 if (IS_DAX(file_inode(file))) { 376 376 vma->vm_ops = &ext4_dax_vm_ops; 377 - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; 377 + vma->vm_flags |= VM_HUGEPAGE; 378 378 } else { 379 379 vma->vm_ops = &ext4_file_vm_ops; 380 380 }

+3 -2

fs/ext4/inode.c

··· 3325 3325 ret = ext4_readpage_inline(inode, page); 3326 3326 3327 3327 if (ret == -EAGAIN) 3328 - return ext4_mpage_readpages(page->mapping, NULL, page, 1); 3328 + return ext4_mpage_readpages(page->mapping, NULL, page, 1, 3329 + false); 3329 3330 3330 3331 return ret; 3331 3332 } ··· 3341 3340 if (ext4_has_inline_data(inode)) 3342 3341 return 0; 3343 3342 3344 - return ext4_mpage_readpages(mapping, pages, NULL, nr_pages); 3343 + return ext4_mpage_readpages(mapping, pages, NULL, nr_pages, true); 3345 3344 } 3346 3345 3347 3346 static void ext4_invalidatepage(struct page *page, unsigned int offset,

+3 -2

fs/ext4/readpage.c

··· 98 98 99 99 int ext4_mpage_readpages(struct address_space *mapping, 100 100 struct list_head *pages, struct page *page, 101 - unsigned nr_pages) 101 + unsigned nr_pages, bool is_readahead) 102 102 { 103 103 struct bio *bio = NULL; 104 104 sector_t last_block_in_bio = 0; ··· 259 259 bio->bi_iter.bi_sector = blocks[0] << (blkbits - 9); 260 260 bio->bi_end_io = mpage_end_io; 261 261 bio->bi_private = ctx; 262 - bio_set_op_attrs(bio, REQ_OP_READ, 0); 262 + bio_set_op_attrs(bio, REQ_OP_READ, 263 + is_readahead ? REQ_RAHEAD : 0); 263 264 } 264 265 265 266 length = first_hole << blkbits;

+5

fs/f2fs/data.c

··· 1421 1421 /* 1422 1422 * This function was originally taken from fs/mpage.c, and customized for f2fs. 1423 1423 * Major change was from block_size == page_size in f2fs by default. 1424 + * 1425 + * Note that the aops->readpages() function is ONLY used for read-ahead. If 1426 + * this function ever deviates from doing just read-ahead, it should either 1427 + * use ->readpage() or do the necessary surgery to decouple ->readpages() 1428 + * readom read-ahead. 1424 1429 */ 1425 1430 static int f2fs_mpage_readpages(struct address_space *mapping, 1426 1431 struct list_head *pages, struct page *page,

+1 -1

fs/hostfs/hostfs.h

··· 19 19 #define HOSTFS_ATTR_ATIME_SET 128 20 20 #define HOSTFS_ATTR_MTIME_SET 256 21 21 22 - /* These two are unused by hostfs. */ 22 + /* This one is unused by hostfs. */ 23 23 #define HOSTFS_ATTR_FORCE 512 /* Not a change, but a change it */ 24 24 #define HOSTFS_ATTR_ATTR_FLAG 1024 25 25

+10 -3

fs/hpfs/hpfs_fn.h

··· 334 334 * local time (HPFS) to GMT (Unix) 335 335 */ 336 336 337 - static inline time_t local_to_gmt(struct super_block *s, time32_t t) 337 + static inline time64_t local_to_gmt(struct super_block *s, time32_t t) 338 338 { 339 339 extern struct timezone sys_tz; 340 340 return t + sys_tz.tz_minuteswest * 60 + hpfs_sb(s)->sb_timeshift; 341 341 } 342 342 343 - static inline time32_t gmt_to_local(struct super_block *s, time_t t) 343 + static inline time32_t gmt_to_local(struct super_block *s, time64_t t) 344 344 { 345 345 extern struct timezone sys_tz; 346 - return t - sys_tz.tz_minuteswest * 60 - hpfs_sb(s)->sb_timeshift; 346 + t = t - sys_tz.tz_minuteswest * 60 - hpfs_sb(s)->sb_timeshift; 347 + 348 + return clamp_t(time64_t, t, 0, U32_MAX); 349 + } 350 + 351 + static inline time32_t local_get_seconds(struct super_block *s) 352 + { 353 + return gmt_to_local(s, ktime_get_real_seconds()); 347 354 } 348 355 349 356 /*

+6 -6

fs/hpfs/namei.c

··· 11 11 12 12 static void hpfs_update_directory_times(struct inode *dir) 13 13 { 14 - time_t t = get_seconds(); 14 + time64_t t = local_to_gmt(dir->i_sb, local_get_seconds(dir->i_sb)); 15 15 if (t == dir->i_mtime.tv_sec && 16 16 t == dir->i_ctime.tv_sec) 17 17 return; ··· 50 50 /*dee.archive = 0;*/ 51 51 dee.hidden = name[0] == '.'; 52 52 dee.fnode = cpu_to_le32(fno); 53 - dee.creation_date = dee.write_date = dee.read_date = cpu_to_le32(gmt_to_local(dir->i_sb, get_seconds())); 53 + dee.creation_date = dee.write_date = dee.read_date = cpu_to_le32(local_get_seconds(dir->i_sb)); 54 54 result = new_inode(dir->i_sb); 55 55 if (!result) 56 56 goto bail2; ··· 91 91 dnode->root_dnode = 1; 92 92 dnode->up = cpu_to_le32(fno); 93 93 de = hpfs_add_de(dir->i_sb, dnode, "\001\001", 2, 0); 94 - de->creation_date = de->write_date = de->read_date = cpu_to_le32(gmt_to_local(dir->i_sb, get_seconds())); 94 + de->creation_date = de->write_date = de->read_date = cpu_to_le32(local_get_seconds(dir->i_sb)); 95 95 if (!(mode & 0222)) de->read_only = 1; 96 96 de->first = de->directory = 1; 97 97 /*de->hidden = de->system = 0;*/ ··· 151 151 dee.archive = 1; 152 152 dee.hidden = name[0] == '.'; 153 153 dee.fnode = cpu_to_le32(fno); 154 - dee.creation_date = dee.write_date = dee.read_date = cpu_to_le32(gmt_to_local(dir->i_sb, get_seconds())); 154 + dee.creation_date = dee.write_date = dee.read_date = cpu_to_le32(local_get_seconds(dir->i_sb)); 155 155 156 156 result = new_inode(dir->i_sb); 157 157 if (!result) ··· 238 238 dee.archive = 1; 239 239 dee.hidden = name[0] == '.'; 240 240 dee.fnode = cpu_to_le32(fno); 241 - dee.creation_date = dee.write_date = dee.read_date = cpu_to_le32(gmt_to_local(dir->i_sb, get_seconds())); 241 + dee.creation_date = dee.write_date = dee.read_date = cpu_to_le32(local_get_seconds(dir->i_sb)); 242 242 243 243 result = new_inode(dir->i_sb); 244 244 if (!result) ··· 314 314 dee.archive = 1; 315 315 dee.hidden = name[0] == '.'; 316 316 dee.fnode = cpu_to_le32(fno); 317 - dee.creation_date = dee.write_date = dee.read_date = cpu_to_le32(gmt_to_local(dir->i_sb, get_seconds())); 317 + dee.creation_date = dee.write_date = dee.read_date = cpu_to_le32(local_get_seconds(dir->i_sb)); 318 318 319 319 result = new_inode(dir->i_sb); 320 320 if (!result)

+66 -52

fs/mpage.c

··· 133 133 } while (page_bh != head); 134 134 } 135 135 136 + struct mpage_readpage_args { 137 + struct bio *bio; 138 + struct page *page; 139 + unsigned int nr_pages; 140 + bool is_readahead; 141 + sector_t last_block_in_bio; 142 + struct buffer_head map_bh; 143 + unsigned long first_logical_block; 144 + get_block_t *get_block; 145 + }; 146 + 136 147 /* 137 148 * This is the worker routine which does all the work of mapping the disk 138 149 * blocks and constructs largest possible bios, submits them for IO if the ··· 153 142 * represent the validity of its disk mapping and to decide when to do the next 154 143 * get_block() call. 155 144 */ 156 - static struct bio * 157 - do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, 158 - sector_t *last_block_in_bio, struct buffer_head *map_bh, 159 - unsigned long *first_logical_block, get_block_t get_block, 160 - gfp_t gfp) 145 + static struct bio *do_mpage_readpage(struct mpage_readpage_args *args) 161 146 { 147 + struct page *page = args->page; 162 148 struct inode *inode = page->mapping->host; 163 149 const unsigned blkbits = inode->i_blkbits; 164 150 const unsigned blocks_per_page = PAGE_SIZE >> blkbits; 165 151 const unsigned blocksize = 1 << blkbits; 152 + struct buffer_head *map_bh = &args->map_bh; 166 153 sector_t block_in_file; 167 154 sector_t last_block; 168 155 sector_t last_block_in_file; ··· 170 161 struct block_device *bdev = NULL; 171 162 int length; 172 163 int fully_mapped = 1; 164 + int op_flags; 173 165 unsigned nblocks; 174 166 unsigned relative_block; 167 + gfp_t gfp; 168 + 169 + if (args->is_readahead) { 170 + op_flags = REQ_RAHEAD; 171 + gfp = readahead_gfp_mask(page->mapping); 172 + } else { 173 + op_flags = 0; 174 + gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL); 175 + } 175 176 176 177 if (page_has_buffers(page)) 177 178 goto confused; 178 179 179 180 block_in_file = (sector_t)page->index << (PAGE_SHIFT - blkbits); 180 - last_block = block_in_file + nr_pages * blocks_per_page; 181 + last_block = block_in_file + args->nr_pages * blocks_per_page; 181 182 last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits; 182 183 if (last_block > last_block_in_file) 183 184 last_block = last_block_in_file; ··· 197 178 * Map blocks using the result from the previous get_blocks call first. 198 179 */ 199 180 nblocks = map_bh->b_size >> blkbits; 200 - if (buffer_mapped(map_bh) && block_in_file > *first_logical_block && 201 - block_in_file < (*first_logical_block + nblocks)) { 202 - unsigned map_offset = block_in_file - *first_logical_block; 181 + if (buffer_mapped(map_bh) && 182 + block_in_file > args->first_logical_block && 183 + block_in_file < (args->first_logical_block + nblocks)) { 184 + unsigned map_offset = block_in_file - args->first_logical_block; 203 185 unsigned last = nblocks - map_offset; 204 186 205 187 for (relative_block = 0; ; relative_block++) { ··· 228 208 229 209 if (block_in_file < last_block) { 230 210 map_bh->b_size = (last_block-block_in_file) << blkbits; 231 - if (get_block(inode, block_in_file, map_bh, 0)) 211 + if (args->get_block(inode, block_in_file, map_bh, 0)) 232 212 goto confused; 233 - *first_logical_block = block_in_file; 213 + args->first_logical_block = block_in_file; 234 214 } 235 215 236 216 if (!buffer_mapped(map_bh)) { ··· 293 273 /* 294 274 * This page will go to BIO. Do we need to send this BIO off first? 295 275 */ 296 - if (bio && (*last_block_in_bio != blocks[0] - 1)) 297 - bio = mpage_bio_submit(REQ_OP_READ, 0, bio); 276 + if (args->bio && (args->last_block_in_bio != blocks[0] - 1)) 277 + args->bio = mpage_bio_submit(REQ_OP_READ, op_flags, args->bio); 298 278 299 279 alloc_new: 300 - if (bio == NULL) { 280 + if (args->bio == NULL) { 301 281 if (first_hole == blocks_per_page) { 302 282 if (!bdev_read_page(bdev, blocks[0] << (blkbits - 9), 303 283 page)) 304 284 goto out; 305 285 } 306 - bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), 307 - min_t(int, nr_pages, BIO_MAX_PAGES), gfp); 308 - if (bio == NULL) 286 + args->bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), 287 + min_t(int, args->nr_pages, 288 + BIO_MAX_PAGES), 289 + gfp); 290 + if (args->bio == NULL) 309 291 goto confused; 310 292 } 311 293 312 294 length = first_hole << blkbits; 313 - if (bio_add_page(bio, page, length, 0) < length) { 314 - bio = mpage_bio_submit(REQ_OP_READ, 0, bio); 295 + if (bio_add_page(args->bio, page, length, 0) < length) { 296 + args->bio = mpage_bio_submit(REQ_OP_READ, op_flags, args->bio); 315 297 goto alloc_new; 316 298 } 317 299 318 - relative_block = block_in_file - *first_logical_block; 300 + relative_block = block_in_file - args->first_logical_block; 319 301 nblocks = map_bh->b_size >> blkbits; 320 302 if ((buffer_boundary(map_bh) && relative_block == nblocks) || 321 303 (first_hole != blocks_per_page)) 322 - bio = mpage_bio_submit(REQ_OP_READ, 0, bio); 304 + args->bio = mpage_bio_submit(REQ_OP_READ, op_flags, args->bio); 323 305 else 324 - *last_block_in_bio = blocks[blocks_per_page - 1]; 306 + args->last_block_in_bio = blocks[blocks_per_page - 1]; 325 307 out: 326 - return bio; 308 + return args->bio; 327 309 328 310 confused: 329 - if (bio) 330 - bio = mpage_bio_submit(REQ_OP_READ, 0, bio); 311 + if (args->bio) 312 + args->bio = mpage_bio_submit(REQ_OP_READ, op_flags, args->bio); 331 313 if (!PageUptodate(page)) 332 - block_read_full_page(page, get_block); 314 + block_read_full_page(page, args->get_block); 333 315 else 334 316 unlock_page(page); 335 317 goto out; ··· 385 363 mpage_readpages(struct address_space *mapping, struct list_head *pages, 386 364 unsigned nr_pages, get_block_t get_block) 387 365 { 388 - struct bio *bio = NULL; 366 + struct mpage_readpage_args args = { 367 + .get_block = get_block, 368 + .is_readahead = true, 369 + }; 389 370 unsigned page_idx; 390 - sector_t last_block_in_bio = 0; 391 - struct buffer_head map_bh; 392 - unsigned long first_logical_block = 0; 393 - gfp_t gfp = readahead_gfp_mask(mapping); 394 371 395 - map_bh.b_state = 0; 396 - map_bh.b_size = 0; 397 372 for (page_idx = 0; page_idx < nr_pages; page_idx++) { 398 373 struct page *page = lru_to_page(pages); 399 374 ··· 398 379 list_del(&page->lru); 399 380 if (!add_to_page_cache_lru(page, mapping, 400 381 page->index, 401 - gfp)) { 402 - bio = do_mpage_readpage(bio, page, 403 - nr_pages - page_idx, 404 - &last_block_in_bio, &map_bh, 405 - &first_logical_block, 406 - get_block, gfp); 382 + readahead_gfp_mask(mapping))) { 383 + args.page = page; 384 + args.nr_pages = nr_pages - page_idx; 385 + args.bio = do_mpage_readpage(&args); 407 386 } 408 387 put_page(page); 409 388 } 410 389 BUG_ON(!list_empty(pages)); 411 - if (bio) 412 - mpage_bio_submit(REQ_OP_READ, 0, bio); 390 + if (args.bio) 391 + mpage_bio_submit(REQ_OP_READ, REQ_RAHEAD, args.bio); 413 392 return 0; 414 393 } 415 394 EXPORT_SYMBOL(mpage_readpages); ··· 417 400 */ 418 401 int mpage_readpage(struct page *page, get_block_t get_block) 419 402 { 420 - struct bio *bio = NULL; 421 - sector_t last_block_in_bio = 0; 422 - struct buffer_head map_bh; 423 - unsigned long first_logical_block = 0; 424 - gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL); 403 + struct mpage_readpage_args args = { 404 + .page = page, 405 + .nr_pages = 1, 406 + .get_block = get_block, 407 + }; 425 408 426 - map_bh.b_state = 0; 427 - map_bh.b_size = 0; 428 - bio = do_mpage_readpage(bio, page, 1, &last_block_in_bio, 429 - &map_bh, &first_logical_block, get_block, gfp); 430 - if (bio) 431 - mpage_bio_submit(REQ_OP_READ, 0, bio); 409 + args.bio = do_mpage_readpage(&args); 410 + if (args.bio) 411 + mpage_bio_submit(REQ_OP_READ, 0, args.bio); 432 412 return 0; 433 413 } 434 414 EXPORT_SYMBOL(mpage_readpage);

+3 -2

fs/notify/dnotify/dnotify.c

··· 384 384 385 385 static int __init dnotify_init(void) 386 386 { 387 - dnotify_struct_cache = KMEM_CACHE(dnotify_struct, SLAB_PANIC); 388 - dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC); 387 + dnotify_struct_cache = KMEM_CACHE(dnotify_struct, 388 + SLAB_PANIC|SLAB_ACCOUNT); 389 + dnotify_mark_cache = KMEM_CACHE(dnotify_mark, SLAB_PANIC|SLAB_ACCOUNT); 389 390 390 391 dnotify_group = fsnotify_alloc_group(&dnotify_fsnotify_ops); 391 392 if (IS_ERR(dnotify_group))

+10 -4

fs/notify/fanotify/fanotify.c

··· 11 11 #include <linux/types.h> 12 12 #include <linux/wait.h> 13 13 #include <linux/audit.h> 14 + #include <linux/sched/mm.h> 14 15 15 16 #include "fanotify.h" 16 17 ··· 141 140 struct inode *inode, u32 mask, 142 141 const struct path *path) 143 142 { 144 - struct fanotify_event_info *event; 145 - gfp_t gfp = GFP_KERNEL; 143 + struct fanotify_event_info *event = NULL; 144 + gfp_t gfp = GFP_KERNEL_ACCOUNT; 146 145 147 146 /* 148 147 * For queues with unlimited length lost events are not expected and ··· 152 151 if (group->max_events == UINT_MAX) 153 152 gfp |= __GFP_NOFAIL; 154 153 154 + /* Whoever is interested in the event, pays for the allocation. */ 155 + memalloc_use_memcg(group->memcg); 156 + 155 157 if (fanotify_is_perm_event(mask)) { 156 158 struct fanotify_perm_event_info *pevent; 157 159 158 160 pevent = kmem_cache_alloc(fanotify_perm_event_cachep, gfp); 159 161 if (!pevent) 160 - return NULL; 162 + goto out; 161 163 event = &pevent->fae; 162 164 pevent->response = 0; 163 165 goto init; 164 166 } 165 167 event = kmem_cache_alloc(fanotify_event_cachep, gfp); 166 168 if (!event) 167 - return NULL; 169 + goto out; 168 170 init: __maybe_unused 169 171 fsnotify_init_event(&event->fse, inode, mask); 170 172 event->tgid = get_pid(task_tgid(current)); ··· 178 174 event->path.mnt = NULL; 179 175 event->path.dentry = NULL; 180 176 } 177 + out: 178 + memalloc_unuse_memcg(); 181 179 return event; 182 180 } 183 181

+4 -1

fs/notify/fanotify/fanotify_user.c

··· 16 16 #include <linux/uaccess.h> 17 17 #include <linux/compat.h> 18 18 #include <linux/sched/signal.h> 19 + #include <linux/memcontrol.h> 19 20 20 21 #include <asm/ioctls.h> 21 22 ··· 732 731 733 732 group->fanotify_data.user = user; 734 733 atomic_inc(&user->fanotify_listeners); 734 + group->memcg = get_mem_cgroup_from_mm(current->mm); 735 735 736 736 oevent = fanotify_alloc_event(group, NULL, FS_Q_OVERFLOW, NULL); 737 737 if (unlikely(!oevent)) { ··· 934 932 */ 935 933 static int __init fanotify_user_setup(void) 936 934 { 937 - fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC); 935 + fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, 936 + SLAB_PANIC|SLAB_ACCOUNT); 938 937 fanotify_event_cachep = KMEM_CACHE(fanotify_event_info, SLAB_PANIC); 939 938 if (IS_ENABLED(CONFIG_FANOTIFY_ACCESS_PERMISSIONS)) { 940 939 fanotify_perm_event_cachep =

+3

fs/notify/group.c

··· 22 22 #include <linux/srcu.h> 23 23 #include <linux/rculist.h> 24 24 #include <linux/wait.h> 25 + #include <linux/memcontrol.h> 25 26 26 27 #include <linux/fsnotify_backend.h> 27 28 #include "fsnotify.h" ··· 36 35 { 37 36 if (group->ops->free_group_priv) 38 37 group->ops->free_group_priv(group); 38 + 39 + mem_cgroup_put(group->memcg); 39 40 40 41 kfree(group); 41 42 }

+6 -1

fs/notify/inotify/inotify_fsnotify.c

··· 31 31 #include <linux/types.h> 32 32 #include <linux/sched.h> 33 33 #include <linux/sched/user.h> 34 + #include <linux/sched/mm.h> 34 35 35 36 #include "inotify.h" 36 37 ··· 99 98 i_mark = container_of(inode_mark, struct inotify_inode_mark, 100 99 fsn_mark); 101 100 102 - event = kmalloc(alloc_len, GFP_KERNEL); 101 + /* Whoever is interested in the event, pays for the allocation. */ 102 + memalloc_use_memcg(group->memcg); 103 + event = kmalloc(alloc_len, GFP_KERNEL_ACCOUNT); 104 + memalloc_unuse_memcg(); 105 + 103 106 if (unlikely(!event)) { 104 107 /* 105 108 * Treat lost event due to ENOMEM the same way as queue

+4 -1

fs/notify/inotify/inotify_user.c

··· 38 38 #include <linux/uaccess.h> 39 39 #include <linux/poll.h> 40 40 #include <linux/wait.h> 41 + #include <linux/memcontrol.h> 41 42 42 43 #include "inotify.h" 43 44 #include "../fdinfo.h" ··· 640 639 oevent->name_len = 0; 641 640 642 641 group->max_events = max_events; 642 + group->memcg = get_mem_cgroup_from_mm(current->mm); 643 643 644 644 spin_lock_init(&group->inotify_data.idr_lock); 645 645 idr_init(&group->inotify_data.idr); ··· 817 815 818 816 BUG_ON(hweight32(ALL_INOTIFY_BITS) != 22); 819 817 820 - inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC); 818 + inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, 819 + SLAB_PANIC|SLAB_ACCOUNT); 821 820 822 821 inotify_max_queued_events = 16384; 823 822 init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;

+4 -5

fs/ntfs/aops.c

··· 93 93 ofs = 0; 94 94 if (file_ofs < init_size) 95 95 ofs = init_size - file_ofs; 96 - local_irq_save(flags); 97 96 kaddr = kmap_atomic(page); 98 97 memset(kaddr + bh_offset(bh) + ofs, 0, 99 98 bh->b_size - ofs); 100 99 flush_dcache_page(page); 101 100 kunmap_atomic(kaddr); 102 - local_irq_restore(flags); 103 101 } 104 102 } else { 105 103 clear_buffer_uptodate(bh); ··· 144 146 recs = PAGE_SIZE / rec_size; 145 147 /* Should have been verified before we got here... */ 146 148 BUG_ON(!recs); 147 - local_irq_save(flags); 148 149 kaddr = kmap_atomic(page); 149 150 for (i = 0; i < recs; i++) 150 151 post_read_mst_fixup((NTFS_RECORD*)(kaddr + 151 152 i * rec_size), rec_size); 152 153 kunmap_atomic(kaddr); 153 - local_irq_restore(flags); 154 154 flush_dcache_page(page); 155 155 if (likely(page_uptodate && !PageError(page))) 156 156 SetPageUptodate(page); ··· 922 926 ntfs_volume *vol = ni->vol; 923 927 u8 *kaddr; 924 928 unsigned int rec_size = ni->itype.index.block_size; 925 - ntfs_inode *locked_nis[PAGE_SIZE / rec_size]; 929 + ntfs_inode *locked_nis[PAGE_SIZE / NTFS_BLOCK_SIZE]; 926 930 struct buffer_head *bh, *head, *tbh, *rec_start_bh; 927 931 struct buffer_head *bhs[MAX_BUF_PER_PAGE]; 928 932 runlist_element *rl; ··· 930 934 unsigned bh_size, rec_size_bits; 931 935 bool sync, is_mft, page_is_dirty, rec_is_dirty; 932 936 unsigned char bh_size_bits; 937 + 938 + if (WARN_ON(rec_size < NTFS_BLOCK_SIZE)) 939 + return -EINVAL; 933 940 934 941 ntfs_debug("Entering for inode 0x%lx, attribute type 0x%x, page index " 935 942 "0x%lx.", vi->i_ino, ni->type, page->index);

+16 -12

fs/ntfs/compress.c

··· 128 128 /** 129 129 * ntfs_decompress - decompress a compression block into an array of pages 130 130 * @dest_pages: destination array of pages 131 + * @completed_pages: scratch space to track completed pages 131 132 * @dest_index: current index into @dest_pages (IN/OUT) 132 133 * @dest_ofs: current offset within @dest_pages[@dest_index] (IN/OUT) 133 134 * @dest_max_index: maximum index into @dest_pages (IN) ··· 163 162 * Note to hackers: This function may not sleep until it has finished accessing 164 163 * the compression block @cb_start as it is a per-CPU buffer. 165 164 */ 166 - static int ntfs_decompress(struct page *dest_pages[], int *dest_index, 167 - int *dest_ofs, const int dest_max_index, const int dest_max_ofs, 168 - const int xpage, char *xpage_done, u8 *const cb_start, 169 - const u32 cb_size, const loff_t i_size, 165 + static int ntfs_decompress(struct page *dest_pages[], int completed_pages[], 166 + int *dest_index, int *dest_ofs, const int dest_max_index, 167 + const int dest_max_ofs, const int xpage, char *xpage_done, 168 + u8 *const cb_start, const u32 cb_size, const loff_t i_size, 170 169 const s64 initialized_size) 171 170 { 172 171 /* ··· 191 190 /* Variables for tag and token parsing. */ 192 191 u8 tag; /* Current tag. */ 193 192 int token; /* Loop counter for the eight tokens in tag. */ 194 - 195 - /* Need this because we can't sleep, so need two stages. */ 196 - int completed_pages[dest_max_index - *dest_index + 1]; 197 193 int nr_completed_pages = 0; 198 194 199 195 /* Default error code. */ ··· 514 516 unsigned int cb_clusters, cb_max_ofs; 515 517 int block, max_block, cb_max_page, bhs_size, nr_bhs, err = 0; 516 518 struct page **pages; 519 + int *completed_pages; 517 520 unsigned char xpage_done = 0; 518 521 519 522 ntfs_debug("Entering, page->index = 0x%lx, cb_size = 0x%x, nr_pages = " ··· 527 528 BUG_ON(ni->name_len); 528 529 529 530 pages = kmalloc_array(nr_pages, sizeof(struct page *), GFP_NOFS); 531 + completed_pages = kmalloc_array(nr_pages + 1, sizeof(int), GFP_NOFS); 530 532 531 533 /* Allocate memory to store the buffer heads we need. */ 532 534 bhs_size = cb_size / block_size * sizeof(struct buffer_head *); 533 535 bhs = kmalloc(bhs_size, GFP_NOFS); 534 536 535 - if (unlikely(!pages || !bhs)) { 537 + if (unlikely(!pages || !bhs || !completed_pages)) { 536 538 kfree(bhs); 537 539 kfree(pages); 540 + kfree(completed_pages); 538 541 unlock_page(page); 539 542 ntfs_error(vol->sb, "Failed to allocate internal buffers."); 540 543 return -ENOMEM; ··· 563 562 if (xpage >= max_page) { 564 563 kfree(bhs); 565 564 kfree(pages); 565 + kfree(completed_pages); 566 566 zero_user(page, 0, PAGE_SIZE); 567 567 ntfs_debug("Compressed read outside i_size - truncated?"); 568 568 SetPageUptodate(page); ··· 856 854 unsigned int prev_cur_page = cur_page; 857 855 858 856 ntfs_debug("Found compressed compression block."); 859 - err = ntfs_decompress(pages, &cur_page, &cur_ofs, 860 - cb_max_page, cb_max_ofs, xpage, &xpage_done, 861 - cb_pos, cb_size - (cb_pos - cb), i_size, 862 - initialized_size); 857 + err = ntfs_decompress(pages, completed_pages, &cur_page, 858 + &cur_ofs, cb_max_page, cb_max_ofs, xpage, 859 + &xpage_done, cb_pos, cb_size - (cb_pos - cb), 860 + i_size, initialized_size); 863 861 /* 864 862 * We can sleep from now on, lock already dropped by 865 863 * ntfs_decompress(). ··· 914 912 915 913 /* We no longer need the list of pages. */ 916 914 kfree(pages); 915 + kfree(completed_pages); 917 916 918 917 /* If we have completed the requested page, we return success. */ 919 918 if (likely(xpage_done)) ··· 959 956 } 960 957 } 961 958 kfree(pages); 959 + kfree(completed_pages); 962 960 return -EIO; 963 961 }

+6 -6

fs/ntfs/inode.c

··· 667 667 * mtime is the last change of the data within the file. Not changed 668 668 * when only metadata is changed, e.g. a rename doesn't affect mtime. 669 669 */ 670 - vi->i_mtime = timespec_to_timespec64(ntfs2utc(si->last_data_change_time)); 670 + vi->i_mtime = ntfs2utc(si->last_data_change_time); 671 671 /* 672 672 * ctime is the last change of the metadata of the file. This obviously 673 673 * always changes, when mtime is changed. ctime can be changed on its 674 674 * own, mtime is then not changed, e.g. when a file is renamed. 675 675 */ 676 - vi->i_ctime = timespec_to_timespec64(ntfs2utc(si->last_mft_change_time)); 676 + vi->i_ctime = ntfs2utc(si->last_mft_change_time); 677 677 /* 678 678 * Last access to the data within the file. Not changed during a rename 679 679 * for example but changed whenever the file is written to. 680 680 */ 681 - vi->i_atime = timespec_to_timespec64(ntfs2utc(si->last_access_time)); 681 + vi->i_atime = ntfs2utc(si->last_access_time); 682 682 683 683 /* Find the attribute list attribute if present. */ 684 684 ntfs_attr_reinit_search_ctx(ctx); ··· 2997 2997 si = (STANDARD_INFORMATION*)((u8*)ctx->attr + 2998 2998 le16_to_cpu(ctx->attr->data.resident.value_offset)); 2999 2999 /* Update the access times if they have changed. */ 3000 - nt = utc2ntfs(timespec64_to_timespec(vi->i_mtime)); 3000 + nt = utc2ntfs(vi->i_mtime); 3001 3001 if (si->last_data_change_time != nt) { 3002 3002 ntfs_debug("Updating mtime for inode 0x%lx: old = 0x%llx, " 3003 3003 "new = 0x%llx", vi->i_ino, (long long) ··· 3006 3006 si->last_data_change_time = nt; 3007 3007 modified = true; 3008 3008 } 3009 - nt = utc2ntfs(timespec64_to_timespec(vi->i_ctime)); 3009 + nt = utc2ntfs(vi->i_ctime); 3010 3010 if (si->last_mft_change_time != nt) { 3011 3011 ntfs_debug("Updating ctime for inode 0x%lx: old = 0x%llx, " 3012 3012 "new = 0x%llx", vi->i_ino, (long long) ··· 3015 3015 si->last_mft_change_time = nt; 3016 3016 modified = true; 3017 3017 } 3018 - nt = utc2ntfs(timespec64_to_timespec(vi->i_atime)); 3018 + nt = utc2ntfs(vi->i_atime); 3019 3019 if (si->last_access_time != nt) { 3020 3020 ntfs_debug("Updating atime for inode 0x%lx: old = 0x%llx, " 3021 3021 "new = 0x%llx", vi->i_ino,

+10 -2

fs/ntfs/mft.c

··· 35 35 #include "mft.h" 36 36 #include "ntfs.h" 37 37 38 + #define MAX_BHS (PAGE_SIZE / NTFS_BLOCK_SIZE) 39 + 38 40 /** 39 41 * map_mft_record_page - map the page in which a specific mft record resides 40 42 * @ni: ntfs inode whose mft record page to map ··· 471 469 struct page *page; 472 470 unsigned int blocksize = vol->sb->s_blocksize; 473 471 int max_bhs = vol->mft_record_size / blocksize; 474 - struct buffer_head *bhs[max_bhs]; 472 + struct buffer_head *bhs[MAX_BHS]; 475 473 struct buffer_head *bh, *head; 476 474 u8 *kmirr; 477 475 runlist_element *rl; ··· 481 479 482 480 ntfs_debug("Entering for inode 0x%lx.", mft_no); 483 481 BUG_ON(!max_bhs); 482 + if (WARN_ON(max_bhs > MAX_BHS)) 483 + return -EINVAL; 484 484 if (unlikely(!vol->mftmirr_ino)) { 485 485 /* This could happen during umount... */ 486 486 err = ntfs_sync_mft_mirror_umount(vol, mft_no, m); ··· 678 674 unsigned int blocksize = vol->sb->s_blocksize; 679 675 unsigned char blocksize_bits = vol->sb->s_blocksize_bits; 680 676 int max_bhs = vol->mft_record_size / blocksize; 681 - struct buffer_head *bhs[max_bhs]; 677 + struct buffer_head *bhs[MAX_BHS]; 682 678 struct buffer_head *bh, *head; 683 679 runlist_element *rl; 684 680 unsigned int block_start, block_end, m_start, m_end; ··· 688 684 BUG_ON(NInoAttr(ni)); 689 685 BUG_ON(!max_bhs); 690 686 BUG_ON(!PageLocked(page)); 687 + if (WARN_ON(max_bhs > MAX_BHS)) { 688 + err = -EINVAL; 689 + goto err_out; 690 + } 691 691 /* 692 692 * If the ntfs_inode is clean no need to do anything. If it is dirty, 693 693 * mark it as clean now so that it can be redirtied later on if needed.

+15 -12

fs/ntfs/time.h

··· 36 36 * Convert the Linux UTC time @ts to its corresponding NTFS time and return 37 37 * that in little endian format. 38 38 * 39 - * Linux stores time in a struct timespec consisting of a time_t (long at 40 - * present) tv_sec and a long tv_nsec where tv_sec is the number of 1-second 41 - * intervals since 1st January 1970, 00:00:00 UTC and tv_nsec is the number of 42 - * 1-nano-second intervals since the value of tv_sec. 39 + * Linux stores time in a struct timespec64 consisting of a time64_t tv_sec 40 + * and a long tv_nsec where tv_sec is the number of 1-second intervals since 41 + * 1st January 1970, 00:00:00 UTC and tv_nsec is the number of 1-nano-second 42 + * intervals since the value of tv_sec. 43 43 * 44 44 * NTFS uses Microsoft's standard time format which is stored in a s64 and is 45 45 * measured as the number of 100-nano-second intervals since 1st January 1601, 46 46 * 00:00:00 UTC. 47 47 */ 48 - static inline sle64 utc2ntfs(const struct timespec ts) 48 + static inline sle64 utc2ntfs(const struct timespec64 ts) 49 49 { 50 50 /* 51 51 * Convert the seconds to 100ns intervals, add the nano-seconds ··· 63 63 */ 64 64 static inline sle64 get_current_ntfs_time(void) 65 65 { 66 - return utc2ntfs(current_kernel_time()); 66 + struct timespec64 ts; 67 + 68 + ktime_get_coarse_real_ts64(&ts); 69 + return utc2ntfs(ts); 67 70 } 68 71 69 72 /** ··· 76 73 * Convert the little endian NTFS time @time to its corresponding Linux UTC 77 74 * time and return that in cpu format. 78 75 * 79 - * Linux stores time in a struct timespec consisting of a time_t (long at 80 - * present) tv_sec and a long tv_nsec where tv_sec is the number of 1-second 81 - * intervals since 1st January 1970, 00:00:00 UTC and tv_nsec is the number of 82 - * 1-nano-second intervals since the value of tv_sec. 76 + * Linux stores time in a struct timespec64 consisting of a time64_t tv_sec 77 + * and a long tv_nsec where tv_sec is the number of 1-second intervals since 78 + * 1st January 1970, 00:00:00 UTC and tv_nsec is the number of 1-nano-second 79 + * intervals since the value of tv_sec. 83 80 * 84 81 * NTFS uses Microsoft's standard time format which is stored in a s64 and is 85 82 * measured as the number of 100 nano-second intervals since 1st January 1601, 86 83 * 00:00:00 UTC. 87 84 */ 88 - static inline struct timespec ntfs2utc(const sle64 time) 85 + static inline struct timespec64 ntfs2utc(const sle64 time) 89 86 { 90 - struct timespec ts; 87 + struct timespec64 ts; 91 88 92 89 /* Subtract the NTFS time offset. */ 93 90 u64 t = (u64)(sle64_to_cpu(time) - NTFS_TIME_OFFSET);

+23 -37

fs/ocfs2/alloc.c

··· 932 932 goto bail; 933 933 } 934 934 935 - if (le32_to_cpu(eb->h_fs_generation) != OCFS2_SB(sb)->fs_generation) { 935 + if (le32_to_cpu(eb->h_fs_generation) != OCFS2_SB(sb)->fs_generation) 936 936 rc = ocfs2_error(sb, 937 937 "Extent block #%llu has an invalid h_fs_generation of #%u\n", 938 938 (unsigned long long)bh->b_blocknr, 939 939 le32_to_cpu(eb->h_fs_generation)); 940 - goto bail; 941 - } 942 940 bail: 943 941 return rc; 944 942 } ··· 1479 1481 1480 1482 while(le16_to_cpu(el->l_tree_depth) > 1) { 1481 1483 if (le16_to_cpu(el->l_next_free_rec) == 0) { 1482 - ocfs2_error(ocfs2_metadata_cache_get_super(et->et_ci), 1483 - "Owner %llu has empty extent list (next_free_rec == 0)\n", 1484 - (unsigned long long)ocfs2_metadata_cache_owner(et->et_ci)); 1485 - status = -EIO; 1484 + status = ocfs2_error(ocfs2_metadata_cache_get_super(et->et_ci), 1485 + "Owner %llu has empty extent list (next_free_rec == 0)\n", 1486 + (unsigned long long)ocfs2_metadata_cache_owner(et->et_ci)); 1486 1487 goto bail; 1487 1488 } 1488 1489 i = le16_to_cpu(el->l_next_free_rec) - 1; 1489 1490 blkno = le64_to_cpu(el->l_recs[i].e_blkno); 1490 1491 if (!blkno) { 1491 - ocfs2_error(ocfs2_metadata_cache_get_super(et->et_ci), 1492 - "Owner %llu has extent list where extent # %d has no physical block start\n", 1493 - (unsigned long long)ocfs2_metadata_cache_owner(et->et_ci), i); 1494 - status = -EIO; 1492 + status = ocfs2_error(ocfs2_metadata_cache_get_super(et->et_ci), 1493 + "Owner %llu has extent list where extent # %d has no physical block start\n", 1494 + (unsigned long long)ocfs2_metadata_cache_owner(et->et_ci), i); 1495 1495 goto bail; 1496 1496 } 1497 1497 ··· 1594 1598 * the new data. */ 1595 1599 ret = ocfs2_add_branch(handle, et, bh, last_eb_bh, 1596 1600 meta_ac); 1597 - if (ret < 0) { 1601 + if (ret < 0) 1598 1602 mlog_errno(ret); 1599 - goto out; 1600 - } 1601 1603 1602 1604 out: 1603 1605 if (final_depth) ··· 3208 3214 goto rightmost_no_delete; 3209 3215 3210 3216 if (le16_to_cpu(el->l_next_free_rec) == 0) { 3211 - ret = -EIO; 3212 - ocfs2_error(ocfs2_metadata_cache_get_super(et->et_ci), 3213 - "Owner %llu has empty extent block at %llu\n", 3214 - (unsigned long long)ocfs2_metadata_cache_owner(et->et_ci), 3215 - (unsigned long long)le64_to_cpu(eb->h_blkno)); 3217 + ret = ocfs2_error(ocfs2_metadata_cache_get_super(et->et_ci), 3218 + "Owner %llu has empty extent block at %llu\n", 3219 + (unsigned long long)ocfs2_metadata_cache_owner(et->et_ci), 3220 + (unsigned long long)le64_to_cpu(eb->h_blkno)); 3216 3221 goto out; 3217 3222 } 3218 3223 ··· 4404 4411 le16_to_cpu(new_el->l_count)) { 4405 4412 bh = path_leaf_bh(left_path); 4406 4413 eb = (struct ocfs2_extent_block *)bh->b_data; 4407 - ocfs2_error(sb, 4408 - "Extent block #%llu has an invalid l_next_free_rec of %d. It should have matched the l_count of %d\n", 4409 - (unsigned long long)le64_to_cpu(eb->h_blkno), 4410 - le16_to_cpu(new_el->l_next_free_rec), 4411 - le16_to_cpu(new_el->l_count)); 4412 - status = -EINVAL; 4414 + status = ocfs2_error(sb, 4415 + "Extent block #%llu has an invalid l_next_free_rec of %d. It should have matched the l_count of %d\n", 4416 + (unsigned long long)le64_to_cpu(eb->h_blkno), 4417 + le16_to_cpu(new_el->l_next_free_rec), 4418 + le16_to_cpu(new_el->l_count)); 4413 4419 goto free_left_path; 4414 4420 } 4415 4421 rec = &new_el->l_recs[ ··· 4458 4466 if (le16_to_cpu(new_el->l_next_free_rec) <= 1) { 4459 4467 bh = path_leaf_bh(right_path); 4460 4468 eb = (struct ocfs2_extent_block *)bh->b_data; 4461 - ocfs2_error(sb, 4462 - "Extent block #%llu has an invalid l_next_free_rec of %d\n", 4463 - (unsigned long long)le64_to_cpu(eb->h_blkno), 4464 - le16_to_cpu(new_el->l_next_free_rec)); 4465 - status = -EINVAL; 4469 + status = ocfs2_error(sb, 4470 + "Extent block #%llu has an invalid l_next_free_rec of %d\n", 4471 + (unsigned long long)le64_to_cpu(eb->h_blkno), 4472 + le16_to_cpu(new_el->l_next_free_rec)); 4466 4473 goto free_right_path; 4467 4474 } 4468 4475 rec = &new_el->l_recs[1]; ··· 5514 5523 ocfs2_journal_dirty(handle, path_leaf_bh(path)); 5515 5524 5516 5525 ret = ocfs2_rotate_tree_left(handle, et, path, dealloc); 5517 - if (ret) { 5526 + if (ret) 5518 5527 mlog_errno(ret); 5519 - goto out; 5520 - } 5521 5528 5522 5529 out: 5523 5530 ocfs2_free_path(left_path); ··· 5648 5659 5649 5660 ret = ocfs2_truncate_rec(handle, et, path, index, dealloc, 5650 5661 cpos, len); 5651 - if (ret) { 5662 + if (ret) 5652 5663 mlog_errno(ret); 5653 - goto out; 5654 - } 5655 5664 } 5656 5665 5657 5666 out: ··· 5694 5707 if (ret < 0) { 5695 5708 if (ret != -ENOSPC) 5696 5709 mlog_errno(ret); 5697 - goto out; 5698 5710 } 5699 5711 } 5700 5712

+6 -6

fs/ocfs2/cluster/heartbeat.c

··· 127 127 O2HB_HEARTBEAT_NUM_MODES, 128 128 }; 129 129 130 - char *o2hb_heartbeat_mode_desc[O2HB_HEARTBEAT_NUM_MODES] = { 131 - "local", /* O2HB_HEARTBEAT_LOCAL */ 132 - "global", /* O2HB_HEARTBEAT_GLOBAL */ 130 + static const char *o2hb_heartbeat_mode_desc[O2HB_HEARTBEAT_NUM_MODES] = { 131 + "local", /* O2HB_HEARTBEAT_LOCAL */ 132 + "global", /* O2HB_HEARTBEAT_GLOBAL */ 133 133 }; 134 134 135 135 unsigned int o2hb_dead_threshold = O2HB_DEFAULT_DEAD_THRESHOLD; 136 - unsigned int o2hb_heartbeat_mode = O2HB_HEARTBEAT_LOCAL; 136 + static unsigned int o2hb_heartbeat_mode = O2HB_HEARTBEAT_LOCAL; 137 137 138 138 /* 139 139 * o2hb_dependent_users tracks the number of registered callbacks that depend ··· 141 141 * However only o2dlm depends on the heartbeat. It does not want the heartbeat 142 142 * to stop while a dlm domain is still active. 143 143 */ 144 - unsigned int o2hb_dependent_users; 144 + static unsigned int o2hb_dependent_users; 145 145 146 146 /* 147 147 * In global heartbeat mode, all regions are pinned if there are one or more ··· 2486 2486 return ret; 2487 2487 } 2488 2488 2489 - void o2hb_region_dec_user(const char *region_uuid) 2489 + static void o2hb_region_dec_user(const char *region_uuid) 2490 2490 { 2491 2491 spin_lock(&o2hb_live_lock); 2492 2492

+3 -3

fs/ocfs2/cluster/nodemanager.c

··· 35 35 * cluster references throughout where nodes are looked up */ 36 36 struct o2nm_cluster *o2nm_single_cluster = NULL; 37 37 38 - char *o2nm_fence_method_desc[O2NM_FENCE_METHODS] = { 39 - "reset", /* O2NM_FENCE_RESET */ 40 - "panic", /* O2NM_FENCE_PANIC */ 38 + static const char *o2nm_fence_method_desc[O2NM_FENCE_METHODS] = { 39 + "reset", /* O2NM_FENCE_RESET */ 40 + "panic", /* O2NM_FENCE_PANIC */ 41 41 }; 42 42 43 43 static inline void o2nm_lock_subsystem(void);

-2

fs/ocfs2/cluster/tcp.c

··· 872 872 "for type %u key %08x\n", msg_type, key); 873 873 } 874 874 write_unlock(&o2net_handler_lock); 875 - if (ret) 876 - goto out; 877 875 878 876 out: 879 877 if (ret)

+1 -1

fs/ocfs2/dlmglue.c

··· 96 96 }; 97 97 98 98 /* Lockdep class keys */ 99 - struct lock_class_key lockdep_keys[OCFS2_NUM_LOCK_TYPES]; 99 + static struct lock_class_key lockdep_keys[OCFS2_NUM_LOCK_TYPES]; 100 100 101 101 static int ocfs2_check_meta_downconvert(struct ocfs2_lock_res *lockres, 102 102 int new_level);

+1 -4

fs/ocfs2/inode.c

··· 637 637 handle = NULL; 638 638 639 639 status = ocfs2_commit_truncate(osb, inode, fe_bh); 640 - if (status < 0) { 640 + if (status < 0) 641 641 mlog_errno(status); 642 - goto out; 643 - } 644 642 } 645 643 646 644 out: ··· 1497 1499 (unsigned long long)bh->b_blocknr, 1498 1500 le32_to_cpu(di->i_fs_generation)); 1499 1501 rc = -OCFS2_FILECHECK_ERR_GENERATION; 1500 - goto bail; 1501 1502 } 1502 1503 1503 1504 bail:

+4 -5

fs/ocfs2/localalloc.c

··· 663 663 #ifdef CONFIG_OCFS2_DEBUG_FS 664 664 if (le32_to_cpu(alloc->id1.bitmap1.i_used) != 665 665 ocfs2_local_alloc_count_bits(alloc)) { 666 - ocfs2_error(osb->sb, "local alloc inode %llu says it has %u used bits, but a count shows %u\n", 667 - (unsigned long long)le64_to_cpu(alloc->i_blkno), 668 - le32_to_cpu(alloc->id1.bitmap1.i_used), 669 - ocfs2_local_alloc_count_bits(alloc)); 670 - status = -EIO; 666 + status = ocfs2_error(osb->sb, "local alloc inode %llu says it has %u used bits, but a count shows %u\n", 667 + (unsigned long long)le64_to_cpu(alloc->i_blkno), 668 + le32_to_cpu(alloc->id1.bitmap1.i_used), 669 + ocfs2_local_alloc_count_bits(alloc)); 671 670 goto bail; 672 671 } 673 672 #endif

+7 -8

fs/ocfs2/quota_local.c

··· 137 137 int rc = 0; 138 138 struct buffer_head *tmp = *bh; 139 139 140 - if (i_size_read(inode) >> inode->i_sb->s_blocksize_bits <= v_block) { 141 - ocfs2_error(inode->i_sb, 142 - "Quota file %llu is probably corrupted! Requested to read block %Lu but file has size only %Lu\n", 143 - (unsigned long long)OCFS2_I(inode)->ip_blkno, 144 - (unsigned long long)v_block, 145 - (unsigned long long)i_size_read(inode)); 146 - return -EIO; 147 - } 140 + if (i_size_read(inode) >> inode->i_sb->s_blocksize_bits <= v_block) 141 + return ocfs2_error(inode->i_sb, 142 + "Quota file %llu is probably corrupted! Requested to read block %Lu but file has size only %Lu\n", 143 + (unsigned long long)OCFS2_I(inode)->ip_blkno, 144 + (unsigned long long)v_block, 145 + (unsigned long long)i_size_read(inode)); 146 + 148 147 rc = ocfs2_read_virt_blocks(inode, v_block, 1, &tmp, 0, 149 148 ocfs2_validate_quota_block); 150 149 if (rc)

+21 -33

fs/seq_file.c

··· 90 90 91 91 static int traverse(struct seq_file *m, loff_t offset) 92 92 { 93 - loff_t pos = 0, index; 93 + loff_t pos = 0; 94 94 int error = 0; 95 95 void *p; 96 96 97 97 m->version = 0; 98 - index = 0; 98 + m->index = 0; 99 99 m->count = m->from = 0; 100 - if (!offset) { 101 - m->index = index; 100 + if (!offset) 102 101 return 0; 103 - } 102 + 104 103 if (!m->buf) { 105 104 m->buf = seq_buf_alloc(m->size = PAGE_SIZE); 106 105 if (!m->buf) 107 106 return -ENOMEM; 108 107 } 109 - p = m->op->start(m, &index); 108 + p = m->op->start(m, &m->index); 110 109 while (p) { 111 110 error = PTR_ERR(p); 112 111 if (IS_ERR(p)) ··· 122 123 if (pos + m->count > offset) { 123 124 m->from = offset - pos; 124 125 m->count -= m->from; 125 - m->index = index; 126 126 break; 127 127 } 128 128 pos += m->count; 129 129 m->count = 0; 130 - if (pos == offset) { 131 - index++; 132 - m->index = index; 130 + p = m->op->next(m, p, &m->index); 131 + if (pos == offset) 133 132 break; 134 - } 135 - p = m->op->next(m, p, &index); 136 133 } 137 134 m->op->stop(m, p); 138 - m->index = index; 139 135 return error; 140 136 141 137 Eoverflow: ··· 154 160 { 155 161 struct seq_file *m = file->private_data; 156 162 size_t copied = 0; 157 - loff_t pos; 158 163 size_t n; 159 164 void *p; 160 165 int err = 0; ··· 216 223 size -= n; 217 224 buf += n; 218 225 copied += n; 219 - if (!m->count) { 220 - m->from = 0; 221 - m->index++; 222 - } 223 226 if (!size) 224 227 goto Done; 225 228 } 226 229 /* we need at least one record in buffer */ 227 - pos = m->index; 228 - p = m->op->start(m, &pos); 230 + m->from = 0; 231 + p = m->op->start(m, &m->index); 229 232 while (1) { 230 233 err = PTR_ERR(p); 231 234 if (!p || IS_ERR(p)) ··· 232 243 if (unlikely(err)) 233 244 m->count = 0; 234 245 if (unlikely(!m->count)) { 235 - p = m->op->next(m, p, &pos); 236 - m->index = pos; 246 + p = m->op->next(m, p, &m->index); 237 247 continue; 238 248 } 239 249 if (m->count < m->size) ··· 244 256 if (!m->buf) 245 257 goto Enomem; 246 258 m->version = 0; 247 - pos = m->index; 248 - p = m->op->start(m, &pos); 259 + p = m->op->start(m, &m->index); 249 260 } 250 261 m->op->stop(m, p); 251 262 m->count = 0; 252 263 goto Done; 253 264 Fill: 254 265 /* they want more? let's try to get some more */ 255 - while (m->count < size) { 266 + while (1) { 256 267 size_t offs = m->count; 257 - loff_t next = pos; 258 - p = m->op->next(m, p, &next); 268 + loff_t pos = m->index; 269 + 270 + p = m->op->next(m, p, &m->index); 271 + if (pos == m->index) 272 + /* Buggy ->next function */ 273 + m->index++; 259 274 if (!p || IS_ERR(p)) { 260 275 err = PTR_ERR(p); 261 276 break; 262 277 } 278 + if (m->count >= size) 279 + break; 263 280 err = m->op->show(m, p); 264 281 if (seq_has_overflowed(m) || err) { 265 282 m->count = offs; 266 283 if (likely(err <= 0)) 267 284 break; 268 285 } 269 - pos = next; 270 286 } 271 287 m->op->stop(m, p); 272 288 n = min(m->count, size); ··· 279 287 goto Efault; 280 288 copied += n; 281 289 m->count -= n; 282 - if (m->count) 283 - m->from = n; 284 - else 285 - pos++; 286 - m->index = pos; 290 + m->from = n; 287 291 Done: 288 292 if (!copied) 289 293 copied = err;

+7 -4

fs/super.c

··· 144 144 total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc); 145 145 total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc); 146 146 147 + if (!total_objects) 148 + return SHRINK_EMPTY; 149 + 147 150 total_objects = vfs_pressure_ratio(total_objects); 148 151 return total_objects; 149 152 } ··· 247 244 INIT_LIST_HEAD(&s->s_inodes_wb); 248 245 spin_lock_init(&s->s_inode_wblist_lock); 249 246 250 - if (list_lru_init_memcg(&s->s_dentry_lru)) 251 - goto fail; 252 - if (list_lru_init_memcg(&s->s_inode_lru)) 253 - goto fail; 254 247 s->s_count = 1; 255 248 atomic_set(&s->s_active, 1); 256 249 mutex_init(&s->s_vfs_rename_mutex); ··· 263 264 s->s_shrink.batch = 1024; 264 265 s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE; 265 266 if (prealloc_shrinker(&s->s_shrink)) 267 + goto fail; 268 + if (list_lru_init_memcg(&s->s_dentry_lru, &s->s_shrink)) 269 + goto fail; 270 + if (list_lru_init_memcg(&s->s_inode_lru, &s->s_shrink)) 266 271 goto fail; 267 272 return s; 268 273

+2 -2

fs/ufs/balloc.c

··· 547 547 /* 548 548 * Block can be extended 549 549 */ 550 - ucg->cg_time = cpu_to_fs32(sb, get_seconds()); 550 + ucg->cg_time = ufs_get_seconds(sb); 551 551 for (i = newcount; i < (uspi->s_fpb - fragoff); i++) 552 552 if (ubh_isclr (UCPI_UBH(ucpi), ucpi->c_freeoff, fragno + i)) 553 553 break; ··· 639 639 if (!ufs_cg_chkmagic(sb, ucg)) 640 640 ufs_panic (sb, "ufs_alloc_fragments", 641 641 "internal error, bad magic number on cg %u", cgno); 642 - ucg->cg_time = cpu_to_fs32(sb, get_seconds()); 642 + ucg->cg_time = ufs_get_seconds(sb); 643 643 644 644 if (count == uspi->s_fpb) { 645 645 result = ufs_alloccg_block (inode, ucpi, goal, err);

+1 -1

fs/ufs/ialloc.c

··· 89 89 if (!ufs_cg_chkmagic(sb, ucg)) 90 90 ufs_panic (sb, "ufs_free_fragments", "internal error, bad cg magic number"); 91 91 92 - ucg->cg_time = cpu_to_fs32(sb, get_seconds()); 92 + ucg->cg_time = ufs_get_seconds(sb); 93 93 94 94 is_directory = S_ISDIR(inode->i_mode); 95 95

+2 -2

fs/ufs/super.c

··· 698 698 usb1 = ubh_get_usb_first(uspi); 699 699 usb3 = ubh_get_usb_third(uspi); 700 700 701 - usb1->fs_time = cpu_to_fs32(sb, get_seconds()); 701 + usb1->fs_time = ufs_get_seconds(sb); 702 702 if ((flags & UFS_ST_MASK) == UFS_ST_SUN || 703 703 (flags & UFS_ST_MASK) == UFS_ST_SUNOS || 704 704 (flags & UFS_ST_MASK) == UFS_ST_SUNx86) ··· 1342 1342 */ 1343 1343 if (*mount_flags & SB_RDONLY) { 1344 1344 ufs_put_super_internal(sb); 1345 - usb1->fs_time = cpu_to_fs32(sb, get_seconds()); 1345 + usb1->fs_time = ufs_get_seconds(sb); 1346 1346 if ((flags & UFS_ST_MASK) == UFS_ST_SUN 1347 1347 || (flags & UFS_ST_MASK) == UFS_ST_SUNOS 1348 1348 || (flags & UFS_ST_MASK) == UFS_ST_SUNx86)

+14

fs/ufs/util.h

··· 590 590 else 591 591 return *(__fs32 *)p == 0; 592 592 } 593 + 594 + static inline __fs32 ufs_get_seconds(struct super_block *sbp) 595 + { 596 + time64_t now = ktime_get_real_seconds(); 597 + 598 + /* Signed 32-bit interpretation wraps around in 2038, which 599 + * happens in ufs1 inode stamps but not ufs2 using 64-bits 600 + * stamps. For superblock and blockgroup, let's assume 601 + * unsigned 32-bit stamps, which are good until y2106. 602 + * Wrap around rather than clamp here to make the dirty 603 + * file system detection work in the superblock stamp. 604 + */ 605 + return cpu_to_fs32(sbp, lower_32_bits(now)); 606 + }

-3

fs/userfaultfd.c

··· 1849 1849 { 1850 1850 struct userfaultfd_ctx *ctx = f->private_data; 1851 1851 wait_queue_entry_t *wq; 1852 - struct userfaultfd_wait_queue *uwq; 1853 1852 unsigned long pending = 0, total = 0; 1854 1853 1855 1854 spin_lock(&ctx->fault_pending_wqh.lock); 1856 1855 list_for_each_entry(wq, &ctx->fault_pending_wqh.head, entry) { 1857 - uwq = container_of(wq, struct userfaultfd_wait_queue, wq); 1858 1856 pending++; 1859 1857 total++; 1860 1858 } 1861 1859 list_for_each_entry(wq, &ctx->fault_wqh.head, entry) { 1862 - uwq = container_of(wq, struct userfaultfd_wait_queue, wq); 1863 1860 total++; 1864 1861 } 1865 1862 spin_unlock(&ctx->fault_pending_wqh.lock);

+1 -1

fs/xfs/xfs_file.c

··· 1169 1169 file_accessed(filp); 1170 1170 vma->vm_ops = &xfs_file_vm_ops; 1171 1171 if (IS_DAX(file_inode(filp))) 1172 - vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; 1172 + vma->vm_flags |= VM_HUGEPAGE; 1173 1173 return 0; 1174 1174 } 1175 1175

+18

include/asm-generic/pgtable.h

··· 1095 1095 } 1096 1096 #endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */ 1097 1097 1098 + /* 1099 + * Architecture PAGE_KERNEL_* fallbacks 1100 + * 1101 + * Some architectures don't define certain PAGE_KERNEL_* flags. This is either 1102 + * because they really don't support them, or the port needs to be updated to 1103 + * reflect the required functionality. Below are a set of relatively safe 1104 + * fallbacks, as best effort, which we can count on in lieu of the architectures 1105 + * not defining them on their own yet. 1106 + */ 1107 + 1108 + #ifndef PAGE_KERNEL_RO 1109 + # define PAGE_KERNEL_RO PAGE_KERNEL 1110 + #endif 1111 + 1112 + #ifndef PAGE_KERNEL_EXEC 1113 + # define PAGE_KERNEL_EXEC PAGE_KERNEL 1114 + #endif 1115 + 1098 1116 #endif /* !__ASSEMBLY__ */ 1099 1117 1100 1118 #ifndef io_remap_pfn_range

+1 -1

include/linux/bitfield.h

··· 53 53 ({ \ 54 54 BUILD_BUG_ON_MSG(!__builtin_constant_p(_mask), \ 55 55 _pfx "mask is not constant"); \ 56 - BUILD_BUG_ON_MSG(!(_mask), _pfx "mask is zero"); \ 56 + BUILD_BUG_ON_MSG((_mask) == 0, _pfx "mask is zero"); \ 57 57 BUILD_BUG_ON_MSG(__builtin_constant_p(_val) ? \ 58 58 ~((_mask) >> __bf_shf(_mask)) & (_val) : 0, \ 59 59 _pfx "value too large for the field"); \

+1 -1

include/linux/cma.h

··· 33 33 const char *name, 34 34 struct cma **res_cma); 35 35 extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, 36 - gfp_t gfp_mask); 36 + bool no_warn); 37 37 extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count); 38 38 39 39 extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);

+2 -2

include/linux/dma-contiguous.h

··· 112 112 } 113 113 114 114 struct page *dma_alloc_from_contiguous(struct device *dev, size_t count, 115 - unsigned int order, gfp_t gfp_mask); 115 + unsigned int order, bool no_warn); 116 116 bool dma_release_from_contiguous(struct device *dev, struct page *pages, 117 117 int count); 118 118 ··· 145 145 146 146 static inline 147 147 struct page *dma_alloc_from_contiguous(struct device *dev, size_t count, 148 - unsigned int order, gfp_t gfp_mask) 148 + unsigned int order, bool no_warn) 149 149 { 150 150 return NULL; 151 151 }

+4 -1

include/linux/fs.h

··· 179 179 #define ATTR_ATIME_SET (1 << 7) 180 180 #define ATTR_MTIME_SET (1 << 8) 181 181 #define ATTR_FORCE (1 << 9) /* Not a change, but a change it */ 182 - #define ATTR_ATTR_FLAG (1 << 10) 183 182 #define ATTR_KILL_SUID (1 << 11) 184 183 #define ATTR_KILL_SGID (1 << 12) 185 184 #define ATTR_FILE (1 << 13) ··· 344 345 /* Set a page dirty. Return true if this dirtied it */ 345 346 int (*set_page_dirty)(struct page *page); 346 347 348 + /* 349 + * Reads in the requested pages. Unlike ->readpage(), this is 350 + * PURELY used for read-ahead!. 351 + */ 347 352 int (*readpages)(struct file *filp, struct address_space *mapping, 348 353 struct list_head *pages, unsigned nr_pages); 349 354

+8 -4

include/linux/fsnotify_backend.h

··· 84 84 struct fsnotify_fname; 85 85 struct fsnotify_iter_info; 86 86 87 + struct mem_cgroup; 88 + 87 89 /* 88 90 * Each group much define these ops. The fsnotify infrastructure will call 89 91 * these operations for each relevant group. ··· 129 127 * everything will be cleaned up. 130 128 */ 131 129 struct fsnotify_group { 130 + const struct fsnotify_ops *ops; /* how this group handles things */ 131 + 132 132 /* 133 133 * How the refcnt is used is up to each group. When the refcnt hits 0 134 134 * fsnotify will clean up all of the resources associated with this group. ··· 140 136 * closed. 141 137 */ 142 138 refcount_t refcnt; /* things with interest in this group */ 143 - 144 - const struct fsnotify_ops *ops; /* how this group handles things */ 145 139 146 140 /* needed to send notification to userspace */ 147 141 spinlock_t notification_lock; /* protect the notification_list */ ··· 162 160 atomic_t num_marks; /* 1 for each mark and 1 for not being 163 161 * past the point of no return when freeing 164 162 * a group */ 163 + atomic_t user_waits; /* Number of tasks waiting for user 164 + * response */ 165 165 struct list_head marks_list; /* all inode marks for this group */ 166 166 167 167 struct fasync_struct *fsn_fa; /* async notification */ ··· 171 167 struct fsnotify_event *overflow_event; /* Event we queue when the 172 168 * notification list is too 173 169 * full */ 174 - atomic_t user_waits; /* Number of tasks waiting for user 175 - * response */ 170 + 171 + struct mem_cgroup *memcg; /* memcg to charge allocations */ 176 172 177 173 /* groups can define private fields here or use the void *private */ 178 174 union {

-3

include/linux/hugetlb.h

··· 348 348 struct huge_bootmem_page { 349 349 struct list_head list; 350 350 struct hstate *hstate; 351 - #ifdef CONFIG_HIGHMEM 352 - phys_addr_t phys; 353 - #endif 354 351 }; 355 352 356 353 struct page *alloc_huge_page(struct vm_area_struct *vma,

+12 -1

include/linux/kasan.h

··· 20 20 extern pud_t kasan_zero_pud[PTRS_PER_PUD]; 21 21 extern p4d_t kasan_zero_p4d[MAX_PTRS_PER_P4D]; 22 22 23 - void kasan_populate_zero_shadow(const void *shadow_start, 23 + int kasan_populate_zero_shadow(const void *shadow_start, 24 24 const void *shadow_end); 25 25 26 26 static inline void *kasan_mem_to_shadow(const void *addr) ··· 70 70 71 71 int kasan_module_alloc(void *addr, size_t size); 72 72 void kasan_free_shadow(const struct vm_struct *vm); 73 + 74 + int kasan_add_zero_shadow(void *start, unsigned long size); 75 + void kasan_remove_zero_shadow(void *start, unsigned long size); 73 76 74 77 size_t ksize(const void *); 75 78 static inline void kasan_unpoison_slab(const void *ptr) { ksize(ptr); } ··· 126 123 127 124 static inline int kasan_module_alloc(void *addr, size_t size) { return 0; } 128 125 static inline void kasan_free_shadow(const struct vm_struct *vm) {} 126 + 127 + static inline int kasan_add_zero_shadow(void *start, unsigned long size) 128 + { 129 + return 0; 130 + } 131 + static inline void kasan_remove_zero_shadow(void *start, 132 + unsigned long size) 133 + {} 129 134 130 135 static inline void kasan_unpoison_slab(const void *ptr) { } 131 136 static inline size_t kasan_metadata_size(struct kmem_cache *cache) { return 0; }

+36 -7

include/linux/list_lru.h

··· 42 42 spinlock_t lock; 43 43 /* global list, used for the root cgroup in cgroup aware lrus */ 44 44 struct list_lru_one lru; 45 - #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB) 45 + #ifdef CONFIG_MEMCG_KMEM 46 46 /* for cgroup aware lrus points to per cgroup lists, otherwise NULL */ 47 47 struct list_lru_memcg __rcu *memcg_lrus; 48 48 #endif ··· 51 51 52 52 struct list_lru { 53 53 struct list_lru_node *node; 54 - #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB) 54 + #ifdef CONFIG_MEMCG_KMEM 55 55 struct list_head list; 56 + int shrinker_id; 56 57 #endif 57 58 }; 58 59 59 60 void list_lru_destroy(struct list_lru *lru); 60 61 int __list_lru_init(struct list_lru *lru, bool memcg_aware, 61 - struct lock_class_key *key); 62 + struct lock_class_key *key, struct shrinker *shrinker); 62 63 63 - #define list_lru_init(lru) __list_lru_init((lru), false, NULL) 64 - #define list_lru_init_key(lru, key) __list_lru_init((lru), false, (key)) 65 - #define list_lru_init_memcg(lru) __list_lru_init((lru), true, NULL) 64 + #define list_lru_init(lru) \ 65 + __list_lru_init((lru), false, NULL, NULL) 66 + #define list_lru_init_key(lru, key) \ 67 + __list_lru_init((lru), false, (key), NULL) 68 + #define list_lru_init_memcg(lru, shrinker) \ 69 + __list_lru_init((lru), true, NULL, shrinker) 66 70 67 71 int memcg_update_all_list_lrus(int num_memcgs); 68 - void memcg_drain_all_list_lrus(int src_idx, int dst_idx); 72 + void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg); 69 73 70 74 /** 71 75 * list_lru_add: add an element to the lru list's tail ··· 166 162 int nid, struct mem_cgroup *memcg, 167 163 list_lru_walk_cb isolate, void *cb_arg, 168 164 unsigned long *nr_to_walk); 165 + /** 166 + * list_lru_walk_one_irq: walk a list_lru, isolating and disposing freeable items. 167 + * @lru: the lru pointer. 168 + * @nid: the node id to scan from. 169 + * @memcg: the cgroup to scan from. 170 + * @isolate: callback function that is resposible for deciding what to do with 171 + * the item currently being scanned 172 + * @cb_arg: opaque type that will be passed to @isolate 173 + * @nr_to_walk: how many items to scan. 174 + * 175 + * Same as @list_lru_walk_one except that the spinlock is acquired with 176 + * spin_lock_irq(). 177 + */ 178 + unsigned long list_lru_walk_one_irq(struct list_lru *lru, 179 + int nid, struct mem_cgroup *memcg, 180 + list_lru_walk_cb isolate, void *cb_arg, 181 + unsigned long *nr_to_walk); 169 182 unsigned long list_lru_walk_node(struct list_lru *lru, int nid, 170 183 list_lru_walk_cb isolate, void *cb_arg, 171 184 unsigned long *nr_to_walk); ··· 193 172 { 194 173 return list_lru_walk_one(lru, sc->nid, sc->memcg, isolate, cb_arg, 195 174 &sc->nr_to_scan); 175 + } 176 + 177 + static inline unsigned long 178 + list_lru_shrink_walk_irq(struct list_lru *lru, struct shrink_control *sc, 179 + list_lru_walk_cb isolate, void *cb_arg) 180 + { 181 + return list_lru_walk_one_irq(lru, sc->nid, sc->memcg, isolate, cb_arg, 182 + &sc->nr_to_scan); 196 183 } 197 184 198 185 static inline unsigned long

+63 -11

include/linux/memcontrol.h

··· 112 112 }; 113 113 114 114 /* 115 + * Bitmap of shrinker::id corresponding to memcg-aware shrinkers, 116 + * which have elements charged to this memcg. 117 + */ 118 + struct memcg_shrinker_map { 119 + struct rcu_head rcu; 120 + unsigned long map[0]; 121 + }; 122 + 123 + /* 115 124 * per-zone information in memory controller. 116 125 */ 117 126 struct mem_cgroup_per_node { ··· 133 124 134 125 struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1]; 135 126 127 + #ifdef CONFIG_MEMCG_KMEM 128 + struct memcg_shrinker_map __rcu *shrinker_map; 129 + #endif 136 130 struct rb_node tree_node; /* RB tree node */ 137 131 unsigned long usage_in_excess;/* Set to the value by which */ 138 132 /* the soft limit is exceeded*/ ··· 283 271 bool tcpmem_active; 284 272 int tcpmem_pressure; 285 273 286 - #ifndef CONFIG_SLOB 274 + #ifdef CONFIG_MEMCG_KMEM 287 275 /* Index in the kmem_cache->memcg_params.memcg_caches array */ 288 276 int kmemcg_id; 289 277 enum memcg_kmem_state kmem_state; ··· 317 305 #define MEMCG_CHARGE_BATCH 32U 318 306 319 307 extern struct mem_cgroup *root_mem_cgroup; 308 + 309 + static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) 310 + { 311 + return (memcg == root_mem_cgroup); 312 + } 320 313 321 314 static inline bool mem_cgroup_disabled(void) 322 315 { ··· 390 373 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg); 391 374 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); 392 375 376 + struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm); 377 + 378 + struct mem_cgroup *get_mem_cgroup_from_page(struct page *page); 379 + 393 380 static inline 394 381 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){ 395 382 return css ? container_of(css, struct mem_cgroup, css) : NULL; 383 + } 384 + 385 + static inline void mem_cgroup_put(struct mem_cgroup *memcg) 386 + { 387 + if (memcg) 388 + css_put(&memcg->css); 396 389 } 397 390 398 391 #define mem_cgroup_from_counter(counter, member) \ ··· 524 497 void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, 525 498 struct task_struct *p); 526 499 527 - static inline void mem_cgroup_oom_enable(void) 500 + static inline void mem_cgroup_enter_user_fault(void) 528 501 { 529 - WARN_ON(current->memcg_may_oom); 530 - current->memcg_may_oom = 1; 502 + WARN_ON(current->in_user_fault); 503 + current->in_user_fault = 1; 531 504 } 532 505 533 - static inline void mem_cgroup_oom_disable(void) 506 + static inline void mem_cgroup_exit_user_fault(void) 534 507 { 535 - WARN_ON(!current->memcg_may_oom); 536 - current->memcg_may_oom = 0; 508 + WARN_ON(!current->in_user_fault); 509 + current->in_user_fault = 0; 537 510 } 538 511 539 512 static inline bool task_in_memcg_oom(struct task_struct *p) ··· 789 762 790 763 struct mem_cgroup; 791 764 765 + static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) 766 + { 767 + return true; 768 + } 769 + 792 770 static inline bool mem_cgroup_disabled(void) 793 771 { 794 772 return true; ··· 880 848 const struct mem_cgroup *memcg) 881 849 { 882 850 return true; 851 + } 852 + 853 + static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) 854 + { 855 + return NULL; 856 + } 857 + 858 + static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page) 859 + { 860 + return NULL; 861 + } 862 + 863 + static inline void mem_cgroup_put(struct mem_cgroup *memcg) 864 + { 883 865 } 884 866 885 867 static inline struct mem_cgroup * ··· 983 937 { 984 938 } 985 939 986 - static inline void mem_cgroup_oom_enable(void) 940 + static inline void mem_cgroup_enter_user_fault(void) 987 941 { 988 942 } 989 943 990 - static inline void mem_cgroup_oom_disable(void) 944 + static inline void mem_cgroup_exit_user_fault(void) 991 945 { 992 946 } 993 947 ··· 1253 1207 int memcg_kmem_charge(struct page *page, gfp_t gfp, int order); 1254 1208 void memcg_kmem_uncharge(struct page *page, int order); 1255 1209 1256 - #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB) 1210 + #ifdef CONFIG_MEMCG_KMEM 1257 1211 extern struct static_key_false memcg_kmem_enabled_key; 1258 1212 extern struct workqueue_struct *memcg_kmem_cache_wq; 1259 1213 ··· 1284 1238 return memcg ? memcg->kmemcg_id : -1; 1285 1239 } 1286 1240 1241 + extern int memcg_expand_shrinker_maps(int new_id); 1242 + 1243 + extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg, 1244 + int nid, int shrinker_id); 1287 1245 #else 1288 1246 #define for_each_memcg_cache_index(_idx) \ 1289 1247 for (; NULL; ) ··· 1310 1260 { 1311 1261 } 1312 1262 1313 - #endif /* CONFIG_MEMCG && !CONFIG_SLOB */ 1263 + static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg, 1264 + int nid, int shrinker_id) { } 1265 + #endif /* CONFIG_MEMCG_KMEM */ 1314 1266 1315 1267 #endif /* _LINUX_MEMCONTROL_H */

+3 -7

include/linux/mm.h

··· 2665 2665 const char * arch_vma_name(struct vm_area_struct *vma); 2666 2666 void print_vma_addr(char *prefix, unsigned long rip); 2667 2667 2668 - void sparse_mem_maps_populate_node(struct page **map_map, 2669 - unsigned long pnum_begin, 2670 - unsigned long pnum_end, 2671 - unsigned long map_count, 2672 - int nodeid); 2673 - 2668 + void *sparse_buffer_alloc(unsigned long size); 2674 2669 struct page *sparse_mem_map_populate(unsigned long pnum, int nid, 2675 2670 struct vmem_altmap *altmap); 2676 2671 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node); ··· 2747 2752 unsigned long addr_hint, 2748 2753 unsigned int pages_per_huge_page); 2749 2754 extern void copy_user_huge_page(struct page *dst, struct page *src, 2750 - unsigned long addr, struct vm_area_struct *vma, 2755 + unsigned long addr_hint, 2756 + struct vm_area_struct *vma, 2751 2757 unsigned int pages_per_huge_page); 2752 2758 extern long copy_huge_page_from_user(struct page *dst_page, 2753 2759 const void __user *usr_src,

+7 -5

include/linux/node.h

··· 33 33 34 34 #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_NUMA) 35 35 extern int link_mem_sections(int nid, unsigned long start_pfn, 36 - unsigned long nr_pages, bool check_nid); 36 + unsigned long end_pfn); 37 37 #else 38 38 static inline int link_mem_sections(int nid, unsigned long start_pfn, 39 - unsigned long nr_pages, bool check_nid) 39 + unsigned long end_pfn) 40 40 { 41 41 return 0; 42 42 } ··· 54 54 55 55 if (node_online(nid)) { 56 56 struct pglist_data *pgdat = NODE_DATA(nid); 57 + unsigned long start_pfn = pgdat->node_start_pfn; 58 + unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages; 57 59 58 60 error = __register_one_node(nid); 59 61 if (error) 60 62 return error; 61 63 /* link memory sections under this node */ 62 - error = link_mem_sections(nid, pgdat->node_start_pfn, pgdat->node_spanned_pages, true); 64 + error = link_mem_sections(nid, start_pfn, end_pfn); 63 65 } 64 66 65 67 return error; ··· 71 69 extern int register_cpu_under_node(unsigned int cpu, unsigned int nid); 72 70 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid); 73 71 extern int register_mem_sect_under_node(struct memory_block *mem_blk, 74 - int nid, bool check_nid); 72 + void *arg); 75 73 extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk, 76 74 unsigned long phys_index); 77 75 ··· 101 99 return 0; 102 100 } 103 101 static inline int register_mem_sect_under_node(struct memory_block *mem_blk, 104 - int nid, bool check_nid) 102 + void *arg) 105 103 { 106 104 return 0; 107 105 }

+2 -13

include/linux/page_ext.h

··· 16 16 17 17 #ifdef CONFIG_PAGE_EXTENSION 18 18 19 - /* 20 - * page_ext->flags bits: 21 - * 22 - * PAGE_EXT_DEBUG_POISON is set for poisoned pages. This is used to 23 - * implement generic debug pagealloc feature. The pages are filled with 24 - * poison patterns and set this flag after free_pages(). The poisoned 25 - * pages are verified whether the patterns are not corrupted and clear 26 - * the flag before alloc_pages(). 27 - */ 28 - 29 19 enum page_ext_flags { 30 - PAGE_EXT_DEBUG_POISON, /* Page is poisoned */ 31 20 PAGE_EXT_DEBUG_GUARD, 32 21 PAGE_EXT_OWNER, 33 22 #if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT) ··· 50 61 } 51 62 #endif 52 63 53 - struct page_ext *lookup_page_ext(struct page *page); 64 + struct page_ext *lookup_page_ext(const struct page *page); 54 65 55 66 #else /* !CONFIG_PAGE_EXTENSION */ 56 67 struct page_ext; ··· 59 70 { 60 71 } 61 72 62 - static inline struct page_ext *lookup_page_ext(struct page *page) 73 + static inline struct page_ext *lookup_page_ext(const struct page *page) 63 74 { 64 75 return NULL; 65 76 }

+5 -2

include/linux/sched.h

··· 722 722 unsigned restore_sigmask:1; 723 723 #endif 724 724 #ifdef CONFIG_MEMCG 725 - unsigned memcg_may_oom:1; 726 - #ifndef CONFIG_SLOB 725 + unsigned in_user_fault:1; 726 + #ifdef CONFIG_MEMCG_KMEM 727 727 unsigned memcg_kmem_skip_account:1; 728 728 #endif 729 729 #endif ··· 1152 1152 1153 1153 /* Number of pages to reclaim on returning to userland: */ 1154 1154 unsigned int memcg_nr_pages_over_high; 1155 + 1156 + /* Used by memcontrol for targeted memcg charge: */ 1157 + struct mem_cgroup *active_memcg; 1155 1158 #endif 1156 1159 1157 1160 #ifdef CONFIG_BLK_CGROUP

+37

include/linux/sched/mm.h

··· 248 248 current->flags = (current->flags & ~PF_MEMALLOC) | flags; 249 249 } 250 250 251 + #ifdef CONFIG_MEMCG 252 + /** 253 + * memalloc_use_memcg - Starts the remote memcg charging scope. 254 + * @memcg: memcg to charge. 255 + * 256 + * This function marks the beginning of the remote memcg charging scope. All the 257 + * __GFP_ACCOUNT allocations till the end of the scope will be charged to the 258 + * given memcg. 259 + * 260 + * NOTE: This function is not nesting safe. 261 + */ 262 + static inline void memalloc_use_memcg(struct mem_cgroup *memcg) 263 + { 264 + WARN_ON_ONCE(current->active_memcg); 265 + current->active_memcg = memcg; 266 + } 267 + 268 + /** 269 + * memalloc_unuse_memcg - Ends the remote memcg charging scope. 270 + * 271 + * This function marks the end of the remote memcg charging scope started by 272 + * memalloc_use_memcg(). 273 + */ 274 + static inline void memalloc_unuse_memcg(void) 275 + { 276 + current->active_memcg = NULL; 277 + } 278 + #else 279 + static inline void memalloc_use_memcg(struct mem_cgroup *memcg) 280 + { 281 + } 282 + 283 + static inline void memalloc_unuse_memcg(void) 284 + { 285 + } 286 + #endif 287 + 251 288 #ifdef CONFIG_MEMBARRIER 252 289 enum { 253 290 MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY = (1U << 0),

+9 -2

include/linux/shrinker.h

··· 34 34 }; 35 35 36 36 #define SHRINK_STOP (~0UL) 37 + #define SHRINK_EMPTY (~0UL - 1) 37 38 /* 38 39 * A callback you can register to apply pressure to ageable caches. 39 40 * 40 41 * @count_objects should return the number of freeable items in the cache. If 41 - * there are no objects to free or the number of freeable items cannot be 42 - * determined, it should return 0. No deadlock checks should be done during the 42 + * there are no objects to free, it should return SHRINK_EMPTY, while 0 is 43 + * returned in cases of the number of freeable items cannot be determined 44 + * or shrinker should skip this cache for this time (e.g., their number 45 + * is below shrinkable limit). No deadlock checks should be done during the 43 46 * count callback - the shrinker relies on aggregating scan counts that couldn't 44 47 * be executed due to potential deadlocks to be run at a later call when the 45 48 * deadlock condition is no longer pending. ··· 69 66 70 67 /* These are for internal use */ 71 68 struct list_head list; 69 + #ifdef CONFIG_MEMCG_KMEM 70 + /* ID in shrinker_idr */ 71 + int id; 72 + #endif 72 73 /* objs pending delete, per node */ 73 74 atomic_long_t *nr_deferred; 74 75 };

+1 -1

include/linux/slab.h

··· 97 97 # define SLAB_FAILSLAB 0 98 98 #endif 99 99 /* Account to memcg */ 100 - #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB) 100 + #ifdef CONFIG_MEMCG_KMEM 101 101 # define SLAB_ACCOUNT ((slab_flags_t __force)0x04000000U) 102 102 #else 103 103 # define SLAB_ACCOUNT 0

-6

include/linux/vmacache.h

··· 5 5 #include <linux/sched.h> 6 6 #include <linux/mm.h> 7 7 8 - /* 9 - * Hash based on the page number. Provides a good hit rate for 10 - * workloads with good locality and those with random accesses as well. 11 - */ 12 - #define VMACACHE_HASH(addr) ((addr >> PAGE_SHIFT) & VMACACHE_MASK) 13 - 14 8 static inline void vmacache_flush(struct task_struct *tsk) 15 9 { 16 10 memset(tsk->vmacache.vmas, 0, sizeof(tsk->vmacache.vmas));

+5

init/Kconfig

··· 708 708 select this option (if, for some reason, they need to disable it 709 709 then swapaccount=0 does the trick). 710 710 711 + config MEMCG_KMEM 712 + bool 713 + depends on MEMCG && !SLOB 714 + default y 715 + 711 716 config BLK_CGROUP 712 717 bool "IO controller" 713 718 depends on BLOCK

+3 -3

kernel/dma/contiguous.c

··· 178 178 * @dev: Pointer to device for which the allocation is performed. 179 179 * @count: Requested number of pages. 180 180 * @align: Requested alignment of pages (in PAGE_SIZE order). 181 - * @gfp_mask: GFP flags to use for this allocation. 181 + * @no_warn: Avoid printing message about failed allocation. 182 182 * 183 183 * This function allocates memory buffer for specified device. It uses 184 184 * device specific contiguous memory area if available or the default ··· 186 186 * function. 187 187 */ 188 188 struct page *dma_alloc_from_contiguous(struct device *dev, size_t count, 189 - unsigned int align, gfp_t gfp_mask) 189 + unsigned int align, bool no_warn) 190 190 { 191 191 if (align > CONFIG_CMA_ALIGNMENT) 192 192 align = CONFIG_CMA_ALIGNMENT; 193 193 194 - return cma_alloc(dev_get_cma_area(dev), count, align, gfp_mask); 194 + return cma_alloc(dev_get_cma_area(dev), count, align, no_warn); 195 195 } 196 196 197 197 /**

+2 -1

kernel/dma/direct.c

··· 78 78 again: 79 79 /* CMA can be used only in the context which permits sleeping */ 80 80 if (gfpflags_allow_blocking(gfp)) { 81 - page = dma_alloc_from_contiguous(dev, count, page_order, gfp); 81 + page = dma_alloc_from_contiguous(dev, count, page_order, 82 + gfp & __GFP_NOWARN); 82 83 if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) { 83 84 dma_release_from_contiguous(dev, page, count); 84 85 page = NULL;

+3

kernel/fork.c

··· 871 871 tsk->use_memdelay = 0; 872 872 #endif 873 873 874 + #ifdef CONFIG_MEMCG 875 + tsk->active_memcg = NULL; 876 + #endif 874 877 return tsk; 875 878 876 879 free_stack:

+10

kernel/memremap.c

··· 5 5 #include <linux/types.h> 6 6 #include <linux/pfn_t.h> 7 7 #include <linux/io.h> 8 + #include <linux/kasan.h> 8 9 #include <linux/mm.h> 9 10 #include <linux/memory_hotplug.h> 10 11 #include <linux/swap.h> ··· 138 137 mem_hotplug_begin(); 139 138 arch_remove_memory(align_start, align_size, pgmap->altmap_valid ? 140 139 &pgmap->altmap : NULL); 140 + kasan_remove_zero_shadow(__va(align_start), align_size); 141 141 mem_hotplug_done(); 142 142 143 143 untrack_pfn(NULL, PHYS_PFN(align_start), align_size); ··· 241 239 goto err_pfn_remap; 242 240 243 241 mem_hotplug_begin(); 242 + error = kasan_add_zero_shadow(__va(align_start), align_size); 243 + if (error) { 244 + mem_hotplug_done(); 245 + goto err_kasan; 246 + } 247 + 244 248 error = arch_add_memory(nid, align_start, align_size, altmap, false); 245 249 if (!error) 246 250 move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], ··· 275 267 return __va(res->start); 276 268 277 269 err_add_memory: 270 + kasan_remove_zero_shadow(__va(align_start), align_size); 271 + err_kasan: 278 272 untrack_pfn(NULL, PHYS_PFN(align_start), align_size); 279 273 err_pfn_remap: 280 274 err_radix:

+4 -7

mm/Kconfig

··· 118 118 config SPARSEMEM_VMEMMAP_ENABLE 119 119 bool 120 120 121 - config SPARSEMEM_ALLOC_MEM_MAP_TOGETHER 122 - def_bool y 123 - depends on SPARSEMEM && X86_64 124 - 125 121 config SPARSEMEM_VMEMMAP 126 122 bool "Sparse Memory virtual memmap" 127 123 depends on SPARSEMEM && SPARSEMEM_VMEMMAP_ENABLE ··· 419 423 420 424 config THP_SWAP 421 425 def_bool y 422 - depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_SWAP 426 + depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_SWAP && SWAP 423 427 help 424 428 Swap transparent huge pages in one piece, without splitting. 425 - XXX: For now this only does clustered swap space allocation. 429 + XXX: For now, swap cluster backing transparent huge page 430 + will be split after swapout. 426 431 427 432 For selection by architectures with reasonable THP sizes. 428 433 ··· 635 638 bool "Defer initialisation of struct pages to kthreads" 636 639 default n 637 640 depends on NO_BOOTMEM 638 - depends on !FLATMEM 641 + depends on SPARSEMEM 639 642 depends on !NEED_PER_CPU_KM 640 643 help 641 644 Ordinarily all struct pages are initialised during early boot in a

+4 -4

mm/cma.c

··· 395 395 * @cma: Contiguous memory region for which the allocation is performed. 396 396 * @count: Requested number of pages. 397 397 * @align: Requested alignment of pages (in PAGE_SIZE order). 398 - * @gfp_mask: GFP mask to use during compaction 398 + * @no_warn: Avoid printing message about failed allocation 399 399 * 400 400 * This function allocates part of contiguous memory on specific 401 401 * contiguous memory area. 402 402 */ 403 403 struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, 404 - gfp_t gfp_mask) 404 + bool no_warn) 405 405 { 406 406 unsigned long mask, offset; 407 407 unsigned long pfn = -1; ··· 447 447 pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit); 448 448 mutex_lock(&cma_mutex); 449 449 ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, 450 - gfp_mask); 450 + GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0)); 451 451 mutex_unlock(&cma_mutex); 452 452 if (ret == 0) { 453 453 page = pfn_to_page(pfn); ··· 466 466 467 467 trace_cma_alloc(pfn, page, count, align); 468 468 469 - if (ret && !(gfp_mask & __GFP_NOWARN)) { 469 + if (ret && !no_warn) { 470 470 pr_err("%s: alloc failed, req-size: %zu pages, ret: %d\n", 471 471 __func__, count, ret); 472 472 cma_debug_show_areas(cma);

+1 -1

mm/cma_debug.c

··· 139 139 if (!mem) 140 140 return -ENOMEM; 141 141 142 - p = cma_alloc(cma, count, 0, GFP_KERNEL); 142 + p = cma_alloc(cma, count, 0, false); 143 143 if (!p) { 144 144 kfree(mem); 145 145 return -ENOMEM;

+6 -2

mm/fadvise.c

··· 72 72 goto out; 73 73 } 74 74 75 - /* Careful about overflows. Len == 0 means "as much as possible" */ 76 - endbyte = offset + len; 75 + /* 76 + * Careful about overflows. Len == 0 means "as much as possible". Use 77 + * unsigned math because signed overflows are undefined and UBSan 78 + * complains. 79 + */ 80 + endbyte = (u64)offset + (u64)len; 77 81 if (!len || endbyte < len) 78 82 endbyte = -1; 79 83 else

+9 -10

mm/hmm.c

··· 299 299 struct hmm_vma_walk *hmm_vma_walk = walk->private; 300 300 struct hmm_range *range = hmm_vma_walk->range; 301 301 struct vm_area_struct *vma = walk->vma; 302 - int r; 302 + vm_fault_t ret; 303 303 304 304 flags |= hmm_vma_walk->block ? 0 : FAULT_FLAG_ALLOW_RETRY; 305 305 flags |= write_fault ? FAULT_FLAG_WRITE : 0; 306 - r = handle_mm_fault(vma, addr, flags); 307 - if (r & VM_FAULT_RETRY) 306 + ret = handle_mm_fault(vma, addr, flags); 307 + if (ret & VM_FAULT_RETRY) 308 308 return -EBUSY; 309 - if (r & VM_FAULT_ERROR) { 309 + if (ret & VM_FAULT_ERROR) { 310 310 *pfn = range->values[HMM_PFN_ERROR]; 311 311 return -EFAULT; 312 312 } ··· 676 676 return -EINVAL; 677 677 678 678 /* FIXME support hugetlb fs */ 679 - if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) { 679 + if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) || 680 + vma_is_dax(vma)) { 680 681 hmm_pfns_special(range); 681 682 return -EINVAL; 682 683 } ··· 850 849 return -EINVAL; 851 850 852 851 /* FIXME support hugetlb fs */ 853 - if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) { 852 + if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) || 853 + vma_is_dax(vma)) { 854 854 hmm_pfns_special(range); 855 855 return -EINVAL; 856 856 } ··· 973 971 974 972 static void hmm_devmem_radix_release(struct resource *resource) 975 973 { 976 - resource_size_t key, align_start, align_size; 977 - 978 - align_start = resource->start & ~(PA_SECTION_SIZE - 1); 979 - align_size = ALIGN(resource_size(resource), PA_SECTION_SIZE); 974 + resource_size_t key; 980 975 981 976 mutex_lock(&hmm_devmem_lock); 982 977 for (key = resource->start;

+6 -5

mm/huge_memory.c

··· 762 762 * but we need to be consistent with PTEs and architectures that 763 763 * can't support a 'special' bit. 764 764 */ 765 - BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))); 765 + BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && 766 + !pfn_t_devmap(pfn)); 766 767 BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) == 767 768 (VM_PFNMAP|VM_MIXEDMAP)); 768 769 BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags)); 769 - BUG_ON(!pfn_t_devmap(pfn)); 770 770 771 771 if (addr < vma->vm_start || addr >= vma->vm_end) 772 772 return VM_FAULT_SIGBUS; ··· 1328 1328 if (!page) 1329 1329 clear_huge_page(new_page, vmf->address, HPAGE_PMD_NR); 1330 1330 else 1331 - copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR); 1331 + copy_user_huge_page(new_page, page, vmf->address, 1332 + vma, HPAGE_PMD_NR); 1332 1333 __SetPageUptodate(new_page); 1333 1334 1334 1335 mmun_start = haddr; ··· 1741 1740 } else { 1742 1741 if (arch_needs_pgtable_deposit()) 1743 1742 zap_deposited_table(tlb->mm, pmd); 1744 - add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR); 1743 + add_mm_counter(tlb->mm, mm_counter_file(page), -HPAGE_PMD_NR); 1745 1744 } 1746 1745 1747 1746 spin_unlock(ptl); ··· 2091 2090 SetPageReferenced(page); 2092 2091 page_remove_rmap(page, true); 2093 2092 put_page(page); 2094 - add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR); 2093 + add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR); 2095 2094 return; 2096 2095 } else if (is_huge_zero_pmd(*pmd)) { 2097 2096 /*

+16 -23

mm/hugetlb.c

··· 2101 2101 for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) { 2102 2102 void *addr; 2103 2103 2104 - addr = memblock_virt_alloc_try_nid_nopanic( 2104 + addr = memblock_virt_alloc_try_nid_raw( 2105 2105 huge_page_size(h), huge_page_size(h), 2106 2106 0, BOOTMEM_ALLOC_ACCESSIBLE, node); 2107 2107 if (addr) { ··· 2119 2119 found: 2120 2120 BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h))); 2121 2121 /* Put them into a private list first because mem_map is not up yet */ 2122 + INIT_LIST_HEAD(&m->list); 2122 2123 list_add(&m->list, &huge_boot_pages); 2123 2124 m->hstate = h; 2124 2125 return 1; ··· 2140 2139 struct huge_bootmem_page *m; 2141 2140 2142 2141 list_for_each_entry(m, &huge_boot_pages, list) { 2142 + struct page *page = virt_to_page(m); 2143 2143 struct hstate *h = m->hstate; 2144 - struct page *page; 2145 2144 2146 - #ifdef CONFIG_HIGHMEM 2147 - page = pfn_to_page(m->phys >> PAGE_SHIFT); 2148 - memblock_free_late(__pa(m), 2149 - sizeof(struct huge_bootmem_page)); 2150 - #else 2151 - page = virt_to_page(m); 2152 - #endif 2153 2145 WARN_ON(page_count(page) != 1); 2154 2146 prep_compound_huge_page(page, h->order); 2155 2147 WARN_ON(PageReserved(page)); ··· 3512 3518 int ret = 0, outside_reserve = 0; 3513 3519 unsigned long mmun_start; /* For mmu_notifiers */ 3514 3520 unsigned long mmun_end; /* For mmu_notifiers */ 3521 + unsigned long haddr = address & huge_page_mask(h); 3515 3522 3516 3523 pte = huge_ptep_get(ptep); 3517 3524 old_page = pte_page(pte); ··· 3522 3527 * and just make the page writable */ 3523 3528 if (page_mapcount(old_page) == 1 && PageAnon(old_page)) { 3524 3529 page_move_anon_rmap(old_page, vma); 3525 - set_huge_ptep_writable(vma, address, ptep); 3530 + set_huge_ptep_writable(vma, haddr, ptep); 3526 3531 return 0; 3527 3532 } 3528 3533 ··· 3546 3551 * be acquired again before returning to the caller, as expected. 3547 3552 */ 3548 3553 spin_unlock(ptl); 3549 - new_page = alloc_huge_page(vma, address, outside_reserve); 3554 + new_page = alloc_huge_page(vma, haddr, outside_reserve); 3550 3555 3551 3556 if (IS_ERR(new_page)) { 3552 3557 /* ··· 3559 3564 if (outside_reserve) { 3560 3565 put_page(old_page); 3561 3566 BUG_ON(huge_pte_none(pte)); 3562 - unmap_ref_private(mm, vma, old_page, address); 3567 + unmap_ref_private(mm, vma, old_page, haddr); 3563 3568 BUG_ON(huge_pte_none(pte)); 3564 3569 spin_lock(ptl); 3565 - ptep = huge_pte_offset(mm, address & huge_page_mask(h), 3566 - huge_page_size(h)); 3570 + ptep = huge_pte_offset(mm, haddr, huge_page_size(h)); 3567 3571 if (likely(ptep && 3568 3572 pte_same(huge_ptep_get(ptep), pte))) 3569 3573 goto retry_avoidcopy; ··· 3592 3598 __SetPageUptodate(new_page); 3593 3599 set_page_huge_active(new_page); 3594 3600 3595 - mmun_start = address & huge_page_mask(h); 3601 + mmun_start = haddr; 3596 3602 mmun_end = mmun_start + huge_page_size(h); 3597 3603 mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); 3598 3604 ··· 3601 3607 * before the page tables are altered 3602 3608 */ 3603 3609 spin_lock(ptl); 3604 - ptep = huge_pte_offset(mm, address & huge_page_mask(h), 3605 - huge_page_size(h)); 3610 + ptep = huge_pte_offset(mm, haddr, huge_page_size(h)); 3606 3611 if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) { 3607 3612 ClearPagePrivate(new_page); 3608 3613 3609 3614 /* Break COW */ 3610 - huge_ptep_clear_flush(vma, address, ptep); 3615 + huge_ptep_clear_flush(vma, haddr, ptep); 3611 3616 mmu_notifier_invalidate_range(mm, mmun_start, mmun_end); 3612 - set_huge_pte_at(mm, address, ptep, 3617 + set_huge_pte_at(mm, haddr, ptep, 3613 3618 make_huge_pte(vma, new_page, 1)); 3614 3619 page_remove_rmap(old_page, true); 3615 - hugepage_add_new_anon_rmap(new_page, vma, address); 3620 + hugepage_add_new_anon_rmap(new_page, vma, haddr); 3616 3621 /* Make the old page be freed below */ 3617 3622 new_page = old_page; 3618 3623 } 3619 3624 spin_unlock(ptl); 3620 3625 mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); 3621 3626 out_release_all: 3622 - restore_reserve_on_error(h, vma, address, new_page); 3627 + restore_reserve_on_error(h, vma, haddr, new_page); 3623 3628 put_page(new_page); 3624 3629 out_release_old: 3625 3630 put_page(old_page); ··· 3821 3828 hugetlb_count_add(pages_per_huge_page(h), mm); 3822 3829 if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { 3823 3830 /* Optimization, do the COW without a second fault */ 3824 - ret = hugetlb_cow(mm, vma, haddr, ptep, page, ptl); 3831 + ret = hugetlb_cow(mm, vma, address, ptep, page, ptl); 3825 3832 } 3826 3833 3827 3834 spin_unlock(ptl); ··· 3975 3982 3976 3983 if (flags & FAULT_FLAG_WRITE) { 3977 3984 if (!huge_pte_write(entry)) { 3978 - ret = hugetlb_cow(mm, vma, haddr, ptep, 3985 + ret = hugetlb_cow(mm, vma, address, ptep, 3979 3986 pagecache_page, ptl); 3980 3987 goto out_put_page; 3981 3988 }

+303 -13

mm/kasan/kasan_init.c

··· 17 17 #include <linux/memblock.h> 18 18 #include <linux/mm.h> 19 19 #include <linux/pfn.h> 20 + #include <linux/slab.h> 20 21 21 22 #include <asm/page.h> 22 23 #include <asm/pgalloc.h> 24 + 25 + #include "kasan.h" 23 26 24 27 /* 25 28 * This page serves two purposes: ··· 35 32 36 33 #if CONFIG_PGTABLE_LEVELS > 4 37 34 p4d_t kasan_zero_p4d[MAX_PTRS_PER_P4D] __page_aligned_bss; 35 + static inline bool kasan_p4d_table(pgd_t pgd) 36 + { 37 + return pgd_page(pgd) == virt_to_page(lm_alias(kasan_zero_p4d)); 38 + } 39 + #else 40 + static inline bool kasan_p4d_table(pgd_t pgd) 41 + { 42 + return 0; 43 + } 38 44 #endif 39 45 #if CONFIG_PGTABLE_LEVELS > 3 40 46 pud_t kasan_zero_pud[PTRS_PER_PUD] __page_aligned_bss; 47 + static inline bool kasan_pud_table(p4d_t p4d) 48 + { 49 + return p4d_page(p4d) == virt_to_page(lm_alias(kasan_zero_pud)); 50 + } 51 + #else 52 + static inline bool kasan_pud_table(p4d_t p4d) 53 + { 54 + return 0; 55 + } 41 56 #endif 42 57 #if CONFIG_PGTABLE_LEVELS > 2 43 58 pmd_t kasan_zero_pmd[PTRS_PER_PMD] __page_aligned_bss; 59 + static inline bool kasan_pmd_table(pud_t pud) 60 + { 61 + return pud_page(pud) == virt_to_page(lm_alias(kasan_zero_pmd)); 62 + } 63 + #else 64 + static inline bool kasan_pmd_table(pud_t pud) 65 + { 66 + return 0; 67 + } 44 68 #endif 45 69 pte_t kasan_zero_pte[PTRS_PER_PTE] __page_aligned_bss; 70 + 71 + static inline bool kasan_pte_table(pmd_t pmd) 72 + { 73 + return pmd_page(pmd) == virt_to_page(lm_alias(kasan_zero_pte)); 74 + } 75 + 76 + static inline bool kasan_zero_page_entry(pte_t pte) 77 + { 78 + return pte_page(pte) == virt_to_page(lm_alias(kasan_zero_page)); 79 + } 46 80 47 81 static __init void *early_alloc(size_t size, int node) 48 82 { ··· 87 47 BOOTMEM_ALLOC_ACCESSIBLE, node); 88 48 } 89 49 90 - static void __init zero_pte_populate(pmd_t *pmd, unsigned long addr, 50 + static void __ref zero_pte_populate(pmd_t *pmd, unsigned long addr, 91 51 unsigned long end) 92 52 { 93 53 pte_t *pte = pte_offset_kernel(pmd, addr); ··· 103 63 } 104 64 } 105 65 106 - static void __init zero_pmd_populate(pud_t *pud, unsigned long addr, 66 + static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr, 107 67 unsigned long end) 108 68 { 109 69 pmd_t *pmd = pmd_offset(pud, addr); ··· 118 78 } 119 79 120 80 if (pmd_none(*pmd)) { 121 - pmd_populate_kernel(&init_mm, pmd, 122 - early_alloc(PAGE_SIZE, NUMA_NO_NODE)); 81 + pte_t *p; 82 + 83 + if (slab_is_available()) 84 + p = pte_alloc_one_kernel(&init_mm, addr); 85 + else 86 + p = early_alloc(PAGE_SIZE, NUMA_NO_NODE); 87 + if (!p) 88 + return -ENOMEM; 89 + 90 + pmd_populate_kernel(&init_mm, pmd, p); 123 91 } 124 92 zero_pte_populate(pmd, addr, next); 125 93 } while (pmd++, addr = next, addr != end); 94 + 95 + return 0; 126 96 } 127 97 128 - static void __init zero_pud_populate(p4d_t *p4d, unsigned long addr, 98 + static int __ref zero_pud_populate(p4d_t *p4d, unsigned long addr, 129 99 unsigned long end) 130 100 { 131 101 pud_t *pud = pud_offset(p4d, addr); ··· 153 103 } 154 104 155 105 if (pud_none(*pud)) { 156 - pud_populate(&init_mm, pud, 157 - early_alloc(PAGE_SIZE, NUMA_NO_NODE)); 106 + pmd_t *p; 107 + 108 + if (slab_is_available()) { 109 + p = pmd_alloc(&init_mm, pud, addr); 110 + if (!p) 111 + return -ENOMEM; 112 + } else { 113 + pud_populate(&init_mm, pud, 114 + early_alloc(PAGE_SIZE, NUMA_NO_NODE)); 115 + } 158 116 } 159 117 zero_pmd_populate(pud, addr, next); 160 118 } while (pud++, addr = next, addr != end); 119 + 120 + return 0; 161 121 } 162 122 163 - static void __init zero_p4d_populate(pgd_t *pgd, unsigned long addr, 123 + static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr, 164 124 unsigned long end) 165 125 { 166 126 p4d_t *p4d = p4d_offset(pgd, addr); ··· 192 132 } 193 133 194 134 if (p4d_none(*p4d)) { 195 - p4d_populate(&init_mm, p4d, 196 - early_alloc(PAGE_SIZE, NUMA_NO_NODE)); 135 + pud_t *p; 136 + 137 + if (slab_is_available()) { 138 + p = pud_alloc(&init_mm, p4d, addr); 139 + if (!p) 140 + return -ENOMEM; 141 + } else { 142 + p4d_populate(&init_mm, p4d, 143 + early_alloc(PAGE_SIZE, NUMA_NO_NODE)); 144 + } 197 145 } 198 146 zero_pud_populate(p4d, addr, next); 199 147 } while (p4d++, addr = next, addr != end); 148 + 149 + return 0; 200 150 } 201 151 202 152 /** ··· 215 145 * @shadow_start - start of the memory range to populate 216 146 * @shadow_end - end of the memory range to populate 217 147 */ 218 - void __init kasan_populate_zero_shadow(const void *shadow_start, 148 + int __ref kasan_populate_zero_shadow(const void *shadow_start, 219 149 const void *shadow_end) 220 150 { 221 151 unsigned long addr = (unsigned long)shadow_start; ··· 261 191 } 262 192 263 193 if (pgd_none(*pgd)) { 264 - pgd_populate(&init_mm, pgd, 265 - early_alloc(PAGE_SIZE, NUMA_NO_NODE)); 194 + p4d_t *p; 195 + 196 + if (slab_is_available()) { 197 + p = p4d_alloc(&init_mm, pgd, addr); 198 + if (!p) 199 + return -ENOMEM; 200 + } else { 201 + pgd_populate(&init_mm, pgd, 202 + early_alloc(PAGE_SIZE, NUMA_NO_NODE)); 203 + } 266 204 } 267 205 zero_p4d_populate(pgd, addr, next); 268 206 } while (pgd++, addr = next, addr != end); 207 + 208 + return 0; 209 + } 210 + 211 + static void kasan_free_pte(pte_t *pte_start, pmd_t *pmd) 212 + { 213 + pte_t *pte; 214 + int i; 215 + 216 + for (i = 0; i < PTRS_PER_PTE; i++) { 217 + pte = pte_start + i; 218 + if (!pte_none(*pte)) 219 + return; 220 + } 221 + 222 + pte_free_kernel(&init_mm, (pte_t *)page_to_virt(pmd_page(*pmd))); 223 + pmd_clear(pmd); 224 + } 225 + 226 + static void kasan_free_pmd(pmd_t *pmd_start, pud_t *pud) 227 + { 228 + pmd_t *pmd; 229 + int i; 230 + 231 + for (i = 0; i < PTRS_PER_PMD; i++) { 232 + pmd = pmd_start + i; 233 + if (!pmd_none(*pmd)) 234 + return; 235 + } 236 + 237 + pmd_free(&init_mm, (pmd_t *)page_to_virt(pud_page(*pud))); 238 + pud_clear(pud); 239 + } 240 + 241 + static void kasan_free_pud(pud_t *pud_start, p4d_t *p4d) 242 + { 243 + pud_t *pud; 244 + int i; 245 + 246 + for (i = 0; i < PTRS_PER_PUD; i++) { 247 + pud = pud_start + i; 248 + if (!pud_none(*pud)) 249 + return; 250 + } 251 + 252 + pud_free(&init_mm, (pud_t *)page_to_virt(p4d_page(*p4d))); 253 + p4d_clear(p4d); 254 + } 255 + 256 + static void kasan_free_p4d(p4d_t *p4d_start, pgd_t *pgd) 257 + { 258 + p4d_t *p4d; 259 + int i; 260 + 261 + for (i = 0; i < PTRS_PER_P4D; i++) { 262 + p4d = p4d_start + i; 263 + if (!p4d_none(*p4d)) 264 + return; 265 + } 266 + 267 + p4d_free(&init_mm, (p4d_t *)page_to_virt(pgd_page(*pgd))); 268 + pgd_clear(pgd); 269 + } 270 + 271 + static void kasan_remove_pte_table(pte_t *pte, unsigned long addr, 272 + unsigned long end) 273 + { 274 + unsigned long next; 275 + 276 + for (; addr < end; addr = next, pte++) { 277 + next = (addr + PAGE_SIZE) & PAGE_MASK; 278 + if (next > end) 279 + next = end; 280 + 281 + if (!pte_present(*pte)) 282 + continue; 283 + 284 + if (WARN_ON(!kasan_zero_page_entry(*pte))) 285 + continue; 286 + pte_clear(&init_mm, addr, pte); 287 + } 288 + } 289 + 290 + static void kasan_remove_pmd_table(pmd_t *pmd, unsigned long addr, 291 + unsigned long end) 292 + { 293 + unsigned long next; 294 + 295 + for (; addr < end; addr = next, pmd++) { 296 + pte_t *pte; 297 + 298 + next = pmd_addr_end(addr, end); 299 + 300 + if (!pmd_present(*pmd)) 301 + continue; 302 + 303 + if (kasan_pte_table(*pmd)) { 304 + if (IS_ALIGNED(addr, PMD_SIZE) && 305 + IS_ALIGNED(next, PMD_SIZE)) 306 + pmd_clear(pmd); 307 + continue; 308 + } 309 + pte = pte_offset_kernel(pmd, addr); 310 + kasan_remove_pte_table(pte, addr, next); 311 + kasan_free_pte(pte_offset_kernel(pmd, 0), pmd); 312 + } 313 + } 314 + 315 + static void kasan_remove_pud_table(pud_t *pud, unsigned long addr, 316 + unsigned long end) 317 + { 318 + unsigned long next; 319 + 320 + for (; addr < end; addr = next, pud++) { 321 + pmd_t *pmd, *pmd_base; 322 + 323 + next = pud_addr_end(addr, end); 324 + 325 + if (!pud_present(*pud)) 326 + continue; 327 + 328 + if (kasan_pmd_table(*pud)) { 329 + if (IS_ALIGNED(addr, PUD_SIZE) && 330 + IS_ALIGNED(next, PUD_SIZE)) 331 + pud_clear(pud); 332 + continue; 333 + } 334 + pmd = pmd_offset(pud, addr); 335 + pmd_base = pmd_offset(pud, 0); 336 + kasan_remove_pmd_table(pmd, addr, next); 337 + kasan_free_pmd(pmd_base, pud); 338 + } 339 + } 340 + 341 + static void kasan_remove_p4d_table(p4d_t *p4d, unsigned long addr, 342 + unsigned long end) 343 + { 344 + unsigned long next; 345 + 346 + for (; addr < end; addr = next, p4d++) { 347 + pud_t *pud; 348 + 349 + next = p4d_addr_end(addr, end); 350 + 351 + if (!p4d_present(*p4d)) 352 + continue; 353 + 354 + if (kasan_pud_table(*p4d)) { 355 + if (IS_ALIGNED(addr, P4D_SIZE) && 356 + IS_ALIGNED(next, P4D_SIZE)) 357 + p4d_clear(p4d); 358 + continue; 359 + } 360 + pud = pud_offset(p4d, addr); 361 + kasan_remove_pud_table(pud, addr, next); 362 + kasan_free_pud(pud_offset(p4d, 0), p4d); 363 + } 364 + } 365 + 366 + void kasan_remove_zero_shadow(void *start, unsigned long size) 367 + { 368 + unsigned long addr, end, next; 369 + pgd_t *pgd; 370 + 371 + addr = (unsigned long)kasan_mem_to_shadow(start); 372 + end = addr + (size >> KASAN_SHADOW_SCALE_SHIFT); 373 + 374 + if (WARN_ON((unsigned long)start % 375 + (KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE)) || 376 + WARN_ON(size % (KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE))) 377 + return; 378 + 379 + for (; addr < end; addr = next) { 380 + p4d_t *p4d; 381 + 382 + next = pgd_addr_end(addr, end); 383 + 384 + pgd = pgd_offset_k(addr); 385 + if (!pgd_present(*pgd)) 386 + continue; 387 + 388 + if (kasan_p4d_table(*pgd)) { 389 + if (IS_ALIGNED(addr, PGDIR_SIZE) && 390 + IS_ALIGNED(next, PGDIR_SIZE)) 391 + pgd_clear(pgd); 392 + continue; 393 + } 394 + 395 + p4d = p4d_offset(pgd, addr); 396 + kasan_remove_p4d_table(p4d, addr, next); 397 + kasan_free_p4d(p4d_offset(pgd, 0), pgd); 398 + } 399 + } 400 + 401 + int kasan_add_zero_shadow(void *start, unsigned long size) 402 + { 403 + int ret; 404 + void *shadow_start, *shadow_end; 405 + 406 + shadow_start = kasan_mem_to_shadow(start); 407 + shadow_end = shadow_start + (size >> KASAN_SHADOW_SCALE_SHIFT); 408 + 409 + if (WARN_ON((unsigned long)start % 410 + (KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE)) || 411 + WARN_ON(size % (KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE))) 412 + return -EINVAL; 413 + 414 + ret = kasan_populate_zero_shadow(shadow_start, shadow_end); 415 + if (ret) 416 + kasan_remove_zero_shadow(shadow_start, 417 + size >> KASAN_SHADOW_SCALE_SHIFT); 418 + return ret; 269 419 }

+31 -29

mm/khugepaged.c

··· 397 397 return atomic_read(&mm->mm_users) == 0; 398 398 } 399 399 400 + static bool hugepage_vma_check(struct vm_area_struct *vma, 401 + unsigned long vm_flags) 402 + { 403 + if ((!(vm_flags & VM_HUGEPAGE) && !khugepaged_always()) || 404 + (vm_flags & VM_NOHUGEPAGE) || 405 + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) 406 + return false; 407 + if (shmem_file(vma->vm_file)) { 408 + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) 409 + return false; 410 + return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, 411 + HPAGE_PMD_NR); 412 + } 413 + if (!vma->anon_vma || vma->vm_ops) 414 + return false; 415 + if (is_vma_temporary_stack(vma)) 416 + return false; 417 + return !(vm_flags & VM_NO_KHUGEPAGED); 418 + } 419 + 400 420 int __khugepaged_enter(struct mm_struct *mm) 401 421 { 402 422 struct mm_slot *mm_slot; ··· 454 434 unsigned long vm_flags) 455 435 { 456 436 unsigned long hstart, hend; 457 - if (!vma->anon_vma) 458 - /* 459 - * Not yet faulted in so we will register later in the 460 - * page fault if needed. 461 - */ 437 + 438 + /* 439 + * khugepaged does not yet work on non-shmem files or special 440 + * mappings. And file-private shmem THP is not supported. 441 + */ 442 + if (!hugepage_vma_check(vma, vm_flags)) 462 443 return 0; 463 - if (vma->vm_ops || (vm_flags & VM_NO_KHUGEPAGED)) 464 - /* khugepaged not yet working on file or special mappings */ 465 - return 0; 444 + 466 445 hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; 467 446 hend = vma->vm_end & HPAGE_PMD_MASK; 468 447 if (hstart < hend) ··· 838 819 } 839 820 #endif 840 821 841 - static bool hugepage_vma_check(struct vm_area_struct *vma) 842 - { 843 - if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) || 844 - (vma->vm_flags & VM_NOHUGEPAGE) || 845 - test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) 846 - return false; 847 - if (shmem_file(vma->vm_file)) { 848 - if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) 849 - return false; 850 - return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, 851 - HPAGE_PMD_NR); 852 - } 853 - if (!vma->anon_vma || vma->vm_ops) 854 - return false; 855 - if (is_vma_temporary_stack(vma)) 856 - return false; 857 - return !(vma->vm_flags & VM_NO_KHUGEPAGED); 858 - } 859 - 860 822 /* 861 823 * If mmap_sem temporarily dropped, revalidate vma 862 824 * before taking mmap_sem. ··· 862 862 hend = vma->vm_end & HPAGE_PMD_MASK; 863 863 if (address < hstart || address + HPAGE_PMD_SIZE > hend) 864 864 return SCAN_ADDRESS_RANGE; 865 - if (!hugepage_vma_check(vma)) 865 + if (!hugepage_vma_check(vma, vma->vm_flags)) 866 866 return SCAN_VMA_CHECK; 867 867 return 0; 868 868 } ··· 1517 1517 unlock_page(new_page); 1518 1518 1519 1519 *hpage = NULL; 1520 + 1521 + khugepaged_pages_collapsed++; 1520 1522 } else { 1521 1523 /* Something went wrong: rollback changes to the radix-tree */ 1522 1524 shmem_uncharge(mapping->host, nr_none); ··· 1696 1694 progress++; 1697 1695 break; 1698 1696 } 1699 - if (!hugepage_vma_check(vma)) { 1697 + if (!hugepage_vma_check(vma, vma->vm_flags)) { 1700 1698 skip: 1701 1699 progress++; 1702 1700 continue;

+4 -1

mm/ksm.c

··· 470 470 static int break_ksm(struct vm_area_struct *vma, unsigned long addr) 471 471 { 472 472 struct page *page; 473 - int ret = 0; 473 + vm_fault_t ret = 0; 474 474 475 475 do { 476 476 cond_resched(); ··· 2429 2429 VM_PFNMAP | VM_IO | VM_DONTEXPAND | 2430 2430 VM_HUGETLB | VM_MIXEDMAP)) 2431 2431 return 0; /* just ignore the advice */ 2432 + 2433 + if (vma_is_dax(vma)) 2434 + return 0; 2432 2435 2433 2436 #ifdef VM_SAO 2434 2437 if (*vm_flags & VM_SAO)

+101 -48

mm/list_lru.c

··· 12 12 #include <linux/mutex.h> 13 13 #include <linux/memcontrol.h> 14 14 15 - #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB) 15 + #ifdef CONFIG_MEMCG_KMEM 16 16 static LIST_HEAD(list_lrus); 17 17 static DEFINE_MUTEX(list_lrus_mutex); 18 18 ··· 29 29 list_del(&lru->list); 30 30 mutex_unlock(&list_lrus_mutex); 31 31 } 32 - #else 33 - static void list_lru_register(struct list_lru *lru) 32 + 33 + static int lru_shrinker_id(struct list_lru *lru) 34 34 { 35 + return lru->shrinker_id; 35 36 } 36 37 37 - static void list_lru_unregister(struct list_lru *lru) 38 - { 39 - } 40 - #endif /* CONFIG_MEMCG && !CONFIG_SLOB */ 41 - 42 - #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB) 43 38 static inline bool list_lru_memcg_aware(struct list_lru *lru) 44 39 { 45 40 /* ··· 70 75 } 71 76 72 77 static inline struct list_lru_one * 73 - list_lru_from_kmem(struct list_lru_node *nlru, void *ptr) 78 + list_lru_from_kmem(struct list_lru_node *nlru, void *ptr, 79 + struct mem_cgroup **memcg_ptr) 74 80 { 75 - struct mem_cgroup *memcg; 81 + struct list_lru_one *l = &nlru->lru; 82 + struct mem_cgroup *memcg = NULL; 76 83 77 84 if (!nlru->memcg_lrus) 78 - return &nlru->lru; 85 + goto out; 79 86 80 87 memcg = mem_cgroup_from_kmem(ptr); 81 88 if (!memcg) 82 - return &nlru->lru; 89 + goto out; 83 90 84 - return list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg)); 91 + l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg)); 92 + out: 93 + if (memcg_ptr) 94 + *memcg_ptr = memcg; 95 + return l; 85 96 } 86 97 #else 98 + static void list_lru_register(struct list_lru *lru) 99 + { 100 + } 101 + 102 + static void list_lru_unregister(struct list_lru *lru) 103 + { 104 + } 105 + 106 + static int lru_shrinker_id(struct list_lru *lru) 107 + { 108 + return -1; 109 + } 110 + 87 111 static inline bool list_lru_memcg_aware(struct list_lru *lru) 88 112 { 89 113 return false; ··· 115 101 } 116 102 117 103 static inline struct list_lru_one * 118 - list_lru_from_kmem(struct list_lru_node *nlru, void *ptr) 104 + list_lru_from_kmem(struct list_lru_node *nlru, void *ptr, 105 + struct mem_cgroup **memcg_ptr) 119 106 { 107 + if (memcg_ptr) 108 + *memcg_ptr = NULL; 120 109 return &nlru->lru; 121 110 } 122 - #endif /* CONFIG_MEMCG && !CONFIG_SLOB */ 111 + #endif /* CONFIG_MEMCG_KMEM */ 123 112 124 113 bool list_lru_add(struct list_lru *lru, struct list_head *item) 125 114 { 126 115 int nid = page_to_nid(virt_to_page(item)); 127 116 struct list_lru_node *nlru = &lru->node[nid]; 117 + struct mem_cgroup *memcg; 128 118 struct list_lru_one *l; 129 119 130 120 spin_lock(&nlru->lock); 131 121 if (list_empty(item)) { 132 - l = list_lru_from_kmem(nlru, item); 122 + l = list_lru_from_kmem(nlru, item, &memcg); 133 123 list_add_tail(item, &l->list); 134 - l->nr_items++; 124 + /* Set shrinker bit if the first element was added */ 125 + if (!l->nr_items++) 126 + memcg_set_shrinker_bit(memcg, nid, 127 + lru_shrinker_id(lru)); 135 128 nlru->nr_items++; 136 129 spin_unlock(&nlru->lock); 137 130 return true; ··· 156 135 157 136 spin_lock(&nlru->lock); 158 137 if (!list_empty(item)) { 159 - l = list_lru_from_kmem(nlru, item); 138 + l = list_lru_from_kmem(nlru, item, NULL); 160 139 list_del_init(item); 161 140 l->nr_items--; 162 141 nlru->nr_items--; ··· 183 162 } 184 163 EXPORT_SYMBOL_GPL(list_lru_isolate_move); 185 164 186 - static unsigned long __list_lru_count_one(struct list_lru *lru, 187 - int nid, int memcg_idx) 165 + unsigned long list_lru_count_one(struct list_lru *lru, 166 + int nid, struct mem_cgroup *memcg) 188 167 { 189 168 struct list_lru_node *nlru = &lru->node[nid]; 190 169 struct list_lru_one *l; 191 170 unsigned long count; 192 171 193 172 rcu_read_lock(); 194 - l = list_lru_from_memcg_idx(nlru, memcg_idx); 173 + l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg)); 195 174 count = l->nr_items; 196 175 rcu_read_unlock(); 197 176 198 177 return count; 199 - } 200 - 201 - unsigned long list_lru_count_one(struct list_lru *lru, 202 - int nid, struct mem_cgroup *memcg) 203 - { 204 - return __list_lru_count_one(lru, nid, memcg_cache_id(memcg)); 205 178 } 206 179 EXPORT_SYMBOL_GPL(list_lru_count_one); 207 180 ··· 209 194 EXPORT_SYMBOL_GPL(list_lru_count_node); 210 195 211 196 static unsigned long 212 - __list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx, 197 + __list_lru_walk_one(struct list_lru_node *nlru, int memcg_idx, 213 198 list_lru_walk_cb isolate, void *cb_arg, 214 199 unsigned long *nr_to_walk) 215 200 { 216 201 217 - struct list_lru_node *nlru = &lru->node[nid]; 218 202 struct list_lru_one *l; 219 203 struct list_head *item, *n; 220 204 unsigned long isolated = 0; 221 205 222 - spin_lock(&nlru->lock); 223 206 l = list_lru_from_memcg_idx(nlru, memcg_idx); 224 207 restart: 225 208 list_for_each_safe(item, n, &l->list) { ··· 263 250 BUG(); 264 251 } 265 252 } 266 - 267 - spin_unlock(&nlru->lock); 268 253 return isolated; 269 254 } 270 255 ··· 271 260 list_lru_walk_cb isolate, void *cb_arg, 272 261 unsigned long *nr_to_walk) 273 262 { 274 - return __list_lru_walk_one(lru, nid, memcg_cache_id(memcg), 275 - isolate, cb_arg, nr_to_walk); 263 + struct list_lru_node *nlru = &lru->node[nid]; 264 + unsigned long ret; 265 + 266 + spin_lock(&nlru->lock); 267 + ret = __list_lru_walk_one(nlru, memcg_cache_id(memcg), isolate, cb_arg, 268 + nr_to_walk); 269 + spin_unlock(&nlru->lock); 270 + return ret; 276 271 } 277 272 EXPORT_SYMBOL_GPL(list_lru_walk_one); 273 + 274 + unsigned long 275 + list_lru_walk_one_irq(struct list_lru *lru, int nid, struct mem_cgroup *memcg, 276 + list_lru_walk_cb isolate, void *cb_arg, 277 + unsigned long *nr_to_walk) 278 + { 279 + struct list_lru_node *nlru = &lru->node[nid]; 280 + unsigned long ret; 281 + 282 + spin_lock_irq(&nlru->lock); 283 + ret = __list_lru_walk_one(nlru, memcg_cache_id(memcg), isolate, cb_arg, 284 + nr_to_walk); 285 + spin_unlock_irq(&nlru->lock); 286 + return ret; 287 + } 278 288 279 289 unsigned long list_lru_walk_node(struct list_lru *lru, int nid, 280 290 list_lru_walk_cb isolate, void *cb_arg, ··· 304 272 long isolated = 0; 305 273 int memcg_idx; 306 274 307 - isolated += __list_lru_walk_one(lru, nid, -1, isolate, cb_arg, 308 - nr_to_walk); 275 + isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg, 276 + nr_to_walk); 309 277 if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) { 310 278 for_each_memcg_cache_index(memcg_idx) { 311 - isolated += __list_lru_walk_one(lru, nid, memcg_idx, 312 - isolate, cb_arg, nr_to_walk); 279 + struct list_lru_node *nlru = &lru->node[nid]; 280 + 281 + spin_lock(&nlru->lock); 282 + isolated += __list_lru_walk_one(nlru, memcg_idx, 283 + isolate, cb_arg, 284 + nr_to_walk); 285 + spin_unlock(&nlru->lock); 286 + 313 287 if (*nr_to_walk <= 0) 314 288 break; 315 289 } ··· 330 292 l->nr_items = 0; 331 293 } 332 294 333 - #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB) 295 + #ifdef CONFIG_MEMCG_KMEM 334 296 static void __memcg_destroy_list_lru_node(struct list_lru_memcg *memcg_lrus, 335 297 int begin, int end) 336 298 { ··· 538 500 goto out; 539 501 } 540 502 541 - static void memcg_drain_list_lru_node(struct list_lru_node *nlru, 542 - int src_idx, int dst_idx) 503 + static void memcg_drain_list_lru_node(struct list_lru *lru, int nid, 504 + int src_idx, struct mem_cgroup *dst_memcg) 543 505 { 506 + struct list_lru_node *nlru = &lru->node[nid]; 507 + int dst_idx = dst_memcg->kmemcg_id; 544 508 struct list_lru_one *src, *dst; 509 + bool set; 545 510 546 511 /* 547 512 * Since list_lru_{add,del} may be called under an IRQ-safe lock, ··· 556 515 dst = list_lru_from_memcg_idx(nlru, dst_idx); 557 516 558 517 list_splice_init(&src->list, &dst->list); 518 + set = (!dst->nr_items && src->nr_items); 559 519 dst->nr_items += src->nr_items; 520 + if (set) 521 + memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru)); 560 522 src->nr_items = 0; 561 523 562 524 spin_unlock_irq(&nlru->lock); 563 525 } 564 526 565 527 static void memcg_drain_list_lru(struct list_lru *lru, 566 - int src_idx, int dst_idx) 528 + int src_idx, struct mem_cgroup *dst_memcg) 567 529 { 568 530 int i; 569 531 ··· 574 530 return; 575 531 576 532 for_each_node(i) 577 - memcg_drain_list_lru_node(&lru->node[i], src_idx, dst_idx); 533 + memcg_drain_list_lru_node(lru, i, src_idx, dst_memcg); 578 534 } 579 535 580 - void memcg_drain_all_list_lrus(int src_idx, int dst_idx) 536 + void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg) 581 537 { 582 538 struct list_lru *lru; 583 539 584 540 mutex_lock(&list_lrus_mutex); 585 541 list_for_each_entry(lru, &list_lrus, list) 586 - memcg_drain_list_lru(lru, src_idx, dst_idx); 542 + memcg_drain_list_lru(lru, src_idx, dst_memcg); 587 543 mutex_unlock(&list_lrus_mutex); 588 544 } 589 545 #else ··· 595 551 static void memcg_destroy_list_lru(struct list_lru *lru) 596 552 { 597 553 } 598 - #endif /* CONFIG_MEMCG && !CONFIG_SLOB */ 554 + #endif /* CONFIG_MEMCG_KMEM */ 599 555 600 556 int __list_lru_init(struct list_lru *lru, bool memcg_aware, 601 - struct lock_class_key *key) 557 + struct lock_class_key *key, struct shrinker *shrinker) 602 558 { 603 559 int i; 604 560 size_t size = sizeof(*lru->node) * nr_node_ids; 605 561 int err = -ENOMEM; 606 562 563 + #ifdef CONFIG_MEMCG_KMEM 564 + if (shrinker) 565 + lru->shrinker_id = shrinker->id; 566 + else 567 + lru->shrinker_id = -1; 568 + #endif 607 569 memcg_get_cache_ids(); 608 570 609 571 lru->node = kzalloc(size, GFP_KERNEL); ··· 652 602 kfree(lru->node); 653 603 lru->node = NULL; 654 604 605 + #ifdef CONFIG_MEMCG_KMEM 606 + lru->shrinker_id = -1; 607 + #endif 655 608 memcg_put_cache_ids(); 656 609 } 657 610 EXPORT_SYMBOL_GPL(list_lru_destroy);

+23 -23

mm/memblock.c

··· 392 392 { 393 393 struct memblock_region *new_array, *old_array; 394 394 phys_addr_t old_alloc_size, new_alloc_size; 395 - phys_addr_t old_size, new_size, addr; 395 + phys_addr_t old_size, new_size, addr, new_end; 396 396 int use_slab = slab_is_available(); 397 397 int *in_slab; 398 398 ··· 453 453 return -1; 454 454 } 455 455 456 - memblock_dbg("memblock: %s is doubled to %ld at [%#010llx-%#010llx]", 457 - type->name, type->max * 2, (u64)addr, 458 - (u64)addr + new_size - 1); 456 + new_end = addr + new_size - 1; 457 + memblock_dbg("memblock: %s is doubled to %ld at [%pa-%pa]", 458 + type->name, type->max * 2, &addr, &new_end); 459 459 460 460 /* 461 461 * Found space, we now need to move the array over before we add the ··· 1438 1438 { 1439 1439 void *ptr; 1440 1440 1441 - memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n", 1442 - __func__, (u64)size, (u64)align, nid, (u64)min_addr, 1443 - (u64)max_addr, (void *)_RET_IP_); 1441 + memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pF\n", 1442 + __func__, (u64)size, (u64)align, nid, &min_addr, 1443 + &max_addr, (void *)_RET_IP_); 1444 1444 1445 1445 ptr = memblock_virt_alloc_internal(size, align, 1446 1446 min_addr, max_addr, nid); ··· 1475 1475 { 1476 1476 void *ptr; 1477 1477 1478 - memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n", 1479 - __func__, (u64)size, (u64)align, nid, (u64)min_addr, 1480 - (u64)max_addr, (void *)_RET_IP_); 1478 + memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pF\n", 1479 + __func__, (u64)size, (u64)align, nid, &min_addr, 1480 + &max_addr, (void *)_RET_IP_); 1481 1481 1482 1482 ptr = memblock_virt_alloc_internal(size, align, 1483 1483 min_addr, max_addr, nid); ··· 1511 1511 { 1512 1512 void *ptr; 1513 1513 1514 - memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n", 1515 - __func__, (u64)size, (u64)align, nid, (u64)min_addr, 1516 - (u64)max_addr, (void *)_RET_IP_); 1514 + memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pF\n", 1515 + __func__, (u64)size, (u64)align, nid, &min_addr, 1516 + &max_addr, (void *)_RET_IP_); 1517 1517 ptr = memblock_virt_alloc_internal(size, align, 1518 1518 min_addr, max_addr, nid); 1519 1519 if (ptr) { ··· 1521 1521 return ptr; 1522 1522 } 1523 1523 1524 - panic("%s: Failed to allocate %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx\n", 1525 - __func__, (u64)size, (u64)align, nid, (u64)min_addr, 1526 - (u64)max_addr); 1524 + panic("%s: Failed to allocate %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa\n", 1525 + __func__, (u64)size, (u64)align, nid, &min_addr, &max_addr); 1527 1526 return NULL; 1528 1527 } 1529 1528 #endif ··· 1537 1538 */ 1538 1539 void __init __memblock_free_early(phys_addr_t base, phys_addr_t size) 1539 1540 { 1540 - memblock_dbg("%s: [%#016llx-%#016llx] %pF\n", 1541 - __func__, (u64)base, (u64)base + size - 1, 1542 - (void *)_RET_IP_); 1541 + phys_addr_t end = base + size - 1; 1542 + 1543 + memblock_dbg("%s: [%pa-%pa] %pF\n", 1544 + __func__, &base, &end, (void *)_RET_IP_); 1543 1545 kmemleak_free_part_phys(base, size); 1544 1546 memblock_remove_range(&memblock.reserved, base, size); 1545 1547 } ··· 1556 1556 */ 1557 1557 void __init __memblock_free_late(phys_addr_t base, phys_addr_t size) 1558 1558 { 1559 - u64 cursor, end; 1559 + phys_addr_t cursor, end; 1560 1560 1561 - memblock_dbg("%s: [%#016llx-%#016llx] %pF\n", 1562 - __func__, (u64)base, (u64)base + size - 1, 1563 - (void *)_RET_IP_); 1561 + end = base + size - 1; 1562 + memblock_dbg("%s: [%pa-%pa] %pF\n", 1563 + __func__, &base, &end, (void *)_RET_IP_); 1564 1564 kmemleak_free_part_phys(base, size); 1565 1565 cursor = PFN_UP(base); 1566 1566 end = PFN_DOWN(base + size);

+278 -48

mm/memcontrol.c

··· 233 233 /* Used for OOM nofiier */ 234 234 #define OOM_CONTROL (0) 235 235 236 + /* 237 + * Iteration constructs for visiting all cgroups (under a tree). If 238 + * loops are exited prematurely (break), mem_cgroup_iter_break() must 239 + * be used for reference counting. 240 + */ 241 + #define for_each_mem_cgroup_tree(iter, root) \ 242 + for (iter = mem_cgroup_iter(root, NULL, NULL); \ 243 + iter != NULL; \ 244 + iter = mem_cgroup_iter(root, iter, NULL)) 245 + 246 + #define for_each_mem_cgroup(iter) \ 247 + for (iter = mem_cgroup_iter(NULL, NULL, NULL); \ 248 + iter != NULL; \ 249 + iter = mem_cgroup_iter(NULL, iter, NULL)) 250 + 236 251 /* Some nice accessors for the vmpressure. */ 237 252 struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg) 238 253 { ··· 261 246 return &container_of(vmpr, struct mem_cgroup, vmpressure)->css; 262 247 } 263 248 264 - static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) 265 - { 266 - return (memcg == root_mem_cgroup); 267 - } 268 - 269 - #ifndef CONFIG_SLOB 249 + #ifdef CONFIG_MEMCG_KMEM 270 250 /* 271 251 * This will be the memcg's index in each cache's ->memcg_params.memcg_caches. 272 252 * The main reason for not using cgroup id for this: ··· 315 305 316 306 struct workqueue_struct *memcg_kmem_cache_wq; 317 307 318 - #endif /* !CONFIG_SLOB */ 308 + static int memcg_shrinker_map_size; 309 + static DEFINE_MUTEX(memcg_shrinker_map_mutex); 310 + 311 + static void memcg_free_shrinker_map_rcu(struct rcu_head *head) 312 + { 313 + kvfree(container_of(head, struct memcg_shrinker_map, rcu)); 314 + } 315 + 316 + static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg, 317 + int size, int old_size) 318 + { 319 + struct memcg_shrinker_map *new, *old; 320 + int nid; 321 + 322 + lockdep_assert_held(&memcg_shrinker_map_mutex); 323 + 324 + for_each_node(nid) { 325 + old = rcu_dereference_protected( 326 + mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true); 327 + /* Not yet online memcg */ 328 + if (!old) 329 + return 0; 330 + 331 + new = kvmalloc(sizeof(*new) + size, GFP_KERNEL); 332 + if (!new) 333 + return -ENOMEM; 334 + 335 + /* Set all old bits, clear all new bits */ 336 + memset(new->map, (int)0xff, old_size); 337 + memset((void *)new->map + old_size, 0, size - old_size); 338 + 339 + rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new); 340 + call_rcu(&old->rcu, memcg_free_shrinker_map_rcu); 341 + } 342 + 343 + return 0; 344 + } 345 + 346 + static void memcg_free_shrinker_maps(struct mem_cgroup *memcg) 347 + { 348 + struct mem_cgroup_per_node *pn; 349 + struct memcg_shrinker_map *map; 350 + int nid; 351 + 352 + if (mem_cgroup_is_root(memcg)) 353 + return; 354 + 355 + for_each_node(nid) { 356 + pn = mem_cgroup_nodeinfo(memcg, nid); 357 + map = rcu_dereference_protected(pn->shrinker_map, true); 358 + if (map) 359 + kvfree(map); 360 + rcu_assign_pointer(pn->shrinker_map, NULL); 361 + } 362 + } 363 + 364 + static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg) 365 + { 366 + struct memcg_shrinker_map *map; 367 + int nid, size, ret = 0; 368 + 369 + if (mem_cgroup_is_root(memcg)) 370 + return 0; 371 + 372 + mutex_lock(&memcg_shrinker_map_mutex); 373 + size = memcg_shrinker_map_size; 374 + for_each_node(nid) { 375 + map = kvzalloc(sizeof(*map) + size, GFP_KERNEL); 376 + if (!map) { 377 + memcg_free_shrinker_maps(memcg); 378 + ret = -ENOMEM; 379 + break; 380 + } 381 + rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map); 382 + } 383 + mutex_unlock(&memcg_shrinker_map_mutex); 384 + 385 + return ret; 386 + } 387 + 388 + int memcg_expand_shrinker_maps(int new_id) 389 + { 390 + int size, old_size, ret = 0; 391 + struct mem_cgroup *memcg; 392 + 393 + size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long); 394 + old_size = memcg_shrinker_map_size; 395 + if (size <= old_size) 396 + return 0; 397 + 398 + mutex_lock(&memcg_shrinker_map_mutex); 399 + if (!root_mem_cgroup) 400 + goto unlock; 401 + 402 + for_each_mem_cgroup(memcg) { 403 + if (mem_cgroup_is_root(memcg)) 404 + continue; 405 + ret = memcg_expand_one_shrinker_map(memcg, size, old_size); 406 + if (ret) 407 + goto unlock; 408 + } 409 + unlock: 410 + if (!ret) 411 + memcg_shrinker_map_size = size; 412 + mutex_unlock(&memcg_shrinker_map_mutex); 413 + return ret; 414 + } 415 + 416 + void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id) 417 + { 418 + if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) { 419 + struct memcg_shrinker_map *map; 420 + 421 + rcu_read_lock(); 422 + map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map); 423 + /* Pairs with smp mb in shrink_slab() */ 424 + smp_mb__before_atomic(); 425 + set_bit(shrinker_id, map->map); 426 + rcu_read_unlock(); 427 + } 428 + } 429 + 430 + #else /* CONFIG_MEMCG_KMEM */ 431 + static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg) 432 + { 433 + return 0; 434 + } 435 + static void memcg_free_shrinker_maps(struct mem_cgroup *memcg) { } 436 + #endif /* CONFIG_MEMCG_KMEM */ 319 437 320 438 /** 321 439 * mem_cgroup_css_from_page - css of the memcg associated with a page ··· 816 678 } 817 679 EXPORT_SYMBOL(mem_cgroup_from_task); 818 680 819 - static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) 681 + /** 682 + * get_mem_cgroup_from_mm: Obtain a reference on given mm_struct's memcg. 683 + * @mm: mm from which memcg should be extracted. It can be NULL. 684 + * 685 + * Obtain a reference on mm->memcg and returns it if successful. Otherwise 686 + * root_mem_cgroup is returned. However if mem_cgroup is disabled, NULL is 687 + * returned. 688 + */ 689 + struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) 820 690 { 821 - struct mem_cgroup *memcg = NULL; 691 + struct mem_cgroup *memcg; 692 + 693 + if (mem_cgroup_disabled()) 694 + return NULL; 822 695 823 696 rcu_read_lock(); 824 697 do { ··· 848 699 } while (!css_tryget_online(&memcg->css)); 849 700 rcu_read_unlock(); 850 701 return memcg; 702 + } 703 + EXPORT_SYMBOL(get_mem_cgroup_from_mm); 704 + 705 + /** 706 + * get_mem_cgroup_from_page: Obtain a reference on given page's memcg. 707 + * @page: page from which memcg should be extracted. 708 + * 709 + * Obtain a reference on page->memcg and returns it if successful. Otherwise 710 + * root_mem_cgroup is returned. 711 + */ 712 + struct mem_cgroup *get_mem_cgroup_from_page(struct page *page) 713 + { 714 + struct mem_cgroup *memcg = page->mem_cgroup; 715 + 716 + if (mem_cgroup_disabled()) 717 + return NULL; 718 + 719 + rcu_read_lock(); 720 + if (!memcg || !css_tryget_online(&memcg->css)) 721 + memcg = root_mem_cgroup; 722 + rcu_read_unlock(); 723 + return memcg; 724 + } 725 + EXPORT_SYMBOL(get_mem_cgroup_from_page); 726 + 727 + /** 728 + * If current->active_memcg is non-NULL, do not fallback to current->mm->memcg. 729 + */ 730 + static __always_inline struct mem_cgroup *get_mem_cgroup_from_current(void) 731 + { 732 + if (unlikely(current->active_memcg)) { 733 + struct mem_cgroup *memcg = root_mem_cgroup; 734 + 735 + rcu_read_lock(); 736 + if (css_tryget_online(&current->active_memcg->css)) 737 + memcg = current->active_memcg; 738 + rcu_read_unlock(); 739 + return memcg; 740 + } 741 + return get_mem_cgroup_from_mm(current->mm); 851 742 } 852 743 853 744 /** ··· 1050 861 } 1051 862 } 1052 863 } 1053 - 1054 - /* 1055 - * Iteration constructs for visiting all cgroups (under a tree). If 1056 - * loops are exited prematurely (break), mem_cgroup_iter_break() must 1057 - * be used for reference counting. 1058 - */ 1059 - #define for_each_mem_cgroup_tree(iter, root) \ 1060 - for (iter = mem_cgroup_iter(root, NULL, NULL); \ 1061 - iter != NULL; \ 1062 - iter = mem_cgroup_iter(root, iter, NULL)) 1063 - 1064 - #define for_each_mem_cgroup(iter) \ 1065 - for (iter = mem_cgroup_iter(NULL, NULL, NULL); \ 1066 - iter != NULL; \ 1067 - iter = mem_cgroup_iter(NULL, iter, NULL)) 1068 864 1069 865 /** 1070 866 * mem_cgroup_scan_tasks - iterate over tasks of a memory cgroup hierarchy ··· 1657 1483 __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); 1658 1484 } 1659 1485 1660 - static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) 1486 + enum oom_status { 1487 + OOM_SUCCESS, 1488 + OOM_FAILED, 1489 + OOM_ASYNC, 1490 + OOM_SKIPPED 1491 + }; 1492 + 1493 + static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) 1661 1494 { 1662 - if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER) 1663 - return; 1495 + if (order > PAGE_ALLOC_COSTLY_ORDER) 1496 + return OOM_SKIPPED; 1497 + 1664 1498 /* 1665 1499 * We are in the middle of the charge context here, so we 1666 1500 * don't want to block when potentially sitting on a callstack 1667 1501 * that holds all kinds of filesystem and mm locks. 1668 1502 * 1669 - * Also, the caller may handle a failed allocation gracefully 1670 - * (like optional page cache readahead) and so an OOM killer 1671 - * invocation might not even be necessary. 1503 + * cgroup1 allows disabling the OOM killer and waiting for outside 1504 + * handling until the charge can succeed; remember the context and put 1505 + * the task to sleep at the end of the page fault when all locks are 1506 + * released. 1672 1507 * 1673 - * That's why we don't do anything here except remember the 1674 - * OOM context and then deal with it at the end of the page 1675 - * fault when the stack is unwound, the locks are released, 1676 - * and when we know whether the fault was overall successful. 1508 + * On the other hand, in-kernel OOM killer allows for an async victim 1509 + * memory reclaim (oom_reaper) and that means that we are not solely 1510 + * relying on the oom victim to make a forward progress and we can 1511 + * invoke the oom killer here. 1512 + * 1513 + * Please note that mem_cgroup_out_of_memory might fail to find a 1514 + * victim and then we have to bail out from the charge path. 1677 1515 */ 1678 - css_get(&memcg->css); 1679 - current->memcg_in_oom = memcg; 1680 - current->memcg_oom_gfp_mask = mask; 1681 - current->memcg_oom_order = order; 1516 + if (memcg->oom_kill_disable) { 1517 + if (!current->in_user_fault) 1518 + return OOM_SKIPPED; 1519 + css_get(&memcg->css); 1520 + current->memcg_in_oom = memcg; 1521 + current->memcg_oom_gfp_mask = mask; 1522 + current->memcg_oom_order = order; 1523 + 1524 + return OOM_ASYNC; 1525 + } 1526 + 1527 + if (mem_cgroup_out_of_memory(memcg, mask, order)) 1528 + return OOM_SUCCESS; 1529 + 1530 + WARN(1,"Memory cgroup charge failed because of no reclaimable memory! " 1531 + "This looks like a misconfiguration or a kernel bug."); 1532 + return OOM_FAILED; 1682 1533 } 1683 1534 1684 1535 /** ··· 2098 1899 unsigned long nr_reclaimed; 2099 1900 bool may_swap = true; 2100 1901 bool drained = false; 1902 + bool oomed = false; 1903 + enum oom_status oom_status; 2101 1904 2102 1905 if (mem_cgroup_is_root(memcg)) 2103 1906 return 0; ··· 2187 1986 if (nr_retries--) 2188 1987 goto retry; 2189 1988 1989 + if (gfp_mask & __GFP_RETRY_MAYFAIL && oomed) 1990 + goto nomem; 1991 + 2190 1992 if (gfp_mask & __GFP_NOFAIL) 2191 1993 goto force; 2192 1994 ··· 2198 1994 2199 1995 memcg_memory_event(mem_over_limit, MEMCG_OOM); 2200 1996 2201 - mem_cgroup_oom(mem_over_limit, gfp_mask, 1997 + /* 1998 + * keep retrying as long as the memcg oom killer is able to make 1999 + * a forward progress or bypass the charge if the oom killer 2000 + * couldn't make any progress. 2001 + */ 2002 + oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, 2202 2003 get_order(nr_pages * PAGE_SIZE)); 2004 + switch (oom_status) { 2005 + case OOM_SUCCESS: 2006 + nr_retries = MEM_CGROUP_RECLAIM_RETRIES; 2007 + oomed = true; 2008 + goto retry; 2009 + case OOM_FAILED: 2010 + goto force; 2011 + default: 2012 + goto nomem; 2013 + } 2203 2014 nomem: 2204 2015 if (!(gfp_mask & __GFP_NOFAIL)) 2205 2016 return -ENOMEM; ··· 2338 2119 unlock_page_lru(page, isolated); 2339 2120 } 2340 2121 2341 - #ifndef CONFIG_SLOB 2122 + #ifdef CONFIG_MEMCG_KMEM 2342 2123 static int memcg_alloc_cache_id(void) 2343 2124 { 2344 2125 int id, size; ··· 2480 2261 if (current->memcg_kmem_skip_account) 2481 2262 return cachep; 2482 2263 2483 - memcg = get_mem_cgroup_from_mm(current->mm); 2264 + memcg = get_mem_cgroup_from_current(); 2484 2265 kmemcg_id = READ_ONCE(memcg->kmemcg_id); 2485 2266 if (kmemcg_id < 0) 2486 2267 goto out; ··· 2564 2345 if (memcg_kmem_bypass()) 2565 2346 return 0; 2566 2347 2567 - memcg = get_mem_cgroup_from_mm(current->mm); 2348 + memcg = get_mem_cgroup_from_current(); 2568 2349 if (!mem_cgroup_is_root(memcg)) { 2569 2350 ret = memcg_kmem_charge_memcg(page, gfp, order, memcg); 2570 2351 if (!ret) ··· 2603 2384 2604 2385 css_put_many(&memcg->css, nr_pages); 2605 2386 } 2606 - #endif /* !CONFIG_SLOB */ 2387 + #endif /* CONFIG_MEMCG_KMEM */ 2607 2388 2608 2389 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 2609 2390 ··· 2998 2779 } 2999 2780 } 3000 2781 3001 - #ifndef CONFIG_SLOB 2782 + #ifdef CONFIG_MEMCG_KMEM 3002 2783 static int memcg_online_kmem(struct mem_cgroup *memcg) 3003 2784 { 3004 2785 int memcg_id; ··· 3070 2851 } 3071 2852 rcu_read_unlock(); 3072 2853 3073 - memcg_drain_all_list_lrus(kmemcg_id, parent->kmemcg_id); 2854 + memcg_drain_all_list_lrus(kmemcg_id, parent); 3074 2855 3075 2856 memcg_free_cache_id(kmemcg_id); 3076 2857 } ··· 3098 2879 static void memcg_free_kmem(struct mem_cgroup *memcg) 3099 2880 { 3100 2881 } 3101 - #endif /* !CONFIG_SLOB */ 2882 + #endif /* CONFIG_MEMCG_KMEM */ 3102 2883 3103 2884 static int memcg_update_kmem_max(struct mem_cgroup *memcg, 3104 2885 unsigned long max) ··· 4402 4183 INIT_LIST_HEAD(&memcg->event_list); 4403 4184 spin_lock_init(&memcg->event_list_lock); 4404 4185 memcg->socket_pressure = jiffies; 4405 - #ifndef CONFIG_SLOB 4186 + #ifdef CONFIG_MEMCG_KMEM 4406 4187 memcg->kmemcg_id = -1; 4407 4188 #endif 4408 4189 #ifdef CONFIG_CGROUP_WRITEBACK ··· 4479 4260 { 4480 4261 struct mem_cgroup *memcg = mem_cgroup_from_css(css); 4481 4262 4263 + /* 4264 + * A memcg must be visible for memcg_expand_shrinker_maps() 4265 + * by the time the maps are allocated. So, we allocate maps 4266 + * here, when for_each_mem_cgroup() can't skip it. 4267 + */ 4268 + if (memcg_alloc_shrinker_maps(memcg)) { 4269 + mem_cgroup_id_remove(memcg); 4270 + return -ENOMEM; 4271 + } 4272 + 4482 4273 /* Online state pins memcg ID, memcg ID pins CSS */ 4483 4274 atomic_set(&memcg->id.ref, 1); 4484 4275 css_get(css); ··· 4541 4312 vmpressure_cleanup(&memcg->vmpressure); 4542 4313 cancel_work_sync(&memcg->high_work); 4543 4314 mem_cgroup_remove_from_trees(memcg); 4315 + memcg_free_shrinker_maps(memcg); 4544 4316 memcg_free_kmem(memcg); 4545 4317 mem_cgroup_free(memcg); 4546 4318 } ··· 6253 6023 { 6254 6024 int cpu, node; 6255 6025 6256 - #ifndef CONFIG_SLOB 6026 + #ifdef CONFIG_MEMCG_KMEM 6257 6027 /* 6258 6028 * Kmem cache creation is mostly done with the slab_mutex held, 6259 6029 * so use a workqueue with limited concurrency to avoid stalling

+93 -61

mm/memory.c

··· 859 859 return NULL; 860 860 } 861 861 } 862 + 863 + if (pte_devmap(pte)) 864 + return NULL; 865 + 862 866 print_bad_pte(vma, addr, pte, NULL); 863 867 return NULL; 864 868 } ··· 927 923 } 928 924 } 929 925 926 + if (pmd_devmap(pmd)) 927 + return NULL; 930 928 if (is_zero_pfn(pfn)) 931 929 return NULL; 932 930 if (unlikely(pfn > highest_memmap_pfn)) ··· 1613 1607 tlb_gather_mmu(&tlb, mm, start, end); 1614 1608 update_hiwater_rss(mm); 1615 1609 mmu_notifier_invalidate_range_start(mm, start, end); 1616 - for ( ; vma && vma->vm_start < end; vma = vma->vm_next) { 1610 + for ( ; vma && vma->vm_start < end; vma = vma->vm_next) 1617 1611 unmap_single_vma(&tlb, vma, start, end, NULL); 1618 - 1619 - /* 1620 - * zap_page_range does not specify whether mmap_sem should be 1621 - * held for read or write. That allows parallel zap_page_range 1622 - * operations to unmap a PTE and defer a flush meaning that 1623 - * this call observes pte_none and fails to flush the TLB. 1624 - * Rather than adding a complex API, ensure that no stale 1625 - * TLB entries exist when this call returns. 1626 - */ 1627 - flush_tlb_range(vma, start, end); 1628 - } 1629 - 1630 1612 mmu_notifier_invalidate_range_end(mm, start, end); 1631 1613 tlb_finish_mmu(&tlb, start, end); 1632 1614 } ··· 3388 3394 if (write) 3389 3395 entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); 3390 3396 3391 - add_mm_counter(vma->vm_mm, MM_FILEPAGES, HPAGE_PMD_NR); 3397 + add_mm_counter(vma->vm_mm, mm_counter_file(page), HPAGE_PMD_NR); 3392 3398 page_add_file_rmap(page, true); 3393 3399 /* 3394 3400 * deposit and withdraw with pmd lock held ··· 4141 4147 * space. Kernel faults are handled more gracefully. 4142 4148 */ 4143 4149 if (flags & FAULT_FLAG_USER) 4144 - mem_cgroup_oom_enable(); 4150 + mem_cgroup_enter_user_fault(); 4145 4151 4146 4152 if (unlikely(is_vm_hugetlb_page(vma))) 4147 4153 ret = hugetlb_fault(vma->vm_mm, vma, address, flags); ··· 4149 4155 ret = __handle_mm_fault(vma, address, flags); 4150 4156 4151 4157 if (flags & FAULT_FLAG_USER) { 4152 - mem_cgroup_oom_disable(); 4158 + mem_cgroup_exit_user_fault(); 4153 4159 /* 4154 4160 * The task may have entered a memcg OOM situation but 4155 4161 * if the allocation error was handled gracefully (no ··· 4587 4593 #endif 4588 4594 4589 4595 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) 4596 + /* 4597 + * Process all subpages of the specified huge page with the specified 4598 + * operation. The target subpage will be processed last to keep its 4599 + * cache lines hot. 4600 + */ 4601 + static inline void process_huge_page( 4602 + unsigned long addr_hint, unsigned int pages_per_huge_page, 4603 + void (*process_subpage)(unsigned long addr, int idx, void *arg), 4604 + void *arg) 4605 + { 4606 + int i, n, base, l; 4607 + unsigned long addr = addr_hint & 4608 + ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1); 4609 + 4610 + /* Process target subpage last to keep its cache lines hot */ 4611 + might_sleep(); 4612 + n = (addr_hint - addr) / PAGE_SIZE; 4613 + if (2 * n <= pages_per_huge_page) { 4614 + /* If target subpage in first half of huge page */ 4615 + base = 0; 4616 + l = n; 4617 + /* Process subpages at the end of huge page */ 4618 + for (i = pages_per_huge_page - 1; i >= 2 * n; i--) { 4619 + cond_resched(); 4620 + process_subpage(addr + i * PAGE_SIZE, i, arg); 4621 + } 4622 + } else { 4623 + /* If target subpage in second half of huge page */ 4624 + base = pages_per_huge_page - 2 * (pages_per_huge_page - n); 4625 + l = pages_per_huge_page - n; 4626 + /* Process subpages at the begin of huge page */ 4627 + for (i = 0; i < base; i++) { 4628 + cond_resched(); 4629 + process_subpage(addr + i * PAGE_SIZE, i, arg); 4630 + } 4631 + } 4632 + /* 4633 + * Process remaining subpages in left-right-left-right pattern 4634 + * towards the target subpage 4635 + */ 4636 + for (i = 0; i < l; i++) { 4637 + int left_idx = base + i; 4638 + int right_idx = base + 2 * l - 1 - i; 4639 + 4640 + cond_resched(); 4641 + process_subpage(addr + left_idx * PAGE_SIZE, left_idx, arg); 4642 + cond_resched(); 4643 + process_subpage(addr + right_idx * PAGE_SIZE, right_idx, arg); 4644 + } 4645 + } 4646 + 4590 4647 static void clear_gigantic_page(struct page *page, 4591 4648 unsigned long addr, 4592 4649 unsigned int pages_per_huge_page) ··· 4652 4607 clear_user_highpage(p, addr + i * PAGE_SIZE); 4653 4608 } 4654 4609 } 4610 + 4611 + static void clear_subpage(unsigned long addr, int idx, void *arg) 4612 + { 4613 + struct page *page = arg; 4614 + 4615 + clear_user_highpage(page + idx, addr); 4616 + } 4617 + 4655 4618 void clear_huge_page(struct page *page, 4656 4619 unsigned long addr_hint, unsigned int pages_per_huge_page) 4657 4620 { 4658 - int i, n, base, l; 4659 4621 unsigned long addr = addr_hint & 4660 4622 ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1); 4661 4623 ··· 4671 4619 return; 4672 4620 } 4673 4621 4674 - /* Clear sub-page to access last to keep its cache lines hot */ 4675 - might_sleep(); 4676 - n = (addr_hint - addr) / PAGE_SIZE; 4677 - if (2 * n <= pages_per_huge_page) { 4678 - /* If sub-page to access in first half of huge page */ 4679 - base = 0; 4680 - l = n; 4681 - /* Clear sub-pages at the end of huge page */ 4682 - for (i = pages_per_huge_page - 1; i >= 2 * n; i--) { 4683 - cond_resched(); 4684 - clear_user_highpage(page + i, addr + i * PAGE_SIZE); 4685 - } 4686 - } else { 4687 - /* If sub-page to access in second half of huge page */ 4688 - base = pages_per_huge_page - 2 * (pages_per_huge_page - n); 4689 - l = pages_per_huge_page - n; 4690 - /* Clear sub-pages at the begin of huge page */ 4691 - for (i = 0; i < base; i++) { 4692 - cond_resched(); 4693 - clear_user_highpage(page + i, addr + i * PAGE_SIZE); 4694 - } 4695 - } 4696 - /* 4697 - * Clear remaining sub-pages in left-right-left-right pattern 4698 - * towards the sub-page to access 4699 - */ 4700 - for (i = 0; i < l; i++) { 4701 - int left_idx = base + i; 4702 - int right_idx = base + 2 * l - 1 - i; 4703 - 4704 - cond_resched(); 4705 - clear_user_highpage(page + left_idx, 4706 - addr + left_idx * PAGE_SIZE); 4707 - cond_resched(); 4708 - clear_user_highpage(page + right_idx, 4709 - addr + right_idx * PAGE_SIZE); 4710 - } 4622 + process_huge_page(addr_hint, pages_per_huge_page, clear_subpage, page); 4711 4623 } 4712 4624 4713 4625 static void copy_user_gigantic_page(struct page *dst, struct page *src, ··· 4693 4677 } 4694 4678 } 4695 4679 4680 + struct copy_subpage_arg { 4681 + struct page *dst; 4682 + struct page *src; 4683 + struct vm_area_struct *vma; 4684 + }; 4685 + 4686 + static void copy_subpage(unsigned long addr, int idx, void *arg) 4687 + { 4688 + struct copy_subpage_arg *copy_arg = arg; 4689 + 4690 + copy_user_highpage(copy_arg->dst + idx, copy_arg->src + idx, 4691 + addr, copy_arg->vma); 4692 + } 4693 + 4696 4694 void copy_user_huge_page(struct page *dst, struct page *src, 4697 - unsigned long addr, struct vm_area_struct *vma, 4695 + unsigned long addr_hint, struct vm_area_struct *vma, 4698 4696 unsigned int pages_per_huge_page) 4699 4697 { 4700 - int i; 4698 + unsigned long addr = addr_hint & 4699 + ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1); 4700 + struct copy_subpage_arg arg = { 4701 + .dst = dst, 4702 + .src = src, 4703 + .vma = vma, 4704 + }; 4701 4705 4702 4706 if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) { 4703 4707 copy_user_gigantic_page(dst, src, addr, vma, ··· 4725 4689 return; 4726 4690 } 4727 4691 4728 - might_sleep(); 4729 - for (i = 0; i < pages_per_huge_page; i++) { 4730 - cond_resched(); 4731 - copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma); 4732 - } 4692 + process_huge_page(addr_hint, pages_per_huge_page, copy_subpage, &arg); 4733 4693 } 4734 4694 4735 4695 long copy_huge_page_from_user(struct page *dst_page,

+47 -49

mm/memory_hotplug.c

··· 1034 1034 return pgdat; 1035 1035 } 1036 1036 1037 - static void rollback_node_hotadd(int nid, pg_data_t *pgdat) 1037 + static void rollback_node_hotadd(int nid) 1038 1038 { 1039 + pg_data_t *pgdat = NODE_DATA(nid); 1040 + 1039 1041 arch_refresh_nodedata(nid, NULL); 1040 1042 free_percpu(pgdat->per_cpu_nodestats); 1041 1043 arch_free_nodedata(pgdat); ··· 1048 1046 /** 1049 1047 * try_online_node - online a node if offlined 1050 1048 * @nid: the node ID 1051 - * 1049 + * @start: start addr of the node 1050 + * @set_node_online: Whether we want to online the node 1052 1051 * called by cpu_up() to online a node without onlined memory. 1052 + * 1053 + * Returns: 1054 + * 1 -> a new node has been allocated 1055 + * 0 -> the node is already online 1056 + * -ENOMEM -> the node could not be allocated 1053 1057 */ 1054 - int try_online_node(int nid) 1058 + static int __try_online_node(int nid, u64 start, bool set_node_online) 1055 1059 { 1056 - pg_data_t *pgdat; 1057 - int ret; 1060 + pg_data_t *pgdat; 1061 + int ret = 1; 1058 1062 1059 1063 if (node_online(nid)) 1060 1064 return 0; 1061 1065 1062 - mem_hotplug_begin(); 1063 - pgdat = hotadd_new_pgdat(nid, 0); 1066 + pgdat = hotadd_new_pgdat(nid, start); 1064 1067 if (!pgdat) { 1065 1068 pr_err("Cannot online node %d due to NULL pgdat\n", nid); 1066 1069 ret = -ENOMEM; 1067 1070 goto out; 1068 1071 } 1069 - node_set_online(nid); 1070 - ret = register_one_node(nid); 1071 - BUG_ON(ret); 1072 + 1073 + if (set_node_online) { 1074 + node_set_online(nid); 1075 + ret = register_one_node(nid); 1076 + BUG_ON(ret); 1077 + } 1072 1078 out: 1079 + return ret; 1080 + } 1081 + 1082 + /* 1083 + * Users of this function always want to online/register the node 1084 + */ 1085 + int try_online_node(int nid) 1086 + { 1087 + int ret; 1088 + 1089 + mem_hotplug_begin(); 1090 + ret = __try_online_node(nid, 0, true); 1073 1091 mem_hotplug_done(); 1074 1092 return ret; 1075 1093 } ··· 1121 1099 int __ref add_memory_resource(int nid, struct resource *res, bool online) 1122 1100 { 1123 1101 u64 start, size; 1124 - pg_data_t *pgdat = NULL; 1125 - bool new_pgdat; 1126 - bool new_node; 1102 + bool new_node = false; 1127 1103 int ret; 1128 1104 1129 1105 start = res->start; ··· 1130 1110 ret = check_hotplug_memory_range(start, size); 1131 1111 if (ret) 1132 1112 return ret; 1133 - 1134 - { /* Stupid hack to suppress address-never-null warning */ 1135 - void *p = NODE_DATA(nid); 1136 - new_pgdat = !p; 1137 - } 1138 1113 1139 1114 mem_hotplug_begin(); 1140 1115 ··· 1141 1126 */ 1142 1127 memblock_add_node(start, size, nid); 1143 1128 1144 - new_node = !node_online(nid); 1145 - if (new_node) { 1146 - pgdat = hotadd_new_pgdat(nid, start); 1147 - ret = -ENOMEM; 1148 - if (!pgdat) 1149 - goto error; 1150 - } 1129 + ret = __try_online_node(nid, start, false); 1130 + if (ret < 0) 1131 + goto error; 1132 + new_node = ret; 1151 1133 1152 1134 /* call arch's memory hotadd */ 1153 1135 ret = arch_add_memory(nid, start, size, NULL, true); 1154 - 1155 1136 if (ret < 0) 1156 1137 goto error; 1157 1138 1158 - /* we online node here. we can't roll back from here. */ 1159 - node_set_online(nid); 1160 - 1161 1139 if (new_node) { 1162 - unsigned long start_pfn = start >> PAGE_SHIFT; 1163 - unsigned long nr_pages = size >> PAGE_SHIFT; 1164 - 1165 - ret = __register_one_node(nid); 1166 - if (ret) 1167 - goto register_fail; 1168 - 1169 - /* 1170 - * link memory sections under this node. This is already 1171 - * done when creatig memory section in register_new_memory 1172 - * but that depends to have the node registered so offline 1173 - * nodes have to go through register_node. 1174 - * TODO clean up this mess. 1175 - */ 1176 - ret = link_mem_sections(nid, start_pfn, nr_pages, false); 1177 - register_fail: 1178 - /* 1179 - * If sysfs file of new node can't create, cpu on the node 1140 + /* If sysfs file of new node can't be created, cpu on the node 1180 1141 * can't be hot-added. There is no rollback way now. 1181 1142 * So, check by BUG_ON() to catch it reluctantly.. 1143 + * We online node here. We can't roll back from here. 1182 1144 */ 1145 + node_set_online(nid); 1146 + ret = __register_one_node(nid); 1183 1147 BUG_ON(ret); 1184 1148 } 1149 + 1150 + /* link memory sections under this node.*/ 1151 + ret = link_mem_sections(nid, PFN_DOWN(start), PFN_UP(start + size - 1)); 1152 + BUG_ON(ret); 1185 1153 1186 1154 /* create new memmap entry */ 1187 1155 firmware_map_add_hotplug(start, start + size, "System RAM"); ··· 1178 1180 1179 1181 error: 1180 1182 /* rollback pgdat allocation and others */ 1181 - if (new_pgdat && pgdat) 1182 - rollback_node_hotadd(nid, pgdat); 1183 + if (new_node) 1184 + rollback_node_hotadd(nid); 1183 1185 memblock_remove(start, size); 1184 1186 1185 1187 out:

+6 -6

mm/mempool.c

··· 111 111 kasan_free_pages(element, (unsigned long)pool->pool_data); 112 112 } 113 113 114 - static void kasan_unpoison_element(mempool_t *pool, void *element, gfp_t flags) 114 + static void kasan_unpoison_element(mempool_t *pool, void *element) 115 115 { 116 116 if (pool->alloc == mempool_alloc_slab || pool->alloc == mempool_kmalloc) 117 117 kasan_unpoison_slab(element); ··· 127 127 pool->elements[pool->curr_nr++] = element; 128 128 } 129 129 130 - static void *remove_element(mempool_t *pool, gfp_t flags) 130 + static void *remove_element(mempool_t *pool) 131 131 { 132 132 void *element = pool->elements[--pool->curr_nr]; 133 133 134 134 BUG_ON(pool->curr_nr < 0); 135 - kasan_unpoison_element(pool, element, flags); 135 + kasan_unpoison_element(pool, element); 136 136 check_element(pool, element); 137 137 return element; 138 138 } ··· 151 151 void mempool_exit(mempool_t *pool) 152 152 { 153 153 while (pool->curr_nr) { 154 - void *element = remove_element(pool, GFP_KERNEL); 154 + void *element = remove_element(pool); 155 155 pool->free(element, pool->pool_data); 156 156 } 157 157 kfree(pool->elements); ··· 301 301 spin_lock_irqsave(&pool->lock, flags); 302 302 if (new_min_nr <= pool->min_nr) { 303 303 while (new_min_nr < pool->curr_nr) { 304 - element = remove_element(pool, GFP_KERNEL); 304 + element = remove_element(pool); 305 305 spin_unlock_irqrestore(&pool->lock, flags); 306 306 pool->free(element, pool->pool_data); 307 307 spin_lock_irqsave(&pool->lock, flags); ··· 387 387 388 388 spin_lock_irqsave(&pool->lock, flags); 389 389 if (likely(pool->curr_nr)) { 390 - element = remove_element(pool, gfp_temp); 390 + element = remove_element(pool); 391 391 spin_unlock_irqrestore(&pool->lock, flags); 392 392 /* paired with rmb in mempool_free(), read comment there */ 393 393 smp_wmb();

+2 -1

mm/migrate.c

··· 2951 2951 /* Sanity check the arguments */ 2952 2952 start &= PAGE_MASK; 2953 2953 end &= PAGE_MASK; 2954 - if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) 2954 + if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL) || 2955 + vma_is_dax(vma)) 2955 2956 return -EINVAL; 2956 2957 if (start < vma->vm_start || start >= vma->vm_end) 2957 2958 return -EINVAL;

+2 -1

mm/mlock.c

··· 527 527 vm_flags_t old_flags = vma->vm_flags; 528 528 529 529 if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) || 530 - is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm)) 530 + is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) || 531 + vma_is_dax(vma)) 531 532 /* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */ 532 533 goto out; 533 534

+5 -4

mm/mmap.c

··· 1796 1796 1797 1797 vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT); 1798 1798 if (vm_flags & VM_LOCKED) { 1799 - if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) || 1800 - vma == get_gate_vma(current->mm))) 1801 - mm->locked_vm += (len >> PAGE_SHIFT); 1802 - else 1799 + if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) || 1800 + is_vm_hugetlb_page(vma) || 1801 + vma == get_gate_vma(current->mm)) 1803 1802 vma->vm_flags &= VM_LOCKED_CLEAR_MASK; 1803 + else 1804 + mm->locked_vm += (len >> PAGE_SHIFT); 1804 1805 } 1805 1806 1806 1807 if (file)

-4

mm/nommu.c

··· 364 364 } 365 365 EXPORT_SYMBOL(vzalloc_node); 366 366 367 - #ifndef PAGE_KERNEL_EXEC 368 - # define PAGE_KERNEL_EXEC PAGE_KERNEL 369 - #endif 370 - 371 367 /** 372 368 * vmalloc_exec - allocate virtually contiguous, executable memory 373 369 * @size: allocation size

+9 -7

mm/oom_kill.c

··· 53 53 int sysctl_oom_kill_allocating_task; 54 54 int sysctl_oom_dump_tasks = 1; 55 55 56 + /* 57 + * Serializes oom killer invocations (out_of_memory()) from all contexts to 58 + * prevent from over eager oom killing (e.g. when the oom killer is invoked 59 + * from different domains). 60 + * 61 + * oom_killer_disable() relies on this lock to stabilize oom_killer_disabled 62 + * and mark_oom_victim 63 + */ 56 64 DEFINE_MUTEX(oom_lock); 57 65 58 66 #ifdef CONFIG_NUMA ··· 1085 1077 dump_header(oc, NULL); 1086 1078 panic("Out of memory and no killable processes...\n"); 1087 1079 } 1088 - if (oc->chosen && oc->chosen != (void *)-1UL) { 1080 + if (oc->chosen && oc->chosen != (void *)-1UL) 1089 1081 oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : 1090 1082 "Memory cgroup out of memory"); 1091 - /* 1092 - * Give the killed process a good chance to exit before trying 1093 - * to allocate memory again. 1094 - */ 1095 - schedule_timeout_killable(1); 1096 - } 1097 1083 return !!oc->chosen; 1098 1084 } 1099 1085

+2 -2

mm/page-writeback.c

··· 2490 2490 2491 2491 /* 2492 2492 * Call this whenever redirtying a page, to de-account the dirty counters 2493 - * (NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied), so that they match the written 2494 - * counters (NR_WRITTEN, BDI_WRITTEN) in long term. The mismatches will lead to 2493 + * (NR_DIRTIED, WB_DIRTIED, tsk->nr_dirtied), so that they match the written 2494 + * counters (NR_WRITTEN, WB_WRITTEN) in long term. The mismatches will lead to 2495 2495 * systematic errors in balanced_dirty_ratelimit and the dirty pages position 2496 2496 * control. 2497 2497 */

+16 -17

mm/page_alloc.c

··· 4165 4165 alloc_flags = reserve_flags; 4166 4166 4167 4167 /* 4168 - * Reset the zonelist iterators if memory policies can be ignored. 4169 - * These allocations are high priority and system rather than user 4170 - * orientated. 4168 + * Reset the nodemask and zonelist iterators if memory policies can be 4169 + * ignored. These allocations are high priority and system rather than 4170 + * user oriented. 4171 4171 */ 4172 4172 if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) { 4173 + ac->nodemask = NULL; 4173 4174 ac->preferred_zoneref = first_zones_zonelist(ac->zonelist, 4174 4175 ac->high_zoneidx, ac->nodemask); 4175 4176 } ··· 4404 4403 EXPORT_SYMBOL(__alloc_pages_nodemask); 4405 4404 4406 4405 /* 4407 - * Common helper functions. 4406 + * Common helper functions. Never use with __GFP_HIGHMEM because the returned 4407 + * address cannot represent highmem pages. Use alloc_pages and then kmap if 4408 + * you need to access high mem. 4408 4409 */ 4409 4410 unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order) 4410 4411 { 4411 4412 struct page *page; 4412 4413 4413 - /* 4414 - * __get_free_pages() returns a virtual address, which cannot represent 4415 - * a highmem page 4416 - */ 4417 - VM_BUG_ON((gfp_mask & __GFP_HIGHMEM) != 0); 4418 - 4419 - page = alloc_pages(gfp_mask, order); 4414 + page = alloc_pages(gfp_mask & ~__GFP_HIGHMEM, order); 4420 4415 if (!page) 4421 4416 return 0; 4422 4417 return (unsigned long) page_address(page); ··· 5564 5567 5565 5568 /* 5566 5569 * The per-cpu-pages pools are set to around 1000th of the 5567 - * size of the zone. But no more than 1/2 of a meg. 5568 - * 5569 - * OK, so we don't know how big the cache is. So guess. 5570 + * size of the zone. 5570 5571 */ 5571 5572 batch = zone->managed_pages / 1024; 5572 - if (batch * PAGE_SIZE > 512 * 1024) 5573 - batch = (512 * 1024) / PAGE_SIZE; 5573 + /* But no more than a meg. */ 5574 + if (batch * PAGE_SIZE > 1024 * 1024) 5575 + batch = (1024 * 1024) / PAGE_SIZE; 5574 5576 batch /= 4; /* We effectively *= 4 below */ 5575 5577 if (batch < 1) 5576 5578 batch = 1; ··· 6401 6405 pgcnt = 0; 6402 6406 for_each_resv_unavail_range(i, &start, &end) { 6403 6407 for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) { 6404 - if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) 6408 + if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) { 6409 + pfn = ALIGN_DOWN(pfn, pageblock_nr_pages) 6410 + + pageblock_nr_pages - 1; 6405 6411 continue; 6412 + } 6406 6413 mm_zero_struct_page(pfn_to_page(pfn)); 6407 6414 pgcnt++; 6408 6415 }

+2 -2

mm/page_ext.c

··· 120 120 pgdat->node_page_ext = NULL; 121 121 } 122 122 123 - struct page_ext *lookup_page_ext(struct page *page) 123 + struct page_ext *lookup_page_ext(const struct page *page) 124 124 { 125 125 unsigned long pfn = page_to_pfn(page); 126 126 unsigned long index; ··· 195 195 196 196 #else /* CONFIG_FLAT_NODE_MEM_MAP */ 197 197 198 - struct page_ext *lookup_page_ext(struct page *page) 198 + struct page_ext *lookup_page_ext(const struct page *page) 199 199 { 200 200 unsigned long pfn = page_to_pfn(page); 201 201 struct mem_section *section = __pfn_to_section(pfn);

+2 -1

mm/shmem.c

··· 29 29 #include <linux/pagemap.h> 30 30 #include <linux/file.h> 31 31 #include <linux/mm.h> 32 + #include <linux/random.h> 32 33 #include <linux/sched/signal.h> 33 34 #include <linux/export.h> 34 35 #include <linux/swap.h> ··· 2189 2188 inode_init_owner(inode, dir, mode); 2190 2189 inode->i_blocks = 0; 2191 2190 inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode); 2192 - inode->i_generation = get_seconds(); 2191 + inode->i_generation = prandom_u32(); 2193 2192 info = SHMEM_I(inode); 2194 2193 memset(info, 0, (char *)inode - (char *)info); 2195 2194 spin_lock_init(&info->lock);

+3 -3

mm/slab.h

··· 203 203 void __kmem_cache_free_bulk(struct kmem_cache *, size_t, void **); 204 204 int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **); 205 205 206 - #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB) 206 + #ifdef CONFIG_MEMCG_KMEM 207 207 208 208 /* List of all root caches. */ 209 209 extern struct list_head slab_root_caches; ··· 296 296 extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s, 297 297 void (*deact_fn)(struct kmem_cache *)); 298 298 299 - #else /* CONFIG_MEMCG && !CONFIG_SLOB */ 299 + #else /* CONFIG_MEMCG_KMEM */ 300 300 301 301 /* If !memcg, all caches are root. */ 302 302 #define slab_root_caches slab_caches ··· 351 351 { 352 352 } 353 353 354 - #endif /* CONFIG_MEMCG && !CONFIG_SLOB */ 354 + #endif /* CONFIG_MEMCG_KMEM */ 355 355 356 356 static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x) 357 357 {

+4 -4

mm/slab_common.c

··· 127 127 return i; 128 128 } 129 129 130 - #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB) 130 + #ifdef CONFIG_MEMCG_KMEM 131 131 132 132 LIST_HEAD(slab_root_caches); 133 133 ··· 256 256 static inline void memcg_unlink_cache(struct kmem_cache *s) 257 257 { 258 258 } 259 - #endif /* CONFIG_MEMCG && !CONFIG_SLOB */ 259 + #endif /* CONFIG_MEMCG_KMEM */ 260 260 261 261 /* 262 262 * Figure out what the alignment of the objects will be given a set of ··· 584 584 return 0; 585 585 } 586 586 587 - #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB) 587 + #ifdef CONFIG_MEMCG_KMEM 588 588 /* 589 589 * memcg_create_kmem_cache - Create a cache for a memory cgroup. 590 590 * @memcg: The memory cgroup the new cache is for. ··· 861 861 static inline void flush_memcg_workqueue(struct kmem_cache *s) 862 862 { 863 863 } 864 - #endif /* CONFIG_MEMCG && !CONFIG_SLOB */ 864 + #endif /* CONFIG_MEMCG_KMEM */ 865 865 866 866 void slab_kmem_cache_release(struct kmem_cache *s) 867 867 {

+1 -2

mm/slub.c

··· 271 271 272 272 static void prefetch_freepointer(const struct kmem_cache *s, void *object) 273 273 { 274 - if (object) 275 - prefetch(freelist_dereference(s, object + s->offset)); 274 + prefetch(object + s->offset); 276 275 } 277 276 278 277 static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)

+4 -57

mm/sparse-vmemmap.c

··· 43 43 unsigned long goal) 44 44 { 45 45 return memblock_virt_alloc_try_nid_raw(size, align, goal, 46 - BOOTMEM_ALLOC_ACCESSIBLE, node); 46 + BOOTMEM_ALLOC_ACCESSIBLE, node); 47 47 } 48 - 49 - static void *vmemmap_buf; 50 - static void *vmemmap_buf_end; 51 48 52 49 void * __meminit vmemmap_alloc_block(unsigned long size, int node) 53 50 { ··· 73 76 /* need to make sure size is all the same during early stage */ 74 77 void * __meminit vmemmap_alloc_block_buf(unsigned long size, int node) 75 78 { 76 - void *ptr; 79 + void *ptr = sparse_buffer_alloc(size); 77 80 78 - if (!vmemmap_buf) 79 - return vmemmap_alloc_block(size, node); 80 - 81 - /* take the from buf */ 82 - ptr = (void *)ALIGN((unsigned long)vmemmap_buf, size); 83 - if (ptr + size > vmemmap_buf_end) 84 - return vmemmap_alloc_block(size, node); 85 - 86 - vmemmap_buf = ptr + size; 87 - 81 + if (!ptr) 82 + ptr = vmemmap_alloc_block(size, node); 88 83 return ptr; 89 84 } 90 85 ··· 260 271 return NULL; 261 272 262 273 return map; 263 - } 264 - 265 - void __init sparse_mem_maps_populate_node(struct page **map_map, 266 - unsigned long pnum_begin, 267 - unsigned long pnum_end, 268 - unsigned long map_count, int nodeid) 269 - { 270 - unsigned long pnum; 271 - unsigned long size = sizeof(struct page) * PAGES_PER_SECTION; 272 - void *vmemmap_buf_start; 273 - 274 - size = ALIGN(size, PMD_SIZE); 275 - vmemmap_buf_start = __earlyonly_bootmem_alloc(nodeid, size * map_count, 276 - PMD_SIZE, __pa(MAX_DMA_ADDRESS)); 277 - 278 - if (vmemmap_buf_start) { 279 - vmemmap_buf = vmemmap_buf_start; 280 - vmemmap_buf_end = vmemmap_buf_start + size * map_count; 281 - } 282 - 283 - for (pnum = pnum_begin; pnum < pnum_end; pnum++) { 284 - struct mem_section *ms; 285 - 286 - if (!present_section_nr(pnum)) 287 - continue; 288 - 289 - map_map[pnum] = sparse_mem_map_populate(pnum, nodeid, NULL); 290 - if (map_map[pnum]) 291 - continue; 292 - ms = __nr_to_section(pnum); 293 - pr_err("%s: sparsemem memory map backing failed some memory will not be available\n", 294 - __func__); 295 - ms->section_mem_map = 0; 296 - } 297 - 298 - if (vmemmap_buf_start) { 299 - /* need to free left buf */ 300 - memblock_free_early(__pa(vmemmap_buf), 301 - vmemmap_buf_end - vmemmap_buf); 302 - vmemmap_buf = NULL; 303 - vmemmap_buf_end = NULL; 304 - } 305 274 }

+111 -193

mm/sparse.c

··· 200 200 (section_nr <= __highest_present_section_nr)); \ 201 201 section_nr = next_present_section_nr(section_nr)) 202 202 203 + static inline unsigned long first_present_section_nr(void) 204 + { 205 + return next_present_section_nr(-1); 206 + } 207 + 203 208 /* Record a memory area against a node. */ 204 209 void __init memory_present(int nid, unsigned long start, unsigned long end) 205 210 { ··· 262 257 return ((struct page *)coded_mem_map) + section_nr_to_pfn(pnum); 263 258 } 264 259 265 - static int __meminit sparse_init_one_section(struct mem_section *ms, 260 + static void __meminit sparse_init_one_section(struct mem_section *ms, 266 261 unsigned long pnum, struct page *mem_map, 267 262 unsigned long *pageblock_bitmap) 268 263 { 269 - if (!present_section(ms)) 270 - return -EINVAL; 271 - 272 264 ms->section_mem_map &= ~SECTION_MAP_MASK; 273 265 ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) | 274 266 SECTION_HAS_MEM_MAP; 275 267 ms->pageblock_flags = pageblock_bitmap; 276 - 277 - return 1; 278 268 } 279 269 280 270 unsigned long usemap_size(void) ··· 370 370 } 371 371 #endif /* CONFIG_MEMORY_HOTREMOVE */ 372 372 373 - static void __init sparse_early_usemaps_alloc_node(void *data, 374 - unsigned long pnum_begin, 375 - unsigned long pnum_end, 376 - unsigned long usemap_count, int nodeid) 373 + #ifdef CONFIG_SPARSEMEM_VMEMMAP 374 + static unsigned long __init section_map_size(void) 377 375 { 378 - void *usemap; 379 - unsigned long pnum; 380 - unsigned long **usemap_map = (unsigned long **)data; 381 - int size = usemap_size(); 382 - 383 - usemap = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nodeid), 384 - size * usemap_count); 385 - if (!usemap) { 386 - pr_warn("%s: allocation failed\n", __func__); 387 - return; 388 - } 389 - 390 - for (pnum = pnum_begin; pnum < pnum_end; pnum++) { 391 - if (!present_section_nr(pnum)) 392 - continue; 393 - usemap_map[pnum] = usemap; 394 - usemap += size; 395 - check_usemap_section_nr(nodeid, usemap_map[pnum]); 396 - } 376 + return ALIGN(sizeof(struct page) * PAGES_PER_SECTION, PMD_SIZE); 397 377 } 398 378 399 - #ifndef CONFIG_SPARSEMEM_VMEMMAP 379 + #else 380 + static unsigned long __init section_map_size(void) 381 + { 382 + return PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION); 383 + } 384 + 400 385 struct page __init *sparse_mem_map_populate(unsigned long pnum, int nid, 401 386 struct vmem_altmap *altmap) 402 387 { 403 - struct page *map; 404 - unsigned long size; 388 + unsigned long size = section_map_size(); 389 + struct page *map = sparse_buffer_alloc(size); 405 390 406 - size = PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION); 391 + if (map) 392 + return map; 393 + 407 394 map = memblock_virt_alloc_try_nid(size, 408 395 PAGE_SIZE, __pa(MAX_DMA_ADDRESS), 409 396 BOOTMEM_ALLOC_ACCESSIBLE, nid); 410 397 return map; 411 398 } 412 - void __init sparse_mem_maps_populate_node(struct page **map_map, 413 - unsigned long pnum_begin, 414 - unsigned long pnum_end, 415 - unsigned long map_count, int nodeid) 416 - { 417 - void *map; 418 - unsigned long pnum; 419 - unsigned long size = sizeof(struct page) * PAGES_PER_SECTION; 420 - 421 - size = PAGE_ALIGN(size); 422 - map = memblock_virt_alloc_try_nid_raw(size * map_count, 423 - PAGE_SIZE, __pa(MAX_DMA_ADDRESS), 424 - BOOTMEM_ALLOC_ACCESSIBLE, nodeid); 425 - if (map) { 426 - for (pnum = pnum_begin; pnum < pnum_end; pnum++) { 427 - if (!present_section_nr(pnum)) 428 - continue; 429 - map_map[pnum] = map; 430 - map += size; 431 - } 432 - return; 433 - } 434 - 435 - /* fallback */ 436 - for (pnum = pnum_begin; pnum < pnum_end; pnum++) { 437 - struct mem_section *ms; 438 - 439 - if (!present_section_nr(pnum)) 440 - continue; 441 - map_map[pnum] = sparse_mem_map_populate(pnum, nodeid, NULL); 442 - if (map_map[pnum]) 443 - continue; 444 - ms = __nr_to_section(pnum); 445 - pr_err("%s: sparsemem memory map backing failed some memory will not be available\n", 446 - __func__); 447 - ms->section_mem_map = 0; 448 - } 449 - } 450 399 #endif /* !CONFIG_SPARSEMEM_VMEMMAP */ 451 400 452 - #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER 453 - static void __init sparse_early_mem_maps_alloc_node(void *data, 454 - unsigned long pnum_begin, 455 - unsigned long pnum_end, 456 - unsigned long map_count, int nodeid) 457 - { 458 - struct page **map_map = (struct page **)data; 459 - sparse_mem_maps_populate_node(map_map, pnum_begin, pnum_end, 460 - map_count, nodeid); 461 - } 462 - #else 463 - static struct page __init *sparse_early_mem_map_alloc(unsigned long pnum) 464 - { 465 - struct page *map; 466 - struct mem_section *ms = __nr_to_section(pnum); 467 - int nid = sparse_early_nid(ms); 401 + static void *sparsemap_buf __meminitdata; 402 + static void *sparsemap_buf_end __meminitdata; 468 403 469 - map = sparse_mem_map_populate(pnum, nid, NULL); 470 - if (map) 471 - return map; 472 - 473 - pr_err("%s: sparsemem memory map backing failed some memory will not be available\n", 474 - __func__); 475 - ms->section_mem_map = 0; 476 - return NULL; 404 + static void __init sparse_buffer_init(unsigned long size, int nid) 405 + { 406 + WARN_ON(sparsemap_buf); /* forgot to call sparse_buffer_fini()? */ 407 + sparsemap_buf = 408 + memblock_virt_alloc_try_nid_raw(size, PAGE_SIZE, 409 + __pa(MAX_DMA_ADDRESS), 410 + BOOTMEM_ALLOC_ACCESSIBLE, nid); 411 + sparsemap_buf_end = sparsemap_buf + size; 477 412 } 478 - #endif 413 + 414 + static void __init sparse_buffer_fini(void) 415 + { 416 + unsigned long size = sparsemap_buf_end - sparsemap_buf; 417 + 418 + if (sparsemap_buf && size > 0) 419 + memblock_free_early(__pa(sparsemap_buf), size); 420 + sparsemap_buf = NULL; 421 + } 422 + 423 + void * __meminit sparse_buffer_alloc(unsigned long size) 424 + { 425 + void *ptr = NULL; 426 + 427 + if (sparsemap_buf) { 428 + ptr = PTR_ALIGN(sparsemap_buf, size); 429 + if (ptr + size > sparsemap_buf_end) 430 + ptr = NULL; 431 + else 432 + sparsemap_buf = ptr + size; 433 + } 434 + return ptr; 435 + } 479 436 480 437 void __weak __meminit vmemmap_populate_print_last(void) 481 438 { 482 439 } 483 440 484 - /** 485 - * alloc_usemap_and_memmap - memory alloction for pageblock flags and vmemmap 486 - * @map: usemap_map for pageblock flags or mmap_map for vmemmap 441 + /* 442 + * Initialize sparse on a specific node. The node spans [pnum_begin, pnum_end) 443 + * And number of present sections in this node is map_count. 487 444 */ 488 - static void __init alloc_usemap_and_memmap(void (*alloc_func) 489 - (void *, unsigned long, unsigned long, 490 - unsigned long, int), void *data) 445 + static void __init sparse_init_nid(int nid, unsigned long pnum_begin, 446 + unsigned long pnum_end, 447 + unsigned long map_count) 491 448 { 492 - unsigned long pnum; 493 - unsigned long map_count; 494 - int nodeid_begin = 0; 495 - unsigned long pnum_begin = 0; 449 + unsigned long pnum, usemap_longs, *usemap; 450 + struct page *map; 496 451 497 - for_each_present_section_nr(0, pnum) { 498 - struct mem_section *ms; 499 - 500 - ms = __nr_to_section(pnum); 501 - nodeid_begin = sparse_early_nid(ms); 502 - pnum_begin = pnum; 503 - break; 452 + usemap_longs = BITS_TO_LONGS(SECTION_BLOCKFLAGS_BITS); 453 + usemap = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nid), 454 + usemap_size() * 455 + map_count); 456 + if (!usemap) { 457 + pr_err("%s: node[%d] usemap allocation failed", __func__, nid); 458 + goto failed; 504 459 } 505 - map_count = 1; 506 - for_each_present_section_nr(pnum_begin + 1, pnum) { 507 - struct mem_section *ms; 508 - int nodeid; 460 + sparse_buffer_init(map_count * section_map_size(), nid); 461 + for_each_present_section_nr(pnum_begin, pnum) { 462 + if (pnum >= pnum_end) 463 + break; 509 464 510 - ms = __nr_to_section(pnum); 511 - nodeid = sparse_early_nid(ms); 512 - if (nodeid == nodeid_begin) { 513 - map_count++; 514 - continue; 465 + map = sparse_mem_map_populate(pnum, nid, NULL); 466 + if (!map) { 467 + pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.", 468 + __func__, nid); 469 + pnum_begin = pnum; 470 + goto failed; 515 471 } 516 - /* ok, we need to take cake of from pnum_begin to pnum - 1*/ 517 - alloc_func(data, pnum_begin, pnum, 518 - map_count, nodeid_begin); 519 - /* new start, update count etc*/ 520 - nodeid_begin = nodeid; 521 - pnum_begin = pnum; 522 - map_count = 1; 472 + check_usemap_section_nr(nid, usemap); 473 + sparse_init_one_section(__nr_to_section(pnum), pnum, map, usemap); 474 + usemap += usemap_longs; 523 475 } 524 - /* ok, last chunk */ 525 - alloc_func(data, pnum_begin, __highest_present_section_nr+1, 526 - map_count, nodeid_begin); 476 + sparse_buffer_fini(); 477 + return; 478 + failed: 479 + /* We failed to allocate, mark all the following pnums as not present */ 480 + for_each_present_section_nr(pnum_begin, pnum) { 481 + struct mem_section *ms; 482 + 483 + if (pnum >= pnum_end) 484 + break; 485 + ms = __nr_to_section(pnum); 486 + ms->section_mem_map = 0; 487 + } 527 488 } 528 489 529 490 /* ··· 493 532 */ 494 533 void __init sparse_init(void) 495 534 { 496 - unsigned long pnum; 497 - struct page *map; 498 - unsigned long *usemap; 499 - unsigned long **usemap_map; 500 - int size; 501 - #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER 502 - int size2; 503 - struct page **map_map; 504 - #endif 505 - 506 - /* see include/linux/mmzone.h 'struct mem_section' definition */ 507 - BUILD_BUG_ON(!is_power_of_2(sizeof(struct mem_section))); 535 + unsigned long pnum_begin = first_present_section_nr(); 536 + int nid_begin = sparse_early_nid(__nr_to_section(pnum_begin)); 537 + unsigned long pnum_end, map_count = 1; 508 538 509 539 /* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */ 510 540 set_pageblock_order(); 511 541 512 - /* 513 - * map is using big page (aka 2M in x86 64 bit) 514 - * usemap is less one page (aka 24 bytes) 515 - * so alloc 2M (with 2M align) and 24 bytes in turn will 516 - * make next 2M slip to one more 2M later. 517 - * then in big system, the memory will have a lot of holes... 518 - * here try to allocate 2M pages continuously. 519 - * 520 - * powerpc need to call sparse_init_one_section right after each 521 - * sparse_early_mem_map_alloc, so allocate usemap_map at first. 522 - */ 523 - size = sizeof(unsigned long *) * NR_MEM_SECTIONS; 524 - usemap_map = memblock_virt_alloc(size, 0); 525 - if (!usemap_map) 526 - panic("can not allocate usemap_map\n"); 527 - alloc_usemap_and_memmap(sparse_early_usemaps_alloc_node, 528 - (void *)usemap_map); 542 + for_each_present_section_nr(pnum_begin + 1, pnum_end) { 543 + int nid = sparse_early_nid(__nr_to_section(pnum_end)); 529 544 530 - #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER 531 - size2 = sizeof(struct page *) * NR_MEM_SECTIONS; 532 - map_map = memblock_virt_alloc(size2, 0); 533 - if (!map_map) 534 - panic("can not allocate map_map\n"); 535 - alloc_usemap_and_memmap(sparse_early_mem_maps_alloc_node, 536 - (void *)map_map); 537 - #endif 538 - 539 - for_each_present_section_nr(0, pnum) { 540 - usemap = usemap_map[pnum]; 541 - if (!usemap) 545 + if (nid == nid_begin) { 546 + map_count++; 542 547 continue; 543 - 544 - #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER 545 - map = map_map[pnum]; 546 - #else 547 - map = sparse_early_mem_map_alloc(pnum); 548 - #endif 549 - if (!map) 550 - continue; 551 - 552 - sparse_init_one_section(__nr_to_section(pnum), pnum, map, 553 - usemap); 548 + } 549 + /* Init node with sections in range [pnum_begin, pnum_end) */ 550 + sparse_init_nid(nid_begin, pnum_begin, pnum_end, map_count); 551 + nid_begin = nid; 552 + pnum_begin = pnum_end; 553 + map_count = 1; 554 554 } 555 - 555 + /* cover the last node */ 556 + sparse_init_nid(nid_begin, pnum_begin, pnum_end, map_count); 556 557 vmemmap_populate_print_last(); 557 - 558 - #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER 559 - memblock_free_early(__pa(map_map), size2); 560 - #endif 561 - memblock_free_early(__pa(usemap_map), size); 562 558 } 563 559 564 560 #ifdef CONFIG_MEMORY_HOTPLUG ··· 678 760 ret = sparse_index_init(section_nr, pgdat->node_id); 679 761 if (ret < 0 && ret != -EEXIST) 680 762 return ret; 763 + ret = 0; 681 764 memmap = kmalloc_section_memmap(section_nr, pgdat->node_id, altmap); 682 765 if (!memmap) 683 766 return -ENOMEM; ··· 705 786 #endif 706 787 707 788 section_mark_present(ms); 708 - 709 - ret = sparse_init_one_section(ms, section_nr, memmap, usemap); 789 + sparse_init_one_section(ms, section_nr, memmap, usemap); 710 790 711 791 out: 712 792 pgdat_resize_unlock(pgdat, &flags); 713 - if (ret <= 0) { 793 + if (ret < 0) { 714 794 kfree(usemap); 715 795 __kfree_section_memmap(memmap, altmap); 716 796 }

+2 -2

mm/swap_slots.c

··· 38 38 static bool swap_slot_cache_active; 39 39 bool swap_slot_cache_enabled; 40 40 static bool swap_slot_cache_initialized; 41 - DEFINE_MUTEX(swap_slots_cache_mutex); 41 + static DEFINE_MUTEX(swap_slots_cache_mutex); 42 42 /* Serialize swap slots cache enable/disable operations */ 43 - DEFINE_MUTEX(swap_slots_cache_enable_mutex); 43 + static DEFINE_MUTEX(swap_slots_cache_enable_mutex); 44 44 45 45 static void __drain_swap_slots_cache(unsigned int type); 46 46 static void deactivate_swap_slots_cache(void);

+29 -9

mm/vmacache.c

··· 6 6 #include <linux/sched/task.h> 7 7 #include <linux/mm.h> 8 8 #include <linux/vmacache.h> 9 + #include <asm/pgtable.h> 10 + 11 + /* 12 + * Hash based on the pmd of addr if configured with MMU, which provides a good 13 + * hit rate for workloads with spatial locality. Otherwise, use pages. 14 + */ 15 + #ifdef CONFIG_MMU 16 + #define VMACACHE_SHIFT PMD_SHIFT 17 + #else 18 + #define VMACACHE_SHIFT PAGE_SHIFT 19 + #endif 20 + #define VMACACHE_HASH(addr) ((addr >> VMACACHE_SHIFT) & VMACACHE_MASK) 9 21 10 22 /* 11 23 * Flush vma caches for threads that share a given mm. ··· 99 87 100 88 struct vm_area_struct *vmacache_find(struct mm_struct *mm, unsigned long addr) 101 89 { 90 + int idx = VMACACHE_HASH(addr); 102 91 int i; 103 92 104 93 count_vm_vmacache_event(VMACACHE_FIND_CALLS); ··· 108 95 return NULL; 109 96 110 97 for (i = 0; i < VMACACHE_SIZE; i++) { 111 - struct vm_area_struct *vma = current->vmacache.vmas[i]; 98 + struct vm_area_struct *vma = current->vmacache.vmas[idx]; 112 99 113 - if (!vma) 114 - continue; 115 - if (WARN_ON_ONCE(vma->vm_mm != mm)) 116 - break; 117 - if (vma->vm_start <= addr && vma->vm_end > addr) { 118 - count_vm_vmacache_event(VMACACHE_FIND_HITS); 119 - return vma; 100 + if (vma) { 101 + #ifdef CONFIG_DEBUG_VM_VMACACHE 102 + if (WARN_ON_ONCE(vma->vm_mm != mm)) 103 + break; 104 + #endif 105 + if (vma->vm_start <= addr && vma->vm_end > addr) { 106 + count_vm_vmacache_event(VMACACHE_FIND_HITS); 107 + return vma; 108 + } 120 109 } 110 + if (++idx == VMACACHE_SIZE) 111 + idx = 0; 121 112 } 122 113 123 114 return NULL; ··· 132 115 unsigned long start, 133 116 unsigned long end) 134 117 { 118 + int idx = VMACACHE_HASH(start); 135 119 int i; 136 120 137 121 count_vm_vmacache_event(VMACACHE_FIND_CALLS); ··· 141 123 return NULL; 142 124 143 125 for (i = 0; i < VMACACHE_SIZE; i++) { 144 - struct vm_area_struct *vma = current->vmacache.vmas[i]; 126 + struct vm_area_struct *vma = current->vmacache.vmas[idx]; 145 127 146 128 if (vma && vma->vm_start == start && vma->vm_end == end) { 147 129 count_vm_vmacache_event(VMACACHE_FIND_HITS); 148 130 return vma; 149 131 } 132 + if (++idx == VMACACHE_SIZE) 133 + idx = 0; 150 134 } 151 135 152 136 return NULL;

-4

mm/vmalloc.c

··· 1907 1907 } 1908 1908 EXPORT_SYMBOL(vzalloc_node); 1909 1909 1910 - #ifndef PAGE_KERNEL_EXEC 1911 - # define PAGE_KERNEL_EXEC PAGE_KERNEL 1912 - #endif 1913 - 1914 1910 /** 1915 1911 * vmalloc_exec - allocate virtually contiguous, executable memory 1916 1912 * @size: allocation size

+201 -40

mm/vmscan.c

··· 65 65 /* How many pages shrink_list() should reclaim */ 66 66 unsigned long nr_to_reclaim; 67 67 68 - /* This context's GFP mask */ 69 - gfp_t gfp_mask; 70 - 71 - /* Allocation order */ 72 - int order; 73 - 74 68 /* 75 69 * Nodemask of nodes allowed by the caller. If NULL, all nodes 76 70 * are scanned. ··· 76 82 * primary target of this reclaim invocation. 77 83 */ 78 84 struct mem_cgroup *target_mem_cgroup; 79 - 80 - /* Scan (total_size >> priority) pages at once */ 81 - int priority; 82 - 83 - /* The highest zone to isolate pages for reclaim from */ 84 - enum zone_type reclaim_idx; 85 85 86 86 /* Writepage batching in laptop mode; RECLAIM_WRITE */ 87 87 unsigned int may_writepage:1; ··· 98 110 99 111 /* One of the zones is ready for compaction */ 100 112 unsigned int compaction_ready:1; 113 + 114 + /* Allocation order */ 115 + s8 order; 116 + 117 + /* Scan (total_size >> priority) pages at once */ 118 + s8 priority; 119 + 120 + /* The highest zone to isolate pages for reclaim from */ 121 + s8 reclaim_idx; 122 + 123 + /* This context's GFP mask */ 124 + gfp_t gfp_mask; 101 125 102 126 /* Incremented by the number of inactive pages that were scanned */ 103 127 unsigned long nr_scanned; ··· 168 168 169 169 static LIST_HEAD(shrinker_list); 170 170 static DECLARE_RWSEM(shrinker_rwsem); 171 + 172 + #ifdef CONFIG_MEMCG_KMEM 173 + 174 + /* 175 + * We allow subsystems to populate their shrinker-related 176 + * LRU lists before register_shrinker_prepared() is called 177 + * for the shrinker, since we don't want to impose 178 + * restrictions on their internal registration order. 179 + * In this case shrink_slab_memcg() may find corresponding 180 + * bit is set in the shrinkers map. 181 + * 182 + * This value is used by the function to detect registering 183 + * shrinkers and to skip do_shrink_slab() calls for them. 184 + */ 185 + #define SHRINKER_REGISTERING ((struct shrinker *)~0UL) 186 + 187 + static DEFINE_IDR(shrinker_idr); 188 + static int shrinker_nr_max; 189 + 190 + static int prealloc_memcg_shrinker(struct shrinker *shrinker) 191 + { 192 + int id, ret = -ENOMEM; 193 + 194 + down_write(&shrinker_rwsem); 195 + /* This may call shrinker, so it must use down_read_trylock() */ 196 + id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL); 197 + if (id < 0) 198 + goto unlock; 199 + 200 + if (id >= shrinker_nr_max) { 201 + if (memcg_expand_shrinker_maps(id)) { 202 + idr_remove(&shrinker_idr, id); 203 + goto unlock; 204 + } 205 + 206 + shrinker_nr_max = id + 1; 207 + } 208 + shrinker->id = id; 209 + ret = 0; 210 + unlock: 211 + up_write(&shrinker_rwsem); 212 + return ret; 213 + } 214 + 215 + static void unregister_memcg_shrinker(struct shrinker *shrinker) 216 + { 217 + int id = shrinker->id; 218 + 219 + BUG_ON(id < 0); 220 + 221 + down_write(&shrinker_rwsem); 222 + idr_remove(&shrinker_idr, id); 223 + up_write(&shrinker_rwsem); 224 + } 225 + #else /* CONFIG_MEMCG_KMEM */ 226 + static int prealloc_memcg_shrinker(struct shrinker *shrinker) 227 + { 228 + return 0; 229 + } 230 + 231 + static void unregister_memcg_shrinker(struct shrinker *shrinker) 232 + { 233 + } 234 + #endif /* CONFIG_MEMCG_KMEM */ 171 235 172 236 #ifdef CONFIG_MEMCG 173 237 static bool global_reclaim(struct scan_control *sc) ··· 377 313 shrinker->nr_deferred = kzalloc(size, GFP_KERNEL); 378 314 if (!shrinker->nr_deferred) 379 315 return -ENOMEM; 316 + 317 + if (shrinker->flags & SHRINKER_MEMCG_AWARE) { 318 + if (prealloc_memcg_shrinker(shrinker)) 319 + goto free_deferred; 320 + } 321 + 380 322 return 0; 323 + 324 + free_deferred: 325 + kfree(shrinker->nr_deferred); 326 + shrinker->nr_deferred = NULL; 327 + return -ENOMEM; 381 328 } 382 329 383 330 void free_prealloced_shrinker(struct shrinker *shrinker) 384 331 { 332 + if (!shrinker->nr_deferred) 333 + return; 334 + 335 + if (shrinker->flags & SHRINKER_MEMCG_AWARE) 336 + unregister_memcg_shrinker(shrinker); 337 + 385 338 kfree(shrinker->nr_deferred); 386 339 shrinker->nr_deferred = NULL; 387 340 } ··· 407 326 { 408 327 down_write(&shrinker_rwsem); 409 328 list_add_tail(&shrinker->list, &shrinker_list); 329 + #ifdef CONFIG_MEMCG_KMEM 330 + idr_replace(&shrinker_idr, shrinker, shrinker->id); 331 + #endif 410 332 up_write(&shrinker_rwsem); 411 333 } 412 334 ··· 431 347 { 432 348 if (!shrinker->nr_deferred) 433 349 return; 350 + if (shrinker->flags & SHRINKER_MEMCG_AWARE) 351 + unregister_memcg_shrinker(shrinker); 434 352 down_write(&shrinker_rwsem); 435 353 list_del(&shrinker->list); 436 354 up_write(&shrinker_rwsem); ··· 457 371 : SHRINK_BATCH; 458 372 long scanned = 0, next_deferred; 459 373 374 + if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) 375 + nid = 0; 376 + 460 377 freeable = shrinker->count_objects(shrinker, shrinkctl); 461 - if (freeable == 0) 462 - return 0; 378 + if (freeable == 0 || freeable == SHRINK_EMPTY) 379 + return freeable; 463 380 464 381 /* 465 382 * copy the current shrinker scan count into a local variable ··· 563 474 return freed; 564 475 } 565 476 477 + #ifdef CONFIG_MEMCG_KMEM 478 + static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, 479 + struct mem_cgroup *memcg, int priority) 480 + { 481 + struct memcg_shrinker_map *map; 482 + unsigned long freed = 0; 483 + int ret, i; 484 + 485 + if (!memcg_kmem_enabled() || !mem_cgroup_online(memcg)) 486 + return 0; 487 + 488 + if (!down_read_trylock(&shrinker_rwsem)) 489 + return 0; 490 + 491 + map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map, 492 + true); 493 + if (unlikely(!map)) 494 + goto unlock; 495 + 496 + for_each_set_bit(i, map->map, shrinker_nr_max) { 497 + struct shrink_control sc = { 498 + .gfp_mask = gfp_mask, 499 + .nid = nid, 500 + .memcg = memcg, 501 + }; 502 + struct shrinker *shrinker; 503 + 504 + shrinker = idr_find(&shrinker_idr, i); 505 + if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) { 506 + if (!shrinker) 507 + clear_bit(i, map->map); 508 + continue; 509 + } 510 + 511 + ret = do_shrink_slab(&sc, shrinker, priority); 512 + if (ret == SHRINK_EMPTY) { 513 + clear_bit(i, map->map); 514 + /* 515 + * After the shrinker reported that it had no objects to 516 + * free, but before we cleared the corresponding bit in 517 + * the memcg shrinker map, a new object might have been 518 + * added. To make sure, we have the bit set in this 519 + * case, we invoke the shrinker one more time and reset 520 + * the bit if it reports that it is not empty anymore. 521 + * The memory barrier here pairs with the barrier in 522 + * memcg_set_shrinker_bit(): 523 + * 524 + * list_lru_add() shrink_slab_memcg() 525 + * list_add_tail() clear_bit() 526 + * <MB> <MB> 527 + * set_bit() do_shrink_slab() 528 + */ 529 + smp_mb__after_atomic(); 530 + ret = do_shrink_slab(&sc, shrinker, priority); 531 + if (ret == SHRINK_EMPTY) 532 + ret = 0; 533 + else 534 + memcg_set_shrinker_bit(memcg, nid, i); 535 + } 536 + freed += ret; 537 + 538 + if (rwsem_is_contended(&shrinker_rwsem)) { 539 + freed = freed ? : 1; 540 + break; 541 + } 542 + } 543 + unlock: 544 + up_read(&shrinker_rwsem); 545 + return freed; 546 + } 547 + #else /* CONFIG_MEMCG_KMEM */ 548 + static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, 549 + struct mem_cgroup *memcg, int priority) 550 + { 551 + return 0; 552 + } 553 + #endif /* CONFIG_MEMCG_KMEM */ 554 + 566 555 /** 567 556 * shrink_slab - shrink slab caches 568 557 * @gfp_mask: allocation context ··· 653 486 * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set, 654 487 * unaware shrinkers will receive a node id of 0 instead. 655 488 * 656 - * @memcg specifies the memory cgroup to target. If it is not NULL, 657 - * only shrinkers with SHRINKER_MEMCG_AWARE set will be called to scan 658 - * objects from the memory cgroup specified. Otherwise, only unaware 659 - * shrinkers are called. 489 + * @memcg specifies the memory cgroup to target. Unaware shrinkers 490 + * are called only if it is the root cgroup. 660 491 * 661 492 * @priority is sc->priority, we take the number of objects and >> by priority 662 493 * in order to get the scan target. ··· 667 502 { 668 503 struct shrinker *shrinker; 669 504 unsigned long freed = 0; 505 + int ret; 670 506 671 - if (memcg && (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))) 672 - return 0; 507 + if (!mem_cgroup_is_root(memcg)) 508 + return shrink_slab_memcg(gfp_mask, nid, memcg, priority); 673 509 674 510 if (!down_read_trylock(&shrinker_rwsem)) 675 511 goto out; ··· 682 516 .memcg = memcg, 683 517 }; 684 518 685 - /* 686 - * If kernel memory accounting is disabled, we ignore 687 - * SHRINKER_MEMCG_AWARE flag and call all shrinkers 688 - * passing NULL for memcg. 689 - */ 690 - if (memcg_kmem_enabled() && 691 - !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE)) 692 - continue; 693 - 694 - if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) 695 - sc.nid = 0; 696 - 697 - freed += do_shrink_slab(&sc, shrinker, priority); 519 + ret = do_shrink_slab(&sc, shrinker, priority); 520 + if (ret == SHRINK_EMPTY) 521 + ret = 0; 522 + freed += ret; 698 523 /* 699 524 * Bail out if someone want to register a new shrinker to 700 525 * prevent the regsitration from being stalled for long periods ··· 711 554 struct mem_cgroup *memcg = NULL; 712 555 713 556 freed = 0; 557 + memcg = mem_cgroup_iter(NULL, NULL, NULL); 714 558 do { 715 559 freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); 716 560 } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); ··· 2731 2573 shrink_node_memcg(pgdat, memcg, sc, &lru_pages); 2732 2574 node_lru_pages += lru_pages; 2733 2575 2734 - if (memcg) 2735 - shrink_slab(sc->gfp_mask, pgdat->node_id, 2736 - memcg, sc->priority); 2576 + shrink_slab(sc->gfp_mask, pgdat->node_id, 2577 + memcg, sc->priority); 2737 2578 2738 2579 /* Record the group's reclaim efficiency */ 2739 2580 vmpressure(sc->gfp_mask, memcg, false, ··· 2755 2598 break; 2756 2599 } 2757 2600 } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); 2758 - 2759 - if (global_reclaim(sc)) 2760 - shrink_slab(sc->gfp_mask, pgdat->node_id, NULL, 2761 - sc->priority); 2762 2601 2763 2602 if (reclaim_state) { 2764 2603 sc->nr_reclaimed += reclaim_state->reclaimed_slab; ··· 3215 3062 .may_unmap = 1, 3216 3063 .may_swap = 1, 3217 3064 }; 3065 + 3066 + /* 3067 + * scan_control uses s8 fields for order, priority, and reclaim_idx. 3068 + * Confirm they are large enough for max values. 3069 + */ 3070 + BUILD_BUG_ON(MAX_ORDER > S8_MAX); 3071 + BUILD_BUG_ON(DEF_PRIORITY > S8_MAX); 3072 + BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX); 3218 3073 3219 3074 /* 3220 3075 * Do not enter reclaim if fatal signal was delivered while throttled.

+13 -17

mm/workingset.c

··· 366 366 unsigned long nodes; 367 367 unsigned long cache; 368 368 369 - /* list_lru lock nests inside the IRQ-safe i_pages lock */ 370 - local_irq_disable(); 371 369 nodes = list_lru_shrink_count(&shadow_nodes, sc); 372 - local_irq_enable(); 373 370 374 371 /* 375 372 * Approximate a reasonable limit for the radix tree nodes ··· 398 401 node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE); 399 402 } 400 403 max_nodes = cache >> (RADIX_TREE_MAP_SHIFT - 3); 404 + 405 + if (!nodes) 406 + return SHRINK_EMPTY; 401 407 402 408 if (nodes <= max_nodes) 403 409 return 0; ··· 434 434 435 435 /* Coming from the list, invert the lock order */ 436 436 if (!xa_trylock(&mapping->i_pages)) { 437 - spin_unlock(lru_lock); 437 + spin_unlock_irq(lru_lock); 438 438 ret = LRU_RETRY; 439 439 goto out; 440 440 } ··· 472 472 workingset_lookup_update(mapping)); 473 473 474 474 out_invalid: 475 - xa_unlock(&mapping->i_pages); 475 + xa_unlock_irq(&mapping->i_pages); 476 476 ret = LRU_REMOVED_RETRY; 477 477 out: 478 - local_irq_enable(); 479 478 cond_resched(); 480 - local_irq_disable(); 481 - spin_lock(lru_lock); 479 + spin_lock_irq(lru_lock); 482 480 return ret; 483 481 } 484 482 485 483 static unsigned long scan_shadow_nodes(struct shrinker *shrinker, 486 484 struct shrink_control *sc) 487 485 { 488 - unsigned long ret; 489 - 490 486 /* list_lru lock nests inside the IRQ-safe i_pages lock */ 491 - local_irq_disable(); 492 - ret = list_lru_shrink_walk(&shadow_nodes, sc, shadow_lru_isolate, NULL); 493 - local_irq_enable(); 494 - return ret; 487 + return list_lru_shrink_walk_irq(&shadow_nodes, sc, shadow_lru_isolate, 488 + NULL); 495 489 } 496 490 497 491 static struct shrinker workingset_shadow_shrinker = { ··· 522 528 pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n", 523 529 timestamp_bits, max_order, bucket_order); 524 530 525 - ret = __list_lru_init(&shadow_nodes, true, &shadow_nodes_key); 531 + ret = prealloc_shrinker(&workingset_shadow_shrinker); 526 532 if (ret) 527 533 goto err; 528 - ret = register_shrinker(&workingset_shadow_shrinker); 534 + ret = __list_lru_init(&shadow_nodes, true, &shadow_nodes_key, 535 + &workingset_shadow_shrinker); 529 536 if (ret) 530 537 goto err_list_lru; 538 + register_shrinker_prepared(&workingset_shadow_shrinker); 531 539 return 0; 532 540 err_list_lru: 533 - list_lru_destroy(&shadow_nodes); 541 + free_prealloced_shrinker(&workingset_shadow_shrinker); 534 542 err: 535 543 return ret; 536 544 }

+18 -18

mm/zsmalloc.c

··· 924 924 page->freelist = NULL; 925 925 } 926 926 927 - /* 928 - * To prevent zspage destroy during migration, zspage freeing should 929 - * hold locks of all pages in the zspage. 930 - */ 931 - void lock_zspage(struct zspage *zspage) 932 - { 933 - struct page *page = get_first_page(zspage); 934 - 935 - do { 936 - lock_page(page); 937 - } while ((page = get_next_page(page)) != NULL); 938 - } 939 - 940 - int trylock_zspage(struct zspage *zspage) 927 + static int trylock_zspage(struct zspage *zspage) 941 928 { 942 929 struct page *cursor, *fail; 943 930 ··· 1801 1814 } 1802 1815 1803 1816 #ifdef CONFIG_COMPACTION 1817 + /* 1818 + * To prevent zspage destroy during migration, zspage freeing should 1819 + * hold locks of all pages in the zspage. 1820 + */ 1821 + static void lock_zspage(struct zspage *zspage) 1822 + { 1823 + struct page *page = get_first_page(zspage); 1824 + 1825 + do { 1826 + lock_page(page); 1827 + } while ((page = get_next_page(page)) != NULL); 1828 + } 1829 + 1804 1830 static struct dentry *zs_mount(struct file_system_type *fs_type, 1805 1831 int flags, const char *dev_name, void *data) 1806 1832 { ··· 1905 1905 __SetPageMovable(newpage, page_mapping(oldpage)); 1906 1906 } 1907 1907 1908 - bool zs_page_isolate(struct page *page, isolate_mode_t mode) 1908 + static bool zs_page_isolate(struct page *page, isolate_mode_t mode) 1909 1909 { 1910 1910 struct zs_pool *pool; 1911 1911 struct size_class *class; ··· 1960 1960 return true; 1961 1961 } 1962 1962 1963 - int zs_page_migrate(struct address_space *mapping, struct page *newpage, 1963 + static int zs_page_migrate(struct address_space *mapping, struct page *newpage, 1964 1964 struct page *page, enum migrate_mode mode) 1965 1965 { 1966 1966 struct zs_pool *pool; ··· 2076 2076 return ret; 2077 2077 } 2078 2078 2079 - void zs_page_putback(struct page *page) 2079 + static void zs_page_putback(struct page *page) 2080 2080 { 2081 2081 struct zs_pool *pool; 2082 2082 struct size_class *class; ··· 2108 2108 spin_unlock(&class->lock); 2109 2109 } 2110 2110 2111 - const struct address_space_operations zsmalloc_aops = { 2111 + static const struct address_space_operations zsmalloc_aops = { 2112 2112 .isolate_page = zs_page_isolate, 2113 2113 .migratepage = zs_page_migrate, 2114 2114 .putback_page = zs_page_putback,

+6 -5

scripts/spdxcheck.py

··· 4 4 5 5 from argparse import ArgumentParser 6 6 from ply import lex, yacc 7 + import locale 7 8 import traceback 8 9 import sys 9 10 import git ··· 33 32 34 33 # The subdirectories of LICENSES in the kernel source 35 34 license_dirs = [ "preferred", "other", "exceptions" ] 36 - lictree = repo.heads.master.commit.tree['LICENSES'] 35 + lictree = repo.head.commit.tree['LICENSES'] 37 36 38 37 spdx = SPDXdata() 39 38 ··· 103 102 raise ParserException(tok, 'Invalid License ID') 104 103 self.lastid = id 105 104 elif tok.type == 'EXC': 106 - if not self.spdx.exceptions.has_key(id): 105 + if id not in self.spdx.exceptions: 107 106 raise ParserException(tok, 'Invalid Exception ID') 108 107 if self.lastid not in self.spdx.exceptions[id]: 109 108 raise ParserException(tok, 'Exception not valid for license %s' %self.lastid) ··· 168 167 self.curline = 0 169 168 try: 170 169 for line in fd: 170 + line = line.decode(locale.getpreferredencoding(False), errors='ignore') 171 171 self.curline += 1 172 172 if self.curline > maxlines: 173 173 break ··· 201 199 continue 202 200 if el.path.find("license-rules.rst") >= 0: 203 201 continue 204 - if el.path == 'scripts/checkpatch.pl': 205 - continue 206 202 if not os.path.isfile(el.path): 207 203 continue 208 - parser.parse_lines(open(el.path), args.maxlines, el.path) 204 + with open(el.path, 'rb') as fd: 205 + parser.parse_lines(fd, args.maxlines, el.path) 209 206 210 207 def scan_git_subtree(tree, path): 211 208 for p in path.strip('/').split('/'):

+104 -14

tools/vm/page-types.c

··· 75 75 76 76 #define KPF_BYTES 8 77 77 #define PROC_KPAGEFLAGS "/proc/kpageflags" 78 + #define PROC_KPAGECOUNT "/proc/kpagecount" 78 79 #define PROC_KPAGECGROUP "/proc/kpagecgroup" 80 + 81 + #define SYS_KERNEL_MM_PAGE_IDLE "/sys/kernel/mm/page_idle/bitmap" 79 82 80 83 /* [32-] kernel hacking assistances */ 81 84 #define KPF_RESERVED 32 ··· 171 168 172 169 static int opt_raw; /* for kernel developers */ 173 170 static int opt_list; /* list pages (in ranges) */ 171 + static int opt_mark_idle; /* set accessed bit */ 174 172 static int opt_no_summary; /* don't show summary */ 175 173 static pid_t opt_pid; /* process to walk */ 176 174 const char *opt_file; /* file or directory path */ 177 175 static uint64_t opt_cgroup; /* cgroup inode */ 178 176 static int opt_list_cgroup;/* list page cgroup */ 177 + static int opt_list_mapcnt;/* list page map count */ 179 178 static const char *opt_kpageflags;/* kpageflags file to parse */ 180 179 181 180 #define MAX_ADDR_RANGES 1024 ··· 199 194 200 195 static int pagemap_fd; 201 196 static int kpageflags_fd; 197 + static int kpagecount_fd = -1; 202 198 static int kpagecgroup_fd = -1; 199 + static int page_idle_fd = -1; 203 200 204 201 static int opt_hwpoison; 205 202 static int opt_unpoison; ··· 305 298 return do_u64_read(kpagecgroup_fd, opt_kpageflags, buf, index, pages); 306 299 } 307 300 301 + static unsigned long kpagecount_read(uint64_t *buf, 302 + unsigned long index, 303 + unsigned long pages) 304 + { 305 + return kpagecount_fd < 0 ? pages : 306 + do_u64_read(kpagecount_fd, PROC_KPAGECOUNT, 307 + buf, index, pages); 308 + } 309 + 308 310 static unsigned long pagemap_read(uint64_t *buf, 309 311 unsigned long index, 310 312 unsigned long pages) ··· 386 370 */ 387 371 388 372 static void show_page_range(unsigned long voffset, unsigned long offset, 389 - unsigned long size, uint64_t flags, uint64_t cgroup) 373 + unsigned long size, uint64_t flags, 374 + uint64_t cgroup, uint64_t mapcnt) 390 375 { 391 376 static uint64_t flags0; 392 377 static uint64_t cgroup0; 378 + static uint64_t mapcnt0; 393 379 static unsigned long voff; 394 380 static unsigned long index; 395 381 static unsigned long count; 396 382 397 - if (flags == flags0 && cgroup == cgroup0 && offset == index + count && 398 - size && voffset == voff + count) { 383 + if (flags == flags0 && cgroup == cgroup0 && mapcnt == mapcnt0 && 384 + offset == index + count && size && voffset == voff + count) { 399 385 count += size; 400 386 return; 401 387 } ··· 409 391 printf("%lu\t", voff); 410 392 if (opt_list_cgroup) 411 393 printf("@%llu\t", (unsigned long long)cgroup0); 394 + if (opt_list_mapcnt) 395 + printf("%lu\t", mapcnt0); 412 396 printf("%lx\t%lx\t%s\n", 413 397 index, count, page_flag_name(flags0)); 414 398 } 415 399 416 400 flags0 = flags; 417 - cgroup0= cgroup; 401 + cgroup0 = cgroup; 402 + mapcnt0 = mapcnt; 418 403 index = offset; 419 404 voff = voffset; 420 405 count = size; ··· 425 404 426 405 static void flush_page_range(void) 427 406 { 428 - show_page_range(0, 0, 0, 0, 0); 407 + show_page_range(0, 0, 0, 0, 0, 0); 429 408 } 430 409 431 410 static void show_page(unsigned long voffset, unsigned long offset, 432 - uint64_t flags, uint64_t cgroup) 411 + uint64_t flags, uint64_t cgroup, uint64_t mapcnt) 433 412 { 434 413 if (opt_pid) 435 414 printf("%lx\t", voffset); ··· 437 416 printf("%lu\t", voffset); 438 417 if (opt_list_cgroup) 439 418 printf("@%llu\t", (unsigned long long)cgroup); 419 + if (opt_list_mapcnt) 420 + printf("%lu\t", mapcnt); 421 + 440 422 printf("%lx\t%s\n", offset, page_flag_name(flags)); 441 423 } 442 424 ··· 591 567 return 0; 592 568 } 593 569 570 + static int mark_page_idle(unsigned long offset) 571 + { 572 + static unsigned long off; 573 + static uint64_t buf; 574 + int len; 575 + 576 + if ((offset / 64 == off / 64) || buf == 0) { 577 + buf |= 1UL << (offset % 64); 578 + off = offset; 579 + return 0; 580 + } 581 + 582 + len = pwrite(page_idle_fd, &buf, 8, 8 * (off / 64)); 583 + if (len < 0) { 584 + perror("mark page idle"); 585 + return len; 586 + } 587 + 588 + buf = 1UL << (offset % 64); 589 + off = offset; 590 + 591 + return 0; 592 + } 593 + 594 594 /* 595 595 * page frame walker 596 596 */ ··· 647 599 } 648 600 649 601 static void add_page(unsigned long voffset, unsigned long offset, 650 - uint64_t flags, uint64_t cgroup, uint64_t pme) 602 + uint64_t flags, uint64_t cgroup, uint64_t mapcnt, 603 + uint64_t pme) 651 604 { 652 605 flags = kpageflags_flags(flags, pme); 653 606 ··· 663 614 if (opt_unpoison) 664 615 unpoison_page(offset); 665 616 617 + if (opt_mark_idle) 618 + mark_page_idle(offset); 619 + 666 620 if (opt_list == 1) 667 - show_page_range(voffset, offset, 1, flags, cgroup); 621 + show_page_range(voffset, offset, 1, flags, cgroup, mapcnt); 668 622 else if (opt_list == 2) 669 - show_page(voffset, offset, flags, cgroup); 623 + show_page(voffset, offset, flags, cgroup, mapcnt); 670 624 671 625 nr_pages[hash_slot(flags)]++; 672 626 total_pages++; ··· 683 631 { 684 632 uint64_t buf[KPAGEFLAGS_BATCH]; 685 633 uint64_t cgi[KPAGEFLAGS_BATCH]; 634 + uint64_t cnt[KPAGEFLAGS_BATCH]; 686 635 unsigned long batch; 687 636 unsigned long pages; 688 637 unsigned long i; ··· 707 654 if (kpagecgroup_read(cgi, index, pages) != pages) 708 655 fatal("kpagecgroup returned fewer pages than expected"); 709 656 657 + if (kpagecount_read(cnt, index, batch) != pages) 658 + fatal("kpagecount returned fewer pages than expected"); 659 + 710 660 for (i = 0; i < pages; i++) 711 - add_page(voffset + i, index + i, buf[i], cgi[i], pme); 661 + add_page(voffset + i, index + i, 662 + buf[i], cgi[i], cnt[i], pme); 712 663 713 664 index += pages; 714 665 count -= pages; ··· 730 673 return; 731 674 732 675 if (opt_list == 1) 733 - show_page_range(voffset, pagemap_swap_offset(pme), 1, flags, 0); 676 + show_page_range(voffset, pagemap_swap_offset(pme), 677 + 1, flags, 0, 0); 734 678 else if (opt_list == 2) 735 - show_page(voffset, pagemap_swap_offset(pme), flags, 0); 679 + show_page(voffset, pagemap_swap_offset(pme), flags, 0, 0); 736 680 737 681 nr_pages[hash_slot(flags)]++; 738 682 total_pages++; ··· 814 756 else 815 757 walk_task(opt_offset[i], opt_size[i]); 816 758 759 + if (opt_mark_idle) 760 + mark_page_idle(0); 761 + 817 762 close(kpageflags_fd); 818 763 } 819 764 ··· 847 786 " -c|--cgroup path|@inode Walk pages within memory cgroup\n" 848 787 " -p|--pid pid Walk process address space\n" 849 788 " -f|--file filename Walk file address space\n" 789 + " -i|--mark-idle Mark pages idle\n" 850 790 " -l|--list Show page details in ranges\n" 851 791 " -L|--list-each Show page details one by one\n" 852 792 " -C|--list-cgroup Show cgroup inode for pages\n" 793 + " -M|--list-mapcnt Show page map count\n" 853 794 " -N|--no-summary Don't show summary info\n" 854 795 " -X|--hwpoison hwpoison pages\n" 855 796 " -x|--unpoison unpoison pages\n" ··· 988 925 uint8_t vec[PAGEMAP_BATCH]; 989 926 uint64_t buf[PAGEMAP_BATCH], flags; 990 927 uint64_t cgroup = 0; 928 + uint64_t mapcnt = 0; 991 929 unsigned long nr_pages, pfn, i; 992 930 off_t off, end = st->st_size; 993 931 int fd; ··· 1048 984 continue; 1049 985 if (!kpagecgroup_read(&cgroup, pfn, 1)) 1050 986 fatal("kpagecgroup_read failed"); 987 + if (!kpagecount_read(&mapcnt, pfn, 1)) 988 + fatal("kpagecount_read failed"); 1051 989 if (first && opt_list) { 1052 990 first = 0; 1053 991 flush_page_range(); 1054 992 show_file(name, st); 1055 993 } 1056 994 add_page(off / page_size + i, pfn, 1057 - flags, cgroup, buf[i]); 995 + flags, cgroup, mapcnt, buf[i]); 1058 996 } 1059 997 } 1060 998 ··· 1256 1190 { "bits" , 1, NULL, 'b' }, 1257 1191 { "cgroup" , 1, NULL, 'c' }, 1258 1192 { "describe" , 1, NULL, 'd' }, 1193 + { "mark-idle" , 0, NULL, 'i' }, 1259 1194 { "list" , 0, NULL, 'l' }, 1260 1195 { "list-each" , 0, NULL, 'L' }, 1261 1196 { "list-cgroup", 0, NULL, 'C' }, 1197 + { "list-mapcnt", 0, NULL, 'M' }, 1262 1198 { "no-summary", 0, NULL, 'N' }, 1263 1199 { "hwpoison" , 0, NULL, 'X' }, 1264 1200 { "unpoison" , 0, NULL, 'x' }, ··· 1276 1208 page_size = getpagesize(); 1277 1209 1278 1210 while ((c = getopt_long(argc, argv, 1279 - "rp:f:a:b:d:c:ClLNXxF:h", opts, NULL)) != -1) { 1211 + "rp:f:a:b:d:c:CilLMNXxF:h", 1212 + opts, NULL)) != -1) { 1280 1213 switch (c) { 1281 1214 case 'r': 1282 1215 opt_raw = 1; ··· 1303 1234 case 'd': 1304 1235 describe_flags(optarg); 1305 1236 exit(0); 1237 + case 'i': 1238 + opt_mark_idle = 1; 1239 + break; 1306 1240 case 'l': 1307 1241 opt_list = 1; 1308 1242 break; 1309 1243 case 'L': 1310 1244 opt_list = 2; 1245 + break; 1246 + case 'M': 1247 + opt_list_mapcnt = 1; 1311 1248 break; 1312 1249 case 'N': 1313 1250 opt_no_summary = 1; ··· 1344 1269 if (opt_cgroup || opt_list_cgroup) 1345 1270 kpagecgroup_fd = checked_open(PROC_KPAGECGROUP, O_RDONLY); 1346 1271 1272 + if (opt_list && opt_list_mapcnt) 1273 + kpagecount_fd = checked_open(PROC_KPAGECOUNT, O_RDONLY); 1274 + 1275 + if (opt_mark_idle && opt_file) 1276 + page_idle_fd = checked_open(SYS_KERNEL_MM_PAGE_IDLE, O_RDWR); 1277 + 1347 1278 if (opt_list && opt_pid) 1348 1279 printf("voffset\t"); 1349 1280 if (opt_list && opt_file) 1350 1281 printf("foffset\t"); 1351 1282 if (opt_list && opt_list_cgroup) 1352 1283 printf("cgroup\t"); 1284 + if (opt_list && opt_list_mapcnt) 1285 + printf("map-cnt\t"); 1286 + 1353 1287 if (opt_list == 1) 1354 1288 printf("offset\tlen\tflags\n"); 1355 1289 if (opt_list == 2) ··· 1379 1295 printf("\n\n"); 1380 1296 1381 1297 show_summary(); 1298 + 1299 + if (opt_list_mapcnt) 1300 + close(kpagecount_fd); 1301 + 1302 + if (page_idle_fd >= 0) 1303 + close(page_idle_fd); 1382 1304 1383 1305 return 0; 1384 1306 }