Merge branch 'akpm' (patches from Andrew)

+14 -8

Documentation/filesystems/proc.txt

··· 175 175 VmLib: 1412 kB 176 176 VmPTE: 20 kb 177 177 VmSwap: 0 kB 178 + HugetlbPages: 0 kB 178 179 Threads: 1 179 180 SigQ: 0/28578 180 181 SigPnd: 0000000000000000 ··· 239 238 VmPTE size of page table entries 240 239 VmPMD size of second level page tables 241 240 VmSwap size of swap usage (the number of referred swapents) 241 + HugetlbPages size of hugetlb memory portions 242 242 Threads number of threads 243 243 SigQ number of signals queued/max. number for queue 244 244 SigPnd bitmap of pending signals for the thread ··· 426 424 Private_Dirty: 0 kB 427 425 Referenced: 892 kB 428 426 Anonymous: 0 kB 427 + AnonHugePages: 0 kB 428 + Shared_Hugetlb: 0 kB 429 + Private_Hugetlb: 0 kB 429 430 Swap: 0 kB 430 431 SwapPss: 0 kB 431 432 KernelPageSize: 4 kB 432 433 MMUPageSize: 4 kB 433 - Locked: 374 kB 434 - VmFlags: rd ex mr mw me de 434 + Locked: 0 kB 435 + VmFlags: rd ex mr mw me dw 435 436 436 437 the first of these lines shows the same information as is displayed for the 437 438 mapping in /proc/PID/maps. The remaining lines show the size of the mapping ··· 454 449 "Anonymous" shows the amount of memory that does not belong to any file. Even 455 450 a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE 456 451 and a page is modified, the file page is replaced by a private anonymous copy. 457 - "Swap" shows how much would-be-anonymous memory is also used, but out on 458 - swap. 452 + "AnonHugePages" shows the ammount of memory backed by transparent hugepage. 453 + "Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by 454 + hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical 455 + reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field. 456 + "Swap" shows how much would-be-anonymous memory is also used, but out on swap. 459 457 "SwapPss" shows proportional swap share of this mapping. 458 + "Locked" indicates whether the mapping is locked in memory or not. 459 + 460 460 "VmFlags" field deserves a separate description. This member represents the kernel 461 461 flags associated with the particular virtual memory area in two letter encoded 462 462 manner. The codes are the following: ··· 485 475 ac - area is accountable 486 476 nr - swap space is not reserved for the area 487 477 ht - area uses huge tlb pages 488 - nl - non-linear mapping 489 478 ar - architecture specific flag 490 479 dd - do not include area into core dump 491 480 sd - soft-dirty flag ··· 823 814 16GB PIII, which has highmem enabled. You may not have all of these fields. 824 815 825 816 > cat /proc/meminfo 826 - 827 - The "Locked" indicates whether the mapping is locked in memory or not. 828 - 829 817 830 818 MemTotal: 16344972 kB 831 819 MemFree: 13634064 kB

+23 -23

Documentation/kasan.txt

··· 1 - Kernel address sanitizer 2 - ================ 1 + KernelAddressSanitizer (KASAN) 2 + ============================== 3 3 4 4 0. Overview 5 5 =========== 6 6 7 - Kernel Address sanitizer (KASan) is a dynamic memory error detector. It provides 7 + KernelAddressSANitizer (KASAN) is a dynamic memory error detector. It provides 8 8 a fast and comprehensive solution for finding use-after-free and out-of-bounds 9 9 bugs. 10 10 11 - KASan uses compile-time instrumentation for checking every memory access, 12 - therefore you will need a gcc version of 4.9.2 or later. KASan could detect out 13 - of bounds accesses to stack or global variables, but only if gcc 5.0 or later was 14 - used to built the kernel. 11 + KASAN uses compile-time instrumentation for checking every memory access, 12 + therefore you will need a GCC version 4.9.2 or later. GCC 5.0 or later is 13 + required for detection of out-of-bounds accesses to stack or global variables. 15 14 16 - Currently KASan is supported only for x86_64 architecture and requires that the 17 - kernel be built with the SLUB allocator. 15 + Currently KASAN is supported only for x86_64 architecture and requires the 16 + kernel to be built with the SLUB allocator. 18 17 19 18 1. Usage 20 - ========= 19 + ======== 21 20 22 21 To enable KASAN configure kernel with: 23 22 24 23 CONFIG_KASAN = y 25 24 26 - and choose between CONFIG_KASAN_OUTLINE and CONFIG_KASAN_INLINE. Outline/inline 27 - is compiler instrumentation types. The former produces smaller binary the 28 - latter is 1.1 - 2 times faster. Inline instrumentation requires a gcc version 29 - of 5.0 or later. 25 + and choose between CONFIG_KASAN_OUTLINE and CONFIG_KASAN_INLINE. Outline and 26 + inline are compiler instrumentation types. The former produces smaller binary 27 + the latter is 1.1 - 2 times faster. Inline instrumentation requires a GCC 28 + version 5.0 or later. 30 29 31 30 Currently KASAN works only with the SLUB memory allocator. 32 - For better bug detection and nicer report, enable CONFIG_STACKTRACE and put 33 - at least 'slub_debug=U' in the boot cmdline. 31 + For better bug detection and nicer reporting, enable CONFIG_STACKTRACE. 34 32 35 33 To disable instrumentation for specific files or directories, add a line 36 34 similar to the following to the respective kernel Makefile: ··· 40 42 KASAN_SANITIZE := n 41 43 42 44 1.1 Error reports 43 - ========== 45 + ================= 44 46 45 47 A typical out of bounds access report looks like this: 46 48 ··· 117 119 ffff8800693bc800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 118 120 ================================================================== 119 121 120 - First sections describe slub object where bad access happened. 121 - See 'SLUB Debug output' section in Documentation/vm/slub.txt for details. 122 + The header of the report discribe what kind of bug happened and what kind of 123 + access caused it. It's followed by the description of the accessed slub object 124 + (see 'SLUB Debug output' section in Documentation/vm/slub.txt for details) and 125 + the description of the accessed memory page. 122 126 123 127 In the last section the report shows memory state around the accessed address. 124 - Reading this part requires some more understanding of how KASAN works. 128 + Reading this part requires some understanding of how KASAN works. 125 129 126 - Each 8 bytes of memory are encoded in one shadow byte as accessible, 127 - partially accessible, freed or they can be part of a redzone. 130 + The state of each 8 aligned bytes of memory is encoded in one shadow byte. 131 + Those 8 bytes can be accessible, partially accessible, freed or be a redzone. 128 132 We use the following encoding for each shadow byte: 0 means that all 8 bytes 129 133 of the corresponding memory region are accessible; number N (1 <= N <= 7) means 130 134 that the first N bytes are accessible, and other (8 - N) bytes are not; ··· 139 139 140 140 141 141 2. Implementation details 142 - ======================== 142 + ========================= 143 143 144 144 From a high level, our approach to memory error detection is similar to that 145 145 of kmemcheck: use shadow memory to record whether each byte of memory is safe

+5

Documentation/kernel-parameters.txt

··· 1275 1275 Format: <unsigned int> such that (rxsize & ~0x1fffc0) == 0. 1276 1276 Default: 1024 1277 1277 1278 + hardlockup_all_cpu_backtrace= 1279 + [KNL] Should the hard-lockup detector generate 1280 + backtraces on all cpus. 1281 + Format: <integer> 1282 + 1278 1283 hashdist= [KNL,NUMA] Large hashes allocated during boot 1279 1284 are distributed across NUMA nodes. Defaults on 1280 1285 for 64-bit NUMA, off otherwise.

+3 -2

Documentation/lockup-watchdogs.txt

··· 20 20 details), without letting other interrupts have a chance to run. 21 21 Similarly to the softlockup case, the current stack trace is displayed 22 22 upon detection and the system will stay locked up unless the default 23 - behavior is changed, which can be done through a compile time knob, 24 - "BOOTPARAM_HARDLOCKUP_PANIC", and a kernel parameter, "nmi_watchdog" 23 + behavior is changed, which can be done through a sysctl, 24 + 'hardlockup_panic', a compile time knob, "BOOTPARAM_HARDLOCKUP_PANIC", 25 + and a kernel parameter, "nmi_watchdog" 25 26 (see "Documentation/kernel-parameters.txt" for details). 26 27 27 28 The panic option can be used in combination with panic_timeout (this

+12

Documentation/sysctl/kernel.txt

··· 33 33 - domainname 34 34 - hostname 35 35 - hotplug 36 + - hardlockup_all_cpu_backtrace 36 37 - hung_task_panic 37 38 - hung_task_check_count 38 39 - hung_task_timeout_secs ··· 293 292 domain names are in general different. For a detailed discussion 294 293 see the hostname(1) man page. 295 294 295 + ============================================================== 296 + hardlockup_all_cpu_backtrace: 297 + 298 + This value controls the hard lockup detector behavior when a hard 299 + lockup condition is detected as to whether or not to gather further 300 + debug information. If enabled, arch-specific all-CPU stack dumping 301 + will be initiated. 302 + 303 + 0: do nothing. This is the default behavior. 304 + 305 + 1: on detection capture more debug information. 296 306 ============================================================== 297 307 298 308 hotplug:

+12 -15

Documentation/vm/page_migration

··· 92 92 93 93 2. Insure that writeback is complete. 94 94 95 - 3. Prep the new page that we want to move to. It is locked 96 - and set to not being uptodate so that all accesses to the new 97 - page immediately lock while the move is in progress. 95 + 3. Lock the new page that we want to move to. It is locked so that accesses to 96 + this (not yet uptodate) page immediately lock while the move is in progress. 98 97 99 - 4. The new page is prepped with some settings from the old page so that 100 - accesses to the new page will discover a page with the correct settings. 98 + 4. All the page table references to the page are converted to migration 99 + entries. This decreases the mapcount of a page. If the resulting 100 + mapcount is not zero then we do not migrate the page. All user space 101 + processes that attempt to access the page will now wait on the page lock. 101 102 102 - 5. All the page table references to the page are converted 103 - to migration entries or dropped (nonlinear vmas). 104 - This decrease the mapcount of a page. If the resulting 105 - mapcount is not zero then we do not migrate the page. 106 - All user space processes that attempt to access the page 107 - will now wait on the page lock. 108 - 109 - 6. The radix tree lock is taken. This will cause all processes trying 103 + 5. The radix tree lock is taken. This will cause all processes trying 110 104 to access the page via the mapping to block on the radix tree spinlock. 111 105 112 - 7. The refcount of the page is examined and we back out if references remain 106 + 6. The refcount of the page is examined and we back out if references remain 113 107 otherwise we know that we are the only one referencing this page. 114 108 115 - 8. The radix tree is checked and if it does not contain the pointer to this 109 + 7. The radix tree is checked and if it does not contain the pointer to this 116 110 page then we back out because someone else modified the radix tree. 111 + 112 + 8. The new page is prepped with some settings from the old page so that 113 + accesses to the new page will discover a page with the correct settings. 117 114 118 115 9. The radix tree is changed to point to the new page. 119 116

+10

Documentation/vm/transhuge.txt

··· 170 170 max_ptes_none can waste cpu time very little, you can 171 171 ignore it. 172 172 173 + max_ptes_swap specifies how many pages can be brought in from 174 + swap when collapsing a group of pages into a transparent huge page. 175 + 176 + /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap 177 + 178 + A higher value can cause excessive swap IO and waste 179 + memory. A lower value can prevent THPs from being 180 + collapsed, resulting fewer pages being collapsed into 181 + THPs, and lower memory access performance. 182 + 173 183 == Boot parameter == 174 184 175 185 You can change the sysfs boot time defaults of Transparent Hugepage

+17 -99

Documentation/vm/unevictable-lru.txt

··· 531 531 532 532 try_to_unmap() is always called, by either vmscan for reclaim or for page 533 533 migration, with the argument page locked and isolated from the LRU. Separate 534 - functions handle anonymous and mapped file pages, as these types of pages have 535 - different reverse map mechanisms. 534 + functions handle anonymous and mapped file and KSM pages, as these types of 535 + pages have different reverse map lookup mechanisms, with different locking. 536 + In each case, whether rmap_walk_anon() or rmap_walk_file() or rmap_walk_ksm(), 537 + it will call try_to_unmap_one() for every VMA which might contain the page. 536 538 537 - (*) try_to_unmap_anon() 539 + When trying to reclaim, if try_to_unmap_one() finds the page in a VM_LOCKED 540 + VMA, it will then mlock the page via mlock_vma_page() instead of unmapping it, 541 + and return SWAP_MLOCK to indicate that the page is unevictable: and the scan 542 + stops there. 538 543 539 - To unmap anonymous pages, each VMA in the list anchored in the anon_vma 540 - must be visited - at least until a VM_LOCKED VMA is encountered. If the 541 - page is being unmapped for migration, VM_LOCKED VMAs do not stop the 542 - process because mlocked pages are migratable. However, for reclaim, if 543 - the page is mapped into a VM_LOCKED VMA, the scan stops. 544 - 545 - try_to_unmap_anon() attempts to acquire in read mode the mmap semaphore of 546 - the mm_struct to which the VMA belongs. If this is successful, it will 547 - mlock the page via mlock_vma_page() - we wouldn't have gotten to 548 - try_to_unmap_anon() if the page were already mlocked - and will return 549 - SWAP_MLOCK, indicating that the page is unevictable. 550 - 551 - If the mmap semaphore cannot be acquired, we are not sure whether the page 552 - is really unevictable or not. In this case, try_to_unmap_anon() will 553 - return SWAP_AGAIN. 554 - 555 - (*) try_to_unmap_file() - linear mappings 556 - 557 - Unmapping of a mapped file page works the same as for anonymous mappings, 558 - except that the scan visits all VMAs that map the page's index/page offset 559 - in the page's mapping's reverse map priority search tree. It also visits 560 - each VMA in the page's mapping's non-linear list, if the list is 561 - non-empty. 562 - 563 - As for anonymous pages, on encountering a VM_LOCKED VMA for a mapped file 564 - page, try_to_unmap_file() will attempt to acquire the associated 565 - mm_struct's mmap semaphore to mlock the page, returning SWAP_MLOCK if this 566 - is successful, and SWAP_AGAIN, if not. 567 - 568 - (*) try_to_unmap_file() - non-linear mappings 569 - 570 - If a page's mapping contains a non-empty non-linear mapping VMA list, then 571 - try_to_un{map|lock}() must also visit each VMA in that list to determine 572 - whether the page is mapped in a VM_LOCKED VMA. Again, the scan must visit 573 - all VMAs in the non-linear list to ensure that the pages is not/should not 574 - be mlocked. 575 - 576 - If a VM_LOCKED VMA is found in the list, the scan could terminate. 577 - However, there is no easy way to determine whether the page is actually 578 - mapped in a given VMA - either for unmapping or testing whether the 579 - VM_LOCKED VMA actually pins the page. 580 - 581 - try_to_unmap_file() handles non-linear mappings by scanning a certain 582 - number of pages - a "cluster" - in each non-linear VMA associated with the 583 - page's mapping, for each file mapped page that vmscan tries to unmap. If 584 - this happens to unmap the page we're trying to unmap, try_to_unmap() will 585 - notice this on return (page_mapcount(page) will be 0) and return 586 - SWAP_SUCCESS. Otherwise, it will return SWAP_AGAIN, causing vmscan to 587 - recirculate this page. We take advantage of the cluster scan in 588 - try_to_unmap_cluster() as follows: 589 - 590 - For each non-linear VMA, try_to_unmap_cluster() attempts to acquire the 591 - mmap semaphore of the associated mm_struct for read without blocking. 592 - 593 - If this attempt is successful and the VMA is VM_LOCKED, 594 - try_to_unmap_cluster() will retain the mmap semaphore for the scan; 595 - otherwise it drops it here. 596 - 597 - Then, for each page in the cluster, if we're holding the mmap semaphore 598 - for a locked VMA, try_to_unmap_cluster() calls mlock_vma_page() to 599 - mlock the page. This call is a no-op if the page is already locked, 600 - but will mlock any pages in the non-linear mapping that happen to be 601 - unlocked. 602 - 603 - If one of the pages so mlocked is the page passed in to try_to_unmap(), 604 - try_to_unmap_cluster() will return SWAP_MLOCK, rather than the default 605 - SWAP_AGAIN. This will allow vmscan to cull the page, rather than 606 - recirculating it on the inactive list. 607 - 608 - Again, if try_to_unmap_cluster() cannot acquire the VMA's mmap sem, it 609 - returns SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED 610 - VMA, but couldn't be mlocked. 544 + mlock_vma_page() is called while holding the page table's lock (in addition 545 + to the page lock, and the rmap lock): to serialize against concurrent mlock or 546 + munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim, 547 + holepunching, and truncation of file pages and their anonymous COWed pages. 611 548 612 549 613 550 try_to_munlock() REVERSE MAP SCAN ··· 560 623 introduced a variant of try_to_unmap() called try_to_munlock(). 561 624 562 625 try_to_munlock() calls the same functions as try_to_unmap() for anonymous and 563 - mapped file pages with an additional argument specifying unlock versus unmap 626 + mapped file and KSM pages with a flag argument specifying unlock versus unmap 564 627 processing. Again, these functions walk the respective reverse maps looking 565 - for VM_LOCKED VMAs. When such a VMA is found for anonymous pages and file 566 - pages mapped in linear VMAs, as in the try_to_unmap() case, the functions 567 - attempt to acquire the associated mmap semaphore, mlock the page via 568 - mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the 569 - pre-clearing of the page's PG_mlocked done by munlock_vma_page. 570 - 571 - If try_to_unmap() is unable to acquire a VM_LOCKED VMA's associated mmap 572 - semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() to 573 - recycle the page on the inactive list and hope that it has better luck with the 574 - page next time. 575 - 576 - For file pages mapped into non-linear VMAs, the try_to_munlock() logic works 577 - slightly differently. On encountering a VM_LOCKED non-linear VMA that might 578 - map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking the 579 - page. munlock_vma_page() will just leave the page unlocked and let vmscan deal 580 - with it - the usual fallback position. 628 + for VM_LOCKED VMAs. When such a VMA is found, as in the try_to_unmap() case, 629 + the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK. This 630 + undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page. 581 631 582 632 Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's 583 633 reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA. 584 - However, the scan can terminate when it encounters a VM_LOCKED VMA and can 585 - successfully acquire the VMA's mmap semaphore for read and mlock the page. 634 + However, the scan can terminate when it encounters a VM_LOCKED VMA. 586 635 Although try_to_munlock() might be called a great many times when munlocking a 587 636 large region or tearing down a large address space that has been mlocked via 588 637 mlockall(), overall this is a fairly rare event. ··· 595 672 596 673 (3) mlocked pages that could not be isolated from the LRU and moved to the 597 674 unevictable list in mlock_vma_page(). 598 - 599 - (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't 600 - acquire the VMA's mmap semaphore to test the flags and set PageMlocked. 601 - munlock_vma_page() was forced to let the page back on to the normal LRU 602 - list for vmscan to handle. 603 675 604 676 shrink_inactive_list() also diverts any unevictable pages that it finds on the 605 677 inactive lists to the appropriate zone's unevictable list.

+3

arch/alpha/include/uapi/asm/mman.h

··· 37 37 38 38 #define MCL_CURRENT 8192 /* lock all currently mapped pages */ 39 39 #define MCL_FUTURE 16384 /* lock all additions to address space */ 40 + #define MCL_ONFAULT 32768 /* lock all pages that are faulted in */ 41 + 42 + #define MLOCK_ONFAULT 0x01 /* Lock pages in range after they are faulted in, do not prefault */ 40 43 41 44 #define MADV_NORMAL 0 /* no further special treatment */ 42 45 #define MADV_RANDOM 1 /* expect random page references */

+1 -1

arch/arm/mm/alignment.c

··· 803 803 } 804 804 } 805 805 } else { 806 - fault = probe_kernel_address(instrptr, instr); 806 + fault = probe_kernel_address((void *)instrptr, instr); 807 807 instr = __mem_to_opcode_arm(instr); 808 808 } 809 809

+6

arch/mips/include/uapi/asm/mman.h

··· 61 61 */ 62 62 #define MCL_CURRENT 1 /* lock all current mappings */ 63 63 #define MCL_FUTURE 2 /* lock all future mappings */ 64 + #define MCL_ONFAULT 4 /* lock all pages that are faulted in */ 65 + 66 + /* 67 + * Flags for mlock 68 + */ 69 + #define MLOCK_ONFAULT 0x01 /* Lock pages in range after they are faulted in, do not prefault */ 64 70 65 71 #define MADV_NORMAL 0 /* no further special treatment */ 66 72 #define MADV_RANDOM 1 /* expect random page references */

+3

arch/parisc/include/uapi/asm/mman.h

··· 31 31 32 32 #define MCL_CURRENT 1 /* lock all current mappings */ 33 33 #define MCL_FUTURE 2 /* lock all future mappings */ 34 + #define MCL_ONFAULT 4 /* lock all pages that are faulted in */ 35 + 36 + #define MLOCK_ONFAULT 0x01 /* Lock pages in range after they are faulted in, do not prefault */ 34 37 35 38 #define MADV_NORMAL 0 /* no further special treatment */ 36 39 #define MADV_RANDOM 1 /* expect random page references */

+1

arch/powerpc/include/uapi/asm/mman.h

··· 22 22 23 23 #define MCL_CURRENT 0x2000 /* lock all currently mapped pages */ 24 24 #define MCL_FUTURE 0x4000 /* lock all additions to address space */ 25 + #define MCL_ONFAULT 0x8000 /* lock all pages that are faulted in */ 25 26 26 27 #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ 27 28 #define MAP_NONBLOCK 0x10000 /* do not block on IO */

+1 -1

arch/powerpc/mm/numa.c

··· 80 80 setup_nr_node_ids(); 81 81 82 82 /* allocate the map */ 83 - for (node = 0; node < nr_node_ids; node++) 83 + for_each_node(node) 84 84 alloc_bootmem_cpumask_var(&node_to_cpumask_map[node]); 85 85 86 86 /* cpumask_of_node() will now work */

+1 -1

arch/powerpc/sysdev/fsl_pci.c

··· 999 999 ret = get_user(regs->nip, &inst); 1000 1000 pagefault_enable(); 1001 1001 } else { 1002 - ret = probe_kernel_address(regs->nip, inst); 1002 + ret = probe_kernel_address((void *)regs->nip, inst); 1003 1003 } 1004 1004 1005 1005 if (mcheck_handle_load(regs, inst)) {

+1

arch/sparc/include/uapi/asm/mman.h

··· 17 17 18 18 #define MCL_CURRENT 0x2000 /* lock all currently mapped pages */ 19 19 #define MCL_FUTURE 0x4000 /* lock all additions to address space */ 20 + #define MCL_ONFAULT 0x8000 /* lock all pages that are faulted in */ 20 21 21 22 #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ 22 23 #define MAP_NONBLOCK 0x10000 /* do not block on IO */

+1

arch/tile/include/uapi/asm/mman.h

··· 36 36 */ 37 37 #define MCL_CURRENT 1 /* lock all current mappings */ 38 38 #define MCL_FUTURE 2 /* lock all future mappings */ 39 + #define MCL_ONFAULT 4 /* lock all pages that are faulted in */ 39 40 40 41 41 42 #endif /* _ASM_TILE_MMAN_H */

+2 -2

arch/x86/boot/Makefile

··· 9 9 # Changed by many, many contributors over the years. 10 10 # 11 11 12 + KASAN_SANITIZE := n 13 + 12 14 # If you want to preset the SVGA mode, uncomment the next line and 13 15 # set SVGA_MODE to whatever number you want. 14 16 # Set it to -DSVGA_MODE=NORMAL_VGA if you just want the EGA/VGA mode. 15 17 # The number is the same as you would ordinarily press at bootup. 16 - 17 - KASAN_SANITIZE := n 18 18 19 19 SVGA_MODE := -DSVGA_MODE=NORMAL_VGA 20 20

+1

arch/x86/entry/syscalls/syscall_32.tbl

··· 382 382 373 i386 shutdown sys_shutdown 383 383 374 i386 userfaultfd sys_userfaultfd 384 384 375 i386 membarrier sys_membarrier 385 + 376 i386 mlock2 sys_mlock2

+1

arch/x86/entry/syscalls/syscall_64.tbl

··· 331 331 322 64 execveat stub_execveat 332 332 323 common userfaultfd sys_userfaultfd 333 333 324 common membarrier sys_membarrier 334 + 325 common mlock2 sys_mlock2 334 335 335 336 # 336 337 # x32-specific system call numbers start at 512 to avoid cache impact

+1 -1

arch/x86/mm/kasan_init_64.c

··· 126 126 __flush_tlb_all(); 127 127 init_task.kasan_depth = 0; 128 128 129 - pr_info("Kernel address sanitizer initialized\n"); 129 + pr_info("KernelAddressSanitizer initialized\n"); 130 130 }

+6

arch/xtensa/include/uapi/asm/mman.h

··· 74 74 */ 75 75 #define MCL_CURRENT 1 /* lock all current mappings */ 76 76 #define MCL_FUTURE 2 /* lock all future mappings */ 77 + #define MCL_ONFAULT 4 /* lock all pages that are faulted in */ 78 + 79 + /* 80 + * Flags for mlock 81 + */ 82 + #define MLOCK_ONFAULT 0x01 /* Lock pages in range after they are faulted in, do not prefault */ 77 83 78 84 #define MADV_NORMAL 0 /* no further special treatment */ 79 85 #define MADV_RANDOM 1 /* expect random page references */

+2 -1

fs/9p/vfs_file.c

··· 231 231 if (res < 0 && fl->fl_type != F_UNLCK) { 232 232 fl_type = fl->fl_type; 233 233 fl->fl_type = F_UNLCK; 234 - res = locks_lock_file_wait(filp, fl); 234 + /* Even if this fails we want to return the remote error */ 235 + locks_lock_file_wait(filp, fl); 235 236 fl->fl_type = fl_type; 236 237 } 237 238 out:

+6 -1

fs/fs-writeback.c

··· 2149 2149 iput(old_inode); 2150 2150 old_inode = inode; 2151 2151 2152 - filemap_fdatawait(mapping); 2152 + /* 2153 + * We keep the error status of individual mapping so that 2154 + * applications can catch the writeback error using fsync(2). 2155 + * See filemap_fdatawait_keep_errors() for details. 2156 + */ 2157 + filemap_fdatawait_keep_errors(mapping); 2153 2158 2154 2159 cond_resched(); 2155 2160

+2 -2

fs/logfs/dev_bdev.c

··· 81 81 unsigned int max_pages; 82 82 int i; 83 83 84 - max_pages = min(nr_pages, BIO_MAX_PAGES); 84 + max_pages = min_t(size_t, nr_pages, BIO_MAX_PAGES); 85 85 86 86 bio = bio_alloc(GFP_NOFS, max_pages); 87 87 BUG_ON(!bio); ··· 171 171 unsigned int max_pages; 172 172 int i; 173 173 174 - max_pages = min(nr_pages, BIO_MAX_PAGES); 174 + max_pages = min_t(size_t, nr_pages, BIO_MAX_PAGES); 175 175 176 176 bio = bio_alloc(GFP_NOFS, max_pages); 177 177 BUG_ON(!bio);

+8 -1

fs/notify/fdinfo.c

··· 83 83 inode_mark = container_of(mark, struct inotify_inode_mark, fsn_mark); 84 84 inode = igrab(mark->inode); 85 85 if (inode) { 86 + /* 87 + * IN_ALL_EVENTS represents all of the mask bits 88 + * that we expose to userspace. There is at 89 + * least one bit (FS_EVENT_ON_CHILD) which is 90 + * used only internally to the kernel. 91 + */ 92 + u32 mask = mark->mask & IN_ALL_EVENTS; 86 93 seq_printf(m, "inotify wd:%x ino:%lx sdev:%x mask:%x ignored_mask:%x ", 87 94 inode_mark->wd, inode->i_ino, inode->i_sb->s_dev, 88 - mark->mask, mark->ignored_mask); 95 + mask, mark->ignored_mask); 89 96 show_mark_fhandle(m, inode); 90 97 seq_putc(m, '\n'); 91 98 iput(inode);

+13 -1

fs/notify/inotify/inotify_user.c

··· 706 706 int ret; 707 707 unsigned flags = 0; 708 708 709 - /* don't allow invalid bits: we don't want flags set */ 709 + /* 710 + * We share a lot of code with fs/dnotify. We also share 711 + * the bit layout between inotify's IN_* and the fsnotify 712 + * FS_*. This check ensures that only the inotify IN_* 713 + * bits get passed in and set in watches/events. 714 + */ 715 + if (unlikely(mask & ~ALL_INOTIFY_BITS)) 716 + return -EINVAL; 717 + /* 718 + * Require at least one valid bit set in the mask. 719 + * Without _something_ set, we would have no events to 720 + * watch for. 721 + */ 710 722 if (unlikely(!(mask & ALL_INOTIFY_BITS))) 711 723 return -EINVAL; 712 724

+2

fs/ocfs2/aops.c

··· 589 589 ret = -EIO; 590 590 goto bail; 591 591 } 592 + set_buffer_new(bh_result); 592 593 up_write(&OCFS2_I(inode)->ip_alloc_sem); 593 594 } 594 595 ··· 865 864 is_overwrite = ocfs2_is_overwrite(osb, inode, offset); 866 865 if (is_overwrite < 0) { 867 866 mlog_errno(is_overwrite); 867 + ret = is_overwrite; 868 868 ocfs2_inode_unlock(inode, 1); 869 869 goto clean_orphan; 870 870 }

+16 -3

fs/ocfs2/cluster/heartbeat.c

··· 219 219 unsigned hr_unclean_stop:1, 220 220 hr_aborted_start:1, 221 221 hr_item_pinned:1, 222 - hr_item_dropped:1; 222 + hr_item_dropped:1, 223 + hr_node_deleted:1; 223 224 224 225 /* protected by the hr_callback_sem */ 225 226 struct task_struct *hr_task; ··· 1079 1078 set_user_nice(current, MIN_NICE); 1080 1079 1081 1080 /* Pin node */ 1082 - o2nm_depend_this_node(); 1081 + ret = o2nm_depend_this_node(); 1082 + if (ret) { 1083 + mlog(ML_ERROR, "Node has been deleted, ret = %d\n", ret); 1084 + reg->hr_node_deleted = 1; 1085 + wake_up(&o2hb_steady_queue); 1086 + return 0; 1087 + } 1083 1088 1084 1089 while (!kthread_should_stop() && 1085 1090 !reg->hr_unclean_stop && !reg->hr_aborted_start) { ··· 1794 1787 spin_unlock(&o2hb_live_lock); 1795 1788 1796 1789 ret = wait_event_interruptible(o2hb_steady_queue, 1797 - atomic_read(&reg->hr_steady_iterations) == 0); 1790 + atomic_read(&reg->hr_steady_iterations) == 0 || 1791 + reg->hr_node_deleted); 1798 1792 if (ret) { 1799 1793 atomic_set(&reg->hr_steady_iterations, 0); 1800 1794 reg->hr_aborted_start = 1; ··· 1803 1795 1804 1796 if (reg->hr_aborted_start) { 1805 1797 ret = -EIO; 1798 + goto out3; 1799 + } 1800 + 1801 + if (reg->hr_node_deleted) { 1802 + ret = -EINVAL; 1806 1803 goto out3; 1807 1804 } 1808 1805

+3 -1

fs/ocfs2/dlm/dlmdomain.c

··· 1866 1866 int status; 1867 1867 unsigned int backoff; 1868 1868 unsigned int total_backoff = 0; 1869 + char wq_name[O2NM_MAX_NAME_LEN]; 1869 1870 1870 1871 BUG_ON(!dlm); 1871 1872 ··· 1896 1895 goto bail; 1897 1896 } 1898 1897 1899 - dlm->dlm_worker = create_singlethread_workqueue("dlm_wq"); 1898 + snprintf(wq_name, O2NM_MAX_NAME_LEN, "dlm_wq-%s", dlm->name); 1899 + dlm->dlm_worker = create_singlethread_workqueue(wq_name); 1900 1900 if (!dlm->dlm_worker) { 1901 1901 status = -ENOMEM; 1902 1902 mlog_errno(status);

+1 -1

fs/ocfs2/dlm/dlmrecovery.c

··· 205 205 mlog(0, "starting dlm recovery thread...\n"); 206 206 207 207 dlm->dlm_reco_thread_task = kthread_run(dlm_recovery_thread, dlm, 208 - "dlm_reco_thread"); 208 + "dlm_reco-%s", dlm->name); 209 209 if (IS_ERR(dlm->dlm_reco_thread_task)) { 210 210 mlog_errno(PTR_ERR(dlm->dlm_reco_thread_task)); 211 211 dlm->dlm_reco_thread_task = NULL;

+2 -1

fs/ocfs2/dlm/dlmthread.c

··· 493 493 { 494 494 mlog(0, "Starting dlm_thread...\n"); 495 495 496 - dlm->dlm_thread_task = kthread_run(dlm_thread, dlm, "dlm_thread"); 496 + dlm->dlm_thread_task = kthread_run(dlm_thread, dlm, "dlm-%s", 497 + dlm->name); 497 498 if (IS_ERR(dlm->dlm_thread_task)) { 498 499 mlog_errno(PTR_ERR(dlm->dlm_thread_task)); 499 500 dlm->dlm_thread_task = NULL;

+2 -1

fs/ocfs2/dlmglue.c

··· 2998 2998 } 2999 2999 3000 3000 /* launch downconvert thread */ 3001 - osb->dc_task = kthread_run(ocfs2_downconvert_thread, osb, "ocfs2dc"); 3001 + osb->dc_task = kthread_run(ocfs2_downconvert_thread, osb, "ocfs2dc-%s", 3002 + osb->uuid_str); 3002 3003 if (IS_ERR(osb->dc_task)) { 3003 3004 status = PTR_ERR(osb->dc_task); 3004 3005 osb->dc_task = NULL;

+2

fs/ocfs2/inode.h

··· 112 112 #define OCFS2_INODE_OPEN_DIRECT 0x00000020 113 113 /* Tell the inode wipe code it's not in orphan dir */ 114 114 #define OCFS2_INODE_SKIP_ORPHAN_DIR 0x00000040 115 + /* Entry in orphan dir with 'dio-' prefix */ 116 + #define OCFS2_INODE_DIO_ORPHAN_ENTRY 0x00000080 115 117 116 118 static inline struct ocfs2_inode_info *OCFS2_I(struct inode *inode) 117 119 {

+61 -44

fs/ocfs2/journal.c

··· 1090 1090 /* Launch the commit thread */ 1091 1091 if (!local) { 1092 1092 osb->commit_task = kthread_run(ocfs2_commit_thread, osb, 1093 - "ocfs2cmt"); 1093 + "ocfs2cmt-%s", osb->uuid_str); 1094 1094 if (IS_ERR(osb->commit_task)) { 1095 1095 status = PTR_ERR(osb->commit_task); 1096 1096 osb->commit_task = NULL; ··· 1507 1507 goto out; 1508 1508 1509 1509 osb->recovery_thread_task = kthread_run(__ocfs2_recovery_thread, osb, 1510 - "ocfs2rec"); 1510 + "ocfs2rec-%s", osb->uuid_str); 1511 1511 if (IS_ERR(osb->recovery_thread_task)) { 1512 1512 mlog_errno((int)PTR_ERR(osb->recovery_thread_task)); 1513 1513 osb->recovery_thread_task = NULL; ··· 2021 2021 struct dir_context ctx; 2022 2022 struct inode *head; 2023 2023 struct ocfs2_super *osb; 2024 + enum ocfs2_orphan_reco_type orphan_reco_type; 2024 2025 }; 2025 2026 2026 2027 static int ocfs2_orphan_filldir(struct dir_context *ctx, const char *name, ··· 2037 2036 if (name_len == 2 && !strncmp("..", name, 2)) 2038 2037 return 0; 2039 2038 2039 + /* do not include dio entry in case of orphan scan */ 2040 + if ((p->orphan_reco_type == ORPHAN_NO_NEED_TRUNCATE) && 2041 + (!strncmp(name, OCFS2_DIO_ORPHAN_PREFIX, 2042 + OCFS2_DIO_ORPHAN_PREFIX_LEN))) 2043 + return 0; 2044 + 2040 2045 /* Skip bad inodes so that recovery can continue */ 2041 2046 iter = ocfs2_iget(p->osb, ino, 2042 2047 OCFS2_FI_FLAG_ORPHAN_RECOVERY, 0); 2043 2048 if (IS_ERR(iter)) 2044 2049 return 0; 2050 + 2051 + if (!strncmp(name, OCFS2_DIO_ORPHAN_PREFIX, 2052 + OCFS2_DIO_ORPHAN_PREFIX_LEN)) 2053 + OCFS2_I(iter)->ip_flags |= OCFS2_INODE_DIO_ORPHAN_ENTRY; 2045 2054 2046 2055 /* Skip inodes which are already added to recover list, since dio may 2047 2056 * happen concurrently with unlink/rename */ ··· 2071 2060 2072 2061 static int ocfs2_queue_orphans(struct ocfs2_super *osb, 2073 2062 int slot, 2074 - struct inode **head) 2063 + struct inode **head, 2064 + enum ocfs2_orphan_reco_type orphan_reco_type) 2075 2065 { 2076 2066 int status; 2077 2067 struct inode *orphan_dir_inode = NULL; 2078 2068 struct ocfs2_orphan_filldir_priv priv = { 2079 2069 .ctx.actor = ocfs2_orphan_filldir, 2080 2070 .osb = osb, 2081 - .head = *head 2071 + .head = *head, 2072 + .orphan_reco_type = orphan_reco_type 2082 2073 }; 2083 2074 2084 2075 orphan_dir_inode = ocfs2_get_system_file_inode(osb, ··· 2183 2170 trace_ocfs2_recover_orphans(slot); 2184 2171 2185 2172 ocfs2_mark_recovering_orphan_dir(osb, slot); 2186 - ret = ocfs2_queue_orphans(osb, slot, &inode); 2173 + ret = ocfs2_queue_orphans(osb, slot, &inode, orphan_reco_type); 2187 2174 ocfs2_clear_recovering_orphan_dir(osb, slot); 2188 2175 2189 2176 /* Error here should be noted, but we want to continue with as ··· 2199 2186 iter = oi->ip_next_orphan; 2200 2187 oi->ip_next_orphan = NULL; 2201 2188 2202 - mutex_lock(&inode->i_mutex); 2203 - ret = ocfs2_rw_lock(inode, 1); 2204 - if (ret < 0) { 2205 - mlog_errno(ret); 2206 - goto next; 2207 - } 2208 - /* 2209 - * We need to take and drop the inode lock to 2210 - * force read inode from disk. 2211 - */ 2212 - ret = ocfs2_inode_lock(inode, &di_bh, 1); 2213 - if (ret) { 2214 - mlog_errno(ret); 2215 - goto unlock_rw; 2216 - } 2189 + if (oi->ip_flags & OCFS2_INODE_DIO_ORPHAN_ENTRY) { 2190 + mutex_lock(&inode->i_mutex); 2191 + ret = ocfs2_rw_lock(inode, 1); 2192 + if (ret < 0) { 2193 + mlog_errno(ret); 2194 + goto unlock_mutex; 2195 + } 2196 + /* 2197 + * We need to take and drop the inode lock to 2198 + * force read inode from disk. 2199 + */ 2200 + ret = ocfs2_inode_lock(inode, &di_bh, 1); 2201 + if (ret) { 2202 + mlog_errno(ret); 2203 + goto unlock_rw; 2204 + } 2217 2205 2218 - di = (struct ocfs2_dinode *)di_bh->b_data; 2206 + di = (struct ocfs2_dinode *)di_bh->b_data; 2219 2207 2220 - if (inode->i_nlink == 0) { 2208 + if (di->i_flags & cpu_to_le32(OCFS2_DIO_ORPHANED_FL)) { 2209 + ret = ocfs2_truncate_file(inode, di_bh, 2210 + i_size_read(inode)); 2211 + if (ret < 0) { 2212 + if (ret != -ENOSPC) 2213 + mlog_errno(ret); 2214 + goto unlock_inode; 2215 + } 2216 + 2217 + ret = ocfs2_del_inode_from_orphan(osb, inode, 2218 + di_bh, 0, 0); 2219 + if (ret) 2220 + mlog_errno(ret); 2221 + } 2222 + unlock_inode: 2223 + ocfs2_inode_unlock(inode, 1); 2224 + brelse(di_bh); 2225 + di_bh = NULL; 2226 + unlock_rw: 2227 + ocfs2_rw_unlock(inode, 1); 2228 + unlock_mutex: 2229 + mutex_unlock(&inode->i_mutex); 2230 + 2231 + /* clear dio flag in ocfs2_inode_info */ 2232 + oi->ip_flags &= ~OCFS2_INODE_DIO_ORPHAN_ENTRY; 2233 + } else { 2221 2234 spin_lock(&oi->ip_lock); 2222 2235 /* Set the proper information to get us going into 2223 2236 * ocfs2_delete_inode. */ ··· 2251 2212 spin_unlock(&oi->ip_lock); 2252 2213 } 2253 2214 2254 - if ((orphan_reco_type == ORPHAN_NEED_TRUNCATE) && 2255 - (di->i_flags & cpu_to_le32(OCFS2_DIO_ORPHANED_FL))) { 2256 - ret = ocfs2_truncate_file(inode, di_bh, 2257 - i_size_read(inode)); 2258 - if (ret < 0) { 2259 - if (ret != -ENOSPC) 2260 - mlog_errno(ret); 2261 - goto unlock_inode; 2262 - } 2263 - 2264 - ret = ocfs2_del_inode_from_orphan(osb, inode, di_bh, 0, 0); 2265 - if (ret) 2266 - mlog_errno(ret); 2267 - } /* else if ORPHAN_NO_NEED_TRUNCATE, do nothing */ 2268 - unlock_inode: 2269 - ocfs2_inode_unlock(inode, 1); 2270 - brelse(di_bh); 2271 - di_bh = NULL; 2272 - unlock_rw: 2273 - ocfs2_rw_unlock(inode, 1); 2274 - next: 2275 - mutex_unlock(&inode->i_mutex); 2276 2215 iput(inode); 2277 2216 inode = iter; 2278 2217 }

+10 -3

fs/ocfs2/namei.c

··· 106 106 static void ocfs2_double_unlock(struct inode *inode1, struct inode *inode2); 107 107 /* An orphan dir name is an 8 byte value, printed as a hex string */ 108 108 #define OCFS2_ORPHAN_NAMELEN ((int)(2 * sizeof(u64))) 109 - #define OCFS2_DIO_ORPHAN_PREFIX "dio-" 110 - #define OCFS2_DIO_ORPHAN_PREFIX_LEN 4 111 109 112 110 static struct dentry *ocfs2_lookup(struct inode *dir, struct dentry *dentry, 113 111 unsigned int flags) ··· 655 657 return status; 656 658 } 657 659 658 - return __ocfs2_mknod_locked(dir, inode, dev, new_fe_bh, 660 + status = __ocfs2_mknod_locked(dir, inode, dev, new_fe_bh, 659 661 parent_fe_bh, handle, inode_ac, 660 662 fe_blkno, suballoc_loc, suballoc_bit); 663 + if (status < 0) { 664 + u64 bg_blkno = ocfs2_which_suballoc_group(fe_blkno, suballoc_bit); 665 + int tmp = ocfs2_free_suballoc_bits(handle, inode_ac->ac_inode, 666 + inode_ac->ac_bh, suballoc_bit, bg_blkno, 1); 667 + if (tmp) 668 + mlog_errno(tmp); 669 + } 670 + 671 + return status; 661 672 } 662 673 663 674 static int ocfs2_mkdir(struct inode *dir,

+3

fs/ocfs2/namei.h

··· 26 26 #ifndef OCFS2_NAMEI_H 27 27 #define OCFS2_NAMEI_H 28 28 29 + #define OCFS2_DIO_ORPHAN_PREFIX "dio-" 30 + #define OCFS2_DIO_ORPHAN_PREFIX_LEN 4 31 + 29 32 extern const struct inode_operations ocfs2_dir_iops; 30 33 31 34 struct dentry *ocfs2_get_parent(struct dentry *child);

+1 -4

fs/ocfs2/refcounttree.c

··· 2920 2920 u64 new_block = ocfs2_clusters_to_blocks(sb, new_cluster); 2921 2921 struct page *page; 2922 2922 pgoff_t page_index; 2923 - unsigned int from, to, readahead_pages; 2923 + unsigned int from, to; 2924 2924 loff_t offset, end, map_end; 2925 2925 struct address_space *mapping = inode->i_mapping; 2926 2926 2927 2927 trace_ocfs2_duplicate_clusters_by_page(cpos, old_cluster, 2928 2928 new_cluster, new_len); 2929 2929 2930 - readahead_pages = 2931 - (ocfs2_cow_contig_clusters(sb) << 2932 - OCFS2_SB(sb)->s_clustersize_bits) >> PAGE_CACHE_SHIFT; 2933 2930 offset = ((loff_t)cpos) << OCFS2_SB(sb)->s_clustersize_bits; 2934 2931 end = offset + (new_len << OCFS2_SB(sb)->s_clustersize_bits); 2935 2932 /*

+4 -1

fs/ocfs2/suballoc.c

··· 1920 1920 status = ocfs2_search_chain(ac, handle, bits_wanted, min_bits, 1921 1921 res, &bits_left); 1922 1922 if (!status) { 1923 - hint = ocfs2_group_from_res(res); 1923 + if (ocfs2_is_cluster_bitmap(ac->ac_inode)) 1924 + hint = res->sr_bg_blkno; 1925 + else 1926 + hint = ocfs2_group_from_res(res); 1924 1927 goto set_hint; 1925 1928 } 1926 1929 if (status < 0 && status != -ENOSPC) {

+10

fs/proc/base.c

··· 1032 1032 return simple_read_from_buffer(buf, count, ppos, buffer, len); 1033 1033 } 1034 1034 1035 + /* 1036 + * /proc/pid/oom_adj exists solely for backwards compatibility with previous 1037 + * kernels. The effective policy is defined by oom_score_adj, which has a 1038 + * different scale: oom_adj grew exponentially and oom_score_adj grows linearly. 1039 + * Values written to oom_adj are simply mapped linearly to oom_score_adj. 1040 + * Processes that become oom disabled via oom_adj will still be oom disabled 1041 + * with this implementation. 1042 + * 1043 + * oom_adj cannot be removed since existing userspace binaries use it. 1044 + */ 1035 1045 static ssize_t oom_adj_write(struct file *file, const char __user *buf, 1036 1046 size_t count, loff_t *ppos) 1037 1047 {

+50 -10

fs/proc/task_mmu.c

··· 70 70 ptes >> 10, 71 71 pmds >> 10, 72 72 swap << (PAGE_SHIFT-10)); 73 + hugetlb_report_usage(m, mm); 73 74 } 74 75 75 76 unsigned long task_vsize(struct mm_struct *mm) ··· 447 446 unsigned long anonymous; 448 447 unsigned long anonymous_thp; 449 448 unsigned long swap; 449 + unsigned long shared_hugetlb; 450 + unsigned long private_hugetlb; 450 451 u64 pss; 451 452 u64 swap_pss; 452 453 }; ··· 628 625 seq_putc(m, '\n'); 629 626 } 630 627 628 + #ifdef CONFIG_HUGETLB_PAGE 629 + static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask, 630 + unsigned long addr, unsigned long end, 631 + struct mm_walk *walk) 632 + { 633 + struct mem_size_stats *mss = walk->private; 634 + struct vm_area_struct *vma = walk->vma; 635 + struct page *page = NULL; 636 + 637 + if (pte_present(*pte)) { 638 + page = vm_normal_page(vma, addr, *pte); 639 + } else if (is_swap_pte(*pte)) { 640 + swp_entry_t swpent = pte_to_swp_entry(*pte); 641 + 642 + if (is_migration_entry(swpent)) 643 + page = migration_entry_to_page(swpent); 644 + } 645 + if (page) { 646 + int mapcount = page_mapcount(page); 647 + 648 + if (mapcount >= 2) 649 + mss->shared_hugetlb += huge_page_size(hstate_vma(vma)); 650 + else 651 + mss->private_hugetlb += huge_page_size(hstate_vma(vma)); 652 + } 653 + return 0; 654 + } 655 + #endif /* HUGETLB_PAGE */ 656 + 631 657 static int show_smap(struct seq_file *m, void *v, int is_pid) 632 658 { 633 659 struct vm_area_struct *vma = v; 634 660 struct mem_size_stats mss; 635 661 struct mm_walk smaps_walk = { 636 662 .pmd_entry = smaps_pte_range, 663 + #ifdef CONFIG_HUGETLB_PAGE 664 + .hugetlb_entry = smaps_hugetlb_range, 665 + #endif 637 666 .mm = vma->vm_mm, 638 667 .private = &mss, 639 668 }; ··· 687 652 "Referenced: %8lu kB\n" 688 653 "Anonymous: %8lu kB\n" 689 654 "AnonHugePages: %8lu kB\n" 655 + "Shared_Hugetlb: %8lu kB\n" 656 + "Private_Hugetlb: %7lu kB\n" 690 657 "Swap: %8lu kB\n" 691 658 "SwapPss: %8lu kB\n" 692 659 "KernelPageSize: %8lu kB\n" ··· 704 667 mss.referenced >> 10, 705 668 mss.anonymous >> 10, 706 669 mss.anonymous_thp >> 10, 670 + mss.shared_hugetlb >> 10, 671 + mss.private_hugetlb >> 10, 707 672 mss.swap >> 10, 708 673 (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)), 709 674 vma_kernel_pagesize(vma) >> 10, ··· 792 753 pte_t ptent = *pte; 793 754 794 755 if (pte_present(ptent)) { 756 + ptent = ptep_modify_prot_start(vma->vm_mm, addr, pte); 795 757 ptent = pte_wrprotect(ptent); 796 758 ptent = pte_clear_soft_dirty(ptent); 759 + ptep_modify_prot_commit(vma->vm_mm, addr, pte, ptent); 797 760 } else if (is_swap_pte(ptent)) { 798 761 ptent = pte_swp_clear_soft_dirty(ptent); 762 + set_pte_at(vma->vm_mm, addr, pte, ptent); 799 763 } 800 - 801 - set_pte_at(vma->vm_mm, addr, pte, ptent); 802 764 } 765 + #else 766 + static inline void clear_soft_dirty(struct vm_area_struct *vma, 767 + unsigned long addr, pte_t *pte) 768 + { 769 + } 770 + #endif 803 771 772 + #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE) 804 773 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, 805 774 unsigned long addr, pmd_t *pmdp) 806 775 { 807 - pmd_t pmd = *pmdp; 776 + pmd_t pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp); 808 777 809 778 pmd = pmd_wrprotect(pmd); 810 779 pmd = pmd_clear_soft_dirty(pmd); ··· 822 775 823 776 set_pmd_at(vma->vm_mm, addr, pmdp, pmd); 824 777 } 825 - 826 778 #else 827 - 828 - static inline void clear_soft_dirty(struct vm_area_struct *vma, 829 - unsigned long addr, pte_t *pte) 830 - { 831 - } 832 - 833 779 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, 834 780 unsigned long addr, pmd_t *pmdp) 835 781 {

+6 -1

fs/sync.c

··· 86 86 87 87 static void fdatawait_one_bdev(struct block_device *bdev, void *arg) 88 88 { 89 - filemap_fdatawait(bdev->bd_inode->i_mapping); 89 + /* 90 + * We keep the error status of individual mapping so that 91 + * applications can catch the writeback error using fsync(2). 92 + * See filemap_fdatawait_keep_errors() for details. 93 + */ 94 + filemap_fdatawait_keep_errors(bdev->bd_inode->i_mapping); 90 95 } 91 96 92 97 /*

+2 -1

include/linux/compaction.h

··· 15 15 /* For more detailed tracepoint output */ 16 16 #define COMPACT_NO_SUITABLE_PAGE 5 17 17 #define COMPACT_NOT_SUITABLE_ZONE 6 18 - /* When adding new state, please change compaction_status_string, too */ 18 + #define COMPACT_CONTENDED 7 19 + /* When adding new states, please adjust include/trace/events/compaction.h */ 19 20 20 21 /* Used to signal whether compaction detected need_sched() or lock contention */ 21 22 /* No contention detected */

+17

include/linux/compiler-gcc.h

··· 210 210 #define __visible __attribute__((externally_visible)) 211 211 #endif 212 212 213 + 214 + #if GCC_VERSION >= 40900 && !defined(__CHECKER__) 215 + /* 216 + * __assume_aligned(n, k): Tell the optimizer that the returned 217 + * pointer can be assumed to be k modulo n. The second argument is 218 + * optional (default 0), so we use a variadic macro to make the 219 + * shorthand. 220 + * 221 + * Beware: Do not apply this to functions which may return 222 + * ERR_PTRs. Also, it is probably unwise to apply it to functions 223 + * returning extra information in the low bits (but in that case the 224 + * compiler should see some alignment anyway, when the return value is 225 + * massaged by 'flags = ptr & 3; ptr &= ~3;'). 226 + */ 227 + #define __assume_aligned(a, ...) __attribute__((__assume_aligned__(a, ## __VA_ARGS__))) 228 + #endif 229 + 213 230 /* 214 231 * GCC 'asm goto' miscompiles certain code sequences: 215 232 *

+8

include/linux/compiler.h

··· 417 417 #define __visible 418 418 #endif 419 419 420 + /* 421 + * Assume alignment of return value. 422 + */ 423 + #ifndef __assume_aligned 424 + #define __assume_aligned(a, ...) 425 + #endif 426 + 427 + 420 428 /* Are two types/vars the same type (ignoring qualifiers)? */ 421 429 #ifndef __same_type 422 430 # define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))

+2 -2

include/linux/cpuset.h

··· 93 93 94 94 extern void rebuild_sched_domains(void); 95 95 96 - extern void cpuset_print_task_mems_allowed(struct task_struct *p); 96 + extern void cpuset_print_current_mems_allowed(void); 97 97 98 98 /* 99 99 * read_mems_allowed_begin is required when making decisions involving ··· 219 219 partition_sched_domains(1, NULL, NULL); 220 220 } 221 221 222 - static inline void cpuset_print_task_mems_allowed(struct task_struct *p) 222 + static inline void cpuset_print_current_mems_allowed(void) 223 223 { 224 224 } 225 225

+1

include/linux/fs.h

··· 2409 2409 extern int filemap_fdatawrite(struct address_space *); 2410 2410 extern int filemap_flush(struct address_space *); 2411 2411 extern int filemap_fdatawait(struct address_space *); 2412 + extern void filemap_fdatawait_keep_errors(struct address_space *); 2412 2413 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart, 2413 2414 loff_t lend); 2414 2415 extern int filemap_write_and_wait(struct address_space *mapping);

+19

include/linux/hugetlb.h

··· 483 483 #define hugepages_supported() (HPAGE_SHIFT != 0) 484 484 #endif 485 485 486 + void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm); 487 + 488 + static inline void hugetlb_count_add(long l, struct mm_struct *mm) 489 + { 490 + atomic_long_add(l, &mm->hugetlb_usage); 491 + } 492 + 493 + static inline void hugetlb_count_sub(long l, struct mm_struct *mm) 494 + { 495 + atomic_long_sub(l, &mm->hugetlb_usage); 496 + } 486 497 #else /* CONFIG_HUGETLB_PAGE */ 487 498 struct hstate {}; 488 499 #define alloc_huge_page(v, a, r) NULL ··· 529 518 struct mm_struct *mm, pte_t *pte) 530 519 { 531 520 return &mm->page_table_lock; 521 + } 522 + 523 + static inline void hugetlb_report_usage(struct seq_file *f, struct mm_struct *m) 524 + { 525 + } 526 + 527 + static inline void hugetlb_count_sub(long l, struct mm_struct *mm) 528 + { 532 529 } 533 530 #endif /* CONFIG_HUGETLB_PAGE */ 534 531

-4

include/linux/memblock.h

··· 89 89 phys_addr_t base, phys_addr_t size, 90 90 int nid, unsigned long flags); 91 91 92 - int memblock_remove_range(struct memblock_type *type, 93 - phys_addr_t base, 94 - phys_addr_t size); 95 - 96 92 void __next_mem_range(u64 *idx, int nid, ulong flags, 97 93 struct memblock_type *type_a, 98 94 struct memblock_type *type_b, phys_addr_t *out_start,

+47 -108

include/linux/memcontrol.h

··· 301 301 void mem_cgroup_uncharge(struct page *page); 302 302 void mem_cgroup_uncharge_list(struct list_head *page_list); 303 303 304 - void mem_cgroup_migrate(struct page *oldpage, struct page *newpage, 305 - bool lrucare); 304 + void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage); 306 305 307 306 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); 308 307 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); ··· 383 384 return mz->lru_size[lru]; 384 385 } 385 386 386 - static inline int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec) 387 + static inline bool mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec) 387 388 { 388 389 unsigned long inactive_ratio; 389 390 unsigned long inactive; ··· 402 403 return inactive * inactive_ratio < active; 403 404 } 404 405 406 + void mem_cgroup_handle_over_high(void); 407 + 405 408 void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, 406 409 struct task_struct *p); 407 410 408 411 static inline void mem_cgroup_oom_enable(void) 409 412 { 410 - WARN_ON(current->memcg_oom.may_oom); 411 - current->memcg_oom.may_oom = 1; 413 + WARN_ON(current->memcg_may_oom); 414 + current->memcg_may_oom = 1; 412 415 } 413 416 414 417 static inline void mem_cgroup_oom_disable(void) 415 418 { 416 - WARN_ON(!current->memcg_oom.may_oom); 417 - current->memcg_oom.may_oom = 0; 419 + WARN_ON(!current->memcg_may_oom); 420 + current->memcg_may_oom = 0; 418 421 } 419 422 420 423 static inline bool task_in_memcg_oom(struct task_struct *p) 421 424 { 422 - return p->memcg_oom.memcg; 425 + return p->memcg_in_oom; 423 426 } 424 427 425 428 bool mem_cgroup_oom_synchronize(bool wait); ··· 538 537 { 539 538 } 540 539 541 - static inline void mem_cgroup_migrate(struct page *oldpage, 542 - struct page *newpage, 543 - bool lrucare) 540 + static inline void mem_cgroup_replace_page(struct page *old, struct page *new) 544 541 { 545 542 } 546 543 ··· 584 585 return true; 585 586 } 586 587 587 - static inline int 588 + static inline bool 588 589 mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec) 589 590 { 590 - return 1; 591 + return true; 591 592 } 592 593 593 594 static inline bool mem_cgroup_lruvec_online(struct lruvec *lruvec) ··· 618 619 } 619 620 620 621 static inline void mem_cgroup_end_page_stat(struct mem_cgroup *memcg) 622 + { 623 + } 624 + 625 + static inline void mem_cgroup_handle_over_high(void) 621 626 { 622 627 } 623 628 ··· 751 748 * conditions, but because they are pretty simple, they are expected to be 752 749 * fast. 753 750 */ 754 - bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, 755 - int order); 756 - void __memcg_kmem_commit_charge(struct page *page, 757 - struct mem_cgroup *memcg, int order); 758 - void __memcg_kmem_uncharge_pages(struct page *page, int order); 751 + int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order, 752 + struct mem_cgroup *memcg); 753 + int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order); 754 + void __memcg_kmem_uncharge(struct page *page, int order); 759 755 760 756 /* 761 757 * helper for acessing a memcg's index. It will be used as an index in the ··· 769 767 struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep); 770 768 void __memcg_kmem_put_cache(struct kmem_cache *cachep); 771 769 772 - struct mem_cgroup *__mem_cgroup_from_kmem(void *ptr); 773 - 774 - int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, 775 - unsigned long nr_pages); 776 - void memcg_uncharge_kmem(struct mem_cgroup *memcg, unsigned long nr_pages); 777 - 778 - /** 779 - * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed. 780 - * @gfp: the gfp allocation flags. 781 - * @memcg: a pointer to the memcg this was charged against. 782 - * @order: allocation order. 783 - * 784 - * returns true if the memcg where the current task belongs can hold this 785 - * allocation. 786 - * 787 - * We return true automatically if this allocation is not to be accounted to 788 - * any memcg. 789 - */ 790 - static inline bool 791 - memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order) 770 + static inline bool __memcg_kmem_bypass(gfp_t gfp) 792 771 { 793 772 if (!memcg_kmem_enabled()) 794 773 return true; 795 - 796 774 if (gfp & __GFP_NOACCOUNT) 797 - return true; 798 - /* 799 - * __GFP_NOFAIL allocations will move on even if charging is not 800 - * possible. Therefore we don't even try, and have this allocation 801 - * unaccounted. We could in theory charge it forcibly, but we hope 802 - * those allocations are rare, and won't be worth the trouble. 803 - */ 804 - if (gfp & __GFP_NOFAIL) 805 775 return true; 806 776 if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD)) 807 777 return true; 808 - 809 - /* If the test is dying, just let it go. */ 810 - if (unlikely(fatal_signal_pending(current))) 811 - return true; 812 - 813 - return __memcg_kmem_newpage_charge(gfp, memcg, order); 778 + return false; 814 779 } 815 780 816 781 /** 817 - * memcg_kmem_uncharge_pages: uncharge pages from memcg 818 - * @page: pointer to struct page being freed 819 - * @order: allocation order. 782 + * memcg_kmem_charge: charge a kmem page 783 + * @page: page to charge 784 + * @gfp: reclaim mode 785 + * @order: allocation order 786 + * 787 + * Returns 0 on success, an error code on failure. 820 788 */ 821 - static inline void 822 - memcg_kmem_uncharge_pages(struct page *page, int order) 789 + static __always_inline int memcg_kmem_charge(struct page *page, 790 + gfp_t gfp, int order) 791 + { 792 + if (__memcg_kmem_bypass(gfp)) 793 + return 0; 794 + return __memcg_kmem_charge(page, gfp, order); 795 + } 796 + 797 + /** 798 + * memcg_kmem_uncharge: uncharge a kmem page 799 + * @page: page to uncharge 800 + * @order: allocation order 801 + */ 802 + static __always_inline void memcg_kmem_uncharge(struct page *page, int order) 823 803 { 824 804 if (memcg_kmem_enabled()) 825 - __memcg_kmem_uncharge_pages(page, order); 826 - } 827 - 828 - /** 829 - * memcg_kmem_commit_charge: embeds correct memcg in a page 830 - * @page: pointer to struct page recently allocated 831 - * @memcg: the memcg structure we charged against 832 - * @order: allocation order. 833 - * 834 - * Needs to be called after memcg_kmem_newpage_charge, regardless of success or 835 - * failure of the allocation. if @page is NULL, this function will revert the 836 - * charges. Otherwise, it will commit @page to @memcg. 837 - */ 838 - static inline void 839 - memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order) 840 - { 841 - if (memcg_kmem_enabled() && memcg) 842 - __memcg_kmem_commit_charge(page, memcg, order); 805 + __memcg_kmem_uncharge(page, order); 843 806 } 844 807 845 808 /** ··· 817 850 static __always_inline struct kmem_cache * 818 851 memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp) 819 852 { 820 - if (!memcg_kmem_enabled()) 853 + if (__memcg_kmem_bypass(gfp)) 821 854 return cachep; 822 - if (gfp & __GFP_NOACCOUNT) 823 - return cachep; 824 - if (gfp & __GFP_NOFAIL) 825 - return cachep; 826 - if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD)) 827 - return cachep; 828 - if (unlikely(fatal_signal_pending(current))) 829 - return cachep; 830 - 831 855 return __memcg_kmem_get_cache(cachep); 832 856 } 833 857 ··· 826 868 { 827 869 if (memcg_kmem_enabled()) 828 870 __memcg_kmem_put_cache(cachep); 829 - } 830 - 831 - static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr) 832 - { 833 - if (!memcg_kmem_enabled()) 834 - return NULL; 835 - return __mem_cgroup_from_kmem(ptr); 836 871 } 837 872 #else 838 873 #define for_each_memcg_cache_index(_idx) \ ··· 841 890 return false; 842 891 } 843 892 844 - static inline bool 845 - memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order) 893 + static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order) 846 894 { 847 - return true; 895 + return 0; 848 896 } 849 897 850 - static inline void memcg_kmem_uncharge_pages(struct page *page, int order) 851 - { 852 - } 853 - 854 - static inline void 855 - memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order) 898 + static inline void memcg_kmem_uncharge(struct page *page, int order) 856 899 { 857 900 } 858 901 ··· 872 927 static inline void memcg_kmem_put_cache(struct kmem_cache *cachep) 873 928 { 874 929 } 875 - 876 - static inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr) 877 - { 878 - return NULL; 879 - } 880 930 #endif /* CONFIG_MEMCG_KMEM */ 881 931 #endif /* _LINUX_MEMCONTROL_H */ 882 -

+8 -3

include/linux/mm.h

··· 139 139 140 140 #define VM_DONTCOPY 0x00020000 /* Do not copy this vma on fork */ 141 141 #define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() */ 142 + #define VM_LOCKONFAULT 0x00080000 /* Lock the pages covered when they are faulted in */ 142 143 #define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */ 143 144 #define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */ 144 145 #define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */ ··· 202 201 203 202 /* This mask defines which mm->def_flags a process can inherit its parent */ 204 203 #define VM_INIT_DEF_MASK VM_NOHUGEPAGE 204 + 205 + /* This mask is used to clear all the VMA flags used by mlock */ 206 + #define VM_LOCKED_CLEAR_MASK (~(VM_LOCKED | VM_LOCKONFAULT)) 205 207 206 208 /* 207 209 * mapping from the currently active vm_flags protection bits (the ··· 1610 1606 1611 1607 static inline bool pgtable_page_ctor(struct page *page) 1612 1608 { 1609 + if (!ptlock_init(page)) 1610 + return false; 1613 1611 inc_zone_page_state(page, NR_PAGETABLE); 1614 - return ptlock_init(page); 1612 + return true; 1615 1613 } 1616 1614 1617 1615 static inline void pgtable_page_dtor(struct page *page) ··· 2042 2036 pgoff_t offset, 2043 2037 unsigned long size); 2044 2038 2045 - unsigned long max_sane_readahead(unsigned long nr); 2046 - 2047 2039 /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */ 2048 2040 extern int expand_stack(struct vm_area_struct *vma, unsigned long address); 2049 2041 ··· 2141 2137 #define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ 2142 2138 #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ 2143 2139 #define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ 2140 + #define FOLL_MLOCK 0x1000 /* lock present pages */ 2144 2141 2145 2142 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, 2146 2143 void *data);

+3

include/linux/mm_types.h

··· 486 486 /* address of the bounds directory */ 487 487 void __user *bd_addr; 488 488 #endif 489 + #ifdef CONFIG_HUGETLB_PAGE 490 + atomic_long_t hugetlb_usage; 491 + #endif 489 492 }; 490 493 491 494 static inline void mm_init_cpumask(struct mm_struct *mm)

+1 -2

include/linux/mmzone.h

··· 823 823 MEMMAP_HOTPLUG, 824 824 }; 825 825 extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn, 826 - unsigned long size, 827 - enum memmap_context context); 826 + unsigned long size); 828 827 829 828 extern void lruvec_init(struct lruvec *lruvec); 830 829

+1

include/linux/nmi.h

··· 73 73 extern int watchdog_thresh; 74 74 extern unsigned long *watchdog_cpumask_bits; 75 75 extern int sysctl_softlockup_all_cpu_backtrace; 76 + extern int sysctl_hardlockup_all_cpu_backtrace; 76 77 struct ctl_table; 77 78 extern int proc_watchdog(struct ctl_table *, int , 78 79 void __user *, size_t *, loff_t *);

+1 -1

include/linux/page-flags.h

··· 256 256 * Must use a macro here due to header dependency issues. page_zone() is not 257 257 * available at this point. 258 258 */ 259 - #define PageHighMem(__p) is_highmem(page_zone(__p)) 259 + #define PageHighMem(__p) is_highmem_idx(page_zonenum(__p)) 260 260 #else 261 261 PAGEFLAG_FALSE(HighMem) 262 262 #endif

+3 -3

include/linux/page_counter.h

··· 36 36 37 37 void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages); 38 38 void page_counter_charge(struct page_counter *counter, unsigned long nr_pages); 39 - int page_counter_try_charge(struct page_counter *counter, 40 - unsigned long nr_pages, 41 - struct page_counter **fail); 39 + bool page_counter_try_charge(struct page_counter *counter, 40 + unsigned long nr_pages, 41 + struct page_counter **fail); 42 42 void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages); 43 43 int page_counter_limit(struct page_counter *counter, unsigned long limit); 44 44 int page_counter_memparse(const char *buf, const char *max,

+10 -7

include/linux/sched.h

··· 384 384 void __user *buffer, 385 385 size_t *lenp, loff_t *ppos); 386 386 extern unsigned int softlockup_panic; 387 + extern unsigned int hardlockup_panic; 387 388 void lockup_detector_init(void); 388 389 #else 389 390 static inline void touch_softlockup_watchdog(void) ··· 1461 1460 unsigned sched_reset_on_fork:1; 1462 1461 unsigned sched_contributes_to_load:1; 1463 1462 unsigned sched_migrated:1; 1464 - 1463 + #ifdef CONFIG_MEMCG 1464 + unsigned memcg_may_oom:1; 1465 + #endif 1465 1466 #ifdef CONFIG_MEMCG_KMEM 1466 1467 unsigned memcg_kmem_skip_account:1; 1467 1468 #endif ··· 1794 1791 unsigned long trace_recursion; 1795 1792 #endif /* CONFIG_TRACING */ 1796 1793 #ifdef CONFIG_MEMCG 1797 - struct memcg_oom_info { 1798 - struct mem_cgroup *memcg; 1799 - gfp_t gfp_mask; 1800 - int order; 1801 - unsigned int may_oom:1; 1802 - } memcg_oom; 1794 + struct mem_cgroup *memcg_in_oom; 1795 + gfp_t memcg_oom_gfp_mask; 1796 + int memcg_oom_order; 1797 + 1798 + /* number of pages to reclaim on returning to userland */ 1799 + unsigned int memcg_nr_pages_over_high; 1803 1800 #endif 1804 1801 #ifdef CONFIG_UPROBES 1805 1802 struct uprobe_task *utask;

+1 -1

include/linux/slab.h

··· 111 111 * struct kmem_cache related prototypes 112 112 */ 113 113 void __init kmem_cache_init(void); 114 - int slab_is_available(void); 114 + bool slab_is_available(void); 115 115 116 116 struct kmem_cache *kmem_cache_create(const char *, size_t, size_t, 117 117 unsigned long,

+2

include/linux/syscalls.h

··· 887 887 888 888 asmlinkage long sys_membarrier(int cmd, int flags); 889 889 890 + asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags); 891 + 890 892 #endif

+3

include/linux/tracehook.h

··· 50 50 #include <linux/ptrace.h> 51 51 #include <linux/security.h> 52 52 #include <linux/task_work.h> 53 + #include <linux/memcontrol.h> 53 54 struct linux_binprm; 54 55 55 56 /* ··· 189 188 smp_mb__after_atomic(); 190 189 if (unlikely(current->task_works)) 191 190 task_work_run(); 191 + 192 + mem_cgroup_handle_over_high(); 192 193 } 193 194 194 195 #endif /* <linux/tracehook.h> */

+15 -1

include/linux/types.h

··· 205 205 * struct callback_head - callback structure for use with RCU and task_work 206 206 * @next: next update requests in a list 207 207 * @func: actual update function to call after the grace period. 208 + * 209 + * The struct is aligned to size of pointer. On most architectures it happens 210 + * naturally due ABI requirements, but some architectures (like CRIS) have 211 + * weird ABI and we need to ask it explicitly. 212 + * 213 + * The alignment is required to guarantee that bits 0 and 1 of @next will be 214 + * clear under normal conditions -- as long as we use call_rcu(), 215 + * call_rcu_bh(), call_rcu_sched(), or call_srcu() to queue callback. 216 + * 217 + * This guarantee is important for few reasons: 218 + * - future call_rcu_lazy() will make use of lower bits in the pointer; 219 + * - the structure shares storage spacer in struct page with @compound_head, 220 + * which encode PageTail() in bit 0. The guarantee is needed to avoid 221 + * false-positive PageTail(). 208 222 */ 209 223 struct callback_head { 210 224 struct callback_head *next; 211 225 void (*func)(struct callback_head *head); 212 - }; 226 + } __attribute__((aligned(sizeof(void *)))); 213 227 #define rcu_head callback_head 214 228 215 229 typedef void (*rcu_callback_t)(struct rcu_head *head);

+10 -30

include/linux/uaccess.h

··· 75 75 76 76 #endif /* ARCH_HAS_NOCACHE_UACCESS */ 77 77 78 - /** 79 - * probe_kernel_address(): safely attempt to read from a location 80 - * @addr: address to read from - its type is type typeof(retval)* 81 - * @retval: read into this variable 82 - * 83 - * Safely read from address @addr into variable @revtal. If a kernel fault 84 - * happens, handle that and return -EFAULT. 85 - * We ensure that the __get_user() is executed in atomic context so that 86 - * do_page_fault() doesn't attempt to take mmap_sem. This makes 87 - * probe_kernel_address() suitable for use within regions where the caller 88 - * already holds mmap_sem, or other locks which nest inside mmap_sem. 89 - * This must be a macro because __get_user() needs to know the types of the 90 - * args. 91 - * 92 - * We don't include enough header files to be able to do the set_fs(). We 93 - * require that the probe_kernel_address() caller will do that. 94 - */ 95 - #define probe_kernel_address(addr, retval) \ 96 - ({ \ 97 - long ret; \ 98 - mm_segment_t old_fs = get_fs(); \ 99 - \ 100 - set_fs(KERNEL_DS); \ 101 - pagefault_disable(); \ 102 - ret = __copy_from_user_inatomic(&(retval), (__force typeof(retval) __user *)(addr), sizeof(retval)); \ 103 - pagefault_enable(); \ 104 - set_fs(old_fs); \ 105 - ret; \ 106 - }) 107 - 108 78 /* 109 79 * probe_kernel_read(): safely attempt to read from a location 110 80 * @dst: pointer to the buffer that shall take the data ··· 100 130 extern long notrace __probe_kernel_write(void *dst, const void *src, size_t size); 101 131 102 132 extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count); 133 + 134 + /** 135 + * probe_kernel_address(): safely attempt to read from a location 136 + * @addr: address to read from 137 + * @retval: read into this variable 138 + * 139 + * Returns 0 on success, or -EFAULT. 140 + */ 141 + #define probe_kernel_address(addr, retval) \ 142 + probe_kernel_read(&retval, addr, sizeof(retval)) 103 143 104 144 #endif /* __LINUX_UACCESS_H__ */

+2 -2

include/linux/vm_event_item.h

··· 14 14 #endif 15 15 16 16 #ifdef CONFIG_HIGHMEM 17 - #define HIGHMEM_ZONE(xx) , xx##_HIGH 17 + #define HIGHMEM_ZONE(xx) xx##_HIGH, 18 18 #else 19 19 #define HIGHMEM_ZONE(xx) 20 20 #endif 21 21 22 - #define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL HIGHMEM_ZONE(xx) , xx##_MOVABLE 22 + #define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, HIGHMEM_ZONE(xx) xx##_MOVABLE 23 23 24 24 enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, 25 25 FOR_ALL_ZONES(PGALLOC),

+1 -24

include/linux/vmstat.h

··· 161 161 } 162 162 163 163 #ifdef CONFIG_NUMA 164 - /* 165 - * Determine the per node value of a stat item. This function 166 - * is called frequently in a NUMA machine, so try to be as 167 - * frugal as possible. 168 - */ 169 - static inline unsigned long node_page_state(int node, 170 - enum zone_stat_item item) 171 - { 172 - struct zone *zones = NODE_DATA(node)->node_zones; 173 164 174 - return 175 - #ifdef CONFIG_ZONE_DMA 176 - zone_page_state(&zones[ZONE_DMA], item) + 177 - #endif 178 - #ifdef CONFIG_ZONE_DMA32 179 - zone_page_state(&zones[ZONE_DMA32], item) + 180 - #endif 181 - #ifdef CONFIG_HIGHMEM 182 - zone_page_state(&zones[ZONE_HIGHMEM], item) + 183 - #endif 184 - zone_page_state(&zones[ZONE_NORMAL], item) + 185 - zone_page_state(&zones[ZONE_MOVABLE], item); 186 - } 187 - 165 + extern unsigned long node_page_state(int node, enum zone_stat_item item); 188 166 extern void zone_statistics(struct zone *, struct zone *, gfp_t gfp); 189 167 190 168 #else ··· 247 269 248 270 #define set_pgdat_percpu_threshold(pgdat, callback) { } 249 271 250 - static inline void refresh_cpu_vm_stats(int cpu) { } 251 272 static inline void refresh_zone_stat_thresholds(void) { } 252 273 static inline void cpu_vm_stats_fold(int cpu) { } 253 274

+64 -8

include/trace/events/compaction.h

··· 9 9 #include <linux/tracepoint.h> 10 10 #include <trace/events/gfpflags.h> 11 11 12 + #define COMPACTION_STATUS \ 13 + EM( COMPACT_DEFERRED, "deferred") \ 14 + EM( COMPACT_SKIPPED, "skipped") \ 15 + EM( COMPACT_CONTINUE, "continue") \ 16 + EM( COMPACT_PARTIAL, "partial") \ 17 + EM( COMPACT_COMPLETE, "complete") \ 18 + EM( COMPACT_NO_SUITABLE_PAGE, "no_suitable_page") \ 19 + EM( COMPACT_NOT_SUITABLE_ZONE, "not_suitable_zone") \ 20 + EMe(COMPACT_CONTENDED, "contended") 21 + 22 + #ifdef CONFIG_ZONE_DMA 23 + #define IFDEF_ZONE_DMA(X) X 24 + #else 25 + #define IFDEF_ZONE_DMA(X) 26 + #endif 27 + 28 + #ifdef CONFIG_ZONE_DMA32 29 + #define IFDEF_ZONE_DMA32(X) X 30 + #else 31 + #define IFDEF_ZONE_DMA32(X) 32 + #endif 33 + 34 + #ifdef CONFIG_HIGHMEM 35 + #define IFDEF_ZONE_HIGHMEM(X) X 36 + #else 37 + #define IFDEF_ZONE_HIGHMEM(X) 38 + #endif 39 + 40 + #define ZONE_TYPE \ 41 + IFDEF_ZONE_DMA( EM (ZONE_DMA, "DMA")) \ 42 + IFDEF_ZONE_DMA32( EM (ZONE_DMA32, "DMA32")) \ 43 + EM (ZONE_NORMAL, "Normal") \ 44 + IFDEF_ZONE_HIGHMEM( EM (ZONE_HIGHMEM,"HighMem")) \ 45 + EMe(ZONE_MOVABLE,"Movable") 46 + 47 + /* 48 + * First define the enums in the above macros to be exported to userspace 49 + * via TRACE_DEFINE_ENUM(). 50 + */ 51 + #undef EM 52 + #undef EMe 53 + #define EM(a, b) TRACE_DEFINE_ENUM(a); 54 + #define EMe(a, b) TRACE_DEFINE_ENUM(a); 55 + 56 + COMPACTION_STATUS 57 + ZONE_TYPE 58 + 59 + /* 60 + * Now redefine the EM() and EMe() macros to map the enums to the strings 61 + * that will be printed in the output. 62 + */ 63 + #undef EM 64 + #undef EMe 65 + #define EM(a, b) {a, b}, 66 + #define EMe(a, b) {a, b} 67 + 12 68 DECLARE_EVENT_CLASS(mm_compaction_isolate_template, 13 69 14 70 TP_PROTO( ··· 217 161 __entry->free_pfn, 218 162 __entry->zone_end, 219 163 __entry->sync ? "sync" : "async", 220 - compaction_status_string[__entry->status]) 164 + __print_symbolic(__entry->status, COMPACTION_STATUS)) 221 165 ); 222 166 223 167 TRACE_EVENT(mm_compaction_try_to_compact_pages, ··· 257 201 258 202 TP_STRUCT__entry( 259 203 __field(int, nid) 260 - __field(char *, name) 204 + __field(enum zone_type, idx) 261 205 __field(int, order) 262 206 __field(int, ret) 263 207 ), 264 208 265 209 TP_fast_assign( 266 210 __entry->nid = zone_to_nid(zone); 267 - __entry->name = (char *)zone->name; 211 + __entry->idx = zone_idx(zone); 268 212 __entry->order = order; 269 213 __entry->ret = ret; 270 214 ), 271 215 272 216 TP_printk("node=%d zone=%-8s order=%d ret=%s", 273 217 __entry->nid, 274 - __entry->name, 218 + __print_symbolic(__entry->idx, ZONE_TYPE), 275 219 __entry->order, 276 - compaction_status_string[__entry->ret]) 220 + __print_symbolic(__entry->ret, COMPACTION_STATUS)) 277 221 ); 278 222 279 223 DEFINE_EVENT(mm_compaction_suitable_template, mm_compaction_finished, ··· 303 247 304 248 TP_STRUCT__entry( 305 249 __field(int, nid) 306 - __field(char *, name) 250 + __field(enum zone_type, idx) 307 251 __field(int, order) 308 252 __field(unsigned int, considered) 309 253 __field(unsigned int, defer_shift) ··· 312 256 313 257 TP_fast_assign( 314 258 __entry->nid = zone_to_nid(zone); 315 - __entry->name = (char *)zone->name; 259 + __entry->idx = zone_idx(zone); 316 260 __entry->order = order; 317 261 __entry->considered = zone->compact_considered; 318 262 __entry->defer_shift = zone->compact_defer_shift; ··· 321 265 322 266 TP_printk("node=%d zone=%-8s order=%d order_failed=%d consider=%u limit=%lu", 323 267 __entry->nid, 324 - __entry->name, 268 + __print_symbolic(__entry->idx, ZONE_TYPE), 325 269 __entry->order, 326 270 __entry->order_failed, 327 271 __entry->considered,

+5

include/uapi/asm-generic/mman-common.h

··· 25 25 # define MAP_UNINITIALIZED 0x0 /* Don't support this flag */ 26 26 #endif 27 27 28 + /* 29 + * Flags for mlock 30 + */ 31 + #define MLOCK_ONFAULT 0x01 /* Lock pages in range after they are faulted in, do not prefault */ 32 + 28 33 #define MS_ASYNC 1 /* sync memory asynchronously */ 29 34 #define MS_INVALIDATE 2 /* invalidate the caches */ 30 35 #define MS_SYNC 4 /* synchronous memory sync */

+1

include/uapi/asm-generic/mman.h

··· 17 17 18 18 #define MCL_CURRENT 1 /* lock all current mappings */ 19 19 #define MCL_FUTURE 2 /* lock all future mappings */ 20 + #define MCL_ONFAULT 4 /* lock all pages that are faulted in */ 20 21 21 22 #endif /* __ASM_GENERIC_MMAN_H */

+3 -1

include/uapi/asm-generic/unistd.h

··· 713 713 __SYSCALL(__NR_userfaultfd, sys_userfaultfd) 714 714 #define __NR_membarrier 283 715 715 __SYSCALL(__NR_membarrier, sys_membarrier) 716 + #define __NR_mlock2 284 717 + __SYSCALL(__NR_mlock2, sys_mlock2) 716 718 717 719 #undef __NR_syscalls 718 - #define __NR_syscalls 284 720 + #define __NR_syscalls 285 719 721 720 722 /* 721 723 * All syscalls below here should go away really,

+7 -7

kernel/cpuset.c

··· 2598 2598 } 2599 2599 2600 2600 /** 2601 - * cpuset_print_task_mems_allowed - prints task's cpuset and mems_allowed 2602 - * @tsk: pointer to task_struct of some task. 2601 + * cpuset_print_current_mems_allowed - prints current's cpuset and mems_allowed 2603 2602 * 2604 - * Description: Prints @task's name, cpuset name, and cached copy of its 2603 + * Description: Prints current's name, cpuset name, and cached copy of its 2605 2604 * mems_allowed to the kernel log. 2606 2605 */ 2607 - void cpuset_print_task_mems_allowed(struct task_struct *tsk) 2606 + void cpuset_print_current_mems_allowed(void) 2608 2607 { 2609 2608 struct cgroup *cgrp; 2610 2609 2611 2610 rcu_read_lock(); 2612 2611 2613 - cgrp = task_cs(tsk)->css.cgroup; 2614 - pr_info("%s cpuset=", tsk->comm); 2612 + cgrp = task_cs(current)->css.cgroup; 2613 + pr_info("%s cpuset=", current->comm); 2615 2614 pr_cont_cgroup_name(cgrp); 2616 - pr_cont(" mems_allowed=%*pbl\n", nodemask_pr_args(&tsk->mems_allowed)); 2615 + pr_cont(" mems_allowed=%*pbl\n", 2616 + nodemask_pr_args(&current->mems_allowed)); 2617 2617 2618 2618 rcu_read_unlock(); 2619 2619 }

+2 -1

kernel/fork.c

··· 455 455 tmp->vm_mm = mm; 456 456 if (anon_vma_fork(tmp, mpnt)) 457 457 goto fail_nomem_anon_vma_fork; 458 - tmp->vm_flags &= ~(VM_LOCKED|VM_UFFD_MISSING|VM_UFFD_WP); 458 + tmp->vm_flags &= 459 + ~(VM_LOCKED|VM_LOCKONFAULT|VM_UFFD_MISSING|VM_UFFD_WP); 459 460 tmp->vm_next = tmp->vm_prev = NULL; 460 461 tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; 461 462 file = tmp->vm_file;

+1

kernel/sys_ni.c

··· 194 194 cond_syscall(sys_munlock); 195 195 cond_syscall(sys_mlockall); 196 196 cond_syscall(sys_munlockall); 197 + cond_syscall(sys_mlock2); 197 198 cond_syscall(sys_mincore); 198 199 cond_syscall(sys_madvise); 199 200 cond_syscall(sys_mremap);

+20

kernel/sysctl.c

··· 888 888 .extra1 = &zero, 889 889 .extra2 = &one, 890 890 }, 891 + #ifdef CONFIG_HARDLOCKUP_DETECTOR 892 + { 893 + .procname = "hardlockup_panic", 894 + .data = &hardlockup_panic, 895 + .maxlen = sizeof(int), 896 + .mode = 0644, 897 + .proc_handler = proc_dointvec_minmax, 898 + .extra1 = &zero, 899 + .extra2 = &one, 900 + }, 901 + #endif 891 902 #ifdef CONFIG_SMP 892 903 { 893 904 .procname = "softlockup_all_cpu_backtrace", 894 905 .data = &sysctl_softlockup_all_cpu_backtrace, 906 + .maxlen = sizeof(int), 907 + .mode = 0644, 908 + .proc_handler = proc_dointvec_minmax, 909 + .extra1 = &zero, 910 + .extra2 = &one, 911 + }, 912 + { 913 + .procname = "hardlockup_all_cpu_backtrace", 914 + .data = &sysctl_hardlockup_all_cpu_backtrace, 895 915 .maxlen = sizeof(int), 896 916 .mode = 0644, 897 917 .proc_handler = proc_dointvec_minmax,

+90 -31

kernel/watchdog.c

··· 57 57 58 58 #ifdef CONFIG_SMP 59 59 int __read_mostly sysctl_softlockup_all_cpu_backtrace; 60 + int __read_mostly sysctl_hardlockup_all_cpu_backtrace; 60 61 #else 61 62 #define sysctl_softlockup_all_cpu_backtrace 0 63 + #define sysctl_hardlockup_all_cpu_backtrace 0 62 64 #endif 63 65 static struct cpumask watchdog_cpumask __read_mostly; 64 66 unsigned long *watchdog_cpumask_bits = cpumask_bits(&watchdog_cpumask); ··· 112 110 * Should we panic when a soft-lockup or hard-lockup occurs: 113 111 */ 114 112 #ifdef CONFIG_HARDLOCKUP_DETECTOR 115 - static int hardlockup_panic = 113 + unsigned int __read_mostly hardlockup_panic = 116 114 CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE; 115 + static unsigned long hardlockup_allcpu_dumped; 117 116 /* 118 117 * We may not want to enable hard lockup detection by default in all cases, 119 118 * for example when running the kernel as a guest on a hypervisor. In these ··· 176 173 return 1; 177 174 } 178 175 __setup("softlockup_all_cpu_backtrace=", softlockup_all_cpu_backtrace_setup); 176 + static int __init hardlockup_all_cpu_backtrace_setup(char *str) 177 + { 178 + sysctl_hardlockup_all_cpu_backtrace = 179 + !!simple_strtol(str, NULL, 0); 180 + return 1; 181 + } 182 + __setup("hardlockup_all_cpu_backtrace=", hardlockup_all_cpu_backtrace_setup); 179 183 #endif 180 184 181 185 /* ··· 273 263 274 264 #ifdef CONFIG_HARDLOCKUP_DETECTOR 275 265 /* watchdog detector functions */ 276 - static int is_hardlockup(void) 266 + static bool is_hardlockup(void) 277 267 { 278 268 unsigned long hrint = __this_cpu_read(hrtimer_interrupts); 279 269 280 270 if (__this_cpu_read(hrtimer_interrupts_saved) == hrint) 281 - return 1; 271 + return true; 282 272 283 273 __this_cpu_write(hrtimer_interrupts_saved, hrint); 284 - return 0; 274 + return false; 285 275 } 286 276 #endif 287 277 ··· 289 279 { 290 280 unsigned long now = get_timestamp(); 291 281 292 - if (watchdog_enabled & SOFT_WATCHDOG_ENABLED) { 282 + if ((watchdog_enabled & SOFT_WATCHDOG_ENABLED) && watchdog_thresh){ 293 283 /* Warn about unreasonable delays. */ 294 284 if (time_after(now, touch_ts + get_softlockup_thresh())) 295 285 return now - touch_ts; ··· 328 318 */ 329 319 if (is_hardlockup()) { 330 320 int this_cpu = smp_processor_id(); 321 + struct pt_regs *regs = get_irq_regs(); 331 322 332 323 /* only print hardlockups once */ 333 324 if (__this_cpu_read(hard_watchdog_warn) == true) 334 325 return; 335 326 336 - if (hardlockup_panic) 337 - panic("Watchdog detected hard LOCKUP on cpu %d", 338 - this_cpu); 327 + pr_emerg("Watchdog detected hard LOCKUP on cpu %d", this_cpu); 328 + print_modules(); 329 + print_irqtrace_events(current); 330 + if (regs) 331 + show_regs(regs); 339 332 else 340 - WARN(1, "Watchdog detected hard LOCKUP on cpu %d", 341 - this_cpu); 333 + dump_stack(); 334 + 335 + /* 336 + * Perform all-CPU dump only once to avoid multiple hardlockups 337 + * generating interleaving traces 338 + */ 339 + if (sysctl_hardlockup_all_cpu_backtrace && 340 + !test_and_set_bit(0, &hardlockup_allcpu_dumped)) 341 + trigger_allbutself_cpu_backtrace(); 342 + 343 + if (hardlockup_panic) 344 + panic("Hard LOCKUP"); 342 345 343 346 __this_cpu_write(hard_watchdog_warn, true); 344 347 return; ··· 369 346 370 347 static int watchdog_nmi_enable(unsigned int cpu); 371 348 static void watchdog_nmi_disable(unsigned int cpu); 349 + 350 + static int watchdog_enable_all_cpus(void); 351 + static void watchdog_disable_all_cpus(void); 372 352 373 353 /* watchdog kicker functions */ 374 354 static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) ··· 677 651 678 652 /* 679 653 * park all watchdog threads that are specified in 'watchdog_cpumask' 654 + * 655 + * This function returns an error if kthread_park() of a watchdog thread 656 + * fails. In this situation, the watchdog threads of some CPUs can already 657 + * be parked and the watchdog threads of other CPUs can still be runnable. 658 + * Callers are expected to handle this special condition as appropriate in 659 + * their context. 660 + * 661 + * This function may only be called in a context that is protected against 662 + * races with CPU hotplug - for example, via get_online_cpus(). 680 663 */ 681 664 static int watchdog_park_threads(void) 682 665 { 683 666 int cpu, ret = 0; 684 667 685 - get_online_cpus(); 686 668 for_each_watchdog_cpu(cpu) { 687 669 ret = kthread_park(per_cpu(softlockup_watchdog, cpu)); 688 670 if (ret) 689 671 break; 690 672 } 691 - if (ret) { 692 - for_each_watchdog_cpu(cpu) 693 - kthread_unpark(per_cpu(softlockup_watchdog, cpu)); 694 - } 695 - put_online_cpus(); 696 673 697 674 return ret; 698 675 } 699 676 700 677 /* 701 678 * unpark all watchdog threads that are specified in 'watchdog_cpumask' 679 + * 680 + * This function may only be called in a context that is protected against 681 + * races with CPU hotplug - for example, via get_online_cpus(). 702 682 */ 703 683 static void watchdog_unpark_threads(void) 704 684 { 705 685 int cpu; 706 686 707 - get_online_cpus(); 708 687 for_each_watchdog_cpu(cpu) 709 688 kthread_unpark(per_cpu(softlockup_watchdog, cpu)); 710 - put_online_cpus(); 711 689 } 712 690 713 691 /* ··· 721 691 { 722 692 int ret = 0; 723 693 694 + get_online_cpus(); 724 695 mutex_lock(&watchdog_proc_mutex); 725 696 /* 726 697 * Multiple suspend requests can be active in parallel (counted by ··· 735 704 736 705 if (ret == 0) 737 706 watchdog_suspended++; 707 + else { 708 + watchdog_disable_all_cpus(); 709 + pr_err("Failed to suspend lockup detectors, disabled\n"); 710 + watchdog_enabled = 0; 711 + } 738 712 739 713 mutex_unlock(&watchdog_proc_mutex); 740 714 ··· 762 726 watchdog_unpark_threads(); 763 727 764 728 mutex_unlock(&watchdog_proc_mutex); 729 + put_online_cpus(); 765 730 } 766 731 767 - static void update_watchdog_all_cpus(void) 732 + static int update_watchdog_all_cpus(void) 768 733 { 769 - watchdog_park_threads(); 734 + int ret; 735 + 736 + ret = watchdog_park_threads(); 737 + if (ret) 738 + return ret; 739 + 770 740 watchdog_unpark_threads(); 741 + 742 + return 0; 771 743 } 772 744 773 745 static int watchdog_enable_all_cpus(void) ··· 794 750 * Enable/disable the lockup detectors or 795 751 * change the sample period 'on the fly'. 796 752 */ 797 - update_watchdog_all_cpus(); 753 + err = update_watchdog_all_cpus(); 754 + 755 + if (err) { 756 + watchdog_disable_all_cpus(); 757 + pr_err("Failed to update lockup detectors, disabled\n"); 758 + } 798 759 } 760 + 761 + if (err) 762 + watchdog_enabled = 0; 799 763 800 764 return err; 801 765 } 802 766 803 - /* prepare/enable/disable routines */ 804 - /* sysctl functions */ 805 - #ifdef CONFIG_SYSCTL 806 767 static void watchdog_disable_all_cpus(void) 807 768 { 808 769 if (watchdog_running) { ··· 815 766 smpboot_unregister_percpu_thread(&watchdog_threads); 816 767 } 817 768 } 769 + 770 + #ifdef CONFIG_SYSCTL 818 771 819 772 /* 820 773 * Update the run state of the lockup detectors. ··· 859 808 int err, old, new; 860 809 int *watchdog_param = (int *)table->data; 861 810 811 + get_online_cpus(); 862 812 mutex_lock(&watchdog_proc_mutex); 863 813 864 814 if (watchdog_suspended) { ··· 901 849 } while (cmpxchg(&watchdog_enabled, old, new) != old); 902 850 903 851 /* 904 - * Update the run state of the lockup detectors. 905 - * Restore 'watchdog_enabled' on failure. 852 + * Update the run state of the lockup detectors. There is _no_ 853 + * need to check the value returned by proc_watchdog_update() 854 + * and to restore the previous value of 'watchdog_enabled' as 855 + * both lockup detectors are disabled if proc_watchdog_update() 856 + * returns an error. 906 857 */ 907 858 err = proc_watchdog_update(); 908 - if (err) 909 - watchdog_enabled = old; 910 859 } 911 860 out: 912 861 mutex_unlock(&watchdog_proc_mutex); 862 + put_online_cpus(); 913 863 return err; 914 864 } 915 865 ··· 953 899 { 954 900 int err, old; 955 901 902 + get_online_cpus(); 956 903 mutex_lock(&watchdog_proc_mutex); 957 904 958 905 if (watchdog_suspended) { ··· 969 914 goto out; 970 915 971 916 /* 972 - * Update the sample period. 973 - * Restore 'watchdog_thresh' on failure. 917 + * Update the sample period. Restore on failure. 974 918 */ 975 919 set_sample_period(); 976 920 err = proc_watchdog_update(); 977 - if (err) 921 + if (err) { 978 922 watchdog_thresh = old; 923 + set_sample_period(); 924 + } 979 925 out: 980 926 mutex_unlock(&watchdog_proc_mutex); 927 + put_online_cpus(); 981 928 return err; 982 929 } 983 930 ··· 994 937 { 995 938 int err; 996 939 940 + get_online_cpus(); 997 941 mutex_lock(&watchdog_proc_mutex); 998 942 999 943 if (watchdog_suspended) { ··· 1022 964 } 1023 965 out: 1024 966 mutex_unlock(&watchdog_proc_mutex); 967 + put_online_cpus(); 1025 968 return err; 1026 969 } 1027 970

+1 -2

lib/Kconfig.kasan

··· 15 15 global variables requires gcc 5.0 or later. 16 16 This feature consumes about 1/8 of available memory and brings about 17 17 ~x3 performance slowdown. 18 - For better error detection enable CONFIG_STACKTRACE, 19 - and add slub_debug=U to boot cmdline. 18 + For better error detection enable CONFIG_STACKTRACE. 20 19 21 20 choice 22 21 prompt "Instrumentation type"

+69

lib/test_kasan.c

··· 138 138 kfree(ptr2); 139 139 } 140 140 141 + static noinline void __init kmalloc_oob_memset_2(void) 142 + { 143 + char *ptr; 144 + size_t size = 8; 145 + 146 + pr_info("out-of-bounds in memset2\n"); 147 + ptr = kmalloc(size, GFP_KERNEL); 148 + if (!ptr) { 149 + pr_err("Allocation failed\n"); 150 + return; 151 + } 152 + 153 + memset(ptr+7, 0, 2); 154 + kfree(ptr); 155 + } 156 + 157 + static noinline void __init kmalloc_oob_memset_4(void) 158 + { 159 + char *ptr; 160 + size_t size = 8; 161 + 162 + pr_info("out-of-bounds in memset4\n"); 163 + ptr = kmalloc(size, GFP_KERNEL); 164 + if (!ptr) { 165 + pr_err("Allocation failed\n"); 166 + return; 167 + } 168 + 169 + memset(ptr+5, 0, 4); 170 + kfree(ptr); 171 + } 172 + 173 + 174 + static noinline void __init kmalloc_oob_memset_8(void) 175 + { 176 + char *ptr; 177 + size_t size = 8; 178 + 179 + pr_info("out-of-bounds in memset8\n"); 180 + ptr = kmalloc(size, GFP_KERNEL); 181 + if (!ptr) { 182 + pr_err("Allocation failed\n"); 183 + return; 184 + } 185 + 186 + memset(ptr+1, 0, 8); 187 + kfree(ptr); 188 + } 189 + 190 + static noinline void __init kmalloc_oob_memset_16(void) 191 + { 192 + char *ptr; 193 + size_t size = 16; 194 + 195 + pr_info("out-of-bounds in memset16\n"); 196 + ptr = kmalloc(size, GFP_KERNEL); 197 + if (!ptr) { 198 + pr_err("Allocation failed\n"); 199 + return; 200 + } 201 + 202 + memset(ptr+1, 0, 16); 203 + kfree(ptr); 204 + } 205 + 141 206 static noinline void __init kmalloc_oob_in_memset(void) 142 207 { 143 208 char *ptr; ··· 329 264 kmalloc_oob_krealloc_less(); 330 265 kmalloc_oob_16(); 331 266 kmalloc_oob_in_memset(); 267 + kmalloc_oob_memset_2(); 268 + kmalloc_oob_memset_4(); 269 + kmalloc_oob_memset_8(); 270 + kmalloc_oob_memset_16(); 332 271 kmalloc_uaf(); 333 272 kmalloc_uaf_memset(); 334 273 kmalloc_uaf2();

+2 -8

mm/balloon_compaction.c

··· 199 199 struct balloon_dev_info *balloon = balloon_page_device(page); 200 200 int rc = -EAGAIN; 201 201 202 - /* 203 - * Block others from accessing the 'newpage' when we get around to 204 - * establishing additional references. We should be the only one 205 - * holding a reference to the 'newpage' at this point. 206 - */ 207 - BUG_ON(!trylock_page(newpage)); 202 + VM_BUG_ON_PAGE(!PageLocked(page), page); 203 + VM_BUG_ON_PAGE(!PageLocked(newpage), newpage); 208 204 209 205 if (WARN_ON(!__is_movable_balloon_page(page))) { 210 206 dump_page(page, "not movable balloon page"); 211 - unlock_page(newpage); 212 207 return rc; 213 208 } 214 209 215 210 if (balloon && balloon->migratepage) 216 211 rc = balloon->migratepage(balloon, newpage, page, mode); 217 212 218 - unlock_page(newpage); 219 213 return rc; 220 214 } 221 215 #endif /* CONFIG_BALLOON_COMPACTION */

+4 -2

mm/cma.c

··· 363 363 */ 364 364 struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align) 365 365 { 366 - unsigned long mask, offset, pfn, start = 0; 366 + unsigned long mask, offset; 367 + unsigned long pfn = -1; 368 + unsigned long start = 0; 367 369 unsigned long bitmap_maxno, bitmap_no, bitmap_count; 368 370 struct page *page = NULL; 369 371 int ret; ··· 420 418 start = bitmap_no + mask + 1; 421 419 } 422 420 423 - trace_cma_alloc(page ? pfn : -1UL, page, count, align); 421 + trace_cma_alloc(pfn, page, count, align); 424 422 425 423 pr_debug("%s(): returned %p\n", __func__, page); 426 424 return page;

+20 -26

mm/compaction.c

··· 35 35 #endif 36 36 37 37 #if defined CONFIG_COMPACTION || defined CONFIG_CMA 38 - #ifdef CONFIG_TRACEPOINTS 39 - static const char *const compaction_status_string[] = { 40 - "deferred", 41 - "skipped", 42 - "continue", 43 - "partial", 44 - "complete", 45 - "no_suitable_page", 46 - "not_suitable_zone", 47 - }; 48 - #endif 49 38 50 39 #define CREATE_TRACE_POINTS 51 40 #include <trace/events/compaction.h> ··· 1186 1197 return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE; 1187 1198 } 1188 1199 1200 + /* 1201 + * order == -1 is expected when compacting via 1202 + * /proc/sys/vm/compact_memory 1203 + */ 1204 + static inline bool is_via_compact_memory(int order) 1205 + { 1206 + return order == -1; 1207 + } 1208 + 1189 1209 static int __compact_finished(struct zone *zone, struct compact_control *cc, 1190 1210 const int migratetype) 1191 1211 { ··· 1202 1204 unsigned long watermark; 1203 1205 1204 1206 if (cc->contended || fatal_signal_pending(current)) 1205 - return COMPACT_PARTIAL; 1207 + return COMPACT_CONTENDED; 1206 1208 1207 1209 /* Compaction run completes if the migrate and free scanner meet */ 1208 1210 if (compact_scanners_met(cc)) { ··· 1221 1223 return COMPACT_COMPLETE; 1222 1224 } 1223 1225 1224 - /* 1225 - * order == -1 is expected when compacting via 1226 - * /proc/sys/vm/compact_memory 1227 - */ 1228 - if (cc->order == -1) 1226 + if (is_via_compact_memory(cc->order)) 1229 1227 return COMPACT_CONTINUE; 1230 1228 1231 1229 /* Compaction run is not finished if the watermark is not met */ ··· 1284 1290 int fragindex; 1285 1291 unsigned long watermark; 1286 1292 1287 - /* 1288 - * order == -1 is expected when compacting via 1289 - * /proc/sys/vm/compact_memory 1290 - */ 1291 - if (order == -1) 1293 + if (is_via_compact_memory(order)) 1292 1294 return COMPACT_CONTINUE; 1293 1295 1294 1296 watermark = low_wmark_pages(zone); ··· 1393 1403 1394 1404 switch (isolate_migratepages(zone, cc)) { 1395 1405 case ISOLATE_ABORT: 1396 - ret = COMPACT_PARTIAL; 1406 + ret = COMPACT_CONTENDED; 1397 1407 putback_movable_pages(&cc->migratepages); 1398 1408 cc->nr_migratepages = 0; 1399 1409 goto out; ··· 1424 1434 * and we want compact_finished() to detect it 1425 1435 */ 1426 1436 if (err == -ENOMEM && !compact_scanners_met(cc)) { 1427 - ret = COMPACT_PARTIAL; 1437 + ret = COMPACT_CONTENDED; 1428 1438 goto out; 1429 1439 } 1430 1440 } ··· 1476 1486 1477 1487 trace_mm_compaction_end(start_pfn, cc->migrate_pfn, 1478 1488 cc->free_pfn, end_pfn, sync, ret); 1489 + 1490 + if (ret == COMPACT_CONTENDED) 1491 + ret = COMPACT_PARTIAL; 1479 1492 1480 1493 return ret; 1481 1494 } ··· 1651 1658 * this makes sure we compact the whole zone regardless of 1652 1659 * cached scanner positions. 1653 1660 */ 1654 - if (cc->order == -1) 1661 + if (is_via_compact_memory(cc->order)) 1655 1662 __reset_isolation_suitable(zone); 1656 1663 1657 - if (cc->order == -1 || !compaction_deferred(zone, cc->order)) 1664 + if (is_via_compact_memory(cc->order) || 1665 + !compaction_deferred(zone, cc->order)) 1658 1666 compact_zone(zone, cc); 1659 1667 1660 1668 if (cc->order > 0) {

+1

mm/debug.c

··· 125 125 {VM_GROWSDOWN, "growsdown" }, 126 126 {VM_PFNMAP, "pfnmap" }, 127 127 {VM_DENYWRITE, "denywrite" }, 128 + {VM_LOCKONFAULT, "lockonfault" }, 128 129 {VM_LOCKED, "locked" }, 129 130 {VM_IO, "io" }, 130 131 {VM_SEQ_READ, "seqread" },

+3 -3

mm/early_ioremap.c

··· 126 126 /* 127 127 * Mappings have to be page-aligned 128 128 */ 129 - offset = phys_addr & ~PAGE_MASK; 129 + offset = offset_in_page(phys_addr); 130 130 phys_addr &= PAGE_MASK; 131 131 size = PAGE_ALIGN(last_addr + 1) - phys_addr; 132 132 ··· 189 189 if (WARN_ON(virt_addr < fix_to_virt(FIX_BTMAP_BEGIN))) 190 190 return; 191 191 192 - offset = virt_addr & ~PAGE_MASK; 192 + offset = offset_in_page(virt_addr); 193 193 nrpages = PAGE_ALIGN(offset + size) >> PAGE_SHIFT; 194 194 195 195 idx = FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*slot; ··· 234 234 char *p; 235 235 236 236 while (size) { 237 - slop = src & ~PAGE_MASK; 237 + slop = offset_in_page(src); 238 238 clen = size; 239 239 if (clen > MAX_MAP_CHUNK - slop) 240 240 clen = MAX_MAP_CHUNK - slop;

+58 -19

mm/filemap.c

··· 331 331 } 332 332 EXPORT_SYMBOL(filemap_flush); 333 333 334 - /** 335 - * filemap_fdatawait_range - wait for writeback to complete 336 - * @mapping: address space structure to wait for 337 - * @start_byte: offset in bytes where the range starts 338 - * @end_byte: offset in bytes where the range ends (inclusive) 339 - * 340 - * Walk the list of under-writeback pages of the given address space 341 - * in the given range and wait for all of them. 342 - */ 343 - int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, 344 - loff_t end_byte) 334 + static int __filemap_fdatawait_range(struct address_space *mapping, 335 + loff_t start_byte, loff_t end_byte) 345 336 { 346 337 pgoff_t index = start_byte >> PAGE_CACHE_SHIFT; 347 338 pgoff_t end = end_byte >> PAGE_CACHE_SHIFT; 348 339 struct pagevec pvec; 349 340 int nr_pages; 350 - int ret2, ret = 0; 341 + int ret = 0; 351 342 352 343 if (end_byte < start_byte) 353 344 goto out; ··· 365 374 cond_resched(); 366 375 } 367 376 out: 377 + return ret; 378 + } 379 + 380 + /** 381 + * filemap_fdatawait_range - wait for writeback to complete 382 + * @mapping: address space structure to wait for 383 + * @start_byte: offset in bytes where the range starts 384 + * @end_byte: offset in bytes where the range ends (inclusive) 385 + * 386 + * Walk the list of under-writeback pages of the given address space 387 + * in the given range and wait for all of them. Check error status of 388 + * the address space and return it. 389 + * 390 + * Since the error status of the address space is cleared by this function, 391 + * callers are responsible for checking the return value and handling and/or 392 + * reporting the error. 393 + */ 394 + int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, 395 + loff_t end_byte) 396 + { 397 + int ret, ret2; 398 + 399 + ret = __filemap_fdatawait_range(mapping, start_byte, end_byte); 368 400 ret2 = filemap_check_errors(mapping); 369 401 if (!ret) 370 402 ret = ret2; ··· 397 383 EXPORT_SYMBOL(filemap_fdatawait_range); 398 384 399 385 /** 386 + * filemap_fdatawait_keep_errors - wait for writeback without clearing errors 387 + * @mapping: address space structure to wait for 388 + * 389 + * Walk the list of under-writeback pages of the given address space 390 + * and wait for all of them. Unlike filemap_fdatawait(), this function 391 + * does not clear error status of the address space. 392 + * 393 + * Use this function if callers don't handle errors themselves. Expected 394 + * call sites are system-wide / filesystem-wide data flushers: e.g. sync(2), 395 + * fsfreeze(8) 396 + */ 397 + void filemap_fdatawait_keep_errors(struct address_space *mapping) 398 + { 399 + loff_t i_size = i_size_read(mapping->host); 400 + 401 + if (i_size == 0) 402 + return; 403 + 404 + __filemap_fdatawait_range(mapping, 0, i_size - 1); 405 + } 406 + 407 + /** 400 408 * filemap_fdatawait - wait for all under-writeback pages to complete 401 409 * @mapping: address space structure to wait for 402 410 * 403 411 * Walk the list of under-writeback pages of the given address space 404 - * and wait for all of them. 412 + * and wait for all of them. Check error status of the address space 413 + * and return it. 414 + * 415 + * Since the error status of the address space is cleared by this function, 416 + * callers are responsible for checking the return value and handling and/or 417 + * reporting the error. 405 418 */ 406 419 int filemap_fdatawait(struct address_space *mapping) 407 420 { ··· 551 510 __inc_zone_page_state(new, NR_SHMEM); 552 511 spin_unlock_irqrestore(&mapping->tree_lock, flags); 553 512 mem_cgroup_end_page_stat(memcg); 554 - mem_cgroup_migrate(old, new, true); 513 + mem_cgroup_replace_page(old, new); 555 514 radix_tree_preload_end(); 556 515 if (freepage) 557 516 freepage(old); ··· 1848 1807 struct file *file, 1849 1808 pgoff_t offset) 1850 1809 { 1851 - unsigned long ra_pages; 1852 1810 struct address_space *mapping = file->f_mapping; 1853 1811 1854 1812 /* If we don't want any read-ahead, don't bother */ ··· 1876 1836 /* 1877 1837 * mmap read-around 1878 1838 */ 1879 - ra_pages = max_sane_readahead(ra->ra_pages); 1880 - ra->start = max_t(long, 0, offset - ra_pages / 2); 1881 - ra->size = ra_pages; 1882 - ra->async_size = ra_pages / 4; 1839 + ra->start = max_t(long, 0, offset - ra->ra_pages / 2); 1840 + ra->size = ra->ra_pages; 1841 + ra->async_size = ra->ra_pages / 4; 1883 1842 ra_submit(ra, mapping, file); 1884 1843 } 1885 1844

+1 -1

mm/frame_vector.c

··· 7 7 #include <linux/pagemap.h> 8 8 #include <linux/sched.h> 9 9 10 - /* 10 + /** 11 11 * get_vaddr_frames() - map virtual addresses to pfns 12 12 * @start: starting user address 13 13 * @nr_frames: number of pages / pfns from start to map

+8 -2

mm/gup.c

··· 129 129 */ 130 130 mark_page_accessed(page); 131 131 } 132 - if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) { 132 + if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) { 133 133 /* 134 134 * The preliminary mapping check is mainly to avoid the 135 135 * pointless overhead of lock_page on the ZERO_PAGE ··· 299 299 unsigned int fault_flags = 0; 300 300 int ret; 301 301 302 + /* mlock all present pages, but do not fault in new pages */ 303 + if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK) 304 + return -ENOENT; 302 305 /* For mm_populate(), just skip the stack guard page. */ 303 306 if ((*flags & FOLL_POPULATE) && 304 307 (stack_guard_page_start(vma, address) || ··· 893 890 VM_BUG_ON_VMA(end > vma->vm_end, vma); 894 891 VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm); 895 892 896 - gup_flags = FOLL_TOUCH | FOLL_POPULATE; 893 + gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK; 894 + if (vma->vm_flags & VM_LOCKONFAULT) 895 + gup_flags &= ~FOLL_POPULATE; 896 + 897 897 /* 898 898 * We want to touch writable mappings with a write fault in order 899 899 * to break COW, except for shared mappings because these don't COW

+1 -1

mm/huge_memory.c

··· 1307 1307 pmd, _pmd, 1)) 1308 1308 update_mmu_cache_pmd(vma, addr, pmd); 1309 1309 } 1310 - if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) { 1310 + if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) { 1311 1311 if (page->mapping && trylock_page(page)) { 1312 1312 lru_add_drain(); 1313 1313 if (page->mapping)

+124 -15

mm/hugetlb.c

··· 1437 1437 dissolve_free_huge_page(pfn_to_page(pfn)); 1438 1438 } 1439 1439 1440 - static struct page *alloc_buddy_huge_page(struct hstate *h, int nid) 1440 + /* 1441 + * There are 3 ways this can get called: 1442 + * 1. With vma+addr: we use the VMA's memory policy 1443 + * 2. With !vma, but nid=NUMA_NO_NODE: We try to allocate a huge 1444 + * page from any node, and let the buddy allocator itself figure 1445 + * it out. 1446 + * 3. With !vma, but nid!=NUMA_NO_NODE. We allocate a huge page 1447 + * strictly from 'nid' 1448 + */ 1449 + static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h, 1450 + struct vm_area_struct *vma, unsigned long addr, int nid) 1451 + { 1452 + int order = huge_page_order(h); 1453 + gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN; 1454 + unsigned int cpuset_mems_cookie; 1455 + 1456 + /* 1457 + * We need a VMA to get a memory policy. If we do not 1458 + * have one, we use the 'nid' argument. 1459 + * 1460 + * The mempolicy stuff below has some non-inlined bits 1461 + * and calls ->vm_ops. That makes it hard to optimize at 1462 + * compile-time, even when NUMA is off and it does 1463 + * nothing. This helps the compiler optimize it out. 1464 + */ 1465 + if (!IS_ENABLED(CONFIG_NUMA) || !vma) { 1466 + /* 1467 + * If a specific node is requested, make sure to 1468 + * get memory from there, but only when a node 1469 + * is explicitly specified. 1470 + */ 1471 + if (nid != NUMA_NO_NODE) 1472 + gfp |= __GFP_THISNODE; 1473 + /* 1474 + * Make sure to call something that can handle 1475 + * nid=NUMA_NO_NODE 1476 + */ 1477 + return alloc_pages_node(nid, gfp, order); 1478 + } 1479 + 1480 + /* 1481 + * OK, so we have a VMA. Fetch the mempolicy and try to 1482 + * allocate a huge page with it. We will only reach this 1483 + * when CONFIG_NUMA=y. 1484 + */ 1485 + do { 1486 + struct page *page; 1487 + struct mempolicy *mpol; 1488 + struct zonelist *zl; 1489 + nodemask_t *nodemask; 1490 + 1491 + cpuset_mems_cookie = read_mems_allowed_begin(); 1492 + zl = huge_zonelist(vma, addr, gfp, &mpol, &nodemask); 1493 + mpol_cond_put(mpol); 1494 + page = __alloc_pages_nodemask(gfp, order, zl, nodemask); 1495 + if (page) 1496 + return page; 1497 + } while (read_mems_allowed_retry(cpuset_mems_cookie)); 1498 + 1499 + return NULL; 1500 + } 1501 + 1502 + /* 1503 + * There are two ways to allocate a huge page: 1504 + * 1. When you have a VMA and an address (like a fault) 1505 + * 2. When you have no VMA (like when setting /proc/.../nr_hugepages) 1506 + * 1507 + * 'vma' and 'addr' are only for (1). 'nid' is always NUMA_NO_NODE in 1508 + * this case which signifies that the allocation should be done with 1509 + * respect for the VMA's memory policy. 1510 + * 1511 + * For (2), we ignore 'vma' and 'addr' and use 'nid' exclusively. This 1512 + * implies that memory policies will not be taken in to account. 1513 + */ 1514 + static struct page *__alloc_buddy_huge_page(struct hstate *h, 1515 + struct vm_area_struct *vma, unsigned long addr, int nid) 1441 1516 { 1442 1517 struct page *page; 1443 1518 unsigned int r_nid; ··· 1520 1445 if (hstate_is_gigantic(h)) 1521 1446 return NULL; 1522 1447 1448 + /* 1449 + * Make sure that anyone specifying 'nid' is not also specifying a VMA. 1450 + * This makes sure the caller is picking _one_ of the modes with which 1451 + * we can call this function, not both. 1452 + */ 1453 + if (vma || (addr != -1)) { 1454 + VM_WARN_ON_ONCE(addr == -1); 1455 + VM_WARN_ON_ONCE(nid != NUMA_NO_NODE); 1456 + } 1523 1457 /* 1524 1458 * Assume we will successfully allocate the surplus page to 1525 1459 * prevent racing processes from causing the surplus to exceed ··· 1562 1478 } 1563 1479 spin_unlock(&hugetlb_lock); 1564 1480 1565 - if (nid == NUMA_NO_NODE) 1566 - page = alloc_pages(htlb_alloc_mask(h)|__GFP_COMP| 1567 - __GFP_REPEAT|__GFP_NOWARN, 1568 - huge_page_order(h)); 1569 - else 1570 - page = __alloc_pages_node(nid, 1571 - htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE| 1572 - __GFP_REPEAT|__GFP_NOWARN, huge_page_order(h)); 1481 + page = __hugetlb_alloc_buddy_huge_page(h, vma, addr, nid); 1573 1482 1574 1483 spin_lock(&hugetlb_lock); 1575 1484 if (page) { ··· 1587 1510 } 1588 1511 1589 1512 /* 1513 + * Allocate a huge page from 'nid'. Note, 'nid' may be 1514 + * NUMA_NO_NODE, which means that it may be allocated 1515 + * anywhere. 1516 + */ 1517 + static 1518 + struct page *__alloc_buddy_huge_page_no_mpol(struct hstate *h, int nid) 1519 + { 1520 + unsigned long addr = -1; 1521 + 1522 + return __alloc_buddy_huge_page(h, NULL, addr, nid); 1523 + } 1524 + 1525 + /* 1526 + * Use the VMA's mpolicy to allocate a huge page from the buddy. 1527 + */ 1528 + static 1529 + struct page *__alloc_buddy_huge_page_with_mpol(struct hstate *h, 1530 + struct vm_area_struct *vma, unsigned long addr) 1531 + { 1532 + return __alloc_buddy_huge_page(h, vma, addr, NUMA_NO_NODE); 1533 + } 1534 + 1535 + /* 1590 1536 * This allocation function is useful in the context where vma is irrelevant. 1591 1537 * E.g. soft-offlining uses this function because it only cares physical 1592 1538 * address of error page. ··· 1624 1524 spin_unlock(&hugetlb_lock); 1625 1525 1626 1526 if (!page) 1627 - page = alloc_buddy_huge_page(h, nid); 1527 + page = __alloc_buddy_huge_page_no_mpol(h, nid); 1628 1528 1629 1529 return page; 1630 1530 } ··· 1654 1554 retry: 1655 1555 spin_unlock(&hugetlb_lock); 1656 1556 for (i = 0; i < needed; i++) { 1657 - page = alloc_buddy_huge_page(h, NUMA_NO_NODE); 1557 + page = __alloc_buddy_huge_page_no_mpol(h, NUMA_NO_NODE); 1658 1558 if (!page) { 1659 1559 alloc_ok = false; 1660 1560 break; ··· 1887 1787 page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg); 1888 1788 if (!page) { 1889 1789 spin_unlock(&hugetlb_lock); 1890 - page = alloc_buddy_huge_page(h, NUMA_NO_NODE); 1790 + page = __alloc_buddy_huge_page_with_mpol(h, vma, addr); 1891 1791 if (!page) 1892 1792 goto out_uncharge_cgroup; 1893 1793 ··· 2476 2376 struct kobject *hugepages_kobj; 2477 2377 struct kobject *hstate_kobjs[HUGE_MAX_HSTATE]; 2478 2378 }; 2479 - struct node_hstate node_hstates[MAX_NUMNODES]; 2379 + static struct node_hstate node_hstates[MAX_NUMNODES]; 2480 2380 2481 2381 /* 2482 2382 * A subset of global hstate attributes for node devices ··· 2890 2790 1UL << (huge_page_order(h) + PAGE_SHIFT - 10)); 2891 2791 } 2892 2792 2793 + void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm) 2794 + { 2795 + seq_printf(m, "HugetlbPages:\t%8lu kB\n", 2796 + atomic_long_read(&mm->hugetlb_usage) << (PAGE_SHIFT - 10)); 2797 + } 2798 + 2893 2799 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */ 2894 2800 unsigned long hugetlb_total_pages(void) 2895 2801 { ··· 3131 3025 get_page(ptepage); 3132 3026 page_dup_rmap(ptepage); 3133 3027 set_huge_pte_at(dst, addr, dst_pte, entry); 3028 + hugetlb_count_add(pages_per_huge_page(h), dst); 3134 3029 } 3135 3030 spin_unlock(src_ptl); 3136 3031 spin_unlock(dst_ptl); ··· 3212 3105 if (huge_pte_dirty(pte)) 3213 3106 set_page_dirty(page); 3214 3107 3108 + hugetlb_count_sub(pages_per_huge_page(h), mm); 3215 3109 page_remove_rmap(page); 3216 3110 force_flush = !__tlb_remove_page(tlb, page); 3217 3111 if (force_flush) { ··· 3617 3509 && (vma->vm_flags & VM_SHARED))); 3618 3510 set_huge_pte_at(mm, address, ptep, new_pte); 3619 3511 3512 + hugetlb_count_add(pages_per_huge_page(h), mm); 3620 3513 if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { 3621 3514 /* Optimization, do the COW without a second fault */ 3622 3515 ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl); ··· 4137 4028 unsigned long s_end = sbase + PUD_SIZE; 4138 4029 4139 4030 /* Allow segments to share if only one is marked locked */ 4140 - unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED; 4141 - unsigned long svm_flags = svma->vm_flags & ~VM_LOCKED; 4031 + unsigned long vm_flags = vma->vm_flags & VM_LOCKED_CLEAR_MASK; 4032 + unsigned long svm_flags = svma->vm_flags & VM_LOCKED_CLEAR_MASK; 4142 4033 4143 4034 /* 4144 4035 * match the virtual addresses, permission and the alignment of the

+2 -1

mm/hugetlb_cgroup.c

··· 186 186 } 187 187 rcu_read_unlock(); 188 188 189 - ret = page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter); 189 + if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter)) 190 + ret = -ENOMEM; 190 191 css_put(&h_cg->css); 191 192 done: 192 193 *ptr = h_cg;

+4 -5

mm/internal.h

··· 271 271 extern void clear_page_mlock(struct page *page); 272 272 273 273 /* 274 - * mlock_migrate_page - called only from migrate_page_copy() to 275 - * migrate the Mlocked page flag; update statistics. 274 + * mlock_migrate_page - called only from migrate_misplaced_transhuge_page() 275 + * (because that does not go through the full procedure of migration ptes): 276 + * to migrate the Mlocked page flag; update statistics. 276 277 */ 277 278 static inline void mlock_migrate_page(struct page *newpage, struct page *page) 278 279 { 279 280 if (TestClearPageMlocked(page)) { 280 - unsigned long flags; 281 281 int nr_pages = hpage_nr_pages(page); 282 282 283 - local_irq_save(flags); 283 + /* Holding pmd lock, no change in irq context: __mod is safe */ 284 284 __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); 285 285 SetPageMlocked(newpage); 286 286 __mod_zone_page_state(page_zone(newpage), NR_MLOCK, nr_pages); 287 - local_irq_restore(flags); 288 287 } 289 288 } 290 289

+26 -12

mm/kasan/kasan.c

··· 4 4 * Copyright (c) 2014 Samsung Electronics Co., Ltd. 5 5 * Author: Andrey Ryabinin <ryabinin.a.a@gmail.com> 6 6 * 7 - * Some of code borrowed from https://github.com/xairy/linux by 7 + * Some code borrowed from https://github.com/xairy/kasan-prototype by 8 8 * Andrey Konovalov <adech.fo@gmail.com> 9 9 * 10 10 * This program is free software; you can redistribute it and/or modify ··· 86 86 if (memory_is_poisoned_1(addr + 1)) 87 87 return true; 88 88 89 + /* 90 + * If single shadow byte covers 2-byte access, we don't 91 + * need to do anything more. Otherwise, test the first 92 + * shadow byte. 93 + */ 89 94 if (likely(((addr + 1) & KASAN_SHADOW_MASK) != 0)) 90 95 return false; 91 96 ··· 108 103 if (memory_is_poisoned_1(addr + 3)) 109 104 return true; 110 105 106 + /* 107 + * If single shadow byte covers 4-byte access, we don't 108 + * need to do anything more. Otherwise, test the first 109 + * shadow byte. 110 + */ 111 111 if (likely(((addr + 3) & KASAN_SHADOW_MASK) >= 3)) 112 112 return false; 113 113 ··· 130 120 if (memory_is_poisoned_1(addr + 7)) 131 121 return true; 132 122 133 - if (likely(((addr + 7) & KASAN_SHADOW_MASK) >= 7)) 123 + /* 124 + * If single shadow byte covers 8-byte access, we don't 125 + * need to do anything more. Otherwise, test the first 126 + * shadow byte. 127 + */ 128 + if (likely(IS_ALIGNED(addr, KASAN_SHADOW_SCALE_SIZE))) 134 129 return false; 135 130 136 131 return unlikely(*(u8 *)shadow_addr); ··· 154 139 if (unlikely(shadow_first_bytes)) 155 140 return true; 156 141 157 - if (likely(IS_ALIGNED(addr, 8))) 142 + /* 143 + * If two shadow bytes covers 16-byte access, we don't 144 + * need to do anything more. Otherwise, test the last 145 + * shadow byte. 146 + */ 147 + if (likely(IS_ALIGNED(addr, KASAN_SHADOW_SCALE_SIZE))) 158 148 return false; 159 149 160 150 return memory_is_poisoned_1(addr + 15); ··· 223 203 s8 *last_shadow = (s8 *)kasan_mem_to_shadow((void *)last_byte); 224 204 225 205 if (unlikely(ret != (unsigned long)last_shadow || 226 - ((last_byte & KASAN_SHADOW_MASK) >= *last_shadow))) 206 + ((long)(last_byte & KASAN_SHADOW_MASK) >= *last_shadow))) 227 207 return true; 228 208 } 229 209 return false; ··· 255 235 static __always_inline void check_memory_region(unsigned long addr, 256 236 size_t size, bool write) 257 237 { 258 - struct kasan_access_info info; 259 - 260 238 if (unlikely(size == 0)) 261 239 return; 262 240 263 241 if (unlikely((void *)addr < 264 242 kasan_shadow_to_mem((void *)KASAN_SHADOW_START))) { 265 - info.access_addr = (void *)addr; 266 - info.access_size = size; 267 - info.is_write = write; 268 - info.ip = _RET_IP_; 269 - kasan_report_user_access(&info); 243 + kasan_report(addr, size, write, _RET_IP_); 270 244 return; 271 245 } 272 246 ··· 538 524 539 525 static int __init kasan_memhotplug_init(void) 540 526 { 541 - pr_err("WARNING: KASan doesn't support memory hot-add\n"); 527 + pr_err("WARNING: KASAN doesn't support memory hot-add\n"); 542 528 pr_err("Memory hot-add will be disabled\n"); 543 529 544 530 hotplug_memory_notifier(kasan_mem_notifier, 0);

+1 -4

mm/kasan/kasan.h

··· 54 54 #endif 55 55 }; 56 56 57 - void kasan_report_error(struct kasan_access_info *info); 58 - void kasan_report_user_access(struct kasan_access_info *info); 59 - 60 57 static inline const void *kasan_shadow_to_mem(const void *shadow_addr) 61 58 { 62 59 return (void *)(((unsigned long)shadow_addr - KASAN_SHADOW_OFFSET) 63 60 << KASAN_SHADOW_SCALE_SHIFT); 64 61 } 65 62 66 - static inline bool kasan_enabled(void) 63 + static inline bool kasan_report_enabled(void) 67 64 { 68 65 return !current->kasan_depth; 69 66 }

+71 -42

mm/kasan/report.c

··· 4 4 * Copyright (c) 2014 Samsung Electronics Co., Ltd. 5 5 * Author: Andrey Ryabinin <ryabinin.a.a@gmail.com> 6 6 * 7 - * Some of code borrowed from https://github.com/xairy/linux by 7 + * Some code borrowed from https://github.com/xairy/kasan-prototype by 8 8 * Andrey Konovalov <adech.fo@gmail.com> 9 9 * 10 10 * This program is free software; you can redistribute it and/or modify ··· 22 22 #include <linux/string.h> 23 23 #include <linux/types.h> 24 24 #include <linux/kasan.h> 25 + #include <linux/module.h> 25 26 26 27 #include <asm/sections.h> 27 28 ··· 49 48 50 49 static void print_error_description(struct kasan_access_info *info) 51 50 { 52 - const char *bug_type = "unknown crash"; 53 - u8 shadow_val; 51 + const char *bug_type = "unknown-crash"; 52 + u8 *shadow_addr; 54 53 55 54 info->first_bad_addr = find_first_bad_addr(info->access_addr, 56 55 info->access_size); 57 56 58 - shadow_val = *(u8 *)kasan_mem_to_shadow(info->first_bad_addr); 57 + shadow_addr = (u8 *)kasan_mem_to_shadow(info->first_bad_addr); 59 58 60 - switch (shadow_val) { 61 - case KASAN_FREE_PAGE: 62 - case KASAN_KMALLOC_FREE: 63 - bug_type = "use after free"; 59 + /* 60 + * If shadow byte value is in [0, KASAN_SHADOW_SCALE_SIZE) we can look 61 + * at the next shadow byte to determine the type of the bad access. 62 + */ 63 + if (*shadow_addr > 0 && *shadow_addr <= KASAN_SHADOW_SCALE_SIZE - 1) 64 + shadow_addr++; 65 + 66 + switch (*shadow_addr) { 67 + case 0 ... KASAN_SHADOW_SCALE_SIZE - 1: 68 + /* 69 + * In theory it's still possible to see these shadow values 70 + * due to a data race in the kernel code. 71 + */ 72 + bug_type = "out-of-bounds"; 64 73 break; 65 74 case KASAN_PAGE_REDZONE: 66 75 case KASAN_KMALLOC_REDZONE: 76 + bug_type = "slab-out-of-bounds"; 77 + break; 67 78 case KASAN_GLOBAL_REDZONE: 68 - case 0 ... KASAN_SHADOW_SCALE_SIZE - 1: 69 - bug_type = "out of bounds access"; 79 + bug_type = "global-out-of-bounds"; 70 80 break; 71 81 case KASAN_STACK_LEFT: 72 82 case KASAN_STACK_MID: 73 83 case KASAN_STACK_RIGHT: 74 84 case KASAN_STACK_PARTIAL: 75 - bug_type = "out of bounds on stack"; 85 + bug_type = "stack-out-of-bounds"; 86 + break; 87 + case KASAN_FREE_PAGE: 88 + case KASAN_KMALLOC_FREE: 89 + bug_type = "use-after-free"; 76 90 break; 77 91 } 78 92 79 - pr_err("BUG: KASan: %s in %pS at addr %p\n", 93 + pr_err("BUG: KASAN: %s in %pS at addr %p\n", 80 94 bug_type, (void *)info->ip, 81 95 info->access_addr); 82 96 pr_err("%s of size %zu by task %s/%d\n", ··· 101 85 102 86 static inline bool kernel_or_module_addr(const void *addr) 103 87 { 104 - return (addr >= (void *)_stext && addr < (void *)_end) 105 - || (addr >= (void *)MODULES_VADDR 106 - && addr < (void *)MODULES_END); 88 + if (addr >= (void *)_stext && addr < (void *)_end) 89 + return true; 90 + if (is_module_address((unsigned long)addr)) 91 + return true; 92 + return false; 107 93 } 108 94 109 95 static inline bool init_task_stack_addr(const void *addr) ··· 179 161 for (i = -SHADOW_ROWS_AROUND_ADDR; i <= SHADOW_ROWS_AROUND_ADDR; i++) { 180 162 const void *kaddr = kasan_shadow_to_mem(shadow_row); 181 163 char buffer[4 + (BITS_PER_LONG/8)*2]; 164 + char shadow_buf[SHADOW_BYTES_PER_ROW]; 182 165 183 166 snprintf(buffer, sizeof(buffer), 184 167 (i == 0) ? ">%p: " : " %p: ", kaddr); 185 - 186 - kasan_disable_current(); 168 + /* 169 + * We should not pass a shadow pointer to generic 170 + * function, because generic functions may try to 171 + * access kasan mapping for the passed address. 172 + */ 173 + memcpy(shadow_buf, shadow_row, SHADOW_BYTES_PER_ROW); 187 174 print_hex_dump(KERN_ERR, buffer, 188 175 DUMP_PREFIX_NONE, SHADOW_BYTES_PER_ROW, 1, 189 - shadow_row, SHADOW_BYTES_PER_ROW, 0); 190 - kasan_enable_current(); 176 + shadow_buf, SHADOW_BYTES_PER_ROW, 0); 191 177 192 178 if (row_is_guilty(shadow_row, shadow)) 193 179 pr_err("%*c\n", ··· 204 182 205 183 static DEFINE_SPINLOCK(report_lock); 206 184 207 - void kasan_report_error(struct kasan_access_info *info) 185 + static void kasan_report_error(struct kasan_access_info *info) 208 186 { 209 187 unsigned long flags; 188 + const char *bug_type; 210 189 190 + /* 191 + * Make sure we don't end up in loop. 192 + */ 193 + kasan_disable_current(); 211 194 spin_lock_irqsave(&report_lock, flags); 212 195 pr_err("=================================" 213 196 "=================================\n"); 214 - print_error_description(info); 215 - print_address_description(info); 216 - print_shadow_for_address(info->first_bad_addr); 197 + if (info->access_addr < 198 + kasan_shadow_to_mem((void *)KASAN_SHADOW_START)) { 199 + if ((unsigned long)info->access_addr < PAGE_SIZE) 200 + bug_type = "null-ptr-deref"; 201 + else if ((unsigned long)info->access_addr < TASK_SIZE) 202 + bug_type = "user-memory-access"; 203 + else 204 + bug_type = "wild-memory-access"; 205 + pr_err("BUG: KASAN: %s on address %p\n", 206 + bug_type, info->access_addr); 207 + pr_err("%s of size %zu by task %s/%d\n", 208 + info->is_write ? "Write" : "Read", 209 + info->access_size, current->comm, 210 + task_pid_nr(current)); 211 + dump_stack(); 212 + } else { 213 + print_error_description(info); 214 + print_address_description(info); 215 + print_shadow_for_address(info->first_bad_addr); 216 + } 217 217 pr_err("=================================" 218 218 "=================================\n"); 219 + add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); 219 220 spin_unlock_irqrestore(&report_lock, flags); 220 - } 221 - 222 - void kasan_report_user_access(struct kasan_access_info *info) 223 - { 224 - unsigned long flags; 225 - 226 - spin_lock_irqsave(&report_lock, flags); 227 - pr_err("=================================" 228 - "=================================\n"); 229 - pr_err("BUG: KASan: user-memory-access on address %p\n", 230 - info->access_addr); 231 - pr_err("%s of size %zu by task %s/%d\n", 232 - info->is_write ? "Write" : "Read", 233 - info->access_size, current->comm, task_pid_nr(current)); 234 - dump_stack(); 235 - pr_err("=================================" 236 - "=================================\n"); 237 - spin_unlock_irqrestore(&report_lock, flags); 221 + kasan_enable_current(); 238 222 } 239 223 240 224 void kasan_report(unsigned long addr, size_t size, ··· 248 220 { 249 221 struct kasan_access_info info; 250 222 251 - if (likely(!kasan_enabled())) 223 + if (likely(!kasan_report_enabled())) 252 224 return; 253 225 254 226 info.access_addr = (void *)addr; 255 227 info.access_size = size; 256 228 info.is_write = is_write; 257 229 info.ip = ip; 230 + 258 231 kasan_report_error(&info); 259 232 } 260 233

+1 -1

mm/kmemleak.c

··· 479 479 static struct kmemleak_object *find_and_get_object(unsigned long ptr, int alias) 480 480 { 481 481 unsigned long flags; 482 - struct kmemleak_object *object = NULL; 482 + struct kmemleak_object *object; 483 483 484 484 rcu_read_lock(); 485 485 read_lock_irqsave(&kmemleak_lock, flags);

+35 -14

mm/ksm.c

··· 475 475 flush_dcache_page(page); 476 476 } else { 477 477 put_page(page); 478 - out: page = NULL; 478 + out: 479 + page = NULL; 479 480 } 480 481 up_read(&mm->mmap_sem); 481 482 return page; ··· 626 625 unlock_page(page); 627 626 put_page(page); 628 627 629 - if (stable_node->hlist.first) 628 + if (!hlist_empty(&stable_node->hlist)) 630 629 ksm_pages_sharing--; 631 630 else 632 631 ksm_pages_shared--; ··· 1022 1021 if (page == kpage) /* ksm page forked */ 1023 1022 return 0; 1024 1023 1025 - if (!(vma->vm_flags & VM_MERGEABLE)) 1026 - goto out; 1027 1024 if (PageTransCompound(page) && page_trans_compound_anon_split(page)) 1028 1025 goto out; 1029 1026 BUG_ON(PageTransCompound(page)); ··· 1086 1087 int err = -EFAULT; 1087 1088 1088 1089 down_read(&mm->mmap_sem); 1089 - if (ksm_test_exit(mm)) 1090 - goto out; 1091 - vma = find_vma(mm, rmap_item->address); 1092 - if (!vma || vma->vm_start > rmap_item->address) 1090 + vma = find_mergeable_vma(mm, rmap_item->address); 1091 + if (!vma) 1093 1092 goto out; 1094 1093 1095 1094 err = try_to_merge_one_page(vma, page, kpage); ··· 1174 1177 cond_resched(); 1175 1178 stable_node = rb_entry(*new, struct stable_node, node); 1176 1179 tree_page = get_ksm_page(stable_node, false); 1177 - if (!tree_page) 1178 - return NULL; 1180 + if (!tree_page) { 1181 + /* 1182 + * If we walked over a stale stable_node, 1183 + * get_ksm_page() will call rb_erase() and it 1184 + * may rebalance the tree from under us. So 1185 + * restart the search from scratch. Returning 1186 + * NULL would be safe too, but we'd generate 1187 + * false negative insertions just because some 1188 + * stable_node was stale. 1189 + */ 1190 + goto again; 1191 + } 1179 1192 1180 1193 ret = memcmp_pages(page, tree_page); 1181 1194 put_page(tree_page); ··· 1261 1254 unsigned long kpfn; 1262 1255 struct rb_root *root; 1263 1256 struct rb_node **new; 1264 - struct rb_node *parent = NULL; 1257 + struct rb_node *parent; 1265 1258 struct stable_node *stable_node; 1266 1259 1267 1260 kpfn = page_to_pfn(kpage); 1268 1261 nid = get_kpfn_nid(kpfn); 1269 1262 root = root_stable_tree + nid; 1263 + again: 1264 + parent = NULL; 1270 1265 new = &root->rb_node; 1271 1266 1272 1267 while (*new) { ··· 1278 1269 cond_resched(); 1279 1270 stable_node = rb_entry(*new, struct stable_node, node); 1280 1271 tree_page = get_ksm_page(stable_node, false); 1281 - if (!tree_page) 1282 - return NULL; 1272 + if (!tree_page) { 1273 + /* 1274 + * If we walked over a stale stable_node, 1275 + * get_ksm_page() will call rb_erase() and it 1276 + * may rebalance the tree from under us. So 1277 + * restart the search from scratch. Returning 1278 + * NULL would be safe too, but we'd generate 1279 + * false negative insertions just because some 1280 + * stable_node was stale. 1281 + */ 1282 + goto again; 1283 + } 1283 1284 1284 1285 ret = memcmp_pages(kpage, tree_page); 1285 1286 put_page(tree_page); ··· 1359 1340 cond_resched(); 1360 1341 tree_rmap_item = rb_entry(*new, struct rmap_item, node); 1361 1342 tree_page = get_mergeable_page(tree_rmap_item); 1362 - if (IS_ERR_OR_NULL(tree_page)) 1343 + if (!tree_page) 1363 1344 return NULL; 1364 1345 1365 1346 /* ··· 1933 1914 struct anon_vma_chain *vmac; 1934 1915 struct vm_area_struct *vma; 1935 1916 1917 + cond_resched(); 1936 1918 anon_vma_lock_read(anon_vma); 1937 1919 anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root, 1938 1920 0, ULONG_MAX) { 1921 + cond_resched(); 1939 1922 vma = vmac->vma; 1940 1923 if (rmap_item->address < vma->vm_start || 1941 1924 rmap_item->address >= vma->vm_end)

+33 -11

mm/list_lru.c

··· 42 42 #ifdef CONFIG_MEMCG_KMEM 43 43 static inline bool list_lru_memcg_aware(struct list_lru *lru) 44 44 { 45 + /* 46 + * This needs node 0 to be always present, even 47 + * in the systems supporting sparse numa ids. 48 + */ 45 49 return !!lru->node[0].memcg_lrus; 46 50 } 47 51 ··· 61 57 return nlru->memcg_lrus->lru[idx]; 62 58 63 59 return &nlru->lru; 60 + } 61 + 62 + static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr) 63 + { 64 + struct page *page; 65 + 66 + if (!memcg_kmem_enabled()) 67 + return NULL; 68 + page = virt_to_head_page(ptr); 69 + return page->mem_cgroup; 64 70 } 65 71 66 72 static inline struct list_lru_one * ··· 391 377 { 392 378 int i; 393 379 394 - for (i = 0; i < nr_node_ids; i++) { 395 - if (!memcg_aware) 396 - lru->node[i].memcg_lrus = NULL; 397 - else if (memcg_init_list_lru_node(&lru->node[i])) 380 + if (!memcg_aware) 381 + return 0; 382 + 383 + for_each_node(i) { 384 + if (memcg_init_list_lru_node(&lru->node[i])) 398 385 goto fail; 399 386 } 400 387 return 0; 401 388 fail: 402 - for (i = i - 1; i >= 0; i--) 389 + for (i = i - 1; i >= 0; i--) { 390 + if (!lru->node[i].memcg_lrus) 391 + continue; 403 392 memcg_destroy_list_lru_node(&lru->node[i]); 393 + } 404 394 return -ENOMEM; 405 395 } 406 396 ··· 415 397 if (!list_lru_memcg_aware(lru)) 416 398 return; 417 399 418 - for (i = 0; i < nr_node_ids; i++) 400 + for_each_node(i) 419 401 memcg_destroy_list_lru_node(&lru->node[i]); 420 402 } 421 403 ··· 427 409 if (!list_lru_memcg_aware(lru)) 428 410 return 0; 429 411 430 - for (i = 0; i < nr_node_ids; i++) { 412 + for_each_node(i) { 431 413 if (memcg_update_list_lru_node(&lru->node[i], 432 414 old_size, new_size)) 433 415 goto fail; 434 416 } 435 417 return 0; 436 418 fail: 437 - for (i = i - 1; i >= 0; i--) 419 + for (i = i - 1; i >= 0; i--) { 420 + if (!lru->node[i].memcg_lrus) 421 + continue; 422 + 438 423 memcg_cancel_update_list_lru_node(&lru->node[i], 439 424 old_size, new_size); 425 + } 440 426 return -ENOMEM; 441 427 } 442 428 ··· 452 430 if (!list_lru_memcg_aware(lru)) 453 431 return; 454 432 455 - for (i = 0; i < nr_node_ids; i++) 433 + for_each_node(i) 456 434 memcg_cancel_update_list_lru_node(&lru->node[i], 457 435 old_size, new_size); 458 436 } ··· 507 485 if (!list_lru_memcg_aware(lru)) 508 486 return; 509 487 510 - for (i = 0; i < nr_node_ids; i++) 488 + for_each_node(i) 511 489 memcg_drain_list_lru_node(&lru->node[i], src_idx, dst_idx); 512 490 } 513 491 ··· 544 522 if (!lru->node) 545 523 goto out; 546 524 547 - for (i = 0; i < nr_node_ids; i++) { 525 + for_each_node(i) { 548 526 spin_lock_init(&lru->node[i].lock); 549 527 if (key) 550 528 lockdep_set_class(&lru->node[i].lock, key);

+6 -1

mm/maccess.c

··· 13 13 * 14 14 * Safely read from address @src to the buffer at @dst. If a kernel fault 15 15 * happens, handle that and return -EFAULT. 16 + * 17 + * We ensure that the copy_from_user is executed in atomic context so that 18 + * do_page_fault() doesn't attempt to take mmap_sem. This makes 19 + * probe_kernel_read() suitable for use within regions where the caller 20 + * already holds mmap_sem, or other locks which nest inside mmap_sem. 16 21 */ 17 22 18 23 long __weak probe_kernel_read(void *dst, const void *src, size_t size) ··· 104 99 pagefault_enable(); 105 100 set_fs(old_fs); 106 101 107 - return ret < 0 ? ret : src - unsafe_addr; 102 + return ret ? -EFAULT : src - unsafe_addr; 108 103 }

+1 -1

mm/memblock.c

··· 706 706 return 0; 707 707 } 708 708 709 - int __init_memblock memblock_remove_range(struct memblock_type *type, 709 + static int __init_memblock memblock_remove_range(struct memblock_type *type, 710 710 phys_addr_t base, phys_addr_t size) 711 711 { 712 712 int start_rgn, end_rgn;

+122 -185

mm/memcontrol.c

··· 62 62 #include <linux/oom.h> 63 63 #include <linux/lockdep.h> 64 64 #include <linux/file.h> 65 + #include <linux/tracehook.h> 65 66 #include "internal.h" 66 67 #include <net/sock.h> 67 68 #include <net/ip.h> ··· 1662 1661 1663 1662 static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) 1664 1663 { 1665 - if (!current->memcg_oom.may_oom) 1664 + if (!current->memcg_may_oom) 1666 1665 return; 1667 1666 /* 1668 1667 * We are in the middle of the charge context here, so we ··· 1679 1678 * and when we know whether the fault was overall successful. 1680 1679 */ 1681 1680 css_get(&memcg->css); 1682 - current->memcg_oom.memcg = memcg; 1683 - current->memcg_oom.gfp_mask = mask; 1684 - current->memcg_oom.order = order; 1681 + current->memcg_in_oom = memcg; 1682 + current->memcg_oom_gfp_mask = mask; 1683 + current->memcg_oom_order = order; 1685 1684 } 1686 1685 1687 1686 /** ··· 1703 1702 */ 1704 1703 bool mem_cgroup_oom_synchronize(bool handle) 1705 1704 { 1706 - struct mem_cgroup *memcg = current->memcg_oom.memcg; 1705 + struct mem_cgroup *memcg = current->memcg_in_oom; 1707 1706 struct oom_wait_info owait; 1708 1707 bool locked; 1709 1708 ··· 1731 1730 if (locked && !memcg->oom_kill_disable) { 1732 1731 mem_cgroup_unmark_under_oom(memcg); 1733 1732 finish_wait(&memcg_oom_waitq, &owait.wait); 1734 - mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask, 1735 - current->memcg_oom.order); 1733 + mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask, 1734 + current->memcg_oom_order); 1736 1735 } else { 1737 1736 schedule(); 1738 1737 mem_cgroup_unmark_under_oom(memcg); ··· 1749 1748 memcg_oom_recover(memcg); 1750 1749 } 1751 1750 cleanup: 1752 - current->memcg_oom.memcg = NULL; 1751 + current->memcg_in_oom = NULL; 1753 1752 css_put(&memcg->css); 1754 1753 return true; 1755 1754 } ··· 1973 1972 return NOTIFY_OK; 1974 1973 } 1975 1974 1975 + /* 1976 + * Scheduled by try_charge() to be executed from the userland return path 1977 + * and reclaims memory over the high limit. 1978 + */ 1979 + void mem_cgroup_handle_over_high(void) 1980 + { 1981 + unsigned int nr_pages = current->memcg_nr_pages_over_high; 1982 + struct mem_cgroup *memcg, *pos; 1983 + 1984 + if (likely(!nr_pages)) 1985 + return; 1986 + 1987 + pos = memcg = get_mem_cgroup_from_mm(current->mm); 1988 + 1989 + do { 1990 + if (page_counter_read(&pos->memory) <= pos->high) 1991 + continue; 1992 + mem_cgroup_events(pos, MEMCG_HIGH, 1); 1993 + try_to_free_mem_cgroup_pages(pos, nr_pages, GFP_KERNEL, true); 1994 + } while ((pos = parent_mem_cgroup(pos))); 1995 + 1996 + css_put(&memcg->css); 1997 + current->memcg_nr_pages_over_high = 0; 1998 + } 1999 + 1976 2000 static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, 1977 2001 unsigned int nr_pages) 1978 2002 { ··· 2008 1982 unsigned long nr_reclaimed; 2009 1983 bool may_swap = true; 2010 1984 bool drained = false; 2011 - int ret = 0; 2012 1985 2013 1986 if (mem_cgroup_is_root(memcg)) 2014 - goto done; 1987 + return 0; 2015 1988 retry: 2016 1989 if (consume_stock(memcg, nr_pages)) 2017 - goto done; 1990 + return 0; 2018 1991 2019 1992 if (!do_swap_account || 2020 - !page_counter_try_charge(&memcg->memsw, batch, &counter)) { 2021 - if (!page_counter_try_charge(&memcg->memory, batch, &counter)) 1993 + page_counter_try_charge(&memcg->memsw, batch, &counter)) { 1994 + if (page_counter_try_charge(&memcg->memory, batch, &counter)) 2022 1995 goto done_restock; 2023 1996 if (do_swap_account) 2024 1997 page_counter_uncharge(&memcg->memsw, batch); ··· 2041 2016 if (unlikely(test_thread_flag(TIF_MEMDIE) || 2042 2017 fatal_signal_pending(current) || 2043 2018 current->flags & PF_EXITING)) 2044 - goto bypass; 2019 + goto force; 2045 2020 2046 2021 if (unlikely(task_in_memcg_oom(current))) 2047 2022 goto nomem; ··· 2087 2062 goto retry; 2088 2063 2089 2064 if (gfp_mask & __GFP_NOFAIL) 2090 - goto bypass; 2065 + goto force; 2091 2066 2092 2067 if (fatal_signal_pending(current)) 2093 - goto bypass; 2068 + goto force; 2094 2069 2095 2070 mem_cgroup_events(mem_over_limit, MEMCG_OOM, 1); 2096 2071 2097 - mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages)); 2072 + mem_cgroup_oom(mem_over_limit, gfp_mask, 2073 + get_order(nr_pages * PAGE_SIZE)); 2098 2074 nomem: 2099 2075 if (!(gfp_mask & __GFP_NOFAIL)) 2100 2076 return -ENOMEM; 2101 - bypass: 2102 - return -EINTR; 2077 + force: 2078 + /* 2079 + * The allocation either can't fail or will lead to more memory 2080 + * being freed very soon. Allow memory usage go over the limit 2081 + * temporarily by force charging it. 2082 + */ 2083 + page_counter_charge(&memcg->memory, nr_pages); 2084 + if (do_swap_account) 2085 + page_counter_charge(&memcg->memsw, nr_pages); 2086 + css_get_many(&memcg->css, nr_pages); 2087 + 2088 + return 0; 2103 2089 2104 2090 done_restock: 2105 2091 css_get_many(&memcg->css, batch); 2106 2092 if (batch > nr_pages) 2107 2093 refill_stock(memcg, batch - nr_pages); 2108 - if (!(gfp_mask & __GFP_WAIT)) 2109 - goto done; 2094 + 2110 2095 /* 2111 - * If the hierarchy is above the normal consumption range, 2112 - * make the charging task trim their excess contribution. 2096 + * If the hierarchy is above the normal consumption range, schedule 2097 + * reclaim on returning to userland. We can perform reclaim here 2098 + * if __GFP_WAIT but let's always punt for simplicity and so that 2099 + * GFP_KERNEL can consistently be used during reclaim. @memcg is 2100 + * not recorded as it most likely matches current's and won't 2101 + * change in the meantime. As high limit is checked again before 2102 + * reclaim, the cost of mismatch is negligible. 2113 2103 */ 2114 2104 do { 2115 - if (page_counter_read(&memcg->memory) <= memcg->high) 2116 - continue; 2117 - mem_cgroup_events(memcg, MEMCG_HIGH, 1); 2118 - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); 2105 + if (page_counter_read(&memcg->memory) > memcg->high) { 2106 + current->memcg_nr_pages_over_high += nr_pages; 2107 + set_notify_resume(current); 2108 + break; 2109 + } 2119 2110 } while ((memcg = parent_mem_cgroup(memcg))); 2120 - done: 2121 - return ret; 2111 + 2112 + return 0; 2122 2113 } 2123 2114 2124 2115 static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) ··· 2215 2174 } 2216 2175 2217 2176 #ifdef CONFIG_MEMCG_KMEM 2218 - int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, 2219 - unsigned long nr_pages) 2220 - { 2221 - struct page_counter *counter; 2222 - int ret = 0; 2223 - 2224 - ret = page_counter_try_charge(&memcg->kmem, nr_pages, &counter); 2225 - if (ret < 0) 2226 - return ret; 2227 - 2228 - ret = try_charge(memcg, gfp, nr_pages); 2229 - if (ret == -EINTR) { 2230 - /* 2231 - * try_charge() chose to bypass to root due to OOM kill or 2232 - * fatal signal. Since our only options are to either fail 2233 - * the allocation or charge it to this cgroup, do it as a 2234 - * temporary condition. But we can't fail. From a kmem/slab 2235 - * perspective, the cache has already been selected, by 2236 - * mem_cgroup_kmem_get_cache(), so it is too late to change 2237 - * our minds. 2238 - * 2239 - * This condition will only trigger if the task entered 2240 - * memcg_charge_kmem in a sane state, but was OOM-killed 2241 - * during try_charge() above. Tasks that were already dying 2242 - * when the allocation triggers should have been already 2243 - * directed to the root cgroup in memcontrol.h 2244 - */ 2245 - page_counter_charge(&memcg->memory, nr_pages); 2246 - if (do_swap_account) 2247 - page_counter_charge(&memcg->memsw, nr_pages); 2248 - css_get_many(&memcg->css, nr_pages); 2249 - ret = 0; 2250 - } else if (ret) 2251 - page_counter_uncharge(&memcg->kmem, nr_pages); 2252 - 2253 - return ret; 2254 - } 2255 - 2256 - void memcg_uncharge_kmem(struct mem_cgroup *memcg, unsigned long nr_pages) 2257 - { 2258 - page_counter_uncharge(&memcg->memory, nr_pages); 2259 - if (do_swap_account) 2260 - page_counter_uncharge(&memcg->memsw, nr_pages); 2261 - 2262 - page_counter_uncharge(&memcg->kmem, nr_pages); 2263 - 2264 - css_put_many(&memcg->css, nr_pages); 2265 - } 2266 - 2267 2177 static int memcg_alloc_cache_id(void) 2268 2178 { 2269 2179 int id, size; ··· 2376 2384 css_put(&cachep->memcg_params.memcg->css); 2377 2385 } 2378 2386 2379 - /* 2380 - * We need to verify if the allocation against current->mm->owner's memcg is 2381 - * possible for the given order. But the page is not allocated yet, so we'll 2382 - * need a further commit step to do the final arrangements. 2383 - * 2384 - * It is possible for the task to switch cgroups in this mean time, so at 2385 - * commit time, we can't rely on task conversion any longer. We'll then use 2386 - * the handle argument to return to the caller which cgroup we should commit 2387 - * against. We could also return the memcg directly and avoid the pointer 2388 - * passing, but a boolean return value gives better semantics considering 2389 - * the compiled-out case as well. 2390 - * 2391 - * Returning true means the allocation is possible. 2392 - */ 2393 - bool 2394 - __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order) 2387 + int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order, 2388 + struct mem_cgroup *memcg) 2389 + { 2390 + unsigned int nr_pages = 1 << order; 2391 + struct page_counter *counter; 2392 + int ret; 2393 + 2394 + if (!memcg_kmem_is_active(memcg)) 2395 + return 0; 2396 + 2397 + if (!page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) 2398 + return -ENOMEM; 2399 + 2400 + ret = try_charge(memcg, gfp, nr_pages); 2401 + if (ret) { 2402 + page_counter_uncharge(&memcg->kmem, nr_pages); 2403 + return ret; 2404 + } 2405 + 2406 + page->mem_cgroup = memcg; 2407 + 2408 + return 0; 2409 + } 2410 + 2411 + int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order) 2395 2412 { 2396 2413 struct mem_cgroup *memcg; 2397 2414 int ret; 2398 2415 2399 - *_memcg = NULL; 2400 - 2401 2416 memcg = get_mem_cgroup_from_mm(current->mm); 2402 - 2403 - if (!memcg_kmem_is_active(memcg)) { 2404 - css_put(&memcg->css); 2405 - return true; 2406 - } 2407 - 2408 - ret = memcg_charge_kmem(memcg, gfp, 1 << order); 2409 - if (!ret) 2410 - *_memcg = memcg; 2411 - 2417 + ret = __memcg_kmem_charge_memcg(page, gfp, order, memcg); 2412 2418 css_put(&memcg->css); 2413 - return (ret == 0); 2419 + return ret; 2414 2420 } 2415 2421 2416 - void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, 2417 - int order) 2418 - { 2419 - VM_BUG_ON(mem_cgroup_is_root(memcg)); 2420 - 2421 - /* The page allocation failed. Revert */ 2422 - if (!page) { 2423 - memcg_uncharge_kmem(memcg, 1 << order); 2424 - return; 2425 - } 2426 - page->mem_cgroup = memcg; 2427 - } 2428 - 2429 - void __memcg_kmem_uncharge_pages(struct page *page, int order) 2422 + void __memcg_kmem_uncharge(struct page *page, int order) 2430 2423 { 2431 2424 struct mem_cgroup *memcg = page->mem_cgroup; 2425 + unsigned int nr_pages = 1 << order; 2432 2426 2433 2427 if (!memcg) 2434 2428 return; 2435 2429 2436 2430 VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page); 2437 2431 2438 - memcg_uncharge_kmem(memcg, 1 << order); 2432 + page_counter_uncharge(&memcg->kmem, nr_pages); 2433 + page_counter_uncharge(&memcg->memory, nr_pages); 2434 + if (do_swap_account) 2435 + page_counter_uncharge(&memcg->memsw, nr_pages); 2436 + 2439 2437 page->mem_cgroup = NULL; 2440 - } 2441 - 2442 - struct mem_cgroup *__mem_cgroup_from_kmem(void *ptr) 2443 - { 2444 - struct mem_cgroup *memcg = NULL; 2445 - struct kmem_cache *cachep; 2446 - struct page *page; 2447 - 2448 - page = virt_to_head_page(ptr); 2449 - if (PageSlab(page)) { 2450 - cachep = page->slab_cache; 2451 - if (!is_root_cache(cachep)) 2452 - memcg = cachep->memcg_params.memcg; 2453 - } else 2454 - /* page allocated by alloc_kmem_pages */ 2455 - memcg = page->mem_cgroup; 2456 - 2457 - return memcg; 2438 + css_put_many(&memcg->css, nr_pages); 2458 2439 } 2459 2440 #endif /* CONFIG_MEMCG_KMEM */ 2460 2441 ··· 2801 2836 return val; 2802 2837 } 2803 2838 2804 - static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) 2839 + static inline unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) 2805 2840 { 2806 - u64 val; 2841 + unsigned long val; 2807 2842 2808 2843 if (mem_cgroup_is_root(memcg)) { 2809 2844 val = tree_stat(memcg, MEM_CGROUP_STAT_CACHE); ··· 2816 2851 else 2817 2852 val = page_counter_read(&memcg->memsw); 2818 2853 } 2819 - return val << PAGE_SHIFT; 2854 + return val; 2820 2855 } 2821 2856 2822 2857 enum { ··· 2850 2885 switch (MEMFILE_ATTR(cft->private)) { 2851 2886 case RES_USAGE: 2852 2887 if (counter == &memcg->memory) 2853 - return mem_cgroup_usage(memcg, false); 2888 + return (u64)mem_cgroup_usage(memcg, false) * PAGE_SIZE; 2854 2889 if (counter == &memcg->memsw) 2855 - return mem_cgroup_usage(memcg, true); 2890 + return (u64)mem_cgroup_usage(memcg, true) * PAGE_SIZE; 2856 2891 return (u64)page_counter_read(counter) * PAGE_SIZE; 2857 2892 case RES_LIMIT: 2858 2893 return (u64)counter->limit * PAGE_SIZE; ··· 3352 3387 ret = page_counter_memparse(args, "-1", &threshold); 3353 3388 if (ret) 3354 3389 return ret; 3355 - threshold <<= PAGE_SHIFT; 3356 3390 3357 3391 mutex_lock(&memcg->thresholds_lock); 3358 3392 ··· 4370 4406 mc.precharge += count; 4371 4407 return ret; 4372 4408 } 4373 - if (ret == -EINTR) { 4374 - cancel_charge(root_mem_cgroup, count); 4375 - return ret; 4376 - } 4377 4409 4378 4410 /* Try charges one by one with reclaim */ 4379 4411 while (count--) { 4380 4412 ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_NORETRY, 1); 4381 - /* 4382 - * In case of failure, any residual charges against 4383 - * mc.to will be dropped by mem_cgroup_clear_mc() 4384 - * later on. However, cancel any charges that are 4385 - * bypassed to root right away or they'll be lost. 4386 - */ 4387 - if (ret == -EINTR) 4388 - cancel_charge(root_mem_cgroup, 1); 4389 4413 if (ret) 4390 4414 return ret; 4391 4415 mc.precharge++; ··· 4528 4576 goto out; 4529 4577 4530 4578 /* 4531 - * Prevent mem_cgroup_migrate() from looking at page->mem_cgroup 4532 - * of its source page while we change it: page migration takes 4533 - * both pages off the LRU, but page cache replacement doesn't. 4579 + * Prevent mem_cgroup_replace_page() from looking at 4580 + * page->mem_cgroup of its source page while we change it. 4534 4581 */ 4535 4582 if (!trylock_page(page)) 4536 4583 goto out; ··· 5036 5085 static u64 memory_current_read(struct cgroup_subsys_state *css, 5037 5086 struct cftype *cft) 5038 5087 { 5039 - return mem_cgroup_usage(mem_cgroup_from_css(css), false); 5088 + struct mem_cgroup *memcg = mem_cgroup_from_css(css); 5089 + 5090 + return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE; 5040 5091 } 5041 5092 5042 5093 static int memory_low_show(struct seq_file *m, void *v) ··· 5150 5197 static struct cftype memory_files[] = { 5151 5198 { 5152 5199 .name = "current", 5200 + .flags = CFTYPE_NOT_ON_ROOT, 5153 5201 .read_u64 = memory_current_read, 5154 5202 }, 5155 5203 { ··· 5294 5340 ret = try_charge(memcg, gfp_mask, nr_pages); 5295 5341 5296 5342 css_put(&memcg->css); 5297 - 5298 - if (ret == -EINTR) { 5299 - memcg = root_mem_cgroup; 5300 - ret = 0; 5301 - } 5302 5343 out: 5303 5344 *memcgp = memcg; 5304 5345 return ret; ··· 5508 5559 } 5509 5560 5510 5561 /** 5511 - * mem_cgroup_migrate - migrate a charge to another page 5562 + * mem_cgroup_replace_page - migrate a charge to another page 5512 5563 * @oldpage: currently charged page 5513 5564 * @newpage: page to transfer the charge to 5514 5565 * @lrucare: either or both pages might be on the LRU already ··· 5517 5568 * 5518 5569 * Both pages must be locked, @newpage->mapping must be set up. 5519 5570 */ 5520 - void mem_cgroup_migrate(struct page *oldpage, struct page *newpage, 5521 - bool lrucare) 5571 + void mem_cgroup_replace_page(struct page *oldpage, struct page *newpage) 5522 5572 { 5523 5573 struct mem_cgroup *memcg; 5524 5574 int isolated; 5525 5575 5526 5576 VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage); 5527 5577 VM_BUG_ON_PAGE(!PageLocked(newpage), newpage); 5528 - VM_BUG_ON_PAGE(!lrucare && PageLRU(oldpage), oldpage); 5529 - VM_BUG_ON_PAGE(!lrucare && PageLRU(newpage), newpage); 5530 5578 VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage); 5531 5579 VM_BUG_ON_PAGE(PageTransHuge(oldpage) != PageTransHuge(newpage), 5532 5580 newpage); ··· 5535 5589 if (newpage->mem_cgroup) 5536 5590 return; 5537 5591 5538 - /* 5539 - * Swapcache readahead pages can get migrated before being 5540 - * charged, and migration from compaction can happen to an 5541 - * uncharged page when the PFN walker finds a page that 5542 - * reclaim just put back on the LRU but has not released yet. 5543 - */ 5592 + /* Swapcache readahead pages can get replaced before being charged */ 5544 5593 memcg = oldpage->mem_cgroup; 5545 5594 if (!memcg) 5546 5595 return; 5547 5596 5548 - if (lrucare) 5549 - lock_page_lru(oldpage, &isolated); 5550 - 5597 + lock_page_lru(oldpage, &isolated); 5551 5598 oldpage->mem_cgroup = NULL; 5599 + unlock_page_lru(oldpage, isolated); 5552 5600 5553 - if (lrucare) 5554 - unlock_page_lru(oldpage, isolated); 5555 - 5556 - commit_charge(newpage, memcg, lrucare); 5601 + commit_charge(newpage, memcg, true); 5557 5602 } 5558 5603 5559 5604 /*

+25 -9

mm/memory-failure.c

··· 56 56 #include <linux/memory_hotplug.h> 57 57 #include <linux/mm_inline.h> 58 58 #include <linux/kfifo.h> 59 + #include <linux/ratelimit.h> 59 60 #include "internal.h" 60 61 #include "ras/ras_event.h" 61 62 ··· 1404 1403 } 1405 1404 core_initcall(memory_failure_init); 1406 1405 1406 + #define unpoison_pr_info(fmt, pfn, rs) \ 1407 + ({ \ 1408 + if (__ratelimit(rs)) \ 1409 + pr_info(fmt, pfn); \ 1410 + }) 1411 + 1407 1412 /** 1408 1413 * unpoison_memory - Unpoison a previously poisoned page 1409 1414 * @pfn: Page number of the to be unpoisoned page ··· 1428 1421 struct page *p; 1429 1422 int freeit = 0; 1430 1423 unsigned int nr_pages; 1424 + static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL, 1425 + DEFAULT_RATELIMIT_BURST); 1431 1426 1432 1427 if (!pfn_valid(pfn)) 1433 1428 return -ENXIO; ··· 1438 1429 page = compound_head(p); 1439 1430 1440 1431 if (!PageHWPoison(p)) { 1441 - pr_info("MCE: Page was already unpoisoned %#lx\n", pfn); 1432 + unpoison_pr_info("MCE: Page was already unpoisoned %#lx\n", 1433 + pfn, &unpoison_rs); 1442 1434 return 0; 1443 1435 } 1444 1436 1445 1437 if (page_count(page) > 1) { 1446 - pr_info("MCE: Someone grabs the hwpoison page %#lx\n", pfn); 1438 + unpoison_pr_info("MCE: Someone grabs the hwpoison page %#lx\n", 1439 + pfn, &unpoison_rs); 1447 1440 return 0; 1448 1441 } 1449 1442 1450 1443 if (page_mapped(page)) { 1451 - pr_info("MCE: Someone maps the hwpoison page %#lx\n", pfn); 1444 + unpoison_pr_info("MCE: Someone maps the hwpoison page %#lx\n", 1445 + pfn, &unpoison_rs); 1452 1446 return 0; 1453 1447 } 1454 1448 1455 1449 if (page_mapping(page)) { 1456 - pr_info("MCE: the hwpoison page has non-NULL mapping %#lx\n", 1457 - pfn); 1450 + unpoison_pr_info("MCE: the hwpoison page has non-NULL mapping %#lx\n", 1451 + pfn, &unpoison_rs); 1458 1452 return 0; 1459 1453 } 1460 1454 ··· 1467 1455 * In such case, we yield to memory_failure() and make unpoison fail. 1468 1456 */ 1469 1457 if (!PageHuge(page) && PageTransHuge(page)) { 1470 - pr_info("MCE: Memory failure is now running on %#lx\n", pfn); 1458 + unpoison_pr_info("MCE: Memory failure is now running on %#lx\n", 1459 + pfn, &unpoison_rs); 1471 1460 return 0; 1472 1461 } 1473 1462 ··· 1482 1469 * to the end. 1483 1470 */ 1484 1471 if (PageHuge(page)) { 1485 - pr_info("MCE: Memory failure is now running on free hugepage %#lx\n", pfn); 1472 + unpoison_pr_info("MCE: Memory failure is now running on free hugepage %#lx\n", 1473 + pfn, &unpoison_rs); 1486 1474 return 0; 1487 1475 } 1488 1476 if (TestClearPageHWPoison(p)) 1489 1477 num_poisoned_pages_dec(); 1490 - pr_info("MCE: Software-unpoisoned free page %#lx\n", pfn); 1478 + unpoison_pr_info("MCE: Software-unpoisoned free page %#lx\n", 1479 + pfn, &unpoison_rs); 1491 1480 return 0; 1492 1481 } 1493 1482 ··· 1501 1486 * the free buddy page pool. 1502 1487 */ 1503 1488 if (TestClearPageHWPoison(page)) { 1504 - pr_info("MCE: Software-unpoisoned page %#lx\n", pfn); 1489 + unpoison_pr_info("MCE: Software-unpoisoned page %#lx\n", 1490 + pfn, &unpoison_rs); 1505 1491 num_poisoned_pages_sub(nr_pages); 1506 1492 freeit = 1; 1507 1493 if (PageHuge(page))

+2 -2

mm/memory_hotplug.c

··· 339 339 unsigned long start_pfn, unsigned long num_pages) 340 340 { 341 341 if (!zone_is_initialized(zone)) 342 - return init_currently_empty_zone(zone, start_pfn, num_pages, 343 - MEMMAP_HOTPLUG); 342 + return init_currently_empty_zone(zone, start_pfn, num_pages); 343 + 344 344 return 0; 345 345 } 346 346

+125 -122

mm/migrate.c

··· 1 1 /* 2 - * Memory Migration functionality - linux/mm/migration.c 2 + * Memory Migration functionality - linux/mm/migrate.c 3 3 * 4 4 * Copyright (C) 2006 Silicon Graphics, Inc., Christoph Lameter 5 5 * ··· 30 30 #include <linux/mempolicy.h> 31 31 #include <linux/vmalloc.h> 32 32 #include <linux/security.h> 33 - #include <linux/memcontrol.h> 33 + #include <linux/backing-dev.h> 34 34 #include <linux/syscalls.h> 35 35 #include <linux/hugetlb.h> 36 36 #include <linux/hugetlb_cgroup.h> ··· 170 170 page_add_anon_rmap(new, vma, addr); 171 171 else 172 172 page_add_file_rmap(new); 173 + 174 + if (vma->vm_flags & VM_LOCKED) 175 + mlock_vma_page(new); 173 176 174 177 /* No need to invalidate - it was non-present before */ 175 178 update_mmu_cache(vma, addr, ptep); ··· 314 311 struct buffer_head *head, enum migrate_mode mode, 315 312 int extra_count) 316 313 { 314 + struct zone *oldzone, *newzone; 315 + int dirty; 317 316 int expected_count = 1 + extra_count; 318 317 void **pslot; 319 318 ··· 323 318 /* Anonymous page without mapping */ 324 319 if (page_count(page) != expected_count) 325 320 return -EAGAIN; 321 + 322 + /* No turning back from here */ 323 + set_page_memcg(newpage, page_memcg(page)); 324 + newpage->index = page->index; 325 + newpage->mapping = page->mapping; 326 + if (PageSwapBacked(page)) 327 + SetPageSwapBacked(newpage); 328 + 326 329 return MIGRATEPAGE_SUCCESS; 327 330 } 331 + 332 + oldzone = page_zone(page); 333 + newzone = page_zone(newpage); 328 334 329 335 spin_lock_irq(&mapping->tree_lock); 330 336 ··· 369 353 } 370 354 371 355 /* 372 - * Now we know that no one else is looking at the page. 356 + * Now we know that no one else is looking at the page: 357 + * no turning back from here. 373 358 */ 359 + set_page_memcg(newpage, page_memcg(page)); 360 + newpage->index = page->index; 361 + newpage->mapping = page->mapping; 362 + if (PageSwapBacked(page)) 363 + SetPageSwapBacked(newpage); 364 + 374 365 get_page(newpage); /* add cache reference */ 375 366 if (PageSwapCache(page)) { 376 367 SetPageSwapCache(newpage); 377 368 set_page_private(newpage, page_private(page)); 369 + } 370 + 371 + /* Move dirty while page refs frozen and newpage not yet exposed */ 372 + dirty = PageDirty(page); 373 + if (dirty) { 374 + ClearPageDirty(page); 375 + SetPageDirty(newpage); 378 376 } 379 377 380 378 radix_tree_replace_slot(pslot, newpage); ··· 400 370 */ 401 371 page_unfreeze_refs(page, expected_count - 1); 402 372 373 + spin_unlock(&mapping->tree_lock); 374 + /* Leave irq disabled to prevent preemption while updating stats */ 375 + 403 376 /* 404 377 * If moved to a different zone then also account 405 378 * the page for that zone. Other VM counters will be ··· 413 380 * via NR_FILE_PAGES and NR_ANON_PAGES if they 414 381 * are mapped to swap space. 415 382 */ 416 - __dec_zone_page_state(page, NR_FILE_PAGES); 417 - __inc_zone_page_state(newpage, NR_FILE_PAGES); 418 - if (!PageSwapCache(page) && PageSwapBacked(page)) { 419 - __dec_zone_page_state(page, NR_SHMEM); 420 - __inc_zone_page_state(newpage, NR_SHMEM); 383 + if (newzone != oldzone) { 384 + __dec_zone_state(oldzone, NR_FILE_PAGES); 385 + __inc_zone_state(newzone, NR_FILE_PAGES); 386 + if (PageSwapBacked(page) && !PageSwapCache(page)) { 387 + __dec_zone_state(oldzone, NR_SHMEM); 388 + __inc_zone_state(newzone, NR_SHMEM); 389 + } 390 + if (dirty && mapping_cap_account_dirty(mapping)) { 391 + __dec_zone_state(oldzone, NR_FILE_DIRTY); 392 + __inc_zone_state(newzone, NR_FILE_DIRTY); 393 + } 421 394 } 422 - spin_unlock_irq(&mapping->tree_lock); 395 + local_irq_enable(); 423 396 424 397 return MIGRATEPAGE_SUCCESS; 425 398 } ··· 439 400 { 440 401 int expected_count; 441 402 void **pslot; 442 - 443 - if (!mapping) { 444 - if (page_count(page) != 1) 445 - return -EAGAIN; 446 - return MIGRATEPAGE_SUCCESS; 447 - } 448 403 449 404 spin_lock_irq(&mapping->tree_lock); 450 405 ··· 457 424 return -EAGAIN; 458 425 } 459 426 427 + set_page_memcg(newpage, page_memcg(page)); 428 + newpage->index = page->index; 429 + newpage->mapping = page->mapping; 460 430 get_page(newpage); 461 431 462 432 radix_tree_replace_slot(pslot, newpage); ··· 546 510 if (PageMappedToDisk(page)) 547 511 SetPageMappedToDisk(newpage); 548 512 549 - if (PageDirty(page)) { 550 - clear_page_dirty_for_io(page); 551 - /* 552 - * Want to mark the page and the radix tree as dirty, and 553 - * redo the accounting that clear_page_dirty_for_io undid, 554 - * but we can't use set_page_dirty because that function 555 - * is actually a signal that all of the page has become dirty. 556 - * Whereas only part of our page may be dirty. 557 - */ 558 - if (PageSwapBacked(page)) 559 - SetPageDirty(newpage); 560 - else 561 - __set_page_dirty_nobuffers(newpage); 562 - } 513 + /* Move dirty on pages not done by migrate_page_move_mapping() */ 514 + if (PageDirty(page)) 515 + SetPageDirty(newpage); 563 516 564 517 if (page_is_young(page)) 565 518 set_page_young(newpage); ··· 562 537 cpupid = page_cpupid_xchg_last(page, -1); 563 538 page_cpupid_xchg_last(newpage, cpupid); 564 539 565 - mlock_migrate_page(newpage, page); 566 540 ksm_migrate_page(newpage, page); 567 541 /* 568 542 * Please do not reorder this without considering how mm/ksm.c's ··· 745 721 * MIGRATEPAGE_SUCCESS - success 746 722 */ 747 723 static int move_to_new_page(struct page *newpage, struct page *page, 748 - int page_was_mapped, enum migrate_mode mode) 724 + enum migrate_mode mode) 749 725 { 750 726 struct address_space *mapping; 751 727 int rc; 752 728 753 - /* 754 - * Block others from accessing the page when we get around to 755 - * establishing additional references. We are the only one 756 - * holding a reference to the new page at this point. 757 - */ 758 - if (!trylock_page(newpage)) 759 - BUG(); 760 - 761 - /* Prepare mapping for the new page.*/ 762 - newpage->index = page->index; 763 - newpage->mapping = page->mapping; 764 - if (PageSwapBacked(page)) 765 - SetPageSwapBacked(newpage); 766 - 767 - /* 768 - * Indirectly called below, migrate_page_copy() copies PG_dirty and thus 769 - * needs newpage's memcg set to transfer memcg dirty page accounting. 770 - * So perform memcg migration in two steps: 771 - * 1. set newpage->mem_cgroup (here) 772 - * 2. clear page->mem_cgroup (below) 773 - */ 774 - set_page_memcg(newpage, page_memcg(page)); 729 + VM_BUG_ON_PAGE(!PageLocked(page), page); 730 + VM_BUG_ON_PAGE(!PageLocked(newpage), newpage); 775 731 776 732 mapping = page_mapping(page); 777 733 if (!mapping) ··· 763 759 * space which also has its own migratepage callback. This 764 760 * is the most common path for page migration. 765 761 */ 766 - rc = mapping->a_ops->migratepage(mapping, 767 - newpage, page, mode); 762 + rc = mapping->a_ops->migratepage(mapping, newpage, page, mode); 768 763 else 769 764 rc = fallback_migrate_page(mapping, newpage, page, mode); 770 765 771 - if (rc != MIGRATEPAGE_SUCCESS) { 772 - set_page_memcg(newpage, NULL); 773 - newpage->mapping = NULL; 774 - } else { 766 + /* 767 + * When successful, old pagecache page->mapping must be cleared before 768 + * page is freed; but stats require that PageAnon be left as PageAnon. 769 + */ 770 + if (rc == MIGRATEPAGE_SUCCESS) { 775 771 set_page_memcg(page, NULL); 776 - if (page_was_mapped) 777 - remove_migration_ptes(page, newpage); 778 - page->mapping = NULL; 772 + if (!PageAnon(page)) 773 + page->mapping = NULL; 779 774 } 780 - 781 - unlock_page(newpage); 782 - 783 775 return rc; 784 776 } 785 777 ··· 824 824 goto out_unlock; 825 825 wait_on_page_writeback(page); 826 826 } 827 + 827 828 /* 828 829 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case, 829 830 * we cannot notice that anon_vma is freed while we migrates a page. ··· 832 831 * of migration. File cache pages are no problem because of page_lock() 833 832 * File Caches may use write_page() or lock_page() in migration, then, 834 833 * just care Anon page here. 834 + * 835 + * Only page_get_anon_vma() understands the subtleties of 836 + * getting a hold on an anon_vma from outside one of its mms. 837 + * But if we cannot get anon_vma, then we won't need it anyway, 838 + * because that implies that the anon page is no longer mapped 839 + * (and cannot be remapped so long as we hold the page lock). 835 840 */ 836 - if (PageAnon(page) && !PageKsm(page)) { 837 - /* 838 - * Only page_lock_anon_vma_read() understands the subtleties of 839 - * getting a hold on an anon_vma from outside one of its mms. 840 - */ 841 + if (PageAnon(page) && !PageKsm(page)) 841 842 anon_vma = page_get_anon_vma(page); 842 - if (anon_vma) { 843 - /* 844 - * Anon page 845 - */ 846 - } else if (PageSwapCache(page)) { 847 - /* 848 - * We cannot be sure that the anon_vma of an unmapped 849 - * swapcache page is safe to use because we don't 850 - * know in advance if the VMA that this page belonged 851 - * to still exists. If the VMA and others sharing the 852 - * data have been freed, then the anon_vma could 853 - * already be invalid. 854 - * 855 - * To avoid this possibility, swapcache pages get 856 - * migrated but are not remapped when migration 857 - * completes 858 - */ 859 - } else { 860 - goto out_unlock; 861 - } 862 - } 843 + 844 + /* 845 + * Block others from accessing the new page when we get around to 846 + * establishing additional references. We are usually the only one 847 + * holding a reference to newpage at this point. We used to have a BUG 848 + * here if trylock_page(newpage) fails, but would like to allow for 849 + * cases where there might be a race with the previous use of newpage. 850 + * This is much like races on refcount of oldpage: just don't BUG(). 851 + */ 852 + if (unlikely(!trylock_page(newpage))) 853 + goto out_unlock; 863 854 864 855 if (unlikely(isolated_balloon_page(page))) { 865 856 /* ··· 862 869 * the page migration right away (proteced by page lock). 863 870 */ 864 871 rc = balloon_page_migrate(newpage, page, mode); 865 - goto out_unlock; 872 + goto out_unlock_both; 866 873 } 867 874 868 875 /* ··· 881 888 VM_BUG_ON_PAGE(PageAnon(page), page); 882 889 if (page_has_private(page)) { 883 890 try_to_free_buffers(page); 884 - goto out_unlock; 891 + goto out_unlock_both; 885 892 } 886 - goto skip_unmap; 887 - } 888 - 889 - /* Establish migration ptes or remove ptes */ 890 - if (page_mapped(page)) { 893 + } else if (page_mapped(page)) { 894 + /* Establish migration ptes */ 895 + VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma, 896 + page); 891 897 try_to_unmap(page, 892 898 TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); 893 899 page_was_mapped = 1; 894 900 } 895 901 896 - skip_unmap: 897 902 if (!page_mapped(page)) 898 - rc = move_to_new_page(newpage, page, page_was_mapped, mode); 903 + rc = move_to_new_page(newpage, page, mode); 899 904 900 - if (rc && page_was_mapped) 901 - remove_migration_ptes(page, page); 905 + if (page_was_mapped) 906 + remove_migration_ptes(page, 907 + rc == MIGRATEPAGE_SUCCESS ? newpage : page); 902 908 909 + out_unlock_both: 910 + unlock_page(newpage); 911 + out_unlock: 903 912 /* Drop an anon_vma reference if we took one */ 904 913 if (anon_vma) 905 914 put_anon_vma(anon_vma); 906 - 907 - out_unlock: 908 915 unlock_page(page); 909 916 out: 910 917 return rc; ··· 930 937 int force, enum migrate_mode mode, 931 938 enum migrate_reason reason) 932 939 { 933 - int rc = 0; 940 + int rc = MIGRATEPAGE_SUCCESS; 934 941 int *result = NULL; 935 - struct page *newpage = get_new_page(page, private, &result); 942 + struct page *newpage; 936 943 944 + newpage = get_new_page(page, private, &result); 937 945 if (!newpage) 938 946 return -ENOMEM; 939 947 ··· 948 954 goto out; 949 955 950 956 rc = __unmap_and_move(page, newpage, force, mode); 957 + if (rc == MIGRATEPAGE_SUCCESS) 958 + put_new_page = NULL; 951 959 952 960 out: 953 961 if (rc != -EAGAIN) { ··· 976 980 * it. Otherwise, putback_lru_page() will drop the reference grabbed 977 981 * during isolation. 978 982 */ 979 - if (rc != MIGRATEPAGE_SUCCESS && put_new_page) { 980 - ClearPageSwapBacked(newpage); 983 + if (put_new_page) 981 984 put_new_page(newpage, private); 982 - } else if (unlikely(__is_movable_balloon_page(newpage))) { 985 + else if (unlikely(__is_movable_balloon_page(newpage))) { 983 986 /* drop our reference, page already in the balloon */ 984 987 put_page(newpage); 985 988 } else ··· 1016 1021 struct page *hpage, int force, 1017 1022 enum migrate_mode mode) 1018 1023 { 1019 - int rc = 0; 1024 + int rc = -EAGAIN; 1020 1025 int *result = NULL; 1021 1026 int page_was_mapped = 0; 1022 1027 struct page *new_hpage; ··· 1038 1043 if (!new_hpage) 1039 1044 return -ENOMEM; 1040 1045 1041 - rc = -EAGAIN; 1042 - 1043 1046 if (!trylock_page(hpage)) { 1044 1047 if (!force || mode != MIGRATE_SYNC) 1045 1048 goto out; ··· 1047 1054 if (PageAnon(hpage)) 1048 1055 anon_vma = page_get_anon_vma(hpage); 1049 1056 1057 + if (unlikely(!trylock_page(new_hpage))) 1058 + goto put_anon; 1059 + 1050 1060 if (page_mapped(hpage)) { 1051 1061 try_to_unmap(hpage, 1052 1062 TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); ··· 1057 1061 } 1058 1062 1059 1063 if (!page_mapped(hpage)) 1060 - rc = move_to_new_page(new_hpage, hpage, page_was_mapped, mode); 1064 + rc = move_to_new_page(new_hpage, hpage, mode); 1061 1065 1062 - if (rc != MIGRATEPAGE_SUCCESS && page_was_mapped) 1063 - remove_migration_ptes(hpage, hpage); 1066 + if (page_was_mapped) 1067 + remove_migration_ptes(hpage, 1068 + rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage); 1064 1069 1070 + unlock_page(new_hpage); 1071 + 1072 + put_anon: 1065 1073 if (anon_vma) 1066 1074 put_anon_vma(anon_vma); 1067 1075 1068 - if (rc == MIGRATEPAGE_SUCCESS) 1076 + if (rc == MIGRATEPAGE_SUCCESS) { 1069 1077 hugetlb_cgroup_migrate(hpage, new_hpage); 1078 + put_new_page = NULL; 1079 + } 1070 1080 1071 1081 unlock_page(hpage); 1072 1082 out: ··· 1084 1082 * it. Otherwise, put_page() will drop the reference grabbed during 1085 1083 * isolation. 1086 1084 */ 1087 - if (rc != MIGRATEPAGE_SUCCESS && put_new_page) 1085 + if (put_new_page) 1088 1086 put_new_page(new_hpage, private); 1089 1087 else 1090 1088 putback_active_hugepage(new_hpage); ··· 1114 1112 * 1115 1113 * The function returns after 10 attempts or if no pages are movable any more 1116 1114 * because the list has become empty or no retryable pages exist any more. 1117 - * The caller should call putback_lru_pages() to return pages to the LRU 1115 + * The caller should call putback_movable_pages() to return pages to the LRU 1118 1116 * or free list only if ret != 0. 1119 1117 * 1120 1118 * Returns the number of pages that were not migrated, or an error code. ··· 1171 1169 } 1172 1170 } 1173 1171 } 1174 - rc = nr_failed + retry; 1172 + nr_failed += retry; 1173 + rc = nr_failed; 1175 1174 out: 1176 1175 if (nr_succeeded) 1177 1176 count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded); ··· 1789 1786 SetPageActive(page); 1790 1787 if (TestClearPageUnevictable(new_page)) 1791 1788 SetPageUnevictable(page); 1792 - mlock_migrate_page(page, new_page); 1793 1789 1794 1790 unlock_page(new_page); 1795 1791 put_page(new_page); /* Free it */ ··· 1830 1828 goto fail_putback; 1831 1829 } 1832 1830 1833 - mem_cgroup_migrate(page, new_page, false); 1834 - 1831 + mlock_migrate_page(new_page, page); 1832 + set_page_memcg(new_page, page_memcg(page)); 1833 + set_page_memcg(page, NULL); 1835 1834 page_remove_rmap(page); 1836 1835 1837 1836 spin_unlock(ptl);

+1 -1

mm/mincore.c

··· 234 234 235 235 /* This also avoids any overflows on PAGE_CACHE_ALIGN */ 236 236 pages = len >> PAGE_SHIFT; 237 - pages += (len & ~PAGE_MASK) != 0; 237 + pages += (offset_in_page(len)) != 0; 238 238 239 239 if (!access_ok(VERIFY_WRITE, vec, pages)) 240 240 return -EFAULT;

+68 -32

mm/mlock.c

··· 422 422 void munlock_vma_pages_range(struct vm_area_struct *vma, 423 423 unsigned long start, unsigned long end) 424 424 { 425 - vma->vm_flags &= ~VM_LOCKED; 425 + vma->vm_flags &= VM_LOCKED_CLEAR_MASK; 426 426 427 427 while (start < end) { 428 428 struct page *page = NULL; ··· 506 506 507 507 if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) || 508 508 is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm)) 509 - goto out; /* don't set VM_LOCKED, don't count */ 509 + /* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */ 510 + goto out; 510 511 511 512 pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); 512 513 *prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma, ··· 555 554 return ret; 556 555 } 557 556 558 - static int do_mlock(unsigned long start, size_t len, int on) 557 + static int apply_vma_lock_flags(unsigned long start, size_t len, 558 + vm_flags_t flags) 559 559 { 560 560 unsigned long nstart, end, tmp; 561 561 struct vm_area_struct * vma, * prev; 562 562 int error; 563 563 564 - VM_BUG_ON(start & ~PAGE_MASK); 564 + VM_BUG_ON(offset_in_page(start)); 565 565 VM_BUG_ON(len != PAGE_ALIGN(len)); 566 566 end = start + len; 567 567 if (end < start) ··· 578 576 prev = vma; 579 577 580 578 for (nstart = start ; ; ) { 581 - vm_flags_t newflags; 579 + vm_flags_t newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK; 580 + 581 + newflags |= flags; 582 582 583 583 /* Here we know that vma->vm_start <= nstart < vma->vm_end. */ 584 - 585 - newflags = vma->vm_flags & ~VM_LOCKED; 586 - if (on) 587 - newflags |= VM_LOCKED; 588 - 589 584 tmp = vma->vm_end; 590 585 if (tmp > end) 591 586 tmp = end; ··· 604 605 return error; 605 606 } 606 607 607 - SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len) 608 + static int do_mlock(unsigned long start, size_t len, vm_flags_t flags) 608 609 { 609 610 unsigned long locked; 610 611 unsigned long lock_limit; ··· 615 616 616 617 lru_add_drain_all(); /* flush pagevec */ 617 618 618 - len = PAGE_ALIGN(len + (start & ~PAGE_MASK)); 619 + len = PAGE_ALIGN(len + (offset_in_page(start))); 619 620 start &= PAGE_MASK; 620 621 621 622 lock_limit = rlimit(RLIMIT_MEMLOCK); ··· 628 629 629 630 /* check against resource limits */ 630 631 if ((locked <= lock_limit) || capable(CAP_IPC_LOCK)) 631 - error = do_mlock(start, len, 1); 632 + error = apply_vma_lock_flags(start, len, flags); 632 633 633 634 up_write(&current->mm->mmap_sem); 634 635 if (error) ··· 640 641 return 0; 641 642 } 642 643 644 + SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len) 645 + { 646 + return do_mlock(start, len, VM_LOCKED); 647 + } 648 + 649 + SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags) 650 + { 651 + vm_flags_t vm_flags = VM_LOCKED; 652 + 653 + if (flags & ~MLOCK_ONFAULT) 654 + return -EINVAL; 655 + 656 + if (flags & MLOCK_ONFAULT) 657 + vm_flags |= VM_LOCKONFAULT; 658 + 659 + return do_mlock(start, len, vm_flags); 660 + } 661 + 643 662 SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len) 644 663 { 645 664 int ret; 646 665 647 - len = PAGE_ALIGN(len + (start & ~PAGE_MASK)); 666 + len = PAGE_ALIGN(len + (offset_in_page(start))); 648 667 start &= PAGE_MASK; 649 668 650 669 down_write(&current->mm->mmap_sem); 651 - ret = do_mlock(start, len, 0); 670 + ret = apply_vma_lock_flags(start, len, 0); 652 671 up_write(&current->mm->mmap_sem); 653 672 654 673 return ret; 655 674 } 656 675 657 - static int do_mlockall(int flags) 676 + /* 677 + * Take the MCL_* flags passed into mlockall (or 0 if called from munlockall) 678 + * and translate into the appropriate modifications to mm->def_flags and/or the 679 + * flags for all current VMAs. 680 + * 681 + * There are a couple of subtleties with this. If mlockall() is called multiple 682 + * times with different flags, the values do not necessarily stack. If mlockall 683 + * is called once including the MCL_FUTURE flag and then a second time without 684 + * it, VM_LOCKED and VM_LOCKONFAULT will be cleared from mm->def_flags. 685 + */ 686 + static int apply_mlockall_flags(int flags) 658 687 { 659 688 struct vm_area_struct * vma, * prev = NULL; 689 + vm_flags_t to_add = 0; 660 690 661 - if (flags & MCL_FUTURE) 691 + current->mm->def_flags &= VM_LOCKED_CLEAR_MASK; 692 + if (flags & MCL_FUTURE) { 662 693 current->mm->def_flags |= VM_LOCKED; 663 - else 664 - current->mm->def_flags &= ~VM_LOCKED; 665 - if (flags == MCL_FUTURE) 666 - goto out; 694 + 695 + if (flags & MCL_ONFAULT) 696 + current->mm->def_flags |= VM_LOCKONFAULT; 697 + 698 + if (!(flags & MCL_CURRENT)) 699 + goto out; 700 + } 701 + 702 + if (flags & MCL_CURRENT) { 703 + to_add |= VM_LOCKED; 704 + if (flags & MCL_ONFAULT) 705 + to_add |= VM_LOCKONFAULT; 706 + } 667 707 668 708 for (vma = current->mm->mmap; vma ; vma = prev->vm_next) { 669 709 vm_flags_t newflags; 670 710 671 - newflags = vma->vm_flags & ~VM_LOCKED; 672 - if (flags & MCL_CURRENT) 673 - newflags |= VM_LOCKED; 711 + newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK; 712 + newflags |= to_add; 674 713 675 714 /* Ignore errors */ 676 715 mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags); ··· 721 684 SYSCALL_DEFINE1(mlockall, int, flags) 722 685 { 723 686 unsigned long lock_limit; 724 - int ret = -EINVAL; 687 + int ret; 725 688 726 - if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE))) 727 - goto out; 689 + if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT))) 690 + return -EINVAL; 728 691 729 - ret = -EPERM; 730 692 if (!can_do_mlock()) 731 - goto out; 693 + return -EPERM; 732 694 733 695 if (flags & MCL_CURRENT) 734 696 lru_add_drain_all(); /* flush pagevec */ ··· 740 704 741 705 if (!(flags & MCL_CURRENT) || (current->mm->total_vm <= lock_limit) || 742 706 capable(CAP_IPC_LOCK)) 743 - ret = do_mlockall(flags); 707 + ret = apply_mlockall_flags(flags); 744 708 up_write(&current->mm->mmap_sem); 745 709 if (!ret && (flags & MCL_CURRENT)) 746 710 mm_populate(0, TASK_SIZE); 747 - out: 711 + 748 712 return ret; 749 713 } 750 714 ··· 753 717 int ret; 754 718 755 719 down_write(&current->mm->mmap_sem); 756 - ret = do_mlockall(0); 720 + ret = apply_mlockall_flags(0); 757 721 up_write(&current->mm->mmap_sem); 758 722 return ret; 759 723 }

+32 -29

mm/mmap.c

··· 1302 1302 * that it represents a valid section of the address space. 1303 1303 */ 1304 1304 addr = get_unmapped_area(file, addr, len, pgoff, flags); 1305 - if (addr & ~PAGE_MASK) 1305 + if (offset_in_page(addr)) 1306 1306 return addr; 1307 1307 1308 1308 /* Do simple checking here so the lower-level routines won't have ··· 1412 1412 unsigned long, fd, unsigned long, pgoff) 1413 1413 { 1414 1414 struct file *file = NULL; 1415 - unsigned long retval = -EBADF; 1415 + unsigned long retval; 1416 1416 1417 1417 if (!(flags & MAP_ANONYMOUS)) { 1418 1418 audit_mmap_fd(fd, flags); 1419 1419 file = fget(fd); 1420 1420 if (!file) 1421 - goto out; 1421 + return -EBADF; 1422 1422 if (is_file_hugepages(file)) 1423 1423 len = ALIGN(len, huge_page_size(hstate_file(file))); 1424 1424 retval = -EINVAL; ··· 1453 1453 out_fput: 1454 1454 if (file) 1455 1455 fput(file); 1456 - out: 1457 1456 return retval; 1458 1457 } 1459 1458 ··· 1472 1473 1473 1474 if (copy_from_user(&a, arg, sizeof(a))) 1474 1475 return -EFAULT; 1475 - if (a.offset & ~PAGE_MASK) 1476 + if (offset_in_page(a.offset)) 1476 1477 return -EINVAL; 1477 1478 1478 1479 return sys_mmap_pgoff(a.addr, a.len, a.prot, a.flags, a.fd, ··· 1561 1562 } 1562 1563 1563 1564 /* Clear old maps */ 1564 - error = -ENOMEM; 1565 1565 while (find_vma_links(mm, addr, addr + len, &prev, &rb_link, 1566 1566 &rb_parent)) { 1567 1567 if (do_munmap(mm, addr, len)) ··· 1661 1663 vma == get_gate_vma(current->mm))) 1662 1664 mm->locked_vm += (len >> PAGE_SHIFT); 1663 1665 else 1664 - vma->vm_flags &= ~VM_LOCKED; 1666 + vma->vm_flags &= VM_LOCKED_CLEAR_MASK; 1665 1667 } 1666 1668 1667 1669 if (file) ··· 1987 1989 * can happen with large stack limits and large mmap() 1988 1990 * allocations. 1989 1991 */ 1990 - if (addr & ~PAGE_MASK) { 1992 + if (offset_in_page(addr)) { 1991 1993 VM_BUG_ON(addr != -ENOMEM); 1992 1994 info.flags = 0; 1993 1995 info.low_limit = TASK_UNMAPPED_BASE; ··· 2023 2025 2024 2026 if (addr > TASK_SIZE - len) 2025 2027 return -ENOMEM; 2026 - if (addr & ~PAGE_MASK) 2028 + if (offset_in_page(addr)) 2027 2029 return -EINVAL; 2028 2030 2029 2031 addr = arch_rebalance_pgtables(addr, len); ··· 2045 2047 return vma; 2046 2048 2047 2049 rb_node = mm->mm_rb.rb_node; 2048 - vma = NULL; 2049 2050 2050 2051 while (rb_node) { 2051 2052 struct vm_area_struct *tmp; ··· 2136 2139 if (security_vm_enough_memory_mm(mm, grow)) 2137 2140 return -ENOMEM; 2138 2141 2139 - /* Ok, everything looks good - let it rip */ 2140 - if (vma->vm_flags & VM_LOCKED) 2141 - mm->locked_vm += grow; 2142 - vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow); 2143 2142 return 0; 2144 2143 } 2145 2144 ··· 2146 2153 */ 2147 2154 int expand_upwards(struct vm_area_struct *vma, unsigned long address) 2148 2155 { 2156 + struct mm_struct *mm = vma->vm_mm; 2149 2157 int error; 2150 2158 2151 2159 if (!(vma->vm_flags & VM_GROWSUP)) ··· 2196 2202 * So, we reuse mm->page_table_lock to guard 2197 2203 * against concurrent vma expansions. 2198 2204 */ 2199 - spin_lock(&vma->vm_mm->page_table_lock); 2205 + spin_lock(&mm->page_table_lock); 2206 + if (vma->vm_flags & VM_LOCKED) 2207 + mm->locked_vm += grow; 2208 + vm_stat_account(mm, vma->vm_flags, 2209 + vma->vm_file, grow); 2200 2210 anon_vma_interval_tree_pre_update_vma(vma); 2201 2211 vma->vm_end = address; 2202 2212 anon_vma_interval_tree_post_update_vma(vma); 2203 2213 if (vma->vm_next) 2204 2214 vma_gap_update(vma->vm_next); 2205 2215 else 2206 - vma->vm_mm->highest_vm_end = address; 2207 - spin_unlock(&vma->vm_mm->page_table_lock); 2216 + mm->highest_vm_end = address; 2217 + spin_unlock(&mm->page_table_lock); 2208 2218 2209 2219 perf_event_mmap(vma); 2210 2220 } ··· 2216 2218 } 2217 2219 vma_unlock_anon_vma(vma); 2218 2220 khugepaged_enter_vma_merge(vma, vma->vm_flags); 2219 - validate_mm(vma->vm_mm); 2221 + validate_mm(mm); 2220 2222 return error; 2221 2223 } 2222 2224 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */ ··· 2227 2229 int expand_downwards(struct vm_area_struct *vma, 2228 2230 unsigned long address) 2229 2231 { 2232 + struct mm_struct *mm = vma->vm_mm; 2230 2233 int error; 2231 2234 2232 2235 /* ··· 2272 2273 * So, we reuse mm->page_table_lock to guard 2273 2274 * against concurrent vma expansions. 2274 2275 */ 2275 - spin_lock(&vma->vm_mm->page_table_lock); 2276 + spin_lock(&mm->page_table_lock); 2277 + if (vma->vm_flags & VM_LOCKED) 2278 + mm->locked_vm += grow; 2279 + vm_stat_account(mm, vma->vm_flags, 2280 + vma->vm_file, grow); 2276 2281 anon_vma_interval_tree_pre_update_vma(vma); 2277 2282 vma->vm_start = address; 2278 2283 vma->vm_pgoff -= grow; 2279 2284 anon_vma_interval_tree_post_update_vma(vma); 2280 2285 vma_gap_update(vma); 2281 - spin_unlock(&vma->vm_mm->page_table_lock); 2286 + spin_unlock(&mm->page_table_lock); 2282 2287 2283 2288 perf_event_mmap(vma); 2284 2289 } ··· 2290 2287 } 2291 2288 vma_unlock_anon_vma(vma); 2292 2289 khugepaged_enter_vma_merge(vma, vma->vm_flags); 2293 - validate_mm(vma->vm_mm); 2290 + validate_mm(mm); 2294 2291 return error; 2295 2292 } 2296 2293 ··· 2539 2536 unsigned long end; 2540 2537 struct vm_area_struct *vma, *prev, *last; 2541 2538 2542 - if ((start & ~PAGE_MASK) || start > TASK_SIZE || len > TASK_SIZE-start) 2539 + if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start) 2543 2540 return -EINVAL; 2544 2541 2545 2542 len = PAGE_ALIGN(len); ··· 2737 2734 flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags; 2738 2735 2739 2736 error = get_unmapped_area(NULL, addr, len, 0, MAP_FIXED); 2740 - if (error & ~PAGE_MASK) 2737 + if (offset_in_page(error)) 2741 2738 return error; 2742 2739 2743 2740 error = mlock_future_check(mm, mm->def_flags, len); ··· 3052 3049 static struct vm_area_struct *__install_special_mapping( 3053 3050 struct mm_struct *mm, 3054 3051 unsigned long addr, unsigned long len, 3055 - unsigned long vm_flags, const struct vm_operations_struct *ops, 3056 - void *priv) 3052 + unsigned long vm_flags, void *priv, 3053 + const struct vm_operations_struct *ops) 3057 3054 { 3058 3055 int ret; 3059 3056 struct vm_area_struct *vma; ··· 3102 3099 unsigned long addr, unsigned long len, 3103 3100 unsigned long vm_flags, const struct vm_special_mapping *spec) 3104 3101 { 3105 - return __install_special_mapping(mm, addr, len, vm_flags, 3106 - &special_mapping_vmops, (void *)spec); 3102 + return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec, 3103 + &special_mapping_vmops); 3107 3104 } 3108 3105 3109 3106 int install_special_mapping(struct mm_struct *mm, ··· 3111 3108 unsigned long vm_flags, struct page **pages) 3112 3109 { 3113 3110 struct vm_area_struct *vma = __install_special_mapping( 3114 - mm, addr, len, vm_flags, &legacy_special_mapping_vmops, 3115 - (void *)pages); 3111 + mm, addr, len, vm_flags, (void *)pages, 3112 + &legacy_special_mapping_vmops); 3116 3113 3117 3114 return PTR_ERR_OR_ZERO(vma); 3118 3115 }

+6 -6

mm/mremap.c

··· 401 401 unsigned long charged = 0; 402 402 unsigned long map_flags; 403 403 404 - if (new_addr & ~PAGE_MASK) 404 + if (offset_in_page(new_addr)) 405 405 goto out; 406 406 407 407 if (new_len > TASK_SIZE || new_addr > TASK_SIZE - new_len) ··· 435 435 ret = get_unmapped_area(vma->vm_file, new_addr, new_len, vma->vm_pgoff + 436 436 ((addr - vma->vm_start) >> PAGE_SHIFT), 437 437 map_flags); 438 - if (ret & ~PAGE_MASK) 438 + if (offset_in_page(ret)) 439 439 goto out1; 440 440 441 441 ret = move_vma(vma, addr, old_len, new_len, new_addr, locked); 442 - if (!(ret & ~PAGE_MASK)) 442 + if (!(offset_in_page(ret))) 443 443 goto out; 444 444 out1: 445 445 vm_unacct_memory(charged); ··· 484 484 if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE)) 485 485 return ret; 486 486 487 - if (addr & ~PAGE_MASK) 487 + if (offset_in_page(addr)) 488 488 return ret; 489 489 490 490 old_len = PAGE_ALIGN(old_len); ··· 566 566 vma->vm_pgoff + 567 567 ((addr - vma->vm_start) >> PAGE_SHIFT), 568 568 map_flags); 569 - if (new_addr & ~PAGE_MASK) { 569 + if (offset_in_page(new_addr)) { 570 570 ret = new_addr; 571 571 goto out; 572 572 } ··· 574 574 ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked); 575 575 } 576 576 out: 577 - if (ret & ~PAGE_MASK) { 577 + if (offset_in_page(ret)) { 578 578 vm_unacct_memory(charged); 579 579 locked = 0; 580 580 }

+1 -1

mm/msync.c

··· 38 38 39 39 if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC)) 40 40 goto out; 41 - if (start & ~PAGE_MASK) 41 + if (offset_in_page(start)) 42 42 goto out; 43 43 if ((flags & MS_ASYNC) && (flags & MS_SYNC)) 44 44 goto out;

+9 -9

mm/nommu.c

··· 578 578 return; 579 579 580 580 last = rb_entry(lastp, struct vm_region, vm_rb); 581 - BUG_ON(unlikely(last->vm_end <= last->vm_start)); 582 - BUG_ON(unlikely(last->vm_top < last->vm_end)); 581 + BUG_ON(last->vm_end <= last->vm_start); 582 + BUG_ON(last->vm_top < last->vm_end); 583 583 584 584 while ((p = rb_next(lastp))) { 585 585 region = rb_entry(p, struct vm_region, vm_rb); 586 586 last = rb_entry(lastp, struct vm_region, vm_rb); 587 587 588 - BUG_ON(unlikely(region->vm_end <= region->vm_start)); 589 - BUG_ON(unlikely(region->vm_top < region->vm_end)); 590 - BUG_ON(unlikely(region->vm_start < last->vm_top)); 588 + BUG_ON(region->vm_end <= region->vm_start); 589 + BUG_ON(region->vm_top < region->vm_end); 590 + BUG_ON(region->vm_start < last->vm_top); 591 591 592 592 lastp = p; 593 593 } ··· 1497 1497 1498 1498 if (copy_from_user(&a, arg, sizeof(a))) 1499 1499 return -EFAULT; 1500 - if (a.offset & ~PAGE_MASK) 1500 + if (offset_in_page(a.offset)) 1501 1501 return -EINVAL; 1502 1502 1503 1503 return sys_mmap_pgoff(a.addr, a.len, a.prot, a.flags, a.fd, ··· 1653 1653 goto erase_whole_vma; 1654 1654 if (start < vma->vm_start || end > vma->vm_end) 1655 1655 return -EINVAL; 1656 - if (start & ~PAGE_MASK) 1656 + if (offset_in_page(start)) 1657 1657 return -EINVAL; 1658 - if (end != vma->vm_end && end & ~PAGE_MASK) 1658 + if (end != vma->vm_end && offset_in_page(end)) 1659 1659 return -EINVAL; 1660 1660 if (start != vma->vm_start && end != vma->vm_end) { 1661 1661 ret = split_vma(mm, vma, start, 1); ··· 1736 1736 if (old_len == 0 || new_len == 0) 1737 1737 return (unsigned long) -EINVAL; 1738 1738 1739 - if (addr & ~PAGE_MASK) 1739 + if (offset_in_page(addr)) 1740 1740 return -EINVAL; 1741 1741 1742 1742 if (flags & MREMAP_FIXED && new_addr != addr)

+40 -19

mm/oom_kill.c

··· 377 377 static void dump_header(struct oom_control *oc, struct task_struct *p, 378 378 struct mem_cgroup *memcg) 379 379 { 380 - task_lock(current); 381 380 pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, " 382 381 "oom_score_adj=%hd\n", 383 382 current->comm, oc->gfp_mask, oc->order, 384 383 current->signal->oom_score_adj); 385 - cpuset_print_task_mems_allowed(current); 386 - task_unlock(current); 384 + cpuset_print_current_mems_allowed(); 387 385 dump_stack(); 388 386 if (memcg) 389 387 mem_cgroup_print_oom_info(memcg, p); ··· 474 476 oom_killer_disabled = false; 475 477 } 476 478 479 + /* 480 + * task->mm can be NULL if the task is the exited group leader. So to 481 + * determine whether the task is using a particular mm, we examine all the 482 + * task's threads: if one of those is using this mm then this task was also 483 + * using it. 484 + */ 485 + static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) 486 + { 487 + struct task_struct *t; 488 + 489 + for_each_thread(p, t) { 490 + struct mm_struct *t_mm = READ_ONCE(t->mm); 491 + if (t_mm) 492 + return t_mm == mm; 493 + } 494 + return false; 495 + } 496 + 477 497 #define K(x) ((x) << (PAGE_SHIFT-10)) 478 498 /* 479 499 * Must be called while holding a reference to p, which will be released upon ··· 525 509 if (__ratelimit(&oom_rs)) 526 510 dump_header(oc, p, memcg); 527 511 528 - task_lock(p); 529 512 pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", 530 513 message, task_pid_nr(p), p->comm, points); 531 - task_unlock(p); 532 514 533 515 /* 534 516 * If any of p's children has a different mm and is eligible for kill, ··· 539 525 list_for_each_entry(child, &t->children, sibling) { 540 526 unsigned int child_points; 541 527 542 - if (child->mm == p->mm) 528 + if (process_shares_mm(child, p->mm)) 543 529 continue; 544 530 /* 545 531 * oom_badness() returns 0 if the thread is unkillable ··· 566 552 victim = p; 567 553 } 568 554 569 - /* mm cannot safely be dereferenced after task_unlock(victim) */ 555 + /* Get a reference to safely compare mm after task_unlock(victim) */ 570 556 mm = victim->mm; 557 + atomic_inc(&mm->mm_count); 558 + /* 559 + * We should send SIGKILL before setting TIF_MEMDIE in order to prevent 560 + * the OOM victim from depleting the memory reserves from the user 561 + * space under its control. 562 + */ 563 + do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); 571 564 mark_oom_victim(victim); 572 565 pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", 573 566 task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), ··· 592 571 * pending fatal signal. 593 572 */ 594 573 rcu_read_lock(); 595 - for_each_process(p) 596 - if (p->mm == mm && !same_thread_group(p, victim) && 597 - !(p->flags & PF_KTHREAD)) { 598 - if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) 599 - continue; 574 + for_each_process(p) { 575 + if (!process_shares_mm(p, mm)) 576 + continue; 577 + if (same_thread_group(p, victim)) 578 + continue; 579 + if (unlikely(p->flags & PF_KTHREAD)) 580 + continue; 581 + if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) 582 + continue; 600 583 601 - task_lock(p); /* Protect ->comm from prctl() */ 602 - pr_err("Kill process %d (%s) sharing same memory\n", 603 - task_pid_nr(p), p->comm); 604 - task_unlock(p); 605 - do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true); 606 - } 584 + do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true); 585 + } 607 586 rcu_read_unlock(); 608 587 609 - do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); 588 + mmdrop(mm); 610 589 put_task_struct(victim); 611 590 } 612 591 #undef K

+23 -18

mm/page_alloc.c

··· 3428 3428 struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order) 3429 3429 { 3430 3430 struct page *page; 3431 - struct mem_cgroup *memcg = NULL; 3432 3431 3433 - if (!memcg_kmem_newpage_charge(gfp_mask, &memcg, order)) 3434 - return NULL; 3435 3432 page = alloc_pages(gfp_mask, order); 3436 - memcg_kmem_commit_charge(page, memcg, order); 3433 + if (page && memcg_kmem_charge(page, gfp_mask, order) != 0) { 3434 + __free_pages(page, order); 3435 + page = NULL; 3436 + } 3437 3437 return page; 3438 3438 } 3439 3439 3440 3440 struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask, unsigned int order) 3441 3441 { 3442 3442 struct page *page; 3443 - struct mem_cgroup *memcg = NULL; 3444 3443 3445 - if (!memcg_kmem_newpage_charge(gfp_mask, &memcg, order)) 3446 - return NULL; 3447 3444 page = alloc_pages_node(nid, gfp_mask, order); 3448 - memcg_kmem_commit_charge(page, memcg, order); 3445 + if (page && memcg_kmem_charge(page, gfp_mask, order) != 0) { 3446 + __free_pages(page, order); 3447 + page = NULL; 3448 + } 3449 3449 return page; 3450 3450 } 3451 3451 ··· 3455 3455 */ 3456 3456 void __free_kmem_pages(struct page *page, unsigned int order) 3457 3457 { 3458 - memcg_kmem_uncharge_pages(page, order); 3458 + memcg_kmem_uncharge(page, order); 3459 3459 __free_pages(page, order); 3460 3460 } 3461 3461 ··· 4900 4900 4901 4901 int __meminit init_currently_empty_zone(struct zone *zone, 4902 4902 unsigned long zone_start_pfn, 4903 - unsigned long size, 4904 - enum memmap_context context) 4903 + unsigned long size) 4905 4904 { 4906 4905 struct pglist_data *pgdat = zone->zone_pgdat; 4907 4906 int ret; ··· 5412 5413 5413 5414 set_pageblock_order(); 5414 5415 setup_usemap(pgdat, zone, zone_start_pfn, size); 5415 - ret = init_currently_empty_zone(zone, zone_start_pfn, 5416 - size, MEMMAP_EARLY); 5416 + ret = init_currently_empty_zone(zone, zone_start_pfn, size); 5417 5417 BUG_ON(ret); 5418 5418 memmap_init(size, nid, j, zone_start_pfn); 5419 5419 zone_start_pfn += size; ··· 5421 5423 5422 5424 static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat) 5423 5425 { 5426 + unsigned long __maybe_unused offset = 0; 5427 + 5424 5428 /* Skip empty nodes */ 5425 5429 if (!pgdat->node_spanned_pages) 5426 5430 return; ··· 5439 5439 * for the buddy allocator to function correctly. 5440 5440 */ 5441 5441 start = pgdat->node_start_pfn & ~(MAX_ORDER_NR_PAGES - 1); 5442 + offset = pgdat->node_start_pfn - start; 5442 5443 end = pgdat_end_pfn(pgdat); 5443 5444 end = ALIGN(end, MAX_ORDER_NR_PAGES); 5444 5445 size = (end - start) * sizeof(struct page); ··· 5447 5446 if (!map) 5448 5447 map = memblock_virt_alloc_node_nopanic(size, 5449 5448 pgdat->node_id); 5450 - pgdat->node_mem_map = map + (pgdat->node_start_pfn - start); 5449 + pgdat->node_mem_map = map + offset; 5451 5450 } 5452 5451 #ifndef CONFIG_NEED_MULTIPLE_NODES 5453 5452 /* ··· 5455 5454 */ 5456 5455 if (pgdat == NODE_DATA(0)) { 5457 5456 mem_map = NODE_DATA(0)->node_mem_map; 5458 - #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP 5457 + #if defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) || defined(CONFIG_FLATMEM) 5459 5458 if (page_to_pfn(mem_map) != pgdat->node_start_pfn) 5460 - mem_map -= (pgdat->node_start_pfn - ARCH_PFN_OFFSET); 5459 + mem_map -= offset; 5461 5460 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */ 5462 5461 } 5463 5462 #endif ··· 5669 5668 */ 5670 5669 required_movablecore = 5671 5670 roundup(required_movablecore, MAX_ORDER_NR_PAGES); 5671 + required_movablecore = min(totalpages, required_movablecore); 5672 5672 corepages = totalpages - required_movablecore; 5673 5673 5674 5674 required_kernelcore = max(required_kernelcore, corepages); 5675 5675 } 5676 5676 5677 - /* If kernelcore was not specified, there is no ZONE_MOVABLE */ 5678 - if (!required_kernelcore) 5677 + /* 5678 + * If kernelcore was not specified or kernelcore size is larger 5679 + * than totalpages, there is no ZONE_MOVABLE. 5680 + */ 5681 + if (!required_kernelcore || required_kernelcore >= totalpages) 5679 5682 goto out; 5680 5683 5681 5684 /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */

+7 -7

mm/page_counter.c

··· 56 56 * @nr_pages: number of pages to charge 57 57 * @fail: points first counter to hit its limit, if any 58 58 * 59 - * Returns 0 on success, or -ENOMEM and @fail if the counter or one of 60 - * its ancestors has hit its configured limit. 59 + * Returns %true on success, or %false and @fail if the counter or one 60 + * of its ancestors has hit its configured limit. 61 61 */ 62 - int page_counter_try_charge(struct page_counter *counter, 63 - unsigned long nr_pages, 64 - struct page_counter **fail) 62 + bool page_counter_try_charge(struct page_counter *counter, 63 + unsigned long nr_pages, 64 + struct page_counter **fail) 65 65 { 66 66 struct page_counter *c; 67 67 ··· 99 99 if (new > c->watermark) 100 100 c->watermark = new; 101 101 } 102 - return 0; 102 + return true; 103 103 104 104 failed: 105 105 for (c = counter; c != *fail; c = c->parent) 106 106 page_counter_cancel(c, nr_pages); 107 107 108 - return -ENOMEM; 108 + return false; 109 109 } 110 110 111 111 /**

+5 -5

mm/percpu.c

··· 1554 1554 PCPU_SETUP_BUG_ON(ai->nr_groups <= 0); 1555 1555 #ifdef CONFIG_SMP 1556 1556 PCPU_SETUP_BUG_ON(!ai->static_size); 1557 - PCPU_SETUP_BUG_ON((unsigned long)__per_cpu_start & ~PAGE_MASK); 1557 + PCPU_SETUP_BUG_ON(offset_in_page(__per_cpu_start)); 1558 1558 #endif 1559 1559 PCPU_SETUP_BUG_ON(!base_addr); 1560 - PCPU_SETUP_BUG_ON((unsigned long)base_addr & ~PAGE_MASK); 1560 + PCPU_SETUP_BUG_ON(offset_in_page(base_addr)); 1561 1561 PCPU_SETUP_BUG_ON(ai->unit_size < size_sum); 1562 - PCPU_SETUP_BUG_ON(ai->unit_size & ~PAGE_MASK); 1562 + PCPU_SETUP_BUG_ON(offset_in_page(ai->unit_size)); 1563 1563 PCPU_SETUP_BUG_ON(ai->unit_size < PCPU_MIN_UNIT_SIZE); 1564 1564 PCPU_SETUP_BUG_ON(ai->dyn_size < PERCPU_DYNAMIC_EARLY_SIZE); 1565 1565 PCPU_SETUP_BUG_ON(pcpu_verify_alloc_info(ai) < 0); ··· 1806 1806 1807 1807 alloc_size = roundup(min_unit_size, atom_size); 1808 1808 upa = alloc_size / min_unit_size; 1809 - while (alloc_size % upa || ((alloc_size / upa) & ~PAGE_MASK)) 1809 + while (alloc_size % upa || (offset_in_page(alloc_size / upa))) 1810 1810 upa--; 1811 1811 max_upa = upa; 1812 1812 ··· 1838 1838 for (upa = max_upa; upa; upa--) { 1839 1839 int allocs = 0, wasted = 0; 1840 1840 1841 - if (alloc_size % upa || ((alloc_size / upa) & ~PAGE_MASK)) 1841 + if (alloc_size % upa || (offset_in_page(alloc_size / upa))) 1842 1842 continue; 1843 1843 1844 1844 for (group = 0; group < nr_groups; group++) {

+2 -12

mm/readahead.c

··· 213 213 if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages)) 214 214 return -EINVAL; 215 215 216 - nr_to_read = max_sane_readahead(nr_to_read); 216 + nr_to_read = min(nr_to_read, inode_to_bdi(mapping->host)->ra_pages); 217 217 while (nr_to_read) { 218 218 int err; 219 219 ··· 230 230 nr_to_read -= this_chunk; 231 231 } 232 232 return 0; 233 - } 234 - 235 - #define MAX_READAHEAD ((512*4096)/PAGE_CACHE_SIZE) 236 - /* 237 - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a 238 - * sensible upper limit. 239 - */ 240 - unsigned long max_sane_readahead(unsigned long nr) 241 - { 242 - return min(nr, MAX_READAHEAD); 243 233 } 244 234 245 235 /* ··· 370 380 bool hit_readahead_marker, pgoff_t offset, 371 381 unsigned long req_size) 372 382 { 373 - unsigned long max = max_sane_readahead(ra->ra_pages); 383 + unsigned long max = ra->ra_pages; 374 384 pgoff_t prev_offset; 375 385 376 386 /*

+51 -62

mm/rmap.c

··· 1304 1304 int ret = SWAP_AGAIN; 1305 1305 enum ttu_flags flags = (enum ttu_flags)arg; 1306 1306 1307 + /* munlock has nothing to gain from examining un-locked vmas */ 1308 + if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED)) 1309 + goto out; 1310 + 1307 1311 pte = page_check_address(page, mm, address, &ptl, 0); 1308 1312 if (!pte) 1309 1313 goto out; ··· 1318 1314 * skipped over this mm) then we should reactivate it. 1319 1315 */ 1320 1316 if (!(flags & TTU_IGNORE_MLOCK)) { 1321 - if (vma->vm_flags & VM_LOCKED) 1322 - goto out_mlock; 1323 - 1317 + if (vma->vm_flags & VM_LOCKED) { 1318 + /* Holding pte lock, we do *not* need mmap_sem here */ 1319 + mlock_vma_page(page); 1320 + ret = SWAP_MLOCK; 1321 + goto out_unmap; 1322 + } 1324 1323 if (flags & TTU_MUNLOCK) 1325 1324 goto out_unmap; 1326 1325 } ··· 1359 1352 update_hiwater_rss(mm); 1360 1353 1361 1354 if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) { 1362 - if (!PageHuge(page)) { 1355 + if (PageHuge(page)) { 1356 + hugetlb_count_sub(1 << compound_order(page), mm); 1357 + } else { 1363 1358 if (PageAnon(page)) 1364 1359 dec_mm_counter(mm, MM_ANONPAGES); 1365 1360 else ··· 1379 1370 dec_mm_counter(mm, MM_ANONPAGES); 1380 1371 else 1381 1372 dec_mm_counter(mm, MM_FILEPAGES); 1382 - } else if (PageAnon(page)) { 1383 - swp_entry_t entry = { .val = page_private(page) }; 1373 + } else if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION)) { 1374 + swp_entry_t entry; 1384 1375 pte_t swp_pte; 1385 - 1386 - if (PageSwapCache(page)) { 1387 - /* 1388 - * Store the swap location in the pte. 1389 - * See handle_pte_fault() ... 1390 - */ 1391 - if (swap_duplicate(entry) < 0) { 1392 - set_pte_at(mm, address, pte, pteval); 1393 - ret = SWAP_FAIL; 1394 - goto out_unmap; 1395 - } 1396 - if (list_empty(&mm->mmlist)) { 1397 - spin_lock(&mmlist_lock); 1398 - if (list_empty(&mm->mmlist)) 1399 - list_add(&mm->mmlist, &init_mm.mmlist); 1400 - spin_unlock(&mmlist_lock); 1401 - } 1402 - dec_mm_counter(mm, MM_ANONPAGES); 1403 - inc_mm_counter(mm, MM_SWAPENTS); 1404 - } else if (IS_ENABLED(CONFIG_MIGRATION)) { 1405 - /* 1406 - * Store the pfn of the page in a special migration 1407 - * pte. do_swap_page() will wait until the migration 1408 - * pte is removed and then restart fault handling. 1409 - */ 1410 - BUG_ON(!(flags & TTU_MIGRATION)); 1411 - entry = make_migration_entry(page, pte_write(pteval)); 1412 - } 1376 + /* 1377 + * Store the pfn of the page in a special migration 1378 + * pte. do_swap_page() will wait until the migration 1379 + * pte is removed and then restart fault handling. 1380 + */ 1381 + entry = make_migration_entry(page, pte_write(pteval)); 1413 1382 swp_pte = swp_entry_to_pte(entry); 1414 1383 if (pte_soft_dirty(pteval)) 1415 1384 swp_pte = pte_swp_mksoft_dirty(swp_pte); 1416 1385 set_pte_at(mm, address, pte, swp_pte); 1417 - } else if (IS_ENABLED(CONFIG_MIGRATION) && 1418 - (flags & TTU_MIGRATION)) { 1419 - /* Establish migration entry for a file page */ 1420 - swp_entry_t entry; 1421 - entry = make_migration_entry(page, pte_write(pteval)); 1422 - set_pte_at(mm, address, pte, swp_entry_to_pte(entry)); 1386 + } else if (PageAnon(page)) { 1387 + swp_entry_t entry = { .val = page_private(page) }; 1388 + pte_t swp_pte; 1389 + /* 1390 + * Store the swap location in the pte. 1391 + * See handle_pte_fault() ... 1392 + */ 1393 + VM_BUG_ON_PAGE(!PageSwapCache(page), page); 1394 + if (swap_duplicate(entry) < 0) { 1395 + set_pte_at(mm, address, pte, pteval); 1396 + ret = SWAP_FAIL; 1397 + goto out_unmap; 1398 + } 1399 + if (list_empty(&mm->mmlist)) { 1400 + spin_lock(&mmlist_lock); 1401 + if (list_empty(&mm->mmlist)) 1402 + list_add(&mm->mmlist, &init_mm.mmlist); 1403 + spin_unlock(&mmlist_lock); 1404 + } 1405 + dec_mm_counter(mm, MM_ANONPAGES); 1406 + inc_mm_counter(mm, MM_SWAPENTS); 1407 + swp_pte = swp_entry_to_pte(entry); 1408 + if (pte_soft_dirty(pteval)) 1409 + swp_pte = pte_swp_mksoft_dirty(swp_pte); 1410 + set_pte_at(mm, address, pte, swp_pte); 1423 1411 } else 1424 1412 dec_mm_counter(mm, MM_FILEPAGES); 1425 1413 ··· 1425 1419 1426 1420 out_unmap: 1427 1421 pte_unmap_unlock(pte, ptl); 1428 - if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK)) 1422 + if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK)) 1429 1423 mmu_notifier_invalidate_page(mm, address); 1430 1424 out: 1431 - return ret; 1432 - 1433 - out_mlock: 1434 - pte_unmap_unlock(pte, ptl); 1435 - 1436 - 1437 - /* 1438 - * We need mmap_sem locking, Otherwise VM_LOCKED check makes 1439 - * unstable result and race. Plus, We can't wait here because 1440 - * we now hold anon_vma->rwsem or mapping->i_mmap_rwsem. 1441 - * if trylock failed, the page remain in evictable lru and later 1442 - * vmscan could retry to move the page to unevictable lru if the 1443 - * page is actually mlocked. 1444 - */ 1445 - if (down_read_trylock(&vma->vm_mm->mmap_sem)) { 1446 - if (vma->vm_flags & VM_LOCKED) { 1447 - mlock_vma_page(page); 1448 - ret = SWAP_MLOCK; 1449 - } 1450 - up_read(&vma->vm_mm->mmap_sem); 1451 - } 1452 1425 return ret; 1453 1426 } 1454 1427 ··· 1592 1607 struct vm_area_struct *vma = avc->vma; 1593 1608 unsigned long address = vma_address(page, vma); 1594 1609 1610 + cond_resched(); 1611 + 1595 1612 if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg)) 1596 1613 continue; 1597 1614 ··· 1642 1655 i_mmap_lock_read(mapping); 1643 1656 vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { 1644 1657 unsigned long address = vma_address(page, vma); 1658 + 1659 + cond_resched(); 1645 1660 1646 1661 if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg)) 1647 1662 continue;

+15 -9

mm/shmem.c

··· 548 548 struct inode *inode = dentry->d_inode; 549 549 struct shmem_inode_info *info = SHMEM_I(inode); 550 550 551 - spin_lock(&info->lock); 552 - shmem_recalc_inode(inode); 553 - spin_unlock(&info->lock); 554 - 551 + if (info->alloced - info->swapped != inode->i_mapping->nrpages) { 552 + spin_lock(&info->lock); 553 + shmem_recalc_inode(inode); 554 + spin_unlock(&info->lock); 555 + } 555 556 generic_fillattr(inode, stat); 556 - 557 557 return 0; 558 558 } 559 559 ··· 586 586 } 587 587 if (newsize <= oldsize) { 588 588 loff_t holebegin = round_up(newsize, PAGE_SIZE); 589 - unmap_mapping_range(inode->i_mapping, holebegin, 0, 1); 590 - shmem_truncate_range(inode, newsize, (loff_t)-1); 589 + if (oldsize > holebegin) 590 + unmap_mapping_range(inode->i_mapping, 591 + holebegin, 0, 1); 592 + if (info->alloced) 593 + shmem_truncate_range(inode, 594 + newsize, (loff_t)-1); 591 595 /* unmap again to remove racily COWed private pages */ 592 - unmap_mapping_range(inode->i_mapping, holebegin, 0, 1); 596 + if (oldsize > holebegin) 597 + unmap_mapping_range(inode->i_mapping, 598 + holebegin, 0, 1); 593 599 } 594 600 } 595 601 ··· 1029 1023 */ 1030 1024 oldpage = newpage; 1031 1025 } else { 1032 - mem_cgroup_migrate(oldpage, newpage, true); 1026 + mem_cgroup_replace_page(oldpage, newpage); 1033 1027 lru_cache_add_anon(newpage); 1034 1028 *pagep = newpage; 1035 1029 }

+9 -8

mm/slab.c

··· 282 282 283 283 #define CFLGS_OFF_SLAB (0x80000000UL) 284 284 #define OFF_SLAB(x) ((x)->flags & CFLGS_OFF_SLAB) 285 + #define OFF_SLAB_MIN_SIZE (max_t(size_t, PAGE_SIZE >> 5, KMALLOC_MIN_SIZE + 1)) 285 286 286 287 #define BATCHREFILL_LIMIT 16 287 288 /* ··· 1593 1592 if (cachep->flags & SLAB_RECLAIM_ACCOUNT) 1594 1593 flags |= __GFP_RECLAIMABLE; 1595 1594 1596 - if (memcg_charge_slab(cachep, flags, cachep->gfporder)) 1597 - return NULL; 1598 - 1599 1595 page = __alloc_pages_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder); 1600 1596 if (!page) { 1601 - memcg_uncharge_slab(cachep, cachep->gfporder); 1602 1597 slab_out_of_memory(cachep, flags, nodeid); 1598 + return NULL; 1599 + } 1600 + 1601 + if (memcg_charge_slab(page, flags, cachep->gfporder, cachep)) { 1602 + __free_pages(page, cachep->gfporder); 1603 1603 return NULL; 1604 1604 } 1605 1605 ··· 1655 1653 1656 1654 if (current->reclaim_state) 1657 1655 current->reclaim_state->reclaimed_slab += nr_freed; 1658 - __free_pages(page, cachep->gfporder); 1659 - memcg_uncharge_slab(cachep, cachep->gfporder); 1656 + __free_kmem_pages(page, cachep->gfporder); 1660 1657 } 1661 1658 1662 1659 static void kmem_rcu_free(struct rcu_head *head) ··· 2213 2212 * it too early on. Always use on-slab management when 2214 2213 * SLAB_NOLEAKTRACE to avoid recursive calls into kmemleak) 2215 2214 */ 2216 - if ((size >= (PAGE_SIZE >> 5)) && !slab_early_init && 2215 + if (size >= OFF_SLAB_MIN_SIZE && !slab_early_init && 2217 2216 !(flags & SLAB_NOLEAKTRACE)) 2218 2217 /* 2219 2218 * Size is large, assume best to place the slab management obj ··· 2277 2276 /* 2278 2277 * This is a possibility for one of the kmalloc_{dma,}_caches. 2279 2278 * But since we go off slab only for object size greater than 2280 - * PAGE_SIZE/8, and kmalloc_{dma,}_caches get created 2279 + * OFF_SLAB_MIN_SIZE, and kmalloc_{dma,}_caches get created 2281 2280 * in ascending order,this should not happen at all. 2282 2281 * But leave a BUG_ON for some lucky dude. 2283 2282 */

+7 -23

mm/slab.h

··· 181 181 list_for_each_entry(iter, &(root)->memcg_params.list, \ 182 182 memcg_params.list) 183 183 184 - #define for_each_memcg_cache_safe(iter, tmp, root) \ 185 - list_for_each_entry_safe(iter, tmp, &(root)->memcg_params.list, \ 186 - memcg_params.list) 187 - 188 184 static inline bool is_root_cache(struct kmem_cache *s) 189 185 { 190 186 return s->memcg_params.is_root_cache; ··· 236 240 return s->memcg_params.root_cache; 237 241 } 238 242 239 - static __always_inline int memcg_charge_slab(struct kmem_cache *s, 240 - gfp_t gfp, int order) 243 + static __always_inline int memcg_charge_slab(struct page *page, 244 + gfp_t gfp, int order, 245 + struct kmem_cache *s) 241 246 { 242 247 if (!memcg_kmem_enabled()) 243 248 return 0; 244 249 if (is_root_cache(s)) 245 250 return 0; 246 - return memcg_charge_kmem(s->memcg_params.memcg, gfp, 1 << order); 247 - } 248 - 249 - static __always_inline void memcg_uncharge_slab(struct kmem_cache *s, int order) 250 - { 251 - if (!memcg_kmem_enabled()) 252 - return; 253 - if (is_root_cache(s)) 254 - return; 255 - memcg_uncharge_kmem(s->memcg_params.memcg, 1 << order); 251 + return __memcg_kmem_charge_memcg(page, gfp, order, 252 + s->memcg_params.memcg); 256 253 } 257 254 258 255 extern void slab_init_memcg_params(struct kmem_cache *); ··· 254 265 255 266 #define for_each_memcg_cache(iter, root) \ 256 267 for ((void)(iter), (void)(root); 0; ) 257 - #define for_each_memcg_cache_safe(iter, tmp, root) \ 258 - for ((void)(iter), (void)(tmp), (void)(root); 0; ) 259 268 260 269 static inline bool is_root_cache(struct kmem_cache *s) 261 270 { ··· 282 295 return s; 283 296 } 284 297 285 - static inline int memcg_charge_slab(struct kmem_cache *s, gfp_t gfp, int order) 298 + static inline int memcg_charge_slab(struct page *page, gfp_t gfp, int order, 299 + struct kmem_cache *s) 286 300 { 287 301 return 0; 288 - } 289 - 290 - static inline void memcg_uncharge_slab(struct kmem_cache *s, int order) 291 - { 292 302 } 293 303 294 304 static inline void slab_init_memcg_params(struct kmem_cache *s)

+102 -40

mm/slab_common.c

··· 316 316 return ALIGN(align, sizeof(void *)); 317 317 } 318 318 319 - static struct kmem_cache * 320 - do_kmem_cache_create(const char *name, size_t object_size, size_t size, 321 - size_t align, unsigned long flags, void (*ctor)(void *), 322 - struct mem_cgroup *memcg, struct kmem_cache *root_cache) 319 + static struct kmem_cache *create_cache(const char *name, 320 + size_t object_size, size_t size, size_t align, 321 + unsigned long flags, void (*ctor)(void *), 322 + struct mem_cgroup *memcg, struct kmem_cache *root_cache) 323 323 { 324 324 struct kmem_cache *s; 325 325 int err; ··· 384 384 kmem_cache_create(const char *name, size_t size, size_t align, 385 385 unsigned long flags, void (*ctor)(void *)) 386 386 { 387 - struct kmem_cache *s; 387 + struct kmem_cache *s = NULL; 388 388 const char *cache_name; 389 389 int err; 390 390 ··· 396 396 397 397 err = kmem_cache_sanity_check(name, size); 398 398 if (err) { 399 - s = NULL; /* suppress uninit var warning */ 400 399 goto out_unlock; 401 400 } 402 401 ··· 417 418 goto out_unlock; 418 419 } 419 420 420 - s = do_kmem_cache_create(cache_name, size, size, 421 - calculate_alignment(flags, align, size), 422 - flags, ctor, NULL, NULL); 421 + s = create_cache(cache_name, size, size, 422 + calculate_alignment(flags, align, size), 423 + flags, ctor, NULL, NULL); 423 424 if (IS_ERR(s)) { 424 425 err = PTR_ERR(s); 425 426 kfree_const(cache_name); ··· 447 448 } 448 449 EXPORT_SYMBOL(kmem_cache_create); 449 450 450 - static int do_kmem_cache_shutdown(struct kmem_cache *s, 451 + static int shutdown_cache(struct kmem_cache *s, 451 452 struct list_head *release, bool *need_rcu_barrier) 452 453 { 453 - if (__kmem_cache_shutdown(s) != 0) { 454 - printk(KERN_ERR "kmem_cache_destroy %s: " 455 - "Slab cache still has objects\n", s->name); 456 - dump_stack(); 454 + if (__kmem_cache_shutdown(s) != 0) 457 455 return -EBUSY; 458 - } 459 456 460 457 if (s->flags & SLAB_DESTROY_BY_RCU) 461 458 *need_rcu_barrier = true; 462 459 463 - #ifdef CONFIG_MEMCG_KMEM 464 - if (!is_root_cache(s)) 465 - list_del(&s->memcg_params.list); 466 - #endif 467 460 list_move(&s->list, release); 468 461 return 0; 469 462 } 470 463 471 - static void do_kmem_cache_release(struct list_head *release, 472 - bool need_rcu_barrier) 464 + static void release_caches(struct list_head *release, bool need_rcu_barrier) 473 465 { 474 466 struct kmem_cache *s, *s2; 475 467 ··· 526 536 if (!cache_name) 527 537 goto out_unlock; 528 538 529 - s = do_kmem_cache_create(cache_name, root_cache->object_size, 530 - root_cache->size, root_cache->align, 531 - root_cache->flags, root_cache->ctor, 532 - memcg, root_cache); 539 + s = create_cache(cache_name, root_cache->object_size, 540 + root_cache->size, root_cache->align, 541 + root_cache->flags, root_cache->ctor, 542 + memcg, root_cache); 533 543 /* 534 544 * If we could not create a memcg cache, do not complain, because 535 545 * that's not critical at all as we can always proceed with the root ··· 588 598 put_online_cpus(); 589 599 } 590 600 601 + static int __shutdown_memcg_cache(struct kmem_cache *s, 602 + struct list_head *release, bool *need_rcu_barrier) 603 + { 604 + BUG_ON(is_root_cache(s)); 605 + 606 + if (shutdown_cache(s, release, need_rcu_barrier)) 607 + return -EBUSY; 608 + 609 + list_del(&s->memcg_params.list); 610 + return 0; 611 + } 612 + 591 613 void memcg_destroy_kmem_caches(struct mem_cgroup *memcg) 592 614 { 593 615 LIST_HEAD(release); ··· 617 615 * The cgroup is about to be freed and therefore has no charges 618 616 * left. Hence, all its caches must be empty by now. 619 617 */ 620 - BUG_ON(do_kmem_cache_shutdown(s, &release, &need_rcu_barrier)); 618 + BUG_ON(__shutdown_memcg_cache(s, &release, &need_rcu_barrier)); 621 619 } 622 620 mutex_unlock(&slab_mutex); 623 621 624 622 put_online_mems(); 625 623 put_online_cpus(); 626 624 627 - do_kmem_cache_release(&release, need_rcu_barrier); 625 + release_caches(&release, need_rcu_barrier); 626 + } 627 + 628 + static int shutdown_memcg_caches(struct kmem_cache *s, 629 + struct list_head *release, bool *need_rcu_barrier) 630 + { 631 + struct memcg_cache_array *arr; 632 + struct kmem_cache *c, *c2; 633 + LIST_HEAD(busy); 634 + int i; 635 + 636 + BUG_ON(!is_root_cache(s)); 637 + 638 + /* 639 + * First, shutdown active caches, i.e. caches that belong to online 640 + * memory cgroups. 641 + */ 642 + arr = rcu_dereference_protected(s->memcg_params.memcg_caches, 643 + lockdep_is_held(&slab_mutex)); 644 + for_each_memcg_cache_index(i) { 645 + c = arr->entries[i]; 646 + if (!c) 647 + continue; 648 + if (__shutdown_memcg_cache(c, release, need_rcu_barrier)) 649 + /* 650 + * The cache still has objects. Move it to a temporary 651 + * list so as not to try to destroy it for a second 652 + * time while iterating over inactive caches below. 653 + */ 654 + list_move(&c->memcg_params.list, &busy); 655 + else 656 + /* 657 + * The cache is empty and will be destroyed soon. Clear 658 + * the pointer to it in the memcg_caches array so that 659 + * it will never be accessed even if the root cache 660 + * stays alive. 661 + */ 662 + arr->entries[i] = NULL; 663 + } 664 + 665 + /* 666 + * Second, shutdown all caches left from memory cgroups that are now 667 + * offline. 668 + */ 669 + list_for_each_entry_safe(c, c2, &s->memcg_params.list, 670 + memcg_params.list) 671 + __shutdown_memcg_cache(c, release, need_rcu_barrier); 672 + 673 + list_splice(&busy, &s->memcg_params.list); 674 + 675 + /* 676 + * A cache being destroyed must be empty. In particular, this means 677 + * that all per memcg caches attached to it must be empty too. 678 + */ 679 + if (!list_empty(&s->memcg_params.list)) 680 + return -EBUSY; 681 + return 0; 682 + } 683 + #else 684 + static inline int shutdown_memcg_caches(struct kmem_cache *s, 685 + struct list_head *release, bool *need_rcu_barrier) 686 + { 687 + return 0; 628 688 } 629 689 #endif /* CONFIG_MEMCG_KMEM */ 630 690 ··· 699 635 700 636 void kmem_cache_destroy(struct kmem_cache *s) 701 637 { 702 - struct kmem_cache *c, *c2; 703 638 LIST_HEAD(release); 704 639 bool need_rcu_barrier = false; 705 - bool busy = false; 640 + int err; 706 641 707 642 if (unlikely(!s)) 708 643 return; 709 - 710 - BUG_ON(!is_root_cache(s)); 711 644 712 645 get_online_cpus(); 713 646 get_online_mems(); ··· 715 654 if (s->refcount) 716 655 goto out_unlock; 717 656 718 - for_each_memcg_cache_safe(c, c2, s) { 719 - if (do_kmem_cache_shutdown(c, &release, &need_rcu_barrier)) 720 - busy = true; 657 + err = shutdown_memcg_caches(s, &release, &need_rcu_barrier); 658 + if (!err) 659 + err = shutdown_cache(s, &release, &need_rcu_barrier); 660 + 661 + if (err) { 662 + pr_err("kmem_cache_destroy %s: " 663 + "Slab cache still has objects\n", s->name); 664 + dump_stack(); 721 665 } 722 - 723 - if (!busy) 724 - do_kmem_cache_shutdown(s, &release, &need_rcu_barrier); 725 - 726 666 out_unlock: 727 667 mutex_unlock(&slab_mutex); 728 668 729 669 put_online_mems(); 730 670 put_online_cpus(); 731 671 732 - do_kmem_cache_release(&release, need_rcu_barrier); 672 + release_caches(&release, need_rcu_barrier); 733 673 } 734 674 EXPORT_SYMBOL(kmem_cache_destroy); 735 675 ··· 754 692 } 755 693 EXPORT_SYMBOL(kmem_cache_shrink); 756 694 757 - int slab_is_available(void) 695 + bool slab_is_available(void) 758 696 { 759 697 return slab_state >= UP; 760 698 }

+10 -15

mm/slub.c

··· 459 459 /* 460 460 * Debug settings: 461 461 */ 462 - #ifdef CONFIG_SLUB_DEBUG_ON 462 + #if defined(CONFIG_SLUB_DEBUG_ON) 463 463 static int slub_debug = DEBUG_DEFAULT_FLAGS; 464 + #elif defined(CONFIG_KASAN) 465 + static int slub_debug = SLAB_STORE_USER; 464 466 #else 465 467 static int slub_debug; 466 468 #endif ··· 1330 1328 1331 1329 flags |= __GFP_NOTRACK; 1332 1330 1333 - if (memcg_charge_slab(s, flags, order)) 1334 - return NULL; 1335 - 1336 1331 if (node == NUMA_NO_NODE) 1337 1332 page = alloc_pages(flags, order); 1338 1333 else 1339 1334 page = __alloc_pages_node(node, flags, order); 1340 1335 1341 - if (!page) 1342 - memcg_uncharge_slab(s, order); 1336 + if (page && memcg_charge_slab(page, flags, order, s)) { 1337 + __free_pages(page, order); 1338 + page = NULL; 1339 + } 1343 1340 1344 1341 return page; 1345 1342 } ··· 1477 1476 page_mapcount_reset(page); 1478 1477 if (current->reclaim_state) 1479 1478 current->reclaim_state->reclaimed_slab += pages; 1480 - __free_pages(page, order); 1481 - memcg_uncharge_slab(s, order); 1479 + __free_kmem_pages(page, order); 1482 1480 } 1483 1481 1484 1482 #define need_reserve_slab_rcu \ ··· 2912 2912 if (order_objects(min_order, size, reserved) > MAX_OBJS_PER_PAGE) 2913 2913 return get_order(size * MAX_OBJS_PER_PAGE) - 1; 2914 2914 2915 - for (order = max(min_order, 2916 - fls(min_objects * size - 1) - PAGE_SHIFT); 2915 + for (order = max(min_order, get_order(min_objects * size + reserved)); 2917 2916 order <= max_order; order++) { 2918 2917 2919 2918 unsigned long slab_size = PAGE_SIZE << order; 2920 - 2921 - if (slab_size < min_objects * size + reserved) 2922 - continue; 2923 2919 2924 2920 rem = (slab_size - reserved) % size; 2925 2921 2926 2922 if (rem <= slab_size / fract_leftover) 2927 2923 break; 2928 - 2929 2924 } 2930 2925 2931 2926 return order; ··· 2938 2943 * works by first attempting to generate a layout with 2939 2944 * the best configuration and backing off gradually. 2940 2945 * 2941 - * First we reduce the acceptable waste in a slab. Then 2946 + * First we increase the acceptable waste in a slab. Then 2942 2947 * we reduce the minimum objects required in a slab. 2943 2948 */ 2944 2949 min_objects = slub_min_objects;

+1 -1

mm/util.c

··· 309 309 { 310 310 if (unlikely(offset + PAGE_ALIGN(len) < offset)) 311 311 return -EINVAL; 312 - if (unlikely(offset & ~PAGE_MASK)) 312 + if (unlikely(offset_in_page(offset))) 313 313 return -EINVAL; 314 314 315 315 return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);

+1 -1

mm/vmacache.c

··· 52 52 * Also handle the case where a kernel thread has adopted this mm via use_mm(). 53 53 * That kernel thread's vmacache is not applicable to this mm. 54 54 */ 55 - static bool vmacache_valid_mm(struct mm_struct *mm) 55 + static inline bool vmacache_valid_mm(struct mm_struct *mm) 56 56 { 57 57 return current->mm == mm && !(current->flags & PF_KTHREAD); 58 58 }

+6 -6

mm/vmalloc.c

··· 358 358 struct vmap_area *first; 359 359 360 360 BUG_ON(!size); 361 - BUG_ON(size & ~PAGE_MASK); 361 + BUG_ON(offset_in_page(size)); 362 362 BUG_ON(!is_power_of_2(align)); 363 363 364 364 va = kmalloc_node(sizeof(struct vmap_area), ··· 936 936 void *vaddr = NULL; 937 937 unsigned int order; 938 938 939 - BUG_ON(size & ~PAGE_MASK); 939 + BUG_ON(offset_in_page(size)); 940 940 BUG_ON(size > PAGE_SIZE*VMAP_MAX_ALLOC); 941 941 if (WARN_ON(size == 0)) { 942 942 /* ··· 989 989 unsigned int order; 990 990 struct vmap_block *vb; 991 991 992 - BUG_ON(size & ~PAGE_MASK); 992 + BUG_ON(offset_in_page(size)); 993 993 BUG_ON(size > PAGE_SIZE*VMAP_MAX_ALLOC); 994 994 995 995 flush_cache_vunmap((unsigned long)addr, (unsigned long)addr + size); ··· 1902 1902 while (count) { 1903 1903 unsigned long offset, length; 1904 1904 1905 - offset = (unsigned long)addr & ~PAGE_MASK; 1905 + offset = offset_in_page(addr); 1906 1906 length = PAGE_SIZE - offset; 1907 1907 if (length > count) 1908 1908 length = count; ··· 1941 1941 while (count) { 1942 1942 unsigned long offset, length; 1943 1943 1944 - offset = (unsigned long)addr & ~PAGE_MASK; 1944 + offset = offset_in_page(addr); 1945 1945 length = PAGE_SIZE - offset; 1946 1946 if (length > count) 1947 1947 length = count; ··· 2392 2392 bool purged = false; 2393 2393 2394 2394 /* verify parameters and allocate data structures */ 2395 - BUG_ON(align & ~PAGE_MASK || !is_power_of_2(align)); 2395 + BUG_ON(offset_in_page(align) || !is_power_of_2(align)); 2396 2396 for (last_area = 0, area = 0; area < nr_vms; area++) { 2397 2397 start = offsets[area]; 2398 2398 end = start + sizes[area];

+12 -15

mm/vmscan.c

··· 194 194 195 195 static unsigned long zone_reclaimable_pages(struct zone *zone) 196 196 { 197 - int nr; 197 + unsigned long nr; 198 198 199 199 nr = zone_page_state(zone, NR_ACTIVE_FILE) + 200 200 zone_page_state(zone, NR_INACTIVE_FILE); ··· 1859 1859 } 1860 1860 1861 1861 #ifdef CONFIG_SWAP 1862 - static int inactive_anon_is_low_global(struct zone *zone) 1862 + static bool inactive_anon_is_low_global(struct zone *zone) 1863 1863 { 1864 1864 unsigned long active, inactive; 1865 1865 1866 1866 active = zone_page_state(zone, NR_ACTIVE_ANON); 1867 1867 inactive = zone_page_state(zone, NR_INACTIVE_ANON); 1868 1868 1869 - if (inactive * zone->inactive_ratio < active) 1870 - return 1; 1871 - 1872 - return 0; 1869 + return inactive * zone->inactive_ratio < active; 1873 1870 } 1874 1871 1875 1872 /** ··· 1876 1879 * Returns true if the zone does not have enough inactive anon pages, 1877 1880 * meaning some active anon pages need to be deactivated. 1878 1881 */ 1879 - static int inactive_anon_is_low(struct lruvec *lruvec) 1882 + static bool inactive_anon_is_low(struct lruvec *lruvec) 1880 1883 { 1881 1884 /* 1882 1885 * If we don't have swap space, anonymous page deactivation 1883 1886 * is pointless. 1884 1887 */ 1885 1888 if (!total_swap_pages) 1886 - return 0; 1889 + return false; 1887 1890 1888 1891 if (!mem_cgroup_disabled()) 1889 1892 return mem_cgroup_inactive_anon_is_low(lruvec); ··· 1891 1894 return inactive_anon_is_low_global(lruvec_zone(lruvec)); 1892 1895 } 1893 1896 #else 1894 - static inline int inactive_anon_is_low(struct lruvec *lruvec) 1897 + static inline bool inactive_anon_is_low(struct lruvec *lruvec) 1895 1898 { 1896 - return 0; 1899 + return false; 1897 1900 } 1898 1901 #endif 1899 1902 ··· 1911 1914 * This uses a different ratio than the anonymous pages, because 1912 1915 * the page cache uses a use-once replacement algorithm. 1913 1916 */ 1914 - static int inactive_file_is_low(struct lruvec *lruvec) 1917 + static bool inactive_file_is_low(struct lruvec *lruvec) 1915 1918 { 1916 1919 unsigned long inactive; 1917 1920 unsigned long active; ··· 1922 1925 return active > inactive; 1923 1926 } 1924 1927 1925 - static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru) 1928 + static bool inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru) 1926 1929 { 1927 1930 if (is_file_lru(lru)) 1928 1931 return inactive_file_is_low(lruvec); ··· 3693 3696 } 3694 3697 3695 3698 /* Work out how many page cache pages we can reclaim in this reclaim_mode */ 3696 - static long zone_pagecache_reclaimable(struct zone *zone) 3699 + static unsigned long zone_pagecache_reclaimable(struct zone *zone) 3697 3700 { 3698 - long nr_pagecache_reclaimable; 3699 - long delta = 0; 3701 + unsigned long nr_pagecache_reclaimable; 3702 + unsigned long delta = 0; 3700 3703 3701 3704 /* 3702 3705 * If RECLAIM_UNMAP is set, then all file pages are considered

+22

mm/vmstat.c

··· 591 591 else 592 592 __inc_zone_state(z, NUMA_OTHER); 593 593 } 594 + 595 + /* 596 + * Determine the per node value of a stat item. 597 + */ 598 + unsigned long node_page_state(int node, enum zone_stat_item item) 599 + { 600 + struct zone *zones = NODE_DATA(node)->node_zones; 601 + 602 + return 603 + #ifdef CONFIG_ZONE_DMA 604 + zone_page_state(&zones[ZONE_DMA], item) + 605 + #endif 606 + #ifdef CONFIG_ZONE_DMA32 607 + zone_page_state(&zones[ZONE_DMA32], item) + 608 + #endif 609 + #ifdef CONFIG_HIGHMEM 610 + zone_page_state(&zones[ZONE_HIGHMEM], item) + 611 + #endif 612 + zone_page_state(&zones[ZONE_NORMAL], item) + 613 + zone_page_state(&zones[ZONE_MOVABLE], item); 614 + } 615 + 594 616 #endif 595 617 596 618 #ifdef CONFIG_COMPACTION

+2

tools/testing/selftests/vm/Makefile

··· 5 5 BINARIES += hugepage-mmap 6 6 BINARIES += hugepage-shm 7 7 BINARIES += map_hugetlb 8 + BINARIES += mlock2-tests 9 + BINARIES += on-fault-limit 8 10 BINARIES += thuge-gen 9 11 BINARIES += transhuge-stress 10 12 BINARIES += userfaultfd

+736

tools/testing/selftests/vm/mlock2-tests.c

··· 1 + #include <sys/mman.h> 2 + #include <stdint.h> 3 + #include <stdio.h> 4 + #include <stdlib.h> 5 + #include <unistd.h> 6 + #include <string.h> 7 + #include <sys/time.h> 8 + #include <sys/resource.h> 9 + #include <syscall.h> 10 + #include <errno.h> 11 + #include <stdbool.h> 12 + 13 + #ifndef MLOCK_ONFAULT 14 + #define MLOCK_ONFAULT 1 15 + #endif 16 + 17 + #ifndef MCL_ONFAULT 18 + #define MCL_ONFAULT (MCL_FUTURE << 1) 19 + #endif 20 + 21 + static int mlock2_(void *start, size_t len, int flags) 22 + { 23 + #ifdef __NR_mlock2 24 + return syscall(__NR_mlock2, start, len, flags); 25 + #else 26 + errno = ENOSYS; 27 + return -1; 28 + #endif 29 + } 30 + 31 + struct vm_boundaries { 32 + unsigned long start; 33 + unsigned long end; 34 + }; 35 + 36 + static int get_vm_area(unsigned long addr, struct vm_boundaries *area) 37 + { 38 + FILE *file; 39 + int ret = 1; 40 + char line[1024] = {0}; 41 + char *end_addr; 42 + char *stop; 43 + unsigned long start; 44 + unsigned long end; 45 + 46 + if (!area) 47 + return ret; 48 + 49 + file = fopen("/proc/self/maps", "r"); 50 + if (!file) { 51 + perror("fopen"); 52 + return ret; 53 + } 54 + 55 + memset(area, 0, sizeof(struct vm_boundaries)); 56 + 57 + while(fgets(line, 1024, file)) { 58 + end_addr = strchr(line, '-'); 59 + if (!end_addr) { 60 + printf("cannot parse /proc/self/maps\n"); 61 + goto out; 62 + } 63 + *end_addr = '\0'; 64 + end_addr++; 65 + stop = strchr(end_addr, ' '); 66 + if (!stop) { 67 + printf("cannot parse /proc/self/maps\n"); 68 + goto out; 69 + } 70 + stop = '\0'; 71 + 72 + sscanf(line, "%lx", &start); 73 + sscanf(end_addr, "%lx", &end); 74 + 75 + if (start <= addr && end > addr) { 76 + area->start = start; 77 + area->end = end; 78 + ret = 0; 79 + goto out; 80 + } 81 + } 82 + out: 83 + fclose(file); 84 + return ret; 85 + } 86 + 87 + static uint64_t get_pageflags(unsigned long addr) 88 + { 89 + FILE *file; 90 + uint64_t pfn; 91 + unsigned long offset; 92 + 93 + file = fopen("/proc/self/pagemap", "r"); 94 + if (!file) { 95 + perror("fopen pagemap"); 96 + _exit(1); 97 + } 98 + 99 + offset = addr / getpagesize() * sizeof(pfn); 100 + 101 + if (fseek(file, offset, SEEK_SET)) { 102 + perror("fseek pagemap"); 103 + _exit(1); 104 + } 105 + 106 + if (fread(&pfn, sizeof(pfn), 1, file) != 1) { 107 + perror("fread pagemap"); 108 + _exit(1); 109 + } 110 + 111 + fclose(file); 112 + return pfn; 113 + } 114 + 115 + static uint64_t get_kpageflags(unsigned long pfn) 116 + { 117 + uint64_t flags; 118 + FILE *file; 119 + 120 + file = fopen("/proc/kpageflags", "r"); 121 + if (!file) { 122 + perror("fopen kpageflags"); 123 + _exit(1); 124 + } 125 + 126 + if (fseek(file, pfn * sizeof(flags), SEEK_SET)) { 127 + perror("fseek kpageflags"); 128 + _exit(1); 129 + } 130 + 131 + if (fread(&flags, sizeof(flags), 1, file) != 1) { 132 + perror("fread kpageflags"); 133 + _exit(1); 134 + } 135 + 136 + fclose(file); 137 + return flags; 138 + } 139 + 140 + static FILE *seek_to_smaps_entry(unsigned long addr) 141 + { 142 + FILE *file; 143 + char *line = NULL; 144 + size_t size = 0; 145 + unsigned long start, end; 146 + char perms[5]; 147 + unsigned long offset; 148 + char dev[32]; 149 + unsigned long inode; 150 + char path[BUFSIZ]; 151 + 152 + file = fopen("/proc/self/smaps", "r"); 153 + if (!file) { 154 + perror("fopen smaps"); 155 + _exit(1); 156 + } 157 + 158 + while (getline(&line, &size, file) > 0) { 159 + if (sscanf(line, "%lx-%lx %s %lx %s %lu %s\n", 160 + &start, &end, perms, &offset, dev, &inode, path) < 6) 161 + goto next; 162 + 163 + if (start <= addr && addr < end) 164 + goto out; 165 + 166 + next: 167 + free(line); 168 + line = NULL; 169 + size = 0; 170 + } 171 + 172 + fclose(file); 173 + file = NULL; 174 + 175 + out: 176 + free(line); 177 + return file; 178 + } 179 + 180 + #define VMFLAGS "VmFlags:" 181 + 182 + static bool is_vmflag_set(unsigned long addr, const char *vmflag) 183 + { 184 + char *line = NULL; 185 + char *flags; 186 + size_t size = 0; 187 + bool ret = false; 188 + FILE *smaps; 189 + 190 + smaps = seek_to_smaps_entry(addr); 191 + if (!smaps) { 192 + printf("Unable to parse /proc/self/smaps\n"); 193 + goto out; 194 + } 195 + 196 + while (getline(&line, &size, smaps) > 0) { 197 + if (!strstr(line, VMFLAGS)) { 198 + free(line); 199 + line = NULL; 200 + size = 0; 201 + continue; 202 + } 203 + 204 + flags = line + strlen(VMFLAGS); 205 + ret = (strstr(flags, vmflag) != NULL); 206 + goto out; 207 + } 208 + 209 + out: 210 + free(line); 211 + fclose(smaps); 212 + return ret; 213 + } 214 + 215 + #define SIZE "Size:" 216 + #define RSS "Rss:" 217 + #define LOCKED "lo" 218 + 219 + static bool is_vma_lock_on_fault(unsigned long addr) 220 + { 221 + bool ret = false; 222 + bool locked; 223 + FILE *smaps = NULL; 224 + unsigned long vma_size, vma_rss; 225 + char *line = NULL; 226 + char *value; 227 + size_t size = 0; 228 + 229 + locked = is_vmflag_set(addr, LOCKED); 230 + if (!locked) 231 + goto out; 232 + 233 + smaps = seek_to_smaps_entry(addr); 234 + if (!smaps) { 235 + printf("Unable to parse /proc/self/smaps\n"); 236 + goto out; 237 + } 238 + 239 + while (getline(&line, &size, smaps) > 0) { 240 + if (!strstr(line, SIZE)) { 241 + free(line); 242 + line = NULL; 243 + size = 0; 244 + continue; 245 + } 246 + 247 + value = line + strlen(SIZE); 248 + if (sscanf(value, "%lu kB", &vma_size) < 1) { 249 + printf("Unable to parse smaps entry for Size\n"); 250 + goto out; 251 + } 252 + break; 253 + } 254 + 255 + while (getline(&line, &size, smaps) > 0) { 256 + if (!strstr(line, RSS)) { 257 + free(line); 258 + line = NULL; 259 + size = 0; 260 + continue; 261 + } 262 + 263 + value = line + strlen(RSS); 264 + if (sscanf(value, "%lu kB", &vma_rss) < 1) { 265 + printf("Unable to parse smaps entry for Rss\n"); 266 + goto out; 267 + } 268 + break; 269 + } 270 + 271 + ret = locked && (vma_rss < vma_size); 272 + out: 273 + free(line); 274 + if (smaps) 275 + fclose(smaps); 276 + return ret; 277 + } 278 + 279 + #define PRESENT_BIT 0x8000000000000000 280 + #define PFN_MASK 0x007FFFFFFFFFFFFF 281 + #define UNEVICTABLE_BIT (1UL << 18) 282 + 283 + static int lock_check(char *map) 284 + { 285 + unsigned long page_size = getpagesize(); 286 + uint64_t page1_flags, page2_flags; 287 + 288 + page1_flags = get_pageflags((unsigned long)map); 289 + page2_flags = get_pageflags((unsigned long)map + page_size); 290 + 291 + /* Both pages should be present */ 292 + if (((page1_flags & PRESENT_BIT) == 0) || 293 + ((page2_flags & PRESENT_BIT) == 0)) { 294 + printf("Failed to make both pages present\n"); 295 + return 1; 296 + } 297 + 298 + page1_flags = get_kpageflags(page1_flags & PFN_MASK); 299 + page2_flags = get_kpageflags(page2_flags & PFN_MASK); 300 + 301 + /* Both pages should be unevictable */ 302 + if (((page1_flags & UNEVICTABLE_BIT) == 0) || 303 + ((page2_flags & UNEVICTABLE_BIT) == 0)) { 304 + printf("Failed to make both pages unevictable\n"); 305 + return 1; 306 + } 307 + 308 + if (!is_vmflag_set((unsigned long)map, LOCKED)) { 309 + printf("VMA flag %s is missing on page 1\n", LOCKED); 310 + return 1; 311 + } 312 + 313 + if (!is_vmflag_set((unsigned long)map + page_size, LOCKED)) { 314 + printf("VMA flag %s is missing on page 2\n", LOCKED); 315 + return 1; 316 + } 317 + 318 + return 0; 319 + } 320 + 321 + static int unlock_lock_check(char *map) 322 + { 323 + unsigned long page_size = getpagesize(); 324 + uint64_t page1_flags, page2_flags; 325 + 326 + page1_flags = get_pageflags((unsigned long)map); 327 + page2_flags = get_pageflags((unsigned long)map + page_size); 328 + page1_flags = get_kpageflags(page1_flags & PFN_MASK); 329 + page2_flags = get_kpageflags(page2_flags & PFN_MASK); 330 + 331 + if ((page1_flags & UNEVICTABLE_BIT) || (page2_flags & UNEVICTABLE_BIT)) { 332 + printf("A page is still marked unevictable after unlock\n"); 333 + return 1; 334 + } 335 + 336 + if (is_vmflag_set((unsigned long)map, LOCKED)) { 337 + printf("VMA flag %s is present on page 1 after unlock\n", LOCKED); 338 + return 1; 339 + } 340 + 341 + if (is_vmflag_set((unsigned long)map + page_size, LOCKED)) { 342 + printf("VMA flag %s is present on page 2 after unlock\n", LOCKED); 343 + return 1; 344 + } 345 + 346 + return 0; 347 + } 348 + 349 + static int test_mlock_lock() 350 + { 351 + char *map; 352 + int ret = 1; 353 + unsigned long page_size = getpagesize(); 354 + 355 + map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE, 356 + MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); 357 + if (map == MAP_FAILED) { 358 + perror("test_mlock_locked mmap"); 359 + goto out; 360 + } 361 + 362 + if (mlock2_(map, 2 * page_size, 0)) { 363 + if (errno == ENOSYS) { 364 + printf("Cannot call new mlock family, skipping test\n"); 365 + _exit(0); 366 + } 367 + perror("mlock2(0)"); 368 + goto unmap; 369 + } 370 + 371 + if (lock_check(map)) 372 + goto unmap; 373 + 374 + /* Now unlock and recheck attributes */ 375 + if (munlock(map, 2 * page_size)) { 376 + perror("munlock()"); 377 + goto unmap; 378 + } 379 + 380 + ret = unlock_lock_check(map); 381 + 382 + unmap: 383 + munmap(map, 2 * page_size); 384 + out: 385 + return ret; 386 + } 387 + 388 + static int onfault_check(char *map) 389 + { 390 + unsigned long page_size = getpagesize(); 391 + uint64_t page1_flags, page2_flags; 392 + 393 + page1_flags = get_pageflags((unsigned long)map); 394 + page2_flags = get_pageflags((unsigned long)map + page_size); 395 + 396 + /* Neither page should be present */ 397 + if ((page1_flags & PRESENT_BIT) || (page2_flags & PRESENT_BIT)) { 398 + printf("Pages were made present by MLOCK_ONFAULT\n"); 399 + return 1; 400 + } 401 + 402 + *map = 'a'; 403 + page1_flags = get_pageflags((unsigned long)map); 404 + page2_flags = get_pageflags((unsigned long)map + page_size); 405 + 406 + /* Only page 1 should be present */ 407 + if ((page1_flags & PRESENT_BIT) == 0) { 408 + printf("Page 1 is not present after fault\n"); 409 + return 1; 410 + } else if (page2_flags & PRESENT_BIT) { 411 + printf("Page 2 was made present\n"); 412 + return 1; 413 + } 414 + 415 + page1_flags = get_kpageflags(page1_flags & PFN_MASK); 416 + 417 + /* Page 1 should be unevictable */ 418 + if ((page1_flags & UNEVICTABLE_BIT) == 0) { 419 + printf("Failed to make faulted page unevictable\n"); 420 + return 1; 421 + } 422 + 423 + if (!is_vma_lock_on_fault((unsigned long)map)) { 424 + printf("VMA is not marked for lock on fault\n"); 425 + return 1; 426 + } 427 + 428 + if (!is_vma_lock_on_fault((unsigned long)map + page_size)) { 429 + printf("VMA is not marked for lock on fault\n"); 430 + return 1; 431 + } 432 + 433 + return 0; 434 + } 435 + 436 + static int unlock_onfault_check(char *map) 437 + { 438 + unsigned long page_size = getpagesize(); 439 + uint64_t page1_flags; 440 + 441 + page1_flags = get_pageflags((unsigned long)map); 442 + page1_flags = get_kpageflags(page1_flags & PFN_MASK); 443 + 444 + if (page1_flags & UNEVICTABLE_BIT) { 445 + printf("Page 1 is still marked unevictable after unlock\n"); 446 + return 1; 447 + } 448 + 449 + if (is_vma_lock_on_fault((unsigned long)map) || 450 + is_vma_lock_on_fault((unsigned long)map + page_size)) { 451 + printf("VMA is still lock on fault after unlock\n"); 452 + return 1; 453 + } 454 + 455 + return 0; 456 + } 457 + 458 + static int test_mlock_onfault() 459 + { 460 + char *map; 461 + int ret = 1; 462 + unsigned long page_size = getpagesize(); 463 + 464 + map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE, 465 + MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); 466 + if (map == MAP_FAILED) { 467 + perror("test_mlock_locked mmap"); 468 + goto out; 469 + } 470 + 471 + if (mlock2_(map, 2 * page_size, MLOCK_ONFAULT)) { 472 + if (errno == ENOSYS) { 473 + printf("Cannot call new mlock family, skipping test\n"); 474 + _exit(0); 475 + } 476 + perror("mlock2(MLOCK_ONFAULT)"); 477 + goto unmap; 478 + } 479 + 480 + if (onfault_check(map)) 481 + goto unmap; 482 + 483 + /* Now unlock and recheck attributes */ 484 + if (munlock(map, 2 * page_size)) { 485 + if (errno == ENOSYS) { 486 + printf("Cannot call new mlock family, skipping test\n"); 487 + _exit(0); 488 + } 489 + perror("munlock()"); 490 + goto unmap; 491 + } 492 + 493 + ret = unlock_onfault_check(map); 494 + unmap: 495 + munmap(map, 2 * page_size); 496 + out: 497 + return ret; 498 + } 499 + 500 + static int test_lock_onfault_of_present() 501 + { 502 + char *map; 503 + int ret = 1; 504 + unsigned long page_size = getpagesize(); 505 + uint64_t page1_flags, page2_flags; 506 + 507 + map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE, 508 + MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); 509 + if (map == MAP_FAILED) { 510 + perror("test_mlock_locked mmap"); 511 + goto out; 512 + } 513 + 514 + *map = 'a'; 515 + 516 + if (mlock2_(map, 2 * page_size, MLOCK_ONFAULT)) { 517 + if (errno == ENOSYS) { 518 + printf("Cannot call new mlock family, skipping test\n"); 519 + _exit(0); 520 + } 521 + perror("mlock2(MLOCK_ONFAULT)"); 522 + goto unmap; 523 + } 524 + 525 + page1_flags = get_pageflags((unsigned long)map); 526 + page2_flags = get_pageflags((unsigned long)map + page_size); 527 + page1_flags = get_kpageflags(page1_flags & PFN_MASK); 528 + page2_flags = get_kpageflags(page2_flags & PFN_MASK); 529 + 530 + /* Page 1 should be unevictable */ 531 + if ((page1_flags & UNEVICTABLE_BIT) == 0) { 532 + printf("Failed to make present page unevictable\n"); 533 + goto unmap; 534 + } 535 + 536 + if (!is_vma_lock_on_fault((unsigned long)map) || 537 + !is_vma_lock_on_fault((unsigned long)map + page_size)) { 538 + printf("VMA with present pages is not marked lock on fault\n"); 539 + goto unmap; 540 + } 541 + ret = 0; 542 + unmap: 543 + munmap(map, 2 * page_size); 544 + out: 545 + return ret; 546 + } 547 + 548 + static int test_munlockall() 549 + { 550 + char *map; 551 + int ret = 1; 552 + unsigned long page_size = getpagesize(); 553 + 554 + map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE, 555 + MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); 556 + 557 + if (map == MAP_FAILED) { 558 + perror("test_munlockall mmap"); 559 + goto out; 560 + } 561 + 562 + if (mlockall(MCL_CURRENT)) { 563 + perror("mlockall(MCL_CURRENT)"); 564 + goto out; 565 + } 566 + 567 + if (lock_check(map)) 568 + goto unmap; 569 + 570 + if (munlockall()) { 571 + perror("munlockall()"); 572 + goto unmap; 573 + } 574 + 575 + if (unlock_lock_check(map)) 576 + goto unmap; 577 + 578 + munmap(map, 2 * page_size); 579 + 580 + map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE, 581 + MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); 582 + 583 + if (map == MAP_FAILED) { 584 + perror("test_munlockall second mmap"); 585 + goto out; 586 + } 587 + 588 + if (mlockall(MCL_CURRENT | MCL_ONFAULT)) { 589 + perror("mlockall(MCL_CURRENT | MCL_ONFAULT)"); 590 + goto unmap; 591 + } 592 + 593 + if (onfault_check(map)) 594 + goto unmap; 595 + 596 + if (munlockall()) { 597 + perror("munlockall()"); 598 + goto unmap; 599 + } 600 + 601 + if (unlock_onfault_check(map)) 602 + goto unmap; 603 + 604 + if (mlockall(MCL_CURRENT | MCL_FUTURE)) { 605 + perror("mlockall(MCL_CURRENT | MCL_FUTURE)"); 606 + goto out; 607 + } 608 + 609 + if (lock_check(map)) 610 + goto unmap; 611 + 612 + if (munlockall()) { 613 + perror("munlockall()"); 614 + goto unmap; 615 + } 616 + 617 + ret = unlock_lock_check(map); 618 + 619 + unmap: 620 + munmap(map, 2 * page_size); 621 + out: 622 + munlockall(); 623 + return ret; 624 + } 625 + 626 + static int test_vma_management(bool call_mlock) 627 + { 628 + int ret = 1; 629 + void *map; 630 + unsigned long page_size = getpagesize(); 631 + struct vm_boundaries page1; 632 + struct vm_boundaries page2; 633 + struct vm_boundaries page3; 634 + 635 + map = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, 636 + MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); 637 + if (map == MAP_FAILED) { 638 + perror("mmap()"); 639 + return ret; 640 + } 641 + 642 + if (call_mlock && mlock2_(map, 3 * page_size, MLOCK_ONFAULT)) { 643 + if (errno == ENOSYS) { 644 + printf("Cannot call new mlock family, skipping test\n"); 645 + _exit(0); 646 + } 647 + perror("mlock(ONFAULT)\n"); 648 + goto out; 649 + } 650 + 651 + if (get_vm_area((unsigned long)map, &page1) || 652 + get_vm_area((unsigned long)map + page_size, &page2) || 653 + get_vm_area((unsigned long)map + page_size * 2, &page3)) { 654 + printf("couldn't find mapping in /proc/self/maps\n"); 655 + goto out; 656 + } 657 + 658 + /* 659 + * Before we unlock a portion, we need to that all three pages are in 660 + * the same VMA. If they are not we abort this test (Note that this is 661 + * not a failure) 662 + */ 663 + if (page1.start != page2.start || page2.start != page3.start) { 664 + printf("VMAs are not merged to start, aborting test\n"); 665 + ret = 0; 666 + goto out; 667 + } 668 + 669 + if (munlock(map + page_size, page_size)) { 670 + perror("munlock()"); 671 + goto out; 672 + } 673 + 674 + if (get_vm_area((unsigned long)map, &page1) || 675 + get_vm_area((unsigned long)map + page_size, &page2) || 676 + get_vm_area((unsigned long)map + page_size * 2, &page3)) { 677 + printf("couldn't find mapping in /proc/self/maps\n"); 678 + goto out; 679 + } 680 + 681 + /* All three VMAs should be different */ 682 + if (page1.start == page2.start || page2.start == page3.start) { 683 + printf("failed to split VMA for munlock\n"); 684 + goto out; 685 + } 686 + 687 + /* Now unlock the first and third page and check the VMAs again */ 688 + if (munlock(map, page_size * 3)) { 689 + perror("munlock()"); 690 + goto out; 691 + } 692 + 693 + if (get_vm_area((unsigned long)map, &page1) || 694 + get_vm_area((unsigned long)map + page_size, &page2) || 695 + get_vm_area((unsigned long)map + page_size * 2, &page3)) { 696 + printf("couldn't find mapping in /proc/self/maps\n"); 697 + goto out; 698 + } 699 + 700 + /* Now all three VMAs should be the same */ 701 + if (page1.start != page2.start || page2.start != page3.start) { 702 + printf("failed to merge VMAs after munlock\n"); 703 + goto out; 704 + } 705 + 706 + ret = 0; 707 + out: 708 + munmap(map, 3 * page_size); 709 + return ret; 710 + } 711 + 712 + static int test_mlockall(int (test_function)(bool call_mlock)) 713 + { 714 + int ret = 1; 715 + 716 + if (mlockall(MCL_CURRENT | MCL_ONFAULT | MCL_FUTURE)) { 717 + perror("mlockall"); 718 + return ret; 719 + } 720 + 721 + ret = test_function(false); 722 + munlockall(); 723 + return ret; 724 + } 725 + 726 + int main(int argc, char **argv) 727 + { 728 + int ret = 0; 729 + ret += test_mlock_lock(); 730 + ret += test_mlock_onfault(); 731 + ret += test_munlockall(); 732 + ret += test_lock_onfault_of_present(); 733 + ret += test_vma_management(true); 734 + ret += test_mlockall(test_vma_management); 735 + return ret; 736 + }

+47

tools/testing/selftests/vm/on-fault-limit.c

··· 1 + #include <sys/mman.h> 2 + #include <stdio.h> 3 + #include <unistd.h> 4 + #include <string.h> 5 + #include <sys/time.h> 6 + #include <sys/resource.h> 7 + 8 + #ifndef MCL_ONFAULT 9 + #define MCL_ONFAULT (MCL_FUTURE << 1) 10 + #endif 11 + 12 + static int test_limit(void) 13 + { 14 + int ret = 1; 15 + struct rlimit lims; 16 + void *map; 17 + 18 + if (getrlimit(RLIMIT_MEMLOCK, &lims)) { 19 + perror("getrlimit"); 20 + return ret; 21 + } 22 + 23 + if (mlockall(MCL_CURRENT | MCL_ONFAULT | MCL_FUTURE)) { 24 + perror("mlockall"); 25 + return ret; 26 + } 27 + 28 + map = mmap(NULL, 2 * lims.rlim_max, PROT_READ | PROT_WRITE, 29 + MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, 0, 0); 30 + if (map != MAP_FAILED) 31 + printf("mmap should have failed, but didn't\n"); 32 + else { 33 + ret = 0; 34 + munmap(map, 2 * lims.rlim_max); 35 + } 36 + 37 + munlockall(); 38 + return ret; 39 + } 40 + 41 + int main(int argc, char **argv) 42 + { 43 + int ret = 0; 44 + 45 + ret += test_limit(); 46 + return ret; 47 + }

+22

tools/testing/selftests/vm/run_vmtests

··· 106 106 echo "[PASS]" 107 107 fi 108 108 109 + echo "--------------------" 110 + echo "running on-fault-limit" 111 + echo "--------------------" 112 + sudo -u nobody ./on-fault-limit 113 + if [ $? -ne 0 ]; then 114 + echo "[FAIL]" 115 + exitcode=1 116 + else 117 + echo "[PASS]" 118 + fi 119 + 120 + echo "--------------------" 121 + echo "running mlock2-tests" 122 + echo "--------------------" 123 + ./mlock2-tests 124 + if [ $? -ne 0 ]; then 125 + echo "[FAIL]" 126 + exitcode=1 127 + else 128 + echo "[PASS]" 129 + fi 130 + 109 131 exit $exitcode

+275

tools/vm/slabinfo-gnuplot.sh

··· 1 + #!/bin/sh 2 + 3 + # Sergey Senozhatsky, 2015 4 + # sergey.senozhatsky.work@gmail.com 5 + # 6 + # This software is licensed under the terms of the GNU General Public 7 + # License version 2, as published by the Free Software Foundation, and 8 + # may be copied, distributed, and modified under those terms. 9 + # 10 + # This program is distributed in the hope that it will be useful, 11 + # but WITHOUT ANY WARRANTY; without even the implied warranty of 12 + # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 + # GNU General Public License for more details. 14 + 15 + 16 + # This program is intended to plot a `slabinfo -X' stats, collected, 17 + # for example, using the following command: 18 + # while [ 1 ]; do slabinfo -X >> stats; sleep 1; done 19 + # 20 + # Use `slabinfo-gnuplot.sh stats' to pre-process collected records 21 + # and generate graphs (totals, slabs sorted by size, slabs sorted 22 + # by size). 23 + # 24 + # Graphs can be [individually] regenerate with different ranges and 25 + # size (-r %d,%d and -s %d,%d options). 26 + # 27 + # To visually compare N `totals' graphs, do 28 + # slabinfo-gnuplot.sh -t FILE1-totals FILE2-totals ... FILEN-totals 29 + # 30 + 31 + min_slab_name_size=11 32 + xmin=0 33 + xmax=0 34 + width=1500 35 + height=700 36 + mode=preprocess 37 + 38 + usage() 39 + { 40 + echo "Usage: [-s W,H] [-r MIN,MAX] [-t|-l] FILE1 [FILE2 ..]" 41 + echo "FILEs must contain 'slabinfo -X' samples" 42 + echo "-t - plot totals for FILE(s)" 43 + echo "-l - plot slabs stats for FILE(s)" 44 + echo "-s %d,%d - set image width and height" 45 + echo "-r %d,%d - use data samples from a given range" 46 + } 47 + 48 + check_file_exist() 49 + { 50 + if [ ! -f "$1" ]; then 51 + echo "File '$1' does not exist" 52 + exit 1 53 + fi 54 + } 55 + 56 + do_slabs_plotting() 57 + { 58 + local file=$1 59 + local out_file 60 + local range="every ::$xmin" 61 + local xtic="" 62 + local xtic_rotate="norotate" 63 + local lines=2000000 64 + local wc_lines 65 + 66 + check_file_exist "$file" 67 + 68 + out_file=`basename "$file"` 69 + if [ $xmax -ne 0 ]; then 70 + range="$range::$xmax" 71 + lines=$((xmax-xmin)) 72 + fi 73 + 74 + wc_lines=`cat "$file" | wc -l` 75 + if [ $? -ne 0 ] || [ "$wc_lines" -eq 0 ] ; then 76 + wc_lines=$lines 77 + fi 78 + 79 + if [ "$wc_lines" -lt "$lines" ]; then 80 + lines=$wc_lines 81 + fi 82 + 83 + if [ $((width / lines)) -gt $min_slab_name_size ]; then 84 + xtic=":xtic(1)" 85 + xtic_rotate=90 86 + fi 87 + 88 + gnuplot -p << EOF 89 + #!/usr/bin/env gnuplot 90 + 91 + set terminal png enhanced size $width,$height large 92 + set output '$out_file.png' 93 + set autoscale xy 94 + set xlabel 'samples' 95 + set ylabel 'bytes' 96 + set style histogram columnstacked title textcolor lt -1 97 + set style fill solid 0.15 98 + set xtics rotate $xtic_rotate 99 + set key left above Left title reverse 100 + 101 + plot "$file" $range u 2$xtic title 'SIZE' with boxes,\ 102 + '' $range u 3 title 'LOSS' with boxes 103 + EOF 104 + 105 + if [ $? -eq 0 ]; then 106 + echo "$out_file.png" 107 + fi 108 + } 109 + 110 + do_totals_plotting() 111 + { 112 + local gnuplot_cmd="" 113 + local range="every ::$xmin" 114 + local file="" 115 + 116 + if [ $xmax -ne 0 ]; then 117 + range="$range::$xmax" 118 + fi 119 + 120 + for i in "${t_files[@]}"; do 121 + check_file_exist "$i" 122 + 123 + file="$file"`basename "$i"` 124 + gnuplot_cmd="$gnuplot_cmd '$i' $range using 1 title\ 125 + '$i Memory usage' with lines," 126 + gnuplot_cmd="$gnuplot_cmd '' $range using 2 title \ 127 + '$i Loss' with lines," 128 + done 129 + 130 + gnuplot -p << EOF 131 + #!/usr/bin/env gnuplot 132 + 133 + set terminal png enhanced size $width,$height large 134 + set autoscale xy 135 + set output '$file.png' 136 + set xlabel 'samples' 137 + set ylabel 'bytes' 138 + set key left above Left title reverse 139 + 140 + plot $gnuplot_cmd 141 + EOF 142 + 143 + if [ $? -eq 0 ]; then 144 + echo "$file.png" 145 + fi 146 + } 147 + 148 + do_preprocess() 149 + { 150 + local out 151 + local lines 152 + local in=$1 153 + 154 + check_file_exist "$in" 155 + 156 + # use only 'TOP' slab (biggest memory usage or loss) 157 + let lines=3 158 + out=`basename "$in"`"-slabs-by-loss" 159 + `cat "$in" | grep -A "$lines" 'Slabs sorted by loss' |\ 160 + egrep -iv '\-\-|Name|Slabs'\ 161 + | awk '{print $1" "$4+$2*$3" "$4}' > "$out"` 162 + if [ $? -eq 0 ]; then 163 + do_slabs_plotting "$out" 164 + fi 165 + 166 + let lines=3 167 + out=`basename "$in"`"-slabs-by-size" 168 + `cat "$in" | grep -A "$lines" 'Slabs sorted by size' |\ 169 + egrep -iv '\-\-|Name|Slabs'\ 170 + | awk '{print $1" "$4" "$4-$2*$3}' > "$out"` 171 + if [ $? -eq 0 ]; then 172 + do_slabs_plotting "$out" 173 + fi 174 + 175 + out=`basename "$in"`"-totals" 176 + `cat "$in" | grep "Memory used" |\ 177 + awk '{print $3" "$7}' > "$out"` 178 + if [ $? -eq 0 ]; then 179 + t_files[0]=$out 180 + do_totals_plotting 181 + fi 182 + } 183 + 184 + parse_opts() 185 + { 186 + local opt 187 + 188 + while getopts "tlr::s::h" opt; do 189 + case $opt in 190 + t) 191 + mode=totals 192 + ;; 193 + l) 194 + mode=slabs 195 + ;; 196 + s) 197 + array=(${OPTARG//,/ }) 198 + width=${array[0]} 199 + height=${array[1]} 200 + ;; 201 + r) 202 + array=(${OPTARG//,/ }) 203 + xmin=${array[0]} 204 + xmax=${array[1]} 205 + ;; 206 + h) 207 + usage 208 + exit 0 209 + ;; 210 + \?) 211 + echo "Invalid option: -$OPTARG" >&2 212 + exit 1 213 + ;; 214 + :) 215 + echo "-$OPTARG requires an argument." >&2 216 + exit 1 217 + ;; 218 + esac 219 + done 220 + 221 + return $OPTIND 222 + } 223 + 224 + parse_args() 225 + { 226 + local idx=0 227 + local p 228 + 229 + for p in "$@"; do 230 + case $mode in 231 + preprocess) 232 + files[$idx]=$p 233 + idx=$idx+1 234 + ;; 235 + totals) 236 + t_files[$idx]=$p 237 + idx=$idx+1 238 + ;; 239 + slabs) 240 + files[$idx]=$p 241 + idx=$idx+1 242 + ;; 243 + esac 244 + done 245 + } 246 + 247 + parse_opts "$@" 248 + argstart=$? 249 + parse_args "${@:$argstart}" 250 + 251 + if [ ${#files[@]} -eq 0 ] && [ ${#t_files[@]} -eq 0 ]; then 252 + usage 253 + exit 1 254 + fi 255 + 256 + case $mode in 257 + preprocess) 258 + for i in "${files[@]}"; do 259 + do_preprocess "$i" 260 + done 261 + ;; 262 + totals) 263 + do_totals_plotting 264 + ;; 265 + slabs) 266 + for i in "${files[@]}"; do 267 + do_slabs_plotting "$i" 268 + done 269 + ;; 270 + *) 271 + echo "Unknown mode $mode" >&2 272 + usage 273 + exit 1 274 + ;; 275 + esac

+167 -88

tools/vm/slabinfo.c

··· 53 53 struct slabinfo *slab; 54 54 } aliasinfo[MAX_ALIASES]; 55 55 56 - int slabs = 0; 57 - int actual_slabs = 0; 58 - int aliases = 0; 59 - int alias_targets = 0; 60 - int highest_node = 0; 56 + int slabs; 57 + int actual_slabs; 58 + int aliases; 59 + int alias_targets; 60 + int highest_node; 61 61 62 62 char buffer[4096]; 63 63 64 - int show_empty = 0; 65 - int show_report = 0; 66 - int show_alias = 0; 67 - int show_slab = 0; 64 + int show_empty; 65 + int show_report; 66 + int show_alias; 67 + int show_slab; 68 68 int skip_zero = 1; 69 - int show_numa = 0; 70 - int show_track = 0; 71 - int show_first_alias = 0; 72 - int validate = 0; 73 - int shrink = 0; 74 - int show_inverted = 0; 75 - int show_single_ref = 0; 76 - int show_totals = 0; 77 - int sort_size = 0; 78 - int sort_active = 0; 79 - int set_debug = 0; 80 - int show_ops = 0; 81 - int show_activity = 0; 69 + int show_numa; 70 + int show_track; 71 + int show_first_alias; 72 + int validate; 73 + int shrink; 74 + int show_inverted; 75 + int show_single_ref; 76 + int show_totals; 77 + int sort_size; 78 + int sort_active; 79 + int set_debug; 80 + int show_ops; 81 + int show_activity; 82 + int output_lines = -1; 83 + int sort_loss; 84 + int extended_totals; 85 + int show_bytes; 82 86 83 87 /* Debug options */ 84 - int sanity = 0; 85 - int redzone = 0; 86 - int poison = 0; 87 - int tracking = 0; 88 - int tracing = 0; 88 + int sanity; 89 + int redzone; 90 + int poison; 91 + int tracking; 92 + int tracing; 89 93 90 94 int page_size; 91 95 ··· 128 124 "-v|--validate Validate slabs\n" 129 125 "-z|--zero Include empty slabs\n" 130 126 "-1|--1ref Single reference\n" 127 + "-N|--lines=K Show the first K slabs\n" 128 + "-L|--Loss Sort by loss\n" 129 + "-X|--Xtotals Show extended summary information\n" 130 + "-B|--Bytes Show size in bytes\n" 131 131 "\nValid debug options (FZPUT may be combined)\n" 132 132 "a / A Switch on all debug options (=FZUP)\n" 133 133 "- Switch off all debug options\n" ··· 233 225 char trailer = 0; 234 226 int n; 235 227 236 - if (value > 1000000000UL) { 237 - divisor = 100000000UL; 238 - trailer = 'G'; 239 - } else if (value > 1000000UL) { 240 - divisor = 100000UL; 241 - trailer = 'M'; 242 - } else if (value > 1000UL) { 243 - divisor = 100; 244 - trailer = 'K'; 228 + if (!show_bytes) { 229 + if (value > 1000000000UL) { 230 + divisor = 100000000UL; 231 + trailer = 'G'; 232 + } else if (value > 1000000UL) { 233 + divisor = 100000UL; 234 + trailer = 'M'; 235 + } else if (value > 1000UL) { 236 + divisor = 100; 237 + trailer = 'K'; 238 + } 245 239 } 246 240 247 241 value /= divisor; ··· 307 297 static void first_line(void) 308 298 { 309 299 if (show_activity) 310 - printf("Name Objects Alloc Free %%Fast Fallb O CmpX UL\n"); 300 + printf("Name Objects Alloc Free" 301 + " %%Fast Fallb O CmpX UL\n"); 311 302 else 312 - printf("Name Objects Objsize Space " 313 - "Slabs/Part/Cpu O/S O %%Fr %%Ef Flg\n"); 303 + printf("Name Objects Objsize %s " 304 + "Slabs/Part/Cpu O/S O %%Fr %%Ef Flg\n", 305 + sort_loss ? " Loss" : "Space"); 314 306 } 315 307 316 308 /* ··· 343 331 { 344 332 return s->alloc_fastpath + s->free_fastpath + 345 333 s->alloc_slowpath + s->free_slowpath; 334 + } 335 + 336 + static unsigned long slab_waste(struct slabinfo *s) 337 + { 338 + return slab_size(s) - s->objects * s->object_size; 346 339 } 347 340 348 341 static void slab_numa(struct slabinfo *s, int mode) ··· 521 504 if (strcmp(s->name, "*") == 0) 522 505 return; 523 506 524 - printf("\nSlabcache: %-20s Aliases: %2d Order : %2d Objects: %lu\n", 507 + printf("\nSlabcache: %-15s Aliases: %2d Order : %2d Objects: %lu\n", 525 508 s->name, s->aliases, s->order, s->objects); 526 509 if (s->hwcache_align) 527 510 printf("** Hardware cacheline aligned\n"); ··· 578 561 if (show_empty && s->slabs) 579 562 return; 580 563 581 - store_size(size_str, slab_size(s)); 564 + if (sort_loss == 0) 565 + store_size(size_str, slab_size(s)); 566 + else 567 + store_size(size_str, slab_waste(s)); 582 568 snprintf(dist_str, 40, "%lu/%lu/%d", s->slabs - s->cpu_slabs, 583 569 s->partial, s->cpu_slabs); 584 570 ··· 622 602 total_free ? (s->free_fastpath * 100 / total_free) : 0, 623 603 s->order_fallback, s->order, s->cmpxchg_double_fail, 624 604 s->cmpxchg_double_cpu_fail); 625 - } 626 - else 627 - printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n", 605 + } else { 606 + printf("%-21s %8ld %7d %15s %14s %4d %1d %3ld %3ld %s\n", 628 607 s->name, s->objects, s->object_size, size_str, dist_str, 629 608 s->objs_per_slab, s->order, 630 609 s->slabs ? (s->partial * 100) / s->slabs : 100, 631 610 s->slabs ? (s->objects * s->object_size * 100) / 632 611 (s->slabs * (page_size << s->order)) : 100, 633 612 flags); 613 + } 634 614 } 635 615 636 616 /* ··· 938 918 939 919 printf("Slabcache Totals\n"); 940 920 printf("----------------\n"); 941 - printf("Slabcaches : %3d Aliases : %3d->%-3d Active: %3d\n", 921 + printf("Slabcaches : %15d Aliases : %11d->%-3d Active: %3d\n", 942 922 slabs, aliases, alias_targets, used_slabs); 943 923 944 924 store_size(b1, total_size);store_size(b2, total_waste); 945 925 store_size(b3, total_waste * 100 / total_used); 946 - printf("Memory used: %6s # Loss : %6s MRatio:%6s%%\n", b1, b2, b3); 926 + printf("Memory used: %15s # Loss : %15s MRatio:%6s%%\n", b1, b2, b3); 947 927 948 928 store_size(b1, total_objects);store_size(b2, total_partobj); 949 929 store_size(b3, total_partobj * 100 / total_objects); 950 - printf("# Objects : %6s # PartObj: %6s ORatio:%6s%%\n", b1, b2, b3); 930 + printf("# Objects : %15s # PartObj: %15s ORatio:%6s%%\n", b1, b2, b3); 951 931 952 932 printf("\n"); 953 - printf("Per Cache Average Min Max Total\n"); 954 - printf("---------------------------------------------------------\n"); 933 + printf("Per Cache Average " 934 + "Min Max Total\n"); 935 + printf("---------------------------------------" 936 + "-------------------------------------\n"); 955 937 956 938 store_size(b1, avg_objects);store_size(b2, min_objects); 957 939 store_size(b3, max_objects);store_size(b4, total_objects); 958 - printf("#Objects %10s %10s %10s %10s\n", 940 + printf("#Objects %15s %15s %15s %15s\n", 959 941 b1, b2, b3, b4); 960 942 961 943 store_size(b1, avg_slabs);store_size(b2, min_slabs); 962 944 store_size(b3, max_slabs);store_size(b4, total_slabs); 963 - printf("#Slabs %10s %10s %10s %10s\n", 945 + printf("#Slabs %15s %15s %15s %15s\n", 964 946 b1, b2, b3, b4); 965 947 966 948 store_size(b1, avg_partial);store_size(b2, min_partial); 967 949 store_size(b3, max_partial);store_size(b4, total_partial); 968 - printf("#PartSlab %10s %10s %10s %10s\n", 950 + printf("#PartSlab %15s %15s %15s %15s\n", 969 951 b1, b2, b3, b4); 970 952 store_size(b1, avg_ppart);store_size(b2, min_ppart); 971 953 store_size(b3, max_ppart); 972 954 store_size(b4, total_partial * 100 / total_slabs); 973 - printf("%%PartSlab%10s%% %10s%% %10s%% %10s%%\n", 955 + printf("%%PartSlab%15s%% %15s%% %15s%% %15s%%\n", 974 956 b1, b2, b3, b4); 975 957 976 958 store_size(b1, avg_partobj);store_size(b2, min_partobj); 977 959 store_size(b3, max_partobj); 978 960 store_size(b4, total_partobj); 979 - printf("PartObjs %10s %10s %10s %10s\n", 961 + printf("PartObjs %15s %15s %15s %15s\n", 980 962 b1, b2, b3, b4); 981 963 982 964 store_size(b1, avg_ppartobj);store_size(b2, min_ppartobj); 983 965 store_size(b3, max_ppartobj); 984 966 store_size(b4, total_partobj * 100 / total_objects); 985 - printf("%% PartObj%10s%% %10s%% %10s%% %10s%%\n", 967 + printf("%% PartObj%15s%% %15s%% %15s%% %15s%%\n", 986 968 b1, b2, b3, b4); 987 969 988 970 store_size(b1, avg_size);store_size(b2, min_size); 989 971 store_size(b3, max_size);store_size(b4, total_size); 990 - printf("Memory %10s %10s %10s %10s\n", 972 + printf("Memory %15s %15s %15s %15s\n", 991 973 b1, b2, b3, b4); 992 974 993 975 store_size(b1, avg_used);store_size(b2, min_used); 994 976 store_size(b3, max_used);store_size(b4, total_used); 995 - printf("Used %10s %10s %10s %10s\n", 977 + printf("Used %15s %15s %15s %15s\n", 996 978 b1, b2, b3, b4); 997 979 998 980 store_size(b1, avg_waste);store_size(b2, min_waste); 999 981 store_size(b3, max_waste);store_size(b4, total_waste); 1000 - printf("Loss %10s %10s %10s %10s\n", 982 + printf("Loss %15s %15s %15s %15s\n", 1001 983 b1, b2, b3, b4); 1002 984 1003 985 printf("\n"); 1004 - printf("Per Object Average Min Max\n"); 1005 - printf("---------------------------------------------\n"); 986 + printf("Per Object Average " 987 + "Min Max\n"); 988 + printf("---------------------------------------" 989 + "--------------------\n"); 1006 990 1007 991 store_size(b1, avg_memobj);store_size(b2, min_memobj); 1008 992 store_size(b3, max_memobj); 1009 - printf("Memory %10s %10s %10s\n", 993 + printf("Memory %15s %15s %15s\n", 1010 994 b1, b2, b3); 1011 995 store_size(b1, avg_objsize);store_size(b2, min_objsize); 1012 996 store_size(b3, max_objsize); 1013 - printf("User %10s %10s %10s\n", 997 + printf("User %15s %15s %15s\n", 1014 998 b1, b2, b3); 1015 999 1016 1000 store_size(b1, avg_objwaste);store_size(b2, min_objwaste); 1017 1001 store_size(b3, max_objwaste); 1018 - printf("Loss %10s %10s %10s\n", 1002 + printf("Loss %15s %15s %15s\n", 1019 1003 b1, b2, b3); 1020 1004 } 1021 1005 ··· 1035 1011 result = slab_size(s1) < slab_size(s2); 1036 1012 else if (sort_active) 1037 1013 result = slab_activity(s1) < slab_activity(s2); 1014 + else if (sort_loss) 1015 + result = slab_waste(s1) < slab_waste(s2); 1038 1016 else 1039 1017 result = strcasecmp(s1->name, s2->name); 1040 1018 ··· 1121 1095 active = a->slab->name; 1122 1096 } 1123 1097 else 1124 - printf("%-20s -> %s\n", a->name, a->slab->name); 1098 + printf("%-15s -> %s\n", a->name, a->slab->name); 1125 1099 } 1126 1100 if (active) 1127 1101 printf("\n"); ··· 1267 1241 static void output_slabs(void) 1268 1242 { 1269 1243 struct slabinfo *slab; 1244 + int lines = output_lines; 1270 1245 1271 - for (slab = slabinfo; slab < slabinfo + slabs; slab++) { 1246 + for (slab = slabinfo; (slab < slabinfo + slabs) && 1247 + lines != 0; slab++) { 1272 1248 1273 1249 if (slab->alias) 1274 1250 continue; 1275 1251 1252 + if (lines != -1) 1253 + lines--; 1276 1254 1277 1255 if (show_numa) 1278 1256 slab_numa(slab, 0); ··· 1297 1267 } 1298 1268 } 1299 1269 1270 + static void xtotals(void) 1271 + { 1272 + totals(); 1273 + 1274 + link_slabs(); 1275 + rename_slabs(); 1276 + 1277 + printf("\nSlabs sorted by size\n"); 1278 + printf("--------------------\n"); 1279 + sort_loss = 0; 1280 + sort_size = 1; 1281 + sort_slabs(); 1282 + output_slabs(); 1283 + 1284 + printf("\nSlabs sorted by loss\n"); 1285 + printf("--------------------\n"); 1286 + line = 0; 1287 + sort_loss = 1; 1288 + sort_size = 0; 1289 + sort_slabs(); 1290 + output_slabs(); 1291 + printf("\n"); 1292 + } 1293 + 1300 1294 struct option opts[] = { 1301 - { "aliases", 0, NULL, 'a' }, 1302 - { "activity", 0, NULL, 'A' }, 1303 - { "debug", 2, NULL, 'd' }, 1304 - { "display-activity", 0, NULL, 'D' }, 1305 - { "empty", 0, NULL, 'e' }, 1306 - { "first-alias", 0, NULL, 'f' }, 1307 - { "help", 0, NULL, 'h' }, 1308 - { "inverted", 0, NULL, 'i'}, 1309 - { "numa", 0, NULL, 'n' }, 1310 - { "ops", 0, NULL, 'o' }, 1311 - { "report", 0, NULL, 'r' }, 1312 - { "shrink", 0, NULL, 's' }, 1313 - { "slabs", 0, NULL, 'l' }, 1314 - { "track", 0, NULL, 't'}, 1315 - { "validate", 0, NULL, 'v' }, 1316 - { "zero", 0, NULL, 'z' }, 1317 - { "1ref", 0, NULL, '1'}, 1295 + { "aliases", no_argument, NULL, 'a' }, 1296 + { "activity", no_argument, NULL, 'A' }, 1297 + { "debug", optional_argument, NULL, 'd' }, 1298 + { "display-activity", no_argument, NULL, 'D' }, 1299 + { "empty", no_argument, NULL, 'e' }, 1300 + { "first-alias", no_argument, NULL, 'f' }, 1301 + { "help", no_argument, NULL, 'h' }, 1302 + { "inverted", no_argument, NULL, 'i'}, 1303 + { "slabs", no_argument, NULL, 'l' }, 1304 + { "numa", no_argument, NULL, 'n' }, 1305 + { "ops", no_argument, NULL, 'o' }, 1306 + { "shrink", no_argument, NULL, 's' }, 1307 + { "report", no_argument, NULL, 'r' }, 1308 + { "Size", no_argument, NULL, 'S'}, 1309 + { "tracking", no_argument, NULL, 't'}, 1310 + { "Totals", no_argument, NULL, 'T'}, 1311 + { "validate", no_argument, NULL, 'v' }, 1312 + { "zero", no_argument, NULL, 'z' }, 1313 + { "1ref", no_argument, NULL, '1'}, 1314 + { "lines", required_argument, NULL, 'N'}, 1315 + { "Loss", no_argument, NULL, 'L'}, 1316 + { "Xtotals", no_argument, NULL, 'X'}, 1317 + { "Bytes", no_argument, NULL, 'B'}, 1318 1318 { NULL, 0, NULL, 0 } 1319 1319 }; 1320 1320 ··· 1356 1296 1357 1297 page_size = getpagesize(); 1358 1298 1359 - while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTS", 1299 + while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTSN:LXB", 1360 1300 opts, NULL)) != -1) 1361 1301 switch (c) { 1362 1302 case '1': ··· 1418 1358 case 'S': 1419 1359 sort_size = 1; 1420 1360 break; 1421 - 1361 + case 'N': 1362 + if (optarg) { 1363 + output_lines = atoi(optarg); 1364 + if (output_lines < 1) 1365 + output_lines = 1; 1366 + } 1367 + break; 1368 + case 'L': 1369 + sort_loss = 1; 1370 + break; 1371 + case 'X': 1372 + if (output_lines == -1) 1373 + output_lines = 1; 1374 + extended_totals = 1; 1375 + show_bytes = 1; 1376 + break; 1377 + case 'B': 1378 + show_bytes = 1; 1379 + break; 1422 1380 default: 1423 1381 fatal("%s: Invalid option '%c'\n", argv[0], optopt); 1424 1382 ··· 1456 1378 fatal("%s: Invalid pattern '%s' code %d\n", 1457 1379 argv[0], pattern_source, err); 1458 1380 read_slab_dir(); 1459 - if (show_alias) 1381 + if (show_alias) { 1460 1382 alias(); 1461 - else 1462 - if (show_totals) 1383 + } else if (extended_totals) { 1384 + xtotals(); 1385 + } else if (show_totals) { 1463 1386 totals(); 1464 - else { 1387 + } else { 1465 1388 link_slabs(); 1466 1389 rename_slabs(); 1467 1390 sort_slabs();