Merge branch 'akpm' (Andrew's patch-bomb)

+5

Documentation/ABI/obsolete/proc-sys-vm-nr_pdflush_threads

··· 1 + What: /proc/sys/vm/nr_pdflush_threads 2 + Date: June 2012 3 + Contact: Wanpeng Li <liwp@linux.vnet.ibm.com> 4 + Description: Since pdflush is replaced by per-BDI flusher, the interface of old pdflush 5 + exported in /proc/sys/vm/ should be removed.

+45

Documentation/cgroups/hugetlb.txt

··· 1 + HugeTLB Controller 2 + ------------------- 3 + 4 + The HugeTLB controller allows to limit the HugeTLB usage per control group and 5 + enforces the controller limit during page fault. Since HugeTLB doesn't 6 + support page reclaim, enforcing the limit at page fault time implies that, 7 + the application will get SIGBUS signal if it tries to access HugeTLB pages 8 + beyond its limit. This requires the application to know beforehand how much 9 + HugeTLB pages it would require for its use. 10 + 11 + HugeTLB controller can be created by first mounting the cgroup filesystem. 12 + 13 + # mount -t cgroup -o hugetlb none /sys/fs/cgroup 14 + 15 + With the above step, the initial or the parent HugeTLB group becomes 16 + visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in 17 + the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup. 18 + 19 + New groups can be created under the parent group /sys/fs/cgroup. 20 + 21 + # cd /sys/fs/cgroup 22 + # mkdir g1 23 + # echo $$ > g1/tasks 24 + 25 + The above steps create a new group g1 and move the current shell 26 + process (bash) into it. 27 + 28 + Brief summary of control files 29 + 30 + hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage 31 + hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded 32 + hugetlb.<hugepagesize>.usage_in_bytes # show current res_counter usage for "hugepagesize" hugetlb 33 + hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit 34 + 35 + For a system supporting two hugepage size (16M and 16G) the control 36 + files include: 37 + 38 + hugetlb.16GB.limit_in_bytes 39 + hugetlb.16GB.max_usage_in_bytes 40 + hugetlb.16GB.usage_in_bytes 41 + hugetlb.16GB.failcnt 42 + hugetlb.16MB.limit_in_bytes 43 + hugetlb.16MB.max_usage_in_bytes 44 + hugetlb.16MB.usage_in_bytes 45 + hugetlb.16MB.failcnt

+7 -5

Documentation/cgroups/memory.txt

··· 73 73 74 74 memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory 75 75 memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation 76 + memory.kmem.tcp.failcnt # show the number of tcp buf memory usage hits limits 77 + memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded 76 78 77 79 1. History 78 80 ··· 189 187 But see section 8.2: when moving a task to another cgroup, its pages may 190 188 be recharged to the new cgroup, if move_charge_at_immigrate has been chosen. 191 189 192 - Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used. 190 + Exception: If CONFIG_CGROUP_CGROUP_MEMCG_SWAP is not used. 193 191 When you do swapoff and make swapped-out pages of shmem(tmpfs) to 194 192 be backed into memory in force, charges for pages are accounted against the 195 193 caller of swapoff rather than the users of shmem. 196 194 197 - 2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP) 195 + 2.4 Swap Extension (CONFIG_MEMCG_SWAP) 198 196 199 197 Swap Extension allows you to record charge for swap. A swapped-in page is 200 198 charged back to original page allocator if possible. ··· 261 259 per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by 262 260 zone->lru_lock, it has no lock of its own. 263 261 264 - 2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM) 262 + 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) 265 263 266 264 With the Kernel memory extension, the Memory Controller is able to limit 267 265 the amount of kernel memory used by the system. Kernel memory is fundamentally ··· 288 286 289 287 a. Enable CONFIG_CGROUPS 290 288 b. Enable CONFIG_RESOURCE_COUNTERS 291 - c. Enable CONFIG_CGROUP_MEM_RES_CTLR 292 - d. Enable CONFIG_CGROUP_MEM_RES_CTLR_SWAP (to use swap extension) 289 + c. Enable CONFIG_MEMCG 290 + d. Enable CONFIG_MEMCG_SWAP (to use swap extension) 293 291 294 292 1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) 295 293 # mount -t tmpfs none /sys/fs/cgroup

+8

Documentation/feature-removal-schedule.txt

··· 13 13 14 14 --------------------------- 15 15 16 + What: /proc/sys/vm/nr_pdflush_threads 17 + When: 2012 18 + Why: Since pdflush is deprecated, the interface exported in /proc/sys/vm/ 19 + should be removed. 20 + Who: Wanpeng Li <liwp@linux.vnet.ibm.com> 21 + 22 + --------------------------- 23 + 16 24 What: CONFIG_APM_CPU_IDLE, and its ability to call APM BIOS in idle 17 25 When: 2012 18 26 Why: This optional sub-feature of APM is of dubious reliability,

+13

Documentation/filesystems/Locking

··· 206 206 int (*launder_page)(struct page *); 207 207 int (*is_partially_uptodate)(struct page *, read_descriptor_t *, unsigned long); 208 208 int (*error_remove_page)(struct address_space *, struct page *); 209 + int (*swap_activate)(struct file *); 210 + int (*swap_deactivate)(struct file *); 209 211 210 212 locking rules: 211 213 All except set_page_dirty and freepage may block ··· 231 229 launder_page: yes 232 230 is_partially_uptodate: yes 233 231 error_remove_page: yes 232 + swap_activate: no 233 + swap_deactivate: no 234 234 235 235 ->write_begin(), ->write_end(), ->sync_page() and ->readpage() 236 236 may be called from the request handler (/dev/loop). ··· 333 329 cleaned, or an error value if not. Note that in order to prevent the page 334 330 getting mapped back in and redirtied, it needs to be kept locked 335 331 across the entire operation. 332 + 333 + ->swap_activate will be called with a non-zero argument on 334 + files backing (non block device backed) swapfiles. A return value 335 + of zero indicates success, in which case this file can be used for 336 + backing swapspace. The swapspace operations will be proxied to the 337 + address space operations. 338 + 339 + ->swap_deactivate() will be called in the sys_swapoff() 340 + path after ->swap_activate() returned success. 336 341 337 342 ----------------------- file_lock_operations ------------------------------ 338 343 prototypes:

+12

Documentation/filesystems/vfs.txt

··· 592 592 int (*migratepage) (struct page *, struct page *); 593 593 int (*launder_page) (struct page *); 594 594 int (*error_remove_page) (struct mapping *mapping, struct page *page); 595 + int (*swap_activate)(struct file *); 596 + int (*swap_deactivate)(struct file *); 595 597 }; 596 598 597 599 writepage: called by the VM to write a dirty page to backing store. ··· 761 759 is ok for this address space. Used for memory failure handling. 762 760 Setting this implies you deal with pages going away under you, 763 761 unless you have them locked or reference counts increased. 762 + 763 + swap_activate: Called when swapon is used on a file to allocate 764 + space if necessary and pin the block lookup information in 765 + memory. A return value of zero indicates success, 766 + in which case this file can be used to back swapspace. The 767 + swapspace operations will be proxied to this address space's 768 + ->swap_{out,in} methods. 769 + 770 + swap_deactivate: Called during swapoff on files where swap_activate 771 + was successful. 764 772 765 773 766 774 The File Object

+14 -16

Documentation/sysctl/vm.txt

··· 42 42 - mmap_min_addr 43 43 - nr_hugepages 44 44 - nr_overcommit_hugepages 45 - - nr_pdflush_threads 46 45 - nr_trim_pages (only if CONFIG_MMU=n) 47 46 - numa_zonelist_order 48 47 - oom_dump_tasks ··· 425 426 426 427 ============================================================== 427 428 428 - nr_pdflush_threads 429 - 430 - The current number of pdflush threads. This value is read-only. 431 - The value changes according to the number of dirty pages in the system. 432 - 433 - When necessary, additional pdflush threads are created, one per second, up to 434 - nr_pdflush_threads_max. 435 - 436 - ============================================================== 437 - 438 429 nr_trim_pages 439 430 440 431 This is available only on NOMMU kernels. ··· 491 502 492 503 Enables a system-wide task dump (excluding kernel threads) to be 493 504 produced when the kernel performs an OOM-killing and includes such 494 - information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and 495 - name. This is helpful to determine why the OOM killer was invoked 496 - and to identify the rogue task that caused it. 505 + information as pid, uid, tgid, vm size, rss, nr_ptes, swapents, 506 + oom_score_adj score, and name. This is helpful to determine why the 507 + OOM killer was invoked, to identify the rogue task that caused it, 508 + and to determine why the OOM killer chose the task it did to kill. 497 509 498 510 If this is set to zero, this information is suppressed. On very 499 511 large systems with thousands of tasks it may not be feasible to dump ··· 564 574 565 575 page-cluster 566 576 567 - page-cluster controls the number of pages which are written to swap in 568 - a single attempt. The swap I/O size. 577 + page-cluster controls the number of pages up to which consecutive pages 578 + are read in from swap in a single attempt. This is the swap counterpart 579 + to page cache readahead. 580 + The mentioned consecutivity is not in terms of virtual/physical addresses, 581 + but consecutive on swap space - that means they were swapped out together. 569 582 570 583 It is a logarithmic value - setting it to zero means "1 page", setting 571 584 it to 1 means "2 pages", setting it to 2 means "4 pages", etc. 585 + Zero disables swap readahead completely. 572 586 573 587 The default value is three (eight pages at a time). There may be some 574 588 small benefits in tuning this to a different value if your workload is 575 589 swap-intensive. 590 + 591 + Lower values mean lower latencies for initial faults, but at the same time 592 + extra faults and I/O delays for following faults if they would have been part of 593 + that consecutive pages readahead would have brought in. 576 594 577 595 ============================================================= 578 596

-1

arch/ia64/kernel/perfmon.c

··· 2353 2353 */ 2354 2354 insert_vm_struct(mm, vma); 2355 2355 2356 - mm->total_vm += size >> PAGE_SHIFT; 2357 2356 vm_stat_account(vma->vm_mm, vma->vm_flags, vma->vm_file, 2358 2357 vma_pages(vma)); 2359 2358 up_write(&task->mm->mmap_sem);

+1

arch/mips/sgi-ip27/ip27-memory.c

··· 401 401 * Allocate the node data structures on the node first. 402 402 */ 403 403 __node_data[node] = __va(slot_freepfn << PAGE_SHIFT); 404 + memset(__node_data[node], 0, PAGE_SIZE); 404 405 405 406 NODE_DATA(node)->bdata = &bootmem_node_data[node]; 406 407 NODE_DATA(node)->node_start_pfn = start_pfn;

+2 -2

arch/powerpc/configs/chroma_defconfig

··· 21 21 CONFIG_CPUSETS=y 22 22 CONFIG_CGROUP_CPUACCT=y 23 23 CONFIG_RESOURCE_COUNTERS=y 24 - CONFIG_CGROUP_MEM_RES_CTLR=y 25 - CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y 24 + CONFIG_CGROUP_MEMCG=y 25 + CONFIG_CGROUP_MEMCG_SWAP=y 26 26 CONFIG_NAMESPACES=y 27 27 CONFIG_RELAY=y 28 28 CONFIG_BLK_DEV_INITRD=y

+1 -1

arch/s390/defconfig

··· 16 16 CONFIG_CPUSETS=y 17 17 CONFIG_CGROUP_CPUACCT=y 18 18 CONFIG_RESOURCE_COUNTERS=y 19 - CONFIG_CGROUP_MEM_RES_CTLR=y 19 + CONFIG_CGROUP_MEMCG=y 20 20 CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y 21 21 CONFIG_CGROUP_SCHED=y 22 22 CONFIG_RT_GROUP_SCHED=y

+1 -1

arch/sh/configs/apsh4ad0a_defconfig

··· 11 11 CONFIG_CGROUP_DEVICE=y 12 12 CONFIG_CGROUP_CPUACCT=y 13 13 CONFIG_RESOURCE_COUNTERS=y 14 - CONFIG_CGROUP_MEM_RES_CTLR=y 14 + CONFIG_CGROUP_MEMCG=y 15 15 CONFIG_BLK_CGROUP=y 16 16 CONFIG_NAMESPACES=y 17 17 CONFIG_BLK_DEV_INITRD=y

+2 -2

arch/sh/configs/sdk7786_defconfig

··· 18 18 # CONFIG_PROC_PID_CPUSET is not set 19 19 CONFIG_CGROUP_CPUACCT=y 20 20 CONFIG_RESOURCE_COUNTERS=y 21 - CONFIG_CGROUP_MEM_RES_CTLR=y 22 - CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y 21 + CONFIG_CGROUP_MEMCG=y 22 + CONFIG_CGROUP_MEMCG_SWAP=y 23 23 CONFIG_CGROUP_SCHED=y 24 24 CONFIG_RT_GROUP_SCHED=y 25 25 CONFIG_BLK_CGROUP=y

+1 -1

arch/sh/configs/se7206_defconfig

··· 11 11 CONFIG_CGROUP_DEVICE=y 12 12 CONFIG_CGROUP_CPUACCT=y 13 13 CONFIG_RESOURCE_COUNTERS=y 14 - CONFIG_CGROUP_MEM_RES_CTLR=y 14 + CONFIG_CGROUP_MEMCG=y 15 15 CONFIG_RELAY=y 16 16 CONFIG_NAMESPACES=y 17 17 CONFIG_UTS_NS=y

+1 -1

arch/sh/configs/shx3_defconfig

··· 13 13 CONFIG_CGROUP_DEVICE=y 14 14 CONFIG_CGROUP_CPUACCT=y 15 15 CONFIG_RESOURCE_COUNTERS=y 16 - CONFIG_CGROUP_MEM_RES_CTLR=y 16 + CONFIG_CGROUP_MEMCG=y 17 17 CONFIG_RELAY=y 18 18 CONFIG_NAMESPACES=y 19 19 CONFIG_UTS_NS=y

+2 -2

arch/sh/configs/urquell_defconfig

··· 15 15 # CONFIG_PROC_PID_CPUSET is not set 16 16 CONFIG_CGROUP_CPUACCT=y 17 17 CONFIG_RESOURCE_COUNTERS=y 18 - CONFIG_CGROUP_MEM_RES_CTLR=y 19 - CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y 18 + CONFIG_CGROUP_MEMCG=y 19 + CONFIG_CGROUP_MEMCG_SWAP=y 20 20 CONFIG_CGROUP_SCHED=y 21 21 CONFIG_RT_GROUP_SCHED=y 22 22 CONFIG_BLK_DEV_INITRD=y

+2 -2

arch/tile/configs/tilegx_defconfig

··· 18 18 CONFIG_CPUSETS=y 19 19 CONFIG_CGROUP_CPUACCT=y 20 20 CONFIG_RESOURCE_COUNTERS=y 21 - CONFIG_CGROUP_MEM_RES_CTLR=y 22 - CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y 21 + CONFIG_CGROUP_MEMCG=y 22 + CONFIG_CGROUP_MEMCG_SWAP=y 23 23 CONFIG_CGROUP_SCHED=y 24 24 CONFIG_RT_GROUP_SCHED=y 25 25 CONFIG_BLK_CGROUP=y

+2 -2

arch/tile/configs/tilepro_defconfig

··· 17 17 CONFIG_CPUSETS=y 18 18 CONFIG_CGROUP_CPUACCT=y 19 19 CONFIG_RESOURCE_COUNTERS=y 20 - CONFIG_CGROUP_MEM_RES_CTLR=y 21 - CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y 20 + CONFIG_CGROUP_MEMCG=y 21 + CONFIG_CGROUP_MEMCG_SWAP=y 22 22 CONFIG_CGROUP_SCHED=y 23 23 CONFIG_RT_GROUP_SCHED=y 24 24 CONFIG_BLK_CGROUP=y

+4 -4

arch/um/defconfig

··· 155 155 CONFIG_PROC_PID_CPUSET=y 156 156 CONFIG_CGROUP_CPUACCT=y 157 157 CONFIG_RESOURCE_COUNTERS=y 158 - CONFIG_CGROUP_MEM_RES_CTLR=y 159 - CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y 160 - # CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED is not set 161 - # CONFIG_CGROUP_MEM_RES_CTLR_KMEM is not set 158 + CONFIG_CGROUP_MEMCG=y 159 + CONFIG_CGROUP_MEMCG_SWAP=y 160 + # CONFIG_CGROUP_MEMCG_SWAP_ENABLED is not set 161 + # CONFIG_CGROUP_MEMCG_KMEM is not set 162 162 CONFIG_CGROUP_SCHED=y 163 163 CONFIG_FAIR_GROUP_SCHED=y 164 164 # CONFIG_CFS_BANDWIDTH is not set

+1

arch/xtensa/Kconfig

··· 7 7 config XTENSA 8 8 def_bool y 9 9 select HAVE_IDE 10 + select GENERIC_ATOMIC64 10 11 select HAVE_GENERIC_HARDIRQS 11 12 select GENERIC_IRQ_SHOW 12 13 select GENERIC_CPU_DEVICES

+1

drivers/base/Kconfig

··· 196 196 bool "Contiguous Memory Allocator (EXPERIMENTAL)" 197 197 depends on HAVE_DMA_CONTIGUOUS && HAVE_MEMBLOCK && EXPERIMENTAL 198 198 select MIGRATION 199 + select MEMORY_ISOLATION 199 200 help 200 201 This enables the Contiguous Memory Allocator which allows drivers 201 202 to allocate big physically-contiguous blocks of memory for use with

+5 -1

drivers/block/nbd.c

··· 154 154 struct msghdr msg; 155 155 struct kvec iov; 156 156 sigset_t blocked, oldset; 157 + unsigned long pflags = current->flags; 157 158 158 159 if (unlikely(!sock)) { 159 160 dev_err(disk_to_dev(nbd->disk), ··· 168 167 siginitsetinv(&blocked, sigmask(SIGKILL)); 169 168 sigprocmask(SIG_SETMASK, &blocked, &oldset); 170 169 170 + current->flags |= PF_MEMALLOC; 171 171 do { 172 - sock->sk->sk_allocation = GFP_NOIO; 172 + sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC; 173 173 iov.iov_base = buf; 174 174 iov.iov_len = size; 175 175 msg.msg_name = NULL; ··· 216 214 } while (size > 0); 217 215 218 216 sigprocmask(SIG_SETMASK, &oldset, NULL); 217 + tsk_restore_flags(current, pflags, PF_MEMALLOC); 219 218 220 219 return result; 221 220 } ··· 408 405 409 406 BUG_ON(nbd->magic != NBD_MAGIC); 410 407 408 + sk_set_memalloc(nbd->sock->sk); 411 409 nbd->pid = task_pid_nr(current); 412 410 ret = device_create_file(disk_to_dev(nbd->disk), &pid_attr); 413 411 if (ret) {

+1 -1

drivers/net/ethernet/chelsio/cxgb4/sge.c

··· 528 528 #endif 529 529 530 530 while (n--) { 531 - pg = alloc_page(gfp); 531 + pg = __skb_alloc_page(gfp, NULL); 532 532 if (unlikely(!pg)) { 533 533 q->alloc_failed++; 534 534 break;

+1 -1

drivers/net/ethernet/chelsio/cxgb4vf/sge.c

··· 653 653 654 654 alloc_small_pages: 655 655 while (n--) { 656 - page = alloc_page(gfp | __GFP_NOWARN | __GFP_COLD); 656 + page = __skb_alloc_page(gfp | __GFP_NOWARN, NULL); 657 657 if (unlikely(!page)) { 658 658 fl->alloc_failed++; 659 659 break;

+1 -1

drivers/net/ethernet/intel/igb/igb_main.c

··· 6235 6235 return true; 6236 6236 6237 6237 if (!page) { 6238 - page = alloc_page(GFP_ATOMIC | __GFP_COLD); 6238 + page = __skb_alloc_page(GFP_ATOMIC, bi->skb); 6239 6239 bi->page = page; 6240 6240 if (unlikely(!page)) { 6241 6241 rx_ring->rx_stats.alloc_failed++;

+2 -2

drivers/net/ethernet/intel/ixgbe/ixgbe_main.c

··· 1141 1141 1142 1142 /* alloc new page for storage */ 1143 1143 if (likely(!page)) { 1144 - page = alloc_pages(GFP_ATOMIC | __GFP_COLD | __GFP_COMP, 1145 - ixgbe_rx_pg_order(rx_ring)); 1144 + page = __skb_alloc_pages(GFP_ATOMIC | __GFP_COLD | __GFP_COMP, 1145 + bi->skb, ixgbe_rx_pg_order(rx_ring)); 1146 1146 if (unlikely(!page)) { 1147 1147 rx_ring->rx_stats.alloc_rx_page_failed++; 1148 1148 return false;

-1

drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c

··· 352 352 adapter->alloc_rx_buff_failed++; 353 353 goto no_buffers; 354 354 } 355 - 356 355 bi->skb = skb; 357 356 } 358 357 if (!bi->dma) {

+1 -1

drivers/net/usb/cdc-phonet.c

··· 130 130 struct page *page; 131 131 int err; 132 132 133 - page = alloc_page(gfp_flags); 133 + page = __skb_alloc_page(gfp_flags | __GFP_NOMEMALLOC, NULL); 134 134 if (!page) 135 135 return -ENOMEM; 136 136

+1 -3

drivers/rtc/rtc-88pm80x.c

··· 314 314 315 315 info->rtc_dev = rtc_device_register("88pm80x-rtc", &pdev->dev, 316 316 &pm80x_rtc_ops, THIS_MODULE); 317 - ret = PTR_ERR(info->rtc_dev); 318 317 if (IS_ERR(info->rtc_dev)) { 318 + ret = PTR_ERR(info->rtc_dev); 319 319 dev_err(&pdev->dev, "Failed to register RTC device: %d\n", ret); 320 320 goto out_rtc; 321 321 } ··· 339 339 out_rtc: 340 340 pm80x_free_irq(chip, info->irq, info); 341 341 out: 342 - devm_kfree(&pdev->dev, info); 343 342 return ret; 344 343 } 345 344 ··· 348 349 platform_set_drvdata(pdev, NULL); 349 350 rtc_device_unregister(info->rtc_dev); 350 351 pm80x_free_irq(info->chip, info->irq, info); 351 - devm_kfree(&pdev->dev, info); 352 352 return 0; 353 353 } 354 354

+1 -1

drivers/usb/gadget/f_phonet.c

··· 301 301 struct page *page; 302 302 int err; 303 303 304 - page = alloc_page(gfp_flags); 304 + page = __skb_alloc_page(gfp_flags | __GFP_NOMEMALLOC, NULL); 305 305 if (!page) 306 306 return -ENOMEM; 307 307

-5

fs/fs-writeback.c

··· 52 52 struct completion *done; /* set if the caller waits */ 53 53 }; 54 54 55 - /* 56 - * We don't actually have pdflush, but this one is exported though /proc... 57 - */ 58 - int nr_pdflush_threads; 59 - 60 55 /** 61 56 * writeback_in_progress - determine whether there is writeback in progress 62 57 * @bdi: the device's backing_dev_info structure.

+2 -2

fs/hugetlbfs/inode.c

··· 416 416 else 417 417 v_offset = 0; 418 418 419 - __unmap_hugepage_range(vma, 420 - vma->vm_start + v_offset, vma->vm_end, NULL); 419 + unmap_hugepage_range(vma, vma->vm_start + v_offset, 420 + vma->vm_end, NULL); 421 421 } 422 422 } 423 423

+8

fs/nfs/Kconfig

··· 86 86 87 87 If unsure, say Y. 88 88 89 + config NFS_SWAP 90 + bool "Provide swap over NFS support" 91 + default n 92 + depends on NFS_FS 93 + select SUNRPC_SWAP 94 + help 95 + This option enables swapon to work on files located on NFS mounts. 96 + 89 97 config NFS_V4_1 90 98 bool "NFS client support for NFSv4.1 (EXPERIMENTAL)" 91 99 depends on NFS_V4 && EXPERIMENTAL

+54 -28

fs/nfs/direct.c

··· 115 115 * @nr_segs: size of iovec array 116 116 * 117 117 * The presence of this routine in the address space ops vector means 118 - * the NFS client supports direct I/O. However, we shunt off direct 119 - * read and write requests before the VFS gets them, so this method 120 - * should never be called. 118 + * the NFS client supports direct I/O. However, for most direct IO, we 119 + * shunt off direct read and write requests before the VFS gets them, 120 + * so this method is only ever called for swap. 121 121 */ 122 122 ssize_t nfs_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t pos, unsigned long nr_segs) 123 123 { 124 + #ifndef CONFIG_NFS_SWAP 124 125 dprintk("NFS: nfs_direct_IO (%s) off/no(%Ld/%lu) EINVAL\n", 125 126 iocb->ki_filp->f_path.dentry->d_name.name, 126 127 (long long) pos, nr_segs); 127 128 128 129 return -EINVAL; 130 + #else 131 + VM_BUG_ON(iocb->ki_left != PAGE_SIZE); 132 + VM_BUG_ON(iocb->ki_nbytes != PAGE_SIZE); 133 + 134 + if (rw == READ || rw == KERNEL_READ) 135 + return nfs_file_direct_read(iocb, iov, nr_segs, pos, 136 + rw == READ ? true : false); 137 + return nfs_file_direct_write(iocb, iov, nr_segs, pos, 138 + rw == WRITE ? true : false); 139 + #endif /* CONFIG_NFS_SWAP */ 129 140 } 130 141 131 142 static void nfs_direct_release_pages(struct page **pages, unsigned int npages) ··· 314 303 */ 315 304 static ssize_t nfs_direct_read_schedule_segment(struct nfs_pageio_descriptor *desc, 316 305 const struct iovec *iov, 317 - loff_t pos) 306 + loff_t pos, bool uio) 318 307 { 319 308 struct nfs_direct_req *dreq = desc->pg_dreq; 320 309 struct nfs_open_context *ctx = dreq->ctx; ··· 342 331 GFP_KERNEL); 343 332 if (!pagevec) 344 333 break; 345 - down_read(&current->mm->mmap_sem); 346 - result = get_user_pages(current, current->mm, user_addr, 334 + if (uio) { 335 + down_read(&current->mm->mmap_sem); 336 + result = get_user_pages(current, current->mm, user_addr, 347 337 npages, 1, 0, pagevec, NULL); 348 - up_read(&current->mm->mmap_sem); 349 - if (result < 0) 350 - break; 338 + up_read(&current->mm->mmap_sem); 339 + if (result < 0) 340 + break; 341 + } else { 342 + WARN_ON(npages != 1); 343 + result = get_kernel_page(user_addr, 1, pagevec); 344 + if (WARN_ON(result != 1)) 345 + break; 346 + } 347 + 351 348 if ((unsigned)result < npages) { 352 349 bytes = result * PAGE_SIZE; 353 350 if (bytes <= pgbase) { ··· 405 386 static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq, 406 387 const struct iovec *iov, 407 388 unsigned long nr_segs, 408 - loff_t pos) 389 + loff_t pos, bool uio) 409 390 { 410 391 struct nfs_pageio_descriptor desc; 411 392 ssize_t result = -EINVAL; ··· 419 400 420 401 for (seg = 0; seg < nr_segs; seg++) { 421 402 const struct iovec *vec = &iov[seg]; 422 - result = nfs_direct_read_schedule_segment(&desc, vec, pos); 403 + result = nfs_direct_read_schedule_segment(&desc, vec, pos, uio); 423 404 if (result < 0) 424 405 break; 425 406 requested_bytes += result; ··· 445 426 } 446 427 447 428 static ssize_t nfs_direct_read(struct kiocb *iocb, const struct iovec *iov, 448 - unsigned long nr_segs, loff_t pos) 429 + unsigned long nr_segs, loff_t pos, bool uio) 449 430 { 450 431 ssize_t result = -ENOMEM; 451 432 struct inode *inode = iocb->ki_filp->f_mapping->host; ··· 463 444 if (!is_sync_kiocb(iocb)) 464 445 dreq->iocb = iocb; 465 446 466 - result = nfs_direct_read_schedule_iovec(dreq, iov, nr_segs, pos); 447 + result = nfs_direct_read_schedule_iovec(dreq, iov, nr_segs, pos, uio); 467 448 if (!result) 468 449 result = nfs_direct_wait(dreq); 469 450 NFS_I(inode)->read_io += result; ··· 629 610 */ 630 611 static ssize_t nfs_direct_write_schedule_segment(struct nfs_pageio_descriptor *desc, 631 612 const struct iovec *iov, 632 - loff_t pos) 613 + loff_t pos, bool uio) 633 614 { 634 615 struct nfs_direct_req *dreq = desc->pg_dreq; 635 616 struct nfs_open_context *ctx = dreq->ctx; ··· 657 638 if (!pagevec) 658 639 break; 659 640 660 - down_read(&current->mm->mmap_sem); 661 - result = get_user_pages(current, current->mm, user_addr, 662 - npages, 0, 0, pagevec, NULL); 663 - up_read(&current->mm->mmap_sem); 664 - if (result < 0) 665 - break; 641 + if (uio) { 642 + down_read(&current->mm->mmap_sem); 643 + result = get_user_pages(current, current->mm, user_addr, 644 + npages, 0, 0, pagevec, NULL); 645 + up_read(&current->mm->mmap_sem); 646 + if (result < 0) 647 + break; 648 + } else { 649 + WARN_ON(npages != 1); 650 + result = get_kernel_page(user_addr, 0, pagevec); 651 + if (WARN_ON(result != 1)) 652 + break; 653 + } 666 654 667 655 if ((unsigned)result < npages) { 668 656 bytes = result * PAGE_SIZE; ··· 800 774 static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq, 801 775 const struct iovec *iov, 802 776 unsigned long nr_segs, 803 - loff_t pos) 777 + loff_t pos, bool uio) 804 778 { 805 779 struct nfs_pageio_descriptor desc; 806 780 struct inode *inode = dreq->inode; ··· 816 790 817 791 for (seg = 0; seg < nr_segs; seg++) { 818 792 const struct iovec *vec = &iov[seg]; 819 - result = nfs_direct_write_schedule_segment(&desc, vec, pos); 793 + result = nfs_direct_write_schedule_segment(&desc, vec, pos, uio); 820 794 if (result < 0) 821 795 break; 822 796 requested_bytes += result; ··· 844 818 845 819 static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov, 846 820 unsigned long nr_segs, loff_t pos, 847 - size_t count) 821 + size_t count, bool uio) 848 822 { 849 823 ssize_t result = -ENOMEM; 850 824 struct inode *inode = iocb->ki_filp->f_mapping->host; ··· 862 836 if (!is_sync_kiocb(iocb)) 863 837 dreq->iocb = iocb; 864 838 865 - result = nfs_direct_write_schedule_iovec(dreq, iov, nr_segs, pos); 839 + result = nfs_direct_write_schedule_iovec(dreq, iov, nr_segs, pos, uio); 866 840 if (!result) 867 841 result = nfs_direct_wait(dreq); 868 842 out_release: ··· 893 867 * cache. 894 868 */ 895 869 ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov, 896 - unsigned long nr_segs, loff_t pos) 870 + unsigned long nr_segs, loff_t pos, bool uio) 897 871 { 898 872 ssize_t retval = -EINVAL; 899 873 struct file *file = iocb->ki_filp; ··· 918 892 919 893 task_io_account_read(count); 920 894 921 - retval = nfs_direct_read(iocb, iov, nr_segs, pos); 895 + retval = nfs_direct_read(iocb, iov, nr_segs, pos, uio); 922 896 if (retval > 0) 923 897 iocb->ki_pos = pos + retval; 924 898 ··· 949 923 * is no atomic O_APPEND write facility in the NFS protocol. 950 924 */ 951 925 ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov, 952 - unsigned long nr_segs, loff_t pos) 926 + unsigned long nr_segs, loff_t pos, bool uio) 953 927 { 954 928 ssize_t retval = -EINVAL; 955 929 struct file *file = iocb->ki_filp; ··· 981 955 982 956 task_io_account_write(count); 983 957 984 - retval = nfs_direct_write(iocb, iov, nr_segs, pos, count); 958 + retval = nfs_direct_write(iocb, iov, nr_segs, pos, count, uio); 985 959 if (retval > 0) { 986 960 struct inode *inode = mapping->host; 987 961

+23 -5

fs/nfs/file.c

··· 180 180 ssize_t result; 181 181 182 182 if (iocb->ki_filp->f_flags & O_DIRECT) 183 - return nfs_file_direct_read(iocb, iov, nr_segs, pos); 183 + return nfs_file_direct_read(iocb, iov, nr_segs, pos, true); 184 184 185 185 dprintk("NFS: read(%s/%s, %lu@%lu)\n", 186 186 dentry->d_parent->d_name.name, dentry->d_name.name, ··· 439 439 if (offset != 0) 440 440 return; 441 441 /* Cancel any unstarted writes on this page */ 442 - nfs_wb_page_cancel(page->mapping->host, page); 442 + nfs_wb_page_cancel(page_file_mapping(page)->host, page); 443 443 444 444 nfs_fscache_invalidate_page(page, page->mapping->host); 445 445 } ··· 484 484 */ 485 485 static int nfs_launder_page(struct page *page) 486 486 { 487 - struct inode *inode = page->mapping->host; 487 + struct inode *inode = page_file_mapping(page)->host; 488 488 struct nfs_inode *nfsi = NFS_I(inode); 489 489 490 490 dfprintk(PAGECACHE, "NFS: launder_page(%ld, %llu)\n", ··· 493 493 nfs_fscache_wait_on_page_write(nfsi, page); 494 494 return nfs_wb_page(inode, page); 495 495 } 496 + 497 + #ifdef CONFIG_NFS_SWAP 498 + static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file, 499 + sector_t *span) 500 + { 501 + *span = sis->pages; 502 + return xs_swapper(NFS_CLIENT(file->f_mapping->host)->cl_xprt, 1); 503 + } 504 + 505 + static void nfs_swap_deactivate(struct file *file) 506 + { 507 + xs_swapper(NFS_CLIENT(file->f_mapping->host)->cl_xprt, 0); 508 + } 509 + #endif 496 510 497 511 const struct address_space_operations nfs_file_aops = { 498 512 .readpage = nfs_readpage, ··· 522 508 .migratepage = nfs_migrate_page, 523 509 .launder_page = nfs_launder_page, 524 510 .error_remove_page = generic_error_remove_page, 511 + #ifdef CONFIG_NFS_SWAP 512 + .swap_activate = nfs_swap_activate, 513 + .swap_deactivate = nfs_swap_deactivate, 514 + #endif 525 515 }; 526 516 527 517 /* ··· 551 533 nfs_fscache_wait_on_page_write(NFS_I(dentry->d_inode), page); 552 534 553 535 lock_page(page); 554 - mapping = page->mapping; 536 + mapping = page_file_mapping(page); 555 537 if (mapping != dentry->d_inode->i_mapping) 556 538 goto out_unlock; 557 539 ··· 600 582 size_t count = iov_length(iov, nr_segs); 601 583 602 584 if (iocb->ki_filp->f_flags & O_DIRECT) 603 - return nfs_file_direct_write(iocb, iov, nr_segs, pos); 585 + return nfs_file_direct_write(iocb, iov, nr_segs, pos, true); 604 586 605 587 dprintk("NFS: write(%s/%s, %lu@%Ld)\n", 606 588 dentry->d_parent->d_name.name, dentry->d_name.name,

+4

fs/nfs/inode.c

··· 897 897 struct nfs_inode *nfsi = NFS_I(inode); 898 898 int ret = 0; 899 899 900 + /* swapfiles are not supposed to be shared. */ 901 + if (IS_SWAPFILE(inode)) 902 + goto out; 903 + 900 904 if (nfs_mapping_need_revalidate_inode(inode)) { 901 905 ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode); 902 906 if (ret < 0)

+4 -3

fs/nfs/internal.h

··· 554 554 static inline 555 555 unsigned int nfs_page_length(struct page *page) 556 556 { 557 - loff_t i_size = i_size_read(page->mapping->host); 557 + loff_t i_size = i_size_read(page_file_mapping(page)->host); 558 558 559 559 if (i_size > 0) { 560 + pgoff_t page_index = page_file_index(page); 560 561 pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT; 561 - if (page->index < end_index) 562 + if (page_index < end_index) 562 563 return PAGE_CACHE_SIZE; 563 - if (page->index == end_index) 564 + if (page_index == end_index) 564 565 return ((i_size - 1) & ~PAGE_CACHE_MASK) + 1; 565 566 } 566 567 return 0;

+2 -2

fs/nfs/pagelist.c

··· 71 71 static inline struct nfs_page * 72 72 nfs_page_alloc(void) 73 73 { 74 - struct nfs_page *p = kmem_cache_zalloc(nfs_page_cachep, GFP_KERNEL); 74 + struct nfs_page *p = kmem_cache_zalloc(nfs_page_cachep, GFP_NOIO); 75 75 if (p) 76 76 INIT_LIST_HEAD(&p->wb_list); 77 77 return p; ··· 118 118 * long write-back delay. This will be adjusted in 119 119 * update_nfs_request below if the region is not locked. */ 120 120 req->wb_page = page; 121 - req->wb_index = page->index; 121 + req->wb_index = page_file_index(page); 122 122 page_cache_get(page); 123 123 req->wb_offset = offset; 124 124 req->wb_pgbase = offset;

+3 -3

fs/nfs/read.c

··· 527 527 int nfs_readpage(struct file *file, struct page *page) 528 528 { 529 529 struct nfs_open_context *ctx; 530 - struct inode *inode = page->mapping->host; 530 + struct inode *inode = page_file_mapping(page)->host; 531 531 int error; 532 532 533 533 dprintk("NFS: nfs_readpage (%p %ld@%lu)\n", 534 - page, PAGE_CACHE_SIZE, page->index); 534 + page, PAGE_CACHE_SIZE, page_file_index(page)); 535 535 nfs_inc_stats(inode, NFSIOS_VFSREADPAGE); 536 536 nfs_add_stats(inode, NFSIOS_READPAGES, 1); 537 537 ··· 585 585 readpage_async_filler(void *data, struct page *page) 586 586 { 587 587 struct nfs_readdesc *desc = (struct nfs_readdesc *)data; 588 - struct inode *inode = page->mapping->host; 588 + struct inode *inode = page_file_mapping(page)->host; 589 589 struct nfs_page *new; 590 590 unsigned int len; 591 591 int error;

+55 -34

fs/nfs/write.c

··· 52 52 53 53 struct nfs_commit_data *nfs_commitdata_alloc(void) 54 54 { 55 - struct nfs_commit_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS); 55 + struct nfs_commit_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOIO); 56 56 57 57 if (p) { 58 58 memset(p, 0, sizeof(*p)); ··· 70 70 71 71 struct nfs_write_header *nfs_writehdr_alloc(void) 72 72 { 73 - struct nfs_write_header *p = mempool_alloc(nfs_wdata_mempool, GFP_NOFS); 73 + struct nfs_write_header *p = mempool_alloc(nfs_wdata_mempool, GFP_NOIO); 74 74 75 75 if (p) { 76 76 struct nfs_pgio_header *hdr = &p->header; ··· 142 142 set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags); 143 143 } 144 144 145 - static struct nfs_page *nfs_page_find_request_locked(struct page *page) 145 + static struct nfs_page * 146 + nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page) 146 147 { 147 148 struct nfs_page *req = NULL; 148 149 149 - if (PagePrivate(page)) { 150 + if (PagePrivate(page)) 150 151 req = (struct nfs_page *)page_private(page); 151 - if (req != NULL) 152 - kref_get(&req->wb_kref); 152 + else if (unlikely(PageSwapCache(page))) { 153 + struct nfs_page *freq, *t; 154 + 155 + /* Linearly search the commit list for the correct req */ 156 + list_for_each_entry_safe(freq, t, &nfsi->commit_info.list, wb_list) { 157 + if (freq->wb_page == page) { 158 + req = freq; 159 + break; 160 + } 161 + } 153 162 } 163 + 164 + if (req) 165 + kref_get(&req->wb_kref); 166 + 154 167 return req; 155 168 } 156 169 157 170 static struct nfs_page *nfs_page_find_request(struct page *page) 158 171 { 159 - struct inode *inode = page->mapping->host; 172 + struct inode *inode = page_file_mapping(page)->host; 160 173 struct nfs_page *req = NULL; 161 174 162 175 spin_lock(&inode->i_lock); 163 - req = nfs_page_find_request_locked(page); 176 + req = nfs_page_find_request_locked(NFS_I(inode), page); 164 177 spin_unlock(&inode->i_lock); 165 178 return req; 166 179 } ··· 181 168 /* Adjust the file length if we're writing beyond the end */ 182 169 static void nfs_grow_file(struct page *page, unsigned int offset, unsigned int count) 183 170 { 184 - struct inode *inode = page->mapping->host; 171 + struct inode *inode = page_file_mapping(page)->host; 185 172 loff_t end, i_size; 186 173 pgoff_t end_index; 187 174 188 175 spin_lock(&inode->i_lock); 189 176 i_size = i_size_read(inode); 190 177 end_index = (i_size - 1) >> PAGE_CACHE_SHIFT; 191 - if (i_size > 0 && page->index < end_index) 178 + if (i_size > 0 && page_file_index(page) < end_index) 192 179 goto out; 193 - end = ((loff_t)page->index << PAGE_CACHE_SHIFT) + ((loff_t)offset+count); 180 + end = page_file_offset(page) + ((loff_t)offset+count); 194 181 if (i_size >= end) 195 182 goto out; 196 183 i_size_write(inode, end); ··· 203 190 static void nfs_set_pageerror(struct page *page) 204 191 { 205 192 SetPageError(page); 206 - nfs_zap_mapping(page->mapping->host, page->mapping); 193 + nfs_zap_mapping(page_file_mapping(page)->host, page_file_mapping(page)); 207 194 } 208 195 209 196 /* We can set the PG_uptodate flag if we see that a write request ··· 244 231 int ret = test_set_page_writeback(page); 245 232 246 233 if (!ret) { 247 - struct inode *inode = page->mapping->host; 234 + struct inode *inode = page_file_mapping(page)->host; 248 235 struct nfs_server *nfss = NFS_SERVER(inode); 249 236 250 237 if (atomic_long_inc_return(&nfss->writeback) > ··· 258 245 259 246 static void nfs_end_page_writeback(struct page *page) 260 247 { 261 - struct inode *inode = page->mapping->host; 248 + struct inode *inode = page_file_mapping(page)->host; 262 249 struct nfs_server *nfss = NFS_SERVER(inode); 263 250 264 251 end_page_writeback(page); ··· 268 255 269 256 static struct nfs_page *nfs_find_and_lock_request(struct page *page, bool nonblock) 270 257 { 271 - struct inode *inode = page->mapping->host; 258 + struct inode *inode = page_file_mapping(page)->host; 272 259 struct nfs_page *req; 273 260 int ret; 274 261 275 262 spin_lock(&inode->i_lock); 276 263 for (;;) { 277 - req = nfs_page_find_request_locked(page); 264 + req = nfs_page_find_request_locked(NFS_I(inode), page); 278 265 if (req == NULL) 279 266 break; 280 267 if (nfs_lock_request(req)) ··· 329 316 330 317 static int nfs_do_writepage(struct page *page, struct writeback_control *wbc, struct nfs_pageio_descriptor *pgio) 331 318 { 332 - struct inode *inode = page->mapping->host; 319 + struct inode *inode = page_file_mapping(page)->host; 333 320 int ret; 334 321 335 322 nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGE); 336 323 nfs_add_stats(inode, NFSIOS_WRITEPAGES, 1); 337 324 338 - nfs_pageio_cond_complete(pgio, page->index); 325 + nfs_pageio_cond_complete(pgio, page_file_index(page)); 339 326 ret = nfs_page_async_flush(pgio, page, wbc->sync_mode == WB_SYNC_NONE); 340 327 if (ret == -EAGAIN) { 341 328 redirty_page_for_writepage(wbc, page); ··· 352 339 struct nfs_pageio_descriptor pgio; 353 340 int err; 354 341 355 - NFS_PROTO(page->mapping->host)->write_pageio_init(&pgio, 342 + NFS_PROTO(page_file_mapping(page)->host)->write_pageio_init(&pgio, 356 343 page->mapping->host, 357 344 wb_priority(wbc), 358 345 &nfs_async_write_completion_ops); ··· 429 416 spin_lock(&inode->i_lock); 430 417 if (!nfsi->npages && NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE)) 431 418 inode->i_version++; 432 - set_bit(PG_MAPPED, &req->wb_flags); 433 - SetPagePrivate(req->wb_page); 434 - set_page_private(req->wb_page, (unsigned long)req); 419 + /* 420 + * Swap-space should not get truncated. Hence no need to plug the race 421 + * with invalidate/truncate. 422 + */ 423 + if (likely(!PageSwapCache(req->wb_page))) { 424 + set_bit(PG_MAPPED, &req->wb_flags); 425 + SetPagePrivate(req->wb_page); 426 + set_page_private(req->wb_page, (unsigned long)req); 427 + } 435 428 nfsi->npages++; 436 429 kref_get(&req->wb_kref); 437 430 spin_unlock(&inode->i_lock); ··· 454 435 BUG_ON (!NFS_WBACK_BUSY(req)); 455 436 456 437 spin_lock(&inode->i_lock); 457 - set_page_private(req->wb_page, 0); 458 - ClearPagePrivate(req->wb_page); 459 - clear_bit(PG_MAPPED, &req->wb_flags); 438 + if (likely(!PageSwapCache(req->wb_page))) { 439 + set_page_private(req->wb_page, 0); 440 + ClearPagePrivate(req->wb_page); 441 + clear_bit(PG_MAPPED, &req->wb_flags); 442 + } 460 443 nfsi->npages--; 461 444 spin_unlock(&inode->i_lock); 462 445 nfs_release_request(req); ··· 495 474 spin_unlock(cinfo->lock); 496 475 if (!cinfo->dreq) { 497 476 inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); 498 - inc_bdi_stat(req->wb_page->mapping->backing_dev_info, 477 + inc_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info, 499 478 BDI_RECLAIMABLE); 500 479 __mark_inode_dirty(req->wb_context->dentry->d_inode, 501 480 I_DIRTY_DATASYNC); ··· 562 541 nfs_clear_page_commit(struct page *page) 563 542 { 564 543 dec_zone_page_state(page, NR_UNSTABLE_NFS); 565 - dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE); 544 + dec_bdi_stat(page_file_mapping(page)->backing_dev_info, BDI_RECLAIMABLE); 566 545 } 567 546 568 547 static void ··· 754 733 spin_lock(&inode->i_lock); 755 734 756 735 for (;;) { 757 - req = nfs_page_find_request_locked(page); 736 + req = nfs_page_find_request_locked(NFS_I(inode), page); 758 737 if (req == NULL) 759 738 goto out_unlock; 760 739 ··· 813 792 static struct nfs_page * nfs_setup_write_request(struct nfs_open_context* ctx, 814 793 struct page *page, unsigned int offset, unsigned int bytes) 815 794 { 816 - struct inode *inode = page->mapping->host; 795 + struct inode *inode = page_file_mapping(page)->host; 817 796 struct nfs_page *req; 818 797 819 798 req = nfs_try_to_update_request(inode, page, offset, bytes); ··· 866 845 nfs_release_request(req); 867 846 if (!do_flush) 868 847 return 0; 869 - status = nfs_wb_page(page->mapping->host, page); 848 + status = nfs_wb_page(page_file_mapping(page)->host, page); 870 849 } while (status == 0); 871 850 return status; 872 851 } ··· 896 875 unsigned int offset, unsigned int count) 897 876 { 898 877 struct nfs_open_context *ctx = nfs_file_open_context(file); 899 - struct inode *inode = page->mapping->host; 878 + struct inode *inode = page_file_mapping(page)->host; 900 879 int status = 0; 901 880 902 881 nfs_inc_stats(inode, NFSIOS_VFSUPDATEPAGE); ··· 904 883 dprintk("NFS: nfs_updatepage(%s/%s %d@%lld)\n", 905 884 file->f_path.dentry->d_parent->d_name.name, 906 885 file->f_path.dentry->d_name.name, count, 907 - (long long)(page_offset(page) + offset)); 886 + (long long)(page_file_offset(page) + offset)); 908 887 909 888 /* If we're not using byte range locks, and we know the page 910 889 * is up to date, it may be more efficient to extend the write ··· 1495 1474 nfs_mark_request_commit(req, lseg, cinfo); 1496 1475 if (!cinfo->dreq) { 1497 1476 dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); 1498 - dec_bdi_stat(req->wb_page->mapping->backing_dev_info, 1477 + dec_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info, 1499 1478 BDI_RECLAIMABLE); 1500 1479 } 1501 1480 nfs_unlock_and_release_request(req); ··· 1752 1731 */ 1753 1732 int nfs_wb_page(struct inode *inode, struct page *page) 1754 1733 { 1755 - loff_t range_start = page_offset(page); 1734 + loff_t range_start = page_file_offset(page); 1756 1735 loff_t range_end = range_start + (loff_t)(PAGE_CACHE_SIZE - 1); 1757 1736 struct writeback_control wbc = { 1758 1737 .sync_mode = WB_SYNC_ALL,

+1 -1

fs/super.c

··· 62 62 return -1; 63 63 64 64 if (!grab_super_passive(sb)) 65 - return !sc->nr_to_scan ? 0 : -1; 65 + return -1; 66 66 67 67 if (sb->s_op && sb->s_op->nr_cached_objects) 68 68 fs_objects = sb->s_op->nr_cached_objects(sb);

+3

include/linux/backing-dev.h

··· 17 17 #include <linux/timer.h> 18 18 #include <linux/writeback.h> 19 19 #include <linux/atomic.h> 20 + #include <linux/sysctl.h> 20 21 21 22 struct page; 22 23 struct device; ··· 305 304 void set_bdi_congested(struct backing_dev_info *bdi, int sync); 306 305 long congestion_wait(int sync, long timeout); 307 306 long wait_iff_congested(struct zone *zone, int sync, long timeout); 307 + int pdflush_proc_obsolete(struct ctl_table *table, int write, 308 + void __user *buffer, size_t *lenp, loff_t *ppos); 308 309 309 310 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi) 310 311 {

+2

include/linux/blk_types.h

··· 160 160 __REQ_FLUSH_SEQ, /* request for flush sequence */ 161 161 __REQ_IO_STAT, /* account I/O stat */ 162 162 __REQ_MIXED_MERGE, /* merge of different types, fail separately */ 163 + __REQ_KERNEL, /* direct IO to kernel pages */ 163 164 __REQ_NR_BITS, /* stops here */ 164 165 }; 165 166 ··· 202 201 #define REQ_IO_STAT (1 << __REQ_IO_STAT) 203 202 #define REQ_MIXED_MERGE (1 << __REQ_MIXED_MERGE) 204 203 #define REQ_SECURE (1 << __REQ_SECURE) 204 + #define REQ_KERNEL (1 << __REQ_KERNEL) 205 205 206 206 #endif /* __LINUX_BLK_TYPES_H */

+7 -1

include/linux/cgroup_subsys.h

··· 31 31 32 32 /* */ 33 33 34 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR 34 + #ifdef CONFIG_MEMCG 35 35 SUBSYS(mem_cgroup) 36 36 #endif 37 37 ··· 69 69 70 70 #ifdef CONFIG_NETPRIO_CGROUP 71 71 SUBSYS(net_prio) 72 + #endif 73 + 74 + /* */ 75 + 76 + #ifdef CONFIG_CGROUP_HUGETLB 77 + SUBSYS(hugetlb) 72 78 #endif 73 79 74 80 /* */

+2 -2

include/linux/compaction.h

··· 58 58 if (++zone->compact_considered > defer_limit) 59 59 zone->compact_considered = defer_limit; 60 60 61 - return zone->compact_considered < (1UL << zone->compact_defer_shift); 61 + return zone->compact_considered < defer_limit; 62 62 } 63 63 64 64 #else ··· 85 85 86 86 static inline bool compaction_deferred(struct zone *zone, int order) 87 87 { 88 - return 1; 88 + return true; 89 89 } 90 90 91 91 #endif /* CONFIG_COMPACTION */

+8

include/linux/fs.h

··· 165 165 #define READ 0 166 166 #define WRITE RW_MASK 167 167 #define READA RWA_MASK 168 + #define KERNEL_READ (READ|REQ_KERNEL) 169 + #define KERNEL_WRITE (WRITE|REQ_KERNEL) 168 170 169 171 #define READ_SYNC (READ | REQ_SYNC) 170 172 #define WRITE_SYNC (WRITE | REQ_SYNC | REQ_NOIDLE) ··· 429 427 struct vm_area_struct; 430 428 struct vfsmount; 431 429 struct cred; 430 + struct swap_info_struct; 432 431 433 432 extern void __init inode_init(void); 434 433 extern void __init inode_init_early(void); ··· 639 636 int (*is_partially_uptodate) (struct page *, read_descriptor_t *, 640 637 unsigned long); 641 638 int (*error_remove_page)(struct address_space *, struct page *); 639 + 640 + /* swapfile support */ 641 + int (*swap_activate)(struct swap_info_struct *sis, struct file *file, 642 + sector_t *span); 643 + void (*swap_deactivate)(struct file *file); 642 644 }; 643 645 644 646 extern const struct address_space_operations empty_aops;

+11 -2

include/linux/gfp.h

··· 23 23 #define ___GFP_REPEAT 0x400u 24 24 #define ___GFP_NOFAIL 0x800u 25 25 #define ___GFP_NORETRY 0x1000u 26 + #define ___GFP_MEMALLOC 0x2000u 26 27 #define ___GFP_COMP 0x4000u 27 28 #define ___GFP_ZERO 0x8000u 28 29 #define ___GFP_NOMEMALLOC 0x10000u ··· 77 76 #define __GFP_REPEAT ((__force gfp_t)___GFP_REPEAT) /* See above */ 78 77 #define __GFP_NOFAIL ((__force gfp_t)___GFP_NOFAIL) /* See above */ 79 78 #define __GFP_NORETRY ((__force gfp_t)___GFP_NORETRY) /* See above */ 79 + #define __GFP_MEMALLOC ((__force gfp_t)___GFP_MEMALLOC)/* Allow access to emergency reserves */ 80 80 #define __GFP_COMP ((__force gfp_t)___GFP_COMP) /* Add compound page metadata */ 81 81 #define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) /* Return zeroed page on success */ 82 - #define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves */ 82 + #define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves. 83 + * This takes precedence over the 84 + * __GFP_MEMALLOC flag if both are 85 + * set 86 + */ 83 87 #define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL) /* Enforce hardwall cpuset memory allocs */ 84 88 #define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)/* No fallback, no policies */ 85 89 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */ ··· 135 129 /* Control page allocator reclaim behavior */ 136 130 #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\ 137 131 __GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\ 138 - __GFP_NORETRY|__GFP_NOMEMALLOC) 132 + __GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC) 139 133 140 134 /* Control slab gfp mask during early boot */ 141 135 #define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_WAIT|__GFP_IO|__GFP_FS)) ··· 384 378 * devices are suspended. 385 379 */ 386 380 extern gfp_t gfp_allowed_mask; 381 + 382 + /* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */ 383 + bool gfp_pfmemalloc_allowed(gfp_t gfp_mask); 387 384 388 385 extern void pm_restrict_gfp_mask(void); 389 386 extern void pm_restore_gfp_mask(void);

+7

include/linux/highmem.h

··· 39 39 40 40 void kmap_flush_unused(void); 41 41 42 + struct page *kmap_to_page(void *addr); 43 + 42 44 #else /* CONFIG_HIGHMEM */ 43 45 44 46 static inline unsigned int nr_free_highpages(void) { return 0; } 47 + 48 + static inline struct page *kmap_to_page(void *addr) 49 + { 50 + return virt_to_page(addr); 51 + } 45 52 46 53 #define totalhigh_pages 0UL 47 54

+45 -5

include/linux/hugetlb.h

··· 4 4 #include <linux/mm_types.h> 5 5 #include <linux/fs.h> 6 6 #include <linux/hugetlb_inline.h> 7 + #include <linux/cgroup.h> 7 8 8 9 struct ctl_table; 9 10 struct user_struct; 11 + struct mmu_gather; 10 12 11 13 #ifdef CONFIG_HUGETLB_PAGE 12 14 ··· 21 19 long count; 22 20 long max_hpages, used_hpages; 23 21 }; 22 + 23 + extern spinlock_t hugetlb_lock; 24 + extern int hugetlb_max_hstate __read_mostly; 25 + #define for_each_hstate(h) \ 26 + for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++) 24 27 25 28 struct hugepage_subpool *hugepage_new_subpool(long nr_blocks); 26 29 void hugepage_put_subpool(struct hugepage_subpool *spool); ··· 47 40 struct page **, struct vm_area_struct **, 48 41 unsigned long *, int *, int, unsigned int flags); 49 42 void unmap_hugepage_range(struct vm_area_struct *, 50 - unsigned long, unsigned long, struct page *); 51 - void __unmap_hugepage_range(struct vm_area_struct *, 52 - unsigned long, unsigned long, struct page *); 43 + unsigned long, unsigned long, struct page *); 44 + void __unmap_hugepage_range_final(struct mmu_gather *tlb, 45 + struct vm_area_struct *vma, 46 + unsigned long start, unsigned long end, 47 + struct page *ref_page); 48 + void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, 49 + unsigned long start, unsigned long end, 50 + struct page *ref_page); 53 51 int hugetlb_prefault(struct address_space *, struct vm_area_struct *); 54 52 void hugetlb_report_meminfo(struct seq_file *); 55 53 int hugetlb_report_node_meminfo(int, char *); ··· 110 98 #define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL) 111 99 #define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; }) 112 100 #define hugetlb_prefault(mapping, vma) ({ BUG(); 0; }) 113 - #define unmap_hugepage_range(vma, start, end, page) BUG() 114 101 static inline void hugetlb_report_meminfo(struct seq_file *m) 115 102 { 116 103 } ··· 123 112 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; }) 124 113 #define hugetlb_fault(mm, vma, addr, flags) ({ BUG(); 0; }) 125 114 #define huge_pte_offset(mm, address) 0 126 - #define dequeue_hwpoisoned_huge_page(page) 0 115 + static inline int dequeue_hwpoisoned_huge_page(struct page *page) 116 + { 117 + return 0; 118 + } 119 + 127 120 static inline void copy_huge_page(struct page *dst, struct page *src) 128 121 { 129 122 } 130 123 131 124 #define hugetlb_change_protection(vma, address, end, newprot) 125 + 126 + static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb, 127 + struct vm_area_struct *vma, unsigned long start, 128 + unsigned long end, struct page *ref_page) 129 + { 130 + BUG(); 131 + } 132 + 133 + static inline void __unmap_hugepage_range(struct mmu_gather *tlb, 134 + struct vm_area_struct *vma, unsigned long start, 135 + unsigned long end, struct page *ref_page) 136 + { 137 + BUG(); 138 + } 132 139 133 140 #endif /* !CONFIG_HUGETLB_PAGE */ 134 141 ··· 228 199 unsigned long resv_huge_pages; 229 200 unsigned long surplus_huge_pages; 230 201 unsigned long nr_overcommit_huge_pages; 202 + struct list_head hugepage_activelist; 231 203 struct list_head hugepage_freelists[MAX_NUMNODES]; 232 204 unsigned int nr_huge_pages_node[MAX_NUMNODES]; 233 205 unsigned int free_huge_pages_node[MAX_NUMNODES]; 234 206 unsigned int surplus_huge_pages_node[MAX_NUMNODES]; 207 + #ifdef CONFIG_CGROUP_HUGETLB 208 + /* cgroup control files */ 209 + struct cftype cgroup_files[5]; 210 + #endif 235 211 char name[HSTATE_NAME_LEN]; 236 212 }; 237 213 ··· 336 302 return hstates[index].order + PAGE_SHIFT; 337 303 } 338 304 305 + static inline int hstate_index(struct hstate *h) 306 + { 307 + return h - hstates; 308 + } 309 + 339 310 #else 340 311 struct hstate {}; 341 312 #define alloc_huge_page_node(h, nid) NULL ··· 359 320 return 1; 360 321 } 361 322 #define hstate_index_to_shift(index) 0 323 + #define hstate_index(h) 0 362 324 #endif 363 325 364 326 #endif /* _LINUX_HUGETLB_H */

+126

include/linux/hugetlb_cgroup.h

··· 1 + /* 2 + * Copyright IBM Corporation, 2012 3 + * Author Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> 4 + * 5 + * This program is free software; you can redistribute it and/or modify it 6 + * under the terms of version 2.1 of the GNU Lesser General Public License 7 + * as published by the Free Software Foundation. 8 + * 9 + * This program is distributed in the hope that it would be useful, but 10 + * WITHOUT ANY WARRANTY; without even the implied warranty of 11 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 12 + * 13 + */ 14 + 15 + #ifndef _LINUX_HUGETLB_CGROUP_H 16 + #define _LINUX_HUGETLB_CGROUP_H 17 + 18 + #include <linux/res_counter.h> 19 + 20 + struct hugetlb_cgroup; 21 + /* 22 + * Minimum page order trackable by hugetlb cgroup. 23 + * At least 3 pages are necessary for all the tracking information. 24 + */ 25 + #define HUGETLB_CGROUP_MIN_ORDER 2 26 + 27 + #ifdef CONFIG_CGROUP_HUGETLB 28 + 29 + static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page) 30 + { 31 + VM_BUG_ON(!PageHuge(page)); 32 + 33 + if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER) 34 + return NULL; 35 + return (struct hugetlb_cgroup *)page[2].lru.next; 36 + } 37 + 38 + static inline 39 + int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg) 40 + { 41 + VM_BUG_ON(!PageHuge(page)); 42 + 43 + if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER) 44 + return -1; 45 + page[2].lru.next = (void *)h_cg; 46 + return 0; 47 + } 48 + 49 + static inline bool hugetlb_cgroup_disabled(void) 50 + { 51 + if (hugetlb_subsys.disabled) 52 + return true; 53 + return false; 54 + } 55 + 56 + extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, 57 + struct hugetlb_cgroup **ptr); 58 + extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, 59 + struct hugetlb_cgroup *h_cg, 60 + struct page *page); 61 + extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, 62 + struct page *page); 63 + extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, 64 + struct hugetlb_cgroup *h_cg); 65 + extern int hugetlb_cgroup_file_init(int idx) __init; 66 + extern void hugetlb_cgroup_migrate(struct page *oldhpage, 67 + struct page *newhpage); 68 + 69 + #else 70 + static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page) 71 + { 72 + return NULL; 73 + } 74 + 75 + static inline 76 + int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg) 77 + { 78 + return 0; 79 + } 80 + 81 + static inline bool hugetlb_cgroup_disabled(void) 82 + { 83 + return true; 84 + } 85 + 86 + static inline int 87 + hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, 88 + struct hugetlb_cgroup **ptr) 89 + { 90 + return 0; 91 + } 92 + 93 + static inline void 94 + hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, 95 + struct hugetlb_cgroup *h_cg, 96 + struct page *page) 97 + { 98 + return; 99 + } 100 + 101 + static inline void 102 + hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page) 103 + { 104 + return; 105 + } 106 + 107 + static inline void 108 + hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, 109 + struct hugetlb_cgroup *h_cg) 110 + { 111 + return; 112 + } 113 + 114 + static inline int __init hugetlb_cgroup_file_init(int idx) 115 + { 116 + return 0; 117 + } 118 + 119 + static inline void hugetlb_cgroup_migrate(struct page *oldhpage, 120 + struct page *newhpage) 121 + { 122 + return; 123 + } 124 + 125 + #endif /* CONFIG_MEM_RES_CTLR_HUGETLB */ 126 + #endif

+12 -22

include/linux/memcontrol.h

··· 38 38 unsigned int generation; 39 39 }; 40 40 41 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR 41 + #ifdef CONFIG_MEMCG 42 42 /* 43 43 * All "charge" functions with gfp_mask should use GFP_KERNEL or 44 44 * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't ··· 72 72 extern void mem_cgroup_uncharge_page(struct page *page); 73 73 extern void mem_cgroup_uncharge_cache_page(struct page *page); 74 74 75 - extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, 76 - int order); 77 75 bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg, 78 76 struct mem_cgroup *memcg); 79 77 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *memcg); ··· 98 100 99 101 extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg); 100 102 101 - extern int 102 - mem_cgroup_prepare_migration(struct page *page, 103 - struct page *newpage, struct mem_cgroup **memcgp, gfp_t gfp_mask); 103 + extern void 104 + mem_cgroup_prepare_migration(struct page *page, struct page *newpage, 105 + struct mem_cgroup **memcgp); 104 106 extern void mem_cgroup_end_migration(struct mem_cgroup *memcg, 105 107 struct page *oldpage, struct page *newpage, bool migration_ok); 106 108 ··· 122 124 extern void mem_cgroup_replace_page_cache(struct page *oldpage, 123 125 struct page *newpage); 124 126 125 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP 127 + #ifdef CONFIG_MEMCG_SWAP 126 128 extern int do_swap_account; 127 129 #endif 128 130 ··· 180 182 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, 181 183 gfp_t gfp_mask, 182 184 unsigned long *total_scanned); 183 - u64 mem_cgroup_get_limit(struct mem_cgroup *memcg); 184 185 185 186 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx); 186 187 #ifdef CONFIG_TRANSPARENT_HUGEPAGE ··· 190 193 bool mem_cgroup_bad_page_check(struct page *page); 191 194 void mem_cgroup_print_bad_page(struct page *page); 192 195 #endif 193 - #else /* CONFIG_CGROUP_MEM_RES_CTLR */ 196 + #else /* CONFIG_MEMCG */ 194 197 struct mem_cgroup; 195 198 196 199 static inline int mem_cgroup_newpage_charge(struct page *page, ··· 276 279 return NULL; 277 280 } 278 281 279 - static inline int 282 + static inline void 280 283 mem_cgroup_prepare_migration(struct page *page, struct page *newpage, 281 - struct mem_cgroup **memcgp, gfp_t gfp_mask) 284 + struct mem_cgroup **memcgp) 282 285 { 283 - return 0; 284 286 } 285 287 286 288 static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg, ··· 362 366 return 0; 363 367 } 364 368 365 - static inline 366 - u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) 367 - { 368 - return 0; 369 - } 370 - 371 369 static inline void mem_cgroup_split_huge_fixup(struct page *head) 372 370 { 373 371 } ··· 374 384 struct page *newpage) 375 385 { 376 386 } 377 - #endif /* CONFIG_CGROUP_MEM_RES_CTLR */ 387 + #endif /* CONFIG_MEMCG */ 378 388 379 - #if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM) 389 + #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM) 380 390 static inline bool 381 391 mem_cgroup_bad_page_check(struct page *page) 382 392 { ··· 396 406 }; 397 407 398 408 struct sock; 399 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM 409 + #ifdef CONFIG_MEMCG_KMEM 400 410 void sock_update_memcg(struct sock *sk); 401 411 void sock_release_memcg(struct sock *sk); 402 412 #else ··· 406 416 static inline void sock_release_memcg(struct sock *sk) 407 417 { 408 418 } 409 - #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */ 419 + #endif /* CONFIG_MEMCG_KMEM */ 410 420 #endif /* _LINUX_MEMCONTROL_H */ 411 421

+2 -2

include/linux/migrate.h

··· 15 15 extern int migrate_pages(struct list_head *l, new_page_t x, 16 16 unsigned long private, bool offlining, 17 17 enum migrate_mode mode); 18 - extern int migrate_huge_pages(struct list_head *l, new_page_t x, 18 + extern int migrate_huge_page(struct page *, new_page_t x, 19 19 unsigned long private, bool offlining, 20 20 enum migrate_mode mode); 21 21 ··· 36 36 static inline int migrate_pages(struct list_head *l, new_page_t x, 37 37 unsigned long private, bool offlining, 38 38 enum migrate_mode mode) { return -ENOSYS; } 39 - static inline int migrate_huge_pages(struct list_head *l, new_page_t x, 39 + static inline int migrate_huge_page(struct page *page, new_page_t x, 40 40 unsigned long private, bool offlining, 41 41 enum migrate_mode mode) { return -ENOSYS; } 42 42

+31

include/linux/mm.h

··· 805 805 return (void *)((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS); 806 806 } 807 807 808 + extern struct address_space *__page_file_mapping(struct page *); 809 + 810 + static inline 811 + struct address_space *page_file_mapping(struct page *page) 812 + { 813 + if (unlikely(PageSwapCache(page))) 814 + return __page_file_mapping(page); 815 + 816 + return page->mapping; 817 + } 818 + 808 819 static inline int PageAnon(struct page *page) 809 820 { 810 821 return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0; ··· 829 818 { 830 819 if (unlikely(PageSwapCache(page))) 831 820 return page_private(page); 821 + return page->index; 822 + } 823 + 824 + extern pgoff_t __page_file_index(struct page *page); 825 + 826 + /* 827 + * Return the file index of the page. Regular pagecache pages use ->index 828 + * whereas swapcache pages use swp_offset(->private) 829 + */ 830 + static inline pgoff_t page_file_index(struct page *page) 831 + { 832 + if (unlikely(PageSwapCache(page))) 833 + return __page_file_index(page); 834 + 832 835 return page->index; 833 836 } 834 837 ··· 1019 994 struct page **pages, struct vm_area_struct **vmas); 1020 995 int get_user_pages_fast(unsigned long start, int nr_pages, int write, 1021 996 struct page **pages); 997 + struct kvec; 998 + int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, 999 + struct page **pages); 1000 + int get_kernel_page(unsigned long start, int write, struct page **pages); 1022 1001 struct page *get_dump_page(unsigned long addr); 1023 1002 1024 1003 extern int try_to_release_page(struct page * page, gfp_t gfp_mask); ··· 1360 1331 extern void setup_per_cpu_pageset(void); 1361 1332 1362 1333 extern void zone_pcp_update(struct zone *zone); 1334 + extern void zone_pcp_reset(struct zone *zone); 1363 1335 1364 1336 /* nommu.c */ 1365 1337 extern atomic_long_t mmap_pages_allocated; ··· 1558 1528 static inline void vm_stat_account(struct mm_struct *mm, 1559 1529 unsigned long flags, struct file *file, long pages) 1560 1530 { 1531 + mm->total_vm += pages; 1561 1532 } 1562 1533 #endif /* CONFIG_PROC_FS */ 1563 1534

+9

include/linux/mm_types.h

··· 54 54 union { 55 55 pgoff_t index; /* Our offset within mapping. */ 56 56 void *freelist; /* slub/slob first free object */ 57 + bool pfmemalloc; /* If set by the page allocator, 58 + * ALLOC_NO_WATERMARKS was set 59 + * and the low watermark was not 60 + * met implying that the system 61 + * is under some pressure. The 62 + * caller should try ensure 63 + * this page is only used to 64 + * free other pages. 65 + */ 57 66 }; 58 67 59 68 union {

+19 -7

include/linux/mmzone.h

··· 201 201 struct lruvec { 202 202 struct list_head lists[NR_LRU_LISTS]; 203 203 struct zone_reclaim_stat reclaim_stat; 204 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR 204 + #ifdef CONFIG_MEMCG 205 205 struct zone *zone; 206 206 #endif 207 207 }; ··· 209 209 /* Mask used at gathering information at once (see memcontrol.c) */ 210 210 #define LRU_ALL_FILE (BIT(LRU_INACTIVE_FILE) | BIT(LRU_ACTIVE_FILE)) 211 211 #define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON)) 212 - #define LRU_ALL_EVICTABLE (LRU_ALL_FILE | LRU_ALL_ANON) 213 212 #define LRU_ALL ((1 << NR_LRU_LISTS) - 1) 214 213 215 214 /* Isolate clean file */ ··· 368 369 */ 369 370 spinlock_t lock; 370 371 int all_unreclaimable; /* All pages pinned */ 372 + #if defined CONFIG_COMPACTION || defined CONFIG_CMA 373 + /* pfn where the last incremental compaction isolated free pages */ 374 + unsigned long compact_cached_free_pfn; 375 + #endif 371 376 #ifdef CONFIG_MEMORY_HOTPLUG 372 377 /* see spanned/present_pages for more description */ 373 378 seqlock_t span_seqlock; ··· 478 475 * rarely used fields: 479 476 */ 480 477 const char *name; 478 + #ifdef CONFIG_MEMORY_ISOLATION 479 + /* 480 + * the number of MIGRATE_ISOLATE *pageblock*. 481 + * We need this for free page counting. Look at zone_watermark_ok_safe. 482 + * It's protected by zone->lock 483 + */ 484 + int nr_pageblock_isolate; 485 + #endif 481 486 } ____cacheline_internodealigned_in_smp; 482 487 483 488 typedef enum { ··· 682 671 int nr_zones; 683 672 #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */ 684 673 struct page *node_mem_map; 685 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR 674 + #ifdef CONFIG_MEMCG 686 675 struct page_cgroup *node_page_cgroup; 687 676 #endif 688 677 #endif ··· 705 694 range, including holes */ 706 695 int node_id; 707 696 wait_queue_head_t kswapd_wait; 697 + wait_queue_head_t pfmemalloc_wait; 708 698 struct task_struct *kswapd; /* Protected by lock_memory_hotplug() */ 709 699 int kswapd_max_order; 710 700 enum zone_type classzone_idx; ··· 730 718 #include <linux/memory_hotplug.h> 731 719 732 720 extern struct mutex zonelists_mutex; 733 - void build_all_zonelists(void *data); 721 + void build_all_zonelists(pg_data_t *pgdat, struct zone *zone); 734 722 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx); 735 723 bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, 736 724 int classzone_idx, int alloc_flags); ··· 748 736 749 737 static inline struct zone *lruvec_zone(struct lruvec *lruvec) 750 738 { 751 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR 739 + #ifdef CONFIG_MEMCG 752 740 return lruvec->zone; 753 741 #else 754 742 return container_of(lruvec, struct zone, lruvec); ··· 785 773 786 774 static inline int zone_movable_is_highmem(void) 787 775 { 788 - #if defined(CONFIG_HIGHMEM) && defined(CONFIG_HAVE_MEMBLOCK_NODE) 776 + #if defined(CONFIG_HIGHMEM) && defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) 789 777 return movable_zone == ZONE_HIGHMEM; 790 778 #else 791 779 return 0; ··· 1064 1052 1065 1053 /* See declaration of similar field in struct zone */ 1066 1054 unsigned long *pageblock_flags; 1067 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR 1055 + #ifdef CONFIG_MEMCG 1068 1056 /* 1069 1057 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use 1070 1058 * section. (see memcontrol.h/page_cgroup.h about this.)

+2 -2

include/linux/nfs_fs.h

··· 473 473 unsigned long); 474 474 extern ssize_t nfs_file_direct_read(struct kiocb *iocb, 475 475 const struct iovec *iov, unsigned long nr_segs, 476 - loff_t pos); 476 + loff_t pos, bool uio); 477 477 extern ssize_t nfs_file_direct_write(struct kiocb *iocb, 478 478 const struct iovec *iov, unsigned long nr_segs, 479 - loff_t pos); 479 + loff_t pos, bool uio); 480 480 481 481 /* 482 482 * linux/fs/nfs/dir.c

+21

include/linux/oom.h

··· 40 40 CONSTRAINT_MEMCG, 41 41 }; 42 42 43 + enum oom_scan_t { 44 + OOM_SCAN_OK, /* scan thread and find its badness */ 45 + OOM_SCAN_CONTINUE, /* do not consider thread for oom kill */ 46 + OOM_SCAN_ABORT, /* abort the iteration and return */ 47 + OOM_SCAN_SELECT, /* always select this thread first */ 48 + }; 49 + 43 50 extern void compare_swap_oom_score_adj(int old_val, int new_val); 44 51 extern int test_set_oom_score_adj(int new_val); 45 52 46 53 extern unsigned long oom_badness(struct task_struct *p, 47 54 struct mem_cgroup *memcg, const nodemask_t *nodemask, 48 55 unsigned long totalpages); 56 + extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, 57 + unsigned int points, unsigned long totalpages, 58 + struct mem_cgroup *memcg, nodemask_t *nodemask, 59 + const char *message); 60 + 49 61 extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags); 50 62 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags); 63 + 64 + extern void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, 65 + int order, const nodemask_t *nodemask); 66 + 67 + extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, 68 + unsigned long totalpages, const nodemask_t *nodemask, 69 + bool force_kill); 70 + extern void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, 71 + int order); 51 72 52 73 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, 53 74 int order, nodemask_t *mask, bool force_kill);

+29

include/linux/page-flags.h

··· 7 7 8 8 #include <linux/types.h> 9 9 #include <linux/bug.h> 10 + #include <linux/mmdebug.h> 10 11 #ifndef __GENERATING_BOUNDS_H 11 12 #include <linux/mm_types.h> 12 13 #include <generated/bounds.h> ··· 453 452 return 0; 454 453 } 455 454 #endif 455 + 456 + /* 457 + * If network-based swap is enabled, sl*b must keep track of whether pages 458 + * were allocated from pfmemalloc reserves. 459 + */ 460 + static inline int PageSlabPfmemalloc(struct page *page) 461 + { 462 + VM_BUG_ON(!PageSlab(page)); 463 + return PageActive(page); 464 + } 465 + 466 + static inline void SetPageSlabPfmemalloc(struct page *page) 467 + { 468 + VM_BUG_ON(!PageSlab(page)); 469 + SetPageActive(page); 470 + } 471 + 472 + static inline void __ClearPageSlabPfmemalloc(struct page *page) 473 + { 474 + VM_BUG_ON(!PageSlab(page)); 475 + __ClearPageActive(page); 476 + } 477 + 478 + static inline void ClearPageSlabPfmemalloc(struct page *page) 479 + { 480 + VM_BUG_ON(!PageSlab(page)); 481 + ClearPageActive(page); 482 + } 456 483 457 484 #ifdef CONFIG_MMU 458 485 #define __PG_MLOCKED (1 << PG_mlocked)

+9 -4

include/linux/page-isolation.h

··· 1 1 #ifndef __LINUX_PAGEISOLATION_H 2 2 #define __LINUX_PAGEISOLATION_H 3 3 4 + 5 + bool has_unmovable_pages(struct zone *zone, struct page *page, int count); 6 + void set_pageblock_migratetype(struct page *page, int migratetype); 7 + int move_freepages_block(struct zone *zone, struct page *page, 8 + int migratetype); 4 9 /* 5 10 * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE. 6 11 * If specified range includes migrate types other than MOVABLE or CMA, ··· 15 10 * free all pages in the range. test_page_isolated() can be used for 16 11 * test it. 17 12 */ 18 - extern int 13 + int 19 14 start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, 20 15 unsigned migratetype); 21 16 ··· 23 18 * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE. 24 19 * target range is [start_pfn, end_pfn) 25 20 */ 26 - extern int 21 + int 27 22 undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, 28 23 unsigned migratetype); 29 24 ··· 35 30 /* 36 31 * Internal functions. Changes pageblock's migrate type. 37 32 */ 38 - extern int set_migratetype_isolate(struct page *page); 39 - extern void unset_migratetype_isolate(struct page *page, unsigned migratetype); 33 + int set_migratetype_isolate(struct page *page); 34 + void unset_migratetype_isolate(struct page *page, unsigned migratetype); 40 35 41 36 42 37 #endif

+5 -5

include/linux/page_cgroup.h

··· 12 12 #ifndef __GENERATING_BOUNDS_H 13 13 #include <generated/bounds.h> 14 14 15 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR 15 + #ifdef CONFIG_MEMCG 16 16 #include <linux/bit_spinlock.h> 17 17 18 18 /* ··· 82 82 bit_spin_unlock(PCG_LOCK, &pc->flags); 83 83 } 84 84 85 - #else /* CONFIG_CGROUP_MEM_RES_CTLR */ 85 + #else /* CONFIG_MEMCG */ 86 86 struct page_cgroup; 87 87 88 88 static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat) ··· 102 102 { 103 103 } 104 104 105 - #endif /* CONFIG_CGROUP_MEM_RES_CTLR */ 105 + #endif /* CONFIG_MEMCG */ 106 106 107 107 #include <linux/swap.h> 108 108 109 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP 109 + #ifdef CONFIG_MEMCG_SWAP 110 110 extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, 111 111 unsigned short old, unsigned short new); 112 112 extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id); ··· 138 138 return; 139 139 } 140 140 141 - #endif /* CONFIG_CGROUP_MEM_RES_CTLR_SWAP */ 141 + #endif /* CONFIG_MEMCG_SWAP */ 142 142 143 143 #endif /* !__GENERATING_BOUNDS_H */ 144 144

+5

include/linux/pagemap.h

··· 286 286 return ((loff_t)page->index) << PAGE_CACHE_SHIFT; 287 287 } 288 288 289 + static inline loff_t page_file_offset(struct page *page) 290 + { 291 + return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT; 292 + } 293 + 289 294 extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma, 290 295 unsigned long address); 291 296

+8 -1

include/linux/sched.h

··· 1584 1584 /* bitmask and counter of trace recursion */ 1585 1585 unsigned long trace_recursion; 1586 1586 #endif /* CONFIG_TRACING */ 1587 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */ 1587 + #ifdef CONFIG_MEMCG /* memcg uses this to do batch job */ 1588 1588 struct memcg_batch_info { 1589 1589 int do_batch; /* incremented when batch uncharge started */ 1590 1590 struct mem_cgroup *memcg; /* target memcg of uncharge */ ··· 1893 1893 } 1894 1894 1895 1895 #endif 1896 + 1897 + static inline void tsk_restore_flags(struct task_struct *task, 1898 + unsigned long orig_flags, unsigned long flags) 1899 + { 1900 + task->flags &= ~flags; 1901 + task->flags |= orig_flags & flags; 1902 + } 1896 1903 1897 1904 #ifdef CONFIG_SMP 1898 1905 extern void do_set_cpus_allowed(struct task_struct *p,

-1

include/linux/shrinker.h

··· 20 20 * 'nr_to_scan' entries and attempt to free them up. It should return 21 21 * the number of objects which remain in the cache. If it returns -1, it means 22 22 * it cannot do any scanning at this time (eg. there is a risk of deadlock). 23 - * The callback must not return -1 if nr_to_scan is zero. 24 23 * 25 24 * The 'gfpmask' refers to the allocation we are currently trying to 26 25 * fulfil.

+78 -2

include/linux/skbuff.h

··· 462 462 #ifdef CONFIG_IPV6_NDISC_NODETYPE 463 463 __u8 ndisc_nodetype:2; 464 464 #endif 465 + __u8 pfmemalloc:1; 465 466 __u8 ooo_okay:1; 466 467 __u8 l4_rxhash:1; 467 468 __u8 wifi_acked_valid:1; ··· 502 501 */ 503 502 #include <linux/slab.h> 504 503 504 + 505 + #define SKB_ALLOC_FCLONE 0x01 506 + #define SKB_ALLOC_RX 0x02 507 + 508 + /* Returns true if the skb was allocated from PFMEMALLOC reserves */ 509 + static inline bool skb_pfmemalloc(const struct sk_buff *skb) 510 + { 511 + return unlikely(skb->pfmemalloc); 512 + } 505 513 506 514 /* 507 515 * skb might have a dst pointer attached, refcounted or not. ··· 575 565 bool *fragstolen, int *delta_truesize); 576 566 577 567 extern struct sk_buff *__alloc_skb(unsigned int size, 578 - gfp_t priority, int fclone, int node); 568 + gfp_t priority, int flags, int node); 579 569 extern struct sk_buff *build_skb(void *data, unsigned int frag_size); 580 570 static inline struct sk_buff *alloc_skb(unsigned int size, 581 571 gfp_t priority) ··· 586 576 static inline struct sk_buff *alloc_skb_fclone(unsigned int size, 587 577 gfp_t priority) 588 578 { 589 - return __alloc_skb(size, priority, 1, NUMA_NO_NODE); 579 + return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE); 590 580 } 591 581 592 582 extern void skb_recycle(struct sk_buff *skb); ··· 1247 1237 { 1248 1238 skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; 1249 1239 1240 + /* 1241 + * Propagate page->pfmemalloc to the skb if we can. The problem is 1242 + * that not all callers have unique ownership of the page. If 1243 + * pfmemalloc is set, we check the mapping as a mapping implies 1244 + * page->index is set (index and pfmemalloc share space). 1245 + * If it's a valid mapping, we cannot use page->pfmemalloc but we 1246 + * do not lose pfmemalloc information as the pages would not be 1247 + * allocated using __GFP_MEMALLOC. 1248 + */ 1249 + if (page->pfmemalloc && !page->mapping) 1250 + skb->pfmemalloc = true; 1250 1251 frag->page.p = page; 1251 1252 frag->page_offset = off; 1252 1253 skb_frag_size_set(frag, size); ··· 1772 1751 unsigned int length) 1773 1752 { 1774 1753 return __netdev_alloc_skb_ip_align(dev, length, GFP_ATOMIC); 1754 + } 1755 + 1756 + /* 1757 + * __skb_alloc_page - allocate pages for ps-rx on a skb and preserve pfmemalloc data 1758 + * @gfp_mask: alloc_pages_node mask. Set __GFP_NOMEMALLOC if not for network packet RX 1759 + * @skb: skb to set pfmemalloc on if __GFP_MEMALLOC is used 1760 + * @order: size of the allocation 1761 + * 1762 + * Allocate a new page. 1763 + * 1764 + * %NULL is returned if there is no free memory. 1765 + */ 1766 + static inline struct page *__skb_alloc_pages(gfp_t gfp_mask, 1767 + struct sk_buff *skb, 1768 + unsigned int order) 1769 + { 1770 + struct page *page; 1771 + 1772 + gfp_mask |= __GFP_COLD; 1773 + 1774 + if (!(gfp_mask & __GFP_NOMEMALLOC)) 1775 + gfp_mask |= __GFP_MEMALLOC; 1776 + 1777 + page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order); 1778 + if (skb && page && page->pfmemalloc) 1779 + skb->pfmemalloc = true; 1780 + 1781 + return page; 1782 + } 1783 + 1784 + /** 1785 + * __skb_alloc_page - allocate a page for ps-rx for a given skb and preserve pfmemalloc data 1786 + * @gfp_mask: alloc_pages_node mask. Set __GFP_NOMEMALLOC if not for network packet RX 1787 + * @skb: skb to set pfmemalloc on if __GFP_MEMALLOC is used 1788 + * 1789 + * Allocate a new page. 1790 + * 1791 + * %NULL is returned if there is no free memory. 1792 + */ 1793 + static inline struct page *__skb_alloc_page(gfp_t gfp_mask, 1794 + struct sk_buff *skb) 1795 + { 1796 + return __skb_alloc_pages(gfp_mask, skb, 0); 1797 + } 1798 + 1799 + /** 1800 + * skb_propagate_pfmemalloc - Propagate pfmemalloc if skb is allocated after RX page 1801 + * @page: The page that was allocated from skb_alloc_page 1802 + * @skb: The skb that may need pfmemalloc set 1803 + */ 1804 + static inline void skb_propagate_pfmemalloc(struct page *page, 1805 + struct sk_buff *skb) 1806 + { 1807 + if (page && page->pfmemalloc) 1808 + skb->pfmemalloc = true; 1775 1809 } 1776 1810 1777 1811 /**

+3

include/linux/sunrpc/xprt.h

··· 174 174 unsigned long state; /* transport state */ 175 175 unsigned char shutdown : 1, /* being shut down */ 176 176 resvport : 1; /* use a reserved port */ 177 + unsigned int swapper; /* we're swapping over this 178 + transport */ 177 179 unsigned int bind_index; /* bind function index */ 178 180 179 181 /* ··· 318 316 void xprt_disconnect_done(struct rpc_xprt *xprt); 319 317 void xprt_force_disconnect(struct rpc_xprt *xprt); 320 318 void xprt_conditional_disconnect(struct rpc_xprt *xprt, unsigned int cookie); 319 + int xs_swapper(struct rpc_xprt *xprt, int enable); 321 320 322 321 /* 323 322 * Reserved bit positions in xprt->state

+11 -3

include/linux/swap.h

··· 151 151 SWP_SOLIDSTATE = (1 << 4), /* blkdev seeks are cheap */ 152 152 SWP_CONTINUED = (1 << 5), /* swap_map has count continuation */ 153 153 SWP_BLKDEV = (1 << 6), /* its a block device */ 154 + SWP_FILE = (1 << 7), /* set after swap_activate success */ 154 155 /* add others here before... */ 155 156 SWP_SCANNING = (1 << 8), /* refcount in scan_swap_map */ 156 157 }; ··· 302 301 303 302 extern int kswapd_run(int nid); 304 303 extern void kswapd_stop(int nid); 305 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR 304 + #ifdef CONFIG_MEMCG 306 305 extern int mem_cgroup_swappiness(struct mem_cgroup *mem); 307 306 #else 308 307 static inline int mem_cgroup_swappiness(struct mem_cgroup *mem) ··· 310 309 return vm_swappiness; 311 310 } 312 311 #endif 313 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP 312 + #ifdef CONFIG_MEMCG_SWAP 314 313 extern void mem_cgroup_uncharge_swap(swp_entry_t ent); 315 314 #else 316 315 static inline void mem_cgroup_uncharge_swap(swp_entry_t ent) ··· 321 320 /* linux/mm/page_io.c */ 322 321 extern int swap_readpage(struct page *); 323 322 extern int swap_writepage(struct page *page, struct writeback_control *wbc); 323 + extern int swap_set_page_dirty(struct page *page); 324 324 extern void end_swap_bio_read(struct bio *bio, int err); 325 + 326 + int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page, 327 + unsigned long nr_pages, sector_t start_block); 328 + int generic_swapfile_activate(struct swap_info_struct *, struct file *, 329 + sector_t *); 325 330 326 331 /* linux/mm/swap_state.c */ 327 332 extern struct address_space swapper_space; ··· 363 356 extern sector_t map_swap_page(struct page *, struct block_device **); 364 357 extern sector_t swapdev_block(int, pgoff_t); 365 358 extern int page_swapcount(struct page *); 359 + extern struct swap_info_struct *page_swap_info(struct page *); 366 360 extern int reuse_swap_page(struct page *); 367 361 extern int try_to_free_swap(struct page *); 368 362 struct backing_dev_info; 369 363 370 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR 364 + #ifdef CONFIG_MEMCG 371 365 extern void 372 366 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout); 373 367 #else

+1

include/linux/vm_event_item.h

··· 30 30 FOR_ALL_ZONES(PGSTEAL_DIRECT), 31 31 FOR_ALL_ZONES(PGSCAN_KSWAPD), 32 32 FOR_ALL_ZONES(PGSCAN_DIRECT), 33 + PGSCAN_DIRECT_THROTTLE, 33 34 #ifdef CONFIG_NUMA 34 35 PGSCAN_ZONE_RECLAIM_FAILED, 35 36 #endif

-5

include/linux/vmstat.h

··· 179 179 #define add_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, __d) 180 180 #define sub_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, -(__d)) 181 181 182 - static inline void zap_zone_vm_stats(struct zone *zone) 183 - { 184 - memset(zone->vm_stat, 0, sizeof(zone->vm_stat)); 185 - } 186 - 187 182 extern void inc_zone_state(struct zone *, enum zone_stat_item); 188 183 189 184 #ifdef CONFIG_SMP

-5

include/linux/writeback.h

··· 189 189 190 190 void account_page_redirty(struct page *page); 191 191 192 - /* pdflush.c */ 193 - extern int nr_pdflush_threads; /* Global so it can be exported to sysctl 194 - read-only. */ 195 - 196 - 197 192 #endif /* WRITEBACK_H */

+35 -5

include/net/sock.h

··· 621 621 SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */ 622 622 SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */ 623 623 SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */ 624 + SOCK_MEMALLOC, /* VM depends on this socket for swapping */ 624 625 SOCK_TIMESTAMPING_TX_HARDWARE, /* %SOF_TIMESTAMPING_TX_HARDWARE */ 625 626 SOCK_TIMESTAMPING_TX_SOFTWARE, /* %SOF_TIMESTAMPING_TX_SOFTWARE */ 626 627 SOCK_TIMESTAMPING_RX_HARDWARE, /* %SOF_TIMESTAMPING_RX_HARDWARE */ ··· 657 656 static inline bool sock_flag(const struct sock *sk, enum sock_flags flag) 658 657 { 659 658 return test_bit(flag, &sk->sk_flags); 659 + } 660 + 661 + #ifdef CONFIG_NET 662 + extern struct static_key memalloc_socks; 663 + static inline int sk_memalloc_socks(void) 664 + { 665 + return static_key_false(&memalloc_socks); 666 + } 667 + #else 668 + 669 + static inline int sk_memalloc_socks(void) 670 + { 671 + return 0; 672 + } 673 + 674 + #endif 675 + 676 + static inline gfp_t sk_gfp_atomic(struct sock *sk, gfp_t gfp_mask) 677 + { 678 + return GFP_ATOMIC | (sk->sk_allocation & __GFP_MEMALLOC); 660 679 } 661 680 662 681 static inline void sk_acceptq_removed(struct sock *sk) ··· 754 733 return 0; 755 734 } 756 735 736 + extern int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb); 737 + 757 738 static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb) 758 739 { 740 + if (sk_memalloc_socks() && skb_pfmemalloc(skb)) 741 + return __sk_backlog_rcv(sk, skb); 742 + 759 743 return sk->sk_backlog_rcv(sk, skb); 760 744 } 761 745 ··· 824 798 extern void sk_stream_wait_close(struct sock *sk, long timeo_p); 825 799 extern int sk_stream_error(struct sock *sk, int flags, int err); 826 800 extern void sk_stream_kill_queues(struct sock *sk); 801 + extern void sk_set_memalloc(struct sock *sk); 802 + extern void sk_clear_memalloc(struct sock *sk); 827 803 828 804 extern int sk_wait_data(struct sock *sk, long *timeo); 829 805 ··· 941 913 #ifdef SOCK_REFCNT_DEBUG 942 914 atomic_t socks; 943 915 #endif 944 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM 916 + #ifdef CONFIG_MEMCG_KMEM 945 917 /* 946 918 * cgroup specific init/deinit functions. Called once for all 947 919 * protocols that implement it, from cgroups populate function. ··· 1022 994 #define sk_refcnt_debug_release(sk) do { } while (0) 1023 995 #endif /* SOCK_REFCNT_DEBUG */ 1024 996 1025 - #if defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) && defined(CONFIG_NET) 997 + #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET) 1026 998 extern struct static_key memcg_socket_limit_enabled; 1027 999 static inline struct cg_proto *parent_cg_proto(struct proto *proto, 1028 1000 struct cg_proto *cg_proto) ··· 1329 1301 __sk_mem_schedule(sk, size, SK_MEM_SEND); 1330 1302 } 1331 1303 1332 - static inline bool sk_rmem_schedule(struct sock *sk, int size) 1304 + static inline bool 1305 + sk_rmem_schedule(struct sock *sk, struct sk_buff *skb, unsigned int size) 1333 1306 { 1334 1307 if (!sk_has_account(sk)) 1335 1308 return true; 1336 - return size <= sk->sk_forward_alloc || 1337 - __sk_mem_schedule(sk, size, SK_MEM_RECV); 1309 + return size<= sk->sk_forward_alloc || 1310 + __sk_mem_schedule(sk, size, SK_MEM_RECV) || 1311 + skb_pfmemalloc(skb); 1338 1312 } 1339 1313 1340 1314 static inline void sk_mem_reclaim(struct sock *sk)

+1

include/trace/events/gfpflags.h

··· 30 30 {(unsigned long)__GFP_COMP, "GFP_COMP"}, \ 31 31 {(unsigned long)__GFP_ZERO, "GFP_ZERO"}, \ 32 32 {(unsigned long)__GFP_NOMEMALLOC, "GFP_NOMEMALLOC"}, \ 33 + {(unsigned long)__GFP_MEMALLOC, "GFP_MEMALLOC"}, \ 33 34 {(unsigned long)__GFP_HARDWALL, "GFP_HARDWALL"}, \ 34 35 {(unsigned long)__GFP_THISNODE, "GFP_THISNODE"}, \ 35 36 {(unsigned long)__GFP_RECLAIMABLE, "GFP_RECLAIMABLE"}, \

+22 -7

init/Kconfig

··· 686 686 This option enables controller independent resource accounting 687 687 infrastructure that works with cgroups. 688 688 689 - config CGROUP_MEM_RES_CTLR 689 + config MEMCG 690 690 bool "Memory Resource Controller for Control Groups" 691 691 depends on RESOURCE_COUNTERS 692 692 select MM_OWNER ··· 709 709 This config option also selects MM_OWNER config option, which 710 710 could in turn add some fork/exit overhead. 711 711 712 - config CGROUP_MEM_RES_CTLR_SWAP 712 + config MEMCG_SWAP 713 713 bool "Memory Resource Controller Swap Extension" 714 - depends on CGROUP_MEM_RES_CTLR && SWAP 714 + depends on MEMCG && SWAP 715 715 help 716 716 Add swap management feature to memory resource controller. When you 717 717 enable this, you can limit mem+swap usage per cgroup. In other words, ··· 726 726 if boot option "swapaccount=0" is set, swap will not be accounted. 727 727 Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page 728 728 size is 4096bytes, 512k per 1Gbytes of swap. 729 - config CGROUP_MEM_RES_CTLR_SWAP_ENABLED 729 + config MEMCG_SWAP_ENABLED 730 730 bool "Memory Resource Controller Swap Extension enabled by default" 731 - depends on CGROUP_MEM_RES_CTLR_SWAP 731 + depends on MEMCG_SWAP 732 732 default y 733 733 help 734 734 Memory Resource Controller Swap Extension comes with its price in ··· 739 739 For those who want to have the feature enabled by default should 740 740 select this option (if, for some reason, they need to disable it 741 741 then swapaccount=0 does the trick). 742 - config CGROUP_MEM_RES_CTLR_KMEM 742 + config MEMCG_KMEM 743 743 bool "Memory Resource Controller Kernel Memory accounting (EXPERIMENTAL)" 744 - depends on CGROUP_MEM_RES_CTLR && EXPERIMENTAL 744 + depends on MEMCG && EXPERIMENTAL 745 745 default n 746 746 help 747 747 The Kernel Memory extension for Memory Resource Controller can limit ··· 750 750 Memory Controller, which are page-based, and can be swapped. Users of 751 751 the kmem extension can use it to guarantee that no group of processes 752 752 will ever exhaust kernel resources alone. 753 + 754 + config CGROUP_HUGETLB 755 + bool "HugeTLB Resource Controller for Control Groups" 756 + depends on RESOURCE_COUNTERS && HUGETLB_PAGE && EXPERIMENTAL 757 + default n 758 + help 759 + Provides a cgroup Resource Controller for HugeTLB pages. 760 + When you enable this, you can put a per cgroup limit on HugeTLB usage. 761 + The limit is enforced during page fault. Since HugeTLB doesn't 762 + support page reclaim, enforcing the limit at page fault time implies 763 + that, the application will get SIGBUS signal if it tries to access 764 + HugeTLB pages beyond its limit. This requires the application to know 765 + beforehand how much HugeTLB pages it would require for its use. The 766 + control group is tracked in the third page lru pointer. This means 767 + that we cannot use the controller with huge page less than 3 pages. 753 768 754 769 config CGROUP_PERF 755 770 bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"

+1 -1

init/main.c

··· 506 506 setup_per_cpu_areas(); 507 507 smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */ 508 508 509 - build_all_zonelists(NULL); 509 + build_all_zonelists(NULL, NULL); 510 510 page_alloc_init(); 511 511 512 512 printk(KERN_NOTICE "Kernel command line: %s\n", boot_command_line);

+1 -1

kernel/cpu.c

··· 416 416 417 417 if (pgdat->node_zonelists->_zonerefs->zone == NULL) { 418 418 mutex_lock(&zonelists_mutex); 419 - build_all_zonelists(NULL); 419 + build_all_zonelists(NULL, NULL); 420 420 mutex_unlock(&zonelists_mutex); 421 421 } 422 422 #endif

+2 -4

kernel/fork.c

··· 381 381 struct file *file; 382 382 383 383 if (mpnt->vm_flags & VM_DONTCOPY) { 384 - long pages = vma_pages(mpnt); 385 - mm->total_vm -= pages; 386 384 vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file, 387 - -pages); 385 + -vma_pages(mpnt)); 388 386 continue; 389 387 } 390 388 charge = 0; ··· 1306 1308 #ifdef CONFIG_DEBUG_MUTEXES 1307 1309 p->blocked_on = NULL; /* not blocked yet */ 1308 1310 #endif 1309 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR 1311 + #ifdef CONFIG_MEMCG 1310 1312 p->memcg_batch.do_batch = 0; 1311 1313 p->memcg_batch.memcg = NULL; 1312 1314 #endif

+9

kernel/softirq.c

··· 210 210 __u32 pending; 211 211 int max_restart = MAX_SOFTIRQ_RESTART; 212 212 int cpu; 213 + unsigned long old_flags = current->flags; 214 + 215 + /* 216 + * Mask out PF_MEMALLOC s current task context is borrowed for the 217 + * softirq. A softirq handled such as network RX might set PF_MEMALLOC 218 + * again if the socket is related to swap 219 + */ 220 + current->flags &= ~PF_MEMALLOC; 213 221 214 222 pending = local_softirq_pending(); 215 223 account_system_vtime(current); ··· 273 265 274 266 account_system_vtime(current); 275 267 __local_bh_enable(SOFTIRQ_OFFSET); 268 + tsk_restore_flags(current, old_flags, PF_MEMALLOC); 276 269 } 277 270 278 271 #ifndef __ARCH_HAS_DO_SOFTIRQ

+3 -5

kernel/sysctl.c

··· 1101 1101 .extra1 = &zero, 1102 1102 }, 1103 1103 { 1104 - .procname = "nr_pdflush_threads", 1105 - .data = &nr_pdflush_threads, 1106 - .maxlen = sizeof nr_pdflush_threads, 1107 - .mode = 0444 /* read-only*/, 1108 - .proc_handler = proc_dointvec, 1104 + .procname = "nr_pdflush_threads", 1105 + .mode = 0444 /* read-only */, 1106 + .proc_handler = pdflush_proc_obsolete, 1109 1107 }, 1110 1108 { 1111 1109 .procname = "swappiness",

+1 -1

kernel/sysctl_binary.c

··· 147 147 { CTL_INT, VM_DIRTY_RATIO, "dirty_ratio" }, 148 148 /* VM_DIRTY_WB_CS "dirty_writeback_centisecs" no longer used */ 149 149 /* VM_DIRTY_EXPIRE_CS "dirty_expire_centisecs" no longer used */ 150 - { CTL_INT, VM_NR_PDFLUSH_THREADS, "nr_pdflush_threads" }, 150 + /* VM_NR_PDFLUSH_THREADS "nr_pdflush_threads" no longer used */ 151 151 { CTL_INT, VM_OVERCOMMIT_RATIO, "overcommit_ratio" }, 152 152 /* VM_PAGEBUF unused */ 153 153 /* VM_HUGETLB_PAGES "nr_hugepages" no longer used */

+5

mm/Kconfig

··· 140 140 config NO_BOOTMEM 141 141 boolean 142 142 143 + config MEMORY_ISOLATION 144 + boolean 145 + 143 146 # eventually, we can have this option just 'select SPARSEMEM' 144 147 config MEMORY_HOTPLUG 145 148 bool "Allow for memory hot-add" 149 + select MEMORY_ISOLATION 146 150 depends on SPARSEMEM || X86_64_ACPI_NUMA 147 151 depends on HOTPLUG && ARCH_ENABLE_MEMORY_HOTPLUG 148 152 depends on (IA64 || X86 || PPC_BOOK3S_64 || SUPERH || S390) ··· 276 272 depends on MMU 277 273 depends on ARCH_SUPPORTS_MEMORY_FAILURE 278 274 bool "Enable recovery from hardware memory errors" 275 + select MEMORY_ISOLATION 279 276 help 280 277 Enables code to recover from some memory failures on systems 281 278 with MCA recovery. This allows a system to continue running

+5 -3

mm/Makefile

··· 15 15 maccess.o page_alloc.o page-writeback.o \ 16 16 readahead.o swap.o truncate.o vmscan.o shmem.o \ 17 17 prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ 18 - page_isolation.o mm_init.o mmu_context.o percpu.o \ 19 - compaction.o slab_common.o $(mmu-y) 18 + mm_init.o mmu_context.o percpu.o slab_common.o \ 19 + compaction.o $(mmu-y) 20 20 21 21 obj-y += init-mm.o 22 22 ··· 49 49 obj-$(CONFIG_MIGRATION) += migrate.o 50 50 obj-$(CONFIG_QUICKLIST) += quicklist.o 51 51 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o 52 - obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o 52 + obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o 53 + obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o 53 54 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o 54 55 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o 55 56 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o 56 57 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o 57 58 obj-$(CONFIG_CLEANCACHE) += cleancache.o 59 + obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o

+20

mm/backing-dev.c

··· 886 886 return ret; 887 887 } 888 888 EXPORT_SYMBOL(wait_iff_congested); 889 + 890 + int pdflush_proc_obsolete(struct ctl_table *table, int write, 891 + void __user *buffer, size_t *lenp, loff_t *ppos) 892 + { 893 + char kbuf[] = "0\n"; 894 + 895 + if (*ppos) { 896 + *lenp = 0; 897 + return 0; 898 + } 899 + 900 + if (copy_to_user(buffer, kbuf, sizeof(kbuf))) 901 + return -EFAULT; 902 + printk_once(KERN_WARNING "%s exported in /proc is scheduled for removal\n", 903 + table->procname); 904 + 905 + *lenp = 2; 906 + *ppos += *lenp; 907 + return 2; 908 + }

+58 -5

mm/compaction.c

··· 422 422 pfn -= pageblock_nr_pages) { 423 423 unsigned long isolated; 424 424 425 + /* 426 + * Skip ahead if another thread is compacting in the area 427 + * simultaneously. If we wrapped around, we can only skip 428 + * ahead if zone->compact_cached_free_pfn also wrapped to 429 + * above our starting point. 430 + */ 431 + if (cc->order > 0 && (!cc->wrapped || 432 + zone->compact_cached_free_pfn > 433 + cc->start_free_pfn)) 434 + pfn = min(pfn, zone->compact_cached_free_pfn); 435 + 425 436 if (!pfn_valid(pfn)) 426 437 continue; 427 438 ··· 472 461 * looking for free pages, the search will restart here as 473 462 * page migration may have returned some pages to the allocator 474 463 */ 475 - if (isolated) 464 + if (isolated) { 476 465 high_pfn = max(high_pfn, pfn); 466 + if (cc->order > 0) 467 + zone->compact_cached_free_pfn = high_pfn; 468 + } 477 469 } 478 470 479 471 /* split_free_page does not map the pages */ ··· 570 556 return ISOLATE_SUCCESS; 571 557 } 572 558 559 + /* 560 + * Returns the start pfn of the last page block in a zone. This is the starting 561 + * point for full compaction of a zone. Compaction searches for free pages from 562 + * the end of each zone, while isolate_freepages_block scans forward inside each 563 + * page block. 564 + */ 565 + static unsigned long start_free_pfn(struct zone *zone) 566 + { 567 + unsigned long free_pfn; 568 + free_pfn = zone->zone_start_pfn + zone->spanned_pages; 569 + free_pfn &= ~(pageblock_nr_pages-1); 570 + return free_pfn; 571 + } 572 + 573 573 static int compact_finished(struct zone *zone, 574 574 struct compact_control *cc) 575 575 { ··· 593 565 if (fatal_signal_pending(current)) 594 566 return COMPACT_PARTIAL; 595 567 596 - /* Compaction run completes if the migrate and free scanner meet */ 597 - if (cc->free_pfn <= cc->migrate_pfn) 568 + /* 569 + * A full (order == -1) compaction run starts at the beginning and 570 + * end of a zone; it completes when the migrate and free scanner meet. 571 + * A partial (order > 0) compaction can start with the free scanner 572 + * at a random point in the zone, and may have to restart. 573 + */ 574 + if (cc->free_pfn <= cc->migrate_pfn) { 575 + if (cc->order > 0 && !cc->wrapped) { 576 + /* We started partway through; restart at the end. */ 577 + unsigned long free_pfn = start_free_pfn(zone); 578 + zone->compact_cached_free_pfn = free_pfn; 579 + cc->free_pfn = free_pfn; 580 + cc->wrapped = 1; 581 + return COMPACT_CONTINUE; 582 + } 583 + return COMPACT_COMPLETE; 584 + } 585 + 586 + /* We wrapped around and ended up where we started. */ 587 + if (cc->wrapped && cc->free_pfn <= cc->start_free_pfn) 598 588 return COMPACT_COMPLETE; 599 589 600 590 /* ··· 710 664 711 665 /* Setup to move all movable pages to the end of the zone */ 712 666 cc->migrate_pfn = zone->zone_start_pfn; 713 - cc->free_pfn = cc->migrate_pfn + zone->spanned_pages; 714 - cc->free_pfn &= ~(pageblock_nr_pages-1); 667 + 668 + if (cc->order > 0) { 669 + /* Incremental compaction. Start where the last one stopped. */ 670 + cc->free_pfn = zone->compact_cached_free_pfn; 671 + cc->start_free_pfn = cc->free_pfn; 672 + } else { 673 + /* Order == -1 starts at the end of the zone. */ 674 + cc->free_pfn = start_free_pfn(zone); 675 + } 715 676 716 677 migrate_prep_local(); 717 678

+7 -11

mm/fadvise.c

··· 93 93 spin_unlock(&file->f_lock); 94 94 break; 95 95 case POSIX_FADV_WILLNEED: 96 - if (!mapping->a_ops->readpage) { 97 - ret = -EINVAL; 98 - break; 99 - } 100 - 101 96 /* First and last PARTIAL page! */ 102 97 start_index = offset >> PAGE_CACHE_SHIFT; 103 98 end_index = endbyte >> PAGE_CACHE_SHIFT; ··· 101 106 nrpages = end_index - start_index + 1; 102 107 if (!nrpages) 103 108 nrpages = ~0UL; 104 - 105 - ret = force_page_cache_readahead(mapping, file, 106 - start_index, 107 - nrpages); 108 - if (ret > 0) 109 - ret = 0; 109 + 110 + /* 111 + * Ignore return value because fadvise() shall return 112 + * success even if filesystem can't retrieve a hint, 113 + */ 114 + force_page_cache_readahead(mapping, file, start_index, 115 + nrpages); 110 116 break; 111 117 case POSIX_FADV_NOREUSE: 112 118 break;

+12

mm/highmem.c

··· 94 94 do { spin_unlock(&kmap_lock); (void)(flags); } while (0) 95 95 #endif 96 96 97 + struct page *kmap_to_page(void *vaddr) 98 + { 99 + unsigned long addr = (unsigned long)vaddr; 100 + 101 + if (addr >= PKMAP_ADDR(0) && addr <= PKMAP_ADDR(LAST_PKMAP)) { 102 + int i = (addr - PKMAP_ADDR(0)) >> PAGE_SHIFT; 103 + return pte_page(pkmap_page_table[i]); 104 + } 105 + 106 + return virt_to_page(addr); 107 + } 108 + 97 109 static void flush_all_zero_pkmaps(void) 98 110 { 99 111 int i;

+136 -59

mm/hugetlb.c

··· 24 24 25 25 #include <asm/page.h> 26 26 #include <asm/pgtable.h> 27 - #include <linux/io.h> 27 + #include <asm/tlb.h> 28 28 29 + #include <linux/io.h> 29 30 #include <linux/hugetlb.h> 31 + #include <linux/hugetlb_cgroup.h> 30 32 #include <linux/node.h> 33 + #include <linux/hugetlb_cgroup.h> 31 34 #include "internal.h" 32 35 33 36 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL; 34 37 static gfp_t htlb_alloc_mask = GFP_HIGHUSER; 35 38 unsigned long hugepages_treat_as_movable; 36 39 37 - static int max_hstate; 40 + int hugetlb_max_hstate __read_mostly; 38 41 unsigned int default_hstate_idx; 39 42 struct hstate hstates[HUGE_MAX_HSTATE]; 40 43 ··· 48 45 static unsigned long __initdata default_hstate_max_huge_pages; 49 46 static unsigned long __initdata default_hstate_size; 50 47 51 - #define for_each_hstate(h) \ 52 - for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++) 53 - 54 48 /* 55 49 * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages 56 50 */ 57 - static DEFINE_SPINLOCK(hugetlb_lock); 51 + DEFINE_SPINLOCK(hugetlb_lock); 58 52 59 53 static inline void unlock_or_release_subpool(struct hugepage_subpool *spool) 60 54 { ··· 509 509 static void enqueue_huge_page(struct hstate *h, struct page *page) 510 510 { 511 511 int nid = page_to_nid(page); 512 - list_add(&page->lru, &h->hugepage_freelists[nid]); 512 + list_move(&page->lru, &h->hugepage_freelists[nid]); 513 513 h->free_huge_pages++; 514 514 h->free_huge_pages_node[nid]++; 515 515 } ··· 521 521 if (list_empty(&h->hugepage_freelists[nid])) 522 522 return NULL; 523 523 page = list_entry(h->hugepage_freelists[nid].next, struct page, lru); 524 - list_del(&page->lru); 524 + list_move(&page->lru, &h->hugepage_activelist); 525 525 set_page_refcounted(page); 526 526 h->free_huge_pages--; 527 527 h->free_huge_pages_node[nid]--; ··· 593 593 1 << PG_active | 1 << PG_reserved | 594 594 1 << PG_private | 1 << PG_writeback); 595 595 } 596 + VM_BUG_ON(hugetlb_cgroup_from_page(page)); 596 597 set_compound_page_dtor(page, NULL); 597 598 set_page_refcounted(page); 598 599 arch_release_hugepage(page); ··· 626 625 page->mapping = NULL; 627 626 BUG_ON(page_count(page)); 628 627 BUG_ON(page_mapcount(page)); 629 - INIT_LIST_HEAD(&page->lru); 630 628 631 629 spin_lock(&hugetlb_lock); 630 + hugetlb_cgroup_uncharge_page(hstate_index(h), 631 + pages_per_huge_page(h), page); 632 632 if (h->surplus_huge_pages_node[nid] && huge_page_order(h) < MAX_ORDER) { 633 + /* remove the page from active list */ 634 + list_del(&page->lru); 633 635 update_and_free_page(h, page); 634 636 h->surplus_huge_pages--; 635 637 h->surplus_huge_pages_node[nid]--; ··· 645 641 646 642 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) 647 643 { 644 + INIT_LIST_HEAD(&page->lru); 648 645 set_compound_page_dtor(page, free_huge_page); 649 646 spin_lock(&hugetlb_lock); 647 + set_hugetlb_cgroup(page, NULL); 650 648 h->nr_huge_pages++; 651 649 h->nr_huge_pages_node[nid]++; 652 650 spin_unlock(&hugetlb_lock); ··· 895 889 896 890 spin_lock(&hugetlb_lock); 897 891 if (page) { 892 + INIT_LIST_HEAD(&page->lru); 898 893 r_nid = page_to_nid(page); 899 894 set_compound_page_dtor(page, free_huge_page); 895 + set_hugetlb_cgroup(page, NULL); 900 896 /* 901 897 * We incremented the global counters already 902 898 */ ··· 1001 993 list_for_each_entry_safe(page, tmp, &surplus_list, lru) { 1002 994 if ((--needed) < 0) 1003 995 break; 1004 - list_del(&page->lru); 1005 996 /* 1006 997 * This page is now managed by the hugetlb allocator and has 1007 998 * no users -- drop the buddy allocator's reference. ··· 1015 1008 /* Free unnecessary surplus pages to the buddy allocator */ 1016 1009 if (!list_empty(&surplus_list)) { 1017 1010 list_for_each_entry_safe(page, tmp, &surplus_list, lru) { 1018 - list_del(&page->lru); 1019 1011 put_page(page); 1020 1012 } 1021 1013 } ··· 1118 1112 struct hstate *h = hstate_vma(vma); 1119 1113 struct page *page; 1120 1114 long chg; 1115 + int ret, idx; 1116 + struct hugetlb_cgroup *h_cg; 1121 1117 1118 + idx = hstate_index(h); 1122 1119 /* 1123 1120 * Processes that did not create the mapping will have no 1124 1121 * reserves and will not have accounted against subpool ··· 1132 1123 */ 1133 1124 chg = vma_needs_reservation(h, vma, addr); 1134 1125 if (chg < 0) 1135 - return ERR_PTR(-VM_FAULT_OOM); 1126 + return ERR_PTR(-ENOMEM); 1136 1127 if (chg) 1137 1128 if (hugepage_subpool_get_pages(spool, chg)) 1138 - return ERR_PTR(-VM_FAULT_SIGBUS); 1129 + return ERR_PTR(-ENOSPC); 1139 1130 1131 + ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg); 1132 + if (ret) { 1133 + hugepage_subpool_put_pages(spool, chg); 1134 + return ERR_PTR(-ENOSPC); 1135 + } 1140 1136 spin_lock(&hugetlb_lock); 1141 1137 page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve); 1142 - spin_unlock(&hugetlb_lock); 1143 - 1144 - if (!page) { 1138 + if (page) { 1139 + /* update page cgroup details */ 1140 + hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), 1141 + h_cg, page); 1142 + spin_unlock(&hugetlb_lock); 1143 + } else { 1144 + spin_unlock(&hugetlb_lock); 1145 1145 page = alloc_buddy_huge_page(h, NUMA_NO_NODE); 1146 1146 if (!page) { 1147 + hugetlb_cgroup_uncharge_cgroup(idx, 1148 + pages_per_huge_page(h), 1149 + h_cg); 1147 1150 hugepage_subpool_put_pages(spool, chg); 1148 - return ERR_PTR(-VM_FAULT_SIGBUS); 1151 + return ERR_PTR(-ENOSPC); 1149 1152 } 1153 + spin_lock(&hugetlb_lock); 1154 + hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), 1155 + h_cg, page); 1156 + list_move(&page->lru, &h->hugepage_activelist); 1157 + spin_unlock(&hugetlb_lock); 1150 1158 } 1151 1159 1152 1160 set_page_private(page, (unsigned long)spool); 1153 1161 1154 1162 vma_commit_reservation(h, vma, addr); 1155 - 1156 1163 return page; 1157 1164 } 1158 1165 ··· 1671 1646 struct attribute_group *hstate_attr_group) 1672 1647 { 1673 1648 int retval; 1674 - int hi = h - hstates; 1649 + int hi = hstate_index(h); 1675 1650 1676 1651 hstate_kobjs[hi] = kobject_create_and_add(h->name, parent); 1677 1652 if (!hstate_kobjs[hi]) ··· 1766 1741 if (!nhs->hugepages_kobj) 1767 1742 return; /* no hstate attributes */ 1768 1743 1769 - for_each_hstate(h) 1770 - if (nhs->hstate_kobjs[h - hstates]) { 1771 - kobject_put(nhs->hstate_kobjs[h - hstates]); 1772 - nhs->hstate_kobjs[h - hstates] = NULL; 1744 + for_each_hstate(h) { 1745 + int idx = hstate_index(h); 1746 + if (nhs->hstate_kobjs[idx]) { 1747 + kobject_put(nhs->hstate_kobjs[idx]); 1748 + nhs->hstate_kobjs[idx] = NULL; 1773 1749 } 1750 + } 1774 1751 1775 1752 kobject_put(nhs->hugepages_kobj); 1776 1753 nhs->hugepages_kobj = NULL; ··· 1875 1848 hugetlb_unregister_all_nodes(); 1876 1849 1877 1850 for_each_hstate(h) { 1878 - kobject_put(hstate_kobjs[h - hstates]); 1851 + kobject_put(hstate_kobjs[hstate_index(h)]); 1879 1852 } 1880 1853 1881 1854 kobject_put(hugepages_kobj); ··· 1896 1869 if (!size_to_hstate(default_hstate_size)) 1897 1870 hugetlb_add_hstate(HUGETLB_PAGE_ORDER); 1898 1871 } 1899 - default_hstate_idx = size_to_hstate(default_hstate_size) - hstates; 1872 + default_hstate_idx = hstate_index(size_to_hstate(default_hstate_size)); 1900 1873 if (default_hstate_max_huge_pages) 1901 1874 default_hstate.max_huge_pages = default_hstate_max_huge_pages; 1902 1875 ··· 1924 1897 printk(KERN_WARNING "hugepagesz= specified twice, ignoring\n"); 1925 1898 return; 1926 1899 } 1927 - BUG_ON(max_hstate >= HUGE_MAX_HSTATE); 1900 + BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE); 1928 1901 BUG_ON(order == 0); 1929 - h = &hstates[max_hstate++]; 1902 + h = &hstates[hugetlb_max_hstate++]; 1930 1903 h->order = order; 1931 1904 h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1); 1932 1905 h->nr_huge_pages = 0; 1933 1906 h->free_huge_pages = 0; 1934 1907 for (i = 0; i < MAX_NUMNODES; ++i) 1935 1908 INIT_LIST_HEAD(&h->hugepage_freelists[i]); 1909 + INIT_LIST_HEAD(&h->hugepage_activelist); 1936 1910 h->next_nid_to_alloc = first_node(node_states[N_HIGH_MEMORY]); 1937 1911 h->next_nid_to_free = first_node(node_states[N_HIGH_MEMORY]); 1938 1912 snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB", 1939 1913 huge_page_size(h)/1024); 1914 + /* 1915 + * Add cgroup control files only if the huge page consists 1916 + * of more than two normal pages. This is because we use 1917 + * page[2].lru.next for storing cgoup details. 1918 + */ 1919 + if (order >= HUGETLB_CGROUP_MIN_ORDER) 1920 + hugetlb_cgroup_file_init(hugetlb_max_hstate - 1); 1940 1921 1941 1922 parsed_hstate = h; 1942 1923 } ··· 1955 1920 static unsigned long *last_mhp; 1956 1921 1957 1922 /* 1958 - * !max_hstate means we haven't parsed a hugepagesz= parameter yet, 1923 + * !hugetlb_max_hstate means we haven't parsed a hugepagesz= parameter yet, 1959 1924 * so this hugepages= parameter goes to the "default hstate". 1960 1925 */ 1961 - if (!max_hstate) 1926 + if (!hugetlb_max_hstate) 1962 1927 mhp = &default_hstate_max_huge_pages; 1963 1928 else 1964 1929 mhp = &parsed_hstate->max_huge_pages; ··· 1977 1942 * But we need to allocate >= MAX_ORDER hstates here early to still 1978 1943 * use the bootmem allocator. 1979 1944 */ 1980 - if (max_hstate && parsed_hstate->order >= MAX_ORDER) 1945 + if (hugetlb_max_hstate && parsed_hstate->order >= MAX_ORDER) 1981 1946 hugetlb_hstate_alloc_pages(parsed_hstate); 1982 1947 1983 1948 last_mhp = mhp; ··· 2343 2308 return 0; 2344 2309 } 2345 2310 2346 - void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, 2347 - unsigned long end, struct page *ref_page) 2311 + void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, 2312 + unsigned long start, unsigned long end, 2313 + struct page *ref_page) 2348 2314 { 2315 + int force_flush = 0; 2349 2316 struct mm_struct *mm = vma->vm_mm; 2350 2317 unsigned long address; 2351 2318 pte_t *ptep; 2352 2319 pte_t pte; 2353 2320 struct page *page; 2354 - struct page *tmp; 2355 2321 struct hstate *h = hstate_vma(vma); 2356 2322 unsigned long sz = huge_page_size(h); 2357 - 2358 - /* 2359 - * A page gathering list, protected by per file i_mmap_mutex. The 2360 - * lock is used to avoid list corruption from multiple unmapping 2361 - * of the same page since we are using page->lru. 2362 - */ 2363 - LIST_HEAD(page_list); 2364 2323 2365 2324 WARN_ON(!is_vm_hugetlb_page(vma)); 2366 2325 BUG_ON(start & ~huge_page_mask(h)); 2367 2326 BUG_ON(end & ~huge_page_mask(h)); 2368 2327 2328 + tlb_start_vma(tlb, vma); 2369 2329 mmu_notifier_invalidate_range_start(mm, start, end); 2330 + again: 2370 2331 spin_lock(&mm->page_table_lock); 2371 2332 for (address = start; address < end; address += sz) { 2372 2333 ptep = huge_pte_offset(mm, address); ··· 2401 2370 } 2402 2371 2403 2372 pte = huge_ptep_get_and_clear(mm, address, ptep); 2373 + tlb_remove_tlb_entry(tlb, ptep, address); 2404 2374 if (pte_dirty(pte)) 2405 2375 set_page_dirty(page); 2406 - list_add(&page->lru, &page_list); 2407 2376 2377 + page_remove_rmap(page); 2378 + force_flush = !__tlb_remove_page(tlb, page); 2379 + if (force_flush) 2380 + break; 2408 2381 /* Bail out after unmapping reference page if supplied */ 2409 2382 if (ref_page) 2410 2383 break; 2411 2384 } 2412 - flush_tlb_range(vma, start, end); 2413 2385 spin_unlock(&mm->page_table_lock); 2414 - mmu_notifier_invalidate_range_end(mm, start, end); 2415 - list_for_each_entry_safe(page, tmp, &page_list, lru) { 2416 - page_remove_rmap(page); 2417 - list_del(&page->lru); 2418 - put_page(page); 2386 + /* 2387 + * mmu_gather ran out of room to batch pages, we break out of 2388 + * the PTE lock to avoid doing the potential expensive TLB invalidate 2389 + * and page-free while holding it. 2390 + */ 2391 + if (force_flush) { 2392 + force_flush = 0; 2393 + tlb_flush_mmu(tlb); 2394 + if (address < end && !ref_page) 2395 + goto again; 2419 2396 } 2397 + mmu_notifier_invalidate_range_end(mm, start, end); 2398 + tlb_end_vma(tlb, vma); 2399 + } 2400 + 2401 + void __unmap_hugepage_range_final(struct mmu_gather *tlb, 2402 + struct vm_area_struct *vma, unsigned long start, 2403 + unsigned long end, struct page *ref_page) 2404 + { 2405 + __unmap_hugepage_range(tlb, vma, start, end, ref_page); 2406 + 2407 + /* 2408 + * Clear this flag so that x86's huge_pmd_share page_table_shareable 2409 + * test will fail on a vma being torn down, and not grab a page table 2410 + * on its way out. We're lucky that the flag has such an appropriate 2411 + * name, and can in fact be safely cleared here. We could clear it 2412 + * before the __unmap_hugepage_range above, but all that's necessary 2413 + * is to clear it before releasing the i_mmap_mutex. This works 2414 + * because in the context this is called, the VMA is about to be 2415 + * destroyed and the i_mmap_mutex is held. 2416 + */ 2417 + vma->vm_flags &= ~VM_MAYSHARE; 2420 2418 } 2421 2419 2422 2420 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, 2423 2421 unsigned long end, struct page *ref_page) 2424 2422 { 2425 - mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex); 2426 - __unmap_hugepage_range(vma, start, end, ref_page); 2427 - mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); 2423 + struct mm_struct *mm; 2424 + struct mmu_gather tlb; 2425 + 2426 + mm = vma->vm_mm; 2427 + 2428 + tlb_gather_mmu(&tlb, mm, 0); 2429 + __unmap_hugepage_range(&tlb, vma, start, end, ref_page); 2430 + tlb_finish_mmu(&tlb, start, end); 2428 2431 } 2429 2432 2430 2433 /* ··· 2503 2438 * from the time of fork. This would look like data corruption 2504 2439 */ 2505 2440 if (!is_vma_resv_set(iter_vma, HPAGE_RESV_OWNER)) 2506 - __unmap_hugepage_range(iter_vma, 2507 - address, address + huge_page_size(h), 2508 - page); 2441 + unmap_hugepage_range(iter_vma, address, 2442 + address + huge_page_size(h), page); 2509 2443 } 2510 2444 mutex_unlock(&mapping->i_mmap_mutex); 2511 2445 ··· 2560 2496 new_page = alloc_huge_page(vma, address, outside_reserve); 2561 2497 2562 2498 if (IS_ERR(new_page)) { 2499 + long err = PTR_ERR(new_page); 2563 2500 page_cache_release(old_page); 2564 2501 2565 2502 /* ··· 2589 2524 2590 2525 /* Caller expects lock to be held */ 2591 2526 spin_lock(&mm->page_table_lock); 2592 - return -PTR_ERR(new_page); 2527 + if (err == -ENOMEM) 2528 + return VM_FAULT_OOM; 2529 + else 2530 + return VM_FAULT_SIGBUS; 2593 2531 } 2594 2532 2595 2533 /* ··· 2710 2642 goto out; 2711 2643 page = alloc_huge_page(vma, address, 0); 2712 2644 if (IS_ERR(page)) { 2713 - ret = -PTR_ERR(page); 2645 + ret = PTR_ERR(page); 2646 + if (ret == -ENOMEM) 2647 + ret = VM_FAULT_OOM; 2648 + else 2649 + ret = VM_FAULT_SIGBUS; 2714 2650 goto out; 2715 2651 } 2716 2652 clear_huge_page(page, address, pages_per_huge_page(h)); ··· 2751 2679 */ 2752 2680 if (unlikely(PageHWPoison(page))) { 2753 2681 ret = VM_FAULT_HWPOISON | 2754 - VM_FAULT_SET_HINDEX(h - hstates); 2682 + VM_FAULT_SET_HINDEX(hstate_index(h)); 2755 2683 goto backout_unlocked; 2756 2684 } 2757 2685 } ··· 2824 2752 return 0; 2825 2753 } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) 2826 2754 return VM_FAULT_HWPOISON_LARGE | 2827 - VM_FAULT_SET_HINDEX(h - hstates); 2755 + VM_FAULT_SET_HINDEX(hstate_index(h)); 2828 2756 } 2829 2757 2830 2758 ptep = huge_pte_alloc(mm, address, huge_page_size(h)); ··· 3031 2959 } 3032 2960 } 3033 2961 spin_unlock(&mm->page_table_lock); 3034 - mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); 3035 - 2962 + /* 2963 + * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare 2964 + * may have cleared our pud entry and done put_page on the page table: 2965 + * once we release i_mmap_mutex, another task can do the final put_page 2966 + * and that page table be reused and filled with junk. 2967 + */ 3036 2968 flush_tlb_range(vma, start, end); 2969 + mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); 3037 2970 } 3038 2971 3039 2972 int hugetlb_reserve_pages(struct inode *inode,

+418

mm/hugetlb_cgroup.c

··· 1 + /* 2 + * 3 + * Copyright IBM Corporation, 2012 4 + * Author Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> 5 + * 6 + * This program is free software; you can redistribute it and/or modify it 7 + * under the terms of version 2.1 of the GNU Lesser General Public License 8 + * as published by the Free Software Foundation. 9 + * 10 + * This program is distributed in the hope that it would be useful, but 11 + * WITHOUT ANY WARRANTY; without even the implied warranty of 12 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 13 + * 14 + */ 15 + 16 + #include <linux/cgroup.h> 17 + #include <linux/slab.h> 18 + #include <linux/hugetlb.h> 19 + #include <linux/hugetlb_cgroup.h> 20 + 21 + struct hugetlb_cgroup { 22 + struct cgroup_subsys_state css; 23 + /* 24 + * the counter to account for hugepages from hugetlb. 25 + */ 26 + struct res_counter hugepage[HUGE_MAX_HSTATE]; 27 + }; 28 + 29 + #define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val)) 30 + #define MEMFILE_IDX(val) (((val) >> 16) & 0xffff) 31 + #define MEMFILE_ATTR(val) ((val) & 0xffff) 32 + 33 + struct cgroup_subsys hugetlb_subsys __read_mostly; 34 + static struct hugetlb_cgroup *root_h_cgroup __read_mostly; 35 + 36 + static inline 37 + struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s) 38 + { 39 + return container_of(s, struct hugetlb_cgroup, css); 40 + } 41 + 42 + static inline 43 + struct hugetlb_cgroup *hugetlb_cgroup_from_cgroup(struct cgroup *cgroup) 44 + { 45 + return hugetlb_cgroup_from_css(cgroup_subsys_state(cgroup, 46 + hugetlb_subsys_id)); 47 + } 48 + 49 + static inline 50 + struct hugetlb_cgroup *hugetlb_cgroup_from_task(struct task_struct *task) 51 + { 52 + return hugetlb_cgroup_from_css(task_subsys_state(task, 53 + hugetlb_subsys_id)); 54 + } 55 + 56 + static inline bool hugetlb_cgroup_is_root(struct hugetlb_cgroup *h_cg) 57 + { 58 + return (h_cg == root_h_cgroup); 59 + } 60 + 61 + static inline struct hugetlb_cgroup *parent_hugetlb_cgroup(struct cgroup *cg) 62 + { 63 + if (!cg->parent) 64 + return NULL; 65 + return hugetlb_cgroup_from_cgroup(cg->parent); 66 + } 67 + 68 + static inline bool hugetlb_cgroup_have_usage(struct cgroup *cg) 69 + { 70 + int idx; 71 + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cg); 72 + 73 + for (idx = 0; idx < hugetlb_max_hstate; idx++) { 74 + if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE)) > 0) 75 + return true; 76 + } 77 + return false; 78 + } 79 + 80 + static struct cgroup_subsys_state *hugetlb_cgroup_create(struct cgroup *cgroup) 81 + { 82 + int idx; 83 + struct cgroup *parent_cgroup; 84 + struct hugetlb_cgroup *h_cgroup, *parent_h_cgroup; 85 + 86 + h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL); 87 + if (!h_cgroup) 88 + return ERR_PTR(-ENOMEM); 89 + 90 + parent_cgroup = cgroup->parent; 91 + if (parent_cgroup) { 92 + parent_h_cgroup = hugetlb_cgroup_from_cgroup(parent_cgroup); 93 + for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) 94 + res_counter_init(&h_cgroup->hugepage[idx], 95 + &parent_h_cgroup->hugepage[idx]); 96 + } else { 97 + root_h_cgroup = h_cgroup; 98 + for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) 99 + res_counter_init(&h_cgroup->hugepage[idx], NULL); 100 + } 101 + return &h_cgroup->css; 102 + } 103 + 104 + static void hugetlb_cgroup_destroy(struct cgroup *cgroup) 105 + { 106 + struct hugetlb_cgroup *h_cgroup; 107 + 108 + h_cgroup = hugetlb_cgroup_from_cgroup(cgroup); 109 + kfree(h_cgroup); 110 + } 111 + 112 + 113 + /* 114 + * Should be called with hugetlb_lock held. 115 + * Since we are holding hugetlb_lock, pages cannot get moved from 116 + * active list or uncharged from the cgroup, So no need to get 117 + * page reference and test for page active here. This function 118 + * cannot fail. 119 + */ 120 + static void hugetlb_cgroup_move_parent(int idx, struct cgroup *cgroup, 121 + struct page *page) 122 + { 123 + int csize; 124 + struct res_counter *counter; 125 + struct res_counter *fail_res; 126 + struct hugetlb_cgroup *page_hcg; 127 + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup); 128 + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(cgroup); 129 + 130 + page_hcg = hugetlb_cgroup_from_page(page); 131 + /* 132 + * We can have pages in active list without any cgroup 133 + * ie, hugepage with less than 3 pages. We can safely 134 + * ignore those pages. 135 + */ 136 + if (!page_hcg || page_hcg != h_cg) 137 + goto out; 138 + 139 + csize = PAGE_SIZE << compound_order(page); 140 + if (!parent) { 141 + parent = root_h_cgroup; 142 + /* root has no limit */ 143 + res_counter_charge_nofail(&parent->hugepage[idx], 144 + csize, &fail_res); 145 + } 146 + counter = &h_cg->hugepage[idx]; 147 + res_counter_uncharge_until(counter, counter->parent, csize); 148 + 149 + set_hugetlb_cgroup(page, parent); 150 + out: 151 + return; 152 + } 153 + 154 + /* 155 + * Force the hugetlb cgroup to empty the hugetlb resources by moving them to 156 + * the parent cgroup. 157 + */ 158 + static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup) 159 + { 160 + struct hstate *h; 161 + struct page *page; 162 + int ret = 0, idx = 0; 163 + 164 + do { 165 + if (cgroup_task_count(cgroup) || 166 + !list_empty(&cgroup->children)) { 167 + ret = -EBUSY; 168 + goto out; 169 + } 170 + for_each_hstate(h) { 171 + spin_lock(&hugetlb_lock); 172 + list_for_each_entry(page, &h->hugepage_activelist, lru) 173 + hugetlb_cgroup_move_parent(idx, cgroup, page); 174 + 175 + spin_unlock(&hugetlb_lock); 176 + idx++; 177 + } 178 + cond_resched(); 179 + } while (hugetlb_cgroup_have_usage(cgroup)); 180 + out: 181 + return ret; 182 + } 183 + 184 + int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages, 185 + struct hugetlb_cgroup **ptr) 186 + { 187 + int ret = 0; 188 + struct res_counter *fail_res; 189 + struct hugetlb_cgroup *h_cg = NULL; 190 + unsigned long csize = nr_pages * PAGE_SIZE; 191 + 192 + if (hugetlb_cgroup_disabled()) 193 + goto done; 194 + /* 195 + * We don't charge any cgroup if the compound page have less 196 + * than 3 pages. 197 + */ 198 + if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER) 199 + goto done; 200 + again: 201 + rcu_read_lock(); 202 + h_cg = hugetlb_cgroup_from_task(current); 203 + if (!css_tryget(&h_cg->css)) { 204 + rcu_read_unlock(); 205 + goto again; 206 + } 207 + rcu_read_unlock(); 208 + 209 + ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res); 210 + css_put(&h_cg->css); 211 + done: 212 + *ptr = h_cg; 213 + return ret; 214 + } 215 + 216 + /* Should be called with hugetlb_lock held */ 217 + void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, 218 + struct hugetlb_cgroup *h_cg, 219 + struct page *page) 220 + { 221 + if (hugetlb_cgroup_disabled() || !h_cg) 222 + return; 223 + 224 + set_hugetlb_cgroup(page, h_cg); 225 + return; 226 + } 227 + 228 + /* 229 + * Should be called with hugetlb_lock held 230 + */ 231 + void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, 232 + struct page *page) 233 + { 234 + struct hugetlb_cgroup *h_cg; 235 + unsigned long csize = nr_pages * PAGE_SIZE; 236 + 237 + if (hugetlb_cgroup_disabled()) 238 + return; 239 + VM_BUG_ON(!spin_is_locked(&hugetlb_lock)); 240 + h_cg = hugetlb_cgroup_from_page(page); 241 + if (unlikely(!h_cg)) 242 + return; 243 + set_hugetlb_cgroup(page, NULL); 244 + res_counter_uncharge(&h_cg->hugepage[idx], csize); 245 + return; 246 + } 247 + 248 + void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, 249 + struct hugetlb_cgroup *h_cg) 250 + { 251 + unsigned long csize = nr_pages * PAGE_SIZE; 252 + 253 + if (hugetlb_cgroup_disabled() || !h_cg) 254 + return; 255 + 256 + if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER) 257 + return; 258 + 259 + res_counter_uncharge(&h_cg->hugepage[idx], csize); 260 + return; 261 + } 262 + 263 + static ssize_t hugetlb_cgroup_read(struct cgroup *cgroup, struct cftype *cft, 264 + struct file *file, char __user *buf, 265 + size_t nbytes, loff_t *ppos) 266 + { 267 + u64 val; 268 + char str[64]; 269 + int idx, name, len; 270 + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup); 271 + 272 + idx = MEMFILE_IDX(cft->private); 273 + name = MEMFILE_ATTR(cft->private); 274 + 275 + val = res_counter_read_u64(&h_cg->hugepage[idx], name); 276 + len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val); 277 + return simple_read_from_buffer(buf, nbytes, ppos, str, len); 278 + } 279 + 280 + static int hugetlb_cgroup_write(struct cgroup *cgroup, struct cftype *cft, 281 + const char *buffer) 282 + { 283 + int idx, name, ret; 284 + unsigned long long val; 285 + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup); 286 + 287 + idx = MEMFILE_IDX(cft->private); 288 + name = MEMFILE_ATTR(cft->private); 289 + 290 + switch (name) { 291 + case RES_LIMIT: 292 + if (hugetlb_cgroup_is_root(h_cg)) { 293 + /* Can't set limit on root */ 294 + ret = -EINVAL; 295 + break; 296 + } 297 + /* This function does all necessary parse...reuse it */ 298 + ret = res_counter_memparse_write_strategy(buffer, &val); 299 + if (ret) 300 + break; 301 + ret = res_counter_set_limit(&h_cg->hugepage[idx], val); 302 + break; 303 + default: 304 + ret = -EINVAL; 305 + break; 306 + } 307 + return ret; 308 + } 309 + 310 + static int hugetlb_cgroup_reset(struct cgroup *cgroup, unsigned int event) 311 + { 312 + int idx, name, ret = 0; 313 + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_cgroup(cgroup); 314 + 315 + idx = MEMFILE_IDX(event); 316 + name = MEMFILE_ATTR(event); 317 + 318 + switch (name) { 319 + case RES_MAX_USAGE: 320 + res_counter_reset_max(&h_cg->hugepage[idx]); 321 + break; 322 + case RES_FAILCNT: 323 + res_counter_reset_failcnt(&h_cg->hugepage[idx]); 324 + break; 325 + default: 326 + ret = -EINVAL; 327 + break; 328 + } 329 + return ret; 330 + } 331 + 332 + static char *mem_fmt(char *buf, int size, unsigned long hsize) 333 + { 334 + if (hsize >= (1UL << 30)) 335 + snprintf(buf, size, "%luGB", hsize >> 30); 336 + else if (hsize >= (1UL << 20)) 337 + snprintf(buf, size, "%luMB", hsize >> 20); 338 + else 339 + snprintf(buf, size, "%luKB", hsize >> 10); 340 + return buf; 341 + } 342 + 343 + int __init hugetlb_cgroup_file_init(int idx) 344 + { 345 + char buf[32]; 346 + struct cftype *cft; 347 + struct hstate *h = &hstates[idx]; 348 + 349 + /* format the size */ 350 + mem_fmt(buf, 32, huge_page_size(h)); 351 + 352 + /* Add the limit file */ 353 + cft = &h->cgroup_files[0]; 354 + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf); 355 + cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT); 356 + cft->read = hugetlb_cgroup_read; 357 + cft->write_string = hugetlb_cgroup_write; 358 + 359 + /* Add the usage file */ 360 + cft = &h->cgroup_files[1]; 361 + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf); 362 + cft->private = MEMFILE_PRIVATE(idx, RES_USAGE); 363 + cft->read = hugetlb_cgroup_read; 364 + 365 + /* Add the MAX usage file */ 366 + cft = &h->cgroup_files[2]; 367 + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf); 368 + cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE); 369 + cft->trigger = hugetlb_cgroup_reset; 370 + cft->read = hugetlb_cgroup_read; 371 + 372 + /* Add the failcntfile */ 373 + cft = &h->cgroup_files[3]; 374 + snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf); 375 + cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT); 376 + cft->trigger = hugetlb_cgroup_reset; 377 + cft->read = hugetlb_cgroup_read; 378 + 379 + /* NULL terminate the last cft */ 380 + cft = &h->cgroup_files[4]; 381 + memset(cft, 0, sizeof(*cft)); 382 + 383 + WARN_ON(cgroup_add_cftypes(&hugetlb_subsys, h->cgroup_files)); 384 + 385 + return 0; 386 + } 387 + 388 + /* 389 + * hugetlb_lock will make sure a parallel cgroup rmdir won't happen 390 + * when we migrate hugepages 391 + */ 392 + void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage) 393 + { 394 + struct hugetlb_cgroup *h_cg; 395 + struct hstate *h = page_hstate(oldhpage); 396 + 397 + if (hugetlb_cgroup_disabled()) 398 + return; 399 + 400 + VM_BUG_ON(!PageHuge(oldhpage)); 401 + spin_lock(&hugetlb_lock); 402 + h_cg = hugetlb_cgroup_from_page(oldhpage); 403 + set_hugetlb_cgroup(oldhpage, NULL); 404 + 405 + /* move the h_cg details to new cgroup */ 406 + set_hugetlb_cgroup(newhpage, h_cg); 407 + list_move(&newhpage->lru, &h->hugepage_activelist); 408 + spin_unlock(&hugetlb_lock); 409 + return; 410 + } 411 + 412 + struct cgroup_subsys hugetlb_subsys = { 413 + .name = "hugetlb", 414 + .create = hugetlb_cgroup_create, 415 + .pre_destroy = hugetlb_cgroup_pre_destroy, 416 + .destroy = hugetlb_cgroup_destroy, 417 + .subsys_id = hugetlb_subsys_id, 418 + };

+1 -1

mm/hwpoison-inject.c

··· 123 123 if (!dentry) 124 124 goto fail; 125 125 126 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP 126 + #ifdef CONFIG_MEMCG_SWAP 127 127 dentry = debugfs_create_u64("corrupt-filter-memcg", 0600, 128 128 hwpoison_dir, &hwpoison_filter_memcg); 129 129 if (!dentry)

+8

mm/internal.h

··· 118 118 unsigned long nr_freepages; /* Number of isolated free pages */ 119 119 unsigned long nr_migratepages; /* Number of pages to migrate */ 120 120 unsigned long free_pfn; /* isolate_freepages search base */ 121 + unsigned long start_free_pfn; /* where we started the search */ 121 122 unsigned long migrate_pfn; /* isolate_migratepages search base */ 122 123 bool sync; /* Synchronous migration */ 124 + bool wrapped; /* Order > 0 compactions are 125 + incremental, once free_pfn 126 + and migrate_pfn meet, we restart 127 + from the top of the zone; 128 + remember we wrapped around. */ 123 129 124 130 int order; /* order a direct compactor needs */ 125 131 int migratetype; /* MOVABLE, RECLAIMABLE etc */ ··· 353 347 extern unsigned long vm_mmap_pgoff(struct file *, unsigned long, 354 348 unsigned long, unsigned long, 355 349 unsigned long, unsigned long); 350 + 351 + extern void set_pageblock_order(void);

+18 -17

mm/memblock.c

··· 222 222 /* Try to find some space for it. 223 223 * 224 224 * WARNING: We assume that either slab_is_available() and we use it or 225 - * we use MEMBLOCK for allocations. That means that this is unsafe to use 226 - * when bootmem is currently active (unless bootmem itself is implemented 227 - * on top of MEMBLOCK which isn't the case yet) 225 + * we use MEMBLOCK for allocations. That means that this is unsafe to 226 + * use when bootmem is currently active (unless bootmem itself is 227 + * implemented on top of MEMBLOCK which isn't the case yet) 228 228 * 229 229 * This should however not be an issue for now, as we currently only 230 - * call into MEMBLOCK while it's still active, or much later when slab is 231 - * active for memory hotplug operations 230 + * call into MEMBLOCK while it's still active, or much later when slab 231 + * is active for memory hotplug operations 232 232 */ 233 233 if (use_slab) { 234 234 new_array = kmalloc(new_size, GFP_KERNEL); ··· 243 243 new_alloc_size, PAGE_SIZE); 244 244 if (!addr && new_area_size) 245 245 addr = memblock_find_in_range(0, 246 - min(new_area_start, memblock.current_limit), 247 - new_alloc_size, PAGE_SIZE); 246 + min(new_area_start, memblock.current_limit), 247 + new_alloc_size, PAGE_SIZE); 248 248 249 249 new_array = addr ? __va(addr) : 0; 250 250 } ··· 254 254 return -1; 255 255 } 256 256 257 - memblock_dbg("memblock: %s array is doubled to %ld at [%#010llx-%#010llx]", 258 - memblock_type_name(type), type->max * 2, (u64)addr, (u64)addr + new_size - 1); 257 + memblock_dbg("memblock: %s is doubled to %ld at [%#010llx-%#010llx]", 258 + memblock_type_name(type), type->max * 2, (u64)addr, 259 + (u64)addr + new_size - 1); 259 260 260 - /* Found space, we now need to move the array over before 261 - * we add the reserved region since it may be our reserved 262 - * array itself that is full. 261 + /* 262 + * Found space, we now need to move the array over before we add the 263 + * reserved region since it may be our reserved array itself that is 264 + * full. 263 265 */ 264 266 memcpy(new_array, type->regions, old_size); 265 267 memset(new_array + type->max, 0, old_size); ··· 269 267 type->regions = new_array; 270 268 type->max <<= 1; 271 269 272 - /* Free old array. We needn't free it if the array is the 273 - * static one 274 - */ 270 + /* Free old array. We needn't free it if the array is the static one */ 275 271 if (*in_slab) 276 272 kfree(old_array); 277 273 else if (old_array != memblock_memory_init_regions && 278 274 old_array != memblock_reserved_init_regions) 279 275 memblock_free(__pa(old_array), old_alloc_size); 280 276 281 - /* Reserve the new array if that comes from the memblock. 282 - * Otherwise, we needn't do it 277 + /* 278 + * Reserve the new array if that comes from the memblock. Otherwise, we 279 + * needn't do it 283 280 */ 284 281 if (!use_slab) 285 282 BUG_ON(memblock_reserve(addr, new_alloc_size));

+229 -161

mm/memcontrol.c

··· 61 61 #define MEM_CGROUP_RECLAIM_RETRIES 5 62 62 static struct mem_cgroup *root_mem_cgroup __read_mostly; 63 63 64 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP 64 + #ifdef CONFIG_MEMCG_SWAP 65 65 /* Turned on only when memory cgroup is enabled && really_do_swap_account = 1 */ 66 66 int do_swap_account __read_mostly; 67 67 68 68 /* for remember boot option*/ 69 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED 69 + #ifdef CONFIG_MEMCG_SWAP_ENABLED 70 70 static int really_do_swap_account __initdata = 1; 71 71 #else 72 72 static int really_do_swap_account __initdata = 0; ··· 87 87 MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */ 88 88 MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ 89 89 MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ 90 - MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */ 90 + MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ 91 91 MEM_CGROUP_STAT_NSTATS, 92 92 }; 93 93 ··· 378 378 379 379 enum charge_type { 380 380 MEM_CGROUP_CHARGE_TYPE_CACHE = 0, 381 - MEM_CGROUP_CHARGE_TYPE_MAPPED, 382 - MEM_CGROUP_CHARGE_TYPE_SHMEM, /* used by page migration of shmem */ 383 - MEM_CGROUP_CHARGE_TYPE_FORCE, /* used by force_empty */ 381 + MEM_CGROUP_CHARGE_TYPE_ANON, 384 382 MEM_CGROUP_CHARGE_TYPE_SWAPOUT, /* for accounting swapcache */ 385 383 MEM_CGROUP_CHARGE_TYPE_DROP, /* a page was unused swap cache */ 386 384 NR_CHARGE_TYPE, ··· 405 407 static void mem_cgroup_get(struct mem_cgroup *memcg); 406 408 static void mem_cgroup_put(struct mem_cgroup *memcg); 407 409 410 + static inline 411 + struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s) 412 + { 413 + return container_of(s, struct mem_cgroup, css); 414 + } 415 + 408 416 /* Writing them here to avoid exposing memcg's inner layout */ 409 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM 417 + #ifdef CONFIG_MEMCG_KMEM 410 418 #include <net/sock.h> 411 419 #include <net/ip.h> 412 420 ··· 471 467 } 472 468 EXPORT_SYMBOL(tcp_proto_cgroup); 473 469 #endif /* CONFIG_INET */ 474 - #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */ 470 + #endif /* CONFIG_MEMCG_KMEM */ 475 471 476 - #if defined(CONFIG_INET) && defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) 472 + #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) 477 473 static void disarm_sock_keys(struct mem_cgroup *memcg) 478 474 { 479 475 if (!memcg_proto_activated(&memcg->tcp_mem.cg_proto)) ··· 707 703 bool charge) 708 704 { 709 705 int val = (charge) ? 1 : -1; 710 - this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAPOUT], val); 706 + this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val); 711 707 } 712 708 713 709 static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg, ··· 868 864 869 865 struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont) 870 866 { 871 - return container_of(cgroup_subsys_state(cont, 872 - mem_cgroup_subsys_id), struct mem_cgroup, 873 - css); 867 + return mem_cgroup_from_css( 868 + cgroup_subsys_state(cont, mem_cgroup_subsys_id)); 874 869 } 875 870 876 871 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p) ··· 882 879 if (unlikely(!p)) 883 880 return NULL; 884 881 885 - return container_of(task_subsys_state(p, mem_cgroup_subsys_id), 886 - struct mem_cgroup, css); 882 + return mem_cgroup_from_css(task_subsys_state(p, mem_cgroup_subsys_id)); 887 883 } 888 884 889 885 struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm) ··· 968 966 css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id); 969 967 if (css) { 970 968 if (css == &root->css || css_tryget(css)) 971 - memcg = container_of(css, 972 - struct mem_cgroup, css); 969 + memcg = mem_cgroup_from_css(css); 973 970 } else 974 971 id = 0; 975 972 rcu_read_unlock(); ··· 1455 1454 /* 1456 1455 * Return the memory (and swap, if configured) limit for a memcg. 1457 1456 */ 1458 - u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) 1457 + static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) 1459 1458 { 1460 1459 u64 limit; 1461 1460 u64 memsw; ··· 1469 1468 * to this memcg, return that limit. 1470 1469 */ 1471 1470 return min(limit, memsw); 1471 + } 1472 + 1473 + void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, 1474 + int order) 1475 + { 1476 + struct mem_cgroup *iter; 1477 + unsigned long chosen_points = 0; 1478 + unsigned long totalpages; 1479 + unsigned int points = 0; 1480 + struct task_struct *chosen = NULL; 1481 + 1482 + /* 1483 + * If current has a pending SIGKILL, then automatically select it. The 1484 + * goal is to allow it to allocate so that it may quickly exit and free 1485 + * its memory. 1486 + */ 1487 + if (fatal_signal_pending(current)) { 1488 + set_thread_flag(TIF_MEMDIE); 1489 + return; 1490 + } 1491 + 1492 + check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL); 1493 + totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1; 1494 + for_each_mem_cgroup_tree(iter, memcg) { 1495 + struct cgroup *cgroup = iter->css.cgroup; 1496 + struct cgroup_iter it; 1497 + struct task_struct *task; 1498 + 1499 + cgroup_iter_start(cgroup, &it); 1500 + while ((task = cgroup_iter_next(cgroup, &it))) { 1501 + switch (oom_scan_process_thread(task, totalpages, NULL, 1502 + false)) { 1503 + case OOM_SCAN_SELECT: 1504 + if (chosen) 1505 + put_task_struct(chosen); 1506 + chosen = task; 1507 + chosen_points = ULONG_MAX; 1508 + get_task_struct(chosen); 1509 + /* fall through */ 1510 + case OOM_SCAN_CONTINUE: 1511 + continue; 1512 + case OOM_SCAN_ABORT: 1513 + cgroup_iter_end(cgroup, &it); 1514 + mem_cgroup_iter_break(memcg, iter); 1515 + if (chosen) 1516 + put_task_struct(chosen); 1517 + return; 1518 + case OOM_SCAN_OK: 1519 + break; 1520 + }; 1521 + points = oom_badness(task, memcg, NULL, totalpages); 1522 + if (points > chosen_points) { 1523 + if (chosen) 1524 + put_task_struct(chosen); 1525 + chosen = task; 1526 + chosen_points = points; 1527 + get_task_struct(chosen); 1528 + } 1529 + } 1530 + cgroup_iter_end(cgroup, &it); 1531 + } 1532 + 1533 + if (!chosen) 1534 + return; 1535 + points = chosen_points * 1000 / totalpages; 1536 + oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg, 1537 + NULL, "Memory cgroup out of memory"); 1472 1538 } 1473 1539 1474 1540 static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg, ··· 1967 1899 return; 1968 1900 /* 1969 1901 * If this memory cgroup is not under account moving, we don't 1970 - * need to take move_lock_page_cgroup(). Because we already hold 1902 + * need to take move_lock_mem_cgroup(). Because we already hold 1971 1903 * rcu_read_lock(), any calls to move_account will be delayed until 1972 1904 * rcu_read_unlock() if mem_cgroup_stolen() == true. 1973 1905 */ ··· 1989 1921 /* 1990 1922 * It's guaranteed that pc->mem_cgroup never changes while 1991 1923 * lock is held because a routine modifies pc->mem_cgroup 1992 - * should take move_lock_page_cgroup(). 1924 + * should take move_lock_mem_cgroup(). 1993 1925 */ 1994 1926 move_unlock_mem_cgroup(pc->mem_cgroup, flags); 1995 1927 } ··· 2336 2268 * We always charge the cgroup the mm_struct belongs to. 2337 2269 * The mm_struct's mem_cgroup changes on task migration if the 2338 2270 * thread group leader migrates. It's possible that mm is not 2339 - * set, if so charge the init_mm (happens for pagecache usage). 2271 + * set, if so charge the root memcg (happens for pagecache usage). 2340 2272 */ 2341 2273 if (!*ptr && !mm) 2342 2274 *ptr = root_mem_cgroup; ··· 2497 2429 css = css_lookup(&mem_cgroup_subsys, id); 2498 2430 if (!css) 2499 2431 return NULL; 2500 - return container_of(css, struct mem_cgroup, css); 2432 + return mem_cgroup_from_css(css); 2501 2433 } 2502 2434 2503 2435 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page) ··· 2541 2473 bool anon; 2542 2474 2543 2475 lock_page_cgroup(pc); 2544 - if (unlikely(PageCgroupUsed(pc))) { 2545 - unlock_page_cgroup(pc); 2546 - __mem_cgroup_cancel_charge(memcg, nr_pages); 2547 - return; 2548 - } 2476 + VM_BUG_ON(PageCgroupUsed(pc)); 2549 2477 /* 2550 2478 * we don't need page_cgroup_lock about tail pages, becase they are not 2551 2479 * accessed by any other context at this point. ··· 2583 2519 spin_unlock_irq(&zone->lru_lock); 2584 2520 } 2585 2521 2586 - if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) 2522 + if (ctype == MEM_CGROUP_CHARGE_TYPE_ANON) 2587 2523 anon = true; 2588 2524 else 2589 2525 anon = false; ··· 2708 2644 2709 2645 static int mem_cgroup_move_parent(struct page *page, 2710 2646 struct page_cgroup *pc, 2711 - struct mem_cgroup *child, 2712 - gfp_t gfp_mask) 2647 + struct mem_cgroup *child) 2713 2648 { 2714 2649 struct mem_cgroup *parent; 2715 2650 unsigned int nr_pages; ··· 2791 2728 VM_BUG_ON(page->mapping && !PageAnon(page)); 2792 2729 VM_BUG_ON(!mm); 2793 2730 return mem_cgroup_charge_common(page, mm, gfp_mask, 2794 - MEM_CGROUP_CHARGE_TYPE_MAPPED); 2795 - } 2796 - 2797 - static void 2798 - __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr, 2799 - enum charge_type ctype); 2800 - 2801 - int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, 2802 - gfp_t gfp_mask) 2803 - { 2804 - struct mem_cgroup *memcg = NULL; 2805 - enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; 2806 - int ret; 2807 - 2808 - if (mem_cgroup_disabled()) 2809 - return 0; 2810 - if (PageCompound(page)) 2811 - return 0; 2812 - 2813 - if (unlikely(!mm)) 2814 - mm = &init_mm; 2815 - if (!page_is_file_cache(page)) 2816 - type = MEM_CGROUP_CHARGE_TYPE_SHMEM; 2817 - 2818 - if (!PageSwapCache(page)) 2819 - ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); 2820 - else { /* page is swapcache/shmem */ 2821 - ret = mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &memcg); 2822 - if (!ret) 2823 - __mem_cgroup_commit_charge_swapin(page, memcg, type); 2824 - } 2825 - return ret; 2731 + MEM_CGROUP_CHARGE_TYPE_ANON); 2826 2732 } 2827 2733 2828 2734 /* ··· 2800 2768 * struct page_cgroup is acquired. This refcnt will be consumed by 2801 2769 * "commit()" or removed by "cancel()" 2802 2770 */ 2803 - int mem_cgroup_try_charge_swapin(struct mm_struct *mm, 2804 - struct page *page, 2805 - gfp_t mask, struct mem_cgroup **memcgp) 2771 + static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm, 2772 + struct page *page, 2773 + gfp_t mask, 2774 + struct mem_cgroup **memcgp) 2806 2775 { 2807 2776 struct mem_cgroup *memcg; 2777 + struct page_cgroup *pc; 2808 2778 int ret; 2809 2779 2810 - *memcgp = NULL; 2811 - 2812 - if (mem_cgroup_disabled()) 2813 - return 0; 2814 - 2815 - if (!do_swap_account) 2816 - goto charge_cur_mm; 2780 + pc = lookup_page_cgroup(page); 2817 2781 /* 2818 - * A racing thread's fault, or swapoff, may have already updated 2819 - * the pte, and even removed page from swap cache: in those cases 2820 - * do_swap_page()'s pte_same() test will fail; but there's also a 2821 - * KSM case which does need to charge the page. 2782 + * Every swap fault against a single page tries to charge the 2783 + * page, bail as early as possible. shmem_unuse() encounters 2784 + * already charged pages, too. The USED bit is protected by 2785 + * the page lock, which serializes swap cache removal, which 2786 + * in turn serializes uncharging. 2822 2787 */ 2823 - if (!PageSwapCache(page)) 2788 + if (PageCgroupUsed(pc)) 2789 + return 0; 2790 + if (!do_swap_account) 2824 2791 goto charge_cur_mm; 2825 2792 memcg = try_get_mem_cgroup_from_page(page); 2826 2793 if (!memcg) ··· 2831 2800 ret = 0; 2832 2801 return ret; 2833 2802 charge_cur_mm: 2834 - if (unlikely(!mm)) 2835 - mm = &init_mm; 2836 2803 ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true); 2837 2804 if (ret == -EINTR) 2838 2805 ret = 0; 2839 2806 return ret; 2807 + } 2808 + 2809 + int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page, 2810 + gfp_t gfp_mask, struct mem_cgroup **memcgp) 2811 + { 2812 + *memcgp = NULL; 2813 + if (mem_cgroup_disabled()) 2814 + return 0; 2815 + /* 2816 + * A racing thread's fault, or swapoff, may have already 2817 + * updated the pte, and even removed page from swap cache: in 2818 + * those cases unuse_pte()'s pte_same() test will fail; but 2819 + * there's also a KSM case which does need to charge the page. 2820 + */ 2821 + if (!PageSwapCache(page)) { 2822 + int ret; 2823 + 2824 + ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, true); 2825 + if (ret == -EINTR) 2826 + ret = 0; 2827 + return ret; 2828 + } 2829 + return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp); 2830 + } 2831 + 2832 + void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) 2833 + { 2834 + if (mem_cgroup_disabled()) 2835 + return; 2836 + if (!memcg) 2837 + return; 2838 + __mem_cgroup_cancel_charge(memcg, 1); 2840 2839 } 2841 2840 2842 2841 static void ··· 2903 2842 struct mem_cgroup *memcg) 2904 2843 { 2905 2844 __mem_cgroup_commit_charge_swapin(page, memcg, 2906 - MEM_CGROUP_CHARGE_TYPE_MAPPED); 2845 + MEM_CGROUP_CHARGE_TYPE_ANON); 2907 2846 } 2908 2847 2909 - void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg) 2848 + int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, 2849 + gfp_t gfp_mask) 2910 2850 { 2851 + struct mem_cgroup *memcg = NULL; 2852 + enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE; 2853 + int ret; 2854 + 2911 2855 if (mem_cgroup_disabled()) 2912 - return; 2913 - if (!memcg) 2914 - return; 2915 - __mem_cgroup_cancel_charge(memcg, 1); 2856 + return 0; 2857 + if (PageCompound(page)) 2858 + return 0; 2859 + 2860 + if (!PageSwapCache(page)) 2861 + ret = mem_cgroup_charge_common(page, mm, gfp_mask, type); 2862 + else { /* page is swapcache/shmem */ 2863 + ret = __mem_cgroup_try_charge_swapin(mm, page, 2864 + gfp_mask, &memcg); 2865 + if (!ret) 2866 + __mem_cgroup_commit_charge_swapin(page, memcg, type); 2867 + } 2868 + return ret; 2916 2869 } 2917 2870 2918 2871 static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg, ··· 2986 2911 * uncharge if !page_mapped(page) 2987 2912 */ 2988 2913 static struct mem_cgroup * 2989 - __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype) 2914 + __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype, 2915 + bool end_migration) 2990 2916 { 2991 2917 struct mem_cgroup *memcg = NULL; 2992 2918 unsigned int nr_pages = 1; ··· 2997 2921 if (mem_cgroup_disabled()) 2998 2922 return NULL; 2999 2923 3000 - if (PageSwapCache(page)) 3001 - return NULL; 2924 + VM_BUG_ON(PageSwapCache(page)); 3002 2925 3003 2926 if (PageTransHuge(page)) { 3004 2927 nr_pages <<= compound_order(page); ··· 3020 2945 anon = PageAnon(page); 3021 2946 3022 2947 switch (ctype) { 3023 - case MEM_CGROUP_CHARGE_TYPE_MAPPED: 2948 + case MEM_CGROUP_CHARGE_TYPE_ANON: 3024 2949 /* 3025 2950 * Generally PageAnon tells if it's the anon statistics to be 3026 2951 * updated; but sometimes e.g. mem_cgroup_uncharge_page() is ··· 3030 2955 /* fallthrough */ 3031 2956 case MEM_CGROUP_CHARGE_TYPE_DROP: 3032 2957 /* See mem_cgroup_prepare_migration() */ 3033 - if (page_mapped(page) || PageCgroupMigration(pc)) 2958 + if (page_mapped(page)) 2959 + goto unlock_out; 2960 + /* 2961 + * Pages under migration may not be uncharged. But 2962 + * end_migration() /must/ be the one uncharging the 2963 + * unused post-migration page and so it has to call 2964 + * here with the migration bit still set. See the 2965 + * res_counter handling below. 2966 + */ 2967 + if (!end_migration && PageCgroupMigration(pc)) 3034 2968 goto unlock_out; 3035 2969 break; 3036 2970 case MEM_CGROUP_CHARGE_TYPE_SWAPOUT: ··· 3073 2989 mem_cgroup_swap_statistics(memcg, true); 3074 2990 mem_cgroup_get(memcg); 3075 2991 } 3076 - if (!mem_cgroup_is_root(memcg)) 2992 + /* 2993 + * Migration does not charge the res_counter for the 2994 + * replacement page, so leave it alone when phasing out the 2995 + * page that is unused after the migration. 2996 + */ 2997 + if (!end_migration && !mem_cgroup_is_root(memcg)) 3077 2998 mem_cgroup_do_uncharge(memcg, nr_pages, ctype); 3078 2999 3079 3000 return memcg; ··· 3094 3005 if (page_mapped(page)) 3095 3006 return; 3096 3007 VM_BUG_ON(page->mapping && !PageAnon(page)); 3097 - __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED); 3008 + if (PageSwapCache(page)) 3009 + return; 3010 + __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false); 3098 3011 } 3099 3012 3100 3013 void mem_cgroup_uncharge_cache_page(struct page *page) 3101 3014 { 3102 3015 VM_BUG_ON(page_mapped(page)); 3103 3016 VM_BUG_ON(page->mapping); 3104 - __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE); 3017 + __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE, false); 3105 3018 } 3106 3019 3107 3020 /* ··· 3167 3076 if (!swapout) /* this was a swap cache but the swap is unused ! */ 3168 3077 ctype = MEM_CGROUP_CHARGE_TYPE_DROP; 3169 3078 3170 - memcg = __mem_cgroup_uncharge_common(page, ctype); 3079 + memcg = __mem_cgroup_uncharge_common(page, ctype, false); 3171 3080 3172 3081 /* 3173 3082 * record memcg information, if swapout && memcg != NULL, ··· 3178 3087 } 3179 3088 #endif 3180 3089 3181 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP 3090 + #ifdef CONFIG_MEMCG_SWAP 3182 3091 /* 3183 3092 * called from swap_entry_free(). remove record in swap_cgroup and 3184 3093 * uncharge "memsw" account. ··· 3257 3166 * Before starting migration, account PAGE_SIZE to mem_cgroup that the old 3258 3167 * page belongs to. 3259 3168 */ 3260 - int mem_cgroup_prepare_migration(struct page *page, 3261 - struct page *newpage, struct mem_cgroup **memcgp, gfp_t gfp_mask) 3169 + void mem_cgroup_prepare_migration(struct page *page, struct page *newpage, 3170 + struct mem_cgroup **memcgp) 3262 3171 { 3263 3172 struct mem_cgroup *memcg = NULL; 3264 3173 struct page_cgroup *pc; 3265 3174 enum charge_type ctype; 3266 - int ret = 0; 3267 3175 3268 3176 *memcgp = NULL; 3269 3177 3270 3178 VM_BUG_ON(PageTransHuge(page)); 3271 3179 if (mem_cgroup_disabled()) 3272 - return 0; 3180 + return; 3273 3181 3274 3182 pc = lookup_page_cgroup(page); 3275 3183 lock_page_cgroup(pc); ··· 3313 3223 * we return here. 3314 3224 */ 3315 3225 if (!memcg) 3316 - return 0; 3226 + return; 3317 3227 3318 3228 *memcgp = memcg; 3319 - ret = __mem_cgroup_try_charge(NULL, gfp_mask, 1, memcgp, false); 3320 - css_put(&memcg->css);/* drop extra refcnt */ 3321 - if (ret) { 3322 - if (PageAnon(page)) { 3323 - lock_page_cgroup(pc); 3324 - ClearPageCgroupMigration(pc); 3325 - unlock_page_cgroup(pc); 3326 - /* 3327 - * The old page may be fully unmapped while we kept it. 3328 - */ 3329 - mem_cgroup_uncharge_page(page); 3330 - } 3331 - /* we'll need to revisit this error code (we have -EINTR) */ 3332 - return -ENOMEM; 3333 - } 3334 3229 /* 3335 3230 * We charge new page before it's used/mapped. So, even if unlock_page() 3336 3231 * is called before end_migration, we can catch all events on this new ··· 3323 3248 * mapcount will be finally 0 and we call uncharge in end_migration(). 3324 3249 */ 3325 3250 if (PageAnon(page)) 3326 - ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED; 3327 - else if (page_is_file_cache(page)) 3328 - ctype = MEM_CGROUP_CHARGE_TYPE_CACHE; 3251 + ctype = MEM_CGROUP_CHARGE_TYPE_ANON; 3329 3252 else 3330 - ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM; 3253 + ctype = MEM_CGROUP_CHARGE_TYPE_CACHE; 3254 + /* 3255 + * The page is committed to the memcg, but it's not actually 3256 + * charged to the res_counter since we plan on replacing the 3257 + * old one and only one page is going to be left afterwards. 3258 + */ 3331 3259 __mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false); 3332 - return ret; 3333 3260 } 3334 3261 3335 3262 /* remove redundant charge if migration failed*/ ··· 3353 3276 used = newpage; 3354 3277 unused = oldpage; 3355 3278 } 3279 + anon = PageAnon(used); 3280 + __mem_cgroup_uncharge_common(unused, 3281 + anon ? MEM_CGROUP_CHARGE_TYPE_ANON 3282 + : MEM_CGROUP_CHARGE_TYPE_CACHE, 3283 + true); 3284 + css_put(&memcg->css); 3356 3285 /* 3357 3286 * We disallowed uncharge of pages under migration because mapcount 3358 3287 * of the page goes down to zero, temporarly. ··· 3368 3285 lock_page_cgroup(pc); 3369 3286 ClearPageCgroupMigration(pc); 3370 3287 unlock_page_cgroup(pc); 3371 - anon = PageAnon(used); 3372 - __mem_cgroup_uncharge_common(unused, 3373 - anon ? MEM_CGROUP_CHARGE_TYPE_MAPPED 3374 - : MEM_CGROUP_CHARGE_TYPE_CACHE); 3375 3288 3376 3289 /* 3377 3290 * If a page is a file cache, radix-tree replacement is very atomic ··· 3419 3340 */ 3420 3341 if (!memcg) 3421 3342 return; 3422 - 3423 - if (PageSwapBacked(oldpage)) 3424 - type = MEM_CGROUP_CHARGE_TYPE_SHMEM; 3425 - 3426 3343 /* 3427 3344 * Even if newpage->mapping was NULL before starting replacement, 3428 3345 * the newpage may be on LRU(or pagevec for LRU) already. We lock ··· 3493 3418 /* 3494 3419 * Rather than hide all in some function, I do this in 3495 3420 * open coded manner. You see what this really does. 3496 - * We have to guarantee memcg->res.limit < memcg->memsw.limit. 3421 + * We have to guarantee memcg->res.limit <= memcg->memsw.limit. 3497 3422 */ 3498 3423 mutex_lock(&set_limit_mutex); 3499 3424 memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT); ··· 3554 3479 /* 3555 3480 * Rather than hide all in some function, I do this in 3556 3481 * open coded manner. You see what this really does. 3557 - * We have to guarantee memcg->res.limit < memcg->memsw.limit. 3482 + * We have to guarantee memcg->res.limit <= memcg->memsw.limit. 3558 3483 */ 3559 3484 mutex_lock(&set_limit_mutex); 3560 3485 memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT); ··· 3686 3611 } 3687 3612 3688 3613 /* 3689 - * This routine traverse page_cgroup in given list and drop them all. 3690 - * *And* this routine doesn't reclaim page itself, just removes page_cgroup. 3614 + * Traverse a specified page_cgroup list and try to drop them all. This doesn't 3615 + * reclaim the pages page themselves - it just removes the page_cgroups. 3616 + * Returns true if some page_cgroups were not freed, indicating that the caller 3617 + * must retry this operation. 3691 3618 */ 3692 - static int mem_cgroup_force_empty_list(struct mem_cgroup *memcg, 3619 + static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg, 3693 3620 int node, int zid, enum lru_list lru) 3694 3621 { 3695 3622 struct mem_cgroup_per_zone *mz; ··· 3699 3622 struct list_head *list; 3700 3623 struct page *busy; 3701 3624 struct zone *zone; 3702 - int ret = 0; 3703 3625 3704 3626 zone = &NODE_DATA(node)->node_zones[zid]; 3705 3627 mz = mem_cgroup_zoneinfo(memcg, node, zid); ··· 3712 3636 struct page_cgroup *pc; 3713 3637 struct page *page; 3714 3638 3715 - ret = 0; 3716 3639 spin_lock_irqsave(&zone->lru_lock, flags); 3717 3640 if (list_empty(list)) { 3718 3641 spin_unlock_irqrestore(&zone->lru_lock, flags); ··· 3728 3653 3729 3654 pc = lookup_page_cgroup(page); 3730 3655 3731 - ret = mem_cgroup_move_parent(page, pc, memcg, GFP_KERNEL); 3732 - if (ret == -ENOMEM || ret == -EINTR) 3733 - break; 3734 - 3735 - if (ret == -EBUSY || ret == -EINVAL) { 3656 + if (mem_cgroup_move_parent(page, pc, memcg)) { 3736 3657 /* found lock contention or "pc" is obsolete. */ 3737 3658 busy = page; 3738 3659 cond_resched(); 3739 3660 } else 3740 3661 busy = NULL; 3741 3662 } 3742 - 3743 - if (!ret && !list_empty(list)) 3744 - return -EBUSY; 3745 - return ret; 3663 + return !list_empty(list); 3746 3664 } 3747 3665 3748 3666 /* ··· 3760 3692 ret = -EBUSY; 3761 3693 if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children)) 3762 3694 goto out; 3763 - ret = -EINTR; 3764 - if (signal_pending(current)) 3765 - goto out; 3766 3695 /* This is for making all *used* pages to be on LRU. */ 3767 3696 lru_add_drain_all(); 3768 3697 drain_all_stock_sync(memcg); ··· 3780 3715 } 3781 3716 mem_cgroup_end_move(memcg); 3782 3717 memcg_oom_recover(memcg); 3783 - /* it seems parent cgroup doesn't have enough mem */ 3784 - if (ret == -ENOMEM) 3785 - goto try_to_free; 3786 3718 cond_resched(); 3787 3719 /* "ret" should also be checked to ensure all lists are empty. */ 3788 3720 } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret); ··· 3841 3779 parent_memcg = mem_cgroup_from_cont(parent); 3842 3780 3843 3781 cgroup_lock(); 3782 + 3783 + if (memcg->use_hierarchy == val) 3784 + goto out; 3785 + 3844 3786 /* 3845 3787 * If parent's use_hierarchy is set, we can't make any modifications 3846 3788 * in the child subtrees. If it is unset, then the change can ··· 3861 3795 retval = -EBUSY; 3862 3796 } else 3863 3797 retval = -EINVAL; 3798 + 3799 + out: 3864 3800 cgroup_unlock(); 3865 3801 3866 3802 return retval; ··· 3899 3831 val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS); 3900 3832 3901 3833 if (swap) 3902 - val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAPOUT); 3834 + val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP); 3903 3835 3904 3836 return val << PAGE_SHIFT; 3905 3837 } ··· 4083 4015 #endif 4084 4016 4085 4017 #ifdef CONFIG_NUMA 4086 - static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft, 4018 + static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, 4087 4019 struct seq_file *m) 4088 4020 { 4089 4021 int nid; ··· 4142 4074 BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS); 4143 4075 } 4144 4076 4145 - static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft, 4077 + static int memcg_stat_show(struct cgroup *cont, struct cftype *cft, 4146 4078 struct seq_file *m) 4147 4079 { 4148 4080 struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); ··· 4150 4082 unsigned int i; 4151 4083 4152 4084 for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { 4153 - if (i == MEM_CGROUP_STAT_SWAPOUT && !do_swap_account) 4085 + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) 4154 4086 continue; 4155 4087 seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i], 4156 4088 mem_cgroup_read_stat(memcg, i) * PAGE_SIZE); ··· 4177 4109 for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { 4178 4110 long long val = 0; 4179 4111 4180 - if (i == MEM_CGROUP_STAT_SWAPOUT && !do_swap_account) 4112 + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) 4181 4113 continue; 4182 4114 for_each_mem_cgroup_tree(mi, memcg) 4183 4115 val += mem_cgroup_read_stat(mi, i) * PAGE_SIZE; ··· 4601 4533 return 0; 4602 4534 } 4603 4535 4604 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM 4536 + #ifdef CONFIG_MEMCG_KMEM 4605 4537 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) 4606 4538 { 4607 4539 return mem_cgroup_sockets_init(memcg, ss); ··· 4656 4588 }, 4657 4589 { 4658 4590 .name = "stat", 4659 - .read_seq_string = mem_control_stat_show, 4591 + .read_seq_string = memcg_stat_show, 4660 4592 }, 4661 4593 { 4662 4594 .name = "force_empty", ··· 4688 4620 #ifdef CONFIG_NUMA 4689 4621 { 4690 4622 .name = "numa_stat", 4691 - .read_seq_string = mem_control_numa_stat_show, 4623 + .read_seq_string = memcg_numa_stat_show, 4692 4624 }, 4693 4625 #endif 4694 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP 4626 + #ifdef CONFIG_MEMCG_SWAP 4695 4627 { 4696 4628 .name = "memsw.usage_in_bytes", 4697 4629 .private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE), ··· 4878 4810 } 4879 4811 EXPORT_SYMBOL(parent_mem_cgroup); 4880 4812 4881 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP 4813 + #ifdef CONFIG_MEMCG_SWAP 4882 4814 static void __init enable_swap_cgroup(void) 4883 4815 { 4884 4816 if (!mem_cgroup_disabled() && really_do_swap_account) ··· 5609 5541 .__DEPRECATED_clear_css_refs = true, 5610 5542 }; 5611 5543 5612 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP 5544 + #ifdef CONFIG_MEMCG_SWAP 5613 5545 static int __init enable_swap_account(char *s) 5614 5546 { 5615 5547 /* consider enabled if no parameter or 1 is given */

+5 -12

mm/memory-failure.c

··· 128 128 * can only guarantee that the page either belongs to the memcg tasks, or is 129 129 * a freed page. 130 130 */ 131 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP 131 + #ifdef CONFIG_MEMCG_SWAP 132 132 u64 hwpoison_filter_memcg; 133 133 EXPORT_SYMBOL_GPL(hwpoison_filter_memcg); 134 134 static int hwpoison_filter_task(struct page *p) ··· 1416 1416 int ret; 1417 1417 unsigned long pfn = page_to_pfn(page); 1418 1418 struct page *hpage = compound_head(page); 1419 - LIST_HEAD(pagelist); 1420 1419 1421 1420 ret = get_any_page(page, pfn, flags); 1422 1421 if (ret < 0) ··· 1430 1431 } 1431 1432 1432 1433 /* Keep page count to indicate a given hugepage is isolated. */ 1433 - 1434 - list_add(&hpage->lru, &pagelist); 1435 - ret = migrate_huge_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, false, 1434 + ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL, false, 1436 1435 MIGRATE_SYNC); 1436 + put_page(hpage); 1437 1437 if (ret) { 1438 - struct page *page1, *page2; 1439 - list_for_each_entry_safe(page1, page2, &pagelist, lru) 1440 - put_page(page1); 1441 - 1442 1438 pr_info("soft offline: %#lx: migration failed %d, type %lx\n", 1443 1439 pfn, ret, page->flags); 1444 - if (ret > 0) 1445 - ret = -EIO; 1446 1440 return ret; 1447 1441 } 1448 1442 done: 1449 1443 if (!PageHWPoison(hpage)) 1450 - atomic_long_add(1 << compound_trans_order(hpage), &mce_bad_pages); 1444 + atomic_long_add(1 << compound_trans_order(hpage), 1445 + &mce_bad_pages); 1451 1446 set_page_hwpoison_huge_page(hpage); 1452 1447 dequeue_hwpoisoned_huge_page(hpage); 1453 1448 /* keep elevated page count for bad page */

+6 -3

mm/memory.c

··· 1343 1343 * Since no pte has actually been setup, it is 1344 1344 * safe to do nothing in this case. 1345 1345 */ 1346 - if (vma->vm_file) 1347 - unmap_hugepage_range(vma, start, end, NULL); 1346 + if (vma->vm_file) { 1347 + mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex); 1348 + __unmap_hugepage_range_final(tlb, vma, start, end, NULL); 1349 + mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); 1350 + } 1348 1351 } else 1349 1352 unmap_page_range(tlb, vma, start, end, details); 1350 1353 } ··· 3941 3938 free_page((unsigned long)buf); 3942 3939 } 3943 3940 } 3944 - up_read(&current->mm->mmap_sem); 3941 + up_read(&mm->mmap_sem); 3945 3942 } 3946 3943 3947 3944 #ifdef CONFIG_PROVE_LOCKING

+12 -8

mm/memory_hotplug.c

··· 512 512 513 513 zone->present_pages += onlined_pages; 514 514 zone->zone_pgdat->node_present_pages += onlined_pages; 515 - if (need_zonelists_rebuild) 516 - build_all_zonelists(zone); 517 - else 518 - zone_pcp_update(zone); 515 + if (onlined_pages) { 516 + node_set_state(zone_to_nid(zone), N_HIGH_MEMORY); 517 + if (need_zonelists_rebuild) 518 + build_all_zonelists(NULL, zone); 519 + else 520 + zone_pcp_update(zone); 521 + } 519 522 520 523 mutex_unlock(&zonelists_mutex); 521 524 522 525 init_per_zone_wmark_min(); 523 526 524 - if (onlined_pages) { 527 + if (onlined_pages) 525 528 kswapd_run(zone_to_nid(zone)); 526 - node_set_state(zone_to_nid(zone), N_HIGH_MEMORY); 527 - } 528 529 529 530 vm_total_pages = nr_free_pagecache_pages(); 530 531 ··· 563 562 * to access not-initialized zonelist, build here. 564 563 */ 565 564 mutex_lock(&zonelists_mutex); 566 - build_all_zonelists(NULL); 565 + build_all_zonelists(pgdat, NULL); 567 566 mutex_unlock(&zonelists_mutex); 568 567 569 568 return pgdat; ··· 965 964 totalram_pages -= offlined_pages; 966 965 967 966 init_per_zone_wmark_min(); 967 + 968 + if (!populated_zone(zone)) 969 + zone_pcp_reset(zone); 968 970 969 971 if (!node_present_pages(node)) { 970 972 node_clear_state(node, N_HIGH_MEMORY);

+27 -52

mm/migrate.c

··· 33 33 #include <linux/memcontrol.h> 34 34 #include <linux/syscalls.h> 35 35 #include <linux/hugetlb.h> 36 + #include <linux/hugetlb_cgroup.h> 36 37 #include <linux/gfp.h> 37 38 38 39 #include <asm/tlbflush.h> ··· 683 682 { 684 683 int rc = -EAGAIN; 685 684 int remap_swapcache = 1; 686 - int charge = 0; 687 685 struct mem_cgroup *mem; 688 686 struct anon_vma *anon_vma = NULL; 689 687 ··· 724 724 } 725 725 726 726 /* charge against new page */ 727 - charge = mem_cgroup_prepare_migration(page, newpage, &mem, GFP_KERNEL); 728 - if (charge == -ENOMEM) { 729 - rc = -ENOMEM; 730 - goto unlock; 731 - } 732 - BUG_ON(charge); 727 + mem_cgroup_prepare_migration(page, newpage, &mem); 733 728 734 729 if (PageWriteback(page)) { 735 730 /* ··· 814 819 put_anon_vma(anon_vma); 815 820 816 821 uncharge: 817 - if (!charge) 818 - mem_cgroup_end_migration(mem, page, newpage, rc == 0); 822 + mem_cgroup_end_migration(mem, page, newpage, rc == 0); 819 823 unlock: 820 824 unlock_page(page); 821 825 out: ··· 925 931 926 932 if (anon_vma) 927 933 put_anon_vma(anon_vma); 934 + 935 + if (!rc) 936 + hugetlb_cgroup_migrate(hpage, new_hpage); 937 + 928 938 unlock_page(hpage); 929 - 930 939 out: 931 - if (rc != -EAGAIN) { 932 - list_del(&hpage->lru); 933 - put_page(hpage); 934 - } 935 - 936 940 put_page(new_hpage); 937 - 938 941 if (result) { 939 942 if (rc) 940 943 *result = rc; ··· 1007 1016 return nr_failed + retry; 1008 1017 } 1009 1018 1010 - int migrate_huge_pages(struct list_head *from, 1011 - new_page_t get_new_page, unsigned long private, bool offlining, 1012 - enum migrate_mode mode) 1019 + int migrate_huge_page(struct page *hpage, new_page_t get_new_page, 1020 + unsigned long private, bool offlining, 1021 + enum migrate_mode mode) 1013 1022 { 1014 - int retry = 1; 1015 - int nr_failed = 0; 1016 - int pass = 0; 1017 - struct page *page; 1018 - struct page *page2; 1019 - int rc; 1023 + int pass, rc; 1020 1024 1021 - for (pass = 0; pass < 10 && retry; pass++) { 1022 - retry = 0; 1023 - 1024 - list_for_each_entry_safe(page, page2, from, lru) { 1025 + for (pass = 0; pass < 10; pass++) { 1026 + rc = unmap_and_move_huge_page(get_new_page, 1027 + private, hpage, pass > 2, offlining, 1028 + mode); 1029 + switch (rc) { 1030 + case -ENOMEM: 1031 + goto out; 1032 + case -EAGAIN: 1033 + /* try again */ 1025 1034 cond_resched(); 1026 - 1027 - rc = unmap_and_move_huge_page(get_new_page, 1028 - private, page, pass > 2, offlining, 1029 - mode); 1030 - 1031 - switch(rc) { 1032 - case -ENOMEM: 1033 - goto out; 1034 - case -EAGAIN: 1035 - retry++; 1036 - break; 1037 - case 0: 1038 - break; 1039 - default: 1040 - /* Permanent failure */ 1041 - nr_failed++; 1042 - break; 1043 - } 1035 + break; 1036 + case 0: 1037 + goto out; 1038 + default: 1039 + rc = -EIO; 1040 + goto out; 1044 1041 } 1045 1042 } 1046 - rc = 0; 1047 1043 out: 1048 - if (rc) 1049 - return rc; 1050 - 1051 - return nr_failed + retry; 1044 + return rc; 1052 1045 } 1053 1046 1054 1047 #ifdef CONFIG_NUMA

+2 -3

mm/mmap.c

··· 943 943 const unsigned long stack_flags 944 944 = VM_STACK_FLAGS & (VM_GROWSUP|VM_GROWSDOWN); 945 945 946 + mm->total_vm += pages; 947 + 946 948 if (file) { 947 949 mm->shared_vm += pages; 948 950 if ((flags & (VM_EXEC|VM_WRITE)) == VM_EXEC) ··· 1349 1347 out: 1350 1348 perf_event_mmap(vma); 1351 1349 1352 - mm->total_vm += len >> PAGE_SHIFT; 1353 1350 vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT); 1354 1351 if (vm_flags & VM_LOCKED) { 1355 1352 if (!mlock_vma_pages_range(vma, addr, addr + len)) ··· 1708 1707 return -ENOMEM; 1709 1708 1710 1709 /* Ok, everything looks good - let it rip */ 1711 - mm->total_vm += grow; 1712 1710 if (vma->vm_flags & VM_LOCKED) 1713 1711 mm->locked_vm += grow; 1714 1712 vm_stat_account(mm, vma->vm_flags, vma->vm_file, grow); ··· 1889 1889 1890 1890 if (vma->vm_flags & VM_ACCOUNT) 1891 1891 nr_accounted += nrpages; 1892 - mm->total_vm -= nrpages; 1893 1892 vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages); 1894 1893 vma = remove_vma(vma); 1895 1894 } while (vma);

+23 -22

mm/mmu_notifier.c

··· 33 33 void __mmu_notifier_release(struct mm_struct *mm) 34 34 { 35 35 struct mmu_notifier *mn; 36 + struct hlist_node *n; 37 + 38 + /* 39 + * RCU here will block mmu_notifier_unregister until 40 + * ->release returns. 41 + */ 42 + rcu_read_lock(); 43 + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) 44 + /* 45 + * if ->release runs before mmu_notifier_unregister it 46 + * must be handled as it's the only way for the driver 47 + * to flush all existing sptes and stop the driver 48 + * from establishing any more sptes before all the 49 + * pages in the mm are freed. 50 + */ 51 + if (mn->ops->release) 52 + mn->ops->release(mn, mm); 53 + rcu_read_unlock(); 36 54 37 55 spin_lock(&mm->mmu_notifier_mm->lock); 38 56 while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) { ··· 64 46 * mmu_notifier_unregister to return. 65 47 */ 66 48 hlist_del_init_rcu(&mn->hlist); 67 - /* 68 - * RCU here will block mmu_notifier_unregister until 69 - * ->release returns. 70 - */ 71 - rcu_read_lock(); 72 - spin_unlock(&mm->mmu_notifier_mm->lock); 73 - /* 74 - * if ->release runs before mmu_notifier_unregister it 75 - * must be handled as it's the only way for the driver 76 - * to flush all existing sptes and stop the driver 77 - * from establishing any more sptes before all the 78 - * pages in the mm are freed. 79 - */ 80 - if (mn->ops->release) 81 - mn->ops->release(mn, mm); 82 - rcu_read_unlock(); 83 - spin_lock(&mm->mmu_notifier_mm->lock); 84 49 } 85 50 spin_unlock(&mm->mmu_notifier_mm->lock); 86 51 ··· 285 284 { 286 285 BUG_ON(atomic_read(&mm->mm_count) <= 0); 287 286 288 - spin_lock(&mm->mmu_notifier_mm->lock); 289 287 if (!hlist_unhashed(&mn->hlist)) { 290 - hlist_del_rcu(&mn->hlist); 291 - 292 288 /* 293 289 * RCU here will force exit_mmap to wait ->release to finish 294 290 * before freeing the pages. 295 291 */ 296 292 rcu_read_lock(); 297 - spin_unlock(&mm->mmu_notifier_mm->lock); 293 + 298 294 /* 299 295 * exit_mmap will block in mmu_notifier_release to 300 296 * guarantee ->release is called before freeing the ··· 300 302 if (mn->ops->release) 301 303 mn->ops->release(mn, mm); 302 304 rcu_read_unlock(); 303 - } else 305 + 306 + spin_lock(&mm->mmu_notifier_mm->lock); 307 + hlist_del_rcu(&mn->hlist); 304 308 spin_unlock(&mm->mmu_notifier_mm->lock); 309 + } 305 310 306 311 /* 307 312 * Wait any running method to finish, of course including

+1 -1

mm/mmzone.c

··· 96 96 for_each_lru(lru) 97 97 INIT_LIST_HEAD(&lruvec->lists[lru]); 98 98 99 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR 99 + #ifdef CONFIG_MEMCG 100 100 lruvec->zone = zone; 101 101 #endif 102 102 }

-2

mm/mremap.c

··· 260 260 * If this were a serious issue, we'd add a flag to do_munmap(). 261 261 */ 262 262 hiwater_vm = mm->hiwater_vm; 263 - mm->total_vm += new_len >> PAGE_SHIFT; 264 263 vm_stat_account(mm, vma->vm_flags, vma->vm_file, new_len>>PAGE_SHIFT); 265 264 266 265 if (do_munmap(mm, old_addr, old_len) < 0) { ··· 496 497 goto out; 497 498 } 498 499 499 - mm->total_vm += pages; 500 500 vm_stat_account(mm, vma->vm_flags, vma->vm_file, pages); 501 501 if (vma->vm_flags & VM_LOCKED) { 502 502 mm->locked_vm += pages;

+112 -111

mm/oom_kill.c

··· 288 288 } 289 289 #endif 290 290 291 + enum oom_scan_t oom_scan_process_thread(struct task_struct *task, 292 + unsigned long totalpages, const nodemask_t *nodemask, 293 + bool force_kill) 294 + { 295 + if (task->exit_state) 296 + return OOM_SCAN_CONTINUE; 297 + if (oom_unkillable_task(task, NULL, nodemask)) 298 + return OOM_SCAN_CONTINUE; 299 + 300 + /* 301 + * This task already has access to memory reserves and is being killed. 302 + * Don't allow any other task to have access to the reserves. 303 + */ 304 + if (test_tsk_thread_flag(task, TIF_MEMDIE)) { 305 + if (unlikely(frozen(task))) 306 + __thaw_task(task); 307 + if (!force_kill) 308 + return OOM_SCAN_ABORT; 309 + } 310 + if (!task->mm) 311 + return OOM_SCAN_CONTINUE; 312 + 313 + if (task->flags & PF_EXITING) { 314 + /* 315 + * If task is current and is in the process of releasing memory, 316 + * allow the "kill" to set TIF_MEMDIE, which will allow it to 317 + * access memory reserves. Otherwise, it may stall forever. 318 + * 319 + * The iteration isn't broken here, however, in case other 320 + * threads are found to have already been oom killed. 321 + */ 322 + if (task == current) 323 + return OOM_SCAN_SELECT; 324 + else if (!force_kill) { 325 + /* 326 + * If this task is not being ptraced on exit, then wait 327 + * for it to finish before killing some other task 328 + * unnecessarily. 329 + */ 330 + if (!(task->group_leader->ptrace & PT_TRACE_EXIT)) 331 + return OOM_SCAN_ABORT; 332 + } 333 + } 334 + return OOM_SCAN_OK; 335 + } 336 + 291 337 /* 292 338 * Simple selection loop. We chose the process with the highest 293 - * number of 'points'. We expect the caller will lock the tasklist. 339 + * number of 'points'. 294 340 * 295 341 * (not docbooked, we don't want this one cluttering up the manual) 296 342 */ 297 343 static struct task_struct *select_bad_process(unsigned int *ppoints, 298 - unsigned long totalpages, struct mem_cgroup *memcg, 299 - const nodemask_t *nodemask, bool force_kill) 344 + unsigned long totalpages, const nodemask_t *nodemask, 345 + bool force_kill) 300 346 { 301 347 struct task_struct *g, *p; 302 348 struct task_struct *chosen = NULL; 303 349 unsigned long chosen_points = 0; 304 350 351 + rcu_read_lock(); 305 352 do_each_thread(g, p) { 306 353 unsigned int points; 307 354 308 - if (p->exit_state) 355 + switch (oom_scan_process_thread(p, totalpages, nodemask, 356 + force_kill)) { 357 + case OOM_SCAN_SELECT: 358 + chosen = p; 359 + chosen_points = ULONG_MAX; 360 + /* fall through */ 361 + case OOM_SCAN_CONTINUE: 309 362 continue; 310 - if (oom_unkillable_task(p, memcg, nodemask)) 311 - continue; 312 - 313 - /* 314 - * This task already has access to memory reserves and is 315 - * being killed. Don't allow any other task access to the 316 - * memory reserve. 317 - * 318 - * Note: this may have a chance of deadlock if it gets 319 - * blocked waiting for another task which itself is waiting 320 - * for memory. Is there a better alternative? 321 - */ 322 - if (test_tsk_thread_flag(p, TIF_MEMDIE)) { 323 - if (unlikely(frozen(p))) 324 - __thaw_task(p); 325 - if (!force_kill) 326 - return ERR_PTR(-1UL); 327 - } 328 - if (!p->mm) 329 - continue; 330 - 331 - if (p->flags & PF_EXITING) { 332 - /* 333 - * If p is the current task and is in the process of 334 - * releasing memory, we allow the "kill" to set 335 - * TIF_MEMDIE, which will allow it to gain access to 336 - * memory reserves. Otherwise, it may stall forever. 337 - * 338 - * The loop isn't broken here, however, in case other 339 - * threads are found to have already been oom killed. 340 - */ 341 - if (p == current) { 342 - chosen = p; 343 - chosen_points = ULONG_MAX; 344 - } else if (!force_kill) { 345 - /* 346 - * If this task is not being ptraced on exit, 347 - * then wait for it to finish before killing 348 - * some other task unnecessarily. 349 - */ 350 - if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) 351 - return ERR_PTR(-1UL); 352 - } 353 - } 354 - 355 - points = oom_badness(p, memcg, nodemask, totalpages); 363 + case OOM_SCAN_ABORT: 364 + rcu_read_unlock(); 365 + return ERR_PTR(-1UL); 366 + case OOM_SCAN_OK: 367 + break; 368 + }; 369 + points = oom_badness(p, NULL, nodemask, totalpages); 356 370 if (points > chosen_points) { 357 371 chosen = p; 358 372 chosen_points = points; 359 373 } 360 374 } while_each_thread(g, p); 375 + if (chosen) 376 + get_task_struct(chosen); 377 + rcu_read_unlock(); 361 378 362 379 *ppoints = chosen_points * 1000 / totalpages; 363 380 return chosen; ··· 388 371 * Dumps the current memory state of all eligible tasks. Tasks not in the same 389 372 * memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes 390 373 * are not shown. 391 - * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj 392 - * value, oom_score_adj value, and name. 393 - * 394 - * Call with tasklist_lock read-locked. 374 + * State information includes task's pid, uid, tgid, vm size, rss, nr_ptes, 375 + * swapents, oom_score_adj value, and name. 395 376 */ 396 377 static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask) 397 378 { 398 379 struct task_struct *p; 399 380 struct task_struct *task; 400 381 401 - pr_info("[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name\n"); 382 + pr_info("[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name\n"); 383 + rcu_read_lock(); 402 384 for_each_process(p) { 403 385 if (oom_unkillable_task(p, memcg, nodemask)) 404 386 continue; ··· 412 396 continue; 413 397 } 414 398 415 - pr_info("[%5d] %5d %5d %8lu %8lu %3u %3d %5d %s\n", 399 + pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5d %s\n", 416 400 task->pid, from_kuid(&init_user_ns, task_uid(task)), 417 401 task->tgid, task->mm->total_vm, get_mm_rss(task->mm), 418 - task_cpu(task), task->signal->oom_adj, 402 + task->mm->nr_ptes, 403 + get_mm_counter(task->mm, MM_SWAPENTS), 419 404 task->signal->oom_score_adj, task->comm); 420 405 task_unlock(task); 421 406 } 407 + rcu_read_unlock(); 422 408 } 423 409 424 410 static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, ··· 441 423 } 442 424 443 425 #define K(x) ((x) << (PAGE_SHIFT-10)) 444 - static void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, 445 - unsigned int points, unsigned long totalpages, 446 - struct mem_cgroup *memcg, nodemask_t *nodemask, 447 - const char *message) 426 + /* 427 + * Must be called while holding a reference to p, which will be released upon 428 + * returning. 429 + */ 430 + void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, 431 + unsigned int points, unsigned long totalpages, 432 + struct mem_cgroup *memcg, nodemask_t *nodemask, 433 + const char *message) 448 434 { 449 435 struct task_struct *victim = p; 450 436 struct task_struct *child; ··· 464 442 */ 465 443 if (p->flags & PF_EXITING) { 466 444 set_tsk_thread_flag(p, TIF_MEMDIE); 445 + put_task_struct(p); 467 446 return; 468 447 } 469 448 ··· 482 459 * parent. This attempts to lose the minimal amount of work done while 483 460 * still freeing memory. 484 461 */ 462 + read_lock(&tasklist_lock); 485 463 do { 486 464 list_for_each_entry(child, &t->children, sibling) { 487 465 unsigned int child_points; ··· 495 471 child_points = oom_badness(child, memcg, nodemask, 496 472 totalpages); 497 473 if (child_points > victim_points) { 474 + put_task_struct(victim); 498 475 victim = child; 499 476 victim_points = child_points; 477 + get_task_struct(victim); 500 478 } 501 479 } 502 480 } while_each_thread(p, t); 481 + read_unlock(&tasklist_lock); 503 482 504 - victim = find_lock_task_mm(victim); 505 - if (!victim) 483 + rcu_read_lock(); 484 + p = find_lock_task_mm(victim); 485 + if (!p) { 486 + rcu_read_unlock(); 487 + put_task_struct(victim); 506 488 return; 489 + } else if (victim != p) { 490 + get_task_struct(p); 491 + put_task_struct(victim); 492 + victim = p; 493 + } 507 494 508 495 /* mm cannot safely be dereferenced after task_unlock(victim) */ 509 496 mm = victim->mm; ··· 545 510 task_unlock(p); 546 511 do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true); 547 512 } 513 + rcu_read_unlock(); 548 514 549 515 set_tsk_thread_flag(victim, TIF_MEMDIE); 550 516 do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); 517 + put_task_struct(victim); 551 518 } 552 519 #undef K 553 520 554 521 /* 555 522 * Determines whether the kernel must panic because of the panic_on_oom sysctl. 556 523 */ 557 - static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, 558 - int order, const nodemask_t *nodemask) 524 + void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask, 525 + int order, const nodemask_t *nodemask) 559 526 { 560 527 if (likely(!sysctl_panic_on_oom)) 561 528 return; ··· 570 533 if (constraint != CONSTRAINT_NONE) 571 534 return; 572 535 } 573 - read_lock(&tasklist_lock); 574 536 dump_header(NULL, gfp_mask, order, NULL, nodemask); 575 - read_unlock(&tasklist_lock); 576 537 panic("Out of memory: %s panic_on_oom is enabled\n", 577 538 sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide"); 578 539 } 579 - 580 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR 581 - void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, 582 - int order) 583 - { 584 - unsigned long limit; 585 - unsigned int points = 0; 586 - struct task_struct *p; 587 - 588 - /* 589 - * If current has a pending SIGKILL, then automatically select it. The 590 - * goal is to allow it to allocate so that it may quickly exit and free 591 - * its memory. 592 - */ 593 - if (fatal_signal_pending(current)) { 594 - set_thread_flag(TIF_MEMDIE); 595 - return; 596 - } 597 - 598 - check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL); 599 - limit = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1; 600 - read_lock(&tasklist_lock); 601 - p = select_bad_process(&points, limit, memcg, NULL, false); 602 - if (p && PTR_ERR(p) != -1UL) 603 - oom_kill_process(p, gfp_mask, order, points, limit, memcg, NULL, 604 - "Memory cgroup out of memory"); 605 - read_unlock(&tasklist_lock); 606 - } 607 - #endif 608 540 609 541 static BLOCKING_NOTIFIER_HEAD(oom_notify_list); 610 542 ··· 696 690 struct task_struct *p; 697 691 unsigned long totalpages; 698 692 unsigned long freed = 0; 699 - unsigned int points; 693 + unsigned int uninitialized_var(points); 700 694 enum oom_constraint constraint = CONSTRAINT_NONE; 701 695 int killed = 0; 702 696 ··· 724 718 mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL; 725 719 check_panic_on_oom(constraint, gfp_mask, order, mpol_mask); 726 720 727 - read_lock(&tasklist_lock); 728 - if (sysctl_oom_kill_allocating_task && 721 + if (sysctl_oom_kill_allocating_task && current->mm && 729 722 !oom_unkillable_task(current, NULL, nodemask) && 730 - current->mm) { 723 + current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { 724 + get_task_struct(current); 731 725 oom_kill_process(current, gfp_mask, order, 0, totalpages, NULL, 732 726 nodemask, 733 727 "Out of memory (oom_kill_allocating_task)"); 734 728 goto out; 735 729 } 736 730 737 - p = select_bad_process(&points, totalpages, NULL, mpol_mask, 738 - force_kill); 731 + p = select_bad_process(&points, totalpages, mpol_mask, force_kill); 739 732 /* Found nothing?!?! Either we hang forever, or we panic. */ 740 733 if (!p) { 741 734 dump_header(NULL, gfp_mask, order, NULL, mpol_mask); 742 - read_unlock(&tasklist_lock); 743 735 panic("Out of memory and no killable processes...\n"); 744 736 } 745 737 if (PTR_ERR(p) != -1UL) { ··· 746 742 killed = 1; 747 743 } 748 744 out: 749 - read_unlock(&tasklist_lock); 750 - 751 745 /* 752 - * Give "p" a good chance of killing itself before we 753 - * retry to allocate memory unless "p" is current 746 + * Give the killed threads a good chance of exiting before trying to 747 + * allocate memory again. 754 748 */ 755 - if (killed && !test_thread_flag(TIF_MEMDIE)) 756 - schedule_timeout_uninterruptible(1); 749 + if (killed) 750 + schedule_timeout_killable(1); 757 751 } 758 752 759 753 /* ··· 766 764 out_of_memory(NULL, 0, 0, NULL, false); 767 765 clear_system_oom(); 768 766 } 769 - if (!test_thread_flag(TIF_MEMDIE)) 770 - schedule_timeout_uninterruptible(1); 767 + schedule_timeout_killable(1); 771 768 }

+171 -147

mm/page_alloc.c

··· 51 51 #include <linux/page_cgroup.h> 52 52 #include <linux/debugobjects.h> 53 53 #include <linux/kmemleak.h> 54 - #include <linux/memory.h> 55 54 #include <linux/compaction.h> 56 55 #include <trace/events/kmem.h> 57 56 #include <linux/ftrace_event.h> ··· 218 219 219 220 int page_group_by_mobility_disabled __read_mostly; 220 221 221 - static void set_pageblock_migratetype(struct page *page, int migratetype) 222 + /* 223 + * NOTE: 224 + * Don't use set_pageblock_migratetype(page, MIGRATE_ISOLATE) directly. 225 + * Instead, use {un}set_pageblock_isolate. 226 + */ 227 + void set_pageblock_migratetype(struct page *page, int migratetype) 222 228 { 223 229 224 230 if (unlikely(page_group_by_mobility_disabled)) ··· 958 954 return pages_moved; 959 955 } 960 956 961 - static int move_freepages_block(struct zone *zone, struct page *page, 957 + int move_freepages_block(struct zone *zone, struct page *page, 962 958 int migratetype) 963 959 { 964 960 unsigned long start_pfn, end_pfn; ··· 1162 1158 to_drain = pcp->batch; 1163 1159 else 1164 1160 to_drain = pcp->count; 1165 - free_pcppages_bulk(zone, to_drain, pcp); 1166 - pcp->count -= to_drain; 1161 + if (to_drain > 0) { 1162 + free_pcppages_bulk(zone, to_drain, pcp); 1163 + pcp->count -= to_drain; 1164 + } 1167 1165 local_irq_restore(flags); 1168 1166 } 1169 1167 #endif ··· 1535 1529 } 1536 1530 __setup("fail_page_alloc=", setup_fail_page_alloc); 1537 1531 1538 - static int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) 1532 + static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) 1539 1533 { 1540 1534 if (order < fail_page_alloc.min_order) 1541 - return 0; 1535 + return false; 1542 1536 if (gfp_mask & __GFP_NOFAIL) 1543 - return 0; 1537 + return false; 1544 1538 if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM)) 1545 - return 0; 1539 + return false; 1546 1540 if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT)) 1547 - return 0; 1541 + return false; 1548 1542 1549 1543 return should_fail(&fail_page_alloc.attr, 1 << order); 1550 1544 } ··· 1584 1578 1585 1579 #else /* CONFIG_FAIL_PAGE_ALLOC */ 1586 1580 1587 - static inline int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) 1581 + static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) 1588 1582 { 1589 - return 0; 1583 + return false; 1590 1584 } 1591 1585 1592 1586 #endif /* CONFIG_FAIL_PAGE_ALLOC */ ··· 1600 1594 { 1601 1595 /* free_pages my go negative - that's OK */ 1602 1596 long min = mark; 1597 + long lowmem_reserve = z->lowmem_reserve[classzone_idx]; 1603 1598 int o; 1604 1599 1605 1600 free_pages -= (1 << order) - 1; ··· 1609 1602 if (alloc_flags & ALLOC_HARDER) 1610 1603 min -= min / 4; 1611 1604 1612 - if (free_pages <= min + z->lowmem_reserve[classzone_idx]) 1605 + if (free_pages <= min + lowmem_reserve) 1613 1606 return false; 1614 1607 for (o = 0; o < order; o++) { 1615 1608 /* At the next order, this order's pages become unavailable */ ··· 1623 1616 } 1624 1617 return true; 1625 1618 } 1619 + 1620 + #ifdef CONFIG_MEMORY_ISOLATION 1621 + static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) 1622 + { 1623 + if (unlikely(zone->nr_pageblock_isolate)) 1624 + return zone->nr_pageblock_isolate * pageblock_nr_pages; 1625 + return 0; 1626 + } 1627 + #else 1628 + static inline unsigned long nr_zone_isolate_freepages(struct zone *zone) 1629 + { 1630 + return 0; 1631 + } 1632 + #endif 1626 1633 1627 1634 bool zone_watermark_ok(struct zone *z, int order, unsigned long mark, 1628 1635 int classzone_idx, int alloc_flags) ··· 1653 1632 if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark) 1654 1633 free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES); 1655 1634 1635 + /* 1636 + * If the zone has MIGRATE_ISOLATE type free pages, we should consider 1637 + * it. nr_zone_isolate_freepages is never accurate so kswapd might not 1638 + * sleep although it could do so. But this is more desirable for memory 1639 + * hotplug than sleeping which can cause a livelock in the direct 1640 + * reclaim path. 1641 + */ 1642 + free_pages -= nr_zone_isolate_freepages(z); 1656 1643 return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags, 1657 1644 free_pages); 1658 1645 } ··· 2116 2087 2117 2088 page = get_page_from_freelist(gfp_mask, nodemask, 2118 2089 order, zonelist, high_zoneidx, 2119 - alloc_flags, preferred_zone, 2120 - migratetype); 2090 + alloc_flags & ~ALLOC_NO_WATERMARKS, 2091 + preferred_zone, migratetype); 2121 2092 if (page) { 2122 2093 preferred_zone->compact_considered = 0; 2123 2094 preferred_zone->compact_defer_shift = 0; ··· 2209 2180 retry: 2210 2181 page = get_page_from_freelist(gfp_mask, nodemask, order, 2211 2182 zonelist, high_zoneidx, 2212 - alloc_flags, preferred_zone, 2213 - migratetype); 2183 + alloc_flags & ~ALLOC_NO_WATERMARKS, 2184 + preferred_zone, migratetype); 2214 2185 2215 2186 /* 2216 2187 * If an allocation failed after direct reclaim, it could be because ··· 2294 2265 alloc_flags |= ALLOC_HARDER; 2295 2266 2296 2267 if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) { 2297 - if (!in_interrupt() && 2298 - ((current->flags & PF_MEMALLOC) || 2299 - unlikely(test_thread_flag(TIF_MEMDIE)))) 2268 + if (gfp_mask & __GFP_MEMALLOC) 2269 + alloc_flags |= ALLOC_NO_WATERMARKS; 2270 + else if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) 2271 + alloc_flags |= ALLOC_NO_WATERMARKS; 2272 + else if (!in_interrupt() && 2273 + ((current->flags & PF_MEMALLOC) || 2274 + unlikely(test_thread_flag(TIF_MEMDIE)))) 2300 2275 alloc_flags |= ALLOC_NO_WATERMARKS; 2301 2276 } 2302 2277 2303 2278 return alloc_flags; 2279 + } 2280 + 2281 + bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) 2282 + { 2283 + return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS); 2304 2284 } 2305 2285 2306 2286 static inline struct page * ··· 2378 2340 2379 2341 /* Allocate without watermarks if the context allows */ 2380 2342 if (alloc_flags & ALLOC_NO_WATERMARKS) { 2343 + /* 2344 + * Ignore mempolicies if ALLOC_NO_WATERMARKS on the grounds 2345 + * the allocation is high priority and these type of 2346 + * allocations are system rather than user orientated 2347 + */ 2348 + zonelist = node_zonelist(numa_node_id(), gfp_mask); 2349 + 2381 2350 page = __alloc_pages_high_priority(gfp_mask, order, 2382 2351 zonelist, high_zoneidx, nodemask, 2383 2352 preferred_zone, migratetype); 2384 - if (page) 2353 + if (page) { 2354 + /* 2355 + * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was 2356 + * necessary to allocate the page. The expectation is 2357 + * that the caller is taking steps that will free more 2358 + * memory. The caller should avoid the page being used 2359 + * for !PFMEMALLOC purposes. 2360 + */ 2361 + page->pfmemalloc = true; 2385 2362 goto got_pg; 2363 + } 2386 2364 } 2387 2365 2388 2366 /* Atomic allocations - we can't balance anything */ ··· 2517 2463 got_pg: 2518 2464 if (kmemcheck_enabled) 2519 2465 kmemcheck_pagealloc_alloc(page, order, gfp_mask); 2520 - return page; 2521 2466 2467 + return page; 2522 2468 } 2523 2469 2524 2470 /* ··· 2569 2515 page = __alloc_pages_slowpath(gfp_mask, order, 2570 2516 zonelist, high_zoneidx, nodemask, 2571 2517 preferred_zone, migratetype); 2518 + else 2519 + page->pfmemalloc = false; 2572 2520 2573 2521 trace_mm_page_alloc(page, order, gfp_mask, migratetype); 2574 2522 ··· 3086 3030 user_zonelist_order = oldval; 3087 3031 } else if (oldval != user_zonelist_order) { 3088 3032 mutex_lock(&zonelists_mutex); 3089 - build_all_zonelists(NULL); 3033 + build_all_zonelists(NULL, NULL); 3090 3034 mutex_unlock(&zonelists_mutex); 3091 3035 } 3092 3036 } ··· 3465 3409 DEFINE_MUTEX(zonelists_mutex); 3466 3410 3467 3411 /* return values int ....just for stop_machine() */ 3468 - static __init_refok int __build_all_zonelists(void *data) 3412 + static int __build_all_zonelists(void *data) 3469 3413 { 3470 3414 int nid; 3471 3415 int cpu; 3416 + pg_data_t *self = data; 3472 3417 3473 3418 #ifdef CONFIG_NUMA 3474 3419 memset(node_load, 0, sizeof(node_load)); 3475 3420 #endif 3421 + 3422 + if (self && !node_online(self->node_id)) { 3423 + build_zonelists(self); 3424 + build_zonelist_cache(self); 3425 + } 3426 + 3476 3427 for_each_online_node(nid) { 3477 3428 pg_data_t *pgdat = NODE_DATA(nid); 3478 3429 ··· 3524 3461 * Called with zonelists_mutex held always 3525 3462 * unless system_state == SYSTEM_BOOTING. 3526 3463 */ 3527 - void __ref build_all_zonelists(void *data) 3464 + void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone) 3528 3465 { 3529 3466 set_zonelist_order(); 3530 3467 ··· 3536 3473 /* we have to stop all cpus to guarantee there is no user 3537 3474 of zonelist */ 3538 3475 #ifdef CONFIG_MEMORY_HOTPLUG 3539 - if (data) 3540 - setup_zone_pageset((struct zone *)data); 3476 + if (zone) 3477 + setup_zone_pageset(zone); 3541 3478 #endif 3542 - stop_machine(__build_all_zonelists, NULL, NULL); 3479 + stop_machine(__build_all_zonelists, pgdat, NULL); 3543 3480 /* cpuset refresh routine should be here */ 3544 3481 } 3545 3482 vm_total_pages = nr_free_pagecache_pages(); ··· 3809 3746 memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY) 3810 3747 #endif 3811 3748 3812 - static int zone_batchsize(struct zone *zone) 3749 + static int __meminit zone_batchsize(struct zone *zone) 3813 3750 { 3814 3751 #ifdef CONFIG_MMU 3815 3752 int batch; ··· 3891 3828 pcp->batch = PAGE_SHIFT * 8; 3892 3829 } 3893 3830 3894 - static void setup_zone_pageset(struct zone *zone) 3831 + static void __meminit setup_zone_pageset(struct zone *zone) 3895 3832 { 3896 3833 int cpu; 3897 3834 ··· 3964 3901 return 0; 3965 3902 } 3966 3903 3967 - static int __zone_pcp_update(void *data) 3968 - { 3969 - struct zone *zone = data; 3970 - int cpu; 3971 - unsigned long batch = zone_batchsize(zone), flags; 3972 - 3973 - for_each_possible_cpu(cpu) { 3974 - struct per_cpu_pageset *pset; 3975 - struct per_cpu_pages *pcp; 3976 - 3977 - pset = per_cpu_ptr(zone->pageset, cpu); 3978 - pcp = &pset->pcp; 3979 - 3980 - local_irq_save(flags); 3981 - free_pcppages_bulk(zone, pcp->count, pcp); 3982 - setup_pageset(pset, batch); 3983 - local_irq_restore(flags); 3984 - } 3985 - return 0; 3986 - } 3987 - 3988 - void zone_pcp_update(struct zone *zone) 3989 - { 3990 - stop_machine(__zone_pcp_update, zone, NULL); 3991 - } 3992 - 3993 3904 static __meminit void zone_pcp_init(struct zone *zone) 3994 3905 { 3995 3906 /* ··· 3979 3942 zone_batchsize(zone)); 3980 3943 } 3981 3944 3982 - __meminit int init_currently_empty_zone(struct zone *zone, 3945 + int __meminit init_currently_empty_zone(struct zone *zone, 3983 3946 unsigned long zone_start_pfn, 3984 3947 unsigned long size, 3985 3948 enum memmap_context context) ··· 4338 4301 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE 4339 4302 4340 4303 /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */ 4341 - static inline void __init set_pageblock_order(void) 4304 + void __init set_pageblock_order(void) 4342 4305 { 4343 4306 unsigned int order; 4344 4307 ··· 4366 4329 * include/linux/pageblock-flags.h for the values of pageblock_order based on 4367 4330 * the kernel config 4368 4331 */ 4369 - static inline void set_pageblock_order(void) 4332 + void __init set_pageblock_order(void) 4370 4333 { 4371 4334 } 4372 4335 ··· 4377 4340 * - mark all pages reserved 4378 4341 * - mark all memory queues empty 4379 4342 * - clear the memory bitmaps 4343 + * 4344 + * NOTE: pgdat should get zeroed by caller. 4380 4345 */ 4381 4346 static void __paginginit free_area_init_core(struct pglist_data *pgdat, 4382 4347 unsigned long *zones_size, unsigned long *zholes_size) ··· 4389 4350 int ret; 4390 4351 4391 4352 pgdat_resize_init(pgdat); 4392 - pgdat->nr_zones = 0; 4393 4353 init_waitqueue_head(&pgdat->kswapd_wait); 4394 - pgdat->kswapd_max_order = 0; 4354 + init_waitqueue_head(&pgdat->pfmemalloc_wait); 4395 4355 pgdat_page_cgroup_init(pgdat); 4396 4356 4397 4357 for (j = 0; j < MAX_NR_ZONES; j++) { ··· 4432 4394 4433 4395 zone->spanned_pages = size; 4434 4396 zone->present_pages = realsize; 4397 + #if defined CONFIG_COMPACTION || defined CONFIG_CMA 4398 + zone->compact_cached_free_pfn = zone->zone_start_pfn + 4399 + zone->spanned_pages; 4400 + zone->compact_cached_free_pfn &= ~(pageblock_nr_pages-1); 4401 + #endif 4435 4402 #ifdef CONFIG_NUMA 4436 4403 zone->node = nid; 4437 4404 zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio) ··· 4451 4408 4452 4409 zone_pcp_init(zone); 4453 4410 lruvec_init(&zone->lruvec, zone); 4454 - zap_zone_vm_stats(zone); 4455 - zone->flags = 0; 4456 4411 if (!size) 4457 4412 continue; 4458 4413 ··· 4509 4468 unsigned long node_start_pfn, unsigned long *zholes_size) 4510 4469 { 4511 4470 pg_data_t *pgdat = NODE_DATA(nid); 4471 + 4472 + /* pg_data_t should be reset to zero when it's allocated */ 4473 + WARN_ON(pgdat->nr_zones || pgdat->node_start_pfn || pgdat->classzone_idx); 4512 4474 4513 4475 pgdat->node_id = nid; 4514 4476 pgdat->node_start_pfn = node_start_pfn; ··· 4794 4750 } 4795 4751 4796 4752 /* Any regular memory on that node ? */ 4797 - static void check_for_regular_memory(pg_data_t *pgdat) 4753 + static void __init check_for_regular_memory(pg_data_t *pgdat) 4798 4754 { 4799 4755 #ifdef CONFIG_HIGHMEM 4800 4756 enum zone_type zone_type; ··· 5512 5468 } 5513 5469 5514 5470 /* 5515 - * This is designed as sub function...plz see page_isolation.c also. 5516 - * set/clear page block's type to be ISOLATE. 5517 - * page allocater never alloc memory from ISOLATE block. 5471 + * This function checks whether pageblock includes unmovable pages or not. 5472 + * If @count is not zero, it is okay to include less @count unmovable pages 5473 + * 5474 + * PageLRU check wihtout isolation or lru_lock could race so that 5475 + * MIGRATE_MOVABLE block might include unmovable pages. It means you can't 5476 + * expect this function should be exact. 5518 5477 */ 5519 - 5520 - static int 5521 - __count_immobile_pages(struct zone *zone, struct page *page, int count) 5478 + bool has_unmovable_pages(struct zone *zone, struct page *page, int count) 5522 5479 { 5523 5480 unsigned long pfn, iter, found; 5524 5481 int mt; 5525 5482 5526 5483 /* 5527 5484 * For avoiding noise data, lru_add_drain_all() should be called 5528 - * If ZONE_MOVABLE, the zone never contains immobile pages 5485 + * If ZONE_MOVABLE, the zone never contains unmovable pages 5529 5486 */ 5530 5487 if (zone_idx(zone) == ZONE_MOVABLE) 5531 - return true; 5488 + return false; 5532 5489 mt = get_pageblock_migratetype(page); 5533 5490 if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt)) 5534 - return true; 5491 + return false; 5535 5492 5536 5493 pfn = page_to_pfn(page); 5537 5494 for (found = 0, iter = 0; iter < pageblock_nr_pages; iter++) { ··· 5542 5497 continue; 5543 5498 5544 5499 page = pfn_to_page(check); 5545 - if (!page_count(page)) { 5500 + /* 5501 + * We can't use page_count without pin a page 5502 + * because another CPU can free compound page. 5503 + * This check already skips compound tails of THP 5504 + * because their page->_count is zero at all time. 5505 + */ 5506 + if (!atomic_read(&page->_count)) { 5546 5507 if (PageBuddy(page)) 5547 5508 iter += (1 << page_order(page)) - 1; 5548 5509 continue; 5549 5510 } 5511 + 5550 5512 if (!PageLRU(page)) 5551 5513 found++; 5552 5514 /* ··· 5570 5518 * page at boot. 5571 5519 */ 5572 5520 if (found > count) 5573 - return false; 5521 + return true; 5574 5522 } 5575 - return true; 5523 + return false; 5576 5524 } 5577 5525 5578 5526 bool is_pageblock_removable_nolock(struct page *page) ··· 5596 5544 zone->zone_start_pfn + zone->spanned_pages <= pfn) 5597 5545 return false; 5598 5546 5599 - return __count_immobile_pages(zone, page, 0); 5600 - } 5601 - 5602 - int set_migratetype_isolate(struct page *page) 5603 - { 5604 - struct zone *zone; 5605 - unsigned long flags, pfn; 5606 - struct memory_isolate_notify arg; 5607 - int notifier_ret; 5608 - int ret = -EBUSY; 5609 - 5610 - zone = page_zone(page); 5611 - 5612 - spin_lock_irqsave(&zone->lock, flags); 5613 - 5614 - pfn = page_to_pfn(page); 5615 - arg.start_pfn = pfn; 5616 - arg.nr_pages = pageblock_nr_pages; 5617 - arg.pages_found = 0; 5618 - 5619 - /* 5620 - * It may be possible to isolate a pageblock even if the 5621 - * migratetype is not MIGRATE_MOVABLE. The memory isolation 5622 - * notifier chain is used by balloon drivers to return the 5623 - * number of pages in a range that are held by the balloon 5624 - * driver to shrink memory. If all the pages are accounted for 5625 - * by balloons, are free, or on the LRU, isolation can continue. 5626 - * Later, for example, when memory hotplug notifier runs, these 5627 - * pages reported as "can be isolated" should be isolated(freed) 5628 - * by the balloon driver through the memory notifier chain. 5629 - */ 5630 - notifier_ret = memory_isolate_notify(MEM_ISOLATE_COUNT, &arg); 5631 - notifier_ret = notifier_to_errno(notifier_ret); 5632 - if (notifier_ret) 5633 - goto out; 5634 - /* 5635 - * FIXME: Now, memory hotplug doesn't call shrink_slab() by itself. 5636 - * We just check MOVABLE pages. 5637 - */ 5638 - if (__count_immobile_pages(zone, page, arg.pages_found)) 5639 - ret = 0; 5640 - 5641 - /* 5642 - * immobile means "not-on-lru" paes. If immobile is larger than 5643 - * removable-by-driver pages reported by notifier, we'll fail. 5644 - */ 5645 - 5646 - out: 5647 - if (!ret) { 5648 - set_pageblock_migratetype(page, MIGRATE_ISOLATE); 5649 - move_freepages_block(zone, page, MIGRATE_ISOLATE); 5650 - } 5651 - 5652 - spin_unlock_irqrestore(&zone->lock, flags); 5653 - if (!ret) 5654 - drain_all_pages(); 5655 - return ret; 5656 - } 5657 - 5658 - void unset_migratetype_isolate(struct page *page, unsigned migratetype) 5659 - { 5660 - struct zone *zone; 5661 - unsigned long flags; 5662 - zone = page_zone(page); 5663 - spin_lock_irqsave(&zone->lock, flags); 5664 - if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) 5665 - goto out; 5666 - set_pageblock_migratetype(page, migratetype); 5667 - move_freepages_block(zone, page, migratetype); 5668 - out: 5669 - spin_unlock_irqrestore(&zone->lock, flags); 5547 + return !has_unmovable_pages(zone, page, 0); 5670 5548 } 5671 5549 5672 5550 #ifdef CONFIG_CMA ··· 5851 5869 } 5852 5870 #endif 5853 5871 5872 + #ifdef CONFIG_MEMORY_HOTPLUG 5873 + static int __meminit __zone_pcp_update(void *data) 5874 + { 5875 + struct zone *zone = data; 5876 + int cpu; 5877 + unsigned long batch = zone_batchsize(zone), flags; 5878 + 5879 + for_each_possible_cpu(cpu) { 5880 + struct per_cpu_pageset *pset; 5881 + struct per_cpu_pages *pcp; 5882 + 5883 + pset = per_cpu_ptr(zone->pageset, cpu); 5884 + pcp = &pset->pcp; 5885 + 5886 + local_irq_save(flags); 5887 + if (pcp->count > 0) 5888 + free_pcppages_bulk(zone, pcp->count, pcp); 5889 + setup_pageset(pset, batch); 5890 + local_irq_restore(flags); 5891 + } 5892 + return 0; 5893 + } 5894 + 5895 + void __meminit zone_pcp_update(struct zone *zone) 5896 + { 5897 + stop_machine(__zone_pcp_update, zone, NULL); 5898 + } 5899 + #endif 5900 + 5854 5901 #ifdef CONFIG_MEMORY_HOTREMOVE 5902 + void zone_pcp_reset(struct zone *zone) 5903 + { 5904 + unsigned long flags; 5905 + 5906 + /* avoid races with drain_pages() */ 5907 + local_irq_save(flags); 5908 + if (zone->pageset != &boot_pageset) { 5909 + free_percpu(zone->pageset); 5910 + zone->pageset = &boot_pageset; 5911 + } 5912 + local_irq_restore(flags); 5913 + } 5914 + 5855 5915 /* 5856 5916 * All pages in the range must be isolated before calling this. 5857 5917 */

+1 -1

mm/page_cgroup.c

··· 317 317 #endif 318 318 319 319 320 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP 320 + #ifdef CONFIG_MEMCG_SWAP 321 321 322 322 static DEFINE_MUTEX(swap_cgroup_mutex); 323 323 struct swap_cgroup_ctrl {

+145

mm/page_io.c

··· 17 17 #include <linux/swap.h> 18 18 #include <linux/bio.h> 19 19 #include <linux/swapops.h> 20 + #include <linux/buffer_head.h> 20 21 #include <linux/writeback.h> 21 22 #include <linux/frontswap.h> 22 23 #include <asm/pgtable.h> ··· 87 86 bio_put(bio); 88 87 } 89 88 89 + int generic_swapfile_activate(struct swap_info_struct *sis, 90 + struct file *swap_file, 91 + sector_t *span) 92 + { 93 + struct address_space *mapping = swap_file->f_mapping; 94 + struct inode *inode = mapping->host; 95 + unsigned blocks_per_page; 96 + unsigned long page_no; 97 + unsigned blkbits; 98 + sector_t probe_block; 99 + sector_t last_block; 100 + sector_t lowest_block = -1; 101 + sector_t highest_block = 0; 102 + int nr_extents = 0; 103 + int ret; 104 + 105 + blkbits = inode->i_blkbits; 106 + blocks_per_page = PAGE_SIZE >> blkbits; 107 + 108 + /* 109 + * Map all the blocks into the extent list. This code doesn't try 110 + * to be very smart. 111 + */ 112 + probe_block = 0; 113 + page_no = 0; 114 + last_block = i_size_read(inode) >> blkbits; 115 + while ((probe_block + blocks_per_page) <= last_block && 116 + page_no < sis->max) { 117 + unsigned block_in_page; 118 + sector_t first_block; 119 + 120 + first_block = bmap(inode, probe_block); 121 + if (first_block == 0) 122 + goto bad_bmap; 123 + 124 + /* 125 + * It must be PAGE_SIZE aligned on-disk 126 + */ 127 + if (first_block & (blocks_per_page - 1)) { 128 + probe_block++; 129 + goto reprobe; 130 + } 131 + 132 + for (block_in_page = 1; block_in_page < blocks_per_page; 133 + block_in_page++) { 134 + sector_t block; 135 + 136 + block = bmap(inode, probe_block + block_in_page); 137 + if (block == 0) 138 + goto bad_bmap; 139 + if (block != first_block + block_in_page) { 140 + /* Discontiguity */ 141 + probe_block++; 142 + goto reprobe; 143 + } 144 + } 145 + 146 + first_block >>= (PAGE_SHIFT - blkbits); 147 + if (page_no) { /* exclude the header page */ 148 + if (first_block < lowest_block) 149 + lowest_block = first_block; 150 + if (first_block > highest_block) 151 + highest_block = first_block; 152 + } 153 + 154 + /* 155 + * We found a PAGE_SIZE-length, PAGE_SIZE-aligned run of blocks 156 + */ 157 + ret = add_swap_extent(sis, page_no, 1, first_block); 158 + if (ret < 0) 159 + goto out; 160 + nr_extents += ret; 161 + page_no++; 162 + probe_block += blocks_per_page; 163 + reprobe: 164 + continue; 165 + } 166 + ret = nr_extents; 167 + *span = 1 + highest_block - lowest_block; 168 + if (page_no == 0) 169 + page_no = 1; /* force Empty message */ 170 + sis->max = page_no; 171 + sis->pages = page_no - 1; 172 + sis->highest_bit = page_no - 1; 173 + out: 174 + return ret; 175 + bad_bmap: 176 + printk(KERN_ERR "swapon: swapfile has holes\n"); 177 + ret = -EINVAL; 178 + goto out; 179 + } 180 + 90 181 /* 91 182 * We may have stale swap cache pages in memory: notice 92 183 * them here and get rid of the unnecessary final write. ··· 187 94 { 188 95 struct bio *bio; 189 96 int ret = 0, rw = WRITE; 97 + struct swap_info_struct *sis = page_swap_info(page); 190 98 191 99 if (try_to_free_swap(page)) { 192 100 unlock_page(page); ··· 199 105 end_page_writeback(page); 200 106 goto out; 201 107 } 108 + 109 + if (sis->flags & SWP_FILE) { 110 + struct kiocb kiocb; 111 + struct file *swap_file = sis->swap_file; 112 + struct address_space *mapping = swap_file->f_mapping; 113 + struct iovec iov = { 114 + .iov_base = kmap(page), 115 + .iov_len = PAGE_SIZE, 116 + }; 117 + 118 + init_sync_kiocb(&kiocb, swap_file); 119 + kiocb.ki_pos = page_file_offset(page); 120 + kiocb.ki_left = PAGE_SIZE; 121 + kiocb.ki_nbytes = PAGE_SIZE; 122 + 123 + unlock_page(page); 124 + ret = mapping->a_ops->direct_IO(KERNEL_WRITE, 125 + &kiocb, &iov, 126 + kiocb.ki_pos, 1); 127 + kunmap(page); 128 + if (ret == PAGE_SIZE) { 129 + count_vm_event(PSWPOUT); 130 + ret = 0; 131 + } 132 + return ret; 133 + } 134 + 202 135 bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write); 203 136 if (bio == NULL) { 204 137 set_page_dirty(page); ··· 247 126 { 248 127 struct bio *bio; 249 128 int ret = 0; 129 + struct swap_info_struct *sis = page_swap_info(page); 250 130 251 131 VM_BUG_ON(!PageLocked(page)); 252 132 VM_BUG_ON(PageUptodate(page)); ··· 256 134 unlock_page(page); 257 135 goto out; 258 136 } 137 + 138 + if (sis->flags & SWP_FILE) { 139 + struct file *swap_file = sis->swap_file; 140 + struct address_space *mapping = swap_file->f_mapping; 141 + 142 + ret = mapping->a_ops->readpage(swap_file, page); 143 + if (!ret) 144 + count_vm_event(PSWPIN); 145 + return ret; 146 + } 147 + 259 148 bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read); 260 149 if (bio == NULL) { 261 150 unlock_page(page); ··· 277 144 submit_bio(READ, bio); 278 145 out: 279 146 return ret; 147 + } 148 + 149 + int swap_set_page_dirty(struct page *page) 150 + { 151 + struct swap_info_struct *sis = page_swap_info(page); 152 + 153 + if (sis->flags & SWP_FILE) { 154 + struct address_space *mapping = sis->swap_file->f_mapping; 155 + return mapping->a_ops->set_page_dirty(page); 156 + } else { 157 + return __set_page_dirty_no_writeback(page); 158 + } 280 159 }

+93

mm/page_isolation.c

··· 5 5 #include <linux/mm.h> 6 6 #include <linux/page-isolation.h> 7 7 #include <linux/pageblock-flags.h> 8 + #include <linux/memory.h> 8 9 #include "internal.h" 10 + 11 + /* called while holding zone->lock */ 12 + static void set_pageblock_isolate(struct page *page) 13 + { 14 + if (get_pageblock_migratetype(page) == MIGRATE_ISOLATE) 15 + return; 16 + 17 + set_pageblock_migratetype(page, MIGRATE_ISOLATE); 18 + page_zone(page)->nr_pageblock_isolate++; 19 + } 20 + 21 + /* called while holding zone->lock */ 22 + static void restore_pageblock_isolate(struct page *page, int migratetype) 23 + { 24 + struct zone *zone = page_zone(page); 25 + if (WARN_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE)) 26 + return; 27 + 28 + BUG_ON(zone->nr_pageblock_isolate <= 0); 29 + set_pageblock_migratetype(page, migratetype); 30 + zone->nr_pageblock_isolate--; 31 + } 32 + 33 + int set_migratetype_isolate(struct page *page) 34 + { 35 + struct zone *zone; 36 + unsigned long flags, pfn; 37 + struct memory_isolate_notify arg; 38 + int notifier_ret; 39 + int ret = -EBUSY; 40 + 41 + zone = page_zone(page); 42 + 43 + spin_lock_irqsave(&zone->lock, flags); 44 + 45 + pfn = page_to_pfn(page); 46 + arg.start_pfn = pfn; 47 + arg.nr_pages = pageblock_nr_pages; 48 + arg.pages_found = 0; 49 + 50 + /* 51 + * It may be possible to isolate a pageblock even if the 52 + * migratetype is not MIGRATE_MOVABLE. The memory isolation 53 + * notifier chain is used by balloon drivers to return the 54 + * number of pages in a range that are held by the balloon 55 + * driver to shrink memory. If all the pages are accounted for 56 + * by balloons, are free, or on the LRU, isolation can continue. 57 + * Later, for example, when memory hotplug notifier runs, these 58 + * pages reported as "can be isolated" should be isolated(freed) 59 + * by the balloon driver through the memory notifier chain. 60 + */ 61 + notifier_ret = memory_isolate_notify(MEM_ISOLATE_COUNT, &arg); 62 + notifier_ret = notifier_to_errno(notifier_ret); 63 + if (notifier_ret) 64 + goto out; 65 + /* 66 + * FIXME: Now, memory hotplug doesn't call shrink_slab() by itself. 67 + * We just check MOVABLE pages. 68 + */ 69 + if (!has_unmovable_pages(zone, page, arg.pages_found)) 70 + ret = 0; 71 + 72 + /* 73 + * immobile means "not-on-lru" paes. If immobile is larger than 74 + * removable-by-driver pages reported by notifier, we'll fail. 75 + */ 76 + 77 + out: 78 + if (!ret) { 79 + set_pageblock_isolate(page); 80 + move_freepages_block(zone, page, MIGRATE_ISOLATE); 81 + } 82 + 83 + spin_unlock_irqrestore(&zone->lock, flags); 84 + if (!ret) 85 + drain_all_pages(); 86 + return ret; 87 + } 88 + 89 + void unset_migratetype_isolate(struct page *page, unsigned migratetype) 90 + { 91 + struct zone *zone; 92 + unsigned long flags; 93 + zone = page_zone(page); 94 + spin_lock_irqsave(&zone->lock, flags); 95 + if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE) 96 + goto out; 97 + move_freepages_block(zone, page, migratetype); 98 + restore_pageblock_isolate(page, migratetype); 99 + out: 100 + spin_unlock_irqrestore(&zone->lock, flags); 101 + } 9 102 10 103 static inline struct page * 11 104 __first_valid_page(unsigned long pfn, unsigned long nr_pages)

+4 -2

mm/shmem.c

··· 929 929 930 930 /* Create a pseudo vma that just contains the policy */ 931 931 pvma.vm_start = 0; 932 - pvma.vm_pgoff = index; 932 + /* Bias interleave by inode number to distribute better across nodes */ 933 + pvma.vm_pgoff = index + info->vfs_inode.i_ino; 933 934 pvma.vm_ops = NULL; 934 935 pvma.vm_policy = spol; 935 936 return swapin_readahead(swap, gfp, &pvma, 0); ··· 943 942 944 943 /* Create a pseudo vma that just contains the policy */ 945 944 pvma.vm_start = 0; 946 - pvma.vm_pgoff = index; 945 + /* Bias interleave by inode number to distribute better across nodes */ 946 + pvma.vm_pgoff = index + info->vfs_inode.i_ino; 947 947 pvma.vm_ops = NULL; 948 948 pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index); 949 949

+198 -18

mm/slab.c

··· 118 118 #include <linux/memory.h> 119 119 #include <linux/prefetch.h> 120 120 121 + #include <net/sock.h> 122 + 121 123 #include <asm/cacheflush.h> 122 124 #include <asm/tlbflush.h> 123 125 #include <asm/page.h> 124 126 125 127 #include <trace/events/kmem.h> 128 + 129 + #include "internal.h" 126 130 127 131 /* 128 132 * DEBUG - 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON. ··· 155 151 #ifndef ARCH_KMALLOC_FLAGS 156 152 #define ARCH_KMALLOC_FLAGS SLAB_HWCACHE_ALIGN 157 153 #endif 154 + 155 + /* 156 + * true if a page was allocated from pfmemalloc reserves for network-based 157 + * swap 158 + */ 159 + static bool pfmemalloc_active __read_mostly; 158 160 159 161 /* Legal flag mask for kmem_cache_create(). */ 160 162 #if DEBUG ··· 267 257 * Must have this definition in here for the proper 268 258 * alignment of array_cache. Also simplifies accessing 269 259 * the entries. 260 + * 261 + * Entries should not be directly dereferenced as 262 + * entries belonging to slabs marked pfmemalloc will 263 + * have the lower bits set SLAB_OBJ_PFMEMALLOC 270 264 */ 271 265 }; 266 + 267 + #define SLAB_OBJ_PFMEMALLOC 1 268 + static inline bool is_obj_pfmemalloc(void *objp) 269 + { 270 + return (unsigned long)objp & SLAB_OBJ_PFMEMALLOC; 271 + } 272 + 273 + static inline void set_obj_pfmemalloc(void **objp) 274 + { 275 + *objp = (void *)((unsigned long)*objp | SLAB_OBJ_PFMEMALLOC); 276 + return; 277 + } 278 + 279 + static inline void clear_obj_pfmemalloc(void **objp) 280 + { 281 + *objp = (void *)((unsigned long)*objp & ~SLAB_OBJ_PFMEMALLOC); 282 + } 272 283 273 284 /* 274 285 * bootstrap: The caches do not work without cpuarrays anymore, but the ··· 931 900 return nc; 932 901 } 933 902 903 + static inline bool is_slab_pfmemalloc(struct slab *slabp) 904 + { 905 + struct page *page = virt_to_page(slabp->s_mem); 906 + 907 + return PageSlabPfmemalloc(page); 908 + } 909 + 910 + /* Clears pfmemalloc_active if no slabs have pfmalloc set */ 911 + static void recheck_pfmemalloc_active(struct kmem_cache *cachep, 912 + struct array_cache *ac) 913 + { 914 + struct kmem_list3 *l3 = cachep->nodelists[numa_mem_id()]; 915 + struct slab *slabp; 916 + unsigned long flags; 917 + 918 + if (!pfmemalloc_active) 919 + return; 920 + 921 + spin_lock_irqsave(&l3->list_lock, flags); 922 + list_for_each_entry(slabp, &l3->slabs_full, list) 923 + if (is_slab_pfmemalloc(slabp)) 924 + goto out; 925 + 926 + list_for_each_entry(slabp, &l3->slabs_partial, list) 927 + if (is_slab_pfmemalloc(slabp)) 928 + goto out; 929 + 930 + list_for_each_entry(slabp, &l3->slabs_free, list) 931 + if (is_slab_pfmemalloc(slabp)) 932 + goto out; 933 + 934 + pfmemalloc_active = false; 935 + out: 936 + spin_unlock_irqrestore(&l3->list_lock, flags); 937 + } 938 + 939 + static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac, 940 + gfp_t flags, bool force_refill) 941 + { 942 + int i; 943 + void *objp = ac->entry[--ac->avail]; 944 + 945 + /* Ensure the caller is allowed to use objects from PFMEMALLOC slab */ 946 + if (unlikely(is_obj_pfmemalloc(objp))) { 947 + struct kmem_list3 *l3; 948 + 949 + if (gfp_pfmemalloc_allowed(flags)) { 950 + clear_obj_pfmemalloc(&objp); 951 + return objp; 952 + } 953 + 954 + /* The caller cannot use PFMEMALLOC objects, find another one */ 955 + for (i = 1; i < ac->avail; i++) { 956 + /* If a !PFMEMALLOC object is found, swap them */ 957 + if (!is_obj_pfmemalloc(ac->entry[i])) { 958 + objp = ac->entry[i]; 959 + ac->entry[i] = ac->entry[ac->avail]; 960 + ac->entry[ac->avail] = objp; 961 + return objp; 962 + } 963 + } 964 + 965 + /* 966 + * If there are empty slabs on the slabs_free list and we are 967 + * being forced to refill the cache, mark this one !pfmemalloc. 968 + */ 969 + l3 = cachep->nodelists[numa_mem_id()]; 970 + if (!list_empty(&l3->slabs_free) && force_refill) { 971 + struct slab *slabp = virt_to_slab(objp); 972 + ClearPageSlabPfmemalloc(virt_to_page(slabp->s_mem)); 973 + clear_obj_pfmemalloc(&objp); 974 + recheck_pfmemalloc_active(cachep, ac); 975 + return objp; 976 + } 977 + 978 + /* No !PFMEMALLOC objects available */ 979 + ac->avail++; 980 + objp = NULL; 981 + } 982 + 983 + return objp; 984 + } 985 + 986 + static inline void *ac_get_obj(struct kmem_cache *cachep, 987 + struct array_cache *ac, gfp_t flags, bool force_refill) 988 + { 989 + void *objp; 990 + 991 + if (unlikely(sk_memalloc_socks())) 992 + objp = __ac_get_obj(cachep, ac, flags, force_refill); 993 + else 994 + objp = ac->entry[--ac->avail]; 995 + 996 + return objp; 997 + } 998 + 999 + static void *__ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac, 1000 + void *objp) 1001 + { 1002 + if (unlikely(pfmemalloc_active)) { 1003 + /* Some pfmemalloc slabs exist, check if this is one */ 1004 + struct page *page = virt_to_page(objp); 1005 + if (PageSlabPfmemalloc(page)) 1006 + set_obj_pfmemalloc(&objp); 1007 + } 1008 + 1009 + return objp; 1010 + } 1011 + 1012 + static inline void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac, 1013 + void *objp) 1014 + { 1015 + if (unlikely(sk_memalloc_socks())) 1016 + objp = __ac_put_obj(cachep, ac, objp); 1017 + 1018 + ac->entry[ac->avail++] = objp; 1019 + } 1020 + 934 1021 /* 935 1022 * Transfer objects in one arraycache to another. 936 1023 * Locking must be handled by the caller. ··· 1225 1076 STATS_INC_ACOVERFLOW(cachep); 1226 1077 __drain_alien_cache(cachep, alien, nodeid); 1227 1078 } 1228 - alien->entry[alien->avail++] = objp; 1079 + ac_put_obj(cachep, alien, objp); 1229 1080 spin_unlock(&alien->lock); 1230 1081 } else { 1231 1082 spin_lock(&(cachep->nodelists[nodeid])->list_lock); ··· 1908 1759 return NULL; 1909 1760 } 1910 1761 1762 + /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */ 1763 + if (unlikely(page->pfmemalloc)) 1764 + pfmemalloc_active = true; 1765 + 1911 1766 nr_pages = (1 << cachep->gfporder); 1912 1767 if (cachep->flags & SLAB_RECLAIM_ACCOUNT) 1913 1768 add_zone_page_state(page_zone(page), ··· 1919 1766 else 1920 1767 add_zone_page_state(page_zone(page), 1921 1768 NR_SLAB_UNRECLAIMABLE, nr_pages); 1922 - for (i = 0; i < nr_pages; i++) 1769 + for (i = 0; i < nr_pages; i++) { 1923 1770 __SetPageSlab(page + i); 1771 + 1772 + if (page->pfmemalloc) 1773 + SetPageSlabPfmemalloc(page + i); 1774 + } 1924 1775 1925 1776 if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) { 1926 1777 kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid); ··· 1957 1800 NR_SLAB_UNRECLAIMABLE, nr_freed); 1958 1801 while (i--) { 1959 1802 BUG_ON(!PageSlab(page)); 1803 + __ClearPageSlabPfmemalloc(page); 1960 1804 __ClearPageSlab(page); 1961 1805 page++; 1962 1806 } ··· 3173 3015 #define check_slabp(x,y) do { } while(0) 3174 3016 #endif 3175 3017 3176 - static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) 3018 + static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags, 3019 + bool force_refill) 3177 3020 { 3178 3021 int batchcount; 3179 3022 struct kmem_list3 *l3; 3180 3023 struct array_cache *ac; 3181 3024 int node; 3182 3025 3183 - retry: 3184 3026 check_irq_off(); 3185 3027 node = numa_mem_id(); 3028 + if (unlikely(force_refill)) 3029 + goto force_grow; 3030 + retry: 3186 3031 ac = cpu_cache_get(cachep); 3187 3032 batchcount = ac->batchcount; 3188 3033 if (!ac->touched && batchcount > BATCHREFILL_LIMIT) { ··· 3235 3074 STATS_INC_ACTIVE(cachep); 3236 3075 STATS_SET_HIGH(cachep); 3237 3076 3238 - ac->entry[ac->avail++] = slab_get_obj(cachep, slabp, 3239 - node); 3077 + ac_put_obj(cachep, ac, slab_get_obj(cachep, slabp, 3078 + node)); 3240 3079 } 3241 3080 check_slabp(cachep, slabp); 3242 3081 ··· 3255 3094 3256 3095 if (unlikely(!ac->avail)) { 3257 3096 int x; 3097 + force_grow: 3258 3098 x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL); 3259 3099 3260 3100 /* cache_grow can reenable interrupts, then ac could change. */ 3261 3101 ac = cpu_cache_get(cachep); 3262 - if (!x && ac->avail == 0) /* no objects in sight? abort */ 3102 + 3103 + /* no objects in sight? abort */ 3104 + if (!x && (ac->avail == 0 || force_refill)) 3263 3105 return NULL; 3264 3106 3265 3107 if (!ac->avail) /* objects refilled by interrupt? */ 3266 3108 goto retry; 3267 3109 } 3268 3110 ac->touched = 1; 3269 - return ac->entry[--ac->avail]; 3111 + 3112 + return ac_get_obj(cachep, ac, flags, force_refill); 3270 3113 } 3271 3114 3272 3115 static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep, ··· 3352 3187 { 3353 3188 void *objp; 3354 3189 struct array_cache *ac; 3190 + bool force_refill = false; 3355 3191 3356 3192 check_irq_off(); 3357 3193 3358 3194 ac = cpu_cache_get(cachep); 3359 3195 if (likely(ac->avail)) { 3360 - STATS_INC_ALLOCHIT(cachep); 3361 3196 ac->touched = 1; 3362 - objp = ac->entry[--ac->avail]; 3363 - } else { 3364 - STATS_INC_ALLOCMISS(cachep); 3365 - objp = cache_alloc_refill(cachep, flags); 3197 + objp = ac_get_obj(cachep, ac, flags, false); 3198 + 3366 3199 /* 3367 - * the 'ac' may be updated by cache_alloc_refill(), 3368 - * and kmemleak_erase() requires its correct value. 3200 + * Allow for the possibility all avail objects are not allowed 3201 + * by the current flags 3369 3202 */ 3370 - ac = cpu_cache_get(cachep); 3203 + if (objp) { 3204 + STATS_INC_ALLOCHIT(cachep); 3205 + goto out; 3206 + } 3207 + force_refill = true; 3371 3208 } 3209 + 3210 + STATS_INC_ALLOCMISS(cachep); 3211 + objp = cache_alloc_refill(cachep, flags, force_refill); 3212 + /* 3213 + * the 'ac' may be updated by cache_alloc_refill(), 3214 + * and kmemleak_erase() requires its correct value. 3215 + */ 3216 + ac = cpu_cache_get(cachep); 3217 + 3218 + out: 3372 3219 /* 3373 3220 * To avoid a false negative, if an object that is in one of the 3374 3221 * per-CPU caches is leaked, we need to make sure kmemleak doesn't ··· 3702 3525 struct kmem_list3 *l3; 3703 3526 3704 3527 for (i = 0; i < nr_objects; i++) { 3705 - void *objp = objpp[i]; 3528 + void *objp; 3706 3529 struct slab *slabp; 3530 + 3531 + clear_obj_pfmemalloc(&objpp[i]); 3532 + objp = objpp[i]; 3707 3533 3708 3534 slabp = virt_to_slab(objp); 3709 3535 l3 = cachep->nodelists[node]; ··· 3825 3645 cache_flusharray(cachep, ac); 3826 3646 } 3827 3647 3828 - ac->entry[ac->avail++] = objp; 3648 + ac_put_obj(cachep, ac, objp); 3829 3649 } 3830 3650 3831 3651 /**

+27 -3

mm/slub.c

··· 34 34 35 35 #include <trace/events/kmem.h> 36 36 37 + #include "internal.h" 38 + 37 39 /* 38 40 * Lock order: 39 41 * 1. slab_mutex (Global Mutex) ··· 1356 1354 inc_slabs_node(s, page_to_nid(page), page->objects); 1357 1355 page->slab = s; 1358 1356 __SetPageSlab(page); 1357 + if (page->pfmemalloc) 1358 + SetPageSlabPfmemalloc(page); 1359 1359 1360 1360 start = page_address(page); 1361 1361 ··· 1401 1397 NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE, 1402 1398 -pages); 1403 1399 1400 + __ClearPageSlabPfmemalloc(page); 1404 1401 __ClearPageSlab(page); 1405 1402 reset_page_mapcount(page); 1406 1403 if (current->reclaim_state) ··· 2131 2126 return freelist; 2132 2127 } 2133 2128 2129 + static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags) 2130 + { 2131 + if (unlikely(PageSlabPfmemalloc(page))) 2132 + return gfp_pfmemalloc_allowed(gfpflags); 2133 + 2134 + return true; 2135 + } 2136 + 2134 2137 /* 2135 2138 * Check the page->freelist of a page and either transfer the freelist to the per cpu freelist 2136 2139 * or deactivate the page. ··· 2219 2206 goto new_slab; 2220 2207 } 2221 2208 2209 + /* 2210 + * By rights, we should be searching for a slab page that was 2211 + * PFMEMALLOC but right now, we are losing the pfmemalloc 2212 + * information when the page leaves the per-cpu allocator 2213 + */ 2214 + if (unlikely(!pfmemalloc_match(page, gfpflags))) { 2215 + deactivate_slab(s, page, c->freelist); 2216 + c->page = NULL; 2217 + c->freelist = NULL; 2218 + goto new_slab; 2219 + } 2220 + 2222 2221 /* must check again c->freelist in case of cpu migration or IRQ */ 2223 2222 freelist = c->freelist; 2224 2223 if (freelist) ··· 2281 2256 } 2282 2257 2283 2258 page = c->page; 2284 - if (likely(!kmem_cache_debug(s))) 2259 + if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags))) 2285 2260 goto load_freelist; 2286 2261 2287 2262 /* Only entered in the debug case */ 2288 - if (!alloc_debug_processing(s, page, freelist, addr)) 2263 + if (kmem_cache_debug(s) && !alloc_debug_processing(s, page, freelist, addr)) 2289 2264 goto new_slab; /* Slab failed checks. Next slab needed */ 2290 2265 2291 2266 deactivate_slab(s, page, get_freepointer(s, freelist)); ··· 2338 2313 object = c->freelist; 2339 2314 page = c->page; 2340 2315 if (unlikely(!object || !node_match(page, node))) 2341 - 2342 2316 object = __slab_alloc(s, gfpflags, node, addr, c); 2343 2317 2344 2318 else {

+10 -19

mm/sparse.c

··· 65 65 66 66 if (slab_is_available()) { 67 67 if (node_state(nid, N_HIGH_MEMORY)) 68 - section = kmalloc_node(array_size, GFP_KERNEL, nid); 68 + section = kzalloc_node(array_size, GFP_KERNEL, nid); 69 69 else 70 - section = kmalloc(array_size, GFP_KERNEL); 71 - } else 70 + section = kzalloc(array_size, GFP_KERNEL); 71 + } else { 72 72 section = alloc_bootmem_node(NODE_DATA(nid), array_size); 73 - 74 - if (section) 75 - memset(section, 0, array_size); 73 + } 76 74 77 75 return section; 78 76 } 79 77 80 78 static int __meminit sparse_index_init(unsigned long section_nr, int nid) 81 79 { 82 - static DEFINE_SPINLOCK(index_init_lock); 83 80 unsigned long root = SECTION_NR_TO_ROOT(section_nr); 84 81 struct mem_section *section; 85 82 int ret = 0; ··· 87 90 section = sparse_index_alloc(nid); 88 91 if (!section) 89 92 return -ENOMEM; 90 - /* 91 - * This lock keeps two different sections from 92 - * reallocating for the same index 93 - */ 94 - spin_lock(&index_init_lock); 95 - 96 - if (mem_section[root]) { 97 - ret = -EEXIST; 98 - goto out; 99 - } 100 93 101 94 mem_section[root] = section; 102 - out: 103 - spin_unlock(&index_init_lock); 95 + 104 96 return ret; 105 97 } 106 98 #else /* !SPARSEMEM_EXTREME */ ··· 117 131 if ((ms >= root) && (ms < (root + SECTIONS_PER_ROOT))) 118 132 break; 119 133 } 134 + 135 + VM_BUG_ON(root_nr == NR_SECTION_ROOTS); 120 136 121 137 return (root_nr * SECTIONS_PER_ROOT) + (ms - root); 122 138 } ··· 480 492 int size2; 481 493 struct page **map_map; 482 494 #endif 495 + 496 + /* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */ 497 + set_pageblock_order(); 483 498 484 499 /* 485 500 * map is using big page (aka 2M in x86 64 bit)

+52

mm/swap.c

··· 236 236 } 237 237 EXPORT_SYMBOL(put_pages_list); 238 238 239 + /* 240 + * get_kernel_pages() - pin kernel pages in memory 241 + * @kiov: An array of struct kvec structures 242 + * @nr_segs: number of segments to pin 243 + * @write: pinning for read/write, currently ignored 244 + * @pages: array that receives pointers to the pages pinned. 245 + * Should be at least nr_segs long. 246 + * 247 + * Returns number of pages pinned. This may be fewer than the number 248 + * requested. If nr_pages is 0 or negative, returns 0. If no pages 249 + * were pinned, returns -errno. Each page returned must be released 250 + * with a put_page() call when it is finished with. 251 + */ 252 + int get_kernel_pages(const struct kvec *kiov, int nr_segs, int write, 253 + struct page **pages) 254 + { 255 + int seg; 256 + 257 + for (seg = 0; seg < nr_segs; seg++) { 258 + if (WARN_ON(kiov[seg].iov_len != PAGE_SIZE)) 259 + return seg; 260 + 261 + pages[seg] = kmap_to_page(kiov[seg].iov_base); 262 + page_cache_get(pages[seg]); 263 + } 264 + 265 + return seg; 266 + } 267 + EXPORT_SYMBOL_GPL(get_kernel_pages); 268 + 269 + /* 270 + * get_kernel_page() - pin a kernel page in memory 271 + * @start: starting kernel address 272 + * @write: pinning for read/write, currently ignored 273 + * @pages: array that receives pointer to the page pinned. 274 + * Must be at least nr_segs long. 275 + * 276 + * Returns 1 if page is pinned. If the page was not pinned, returns 277 + * -errno. The page returned must be released with a put_page() call 278 + * when it is finished with. 279 + */ 280 + int get_kernel_page(unsigned long start, int write, struct page **pages) 281 + { 282 + const struct kvec kiov = { 283 + .iov_base = (void *)start, 284 + .iov_len = PAGE_SIZE 285 + }; 286 + 287 + return get_kernel_pages(&kiov, 1, write, pages); 288 + } 289 + EXPORT_SYMBOL_GPL(get_kernel_page); 290 + 239 291 static void pagevec_lru_move_fn(struct pagevec *pvec, 240 292 void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg), 241 293 void *arg)

+6 -1

mm/swap_state.c

··· 14 14 #include <linux/init.h> 15 15 #include <linux/pagemap.h> 16 16 #include <linux/backing-dev.h> 17 + #include <linux/blkdev.h> 17 18 #include <linux/pagevec.h> 18 19 #include <linux/migrate.h> 19 20 #include <linux/page_cgroup.h> ··· 27 26 */ 28 27 static const struct address_space_operations swap_aops = { 29 28 .writepage = swap_writepage, 30 - .set_page_dirty = __set_page_dirty_no_writeback, 29 + .set_page_dirty = swap_set_page_dirty, 31 30 .migratepage = migrate_page, 32 31 }; 33 32 ··· 377 376 unsigned long offset = swp_offset(entry); 378 377 unsigned long start_offset, end_offset; 379 378 unsigned long mask = (1UL << page_cluster) - 1; 379 + struct blk_plug plug; 380 380 381 381 /* Read a page_cluster sized and aligned cluster around offset. */ 382 382 start_offset = offset & ~mask; ··· 385 383 if (!start_offset) /* First page is swap header. */ 386 384 start_offset++; 387 385 386 + blk_start_plug(&plug); 388 387 for (offset = start_offset; offset <= end_offset ; offset++) { 389 388 /* Ok, do the async read-ahead now */ 390 389 page = read_swap_cache_async(swp_entry(swp_type(entry), offset), ··· 394 391 continue; 395 392 page_cache_release(page); 396 393 } 394 + blk_finish_plug(&plug); 395 + 397 396 lru_add_drain(); /* Push any new pages onto the LRU now */ 398 397 return read_swap_cache_async(entry, gfp_mask, vma, addr); 399 398 }

+55 -90

mm/swapfile.c

··· 33 33 #include <linux/oom.h> 34 34 #include <linux/frontswap.h> 35 35 #include <linux/swapfile.h> 36 + #include <linux/export.h> 36 37 37 38 #include <asm/pgtable.h> 38 39 #include <asm/tlbflush.h> ··· 549 548 550 549 /* free if no reference */ 551 550 if (!usage) { 552 - struct gendisk *disk = p->bdev->bd_disk; 553 551 if (offset < p->lowest_bit) 554 552 p->lowest_bit = offset; 555 553 if (offset > p->highest_bit) ··· 559 559 nr_swap_pages++; 560 560 p->inuse_pages--; 561 561 frontswap_invalidate_page(p->type, offset); 562 - if ((p->flags & SWP_BLKDEV) && 563 - disk->fops->swap_slot_free_notify) 564 - disk->fops->swap_slot_free_notify(p->bdev, offset); 562 + if (p->flags & SWP_BLKDEV) { 563 + struct gendisk *disk = p->bdev->bd_disk; 564 + if (disk->fops->swap_slot_free_notify) 565 + disk->fops->swap_slot_free_notify(p->bdev, 566 + offset); 567 + } 565 568 } 566 569 567 570 return usage; ··· 835 832 836 833 pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); 837 834 if (unlikely(!pte_same(*pte, swp_entry_to_pte(entry)))) { 838 - if (ret > 0) 839 - mem_cgroup_cancel_charge_swapin(memcg); 835 + mem_cgroup_cancel_charge_swapin(memcg); 840 836 ret = 0; 841 837 goto out; 842 838 } ··· 1330 1328 list_del(&se->list); 1331 1329 kfree(se); 1332 1330 } 1331 + 1332 + if (sis->flags & SWP_FILE) { 1333 + struct file *swap_file = sis->swap_file; 1334 + struct address_space *mapping = swap_file->f_mapping; 1335 + 1336 + sis->flags &= ~SWP_FILE; 1337 + mapping->a_ops->swap_deactivate(swap_file); 1338 + } 1333 1339 } 1334 1340 1335 1341 /* ··· 1346 1336 * 1347 1337 * This function rather assumes that it is called in ascending page order. 1348 1338 */ 1349 - static int 1339 + int 1350 1340 add_swap_extent(struct swap_info_struct *sis, unsigned long start_page, 1351 1341 unsigned long nr_pages, sector_t start_block) 1352 1342 { ··· 1419 1409 */ 1420 1410 static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span) 1421 1411 { 1422 - struct inode *inode; 1423 - unsigned blocks_per_page; 1424 - unsigned long page_no; 1425 - unsigned blkbits; 1426 - sector_t probe_block; 1427 - sector_t last_block; 1428 - sector_t lowest_block = -1; 1429 - sector_t highest_block = 0; 1430 - int nr_extents = 0; 1412 + struct file *swap_file = sis->swap_file; 1413 + struct address_space *mapping = swap_file->f_mapping; 1414 + struct inode *inode = mapping->host; 1431 1415 int ret; 1432 1416 1433 - inode = sis->swap_file->f_mapping->host; 1434 1417 if (S_ISBLK(inode->i_mode)) { 1435 1418 ret = add_swap_extent(sis, 0, sis->max, 0); 1436 1419 *span = sis->pages; 1437 - goto out; 1420 + return ret; 1438 1421 } 1439 1422 1440 - blkbits = inode->i_blkbits; 1441 - blocks_per_page = PAGE_SIZE >> blkbits; 1442 - 1443 - /* 1444 - * Map all the blocks into the extent list. This code doesn't try 1445 - * to be very smart. 1446 - */ 1447 - probe_block = 0; 1448 - page_no = 0; 1449 - last_block = i_size_read(inode) >> blkbits; 1450 - while ((probe_block + blocks_per_page) <= last_block && 1451 - page_no < sis->max) { 1452 - unsigned block_in_page; 1453 - sector_t first_block; 1454 - 1455 - first_block = bmap(inode, probe_block); 1456 - if (first_block == 0) 1457 - goto bad_bmap; 1458 - 1459 - /* 1460 - * It must be PAGE_SIZE aligned on-disk 1461 - */ 1462 - if (first_block & (blocks_per_page - 1)) { 1463 - probe_block++; 1464 - goto reprobe; 1423 + if (mapping->a_ops->swap_activate) { 1424 + ret = mapping->a_ops->swap_activate(sis, swap_file, span); 1425 + if (!ret) { 1426 + sis->flags |= SWP_FILE; 1427 + ret = add_swap_extent(sis, 0, sis->max, 0); 1428 + *span = sis->pages; 1465 1429 } 1466 - 1467 - for (block_in_page = 1; block_in_page < blocks_per_page; 1468 - block_in_page++) { 1469 - sector_t block; 1470 - 1471 - block = bmap(inode, probe_block + block_in_page); 1472 - if (block == 0) 1473 - goto bad_bmap; 1474 - if (block != first_block + block_in_page) { 1475 - /* Discontiguity */ 1476 - probe_block++; 1477 - goto reprobe; 1478 - } 1479 - } 1480 - 1481 - first_block >>= (PAGE_SHIFT - blkbits); 1482 - if (page_no) { /* exclude the header page */ 1483 - if (first_block < lowest_block) 1484 - lowest_block = first_block; 1485 - if (first_block > highest_block) 1486 - highest_block = first_block; 1487 - } 1488 - 1489 - /* 1490 - * We found a PAGE_SIZE-length, PAGE_SIZE-aligned run of blocks 1491 - */ 1492 - ret = add_swap_extent(sis, page_no, 1, first_block); 1493 - if (ret < 0) 1494 - goto out; 1495 - nr_extents += ret; 1496 - page_no++; 1497 - probe_block += blocks_per_page; 1498 - reprobe: 1499 - continue; 1430 + return ret; 1500 1431 } 1501 - ret = nr_extents; 1502 - *span = 1 + highest_block - lowest_block; 1503 - if (page_no == 0) 1504 - page_no = 1; /* force Empty message */ 1505 - sis->max = page_no; 1506 - sis->pages = page_no - 1; 1507 - sis->highest_bit = page_no - 1; 1508 - out: 1509 - return ret; 1510 - bad_bmap: 1511 - printk(KERN_ERR "swapon: swapfile has holes\n"); 1512 - ret = -EINVAL; 1513 - goto out; 1432 + 1433 + return generic_swapfile_activate(sis, swap_file, span); 1514 1434 } 1515 1435 1516 1436 static void enable_swap_info(struct swap_info_struct *p, int prio, ··· 2224 2284 { 2225 2285 return __swap_duplicate(entry, SWAP_HAS_CACHE); 2226 2286 } 2287 + 2288 + struct swap_info_struct *page_swap_info(struct page *page) 2289 + { 2290 + swp_entry_t swap = { .val = page_private(page) }; 2291 + BUG_ON(!PageSwapCache(page)); 2292 + return swap_info[swp_type(swap)]; 2293 + } 2294 + 2295 + /* 2296 + * out-of-line __page_file_ methods to avoid include hell. 2297 + */ 2298 + struct address_space *__page_file_mapping(struct page *page) 2299 + { 2300 + VM_BUG_ON(!PageSwapCache(page)); 2301 + return page_swap_info(page)->swap_file->f_mapping; 2302 + } 2303 + EXPORT_SYMBOL_GPL(__page_file_mapping); 2304 + 2305 + pgoff_t __page_file_index(struct page *page) 2306 + { 2307 + swp_entry_t swap = { .val = page_private(page) }; 2308 + VM_BUG_ON(!PageSwapCache(page)); 2309 + return swp_offset(swap); 2310 + } 2311 + EXPORT_SYMBOL_GPL(__page_file_index); 2227 2312 2228 2313 /* 2229 2314 * add_swap_count_continuation - called when a swap count is duplicated

+12 -4

mm/vmalloc.c

··· 413 413 if (addr + size - 1 < addr) 414 414 goto overflow; 415 415 416 - n = rb_next(&first->rb_node); 417 - if (n) 418 - first = rb_entry(n, struct vmap_area, rb_node); 419 - else 416 + if (list_is_last(&first->list, &vmap_area_list)) 420 417 goto found; 418 + 419 + first = list_entry(first->list.next, 420 + struct vmap_area, list); 421 421 } 422 422 423 423 found: ··· 904 904 905 905 BUG_ON(size & ~PAGE_MASK); 906 906 BUG_ON(size > PAGE_SIZE*VMAP_MAX_ALLOC); 907 + if (WARN_ON(size == 0)) { 908 + /* 909 + * Allocating 0 bytes isn't what caller wants since 910 + * get_order(0) returns funny result. Just warn and terminate 911 + * early. 912 + */ 913 + return NULL; 914 + } 907 915 order = get_order(size); 908 916 909 917 again:

+162 -13

mm/vmscan.c

··· 133 133 static LIST_HEAD(shrinker_list); 134 134 static DECLARE_RWSEM(shrinker_rwsem); 135 135 136 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR 136 + #ifdef CONFIG_MEMCG 137 137 static bool global_reclaim(struct scan_control *sc) 138 138 { 139 139 return !sc->target_mem_cgroup; ··· 687 687 688 688 cond_resched(); 689 689 690 + mem_cgroup_uncharge_start(); 690 691 while (!list_empty(page_list)) { 691 692 enum page_references references; 692 693 struct address_space *mapping; ··· 721 720 (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); 722 721 723 722 if (PageWriteback(page)) { 724 - nr_writeback++; 725 - unlock_page(page); 726 - goto keep; 723 + /* 724 + * memcg doesn't have any dirty pages throttling so we 725 + * could easily OOM just because too many pages are in 726 + * writeback and there is nothing else to reclaim. 727 + * 728 + * Check __GFP_IO, certainly because a loop driver 729 + * thread might enter reclaim, and deadlock if it waits 730 + * on a page for which it is needed to do the write 731 + * (loop masks off __GFP_IO|__GFP_FS for this reason); 732 + * but more thought would probably show more reasons. 733 + * 734 + * Don't require __GFP_FS, since we're not going into 735 + * the FS, just waiting on its writeback completion. 736 + * Worryingly, ext4 gfs2 and xfs allocate pages with 737 + * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so 738 + * testing may_enter_fs here is liable to OOM on them. 739 + */ 740 + if (global_reclaim(sc) || 741 + !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) { 742 + /* 743 + * This is slightly racy - end_page_writeback() 744 + * might have just cleared PageReclaim, then 745 + * setting PageReclaim here end up interpreted 746 + * as PageReadahead - but that does not matter 747 + * enough to care. What we do want is for this 748 + * page to have PageReclaim set next time memcg 749 + * reclaim reaches the tests above, so it will 750 + * then wait_on_page_writeback() to avoid OOM; 751 + * and it's also appropriate in global reclaim. 752 + */ 753 + SetPageReclaim(page); 754 + nr_writeback++; 755 + goto keep_locked; 756 + } 757 + wait_on_page_writeback(page); 727 758 } 728 759 729 760 references = page_check_references(page, sc); ··· 954 921 955 922 list_splice(&ret_pages, page_list); 956 923 count_vm_events(PGACTIVATE, pgactivate); 924 + mem_cgroup_uncharge_end(); 957 925 *ret_nr_dirty += nr_dirty; 958 926 *ret_nr_writeback += nr_writeback; 959 927 return nr_reclaimed; ··· 2146 2112 return 0; 2147 2113 } 2148 2114 2115 + static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) 2116 + { 2117 + struct zone *zone; 2118 + unsigned long pfmemalloc_reserve = 0; 2119 + unsigned long free_pages = 0; 2120 + int i; 2121 + bool wmark_ok; 2122 + 2123 + for (i = 0; i <= ZONE_NORMAL; i++) { 2124 + zone = &pgdat->node_zones[i]; 2125 + pfmemalloc_reserve += min_wmark_pages(zone); 2126 + free_pages += zone_page_state(zone, NR_FREE_PAGES); 2127 + } 2128 + 2129 + wmark_ok = free_pages > pfmemalloc_reserve / 2; 2130 + 2131 + /* kswapd must be awake if processes are being throttled */ 2132 + if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) { 2133 + pgdat->classzone_idx = min(pgdat->classzone_idx, 2134 + (enum zone_type)ZONE_NORMAL); 2135 + wake_up_interruptible(&pgdat->kswapd_wait); 2136 + } 2137 + 2138 + return wmark_ok; 2139 + } 2140 + 2141 + /* 2142 + * Throttle direct reclaimers if backing storage is backed by the network 2143 + * and the PFMEMALLOC reserve for the preferred node is getting dangerously 2144 + * depleted. kswapd will continue to make progress and wake the processes 2145 + * when the low watermark is reached 2146 + */ 2147 + static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, 2148 + nodemask_t *nodemask) 2149 + { 2150 + struct zone *zone; 2151 + int high_zoneidx = gfp_zone(gfp_mask); 2152 + pg_data_t *pgdat; 2153 + 2154 + /* 2155 + * Kernel threads should not be throttled as they may be indirectly 2156 + * responsible for cleaning pages necessary for reclaim to make forward 2157 + * progress. kjournald for example may enter direct reclaim while 2158 + * committing a transaction where throttling it could forcing other 2159 + * processes to block on log_wait_commit(). 2160 + */ 2161 + if (current->flags & PF_KTHREAD) 2162 + return; 2163 + 2164 + /* Check if the pfmemalloc reserves are ok */ 2165 + first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone); 2166 + pgdat = zone->zone_pgdat; 2167 + if (pfmemalloc_watermark_ok(pgdat)) 2168 + return; 2169 + 2170 + /* Account for the throttling */ 2171 + count_vm_event(PGSCAN_DIRECT_THROTTLE); 2172 + 2173 + /* 2174 + * If the caller cannot enter the filesystem, it's possible that it 2175 + * is due to the caller holding an FS lock or performing a journal 2176 + * transaction in the case of a filesystem like ext[3|4]. In this case, 2177 + * it is not safe to block on pfmemalloc_wait as kswapd could be 2178 + * blocked waiting on the same lock. Instead, throttle for up to a 2179 + * second before continuing. 2180 + */ 2181 + if (!(gfp_mask & __GFP_FS)) { 2182 + wait_event_interruptible_timeout(pgdat->pfmemalloc_wait, 2183 + pfmemalloc_watermark_ok(pgdat), HZ); 2184 + return; 2185 + } 2186 + 2187 + /* Throttle until kswapd wakes the process */ 2188 + wait_event_killable(zone->zone_pgdat->pfmemalloc_wait, 2189 + pfmemalloc_watermark_ok(pgdat)); 2190 + } 2191 + 2149 2192 unsigned long try_to_free_pages(struct zonelist *zonelist, int order, 2150 2193 gfp_t gfp_mask, nodemask_t *nodemask) 2151 2194 { ··· 2242 2131 .gfp_mask = sc.gfp_mask, 2243 2132 }; 2244 2133 2134 + throttle_direct_reclaim(gfp_mask, zonelist, nodemask); 2135 + 2136 + /* 2137 + * Do not enter reclaim if fatal signal is pending. 1 is returned so 2138 + * that the page allocator does not consider triggering OOM 2139 + */ 2140 + if (fatal_signal_pending(current)) 2141 + return 1; 2142 + 2245 2143 trace_mm_vmscan_direct_reclaim_begin(order, 2246 2144 sc.may_writepage, 2247 2145 gfp_mask); ··· 2262 2142 return nr_reclaimed; 2263 2143 } 2264 2144 2265 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR 2145 + #ifdef CONFIG_MEMCG 2266 2146 2267 2147 unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg, 2268 2148 gfp_t gfp_mask, bool noswap, ··· 2395 2275 return balanced_pages >= (present_pages >> 2); 2396 2276 } 2397 2277 2398 - /* is kswapd sleeping prematurely? */ 2399 - static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, 2278 + /* 2279 + * Prepare kswapd for sleeping. This verifies that there are no processes 2280 + * waiting in throttle_direct_reclaim() and that watermarks have been met. 2281 + * 2282 + * Returns true if kswapd is ready to sleep 2283 + */ 2284 + static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, 2400 2285 int classzone_idx) 2401 2286 { 2402 2287 int i; ··· 2410 2285 2411 2286 /* If a direct reclaimer woke kswapd within HZ/10, it's premature */ 2412 2287 if (remaining) 2413 - return true; 2288 + return false; 2289 + 2290 + /* 2291 + * There is a potential race between when kswapd checks its watermarks 2292 + * and a process gets throttled. There is also a potential race if 2293 + * processes get throttled, kswapd wakes, a large process exits therby 2294 + * balancing the zones that causes kswapd to miss a wakeup. If kswapd 2295 + * is going to sleep, no process should be sleeping on pfmemalloc_wait 2296 + * so wake them now if necessary. If necessary, processes will wake 2297 + * kswapd and get throttled again 2298 + */ 2299 + if (waitqueue_active(&pgdat->pfmemalloc_wait)) { 2300 + wake_up(&pgdat->pfmemalloc_wait); 2301 + return false; 2302 + } 2414 2303 2415 2304 /* Check the watermark levels */ 2416 2305 for (i = 0; i <= classzone_idx; i++) { ··· 2457 2318 * must be balanced 2458 2319 */ 2459 2320 if (order) 2460 - return !pgdat_balanced(pgdat, balanced, classzone_idx); 2321 + return pgdat_balanced(pgdat, balanced, classzone_idx); 2461 2322 else 2462 - return !all_zones_ok; 2323 + return all_zones_ok; 2463 2324 } 2464 2325 2465 2326 /* ··· 2685 2546 } 2686 2547 2687 2548 } 2549 + 2550 + /* 2551 + * If the low watermark is met there is no need for processes 2552 + * to be throttled on pfmemalloc_wait as they should not be 2553 + * able to safely make forward progress. Wake them 2554 + */ 2555 + if (waitqueue_active(&pgdat->pfmemalloc_wait) && 2556 + pfmemalloc_watermark_ok(pgdat)) 2557 + wake_up(&pgdat->pfmemalloc_wait); 2558 + 2688 2559 if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx))) 2689 2560 break; /* kswapd: all done */ 2690 2561 /* ··· 2796 2647 } 2797 2648 2798 2649 /* 2799 - * Return the order we were reclaiming at so sleeping_prematurely() 2650 + * Return the order we were reclaiming at so prepare_kswapd_sleep() 2800 2651 * makes a decision on the order we were last reclaiming at. However, 2801 2652 * if another caller entered the allocator slow path while kswapd 2802 2653 * was awake, order will remain at the higher level ··· 2816 2667 prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); 2817 2668 2818 2669 /* Try to sleep for a short interval */ 2819 - if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) { 2670 + if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) { 2820 2671 remaining = schedule_timeout(HZ/10); 2821 2672 finish_wait(&pgdat->kswapd_wait, &wait); 2822 2673 prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); ··· 2826 2677 * After a short sleep, check if it was a premature sleep. If not, then 2827 2678 * go fully to sleep until explicitly woken up. 2828 2679 */ 2829 - if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) { 2680 + if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) { 2830 2681 trace_mm_vmscan_kswapd_sleep(pgdat->node_id); 2831 2682 2832 2683 /*

+1

mm/vmstat.c

··· 745 745 TEXTS_FOR_ZONES("pgsteal_direct") 746 746 TEXTS_FOR_ZONES("pgscan_kswapd") 747 747 TEXTS_FOR_ZONES("pgscan_direct") 748 + "pgscan_direct_throttle", 748 749 749 750 #ifdef CONFIG_NUMA 750 751 "zone_reclaim_failed",

+1 -1

net/caif/caif_socket.c

··· 141 141 err = sk_filter(sk, skb); 142 142 if (err) 143 143 return err; 144 - if (!sk_rmem_schedule(sk, skb->truesize) && rx_flow_is_on(cf_sk)) { 144 + if (!sk_rmem_schedule(sk, skb, skb->truesize) && rx_flow_is_on(cf_sk)) { 145 145 set_rx_flow_off(cf_sk); 146 146 net_dbg_ratelimited("sending flow OFF due to rmem_schedule\n"); 147 147 caif_flow_ctrl(sk, CAIF_MODEMCMD_FLOW_OFF_REQ);

+47 -6

net/core/dev.c

··· 3156 3156 } 3157 3157 EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister); 3158 3158 3159 + /* 3160 + * Limit the use of PFMEMALLOC reserves to those protocols that implement 3161 + * the special handling of PFMEMALLOC skbs. 3162 + */ 3163 + static bool skb_pfmemalloc_protocol(struct sk_buff *skb) 3164 + { 3165 + switch (skb->protocol) { 3166 + case __constant_htons(ETH_P_ARP): 3167 + case __constant_htons(ETH_P_IP): 3168 + case __constant_htons(ETH_P_IPV6): 3169 + case __constant_htons(ETH_P_8021Q): 3170 + return true; 3171 + default: 3172 + return false; 3173 + } 3174 + } 3175 + 3159 3176 static int __netif_receive_skb(struct sk_buff *skb) 3160 3177 { 3161 3178 struct packet_type *ptype, *pt_prev; ··· 3182 3165 bool deliver_exact = false; 3183 3166 int ret = NET_RX_DROP; 3184 3167 __be16 type; 3168 + unsigned long pflags = current->flags; 3185 3169 3186 3170 net_timestamp_check(!netdev_tstamp_prequeue, skb); 3187 3171 3188 3172 trace_netif_receive_skb(skb); 3189 3173 3174 + /* 3175 + * PFMEMALLOC skbs are special, they should 3176 + * - be delivered to SOCK_MEMALLOC sockets only 3177 + * - stay away from userspace 3178 + * - have bounded memory usage 3179 + * 3180 + * Use PF_MEMALLOC as this saves us from propagating the allocation 3181 + * context down to all allocation sites. 3182 + */ 3183 + if (sk_memalloc_socks() && skb_pfmemalloc(skb)) 3184 + current->flags |= PF_MEMALLOC; 3185 + 3190 3186 /* if we've gotten here through NAPI, check netpoll */ 3191 3187 if (netpoll_receive_skb(skb)) 3192 - return NET_RX_DROP; 3188 + goto out; 3193 3189 3194 3190 orig_dev = skb->dev; 3195 3191 ··· 3222 3192 if (skb->protocol == cpu_to_be16(ETH_P_8021Q)) { 3223 3193 skb = vlan_untag(skb); 3224 3194 if (unlikely(!skb)) 3225 - goto out; 3195 + goto unlock; 3226 3196 } 3227 3197 3228 3198 #ifdef CONFIG_NET_CLS_ACT ··· 3232 3202 } 3233 3203 #endif 3234 3204 3205 + if (sk_memalloc_socks() && skb_pfmemalloc(skb)) 3206 + goto skip_taps; 3207 + 3235 3208 list_for_each_entry_rcu(ptype, &ptype_all, list) { 3236 3209 if (!ptype->dev || ptype->dev == skb->dev) { 3237 3210 if (pt_prev) ··· 3243 3210 } 3244 3211 } 3245 3212 3213 + skip_taps: 3246 3214 #ifdef CONFIG_NET_CLS_ACT 3247 3215 skb = handle_ing(skb, &pt_prev, &ret, orig_dev); 3248 3216 if (!skb) 3249 - goto out; 3217 + goto unlock; 3250 3218 ncls: 3251 3219 #endif 3220 + 3221 + if (sk_memalloc_socks() && skb_pfmemalloc(skb) 3222 + && !skb_pfmemalloc_protocol(skb)) 3223 + goto drop; 3252 3224 3253 3225 rx_handler = rcu_dereference(skb->dev->rx_handler); 3254 3226 if (vlan_tx_tag_present(skb)) { ··· 3264 3226 if (vlan_do_receive(&skb, !rx_handler)) 3265 3227 goto another_round; 3266 3228 else if (unlikely(!skb)) 3267 - goto out; 3229 + goto unlock; 3268 3230 } 3269 3231 3270 3232 if (rx_handler) { ··· 3274 3236 } 3275 3237 switch (rx_handler(&skb)) { 3276 3238 case RX_HANDLER_CONSUMED: 3277 - goto out; 3239 + goto unlock; 3278 3240 case RX_HANDLER_ANOTHER: 3279 3241 goto another_round; 3280 3242 case RX_HANDLER_EXACT: ··· 3307 3269 else 3308 3270 ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); 3309 3271 } else { 3272 + drop: 3310 3273 atomic_long_inc(&skb->dev->rx_dropped); 3311 3274 kfree_skb(skb); 3312 3275 /* Jamal, now you will not able to escape explaining ··· 3316 3277 ret = NET_RX_DROP; 3317 3278 } 3318 3279 3319 - out: 3280 + unlock: 3320 3281 rcu_read_unlock(); 3282 + out: 3283 + tsk_restore_flags(current, pflags, PF_MEMALLOC); 3321 3284 return ret; 3322 3285 } 3323 3286

+8

net/core/filter.c

··· 83 83 int err; 84 84 struct sk_filter *filter; 85 85 86 + /* 87 + * If the skb was allocated from pfmemalloc reserves, only 88 + * allow SOCK_MEMALLOC sockets to use it as this socket is 89 + * helping free memory 90 + */ 91 + if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) 92 + return -ENOMEM; 93 + 86 94 err = security_sock_rcv_skb(sk, skb); 87 95 if (err) 88 96 return err;

+99 -25

net/core/skbuff.c

··· 145 145 BUG(); 146 146 } 147 147 148 + 149 + /* 150 + * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells 151 + * the caller if emergency pfmemalloc reserves are being used. If it is and 152 + * the socket is later found to be SOCK_MEMALLOC then PFMEMALLOC reserves 153 + * may be used. Otherwise, the packet data may be discarded until enough 154 + * memory is free 155 + */ 156 + #define kmalloc_reserve(size, gfp, node, pfmemalloc) \ 157 + __kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc) 158 + void *__kmalloc_reserve(size_t size, gfp_t flags, int node, unsigned long ip, 159 + bool *pfmemalloc) 160 + { 161 + void *obj; 162 + bool ret_pfmemalloc = false; 163 + 164 + /* 165 + * Try a regular allocation, when that fails and we're not entitled 166 + * to the reserves, fail. 167 + */ 168 + obj = kmalloc_node_track_caller(size, 169 + flags | __GFP_NOMEMALLOC | __GFP_NOWARN, 170 + node); 171 + if (obj || !(gfp_pfmemalloc_allowed(flags))) 172 + goto out; 173 + 174 + /* Try again but now we are using pfmemalloc reserves */ 175 + ret_pfmemalloc = true; 176 + obj = kmalloc_node_track_caller(size, flags, node); 177 + 178 + out: 179 + if (pfmemalloc) 180 + *pfmemalloc = ret_pfmemalloc; 181 + 182 + return obj; 183 + } 184 + 148 185 /* Allocate a new skbuff. We do this ourselves so we can fill in a few 149 186 * 'private' fields and also do memory statistics to find all the 150 187 * [BEEP] leaks. ··· 192 155 * __alloc_skb - allocate a network buffer 193 156 * @size: size to allocate 194 157 * @gfp_mask: allocation mask 195 - * @fclone: allocate from fclone cache instead of head cache 196 - * and allocate a cloned (child) skb 158 + * @flags: If SKB_ALLOC_FCLONE is set, allocate from fclone cache 159 + * instead of head cache and allocate a cloned (child) skb. 160 + * If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for 161 + * allocations in case the data is required for writeback 197 162 * @node: numa node to allocate memory on 198 163 * 199 164 * Allocate a new &sk_buff. The returned buffer has no headroom and a ··· 206 167 * %GFP_ATOMIC. 207 168 */ 208 169 struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, 209 - int fclone, int node) 170 + int flags, int node) 210 171 { 211 172 struct kmem_cache *cache; 212 173 struct skb_shared_info *shinfo; 213 174 struct sk_buff *skb; 214 175 u8 *data; 176 + bool pfmemalloc; 215 177 216 - cache = fclone ? skbuff_fclone_cache : skbuff_head_cache; 178 + cache = (flags & SKB_ALLOC_FCLONE) 179 + ? skbuff_fclone_cache : skbuff_head_cache; 180 + 181 + if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX)) 182 + gfp_mask |= __GFP_MEMALLOC; 217 183 218 184 /* Get the HEAD */ 219 185 skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node); ··· 233 189 */ 234 190 size = SKB_DATA_ALIGN(size); 235 191 size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); 236 - data = kmalloc_node_track_caller(size, gfp_mask, node); 192 + data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc); 237 193 if (!data) 238 194 goto nodata; 239 195 /* kmalloc(size) might give us more room than requested. ··· 251 207 memset(skb, 0, offsetof(struct sk_buff, tail)); 252 208 /* Account for allocated memory : skb + skb->head */ 253 209 skb->truesize = SKB_TRUESIZE(size); 210 + skb->pfmemalloc = pfmemalloc; 254 211 atomic_set(&skb->users, 1); 255 212 skb->head = data; 256 213 skb->data = data; ··· 267 222 atomic_set(&shinfo->dataref, 1); 268 223 kmemcheck_annotate_variable(shinfo->destructor_arg); 269 224 270 - if (fclone) { 225 + if (flags & SKB_ALLOC_FCLONE) { 271 226 struct sk_buff *child = skb + 1; 272 227 atomic_t *fclone_ref = (atomic_t *) (child + 1); 273 228 ··· 277 232 atomic_set(fclone_ref, 1); 278 233 279 234 child->fclone = SKB_FCLONE_UNAVAILABLE; 235 + child->pfmemalloc = pfmemalloc; 280 236 } 281 237 out: 282 238 return skb; ··· 348 302 349 303 #define NETDEV_PAGECNT_BIAS (PAGE_SIZE / SMP_CACHE_BYTES) 350 304 351 - /** 352 - * netdev_alloc_frag - allocate a page fragment 353 - * @fragsz: fragment size 354 - * 355 - * Allocates a frag from a page for receive buffer. 356 - * Uses GFP_ATOMIC allocations. 357 - */ 358 - void *netdev_alloc_frag(unsigned int fragsz) 305 + static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask) 359 306 { 360 307 struct netdev_alloc_cache *nc; 361 308 void *data = NULL; ··· 358 319 nc = &__get_cpu_var(netdev_alloc_cache); 359 320 if (unlikely(!nc->page)) { 360 321 refill: 361 - nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD); 322 + nc->page = alloc_page(gfp_mask); 362 323 if (unlikely(!nc->page)) 363 324 goto end; 364 325 recycle: ··· 381 342 end: 382 343 local_irq_restore(flags); 383 344 return data; 345 + } 346 + 347 + /** 348 + * netdev_alloc_frag - allocate a page fragment 349 + * @fragsz: fragment size 350 + * 351 + * Allocates a frag from a page for receive buffer. 352 + * Uses GFP_ATOMIC allocations. 353 + */ 354 + void *netdev_alloc_frag(unsigned int fragsz) 355 + { 356 + return __netdev_alloc_frag(fragsz, GFP_ATOMIC | __GFP_COLD); 384 357 } 385 358 EXPORT_SYMBOL(netdev_alloc_frag); 386 359 ··· 417 366 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); 418 367 419 368 if (fragsz <= PAGE_SIZE && !(gfp_mask & (__GFP_WAIT | GFP_DMA))) { 420 - void *data = netdev_alloc_frag(fragsz); 369 + void *data; 370 + 371 + if (sk_memalloc_socks()) 372 + gfp_mask |= __GFP_MEMALLOC; 373 + 374 + data = __netdev_alloc_frag(fragsz, gfp_mask); 421 375 422 376 if (likely(data)) { 423 377 skb = build_skb(data, fragsz); ··· 430 374 put_page(virt_to_head_page(data)); 431 375 } 432 376 } else { 433 - skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE); 377 + skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 378 + SKB_ALLOC_RX, NUMA_NO_NODE); 434 379 } 435 380 if (likely(skb)) { 436 381 skb_reserve(skb, NET_SKB_PAD); ··· 713 656 #if IS_ENABLED(CONFIG_IP_VS) 714 657 new->ipvs_property = old->ipvs_property; 715 658 #endif 659 + new->pfmemalloc = old->pfmemalloc; 716 660 new->protocol = old->protocol; 717 661 new->mark = old->mark; 718 662 new->skb_iif = old->skb_iif; ··· 872 814 n->fclone = SKB_FCLONE_CLONE; 873 815 atomic_inc(fclone_ref); 874 816 } else { 817 + if (skb_pfmemalloc(skb)) 818 + gfp_mask |= __GFP_MEMALLOC; 819 + 875 820 n = kmem_cache_alloc(skbuff_head_cache, gfp_mask); 876 821 if (!n) 877 822 return NULL; ··· 911 850 skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type; 912 851 } 913 852 853 + static inline int skb_alloc_rx_flag(const struct sk_buff *skb) 854 + { 855 + if (skb_pfmemalloc(skb)) 856 + return SKB_ALLOC_RX; 857 + return 0; 858 + } 859 + 914 860 /** 915 861 * skb_copy - create private copy of an sk_buff 916 862 * @skb: buffer to copy ··· 939 871 { 940 872 int headerlen = skb_headroom(skb); 941 873 unsigned int size = skb_end_offset(skb) + skb->data_len; 942 - struct sk_buff *n = alloc_skb(size, gfp_mask); 874 + struct sk_buff *n = __alloc_skb(size, gfp_mask, 875 + skb_alloc_rx_flag(skb), NUMA_NO_NODE); 943 876 944 877 if (!n) 945 878 return NULL; ··· 975 906 struct sk_buff *__pskb_copy(struct sk_buff *skb, int headroom, gfp_t gfp_mask) 976 907 { 977 908 unsigned int size = skb_headlen(skb) + headroom; 978 - struct sk_buff *n = alloc_skb(size, gfp_mask); 909 + struct sk_buff *n = __alloc_skb(size, gfp_mask, 910 + skb_alloc_rx_flag(skb), NUMA_NO_NODE); 979 911 980 912 if (!n) 981 913 goto out; ··· 1049 979 1050 980 size = SKB_DATA_ALIGN(size); 1051 981 1052 - data = kmalloc(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)), 1053 - gfp_mask); 982 + if (skb_pfmemalloc(skb)) 983 + gfp_mask |= __GFP_MEMALLOC; 984 + data = kmalloc_reserve(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)), 985 + gfp_mask, NUMA_NO_NODE, NULL); 1054 986 if (!data) 1055 987 goto nodata; 1056 988 size = SKB_WITH_OVERHEAD(ksize(data)); ··· 1164 1092 /* 1165 1093 * Allocate the copy buffer 1166 1094 */ 1167 - struct sk_buff *n = alloc_skb(newheadroom + skb->len + newtailroom, 1168 - gfp_mask); 1095 + struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom, 1096 + gfp_mask, skb_alloc_rx_flag(skb), 1097 + NUMA_NO_NODE); 1169 1098 int oldheadroom = skb_headroom(skb); 1170 1099 int head_copy_len, head_copy_off; 1171 1100 int off; ··· 2848 2775 skb_release_head_state(nskb); 2849 2776 __skb_push(nskb, doffset); 2850 2777 } else { 2851 - nskb = alloc_skb(hsize + doffset + headroom, 2852 - GFP_ATOMIC); 2778 + nskb = __alloc_skb(hsize + doffset + headroom, 2779 + GFP_ATOMIC, skb_alloc_rx_flag(skb), 2780 + NUMA_NO_NODE); 2853 2781 2854 2782 if (unlikely(!nskb)) 2855 2783 goto err;

+57 -2

net/core/sock.c

··· 142 142 static DEFINE_MUTEX(proto_list_mutex); 143 143 static LIST_HEAD(proto_list); 144 144 145 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM 145 + #ifdef CONFIG_MEMCG_KMEM 146 146 int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss) 147 147 { 148 148 struct proto *proto; ··· 271 271 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512); 272 272 EXPORT_SYMBOL(sysctl_optmem_max); 273 273 274 + struct static_key memalloc_socks = STATIC_KEY_INIT_FALSE; 275 + EXPORT_SYMBOL_GPL(memalloc_socks); 276 + 277 + /** 278 + * sk_set_memalloc - sets %SOCK_MEMALLOC 279 + * @sk: socket to set it on 280 + * 281 + * Set %SOCK_MEMALLOC on a socket for access to emergency reserves. 282 + * It's the responsibility of the admin to adjust min_free_kbytes 283 + * to meet the requirements 284 + */ 285 + void sk_set_memalloc(struct sock *sk) 286 + { 287 + sock_set_flag(sk, SOCK_MEMALLOC); 288 + sk->sk_allocation |= __GFP_MEMALLOC; 289 + static_key_slow_inc(&memalloc_socks); 290 + } 291 + EXPORT_SYMBOL_GPL(sk_set_memalloc); 292 + 293 + void sk_clear_memalloc(struct sock *sk) 294 + { 295 + sock_reset_flag(sk, SOCK_MEMALLOC); 296 + sk->sk_allocation &= ~__GFP_MEMALLOC; 297 + static_key_slow_dec(&memalloc_socks); 298 + 299 + /* 300 + * SOCK_MEMALLOC is allowed to ignore rmem limits to ensure forward 301 + * progress of swapping. However, if SOCK_MEMALLOC is cleared while 302 + * it has rmem allocations there is a risk that the user of the 303 + * socket cannot make forward progress due to exceeding the rmem 304 + * limits. By rights, sk_clear_memalloc() should only be called 305 + * on sockets being torn down but warn and reset the accounting if 306 + * that assumption breaks. 307 + */ 308 + if (WARN_ON(sk->sk_forward_alloc)) 309 + sk_mem_reclaim(sk); 310 + } 311 + EXPORT_SYMBOL_GPL(sk_clear_memalloc); 312 + 313 + int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb) 314 + { 315 + int ret; 316 + unsigned long pflags = current->flags; 317 + 318 + /* these should have been dropped before queueing */ 319 + BUG_ON(!sock_flag(sk, SOCK_MEMALLOC)); 320 + 321 + current->flags |= PF_MEMALLOC; 322 + ret = sk->sk_backlog_rcv(sk, skb); 323 + tsk_restore_flags(current, pflags, PF_MEMALLOC); 324 + 325 + return ret; 326 + } 327 + EXPORT_SYMBOL(__sk_backlog_rcv); 328 + 274 329 #if defined(CONFIG_CGROUPS) 275 330 #if !defined(CONFIG_NET_CLS_CGROUP) 276 331 int net_cls_subsys_id = -1; ··· 408 353 if (err) 409 354 return err; 410 355 411 - if (!sk_rmem_schedule(sk, skb->truesize)) { 356 + if (!sk_rmem_schedule(sk, skb, skb->truesize)) { 412 357 atomic_inc(&sk->sk_drops); 413 358 return -ENOBUFS; 414 359 }

+1 -1

net/ipv4/Makefile

··· 49 49 obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o 50 50 obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o 51 51 obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o 52 - obj-$(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) += tcp_memcontrol.o 52 + obj-$(CONFIG_MEMCG_KMEM) += tcp_memcontrol.o 53 53 obj-$(CONFIG_NETLABEL) += cipso_ipv4.o 54 54 55 55 obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \

+2 -2

net/ipv4/sysctl_net_ipv4.c

··· 184 184 int ret; 185 185 unsigned long vec[3]; 186 186 struct net *net = current->nsproxy->net_ns; 187 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM 187 + #ifdef CONFIG_MEMCG_KMEM 188 188 struct mem_cgroup *memcg; 189 189 #endif 190 190 ··· 203 203 if (ret) 204 204 return ret; 205 205 206 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM 206 + #ifdef CONFIG_MEMCG_KMEM 207 207 rcu_read_lock(); 208 208 memcg = mem_cgroup_from_task(current); 209 209

+11 -10

net/ipv4/tcp_input.c

··· 4351 4351 static bool tcp_prune_ofo_queue(struct sock *sk); 4352 4352 static int tcp_prune_queue(struct sock *sk); 4353 4353 4354 - static int tcp_try_rmem_schedule(struct sock *sk, unsigned int size) 4354 + static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb, 4355 + unsigned int size) 4355 4356 { 4356 4357 if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf || 4357 - !sk_rmem_schedule(sk, size)) { 4358 + !sk_rmem_schedule(sk, skb, size)) { 4358 4359 4359 4360 if (tcp_prune_queue(sk) < 0) 4360 4361 return -1; 4361 4362 4362 - if (!sk_rmem_schedule(sk, size)) { 4363 + if (!sk_rmem_schedule(sk, skb, size)) { 4363 4364 if (!tcp_prune_ofo_queue(sk)) 4364 4365 return -1; 4365 4366 4366 - if (!sk_rmem_schedule(sk, size)) 4367 + if (!sk_rmem_schedule(sk, skb, size)) 4367 4368 return -1; 4368 4369 } 4369 4370 } ··· 4419 4418 4420 4419 TCP_ECN_check_ce(tp, skb); 4421 4420 4422 - if (unlikely(tcp_try_rmem_schedule(sk, skb->truesize))) { 4421 + if (unlikely(tcp_try_rmem_schedule(sk, skb, skb->truesize))) { 4423 4422 NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFODROP); 4424 4423 __kfree_skb(skb); 4425 4424 return; ··· 4553 4552 4554 4553 int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size) 4555 4554 { 4556 - struct sk_buff *skb; 4555 + struct sk_buff *skb = NULL; 4557 4556 struct tcphdr *th; 4558 4557 bool fragstolen; 4559 - 4560 - if (tcp_try_rmem_schedule(sk, size + sizeof(*th))) 4561 - goto err; 4562 4558 4563 4559 skb = alloc_skb(size + sizeof(*th), sk->sk_allocation); 4564 4560 if (!skb) 4565 4561 goto err; 4562 + 4563 + if (tcp_try_rmem_schedule(sk, skb, size + sizeof(*th))) 4564 + goto err_free; 4566 4565 4567 4566 th = (struct tcphdr *)skb_put(skb, sizeof(*th)); 4568 4567 skb_reset_transport_header(skb); ··· 4634 4633 if (eaten <= 0) { 4635 4634 queue_and_out: 4636 4635 if (eaten < 0 && 4637 - tcp_try_rmem_schedule(sk, skb->truesize)) 4636 + tcp_try_rmem_schedule(sk, skb, skb->truesize)) 4638 4637 goto drop; 4639 4638 4640 4639 eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);

+1 -1

net/ipv4/tcp_ipv4.c

··· 2633 2633 .compat_setsockopt = compat_tcp_setsockopt, 2634 2634 .compat_getsockopt = compat_tcp_getsockopt, 2635 2635 #endif 2636 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM 2636 + #ifdef CONFIG_MEMCG_KMEM 2637 2637 .init_cgroup = tcp_init_cgroup, 2638 2638 .destroy_cgroup = tcp_destroy_cgroup, 2639 2639 .proto_cgroup = tcp_proto_cgroup,

+7 -5

net/ipv4/tcp_output.c

··· 2045 2045 if (unlikely(sk->sk_state == TCP_CLOSE)) 2046 2046 return; 2047 2047 2048 - if (tcp_write_xmit(sk, cur_mss, nonagle, 0, GFP_ATOMIC)) 2048 + if (tcp_write_xmit(sk, cur_mss, nonagle, 0, 2049 + sk_gfp_atomic(sk, GFP_ATOMIC))) 2049 2050 tcp_check_probe_timer(sk); 2050 2051 } 2051 2052 ··· 2667 2666 2668 2667 if (cvp != NULL && cvp->s_data_constant && cvp->s_data_desired) 2669 2668 s_data_desired = cvp->s_data_desired; 2670 - skb = alloc_skb(MAX_TCP_HEADER + 15 + s_data_desired, GFP_ATOMIC); 2669 + skb = alloc_skb(MAX_TCP_HEADER + 15 + s_data_desired, 2670 + sk_gfp_atomic(sk, GFP_ATOMIC)); 2671 2671 if (unlikely(!skb)) { 2672 2672 dst_release(dst); 2673 2673 return NULL; ··· 3066 3064 * tcp_transmit_skb() will set the ownership to this 3067 3065 * sock. 3068 3066 */ 3069 - buff = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC); 3067 + buff = alloc_skb(MAX_TCP_HEADER, sk_gfp_atomic(sk, GFP_ATOMIC)); 3070 3068 if (buff == NULL) { 3071 3069 inet_csk_schedule_ack(sk); 3072 3070 inet_csk(sk)->icsk_ack.ato = TCP_ATO_MIN; ··· 3081 3079 3082 3080 /* Send it off, this clears delayed acks for us. */ 3083 3081 TCP_SKB_CB(buff)->when = tcp_time_stamp; 3084 - tcp_transmit_skb(sk, buff, 0, GFP_ATOMIC); 3082 + tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC)); 3085 3083 } 3086 3084 3087 3085 /* This routine sends a packet with an out of date sequence ··· 3101 3099 struct sk_buff *skb; 3102 3100 3103 3101 /* We don't queue it, tcp_transmit_skb() sets ownership. */ 3104 - skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC); 3102 + skb = alloc_skb(MAX_TCP_HEADER, sk_gfp_atomic(sk, GFP_ATOMIC)); 3105 3103 if (skb == NULL) 3106 3104 return -1; 3107 3105

+6 -4

net/ipv6/tcp_ipv6.c

··· 1299 1299 /* Clone pktoptions received with SYN */ 1300 1300 newnp->pktoptions = NULL; 1301 1301 if (treq->pktopts != NULL) { 1302 - newnp->pktoptions = skb_clone(treq->pktopts, GFP_ATOMIC); 1302 + newnp->pktoptions = skb_clone(treq->pktopts, 1303 + sk_gfp_atomic(sk, GFP_ATOMIC)); 1303 1304 consume_skb(treq->pktopts); 1304 1305 treq->pktopts = NULL; 1305 1306 if (newnp->pktoptions) ··· 1350 1349 * across. Shucks. 1351 1350 */ 1352 1351 tcp_md5_do_add(newsk, (union tcp_md5_addr *)&newnp->daddr, 1353 - AF_INET6, key->key, key->keylen, GFP_ATOMIC); 1352 + AF_INET6, key->key, key->keylen, 1353 + sk_gfp_atomic(sk, GFP_ATOMIC)); 1354 1354 } 1355 1355 #endif 1356 1356 ··· 1444 1442 --ANK (980728) 1445 1443 */ 1446 1444 if (np->rxopt.all) 1447 - opt_skb = skb_clone(skb, GFP_ATOMIC); 1445 + opt_skb = skb_clone(skb, sk_gfp_atomic(sk, GFP_ATOMIC)); 1448 1446 1449 1447 if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */ 1450 1448 sock_rps_save_rxhash(sk, skb); ··· 2017 2015 .compat_setsockopt = compat_tcp_setsockopt, 2018 2016 .compat_getsockopt = compat_tcp_getsockopt, 2019 2017 #endif 2020 - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM 2018 + #ifdef CONFIG_MEMCG_KMEM 2021 2019 .proto_cgroup = tcp_proto_cgroup, 2022 2020 #endif 2023 2021 };

+2 -1

net/sctp/ulpevent.c

··· 702 702 if (rx_count >= asoc->base.sk->sk_rcvbuf) { 703 703 704 704 if ((asoc->base.sk->sk_userlocks & SOCK_RCVBUF_LOCK) || 705 - (!sk_rmem_schedule(asoc->base.sk, chunk->skb->truesize))) 705 + (!sk_rmem_schedule(asoc->base.sk, chunk->skb, 706 + chunk->skb->truesize))) 706 707 goto fail; 707 708 } 708 709

+5

net/sunrpc/Kconfig

··· 21 21 22 22 If unsure, say N. 23 23 24 + config SUNRPC_SWAP 25 + bool 26 + depends on SUNRPC 27 + select NETVM 28 + 24 29 config RPCSEC_GSS_KRB5 25 30 tristate "Secure RPC: Kerberos V mechanism" 26 31 depends on SUNRPC && CRYPTO

+9

net/sunrpc/clnt.c

··· 717 717 atomic_inc(&clnt->cl_count); 718 718 if (clnt->cl_softrtry) 719 719 task->tk_flags |= RPC_TASK_SOFT; 720 + if (sk_memalloc_socks()) { 721 + struct rpc_xprt *xprt; 722 + 723 + rcu_read_lock(); 724 + xprt = rcu_dereference(clnt->cl_xprt); 725 + if (xprt->swapper) 726 + task->tk_flags |= RPC_TASK_SWAPPER; 727 + rcu_read_unlock(); 728 + } 720 729 /* Add to the client's list of all tasks */ 721 730 spin_lock(&clnt->cl_lock); 722 731 list_add_tail(&task->tk_task, &clnt->cl_tasks);

+5 -2

net/sunrpc/sched.c

··· 815 815 void *rpc_malloc(struct rpc_task *task, size_t size) 816 816 { 817 817 struct rpc_buffer *buf; 818 - gfp_t gfp = RPC_IS_SWAPPER(task) ? GFP_ATOMIC : GFP_NOWAIT; 818 + gfp_t gfp = GFP_NOWAIT; 819 + 820 + if (RPC_IS_SWAPPER(task)) 821 + gfp |= __GFP_MEMALLOC; 819 822 820 823 size += sizeof(struct rpc_buffer); 821 824 if (size <= RPC_BUFFER_MAXSIZE) ··· 892 889 static struct rpc_task * 893 890 rpc_alloc_task(void) 894 891 { 895 - return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOFS); 892 + return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOIO); 896 893 } 897 894 898 895 /*

+43

net/sunrpc/xprtsock.c

··· 1930 1930 current->flags &= ~PF_FSTRANS; 1931 1931 } 1932 1932 1933 + #ifdef CONFIG_SUNRPC_SWAP 1934 + static void xs_set_memalloc(struct rpc_xprt *xprt) 1935 + { 1936 + struct sock_xprt *transport = container_of(xprt, struct sock_xprt, 1937 + xprt); 1938 + 1939 + if (xprt->swapper) 1940 + sk_set_memalloc(transport->inet); 1941 + } 1942 + 1943 + /** 1944 + * xs_swapper - Tag this transport as being used for swap. 1945 + * @xprt: transport to tag 1946 + * @enable: enable/disable 1947 + * 1948 + */ 1949 + int xs_swapper(struct rpc_xprt *xprt, int enable) 1950 + { 1951 + struct sock_xprt *transport = container_of(xprt, struct sock_xprt, 1952 + xprt); 1953 + int err = 0; 1954 + 1955 + if (enable) { 1956 + xprt->swapper++; 1957 + xs_set_memalloc(xprt); 1958 + } else if (xprt->swapper) { 1959 + xprt->swapper--; 1960 + sk_clear_memalloc(transport->inet); 1961 + } 1962 + 1963 + return err; 1964 + } 1965 + EXPORT_SYMBOL_GPL(xs_swapper); 1966 + #else 1967 + static void xs_set_memalloc(struct rpc_xprt *xprt) 1968 + { 1969 + } 1970 + #endif 1971 + 1933 1972 static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock) 1934 1973 { 1935 1974 struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt); ··· 1992 1953 /* Reset to new socket */ 1993 1954 transport->sock = sock; 1994 1955 transport->inet = sk; 1956 + 1957 + xs_set_memalloc(xprt); 1995 1958 1996 1959 write_unlock_bh(&sk->sk_callback_lock); 1997 1960 } ··· 2121 2080 2122 2081 if (!xprt_bound(xprt)) 2123 2082 goto out; 2083 + 2084 + xs_set_memalloc(xprt); 2124 2085 2125 2086 /* Tell the socket layer to start connecting... */ 2126 2087 xprt->stat.connect_count++;

+1 -1

security/selinux/avc.c

··· 274 274 { 275 275 struct avc_node *node; 276 276 277 - node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC); 277 + node = kmem_cache_zalloc(avc_node_cachep, GFP_ATOMIC|__GFP_NOMEMALLOC); 278 278 if (!node) 279 279 goto out; 280 280

+1 -1

tools/testing/fault-injection/failcmd.sh

··· 206 206 esac 207 207 done 208 208 209 - [ -z "$@" ] && exit 0 209 + [ -z "$1" ] && exit 0 210 210 211 211 echo $oom_kill_allocating_task > /proc/sys/vm/oom_kill_allocating_task 212 212 echo $task_filter > $FAULTATTR/task-filter