Merge branch 'akpm' (patchbomb from Andrew)

+1 -1

Documentation/cgroups/hugetlb.txt

··· 29 29 30 30 hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage 31 31 hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded 32 - hugetlb.<hugepagesize>.usage_in_bytes # show current res_counter usage for "hugepagesize" hugetlb 32 + hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb 33 33 hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit 34 34 35 35 For a system supporting two hugepage size (16M and 16G) the control

+15 -11

Documentation/cgroups/memory.txt

··· 1 1 Memory Resource Controller 2 2 3 + NOTE: This document is hopelessly outdated and it asks for a complete 4 + rewrite. It still contains a useful information so we are keeping it 5 + here but make sure to check the current code if you need a deeper 6 + understanding. 7 + 3 8 NOTE: The Memory Resource Controller has generically been referred to as the 4 9 memory controller in this document. Do not confuse memory controller 5 10 used here with the memory controller that is used in hardware. ··· 57 52 tasks # attach a task(thread) and show list of threads 58 53 cgroup.procs # show list of processes 59 54 cgroup.event_control # an interface for event_fd() 60 - memory.usage_in_bytes # show current res_counter usage for memory 55 + memory.usage_in_bytes # show current usage for memory 61 56 (See 5.5 for details) 62 - memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap 57 + memory.memsw.usage_in_bytes # show current usage for memory+Swap 63 58 (See 5.5 for details) 64 59 memory.limit_in_bytes # set/show limit of memory usage 65 60 memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage ··· 121 116 122 117 2.1. Design 123 118 124 - The core of the design is a counter called the res_counter. The res_counter 125 - tracks the current memory usage and limit of the group of processes associated 126 - with the controller. Each cgroup has a memory controller specific data 127 - structure (mem_cgroup) associated with it. 119 + The core of the design is a counter called the page_counter. The 120 + page_counter tracks the current memory usage and limit of the group of 121 + processes associated with the controller. Each cgroup has a memory controller 122 + specific data structure (mem_cgroup) associated with it. 128 123 129 124 2.2. Accounting 130 125 131 126 +--------------------+ 132 - | mem_cgroup | 133 - | (res_counter) | 127 + | mem_cgroup | 128 + | (page_counter) | 134 129 +--------------------+ 135 130 / ^ \ 136 131 / | \ ··· 357 352 0. Configuration 358 353 359 354 a. Enable CONFIG_CGROUPS 360 - b. Enable CONFIG_RESOURCE_COUNTERS 361 - c. Enable CONFIG_MEMCG 362 - d. Enable CONFIG_MEMCG_SWAP (to use swap extension) 355 + b. Enable CONFIG_MEMCG 356 + c. Enable CONFIG_MEMCG_SWAP (to use swap extension) 363 357 d. Enable CONFIG_MEMCG_KMEM (to use kmem extension) 364 358 365 359 1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)

-197

Documentation/cgroups/resource_counter.txt

··· 1 - 2 - The Resource Counter 3 - 4 - The resource counter, declared at include/linux/res_counter.h, 5 - is supposed to facilitate the resource management by controllers 6 - by providing common stuff for accounting. 7 - 8 - This "stuff" includes the res_counter structure and routines 9 - to work with it. 10 - 11 - 12 - 13 - 1. Crucial parts of the res_counter structure 14 - 15 - a. unsigned long long usage 16 - 17 - The usage value shows the amount of a resource that is consumed 18 - by a group at a given time. The units of measurement should be 19 - determined by the controller that uses this counter. E.g. it can 20 - be bytes, items or any other unit the controller operates on. 21 - 22 - b. unsigned long long max_usage 23 - 24 - The maximal value of the usage over time. 25 - 26 - This value is useful when gathering statistical information about 27 - the particular group, as it shows the actual resource requirements 28 - for a particular group, not just some usage snapshot. 29 - 30 - c. unsigned long long limit 31 - 32 - The maximal allowed amount of resource to consume by the group. In 33 - case the group requests for more resources, so that the usage value 34 - would exceed the limit, the resource allocation is rejected (see 35 - the next section). 36 - 37 - d. unsigned long long failcnt 38 - 39 - The failcnt stands for "failures counter". This is the number of 40 - resource allocation attempts that failed. 41 - 42 - c. spinlock_t lock 43 - 44 - Protects changes of the above values. 45 - 46 - 47 - 48 - 2. Basic accounting routines 49 - 50 - a. void res_counter_init(struct res_counter *rc, 51 - struct res_counter *rc_parent) 52 - 53 - Initializes the resource counter. As usual, should be the first 54 - routine called for a new counter. 55 - 56 - The struct res_counter *parent can be used to define a hierarchical 57 - child -> parent relationship directly in the res_counter structure, 58 - NULL can be used to define no relationship. 59 - 60 - c. int res_counter_charge(struct res_counter *rc, unsigned long val, 61 - struct res_counter **limit_fail_at) 62 - 63 - When a resource is about to be allocated it has to be accounted 64 - with the appropriate resource counter (controller should determine 65 - which one to use on its own). This operation is called "charging". 66 - 67 - This is not very important which operation - resource allocation 68 - or charging - is performed first, but 69 - * if the allocation is performed first, this may create a 70 - temporary resource over-usage by the time resource counter is 71 - charged; 72 - * if the charging is performed first, then it should be uncharged 73 - on error path (if the one is called). 74 - 75 - If the charging fails and a hierarchical dependency exists, the 76 - limit_fail_at parameter is set to the particular res_counter element 77 - where the charging failed. 78 - 79 - d. u64 res_counter_uncharge(struct res_counter *rc, unsigned long val) 80 - 81 - When a resource is released (freed) it should be de-accounted 82 - from the resource counter it was accounted to. This is called 83 - "uncharging". The return value of this function indicate the amount 84 - of charges still present in the counter. 85 - 86 - The _locked routines imply that the res_counter->lock is taken. 87 - 88 - e. u64 res_counter_uncharge_until 89 - (struct res_counter *rc, struct res_counter *top, 90 - unsigned long val) 91 - 92 - Almost same as res_counter_uncharge() but propagation of uncharge 93 - stops when rc == top. This is useful when kill a res_counter in 94 - child cgroup. 95 - 96 - 2.1 Other accounting routines 97 - 98 - There are more routines that may help you with common needs, like 99 - checking whether the limit is reached or resetting the max_usage 100 - value. They are all declared in include/linux/res_counter.h. 101 - 102 - 103 - 104 - 3. Analyzing the resource counter registrations 105 - 106 - a. If the failcnt value constantly grows, this means that the counter's 107 - limit is too tight. Either the group is misbehaving and consumes too 108 - many resources, or the configuration is not suitable for the group 109 - and the limit should be increased. 110 - 111 - b. The max_usage value can be used to quickly tune the group. One may 112 - set the limits to maximal values and either load the container with 113 - a common pattern or leave one for a while. After this the max_usage 114 - value shows the amount of memory the container would require during 115 - its common activity. 116 - 117 - Setting the limit a bit above this value gives a pretty good 118 - configuration that works in most of the cases. 119 - 120 - c. If the max_usage is much less than the limit, but the failcnt value 121 - is growing, then the group tries to allocate a big chunk of resource 122 - at once. 123 - 124 - d. If the max_usage is much less than the limit, but the failcnt value 125 - is 0, then this group is given too high limit, that it does not 126 - require. It is better to lower the limit a bit leaving more resource 127 - for other groups. 128 - 129 - 130 - 131 - 4. Communication with the control groups subsystem (cgroups) 132 - 133 - All the resource controllers that are using cgroups and resource counters 134 - should provide files (in the cgroup filesystem) to work with the resource 135 - counter fields. They are recommended to adhere to the following rules: 136 - 137 - a. File names 138 - 139 - Field name File name 140 - --------------------------------------------------- 141 - usage usage_in_<unit_of_measurement> 142 - max_usage max_usage_in_<unit_of_measurement> 143 - limit limit_in_<unit_of_measurement> 144 - failcnt failcnt 145 - lock no file :) 146 - 147 - b. Reading from file should show the corresponding field value in the 148 - appropriate format. 149 - 150 - c. Writing to file 151 - 152 - Field Expected behavior 153 - ---------------------------------- 154 - usage prohibited 155 - max_usage reset to usage 156 - limit set the limit 157 - failcnt reset to zero 158 - 159 - 160 - 161 - 5. Usage example 162 - 163 - a. Declare a task group (take a look at cgroups subsystem for this) and 164 - fold a res_counter into it 165 - 166 - struct my_group { 167 - struct res_counter res; 168 - 169 - <other fields> 170 - } 171 - 172 - b. Put hooks in resource allocation/release paths 173 - 174 - int alloc_something(...) 175 - { 176 - if (res_counter_charge(res_counter_ptr, amount) < 0) 177 - return -ENOMEM; 178 - 179 - <allocate the resource and return to the caller> 180 - } 181 - 182 - void release_something(...) 183 - { 184 - res_counter_uncharge(res_counter_ptr, amount); 185 - 186 - <release the resource> 187 - } 188 - 189 - In order to keep the usage value self-consistent, both the 190 - "res_counter_ptr" and the "amount" in release_something() should be 191 - the same as they were in the alloc_something() when the releasing 192 - resource was allocated. 193 - 194 - c. Provide the way to read res_counter values and set them (the cgroups 195 - still can help with it). 196 - 197 - c. Compile and run :)

+8 -1

Documentation/devicetree/bindings/rtc/rtc-omap.txt

··· 5 5 - "ti,da830-rtc" - for RTC IP used similar to that on DA8xx SoC family. 6 6 - "ti,am3352-rtc" - for RTC IP used similar to that on AM335x SoC family. 7 7 This RTC IP has special WAKE-EN Register to enable 8 - Wakeup generation for event Alarm. 8 + Wakeup generation for event Alarm. It can also be 9 + used to control an external PMIC via the 10 + pmic_power_en pin. 9 11 - reg: Address range of rtc register set 10 12 - interrupts: rtc timer, alarm interrupts in order 11 13 - interrupt-parent: phandle for the interrupt controller 14 + 15 + Optional properties: 16 + - system-power-controller: whether the rtc is controlling the system power 17 + through pmic_power_en 12 18 13 19 Example: 14 20 ··· 24 18 interrupts = <19 25 19 19>; 26 20 interrupt-parent = <&intc>; 21 + system-power-controller; 27 22 };

+1

Documentation/devicetree/bindings/vendor-prefixes.txt

··· 115 115 onnn ON Semiconductor Corp. 116 116 opencores OpenCores.org 117 117 panasonic Panasonic Corporation 118 + pericom Pericom Technology Inc. 118 119 phytec PHYTEC Messtechnik GmbH 119 120 picochip Picochip Ltd 120 121 plathome Plat'Home Co., Ltd.

+7

Documentation/kdump/kdump.txt

··· 471 471 472 472 http://people.redhat.com/~anderson/ 473 473 474 + Trigger Kdump on WARN() 475 + ======================= 476 + 477 + The kernel parameter, panic_on_warn, calls panic() in all WARN() paths. This 478 + will cause a kdump to occur at the panic() call. In cases where a user wants 479 + to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set to 1 480 + to achieve the same behaviour. 474 481 475 482 Contact 476 483 =======

+3

Documentation/kernel-parameters.txt

··· 2509 2509 timeout < 0: reboot immediately 2510 2510 Format: <timeout> 2511 2511 2512 + panic_on_warn panic() instead of WARN(). Useful to cause kdump 2513 + on a WARN(). 2514 + 2512 2515 crash_kexec_post_notifiers 2513 2516 Run kdump after running panic-notifiers and dumping 2514 2517 kmsg. This only for the users who doubt kdump always

+26 -14

Documentation/sysctl/kernel.txt

··· 54 54 - overflowuid 55 55 - panic 56 56 - panic_on_oops 57 - - panic_on_unrecovered_nmi 58 57 - panic_on_stackoverflow 58 + - panic_on_unrecovered_nmi 59 + - panic_on_warn 59 60 - pid_max 60 61 - powersave-nap [ PPC only ] 61 62 - printk ··· 528 527 529 528 ============================================================== 530 529 531 - panic_on_unrecovered_nmi: 532 - 533 - The default Linux behaviour on an NMI of either memory or unknown is 534 - to continue operation. For many environments such as scientific 535 - computing it is preferable that the box is taken out and the error 536 - dealt with than an uncorrected parity/ECC error get propagated. 537 - 538 - A small number of systems do generate NMI's for bizarre random reasons 539 - such as power management so the default is off. That sysctl works like 540 - the existing panic controls already in that directory. 541 - 542 - ============================================================== 543 - 544 530 panic_on_oops: 545 531 546 532 Controls the kernel's behaviour when an oops or BUG is encountered. ··· 548 560 0: try to continue operation. 549 561 550 562 1: panic immediately. 563 + 564 + ============================================================== 565 + 566 + panic_on_unrecovered_nmi: 567 + 568 + The default Linux behaviour on an NMI of either memory or unknown is 569 + to continue operation. For many environments such as scientific 570 + computing it is preferable that the box is taken out and the error 571 + dealt with than an uncorrected parity/ECC error get propagated. 572 + 573 + A small number of systems do generate NMI's for bizarre random reasons 574 + such as power management so the default is off. That sysctl works like 575 + the existing panic controls already in that directory. 576 + 577 + ============================================================== 578 + 579 + panic_on_warn: 580 + 581 + Calls panic() in the WARN() path when set to 1. This is useful to avoid 582 + a kernel rebuild when attempting to kdump at the location of a WARN(). 583 + 584 + 0: only WARN(), default behaviour. 585 + 586 + 1: call panic() after printing out WARN() location. 551 587 552 588 ============================================================== 553 589

+3 -3

MAINTAINERS

··· 2605 2605 L: linux-mm@kvack.org 2606 2606 S: Maintained 2607 2607 F: mm/memcontrol.c 2608 - F: mm/page_cgroup.c 2608 + F: mm/swap_cgroup.c 2609 2609 2610 2610 CORETEMP HARDWARE MONITORING DRIVER 2611 2611 M: Fenghua Yu <fenghua.yu@intel.com> ··· 2722 2722 2723 2723 CX18 VIDEO4LINUX DRIVER 2724 2724 M: Andy Walls <awalls@md.metrocast.net> 2725 - L: ivtv-devel@ivtvdriver.org (moderated for non-subscribers) 2725 + L: ivtv-devel@ivtvdriver.org (subscribers-only) 2726 2726 L: linux-media@vger.kernel.org 2727 2727 T: git git://linuxtv.org/media_tree.git 2728 2728 W: http://linuxtv.org ··· 5208 5208 5209 5209 IVTV VIDEO4LINUX DRIVER 5210 5210 M: Andy Walls <awalls@md.metrocast.net> 5211 - L: ivtv-devel@ivtvdriver.org (moderated for non-subscribers) 5211 + L: ivtv-devel@ivtvdriver.org (subscribers-only) 5212 5212 L: linux-media@vger.kernel.org 5213 5213 T: git git://linuxtv.org/media_tree.git 5214 5214 W: http://www.ivtvdriver.org

+4

arch/arm/boot/dts/am335x-boneblack.dts

··· 80 80 status = "okay"; 81 81 }; 82 82 }; 83 + 84 + &rtc { 85 + system-power-controller; 86 + };

+1 -1

arch/arm/boot/dts/am33xx.dtsi

··· 435 435 }; 436 436 437 437 rtc: rtc@44e3e000 { 438 - compatible = "ti,da830-rtc"; 438 + compatible = "ti,am3352-rtc", "ti,da830-rtc"; 439 439 reg = <0x44e3e000 0x1000>; 440 440 interrupts = <75 441 441 76>;

+1

arch/arm64/include/asm/pgtable.h

··· 279 279 #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ 280 280 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ 281 281 282 + #define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd)) 282 283 #define pmd_young(pmd) pte_young(pmd_pte(pmd)) 283 284 #define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd))) 284 285 #define pmd_mksplitting(pmd) pte_pmd(pte_mkspecial(pmd_pte(pmd)))

+1 -1

arch/ia64/kernel/perfmon.c

··· 2662 2662 2663 2663 ret = -ENOMEM; 2664 2664 2665 - fd = get_unused_fd(); 2665 + fd = get_unused_fd_flags(0); 2666 2666 if (fd < 0) 2667 2667 return fd; 2668 2668

+1

arch/powerpc/include/asm/pgtable-ppc64.h

··· 467 467 } 468 468 469 469 #define pmd_pfn(pmd) pte_pfn(pmd_pte(pmd)) 470 + #define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd)) 470 471 #define pmd_young(pmd) pte_young(pmd_pte(pmd)) 471 472 #define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd))) 472 473 #define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))

+2 -2

arch/powerpc/platforms/cell/spufs/inode.c

··· 301 301 int ret; 302 302 struct file *filp; 303 303 304 - ret = get_unused_fd(); 304 + ret = get_unused_fd_flags(0); 305 305 if (ret < 0) 306 306 return ret; 307 307 ··· 518 518 int ret; 519 519 struct file *filp; 520 520 521 - ret = get_unused_fd(); 521 + ret = get_unused_fd_flags(0); 522 522 if (ret < 0) 523 523 return ret; 524 524

+1 -1

arch/sh/mm/numa.c

··· 31 31 unsigned long bootmem_paddr; 32 32 33 33 /* Don't allow bogus node assignment */ 34 - BUG_ON(nid > MAX_NUMNODES || nid <= 0); 34 + BUG_ON(nid >= MAX_NUMNODES || nid <= 0); 35 35 36 36 start_pfn = start >> PAGE_SHIFT; 37 37 end_pfn = end >> PAGE_SHIFT;

+7

arch/sparc/include/asm/pgtable_64.h

··· 667 667 } 668 668 669 669 #ifdef CONFIG_TRANSPARENT_HUGEPAGE 670 + static inline unsigned long pmd_dirty(pmd_t pmd) 671 + { 672 + pte_t pte = __pte(pmd_val(pmd)); 673 + 674 + return pte_dirty(pte); 675 + } 676 + 670 677 static inline unsigned long pmd_young(pmd_t pmd) 671 678 { 672 679 pte_t pte = __pte(pmd_val(pmd));

+13 -6

arch/tile/kernel/early_printk.c

··· 43 43 44 44 void early_panic(const char *fmt, ...) 45 45 { 46 - va_list ap; 46 + struct va_format vaf; 47 + va_list args; 48 + 47 49 arch_local_irq_disable_all(); 48 - va_start(ap, fmt); 49 - early_printk("Kernel panic - not syncing: "); 50 - early_vprintk(fmt, ap); 51 - early_printk("\n"); 52 - va_end(ap); 50 + 51 + va_start(args, fmt); 52 + 53 + vaf.fmt = fmt; 54 + vaf.va = &args; 55 + 56 + early_printk("Kernel panic - not syncing: %pV", &vaf); 57 + 58 + va_end(args); 59 + 53 60 dump_stack(); 54 61 hv_halt(); 55 62 }

+20 -25

arch/tile/kernel/setup.c

··· 534 534 } 535 535 } 536 536 physpages -= dropped_pages; 537 - pr_warning("Only using %ldMB memory;" 538 - " ignoring %ldMB.\n", 539 - physpages >> (20 - PAGE_SHIFT), 540 - dropped_pages >> (20 - PAGE_SHIFT)); 541 - pr_warning("Consider using a larger page size.\n"); 537 + pr_warn("Only using %ldMB memory - ignoring %ldMB\n", 538 + physpages >> (20 - PAGE_SHIFT), 539 + dropped_pages >> (20 - PAGE_SHIFT)); 540 + pr_warn("Consider using a larger page size\n"); 542 541 } 543 542 #endif 544 543 ··· 565 566 566 567 #ifndef __tilegx__ 567 568 if (node_end_pfn[0] > MAXMEM_PFN) { 568 - pr_warning("Only using %ldMB LOWMEM.\n", 569 - MAXMEM>>20); 570 - pr_warning("Use a HIGHMEM enabled kernel.\n"); 569 + pr_warn("Only using %ldMB LOWMEM\n", MAXMEM >> 20); 570 + pr_warn("Use a HIGHMEM enabled kernel\n"); 571 571 max_low_pfn = MAXMEM_PFN; 572 572 max_pfn = MAXMEM_PFN; 573 573 node_end_pfn[0] = MAXMEM_PFN; ··· 1110 1112 fd = hv_fs_findfile((HV_VirtAddr) initramfs_file); 1111 1113 if (fd == HV_ENOENT) { 1112 1114 if (set_initramfs_file) { 1113 - pr_warning("No such hvfs initramfs file '%s'\n", 1114 - initramfs_file); 1115 + pr_warn("No such hvfs initramfs file '%s'\n", 1116 + initramfs_file); 1115 1117 return; 1116 1118 } else { 1117 1119 /* Try old backwards-compatible name. */ ··· 1124 1126 stat = hv_fs_fstat(fd); 1125 1127 BUG_ON(stat.size < 0); 1126 1128 if (stat.flags & HV_FS_ISDIR) { 1127 - pr_warning("Ignoring hvfs file '%s': it's a directory.\n", 1128 - initramfs_file); 1129 + pr_warn("Ignoring hvfs file '%s': it's a directory\n", 1130 + initramfs_file); 1129 1131 return; 1130 1132 } 1131 1133 initrd = alloc_bootmem_pages(stat.size); ··· 1183 1185 HV_Topology topology = hv_inquire_topology(); 1184 1186 BUG_ON(topology.coord.x != 0 || topology.coord.y != 0); 1185 1187 if (topology.width != 1 || topology.height != 1) { 1186 - pr_warning("Warning: booting UP kernel on %dx%d grid;" 1187 - " will ignore all but first tile.\n", 1188 - topology.width, topology.height); 1188 + pr_warn("Warning: booting UP kernel on %dx%d grid; will ignore all but first tile\n", 1189 + topology.width, topology.height); 1189 1190 } 1190 1191 #endif 1191 1192 ··· 1205 1208 * We use a struct cpumask for this, so it must be big enough. 1206 1209 */ 1207 1210 if ((smp_height * smp_width) > nr_cpu_ids) 1208 - early_panic("Hypervisor %d x %d grid too big for Linux" 1209 - " NR_CPUS %d\n", smp_height, smp_width, 1210 - nr_cpu_ids); 1211 + early_panic("Hypervisor %d x %d grid too big for Linux NR_CPUS %d\n", 1212 + smp_height, smp_width, nr_cpu_ids); 1211 1213 #endif 1212 1214 1213 1215 /* ··· 1261 1265 1262 1266 /* Kernel PCs must have their high bit set; see intvec.S. */ 1263 1267 if ((long)VMALLOC_START >= 0) 1264 - early_panic( 1265 - "Linux VMALLOC region below the 2GB line (%#lx)!\n" 1266 - "Reconfigure the kernel with smaller VMALLOC_RESERVE.\n", 1267 - VMALLOC_START); 1268 + early_panic("Linux VMALLOC region below the 2GB line (%#lx)!\n" 1269 + "Reconfigure the kernel with smaller VMALLOC_RESERVE\n", 1270 + VMALLOC_START); 1268 1271 #endif 1269 1272 } 1270 1273 ··· 1390 1395 1391 1396 static int __init dataplane(char *str) 1392 1397 { 1393 - pr_warning("WARNING: dataplane support disabled in this kernel\n"); 1398 + pr_warn("WARNING: dataplane support disabled in this kernel\n"); 1394 1399 return 0; 1395 1400 } 1396 1401 ··· 1408 1413 len = hv_get_command_line((HV_VirtAddr) boot_command_line, 1409 1414 COMMAND_LINE_SIZE); 1410 1415 if (boot_command_line[0]) 1411 - pr_warning("WARNING: ignoring dynamic command line \"%s\"\n", 1412 - boot_command_line); 1416 + pr_warn("WARNING: ignoring dynamic command line \"%s\"\n", 1417 + boot_command_line); 1413 1418 strlcpy(boot_command_line, builtin_cmdline, COMMAND_LINE_SIZE); 1414 1419 #else 1415 1420 char *hv_cmdline;

+5

arch/x86/include/asm/pgtable.h

··· 100 100 return pte_flags(pte) & _PAGE_ACCESSED; 101 101 } 102 102 103 + static inline int pmd_dirty(pmd_t pmd) 104 + { 105 + return pmd_flags(pmd) & _PAGE_DIRTY; 106 + } 107 + 103 108 static inline int pmd_young(pmd_t pmd) 104 109 { 105 110 return pmd_flags(pmd) & _PAGE_ACCESSED;

+7 -1

drivers/base/Kconfig

··· 267 267 config CMA_SIZE_MBYTES 268 268 int "Size in Mega Bytes" 269 269 depends on !CMA_SIZE_SEL_PERCENTAGE 270 + default 0 if X86 270 271 default 16 271 272 help 272 273 Defines the size (in MiB) of the default memory area for Contiguous 273 - Memory Allocator. 274 + Memory Allocator. If the size of 0 is selected, CMA is disabled by 275 + default, but it can be enabled by passing cma=size[MG] to the kernel. 276 + 274 277 275 278 config CMA_SIZE_PERCENTAGE 276 279 int "Percentage of total memory" 277 280 depends on !CMA_SIZE_SEL_MBYTES 281 + default 0 if X86 278 282 default 10 279 283 help 280 284 Defines the size of the default memory area for Contiguous Memory 281 285 Allocator as a percentage of the total memory in the system. 286 + If 0 percent is selected, CMA is disabled by default, but it can be 287 + enabled by passing cma=size[MG] to the kernel. 282 288 283 289 choice 284 290 prompt "Selected region size"

+8

drivers/rtc/Kconfig

··· 192 192 This driver can also be built as a module. If so, the module 193 193 will be called rtc-ds1374. 194 194 195 + config RTC_DRV_DS1374_WDT 196 + bool "Dallas/Maxim DS1374 watchdog timer" 197 + depends on RTC_DRV_DS1374 198 + help 199 + If you say Y here you will get support for the 200 + watchdog timer in the Dallas Semiconductor DS1374 201 + real-time clock chips. 202 + 195 203 config RTC_DRV_DS1672 196 204 tristate "Dallas/Maxim DS1672" 197 205 help

+21

drivers/rtc/interface.c

··· 30 30 else { 31 31 memset(tm, 0, sizeof(struct rtc_time)); 32 32 err = rtc->ops->read_time(rtc->dev.parent, tm); 33 + if (err < 0) { 34 + dev_err(&rtc->dev, "read_time: fail to read\n"); 35 + return err; 36 + } 37 + 38 + err = rtc_valid_tm(tm); 39 + if (err < 0) 40 + dev_err(&rtc->dev, "read_time: rtc_time isn't valid\n"); 33 41 } 34 42 return err; 35 43 } ··· 899 891 if (next) { 900 892 struct rtc_wkalrm alarm; 901 893 int err; 894 + int retry = 3; 895 + 902 896 alarm.time = rtc_ktime_to_tm(next->expires); 903 897 alarm.enabled = 1; 898 + reprogram: 904 899 err = __rtc_set_alarm(rtc, &alarm); 905 900 if (err == -ETIME) 906 901 goto again; 902 + else if (err) { 903 + if (retry-- > 0) 904 + goto reprogram; 905 + 906 + timer = container_of(next, struct rtc_timer, node); 907 + timerqueue_del(&rtc->timerqueue, &timer->node); 908 + timer->enabled = 0; 909 + dev_err(&rtc->dev, "__rtc_set_alarm: err=%d\n", err); 910 + goto again; 911 + } 907 912 } else 908 913 rtc_alarm_disable(rtc); 909 914

+2

drivers/rtc/rtc-ab8500.c

··· 504 504 return err; 505 505 } 506 506 507 + rtc->uie_unsupported = 1; 508 + 507 509 return 0; 508 510 } 509 511

+63 -62

drivers/rtc/rtc-ds1307.c

··· 35 35 ds_1388, 36 36 ds_3231, 37 37 m41t00, 38 - mcp7941x, 38 + mcp794xx, 39 39 rx_8025, 40 40 last_ds_type /* always last */ 41 41 /* rs5c372 too? different address... */ ··· 46 46 #define DS1307_REG_SECS 0x00 /* 00-59 */ 47 47 # define DS1307_BIT_CH 0x80 48 48 # define DS1340_BIT_nEOSC 0x80 49 - # define MCP7941X_BIT_ST 0x80 49 + # define MCP794XX_BIT_ST 0x80 50 50 #define DS1307_REG_MIN 0x01 /* 00-59 */ 51 51 #define DS1307_REG_HOUR 0x02 /* 00-23, or 1-12{am,pm} */ 52 52 # define DS1307_BIT_12HR 0x40 /* in REG_HOUR */ ··· 54 54 # define DS1340_BIT_CENTURY_EN 0x80 /* in REG_HOUR */ 55 55 # define DS1340_BIT_CENTURY 0x40 /* in REG_HOUR */ 56 56 #define DS1307_REG_WDAY 0x03 /* 01-07 */ 57 - # define MCP7941X_BIT_VBATEN 0x08 57 + # define MCP794XX_BIT_VBATEN 0x08 58 58 #define DS1307_REG_MDAY 0x04 /* 01-31 */ 59 59 #define DS1307_REG_MONTH 0x05 /* 01-12 */ 60 60 # define DS1337_BIT_CENTURY 0x80 /* in REG_MONTH */ ··· 159 159 [ds_3231] = { 160 160 .alarm = 1, 161 161 }, 162 - [mcp7941x] = { 162 + [mcp794xx] = { 163 163 .alarm = 1, 164 164 /* this is battery backed SRAM */ 165 165 .nvram_offset = 0x20, ··· 176 176 { "ds1340", ds_1340 }, 177 177 { "ds3231", ds_3231 }, 178 178 { "m41t00", m41t00 }, 179 - { "mcp7941x", mcp7941x }, 179 + { "mcp7940x", mcp794xx }, 180 + { "mcp7941x", mcp794xx }, 180 181 { "pt7c4338", ds_1307 }, 181 182 { "rx8025", rx_8025 }, 182 183 { } ··· 440 439 buf[DS1307_REG_HOUR] |= DS1340_BIT_CENTURY_EN 441 440 | DS1340_BIT_CENTURY; 442 441 break; 443 - case mcp7941x: 442 + case mcp794xx: 444 443 /* 445 444 * these bits were cleared when preparing the date/time 446 445 * values and need to be set again before writing the 447 446 * buffer out to the device. 448 447 */ 449 - buf[DS1307_REG_SECS] |= MCP7941X_BIT_ST; 450 - buf[DS1307_REG_WDAY] |= MCP7941X_BIT_VBATEN; 448 + buf[DS1307_REG_SECS] |= MCP794XX_BIT_ST; 449 + buf[DS1307_REG_WDAY] |= MCP794XX_BIT_VBATEN; 451 450 break; 452 451 default: 453 452 break; ··· 615 614 /*----------------------------------------------------------------------*/ 616 615 617 616 /* 618 - * Alarm support for mcp7941x devices. 617 + * Alarm support for mcp794xx devices. 619 618 */ 620 619 621 - #define MCP7941X_REG_CONTROL 0x07 622 - # define MCP7941X_BIT_ALM0_EN 0x10 623 - # define MCP7941X_BIT_ALM1_EN 0x20 624 - #define MCP7941X_REG_ALARM0_BASE 0x0a 625 - #define MCP7941X_REG_ALARM0_CTRL 0x0d 626 - #define MCP7941X_REG_ALARM1_BASE 0x11 627 - #define MCP7941X_REG_ALARM1_CTRL 0x14 628 - # define MCP7941X_BIT_ALMX_IF (1 << 3) 629 - # define MCP7941X_BIT_ALMX_C0 (1 << 4) 630 - # define MCP7941X_BIT_ALMX_C1 (1 << 5) 631 - # define MCP7941X_BIT_ALMX_C2 (1 << 6) 632 - # define MCP7941X_BIT_ALMX_POL (1 << 7) 633 - # define MCP7941X_MSK_ALMX_MATCH (MCP7941X_BIT_ALMX_C0 | \ 634 - MCP7941X_BIT_ALMX_C1 | \ 635 - MCP7941X_BIT_ALMX_C2) 620 + #define MCP794XX_REG_CONTROL 0x07 621 + # define MCP794XX_BIT_ALM0_EN 0x10 622 + # define MCP794XX_BIT_ALM1_EN 0x20 623 + #define MCP794XX_REG_ALARM0_BASE 0x0a 624 + #define MCP794XX_REG_ALARM0_CTRL 0x0d 625 + #define MCP794XX_REG_ALARM1_BASE 0x11 626 + #define MCP794XX_REG_ALARM1_CTRL 0x14 627 + # define MCP794XX_BIT_ALMX_IF (1 << 3) 628 + # define MCP794XX_BIT_ALMX_C0 (1 << 4) 629 + # define MCP794XX_BIT_ALMX_C1 (1 << 5) 630 + # define MCP794XX_BIT_ALMX_C2 (1 << 6) 631 + # define MCP794XX_BIT_ALMX_POL (1 << 7) 632 + # define MCP794XX_MSK_ALMX_MATCH (MCP794XX_BIT_ALMX_C0 | \ 633 + MCP794XX_BIT_ALMX_C1 | \ 634 + MCP794XX_BIT_ALMX_C2) 636 635 637 - static void mcp7941x_work(struct work_struct *work) 636 + static void mcp794xx_work(struct work_struct *work) 638 637 { 639 638 struct ds1307 *ds1307 = container_of(work, struct ds1307, work); 640 639 struct i2c_client *client = ds1307->client; ··· 643 642 mutex_lock(&ds1307->rtc->ops_lock); 644 643 645 644 /* Check and clear alarm 0 interrupt flag. */ 646 - reg = i2c_smbus_read_byte_data(client, MCP7941X_REG_ALARM0_CTRL); 645 + reg = i2c_smbus_read_byte_data(client, MCP794XX_REG_ALARM0_CTRL); 647 646 if (reg < 0) 648 647 goto out; 649 - if (!(reg & MCP7941X_BIT_ALMX_IF)) 648 + if (!(reg & MCP794XX_BIT_ALMX_IF)) 650 649 goto out; 651 - reg &= ~MCP7941X_BIT_ALMX_IF; 652 - ret = i2c_smbus_write_byte_data(client, MCP7941X_REG_ALARM0_CTRL, reg); 650 + reg &= ~MCP794XX_BIT_ALMX_IF; 651 + ret = i2c_smbus_write_byte_data(client, MCP794XX_REG_ALARM0_CTRL, reg); 653 652 if (ret < 0) 654 653 goto out; 655 654 656 655 /* Disable alarm 0. */ 657 - reg = i2c_smbus_read_byte_data(client, MCP7941X_REG_CONTROL); 656 + reg = i2c_smbus_read_byte_data(client, MCP794XX_REG_CONTROL); 658 657 if (reg < 0) 659 658 goto out; 660 - reg &= ~MCP7941X_BIT_ALM0_EN; 661 - ret = i2c_smbus_write_byte_data(client, MCP7941X_REG_CONTROL, reg); 659 + reg &= ~MCP794XX_BIT_ALM0_EN; 660 + ret = i2c_smbus_write_byte_data(client, MCP794XX_REG_CONTROL, reg); 662 661 if (ret < 0) 663 662 goto out; 664 663 ··· 670 669 mutex_unlock(&ds1307->rtc->ops_lock); 671 670 } 672 671 673 - static int mcp7941x_read_alarm(struct device *dev, struct rtc_wkalrm *t) 672 + static int mcp794xx_read_alarm(struct device *dev, struct rtc_wkalrm *t) 674 673 { 675 674 struct i2c_client *client = to_i2c_client(dev); 676 675 struct ds1307 *ds1307 = i2c_get_clientdata(client); ··· 681 680 return -EINVAL; 682 681 683 682 /* Read control and alarm 0 registers. */ 684 - ret = ds1307->read_block_data(client, MCP7941X_REG_CONTROL, 10, regs); 683 + ret = ds1307->read_block_data(client, MCP794XX_REG_CONTROL, 10, regs); 685 684 if (ret < 0) 686 685 return ret; 687 686 688 - t->enabled = !!(regs[0] & MCP7941X_BIT_ALM0_EN); 687 + t->enabled = !!(regs[0] & MCP794XX_BIT_ALM0_EN); 689 688 690 689 /* Report alarm 0 time assuming 24-hour and day-of-month modes. */ 691 690 t->time.tm_sec = bcd2bin(ds1307->regs[3] & 0x7f); ··· 702 701 "enabled=%d polarity=%d irq=%d match=%d\n", __func__, 703 702 t->time.tm_sec, t->time.tm_min, t->time.tm_hour, 704 703 t->time.tm_wday, t->time.tm_mday, t->time.tm_mon, t->enabled, 705 - !!(ds1307->regs[6] & MCP7941X_BIT_ALMX_POL), 706 - !!(ds1307->regs[6] & MCP7941X_BIT_ALMX_IF), 707 - (ds1307->regs[6] & MCP7941X_MSK_ALMX_MATCH) >> 4); 704 + !!(ds1307->regs[6] & MCP794XX_BIT_ALMX_POL), 705 + !!(ds1307->regs[6] & MCP794XX_BIT_ALMX_IF), 706 + (ds1307->regs[6] & MCP794XX_MSK_ALMX_MATCH) >> 4); 708 707 709 708 return 0; 710 709 } 711 710 712 - static int mcp7941x_set_alarm(struct device *dev, struct rtc_wkalrm *t) 711 + static int mcp794xx_set_alarm(struct device *dev, struct rtc_wkalrm *t) 713 712 { 714 713 struct i2c_client *client = to_i2c_client(dev); 715 714 struct ds1307 *ds1307 = i2c_get_clientdata(client); ··· 726 725 t->enabled, t->pending); 727 726 728 727 /* Read control and alarm 0 registers. */ 729 - ret = ds1307->read_block_data(client, MCP7941X_REG_CONTROL, 10, regs); 728 + ret = ds1307->read_block_data(client, MCP794XX_REG_CONTROL, 10, regs); 730 729 if (ret < 0) 731 730 return ret; 732 731 ··· 739 738 regs[8] = bin2bcd(t->time.tm_mon) + 1; 740 739 741 740 /* Clear the alarm 0 interrupt flag. */ 742 - regs[6] &= ~MCP7941X_BIT_ALMX_IF; 741 + regs[6] &= ~MCP794XX_BIT_ALMX_IF; 743 742 /* Set alarm match: second, minute, hour, day, date, month. */ 744 - regs[6] |= MCP7941X_MSK_ALMX_MATCH; 743 + regs[6] |= MCP794XX_MSK_ALMX_MATCH; 745 744 746 745 if (t->enabled) 747 - regs[0] |= MCP7941X_BIT_ALM0_EN; 746 + regs[0] |= MCP794XX_BIT_ALM0_EN; 748 747 else 749 - regs[0] &= ~MCP7941X_BIT_ALM0_EN; 748 + regs[0] &= ~MCP794XX_BIT_ALM0_EN; 750 749 751 - ret = ds1307->write_block_data(client, MCP7941X_REG_CONTROL, 10, regs); 750 + ret = ds1307->write_block_data(client, MCP794XX_REG_CONTROL, 10, regs); 752 751 if (ret < 0) 753 752 return ret; 754 753 755 754 return 0; 756 755 } 757 756 758 - static int mcp7941x_alarm_irq_enable(struct device *dev, unsigned int enabled) 757 + static int mcp794xx_alarm_irq_enable(struct device *dev, unsigned int enabled) 759 758 { 760 759 struct i2c_client *client = to_i2c_client(dev); 761 760 struct ds1307 *ds1307 = i2c_get_clientdata(client); ··· 764 763 if (!test_bit(HAS_ALARM, &ds1307->flags)) 765 764 return -EINVAL; 766 765 767 - reg = i2c_smbus_read_byte_data(client, MCP7941X_REG_CONTROL); 766 + reg = i2c_smbus_read_byte_data(client, MCP794XX_REG_CONTROL); 768 767 if (reg < 0) 769 768 return reg; 770 769 771 770 if (enabled) 772 - reg |= MCP7941X_BIT_ALM0_EN; 771 + reg |= MCP794XX_BIT_ALM0_EN; 773 772 else 774 - reg &= ~MCP7941X_BIT_ALM0_EN; 773 + reg &= ~MCP794XX_BIT_ALM0_EN; 775 774 776 - return i2c_smbus_write_byte_data(client, MCP7941X_REG_CONTROL, reg); 775 + return i2c_smbus_write_byte_data(client, MCP794XX_REG_CONTROL, reg); 777 776 } 778 777 779 - static const struct rtc_class_ops mcp7941x_rtc_ops = { 778 + static const struct rtc_class_ops mcp794xx_rtc_ops = { 780 779 .read_time = ds1307_get_time, 781 780 .set_time = ds1307_set_time, 782 - .read_alarm = mcp7941x_read_alarm, 783 - .set_alarm = mcp7941x_set_alarm, 784 - .alarm_irq_enable = mcp7941x_alarm_irq_enable, 781 + .read_alarm = mcp794xx_read_alarm, 782 + .set_alarm = mcp794xx_set_alarm, 783 + .alarm_irq_enable = mcp794xx_alarm_irq_enable, 785 784 }; 786 785 787 786 /*----------------------------------------------------------------------*/ ··· 1050 1049 case ds_1388: 1051 1050 ds1307->offset = 1; /* Seconds starts at 1 */ 1052 1051 break; 1053 - case mcp7941x: 1054 - rtc_ops = &mcp7941x_rtc_ops; 1052 + case mcp794xx: 1053 + rtc_ops = &mcp794xx_rtc_ops; 1055 1054 if (ds1307->client->irq > 0 && chip->alarm) { 1056 - INIT_WORK(&ds1307->work, mcp7941x_work); 1055 + INIT_WORK(&ds1307->work, mcp794xx_work); 1057 1056 want_irq = true; 1058 1057 } 1059 1058 break; ··· 1118 1117 dev_warn(&client->dev, "SET TIME!\n"); 1119 1118 } 1120 1119 break; 1121 - case mcp7941x: 1120 + case mcp794xx: 1122 1121 /* make sure that the backup battery is enabled */ 1123 - if (!(ds1307->regs[DS1307_REG_WDAY] & MCP7941X_BIT_VBATEN)) { 1122 + if (!(ds1307->regs[DS1307_REG_WDAY] & MCP794XX_BIT_VBATEN)) { 1124 1123 i2c_smbus_write_byte_data(client, DS1307_REG_WDAY, 1125 1124 ds1307->regs[DS1307_REG_WDAY] 1126 - | MCP7941X_BIT_VBATEN); 1125 + | MCP794XX_BIT_VBATEN); 1127 1126 } 1128 1127 1129 1128 /* clock halted? turn it on, so clock can tick. */ 1130 - if (!(tmp & MCP7941X_BIT_ST)) { 1129 + if (!(tmp & MCP794XX_BIT_ST)) { 1131 1130 i2c_smbus_write_byte_data(client, DS1307_REG_SECS, 1132 - MCP7941X_BIT_ST); 1131 + MCP794XX_BIT_ST); 1133 1132 dev_warn(&client->dev, "SET TIME!\n"); 1134 1133 goto read_rtc; 1135 1134 }

+285

drivers/rtc/rtc-ds1374.c

··· 4 4 * Based on code by Randy Vinson <rvinson@mvista.com>, 5 5 * which was based on the m41t00.c by Mark Greer <mgreer@mvista.com>. 6 6 * 7 + * Copyright (C) 2014 Rose Technology 7 8 * Copyright (C) 2006-2007 Freescale Semiconductor 8 9 * 9 10 * 2005 (c) MontaVista Software, Inc. This file is licensed under ··· 27 26 #include <linux/workqueue.h> 28 27 #include <linux/slab.h> 29 28 #include <linux/pm.h> 29 + #ifdef CONFIG_RTC_DRV_DS1374_WDT 30 + #include <linux/fs.h> 31 + #include <linux/ioctl.h> 32 + #include <linux/miscdevice.h> 33 + #include <linux/reboot.h> 34 + #include <linux/watchdog.h> 35 + #endif 30 36 31 37 #define DS1374_REG_TOD0 0x00 /* Time of Day */ 32 38 #define DS1374_REG_TOD1 0x01 ··· 56 48 { } 57 49 }; 58 50 MODULE_DEVICE_TABLE(i2c, ds1374_id); 51 + 52 + #ifdef CONFIG_OF 53 + static const struct of_device_id ds1374_of_match[] = { 54 + { .compatible = "dallas,ds1374" }, 55 + { } 56 + }; 57 + MODULE_DEVICE_TABLE(of, ds1374_of_match); 58 + #endif 59 59 60 60 struct ds1374 { 61 61 struct i2c_client *client; ··· 178 162 return ds1374_write_rtc(client, itime, DS1374_REG_TOD0, 4); 179 163 } 180 164 165 + #ifndef CONFIG_RTC_DRV_DS1374_WDT 181 166 /* The ds1374 has a decrementer for an alarm, rather than a comparator. 182 167 * If the time of day is changed, then the alarm will need to be 183 168 * reset. ··· 280 263 mutex_unlock(&ds1374->mutex); 281 264 return ret; 282 265 } 266 + #endif 283 267 284 268 static irqreturn_t ds1374_irq(int irq, void *dev_id) 285 269 { ··· 325 307 mutex_unlock(&ds1374->mutex); 326 308 } 327 309 310 + #ifndef CONFIG_RTC_DRV_DS1374_WDT 328 311 static int ds1374_alarm_irq_enable(struct device *dev, unsigned int enabled) 329 312 { 330 313 struct i2c_client *client = to_i2c_client(dev); ··· 350 331 mutex_unlock(&ds1374->mutex); 351 332 return ret; 352 333 } 334 + #endif 353 335 354 336 static const struct rtc_class_ops ds1374_rtc_ops = { 355 337 .read_time = ds1374_read_time, 356 338 .set_time = ds1374_set_time, 339 + #ifndef CONFIG_RTC_DRV_DS1374_WDT 357 340 .read_alarm = ds1374_read_alarm, 358 341 .set_alarm = ds1374_set_alarm, 359 342 .alarm_irq_enable = ds1374_alarm_irq_enable, 343 + #endif 360 344 }; 361 345 346 + #ifdef CONFIG_RTC_DRV_DS1374_WDT 347 + /* 348 + ***************************************************************************** 349 + * 350 + * Watchdog Driver 351 + * 352 + ***************************************************************************** 353 + */ 354 + static struct i2c_client *save_client; 355 + /* Default margin */ 356 + #define WD_TIMO 131762 357 + 358 + #define DRV_NAME "DS1374 Watchdog" 359 + 360 + static int wdt_margin = WD_TIMO; 361 + static unsigned long wdt_is_open; 362 + module_param(wdt_margin, int, 0); 363 + MODULE_PARM_DESC(wdt_margin, "Watchdog timeout in seconds (default 32s)"); 364 + 365 + static const struct watchdog_info ds1374_wdt_info = { 366 + .identity = "DS1374 WTD", 367 + .options = WDIOF_SETTIMEOUT | WDIOF_KEEPALIVEPING | 368 + WDIOF_MAGICCLOSE, 369 + }; 370 + 371 + static int ds1374_wdt_settimeout(unsigned int timeout) 372 + { 373 + int ret = -ENOIOCTLCMD; 374 + int cr; 375 + 376 + ret = cr = i2c_smbus_read_byte_data(save_client, DS1374_REG_CR); 377 + if (ret < 0) 378 + goto out; 379 + 380 + /* Disable any existing watchdog/alarm before setting the new one */ 381 + cr &= ~DS1374_REG_CR_WACE; 382 + 383 + ret = i2c_smbus_write_byte_data(save_client, DS1374_REG_CR, cr); 384 + if (ret < 0) 385 + goto out; 386 + 387 + /* Set new watchdog time */ 388 + ret = ds1374_write_rtc(save_client, timeout, DS1374_REG_WDALM0, 3); 389 + if (ret) { 390 + pr_info("rtc-ds1374 - couldn't set new watchdog time\n"); 391 + goto out; 392 + } 393 + 394 + /* Enable watchdog timer */ 395 + cr |= DS1374_REG_CR_WACE | DS1374_REG_CR_WDALM; 396 + cr &= ~DS1374_REG_CR_AIE; 397 + 398 + ret = i2c_smbus_write_byte_data(save_client, DS1374_REG_CR, cr); 399 + if (ret < 0) 400 + goto out; 401 + 402 + return 0; 403 + out: 404 + return ret; 405 + } 406 + 407 + 408 + /* 409 + * Reload the watchdog timer. (ie, pat the watchdog) 410 + */ 411 + static void ds1374_wdt_ping(void) 412 + { 413 + u32 val; 414 + int ret = 0; 415 + 416 + ret = ds1374_read_rtc(save_client, &val, DS1374_REG_WDALM0, 3); 417 + if (ret) 418 + pr_info("WD TICK FAIL!!!!!!!!!! %i\n", ret); 419 + } 420 + 421 + static void ds1374_wdt_disable(void) 422 + { 423 + int ret = -ENOIOCTLCMD; 424 + int cr; 425 + 426 + cr = i2c_smbus_read_byte_data(save_client, DS1374_REG_CR); 427 + /* Disable watchdog timer */ 428 + cr &= ~DS1374_REG_CR_WACE; 429 + 430 + ret = i2c_smbus_write_byte_data(save_client, DS1374_REG_CR, cr); 431 + } 432 + 433 + /* 434 + * Watchdog device is opened, and watchdog starts running. 435 + */ 436 + static int ds1374_wdt_open(struct inode *inode, struct file *file) 437 + { 438 + struct ds1374 *ds1374 = i2c_get_clientdata(save_client); 439 + 440 + if (MINOR(inode->i_rdev) == WATCHDOG_MINOR) { 441 + mutex_lock(&ds1374->mutex); 442 + if (test_and_set_bit(0, &wdt_is_open)) { 443 + mutex_unlock(&ds1374->mutex); 444 + return -EBUSY; 445 + } 446 + /* 447 + * Activate 448 + */ 449 + wdt_is_open = 1; 450 + mutex_unlock(&ds1374->mutex); 451 + return nonseekable_open(inode, file); 452 + } 453 + return -ENODEV; 454 + } 455 + 456 + /* 457 + * Close the watchdog device. 458 + */ 459 + static int ds1374_wdt_release(struct inode *inode, struct file *file) 460 + { 461 + if (MINOR(inode->i_rdev) == WATCHDOG_MINOR) 462 + clear_bit(0, &wdt_is_open); 463 + 464 + return 0; 465 + } 466 + 467 + /* 468 + * Pat the watchdog whenever device is written to. 469 + */ 470 + static ssize_t ds1374_wdt_write(struct file *file, const char __user *data, 471 + size_t len, loff_t *ppos) 472 + { 473 + if (len) { 474 + ds1374_wdt_ping(); 475 + return 1; 476 + } 477 + return 0; 478 + } 479 + 480 + static ssize_t ds1374_wdt_read(struct file *file, char __user *data, 481 + size_t len, loff_t *ppos) 482 + { 483 + return 0; 484 + } 485 + 486 + /* 487 + * Handle commands from user-space. 488 + */ 489 + static long ds1374_wdt_ioctl(struct file *file, unsigned int cmd, 490 + unsigned long arg) 491 + { 492 + int new_margin, options; 493 + 494 + switch (cmd) { 495 + case WDIOC_GETSUPPORT: 496 + return copy_to_user((struct watchdog_info __user *)arg, 497 + &ds1374_wdt_info, sizeof(ds1374_wdt_info)) ? -EFAULT : 0; 498 + 499 + case WDIOC_GETSTATUS: 500 + case WDIOC_GETBOOTSTATUS: 501 + return put_user(0, (int __user *)arg); 502 + case WDIOC_KEEPALIVE: 503 + ds1374_wdt_ping(); 504 + return 0; 505 + case WDIOC_SETTIMEOUT: 506 + if (get_user(new_margin, (int __user *)arg)) 507 + return -EFAULT; 508 + 509 + if (new_margin < 1 || new_margin > 16777216) 510 + return -EINVAL; 511 + 512 + wdt_margin = new_margin; 513 + ds1374_wdt_settimeout(new_margin); 514 + ds1374_wdt_ping(); 515 + /* fallthrough */ 516 + case WDIOC_GETTIMEOUT: 517 + return put_user(wdt_margin, (int __user *)arg); 518 + case WDIOC_SETOPTIONS: 519 + if (copy_from_user(&options, (int __user *)arg, sizeof(int))) 520 + return -EFAULT; 521 + 522 + if (options & WDIOS_DISABLECARD) { 523 + pr_info("rtc-ds1374: disable watchdog\n"); 524 + ds1374_wdt_disable(); 525 + } 526 + 527 + if (options & WDIOS_ENABLECARD) { 528 + pr_info("rtc-ds1374: enable watchdog\n"); 529 + ds1374_wdt_settimeout(wdt_margin); 530 + ds1374_wdt_ping(); 531 + } 532 + 533 + return -EINVAL; 534 + } 535 + return -ENOTTY; 536 + } 537 + 538 + static long ds1374_wdt_unlocked_ioctl(struct file *file, unsigned int cmd, 539 + unsigned long arg) 540 + { 541 + int ret; 542 + struct ds1374 *ds1374 = i2c_get_clientdata(save_client); 543 + 544 + mutex_lock(&ds1374->mutex); 545 + ret = ds1374_wdt_ioctl(file, cmd, arg); 546 + mutex_unlock(&ds1374->mutex); 547 + 548 + return ret; 549 + } 550 + 551 + static int ds1374_wdt_notify_sys(struct notifier_block *this, 552 + unsigned long code, void *unused) 553 + { 554 + if (code == SYS_DOWN || code == SYS_HALT) 555 + /* Disable Watchdog */ 556 + ds1374_wdt_disable(); 557 + return NOTIFY_DONE; 558 + } 559 + 560 + static const struct file_operations ds1374_wdt_fops = { 561 + .owner = THIS_MODULE, 562 + .read = ds1374_wdt_read, 563 + .unlocked_ioctl = ds1374_wdt_unlocked_ioctl, 564 + .write = ds1374_wdt_write, 565 + .open = ds1374_wdt_open, 566 + .release = ds1374_wdt_release, 567 + .llseek = no_llseek, 568 + }; 569 + 570 + static struct miscdevice ds1374_miscdev = { 571 + .minor = WATCHDOG_MINOR, 572 + .name = "watchdog", 573 + .fops = &ds1374_wdt_fops, 574 + }; 575 + 576 + static struct notifier_block ds1374_wdt_notifier = { 577 + .notifier_call = ds1374_wdt_notify_sys, 578 + }; 579 + 580 + #endif /*CONFIG_RTC_DRV_DS1374_WDT*/ 581 + /* 582 + ***************************************************************************** 583 + * 584 + * Driver Interface 585 + * 586 + ***************************************************************************** 587 + */ 362 588 static int ds1374_probe(struct i2c_client *client, 363 589 const struct i2c_device_id *id) 364 590 { ··· 642 378 return PTR_ERR(ds1374->rtc); 643 379 } 644 380 381 + #ifdef CONFIG_RTC_DRV_DS1374_WDT 382 + save_client = client; 383 + ret = misc_register(&ds1374_miscdev); 384 + if (ret) 385 + return ret; 386 + ret = register_reboot_notifier(&ds1374_wdt_notifier); 387 + if (ret) { 388 + misc_deregister(&ds1374_miscdev); 389 + return ret; 390 + } 391 + ds1374_wdt_settimeout(131072); 392 + #endif 393 + 645 394 return 0; 646 395 } 647 396 648 397 static int ds1374_remove(struct i2c_client *client) 649 398 { 650 399 struct ds1374 *ds1374 = i2c_get_clientdata(client); 400 + #ifdef CONFIG_RTC_DRV_DS1374_WDT 401 + int res; 402 + 403 + res = misc_deregister(&ds1374_miscdev); 404 + if (!res) 405 + ds1374_miscdev.parent = NULL; 406 + unregister_reboot_notifier(&ds1374_wdt_notifier); 407 + #endif 651 408 652 409 if (client->irq > 0) { 653 410 mutex_lock(&ds1374->mutex);

+60 -23

drivers/rtc/rtc-isl12057.c

··· 41 41 #define ISL12057_REG_RTC_DW 0x03 /* Day of the Week */ 42 42 #define ISL12057_REG_RTC_DT 0x04 /* Date */ 43 43 #define ISL12057_REG_RTC_MO 0x05 /* Month */ 44 + #define ISL12057_REG_RTC_MO_CEN BIT(7) /* Century bit */ 44 45 #define ISL12057_REG_RTC_YR 0x06 /* Year */ 45 46 #define ISL12057_RTC_SEC_LEN 7 46 47 ··· 89 88 tm->tm_min = bcd2bin(regs[ISL12057_REG_RTC_MN]); 90 89 91 90 if (regs[ISL12057_REG_RTC_HR] & ISL12057_REG_RTC_HR_MIL) { /* AM/PM */ 92 - tm->tm_hour = bcd2bin(regs[ISL12057_REG_RTC_HR] & 0x0f); 91 + tm->tm_hour = bcd2bin(regs[ISL12057_REG_RTC_HR] & 0x1f); 93 92 if (regs[ISL12057_REG_RTC_HR] & ISL12057_REG_RTC_HR_PM) 94 93 tm->tm_hour += 12; 95 94 } else { /* 24 hour mode */ ··· 98 97 99 98 tm->tm_mday = bcd2bin(regs[ISL12057_REG_RTC_DT]); 100 99 tm->tm_wday = bcd2bin(regs[ISL12057_REG_RTC_DW]) - 1; /* starts at 1 */ 101 - tm->tm_mon = bcd2bin(regs[ISL12057_REG_RTC_MO]) - 1; /* starts at 1 */ 100 + tm->tm_mon = bcd2bin(regs[ISL12057_REG_RTC_MO] & 0x1f) - 1; /* ditto */ 102 101 tm->tm_year = bcd2bin(regs[ISL12057_REG_RTC_YR]) + 100; 102 + 103 + /* Check if years register has overflown from 99 to 00 */ 104 + if (regs[ISL12057_REG_RTC_MO] & ISL12057_REG_RTC_MO_CEN) 105 + tm->tm_year += 100; 103 106 } 104 107 105 108 static int isl12057_rtc_tm_to_regs(u8 *regs, struct rtc_time *tm) 106 109 { 110 + u8 century_bit; 111 + 107 112 /* 108 113 * The clock has an 8 bit wide bcd-coded register for the year. 114 + * It also has a century bit encoded in MO flag which provides 115 + * information about overflow of year register from 99 to 00. 109 116 * tm_year is an offset from 1900 and we are interested in the 110 - * 2000-2099 range, so any value less than 100 is invalid. 117 + * 2000-2199 range, so any value less than 100 or larger than 118 + * 299 is invalid. 111 119 */ 112 - if (tm->tm_year < 100) 120 + if (tm->tm_year < 100 || tm->tm_year > 299) 113 121 return -EINVAL; 122 + 123 + century_bit = (tm->tm_year > 199) ? ISL12057_REG_RTC_MO_CEN : 0; 114 124 115 125 regs[ISL12057_REG_RTC_SC] = bin2bcd(tm->tm_sec); 116 126 regs[ISL12057_REG_RTC_MN] = bin2bcd(tm->tm_min); 117 127 regs[ISL12057_REG_RTC_HR] = bin2bcd(tm->tm_hour); /* 24-hour format */ 118 128 regs[ISL12057_REG_RTC_DT] = bin2bcd(tm->tm_mday); 119 - regs[ISL12057_REG_RTC_MO] = bin2bcd(tm->tm_mon + 1); 120 - regs[ISL12057_REG_RTC_YR] = bin2bcd(tm->tm_year - 100); 129 + regs[ISL12057_REG_RTC_MO] = bin2bcd(tm->tm_mon + 1) | century_bit; 130 + regs[ISL12057_REG_RTC_YR] = bin2bcd(tm->tm_year % 100); 121 131 regs[ISL12057_REG_RTC_DW] = bin2bcd(tm->tm_wday + 1); 122 132 123 133 return 0; ··· 164 152 { 165 153 struct isl12057_rtc_data *data = dev_get_drvdata(dev); 166 154 u8 regs[ISL12057_RTC_SEC_LEN]; 155 + unsigned int sr; 167 156 int ret; 168 157 169 158 mutex_lock(&data->lock); 159 + ret = regmap_read(data->regmap, ISL12057_REG_SR, &sr); 160 + if (ret) { 161 + dev_err(dev, "%s: unable to read oscillator status flag (%d)\n", 162 + __func__, ret); 163 + goto out; 164 + } else { 165 + if (sr & ISL12057_REG_SR_OSF) { 166 + ret = -ENODATA; 167 + goto out; 168 + } 169 + } 170 + 170 171 ret = regmap_bulk_read(data->regmap, ISL12057_REG_RTC_SC, regs, 171 172 ISL12057_RTC_SEC_LEN); 173 + if (ret) 174 + dev_err(dev, "%s: unable to read RTC time section (%d)\n", 175 + __func__, ret); 176 + 177 + out: 172 178 mutex_unlock(&data->lock); 173 179 174 - if (ret) { 175 - dev_err(dev, "%s: RTC read failed\n", __func__); 180 + if (ret) 176 181 return ret; 177 - } 178 182 179 183 isl12057_rtc_regs_to_tm(tm, regs); 180 184 ··· 210 182 mutex_lock(&data->lock); 211 183 ret = regmap_bulk_write(data->regmap, ISL12057_REG_RTC_SC, regs, 212 184 ISL12057_RTC_SEC_LEN); 213 - mutex_unlock(&data->lock); 185 + if (ret) { 186 + dev_err(dev, "%s: unable to write RTC time section (%d)\n", 187 + __func__, ret); 188 + goto out; 189 + } 214 190 215 - if (ret) 216 - dev_err(dev, "%s: RTC write failed\n", __func__); 191 + /* 192 + * Now that RTC time has been updated, let's clear oscillator 193 + * failure flag, if needed. 194 + */ 195 + ret = regmap_update_bits(data->regmap, ISL12057_REG_SR, 196 + ISL12057_REG_SR_OSF, 0); 197 + if (ret < 0) 198 + dev_err(dev, "%s: unable to clear osc. failure bit (%d)\n", 199 + __func__, ret); 200 + 201 + out: 202 + mutex_unlock(&data->lock); 217 203 218 204 return ret; 219 205 } ··· 245 203 ret = regmap_update_bits(regmap, ISL12057_REG_INT, 246 204 ISL12057_REG_INT_EOSC, 0); 247 205 if (ret < 0) { 248 - dev_err(dev, "Unable to enable oscillator\n"); 249 - return ret; 250 - } 251 - 252 - /* Clear oscillator failure bit if needed */ 253 - ret = regmap_update_bits(regmap, ISL12057_REG_SR, 254 - ISL12057_REG_SR_OSF, 0); 255 - if (ret < 0) { 256 - dev_err(dev, "Unable to clear oscillator failure bit\n"); 206 + dev_err(dev, "%s: unable to enable oscillator (%d)\n", 207 + __func__, ret); 257 208 return ret; 258 209 } 259 210 ··· 254 219 ret = regmap_update_bits(regmap, ISL12057_REG_SR, 255 220 ISL12057_REG_SR_A1F, 0); 256 221 if (ret < 0) { 257 - dev_err(dev, "Unable to clear alarm bit\n"); 222 + dev_err(dev, "%s: unable to clear alarm bit (%d)\n", 223 + __func__, ret); 258 224 return ret; 259 225 } 260 226 ··· 289 253 regmap = devm_regmap_init_i2c(client, &isl12057_rtc_regmap_config); 290 254 if (IS_ERR(regmap)) { 291 255 ret = PTR_ERR(regmap); 292 - dev_err(dev, "regmap allocation failed: %d\n", ret); 256 + dev_err(dev, "%s: regmap allocation failed (%d)\n", 257 + __func__, ret); 293 258 return ret; 294 259 } 295 260

+361 -210

drivers/rtc/rtc-omap.c

··· 1 1 /* 2 - * TI OMAP1 Real Time Clock interface for Linux 2 + * TI OMAP Real Time Clock interface for Linux 3 3 * 4 4 * Copyright (C) 2003 MontaVista Software, Inc. 5 5 * Author: George G. Davis <gdavis@mvista.com> or <source@mvista.com> 6 6 * 7 7 * Copyright (C) 2006 David Brownell (new RTC framework) 8 + * Copyright (C) 2014 Johan Hovold <johan@kernel.org> 8 9 * 9 10 * This program is free software; you can redistribute it and/or 10 11 * modify it under the terms of the GNU General Public License ··· 26 25 #include <linux/pm_runtime.h> 27 26 #include <linux/io.h> 28 27 29 - /* The OMAP1 RTC is a year/month/day/hours/minutes/seconds BCD clock 28 + /* 29 + * The OMAP RTC is a year/month/day/hours/minutes/seconds BCD clock 30 30 * with century-range alarm matching, driven by the 32kHz clock. 31 31 * 32 32 * The main user-visible ways it differs from PC RTCs are by omitting ··· 40 38 * low power modes) for OMAP1 boards (OMAP-L138 has this built into 41 39 * the SoC). See the BOARD-SPECIFIC CUSTOMIZATION comment. 42 40 */ 43 - 44 - #define DRIVER_NAME "omap_rtc" 45 - 46 - #define OMAP_RTC_BASE 0xfffb4800 47 41 48 42 /* RTC registers */ 49 43 #define OMAP_RTC_SECONDS_REG 0x00 ··· 70 72 71 73 #define OMAP_RTC_IRQWAKEEN 0x7c 72 74 75 + #define OMAP_RTC_ALARM2_SECONDS_REG 0x80 76 + #define OMAP_RTC_ALARM2_MINUTES_REG 0x84 77 + #define OMAP_RTC_ALARM2_HOURS_REG 0x88 78 + #define OMAP_RTC_ALARM2_DAYS_REG 0x8c 79 + #define OMAP_RTC_ALARM2_MONTHS_REG 0x90 80 + #define OMAP_RTC_ALARM2_YEARS_REG 0x94 81 + 82 + #define OMAP_RTC_PMIC_REG 0x98 83 + 73 84 /* OMAP_RTC_CTRL_REG bit fields: */ 74 85 #define OMAP_RTC_CTRL_SPLIT BIT(7) 75 86 #define OMAP_RTC_CTRL_DISABLE BIT(6) ··· 91 84 92 85 /* OMAP_RTC_STATUS_REG bit fields: */ 93 86 #define OMAP_RTC_STATUS_POWER_UP BIT(7) 87 + #define OMAP_RTC_STATUS_ALARM2 BIT(7) 94 88 #define OMAP_RTC_STATUS_ALARM BIT(6) 95 89 #define OMAP_RTC_STATUS_1D_EVENT BIT(5) 96 90 #define OMAP_RTC_STATUS_1H_EVENT BIT(4) ··· 101 93 #define OMAP_RTC_STATUS_BUSY BIT(0) 102 94 103 95 /* OMAP_RTC_INTERRUPTS_REG bit fields: */ 96 + #define OMAP_RTC_INTERRUPTS_IT_ALARM2 BIT(4) 104 97 #define OMAP_RTC_INTERRUPTS_IT_ALARM BIT(3) 105 98 #define OMAP_RTC_INTERRUPTS_IT_TIMER BIT(2) 106 99 ··· 111 102 /* OMAP_RTC_IRQWAKEEN bit fields: */ 112 103 #define OMAP_RTC_IRQWAKEEN_ALARM_WAKEEN BIT(1) 113 104 105 + /* OMAP_RTC_PMIC bit fields: */ 106 + #define OMAP_RTC_PMIC_POWER_EN_EN BIT(16) 107 + 114 108 /* OMAP_RTC_KICKER values */ 115 109 #define KICK0_VALUE 0x83e70b13 116 110 #define KICK1_VALUE 0x95a4f1e0 117 111 118 - #define OMAP_RTC_HAS_KICKER BIT(0) 112 + struct omap_rtc_device_type { 113 + bool has_32kclk_en; 114 + bool has_kicker; 115 + bool has_irqwakeen; 116 + bool has_pmic_mode; 117 + bool has_power_up_reset; 118 + }; 119 + 120 + struct omap_rtc { 121 + struct rtc_device *rtc; 122 + void __iomem *base; 123 + int irq_alarm; 124 + int irq_timer; 125 + u8 interrupts_reg; 126 + bool is_pmic_controller; 127 + const struct omap_rtc_device_type *type; 128 + }; 129 + 130 + static inline u8 rtc_read(struct omap_rtc *rtc, unsigned int reg) 131 + { 132 + return readb(rtc->base + reg); 133 + } 134 + 135 + static inline u32 rtc_readl(struct omap_rtc *rtc, unsigned int reg) 136 + { 137 + return readl(rtc->base + reg); 138 + } 139 + 140 + static inline void rtc_write(struct omap_rtc *rtc, unsigned int reg, u8 val) 141 + { 142 + writeb(val, rtc->base + reg); 143 + } 144 + 145 + static inline void rtc_writel(struct omap_rtc *rtc, unsigned int reg, u32 val) 146 + { 147 + writel(val, rtc->base + reg); 148 + } 119 149 120 150 /* 121 - * Few RTC IP revisions has special WAKE-EN Register to enable Wakeup 122 - * generation for event Alarm. 123 - */ 124 - #define OMAP_RTC_HAS_IRQWAKEEN BIT(1) 125 - 126 - /* 127 - * Some RTC IP revisions (like those in AM335x and DRA7x) need 128 - * the 32KHz clock to be explicitly enabled. 129 - */ 130 - #define OMAP_RTC_HAS_32KCLK_EN BIT(2) 131 - 132 - static void __iomem *rtc_base; 133 - 134 - #define rtc_read(addr) readb(rtc_base + (addr)) 135 - #define rtc_write(val, addr) writeb(val, rtc_base + (addr)) 136 - 137 - #define rtc_writel(val, addr) writel(val, rtc_base + (addr)) 138 - 139 - 140 - /* we rely on the rtc framework to handle locking (rtc->ops_lock), 151 + * We rely on the rtc framework to handle locking (rtc->ops_lock), 141 152 * so the only other requirement is that register accesses which 142 153 * require BUSY to be clear are made with IRQs locally disabled 143 154 */ 144 - static void rtc_wait_not_busy(void) 155 + static void rtc_wait_not_busy(struct omap_rtc *rtc) 145 156 { 146 - int count = 0; 147 - u8 status; 157 + int count; 158 + u8 status; 148 159 149 160 /* BUSY may stay active for 1/32768 second (~30 usec) */ 150 161 for (count = 0; count < 50; count++) { 151 - status = rtc_read(OMAP_RTC_STATUS_REG); 152 - if ((status & (u8)OMAP_RTC_STATUS_BUSY) == 0) 162 + status = rtc_read(rtc, OMAP_RTC_STATUS_REG); 163 + if (!(status & OMAP_RTC_STATUS_BUSY)) 153 164 break; 154 165 udelay(1); 155 166 } 156 167 /* now we have ~15 usec to read/write various registers */ 157 168 } 158 169 159 - static irqreturn_t rtc_irq(int irq, void *rtc) 170 + static irqreturn_t rtc_irq(int irq, void *dev_id) 160 171 { 161 - unsigned long events = 0; 162 - u8 irq_data; 172 + struct omap_rtc *rtc = dev_id; 173 + unsigned long events = 0; 174 + u8 irq_data; 163 175 164 - irq_data = rtc_read(OMAP_RTC_STATUS_REG); 176 + irq_data = rtc_read(rtc, OMAP_RTC_STATUS_REG); 165 177 166 178 /* alarm irq? */ 167 179 if (irq_data & OMAP_RTC_STATUS_ALARM) { 168 - rtc_write(OMAP_RTC_STATUS_ALARM, OMAP_RTC_STATUS_REG); 180 + rtc_write(rtc, OMAP_RTC_STATUS_REG, OMAP_RTC_STATUS_ALARM); 169 181 events |= RTC_IRQF | RTC_AF; 170 182 } 171 183 ··· 194 164 if (irq_data & OMAP_RTC_STATUS_1S_EVENT) 195 165 events |= RTC_IRQF | RTC_UF; 196 166 197 - rtc_update_irq(rtc, 1, events); 167 + rtc_update_irq(rtc->rtc, 1, events); 198 168 199 169 return IRQ_HANDLED; 200 170 } 201 171 202 172 static int omap_rtc_alarm_irq_enable(struct device *dev, unsigned int enabled) 203 173 { 174 + struct omap_rtc *rtc = dev_get_drvdata(dev); 204 175 u8 reg, irqwake_reg = 0; 205 - struct platform_device *pdev = to_platform_device(dev); 206 - const struct platform_device_id *id_entry = 207 - platform_get_device_id(pdev); 208 176 209 177 local_irq_disable(); 210 - rtc_wait_not_busy(); 211 - reg = rtc_read(OMAP_RTC_INTERRUPTS_REG); 212 - if (id_entry->driver_data & OMAP_RTC_HAS_IRQWAKEEN) 213 - irqwake_reg = rtc_read(OMAP_RTC_IRQWAKEEN); 178 + rtc_wait_not_busy(rtc); 179 + reg = rtc_read(rtc, OMAP_RTC_INTERRUPTS_REG); 180 + if (rtc->type->has_irqwakeen) 181 + irqwake_reg = rtc_read(rtc, OMAP_RTC_IRQWAKEEN); 214 182 215 183 if (enabled) { 216 184 reg |= OMAP_RTC_INTERRUPTS_IT_ALARM; ··· 217 189 reg &= ~OMAP_RTC_INTERRUPTS_IT_ALARM; 218 190 irqwake_reg &= ~OMAP_RTC_IRQWAKEEN_ALARM_WAKEEN; 219 191 } 220 - rtc_wait_not_busy(); 221 - rtc_write(reg, OMAP_RTC_INTERRUPTS_REG); 222 - if (id_entry->driver_data & OMAP_RTC_HAS_IRQWAKEEN) 223 - rtc_write(irqwake_reg, OMAP_RTC_IRQWAKEEN); 192 + rtc_wait_not_busy(rtc); 193 + rtc_write(rtc, OMAP_RTC_INTERRUPTS_REG, reg); 194 + if (rtc->type->has_irqwakeen) 195 + rtc_write(rtc, OMAP_RTC_IRQWAKEEN, irqwake_reg); 224 196 local_irq_enable(); 225 197 226 198 return 0; ··· 258 230 tm->tm_year = bcd2bin(tm->tm_year) + 100; 259 231 } 260 232 233 + static void omap_rtc_read_time_raw(struct omap_rtc *rtc, struct rtc_time *tm) 234 + { 235 + tm->tm_sec = rtc_read(rtc, OMAP_RTC_SECONDS_REG); 236 + tm->tm_min = rtc_read(rtc, OMAP_RTC_MINUTES_REG); 237 + tm->tm_hour = rtc_read(rtc, OMAP_RTC_HOURS_REG); 238 + tm->tm_mday = rtc_read(rtc, OMAP_RTC_DAYS_REG); 239 + tm->tm_mon = rtc_read(rtc, OMAP_RTC_MONTHS_REG); 240 + tm->tm_year = rtc_read(rtc, OMAP_RTC_YEARS_REG); 241 + } 261 242 262 243 static int omap_rtc_read_time(struct device *dev, struct rtc_time *tm) 263 244 { 245 + struct omap_rtc *rtc = dev_get_drvdata(dev); 246 + 264 247 /* we don't report wday/yday/isdst ... */ 265 248 local_irq_disable(); 266 - rtc_wait_not_busy(); 267 - 268 - tm->tm_sec = rtc_read(OMAP_RTC_SECONDS_REG); 269 - tm->tm_min = rtc_read(OMAP_RTC_MINUTES_REG); 270 - tm->tm_hour = rtc_read(OMAP_RTC_HOURS_REG); 271 - tm->tm_mday = rtc_read(OMAP_RTC_DAYS_REG); 272 - tm->tm_mon = rtc_read(OMAP_RTC_MONTHS_REG); 273 - tm->tm_year = rtc_read(OMAP_RTC_YEARS_REG); 274 - 249 + rtc_wait_not_busy(rtc); 250 + omap_rtc_read_time_raw(rtc, tm); 275 251 local_irq_enable(); 276 252 277 253 bcd2tm(tm); 254 + 278 255 return 0; 279 256 } 280 257 281 258 static int omap_rtc_set_time(struct device *dev, struct rtc_time *tm) 282 259 { 260 + struct omap_rtc *rtc = dev_get_drvdata(dev); 261 + 283 262 if (tm2bcd(tm) < 0) 284 263 return -EINVAL; 285 - local_irq_disable(); 286 - rtc_wait_not_busy(); 287 264 288 - rtc_write(tm->tm_year, OMAP_RTC_YEARS_REG); 289 - rtc_write(tm->tm_mon, OMAP_RTC_MONTHS_REG); 290 - rtc_write(tm->tm_mday, OMAP_RTC_DAYS_REG); 291 - rtc_write(tm->tm_hour, OMAP_RTC_HOURS_REG); 292 - rtc_write(tm->tm_min, OMAP_RTC_MINUTES_REG); 293 - rtc_write(tm->tm_sec, OMAP_RTC_SECONDS_REG); 265 + local_irq_disable(); 266 + rtc_wait_not_busy(rtc); 267 + 268 + rtc_write(rtc, OMAP_RTC_YEARS_REG, tm->tm_year); 269 + rtc_write(rtc, OMAP_RTC_MONTHS_REG, tm->tm_mon); 270 + rtc_write(rtc, OMAP_RTC_DAYS_REG, tm->tm_mday); 271 + rtc_write(rtc, OMAP_RTC_HOURS_REG, tm->tm_hour); 272 + rtc_write(rtc, OMAP_RTC_MINUTES_REG, tm->tm_min); 273 + rtc_write(rtc, OMAP_RTC_SECONDS_REG, tm->tm_sec); 294 274 295 275 local_irq_enable(); 296 276 ··· 307 271 308 272 static int omap_rtc_read_alarm(struct device *dev, struct rtc_wkalrm *alm) 309 273 { 310 - local_irq_disable(); 311 - rtc_wait_not_busy(); 274 + struct omap_rtc *rtc = dev_get_drvdata(dev); 275 + u8 interrupts; 312 276 313 - alm->time.tm_sec = rtc_read(OMAP_RTC_ALARM_SECONDS_REG); 314 - alm->time.tm_min = rtc_read(OMAP_RTC_ALARM_MINUTES_REG); 315 - alm->time.tm_hour = rtc_read(OMAP_RTC_ALARM_HOURS_REG); 316 - alm->time.tm_mday = rtc_read(OMAP_RTC_ALARM_DAYS_REG); 317 - alm->time.tm_mon = rtc_read(OMAP_RTC_ALARM_MONTHS_REG); 318 - alm->time.tm_year = rtc_read(OMAP_RTC_ALARM_YEARS_REG); 277 + local_irq_disable(); 278 + rtc_wait_not_busy(rtc); 279 + 280 + alm->time.tm_sec = rtc_read(rtc, OMAP_RTC_ALARM_SECONDS_REG); 281 + alm->time.tm_min = rtc_read(rtc, OMAP_RTC_ALARM_MINUTES_REG); 282 + alm->time.tm_hour = rtc_read(rtc, OMAP_RTC_ALARM_HOURS_REG); 283 + alm->time.tm_mday = rtc_read(rtc, OMAP_RTC_ALARM_DAYS_REG); 284 + alm->time.tm_mon = rtc_read(rtc, OMAP_RTC_ALARM_MONTHS_REG); 285 + alm->time.tm_year = rtc_read(rtc, OMAP_RTC_ALARM_YEARS_REG); 319 286 320 287 local_irq_enable(); 321 288 322 289 bcd2tm(&alm->time); 323 - alm->enabled = !!(rtc_read(OMAP_RTC_INTERRUPTS_REG) 324 - & OMAP_RTC_INTERRUPTS_IT_ALARM); 290 + 291 + interrupts = rtc_read(rtc, OMAP_RTC_INTERRUPTS_REG); 292 + alm->enabled = !!(interrupts & OMAP_RTC_INTERRUPTS_IT_ALARM); 325 293 326 294 return 0; 327 295 } 328 296 329 297 static int omap_rtc_set_alarm(struct device *dev, struct rtc_wkalrm *alm) 330 298 { 299 + struct omap_rtc *rtc = dev_get_drvdata(dev); 331 300 u8 reg, irqwake_reg = 0; 332 - struct platform_device *pdev = to_platform_device(dev); 333 - const struct platform_device_id *id_entry = 334 - platform_get_device_id(pdev); 335 301 336 302 if (tm2bcd(&alm->time) < 0) 337 303 return -EINVAL; 338 304 339 305 local_irq_disable(); 340 - rtc_wait_not_busy(); 306 + rtc_wait_not_busy(rtc); 341 307 342 - rtc_write(alm->time.tm_year, OMAP_RTC_ALARM_YEARS_REG); 343 - rtc_write(alm->time.tm_mon, OMAP_RTC_ALARM_MONTHS_REG); 344 - rtc_write(alm->time.tm_mday, OMAP_RTC_ALARM_DAYS_REG); 345 - rtc_write(alm->time.tm_hour, OMAP_RTC_ALARM_HOURS_REG); 346 - rtc_write(alm->time.tm_min, OMAP_RTC_ALARM_MINUTES_REG); 347 - rtc_write(alm->time.tm_sec, OMAP_RTC_ALARM_SECONDS_REG); 308 + rtc_write(rtc, OMAP_RTC_ALARM_YEARS_REG, alm->time.tm_year); 309 + rtc_write(rtc, OMAP_RTC_ALARM_MONTHS_REG, alm->time.tm_mon); 310 + rtc_write(rtc, OMAP_RTC_ALARM_DAYS_REG, alm->time.tm_mday); 311 + rtc_write(rtc, OMAP_RTC_ALARM_HOURS_REG, alm->time.tm_hour); 312 + rtc_write(rtc, OMAP_RTC_ALARM_MINUTES_REG, alm->time.tm_min); 313 + rtc_write(rtc, OMAP_RTC_ALARM_SECONDS_REG, alm->time.tm_sec); 348 314 349 - reg = rtc_read(OMAP_RTC_INTERRUPTS_REG); 350 - if (id_entry->driver_data & OMAP_RTC_HAS_IRQWAKEEN) 351 - irqwake_reg = rtc_read(OMAP_RTC_IRQWAKEEN); 315 + reg = rtc_read(rtc, OMAP_RTC_INTERRUPTS_REG); 316 + if (rtc->type->has_irqwakeen) 317 + irqwake_reg = rtc_read(rtc, OMAP_RTC_IRQWAKEEN); 352 318 353 319 if (alm->enabled) { 354 320 reg |= OMAP_RTC_INTERRUPTS_IT_ALARM; ··· 359 321 reg &= ~OMAP_RTC_INTERRUPTS_IT_ALARM; 360 322 irqwake_reg &= ~OMAP_RTC_IRQWAKEEN_ALARM_WAKEEN; 361 323 } 362 - rtc_write(reg, OMAP_RTC_INTERRUPTS_REG); 363 - if (id_entry->driver_data & OMAP_RTC_HAS_IRQWAKEEN) 364 - rtc_write(irqwake_reg, OMAP_RTC_IRQWAKEEN); 324 + rtc_write(rtc, OMAP_RTC_INTERRUPTS_REG, reg); 325 + if (rtc->type->has_irqwakeen) 326 + rtc_write(rtc, OMAP_RTC_IRQWAKEEN, irqwake_reg); 365 327 366 328 local_irq_enable(); 367 329 368 330 return 0; 331 + } 332 + 333 + static struct omap_rtc *omap_rtc_power_off_rtc; 334 + 335 + /* 336 + * omap_rtc_poweroff: RTC-controlled power off 337 + * 338 + * The RTC can be used to control an external PMIC via the pmic_power_en pin, 339 + * which can be configured to transition to OFF on ALARM2 events. 340 + * 341 + * Notes: 342 + * The two-second alarm offset is the shortest offset possible as the alarm 343 + * registers must be set before the next timer update and the offset 344 + * calculation is too heavy for everything to be done within a single access 345 + * period (~15 us). 346 + * 347 + * Called with local interrupts disabled. 348 + */ 349 + static void omap_rtc_power_off(void) 350 + { 351 + struct omap_rtc *rtc = omap_rtc_power_off_rtc; 352 + struct rtc_time tm; 353 + unsigned long now; 354 + u32 val; 355 + 356 + /* enable pmic_power_en control */ 357 + val = rtc_readl(rtc, OMAP_RTC_PMIC_REG); 358 + rtc_writel(rtc, OMAP_RTC_PMIC_REG, val | OMAP_RTC_PMIC_POWER_EN_EN); 359 + 360 + /* set alarm two seconds from now */ 361 + omap_rtc_read_time_raw(rtc, &tm); 362 + bcd2tm(&tm); 363 + rtc_tm_to_time(&tm, &now); 364 + rtc_time_to_tm(now + 2, &tm); 365 + 366 + if (tm2bcd(&tm) < 0) { 367 + dev_err(&rtc->rtc->dev, "power off failed\n"); 368 + return; 369 + } 370 + 371 + rtc_wait_not_busy(rtc); 372 + 373 + rtc_write(rtc, OMAP_RTC_ALARM2_SECONDS_REG, tm.tm_sec); 374 + rtc_write(rtc, OMAP_RTC_ALARM2_MINUTES_REG, tm.tm_min); 375 + rtc_write(rtc, OMAP_RTC_ALARM2_HOURS_REG, tm.tm_hour); 376 + rtc_write(rtc, OMAP_RTC_ALARM2_DAYS_REG, tm.tm_mday); 377 + rtc_write(rtc, OMAP_RTC_ALARM2_MONTHS_REG, tm.tm_mon); 378 + rtc_write(rtc, OMAP_RTC_ALARM2_YEARS_REG, tm.tm_year); 379 + 380 + /* 381 + * enable ALARM2 interrupt 382 + * 383 + * NOTE: this fails on AM3352 if rtc_write (writeb) is used 384 + */ 385 + val = rtc_read(rtc, OMAP_RTC_INTERRUPTS_REG); 386 + rtc_writel(rtc, OMAP_RTC_INTERRUPTS_REG, 387 + val | OMAP_RTC_INTERRUPTS_IT_ALARM2); 388 + 389 + /* 390 + * Wait for alarm to trigger (within two seconds) and external PMIC to 391 + * power off the system. Add a 500 ms margin for external latencies 392 + * (e.g. debounce circuits). 393 + */ 394 + mdelay(2500); 369 395 } 370 396 371 397 static struct rtc_class_ops omap_rtc_ops = { ··· 440 338 .alarm_irq_enable = omap_rtc_alarm_irq_enable, 441 339 }; 442 340 443 - static int omap_rtc_alarm; 444 - static int omap_rtc_timer; 445 - 446 - #define OMAP_RTC_DATA_AM3352_IDX 1 447 - #define OMAP_RTC_DATA_DA830_IDX 2 448 - 449 - static struct platform_device_id omap_rtc_devtype[] = { 450 - { 451 - .name = DRIVER_NAME, 452 - }, 453 - [OMAP_RTC_DATA_AM3352_IDX] = { 454 - .name = "am3352-rtc", 455 - .driver_data = OMAP_RTC_HAS_KICKER | OMAP_RTC_HAS_IRQWAKEEN | 456 - OMAP_RTC_HAS_32KCLK_EN, 457 - }, 458 - [OMAP_RTC_DATA_DA830_IDX] = { 459 - .name = "da830-rtc", 460 - .driver_data = OMAP_RTC_HAS_KICKER, 461 - }, 462 - {}, 341 + static const struct omap_rtc_device_type omap_rtc_default_type = { 342 + .has_power_up_reset = true, 463 343 }; 464 - MODULE_DEVICE_TABLE(platform, omap_rtc_devtype); 344 + 345 + static const struct omap_rtc_device_type omap_rtc_am3352_type = { 346 + .has_32kclk_en = true, 347 + .has_kicker = true, 348 + .has_irqwakeen = true, 349 + .has_pmic_mode = true, 350 + }; 351 + 352 + static const struct omap_rtc_device_type omap_rtc_da830_type = { 353 + .has_kicker = true, 354 + }; 355 + 356 + static const struct platform_device_id omap_rtc_id_table[] = { 357 + { 358 + .name = "omap_rtc", 359 + .driver_data = (kernel_ulong_t)&omap_rtc_default_type, 360 + }, { 361 + .name = "am3352-rtc", 362 + .driver_data = (kernel_ulong_t)&omap_rtc_am3352_type, 363 + }, { 364 + .name = "da830-rtc", 365 + .driver_data = (kernel_ulong_t)&omap_rtc_da830_type, 366 + }, { 367 + /* sentinel */ 368 + } 369 + }; 370 + MODULE_DEVICE_TABLE(platform, omap_rtc_id_table); 465 371 466 372 static const struct of_device_id omap_rtc_of_match[] = { 467 - { .compatible = "ti,da830-rtc", 468 - .data = &omap_rtc_devtype[OMAP_RTC_DATA_DA830_IDX], 469 - }, 470 - { .compatible = "ti,am3352-rtc", 471 - .data = &omap_rtc_devtype[OMAP_RTC_DATA_AM3352_IDX], 472 - }, 473 - {}, 373 + { 374 + .compatible = "ti,am3352-rtc", 375 + .data = &omap_rtc_am3352_type, 376 + }, { 377 + .compatible = "ti,da830-rtc", 378 + .data = &omap_rtc_da830_type, 379 + }, { 380 + /* sentinel */ 381 + } 474 382 }; 475 383 MODULE_DEVICE_TABLE(of, omap_rtc_of_match); 476 384 477 385 static int __init omap_rtc_probe(struct platform_device *pdev) 478 386 { 479 - struct resource *res; 480 - struct rtc_device *rtc; 481 - u8 reg, new_ctrl; 387 + struct omap_rtc *rtc; 388 + struct resource *res; 389 + u8 reg, mask, new_ctrl; 482 390 const struct platform_device_id *id_entry; 483 391 const struct of_device_id *of_id; 392 + int ret; 393 + 394 + rtc = devm_kzalloc(&pdev->dev, sizeof(*rtc), GFP_KERNEL); 395 + if (!rtc) 396 + return -ENOMEM; 484 397 485 398 of_id = of_match_device(omap_rtc_of_match, &pdev->dev); 486 - if (of_id) 487 - pdev->id_entry = of_id->data; 488 - 489 - id_entry = platform_get_device_id(pdev); 490 - if (!id_entry) { 491 - dev_err(&pdev->dev, "no matching device entry\n"); 492 - return -ENODEV; 399 + if (of_id) { 400 + rtc->type = of_id->data; 401 + rtc->is_pmic_controller = rtc->type->has_pmic_mode && 402 + of_property_read_bool(pdev->dev.of_node, 403 + "system-power-controller"); 404 + } else { 405 + id_entry = platform_get_device_id(pdev); 406 + rtc->type = (void *)id_entry->driver_data; 493 407 } 494 408 495 - omap_rtc_timer = platform_get_irq(pdev, 0); 496 - if (omap_rtc_timer <= 0) { 497 - pr_debug("%s: no update irq?\n", pdev->name); 409 + rtc->irq_timer = platform_get_irq(pdev, 0); 410 + if (rtc->irq_timer <= 0) 498 411 return -ENOENT; 499 - } 500 412 501 - omap_rtc_alarm = platform_get_irq(pdev, 1); 502 - if (omap_rtc_alarm <= 0) { 503 - pr_debug("%s: no alarm irq?\n", pdev->name); 413 + rtc->irq_alarm = platform_get_irq(pdev, 1); 414 + if (rtc->irq_alarm <= 0) 504 415 return -ENOENT; 505 - } 506 416 507 417 res = platform_get_resource(pdev, IORESOURCE_MEM, 0); 508 - rtc_base = devm_ioremap_resource(&pdev->dev, res); 509 - if (IS_ERR(rtc_base)) 510 - return PTR_ERR(rtc_base); 418 + rtc->base = devm_ioremap_resource(&pdev->dev, res); 419 + if (IS_ERR(rtc->base)) 420 + return PTR_ERR(rtc->base); 421 + 422 + platform_set_drvdata(pdev, rtc); 511 423 512 424 /* Enable the clock/module so that we can access the registers */ 513 425 pm_runtime_enable(&pdev->dev); 514 426 pm_runtime_get_sync(&pdev->dev); 515 427 516 - if (id_entry->driver_data & OMAP_RTC_HAS_KICKER) { 517 - rtc_writel(KICK0_VALUE, OMAP_RTC_KICK0_REG); 518 - rtc_writel(KICK1_VALUE, OMAP_RTC_KICK1_REG); 428 + if (rtc->type->has_kicker) { 429 + rtc_writel(rtc, OMAP_RTC_KICK0_REG, KICK0_VALUE); 430 + rtc_writel(rtc, OMAP_RTC_KICK1_REG, KICK1_VALUE); 519 431 } 520 432 521 - rtc = devm_rtc_device_register(&pdev->dev, pdev->name, 522 - &omap_rtc_ops, THIS_MODULE); 523 - if (IS_ERR(rtc)) { 524 - pr_debug("%s: can't register RTC device, err %ld\n", 525 - pdev->name, PTR_ERR(rtc)); 526 - goto fail0; 527 - } 528 - platform_set_drvdata(pdev, rtc); 529 - 530 - /* clear pending irqs, and set 1/second periodic, 531 - * which we'll use instead of update irqs 433 + /* 434 + * disable interrupts 435 + * 436 + * NOTE: ALARM2 is not cleared on AM3352 if rtc_write (writeb) is used 532 437 */ 533 - rtc_write(0, OMAP_RTC_INTERRUPTS_REG); 438 + rtc_writel(rtc, OMAP_RTC_INTERRUPTS_REG, 0); 534 439 535 440 /* enable RTC functional clock */ 536 - if (id_entry->driver_data & OMAP_RTC_HAS_32KCLK_EN) 537 - rtc_writel(OMAP_RTC_OSC_32KCLK_EN, OMAP_RTC_OSC_REG); 441 + if (rtc->type->has_32kclk_en) { 442 + reg = rtc_read(rtc, OMAP_RTC_OSC_REG); 443 + rtc_writel(rtc, OMAP_RTC_OSC_REG, 444 + reg | OMAP_RTC_OSC_32KCLK_EN); 445 + } 538 446 539 447 /* clear old status */ 540 - reg = rtc_read(OMAP_RTC_STATUS_REG); 541 - if (reg & (u8) OMAP_RTC_STATUS_POWER_UP) { 542 - pr_info("%s: RTC power up reset detected\n", 543 - pdev->name); 544 - rtc_write(OMAP_RTC_STATUS_POWER_UP, OMAP_RTC_STATUS_REG); 545 - } 546 - if (reg & (u8) OMAP_RTC_STATUS_ALARM) 547 - rtc_write(OMAP_RTC_STATUS_ALARM, OMAP_RTC_STATUS_REG); 448 + reg = rtc_read(rtc, OMAP_RTC_STATUS_REG); 548 449 549 - /* handle periodic and alarm irqs */ 550 - if (devm_request_irq(&pdev->dev, omap_rtc_timer, rtc_irq, 0, 551 - dev_name(&rtc->dev), rtc)) { 552 - pr_debug("%s: RTC timer interrupt IRQ%d already claimed\n", 553 - pdev->name, omap_rtc_timer); 554 - goto fail0; 450 + mask = OMAP_RTC_STATUS_ALARM; 451 + 452 + if (rtc->type->has_pmic_mode) 453 + mask |= OMAP_RTC_STATUS_ALARM2; 454 + 455 + if (rtc->type->has_power_up_reset) { 456 + mask |= OMAP_RTC_STATUS_POWER_UP; 457 + if (reg & OMAP_RTC_STATUS_POWER_UP) 458 + dev_info(&pdev->dev, "RTC power up reset detected\n"); 555 459 } 556 - if ((omap_rtc_timer != omap_rtc_alarm) && 557 - (devm_request_irq(&pdev->dev, omap_rtc_alarm, rtc_irq, 0, 558 - dev_name(&rtc->dev), rtc))) { 559 - pr_debug("%s: RTC alarm interrupt IRQ%d already claimed\n", 560 - pdev->name, omap_rtc_alarm); 561 - goto fail0; 562 - } 460 + 461 + if (reg & mask) 462 + rtc_write(rtc, OMAP_RTC_STATUS_REG, reg & mask); 563 463 564 464 /* On boards with split power, RTC_ON_NOFF won't reset the RTC */ 565 - reg = rtc_read(OMAP_RTC_CTRL_REG); 566 - if (reg & (u8) OMAP_RTC_CTRL_STOP) 567 - pr_info("%s: already running\n", pdev->name); 465 + reg = rtc_read(rtc, OMAP_RTC_CTRL_REG); 466 + if (reg & OMAP_RTC_CTRL_STOP) 467 + dev_info(&pdev->dev, "already running\n"); 568 468 569 469 /* force to 24 hour mode */ 570 - new_ctrl = reg & (OMAP_RTC_CTRL_SPLIT|OMAP_RTC_CTRL_AUTO_COMP); 470 + new_ctrl = reg & (OMAP_RTC_CTRL_SPLIT | OMAP_RTC_CTRL_AUTO_COMP); 571 471 new_ctrl |= OMAP_RTC_CTRL_STOP; 572 472 573 - /* BOARD-SPECIFIC CUSTOMIZATION CAN GO HERE: 473 + /* 474 + * BOARD-SPECIFIC CUSTOMIZATION CAN GO HERE: 574 475 * 575 476 * - Device wake-up capability setting should come through chip 576 477 * init logic. OMAP1 boards should initialize the "wakeup capable" ··· 587 482 * is write-only, and always reads as zero...) 588 483 */ 589 484 590 - device_init_wakeup(&pdev->dev, true); 591 - 592 - if (new_ctrl & (u8) OMAP_RTC_CTRL_SPLIT) 593 - pr_info("%s: split power mode\n", pdev->name); 485 + if (new_ctrl & OMAP_RTC_CTRL_SPLIT) 486 + dev_info(&pdev->dev, "split power mode\n"); 594 487 595 488 if (reg != new_ctrl) 596 - rtc_write(new_ctrl, OMAP_RTC_CTRL_REG); 489 + rtc_write(rtc, OMAP_RTC_CTRL_REG, new_ctrl); 490 + 491 + device_init_wakeup(&pdev->dev, true); 492 + 493 + rtc->rtc = devm_rtc_device_register(&pdev->dev, pdev->name, 494 + &omap_rtc_ops, THIS_MODULE); 495 + if (IS_ERR(rtc->rtc)) { 496 + ret = PTR_ERR(rtc->rtc); 497 + goto err; 498 + } 499 + 500 + /* handle periodic and alarm irqs */ 501 + ret = devm_request_irq(&pdev->dev, rtc->irq_timer, rtc_irq, 0, 502 + dev_name(&rtc->rtc->dev), rtc); 503 + if (ret) 504 + goto err; 505 + 506 + if (rtc->irq_timer != rtc->irq_alarm) { 507 + ret = devm_request_irq(&pdev->dev, rtc->irq_alarm, rtc_irq, 0, 508 + dev_name(&rtc->rtc->dev), rtc); 509 + if (ret) 510 + goto err; 511 + } 512 + 513 + if (rtc->is_pmic_controller) { 514 + if (!pm_power_off) { 515 + omap_rtc_power_off_rtc = rtc; 516 + pm_power_off = omap_rtc_power_off; 517 + } 518 + } 597 519 598 520 return 0; 599 521 600 - fail0: 601 - if (id_entry->driver_data & OMAP_RTC_HAS_KICKER) 602 - rtc_writel(0, OMAP_RTC_KICK0_REG); 522 + err: 523 + device_init_wakeup(&pdev->dev, false); 524 + if (rtc->type->has_kicker) 525 + rtc_writel(rtc, OMAP_RTC_KICK0_REG, 0); 603 526 pm_runtime_put_sync(&pdev->dev); 604 527 pm_runtime_disable(&pdev->dev); 605 - return -EIO; 528 + 529 + return ret; 606 530 } 607 531 608 532 static int __exit omap_rtc_remove(struct platform_device *pdev) 609 533 { 610 - const struct platform_device_id *id_entry = 611 - platform_get_device_id(pdev); 534 + struct omap_rtc *rtc = platform_get_drvdata(pdev); 535 + 536 + if (pm_power_off == omap_rtc_power_off && 537 + omap_rtc_power_off_rtc == rtc) { 538 + pm_power_off = NULL; 539 + omap_rtc_power_off_rtc = NULL; 540 + } 612 541 613 542 device_init_wakeup(&pdev->dev, 0); 614 543 615 544 /* leave rtc running, but disable irqs */ 616 - rtc_write(0, OMAP_RTC_INTERRUPTS_REG); 545 + rtc_write(rtc, OMAP_RTC_INTERRUPTS_REG, 0); 617 546 618 - if (id_entry->driver_data & OMAP_RTC_HAS_KICKER) 619 - rtc_writel(0, OMAP_RTC_KICK0_REG); 547 + if (rtc->type->has_kicker) 548 + rtc_writel(rtc, OMAP_RTC_KICK0_REG, 0); 620 549 621 550 /* Disable the clock/module */ 622 551 pm_runtime_put_sync(&pdev->dev); ··· 660 521 } 661 522 662 523 #ifdef CONFIG_PM_SLEEP 663 - static u8 irqstat; 664 - 665 524 static int omap_rtc_suspend(struct device *dev) 666 525 { 667 - irqstat = rtc_read(OMAP_RTC_INTERRUPTS_REG); 526 + struct omap_rtc *rtc = dev_get_drvdata(dev); 668 527 669 - /* FIXME the RTC alarm is not currently acting as a wakeup event 528 + rtc->interrupts_reg = rtc_read(rtc, OMAP_RTC_INTERRUPTS_REG); 529 + 530 + /* 531 + * FIXME: the RTC alarm is not currently acting as a wakeup event 670 532 * source on some platforms, and in fact this enable() call is just 671 533 * saving a flag that's never used... 672 534 */ 673 535 if (device_may_wakeup(dev)) 674 - enable_irq_wake(omap_rtc_alarm); 536 + enable_irq_wake(rtc->irq_alarm); 675 537 else 676 - rtc_write(0, OMAP_RTC_INTERRUPTS_REG); 538 + rtc_write(rtc, OMAP_RTC_INTERRUPTS_REG, 0); 677 539 678 540 /* Disable the clock/module */ 679 541 pm_runtime_put_sync(dev); ··· 684 544 685 545 static int omap_rtc_resume(struct device *dev) 686 546 { 547 + struct omap_rtc *rtc = dev_get_drvdata(dev); 548 + 687 549 /* Enable the clock/module so that we can access the registers */ 688 550 pm_runtime_get_sync(dev); 689 551 690 552 if (device_may_wakeup(dev)) 691 - disable_irq_wake(omap_rtc_alarm); 553 + disable_irq_wake(rtc->irq_alarm); 692 554 else 693 - rtc_write(irqstat, OMAP_RTC_INTERRUPTS_REG); 555 + rtc_write(rtc, OMAP_RTC_INTERRUPTS_REG, rtc->interrupts_reg); 694 556 695 557 return 0; 696 558 } ··· 702 560 703 561 static void omap_rtc_shutdown(struct platform_device *pdev) 704 562 { 705 - rtc_write(0, OMAP_RTC_INTERRUPTS_REG); 563 + struct omap_rtc *rtc = platform_get_drvdata(pdev); 564 + u8 mask; 565 + 566 + /* 567 + * Keep the ALARM interrupt enabled to allow the system to power up on 568 + * alarm events. 569 + */ 570 + mask = rtc_read(rtc, OMAP_RTC_INTERRUPTS_REG); 571 + mask &= OMAP_RTC_INTERRUPTS_IT_ALARM; 572 + rtc_write(rtc, OMAP_RTC_INTERRUPTS_REG, mask); 706 573 } 707 574 708 - MODULE_ALIAS("platform:omap_rtc"); 709 575 static struct platform_driver omap_rtc_driver = { 710 576 .remove = __exit_p(omap_rtc_remove), 711 577 .shutdown = omap_rtc_shutdown, 712 578 .driver = { 713 - .name = DRIVER_NAME, 579 + .name = "omap_rtc", 714 580 .owner = THIS_MODULE, 715 581 .pm = &omap_rtc_pm_ops, 716 582 .of_match_table = omap_rtc_of_match, 717 583 }, 718 - .id_table = omap_rtc_devtype, 584 + .id_table = omap_rtc_id_table, 719 585 }; 720 586 721 587 module_platform_driver_probe(omap_rtc_driver, omap_rtc_probe); 722 588 589 + MODULE_ALIAS("platform:omap_rtc"); 723 590 MODULE_AUTHOR("George G. Davis (and others)"); 724 591 MODULE_LICENSE("GPL");

+46 -9

drivers/rtc/rtc-pcf8563.c

··· 28 28 #define PCF8563_REG_ST2 0x01 29 29 #define PCF8563_BIT_AIE (1 << 1) 30 30 #define PCF8563_BIT_AF (1 << 3) 31 + #define PCF8563_BITS_ST2_N (7 << 5) 31 32 32 33 #define PCF8563_REG_SC 0x02 /* datetime */ 33 34 #define PCF8563_REG_MN 0x03 ··· 42 41 43 42 #define PCF8563_REG_CLKO 0x0D /* clock out */ 44 43 #define PCF8563_REG_TMRC 0x0E /* timer control */ 44 + #define PCF8563_TMRC_ENABLE BIT(7) 45 + #define PCF8563_TMRC_4096 0 46 + #define PCF8563_TMRC_64 1 47 + #define PCF8563_TMRC_1 2 48 + #define PCF8563_TMRC_1_60 3 49 + #define PCF8563_TMRC_MASK 3 50 + 45 51 #define PCF8563_REG_TMR 0x0F /* timer */ 46 52 47 53 #define PCF8563_SC_LV 0x80 /* low voltage */ ··· 126 118 127 119 static int pcf8563_set_alarm_mode(struct i2c_client *client, bool on) 128 120 { 129 - unsigned char buf[2]; 121 + unsigned char buf; 130 122 int err; 131 123 132 - err = pcf8563_read_block_data(client, PCF8563_REG_ST2, 1, buf + 1); 124 + err = pcf8563_read_block_data(client, PCF8563_REG_ST2, 1, &buf); 133 125 if (err < 0) 134 126 return err; 135 127 136 128 if (on) 137 - buf[1] |= PCF8563_BIT_AIE; 129 + buf |= PCF8563_BIT_AIE; 138 130 else 139 - buf[1] &= ~PCF8563_BIT_AIE; 131 + buf &= ~PCF8563_BIT_AIE; 140 132 141 - buf[1] &= ~PCF8563_BIT_AF; 142 - buf[0] = PCF8563_REG_ST2; 133 + buf &= ~(PCF8563_BIT_AF | PCF8563_BITS_ST2_N); 143 134 144 - err = pcf8563_write_block_data(client, PCF8563_REG_ST2, 1, buf + 1); 135 + err = pcf8563_write_block_data(client, PCF8563_REG_ST2, 1, &buf); 145 136 if (err < 0) { 146 137 dev_err(&client->dev, "%s: write error\n", __func__); 147 138 return -EIO; ··· 343 336 __func__, buf[0], buf[1], buf[2], buf[3]); 344 337 345 338 tm->time.tm_min = bcd2bin(buf[0] & 0x7F); 346 - tm->time.tm_hour = bcd2bin(buf[1] & 0x7F); 347 - tm->time.tm_mday = bcd2bin(buf[2] & 0x1F); 339 + tm->time.tm_hour = bcd2bin(buf[1] & 0x3F); 340 + tm->time.tm_mday = bcd2bin(buf[2] & 0x3F); 348 341 tm->time.tm_wday = bcd2bin(buf[3] & 0x7); 349 342 tm->time.tm_mon = -1; 350 343 tm->time.tm_year = -1; ··· 368 361 struct i2c_client *client = to_i2c_client(dev); 369 362 unsigned char buf[4]; 370 363 int err; 364 + unsigned long alarm_time; 365 + 366 + /* The alarm has no seconds, round up to nearest minute */ 367 + if (tm->time.tm_sec) { 368 + rtc_tm_to_time(&tm->time, &alarm_time); 369 + alarm_time += 60-tm->time.tm_sec; 370 + rtc_time_to_tm(alarm_time, &tm->time); 371 + } 371 372 372 373 dev_dbg(dev, "%s, min=%d hour=%d wday=%d mday=%d " 373 374 "enabled=%d pending=%d\n", __func__, ··· 396 381 397 382 static int pcf8563_irq_enable(struct device *dev, unsigned int enabled) 398 383 { 384 + dev_dbg(dev, "%s: en=%d\n", __func__, enabled); 399 385 return pcf8563_set_alarm_mode(to_i2c_client(dev), !!enabled); 400 386 } 401 387 ··· 414 398 { 415 399 struct pcf8563 *pcf8563; 416 400 int err; 401 + unsigned char buf; 402 + unsigned char alm_pending; 417 403 418 404 dev_dbg(&client->dev, "%s\n", __func__); 419 405 ··· 432 414 i2c_set_clientdata(client, pcf8563); 433 415 pcf8563->client = client; 434 416 device_set_wakeup_capable(&client->dev, 1); 417 + 418 + /* Set timer to lowest frequency to save power (ref Haoyu datasheet) */ 419 + buf = PCF8563_TMRC_1_60; 420 + err = pcf8563_write_block_data(client, PCF8563_REG_TMRC, 1, &buf); 421 + if (err < 0) { 422 + dev_err(&client->dev, "%s: write error\n", __func__); 423 + return err; 424 + } 425 + 426 + err = pcf8563_get_alarm_mode(client, NULL, &alm_pending); 427 + if (err < 0) { 428 + dev_err(&client->dev, "%s: read error\n", __func__); 429 + return err; 430 + } 431 + if (alm_pending) 432 + pcf8563_set_alarm_mode(client, 0); 435 433 436 434 pcf8563->rtc = devm_rtc_device_register(&client->dev, 437 435 pcf8563_driver.driver.name, ··· 468 434 } 469 435 470 436 } 437 + 438 + /* the pcf8563 alarm only supports a minute accuracy */ 439 + pcf8563->rtc->uie_unsupported = 1; 471 440 472 441 return 0; 473 442 }

+51 -15

drivers/rtc/rtc-sirfsoc.c

··· 47 47 unsigned irq_wake; 48 48 /* Overflow for every 8 years extra time */ 49 49 u32 overflow_rtc; 50 + spinlock_t lock; 50 51 #ifdef CONFIG_PM 51 52 u32 saved_counter; 52 53 u32 saved_overflow_rtc; ··· 62 61 63 62 rtcdrv = dev_get_drvdata(dev); 64 63 65 - local_irq_disable(); 64 + spin_lock_irq(&rtcdrv->lock); 66 65 67 66 rtc_count = sirfsoc_rtc_iobrg_readl(rtcdrv->rtc_base + RTC_CN); 68 67 ··· 85 84 if (sirfsoc_rtc_iobrg_readl( 86 85 rtcdrv->rtc_base + RTC_STATUS) & SIRFSOC_RTC_AL0E) 87 86 alrm->enabled = 1; 88 - local_irq_enable(); 87 + 88 + spin_unlock_irq(&rtcdrv->lock); 89 89 90 90 return 0; 91 91 } ··· 101 99 if (alrm->enabled) { 102 100 rtc_tm_to_time(&(alrm->time), &rtc_alarm); 103 101 104 - local_irq_disable(); 102 + spin_lock_irq(&rtcdrv->lock); 105 103 106 104 rtc_status_reg = sirfsoc_rtc_iobrg_readl( 107 105 rtcdrv->rtc_base + RTC_STATUS); ··· 125 123 rtc_status_reg |= SIRFSOC_RTC_AL0E; 126 124 sirfsoc_rtc_iobrg_writel( 127 125 rtc_status_reg, rtcdrv->rtc_base + RTC_STATUS); 128 - local_irq_enable(); 126 + 127 + spin_unlock_irq(&rtcdrv->lock); 129 128 } else { 130 129 /* 131 130 * if this function was called with enabled=0 132 131 * then it could mean that the application is 133 132 * trying to cancel an ongoing alarm 134 133 */ 135 - local_irq_disable(); 134 + spin_lock_irq(&rtcdrv->lock); 136 135 137 136 rtc_status_reg = sirfsoc_rtc_iobrg_readl( 138 137 rtcdrv->rtc_base + RTC_STATUS); ··· 149 146 rtcdrv->rtc_base + RTC_STATUS); 150 147 } 151 148 152 - local_irq_enable(); 149 + spin_unlock_irq(&rtcdrv->lock); 153 150 } 154 151 155 152 return 0; ··· 212 209 } 213 210 } 214 211 212 + static int sirfsoc_rtc_alarm_irq_enable(struct device *dev, 213 + unsigned int enabled) 214 + { 215 + unsigned long rtc_status_reg = 0x0; 216 + struct sirfsoc_rtc_drv *rtcdrv; 217 + 218 + rtcdrv = dev_get_drvdata(dev); 219 + 220 + spin_lock_irq(&rtcdrv->lock); 221 + 222 + rtc_status_reg = sirfsoc_rtc_iobrg_readl( 223 + rtcdrv->rtc_base + RTC_STATUS); 224 + if (enabled) 225 + rtc_status_reg |= SIRFSOC_RTC_AL0E; 226 + else 227 + rtc_status_reg &= ~SIRFSOC_RTC_AL0E; 228 + 229 + sirfsoc_rtc_iobrg_writel(rtc_status_reg, rtcdrv->rtc_base + RTC_STATUS); 230 + 231 + spin_unlock_irq(&rtcdrv->lock); 232 + 233 + return 0; 234 + 235 + } 236 + 215 237 static const struct rtc_class_ops sirfsoc_rtc_ops = { 216 238 .read_time = sirfsoc_rtc_read_time, 217 239 .set_time = sirfsoc_rtc_set_time, 218 240 .read_alarm = sirfsoc_rtc_read_alarm, 219 241 .set_alarm = sirfsoc_rtc_set_alarm, 220 - .ioctl = sirfsoc_rtc_ioctl 242 + .ioctl = sirfsoc_rtc_ioctl, 243 + .alarm_irq_enable = sirfsoc_rtc_alarm_irq_enable 221 244 }; 222 245 223 246 static irqreturn_t sirfsoc_rtc_irq_handler(int irq, void *pdata) ··· 251 222 struct sirfsoc_rtc_drv *rtcdrv = pdata; 252 223 unsigned long rtc_status_reg = 0x0; 253 224 unsigned long events = 0x0; 225 + 226 + spin_lock(&rtcdrv->lock); 254 227 255 228 rtc_status_reg = sirfsoc_rtc_iobrg_readl(rtcdrv->rtc_base + RTC_STATUS); 256 229 /* this bit will be set ONLY if an alarm was active ··· 271 240 rtc_status_reg &= ~(SIRFSOC_RTC_AL0E); 272 241 } 273 242 sirfsoc_rtc_iobrg_writel(rtc_status_reg, rtcdrv->rtc_base + RTC_STATUS); 243 + 244 + spin_unlock(&rtcdrv->lock); 245 + 274 246 /* this should wake up any apps polling/waiting on the read 275 247 * after setting the alarm 276 248 */ ··· 301 267 if (rtcdrv == NULL) 302 268 return -ENOMEM; 303 269 270 + spin_lock_init(&rtcdrv->lock); 271 + 304 272 err = of_property_read_u32(np, "reg", &rtcdrv->rtc_base); 305 273 if (err) { 306 274 dev_err(&pdev->dev, "unable to find base address of rtc node in dtb\n"); ··· 322 286 rtc_div = ((32768 / RTC_HZ) / 2) - 1; 323 287 sirfsoc_rtc_iobrg_writel(rtc_div, rtcdrv->rtc_base + RTC_DIV); 324 288 325 - rtcdrv->rtc = devm_rtc_device_register(&pdev->dev, pdev->name, 326 - &sirfsoc_rtc_ops, THIS_MODULE); 327 - if (IS_ERR(rtcdrv->rtc)) { 328 - err = PTR_ERR(rtcdrv->rtc); 329 - dev_err(&pdev->dev, "can't register RTC device\n"); 330 - return err; 331 - } 332 - 333 289 /* 0x3 -> RTC_CLK */ 334 290 sirfsoc_rtc_iobrg_writel(SIRFSOC_RTC_CLK, 335 291 rtcdrv->rtc_base + RTC_CLOCK_SWITCH); ··· 335 307 /* Restore RTC Overflow From Register After Command Reboot */ 336 308 rtcdrv->overflow_rtc = 337 309 sirfsoc_rtc_iobrg_readl(rtcdrv->rtc_base + RTC_SW_VALUE); 310 + 311 + rtcdrv->rtc = devm_rtc_device_register(&pdev->dev, pdev->name, 312 + &sirfsoc_rtc_ops, THIS_MODULE); 313 + if (IS_ERR(rtcdrv->rtc)) { 314 + err = PTR_ERR(rtcdrv->rtc); 315 + dev_err(&pdev->dev, "can't register RTC device\n"); 316 + return err; 317 + } 338 318 339 319 rtcdrv->irq = platform_get_irq(pdev, 0); 340 320 err = devm_request_irq(

+36 -3

drivers/rtc/rtc-snvs.c

··· 17 17 #include <linux/of_device.h> 18 18 #include <linux/platform_device.h> 19 19 #include <linux/rtc.h> 20 + #include <linux/clk.h> 20 21 21 22 /* These register offsets are relative to LP (Low Power) range */ 22 23 #define SNVS_LPCR 0x04 ··· 40 39 void __iomem *ioaddr; 41 40 int irq; 42 41 spinlock_t lock; 42 + struct clk *clk; 43 43 }; 44 44 45 45 static u32 rtc_read_lp_counter(void __iomem *ioaddr) ··· 262 260 if (data->irq < 0) 263 261 return data->irq; 264 262 263 + data->clk = devm_clk_get(&pdev->dev, "snvs-rtc"); 264 + if (IS_ERR(data->clk)) { 265 + data->clk = NULL; 266 + } else { 267 + ret = clk_prepare_enable(data->clk); 268 + if (ret) { 269 + dev_err(&pdev->dev, 270 + "Could not prepare or enable the snvs clock\n"); 271 + return ret; 272 + } 273 + } 274 + 265 275 platform_set_drvdata(pdev, data); 266 276 267 277 spin_lock_init(&data->lock); ··· 294 280 if (ret) { 295 281 dev_err(&pdev->dev, "failed to request irq %d: %d\n", 296 282 data->irq, ret); 297 - return ret; 283 + goto error_rtc_device_register; 298 284 } 299 285 300 286 data->rtc = devm_rtc_device_register(&pdev->dev, pdev->name, ··· 302 288 if (IS_ERR(data->rtc)) { 303 289 ret = PTR_ERR(data->rtc); 304 290 dev_err(&pdev->dev, "failed to register rtc: %d\n", ret); 305 - return ret; 291 + goto error_rtc_device_register; 306 292 } 307 293 308 294 return 0; 295 + 296 + error_rtc_device_register: 297 + if (data->clk) 298 + clk_disable_unprepare(data->clk); 299 + 300 + return ret; 309 301 } 310 302 311 303 #ifdef CONFIG_PM_SLEEP ··· 322 302 if (device_may_wakeup(dev)) 323 303 enable_irq_wake(data->irq); 324 304 305 + if (data->clk) 306 + clk_disable_unprepare(data->clk); 307 + 325 308 return 0; 326 309 } 327 310 328 311 static int snvs_rtc_resume(struct device *dev) 329 312 { 330 313 struct snvs_rtc_data *data = dev_get_drvdata(dev); 314 + int ret; 331 315 332 316 if (device_may_wakeup(dev)) 333 317 disable_irq_wake(data->irq); 318 + 319 + if (data->clk) { 320 + ret = clk_prepare_enable(data->clk); 321 + if (ret) 322 + return ret; 323 + } 334 324 335 325 return 0; 336 326 } 337 327 #endif 338 328 339 - static SIMPLE_DEV_PM_OPS(snvs_rtc_pm_ops, snvs_rtc_suspend, snvs_rtc_resume); 329 + static const struct dev_pm_ops snvs_rtc_pm_ops = { 330 + .suspend_noirq = snvs_rtc_suspend, 331 + .resume_noirq = snvs_rtc_resume, 332 + }; 340 333 341 334 static const struct of_device_id snvs_dt_ids[] = { 342 335 { .compatible = "fsl,sec-v4.0-mon-rtc-lp", },

+1 -1

drivers/usb/storage/debug.c

··· 188 188 189 189 va_start(args, fmt); 190 190 191 - r = dev_vprintk_emit(7, &us->pusb_dev->dev, fmt, args); 191 + r = dev_vprintk_emit(LOGLEVEL_DEBUG, &us->pusb_dev->dev, fmt, args); 192 192 193 193 va_end(args); 194 194

+22 -18

fs/binfmt_elf.c

··· 1994 1994 shdr4extnum->sh_info = segs; 1995 1995 } 1996 1996 1997 - static size_t elf_core_vma_data_size(struct vm_area_struct *gate_vma, 1998 - unsigned long mm_flags) 1999 - { 2000 - struct vm_area_struct *vma; 2001 - size_t size = 0; 2002 - 2003 - for (vma = first_vma(current, gate_vma); vma != NULL; 2004 - vma = next_vma(vma, gate_vma)) 2005 - size += vma_dump_size(vma, mm_flags); 2006 - return size; 2007 - } 2008 - 2009 1997 /* 2010 1998 * Actual dumper 2011 1999 * ··· 2005 2017 { 2006 2018 int has_dumped = 0; 2007 2019 mm_segment_t fs; 2008 - int segs; 2020 + int segs, i; 2021 + size_t vma_data_size = 0; 2009 2022 struct vm_area_struct *vma, *gate_vma; 2010 2023 struct elfhdr *elf = NULL; 2011 2024 loff_t offset = 0, dataoff; ··· 2015 2026 struct elf_shdr *shdr4extnum = NULL; 2016 2027 Elf_Half e_phnum; 2017 2028 elf_addr_t e_shoff; 2029 + elf_addr_t *vma_filesz = NULL; 2018 2030 2019 2031 /* 2020 2032 * We no longer stop all VM operations. ··· 2083 2093 2084 2094 dataoff = offset = roundup(offset, ELF_EXEC_PAGESIZE); 2085 2095 2086 - offset += elf_core_vma_data_size(gate_vma, cprm->mm_flags); 2096 + vma_filesz = kmalloc_array(segs - 1, sizeof(*vma_filesz), GFP_KERNEL); 2097 + if (!vma_filesz) 2098 + goto end_coredump; 2099 + 2100 + for (i = 0, vma = first_vma(current, gate_vma); vma != NULL; 2101 + vma = next_vma(vma, gate_vma)) { 2102 + unsigned long dump_size; 2103 + 2104 + dump_size = vma_dump_size(vma, cprm->mm_flags); 2105 + vma_filesz[i++] = dump_size; 2106 + vma_data_size += dump_size; 2107 + } 2108 + 2109 + offset += vma_data_size; 2087 2110 offset += elf_core_extra_data_size(); 2088 2111 e_shoff = offset; 2089 2112 ··· 2116 2113 goto end_coredump; 2117 2114 2118 2115 /* Write program headers for segments dump */ 2119 - for (vma = first_vma(current, gate_vma); vma != NULL; 2116 + for (i = 0, vma = first_vma(current, gate_vma); vma != NULL; 2120 2117 vma = next_vma(vma, gate_vma)) { 2121 2118 struct elf_phdr phdr; 2122 2119 ··· 2124 2121 phdr.p_offset = offset; 2125 2122 phdr.p_vaddr = vma->vm_start; 2126 2123 phdr.p_paddr = 0; 2127 - phdr.p_filesz = vma_dump_size(vma, cprm->mm_flags); 2124 + phdr.p_filesz = vma_filesz[i++]; 2128 2125 phdr.p_memsz = vma->vm_end - vma->vm_start; 2129 2126 offset += phdr.p_filesz; 2130 2127 phdr.p_flags = vma->vm_flags & VM_READ ? PF_R : 0; ··· 2152 2149 if (!dump_skip(cprm, dataoff - cprm->written)) 2153 2150 goto end_coredump; 2154 2151 2155 - for (vma = first_vma(current, gate_vma); vma != NULL; 2152 + for (i = 0, vma = first_vma(current, gate_vma); vma != NULL; 2156 2153 vma = next_vma(vma, gate_vma)) { 2157 2154 unsigned long addr; 2158 2155 unsigned long end; 2159 2156 2160 - end = vma->vm_start + vma_dump_size(vma, cprm->mm_flags); 2157 + end = vma->vm_start + vma_filesz[i++]; 2161 2158 2162 2159 for (addr = vma->vm_start; addr < end; addr += PAGE_SIZE) { 2163 2160 struct page *page; ··· 2190 2187 cleanup: 2191 2188 free_note_info(&info); 2192 2189 kfree(shdr4extnum); 2190 + kfree(vma_filesz); 2193 2191 kfree(phdr4note); 2194 2192 kfree(elf); 2195 2193 out:

+244 -141

fs/binfmt_misc.c

··· 1 1 /* 2 - * binfmt_misc.c 2 + * binfmt_misc.c 3 3 * 4 - * Copyright (C) 1997 Richard Günther 4 + * Copyright (C) 1997 Richard Günther 5 5 * 6 - * binfmt_misc detects binaries via a magic or filename extension and invokes 7 - * a specified wrapper. This should obsolete binfmt_java, binfmt_em86 and 8 - * binfmt_mz. 9 - * 10 - * 1997-04-25 first version 11 - * [...] 12 - * 1997-05-19 cleanup 13 - * 1997-06-26 hpa: pass the real filename rather than argv[0] 14 - * 1997-06-30 minor cleanup 15 - * 1997-08-09 removed extension stripping, locking cleanup 16 - * 2001-02-28 AV: rewritten into something that resembles C. Original didn't. 6 + * binfmt_misc detects binaries via a magic or filename extension and invokes 7 + * a specified wrapper. See Documentation/binfmt_misc.txt for more details. 17 8 */ 9 + 10 + #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 18 11 19 12 #include <linux/module.h> 20 13 #include <linux/init.h> ··· 23 30 #include <linux/mount.h> 24 31 #include <linux/syscalls.h> 25 32 #include <linux/fs.h> 33 + #include <linux/uaccess.h> 26 34 27 - #include <asm/uaccess.h> 35 + #ifdef DEBUG 36 + # define USE_DEBUG 1 37 + #else 38 + # define USE_DEBUG 0 39 + #endif 28 40 29 41 enum { 30 42 VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */ ··· 39 41 static int enabled = 1; 40 42 41 43 enum {Enabled, Magic}; 42 - #define MISC_FMT_PRESERVE_ARGV0 (1<<31) 43 - #define MISC_FMT_OPEN_BINARY (1<<30) 44 - #define MISC_FMT_CREDENTIALS (1<<29) 44 + #define MISC_FMT_PRESERVE_ARGV0 (1 << 31) 45 + #define MISC_FMT_OPEN_BINARY (1 << 30) 46 + #define MISC_FMT_CREDENTIALS (1 << 29) 45 47 46 48 typedef struct { 47 49 struct list_head list; ··· 85 87 char *p = strrchr(bprm->interp, '.'); 86 88 struct list_head *l; 87 89 90 + /* Walk all the registered handlers. */ 88 91 list_for_each(l, &entries) { 89 92 Node *e = list_entry(l, Node, list); 90 93 char *s; 91 94 int j; 92 95 96 + /* Make sure this one is currently enabled. */ 93 97 if (!test_bit(Enabled, &e->flags)) 94 98 continue; 95 99 100 + /* Do matching based on extension if applicable. */ 96 101 if (!test_bit(Magic, &e->flags)) { 97 102 if (p && !strcmp(e->magic, p + 1)) 98 103 return e; 99 104 continue; 100 105 } 101 106 107 + /* Do matching based on magic & mask. */ 102 108 s = bprm->buf + e->offset; 103 109 if (e->mask) { 104 110 for (j = 0; j < e->size; j++) ··· 125 123 static int load_misc_binary(struct linux_binprm *bprm) 126 124 { 127 125 Node *fmt; 128 - struct file * interp_file = NULL; 126 + struct file *interp_file = NULL; 129 127 char iname[BINPRM_BUF_SIZE]; 130 128 const char *iname_addr = iname; 131 129 int retval; ··· 133 131 134 132 retval = -ENOEXEC; 135 133 if (!enabled) 136 - goto _ret; 134 + goto ret; 137 135 138 136 /* to keep locking time low, we copy the interpreter string */ 139 137 read_lock(&entries_lock); ··· 142 140 strlcpy(iname, fmt->interpreter, BINPRM_BUF_SIZE); 143 141 read_unlock(&entries_lock); 144 142 if (!fmt) 145 - goto _ret; 143 + goto ret; 146 144 147 145 if (!(fmt->flags & MISC_FMT_PRESERVE_ARGV0)) { 148 146 retval = remove_arg_zero(bprm); 149 147 if (retval) 150 - goto _ret; 148 + goto ret; 151 149 } 152 150 153 151 if (fmt->flags & MISC_FMT_OPEN_BINARY) { 154 152 155 153 /* if the binary should be opened on behalf of the 156 154 * interpreter than keep it open and assign descriptor 157 - * to it */ 158 - fd_binary = get_unused_fd(); 159 - if (fd_binary < 0) { 160 - retval = fd_binary; 161 - goto _ret; 162 - } 163 - fd_install(fd_binary, bprm->file); 155 + * to it 156 + */ 157 + fd_binary = get_unused_fd_flags(0); 158 + if (fd_binary < 0) { 159 + retval = fd_binary; 160 + goto ret; 161 + } 162 + fd_install(fd_binary, bprm->file); 164 163 165 164 /* if the binary is not readable than enforce mm->dumpable=0 166 165 regardless of the interpreter's permissions */ ··· 174 171 bprm->interp_flags |= BINPRM_FLAGS_EXECFD; 175 172 bprm->interp_data = fd_binary; 176 173 177 - } else { 178 - allow_write_access(bprm->file); 179 - fput(bprm->file); 180 - bprm->file = NULL; 181 - } 174 + } else { 175 + allow_write_access(bprm->file); 176 + fput(bprm->file); 177 + bprm->file = NULL; 178 + } 182 179 /* make argv[1] be the path to the binary */ 183 - retval = copy_strings_kernel (1, &bprm->interp, bprm); 180 + retval = copy_strings_kernel(1, &bprm->interp, bprm); 184 181 if (retval < 0) 185 - goto _error; 182 + goto error; 186 183 bprm->argc++; 187 184 188 185 /* add the interp as argv[0] */ 189 - retval = copy_strings_kernel (1, &iname_addr, bprm); 186 + retval = copy_strings_kernel(1, &iname_addr, bprm); 190 187 if (retval < 0) 191 - goto _error; 192 - bprm->argc ++; 188 + goto error; 189 + bprm->argc++; 193 190 194 191 /* Update interp in case binfmt_script needs it. */ 195 192 retval = bprm_change_interp(iname, bprm); 196 193 if (retval < 0) 197 - goto _error; 194 + goto error; 198 195 199 - interp_file = open_exec (iname); 200 - retval = PTR_ERR (interp_file); 201 - if (IS_ERR (interp_file)) 202 - goto _error; 196 + interp_file = open_exec(iname); 197 + retval = PTR_ERR(interp_file); 198 + if (IS_ERR(interp_file)) 199 + goto error; 203 200 204 201 bprm->file = interp_file; 205 202 if (fmt->flags & MISC_FMT_CREDENTIALS) { ··· 210 207 memset(bprm->buf, 0, BINPRM_BUF_SIZE); 211 208 retval = kernel_read(bprm->file, 0, bprm->buf, BINPRM_BUF_SIZE); 212 209 } else 213 - retval = prepare_binprm (bprm); 210 + retval = prepare_binprm(bprm); 214 211 215 212 if (retval < 0) 216 - goto _error; 213 + goto error; 217 214 218 215 retval = search_binary_handler(bprm); 219 216 if (retval < 0) 220 - goto _error; 217 + goto error; 221 218 222 - _ret: 219 + ret: 223 220 return retval; 224 - _error: 221 + error: 225 222 if (fd_binary > 0) 226 223 sys_close(fd_binary); 227 224 bprm->interp_flags = 0; 228 225 bprm->interp_data = 0; 229 - goto _ret; 226 + goto ret; 230 227 } 231 228 232 229 /* Command parsers */ ··· 253 250 return s; 254 251 } 255 252 256 - static char * check_special_flags (char * sfs, Node * e) 253 + static char *check_special_flags(char *sfs, Node *e) 257 254 { 258 - char * p = sfs; 255 + char *p = sfs; 259 256 int cont = 1; 260 257 261 258 /* special flags */ 262 259 while (cont) { 263 260 switch (*p) { 264 - case 'P': 265 - p++; 266 - e->flags |= MISC_FMT_PRESERVE_ARGV0; 267 - break; 268 - case 'O': 269 - p++; 270 - e->flags |= MISC_FMT_OPEN_BINARY; 271 - break; 272 - case 'C': 273 - p++; 274 - /* this flags also implies the 275 - open-binary flag */ 276 - e->flags |= (MISC_FMT_CREDENTIALS | 277 - MISC_FMT_OPEN_BINARY); 278 - break; 279 - default: 280 - cont = 0; 261 + case 'P': 262 + pr_debug("register: flag: P (preserve argv0)\n"); 263 + p++; 264 + e->flags |= MISC_FMT_PRESERVE_ARGV0; 265 + break; 266 + case 'O': 267 + pr_debug("register: flag: O (open binary)\n"); 268 + p++; 269 + e->flags |= MISC_FMT_OPEN_BINARY; 270 + break; 271 + case 'C': 272 + pr_debug("register: flag: C (preserve creds)\n"); 273 + p++; 274 + /* this flags also implies the 275 + open-binary flag */ 276 + e->flags |= (MISC_FMT_CREDENTIALS | 277 + MISC_FMT_OPEN_BINARY); 278 + break; 279 + default: 280 + cont = 0; 281 281 } 282 282 } 283 283 284 284 return p; 285 285 } 286 + 286 287 /* 287 288 * This registers a new binary format, it recognises the syntax 288 289 * ':name:type:offset:magic:mask:interpreter:flags' ··· 299 292 char *buf, *p; 300 293 char del; 301 294 295 + pr_debug("register: received %zu bytes\n", count); 296 + 302 297 /* some sanity checks */ 303 298 err = -EINVAL; 304 299 if ((count < 11) || (count > MAX_REGISTER_LENGTH)) ··· 308 299 309 300 err = -ENOMEM; 310 301 memsize = sizeof(Node) + count + 8; 311 - e = kmalloc(memsize, GFP_USER); 302 + e = kmalloc(memsize, GFP_KERNEL); 312 303 if (!e) 313 304 goto out; 314 305 ··· 316 307 317 308 memset(e, 0, sizeof(Node)); 318 309 if (copy_from_user(buf, buffer, count)) 319 - goto Efault; 310 + goto efault; 320 311 321 312 del = *p++; /* delimeter */ 322 313 323 - memset(buf+count, del, 8); 314 + pr_debug("register: delim: %#x {%c}\n", del, del); 324 315 316 + /* Pad the buffer with the delim to simplify parsing below. */ 317 + memset(buf + count, del, 8); 318 + 319 + /* Parse the 'name' field. */ 325 320 e->name = p; 326 321 p = strchr(p, del); 327 322 if (!p) 328 - goto Einval; 323 + goto einval; 329 324 *p++ = '\0'; 330 325 if (!e->name[0] || 331 326 !strcmp(e->name, ".") || 332 327 !strcmp(e->name, "..") || 333 328 strchr(e->name, '/')) 334 - goto Einval; 329 + goto einval; 330 + 331 + pr_debug("register: name: {%s}\n", e->name); 332 + 333 + /* Parse the 'type' field. */ 335 334 switch (*p++) { 336 - case 'E': e->flags = 1<<Enabled; break; 337 - case 'M': e->flags = (1<<Enabled) | (1<<Magic); break; 338 - default: goto Einval; 335 + case 'E': 336 + pr_debug("register: type: E (extension)\n"); 337 + e->flags = 1 << Enabled; 338 + break; 339 + case 'M': 340 + pr_debug("register: type: M (magic)\n"); 341 + e->flags = (1 << Enabled) | (1 << Magic); 342 + break; 343 + default: 344 + goto einval; 339 345 } 340 346 if (*p++ != del) 341 - goto Einval; 347 + goto einval; 348 + 342 349 if (test_bit(Magic, &e->flags)) { 343 - char *s = strchr(p, del); 350 + /* Handle the 'M' (magic) format. */ 351 + char *s; 352 + 353 + /* Parse the 'offset' field. */ 354 + s = strchr(p, del); 344 355 if (!s) 345 - goto Einval; 356 + goto einval; 346 357 *s++ = '\0'; 347 358 e->offset = simple_strtoul(p, &p, 10); 348 359 if (*p++) 349 - goto Einval; 360 + goto einval; 361 + pr_debug("register: offset: %#x\n", e->offset); 362 + 363 + /* Parse the 'magic' field. */ 350 364 e->magic = p; 351 365 p = scanarg(p, del); 352 366 if (!p) 353 - goto Einval; 367 + goto einval; 354 368 p[-1] = '\0'; 355 - if (!e->magic[0]) 356 - goto Einval; 369 + if (p == e->magic) 370 + goto einval; 371 + if (USE_DEBUG) 372 + print_hex_dump_bytes( 373 + KBUILD_MODNAME ": register: magic[raw]: ", 374 + DUMP_PREFIX_NONE, e->magic, p - e->magic); 375 + 376 + /* Parse the 'mask' field. */ 357 377 e->mask = p; 358 378 p = scanarg(p, del); 359 379 if (!p) 360 - goto Einval; 380 + goto einval; 361 381 p[-1] = '\0'; 362 - if (!e->mask[0]) 382 + if (p == e->mask) { 363 383 e->mask = NULL; 384 + pr_debug("register: mask[raw]: none\n"); 385 + } else if (USE_DEBUG) 386 + print_hex_dump_bytes( 387 + KBUILD_MODNAME ": register: mask[raw]: ", 388 + DUMP_PREFIX_NONE, e->mask, p - e->mask); 389 + 390 + /* 391 + * Decode the magic & mask fields. 392 + * Note: while we might have accepted embedded NUL bytes from 393 + * above, the unescape helpers here will stop at the first one 394 + * it encounters. 395 + */ 364 396 e->size = string_unescape_inplace(e->magic, UNESCAPE_HEX); 365 397 if (e->mask && 366 398 string_unescape_inplace(e->mask, UNESCAPE_HEX) != e->size) 367 - goto Einval; 399 + goto einval; 368 400 if (e->size + e->offset > BINPRM_BUF_SIZE) 369 - goto Einval; 401 + goto einval; 402 + pr_debug("register: magic/mask length: %i\n", e->size); 403 + if (USE_DEBUG) { 404 + print_hex_dump_bytes( 405 + KBUILD_MODNAME ": register: magic[decoded]: ", 406 + DUMP_PREFIX_NONE, e->magic, e->size); 407 + 408 + if (e->mask) { 409 + int i; 410 + char *masked = kmalloc(e->size, GFP_KERNEL); 411 + 412 + print_hex_dump_bytes( 413 + KBUILD_MODNAME ": register: mask[decoded]: ", 414 + DUMP_PREFIX_NONE, e->mask, e->size); 415 + 416 + if (masked) { 417 + for (i = 0; i < e->size; ++i) 418 + masked[i] = e->magic[i] & e->mask[i]; 419 + print_hex_dump_bytes( 420 + KBUILD_MODNAME ": register: magic[masked]: ", 421 + DUMP_PREFIX_NONE, masked, e->size); 422 + 423 + kfree(masked); 424 + } 425 + } 426 + } 370 427 } else { 428 + /* Handle the 'E' (extension) format. */ 429 + 430 + /* Skip the 'offset' field. */ 371 431 p = strchr(p, del); 372 432 if (!p) 373 - goto Einval; 433 + goto einval; 374 434 *p++ = '\0'; 435 + 436 + /* Parse the 'magic' field. */ 375 437 e->magic = p; 376 438 p = strchr(p, del); 377 439 if (!p) 378 - goto Einval; 440 + goto einval; 379 441 *p++ = '\0'; 380 442 if (!e->magic[0] || strchr(e->magic, '/')) 381 - goto Einval; 443 + goto einval; 444 + pr_debug("register: extension: {%s}\n", e->magic); 445 + 446 + /* Skip the 'mask' field. */ 382 447 p = strchr(p, del); 383 448 if (!p) 384 - goto Einval; 449 + goto einval; 385 450 *p++ = '\0'; 386 451 } 452 + 453 + /* Parse the 'interpreter' field. */ 387 454 e->interpreter = p; 388 455 p = strchr(p, del); 389 456 if (!p) 390 - goto Einval; 457 + goto einval; 391 458 *p++ = '\0'; 392 459 if (!e->interpreter[0]) 393 - goto Einval; 460 + goto einval; 461 + pr_debug("register: interpreter: {%s}\n", e->interpreter); 394 462 395 - 396 - p = check_special_flags (p, e); 397 - 463 + /* Parse the 'flags' field. */ 464 + p = check_special_flags(p, e); 398 465 if (*p == '\n') 399 466 p++; 400 467 if (p != buf + count) 401 - goto Einval; 468 + goto einval; 469 + 402 470 return e; 403 471 404 472 out: 405 473 return ERR_PTR(err); 406 474 407 - Efault: 475 + efault: 408 476 kfree(e); 409 477 return ERR_PTR(-EFAULT); 410 - Einval: 478 + einval: 411 479 kfree(e); 412 480 return ERR_PTR(-EINVAL); 413 481 } ··· 503 417 return -EFAULT; 504 418 if (!count) 505 419 return 0; 506 - if (s[count-1] == '\n') 420 + if (s[count - 1] == '\n') 507 421 count--; 508 422 if (count == 1 && s[0] == '0') 509 423 return 1; ··· 520 434 { 521 435 char *dp; 522 436 char *status = "disabled"; 523 - const char * flags = "flags: "; 437 + const char *flags = "flags: "; 524 438 525 439 if (test_bit(Enabled, &e->flags)) 526 440 status = "enabled"; ··· 534 448 dp = page + strlen(page); 535 449 536 450 /* print the special flags */ 537 - sprintf (dp, "%s", flags); 538 - dp += strlen (flags); 539 - if (e->flags & MISC_FMT_PRESERVE_ARGV0) { 540 - *dp ++ = 'P'; 541 - } 542 - if (e->flags & MISC_FMT_OPEN_BINARY) { 543 - *dp ++ = 'O'; 544 - } 545 - if (e->flags & MISC_FMT_CREDENTIALS) { 546 - *dp ++ = 'C'; 547 - } 548 - *dp ++ = '\n'; 549 - 451 + sprintf(dp, "%s", flags); 452 + dp += strlen(flags); 453 + if (e->flags & MISC_FMT_PRESERVE_ARGV0) 454 + *dp++ = 'P'; 455 + if (e->flags & MISC_FMT_OPEN_BINARY) 456 + *dp++ = 'O'; 457 + if (e->flags & MISC_FMT_CREDENTIALS) 458 + *dp++ = 'C'; 459 + *dp++ = '\n'; 550 460 551 461 if (!test_bit(Magic, &e->flags)) { 552 462 sprintf(dp, "extension .%s\n", e->magic); ··· 570 488 571 489 static struct inode *bm_get_inode(struct super_block *sb, int mode) 572 490 { 573 - struct inode * inode = new_inode(sb); 491 + struct inode *inode = new_inode(sb); 574 492 575 493 if (inode) { 576 494 inode->i_ino = get_next_ino(); ··· 610 528 /* /<entry> */ 611 529 612 530 static ssize_t 613 - bm_entry_read(struct file * file, char __user * buf, size_t nbytes, loff_t *ppos) 531 + bm_entry_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) 614 532 { 615 533 Node *e = file_inode(file)->i_private; 616 534 ssize_t res; 617 535 char *page; 618 536 619 - if (!(page = (char*) __get_free_page(GFP_KERNEL))) 537 + page = (char *) __get_free_page(GFP_KERNEL); 538 + if (!page) 620 539 return -ENOMEM; 621 540 622 541 entry_status(e, page); ··· 636 553 int res = parse_command(buffer, count); 637 554 638 555 switch (res) { 639 - case 1: clear_bit(Enabled, &e->flags); 640 - break; 641 - case 2: set_bit(Enabled, &e->flags); 642 - break; 643 - case 3: root = dget(file->f_path.dentry->d_sb->s_root); 644 - mutex_lock(&root->d_inode->i_mutex); 556 + case 1: 557 + /* Disable this handler. */ 558 + clear_bit(Enabled, &e->flags); 559 + break; 560 + case 2: 561 + /* Enable this handler. */ 562 + set_bit(Enabled, &e->flags); 563 + break; 564 + case 3: 565 + /* Delete this handler. */ 566 + root = dget(file->f_path.dentry->d_sb->s_root); 567 + mutex_lock(&root->d_inode->i_mutex); 645 568 646 - kill_node(e); 569 + kill_node(e); 647 570 648 - mutex_unlock(&root->d_inode->i_mutex); 649 - dput(root); 650 - break; 651 - default: return res; 571 + mutex_unlock(&root->d_inode->i_mutex); 572 + dput(root); 573 + break; 574 + default: 575 + return res; 652 576 } 577 + 653 578 return count; 654 579 } 655 580 ··· 745 654 return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s)); 746 655 } 747 656 748 - static ssize_t bm_status_write(struct file * file, const char __user * buffer, 657 + static ssize_t bm_status_write(struct file *file, const char __user *buffer, 749 658 size_t count, loff_t *ppos) 750 659 { 751 660 int res = parse_command(buffer, count); 752 661 struct dentry *root; 753 662 754 663 switch (res) { 755 - case 1: enabled = 0; break; 756 - case 2: enabled = 1; break; 757 - case 3: root = dget(file->f_path.dentry->d_sb->s_root); 758 - mutex_lock(&root->d_inode->i_mutex); 664 + case 1: 665 + /* Disable all handlers. */ 666 + enabled = 0; 667 + break; 668 + case 2: 669 + /* Enable all handlers. */ 670 + enabled = 1; 671 + break; 672 + case 3: 673 + /* Delete all handlers. */ 674 + root = dget(file->f_path.dentry->d_sb->s_root); 675 + mutex_lock(&root->d_inode->i_mutex); 759 676 760 - while (!list_empty(&entries)) 761 - kill_node(list_entry(entries.next, Node, list)); 677 + while (!list_empty(&entries)) 678 + kill_node(list_entry(entries.next, Node, list)); 762 679 763 - mutex_unlock(&root->d_inode->i_mutex); 764 - dput(root); 765 - break; 766 - default: return res; 680 + mutex_unlock(&root->d_inode->i_mutex); 681 + dput(root); 682 + break; 683 + default: 684 + return res; 767 685 } 686 + 768 687 return count; 769 688 } 770 689 ··· 791 690 .evict_inode = bm_evict_inode, 792 691 }; 793 692 794 - static int bm_fill_super(struct super_block * sb, void * data, int silent) 693 + static int bm_fill_super(struct super_block *sb, void *data, int silent) 795 694 { 695 + int err; 796 696 static struct tree_descr bm_files[] = { 797 697 [2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO}, 798 698 [3] = {"register", &bm_register_operations, S_IWUSR}, 799 699 /* last one */ {""} 800 700 }; 801 - int err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files); 701 + 702 + err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files); 802 703 if (!err) 803 704 sb->s_op = &s_ops; 804 705 return err;

-1

fs/char_dev.c

··· 117 117 goto out; 118 118 } 119 119 major = i; 120 - ret = major; 121 120 } 122 121 123 122 cd->major = major;

+1 -1

fs/cifs/cifsacl.c

··· 38 38 1, 1, {0, 0, 0, 0, 0, 1}, {0} }; 39 39 /* security id for Authenticated Users system group */ 40 40 static const struct cifs_sid sid_authusers = { 41 - 1, 1, {0, 0, 0, 0, 0, 5}, {__constant_cpu_to_le32(11)} }; 41 + 1, 1, {0, 0, 0, 0, 0, 5}, {cpu_to_le32(11)} }; 42 42 /* group users */ 43 43 static const struct cifs_sid sid_user = {1, 2 , {0, 0, 0, 0, 0, 5}, {} }; 44 44

+10 -10

fs/cifs/cifssmb.c

··· 2477 2477 } 2478 2478 parm_data = (struct cifs_posix_lock *) 2479 2479 ((char *)&pSMBr->hdr.Protocol + data_offset); 2480 - if (parm_data->lock_type == __constant_cpu_to_le16(CIFS_UNLCK)) 2480 + if (parm_data->lock_type == cpu_to_le16(CIFS_UNLCK)) 2481 2481 pLockData->fl_type = F_UNLCK; 2482 2482 else { 2483 2483 if (parm_data->lock_type == 2484 - __constant_cpu_to_le16(CIFS_RDLCK)) 2484 + cpu_to_le16(CIFS_RDLCK)) 2485 2485 pLockData->fl_type = F_RDLCK; 2486 2486 else if (parm_data->lock_type == 2487 - __constant_cpu_to_le16(CIFS_WRLCK)) 2487 + cpu_to_le16(CIFS_WRLCK)) 2488 2488 pLockData->fl_type = F_WRLCK; 2489 2489 2490 2490 pLockData->fl_start = le64_to_cpu(parm_data->start); ··· 3276 3276 pSMB->compression_state = cpu_to_le16(COMPRESSION_FORMAT_DEFAULT); 3277 3277 3278 3278 pSMB->TotalParameterCount = 0; 3279 - pSMB->TotalDataCount = __constant_cpu_to_le32(2); 3279 + pSMB->TotalDataCount = cpu_to_le32(2); 3280 3280 pSMB->MaxParameterCount = 0; 3281 3281 pSMB->MaxDataCount = 0; 3282 3282 pSMB->MaxSetupCount = 4; 3283 3283 pSMB->Reserved = 0; 3284 3284 pSMB->ParameterOffset = 0; 3285 - pSMB->DataCount = __constant_cpu_to_le32(2); 3285 + pSMB->DataCount = cpu_to_le32(2); 3286 3286 pSMB->DataOffset = 3287 3287 cpu_to_le32(offsetof(struct smb_com_transaction_compr_ioctl_req, 3288 3288 compression_state) - 4); /* 84 */ 3289 3289 pSMB->SetupCount = 4; 3290 - pSMB->SubCommand = __constant_cpu_to_le16(NT_TRANSACT_IOCTL); 3290 + pSMB->SubCommand = cpu_to_le16(NT_TRANSACT_IOCTL); 3291 3291 pSMB->ParameterCount = 0; 3292 - pSMB->FunctionCode = __constant_cpu_to_le32(FSCTL_SET_COMPRESSION); 3292 + pSMB->FunctionCode = cpu_to_le32(FSCTL_SET_COMPRESSION); 3293 3293 pSMB->IsFsctl = 1; /* FSCTL */ 3294 3294 pSMB->IsRootFlag = 0; 3295 3295 pSMB->Fid = fid; /* file handle always le */ 3296 3296 /* 3 byte pad, followed by 2 byte compress state */ 3297 - pSMB->ByteCount = __constant_cpu_to_le16(5); 3297 + pSMB->ByteCount = cpu_to_le16(5); 3298 3298 inc_rfc1001_len(pSMB, 5); 3299 3299 3300 3300 rc = SendReceive(xid, tcon->ses, (struct smb_hdr *) pSMB, ··· 3430 3430 cifs_acl->version = cpu_to_le16(1); 3431 3431 if (acl_type == ACL_TYPE_ACCESS) { 3432 3432 cifs_acl->access_entry_count = cpu_to_le16(count); 3433 - cifs_acl->default_entry_count = __constant_cpu_to_le16(0xFFFF); 3433 + cifs_acl->default_entry_count = cpu_to_le16(0xFFFF); 3434 3434 } else if (acl_type == ACL_TYPE_DEFAULT) { 3435 3435 cifs_acl->default_entry_count = cpu_to_le16(count); 3436 - cifs_acl->access_entry_count = __constant_cpu_to_le16(0xFFFF); 3436 + cifs_acl->access_entry_count = cpu_to_le16(0xFFFF); 3437 3437 } else { 3438 3438 cifs_dbg(FYI, "unknown ACL type %d\n", acl_type); 3439 3439 return 0;

+2 -2

fs/cifs/file.c

··· 1066 1066 1067 1067 max_num = (max_buf - sizeof(struct smb_hdr)) / 1068 1068 sizeof(LOCKING_ANDX_RANGE); 1069 - buf = kzalloc(max_num * sizeof(LOCKING_ANDX_RANGE), GFP_KERNEL); 1069 + buf = kcalloc(max_num, sizeof(LOCKING_ANDX_RANGE), GFP_KERNEL); 1070 1070 if (!buf) { 1071 1071 free_xid(xid); 1072 1072 return -ENOMEM; ··· 1401 1401 1402 1402 max_num = (max_buf - sizeof(struct smb_hdr)) / 1403 1403 sizeof(LOCKING_ANDX_RANGE); 1404 - buf = kzalloc(max_num * sizeof(LOCKING_ANDX_RANGE), GFP_KERNEL); 1404 + buf = kcalloc(max_num, sizeof(LOCKING_ANDX_RANGE), GFP_KERNEL); 1405 1405 if (!buf) 1406 1406 return -ENOMEM; 1407 1407

+1 -1

fs/cifs/sess.c

··· 46 46 CIFSMaxBufSize + MAX_CIFS_HDR_SIZE - 4, 47 47 USHRT_MAX)); 48 48 pSMB->req.MaxMpxCount = cpu_to_le16(ses->server->maxReq); 49 - pSMB->req.VcNumber = __constant_cpu_to_le16(1); 49 + pSMB->req.VcNumber = cpu_to_le16(1); 50 50 51 51 /* Now no need to set SMBFLG_CASELESS or obsolete CANONICAL PATH */ 52 52

+2 -2

fs/cifs/smb2file.c

··· 111 111 return -EINVAL; 112 112 113 113 max_num = max_buf / sizeof(struct smb2_lock_element); 114 - buf = kzalloc(max_num * sizeof(struct smb2_lock_element), GFP_KERNEL); 114 + buf = kcalloc(max_num, sizeof(struct smb2_lock_element), GFP_KERNEL); 115 115 if (!buf) 116 116 return -ENOMEM; 117 117 ··· 247 247 } 248 248 249 249 max_num = max_buf / sizeof(struct smb2_lock_element); 250 - buf = kzalloc(max_num * sizeof(struct smb2_lock_element), GFP_KERNEL); 250 + buf = kcalloc(max_num, sizeof(struct smb2_lock_element), GFP_KERNEL); 251 251 if (!buf) { 252 252 free_xid(xid); 253 253 return -ENOMEM;

+19 -19

fs/cifs/smb2misc.c

··· 67 67 * indexed by command in host byte order 68 68 */ 69 69 static const __le16 smb2_rsp_struct_sizes[NUMBER_OF_SMB2_COMMANDS] = { 70 - /* SMB2_NEGOTIATE */ __constant_cpu_to_le16(65), 71 - /* SMB2_SESSION_SETUP */ __constant_cpu_to_le16(9), 72 - /* SMB2_LOGOFF */ __constant_cpu_to_le16(4), 73 - /* SMB2_TREE_CONNECT */ __constant_cpu_to_le16(16), 74 - /* SMB2_TREE_DISCONNECT */ __constant_cpu_to_le16(4), 75 - /* SMB2_CREATE */ __constant_cpu_to_le16(89), 76 - /* SMB2_CLOSE */ __constant_cpu_to_le16(60), 77 - /* SMB2_FLUSH */ __constant_cpu_to_le16(4), 78 - /* SMB2_READ */ __constant_cpu_to_le16(17), 79 - /* SMB2_WRITE */ __constant_cpu_to_le16(17), 80 - /* SMB2_LOCK */ __constant_cpu_to_le16(4), 81 - /* SMB2_IOCTL */ __constant_cpu_to_le16(49), 70 + /* SMB2_NEGOTIATE */ cpu_to_le16(65), 71 + /* SMB2_SESSION_SETUP */ cpu_to_le16(9), 72 + /* SMB2_LOGOFF */ cpu_to_le16(4), 73 + /* SMB2_TREE_CONNECT */ cpu_to_le16(16), 74 + /* SMB2_TREE_DISCONNECT */ cpu_to_le16(4), 75 + /* SMB2_CREATE */ cpu_to_le16(89), 76 + /* SMB2_CLOSE */ cpu_to_le16(60), 77 + /* SMB2_FLUSH */ cpu_to_le16(4), 78 + /* SMB2_READ */ cpu_to_le16(17), 79 + /* SMB2_WRITE */ cpu_to_le16(17), 80 + /* SMB2_LOCK */ cpu_to_le16(4), 81 + /* SMB2_IOCTL */ cpu_to_le16(49), 82 82 /* BB CHECK this ... not listed in documentation */ 83 - /* SMB2_CANCEL */ __constant_cpu_to_le16(0), 84 - /* SMB2_ECHO */ __constant_cpu_to_le16(4), 85 - /* SMB2_QUERY_DIRECTORY */ __constant_cpu_to_le16(9), 86 - /* SMB2_CHANGE_NOTIFY */ __constant_cpu_to_le16(9), 87 - /* SMB2_QUERY_INFO */ __constant_cpu_to_le16(9), 88 - /* SMB2_SET_INFO */ __constant_cpu_to_le16(2), 83 + /* SMB2_CANCEL */ cpu_to_le16(0), 84 + /* SMB2_ECHO */ cpu_to_le16(4), 85 + /* SMB2_QUERY_DIRECTORY */ cpu_to_le16(9), 86 + /* SMB2_CHANGE_NOTIFY */ cpu_to_le16(9), 87 + /* SMB2_QUERY_INFO */ cpu_to_le16(9), 88 + /* SMB2_SET_INFO */ cpu_to_le16(2), 89 89 /* BB FIXME can also be 44 for lease break */ 90 - /* SMB2_OPLOCK_BREAK */ __constant_cpu_to_le16(24) 90 + /* SMB2_OPLOCK_BREAK */ cpu_to_le16(24) 91 91 }; 92 92 93 93 int

+1 -1

fs/cifs/smb2ops.c

··· 600 600 goto cchunk_out; 601 601 602 602 /* For now array only one chunk long, will make more flexible later */ 603 - pcchunk->ChunkCount = __constant_cpu_to_le32(1); 603 + pcchunk->ChunkCount = cpu_to_le32(1); 604 604 pcchunk->Reserved = 0; 605 605 pcchunk->Reserved2 = 0; 606 606

+1 -1

fs/cifs/smb2pdu.c

··· 1358 1358 char *ret_data = NULL; 1359 1359 1360 1360 fsctl_input.CompressionState = 1361 - __constant_cpu_to_le16(COMPRESSION_FORMAT_DEFAULT); 1361 + cpu_to_le16(COMPRESSION_FORMAT_DEFAULT); 1362 1362 1363 1363 rc = SMB2_ioctl(xid, tcon, persistent_fid, volatile_fid, 1364 1364 FSCTL_SET_COMPRESSION, true /* is_fsctl */,

+14 -14

fs/cifs/smb2pdu.h

··· 85 85 /* BB FIXME - analyze following length BB */ 86 86 #define MAX_SMB2_HDR_SIZE 0x78 /* 4 len + 64 hdr + (2*24 wct) + 2 bct + 2 pad */ 87 87 88 - #define SMB2_PROTO_NUMBER __constant_cpu_to_le32(0x424d53fe) 88 + #define SMB2_PROTO_NUMBER cpu_to_le32(0x424d53fe) 89 89 90 90 /* 91 91 * SMB2 Header Definition ··· 96 96 * 97 97 */ 98 98 99 - #define SMB2_HEADER_STRUCTURE_SIZE __constant_cpu_to_le16(64) 99 + #define SMB2_HEADER_STRUCTURE_SIZE cpu_to_le16(64) 100 100 101 101 struct smb2_hdr { 102 102 __be32 smb2_buf_length; /* big endian on wire */ ··· 137 137 } __packed; 138 138 139 139 /* Encryption Algorithms */ 140 - #define SMB2_ENCRYPTION_AES128_CCM __constant_cpu_to_le16(0x0001) 140 + #define SMB2_ENCRYPTION_AES128_CCM cpu_to_le16(0x0001) 141 141 142 142 /* 143 143 * SMB2 flag definitions 144 144 */ 145 - #define SMB2_FLAGS_SERVER_TO_REDIR __constant_cpu_to_le32(0x00000001) 146 - #define SMB2_FLAGS_ASYNC_COMMAND __constant_cpu_to_le32(0x00000002) 147 - #define SMB2_FLAGS_RELATED_OPERATIONS __constant_cpu_to_le32(0x00000004) 148 - #define SMB2_FLAGS_SIGNED __constant_cpu_to_le32(0x00000008) 149 - #define SMB2_FLAGS_DFS_OPERATIONS __constant_cpu_to_le32(0x10000000) 145 + #define SMB2_FLAGS_SERVER_TO_REDIR cpu_to_le32(0x00000001) 146 + #define SMB2_FLAGS_ASYNC_COMMAND cpu_to_le32(0x00000002) 147 + #define SMB2_FLAGS_RELATED_OPERATIONS cpu_to_le32(0x00000004) 148 + #define SMB2_FLAGS_SIGNED cpu_to_le32(0x00000008) 149 + #define SMB2_FLAGS_DFS_OPERATIONS cpu_to_le32(0x10000000) 150 150 151 151 /* 152 152 * Definitions for SMB2 Protocol Data Units (network frames) ··· 157 157 * 158 158 */ 159 159 160 - #define SMB2_ERROR_STRUCTURE_SIZE2 __constant_cpu_to_le16(9) 160 + #define SMB2_ERROR_STRUCTURE_SIZE2 cpu_to_le16(9) 161 161 162 162 struct smb2_err_rsp { 163 163 struct smb2_hdr hdr; ··· 502 502 #define SMB2_LEASE_HANDLE_CACHING_HE 0x02 503 503 #define SMB2_LEASE_WRITE_CACHING_HE 0x04 504 504 505 - #define SMB2_LEASE_NONE __constant_cpu_to_le32(0x00) 506 - #define SMB2_LEASE_READ_CACHING __constant_cpu_to_le32(0x01) 507 - #define SMB2_LEASE_HANDLE_CACHING __constant_cpu_to_le32(0x02) 508 - #define SMB2_LEASE_WRITE_CACHING __constant_cpu_to_le32(0x04) 505 + #define SMB2_LEASE_NONE cpu_to_le32(0x00) 506 + #define SMB2_LEASE_READ_CACHING cpu_to_le32(0x01) 507 + #define SMB2_LEASE_HANDLE_CACHING cpu_to_le32(0x02) 508 + #define SMB2_LEASE_WRITE_CACHING cpu_to_le32(0x04) 509 509 510 - #define SMB2_LEASE_FLAG_BREAK_IN_PROGRESS __constant_cpu_to_le32(0x02) 510 + #define SMB2_LEASE_FLAG_BREAK_IN_PROGRESS cpu_to_le32(0x02) 511 511 512 512 #define SMB2_LEASE_KEY_SIZE 16 513 513

+1 -1

fs/file.c

··· 869 869 struct file *file = fget_raw(fildes); 870 870 871 871 if (file) { 872 - ret = get_unused_fd(); 872 + ret = get_unused_fd_flags(0); 873 873 if (ret >= 0) 874 874 fd_install(ret, file); 875 875 else

+8 -6

fs/hfs/catalog.c

··· 162 162 */ 163 163 int hfs_cat_keycmp(const btree_key *key1, const btree_key *key2) 164 164 { 165 - int retval; 165 + __be32 k1p, k2p; 166 166 167 - retval = be32_to_cpu(key1->cat.ParID) - be32_to_cpu(key2->cat.ParID); 168 - if (!retval) 169 - retval = hfs_strcmp(key1->cat.CName.name, key1->cat.CName.len, 170 - key2->cat.CName.name, key2->cat.CName.len); 167 + k1p = key1->cat.ParID; 168 + k2p = key2->cat.ParID; 171 169 172 - return retval; 170 + if (k1p != k2p) 171 + return be32_to_cpu(k1p) < be32_to_cpu(k2p) ? -1 : 1; 172 + 173 + return hfs_strcmp(key1->cat.CName.name, key1->cat.CName.len, 174 + key2->cat.CName.name, key2->cat.CName.len); 173 175 } 174 176 175 177 /* Try to get a catalog entry for given catalog id */

-1

fs/ncpfs/ioctl.c

··· 447 447 result = -EIO; 448 448 } 449 449 } 450 - result = 0; 451 450 } 452 451 mutex_unlock(&server->root_setup_lock); 453 452

+2 -8

fs/nilfs2/file.c

··· 39 39 */ 40 40 struct the_nilfs *nilfs; 41 41 struct inode *inode = file->f_mapping->host; 42 - int err; 43 - 44 - err = filemap_write_and_wait_range(inode->i_mapping, start, end); 45 - if (err) 46 - return err; 47 - mutex_lock(&inode->i_mutex); 42 + int err = 0; 48 43 49 44 if (nilfs_inode_dirty(inode)) { 50 45 if (datasync) 51 46 err = nilfs_construct_dsync_segment(inode->i_sb, inode, 52 - 0, LLONG_MAX); 47 + start, end); 53 48 else 54 49 err = nilfs_construct_segment(inode->i_sb); 55 50 } 56 - mutex_unlock(&inode->i_mutex); 57 51 58 52 nilfs = inode->i_sb->s_fs_info; 59 53 if (!err)

+24 -8

fs/nilfs2/inode.c

··· 49 49 int for_gc; 50 50 }; 51 51 52 + static int nilfs_iget_test(struct inode *inode, void *opaque); 53 + 52 54 void nilfs_inode_add_blocks(struct inode *inode, int n) 53 55 { 54 56 struct nilfs_root *root = NILFS_I(inode)->i_root; ··· 350 348 .is_partially_uptodate = block_is_partially_uptodate, 351 349 }; 352 350 351 + static int nilfs_insert_inode_locked(struct inode *inode, 352 + struct nilfs_root *root, 353 + unsigned long ino) 354 + { 355 + struct nilfs_iget_args args = { 356 + .ino = ino, .root = root, .cno = 0, .for_gc = 0 357 + }; 358 + 359 + return insert_inode_locked4(inode, ino, nilfs_iget_test, &args); 360 + } 361 + 353 362 struct inode *nilfs_new_inode(struct inode *dir, umode_t mode) 354 363 { 355 364 struct super_block *sb = dir->i_sb; ··· 396 383 if (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)) { 397 384 err = nilfs_bmap_read(ii->i_bmap, NULL); 398 385 if (err < 0) 399 - goto failed_bmap; 386 + goto failed_after_creation; 400 387 401 388 set_bit(NILFS_I_BMAP, &ii->i_state); 402 389 /* No lock is needed; iget() ensures it. */ ··· 412 399 spin_lock(&nilfs->ns_next_gen_lock); 413 400 inode->i_generation = nilfs->ns_next_generation++; 414 401 spin_unlock(&nilfs->ns_next_gen_lock); 415 - insert_inode_hash(inode); 402 + if (nilfs_insert_inode_locked(inode, root, ino) < 0) { 403 + err = -EIO; 404 + goto failed_after_creation; 405 + } 416 406 417 407 err = nilfs_init_acl(inode, dir); 418 408 if (unlikely(err)) 419 - goto failed_acl; /* never occur. When supporting 409 + goto failed_after_creation; /* never occur. When supporting 420 410 nilfs_init_acl(), proper cancellation of 421 411 above jobs should be considered */ 422 412 423 413 return inode; 424 414 425 - failed_acl: 426 - failed_bmap: 415 + failed_after_creation: 427 416 clear_nlink(inode); 417 + unlock_new_inode(inode); 428 418 iput(inode); /* raw_inode will be deleted through 429 - generic_delete_inode() */ 419 + nilfs_evict_inode() */ 430 420 goto failed; 431 421 432 422 failed_ifile_create_inode: ··· 477 461 inode->i_atime.tv_nsec = le32_to_cpu(raw_inode->i_mtime_nsec); 478 462 inode->i_ctime.tv_nsec = le32_to_cpu(raw_inode->i_ctime_nsec); 479 463 inode->i_mtime.tv_nsec = le32_to_cpu(raw_inode->i_mtime_nsec); 480 - if (inode->i_nlink == 0 && inode->i_mode == 0) 481 - return -EINVAL; /* this inode is deleted */ 464 + if (inode->i_nlink == 0) 465 + return -ESTALE; /* this inode is deleted */ 482 466 483 467 inode->i_blocks = le64_to_cpu(raw_inode->i_blocks); 484 468 ii->i_flags = le32_to_cpu(raw_inode->i_flags);

+12 -3

fs/nilfs2/namei.c

··· 51 51 int err = nilfs_add_link(dentry, inode); 52 52 if (!err) { 53 53 d_instantiate(dentry, inode); 54 + unlock_new_inode(inode); 54 55 return 0; 55 56 } 56 57 inode_dec_link_count(inode); 58 + unlock_new_inode(inode); 57 59 iput(inode); 58 60 return err; 59 61 } ··· 184 182 out_fail: 185 183 drop_nlink(inode); 186 184 nilfs_mark_inode_dirty(inode); 185 + unlock_new_inode(inode); 187 186 iput(inode); 188 187 goto out; 189 188 } ··· 204 201 inode_inc_link_count(inode); 205 202 ihold(inode); 206 203 207 - err = nilfs_add_nondir(dentry, inode); 208 - if (!err) 204 + err = nilfs_add_link(dentry, inode); 205 + if (!err) { 206 + d_instantiate(dentry, inode); 209 207 err = nilfs_transaction_commit(dir->i_sb); 210 - else 208 + } else { 209 + inode_dec_link_count(inode); 210 + iput(inode); 211 211 nilfs_transaction_abort(dir->i_sb); 212 + } 212 213 213 214 return err; 214 215 } ··· 250 243 251 244 nilfs_mark_inode_dirty(inode); 252 245 d_instantiate(dentry, inode); 246 + unlock_new_inode(inode); 253 247 out: 254 248 if (!err) 255 249 err = nilfs_transaction_commit(dir->i_sb); ··· 263 255 drop_nlink(inode); 264 256 drop_nlink(inode); 265 257 nilfs_mark_inode_dirty(inode); 258 + unlock_new_inode(inode); 266 259 iput(inode); 267 260 out_dir: 268 261 drop_nlink(dir);

+1 -2

fs/nilfs2/the_nilfs.c

··· 808 808 spin_lock(&nilfs->ns_cptree_lock); 809 809 rb_erase(&root->rb_node, &nilfs->ns_cptree); 810 810 spin_unlock(&nilfs->ns_cptree_lock); 811 - if (root->ifile) 812 - iput(root->ifile); 811 + iput(root->ifile); 813 812 814 813 kfree(root); 815 814 }

+1 -1

fs/ocfs2/aops.c

··· 1251 1251 ret = ocfs2_extent_map_get_blocks(inode, v_blkno, &p_blkno, NULL, 1252 1252 NULL); 1253 1253 if (ret < 0) { 1254 - ocfs2_error(inode->i_sb, "Corrupting extend for inode %llu, " 1254 + mlog(ML_ERROR, "Get physical blkno failed for inode %llu, " 1255 1255 "at logical block %llu", 1256 1256 (unsigned long long)OCFS2_I(inode)->ip_blkno, 1257 1257 (unsigned long long)v_blkno);

+2 -2

fs/ocfs2/cluster/heartbeat.c

··· 1127 1127 elapsed_msec = o2hb_elapsed_msecs(&before_hb, &after_hb); 1128 1128 1129 1129 mlog(ML_HEARTBEAT, 1130 - "start = %lu.%lu, end = %lu.%lu, msec = %u\n", 1130 + "start = %lu.%lu, end = %lu.%lu, msec = %u, ret = %d\n", 1131 1131 before_hb.tv_sec, (unsigned long) before_hb.tv_usec, 1132 1132 after_hb.tv_sec, (unsigned long) after_hb.tv_usec, 1133 - elapsed_msec); 1133 + elapsed_msec, ret); 1134 1134 1135 1135 if (!kthread_should_stop() && 1136 1136 elapsed_msec < reg->hr_timeout_ms) {

+1 -1

fs/ocfs2/cluster/tcp.c

··· 1736 1736 o2net_idle_timeout() / 1000, 1737 1737 o2net_idle_timeout() % 1000); 1738 1738 1739 - o2net_set_nn_state(nn, NULL, 0, -ENOTCONN); 1739 + o2net_set_nn_state(nn, NULL, 0, 0); 1740 1740 } 1741 1741 spin_unlock(&nn->nn_lock); 1742 1742 }

+1 -1

fs/ocfs2/dir.c

··· 744 744 if (ocfs2_read_dir_block(dir, block, &bh, 0)) { 745 745 /* read error, skip block & hope for the best. 746 746 * ocfs2_read_dir_block() has released the bh. */ 747 - ocfs2_error(dir->i_sb, "reading directory %llu, " 747 + mlog(ML_ERROR, "reading directory %llu, " 748 748 "offset %lu\n", 749 749 (unsigned long long)OCFS2_I(dir)->ip_blkno, 750 750 block);

+1 -1

fs/ocfs2/dlm/dlmdomain.c

··· 877 877 * to be put in someone's domain map. 878 878 * Also, explicitly disallow joining at certain troublesome 879 879 * times (ie. during recovery). */ 880 - if (dlm && dlm->dlm_state != DLM_CTXT_LEAVING) { 880 + if (dlm->dlm_state != DLM_CTXT_LEAVING) { 881 881 int bit = query->node_idx; 882 882 spin_lock(&dlm->spinlock); 883 883

+12

fs/ocfs2/dlm/dlmmaster.c

··· 1460 1460 1461 1461 /* take care of the easy cases up front */ 1462 1462 spin_lock(&res->spinlock); 1463 + 1464 + /* 1465 + * Right after dlm spinlock was released, dlm_thread could have 1466 + * purged the lockres. Check if lockres got unhashed. If so 1467 + * start over. 1468 + */ 1469 + if (hlist_unhashed(&res->hash_node)) { 1470 + spin_unlock(&res->spinlock); 1471 + dlm_lockres_put(res); 1472 + goto way_up_top; 1473 + } 1474 + 1463 1475 if (res->state & (DLM_LOCK_RES_RECOVERING| 1464 1476 DLM_LOCK_RES_MIGRATING)) { 1465 1477 spin_unlock(&res->spinlock);

+13 -5

fs/ocfs2/dlm/dlmrecovery.c

··· 1656 1656 req.namelen = res->lockname.len; 1657 1657 memcpy(req.name, res->lockname.name, res->lockname.len); 1658 1658 1659 + resend: 1659 1660 ret = o2net_send_message(DLM_MASTER_REQUERY_MSG, dlm->key, 1660 1661 &req, sizeof(req), nodenum, &status); 1661 - /* XXX: negative status not handled properly here. */ 1662 1662 if (ret < 0) 1663 1663 mlog(ML_ERROR, "Error %d when sending message %u (key " 1664 1664 "0x%x) to node %u\n", ret, DLM_MASTER_REQUERY_MSG, 1665 1665 dlm->key, nodenum); 1666 - else { 1666 + else if (status == -ENOMEM) { 1667 + mlog_errno(status); 1668 + msleep(50); 1669 + goto resend; 1670 + } else { 1667 1671 BUG_ON(status < 0); 1668 1672 BUG_ON(status > DLM_LOCK_RES_OWNER_UNKNOWN); 1669 1673 *real_master = (u8) (status & 0xff); ··· 1709 1705 int ret = dlm_dispatch_assert_master(dlm, res, 1710 1706 0, 0, flags); 1711 1707 if (ret < 0) { 1712 - mlog_errno(-ENOMEM); 1713 - /* retry!? */ 1714 - BUG(); 1708 + mlog_errno(ret); 1709 + spin_unlock(&res->spinlock); 1710 + dlm_lockres_put(res); 1711 + spin_unlock(&dlm->spinlock); 1712 + dlm_put(dlm); 1713 + /* sender will take care of this and retry */ 1714 + return ret; 1715 1715 } else 1716 1716 __dlm_lockres_grab_inflight_worker(dlm, res); 1717 1717 spin_unlock(&res->spinlock);

+31 -6

fs/ocfs2/dlmglue.c

··· 861 861 * We set the OCFS2_LOCK_UPCONVERT_FINISHING flag before clearing 862 862 * the OCFS2_LOCK_BUSY flag to prevent the dc thread from 863 863 * downconverting the lock before the upconvert has fully completed. 864 + * Do not prevent the dc thread from downconverting if NONBLOCK lock 865 + * had already returned. 864 866 */ 865 - lockres_or_flags(lockres, OCFS2_LOCK_UPCONVERT_FINISHING); 867 + if (!(lockres->l_flags & OCFS2_LOCK_NONBLOCK_FINISHED)) 868 + lockres_or_flags(lockres, OCFS2_LOCK_UPCONVERT_FINISHING); 869 + else 870 + lockres_clear_flags(lockres, OCFS2_LOCK_NONBLOCK_FINISHED); 866 871 867 872 lockres_clear_flags(lockres, OCFS2_LOCK_BUSY); 868 873 } ··· 1329 1324 1330 1325 /* returns 0 if the mw that was removed was already satisfied, -EBUSY 1331 1326 * if the mask still hadn't reached its goal */ 1332 - static int lockres_remove_mask_waiter(struct ocfs2_lock_res *lockres, 1327 + static int __lockres_remove_mask_waiter(struct ocfs2_lock_res *lockres, 1333 1328 struct ocfs2_mask_waiter *mw) 1334 1329 { 1335 - unsigned long flags; 1336 1330 int ret = 0; 1337 1331 1338 - spin_lock_irqsave(&lockres->l_lock, flags); 1332 + assert_spin_locked(&lockres->l_lock); 1339 1333 if (!list_empty(&mw->mw_item)) { 1340 1334 if ((lockres->l_flags & mw->mw_mask) != mw->mw_goal) 1341 1335 ret = -EBUSY; ··· 1342 1338 list_del_init(&mw->mw_item); 1343 1339 init_completion(&mw->mw_complete); 1344 1340 } 1341 + 1342 + return ret; 1343 + } 1344 + 1345 + static int lockres_remove_mask_waiter(struct ocfs2_lock_res *lockres, 1346 + struct ocfs2_mask_waiter *mw) 1347 + { 1348 + unsigned long flags; 1349 + int ret = 0; 1350 + 1351 + spin_lock_irqsave(&lockres->l_lock, flags); 1352 + ret = __lockres_remove_mask_waiter(lockres, mw); 1345 1353 spin_unlock_irqrestore(&lockres->l_lock, flags); 1346 1354 1347 1355 return ret; ··· 1389 1373 unsigned long flags; 1390 1374 unsigned int gen; 1391 1375 int noqueue_attempted = 0; 1376 + int dlm_locked = 0; 1392 1377 1393 1378 ocfs2_init_mask_waiter(&mw); 1394 1379 ··· 1498 1481 ocfs2_recover_from_dlm_error(lockres, 1); 1499 1482 goto out; 1500 1483 } 1484 + dlm_locked = 1; 1501 1485 1502 1486 mlog(0, "lock %s, successful return from ocfs2_dlm_lock\n", 1503 1487 lockres->l_name); ··· 1532 1514 if (wait && arg_flags & OCFS2_LOCK_NONBLOCK && 1533 1515 mw.mw_mask & (OCFS2_LOCK_BUSY|OCFS2_LOCK_BLOCKED)) { 1534 1516 wait = 0; 1535 - if (lockres_remove_mask_waiter(lockres, &mw)) 1517 + spin_lock_irqsave(&lockres->l_lock, flags); 1518 + if (__lockres_remove_mask_waiter(lockres, &mw)) { 1519 + if (dlm_locked) 1520 + lockres_or_flags(lockres, 1521 + OCFS2_LOCK_NONBLOCK_FINISHED); 1522 + spin_unlock_irqrestore(&lockres->l_lock, flags); 1536 1523 ret = -EAGAIN; 1537 - else 1524 + } else { 1525 + spin_unlock_irqrestore(&lockres->l_lock, flags); 1538 1526 goto again; 1527 + } 1539 1528 } 1540 1529 if (wait) { 1541 1530 ret = ocfs2_wait_for_mask(&mw);

+1 -3

fs/ocfs2/file.c

··· 2381 2381 if (ret < 0) 2382 2382 written = ret; 2383 2383 2384 - if (!ret && ((old_size != i_size_read(inode)) || 2385 - (old_clusters != OCFS2_I(inode)->ip_clusters) || 2386 - has_refcount)) { 2384 + if (!ret) { 2387 2385 ret = jbd2_journal_force_commit(osb->journal->j_journal); 2388 2386 if (ret < 0) 2389 2387 written = ret;

+1 -2

fs/ocfs2/inode.c

··· 540 540 if (status < 0) 541 541 make_bad_inode(inode); 542 542 543 - if (args && bh) 544 - brelse(bh); 543 + brelse(bh); 545 544 546 545 return status; 547 546 }

-3

fs/ocfs2/move_extents.c

··· 904 904 struct buffer_head *di_bh = NULL; 905 905 struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); 906 906 907 - if (!inode) 908 - return -ENOENT; 909 - 910 907 if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb)) 911 908 return -EROFS; 912 909

+6

fs/ocfs2/ocfs2.h

··· 144 144 * before the upconvert 145 145 * has completed */ 146 146 147 + #define OCFS2_LOCK_NONBLOCK_FINISHED (0x00001000) /* NONBLOCK cluster 148 + * lock has already 149 + * returned, do not block 150 + * dc thread from 151 + * downconverting */ 152 + 147 153 struct ocfs2_lock_res_ops; 148 154 149 155 typedef void (*ocfs2_lock_callback)(int status, unsigned long data);

+1 -1

fs/ocfs2/slot_map.c

··· 306 306 assert_spin_locked(&osb->osb_lock); 307 307 308 308 BUG_ON(slot_num < 0); 309 - BUG_ON(slot_num > osb->max_slots); 309 + BUG_ON(slot_num >= osb->max_slots); 310 310 311 311 if (!si->si_slots[slot_num].sl_valid) 312 312 return -ENOENT;

+2 -1

fs/ocfs2/super.c

··· 1629 1629 1630 1630 ocfs2_debugfs_root = debugfs_create_dir("ocfs2", NULL); 1631 1631 if (!ocfs2_debugfs_root) { 1632 - status = -EFAULT; 1632 + status = -ENOMEM; 1633 1633 mlog(ML_ERROR, "Unable to create ocfs2 debugfs root.\n"); 1634 + goto out4; 1634 1635 } 1635 1636 1636 1637 ocfs2_set_locking_protocol();

+1 -1

fs/ocfs2/xattr.c

··· 1284 1284 return -EOPNOTSUPP; 1285 1285 1286 1286 if (!(oi->ip_dyn_features & OCFS2_HAS_XATTR_FL)) 1287 - ret = -ENODATA; 1287 + return -ENODATA; 1288 1288 1289 1289 xis.inode_bh = xbs.inode_bh = di_bh; 1290 1290 di = (struct ocfs2_dinode *)di_bh->b_data;

+22 -25

fs/proc/array.c

··· 157 157 struct user_namespace *user_ns = seq_user_ns(m); 158 158 struct group_info *group_info; 159 159 int g; 160 - struct fdtable *fdt = NULL; 160 + struct task_struct *tracer; 161 161 const struct cred *cred; 162 - pid_t ppid, tpid; 162 + pid_t ppid, tpid = 0, tgid, ngid; 163 + unsigned int max_fds = 0; 163 164 164 165 rcu_read_lock(); 165 166 ppid = pid_alive(p) ? 166 167 task_tgid_nr_ns(rcu_dereference(p->real_parent), ns) : 0; 167 - tpid = 0; 168 - if (pid_alive(p)) { 169 - struct task_struct *tracer = ptrace_parent(p); 170 - if (tracer) 171 - tpid = task_pid_nr_ns(tracer, ns); 172 - } 168 + 169 + tracer = ptrace_parent(p); 170 + if (tracer) 171 + tpid = task_pid_nr_ns(tracer, ns); 172 + 173 + tgid = task_tgid_nr_ns(p, ns); 174 + ngid = task_numa_group_id(p); 173 175 cred = get_task_cred(p); 176 + 177 + task_lock(p); 178 + if (p->files) 179 + max_fds = files_fdtable(p->files)->max_fds; 180 + task_unlock(p); 181 + rcu_read_unlock(); 182 + 174 183 seq_printf(m, 175 184 "State:\t%s\n" 176 185 "Tgid:\t%d\n" ··· 188 179 "PPid:\t%d\n" 189 180 "TracerPid:\t%d\n" 190 181 "Uid:\t%d\t%d\t%d\t%d\n" 191 - "Gid:\t%d\t%d\t%d\t%d\n", 182 + "Gid:\t%d\t%d\t%d\t%d\n" 183 + "FDSize:\t%d\nGroups:\t", 192 184 get_task_state(p), 193 - task_tgid_nr_ns(p, ns), 194 - task_numa_group_id(p), 195 - pid_nr_ns(pid, ns), 196 - ppid, tpid, 185 + tgid, ngid, pid_nr_ns(pid, ns), ppid, tpid, 197 186 from_kuid_munged(user_ns, cred->uid), 198 187 from_kuid_munged(user_ns, cred->euid), 199 188 from_kuid_munged(user_ns, cred->suid), ··· 199 192 from_kgid_munged(user_ns, cred->gid), 200 193 from_kgid_munged(user_ns, cred->egid), 201 194 from_kgid_munged(user_ns, cred->sgid), 202 - from_kgid_munged(user_ns, cred->fsgid)); 203 - 204 - task_lock(p); 205 - if (p->files) 206 - fdt = files_fdtable(p->files); 207 - seq_printf(m, 208 - "FDSize:\t%d\n" 209 - "Groups:\t", 210 - fdt ? fdt->max_fds : 0); 211 - rcu_read_unlock(); 195 + from_kgid_munged(user_ns, cred->fsgid), 196 + max_fds); 212 197 213 198 group_info = cred->group_info; 214 - task_unlock(p); 215 - 216 199 for (g = 0; g < group_info->ngroups; g++) 217 200 seq_printf(m, "%d ", 218 201 from_kgid_munged(user_ns, GROUP_AT(group_info, g)));

+3

fs/proc/base.c

··· 2618 2618 dput(dentry); 2619 2619 } 2620 2620 2621 + if (pid == tgid) 2622 + return; 2623 + 2621 2624 name.name = buf; 2622 2625 name.len = snprintf(buf, sizeof(buf), "%d", tgid); 2623 2626 leader = d_hash_and_lookup(mnt->mnt_root, &name);

+104 -59

fs/proc/generic.c

··· 31 31 32 32 static int proc_match(unsigned int len, const char *name, struct proc_dir_entry *de) 33 33 { 34 - if (de->namelen != len) 35 - return 0; 36 - return !memcmp(name, de->name, len); 34 + if (len < de->namelen) 35 + return -1; 36 + if (len > de->namelen) 37 + return 1; 38 + 39 + return memcmp(name, de->name, len); 40 + } 41 + 42 + static struct proc_dir_entry *pde_subdir_first(struct proc_dir_entry *dir) 43 + { 44 + return rb_entry_safe(rb_first(&dir->subdir), struct proc_dir_entry, 45 + subdir_node); 46 + } 47 + 48 + static struct proc_dir_entry *pde_subdir_next(struct proc_dir_entry *dir) 49 + { 50 + return rb_entry_safe(rb_next(&dir->subdir_node), struct proc_dir_entry, 51 + subdir_node); 52 + } 53 + 54 + static struct proc_dir_entry *pde_subdir_find(struct proc_dir_entry *dir, 55 + const char *name, 56 + unsigned int len) 57 + { 58 + struct rb_node *node = dir->subdir.rb_node; 59 + 60 + while (node) { 61 + struct proc_dir_entry *de = container_of(node, 62 + struct proc_dir_entry, 63 + subdir_node); 64 + int result = proc_match(len, name, de); 65 + 66 + if (result < 0) 67 + node = node->rb_left; 68 + else if (result > 0) 69 + node = node->rb_right; 70 + else 71 + return de; 72 + } 73 + return NULL; 74 + } 75 + 76 + static bool pde_subdir_insert(struct proc_dir_entry *dir, 77 + struct proc_dir_entry *de) 78 + { 79 + struct rb_root *root = &dir->subdir; 80 + struct rb_node **new = &root->rb_node, *parent = NULL; 81 + 82 + /* Figure out where to put new node */ 83 + while (*new) { 84 + struct proc_dir_entry *this = 85 + container_of(*new, struct proc_dir_entry, subdir_node); 86 + int result = proc_match(de->namelen, de->name, this); 87 + 88 + parent = *new; 89 + if (result < 0) 90 + new = &(*new)->rb_left; 91 + else if (result > 0) 92 + new = &(*new)->rb_right; 93 + else 94 + return false; 95 + } 96 + 97 + /* Add new node and rebalance tree. */ 98 + rb_link_node(&de->subdir_node, parent, new); 99 + rb_insert_color(&de->subdir_node, root); 100 + return true; 37 101 } 38 102 39 103 static int proc_notify_change(struct dentry *dentry, struct iattr *iattr) ··· 156 92 break; 157 93 158 94 len = next - cp; 159 - for (de = de->subdir; de ; de = de->next) { 160 - if (proc_match(len, cp, de)) 161 - break; 162 - } 95 + de = pde_subdir_find(de, cp, len); 163 96 if (!de) { 164 97 WARN(1, "name '%s'\n", name); 165 98 return -ENOENT; ··· 244 183 struct inode *inode; 245 184 246 185 spin_lock(&proc_subdir_lock); 247 - for (de = de->subdir; de ; de = de->next) { 248 - if (de->namelen != dentry->d_name.len) 249 - continue; 250 - if (!memcmp(dentry->d_name.name, de->name, de->namelen)) { 251 - pde_get(de); 252 - spin_unlock(&proc_subdir_lock); 253 - inode = proc_get_inode(dir->i_sb, de); 254 - if (!inode) 255 - return ERR_PTR(-ENOMEM); 256 - d_set_d_op(dentry, &simple_dentry_operations); 257 - d_add(dentry, inode); 258 - return NULL; 259 - } 186 + de = pde_subdir_find(de, dentry->d_name.name, dentry->d_name.len); 187 + if (de) { 188 + pde_get(de); 189 + spin_unlock(&proc_subdir_lock); 190 + inode = proc_get_inode(dir->i_sb, de); 191 + if (!inode) 192 + return ERR_PTR(-ENOMEM); 193 + d_set_d_op(dentry, &simple_dentry_operations); 194 + d_add(dentry, inode); 195 + return NULL; 260 196 } 261 197 spin_unlock(&proc_subdir_lock); 262 198 return ERR_PTR(-ENOENT); ··· 283 225 return 0; 284 226 285 227 spin_lock(&proc_subdir_lock); 286 - de = de->subdir; 228 + de = pde_subdir_first(de); 287 229 i = ctx->pos - 2; 288 230 for (;;) { 289 231 if (!de) { ··· 292 234 } 293 235 if (!i) 294 236 break; 295 - de = de->next; 237 + de = pde_subdir_next(de); 296 238 i--; 297 239 } 298 240 ··· 307 249 } 308 250 spin_lock(&proc_subdir_lock); 309 251 ctx->pos++; 310 - next = de->next; 252 + next = pde_subdir_next(de); 311 253 pde_put(de); 312 254 de = next; 313 255 } while (de); ··· 344 286 345 287 static int proc_register(struct proc_dir_entry * dir, struct proc_dir_entry * dp) 346 288 { 347 - struct proc_dir_entry *tmp; 348 289 int ret; 349 - 290 + 350 291 ret = proc_alloc_inum(&dp->low_ino); 351 292 if (ret) 352 293 return ret; ··· 361 304 dp->proc_iops = &proc_file_inode_operations; 362 305 } else { 363 306 WARN_ON(1); 307 + proc_free_inum(dp->low_ino); 364 308 return -EINVAL; 365 309 } 366 310 367 311 spin_lock(&proc_subdir_lock); 368 - 369 - for (tmp = dir->subdir; tmp; tmp = tmp->next) 370 - if (strcmp(tmp->name, dp->name) == 0) { 371 - WARN(1, "proc_dir_entry '%s/%s' already registered\n", 372 - dir->name, dp->name); 373 - break; 374 - } 375 - 376 - dp->next = dir->subdir; 377 312 dp->parent = dir; 378 - dir->subdir = dp; 313 + if (pde_subdir_insert(dir, dp) == false) { 314 + WARN(1, "proc_dir_entry '%s/%s' already registered\n", 315 + dir->name, dp->name); 316 + spin_unlock(&proc_subdir_lock); 317 + if (S_ISDIR(dp->mode)) 318 + dir->nlink--; 319 + proc_free_inum(dp->low_ino); 320 + return -EEXIST; 321 + } 379 322 spin_unlock(&proc_subdir_lock); 380 323 381 324 return 0; ··· 411 354 ent->namelen = qstr.len; 412 355 ent->mode = mode; 413 356 ent->nlink = nlink; 357 + ent->subdir = RB_ROOT; 414 358 atomic_set(&ent->count, 1); 415 359 spin_lock_init(&ent->pde_unload_lock); 416 360 INIT_LIST_HEAD(&ent->pde_openers); ··· 543 485 */ 544 486 void remove_proc_entry(const char *name, struct proc_dir_entry *parent) 545 487 { 546 - struct proc_dir_entry **p; 547 488 struct proc_dir_entry *de = NULL; 548 489 const char *fn = name; 549 490 unsigned int len; ··· 554 497 } 555 498 len = strlen(fn); 556 499 557 - for (p = &parent->subdir; *p; p=&(*p)->next ) { 558 - if (proc_match(len, fn, *p)) { 559 - de = *p; 560 - *p = de->next; 561 - de->next = NULL; 562 - break; 563 - } 564 - } 500 + de = pde_subdir_find(parent, fn, len); 501 + if (de) 502 + rb_erase(&de->subdir_node, &parent->subdir); 565 503 spin_unlock(&proc_subdir_lock); 566 504 if (!de) { 567 505 WARN(1, "name '%s'\n", name); ··· 568 516 if (S_ISDIR(de->mode)) 569 517 parent->nlink--; 570 518 de->nlink = 0; 571 - WARN(de->subdir, "%s: removing non-empty directory " 572 - "'%s/%s', leaking at least '%s'\n", __func__, 573 - de->parent->name, de->name, de->subdir->name); 519 + WARN(pde_subdir_first(de), 520 + "%s: removing non-empty directory '%s/%s', leaking at least '%s'\n", 521 + __func__, de->parent->name, de->name, pde_subdir_first(de)->name); 574 522 pde_put(de); 575 523 } 576 524 EXPORT_SYMBOL(remove_proc_entry); 577 525 578 526 int remove_proc_subtree(const char *name, struct proc_dir_entry *parent) 579 527 { 580 - struct proc_dir_entry **p; 581 528 struct proc_dir_entry *root = NULL, *de, *next; 582 529 const char *fn = name; 583 530 unsigned int len; ··· 588 537 } 589 538 len = strlen(fn); 590 539 591 - for (p = &parent->subdir; *p; p=&(*p)->next ) { 592 - if (proc_match(len, fn, *p)) { 593 - root = *p; 594 - *p = root->next; 595 - root->next = NULL; 596 - break; 597 - } 598 - } 540 + root = pde_subdir_find(parent, fn, len); 599 541 if (!root) { 600 542 spin_unlock(&proc_subdir_lock); 601 543 return -ENOENT; 602 544 } 545 + rb_erase(&root->subdir_node, &parent->subdir); 546 + 603 547 de = root; 604 548 while (1) { 605 - next = de->subdir; 549 + next = pde_subdir_first(de); 606 550 if (next) { 607 - de->subdir = next->next; 608 - next->next = NULL; 551 + rb_erase(&next->subdir_node, &de->subdir); 609 552 de = next; 610 553 continue; 611 554 }

+6 -5

fs/proc/internal.h

··· 24 24 * tree) of these proc_dir_entries, so that we can dynamically 25 25 * add new files to /proc. 26 26 * 27 - * The "next" pointer creates a linked list of one /proc directory, 28 - * while parent/subdir create the directory structure (every 29 - * /proc file has a parent, but "subdir" is NULL for all 30 - * non-directory entries). 27 + * parent/subdir are used for the directory structure (every /proc file has a 28 + * parent, but "subdir" is empty for all non-directory entries). 29 + * subdir_node is used to build the rb tree "subdir" of the parent. 31 30 */ 32 31 struct proc_dir_entry { 33 32 unsigned int low_ino; ··· 37 38 loff_t size; 38 39 const struct inode_operations *proc_iops; 39 40 const struct file_operations *proc_fops; 40 - struct proc_dir_entry *next, *parent, *subdir; 41 + struct proc_dir_entry *parent; 42 + struct rb_root subdir; 43 + struct rb_node subdir_node; 41 44 void *data; 42 45 atomic_t count; /* use count */ 43 46 atomic_t in_use; /* number of callers into module in progress; */

+1

fs/proc/proc_net.c

··· 192 192 if (!netd) 193 193 goto out; 194 194 195 + netd->subdir = RB_ROOT; 195 196 netd->data = net; 196 197 netd->nlink = 2; 197 198 netd->namelen = 3;

+1

fs/proc/root.c

··· 251 251 .proc_iops = &proc_root_inode_operations, 252 252 .proc_fops = &proc_root_operations, 253 253 .parent = &proc_root, 254 + .subdir = RB_ROOT, 254 255 .name = "/proc", 255 256 }; 256 257

+68 -36

fs/proc/task_mmu.c

··· 447 447 u64 pss; 448 448 }; 449 449 450 + static void smaps_account(struct mem_size_stats *mss, struct page *page, 451 + unsigned long size, bool young, bool dirty) 452 + { 453 + int mapcount; 450 454 451 - static void smaps_pte_entry(pte_t ptent, unsigned long addr, 452 - unsigned long ptent_size, struct mm_walk *walk) 455 + if (PageAnon(page)) 456 + mss->anonymous += size; 457 + 458 + mss->resident += size; 459 + /* Accumulate the size in pages that have been accessed. */ 460 + if (young || PageReferenced(page)) 461 + mss->referenced += size; 462 + mapcount = page_mapcount(page); 463 + if (mapcount >= 2) { 464 + u64 pss_delta; 465 + 466 + if (dirty || PageDirty(page)) 467 + mss->shared_dirty += size; 468 + else 469 + mss->shared_clean += size; 470 + pss_delta = (u64)size << PSS_SHIFT; 471 + do_div(pss_delta, mapcount); 472 + mss->pss += pss_delta; 473 + } else { 474 + if (dirty || PageDirty(page)) 475 + mss->private_dirty += size; 476 + else 477 + mss->private_clean += size; 478 + mss->pss += (u64)size << PSS_SHIFT; 479 + } 480 + } 481 + 482 + static void smaps_pte_entry(pte_t *pte, unsigned long addr, 483 + struct mm_walk *walk) 453 484 { 454 485 struct mem_size_stats *mss = walk->private; 455 486 struct vm_area_struct *vma = mss->vma; 456 487 pgoff_t pgoff = linear_page_index(vma, addr); 457 488 struct page *page = NULL; 458 - int mapcount; 459 489 460 - if (pte_present(ptent)) { 461 - page = vm_normal_page(vma, addr, ptent); 462 - } else if (is_swap_pte(ptent)) { 463 - swp_entry_t swpent = pte_to_swp_entry(ptent); 490 + if (pte_present(*pte)) { 491 + page = vm_normal_page(vma, addr, *pte); 492 + } else if (is_swap_pte(*pte)) { 493 + swp_entry_t swpent = pte_to_swp_entry(*pte); 464 494 465 495 if (!non_swap_entry(swpent)) 466 - mss->swap += ptent_size; 496 + mss->swap += PAGE_SIZE; 467 497 else if (is_migration_entry(swpent)) 468 498 page = migration_entry_to_page(swpent); 469 - } else if (pte_file(ptent)) { 470 - if (pte_to_pgoff(ptent) != pgoff) 471 - mss->nonlinear += ptent_size; 499 + } else if (pte_file(*pte)) { 500 + if (pte_to_pgoff(*pte) != pgoff) 501 + mss->nonlinear += PAGE_SIZE; 472 502 } 473 503 474 504 if (!page) 475 505 return; 476 506 477 - if (PageAnon(page)) 478 - mss->anonymous += ptent_size; 479 - 480 507 if (page->index != pgoff) 481 - mss->nonlinear += ptent_size; 508 + mss->nonlinear += PAGE_SIZE; 482 509 483 - mss->resident += ptent_size; 484 - /* Accumulate the size in pages that have been accessed. */ 485 - if (pte_young(ptent) || PageReferenced(page)) 486 - mss->referenced += ptent_size; 487 - mapcount = page_mapcount(page); 488 - if (mapcount >= 2) { 489 - if (pte_dirty(ptent) || PageDirty(page)) 490 - mss->shared_dirty += ptent_size; 491 - else 492 - mss->shared_clean += ptent_size; 493 - mss->pss += (ptent_size << PSS_SHIFT) / mapcount; 494 - } else { 495 - if (pte_dirty(ptent) || PageDirty(page)) 496 - mss->private_dirty += ptent_size; 497 - else 498 - mss->private_clean += ptent_size; 499 - mss->pss += (ptent_size << PSS_SHIFT); 500 - } 510 + smaps_account(mss, page, PAGE_SIZE, pte_young(*pte), pte_dirty(*pte)); 501 511 } 512 + 513 + #ifdef CONFIG_TRANSPARENT_HUGEPAGE 514 + static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, 515 + struct mm_walk *walk) 516 + { 517 + struct mem_size_stats *mss = walk->private; 518 + struct vm_area_struct *vma = mss->vma; 519 + struct page *page; 520 + 521 + /* FOLL_DUMP will return -EFAULT on huge zero page */ 522 + page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP); 523 + if (IS_ERR_OR_NULL(page)) 524 + return; 525 + mss->anonymous_thp += HPAGE_PMD_SIZE; 526 + smaps_account(mss, page, HPAGE_PMD_SIZE, 527 + pmd_young(*pmd), pmd_dirty(*pmd)); 528 + } 529 + #else 530 + static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, 531 + struct mm_walk *walk) 532 + { 533 + } 534 + #endif 502 535 503 536 static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, 504 537 struct mm_walk *walk) ··· 542 509 spinlock_t *ptl; 543 510 544 511 if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) { 545 - smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk); 512 + smaps_pmd_entry(pmd, addr, walk); 546 513 spin_unlock(ptl); 547 - mss->anonymous_thp += HPAGE_PMD_SIZE; 548 514 return 0; 549 515 } 550 516 ··· 556 524 */ 557 525 pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); 558 526 for (; addr != end; pte++, addr += PAGE_SIZE) 559 - smaps_pte_entry(*pte, addr, PAGE_SIZE, walk); 527 + smaps_pte_entry(pte, addr, walk); 560 528 pte_unmap_unlock(pte - 1, ptl); 561 529 cond_resched(); 562 530 return 0;

+26

include/linux/cgroup.h

··· 113 113 } 114 114 115 115 /** 116 + * css_get_many - obtain references on the specified css 117 + * @css: target css 118 + * @n: number of references to get 119 + * 120 + * The caller must already have a reference. 121 + */ 122 + static inline void css_get_many(struct cgroup_subsys_state *css, unsigned int n) 123 + { 124 + if (!(css->flags & CSS_NO_REF)) 125 + percpu_ref_get_many(&css->refcnt, n); 126 + } 127 + 128 + /** 116 129 * css_tryget - try to obtain a reference on the specified css 117 130 * @css: target css 118 131 * ··· 170 157 { 171 158 if (!(css->flags & CSS_NO_REF)) 172 159 percpu_ref_put(&css->refcnt); 160 + } 161 + 162 + /** 163 + * css_put_many - put css references 164 + * @css: target css 165 + * @n: number of references to put 166 + * 167 + * Put references obtained via css_get() and css_tryget_online(). 168 + */ 169 + static inline void css_put_many(struct cgroup_subsys_state *css, unsigned int n) 170 + { 171 + if (!(css->flags & CSS_NO_REF)) 172 + percpu_ref_put_many(&css->refcnt, n); 173 173 } 174 174 175 175 /* bits in struct cgroup flags field */

+6 -4

include/linux/compaction.h

··· 33 33 extern unsigned long try_to_compact_pages(struct zonelist *zonelist, 34 34 int order, gfp_t gfp_mask, nodemask_t *mask, 35 35 enum migrate_mode mode, int *contended, 36 - struct zone **candidate_zone); 36 + int alloc_flags, int classzone_idx); 37 37 extern void compact_pgdat(pg_data_t *pgdat, int order); 38 38 extern void reset_isolation_suitable(pg_data_t *pgdat); 39 - extern unsigned long compaction_suitable(struct zone *zone, int order); 39 + extern unsigned long compaction_suitable(struct zone *zone, int order, 40 + int alloc_flags, int classzone_idx); 40 41 41 42 /* Do not skip compaction more than 64 times */ 42 43 #define COMPACT_MAX_DEFER_SHIFT 6 ··· 104 103 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist, 105 104 int order, gfp_t gfp_mask, nodemask_t *nodemask, 106 105 enum migrate_mode mode, int *contended, 107 - struct zone **candidate_zone) 106 + int alloc_flags, int classzone_idx) 108 107 { 109 108 return COMPACT_CONTINUE; 110 109 } ··· 117 116 { 118 117 } 119 118 120 - static inline unsigned long compaction_suitable(struct zone *zone, int order) 119 + static inline unsigned long compaction_suitable(struct zone *zone, int order, 120 + int alloc_flags, int classzone_idx) 121 121 { 122 122 return COMPACT_SKIPPED; 123 123 }

-1

include/linux/file.h

··· 66 66 extern bool get_close_on_exec(unsigned int fd); 67 67 extern void put_filp(struct file *); 68 68 extern int get_unused_fd_flags(unsigned flags); 69 - #define get_unused_fd() get_unused_fd_flags(0) 70 69 extern void put_unused_fd(unsigned int fd); 71 70 72 71 extern void fd_install(unsigned int fd, struct file *file);

+2 -2

include/linux/gfp.h

··· 381 381 382 382 void page_alloc_init(void); 383 383 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp); 384 - void drain_all_pages(void); 385 - void drain_local_pages(void *dummy); 384 + void drain_all_pages(struct zone *zone); 385 + void drain_local_pages(struct zone *zone); 386 386 387 387 /* 388 388 * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what

+2 -1

include/linux/hugetlb.h

··· 311 311 { 312 312 if (!page_size_log) 313 313 return &default_hstate; 314 - return size_to_hstate(1 << page_size_log); 314 + 315 + return size_to_hstate(1UL << page_size_log); 315 316 } 316 317 317 318 static inline struct hstate *hstate_vma(struct vm_area_struct *vma)

-1

include/linux/hugetlb_cgroup.h

··· 16 16 #define _LINUX_HUGETLB_CGROUP_H 17 17 18 18 #include <linux/mmdebug.h> 19 - #include <linux/res_counter.h> 20 19 21 20 struct hugetlb_cgroup; 22 21 /*

+13

include/linux/kern_levels.h

··· 22 22 */ 23 23 #define KERN_CONT "" 24 24 25 + /* integer equivalents of KERN_<LEVEL> */ 26 + #define LOGLEVEL_SCHED -2 /* Deferred messages from sched code 27 + * are set to this special level */ 28 + #define LOGLEVEL_DEFAULT -1 /* default (or last) loglevel */ 29 + #define LOGLEVEL_EMERG 0 /* system is unusable */ 30 + #define LOGLEVEL_ALERT 1 /* action must be taken immediately */ 31 + #define LOGLEVEL_CRIT 2 /* critical conditions */ 32 + #define LOGLEVEL_ERR 3 /* error conditions */ 33 + #define LOGLEVEL_WARNING 4 /* warning conditions */ 34 + #define LOGLEVEL_NOTICE 5 /* normal but significant condition */ 35 + #define LOGLEVEL_INFO 6 /* informational */ 36 + #define LOGLEVEL_DEBUG 7 /* debug-level messages */ 37 + 25 38 #endif

+1

include/linux/kernel.h

··· 427 427 extern int panic_on_oops; 428 428 extern int panic_on_unrecovered_nmi; 429 429 extern int panic_on_io_nmi; 430 + extern int panic_on_warn; 430 431 extern int sysctl_panic_on_stackoverflow; 431 432 /* 432 433 * Only to be used by arch init code. If the user over-wrote the default

+14 -36

include/linux/memcontrol.h

··· 25 25 #include <linux/jump_label.h> 26 26 27 27 struct mem_cgroup; 28 - struct page_cgroup; 29 28 struct page; 30 29 struct mm_struct; 31 30 struct kmem_cache; ··· 67 68 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *); 68 69 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *); 69 70 70 - bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg, 71 - struct mem_cgroup *memcg); 72 - bool task_in_mem_cgroup(struct task_struct *task, 73 - const struct mem_cgroup *memcg); 71 + bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, 72 + struct mem_cgroup *root); 73 + bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg); 74 74 75 75 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page); 76 76 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); ··· 77 79 extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg); 78 80 extern struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css); 79 81 80 - static inline 81 - bool mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *memcg) 82 + static inline bool mm_match_cgroup(struct mm_struct *mm, 83 + struct mem_cgroup *memcg) 82 84 { 83 85 struct mem_cgroup *task_memcg; 84 - bool match; 86 + bool match = false; 85 87 86 88 rcu_read_lock(); 87 89 task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); 88 - match = __mem_cgroup_same_or_subtree(memcg, task_memcg); 90 + if (task_memcg) 91 + match = mem_cgroup_is_descendant(task_memcg, memcg); 89 92 rcu_read_unlock(); 90 93 return match; 91 94 } ··· 140 141 141 142 struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page, bool *locked, 142 143 unsigned long *flags); 143 - void mem_cgroup_end_page_stat(struct mem_cgroup *memcg, bool locked, 144 - unsigned long flags); 144 + void mem_cgroup_end_page_stat(struct mem_cgroup *memcg, bool *locked, 145 + unsigned long *flags); 145 146 void mem_cgroup_update_page_stat(struct mem_cgroup *memcg, 146 147 enum mem_cgroup_stat_index idx, int val); 147 148 ··· 173 174 void mem_cgroup_split_huge_fixup(struct page *head); 174 175 #endif 175 176 176 - #ifdef CONFIG_DEBUG_VM 177 - bool mem_cgroup_bad_page_check(struct page *page); 178 - void mem_cgroup_print_bad_page(struct page *page); 179 - #endif 180 177 #else /* CONFIG_MEMCG */ 181 178 struct mem_cgroup; 182 179 ··· 292 297 } 293 298 294 299 static inline void mem_cgroup_end_page_stat(struct mem_cgroup *memcg, 295 - bool locked, unsigned long flags) 300 + bool *locked, unsigned long *flags) 296 301 { 297 302 } 298 303 ··· 341 346 { 342 347 } 343 348 #endif /* CONFIG_MEMCG */ 344 - 345 - #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM) 346 - static inline bool 347 - mem_cgroup_bad_page_check(struct page *page) 348 - { 349 - return false; 350 - } 351 - 352 - static inline void 353 - mem_cgroup_print_bad_page(struct page *page) 354 - { 355 - } 356 - #endif 357 349 358 350 enum { 359 351 UNDER_LIMIT, ··· 429 447 /* 430 448 * __GFP_NOFAIL allocations will move on even if charging is not 431 449 * possible. Therefore we don't even try, and have this allocation 432 - * unaccounted. We could in theory charge it with 433 - * res_counter_charge_nofail, but we hope those allocations are rare, 434 - * and won't be worth the trouble. 450 + * unaccounted. We could in theory charge it forcibly, but we hope 451 + * those allocations are rare, and won't be worth the trouble. 435 452 */ 436 453 if (gfp & __GFP_NOFAIL) 437 454 return true; ··· 448 467 * memcg_kmem_uncharge_pages: uncharge pages from memcg 449 468 * @page: pointer to struct page being freed 450 469 * @order: allocation order. 451 - * 452 - * there is no need to specify memcg here, since it is embedded in page_cgroup 453 470 */ 454 471 static inline void 455 472 memcg_kmem_uncharge_pages(struct page *page, int order) ··· 464 485 * 465 486 * Needs to be called after memcg_kmem_newpage_charge, regardless of success or 466 487 * failure of the allocation. if @page is NULL, this function will revert the 467 - * charges. Otherwise, it will commit the memcg given by @memcg to the 468 - * corresponding page_cgroup. 488 + * charges. Otherwise, it will commit @page to @memcg. 469 489 */ 470 490 static inline void 471 491 memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)

+5

include/linux/mm_types.h

··· 22 22 #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1)) 23 23 24 24 struct address_space; 25 + struct mem_cgroup; 25 26 26 27 #define USE_SPLIT_PTE_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS) 27 28 #define USE_SPLIT_PMD_PTLOCKS (USE_SPLIT_PTE_PTLOCKS && \ ··· 167 166 struct kmem_cache *slab_cache; /* SL[AU]B: Pointer to slab */ 168 167 struct page *first_page; /* Compound tail pages */ 169 168 }; 169 + 170 + #ifdef CONFIG_MEMCG 171 + struct mem_cgroup *mem_cgroup; 172 + #endif 170 173 171 174 /* 172 175 * On machines where all RAM is mapped into kernel address space,

-12

include/linux/mmzone.h

··· 722 722 int nr_zones; 723 723 #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */ 724 724 struct page *node_mem_map; 725 - #ifdef CONFIG_MEMCG 726 - struct page_cgroup *node_page_cgroup; 727 - #endif 728 725 #endif 729 726 #ifndef CONFIG_NO_BOOTMEM 730 727 struct bootmem_data *bdata; ··· 1075 1078 #define SECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SECTION_MASK) 1076 1079 1077 1080 struct page; 1078 - struct page_cgroup; 1079 1081 struct mem_section { 1080 1082 /* 1081 1083 * This is, logically, a pointer to an array of struct ··· 1092 1096 1093 1097 /* See declaration of similar field in struct zone */ 1094 1098 unsigned long *pageblock_flags; 1095 - #ifdef CONFIG_MEMCG 1096 - /* 1097 - * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use 1098 - * section. (see memcontrol.h/page_cgroup.h about this.) 1099 - */ 1100 - struct page_cgroup *page_cgroup; 1101 - unsigned long pad; 1102 - #endif 1103 1099 /* 1104 1100 * WARNING: mem_section must be a power-of-2 in size for the 1105 1101 * calculation and use of SECTION_ROOT_MASK to make sense.

-105

include/linux/page_cgroup.h

··· 1 - #ifndef __LINUX_PAGE_CGROUP_H 2 - #define __LINUX_PAGE_CGROUP_H 3 - 4 - enum { 5 - /* flags for mem_cgroup */ 6 - PCG_USED = 0x01, /* This page is charged to a memcg */ 7 - PCG_MEM = 0x02, /* This page holds a memory charge */ 8 - PCG_MEMSW = 0x04, /* This page holds a memory+swap charge */ 9 - }; 10 - 11 - struct pglist_data; 12 - 13 - #ifdef CONFIG_MEMCG 14 - struct mem_cgroup; 15 - 16 - /* 17 - * Page Cgroup can be considered as an extended mem_map. 18 - * A page_cgroup page is associated with every page descriptor. The 19 - * page_cgroup helps us identify information about the cgroup 20 - * All page cgroups are allocated at boot or memory hotplug event, 21 - * then the page cgroup for pfn always exists. 22 - */ 23 - struct page_cgroup { 24 - unsigned long flags; 25 - struct mem_cgroup *mem_cgroup; 26 - }; 27 - 28 - extern void pgdat_page_cgroup_init(struct pglist_data *pgdat); 29 - 30 - #ifdef CONFIG_SPARSEMEM 31 - static inline void page_cgroup_init_flatmem(void) 32 - { 33 - } 34 - extern void page_cgroup_init(void); 35 - #else 36 - extern void page_cgroup_init_flatmem(void); 37 - static inline void page_cgroup_init(void) 38 - { 39 - } 40 - #endif 41 - 42 - struct page_cgroup *lookup_page_cgroup(struct page *page); 43 - 44 - static inline int PageCgroupUsed(struct page_cgroup *pc) 45 - { 46 - return !!(pc->flags & PCG_USED); 47 - } 48 - #else /* !CONFIG_MEMCG */ 49 - struct page_cgroup; 50 - 51 - static inline void pgdat_page_cgroup_init(struct pglist_data *pgdat) 52 - { 53 - } 54 - 55 - static inline struct page_cgroup *lookup_page_cgroup(struct page *page) 56 - { 57 - return NULL; 58 - } 59 - 60 - static inline void page_cgroup_init(void) 61 - { 62 - } 63 - 64 - static inline void page_cgroup_init_flatmem(void) 65 - { 66 - } 67 - #endif /* CONFIG_MEMCG */ 68 - 69 - #include <linux/swap.h> 70 - 71 - #ifdef CONFIG_MEMCG_SWAP 72 - extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, 73 - unsigned short old, unsigned short new); 74 - extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id); 75 - extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent); 76 - extern int swap_cgroup_swapon(int type, unsigned long max_pages); 77 - extern void swap_cgroup_swapoff(int type); 78 - #else 79 - 80 - static inline 81 - unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id) 82 - { 83 - return 0; 84 - } 85 - 86 - static inline 87 - unsigned short lookup_swap_cgroup_id(swp_entry_t ent) 88 - { 89 - return 0; 90 - } 91 - 92 - static inline int 93 - swap_cgroup_swapon(int type, unsigned long max_pages) 94 - { 95 - return 0; 96 - } 97 - 98 - static inline void swap_cgroup_swapoff(int type) 99 - { 100 - return; 101 - } 102 - 103 - #endif /* CONFIG_MEMCG_SWAP */ 104 - 105 - #endif /* __LINUX_PAGE_CGROUP_H */

+51

include/linux/page_counter.h

··· 1 + #ifndef _LINUX_PAGE_COUNTER_H 2 + #define _LINUX_PAGE_COUNTER_H 3 + 4 + #include <linux/atomic.h> 5 + #include <linux/kernel.h> 6 + #include <asm/page.h> 7 + 8 + struct page_counter { 9 + atomic_long_t count; 10 + unsigned long limit; 11 + struct page_counter *parent; 12 + 13 + /* legacy */ 14 + unsigned long watermark; 15 + unsigned long failcnt; 16 + }; 17 + 18 + #if BITS_PER_LONG == 32 19 + #define PAGE_COUNTER_MAX LONG_MAX 20 + #else 21 + #define PAGE_COUNTER_MAX (LONG_MAX / PAGE_SIZE) 22 + #endif 23 + 24 + static inline void page_counter_init(struct page_counter *counter, 25 + struct page_counter *parent) 26 + { 27 + atomic_long_set(&counter->count, 0); 28 + counter->limit = PAGE_COUNTER_MAX; 29 + counter->parent = parent; 30 + } 31 + 32 + static inline unsigned long page_counter_read(struct page_counter *counter) 33 + { 34 + return atomic_long_read(&counter->count); 35 + } 36 + 37 + void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages); 38 + void page_counter_charge(struct page_counter *counter, unsigned long nr_pages); 39 + int page_counter_try_charge(struct page_counter *counter, 40 + unsigned long nr_pages, 41 + struct page_counter **fail); 42 + void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages); 43 + int page_counter_limit(struct page_counter *counter, unsigned long limit); 44 + int page_counter_memparse(const char *buf, unsigned long *nr_pages); 45 + 46 + static inline void page_counter_reset_watermark(struct page_counter *counter) 47 + { 48 + counter->watermark = page_counter_read(counter); 49 + } 50 + 51 + #endif /* _LINUX_PAGE_COUNTER_H */

+49 -20

include/linux/percpu-refcount.h

··· 147 147 } 148 148 149 149 /** 150 + * percpu_ref_get_many - increment a percpu refcount 151 + * @ref: percpu_ref to get 152 + * @nr: number of references to get 153 + * 154 + * Analogous to atomic_long_add(). 155 + * 156 + * This function is safe to call as long as @ref is between init and exit. 157 + */ 158 + static inline void percpu_ref_get_many(struct percpu_ref *ref, unsigned long nr) 159 + { 160 + unsigned long __percpu *percpu_count; 161 + 162 + rcu_read_lock_sched(); 163 + 164 + if (__ref_is_percpu(ref, &percpu_count)) 165 + this_cpu_add(*percpu_count, nr); 166 + else 167 + atomic_long_add(nr, &ref->count); 168 + 169 + rcu_read_unlock_sched(); 170 + } 171 + 172 + /** 150 173 * percpu_ref_get - increment a percpu refcount 151 174 * @ref: percpu_ref to get 152 175 * ··· 179 156 */ 180 157 static inline void percpu_ref_get(struct percpu_ref *ref) 181 158 { 182 - unsigned long __percpu *percpu_count; 183 - 184 - rcu_read_lock_sched(); 185 - 186 - if (__ref_is_percpu(ref, &percpu_count)) 187 - this_cpu_inc(*percpu_count); 188 - else 189 - atomic_long_inc(&ref->count); 190 - 191 - rcu_read_unlock_sched(); 159 + percpu_ref_get_many(ref, 1); 192 160 } 193 161 194 162 /** ··· 245 231 } 246 232 247 233 /** 234 + * percpu_ref_put_many - decrement a percpu refcount 235 + * @ref: percpu_ref to put 236 + * @nr: number of references to put 237 + * 238 + * Decrement the refcount, and if 0, call the release function (which was passed 239 + * to percpu_ref_init()) 240 + * 241 + * This function is safe to call as long as @ref is between init and exit. 242 + */ 243 + static inline void percpu_ref_put_many(struct percpu_ref *ref, unsigned long nr) 244 + { 245 + unsigned long __percpu *percpu_count; 246 + 247 + rcu_read_lock_sched(); 248 + 249 + if (__ref_is_percpu(ref, &percpu_count)) 250 + this_cpu_sub(*percpu_count, nr); 251 + else if (unlikely(atomic_long_sub_and_test(nr, &ref->count))) 252 + ref->release(ref); 253 + 254 + rcu_read_unlock_sched(); 255 + } 256 + 257 + /** 248 258 * percpu_ref_put - decrement a percpu refcount 249 259 * @ref: percpu_ref to put 250 260 * ··· 279 241 */ 280 242 static inline void percpu_ref_put(struct percpu_ref *ref) 281 243 { 282 - unsigned long __percpu *percpu_count; 283 - 284 - rcu_read_lock_sched(); 285 - 286 - if (__ref_is_percpu(ref, &percpu_count)) 287 - this_cpu_dec(*percpu_count); 288 - else if (unlikely(atomic_long_dec_and_test(&ref->count))) 289 - ref->release(ref); 290 - 291 - rcu_read_unlock_sched(); 244 + percpu_ref_put_many(ref, 1); 292 245 } 293 246 294 247 /**

-1

include/linux/printk.h

··· 118 118 #ifdef CONFIG_EARLY_PRINTK 119 119 extern asmlinkage __printf(1, 2) 120 120 void early_printk(const char *fmt, ...); 121 - void early_vprintk(const char *fmt, va_list ap); 122 121 #else 123 122 static inline __printf(1, 2) __cold 124 123 void early_printk(const char *s, ...) { }

+1 -1

include/linux/ptrace.h

··· 52 52 extern void __ptrace_link(struct task_struct *child, 53 53 struct task_struct *new_parent); 54 54 extern void __ptrace_unlink(struct task_struct *child); 55 - extern void exit_ptrace(struct task_struct *tracer); 55 + extern void exit_ptrace(struct task_struct *tracer, struct list_head *dead); 56 56 #define PTRACE_MODE_READ 0x01 57 57 #define PTRACE_MODE_ATTACH 0x02 58 58 #define PTRACE_MODE_NOAUDIT 0x04

-223

include/linux/res_counter.h

··· 1 - #ifndef __RES_COUNTER_H__ 2 - #define __RES_COUNTER_H__ 3 - 4 - /* 5 - * Resource Counters 6 - * Contain common data types and routines for resource accounting 7 - * 8 - * Copyright 2007 OpenVZ SWsoft Inc 9 - * 10 - * Author: Pavel Emelianov <xemul@openvz.org> 11 - * 12 - * See Documentation/cgroups/resource_counter.txt for more 13 - * info about what this counter is. 14 - */ 15 - 16 - #include <linux/spinlock.h> 17 - #include <linux/errno.h> 18 - 19 - /* 20 - * The core object. the cgroup that wishes to account for some 21 - * resource may include this counter into its structures and use 22 - * the helpers described beyond 23 - */ 24 - 25 - struct res_counter { 26 - /* 27 - * the current resource consumption level 28 - */ 29 - unsigned long long usage; 30 - /* 31 - * the maximal value of the usage from the counter creation 32 - */ 33 - unsigned long long max_usage; 34 - /* 35 - * the limit that usage cannot exceed 36 - */ 37 - unsigned long long limit; 38 - /* 39 - * the limit that usage can be exceed 40 - */ 41 - unsigned long long soft_limit; 42 - /* 43 - * the number of unsuccessful attempts to consume the resource 44 - */ 45 - unsigned long long failcnt; 46 - /* 47 - * the lock to protect all of the above. 48 - * the routines below consider this to be IRQ-safe 49 - */ 50 - spinlock_t lock; 51 - /* 52 - * Parent counter, used for hierarchial resource accounting 53 - */ 54 - struct res_counter *parent; 55 - }; 56 - 57 - #define RES_COUNTER_MAX ULLONG_MAX 58 - 59 - /** 60 - * Helpers to interact with userspace 61 - * res_counter_read_u64() - returns the value of the specified member. 62 - * res_counter_read/_write - put/get the specified fields from the 63 - * res_counter struct to/from the user 64 - * 65 - * @counter: the counter in question 66 - * @member: the field to work with (see RES_xxx below) 67 - * @buf: the buffer to opeate on,... 68 - * @nbytes: its size... 69 - * @pos: and the offset. 70 - */ 71 - 72 - u64 res_counter_read_u64(struct res_counter *counter, int member); 73 - 74 - ssize_t res_counter_read(struct res_counter *counter, int member, 75 - const char __user *buf, size_t nbytes, loff_t *pos, 76 - int (*read_strategy)(unsigned long long val, char *s)); 77 - 78 - int res_counter_memparse_write_strategy(const char *buf, 79 - unsigned long long *res); 80 - 81 - /* 82 - * the field descriptors. one for each member of res_counter 83 - */ 84 - 85 - enum { 86 - RES_USAGE, 87 - RES_MAX_USAGE, 88 - RES_LIMIT, 89 - RES_FAILCNT, 90 - RES_SOFT_LIMIT, 91 - }; 92 - 93 - /* 94 - * helpers for accounting 95 - */ 96 - 97 - void res_counter_init(struct res_counter *counter, struct res_counter *parent); 98 - 99 - /* 100 - * charge - try to consume more resource. 101 - * 102 - * @counter: the counter 103 - * @val: the amount of the resource. each controller defines its own 104 - * units, e.g. numbers, bytes, Kbytes, etc 105 - * 106 - * returns 0 on success and <0 if the counter->usage will exceed the 107 - * counter->limit 108 - * 109 - * charge_nofail works the same, except that it charges the resource 110 - * counter unconditionally, and returns < 0 if the after the current 111 - * charge we are over limit. 112 - */ 113 - 114 - int __must_check res_counter_charge(struct res_counter *counter, 115 - unsigned long val, struct res_counter **limit_fail_at); 116 - int res_counter_charge_nofail(struct res_counter *counter, 117 - unsigned long val, struct res_counter **limit_fail_at); 118 - 119 - /* 120 - * uncharge - tell that some portion of the resource is released 121 - * 122 - * @counter: the counter 123 - * @val: the amount of the resource 124 - * 125 - * these calls check for usage underflow and show a warning on the console 126 - * 127 - * returns the total charges still present in @counter. 128 - */ 129 - 130 - u64 res_counter_uncharge(struct res_counter *counter, unsigned long val); 131 - 132 - u64 res_counter_uncharge_until(struct res_counter *counter, 133 - struct res_counter *top, 134 - unsigned long val); 135 - /** 136 - * res_counter_margin - calculate chargeable space of a counter 137 - * @cnt: the counter 138 - * 139 - * Returns the difference between the hard limit and the current usage 140 - * of resource counter @cnt. 141 - */ 142 - static inline unsigned long long res_counter_margin(struct res_counter *cnt) 143 - { 144 - unsigned long long margin; 145 - unsigned long flags; 146 - 147 - spin_lock_irqsave(&cnt->lock, flags); 148 - if (cnt->limit > cnt->usage) 149 - margin = cnt->limit - cnt->usage; 150 - else 151 - margin = 0; 152 - spin_unlock_irqrestore(&cnt->lock, flags); 153 - return margin; 154 - } 155 - 156 - /** 157 - * Get the difference between the usage and the soft limit 158 - * @cnt: The counter 159 - * 160 - * Returns 0 if usage is less than or equal to soft limit 161 - * The difference between usage and soft limit, otherwise. 162 - */ 163 - static inline unsigned long long 164 - res_counter_soft_limit_excess(struct res_counter *cnt) 165 - { 166 - unsigned long long excess; 167 - unsigned long flags; 168 - 169 - spin_lock_irqsave(&cnt->lock, flags); 170 - if (cnt->usage <= cnt->soft_limit) 171 - excess = 0; 172 - else 173 - excess = cnt->usage - cnt->soft_limit; 174 - spin_unlock_irqrestore(&cnt->lock, flags); 175 - return excess; 176 - } 177 - 178 - static inline void res_counter_reset_max(struct res_counter *cnt) 179 - { 180 - unsigned long flags; 181 - 182 - spin_lock_irqsave(&cnt->lock, flags); 183 - cnt->max_usage = cnt->usage; 184 - spin_unlock_irqrestore(&cnt->lock, flags); 185 - } 186 - 187 - static inline void res_counter_reset_failcnt(struct res_counter *cnt) 188 - { 189 - unsigned long flags; 190 - 191 - spin_lock_irqsave(&cnt->lock, flags); 192 - cnt->failcnt = 0; 193 - spin_unlock_irqrestore(&cnt->lock, flags); 194 - } 195 - 196 - static inline int res_counter_set_limit(struct res_counter *cnt, 197 - unsigned long long limit) 198 - { 199 - unsigned long flags; 200 - int ret = -EBUSY; 201 - 202 - spin_lock_irqsave(&cnt->lock, flags); 203 - if (cnt->usage <= limit) { 204 - cnt->limit = limit; 205 - ret = 0; 206 - } 207 - spin_unlock_irqrestore(&cnt->lock, flags); 208 - return ret; 209 - } 210 - 211 - static inline int 212 - res_counter_set_soft_limit(struct res_counter *cnt, 213 - unsigned long long soft_limit) 214 - { 215 - unsigned long flags; 216 - 217 - spin_lock_irqsave(&cnt->lock, flags); 218 - cnt->soft_limit = soft_limit; 219 - spin_unlock_irqrestore(&cnt->lock, flags); 220 - return 0; 221 - } 222 - 223 - #endif

-4

include/linux/slab.h

··· 513 513 514 514 int memcg_update_all_caches(int num_memcgs); 515 515 516 - struct seq_file; 517 - int cache_show(struct kmem_cache *s, struct seq_file *m); 518 - void print_slabinfo_header(struct seq_file *m); 519 - 520 516 /** 521 517 * kmalloc_array - allocate memory for an array. 522 518 * @n: number of elements.

+42

include/linux/swap_cgroup.h

··· 1 + #ifndef __LINUX_SWAP_CGROUP_H 2 + #define __LINUX_SWAP_CGROUP_H 3 + 4 + #include <linux/swap.h> 5 + 6 + #ifdef CONFIG_MEMCG_SWAP 7 + 8 + extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, 9 + unsigned short old, unsigned short new); 10 + extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id); 11 + extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent); 12 + extern int swap_cgroup_swapon(int type, unsigned long max_pages); 13 + extern void swap_cgroup_swapoff(int type); 14 + 15 + #else 16 + 17 + static inline 18 + unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id) 19 + { 20 + return 0; 21 + } 22 + 23 + static inline 24 + unsigned short lookup_swap_cgroup_id(swp_entry_t ent) 25 + { 26 + return 0; 27 + } 28 + 29 + static inline int 30 + swap_cgroup_swapon(int type, unsigned long max_pages) 31 + { 32 + return 0; 33 + } 34 + 35 + static inline void swap_cgroup_swapoff(int type) 36 + { 37 + return; 38 + } 39 + 40 + #endif /* CONFIG_MEMCG_SWAP */ 41 + 42 + #endif /* __LINUX_SWAP_CGROUP_H */

+9 -17

include/net/sock.h

··· 54 54 #include <linux/security.h> 55 55 #include <linux/slab.h> 56 56 #include <linux/uaccess.h> 57 + #include <linux/page_counter.h> 57 58 #include <linux/memcontrol.h> 58 - #include <linux/res_counter.h> 59 59 #include <linux/static_key.h> 60 60 #include <linux/aio.h> 61 61 #include <linux/sched.h> ··· 1062 1062 }; 1063 1063 1064 1064 struct cg_proto { 1065 - struct res_counter memory_allocated; /* Current allocated memory. */ 1065 + struct page_counter memory_allocated; /* Current allocated memory. */ 1066 1066 struct percpu_counter sockets_allocated; /* Current number of sockets. */ 1067 1067 int memory_pressure; 1068 1068 long sysctl_mem[3]; ··· 1214 1214 unsigned long amt, 1215 1215 int *parent_status) 1216 1216 { 1217 - struct res_counter *fail; 1218 - int ret; 1217 + page_counter_charge(&prot->memory_allocated, amt); 1219 1218 1220 - ret = res_counter_charge_nofail(&prot->memory_allocated, 1221 - amt << PAGE_SHIFT, &fail); 1222 - if (ret < 0) 1219 + if (page_counter_read(&prot->memory_allocated) > 1220 + prot->memory_allocated.limit) 1223 1221 *parent_status = OVER_LIMIT; 1224 1222 } 1225 1223 1226 1224 static inline void memcg_memory_allocated_sub(struct cg_proto *prot, 1227 1225 unsigned long amt) 1228 1226 { 1229 - res_counter_uncharge(&prot->memory_allocated, amt << PAGE_SHIFT); 1230 - } 1231 - 1232 - static inline u64 memcg_memory_allocated_read(struct cg_proto *prot) 1233 - { 1234 - u64 ret; 1235 - ret = res_counter_read_u64(&prot->memory_allocated, RES_USAGE); 1236 - return ret >> PAGE_SHIFT; 1227 + page_counter_uncharge(&prot->memory_allocated, amt); 1237 1228 } 1238 1229 1239 1230 static inline long 1240 1231 sk_memory_allocated(const struct sock *sk) 1241 1232 { 1242 1233 struct proto *prot = sk->sk_prot; 1234 + 1243 1235 if (mem_cgroup_sockets_enabled && sk->sk_cgrp) 1244 - return memcg_memory_allocated_read(sk->sk_cgrp); 1236 + return page_counter_read(&sk->sk_cgrp->memory_allocated); 1245 1237 1246 1238 return atomic_long_read(prot->memory_allocated); 1247 1239 } ··· 1247 1255 memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status); 1248 1256 /* update the root cgroup regardless */ 1249 1257 atomic_long_add_return(amt, prot->memory_allocated); 1250 - return memcg_memory_allocated_read(sk->sk_cgrp); 1258 + return page_counter_read(&sk->sk_cgrp->memory_allocated); 1251 1259 } 1252 1260 1253 1261 return atomic_long_add_return(amt, prot->memory_allocated);

+1

include/uapi/linux/sysctl.h

··· 153 153 KERN_MAX_LOCK_DEPTH=74, /* int: rtmutex's maximum lock depth */ 154 154 KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */ 155 155 KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */ 156 + KERN_PANIC_ON_WARN=77, /* int: call panic() in WARN() functions */ 156 157 }; 157 158 158 159

+29 -27

init/Kconfig

··· 893 893 config ARCH_WANT_NUMA_VARIABLE_LOCALITY 894 894 bool 895 895 896 - config NUMA_BALANCING_DEFAULT_ENABLED 897 - bool "Automatically enable NUMA aware memory/task placement" 898 - default y 899 - depends on NUMA_BALANCING 900 - help 901 - If set, automatic NUMA balancing will be enabled if running on a NUMA 902 - machine. 903 - 904 896 config NUMA_BALANCING 905 897 bool "Memory placement aware NUMA scheduler" 906 898 depends on ARCH_SUPPORTS_NUMA_BALANCING ··· 904 912 it has references to the node the task is running on. 905 913 906 914 This system will be inactive on UMA systems. 915 + 916 + config NUMA_BALANCING_DEFAULT_ENABLED 917 + bool "Automatically enable NUMA aware memory/task placement" 918 + default y 919 + depends on NUMA_BALANCING 920 + help 921 + If set, automatic NUMA balancing will be enabled if running on a NUMA 922 + machine. 907 923 908 924 menuconfig CGROUPS 909 925 boolean "Control Group support" ··· 972 972 Provides a simple Resource Controller for monitoring the 973 973 total CPU consumed by the tasks in a cgroup. 974 974 975 - config RESOURCE_COUNTERS 976 - bool "Resource counters" 977 - help 978 - This option enables controller independent resource accounting 979 - infrastructure that works with cgroups. 975 + config PAGE_COUNTER 976 + bool 980 977 981 978 config MEMCG 982 979 bool "Memory Resource Controller for Control Groups" 983 - depends on RESOURCE_COUNTERS 980 + select PAGE_COUNTER 984 981 select EVENTFD 985 982 help 986 983 Provides a memory resource controller that manages both anonymous 987 984 memory and page cache. (See Documentation/cgroups/memory.txt) 988 - 989 - Note that setting this option increases fixed memory overhead 990 - associated with each page of memory in the system. By this, 991 - 8(16)bytes/PAGE_SIZE on 32(64)bit system will be occupied by memory 992 - usage tracking struct at boot. Total amount of this is printed out 993 - at boot. 994 - 995 - Only enable when you're ok with these trade offs and really 996 - sure you need the memory resource controller. Even when you enable 997 - this, you can set "cgroup_disable=memory" at your boot option to 998 - disable memory resource controller and you can avoid overheads. 999 - (and lose benefits of memory resource controller) 1000 985 1001 986 config MEMCG_SWAP 1002 987 bool "Memory Resource Controller Swap Extension" ··· 1033 1048 1034 1049 config CGROUP_HUGETLB 1035 1050 bool "HugeTLB Resource Controller for Control Groups" 1036 - depends on RESOURCE_COUNTERS && HUGETLB_PAGE 1051 + depends on HUGETLB_PAGE 1052 + select PAGE_COUNTER 1037 1053 default n 1038 1054 help 1039 1055 Provides a cgroup Resource Controller for HugeTLB pages. ··· 1279 1293 source "usr/Kconfig" 1280 1294 1281 1295 endif 1296 + 1297 + config INIT_FALLBACK 1298 + bool "Fall back to defaults if init= parameter is bad" 1299 + default y 1300 + help 1301 + If enabled, the kernel will try the default init binaries if an 1302 + explicit request from the init= parameter fails. 1303 + 1304 + This can have unexpected effects. For example, booting 1305 + with init=/sbin/kiosk_app will run /sbin/init or even /bin/sh 1306 + if /sbin/kiosk_app cannot be executed. 1307 + 1308 + The default value of Y is consistent with historical behavior. 1309 + Selecting N is likely to be more appropriate for most uses, 1310 + especially on kiosks and on kernels that are intended to be 1311 + run under the control of a script. 1282 1312 1283 1313 config CC_OPTIMIZE_FOR_SIZE 1284 1314 bool "Optimize for size"

+6 -8

init/main.c

··· 51 51 #include <linux/mempolicy.h> 52 52 #include <linux/key.h> 53 53 #include <linux/buffer_head.h> 54 - #include <linux/page_cgroup.h> 55 54 #include <linux/debug_locks.h> 56 55 #include <linux/debugobjects.h> 57 56 #include <linux/lockdep.h> ··· 484 485 */ 485 486 static void __init mm_init(void) 486 487 { 487 - /* 488 - * page_cgroup requires contiguous pages, 489 - * bigger than MAX_ORDER unless SPARSEMEM. 490 - */ 491 - page_cgroup_init_flatmem(); 492 488 mem_init(); 493 489 kmem_cache_init(); 494 490 percpu_init_late(); ··· 621 627 initrd_start = 0; 622 628 } 623 629 #endif 624 - page_cgroup_init(); 625 630 debug_objects_mem_init(); 626 631 kmemleak_init(); 627 632 setup_per_cpu_pageset(); ··· 952 959 ret = run_init_process(execute_command); 953 960 if (!ret) 954 961 return 0; 962 + #ifndef CONFIG_INIT_FALLBACK 963 + panic("Requested init %s failed (error %d).", 964 + execute_command, ret); 965 + #else 955 966 pr_err("Failed to execute %s (error %d). Attempting defaults...\n", 956 - execute_command, ret); 967 + execute_command, ret); 968 + #endif 957 969 } 958 970 if (!try_to_run_init_process("/sbin/init") || 959 971 !try_to_run_init_process("/etc/init") ||

-1

kernel/Makefile

··· 57 57 obj-$(CONFIG_USER_NS) += user_namespace.o 58 58 obj-$(CONFIG_PID_NS) += pid_namespace.o 59 59 obj-$(CONFIG_IKCONFIG) += configs.o 60 - obj-$(CONFIG_RESOURCE_COUNTERS) += res_counter.o 61 60 obj-$(CONFIG_SMP) += stop_machine.o 62 61 obj-$(CONFIG_KPROBES_SANITY_TEST) += test_kprobes.o 63 62 obj-$(CONFIG_AUDIT) += audit.o auditfilter.o

+123 -122

kernel/exit.c

··· 118 118 } 119 119 120 120 /* 121 - * Accumulate here the counters for all threads but the group leader 122 - * as they die, so they can be added into the process-wide totals 123 - * when those are taken. The group leader stays around as a zombie as 124 - * long as there are other threads. When it gets reaped, the exit.c 125 - * code will add its counts into these totals. We won't ever get here 126 - * for the group leader, since it will have been the last reference on 127 - * the signal_struct. 121 + * Accumulate here the counters for all threads as they die. We could 122 + * skip the group leader because it is the last user of signal_struct, 123 + * but we want to avoid the race with thread_group_cputime() which can 124 + * see the empty ->thread_head list. 128 125 */ 129 126 task_cputime(tsk, &utime, &stime); 130 127 write_seqlock(&sig->stats_lock); ··· 459 462 clear_thread_flag(TIF_MEMDIE); 460 463 } 461 464 465 + static struct task_struct *find_alive_thread(struct task_struct *p) 466 + { 467 + struct task_struct *t; 468 + 469 + for_each_thread(p, t) { 470 + if (!(t->flags & PF_EXITING)) 471 + return t; 472 + } 473 + return NULL; 474 + } 475 + 476 + static struct task_struct *find_child_reaper(struct task_struct *father) 477 + __releases(&tasklist_lock) 478 + __acquires(&tasklist_lock) 479 + { 480 + struct pid_namespace *pid_ns = task_active_pid_ns(father); 481 + struct task_struct *reaper = pid_ns->child_reaper; 482 + 483 + if (likely(reaper != father)) 484 + return reaper; 485 + 486 + reaper = find_alive_thread(father); 487 + if (reaper) { 488 + pid_ns->child_reaper = reaper; 489 + return reaper; 490 + } 491 + 492 + write_unlock_irq(&tasklist_lock); 493 + if (unlikely(pid_ns == &init_pid_ns)) { 494 + panic("Attempted to kill init! exitcode=0x%08x\n", 495 + father->signal->group_exit_code ?: father->exit_code); 496 + } 497 + zap_pid_ns_processes(pid_ns); 498 + write_lock_irq(&tasklist_lock); 499 + 500 + return father; 501 + } 502 + 462 503 /* 463 504 * When we die, we re-parent all our children, and try to: 464 505 * 1. give them to another thread in our thread group, if such a member exists ··· 504 469 * child_subreaper for its children (like a service manager) 505 470 * 3. give it to the init process (PID 1) in our pid namespace 506 471 */ 507 - static struct task_struct *find_new_reaper(struct task_struct *father) 508 - __releases(&tasklist_lock) 509 - __acquires(&tasklist_lock) 472 + static struct task_struct *find_new_reaper(struct task_struct *father, 473 + struct task_struct *child_reaper) 510 474 { 511 - struct pid_namespace *pid_ns = task_active_pid_ns(father); 512 - struct task_struct *thread; 475 + struct task_struct *thread, *reaper; 513 476 514 - thread = father; 515 - while_each_thread(father, thread) { 516 - if (thread->flags & PF_EXITING) 517 - continue; 518 - if (unlikely(pid_ns->child_reaper == father)) 519 - pid_ns->child_reaper = thread; 477 + thread = find_alive_thread(father); 478 + if (thread) 520 479 return thread; 521 - } 522 480 523 - if (unlikely(pid_ns->child_reaper == father)) { 524 - write_unlock_irq(&tasklist_lock); 525 - if (unlikely(pid_ns == &init_pid_ns)) { 526 - panic("Attempted to kill init! exitcode=0x%08x\n", 527 - father->signal->group_exit_code ?: 528 - father->exit_code); 529 - } 530 - 531 - zap_pid_ns_processes(pid_ns); 532 - write_lock_irq(&tasklist_lock); 533 - } else if (father->signal->has_child_subreaper) { 534 - struct task_struct *reaper; 535 - 481 + if (father->signal->has_child_subreaper) { 536 482 /* 537 - * Find the first ancestor marked as child_subreaper. 538 - * Note that the code below checks same_thread_group(reaper, 539 - * pid_ns->child_reaper). This is what we need to DTRT in a 540 - * PID namespace. However we still need the check above, see 541 - * http://marc.info/?l=linux-kernel&m=131385460420380 483 + * Find the first ->is_child_subreaper ancestor in our pid_ns. 484 + * We start from father to ensure we can not look into another 485 + * namespace, this is safe because all its threads are dead. 542 486 */ 543 - for (reaper = father->real_parent; 544 - reaper != &init_task; 487 + for (reaper = father; 488 + !same_thread_group(reaper, child_reaper); 545 489 reaper = reaper->real_parent) { 546 - if (same_thread_group(reaper, pid_ns->child_reaper)) 490 + /* call_usermodehelper() descendants need this check */ 491 + if (reaper == &init_task) 547 492 break; 548 493 if (!reaper->signal->is_child_subreaper) 549 494 continue; 550 - thread = reaper; 551 - do { 552 - if (!(thread->flags & PF_EXITING)) 553 - return reaper; 554 - } while_each_thread(reaper, thread); 495 + thread = find_alive_thread(reaper); 496 + if (thread) 497 + return thread; 555 498 } 556 499 } 557 500 558 - return pid_ns->child_reaper; 501 + return child_reaper; 559 502 } 560 503 561 504 /* ··· 542 529 static void reparent_leader(struct task_struct *father, struct task_struct *p, 543 530 struct list_head *dead) 544 531 { 545 - list_move_tail(&p->sibling, &p->real_parent->children); 546 - 547 - if (p->exit_state == EXIT_DEAD) 548 - return; 549 - /* 550 - * If this is a threaded reparent there is no need to 551 - * notify anyone anything has happened. 552 - */ 553 - if (same_thread_group(p->real_parent, father)) 532 + if (unlikely(p->exit_state == EXIT_DEAD)) 554 533 return; 555 534 556 535 /* We don't want people slaying init. */ ··· 553 548 p->exit_state == EXIT_ZOMBIE && thread_group_empty(p)) { 554 549 if (do_notify_parent(p, p->exit_signal)) { 555 550 p->exit_state = EXIT_DEAD; 556 - list_move_tail(&p->sibling, dead); 551 + list_add(&p->ptrace_entry, dead); 557 552 } 558 553 } 559 554 560 555 kill_orphaned_pgrp(p, father); 561 556 } 562 557 563 - static void forget_original_parent(struct task_struct *father) 558 + /* 559 + * This does two things: 560 + * 561 + * A. Make init inherit all the child processes 562 + * B. Check to see if any process groups have become orphaned 563 + * as a result of our exiting, and if they have any stopped 564 + * jobs, send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2) 565 + */ 566 + static void forget_original_parent(struct task_struct *father, 567 + struct list_head *dead) 564 568 { 565 - struct task_struct *p, *n, *reaper; 566 - LIST_HEAD(dead_children); 569 + struct task_struct *p, *t, *reaper; 567 570 568 - write_lock_irq(&tasklist_lock); 569 - /* 570 - * Note that exit_ptrace() and find_new_reaper() might 571 - * drop tasklist_lock and reacquire it. 572 - */ 573 - exit_ptrace(father); 574 - reaper = find_new_reaper(father); 571 + if (unlikely(!list_empty(&father->ptraced))) 572 + exit_ptrace(father, dead); 575 573 576 - list_for_each_entry_safe(p, n, &father->children, sibling) { 577 - struct task_struct *t = p; 574 + /* Can drop and reacquire tasklist_lock */ 575 + reaper = find_child_reaper(father); 576 + if (list_empty(&father->children)) 577 + return; 578 578 579 - do { 579 + reaper = find_new_reaper(father, reaper); 580 + list_for_each_entry(p, &father->children, sibling) { 581 + for_each_thread(p, t) { 580 582 t->real_parent = reaper; 581 - if (t->parent == father) { 582 - BUG_ON(t->ptrace); 583 + BUG_ON((!t->ptrace) != (t->parent == father)); 584 + if (likely(!t->ptrace)) 583 585 t->parent = t->real_parent; 584 - } 585 586 if (t->pdeath_signal) 586 587 group_send_sig_info(t->pdeath_signal, 587 588 SEND_SIG_NOINFO, t); 588 - } while_each_thread(p, t); 589 - reparent_leader(father, p, &dead_children); 589 + } 590 + /* 591 + * If this is a threaded reparent there is no need to 592 + * notify anyone anything has happened. 593 + */ 594 + if (!same_thread_group(reaper, father)) 595 + reparent_leader(father, p, dead); 590 596 } 591 - write_unlock_irq(&tasklist_lock); 592 - 593 - BUG_ON(!list_empty(&father->children)); 594 - 595 - list_for_each_entry_safe(p, n, &dead_children, sibling) { 596 - list_del_init(&p->sibling); 597 - release_task(p); 598 - } 597 + list_splice_tail_init(&father->children, &reaper->children); 599 598 } 600 599 601 600 /* ··· 609 600 static void exit_notify(struct task_struct *tsk, int group_dead) 610 601 { 611 602 bool autoreap; 612 - 613 - /* 614 - * This does two things: 615 - * 616 - * A. Make init inherit all the child processes 617 - * B. Check to see if any process groups have become orphaned 618 - * as a result of our exiting, and if they have any stopped 619 - * jobs, send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2) 620 - */ 621 - forget_original_parent(tsk); 603 + struct task_struct *p, *n; 604 + LIST_HEAD(dead); 622 605 623 606 write_lock_irq(&tasklist_lock); 607 + forget_original_parent(tsk, &dead); 608 + 624 609 if (group_dead) 625 610 kill_orphaned_pgrp(tsk->group_leader, NULL); 626 611 ··· 632 629 } 633 630 634 631 tsk->exit_state = autoreap ? EXIT_DEAD : EXIT_ZOMBIE; 632 + if (tsk->exit_state == EXIT_DEAD) 633 + list_add(&tsk->ptrace_entry, &dead); 635 634 636 635 /* mt-exec, de_thread() is waiting for group leader */ 637 636 if (unlikely(tsk->signal->notify_count < 0)) 638 637 wake_up_process(tsk->signal->group_exit_task); 639 638 write_unlock_irq(&tasklist_lock); 640 639 641 - /* If the process is dead, release it - nobody will wait for it */ 642 - if (autoreap) 643 - release_task(tsk); 640 + list_for_each_entry_safe(p, n, &dead, ptrace_entry) { 641 + list_del_init(&p->ptrace_entry); 642 + release_task(p); 643 + } 644 644 } 645 645 646 646 #ifdef CONFIG_DEBUG_STACK_USAGE ··· 988 982 */ 989 983 static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p) 990 984 { 991 - unsigned long state; 992 - int retval, status, traced; 985 + int state, retval, status; 993 986 pid_t pid = task_pid_vnr(p); 994 987 uid_t uid = from_kuid_munged(current_user_ns(), task_uid(p)); 995 988 struct siginfo __user *infop; ··· 1013 1008 } 1014 1009 return wait_noreap_copyout(wo, p, pid, uid, why, status); 1015 1010 } 1016 - 1017 - traced = ptrace_reparented(p); 1018 1011 /* 1019 1012 * Move the task's state to DEAD/TRACE, only one thread can do this. 1020 1013 */ 1021 - state = traced && thread_group_leader(p) ? EXIT_TRACE : EXIT_DEAD; 1014 + state = (ptrace_reparented(p) && thread_group_leader(p)) ? 1015 + EXIT_TRACE : EXIT_DEAD; 1022 1016 if (cmpxchg(&p->exit_state, EXIT_ZOMBIE, state) != EXIT_ZOMBIE) 1023 1017 return 0; 1024 1018 /* 1025 - * It can be ptraced but not reparented, check 1026 - * thread_group_leader() to filter out sub-threads. 1019 + * We own this thread, nobody else can reap it. 1027 1020 */ 1028 - if (likely(!traced) && thread_group_leader(p)) { 1029 - struct signal_struct *psig; 1030 - struct signal_struct *sig; 1021 + read_unlock(&tasklist_lock); 1022 + sched_annotate_sleep(); 1023 + 1024 + /* 1025 + * Check thread_group_leader() to exclude the traced sub-threads. 1026 + */ 1027 + if (state == EXIT_DEAD && thread_group_leader(p)) { 1028 + struct signal_struct *sig = p->signal; 1029 + struct signal_struct *psig = current->signal; 1031 1030 unsigned long maxrss; 1032 1031 cputime_t tgutime, tgstime; 1033 1032 ··· 1043 1034 * accumulate in the parent's signal_struct c* fields. 1044 1035 * 1045 1036 * We don't bother to take a lock here to protect these 1046 - * p->signal fields, because they are only touched by 1047 - * __exit_signal, which runs with tasklist_lock 1048 - * write-locked anyway, and so is excluded here. We do 1049 - * need to protect the access to parent->signal fields, 1050 - * as other threads in the parent group can be right 1051 - * here reaping other children at the same time. 1037 + * p->signal fields because the whole thread group is dead 1038 + * and nobody can change them. 1039 + * 1040 + * psig->stats_lock also protects us from our sub-theads 1041 + * which can reap other children at the same time. Until 1042 + * we change k_getrusage()-like users to rely on this lock 1043 + * we have to take ->siglock as well. 1052 1044 * 1053 1045 * We use thread_group_cputime_adjusted() to get times for 1054 1046 * the thread group, which consolidates times for all threads 1055 1047 * in the group including the group leader. 1056 1048 */ 1057 1049 thread_group_cputime_adjusted(p, &tgutime, &tgstime); 1058 - spin_lock_irq(&p->real_parent->sighand->siglock); 1059 - psig = p->real_parent->signal; 1060 - sig = p->signal; 1050 + spin_lock_irq(&current->sighand->siglock); 1061 1051 write_seqlock(&psig->stats_lock); 1062 1052 psig->cutime += tgutime + sig->cutime; 1063 1053 psig->cstime += tgstime + sig->cstime; ··· 1081 1073 task_io_accounting_add(&psig->ioac, &p->ioac); 1082 1074 task_io_accounting_add(&psig->ioac, &sig->ioac); 1083 1075 write_sequnlock(&psig->stats_lock); 1084 - spin_unlock_irq(&p->real_parent->sighand->siglock); 1076 + spin_unlock_irq(&current->sighand->siglock); 1085 1077 } 1086 - 1087 - /* 1088 - * Now we are sure this task is interesting, and no other 1089 - * thread can reap it because we its state == DEAD/TRACE. 1090 - */ 1091 - read_unlock(&tasklist_lock); 1092 - sched_annotate_sleep(); 1093 1078 1094 1079 retval = wo->wo_rusage 1095 1080 ? getrusage(p, RUSAGE_BOTH, wo->wo_rusage) : 0;

+5 -38

kernel/kmod.c

··· 47 47 48 48 static struct workqueue_struct *khelper_wq; 49 49 50 - /* 51 - * kmod_thread_locker is used for deadlock avoidance. There is no explicit 52 - * locking to protect this global - it is private to the singleton khelper 53 - * thread and should only ever be modified by that thread. 54 - */ 55 - static const struct task_struct *kmod_thread_locker; 56 - 57 50 #define CAP_BSET (void *)1 58 51 #define CAP_PI (void *)2 59 52 ··· 216 223 static int ____call_usermodehelper(void *data) 217 224 { 218 225 struct subprocess_info *sub_info = data; 219 - int wait = sub_info->wait & ~UMH_KILLABLE; 220 226 struct cred *new; 221 227 int retval; 222 228 ··· 259 267 out: 260 268 sub_info->retval = retval; 261 269 /* wait_for_helper() will call umh_complete if UHM_WAIT_PROC. */ 262 - if (wait != UMH_WAIT_PROC) 270 + if (!(sub_info->wait & UMH_WAIT_PROC)) 263 271 umh_complete(sub_info); 264 272 if (!retval) 265 273 return 0; 266 274 do_exit(0); 267 - } 268 - 269 - static int call_helper(void *data) 270 - { 271 - /* Worker thread started blocking khelper thread. */ 272 - kmod_thread_locker = current; 273 - return ____call_usermodehelper(data); 274 275 } 275 276 276 277 /* Keventd can't block, but this (a child) can. */ ··· 308 323 { 309 324 struct subprocess_info *sub_info = 310 325 container_of(work, struct subprocess_info, work); 311 - int wait = sub_info->wait & ~UMH_KILLABLE; 312 326 pid_t pid; 313 327 314 - /* CLONE_VFORK: wait until the usermode helper has execve'd 315 - * successfully We need the data structures to stay around 316 - * until that is done. */ 317 - if (wait == UMH_WAIT_PROC) 328 + if (sub_info->wait & UMH_WAIT_PROC) 318 329 pid = kernel_thread(wait_for_helper, sub_info, 319 330 CLONE_FS | CLONE_FILES | SIGCHLD); 320 - else { 321 - pid = kernel_thread(call_helper, sub_info, 322 - CLONE_VFORK | SIGCHLD); 323 - /* Worker thread stopped blocking khelper thread. */ 324 - kmod_thread_locker = NULL; 325 - } 331 + else 332 + pid = kernel_thread(____call_usermodehelper, sub_info, 333 + SIGCHLD); 326 334 327 335 if (pid < 0) { 328 336 sub_info->retval = pid; ··· 548 570 retval = -EBUSY; 549 571 goto out; 550 572 } 551 - /* 552 - * Worker thread must not wait for khelper thread at below 553 - * wait_for_completion() if the thread was created with CLONE_VFORK 554 - * flag, for khelper thread is already waiting for the thread at 555 - * wait_for_completion() in do_fork(). 556 - */ 557 - if (wait != UMH_NO_WAIT && current == kmod_thread_locker) { 558 - retval = -EBUSY; 559 - goto out; 560 - } 561 - 562 573 /* 563 574 * Set the completion pointer only if there is a waiter. 564 575 * This makes it possible to use umh_complete to free

+13

kernel/panic.c

··· 33 33 static int pause_on_oops_flag; 34 34 static DEFINE_SPINLOCK(pause_on_oops_lock); 35 35 static bool crash_kexec_post_notifiers; 36 + int panic_on_warn __read_mostly; 36 37 37 38 int panic_timeout = CONFIG_PANIC_TIMEOUT; 38 39 EXPORT_SYMBOL_GPL(panic_timeout); ··· 429 428 if (args) 430 429 vprintk(args->fmt, args->args); 431 430 431 + if (panic_on_warn) { 432 + /* 433 + * This thread may hit another WARN() in the panic path. 434 + * Resetting this prevents additional WARN() from panicking the 435 + * system on this thread. Other threads are blocked by the 436 + * panic_mutex in panic(). 437 + */ 438 + panic_on_warn = 0; 439 + panic("panic_on_warn set ...\n"); 440 + } 441 + 432 442 print_modules(); 433 443 dump_stack(); 434 444 print_oops_end_marker(); ··· 497 485 498 486 core_param(panic, panic_timeout, int, 0644); 499 487 core_param(pause_on_oops, pause_on_oops, int, 0644); 488 + core_param(panic_on_warn, panic_on_warn, int, 0644); 500 489 501 490 static int __init setup_crash_kexec_post_notifiers(char *s) 502 491 {

+2

kernel/pid.c

··· 341 341 342 342 out_unlock: 343 343 spin_unlock_irq(&pidmap_lock); 344 + put_pid_ns(ns); 345 + 344 346 out_free: 345 347 while (++i <= ns->level) 346 348 free_pidmap(pid->numbers + i);

+24 -4

kernel/pid_namespace.c

··· 190 190 /* Don't allow any more processes into the pid namespace */ 191 191 disable_pid_allocation(pid_ns); 192 192 193 - /* Ignore SIGCHLD causing any terminated children to autoreap */ 193 + /* 194 + * Ignore SIGCHLD causing any terminated children to autoreap. 195 + * This speeds up the namespace shutdown, plus see the comment 196 + * below. 197 + */ 194 198 spin_lock_irq(&me->sighand->siglock); 195 199 me->sighand->action[SIGCHLD - 1].sa.sa_handler = SIG_IGN; 196 200 spin_unlock_irq(&me->sighand->siglock); ··· 227 223 } 228 224 read_unlock(&tasklist_lock); 229 225 230 - /* Firstly reap the EXIT_ZOMBIE children we may have. */ 226 + /* 227 + * Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD. 228 + * sys_wait4() will also block until our children traced from the 229 + * parent namespace are detached and become EXIT_DEAD. 230 + */ 231 231 do { 232 232 clear_thread_flag(TIF_SIGPENDING); 233 233 rc = sys_wait4(-1, NULL, __WALL, NULL); 234 234 } while (rc != -ECHILD); 235 235 236 236 /* 237 - * sys_wait4() above can't reap the TASK_DEAD children. 238 - * Make sure they all go away, see free_pid(). 237 + * sys_wait4() above can't reap the EXIT_DEAD children but we do not 238 + * really care, we could reparent them to the global init. We could 239 + * exit and reap ->child_reaper even if it is not the last thread in 240 + * this pid_ns, free_pid(nr_hashed == 0) calls proc_cleanup_work(), 241 + * pid_ns can not go away until proc_kill_sb() drops the reference. 242 + * 243 + * But this ns can also have other tasks injected by setns()+fork(). 244 + * Again, ignoring the user visible semantics we do not really need 245 + * to wait until they are all reaped, but they can be reparented to 246 + * us and thus we need to ensure that pid->child_reaper stays valid 247 + * until they all go away. See free_pid()->wake_up_process(). 248 + * 249 + * We rely on ignored SIGCHLD, an injected zombie must be autoreaped 250 + * if reparented. 239 251 */ 240 252 for (;;) { 241 253 set_current_state(TASK_UNINTERRUPTIBLE);

+22 -27

kernel/printk/printk.c

··· 62 62 CONSOLE_LOGLEVEL_DEFAULT, /* default_console_loglevel */ 63 63 }; 64 64 65 - /* Deferred messaged from sched code are marked by this special level */ 66 - #define SCHED_MESSAGE_LOGLEVEL -2 67 - 68 65 /* 69 66 * Low level drivers may need that to know if they can schedule in 70 67 * their unblank() callback or not. So let's export it. ··· 1256 1259 int do_syslog(int type, char __user *buf, int len, bool from_file) 1257 1260 { 1258 1261 bool clear = false; 1259 - static int saved_console_loglevel = -1; 1262 + static int saved_console_loglevel = LOGLEVEL_DEFAULT; 1260 1263 int error; 1261 1264 1262 1265 error = check_syslog_permissions(type, from_file); ··· 1313 1316 break; 1314 1317 /* Disable logging to console */ 1315 1318 case SYSLOG_ACTION_CONSOLE_OFF: 1316 - if (saved_console_loglevel == -1) 1319 + if (saved_console_loglevel == LOGLEVEL_DEFAULT) 1317 1320 saved_console_loglevel = console_loglevel; 1318 1321 console_loglevel = minimum_console_loglevel; 1319 1322 break; 1320 1323 /* Enable logging to console */ 1321 1324 case SYSLOG_ACTION_CONSOLE_ON: 1322 - if (saved_console_loglevel != -1) { 1325 + if (saved_console_loglevel != LOGLEVEL_DEFAULT) { 1323 1326 console_loglevel = saved_console_loglevel; 1324 - saved_console_loglevel = -1; 1327 + saved_console_loglevel = LOGLEVEL_DEFAULT; 1325 1328 } 1326 1329 break; 1327 1330 /* Set level of messages printed to console */ ··· 1333 1336 len = minimum_console_loglevel; 1334 1337 console_loglevel = len; 1335 1338 /* Implicitly re-enable logging to console */ 1336 - saved_console_loglevel = -1; 1339 + saved_console_loglevel = LOGLEVEL_DEFAULT; 1337 1340 error = 0; 1338 1341 break; 1339 1342 /* Number of chars in the log buffer */ ··· 1624 1627 int printed_len = 0; 1625 1628 bool in_sched = false; 1626 1629 /* cpu currently holding logbuf_lock in this function */ 1627 - static volatile unsigned int logbuf_cpu = UINT_MAX; 1630 + static unsigned int logbuf_cpu = UINT_MAX; 1628 1631 1629 - if (level == SCHED_MESSAGE_LOGLEVEL) { 1630 - level = -1; 1632 + if (level == LOGLEVEL_SCHED) { 1633 + level = LOGLEVEL_DEFAULT; 1631 1634 in_sched = true; 1632 1635 } 1633 1636 ··· 1692 1695 const char *end_of_header = printk_skip_level(text); 1693 1696 switch (kern_level) { 1694 1697 case '0' ... '7': 1695 - if (level == -1) 1698 + if (level == LOGLEVEL_DEFAULT) 1696 1699 level = kern_level - '0'; 1700 + /* fallthrough */ 1697 1701 case 'd': /* KERN_DEFAULT */ 1698 1702 lflags |= LOG_PREFIX; 1699 1703 } ··· 1708 1710 } 1709 1711 } 1710 1712 1711 - if (level == -1) 1713 + if (level == LOGLEVEL_DEFAULT) 1712 1714 level = default_message_loglevel; 1713 1715 1714 1716 if (dict) ··· 1786 1788 1787 1789 asmlinkage int vprintk(const char *fmt, va_list args) 1788 1790 { 1789 - return vprintk_emit(0, -1, NULL, 0, fmt, args); 1791 + return vprintk_emit(0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args); 1790 1792 } 1791 1793 EXPORT_SYMBOL(vprintk); 1792 1794 ··· 1840 1842 } 1841 1843 #endif 1842 1844 va_start(args, fmt); 1843 - r = vprintk_emit(0, -1, NULL, 0, fmt, args); 1845 + r = vprintk_emit(0, LOGLEVEL_DEFAULT, NULL, 0, fmt, args); 1844 1846 va_end(args); 1845 1847 1846 1848 return r; ··· 1879 1881 #ifdef CONFIG_EARLY_PRINTK 1880 1882 struct console *early_console; 1881 1883 1882 - void early_vprintk(const char *fmt, va_list ap) 1883 - { 1884 - if (early_console) { 1885 - char buf[512]; 1886 - int n = vscnprintf(buf, sizeof(buf), fmt, ap); 1887 - 1888 - early_console->write(early_console, buf, n); 1889 - } 1890 - } 1891 - 1892 1884 asmlinkage __visible void early_printk(const char *fmt, ...) 1893 1885 { 1894 1886 va_list ap; 1887 + char buf[512]; 1888 + int n; 1889 + 1890 + if (!early_console) 1891 + return; 1895 1892 1896 1893 va_start(ap, fmt); 1897 - early_vprintk(fmt, ap); 1894 + n = vscnprintf(buf, sizeof(buf), fmt, ap); 1898 1895 va_end(ap); 1896 + 1897 + early_console->write(early_console, buf, n); 1899 1898 } 1900 1899 #endif 1901 1900 ··· 2629 2634 2630 2635 preempt_disable(); 2631 2636 va_start(args, fmt); 2632 - r = vprintk_emit(0, SCHED_MESSAGE_LOGLEVEL, NULL, 0, fmt, args); 2637 + r = vprintk_emit(0, LOGLEVEL_SCHED, NULL, 0, fmt, args); 2633 2638 va_end(args); 2634 2639 2635 2640 __this_cpu_or(printk_pending, PRINTK_PENDING_OUTPUT);

+3 -20

kernel/ptrace.c

··· 485 485 486 486 /* 487 487 * Detach all tasks we were using ptrace on. Called with tasklist held 488 - * for writing, and returns with it held too. But note it can release 489 - * and reacquire the lock. 488 + * for writing. 490 489 */ 491 - void exit_ptrace(struct task_struct *tracer) 492 - __releases(&tasklist_lock) 493 - __acquires(&tasklist_lock) 490 + void exit_ptrace(struct task_struct *tracer, struct list_head *dead) 494 491 { 495 492 struct task_struct *p, *n; 496 - LIST_HEAD(ptrace_dead); 497 - 498 - if (likely(list_empty(&tracer->ptraced))) 499 - return; 500 493 501 494 list_for_each_entry_safe(p, n, &tracer->ptraced, ptrace_entry) { 502 495 if (unlikely(p->ptrace & PT_EXITKILL)) 503 496 send_sig_info(SIGKILL, SEND_SIG_FORCED, p); 504 497 505 498 if (__ptrace_detach(tracer, p)) 506 - list_add(&p->ptrace_entry, &ptrace_dead); 499 + list_add(&p->ptrace_entry, dead); 507 500 } 508 - 509 - write_unlock_irq(&tasklist_lock); 510 - BUG_ON(!list_empty(&tracer->ptraced)); 511 - 512 - list_for_each_entry_safe(p, n, &ptrace_dead, ptrace_entry) { 513 - list_del_init(&p->ptrace_entry); 514 - release_task(p); 515 - } 516 - 517 - write_lock_irq(&tasklist_lock); 518 501 } 519 502 520 503 int ptrace_readdata(struct task_struct *tsk, unsigned long src, char __user *dst, int len)

-211

kernel/res_counter.c

··· 1 - /* 2 - * resource cgroups 3 - * 4 - * Copyright 2007 OpenVZ SWsoft Inc 5 - * 6 - * Author: Pavel Emelianov <xemul@openvz.org> 7 - * 8 - */ 9 - 10 - #include <linux/types.h> 11 - #include <linux/parser.h> 12 - #include <linux/fs.h> 13 - #include <linux/res_counter.h> 14 - #include <linux/uaccess.h> 15 - #include <linux/mm.h> 16 - 17 - void res_counter_init(struct res_counter *counter, struct res_counter *parent) 18 - { 19 - spin_lock_init(&counter->lock); 20 - counter->limit = RES_COUNTER_MAX; 21 - counter->soft_limit = RES_COUNTER_MAX; 22 - counter->parent = parent; 23 - } 24 - 25 - static u64 res_counter_uncharge_locked(struct res_counter *counter, 26 - unsigned long val) 27 - { 28 - if (WARN_ON(counter->usage < val)) 29 - val = counter->usage; 30 - 31 - counter->usage -= val; 32 - return counter->usage; 33 - } 34 - 35 - static int res_counter_charge_locked(struct res_counter *counter, 36 - unsigned long val, bool force) 37 - { 38 - int ret = 0; 39 - 40 - if (counter->usage + val > counter->limit) { 41 - counter->failcnt++; 42 - ret = -ENOMEM; 43 - if (!force) 44 - return ret; 45 - } 46 - 47 - counter->usage += val; 48 - if (counter->usage > counter->max_usage) 49 - counter->max_usage = counter->usage; 50 - return ret; 51 - } 52 - 53 - static int __res_counter_charge(struct res_counter *counter, unsigned long val, 54 - struct res_counter **limit_fail_at, bool force) 55 - { 56 - int ret, r; 57 - unsigned long flags; 58 - struct res_counter *c, *u; 59 - 60 - r = ret = 0; 61 - *limit_fail_at = NULL; 62 - local_irq_save(flags); 63 - for (c = counter; c != NULL; c = c->parent) { 64 - spin_lock(&c->lock); 65 - r = res_counter_charge_locked(c, val, force); 66 - spin_unlock(&c->lock); 67 - if (r < 0 && !ret) { 68 - ret = r; 69 - *limit_fail_at = c; 70 - if (!force) 71 - break; 72 - } 73 - } 74 - 75 - if (ret < 0 && !force) { 76 - for (u = counter; u != c; u = u->parent) { 77 - spin_lock(&u->lock); 78 - res_counter_uncharge_locked(u, val); 79 - spin_unlock(&u->lock); 80 - } 81 - } 82 - local_irq_restore(flags); 83 - 84 - return ret; 85 - } 86 - 87 - int res_counter_charge(struct res_counter *counter, unsigned long val, 88 - struct res_counter **limit_fail_at) 89 - { 90 - return __res_counter_charge(counter, val, limit_fail_at, false); 91 - } 92 - 93 - int res_counter_charge_nofail(struct res_counter *counter, unsigned long val, 94 - struct res_counter **limit_fail_at) 95 - { 96 - return __res_counter_charge(counter, val, limit_fail_at, true); 97 - } 98 - 99 - u64 res_counter_uncharge_until(struct res_counter *counter, 100 - struct res_counter *top, 101 - unsigned long val) 102 - { 103 - unsigned long flags; 104 - struct res_counter *c; 105 - u64 ret = 0; 106 - 107 - local_irq_save(flags); 108 - for (c = counter; c != top; c = c->parent) { 109 - u64 r; 110 - spin_lock(&c->lock); 111 - r = res_counter_uncharge_locked(c, val); 112 - if (c == counter) 113 - ret = r; 114 - spin_unlock(&c->lock); 115 - } 116 - local_irq_restore(flags); 117 - return ret; 118 - } 119 - 120 - u64 res_counter_uncharge(struct res_counter *counter, unsigned long val) 121 - { 122 - return res_counter_uncharge_until(counter, NULL, val); 123 - } 124 - 125 - static inline unsigned long long * 126 - res_counter_member(struct res_counter *counter, int member) 127 - { 128 - switch (member) { 129 - case RES_USAGE: 130 - return &counter->usage; 131 - case RES_MAX_USAGE: 132 - return &counter->max_usage; 133 - case RES_LIMIT: 134 - return &counter->limit; 135 - case RES_FAILCNT: 136 - return &counter->failcnt; 137 - case RES_SOFT_LIMIT: 138 - return &counter->soft_limit; 139 - }; 140 - 141 - BUG(); 142 - return NULL; 143 - } 144 - 145 - ssize_t res_counter_read(struct res_counter *counter, int member, 146 - const char __user *userbuf, size_t nbytes, loff_t *pos, 147 - int (*read_strategy)(unsigned long long val, char *st_buf)) 148 - { 149 - unsigned long long *val; 150 - char buf[64], *s; 151 - 152 - s = buf; 153 - val = res_counter_member(counter, member); 154 - if (read_strategy) 155 - s += read_strategy(*val, s); 156 - else 157 - s += sprintf(s, "%llu\n", *val); 158 - return simple_read_from_buffer((void __user *)userbuf, nbytes, 159 - pos, buf, s - buf); 160 - } 161 - 162 - #if BITS_PER_LONG == 32 163 - u64 res_counter_read_u64(struct res_counter *counter, int member) 164 - { 165 - unsigned long flags; 166 - u64 ret; 167 - 168 - spin_lock_irqsave(&counter->lock, flags); 169 - ret = *res_counter_member(counter, member); 170 - spin_unlock_irqrestore(&counter->lock, flags); 171 - 172 - return ret; 173 - } 174 - #else 175 - u64 res_counter_read_u64(struct res_counter *counter, int member) 176 - { 177 - return *res_counter_member(counter, member); 178 - } 179 - #endif 180 - 181 - int res_counter_memparse_write_strategy(const char *buf, 182 - unsigned long long *resp) 183 - { 184 - char *end; 185 - unsigned long long res; 186 - 187 - /* return RES_COUNTER_MAX(unlimited) if "-1" is specified */ 188 - if (*buf == '-') { 189 - int rc = kstrtoull(buf + 1, 10, &res); 190 - 191 - if (rc) 192 - return rc; 193 - if (res != 1) 194 - return -EINVAL; 195 - *resp = RES_COUNTER_MAX; 196 - return 0; 197 - } 198 - 199 - res = memparse(buf, &end); 200 - if (*end != '\0') 201 - return -EINVAL; 202 - 203 - if (PAGE_ALIGN(res) >= res) 204 - res = PAGE_ALIGN(res); 205 - else 206 - res = RES_COUNTER_MAX; 207 - 208 - *resp = res; 209 - 210 - return 0; 211 - }

+3 -1

kernel/sched/core.c

··· 4527 4527 #ifdef CONFIG_DEBUG_STACK_USAGE 4528 4528 free = stack_not_used(p); 4529 4529 #endif 4530 + ppid = 0; 4530 4531 rcu_read_lock(); 4531 - ppid = task_pid_nr(rcu_dereference(p->real_parent)); 4532 + if (pid_alive(p)) 4533 + ppid = task_pid_nr(rcu_dereference(p->real_parent)); 4532 4534 rcu_read_unlock(); 4533 4535 printk(KERN_CONT "%5lu %5d %6d 0x%08lx\n", free, 4534 4536 task_pid_nr(p), ppid,

+9

kernel/sysctl.c

··· 1104 1104 .proc_handler = proc_dointvec, 1105 1105 }, 1106 1106 #endif 1107 + { 1108 + .procname = "panic_on_warn", 1109 + .data = &panic_on_warn, 1110 + .maxlen = sizeof(int), 1111 + .mode = 0644, 1112 + .proc_handler = proc_dointvec_minmax, 1113 + .extra1 = &zero, 1114 + .extra2 = &one, 1115 + }, 1107 1116 { } 1108 1117 }; 1109 1118

+1

kernel/sysctl_binary.c

··· 137 137 { CTL_INT, KERN_COMPAT_LOG, "compat-log" }, 138 138 { CTL_INT, KERN_MAX_LOCK_DEPTH, "max_lock_depth" }, 139 139 { CTL_INT, KERN_PANIC_ON_NMI, "panic_on_unrecovered_nmi" }, 140 + { CTL_INT, KERN_PANIC_ON_WARN, "panic_on_warn" }, 140 141 {} 141 142 }; 142 143

+28 -15

lib/dma-debug.c

··· 102 102 /* Global disable flag - will be set in case of an error */ 103 103 static u32 global_disable __read_mostly; 104 104 105 + /* Early initialization disable flag, set at the end of dma_debug_init */ 106 + static bool dma_debug_initialized __read_mostly; 107 + 108 + static inline bool dma_debug_disabled(void) 109 + { 110 + return global_disable || !dma_debug_initialized; 111 + } 112 + 105 113 /* Global error count */ 106 114 static u32 error_count; 107 115 ··· 953 945 struct dma_debug_entry *uninitialized_var(entry); 954 946 int count; 955 947 956 - if (global_disable) 948 + if (dma_debug_disabled()) 957 949 return 0; 958 950 959 951 switch (action) { ··· 981 973 { 982 974 struct notifier_block *nb; 983 975 984 - if (global_disable) 976 + if (dma_debug_disabled()) 985 977 return; 986 978 987 979 nb = kzalloc(sizeof(struct notifier_block), GFP_KERNEL); ··· 1002 994 { 1003 995 int i; 1004 996 997 + /* Do not use dma_debug_initialized here, since we really want to be 998 + * called to set dma_debug_initialized 999 + */ 1005 1000 if (global_disable) 1006 1001 return; 1007 1002 ··· 1031 1020 } 1032 1021 1033 1022 nr_total_entries = num_free_entries; 1023 + 1024 + dma_debug_initialized = true; 1034 1025 1035 1026 pr_info("DMA-API: debugging enabled by kernel config\n"); 1036 1027 } ··· 1256 1243 { 1257 1244 struct dma_debug_entry *entry; 1258 1245 1259 - if (unlikely(global_disable)) 1246 + if (unlikely(dma_debug_disabled())) 1260 1247 return; 1261 1248 1262 1249 if (dma_mapping_error(dev, dma_addr)) ··· 1296 1283 struct hash_bucket *bucket; 1297 1284 unsigned long flags; 1298 1285 1299 - if (unlikely(global_disable)) 1286 + if (unlikely(dma_debug_disabled())) 1300 1287 return; 1301 1288 1302 1289 ref.dev = dev; ··· 1338 1325 .direction = direction, 1339 1326 }; 1340 1327 1341 - if (unlikely(global_disable)) 1328 + if (unlikely(dma_debug_disabled())) 1342 1329 return; 1343 1330 1344 1331 if (map_single) ··· 1355 1342 struct scatterlist *s; 1356 1343 int i; 1357 1344 1358 - if (unlikely(global_disable)) 1345 + if (unlikely(dma_debug_disabled())) 1359 1346 return; 1360 1347 1361 1348 for_each_sg(sg, s, mapped_ents, i) { ··· 1408 1395 struct scatterlist *s; 1409 1396 int mapped_ents = 0, i; 1410 1397 1411 - if (unlikely(global_disable)) 1398 + if (unlikely(dma_debug_disabled())) 1412 1399 return; 1413 1400 1414 1401 for_each_sg(sglist, s, nelems, i) { ··· 1440 1427 { 1441 1428 struct dma_debug_entry *entry; 1442 1429 1443 - if (unlikely(global_disable)) 1430 + if (unlikely(dma_debug_disabled())) 1444 1431 return; 1445 1432 1446 1433 if (unlikely(virt == NULL)) ··· 1475 1462 .direction = DMA_BIDIRECTIONAL, 1476 1463 }; 1477 1464 1478 - if (unlikely(global_disable)) 1465 + if (unlikely(dma_debug_disabled())) 1479 1466 return; 1480 1467 1481 1468 check_unmap(&ref); ··· 1487 1474 { 1488 1475 struct dma_debug_entry ref; 1489 1476 1490 - if (unlikely(global_disable)) 1477 + if (unlikely(dma_debug_disabled())) 1491 1478 return; 1492 1479 1493 1480 ref.type = dma_debug_single; ··· 1507 1494 { 1508 1495 struct dma_debug_entry ref; 1509 1496 1510 - if (unlikely(global_disable)) 1497 + if (unlikely(dma_debug_disabled())) 1511 1498 return; 1512 1499 1513 1500 ref.type = dma_debug_single; ··· 1528 1515 { 1529 1516 struct dma_debug_entry ref; 1530 1517 1531 - if (unlikely(global_disable)) 1518 + if (unlikely(dma_debug_disabled())) 1532 1519 return; 1533 1520 1534 1521 ref.type = dma_debug_single; ··· 1549 1536 { 1550 1537 struct dma_debug_entry ref; 1551 1538 1552 - if (unlikely(global_disable)) 1539 + if (unlikely(dma_debug_disabled())) 1553 1540 return; 1554 1541 1555 1542 ref.type = dma_debug_single; ··· 1569 1556 struct scatterlist *s; 1570 1557 int mapped_ents = 0, i; 1571 1558 1572 - if (unlikely(global_disable)) 1559 + if (unlikely(dma_debug_disabled())) 1573 1560 return; 1574 1561 1575 1562 for_each_sg(sg, s, nelems, i) { ··· 1602 1589 struct scatterlist *s; 1603 1590 int mapped_ents = 0, i; 1604 1591 1605 - if (unlikely(global_disable)) 1592 + if (unlikely(dma_debug_disabled())) 1606 1593 return; 1607 1594 1608 1595 for_each_sg(sg, s, nelems, i) {

+2 -2

lib/dynamic_debug.c

··· 576 576 } else { 577 577 char buf[PREFIX_SIZE]; 578 578 579 - dev_printk_emit(7, dev, "%s%s %s: %pV", 579 + dev_printk_emit(LOGLEVEL_DEBUG, dev, "%s%s %s: %pV", 580 580 dynamic_emit_prefix(descriptor, buf), 581 581 dev_driver_string(dev), dev_name(dev), 582 582 &vaf); ··· 605 605 if (dev && dev->dev.parent) { 606 606 char buf[PREFIX_SIZE]; 607 607 608 - dev_printk_emit(7, dev->dev.parent, 608 + dev_printk_emit(LOGLEVEL_DEBUG, dev->dev.parent, 609 609 "%s%s %s %s%s: %pV", 610 610 dynamic_emit_prefix(descriptor, buf), 611 611 dev_driver_string(dev->dev.parent),

+3 -5

lib/lcm.c

··· 7 7 unsigned long lcm(unsigned long a, unsigned long b) 8 8 { 9 9 if (a && b) 10 - return (a * b) / gcd(a, b); 11 - else if (b) 12 - return b; 13 - 14 - return a; 10 + return (a / gcd(a, b)) * b; 11 + else 12 + return 0; 15 13 } 16 14 EXPORT_SYMBOL_GPL(lcm);

+3 -1

mm/Makefile

··· 55 55 obj-$(CONFIG_MIGRATION) += migrate.o 56 56 obj-$(CONFIG_QUICKLIST) += quicklist.o 57 57 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o 58 - obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o vmpressure.o 58 + obj-$(CONFIG_PAGE_COUNTER) += page_counter.o 59 + obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o 60 + obj-$(CONFIG_MEMCG_SWAP) += swap_cgroup.o 59 61 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o 60 62 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o 61 63 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o

+13 -1

mm/cma.c

··· 215 215 bool fixed, struct cma **res_cma) 216 216 { 217 217 phys_addr_t memblock_end = memblock_end_of_DRAM(); 218 - phys_addr_t highmem_start = __pa(high_memory); 218 + phys_addr_t highmem_start; 219 219 int ret = 0; 220 220 221 + #ifdef CONFIG_X86 222 + /* 223 + * high_memory isn't direct mapped memory so retrieving its physical 224 + * address isn't appropriate. But it would be useful to check the 225 + * physical address of the highmem boundary so it's justfiable to get 226 + * the physical address from it. On x86 there is a validation check for 227 + * this case, so the following workaround is needed to avoid it. 228 + */ 229 + highmem_start = __pa_nodebug(high_memory); 230 + #else 231 + highmem_start = __pa(high_memory); 232 + #endif 221 233 pr_debug("%s(size %pa, base %pa, limit %pa alignment %pa)\n", 222 234 __func__, &size, &base, &limit, &alignment); 223 235

+93 -46

mm/compaction.c

··· 41 41 static unsigned long release_freepages(struct list_head *freelist) 42 42 { 43 43 struct page *page, *next; 44 - unsigned long count = 0; 44 + unsigned long high_pfn = 0; 45 45 46 46 list_for_each_entry_safe(page, next, freelist, lru) { 47 + unsigned long pfn = page_to_pfn(page); 47 48 list_del(&page->lru); 48 49 __free_page(page); 49 - count++; 50 + if (pfn > high_pfn) 51 + high_pfn = pfn; 50 52 } 51 53 52 - return count; 54 + return high_pfn; 53 55 } 54 56 55 57 static void map_pages(struct list_head *list) ··· 197 195 198 196 /* Update where async and sync compaction should restart */ 199 197 if (migrate_scanner) { 200 - if (cc->finished_update_migrate) 201 - return; 202 198 if (pfn > zone->compact_cached_migrate_pfn[0]) 203 199 zone->compact_cached_migrate_pfn[0] = pfn; 204 200 if (cc->mode != MIGRATE_ASYNC && 205 201 pfn > zone->compact_cached_migrate_pfn[1]) 206 202 zone->compact_cached_migrate_pfn[1] = pfn; 207 203 } else { 208 - if (cc->finished_update_free) 209 - return; 210 204 if (pfn < zone->compact_cached_free_pfn) 211 205 zone->compact_cached_free_pfn = pfn; 212 206 } ··· 713 715 del_page_from_lru_list(page, lruvec, page_lru(page)); 714 716 715 717 isolate_success: 716 - cc->finished_update_migrate = true; 717 718 list_add(&page->lru, migratelist); 718 719 cc->nr_migratepages++; 719 720 nr_isolated++; ··· 884 887 cc->free_pfn = (isolate_start_pfn < block_end_pfn) ? 885 888 isolate_start_pfn : 886 889 block_start_pfn - pageblock_nr_pages; 887 - 888 - /* 889 - * Set a flag that we successfully isolated in this pageblock. 890 - * In the next loop iteration, zone->compact_cached_free_pfn 891 - * will not be updated and thus it will effectively contain the 892 - * highest pageblock we isolated pages from. 893 - */ 894 - if (isolated) 895 - cc->finished_update_free = true; 896 890 897 891 /* 898 892 * isolate_freepages_block() might have aborted due to async ··· 1074 1086 1075 1087 /* Compaction run is not finished if the watermark is not met */ 1076 1088 watermark = low_wmark_pages(zone); 1077 - watermark += (1 << cc->order); 1078 1089 1079 - if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0)) 1090 + if (!zone_watermark_ok(zone, cc->order, watermark, cc->classzone_idx, 1091 + cc->alloc_flags)) 1080 1092 return COMPACT_CONTINUE; 1081 1093 1082 1094 /* Direct compactor: Is a suitable page free? */ ··· 1102 1114 * COMPACT_PARTIAL - If the allocation would succeed without compaction 1103 1115 * COMPACT_CONTINUE - If compaction should run now 1104 1116 */ 1105 - unsigned long compaction_suitable(struct zone *zone, int order) 1117 + unsigned long compaction_suitable(struct zone *zone, int order, 1118 + int alloc_flags, int classzone_idx) 1106 1119 { 1107 1120 int fragindex; 1108 1121 unsigned long watermark; ··· 1115 1126 if (order == -1) 1116 1127 return COMPACT_CONTINUE; 1117 1128 1129 + watermark = low_wmark_pages(zone); 1130 + /* 1131 + * If watermarks for high-order allocation are already met, there 1132 + * should be no need for compaction at all. 1133 + */ 1134 + if (zone_watermark_ok(zone, order, watermark, classzone_idx, 1135 + alloc_flags)) 1136 + return COMPACT_PARTIAL; 1137 + 1118 1138 /* 1119 1139 * Watermarks for order-0 must be met for compaction. Note the 2UL. 1120 1140 * This is because during migration, copies of pages need to be 1121 1141 * allocated and for a short time, the footprint is higher 1122 1142 */ 1123 - watermark = low_wmark_pages(zone) + (2UL << order); 1124 - if (!zone_watermark_ok(zone, 0, watermark, 0, 0)) 1143 + watermark += (2UL << order); 1144 + if (!zone_watermark_ok(zone, 0, watermark, classzone_idx, alloc_flags)) 1125 1145 return COMPACT_SKIPPED; 1126 1146 1127 1147 /* 1128 1148 * fragmentation index determines if allocation failures are due to 1129 1149 * low memory or external fragmentation 1130 1150 * 1131 - * index of -1000 implies allocations might succeed depending on 1132 - * watermarks 1151 + * index of -1000 would imply allocations might succeed depending on 1152 + * watermarks, but we already failed the high-order watermark check 1133 1153 * index towards 0 implies failure is due to lack of memory 1134 1154 * index towards 1000 implies failure is due to fragmentation 1135 1155 * ··· 1147 1149 fragindex = fragmentation_index(zone, order); 1148 1150 if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold) 1149 1151 return COMPACT_SKIPPED; 1150 - 1151 - if (fragindex == -1000 && zone_watermark_ok(zone, order, watermark, 1152 - 0, 0)) 1153 - return COMPACT_PARTIAL; 1154 1152 1155 1153 return COMPACT_CONTINUE; 1156 1154 } ··· 1158 1164 unsigned long end_pfn = zone_end_pfn(zone); 1159 1165 const int migratetype = gfpflags_to_migratetype(cc->gfp_mask); 1160 1166 const bool sync = cc->mode != MIGRATE_ASYNC; 1167 + unsigned long last_migrated_pfn = 0; 1161 1168 1162 - ret = compaction_suitable(zone, cc->order); 1169 + ret = compaction_suitable(zone, cc->order, cc->alloc_flags, 1170 + cc->classzone_idx); 1163 1171 switch (ret) { 1164 1172 case COMPACT_PARTIAL: 1165 1173 case COMPACT_SKIPPED: ··· 1204 1208 while ((ret = compact_finished(zone, cc, migratetype)) == 1205 1209 COMPACT_CONTINUE) { 1206 1210 int err; 1211 + unsigned long isolate_start_pfn = cc->migrate_pfn; 1207 1212 1208 1213 switch (isolate_migratepages(zone, cc)) { 1209 1214 case ISOLATE_ABORT: ··· 1213 1216 cc->nr_migratepages = 0; 1214 1217 goto out; 1215 1218 case ISOLATE_NONE: 1216 - continue; 1219 + /* 1220 + * We haven't isolated and migrated anything, but 1221 + * there might still be unflushed migrations from 1222 + * previous cc->order aligned block. 1223 + */ 1224 + goto check_drain; 1217 1225 case ISOLATE_SUCCESS: 1218 1226 ; 1219 1227 } ··· 1243 1241 goto out; 1244 1242 } 1245 1243 } 1244 + 1245 + /* 1246 + * Record where we could have freed pages by migration and not 1247 + * yet flushed them to buddy allocator. We use the pfn that 1248 + * isolate_migratepages() started from in this loop iteration 1249 + * - this is the lowest page that could have been isolated and 1250 + * then freed by migration. 1251 + */ 1252 + if (!last_migrated_pfn) 1253 + last_migrated_pfn = isolate_start_pfn; 1254 + 1255 + check_drain: 1256 + /* 1257 + * Has the migration scanner moved away from the previous 1258 + * cc->order aligned block where we migrated from? If yes, 1259 + * flush the pages that were freed, so that they can merge and 1260 + * compact_finished() can detect immediately if allocation 1261 + * would succeed. 1262 + */ 1263 + if (cc->order > 0 && last_migrated_pfn) { 1264 + int cpu; 1265 + unsigned long current_block_start = 1266 + cc->migrate_pfn & ~((1UL << cc->order) - 1); 1267 + 1268 + if (last_migrated_pfn < current_block_start) { 1269 + cpu = get_cpu(); 1270 + lru_add_drain_cpu(cpu); 1271 + drain_local_pages(zone); 1272 + put_cpu(); 1273 + /* No more flushing until we migrate again */ 1274 + last_migrated_pfn = 0; 1275 + } 1276 + } 1277 + 1246 1278 } 1247 1279 1248 1280 out: 1249 - /* Release free pages and check accounting */ 1250 - cc->nr_freepages -= release_freepages(&cc->freepages); 1251 - VM_BUG_ON(cc->nr_freepages != 0); 1281 + /* 1282 + * Release free pages and update where the free scanner should restart, 1283 + * so we don't leave any returned pages behind in the next attempt. 1284 + */ 1285 + if (cc->nr_freepages > 0) { 1286 + unsigned long free_pfn = release_freepages(&cc->freepages); 1287 + 1288 + cc->nr_freepages = 0; 1289 + VM_BUG_ON(free_pfn == 0); 1290 + /* The cached pfn is always the first in a pageblock */ 1291 + free_pfn &= ~(pageblock_nr_pages-1); 1292 + /* 1293 + * Only go back, not forward. The cached pfn might have been 1294 + * already reset to zone end in compact_finished() 1295 + */ 1296 + if (free_pfn > zone->compact_cached_free_pfn) 1297 + zone->compact_cached_free_pfn = free_pfn; 1298 + } 1252 1299 1253 1300 trace_mm_compaction_end(ret); 1254 1301 ··· 1305 1254 } 1306 1255 1307 1256 static unsigned long compact_zone_order(struct zone *zone, int order, 1308 - gfp_t gfp_mask, enum migrate_mode mode, int *contended) 1257 + gfp_t gfp_mask, enum migrate_mode mode, int *contended, 1258 + int alloc_flags, int classzone_idx) 1309 1259 { 1310 1260 unsigned long ret; 1311 1261 struct compact_control cc = { ··· 1316 1264 .gfp_mask = gfp_mask, 1317 1265 .zone = zone, 1318 1266 .mode = mode, 1267 + .alloc_flags = alloc_flags, 1268 + .classzone_idx = classzone_idx, 1319 1269 }; 1320 1270 INIT_LIST_HEAD(&cc.freepages); 1321 1271 INIT_LIST_HEAD(&cc.migratepages); ··· 1342 1288 * @mode: The migration mode for async, sync light, or sync migration 1343 1289 * @contended: Return value that determines if compaction was aborted due to 1344 1290 * need_resched() or lock contention 1345 - * @candidate_zone: Return the zone where we think allocation should succeed 1346 1291 * 1347 1292 * This is the main entry point for direct page compaction. 1348 1293 */ 1349 1294 unsigned long try_to_compact_pages(struct zonelist *zonelist, 1350 1295 int order, gfp_t gfp_mask, nodemask_t *nodemask, 1351 1296 enum migrate_mode mode, int *contended, 1352 - struct zone **candidate_zone) 1297 + int alloc_flags, int classzone_idx) 1353 1298 { 1354 1299 enum zone_type high_zoneidx = gfp_zone(gfp_mask); 1355 1300 int may_enter_fs = gfp_mask & __GFP_FS; ··· 1356 1303 struct zoneref *z; 1357 1304 struct zone *zone; 1358 1305 int rc = COMPACT_DEFERRED; 1359 - int alloc_flags = 0; 1360 1306 int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */ 1361 1307 1362 1308 *contended = COMPACT_CONTENDED_NONE; ··· 1364 1312 if (!order || !may_enter_fs || !may_perform_io) 1365 1313 return COMPACT_SKIPPED; 1366 1314 1367 - #ifdef CONFIG_CMA 1368 - if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE) 1369 - alloc_flags |= ALLOC_CMA; 1370 - #endif 1371 1315 /* Compact each zone in the list */ 1372 1316 for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx, 1373 1317 nodemask) { ··· 1374 1326 continue; 1375 1327 1376 1328 status = compact_zone_order(zone, order, gfp_mask, mode, 1377 - &zone_contended); 1329 + &zone_contended, alloc_flags, classzone_idx); 1378 1330 rc = max(status, rc); 1379 1331 /* 1380 1332 * It takes at least one zone that wasn't lock contended ··· 1383 1335 all_zones_contended &= zone_contended; 1384 1336 1385 1337 /* If a normal allocation would succeed, stop compacting */ 1386 - if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 1387 - alloc_flags)) { 1388 - *candidate_zone = zone; 1338 + if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 1339 + classzone_idx, alloc_flags)) { 1389 1340 /* 1390 1341 * We think the allocation will succeed in this zone, 1391 1342 * but it is not certain, hence the false. The caller ··· 1406 1359 goto break_loop; 1407 1360 } 1408 1361 1409 - if (mode != MIGRATE_ASYNC) { 1362 + if (mode != MIGRATE_ASYNC && status == COMPACT_COMPLETE) { 1410 1363 /* 1411 1364 * We think that allocation won't succeed in this zone 1412 1365 * so we defer compaction there. If it ends up

+4 -1

mm/debug.c

··· 95 95 dump_flags(page->flags & badflags, 96 96 pageflag_names, ARRAY_SIZE(pageflag_names)); 97 97 } 98 - mem_cgroup_print_bad_page(page); 98 + #ifdef CONFIG_MEMCG 99 + if (page->mem_cgroup) 100 + pr_alert("page->mem_cgroup:%p\n", page->mem_cgroup); 101 + #endif 99 102 } 100 103 101 104 void dump_page(struct page *page, const char *reason)

+1 -1

mm/frontswap.c

··· 182 182 if (frontswap_ops) 183 183 frontswap_ops->init(type); 184 184 else { 185 - BUG_ON(type > MAX_SWAPFILES); 185 + BUG_ON(type >= MAX_SWAPFILES); 186 186 set_bit(type, need_init); 187 187 } 188 188 }

-1

mm/huge_memory.c

··· 784 784 if (!pmd_none(*pmd)) 785 785 return false; 786 786 entry = mk_pmd(zero_page, vma->vm_page_prot); 787 - entry = pmd_wrprotect(entry); 788 787 entry = pmd_mkhuge(entry); 789 788 pgtable_trans_huge_deposit(mm, pmd, pgtable); 790 789 set_pmd_at(mm, haddr, pmd, entry);

+3 -1

mm/hugetlb.c

··· 2638 2638 2639 2639 tlb_start_vma(tlb, vma); 2640 2640 mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); 2641 + address = start; 2641 2642 again: 2642 - for (address = start; address < end; address += sz) { 2643 + for (; address < end; address += sz) { 2643 2644 ptep = huge_pte_offset(mm, address); 2644 2645 if (!ptep) 2645 2646 continue; ··· 2687 2686 page_remove_rmap(page); 2688 2687 force_flush = !__tlb_remove_page(tlb, page); 2689 2688 if (force_flush) { 2689 + address += sz; 2690 2690 spin_unlock(ptl); 2691 2691 break; 2692 2692 }

+60 -47

mm/hugetlb_cgroup.c

··· 14 14 */ 15 15 16 16 #include <linux/cgroup.h> 17 + #include <linux/page_counter.h> 17 18 #include <linux/slab.h> 18 19 #include <linux/hugetlb.h> 19 20 #include <linux/hugetlb_cgroup.h> ··· 24 23 /* 25 24 * the counter to account for hugepages from hugetlb. 26 25 */ 27 - struct res_counter hugepage[HUGE_MAX_HSTATE]; 26 + struct page_counter hugepage[HUGE_MAX_HSTATE]; 28 27 }; 29 28 30 29 #define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val)) ··· 61 60 int idx; 62 61 63 62 for (idx = 0; idx < hugetlb_max_hstate; idx++) { 64 - if ((res_counter_read_u64(&h_cg->hugepage[idx], RES_USAGE)) > 0) 63 + if (page_counter_read(&h_cg->hugepage[idx])) 65 64 return true; 66 65 } 67 66 return false; ··· 80 79 81 80 if (parent_h_cgroup) { 82 81 for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) 83 - res_counter_init(&h_cgroup->hugepage[idx], 84 - &parent_h_cgroup->hugepage[idx]); 82 + page_counter_init(&h_cgroup->hugepage[idx], 83 + &parent_h_cgroup->hugepage[idx]); 85 84 } else { 86 85 root_h_cgroup = h_cgroup; 87 86 for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) 88 - res_counter_init(&h_cgroup->hugepage[idx], NULL); 87 + page_counter_init(&h_cgroup->hugepage[idx], NULL); 89 88 } 90 89 return &h_cgroup->css; 91 90 } ··· 109 108 static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg, 110 109 struct page *page) 111 110 { 112 - int csize; 113 - struct res_counter *counter; 114 - struct res_counter *fail_res; 111 + unsigned int nr_pages; 112 + struct page_counter *counter; 115 113 struct hugetlb_cgroup *page_hcg; 116 114 struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg); 117 115 ··· 123 123 if (!page_hcg || page_hcg != h_cg) 124 124 goto out; 125 125 126 - csize = PAGE_SIZE << compound_order(page); 126 + nr_pages = 1 << compound_order(page); 127 127 if (!parent) { 128 128 parent = root_h_cgroup; 129 129 /* root has no limit */ 130 - res_counter_charge_nofail(&parent->hugepage[idx], 131 - csize, &fail_res); 130 + page_counter_charge(&parent->hugepage[idx], nr_pages); 132 131 } 133 132 counter = &h_cg->hugepage[idx]; 134 - res_counter_uncharge_until(counter, counter->parent, csize); 133 + /* Take the pages off the local counter */ 134 + page_counter_cancel(counter, nr_pages); 135 135 136 136 set_hugetlb_cgroup(page, parent); 137 137 out: ··· 166 166 struct hugetlb_cgroup **ptr) 167 167 { 168 168 int ret = 0; 169 - struct res_counter *fail_res; 169 + struct page_counter *counter; 170 170 struct hugetlb_cgroup *h_cg = NULL; 171 - unsigned long csize = nr_pages * PAGE_SIZE; 172 171 173 172 if (hugetlb_cgroup_disabled()) 174 173 goto done; ··· 186 187 } 187 188 rcu_read_unlock(); 188 189 189 - ret = res_counter_charge(&h_cg->hugepage[idx], csize, &fail_res); 190 + ret = page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter); 190 191 css_put(&h_cg->css); 191 192 done: 192 193 *ptr = h_cg; ··· 212 213 struct page *page) 213 214 { 214 215 struct hugetlb_cgroup *h_cg; 215 - unsigned long csize = nr_pages * PAGE_SIZE; 216 216 217 217 if (hugetlb_cgroup_disabled()) 218 218 return; ··· 220 222 if (unlikely(!h_cg)) 221 223 return; 222 224 set_hugetlb_cgroup(page, NULL); 223 - res_counter_uncharge(&h_cg->hugepage[idx], csize); 225 + page_counter_uncharge(&h_cg->hugepage[idx], nr_pages); 224 226 return; 225 227 } 226 228 227 229 void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages, 228 230 struct hugetlb_cgroup *h_cg) 229 231 { 230 - unsigned long csize = nr_pages * PAGE_SIZE; 231 - 232 232 if (hugetlb_cgroup_disabled() || !h_cg) 233 233 return; 234 234 235 235 if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER) 236 236 return; 237 237 238 - res_counter_uncharge(&h_cg->hugepage[idx], csize); 238 + page_counter_uncharge(&h_cg->hugepage[idx], nr_pages); 239 239 return; 240 240 } 241 + 242 + enum { 243 + RES_USAGE, 244 + RES_LIMIT, 245 + RES_MAX_USAGE, 246 + RES_FAILCNT, 247 + }; 241 248 242 249 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css, 243 250 struct cftype *cft) 244 251 { 245 - int idx, name; 252 + struct page_counter *counter; 246 253 struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css); 247 254 248 - idx = MEMFILE_IDX(cft->private); 249 - name = MEMFILE_ATTR(cft->private); 255 + counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)]; 250 256 251 - return res_counter_read_u64(&h_cg->hugepage[idx], name); 257 + switch (MEMFILE_ATTR(cft->private)) { 258 + case RES_USAGE: 259 + return (u64)page_counter_read(counter) * PAGE_SIZE; 260 + case RES_LIMIT: 261 + return (u64)counter->limit * PAGE_SIZE; 262 + case RES_MAX_USAGE: 263 + return (u64)counter->watermark * PAGE_SIZE; 264 + case RES_FAILCNT: 265 + return counter->failcnt; 266 + default: 267 + BUG(); 268 + } 252 269 } 270 + 271 + static DEFINE_MUTEX(hugetlb_limit_mutex); 253 272 254 273 static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, 255 274 char *buf, size_t nbytes, loff_t off) 256 275 { 257 - int idx, name, ret; 258 - unsigned long long val; 276 + int ret, idx; 277 + unsigned long nr_pages; 259 278 struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of)); 260 279 261 - buf = strstrip(buf); 262 - idx = MEMFILE_IDX(of_cft(of)->private); 263 - name = MEMFILE_ATTR(of_cft(of)->private); 280 + if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */ 281 + return -EINVAL; 264 282 265 - switch (name) { 283 + buf = strstrip(buf); 284 + ret = page_counter_memparse(buf, &nr_pages); 285 + if (ret) 286 + return ret; 287 + 288 + idx = MEMFILE_IDX(of_cft(of)->private); 289 + 290 + switch (MEMFILE_ATTR(of_cft(of)->private)) { 266 291 case RES_LIMIT: 267 - if (hugetlb_cgroup_is_root(h_cg)) { 268 - /* Can't set limit on root */ 269 - ret = -EINVAL; 270 - break; 271 - } 272 - /* This function does all necessary parse...reuse it */ 273 - ret = res_counter_memparse_write_strategy(buf, &val); 274 - if (ret) 275 - break; 276 - val = ALIGN(val, 1ULL << huge_page_shift(&hstates[idx])); 277 - ret = res_counter_set_limit(&h_cg->hugepage[idx], val); 292 + mutex_lock(&hugetlb_limit_mutex); 293 + ret = page_counter_limit(&h_cg->hugepage[idx], nr_pages); 294 + mutex_unlock(&hugetlb_limit_mutex); 278 295 break; 279 296 default: 280 297 ret = -EINVAL; ··· 301 288 static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of, 302 289 char *buf, size_t nbytes, loff_t off) 303 290 { 304 - int idx, name, ret = 0; 291 + int ret = 0; 292 + struct page_counter *counter; 305 293 struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of)); 306 294 307 - idx = MEMFILE_IDX(of_cft(of)->private); 308 - name = MEMFILE_ATTR(of_cft(of)->private); 295 + counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)]; 309 296 310 - switch (name) { 297 + switch (MEMFILE_ATTR(of_cft(of)->private)) { 311 298 case RES_MAX_USAGE: 312 - res_counter_reset_max(&h_cg->hugepage[idx]); 299 + page_counter_reset_watermark(counter); 313 300 break; 314 301 case RES_FAILCNT: 315 - res_counter_reset_failcnt(&h_cg->hugepage[idx]); 302 + counter->failcnt = 0; 316 303 break; 317 304 default: 318 305 ret = -EINVAL;

+2 -5

mm/internal.h

··· 161 161 unsigned long migrate_pfn; /* isolate_migratepages search base */ 162 162 enum migrate_mode mode; /* Async or sync migration mode */ 163 163 bool ignore_skip_hint; /* Scan blocks even if marked skip */ 164 - bool finished_update_free; /* True when the zone cached pfns are 165 - * no longer being updated 166 - */ 167 - bool finished_update_migrate; 168 - 169 164 int order; /* order a direct compactor needs */ 170 165 const gfp_t gfp_mask; /* gfp mask of a direct compactor */ 166 + const int alloc_flags; /* alloc flags of a direct compactor */ 167 + const int classzone_idx; /* zone index of a direct compactor */ 171 168 struct zone *zone; 172 169 int contended; /* Signal need_sched() or lock 173 170 * contention detected during

+510 -1206

mm/memcontrol.c

··· 25 25 * GNU General Public License for more details. 26 26 */ 27 27 28 - #include <linux/res_counter.h> 28 + #include <linux/page_counter.h> 29 29 #include <linux/memcontrol.h> 30 30 #include <linux/cgroup.h> 31 31 #include <linux/mm.h> ··· 51 51 #include <linux/seq_file.h> 52 52 #include <linux/vmpressure.h> 53 53 #include <linux/mm_inline.h> 54 - #include <linux/page_cgroup.h> 54 + #include <linux/swap_cgroup.h> 55 55 #include <linux/cpu.h> 56 56 #include <linux/oom.h> 57 57 #include <linux/lockdep.h> ··· 143 143 unsigned long targets[MEM_CGROUP_NTARGETS]; 144 144 }; 145 145 146 - struct mem_cgroup_reclaim_iter { 147 - /* 148 - * last scanned hierarchy member. Valid only if last_dead_count 149 - * matches memcg->dead_count of the hierarchy root group. 150 - */ 151 - struct mem_cgroup *last_visited; 152 - int last_dead_count; 153 - 146 + struct reclaim_iter { 147 + struct mem_cgroup *position; 154 148 /* scan generation, increased every round-trip */ 155 149 unsigned int generation; 156 150 }; ··· 156 162 struct lruvec lruvec; 157 163 unsigned long lru_size[NR_LRU_LISTS]; 158 164 159 - struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1]; 165 + struct reclaim_iter iter[DEF_PRIORITY + 1]; 160 166 161 167 struct rb_node tree_node; /* RB tree node */ 162 - unsigned long long usage_in_excess;/* Set to the value by which */ 168 + unsigned long usage_in_excess;/* Set to the value by which */ 163 169 /* the soft limit is exceeded*/ 164 170 bool on_tree; 165 171 struct mem_cgroup *memcg; /* Back pointer, we cannot */ ··· 192 198 193 199 struct mem_cgroup_threshold { 194 200 struct eventfd_ctx *eventfd; 195 - u64 threshold; 201 + unsigned long threshold; 196 202 }; 197 203 198 204 /* For threshold */ ··· 278 284 */ 279 285 struct mem_cgroup { 280 286 struct cgroup_subsys_state css; 281 - /* 282 - * the counter to account for memory usage 283 - */ 284 - struct res_counter res; 287 + 288 + /* Accounted resources */ 289 + struct page_counter memory; 290 + struct page_counter memsw; 291 + struct page_counter kmem; 292 + 293 + unsigned long soft_limit; 285 294 286 295 /* vmpressure notifications */ 287 296 struct vmpressure vmpressure; ··· 292 295 /* css_online() has been completed */ 293 296 int initialized; 294 297 295 - /* 296 - * the counter to account for mem+swap usage. 297 - */ 298 - struct res_counter memsw; 299 - 300 - /* 301 - * the counter to account for kernel memory usage. 302 - */ 303 - struct res_counter kmem; 304 298 /* 305 299 * Should the accounting and control be hierarchical, per subtree? 306 300 */ ··· 340 352 struct mem_cgroup_stat_cpu nocpu_base; 341 353 spinlock_t pcp_counter_lock; 342 354 343 - atomic_t dead_count; 344 355 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) 345 356 struct cg_proto tcp_mem; 346 357 #endif ··· 369 382 /* internal only representation about the status of kmem accounting. */ 370 383 enum { 371 384 KMEM_ACCOUNTED_ACTIVE, /* accounted by this cgroup itself */ 372 - KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */ 373 385 }; 374 386 375 387 #ifdef CONFIG_MEMCG_KMEM ··· 382 396 return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags); 383 397 } 384 398 385 - static void memcg_kmem_mark_dead(struct mem_cgroup *memcg) 386 - { 387 - /* 388 - * Our caller must use css_get() first, because memcg_uncharge_kmem() 389 - * will call css_put() if it sees the memcg is dead. 390 - */ 391 - smp_wmb(); 392 - if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags)) 393 - set_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_account_flags); 394 - } 395 - 396 - static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg) 397 - { 398 - return test_and_clear_bit(KMEM_ACCOUNTED_DEAD, 399 - &memcg->kmem_account_flags); 400 - } 401 399 #endif 402 400 403 401 /* Stuffs for move charges at task migration. */ ··· 620 650 * This check can't live in kmem destruction function, 621 651 * since the charges will outlive the cgroup 622 652 */ 623 - WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0); 653 + WARN_ON(page_counter_read(&memcg->kmem)); 624 654 } 625 655 #else 626 656 static void disarm_kmem_keys(struct mem_cgroup *memcg) ··· 633 663 disarm_sock_keys(memcg); 634 664 disarm_kmem_keys(memcg); 635 665 } 636 - 637 - static void drain_all_stock_async(struct mem_cgroup *memcg); 638 666 639 667 static struct mem_cgroup_per_zone * 640 668 mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone) ··· 674 706 675 707 static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_zone *mz, 676 708 struct mem_cgroup_tree_per_zone *mctz, 677 - unsigned long long new_usage_in_excess) 709 + unsigned long new_usage_in_excess) 678 710 { 679 711 struct rb_node **p = &mctz->rb_root.rb_node; 680 712 struct rb_node *parent = NULL; ··· 723 755 spin_unlock_irqrestore(&mctz->lock, flags); 724 756 } 725 757 758 + static unsigned long soft_limit_excess(struct mem_cgroup *memcg) 759 + { 760 + unsigned long nr_pages = page_counter_read(&memcg->memory); 761 + unsigned long soft_limit = ACCESS_ONCE(memcg->soft_limit); 762 + unsigned long excess = 0; 763 + 764 + if (nr_pages > soft_limit) 765 + excess = nr_pages - soft_limit; 766 + 767 + return excess; 768 + } 726 769 727 770 static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page) 728 771 { 729 - unsigned long long excess; 772 + unsigned long excess; 730 773 struct mem_cgroup_per_zone *mz; 731 774 struct mem_cgroup_tree_per_zone *mctz; 732 775 ··· 748 769 */ 749 770 for (; memcg; memcg = parent_mem_cgroup(memcg)) { 750 771 mz = mem_cgroup_page_zoneinfo(memcg, page); 751 - excess = res_counter_soft_limit_excess(&memcg->res); 772 + excess = soft_limit_excess(memcg); 752 773 /* 753 774 * We have to update the tree if mz is on RB-tree or 754 775 * mem is over its softlimit. ··· 804 825 * position in the tree. 805 826 */ 806 827 __mem_cgroup_remove_exceeded(mz, mctz); 807 - if (!res_counter_soft_limit_excess(&mz->memcg->res) || 828 + if (!soft_limit_excess(mz->memcg) || 808 829 !css_tryget_online(&mz->memcg->css)) 809 830 goto retry; 810 831 done: ··· 1041 1062 return memcg; 1042 1063 } 1043 1064 1044 - /* 1045 - * Returns a next (in a pre-order walk) alive memcg (with elevated css 1046 - * ref. count) or NULL if the whole root's subtree has been visited. 1047 - * 1048 - * helper function to be used by mem_cgroup_iter 1049 - */ 1050 - static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root, 1051 - struct mem_cgroup *last_visited) 1052 - { 1053 - struct cgroup_subsys_state *prev_css, *next_css; 1054 - 1055 - prev_css = last_visited ? &last_visited->css : NULL; 1056 - skip_node: 1057 - next_css = css_next_descendant_pre(prev_css, &root->css); 1058 - 1059 - /* 1060 - * Even if we found a group we have to make sure it is 1061 - * alive. css && !memcg means that the groups should be 1062 - * skipped and we should continue the tree walk. 1063 - * last_visited css is safe to use because it is 1064 - * protected by css_get and the tree walk is rcu safe. 1065 - * 1066 - * We do not take a reference on the root of the tree walk 1067 - * because we might race with the root removal when it would 1068 - * be the only node in the iterated hierarchy and mem_cgroup_iter 1069 - * would end up in an endless loop because it expects that at 1070 - * least one valid node will be returned. Root cannot disappear 1071 - * because caller of the iterator should hold it already so 1072 - * skipping css reference should be safe. 1073 - */ 1074 - if (next_css) { 1075 - struct mem_cgroup *memcg = mem_cgroup_from_css(next_css); 1076 - 1077 - if (next_css == &root->css) 1078 - return memcg; 1079 - 1080 - if (css_tryget_online(next_css)) { 1081 - /* 1082 - * Make sure the memcg is initialized: 1083 - * mem_cgroup_css_online() orders the the 1084 - * initialization against setting the flag. 1085 - */ 1086 - if (smp_load_acquire(&memcg->initialized)) 1087 - return memcg; 1088 - css_put(next_css); 1089 - } 1090 - 1091 - prev_css = next_css; 1092 - goto skip_node; 1093 - } 1094 - 1095 - return NULL; 1096 - } 1097 - 1098 - static void mem_cgroup_iter_invalidate(struct mem_cgroup *root) 1099 - { 1100 - /* 1101 - * When a group in the hierarchy below root is destroyed, the 1102 - * hierarchy iterator can no longer be trusted since it might 1103 - * have pointed to the destroyed group. Invalidate it. 1104 - */ 1105 - atomic_inc(&root->dead_count); 1106 - } 1107 - 1108 - static struct mem_cgroup * 1109 - mem_cgroup_iter_load(struct mem_cgroup_reclaim_iter *iter, 1110 - struct mem_cgroup *root, 1111 - int *sequence) 1112 - { 1113 - struct mem_cgroup *position = NULL; 1114 - /* 1115 - * A cgroup destruction happens in two stages: offlining and 1116 - * release. They are separated by a RCU grace period. 1117 - * 1118 - * If the iterator is valid, we may still race with an 1119 - * offlining. The RCU lock ensures the object won't be 1120 - * released, tryget will fail if we lost the race. 1121 - */ 1122 - *sequence = atomic_read(&root->dead_count); 1123 - if (iter->last_dead_count == *sequence) { 1124 - smp_rmb(); 1125 - position = iter->last_visited; 1126 - 1127 - /* 1128 - * We cannot take a reference to root because we might race 1129 - * with root removal and returning NULL would end up in 1130 - * an endless loop on the iterator user level when root 1131 - * would be returned all the time. 1132 - */ 1133 - if (position && position != root && 1134 - !css_tryget_online(&position->css)) 1135 - position = NULL; 1136 - } 1137 - return position; 1138 - } 1139 - 1140 - static void mem_cgroup_iter_update(struct mem_cgroup_reclaim_iter *iter, 1141 - struct mem_cgroup *last_visited, 1142 - struct mem_cgroup *new_position, 1143 - struct mem_cgroup *root, 1144 - int sequence) 1145 - { 1146 - /* root reference counting symmetric to mem_cgroup_iter_load */ 1147 - if (last_visited && last_visited != root) 1148 - css_put(&last_visited->css); 1149 - /* 1150 - * We store the sequence count from the time @last_visited was 1151 - * loaded successfully instead of rereading it here so that we 1152 - * don't lose destruction events in between. We could have 1153 - * raced with the destruction of @new_position after all. 1154 - */ 1155 - iter->last_visited = new_position; 1156 - smp_wmb(); 1157 - iter->last_dead_count = sequence; 1158 - } 1159 - 1160 1065 /** 1161 1066 * mem_cgroup_iter - iterate over memory cgroup hierarchy 1162 1067 * @root: hierarchy root ··· 1062 1199 struct mem_cgroup *prev, 1063 1200 struct mem_cgroup_reclaim_cookie *reclaim) 1064 1201 { 1202 + struct reclaim_iter *uninitialized_var(iter); 1203 + struct cgroup_subsys_state *css = NULL; 1065 1204 struct mem_cgroup *memcg = NULL; 1066 - struct mem_cgroup *last_visited = NULL; 1205 + struct mem_cgroup *pos = NULL; 1067 1206 1068 1207 if (mem_cgroup_disabled()) 1069 1208 return NULL; ··· 1074 1209 root = root_mem_cgroup; 1075 1210 1076 1211 if (prev && !reclaim) 1077 - last_visited = prev; 1212 + pos = prev; 1078 1213 1079 1214 if (!root->use_hierarchy && root != root_mem_cgroup) { 1080 1215 if (prev) 1081 - goto out_css_put; 1216 + goto out; 1082 1217 return root; 1083 1218 } 1084 1219 1085 1220 rcu_read_lock(); 1086 - while (!memcg) { 1087 - struct mem_cgroup_reclaim_iter *uninitialized_var(iter); 1088 - int uninitialized_var(seq); 1089 1221 1090 - if (reclaim) { 1091 - struct mem_cgroup_per_zone *mz; 1222 + if (reclaim) { 1223 + struct mem_cgroup_per_zone *mz; 1092 1224 1093 - mz = mem_cgroup_zone_zoneinfo(root, reclaim->zone); 1094 - iter = &mz->reclaim_iter[reclaim->priority]; 1095 - if (prev && reclaim->generation != iter->generation) { 1096 - iter->last_visited = NULL; 1097 - goto out_unlock; 1098 - } 1225 + mz = mem_cgroup_zone_zoneinfo(root, reclaim->zone); 1226 + iter = &mz->iter[reclaim->priority]; 1099 1227 1100 - last_visited = mem_cgroup_iter_load(iter, root, &seq); 1101 - } 1102 - 1103 - memcg = __mem_cgroup_iter_next(root, last_visited); 1104 - 1105 - if (reclaim) { 1106 - mem_cgroup_iter_update(iter, last_visited, memcg, root, 1107 - seq); 1108 - 1109 - if (!memcg) 1110 - iter->generation++; 1111 - else if (!prev && memcg) 1112 - reclaim->generation = iter->generation; 1113 - } 1114 - 1115 - if (prev && !memcg) 1228 + if (prev && reclaim->generation != iter->generation) 1116 1229 goto out_unlock; 1230 + 1231 + do { 1232 + pos = ACCESS_ONCE(iter->position); 1233 + /* 1234 + * A racing update may change the position and 1235 + * put the last reference, hence css_tryget(), 1236 + * or retry to see the updated position. 1237 + */ 1238 + } while (pos && !css_tryget(&pos->css)); 1117 1239 } 1240 + 1241 + if (pos) 1242 + css = &pos->css; 1243 + 1244 + for (;;) { 1245 + css = css_next_descendant_pre(css, &root->css); 1246 + if (!css) { 1247 + /* 1248 + * Reclaimers share the hierarchy walk, and a 1249 + * new one might jump in right at the end of 1250 + * the hierarchy - make sure they see at least 1251 + * one group and restart from the beginning. 1252 + */ 1253 + if (!prev) 1254 + continue; 1255 + break; 1256 + } 1257 + 1258 + /* 1259 + * Verify the css and acquire a reference. The root 1260 + * is provided by the caller, so we know it's alive 1261 + * and kicking, and don't take an extra reference. 1262 + */ 1263 + memcg = mem_cgroup_from_css(css); 1264 + 1265 + if (css == &root->css) 1266 + break; 1267 + 1268 + if (css_tryget(css)) { 1269 + /* 1270 + * Make sure the memcg is initialized: 1271 + * mem_cgroup_css_online() orders the the 1272 + * initialization against setting the flag. 1273 + */ 1274 + if (smp_load_acquire(&memcg->initialized)) 1275 + break; 1276 + 1277 + css_put(css); 1278 + } 1279 + 1280 + memcg = NULL; 1281 + } 1282 + 1283 + if (reclaim) { 1284 + if (cmpxchg(&iter->position, pos, memcg) == pos) { 1285 + if (memcg) 1286 + css_get(&memcg->css); 1287 + if (pos) 1288 + css_put(&pos->css); 1289 + } 1290 + 1291 + /* 1292 + * pairs with css_tryget when dereferencing iter->position 1293 + * above. 1294 + */ 1295 + if (pos) 1296 + css_put(&pos->css); 1297 + 1298 + if (!memcg) 1299 + iter->generation++; 1300 + else if (!prev) 1301 + reclaim->generation = iter->generation; 1302 + } 1303 + 1118 1304 out_unlock: 1119 1305 rcu_read_unlock(); 1120 - out_css_put: 1306 + out: 1121 1307 if (prev && prev != root) 1122 1308 css_put(&prev->css); 1123 1309 ··· 1262 1346 } 1263 1347 1264 1348 /** 1265 - * mem_cgroup_page_lruvec - return lruvec for adding an lru page 1349 + * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page 1266 1350 * @page: the page 1267 1351 * @zone: zone of the page 1352 + * 1353 + * This function is only safe when following the LRU page isolation 1354 + * and putback protocol: the LRU lock must be held, and the page must 1355 + * either be PageLRU() or the caller must have isolated/allocated it. 1268 1356 */ 1269 1357 struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone) 1270 1358 { 1271 1359 struct mem_cgroup_per_zone *mz; 1272 1360 struct mem_cgroup *memcg; 1273 - struct page_cgroup *pc; 1274 1361 struct lruvec *lruvec; 1275 1362 1276 1363 if (mem_cgroup_disabled()) { ··· 1281 1362 goto out; 1282 1363 } 1283 1364 1284 - pc = lookup_page_cgroup(page); 1285 - memcg = pc->mem_cgroup; 1286 - 1365 + memcg = page->mem_cgroup; 1287 1366 /* 1288 - * Surreptitiously switch any uncharged offlist page to root: 1289 - * an uncharged page off lru does nothing to secure 1290 - * its former mem_cgroup from sudden removal. 1291 - * 1292 - * Our caller holds lru_lock, and PageCgroupUsed is updated 1293 - * under page_cgroup lock: between them, they make all uses 1294 - * of pc->mem_cgroup safe. 1367 + * Swapcache readahead pages are added to the LRU - and 1368 + * possibly migrated - before they are charged. 1295 1369 */ 1296 - if (!PageLRU(page) && !PageCgroupUsed(pc) && memcg != root_mem_cgroup) 1297 - pc->mem_cgroup = memcg = root_mem_cgroup; 1370 + if (!memcg) 1371 + memcg = root_mem_cgroup; 1298 1372 1299 1373 mz = mem_cgroup_page_zoneinfo(memcg, page); 1300 1374 lruvec = &mz->lruvec; ··· 1326 1414 VM_BUG_ON((long)(*lru_size) < 0); 1327 1415 } 1328 1416 1329 - /* 1330 - * Checks whether given mem is same or in the root_mem_cgroup's 1331 - * hierarchy subtree 1332 - */ 1333 - bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg, 1334 - struct mem_cgroup *memcg) 1417 + bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, struct mem_cgroup *root) 1335 1418 { 1336 - if (root_memcg == memcg) 1419 + if (root == memcg) 1337 1420 return true; 1338 - if (!root_memcg->use_hierarchy || !memcg) 1421 + if (!root->use_hierarchy) 1339 1422 return false; 1340 - return cgroup_is_descendant(memcg->css.cgroup, root_memcg->css.cgroup); 1423 + return cgroup_is_descendant(memcg->css.cgroup, root->css.cgroup); 1341 1424 } 1342 1425 1343 - static bool mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg, 1344 - struct mem_cgroup *memcg) 1426 + bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg) 1345 1427 { 1346 - bool ret; 1347 - 1348 - rcu_read_lock(); 1349 - ret = __mem_cgroup_same_or_subtree(root_memcg, memcg); 1350 - rcu_read_unlock(); 1351 - return ret; 1352 - } 1353 - 1354 - bool task_in_mem_cgroup(struct task_struct *task, 1355 - const struct mem_cgroup *memcg) 1356 - { 1357 - struct mem_cgroup *curr = NULL; 1428 + struct mem_cgroup *task_memcg; 1358 1429 struct task_struct *p; 1359 1430 bool ret; 1360 1431 1361 1432 p = find_lock_task_mm(task); 1362 1433 if (p) { 1363 - curr = get_mem_cgroup_from_mm(p->mm); 1434 + task_memcg = get_mem_cgroup_from_mm(p->mm); 1364 1435 task_unlock(p); 1365 1436 } else { 1366 1437 /* ··· 1352 1457 * killed to prevent needlessly killing additional tasks. 1353 1458 */ 1354 1459 rcu_read_lock(); 1355 - curr = mem_cgroup_from_task(task); 1356 - if (curr) 1357 - css_get(&curr->css); 1460 + task_memcg = mem_cgroup_from_task(task); 1461 + css_get(&task_memcg->css); 1358 1462 rcu_read_unlock(); 1359 1463 } 1360 - /* 1361 - * We should check use_hierarchy of "memcg" not "curr". Because checking 1362 - * use_hierarchy of "curr" here make this function true if hierarchy is 1363 - * enabled in "curr" and "curr" is a child of "memcg" in *cgroup* 1364 - * hierarchy(even if use_hierarchy is disabled in "memcg"). 1365 - */ 1366 - ret = mem_cgroup_same_or_subtree(memcg, curr); 1367 - css_put(&curr->css); 1464 + ret = mem_cgroup_is_descendant(task_memcg, memcg); 1465 + css_put(&task_memcg->css); 1368 1466 return ret; 1369 1467 } 1370 1468 ··· 1380 1492 return inactive * inactive_ratio < active; 1381 1493 } 1382 1494 1383 - #define mem_cgroup_from_res_counter(counter, member) \ 1495 + #define mem_cgroup_from_counter(counter, member) \ 1384 1496 container_of(counter, struct mem_cgroup, member) 1385 1497 1386 1498 /** ··· 1392 1504 */ 1393 1505 static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) 1394 1506 { 1395 - unsigned long long margin; 1507 + unsigned long margin = 0; 1508 + unsigned long count; 1509 + unsigned long limit; 1396 1510 1397 - margin = res_counter_margin(&memcg->res); 1398 - if (do_swap_account) 1399 - margin = min(margin, res_counter_margin(&memcg->memsw)); 1400 - return margin >> PAGE_SHIFT; 1511 + count = page_counter_read(&memcg->memory); 1512 + limit = ACCESS_ONCE(memcg->memory.limit); 1513 + if (count < limit) 1514 + margin = limit - count; 1515 + 1516 + if (do_swap_account) { 1517 + count = page_counter_read(&memcg->memsw); 1518 + limit = ACCESS_ONCE(memcg->memsw.limit); 1519 + if (count <= limit) 1520 + margin = min(margin, limit - count); 1521 + } 1522 + 1523 + return margin; 1401 1524 } 1402 1525 1403 1526 int mem_cgroup_swappiness(struct mem_cgroup *memcg) ··· 1418 1519 return vm_swappiness; 1419 1520 1420 1521 return memcg->swappiness; 1421 - } 1422 - 1423 - /* 1424 - * memcg->moving_account is used for checking possibility that some thread is 1425 - * calling move_account(). When a thread on CPU-A starts moving pages under 1426 - * a memcg, other threads should check memcg->moving_account under 1427 - * rcu_read_lock(), like this: 1428 - * 1429 - * CPU-A CPU-B 1430 - * rcu_read_lock() 1431 - * memcg->moving_account+1 if (memcg->mocing_account) 1432 - * take heavy locks. 1433 - * synchronize_rcu() update something. 1434 - * rcu_read_unlock() 1435 - * start move here. 1436 - */ 1437 - 1438 - static void mem_cgroup_start_move(struct mem_cgroup *memcg) 1439 - { 1440 - atomic_inc(&memcg->moving_account); 1441 - synchronize_rcu(); 1442 - } 1443 - 1444 - static void mem_cgroup_end_move(struct mem_cgroup *memcg) 1445 - { 1446 - /* 1447 - * Now, mem_cgroup_clear_mc() may call this function with NULL. 1448 - * We check NULL in callee rather than caller. 1449 - */ 1450 - if (memcg) 1451 - atomic_dec(&memcg->moving_account); 1452 1522 } 1453 1523 1454 1524 /* ··· 1442 1574 if (!from) 1443 1575 goto unlock; 1444 1576 1445 - ret = mem_cgroup_same_or_subtree(memcg, from) 1446 - || mem_cgroup_same_or_subtree(memcg, to); 1577 + ret = mem_cgroup_is_descendant(from, memcg) || 1578 + mem_cgroup_is_descendant(to, memcg); 1447 1579 unlock: 1448 1580 spin_unlock(&mc.lock); 1449 1581 return ret; ··· 1463 1595 } 1464 1596 } 1465 1597 return false; 1466 - } 1467 - 1468 - /* 1469 - * Take this lock when 1470 - * - a code tries to modify page's memcg while it's USED. 1471 - * - a code tries to modify page state accounting in a memcg. 1472 - */ 1473 - static void move_lock_mem_cgroup(struct mem_cgroup *memcg, 1474 - unsigned long *flags) 1475 - { 1476 - spin_lock_irqsave(&memcg->move_lock, *flags); 1477 - } 1478 - 1479 - static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, 1480 - unsigned long *flags) 1481 - { 1482 - spin_unlock_irqrestore(&memcg->move_lock, *flags); 1483 1598 } 1484 1599 1485 1600 #define K(x) ((x) << (PAGE_SHIFT-10)) ··· 1495 1644 1496 1645 rcu_read_unlock(); 1497 1646 1498 - pr_info("memory: usage %llukB, limit %llukB, failcnt %llu\n", 1499 - res_counter_read_u64(&memcg->res, RES_USAGE) >> 10, 1500 - res_counter_read_u64(&memcg->res, RES_LIMIT) >> 10, 1501 - res_counter_read_u64(&memcg->res, RES_FAILCNT)); 1502 - pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %llu\n", 1503 - res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10, 1504 - res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10, 1505 - res_counter_read_u64(&memcg->memsw, RES_FAILCNT)); 1506 - pr_info("kmem: usage %llukB, limit %llukB, failcnt %llu\n", 1507 - res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10, 1508 - res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10, 1509 - res_counter_read_u64(&memcg->kmem, RES_FAILCNT)); 1647 + pr_info("memory: usage %llukB, limit %llukB, failcnt %lu\n", 1648 + K((u64)page_counter_read(&memcg->memory)), 1649 + K((u64)memcg->memory.limit), memcg->memory.failcnt); 1650 + pr_info("memory+swap: usage %llukB, limit %llukB, failcnt %lu\n", 1651 + K((u64)page_counter_read(&memcg->memsw)), 1652 + K((u64)memcg->memsw.limit), memcg->memsw.failcnt); 1653 + pr_info("kmem: usage %llukB, limit %llukB, failcnt %lu\n", 1654 + K((u64)page_counter_read(&memcg->kmem)), 1655 + K((u64)memcg->kmem.limit), memcg->kmem.failcnt); 1510 1656 1511 1657 for_each_mem_cgroup_tree(iter, memcg) { 1512 1658 pr_info("Memory cgroup stats for "); ··· 1543 1695 /* 1544 1696 * Return the memory (and swap, if configured) limit for a memcg. 1545 1697 */ 1546 - static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) 1698 + static unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg) 1547 1699 { 1548 - u64 limit; 1700 + unsigned long limit; 1549 1701 1550 - limit = res_counter_read_u64(&memcg->res, RES_LIMIT); 1551 - 1552 - /* 1553 - * Do not consider swap space if we cannot swap due to swappiness 1554 - */ 1702 + limit = memcg->memory.limit; 1555 1703 if (mem_cgroup_swappiness(memcg)) { 1556 - u64 memsw; 1704 + unsigned long memsw_limit; 1557 1705 1558 - limit += total_swap_pages << PAGE_SHIFT; 1559 - memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT); 1560 - 1561 - /* 1562 - * If memsw is finite and limits the amount of swap space 1563 - * available to this memcg, return that limit. 1564 - */ 1565 - limit = min(limit, memsw); 1706 + memsw_limit = memcg->memsw.limit; 1707 + limit = min(limit + total_swap_pages, memsw_limit); 1566 1708 } 1567 - 1568 1709 return limit; 1569 1710 } 1570 1711 ··· 1577 1740 } 1578 1741 1579 1742 check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL); 1580 - totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1; 1743 + totalpages = mem_cgroup_get_limit(memcg) ? : 1; 1581 1744 for_each_mem_cgroup_tree(iter, memcg) { 1582 1745 struct css_task_iter it; 1583 1746 struct task_struct *task; ··· 1717 1880 memcg->last_scanned_node = node; 1718 1881 return node; 1719 1882 } 1720 - 1721 - /* 1722 - * Check all nodes whether it contains reclaimable pages or not. 1723 - * For quick scan, we make use of scan_nodes. This will allow us to skip 1724 - * unused nodes. But scan_nodes is lazily updated and may not cotain 1725 - * enough new information. We need to do double check. 1726 - */ 1727 - static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) 1728 - { 1729 - int nid; 1730 - 1731 - /* 1732 - * quick check...making use of scan_node. 1733 - * We can skip unused nodes. 1734 - */ 1735 - if (!nodes_empty(memcg->scan_nodes)) { 1736 - for (nid = first_node(memcg->scan_nodes); 1737 - nid < MAX_NUMNODES; 1738 - nid = next_node(nid, memcg->scan_nodes)) { 1739 - 1740 - if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap)) 1741 - return true; 1742 - } 1743 - } 1744 - /* 1745 - * Check rest of nodes. 1746 - */ 1747 - for_each_node_state(nid, N_MEMORY) { 1748 - if (node_isset(nid, memcg->scan_nodes)) 1749 - continue; 1750 - if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap)) 1751 - return true; 1752 - } 1753 - return false; 1754 - } 1755 - 1756 1883 #else 1757 1884 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) 1758 1885 { 1759 1886 return 0; 1760 - } 1761 - 1762 - static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap) 1763 - { 1764 - return test_mem_cgroup_node_reclaimable(memcg, 0, noswap); 1765 1887 } 1766 1888 #endif 1767 1889 ··· 1739 1943 .priority = 0, 1740 1944 }; 1741 1945 1742 - excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT; 1946 + excess = soft_limit_excess(root_memcg); 1743 1947 1744 1948 while (1) { 1745 1949 victim = mem_cgroup_iter(root_memcg, victim, &reclaim); ··· 1765 1969 } 1766 1970 continue; 1767 1971 } 1768 - if (!mem_cgroup_reclaimable(victim, false)) 1769 - continue; 1770 1972 total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false, 1771 1973 zone, &nr_scanned); 1772 1974 *total_scanned += nr_scanned; 1773 - if (!res_counter_soft_limit_excess(&root_memcg->res)) 1975 + if (!soft_limit_excess(root_memcg)) 1774 1976 break; 1775 1977 } 1776 1978 mem_cgroup_iter_break(root_memcg, victim); ··· 1875 2081 oom_wait_info = container_of(wait, struct oom_wait_info, wait); 1876 2082 oom_wait_memcg = oom_wait_info->memcg; 1877 2083 1878 - /* 1879 - * Both of oom_wait_info->memcg and wake_memcg are stable under us. 1880 - * Then we can use css_is_ancestor without taking care of RCU. 1881 - */ 1882 - if (!mem_cgroup_same_or_subtree(oom_wait_memcg, wake_memcg) 1883 - && !mem_cgroup_same_or_subtree(wake_memcg, oom_wait_memcg)) 2084 + if (!mem_cgroup_is_descendant(wake_memcg, oom_wait_memcg) && 2085 + !mem_cgroup_is_descendant(oom_wait_memcg, wake_memcg)) 1884 2086 return 0; 1885 2087 return autoremove_wake_function(wait, mode, sync, arg); 1886 2088 } ··· 2018 2228 unsigned long *flags) 2019 2229 { 2020 2230 struct mem_cgroup *memcg; 2021 - struct page_cgroup *pc; 2022 2231 2023 2232 rcu_read_lock(); 2024 2233 2025 2234 if (mem_cgroup_disabled()) 2026 2235 return NULL; 2027 - 2028 - pc = lookup_page_cgroup(page); 2029 2236 again: 2030 - memcg = pc->mem_cgroup; 2031 - if (unlikely(!memcg || !PageCgroupUsed(pc))) 2237 + memcg = page->mem_cgroup; 2238 + if (unlikely(!memcg)) 2032 2239 return NULL; 2033 2240 2034 2241 *locked = false; 2035 2242 if (atomic_read(&memcg->moving_account) <= 0) 2036 2243 return memcg; 2037 2244 2038 - move_lock_mem_cgroup(memcg, flags); 2039 - if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) { 2040 - move_unlock_mem_cgroup(memcg, flags); 2245 + spin_lock_irqsave(&memcg->move_lock, *flags); 2246 + if (memcg != page->mem_cgroup) { 2247 + spin_unlock_irqrestore(&memcg->move_lock, *flags); 2041 2248 goto again; 2042 2249 } 2043 2250 *locked = true; ··· 2048 2261 * @locked: value received from mem_cgroup_begin_page_stat() 2049 2262 * @flags: value received from mem_cgroup_begin_page_stat() 2050 2263 */ 2051 - void mem_cgroup_end_page_stat(struct mem_cgroup *memcg, bool locked, 2052 - unsigned long flags) 2264 + void mem_cgroup_end_page_stat(struct mem_cgroup *memcg, bool *locked, 2265 + unsigned long *flags) 2053 2266 { 2054 - if (memcg && locked) 2055 - move_unlock_mem_cgroup(memcg, &flags); 2267 + if (memcg && *locked) 2268 + spin_unlock_irqrestore(&memcg->move_lock, *flags); 2056 2269 2057 2270 rcu_read_unlock(); 2058 2271 } ··· 2103 2316 static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) 2104 2317 { 2105 2318 struct memcg_stock_pcp *stock; 2106 - bool ret = true; 2319 + bool ret = false; 2107 2320 2108 2321 if (nr_pages > CHARGE_BATCH) 2109 - return false; 2322 + return ret; 2110 2323 2111 2324 stock = &get_cpu_var(memcg_stock); 2112 - if (memcg == stock->cached && stock->nr_pages >= nr_pages) 2325 + if (memcg == stock->cached && stock->nr_pages >= nr_pages) { 2113 2326 stock->nr_pages -= nr_pages; 2114 - else /* need to call res_counter_charge */ 2115 - ret = false; 2327 + ret = true; 2328 + } 2116 2329 put_cpu_var(memcg_stock); 2117 2330 return ret; 2118 2331 } 2119 2332 2120 2333 /* 2121 - * Returns stocks cached in percpu to res_counter and reset cached information. 2334 + * Returns stocks cached in percpu and reset cached information. 2122 2335 */ 2123 2336 static void drain_stock(struct memcg_stock_pcp *stock) 2124 2337 { 2125 2338 struct mem_cgroup *old = stock->cached; 2126 2339 2127 2340 if (stock->nr_pages) { 2128 - unsigned long bytes = stock->nr_pages * PAGE_SIZE; 2129 - 2130 - res_counter_uncharge(&old->res, bytes); 2341 + page_counter_uncharge(&old->memory, stock->nr_pages); 2131 2342 if (do_swap_account) 2132 - res_counter_uncharge(&old->memsw, bytes); 2343 + page_counter_uncharge(&old->memsw, stock->nr_pages); 2344 + css_put_many(&old->css, stock->nr_pages); 2133 2345 stock->nr_pages = 0; 2134 2346 } 2135 2347 stock->cached = NULL; ··· 2157 2371 } 2158 2372 2159 2373 /* 2160 - * Cache charges(val) which is from res_counter, to local per_cpu area. 2374 + * Cache charges(val) to local per_cpu area. 2161 2375 * This will be consumed by consume_stock() function, later. 2162 2376 */ 2163 2377 static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) ··· 2174 2388 2175 2389 /* 2176 2390 * Drains all per-CPU charge caches for given root_memcg resp. subtree 2177 - * of the hierarchy under it. sync flag says whether we should block 2178 - * until the work is done. 2391 + * of the hierarchy under it. 2179 2392 */ 2180 - static void drain_all_stock(struct mem_cgroup *root_memcg, bool sync) 2393 + static void drain_all_stock(struct mem_cgroup *root_memcg) 2181 2394 { 2182 2395 int cpu, curcpu; 2183 2396 2397 + /* If someone's already draining, avoid adding running more workers. */ 2398 + if (!mutex_trylock(&percpu_charge_mutex)) 2399 + return; 2184 2400 /* Notify other cpus that system-wide "drain" is running */ 2185 2401 get_online_cpus(); 2186 2402 curcpu = get_cpu(); ··· 2193 2405 memcg = stock->cached; 2194 2406 if (!memcg || !stock->nr_pages) 2195 2407 continue; 2196 - if (!mem_cgroup_same_or_subtree(root_memcg, memcg)) 2408 + if (!mem_cgroup_is_descendant(memcg, root_memcg)) 2197 2409 continue; 2198 2410 if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) { 2199 2411 if (cpu == curcpu) ··· 2203 2415 } 2204 2416 } 2205 2417 put_cpu(); 2206 - 2207 - if (!sync) 2208 - goto out; 2209 - 2210 - for_each_online_cpu(cpu) { 2211 - struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu); 2212 - if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) 2213 - flush_work(&stock->work); 2214 - } 2215 - out: 2216 2418 put_online_cpus(); 2217 - } 2218 - 2219 - /* 2220 - * Tries to drain stocked charges in other cpus. This function is asynchronous 2221 - * and just put a work per cpu for draining localy on each cpu. Caller can 2222 - * expects some charges will be back to res_counter later but cannot wait for 2223 - * it. 2224 - */ 2225 - static void drain_all_stock_async(struct mem_cgroup *root_memcg) 2226 - { 2227 - /* 2228 - * If someone calls draining, avoid adding more kworker runs. 2229 - */ 2230 - if (!mutex_trylock(&percpu_charge_mutex)) 2231 - return; 2232 - drain_all_stock(root_memcg, false); 2233 - mutex_unlock(&percpu_charge_mutex); 2234 - } 2235 - 2236 - /* This is a synchronous drain interface. */ 2237 - static void drain_all_stock_sync(struct mem_cgroup *root_memcg) 2238 - { 2239 - /* called when force_empty is called */ 2240 - mutex_lock(&percpu_charge_mutex); 2241 - drain_all_stock(root_memcg, true); 2242 2419 mutex_unlock(&percpu_charge_mutex); 2243 2420 } 2244 2421 ··· 2259 2506 unsigned int batch = max(CHARGE_BATCH, nr_pages); 2260 2507 int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; 2261 2508 struct mem_cgroup *mem_over_limit; 2262 - struct res_counter *fail_res; 2509 + struct page_counter *counter; 2263 2510 unsigned long nr_reclaimed; 2264 - unsigned long long size; 2265 2511 bool may_swap = true; 2266 2512 bool drained = false; 2267 2513 int ret = 0; ··· 2271 2519 if (consume_stock(memcg, nr_pages)) 2272 2520 goto done; 2273 2521 2274 - size = batch * PAGE_SIZE; 2275 2522 if (!do_swap_account || 2276 - !res_counter_charge(&memcg->memsw, size, &fail_res)) { 2277 - if (!res_counter_charge(&memcg->res, size, &fail_res)) 2523 + !page_counter_try_charge(&memcg->memsw, batch, &counter)) { 2524 + if (!page_counter_try_charge(&memcg->memory, batch, &counter)) 2278 2525 goto done_restock; 2279 2526 if (do_swap_account) 2280 - res_counter_uncharge(&memcg->memsw, size); 2281 - mem_over_limit = mem_cgroup_from_res_counter(fail_res, res); 2527 + page_counter_uncharge(&memcg->memsw, batch); 2528 + mem_over_limit = mem_cgroup_from_counter(counter, memory); 2282 2529 } else { 2283 - mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw); 2530 + mem_over_limit = mem_cgroup_from_counter(counter, memsw); 2284 2531 may_swap = false; 2285 2532 } 2286 2533 ··· 2312 2561 goto retry; 2313 2562 2314 2563 if (!drained) { 2315 - drain_all_stock_async(mem_over_limit); 2564 + drain_all_stock(mem_over_limit); 2316 2565 drained = true; 2317 2566 goto retry; 2318 2567 } ··· 2354 2603 return -EINTR; 2355 2604 2356 2605 done_restock: 2606 + css_get_many(&memcg->css, batch); 2357 2607 if (batch > nr_pages) 2358 2608 refill_stock(memcg, batch - nr_pages); 2359 2609 done: ··· 2363 2611 2364 2612 static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) 2365 2613 { 2366 - unsigned long bytes = nr_pages * PAGE_SIZE; 2367 - 2368 2614 if (mem_cgroup_is_root(memcg)) 2369 2615 return; 2370 2616 2371 - res_counter_uncharge(&memcg->res, bytes); 2617 + page_counter_uncharge(&memcg->memory, nr_pages); 2372 2618 if (do_swap_account) 2373 - res_counter_uncharge(&memcg->memsw, bytes); 2374 - } 2619 + page_counter_uncharge(&memcg->memsw, nr_pages); 2375 2620 2376 - /* 2377 - * Cancel chrages in this cgroup....doesn't propagate to parent cgroup. 2378 - * This is useful when moving usage to parent cgroup. 2379 - */ 2380 - static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg, 2381 - unsigned int nr_pages) 2382 - { 2383 - unsigned long bytes = nr_pages * PAGE_SIZE; 2384 - 2385 - if (mem_cgroup_is_root(memcg)) 2386 - return; 2387 - 2388 - res_counter_uncharge_until(&memcg->res, memcg->res.parent, bytes); 2389 - if (do_swap_account) 2390 - res_counter_uncharge_until(&memcg->memsw, 2391 - memcg->memsw.parent, bytes); 2621 + css_put_many(&memcg->css, nr_pages); 2392 2622 } 2393 2623 2394 2624 /* ··· 2399 2665 */ 2400 2666 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page) 2401 2667 { 2402 - struct mem_cgroup *memcg = NULL; 2403 - struct page_cgroup *pc; 2668 + struct mem_cgroup *memcg; 2404 2669 unsigned short id; 2405 2670 swp_entry_t ent; 2406 2671 2407 2672 VM_BUG_ON_PAGE(!PageLocked(page), page); 2408 2673 2409 - pc = lookup_page_cgroup(page); 2410 - if (PageCgroupUsed(pc)) { 2411 - memcg = pc->mem_cgroup; 2412 - if (memcg && !css_tryget_online(&memcg->css)) 2674 + memcg = page->mem_cgroup; 2675 + if (memcg) { 2676 + if (!css_tryget_online(&memcg->css)) 2413 2677 memcg = NULL; 2414 2678 } else if (PageSwapCache(page)) { 2415 2679 ent.val = page_private(page); ··· 2455 2723 static void commit_charge(struct page *page, struct mem_cgroup *memcg, 2456 2724 bool lrucare) 2457 2725 { 2458 - struct page_cgroup *pc = lookup_page_cgroup(page); 2459 2726 int isolated; 2460 2727 2461 - VM_BUG_ON_PAGE(PageCgroupUsed(pc), page); 2462 - /* 2463 - * we don't need page_cgroup_lock about tail pages, becase they are not 2464 - * accessed by any other context at this point. 2465 - */ 2728 + VM_BUG_ON_PAGE(page->mem_cgroup, page); 2466 2729 2467 2730 /* 2468 2731 * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page ··· 2468 2741 2469 2742 /* 2470 2743 * Nobody should be changing or seriously looking at 2471 - * pc->mem_cgroup and pc->flags at this point: 2744 + * page->mem_cgroup at this point: 2472 2745 * 2473 2746 * - the page is uncharged 2474 2747 * ··· 2480 2753 * - a page cache insertion, a swapin fault, or a migration 2481 2754 * have the page locked 2482 2755 */ 2483 - pc->mem_cgroup = memcg; 2484 - pc->flags = PCG_USED | PCG_MEM | (do_swap_account ? PCG_MEMSW : 0); 2756 + page->mem_cgroup = memcg; 2485 2757 2486 2758 if (lrucare) 2487 2759 unlock_page_lru(page, isolated); 2488 2760 } 2489 - 2490 - static DEFINE_MUTEX(set_limit_mutex); 2491 2761 2492 2762 #ifdef CONFIG_MEMCG_KMEM 2493 2763 /* ··· 2492 2768 * destroyed. It protects memcg_caches arrays and memcg_slab_caches lists. 2493 2769 */ 2494 2770 static DEFINE_MUTEX(memcg_slab_mutex); 2495 - 2496 - static DEFINE_MUTEX(activate_kmem_mutex); 2497 2771 2498 2772 /* 2499 2773 * This is a bit cumbersome, but it is rarely used and avoids a backpointer ··· 2506 2784 return cache_from_memcg_idx(cachep, memcg_cache_id(p->memcg)); 2507 2785 } 2508 2786 2509 - #ifdef CONFIG_SLABINFO 2510 - static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v) 2787 + static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, 2788 + unsigned long nr_pages) 2511 2789 { 2512 - struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); 2513 - struct memcg_cache_params *params; 2514 - 2515 - if (!memcg_kmem_is_active(memcg)) 2516 - return -EIO; 2517 - 2518 - print_slabinfo_header(m); 2519 - 2520 - mutex_lock(&memcg_slab_mutex); 2521 - list_for_each_entry(params, &memcg->memcg_slab_caches, list) 2522 - cache_show(memcg_params_to_cache(params), m); 2523 - mutex_unlock(&memcg_slab_mutex); 2524 - 2525 - return 0; 2526 - } 2527 - #endif 2528 - 2529 - static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size) 2530 - { 2531 - struct res_counter *fail_res; 2790 + struct page_counter *counter; 2532 2791 int ret = 0; 2533 2792 2534 - ret = res_counter_charge(&memcg->kmem, size, &fail_res); 2535 - if (ret) 2793 + ret = page_counter_try_charge(&memcg->kmem, nr_pages, &counter); 2794 + if (ret < 0) 2536 2795 return ret; 2537 2796 2538 - ret = try_charge(memcg, gfp, size >> PAGE_SHIFT); 2797 + ret = try_charge(memcg, gfp, nr_pages); 2539 2798 if (ret == -EINTR) { 2540 2799 /* 2541 2800 * try_charge() chose to bypass to root due to OOM kill or ··· 2533 2830 * when the allocation triggers should have been already 2534 2831 * directed to the root cgroup in memcontrol.h 2535 2832 */ 2536 - res_counter_charge_nofail(&memcg->res, size, &fail_res); 2833 + page_counter_charge(&memcg->memory, nr_pages); 2537 2834 if (do_swap_account) 2538 - res_counter_charge_nofail(&memcg->memsw, size, 2539 - &fail_res); 2835 + page_counter_charge(&memcg->memsw, nr_pages); 2836 + css_get_many(&memcg->css, nr_pages); 2540 2837 ret = 0; 2541 2838 } else if (ret) 2542 - res_counter_uncharge(&memcg->kmem, size); 2839 + page_counter_uncharge(&memcg->kmem, nr_pages); 2543 2840 2544 2841 return ret; 2545 2842 } 2546 2843 2547 - static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size) 2844 + static void memcg_uncharge_kmem(struct mem_cgroup *memcg, 2845 + unsigned long nr_pages) 2548 2846 { 2549 - res_counter_uncharge(&memcg->res, size); 2847 + page_counter_uncharge(&memcg->memory, nr_pages); 2550 2848 if (do_swap_account) 2551 - res_counter_uncharge(&memcg->memsw, size); 2849 + page_counter_uncharge(&memcg->memsw, nr_pages); 2552 2850 2553 - /* Not down to 0 */ 2554 - if (res_counter_uncharge(&memcg->kmem, size)) 2555 - return; 2851 + page_counter_uncharge(&memcg->kmem, nr_pages); 2556 2852 2557 - /* 2558 - * Releases a reference taken in kmem_cgroup_css_offline in case 2559 - * this last uncharge is racing with the offlining code or it is 2560 - * outliving the memcg existence. 2561 - * 2562 - * The memory barrier imposed by test&clear is paired with the 2563 - * explicit one in memcg_kmem_mark_dead(). 2564 - */ 2565 - if (memcg_kmem_test_and_clear_dead(memcg)) 2566 - css_put(&memcg->css); 2853 + css_put_many(&memcg->css, nr_pages); 2567 2854 } 2568 2855 2569 2856 /* ··· 2817 3124 2818 3125 int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order) 2819 3126 { 3127 + unsigned int nr_pages = 1 << order; 2820 3128 int res; 2821 3129 2822 - res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp, 2823 - PAGE_SIZE << order); 3130 + res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp, nr_pages); 2824 3131 if (!res) 2825 - atomic_add(1 << order, &cachep->memcg_params->nr_pages); 3132 + atomic_add(nr_pages, &cachep->memcg_params->nr_pages); 2826 3133 return res; 2827 3134 } 2828 3135 2829 3136 void __memcg_uncharge_slab(struct kmem_cache *cachep, int order) 2830 3137 { 2831 - memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order); 2832 - atomic_sub(1 << order, &cachep->memcg_params->nr_pages); 3138 + unsigned int nr_pages = 1 << order; 3139 + 3140 + memcg_uncharge_kmem(cachep->memcg_params->memcg, nr_pages); 3141 + atomic_sub(nr_pages, &cachep->memcg_params->nr_pages); 2833 3142 } 2834 3143 2835 3144 /* ··· 2952 3257 return true; 2953 3258 } 2954 3259 2955 - ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order); 3260 + ret = memcg_charge_kmem(memcg, gfp, 1 << order); 2956 3261 if (!ret) 2957 3262 *_memcg = memcg; 2958 3263 ··· 2963 3268 void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, 2964 3269 int order) 2965 3270 { 2966 - struct page_cgroup *pc; 2967 - 2968 3271 VM_BUG_ON(mem_cgroup_is_root(memcg)); 2969 3272 2970 3273 /* The page allocation failed. Revert */ 2971 3274 if (!page) { 2972 - memcg_uncharge_kmem(memcg, PAGE_SIZE << order); 3275 + memcg_uncharge_kmem(memcg, 1 << order); 2973 3276 return; 2974 3277 } 2975 - /* 2976 - * The page is freshly allocated and not visible to any 2977 - * outside callers yet. Set up pc non-atomically. 2978 - */ 2979 - pc = lookup_page_cgroup(page); 2980 - pc->mem_cgroup = memcg; 2981 - pc->flags = PCG_USED; 3278 + page->mem_cgroup = memcg; 2982 3279 } 2983 3280 2984 3281 void __memcg_kmem_uncharge_pages(struct page *page, int order) 2985 3282 { 2986 - struct mem_cgroup *memcg = NULL; 2987 - struct page_cgroup *pc; 3283 + struct mem_cgroup *memcg = page->mem_cgroup; 2988 3284 2989 - 2990 - pc = lookup_page_cgroup(page); 2991 - if (!PageCgroupUsed(pc)) 2992 - return; 2993 - 2994 - memcg = pc->mem_cgroup; 2995 - pc->flags = 0; 2996 - 2997 - /* 2998 - * We trust that only if there is a memcg associated with the page, it 2999 - * is a valid allocation 3000 - */ 3001 3285 if (!memcg) 3002 3286 return; 3003 3287 3004 3288 VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page); 3005 - memcg_uncharge_kmem(memcg, PAGE_SIZE << order); 3289 + 3290 + memcg_uncharge_kmem(memcg, 1 << order); 3291 + page->mem_cgroup = NULL; 3006 3292 } 3007 3293 #else 3008 3294 static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg) ··· 3001 3325 */ 3002 3326 void mem_cgroup_split_huge_fixup(struct page *head) 3003 3327 { 3004 - struct page_cgroup *head_pc = lookup_page_cgroup(head); 3005 - struct page_cgroup *pc; 3006 - struct mem_cgroup *memcg; 3007 3328 int i; 3008 3329 3009 3330 if (mem_cgroup_disabled()) 3010 3331 return; 3011 3332 3012 - memcg = head_pc->mem_cgroup; 3013 - for (i = 1; i < HPAGE_PMD_NR; i++) { 3014 - pc = head_pc + i; 3015 - pc->mem_cgroup = memcg; 3016 - pc->flags = head_pc->flags; 3017 - } 3018 - __this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE], 3333 + for (i = 1; i < HPAGE_PMD_NR; i++) 3334 + head[i].mem_cgroup = head->mem_cgroup; 3335 + 3336 + __this_cpu_sub(head->mem_cgroup->stat->count[MEM_CGROUP_STAT_RSS_HUGE], 3019 3337 HPAGE_PMD_NR); 3020 3338 } 3021 3339 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ ··· 3018 3348 * mem_cgroup_move_account - move account of the page 3019 3349 * @page: the page 3020 3350 * @nr_pages: number of regular pages (>1 for huge pages) 3021 - * @pc: page_cgroup of the page. 3022 3351 * @from: mem_cgroup which the page is moved from. 3023 3352 * @to: mem_cgroup which the page is moved to. @from != @to. 3024 3353 * ··· 3030 3361 */ 3031 3362 static int mem_cgroup_move_account(struct page *page, 3032 3363 unsigned int nr_pages, 3033 - struct page_cgroup *pc, 3034 3364 struct mem_cgroup *from, 3035 3365 struct mem_cgroup *to) 3036 3366 { ··· 3049 3381 goto out; 3050 3382 3051 3383 /* 3052 - * Prevent mem_cgroup_migrate() from looking at pc->mem_cgroup 3384 + * Prevent mem_cgroup_migrate() from looking at page->mem_cgroup 3053 3385 * of its source page while we change it: page migration takes 3054 3386 * both pages off the LRU, but page cache replacement doesn't. 3055 3387 */ ··· 3057 3389 goto out; 3058 3390 3059 3391 ret = -EINVAL; 3060 - if (!PageCgroupUsed(pc) || pc->mem_cgroup != from) 3392 + if (page->mem_cgroup != from) 3061 3393 goto out_unlock; 3062 3394 3063 - move_lock_mem_cgroup(from, &flags); 3395 + spin_lock_irqsave(&from->move_lock, flags); 3064 3396 3065 3397 if (!PageAnon(page) && page_mapped(page)) { 3066 3398 __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], ··· 3077 3409 } 3078 3410 3079 3411 /* 3080 - * It is safe to change pc->mem_cgroup here because the page 3412 + * It is safe to change page->mem_cgroup here because the page 3081 3413 * is referenced, charged, and isolated - we can't race with 3082 3414 * uncharging, charging, migration, or LRU putback. 3083 3415 */ 3084 3416 3085 3417 /* caller should have done css_get */ 3086 - pc->mem_cgroup = to; 3087 - move_unlock_mem_cgroup(from, &flags); 3418 + page->mem_cgroup = to; 3419 + spin_unlock_irqrestore(&from->move_lock, flags); 3420 + 3088 3421 ret = 0; 3089 3422 3090 3423 local_irq_disable(); ··· 3096 3427 local_irq_enable(); 3097 3428 out_unlock: 3098 3429 unlock_page(page); 3099 - out: 3100 - return ret; 3101 - } 3102 - 3103 - /** 3104 - * mem_cgroup_move_parent - moves page to the parent group 3105 - * @page: the page to move 3106 - * @pc: page_cgroup of the page 3107 - * @child: page's cgroup 3108 - * 3109 - * move charges to its parent or the root cgroup if the group has no 3110 - * parent (aka use_hierarchy==0). 3111 - * Although this might fail (get_page_unless_zero, isolate_lru_page or 3112 - * mem_cgroup_move_account fails) the failure is always temporary and 3113 - * it signals a race with a page removal/uncharge or migration. In the 3114 - * first case the page is on the way out and it will vanish from the LRU 3115 - * on the next attempt and the call should be retried later. 3116 - * Isolation from the LRU fails only if page has been isolated from 3117 - * the LRU since we looked at it and that usually means either global 3118 - * reclaim or migration going on. The page will either get back to the 3119 - * LRU or vanish. 3120 - * Finaly mem_cgroup_move_account fails only if the page got uncharged 3121 - * (!PageCgroupUsed) or moved to a different group. The page will 3122 - * disappear in the next attempt. 3123 - */ 3124 - static int mem_cgroup_move_parent(struct page *page, 3125 - struct page_cgroup *pc, 3126 - struct mem_cgroup *child) 3127 - { 3128 - struct mem_cgroup *parent; 3129 - unsigned int nr_pages; 3130 - unsigned long uninitialized_var(flags); 3131 - int ret; 3132 - 3133 - VM_BUG_ON(mem_cgroup_is_root(child)); 3134 - 3135 - ret = -EBUSY; 3136 - if (!get_page_unless_zero(page)) 3137 - goto out; 3138 - if (isolate_lru_page(page)) 3139 - goto put; 3140 - 3141 - nr_pages = hpage_nr_pages(page); 3142 - 3143 - parent = parent_mem_cgroup(child); 3144 - /* 3145 - * If no parent, move charges to root cgroup. 3146 - */ 3147 - if (!parent) 3148 - parent = root_mem_cgroup; 3149 - 3150 - if (nr_pages > 1) { 3151 - VM_BUG_ON_PAGE(!PageTransHuge(page), page); 3152 - flags = compound_lock_irqsave(page); 3153 - } 3154 - 3155 - ret = mem_cgroup_move_account(page, nr_pages, 3156 - pc, child, parent); 3157 - if (!ret) 3158 - __mem_cgroup_cancel_local_charge(child, nr_pages); 3159 - 3160 - if (nr_pages > 1) 3161 - compound_unlock_irqrestore(page, flags); 3162 - putback_lru_page(page); 3163 - put: 3164 - put_page(page); 3165 3430 out: 3166 3431 return ret; 3167 3432 } ··· 3119 3516 * 3120 3517 * Returns 0 on success, -EINVAL on failure. 3121 3518 * 3122 - * The caller must have charged to @to, IOW, called res_counter_charge() about 3519 + * The caller must have charged to @to, IOW, called page_counter_charge() about 3123 3520 * both res and memsw, and called css_get(). 3124 3521 */ 3125 3522 static int mem_cgroup_move_swap_account(swp_entry_t entry, ··· 3135 3532 mem_cgroup_swap_statistics(to, true); 3136 3533 /* 3137 3534 * This function is only called from task migration context now. 3138 - * It postpones res_counter and refcount handling till the end 3535 + * It postpones page_counter and refcount handling till the end 3139 3536 * of task migration(mem_cgroup_clear_mc()) for performance 3140 3537 * improvement. But we cannot postpone css_get(to) because if 3141 3538 * the process that has been moved to @to does swap-in, the ··· 3157 3554 } 3158 3555 #endif 3159 3556 3160 - #ifdef CONFIG_DEBUG_VM 3161 - static struct page_cgroup *lookup_page_cgroup_used(struct page *page) 3162 - { 3163 - struct page_cgroup *pc; 3164 - 3165 - pc = lookup_page_cgroup(page); 3166 - /* 3167 - * Can be NULL while feeding pages into the page allocator for 3168 - * the first time, i.e. during boot or memory hotplug; 3169 - * or when mem_cgroup_disabled(). 3170 - */ 3171 - if (likely(pc) && PageCgroupUsed(pc)) 3172 - return pc; 3173 - return NULL; 3174 - } 3175 - 3176 - bool mem_cgroup_bad_page_check(struct page *page) 3177 - { 3178 - if (mem_cgroup_disabled()) 3179 - return false; 3180 - 3181 - return lookup_page_cgroup_used(page) != NULL; 3182 - } 3183 - 3184 - void mem_cgroup_print_bad_page(struct page *page) 3185 - { 3186 - struct page_cgroup *pc; 3187 - 3188 - pc = lookup_page_cgroup_used(page); 3189 - if (pc) { 3190 - pr_alert("pc:%p pc->flags:%lx pc->mem_cgroup:%p\n", 3191 - pc, pc->flags, pc->mem_cgroup); 3192 - } 3193 - } 3194 - #endif 3557 + static DEFINE_MUTEX(memcg_limit_mutex); 3195 3558 3196 3559 static int mem_cgroup_resize_limit(struct mem_cgroup *memcg, 3197 - unsigned long long val) 3560 + unsigned long limit) 3198 3561 { 3562 + unsigned long curusage; 3563 + unsigned long oldusage; 3564 + bool enlarge = false; 3199 3565 int retry_count; 3200 - int ret = 0; 3201 - int children = mem_cgroup_count_children(memcg); 3202 - u64 curusage, oldusage; 3203 - int enlarge; 3566 + int ret; 3204 3567 3205 3568 /* 3206 3569 * For keeping hierarchical_reclaim simple, how long we should retry 3207 3570 * is depends on callers. We set our retry-count to be function 3208 3571 * of # of children which we should visit in this loop. 3209 3572 */ 3210 - retry_count = MEM_CGROUP_RECLAIM_RETRIES * children; 3573 + retry_count = MEM_CGROUP_RECLAIM_RETRIES * 3574 + mem_cgroup_count_children(memcg); 3211 3575 3212 - oldusage = res_counter_read_u64(&memcg->res, RES_USAGE); 3576 + oldusage = page_counter_read(&memcg->memory); 3213 3577 3214 - enlarge = 0; 3215 - while (retry_count) { 3578 + do { 3216 3579 if (signal_pending(current)) { 3217 3580 ret = -EINTR; 3218 3581 break; 3219 3582 } 3220 - /* 3221 - * Rather than hide all in some function, I do this in 3222 - * open coded manner. You see what this really does. 3223 - * We have to guarantee memcg->res.limit <= memcg->memsw.limit. 3224 - */ 3225 - mutex_lock(&set_limit_mutex); 3226 - if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val) { 3583 + 3584 + mutex_lock(&memcg_limit_mutex); 3585 + if (limit > memcg->memsw.limit) { 3586 + mutex_unlock(&memcg_limit_mutex); 3227 3587 ret = -EINVAL; 3228 - mutex_unlock(&set_limit_mutex); 3229 3588 break; 3230 3589 } 3231 - 3232 - if (res_counter_read_u64(&memcg->res, RES_LIMIT) < val) 3233 - enlarge = 1; 3234 - 3235 - ret = res_counter_set_limit(&memcg->res, val); 3236 - mutex_unlock(&set_limit_mutex); 3590 + if (limit > memcg->memory.limit) 3591 + enlarge = true; 3592 + ret = page_counter_limit(&memcg->memory, limit); 3593 + mutex_unlock(&memcg_limit_mutex); 3237 3594 3238 3595 if (!ret) 3239 3596 break; 3240 3597 3241 3598 try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true); 3242 3599 3243 - curusage = res_counter_read_u64(&memcg->res, RES_USAGE); 3600 + curusage = page_counter_read(&memcg->memory); 3244 3601 /* Usage is reduced ? */ 3245 3602 if (curusage >= oldusage) 3246 3603 retry_count--; 3247 3604 else 3248 3605 oldusage = curusage; 3249 - } 3606 + } while (retry_count); 3607 + 3250 3608 if (!ret && enlarge) 3251 3609 memcg_oom_recover(memcg); 3252 3610 ··· 3215 3651 } 3216 3652 3217 3653 static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg, 3218 - unsigned long long val) 3654 + unsigned long limit) 3219 3655 { 3656 + unsigned long curusage; 3657 + unsigned long oldusage; 3658 + bool enlarge = false; 3220 3659 int retry_count; 3221 - u64 oldusage, curusage; 3222 - int children = mem_cgroup_count_children(memcg); 3223 - int ret = -EBUSY; 3224 - int enlarge = 0; 3660 + int ret; 3225 3661 3226 3662 /* see mem_cgroup_resize_res_limit */ 3227 - retry_count = children * MEM_CGROUP_RECLAIM_RETRIES; 3228 - oldusage = res_counter_read_u64(&memcg->memsw, RES_USAGE); 3229 - while (retry_count) { 3663 + retry_count = MEM_CGROUP_RECLAIM_RETRIES * 3664 + mem_cgroup_count_children(memcg); 3665 + 3666 + oldusage = page_counter_read(&memcg->memsw); 3667 + 3668 + do { 3230 3669 if (signal_pending(current)) { 3231 3670 ret = -EINTR; 3232 3671 break; 3233 3672 } 3234 - /* 3235 - * Rather than hide all in some function, I do this in 3236 - * open coded manner. You see what this really does. 3237 - * We have to guarantee memcg->res.limit <= memcg->memsw.limit. 3238 - */ 3239 - mutex_lock(&set_limit_mutex); 3240 - if (res_counter_read_u64(&memcg->res, RES_LIMIT) > val) { 3673 + 3674 + mutex_lock(&memcg_limit_mutex); 3675 + if (limit < memcg->memory.limit) { 3676 + mutex_unlock(&memcg_limit_mutex); 3241 3677 ret = -EINVAL; 3242 - mutex_unlock(&set_limit_mutex); 3243 3678 break; 3244 3679 } 3245 - if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val) 3246 - enlarge = 1; 3247 - ret = res_counter_set_limit(&memcg->memsw, val); 3248 - mutex_unlock(&set_limit_mutex); 3680 + if (limit > memcg->memsw.limit) 3681 + enlarge = true; 3682 + ret = page_counter_limit(&memcg->memsw, limit); 3683 + mutex_unlock(&memcg_limit_mutex); 3249 3684 3250 3685 if (!ret) 3251 3686 break; 3252 3687 3253 3688 try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false); 3254 3689 3255 - curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE); 3690 + curusage = page_counter_read(&memcg->memsw); 3256 3691 /* Usage is reduced ? */ 3257 3692 if (curusage >= oldusage) 3258 3693 retry_count--; 3259 3694 else 3260 3695 oldusage = curusage; 3261 - } 3696 + } while (retry_count); 3697 + 3262 3698 if (!ret && enlarge) 3263 3699 memcg_oom_recover(memcg); 3700 + 3264 3701 return ret; 3265 3702 } 3266 3703 ··· 3274 3709 unsigned long reclaimed; 3275 3710 int loop = 0; 3276 3711 struct mem_cgroup_tree_per_zone *mctz; 3277 - unsigned long long excess; 3712 + unsigned long excess; 3278 3713 unsigned long nr_scanned; 3279 3714 3280 3715 if (order > 0) ··· 3300 3735 nr_reclaimed += reclaimed; 3301 3736 *total_scanned += nr_scanned; 3302 3737 spin_lock_irq(&mctz->lock); 3738 + __mem_cgroup_remove_exceeded(mz, mctz); 3303 3739 3304 3740 /* 3305 3741 * If we failed to reclaim anything from this memory cgroup 3306 3742 * it is time to move on to the next cgroup 3307 3743 */ 3308 3744 next_mz = NULL; 3309 - if (!reclaimed) { 3310 - do { 3311 - /* 3312 - * Loop until we find yet another one. 3313 - * 3314 - * By the time we get the soft_limit lock 3315 - * again, someone might have aded the 3316 - * group back on the RB tree. Iterate to 3317 - * make sure we get a different mem. 3318 - * mem_cgroup_largest_soft_limit_node returns 3319 - * NULL if no other cgroup is present on 3320 - * the tree 3321 - */ 3322 - next_mz = 3323 - __mem_cgroup_largest_soft_limit_node(mctz); 3324 - if (next_mz == mz) 3325 - css_put(&next_mz->memcg->css); 3326 - else /* next_mz == NULL or other memcg */ 3327 - break; 3328 - } while (1); 3329 - } 3330 - __mem_cgroup_remove_exceeded(mz, mctz); 3331 - excess = res_counter_soft_limit_excess(&mz->memcg->res); 3745 + if (!reclaimed) 3746 + next_mz = __mem_cgroup_largest_soft_limit_node(mctz); 3747 + 3748 + excess = soft_limit_excess(mz->memcg); 3332 3749 /* 3333 3750 * One school of thought says that we should not add 3334 3751 * back the node to the tree if reclaim returns 0. ··· 3337 3790 if (next_mz) 3338 3791 css_put(&next_mz->memcg->css); 3339 3792 return nr_reclaimed; 3340 - } 3341 - 3342 - /** 3343 - * mem_cgroup_force_empty_list - clears LRU of a group 3344 - * @memcg: group to clear 3345 - * @node: NUMA node 3346 - * @zid: zone id 3347 - * @lru: lru to to clear 3348 - * 3349 - * Traverse a specified page_cgroup list and try to drop them all. This doesn't 3350 - * reclaim the pages page themselves - pages are moved to the parent (or root) 3351 - * group. 3352 - */ 3353 - static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg, 3354 - int node, int zid, enum lru_list lru) 3355 - { 3356 - struct lruvec *lruvec; 3357 - unsigned long flags; 3358 - struct list_head *list; 3359 - struct page *busy; 3360 - struct zone *zone; 3361 - 3362 - zone = &NODE_DATA(node)->node_zones[zid]; 3363 - lruvec = mem_cgroup_zone_lruvec(zone, memcg); 3364 - list = &lruvec->lists[lru]; 3365 - 3366 - busy = NULL; 3367 - do { 3368 - struct page_cgroup *pc; 3369 - struct page *page; 3370 - 3371 - spin_lock_irqsave(&zone->lru_lock, flags); 3372 - if (list_empty(list)) { 3373 - spin_unlock_irqrestore(&zone->lru_lock, flags); 3374 - break; 3375 - } 3376 - page = list_entry(list->prev, struct page, lru); 3377 - if (busy == page) { 3378 - list_move(&page->lru, list); 3379 - busy = NULL; 3380 - spin_unlock_irqrestore(&zone->lru_lock, flags); 3381 - continue; 3382 - } 3383 - spin_unlock_irqrestore(&zone->lru_lock, flags); 3384 - 3385 - pc = lookup_page_cgroup(page); 3386 - 3387 - if (mem_cgroup_move_parent(page, pc, memcg)) { 3388 - /* found lock contention or "pc" is obsolete. */ 3389 - busy = page; 3390 - } else 3391 - busy = NULL; 3392 - cond_resched(); 3393 - } while (!list_empty(list)); 3394 - } 3395 - 3396 - /* 3397 - * make mem_cgroup's charge to be 0 if there is no task by moving 3398 - * all the charges and pages to the parent. 3399 - * This enables deleting this mem_cgroup. 3400 - * 3401 - * Caller is responsible for holding css reference on the memcg. 3402 - */ 3403 - static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg) 3404 - { 3405 - int node, zid; 3406 - u64 usage; 3407 - 3408 - do { 3409 - /* This is for making all *used* pages to be on LRU. */ 3410 - lru_add_drain_all(); 3411 - drain_all_stock_sync(memcg); 3412 - mem_cgroup_start_move(memcg); 3413 - for_each_node_state(node, N_MEMORY) { 3414 - for (zid = 0; zid < MAX_NR_ZONES; zid++) { 3415 - enum lru_list lru; 3416 - for_each_lru(lru) { 3417 - mem_cgroup_force_empty_list(memcg, 3418 - node, zid, lru); 3419 - } 3420 - } 3421 - } 3422 - mem_cgroup_end_move(memcg); 3423 - memcg_oom_recover(memcg); 3424 - cond_resched(); 3425 - 3426 - /* 3427 - * Kernel memory may not necessarily be trackable to a specific 3428 - * process. So they are not migrated, and therefore we can't 3429 - * expect their value to drop to 0 here. 3430 - * Having res filled up with kmem only is enough. 3431 - * 3432 - * This is a safety check because mem_cgroup_force_empty_list 3433 - * could have raced with mem_cgroup_replace_page_cache callers 3434 - * so the lru seemed empty but the page could have been added 3435 - * right after the check. RES_USAGE should be safe as we always 3436 - * charge before adding to the LRU. 3437 - */ 3438 - usage = res_counter_read_u64(&memcg->res, RES_USAGE) - 3439 - res_counter_read_u64(&memcg->kmem, RES_USAGE); 3440 - } while (usage > 0); 3441 3793 } 3442 3794 3443 3795 /* ··· 3376 3930 /* we call try-to-free pages for make this cgroup empty */ 3377 3931 lru_add_drain_all(); 3378 3932 /* try to free all pages in this cgroup */ 3379 - while (nr_retries && res_counter_read_u64(&memcg->res, RES_USAGE) > 0) { 3933 + while (nr_retries && page_counter_read(&memcg->memory)) { 3380 3934 int progress; 3381 3935 3382 3936 if (signal_pending(current)) ··· 3447 4001 return retval; 3448 4002 } 3449 4003 3450 - static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg, 3451 - enum mem_cgroup_stat_index idx) 4004 + static unsigned long tree_stat(struct mem_cgroup *memcg, 4005 + enum mem_cgroup_stat_index idx) 3452 4006 { 3453 4007 struct mem_cgroup *iter; 3454 4008 long val = 0; ··· 3466 4020 { 3467 4021 u64 val; 3468 4022 3469 - if (!mem_cgroup_is_root(memcg)) { 4023 + if (mem_cgroup_is_root(memcg)) { 4024 + val = tree_stat(memcg, MEM_CGROUP_STAT_CACHE); 4025 + val += tree_stat(memcg, MEM_CGROUP_STAT_RSS); 4026 + if (swap) 4027 + val += tree_stat(memcg, MEM_CGROUP_STAT_SWAP); 4028 + } else { 3470 4029 if (!swap) 3471 - return res_counter_read_u64(&memcg->res, RES_USAGE); 4030 + val = page_counter_read(&memcg->memory); 3472 4031 else 3473 - return res_counter_read_u64(&memcg->memsw, RES_USAGE); 4032 + val = page_counter_read(&memcg->memsw); 3474 4033 } 3475 - 3476 - /* 3477 - * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS 3478 - * as well as in MEM_CGROUP_STAT_RSS_HUGE. 3479 - */ 3480 - val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE); 3481 - val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS); 3482 - 3483 - if (swap) 3484 - val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP); 3485 - 3486 4034 return val << PAGE_SHIFT; 3487 4035 } 3488 4036 4037 + enum { 4038 + RES_USAGE, 4039 + RES_LIMIT, 4040 + RES_MAX_USAGE, 4041 + RES_FAILCNT, 4042 + RES_SOFT_LIMIT, 4043 + }; 3489 4044 3490 4045 static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css, 3491 4046 struct cftype *cft) 3492 4047 { 3493 4048 struct mem_cgroup *memcg = mem_cgroup_from_css(css); 3494 - enum res_type type = MEMFILE_TYPE(cft->private); 3495 - int name = MEMFILE_ATTR(cft->private); 4049 + struct page_counter *counter; 3496 4050 3497 - switch (type) { 4051 + switch (MEMFILE_TYPE(cft->private)) { 3498 4052 case _MEM: 3499 - if (name == RES_USAGE) 3500 - return mem_cgroup_usage(memcg, false); 3501 - return res_counter_read_u64(&memcg->res, name); 3502 - case _MEMSWAP: 3503 - if (name == RES_USAGE) 3504 - return mem_cgroup_usage(memcg, true); 3505 - return res_counter_read_u64(&memcg->memsw, name); 3506 - case _KMEM: 3507 - return res_counter_read_u64(&memcg->kmem, name); 4053 + counter = &memcg->memory; 3508 4054 break; 4055 + case _MEMSWAP: 4056 + counter = &memcg->memsw; 4057 + break; 4058 + case _KMEM: 4059 + counter = &memcg->kmem; 4060 + break; 4061 + default: 4062 + BUG(); 4063 + } 4064 + 4065 + switch (MEMFILE_ATTR(cft->private)) { 4066 + case RES_USAGE: 4067 + if (counter == &memcg->memory) 4068 + return mem_cgroup_usage(memcg, false); 4069 + if (counter == &memcg->memsw) 4070 + return mem_cgroup_usage(memcg, true); 4071 + return (u64)page_counter_read(counter) * PAGE_SIZE; 4072 + case RES_LIMIT: 4073 + return (u64)counter->limit * PAGE_SIZE; 4074 + case RES_MAX_USAGE: 4075 + return (u64)counter->watermark * PAGE_SIZE; 4076 + case RES_FAILCNT: 4077 + return counter->failcnt; 4078 + case RES_SOFT_LIMIT: 4079 + return (u64)memcg->soft_limit * PAGE_SIZE; 3509 4080 default: 3510 4081 BUG(); 3511 4082 } 3512 4083 } 3513 4084 3514 4085 #ifdef CONFIG_MEMCG_KMEM 3515 - /* should be called with activate_kmem_mutex held */ 3516 - static int __memcg_activate_kmem(struct mem_cgroup *memcg, 3517 - unsigned long long limit) 4086 + static int memcg_activate_kmem(struct mem_cgroup *memcg, 4087 + unsigned long nr_pages) 3518 4088 { 3519 4089 int err = 0; 3520 4090 int memcg_id; ··· 3577 4115 * We couldn't have accounted to this cgroup, because it hasn't got the 3578 4116 * active bit set yet, so this should succeed. 3579 4117 */ 3580 - err = res_counter_set_limit(&memcg->kmem, limit); 4118 + err = page_counter_limit(&memcg->kmem, nr_pages); 3581 4119 VM_BUG_ON(err); 3582 4120 3583 4121 static_key_slow_inc(&memcg_kmem_enabled_key); ··· 3592 4130 return err; 3593 4131 } 3594 4132 3595 - static int memcg_activate_kmem(struct mem_cgroup *memcg, 3596 - unsigned long long limit) 3597 - { 3598 - int ret; 3599 - 3600 - mutex_lock(&activate_kmem_mutex); 3601 - ret = __memcg_activate_kmem(memcg, limit); 3602 - mutex_unlock(&activate_kmem_mutex); 3603 - return ret; 3604 - } 3605 - 3606 4133 static int memcg_update_kmem_limit(struct mem_cgroup *memcg, 3607 - unsigned long long val) 4134 + unsigned long limit) 3608 4135 { 3609 4136 int ret; 3610 4137 4138 + mutex_lock(&memcg_limit_mutex); 3611 4139 if (!memcg_kmem_is_active(memcg)) 3612 - ret = memcg_activate_kmem(memcg, val); 4140 + ret = memcg_activate_kmem(memcg, limit); 3613 4141 else 3614 - ret = res_counter_set_limit(&memcg->kmem, val); 4142 + ret = page_counter_limit(&memcg->kmem, limit); 4143 + mutex_unlock(&memcg_limit_mutex); 3615 4144 return ret; 3616 4145 } 3617 4146 ··· 3614 4161 if (!parent) 3615 4162 return 0; 3616 4163 3617 - mutex_lock(&activate_kmem_mutex); 4164 + mutex_lock(&memcg_limit_mutex); 3618 4165 /* 3619 4166 * If the parent cgroup is not kmem-active now, it cannot be activated 3620 4167 * after this point, because it has at least one child already. 3621 4168 */ 3622 4169 if (memcg_kmem_is_active(parent)) 3623 - ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX); 3624 - mutex_unlock(&activate_kmem_mutex); 4170 + ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX); 4171 + mutex_unlock(&memcg_limit_mutex); 3625 4172 return ret; 3626 4173 } 3627 4174 #else 3628 4175 static int memcg_update_kmem_limit(struct mem_cgroup *memcg, 3629 - unsigned long long val) 4176 + unsigned long limit) 3630 4177 { 3631 4178 return -EINVAL; 3632 4179 } ··· 3640 4187 char *buf, size_t nbytes, loff_t off) 3641 4188 { 3642 4189 struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); 3643 - enum res_type type; 3644 - int name; 3645 - unsigned long long val; 4190 + unsigned long nr_pages; 3646 4191 int ret; 3647 4192 3648 4193 buf = strstrip(buf); 3649 - type = MEMFILE_TYPE(of_cft(of)->private); 3650 - name = MEMFILE_ATTR(of_cft(of)->private); 4194 + ret = page_counter_memparse(buf, &nr_pages); 4195 + if (ret) 4196 + return ret; 3651 4197 3652 - switch (name) { 4198 + switch (MEMFILE_ATTR(of_cft(of)->private)) { 3653 4199 case RES_LIMIT: 3654 4200 if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */ 3655 4201 ret = -EINVAL; 3656 4202 break; 3657 4203 } 3658 - /* This function does all necessary parse...reuse it */ 3659 - ret = res_counter_memparse_write_strategy(buf, &val); 3660 - if (ret) 4204 + switch (MEMFILE_TYPE(of_cft(of)->private)) { 4205 + case _MEM: 4206 + ret = mem_cgroup_resize_limit(memcg, nr_pages); 3661 4207 break; 3662 - if (type == _MEM) 3663 - ret = mem_cgroup_resize_limit(memcg, val); 3664 - else if (type == _MEMSWAP) 3665 - ret = mem_cgroup_resize_memsw_limit(memcg, val); 3666 - else if (type == _KMEM) 3667 - ret = memcg_update_kmem_limit(memcg, val); 3668 - else 3669 - return -EINVAL; 4208 + case _MEMSWAP: 4209 + ret = mem_cgroup_resize_memsw_limit(memcg, nr_pages); 4210 + break; 4211 + case _KMEM: 4212 + ret = memcg_update_kmem_limit(memcg, nr_pages); 4213 + break; 4214 + } 3670 4215 break; 3671 4216 case RES_SOFT_LIMIT: 3672 - ret = res_counter_memparse_write_strategy(buf, &val); 3673 - if (ret) 3674 - break; 3675 - /* 3676 - * For memsw, soft limits are hard to implement in terms 3677 - * of semantics, for now, we support soft limits for 3678 - * control without swap 3679 - */ 3680 - if (type == _MEM) 3681 - ret = res_counter_set_soft_limit(&memcg->res, val); 3682 - else 3683 - ret = -EINVAL; 3684 - break; 3685 - default: 3686 - ret = -EINVAL; /* should be BUG() ? */ 4217 + memcg->soft_limit = nr_pages; 4218 + ret = 0; 3687 4219 break; 3688 4220 } 3689 4221 return ret ?: nbytes; 3690 - } 3691 - 3692 - static void memcg_get_hierarchical_limit(struct mem_cgroup *memcg, 3693 - unsigned long long *mem_limit, unsigned long long *memsw_limit) 3694 - { 3695 - unsigned long long min_limit, min_memsw_limit, tmp; 3696 - 3697 - min_limit = res_counter_read_u64(&memcg->res, RES_LIMIT); 3698 - min_memsw_limit = res_counter_read_u64(&memcg->memsw, RES_LIMIT); 3699 - if (!memcg->use_hierarchy) 3700 - goto out; 3701 - 3702 - while (memcg->css.parent) { 3703 - memcg = mem_cgroup_from_css(memcg->css.parent); 3704 - if (!memcg->use_hierarchy) 3705 - break; 3706 - tmp = res_counter_read_u64(&memcg->res, RES_LIMIT); 3707 - min_limit = min(min_limit, tmp); 3708 - tmp = res_counter_read_u64(&memcg->memsw, RES_LIMIT); 3709 - min_memsw_limit = min(min_memsw_limit, tmp); 3710 - } 3711 - out: 3712 - *mem_limit = min_limit; 3713 - *memsw_limit = min_memsw_limit; 3714 4222 } 3715 4223 3716 4224 static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf, 3717 4225 size_t nbytes, loff_t off) 3718 4226 { 3719 4227 struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); 3720 - int name; 3721 - enum res_type type; 4228 + struct page_counter *counter; 3722 4229 3723 - type = MEMFILE_TYPE(of_cft(of)->private); 3724 - name = MEMFILE_ATTR(of_cft(of)->private); 4230 + switch (MEMFILE_TYPE(of_cft(of)->private)) { 4231 + case _MEM: 4232 + counter = &memcg->memory; 4233 + break; 4234 + case _MEMSWAP: 4235 + counter = &memcg->memsw; 4236 + break; 4237 + case _KMEM: 4238 + counter = &memcg->kmem; 4239 + break; 4240 + default: 4241 + BUG(); 4242 + } 3725 4243 3726 - switch (name) { 4244 + switch (MEMFILE_ATTR(of_cft(of)->private)) { 3727 4245 case RES_MAX_USAGE: 3728 - if (type == _MEM) 3729 - res_counter_reset_max(&memcg->res); 3730 - else if (type == _MEMSWAP) 3731 - res_counter_reset_max(&memcg->memsw); 3732 - else if (type == _KMEM) 3733 - res_counter_reset_max(&memcg->kmem); 3734 - else 3735 - return -EINVAL; 4246 + page_counter_reset_watermark(counter); 3736 4247 break; 3737 4248 case RES_FAILCNT: 3738 - if (type == _MEM) 3739 - res_counter_reset_failcnt(&memcg->res); 3740 - else if (type == _MEMSWAP) 3741 - res_counter_reset_failcnt(&memcg->memsw); 3742 - else if (type == _KMEM) 3743 - res_counter_reset_failcnt(&memcg->kmem); 3744 - else 3745 - return -EINVAL; 4249 + counter->failcnt = 0; 3746 4250 break; 4251 + default: 4252 + BUG(); 3747 4253 } 3748 4254 3749 4255 return nbytes; ··· 3799 4387 static int memcg_stat_show(struct seq_file *m, void *v) 3800 4388 { 3801 4389 struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); 4390 + unsigned long memory, memsw; 3802 4391 struct mem_cgroup *mi; 3803 4392 unsigned int i; 3804 4393 ··· 3819 4406 mem_cgroup_nr_lru_pages(memcg, BIT(i)) * PAGE_SIZE); 3820 4407 3821 4408 /* Hierarchical information */ 3822 - { 3823 - unsigned long long limit, memsw_limit; 3824 - memcg_get_hierarchical_limit(memcg, &limit, &memsw_limit); 3825 - seq_printf(m, "hierarchical_memory_limit %llu\n", limit); 3826 - if (do_swap_account) 3827 - seq_printf(m, "hierarchical_memsw_limit %llu\n", 3828 - memsw_limit); 4409 + memory = memsw = PAGE_COUNTER_MAX; 4410 + for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) { 4411 + memory = min(memory, mi->memory.limit); 4412 + memsw = min(memsw, mi->memsw.limit); 3829 4413 } 4414 + seq_printf(m, "hierarchical_memory_limit %llu\n", 4415 + (u64)memory * PAGE_SIZE); 4416 + if (do_swap_account) 4417 + seq_printf(m, "hierarchical_memsw_limit %llu\n", 4418 + (u64)memsw * PAGE_SIZE); 3830 4419 3831 4420 for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { 3832 4421 long long val = 0; ··· 3912 4497 static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap) 3913 4498 { 3914 4499 struct mem_cgroup_threshold_ary *t; 3915 - u64 usage; 4500 + unsigned long usage; 3916 4501 int i; 3917 4502 3918 4503 rcu_read_lock(); ··· 4011 4596 { 4012 4597 struct mem_cgroup_thresholds *thresholds; 4013 4598 struct mem_cgroup_threshold_ary *new; 4014 - u64 threshold, usage; 4599 + unsigned long threshold; 4600 + unsigned long usage; 4015 4601 int i, size, ret; 4016 4602 4017 - ret = res_counter_memparse_write_strategy(args, &threshold); 4603 + ret = page_counter_memparse(args, &threshold); 4018 4604 if (ret) 4019 4605 return ret; 4020 4606 ··· 4105 4689 { 4106 4690 struct mem_cgroup_thresholds *thresholds; 4107 4691 struct mem_cgroup_threshold_ary *new; 4108 - u64 usage; 4692 + unsigned long usage; 4109 4693 int i, j, size; 4110 4694 4111 4695 mutex_lock(&memcg->thresholds_lock); ··· 4271 4855 { 4272 4856 mem_cgroup_sockets_destroy(memcg); 4273 4857 } 4274 - 4275 - static void kmem_cgroup_css_offline(struct mem_cgroup *memcg) 4276 - { 4277 - if (!memcg_kmem_is_active(memcg)) 4278 - return; 4279 - 4280 - /* 4281 - * kmem charges can outlive the cgroup. In the case of slab 4282 - * pages, for instance, a page contain objects from various 4283 - * processes. As we prevent from taking a reference for every 4284 - * such allocation we have to be careful when doing uncharge 4285 - * (see memcg_uncharge_kmem) and here during offlining. 4286 - * 4287 - * The idea is that that only the _last_ uncharge which sees 4288 - * the dead memcg will drop the last reference. An additional 4289 - * reference is taken here before the group is marked dead 4290 - * which is then paired with css_put during uncharge resp. here. 4291 - * 4292 - * Although this might sound strange as this path is called from 4293 - * css_offline() when the referencemight have dropped down to 0 and 4294 - * shouldn't be incremented anymore (css_tryget_online() would 4295 - * fail) we do not have other options because of the kmem 4296 - * allocations lifetime. 4297 - */ 4298 - css_get(&memcg->css); 4299 - 4300 - memcg_kmem_mark_dead(memcg); 4301 - 4302 - if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0) 4303 - return; 4304 - 4305 - if (memcg_kmem_test_and_clear_dead(memcg)) 4306 - css_put(&memcg->css); 4307 - } 4308 4858 #else 4309 4859 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) 4310 4860 { ··· 4278 4896 } 4279 4897 4280 4898 static void memcg_destroy_kmem(struct mem_cgroup *memcg) 4281 - { 4282 - } 4283 - 4284 - static void kmem_cgroup_css_offline(struct mem_cgroup *memcg) 4285 4899 { 4286 4900 } 4287 4901 #endif ··· 4606 5228 #ifdef CONFIG_SLABINFO 4607 5229 { 4608 5230 .name = "kmem.slabinfo", 4609 - .seq_show = mem_cgroup_slabinfo_read, 5231 + .seq_start = slab_start, 5232 + .seq_next = slab_next, 5233 + .seq_stop = slab_stop, 5234 + .seq_show = memcg_slab_show, 4610 5235 }, 4611 5236 #endif 4612 5237 #endif ··· 4744 5363 */ 4745 5364 struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) 4746 5365 { 4747 - if (!memcg->res.parent) 5366 + if (!memcg->memory.parent) 4748 5367 return NULL; 4749 - return mem_cgroup_from_res_counter(memcg->res.parent, res); 5368 + return mem_cgroup_from_counter(memcg->memory.parent, memory); 4750 5369 } 4751 5370 EXPORT_SYMBOL(parent_mem_cgroup); 4752 5371 ··· 4791 5410 /* root ? */ 4792 5411 if (parent_css == NULL) { 4793 5412 root_mem_cgroup = memcg; 4794 - res_counter_init(&memcg->res, NULL); 4795 - res_counter_init(&memcg->memsw, NULL); 4796 - res_counter_init(&memcg->kmem, NULL); 5413 + page_counter_init(&memcg->memory, NULL); 5414 + page_counter_init(&memcg->memsw, NULL); 5415 + page_counter_init(&memcg->kmem, NULL); 4797 5416 } 4798 5417 4799 5418 memcg->last_scanned_node = MAX_NUMNODES; ··· 4832 5451 memcg->swappiness = mem_cgroup_swappiness(parent); 4833 5452 4834 5453 if (parent->use_hierarchy) { 4835 - res_counter_init(&memcg->res, &parent->res); 4836 - res_counter_init(&memcg->memsw, &parent->memsw); 4837 - res_counter_init(&memcg->kmem, &parent->kmem); 5454 + page_counter_init(&memcg->memory, &parent->memory); 5455 + page_counter_init(&memcg->memsw, &parent->memsw); 5456 + page_counter_init(&memcg->kmem, &parent->kmem); 4838 5457 4839 5458 /* 4840 5459 * No need to take a reference to the parent because cgroup 4841 5460 * core guarantees its existence. 4842 5461 */ 4843 5462 } else { 4844 - res_counter_init(&memcg->res, NULL); 4845 - res_counter_init(&memcg->memsw, NULL); 4846 - res_counter_init(&memcg->kmem, NULL); 5463 + page_counter_init(&memcg->memory, NULL); 5464 + page_counter_init(&memcg->memsw, NULL); 5465 + page_counter_init(&memcg->kmem, NULL); 4847 5466 /* 4848 5467 * Deeper hierachy with use_hierarchy == false doesn't make 4849 5468 * much sense so let cgroup subsystem know about this ··· 4868 5487 return 0; 4869 5488 } 4870 5489 4871 - /* 4872 - * Announce all parents that a group from their hierarchy is gone. 4873 - */ 4874 - static void mem_cgroup_invalidate_reclaim_iterators(struct mem_cgroup *memcg) 4875 - { 4876 - struct mem_cgroup *parent = memcg; 4877 - 4878 - while ((parent = parent_mem_cgroup(parent))) 4879 - mem_cgroup_iter_invalidate(parent); 4880 - 4881 - /* 4882 - * if the root memcg is not hierarchical we have to check it 4883 - * explicitely. 4884 - */ 4885 - if (!root_mem_cgroup->use_hierarchy) 4886 - mem_cgroup_iter_invalidate(root_mem_cgroup); 4887 - } 4888 - 4889 5490 static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) 4890 5491 { 4891 5492 struct mem_cgroup *memcg = mem_cgroup_from_css(css); 4892 5493 struct mem_cgroup_event *event, *tmp; 4893 - struct cgroup_subsys_state *iter; 4894 5494 4895 5495 /* 4896 5496 * Unregister events and notify userspace. ··· 4885 5523 } 4886 5524 spin_unlock(&memcg->event_list_lock); 4887 5525 4888 - kmem_cgroup_css_offline(memcg); 4889 - 4890 - mem_cgroup_invalidate_reclaim_iterators(memcg); 4891 - 4892 - /* 4893 - * This requires that offlining is serialized. Right now that is 4894 - * guaranteed because css_killed_work_fn() holds the cgroup_mutex. 4895 - */ 4896 - css_for_each_descendant_post(iter, css) 4897 - mem_cgroup_reparent_charges(mem_cgroup_from_css(iter)); 4898 - 4899 5526 memcg_unregister_all_caches(memcg); 4900 5527 vmpressure_cleanup(&memcg->vmpressure); 4901 5528 } ··· 4892 5541 static void mem_cgroup_css_free(struct cgroup_subsys_state *css) 4893 5542 { 4894 5543 struct mem_cgroup *memcg = mem_cgroup_from_css(css); 4895 - /* 4896 - * XXX: css_offline() would be where we should reparent all 4897 - * memory to prepare the cgroup for destruction. However, 4898 - * memcg does not do css_tryget_online() and res_counter charging 4899 - * under the same RCU lock region, which means that charging 4900 - * could race with offlining. Offlining only happens to 4901 - * cgroups with no tasks in them but charges can show up 4902 - * without any tasks from the swapin path when the target 4903 - * memcg is looked up from the swapout record and not from the 4904 - * current task as it usually is. A race like this can leak 4905 - * charges and put pages with stale cgroup pointers into 4906 - * circulation: 4907 - * 4908 - * #0 #1 4909 - * lookup_swap_cgroup_id() 4910 - * rcu_read_lock() 4911 - * mem_cgroup_lookup() 4912 - * css_tryget_online() 4913 - * rcu_read_unlock() 4914 - * disable css_tryget_online() 4915 - * call_rcu() 4916 - * offline_css() 4917 - * reparent_charges() 4918 - * res_counter_charge() 4919 - * css_put() 4920 - * css_free() 4921 - * pc->mem_cgroup = dead memcg 4922 - * add page to lru 4923 - * 4924 - * The bulk of the charges are still moved in offline_css() to 4925 - * avoid pinning a lot of pages in case a long-term reference 4926 - * like a swapout record is deferring the css_free() to long 4927 - * after offlining. But this makes sure we catch any charges 4928 - * made after offlining: 4929 - */ 4930 - mem_cgroup_reparent_charges(memcg); 4931 5544 4932 5545 memcg_destroy_kmem(memcg); 4933 5546 __mem_cgroup_free(memcg); ··· 4914 5599 { 4915 5600 struct mem_cgroup *memcg = mem_cgroup_from_css(css); 4916 5601 4917 - mem_cgroup_resize_limit(memcg, ULLONG_MAX); 4918 - mem_cgroup_resize_memsw_limit(memcg, ULLONG_MAX); 4919 - memcg_update_kmem_limit(memcg, ULLONG_MAX); 4920 - res_counter_set_soft_limit(&memcg->res, ULLONG_MAX); 5602 + mem_cgroup_resize_limit(memcg, PAGE_COUNTER_MAX); 5603 + mem_cgroup_resize_memsw_limit(memcg, PAGE_COUNTER_MAX); 5604 + memcg_update_kmem_limit(memcg, PAGE_COUNTER_MAX); 5605 + memcg->soft_limit = 0; 4921 5606 } 4922 5607 4923 5608 #ifdef CONFIG_MMU ··· 5073 5758 unsigned long addr, pte_t ptent, union mc_target *target) 5074 5759 { 5075 5760 struct page *page = NULL; 5076 - struct page_cgroup *pc; 5077 5761 enum mc_target_type ret = MC_TARGET_NONE; 5078 5762 swp_entry_t ent = { .val = 0 }; 5079 5763 ··· 5086 5772 if (!page && !ent.val) 5087 5773 return ret; 5088 5774 if (page) { 5089 - pc = lookup_page_cgroup(page); 5090 5775 /* 5091 5776 * Do only loose check w/o serialization. 5092 - * mem_cgroup_move_account() checks the pc is valid or 5777 + * mem_cgroup_move_account() checks the page is valid or 5093 5778 * not under LRU exclusion. 5094 5779 */ 5095 - if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) { 5780 + if (page->mem_cgroup == mc.from) { 5096 5781 ret = MC_TARGET_PAGE; 5097 5782 if (target) 5098 5783 target->page = page; ··· 5119 5806 unsigned long addr, pmd_t pmd, union mc_target *target) 5120 5807 { 5121 5808 struct page *page = NULL; 5122 - struct page_cgroup *pc; 5123 5809 enum mc_target_type ret = MC_TARGET_NONE; 5124 5810 5125 5811 page = pmd_page(pmd); 5126 5812 VM_BUG_ON_PAGE(!page || !PageHead(page), page); 5127 5813 if (!move_anon()) 5128 5814 return ret; 5129 - pc = lookup_page_cgroup(page); 5130 - if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) { 5815 + if (page->mem_cgroup == mc.from) { 5131 5816 ret = MC_TARGET_PAGE; 5132 5817 if (target) { 5133 5818 get_page(page); ··· 5208 5897 { 5209 5898 struct mem_cgroup *from = mc.from; 5210 5899 struct mem_cgroup *to = mc.to; 5211 - int i; 5212 5900 5213 5901 /* we must uncharge all the leftover precharges from mc.to */ 5214 5902 if (mc.precharge) { ··· 5226 5916 if (mc.moved_swap) { 5227 5917 /* uncharge swap account from the old cgroup */ 5228 5918 if (!mem_cgroup_is_root(mc.from)) 5229 - res_counter_uncharge(&mc.from->memsw, 5230 - PAGE_SIZE * mc.moved_swap); 5231 - 5232 - for (i = 0; i < mc.moved_swap; i++) 5233 - css_put(&mc.from->css); 5919 + page_counter_uncharge(&mc.from->memsw, mc.moved_swap); 5234 5920 5235 5921 /* 5236 - * we charged both to->res and to->memsw, so we should 5237 - * uncharge to->res. 5922 + * we charged both to->memory and to->memsw, so we 5923 + * should uncharge to->memory. 5238 5924 */ 5239 5925 if (!mem_cgroup_is_root(mc.to)) 5240 - res_counter_uncharge(&mc.to->res, 5241 - PAGE_SIZE * mc.moved_swap); 5926 + page_counter_uncharge(&mc.to->memory, mc.moved_swap); 5927 + 5928 + css_put_many(&mc.from->css, mc.moved_swap); 5929 + 5242 5930 /* we've already done css_get(mc.to) */ 5243 5931 mc.moved_swap = 0; 5244 5932 } ··· 5247 5939 5248 5940 static void mem_cgroup_clear_mc(void) 5249 5941 { 5250 - struct mem_cgroup *from = mc.from; 5251 - 5252 5942 /* 5253 5943 * we must clear moving_task before waking up waiters at the end of 5254 5944 * task migration. ··· 5257 5951 mc.from = NULL; 5258 5952 mc.to = NULL; 5259 5953 spin_unlock(&mc.lock); 5260 - mem_cgroup_end_move(from); 5261 5954 } 5262 5955 5263 5956 static int mem_cgroup_can_attach(struct cgroup_subsys_state *css, ··· 5289 5984 VM_BUG_ON(mc.precharge); 5290 5985 VM_BUG_ON(mc.moved_charge); 5291 5986 VM_BUG_ON(mc.moved_swap); 5292 - mem_cgroup_start_move(from); 5987 + 5293 5988 spin_lock(&mc.lock); 5294 5989 mc.from = from; 5295 5990 mc.to = memcg; ··· 5309 6004 static void mem_cgroup_cancel_attach(struct cgroup_subsys_state *css, 5310 6005 struct cgroup_taskset *tset) 5311 6006 { 5312 - mem_cgroup_clear_mc(); 6007 + if (mc.to) 6008 + mem_cgroup_clear_mc(); 5313 6009 } 5314 6010 5315 6011 static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, ··· 5324 6018 enum mc_target_type target_type; 5325 6019 union mc_target target; 5326 6020 struct page *page; 5327 - struct page_cgroup *pc; 5328 6021 5329 6022 /* 5330 6023 * We don't take compound_lock() here but no race with splitting thp ··· 5344 6039 if (target_type == MC_TARGET_PAGE) { 5345 6040 page = target.page; 5346 6041 if (!isolate_lru_page(page)) { 5347 - pc = lookup_page_cgroup(page); 5348 6042 if (!mem_cgroup_move_account(page, HPAGE_PMD_NR, 5349 - pc, mc.from, mc.to)) { 6043 + mc.from, mc.to)) { 5350 6044 mc.precharge -= HPAGE_PMD_NR; 5351 6045 mc.moved_charge += HPAGE_PMD_NR; 5352 6046 } ··· 5373 6069 page = target.page; 5374 6070 if (isolate_lru_page(page)) 5375 6071 goto put; 5376 - pc = lookup_page_cgroup(page); 5377 - if (!mem_cgroup_move_account(page, 1, pc, 5378 - mc.from, mc.to)) { 6072 + if (!mem_cgroup_move_account(page, 1, mc.from, mc.to)) { 5379 6073 mc.precharge--; 5380 6074 /* we uncharge from mc.from later. */ 5381 6075 mc.moved_charge++; ··· 5417 6115 struct vm_area_struct *vma; 5418 6116 5419 6117 lru_add_drain_all(); 6118 + /* 6119 + * Signal mem_cgroup_begin_page_stat() to take the memcg's 6120 + * move_lock while we're moving its pages to another memcg. 6121 + * Then wait for already started RCU-only updates to finish. 6122 + */ 6123 + atomic_inc(&mc.from->moving_account); 6124 + synchronize_rcu(); 5420 6125 retry: 5421 6126 if (unlikely(!down_read_trylock(&mm->mmap_sem))) { 5422 6127 /* ··· 5456 6147 break; 5457 6148 } 5458 6149 up_read(&mm->mmap_sem); 6150 + atomic_dec(&mc.from->moving_account); 5459 6151 } 5460 6152 5461 6153 static void mem_cgroup_move_task(struct cgroup_subsys_state *css, ··· 5560 6250 */ 5561 6251 void mem_cgroup_swapout(struct page *page, swp_entry_t entry) 5562 6252 { 5563 - struct page_cgroup *pc; 6253 + struct mem_cgroup *memcg; 5564 6254 unsigned short oldid; 5565 6255 5566 6256 VM_BUG_ON_PAGE(PageLRU(page), page); ··· 5569 6259 if (!do_swap_account) 5570 6260 return; 5571 6261 5572 - pc = lookup_page_cgroup(page); 6262 + memcg = page->mem_cgroup; 5573 6263 5574 6264 /* Readahead page, never charged */ 5575 - if (!PageCgroupUsed(pc)) 6265 + if (!memcg) 5576 6266 return; 5577 6267 5578 - VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page); 5579 - 5580 - oldid = swap_cgroup_record(entry, mem_cgroup_id(pc->mem_cgroup)); 6268 + oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg)); 5581 6269 VM_BUG_ON_PAGE(oldid, page); 6270 + mem_cgroup_swap_statistics(memcg, true); 5582 6271 5583 - pc->flags &= ~PCG_MEMSW; 5584 - css_get(&pc->mem_cgroup->css); 5585 - mem_cgroup_swap_statistics(pc->mem_cgroup, true); 6272 + page->mem_cgroup = NULL; 6273 + 6274 + if (!mem_cgroup_is_root(memcg)) 6275 + page_counter_uncharge(&memcg->memory, 1); 6276 + 6277 + /* XXX: caller holds IRQ-safe mapping->tree_lock */ 6278 + VM_BUG_ON(!irqs_disabled()); 6279 + 6280 + mem_cgroup_charge_statistics(memcg, page, -1); 6281 + memcg_check_events(memcg, page); 5586 6282 } 5587 6283 5588 6284 /** ··· 5610 6294 memcg = mem_cgroup_lookup(id); 5611 6295 if (memcg) { 5612 6296 if (!mem_cgroup_is_root(memcg)) 5613 - res_counter_uncharge(&memcg->memsw, PAGE_SIZE); 6297 + page_counter_uncharge(&memcg->memsw, 1); 5614 6298 mem_cgroup_swap_statistics(memcg, false); 5615 6299 css_put(&memcg->css); 5616 6300 } ··· 5646 6330 goto out; 5647 6331 5648 6332 if (PageSwapCache(page)) { 5649 - struct page_cgroup *pc = lookup_page_cgroup(page); 5650 6333 /* 5651 6334 * Every swap fault against a single page tries to charge the 5652 6335 * page, bail as early as possible. shmem_unuse() encounters ··· 5653 6338 * the page lock, which serializes swap cache removal, which 5654 6339 * in turn serializes uncharging. 5655 6340 */ 5656 - if (PageCgroupUsed(pc)) 6341 + if (page->mem_cgroup) 5657 6342 goto out; 5658 6343 } 5659 6344 ··· 5767 6452 } 5768 6453 5769 6454 static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout, 5770 - unsigned long nr_mem, unsigned long nr_memsw, 5771 6455 unsigned long nr_anon, unsigned long nr_file, 5772 6456 unsigned long nr_huge, struct page *dummy_page) 5773 6457 { 6458 + unsigned long nr_pages = nr_anon + nr_file; 5774 6459 unsigned long flags; 5775 6460 5776 6461 if (!mem_cgroup_is_root(memcg)) { 5777 - if (nr_mem) 5778 - res_counter_uncharge(&memcg->res, 5779 - nr_mem * PAGE_SIZE); 5780 - if (nr_memsw) 5781 - res_counter_uncharge(&memcg->memsw, 5782 - nr_memsw * PAGE_SIZE); 6462 + page_counter_uncharge(&memcg->memory, nr_pages); 6463 + if (do_swap_account) 6464 + page_counter_uncharge(&memcg->memsw, nr_pages); 5783 6465 memcg_oom_recover(memcg); 5784 6466 } 5785 6467 ··· 5785 6473 __this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_CACHE], nr_file); 5786 6474 __this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE], nr_huge); 5787 6475 __this_cpu_add(memcg->stat->events[MEM_CGROUP_EVENTS_PGPGOUT], pgpgout); 5788 - __this_cpu_add(memcg->stat->nr_page_events, nr_anon + nr_file); 6476 + __this_cpu_add(memcg->stat->nr_page_events, nr_pages); 5789 6477 memcg_check_events(memcg, dummy_page); 5790 6478 local_irq_restore(flags); 6479 + 6480 + if (!mem_cgroup_is_root(memcg)) 6481 + css_put_many(&memcg->css, nr_pages); 5791 6482 } 5792 6483 5793 6484 static void uncharge_list(struct list_head *page_list) 5794 6485 { 5795 6486 struct mem_cgroup *memcg = NULL; 5796 - unsigned long nr_memsw = 0; 5797 6487 unsigned long nr_anon = 0; 5798 6488 unsigned long nr_file = 0; 5799 6489 unsigned long nr_huge = 0; 5800 6490 unsigned long pgpgout = 0; 5801 - unsigned long nr_mem = 0; 5802 6491 struct list_head *next; 5803 6492 struct page *page; 5804 6493 5805 6494 next = page_list->next; 5806 6495 do { 5807 6496 unsigned int nr_pages = 1; 5808 - struct page_cgroup *pc; 5809 6497 5810 6498 page = list_entry(next, struct page, lru); 5811 6499 next = page->lru.next; ··· 5813 6501 VM_BUG_ON_PAGE(PageLRU(page), page); 5814 6502 VM_BUG_ON_PAGE(page_count(page), page); 5815 6503 5816 - pc = lookup_page_cgroup(page); 5817 - if (!PageCgroupUsed(pc)) 6504 + if (!page->mem_cgroup) 5818 6505 continue; 5819 6506 5820 6507 /* 5821 6508 * Nobody should be changing or seriously looking at 5822 - * pc->mem_cgroup and pc->flags at this point, we have 5823 - * fully exclusive access to the page. 6509 + * page->mem_cgroup at this point, we have fully 6510 + * exclusive access to the page. 5824 6511 */ 5825 6512 5826 - if (memcg != pc->mem_cgroup) { 6513 + if (memcg != page->mem_cgroup) { 5827 6514 if (memcg) { 5828 - uncharge_batch(memcg, pgpgout, nr_mem, nr_memsw, 5829 - nr_anon, nr_file, nr_huge, page); 5830 - pgpgout = nr_mem = nr_memsw = 0; 5831 - nr_anon = nr_file = nr_huge = 0; 6515 + uncharge_batch(memcg, pgpgout, nr_anon, nr_file, 6516 + nr_huge, page); 6517 + pgpgout = nr_anon = nr_file = nr_huge = 0; 5832 6518 } 5833 - memcg = pc->mem_cgroup; 6519 + memcg = page->mem_cgroup; 5834 6520 } 5835 6521 5836 6522 if (PageTransHuge(page)) { ··· 5842 6532 else 5843 6533 nr_file += nr_pages; 5844 6534 5845 - if (pc->flags & PCG_MEM) 5846 - nr_mem += nr_pages; 5847 - if (pc->flags & PCG_MEMSW) 5848 - nr_memsw += nr_pages; 5849 - pc->flags = 0; 6535 + page->mem_cgroup = NULL; 5850 6536 5851 6537 pgpgout++; 5852 6538 } while (next != page_list); 5853 6539 5854 6540 if (memcg) 5855 - uncharge_batch(memcg, pgpgout, nr_mem, nr_memsw, 5856 - nr_anon, nr_file, nr_huge, page); 6541 + uncharge_batch(memcg, pgpgout, nr_anon, nr_file, 6542 + nr_huge, page); 5857 6543 } 5858 6544 5859 6545 /** ··· 5861 6555 */ 5862 6556 void mem_cgroup_uncharge(struct page *page) 5863 6557 { 5864 - struct page_cgroup *pc; 5865 - 5866 6558 if (mem_cgroup_disabled()) 5867 6559 return; 5868 6560 5869 6561 /* Don't touch page->lru of any random page, pre-check: */ 5870 - pc = lookup_page_cgroup(page); 5871 - if (!PageCgroupUsed(pc)) 6562 + if (!page->mem_cgroup) 5872 6563 return; 5873 6564 5874 6565 INIT_LIST_HEAD(&page->lru); ··· 5901 6598 void mem_cgroup_migrate(struct page *oldpage, struct page *newpage, 5902 6599 bool lrucare) 5903 6600 { 5904 - struct page_cgroup *pc; 6601 + struct mem_cgroup *memcg; 5905 6602 int isolated; 5906 6603 5907 6604 VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage); ··· 5916 6613 return; 5917 6614 5918 6615 /* Page cache replacement: new page already charged? */ 5919 - pc = lookup_page_cgroup(newpage); 5920 - if (PageCgroupUsed(pc)) 6616 + if (newpage->mem_cgroup) 5921 6617 return; 5922 6618 5923 - /* Re-entrant migration: old page already uncharged? */ 5924 - pc = lookup_page_cgroup(oldpage); 5925 - if (!PageCgroupUsed(pc)) 6619 + /* 6620 + * Swapcache readahead pages can get migrated before being 6621 + * charged, and migration from compaction can happen to an 6622 + * uncharged page when the PFN walker finds a page that 6623 + * reclaim just put back on the LRU but has not released yet. 6624 + */ 6625 + memcg = oldpage->mem_cgroup; 6626 + if (!memcg) 5926 6627 return; 5927 - 5928 - VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM), oldpage); 5929 - VM_BUG_ON_PAGE(do_swap_account && !(pc->flags & PCG_MEMSW), oldpage); 5930 6628 5931 6629 if (lrucare) 5932 6630 lock_page_lru(oldpage, &isolated); 5933 6631 5934 - pc->flags = 0; 6632 + oldpage->mem_cgroup = NULL; 5935 6633 5936 6634 if (lrucare) 5937 6635 unlock_page_lru(oldpage, isolated); 5938 6636 5939 - commit_charge(newpage, pc->mem_cgroup, lrucare); 6637 + commit_charge(newpage, memcg, lrucare); 5940 6638 } 5941 6639 5942 6640 /*

+2 -2

mm/memory-failure.c

··· 233 233 lru_add_drain_all(); 234 234 if (PageLRU(p)) 235 235 return; 236 - drain_all_pages(); 236 + drain_all_pages(page_zone(p)); 237 237 if (PageLRU(p) || is_free_buddy_page(p)) 238 238 return; 239 239 } ··· 1661 1661 if (!is_free_buddy_page(page)) 1662 1662 lru_add_drain_all(); 1663 1663 if (!is_free_buddy_page(page)) 1664 - drain_all_pages(); 1664 + drain_all_pages(page_zone(page)); 1665 1665 SetPageHWPoison(page); 1666 1666 if (!is_free_buddy_page(page)) 1667 1667 pr_info("soft offline: %#lx: page leaked\n",

+2 -2

mm/memory_hotplug.c

··· 1725 1725 if (drain) { 1726 1726 lru_add_drain_all(); 1727 1727 cond_resched(); 1728 - drain_all_pages(); 1728 + drain_all_pages(zone); 1729 1729 } 1730 1730 1731 1731 pfn = scan_movable_pages(start_pfn, end_pfn); ··· 1747 1747 lru_add_drain_all(); 1748 1748 yield(); 1749 1749 /* drain pcp pages, this is synchronous. */ 1750 - drain_all_pages(); 1750 + drain_all_pages(zone); 1751 1751 /* 1752 1752 * dissolve free hugepages in the memory block before doing offlining 1753 1753 * actually in order to make hugetlbfs's object counting consistent.

+2 -2

mm/oom_kill.c

··· 119 119 120 120 /* return true if the task is not adequate as candidate victim task. */ 121 121 static bool oom_unkillable_task(struct task_struct *p, 122 - const struct mem_cgroup *memcg, const nodemask_t *nodemask) 122 + struct mem_cgroup *memcg, const nodemask_t *nodemask) 123 123 { 124 124 if (is_global_init(p)) 125 125 return true; ··· 353 353 * State information includes task's pid, uid, tgid, vm size, rss, nr_ptes, 354 354 * swapents, oom_score_adj value, and name. 355 355 */ 356 - static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemask) 356 + static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) 357 357 { 358 358 struct task_struct *p; 359 359 struct task_struct *task;

+2 -2

mm/page-writeback.c

··· 2357 2357 dec_zone_page_state(page, NR_WRITEBACK); 2358 2358 inc_zone_page_state(page, NR_WRITTEN); 2359 2359 } 2360 - mem_cgroup_end_page_stat(memcg, locked, memcg_flags); 2360 + mem_cgroup_end_page_stat(memcg, &locked, &memcg_flags); 2361 2361 return ret; 2362 2362 } 2363 2363 ··· 2399 2399 mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_WRITEBACK); 2400 2400 inc_zone_page_state(page, NR_WRITEBACK); 2401 2401 } 2402 - mem_cgroup_end_page_stat(memcg, locked, memcg_flags); 2402 + mem_cgroup_end_page_stat(memcg, &locked, &memcg_flags); 2403 2403 return ret; 2404 2404 2405 2405 }

+80 -57

mm/page_alloc.c

··· 48 48 #include <linux/backing-dev.h> 49 49 #include <linux/fault-inject.h> 50 50 #include <linux/page-isolation.h> 51 - #include <linux/page_cgroup.h> 52 51 #include <linux/debugobjects.h> 53 52 #include <linux/kmemleak.h> 54 53 #include <linux/compaction.h> ··· 640 641 bad_reason = "PAGE_FLAGS_CHECK_AT_FREE flag(s) set"; 641 642 bad_flags = PAGE_FLAGS_CHECK_AT_FREE; 642 643 } 643 - if (unlikely(mem_cgroup_bad_page_check(page))) 644 - bad_reason = "cgroup check failed"; 644 + #ifdef CONFIG_MEMCG 645 + if (unlikely(page->mem_cgroup)) 646 + bad_reason = "page still charged to cgroup"; 647 + #endif 645 648 if (unlikely(bad_reason)) { 646 649 bad_page(page, bad_reason, bad_flags); 647 650 return 1; ··· 741 740 { 742 741 int i; 743 742 int bad = 0; 743 + 744 + VM_BUG_ON_PAGE(PageTail(page), page); 745 + VM_BUG_ON_PAGE(PageHead(page) && compound_order(page) != order, page); 744 746 745 747 trace_mm_page_free(page, order); 746 748 kmemcheck_free_shadow(page, order); ··· 902 898 bad_reason = "PAGE_FLAGS_CHECK_AT_PREP flag set"; 903 899 bad_flags = PAGE_FLAGS_CHECK_AT_PREP; 904 900 } 905 - if (unlikely(mem_cgroup_bad_page_check(page))) 906 - bad_reason = "cgroup check failed"; 901 + #ifdef CONFIG_MEMCG 902 + if (unlikely(page->mem_cgroup)) 903 + bad_reason = "page still charged to cgroup"; 904 + #endif 907 905 if (unlikely(bad_reason)) { 908 906 bad_page(page, bad_reason, bad_flags); 909 907 return 1; ··· 1273 1267 #endif 1274 1268 1275 1269 /* 1276 - * Drain pages of the indicated processor. 1270 + * Drain pcplists of the indicated processor and zone. 1271 + * 1272 + * The processor must either be the current processor and the 1273 + * thread pinned to the current processor or a processor that 1274 + * is not online. 1275 + */ 1276 + static void drain_pages_zone(unsigned int cpu, struct zone *zone) 1277 + { 1278 + unsigned long flags; 1279 + struct per_cpu_pageset *pset; 1280 + struct per_cpu_pages *pcp; 1281 + 1282 + local_irq_save(flags); 1283 + pset = per_cpu_ptr(zone->pageset, cpu); 1284 + 1285 + pcp = &pset->pcp; 1286 + if (pcp->count) { 1287 + free_pcppages_bulk(zone, pcp->count, pcp); 1288 + pcp->count = 0; 1289 + } 1290 + local_irq_restore(flags); 1291 + } 1292 + 1293 + /* 1294 + * Drain pcplists of all zones on the indicated processor. 1277 1295 * 1278 1296 * The processor must either be the current processor and the 1279 1297 * thread pinned to the current processor or a processor that ··· 1305 1275 */ 1306 1276 static void drain_pages(unsigned int cpu) 1307 1277 { 1308 - unsigned long flags; 1309 1278 struct zone *zone; 1310 1279 1311 1280 for_each_populated_zone(zone) { 1312 - struct per_cpu_pageset *pset; 1313 - struct per_cpu_pages *pcp; 1314 - 1315 - local_irq_save(flags); 1316 - pset = per_cpu_ptr(zone->pageset, cpu); 1317 - 1318 - pcp = &pset->pcp; 1319 - if (pcp->count) { 1320 - free_pcppages_bulk(zone, pcp->count, pcp); 1321 - pcp->count = 0; 1322 - } 1323 - local_irq_restore(flags); 1281 + drain_pages_zone(cpu, zone); 1324 1282 } 1325 1283 } 1326 1284 1327 1285 /* 1328 1286 * Spill all of this CPU's per-cpu pages back into the buddy allocator. 1287 + * 1288 + * The CPU has to be pinned. When zone parameter is non-NULL, spill just 1289 + * the single zone's pages. 1329 1290 */ 1330 - void drain_local_pages(void *arg) 1291 + void drain_local_pages(struct zone *zone) 1331 1292 { 1332 - drain_pages(smp_processor_id()); 1293 + int cpu = smp_processor_id(); 1294 + 1295 + if (zone) 1296 + drain_pages_zone(cpu, zone); 1297 + else 1298 + drain_pages(cpu); 1333 1299 } 1334 1300 1335 1301 /* 1336 1302 * Spill all the per-cpu pages from all CPUs back into the buddy allocator. 1303 + * 1304 + * When zone parameter is non-NULL, spill just the single zone's pages. 1337 1305 * 1338 1306 * Note that this code is protected against sending an IPI to an offline 1339 1307 * CPU but does not guarantee sending an IPI to newly hotplugged CPUs: ··· 1339 1311 * nothing keeps CPUs from showing up after we populated the cpumask and 1340 1312 * before the call to on_each_cpu_mask(). 1341 1313 */ 1342 - void drain_all_pages(void) 1314 + void drain_all_pages(struct zone *zone) 1343 1315 { 1344 1316 int cpu; 1345 - struct per_cpu_pageset *pcp; 1346 - struct zone *zone; 1347 1317 1348 1318 /* 1349 1319 * Allocate in the BSS so we wont require allocation in ··· 1356 1330 * disables preemption as part of its processing 1357 1331 */ 1358 1332 for_each_online_cpu(cpu) { 1333 + struct per_cpu_pageset *pcp; 1334 + struct zone *z; 1359 1335 bool has_pcps = false; 1360 - for_each_populated_zone(zone) { 1336 + 1337 + if (zone) { 1361 1338 pcp = per_cpu_ptr(zone->pageset, cpu); 1362 - if (pcp->pcp.count) { 1339 + if (pcp->pcp.count) 1363 1340 has_pcps = true; 1364 - break; 1341 + } else { 1342 + for_each_populated_zone(z) { 1343 + pcp = per_cpu_ptr(z->pageset, cpu); 1344 + if (pcp->pcp.count) { 1345 + has_pcps = true; 1346 + break; 1347 + } 1365 1348 } 1366 1349 } 1350 + 1367 1351 if (has_pcps) 1368 1352 cpumask_set_cpu(cpu, &cpus_with_pcps); 1369 1353 else 1370 1354 cpumask_clear_cpu(cpu, &cpus_with_pcps); 1371 1355 } 1372 - on_each_cpu_mask(&cpus_with_pcps, drain_local_pages, NULL, 1); 1356 + on_each_cpu_mask(&cpus_with_pcps, (smp_call_func_t) drain_local_pages, 1357 + zone, 1); 1373 1358 } 1374 1359 1375 1360 #ifdef CONFIG_HIBERNATION ··· 1742 1705 unsigned long mark, int classzone_idx, int alloc_flags, 1743 1706 long free_pages) 1744 1707 { 1745 - /* free_pages my go negative - that's OK */ 1708 + /* free_pages may go negative - that's OK */ 1746 1709 long min = mark; 1747 1710 int o; 1748 1711 long free_cma = 0; ··· 2333 2296 int classzone_idx, int migratetype, enum migrate_mode mode, 2334 2297 int *contended_compaction, bool *deferred_compaction) 2335 2298 { 2336 - struct zone *last_compact_zone = NULL; 2337 2299 unsigned long compact_result; 2338 2300 struct page *page; 2339 2301 ··· 2343 2307 compact_result = try_to_compact_pages(zonelist, order, gfp_mask, 2344 2308 nodemask, mode, 2345 2309 contended_compaction, 2346 - &last_compact_zone); 2310 + alloc_flags, classzone_idx); 2347 2311 current->flags &= ~PF_MEMALLOC; 2348 2312 2349 2313 switch (compact_result) { ··· 2362 2326 */ 2363 2327 count_vm_event(COMPACTSTALL); 2364 2328 2365 - /* Page migration frees to the PCP lists but we want merging */ 2366 - drain_pages(get_cpu()); 2367 - put_cpu(); 2368 - 2369 2329 page = get_page_from_freelist(gfp_mask, nodemask, 2370 2330 order, zonelist, high_zoneidx, 2371 2331 alloc_flags & ~ALLOC_NO_WATERMARKS, ··· 2375 2343 count_vm_event(COMPACTSUCCESS); 2376 2344 return page; 2377 2345 } 2378 - 2379 - /* 2380 - * last_compact_zone is where try_to_compact_pages thought allocation 2381 - * should succeed, so it did not defer compaction. But here we know 2382 - * that it didn't succeed, so we do the defer. 2383 - */ 2384 - if (last_compact_zone && mode != MIGRATE_ASYNC) 2385 - defer_compaction(last_compact_zone, order); 2386 2346 2387 2347 /* 2388 2348 * It's bad if compaction run occurs and fails. The most likely reason ··· 2457 2433 * pages are pinned on the per-cpu lists. Drain them and try again 2458 2434 */ 2459 2435 if (!page && !drained) { 2460 - drain_all_pages(); 2436 + drain_all_pages(NULL); 2461 2437 drained = true; 2462 2438 goto retry; 2463 2439 } ··· 3917 3893 else 3918 3894 page_group_by_mobility_disabled = 0; 3919 3895 3920 - printk("Built %i zonelists in %s order, mobility grouping %s. " 3896 + pr_info("Built %i zonelists in %s order, mobility grouping %s. " 3921 3897 "Total pages: %ld\n", 3922 3898 nr_online_nodes, 3923 3899 zonelist_order_name[current_zonelist_order], 3924 3900 page_group_by_mobility_disabled ? "off" : "on", 3925 3901 vm_total_pages); 3926 3902 #ifdef CONFIG_NUMA 3927 - printk("Policy zone: %s\n", zone_names[policy_zone]); 3903 + pr_info("Policy zone: %s\n", zone_names[policy_zone]); 3928 3904 #endif 3929 3905 } 3930 3906 ··· 4856 4832 #endif 4857 4833 init_waitqueue_head(&pgdat->kswapd_wait); 4858 4834 init_waitqueue_head(&pgdat->pfmemalloc_wait); 4859 - pgdat_page_cgroup_init(pgdat); 4860 4835 4861 4836 for (j = 0; j < MAX_NR_ZONES; j++) { 4862 4837 struct zone *zone = pgdat->node_zones + j; ··· 5357 5334 find_zone_movable_pfns_for_nodes(); 5358 5335 5359 5336 /* Print out the zone ranges */ 5360 - printk("Zone ranges:\n"); 5337 + pr_info("Zone ranges:\n"); 5361 5338 for (i = 0; i < MAX_NR_ZONES; i++) { 5362 5339 if (i == ZONE_MOVABLE) 5363 5340 continue; 5364 - printk(KERN_CONT " %-8s ", zone_names[i]); 5341 + pr_info(" %-8s ", zone_names[i]); 5365 5342 if (arch_zone_lowest_possible_pfn[i] == 5366 5343 arch_zone_highest_possible_pfn[i]) 5367 - printk(KERN_CONT "empty\n"); 5344 + pr_cont("empty\n"); 5368 5345 else 5369 - printk(KERN_CONT "[mem %0#10lx-%0#10lx]\n", 5346 + pr_cont("[mem %0#10lx-%0#10lx]\n", 5370 5347 arch_zone_lowest_possible_pfn[i] << PAGE_SHIFT, 5371 5348 (arch_zone_highest_possible_pfn[i] 5372 5349 << PAGE_SHIFT) - 1); 5373 5350 } 5374 5351 5375 5352 /* Print out the PFNs ZONE_MOVABLE begins at in each node */ 5376 - printk("Movable zone start for each node\n"); 5353 + pr_info("Movable zone start for each node\n"); 5377 5354 for (i = 0; i < MAX_NUMNODES; i++) { 5378 5355 if (zone_movable_pfn[i]) 5379 - printk(" Node %d: %#010lx\n", i, 5356 + pr_info(" Node %d: %#010lx\n", i, 5380 5357 zone_movable_pfn[i] << PAGE_SHIFT); 5381 5358 } 5382 5359 5383 5360 /* Print out the early node map */ 5384 - printk("Early memory node ranges\n"); 5361 + pr_info("Early memory node ranges\n"); 5385 5362 for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) 5386 - printk(" node %3d: [mem %#010lx-%#010lx]\n", nid, 5363 + pr_info(" node %3d: [mem %#010lx-%#010lx]\n", nid, 5387 5364 start_pfn << PAGE_SHIFT, (end_pfn << PAGE_SHIFT) - 1); 5388 5365 5389 5366 /* Initialise every node */ ··· 5519 5496 5520 5497 #undef adj_init_size 5521 5498 5522 - printk("Memory: %luK/%luK available " 5499 + pr_info("Memory: %luK/%luK available " 5523 5500 "(%luK kernel code, %luK rwdata, %luK rodata, " 5524 5501 "%luK init, %luK bss, %luK reserved" 5525 5502 #ifdef CONFIG_HIGHMEM ··· 6408 6385 */ 6409 6386 6410 6387 lru_add_drain_all(); 6411 - drain_all_pages(); 6388 + drain_all_pages(cc.zone); 6412 6389 6413 6390 order = 0; 6414 6391 outer_start = start;

-530

mm/page_cgroup.c

··· 1 - #include <linux/mm.h> 2 - #include <linux/mmzone.h> 3 - #include <linux/bootmem.h> 4 - #include <linux/bit_spinlock.h> 5 - #include <linux/page_cgroup.h> 6 - #include <linux/hash.h> 7 - #include <linux/slab.h> 8 - #include <linux/memory.h> 9 - #include <linux/vmalloc.h> 10 - #include <linux/cgroup.h> 11 - #include <linux/swapops.h> 12 - #include <linux/kmemleak.h> 13 - 14 - static unsigned long total_usage; 15 - 16 - #if !defined(CONFIG_SPARSEMEM) 17 - 18 - 19 - void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat) 20 - { 21 - pgdat->node_page_cgroup = NULL; 22 - } 23 - 24 - struct page_cgroup *lookup_page_cgroup(struct page *page) 25 - { 26 - unsigned long pfn = page_to_pfn(page); 27 - unsigned long offset; 28 - struct page_cgroup *base; 29 - 30 - base = NODE_DATA(page_to_nid(page))->node_page_cgroup; 31 - #ifdef CONFIG_DEBUG_VM 32 - /* 33 - * The sanity checks the page allocator does upon freeing a 34 - * page can reach here before the page_cgroup arrays are 35 - * allocated when feeding a range of pages to the allocator 36 - * for the first time during bootup or memory hotplug. 37 - */ 38 - if (unlikely(!base)) 39 - return NULL; 40 - #endif 41 - offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn; 42 - return base + offset; 43 - } 44 - 45 - static int __init alloc_node_page_cgroup(int nid) 46 - { 47 - struct page_cgroup *base; 48 - unsigned long table_size; 49 - unsigned long nr_pages; 50 - 51 - nr_pages = NODE_DATA(nid)->node_spanned_pages; 52 - if (!nr_pages) 53 - return 0; 54 - 55 - table_size = sizeof(struct page_cgroup) * nr_pages; 56 - 57 - base = memblock_virt_alloc_try_nid_nopanic( 58 - table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS), 59 - BOOTMEM_ALLOC_ACCESSIBLE, nid); 60 - if (!base) 61 - return -ENOMEM; 62 - NODE_DATA(nid)->node_page_cgroup = base; 63 - total_usage += table_size; 64 - return 0; 65 - } 66 - 67 - void __init page_cgroup_init_flatmem(void) 68 - { 69 - 70 - int nid, fail; 71 - 72 - if (mem_cgroup_disabled()) 73 - return; 74 - 75 - for_each_online_node(nid) { 76 - fail = alloc_node_page_cgroup(nid); 77 - if (fail) 78 - goto fail; 79 - } 80 - printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage); 81 - printk(KERN_INFO "please try 'cgroup_disable=memory' option if you" 82 - " don't want memory cgroups\n"); 83 - return; 84 - fail: 85 - printk(KERN_CRIT "allocation of page_cgroup failed.\n"); 86 - printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n"); 87 - panic("Out of memory"); 88 - } 89 - 90 - #else /* CONFIG_FLAT_NODE_MEM_MAP */ 91 - 92 - struct page_cgroup *lookup_page_cgroup(struct page *page) 93 - { 94 - unsigned long pfn = page_to_pfn(page); 95 - struct mem_section *section = __pfn_to_section(pfn); 96 - #ifdef CONFIG_DEBUG_VM 97 - /* 98 - * The sanity checks the page allocator does upon freeing a 99 - * page can reach here before the page_cgroup arrays are 100 - * allocated when feeding a range of pages to the allocator 101 - * for the first time during bootup or memory hotplug. 102 - */ 103 - if (!section->page_cgroup) 104 - return NULL; 105 - #endif 106 - return section->page_cgroup + pfn; 107 - } 108 - 109 - static void *__meminit alloc_page_cgroup(size_t size, int nid) 110 - { 111 - gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN; 112 - void *addr = NULL; 113 - 114 - addr = alloc_pages_exact_nid(nid, size, flags); 115 - if (addr) { 116 - kmemleak_alloc(addr, size, 1, flags); 117 - return addr; 118 - } 119 - 120 - if (node_state(nid, N_HIGH_MEMORY)) 121 - addr = vzalloc_node(size, nid); 122 - else 123 - addr = vzalloc(size); 124 - 125 - return addr; 126 - } 127 - 128 - static int __meminit init_section_page_cgroup(unsigned long pfn, int nid) 129 - { 130 - struct mem_section *section; 131 - struct page_cgroup *base; 132 - unsigned long table_size; 133 - 134 - section = __pfn_to_section(pfn); 135 - 136 - if (section->page_cgroup) 137 - return 0; 138 - 139 - table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION; 140 - base = alloc_page_cgroup(table_size, nid); 141 - 142 - /* 143 - * The value stored in section->page_cgroup is (base - pfn) 144 - * and it does not point to the memory block allocated above, 145 - * causing kmemleak false positives. 146 - */ 147 - kmemleak_not_leak(base); 148 - 149 - if (!base) { 150 - printk(KERN_ERR "page cgroup allocation failure\n"); 151 - return -ENOMEM; 152 - } 153 - 154 - /* 155 - * The passed "pfn" may not be aligned to SECTION. For the calculation 156 - * we need to apply a mask. 157 - */ 158 - pfn &= PAGE_SECTION_MASK; 159 - section->page_cgroup = base - pfn; 160 - total_usage += table_size; 161 - return 0; 162 - } 163 - #ifdef CONFIG_MEMORY_HOTPLUG 164 - static void free_page_cgroup(void *addr) 165 - { 166 - if (is_vmalloc_addr(addr)) { 167 - vfree(addr); 168 - } else { 169 - struct page *page = virt_to_page(addr); 170 - size_t table_size = 171 - sizeof(struct page_cgroup) * PAGES_PER_SECTION; 172 - 173 - BUG_ON(PageReserved(page)); 174 - kmemleak_free(addr); 175 - free_pages_exact(addr, table_size); 176 - } 177 - } 178 - 179 - static void __free_page_cgroup(unsigned long pfn) 180 - { 181 - struct mem_section *ms; 182 - struct page_cgroup *base; 183 - 184 - ms = __pfn_to_section(pfn); 185 - if (!ms || !ms->page_cgroup) 186 - return; 187 - base = ms->page_cgroup + pfn; 188 - free_page_cgroup(base); 189 - ms->page_cgroup = NULL; 190 - } 191 - 192 - static int __meminit online_page_cgroup(unsigned long start_pfn, 193 - unsigned long nr_pages, 194 - int nid) 195 - { 196 - unsigned long start, end, pfn; 197 - int fail = 0; 198 - 199 - start = SECTION_ALIGN_DOWN(start_pfn); 200 - end = SECTION_ALIGN_UP(start_pfn + nr_pages); 201 - 202 - if (nid == -1) { 203 - /* 204 - * In this case, "nid" already exists and contains valid memory. 205 - * "start_pfn" passed to us is a pfn which is an arg for 206 - * online__pages(), and start_pfn should exist. 207 - */ 208 - nid = pfn_to_nid(start_pfn); 209 - VM_BUG_ON(!node_state(nid, N_ONLINE)); 210 - } 211 - 212 - for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) { 213 - if (!pfn_present(pfn)) 214 - continue; 215 - fail = init_section_page_cgroup(pfn, nid); 216 - } 217 - if (!fail) 218 - return 0; 219 - 220 - /* rollback */ 221 - for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) 222 - __free_page_cgroup(pfn); 223 - 224 - return -ENOMEM; 225 - } 226 - 227 - static int __meminit offline_page_cgroup(unsigned long start_pfn, 228 - unsigned long nr_pages, int nid) 229 - { 230 - unsigned long start, end, pfn; 231 - 232 - start = SECTION_ALIGN_DOWN(start_pfn); 233 - end = SECTION_ALIGN_UP(start_pfn + nr_pages); 234 - 235 - for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) 236 - __free_page_cgroup(pfn); 237 - return 0; 238 - 239 - } 240 - 241 - static int __meminit page_cgroup_callback(struct notifier_block *self, 242 - unsigned long action, void *arg) 243 - { 244 - struct memory_notify *mn = arg; 245 - int ret = 0; 246 - switch (action) { 247 - case MEM_GOING_ONLINE: 248 - ret = online_page_cgroup(mn->start_pfn, 249 - mn->nr_pages, mn->status_change_nid); 250 - break; 251 - case MEM_OFFLINE: 252 - offline_page_cgroup(mn->start_pfn, 253 - mn->nr_pages, mn->status_change_nid); 254 - break; 255 - case MEM_CANCEL_ONLINE: 256 - offline_page_cgroup(mn->start_pfn, 257 - mn->nr_pages, mn->status_change_nid); 258 - break; 259 - case MEM_GOING_OFFLINE: 260 - break; 261 - case MEM_ONLINE: 262 - case MEM_CANCEL_OFFLINE: 263 - break; 264 - } 265 - 266 - return notifier_from_errno(ret); 267 - } 268 - 269 - #endif 270 - 271 - void __init page_cgroup_init(void) 272 - { 273 - unsigned long pfn; 274 - int nid; 275 - 276 - if (mem_cgroup_disabled()) 277 - return; 278 - 279 - for_each_node_state(nid, N_MEMORY) { 280 - unsigned long start_pfn, end_pfn; 281 - 282 - start_pfn = node_start_pfn(nid); 283 - end_pfn = node_end_pfn(nid); 284 - /* 285 - * start_pfn and end_pfn may not be aligned to SECTION and the 286 - * page->flags of out of node pages are not initialized. So we 287 - * scan [start_pfn, the biggest section's pfn < end_pfn) here. 288 - */ 289 - for (pfn = start_pfn; 290 - pfn < end_pfn; 291 - pfn = ALIGN(pfn + 1, PAGES_PER_SECTION)) { 292 - 293 - if (!pfn_valid(pfn)) 294 - continue; 295 - /* 296 - * Nodes's pfns can be overlapping. 297 - * We know some arch can have a nodes layout such as 298 - * -------------pfn--------------> 299 - * N0 | N1 | N2 | N0 | N1 | N2|.... 300 - */ 301 - if (pfn_to_nid(pfn) != nid) 302 - continue; 303 - if (init_section_page_cgroup(pfn, nid)) 304 - goto oom; 305 - } 306 - } 307 - hotplug_memory_notifier(page_cgroup_callback, 0); 308 - printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage); 309 - printk(KERN_INFO "please try 'cgroup_disable=memory' option if you " 310 - "don't want memory cgroups\n"); 311 - return; 312 - oom: 313 - printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n"); 314 - panic("Out of memory"); 315 - } 316 - 317 - void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat) 318 - { 319 - return; 320 - } 321 - 322 - #endif 323 - 324 - 325 - #ifdef CONFIG_MEMCG_SWAP 326 - 327 - static DEFINE_MUTEX(swap_cgroup_mutex); 328 - struct swap_cgroup_ctrl { 329 - struct page **map; 330 - unsigned long length; 331 - spinlock_t lock; 332 - }; 333 - 334 - static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES]; 335 - 336 - struct swap_cgroup { 337 - unsigned short id; 338 - }; 339 - #define SC_PER_PAGE (PAGE_SIZE/sizeof(struct swap_cgroup)) 340 - 341 - /* 342 - * SwapCgroup implements "lookup" and "exchange" operations. 343 - * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge 344 - * against SwapCache. At swap_free(), this is accessed directly from swap. 345 - * 346 - * This means, 347 - * - we have no race in "exchange" when we're accessed via SwapCache because 348 - * SwapCache(and its swp_entry) is under lock. 349 - * - When called via swap_free(), there is no user of this entry and no race. 350 - * Then, we don't need lock around "exchange". 351 - * 352 - * TODO: we can push these buffers out to HIGHMEM. 353 - */ 354 - 355 - /* 356 - * allocate buffer for swap_cgroup. 357 - */ 358 - static int swap_cgroup_prepare(int type) 359 - { 360 - struct page *page; 361 - struct swap_cgroup_ctrl *ctrl; 362 - unsigned long idx, max; 363 - 364 - ctrl = &swap_cgroup_ctrl[type]; 365 - 366 - for (idx = 0; idx < ctrl->length; idx++) { 367 - page = alloc_page(GFP_KERNEL | __GFP_ZERO); 368 - if (!page) 369 - goto not_enough_page; 370 - ctrl->map[idx] = page; 371 - } 372 - return 0; 373 - not_enough_page: 374 - max = idx; 375 - for (idx = 0; idx < max; idx++) 376 - __free_page(ctrl->map[idx]); 377 - 378 - return -ENOMEM; 379 - } 380 - 381 - static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent, 382 - struct swap_cgroup_ctrl **ctrlp) 383 - { 384 - pgoff_t offset = swp_offset(ent); 385 - struct swap_cgroup_ctrl *ctrl; 386 - struct page *mappage; 387 - struct swap_cgroup *sc; 388 - 389 - ctrl = &swap_cgroup_ctrl[swp_type(ent)]; 390 - if (ctrlp) 391 - *ctrlp = ctrl; 392 - 393 - mappage = ctrl->map[offset / SC_PER_PAGE]; 394 - sc = page_address(mappage); 395 - return sc + offset % SC_PER_PAGE; 396 - } 397 - 398 - /** 399 - * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry. 400 - * @ent: swap entry to be cmpxchged 401 - * @old: old id 402 - * @new: new id 403 - * 404 - * Returns old id at success, 0 at failure. 405 - * (There is no mem_cgroup using 0 as its id) 406 - */ 407 - unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, 408 - unsigned short old, unsigned short new) 409 - { 410 - struct swap_cgroup_ctrl *ctrl; 411 - struct swap_cgroup *sc; 412 - unsigned long flags; 413 - unsigned short retval; 414 - 415 - sc = lookup_swap_cgroup(ent, &ctrl); 416 - 417 - spin_lock_irqsave(&ctrl->lock, flags); 418 - retval = sc->id; 419 - if (retval == old) 420 - sc->id = new; 421 - else 422 - retval = 0; 423 - spin_unlock_irqrestore(&ctrl->lock, flags); 424 - return retval; 425 - } 426 - 427 - /** 428 - * swap_cgroup_record - record mem_cgroup for this swp_entry. 429 - * @ent: swap entry to be recorded into 430 - * @id: mem_cgroup to be recorded 431 - * 432 - * Returns old value at success, 0 at failure. 433 - * (Of course, old value can be 0.) 434 - */ 435 - unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id) 436 - { 437 - struct swap_cgroup_ctrl *ctrl; 438 - struct swap_cgroup *sc; 439 - unsigned short old; 440 - unsigned long flags; 441 - 442 - sc = lookup_swap_cgroup(ent, &ctrl); 443 - 444 - spin_lock_irqsave(&ctrl->lock, flags); 445 - old = sc->id; 446 - sc->id = id; 447 - spin_unlock_irqrestore(&ctrl->lock, flags); 448 - 449 - return old; 450 - } 451 - 452 - /** 453 - * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry 454 - * @ent: swap entry to be looked up. 455 - * 456 - * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID) 457 - */ 458 - unsigned short lookup_swap_cgroup_id(swp_entry_t ent) 459 - { 460 - return lookup_swap_cgroup(ent, NULL)->id; 461 - } 462 - 463 - int swap_cgroup_swapon(int type, unsigned long max_pages) 464 - { 465 - void *array; 466 - unsigned long array_size; 467 - unsigned long length; 468 - struct swap_cgroup_ctrl *ctrl; 469 - 470 - if (!do_swap_account) 471 - return 0; 472 - 473 - length = DIV_ROUND_UP(max_pages, SC_PER_PAGE); 474 - array_size = length * sizeof(void *); 475 - 476 - array = vzalloc(array_size); 477 - if (!array) 478 - goto nomem; 479 - 480 - ctrl = &swap_cgroup_ctrl[type]; 481 - mutex_lock(&swap_cgroup_mutex); 482 - ctrl->length = length; 483 - ctrl->map = array; 484 - spin_lock_init(&ctrl->lock); 485 - if (swap_cgroup_prepare(type)) { 486 - /* memory shortage */ 487 - ctrl->map = NULL; 488 - ctrl->length = 0; 489 - mutex_unlock(&swap_cgroup_mutex); 490 - vfree(array); 491 - goto nomem; 492 - } 493 - mutex_unlock(&swap_cgroup_mutex); 494 - 495 - return 0; 496 - nomem: 497 - printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n"); 498 - printk(KERN_INFO 499 - "swap_cgroup can be disabled by swapaccount=0 boot option\n"); 500 - return -ENOMEM; 501 - } 502 - 503 - void swap_cgroup_swapoff(int type) 504 - { 505 - struct page **map; 506 - unsigned long i, length; 507 - struct swap_cgroup_ctrl *ctrl; 508 - 509 - if (!do_swap_account) 510 - return; 511 - 512 - mutex_lock(&swap_cgroup_mutex); 513 - ctrl = &swap_cgroup_ctrl[type]; 514 - map = ctrl->map; 515 - length = ctrl->length; 516 - ctrl->map = NULL; 517 - ctrl->length = 0; 518 - mutex_unlock(&swap_cgroup_mutex); 519 - 520 - if (map) { 521 - for (i = 0; i < length; i++) { 522 - struct page *page = map[i]; 523 - if (page) 524 - __free_page(page); 525 - } 526 - vfree(map); 527 - } 528 - } 529 - 530 - #endif

+192

mm/page_counter.c

··· 1 + /* 2 + * Lockless hierarchical page accounting & limiting 3 + * 4 + * Copyright (C) 2014 Red Hat, Inc., Johannes Weiner 5 + */ 6 + 7 + #include <linux/page_counter.h> 8 + #include <linux/atomic.h> 9 + #include <linux/kernel.h> 10 + #include <linux/string.h> 11 + #include <linux/sched.h> 12 + #include <linux/bug.h> 13 + #include <asm/page.h> 14 + 15 + /** 16 + * page_counter_cancel - take pages out of the local counter 17 + * @counter: counter 18 + * @nr_pages: number of pages to cancel 19 + */ 20 + void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages) 21 + { 22 + long new; 23 + 24 + new = atomic_long_sub_return(nr_pages, &counter->count); 25 + /* More uncharges than charges? */ 26 + WARN_ON_ONCE(new < 0); 27 + } 28 + 29 + /** 30 + * page_counter_charge - hierarchically charge pages 31 + * @counter: counter 32 + * @nr_pages: number of pages to charge 33 + * 34 + * NOTE: This does not consider any configured counter limits. 35 + */ 36 + void page_counter_charge(struct page_counter *counter, unsigned long nr_pages) 37 + { 38 + struct page_counter *c; 39 + 40 + for (c = counter; c; c = c->parent) { 41 + long new; 42 + 43 + new = atomic_long_add_return(nr_pages, &c->count); 44 + /* 45 + * This is indeed racy, but we can live with some 46 + * inaccuracy in the watermark. 47 + */ 48 + if (new > c->watermark) 49 + c->watermark = new; 50 + } 51 + } 52 + 53 + /** 54 + * page_counter_try_charge - try to hierarchically charge pages 55 + * @counter: counter 56 + * @nr_pages: number of pages to charge 57 + * @fail: points first counter to hit its limit, if any 58 + * 59 + * Returns 0 on success, or -ENOMEM and @fail if the counter or one of 60 + * its ancestors has hit its configured limit. 61 + */ 62 + int page_counter_try_charge(struct page_counter *counter, 63 + unsigned long nr_pages, 64 + struct page_counter **fail) 65 + { 66 + struct page_counter *c; 67 + 68 + for (c = counter; c; c = c->parent) { 69 + long new; 70 + /* 71 + * Charge speculatively to avoid an expensive CAS. If 72 + * a bigger charge fails, it might falsely lock out a 73 + * racing smaller charge and send it into reclaim 74 + * early, but the error is limited to the difference 75 + * between the two sizes, which is less than 2M/4M in 76 + * case of a THP locking out a regular page charge. 77 + * 78 + * The atomic_long_add_return() implies a full memory 79 + * barrier between incrementing the count and reading 80 + * the limit. When racing with page_counter_limit(), 81 + * we either see the new limit or the setter sees the 82 + * counter has changed and retries. 83 + */ 84 + new = atomic_long_add_return(nr_pages, &c->count); 85 + if (new > c->limit) { 86 + atomic_long_sub(nr_pages, &c->count); 87 + /* 88 + * This is racy, but we can live with some 89 + * inaccuracy in the failcnt. 90 + */ 91 + c->failcnt++; 92 + *fail = c; 93 + goto failed; 94 + } 95 + /* 96 + * Just like with failcnt, we can live with some 97 + * inaccuracy in the watermark. 98 + */ 99 + if (new > c->watermark) 100 + c->watermark = new; 101 + } 102 + return 0; 103 + 104 + failed: 105 + for (c = counter; c != *fail; c = c->parent) 106 + page_counter_cancel(c, nr_pages); 107 + 108 + return -ENOMEM; 109 + } 110 + 111 + /** 112 + * page_counter_uncharge - hierarchically uncharge pages 113 + * @counter: counter 114 + * @nr_pages: number of pages to uncharge 115 + */ 116 + void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages) 117 + { 118 + struct page_counter *c; 119 + 120 + for (c = counter; c; c = c->parent) 121 + page_counter_cancel(c, nr_pages); 122 + } 123 + 124 + /** 125 + * page_counter_limit - limit the number of pages allowed 126 + * @counter: counter 127 + * @limit: limit to set 128 + * 129 + * Returns 0 on success, -EBUSY if the current number of pages on the 130 + * counter already exceeds the specified limit. 131 + * 132 + * The caller must serialize invocations on the same counter. 133 + */ 134 + int page_counter_limit(struct page_counter *counter, unsigned long limit) 135 + { 136 + for (;;) { 137 + unsigned long old; 138 + long count; 139 + 140 + /* 141 + * Update the limit while making sure that it's not 142 + * below the concurrently-changing counter value. 143 + * 144 + * The xchg implies two full memory barriers before 145 + * and after, so the read-swap-read is ordered and 146 + * ensures coherency with page_counter_try_charge(): 147 + * that function modifies the count before checking 148 + * the limit, so if it sees the old limit, we see the 149 + * modified counter and retry. 150 + */ 151 + count = atomic_long_read(&counter->count); 152 + 153 + if (count > limit) 154 + return -EBUSY; 155 + 156 + old = xchg(&counter->limit, limit); 157 + 158 + if (atomic_long_read(&counter->count) <= count) 159 + return 0; 160 + 161 + counter->limit = old; 162 + cond_resched(); 163 + } 164 + } 165 + 166 + /** 167 + * page_counter_memparse - memparse() for page counter limits 168 + * @buf: string to parse 169 + * @nr_pages: returns the result in number of pages 170 + * 171 + * Returns -EINVAL, or 0 and @nr_pages on success. @nr_pages will be 172 + * limited to %PAGE_COUNTER_MAX. 173 + */ 174 + int page_counter_memparse(const char *buf, unsigned long *nr_pages) 175 + { 176 + char unlimited[] = "-1"; 177 + char *end; 178 + u64 bytes; 179 + 180 + if (!strncmp(buf, unlimited, sizeof(unlimited))) { 181 + *nr_pages = PAGE_COUNTER_MAX; 182 + return 0; 183 + } 184 + 185 + bytes = memparse(buf, &end); 186 + if (*end != '\0') 187 + return -EINVAL; 188 + 189 + *nr_pages = min(bytes / PAGE_SIZE, (u64)PAGE_COUNTER_MAX); 190 + 191 + return 0; 192 + }

+1 -1

mm/page_isolation.c

··· 68 68 69 69 spin_unlock_irqrestore(&zone->lock, flags); 70 70 if (!ret) 71 - drain_all_pages(); 71 + drain_all_pages(zone); 72 72 return ret; 73 73 } 74 74

+2 -2

mm/rmap.c

··· 1053 1053 __inc_zone_page_state(page, NR_FILE_MAPPED); 1054 1054 mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_FILE_MAPPED); 1055 1055 } 1056 - mem_cgroup_end_page_stat(memcg, locked, flags); 1056 + mem_cgroup_end_page_stat(memcg, &locked, &flags); 1057 1057 } 1058 1058 1059 1059 static void page_remove_file_rmap(struct page *page) ··· 1083 1083 if (unlikely(PageMlocked(page))) 1084 1084 clear_page_mlock(page); 1085 1085 out: 1086 - mem_cgroup_end_page_stat(memcg, locked, flags); 1086 + mem_cgroup_end_page_stat(memcg, &locked, &flags); 1087 1087 } 1088 1088 1089 1089 /**

+10 -13

mm/slab.c

··· 2590 2590 * Be lazy and only check for valid flags here, keeping it out of the 2591 2591 * critical path in kmem_cache_alloc(). 2592 2592 */ 2593 - BUG_ON(flags & GFP_SLAB_BUG_MASK); 2593 + if (unlikely(flags & GFP_SLAB_BUG_MASK)) { 2594 + pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK); 2595 + BUG(); 2596 + } 2594 2597 local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK); 2595 2598 2596 2599 /* Take the node list lock to change the colour_next on this node */ ··· 3583 3580 3584 3581 for_each_online_node(node) { 3585 3582 3586 - if (use_alien_caches) { 3587 - new_alien = alloc_alien_cache(node, cachep->limit, gfp); 3588 - if (!new_alien) 3589 - goto fail; 3590 - } 3583 + if (use_alien_caches) { 3584 + new_alien = alloc_alien_cache(node, cachep->limit, gfp); 3585 + if (!new_alien) 3586 + goto fail; 3587 + } 3591 3588 3592 3589 new_shared = NULL; 3593 3590 if (cachep->shared) { ··· 4046 4043 4047 4044 #ifdef CONFIG_DEBUG_SLAB_LEAK 4048 4045 4049 - static void *leaks_start(struct seq_file *m, loff_t *pos) 4050 - { 4051 - mutex_lock(&slab_mutex); 4052 - return seq_list_start(&slab_caches, *pos); 4053 - } 4054 - 4055 4046 static inline int add_caller(unsigned long *n, unsigned long v) 4056 4047 { 4057 4048 unsigned long *p; ··· 4167 4170 } 4168 4171 4169 4172 static const struct seq_operations slabstats_op = { 4170 - .start = leaks_start, 4173 + .start = slab_start, 4171 4174 .next = slab_next, 4172 4175 .stop = slab_stop, 4173 4176 .show = leaks_show,

+5 -3

mm/slab.h

··· 209 209 210 210 rcu_read_lock(); 211 211 params = rcu_dereference(s->memcg_params); 212 - cachep = params->memcg_caches[idx]; 213 - rcu_read_unlock(); 214 212 215 213 /* 216 214 * Make sure we will access the up-to-date value. The code updating 217 215 * memcg_caches issues a write barrier to match this (see 218 216 * memcg_register_cache()). 219 217 */ 220 - smp_read_barrier_depends(); 218 + cachep = lockless_dereference(params->memcg_caches[idx]); 219 + rcu_read_unlock(); 220 + 221 221 return cachep; 222 222 } 223 223 ··· 357 357 358 358 #endif 359 359 360 + void *slab_start(struct seq_file *m, loff_t *pos); 360 361 void *slab_next(struct seq_file *m, void *p, loff_t *pos); 361 362 void slab_stop(struct seq_file *m, void *p); 363 + int memcg_slab_show(struct seq_file *m, void *p); 362 364 363 365 #endif /* MM_SLAB_H */

+26 -16

mm/slab_common.c

··· 240 240 size = ALIGN(size, align); 241 241 flags = kmem_cache_flags(size, flags, name, NULL); 242 242 243 - list_for_each_entry(s, &slab_caches, list) { 243 + list_for_each_entry_reverse(s, &slab_caches, list) { 244 244 if (slab_unmergeable(s)) 245 245 continue; 246 246 ··· 811 811 #define SLABINFO_RIGHTS S_IRUSR 812 812 #endif 813 813 814 - void print_slabinfo_header(struct seq_file *m) 814 + static void print_slabinfo_header(struct seq_file *m) 815 815 { 816 816 /* 817 817 * Output format version, so at least we can change it ··· 834 834 seq_putc(m, '\n'); 835 835 } 836 836 837 - static void *s_start(struct seq_file *m, loff_t *pos) 837 + void *slab_start(struct seq_file *m, loff_t *pos) 838 838 { 839 - loff_t n = *pos; 840 - 841 839 mutex_lock(&slab_mutex); 842 - if (!n) 843 - print_slabinfo_header(m); 844 - 845 840 return seq_list_start(&slab_caches, *pos); 846 841 } 847 842 ··· 876 881 } 877 882 } 878 883 879 - int cache_show(struct kmem_cache *s, struct seq_file *m) 884 + static void cache_show(struct kmem_cache *s, struct seq_file *m) 880 885 { 881 886 struct slabinfo sinfo; 882 887 ··· 895 900 sinfo.active_slabs, sinfo.num_slabs, sinfo.shared_avail); 896 901 slabinfo_show_stats(m, s); 897 902 seq_putc(m, '\n'); 898 - return 0; 899 903 } 900 904 901 - static int s_show(struct seq_file *m, void *p) 905 + static int slab_show(struct seq_file *m, void *p) 902 906 { 903 907 struct kmem_cache *s = list_entry(p, struct kmem_cache, list); 904 908 905 - if (!is_root_cache(s)) 906 - return 0; 907 - return cache_show(s, m); 909 + if (p == slab_caches.next) 910 + print_slabinfo_header(m); 911 + if (is_root_cache(s)) 912 + cache_show(s, m); 913 + return 0; 908 914 } 915 + 916 + #ifdef CONFIG_MEMCG_KMEM 917 + int memcg_slab_show(struct seq_file *m, void *p) 918 + { 919 + struct kmem_cache *s = list_entry(p, struct kmem_cache, list); 920 + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); 921 + 922 + if (p == slab_caches.next) 923 + print_slabinfo_header(m); 924 + if (!is_root_cache(s) && s->memcg_params->memcg == memcg) 925 + cache_show(s, m); 926 + return 0; 927 + } 928 + #endif 909 929 910 930 /* 911 931 * slabinfo_op - iterator that generates /proc/slabinfo ··· 936 926 * + further values on SMP and with statistics enabled 937 927 */ 938 928 static const struct seq_operations slabinfo_op = { 939 - .start = s_start, 929 + .start = slab_start, 940 930 .next = slab_next, 941 931 .stop = slab_stop, 942 - .show = s_show, 932 + .show = slab_show, 943 933 }; 944 934 945 935 static int slabinfo_open(struct inode *inode, struct file *file)

+12 -9

mm/slub.c

··· 849 849 maxobj = order_objects(compound_order(page), s->size, s->reserved); 850 850 if (page->objects > maxobj) { 851 851 slab_err(s, page, "objects %u > max %u", 852 - s->name, page->objects, maxobj); 852 + page->objects, maxobj); 853 853 return 0; 854 854 } 855 855 if (page->inuse > page->objects) { 856 856 slab_err(s, page, "inuse %u > max %u", 857 - s->name, page->inuse, page->objects); 857 + page->inuse, page->objects); 858 858 return 0; 859 859 } 860 860 /* Slab_pad_check fixes things up after itself */ ··· 871 871 int nr = 0; 872 872 void *fp; 873 873 void *object = NULL; 874 - unsigned long max_objects; 874 + int max_objects; 875 875 876 876 fp = page->freelist; 877 877 while (fp && nr <= page->objects) { ··· 1377 1377 int order; 1378 1378 int idx; 1379 1379 1380 - BUG_ON(flags & GFP_SLAB_BUG_MASK); 1380 + if (unlikely(flags & GFP_SLAB_BUG_MASK)) { 1381 + pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK); 1382 + BUG(); 1383 + } 1381 1384 1382 1385 page = allocate_slab(s, 1383 1386 flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node); ··· 2557 2554 2558 2555 } else { /* Needs to be taken off a list */ 2559 2556 2560 - n = get_node(s, page_to_nid(page)); 2557 + n = get_node(s, page_to_nid(page)); 2561 2558 /* 2562 2559 * Speculatively acquire the list_lock. 2563 2560 * If the cmpxchg does not succeed then we may ··· 2590 2587 * The list lock was not taken therefore no list 2591 2588 * activity can be necessary. 2592 2589 */ 2593 - if (was_frozen) 2594 - stat(s, FREE_FROZEN); 2595 - return; 2596 - } 2590 + if (was_frozen) 2591 + stat(s, FREE_FROZEN); 2592 + return; 2593 + } 2597 2594 2598 2595 if (unlikely(!new.inuse && n->nr_partial >= s->min_partial)) 2599 2596 goto slab_empty;

+208

mm/swap_cgroup.c

··· 1 + #include <linux/swap_cgroup.h> 2 + #include <linux/vmalloc.h> 3 + #include <linux/mm.h> 4 + 5 + #include <linux/swapops.h> /* depends on mm.h include */ 6 + 7 + static DEFINE_MUTEX(swap_cgroup_mutex); 8 + struct swap_cgroup_ctrl { 9 + struct page **map; 10 + unsigned long length; 11 + spinlock_t lock; 12 + }; 13 + 14 + static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES]; 15 + 16 + struct swap_cgroup { 17 + unsigned short id; 18 + }; 19 + #define SC_PER_PAGE (PAGE_SIZE/sizeof(struct swap_cgroup)) 20 + 21 + /* 22 + * SwapCgroup implements "lookup" and "exchange" operations. 23 + * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge 24 + * against SwapCache. At swap_free(), this is accessed directly from swap. 25 + * 26 + * This means, 27 + * - we have no race in "exchange" when we're accessed via SwapCache because 28 + * SwapCache(and its swp_entry) is under lock. 29 + * - When called via swap_free(), there is no user of this entry and no race. 30 + * Then, we don't need lock around "exchange". 31 + * 32 + * TODO: we can push these buffers out to HIGHMEM. 33 + */ 34 + 35 + /* 36 + * allocate buffer for swap_cgroup. 37 + */ 38 + static int swap_cgroup_prepare(int type) 39 + { 40 + struct page *page; 41 + struct swap_cgroup_ctrl *ctrl; 42 + unsigned long idx, max; 43 + 44 + ctrl = &swap_cgroup_ctrl[type]; 45 + 46 + for (idx = 0; idx < ctrl->length; idx++) { 47 + page = alloc_page(GFP_KERNEL | __GFP_ZERO); 48 + if (!page) 49 + goto not_enough_page; 50 + ctrl->map[idx] = page; 51 + } 52 + return 0; 53 + not_enough_page: 54 + max = idx; 55 + for (idx = 0; idx < max; idx++) 56 + __free_page(ctrl->map[idx]); 57 + 58 + return -ENOMEM; 59 + } 60 + 61 + static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent, 62 + struct swap_cgroup_ctrl **ctrlp) 63 + { 64 + pgoff_t offset = swp_offset(ent); 65 + struct swap_cgroup_ctrl *ctrl; 66 + struct page *mappage; 67 + struct swap_cgroup *sc; 68 + 69 + ctrl = &swap_cgroup_ctrl[swp_type(ent)]; 70 + if (ctrlp) 71 + *ctrlp = ctrl; 72 + 73 + mappage = ctrl->map[offset / SC_PER_PAGE]; 74 + sc = page_address(mappage); 75 + return sc + offset % SC_PER_PAGE; 76 + } 77 + 78 + /** 79 + * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry. 80 + * @ent: swap entry to be cmpxchged 81 + * @old: old id 82 + * @new: new id 83 + * 84 + * Returns old id at success, 0 at failure. 85 + * (There is no mem_cgroup using 0 as its id) 86 + */ 87 + unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, 88 + unsigned short old, unsigned short new) 89 + { 90 + struct swap_cgroup_ctrl *ctrl; 91 + struct swap_cgroup *sc; 92 + unsigned long flags; 93 + unsigned short retval; 94 + 95 + sc = lookup_swap_cgroup(ent, &ctrl); 96 + 97 + spin_lock_irqsave(&ctrl->lock, flags); 98 + retval = sc->id; 99 + if (retval == old) 100 + sc->id = new; 101 + else 102 + retval = 0; 103 + spin_unlock_irqrestore(&ctrl->lock, flags); 104 + return retval; 105 + } 106 + 107 + /** 108 + * swap_cgroup_record - record mem_cgroup for this swp_entry. 109 + * @ent: swap entry to be recorded into 110 + * @id: mem_cgroup to be recorded 111 + * 112 + * Returns old value at success, 0 at failure. 113 + * (Of course, old value can be 0.) 114 + */ 115 + unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id) 116 + { 117 + struct swap_cgroup_ctrl *ctrl; 118 + struct swap_cgroup *sc; 119 + unsigned short old; 120 + unsigned long flags; 121 + 122 + sc = lookup_swap_cgroup(ent, &ctrl); 123 + 124 + spin_lock_irqsave(&ctrl->lock, flags); 125 + old = sc->id; 126 + sc->id = id; 127 + spin_unlock_irqrestore(&ctrl->lock, flags); 128 + 129 + return old; 130 + } 131 + 132 + /** 133 + * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry 134 + * @ent: swap entry to be looked up. 135 + * 136 + * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID) 137 + */ 138 + unsigned short lookup_swap_cgroup_id(swp_entry_t ent) 139 + { 140 + return lookup_swap_cgroup(ent, NULL)->id; 141 + } 142 + 143 + int swap_cgroup_swapon(int type, unsigned long max_pages) 144 + { 145 + void *array; 146 + unsigned long array_size; 147 + unsigned long length; 148 + struct swap_cgroup_ctrl *ctrl; 149 + 150 + if (!do_swap_account) 151 + return 0; 152 + 153 + length = DIV_ROUND_UP(max_pages, SC_PER_PAGE); 154 + array_size = length * sizeof(void *); 155 + 156 + array = vzalloc(array_size); 157 + if (!array) 158 + goto nomem; 159 + 160 + ctrl = &swap_cgroup_ctrl[type]; 161 + mutex_lock(&swap_cgroup_mutex); 162 + ctrl->length = length; 163 + ctrl->map = array; 164 + spin_lock_init(&ctrl->lock); 165 + if (swap_cgroup_prepare(type)) { 166 + /* memory shortage */ 167 + ctrl->map = NULL; 168 + ctrl->length = 0; 169 + mutex_unlock(&swap_cgroup_mutex); 170 + vfree(array); 171 + goto nomem; 172 + } 173 + mutex_unlock(&swap_cgroup_mutex); 174 + 175 + return 0; 176 + nomem: 177 + printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n"); 178 + printk(KERN_INFO 179 + "swap_cgroup can be disabled by swapaccount=0 boot option\n"); 180 + return -ENOMEM; 181 + } 182 + 183 + void swap_cgroup_swapoff(int type) 184 + { 185 + struct page **map; 186 + unsigned long i, length; 187 + struct swap_cgroup_ctrl *ctrl; 188 + 189 + if (!do_swap_account) 190 + return; 191 + 192 + mutex_lock(&swap_cgroup_mutex); 193 + ctrl = &swap_cgroup_ctrl[type]; 194 + map = ctrl->map; 195 + length = ctrl->length; 196 + ctrl->map = NULL; 197 + ctrl->length = 0; 198 + mutex_unlock(&swap_cgroup_mutex); 199 + 200 + if (map) { 201 + for (i = 0; i < length; i++) { 202 + struct page *page = map[i]; 203 + if (page) 204 + __free_page(page); 205 + } 206 + vfree(map); 207 + } 208 + }

-1

mm/swap_state.c

··· 17 17 #include <linux/blkdev.h> 18 18 #include <linux/pagevec.h> 19 19 #include <linux/migrate.h> 20 - #include <linux/page_cgroup.h> 21 20 22 21 #include <asm/pgtable.h> 23 22

+1 -1

mm/swapfile.c

··· 38 38 #include <asm/pgtable.h> 39 39 #include <asm/tlbflush.h> 40 40 #include <linux/swapops.h> 41 - #include <linux/page_cgroup.h> 41 + #include <linux/swap_cgroup.h> 42 42 43 43 static bool swap_count_continued(struct swap_info_struct *, pgoff_t, 44 44 unsigned char);

+1 -2

mm/vmalloc.c

··· 463 463 goto retry; 464 464 } 465 465 if (printk_ratelimit()) 466 - printk(KERN_WARNING 467 - "vmap allocation for size %lu failed: " 466 + pr_warn("vmap allocation for size %lu failed: " 468 467 "use vmalloc=<size> to increase size.\n", size); 469 468 kfree(va); 470 469 return ERR_PTR(-EBUSY);

+9 -9

mm/vmscan.c

··· 260 260 do_div(delta, lru_pages + 1); 261 261 total_scan += delta; 262 262 if (total_scan < 0) { 263 - printk(KERN_ERR 264 - "shrink_slab: %pF negative objects to delete nr=%ld\n", 263 + pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n", 265 264 shrinker->scan_objects, total_scan); 266 265 total_scan = freeable; 267 266 } ··· 874 875 * end of the LRU a second time. 875 876 */ 876 877 mapping = page_mapping(page); 877 - if ((mapping && bdi_write_congested(mapping->backing_dev_info)) || 878 + if (((dirty || writeback) && mapping && 879 + bdi_write_congested(mapping->backing_dev_info)) || 878 880 (writeback && PageReclaim(page))) 879 881 nr_congested++; 880 882 ··· 2249 2249 return true; 2250 2250 2251 2251 /* If compaction would go ahead or the allocation would succeed, stop */ 2252 - switch (compaction_suitable(zone, sc->order)) { 2252 + switch (compaction_suitable(zone, sc->order, 0, 0)) { 2253 2253 case COMPACT_PARTIAL: 2254 2254 case COMPACT_CONTINUE: 2255 2255 return false; ··· 2346 2346 * If compaction is not ready to start and allocation is not likely 2347 2347 * to succeed without it, then keep reclaiming. 2348 2348 */ 2349 - if (compaction_suitable(zone, order) == COMPACT_SKIPPED) 2349 + if (compaction_suitable(zone, order, 0, 0) == COMPACT_SKIPPED) 2350 2350 return false; 2351 2351 2352 2352 return watermark_ok; ··· 2824 2824 balance_gap, classzone_idx, 0)) 2825 2825 return false; 2826 2826 2827 - if (IS_ENABLED(CONFIG_COMPACTION) && order && 2828 - compaction_suitable(zone, order) == COMPACT_SKIPPED) 2827 + if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone, 2828 + order, 0, classzone_idx) == COMPACT_SKIPPED) 2829 2829 return false; 2830 2830 2831 2831 return true; ··· 2952 2952 * from memory. Do not reclaim more than needed for compaction. 2953 2953 */ 2954 2954 if (IS_ENABLED(CONFIG_COMPACTION) && sc->order && 2955 - compaction_suitable(zone, sc->order) != 2956 - COMPACT_SKIPPED) 2955 + compaction_suitable(zone, sc->order, 0, classzone_idx) 2956 + != COMPACT_SKIPPED) 2957 2957 testorder = 0; 2958 2958 2959 2959 /*

+44 -43

net/ipv4/tcp_memcontrol.c

··· 9 9 int tcp_init_cgroup(struct mem_cgroup *memcg, struct cgroup_subsys *ss) 10 10 { 11 11 /* 12 - * The root cgroup does not use res_counters, but rather, 12 + * The root cgroup does not use page_counters, but rather, 13 13 * rely on the data already collected by the network 14 14 * subsystem 15 15 */ 16 - struct res_counter *res_parent = NULL; 17 - struct cg_proto *cg_proto, *parent_cg; 18 16 struct mem_cgroup *parent = parent_mem_cgroup(memcg); 17 + struct page_counter *counter_parent = NULL; 18 + struct cg_proto *cg_proto, *parent_cg; 19 19 20 20 cg_proto = tcp_prot.proto_cgroup(memcg); 21 21 if (!cg_proto) ··· 29 29 30 30 parent_cg = tcp_prot.proto_cgroup(parent); 31 31 if (parent_cg) 32 - res_parent = &parent_cg->memory_allocated; 32 + counter_parent = &parent_cg->memory_allocated; 33 33 34 - res_counter_init(&cg_proto->memory_allocated, res_parent); 34 + page_counter_init(&cg_proto->memory_allocated, counter_parent); 35 35 percpu_counter_init(&cg_proto->sockets_allocated, 0, GFP_KERNEL); 36 36 37 37 return 0; ··· 50 50 } 51 51 EXPORT_SYMBOL(tcp_destroy_cgroup); 52 52 53 - static int tcp_update_limit(struct mem_cgroup *memcg, u64 val) 53 + static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages) 54 54 { 55 55 struct cg_proto *cg_proto; 56 56 int i; ··· 60 60 if (!cg_proto) 61 61 return -EINVAL; 62 62 63 - if (val > RES_COUNTER_MAX) 64 - val = RES_COUNTER_MAX; 65 - 66 - ret = res_counter_set_limit(&cg_proto->memory_allocated, val); 63 + ret = page_counter_limit(&cg_proto->memory_allocated, nr_pages); 67 64 if (ret) 68 65 return ret; 69 66 70 67 for (i = 0; i < 3; i++) 71 - cg_proto->sysctl_mem[i] = min_t(long, val >> PAGE_SHIFT, 68 + cg_proto->sysctl_mem[i] = min_t(long, nr_pages, 72 69 sysctl_tcp_mem[i]); 73 70 74 - if (val == RES_COUNTER_MAX) 71 + if (nr_pages == PAGE_COUNTER_MAX) 75 72 clear_bit(MEMCG_SOCK_ACTIVE, &cg_proto->flags); 76 - else if (val != RES_COUNTER_MAX) { 73 + else { 77 74 /* 78 75 * The active bit needs to be written after the static_key 79 76 * update. This is what guarantees that the socket activation ··· 99 102 return 0; 100 103 } 101 104 105 + enum { 106 + RES_USAGE, 107 + RES_LIMIT, 108 + RES_MAX_USAGE, 109 + RES_FAILCNT, 110 + }; 111 + 112 + static DEFINE_MUTEX(tcp_limit_mutex); 113 + 102 114 static ssize_t tcp_cgroup_write(struct kernfs_open_file *of, 103 115 char *buf, size_t nbytes, loff_t off) 104 116 { 105 117 struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); 106 - unsigned long long val; 118 + unsigned long nr_pages; 107 119 int ret = 0; 108 120 109 121 buf = strstrip(buf); ··· 120 114 switch (of_cft(of)->private) { 121 115 case RES_LIMIT: 122 116 /* see memcontrol.c */ 123 - ret = res_counter_memparse_write_strategy(buf, &val); 117 + ret = page_counter_memparse(buf, &nr_pages); 124 118 if (ret) 125 119 break; 126 - ret = tcp_update_limit(memcg, val); 120 + mutex_lock(&tcp_limit_mutex); 121 + ret = tcp_update_limit(memcg, nr_pages); 122 + mutex_unlock(&tcp_limit_mutex); 127 123 break; 128 124 default: 129 125 ret = -EINVAL; ··· 134 126 return ret ?: nbytes; 135 127 } 136 128 137 - static u64 tcp_read_stat(struct mem_cgroup *memcg, int type, u64 default_val) 138 - { 139 - struct cg_proto *cg_proto; 140 - 141 - cg_proto = tcp_prot.proto_cgroup(memcg); 142 - if (!cg_proto) 143 - return default_val; 144 - 145 - return res_counter_read_u64(&cg_proto->memory_allocated, type); 146 - } 147 - 148 - static u64 tcp_read_usage(struct mem_cgroup *memcg) 149 - { 150 - struct cg_proto *cg_proto; 151 - 152 - cg_proto = tcp_prot.proto_cgroup(memcg); 153 - if (!cg_proto) 154 - return atomic_long_read(&tcp_memory_allocated) << PAGE_SHIFT; 155 - 156 - return res_counter_read_u64(&cg_proto->memory_allocated, RES_USAGE); 157 - } 158 - 159 129 static u64 tcp_cgroup_read(struct cgroup_subsys_state *css, struct cftype *cft) 160 130 { 161 131 struct mem_cgroup *memcg = mem_cgroup_from_css(css); 132 + struct cg_proto *cg_proto = tcp_prot.proto_cgroup(memcg); 162 133 u64 val; 163 134 164 135 switch (cft->private) { 165 136 case RES_LIMIT: 166 - val = tcp_read_stat(memcg, RES_LIMIT, RES_COUNTER_MAX); 137 + if (!cg_proto) 138 + return PAGE_COUNTER_MAX; 139 + val = cg_proto->memory_allocated.limit; 140 + val *= PAGE_SIZE; 167 141 break; 168 142 case RES_USAGE: 169 - val = tcp_read_usage(memcg); 143 + if (!cg_proto) 144 + val = atomic_long_read(&tcp_memory_allocated); 145 + else 146 + val = page_counter_read(&cg_proto->memory_allocated); 147 + val *= PAGE_SIZE; 170 148 break; 171 149 case RES_FAILCNT: 150 + if (!cg_proto) 151 + return 0; 152 + val = cg_proto->memory_allocated.failcnt; 153 + break; 172 154 case RES_MAX_USAGE: 173 - val = tcp_read_stat(memcg, cft->private, 0); 155 + if (!cg_proto) 156 + return 0; 157 + val = cg_proto->memory_allocated.watermark; 158 + val *= PAGE_SIZE; 174 159 break; 175 160 default: 176 161 BUG(); ··· 184 183 185 184 switch (of_cft(of)->private) { 186 185 case RES_MAX_USAGE: 187 - res_counter_reset_max(&cg_proto->memory_allocated); 186 + page_counter_reset_watermark(&cg_proto->memory_allocated); 188 187 break; 189 188 case RES_FAILCNT: 190 - res_counter_reset_failcnt(&cg_proto->memory_allocated); 189 + cg_proto->memory_allocated.failcnt = 0; 191 190 break; 192 191 } 193 192

+192 -71

scripts/checkpatch.pl

··· 7 7 8 8 use strict; 9 9 use POSIX; 10 + use File::Basename; 11 + use Cwd 'abs_path'; 10 12 11 13 my $P = $0; 12 - $P =~ s@(.*)/@@g; 13 - my $D = $1; 14 + my $D = dirname(abs_path($P)); 14 15 15 16 my $V = '0.32'; 16 17 ··· 439 438 440 439 # Load common spelling mistakes and build regular expression list. 441 440 my $misspellings; 442 - my @spelling_list; 443 441 my %spelling_fix; 444 - open(my $spelling, '<', $spelling_file) 445 - or die "$P: Can't open $spelling_file for reading: $!\n"; 446 - while (<$spelling>) { 447 - my $line = $_; 448 442 449 - $line =~ s/\s*\n?$//g; 450 - $line =~ s/^\s*//g; 443 + if (open(my $spelling, '<', $spelling_file)) { 444 + my @spelling_list; 445 + while (<$spelling>) { 446 + my $line = $_; 451 447 452 - next if ($line =~ m/^\s*#/); 453 - next if ($line =~ m/^\s*$/); 448 + $line =~ s/\s*\n?$//g; 449 + $line =~ s/^\s*//g; 454 450 455 - my ($suspect, $fix) = split(/\|\|/, $line); 451 + next if ($line =~ m/^\s*#/); 452 + next if ($line =~ m/^\s*$/); 456 453 457 - push(@spelling_list, $suspect); 458 - $spelling_fix{$suspect} = $fix; 454 + my ($suspect, $fix) = split(/\|\|/, $line); 455 + 456 + push(@spelling_list, $suspect); 457 + $spelling_fix{$suspect} = $fix; 458 + } 459 + close($spelling); 460 + $misspellings = join("|", @spelling_list); 461 + } else { 462 + warn "No typos will be found - file '$spelling_file': $!\n"; 459 463 } 460 - close($spelling); 461 - $misspellings = join("|", @spelling_list); 462 464 463 465 sub build_types { 464 466 my $mods = "(?x: \n" . join("|\n ", @modifierList) . "\n)"; ··· 946 942 sub get_quoted_string { 947 943 my ($line, $rawline) = @_; 948 944 949 - return "" if ($line !~ m/(\"[X]+\")/g); 945 + return "" if ($line !~ m/(\"[X\t]+\")/g); 950 946 return substr($rawline, $-[0], $+[0] - $-[0]); 951 947 } 952 948 ··· 1847 1843 my $non_utf8_charset = 0; 1848 1844 1849 1845 my $last_blank_line = 0; 1846 + my $last_coalesced_string_linenr = -1; 1850 1847 1851 1848 our @report = (); 1852 1849 our $cnt_lines = 0; ··· 2083 2078 $in_commit_log = 0; 2084 2079 } 2085 2080 2081 + # Check if MAINTAINERS is being updated. If so, there's probably no need to 2082 + # emit the "does MAINTAINERS need updating?" message on file add/move/delete 2083 + if ($line =~ /^\s*MAINTAINERS\s*\|/) { 2084 + $reported_maintainer_file = 1; 2085 + } 2086 + 2086 2087 # Check signature styles 2087 2088 if (!$in_header_lines && 2088 2089 $line =~ /^(\s*)([a-z0-9_-]+by:|$signature_tags)(\s*)(.*)/i) { ··· 2257 2246 } 2258 2247 2259 2248 # Check for various typo / spelling mistakes 2260 - if ($in_commit_log || $line =~ /^\+/) { 2249 + if (defined($misspellings) && ($in_commit_log || $line =~ /^\+/)) { 2261 2250 while ($rawline =~ /(?:^|[^a-z@])($misspellings)(?:$|[^a-z@])/gi) { 2262 2251 my $typo = $1; 2263 2252 my $typo_fix = $spelling_fix{lc($typo)}; ··· 2414 2403 "line over $max_line_length characters\n" . $herecurr); 2415 2404 } 2416 2405 2417 - # Check for user-visible strings broken across lines, which breaks the ability 2418 - # to grep for the string. Make exceptions when the previous string ends in a 2419 - # newline (multiple lines in one string constant) or '\t', '\r', ';', or '{' 2420 - # (common in inline assembly) or is a octal \123 or hexadecimal \xaf value 2421 - if ($line =~ /^\+\s*"/ && 2422 - $prevline =~ /"\s*$/ && 2423 - $prevrawline !~ /(?:\$?:[ntr]|[0-7]{1,3}|x[0-9a-fA-F]{1,2})|;\s*|\{\s*)"\s*$/) { 2424 - WARN("SPLIT_STRING", 2425 - "quoted string split across lines\n" . $hereprev); 2426 - } 2427 - 2428 - # check for missing a space in a string concatination 2429 - if ($prevrawline =~ /[^\\]\w"$/ && $rawline =~ /^\+[\t ]+"\w/) { 2430 - WARN('MISSING_SPACE', 2431 - "break quoted strings at a space character\n" . $hereprev); 2432 - } 2433 - 2434 - # check for spaces before a quoted newline 2435 - if ($rawline =~ /^.*\".*\s\\n/) { 2436 - if (WARN("QUOTED_WHITESPACE_BEFORE_NEWLINE", 2437 - "unnecessary whitespace before a quoted newline\n" . $herecurr) && 2438 - $fix) { 2439 - $fixed[$fixlinenr] =~ s/^(\+.*\".*)\s+\\n/$1\\n/; 2440 - } 2441 - 2442 - } 2443 - 2444 2406 # check for adding lines without a newline. 2445 2407 if ($line =~ /^\+/ && defined $lines[$linenr] && $lines[$linenr] =~ /^\\ No newline at end of file/) { 2446 2408 WARN("MISSING_EOF_NEWLINE", ··· 2499 2515 } 2500 2516 } 2501 2517 2502 - if ($line =~ /^\+.*\(\s*$Type\s*$[ \t]+(?!$Assignment|$Arithmetic|{)/) { 2518 + if ($line =~ /^\+.*(\w+\s*)?$\s*$Type\s*$[ \t]+(?!$Assignment|$Arithmetic|[,;\({\[\<\>])/ && 2519 + (!defined($1) || $1 !~ /sizeof\s*/)) { 2503 2520 if (CHK("SPACING", 2504 2521 "No space is necessary after a cast\n" . $herecurr) && 2505 2522 $fix) { ··· 3548 3563 } 3549 3564 } 3550 3565 3551 - # , must have a space on the right. 3566 + # , must not have a space before and must have a space on the right. 3552 3567 } elsif ($op eq ',') { 3568 + my $rtrim_before = 0; 3569 + my $space_after = 0; 3570 + if ($ctx =~ /Wx./) { 3571 + if (ERROR("SPACING", 3572 + "space prohibited before that '$op' $at\n" . $hereptr)) { 3573 + $line_fixed = 1; 3574 + $rtrim_before = 1; 3575 + } 3576 + } 3553 3577 if ($ctx !~ /.x[WEC]/ && $cc !~ /^}/) { 3554 3578 if (ERROR("SPACING", 3555 3579 "space required after that '$op' $at\n" . $hereptr)) { 3556 - $good = $fix_elements[$n] . trim($fix_elements[$n + 1]) . " "; 3557 3580 $line_fixed = 1; 3558 3581 $last_after = $n; 3582 + $space_after = 1; 3583 + } 3584 + } 3585 + if ($rtrim_before || $space_after) { 3586 + if ($rtrim_before) { 3587 + $good = rtrim($fix_elements[$n]) . trim($fix_elements[$n + 1]); 3588 + } else { 3589 + $good = $fix_elements[$n] . trim($fix_elements[$n + 1]); 3590 + } 3591 + if ($space_after) { 3592 + $good .= " "; 3559 3593 } 3560 3594 } 3561 3595 ··· 3818 3814 # ie: &(foo->bar) should be &foo->bar and *(foo->bar) should be *foo->bar 3819 3815 3820 3816 while ($line =~ /(?:[^&]&\s*|\*)$\s*($Ident\s*(?:$Member\s*)+)\s*$/g) { 3821 - CHK("UNNECESSARY_PARENTHESES", 3822 - "Unnecessary parentheses around $1\n" . $herecurr); 3823 - } 3817 + my $var = $1; 3818 + if (CHK("UNNECESSARY_PARENTHESES", 3819 + "Unnecessary parentheses around $var\n" . $herecurr) && 3820 + $fix) { 3821 + $fixed[$fixlinenr] =~ s/$\s*\Q$var\E\s*$/$var/; 3822 + } 3823 + } 3824 + 3825 + # check for unnecessary parentheses around function pointer uses 3826 + # ie: (foo->bar)(); should be foo->bar(); 3827 + # but not "if (foo->bar) (" to avoid some false positives 3828 + if ($line =~ /(\bif\s*|)($\s*$Ident\s*(?:$Member\s*)+$)[ \t]*\(/ && $1 !~ /^if/) { 3829 + my $var = $2; 3830 + if (CHK("UNNECESSARY_PARENTHESES", 3831 + "Unnecessary parentheses around function pointer $var\n" . $herecurr) && 3832 + $fix) { 3833 + my $var2 = deparenthesize($var); 3834 + $var2 =~ s/\s//g; 3835 + $fixed[$fixlinenr] =~ s/\Q$var\E/$var2/; 3836 + } 3837 + } 3824 3838 3825 3839 #goto labels aren't indented, allow a single space however 3826 3840 if ($line=~/^.\s+[A-Za-z\d_]+:(?![0-9]+)/ and ··· 4078 4056 #Ignore Page<foo> variants 4079 4057 $var !~ /^(?:Clear|Set|TestClear|TestSet|)Page[A-Z]/ && 4080 4058 #Ignore SI style variants like nS, mV and dB (ie: max_uV, regulator_min_uA_show) 4081 - $var !~ /^(?:[a-z_]*?)_?[a-z][A-Z](?:_[a-z_]+)?$/) { 4059 + $var !~ /^(?:[a-z_]*?)_?[a-z][A-Z](?:_[a-z_]+)?$/ && 4060 + #Ignore some three character SI units explicitly, like MiB and KHz 4061 + $var !~ /^(?:[a-z_]*?)_?(?:[KMGT]iB|[KMGT]?Hz)(?:_[a-z_]+)?$/) { 4082 4062 while ($var =~ m{($Ident)}g) { 4083 4063 my $word = $1; 4084 4064 next if ($word !~ /[A-Z][a-z]|[a-z][A-Z]/); ··· 4432 4408 "Use of volatile is usually wrong: see Documentation/volatile-considered-harmful.txt\n" . $herecurr); 4433 4409 } 4434 4410 4411 + # Check for user-visible strings broken across lines, which breaks the ability 4412 + # to grep for the string. Make exceptions when the previous string ends in a 4413 + # newline (multiple lines in one string constant) or '\t', '\r', ';', or '{' 4414 + # (common in inline assembly) or is a octal \123 or hexadecimal \xaf value 4415 + if ($line =~ /^\+\s*"[X\t]*"/ && 4416 + $prevline =~ /"\s*$/ && 4417 + $prevrawline !~ /(?:\$?:[ntr]|[0-7]{1,3}|x[0-9a-fA-F]{1,2})|;\s*|\{\s*)"\s*$/) { 4418 + if (WARN("SPLIT_STRING", 4419 + "quoted string split across lines\n" . $hereprev) && 4420 + $fix && 4421 + $prevrawline =~ /^\+.*"\s*$/ && 4422 + $last_coalesced_string_linenr != $linenr - 1) { 4423 + my $extracted_string = get_quoted_string($line, $rawline); 4424 + my $comma_close = ""; 4425 + if ($rawline =~ /\Q$extracted_string\E(\s*$\s*;\s*$|\s*,\s*)/) { 4426 + $comma_close = $1; 4427 + } 4428 + 4429 + fix_delete_line($fixlinenr - 1, $prevrawline); 4430 + fix_delete_line($fixlinenr, $rawline); 4431 + my $fixedline = $prevrawline; 4432 + $fixedline =~ s/"\s*$//; 4433 + $fixedline .= substr($extracted_string, 1) . trim($comma_close); 4434 + fix_insert_line($fixlinenr - 1, $fixedline); 4435 + $fixedline = $rawline; 4436 + $fixedline =~ s/\Q$extracted_string\E\Q$comma_close\E//; 4437 + if ($fixedline !~ /\+\s*$/) { 4438 + fix_insert_line($fixlinenr, $fixedline); 4439 + } 4440 + $last_coalesced_string_linenr = $linenr; 4441 + } 4442 + } 4443 + 4444 + # check for missing a space in a string concatenation 4445 + if ($prevrawline =~ /[^\\]\w"$/ && $rawline =~ /^\+[\t ]+"\w/) { 4446 + WARN('MISSING_SPACE', 4447 + "break quoted strings at a space character\n" . $hereprev); 4448 + } 4449 + 4450 + # check for spaces before a quoted newline 4451 + if ($rawline =~ /^.*\".*\s\\n/) { 4452 + if (WARN("QUOTED_WHITESPACE_BEFORE_NEWLINE", 4453 + "unnecessary whitespace before a quoted newline\n" . $herecurr) && 4454 + $fix) { 4455 + $fixed[$fixlinenr] =~ s/^(\+.*\".*)\s+\\n/$1\\n/; 4456 + } 4457 + 4458 + } 4459 + 4435 4460 # concatenated string without spaces between elements 4436 4461 if ($line =~ /"X+"[A-Z_]+/ || $line =~ /[A-Z_]+"X+"/) { 4437 4462 CHK("CONCATENATED_STRING", 4438 4463 "Concatenated strings should use spaces between elements\n" . $herecurr); 4464 + } 4465 + 4466 + # uncoalesced string fragments 4467 + if ($line =~ /"X*"\s*"/) { 4468 + WARN("STRING_FRAGMENTS", 4469 + "Consecutive strings are generally better as a single string\n" . $herecurr); 4470 + } 4471 + 4472 + # check for %L{u,d,i} in strings 4473 + my $string; 4474 + while ($line =~ /(?:^|")([X\t]*)(?:"|$)/g) { 4475 + $string = substr($rawline, $-[1], $+[1] - $-[1]); 4476 + $string =~ s/%%/__/g; 4477 + if ($string =~ /(?<!%)%L[udi]/) { 4478 + WARN("PRINTF_L", 4479 + "\%Ld/%Lu are not-standard C, use %lld/%llu\n" . $herecurr); 4480 + last; 4481 + } 4482 + } 4483 + 4484 + # check for line continuations in quoted strings with odd counts of " 4485 + if ($rawline =~ /\\$/ && $rawline =~ tr/"/"/ % 2) { 4486 + WARN("LINE_CONTINUATIONS", 4487 + "Avoid line continuations in quoted strings\n" . $herecurr); 4439 4488 } 4440 4489 4441 4490 # warn about #if 0 ··· 4523 4426 my $expr = '\s*$\s*' . quotemeta($1) . '\s*$\s*;'; 4524 4427 if ($line =~ /\b(kfree|usb_free_urb|debugfs_remove(?:_recursive)?)$expr/) { 4525 4428 WARN('NEEDLESS_IF', 4526 - "$1(NULL) is safe this check is probably not required\n" . $hereprev); 4429 + "$1(NULL) is safe and this check is probably not required\n" . $hereprev); 4527 4430 } 4528 4431 } 4529 4432 ··· 4552 4455 "Possible unnecessary $level\n" . $herecurr) && 4553 4456 $fix) { 4554 4457 $fixed[$fixlinenr] =~ s/\s*$level\s*//; 4458 + } 4459 + } 4460 + 4461 + # check for mask then right shift without a parentheses 4462 + if ($^V && $^V ge 5.10.0 && 4463 + $line =~ /$LvalOrFunc\s*\&\s*($LvalOrFunc)\s*>>/ && 4464 + $4 !~ /^\&/) { # $LvalOrFunc may be &foo, ignore if so 4465 + WARN("MASK_THEN_SHIFT", 4466 + "Possible precedence defect with mask then right shift - may need parentheses\n" . $herecurr); 4467 + } 4468 + 4469 + # check for pointer comparisons to NULL 4470 + if ($^V && $^V ge 5.10.0) { 4471 + while ($line =~ /\b$LvalOrFunc\s*(==|\!=)\s*NULL\b/g) { 4472 + my $val = $1; 4473 + my $equal = "!"; 4474 + $equal = "" if ($4 eq "!="); 4475 + if (CHK("COMPARISON_TO_NULL", 4476 + "Comparison to NULL could be written \"${equal}${val}\"\n" . $herecurr) && 4477 + $fix) { 4478 + $fixed[$fixlinenr] =~ s/\b\Q$val\E\s*(?:==|\!=)\s*NULL\b/$equal$val/; 4479 + } 4555 4480 } 4556 4481 } 4557 4482 ··· 4771 4652 } 4772 4653 } 4773 4654 4655 + # Check for __attribute__ weak, or __weak declarations (may have link issues) 4656 + if ($^V && $^V ge 5.10.0 && 4657 + $line =~ /(?:$Declare|$DeclareMisordered)\s*$Ident\s*$balanced_parens\s*(?:$Attribute)?\s*;/ && 4658 + ($line =~ /\b__attribute__\s*$\s*\(.*\bweak\b/ || 4659 + $line =~ /\b__weak\b/)) { 4660 + ERROR("WEAK_DECLARATION", 4661 + "Using weak declarations can have unintended link defects\n" . $herecurr); 4662 + } 4663 + 4774 4664 # check for sizeof(&) 4775 4665 if ($line =~ /\bsizeof\s*\(\s*\&/) { 4776 4666 WARN("SIZEOF_ADDRESS", ··· 4793 4665 $fix) { 4794 4666 $fixed[$fixlinenr] =~ s/\bsizeof\s+((?:\*\s*|)$Lval|$Type(?:\s+$Lval|))/"sizeof(" . trim($1) . ")"/ex; 4795 4667 } 4796 - } 4797 - 4798 - # check for line continuations in quoted strings with odd counts of " 4799 - if ($rawline =~ /\\$/ && $rawline =~ tr/"/"/ % 2) { 4800 - WARN("LINE_CONTINUATIONS", 4801 - "Avoid line continuations in quoted strings\n" . $herecurr); 4802 4668 } 4803 4669 4804 4670 # check for struct spinlock declarations ··· 5030 4908 } 5031 4909 } 5032 4910 4911 + # check for #defines like: 1 << <digit> that could be BIT(digit) 4912 + if ($line =~ /#\s*define\s+\w+\s+\(?\s*1\s*([ulUL]*)\s*\<\<\s*(?:\d+|$Ident)\s*$?/) { 4913 + my $ull = ""; 4914 + $ull = "_ULL" if (defined($1) && $1 =~ /ll/i); 4915 + if (CHK("BIT_MACRO", 4916 + "Prefer using the BIT$ull macro\n" . $herecurr) && 4917 + $fix) { 4918 + $fixed[$fixlinenr] =~ s/$?\s*1\s*[ulUL]*\s*<<\s*(\d+|$Ident)\s*$?/BIT${ull}($1)/; 4919 + } 4920 + } 4921 + 5033 4922 # check for case / default statements not preceded by break/fallthrough/switch 5034 4923 if ($line =~ /^.\s*(?:case\s+(?:$Ident|$Constant)\s*|default):/) { 5035 4924 my $has_break = 0; ··· 5202 5069 if ($line =~ /\+\s*#\s*define\s+((?:__)?ARCH_(?:HAS|HAVE)\w*)\b/) { 5203 5070 ERROR("DEFINE_ARCH_HAS", 5204 5071 "#define of '$1' is wrong - use Kconfig variables or standard guards instead\n" . $herecurr); 5205 - } 5206 - 5207 - # check for %L{u,d,i} in strings 5208 - my $string; 5209 - while ($line =~ /(?:^|")([X\t]*)(?:"|$)/g) { 5210 - $string = substr($rawline, $-[1], $+[1] - $-[1]); 5211 - $string =~ s/%%/__/g; 5212 - if ($string =~ /(?<!%)%L[udi]/) { 5213 - WARN("PRINTF_L", 5214 - "\%Ld/%Lu are not-standard C, use %lld/%llu\n" . $herecurr); 5215 - last; 5216 - } 5217 5072 } 5218 5073 5219 5074 # whine mightly about in_atomic

+1 -1

scripts/kernel-doc

··· 1753 1753 # strip kmemcheck_bitfield_{begin,end}.*; 1754 1754 $members =~ s/kmemcheck_bitfield_.*?;//gos; 1755 1755 # strip attributes 1756 - $members =~ s/__aligned\s*$.+$//gos; 1756 + $members =~ s/__aligned\s*$[^;]*$//gos; 1757 1757 1758 1758 create_parameterlist($members, ';', $file); 1759 1759 check_sections($file, $declaration_name, "struct", $sectcheck, $struct_actual, $nested);