Merge branch 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

+6

Documentation/admin-guide/cgroup-v2.rst

··· 705 705 informational files on the root cgroup which end up showing global 706 706 information available elsewhere shouldn't exist. 707 707 708 + - The default time unit is microseconds. If a different unit is ever 709 + used, an explicit unit suffix must be present. 710 + 711 + - A parts-per quantity should use a percentage decimal with at least 712 + two digit fractional part - e.g. 13.40. 713 + 708 714 - If a controller implements weight based resource distribution, its 709 715 interface file should be named "weight" and have the range [1, 710 716 10000] with 100 as the default. The values are chosen to allow

+1 -1

Documentation/admin-guide/hw-vuln/l1tf.rst

··· 241 241 For further information about confining guests to a single or to a group 242 242 of cores consult the cpusets documentation: 243 243 244 - https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt 244 + https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.rst 245 245 246 246 .. _interrupt_isolation: 247 247

+2 -2

Documentation/admin-guide/kernel-parameters.txt

··· 4084 4084 4085 4085 relax_domain_level= 4086 4086 [KNL, SMP] Set scheduler's default relax_domain_level. 4087 - See Documentation/cgroup-v1/cpusets.txt. 4087 + See Documentation/cgroup-v1/cpusets.rst. 4088 4088 4089 4089 reserve= [KNL,BUGS] Force kernel to ignore I/O ports or memory 4090 4090 Format: <base1>,<size1>[,<base2>,<size2>,...] ··· 4594 4594 swapaccount=[0|1] 4595 4595 [KNL] Enable accounting of swap in memory resource 4596 4596 controller if no parameter or 1 is given or disable 4597 - it if 0 is given (See Documentation/cgroup-v1/memory.txt) 4597 + it if 0 is given (See Documentation/cgroup-v1/memory.rst) 4598 4598 4599 4599 swiotlb= [ARM,IA-64,PPC,MIPS,X86] 4600 4600 Format: { <int> | force | noforce }

+1 -1

Documentation/admin-guide/mm/numa_memory_policy.rst

··· 15 15 support. 16 16 17 17 Memory policies should not be confused with cpusets 18 - (``Documentation/cgroup-v1/cpusets.txt``) 18 + (``Documentation/cgroup-v1/cpusets.rst``) 19 19 which is an administrative mechanism for restricting the nodes from which 20 20 memory may be allocated by a set of processes. Memory policies are a 21 21 programming interface that a NUMA-aware application can take advantage of. When

+1 -1

Documentation/block/bfq-iosched.txt

··· 539 539 created, and kept up-to-date by bfq, depends on whether 540 540 CONFIG_DEBUG_BLK_CGROUP is set. If it is set, then bfq creates all 541 541 the stat files documented in 542 - Documentation/cgroup-v1/blkio-controller.txt. If, instead, 542 + Documentation/cgroup-v1/blkio-controller.rst. If, instead, 543 543 CONFIG_DEBUG_BLK_CGROUP is not set, then bfq creates only the files 544 544 blkio.bfq.io_service_bytes 545 545 blkio.bfq.io_service_bytes_recursive

+44 -35

Documentation/cgroup-v1/blkio-controller.txt Documentation/cgroup-v1/blkio-controller.rst

··· 1 - Block IO Controller 2 - =================== 1 + =================== 2 + Block IO Controller 3 + =================== 4 + 3 5 Overview 4 6 ======== 5 7 cgroup subsys "blkio" implements the block io controller. There seems to be ··· 19 17 ===== 20 18 Throttling/Upper Limit policy 21 19 ----------------------------- 22 - - Enable Block IO controller 20 + - Enable Block IO controller:: 21 + 23 22 CONFIG_BLK_CGROUP=y 24 23 25 - - Enable throttling in block layer 24 + - Enable throttling in block layer:: 25 + 26 26 CONFIG_BLK_DEV_THROTTLING=y 27 27 28 - - Mount blkio controller (see cgroups.txt, Why are cgroups needed?) 28 + - Mount blkio controller (see cgroups.txt, Why are cgroups needed?):: 29 + 29 30 mount -t cgroup -o blkio none /sys/fs/cgroup/blkio 30 31 31 32 - Specify a bandwidth rate on particular device for root group. The format 32 - for policy is "<major>:<minor> <bytes_per_second>". 33 + for policy is "<major>:<minor> <bytes_per_second>":: 33 34 34 35 echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device 35 36 36 37 Above will put a limit of 1MB/second on reads happening for root group 37 38 on device having major/minor number 8:16. 38 39 39 - - Run dd to read a file and see if rate is throttled to 1MB/s or not. 40 + - Run dd to read a file and see if rate is throttled to 1MB/s or not:: 40 41 41 42 # dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024 42 43 1024+0 records in ··· 56 51 enabled from cgroup side, which currently is a development option and 57 52 not publicly available. 58 53 59 - If somebody created a hierarchy like as follows. 54 + If somebody created a hierarchy like as follows:: 60 55 61 56 root 62 57 / \ ··· 71 66 72 67 Throttling without "sane_behavior" enabled from cgroup side will 73 68 practically treat all groups at same level as if it looks like the 74 - following. 69 + following:: 75 70 76 71 pivot 77 72 / / \ \ ··· 104 99 These rules override the default value of group weight as specified 105 100 by blkio.weight. 106 101 107 - Following is the format. 102 + Following is the format:: 108 103 109 - # echo dev_maj:dev_minor weight > blkio.weight_device 110 - Configure weight=300 on /dev/sdb (8:16) in this cgroup 111 - # echo 8:16 300 > blkio.weight_device 112 - # cat blkio.weight_device 113 - dev weight 114 - 8:16 300 104 + # echo dev_maj:dev_minor weight > blkio.weight_device 115 105 116 - Configure weight=500 on /dev/sda (8:0) in this cgroup 117 - # echo 8:0 500 > blkio.weight_device 118 - # cat blkio.weight_device 119 - dev weight 120 - 8:0 500 121 - 8:16 300 106 + Configure weight=300 on /dev/sdb (8:16) in this cgroup:: 122 107 123 - Remove specific weight for /dev/sda in this cgroup 124 - # echo 8:0 0 > blkio.weight_device 125 - # cat blkio.weight_device 126 - dev weight 127 - 8:16 300 108 + # echo 8:16 300 > blkio.weight_device 109 + # cat blkio.weight_device 110 + dev weight 111 + 8:16 300 112 + 113 + Configure weight=500 on /dev/sda (8:0) in this cgroup:: 114 + 115 + # echo 8:0 500 > blkio.weight_device 116 + # cat blkio.weight_device 117 + dev weight 118 + 8:0 500 119 + 8:16 300 120 + 121 + Remove specific weight for /dev/sda in this cgroup:: 122 + 123 + # echo 8:0 0 > blkio.weight_device 124 + # cat blkio.weight_device 125 + dev weight 126 + 8:16 300 128 127 129 128 - blkio.leaf_weight[_device] 130 129 - Equivalents of blkio.weight[_device] for the purpose of ··· 253 244 - blkio.throttle.read_bps_device 254 245 - Specifies upper limit on READ rate from the device. IO rate is 255 246 specified in bytes per second. Rules are per device. Following is 256 - the format. 247 + the format:: 257 248 258 - echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device 249 + echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device 259 250 260 251 - blkio.throttle.write_bps_device 261 252 - Specifies upper limit on WRITE rate to the device. IO rate is 262 253 specified in bytes per second. Rules are per device. Following is 263 - the format. 254 + the format:: 264 255 265 - echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device 256 + echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device 266 257 267 258 - blkio.throttle.read_iops_device 268 259 - Specifies upper limit on READ rate from the device. IO rate is 269 260 specified in IO per second. Rules are per device. Following is 270 - the format. 261 + the format:: 271 262 272 - echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device 263 + echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device 273 264 274 265 - blkio.throttle.write_iops_device 275 266 - Specifies upper limit on WRITE rate to the device. IO rate is 276 267 specified in io per second. Rules are per device. Following is 277 - the format. 268 + the format:: 278 269 279 - echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device 270 + echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device 280 271 281 272 Note: If both BW and IOPS rules are specified for a device, then IO is 282 273 subjected to both the constraints.

+101 -83

Documentation/cgroup-v1/cgroups.txt Documentation/cgroup-v1/cgroups.rst

··· 1 - CGROUPS 2 - ------- 1 + ============== 2 + Control Groups 3 + ============== 3 4 4 5 Written by Paul Menage <menage@google.com> based on 5 - Documentation/cgroup-v1/cpusets.txt 6 + Documentation/cgroup-v1/cpusets.rst 6 7 7 8 Original copyright statements from cpusets.txt: 9 + 8 10 Portions Copyright (C) 2004 BULL SA. 11 + 9 12 Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. 13 + 10 14 Modified by Paul Jackson <pj@sgi.com> 15 + 11 16 Modified by Christoph Lameter <cl@linux.com> 12 17 13 - CONTENTS: 14 - ========= 18 + .. CONTENTS: 15 19 16 - 1. Control Groups 17 - 1.1 What are cgroups ? 18 - 1.2 Why are cgroups needed ? 19 - 1.3 How are cgroups implemented ? 20 - 1.4 What does notify_on_release do ? 21 - 1.5 What does clone_children do ? 22 - 1.6 How do I use cgroups ? 23 - 2. Usage Examples and Syntax 24 - 2.1 Basic Usage 25 - 2.2 Attaching processes 26 - 2.3 Mounting hierarchies by name 27 - 3. Kernel API 28 - 3.1 Overview 29 - 3.2 Synchronization 30 - 3.3 Subsystem API 31 - 4. Extended attributes usage 32 - 5. Questions 20 + 1. Control Groups 21 + 1.1 What are cgroups ? 22 + 1.2 Why are cgroups needed ? 23 + 1.3 How are cgroups implemented ? 24 + 1.4 What does notify_on_release do ? 25 + 1.5 What does clone_children do ? 26 + 1.6 How do I use cgroups ? 27 + 2. Usage Examples and Syntax 28 + 2.1 Basic Usage 29 + 2.2 Attaching processes 30 + 2.3 Mounting hierarchies by name 31 + 3. Kernel API 32 + 3.1 Overview 33 + 3.2 Synchronization 34 + 3.3 Subsystem API 35 + 4. Extended attributes usage 36 + 5. Questions 33 37 34 38 1. Control Groups 35 39 ================= ··· 76 72 tracking. The intention is that other subsystems hook into the generic 77 73 cgroup support to provide new attributes for cgroups, such as 78 74 accounting/limiting the resources which processes in a cgroup can 79 - access. For example, cpusets (see Documentation/cgroup-v1/cpusets.txt) allow 75 + access. For example, cpusets (see Documentation/cgroup-v1/cpusets.rst) allow 80 76 you to associate a set of CPUs and a set of memory nodes with the 81 77 tasks in each cgroup. 82 78 ··· 112 108 that can benefit from multiple hierarchies, consider a large 113 109 university server with various users - students, professors, system 114 110 tasks etc. The resource planning for this server could be along the 115 - following lines: 111 + following lines:: 116 112 117 113 CPU : "Top cpuset" 118 114 / \ ··· 140 136 With the ability to classify tasks differently for different resources 141 137 (by putting those resource subsystems in different hierarchies), 142 138 the admin can easily set up a script which receives exec notifications 143 - and depending on who is launching the browser he can 139 + and depending on who is launching the browser he can:: 144 140 145 141 # echo browser_pid > /sys/fs/cgroup/<restype>/<userclass>/tasks 146 142 ··· 155 151 apps enhanced CPU power. 156 152 157 153 With ability to write PIDs directly to resource classes, it's just a 158 - matter of: 154 + matter of:: 159 155 160 156 # echo pid > /sys/fs/cgroup/network/<new_class>/tasks 161 157 (after some time) ··· 310 306 -------------------------- 311 307 312 308 To start a new job that is to be contained within a cgroup, using 313 - the "cpuset" cgroup subsystem, the steps are something like: 309 + the "cpuset" cgroup subsystem, the steps are something like:: 314 310 315 311 1) mount -t tmpfs cgroup_root /sys/fs/cgroup 316 312 2) mkdir /sys/fs/cgroup/cpuset ··· 324 320 325 321 For example, the following sequence of commands will setup a cgroup 326 322 named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, 327 - and then start a subshell 'sh' in that cgroup: 323 + and then start a subshell 'sh' in that cgroup:: 328 324 329 325 mount -t tmpfs cgroup_root /sys/fs/cgroup 330 326 mkdir /sys/fs/cgroup/cpuset ··· 349 345 Creating, modifying, using cgroups can be done through the cgroup 350 346 virtual filesystem. 351 347 352 - To mount a cgroup hierarchy with all available subsystems, type: 353 - # mount -t cgroup xxx /sys/fs/cgroup 348 + To mount a cgroup hierarchy with all available subsystems, type:: 349 + 350 + # mount -t cgroup xxx /sys/fs/cgroup 354 351 355 352 The "xxx" is not interpreted by the cgroup code, but will appear in 356 353 /proc/mounts so may be any useful identifying string that you like. ··· 360 355 if cpusets are enabled the user will have to populate the cpus and mems files 361 356 for each new cgroup created before that group can be used. 362 357 363 - As explained in section `1.2 Why are cgroups needed?' you should create 358 + As explained in section `1.2 Why are cgroups needed?` you should create 364 359 different hierarchies of cgroups for each single resource or group of 365 360 resources you want to control. Therefore, you should mount a tmpfs on 366 361 /sys/fs/cgroup and create directories for each cgroup resource or resource 367 - group. 362 + group:: 368 363 369 - # mount -t tmpfs cgroup_root /sys/fs/cgroup 370 - # mkdir /sys/fs/cgroup/rg1 364 + # mount -t tmpfs cgroup_root /sys/fs/cgroup 365 + # mkdir /sys/fs/cgroup/rg1 371 366 372 367 To mount a cgroup hierarchy with just the cpuset and memory 373 - subsystems, type: 374 - # mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1 368 + subsystems, type:: 369 + 370 + # mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1 375 371 376 372 While remounting cgroups is currently supported, it is not recommend 377 373 to use it. Remounting allows changing bound subsystems and ··· 381 375 conventional fsnotify. The support for remounting will be removed in 382 376 the future. 383 377 384 - To Specify a hierarchy's release_agent: 385 - # mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \ 386 - xxx /sys/fs/cgroup/rg1 378 + To Specify a hierarchy's release_agent:: 379 + 380 + # mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \ 381 + xxx /sys/fs/cgroup/rg1 387 382 388 383 Note that specifying 'release_agent' more than once will return failure. 389 384 ··· 397 390 tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1 398 391 is the cgroup that holds the whole system. 399 392 400 - If you want to change the value of release_agent: 401 - # echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent 393 + If you want to change the value of release_agent:: 394 + 395 + # echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent 402 396 403 397 It can also be changed via remount. 404 398 405 - If you want to create a new cgroup under /sys/fs/cgroup/rg1: 406 - # cd /sys/fs/cgroup/rg1 407 - # mkdir my_cgroup 399 + If you want to create a new cgroup under /sys/fs/cgroup/rg1:: 408 400 409 - Now you want to do something with this cgroup. 410 - # cd my_cgroup 401 + # cd /sys/fs/cgroup/rg1 402 + # mkdir my_cgroup 411 403 412 - In this directory you can find several files: 413 - # ls 414 - cgroup.procs notify_on_release tasks 415 - (plus whatever files added by the attached subsystems) 404 + Now you want to do something with this cgroup: 416 405 417 - Now attach your shell to this cgroup: 418 - # /bin/echo $$ > tasks 406 + # cd my_cgroup 407 + 408 + In this directory you can find several files:: 409 + 410 + # ls 411 + cgroup.procs notify_on_release tasks 412 + (plus whatever files added by the attached subsystems) 413 + 414 + Now attach your shell to this cgroup:: 415 + 416 + # /bin/echo $$ > tasks 419 417 420 418 You can also create cgroups inside your cgroup by using mkdir in this 421 - directory. 422 - # mkdir my_sub_cs 419 + directory:: 423 420 424 - To remove a cgroup, just use rmdir: 425 - # rmdir my_sub_cs 421 + # mkdir my_sub_cs 422 + 423 + To remove a cgroup, just use rmdir:: 424 + 425 + # rmdir my_sub_cs 426 426 427 427 This will fail if the cgroup is in use (has cgroups inside, or 428 428 has processes attached, or is held alive by other subsystem-specific ··· 438 424 2.2 Attaching processes 439 425 ----------------------- 440 426 441 - # /bin/echo PID > tasks 427 + :: 428 + 429 + # /bin/echo PID > tasks 442 430 443 431 Note that it is PID, not PIDs. You can only attach ONE task at a time. 444 - If you have several tasks to attach, you have to do it one after another: 432 + If you have several tasks to attach, you have to do it one after another:: 445 433 446 - # /bin/echo PID1 > tasks 447 - # /bin/echo PID2 > tasks 448 - ... 449 - # /bin/echo PIDn > tasks 434 + # /bin/echo PID1 > tasks 435 + # /bin/echo PID2 > tasks 436 + ... 437 + # /bin/echo PIDn > tasks 450 438 451 - You can attach the current shell task by echoing 0: 439 + You can attach the current shell task by echoing 0:: 452 440 453 - # echo 0 > tasks 441 + # echo 0 > tasks 454 442 455 443 You can use the cgroup.procs file instead of the tasks file to move all 456 444 threads in a threadgroup at once. Echoing the PID of any task in a ··· 545 529 methods are css_alloc/free. Any others that are null are presumed to 546 530 be successful no-ops. 547 531 548 - struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp) 532 + ``struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp)`` 549 533 (cgroup_mutex held by caller) 550 534 551 535 Called to allocate a subsystem state object for a cgroup. The ··· 560 544 it's the root of the hierarchy) and may be an appropriate place for 561 545 initialization code. 562 546 563 - int css_online(struct cgroup *cgrp) 547 + ``int css_online(struct cgroup *cgrp)`` 564 548 (cgroup_mutex held by caller) 565 549 566 550 Called after @cgrp successfully completed all allocations and made ··· 570 554 propagation along the hierarchy. See the comment on 571 555 cgroup_for_each_descendant_pre() for details. 572 556 573 - void css_offline(struct cgroup *cgrp); 557 + ``void css_offline(struct cgroup *cgrp);`` 574 558 (cgroup_mutex held by caller) 575 559 576 560 This is the counterpart of css_online() and called iff css_online() ··· 580 564 cgroup removal will proceed to the next step - css_free(). After this 581 565 callback, @cgrp should be considered dead to the subsystem. 582 566 583 - void css_free(struct cgroup *cgrp) 567 + ``void css_free(struct cgroup *cgrp)`` 584 568 (cgroup_mutex held by caller) 585 569 586 570 The cgroup system is about to free @cgrp; the subsystem should free ··· 589 573 be called for a newly-created cgroup if an error occurs after this 590 574 subsystem's create() method has been called for the new cgroup). 591 575 592 - int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset) 576 + ``int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)`` 593 577 (cgroup_mutex held by caller) 594 578 595 579 Called prior to moving one or more tasks into a cgroup; if the ··· 610 594 while the caller holds cgroup_mutex and it is ensured that either 611 595 attach() or cancel_attach() will be called in future. 612 596 613 - void css_reset(struct cgroup_subsys_state *css) 597 + ``void css_reset(struct cgroup_subsys_state *css)`` 614 598 (cgroup_mutex held by caller) 615 599 616 600 An optional operation which should restore @css's configuration to the ··· 624 608 ensures that the configuration is in the initial state when it is made 625 609 visible again later. 626 610 627 - void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset) 611 + ``void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)`` 628 612 (cgroup_mutex held by caller) 629 613 630 614 Called when a task attach operation has failed after can_attach() has succeeded. ··· 633 617 This will be called only about subsystems whose can_attach() operation have 634 618 succeeded. The parameters are identical to can_attach(). 635 619 636 - void attach(struct cgroup *cgrp, struct cgroup_taskset *tset) 620 + ``void attach(struct cgroup *cgrp, struct cgroup_taskset *tset)`` 637 621 (cgroup_mutex held by caller) 638 622 639 623 Called after the task has been attached to the cgroup, to allow any 640 624 post-attachment activity that requires memory allocations or blocking. 641 625 The parameters are identical to can_attach(). 642 626 643 - void fork(struct task_struct *task) 627 + ``void fork(struct task_struct *task)`` 644 628 645 629 Called when a task is forked into a cgroup. 646 630 647 - void exit(struct task_struct *task) 631 + ``void exit(struct task_struct *task)`` 648 632 649 633 Called during task exit. 650 634 651 - void free(struct task_struct *task) 635 + ``void free(struct task_struct *task)`` 652 636 653 637 Called when the task_struct is freed. 654 638 655 - void bind(struct cgroup *root) 639 + ``void bind(struct cgroup *root)`` 656 640 (cgroup_mutex held by caller) 657 641 658 642 Called when a cgroup subsystem is rebound to a different hierarchy ··· 665 649 666 650 cgroup filesystem supports certain types of extended attributes in its 667 651 directories and files. The current supported types are: 652 + 668 653 - Trusted (XATTR_TRUSTED) 669 654 - Security (XATTR_SECURITY) 670 655 ··· 683 666 5. Questions 684 667 ============ 685 668 686 - Q: what's up with this '/bin/echo' ? 687 - A: bash's builtin 'echo' command does not check calls to write() against 688 - errors. If you use it in the cgroup file system, you won't be 689 - able to tell whether a command succeeded or failed. 669 + :: 690 670 691 - Q: When I attach processes, only the first of the line gets really attached ! 692 - A: We can only return one error code per call to write(). So you should also 693 - put only ONE PID. 671 + Q: what's up with this '/bin/echo' ? 672 + A: bash's builtin 'echo' command does not check calls to write() against 673 + errors. If you use it in the cgroup file system, you won't be 674 + able to tell whether a command succeeded or failed. 694 675 676 + Q: When I attach processes, only the first of the line gets really attached ! 677 + A: We can only return one error code per call to write(). So you should also 678 + put only ONE PID.

+8 -7

Documentation/cgroup-v1/cpuacct.txt Documentation/cgroup-v1/cpuacct.rst

··· 1 + ========================= 1 2 CPU Accounting Controller 2 - ------------------------- 3 + ========================= 3 4 4 5 The CPU accounting controller is used to group tasks using cgroups and 5 6 account the CPU usage of these groups of tasks. ··· 9 8 group accumulates the CPU usage of all of its child groups and the tasks 10 9 directly present in its group. 11 10 12 - Accounting groups can be created by first mounting the cgroup filesystem. 11 + Accounting groups can be created by first mounting the cgroup filesystem:: 13 12 14 - # mount -t cgroup -ocpuacct none /sys/fs/cgroup 13 + # mount -t cgroup -ocpuacct none /sys/fs/cgroup 15 14 16 15 With the above step, the initial or the parent accounting group becomes 17 16 visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in ··· 20 19 by this group which is essentially the CPU time obtained by all the tasks 21 20 in the system. 22 21 23 - New accounting groups can be created under the parent group /sys/fs/cgroup. 22 + New accounting groups can be created under the parent group /sys/fs/cgroup:: 24 23 25 - # cd /sys/fs/cgroup 26 - # mkdir g1 27 - # echo $$ > g1/tasks 24 + # cd /sys/fs/cgroup 25 + # mkdir g1 26 + # echo $$ > g1/tasks 28 27 29 28 The above steps create a new group g1 and move the current shell 30 29 process (bash) into it. CPU time consumed by this bash and its children

+116 -89

Documentation/cgroup-v1/cpusets.txt Documentation/cgroup-v1/cpusets.rst

··· 1 - CPUSETS 2 - ------- 1 + ======= 2 + CPUSETS 3 + ======= 3 4 4 5 Copyright (C) 2004 BULL SA. 6 + 5 7 Written by Simon.Derr@bull.net 6 8 7 - Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. 8 - Modified by Paul Jackson <pj@sgi.com> 9 - Modified by Christoph Lameter <cl@linux.com> 10 - Modified by Paul Menage <menage@google.com> 11 - Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> 9 + - Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. 10 + - Modified by Paul Jackson <pj@sgi.com> 11 + - Modified by Christoph Lameter <cl@linux.com> 12 + - Modified by Paul Menage <menage@google.com> 13 + - Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> 12 14 13 - CONTENTS: 14 - ========= 15 + .. CONTENTS: 15 16 16 - 1. Cpusets 17 - 1.1 What are cpusets ? 18 - 1.2 Why are cpusets needed ? 19 - 1.3 How are cpusets implemented ? 20 - 1.4 What are exclusive cpusets ? 21 - 1.5 What is memory_pressure ? 22 - 1.6 What is memory spread ? 23 - 1.7 What is sched_load_balance ? 24 - 1.8 What is sched_relax_domain_level ? 25 - 1.9 How do I use cpusets ? 26 - 2. Usage Examples and Syntax 27 - 2.1 Basic Usage 28 - 2.2 Adding/removing cpus 29 - 2.3 Setting flags 30 - 2.4 Attaching processes 31 - 3. Questions 32 - 4. Contact 17 + 1. Cpusets 18 + 1.1 What are cpusets ? 19 + 1.2 Why are cpusets needed ? 20 + 1.3 How are cpusets implemented ? 21 + 1.4 What are exclusive cpusets ? 22 + 1.5 What is memory_pressure ? 23 + 1.6 What is memory spread ? 24 + 1.7 What is sched_load_balance ? 25 + 1.8 What is sched_relax_domain_level ? 26 + 1.9 How do I use cpusets ? 27 + 2. Usage Examples and Syntax 28 + 2.1 Basic Usage 29 + 2.2 Adding/removing cpus 30 + 2.3 Setting flags 31 + 2.4 Attaching processes 32 + 3. Questions 33 + 4. Contact 33 34 34 35 1. Cpusets 35 36 ========== ··· 49 48 job placement on large systems. 50 49 51 50 Cpusets use the generic cgroup subsystem described in 52 - Documentation/cgroup-v1/cgroups.txt. 51 + Documentation/cgroup-v1/cgroups.rst. 53 52 54 53 Requests by a task, using the sched_setaffinity(2) system call to 55 54 include CPUs in its CPU affinity mask, and using the mbind(2) and ··· 158 157 The /proc/<pid>/status file for each task has four added lines, 159 158 displaying the task's cpus_allowed (on which CPUs it may be scheduled) 160 159 and mems_allowed (on which Memory Nodes it may obtain memory), 161 - in the two formats seen in the following example: 160 + in the two formats seen in the following example:: 162 161 163 162 Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff 164 163 Cpus_allowed_list: 0-127 ··· 182 181 - cpuset.sched_relax_domain_level: the searching range when migrating tasks 183 182 184 183 In addition, only the root cpuset has the following file: 184 + 185 185 - cpuset.memory_pressure_enabled flag: compute memory_pressure? 186 186 187 187 New cpusets are created using the mkdir system call or shell ··· 268 266 batch manager or other user code to decide what to do about it and 269 267 take action. 270 268 271 - ==> Unless this feature is enabled by writing "1" to the special file 269 + ==> 270 + Unless this feature is enabled by writing "1" to the special file 272 271 /dev/cpuset/memory_pressure_enabled, the hook in the rebalance 273 272 code of __alloc_pages() for this metric reduces to simply noticing 274 273 that the cpuset_memory_pressure_enabled flag is zero. So only ··· 402 399 403 400 This default load balancing across all CPUs is not well suited for 404 401 the following two situations: 402 + 405 403 1) On large systems, load balancing across many CPUs is expensive. 406 404 If the system is managed using cpusets to place independent jobs 407 405 on separate sets of CPUs, full load balancing is unnecessary. ··· 505 501 The cpuset code builds a new such partition and passes it to the 506 502 scheduler sched domain setup code, to have the sched domains rebuilt 507 503 as necessary, whenever: 504 + 508 505 - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes, 509 506 - or CPUs come or go from a cpuset with this flag enabled, 510 507 - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs ··· 558 553 indicates size of searching range in levels ideally as follows, 559 554 otherwise initial value -1 that indicates the cpuset has no request. 560 555 561 - -1 : no request. use system default or follow request of others. 562 - 0 : no search. 563 - 1 : search siblings (hyperthreads in a core). 564 - 2 : search cores in a package. 565 - 3 : search cpus in a node [= system wide on non-NUMA system] 566 - 4 : search nodes in a chunk of node [on NUMA system] 567 - 5 : search system wide [on NUMA system] 556 + ====== =========================================================== 557 + -1 no request. use system default or follow request of others. 558 + 0 no search. 559 + 1 search siblings (hyperthreads in a core). 560 + 2 search cores in a package. 561 + 3 search cpus in a node [= system wide on non-NUMA system] 562 + 4 search nodes in a chunk of node [on NUMA system] 563 + 5 search system wide [on NUMA system] 564 + ====== =========================================================== 568 565 569 566 The system default is architecture dependent. The system default 570 567 can be changed using the relax_domain_level= boot parameter. ··· 585 578 Don't modify this file if you are not sure. 586 579 587 580 If your situation is: 581 + 588 582 - The migration costs between each cpu can be assumed considerably 589 583 small(for you) due to your special application's behavior or 590 584 special hardware support for CPU cache etc. 591 585 - The searching cost doesn't have impact(for you) or you can make 592 586 the searching cost enough small by managing cpuset to compact etc. 593 587 - The latency is required even it sacrifices cache hit rate etc. 594 - then increasing 'sched_relax_domain_level' would benefit you. 588 + then increasing 'sched_relax_domain_level' would benefit you. 595 589 596 590 597 591 1.9 How do I use cpusets ? ··· 686 678 687 679 For example, the following sequence of commands will setup a cpuset 688 680 named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, 689 - and then start a subshell 'sh' in that cpuset: 681 + and then start a subshell 'sh' in that cpuset:: 690 682 691 683 mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset 692 684 cd /sys/fs/cgroup/cpuset ··· 701 693 cat /proc/self/cpuset 702 694 703 695 There are ways to query or modify cpusets: 696 + 704 697 - via the cpuset file system directly, using the various cd, mkdir, echo, 705 698 cat, rmdir commands from the shell, or their equivalent from C. 706 699 - via the C library libcpuset. ··· 731 722 tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset 732 723 is the cpuset that holds the whole system. 733 724 734 - If you want to create a new cpuset under /sys/fs/cgroup/cpuset: 735 - # cd /sys/fs/cgroup/cpuset 736 - # mkdir my_cpuset 725 + If you want to create a new cpuset under /sys/fs/cgroup/cpuset:: 737 726 738 - Now you want to do something with this cpuset. 739 - # cd my_cpuset 727 + # cd /sys/fs/cgroup/cpuset 728 + # mkdir my_cpuset 740 729 741 - In this directory you can find several files: 742 - # ls 743 - cgroup.clone_children cpuset.memory_pressure 744 - cgroup.event_control cpuset.memory_spread_page 745 - cgroup.procs cpuset.memory_spread_slab 746 - cpuset.cpu_exclusive cpuset.mems 747 - cpuset.cpus cpuset.sched_load_balance 748 - cpuset.mem_exclusive cpuset.sched_relax_domain_level 749 - cpuset.mem_hardwall notify_on_release 750 - cpuset.memory_migrate tasks 730 + Now you want to do something with this cpuset:: 731 + 732 + # cd my_cpuset 733 + 734 + In this directory you can find several files:: 735 + 736 + # ls 737 + cgroup.clone_children cpuset.memory_pressure 738 + cgroup.event_control cpuset.memory_spread_page 739 + cgroup.procs cpuset.memory_spread_slab 740 + cpuset.cpu_exclusive cpuset.mems 741 + cpuset.cpus cpuset.sched_load_balance 742 + cpuset.mem_exclusive cpuset.sched_relax_domain_level 743 + cpuset.mem_hardwall notify_on_release 744 + cpuset.memory_migrate tasks 751 745 752 746 Reading them will give you information about the state of this cpuset: 753 747 the CPUs and Memory Nodes it can use, the processes that are using 754 748 it, its properties. By writing to these files you can manipulate 755 749 the cpuset. 756 750 757 - Set some flags: 758 - # /bin/echo 1 > cpuset.cpu_exclusive 751 + Set some flags:: 759 752 760 - Add some cpus: 761 - # /bin/echo 0-7 > cpuset.cpus 753 + # /bin/echo 1 > cpuset.cpu_exclusive 762 754 763 - Add some mems: 764 - # /bin/echo 0-7 > cpuset.mems 755 + Add some cpus:: 765 756 766 - Now attach your shell to this cpuset: 767 - # /bin/echo $$ > tasks 757 + # /bin/echo 0-7 > cpuset.cpus 758 + 759 + Add some mems:: 760 + 761 + # /bin/echo 0-7 > cpuset.mems 762 + 763 + Now attach your shell to this cpuset:: 764 + 765 + # /bin/echo $$ > tasks 768 766 769 767 You can also create cpusets inside your cpuset by using mkdir in this 770 - directory. 771 - # mkdir my_sub_cs 768 + directory:: 772 769 773 - To remove a cpuset, just use rmdir: 774 - # rmdir my_sub_cs 770 + # mkdir my_sub_cs 771 + 772 + To remove a cpuset, just use rmdir:: 773 + 774 + # rmdir my_sub_cs 775 + 775 776 This will fail if the cpuset is in use (has cpusets inside, or has 776 777 processes attached). 777 778 778 779 Note that for legacy reasons, the "cpuset" filesystem exists as a 779 780 wrapper around the cgroup filesystem. 780 781 781 - The command 782 + The command:: 782 783 783 - mount -t cpuset X /sys/fs/cgroup/cpuset 784 + mount -t cpuset X /sys/fs/cgroup/cpuset 784 785 785 - is equivalent to 786 + is equivalent to:: 786 787 787 - mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset 788 - echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent 788 + mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset 789 + echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent 789 790 790 791 2.2 Adding/removing cpus 791 792 ------------------------ 792 793 793 794 This is the syntax to use when writing in the cpus or mems files 794 - in cpuset directories: 795 + in cpuset directories:: 795 796 796 - # /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 797 - # /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 797 + # /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 798 + # /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 798 799 799 800 To add a CPU to a cpuset, write the new list of CPUs including the 800 - CPU to be added. To add 6 to the above cpuset: 801 + CPU to be added. To add 6 to the above cpuset:: 801 802 802 - # /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6 803 + # /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6 803 804 804 805 Similarly to remove a CPU from a cpuset, write the new list of CPUs 805 806 without the CPU to be removed. 806 807 807 - To remove all the CPUs: 808 + To remove all the CPUs:: 808 809 809 - # /bin/echo "" > cpuset.cpus -> clear cpus list 810 + # /bin/echo "" > cpuset.cpus -> clear cpus list 810 811 811 812 2.3 Setting flags 812 813 ----------------- 813 814 814 - The syntax is very simple: 815 + The syntax is very simple:: 815 816 816 - # /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive' 817 - # /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive' 817 + # /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive' 818 + # /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive' 818 819 819 820 2.4 Attaching processes 820 821 ----------------------- 821 822 822 - # /bin/echo PID > tasks 823 + :: 824 + 825 + # /bin/echo PID > tasks 823 826 824 827 Note that it is PID, not PIDs. You can only attach ONE task at a time. 825 - If you have several tasks to attach, you have to do it one after another: 828 + If you have several tasks to attach, you have to do it one after another:: 826 829 827 - # /bin/echo PID1 > tasks 828 - # /bin/echo PID2 > tasks 830 + # /bin/echo PID1 > tasks 831 + # /bin/echo PID2 > tasks 829 832 ... 830 - # /bin/echo PIDn > tasks 833 + # /bin/echo PIDn > tasks 831 834 832 835 833 836 3. Questions 834 837 ============ 835 838 836 - Q: what's up with this '/bin/echo' ? 837 - A: bash's builtin 'echo' command does not check calls to write() against 839 + Q: 840 + what's up with this '/bin/echo' ? 841 + 842 + A: 843 + bash's builtin 'echo' command does not check calls to write() against 838 844 errors. If you use it in the cpuset file system, you won't be 839 845 able to tell whether a command succeeded or failed. 840 846 841 - Q: When I attach processes, only the first of the line gets really attached ! 842 - A: We can only return one error code per call to write(). So you should also 847 + Q: 848 + When I attach processes, only the first of the line gets really attached ! 849 + 850 + A: 851 + We can only return one error code per call to write(). So you should also 843 852 put only ONE pid. 844 853 845 854 4. Contact

+28 -12

Documentation/cgroup-v1/devices.txt Documentation/cgroup-v1/devices.rst

··· 1 + =========================== 1 2 Device Whitelist Controller 3 + =========================== 2 4 3 - 1. Description: 5 + 1. Description 6 + ============== 4 7 5 8 Implement a cgroup to track and enforce open and mknod restrictions 6 9 on device files. A device cgroup associates a device access ··· 19 16 never receive a device access which is denied by its parent. 20 17 21 18 2. User Interface 19 + ================= 22 20 23 21 An entry is added using devices.allow, and removed using 24 - devices.deny. For instance 22 + devices.deny. For instance:: 25 23 26 24 echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow 27 25 28 26 allows cgroup 1 to read and mknod the device usually known as 29 - /dev/null. Doing 27 + /dev/null. Doing:: 30 28 31 29 echo a > /sys/fs/cgroup/1/devices.deny 32 30 33 - will remove the default 'a *:* rwm' entry. Doing 31 + will remove the default 'a *:* rwm' entry. Doing:: 34 32 35 33 echo a > /sys/fs/cgroup/1/devices.allow 36 34 37 35 will add the 'a *:* rwm' entry to the whitelist. 38 36 39 37 3. Security 38 + =========== 40 39 41 40 Any task can move itself between cgroups. This clearly won't 42 41 suffice, but we can decide the best way to adequately restrict ··· 55 50 parent has. 56 51 57 52 4. Hierarchy 53 + ============ 58 54 59 55 device cgroups maintain hierarchy by making sure a cgroup never has more 60 56 access permissions than its parent. Every time an entry is written to ··· 64 58 re-evaluated. In case one of the locally set whitelist entries would provide 65 59 more access than the cgroup's parent, it'll be removed from the whitelist. 66 60 67 - Example: 61 + Example:: 62 + 68 63 A 69 64 / \ 70 65 B ··· 74 67 A allow "b 8:* rwm", "c 116:1 rw" 75 68 B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm" 76 69 77 - If a device is denied in group A: 70 + If a device is denied in group A:: 71 + 78 72 # echo "c 116:* r" > A/devices.deny 73 + 79 74 it'll propagate down and after revalidating B's entries, the whitelist entry 80 - "c 116:2 rwm" will be removed: 75 + "c 116:2 rwm" will be removed:: 81 76 82 77 group whitelist entries denied devices 83 78 A all "b 8:* rwm", "c 116:* rw" ··· 88 79 In case parent's exceptions change and local exceptions are not allowed 89 80 anymore, they'll be deleted. 90 81 91 - Notice that new whitelist entries will not be propagated: 82 + Notice that new whitelist entries will not be propagated:: 83 + 92 84 A 93 85 / \ 94 86 B ··· 98 88 A "c 1:3 rwm", "c 1:5 r" all the rest 99 89 B "c 1:3 rwm", "c 1:5 r" all the rest 100 90 101 - when adding "c *:3 rwm": 91 + when adding ``c *:3 rwm``:: 92 + 102 93 # echo "c *:3 rwm" >A/devices.allow 103 94 104 - the result: 95 + the result:: 96 + 105 97 group whitelist entries denied devices 106 98 A "c *:3 rwm", "c 1:5 r" all the rest 107 99 B "c 1:3 rwm", "c 1:5 r" all the rest 108 100 109 - but now it'll be possible to add new entries to B: 101 + but now it'll be possible to add new entries to B:: 102 + 110 103 # echo "c 2:3 rwm" >B/devices.allow 111 104 # echo "c 50:3 r" >B/devices.allow 112 - or even 105 + 106 + or even:: 107 + 113 108 # echo "c *:3 rwm" >B/devices.allow 114 109 115 110 Allowing or denying all by writing 'a' to devices.allow or devices.deny will 116 111 not be possible once the device cgroups has children. 117 112 118 113 4.1 Hierarchy (internal implementation) 114 + --------------------------------------- 119 115 120 116 device cgroups is implemented internally using a behavior (ALLOW, DENY) and a 121 117 list of exceptions. The internal state is controlled using the same user

+9 -5

Documentation/cgroup-v1/freezer-subsystem.txt Documentation/cgroup-v1/freezer-subsystem.rst

··· 1 + ============== 2 + Cgroup Freezer 3 + ============== 4 + 1 5 The cgroup freezer is useful to batch job management system which start 2 6 and stop sets of tasks in order to schedule the resources of a machine 3 7 according to the desires of a system administrator. This sort of program ··· 27 23 SIGCONT is especially unsuitable since it can be caught by the task. Any 28 24 programs designed to watch for SIGSTOP and SIGCONT could be broken by 29 25 attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can 30 - demonstrate this problem using nested bash shells: 26 + demonstrate this problem using nested bash shells:: 31 27 32 28 $ echo $$ 33 29 16644 ··· 97 93 The root cgroup is non-freezable and the above interface files don't 98 94 exist. 99 95 100 - * Examples of usage : 96 + * Examples of usage:: 101 97 102 98 # mkdir /sys/fs/cgroup/freezer 103 99 # mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer 104 100 # mkdir /sys/fs/cgroup/freezer/0 105 101 # echo $some_pid > /sys/fs/cgroup/freezer/0/tasks 106 102 107 - to get status of the freezer subsystem : 103 + to get status of the freezer subsystem:: 108 104 109 105 # cat /sys/fs/cgroup/freezer/0/freezer.state 110 106 THAWED 111 107 112 - to freeze all tasks in the container : 108 + to freeze all tasks in the container:: 113 109 114 110 # echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state 115 111 # cat /sys/fs/cgroup/freezer/0/freezer.state ··· 117 113 # cat /sys/fs/cgroup/freezer/0/freezer.state 118 114 FROZEN 119 115 120 - to unfreeze all tasks in the container : 116 + to unfreeze all tasks in the container:: 121 117 122 118 # echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state 123 119 # cat /sys/fs/cgroup/freezer/0/freezer.state

+20 -19

Documentation/cgroup-v1/hugetlb.txt Documentation/cgroup-v1/hugetlb.rst

··· 1 + ================== 1 2 HugeTLB Controller 2 - ------------------- 3 + ================== 3 4 4 5 The HugeTLB controller allows to limit the HugeTLB usage per control group and 5 6 enforces the controller limit during page fault. Since HugeTLB doesn't ··· 17 16 visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in 18 17 the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup. 19 18 20 - New groups can be created under the parent group /sys/fs/cgroup. 19 + New groups can be created under the parent group /sys/fs/cgroup:: 21 20 22 - # cd /sys/fs/cgroup 23 - # mkdir g1 24 - # echo $$ > g1/tasks 21 + # cd /sys/fs/cgroup 22 + # mkdir g1 23 + # echo $$ > g1/tasks 25 24 26 25 The above steps create a new group g1 and move the current shell 27 26 process (bash) into it. 28 27 29 - Brief summary of control files 28 + Brief summary of control files:: 30 29 31 30 hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage 32 31 hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded ··· 34 33 hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit 35 34 36 35 For a system supporting three hugepage sizes (64k, 32M and 1G), the control 37 - files include: 36 + files include:: 38 37 39 - hugetlb.1GB.limit_in_bytes 40 - hugetlb.1GB.max_usage_in_bytes 41 - hugetlb.1GB.usage_in_bytes 42 - hugetlb.1GB.failcnt 43 - hugetlb.64KB.limit_in_bytes 44 - hugetlb.64KB.max_usage_in_bytes 45 - hugetlb.64KB.usage_in_bytes 46 - hugetlb.64KB.failcnt 47 - hugetlb.32MB.limit_in_bytes 48 - hugetlb.32MB.max_usage_in_bytes 49 - hugetlb.32MB.usage_in_bytes 50 - hugetlb.32MB.failcnt 38 + hugetlb.1GB.limit_in_bytes 39 + hugetlb.1GB.max_usage_in_bytes 40 + hugetlb.1GB.usage_in_bytes 41 + hugetlb.1GB.failcnt 42 + hugetlb.64KB.limit_in_bytes 43 + hugetlb.64KB.max_usage_in_bytes 44 + hugetlb.64KB.usage_in_bytes 45 + hugetlb.64KB.failcnt 46 + hugetlb.32MB.limit_in_bytes 47 + hugetlb.32MB.max_usage_in_bytes 48 + hugetlb.32MB.usage_in_bytes 49 + hugetlb.32MB.failcnt

+30

Documentation/cgroup-v1/index.rst

··· 1 + :orphan: 2 + 3 + ======================== 4 + Control Groups version 1 5 + ======================== 6 + 7 + .. toctree:: 8 + :maxdepth: 1 9 + 10 + cgroups 11 + 12 + blkio-controller 13 + cpuacct 14 + cpusets 15 + devices 16 + freezer-subsystem 17 + hugetlb 18 + memcg_test 19 + memory 20 + net_cls 21 + net_prio 22 + pids 23 + rdma 24 + 25 + .. only:: subproject and html 26 + 27 + Indices 28 + ======= 29 + 30 + * :ref:`genindex`

+170 -95

Documentation/cgroup-v1/memcg_test.txt Documentation/cgroup-v1/memcg_test.rst

··· 1 - Memory Resource Controller(Memcg) Implementation Memo. 1 + ===================================================== 2 + Memory Resource Controller(Memcg) Implementation Memo 3 + ===================================================== 4 + 2 5 Last Updated: 2010/2 6 + 3 7 Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34). 4 8 5 9 Because VM is getting complex (one of reasons is memcg...), memcg's behavior 6 10 is complex. This is a document for memcg's internal behavior. 7 11 Please note that implementation details can be changed. 8 12 9 - (*) Topics on API should be in Documentation/cgroup-v1/memory.txt) 13 + (*) Topics on API should be in Documentation/cgroup-v1/memory.rst) 10 14 11 15 0. How to record usage ? 16 + ======================== 17 + 12 18 2 objects are used. 13 19 14 20 page_cgroup ....an object per page. 21 + 15 22 Allocated at boot or memory hotplug. Freed at memory hot removal. 16 23 17 24 swap_cgroup ... an entry per swp_entry. 25 + 18 26 Allocated at swapon(). Freed at swapoff(). 19 27 20 28 The page_cgroup has USED bit and double count against a page_cgroup never 21 29 occurs. swap_cgroup is used only when a charged page is swapped-out. 22 30 23 31 1. Charge 32 + ========= 24 33 25 34 a page/swp_entry may be charged (usage += PAGE_SIZE) at 26 35 27 36 mem_cgroup_try_charge() 28 37 29 38 2. Uncharge 39 + =========== 40 + 30 41 a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by 31 42 32 43 mem_cgroup_uncharge() ··· 48 37 disappears. 49 38 50 39 3. charge-commit-cancel 40 + ======================= 41 + 51 42 Memcg pages are charged in two steps: 52 - mem_cgroup_try_charge() 53 - mem_cgroup_commit_charge() or mem_cgroup_cancel_charge() 43 + 44 + - mem_cgroup_try_charge() 45 + - mem_cgroup_commit_charge() or mem_cgroup_cancel_charge() 54 46 55 47 At try_charge(), there are no flags to say "this page is charged". 56 48 at this point, usage += PAGE_SIZE. ··· 65 51 Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. 66 52 67 53 4. Anonymous 54 + ============ 55 + 68 56 Anonymous page is newly allocated at 69 57 - page fault into MAP_ANONYMOUS mapping. 70 58 - Copy-On-Write. ··· 94 78 (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0. 95 79 96 80 5. Page Cache 97 - Page Cache is charged at 81 + ============= 82 + 83 + Page Cache is charged at 98 84 - add_to_page_cache_locked(). 99 85 100 86 The logic is very clear. (About migration, see below) 101 - Note: __remove_from_page_cache() is called by remove_from_page_cache() 102 - and __remove_mapping(). 87 + 88 + Note: 89 + __remove_from_page_cache() is called by remove_from_page_cache() 90 + and __remove_mapping(). 103 91 104 92 6. Shmem(tmpfs) Page Cache 93 + =========================== 94 + 105 95 The best way to understand shmem's page state transition is to read 106 96 mm/shmem.c. 97 + 107 98 But brief explanation of the behavior of memcg around shmem will be 108 99 helpful to understand the logic. 109 100 110 101 Shmem's page (just leaf page, not direct/indirect block) can be on 102 + 111 103 - radix-tree of shmem's inode. 112 104 - SwapCache. 113 105 - Both on radix-tree and SwapCache. This happens at swap-in 114 106 and swap-out, 115 107 116 108 It's charged when... 109 + 117 110 - A new page is added to shmem's radix-tree. 118 111 - A swp page is read. (move a charge from swap_cgroup to page_cgroup) 119 112 120 113 7. Page Migration 114 + ================= 121 115 122 116 mem_cgroup_migrate() 123 117 124 118 8. LRU 119 + ====== 125 120 Each memcg has its own private LRU. Now, its handling is under global 126 121 VM's control (means that it's handled under global pgdat->lru_lock). 127 122 Almost all routines around memcg's LRU is called by global LRU's ··· 141 114 A special function is mem_cgroup_isolate_pages(). This scans 142 115 memcg's private LRU and call __isolate_lru_page() to extract a page 143 116 from LRU. 117 + 144 118 (By __isolate_lru_page(), the page is removed from both of global and 145 - private LRU.) 119 + private LRU.) 146 120 147 121 148 122 9. Typical Tests. 123 + ================= 149 124 150 125 Tests for racy cases. 151 126 152 - 9.1 Small limit to memcg. 127 + 9.1 Small limit to memcg. 128 + ------------------------- 129 + 153 130 When you do test to do racy case, it's good test to set memcg's limit 154 131 to be very small rather than GB. Many races found in the test under 155 132 xKB or xxMB limits. 156 - (Memory behavior under GB and Memory behavior under MB shows very 157 - different situation.) 158 133 159 - 9.2 Shmem 134 + (Memory behavior under GB and Memory behavior under MB shows very 135 + different situation.) 136 + 137 + 9.2 Shmem 138 + --------- 139 + 160 140 Historically, memcg's shmem handling was poor and we saw some amount 161 141 of troubles here. This is because shmem is page-cache but can be 162 142 SwapCache. Test with shmem/tmpfs is always good test. 163 143 164 - 9.3 Migration 144 + 9.3 Migration 145 + ------------- 146 + 165 147 For NUMA, migration is an another special case. To do easy test, cpuset 166 - is useful. Following is a sample script to do migration. 148 + is useful. Following is a sample script to do migration:: 167 149 168 - mount -t cgroup -o cpuset none /opt/cpuset 150 + mount -t cgroup -o cpuset none /opt/cpuset 169 151 170 - mkdir /opt/cpuset/01 171 - echo 1 > /opt/cpuset/01/cpuset.cpus 172 - echo 0 > /opt/cpuset/01/cpuset.mems 173 - echo 1 > /opt/cpuset/01/cpuset.memory_migrate 174 - mkdir /opt/cpuset/02 175 - echo 1 > /opt/cpuset/02/cpuset.cpus 176 - echo 1 > /opt/cpuset/02/cpuset.mems 177 - echo 1 > /opt/cpuset/02/cpuset.memory_migrate 152 + mkdir /opt/cpuset/01 153 + echo 1 > /opt/cpuset/01/cpuset.cpus 154 + echo 0 > /opt/cpuset/01/cpuset.mems 155 + echo 1 > /opt/cpuset/01/cpuset.memory_migrate 156 + mkdir /opt/cpuset/02 157 + echo 1 > /opt/cpuset/02/cpuset.cpus 158 + echo 1 > /opt/cpuset/02/cpuset.mems 159 + echo 1 > /opt/cpuset/02/cpuset.memory_migrate 178 160 179 161 In above set, when you moves a task from 01 to 02, page migration to 180 162 node 0 to node 1 will occur. Following is a script to migrate all 181 - under cpuset. 182 - -- 183 - move_task() 184 - { 185 - for pid in $1 186 - do 187 - /bin/echo $pid >$2/tasks 2>/dev/null 188 - echo -n $pid 189 - echo -n " " 190 - done 191 - echo END 192 - } 163 + under cpuset.:: 193 164 194 - G1_TASK=`cat ${G1}/tasks` 195 - G2_TASK=`cat ${G2}/tasks` 196 - move_task "${G1_TASK}" ${G2} & 197 - -- 198 - 9.4 Memory hotplug. 165 + -- 166 + move_task() 167 + { 168 + for pid in $1 169 + do 170 + /bin/echo $pid >$2/tasks 2>/dev/null 171 + echo -n $pid 172 + echo -n " " 173 + done 174 + echo END 175 + } 176 + 177 + G1_TASK=`cat ${G1}/tasks` 178 + G2_TASK=`cat ${G2}/tasks` 179 + move_task "${G1_TASK}" ${G2} & 180 + -- 181 + 182 + 9.4 Memory hotplug 183 + ------------------ 184 + 199 185 memory hotplug test is one of good test. 200 - to offline memory, do following. 201 - # echo offline > /sys/devices/system/memory/memoryXXX/state 186 + 187 + to offline memory, do following:: 188 + 189 + # echo offline > /sys/devices/system/memory/memoryXXX/state 190 + 202 191 (XXX is the place of memory) 192 + 203 193 This is an easy way to test page migration, too. 204 194 205 - 9.5 mkdir/rmdir 195 + 9.5 mkdir/rmdir 196 + --------------- 197 + 206 198 When using hierarchy, mkdir/rmdir test should be done. 207 - Use tests like the following. 199 + Use tests like the following:: 208 200 209 - echo 1 >/opt/cgroup/01/memory/use_hierarchy 210 - mkdir /opt/cgroup/01/child_a 211 - mkdir /opt/cgroup/01/child_b 201 + echo 1 >/opt/cgroup/01/memory/use_hierarchy 202 + mkdir /opt/cgroup/01/child_a 203 + mkdir /opt/cgroup/01/child_b 212 204 213 - set limit to 01. 214 - add limit to 01/child_b 215 - run jobs under child_a and child_b 205 + set limit to 01. 206 + add limit to 01/child_b 207 + run jobs under child_a and child_b 216 208 217 - create/delete following groups at random while jobs are running. 218 - /opt/cgroup/01/child_a/child_aa 219 - /opt/cgroup/01/child_b/child_bb 220 - /opt/cgroup/01/child_c 209 + create/delete following groups at random while jobs are running:: 210 + 211 + /opt/cgroup/01/child_a/child_aa 212 + /opt/cgroup/01/child_b/child_bb 213 + /opt/cgroup/01/child_c 221 214 222 215 running new jobs in new group is also good. 223 216 224 - 9.6 Mount with other subsystems. 217 + 9.6 Mount with other subsystems 218 + ------------------------------- 219 + 225 220 Mounting with other subsystems is a good test because there is a 226 221 race and lock dependency with other cgroup subsystems. 227 222 228 - example) 229 - # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices 223 + example:: 224 + 225 + # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices 230 226 231 227 and do task move, mkdir, rmdir etc...under this. 232 228 233 - 9.7 swapoff. 229 + 9.7 swapoff 230 + ----------- 231 + 234 232 Besides management of swap is one of complicated parts of memcg, 235 233 call path of swap-in at swapoff is not same as usual swap-in path.. 236 234 It's worth to be tested explicitly. 237 235 238 - For example, test like following is good. 239 - (Shell-A) 240 - # mount -t cgroup none /cgroup -o memory 241 - # mkdir /cgroup/test 242 - # echo 40M > /cgroup/test/memory.limit_in_bytes 243 - # echo 0 > /cgroup/test/tasks 236 + For example, test like following is good: 237 + 238 + (Shell-A):: 239 + 240 + # mount -t cgroup none /cgroup -o memory 241 + # mkdir /cgroup/test 242 + # echo 40M > /cgroup/test/memory.limit_in_bytes 243 + # echo 0 > /cgroup/test/tasks 244 + 244 245 Run malloc(100M) program under this. You'll see 60M of swaps. 245 - (Shell-B) 246 - # move all tasks in /cgroup/test to /cgroup 247 - # /sbin/swapoff -a 248 - # rmdir /cgroup/test 249 - # kill malloc task. 246 + 247 + (Shell-B):: 248 + 249 + # move all tasks in /cgroup/test to /cgroup 250 + # /sbin/swapoff -a 251 + # rmdir /cgroup/test 252 + # kill malloc task. 250 253 251 254 Of course, tmpfs v.s. swapoff test should be tested, too. 252 255 253 - 9.8 OOM-Killer 256 + 9.8 OOM-Killer 257 + -------------- 258 + 254 259 Out-of-memory caused by memcg's limit will kill tasks under 255 260 the memcg. When hierarchy is used, a task under hierarchy 256 261 will be killed by the kernel. 262 + 257 263 In this case, panic_on_oom shouldn't be invoked and tasks 258 264 in other groups shouldn't be killed. 259 265 260 266 It's not difficult to cause OOM under memcg as following. 261 - Case A) when you can swapoff 262 - #swapoff -a 263 - #echo 50M > /memory.limit_in_bytes 267 + 268 + Case A) when you can swapoff:: 269 + 270 + #swapoff -a 271 + #echo 50M > /memory.limit_in_bytes 272 + 264 273 run 51M of malloc 265 274 266 - Case B) when you use mem+swap limitation. 267 - #echo 50M > memory.limit_in_bytes 268 - #echo 50M > memory.memsw.limit_in_bytes 275 + Case B) when you use mem+swap limitation:: 276 + 277 + #echo 50M > memory.limit_in_bytes 278 + #echo 50M > memory.memsw.limit_in_bytes 279 + 269 280 run 51M of malloc 270 281 271 - 9.9 Move charges at task migration 282 + 9.9 Move charges at task migration 283 + ---------------------------------- 284 + 272 285 Charges associated with a task can be moved along with task migration. 273 286 274 - (Shell-A) 275 - #mkdir /cgroup/A 276 - #echo $$ >/cgroup/A/tasks 287 + (Shell-A):: 288 + 289 + #mkdir /cgroup/A 290 + #echo $$ >/cgroup/A/tasks 291 + 277 292 run some programs which uses some amount of memory in /cgroup/A. 278 293 279 - (Shell-B) 280 - #mkdir /cgroup/B 281 - #echo 1 >/cgroup/B/memory.move_charge_at_immigrate 282 - #echo "pid of the program running in group A" >/cgroup/B/tasks 294 + (Shell-B):: 283 295 284 - You can see charges have been moved by reading *.usage_in_bytes or 296 + #mkdir /cgroup/B 297 + #echo 1 >/cgroup/B/memory.move_charge_at_immigrate 298 + #echo "pid of the program running in group A" >/cgroup/B/tasks 299 + 300 + You can see charges have been moved by reading ``*.usage_in_bytes`` or 285 301 memory.stat of both A and B. 286 - See 8.2 of Documentation/cgroup-v1/memory.txt to see what value should be 287 - written to move_charge_at_immigrate. 288 302 289 - 9.10 Memory thresholds 303 + See 8.2 of Documentation/cgroup-v1/memory.rst to see what value should 304 + be written to move_charge_at_immigrate. 305 + 306 + 9.10 Memory thresholds 307 + ---------------------- 308 + 290 309 Memory controller implements memory thresholds using cgroups notification 291 310 API. You can use tools/cgroup/cgroup_event_listener.c to test it. 292 311 293 - (Shell-A) Create cgroup and run event listener 294 - # mkdir /cgroup/A 295 - # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M 312 + (Shell-A) Create cgroup and run event listener:: 296 313 297 - (Shell-B) Add task to cgroup and try to allocate and free memory 298 - # echo $$ >/cgroup/A/tasks 299 - # a="$(dd if=/dev/zero bs=1M count=10)" 300 - # a= 314 + # mkdir /cgroup/A 315 + # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M 316 + 317 + (Shell-B) Add task to cgroup and try to allocate and free memory:: 318 + 319 + # echo $$ >/cgroup/A/tasks 320 + # a="$(dd if=/dev/zero bs=1M count=10)" 321 + # a= 301 322 302 323 You will see message from cgroup_event_listener every time you cross 303 324 the thresholds.

+280 -169

Documentation/cgroup-v1/memory.txt Documentation/cgroup-v1/memory.rst

··· 1 + ========================== 1 2 Memory Resource Controller 3 + ========================== 2 4 3 - NOTE: This document is hopelessly outdated and it asks for a complete 5 + NOTE: 6 + This document is hopelessly outdated and it asks for a complete 4 7 rewrite. It still contains a useful information so we are keeping it 5 8 here but make sure to check the current code if you need a deeper 6 9 understanding. 7 10 8 - NOTE: The Memory Resource Controller has generically been referred to as the 11 + NOTE: 12 + The Memory Resource Controller has generically been referred to as the 9 13 memory controller in this document. Do not confuse memory controller 10 14 used here with the memory controller that is used in hardware. 11 15 12 - (For editors) 13 - In this document: 16 + (For editors) In this document: 14 17 When we mention a cgroup (cgroupfs's directory) with memory controller, 15 18 we call it "memory cgroup". When you see git-log and source code, you'll 16 19 see patch's title and function names tend to use "memcg". 17 20 In this document, we avoid using it. 18 21 19 22 Benefits and Purpose of the memory controller 23 + ============================================= 20 24 21 25 The memory controller isolates the memory behaviour of a group of tasks 22 26 from the rest of the system. The article on LWN [12] mentions some probable ··· 42 38 Current Status: linux-2.6.34-mmotm(development version of 2010/April) 43 39 44 40 Features: 41 + 45 42 - accounting anonymous pages, file caches, swap caches usage and limiting them. 46 43 - pages are linked to per-memcg LRU exclusively, and there is no global LRU. 47 44 - optionally, memory+swap usage can be accounted and limited. ··· 59 54 60 55 Brief summary of control files. 61 56 62 - tasks # attach a task(thread) and show list of threads 63 - cgroup.procs # show list of processes 64 - cgroup.event_control # an interface for event_fd() 65 - memory.usage_in_bytes # show current usage for memory 66 - (See 5.5 for details) 67 - memory.memsw.usage_in_bytes # show current usage for memory+Swap 68 - (See 5.5 for details) 69 - memory.limit_in_bytes # set/show limit of memory usage 70 - memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage 71 - memory.failcnt # show the number of memory usage hits limits 72 - memory.memsw.failcnt # show the number of memory+Swap hits limits 73 - memory.max_usage_in_bytes # show max memory usage recorded 74 - memory.memsw.max_usage_in_bytes # show max memory+Swap usage recorded 75 - memory.soft_limit_in_bytes # set/show soft limit of memory usage 76 - memory.stat # show various statistics 77 - memory.use_hierarchy # set/show hierarchical account enabled 78 - memory.force_empty # trigger forced page reclaim 79 - memory.pressure_level # set memory pressure notifications 80 - memory.swappiness # set/show swappiness parameter of vmscan 81 - (See sysctl's vm.swappiness) 82 - memory.move_charge_at_immigrate # set/show controls of moving charges 83 - memory.oom_control # set/show oom controls. 84 - memory.numa_stat # show the number of memory usage per numa node 57 + ==================================== ========================================== 58 + tasks attach a task(thread) and show list of 59 + threads 60 + cgroup.procs show list of processes 61 + cgroup.event_control an interface for event_fd() 62 + memory.usage_in_bytes show current usage for memory 63 + (See 5.5 for details) 64 + memory.memsw.usage_in_bytes show current usage for memory+Swap 65 + (See 5.5 for details) 66 + memory.limit_in_bytes set/show limit of memory usage 67 + memory.memsw.limit_in_bytes set/show limit of memory+Swap usage 68 + memory.failcnt show the number of memory usage hits limits 69 + memory.memsw.failcnt show the number of memory+Swap hits limits 70 + memory.max_usage_in_bytes show max memory usage recorded 71 + memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded 72 + memory.soft_limit_in_bytes set/show soft limit of memory usage 73 + memory.stat show various statistics 74 + memory.use_hierarchy set/show hierarchical account enabled 75 + memory.force_empty trigger forced page reclaim 76 + memory.pressure_level set memory pressure notifications 77 + memory.swappiness set/show swappiness parameter of vmscan 78 + (See sysctl's vm.swappiness) 79 + memory.move_charge_at_immigrate set/show controls of moving charges 80 + memory.oom_control set/show oom controls. 81 + memory.numa_stat show the number of memory usage per numa 82 + node 85 83 86 - memory.kmem.limit_in_bytes # set/show hard limit for kernel memory 87 - memory.kmem.usage_in_bytes # show current kernel memory allocation 88 - memory.kmem.failcnt # show the number of kernel memory usage hits limits 89 - memory.kmem.max_usage_in_bytes # show max kernel memory usage recorded 84 + memory.kmem.limit_in_bytes set/show hard limit for kernel memory 85 + memory.kmem.usage_in_bytes show current kernel memory allocation 86 + memory.kmem.failcnt show the number of kernel memory usage 87 + hits limits 88 + memory.kmem.max_usage_in_bytes show max kernel memory usage recorded 90 89 91 - memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory 92 - memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation 93 - memory.kmem.tcp.failcnt # show the number of tcp buf memory usage hits limits 94 - memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded 90 + memory.kmem.tcp.limit_in_bytes set/show hard limit for tcp buf memory 91 + memory.kmem.tcp.usage_in_bytes show current tcp buf memory allocation 92 + memory.kmem.tcp.failcnt show the number of tcp buf memory usage 93 + hits limits 94 + memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded 95 + ==================================== ========================================== 95 96 96 97 1. History 98 + ========== 97 99 98 100 The memory controller has a long history. A request for comments for the memory 99 101 controller was posted by Balbir Singh [1]. At the time the RFC was posted ··· 115 103 Cache Control [11]. 116 104 117 105 2. Memory Control 106 + ================= 118 107 119 108 Memory is a unique resource in the sense that it is present in a limited 120 109 amount. If a task requires a lot of CPU processing, the task can spread ··· 133 120 The memory controller is the first controller developed. 134 121 135 122 2.1. Design 123 + ----------- 136 124 137 125 The core of the design is a counter called the page_counter. The 138 126 page_counter tracks the current memory usage and limit of the group of ··· 141 127 specific data structure (mem_cgroup) associated with it. 142 128 143 129 2.2. Accounting 130 + --------------- 131 + 132 + :: 144 133 145 134 +--------------------+ 146 135 | mem_cgroup | ··· 182 165 (*) page_cgroup structure is allocated at boot/memory-hotplug time. 183 166 184 167 2.2.1 Accounting details 168 + ------------------------ 185 169 186 170 All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. 187 171 Some pages which are never reclaimable and will not be on the LRU ··· 209 191 of used pages; not-on-LRU pages tend to be out-of-control from VM view. 210 192 211 193 2.3 Shared Page Accounting 194 + -------------------------- 212 195 213 196 Shared pages are accounted on the basis of the first touch approach. The 214 197 cgroup that first touches a page is accounted for the page. The principle ··· 226 207 caller of swapoff rather than the users of shmem. 227 208 228 209 2.4 Swap Extension (CONFIG_MEMCG_SWAP) 210 + -------------------------------------- 229 211 230 212 Swap Extension allows you to record charge for swap. A swapped-in page is 231 213 charged back to original page allocator if possible. 232 214 233 215 When swap is accounted, following files are added. 216 + 234 217 - memory.memsw.usage_in_bytes. 235 218 - memory.memsw.limit_in_bytes. 236 219 ··· 245 224 By using the memsw limit, you can avoid system OOM which can be caused by swap 246 225 shortage. 247 226 248 - * why 'memory+swap' rather than swap. 227 + **why 'memory+swap' rather than swap** 228 + 249 229 The global LRU(kswapd) can swap out arbitrary pages. Swap-out means 250 230 to move account from memory to swap...there is no change in usage of 251 231 memory+swap. In other words, when we want to limit the usage of swap without 252 232 affecting global LRU, memory+swap limit is better than just limiting swap from 253 233 an OS point of view. 254 234 255 - * What happens when a cgroup hits memory.memsw.limit_in_bytes 235 + **What happens when a cgroup hits memory.memsw.limit_in_bytes** 236 + 256 237 When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out 257 238 in this cgroup. Then, swap-out will not be done by cgroup routine and file 258 239 caches are dropped. But as mentioned above, global LRU can do swapout memory ··· 262 239 it by cgroup. 263 240 264 241 2.5 Reclaim 242 + ----------- 265 243 266 244 Each cgroup maintains a per cgroup LRU which has the same structure as 267 245 global VM. When a cgroup goes over its limit, we first try ··· 275 251 pages that are selected for reclaiming come from the per-cgroup LRU 276 252 list. 277 253 278 - NOTE: Reclaim does not work for the root cgroup, since we cannot set any 279 - limits on the root cgroup. 254 + NOTE: 255 + Reclaim does not work for the root cgroup, since we cannot set any 256 + limits on the root cgroup. 280 257 281 - Note2: When panic_on_oom is set to "2", the whole system will panic. 258 + Note2: 259 + When panic_on_oom is set to "2", the whole system will panic. 282 260 283 261 When oom event notifier is registered, event will be delivered. 284 262 (See oom_control section) 285 263 286 264 2.6 Locking 265 + ----------- 287 266 288 267 lock_page_cgroup()/unlock_page_cgroup() should not be called under 289 268 the i_pages lock. 290 269 291 270 Other lock order is following: 271 + 292 272 PG_locked. 293 - mm->page_table_lock 294 - pgdat->lru_lock 295 - lock_page_cgroup. 273 + mm->page_table_lock 274 + pgdat->lru_lock 275 + lock_page_cgroup. 276 + 296 277 In many cases, just lock_page_cgroup() is called. 278 + 297 279 per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by 298 280 pgdat->lru_lock, it has no lock of its own. 299 281 300 282 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) 283 + ----------------------------------------------- 301 284 302 285 With the Kernel memory extension, the Memory Controller is able to limit 303 286 the amount of kernel memory used by the system. Kernel memory is fundamentally ··· 319 288 cgroup may or may not be accounted. The memory used is accumulated into 320 289 memory.kmem.usage_in_bytes, or in a separate counter when it makes sense. 321 290 (currently only for tcp). 291 + 322 292 The main "kmem" counter is fed into the main counter, so kmem charges will 323 293 also be visible from the user counter. 324 294 ··· 327 295 to trigger slab reclaim when those limits are reached. 328 296 329 297 2.7.1 Current Kernel Memory resources accounted 298 + ----------------------------------------------- 330 299 331 - * stack pages: every process consumes some stack pages. By accounting into 332 - kernel memory, we prevent new processes from being created when the kernel 333 - memory usage is too high. 300 + stack pages: 301 + every process consumes some stack pages. By accounting into 302 + kernel memory, we prevent new processes from being created when the kernel 303 + memory usage is too high. 334 304 335 - * slab pages: pages allocated by the SLAB or SLUB allocator are tracked. A copy 336 - of each kmem_cache is created every time the cache is touched by the first time 337 - from inside the memcg. The creation is done lazily, so some objects can still be 338 - skipped while the cache is being created. All objects in a slab page should 339 - belong to the same memcg. This only fails to hold when a task is migrated to a 340 - different memcg during the page allocation by the cache. 305 + slab pages: 306 + pages allocated by the SLAB or SLUB allocator are tracked. A copy 307 + of each kmem_cache is created every time the cache is touched by the first time 308 + from inside the memcg. The creation is done lazily, so some objects can still be 309 + skipped while the cache is being created. All objects in a slab page should 310 + belong to the same memcg. This only fails to hold when a task is migrated to a 311 + different memcg during the page allocation by the cache. 341 312 342 - * sockets memory pressure: some sockets protocols have memory pressure 343 - thresholds. The Memory Controller allows them to be controlled individually 344 - per cgroup, instead of globally. 313 + sockets memory pressure: 314 + some sockets protocols have memory pressure 315 + thresholds. The Memory Controller allows them to be controlled individually 316 + per cgroup, instead of globally. 345 317 346 - * tcp memory pressure: sockets memory pressure for the tcp protocol. 318 + tcp memory pressure: 319 + sockets memory pressure for the tcp protocol. 347 320 348 321 2.7.2 Common use cases 322 + ---------------------- 349 323 350 324 Because the "kmem" counter is fed to the main user counter, kernel memory can 351 325 never be limited completely independently of user memory. Say "U" is the user 352 326 limit, and "K" the kernel limit. There are three possible ways limits can be 353 327 set: 354 328 355 - U != 0, K = unlimited: 329 + U != 0, K = unlimited: 356 330 This is the standard memcg limitation mechanism already present before kmem 357 331 accounting. Kernel memory is completely ignored. 358 332 359 - U != 0, K < U: 333 + U != 0, K < U: 360 334 Kernel memory is a subset of the user memory. This setup is useful in 361 335 deployments where the total amount of memory per-cgroup is overcommited. 362 336 Overcommiting kernel memory limits is definitely not recommended, since the ··· 370 332 In this case, the admin could set up K so that the sum of all groups is 371 333 never greater than the total memory, and freely set U at the cost of his 372 334 QoS. 373 - WARNING: In the current implementation, memory reclaim will NOT be 335 + 336 + WARNING: 337 + In the current implementation, memory reclaim will NOT be 374 338 triggered for a cgroup when it hits K while staying below U, which makes 375 339 this setup impractical. 376 340 377 - U != 0, K >= U: 341 + U != 0, K >= U: 378 342 Since kmem charges will also be fed to the user counter and reclaim will be 379 343 triggered for the cgroup for both kinds of memory. This setup gives the 380 344 admin a unified view of memory, and it is also useful for people who just 381 345 want to track kernel memory usage. 382 346 383 347 3. User Interface 348 + ================= 384 349 385 350 3.0. Configuration 351 + ------------------ 386 352 387 353 a. Enable CONFIG_CGROUPS 388 354 b. Enable CONFIG_MEMCG ··· 394 352 d. Enable CONFIG_MEMCG_KMEM (to use kmem extension) 395 353 396 354 3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) 397 - # mount -t tmpfs none /sys/fs/cgroup 398 - # mkdir /sys/fs/cgroup/memory 399 - # mount -t cgroup none /sys/fs/cgroup/memory -o memory 355 + ------------------------------------------------------------------- 400 356 401 - 3.2. Make the new group and move bash into it 402 - # mkdir /sys/fs/cgroup/memory/0 403 - # echo $$ > /sys/fs/cgroup/memory/0/tasks 357 + :: 404 358 405 - Since now we're in the 0 cgroup, we can alter the memory limit: 406 - # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes 359 + # mount -t tmpfs none /sys/fs/cgroup 360 + # mkdir /sys/fs/cgroup/memory 361 + # mount -t cgroup none /sys/fs/cgroup/memory -o memory 407 362 408 - NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, 409 - mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.) 363 + 3.2. Make the new group and move bash into it:: 410 364 411 - NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited). 412 - NOTE: We cannot set limits on the root cgroup any more. 365 + # mkdir /sys/fs/cgroup/memory/0 366 + # echo $$ > /sys/fs/cgroup/memory/0/tasks 413 367 414 - # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes 415 - 4194304 368 + Since now we're in the 0 cgroup, we can alter the memory limit:: 416 369 417 - We can check the usage: 418 - # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes 419 - 1216512 370 + # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes 371 + 372 + NOTE: 373 + We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, 374 + mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, 375 + Gibibytes.) 376 + 377 + NOTE: 378 + We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``. 379 + 380 + NOTE: 381 + We cannot set limits on the root cgroup any more. 382 + 383 + :: 384 + 385 + # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes 386 + 4194304 387 + 388 + We can check the usage:: 389 + 390 + # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes 391 + 1216512 420 392 421 393 A successful write to this file does not guarantee a successful setting of 422 394 this limit to the value written into the file. This can be due to a 423 395 number of factors, such as rounding up to page boundaries or the total 424 396 availability of memory on the system. The user is required to re-read 425 - this file after a write to guarantee the value committed by the kernel. 397 + this file after a write to guarantee the value committed by the kernel:: 426 398 427 - # echo 1 > memory.limit_in_bytes 428 - # cat memory.limit_in_bytes 429 - 4096 399 + # echo 1 > memory.limit_in_bytes 400 + # cat memory.limit_in_bytes 401 + 4096 430 402 431 403 The memory.failcnt field gives the number of times that the cgroup limit was 432 404 exceeded. ··· 449 393 caches, RSS and Active pages/Inactive pages are shown. 450 394 451 395 4. Testing 396 + ========== 452 397 453 398 For testing features and implementation, see memcg_test.txt. 454 399 ··· 465 408 Trying usual test under memory controller is always helpful. 466 409 467 410 4.1 Troubleshooting 411 + ------------------- 468 412 469 413 Sometimes a user might find that the application under a cgroup is 470 414 terminated by the OOM killer. There are several causes for this: ··· 480 422 seeing what happens will be helpful. 481 423 482 424 4.2 Task migration 425 + ------------------ 483 426 484 427 When a task migrates from one cgroup to another, its charge is not 485 428 carried forward by default. The pages allocated from the original cgroup still ··· 491 432 See 8. "Move charges at task migration" 492 433 493 434 4.3 Removing a cgroup 435 + --------------------- 494 436 495 437 A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a 496 438 cgroup might have some charge associated with it, even though all ··· 508 448 509 449 About use_hierarchy, see Section 6. 510 450 511 - 5. Misc. interfaces. 451 + 5. Misc. interfaces 452 + =================== 512 453 513 454 5.1 force_empty 455 + --------------- 514 456 memory.force_empty interface is provided to make cgroup's memory usage empty. 515 - When writing anything to this 457 + When writing anything to this:: 516 458 517 - # echo 0 > memory.force_empty 459 + # echo 0 > memory.force_empty 518 460 519 461 the cgroup will be reclaimed and as many pages reclaimed as possible. 520 462 ··· 533 471 About use_hierarchy, see Section 6. 534 472 535 473 5.2 stat file 474 + ------------- 536 475 537 476 memory.stat file includes following statistics 538 477 539 - # per-memory cgroup local status 540 - cache - # of bytes of page cache memory. 541 - rss - # of bytes of anonymous and swap cache memory (includes 478 + per-memory cgroup local status 479 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 480 + 481 + =============== =============================================================== 482 + cache # of bytes of page cache memory. 483 + rss # of bytes of anonymous and swap cache memory (includes 542 484 transparent hugepages). 543 - rss_huge - # of bytes of anonymous transparent hugepages. 544 - mapped_file - # of bytes of mapped file (includes tmpfs/shmem) 545 - pgpgin - # of charging events to the memory cgroup. The charging 485 + rss_huge # of bytes of anonymous transparent hugepages. 486 + mapped_file # of bytes of mapped file (includes tmpfs/shmem) 487 + pgpgin # of charging events to the memory cgroup. The charging 546 488 event happens each time a page is accounted as either mapped 547 489 anon page(RSS) or cache page(Page Cache) to the cgroup. 548 - pgpgout - # of uncharging events to the memory cgroup. The uncharging 490 + pgpgout # of uncharging events to the memory cgroup. The uncharging 549 491 event happens each time a page is unaccounted from the cgroup. 550 - swap - # of bytes of swap usage 551 - dirty - # of bytes that are waiting to get written back to the disk. 552 - writeback - # of bytes of file/anon cache that are queued for syncing to 492 + swap # of bytes of swap usage 493 + dirty # of bytes that are waiting to get written back to the disk. 494 + writeback # of bytes of file/anon cache that are queued for syncing to 553 495 disk. 554 - inactive_anon - # of bytes of anonymous and swap cache memory on inactive 496 + inactive_anon # of bytes of anonymous and swap cache memory on inactive 555 497 LRU list. 556 - active_anon - # of bytes of anonymous and swap cache memory on active 498 + active_anon # of bytes of anonymous and swap cache memory on active 557 499 LRU list. 558 - inactive_file - # of bytes of file-backed memory on inactive LRU list. 559 - active_file - # of bytes of file-backed memory on active LRU list. 560 - unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc). 500 + inactive_file # of bytes of file-backed memory on inactive LRU list. 501 + active_file # of bytes of file-backed memory on active LRU list. 502 + unevictable # of bytes of memory that cannot be reclaimed (mlocked etc). 503 + =============== =============================================================== 561 504 562 - # status considering hierarchy (see memory.use_hierarchy settings) 505 + status considering hierarchy (see memory.use_hierarchy settings) 506 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 563 507 564 - hierarchical_memory_limit - # of bytes of memory limit with regard to hierarchy 565 - under which the memory cgroup is 566 - hierarchical_memsw_limit - # of bytes of memory+swap limit with regard to 567 - hierarchy under which memory cgroup is. 508 + ========================= =================================================== 509 + hierarchical_memory_limit # of bytes of memory limit with regard to hierarchy 510 + under which the memory cgroup is 511 + hierarchical_memsw_limit # of bytes of memory+swap limit with regard to 512 + hierarchy under which memory cgroup is. 568 513 569 - total_<counter> - # hierarchical version of <counter>, which in 570 - addition to the cgroup's own value includes the 571 - sum of all hierarchical children's values of 572 - <counter>, i.e. total_cache 514 + total_<counter> # hierarchical version of <counter>, which in 515 + addition to the cgroup's own value includes the 516 + sum of all hierarchical children's values of 517 + <counter>, i.e. total_cache 518 + ========================= =================================================== 573 519 574 - # The following additional stats are dependent on CONFIG_DEBUG_VM. 520 + The following additional stats are dependent on CONFIG_DEBUG_VM 521 + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 575 522 576 - recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) 577 - recent_rotated_file - VM internal parameter. (see mm/vmscan.c) 578 - recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) 579 - recent_scanned_file - VM internal parameter. (see mm/vmscan.c) 523 + ========================= ======================================== 524 + recent_rotated_anon VM internal parameter. (see mm/vmscan.c) 525 + recent_rotated_file VM internal parameter. (see mm/vmscan.c) 526 + recent_scanned_anon VM internal parameter. (see mm/vmscan.c) 527 + recent_scanned_file VM internal parameter. (see mm/vmscan.c) 528 + ========================= ======================================== 580 529 581 530 Memo: 582 531 recent_rotated means recent frequency of LRU rotation. ··· 598 525 Only anonymous and swap cache memory is listed as part of 'rss' stat. 599 526 This should not be confused with the true 'resident set size' or the 600 527 amount of physical memory used by the cgroup. 528 + 601 529 'rss + mapped_file" will give you resident set size of cgroup. 530 + 602 531 (Note: file and shmem may be shared among other cgroups. In that case, 603 - mapped_file is accounted only when the memory cgroup is owner of page 604 - cache.) 532 + mapped_file is accounted only when the memory cgroup is owner of page 533 + cache.) 605 534 606 535 5.3 swappiness 536 + -------------- 607 537 608 538 Overrides /proc/sys/vm/swappiness for the particular group. The tunable 609 539 in the root cgroup corresponds to the global swappiness setting. ··· 617 541 if there are no file pages to reclaim. 618 542 619 543 5.4 failcnt 544 + ----------- 620 545 621 546 A memory cgroup provides memory.failcnt and memory.memsw.failcnt files. 622 547 This failcnt(== failure count) shows the number of times that a usage counter 623 548 hit its limit. When a memory cgroup hits a limit, failcnt increases and 624 549 memory under it will be reclaimed. 625 550 626 - You can reset failcnt by writing 0 to failcnt file. 627 - # echo 0 > .../memory.failcnt 551 + You can reset failcnt by writing 0 to failcnt file:: 552 + 553 + # echo 0 > .../memory.failcnt 628 554 629 555 5.5 usage_in_bytes 556 + ------------------ 630 557 631 558 For efficiency, as other kernel components, memory cgroup uses some optimization 632 559 to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the ··· 639 560 value in memory.stat(see 5.2). 640 561 641 562 5.6 numa_stat 563 + ------------- 642 564 643 565 This is similar to numa_maps but operates on a per-memcg basis. This is 644 566 useful for providing visibility into the numa locality information within ··· 651 571 per-node page counts including "hierarchical_<counter>" which sums up all 652 572 hierarchical children's values in addition to the memcg's own value. 653 573 654 - The output format of memory.numa_stat is: 574 + The output format of memory.numa_stat is:: 655 575 656 - total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ... 657 - file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ... 658 - anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... 659 - unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... 660 - hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ... 576 + total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ... 577 + file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ... 578 + anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... 579 + unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... 580 + hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ... 661 581 662 582 The "total" count is sum of file + anon + unevictable. 663 583 664 584 6. Hierarchy support 585 + ==================== 665 586 666 587 The memory controller supports a deep hierarchy and hierarchical accounting. 667 588 The hierarchy is created by creating the appropriate cgroups in the 668 589 cgroup filesystem. Consider for example, the following cgroup filesystem 669 - hierarchy 590 + hierarchy:: 670 591 671 592 root 672 593 / | \ ··· 684 603 children of the ancestor. 685 604 686 605 6.1 Enabling hierarchical accounting and reclaim 606 + ------------------------------------------------ 687 607 688 608 A memory cgroup by default disables the hierarchy feature. Support 689 - can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup 609 + can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup:: 690 610 691 - # echo 1 > memory.use_hierarchy 611 + # echo 1 > memory.use_hierarchy 692 612 693 - The feature can be disabled by 613 + The feature can be disabled by:: 694 614 695 - # echo 0 > memory.use_hierarchy 615 + # echo 0 > memory.use_hierarchy 696 616 697 - NOTE1: Enabling/disabling will fail if either the cgroup already has other 617 + NOTE1: 618 + Enabling/disabling will fail if either the cgroup already has other 698 619 cgroups created below it, or if the parent cgroup has use_hierarchy 699 620 enabled. 700 621 701 - NOTE2: When panic_on_oom is set to "2", the whole system will panic in 622 + NOTE2: 623 + When panic_on_oom is set to "2", the whole system will panic in 702 624 case of an OOM event in any cgroup. 703 625 704 626 7. Soft limits 627 + ============== 705 628 706 629 Soft limits allow for greater sharing of memory. The idea behind soft limits 707 630 is to allow control groups to use as much of the memory as needed, provided ··· 725 640 it gets invoked from balance_pgdat (kswapd). 726 641 727 642 7.1 Interface 643 + ------------- 728 644 729 645 Soft limits can be setup by using the following commands (in this example we 730 - assume a soft limit of 256 MiB) 646 + assume a soft limit of 256 MiB):: 731 647 732 - # echo 256M > memory.soft_limit_in_bytes 648 + # echo 256M > memory.soft_limit_in_bytes 733 649 734 - If we want to change this to 1G, we can at any time use 650 + If we want to change this to 1G, we can at any time use:: 735 651 736 - # echo 1G > memory.soft_limit_in_bytes 652 + # echo 1G > memory.soft_limit_in_bytes 737 653 738 - NOTE1: Soft limits take effect over a long period of time, since they involve 654 + NOTE1: 655 + Soft limits take effect over a long period of time, since they involve 739 656 reclaiming memory for balancing between memory cgroups 740 - NOTE2: It is recommended to set the soft limit always below the hard limit, 657 + NOTE2: 658 + It is recommended to set the soft limit always below the hard limit, 741 659 otherwise the hard limit will take precedence. 742 660 743 661 8. Move charges at task migration 662 + ================================= 744 663 745 664 Users can move charges associated with a task along with task migration, that 746 665 is, uncharge task's pages from the old cgroup and charge them to the new cgroup. ··· 752 663 page tables. 753 664 754 665 8.1 Interface 666 + ------------- 755 667 756 668 This feature is disabled by default. It can be enabled (and disabled again) by 757 669 writing to memory.move_charge_at_immigrate of the destination cgroup. 758 670 759 - If you want to enable it: 671 + If you want to enable it:: 760 672 761 - # echo (some positive value) > memory.move_charge_at_immigrate 673 + # echo (some positive value) > memory.move_charge_at_immigrate 762 674 763 - Note: Each bits of move_charge_at_immigrate has its own meaning about what type 675 + Note: 676 + Each bits of move_charge_at_immigrate has its own meaning about what type 764 677 of charges should be moved. See 8.2 for details. 765 - Note: Charges are moved only when you move mm->owner, in other words, 678 + Note: 679 + Charges are moved only when you move mm->owner, in other words, 766 680 a leader of a thread group. 767 - Note: If we cannot find enough space for the task in the destination cgroup, we 681 + Note: 682 + If we cannot find enough space for the task in the destination cgroup, we 768 683 try to make space by reclaiming memory. Task migration may fail if we 769 684 cannot make enough space. 770 - Note: It can take several seconds if you move charges much. 685 + Note: 686 + It can take several seconds if you move charges much. 771 687 772 - And if you want disable it again: 688 + And if you want disable it again:: 773 689 774 - # echo 0 > memory.move_charge_at_immigrate 690 + # echo 0 > memory.move_charge_at_immigrate 775 691 776 692 8.2 Type of charges which can be moved 693 + -------------------------------------- 777 694 778 695 Each bit in move_charge_at_immigrate has its own meaning about what type of 779 696 charges should be moved. But in any case, it must be noted that an account of 780 697 a page or a swap can be moved only when it is charged to the task's current 781 698 (old) memory cgroup. 782 699 783 - bit | what type of charges would be moved ? 784 - -----+------------------------------------------------------------------------ 785 - 0 | A charge of an anonymous page (or swap of it) used by the target task. 786 - | You must enable Swap Extension (see 2.4) to enable move of swap charges. 787 - -----+------------------------------------------------------------------------ 788 - 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) 789 - | and swaps of tmpfs file) mmapped by the target task. Unlike the case of 790 - | anonymous pages, file pages (and swaps) in the range mmapped by the task 791 - | will be moved even if the task hasn't done page fault, i.e. they might 792 - | not be the task's "RSS", but other task's "RSS" that maps the same file. 793 - | And mapcount of the page is ignored (the page can be moved even if 794 - | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to 795 - | enable move of swap charges. 700 + +---+--------------------------------------------------------------------------+ 701 + |bit| what type of charges would be moved ? | 702 + +===+==========================================================================+ 703 + | 0 | A charge of an anonymous page (or swap of it) used by the target task. | 704 + | | You must enable Swap Extension (see 2.4) to enable move of swap charges. | 705 + +---+--------------------------------------------------------------------------+ 706 + | 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) | 707 + | | and swaps of tmpfs file) mmapped by the target task. Unlike the case of | 708 + | | anonymous pages, file pages (and swaps) in the range mmapped by the task | 709 + | | will be moved even if the task hasn't done page fault, i.e. they might | 710 + | | not be the task's "RSS", but other task's "RSS" that maps the same file. | 711 + | | And mapcount of the page is ignored (the page can be moved even if | 712 + | | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to | 713 + | | enable move of swap charges. | 714 + +---+--------------------------------------------------------------------------+ 796 715 797 716 8.3 TODO 717 + -------- 798 718 799 719 - All of moving charge operations are done under cgroup_mutex. It's not good 800 720 behavior to hold the mutex too long, so we may need some trick. 801 721 802 722 9. Memory thresholds 723 + ==================== 803 724 804 725 Memory cgroup implements memory thresholds using the cgroups notification 805 726 API (see cgroups.txt). It allows to register multiple memory and memsw 806 727 thresholds and gets notifications when it crosses. 807 728 808 729 To register a threshold, an application must: 730 + 809 731 - create an eventfd using eventfd(2); 810 732 - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; 811 733 - write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to ··· 828 728 It's applicable for root and non-root cgroup. 829 729 830 730 10. OOM Control 731 + =============== 831 732 832 733 memory.oom_control file is for OOM notification and other controls. 833 734 ··· 837 736 delivery and gets notification when OOM happens. 838 737 839 738 To register a notifier, an application must: 739 + 840 740 - create an eventfd using eventfd(2) 841 741 - open memory.oom_control file 842 742 - write string like "<event_fd> <fd of memory.oom_control>" to ··· 854 752 in memory cgroup's OOM-waitqueue when they request accountable memory. 855 753 856 754 For running them, you have to relax the memory cgroup's OOM status by 755 + 857 756 * enlarge limit or reduce usage. 757 + 858 758 To reduce usage, 759 + 859 760 * kill some tasks. 860 761 * move some tasks to other group with account migration. 861 762 * remove some files (on tmpfs?) ··· 866 761 Then, stopped tasks will work again. 867 762 868 763 At reading, current status of OOM is shown. 869 - oom_kill_disable 0 or 1 (if 1, oom-killer is disabled) 870 - under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may 871 - be stopped.) 764 + 765 + - oom_kill_disable 0 or 1 766 + (if 1, oom-killer is disabled) 767 + - under_oom 0 or 1 768 + (if 1, the memory cgroup is under OOM, tasks may be stopped.) 872 769 873 770 11. Memory Pressure 771 + =================== 874 772 875 773 The pressure level notifications can be used to monitor the memory 876 774 allocation cost; based on the pressure, applications can implement ··· 948 840 949 841 Here is a small script example that makes a new cgroup, sets up a 950 842 memory limit, sets up a notification in the cgroup and then makes child 951 - cgroup experience a critical pressure: 843 + cgroup experience a critical pressure:: 952 844 953 - # cd /sys/fs/cgroup/memory/ 954 - # mkdir foo 955 - # cd foo 956 - # cgroup_event_listener memory.pressure_level low,hierarchy & 957 - # echo 8000000 > memory.limit_in_bytes 958 - # echo 8000000 > memory.memsw.limit_in_bytes 959 - # echo $$ > tasks 960 - # dd if=/dev/zero | read x 845 + # cd /sys/fs/cgroup/memory/ 846 + # mkdir foo 847 + # cd foo 848 + # cgroup_event_listener memory.pressure_level low,hierarchy & 849 + # echo 8000000 > memory.limit_in_bytes 850 + # echo 8000000 > memory.memsw.limit_in_bytes 851 + # echo $$ > tasks 852 + # dd if=/dev/zero | read x 961 853 962 854 (Expect a bunch of notifications, and eventually, the oom-killer will 963 855 trigger.) 964 856 965 857 12. TODO 858 + ======== 966 859 967 860 1. Make per-cgroup scanner reclaim not-shared pages first 968 861 2. Teach controller to account for shared-pages ··· 971 862 not yet hit but the usage is getting closer 972 863 973 864 Summary 865 + ======= 974 866 975 867 Overall, the memory controller has been a stable controller and has been 976 868 commented and discussed quite extensively in the community. 977 869 978 870 References 871 + ========== 979 872 980 873 1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ 981 874 2. Singh, Balbir. Memory Controller (RSS Control),

+21 -16

Documentation/cgroup-v1/net_cls.txt Documentation/cgroup-v1/net_cls.rst

··· 1 + ========================= 1 2 Network classifier cgroup 2 - ------------------------- 3 + ========================= 3 4 4 5 The Network classifier cgroup provides an interface to 5 6 tag network packets with a class identifier (classid). ··· 18 17 is the minor handle number. 19 18 Reading net_cls.classid yields a decimal result. 20 19 21 - Example: 22 - mkdir /sys/fs/cgroup/net_cls 23 - mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls 24 - mkdir /sys/fs/cgroup/net_cls/0 25 - echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid 26 - - setting a 10:1 handle. 20 + Example:: 27 21 28 - cat /sys/fs/cgroup/net_cls/0/net_cls.classid 29 - 1048577 22 + mkdir /sys/fs/cgroup/net_cls 23 + mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls 24 + mkdir /sys/fs/cgroup/net_cls/0 25 + echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid 30 26 31 - configuring tc: 32 - tc qdisc add dev eth0 root handle 10: htb 27 + - setting a 10:1 handle:: 33 28 34 - tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit 35 - - creating traffic class 10:1 29 + cat /sys/fs/cgroup/net_cls/0/net_cls.classid 30 + 1048577 36 31 37 - tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup 32 + - configuring tc:: 38 33 39 - configuring iptables, basic example: 40 - iptables -A OUTPUT -m cgroup ! --cgroup 0x100001 -j DROP 34 + tc qdisc add dev eth0 root handle 10: htb 35 + tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit 36 + 37 + - creating traffic class 10:1:: 38 + 39 + tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup 40 + 41 + configuring iptables, basic example:: 42 + 43 + iptables -A OUTPUT -m cgroup ! --cgroup 0x100001 -j DROP

+13 -11

Documentation/cgroup-v1/net_prio.txt Documentation/cgroup-v1/net_prio.rst

··· 1 + ======================= 1 2 Network priority cgroup 2 - ------------------------- 3 + ======================= 3 4 4 5 The Network priority cgroup provides an interface to allow an administrator to 5 6 dynamically set the priority of network traffic generated by various ··· 15 14 16 15 This cgroup allows an administrator to assign a process to a group which defines 17 16 the priority of egress traffic on a given interface. Network priority groups can 18 - be created by first mounting the cgroup filesystem. 17 + be created by first mounting the cgroup filesystem:: 19 18 20 - # mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio 19 + # mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio 21 20 22 21 With the above step, the initial group acting as the parent accounting group 23 22 becomes visible at '/sys/fs/cgroup/net_prio'. This group includes all tasks in ··· 26 25 Each net_prio cgroup contains two files that are subsystem specific 27 26 28 27 net_prio.prioidx 29 - This file is read-only, and is simply informative. It contains a unique integer 30 - value that the kernel uses as an internal representation of this cgroup. 28 + This file is read-only, and is simply informative. It contains a unique 29 + integer value that the kernel uses as an internal representation of this 30 + cgroup. 31 31 32 32 net_prio.ifpriomap 33 - This file contains a map of the priorities assigned to traffic originating from 34 - processes in this group and egressing the system on various interfaces. It 35 - contains a list of tuples in the form <ifname priority>. Contents of this file 36 - can be modified by echoing a string into the file using the same tuple format. 37 - for example: 33 + This file contains a map of the priorities assigned to traffic originating 34 + from processes in this group and egressing the system on various interfaces. 35 + It contains a list of tuples in the form <ifname priority>. Contents of this 36 + file can be modified by echoing a string into the file using the same tuple 37 + format. For example:: 38 38 39 - echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap 39 + echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap 40 40 41 41 This command would force any traffic originating from processes belonging to the 42 42 iscsi net_prio cgroup and egressing on interface eth0 to have the priority of

+41 -37

Documentation/cgroup-v1/pids.txt Documentation/cgroup-v1/pids.rst

··· 1 - Process Number Controller 2 - ========================= 1 + ========================= 2 + Process Number Controller 3 + ========================= 3 4 4 5 Abstract 5 6 -------- ··· 35 34 superset of parent/child/pids.current. 36 35 37 36 The pids.events file contains event counters: 37 + 38 38 - max: Number of times fork failed because limit was hit. 39 39 40 40 Example 41 41 ------- 42 42 43 - First, we mount the pids controller: 44 - # mkdir -p /sys/fs/cgroup/pids 45 - # mount -t cgroup -o pids none /sys/fs/cgroup/pids 43 + First, we mount the pids controller:: 46 44 47 - Then we create a hierarchy, set limits and attach processes to it: 48 - # mkdir -p /sys/fs/cgroup/pids/parent/child 49 - # echo 2 > /sys/fs/cgroup/pids/parent/pids.max 50 - # echo $$ > /sys/fs/cgroup/pids/parent/cgroup.procs 51 - # cat /sys/fs/cgroup/pids/parent/pids.current 52 - 2 53 - # 45 + # mkdir -p /sys/fs/cgroup/pids 46 + # mount -t cgroup -o pids none /sys/fs/cgroup/pids 47 + 48 + Then we create a hierarchy, set limits and attach processes to it:: 49 + 50 + # mkdir -p /sys/fs/cgroup/pids/parent/child 51 + # echo 2 > /sys/fs/cgroup/pids/parent/pids.max 52 + # echo $$ > /sys/fs/cgroup/pids/parent/cgroup.procs 53 + # cat /sys/fs/cgroup/pids/parent/pids.current 54 + 2 55 + # 54 56 55 57 It should be noted that attempts to overcome the set limit (2 in this case) will 56 - fail: 58 + fail:: 57 59 58 - # cat /sys/fs/cgroup/pids/parent/pids.current 59 - 2 60 - # ( /bin/echo "Here's some processes for you." | cat ) 61 - sh: fork: Resource temporary unavailable 62 - # 60 + # cat /sys/fs/cgroup/pids/parent/pids.current 61 + 2 62 + # ( /bin/echo "Here's some processes for you." | cat ) 63 + sh: fork: Resource temporary unavailable 64 + # 63 65 64 66 Even if we migrate to a child cgroup (which doesn't have a set limit), we will 65 67 not be able to overcome the most stringent limit in the hierarchy (in this case, 66 - parent's): 68 + parent's):: 67 69 68 - # echo $$ > /sys/fs/cgroup/pids/parent/child/cgroup.procs 69 - # cat /sys/fs/cgroup/pids/parent/pids.current 70 - 2 71 - # cat /sys/fs/cgroup/pids/parent/child/pids.current 72 - 2 73 - # cat /sys/fs/cgroup/pids/parent/child/pids.max 74 - max 75 - # ( /bin/echo "Here's some processes for you." | cat ) 76 - sh: fork: Resource temporary unavailable 77 - # 70 + # echo $$ > /sys/fs/cgroup/pids/parent/child/cgroup.procs 71 + # cat /sys/fs/cgroup/pids/parent/pids.current 72 + 2 73 + # cat /sys/fs/cgroup/pids/parent/child/pids.current 74 + 2 75 + # cat /sys/fs/cgroup/pids/parent/child/pids.max 76 + max 77 + # ( /bin/echo "Here's some processes for you." | cat ) 78 + sh: fork: Resource temporary unavailable 79 + # 78 80 79 81 We can set a limit that is smaller than pids.current, which will stop any new 80 82 processes from being forked at all (note that the shell itself counts towards 81 - pids.current): 83 + pids.current):: 82 84 83 - # echo 1 > /sys/fs/cgroup/pids/parent/pids.max 84 - # /bin/echo "We can't even spawn a single process now." 85 - sh: fork: Resource temporary unavailable 86 - # echo 0 > /sys/fs/cgroup/pids/parent/pids.max 87 - # /bin/echo "We can't even spawn a single process now." 88 - sh: fork: Resource temporary unavailable 89 - # 85 + # echo 1 > /sys/fs/cgroup/pids/parent/pids.max 86 + # /bin/echo "We can't even spawn a single process now." 87 + sh: fork: Resource temporary unavailable 88 + # echo 0 > /sys/fs/cgroup/pids/parent/pids.max 89 + # /bin/echo "We can't even spawn a single process now." 90 + sh: fork: Resource temporary unavailable 91 + #

+34 -26

Documentation/cgroup-v1/rdma.txt Documentation/cgroup-v1/rdma.rst

··· 1 - RDMA Controller 2 - ---------------- 1 + =============== 2 + RDMA Controller 3 + =============== 3 4 4 - Contents 5 - -------- 5 + .. Contents 6 + 7 + 1. Overview 8 + 1-1. What is RDMA controller? 9 + 1-2. Why RDMA controller needed? 10 + 1-3. How is RDMA controller implemented? 11 + 2. Usage Examples 6 12 7 13 1. Overview 8 - 1-1. What is RDMA controller? 9 - 1-2. Why RDMA controller needed? 10 - 1-3. How is RDMA controller implemented? 11 - 2. Usage Examples 12 - 13 - 1. Overview 14 + =========== 14 15 15 16 1-1. What is RDMA controller? 16 17 ----------------------------- ··· 84 83 IB device. 85 84 86 85 Following resources can be accounted by rdma controller. 86 + 87 + ========== ============================= 87 88 hca_handle Maximum number of HCA Handles 88 89 hca_object Maximum number of HCA Objects 90 + ========== ============================= 89 91 90 92 2. Usage Examples 91 - ----------------- 93 + ================= 92 94 93 - (a) Configure resource limit: 94 - echo mlx4_0 hca_handle=2 hca_object=2000 > /sys/fs/cgroup/rdma/1/rdma.max 95 - echo ocrdma1 hca_handle=3 > /sys/fs/cgroup/rdma/2/rdma.max 95 + (a) Configure resource limit:: 96 96 97 - (b) Query resource limit: 98 - cat /sys/fs/cgroup/rdma/2/rdma.max 99 - #Output: 100 - mlx4_0 hca_handle=2 hca_object=2000 101 - ocrdma1 hca_handle=3 hca_object=max 97 + echo mlx4_0 hca_handle=2 hca_object=2000 > /sys/fs/cgroup/rdma/1/rdma.max 98 + echo ocrdma1 hca_handle=3 > /sys/fs/cgroup/rdma/2/rdma.max 102 99 103 - (c) Query current usage: 104 - cat /sys/fs/cgroup/rdma/2/rdma.current 105 - #Output: 106 - mlx4_0 hca_handle=1 hca_object=20 107 - ocrdma1 hca_handle=1 hca_object=23 100 + (b) Query resource limit:: 108 101 109 - (d) Delete resource limit: 110 - echo echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max 102 + cat /sys/fs/cgroup/rdma/2/rdma.max 103 + #Output: 104 + mlx4_0 hca_handle=2 hca_object=2000 105 + ocrdma1 hca_handle=3 hca_object=max 106 + 107 + (c) Query current usage:: 108 + 109 + cat /sys/fs/cgroup/rdma/2/rdma.current 110 + #Output: 111 + mlx4_0 hca_handle=1 hca_object=20 112 + ocrdma1 hca_handle=1 hca_object=23 113 + 114 + (d) Delete resource limit:: 115 + 116 + echo echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max

+1 -1

Documentation/filesystems/tmpfs.txt

··· 98 98 use at file creation time. When a task allocates a file in the file 99 99 system, the mount option memory policy will be applied with a NodeList, 100 100 if any, modified by the calling task's cpuset constraints 101 - [See Documentation/cgroup-v1/cpusets.txt] and any optional flags, listed 101 + [See Documentation/cgroup-v1/cpusets.rst] and any optional flags, listed 102 102 below. If the resulting NodeLists is the empty set, the effective memory 103 103 policy for the file will revert to "default" policy. 104 104

+1 -1

Documentation/scheduler/sched-deadline.txt

··· 652 652 653 653 -deadline tasks cannot have an affinity mask smaller that the entire 654 654 root_domain they are created on. However, affinities can be specified 655 - through the cpuset facility (Documentation/cgroup-v1/cpusets.txt). 655 + through the cpuset facility (Documentation/cgroup-v1/cpusets.rst). 656 656 657 657 5.1 SCHED_DEADLINE and cpusets HOWTO 658 658 ------------------------------------

+1 -1

Documentation/scheduler/sched-design-CFS.txt

··· 215 215 216 216 These options need CONFIG_CGROUPS to be defined, and let the administrator 217 217 create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See 218 - Documentation/cgroup-v1/cgroups.txt for more information about this filesystem. 218 + Documentation/cgroup-v1/cgroups.rst for more information about this filesystem. 219 219 220 220 When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each 221 221 group created using the pseudo filesystem. See example steps below to create

+1 -1

Documentation/scheduler/sched-rt-group.txt

··· 133 133 to control the CPU time reserved for each control group. 134 134 135 135 For more information on working with control groups, you should read 136 - Documentation/cgroup-v1/cgroups.txt as well. 136 + Documentation/cgroup-v1/cgroups.rst as well. 137 137 138 138 Group settings are checked against the following limits in order to keep the 139 139 configuration schedulable:

+2 -2

Documentation/vm/numa.rst

··· 67 67 physical memory. NUMA emluation is useful for testing NUMA kernel and 68 68 application features on non-NUMA platforms, and as a sort of memory resource 69 69 management mechanism when used together with cpusets. 70 - [see Documentation/cgroup-v1/cpusets.txt] 70 + [see Documentation/cgroup-v1/cpusets.rst] 71 71 72 72 For each node with memory, Linux constructs an independent memory management 73 73 subsystem, complete with its own free page lists, in-use page lists, usage ··· 114 114 115 115 System administrators can restrict the CPUs and nodes' memories that a non- 116 116 privileged user can specify in the scheduling or NUMA commands and functions 117 - using control groups and CPUsets. [see Documentation/cgroup-v1/cpusets.txt] 117 + using control groups and CPUsets. [see Documentation/cgroup-v1/cpusets.rst] 118 118 119 119 On architectures that do not hide memoryless nodes, Linux will include only 120 120 zones [nodes] with memory in the zonelists. This means that for a memoryless

+1 -1

Documentation/vm/page_migration.rst

··· 41 41 Larger installations usually partition the system using cpusets into 42 42 sections of nodes. Paul Jackson has equipped cpusets with the ability to 43 43 move pages when a task is moved to another cpuset (See 44 - Documentation/cgroup-v1/cpusets.txt). 44 + Documentation/cgroup-v1/cpusets.rst). 45 45 Cpusets allows the automation of process locality. If a task is moved to 46 46 a new cpuset then also all its pages are moved with it so that the 47 47 performance of the process does not sink dramatically. Also the pages

+1 -1

Documentation/vm/unevictable-lru.rst

··· 98 98 -------------------------------- 99 99 100 100 The unevictable LRU facility interacts with the memory control group [aka 101 - memory controller; see Documentation/cgroup-v1/memory.txt] by extending the 101 + memory controller; see Documentation/cgroup-v1/memory.rst] by extending the 102 102 lru_list enum. 103 103 104 104 The memory controller data structure automatically gets a per-zone unevictable

+2 -2

Documentation/x86/x86_64/fake-numa-for-cpusets.rst

··· 15 15 amount of system memory that are available to a certain class of tasks. 16 16 17 17 For more information on the features of cpusets, see 18 - Documentation/cgroup-v1/cpusets.txt. 18 + Documentation/cgroup-v1/cpusets.rst. 19 19 There are a number of different configurations you can use for your needs. For 20 20 more information on the numa=fake command line option and its various ways of 21 21 configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt. ··· 40 40 On node 3 totalpages: 131072 41 41 42 42 Now following the instructions for mounting the cpusets filesystem from 43 - Documentation/cgroup-v1/cpusets.txt, you can assign fake nodes (i.e. contiguous memory 43 + Documentation/cgroup-v1/cpusets.rst, you can assign fake nodes (i.e. contiguous memory 44 44 address spaces) to individual cpusets:: 45 45 46 46 [root@xroads /]# mkdir exampleset

+1 -1

MAINTAINERS

··· 4122 4122 W: http://oss.sgi.com/projects/cpusets/ 4123 4123 T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 4124 4124 S: Maintained 4125 - F: Documentation/cgroup-v1/cpusets.txt 4125 + F: Documentation/cgroup-v1/cpusets.rst 4126 4126 F: include/linux/cpuset.h 4127 4127 F: kernel/cgroup/cpuset.c 4128 4128

+1 -1

block/Kconfig

··· 89 89 one needs to mount and use blkio cgroup controller for creating 90 90 cgroups and specifying per device IO rate policies. 91 91 92 - See Documentation/cgroup-v1/blkio-controller.txt for more information. 92 + See Documentation/cgroup-v1/blkio-controller.rst for more information. 93 93 94 94 config BLK_DEV_THROTTLING_LOW 95 95 bool "Block throttling .low limit interface support (EXPERIMENTAL)"

+1 -1

include/linux/cgroup-defs.h

··· 624 624 625 625 /* 626 626 * Control Group subsystem type. 627 - * See Documentation/cgroup-v1/cgroups.txt for details 627 + * See Documentation/cgroup-v1/cgroups.rst for details 628 628 */ 629 629 struct cgroup_subsys { 630 630 struct cgroup_subsys_state *(*css_alloc)(struct cgroup_subsys_state *parent_css);

+2

include/linux/cgroup.h

··· 131 131 int cgroup_init_early(void); 132 132 int cgroup_init(void); 133 133 134 + int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v); 135 + 134 136 /* 135 137 * Iteration helpers and macros. 136 138 */

+1 -1

include/uapi/linux/bpf.h

··· 785 785 * based on a user-provided identifier for all traffic coming from 786 786 * the tasks belonging to the related cgroup. See also the related 787 787 * kernel documentation, available from the Linux sources in file 788 - * *Documentation/cgroup-v1/net_cls.txt*. 788 + * *Documentation/cgroup-v1/net_cls.rst*. 789 789 * 790 790 * The Linux kernel has two versions for cgroups: there are 791 791 * cgroups v1 and cgroups v2. Both are available to users, who can

+1 -1

init/Kconfig

··· 850 850 CONFIG_CFQ_GROUP_IOSCHED=y; for enabling throttling policy, set 851 851 CONFIG_BLK_DEV_THROTTLING=y. 852 852 853 - See Documentation/cgroup-v1/blkio-controller.txt for more information. 853 + See Documentation/cgroup-v1/blkio-controller.rst for more information. 854 854 855 855 config DEBUG_BLK_CGROUP 856 856 bool "IO controller debugging"

+43

kernel/cgroup/cgroup.c

··· 6240 6240 } 6241 6241 EXPORT_SYMBOL_GPL(cgroup_get_from_fd); 6242 6242 6243 + static u64 power_of_ten(int power) 6244 + { 6245 + u64 v = 1; 6246 + while (power--) 6247 + v *= 10; 6248 + return v; 6249 + } 6250 + 6251 + /** 6252 + * cgroup_parse_float - parse a floating number 6253 + * @input: input string 6254 + * @dec_shift: number of decimal digits to shift 6255 + * @v: output 6256 + * 6257 + * Parse a decimal floating point number in @input and store the result in 6258 + * @v with decimal point right shifted @dec_shift times. For example, if 6259 + * @input is "12.3456" and @dec_shift is 3, *@v will be set to 12345. 6260 + * Returns 0 on success, -errno otherwise. 6261 + * 6262 + * There's nothing cgroup specific about this function except that it's 6263 + * currently the only user. 6264 + */ 6265 + int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v) 6266 + { 6267 + s64 whole, frac = 0; 6268 + int fstart = 0, fend = 0, flen; 6269 + 6270 + if (!sscanf(input, "%lld.%n%lld%n", &whole, &fstart, &frac, &fend)) 6271 + return -EINVAL; 6272 + if (frac < 0) 6273 + return -EINVAL; 6274 + 6275 + flen = fend > fstart ? fend - fstart : 0; 6276 + if (flen < dec_shift) 6277 + frac *= power_of_ten(dec_shift - flen); 6278 + else 6279 + frac = DIV_ROUND_CLOSEST_ULL(frac, power_of_ten(flen - dec_shift)); 6280 + 6281 + *v = whole * power_of_ten(dec_shift) + frac; 6282 + return 0; 6283 + } 6284 + 6243 6285 /* 6244 6286 * sock->sk_cgrp_data handling. For more info, see sock_cgroup_data 6245 6287 * definition in cgroup-defs.h. ··· 6444 6402 return sysfs_create_group(kernel_kobj, &cgroup_sysfs_attr_group); 6445 6403 } 6446 6404 subsys_initcall(cgroup_sysfs_init); 6405 + 6447 6406 #endif /* CONFIG_SYSFS */

+1 -1

kernel/cgroup/cpuset.c

··· 729 729 * load balancing domains (sched domains) as specified by that partial 730 730 * partition. 731 731 * 732 - * See "What is sched_load_balance" in Documentation/cgroup-v1/cpusets.txt 732 + * See "What is sched_load_balance" in Documentation/cgroup-v1/cpusets.rst 733 733 * for a background explanation of this. 734 734 * 735 735 * Does not return errors, on the theory that the callers of this

+1 -1

security/device_cgroup.c

··· 509 509 * This is one of the three key functions for hierarchy implementation. 510 510 * This function is responsible for re-evaluating all the cgroup's active 511 511 * exceptions due to a parent's exception change. 512 - * Refer to Documentation/cgroup-v1/devices.txt for more details. 512 + * Refer to Documentation/cgroup-v1/devices.rst for more details. 513 513 */ 514 514 static void revalidate_active_exceptions(struct dev_cgroup *devcg) 515 515 {

+1 -1

tools/include/uapi/linux/bpf.h

··· 785 785 * based on a user-provided identifier for all traffic coming from 786 786 * the tasks belonging to the related cgroup. See also the related 787 787 * kernel documentation, available from the Linux sources in file 788 - * *Documentation/cgroup-v1/net_cls.txt*. 788 + * *Documentation/cgroup-v1/net_cls.rst*. 789 789 * 790 790 * The Linux kernel has two versions for cgroups: there are 791 791 * cgroups v1 and cgroups v2. Both are available to users, who can