Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

cgroup: implement dynamic subtree controller enable/disable on the default hierarchy

cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.

This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.

root - A - B - C
\ D

root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.

The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.

When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.

* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.

* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.

* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.

* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.

* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.

* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".

This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.

v2: Non-root cgroups now also have "cgroup.controllers".

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>

Tejun Heo f8f22e53 f817de98

+370 -2
+5
include/linux/cgroup.h
··· 21 21 #include <linux/percpu-refcount.h> 22 22 #include <linux/seq_file.h> 23 23 #include <linux/kernfs.h> 24 + #include <linux/wait.h> 24 25 25 26 #ifdef CONFIG_CGROUPS 26 27 ··· 165 164 166 165 struct cgroup *parent; /* my parent */ 167 166 struct kernfs_node *kn; /* cgroup kernfs entry */ 167 + struct kernfs_node *control_kn; /* kn for "cgroup.subtree_control" */ 168 168 169 169 /* 170 170 * Monotonically increasing unique serial number which defines a ··· 218 216 /* For css percpu_ref killing and RCU-protected deletion */ 219 217 struct rcu_head rcu_head; 220 218 struct work_struct destroy_work; 219 + 220 + /* used to wait for offlining of csses */ 221 + wait_queue_head_t offline_waitq; 221 222 }; 222 223 223 224 #define MAX_CGROUP_ROOT_NAMELEN 64
+365 -2
kernel/cgroup.c
··· 182 182 unsigned long ss_mask); 183 183 static void cgroup_destroy_css_killed(struct cgroup *cgrp); 184 184 static int cgroup_destroy_locked(struct cgroup *cgrp); 185 + static int create_css(struct cgroup *cgrp, struct cgroup_subsys *ss); 186 + static void kill_css(struct cgroup_subsys_state *css); 185 187 static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[], 186 188 bool is_add); 187 189 static void cgroup_pidlist_destroy_all(struct cgroup *cgrp); ··· 339 337 /* iterate across the hierarchies */ 340 338 #define for_each_root(root) \ 341 339 list_for_each_entry((root), &cgroup_roots, root_list) 340 + 341 + /* iterate over child cgrps, lock should be held throughout iteration */ 342 + #define cgroup_for_each_live_child(child, cgrp) \ 343 + list_for_each_entry((child), &(cgrp)->children, sibling) \ 344 + if (({ lockdep_assert_held(&cgroup_tree_mutex); \ 345 + cgroup_is_dead(child); })) \ 346 + ; \ 347 + else 342 348 343 349 /** 344 350 * cgroup_lock_live_group - take cgroup_mutex and check that cgrp is alive. ··· 1460 1450 1461 1451 for_each_subsys(ss, ssid) 1462 1452 INIT_LIST_HEAD(&cgrp->e_csets[ssid]); 1453 + 1454 + init_waitqueue_head(&cgrp->offline_waitq); 1463 1455 } 1464 1456 1465 1457 static void init_cgroup_root(struct cgroup_root *root, ··· 1950 1938 1951 1939 lockdep_assert_held(&cgroup_mutex); 1952 1940 1941 + /* 1942 + * Except for the root, child_subsys_mask must be zero for a cgroup 1943 + * with tasks so that child cgroups don't compete against tasks. 1944 + */ 1945 + if (dst_cgrp && cgroup_on_dfl(dst_cgrp) && dst_cgrp->parent && 1946 + dst_cgrp->child_subsys_mask) 1947 + return -EBUSY; 1948 + 1953 1949 /* look up the dst cset for each src cset and link it to src */ 1954 1950 list_for_each_entry_safe(src_cset, tmp_cset, preloaded_csets, mg_preload_node) { 1955 1951 struct css_set *dst_cset; ··· 2323 2303 return 0; 2324 2304 } 2325 2305 2306 + static void cgroup_print_ss_mask(struct seq_file *seq, unsigned int ss_mask) 2307 + { 2308 + struct cgroup_subsys *ss; 2309 + bool printed = false; 2310 + int ssid; 2311 + 2312 + for_each_subsys(ss, ssid) { 2313 + if (ss_mask & (1 << ssid)) { 2314 + if (printed) 2315 + seq_putc(seq, ' '); 2316 + seq_printf(seq, "%s", ss->name); 2317 + printed = true; 2318 + } 2319 + } 2320 + if (printed) 2321 + seq_putc(seq, '\n'); 2322 + } 2323 + 2324 + /* show controllers which are currently attached to the default hierarchy */ 2325 + static int cgroup_root_controllers_show(struct seq_file *seq, void *v) 2326 + { 2327 + struct cgroup *cgrp = seq_css(seq)->cgroup; 2328 + 2329 + cgroup_print_ss_mask(seq, cgrp->root->subsys_mask); 2330 + return 0; 2331 + } 2332 + 2333 + /* show controllers which are enabled from the parent */ 2334 + static int cgroup_controllers_show(struct seq_file *seq, void *v) 2335 + { 2336 + struct cgroup *cgrp = seq_css(seq)->cgroup; 2337 + 2338 + cgroup_print_ss_mask(seq, cgrp->parent->child_subsys_mask); 2339 + return 0; 2340 + } 2341 + 2342 + /* show controllers which are enabled for a given cgroup's children */ 2343 + static int cgroup_subtree_control_show(struct seq_file *seq, void *v) 2344 + { 2345 + struct cgroup *cgrp = seq_css(seq)->cgroup; 2346 + 2347 + cgroup_print_ss_mask(seq, cgrp->child_subsys_mask); 2348 + return 0; 2349 + } 2350 + 2351 + /** 2352 + * cgroup_update_dfl_csses - update css assoc of a subtree in default hierarchy 2353 + * @cgrp: root of the subtree to update csses for 2354 + * 2355 + * @cgrp's child_subsys_mask has changed and its subtree's (self excluded) 2356 + * css associations need to be updated accordingly. This function looks up 2357 + * all css_sets which are attached to the subtree, creates the matching 2358 + * updated css_sets and migrates the tasks to the new ones. 2359 + */ 2360 + static int cgroup_update_dfl_csses(struct cgroup *cgrp) 2361 + { 2362 + LIST_HEAD(preloaded_csets); 2363 + struct cgroup_subsys_state *css; 2364 + struct css_set *src_cset; 2365 + int ret; 2366 + 2367 + lockdep_assert_held(&cgroup_tree_mutex); 2368 + lockdep_assert_held(&cgroup_mutex); 2369 + 2370 + /* look up all csses currently attached to @cgrp's subtree */ 2371 + down_read(&css_set_rwsem); 2372 + css_for_each_descendant_pre(css, cgroup_css(cgrp, NULL)) { 2373 + struct cgrp_cset_link *link; 2374 + 2375 + /* self is not affected by child_subsys_mask change */ 2376 + if (css->cgroup == cgrp) 2377 + continue; 2378 + 2379 + list_for_each_entry(link, &css->cgroup->cset_links, cset_link) 2380 + cgroup_migrate_add_src(link->cset, cgrp, 2381 + &preloaded_csets); 2382 + } 2383 + up_read(&css_set_rwsem); 2384 + 2385 + /* NULL dst indicates self on default hierarchy */ 2386 + ret = cgroup_migrate_prepare_dst(NULL, &preloaded_csets); 2387 + if (ret) 2388 + goto out_finish; 2389 + 2390 + list_for_each_entry(src_cset, &preloaded_csets, mg_preload_node) { 2391 + struct task_struct *last_task = NULL, *task; 2392 + 2393 + /* src_csets precede dst_csets, break on the first dst_cset */ 2394 + if (!src_cset->mg_src_cgrp) 2395 + break; 2396 + 2397 + /* 2398 + * All tasks in src_cset need to be migrated to the 2399 + * matching dst_cset. Empty it process by process. We 2400 + * walk tasks but migrate processes. The leader might even 2401 + * belong to a different cset but such src_cset would also 2402 + * be among the target src_csets because the default 2403 + * hierarchy enforces per-process membership. 2404 + */ 2405 + while (true) { 2406 + down_read(&css_set_rwsem); 2407 + task = list_first_entry_or_null(&src_cset->tasks, 2408 + struct task_struct, cg_list); 2409 + if (task) { 2410 + task = task->group_leader; 2411 + WARN_ON_ONCE(!task_css_set(task)->mg_src_cgrp); 2412 + get_task_struct(task); 2413 + } 2414 + up_read(&css_set_rwsem); 2415 + 2416 + if (!task) 2417 + break; 2418 + 2419 + /* guard against possible infinite loop */ 2420 + if (WARN(last_task == task, 2421 + "cgroup: update_dfl_csses failed to make progress, aborting in inconsistent state\n")) 2422 + goto out_finish; 2423 + last_task = task; 2424 + 2425 + threadgroup_lock(task); 2426 + /* raced against de_thread() from another thread? */ 2427 + if (!thread_group_leader(task)) { 2428 + threadgroup_unlock(task); 2429 + put_task_struct(task); 2430 + continue; 2431 + } 2432 + 2433 + ret = cgroup_migrate(src_cset->dfl_cgrp, task, true); 2434 + 2435 + threadgroup_unlock(task); 2436 + put_task_struct(task); 2437 + 2438 + if (WARN(ret, "cgroup: failed to update controllers for the default hierarchy (%d), further operations may crash or hang\n", ret)) 2439 + goto out_finish; 2440 + } 2441 + } 2442 + 2443 + out_finish: 2444 + cgroup_migrate_finish(&preloaded_csets); 2445 + return ret; 2446 + } 2447 + 2448 + /* change the enabled child controllers for a cgroup in the default hierarchy */ 2449 + static int cgroup_subtree_control_write(struct cgroup_subsys_state *dummy_css, 2450 + struct cftype *cft, char *buffer) 2451 + { 2452 + unsigned long enable_req = 0, disable_req = 0, enable, disable; 2453 + struct cgroup *cgrp = dummy_css->cgroup, *child; 2454 + struct cgroup_subsys *ss; 2455 + char *tok, *p; 2456 + int ssid, ret; 2457 + 2458 + /* 2459 + * Parse input - white space separated list of subsystem names 2460 + * prefixed with either + or -. 2461 + */ 2462 + p = buffer; 2463 + while ((tok = strsep(&p, " \t\n"))) { 2464 + for_each_subsys(ss, ssid) { 2465 + if (ss->disabled || strcmp(tok + 1, ss->name)) 2466 + continue; 2467 + 2468 + if (*tok == '+') { 2469 + enable_req |= 1 << ssid; 2470 + disable_req &= ~(1 << ssid); 2471 + } else if (*tok == '-') { 2472 + disable_req |= 1 << ssid; 2473 + enable_req &= ~(1 << ssid); 2474 + } else { 2475 + return -EINVAL; 2476 + } 2477 + break; 2478 + } 2479 + if (ssid == CGROUP_SUBSYS_COUNT) 2480 + return -EINVAL; 2481 + } 2482 + 2483 + /* 2484 + * We're gonna grab cgroup_tree_mutex which nests outside kernfs 2485 + * active_ref. cgroup_lock_live_group() already provides enough 2486 + * protection. Ensure @cgrp stays accessible and break the 2487 + * active_ref protection. 2488 + */ 2489 + cgroup_get(cgrp); 2490 + kernfs_break_active_protection(cgrp->control_kn); 2491 + retry: 2492 + enable = enable_req; 2493 + disable = disable_req; 2494 + 2495 + mutex_lock(&cgroup_tree_mutex); 2496 + 2497 + for_each_subsys(ss, ssid) { 2498 + if (enable & (1 << ssid)) { 2499 + if (cgrp->child_subsys_mask & (1 << ssid)) { 2500 + enable &= ~(1 << ssid); 2501 + continue; 2502 + } 2503 + 2504 + /* 2505 + * Because css offlining is asynchronous, userland 2506 + * might try to re-enable the same controller while 2507 + * the previous instance is still around. In such 2508 + * cases, wait till it's gone using offline_waitq. 2509 + */ 2510 + cgroup_for_each_live_child(child, cgrp) { 2511 + wait_queue_t wait; 2512 + 2513 + if (!cgroup_css(child, ss)) 2514 + continue; 2515 + 2516 + prepare_to_wait(&child->offline_waitq, &wait, 2517 + TASK_UNINTERRUPTIBLE); 2518 + mutex_unlock(&cgroup_tree_mutex); 2519 + schedule(); 2520 + finish_wait(&child->offline_waitq, &wait); 2521 + goto retry; 2522 + } 2523 + 2524 + /* unavailable or not enabled on the parent? */ 2525 + if (!(cgrp_dfl_root.subsys_mask & (1 << ssid)) || 2526 + (cgrp->parent && 2527 + !(cgrp->parent->child_subsys_mask & (1 << ssid)))) { 2528 + ret = -ENOENT; 2529 + goto out_unlock_tree; 2530 + } 2531 + } else if (disable & (1 << ssid)) { 2532 + if (!(cgrp->child_subsys_mask & (1 << ssid))) { 2533 + disable &= ~(1 << ssid); 2534 + continue; 2535 + } 2536 + 2537 + /* a child has it enabled? */ 2538 + cgroup_for_each_live_child(child, cgrp) { 2539 + if (child->child_subsys_mask & (1 << ssid)) { 2540 + ret = -EBUSY; 2541 + goto out_unlock_tree; 2542 + } 2543 + } 2544 + } 2545 + } 2546 + 2547 + if (!enable && !disable) { 2548 + ret = 0; 2549 + goto out_unlock_tree; 2550 + } 2551 + 2552 + if (!cgroup_lock_live_group(cgrp)) { 2553 + ret = -ENODEV; 2554 + goto out_unlock_tree; 2555 + } 2556 + 2557 + /* 2558 + * Except for the root, child_subsys_mask must be zero for a cgroup 2559 + * with tasks so that child cgroups don't compete against tasks. 2560 + */ 2561 + if (enable && cgrp->parent && !list_empty(&cgrp->cset_links)) { 2562 + ret = -EBUSY; 2563 + goto out_unlock; 2564 + } 2565 + 2566 + /* 2567 + * Create csses for enables and update child_subsys_mask. This 2568 + * changes cgroup_e_css() results which in turn makes the 2569 + * subsequent cgroup_update_dfl_csses() associate all tasks in the 2570 + * subtree to the updated csses. 2571 + */ 2572 + for_each_subsys(ss, ssid) { 2573 + if (!(enable & (1 << ssid))) 2574 + continue; 2575 + 2576 + cgroup_for_each_live_child(child, cgrp) { 2577 + ret = create_css(child, ss); 2578 + if (ret) 2579 + goto err_undo_css; 2580 + } 2581 + } 2582 + 2583 + cgrp->child_subsys_mask |= enable; 2584 + cgrp->child_subsys_mask &= ~disable; 2585 + 2586 + ret = cgroup_update_dfl_csses(cgrp); 2587 + if (ret) 2588 + goto err_undo_css; 2589 + 2590 + /* all tasks are now migrated away from the old csses, kill them */ 2591 + for_each_subsys(ss, ssid) { 2592 + if (!(disable & (1 << ssid))) 2593 + continue; 2594 + 2595 + cgroup_for_each_live_child(child, cgrp) 2596 + kill_css(cgroup_css(child, ss)); 2597 + } 2598 + 2599 + kernfs_activate(cgrp->kn); 2600 + ret = 0; 2601 + out_unlock: 2602 + mutex_unlock(&cgroup_mutex); 2603 + out_unlock_tree: 2604 + mutex_unlock(&cgroup_tree_mutex); 2605 + kernfs_unbreak_active_protection(cgrp->control_kn); 2606 + cgroup_put(cgrp); 2607 + return ret; 2608 + 2609 + err_undo_css: 2610 + cgrp->child_subsys_mask &= ~enable; 2611 + cgrp->child_subsys_mask |= disable; 2612 + 2613 + for_each_subsys(ss, ssid) { 2614 + if (!(enable & (1 << ssid))) 2615 + continue; 2616 + 2617 + cgroup_for_each_live_child(child, cgrp) { 2618 + struct cgroup_subsys_state *css = cgroup_css(child, ss); 2619 + if (css) 2620 + kill_css(css); 2621 + } 2622 + } 2623 + goto out_unlock; 2624 + } 2625 + 2326 2626 static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf, 2327 2627 size_t nbytes, loff_t off) 2328 2628 { ··· 2802 2462 return PTR_ERR(kn); 2803 2463 2804 2464 ret = cgroup_kn_set_ugid(kn); 2805 - if (ret) 2465 + if (ret) { 2806 2466 kernfs_remove(kn); 2807 - return ret; 2467 + return ret; 2468 + } 2469 + 2470 + if (cft->seq_show == cgroup_subtree_control_show) 2471 + cgrp->control_kn = kn; 2472 + return 0; 2808 2473 } 2809 2474 2810 2475 /** ··· 3902 3557 .flags = CFTYPE_ONLY_ON_ROOT, 3903 3558 .seq_show = cgroup_sane_behavior_show, 3904 3559 }, 3560 + { 3561 + .name = "cgroup.controllers", 3562 + .flags = CFTYPE_ONLY_ON_DFL | CFTYPE_ONLY_ON_ROOT, 3563 + .seq_show = cgroup_root_controllers_show, 3564 + }, 3565 + { 3566 + .name = "cgroup.controllers", 3567 + .flags = CFTYPE_ONLY_ON_DFL | CFTYPE_NOT_ON_ROOT, 3568 + .seq_show = cgroup_controllers_show, 3569 + }, 3570 + { 3571 + .name = "cgroup.subtree_control", 3572 + .flags = CFTYPE_ONLY_ON_DFL, 3573 + .seq_show = cgroup_subtree_control_show, 3574 + .write_string = cgroup_subtree_control_write, 3575 + }, 3905 3576 3906 3577 /* 3907 3578 * Historical crazy stuff. These don't have "cgroup." prefix and ··· 4086 3725 css->flags &= ~CSS_ONLINE; 4087 3726 css->cgroup->nr_css--; 4088 3727 RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL); 3728 + 3729 + wake_up_all(&css->cgroup->offline_waitq); 4089 3730 } 4090 3731 4091 3732 /**