cgroup: replace global percpu_rwsem with per threadgroup resem when writing to cgroup.procs

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

The static usage pattern of creating a cgroup, enabling controllers,
and then seeding it with CLONE_INTO_CGROUP doesn't require write
locking cgroup_threadgroup_rwsem and thus doesn't benefit from this
patch.

To avoid affecting other users, the per threadgroup rwsem is only used
when the favordynmods is enabled.

As computer hardware advances, modern systems are typically equipped
with many CPU cores and large amounts of memory, enabling the deployment
of numerous applications. On such systems, container creation and
deletion become frequent operations, making cgroup process migration no
longer a cold path. This leads to noticeable contention with common
process operations such as fork, exec, and exit.

To alleviate the contention between cgroup process migration and
operations like process fork, this patch modifies lock to take the write
lock on signal_struct->group_rwsem when writing pid to
cgroup.procs/threads instead of holding a global write lock.

Cgroup process migration has historically relied on
signal_struct->group_rwsem to protect thread group integrity. In commit
<1ed1328792ff> ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem"), this was changed to a global
cgroup_threadgroup_rwsem. The advantage of using a global lock was
simplified handling of process group migrations. This patch retains the
use of the global lock for protecting process group migration, while
reducing contention by using per thread group lock during
cgroup.procs/threads writes.

The locking behavior is as follows:

write cgroup.procs/threads | process fork,exec,exit | process group migration
------------------------------------------------------------------------------
cgroup_lock() | down_read(&g_rwsem) | cgroup_lock()
down_write(&p_rwsem) | down_read(&p_rwsem) | down_write(&g_rwsem)
critical section | critical section | critical section
up_write(&p_rwsem) | up_read(&p_rwsem) | up_write(&g_rwsem)
cgroup_unlock() | up_read(&g_rwsem) | cgroup_unlock()

g_rwsem denotes cgroup_threadgroup_rwsem, p_rwsem denotes
signal_struct->group_rwsem.

This patch eliminates contention between cgroup migration and fork
operations for threads that belong to different thread groups, thereby
reducing the long-tail latency of cgroup migrations and lowering system
load.

With this patch, under heavy fork and exec interference, the long-tail
latency of cgroup migration has been reduced from milliseconds to
microseconds. Under heavy cgroup migration interference, the multi-CPU
score of the spawn test case in UnixBench increased by 9%.

tj: Update comment in cgroup_favor_dynmods() and switch WARN_ONCE() to
pr_warn_once().

Signed-off-by: Yi Tao <escape@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

authored by

Yi Tao and committed by

Tejun Heo 6 months ago 0568f89d 477abc2e

+93 -22

7 changed files

expand all

include

linux

cgroup-defs.h

sched

signal.h

init

init_task.c

kernel

cgroup

cgroup-internal.h

cgroup-v1.c

cgroup.c

fork.c

+16 -1

include/linux/cgroup-defs.h

··· 91 91 * cgroup_threadgroup_rwsem. This makes hot path operations such as 92 92 * forks and exits into the slow path and more expensive. 93 93 * 94 + * Alleviate the contention between fork, exec, exit operations and 95 + * writing to cgroup.procs by taking a per threadgroup rwsem instead of 96 + * the global cgroup_threadgroup_rwsem. Fork and other operations 97 + * from threads in different thread groups no longer contend with 98 + * writing to cgroup.procs. 99 + * 94 100 * The static usage pattern of creating a cgroup, enabling controllers, 95 101 * and then seeding it with CLONE_INTO_CGROUP doesn't require write 96 102 * locking cgroup_threadgroup_rwsem and thus doesn't benefit from ··· 152 146 153 147 /* When pid=0 && threadgroup=false, see comments in cgroup_procs_write_start */ 154 148 CGRP_ATTACH_LOCK_NONE, 149 + 150 + /* When favordynmods is on, see comments above CGRP_ROOT_FAVOR_DYNMODS */ 151 + CGRP_ATTACH_LOCK_PER_THREADGROUP, 155 152 }; 156 153 157 154 /* ··· 855 846 }; 856 847 857 848 extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem; 849 + extern bool cgroup_enable_per_threadgroup_rwsem; 858 850 859 851 struct cgroup_of_peak { 860 852 unsigned long value; ··· 867 857 * @tsk: target task 868 858 * 869 859 * Allows cgroup operations to synchronize against threadgroup changes 870 - * using a percpu_rw_semaphore. 860 + * using a global percpu_rw_semaphore and a per threadgroup rw_semaphore when 861 + * favordynmods is on. See the comment above CGRP_ROOT_FAVOR_DYNMODS definition. 871 862 */ 872 863 static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) 873 864 { 874 865 percpu_down_read(&cgroup_threadgroup_rwsem); 866 + if (cgroup_enable_per_threadgroup_rwsem) 867 + down_read(&tsk->signal->cgroup_threadgroup_rwsem); 875 868 } 876 869 877 870 /** ··· 885 872 */ 886 873 static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) 887 874 { 875 + if (cgroup_enable_per_threadgroup_rwsem) 876 + up_read(&tsk->signal->cgroup_threadgroup_rwsem); 888 877 percpu_up_read(&cgroup_threadgroup_rwsem); 889 878 } 890 879

include/linux/sched/signal.h

··· 226 226 struct tty_audit_buf *tty_audit_buf; 227 227 #endif 228 228 229 + #ifdef CONFIG_CGROUPS 230 + struct rw_semaphore cgroup_threadgroup_rwsem; 231 + #endif 232 + 229 233 /* 230 234 * Thread is the potential origin of an oom condition; kill first on 231 235 * oom

init/init_task.c

··· 27 27 }, 28 28 .multiprocess = HLIST_HEAD_INIT, 29 29 .rlim = INIT_RLIMITS, 30 + #ifdef CONFIG_CGROUPS 31 + .cgroup_threadgroup_rwsem = __RWSEM_INITIALIZER(init_signals.cgroup_threadgroup_rwsem), 32 + #endif 30 33 .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), 31 34 .exec_update_lock = __RWSEM_INITIALIZER(init_signals.exec_update_lock), 32 35 #ifdef CONFIG_POSIX_TIMERS

+4 -2

kernel/cgroup/cgroup-internal.h

··· 249 249 250 250 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, 251 251 bool threadgroup); 252 - void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode); 253 - void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode); 252 + void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode, 253 + struct task_struct *tsk); 254 + void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode, 255 + struct task_struct *tsk); 254 256 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, 255 257 enum cgroup_attach_lock_mode *lock_mode) 256 258 __acquires(&cgroup_threadgroup_rwsem);

+4 -4

kernel/cgroup/cgroup-v1.c

··· 69 69 int retval = 0; 70 70 71 71 cgroup_lock(); 72 - cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL); 72 + cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL); 73 73 for_each_root(root) { 74 74 struct cgroup *from_cgrp; 75 75 ··· 81 81 if (retval) 82 82 break; 83 83 } 84 - cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL); 84 + cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL); 85 85 cgroup_unlock(); 86 86 87 87 return retval; ··· 118 118 119 119 cgroup_lock(); 120 120 121 - cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL); 121 + cgroup_attach_lock(CGRP_ATTACH_LOCK_GLOBAL, NULL); 122 122 123 123 /* all tasks in @from are being moved, all csets are source */ 124 124 spin_lock_irq(&css_set_lock); ··· 154 154 } while (task && !ret); 155 155 out_err: 156 156 cgroup_migrate_finish(&mgctx); 157 - cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL); 157 + cgroup_attach_unlock(CGRP_ATTACH_LOCK_GLOBAL, NULL); 158 158 cgroup_unlock(); 159 159 return ret; 160 160 }

+58 -15

kernel/cgroup/cgroup.c

··· 239 239 240 240 static bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS); 241 241 242 + /* 243 + * Write protected by cgroup_mutex and write-lock of cgroup_threadgroup_rwsem, 244 + * read protected by either. 245 + * 246 + * Can only be turned on, but not turned off. 247 + */ 248 + bool cgroup_enable_per_threadgroup_rwsem __read_mostly; 249 + 242 250 /* cgroup namespace for init task */ 243 251 struct cgroup_namespace init_cgroup_ns = { 244 252 .ns.count = REFCOUNT_INIT(2), ··· 1333 1325 { 1334 1326 bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS; 1335 1327 1336 - /* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */ 1328 + /* 1329 + * see the comment above CGRP_ROOT_FAVOR_DYNMODS definition. 1330 + * favordynmods can flip while task is between 1331 + * cgroup_threadgroup_change_begin() and end(), so down_write global 1332 + * cgroup_threadgroup_rwsem to synchronize them. 1333 + * 1334 + * Once cgroup_enable_per_threadgroup_rwsem is enabled, holding 1335 + * cgroup_threadgroup_rwsem doesn't exlude tasks between 1336 + * cgroup_thread_group_change_begin() and end() and thus it's unsafe to 1337 + * turn off. As the scenario is unlikely, simply disallow disabling once 1338 + * enabled and print out a warning. 1339 + */ 1340 + percpu_down_write(&cgroup_threadgroup_rwsem); 1337 1341 if (favor && !favoring) { 1342 + cgroup_enable_per_threadgroup_rwsem = true; 1338 1343 rcu_sync_enter(&cgroup_threadgroup_rwsem.rss); 1339 1344 root->flags |= CGRP_ROOT_FAVOR_DYNMODS; 1340 1345 } else if (!favor && favoring) { 1346 + if (cgroup_enable_per_threadgroup_rwsem) 1347 + pr_warn_once("cgroup favordynmods: per threadgroup rwsem mechanism can't be disabled\n"); 1341 1348 rcu_sync_exit(&cgroup_threadgroup_rwsem.rss); 1342 1349 root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS; 1343 1350 } 1351 + percpu_up_write(&cgroup_threadgroup_rwsem); 1344 1352 } 1345 1353 1346 1354 static int cgroup_init_root_id(struct cgroup_root *root) ··· 2506 2482 2507 2483 /** 2508 2484 * cgroup_attach_lock - Lock for ->attach() 2509 - * @lock_mode: whether to down_write cgroup_threadgroup_rwsem 2485 + * @lock_mode: whether acquire and acquire which rwsem 2486 + * @tsk: thread group to lock 2510 2487 * 2511 2488 * cgroup migration sometimes needs to stabilize threadgroups against forks and 2512 2489 * exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach() ··· 2527 2502 * Resolve the situation by always acquiring cpus_read_lock() before optionally 2528 2503 * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that 2529 2504 * CPU hotplug is disabled on entry. 2505 + * 2506 + * When favordynmods is enabled, take per threadgroup rwsem to reduce overhead 2507 + * on dynamic cgroup modifications. see the comment above 2508 + * CGRP_ROOT_FAVOR_DYNMODS definition. 2509 + * 2510 + * tsk is not NULL only when writing to cgroup.procs. 2530 2511 */ 2531 - void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode) 2512 + void cgroup_attach_lock(enum cgroup_attach_lock_mode lock_mode, 2513 + struct task_struct *tsk) 2532 2514 { 2533 2515 cpus_read_lock(); 2534 2516 ··· 2545 2513 case CGRP_ATTACH_LOCK_GLOBAL: 2546 2514 percpu_down_write(&cgroup_threadgroup_rwsem); 2547 2515 break; 2516 + case CGRP_ATTACH_LOCK_PER_THREADGROUP: 2517 + down_write(&tsk->signal->cgroup_threadgroup_rwsem); 2518 + break; 2548 2519 default: 2549 2520 pr_warn("cgroup: Unexpected attach lock mode."); 2550 2521 break; ··· 2556 2521 2557 2522 /** 2558 2523 * cgroup_attach_unlock - Undo cgroup_attach_lock() 2559 - * @lock_mode: whether to up_write cgroup_threadgroup_rwsem 2524 + * @lock_mode: whether release and release which rwsem 2525 + * @tsk: thread group to lock 2560 2526 */ 2561 - void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode) 2527 + void cgroup_attach_unlock(enum cgroup_attach_lock_mode lock_mode, 2528 + struct task_struct *tsk) 2562 2529 { 2563 2530 switch (lock_mode) { 2564 2531 case CGRP_ATTACH_LOCK_NONE: 2565 2532 break; 2566 2533 case CGRP_ATTACH_LOCK_GLOBAL: 2567 2534 percpu_up_write(&cgroup_threadgroup_rwsem); 2535 + break; 2536 + case CGRP_ATTACH_LOCK_PER_THREADGROUP: 2537 + up_write(&tsk->signal->cgroup_threadgroup_rwsem); 2568 2538 break; 2569 2539 default: 2570 2540 pr_warn("cgroup: Unexpected attach lock mode."); ··· 3082 3042 tsk = ERR_PTR(-EINVAL); 3083 3043 goto out_unlock_rcu; 3084 3044 } 3085 - 3086 3045 get_task_struct(tsk); 3087 3046 rcu_read_unlock(); 3088 3047 ··· 3094 3055 */ 3095 3056 lockdep_assert_held(&cgroup_mutex); 3096 3057 3097 - if (pid || threadgroup) 3098 - *lock_mode = CGRP_ATTACH_LOCK_GLOBAL; 3099 - else 3058 + if (pid || threadgroup) { 3059 + if (cgroup_enable_per_threadgroup_rwsem) 3060 + *lock_mode = CGRP_ATTACH_LOCK_PER_THREADGROUP; 3061 + else 3062 + *lock_mode = CGRP_ATTACH_LOCK_GLOBAL; 3063 + } else { 3100 3064 *lock_mode = CGRP_ATTACH_LOCK_NONE; 3065 + } 3101 3066 3102 - cgroup_attach_lock(*lock_mode); 3067 + cgroup_attach_lock(*lock_mode, tsk); 3103 3068 3104 3069 if (threadgroup) { 3105 3070 if (!thread_group_leader(tsk)) { ··· 3112 3069 * may strip us of our leadership. If this happens, 3113 3070 * throw this task away and try again. 3114 3071 */ 3115 - cgroup_attach_unlock(*lock_mode); 3072 + cgroup_attach_unlock(*lock_mode, tsk); 3116 3073 put_task_struct(tsk); 3117 3074 goto retry_find_task; 3118 3075 } ··· 3128 3085 void cgroup_procs_write_finish(struct task_struct *task, 3129 3086 enum cgroup_attach_lock_mode lock_mode) 3130 3087 { 3088 + cgroup_attach_unlock(lock_mode, task); 3089 + 3131 3090 /* release reference from cgroup_procs_write_start() */ 3132 3091 put_task_struct(task); 3133 - 3134 - cgroup_attach_unlock(lock_mode); 3135 3092 } 3136 3093 3137 3094 static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask) ··· 3221 3178 else 3222 3179 lock_mode = CGRP_ATTACH_LOCK_NONE; 3223 3180 3224 - cgroup_attach_lock(lock_mode); 3181 + cgroup_attach_lock(lock_mode, NULL); 3225 3182 3226 3183 /* NULL dst indicates self on default hierarchy */ 3227 3184 ret = cgroup_migrate_prepare_dst(&mgctx); ··· 3242 3199 ret = cgroup_migrate_execute(&mgctx); 3243 3200 out_finish: 3244 3201 cgroup_migrate_finish(&mgctx); 3245 - cgroup_attach_unlock(lock_mode); 3202 + cgroup_attach_unlock(lock_mode, NULL); 3246 3203 return ret; 3247 3204 } 3248 3205

kernel/fork.c

··· 1688 1688 tty_audit_fork(sig); 1689 1689 sched_autogroup_fork(sig); 1690 1690 1691 + #ifdef CONFIG_CGROUPS 1692 + init_rwsem(&sig->cgroup_threadgroup_rwsem); 1693 + #endif 1694 + 1691 1695 sig->oom_score_adj = current->signal->oom_score_adj; 1692 1696 sig->oom_score_adj_min = current->signal->oom_score_adj_min; 1693 1697