Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

cgroup: fix race between fork and cgroup.kill

Tejun reported the following race between fork() and cgroup.kill at [1].

Tejun:
I was looking at cgroup.kill implementation and wondering whether there
could be a race window. So, __cgroup_kill() does the following:

k1. Set CGRP_KILL.
k2. Iterate tasks and deliver SIGKILL.
k3. Clear CGRP_KILL.

The copy_process() does the following:

c1. Copy a bunch of stuff.
c2. Grab siglock.
c3. Check fatal_signal_pending().
c4. Commit to forking.
c5. Release siglock.
c6. Call cgroup_post_fork() which puts the task on the css_set and tests
CGRP_KILL.

The intention seems to be that either a forking task gets SIGKILL and
terminates on c3 or it sees CGRP_KILL on c6 and kills the child. However, I
don't see what guarantees that k3 can't happen before c6. ie. After a
forking task passes c5, k2 can take place and then before the forking task
reaches c6, k3 can happen. Then, nobody would send SIGKILL to the child.
What am I missing?

This is indeed a race. One way to fix this race is by taking
cgroup_threadgroup_rwsem in write mode in __cgroup_kill() as the fork()
side takes cgroup_threadgroup_rwsem in read mode from cgroup_can_fork()
to cgroup_post_fork(). However that would be heavy handed as this adds
one more potential stall scenario for cgroup.kill which is usually
called under extreme situation like memory pressure.

To fix this race, let's maintain a sequence number per cgroup which gets
incremented on __cgroup_kill() call. On the fork() side, the
cgroup_can_fork() will cache the sequence number locally and recheck it
against the cgroup's sequence number at cgroup_post_fork() site. If the
sequence numbers mismatch, it means __cgroup_kill() can been called and
we should send SIGKILL to the newly created task.

Reported-by: Tejun Heo <tj@kernel.org>
Closes: https://lore.kernel.org/all/Z5QHE2Qn-QZ6M-KW@slm.duckdns.org/ [1]
Fixes: 661ee6280931 ("cgroup: introduce cgroup.kill")
Cc: stable@vger.kernel.org # v5.14+
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

authored by

Shakeel Butt and committed by
Tejun Heo
b69bb476 a86bf228

+16 -11
+3 -3
include/linux/cgroup-defs.h
··· 71 71 72 72 /* Cgroup is frozen. */ 73 73 CGRP_FROZEN, 74 - 75 - /* Control group has to be killed. */ 76 - CGRP_KILL, 77 74 }; 78 75 79 76 /* cgroup_root->flags */ ··· 457 460 int nr_populated_threaded_children; 458 461 459 462 int nr_threaded_children; /* # of live threaded child cgroups */ 463 + 464 + /* sequence number for cgroup.kill, serialized by css_set_lock. */ 465 + unsigned int kill_seq; 460 466 461 467 struct kernfs_node *kn; /* cgroup kernfs entry */ 462 468 struct cgroup_file procs_file; /* handle for "cgroup.procs" */
+1
include/linux/sched/task.h
··· 43 43 void *fn_arg; 44 44 struct cgroup *cgrp; 45 45 struct css_set *cset; 46 + unsigned int kill_seq; 46 47 }; 47 48 48 49 /*
+12 -8
kernel/cgroup/cgroup.c
··· 4013 4013 lockdep_assert_held(&cgroup_mutex); 4014 4014 4015 4015 spin_lock_irq(&css_set_lock); 4016 - set_bit(CGRP_KILL, &cgrp->flags); 4016 + cgrp->kill_seq++; 4017 4017 spin_unlock_irq(&css_set_lock); 4018 4018 4019 4019 css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it); ··· 4029 4029 send_sig(SIGKILL, task, 0); 4030 4030 } 4031 4031 css_task_iter_end(&it); 4032 - 4033 - spin_lock_irq(&css_set_lock); 4034 - clear_bit(CGRP_KILL, &cgrp->flags); 4035 - spin_unlock_irq(&css_set_lock); 4036 4032 } 4037 4033 4038 4034 static void cgroup_kill(struct cgroup *cgrp) ··· 6484 6488 spin_lock_irq(&css_set_lock); 6485 6489 cset = task_css_set(current); 6486 6490 get_css_set(cset); 6491 + if (kargs->cgrp) 6492 + kargs->kill_seq = kargs->cgrp->kill_seq; 6493 + else 6494 + kargs->kill_seq = cset->dfl_cgrp->kill_seq; 6487 6495 spin_unlock_irq(&css_set_lock); 6488 6496 6489 6497 if (!(kargs->flags & CLONE_INTO_CGROUP)) { ··· 6668 6668 struct kernel_clone_args *kargs) 6669 6669 __releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex) 6670 6670 { 6671 + unsigned int cgrp_kill_seq = 0; 6671 6672 unsigned long cgrp_flags = 0; 6672 6673 bool kill = false; 6673 6674 struct cgroup_subsys *ss; ··· 6682 6681 6683 6682 /* init tasks are special, only link regular threads */ 6684 6683 if (likely(child->pid)) { 6685 - if (kargs->cgrp) 6684 + if (kargs->cgrp) { 6686 6685 cgrp_flags = kargs->cgrp->flags; 6687 - else 6686 + cgrp_kill_seq = kargs->cgrp->kill_seq; 6687 + } else { 6688 6688 cgrp_flags = cset->dfl_cgrp->flags; 6689 + cgrp_kill_seq = cset->dfl_cgrp->kill_seq; 6690 + } 6689 6691 6690 6692 WARN_ON_ONCE(!list_empty(&child->cg_list)); 6691 6693 cset->nr_tasks++; ··· 6723 6719 * child down right after we finished preparing it for 6724 6720 * userspace. 6725 6721 */ 6726 - kill = test_bit(CGRP_KILL, &cgrp_flags); 6722 + kill = kargs->kill_seq != cgrp_kill_seq; 6727 6723 } 6728 6724 6729 6725 spin_unlock_irq(&css_set_lock);