cgroup: use a dedicated workqueue for cgroup destruction

Since be44562613851 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue. css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.

As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex. As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.

Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu(). If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.

This can be fixed by queueing destruction work items on a separate
workqueue. This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose. As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.

Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init(). Do it from a
separate core_initcall(). In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Cc: stable@vger.kernel.org # v3.9+

Tejun Heo e5fca243 6ce4eac1

+27 -3
+27 -3
kernel/cgroup.c
··· 90 90 static DEFINE_MUTEX(cgroup_root_mutex); 91 91 92 92 /* 93 + * cgroup destruction makes heavy use of work items and there can be a lot 94 + * of concurrent destructions. Use a separate workqueue so that cgroup 95 + * destruction work items don't end up filling up max_active of system_wq 96 + * which may lead to deadlock. 97 + */ 98 + static struct workqueue_struct *cgroup_destroy_wq; 99 + 100 + /* 93 101 * Generate an array of cgroup subsystem pointers. At boot time, this is 94 102 * populated with the built in subsystems, and modular subsystems are 95 103 * registered after that. The mutable section of this array is protected by ··· 879 871 struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head); 880 872 881 873 INIT_WORK(&cgrp->destroy_work, cgroup_free_fn); 882 - schedule_work(&cgrp->destroy_work); 874 + queue_work(cgroup_destroy_wq, &cgrp->destroy_work); 883 875 } 884 876 885 877 static void cgroup_diput(struct dentry *dentry, struct inode *inode) ··· 4257 4249 * css_put(). dput() requires process context which we don't have. 4258 4250 */ 4259 4251 INIT_WORK(&css->destroy_work, css_free_work_fn); 4260 - schedule_work(&css->destroy_work); 4252 + queue_work(cgroup_destroy_wq, &css->destroy_work); 4261 4253 } 4262 4254 4263 4255 static void css_release(struct percpu_ref *ref) ··· 4547 4539 container_of(ref, struct cgroup_subsys_state, refcnt); 4548 4540 4549 4541 INIT_WORK(&css->destroy_work, css_killed_work_fn); 4550 - schedule_work(&css->destroy_work); 4542 + queue_work(cgroup_destroy_wq, &css->destroy_work); 4551 4543 } 4552 4544 4553 4545 /** ··· 5070 5062 5071 5063 return err; 5072 5064 } 5065 + 5066 + static int __init cgroup_wq_init(void) 5067 + { 5068 + /* 5069 + * There isn't much point in executing destruction path in 5070 + * parallel. Good chunk is serialized with cgroup_mutex anyway. 5071 + * Use 1 for @max_active. 5072 + * 5073 + * We would prefer to do this in cgroup_init() above, but that 5074 + * is called before init_workqueues(): so leave this until after. 5075 + */ 5076 + cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1); 5077 + BUG_ON(!cgroup_destroy_wq); 5078 + return 0; 5079 + } 5080 + core_initcall(cgroup_wq_init); 5073 5081 5074 5082 /* 5075 5083 * proc_cgroup_show()