Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

cfq-iosched: enable full blkcg hierarchy support

With the previous two patches, all cfqg scheduling decisions are based
on vfraction and ready for hierarchy support. The only thing which
keeps the behavior flat is cfqg_flat_parent() which makes vfraction
calculation consider all non-root cfqgs children of the root cfqg.

Replace it with cfqg_parent() which returns the real parent. This
enables full blkcg hierarchy support for cfq-iosched. For example,
consider the following hierarchy.

root
/ \
A:500 B:250
/ \
AA:500 AB:1000

For simplicity, let's say all the leaf nodes have active tasks and are
on service tree. For each leaf node, vfraction would be

AA: (500 / 1500) * (500 / 750) =~ 0.2222
AB: (1000 / 1500) * (500 / 750) =~ 0.4444
B: (250 / 750) =~ 0.3333

and vdisktime will be distributed accordingly. For more detail,
please refer to Documentation/block/cfq-iosched.txt.

v2: cfq-iosched.txt updated to describe group scheduling as suggested
by Vivek.

v3: blkio-controller.txt updated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>

Tejun Heo d02f7aa8 41cad6ab

+88 -26
+58
Documentation/block/cfq-iosched.txt
··· 102 102 performace although this can cause the latency of some I/O to increase due 103 103 to more number of requests. 104 104 105 + CFQ Group scheduling 106 + ==================== 107 + 108 + CFQ supports blkio cgroup and has "blkio." prefixed files in each 109 + blkio cgroup directory. It is weight-based and there are four knobs 110 + for configuration - weight[_device] and leaf_weight[_device]. 111 + Internal cgroup nodes (the ones with children) can also have tasks in 112 + them, so the former two configure how much proportion the cgroup as a 113 + whole is entitled to at its parent's level while the latter two 114 + configure how much proportion the tasks in the cgroup have compared to 115 + its direct children. 116 + 117 + Another way to think about it is assuming that each internal node has 118 + an implicit leaf child node which hosts all the tasks whose weight is 119 + configured by leaf_weight[_device]. Let's assume a blkio hierarchy 120 + composed of five cgroups - root, A, B, AA and AB - with the following 121 + weights where the names represent the hierarchy. 122 + 123 + weight leaf_weight 124 + root : 125 125 125 + A : 500 750 126 + B : 250 500 127 + AA : 500 500 128 + AB : 1000 500 129 + 130 + root never has a parent making its weight is meaningless. For backward 131 + compatibility, weight is always kept in sync with leaf_weight. B, AA 132 + and AB have no child and thus its tasks have no children cgroup to 133 + compete with. They always get 100% of what the cgroup won at the 134 + parent level. Considering only the weights which matter, the hierarchy 135 + looks like the following. 136 + 137 + root 138 + / | \ 139 + A B leaf 140 + 500 250 125 141 + / | \ 142 + AA AB leaf 143 + 500 1000 750 144 + 145 + If all cgroups have active IOs and competing with each other, disk 146 + time will be distributed like the following. 147 + 148 + Distribution below root. The total active weight at this level is 149 + A:500 + B:250 + C:125 = 875. 150 + 151 + root-leaf : 125 / 875 =~ 14% 152 + A : 500 / 875 =~ 57% 153 + B(-leaf) : 250 / 875 =~ 28% 154 + 155 + A has children and further distributes its 57% among the children and 156 + the implicit leaf node. The total active weight at this level is 157 + AA:500 + AB:1000 + A-leaf:750 = 2250. 158 + 159 + A-leaf : ( 750 / 2250) * A =~ 19% 160 + AA(-leaf) : ( 500 / 2250) * A =~ 12% 161 + AB(-leaf) : (1000 / 2250) * A =~ 25% 162 + 105 163 CFQ IOPS Mode for group scheduling 106 164 =================================== 107 165 Basic CFQ design is to provide priority based time slices. Higher priority
+24 -11
Documentation/cgroups/blkio-controller.txt
··· 94 94 95 95 Hierarchical Cgroups 96 96 ==================== 97 - - Currently none of the IO control policy supports hierarchical groups. But 98 - cgroup interface does allow creation of hierarchical cgroups and internally 99 - IO policies treat them as flat hierarchy. 97 + - Currently only CFQ supports hierarchical groups. For throttling, 98 + cgroup interface does allow creation of hierarchical cgroups and 99 + internally it treats them as flat hierarchy. 100 100 101 - So this patch will allow creation of cgroup hierarchcy but at the backend 102 - everything will be treated as flat. So if somebody created a hierarchy like 103 - as follows. 101 + If somebody created a hierarchy like as follows. 104 102 105 103 root 106 104 / \ ··· 106 108 | 107 109 test3 108 110 109 - CFQ and throttling will practically treat all groups at same level. 111 + CFQ will handle the hierarchy correctly but and throttling will 112 + practically treat all groups at same level. For details on CFQ 113 + hierarchy support, refer to Documentation/block/cfq-iosched.txt. 114 + Throttling will treat the hierarchy as if it looks like the 115 + following. 110 116 111 117 pivot 112 118 / / \ \ 113 119 root test1 test2 test3 114 120 115 - Down the line we can implement hierarchical accounting/control support 116 - and also introduce a new cgroup file "use_hierarchy" which will control 117 - whether cgroup hierarchy is viewed as flat or hierarchical by the policy.. 118 - This is how memory controller also has implemented the things. 121 + Nesting cgroups, while allowed, isn't officially supported and blkio 122 + genereates warning when cgroups nest. Once throttling implements 123 + hierarchy support, hierarchy will be supported and the warning will 124 + be removed. 119 125 120 126 Various user visible config options 121 127 =================================== ··· 173 171 # cat blkio.weight_device 174 172 dev weight 175 173 8:16 300 174 + 175 + - blkio.leaf_weight[_device] 176 + - Equivalents of blkio.weight[_device] for the purpose of 177 + deciding how much weight tasks in the given cgroup has while 178 + competing with the cgroup's child cgroups. For details, 179 + please refer to Documentation/block/cfq-iosched.txt. 176 180 177 181 - blkio.time 178 182 - disk time allocated to cgroup per device in milliseconds. First ··· 286 278 from service tree of the device. First two fields specify the major 287 279 and minor number of the device and third field specifies the number 288 280 of times a group was dequeued from a particular device. 281 + 282 + - blkio.*_recursive 283 + - Recursive version of various stats. These files show the 284 + same information as their non-recursive counterparts but 285 + include stats from all the descendant cgroups. 289 286 290 287 Throttling/Upper limit policy files 291 288 -----------------------------------
+6 -15
block/cfq-iosched.c
··· 606 606 return pd_to_cfqg(blkg_to_pd(blkg, &blkcg_policy_cfq)); 607 607 } 608 608 609 - /* 610 - * Determine the parent cfqg for weight calculation. Currently, cfqg 611 - * scheduling is flat and the root is the parent of everyone else. 612 - */ 613 - static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg) 609 + static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg) 614 610 { 615 - struct blkcg_gq *blkg = cfqg_to_blkg(cfqg); 616 - struct cfq_group *root; 611 + struct blkcg_gq *pblkg = cfqg_to_blkg(cfqg)->parent; 617 612 618 - while (blkg->parent) 619 - blkg = blkg->parent; 620 - root = blkg_to_cfqg(blkg); 621 - 622 - return root != cfqg ? root : NULL; 613 + return pblkg ? blkg_to_cfqg(pblkg) : NULL; 623 614 } 624 615 625 616 static inline void cfqg_get(struct cfq_group *cfqg) ··· 713 722 714 723 #else /* CONFIG_CFQ_GROUP_IOSCHED */ 715 724 716 - static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg) { return NULL; } 725 + static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg) { return NULL; } 717 726 static inline void cfqg_get(struct cfq_group *cfqg) { } 718 727 static inline void cfqg_put(struct cfq_group *cfqg) { } 719 728 ··· 1281 1290 * stops once an already activated node is met. vfraction 1282 1291 * calculation should always continue to the root. 1283 1292 */ 1284 - while ((parent = cfqg_flat_parent(pos))) { 1293 + while ((parent = cfqg_parent(pos))) { 1285 1294 if (propagate) { 1286 1295 propagate = !parent->nr_active++; 1287 1296 parent->children_weight += pos->weight; ··· 1332 1341 pos->children_weight -= pos->leaf_weight; 1333 1342 1334 1343 while (propagate) { 1335 - struct cfq_group *parent = cfqg_flat_parent(pos); 1344 + struct cfq_group *parent = cfqg_parent(pos); 1336 1345 1337 1346 /* @pos has 0 nr_active at this point */ 1338 1347 WARN_ON_ONCE(pos->children_weight);