Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-block

+58

Documentation/block/cfq-iosched.txt

··· 102 102 performace although this can cause the latency of some I/O to increase due 103 103 to more number of requests. 104 104 105 + CFQ Group scheduling 106 + ==================== 107 + 108 + CFQ supports blkio cgroup and has "blkio." prefixed files in each 109 + blkio cgroup directory. It is weight-based and there are four knobs 110 + for configuration - weight[_device] and leaf_weight[_device]. 111 + Internal cgroup nodes (the ones with children) can also have tasks in 112 + them, so the former two configure how much proportion the cgroup as a 113 + whole is entitled to at its parent's level while the latter two 114 + configure how much proportion the tasks in the cgroup have compared to 115 + its direct children. 116 + 117 + Another way to think about it is assuming that each internal node has 118 + an implicit leaf child node which hosts all the tasks whose weight is 119 + configured by leaf_weight[_device]. Let's assume a blkio hierarchy 120 + composed of five cgroups - root, A, B, AA and AB - with the following 121 + weights where the names represent the hierarchy. 122 + 123 + weight leaf_weight 124 + root : 125 125 125 + A : 500 750 126 + B : 250 500 127 + AA : 500 500 128 + AB : 1000 500 129 + 130 + root never has a parent making its weight is meaningless. For backward 131 + compatibility, weight is always kept in sync with leaf_weight. B, AA 132 + and AB have no child and thus its tasks have no children cgroup to 133 + compete with. They always get 100% of what the cgroup won at the 134 + parent level. Considering only the weights which matter, the hierarchy 135 + looks like the following. 136 + 137 + root 138 + / | \ 139 + A B leaf 140 + 500 250 125 141 + / | \ 142 + AA AB leaf 143 + 500 1000 750 144 + 145 + If all cgroups have active IOs and competing with each other, disk 146 + time will be distributed like the following. 147 + 148 + Distribution below root. The total active weight at this level is 149 + A:500 + B:250 + C:125 = 875. 150 + 151 + root-leaf : 125 / 875 =~ 14% 152 + A : 500 / 875 =~ 57% 153 + B(-leaf) : 250 / 875 =~ 28% 154 + 155 + A has children and further distributes its 57% among the children and 156 + the implicit leaf node. The total active weight at this level is 157 + AA:500 + AB:1000 + A-leaf:750 = 2250. 158 + 159 + A-leaf : ( 750 / 2250) * A =~ 19% 160 + AA(-leaf) : ( 500 / 2250) * A =~ 12% 161 + AB(-leaf) : (1000 / 2250) * A =~ 25% 162 + 105 163 CFQ IOPS Mode for group scheduling 106 164 =================================== 107 165 Basic CFQ design is to provide priority based time slices. Higher priority

+24 -11

Documentation/cgroups/blkio-controller.txt

··· 94 94 95 95 Hierarchical Cgroups 96 96 ==================== 97 - - Currently none of the IO control policy supports hierarchical groups. But 98 - cgroup interface does allow creation of hierarchical cgroups and internally 99 - IO policies treat them as flat hierarchy. 97 + - Currently only CFQ supports hierarchical groups. For throttling, 98 + cgroup interface does allow creation of hierarchical cgroups and 99 + internally it treats them as flat hierarchy. 100 100 101 - So this patch will allow creation of cgroup hierarchcy but at the backend 102 - everything will be treated as flat. So if somebody created a hierarchy like 103 - as follows. 101 + If somebody created a hierarchy like as follows. 104 102 105 103 root 106 104 / \ ··· 106 108 | 107 109 test3 108 110 109 - CFQ and throttling will practically treat all groups at same level. 111 + CFQ will handle the hierarchy correctly but and throttling will 112 + practically treat all groups at same level. For details on CFQ 113 + hierarchy support, refer to Documentation/block/cfq-iosched.txt. 114 + Throttling will treat the hierarchy as if it looks like the 115 + following. 110 116 111 117 pivot 112 118 / / \ \ 113 119 root test1 test2 test3 114 120 115 - Down the line we can implement hierarchical accounting/control support 116 - and also introduce a new cgroup file "use_hierarchy" which will control 117 - whether cgroup hierarchy is viewed as flat or hierarchical by the policy.. 118 - This is how memory controller also has implemented the things. 121 + Nesting cgroups, while allowed, isn't officially supported and blkio 122 + genereates warning when cgroups nest. Once throttling implements 123 + hierarchy support, hierarchy will be supported and the warning will 124 + be removed. 119 125 120 126 Various user visible config options 121 127 =================================== ··· 173 171 # cat blkio.weight_device 174 172 dev weight 175 173 8:16 300 174 + 175 + - blkio.leaf_weight[_device] 176 + - Equivalents of blkio.weight[_device] for the purpose of 177 + deciding how much weight tasks in the given cgroup has while 178 + competing with the cgroup's child cgroups. For details, 179 + please refer to Documentation/block/cfq-iosched.txt. 176 180 177 181 - blkio.time 178 182 - disk time allocated to cgroup per device in milliseconds. First ··· 286 278 from service tree of the device. First two fields specify the major 287 279 and minor number of the device and third field specifies the number 288 280 of times a group was dequeued from a particular device. 281 + 282 + - blkio.*_recursive 283 + - Recursive version of various stats. These files show the 284 + same information as their non-recursive counterparts but 285 + include stats from all the descendant cgroups. 289 286 290 287 Throttling/Upper limit policy files 291 288 -----------------------------------

-1

block/Kconfig

··· 4 4 menuconfig BLOCK 5 5 bool "Enable the block layer" if EXPERT 6 6 default y 7 - select PERCPU_RWSEM 8 7 help 9 8 Provide block layer support for the kernel. 10 9

+236 -41

block/blk-cgroup.c

··· 26 26 27 27 static DEFINE_MUTEX(blkcg_pol_mutex); 28 28 29 - struct blkcg blkcg_root = { .cfq_weight = 2 * CFQ_WEIGHT_DEFAULT }; 29 + struct blkcg blkcg_root = { .cfq_weight = 2 * CFQ_WEIGHT_DEFAULT, 30 + .cfq_leaf_weight = 2 * CFQ_WEIGHT_DEFAULT, }; 30 31 EXPORT_SYMBOL_GPL(blkcg_root); 31 32 32 33 static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS]; 34 + 35 + static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, 36 + struct request_queue *q, bool update_hint); 37 + 38 + /** 39 + * blkg_for_each_descendant_pre - pre-order walk of a blkg's descendants 40 + * @d_blkg: loop cursor pointing to the current descendant 41 + * @pos_cgrp: used for iteration 42 + * @p_blkg: target blkg to walk descendants of 43 + * 44 + * Walk @c_blkg through the descendants of @p_blkg. Must be used with RCU 45 + * read locked. If called under either blkcg or queue lock, the iteration 46 + * is guaranteed to include all and only online blkgs. The caller may 47 + * update @pos_cgrp by calling cgroup_rightmost_descendant() to skip 48 + * subtree. 49 + */ 50 + #define blkg_for_each_descendant_pre(d_blkg, pos_cgrp, p_blkg) \ 51 + cgroup_for_each_descendant_pre((pos_cgrp), (p_blkg)->blkcg->css.cgroup) \ 52 + if (((d_blkg) = __blkg_lookup(cgroup_to_blkcg(pos_cgrp), \ 53 + (p_blkg)->q, false))) 33 54 34 55 static bool blkcg_policy_enabled(struct request_queue *q, 35 56 const struct blkcg_policy *pol) ··· 133 112 134 113 blkg->pd[i] = pd; 135 114 pd->blkg = blkg; 115 + pd->plid = i; 136 116 137 117 /* invoke per-policy init */ 138 - if (blkcg_policy_enabled(blkg->q, pol)) 118 + if (pol->pd_init_fn) 139 119 pol->pd_init_fn(blkg); 140 120 } 141 121 ··· 147 125 return NULL; 148 126 } 149 127 128 + /** 129 + * __blkg_lookup - internal version of blkg_lookup() 130 + * @blkcg: blkcg of interest 131 + * @q: request_queue of interest 132 + * @update_hint: whether to update lookup hint with the result or not 133 + * 134 + * This is internal version and shouldn't be used by policy 135 + * implementations. Looks up blkgs for the @blkcg - @q pair regardless of 136 + * @q's bypass state. If @update_hint is %true, the caller should be 137 + * holding @q->queue_lock and lookup hint is updated on success. 138 + */ 150 139 static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, 151 - struct request_queue *q) 140 + struct request_queue *q, bool update_hint) 152 141 { 153 142 struct blkcg_gq *blkg; 154 143 ··· 168 135 return blkg; 169 136 170 137 /* 171 - * Hint didn't match. Look up from the radix tree. Note that we 172 - * may not be holding queue_lock and thus are not sure whether 173 - * @blkg from blkg_tree has already been removed or not, so we 174 - * can't update hint to the lookup result. Leave it to the caller. 138 + * Hint didn't match. Look up from the radix tree. Note that the 139 + * hint can only be updated under queue_lock as otherwise @blkg 140 + * could have already been removed from blkg_tree. The caller is 141 + * responsible for grabbing queue_lock if @update_hint. 175 142 */ 176 143 blkg = radix_tree_lookup(&blkcg->blkg_tree, q->id); 177 - if (blkg && blkg->q == q) 144 + if (blkg && blkg->q == q) { 145 + if (update_hint) { 146 + lockdep_assert_held(q->queue_lock); 147 + rcu_assign_pointer(blkcg->blkg_hint, blkg); 148 + } 178 149 return blkg; 150 + } 179 151 180 152 return NULL; 181 153 } ··· 200 162 201 163 if (unlikely(blk_queue_bypass(q))) 202 164 return NULL; 203 - return __blkg_lookup(blkcg, q); 165 + return __blkg_lookup(blkcg, q, false); 204 166 } 205 167 EXPORT_SYMBOL_GPL(blkg_lookup); 206 168 ··· 208 170 * If @new_blkg is %NULL, this function tries to allocate a new one as 209 171 * necessary using %GFP_ATOMIC. @new_blkg is always consumed on return. 210 172 */ 211 - static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg, 212 - struct request_queue *q, 213 - struct blkcg_gq *new_blkg) 173 + static struct blkcg_gq *blkg_create(struct blkcg *blkcg, 174 + struct request_queue *q, 175 + struct blkcg_gq *new_blkg) 214 176 { 215 177 struct blkcg_gq *blkg; 216 - int ret; 178 + int i, ret; 217 179 218 180 WARN_ON_ONCE(!rcu_read_lock_held()); 219 181 lockdep_assert_held(q->queue_lock); 220 182 221 - /* lookup and update hint on success, see __blkg_lookup() for details */ 222 - blkg = __blkg_lookup(blkcg, q); 223 - if (blkg) { 224 - rcu_assign_pointer(blkcg->blkg_hint, blkg); 225 - goto out_free; 226 - } 227 - 228 183 /* blkg holds a reference to blkcg */ 229 184 if (!css_tryget(&blkcg->css)) { 230 - blkg = ERR_PTR(-EINVAL); 231 - goto out_free; 185 + ret = -EINVAL; 186 + goto err_free_blkg; 232 187 } 233 188 234 189 /* allocate */ 235 190 if (!new_blkg) { 236 191 new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC); 237 192 if (unlikely(!new_blkg)) { 238 - blkg = ERR_PTR(-ENOMEM); 239 - goto out_put; 193 + ret = -ENOMEM; 194 + goto err_put_css; 240 195 } 241 196 } 242 197 blkg = new_blkg; 243 198 244 - /* insert */ 199 + /* link parent and insert */ 200 + if (blkcg_parent(blkcg)) { 201 + blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false); 202 + if (WARN_ON_ONCE(!blkg->parent)) { 203 + blkg = ERR_PTR(-EINVAL); 204 + goto err_put_css; 205 + } 206 + blkg_get(blkg->parent); 207 + } 208 + 245 209 spin_lock(&blkcg->lock); 246 210 ret = radix_tree_insert(&blkcg->blkg_tree, q->id, blkg); 247 211 if (likely(!ret)) { 248 212 hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list); 249 213 list_add(&blkg->q_node, &q->blkg_list); 214 + 215 + for (i = 0; i < BLKCG_MAX_POLS; i++) { 216 + struct blkcg_policy *pol = blkcg_policy[i]; 217 + 218 + if (blkg->pd[i] && pol->pd_online_fn) 219 + pol->pd_online_fn(blkg); 220 + } 250 221 } 222 + blkg->online = true; 251 223 spin_unlock(&blkcg->lock); 252 224 253 225 if (!ret) 254 226 return blkg; 255 227 256 - blkg = ERR_PTR(ret); 257 - out_put: 228 + /* @blkg failed fully initialized, use the usual release path */ 229 + blkg_put(blkg); 230 + return ERR_PTR(ret); 231 + 232 + err_put_css: 258 233 css_put(&blkcg->css); 259 - out_free: 234 + err_free_blkg: 260 235 blkg_free(new_blkg); 261 - return blkg; 236 + return ERR_PTR(ret); 262 237 } 263 238 239 + /** 240 + * blkg_lookup_create - lookup blkg, try to create one if not there 241 + * @blkcg: blkcg of interest 242 + * @q: request_queue of interest 243 + * 244 + * Lookup blkg for the @blkcg - @q pair. If it doesn't exist, try to 245 + * create one. blkg creation is performed recursively from blkcg_root such 246 + * that all non-root blkg's have access to the parent blkg. This function 247 + * should be called under RCU read lock and @q->queue_lock. 248 + * 249 + * Returns pointer to the looked up or created blkg on success, ERR_PTR() 250 + * value on error. If @q is dead, returns ERR_PTR(-EINVAL). If @q is not 251 + * dead and bypassing, returns ERR_PTR(-EBUSY). 252 + */ 264 253 struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg, 265 254 struct request_queue *q) 266 255 { 256 + struct blkcg_gq *blkg; 257 + 258 + WARN_ON_ONCE(!rcu_read_lock_held()); 259 + lockdep_assert_held(q->queue_lock); 260 + 267 261 /* 268 262 * This could be the first entry point of blkcg implementation and 269 263 * we shouldn't allow anything to go through for a bypassing queue. 270 264 */ 271 265 if (unlikely(blk_queue_bypass(q))) 272 266 return ERR_PTR(blk_queue_dying(q) ? -EINVAL : -EBUSY); 273 - return __blkg_lookup_create(blkcg, q, NULL); 267 + 268 + blkg = __blkg_lookup(blkcg, q, true); 269 + if (blkg) 270 + return blkg; 271 + 272 + /* 273 + * Create blkgs walking down from blkcg_root to @blkcg, so that all 274 + * non-root blkgs have access to their parents. 275 + */ 276 + while (true) { 277 + struct blkcg *pos = blkcg; 278 + struct blkcg *parent = blkcg_parent(blkcg); 279 + 280 + while (parent && !__blkg_lookup(parent, q, false)) { 281 + pos = parent; 282 + parent = blkcg_parent(parent); 283 + } 284 + 285 + blkg = blkg_create(pos, q, NULL); 286 + if (pos == blkcg || IS_ERR(blkg)) 287 + return blkg; 288 + } 274 289 } 275 290 EXPORT_SYMBOL_GPL(blkg_lookup_create); 276 291 277 292 static void blkg_destroy(struct blkcg_gq *blkg) 278 293 { 279 294 struct blkcg *blkcg = blkg->blkcg; 295 + int i; 280 296 281 297 lockdep_assert_held(blkg->q->queue_lock); 282 298 lockdep_assert_held(&blkcg->lock); ··· 338 246 /* Something wrong if we are trying to remove same group twice */ 339 247 WARN_ON_ONCE(list_empty(&blkg->q_node)); 340 248 WARN_ON_ONCE(hlist_unhashed(&blkg->blkcg_node)); 249 + 250 + for (i = 0; i < BLKCG_MAX_POLS; i++) { 251 + struct blkcg_policy *pol = blkcg_policy[i]; 252 + 253 + if (blkg->pd[i] && pol->pd_offline_fn) 254 + pol->pd_offline_fn(blkg); 255 + } 256 + blkg->online = false; 341 257 342 258 radix_tree_delete(&blkcg->blkg_tree, blkg->q->id); 343 259 list_del_init(&blkg->q_node); ··· 401 301 402 302 void __blkg_release(struct blkcg_gq *blkg) 403 303 { 404 - /* release the extra blkcg reference this blkg has been holding */ 304 + /* release the blkcg and parent blkg refs this blkg has been holding */ 405 305 css_put(&blkg->blkcg->css); 306 + if (blkg->parent) 307 + blkg_put(blkg->parent); 406 308 407 309 /* 408 310 * A group is freed in rcu manner. But having an rcu lock does not ··· 503 401 * 504 402 * This function invokes @prfill on each blkg of @blkcg if pd for the 505 403 * policy specified by @pol exists. @prfill is invoked with @sf, the 506 - * policy data and @data. If @show_total is %true, the sum of the return 507 - * values from @prfill is printed with "Total" label at the end. 404 + * policy data and @data and the matching queue lock held. If @show_total 405 + * is %true, the sum of the return values from @prfill is printed with 406 + * "Total" label at the end. 508 407 * 509 408 * This is to be used to construct print functions for 510 409 * cftype->read_seq_string method. ··· 519 416 struct blkcg_gq *blkg; 520 417 u64 total = 0; 521 418 522 - spin_lock_irq(&blkcg->lock); 523 - hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) 419 + rcu_read_lock(); 420 + hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { 421 + spin_lock_irq(blkg->q->queue_lock); 524 422 if (blkcg_policy_enabled(blkg->q, pol)) 525 423 total += prfill(sf, blkg->pd[pol->plid], data); 526 - spin_unlock_irq(&blkcg->lock); 424 + spin_unlock_irq(blkg->q->queue_lock); 425 + } 426 + rcu_read_unlock(); 527 427 528 428 if (show_total) 529 429 seq_printf(sf, "Total %llu\n", (unsigned long long)total); ··· 585 479 seq_printf(sf, "%s Total %llu\n", dname, (unsigned long long)v); 586 480 return v; 587 481 } 482 + EXPORT_SYMBOL_GPL(__blkg_prfill_rwstat); 588 483 589 484 /** 590 485 * blkg_prfill_stat - prfill callback for blkg_stat ··· 617 510 return __blkg_prfill_rwstat(sf, pd, &rwstat); 618 511 } 619 512 EXPORT_SYMBOL_GPL(blkg_prfill_rwstat); 513 + 514 + /** 515 + * blkg_stat_recursive_sum - collect hierarchical blkg_stat 516 + * @pd: policy private data of interest 517 + * @off: offset to the blkg_stat in @pd 518 + * 519 + * Collect the blkg_stat specified by @off from @pd and all its online 520 + * descendants and return the sum. The caller must be holding the queue 521 + * lock for online tests. 522 + */ 523 + u64 blkg_stat_recursive_sum(struct blkg_policy_data *pd, int off) 524 + { 525 + struct blkcg_policy *pol = blkcg_policy[pd->plid]; 526 + struct blkcg_gq *pos_blkg; 527 + struct cgroup *pos_cgrp; 528 + u64 sum; 529 + 530 + lockdep_assert_held(pd->blkg->q->queue_lock); 531 + 532 + sum = blkg_stat_read((void *)pd + off); 533 + 534 + rcu_read_lock(); 535 + blkg_for_each_descendant_pre(pos_blkg, pos_cgrp, pd_to_blkg(pd)) { 536 + struct blkg_policy_data *pos_pd = blkg_to_pd(pos_blkg, pol); 537 + struct blkg_stat *stat = (void *)pos_pd + off; 538 + 539 + if (pos_blkg->online) 540 + sum += blkg_stat_read(stat); 541 + } 542 + rcu_read_unlock(); 543 + 544 + return sum; 545 + } 546 + EXPORT_SYMBOL_GPL(blkg_stat_recursive_sum); 547 + 548 + /** 549 + * blkg_rwstat_recursive_sum - collect hierarchical blkg_rwstat 550 + * @pd: policy private data of interest 551 + * @off: offset to the blkg_stat in @pd 552 + * 553 + * Collect the blkg_rwstat specified by @off from @pd and all its online 554 + * descendants and return the sum. The caller must be holding the queue 555 + * lock for online tests. 556 + */ 557 + struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkg_policy_data *pd, 558 + int off) 559 + { 560 + struct blkcg_policy *pol = blkcg_policy[pd->plid]; 561 + struct blkcg_gq *pos_blkg; 562 + struct cgroup *pos_cgrp; 563 + struct blkg_rwstat sum; 564 + int i; 565 + 566 + lockdep_assert_held(pd->blkg->q->queue_lock); 567 + 568 + sum = blkg_rwstat_read((void *)pd + off); 569 + 570 + rcu_read_lock(); 571 + blkg_for_each_descendant_pre(pos_blkg, pos_cgrp, pd_to_blkg(pd)) { 572 + struct blkg_policy_data *pos_pd = blkg_to_pd(pos_blkg, pol); 573 + struct blkg_rwstat *rwstat = (void *)pos_pd + off; 574 + struct blkg_rwstat tmp; 575 + 576 + if (!pos_blkg->online) 577 + continue; 578 + 579 + tmp = blkg_rwstat_read(rwstat); 580 + 581 + for (i = 0; i < BLKG_RWSTAT_NR; i++) 582 + sum.cnt[i] += tmp.cnt[i]; 583 + } 584 + rcu_read_unlock(); 585 + 586 + return sum; 587 + } 588 + EXPORT_SYMBOL_GPL(blkg_rwstat_recursive_sum); 620 589 621 590 /** 622 591 * blkg_conf_prep - parse and prepare for per-blkg config update ··· 839 656 return ERR_PTR(-ENOMEM); 840 657 841 658 blkcg->cfq_weight = CFQ_WEIGHT_DEFAULT; 659 + blkcg->cfq_leaf_weight = CFQ_WEIGHT_DEFAULT; 842 660 blkcg->id = atomic64_inc_return(&id_seq); /* root is 0, start from 1 */ 843 661 done: 844 662 spin_lock_init(&blkcg->lock); ··· 959 775 const struct blkcg_policy *pol) 960 776 { 961 777 LIST_HEAD(pds); 962 - struct blkcg_gq *blkg; 778 + struct blkcg_gq *blkg, *new_blkg; 963 779 struct blkg_policy_data *pd, *n; 964 780 int cnt = 0, ret; 965 781 bool preloaded; ··· 968 784 return 0; 969 785 970 786 /* preallocations for root blkg */ 971 - blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL); 972 - if (!blkg) 787 + new_blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL); 788 + if (!new_blkg) 973 789 return -ENOMEM; 974 790 975 791 preloaded = !radix_tree_preload(GFP_KERNEL); 976 792 977 793 blk_queue_bypass_start(q); 978 794 979 - /* make sure the root blkg exists and count the existing blkgs */ 795 + /* 796 + * Make sure the root blkg exists and count the existing blkgs. As 797 + * @q is bypassing at this point, blkg_lookup_create() can't be 798 + * used. Open code it. 799 + */ 980 800 spin_lock_irq(q->queue_lock); 981 801 982 802 rcu_read_lock(); 983 - blkg = __blkg_lookup_create(&blkcg_root, q, blkg); 803 + blkg = __blkg_lookup(&blkcg_root, q, false); 804 + if (blkg) 805 + blkg_free(new_blkg); 806 + else 807 + blkg = blkg_create(&blkcg_root, q, new_blkg); 984 808 rcu_read_unlock(); 985 809 986 810 if (preloaded) ··· 1036 844 1037 845 blkg->pd[pol->plid] = pd; 1038 846 pd->blkg = blkg; 847 + pd->plid = pol->plid; 1039 848 pol->pd_init_fn(blkg); 1040 849 1041 850 spin_unlock(&blkg->blkcg->lock); ··· 1083 890 /* grab blkcg lock too while removing @pd from @blkg */ 1084 891 spin_lock(&blkg->blkcg->lock); 1085 892 893 + if (pol->pd_offline_fn) 894 + pol->pd_offline_fn(blkg); 1086 895 if (pol->pd_exit_fn) 1087 896 pol->pd_exit_fn(blkg); 1088 897

+65 -3

block/blk-cgroup.h

··· 54 54 55 55 /* TODO: per-policy storage in blkcg */ 56 56 unsigned int cfq_weight; /* belongs to cfq */ 57 + unsigned int cfq_leaf_weight; 57 58 }; 58 59 59 60 struct blkg_stat { ··· 81 80 * beginning and pd_size can't be smaller than pd. 82 81 */ 83 82 struct blkg_policy_data { 84 - /* the blkg this per-policy data belongs to */ 83 + /* the blkg and policy id this per-policy data belongs to */ 85 84 struct blkcg_gq *blkg; 85 + int plid; 86 86 87 87 /* used during policy activation */ 88 88 struct list_head alloc_node; ··· 96 94 struct list_head q_node; 97 95 struct hlist_node blkcg_node; 98 96 struct blkcg *blkcg; 97 + 98 + /* all non-root blkcg_gq's are guaranteed to have access to parent */ 99 + struct blkcg_gq *parent; 100 + 99 101 /* request allocation list for this blkcg-q pair */ 100 102 struct request_list rl; 103 + 101 104 /* reference count */ 102 105 int refcnt; 106 + 107 + /* is this blkg online? protected by both blkcg and q locks */ 108 + bool online; 103 109 104 110 struct blkg_policy_data *pd[BLKCG_MAX_POLS]; 105 111 ··· 115 105 }; 116 106 117 107 typedef void (blkcg_pol_init_pd_fn)(struct blkcg_gq *blkg); 108 + typedef void (blkcg_pol_online_pd_fn)(struct blkcg_gq *blkg); 109 + typedef void (blkcg_pol_offline_pd_fn)(struct blkcg_gq *blkg); 118 110 typedef void (blkcg_pol_exit_pd_fn)(struct blkcg_gq *blkg); 119 111 typedef void (blkcg_pol_reset_pd_stats_fn)(struct blkcg_gq *blkg); 120 112 ··· 129 117 130 118 /* operations */ 131 119 blkcg_pol_init_pd_fn *pd_init_fn; 120 + blkcg_pol_online_pd_fn *pd_online_fn; 121 + blkcg_pol_offline_pd_fn *pd_offline_fn; 132 122 blkcg_pol_exit_pd_fn *pd_exit_fn; 133 123 blkcg_pol_reset_pd_stats_fn *pd_reset_stats_fn; 134 124 }; ··· 164 150 u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd, 165 151 int off); 166 152 153 + u64 blkg_stat_recursive_sum(struct blkg_policy_data *pd, int off); 154 + struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkg_policy_data *pd, 155 + int off); 156 + 167 157 struct blkg_conf_ctx { 168 158 struct gendisk *disk; 169 159 struct blkcg_gq *blkg; ··· 196 178 if (bio && bio->bi_css) 197 179 return container_of(bio->bi_css, struct blkcg, css); 198 180 return task_blkcg(current); 181 + } 182 + 183 + /** 184 + * blkcg_parent - get the parent of a blkcg 185 + * @blkcg: blkcg of interest 186 + * 187 + * Return the parent blkcg of @blkcg. Can be called anytime. 188 + */ 189 + static inline struct blkcg *blkcg_parent(struct blkcg *blkcg) 190 + { 191 + struct cgroup *pcg = blkcg->css.cgroup->parent; 192 + 193 + return pcg ? cgroup_to_blkcg(pcg) : NULL; 199 194 } 200 195 201 196 /** ··· 418 387 } 419 388 420 389 /** 390 + * blkg_stat_merge - merge a blkg_stat into another 391 + * @to: the destination blkg_stat 392 + * @from: the source 393 + * 394 + * Add @from's count to @to. 395 + */ 396 + static inline void blkg_stat_merge(struct blkg_stat *to, struct blkg_stat *from) 397 + { 398 + blkg_stat_add(to, blkg_stat_read(from)); 399 + } 400 + 401 + /** 421 402 * blkg_rwstat_add - add a value to a blkg_rwstat 422 403 * @rwstat: target blkg_rwstat 423 404 * @rw: mask of REQ_{WRITE|SYNC} ··· 477 434 } 478 435 479 436 /** 480 - * blkg_rwstat_sum - read the total count of a blkg_rwstat 437 + * blkg_rwstat_total - read the total count of a blkg_rwstat 481 438 * @rwstat: blkg_rwstat to read 482 439 * 483 440 * Return the total count of @rwstat regardless of the IO direction. This 484 441 * function can be called without synchronization and takes care of u64 485 442 * atomicity. 486 443 */ 487 - static inline uint64_t blkg_rwstat_sum(struct blkg_rwstat *rwstat) 444 + static inline uint64_t blkg_rwstat_total(struct blkg_rwstat *rwstat) 488 445 { 489 446 struct blkg_rwstat tmp = blkg_rwstat_read(rwstat); 490 447 ··· 498 455 static inline void blkg_rwstat_reset(struct blkg_rwstat *rwstat) 499 456 { 500 457 memset(rwstat->cnt, 0, sizeof(rwstat->cnt)); 458 + } 459 + 460 + /** 461 + * blkg_rwstat_merge - merge a blkg_rwstat into another 462 + * @to: the destination blkg_rwstat 463 + * @from: the source 464 + * 465 + * Add @from's counts to @to. 466 + */ 467 + static inline void blkg_rwstat_merge(struct blkg_rwstat *to, 468 + struct blkg_rwstat *from) 469 + { 470 + struct blkg_rwstat v = blkg_rwstat_read(from); 471 + int i; 472 + 473 + u64_stats_update_begin(&to->syncp); 474 + for (i = 0; i < BLKG_RWSTAT_NR; i++) 475 + to->cnt[i] += v.cnt[i]; 476 + u64_stats_update_end(&to->syncp); 501 477 } 502 478 503 479 #else /* CONFIG_BLK_CGROUP */

+3 -15

block/blk-core.c

··· 39 39 40 40 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap); 41 41 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap); 42 - EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_complete); 43 42 EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug); 44 43 45 44 DEFINE_IDA(blk_queue_ida); ··· 1347 1348 if (!ll_back_merge_fn(q, req, bio)) 1348 1349 return false; 1349 1350 1350 - trace_block_bio_backmerge(q, bio); 1351 + trace_block_bio_backmerge(q, req, bio); 1351 1352 1352 1353 if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff) 1353 1354 blk_rq_set_mixed_merge(req); ··· 1369 1370 if (!ll_front_merge_fn(q, req, bio)) 1370 1371 return false; 1371 1372 1372 - trace_block_bio_frontmerge(q, bio); 1373 + trace_block_bio_frontmerge(q, req, bio); 1373 1374 1374 1375 if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff) 1375 1376 blk_rq_set_mixed_merge(req); ··· 1552 1553 if (list_empty(&plug->list)) 1553 1554 trace_block_plug(q); 1554 1555 else { 1555 - if (!plug->should_sort) { 1556 - struct request *__rq; 1557 - 1558 - __rq = list_entry_rq(plug->list.prev); 1559 - if (__rq->q != q) 1560 - plug->should_sort = 1; 1561 - } 1562 1556 if (request_count >= BLK_MAX_REQUEST_COUNT) { 1563 1557 blk_flush_plug_list(plug, false); 1564 1558 trace_block_plug(q); ··· 2882 2890 plug->magic = PLUG_MAGIC; 2883 2891 INIT_LIST_HEAD(&plug->list); 2884 2892 INIT_LIST_HEAD(&plug->cb_list); 2885 - plug->should_sort = 0; 2886 2893 2887 2894 /* 2888 2895 * If this is a nested plug, don't actually assign it. It will be ··· 2983 2992 2984 2993 list_splice_init(&plug->list, &list); 2985 2994 2986 - if (plug->should_sort) { 2987 - list_sort(NULL, &list, plug_rq_cmp); 2988 - plug->should_sort = 0; 2989 - } 2995 + list_sort(NULL, &list, plug_rq_cmp); 2990 2996 2991 2997 q = NULL; 2992 2998 depth = 0;

+2 -2

block/blk-exec.c

··· 121 121 /* Prevent hang_check timer from firing at us during very long I/O */ 122 122 hang_check = sysctl_hung_task_timeout_secs; 123 123 if (hang_check) 124 - while (!wait_for_completion_timeout(&wait, hang_check * (HZ/2))); 124 + while (!wait_for_completion_io_timeout(&wait, hang_check * (HZ/2))); 125 125 else 126 - wait_for_completion(&wait); 126 + wait_for_completion_io(&wait); 127 127 128 128 if (rq->errors) 129 129 err = -EIO;

+1 -1

block/blk-flush.c

··· 436 436 437 437 bio_get(bio); 438 438 submit_bio(WRITE_FLUSH, bio); 439 - wait_for_completion(&wait); 439 + wait_for_completion_io(&wait); 440 440 441 441 /* 442 442 * The driver must store the error location in ->bi_sector, if

+3 -3

block/blk-lib.c

··· 126 126 127 127 /* Wait for bios in-flight */ 128 128 if (!atomic_dec_and_test(&bb.done)) 129 - wait_for_completion(&wait); 129 + wait_for_completion_io(&wait); 130 130 131 131 if (!test_bit(BIO_UPTODATE, &bb.flags)) 132 132 ret = -EIO; ··· 200 200 201 201 /* Wait for bios in-flight */ 202 202 if (!atomic_dec_and_test(&bb.done)) 203 - wait_for_completion(&wait); 203 + wait_for_completion_io(&wait); 204 204 205 205 if (!test_bit(BIO_UPTODATE, &bb.flags)) 206 206 ret = -ENOTSUPP; ··· 262 262 263 263 /* Wait for bios in-flight */ 264 264 if (!atomic_dec_and_test(&bb.done)) 265 - wait_for_completion(&wait); 265 + wait_for_completion_io(&wait); 266 266 267 267 if (!test_bit(BIO_UPTODATE, &bb.flags)) 268 268 /* One of bios in the batch was completed with error.*/

+8 -1

block/blk-sysfs.c

··· 497 497 return res; 498 498 } 499 499 500 + static void blk_free_queue_rcu(struct rcu_head *rcu_head) 501 + { 502 + struct request_queue *q = container_of(rcu_head, struct request_queue, 503 + rcu_head); 504 + kmem_cache_free(blk_requestq_cachep, q); 505 + } 506 + 500 507 /** 501 508 * blk_release_queue: - release a &struct request_queue when it is no longer needed 502 509 * @kobj: the kobj belonging to the request queue to be released ··· 545 538 bdi_destroy(&q->backing_dev_info); 546 539 547 540 ida_simple_remove(&blk_queue_ida, q->id); 548 - kmem_cache_free(blk_requestq_cachep, q); 541 + call_rcu(&q->rcu_head, blk_free_queue_rcu); 549 542 } 550 543 551 544 static const struct sysfs_ops queue_sysfs_ops = {

+1 -1

block/blk.h

··· 61 61 /* 62 62 * Internal elevator interface 63 63 */ 64 - #define ELV_ON_HASH(rq) (!hlist_unhashed(&(rq)->hash)) 64 + #define ELV_ON_HASH(rq) hash_hashed(&(rq)->hash) 65 65 66 66 void blk_insert_flush(struct request *rq); 67 67 void blk_abort_flushes(struct request_queue *q);

+511 -118

block/cfq-iosched.c

··· 85 85 struct rb_root rb; 86 86 struct rb_node *left; 87 87 unsigned count; 88 - unsigned total_weight; 89 88 u64 min_vdisktime; 90 89 struct cfq_ttime ttime; 91 90 }; ··· 154 155 * First index in the service_trees. 155 156 * IDLE is handled separately, so it has negative index 156 157 */ 157 - enum wl_prio_t { 158 + enum wl_class_t { 158 159 BE_WORKLOAD = 0, 159 160 RT_WORKLOAD = 1, 160 161 IDLE_WORKLOAD = 2, ··· 222 223 223 224 /* group service_tree key */ 224 225 u64 vdisktime; 226 + 227 + /* 228 + * The number of active cfqgs and sum of their weights under this 229 + * cfqg. This covers this cfqg's leaf_weight and all children's 230 + * weights, but does not cover weights of further descendants. 231 + * 232 + * If a cfqg is on the service tree, it's active. An active cfqg 233 + * also activates its parent and contributes to the children_weight 234 + * of the parent. 235 + */ 236 + int nr_active; 237 + unsigned int children_weight; 238 + 239 + /* 240 + * vfraction is the fraction of vdisktime that the tasks in this 241 + * cfqg are entitled to. This is determined by compounding the 242 + * ratios walking up from this cfqg to the root. 243 + * 244 + * It is in fixed point w/ CFQ_SERVICE_SHIFT and the sum of all 245 + * vfractions on a service tree is approximately 1. The sum may 246 + * deviate a bit due to rounding errors and fluctuations caused by 247 + * cfqgs entering and leaving the service tree. 248 + */ 249 + unsigned int vfraction; 250 + 251 + /* 252 + * There are two weights - (internal) weight is the weight of this 253 + * cfqg against the sibling cfqgs. leaf_weight is the wight of 254 + * this cfqg against the child cfqgs. For the root cfqg, both 255 + * weights are kept in sync for backward compatibility. 256 + */ 225 257 unsigned int weight; 226 258 unsigned int new_weight; 227 259 unsigned int dev_weight; 260 + 261 + unsigned int leaf_weight; 262 + unsigned int new_leaf_weight; 263 + unsigned int dev_leaf_weight; 228 264 229 265 /* number of cfqq currently on this group */ 230 266 int nr_cfqq; ··· 282 248 struct cfq_rb_root service_trees[2][3]; 283 249 struct cfq_rb_root service_tree_idle; 284 250 285 - unsigned long saved_workload_slice; 286 - enum wl_type_t saved_workload; 287 - enum wl_prio_t saved_serving_prio; 251 + unsigned long saved_wl_slice; 252 + enum wl_type_t saved_wl_type; 253 + enum wl_class_t saved_wl_class; 288 254 289 255 /* number of requests that are on the dispatch list or inside driver */ 290 256 int dispatched; 291 257 struct cfq_ttime ttime; 292 - struct cfqg_stats stats; 258 + struct cfqg_stats stats; /* stats for this cfqg */ 259 + struct cfqg_stats dead_stats; /* stats pushed from dead children */ 293 260 }; 294 261 295 262 struct cfq_io_cq { ··· 315 280 /* 316 281 * The priority currently being served 317 282 */ 318 - enum wl_prio_t serving_prio; 319 - enum wl_type_t serving_type; 283 + enum wl_class_t serving_wl_class; 284 + enum wl_type_t serving_wl_type; 320 285 unsigned long workload_expires; 321 286 struct cfq_group *serving_group; 322 287 ··· 388 353 389 354 static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd); 390 355 391 - static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg, 392 - enum wl_prio_t prio, 356 + static struct cfq_rb_root *st_for(struct cfq_group *cfqg, 357 + enum wl_class_t class, 393 358 enum wl_type_t type) 394 359 { 395 360 if (!cfqg) 396 361 return NULL; 397 362 398 - if (prio == IDLE_WORKLOAD) 363 + if (class == IDLE_WORKLOAD) 399 364 return &cfqg->service_tree_idle; 400 365 401 - return &cfqg->service_trees[prio][type]; 366 + return &cfqg->service_trees[class][type]; 402 367 } 403 368 404 369 enum cfqq_state_flags { ··· 537 502 { 538 503 struct cfqg_stats *stats = &cfqg->stats; 539 504 540 - if (blkg_rwstat_sum(&stats->queued)) 505 + if (blkg_rwstat_total(&stats->queued)) 541 506 return; 542 507 543 508 /* ··· 581 546 struct cfqg_stats *stats = &cfqg->stats; 582 547 583 548 blkg_stat_add(&stats->avg_queue_size_sum, 584 - blkg_rwstat_sum(&stats->queued)); 549 + blkg_rwstat_total(&stats->queued)); 585 550 blkg_stat_add(&stats->avg_queue_size_samples, 1); 586 551 cfqg_stats_update_group_wait_time(stats); 587 552 } ··· 607 572 return pd_to_cfqg(blkg_to_pd(blkg, &blkcg_policy_cfq)); 608 573 } 609 574 575 + static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg) 576 + { 577 + struct blkcg_gq *pblkg = cfqg_to_blkg(cfqg)->parent; 578 + 579 + return pblkg ? blkg_to_cfqg(pblkg) : NULL; 580 + } 581 + 610 582 static inline void cfqg_get(struct cfq_group *cfqg) 611 583 { 612 584 return blkg_get(cfqg_to_blkg(cfqg)); ··· 628 586 char __pbuf[128]; \ 629 587 \ 630 588 blkg_path(cfqg_to_blkg((cfqq)->cfqg), __pbuf, sizeof(__pbuf)); \ 631 - blk_add_trace_msg((cfqd)->queue, "cfq%d%c %s " fmt, (cfqq)->pid, \ 632 - cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \ 589 + blk_add_trace_msg((cfqd)->queue, "cfq%d%c%c %s " fmt, (cfqq)->pid, \ 590 + cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \ 591 + cfqq_type((cfqq)) == SYNC_NOIDLE_WORKLOAD ? 'N' : ' ',\ 633 592 __pbuf, ##args); \ 634 593 } while (0) 635 594 ··· 689 646 io_start_time - start_time); 690 647 } 691 648 692 - static void cfq_pd_reset_stats(struct blkcg_gq *blkg) 649 + /* @stats = 0 */ 650 + static void cfqg_stats_reset(struct cfqg_stats *stats) 693 651 { 694 - struct cfq_group *cfqg = blkg_to_cfqg(blkg); 695 - struct cfqg_stats *stats = &cfqg->stats; 696 - 697 652 /* queued stats shouldn't be cleared */ 698 653 blkg_rwstat_reset(&stats->service_bytes); 699 654 blkg_rwstat_reset(&stats->serviced); ··· 710 669 #endif 711 670 } 712 671 672 + /* @to += @from */ 673 + static void cfqg_stats_merge(struct cfqg_stats *to, struct cfqg_stats *from) 674 + { 675 + /* queued stats shouldn't be cleared */ 676 + blkg_rwstat_merge(&to->service_bytes, &from->service_bytes); 677 + blkg_rwstat_merge(&to->serviced, &from->serviced); 678 + blkg_rwstat_merge(&to->merged, &from->merged); 679 + blkg_rwstat_merge(&to->service_time, &from->service_time); 680 + blkg_rwstat_merge(&to->wait_time, &from->wait_time); 681 + blkg_stat_merge(&from->time, &from->time); 682 + #ifdef CONFIG_DEBUG_BLK_CGROUP 683 + blkg_stat_merge(&to->unaccounted_time, &from->unaccounted_time); 684 + blkg_stat_merge(&to->avg_queue_size_sum, &from->avg_queue_size_sum); 685 + blkg_stat_merge(&to->avg_queue_size_samples, &from->avg_queue_size_samples); 686 + blkg_stat_merge(&to->dequeue, &from->dequeue); 687 + blkg_stat_merge(&to->group_wait_time, &from->group_wait_time); 688 + blkg_stat_merge(&to->idle_time, &from->idle_time); 689 + blkg_stat_merge(&to->empty_time, &from->empty_time); 690 + #endif 691 + } 692 + 693 + /* 694 + * Transfer @cfqg's stats to its parent's dead_stats so that the ancestors' 695 + * recursive stats can still account for the amount used by this cfqg after 696 + * it's gone. 697 + */ 698 + static void cfqg_stats_xfer_dead(struct cfq_group *cfqg) 699 + { 700 + struct cfq_group *parent = cfqg_parent(cfqg); 701 + 702 + lockdep_assert_held(cfqg_to_blkg(cfqg)->q->queue_lock); 703 + 704 + if (unlikely(!parent)) 705 + return; 706 + 707 + cfqg_stats_merge(&parent->dead_stats, &cfqg->stats); 708 + cfqg_stats_merge(&parent->dead_stats, &cfqg->dead_stats); 709 + cfqg_stats_reset(&cfqg->stats); 710 + cfqg_stats_reset(&cfqg->dead_stats); 711 + } 712 + 713 713 #else /* CONFIG_CFQ_GROUP_IOSCHED */ 714 714 715 + static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg) { return NULL; } 715 716 static inline void cfqg_get(struct cfq_group *cfqg) { } 716 717 static inline void cfqg_put(struct cfq_group *cfqg) { } 717 718 718 719 #define cfq_log_cfqq(cfqd, cfqq, fmt, args...) \ 719 - blk_add_trace_msg((cfqd)->queue, "cfq%d " fmt, (cfqq)->pid, ##args) 720 + blk_add_trace_msg((cfqd)->queue, "cfq%d%c%c " fmt, (cfqq)->pid, \ 721 + cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \ 722 + cfqq_type((cfqq)) == SYNC_NOIDLE_WORKLOAD ? 'N' : ' ',\ 723 + ##args) 720 724 #define cfq_log_cfqg(cfqd, cfqg, fmt, args...) do {} while (0) 721 725 722 726 static inline void cfqg_stats_update_io_add(struct cfq_group *cfqg, ··· 818 732 return false; 819 733 } 820 734 821 - static inline enum wl_prio_t cfqq_prio(struct cfq_queue *cfqq) 735 + static inline enum wl_class_t cfqq_class(struct cfq_queue *cfqq) 822 736 { 823 737 if (cfq_class_idle(cfqq)) 824 738 return IDLE_WORKLOAD; ··· 837 751 return SYNC_WORKLOAD; 838 752 } 839 753 840 - static inline int cfq_group_busy_queues_wl(enum wl_prio_t wl, 754 + static inline int cfq_group_busy_queues_wl(enum wl_class_t wl_class, 841 755 struct cfq_data *cfqd, 842 756 struct cfq_group *cfqg) 843 757 { 844 - if (wl == IDLE_WORKLOAD) 758 + if (wl_class == IDLE_WORKLOAD) 845 759 return cfqg->service_tree_idle.count; 846 760 847 - return cfqg->service_trees[wl][ASYNC_WORKLOAD].count 848 - + cfqg->service_trees[wl][SYNC_NOIDLE_WORKLOAD].count 849 - + cfqg->service_trees[wl][SYNC_WORKLOAD].count; 761 + return cfqg->service_trees[wl_class][ASYNC_WORKLOAD].count + 762 + cfqg->service_trees[wl_class][SYNC_NOIDLE_WORKLOAD].count + 763 + cfqg->service_trees[wl_class][SYNC_WORKLOAD].count; 850 764 } 851 765 852 766 static inline int cfqg_busy_async_queues(struct cfq_data *cfqd, 853 767 struct cfq_group *cfqg) 854 768 { 855 - return cfqg->service_trees[RT_WORKLOAD][ASYNC_WORKLOAD].count 856 - + cfqg->service_trees[BE_WORKLOAD][ASYNC_WORKLOAD].count; 769 + return cfqg->service_trees[RT_WORKLOAD][ASYNC_WORKLOAD].count + 770 + cfqg->service_trees[BE_WORKLOAD][ASYNC_WORKLOAD].count; 857 771 } 858 772 859 773 static void cfq_dispatch_insert(struct request_queue *, struct request *); ··· 933 847 return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio); 934 848 } 935 849 936 - static inline u64 cfq_scale_slice(unsigned long delta, struct cfq_group *cfqg) 850 + /** 851 + * cfqg_scale_charge - scale disk time charge according to cfqg weight 852 + * @charge: disk time being charged 853 + * @vfraction: vfraction of the cfqg, fixed point w/ CFQ_SERVICE_SHIFT 854 + * 855 + * Scale @charge according to @vfraction, which is in range (0, 1]. The 856 + * scaling is inversely proportional. 857 + * 858 + * scaled = charge / vfraction 859 + * 860 + * The result is also in fixed point w/ CFQ_SERVICE_SHIFT. 861 + */ 862 + static inline u64 cfqg_scale_charge(unsigned long charge, 863 + unsigned int vfraction) 937 864 { 938 - u64 d = delta << CFQ_SERVICE_SHIFT; 865 + u64 c = charge << CFQ_SERVICE_SHIFT; /* make it fixed point */ 939 866 940 - d = d * CFQ_WEIGHT_DEFAULT; 941 - do_div(d, cfqg->weight); 942 - return d; 867 + /* charge / vfraction */ 868 + c <<= CFQ_SERVICE_SHIFT; 869 + do_div(c, vfraction); 870 + return c; 943 871 } 944 872 945 873 static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime) ··· 1009 909 static inline unsigned 1010 910 cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg) 1011 911 { 1012 - struct cfq_rb_root *st = &cfqd->grp_service_tree; 1013 - 1014 - return cfqd->cfq_target_latency * cfqg->weight / st->total_weight; 912 + return cfqd->cfq_target_latency * cfqg->vfraction >> CFQ_SERVICE_SHIFT; 1015 913 } 1016 914 1017 915 static inline unsigned ··· 1276 1178 cfq_update_group_weight(struct cfq_group *cfqg) 1277 1179 { 1278 1180 BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node)); 1181 + 1279 1182 if (cfqg->new_weight) { 1280 1183 cfqg->weight = cfqg->new_weight; 1281 1184 cfqg->new_weight = 0; 1185 + } 1186 + 1187 + if (cfqg->new_leaf_weight) { 1188 + cfqg->leaf_weight = cfqg->new_leaf_weight; 1189 + cfqg->new_leaf_weight = 0; 1282 1190 } 1283 1191 } 1284 1192 1285 1193 static void 1286 1194 cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg) 1287 1195 { 1196 + unsigned int vfr = 1 << CFQ_SERVICE_SHIFT; /* start with 1 */ 1197 + struct cfq_group *pos = cfqg; 1198 + struct cfq_group *parent; 1199 + bool propagate; 1200 + 1201 + /* add to the service tree */ 1288 1202 BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node)); 1289 1203 1290 1204 cfq_update_group_weight(cfqg); 1291 1205 __cfq_group_service_tree_add(st, cfqg); 1292 - st->total_weight += cfqg->weight; 1206 + 1207 + /* 1208 + * Activate @cfqg and calculate the portion of vfraction @cfqg is 1209 + * entitled to. vfraction is calculated by walking the tree 1210 + * towards the root calculating the fraction it has at each level. 1211 + * The compounded ratio is how much vfraction @cfqg owns. 1212 + * 1213 + * Start with the proportion tasks in this cfqg has against active 1214 + * children cfqgs - its leaf_weight against children_weight. 1215 + */ 1216 + propagate = !pos->nr_active++; 1217 + pos->children_weight += pos->leaf_weight; 1218 + vfr = vfr * pos->leaf_weight / pos->children_weight; 1219 + 1220 + /* 1221 + * Compound ->weight walking up the tree. Both activation and 1222 + * vfraction calculation are done in the same loop. Propagation 1223 + * stops once an already activated node is met. vfraction 1224 + * calculation should always continue to the root. 1225 + */ 1226 + while ((parent = cfqg_parent(pos))) { 1227 + if (propagate) { 1228 + propagate = !parent->nr_active++; 1229 + parent->children_weight += pos->weight; 1230 + } 1231 + vfr = vfr * pos->weight / parent->children_weight; 1232 + pos = parent; 1233 + } 1234 + 1235 + cfqg->vfraction = max_t(unsigned, vfr, 1); 1293 1236 } 1294 1237 1295 1238 static void ··· 1361 1222 static void 1362 1223 cfq_group_service_tree_del(struct cfq_rb_root *st, struct cfq_group *cfqg) 1363 1224 { 1364 - st->total_weight -= cfqg->weight; 1225 + struct cfq_group *pos = cfqg; 1226 + bool propagate; 1227 + 1228 + /* 1229 + * Undo activation from cfq_group_service_tree_add(). Deactivate 1230 + * @cfqg and propagate deactivation upwards. 1231 + */ 1232 + propagate = !--pos->nr_active; 1233 + pos->children_weight -= pos->leaf_weight; 1234 + 1235 + while (propagate) { 1236 + struct cfq_group *parent = cfqg_parent(pos); 1237 + 1238 + /* @pos has 0 nr_active at this point */ 1239 + WARN_ON_ONCE(pos->children_weight); 1240 + pos->vfraction = 0; 1241 + 1242 + if (!parent) 1243 + break; 1244 + 1245 + propagate = !--parent->nr_active; 1246 + parent->children_weight -= pos->weight; 1247 + pos = parent; 1248 + } 1249 + 1250 + /* remove from the service tree */ 1365 1251 if (!RB_EMPTY_NODE(&cfqg->rb_node)) 1366 1252 cfq_rb_erase(&cfqg->rb_node, st); 1367 1253 } ··· 1405 1241 1406 1242 cfq_log_cfqg(cfqd, cfqg, "del_from_rr group"); 1407 1243 cfq_group_service_tree_del(st, cfqg); 1408 - cfqg->saved_workload_slice = 0; 1244 + cfqg->saved_wl_slice = 0; 1409 1245 cfqg_stats_update_dequeue(cfqg); 1410 1246 } 1411 1247 ··· 1448 1284 unsigned int used_sl, charge, unaccounted_sl = 0; 1449 1285 int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg) 1450 1286 - cfqg->service_tree_idle.count; 1287 + unsigned int vfr; 1451 1288 1452 1289 BUG_ON(nr_sync < 0); 1453 1290 used_sl = charge = cfq_cfqq_slice_usage(cfqq, &unaccounted_sl); ··· 1458 1293 else if (!cfq_cfqq_sync(cfqq) && !nr_sync) 1459 1294 charge = cfqq->allocated_slice; 1460 1295 1461 - /* Can't update vdisktime while group is on service tree */ 1296 + /* 1297 + * Can't update vdisktime while on service tree and cfqg->vfraction 1298 + * is valid only while on it. Cache vfr, leave the service tree, 1299 + * update vdisktime and go back on. The re-addition to the tree 1300 + * will also update the weights as necessary. 1301 + */ 1302 + vfr = cfqg->vfraction; 1462 1303 cfq_group_service_tree_del(st, cfqg); 1463 - cfqg->vdisktime += cfq_scale_slice(charge, cfqg); 1464 - /* If a new weight was requested, update now, off tree */ 1304 + cfqg->vdisktime += cfqg_scale_charge(charge, vfr); 1465 1305 cfq_group_service_tree_add(st, cfqg); 1466 1306 1467 1307 /* This group is being expired. Save the context */ 1468 1308 if (time_after(cfqd->workload_expires, jiffies)) { 1469 - cfqg->saved_workload_slice = cfqd->workload_expires 1309 + cfqg->saved_wl_slice = cfqd->workload_expires 1470 1310 - jiffies; 1471 - cfqg->saved_workload = cfqd->serving_type; 1472 - cfqg->saved_serving_prio = cfqd->serving_prio; 1311 + cfqg->saved_wl_type = cfqd->serving_wl_type; 1312 + cfqg->saved_wl_class = cfqd->serving_wl_class; 1473 1313 } else 1474 - cfqg->saved_workload_slice = 0; 1314 + cfqg->saved_wl_slice = 0; 1475 1315 1476 1316 cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime, 1477 1317 st->min_vdisktime); ··· 1514 1344 1515 1345 cfq_init_cfqg_base(cfqg); 1516 1346 cfqg->weight = blkg->blkcg->cfq_weight; 1347 + cfqg->leaf_weight = blkg->blkcg->cfq_leaf_weight; 1348 + } 1349 + 1350 + static void cfq_pd_offline(struct blkcg_gq *blkg) 1351 + { 1352 + /* 1353 + * @blkg is going offline and will be ignored by 1354 + * blkg_[rw]stat_recursive_sum(). Transfer stats to the parent so 1355 + * that they don't get lost. If IOs complete after this point, the 1356 + * stats for them will be lost. Oh well... 1357 + */ 1358 + cfqg_stats_xfer_dead(blkg_to_cfqg(blkg)); 1359 + } 1360 + 1361 + /* offset delta from cfqg->stats to cfqg->dead_stats */ 1362 + static const int dead_stats_off_delta = offsetof(struct cfq_group, dead_stats) - 1363 + offsetof(struct cfq_group, stats); 1364 + 1365 + /* to be used by recursive prfill, sums live and dead stats recursively */ 1366 + static u64 cfqg_stat_pd_recursive_sum(struct blkg_policy_data *pd, int off) 1367 + { 1368 + u64 sum = 0; 1369 + 1370 + sum += blkg_stat_recursive_sum(pd, off); 1371 + sum += blkg_stat_recursive_sum(pd, off + dead_stats_off_delta); 1372 + return sum; 1373 + } 1374 + 1375 + /* to be used by recursive prfill, sums live and dead rwstats recursively */ 1376 + static struct blkg_rwstat cfqg_rwstat_pd_recursive_sum(struct blkg_policy_data *pd, 1377 + int off) 1378 + { 1379 + struct blkg_rwstat a, b; 1380 + 1381 + a = blkg_rwstat_recursive_sum(pd, off); 1382 + b = blkg_rwstat_recursive_sum(pd, off + dead_stats_off_delta); 1383 + blkg_rwstat_merge(&a, &b); 1384 + return a; 1385 + } 1386 + 1387 + static void cfq_pd_reset_stats(struct blkcg_gq *blkg) 1388 + { 1389 + struct cfq_group *cfqg = blkg_to_cfqg(blkg); 1390 + 1391 + cfqg_stats_reset(&cfqg->stats); 1392 + cfqg_stats_reset(&cfqg->dead_stats); 1517 1393 } 1518 1394 1519 1395 /* ··· 1616 1400 return 0; 1617 1401 } 1618 1402 1403 + static u64 cfqg_prfill_leaf_weight_device(struct seq_file *sf, 1404 + struct blkg_policy_data *pd, int off) 1405 + { 1406 + struct cfq_group *cfqg = pd_to_cfqg(pd); 1407 + 1408 + if (!cfqg->dev_leaf_weight) 1409 + return 0; 1410 + return __blkg_prfill_u64(sf, pd, cfqg->dev_leaf_weight); 1411 + } 1412 + 1413 + static int cfqg_print_leaf_weight_device(struct cgroup *cgrp, 1414 + struct cftype *cft, 1415 + struct seq_file *sf) 1416 + { 1417 + blkcg_print_blkgs(sf, cgroup_to_blkcg(cgrp), 1418 + cfqg_prfill_leaf_weight_device, &blkcg_policy_cfq, 0, 1419 + false); 1420 + return 0; 1421 + } 1422 + 1619 1423 static int cfq_print_weight(struct cgroup *cgrp, struct cftype *cft, 1620 1424 struct seq_file *sf) 1621 1425 { ··· 1643 1407 return 0; 1644 1408 } 1645 1409 1646 - static int cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft, 1647 - const char *buf) 1410 + static int cfq_print_leaf_weight(struct cgroup *cgrp, struct cftype *cft, 1411 + struct seq_file *sf) 1412 + { 1413 + seq_printf(sf, "%u\n", 1414 + cgroup_to_blkcg(cgrp)->cfq_leaf_weight); 1415 + return 0; 1416 + } 1417 + 1418 + static int __cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft, 1419 + const char *buf, bool is_leaf_weight) 1648 1420 { 1649 1421 struct blkcg *blkcg = cgroup_to_blkcg(cgrp); 1650 1422 struct blkg_conf_ctx ctx; ··· 1666 1422 ret = -EINVAL; 1667 1423 cfqg = blkg_to_cfqg(ctx.blkg); 1668 1424 if (!ctx.v || (ctx.v >= CFQ_WEIGHT_MIN && ctx.v <= CFQ_WEIGHT_MAX)) { 1669 - cfqg->dev_weight = ctx.v; 1670 - cfqg->new_weight = cfqg->dev_weight ?: blkcg->cfq_weight; 1425 + if (!is_leaf_weight) { 1426 + cfqg->dev_weight = ctx.v; 1427 + cfqg->new_weight = ctx.v ?: blkcg->cfq_weight; 1428 + } else { 1429 + cfqg->dev_leaf_weight = ctx.v; 1430 + cfqg->new_leaf_weight = ctx.v ?: blkcg->cfq_leaf_weight; 1431 + } 1671 1432 ret = 0; 1672 1433 } 1673 1434 ··· 1680 1431 return ret; 1681 1432 } 1682 1433 1683 - static int cfq_set_weight(struct cgroup *cgrp, struct cftype *cft, u64 val) 1434 + static int cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft, 1435 + const char *buf) 1436 + { 1437 + return __cfqg_set_weight_device(cgrp, cft, buf, false); 1438 + } 1439 + 1440 + static int cfqg_set_leaf_weight_device(struct cgroup *cgrp, struct cftype *cft, 1441 + const char *buf) 1442 + { 1443 + return __cfqg_set_weight_device(cgrp, cft, buf, true); 1444 + } 1445 + 1446 + static int __cfq_set_weight(struct cgroup *cgrp, struct cftype *cft, u64 val, 1447 + bool is_leaf_weight) 1684 1448 { 1685 1449 struct blkcg *blkcg = cgroup_to_blkcg(cgrp); 1686 1450 struct blkcg_gq *blkg; ··· 1702 1440 return -EINVAL; 1703 1441 1704 1442 spin_lock_irq(&blkcg->lock); 1705 - blkcg->cfq_weight = (unsigned int)val; 1443 + 1444 + if (!is_leaf_weight) 1445 + blkcg->cfq_weight = val; 1446 + else 1447 + blkcg->cfq_leaf_weight = val; 1706 1448 1707 1449 hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) { 1708 1450 struct cfq_group *cfqg = blkg_to_cfqg(blkg); 1709 1451 1710 - if (cfqg && !cfqg->dev_weight) 1711 - cfqg->new_weight = blkcg->cfq_weight; 1452 + if (!cfqg) 1453 + continue; 1454 + 1455 + if (!is_leaf_weight) { 1456 + if (!cfqg->dev_weight) 1457 + cfqg->new_weight = blkcg->cfq_weight; 1458 + } else { 1459 + if (!cfqg->dev_leaf_weight) 1460 + cfqg->new_leaf_weight = blkcg->cfq_leaf_weight; 1461 + } 1712 1462 } 1713 1463 1714 1464 spin_unlock_irq(&blkcg->lock); 1715 1465 return 0; 1466 + } 1467 + 1468 + static int cfq_set_weight(struct cgroup *cgrp, struct cftype *cft, u64 val) 1469 + { 1470 + return __cfq_set_weight(cgrp, cft, val, false); 1471 + } 1472 + 1473 + static int cfq_set_leaf_weight(struct cgroup *cgrp, struct cftype *cft, u64 val) 1474 + { 1475 + return __cfq_set_weight(cgrp, cft, val, true); 1716 1476 } 1717 1477 1718 1478 static int cfqg_print_stat(struct cgroup *cgrp, struct cftype *cft, ··· 1754 1470 1755 1471 blkcg_print_blkgs(sf, blkcg, blkg_prfill_rwstat, &blkcg_policy_cfq, 1756 1472 cft->private, true); 1473 + return 0; 1474 + } 1475 + 1476 + static u64 cfqg_prfill_stat_recursive(struct seq_file *sf, 1477 + struct blkg_policy_data *pd, int off) 1478 + { 1479 + u64 sum = cfqg_stat_pd_recursive_sum(pd, off); 1480 + 1481 + return __blkg_prfill_u64(sf, pd, sum); 1482 + } 1483 + 1484 + static u64 cfqg_prfill_rwstat_recursive(struct seq_file *sf, 1485 + struct blkg_policy_data *pd, int off) 1486 + { 1487 + struct blkg_rwstat sum = cfqg_rwstat_pd_recursive_sum(pd, off); 1488 + 1489 + return __blkg_prfill_rwstat(sf, pd, &sum); 1490 + } 1491 + 1492 + static int cfqg_print_stat_recursive(struct cgroup *cgrp, struct cftype *cft, 1493 + struct seq_file *sf) 1494 + { 1495 + struct blkcg *blkcg = cgroup_to_blkcg(cgrp); 1496 + 1497 + blkcg_print_blkgs(sf, blkcg, cfqg_prfill_stat_recursive, 1498 + &blkcg_policy_cfq, cft->private, false); 1499 + return 0; 1500 + } 1501 + 1502 + static int cfqg_print_rwstat_recursive(struct cgroup *cgrp, struct cftype *cft, 1503 + struct seq_file *sf) 1504 + { 1505 + struct blkcg *blkcg = cgroup_to_blkcg(cgrp); 1506 + 1507 + blkcg_print_blkgs(sf, blkcg, cfqg_prfill_rwstat_recursive, 1508 + &blkcg_policy_cfq, cft->private, true); 1757 1509 return 0; 1758 1510 } 1759 1511 ··· 1822 1502 #endif /* CONFIG_DEBUG_BLK_CGROUP */ 1823 1503 1824 1504 static struct cftype cfq_blkcg_files[] = { 1505 + /* on root, weight is mapped to leaf_weight */ 1825 1506 { 1826 1507 .name = "weight_device", 1508 + .flags = CFTYPE_ONLY_ON_ROOT, 1509 + .read_seq_string = cfqg_print_leaf_weight_device, 1510 + .write_string = cfqg_set_leaf_weight_device, 1511 + .max_write_len = 256, 1512 + }, 1513 + { 1514 + .name = "weight", 1515 + .flags = CFTYPE_ONLY_ON_ROOT, 1516 + .read_seq_string = cfq_print_leaf_weight, 1517 + .write_u64 = cfq_set_leaf_weight, 1518 + }, 1519 + 1520 + /* no such mapping necessary for !roots */ 1521 + { 1522 + .name = "weight_device", 1523 + .flags = CFTYPE_NOT_ON_ROOT, 1827 1524 .read_seq_string = cfqg_print_weight_device, 1828 1525 .write_string = cfqg_set_weight_device, 1829 1526 .max_write_len = 256, 1830 1527 }, 1831 1528 { 1832 1529 .name = "weight", 1530 + .flags = CFTYPE_NOT_ON_ROOT, 1833 1531 .read_seq_string = cfq_print_weight, 1834 1532 .write_u64 = cfq_set_weight, 1835 1533 }, 1534 + 1535 + { 1536 + .name = "leaf_weight_device", 1537 + .read_seq_string = cfqg_print_leaf_weight_device, 1538 + .write_string = cfqg_set_leaf_weight_device, 1539 + .max_write_len = 256, 1540 + }, 1541 + { 1542 + .name = "leaf_weight", 1543 + .read_seq_string = cfq_print_leaf_weight, 1544 + .write_u64 = cfq_set_leaf_weight, 1545 + }, 1546 + 1547 + /* statistics, covers only the tasks in the cfqg */ 1836 1548 { 1837 1549 .name = "time", 1838 1550 .private = offsetof(struct cfq_group, stats.time), ··· 1904 1552 .name = "io_queued", 1905 1553 .private = offsetof(struct cfq_group, stats.queued), 1906 1554 .read_seq_string = cfqg_print_rwstat, 1555 + }, 1556 + 1557 + /* the same statictics which cover the cfqg and its descendants */ 1558 + { 1559 + .name = "time_recursive", 1560 + .private = offsetof(struct cfq_group, stats.time), 1561 + .read_seq_string = cfqg_print_stat_recursive, 1562 + }, 1563 + { 1564 + .name = "sectors_recursive", 1565 + .private = offsetof(struct cfq_group, stats.sectors), 1566 + .read_seq_string = cfqg_print_stat_recursive, 1567 + }, 1568 + { 1569 + .name = "io_service_bytes_recursive", 1570 + .private = offsetof(struct cfq_group, stats.service_bytes), 1571 + .read_seq_string = cfqg_print_rwstat_recursive, 1572 + }, 1573 + { 1574 + .name = "io_serviced_recursive", 1575 + .private = offsetof(struct cfq_group, stats.serviced), 1576 + .read_seq_string = cfqg_print_rwstat_recursive, 1577 + }, 1578 + { 1579 + .name = "io_service_time_recursive", 1580 + .private = offsetof(struct cfq_group, stats.service_time), 1581 + .read_seq_string = cfqg_print_rwstat_recursive, 1582 + }, 1583 + { 1584 + .name = "io_wait_time_recursive", 1585 + .private = offsetof(struct cfq_group, stats.wait_time), 1586 + .read_seq_string = cfqg_print_rwstat_recursive, 1587 + }, 1588 + { 1589 + .name = "io_merged_recursive", 1590 + .private = offsetof(struct cfq_group, stats.merged), 1591 + .read_seq_string = cfqg_print_rwstat_recursive, 1592 + }, 1593 + { 1594 + .name = "io_queued_recursive", 1595 + .private = offsetof(struct cfq_group, stats.queued), 1596 + .read_seq_string = cfqg_print_rwstat_recursive, 1907 1597 }, 1908 1598 #ifdef CONFIG_DEBUG_BLK_CGROUP 1909 1599 { ··· 2005 1611 struct rb_node **p, *parent; 2006 1612 struct cfq_queue *__cfqq; 2007 1613 unsigned long rb_key; 2008 - struct cfq_rb_root *service_tree; 1614 + struct cfq_rb_root *st; 2009 1615 int left; 2010 1616 int new_cfqq = 1; 2011 1617 2012 - service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq), 2013 - cfqq_type(cfqq)); 1618 + st = st_for(cfqq->cfqg, cfqq_class(cfqq), cfqq_type(cfqq)); 2014 1619 if (cfq_class_idle(cfqq)) { 2015 1620 rb_key = CFQ_IDLE_DELAY; 2016 - parent = rb_last(&service_tree->rb); 1621 + parent = rb_last(&st->rb); 2017 1622 if (parent && parent != &cfqq->rb_node) { 2018 1623 __cfqq = rb_entry(parent, struct cfq_queue, rb_node); 2019 1624 rb_key += __cfqq->rb_key; ··· 2030 1637 cfqq->slice_resid = 0; 2031 1638 } else { 2032 1639 rb_key = -HZ; 2033 - __cfqq = cfq_rb_first(service_tree); 1640 + __cfqq = cfq_rb_first(st); 2034 1641 rb_key += __cfqq ? __cfqq->rb_key : jiffies; 2035 1642 } 2036 1643 ··· 2039 1646 /* 2040 1647 * same position, nothing more to do 2041 1648 */ 2042 - if (rb_key == cfqq->rb_key && 2043 - cfqq->service_tree == service_tree) 1649 + if (rb_key == cfqq->rb_key && cfqq->service_tree == st) 2044 1650 return; 2045 1651 2046 1652 cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree); ··· 2048 1656 2049 1657 left = 1; 2050 1658 parent = NULL; 2051 - cfqq->service_tree = service_tree; 2052 - p = &service_tree->rb.rb_node; 1659 + cfqq->service_tree = st; 1660 + p = &st->rb.rb_node; 2053 1661 while (*p) { 2054 - struct rb_node **n; 2055 - 2056 1662 parent = *p; 2057 1663 __cfqq = rb_entry(parent, struct cfq_queue, rb_node); 2058 1664 ··· 2058 1668 * sort by key, that represents service time. 2059 1669 */ 2060 1670 if (time_before(rb_key, __cfqq->rb_key)) 2061 - n = &(*p)->rb_left; 1671 + p = &parent->rb_left; 2062 1672 else { 2063 - n = &(*p)->rb_right; 1673 + p = &parent->rb_right; 2064 1674 left = 0; 2065 1675 } 2066 - 2067 - p = n; 2068 1676 } 2069 1677 2070 1678 if (left) 2071 - service_tree->left = &cfqq->rb_node; 1679 + st->left = &cfqq->rb_node; 2072 1680 2073 1681 cfqq->rb_key = rb_key; 2074 1682 rb_link_node(&cfqq->rb_node, parent, p); 2075 - rb_insert_color(&cfqq->rb_node, &service_tree->rb); 2076 - service_tree->count++; 1683 + rb_insert_color(&cfqq->rb_node, &st->rb); 1684 + st->count++; 2077 1685 if (add_front || !new_cfqq) 2078 1686 return; 2079 1687 cfq_group_notify_queue_add(cfqd, cfqq->cfqg); ··· 2417 2029 struct cfq_queue *cfqq) 2418 2030 { 2419 2031 if (cfqq) { 2420 - cfq_log_cfqq(cfqd, cfqq, "set_active wl_prio:%d wl_type:%d", 2421 - cfqd->serving_prio, cfqd->serving_type); 2032 + cfq_log_cfqq(cfqd, cfqq, "set_active wl_class:%d wl_type:%d", 2033 + cfqd->serving_wl_class, cfqd->serving_wl_type); 2422 2034 cfqg_stats_update_avg_queue_size(cfqq->cfqg); 2423 2035 cfqq->slice_start = 0; 2424 2036 cfqq->dispatch_start = jiffies; ··· 2504 2116 */ 2505 2117 static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd) 2506 2118 { 2507 - struct cfq_rb_root *service_tree = 2508 - service_tree_for(cfqd->serving_group, cfqd->serving_prio, 2509 - cfqd->serving_type); 2119 + struct cfq_rb_root *st = st_for(cfqd->serving_group, 2120 + cfqd->serving_wl_class, cfqd->serving_wl_type); 2510 2121 2511 2122 if (!cfqd->rq_queued) 2512 2123 return NULL; 2513 2124 2514 2125 /* There is nothing to dispatch */ 2515 - if (!service_tree) 2126 + if (!st) 2516 2127 return NULL; 2517 - if (RB_EMPTY_ROOT(&service_tree->rb)) 2128 + if (RB_EMPTY_ROOT(&st->rb)) 2518 2129 return NULL; 2519 - return cfq_rb_first(service_tree); 2130 + return cfq_rb_first(st); 2520 2131 } 2521 2132 2522 2133 static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd) ··· 2671 2284 2672 2285 static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq) 2673 2286 { 2674 - enum wl_prio_t prio = cfqq_prio(cfqq); 2675 - struct cfq_rb_root *service_tree = cfqq->service_tree; 2287 + enum wl_class_t wl_class = cfqq_class(cfqq); 2288 + struct cfq_rb_root *st = cfqq->service_tree; 2676 2289 2677 - BUG_ON(!service_tree); 2678 - BUG_ON(!service_tree->count); 2290 + BUG_ON(!st); 2291 + BUG_ON(!st->count); 2679 2292 2680 2293 if (!cfqd->cfq_slice_idle) 2681 2294 return false; 2682 2295 2683 2296 /* We never do for idle class queues. */ 2684 - if (prio == IDLE_WORKLOAD) 2297 + if (wl_class == IDLE_WORKLOAD) 2685 2298 return false; 2686 2299 2687 2300 /* We do for queues that were marked with idle window flag. */ ··· 2693 2306 * Otherwise, we do only if they are the last ones 2694 2307 * in their service tree. 2695 2308 */ 2696 - if (service_tree->count == 1 && cfq_cfqq_sync(cfqq) && 2697 - !cfq_io_thinktime_big(cfqd, &service_tree->ttime, false)) 2309 + if (st->count == 1 && cfq_cfqq_sync(cfqq) && 2310 + !cfq_io_thinktime_big(cfqd, &st->ttime, false)) 2698 2311 return true; 2699 - cfq_log_cfqq(cfqd, cfqq, "Not idling. st->count:%d", 2700 - service_tree->count); 2312 + cfq_log_cfqq(cfqd, cfqq, "Not idling. st->count:%d", st->count); 2701 2313 return false; 2702 2314 } 2703 2315 ··· 2879 2493 } 2880 2494 } 2881 2495 2882 - static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd, 2883 - struct cfq_group *cfqg, enum wl_prio_t prio) 2496 + static enum wl_type_t cfq_choose_wl_type(struct cfq_data *cfqd, 2497 + struct cfq_group *cfqg, enum wl_class_t wl_class) 2884 2498 { 2885 2499 struct cfq_queue *queue; 2886 2500 int i; ··· 2890 2504 2891 2505 for (i = 0; i <= SYNC_WORKLOAD; ++i) { 2892 2506 /* select the one with lowest rb_key */ 2893 - queue = cfq_rb_first(service_tree_for(cfqg, prio, i)); 2507 + queue = cfq_rb_first(st_for(cfqg, wl_class, i)); 2894 2508 if (queue && 2895 2509 (!key_valid || time_before(queue->rb_key, lowest_key))) { 2896 2510 lowest_key = queue->rb_key; ··· 2902 2516 return cur_best; 2903 2517 } 2904 2518 2905 - static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg) 2519 + static void 2520 + choose_wl_class_and_type(struct cfq_data *cfqd, struct cfq_group *cfqg) 2906 2521 { 2907 2522 unsigned slice; 2908 2523 unsigned count; 2909 2524 struct cfq_rb_root *st; 2910 2525 unsigned group_slice; 2911 - enum wl_prio_t original_prio = cfqd->serving_prio; 2526 + enum wl_class_t original_class = cfqd->serving_wl_class; 2912 2527 2913 2528 /* Choose next priority. RT > BE > IDLE */ 2914 2529 if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg)) 2915 - cfqd->serving_prio = RT_WORKLOAD; 2530 + cfqd->serving_wl_class = RT_WORKLOAD; 2916 2531 else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg)) 2917 - cfqd->serving_prio = BE_WORKLOAD; 2532 + cfqd->serving_wl_class = BE_WORKLOAD; 2918 2533 else { 2919 - cfqd->serving_prio = IDLE_WORKLOAD; 2534 + cfqd->serving_wl_class = IDLE_WORKLOAD; 2920 2535 cfqd->workload_expires = jiffies + 1; 2921 2536 return; 2922 2537 } 2923 2538 2924 - if (original_prio != cfqd->serving_prio) 2539 + if (original_class != cfqd->serving_wl_class) 2925 2540 goto new_workload; 2926 2541 2927 2542 /* ··· 2930 2543 * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload 2931 2544 * expiration time 2932 2545 */ 2933 - st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type); 2546 + st = st_for(cfqg, cfqd->serving_wl_class, cfqd->serving_wl_type); 2934 2547 count = st->count; 2935 2548 2936 2549 /* ··· 2941 2554 2942 2555 new_workload: 2943 2556 /* otherwise select new workload type */ 2944 - cfqd->serving_type = 2945 - cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio); 2946 - st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type); 2557 + cfqd->serving_wl_type = cfq_choose_wl_type(cfqd, cfqg, 2558 + cfqd->serving_wl_class); 2559 + st = st_for(cfqg, cfqd->serving_wl_class, cfqd->serving_wl_type); 2947 2560 count = st->count; 2948 2561 2949 2562 /* ··· 2954 2567 group_slice = cfq_group_slice(cfqd, cfqg); 2955 2568 2956 2569 slice = group_slice * count / 2957 - max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio], 2958 - cfq_group_busy_queues_wl(cfqd->serving_prio, cfqd, cfqg)); 2570 + max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_wl_class], 2571 + cfq_group_busy_queues_wl(cfqd->serving_wl_class, cfqd, 2572 + cfqg)); 2959 2573 2960 - if (cfqd->serving_type == ASYNC_WORKLOAD) { 2574 + if (cfqd->serving_wl_type == ASYNC_WORKLOAD) { 2961 2575 unsigned int tmp; 2962 2576 2963 2577 /* ··· 3004 2616 cfqd->serving_group = cfqg; 3005 2617 3006 2618 /* Restore the workload type data */ 3007 - if (cfqg->saved_workload_slice) { 3008 - cfqd->workload_expires = jiffies + cfqg->saved_workload_slice; 3009 - cfqd->serving_type = cfqg->saved_workload; 3010 - cfqd->serving_prio = cfqg->saved_serving_prio; 2619 + if (cfqg->saved_wl_slice) { 2620 + cfqd->workload_expires = jiffies + cfqg->saved_wl_slice; 2621 + cfqd->serving_wl_type = cfqg->saved_wl_type; 2622 + cfqd->serving_wl_class = cfqg->saved_wl_class; 3011 2623 } else 3012 2624 cfqd->workload_expires = jiffies - 1; 3013 2625 3014 - choose_service_tree(cfqd, cfqg); 2626 + choose_wl_class_and_type(cfqd, cfqg); 3015 2627 } 3016 2628 3017 2629 /* ··· 3593 3205 spin_lock_irq(cfqd->queue->queue_lock); 3594 3206 if (new_cfqq) 3595 3207 goto retry; 3208 + else 3209 + return &cfqd->oom_cfqq; 3596 3210 } else { 3597 3211 cfqq = kmem_cache_alloc_node(cfq_pool, 3598 3212 gfp_mask | __GFP_ZERO, ··· 3792 3402 return true; 3793 3403 3794 3404 /* Allow preemption only if we are idling on sync-noidle tree */ 3795 - if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD && 3405 + if (cfqd->serving_wl_type == SYNC_NOIDLE_WORKLOAD && 3796 3406 cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD && 3797 3407 new_cfqq->service_tree->count == 2 && 3798 3408 RB_EMPTY_ROOT(&cfqq->sort_list)) ··· 3844 3454 * doesn't happen 3845 3455 */ 3846 3456 if (old_type != cfqq_type(cfqq)) 3847 - cfqq->cfqg->saved_workload_slice = 0; 3457 + cfqq->cfqg->saved_wl_slice = 0; 3848 3458 3849 3459 /* 3850 3460 * Put the new queue at the front of the of the current list, ··· 4026 3636 cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]--; 4027 3637 4028 3638 if (sync) { 4029 - struct cfq_rb_root *service_tree; 3639 + struct cfq_rb_root *st; 4030 3640 4031 3641 RQ_CIC(rq)->ttime.last_end_request = now; 4032 3642 4033 3643 if (cfq_cfqq_on_rr(cfqq)) 4034 - service_tree = cfqq->service_tree; 3644 + st = cfqq->service_tree; 4035 3645 else 4036 - service_tree = service_tree_for(cfqq->cfqg, 4037 - cfqq_prio(cfqq), cfqq_type(cfqq)); 4038 - service_tree->ttime.last_end_request = now; 3646 + st = st_for(cfqq->cfqg, cfqq_class(cfqq), 3647 + cfqq_type(cfqq)); 3648 + 3649 + st->ttime.last_end_request = now; 4039 3650 if (!time_after(rq->start_time + cfqd->cfq_fifo_expire[1], now)) 4040 3651 cfqd->last_delayed_sync = now; 4041 3652 } ··· 4383 3992 cfq_init_cfqg_base(cfqd->root_group); 4384 3993 #endif 4385 3994 cfqd->root_group->weight = 2 * CFQ_WEIGHT_DEFAULT; 3995 + cfqd->root_group->leaf_weight = 2 * CFQ_WEIGHT_DEFAULT; 4386 3996 4387 3997 /* 4388 3998 * Not strictly needed (since RB_ROOT just clears the node and we ··· 4568 4176 .cftypes = cfq_blkcg_files, 4569 4177 4570 4178 .pd_init_fn = cfq_pd_init, 4179 + .pd_offline_fn = cfq_pd_offline, 4571 4180 .pd_reset_stats_fn = cfq_pd_reset_stats, 4572 4181 }; 4573 4182 #endif

+4 -19

block/elevator.c

··· 46 46 /* 47 47 * Merge hash stuff. 48 48 */ 49 - static const int elv_hash_shift = 6; 50 - #define ELV_HASH_BLOCK(sec) ((sec) >> 3) 51 - #define ELV_HASH_FN(sec) \ 52 - (hash_long(ELV_HASH_BLOCK((sec)), elv_hash_shift)) 53 - #define ELV_HASH_ENTRIES (1 << elv_hash_shift) 54 49 #define rq_hash_key(rq) (blk_rq_pos(rq) + blk_rq_sectors(rq)) 55 50 56 51 /* ··· 153 158 struct elevator_type *e) 154 159 { 155 160 struct elevator_queue *eq; 156 - int i; 157 161 158 162 eq = kmalloc_node(sizeof(*eq), GFP_KERNEL | __GFP_ZERO, q->node); 159 163 if (unlikely(!eq)) ··· 161 167 eq->type = e; 162 168 kobject_init(&eq->kobj, &elv_ktype); 163 169 mutex_init(&eq->sysfs_lock); 164 - 165 - eq->hash = kmalloc_node(sizeof(struct hlist_head) * ELV_HASH_ENTRIES, 166 - GFP_KERNEL, q->node); 167 - if (!eq->hash) 168 - goto err; 169 - 170 - for (i = 0; i < ELV_HASH_ENTRIES; i++) 171 - INIT_HLIST_HEAD(&eq->hash[i]); 170 + hash_init(eq->hash); 172 171 173 172 return eq; 174 173 err: ··· 176 189 177 190 e = container_of(kobj, struct elevator_queue, kobj); 178 191 elevator_put(e->type); 179 - kfree(e->hash); 180 192 kfree(e); 181 193 } 182 194 ··· 247 261 248 262 static inline void __elv_rqhash_del(struct request *rq) 249 263 { 250 - hlist_del_init(&rq->hash); 264 + hash_del(&rq->hash); 251 265 } 252 266 253 267 static void elv_rqhash_del(struct request_queue *q, struct request *rq) ··· 261 275 struct elevator_queue *e = q->elevator; 262 276 263 277 BUG_ON(ELV_ON_HASH(rq)); 264 - hlist_add_head(&rq->hash, &e->hash[ELV_HASH_FN(rq_hash_key(rq))]); 278 + hash_add(e->hash, &rq->hash, rq_hash_key(rq)); 265 279 } 266 280 267 281 static void elv_rqhash_reposition(struct request_queue *q, struct request *rq) ··· 273 287 static struct request *elv_rqhash_find(struct request_queue *q, sector_t offset) 274 288 { 275 289 struct elevator_queue *e = q->elevator; 276 - struct hlist_head *hash_list = &e->hash[ELV_HASH_FN(offset)]; 277 290 struct hlist_node *next; 278 291 struct request *rq; 279 292 280 - hlist_for_each_entry_safe(rq, next, hash_list, hash) { 293 + hash_for_each_possible_safe(e->hash, rq, next, hash, offset) { 281 294 BUG_ON(!ELV_ON_HASH(rq)); 282 295 283 296 if (unlikely(!rq_mergeable(rq))) {

+4 -1

drivers/block/swim3.c

··· 1090 1090 static void swim3_mb_event(struct macio_dev* mdev, int mb_state) 1091 1091 { 1092 1092 struct floppy_state *fs = macio_get_drvdata(mdev); 1093 - struct swim3 __iomem *sw = fs->swim3; 1093 + struct swim3 __iomem *sw; 1094 1094 1095 1095 if (!fs) 1096 1096 return; 1097 + 1098 + sw = fs->swim3; 1099 + 1097 1100 if (mb_state != MB_FD) 1098 1101 return; 1099 1102

-1

drivers/md/dm.c

··· 626 626 queue_io(md, bio); 627 627 } else { 628 628 /* done with normal IO or empty flush */ 629 - trace_block_bio_complete(md->queue, bio, io_error); 630 629 bio_endio(bio, io_error); 631 630 } 632 631 }

+1 -10

drivers/md/raid5.c

··· 184 184 return_bi = bi->bi_next; 185 185 bi->bi_next = NULL; 186 186 bi->bi_size = 0; 187 - trace_block_bio_complete(bdev_get_queue(bi->bi_bdev), 188 - bi, 0); 189 187 bio_endio(bi, 0); 190 188 bi = return_bi; 191 189 } ··· 3914 3916 rdev_dec_pending(rdev, conf->mddev); 3915 3917 3916 3918 if (!error && uptodate) { 3917 - trace_block_bio_complete(bdev_get_queue(raid_bi->bi_bdev), 3918 - raid_bi, 0); 3919 3919 bio_endio(raid_bi, 0); 3920 3920 if (atomic_dec_and_test(&conf->active_aligned_reads)) 3921 3921 wake_up(&conf->wait_for_stripe); ··· 4372 4376 if ( rw == WRITE ) 4373 4377 md_write_end(mddev); 4374 4378 4375 - trace_block_bio_complete(bdev_get_queue(bi->bi_bdev), 4376 - bi, 0); 4377 4379 bio_endio(bi, 0); 4378 4380 } 4379 4381 } ··· 4748 4754 handled++; 4749 4755 } 4750 4756 remaining = raid5_dec_bi_active_stripes(raid_bio); 4751 - if (remaining == 0) { 4752 - trace_block_bio_complete(bdev_get_queue(raid_bio->bi_bdev), 4753 - raid_bio, 0); 4757 + if (remaining == 0) 4754 4758 bio_endio(raid_bio, 0); 4755 - } 4756 4759 if (atomic_dec_and_test(&conf->active_aligned_reads)) 4757 4760 wake_up(&conf->wait_for_stripe); 4758 4761 return handled;

+2

fs/bio.c

··· 1428 1428 else if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) 1429 1429 error = -EIO; 1430 1430 1431 + trace_block_bio_complete(bio, error); 1432 + 1431 1433 if (bio->bi_end_io) 1432 1434 bio->bi_end_io(bio, error); 1433 1435 }

+4 -2

fs/block_dev.c

··· 1033 1033 { 1034 1034 unsigned bsize = bdev_logical_block_size(bdev); 1035 1035 1036 - bdev->bd_inode->i_size = size; 1036 + mutex_lock(&bdev->bd_inode->i_mutex); 1037 + i_size_write(bdev->bd_inode, size); 1038 + mutex_unlock(&bdev->bd_inode->i_mutex); 1037 1039 while (bsize < PAGE_CACHE_SIZE) { 1038 1040 if (size & bsize) 1039 1041 break; ··· 1120 1118 } 1121 1119 } 1122 1120 1123 - if (!ret && !bdev->bd_openers) { 1121 + if (!ret) { 1124 1122 bd_set_size(bdev,(loff_t)get_capacity(disk)<<9); 1125 1123 bdi = blk_get_backing_dev_info(bdev); 1126 1124 if (bdi == NULL)

+10

fs/buffer.c

··· 41 41 #include <linux/bitops.h> 42 42 #include <linux/mpage.h> 43 43 #include <linux/bit_spinlock.h> 44 + #include <trace/events/block.h> 44 45 45 46 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); 46 47 ··· 53 52 bh->b_private = private; 54 53 } 55 54 EXPORT_SYMBOL(init_buffer); 55 + 56 + inline void touch_buffer(struct buffer_head *bh) 57 + { 58 + trace_block_touch_buffer(bh); 59 + mark_page_accessed(bh->b_page); 60 + } 61 + EXPORT_SYMBOL(touch_buffer); 56 62 57 63 static int sleep_on_buffer(void *word) 58 64 { ··· 1120 1112 void mark_buffer_dirty(struct buffer_head *bh) 1121 1113 { 1122 1114 WARN_ON_ONCE(!buffer_uptodate(bh)); 1115 + 1116 + trace_block_dirty_buffer(bh); 1123 1117 1124 1118 /* 1125 1119 * Very *carefully* optimize the it-is-already-dirty case.

+14 -2

fs/fs-writeback.c

··· 318 318 319 319 static int write_inode(struct inode *inode, struct writeback_control *wbc) 320 320 { 321 - if (inode->i_sb->s_op->write_inode && !is_bad_inode(inode)) 322 - return inode->i_sb->s_op->write_inode(inode, wbc); 321 + int ret; 322 + 323 + if (inode->i_sb->s_op->write_inode && !is_bad_inode(inode)) { 324 + trace_writeback_write_inode_start(inode, wbc); 325 + ret = inode->i_sb->s_op->write_inode(inode, wbc); 326 + trace_writeback_write_inode(inode, wbc); 327 + return ret; 328 + } 323 329 return 0; 324 330 } 325 331 ··· 455 449 int ret; 456 450 457 451 WARN_ON(!(inode->i_state & I_SYNC)); 452 + 453 + trace_writeback_single_inode_start(inode, wbc, nr_to_write); 458 454 459 455 ret = do_writepages(mapping, wbc); 460 456 ··· 1158 1150 * dirty the inode itself 1159 1151 */ 1160 1152 if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { 1153 + trace_writeback_dirty_inode_start(inode, flags); 1154 + 1161 1155 if (sb->s_op->dirty_inode) 1162 1156 sb->s_op->dirty_inode(inode, flags); 1157 + 1158 + trace_writeback_dirty_inode(inode, flags); 1163 1159 } 1164 1160 1165 1161 /*

+2 -1

include/linux/blkdev.h

··· 19 19 #include <linux/gfp.h> 20 20 #include <linux/bsg.h> 21 21 #include <linux/smp.h> 22 + #include <linux/rcupdate.h> 22 23 23 24 #include <asm/scatterlist.h> 24 25 ··· 438 437 /* Throttle data */ 439 438 struct throtl_data *td; 440 439 #endif 440 + struct rcu_head rcu_head; 441 441 }; 442 442 443 443 #define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */ ··· 976 974 unsigned long magic; /* detect uninitialized use-cases */ 977 975 struct list_head list; /* requests */ 978 976 struct list_head cb_list; /* md requires an unplug callback */ 979 - unsigned int should_sort; /* list to be sorted before flushing? */ 980 977 }; 981 978 #define BLK_MAX_REQUEST_COUNT 16 982 979

+1

include/linux/blktrace_api.h

··· 12 12 13 13 struct blk_trace { 14 14 int trace_state; 15 + bool rq_based; 15 16 struct rchan *rchan; 16 17 unsigned long __percpu *sequence; 17 18 unsigned char __percpu *msg_data;

+1 -1

include/linux/buffer_head.h

··· 126 126 BUFFER_FNS(Unwritten, unwritten) 127 127 128 128 #define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK) 129 - #define touch_buffer(bh) mark_page_accessed(bh->b_page) 130 129 131 130 /* If we *know* page->private refers to buffer_heads */ 132 131 #define page_buffers(page) \ ··· 141 142 142 143 void mark_buffer_dirty(struct buffer_head *bh); 143 144 void init_buffer(struct buffer_head *, bh_end_io_t *, void *); 145 + void touch_buffer(struct buffer_head *bh); 144 146 void set_bh_page(struct buffer_head *bh, 145 147 struct page *page, unsigned long offset); 146 148 int try_to_free_buffers(struct page *);

+3

include/linux/completion.h

··· 77 77 } 78 78 79 79 extern void wait_for_completion(struct completion *); 80 + extern void wait_for_completion_io(struct completion *); 80 81 extern int wait_for_completion_interruptible(struct completion *x); 81 82 extern int wait_for_completion_killable(struct completion *x); 82 83 extern unsigned long wait_for_completion_timeout(struct completion *x, 83 84 unsigned long timeout); 85 + extern unsigned long wait_for_completion_io_timeout(struct completion *x, 86 + unsigned long timeout); 84 87 extern long wait_for_completion_interruptible_timeout( 85 88 struct completion *x, unsigned long timeout); 86 89 extern long wait_for_completion_killable_timeout(

+4 -1

include/linux/elevator.h

··· 2 2 #define _LINUX_ELEVATOR_H 3 3 4 4 #include <linux/percpu.h> 5 + #include <linux/hashtable.h> 5 6 6 7 #ifdef CONFIG_BLOCK 7 8 ··· 97 96 struct list_head list; 98 97 }; 99 98 99 + #define ELV_HASH_BITS 6 100 + 100 101 /* 101 102 * each queue has an elevator_queue associated with it 102 103 */ ··· 108 105 void *elevator_data; 109 106 struct kobject kobj; 110 107 struct mutex sysfs_lock; 111 - struct hlist_head *hash; 112 108 unsigned int registered:1; 109 + DECLARE_HASHTABLE(hash, ELV_HASH_BITS); 113 110 }; 114 111 115 112 /*

+89 -15

include/trace/events/block.h

··· 6 6 7 7 #include <linux/blktrace_api.h> 8 8 #include <linux/blkdev.h> 9 + #include <linux/buffer_head.h> 9 10 #include <linux/tracepoint.h> 10 11 11 12 #define RWBS_LEN 8 13 + 14 + DECLARE_EVENT_CLASS(block_buffer, 15 + 16 + TP_PROTO(struct buffer_head *bh), 17 + 18 + TP_ARGS(bh), 19 + 20 + TP_STRUCT__entry ( 21 + __field( dev_t, dev ) 22 + __field( sector_t, sector ) 23 + __field( size_t, size ) 24 + ), 25 + 26 + TP_fast_assign( 27 + __entry->dev = bh->b_bdev->bd_dev; 28 + __entry->sector = bh->b_blocknr; 29 + __entry->size = bh->b_size; 30 + ), 31 + 32 + TP_printk("%d,%d sector=%llu size=%zu", 33 + MAJOR(__entry->dev), MINOR(__entry->dev), 34 + (unsigned long long)__entry->sector, __entry->size 35 + ) 36 + ); 37 + 38 + /** 39 + * block_touch_buffer - mark a buffer accessed 40 + * @bh: buffer_head being touched 41 + * 42 + * Called from touch_buffer(). 43 + */ 44 + DEFINE_EVENT(block_buffer, block_touch_buffer, 45 + 46 + TP_PROTO(struct buffer_head *bh), 47 + 48 + TP_ARGS(bh) 49 + ); 50 + 51 + /** 52 + * block_dirty_buffer - mark a buffer dirty 53 + * @bh: buffer_head being dirtied 54 + * 55 + * Called from mark_buffer_dirty(). 56 + */ 57 + DEFINE_EVENT(block_buffer, block_dirty_buffer, 58 + 59 + TP_PROTO(struct buffer_head *bh), 60 + 61 + TP_ARGS(bh) 62 + ); 12 63 13 64 DECLARE_EVENT_CLASS(block_rq_with_error, 14 65 ··· 257 206 258 207 /** 259 208 * block_bio_complete - completed all work on the block operation 260 - * @q: queue holding the block operation 261 209 * @bio: block operation completed 262 210 * @error: io error value 263 211 * ··· 265 215 */ 266 216 TRACE_EVENT(block_bio_complete, 267 217 268 - TP_PROTO(struct request_queue *q, struct bio *bio, int error), 218 + TP_PROTO(struct bio *bio, int error), 269 219 270 - TP_ARGS(q, bio, error), 220 + TP_ARGS(bio, error), 271 221 272 222 TP_STRUCT__entry( 273 223 __field( dev_t, dev ) ··· 278 228 ), 279 229 280 230 TP_fast_assign( 281 - __entry->dev = bio->bi_bdev->bd_dev; 231 + __entry->dev = bio->bi_bdev ? 232 + bio->bi_bdev->bd_dev : 0; 282 233 __entry->sector = bio->bi_sector; 283 234 __entry->nr_sector = bio->bi_size >> 9; 284 235 __entry->error = error; ··· 292 241 __entry->nr_sector, __entry->error) 293 242 ); 294 243 295 - DECLARE_EVENT_CLASS(block_bio, 244 + DECLARE_EVENT_CLASS(block_bio_merge, 296 245 297 - TP_PROTO(struct request_queue *q, struct bio *bio), 246 + TP_PROTO(struct request_queue *q, struct request *rq, struct bio *bio), 298 247 299 - TP_ARGS(q, bio), 248 + TP_ARGS(q, rq, bio), 300 249 301 250 TP_STRUCT__entry( 302 251 __field( dev_t, dev ) ··· 323 272 /** 324 273 * block_bio_backmerge - merging block operation to the end of an existing operation 325 274 * @q: queue holding operation 275 + * @rq: request bio is being merged into 326 276 * @bio: new block operation to merge 327 277 * 328 278 * Merging block request @bio to the end of an existing block request 329 279 * in queue @q. 330 280 */ 331 - DEFINE_EVENT(block_bio, block_bio_backmerge, 281 + DEFINE_EVENT(block_bio_merge, block_bio_backmerge, 332 282 333 - TP_PROTO(struct request_queue *q, struct bio *bio), 283 + TP_PROTO(struct request_queue *q, struct request *rq, struct bio *bio), 334 284 335 - TP_ARGS(q, bio) 285 + TP_ARGS(q, rq, bio) 336 286 ); 337 287 338 288 /** 339 289 * block_bio_frontmerge - merging block operation to the beginning of an existing operation 340 290 * @q: queue holding operation 291 + * @rq: request bio is being merged into 341 292 * @bio: new block operation to merge 342 293 * 343 294 * Merging block IO operation @bio to the beginning of an existing block 344 295 * operation in queue @q. 345 296 */ 346 - DEFINE_EVENT(block_bio, block_bio_frontmerge, 297 + DEFINE_EVENT(block_bio_merge, block_bio_frontmerge, 347 298 348 - TP_PROTO(struct request_queue *q, struct bio *bio), 299 + TP_PROTO(struct request_queue *q, struct request *rq, struct bio *bio), 349 300 350 - TP_ARGS(q, bio) 301 + TP_ARGS(q, rq, bio) 351 302 ); 352 303 353 304 /** ··· 359 306 * 360 307 * About to place the block IO operation @bio into queue @q. 361 308 */ 362 - DEFINE_EVENT(block_bio, block_bio_queue, 309 + TRACE_EVENT(block_bio_queue, 363 310 364 311 TP_PROTO(struct request_queue *q, struct bio *bio), 365 312 366 - TP_ARGS(q, bio) 313 + TP_ARGS(q, bio), 314 + 315 + TP_STRUCT__entry( 316 + __field( dev_t, dev ) 317 + __field( sector_t, sector ) 318 + __field( unsigned int, nr_sector ) 319 + __array( char, rwbs, RWBS_LEN ) 320 + __array( char, comm, TASK_COMM_LEN ) 321 + ), 322 + 323 + TP_fast_assign( 324 + __entry->dev = bio->bi_bdev->bd_dev; 325 + __entry->sector = bio->bi_sector; 326 + __entry->nr_sector = bio->bi_size >> 9; 327 + blk_fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size); 328 + memcpy(__entry->comm, current->comm, TASK_COMM_LEN); 329 + ), 330 + 331 + TP_printk("%d,%d %s %llu + %u [%s]", 332 + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs, 333 + (unsigned long long)__entry->sector, 334 + __entry->nr_sector, __entry->comm) 367 335 ); 368 336 369 337 DECLARE_EVENT_CLASS(block_get_rq,

+116

include/trace/events/writeback.h

··· 32 32 33 33 struct wb_writeback_work; 34 34 35 + TRACE_EVENT(writeback_dirty_page, 36 + 37 + TP_PROTO(struct page *page, struct address_space *mapping), 38 + 39 + TP_ARGS(page, mapping), 40 + 41 + TP_STRUCT__entry ( 42 + __array(char, name, 32) 43 + __field(unsigned long, ino) 44 + __field(pgoff_t, index) 45 + ), 46 + 47 + TP_fast_assign( 48 + strncpy(__entry->name, 49 + mapping ? dev_name(mapping->backing_dev_info->dev) : "(unknown)", 32); 50 + __entry->ino = mapping ? mapping->host->i_ino : 0; 51 + __entry->index = page->index; 52 + ), 53 + 54 + TP_printk("bdi %s: ino=%lu index=%lu", 55 + __entry->name, 56 + __entry->ino, 57 + __entry->index 58 + ) 59 + ); 60 + 61 + DECLARE_EVENT_CLASS(writeback_dirty_inode_template, 62 + 63 + TP_PROTO(struct inode *inode, int flags), 64 + 65 + TP_ARGS(inode, flags), 66 + 67 + TP_STRUCT__entry ( 68 + __array(char, name, 32) 69 + __field(unsigned long, ino) 70 + __field(unsigned long, flags) 71 + ), 72 + 73 + TP_fast_assign( 74 + struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info; 75 + 76 + /* may be called for files on pseudo FSes w/ unregistered bdi */ 77 + strncpy(__entry->name, 78 + bdi->dev ? dev_name(bdi->dev) : "(unknown)", 32); 79 + __entry->ino = inode->i_ino; 80 + __entry->flags = flags; 81 + ), 82 + 83 + TP_printk("bdi %s: ino=%lu flags=%s", 84 + __entry->name, 85 + __entry->ino, 86 + show_inode_state(__entry->flags) 87 + ) 88 + ); 89 + 90 + DEFINE_EVENT(writeback_dirty_inode_template, writeback_dirty_inode_start, 91 + 92 + TP_PROTO(struct inode *inode, int flags), 93 + 94 + TP_ARGS(inode, flags) 95 + ); 96 + 97 + DEFINE_EVENT(writeback_dirty_inode_template, writeback_dirty_inode, 98 + 99 + TP_PROTO(struct inode *inode, int flags), 100 + 101 + TP_ARGS(inode, flags) 102 + ); 103 + 104 + DECLARE_EVENT_CLASS(writeback_write_inode_template, 105 + 106 + TP_PROTO(struct inode *inode, struct writeback_control *wbc), 107 + 108 + TP_ARGS(inode, wbc), 109 + 110 + TP_STRUCT__entry ( 111 + __array(char, name, 32) 112 + __field(unsigned long, ino) 113 + __field(int, sync_mode) 114 + ), 115 + 116 + TP_fast_assign( 117 + strncpy(__entry->name, 118 + dev_name(inode->i_mapping->backing_dev_info->dev), 32); 119 + __entry->ino = inode->i_ino; 120 + __entry->sync_mode = wbc->sync_mode; 121 + ), 122 + 123 + TP_printk("bdi %s: ino=%lu sync_mode=%d", 124 + __entry->name, 125 + __entry->ino, 126 + __entry->sync_mode 127 + ) 128 + ); 129 + 130 + DEFINE_EVENT(writeback_write_inode_template, writeback_write_inode_start, 131 + 132 + TP_PROTO(struct inode *inode, struct writeback_control *wbc), 133 + 134 + TP_ARGS(inode, wbc) 135 + ); 136 + 137 + DEFINE_EVENT(writeback_write_inode_template, writeback_write_inode, 138 + 139 + TP_PROTO(struct inode *inode, struct writeback_control *wbc), 140 + 141 + TP_ARGS(inode, wbc) 142 + ); 143 + 35 144 DECLARE_EVENT_CLASS(writeback_work_class, 36 145 TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_work *work), 37 146 TP_ARGS(bdi, work), ··· 586 477 __entry->nr_to_write, 587 478 __entry->wrote 588 479 ) 480 + ); 481 + 482 + DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode_start, 483 + TP_PROTO(struct inode *inode, 484 + struct writeback_control *wbc, 485 + unsigned long nr_to_write), 486 + TP_ARGS(inode, wbc, nr_to_write) 589 487 ); 590 488 591 489 DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode,

+52 -5

kernel/sched/core.c

··· 3258 3258 EXPORT_SYMBOL(complete_all); 3259 3259 3260 3260 static inline long __sched 3261 - do_wait_for_common(struct completion *x, long timeout, int state) 3261 + do_wait_for_common(struct completion *x, 3262 + long (*action)(long), long timeout, int state) 3262 3263 { 3263 3264 if (!x->done) { 3264 3265 DECLARE_WAITQUEUE(wait, current); ··· 3272 3271 } 3273 3272 __set_current_state(state); 3274 3273 spin_unlock_irq(&x->wait.lock); 3275 - timeout = schedule_timeout(timeout); 3274 + timeout = action(timeout); 3276 3275 spin_lock_irq(&x->wait.lock); 3277 3276 } while (!x->done && timeout); 3278 3277 __remove_wait_queue(&x->wait, &wait); ··· 3283 3282 return timeout ?: 1; 3284 3283 } 3285 3284 3286 - static long __sched 3287 - wait_for_common(struct completion *x, long timeout, int state) 3285 + static inline long __sched 3286 + __wait_for_common(struct completion *x, 3287 + long (*action)(long), long timeout, int state) 3288 3288 { 3289 3289 might_sleep(); 3290 3290 3291 3291 spin_lock_irq(&x->wait.lock); 3292 - timeout = do_wait_for_common(x, timeout, state); 3292 + timeout = do_wait_for_common(x, action, timeout, state); 3293 3293 spin_unlock_irq(&x->wait.lock); 3294 3294 return timeout; 3295 + } 3296 + 3297 + static long __sched 3298 + wait_for_common(struct completion *x, long timeout, int state) 3299 + { 3300 + return __wait_for_common(x, schedule_timeout, timeout, state); 3301 + } 3302 + 3303 + static long __sched 3304 + wait_for_common_io(struct completion *x, long timeout, int state) 3305 + { 3306 + return __wait_for_common(x, io_schedule_timeout, timeout, state); 3295 3307 } 3296 3308 3297 3309 /** ··· 3341 3327 return wait_for_common(x, timeout, TASK_UNINTERRUPTIBLE); 3342 3328 } 3343 3329 EXPORT_SYMBOL(wait_for_completion_timeout); 3330 + 3331 + /** 3332 + * wait_for_completion_io: - waits for completion of a task 3333 + * @x: holds the state of this particular completion 3334 + * 3335 + * This waits to be signaled for completion of a specific task. It is NOT 3336 + * interruptible and there is no timeout. The caller is accounted as waiting 3337 + * for IO. 3338 + */ 3339 + void __sched wait_for_completion_io(struct completion *x) 3340 + { 3341 + wait_for_common_io(x, MAX_SCHEDULE_TIMEOUT, TASK_UNINTERRUPTIBLE); 3342 + } 3343 + EXPORT_SYMBOL(wait_for_completion_io); 3344 + 3345 + /** 3346 + * wait_for_completion_io_timeout: - waits for completion of a task (w/timeout) 3347 + * @x: holds the state of this particular completion 3348 + * @timeout: timeout value in jiffies 3349 + * 3350 + * This waits for either a completion of a specific task to be signaled or for a 3351 + * specified timeout to expire. The timeout is in jiffies. It is not 3352 + * interruptible. The caller is accounted as waiting for IO. 3353 + * 3354 + * The return value is 0 if timed out, and positive (at least 1, or number of 3355 + * jiffies left till timeout) if completed. 3356 + */ 3357 + unsigned long __sched 3358 + wait_for_completion_io_timeout(struct completion *x, unsigned long timeout) 3359 + { 3360 + return wait_for_common_io(x, timeout, TASK_UNINTERRUPTIBLE); 3361 + } 3362 + EXPORT_SYMBOL(wait_for_completion_io_timeout); 3344 3363 3345 3364 /** 3346 3365 * wait_for_completion_interruptible: - waits for completion of a task (w/intr)

+25 -3

kernel/trace/blktrace.c

··· 739 739 struct request_queue *q, 740 740 struct request *rq) 741 741 { 742 + struct blk_trace *bt = q->blk_trace; 743 + 744 + /* if control ever passes through here, it's a request based driver */ 745 + if (unlikely(bt && !bt->rq_based)) 746 + bt->rq_based = true; 747 + 742 748 blk_add_trace_rq(q, rq, BLK_TA_COMPLETE); 743 749 } 744 750 ··· 780 774 blk_add_trace_bio(q, bio, BLK_TA_BOUNCE, 0); 781 775 } 782 776 783 - static void blk_add_trace_bio_complete(void *ignore, 784 - struct request_queue *q, struct bio *bio, 785 - int error) 777 + static void blk_add_trace_bio_complete(void *ignore, struct bio *bio, int error) 786 778 { 779 + struct request_queue *q; 780 + struct blk_trace *bt; 781 + 782 + if (!bio->bi_bdev) 783 + return; 784 + 785 + q = bdev_get_queue(bio->bi_bdev); 786 + bt = q->blk_trace; 787 + 788 + /* 789 + * Request based drivers will generate both rq and bio completions. 790 + * Ignore bio ones. 791 + */ 792 + if (likely(!bt) || bt->rq_based) 793 + return; 794 + 787 795 blk_add_trace_bio(q, bio, BLK_TA_COMPLETE, error); 788 796 } 789 797 790 798 static void blk_add_trace_bio_backmerge(void *ignore, 791 799 struct request_queue *q, 800 + struct request *rq, 792 801 struct bio *bio) 793 802 { 794 803 blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE, 0); ··· 811 790 812 791 static void blk_add_trace_bio_frontmerge(void *ignore, 813 792 struct request_queue *q, 793 + struct request *rq, 814 794 struct bio *bio) 815 795 { 816 796 blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE, 0);

+2

mm/page-writeback.c

··· 1986 1986 */ 1987 1987 void account_page_dirtied(struct page *page, struct address_space *mapping) 1988 1988 { 1989 + trace_writeback_dirty_page(page, mapping); 1990 + 1989 1991 if (mapping_cap_account_dirty(mapping)) { 1990 1992 __inc_zone_page_state(page, NR_FILE_DIRTY); 1991 1993 __inc_zone_page_state(page, NR_DIRTIED);