Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block

+78 -5

Documentation/cgroups/blkio-controller.txt

··· 387 387 IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle 388 388 on individual groups and throughput should improve. 389 389 390 - What works 391 - ========== 392 - - Currently only sync IO queues are support. All the buffered writes are 393 - still system wide and not per group. Hence we will not see service 394 - differentiation between buffered writes between groups. 390 + Writeback 391 + ========= 392 + 393 + Page cache is dirtied through buffered writes and shared mmaps and 394 + written asynchronously to the backing filesystem by the writeback 395 + mechanism. Writeback sits between the memory and IO domains and 396 + regulates the proportion of dirty memory by balancing dirtying and 397 + write IOs. 398 + 399 + On traditional cgroup hierarchies, relationships between different 400 + controllers cannot be established making it impossible for writeback 401 + to operate accounting for cgroup resource restrictions and all 402 + writeback IOs are attributed to the root cgroup. 403 + 404 + If both the blkio and memory controllers are used on the v2 hierarchy 405 + and the filesystem supports cgroup writeback, writeback operations 406 + correctly follow the resource restrictions imposed by both memory and 407 + blkio controllers. 408 + 409 + Writeback examines both system-wide and per-cgroup dirty memory status 410 + and enforces the more restrictive of the two. Also, writeback control 411 + parameters which are absolute values - vm.dirty_bytes and 412 + vm.dirty_background_bytes - are distributed across cgroups according 413 + to their current writeback bandwidth. 414 + 415 + There's a peculiarity stemming from the discrepancy in ownership 416 + granularity between memory controller and writeback. While memory 417 + controller tracks ownership per page, writeback operates on inode 418 + basis. cgroup writeback bridges the gap by tracking ownership by 419 + inode but migrating ownership if too many foreign pages, pages which 420 + don't match the current inode ownership, have been encountered while 421 + writing back the inode. 422 + 423 + This is a conscious design choice as writeback operations are 424 + inherently tied to inodes making strictly following page ownership 425 + complicated and inefficient. The only use case which suffers from 426 + this compromise is multiple cgroups concurrently dirtying disjoint 427 + regions of the same inode, which is an unlikely use case and decided 428 + to be unsupported. Note that as memory controller assigns page 429 + ownership on the first use and doesn't update it until the page is 430 + released, even if cgroup writeback strictly follows page ownership, 431 + multiple cgroups dirtying overlapping areas wouldn't work as expected. 432 + In general, write-sharing an inode across multiple cgroups is not well 433 + supported. 434 + 435 + Filesystem support for cgroup writeback 436 + --------------------------------------- 437 + 438 + A filesystem can make writeback IOs cgroup-aware by updating 439 + address_space_operations->writepage[s]() to annotate bio's using the 440 + following two functions. 441 + 442 + * wbc_init_bio(@wbc, @bio) 443 + 444 + Should be called for each bio carrying writeback data and associates 445 + the bio with the inode's owner cgroup. Can be called anytime 446 + between bio allocation and submission. 447 + 448 + * wbc_account_io(@wbc, @page, @bytes) 449 + 450 + Should be called for each data segment being written out. While 451 + this function doesn't care exactly when it's called during the 452 + writeback session, it's the easiest and most natural to call it as 453 + data segments are added to a bio. 454 + 455 + With writeback bio's annotated, cgroup support can be enabled per 456 + super_block by setting MS_CGROUPWB in ->s_flags. This allows for 457 + selective disabling of cgroup writeback support which is helpful when 458 + certain filesystem features, e.g. journaled data mode, are 459 + incompatible. 460 + 461 + wbc_init_bio() binds the specified bio to its cgroup. Depending on 462 + the configuration, the bio may be executed at a lower priority and if 463 + the writeback session is holding shared resources, e.g. a journal 464 + entry, may lead to priority inversion. There is no one easy solution 465 + for the problem. Filesystems can try to work around specific problem 466 + cases by skipping wbc_init_bio() or using bio_associate_blkcg() 467 + directly.

+1

Documentation/cgroups/memory.txt

··· 493 493 pgpgout - # of uncharging events to the memory cgroup. The uncharging 494 494 event happens each time a page is unaccounted from the cgroup. 495 495 swap - # of bytes of swap usage 496 + dirty - # of bytes that are waiting to get written back to the disk. 496 497 writeback - # of bytes of file/anon cache that are queued for syncing to 497 498 disk. 498 499 inactive_anon - # of bytes of anonymous and swap cache memory on inactive

+24 -11

block/bio.c

··· 1988 1988 EXPORT_SYMBOL(bioset_create_nobvec); 1989 1989 1990 1990 #ifdef CONFIG_BLK_CGROUP 1991 + 1992 + /** 1993 + * bio_associate_blkcg - associate a bio with the specified blkcg 1994 + * @bio: target bio 1995 + * @blkcg_css: css of the blkcg to associate 1996 + * 1997 + * Associate @bio with the blkcg specified by @blkcg_css. Block layer will 1998 + * treat @bio as if it were issued by a task which belongs to the blkcg. 1999 + * 2000 + * This function takes an extra reference of @blkcg_css which will be put 2001 + * when @bio is released. The caller must own @bio and is responsible for 2002 + * synchronizing calls to this function. 2003 + */ 2004 + int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css) 2005 + { 2006 + if (unlikely(bio->bi_css)) 2007 + return -EBUSY; 2008 + css_get(blkcg_css); 2009 + bio->bi_css = blkcg_css; 2010 + return 0; 2011 + } 2012 + 1991 2013 /** 1992 2014 * bio_associate_current - associate a bio with %current 1993 2015 * @bio: target bio ··· 2026 2004 int bio_associate_current(struct bio *bio) 2027 2005 { 2028 2006 struct io_context *ioc; 2029 - struct cgroup_subsys_state *css; 2030 2007 2031 - if (bio->bi_ioc) 2008 + if (bio->bi_css) 2032 2009 return -EBUSY; 2033 2010 2034 2011 ioc = current->io_context; 2035 2012 if (!ioc) 2036 2013 return -ENOENT; 2037 2014 2038 - /* acquire active ref on @ioc and associate */ 2039 2015 get_io_context_active(ioc); 2040 2016 bio->bi_ioc = ioc; 2041 - 2042 - /* associate blkcg if exists */ 2043 - rcu_read_lock(); 2044 - css = task_css(current, blkio_cgrp_id); 2045 - if (css && css_tryget_online(css)) 2046 - bio->bi_css = css; 2047 - rcu_read_unlock(); 2048 - 2017 + bio->bi_css = task_get_css(current, blkio_cgrp_id); 2049 2018 return 0; 2050 2019 } 2051 2020

+65 -58

block/blk-cgroup.c

··· 19 19 #include <linux/module.h> 20 20 #include <linux/err.h> 21 21 #include <linux/blkdev.h> 22 + #include <linux/backing-dev.h> 22 23 #include <linux/slab.h> 23 24 #include <linux/genhd.h> 24 25 #include <linux/delay.h> 25 26 #include <linux/atomic.h> 26 - #include "blk-cgroup.h" 27 + #include <linux/blk-cgroup.h> 27 28 #include "blk.h" 28 29 29 30 #define MAX_KEY_LEN 100 ··· 33 32 34 33 struct blkcg blkcg_root; 35 34 EXPORT_SYMBOL_GPL(blkcg_root); 35 + 36 + struct cgroup_subsys_state * const blkcg_root_css = &blkcg_root.css; 36 37 37 38 static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS]; 38 39 ··· 185 182 struct blkcg_gq *new_blkg) 186 183 { 187 184 struct blkcg_gq *blkg; 185 + struct bdi_writeback_congested *wb_congested; 188 186 int i, ret; 189 187 190 188 WARN_ON_ONCE(!rcu_read_lock_held()); ··· 197 193 goto err_free_blkg; 198 194 } 199 195 196 + wb_congested = wb_congested_get_create(&q->backing_dev_info, 197 + blkcg->css.id, GFP_ATOMIC); 198 + if (!wb_congested) { 199 + ret = -ENOMEM; 200 + goto err_put_css; 201 + } 202 + 200 203 /* allocate */ 201 204 if (!new_blkg) { 202 205 new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC); 203 206 if (unlikely(!new_blkg)) { 204 207 ret = -ENOMEM; 205 - goto err_put_css; 208 + goto err_put_congested; 206 209 } 207 210 } 208 211 blkg = new_blkg; 212 + blkg->wb_congested = wb_congested; 209 213 210 214 /* link parent */ 211 215 if (blkcg_parent(blkcg)) { 212 216 blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false); 213 217 if (WARN_ON_ONCE(!blkg->parent)) { 214 218 ret = -EINVAL; 215 - goto err_put_css; 219 + goto err_put_congested; 216 220 } 217 221 blkg_get(blkg->parent); 218 222 } ··· 250 238 blkg->online = true; 251 239 spin_unlock(&blkcg->lock); 252 240 253 - if (!ret) { 254 - if (blkcg == &blkcg_root) { 255 - q->root_blkg = blkg; 256 - q->root_rl.blkg = blkg; 257 - } 241 + if (!ret) 258 242 return blkg; 259 - } 260 243 261 244 /* @blkg failed fully initialized, use the usual release path */ 262 245 blkg_put(blkg); 263 246 return ERR_PTR(ret); 264 247 248 + err_put_congested: 249 + wb_congested_put(wb_congested); 265 250 err_put_css: 266 251 css_put(&blkcg->css); 267 252 err_free_blkg: ··· 352 343 rcu_assign_pointer(blkcg->blkg_hint, NULL); 353 344 354 345 /* 355 - * If root blkg is destroyed. Just clear the pointer since root_rl 356 - * does not take reference on root blkg. 357 - */ 358 - if (blkcg == &blkcg_root) { 359 - blkg->q->root_blkg = NULL; 360 - blkg->q->root_rl.blkg = NULL; 361 - } 362 - 363 - /* 364 346 * Put the reference taken at the time of creation so that when all 365 347 * queues are gone, group can be destroyed. 366 348 */ ··· 404 404 css_put(&blkg->blkcg->css); 405 405 if (blkg->parent) 406 406 blkg_put(blkg->parent); 407 + 408 + wb_congested_put(blkg->wb_congested); 407 409 408 410 blkg_free(blkg); 409 411 } ··· 814 812 } 815 813 816 814 spin_unlock_irq(&blkcg->lock); 815 + 816 + wb_blkcg_offline(blkcg); 817 817 } 818 818 819 819 static void blkcg_css_free(struct cgroup_subsys_state *css) ··· 872 868 spin_lock_init(&blkcg->lock); 873 869 INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_ATOMIC); 874 870 INIT_HLIST_HEAD(&blkcg->blkg_list); 875 - 871 + #ifdef CONFIG_CGROUP_WRITEBACK 872 + INIT_LIST_HEAD(&blkcg->cgwb_list); 873 + #endif 876 874 return &blkcg->css; 877 875 878 876 free_pd_blkcg: ··· 898 892 */ 899 893 int blkcg_init_queue(struct request_queue *q) 900 894 { 901 - might_sleep(); 895 + struct blkcg_gq *new_blkg, *blkg; 896 + bool preloaded; 897 + int ret; 902 898 903 - return blk_throtl_init(q); 899 + new_blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL); 900 + if (!new_blkg) 901 + return -ENOMEM; 902 + 903 + preloaded = !radix_tree_preload(GFP_KERNEL); 904 + 905 + /* 906 + * Make sure the root blkg exists and count the existing blkgs. As 907 + * @q is bypassing at this point, blkg_lookup_create() can't be 908 + * used. Open code insertion. 909 + */ 910 + rcu_read_lock(); 911 + spin_lock_irq(q->queue_lock); 912 + blkg = blkg_create(&blkcg_root, q, new_blkg); 913 + spin_unlock_irq(q->queue_lock); 914 + rcu_read_unlock(); 915 + 916 + if (preloaded) 917 + radix_tree_preload_end(); 918 + 919 + if (IS_ERR(blkg)) { 920 + kfree(new_blkg); 921 + return PTR_ERR(blkg); 922 + } 923 + 924 + q->root_blkg = blkg; 925 + q->root_rl.blkg = blkg; 926 + 927 + ret = blk_throtl_init(q); 928 + if (ret) { 929 + spin_lock_irq(q->queue_lock); 930 + blkg_destroy_all(q); 931 + spin_unlock_irq(q->queue_lock); 932 + } 933 + return ret; 904 934 } 905 935 906 936 /** ··· 1038 996 { 1039 997 LIST_HEAD(pds); 1040 998 LIST_HEAD(cpds); 1041 - struct blkcg_gq *blkg, *new_blkg; 999 + struct blkcg_gq *blkg; 1042 1000 struct blkg_policy_data *pd, *nd; 1043 1001 struct blkcg_policy_data *cpd, *cnd; 1044 1002 int cnt = 0, ret; 1045 - bool preloaded; 1046 1003 1047 1004 if (blkcg_policy_enabled(q, pol)) 1048 1005 return 0; 1049 1006 1050 - /* preallocations for root blkg */ 1051 - new_blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL); 1052 - if (!new_blkg) 1053 - return -ENOMEM; 1054 - 1007 + /* count and allocate policy_data for all existing blkgs */ 1055 1008 blk_queue_bypass_start(q); 1056 - 1057 - preloaded = !radix_tree_preload(GFP_KERNEL); 1058 - 1059 - /* 1060 - * Make sure the root blkg exists and count the existing blkgs. As 1061 - * @q is bypassing at this point, blkg_lookup_create() can't be 1062 - * used. Open code it. 1063 - */ 1064 1009 spin_lock_irq(q->queue_lock); 1065 - 1066 - rcu_read_lock(); 1067 - blkg = __blkg_lookup(&blkcg_root, q, false); 1068 - if (blkg) 1069 - blkg_free(new_blkg); 1070 - else 1071 - blkg = blkg_create(&blkcg_root, q, new_blkg); 1072 - rcu_read_unlock(); 1073 - 1074 - if (preloaded) 1075 - radix_tree_preload_end(); 1076 - 1077 - if (IS_ERR(blkg)) { 1078 - ret = PTR_ERR(blkg); 1079 - goto out_unlock; 1080 - } 1081 - 1082 1010 list_for_each_entry(blkg, &q->blkg_list, q_node) 1083 1011 cnt++; 1084 - 1085 1012 spin_unlock_irq(q->queue_lock); 1086 1013 1087 1014 /* ··· 1150 1139 spin_lock_irq(q->queue_lock); 1151 1140 1152 1141 __clear_bit(pol->plid, q->blkcg_pols); 1153 - 1154 - /* if no policy is left, no need for blkgs - shoot them down */ 1155 - if (bitmap_empty(q->blkcg_pols, BLKCG_MAX_POLS)) 1156 - blkg_destroy_all(q); 1157 1142 1158 1143 list_for_each_entry(blkg, &q->blkg_list, q_node) { 1159 1144 /* grab blkcg lock too while removing @pd from @blkg */

+30 -2

block/blk-cgroup.h include/linux/blk-cgroup.h

··· 46 46 struct hlist_head blkg_list; 47 47 48 48 struct blkcg_policy_data *pd[BLKCG_MAX_POLS]; 49 + 50 + #ifdef CONFIG_CGROUP_WRITEBACK 51 + struct list_head cgwb_list; 52 + #endif 49 53 }; 50 54 51 55 struct blkg_stat { ··· 110 106 struct hlist_node blkcg_node; 111 107 struct blkcg *blkcg; 112 108 109 + /* 110 + * Each blkg gets congested separately and the congestion state is 111 + * propagated to the matching bdi_writeback_congested. 112 + */ 113 + struct bdi_writeback_congested *wb_congested; 114 + 113 115 /* all non-root blkcg_gq's are guaranteed to have access to parent */ 114 116 struct blkcg_gq *parent; 115 117 ··· 159 149 }; 160 150 161 151 extern struct blkcg blkcg_root; 152 + extern struct cgroup_subsys_state * const blkcg_root_css; 162 153 163 154 struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q); 164 155 struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg, ··· 218 207 if (bio && bio->bi_css) 219 208 return css_to_blkcg(bio->bi_css); 220 209 return task_blkcg(current); 210 + } 211 + 212 + static inline struct cgroup_subsys_state * 213 + task_get_blkcg_css(struct task_struct *task) 214 + { 215 + return task_get_css(task, blkio_cgrp_id); 221 216 } 222 217 223 218 /** ··· 596 579 597 580 #else /* CONFIG_BLK_CGROUP */ 598 581 599 - struct cgroup; 600 - struct blkcg; 582 + struct blkcg { 583 + }; 601 584 602 585 struct blkg_policy_data { 603 586 }; ··· 610 593 611 594 struct blkcg_policy { 612 595 }; 596 + 597 + #define blkcg_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL)) 598 + 599 + static inline struct cgroup_subsys_state * 600 + task_get_blkcg_css(struct task_struct *task) 601 + { 602 + return NULL; 603 + } 604 + 605 + #ifdef CONFIG_BLOCK 613 606 614 607 static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; } 615 608 static inline int blkcg_init_queue(struct request_queue *q) { return 0; } ··· 650 623 #define blk_queue_for_each_rl(rl, q) \ 651 624 for ((rl) = &(q)->root_rl; (rl); (rl) = NULL) 652 625 626 + #endif /* CONFIG_BLOCK */ 653 627 #endif /* CONFIG_BLK_CGROUP */ 654 628 #endif /* _BLK_CGROUP_H */

+43 -29

block/blk-core.c

··· 32 32 #include <linux/delay.h> 33 33 #include <linux/ratelimit.h> 34 34 #include <linux/pm_runtime.h> 35 + #include <linux/blk-cgroup.h> 35 36 36 37 #define CREATE_TRACE_POINTS 37 38 #include <trace/events/block.h> 38 39 39 40 #include "blk.h" 40 - #include "blk-cgroup.h" 41 41 #include "blk-mq.h" 42 42 43 43 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap); ··· 62 62 * Controlling structure to kblockd 63 63 */ 64 64 static struct workqueue_struct *kblockd_workqueue; 65 + 66 + static void blk_clear_congested(struct request_list *rl, int sync) 67 + { 68 + #ifdef CONFIG_CGROUP_WRITEBACK 69 + clear_wb_congested(rl->blkg->wb_congested, sync); 70 + #else 71 + /* 72 + * If !CGROUP_WRITEBACK, all blkg's map to bdi->wb and we shouldn't 73 + * flip its congestion state for events on other blkcgs. 74 + */ 75 + if (rl == &rl->q->root_rl) 76 + clear_wb_congested(rl->q->backing_dev_info.wb.congested, sync); 77 + #endif 78 + } 79 + 80 + static void blk_set_congested(struct request_list *rl, int sync) 81 + { 82 + #ifdef CONFIG_CGROUP_WRITEBACK 83 + set_wb_congested(rl->blkg->wb_congested, sync); 84 + #else 85 + /* see blk_clear_congested() */ 86 + if (rl == &rl->q->root_rl) 87 + set_wb_congested(rl->q->backing_dev_info.wb.congested, sync); 88 + #endif 89 + } 65 90 66 91 void blk_queue_congestion_threshold(struct request_queue *q) 67 92 { ··· 648 623 649 624 q->backing_dev_info.ra_pages = 650 625 (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE; 651 - q->backing_dev_info.state = 0; 652 - q->backing_dev_info.capabilities = 0; 626 + q->backing_dev_info.capabilities = BDI_CAP_CGROUP_WRITEBACK; 653 627 q->backing_dev_info.name = "block"; 654 628 q->node = node_id; 655 629 ··· 871 847 { 872 848 struct request_queue *q = rl->q; 873 849 874 - /* 875 - * bdi isn't aware of blkcg yet. As all async IOs end up root 876 - * blkcg anyway, just use root blkcg state. 877 - */ 878 - if (rl == &q->root_rl && 879 - rl->count[sync] < queue_congestion_off_threshold(q)) 880 - blk_clear_queue_congested(q, sync); 850 + if (rl->count[sync] < queue_congestion_off_threshold(q)) 851 + blk_clear_congested(rl, sync); 881 852 882 853 if (rl->count[sync] + 1 <= q->nr_requests) { 883 854 if (waitqueue_active(&rl->wait[sync])) ··· 905 886 int blk_update_nr_requests(struct request_queue *q, unsigned int nr) 906 887 { 907 888 struct request_list *rl; 889 + int on_thresh, off_thresh; 908 890 909 891 spin_lock_irq(q->queue_lock); 910 892 q->nr_requests = nr; 911 893 blk_queue_congestion_threshold(q); 912 - 913 - /* congestion isn't cgroup aware and follows root blkcg for now */ 914 - rl = &q->root_rl; 915 - 916 - if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q)) 917 - blk_set_queue_congested(q, BLK_RW_SYNC); 918 - else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q)) 919 - blk_clear_queue_congested(q, BLK_RW_SYNC); 920 - 921 - if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q)) 922 - blk_set_queue_congested(q, BLK_RW_ASYNC); 923 - else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q)) 924 - blk_clear_queue_congested(q, BLK_RW_ASYNC); 894 + on_thresh = queue_congestion_on_threshold(q); 895 + off_thresh = queue_congestion_off_threshold(q); 925 896 926 897 blk_queue_for_each_rl(rl, q) { 898 + if (rl->count[BLK_RW_SYNC] >= on_thresh) 899 + blk_set_congested(rl, BLK_RW_SYNC); 900 + else if (rl->count[BLK_RW_SYNC] < off_thresh) 901 + blk_clear_congested(rl, BLK_RW_SYNC); 902 + 903 + if (rl->count[BLK_RW_ASYNC] >= on_thresh) 904 + blk_set_congested(rl, BLK_RW_ASYNC); 905 + else if (rl->count[BLK_RW_ASYNC] < off_thresh) 906 + blk_clear_congested(rl, BLK_RW_ASYNC); 907 + 927 908 if (rl->count[BLK_RW_SYNC] >= q->nr_requests) { 928 909 blk_set_rl_full(rl, BLK_RW_SYNC); 929 910 } else { ··· 1033 1014 } 1034 1015 } 1035 1016 } 1036 - /* 1037 - * bdi isn't aware of blkcg yet. As all async IOs end up 1038 - * root blkcg anyway, just use root blkcg state. 1039 - */ 1040 - if (rl == &q->root_rl) 1041 - blk_set_queue_congested(q, is_sync); 1017 + blk_set_congested(rl, is_sync); 1042 1018 } 1043 1019 1044 1020 /*

+1

block/blk-integrity.c

··· 21 21 */ 22 22 23 23 #include <linux/blkdev.h> 24 + #include <linux/backing-dev.h> 24 25 #include <linux/mempool.h> 25 26 #include <linux/bio.h> 26 27 #include <linux/scatterlist.h>

+2 -1

block/blk-sysfs.c

··· 6 6 #include <linux/module.h> 7 7 #include <linux/bio.h> 8 8 #include <linux/blkdev.h> 9 + #include <linux/backing-dev.h> 9 10 #include <linux/blktrace_api.h> 10 11 #include <linux/blk-mq.h> 12 + #include <linux/blk-cgroup.h> 11 13 12 14 #include "blk.h" 13 - #include "blk-cgroup.h" 14 15 #include "blk-mq.h" 15 16 16 17 struct queue_sysfs_entry {

+1 -1

block/blk-throttle.c

··· 9 9 #include <linux/blkdev.h> 10 10 #include <linux/bio.h> 11 11 #include <linux/blktrace_api.h> 12 - #include "blk-cgroup.h" 12 + #include <linux/blk-cgroup.h> 13 13 #include "blk.h" 14 14 15 15 /* Max dispatch from a group in 1 round */

+1

block/bounce.c

··· 13 13 #include <linux/pagemap.h> 14 14 #include <linux/mempool.h> 15 15 #include <linux/blkdev.h> 16 + #include <linux/backing-dev.h> 16 17 #include <linux/init.h> 17 18 #include <linux/hash.h> 18 19 #include <linux/highmem.h>

+1 -1

block/cfq-iosched.c

··· 14 14 #include <linux/rbtree.h> 15 15 #include <linux/ioprio.h> 16 16 #include <linux/blktrace_api.h> 17 + #include <linux/blk-cgroup.h> 17 18 #include "blk.h" 18 - #include "blk-cgroup.h" 19 19 20 20 /* 21 21 * tunables

+1 -1

block/elevator.c

··· 35 35 #include <linux/hash.h> 36 36 #include <linux/uaccess.h> 37 37 #include <linux/pm_runtime.h> 38 + #include <linux/blk-cgroup.h> 38 39 39 40 #include <trace/events/block.h> 40 41 41 42 #include "blk.h" 42 - #include "blk-cgroup.h" 43 43 44 44 static DEFINE_SPINLOCK(elv_list_lock); 45 45 static LIST_HEAD(elv_list);

+1

block/genhd.c

··· 8 8 #include <linux/kdev_t.h> 9 9 #include <linux/kernel.h> 10 10 #include <linux/blkdev.h> 11 + #include <linux/backing-dev.h> 11 12 #include <linux/init.h> 12 13 #include <linux/spinlock.h> 13 14 #include <linux/proc_fs.h>

+1

drivers/block/drbd/drbd_int.h

··· 38 38 #include <linux/mutex.h> 39 39 #include <linux/major.h> 40 40 #include <linux/blkdev.h> 41 + #include <linux/backing-dev.h> 41 42 #include <linux/genhd.h> 42 43 #include <linux/idr.h> 43 44 #include <net/tcp.h>

+5 -5

drivers/block/drbd/drbd_main.c

··· 2359 2359 * @congested_data: User data 2360 2360 * @bdi_bits: Bits the BDI flusher thread is currently interested in 2361 2361 * 2362 - * Returns 1<<BDI_async_congested and/or 1<<BDI_sync_congested if we are congested. 2362 + * Returns 1<<WB_async_congested and/or 1<<WB_sync_congested if we are congested. 2363 2363 */ 2364 2364 static int drbd_congested(void *congested_data, int bdi_bits) 2365 2365 { ··· 2376 2376 } 2377 2377 2378 2378 if (test_bit(CALLBACK_PENDING, &first_peer_device(device)->connection->flags)) { 2379 - r |= (1 << BDI_async_congested); 2379 + r |= (1 << WB_async_congested); 2380 2380 /* Without good local data, we would need to read from remote, 2381 2381 * and that would need the worker thread as well, which is 2382 2382 * currently blocked waiting for that usermode helper to 2383 2383 * finish. 2384 2384 */ 2385 2385 if (!get_ldev_if_state(device, D_UP_TO_DATE)) 2386 - r |= (1 << BDI_sync_congested); 2386 + r |= (1 << WB_sync_congested); 2387 2387 else 2388 2388 put_ldev(device); 2389 2389 r &= bdi_bits; ··· 2399 2399 reason = 'b'; 2400 2400 } 2401 2401 2402 - if (bdi_bits & (1 << BDI_async_congested) && 2402 + if (bdi_bits & (1 << WB_async_congested) && 2403 2403 test_bit(NET_CONGESTED, &first_peer_device(device)->connection->flags)) { 2404 - r |= (1 << BDI_async_congested); 2404 + r |= (1 << WB_async_congested); 2405 2405 reason = reason == 'b' ? 'a' : 'n'; 2406 2406 } 2407 2407

+1

drivers/block/pktcdvd.c

··· 61 61 #include <linux/freezer.h> 62 62 #include <linux/mutex.h> 63 63 #include <linux/slab.h> 64 + #include <linux/backing-dev.h> 64 65 #include <scsi/scsi_cmnd.h> 65 66 #include <scsi/scsi_ioctl.h> 66 67 #include <scsi/scsi.h>

+1

drivers/char/raw.c

··· 12 12 #include <linux/fs.h> 13 13 #include <linux/major.h> 14 14 #include <linux/blkdev.h> 15 + #include <linux/backing-dev.h> 15 16 #include <linux/module.h> 16 17 #include <linux/raw.h> 17 18 #include <linux/capability.h>

+1

drivers/md/bcache/request.c

··· 15 15 #include <linux/module.h> 16 16 #include <linux/hash.h> 17 17 #include <linux/random.h> 18 + #include <linux/backing-dev.h> 18 19 19 20 #include <trace/events/bcache.h> 20 21

+1 -1

drivers/md/dm.c

··· 2080 2080 * the query about congestion status of request_queue 2081 2081 */ 2082 2082 if (dm_request_based(md)) 2083 - r = md->queue->backing_dev_info.state & 2083 + r = md->queue->backing_dev_info.wb.state & 2084 2084 bdi_bits; 2085 2085 else 2086 2086 r = dm_table_any_congested(map, bdi_bits);

+1

drivers/md/dm.h

··· 14 14 #include <linux/device-mapper.h> 15 15 #include <linux/list.h> 16 16 #include <linux/blkdev.h> 17 + #include <linux/backing-dev.h> 17 18 #include <linux/hdreg.h> 18 19 #include <linux/completion.h> 19 20 #include <linux/kobject.h>

+1

drivers/md/md.h

··· 16 16 #define _MD_MD_H 17 17 18 18 #include <linux/blkdev.h> 19 + #include <linux/backing-dev.h> 19 20 #include <linux/kobject.h> 20 21 #include <linux/list.h> 21 22 #include <linux/mm.h>

+2 -2

drivers/md/raid1.c

··· 745 745 struct r1conf *conf = mddev->private; 746 746 int i, ret = 0; 747 747 748 - if ((bits & (1 << BDI_async_congested)) && 748 + if ((bits & (1 << WB_async_congested)) && 749 749 conf->pending_count >= max_queued_requests) 750 750 return 1; 751 751 ··· 760 760 /* Note the '|| 1' - when read_balance prefers 761 761 * non-congested targets, it can be removed 762 762 */ 763 - if ((bits & (1<<BDI_async_congested)) || 1) 763 + if ((bits & (1 << WB_async_congested)) || 1) 764 764 ret |= bdi_congested(&q->backing_dev_info, bits); 765 765 else 766 766 ret &= bdi_congested(&q->backing_dev_info, bits);

+1 -1

drivers/md/raid10.c

··· 914 914 struct r10conf *conf = mddev->private; 915 915 int i, ret = 0; 916 916 917 - if ((bits & (1 << BDI_async_congested)) && 917 + if ((bits & (1 << WB_async_congested)) && 918 918 conf->pending_count >= max_queued_requests) 919 919 return 1; 920 920

+1

drivers/mtd/devices/block2mtd.c

··· 20 20 #include <linux/delay.h> 21 21 #include <linux/fs.h> 22 22 #include <linux/blkdev.h> 23 + #include <linux/backing-dev.h> 23 24 #include <linux/bio.h> 24 25 #include <linux/pagemap.h> 25 26 #include <linux/list.h>

+1 -3

drivers/staging/lustre/lustre/include/linux/lustre_patchless_compat.h

··· 55 55 if (PagePrivate(page)) 56 56 page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE); 57 57 58 - if (TestClearPageDirty(page)) 59 - account_page_cleaned(page, mapping); 60 - 58 + cancel_dirty_page(page); 61 59 ClearPageMappedToDisk(page); 62 60 ll_delete_from_page_cache(page); 63 61 }

+22 -28

fs/9p/v9fs.c

··· 320 320 struct p9_fid *v9fs_session_init(struct v9fs_session_info *v9ses, 321 321 const char *dev_name, char *data) 322 322 { 323 - int retval = -EINVAL; 324 323 struct p9_fid *fid; 325 - int rc; 324 + int rc = -ENOMEM; 326 325 327 326 v9ses->uname = kstrdup(V9FS_DEFUSER, GFP_KERNEL); 328 327 if (!v9ses->uname) 329 - return ERR_PTR(-ENOMEM); 328 + goto err_names; 330 329 331 330 v9ses->aname = kstrdup(V9FS_DEFANAME, GFP_KERNEL); 332 - if (!v9ses->aname) { 333 - kfree(v9ses->uname); 334 - return ERR_PTR(-ENOMEM); 335 - } 331 + if (!v9ses->aname) 332 + goto err_names; 336 333 init_rwsem(&v9ses->rename_sem); 337 334 338 335 rc = bdi_setup_and_register(&v9ses->bdi, "9p"); 339 - if (rc) { 340 - kfree(v9ses->aname); 341 - kfree(v9ses->uname); 342 - return ERR_PTR(rc); 343 - } 344 - 345 - spin_lock(&v9fs_sessionlist_lock); 346 - list_add(&v9ses->slist, &v9fs_sessionlist); 347 - spin_unlock(&v9fs_sessionlist_lock); 336 + if (rc) 337 + goto err_names; 348 338 349 339 v9ses->uid = INVALID_UID; 350 340 v9ses->dfltuid = V9FS_DEFUID; ··· 342 352 343 353 v9ses->clnt = p9_client_create(dev_name, data); 344 354 if (IS_ERR(v9ses->clnt)) { 345 - retval = PTR_ERR(v9ses->clnt); 346 - v9ses->clnt = NULL; 355 + rc = PTR_ERR(v9ses->clnt); 347 356 p9_debug(P9_DEBUG_ERROR, "problem initializing 9p client\n"); 348 - goto error; 357 + goto err_bdi; 349 358 } 350 359 351 360 v9ses->flags = V9FS_ACCESS_USER; ··· 357 368 } 358 369 359 370 rc = v9fs_parse_options(v9ses, data); 360 - if (rc < 0) { 361 - retval = rc; 362 - goto error; 363 - } 371 + if (rc < 0) 372 + goto err_clnt; 364 373 365 374 v9ses->maxdata = v9ses->clnt->msize - P9_IOHDRSZ; 366 375 ··· 392 405 fid = p9_client_attach(v9ses->clnt, NULL, v9ses->uname, INVALID_UID, 393 406 v9ses->aname); 394 407 if (IS_ERR(fid)) { 395 - retval = PTR_ERR(fid); 396 - fid = NULL; 408 + rc = PTR_ERR(fid); 397 409 p9_debug(P9_DEBUG_ERROR, "cannot attach\n"); 398 - goto error; 410 + goto err_clnt; 399 411 } 400 412 401 413 if ((v9ses->flags & V9FS_ACCESS_MASK) == V9FS_ACCESS_SINGLE) ··· 406 420 /* register the session for caching */ 407 421 v9fs_cache_session_get_cookie(v9ses); 408 422 #endif 423 + spin_lock(&v9fs_sessionlist_lock); 424 + list_add(&v9ses->slist, &v9fs_sessionlist); 425 + spin_unlock(&v9fs_sessionlist_lock); 409 426 410 427 return fid; 411 428 412 - error: 429 + err_clnt: 430 + p9_client_destroy(v9ses->clnt); 431 + err_bdi: 413 432 bdi_destroy(&v9ses->bdi); 414 - return ERR_PTR(retval); 433 + err_names: 434 + kfree(v9ses->uname); 435 + kfree(v9ses->aname); 436 + return ERR_PTR(rc); 415 437 } 416 438 417 439 /**

+2 -6

fs/9p/vfs_super.c

··· 130 130 fid = v9fs_session_init(v9ses, dev_name, data); 131 131 if (IS_ERR(fid)) { 132 132 retval = PTR_ERR(fid); 133 - /* 134 - * we need to call session_close to tear down some 135 - * of the data structure setup by session_init 136 - */ 137 - goto close_session; 133 + goto free_session; 138 134 } 139 135 140 136 sb = sget(fs_type, NULL, v9fs_set_super, flags, v9ses); ··· 191 195 192 196 clunk_fid: 193 197 p9_client_clunk(fid); 194 - close_session: 195 198 v9fs_session_close(v9ses); 199 + free_session: 196 200 kfree(v9ses); 197 201 return ERR_PTR(retval); 198 202

+3 -6

fs/block_dev.c

··· 14 14 #include <linux/device_cgroup.h> 15 15 #include <linux/highmem.h> 16 16 #include <linux/blkdev.h> 17 + #include <linux/backing-dev.h> 17 18 #include <linux/module.h> 18 19 #include <linux/blkpg.h> 19 20 #include <linux/magic.h> ··· 547 546 .kill_sb = kill_anon_super, 548 547 }; 549 548 550 - static struct super_block *blockdev_superblock __read_mostly; 549 + struct super_block *blockdev_superblock __read_mostly; 550 + EXPORT_SYMBOL_GPL(blockdev_superblock); 551 551 552 552 void __init bdev_cache_init(void) 553 553 { ··· 687 685 spin_unlock(&bdev_lock); 688 686 } 689 687 return bdev; 690 - } 691 - 692 - int sb_is_blkdev_sb(struct super_block *sb) 693 - { 694 - return sb == blockdev_superblock; 695 688 } 696 689 697 690 /* Call when you free inode */

+49 -15

fs/buffer.c

··· 30 30 #include <linux/quotaops.h> 31 31 #include <linux/highmem.h> 32 32 #include <linux/export.h> 33 + #include <linux/backing-dev.h> 33 34 #include <linux/writeback.h> 34 35 #include <linux/hash.h> 35 36 #include <linux/suspend.h> ··· 45 44 #include <trace/events/block.h> 46 45 47 46 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); 47 + static int submit_bh_wbc(int rw, struct buffer_head *bh, 48 + unsigned long bio_flags, 49 + struct writeback_control *wbc); 48 50 49 51 #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers) 50 52 ··· 627 623 * 628 624 * If warn is true, then emit a warning if the page is not uptodate and has 629 625 * not been truncated. 626 + * 627 + * The caller must hold mem_cgroup_begin_page_stat() lock. 630 628 */ 631 - static void __set_page_dirty(struct page *page, 632 - struct address_space *mapping, int warn) 629 + static void __set_page_dirty(struct page *page, struct address_space *mapping, 630 + struct mem_cgroup *memcg, int warn) 633 631 { 634 632 unsigned long flags; 635 633 636 634 spin_lock_irqsave(&mapping->tree_lock, flags); 637 635 if (page->mapping) { /* Race with truncate? */ 638 636 WARN_ON_ONCE(warn && !PageUptodate(page)); 639 - account_page_dirtied(page, mapping); 637 + account_page_dirtied(page, mapping, memcg); 640 638 radix_tree_tag_set(&mapping->page_tree, 641 639 page_index(page), PAGECACHE_TAG_DIRTY); 642 640 } 643 641 spin_unlock_irqrestore(&mapping->tree_lock, flags); 644 - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); 645 642 } 646 643 647 644 /* ··· 673 668 int __set_page_dirty_buffers(struct page *page) 674 669 { 675 670 int newly_dirty; 671 + struct mem_cgroup *memcg; 676 672 struct address_space *mapping = page_mapping(page); 677 673 678 674 if (unlikely(!mapping)) ··· 689 683 bh = bh->b_this_page; 690 684 } while (bh != head); 691 685 } 686 + /* 687 + * Use mem_group_begin_page_stat() to keep PageDirty synchronized with 688 + * per-memcg dirty page counters. 689 + */ 690 + memcg = mem_cgroup_begin_page_stat(page); 692 691 newly_dirty = !TestSetPageDirty(page); 693 692 spin_unlock(&mapping->private_lock); 694 693 695 694 if (newly_dirty) 696 - __set_page_dirty(page, mapping, 1); 695 + __set_page_dirty(page, mapping, memcg, 1); 696 + 697 + mem_cgroup_end_page_stat(memcg); 698 + 699 + if (newly_dirty) 700 + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); 701 + 697 702 return newly_dirty; 698 703 } 699 704 EXPORT_SYMBOL(__set_page_dirty_buffers); ··· 1175 1158 1176 1159 if (!test_set_buffer_dirty(bh)) { 1177 1160 struct page *page = bh->b_page; 1161 + struct address_space *mapping = NULL; 1162 + struct mem_cgroup *memcg; 1163 + 1164 + memcg = mem_cgroup_begin_page_stat(page); 1178 1165 if (!TestSetPageDirty(page)) { 1179 - struct address_space *mapping = page_mapping(page); 1166 + mapping = page_mapping(page); 1180 1167 if (mapping) 1181 - __set_page_dirty(page, mapping, 0); 1168 + __set_page_dirty(page, mapping, memcg, 0); 1182 1169 } 1170 + mem_cgroup_end_page_stat(memcg); 1171 + if (mapping) 1172 + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); 1183 1173 } 1184 1174 } 1185 1175 EXPORT_SYMBOL(mark_buffer_dirty); ··· 1708 1684 struct buffer_head *bh, *head; 1709 1685 unsigned int blocksize, bbits; 1710 1686 int nr_underway = 0; 1711 - int write_op = (wbc->sync_mode == WB_SYNC_ALL ? 1712 - WRITE_SYNC : WRITE); 1687 + int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); 1713 1688 1714 1689 head = create_page_buffers(page, inode, 1715 1690 (1 << BH_Dirty)|(1 << BH_Uptodate)); ··· 1797 1774 do { 1798 1775 struct buffer_head *next = bh->b_this_page; 1799 1776 if (buffer_async_write(bh)) { 1800 - submit_bh(write_op, bh); 1777 + submit_bh_wbc(write_op, bh, 0, wbc); 1801 1778 nr_underway++; 1802 1779 } 1803 1780 bh = next; ··· 1851 1828 struct buffer_head *next = bh->b_this_page; 1852 1829 if (buffer_async_write(bh)) { 1853 1830 clear_buffer_dirty(bh); 1854 - submit_bh(write_op, bh); 1831 + submit_bh_wbc(write_op, bh, 0, wbc); 1855 1832 nr_underway++; 1856 1833 } 1857 1834 bh = next; ··· 3016 2993 } 3017 2994 } 3018 2995 3019 - int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags) 2996 + static int submit_bh_wbc(int rw, struct buffer_head *bh, 2997 + unsigned long bio_flags, struct writeback_control *wbc) 3020 2998 { 3021 2999 struct bio *bio; 3022 3000 ··· 3038 3014 * submit_bio -> generic_make_request may further map this bio around 3039 3015 */ 3040 3016 bio = bio_alloc(GFP_NOIO, 1); 3017 + 3018 + if (wbc) { 3019 + wbc_init_bio(wbc, bio); 3020 + wbc_account_io(wbc, bh->b_page, bh->b_size); 3021 + } 3041 3022 3042 3023 bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9); 3043 3024 bio->bi_bdev = bh->b_bdev; ··· 3068 3039 submit_bio(rw, bio); 3069 3040 return 0; 3070 3041 } 3042 + 3043 + int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags) 3044 + { 3045 + return submit_bh_wbc(rw, bh, bio_flags, NULL); 3046 + } 3071 3047 EXPORT_SYMBOL_GPL(_submit_bh); 3072 3048 3073 3049 int submit_bh(int rw, struct buffer_head *bh) 3074 3050 { 3075 - return _submit_bh(rw, bh, 0); 3051 + return submit_bh_wbc(rw, bh, 0, NULL); 3076 3052 } 3077 3053 EXPORT_SYMBOL(submit_bh); 3078 3054 ··· 3266 3232 * to synchronise against __set_page_dirty_buffers and prevent the 3267 3233 * dirty bit from being lost. 3268 3234 */ 3269 - if (ret && TestClearPageDirty(page)) 3270 - account_page_cleaned(page, mapping); 3235 + if (ret) 3236 + cancel_dirty_page(page); 3271 3237 spin_unlock(&mapping->private_lock); 3272 3238 out: 3273 3239 if (buffers_to_free) {

+1

fs/ext2/super.c

··· 882 882 sb->s_flags = (sb->s_flags & ~MS_POSIXACL) | 883 883 ((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? 884 884 MS_POSIXACL : 0); 885 + sb->s_iflags |= SB_I_CGROUPWB; 885 886 886 887 if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV && 887 888 (EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||

+1

fs/ext4/extents.c

··· 39 39 #include <linux/slab.h> 40 40 #include <asm/uaccess.h> 41 41 #include <linux/fiemap.h> 42 + #include <linux/backing-dev.h> 42 43 #include "ext4_jbd2.h" 43 44 #include "ext4_extents.h" 44 45 #include "xattr.h"

+1

fs/ext4/mballoc.c

··· 26 26 #include <linux/log2.h> 27 27 #include <linux/module.h> 28 28 #include <linux/slab.h> 29 + #include <linux/backing-dev.h> 29 30 #include <trace/events/ext4.h> 30 31 31 32 #ifdef CONFIG_EXT4_DEBUG

+1

fs/ext4/super.c

··· 24 24 #include <linux/slab.h> 25 25 #include <linux/init.h> 26 26 #include <linux/blkdev.h> 27 + #include <linux/backing-dev.h> 27 28 #include <linux/parser.h> 28 29 #include <linux/buffer_head.h> 29 30 #include <linux/exportfs.h>

+2 -2

fs/f2fs/node.c

··· 53 53 PAGE_CACHE_SHIFT; 54 54 res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 2); 55 55 } else if (type == DIRTY_DENTS) { 56 - if (sbi->sb->s_bdi->dirty_exceeded) 56 + if (sbi->sb->s_bdi->wb.dirty_exceeded) 57 57 return false; 58 58 mem_size = get_pages(sbi, F2FS_DIRTY_DENTS); 59 59 res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1); ··· 70 70 sizeof(struct extent_node)) >> PAGE_CACHE_SHIFT; 71 71 res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1); 72 72 } else { 73 - if (sbi->sb->s_bdi->dirty_exceeded) 73 + if (sbi->sb->s_bdi->wb.dirty_exceeded) 74 74 return false; 75 75 } 76 76 return res;

+2 -1

fs/f2fs/segment.h

··· 9 9 * published by the Free Software Foundation. 10 10 */ 11 11 #include <linux/blkdev.h> 12 + #include <linux/backing-dev.h> 12 13 13 14 /* constant macro */ 14 15 #define NULL_SEGNO ((unsigned int)(~0)) ··· 715 714 */ 716 715 static inline int nr_pages_to_skip(struct f2fs_sb_info *sbi, int type) 717 716 { 718 - if (sbi->sb->s_bdi->dirty_exceeded) 717 + if (sbi->sb->s_bdi->wb.dirty_exceeded) 719 718 return 0; 720 719 721 720 if (type == DATA)

+1

fs/fat/file.c

··· 11 11 #include <linux/compat.h> 12 12 #include <linux/mount.h> 13 13 #include <linux/blkdev.h> 14 + #include <linux/backing-dev.h> 14 15 #include <linux/fsnotify.h> 15 16 #include <linux/security.h> 16 17 #include "fat.h"

+1

fs/fat/inode.c

··· 18 18 #include <linux/parser.h> 19 19 #include <linux/uio.h> 20 20 #include <linux/blkdev.h> 21 + #include <linux/backing-dev.h> 21 22 #include <asm/unaligned.h> 22 23 #include "fat.h" 23 24

+962 -205

fs/fs-writeback.c

··· 27 27 #include <linux/backing-dev.h> 28 28 #include <linux/tracepoint.h> 29 29 #include <linux/device.h> 30 + #include <linux/memcontrol.h> 30 31 #include "internal.h" 31 32 32 33 /* 33 34 * 4MB minimal write chunk size 34 35 */ 35 36 #define MIN_WRITEBACK_PAGES (4096UL >> (PAGE_CACHE_SHIFT - 10)) 37 + 38 + struct wb_completion { 39 + atomic_t cnt; 40 + }; 36 41 37 42 /* 38 43 * Passed into wb_writeback(), essentially a subset of writeback_control ··· 52 47 unsigned int range_cyclic:1; 53 48 unsigned int for_background:1; 54 49 unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */ 50 + unsigned int auto_free:1; /* free on completion */ 51 + unsigned int single_wait:1; 52 + unsigned int single_done:1; 55 53 enum wb_reason reason; /* why was writeback initiated? */ 56 54 57 55 struct list_head list; /* pending work list */ 58 - struct completion *done; /* set if the caller waits */ 56 + struct wb_completion *done; /* set if the caller waits */ 59 57 }; 58 + 59 + /* 60 + * If one wants to wait for one or more wb_writeback_works, each work's 61 + * ->done should be set to a wb_completion defined using the following 62 + * macro. Once all work items are issued with wb_queue_work(), the caller 63 + * can wait for the completion of all using wb_wait_for_completion(). Work 64 + * items which are waited upon aren't freed automatically on completion. 65 + */ 66 + #define DEFINE_WB_COMPLETION_ONSTACK(cmpl) \ 67 + struct wb_completion cmpl = { \ 68 + .cnt = ATOMIC_INIT(1), \ 69 + } 70 + 60 71 61 72 /* 62 73 * If an inode is constantly having its pages dirtied, but then the ··· 85 64 * few inodes might not their timestamps updated for 24 hours. 86 65 */ 87 66 unsigned int dirtytime_expire_interval = 12 * 60 * 60; 88 - 89 - /** 90 - * writeback_in_progress - determine whether there is writeback in progress 91 - * @bdi: the device's backing_dev_info structure. 92 - * 93 - * Determine whether there is writeback waiting to be handled against a 94 - * backing device. 95 - */ 96 - int writeback_in_progress(struct backing_dev_info *bdi) 97 - { 98 - return test_bit(BDI_writeback_running, &bdi->state); 99 - } 100 - EXPORT_SYMBOL(writeback_in_progress); 101 - 102 - struct backing_dev_info *inode_to_bdi(struct inode *inode) 103 - { 104 - struct super_block *sb; 105 - 106 - if (!inode) 107 - return &noop_backing_dev_info; 108 - 109 - sb = inode->i_sb; 110 - #ifdef CONFIG_BLOCK 111 - if (sb_is_blkdev_sb(sb)) 112 - return blk_get_backing_dev_info(I_BDEV(inode)); 113 - #endif 114 - return sb->s_bdi; 115 - } 116 - EXPORT_SYMBOL_GPL(inode_to_bdi); 117 67 118 68 static inline struct inode *wb_inode(struct list_head *head) 119 69 { ··· 101 109 102 110 EXPORT_TRACEPOINT_SYMBOL_GPL(wbc_writepage); 103 111 104 - static void bdi_wakeup_thread(struct backing_dev_info *bdi) 112 + static bool wb_io_lists_populated(struct bdi_writeback *wb) 105 113 { 106 - spin_lock_bh(&bdi->wb_lock); 107 - if (test_bit(BDI_registered, &bdi->state)) 108 - mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0); 109 - spin_unlock_bh(&bdi->wb_lock); 114 + if (wb_has_dirty_io(wb)) { 115 + return false; 116 + } else { 117 + set_bit(WB_has_dirty_io, &wb->state); 118 + WARN_ON_ONCE(!wb->avg_write_bandwidth); 119 + atomic_long_add(wb->avg_write_bandwidth, 120 + &wb->bdi->tot_write_bandwidth); 121 + return true; 122 + } 110 123 } 111 124 112 - static void bdi_queue_work(struct backing_dev_info *bdi, 113 - struct wb_writeback_work *work) 125 + static void wb_io_lists_depopulated(struct bdi_writeback *wb) 114 126 { 115 - trace_writeback_queue(bdi, work); 127 + if (wb_has_dirty_io(wb) && list_empty(&wb->b_dirty) && 128 + list_empty(&wb->b_io) && list_empty(&wb->b_more_io)) { 129 + clear_bit(WB_has_dirty_io, &wb->state); 130 + WARN_ON_ONCE(atomic_long_sub_return(wb->avg_write_bandwidth, 131 + &wb->bdi->tot_write_bandwidth) < 0); 132 + } 133 + } 116 134 117 - spin_lock_bh(&bdi->wb_lock); 118 - if (!test_bit(BDI_registered, &bdi->state)) { 119 - if (work->done) 120 - complete(work->done); 135 + /** 136 + * inode_wb_list_move_locked - move an inode onto a bdi_writeback IO list 137 + * @inode: inode to be moved 138 + * @wb: target bdi_writeback 139 + * @head: one of @wb->b_{dirty|io|more_io} 140 + * 141 + * Move @inode->i_wb_list to @list of @wb and set %WB_has_dirty_io. 142 + * Returns %true if @inode is the first occupant of the !dirty_time IO 143 + * lists; otherwise, %false. 144 + */ 145 + static bool inode_wb_list_move_locked(struct inode *inode, 146 + struct bdi_writeback *wb, 147 + struct list_head *head) 148 + { 149 + assert_spin_locked(&wb->list_lock); 150 + 151 + list_move(&inode->i_wb_list, head); 152 + 153 + /* dirty_time doesn't count as dirty_io until expiration */ 154 + if (head != &wb->b_dirty_time) 155 + return wb_io_lists_populated(wb); 156 + 157 + wb_io_lists_depopulated(wb); 158 + return false; 159 + } 160 + 161 + /** 162 + * inode_wb_list_del_locked - remove an inode from its bdi_writeback IO list 163 + * @inode: inode to be removed 164 + * @wb: bdi_writeback @inode is being removed from 165 + * 166 + * Remove @inode which may be on one of @wb->b_{dirty|io|more_io} lists and 167 + * clear %WB_has_dirty_io if all are empty afterwards. 168 + */ 169 + static void inode_wb_list_del_locked(struct inode *inode, 170 + struct bdi_writeback *wb) 171 + { 172 + assert_spin_locked(&wb->list_lock); 173 + 174 + list_del_init(&inode->i_wb_list); 175 + wb_io_lists_depopulated(wb); 176 + } 177 + 178 + static void wb_wakeup(struct bdi_writeback *wb) 179 + { 180 + spin_lock_bh(&wb->work_lock); 181 + if (test_bit(WB_registered, &wb->state)) 182 + mod_delayed_work(bdi_wq, &wb->dwork, 0); 183 + spin_unlock_bh(&wb->work_lock); 184 + } 185 + 186 + static void wb_queue_work(struct bdi_writeback *wb, 187 + struct wb_writeback_work *work) 188 + { 189 + trace_writeback_queue(wb->bdi, work); 190 + 191 + spin_lock_bh(&wb->work_lock); 192 + if (!test_bit(WB_registered, &wb->state)) { 193 + if (work->single_wait) 194 + work->single_done = 1; 121 195 goto out_unlock; 122 196 } 123 - list_add_tail(&work->list, &bdi->work_list); 124 - mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0); 197 + if (work->done) 198 + atomic_inc(&work->done->cnt); 199 + list_add_tail(&work->list, &wb->work_list); 200 + mod_delayed_work(bdi_wq, &wb->dwork, 0); 125 201 out_unlock: 126 - spin_unlock_bh(&bdi->wb_lock); 202 + spin_unlock_bh(&wb->work_lock); 127 203 } 128 204 129 - static void 130 - __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, 131 - bool range_cyclic, enum wb_reason reason) 205 + /** 206 + * wb_wait_for_completion - wait for completion of bdi_writeback_works 207 + * @bdi: bdi work items were issued to 208 + * @done: target wb_completion 209 + * 210 + * Wait for one or more work items issued to @bdi with their ->done field 211 + * set to @done, which should have been defined with 212 + * DEFINE_WB_COMPLETION_ONSTACK(). This function returns after all such 213 + * work items are completed. Work items which are waited upon aren't freed 214 + * automatically on completion. 215 + */ 216 + static void wb_wait_for_completion(struct backing_dev_info *bdi, 217 + struct wb_completion *done) 218 + { 219 + atomic_dec(&done->cnt); /* put down the initial count */ 220 + wait_event(bdi->wb_waitq, !atomic_read(&done->cnt)); 221 + } 222 + 223 + #ifdef CONFIG_CGROUP_WRITEBACK 224 + 225 + /* parameters for foreign inode detection, see wb_detach_inode() */ 226 + #define WB_FRN_TIME_SHIFT 13 /* 1s = 2^13, upto 8 secs w/ 16bit */ 227 + #define WB_FRN_TIME_AVG_SHIFT 3 /* avg = avg * 7/8 + new * 1/8 */ 228 + #define WB_FRN_TIME_CUT_DIV 2 /* ignore rounds < avg / 2 */ 229 + #define WB_FRN_TIME_PERIOD (2 * (1 << WB_FRN_TIME_SHIFT)) /* 2s */ 230 + 231 + #define WB_FRN_HIST_SLOTS 16 /* inode->i_wb_frn_history is 16bit */ 232 + #define WB_FRN_HIST_UNIT (WB_FRN_TIME_PERIOD / WB_FRN_HIST_SLOTS) 233 + /* each slot's duration is 2s / 16 */ 234 + #define WB_FRN_HIST_THR_SLOTS (WB_FRN_HIST_SLOTS / 2) 235 + /* if foreign slots >= 8, switch */ 236 + #define WB_FRN_HIST_MAX_SLOTS (WB_FRN_HIST_THR_SLOTS / 2 + 1) 237 + /* one round can affect upto 5 slots */ 238 + 239 + void __inode_attach_wb(struct inode *inode, struct page *page) 240 + { 241 + struct backing_dev_info *bdi = inode_to_bdi(inode); 242 + struct bdi_writeback *wb = NULL; 243 + 244 + if (inode_cgwb_enabled(inode)) { 245 + struct cgroup_subsys_state *memcg_css; 246 + 247 + if (page) { 248 + memcg_css = mem_cgroup_css_from_page(page); 249 + wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); 250 + } else { 251 + /* must pin memcg_css, see wb_get_create() */ 252 + memcg_css = task_get_css(current, memory_cgrp_id); 253 + wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); 254 + css_put(memcg_css); 255 + } 256 + } 257 + 258 + if (!wb) 259 + wb = &bdi->wb; 260 + 261 + /* 262 + * There may be multiple instances of this function racing to 263 + * update the same inode. Use cmpxchg() to tell the winner. 264 + */ 265 + if (unlikely(cmpxchg(&inode->i_wb, NULL, wb))) 266 + wb_put(wb); 267 + } 268 + 269 + /** 270 + * locked_inode_to_wb_and_lock_list - determine a locked inode's wb and lock it 271 + * @inode: inode of interest with i_lock held 272 + * 273 + * Returns @inode's wb with its list_lock held. @inode->i_lock must be 274 + * held on entry and is released on return. The returned wb is guaranteed 275 + * to stay @inode's associated wb until its list_lock is released. 276 + */ 277 + static struct bdi_writeback * 278 + locked_inode_to_wb_and_lock_list(struct inode *inode) 279 + __releases(&inode->i_lock) 280 + __acquires(&wb->list_lock) 281 + { 282 + while (true) { 283 + struct bdi_writeback *wb = inode_to_wb(inode); 284 + 285 + /* 286 + * inode_to_wb() association is protected by both 287 + * @inode->i_lock and @wb->list_lock but list_lock nests 288 + * outside i_lock. Drop i_lock and verify that the 289 + * association hasn't changed after acquiring list_lock. 290 + */ 291 + wb_get(wb); 292 + spin_unlock(&inode->i_lock); 293 + spin_lock(&wb->list_lock); 294 + wb_put(wb); /* not gonna deref it anymore */ 295 + 296 + /* i_wb may have changed inbetween, can't use inode_to_wb() */ 297 + if (likely(wb == inode->i_wb)) 298 + return wb; /* @inode already has ref */ 299 + 300 + spin_unlock(&wb->list_lock); 301 + cpu_relax(); 302 + spin_lock(&inode->i_lock); 303 + } 304 + } 305 + 306 + /** 307 + * inode_to_wb_and_lock_list - determine an inode's wb and lock it 308 + * @inode: inode of interest 309 + * 310 + * Same as locked_inode_to_wb_and_lock_list() but @inode->i_lock isn't held 311 + * on entry. 312 + */ 313 + static struct bdi_writeback *inode_to_wb_and_lock_list(struct inode *inode) 314 + __acquires(&wb->list_lock) 315 + { 316 + spin_lock(&inode->i_lock); 317 + return locked_inode_to_wb_and_lock_list(inode); 318 + } 319 + 320 + struct inode_switch_wbs_context { 321 + struct inode *inode; 322 + struct bdi_writeback *new_wb; 323 + 324 + struct rcu_head rcu_head; 325 + struct work_struct work; 326 + }; 327 + 328 + static void inode_switch_wbs_work_fn(struct work_struct *work) 329 + { 330 + struct inode_switch_wbs_context *isw = 331 + container_of(work, struct inode_switch_wbs_context, work); 332 + struct inode *inode = isw->inode; 333 + struct address_space *mapping = inode->i_mapping; 334 + struct bdi_writeback *old_wb = inode->i_wb; 335 + struct bdi_writeback *new_wb = isw->new_wb; 336 + struct radix_tree_iter iter; 337 + bool switched = false; 338 + void **slot; 339 + 340 + /* 341 + * By the time control reaches here, RCU grace period has passed 342 + * since I_WB_SWITCH assertion and all wb stat update transactions 343 + * between unlocked_inode_to_wb_begin/end() are guaranteed to be 344 + * synchronizing against mapping->tree_lock. 345 + * 346 + * Grabbing old_wb->list_lock, inode->i_lock and mapping->tree_lock 347 + * gives us exclusion against all wb related operations on @inode 348 + * including IO list manipulations and stat updates. 349 + */ 350 + if (old_wb < new_wb) { 351 + spin_lock(&old_wb->list_lock); 352 + spin_lock_nested(&new_wb->list_lock, SINGLE_DEPTH_NESTING); 353 + } else { 354 + spin_lock(&new_wb->list_lock); 355 + spin_lock_nested(&old_wb->list_lock, SINGLE_DEPTH_NESTING); 356 + } 357 + spin_lock(&inode->i_lock); 358 + spin_lock_irq(&mapping->tree_lock); 359 + 360 + /* 361 + * Once I_FREEING is visible under i_lock, the eviction path owns 362 + * the inode and we shouldn't modify ->i_wb_list. 363 + */ 364 + if (unlikely(inode->i_state & I_FREEING)) 365 + goto skip_switch; 366 + 367 + /* 368 + * Count and transfer stats. Note that PAGECACHE_TAG_DIRTY points 369 + * to possibly dirty pages while PAGECACHE_TAG_WRITEBACK points to 370 + * pages actually under underwriteback. 371 + */ 372 + radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter, 0, 373 + PAGECACHE_TAG_DIRTY) { 374 + struct page *page = radix_tree_deref_slot_protected(slot, 375 + &mapping->tree_lock); 376 + if (likely(page) && PageDirty(page)) { 377 + __dec_wb_stat(old_wb, WB_RECLAIMABLE); 378 + __inc_wb_stat(new_wb, WB_RECLAIMABLE); 379 + } 380 + } 381 + 382 + radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter, 0, 383 + PAGECACHE_TAG_WRITEBACK) { 384 + struct page *page = radix_tree_deref_slot_protected(slot, 385 + &mapping->tree_lock); 386 + if (likely(page)) { 387 + WARN_ON_ONCE(!PageWriteback(page)); 388 + __dec_wb_stat(old_wb, WB_WRITEBACK); 389 + __inc_wb_stat(new_wb, WB_WRITEBACK); 390 + } 391 + } 392 + 393 + wb_get(new_wb); 394 + 395 + /* 396 + * Transfer to @new_wb's IO list if necessary. The specific list 397 + * @inode was on is ignored and the inode is put on ->b_dirty which 398 + * is always correct including from ->b_dirty_time. The transfer 399 + * preserves @inode->dirtied_when ordering. 400 + */ 401 + if (!list_empty(&inode->i_wb_list)) { 402 + struct inode *pos; 403 + 404 + inode_wb_list_del_locked(inode, old_wb); 405 + inode->i_wb = new_wb; 406 + list_for_each_entry(pos, &new_wb->b_dirty, i_wb_list) 407 + if (time_after_eq(inode->dirtied_when, 408 + pos->dirtied_when)) 409 + break; 410 + inode_wb_list_move_locked(inode, new_wb, pos->i_wb_list.prev); 411 + } else { 412 + inode->i_wb = new_wb; 413 + } 414 + 415 + /* ->i_wb_frn updates may race wbc_detach_inode() but doesn't matter */ 416 + inode->i_wb_frn_winner = 0; 417 + inode->i_wb_frn_avg_time = 0; 418 + inode->i_wb_frn_history = 0; 419 + switched = true; 420 + skip_switch: 421 + /* 422 + * Paired with load_acquire in unlocked_inode_to_wb_begin() and 423 + * ensures that the new wb is visible if they see !I_WB_SWITCH. 424 + */ 425 + smp_store_release(&inode->i_state, inode->i_state & ~I_WB_SWITCH); 426 + 427 + spin_unlock_irq(&mapping->tree_lock); 428 + spin_unlock(&inode->i_lock); 429 + spin_unlock(&new_wb->list_lock); 430 + spin_unlock(&old_wb->list_lock); 431 + 432 + if (switched) { 433 + wb_wakeup(new_wb); 434 + wb_put(old_wb); 435 + } 436 + wb_put(new_wb); 437 + 438 + iput(inode); 439 + kfree(isw); 440 + } 441 + 442 + static void inode_switch_wbs_rcu_fn(struct rcu_head *rcu_head) 443 + { 444 + struct inode_switch_wbs_context *isw = container_of(rcu_head, 445 + struct inode_switch_wbs_context, rcu_head); 446 + 447 + /* needs to grab bh-unsafe locks, bounce to work item */ 448 + INIT_WORK(&isw->work, inode_switch_wbs_work_fn); 449 + schedule_work(&isw->work); 450 + } 451 + 452 + /** 453 + * inode_switch_wbs - change the wb association of an inode 454 + * @inode: target inode 455 + * @new_wb_id: ID of the new wb 456 + * 457 + * Switch @inode's wb association to the wb identified by @new_wb_id. The 458 + * switching is performed asynchronously and may fail silently. 459 + */ 460 + static void inode_switch_wbs(struct inode *inode, int new_wb_id) 461 + { 462 + struct backing_dev_info *bdi = inode_to_bdi(inode); 463 + struct cgroup_subsys_state *memcg_css; 464 + struct inode_switch_wbs_context *isw; 465 + 466 + /* noop if seems to be already in progress */ 467 + if (inode->i_state & I_WB_SWITCH) 468 + return; 469 + 470 + isw = kzalloc(sizeof(*isw), GFP_ATOMIC); 471 + if (!isw) 472 + return; 473 + 474 + /* find and pin the new wb */ 475 + rcu_read_lock(); 476 + memcg_css = css_from_id(new_wb_id, &memory_cgrp_subsys); 477 + if (memcg_css) 478 + isw->new_wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); 479 + rcu_read_unlock(); 480 + if (!isw->new_wb) 481 + goto out_free; 482 + 483 + /* while holding I_WB_SWITCH, no one else can update the association */ 484 + spin_lock(&inode->i_lock); 485 + if (inode->i_state & (I_WB_SWITCH | I_FREEING) || 486 + inode_to_wb(inode) == isw->new_wb) { 487 + spin_unlock(&inode->i_lock); 488 + goto out_free; 489 + } 490 + inode->i_state |= I_WB_SWITCH; 491 + spin_unlock(&inode->i_lock); 492 + 493 + ihold(inode); 494 + isw->inode = inode; 495 + 496 + /* 497 + * In addition to synchronizing among switchers, I_WB_SWITCH tells 498 + * the RCU protected stat update paths to grab the mapping's 499 + * tree_lock so that stat transfer can synchronize against them. 500 + * Let's continue after I_WB_SWITCH is guaranteed to be visible. 501 + */ 502 + call_rcu(&isw->rcu_head, inode_switch_wbs_rcu_fn); 503 + return; 504 + 505 + out_free: 506 + if (isw->new_wb) 507 + wb_put(isw->new_wb); 508 + kfree(isw); 509 + } 510 + 511 + /** 512 + * wbc_attach_and_unlock_inode - associate wbc with target inode and unlock it 513 + * @wbc: writeback_control of interest 514 + * @inode: target inode 515 + * 516 + * @inode is locked and about to be written back under the control of @wbc. 517 + * Record @inode's writeback context into @wbc and unlock the i_lock. On 518 + * writeback completion, wbc_detach_inode() should be called. This is used 519 + * to track the cgroup writeback context. 520 + */ 521 + void wbc_attach_and_unlock_inode(struct writeback_control *wbc, 522 + struct inode *inode) 523 + { 524 + if (!inode_cgwb_enabled(inode)) { 525 + spin_unlock(&inode->i_lock); 526 + return; 527 + } 528 + 529 + wbc->wb = inode_to_wb(inode); 530 + wbc->inode = inode; 531 + 532 + wbc->wb_id = wbc->wb->memcg_css->id; 533 + wbc->wb_lcand_id = inode->i_wb_frn_winner; 534 + wbc->wb_tcand_id = 0; 535 + wbc->wb_bytes = 0; 536 + wbc->wb_lcand_bytes = 0; 537 + wbc->wb_tcand_bytes = 0; 538 + 539 + wb_get(wbc->wb); 540 + spin_unlock(&inode->i_lock); 541 + 542 + /* 543 + * A dying wb indicates that the memcg-blkcg mapping has changed 544 + * and a new wb is already serving the memcg. Switch immediately. 545 + */ 546 + if (unlikely(wb_dying(wbc->wb))) 547 + inode_switch_wbs(inode, wbc->wb_id); 548 + } 549 + 550 + /** 551 + * wbc_detach_inode - disassociate wbc from inode and perform foreign detection 552 + * @wbc: writeback_control of the just finished writeback 553 + * 554 + * To be called after a writeback attempt of an inode finishes and undoes 555 + * wbc_attach_and_unlock_inode(). Can be called under any context. 556 + * 557 + * As concurrent write sharing of an inode is expected to be very rare and 558 + * memcg only tracks page ownership on first-use basis severely confining 559 + * the usefulness of such sharing, cgroup writeback tracks ownership 560 + * per-inode. While the support for concurrent write sharing of an inode 561 + * is deemed unnecessary, an inode being written to by different cgroups at 562 + * different points in time is a lot more common, and, more importantly, 563 + * charging only by first-use can too readily lead to grossly incorrect 564 + * behaviors (single foreign page can lead to gigabytes of writeback to be 565 + * incorrectly attributed). 566 + * 567 + * To resolve this issue, cgroup writeback detects the majority dirtier of 568 + * an inode and transfers the ownership to it. To avoid unnnecessary 569 + * oscillation, the detection mechanism keeps track of history and gives 570 + * out the switch verdict only if the foreign usage pattern is stable over 571 + * a certain amount of time and/or writeback attempts. 572 + * 573 + * On each writeback attempt, @wbc tries to detect the majority writer 574 + * using Boyer-Moore majority vote algorithm. In addition to the byte 575 + * count from the majority voting, it also counts the bytes written for the 576 + * current wb and the last round's winner wb (max of last round's current 577 + * wb, the winner from two rounds ago, and the last round's majority 578 + * candidate). Keeping track of the historical winner helps the algorithm 579 + * to semi-reliably detect the most active writer even when it's not the 580 + * absolute majority. 581 + * 582 + * Once the winner of the round is determined, whether the winner is 583 + * foreign or not and how much IO time the round consumed is recorded in 584 + * inode->i_wb_frn_history. If the amount of recorded foreign IO time is 585 + * over a certain threshold, the switch verdict is given. 586 + */ 587 + void wbc_detach_inode(struct writeback_control *wbc) 588 + { 589 + struct bdi_writeback *wb = wbc->wb; 590 + struct inode *inode = wbc->inode; 591 + unsigned long avg_time, max_bytes, max_time; 592 + u16 history; 593 + int max_id; 594 + 595 + if (!wb) 596 + return; 597 + 598 + history = inode->i_wb_frn_history; 599 + avg_time = inode->i_wb_frn_avg_time; 600 + 601 + /* pick the winner of this round */ 602 + if (wbc->wb_bytes >= wbc->wb_lcand_bytes && 603 + wbc->wb_bytes >= wbc->wb_tcand_bytes) { 604 + max_id = wbc->wb_id; 605 + max_bytes = wbc->wb_bytes; 606 + } else if (wbc->wb_lcand_bytes >= wbc->wb_tcand_bytes) { 607 + max_id = wbc->wb_lcand_id; 608 + max_bytes = wbc->wb_lcand_bytes; 609 + } else { 610 + max_id = wbc->wb_tcand_id; 611 + max_bytes = wbc->wb_tcand_bytes; 612 + } 613 + 614 + /* 615 + * Calculate the amount of IO time the winner consumed and fold it 616 + * into the running average kept per inode. If the consumed IO 617 + * time is lower than avag / WB_FRN_TIME_CUT_DIV, ignore it for 618 + * deciding whether to switch or not. This is to prevent one-off 619 + * small dirtiers from skewing the verdict. 620 + */ 621 + max_time = DIV_ROUND_UP((max_bytes >> PAGE_SHIFT) << WB_FRN_TIME_SHIFT, 622 + wb->avg_write_bandwidth); 623 + if (avg_time) 624 + avg_time += (max_time >> WB_FRN_TIME_AVG_SHIFT) - 625 + (avg_time >> WB_FRN_TIME_AVG_SHIFT); 626 + else 627 + avg_time = max_time; /* immediate catch up on first run */ 628 + 629 + if (max_time >= avg_time / WB_FRN_TIME_CUT_DIV) { 630 + int slots; 631 + 632 + /* 633 + * The switch verdict is reached if foreign wb's consume 634 + * more than a certain proportion of IO time in a 635 + * WB_FRN_TIME_PERIOD. This is loosely tracked by 16 slot 636 + * history mask where each bit represents one sixteenth of 637 + * the period. Determine the number of slots to shift into 638 + * history from @max_time. 639 + */ 640 + slots = min(DIV_ROUND_UP(max_time, WB_FRN_HIST_UNIT), 641 + (unsigned long)WB_FRN_HIST_MAX_SLOTS); 642 + history <<= slots; 643 + if (wbc->wb_id != max_id) 644 + history |= (1U << slots) - 1; 645 + 646 + /* 647 + * Switch if the current wb isn't the consistent winner. 648 + * If there are multiple closely competing dirtiers, the 649 + * inode may switch across them repeatedly over time, which 650 + * is okay. The main goal is avoiding keeping an inode on 651 + * the wrong wb for an extended period of time. 652 + */ 653 + if (hweight32(history) > WB_FRN_HIST_THR_SLOTS) 654 + inode_switch_wbs(inode, max_id); 655 + } 656 + 657 + /* 658 + * Multiple instances of this function may race to update the 659 + * following fields but we don't mind occassional inaccuracies. 660 + */ 661 + inode->i_wb_frn_winner = max_id; 662 + inode->i_wb_frn_avg_time = min(avg_time, (unsigned long)U16_MAX); 663 + inode->i_wb_frn_history = history; 664 + 665 + wb_put(wbc->wb); 666 + wbc->wb = NULL; 667 + } 668 + 669 + /** 670 + * wbc_account_io - account IO issued during writeback 671 + * @wbc: writeback_control of the writeback in progress 672 + * @page: page being written out 673 + * @bytes: number of bytes being written out 674 + * 675 + * @bytes from @page are about to written out during the writeback 676 + * controlled by @wbc. Keep the book for foreign inode detection. See 677 + * wbc_detach_inode(). 678 + */ 679 + void wbc_account_io(struct writeback_control *wbc, struct page *page, 680 + size_t bytes) 681 + { 682 + int id; 683 + 684 + /* 685 + * pageout() path doesn't attach @wbc to the inode being written 686 + * out. This is intentional as we don't want the function to block 687 + * behind a slow cgroup. Ultimately, we want pageout() to kick off 688 + * regular writeback instead of writing things out itself. 689 + */ 690 + if (!wbc->wb) 691 + return; 692 + 693 + rcu_read_lock(); 694 + id = mem_cgroup_css_from_page(page)->id; 695 + rcu_read_unlock(); 696 + 697 + if (id == wbc->wb_id) { 698 + wbc->wb_bytes += bytes; 699 + return; 700 + } 701 + 702 + if (id == wbc->wb_lcand_id) 703 + wbc->wb_lcand_bytes += bytes; 704 + 705 + /* Boyer-Moore majority vote algorithm */ 706 + if (!wbc->wb_tcand_bytes) 707 + wbc->wb_tcand_id = id; 708 + if (id == wbc->wb_tcand_id) 709 + wbc->wb_tcand_bytes += bytes; 710 + else 711 + wbc->wb_tcand_bytes -= min(bytes, wbc->wb_tcand_bytes); 712 + } 713 + 714 + /** 715 + * inode_congested - test whether an inode is congested 716 + * @inode: inode to test for congestion 717 + * @cong_bits: mask of WB_[a]sync_congested bits to test 718 + * 719 + * Tests whether @inode is congested. @cong_bits is the mask of congestion 720 + * bits to test and the return value is the mask of set bits. 721 + * 722 + * If cgroup writeback is enabled for @inode, the congestion state is 723 + * determined by whether the cgwb (cgroup bdi_writeback) for the blkcg 724 + * associated with @inode is congested; otherwise, the root wb's congestion 725 + * state is used. 726 + */ 727 + int inode_congested(struct inode *inode, int cong_bits) 728 + { 729 + /* 730 + * Once set, ->i_wb never becomes NULL while the inode is alive. 731 + * Start transaction iff ->i_wb is visible. 732 + */ 733 + if (inode && inode_to_wb_is_valid(inode)) { 734 + struct bdi_writeback *wb; 735 + bool locked, congested; 736 + 737 + wb = unlocked_inode_to_wb_begin(inode, &locked); 738 + congested = wb_congested(wb, cong_bits); 739 + unlocked_inode_to_wb_end(inode, locked); 740 + return congested; 741 + } 742 + 743 + return wb_congested(&inode_to_bdi(inode)->wb, cong_bits); 744 + } 745 + EXPORT_SYMBOL_GPL(inode_congested); 746 + 747 + /** 748 + * wb_wait_for_single_work - wait for completion of a single bdi_writeback_work 749 + * @bdi: bdi the work item was issued to 750 + * @work: work item to wait for 751 + * 752 + * Wait for the completion of @work which was issued to one of @bdi's 753 + * bdi_writeback's. The caller must have set @work->single_wait before 754 + * issuing it. This wait operates independently fo 755 + * wb_wait_for_completion() and also disables automatic freeing of @work. 756 + */ 757 + static void wb_wait_for_single_work(struct backing_dev_info *bdi, 758 + struct wb_writeback_work *work) 759 + { 760 + if (WARN_ON_ONCE(!work->single_wait)) 761 + return; 762 + 763 + wait_event(bdi->wb_waitq, work->single_done); 764 + 765 + /* 766 + * Paired with smp_wmb() in wb_do_writeback() and ensures that all 767 + * modifications to @work prior to assertion of ->single_done is 768 + * visible to the caller once this function returns. 769 + */ 770 + smp_rmb(); 771 + } 772 + 773 + /** 774 + * wb_split_bdi_pages - split nr_pages to write according to bandwidth 775 + * @wb: target bdi_writeback to split @nr_pages to 776 + * @nr_pages: number of pages to write for the whole bdi 777 + * 778 + * Split @wb's portion of @nr_pages according to @wb's write bandwidth in 779 + * relation to the total write bandwidth of all wb's w/ dirty inodes on 780 + * @wb->bdi. 781 + */ 782 + static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages) 783 + { 784 + unsigned long this_bw = wb->avg_write_bandwidth; 785 + unsigned long tot_bw = atomic_long_read(&wb->bdi->tot_write_bandwidth); 786 + 787 + if (nr_pages == LONG_MAX) 788 + return LONG_MAX; 789 + 790 + /* 791 + * This may be called on clean wb's and proportional distribution 792 + * may not make sense, just use the original @nr_pages in those 793 + * cases. In general, we wanna err on the side of writing more. 794 + */ 795 + if (!tot_bw || this_bw >= tot_bw) 796 + return nr_pages; 797 + else 798 + return DIV_ROUND_UP_ULL((u64)nr_pages * this_bw, tot_bw); 799 + } 800 + 801 + /** 802 + * wb_clone_and_queue_work - clone a wb_writeback_work and issue it to a wb 803 + * @wb: target bdi_writeback 804 + * @base_work: source wb_writeback_work 805 + * 806 + * Try to make a clone of @base_work and issue it to @wb. If cloning 807 + * succeeds, %true is returned; otherwise, @base_work is issued directly 808 + * and %false is returned. In the latter case, the caller is required to 809 + * wait for @base_work's completion using wb_wait_for_single_work(). 810 + * 811 + * A clone is auto-freed on completion. @base_work never is. 812 + */ 813 + static bool wb_clone_and_queue_work(struct bdi_writeback *wb, 814 + struct wb_writeback_work *base_work) 132 815 { 133 816 struct wb_writeback_work *work; 817 + 818 + work = kmalloc(sizeof(*work), GFP_ATOMIC); 819 + if (work) { 820 + *work = *base_work; 821 + work->auto_free = 1; 822 + work->single_wait = 0; 823 + } else { 824 + work = base_work; 825 + work->auto_free = 0; 826 + work->single_wait = 1; 827 + } 828 + work->single_done = 0; 829 + wb_queue_work(wb, work); 830 + return work != base_work; 831 + } 832 + 833 + /** 834 + * bdi_split_work_to_wbs - split a wb_writeback_work to all wb's of a bdi 835 + * @bdi: target backing_dev_info 836 + * @base_work: wb_writeback_work to issue 837 + * @skip_if_busy: skip wb's which already have writeback in progress 838 + * 839 + * Split and issue @base_work to all wb's (bdi_writeback's) of @bdi which 840 + * have dirty inodes. If @base_work->nr_page isn't %LONG_MAX, it's 841 + * distributed to the busy wbs according to each wb's proportion in the 842 + * total active write bandwidth of @bdi. 843 + */ 844 + static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, 845 + struct wb_writeback_work *base_work, 846 + bool skip_if_busy) 847 + { 848 + long nr_pages = base_work->nr_pages; 849 + int next_blkcg_id = 0; 850 + struct bdi_writeback *wb; 851 + struct wb_iter iter; 852 + 853 + might_sleep(); 854 + 855 + if (!bdi_has_dirty_io(bdi)) 856 + return; 857 + restart: 858 + rcu_read_lock(); 859 + bdi_for_each_wb(wb, bdi, &iter, next_blkcg_id) { 860 + if (!wb_has_dirty_io(wb) || 861 + (skip_if_busy && writeback_in_progress(wb))) 862 + continue; 863 + 864 + base_work->nr_pages = wb_split_bdi_pages(wb, nr_pages); 865 + if (!wb_clone_and_queue_work(wb, base_work)) { 866 + next_blkcg_id = wb->blkcg_css->id + 1; 867 + rcu_read_unlock(); 868 + wb_wait_for_single_work(bdi, base_work); 869 + goto restart; 870 + } 871 + } 872 + rcu_read_unlock(); 873 + } 874 + 875 + #else /* CONFIG_CGROUP_WRITEBACK */ 876 + 877 + static struct bdi_writeback * 878 + locked_inode_to_wb_and_lock_list(struct inode *inode) 879 + __releases(&inode->i_lock) 880 + __acquires(&wb->list_lock) 881 + { 882 + struct bdi_writeback *wb = inode_to_wb(inode); 883 + 884 + spin_unlock(&inode->i_lock); 885 + spin_lock(&wb->list_lock); 886 + return wb; 887 + } 888 + 889 + static struct bdi_writeback *inode_to_wb_and_lock_list(struct inode *inode) 890 + __acquires(&wb->list_lock) 891 + { 892 + struct bdi_writeback *wb = inode_to_wb(inode); 893 + 894 + spin_lock(&wb->list_lock); 895 + return wb; 896 + } 897 + 898 + static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages) 899 + { 900 + return nr_pages; 901 + } 902 + 903 + static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, 904 + struct wb_writeback_work *base_work, 905 + bool skip_if_busy) 906 + { 907 + might_sleep(); 908 + 909 + if (bdi_has_dirty_io(bdi) && 910 + (!skip_if_busy || !writeback_in_progress(&bdi->wb))) { 911 + base_work->auto_free = 0; 912 + base_work->single_wait = 0; 913 + base_work->single_done = 0; 914 + wb_queue_work(&bdi->wb, base_work); 915 + } 916 + } 917 + 918 + #endif /* CONFIG_CGROUP_WRITEBACK */ 919 + 920 + void wb_start_writeback(struct bdi_writeback *wb, long nr_pages, 921 + bool range_cyclic, enum wb_reason reason) 922 + { 923 + struct wb_writeback_work *work; 924 + 925 + if (!wb_has_dirty_io(wb)) 926 + return; 134 927 135 928 /* 136 929 * This is WB_SYNC_NONE writeback, so if allocation fails just ··· 923 146 */ 924 147 work = kzalloc(sizeof(*work), GFP_ATOMIC); 925 148 if (!work) { 926 - trace_writeback_nowork(bdi); 927 - bdi_wakeup_thread(bdi); 149 + trace_writeback_nowork(wb->bdi); 150 + wb_wakeup(wb); 928 151 return; 929 152 } 930 153 ··· 932 155 work->nr_pages = nr_pages; 933 156 work->range_cyclic = range_cyclic; 934 157 work->reason = reason; 158 + work->auto_free = 1; 935 159 936 - bdi_queue_work(bdi, work); 160 + wb_queue_work(wb, work); 937 161 } 938 162 939 163 /** 940 - * bdi_start_writeback - start writeback 941 - * @bdi: the backing device to write from 942 - * @nr_pages: the number of pages to write 943 - * @reason: reason why some writeback work was initiated 944 - * 945 - * Description: 946 - * This does WB_SYNC_NONE opportunistic writeback. The IO is only 947 - * started when this function returns, we make no guarantees on 948 - * completion. Caller need not hold sb s_umount semaphore. 949 - * 950 - */ 951 - void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, 952 - enum wb_reason reason) 953 - { 954 - __bdi_start_writeback(bdi, nr_pages, true, reason); 955 - } 956 - 957 - /** 958 - * bdi_start_background_writeback - start background writeback 959 - * @bdi: the backing device to write from 164 + * wb_start_background_writeback - start background writeback 165 + * @wb: bdi_writback to write from 960 166 * 961 167 * Description: 962 168 * This makes sure WB_SYNC_NONE background writeback happens. When 963 - * this function returns, it is only guaranteed that for given BDI 169 + * this function returns, it is only guaranteed that for given wb 964 170 * some IO is happening if we are over background dirty threshold. 965 171 * Caller need not hold sb s_umount semaphore. 966 172 */ 967 - void bdi_start_background_writeback(struct backing_dev_info *bdi) 173 + void wb_start_background_writeback(struct bdi_writeback *wb) 968 174 { 969 175 /* 970 176 * We just wake up the flusher thread. It will perform background 971 177 * writeback as soon as there is no other work to do. 972 178 */ 973 - trace_writeback_wake_background(bdi); 974 - bdi_wakeup_thread(bdi); 179 + trace_writeback_wake_background(wb->bdi); 180 + wb_wakeup(wb); 975 181 } 976 182 977 183 /* ··· 962 202 */ 963 203 void inode_wb_list_del(struct inode *inode) 964 204 { 965 - struct backing_dev_info *bdi = inode_to_bdi(inode); 205 + struct bdi_writeback *wb; 966 206 967 - spin_lock(&bdi->wb.list_lock); 968 - list_del_init(&inode->i_wb_list); 969 - spin_unlock(&bdi->wb.list_lock); 207 + wb = inode_to_wb_and_lock_list(inode); 208 + inode_wb_list_del_locked(inode, wb); 209 + spin_unlock(&wb->list_lock); 970 210 } 971 211 972 212 /* ··· 980 220 */ 981 221 static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) 982 222 { 983 - assert_spin_locked(&wb->list_lock); 984 223 if (!list_empty(&wb->b_dirty)) { 985 224 struct inode *tail; 986 225 ··· 987 228 if (time_before(inode->dirtied_when, tail->dirtied_when)) 988 229 inode->dirtied_when = jiffies; 989 230 } 990 - list_move(&inode->i_wb_list, &wb->b_dirty); 231 + inode_wb_list_move_locked(inode, wb, &wb->b_dirty); 991 232 } 992 233 993 234 /* ··· 995 236 */ 996 237 static void requeue_io(struct inode *inode, struct bdi_writeback *wb) 997 238 { 998 - assert_spin_locked(&wb->list_lock); 999 - list_move(&inode->i_wb_list, &wb->b_more_io); 239 + inode_wb_list_move_locked(inode, wb, &wb->b_more_io); 1000 240 } 1001 241 1002 242 static void inode_sync_complete(struct inode *inode) ··· 1104 346 moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, 0, work); 1105 347 moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io, 1106 348 EXPIRE_DIRTY_ATIME, work); 349 + if (moved) 350 + wb_io_lists_populated(wb); 1107 351 trace_writeback_queue_io(wb, work, moved); 1108 352 } 1109 353 ··· 1231 471 redirty_tail(inode, wb); 1232 472 } else if (inode->i_state & I_DIRTY_TIME) { 1233 473 inode->dirtied_when = jiffies; 1234 - list_move(&inode->i_wb_list, &wb->b_dirty_time); 474 + inode_wb_list_move_locked(inode, wb, &wb->b_dirty_time); 1235 475 } else { 1236 476 /* The inode is clean. Remove from writeback lists. */ 1237 - list_del_init(&inode->i_wb_list); 477 + inode_wb_list_del_locked(inode, wb); 1238 478 } 1239 479 } 1240 480 ··· 1365 605 !mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK))) 1366 606 goto out; 1367 607 inode->i_state |= I_SYNC; 1368 - spin_unlock(&inode->i_lock); 608 + wbc_attach_and_unlock_inode(wbc, inode); 1369 609 1370 610 ret = __writeback_single_inode(inode, wbc); 1371 611 612 + wbc_detach_inode(wbc); 1372 613 spin_lock(&wb->list_lock); 1373 614 spin_lock(&inode->i_lock); 1374 615 /* ··· 1377 616 * touch it. See comment above for explanation. 1378 617 */ 1379 618 if (!(inode->i_state & I_DIRTY_ALL)) 1380 - list_del_init(&inode->i_wb_list); 619 + inode_wb_list_del_locked(inode, wb); 1381 620 spin_unlock(&wb->list_lock); 1382 621 inode_sync_complete(inode); 1383 622 out: ··· 1385 624 return ret; 1386 625 } 1387 626 1388 - static long writeback_chunk_size(struct backing_dev_info *bdi, 627 + static long writeback_chunk_size(struct bdi_writeback *wb, 1389 628 struct wb_writeback_work *work) 1390 629 { 1391 630 long pages; ··· 1406 645 if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages) 1407 646 pages = LONG_MAX; 1408 647 else { 1409 - pages = min(bdi->avg_write_bandwidth / 2, 1410 - global_dirty_limit / DIRTY_SCOPE); 648 + pages = min(wb->avg_write_bandwidth / 2, 649 + global_wb_domain.dirty_limit / DIRTY_SCOPE); 1411 650 pages = min(pages, work->nr_pages); 1412 651 pages = round_down(pages + MIN_WRITEBACK_PAGES, 1413 652 MIN_WRITEBACK_PAGES); ··· 1502 741 continue; 1503 742 } 1504 743 inode->i_state |= I_SYNC; 1505 - spin_unlock(&inode->i_lock); 744 + wbc_attach_and_unlock_inode(&wbc, inode); 1506 745 1507 - write_chunk = writeback_chunk_size(wb->bdi, work); 746 + write_chunk = writeback_chunk_size(wb, work); 1508 747 wbc.nr_to_write = write_chunk; 1509 748 wbc.pages_skipped = 0; 1510 749 ··· 1514 753 */ 1515 754 __writeback_single_inode(inode, &wbc); 1516 755 756 + wbc_detach_inode(&wbc); 1517 757 work->nr_pages -= write_chunk - wbc.nr_to_write; 1518 758 wrote += write_chunk - wbc.nr_to_write; 1519 759 spin_lock(&wb->list_lock); ··· 1592 830 return nr_pages - work.nr_pages; 1593 831 } 1594 832 1595 - static bool over_bground_thresh(struct backing_dev_info *bdi) 1596 - { 1597 - unsigned long background_thresh, dirty_thresh; 1598 - 1599 - global_dirty_limits(&background_thresh, &dirty_thresh); 1600 - 1601 - if (global_page_state(NR_FILE_DIRTY) + 1602 - global_page_state(NR_UNSTABLE_NFS) > background_thresh) 1603 - return true; 1604 - 1605 - if (bdi_stat(bdi, BDI_RECLAIMABLE) > 1606 - bdi_dirty_limit(bdi, background_thresh)) 1607 - return true; 1608 - 1609 - return false; 1610 - } 1611 - 1612 - /* 1613 - * Called under wb->list_lock. If there are multiple wb per bdi, 1614 - * only the flusher working on the first wb should do it. 1615 - */ 1616 - static void wb_update_bandwidth(struct bdi_writeback *wb, 1617 - unsigned long start_time) 1618 - { 1619 - __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time); 1620 - } 1621 - 1622 833 /* 1623 834 * Explicit flushing or periodic writeback of "old" data. 1624 835 * ··· 1634 899 * after the other works are all done. 1635 900 */ 1636 901 if ((work->for_background || work->for_kupdate) && 1637 - !list_empty(&wb->bdi->work_list)) 902 + !list_empty(&wb->work_list)) 1638 903 break; 1639 904 1640 905 /* 1641 906 * For background writeout, stop when we are below the 1642 907 * background dirty threshold 1643 908 */ 1644 - if (work->for_background && !over_bground_thresh(wb->bdi)) 909 + if (work->for_background && !wb_over_bg_thresh(wb)) 1645 910 break; 1646 911 1647 912 /* ··· 1705 970 /* 1706 971 * Return the next wb_writeback_work struct that hasn't been processed yet. 1707 972 */ 1708 - static struct wb_writeback_work * 1709 - get_next_work_item(struct backing_dev_info *bdi) 973 + static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb) 1710 974 { 1711 975 struct wb_writeback_work *work = NULL; 1712 976 1713 - spin_lock_bh(&bdi->wb_lock); 1714 - if (!list_empty(&bdi->work_list)) { 1715 - work = list_entry(bdi->work_list.next, 977 + spin_lock_bh(&wb->work_lock); 978 + if (!list_empty(&wb->work_list)) { 979 + work = list_entry(wb->work_list.next, 1716 980 struct wb_writeback_work, list); 1717 981 list_del_init(&work->list); 1718 982 } 1719 - spin_unlock_bh(&bdi->wb_lock); 983 + spin_unlock_bh(&wb->work_lock); 1720 984 return work; 1721 985 } 1722 986 ··· 1732 998 1733 999 static long wb_check_background_flush(struct bdi_writeback *wb) 1734 1000 { 1735 - if (over_bground_thresh(wb->bdi)) { 1001 + if (wb_over_bg_thresh(wb)) { 1736 1002 1737 1003 struct wb_writeback_work work = { 1738 1004 .nr_pages = LONG_MAX, ··· 1787 1053 */ 1788 1054 static long wb_do_writeback(struct bdi_writeback *wb) 1789 1055 { 1790 - struct backing_dev_info *bdi = wb->bdi; 1791 1056 struct wb_writeback_work *work; 1792 1057 long wrote = 0; 1793 1058 1794 - set_bit(BDI_writeback_running, &wb->bdi->state); 1795 - while ((work = get_next_work_item(bdi)) != NULL) { 1059 + set_bit(WB_writeback_running, &wb->state); 1060 + while ((work = get_next_work_item(wb)) != NULL) { 1061 + struct wb_completion *done = work->done; 1062 + bool need_wake_up = false; 1796 1063 1797 - trace_writeback_exec(bdi, work); 1064 + trace_writeback_exec(wb->bdi, work); 1798 1065 1799 1066 wrote += wb_writeback(wb, work); 1800 1067 1801 - /* 1802 - * Notify the caller of completion if this is a synchronous 1803 - * work item, otherwise just free it. 1804 - */ 1805 - if (work->done) 1806 - complete(work->done); 1807 - else 1068 + if (work->single_wait) { 1069 + WARN_ON_ONCE(work->auto_free); 1070 + /* paired w/ rmb in wb_wait_for_single_work() */ 1071 + smp_wmb(); 1072 + work->single_done = 1; 1073 + need_wake_up = true; 1074 + } else if (work->auto_free) { 1808 1075 kfree(work); 1076 + } 1077 + 1078 + if (done && atomic_dec_and_test(&done->cnt)) 1079 + need_wake_up = true; 1080 + 1081 + if (need_wake_up) 1082 + wake_up_all(&wb->bdi->wb_waitq); 1809 1083 } 1810 1084 1811 1085 /* ··· 1821 1079 */ 1822 1080 wrote += wb_check_old_data_flush(wb); 1823 1081 wrote += wb_check_background_flush(wb); 1824 - clear_bit(BDI_writeback_running, &wb->bdi->state); 1082 + clear_bit(WB_writeback_running, &wb->state); 1825 1083 1826 1084 return wrote; 1827 1085 } ··· 1830 1088 * Handle writeback of dirty data for the device backed by this bdi. Also 1831 1089 * reschedules periodically and does kupdated style flushing. 1832 1090 */ 1833 - void bdi_writeback_workfn(struct work_struct *work) 1091 + void wb_workfn(struct work_struct *work) 1834 1092 { 1835 1093 struct bdi_writeback *wb = container_of(to_delayed_work(work), 1836 1094 struct bdi_writeback, dwork); 1837 - struct backing_dev_info *bdi = wb->bdi; 1838 1095 long pages_written; 1839 1096 1840 - set_worker_desc("flush-%s", dev_name(bdi->dev)); 1097 + set_worker_desc("flush-%s", dev_name(wb->bdi->dev)); 1841 1098 current->flags |= PF_SWAPWRITE; 1842 1099 1843 1100 if (likely(!current_is_workqueue_rescuer() || 1844 - !test_bit(BDI_registered, &bdi->state))) { 1101 + !test_bit(WB_registered, &wb->state))) { 1845 1102 /* 1846 - * The normal path. Keep writing back @bdi until its 1103 + * The normal path. Keep writing back @wb until its 1847 1104 * work_list is empty. Note that this path is also taken 1848 - * if @bdi is shutting down even when we're running off the 1105 + * if @wb is shutting down even when we're running off the 1849 1106 * rescuer as work_list needs to be drained. 1850 1107 */ 1851 1108 do { 1852 1109 pages_written = wb_do_writeback(wb); 1853 1110 trace_writeback_pages_written(pages_written); 1854 - } while (!list_empty(&bdi->work_list)); 1111 + } while (!list_empty(&wb->work_list)); 1855 1112 } else { 1856 1113 /* 1857 1114 * bdi_wq can't get enough workers and we're running off 1858 1115 * the emergency worker. Don't hog it. Hopefully, 1024 is 1859 1116 * enough for efficient IO. 1860 1117 */ 1861 - pages_written = writeback_inodes_wb(&bdi->wb, 1024, 1118 + pages_written = writeback_inodes_wb(wb, 1024, 1862 1119 WB_REASON_FORKER_THREAD); 1863 1120 trace_writeback_pages_written(pages_written); 1864 1121 } 1865 1122 1866 - if (!list_empty(&bdi->work_list)) 1123 + if (!list_empty(&wb->work_list)) 1867 1124 mod_delayed_work(bdi_wq, &wb->dwork, 0); 1868 1125 else if (wb_has_dirty_io(wb) && dirty_writeback_interval) 1869 - bdi_wakeup_thread_delayed(bdi); 1126 + wb_wakeup_delayed(wb); 1870 1127 1871 1128 current->flags &= ~PF_SWAPWRITE; 1872 1129 } ··· 1883 1142 1884 1143 rcu_read_lock(); 1885 1144 list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) { 1145 + struct bdi_writeback *wb; 1146 + struct wb_iter iter; 1147 + 1886 1148 if (!bdi_has_dirty_io(bdi)) 1887 1149 continue; 1888 - __bdi_start_writeback(bdi, nr_pages, false, reason); 1150 + 1151 + bdi_for_each_wb(wb, bdi, &iter, 0) 1152 + wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages), 1153 + false, reason); 1889 1154 } 1890 1155 rcu_read_unlock(); 1891 1156 } ··· 1920 1173 1921 1174 rcu_read_lock(); 1922 1175 list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) { 1923 - if (list_empty(&bdi->wb.b_dirty_time)) 1924 - continue; 1925 - bdi_wakeup_thread(bdi); 1176 + struct bdi_writeback *wb; 1177 + struct wb_iter iter; 1178 + 1179 + bdi_for_each_wb(wb, bdi, &iter, 0) 1180 + if (!list_empty(&bdi->wb.b_dirty_time)) 1181 + wb_wakeup(&bdi->wb); 1926 1182 } 1927 1183 rcu_read_unlock(); 1928 1184 schedule_delayed_work(&dirtytime_work, dirtytime_expire_interval * HZ); ··· 1999 1249 void __mark_inode_dirty(struct inode *inode, int flags) 2000 1250 { 2001 1251 struct super_block *sb = inode->i_sb; 2002 - struct backing_dev_info *bdi = NULL; 2003 1252 int dirtytime; 2004 1253 2005 1254 trace_writeback_mark_inode_dirty(inode, flags); ··· 2038 1289 if ((inode->i_state & flags) != flags) { 2039 1290 const int was_dirty = inode->i_state & I_DIRTY; 2040 1291 1292 + inode_attach_wb(inode, NULL); 1293 + 2041 1294 if (flags & I_DIRTY_INODE) 2042 1295 inode->i_state &= ~I_DIRTY_TIME; 2043 1296 inode->i_state |= flags; ··· 2068 1317 * reposition it (that would break b_dirty time-ordering). 2069 1318 */ 2070 1319 if (!was_dirty) { 1320 + struct bdi_writeback *wb; 1321 + struct list_head *dirty_list; 2071 1322 bool wakeup_bdi = false; 2072 - bdi = inode_to_bdi(inode); 2073 1323 2074 - spin_unlock(&inode->i_lock); 2075 - spin_lock(&bdi->wb.list_lock); 2076 - if (bdi_cap_writeback_dirty(bdi)) { 2077 - WARN(!test_bit(BDI_registered, &bdi->state), 2078 - "bdi-%s not registered\n", bdi->name); 1324 + wb = locked_inode_to_wb_and_lock_list(inode); 2079 1325 2080 - /* 2081 - * If this is the first dirty inode for this 2082 - * bdi, we have to wake-up the corresponding 2083 - * bdi thread to make sure background 2084 - * write-back happens later. 2085 - */ 2086 - if (!wb_has_dirty_io(&bdi->wb)) 2087 - wakeup_bdi = true; 2088 - } 1326 + WARN(bdi_cap_writeback_dirty(wb->bdi) && 1327 + !test_bit(WB_registered, &wb->state), 1328 + "bdi-%s not registered\n", wb->bdi->name); 2089 1329 2090 1330 inode->dirtied_when = jiffies; 2091 1331 if (dirtytime) 2092 1332 inode->dirtied_time_when = jiffies; 1333 + 2093 1334 if (inode->i_state & (I_DIRTY_INODE | I_DIRTY_PAGES)) 2094 - list_move(&inode->i_wb_list, &bdi->wb.b_dirty); 1335 + dirty_list = &wb->b_dirty; 2095 1336 else 2096 - list_move(&inode->i_wb_list, 2097 - &bdi->wb.b_dirty_time); 2098 - spin_unlock(&bdi->wb.list_lock); 1337 + dirty_list = &wb->b_dirty_time; 1338 + 1339 + wakeup_bdi = inode_wb_list_move_locked(inode, wb, 1340 + dirty_list); 1341 + 1342 + spin_unlock(&wb->list_lock); 2099 1343 trace_writeback_dirty_inode_enqueue(inode); 2100 1344 2101 - if (wakeup_bdi) 2102 - bdi_wakeup_thread_delayed(bdi); 1345 + /* 1346 + * If this is the first dirty inode for this bdi, 1347 + * we have to wake-up the corresponding bdi thread 1348 + * to make sure background write-back happens 1349 + * later. 1350 + */ 1351 + if (bdi_cap_writeback_dirty(wb->bdi) && wakeup_bdi) 1352 + wb_wakeup_delayed(wb); 2103 1353 return; 2104 1354 } 2105 1355 } ··· 2163 1411 iput(old_inode); 2164 1412 } 2165 1413 1414 + static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, 1415 + enum wb_reason reason, bool skip_if_busy) 1416 + { 1417 + DEFINE_WB_COMPLETION_ONSTACK(done); 1418 + struct wb_writeback_work work = { 1419 + .sb = sb, 1420 + .sync_mode = WB_SYNC_NONE, 1421 + .tagged_writepages = 1, 1422 + .done = &done, 1423 + .nr_pages = nr, 1424 + .reason = reason, 1425 + }; 1426 + struct backing_dev_info *bdi = sb->s_bdi; 1427 + 1428 + if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info) 1429 + return; 1430 + WARN_ON(!rwsem_is_locked(&sb->s_umount)); 1431 + 1432 + bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy); 1433 + wb_wait_for_completion(bdi, &done); 1434 + } 1435 + 2166 1436 /** 2167 1437 * writeback_inodes_sb_nr - writeback dirty inodes from given super_block 2168 1438 * @sb: the superblock ··· 2199 1425 unsigned long nr, 2200 1426 enum wb_reason reason) 2201 1427 { 2202 - DECLARE_COMPLETION_ONSTACK(done); 2203 - struct wb_writeback_work work = { 2204 - .sb = sb, 2205 - .sync_mode = WB_SYNC_NONE, 2206 - .tagged_writepages = 1, 2207 - .done = &done, 2208 - .nr_pages = nr, 2209 - .reason = reason, 2210 - }; 2211 - 2212 - if (sb->s_bdi == &noop_backing_dev_info) 2213 - return; 2214 - WARN_ON(!rwsem_is_locked(&sb->s_umount)); 2215 - bdi_queue_work(sb->s_bdi, &work); 2216 - wait_for_completion(&done); 1428 + __writeback_inodes_sb_nr(sb, nr, reason, false); 2217 1429 } 2218 1430 EXPORT_SYMBOL(writeback_inodes_sb_nr); 2219 1431 ··· 2227 1467 * Invoke writeback_inodes_sb_nr if no writeback is currently underway. 2228 1468 * Returns 1 if writeback was started, 0 if not. 2229 1469 */ 2230 - int try_to_writeback_inodes_sb_nr(struct super_block *sb, 2231 - unsigned long nr, 2232 - enum wb_reason reason) 1470 + bool try_to_writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, 1471 + enum wb_reason reason) 2233 1472 { 2234 - if (writeback_in_progress(sb->s_bdi)) 2235 - return 1; 2236 - 2237 1473 if (!down_read_trylock(&sb->s_umount)) 2238 - return 0; 1474 + return false; 2239 1475 2240 - writeback_inodes_sb_nr(sb, nr, reason); 1476 + __writeback_inodes_sb_nr(sb, nr, reason, true); 2241 1477 up_read(&sb->s_umount); 2242 - return 1; 1478 + return true; 2243 1479 } 2244 1480 EXPORT_SYMBOL(try_to_writeback_inodes_sb_nr); 2245 1481 ··· 2247 1491 * Implement by try_to_writeback_inodes_sb_nr() 2248 1492 * Returns 1 if writeback was started, 0 if not. 2249 1493 */ 2250 - int try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason) 1494 + bool try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason) 2251 1495 { 2252 1496 return try_to_writeback_inodes_sb_nr(sb, get_nr_dirty_pages(), reason); 2253 1497 } ··· 2262 1506 */ 2263 1507 void sync_inodes_sb(struct super_block *sb) 2264 1508 { 2265 - DECLARE_COMPLETION_ONSTACK(done); 1509 + DEFINE_WB_COMPLETION_ONSTACK(done); 2266 1510 struct wb_writeback_work work = { 2267 1511 .sb = sb, 2268 1512 .sync_mode = WB_SYNC_ALL, ··· 2272 1516 .reason = WB_REASON_SYNC, 2273 1517 .for_sync = 1, 2274 1518 }; 1519 + struct backing_dev_info *bdi = sb->s_bdi; 2275 1520 2276 1521 /* Nothing to do? */ 2277 - if (sb->s_bdi == &noop_backing_dev_info) 1522 + if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info) 2278 1523 return; 2279 1524 WARN_ON(!rwsem_is_locked(&sb->s_umount)); 2280 1525 2281 - bdi_queue_work(sb->s_bdi, &work); 2282 - wait_for_completion(&done); 1526 + bdi_split_work_to_wbs(bdi, &work, false); 1527 + wb_wait_for_completion(bdi, &done); 2283 1528 2284 1529 wait_sb_inodes(sb); 2285 1530 }

+6 -6

fs/fuse/file.c

··· 1445 1445 1446 1446 list_del(&req->writepages_entry); 1447 1447 for (i = 0; i < req->num_pages; i++) { 1448 - dec_bdi_stat(bdi, BDI_WRITEBACK); 1448 + dec_wb_stat(&bdi->wb, WB_WRITEBACK); 1449 1449 dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP); 1450 - bdi_writeout_inc(bdi); 1450 + wb_writeout_inc(&bdi->wb); 1451 1451 } 1452 1452 wake_up(&fi->page_waitq); 1453 1453 } ··· 1634 1634 req->end = fuse_writepage_end; 1635 1635 req->inode = inode; 1636 1636 1637 - inc_bdi_stat(inode_to_bdi(inode), BDI_WRITEBACK); 1637 + inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK); 1638 1638 inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP); 1639 1639 1640 1640 spin_lock(&fc->lock); ··· 1749 1749 copy_highpage(old_req->pages[0], page); 1750 1750 spin_unlock(&fc->lock); 1751 1751 1752 - dec_bdi_stat(bdi, BDI_WRITEBACK); 1752 + dec_wb_stat(&bdi->wb, WB_WRITEBACK); 1753 1753 dec_zone_page_state(page, NR_WRITEBACK_TEMP); 1754 - bdi_writeout_inc(bdi); 1754 + wb_writeout_inc(&bdi->wb); 1755 1755 fuse_writepage_free(fc, new_req); 1756 1756 fuse_request_free(new_req); 1757 1757 goto out; ··· 1848 1848 req->page_descs[req->num_pages].offset = 0; 1849 1849 req->page_descs[req->num_pages].length = PAGE_SIZE; 1850 1850 1851 - inc_bdi_stat(inode_to_bdi(inode), BDI_WRITEBACK); 1851 + inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK); 1852 1852 inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP); 1853 1853 1854 1854 err = 0;

+1 -1

fs/gfs2/super.c

··· 748 748 749 749 if (wbc->sync_mode == WB_SYNC_ALL) 750 750 gfs2_log_flush(GFS2_SB(inode), ip->i_gl, NORMAL_FLUSH); 751 - if (bdi->dirty_exceeded) 751 + if (bdi->wb.dirty_exceeded) 752 752 gfs2_ail1_flush(sdp, wbc); 753 753 else 754 754 filemap_fdatawrite(metamapping);

+1

fs/hfs/super.c

··· 14 14 15 15 #include <linux/module.h> 16 16 #include <linux/blkdev.h> 17 + #include <linux/backing-dev.h> 17 18 #include <linux/mount.h> 18 19 #include <linux/init.h> 19 20 #include <linux/nls.h>

+1

fs/hfsplus/super.c

··· 11 11 #include <linux/init.h> 12 12 #include <linux/pagemap.h> 13 13 #include <linux/blkdev.h> 14 + #include <linux/backing-dev.h> 14 15 #include <linux/fs.h> 15 16 #include <linux/slab.h> 16 17 #include <linux/vfs.h>

+1

fs/inode.c

··· 224 224 void __destroy_inode(struct inode *inode) 225 225 { 226 226 BUG_ON(inode_has_buffers(inode)); 227 + inode_detach_wb(inode); 227 228 security_inode_free(inode); 228 229 fsnotify_inode_delete(inode); 229 230 locks_free_lock_context(inode->i_flctx);

+3

fs/mpage.c

··· 605 605 bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH); 606 606 if (bio == NULL) 607 607 goto confused; 608 + 609 + wbc_init_bio(wbc, bio); 608 610 } 609 611 610 612 /* ··· 614 612 * the confused fail path above (OOM) will be very confused when 615 613 * it finds all bh marked clean (i.e. it will not write anything) 616 614 */ 615 + wbc_account_io(wbc, page, PAGE_SIZE); 617 616 length = first_unmapped << blkbits; 618 617 if (bio_add_page(bio, page, length, 0) < length) { 619 618 bio = mpage_bio_submit(WRITE, bio);

+1

fs/nfs/filelayout/filelayout.c

··· 32 32 #include <linux/nfs_fs.h> 33 33 #include <linux/nfs_page.h> 34 34 #include <linux/module.h> 35 + #include <linux/backing-dev.h> 35 36 36 37 #include <linux/sunrpc/metrics.h> 37 38

+1 -1

fs/nfs/internal.h

··· 607 607 struct inode *inode = page_file_mapping(page)->host; 608 608 609 609 inc_zone_page_state(page, NR_UNSTABLE_NFS); 610 - inc_bdi_stat(inode_to_bdi(inode), BDI_RECLAIMABLE); 610 + inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE); 611 611 __mark_inode_dirty(inode, I_DIRTY_DATASYNC); 612 612 } 613 613

+2 -1

fs/nfs/write.c

··· 853 853 nfs_clear_page_commit(struct page *page) 854 854 { 855 855 dec_zone_page_state(page, NR_UNSTABLE_NFS); 856 - dec_bdi_stat(inode_to_bdi(page_file_mapping(page)->host), BDI_RECLAIMABLE); 856 + dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb, 857 + WB_RECLAIMABLE); 857 858 } 858 859 859 860 /* Called holding inode (/cinfo) lock */

+1

fs/ocfs2/file.c

··· 37 37 #include <linux/falloc.h> 38 38 #include <linux/quotaops.h> 39 39 #include <linux/blkdev.h> 40 + #include <linux/backing-dev.h> 40 41 41 42 #include <cluster/masklog.h> 42 43

+1

fs/reiserfs/super.c

··· 21 21 #include "xattr.h" 22 22 #include <linux/init.h> 23 23 #include <linux/blkdev.h> 24 + #include <linux/backing-dev.h> 24 25 #include <linux/buffer_head.h> 25 26 #include <linux/exportfs.h> 26 27 #include <linux/quotaops.h>

+1

fs/ufs/super.c

··· 80 80 #include <linux/stat.h> 81 81 #include <linux/string.h> 82 82 #include <linux/blkdev.h> 83 + #include <linux/backing-dev.h> 83 84 #include <linux/init.h> 84 85 #include <linux/parser.h> 85 86 #include <linux/buffer_head.h>

+10 -2

fs/xfs/xfs_aops.c

··· 1873 1873 loff_t end_offset; 1874 1874 loff_t offset; 1875 1875 int newly_dirty; 1876 + struct mem_cgroup *memcg; 1876 1877 1877 1878 if (unlikely(!mapping)) 1878 1879 return !TestSetPageDirty(page); ··· 1893 1892 offset += 1 << inode->i_blkbits; 1894 1893 } while (bh != head); 1895 1894 } 1895 + /* 1896 + * Use mem_group_begin_page_stat() to keep PageDirty synchronized with 1897 + * per-memcg dirty page counters. 1898 + */ 1899 + memcg = mem_cgroup_begin_page_stat(page); 1896 1900 newly_dirty = !TestSetPageDirty(page); 1897 1901 spin_unlock(&mapping->private_lock); 1898 1902 ··· 1908 1902 spin_lock_irqsave(&mapping->tree_lock, flags); 1909 1903 if (page->mapping) { /* Race with truncate? */ 1910 1904 WARN_ON_ONCE(!PageUptodate(page)); 1911 - account_page_dirtied(page, mapping); 1905 + account_page_dirtied(page, mapping, memcg); 1912 1906 radix_tree_tag_set(&mapping->page_tree, 1913 1907 page_index(page), PAGECACHE_TAG_DIRTY); 1914 1908 } 1915 1909 spin_unlock_irqrestore(&mapping->tree_lock, flags); 1916 - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); 1917 1910 } 1911 + mem_cgroup_end_page_stat(memcg); 1912 + if (newly_dirty) 1913 + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); 1918 1914 return newly_dirty; 1919 1915 } 1920 1916

+1

fs/xfs/xfs_file.c

··· 41 41 #include <linux/dcache.h> 42 42 #include <linux/falloc.h> 43 43 #include <linux/pagevec.h> 44 + #include <linux/backing-dev.h> 44 45 45 46 static const struct vm_operations_struct xfs_file_vm_ops; 46 47

+255

include/linux/backing-dev-defs.h

··· 1 + #ifndef __LINUX_BACKING_DEV_DEFS_H 2 + #define __LINUX_BACKING_DEV_DEFS_H 3 + 4 + #include <linux/list.h> 5 + #include <linux/radix-tree.h> 6 + #include <linux/rbtree.h> 7 + #include <linux/spinlock.h> 8 + #include <linux/percpu_counter.h> 9 + #include <linux/percpu-refcount.h> 10 + #include <linux/flex_proportions.h> 11 + #include <linux/timer.h> 12 + #include <linux/workqueue.h> 13 + 14 + struct page; 15 + struct device; 16 + struct dentry; 17 + 18 + /* 19 + * Bits in bdi_writeback.state 20 + */ 21 + enum wb_state { 22 + WB_registered, /* bdi_register() was done */ 23 + WB_writeback_running, /* Writeback is in progress */ 24 + WB_has_dirty_io, /* Dirty inodes on ->b_{dirty|io|more_io} */ 25 + }; 26 + 27 + enum wb_congested_state { 28 + WB_async_congested, /* The async (write) queue is getting full */ 29 + WB_sync_congested, /* The sync queue is getting full */ 30 + }; 31 + 32 + typedef int (congested_fn)(void *, int); 33 + 34 + enum wb_stat_item { 35 + WB_RECLAIMABLE, 36 + WB_WRITEBACK, 37 + WB_DIRTIED, 38 + WB_WRITTEN, 39 + NR_WB_STAT_ITEMS 40 + }; 41 + 42 + #define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids))) 43 + 44 + /* 45 + * For cgroup writeback, multiple wb's may map to the same blkcg. Those 46 + * wb's can operate mostly independently but should share the congested 47 + * state. To facilitate such sharing, the congested state is tracked using 48 + * the following struct which is created on demand, indexed by blkcg ID on 49 + * its bdi, and refcounted. 50 + */ 51 + struct bdi_writeback_congested { 52 + unsigned long state; /* WB_[a]sync_congested flags */ 53 + 54 + #ifdef CONFIG_CGROUP_WRITEBACK 55 + struct backing_dev_info *bdi; /* the associated bdi */ 56 + atomic_t refcnt; /* nr of attached wb's and blkg */ 57 + int blkcg_id; /* ID of the associated blkcg */ 58 + struct rb_node rb_node; /* on bdi->cgwb_congestion_tree */ 59 + #endif 60 + }; 61 + 62 + /* 63 + * Each wb (bdi_writeback) can perform writeback operations, is measured 64 + * and throttled, independently. Without cgroup writeback, each bdi 65 + * (bdi_writeback) is served by its embedded bdi->wb. 66 + * 67 + * On the default hierarchy, blkcg implicitly enables memcg. This allows 68 + * using memcg's page ownership for attributing writeback IOs, and every 69 + * memcg - blkcg combination can be served by its own wb by assigning a 70 + * dedicated wb to each memcg, which enables isolation across different 71 + * cgroups and propagation of IO back pressure down from the IO layer upto 72 + * the tasks which are generating the dirty pages to be written back. 73 + * 74 + * A cgroup wb is indexed on its bdi by the ID of the associated memcg, 75 + * refcounted with the number of inodes attached to it, and pins the memcg 76 + * and the corresponding blkcg. As the corresponding blkcg for a memcg may 77 + * change as blkcg is disabled and enabled higher up in the hierarchy, a wb 78 + * is tested for blkcg after lookup and removed from index on mismatch so 79 + * that a new wb for the combination can be created. 80 + */ 81 + struct bdi_writeback { 82 + struct backing_dev_info *bdi; /* our parent bdi */ 83 + 84 + unsigned long state; /* Always use atomic bitops on this */ 85 + unsigned long last_old_flush; /* last old data flush */ 86 + 87 + struct list_head b_dirty; /* dirty inodes */ 88 + struct list_head b_io; /* parked for writeback */ 89 + struct list_head b_more_io; /* parked for more writeback */ 90 + struct list_head b_dirty_time; /* time stamps are dirty */ 91 + spinlock_t list_lock; /* protects the b_* lists */ 92 + 93 + struct percpu_counter stat[NR_WB_STAT_ITEMS]; 94 + 95 + struct bdi_writeback_congested *congested; 96 + 97 + unsigned long bw_time_stamp; /* last time write bw is updated */ 98 + unsigned long dirtied_stamp; 99 + unsigned long written_stamp; /* pages written at bw_time_stamp */ 100 + unsigned long write_bandwidth; /* the estimated write bandwidth */ 101 + unsigned long avg_write_bandwidth; /* further smoothed write bw, > 0 */ 102 + 103 + /* 104 + * The base dirty throttle rate, re-calculated on every 200ms. 105 + * All the bdi tasks' dirty rate will be curbed under it. 106 + * @dirty_ratelimit tracks the estimated @balanced_dirty_ratelimit 107 + * in small steps and is much more smooth/stable than the latter. 108 + */ 109 + unsigned long dirty_ratelimit; 110 + unsigned long balanced_dirty_ratelimit; 111 + 112 + struct fprop_local_percpu completions; 113 + int dirty_exceeded; 114 + 115 + spinlock_t work_lock; /* protects work_list & dwork scheduling */ 116 + struct list_head work_list; 117 + struct delayed_work dwork; /* work item used for writeback */ 118 + 119 + #ifdef CONFIG_CGROUP_WRITEBACK 120 + struct percpu_ref refcnt; /* used only for !root wb's */ 121 + struct fprop_local_percpu memcg_completions; 122 + struct cgroup_subsys_state *memcg_css; /* the associated memcg */ 123 + struct cgroup_subsys_state *blkcg_css; /* and blkcg */ 124 + struct list_head memcg_node; /* anchored at memcg->cgwb_list */ 125 + struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */ 126 + 127 + union { 128 + struct work_struct release_work; 129 + struct rcu_head rcu; 130 + }; 131 + #endif 132 + }; 133 + 134 + struct backing_dev_info { 135 + struct list_head bdi_list; 136 + unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */ 137 + unsigned int capabilities; /* Device capabilities */ 138 + congested_fn *congested_fn; /* Function pointer if device is md/dm */ 139 + void *congested_data; /* Pointer to aux data for congested func */ 140 + 141 + char *name; 142 + 143 + unsigned int min_ratio; 144 + unsigned int max_ratio, max_prop_frac; 145 + 146 + /* 147 + * Sum of avg_write_bw of wbs with dirty inodes. > 0 if there are 148 + * any dirty wbs, which is depended upon by bdi_has_dirty(). 149 + */ 150 + atomic_long_t tot_write_bandwidth; 151 + 152 + struct bdi_writeback wb; /* the root writeback info for this bdi */ 153 + struct bdi_writeback_congested wb_congested; /* its congested state */ 154 + #ifdef CONFIG_CGROUP_WRITEBACK 155 + struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ 156 + struct rb_root cgwb_congested_tree; /* their congested states */ 157 + atomic_t usage_cnt; /* counts both cgwbs and cgwb_contested's */ 158 + #endif 159 + wait_queue_head_t wb_waitq; 160 + 161 + struct device *dev; 162 + 163 + struct timer_list laptop_mode_wb_timer; 164 + 165 + #ifdef CONFIG_DEBUG_FS 166 + struct dentry *debug_dir; 167 + struct dentry *debug_stats; 168 + #endif 169 + }; 170 + 171 + enum { 172 + BLK_RW_ASYNC = 0, 173 + BLK_RW_SYNC = 1, 174 + }; 175 + 176 + void clear_wb_congested(struct bdi_writeback_congested *congested, int sync); 177 + void set_wb_congested(struct bdi_writeback_congested *congested, int sync); 178 + 179 + static inline void clear_bdi_congested(struct backing_dev_info *bdi, int sync) 180 + { 181 + clear_wb_congested(bdi->wb.congested, sync); 182 + } 183 + 184 + static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync) 185 + { 186 + set_wb_congested(bdi->wb.congested, sync); 187 + } 188 + 189 + #ifdef CONFIG_CGROUP_WRITEBACK 190 + 191 + /** 192 + * wb_tryget - try to increment a wb's refcount 193 + * @wb: bdi_writeback to get 194 + */ 195 + static inline bool wb_tryget(struct bdi_writeback *wb) 196 + { 197 + if (wb != &wb->bdi->wb) 198 + return percpu_ref_tryget(&wb->refcnt); 199 + return true; 200 + } 201 + 202 + /** 203 + * wb_get - increment a wb's refcount 204 + * @wb: bdi_writeback to get 205 + */ 206 + static inline void wb_get(struct bdi_writeback *wb) 207 + { 208 + if (wb != &wb->bdi->wb) 209 + percpu_ref_get(&wb->refcnt); 210 + } 211 + 212 + /** 213 + * wb_put - decrement a wb's refcount 214 + * @wb: bdi_writeback to put 215 + */ 216 + static inline void wb_put(struct bdi_writeback *wb) 217 + { 218 + if (wb != &wb->bdi->wb) 219 + percpu_ref_put(&wb->refcnt); 220 + } 221 + 222 + /** 223 + * wb_dying - is a wb dying? 224 + * @wb: bdi_writeback of interest 225 + * 226 + * Returns whether @wb is unlinked and being drained. 227 + */ 228 + static inline bool wb_dying(struct bdi_writeback *wb) 229 + { 230 + return percpu_ref_is_dying(&wb->refcnt); 231 + } 232 + 233 + #else /* CONFIG_CGROUP_WRITEBACK */ 234 + 235 + static inline bool wb_tryget(struct bdi_writeback *wb) 236 + { 237 + return true; 238 + } 239 + 240 + static inline void wb_get(struct bdi_writeback *wb) 241 + { 242 + } 243 + 244 + static inline void wb_put(struct bdi_writeback *wb) 245 + { 246 + } 247 + 248 + static inline bool wb_dying(struct bdi_writeback *wb) 249 + { 250 + return false; 251 + } 252 + 253 + #endif /* CONFIG_CGROUP_WRITEBACK */ 254 + 255 + #endif /* __LINUX_BACKING_DEV_DEFS_H */

+414 -169

include/linux/backing-dev.h

··· 8 8 #ifndef _LINUX_BACKING_DEV_H 9 9 #define _LINUX_BACKING_DEV_H 10 10 11 - #include <linux/percpu_counter.h> 12 - #include <linux/log2.h> 13 - #include <linux/flex_proportions.h> 14 11 #include <linux/kernel.h> 15 12 #include <linux/fs.h> 16 13 #include <linux/sched.h> 17 - #include <linux/timer.h> 14 + #include <linux/blkdev.h> 18 15 #include <linux/writeback.h> 19 - #include <linux/atomic.h> 20 - #include <linux/sysctl.h> 21 - #include <linux/workqueue.h> 22 - 23 - struct page; 24 - struct device; 25 - struct dentry; 26 - 27 - /* 28 - * Bits in backing_dev_info.state 29 - */ 30 - enum bdi_state { 31 - BDI_async_congested, /* The async (write) queue is getting full */ 32 - BDI_sync_congested, /* The sync queue is getting full */ 33 - BDI_registered, /* bdi_register() was done */ 34 - BDI_writeback_running, /* Writeback is in progress */ 35 - }; 36 - 37 - typedef int (congested_fn)(void *, int); 38 - 39 - enum bdi_stat_item { 40 - BDI_RECLAIMABLE, 41 - BDI_WRITEBACK, 42 - BDI_DIRTIED, 43 - BDI_WRITTEN, 44 - NR_BDI_STAT_ITEMS 45 - }; 46 - 47 - #define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids))) 48 - 49 - struct bdi_writeback { 50 - struct backing_dev_info *bdi; /* our parent bdi */ 51 - 52 - unsigned long last_old_flush; /* last old data flush */ 53 - 54 - struct delayed_work dwork; /* work item used for writeback */ 55 - struct list_head b_dirty; /* dirty inodes */ 56 - struct list_head b_io; /* parked for writeback */ 57 - struct list_head b_more_io; /* parked for more writeback */ 58 - struct list_head b_dirty_time; /* time stamps are dirty */ 59 - spinlock_t list_lock; /* protects the b_* lists */ 60 - }; 61 - 62 - struct backing_dev_info { 63 - struct list_head bdi_list; 64 - unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */ 65 - unsigned long state; /* Always use atomic bitops on this */ 66 - unsigned int capabilities; /* Device capabilities */ 67 - congested_fn *congested_fn; /* Function pointer if device is md/dm */ 68 - void *congested_data; /* Pointer to aux data for congested func */ 69 - 70 - char *name; 71 - 72 - struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS]; 73 - 74 - unsigned long bw_time_stamp; /* last time write bw is updated */ 75 - unsigned long dirtied_stamp; 76 - unsigned long written_stamp; /* pages written at bw_time_stamp */ 77 - unsigned long write_bandwidth; /* the estimated write bandwidth */ 78 - unsigned long avg_write_bandwidth; /* further smoothed write bw */ 79 - 80 - /* 81 - * The base dirty throttle rate, re-calculated on every 200ms. 82 - * All the bdi tasks' dirty rate will be curbed under it. 83 - * @dirty_ratelimit tracks the estimated @balanced_dirty_ratelimit 84 - * in small steps and is much more smooth/stable than the latter. 85 - */ 86 - unsigned long dirty_ratelimit; 87 - unsigned long balanced_dirty_ratelimit; 88 - 89 - struct fprop_local_percpu completions; 90 - int dirty_exceeded; 91 - 92 - unsigned int min_ratio; 93 - unsigned int max_ratio, max_prop_frac; 94 - 95 - struct bdi_writeback wb; /* default writeback info for this bdi */ 96 - spinlock_t wb_lock; /* protects work_list & wb.dwork scheduling */ 97 - 98 - struct list_head work_list; 99 - 100 - struct device *dev; 101 - 102 - struct timer_list laptop_mode_wb_timer; 103 - 104 - #ifdef CONFIG_DEBUG_FS 105 - struct dentry *debug_dir; 106 - struct dentry *debug_stats; 107 - #endif 108 - }; 109 - 110 - struct backing_dev_info *inode_to_bdi(struct inode *inode); 16 + #include <linux/blk-cgroup.h> 17 + #include <linux/backing-dev-defs.h> 111 18 112 19 int __must_check bdi_init(struct backing_dev_info *bdi); 113 20 void bdi_destroy(struct backing_dev_info *bdi); ··· 24 117 const char *fmt, ...); 25 118 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev); 26 119 int __must_check bdi_setup_and_register(struct backing_dev_info *, char *); 27 - void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, 28 - enum wb_reason reason); 29 - void bdi_start_background_writeback(struct backing_dev_info *bdi); 30 - void bdi_writeback_workfn(struct work_struct *work); 31 - int bdi_has_dirty_io(struct backing_dev_info *bdi); 32 - void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi); 120 + void wb_start_writeback(struct bdi_writeback *wb, long nr_pages, 121 + bool range_cyclic, enum wb_reason reason); 122 + void wb_start_background_writeback(struct bdi_writeback *wb); 123 + void wb_workfn(struct work_struct *work); 124 + void wb_wakeup_delayed(struct bdi_writeback *wb); 33 125 34 126 extern spinlock_t bdi_lock; 35 127 extern struct list_head bdi_list; 36 128 37 129 extern struct workqueue_struct *bdi_wq; 38 130 39 - static inline int wb_has_dirty_io(struct bdi_writeback *wb) 131 + static inline bool wb_has_dirty_io(struct bdi_writeback *wb) 40 132 { 41 - return !list_empty(&wb->b_dirty) || 42 - !list_empty(&wb->b_io) || 43 - !list_empty(&wb->b_more_io); 133 + return test_bit(WB_has_dirty_io, &wb->state); 44 134 } 45 135 46 - static inline void __add_bdi_stat(struct backing_dev_info *bdi, 47 - enum bdi_stat_item item, s64 amount) 136 + static inline bool bdi_has_dirty_io(struct backing_dev_info *bdi) 48 137 { 49 - __percpu_counter_add(&bdi->bdi_stat[item], amount, BDI_STAT_BATCH); 138 + /* 139 + * @bdi->tot_write_bandwidth is guaranteed to be > 0 if there are 140 + * any dirty wbs. See wb_update_write_bandwidth(). 141 + */ 142 + return atomic_long_read(&bdi->tot_write_bandwidth); 50 143 } 51 144 52 - static inline void __inc_bdi_stat(struct backing_dev_info *bdi, 53 - enum bdi_stat_item item) 145 + static inline void __add_wb_stat(struct bdi_writeback *wb, 146 + enum wb_stat_item item, s64 amount) 54 147 { 55 - __add_bdi_stat(bdi, item, 1); 148 + __percpu_counter_add(&wb->stat[item], amount, WB_STAT_BATCH); 56 149 } 57 150 58 - static inline void inc_bdi_stat(struct backing_dev_info *bdi, 59 - enum bdi_stat_item item) 151 + static inline void __inc_wb_stat(struct bdi_writeback *wb, 152 + enum wb_stat_item item) 60 153 { 61 - unsigned long flags; 62 - 63 - local_irq_save(flags); 64 - __inc_bdi_stat(bdi, item); 65 - local_irq_restore(flags); 154 + __add_wb_stat(wb, item, 1); 66 155 } 67 156 68 - static inline void __dec_bdi_stat(struct backing_dev_info *bdi, 69 - enum bdi_stat_item item) 70 - { 71 - __add_bdi_stat(bdi, item, -1); 72 - } 73 - 74 - static inline void dec_bdi_stat(struct backing_dev_info *bdi, 75 - enum bdi_stat_item item) 157 + static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item) 76 158 { 77 159 unsigned long flags; 78 160 79 161 local_irq_save(flags); 80 - __dec_bdi_stat(bdi, item); 162 + __inc_wb_stat(wb, item); 81 163 local_irq_restore(flags); 82 164 } 83 165 84 - static inline s64 bdi_stat(struct backing_dev_info *bdi, 85 - enum bdi_stat_item item) 166 + static inline void __dec_wb_stat(struct bdi_writeback *wb, 167 + enum wb_stat_item item) 86 168 { 87 - return percpu_counter_read_positive(&bdi->bdi_stat[item]); 169 + __add_wb_stat(wb, item, -1); 88 170 } 89 171 90 - static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi, 91 - enum bdi_stat_item item) 172 + static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item) 92 173 { 93 - return percpu_counter_sum_positive(&bdi->bdi_stat[item]); 174 + unsigned long flags; 175 + 176 + local_irq_save(flags); 177 + __dec_wb_stat(wb, item); 178 + local_irq_restore(flags); 94 179 } 95 180 96 - static inline s64 bdi_stat_sum(struct backing_dev_info *bdi, 97 - enum bdi_stat_item item) 181 + static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item) 182 + { 183 + return percpu_counter_read_positive(&wb->stat[item]); 184 + } 185 + 186 + static inline s64 __wb_stat_sum(struct bdi_writeback *wb, 187 + enum wb_stat_item item) 188 + { 189 + return percpu_counter_sum_positive(&wb->stat[item]); 190 + } 191 + 192 + static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum wb_stat_item item) 98 193 { 99 194 s64 sum; 100 195 unsigned long flags; 101 196 102 197 local_irq_save(flags); 103 - sum = __bdi_stat_sum(bdi, item); 198 + sum = __wb_stat_sum(wb, item); 104 199 local_irq_restore(flags); 105 200 106 201 return sum; 107 202 } 108 203 109 - extern void bdi_writeout_inc(struct backing_dev_info *bdi); 204 + extern void wb_writeout_inc(struct bdi_writeback *wb); 110 205 111 206 /* 112 207 * maximal error of a stat counter. 113 208 */ 114 - static inline unsigned long bdi_stat_error(struct backing_dev_info *bdi) 209 + static inline unsigned long wb_stat_error(struct bdi_writeback *wb) 115 210 { 116 211 #ifdef CONFIG_SMP 117 - return nr_cpu_ids * BDI_STAT_BATCH; 212 + return nr_cpu_ids * WB_STAT_BATCH; 118 213 #else 119 214 return 1; 120 215 #endif ··· 140 231 * BDI_CAP_NO_WRITEBACK: Don't write pages back 141 232 * BDI_CAP_NO_ACCT_WB: Don't automatically account writeback pages 142 233 * BDI_CAP_STRICTLIMIT: Keep number of dirty pages below bdi threshold. 234 + * 235 + * BDI_CAP_CGROUP_WRITEBACK: Supports cgroup-aware writeback. 143 236 */ 144 237 #define BDI_CAP_NO_ACCT_DIRTY 0x00000001 145 238 #define BDI_CAP_NO_WRITEBACK 0x00000002 146 239 #define BDI_CAP_NO_ACCT_WB 0x00000004 147 240 #define BDI_CAP_STABLE_WRITES 0x00000008 148 241 #define BDI_CAP_STRICTLIMIT 0x00000010 242 + #define BDI_CAP_CGROUP_WRITEBACK 0x00000020 149 243 150 244 #define BDI_CAP_NO_ACCT_AND_WRITEBACK \ 151 245 (BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB) 152 246 153 247 extern struct backing_dev_info noop_backing_dev_info; 154 248 155 - int writeback_in_progress(struct backing_dev_info *bdi); 156 - 157 - static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits) 249 + /** 250 + * writeback_in_progress - determine whether there is writeback in progress 251 + * @wb: bdi_writeback of interest 252 + * 253 + * Determine whether there is writeback waiting to be handled against a 254 + * bdi_writeback. 255 + */ 256 + static inline bool writeback_in_progress(struct bdi_writeback *wb) 158 257 { 258 + return test_bit(WB_writeback_running, &wb->state); 259 + } 260 + 261 + static inline struct backing_dev_info *inode_to_bdi(struct inode *inode) 262 + { 263 + struct super_block *sb; 264 + 265 + if (!inode) 266 + return &noop_backing_dev_info; 267 + 268 + sb = inode->i_sb; 269 + #ifdef CONFIG_BLOCK 270 + if (sb_is_blkdev_sb(sb)) 271 + return blk_get_backing_dev_info(I_BDEV(inode)); 272 + #endif 273 + return sb->s_bdi; 274 + } 275 + 276 + static inline int wb_congested(struct bdi_writeback *wb, int cong_bits) 277 + { 278 + struct backing_dev_info *bdi = wb->bdi; 279 + 159 280 if (bdi->congested_fn) 160 - return bdi->congested_fn(bdi->congested_data, bdi_bits); 161 - return (bdi->state & bdi_bits); 281 + return bdi->congested_fn(bdi->congested_data, cong_bits); 282 + return wb->congested->state & cong_bits; 162 283 } 163 284 164 - static inline int bdi_read_congested(struct backing_dev_info *bdi) 165 - { 166 - return bdi_congested(bdi, 1 << BDI_sync_congested); 167 - } 168 - 169 - static inline int bdi_write_congested(struct backing_dev_info *bdi) 170 - { 171 - return bdi_congested(bdi, 1 << BDI_async_congested); 172 - } 173 - 174 - static inline int bdi_rw_congested(struct backing_dev_info *bdi) 175 - { 176 - return bdi_congested(bdi, (1 << BDI_sync_congested) | 177 - (1 << BDI_async_congested)); 178 - } 179 - 180 - enum { 181 - BLK_RW_ASYNC = 0, 182 - BLK_RW_SYNC = 1, 183 - }; 184 - 185 - void clear_bdi_congested(struct backing_dev_info *bdi, int sync); 186 - void set_bdi_congested(struct backing_dev_info *bdi, int sync); 187 285 long congestion_wait(int sync, long timeout); 188 286 long wait_iff_congested(struct zone *zone, int sync, long timeout); 189 287 int pdflush_proc_obsolete(struct ctl_table *table, int write, ··· 234 318 return 0; 235 319 } 236 320 237 - #endif /* _LINUX_BACKING_DEV_H */ 321 + #ifdef CONFIG_CGROUP_WRITEBACK 322 + 323 + struct bdi_writeback_congested * 324 + wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp); 325 + void wb_congested_put(struct bdi_writeback_congested *congested); 326 + struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, 327 + struct cgroup_subsys_state *memcg_css, 328 + gfp_t gfp); 329 + void wb_memcg_offline(struct mem_cgroup *memcg); 330 + void wb_blkcg_offline(struct blkcg *blkcg); 331 + int inode_congested(struct inode *inode, int cong_bits); 332 + 333 + /** 334 + * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode 335 + * @inode: inode of interest 336 + * 337 + * cgroup writeback requires support from both the bdi and filesystem. 338 + * Test whether @inode has both. 339 + */ 340 + static inline bool inode_cgwb_enabled(struct inode *inode) 341 + { 342 + struct backing_dev_info *bdi = inode_to_bdi(inode); 343 + 344 + return bdi_cap_account_dirty(bdi) && 345 + (bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) && 346 + (inode->i_sb->s_iflags & SB_I_CGROUPWB); 347 + } 348 + 349 + /** 350 + * wb_find_current - find wb for %current on a bdi 351 + * @bdi: bdi of interest 352 + * 353 + * Find the wb of @bdi which matches both the memcg and blkcg of %current. 354 + * Must be called under rcu_read_lock() which protects the returend wb. 355 + * NULL if not found. 356 + */ 357 + static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi) 358 + { 359 + struct cgroup_subsys_state *memcg_css; 360 + struct bdi_writeback *wb; 361 + 362 + memcg_css = task_css(current, memory_cgrp_id); 363 + if (!memcg_css->parent) 364 + return &bdi->wb; 365 + 366 + wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); 367 + 368 + /* 369 + * %current's blkcg equals the effective blkcg of its memcg. No 370 + * need to use the relatively expensive cgroup_get_e_css(). 371 + */ 372 + if (likely(wb && wb->blkcg_css == task_css(current, blkio_cgrp_id))) 373 + return wb; 374 + return NULL; 375 + } 376 + 377 + /** 378 + * wb_get_create_current - get or create wb for %current on a bdi 379 + * @bdi: bdi of interest 380 + * @gfp: allocation mask 381 + * 382 + * Equivalent to wb_get_create() on %current's memcg. This function is 383 + * called from a relatively hot path and optimizes the common cases using 384 + * wb_find_current(). 385 + */ 386 + static inline struct bdi_writeback * 387 + wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp) 388 + { 389 + struct bdi_writeback *wb; 390 + 391 + rcu_read_lock(); 392 + wb = wb_find_current(bdi); 393 + if (wb && unlikely(!wb_tryget(wb))) 394 + wb = NULL; 395 + rcu_read_unlock(); 396 + 397 + if (unlikely(!wb)) { 398 + struct cgroup_subsys_state *memcg_css; 399 + 400 + memcg_css = task_get_css(current, memory_cgrp_id); 401 + wb = wb_get_create(bdi, memcg_css, gfp); 402 + css_put(memcg_css); 403 + } 404 + return wb; 405 + } 406 + 407 + /** 408 + * inode_to_wb_is_valid - test whether an inode has a wb associated 409 + * @inode: inode of interest 410 + * 411 + * Returns %true if @inode has a wb associated. May be called without any 412 + * locking. 413 + */ 414 + static inline bool inode_to_wb_is_valid(struct inode *inode) 415 + { 416 + return inode->i_wb; 417 + } 418 + 419 + /** 420 + * inode_to_wb - determine the wb of an inode 421 + * @inode: inode of interest 422 + * 423 + * Returns the wb @inode is currently associated with. The caller must be 424 + * holding either @inode->i_lock, @inode->i_mapping->tree_lock, or the 425 + * associated wb's list_lock. 426 + */ 427 + static inline struct bdi_writeback *inode_to_wb(struct inode *inode) 428 + { 429 + #ifdef CONFIG_LOCKDEP 430 + WARN_ON_ONCE(debug_locks && 431 + (!lockdep_is_held(&inode->i_lock) && 432 + !lockdep_is_held(&inode->i_mapping->tree_lock) && 433 + !lockdep_is_held(&inode->i_wb->list_lock))); 434 + #endif 435 + return inode->i_wb; 436 + } 437 + 438 + /** 439 + * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction 440 + * @inode: target inode 441 + * @lockedp: temp bool output param, to be passed to the end function 442 + * 443 + * The caller wants to access the wb associated with @inode but isn't 444 + * holding inode->i_lock, mapping->tree_lock or wb->list_lock. This 445 + * function determines the wb associated with @inode and ensures that the 446 + * association doesn't change until the transaction is finished with 447 + * unlocked_inode_to_wb_end(). 448 + * 449 + * The caller must call unlocked_inode_to_wb_end() with *@lockdep 450 + * afterwards and can't sleep during transaction. IRQ may or may not be 451 + * disabled on return. 452 + */ 453 + static inline struct bdi_writeback * 454 + unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp) 455 + { 456 + rcu_read_lock(); 457 + 458 + /* 459 + * Paired with store_release in inode_switch_wb_work_fn() and 460 + * ensures that we see the new wb if we see cleared I_WB_SWITCH. 461 + */ 462 + *lockedp = smp_load_acquire(&inode->i_state) & I_WB_SWITCH; 463 + 464 + if (unlikely(*lockedp)) 465 + spin_lock_irq(&inode->i_mapping->tree_lock); 466 + 467 + /* 468 + * Protected by either !I_WB_SWITCH + rcu_read_lock() or tree_lock. 469 + * inode_to_wb() will bark. Deref directly. 470 + */ 471 + return inode->i_wb; 472 + } 473 + 474 + /** 475 + * unlocked_inode_to_wb_end - end inode wb access transaction 476 + * @inode: target inode 477 + * @locked: *@lockedp from unlocked_inode_to_wb_begin() 478 + */ 479 + static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked) 480 + { 481 + if (unlikely(locked)) 482 + spin_unlock_irq(&inode->i_mapping->tree_lock); 483 + 484 + rcu_read_unlock(); 485 + } 486 + 487 + struct wb_iter { 488 + int start_blkcg_id; 489 + struct radix_tree_iter tree_iter; 490 + void **slot; 491 + }; 492 + 493 + static inline struct bdi_writeback *__wb_iter_next(struct wb_iter *iter, 494 + struct backing_dev_info *bdi) 495 + { 496 + struct radix_tree_iter *titer = &iter->tree_iter; 497 + 498 + WARN_ON_ONCE(!rcu_read_lock_held()); 499 + 500 + if (iter->start_blkcg_id >= 0) { 501 + iter->slot = radix_tree_iter_init(titer, iter->start_blkcg_id); 502 + iter->start_blkcg_id = -1; 503 + } else { 504 + iter->slot = radix_tree_next_slot(iter->slot, titer, 0); 505 + } 506 + 507 + if (!iter->slot) 508 + iter->slot = radix_tree_next_chunk(&bdi->cgwb_tree, titer, 0); 509 + if (iter->slot) 510 + return *iter->slot; 511 + return NULL; 512 + } 513 + 514 + static inline struct bdi_writeback *__wb_iter_init(struct wb_iter *iter, 515 + struct backing_dev_info *bdi, 516 + int start_blkcg_id) 517 + { 518 + iter->start_blkcg_id = start_blkcg_id; 519 + 520 + if (start_blkcg_id) 521 + return __wb_iter_next(iter, bdi); 522 + else 523 + return &bdi->wb; 524 + } 525 + 526 + /** 527 + * bdi_for_each_wb - walk all wb's of a bdi in ascending blkcg ID order 528 + * @wb_cur: cursor struct bdi_writeback pointer 529 + * @bdi: bdi to walk wb's of 530 + * @iter: pointer to struct wb_iter to be used as iteration buffer 531 + * @start_blkcg_id: blkcg ID to start iteration from 532 + * 533 + * Iterate @wb_cur through the wb's (bdi_writeback's) of @bdi in ascending 534 + * blkcg ID order starting from @start_blkcg_id. @iter is struct wb_iter 535 + * to be used as temp storage during iteration. rcu_read_lock() must be 536 + * held throughout iteration. 537 + */ 538 + #define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \ 539 + for ((wb_cur) = __wb_iter_init(iter, bdi, start_blkcg_id); \ 540 + (wb_cur); (wb_cur) = __wb_iter_next(iter, bdi)) 541 + 542 + #else /* CONFIG_CGROUP_WRITEBACK */ 543 + 544 + static inline bool inode_cgwb_enabled(struct inode *inode) 545 + { 546 + return false; 547 + } 548 + 549 + static inline struct bdi_writeback_congested * 550 + wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp) 551 + { 552 + return bdi->wb.congested; 553 + } 554 + 555 + static inline void wb_congested_put(struct bdi_writeback_congested *congested) 556 + { 557 + } 558 + 559 + static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi) 560 + { 561 + return &bdi->wb; 562 + } 563 + 564 + static inline struct bdi_writeback * 565 + wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp) 566 + { 567 + return &bdi->wb; 568 + } 569 + 570 + static inline bool inode_to_wb_is_valid(struct inode *inode) 571 + { 572 + return true; 573 + } 574 + 575 + static inline struct bdi_writeback *inode_to_wb(struct inode *inode) 576 + { 577 + return &inode_to_bdi(inode)->wb; 578 + } 579 + 580 + static inline struct bdi_writeback * 581 + unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp) 582 + { 583 + return inode_to_wb(inode); 584 + } 585 + 586 + static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked) 587 + { 588 + } 589 + 590 + static inline void wb_memcg_offline(struct mem_cgroup *memcg) 591 + { 592 + } 593 + 594 + static inline void wb_blkcg_offline(struct blkcg *blkcg) 595 + { 596 + } 597 + 598 + struct wb_iter { 599 + int next_id; 600 + }; 601 + 602 + #define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \ 603 + for ((iter)->next_id = (start_blkcg_id); \ 604 + ({ (wb_cur) = !(iter)->next_id++ ? &(bdi)->wb : NULL; }); ) 605 + 606 + static inline int inode_congested(struct inode *inode, int cong_bits) 607 + { 608 + return wb_congested(&inode_to_bdi(inode)->wb, cong_bits); 609 + } 610 + 611 + #endif /* CONFIG_CGROUP_WRITEBACK */ 612 + 613 + static inline int inode_read_congested(struct inode *inode) 614 + { 615 + return inode_congested(inode, 1 << WB_sync_congested); 616 + } 617 + 618 + static inline int inode_write_congested(struct inode *inode) 619 + { 620 + return inode_congested(inode, 1 << WB_async_congested); 621 + } 622 + 623 + static inline int inode_rw_congested(struct inode *inode) 624 + { 625 + return inode_congested(inode, (1 << WB_sync_congested) | 626 + (1 << WB_async_congested)); 627 + } 628 + 629 + static inline int bdi_congested(struct backing_dev_info *bdi, int cong_bits) 630 + { 631 + return wb_congested(&bdi->wb, cong_bits); 632 + } 633 + 634 + static inline int bdi_read_congested(struct backing_dev_info *bdi) 635 + { 636 + return bdi_congested(bdi, 1 << WB_sync_congested); 637 + } 638 + 639 + static inline int bdi_write_congested(struct backing_dev_info *bdi) 640 + { 641 + return bdi_congested(bdi, 1 << WB_async_congested); 642 + } 643 + 644 + static inline int bdi_rw_congested(struct backing_dev_info *bdi) 645 + { 646 + return bdi_congested(bdi, (1 << WB_sync_congested) | 647 + (1 << WB_async_congested)); 648 + } 649 + 650 + #endif /* _LINUX_BACKING_DEV_H */

+3

include/linux/bio.h

··· 482 482 extern unsigned int bvec_nr_vecs(unsigned short idx); 483 483 484 484 #ifdef CONFIG_BLK_CGROUP 485 + int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css); 485 486 int bio_associate_current(struct bio *bio); 486 487 void bio_disassociate_task(struct bio *bio); 487 488 #else /* CONFIG_BLK_CGROUP */ 489 + static inline int bio_associate_blkcg(struct bio *bio, 490 + struct cgroup_subsys_state *blkcg_css) { return 0; } 488 491 static inline int bio_associate_current(struct bio *bio) { return -ENOENT; } 489 492 static inline void bio_disassociate_task(struct bio *bio) { } 490 493 #endif /* CONFIG_BLK_CGROUP */

+1 -20

include/linux/blkdev.h

··· 12 12 #include <linux/timer.h> 13 13 #include <linux/workqueue.h> 14 14 #include <linux/pagemap.h> 15 - #include <linux/backing-dev.h> 15 + #include <linux/backing-dev-defs.h> 16 16 #include <linux/wait.h> 17 17 #include <linux/mempool.h> 18 18 #include <linux/bio.h> ··· 786 786 unsigned int, void __user *); 787 787 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t, 788 788 struct scsi_ioctl_command __user *); 789 - 790 - /* 791 - * A queue has just exitted congestion. Note this in the global counter of 792 - * congested queues, and wake up anyone who was waiting for requests to be 793 - * put back. 794 - */ 795 - static inline void blk_clear_queue_congested(struct request_queue *q, int sync) 796 - { 797 - clear_bdi_congested(&q->backing_dev_info, sync); 798 - } 799 - 800 - /* 801 - * A queue has just entered congestion. Flag that in the queue's VM-visible 802 - * state flags and increment the global gounter of congested queues. 803 - */ 804 - static inline void blk_set_queue_congested(struct request_queue *q, int sync) 805 - { 806 - set_bdi_congested(&q->backing_dev_info, sync); 807 - } 808 789 809 790 extern void blk_start_queue(struct request_queue *q); 810 791 extern void blk_stop_queue(struct request_queue *q);

+25

include/linux/cgroup.h

··· 774 774 } 775 775 776 776 /** 777 + * task_get_css - find and get the css for (task, subsys) 778 + * @task: the target task 779 + * @subsys_id: the target subsystem ID 780 + * 781 + * Find the css for the (@task, @subsys_id) combination, increment a 782 + * reference on and return it. This function is guaranteed to return a 783 + * valid css. 784 + */ 785 + static inline struct cgroup_subsys_state * 786 + task_get_css(struct task_struct *task, int subsys_id) 787 + { 788 + struct cgroup_subsys_state *css; 789 + 790 + rcu_read_lock(); 791 + while (true) { 792 + css = task_css(task, subsys_id); 793 + if (likely(css_tryget_online(css))) 794 + break; 795 + cpu_relax(); 796 + } 797 + rcu_read_unlock(); 798 + return css; 799 + } 800 + 801 + /** 777 802 * task_css_is_root - test whether a task belongs to the root css 778 803 * @task: the target task 779 804 * @subsys_id: the target subsystem ID

+25 -1

include/linux/fs.h

··· 35 35 #include <uapi/linux/fs.h> 36 36 37 37 struct backing_dev_info; 38 + struct bdi_writeback; 38 39 struct export_operations; 39 40 struct hd_geometry; 40 41 struct iovec; ··· 635 634 636 635 struct hlist_node i_hash; 637 636 struct list_head i_wb_list; /* backing dev IO list */ 637 + #ifdef CONFIG_CGROUP_WRITEBACK 638 + struct bdi_writeback *i_wb; /* the associated cgroup wb */ 639 + 640 + /* foreign inode detection, see wbc_detach_inode() */ 641 + int i_wb_frn_winner; 642 + u16 i_wb_frn_avg_time; 643 + u16 i_wb_frn_history; 644 + #endif 638 645 struct list_head i_lru; /* inode LRU list */ 639 646 struct list_head i_sb_list; 640 647 union { ··· 1241 1232 #define UMOUNT_NOFOLLOW 0x00000008 /* Don't follow symlink on umount */ 1242 1233 #define UMOUNT_UNUSED 0x80000000 /* Flag guaranteed to be unused */ 1243 1234 1235 + /* sb->s_iflags */ 1236 + #define SB_I_CGROUPWB 0x00000001 /* cgroup-aware writeback enabled */ 1244 1237 1245 1238 /* Possible states of 'frozen' field */ 1246 1239 enum { ··· 1281 1270 const struct quotactl_ops *s_qcop; 1282 1271 const struct export_operations *s_export_op; 1283 1272 unsigned long s_flags; 1273 + unsigned long s_iflags; /* internal SB_I_* flags */ 1284 1274 unsigned long s_magic; 1285 1275 struct dentry *s_root; 1286 1276 struct rw_semaphore s_umount; ··· 1818 1806 * 1819 1807 * I_DIO_WAKEUP Never set. Only used as a key for wait_on_bit(). 1820 1808 * 1809 + * I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to 1810 + * synchronize competing switching instances and to tell 1811 + * wb stat updates to grab mapping->tree_lock. See 1812 + * inode_switch_wb_work_fn() for details. 1813 + * 1821 1814 * Q: What is the difference between I_WILL_FREE and I_FREEING? 1822 1815 */ 1823 1816 #define I_DIRTY_SYNC (1 << 0) ··· 1842 1825 #define I_DIRTY_TIME (1 << 11) 1843 1826 #define __I_DIRTY_TIME_EXPIRED 12 1844 1827 #define I_DIRTY_TIME_EXPIRED (1 << __I_DIRTY_TIME_EXPIRED) 1828 + #define I_WB_SWITCH (1 << 13) 1845 1829 1846 1830 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) 1847 1831 #define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME) ··· 2259 2241 extern void emergency_thaw_all(void); 2260 2242 extern int thaw_bdev(struct block_device *bdev, struct super_block *sb); 2261 2243 extern int fsync_bdev(struct block_device *); 2262 - extern int sb_is_blkdev_sb(struct super_block *sb); 2244 + 2245 + extern struct super_block *blockdev_superblock; 2246 + 2247 + static inline bool sb_is_blkdev_sb(struct super_block *sb) 2248 + { 2249 + return sb == blockdev_superblock; 2250 + } 2263 2251 #else 2264 2252 static inline void bd_forget(struct inode *inode) {} 2265 2253 static inline int sync_blockdev(struct block_device *bdev) { return 0; }

+29

include/linux/memcontrol.h

··· 41 41 MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ 42 42 MEM_CGROUP_STAT_RSS_HUGE, /* # of pages charged as anon huge */ 43 43 MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ 44 + MEM_CGROUP_STAT_DIRTY, /* # of dirty pages in page cache */ 44 45 MEM_CGROUP_STAT_WRITEBACK, /* # of pages under writeback */ 45 46 MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ 46 47 MEM_CGROUP_STAT_NSTATS, ··· 68 67 }; 69 68 70 69 #ifdef CONFIG_MEMCG 70 + extern struct cgroup_subsys_state *mem_cgroup_root_css; 71 + 71 72 void mem_cgroup_events(struct mem_cgroup *memcg, 72 73 enum mem_cgroup_events_index idx, 73 74 unsigned int nr); ··· 115 112 } 116 113 117 114 extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg); 115 + extern struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page); 118 116 119 117 struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *, 120 118 struct mem_cgroup *, ··· 198 194 199 195 #else /* CONFIG_MEMCG */ 200 196 struct mem_cgroup; 197 + 198 + #define mem_cgroup_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL)) 201 199 202 200 static inline void mem_cgroup_events(struct mem_cgroup *memcg, 203 201 enum mem_cgroup_events_index idx, ··· 387 381 SOFT_LIMIT, 388 382 OVER_LIMIT, 389 383 }; 384 + 385 + #ifdef CONFIG_CGROUP_WRITEBACK 386 + 387 + struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg); 388 + struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb); 389 + void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail, 390 + unsigned long *pdirty, unsigned long *pwriteback); 391 + 392 + #else /* CONFIG_CGROUP_WRITEBACK */ 393 + 394 + static inline struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) 395 + { 396 + return NULL; 397 + } 398 + 399 + static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb, 400 + unsigned long *pavail, 401 + unsigned long *pdirty, 402 + unsigned long *pwriteback) 403 + { 404 + } 405 + 406 + #endif /* CONFIG_CGROUP_WRITEBACK */ 390 407 391 408 struct sock; 392 409 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)

+6 -2

include/linux/mm.h

··· 27 27 struct file_ra_state; 28 28 struct user_struct; 29 29 struct writeback_control; 30 + struct bdi_writeback; 30 31 31 32 #ifndef CONFIG_NEED_MULTIPLE_NODES /* Don't use mapnrs, do it properly */ 32 33 extern unsigned long max_mapnr; ··· 1212 1211 int __set_page_dirty_no_writeback(struct page *page); 1213 1212 int redirty_page_for_writepage(struct writeback_control *wbc, 1214 1213 struct page *page); 1215 - void account_page_dirtied(struct page *page, struct address_space *mapping); 1216 - void account_page_cleaned(struct page *page, struct address_space *mapping); 1214 + void account_page_dirtied(struct page *page, struct address_space *mapping, 1215 + struct mem_cgroup *memcg); 1216 + void account_page_cleaned(struct page *page, struct address_space *mapping, 1217 + struct mem_cgroup *memcg, struct bdi_writeback *wb); 1217 1218 int set_page_dirty(struct page *page); 1218 1219 int set_page_dirty_lock(struct page *page); 1220 + void cancel_dirty_page(struct page *page); 1219 1221 int clear_page_dirty_for_io(struct page *page); 1220 1222 1221 1223 int get_cmdline(struct task_struct *task, char *buffer, int buflen);

+2 -1

include/linux/pagemap.h

··· 651 651 int add_to_page_cache_lru(struct page *page, struct address_space *mapping, 652 652 pgoff_t index, gfp_t gfp_mask); 653 653 extern void delete_from_page_cache(struct page *page); 654 - extern void __delete_from_page_cache(struct page *page, void *shadow); 654 + extern void __delete_from_page_cache(struct page *page, void *shadow, 655 + struct mem_cgroup *memcg); 655 656 int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask); 656 657 657 658 /*

+207 -14

include/linux/writeback.h

··· 7 7 #include <linux/sched.h> 8 8 #include <linux/workqueue.h> 9 9 #include <linux/fs.h> 10 + #include <linux/flex_proportions.h> 11 + #include <linux/backing-dev-defs.h> 10 12 11 13 DECLARE_PER_CPU(int, dirty_throttle_leaks); 12 14 ··· 86 84 unsigned for_reclaim:1; /* Invoked from the page allocator */ 87 85 unsigned range_cyclic:1; /* range_start is cyclic */ 88 86 unsigned for_sync:1; /* sync(2) WB_SYNC_ALL writeback */ 87 + #ifdef CONFIG_CGROUP_WRITEBACK 88 + struct bdi_writeback *wb; /* wb this writeback is issued under */ 89 + struct inode *inode; /* inode being written out */ 90 + 91 + /* foreign inode detection, see wbc_detach_inode() */ 92 + int wb_id; /* current wb id */ 93 + int wb_lcand_id; /* last foreign candidate wb id */ 94 + int wb_tcand_id; /* this foreign candidate wb id */ 95 + size_t wb_bytes; /* bytes written by current wb */ 96 + size_t wb_lcand_bytes; /* bytes written by last candidate */ 97 + size_t wb_tcand_bytes; /* bytes written by this candidate */ 98 + #endif 89 99 }; 100 + 101 + /* 102 + * A wb_domain represents a domain that wb's (bdi_writeback's) belong to 103 + * and are measured against each other in. There always is one global 104 + * domain, global_wb_domain, that every wb in the system is a member of. 105 + * This allows measuring the relative bandwidth of each wb to distribute 106 + * dirtyable memory accordingly. 107 + */ 108 + struct wb_domain { 109 + spinlock_t lock; 110 + 111 + /* 112 + * Scale the writeback cache size proportional to the relative 113 + * writeout speed. 114 + * 115 + * We do this by keeping a floating proportion between BDIs, based 116 + * on page writeback completions [end_page_writeback()]. Those 117 + * devices that write out pages fastest will get the larger share, 118 + * while the slower will get a smaller share. 119 + * 120 + * We use page writeout completions because we are interested in 121 + * getting rid of dirty pages. Having them written out is the 122 + * primary goal. 123 + * 124 + * We introduce a concept of time, a period over which we measure 125 + * these events, because demand can/will vary over time. The length 126 + * of this period itself is measured in page writeback completions. 127 + */ 128 + struct fprop_global completions; 129 + struct timer_list period_timer; /* timer for aging of completions */ 130 + unsigned long period_time; 131 + 132 + /* 133 + * The dirtyable memory and dirty threshold could be suddenly 134 + * knocked down by a large amount (eg. on the startup of KVM in a 135 + * swapless system). This may throw the system into deep dirty 136 + * exceeded state and throttle heavy/light dirtiers alike. To 137 + * retain good responsiveness, maintain global_dirty_limit for 138 + * tracking slowly down to the knocked down dirty threshold. 139 + * 140 + * Both fields are protected by ->lock. 141 + */ 142 + unsigned long dirty_limit_tstamp; 143 + unsigned long dirty_limit; 144 + }; 145 + 146 + /** 147 + * wb_domain_size_changed - memory available to a wb_domain has changed 148 + * @dom: wb_domain of interest 149 + * 150 + * This function should be called when the amount of memory available to 151 + * @dom has changed. It resets @dom's dirty limit parameters to prevent 152 + * the past values which don't match the current configuration from skewing 153 + * dirty throttling. Without this, when memory size of a wb_domain is 154 + * greatly reduced, the dirty throttling logic may allow too many pages to 155 + * be dirtied leading to consecutive unnecessary OOMs and may get stuck in 156 + * that situation. 157 + */ 158 + static inline void wb_domain_size_changed(struct wb_domain *dom) 159 + { 160 + spin_lock(&dom->lock); 161 + dom->dirty_limit_tstamp = jiffies; 162 + dom->dirty_limit = 0; 163 + spin_unlock(&dom->lock); 164 + } 90 165 91 166 /* 92 167 * fs/fs-writeback.c ··· 172 93 void writeback_inodes_sb(struct super_block *, enum wb_reason reason); 173 94 void writeback_inodes_sb_nr(struct super_block *, unsigned long nr, 174 95 enum wb_reason reason); 175 - int try_to_writeback_inodes_sb(struct super_block *, enum wb_reason reason); 176 - int try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr, 177 - enum wb_reason reason); 96 + bool try_to_writeback_inodes_sb(struct super_block *, enum wb_reason reason); 97 + bool try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr, 98 + enum wb_reason reason); 178 99 void sync_inodes_sb(struct super_block *); 179 100 void wakeup_flusher_threads(long nr_pages, enum wb_reason reason); 180 101 void inode_wait_for_writeback(struct inode *inode); ··· 185 106 might_sleep(); 186 107 wait_on_bit(&inode->i_state, __I_NEW, TASK_UNINTERRUPTIBLE); 187 108 } 109 + 110 + #ifdef CONFIG_CGROUP_WRITEBACK 111 + 112 + #include <linux/cgroup.h> 113 + #include <linux/bio.h> 114 + 115 + void __inode_attach_wb(struct inode *inode, struct page *page); 116 + void wbc_attach_and_unlock_inode(struct writeback_control *wbc, 117 + struct inode *inode) 118 + __releases(&inode->i_lock); 119 + void wbc_detach_inode(struct writeback_control *wbc); 120 + void wbc_account_io(struct writeback_control *wbc, struct page *page, 121 + size_t bytes); 122 + 123 + /** 124 + * inode_attach_wb - associate an inode with its wb 125 + * @inode: inode of interest 126 + * @page: page being dirtied (may be NULL) 127 + * 128 + * If @inode doesn't have its wb, associate it with the wb matching the 129 + * memcg of @page or, if @page is NULL, %current. May be called w/ or w/o 130 + * @inode->i_lock. 131 + */ 132 + static inline void inode_attach_wb(struct inode *inode, struct page *page) 133 + { 134 + if (!inode->i_wb) 135 + __inode_attach_wb(inode, page); 136 + } 137 + 138 + /** 139 + * inode_detach_wb - disassociate an inode from its wb 140 + * @inode: inode of interest 141 + * 142 + * @inode is being freed. Detach from its wb. 143 + */ 144 + static inline void inode_detach_wb(struct inode *inode) 145 + { 146 + if (inode->i_wb) { 147 + wb_put(inode->i_wb); 148 + inode->i_wb = NULL; 149 + } 150 + } 151 + 152 + /** 153 + * wbc_attach_fdatawrite_inode - associate wbc and inode for fdatawrite 154 + * @wbc: writeback_control of interest 155 + * @inode: target inode 156 + * 157 + * This function is to be used by __filemap_fdatawrite_range(), which is an 158 + * alternative entry point into writeback code, and first ensures @inode is 159 + * associated with a bdi_writeback and attaches it to @wbc. 160 + */ 161 + static inline void wbc_attach_fdatawrite_inode(struct writeback_control *wbc, 162 + struct inode *inode) 163 + { 164 + spin_lock(&inode->i_lock); 165 + inode_attach_wb(inode, NULL); 166 + wbc_attach_and_unlock_inode(wbc, inode); 167 + } 168 + 169 + /** 170 + * wbc_init_bio - writeback specific initializtion of bio 171 + * @wbc: writeback_control for the writeback in progress 172 + * @bio: bio to be initialized 173 + * 174 + * @bio is a part of the writeback in progress controlled by @wbc. Perform 175 + * writeback specific initialization. This is used to apply the cgroup 176 + * writeback context. 177 + */ 178 + static inline void wbc_init_bio(struct writeback_control *wbc, struct bio *bio) 179 + { 180 + /* 181 + * pageout() path doesn't attach @wbc to the inode being written 182 + * out. This is intentional as we don't want the function to block 183 + * behind a slow cgroup. Ultimately, we want pageout() to kick off 184 + * regular writeback instead of writing things out itself. 185 + */ 186 + if (wbc->wb) 187 + bio_associate_blkcg(bio, wbc->wb->blkcg_css); 188 + } 189 + 190 + #else /* CONFIG_CGROUP_WRITEBACK */ 191 + 192 + static inline void inode_attach_wb(struct inode *inode, struct page *page) 193 + { 194 + } 195 + 196 + static inline void inode_detach_wb(struct inode *inode) 197 + { 198 + } 199 + 200 + static inline void wbc_attach_and_unlock_inode(struct writeback_control *wbc, 201 + struct inode *inode) 202 + __releases(&inode->i_lock) 203 + { 204 + spin_unlock(&inode->i_lock); 205 + } 206 + 207 + static inline void wbc_attach_fdatawrite_inode(struct writeback_control *wbc, 208 + struct inode *inode) 209 + { 210 + } 211 + 212 + static inline void wbc_detach_inode(struct writeback_control *wbc) 213 + { 214 + } 215 + 216 + static inline void wbc_init_bio(struct writeback_control *wbc, struct bio *bio) 217 + { 218 + } 219 + 220 + static inline void wbc_account_io(struct writeback_control *wbc, 221 + struct page *page, size_t bytes) 222 + { 223 + } 224 + 225 + #endif /* CONFIG_CGROUP_WRITEBACK */ 188 226 189 227 /* 190 228 * mm/page-writeback.c ··· 316 120 #endif 317 121 void throttle_vm_writeout(gfp_t gfp_mask); 318 122 bool zone_dirty_ok(struct zone *zone); 123 + int wb_domain_init(struct wb_domain *dom, gfp_t gfp); 124 + #ifdef CONFIG_CGROUP_WRITEBACK 125 + void wb_domain_exit(struct wb_domain *dom); 126 + #endif 319 127 320 - extern unsigned long global_dirty_limit; 128 + extern struct wb_domain global_wb_domain; 321 129 322 130 /* These are exported to sysctl. */ 323 131 extern int dirty_background_ratio; ··· 355 155 void __user *, size_t *, loff_t *); 356 156 357 157 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty); 358 - unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, 359 - unsigned long dirty); 158 + unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh); 360 159 361 - void __bdi_update_bandwidth(struct backing_dev_info *bdi, 362 - unsigned long thresh, 363 - unsigned long bg_thresh, 364 - unsigned long dirty, 365 - unsigned long bdi_thresh, 366 - unsigned long bdi_dirty, 367 - unsigned long start_time); 368 - 160 + void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time); 369 161 void page_writeback_init(void); 370 162 void balance_dirty_pages_ratelimited(struct address_space *mapping); 163 + bool wb_over_bg_thresh(struct bdi_writeback *wb); 371 164 372 165 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc, 373 166 void *data);

+8 -7

include/trace/events/writeback.h

··· 360 360 __entry->nr_written = global_page_state(NR_WRITTEN); 361 361 __entry->background_thresh = background_thresh; 362 362 __entry->dirty_thresh = dirty_thresh; 363 - __entry->dirty_limit = global_dirty_limit; 363 + __entry->dirty_limit = global_wb_domain.dirty_limit; 364 364 ), 365 365 366 366 TP_printk("dirty=%lu writeback=%lu unstable=%lu " ··· 399 399 400 400 TP_fast_assign( 401 401 strlcpy(__entry->bdi, dev_name(bdi->dev), 32); 402 - __entry->write_bw = KBps(bdi->write_bandwidth); 403 - __entry->avg_write_bw = KBps(bdi->avg_write_bandwidth); 402 + __entry->write_bw = KBps(bdi->wb.write_bandwidth); 403 + __entry->avg_write_bw = KBps(bdi->wb.avg_write_bandwidth); 404 404 __entry->dirty_rate = KBps(dirty_rate); 405 - __entry->dirty_ratelimit = KBps(bdi->dirty_ratelimit); 405 + __entry->dirty_ratelimit = KBps(bdi->wb.dirty_ratelimit); 406 406 __entry->task_ratelimit = KBps(task_ratelimit); 407 407 __entry->balanced_dirty_ratelimit = 408 - KBps(bdi->balanced_dirty_ratelimit); 408 + KBps(bdi->wb.balanced_dirty_ratelimit); 409 409 ), 410 410 411 411 TP_printk("bdi %s: " ··· 462 462 unsigned long freerun = (thresh + bg_thresh) / 2; 463 463 strlcpy(__entry->bdi, dev_name(bdi->dev), 32); 464 464 465 - __entry->limit = global_dirty_limit; 466 - __entry->setpoint = (global_dirty_limit + freerun) / 2; 465 + __entry->limit = global_wb_domain.dirty_limit; 466 + __entry->setpoint = (global_wb_domain.dirty_limit + 467 + freerun) / 2; 467 468 __entry->dirty = dirty; 468 469 __entry->bdi_setpoint = __entry->setpoint * 469 470 bdi_thresh / (thresh + 1);

+5

init/Kconfig

··· 1127 1127 Enable some debugging help. Currently it exports additional stat 1128 1128 files in a cgroup which can be useful for debugging. 1129 1129 1130 + config CGROUP_WRITEBACK 1131 + bool 1132 + depends on MEMCG && BLK_CGROUP 1133 + default y 1134 + 1130 1135 endif # CGROUPS 1131 1136 1132 1137 config CHECKPOINT_RESTORE

+517 -135

mm/backing-dev.c

··· 18 18 .name = "noop", 19 19 .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK, 20 20 }; 21 + EXPORT_SYMBOL_GPL(noop_backing_dev_info); 21 22 22 23 static struct class *bdi_class; 23 24 ··· 49 48 struct bdi_writeback *wb = &bdi->wb; 50 49 unsigned long background_thresh; 51 50 unsigned long dirty_thresh; 52 - unsigned long bdi_thresh; 51 + unsigned long wb_thresh; 53 52 unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time; 54 53 struct inode *inode; 55 54 ··· 67 66 spin_unlock(&wb->list_lock); 68 67 69 68 global_dirty_limits(&background_thresh, &dirty_thresh); 70 - bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); 69 + wb_thresh = wb_calc_thresh(wb, dirty_thresh); 71 70 72 71 #define K(x) ((x) << (PAGE_SHIFT - 10)) 73 72 seq_printf(m, ··· 85 84 "b_dirty_time: %10lu\n" 86 85 "bdi_list: %10u\n" 87 86 "state: %10lx\n", 88 - (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), 89 - (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)), 90 - K(bdi_thresh), 87 + (unsigned long) K(wb_stat(wb, WB_WRITEBACK)), 88 + (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)), 89 + K(wb_thresh), 91 90 K(dirty_thresh), 92 91 K(background_thresh), 93 - (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)), 94 - (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)), 95 - (unsigned long) K(bdi->write_bandwidth), 92 + (unsigned long) K(wb_stat(wb, WB_DIRTIED)), 93 + (unsigned long) K(wb_stat(wb, WB_WRITTEN)), 94 + (unsigned long) K(wb->write_bandwidth), 96 95 nr_dirty, 97 96 nr_io, 98 97 nr_more_io, 99 98 nr_dirty_time, 100 - !list_empty(&bdi->bdi_list), bdi->state); 99 + !list_empty(&bdi->bdi_list), bdi->wb.state); 101 100 #undef K 102 101 103 102 return 0; ··· 256 255 } 257 256 subsys_initcall(default_bdi_init); 258 257 259 - int bdi_has_dirty_io(struct backing_dev_info *bdi) 260 - { 261 - return wb_has_dirty_io(&bdi->wb); 262 - } 263 - 264 258 /* 265 - * This function is used when the first inode for this bdi is marked dirty. It 259 + * This function is used when the first inode for this wb is marked dirty. It 266 260 * wakes-up the corresponding bdi thread which should then take care of the 267 261 * periodic background write-out of dirty inodes. Since the write-out would 268 262 * starts only 'dirty_writeback_interval' centisecs from now anyway, we just ··· 270 274 * We have to be careful not to postpone flush work if it is scheduled for 271 275 * earlier. Thus we use queue_delayed_work(). 272 276 */ 273 - void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi) 277 + void wb_wakeup_delayed(struct bdi_writeback *wb) 274 278 { 275 279 unsigned long timeout; 276 280 277 281 timeout = msecs_to_jiffies(dirty_writeback_interval * 10); 278 - spin_lock_bh(&bdi->wb_lock); 279 - if (test_bit(BDI_registered, &bdi->state)) 280 - queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout); 281 - spin_unlock_bh(&bdi->wb_lock); 282 + spin_lock_bh(&wb->work_lock); 283 + if (test_bit(WB_registered, &wb->state)) 284 + queue_delayed_work(bdi_wq, &wb->dwork, timeout); 285 + spin_unlock_bh(&wb->work_lock); 282 286 } 283 287 284 288 /* 285 - * Remove bdi from bdi_list, and ensure that it is no longer visible 289 + * Initial write bandwidth: 100 MB/s 286 290 */ 287 - static void bdi_remove_from_list(struct backing_dev_info *bdi) 288 - { 289 - spin_lock_bh(&bdi_lock); 290 - list_del_rcu(&bdi->bdi_list); 291 - spin_unlock_bh(&bdi_lock); 291 + #define INIT_BW (100 << (20 - PAGE_SHIFT)) 292 292 293 - synchronize_rcu_expedited(); 293 + static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi, 294 + gfp_t gfp) 295 + { 296 + int i, err; 297 + 298 + memset(wb, 0, sizeof(*wb)); 299 + 300 + wb->bdi = bdi; 301 + wb->last_old_flush = jiffies; 302 + INIT_LIST_HEAD(&wb->b_dirty); 303 + INIT_LIST_HEAD(&wb->b_io); 304 + INIT_LIST_HEAD(&wb->b_more_io); 305 + INIT_LIST_HEAD(&wb->b_dirty_time); 306 + spin_lock_init(&wb->list_lock); 307 + 308 + wb->bw_time_stamp = jiffies; 309 + wb->balanced_dirty_ratelimit = INIT_BW; 310 + wb->dirty_ratelimit = INIT_BW; 311 + wb->write_bandwidth = INIT_BW; 312 + wb->avg_write_bandwidth = INIT_BW; 313 + 314 + spin_lock_init(&wb->work_lock); 315 + INIT_LIST_HEAD(&wb->work_list); 316 + INIT_DELAYED_WORK(&wb->dwork, wb_workfn); 317 + 318 + err = fprop_local_init_percpu(&wb->completions, gfp); 319 + if (err) 320 + return err; 321 + 322 + for (i = 0; i < NR_WB_STAT_ITEMS; i++) { 323 + err = percpu_counter_init(&wb->stat[i], 0, gfp); 324 + if (err) { 325 + while (--i) 326 + percpu_counter_destroy(&wb->stat[i]); 327 + fprop_local_destroy_percpu(&wb->completions); 328 + return err; 329 + } 330 + } 331 + 332 + return 0; 294 333 } 334 + 335 + /* 336 + * Remove bdi from the global list and shutdown any threads we have running 337 + */ 338 + static void wb_shutdown(struct bdi_writeback *wb) 339 + { 340 + /* Make sure nobody queues further work */ 341 + spin_lock_bh(&wb->work_lock); 342 + if (!test_and_clear_bit(WB_registered, &wb->state)) { 343 + spin_unlock_bh(&wb->work_lock); 344 + return; 345 + } 346 + spin_unlock_bh(&wb->work_lock); 347 + 348 + /* 349 + * Drain work list and shutdown the delayed_work. !WB_registered 350 + * tells wb_workfn() that @wb is dying and its work_list needs to 351 + * be drained no matter what. 352 + */ 353 + mod_delayed_work(bdi_wq, &wb->dwork, 0); 354 + flush_delayed_work(&wb->dwork); 355 + WARN_ON(!list_empty(&wb->work_list)); 356 + } 357 + 358 + static void wb_exit(struct bdi_writeback *wb) 359 + { 360 + int i; 361 + 362 + WARN_ON(delayed_work_pending(&wb->dwork)); 363 + 364 + for (i = 0; i < NR_WB_STAT_ITEMS; i++) 365 + percpu_counter_destroy(&wb->stat[i]); 366 + 367 + fprop_local_destroy_percpu(&wb->completions); 368 + } 369 + 370 + #ifdef CONFIG_CGROUP_WRITEBACK 371 + 372 + #include <linux/memcontrol.h> 373 + 374 + /* 375 + * cgwb_lock protects bdi->cgwb_tree, bdi->cgwb_congested_tree, 376 + * blkcg->cgwb_list, and memcg->cgwb_list. bdi->cgwb_tree is also RCU 377 + * protected. cgwb_release_wait is used to wait for the completion of cgwb 378 + * releases from bdi destruction path. 379 + */ 380 + static DEFINE_SPINLOCK(cgwb_lock); 381 + static DECLARE_WAIT_QUEUE_HEAD(cgwb_release_wait); 382 + 383 + /** 384 + * wb_congested_get_create - get or create a wb_congested 385 + * @bdi: associated bdi 386 + * @blkcg_id: ID of the associated blkcg 387 + * @gfp: allocation mask 388 + * 389 + * Look up the wb_congested for @blkcg_id on @bdi. If missing, create one. 390 + * The returned wb_congested has its reference count incremented. Returns 391 + * NULL on failure. 392 + */ 393 + struct bdi_writeback_congested * 394 + wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp) 395 + { 396 + struct bdi_writeback_congested *new_congested = NULL, *congested; 397 + struct rb_node **node, *parent; 398 + unsigned long flags; 399 + 400 + if (blkcg_id == 1) 401 + return &bdi->wb_congested; 402 + retry: 403 + spin_lock_irqsave(&cgwb_lock, flags); 404 + 405 + node = &bdi->cgwb_congested_tree.rb_node; 406 + parent = NULL; 407 + 408 + while (*node != NULL) { 409 + parent = *node; 410 + congested = container_of(parent, struct bdi_writeback_congested, 411 + rb_node); 412 + if (congested->blkcg_id < blkcg_id) 413 + node = &parent->rb_left; 414 + else if (congested->blkcg_id > blkcg_id) 415 + node = &parent->rb_right; 416 + else 417 + goto found; 418 + } 419 + 420 + if (new_congested) { 421 + /* !found and storage for new one already allocated, insert */ 422 + congested = new_congested; 423 + new_congested = NULL; 424 + rb_link_node(&congested->rb_node, parent, node); 425 + rb_insert_color(&congested->rb_node, &bdi->cgwb_congested_tree); 426 + atomic_inc(&bdi->usage_cnt); 427 + goto found; 428 + } 429 + 430 + spin_unlock_irqrestore(&cgwb_lock, flags); 431 + 432 + /* allocate storage for new one and retry */ 433 + new_congested = kzalloc(sizeof(*new_congested), gfp); 434 + if (!new_congested) 435 + return NULL; 436 + 437 + atomic_set(&new_congested->refcnt, 0); 438 + new_congested->bdi = bdi; 439 + new_congested->blkcg_id = blkcg_id; 440 + goto retry; 441 + 442 + found: 443 + atomic_inc(&congested->refcnt); 444 + spin_unlock_irqrestore(&cgwb_lock, flags); 445 + kfree(new_congested); 446 + return congested; 447 + } 448 + 449 + /** 450 + * wb_congested_put - put a wb_congested 451 + * @congested: wb_congested to put 452 + * 453 + * Put @congested and destroy it if the refcnt reaches zero. 454 + */ 455 + void wb_congested_put(struct bdi_writeback_congested *congested) 456 + { 457 + struct backing_dev_info *bdi = congested->bdi; 458 + unsigned long flags; 459 + 460 + if (congested->blkcg_id == 1) 461 + return; 462 + 463 + local_irq_save(flags); 464 + if (!atomic_dec_and_lock(&congested->refcnt, &cgwb_lock)) { 465 + local_irq_restore(flags); 466 + return; 467 + } 468 + 469 + rb_erase(&congested->rb_node, &congested->bdi->cgwb_congested_tree); 470 + spin_unlock_irqrestore(&cgwb_lock, flags); 471 + kfree(congested); 472 + 473 + if (atomic_dec_and_test(&bdi->usage_cnt)) 474 + wake_up_all(&cgwb_release_wait); 475 + } 476 + 477 + static void cgwb_release_workfn(struct work_struct *work) 478 + { 479 + struct bdi_writeback *wb = container_of(work, struct bdi_writeback, 480 + release_work); 481 + struct backing_dev_info *bdi = wb->bdi; 482 + 483 + wb_shutdown(wb); 484 + 485 + css_put(wb->memcg_css); 486 + css_put(wb->blkcg_css); 487 + wb_congested_put(wb->congested); 488 + 489 + fprop_local_destroy_percpu(&wb->memcg_completions); 490 + percpu_ref_exit(&wb->refcnt); 491 + wb_exit(wb); 492 + kfree_rcu(wb, rcu); 493 + 494 + if (atomic_dec_and_test(&bdi->usage_cnt)) 495 + wake_up_all(&cgwb_release_wait); 496 + } 497 + 498 + static void cgwb_release(struct percpu_ref *refcnt) 499 + { 500 + struct bdi_writeback *wb = container_of(refcnt, struct bdi_writeback, 501 + refcnt); 502 + schedule_work(&wb->release_work); 503 + } 504 + 505 + static void cgwb_kill(struct bdi_writeback *wb) 506 + { 507 + lockdep_assert_held(&cgwb_lock); 508 + 509 + WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id)); 510 + list_del(&wb->memcg_node); 511 + list_del(&wb->blkcg_node); 512 + percpu_ref_kill(&wb->refcnt); 513 + } 514 + 515 + static int cgwb_create(struct backing_dev_info *bdi, 516 + struct cgroup_subsys_state *memcg_css, gfp_t gfp) 517 + { 518 + struct mem_cgroup *memcg; 519 + struct cgroup_subsys_state *blkcg_css; 520 + struct blkcg *blkcg; 521 + struct list_head *memcg_cgwb_list, *blkcg_cgwb_list; 522 + struct bdi_writeback *wb; 523 + unsigned long flags; 524 + int ret = 0; 525 + 526 + memcg = mem_cgroup_from_css(memcg_css); 527 + blkcg_css = cgroup_get_e_css(memcg_css->cgroup, &blkio_cgrp_subsys); 528 + blkcg = css_to_blkcg(blkcg_css); 529 + memcg_cgwb_list = mem_cgroup_cgwb_list(memcg); 530 + blkcg_cgwb_list = &blkcg->cgwb_list; 531 + 532 + /* look up again under lock and discard on blkcg mismatch */ 533 + spin_lock_irqsave(&cgwb_lock, flags); 534 + wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); 535 + if (wb && wb->blkcg_css != blkcg_css) { 536 + cgwb_kill(wb); 537 + wb = NULL; 538 + } 539 + spin_unlock_irqrestore(&cgwb_lock, flags); 540 + if (wb) 541 + goto out_put; 542 + 543 + /* need to create a new one */ 544 + wb = kmalloc(sizeof(*wb), gfp); 545 + if (!wb) 546 + return -ENOMEM; 547 + 548 + ret = wb_init(wb, bdi, gfp); 549 + if (ret) 550 + goto err_free; 551 + 552 + ret = percpu_ref_init(&wb->refcnt, cgwb_release, 0, gfp); 553 + if (ret) 554 + goto err_wb_exit; 555 + 556 + ret = fprop_local_init_percpu(&wb->memcg_completions, gfp); 557 + if (ret) 558 + goto err_ref_exit; 559 + 560 + wb->congested = wb_congested_get_create(bdi, blkcg_css->id, gfp); 561 + if (!wb->congested) { 562 + ret = -ENOMEM; 563 + goto err_fprop_exit; 564 + } 565 + 566 + wb->memcg_css = memcg_css; 567 + wb->blkcg_css = blkcg_css; 568 + INIT_WORK(&wb->release_work, cgwb_release_workfn); 569 + set_bit(WB_registered, &wb->state); 570 + 571 + /* 572 + * The root wb determines the registered state of the whole bdi and 573 + * memcg_cgwb_list and blkcg_cgwb_list's next pointers indicate 574 + * whether they're still online. Don't link @wb if any is dead. 575 + * See wb_memcg_offline() and wb_blkcg_offline(). 576 + */ 577 + ret = -ENODEV; 578 + spin_lock_irqsave(&cgwb_lock, flags); 579 + if (test_bit(WB_registered, &bdi->wb.state) && 580 + blkcg_cgwb_list->next && memcg_cgwb_list->next) { 581 + /* we might have raced another instance of this function */ 582 + ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb); 583 + if (!ret) { 584 + atomic_inc(&bdi->usage_cnt); 585 + list_add(&wb->memcg_node, memcg_cgwb_list); 586 + list_add(&wb->blkcg_node, blkcg_cgwb_list); 587 + css_get(memcg_css); 588 + css_get(blkcg_css); 589 + } 590 + } 591 + spin_unlock_irqrestore(&cgwb_lock, flags); 592 + if (ret) { 593 + if (ret == -EEXIST) 594 + ret = 0; 595 + goto err_put_congested; 596 + } 597 + goto out_put; 598 + 599 + err_put_congested: 600 + wb_congested_put(wb->congested); 601 + err_fprop_exit: 602 + fprop_local_destroy_percpu(&wb->memcg_completions); 603 + err_ref_exit: 604 + percpu_ref_exit(&wb->refcnt); 605 + err_wb_exit: 606 + wb_exit(wb); 607 + err_free: 608 + kfree(wb); 609 + out_put: 610 + css_put(blkcg_css); 611 + return ret; 612 + } 613 + 614 + /** 615 + * wb_get_create - get wb for a given memcg, create if necessary 616 + * @bdi: target bdi 617 + * @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref) 618 + * @gfp: allocation mask to use 619 + * 620 + * Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to 621 + * create one. The returned wb has its refcount incremented. 622 + * 623 + * This function uses css_get() on @memcg_css and thus expects its refcnt 624 + * to be positive on invocation. IOW, rcu_read_lock() protection on 625 + * @memcg_css isn't enough. try_get it before calling this function. 626 + * 627 + * A wb is keyed by its associated memcg. As blkcg implicitly enables 628 + * memcg on the default hierarchy, memcg association is guaranteed to be 629 + * more specific (equal or descendant to the associated blkcg) and thus can 630 + * identify both the memcg and blkcg associations. 631 + * 632 + * Because the blkcg associated with a memcg may change as blkcg is enabled 633 + * and disabled closer to root in the hierarchy, each wb keeps track of 634 + * both the memcg and blkcg associated with it and verifies the blkcg on 635 + * each lookup. On mismatch, the existing wb is discarded and a new one is 636 + * created. 637 + */ 638 + struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, 639 + struct cgroup_subsys_state *memcg_css, 640 + gfp_t gfp) 641 + { 642 + struct bdi_writeback *wb; 643 + 644 + might_sleep_if(gfp & __GFP_WAIT); 645 + 646 + if (!memcg_css->parent) 647 + return &bdi->wb; 648 + 649 + do { 650 + rcu_read_lock(); 651 + wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); 652 + if (wb) { 653 + struct cgroup_subsys_state *blkcg_css; 654 + 655 + /* see whether the blkcg association has changed */ 656 + blkcg_css = cgroup_get_e_css(memcg_css->cgroup, 657 + &blkio_cgrp_subsys); 658 + if (unlikely(wb->blkcg_css != blkcg_css || 659 + !wb_tryget(wb))) 660 + wb = NULL; 661 + css_put(blkcg_css); 662 + } 663 + rcu_read_unlock(); 664 + } while (!wb && !cgwb_create(bdi, memcg_css, gfp)); 665 + 666 + return wb; 667 + } 668 + 669 + static void cgwb_bdi_init(struct backing_dev_info *bdi) 670 + { 671 + bdi->wb.memcg_css = mem_cgroup_root_css; 672 + bdi->wb.blkcg_css = blkcg_root_css; 673 + bdi->wb_congested.blkcg_id = 1; 674 + INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC); 675 + bdi->cgwb_congested_tree = RB_ROOT; 676 + atomic_set(&bdi->usage_cnt, 1); 677 + } 678 + 679 + static void cgwb_bdi_destroy(struct backing_dev_info *bdi) 680 + { 681 + struct radix_tree_iter iter; 682 + void **slot; 683 + 684 + WARN_ON(test_bit(WB_registered, &bdi->wb.state)); 685 + 686 + spin_lock_irq(&cgwb_lock); 687 + radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0) 688 + cgwb_kill(*slot); 689 + spin_unlock_irq(&cgwb_lock); 690 + 691 + /* 692 + * All cgwb's and their congested states must be shutdown and 693 + * released before returning. Drain the usage counter to wait for 694 + * all cgwb's and cgwb_congested's ever created on @bdi. 695 + */ 696 + atomic_dec(&bdi->usage_cnt); 697 + wait_event(cgwb_release_wait, !atomic_read(&bdi->usage_cnt)); 698 + } 699 + 700 + /** 701 + * wb_memcg_offline - kill all wb's associated with a memcg being offlined 702 + * @memcg: memcg being offlined 703 + * 704 + * Also prevents creation of any new wb's associated with @memcg. 705 + */ 706 + void wb_memcg_offline(struct mem_cgroup *memcg) 707 + { 708 + LIST_HEAD(to_destroy); 709 + struct list_head *memcg_cgwb_list = mem_cgroup_cgwb_list(memcg); 710 + struct bdi_writeback *wb, *next; 711 + 712 + spin_lock_irq(&cgwb_lock); 713 + list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node) 714 + cgwb_kill(wb); 715 + memcg_cgwb_list->next = NULL; /* prevent new wb's */ 716 + spin_unlock_irq(&cgwb_lock); 717 + } 718 + 719 + /** 720 + * wb_blkcg_offline - kill all wb's associated with a blkcg being offlined 721 + * @blkcg: blkcg being offlined 722 + * 723 + * Also prevents creation of any new wb's associated with @blkcg. 724 + */ 725 + void wb_blkcg_offline(struct blkcg *blkcg) 726 + { 727 + LIST_HEAD(to_destroy); 728 + struct bdi_writeback *wb, *next; 729 + 730 + spin_lock_irq(&cgwb_lock); 731 + list_for_each_entry_safe(wb, next, &blkcg->cgwb_list, blkcg_node) 732 + cgwb_kill(wb); 733 + blkcg->cgwb_list.next = NULL; /* prevent new wb's */ 734 + spin_unlock_irq(&cgwb_lock); 735 + } 736 + 737 + #else /* CONFIG_CGROUP_WRITEBACK */ 738 + 739 + static void cgwb_bdi_init(struct backing_dev_info *bdi) { } 740 + static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { } 741 + 742 + #endif /* CONFIG_CGROUP_WRITEBACK */ 743 + 744 + int bdi_init(struct backing_dev_info *bdi) 745 + { 746 + int err; 747 + 748 + bdi->dev = NULL; 749 + 750 + bdi->min_ratio = 0; 751 + bdi->max_ratio = 100; 752 + bdi->max_prop_frac = FPROP_FRAC_BASE; 753 + INIT_LIST_HEAD(&bdi->bdi_list); 754 + init_waitqueue_head(&bdi->wb_waitq); 755 + 756 + err = wb_init(&bdi->wb, bdi, GFP_KERNEL); 757 + if (err) 758 + return err; 759 + 760 + bdi->wb_congested.state = 0; 761 + bdi->wb.congested = &bdi->wb_congested; 762 + 763 + cgwb_bdi_init(bdi); 764 + return 0; 765 + } 766 + EXPORT_SYMBOL(bdi_init); 295 767 296 768 int bdi_register(struct backing_dev_info *bdi, struct device *parent, 297 769 const char *fmt, ...) ··· 779 315 bdi->dev = dev; 780 316 781 317 bdi_debug_register(bdi, dev_name(dev)); 782 - set_bit(BDI_registered, &bdi->state); 318 + set_bit(WB_registered, &bdi->wb.state); 783 319 784 320 spin_lock_bh(&bdi_lock); 785 321 list_add_tail_rcu(&bdi->bdi_list, &bdi_list); ··· 797 333 EXPORT_SYMBOL(bdi_register_dev); 798 334 799 335 /* 800 - * Remove bdi from the global list and shutdown any threads we have running 336 + * Remove bdi from bdi_list, and ensure that it is no longer visible 801 337 */ 802 - static void bdi_wb_shutdown(struct backing_dev_info *bdi) 338 + static void bdi_remove_from_list(struct backing_dev_info *bdi) 803 339 { 804 - /* Make sure nobody queues further work */ 805 - spin_lock_bh(&bdi->wb_lock); 806 - if (!test_and_clear_bit(BDI_registered, &bdi->state)) { 807 - spin_unlock_bh(&bdi->wb_lock); 808 - return; 809 - } 810 - spin_unlock_bh(&bdi->wb_lock); 340 + spin_lock_bh(&bdi_lock); 341 + list_del_rcu(&bdi->bdi_list); 342 + spin_unlock_bh(&bdi_lock); 811 343 812 - /* 813 - * Make sure nobody finds us on the bdi_list anymore 814 - */ 815 - bdi_remove_from_list(bdi); 816 - 817 - /* 818 - * Drain work list and shutdown the delayed_work. At this point, 819 - * @bdi->bdi_list is empty telling bdi_Writeback_workfn() that @bdi 820 - * is dying and its work_list needs to be drained no matter what. 821 - */ 822 - mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0); 823 - flush_delayed_work(&bdi->wb.dwork); 344 + synchronize_rcu_expedited(); 824 345 } 825 - 826 - static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) 827 - { 828 - memset(wb, 0, sizeof(*wb)); 829 - 830 - wb->bdi = bdi; 831 - wb->last_old_flush = jiffies; 832 - INIT_LIST_HEAD(&wb->b_dirty); 833 - INIT_LIST_HEAD(&wb->b_io); 834 - INIT_LIST_HEAD(&wb->b_more_io); 835 - INIT_LIST_HEAD(&wb->b_dirty_time); 836 - spin_lock_init(&wb->list_lock); 837 - INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn); 838 - } 839 - 840 - /* 841 - * Initial write bandwidth: 100 MB/s 842 - */ 843 - #define INIT_BW (100 << (20 - PAGE_SHIFT)) 844 - 845 - int bdi_init(struct backing_dev_info *bdi) 846 - { 847 - int i, err; 848 - 849 - bdi->dev = NULL; 850 - 851 - bdi->min_ratio = 0; 852 - bdi->max_ratio = 100; 853 - bdi->max_prop_frac = FPROP_FRAC_BASE; 854 - spin_lock_init(&bdi->wb_lock); 855 - INIT_LIST_HEAD(&bdi->bdi_list); 856 - INIT_LIST_HEAD(&bdi->work_list); 857 - 858 - bdi_wb_init(&bdi->wb, bdi); 859 - 860 - for (i = 0; i < NR_BDI_STAT_ITEMS; i++) { 861 - err = percpu_counter_init(&bdi->bdi_stat[i], 0, GFP_KERNEL); 862 - if (err) 863 - goto err; 864 - } 865 - 866 - bdi->dirty_exceeded = 0; 867 - 868 - bdi->bw_time_stamp = jiffies; 869 - bdi->written_stamp = 0; 870 - 871 - bdi->balanced_dirty_ratelimit = INIT_BW; 872 - bdi->dirty_ratelimit = INIT_BW; 873 - bdi->write_bandwidth = INIT_BW; 874 - bdi->avg_write_bandwidth = INIT_BW; 875 - 876 - err = fprop_local_init_percpu(&bdi->completions, GFP_KERNEL); 877 - 878 - if (err) { 879 - err: 880 - while (i--) 881 - percpu_counter_destroy(&bdi->bdi_stat[i]); 882 - } 883 - 884 - return err; 885 - } 886 - EXPORT_SYMBOL(bdi_init); 887 346 888 347 void bdi_destroy(struct backing_dev_info *bdi) 889 348 { 890 - int i; 891 - 892 - bdi_wb_shutdown(bdi); 893 - bdi_set_min_ratio(bdi, 0); 894 - 895 - WARN_ON(!list_empty(&bdi->work_list)); 896 - WARN_ON(delayed_work_pending(&bdi->wb.dwork)); 349 + /* make sure nobody finds us on the bdi_list anymore */ 350 + bdi_remove_from_list(bdi); 351 + wb_shutdown(&bdi->wb); 352 + cgwb_bdi_destroy(bdi); 897 353 898 354 if (bdi->dev) { 899 355 bdi_debug_unregister(bdi); ··· 821 437 bdi->dev = NULL; 822 438 } 823 439 824 - for (i = 0; i < NR_BDI_STAT_ITEMS; i++) 825 - percpu_counter_destroy(&bdi->bdi_stat[i]); 826 - fprop_local_destroy_percpu(&bdi->completions); 440 + wb_exit(&bdi->wb); 827 441 } 828 442 EXPORT_SYMBOL(bdi_destroy); 829 443 ··· 854 472 __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]), 855 473 __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1]) 856 474 }; 857 - static atomic_t nr_bdi_congested[2]; 475 + static atomic_t nr_wb_congested[2]; 858 476 859 - void clear_bdi_congested(struct backing_dev_info *bdi, int sync) 477 + void clear_wb_congested(struct bdi_writeback_congested *congested, int sync) 860 478 { 861 - enum bdi_state bit; 862 479 wait_queue_head_t *wqh = &congestion_wqh[sync]; 480 + enum wb_state bit; 863 481 864 - bit = sync ? BDI_sync_congested : BDI_async_congested; 865 - if (test_and_clear_bit(bit, &bdi->state)) 866 - atomic_dec(&nr_bdi_congested[sync]); 482 + bit = sync ? WB_sync_congested : WB_async_congested; 483 + if (test_and_clear_bit(bit, &congested->state)) 484 + atomic_dec(&nr_wb_congested[sync]); 867 485 smp_mb__after_atomic(); 868 486 if (waitqueue_active(wqh)) 869 487 wake_up(wqh); 870 488 } 871 - EXPORT_SYMBOL(clear_bdi_congested); 489 + EXPORT_SYMBOL(clear_wb_congested); 872 490 873 - void set_bdi_congested(struct backing_dev_info *bdi, int sync) 491 + void set_wb_congested(struct bdi_writeback_congested *congested, int sync) 874 492 { 875 - enum bdi_state bit; 493 + enum wb_state bit; 876 494 877 - bit = sync ? BDI_sync_congested : BDI_async_congested; 878 - if (!test_and_set_bit(bit, &bdi->state)) 879 - atomic_inc(&nr_bdi_congested[sync]); 495 + bit = sync ? WB_sync_congested : WB_async_congested; 496 + if (!test_and_set_bit(bit, &congested->state)) 497 + atomic_inc(&nr_wb_congested[sync]); 880 498 } 881 - EXPORT_SYMBOL(set_bdi_congested); 499 + EXPORT_SYMBOL(set_wb_congested); 882 500 883 501 /** 884 502 * congestion_wait - wait for a backing_dev to become uncongested ··· 937 555 * encountered in the current zone, yield if necessary instead 938 556 * of sleeping on the congestion queue 939 557 */ 940 - if (atomic_read(&nr_bdi_congested[sync]) == 0 || 558 + if (atomic_read(&nr_wb_congested[sync]) == 0 || 941 559 !test_bit(ZONE_CONGESTED, &zone->flags)) { 942 560 cond_resched(); 943 561

+1 -1

mm/fadvise.c

··· 115 115 case POSIX_FADV_NOREUSE: 116 116 break; 117 117 case POSIX_FADV_DONTNEED: 118 - if (!bdi_write_congested(bdi)) 118 + if (!inode_write_congested(mapping->host)) 119 119 __filemap_fdatawrite_range(mapping, offset, endbyte, 120 120 WB_SYNC_NONE); 121 121

+25 -9

mm/filemap.c

··· 100 100 * ->tree_lock (page_remove_rmap->set_page_dirty) 101 101 * bdi.wb->list_lock (page_remove_rmap->set_page_dirty) 102 102 * ->inode->i_lock (page_remove_rmap->set_page_dirty) 103 + * ->memcg->move_lock (page_remove_rmap->mem_cgroup_begin_page_stat) 103 104 * bdi.wb->list_lock (zap_pte_range->set_page_dirty) 104 105 * ->inode->i_lock (zap_pte_range->set_page_dirty) 105 106 * ->private_lock (zap_pte_range->__set_page_dirty_buffers) ··· 175 174 /* 176 175 * Delete a page from the page cache and free it. Caller has to make 177 176 * sure the page is locked and that nobody else uses it - or that usage 178 - * is safe. The caller must hold the mapping's tree_lock. 177 + * is safe. The caller must hold the mapping's tree_lock and 178 + * mem_cgroup_begin_page_stat(). 179 179 */ 180 - void __delete_from_page_cache(struct page *page, void *shadow) 180 + void __delete_from_page_cache(struct page *page, void *shadow, 181 + struct mem_cgroup *memcg) 181 182 { 182 183 struct address_space *mapping = page->mapping; 183 184 ··· 215 212 * anyway will be cleared before returning page into buddy allocator. 216 213 */ 217 214 if (WARN_ON_ONCE(PageDirty(page))) 218 - account_page_cleaned(page, mapping); 215 + account_page_cleaned(page, mapping, memcg, 216 + inode_to_wb(mapping->host)); 219 217 } 220 218 221 219 /** ··· 230 226 void delete_from_page_cache(struct page *page) 231 227 { 232 228 struct address_space *mapping = page->mapping; 229 + struct mem_cgroup *memcg; 230 + unsigned long flags; 231 + 233 232 void (*freepage)(struct page *); 234 233 235 234 BUG_ON(!PageLocked(page)); 236 235 237 236 freepage = mapping->a_ops->freepage; 238 - spin_lock_irq(&mapping->tree_lock); 239 - __delete_from_page_cache(page, NULL); 240 - spin_unlock_irq(&mapping->tree_lock); 237 + 238 + memcg = mem_cgroup_begin_page_stat(page); 239 + spin_lock_irqsave(&mapping->tree_lock, flags); 240 + __delete_from_page_cache(page, NULL, memcg); 241 + spin_unlock_irqrestore(&mapping->tree_lock, flags); 242 + mem_cgroup_end_page_stat(memcg); 241 243 242 244 if (freepage) 243 245 freepage(page); ··· 293 283 if (!mapping_cap_writeback_dirty(mapping)) 294 284 return 0; 295 285 286 + wbc_attach_fdatawrite_inode(&wbc, mapping->host); 296 287 ret = do_writepages(mapping, &wbc); 288 + wbc_detach_inode(&wbc); 297 289 return ret; 298 290 } 299 291 ··· 484 472 if (!error) { 485 473 struct address_space *mapping = old->mapping; 486 474 void (*freepage)(struct page *); 475 + struct mem_cgroup *memcg; 476 + unsigned long flags; 487 477 488 478 pgoff_t offset = old->index; 489 479 freepage = mapping->a_ops->freepage; ··· 494 480 new->mapping = mapping; 495 481 new->index = offset; 496 482 497 - spin_lock_irq(&mapping->tree_lock); 498 - __delete_from_page_cache(old, NULL); 483 + memcg = mem_cgroup_begin_page_stat(old); 484 + spin_lock_irqsave(&mapping->tree_lock, flags); 485 + __delete_from_page_cache(old, NULL, memcg); 499 486 error = radix_tree_insert(&mapping->page_tree, offset, new); 500 487 BUG_ON(error); 501 488 mapping->nrpages++; ··· 508 493 __inc_zone_page_state(new, NR_FILE_PAGES); 509 494 if (PageSwapBacked(new)) 510 495 __inc_zone_page_state(new, NR_SHMEM); 511 - spin_unlock_irq(&mapping->tree_lock); 496 + spin_unlock_irqrestore(&mapping->tree_lock, flags); 497 + mem_cgroup_end_page_stat(memcg); 512 498 mem_cgroup_migrate(old, new, true); 513 499 radix_tree_preload_end(); 514 500 if (freepage)

+1

mm/madvise.c

··· 17 17 #include <linux/fs.h> 18 18 #include <linux/file.h> 19 19 #include <linux/blkdev.h> 20 + #include <linux/backing-dev.h> 20 21 #include <linux/swap.h> 21 22 #include <linux/swapops.h> 22 23

+172 -51

mm/memcontrol.c

··· 77 77 78 78 #define MEM_CGROUP_RECLAIM_RETRIES 5 79 79 static struct mem_cgroup *root_mem_cgroup __read_mostly; 80 + struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly; 80 81 81 82 /* Whether the swap controller is active */ 82 83 #ifdef CONFIG_MEMCG_SWAP ··· 91 90 "rss", 92 91 "rss_huge", 93 92 "mapped_file", 93 + "dirty", 94 94 "writeback", 95 95 "swap", 96 96 }; ··· 324 322 * percpu counter. 325 323 */ 326 324 struct mem_cgroup_stat_cpu __percpu *stat; 327 - /* 328 - * used when a cpu is offlined or other synchronizations 329 - * See mem_cgroup_read_stat(). 330 - */ 331 - struct mem_cgroup_stat_cpu nocpu_base; 332 325 spinlock_t pcp_counter_lock; 333 326 334 327 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) ··· 341 344 nodemask_t scan_nodes; 342 345 atomic_t numainfo_events; 343 346 atomic_t numainfo_updating; 347 + #endif 348 + 349 + #ifdef CONFIG_CGROUP_WRITEBACK 350 + struct list_head cgwb_list; 351 + struct wb_domain cgwb_domain; 344 352 #endif 345 353 346 354 /* List of events which userspace want to receive */ ··· 598 596 return &memcg->css; 599 597 } 600 598 599 + /** 600 + * mem_cgroup_css_from_page - css of the memcg associated with a page 601 + * @page: page of interest 602 + * 603 + * If memcg is bound to the default hierarchy, css of the memcg associated 604 + * with @page is returned. The returned css remains associated with @page 605 + * until it is released. 606 + * 607 + * If memcg is bound to a traditional hierarchy, the css of root_mem_cgroup 608 + * is returned. 609 + * 610 + * XXX: The above description of behavior on the default hierarchy isn't 611 + * strictly true yet as replace_page_cache_page() can modify the 612 + * association before @page is released even on the default hierarchy; 613 + * however, the current and planned usages don't mix the the two functions 614 + * and replace_page_cache_page() will soon be updated to make the invariant 615 + * actually true. 616 + */ 617 + struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page) 618 + { 619 + struct mem_cgroup *memcg; 620 + 621 + rcu_read_lock(); 622 + 623 + memcg = page->mem_cgroup; 624 + 625 + if (!memcg || !cgroup_on_dfl(memcg->css.cgroup)) 626 + memcg = root_mem_cgroup; 627 + 628 + rcu_read_unlock(); 629 + return &memcg->css; 630 + } 631 + 601 632 static struct mem_cgroup_per_zone * 602 633 mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page) 603 634 { ··· 830 795 long val = 0; 831 796 int cpu; 832 797 833 - get_online_cpus(); 834 - for_each_online_cpu(cpu) 798 + for_each_possible_cpu(cpu) 835 799 val += per_cpu(memcg->stat->count[idx], cpu); 836 - #ifdef CONFIG_HOTPLUG_CPU 837 - spin_lock(&memcg->pcp_counter_lock); 838 - val += memcg->nocpu_base.count[idx]; 839 - spin_unlock(&memcg->pcp_counter_lock); 840 - #endif 841 - put_online_cpus(); 842 800 return val; 843 801 } 844 802 ··· 841 813 unsigned long val = 0; 842 814 int cpu; 843 815 844 - get_online_cpus(); 845 - for_each_online_cpu(cpu) 816 + for_each_possible_cpu(cpu) 846 817 val += per_cpu(memcg->stat->events[idx], cpu); 847 - #ifdef CONFIG_HOTPLUG_CPU 848 - spin_lock(&memcg->pcp_counter_lock); 849 - val += memcg->nocpu_base.events[idx]; 850 - spin_unlock(&memcg->pcp_counter_lock); 851 - #endif 852 - put_online_cpus(); 853 818 return val; 854 819 } 855 820 ··· 2041 2020 2042 2021 return memcg; 2043 2022 } 2023 + EXPORT_SYMBOL(mem_cgroup_begin_page_stat); 2044 2024 2045 2025 /** 2046 2026 * mem_cgroup_end_page_stat - finish a page state statistics transaction ··· 2060 2038 2061 2039 rcu_read_unlock(); 2062 2040 } 2041 + EXPORT_SYMBOL(mem_cgroup_end_page_stat); 2063 2042 2064 2043 /** 2065 2044 * mem_cgroup_update_page_stat - update page state statistics ··· 2201 2178 mutex_unlock(&percpu_charge_mutex); 2202 2179 } 2203 2180 2204 - /* 2205 - * This function drains percpu counter value from DEAD cpu and 2206 - * move it to local cpu. Note that this function can be preempted. 2207 - */ 2208 - static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu) 2209 - { 2210 - int i; 2211 - 2212 - spin_lock(&memcg->pcp_counter_lock); 2213 - for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { 2214 - long x = per_cpu(memcg->stat->count[i], cpu); 2215 - 2216 - per_cpu(memcg->stat->count[i], cpu) = 0; 2217 - memcg->nocpu_base.count[i] += x; 2218 - } 2219 - for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { 2220 - unsigned long x = per_cpu(memcg->stat->events[i], cpu); 2221 - 2222 - per_cpu(memcg->stat->events[i], cpu) = 0; 2223 - memcg->nocpu_base.events[i] += x; 2224 - } 2225 - spin_unlock(&memcg->pcp_counter_lock); 2226 - } 2227 - 2228 2181 static int memcg_cpu_hotplug_callback(struct notifier_block *nb, 2229 2182 unsigned long action, 2230 2183 void *hcpu) 2231 2184 { 2232 2185 int cpu = (unsigned long)hcpu; 2233 2186 struct memcg_stock_pcp *stock; 2234 - struct mem_cgroup *iter; 2235 2187 2236 2188 if (action == CPU_ONLINE) 2237 2189 return NOTIFY_OK; 2238 2190 2239 2191 if (action != CPU_DEAD && action != CPU_DEAD_FROZEN) 2240 2192 return NOTIFY_OK; 2241 - 2242 - for_each_mem_cgroup(iter) 2243 - mem_cgroup_drain_pcp_counter(iter, cpu); 2244 2193 2245 2194 stock = &per_cpu(memcg_stock, cpu); 2246 2195 drain_stock(stock); ··· 3999 4004 } 4000 4005 #endif 4001 4006 4007 + #ifdef CONFIG_CGROUP_WRITEBACK 4008 + 4009 + struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg) 4010 + { 4011 + return &memcg->cgwb_list; 4012 + } 4013 + 4014 + static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) 4015 + { 4016 + return wb_domain_init(&memcg->cgwb_domain, gfp); 4017 + } 4018 + 4019 + static void memcg_wb_domain_exit(struct mem_cgroup *memcg) 4020 + { 4021 + wb_domain_exit(&memcg->cgwb_domain); 4022 + } 4023 + 4024 + static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg) 4025 + { 4026 + wb_domain_size_changed(&memcg->cgwb_domain); 4027 + } 4028 + 4029 + struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) 4030 + { 4031 + struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); 4032 + 4033 + if (!memcg->css.parent) 4034 + return NULL; 4035 + 4036 + return &memcg->cgwb_domain; 4037 + } 4038 + 4039 + /** 4040 + * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg 4041 + * @wb: bdi_writeback in question 4042 + * @pavail: out parameter for number of available pages 4043 + * @pdirty: out parameter for number of dirty pages 4044 + * @pwriteback: out parameter for number of pages under writeback 4045 + * 4046 + * Determine the numbers of available, dirty, and writeback pages in @wb's 4047 + * memcg. Dirty and writeback are self-explanatory. Available is a bit 4048 + * more involved. 4049 + * 4050 + * A memcg's headroom is "min(max, high) - used". The available memory is 4051 + * calculated as the lowest headroom of itself and the ancestors plus the 4052 + * number of pages already being used for file pages. Note that this 4053 + * doesn't consider the actual amount of available memory in the system. 4054 + * The caller should further cap *@pavail accordingly. 4055 + */ 4056 + void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail, 4057 + unsigned long *pdirty, unsigned long *pwriteback) 4058 + { 4059 + struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); 4060 + struct mem_cgroup *parent; 4061 + unsigned long head_room = PAGE_COUNTER_MAX; 4062 + unsigned long file_pages; 4063 + 4064 + *pdirty = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_DIRTY); 4065 + 4066 + /* this should eventually include NR_UNSTABLE_NFS */ 4067 + *pwriteback = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK); 4068 + 4069 + file_pages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) | 4070 + (1 << LRU_ACTIVE_FILE)); 4071 + while ((parent = parent_mem_cgroup(memcg))) { 4072 + unsigned long ceiling = min(memcg->memory.limit, memcg->high); 4073 + unsigned long used = page_counter_read(&memcg->memory); 4074 + 4075 + head_room = min(head_room, ceiling - min(ceiling, used)); 4076 + memcg = parent; 4077 + } 4078 + 4079 + *pavail = file_pages + head_room; 4080 + } 4081 + 4082 + #else /* CONFIG_CGROUP_WRITEBACK */ 4083 + 4084 + static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) 4085 + { 4086 + return 0; 4087 + } 4088 + 4089 + static void memcg_wb_domain_exit(struct mem_cgroup *memcg) 4090 + { 4091 + } 4092 + 4093 + static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg) 4094 + { 4095 + } 4096 + 4097 + #endif /* CONFIG_CGROUP_WRITEBACK */ 4098 + 4002 4099 /* 4003 4100 * DO NOT USE IN NEW FILES. 4004 4101 * ··· 4475 4388 memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu); 4476 4389 if (!memcg->stat) 4477 4390 goto out_free; 4391 + 4392 + if (memcg_wb_domain_init(memcg, GFP_KERNEL)) 4393 + goto out_free_stat; 4394 + 4478 4395 spin_lock_init(&memcg->pcp_counter_lock); 4479 4396 return memcg; 4480 4397 4398 + out_free_stat: 4399 + free_percpu(memcg->stat); 4481 4400 out_free: 4482 4401 kfree(memcg); 4483 4402 return NULL; ··· 4510 4417 free_mem_cgroup_per_zone_info(memcg, node); 4511 4418 4512 4419 free_percpu(memcg->stat); 4420 + memcg_wb_domain_exit(memcg); 4513 4421 kfree(memcg); 4514 4422 } 4515 4423 ··· 4543 4449 /* root ? */ 4544 4450 if (parent_css == NULL) { 4545 4451 root_mem_cgroup = memcg; 4452 + mem_cgroup_root_css = &memcg->css; 4546 4453 page_counter_init(&memcg->memory, NULL); 4547 4454 memcg->high = PAGE_COUNTER_MAX; 4548 4455 memcg->soft_limit = PAGE_COUNTER_MAX; ··· 4562 4467 #ifdef CONFIG_MEMCG_KMEM 4563 4468 memcg->kmemcg_id = -1; 4564 4469 #endif 4565 - 4470 + #ifdef CONFIG_CGROUP_WRITEBACK 4471 + INIT_LIST_HEAD(&memcg->cgwb_list); 4472 + #endif 4566 4473 return &memcg->css; 4567 4474 4568 4475 free_out: ··· 4652 4555 vmpressure_cleanup(&memcg->vmpressure); 4653 4556 4654 4557 memcg_deactivate_kmem(memcg); 4558 + 4559 + wb_memcg_offline(memcg); 4655 4560 } 4656 4561 4657 4562 static void mem_cgroup_css_free(struct cgroup_subsys_state *css) ··· 4687 4588 memcg->low = 0; 4688 4589 memcg->high = PAGE_COUNTER_MAX; 4689 4590 memcg->soft_limit = PAGE_COUNTER_MAX; 4591 + memcg_wb_domain_size_changed(memcg); 4690 4592 } 4691 4593 4692 4594 #ifdef CONFIG_MMU ··· 4857 4757 { 4858 4758 unsigned long flags; 4859 4759 int ret; 4760 + bool anon; 4860 4761 4861 4762 VM_BUG_ON(from == to); 4862 4763 VM_BUG_ON_PAGE(PageLRU(page), page); ··· 4883 4782 if (page->mem_cgroup != from) 4884 4783 goto out_unlock; 4885 4784 4785 + anon = PageAnon(page); 4786 + 4886 4787 spin_lock_irqsave(&from->move_lock, flags); 4887 4788 4888 - if (!PageAnon(page) && page_mapped(page)) { 4789 + if (!anon && page_mapped(page)) { 4889 4790 __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], 4890 4791 nr_pages); 4891 4792 __this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], 4892 4793 nr_pages); 4794 + } 4795 + 4796 + /* 4797 + * move_lock grabbed above and caller set from->moving_account, so 4798 + * mem_cgroup_update_page_stat() will serialize updates to PageDirty. 4799 + * So mapping should be stable for dirty pages. 4800 + */ 4801 + if (!anon && PageDirty(page)) { 4802 + struct address_space *mapping = page_mapping(page); 4803 + 4804 + if (mapping_cap_account_dirty(mapping)) { 4805 + __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_DIRTY], 4806 + nr_pages); 4807 + __this_cpu_add(to->stat->count[MEM_CGROUP_STAT_DIRTY], 4808 + nr_pages); 4809 + } 4893 4810 } 4894 4811 4895 4812 if (PageWriteback(page)) { ··· 5425 5306 5426 5307 memcg->high = high; 5427 5308 5309 + memcg_wb_domain_size_changed(memcg); 5428 5310 return nbytes; 5429 5311 } 5430 5312 ··· 5458 5338 if (err) 5459 5339 return err; 5460 5340 5341 + memcg_wb_domain_size_changed(memcg); 5461 5342 return nbytes; 5462 5343 } 5463 5344

+801 -432

mm/page-writeback.c

··· 122 122 123 123 /* End of sysctl-exported parameters */ 124 124 125 - unsigned long global_dirty_limit; 125 + struct wb_domain global_wb_domain; 126 126 127 - /* 128 - * Scale the writeback cache size proportional to the relative writeout speeds. 129 - * 130 - * We do this by keeping a floating proportion between BDIs, based on page 131 - * writeback completions [end_page_writeback()]. Those devices that write out 132 - * pages fastest will get the larger share, while the slower will get a smaller 133 - * share. 134 - * 135 - * We use page writeout completions because we are interested in getting rid of 136 - * dirty pages. Having them written out is the primary goal. 137 - * 138 - * We introduce a concept of time, a period over which we measure these events, 139 - * because demand can/will vary over time. The length of this period itself is 140 - * measured in page writeback completions. 141 - * 142 - */ 143 - static struct fprop_global writeout_completions; 127 + /* consolidated parameters for balance_dirty_pages() and its subroutines */ 128 + struct dirty_throttle_control { 129 + #ifdef CONFIG_CGROUP_WRITEBACK 130 + struct wb_domain *dom; 131 + struct dirty_throttle_control *gdtc; /* only set in memcg dtc's */ 132 + #endif 133 + struct bdi_writeback *wb; 134 + struct fprop_local_percpu *wb_completions; 144 135 145 - static void writeout_period(unsigned long t); 146 - /* Timer for aging of writeout_completions */ 147 - static struct timer_list writeout_period_timer = 148 - TIMER_DEFERRED_INITIALIZER(writeout_period, 0, 0); 149 - static unsigned long writeout_period_time = 0; 136 + unsigned long avail; /* dirtyable */ 137 + unsigned long dirty; /* file_dirty + write + nfs */ 138 + unsigned long thresh; /* dirty threshold */ 139 + unsigned long bg_thresh; /* dirty background threshold */ 140 + 141 + unsigned long wb_dirty; /* per-wb counterparts */ 142 + unsigned long wb_thresh; 143 + unsigned long wb_bg_thresh; 144 + 145 + unsigned long pos_ratio; 146 + }; 147 + 148 + #define DTC_INIT_COMMON(__wb) .wb = (__wb), \ 149 + .wb_completions = &(__wb)->completions 150 150 151 151 /* 152 152 * Length of period for aging writeout fractions of bdis. This is an ··· 154 154 * reflect changes in current writeout rate. 155 155 */ 156 156 #define VM_COMPLETIONS_PERIOD_LEN (3*HZ) 157 + 158 + #ifdef CONFIG_CGROUP_WRITEBACK 159 + 160 + #define GDTC_INIT(__wb) .dom = &global_wb_domain, \ 161 + DTC_INIT_COMMON(__wb) 162 + #define GDTC_INIT_NO_WB .dom = &global_wb_domain 163 + #define MDTC_INIT(__wb, __gdtc) .dom = mem_cgroup_wb_domain(__wb), \ 164 + .gdtc = __gdtc, \ 165 + DTC_INIT_COMMON(__wb) 166 + 167 + static bool mdtc_valid(struct dirty_throttle_control *dtc) 168 + { 169 + return dtc->dom; 170 + } 171 + 172 + static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc) 173 + { 174 + return dtc->dom; 175 + } 176 + 177 + static struct dirty_throttle_control *mdtc_gdtc(struct dirty_throttle_control *mdtc) 178 + { 179 + return mdtc->gdtc; 180 + } 181 + 182 + static struct fprop_local_percpu *wb_memcg_completions(struct bdi_writeback *wb) 183 + { 184 + return &wb->memcg_completions; 185 + } 186 + 187 + static void wb_min_max_ratio(struct bdi_writeback *wb, 188 + unsigned long *minp, unsigned long *maxp) 189 + { 190 + unsigned long this_bw = wb->avg_write_bandwidth; 191 + unsigned long tot_bw = atomic_long_read(&wb->bdi->tot_write_bandwidth); 192 + unsigned long long min = wb->bdi->min_ratio; 193 + unsigned long long max = wb->bdi->max_ratio; 194 + 195 + /* 196 + * @wb may already be clean by the time control reaches here and 197 + * the total may not include its bw. 198 + */ 199 + if (this_bw < tot_bw) { 200 + if (min) { 201 + min *= this_bw; 202 + do_div(min, tot_bw); 203 + } 204 + if (max < 100) { 205 + max *= this_bw; 206 + do_div(max, tot_bw); 207 + } 208 + } 209 + 210 + *minp = min; 211 + *maxp = max; 212 + } 213 + 214 + #else /* CONFIG_CGROUP_WRITEBACK */ 215 + 216 + #define GDTC_INIT(__wb) DTC_INIT_COMMON(__wb) 217 + #define GDTC_INIT_NO_WB 218 + #define MDTC_INIT(__wb, __gdtc) 219 + 220 + static bool mdtc_valid(struct dirty_throttle_control *dtc) 221 + { 222 + return false; 223 + } 224 + 225 + static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc) 226 + { 227 + return &global_wb_domain; 228 + } 229 + 230 + static struct dirty_throttle_control *mdtc_gdtc(struct dirty_throttle_control *mdtc) 231 + { 232 + return NULL; 233 + } 234 + 235 + static struct fprop_local_percpu *wb_memcg_completions(struct bdi_writeback *wb) 236 + { 237 + return NULL; 238 + } 239 + 240 + static void wb_min_max_ratio(struct bdi_writeback *wb, 241 + unsigned long *minp, unsigned long *maxp) 242 + { 243 + *minp = wb->bdi->min_ratio; 244 + *maxp = wb->bdi->max_ratio; 245 + } 246 + 247 + #endif /* CONFIG_CGROUP_WRITEBACK */ 157 248 158 249 /* 159 250 * In a memory zone, there is a certain amount of pages we consider ··· 341 250 return x + 1; /* Ensure that we never return 0 */ 342 251 } 343 252 344 - /* 345 - * global_dirty_limits - background-writeback and dirty-throttling thresholds 253 + /** 254 + * domain_dirty_limits - calculate thresh and bg_thresh for a wb_domain 255 + * @dtc: dirty_throttle_control of interest 346 256 * 347 - * Calculate the dirty thresholds based on sysctl parameters 348 - * - vm.dirty_background_ratio or vm.dirty_background_bytes 349 - * - vm.dirty_ratio or vm.dirty_bytes 350 - * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and 257 + * Calculate @dtc->thresh and ->bg_thresh considering 258 + * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller 259 + * must ensure that @dtc->avail is set before calling this function. The 260 + * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and 351 261 * real-time tasks. 262 + */ 263 + static void domain_dirty_limits(struct dirty_throttle_control *dtc) 264 + { 265 + const unsigned long available_memory = dtc->avail; 266 + struct dirty_throttle_control *gdtc = mdtc_gdtc(dtc); 267 + unsigned long bytes = vm_dirty_bytes; 268 + unsigned long bg_bytes = dirty_background_bytes; 269 + unsigned long ratio = vm_dirty_ratio; 270 + unsigned long bg_ratio = dirty_background_ratio; 271 + unsigned long thresh; 272 + unsigned long bg_thresh; 273 + struct task_struct *tsk; 274 + 275 + /* gdtc is !NULL iff @dtc is for memcg domain */ 276 + if (gdtc) { 277 + unsigned long global_avail = gdtc->avail; 278 + 279 + /* 280 + * The byte settings can't be applied directly to memcg 281 + * domains. Convert them to ratios by scaling against 282 + * globally available memory. 283 + */ 284 + if (bytes) 285 + ratio = min(DIV_ROUND_UP(bytes, PAGE_SIZE) * 100 / 286 + global_avail, 100UL); 287 + if (bg_bytes) 288 + bg_ratio = min(DIV_ROUND_UP(bg_bytes, PAGE_SIZE) * 100 / 289 + global_avail, 100UL); 290 + bytes = bg_bytes = 0; 291 + } 292 + 293 + if (bytes) 294 + thresh = DIV_ROUND_UP(bytes, PAGE_SIZE); 295 + else 296 + thresh = (ratio * available_memory) / 100; 297 + 298 + if (bg_bytes) 299 + bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE); 300 + else 301 + bg_thresh = (bg_ratio * available_memory) / 100; 302 + 303 + if (bg_thresh >= thresh) 304 + bg_thresh = thresh / 2; 305 + tsk = current; 306 + if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { 307 + bg_thresh += bg_thresh / 4; 308 + thresh += thresh / 4; 309 + } 310 + dtc->thresh = thresh; 311 + dtc->bg_thresh = bg_thresh; 312 + 313 + /* we should eventually report the domain in the TP */ 314 + if (!gdtc) 315 + trace_global_dirty_state(bg_thresh, thresh); 316 + } 317 + 318 + /** 319 + * global_dirty_limits - background-writeback and dirty-throttling thresholds 320 + * @pbackground: out parameter for bg_thresh 321 + * @pdirty: out parameter for thresh 322 + * 323 + * Calculate bg_thresh and thresh for global_wb_domain. See 324 + * domain_dirty_limits() for details. 352 325 */ 353 326 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty) 354 327 { 355 - const unsigned long available_memory = global_dirtyable_memory(); 356 - unsigned long background; 357 - unsigned long dirty; 358 - struct task_struct *tsk; 328 + struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB }; 359 329 360 - if (vm_dirty_bytes) 361 - dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE); 362 - else 363 - dirty = (vm_dirty_ratio * available_memory) / 100; 330 + gdtc.avail = global_dirtyable_memory(); 331 + domain_dirty_limits(&gdtc); 364 332 365 - if (dirty_background_bytes) 366 - background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE); 367 - else 368 - background = (dirty_background_ratio * available_memory) / 100; 369 - 370 - if (background >= dirty) 371 - background = dirty / 2; 372 - tsk = current; 373 - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { 374 - background += background / 4; 375 - dirty += dirty / 4; 376 - } 377 - *pbackground = background; 378 - *pdirty = dirty; 379 - trace_global_dirty_state(background, dirty); 333 + *pbackground = gdtc.bg_thresh; 334 + *pdirty = gdtc.thresh; 380 335 } 381 336 382 337 /** ··· 529 392 return cur_time; 530 393 } 531 394 532 - /* 533 - * Increment the BDI's writeout completion count and the global writeout 534 - * completion count. Called from test_clear_page_writeback(). 535 - */ 536 - static inline void __bdi_writeout_inc(struct backing_dev_info *bdi) 395 + static void wb_domain_writeout_inc(struct wb_domain *dom, 396 + struct fprop_local_percpu *completions, 397 + unsigned int max_prop_frac) 537 398 { 538 - __inc_bdi_stat(bdi, BDI_WRITTEN); 539 - __fprop_inc_percpu_max(&writeout_completions, &bdi->completions, 540 - bdi->max_prop_frac); 399 + __fprop_inc_percpu_max(&dom->completions, completions, 400 + max_prop_frac); 541 401 /* First event after period switching was turned off? */ 542 - if (!unlikely(writeout_period_time)) { 402 + if (!unlikely(dom->period_time)) { 543 403 /* 544 404 * We can race with other __bdi_writeout_inc calls here but 545 405 * it does not cause any harm since the resulting time when 546 406 * timer will fire and what is in writeout_period_time will be 547 407 * roughly the same. 548 408 */ 549 - writeout_period_time = wp_next_time(jiffies); 550 - mod_timer(&writeout_period_timer, writeout_period_time); 409 + dom->period_time = wp_next_time(jiffies); 410 + mod_timer(&dom->period_timer, dom->period_time); 551 411 } 552 412 } 553 413 554 - void bdi_writeout_inc(struct backing_dev_info *bdi) 414 + /* 415 + * Increment @wb's writeout completion count and the global writeout 416 + * completion count. Called from test_clear_page_writeback(). 417 + */ 418 + static inline void __wb_writeout_inc(struct bdi_writeback *wb) 419 + { 420 + struct wb_domain *cgdom; 421 + 422 + __inc_wb_stat(wb, WB_WRITTEN); 423 + wb_domain_writeout_inc(&global_wb_domain, &wb->completions, 424 + wb->bdi->max_prop_frac); 425 + 426 + cgdom = mem_cgroup_wb_domain(wb); 427 + if (cgdom) 428 + wb_domain_writeout_inc(cgdom, wb_memcg_completions(wb), 429 + wb->bdi->max_prop_frac); 430 + } 431 + 432 + void wb_writeout_inc(struct bdi_writeback *wb) 555 433 { 556 434 unsigned long flags; 557 435 558 436 local_irq_save(flags); 559 - __bdi_writeout_inc(bdi); 437 + __wb_writeout_inc(wb); 560 438 local_irq_restore(flags); 561 439 } 562 - EXPORT_SYMBOL_GPL(bdi_writeout_inc); 563 - 564 - /* 565 - * Obtain an accurate fraction of the BDI's portion. 566 - */ 567 - static void bdi_writeout_fraction(struct backing_dev_info *bdi, 568 - long *numerator, long *denominator) 569 - { 570 - fprop_fraction_percpu(&writeout_completions, &bdi->completions, 571 - numerator, denominator); 572 - } 440 + EXPORT_SYMBOL_GPL(wb_writeout_inc); 573 441 574 442 /* 575 443 * On idle system, we can be called long after we scheduled because we use ··· 582 440 */ 583 441 static void writeout_period(unsigned long t) 584 442 { 585 - int miss_periods = (jiffies - writeout_period_time) / 443 + struct wb_domain *dom = (void *)t; 444 + int miss_periods = (jiffies - dom->period_time) / 586 445 VM_COMPLETIONS_PERIOD_LEN; 587 446 588 - if (fprop_new_period(&writeout_completions, miss_periods + 1)) { 589 - writeout_period_time = wp_next_time(writeout_period_time + 447 + if (fprop_new_period(&dom->completions, miss_periods + 1)) { 448 + dom->period_time = wp_next_time(dom->period_time + 590 449 miss_periods * VM_COMPLETIONS_PERIOD_LEN); 591 - mod_timer(&writeout_period_timer, writeout_period_time); 450 + mod_timer(&dom->period_timer, dom->period_time); 592 451 } else { 593 452 /* 594 453 * Aging has zeroed all fractions. Stop wasting CPU on period 595 454 * updates. 596 455 */ 597 - writeout_period_time = 0; 456 + dom->period_time = 0; 598 457 } 599 458 } 459 + 460 + int wb_domain_init(struct wb_domain *dom, gfp_t gfp) 461 + { 462 + memset(dom, 0, sizeof(*dom)); 463 + 464 + spin_lock_init(&dom->lock); 465 + 466 + init_timer_deferrable(&dom->period_timer); 467 + dom->period_timer.function = writeout_period; 468 + dom->period_timer.data = (unsigned long)dom; 469 + 470 + dom->dirty_limit_tstamp = jiffies; 471 + 472 + return fprop_global_init(&dom->completions, gfp); 473 + } 474 + 475 + #ifdef CONFIG_CGROUP_WRITEBACK 476 + void wb_domain_exit(struct wb_domain *dom) 477 + { 478 + del_timer_sync(&dom->period_timer); 479 + fprop_global_destroy(&dom->completions); 480 + } 481 + #endif 600 482 601 483 /* 602 484 * bdi_min_ratio keeps the sum of the minimum dirty shares of all ··· 676 510 return (thresh + bg_thresh) / 2; 677 511 } 678 512 679 - static unsigned long hard_dirty_limit(unsigned long thresh) 513 + static unsigned long hard_dirty_limit(struct wb_domain *dom, 514 + unsigned long thresh) 680 515 { 681 - return max(thresh, global_dirty_limit); 516 + return max(thresh, dom->dirty_limit); 517 + } 518 + 519 + /* memory available to a memcg domain is capped by system-wide clean memory */ 520 + static void mdtc_cap_avail(struct dirty_throttle_control *mdtc) 521 + { 522 + struct dirty_throttle_control *gdtc = mdtc_gdtc(mdtc); 523 + unsigned long clean = gdtc->avail - min(gdtc->avail, gdtc->dirty); 524 + 525 + mdtc->avail = min(mdtc->avail, clean); 682 526 } 683 527 684 528 /** 685 - * bdi_dirty_limit - @bdi's share of dirty throttling threshold 686 - * @bdi: the backing_dev_info to query 687 - * @dirty: global dirty limit in pages 529 + * __wb_calc_thresh - @wb's share of dirty throttling threshold 530 + * @dtc: dirty_throttle_context of interest 688 531 * 689 - * Returns @bdi's dirty limit in pages. The term "dirty" in the context of 532 + * Returns @wb's dirty limit in pages. The term "dirty" in the context of 690 533 * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages. 691 534 * 692 535 * Note that balance_dirty_pages() will only seriously take it as a hard limit ··· 703 528 * control. For example, when the device is completely stalled due to some error 704 529 * conditions, or when there are 1000 dd tasks writing to a slow 10MB/s USB key. 705 530 * In the other normal situations, it acts more gently by throttling the tasks 706 - * more (rather than completely block them) when the bdi dirty pages go high. 531 + * more (rather than completely block them) when the wb dirty pages go high. 707 532 * 708 533 * It allocates high/low dirty limits to fast/slow devices, in order to prevent 709 534 * - starving fast devices 710 535 * - piling up dirty pages (that will take long time to sync) on slow devices 711 536 * 712 - * The bdi's share of dirty limit will be adapting to its throughput and 537 + * The wb's share of dirty limit will be adapting to its throughput and 713 538 * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set. 714 539 */ 715 - unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty) 540 + static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc) 716 541 { 717 - u64 bdi_dirty; 542 + struct wb_domain *dom = dtc_dom(dtc); 543 + unsigned long thresh = dtc->thresh; 544 + u64 wb_thresh; 718 545 long numerator, denominator; 546 + unsigned long wb_min_ratio, wb_max_ratio; 719 547 720 548 /* 721 - * Calculate this BDI's share of the dirty ratio. 549 + * Calculate this BDI's share of the thresh ratio. 722 550 */ 723 - bdi_writeout_fraction(bdi, &numerator, &denominator); 551 + fprop_fraction_percpu(&dom->completions, dtc->wb_completions, 552 + &numerator, &denominator); 724 553 725 - bdi_dirty = (dirty * (100 - bdi_min_ratio)) / 100; 726 - bdi_dirty *= numerator; 727 - do_div(bdi_dirty, denominator); 554 + wb_thresh = (thresh * (100 - bdi_min_ratio)) / 100; 555 + wb_thresh *= numerator; 556 + do_div(wb_thresh, denominator); 728 557 729 - bdi_dirty += (dirty * bdi->min_ratio) / 100; 730 - if (bdi_dirty > (dirty * bdi->max_ratio) / 100) 731 - bdi_dirty = dirty * bdi->max_ratio / 100; 558 + wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio); 732 559 733 - return bdi_dirty; 560 + wb_thresh += (thresh * wb_min_ratio) / 100; 561 + if (wb_thresh > (thresh * wb_max_ratio) / 100) 562 + wb_thresh = thresh * wb_max_ratio / 100; 563 + 564 + return wb_thresh; 565 + } 566 + 567 + unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh) 568 + { 569 + struct dirty_throttle_control gdtc = { GDTC_INIT(wb), 570 + .thresh = thresh }; 571 + return __wb_calc_thresh(&gdtc); 734 572 } 735 573 736 574 /* ··· 782 594 * 783 595 * (o) global/bdi setpoints 784 596 * 785 - * We want the dirty pages be balanced around the global/bdi setpoints. 597 + * We want the dirty pages be balanced around the global/wb setpoints. 786 598 * When the number of dirty pages is higher/lower than the setpoint, the 787 599 * dirty position control ratio (and hence task dirty ratelimit) will be 788 600 * decreased/increased to bring the dirty pages back to the setpoint. ··· 792 604 * if (dirty < setpoint) scale up pos_ratio 793 605 * if (dirty > setpoint) scale down pos_ratio 794 606 * 795 - * if (bdi_dirty < bdi_setpoint) scale up pos_ratio 796 - * if (bdi_dirty > bdi_setpoint) scale down pos_ratio 607 + * if (wb_dirty < wb_setpoint) scale up pos_ratio 608 + * if (wb_dirty > wb_setpoint) scale down pos_ratio 797 609 * 798 610 * task_ratelimit = dirty_ratelimit * pos_ratio >> RATELIMIT_CALC_SHIFT 799 611 * ··· 818 630 * 0 +------------.------------------.----------------------*-------------> 819 631 * freerun^ setpoint^ limit^ dirty pages 820 632 * 821 - * (o) bdi control line 633 + * (o) wb control line 822 634 * 823 635 * ^ pos_ratio 824 636 * | ··· 844 656 * | . . 845 657 * | . . 846 658 * 0 +----------------------.-------------------------------.-------------> 847 - * bdi_setpoint^ x_intercept^ 659 + * wb_setpoint^ x_intercept^ 848 660 * 849 - * The bdi control line won't drop below pos_ratio=1/4, so that bdi_dirty can 661 + * The wb control line won't drop below pos_ratio=1/4, so that wb_dirty can 850 662 * be smoothly throttled down to normal if it starts high in situations like 851 663 * - start writing to a slow SD card and a fast disk at the same time. The SD 852 - * card's bdi_dirty may rush to many times higher than bdi_setpoint. 853 - * - the bdi dirty thresh drops quickly due to change of JBOD workload 664 + * card's wb_dirty may rush to many times higher than wb_setpoint. 665 + * - the wb dirty thresh drops quickly due to change of JBOD workload 854 666 */ 855 - static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, 856 - unsigned long thresh, 857 - unsigned long bg_thresh, 858 - unsigned long dirty, 859 - unsigned long bdi_thresh, 860 - unsigned long bdi_dirty) 667 + static void wb_position_ratio(struct dirty_throttle_control *dtc) 861 668 { 862 - unsigned long write_bw = bdi->avg_write_bandwidth; 863 - unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh); 864 - unsigned long limit = hard_dirty_limit(thresh); 669 + struct bdi_writeback *wb = dtc->wb; 670 + unsigned long write_bw = wb->avg_write_bandwidth; 671 + unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh); 672 + unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh); 673 + unsigned long wb_thresh = dtc->wb_thresh; 865 674 unsigned long x_intercept; 866 675 unsigned long setpoint; /* dirty pages' target balance point */ 867 - unsigned long bdi_setpoint; 676 + unsigned long wb_setpoint; 868 677 unsigned long span; 869 678 long long pos_ratio; /* for scaling up/down the rate limit */ 870 679 long x; 871 680 872 - if (unlikely(dirty >= limit)) 873 - return 0; 681 + dtc->pos_ratio = 0; 682 + 683 + if (unlikely(dtc->dirty >= limit)) 684 + return; 874 685 875 686 /* 876 687 * global setpoint ··· 877 690 * See comment for pos_ratio_polynom(). 878 691 */ 879 692 setpoint = (freerun + limit) / 2; 880 - pos_ratio = pos_ratio_polynom(setpoint, dirty, limit); 693 + pos_ratio = pos_ratio_polynom(setpoint, dtc->dirty, limit); 881 694 882 695 /* 883 696 * The strictlimit feature is a tool preventing mistrusted filesystems 884 697 * from growing a large number of dirty pages before throttling. For 885 - * such filesystems balance_dirty_pages always checks bdi counters 886 - * against bdi limits. Even if global "nr_dirty" is under "freerun". 698 + * such filesystems balance_dirty_pages always checks wb counters 699 + * against wb limits. Even if global "nr_dirty" is under "freerun". 887 700 * This is especially important for fuse which sets bdi->max_ratio to 888 701 * 1% by default. Without strictlimit feature, fuse writeback may 889 702 * consume arbitrary amount of RAM because it is accounted in 890 703 * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty". 891 704 * 892 - * Here, in bdi_position_ratio(), we calculate pos_ratio based on 893 - * two values: bdi_dirty and bdi_thresh. Let's consider an example: 705 + * Here, in wb_position_ratio(), we calculate pos_ratio based on 706 + * two values: wb_dirty and wb_thresh. Let's consider an example: 894 707 * total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global 895 708 * limits are set by default to 10% and 20% (background and throttle). 896 - * Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages. 897 - * bdi_dirty_limit(bdi, bg_thresh) is about ~4K pages. bdi_setpoint is 898 - * about ~6K pages (as the average of background and throttle bdi 709 + * Then wb_thresh is 1% of 20% of 16GB. This amounts to ~8K pages. 710 + * wb_calc_thresh(wb, bg_thresh) is about ~4K pages. wb_setpoint is 711 + * about ~6K pages (as the average of background and throttle wb 899 712 * limits). The 3rd order polynomial will provide positive feedback if 900 - * bdi_dirty is under bdi_setpoint and vice versa. 713 + * wb_dirty is under wb_setpoint and vice versa. 901 714 * 902 715 * Note, that we cannot use global counters in these calculations 903 - * because we want to throttle process writing to a strictlimit BDI 716 + * because we want to throttle process writing to a strictlimit wb 904 717 * much earlier than global "freerun" is reached (~23MB vs. ~2.3GB 905 718 * in the example above). 906 719 */ 907 - if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) { 908 - long long bdi_pos_ratio; 909 - unsigned long bdi_bg_thresh; 720 + if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { 721 + long long wb_pos_ratio; 910 722 911 - if (bdi_dirty < 8) 912 - return min_t(long long, pos_ratio * 2, 913 - 2 << RATELIMIT_CALC_SHIFT); 723 + if (dtc->wb_dirty < 8) { 724 + dtc->pos_ratio = min_t(long long, pos_ratio * 2, 725 + 2 << RATELIMIT_CALC_SHIFT); 726 + return; 727 + } 914 728 915 - if (bdi_dirty >= bdi_thresh) 916 - return 0; 729 + if (dtc->wb_dirty >= wb_thresh) 730 + return; 917 731 918 - bdi_bg_thresh = div_u64((u64)bdi_thresh * bg_thresh, thresh); 919 - bdi_setpoint = dirty_freerun_ceiling(bdi_thresh, 920 - bdi_bg_thresh); 732 + wb_setpoint = dirty_freerun_ceiling(wb_thresh, 733 + dtc->wb_bg_thresh); 921 734 922 - if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh) 923 - return 0; 735 + if (wb_setpoint == 0 || wb_setpoint == wb_thresh) 736 + return; 924 737 925 - bdi_pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty, 926 - bdi_thresh); 738 + wb_pos_ratio = pos_ratio_polynom(wb_setpoint, dtc->wb_dirty, 739 + wb_thresh); 927 740 928 741 /* 929 - * Typically, for strictlimit case, bdi_setpoint << setpoint 930 - * and pos_ratio >> bdi_pos_ratio. In the other words global 742 + * Typically, for strictlimit case, wb_setpoint << setpoint 743 + * and pos_ratio >> wb_pos_ratio. In the other words global 931 744 * state ("dirty") is not limiting factor and we have to 932 - * make decision based on bdi counters. But there is an 745 + * make decision based on wb counters. But there is an 933 746 * important case when global pos_ratio should get precedence: 934 747 * global limits are exceeded (e.g. due to activities on other 935 - * BDIs) while given strictlimit BDI is below limit. 748 + * wb's) while given strictlimit wb is below limit. 936 749 * 937 - * "pos_ratio * bdi_pos_ratio" would work for the case above, 750 + * "pos_ratio * wb_pos_ratio" would work for the case above, 938 751 * but it would look too non-natural for the case of all 939 - * activity in the system coming from a single strictlimit BDI 752 + * activity in the system coming from a single strictlimit wb 940 753 * with bdi->max_ratio == 100%. 941 754 * 942 755 * Note that min() below somewhat changes the dynamics of the 943 756 * control system. Normally, pos_ratio value can be well over 3 944 - * (when globally we are at freerun and bdi is well below bdi 757 + * (when globally we are at freerun and wb is well below wb 945 758 * setpoint). Now the maximum pos_ratio in the same situation 946 759 * is 2. We might want to tweak this if we observe the control 947 760 * system is too slow to adapt. 948 761 */ 949 - return min(pos_ratio, bdi_pos_ratio); 762 + dtc->pos_ratio = min(pos_ratio, wb_pos_ratio); 763 + return; 950 764 } 951 765 952 766 /* 953 767 * We have computed basic pos_ratio above based on global situation. If 954 - * the bdi is over/under its share of dirty pages, we want to scale 768 + * the wb is over/under its share of dirty pages, we want to scale 955 769 * pos_ratio further down/up. That is done by the following mechanism. 956 770 */ 957 771 958 772 /* 959 - * bdi setpoint 773 + * wb setpoint 960 774 * 961 - * f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint) 775 + * f(wb_dirty) := 1.0 + k * (wb_dirty - wb_setpoint) 962 776 * 963 - * x_intercept - bdi_dirty 777 + * x_intercept - wb_dirty 964 778 * := -------------------------- 965 - * x_intercept - bdi_setpoint 779 + * x_intercept - wb_setpoint 966 780 * 967 - * The main bdi control line is a linear function that subjects to 781 + * The main wb control line is a linear function that subjects to 968 782 * 969 - * (1) f(bdi_setpoint) = 1.0 970 - * (2) k = - 1 / (8 * write_bw) (in single bdi case) 971 - * or equally: x_intercept = bdi_setpoint + 8 * write_bw 783 + * (1) f(wb_setpoint) = 1.0 784 + * (2) k = - 1 / (8 * write_bw) (in single wb case) 785 + * or equally: x_intercept = wb_setpoint + 8 * write_bw 972 786 * 973 - * For single bdi case, the dirty pages are observed to fluctuate 787 + * For single wb case, the dirty pages are observed to fluctuate 974 788 * regularly within range 975 - * [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2] 789 + * [wb_setpoint - write_bw/2, wb_setpoint + write_bw/2] 976 790 * for various filesystems, where (2) can yield in a reasonable 12.5% 977 791 * fluctuation range for pos_ratio. 978 792 * 979 - * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its 793 + * For JBOD case, wb_thresh (not wb_dirty!) could fluctuate up to its 980 794 * own size, so move the slope over accordingly and choose a slope that 981 - * yields 100% pos_ratio fluctuation on suddenly doubled bdi_thresh. 795 + * yields 100% pos_ratio fluctuation on suddenly doubled wb_thresh. 982 796 */ 983 - if (unlikely(bdi_thresh > thresh)) 984 - bdi_thresh = thresh; 797 + if (unlikely(wb_thresh > dtc->thresh)) 798 + wb_thresh = dtc->thresh; 985 799 /* 986 - * It's very possible that bdi_thresh is close to 0 not because the 800 + * It's very possible that wb_thresh is close to 0 not because the 987 801 * device is slow, but that it has remained inactive for long time. 988 802 * Honour such devices a reasonable good (hopefully IO efficient) 989 803 * threshold, so that the occasional writes won't be blocked and active 990 804 * writes can rampup the threshold quickly. 991 805 */ 992 - bdi_thresh = max(bdi_thresh, (limit - dirty) / 8); 806 + wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8); 993 807 /* 994 - * scale global setpoint to bdi's: 995 - * bdi_setpoint = setpoint * bdi_thresh / thresh 808 + * scale global setpoint to wb's: 809 + * wb_setpoint = setpoint * wb_thresh / thresh 996 810 */ 997 - x = div_u64((u64)bdi_thresh << 16, thresh | 1); 998 - bdi_setpoint = setpoint * (u64)x >> 16; 811 + x = div_u64((u64)wb_thresh << 16, dtc->thresh | 1); 812 + wb_setpoint = setpoint * (u64)x >> 16; 999 813 /* 1000 - * Use span=(8*write_bw) in single bdi case as indicated by 1001 - * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case. 814 + * Use span=(8*write_bw) in single wb case as indicated by 815 + * (thresh - wb_thresh ~= 0) and transit to wb_thresh in JBOD case. 1002 816 * 1003 - * bdi_thresh thresh - bdi_thresh 1004 - * span = ---------- * (8 * write_bw) + ------------------- * bdi_thresh 1005 - * thresh thresh 817 + * wb_thresh thresh - wb_thresh 818 + * span = --------- * (8 * write_bw) + ------------------ * wb_thresh 819 + * thresh thresh 1006 820 */ 1007 - span = (thresh - bdi_thresh + 8 * write_bw) * (u64)x >> 16; 1008 - x_intercept = bdi_setpoint + span; 821 + span = (dtc->thresh - wb_thresh + 8 * write_bw) * (u64)x >> 16; 822 + x_intercept = wb_setpoint + span; 1009 823 1010 - if (bdi_dirty < x_intercept - span / 4) { 1011 - pos_ratio = div64_u64(pos_ratio * (x_intercept - bdi_dirty), 1012 - (x_intercept - bdi_setpoint) | 1); 824 + if (dtc->wb_dirty < x_intercept - span / 4) { 825 + pos_ratio = div64_u64(pos_ratio * (x_intercept - dtc->wb_dirty), 826 + (x_intercept - wb_setpoint) | 1); 1013 827 } else 1014 828 pos_ratio /= 4; 1015 829 1016 830 /* 1017 - * bdi reserve area, safeguard against dirty pool underrun and disk idle 831 + * wb reserve area, safeguard against dirty pool underrun and disk idle 1018 832 * It may push the desired control point of global dirty pages higher 1019 833 * than setpoint. 1020 834 */ 1021 - x_intercept = bdi_thresh / 2; 1022 - if (bdi_dirty < x_intercept) { 1023 - if (bdi_dirty > x_intercept / 8) 1024 - pos_ratio = div_u64(pos_ratio * x_intercept, bdi_dirty); 835 + x_intercept = wb_thresh / 2; 836 + if (dtc->wb_dirty < x_intercept) { 837 + if (dtc->wb_dirty > x_intercept / 8) 838 + pos_ratio = div_u64(pos_ratio * x_intercept, 839 + dtc->wb_dirty); 1025 840 else 1026 841 pos_ratio *= 8; 1027 842 } 1028 843 1029 - return pos_ratio; 844 + dtc->pos_ratio = pos_ratio; 1030 845 } 1031 846 1032 - static void bdi_update_write_bandwidth(struct backing_dev_info *bdi, 1033 - unsigned long elapsed, 1034 - unsigned long written) 847 + static void wb_update_write_bandwidth(struct bdi_writeback *wb, 848 + unsigned long elapsed, 849 + unsigned long written) 1035 850 { 1036 851 const unsigned long period = roundup_pow_of_two(3 * HZ); 1037 - unsigned long avg = bdi->avg_write_bandwidth; 1038 - unsigned long old = bdi->write_bandwidth; 852 + unsigned long avg = wb->avg_write_bandwidth; 853 + unsigned long old = wb->write_bandwidth; 1039 854 u64 bw; 1040 855 1041 856 /* ··· 1050 861 * @written may have decreased due to account_page_redirty(). 1051 862 * Avoid underflowing @bw calculation. 1052 863 */ 1053 - bw = written - min(written, bdi->written_stamp); 864 + bw = written - min(written, wb->written_stamp); 1054 865 bw *= HZ; 1055 866 if (unlikely(elapsed > period)) { 1056 867 do_div(bw, elapsed); 1057 868 avg = bw; 1058 869 goto out; 1059 870 } 1060 - bw += (u64)bdi->write_bandwidth * (period - elapsed); 871 + bw += (u64)wb->write_bandwidth * (period - elapsed); 1061 872 bw >>= ilog2(period); 1062 873 1063 874 /* ··· 1070 881 avg += (old - avg) >> 3; 1071 882 1072 883 out: 1073 - bdi->write_bandwidth = bw; 1074 - bdi->avg_write_bandwidth = avg; 884 + /* keep avg > 0 to guarantee that tot > 0 if there are dirty wbs */ 885 + avg = max(avg, 1LU); 886 + if (wb_has_dirty_io(wb)) { 887 + long delta = avg - wb->avg_write_bandwidth; 888 + WARN_ON_ONCE(atomic_long_add_return(delta, 889 + &wb->bdi->tot_write_bandwidth) <= 0); 890 + } 891 + wb->write_bandwidth = bw; 892 + wb->avg_write_bandwidth = avg; 1075 893 } 1076 894 1077 - /* 1078 - * The global dirtyable memory and dirty threshold could be suddenly knocked 1079 - * down by a large amount (eg. on the startup of KVM in a swapless system). 1080 - * This may throw the system into deep dirty exceeded state and throttle 1081 - * heavy/light dirtiers alike. To retain good responsiveness, maintain 1082 - * global_dirty_limit for tracking slowly down to the knocked down dirty 1083 - * threshold. 1084 - */ 1085 - static void update_dirty_limit(unsigned long thresh, unsigned long dirty) 895 + static void update_dirty_limit(struct dirty_throttle_control *dtc) 1086 896 { 1087 - unsigned long limit = global_dirty_limit; 897 + struct wb_domain *dom = dtc_dom(dtc); 898 + unsigned long thresh = dtc->thresh; 899 + unsigned long limit = dom->dirty_limit; 1088 900 1089 901 /* 1090 902 * Follow up in one step. ··· 1098 908 /* 1099 909 * Follow down slowly. Use the higher one as the target, because thresh 1100 910 * may drop below dirty. This is exactly the reason to introduce 1101 - * global_dirty_limit which is guaranteed to lie above the dirty pages. 911 + * dom->dirty_limit which is guaranteed to lie above the dirty pages. 1102 912 */ 1103 - thresh = max(thresh, dirty); 913 + thresh = max(thresh, dtc->dirty); 1104 914 if (limit > thresh) { 1105 915 limit -= (limit - thresh) >> 5; 1106 916 goto update; 1107 917 } 1108 918 return; 1109 919 update: 1110 - global_dirty_limit = limit; 920 + dom->dirty_limit = limit; 1111 921 } 1112 922 1113 - static void global_update_bandwidth(unsigned long thresh, 1114 - unsigned long dirty, 923 + static void domain_update_bandwidth(struct dirty_throttle_control *dtc, 1115 924 unsigned long now) 1116 925 { 1117 - static DEFINE_SPINLOCK(dirty_lock); 1118 - static unsigned long update_time = INITIAL_JIFFIES; 926 + struct wb_domain *dom = dtc_dom(dtc); 1119 927 1120 928 /* 1121 929 * check locklessly first to optimize away locking for the most time 1122 930 */ 1123 - if (time_before(now, update_time + BANDWIDTH_INTERVAL)) 931 + if (time_before(now, dom->dirty_limit_tstamp + BANDWIDTH_INTERVAL)) 1124 932 return; 1125 933 1126 - spin_lock(&dirty_lock); 1127 - if (time_after_eq(now, update_time + BANDWIDTH_INTERVAL)) { 1128 - update_dirty_limit(thresh, dirty); 1129 - update_time = now; 934 + spin_lock(&dom->lock); 935 + if (time_after_eq(now, dom->dirty_limit_tstamp + BANDWIDTH_INTERVAL)) { 936 + update_dirty_limit(dtc); 937 + dom->dirty_limit_tstamp = now; 1130 938 } 1131 - spin_unlock(&dirty_lock); 939 + spin_unlock(&dom->lock); 1132 940 } 1133 941 1134 942 /* 1135 - * Maintain bdi->dirty_ratelimit, the base dirty throttle rate. 943 + * Maintain wb->dirty_ratelimit, the base dirty throttle rate. 1136 944 * 1137 - * Normal bdi tasks will be curbed at or below it in long term. 945 + * Normal wb tasks will be curbed at or below it in long term. 1138 946 * Obviously it should be around (write_bw / N) when there are N dd tasks. 1139 947 */ 1140 - static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi, 1141 - unsigned long thresh, 1142 - unsigned long bg_thresh, 1143 - unsigned long dirty, 1144 - unsigned long bdi_thresh, 1145 - unsigned long bdi_dirty, 1146 - unsigned long dirtied, 1147 - unsigned long elapsed) 948 + static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc, 949 + unsigned long dirtied, 950 + unsigned long elapsed) 1148 951 { 1149 - unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh); 1150 - unsigned long limit = hard_dirty_limit(thresh); 952 + struct bdi_writeback *wb = dtc->wb; 953 + unsigned long dirty = dtc->dirty; 954 + unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh); 955 + unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh); 1151 956 unsigned long setpoint = (freerun + limit) / 2; 1152 - unsigned long write_bw = bdi->avg_write_bandwidth; 1153 - unsigned long dirty_ratelimit = bdi->dirty_ratelimit; 957 + unsigned long write_bw = wb->avg_write_bandwidth; 958 + unsigned long dirty_ratelimit = wb->dirty_ratelimit; 1154 959 unsigned long dirty_rate; 1155 960 unsigned long task_ratelimit; 1156 961 unsigned long balanced_dirty_ratelimit; 1157 - unsigned long pos_ratio; 1158 962 unsigned long step; 1159 963 unsigned long x; 1160 964 ··· 1156 972 * The dirty rate will match the writeout rate in long term, except 1157 973 * when dirty pages are truncated by userspace or re-dirtied by FS. 1158 974 */ 1159 - dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed; 975 + dirty_rate = (dirtied - wb->dirtied_stamp) * HZ / elapsed; 1160 976 1161 - pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty, 1162 - bdi_thresh, bdi_dirty); 1163 977 /* 1164 978 * task_ratelimit reflects each dd's dirty rate for the past 200ms. 1165 979 */ 1166 980 task_ratelimit = (u64)dirty_ratelimit * 1167 - pos_ratio >> RATELIMIT_CALC_SHIFT; 981 + dtc->pos_ratio >> RATELIMIT_CALC_SHIFT; 1168 982 task_ratelimit++; /* it helps rampup dirty_ratelimit from tiny values */ 1169 983 1170 984 /* 1171 985 * A linear estimation of the "balanced" throttle rate. The theory is, 1172 - * if there are N dd tasks, each throttled at task_ratelimit, the bdi's 986 + * if there are N dd tasks, each throttled at task_ratelimit, the wb's 1173 987 * dirty_rate will be measured to be (N * task_ratelimit). So the below 1174 988 * formula will yield the balanced rate limit (write_bw / N). 1175 989 * ··· 1206 1024 /* 1207 1025 * We could safely do this and return immediately: 1208 1026 * 1209 - * bdi->dirty_ratelimit = balanced_dirty_ratelimit; 1027 + * wb->dirty_ratelimit = balanced_dirty_ratelimit; 1210 1028 * 1211 1029 * However to get a more stable dirty_ratelimit, the below elaborated 1212 1030 * code makes use of task_ratelimit to filter out singular points and ··· 1240 1058 step = 0; 1241 1059 1242 1060 /* 1243 - * For strictlimit case, calculations above were based on bdi counters 1244 - * and limits (starting from pos_ratio = bdi_position_ratio() and up to 1061 + * For strictlimit case, calculations above were based on wb counters 1062 + * and limits (starting from pos_ratio = wb_position_ratio() and up to 1245 1063 * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate). 1246 - * Hence, to calculate "step" properly, we have to use bdi_dirty as 1247 - * "dirty" and bdi_setpoint as "setpoint". 1064 + * Hence, to calculate "step" properly, we have to use wb_dirty as 1065 + * "dirty" and wb_setpoint as "setpoint". 1248 1066 * 1249 - * We rampup dirty_ratelimit forcibly if bdi_dirty is low because 1250 - * it's possible that bdi_thresh is close to zero due to inactivity 1251 - * of backing device (see the implementation of bdi_dirty_limit()). 1067 + * We rampup dirty_ratelimit forcibly if wb_dirty is low because 1068 + * it's possible that wb_thresh is close to zero due to inactivity 1069 + * of backing device. 1252 1070 */ 1253 - if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) { 1254 - dirty = bdi_dirty; 1255 - if (bdi_dirty < 8) 1256 - setpoint = bdi_dirty + 1; 1071 + if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { 1072 + dirty = dtc->wb_dirty; 1073 + if (dtc->wb_dirty < 8) 1074 + setpoint = dtc->wb_dirty + 1; 1257 1075 else 1258 - setpoint = (bdi_thresh + 1259 - bdi_dirty_limit(bdi, bg_thresh)) / 2; 1076 + setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2; 1260 1077 } 1261 1078 1262 1079 if (dirty < setpoint) { 1263 - x = min3(bdi->balanced_dirty_ratelimit, 1080 + x = min3(wb->balanced_dirty_ratelimit, 1264 1081 balanced_dirty_ratelimit, task_ratelimit); 1265 1082 if (dirty_ratelimit < x) 1266 1083 step = x - dirty_ratelimit; 1267 1084 } else { 1268 - x = max3(bdi->balanced_dirty_ratelimit, 1085 + x = max3(wb->balanced_dirty_ratelimit, 1269 1086 balanced_dirty_ratelimit, task_ratelimit); 1270 1087 if (dirty_ratelimit > x) 1271 1088 step = dirty_ratelimit - x; ··· 1286 1105 else 1287 1106 dirty_ratelimit -= step; 1288 1107 1289 - bdi->dirty_ratelimit = max(dirty_ratelimit, 1UL); 1290 - bdi->balanced_dirty_ratelimit = balanced_dirty_ratelimit; 1108 + wb->dirty_ratelimit = max(dirty_ratelimit, 1UL); 1109 + wb->balanced_dirty_ratelimit = balanced_dirty_ratelimit; 1291 1110 1292 - trace_bdi_dirty_ratelimit(bdi, dirty_rate, task_ratelimit); 1111 + trace_bdi_dirty_ratelimit(wb->bdi, dirty_rate, task_ratelimit); 1293 1112 } 1294 1113 1295 - void __bdi_update_bandwidth(struct backing_dev_info *bdi, 1296 - unsigned long thresh, 1297 - unsigned long bg_thresh, 1298 - unsigned long dirty, 1299 - unsigned long bdi_thresh, 1300 - unsigned long bdi_dirty, 1301 - unsigned long start_time) 1114 + static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc, 1115 + struct dirty_throttle_control *mdtc, 1116 + unsigned long start_time, 1117 + bool update_ratelimit) 1302 1118 { 1119 + struct bdi_writeback *wb = gdtc->wb; 1303 1120 unsigned long now = jiffies; 1304 - unsigned long elapsed = now - bdi->bw_time_stamp; 1121 + unsigned long elapsed = now - wb->bw_time_stamp; 1305 1122 unsigned long dirtied; 1306 1123 unsigned long written; 1124 + 1125 + lockdep_assert_held(&wb->list_lock); 1307 1126 1308 1127 /* 1309 1128 * rate-limit, only update once every 200ms. ··· 1311 1130 if (elapsed < BANDWIDTH_INTERVAL) 1312 1131 return; 1313 1132 1314 - dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]); 1315 - written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]); 1133 + dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED]); 1134 + written = percpu_counter_read(&wb->stat[WB_WRITTEN]); 1316 1135 1317 1136 /* 1318 1137 * Skip quiet periods when disk bandwidth is under-utilized. 1319 1138 * (at least 1s idle time between two flusher runs) 1320 1139 */ 1321 - if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time)) 1140 + if (elapsed > HZ && time_before(wb->bw_time_stamp, start_time)) 1322 1141 goto snapshot; 1323 1142 1324 - if (thresh) { 1325 - global_update_bandwidth(thresh, dirty, now); 1326 - bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty, 1327 - bdi_thresh, bdi_dirty, 1328 - dirtied, elapsed); 1143 + if (update_ratelimit) { 1144 + domain_update_bandwidth(gdtc, now); 1145 + wb_update_dirty_ratelimit(gdtc, dirtied, elapsed); 1146 + 1147 + /* 1148 + * @mdtc is always NULL if !CGROUP_WRITEBACK but the 1149 + * compiler has no way to figure that out. Help it. 1150 + */ 1151 + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) && mdtc) { 1152 + domain_update_bandwidth(mdtc, now); 1153 + wb_update_dirty_ratelimit(mdtc, dirtied, elapsed); 1154 + } 1329 1155 } 1330 - bdi_update_write_bandwidth(bdi, elapsed, written); 1156 + wb_update_write_bandwidth(wb, elapsed, written); 1331 1157 1332 1158 snapshot: 1333 - bdi->dirtied_stamp = dirtied; 1334 - bdi->written_stamp = written; 1335 - bdi->bw_time_stamp = now; 1159 + wb->dirtied_stamp = dirtied; 1160 + wb->written_stamp = written; 1161 + wb->bw_time_stamp = now; 1336 1162 } 1337 1163 1338 - static void bdi_update_bandwidth(struct backing_dev_info *bdi, 1339 - unsigned long thresh, 1340 - unsigned long bg_thresh, 1341 - unsigned long dirty, 1342 - unsigned long bdi_thresh, 1343 - unsigned long bdi_dirty, 1344 - unsigned long start_time) 1164 + void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time) 1345 1165 { 1346 - if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL)) 1347 - return; 1348 - spin_lock(&bdi->wb.list_lock); 1349 - __bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty, 1350 - bdi_thresh, bdi_dirty, start_time); 1351 - spin_unlock(&bdi->wb.list_lock); 1166 + struct dirty_throttle_control gdtc = { GDTC_INIT(wb) }; 1167 + 1168 + __wb_update_bandwidth(&gdtc, NULL, start_time, false); 1352 1169 } 1353 1170 1354 1171 /* ··· 1366 1187 return 1; 1367 1188 } 1368 1189 1369 - static unsigned long bdi_max_pause(struct backing_dev_info *bdi, 1370 - unsigned long bdi_dirty) 1190 + static unsigned long wb_max_pause(struct bdi_writeback *wb, 1191 + unsigned long wb_dirty) 1371 1192 { 1372 - unsigned long bw = bdi->avg_write_bandwidth; 1193 + unsigned long bw = wb->avg_write_bandwidth; 1373 1194 unsigned long t; 1374 1195 1375 1196 /* ··· 1379 1200 * 1380 1201 * 8 serves as the safety ratio. 1381 1202 */ 1382 - t = bdi_dirty / (1 + bw / roundup_pow_of_two(1 + HZ / 8)); 1203 + t = wb_dirty / (1 + bw / roundup_pow_of_two(1 + HZ / 8)); 1383 1204 t++; 1384 1205 1385 1206 return min_t(unsigned long, t, MAX_PAUSE); 1386 1207 } 1387 1208 1388 - static long bdi_min_pause(struct backing_dev_info *bdi, 1389 - long max_pause, 1390 - unsigned long task_ratelimit, 1391 - unsigned long dirty_ratelimit, 1392 - int *nr_dirtied_pause) 1209 + static long wb_min_pause(struct bdi_writeback *wb, 1210 + long max_pause, 1211 + unsigned long task_ratelimit, 1212 + unsigned long dirty_ratelimit, 1213 + int *nr_dirtied_pause) 1393 1214 { 1394 - long hi = ilog2(bdi->avg_write_bandwidth); 1395 - long lo = ilog2(bdi->dirty_ratelimit); 1215 + long hi = ilog2(wb->avg_write_bandwidth); 1216 + long lo = ilog2(wb->dirty_ratelimit); 1396 1217 long t; /* target pause */ 1397 1218 long pause; /* estimated next pause */ 1398 1219 int pages; /* target nr_dirtied_pause */ ··· 1460 1281 return pages >= DIRTY_POLL_THRESH ? 1 + t / 2 : t; 1461 1282 } 1462 1283 1463 - static inline void bdi_dirty_limits(struct backing_dev_info *bdi, 1464 - unsigned long dirty_thresh, 1465 - unsigned long background_thresh, 1466 - unsigned long *bdi_dirty, 1467 - unsigned long *bdi_thresh, 1468 - unsigned long *bdi_bg_thresh) 1284 + static inline void wb_dirty_limits(struct dirty_throttle_control *dtc) 1469 1285 { 1470 - unsigned long bdi_reclaimable; 1286 + struct bdi_writeback *wb = dtc->wb; 1287 + unsigned long wb_reclaimable; 1471 1288 1472 1289 /* 1473 - * bdi_thresh is not treated as some limiting factor as 1290 + * wb_thresh is not treated as some limiting factor as 1474 1291 * dirty_thresh, due to reasons 1475 - * - in JBOD setup, bdi_thresh can fluctuate a lot 1292 + * - in JBOD setup, wb_thresh can fluctuate a lot 1476 1293 * - in a system with HDD and USB key, the USB key may somehow 1477 - * go into state (bdi_dirty >> bdi_thresh) either because 1478 - * bdi_dirty starts high, or because bdi_thresh drops low. 1294 + * go into state (wb_dirty >> wb_thresh) either because 1295 + * wb_dirty starts high, or because wb_thresh drops low. 1479 1296 * In this case we don't want to hard throttle the USB key 1480 - * dirtiers for 100 seconds until bdi_dirty drops under 1481 - * bdi_thresh. Instead the auxiliary bdi control line in 1482 - * bdi_position_ratio() will let the dirtier task progress 1483 - * at some rate <= (write_bw / 2) for bringing down bdi_dirty. 1297 + * dirtiers for 100 seconds until wb_dirty drops under 1298 + * wb_thresh. Instead the auxiliary wb control line in 1299 + * wb_position_ratio() will let the dirtier task progress 1300 + * at some rate <= (write_bw / 2) for bringing down wb_dirty. 1484 1301 */ 1485 - *bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); 1486 - 1487 - if (bdi_bg_thresh) 1488 - *bdi_bg_thresh = dirty_thresh ? div_u64((u64)*bdi_thresh * 1489 - background_thresh, 1490 - dirty_thresh) : 0; 1302 + dtc->wb_thresh = __wb_calc_thresh(dtc); 1303 + dtc->wb_bg_thresh = dtc->thresh ? 1304 + div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0; 1491 1305 1492 1306 /* 1493 1307 * In order to avoid the stacked BDI deadlock we need ··· 1492 1320 * actually dirty; with m+n sitting in the percpu 1493 1321 * deltas. 1494 1322 */ 1495 - if (*bdi_thresh < 2 * bdi_stat_error(bdi)) { 1496 - bdi_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE); 1497 - *bdi_dirty = bdi_reclaimable + 1498 - bdi_stat_sum(bdi, BDI_WRITEBACK); 1323 + if (dtc->wb_thresh < 2 * wb_stat_error(wb)) { 1324 + wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE); 1325 + dtc->wb_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK); 1499 1326 } else { 1500 - bdi_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); 1501 - *bdi_dirty = bdi_reclaimable + 1502 - bdi_stat(bdi, BDI_WRITEBACK); 1327 + wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE); 1328 + dtc->wb_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK); 1503 1329 } 1504 1330 } 1505 1331 ··· 1509 1339 * perform some writeout. 1510 1340 */ 1511 1341 static void balance_dirty_pages(struct address_space *mapping, 1342 + struct bdi_writeback *wb, 1512 1343 unsigned long pages_dirtied) 1513 1344 { 1345 + struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) }; 1346 + struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) }; 1347 + struct dirty_throttle_control * const gdtc = &gdtc_stor; 1348 + struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ? 1349 + &mdtc_stor : NULL; 1350 + struct dirty_throttle_control *sdtc; 1514 1351 unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */ 1515 - unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */ 1516 - unsigned long background_thresh; 1517 - unsigned long dirty_thresh; 1518 1352 long period; 1519 1353 long pause; 1520 1354 long max_pause; ··· 1527 1353 bool dirty_exceeded = false; 1528 1354 unsigned long task_ratelimit; 1529 1355 unsigned long dirty_ratelimit; 1530 - unsigned long pos_ratio; 1531 - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); 1356 + struct backing_dev_info *bdi = wb->bdi; 1532 1357 bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; 1533 1358 unsigned long start_time = jiffies; 1534 1359 1535 1360 for (;;) { 1536 1361 unsigned long now = jiffies; 1537 - unsigned long uninitialized_var(bdi_thresh); 1538 - unsigned long thresh; 1539 - unsigned long uninitialized_var(bdi_dirty); 1540 - unsigned long dirty; 1541 - unsigned long bg_thresh; 1362 + unsigned long dirty, thresh, bg_thresh; 1363 + unsigned long m_dirty, m_thresh, m_bg_thresh; 1542 1364 1543 1365 /* 1544 1366 * Unstable writes are a feature of certain networked ··· 1544 1374 */ 1545 1375 nr_reclaimable = global_page_state(NR_FILE_DIRTY) + 1546 1376 global_page_state(NR_UNSTABLE_NFS); 1547 - nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); 1377 + gdtc->avail = global_dirtyable_memory(); 1378 + gdtc->dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); 1548 1379 1549 - global_dirty_limits(&background_thresh, &dirty_thresh); 1380 + domain_dirty_limits(gdtc); 1550 1381 1551 1382 if (unlikely(strictlimit)) { 1552 - bdi_dirty_limits(bdi, dirty_thresh, background_thresh, 1553 - &bdi_dirty, &bdi_thresh, &bg_thresh); 1383 + wb_dirty_limits(gdtc); 1554 1384 1555 - dirty = bdi_dirty; 1556 - thresh = bdi_thresh; 1385 + dirty = gdtc->wb_dirty; 1386 + thresh = gdtc->wb_thresh; 1387 + bg_thresh = gdtc->wb_bg_thresh; 1557 1388 } else { 1558 - dirty = nr_dirty; 1559 - thresh = dirty_thresh; 1560 - bg_thresh = background_thresh; 1389 + dirty = gdtc->dirty; 1390 + thresh = gdtc->thresh; 1391 + bg_thresh = gdtc->bg_thresh; 1392 + } 1393 + 1394 + if (mdtc) { 1395 + unsigned long writeback; 1396 + 1397 + /* 1398 + * If @wb belongs to !root memcg, repeat the same 1399 + * basic calculations for the memcg domain. 1400 + */ 1401 + mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty, 1402 + &writeback); 1403 + mdtc_cap_avail(mdtc); 1404 + mdtc->dirty += writeback; 1405 + 1406 + domain_dirty_limits(mdtc); 1407 + 1408 + if (unlikely(strictlimit)) { 1409 + wb_dirty_limits(mdtc); 1410 + m_dirty = mdtc->wb_dirty; 1411 + m_thresh = mdtc->wb_thresh; 1412 + m_bg_thresh = mdtc->wb_bg_thresh; 1413 + } else { 1414 + m_dirty = mdtc->dirty; 1415 + m_thresh = mdtc->thresh; 1416 + m_bg_thresh = mdtc->bg_thresh; 1417 + } 1561 1418 } 1562 1419 1563 1420 /* 1564 1421 * Throttle it only when the background writeback cannot 1565 1422 * catch-up. This avoids (excessively) small writeouts 1566 - * when the bdi limits are ramping up in case of !strictlimit. 1423 + * when the wb limits are ramping up in case of !strictlimit. 1567 1424 * 1568 - * In strictlimit case make decision based on the bdi counters 1569 - * and limits. Small writeouts when the bdi limits are ramping 1425 + * In strictlimit case make decision based on the wb counters 1426 + * and limits. Small writeouts when the wb limits are ramping 1570 1427 * up are the price we consciously pay for strictlimit-ing. 1428 + * 1429 + * If memcg domain is in effect, @dirty should be under 1430 + * both global and memcg freerun ceilings. 1571 1431 */ 1572 - if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh)) { 1432 + if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) && 1433 + (!mdtc || 1434 + m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) { 1435 + unsigned long intv = dirty_poll_interval(dirty, thresh); 1436 + unsigned long m_intv = ULONG_MAX; 1437 + 1573 1438 current->dirty_paused_when = now; 1574 1439 current->nr_dirtied = 0; 1575 - current->nr_dirtied_pause = 1576 - dirty_poll_interval(dirty, thresh); 1440 + if (mdtc) 1441 + m_intv = dirty_poll_interval(m_dirty, m_thresh); 1442 + current->nr_dirtied_pause = min(intv, m_intv); 1577 1443 break; 1578 1444 } 1579 1445 1580 - if (unlikely(!writeback_in_progress(bdi))) 1581 - bdi_start_background_writeback(bdi); 1446 + if (unlikely(!writeback_in_progress(wb))) 1447 + wb_start_background_writeback(wb); 1582 1448 1449 + /* 1450 + * Calculate global domain's pos_ratio and select the 1451 + * global dtc by default. 1452 + */ 1583 1453 if (!strictlimit) 1584 - bdi_dirty_limits(bdi, dirty_thresh, background_thresh, 1585 - &bdi_dirty, &bdi_thresh, NULL); 1454 + wb_dirty_limits(gdtc); 1586 1455 1587 - dirty_exceeded = (bdi_dirty > bdi_thresh) && 1588 - ((nr_dirty > dirty_thresh) || strictlimit); 1589 - if (dirty_exceeded && !bdi->dirty_exceeded) 1590 - bdi->dirty_exceeded = 1; 1456 + dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) && 1457 + ((gdtc->dirty > gdtc->thresh) || strictlimit); 1591 1458 1592 - bdi_update_bandwidth(bdi, dirty_thresh, background_thresh, 1593 - nr_dirty, bdi_thresh, bdi_dirty, 1594 - start_time); 1459 + wb_position_ratio(gdtc); 1460 + sdtc = gdtc; 1595 1461 1596 - dirty_ratelimit = bdi->dirty_ratelimit; 1597 - pos_ratio = bdi_position_ratio(bdi, dirty_thresh, 1598 - background_thresh, nr_dirty, 1599 - bdi_thresh, bdi_dirty); 1600 - task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >> 1462 + if (mdtc) { 1463 + /* 1464 + * If memcg domain is in effect, calculate its 1465 + * pos_ratio. @wb should satisfy constraints from 1466 + * both global and memcg domains. Choose the one 1467 + * w/ lower pos_ratio. 1468 + */ 1469 + if (!strictlimit) 1470 + wb_dirty_limits(mdtc); 1471 + 1472 + dirty_exceeded |= (mdtc->wb_dirty > mdtc->wb_thresh) && 1473 + ((mdtc->dirty > mdtc->thresh) || strictlimit); 1474 + 1475 + wb_position_ratio(mdtc); 1476 + if (mdtc->pos_ratio < gdtc->pos_ratio) 1477 + sdtc = mdtc; 1478 + } 1479 + 1480 + if (dirty_exceeded && !wb->dirty_exceeded) 1481 + wb->dirty_exceeded = 1; 1482 + 1483 + if (time_is_before_jiffies(wb->bw_time_stamp + 1484 + BANDWIDTH_INTERVAL)) { 1485 + spin_lock(&wb->list_lock); 1486 + __wb_update_bandwidth(gdtc, mdtc, start_time, true); 1487 + spin_unlock(&wb->list_lock); 1488 + } 1489 + 1490 + /* throttle according to the chosen dtc */ 1491 + dirty_ratelimit = wb->dirty_ratelimit; 1492 + task_ratelimit = ((u64)dirty_ratelimit * sdtc->pos_ratio) >> 1601 1493 RATELIMIT_CALC_SHIFT; 1602 - max_pause = bdi_max_pause(bdi, bdi_dirty); 1603 - min_pause = bdi_min_pause(bdi, max_pause, 1604 - task_ratelimit, dirty_ratelimit, 1605 - &nr_dirtied_pause); 1494 + max_pause = wb_max_pause(wb, sdtc->wb_dirty); 1495 + min_pause = wb_min_pause(wb, max_pause, 1496 + task_ratelimit, dirty_ratelimit, 1497 + &nr_dirtied_pause); 1606 1498 1607 1499 if (unlikely(task_ratelimit == 0)) { 1608 1500 period = max_pause; ··· 1684 1452 */ 1685 1453 if (pause < min_pause) { 1686 1454 trace_balance_dirty_pages(bdi, 1687 - dirty_thresh, 1688 - background_thresh, 1689 - nr_dirty, 1690 - bdi_thresh, 1691 - bdi_dirty, 1455 + sdtc->thresh, 1456 + sdtc->bg_thresh, 1457 + sdtc->dirty, 1458 + sdtc->wb_thresh, 1459 + sdtc->wb_dirty, 1692 1460 dirty_ratelimit, 1693 1461 task_ratelimit, 1694 1462 pages_dirtied, ··· 1713 1481 1714 1482 pause: 1715 1483 trace_balance_dirty_pages(bdi, 1716 - dirty_thresh, 1717 - background_thresh, 1718 - nr_dirty, 1719 - bdi_thresh, 1720 - bdi_dirty, 1484 + sdtc->thresh, 1485 + sdtc->bg_thresh, 1486 + sdtc->dirty, 1487 + sdtc->wb_thresh, 1488 + sdtc->wb_dirty, 1721 1489 dirty_ratelimit, 1722 1490 task_ratelimit, 1723 1491 pages_dirtied, ··· 1732 1500 current->nr_dirtied_pause = nr_dirtied_pause; 1733 1501 1734 1502 /* 1735 - * This is typically equal to (nr_dirty < dirty_thresh) and can 1736 - * also keep "1000+ dd on a slow USB stick" under control. 1503 + * This is typically equal to (dirty < thresh) and can also 1504 + * keep "1000+ dd on a slow USB stick" under control. 1737 1505 */ 1738 1506 if (task_ratelimit) 1739 1507 break; 1740 1508 1741 1509 /* 1742 1510 * In the case of an unresponding NFS server and the NFS dirty 1743 - * pages exceeds dirty_thresh, give the other good bdi's a pipe 1511 + * pages exceeds dirty_thresh, give the other good wb's a pipe 1744 1512 * to go through, so that tasks on them still remain responsive. 1745 1513 * 1746 1514 * In theory 1 page is enough to keep the comsumer-producer 1747 1515 * pipe going: the flusher cleans 1 page => the task dirties 1 1748 - * more page. However bdi_dirty has accounting errors. So use 1749 - * the larger and more IO friendly bdi_stat_error. 1516 + * more page. However wb_dirty has accounting errors. So use 1517 + * the larger and more IO friendly wb_stat_error. 1750 1518 */ 1751 - if (bdi_dirty <= bdi_stat_error(bdi)) 1519 + if (sdtc->wb_dirty <= wb_stat_error(wb)) 1752 1520 break; 1753 1521 1754 1522 if (fatal_signal_pending(current)) 1755 1523 break; 1756 1524 } 1757 1525 1758 - if (!dirty_exceeded && bdi->dirty_exceeded) 1759 - bdi->dirty_exceeded = 0; 1526 + if (!dirty_exceeded && wb->dirty_exceeded) 1527 + wb->dirty_exceeded = 0; 1760 1528 1761 - if (writeback_in_progress(bdi)) 1529 + if (writeback_in_progress(wb)) 1762 1530 return; 1763 1531 1764 1532 /* ··· 1772 1540 if (laptop_mode) 1773 1541 return; 1774 1542 1775 - if (nr_reclaimable > background_thresh) 1776 - bdi_start_background_writeback(bdi); 1543 + if (nr_reclaimable > gdtc->bg_thresh) 1544 + wb_start_background_writeback(wb); 1777 1545 } 1778 1546 1779 1547 static DEFINE_PER_CPU(int, bdp_ratelimits); ··· 1809 1577 */ 1810 1578 void balance_dirty_pages_ratelimited(struct address_space *mapping) 1811 1579 { 1812 - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); 1580 + struct inode *inode = mapping->host; 1581 + struct backing_dev_info *bdi = inode_to_bdi(inode); 1582 + struct bdi_writeback *wb = NULL; 1813 1583 int ratelimit; 1814 1584 int *p; 1815 1585 1816 1586 if (!bdi_cap_account_dirty(bdi)) 1817 1587 return; 1818 1588 1589 + if (inode_cgwb_enabled(inode)) 1590 + wb = wb_get_create_current(bdi, GFP_KERNEL); 1591 + if (!wb) 1592 + wb = &bdi->wb; 1593 + 1819 1594 ratelimit = current->nr_dirtied_pause; 1820 - if (bdi->dirty_exceeded) 1595 + if (wb->dirty_exceeded) 1821 1596 ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10)); 1822 1597 1823 1598 preempt_disable(); ··· 1856 1617 preempt_enable(); 1857 1618 1858 1619 if (unlikely(current->nr_dirtied >= ratelimit)) 1859 - balance_dirty_pages(mapping, current->nr_dirtied); 1620 + balance_dirty_pages(mapping, wb, current->nr_dirtied); 1621 + 1622 + wb_put(wb); 1860 1623 } 1861 1624 EXPORT_SYMBOL(balance_dirty_pages_ratelimited); 1625 + 1626 + /** 1627 + * wb_over_bg_thresh - does @wb need to be written back? 1628 + * @wb: bdi_writeback of interest 1629 + * 1630 + * Determines whether background writeback should keep writing @wb or it's 1631 + * clean enough. Returns %true if writeback should continue. 1632 + */ 1633 + bool wb_over_bg_thresh(struct bdi_writeback *wb) 1634 + { 1635 + struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) }; 1636 + struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) }; 1637 + struct dirty_throttle_control * const gdtc = &gdtc_stor; 1638 + struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ? 1639 + &mdtc_stor : NULL; 1640 + 1641 + /* 1642 + * Similar to balance_dirty_pages() but ignores pages being written 1643 + * as we're trying to decide whether to put more under writeback. 1644 + */ 1645 + gdtc->avail = global_dirtyable_memory(); 1646 + gdtc->dirty = global_page_state(NR_FILE_DIRTY) + 1647 + global_page_state(NR_UNSTABLE_NFS); 1648 + domain_dirty_limits(gdtc); 1649 + 1650 + if (gdtc->dirty > gdtc->bg_thresh) 1651 + return true; 1652 + 1653 + if (wb_stat(wb, WB_RECLAIMABLE) > __wb_calc_thresh(gdtc)) 1654 + return true; 1655 + 1656 + if (mdtc) { 1657 + unsigned long writeback; 1658 + 1659 + mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty, &writeback); 1660 + mdtc_cap_avail(mdtc); 1661 + domain_dirty_limits(mdtc); /* ditto, ignore writeback */ 1662 + 1663 + if (mdtc->dirty > mdtc->bg_thresh) 1664 + return true; 1665 + 1666 + if (wb_stat(wb, WB_RECLAIMABLE) > __wb_calc_thresh(mdtc)) 1667 + return true; 1668 + } 1669 + 1670 + return false; 1671 + } 1862 1672 1863 1673 void throttle_vm_writeout(gfp_t gfp_mask) 1864 1674 { ··· 1916 1628 1917 1629 for ( ; ; ) { 1918 1630 global_dirty_limits(&background_thresh, &dirty_thresh); 1919 - dirty_thresh = hard_dirty_limit(dirty_thresh); 1631 + dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh); 1920 1632 1921 1633 /* 1922 1634 * Boost the allowable dirty threshold a bit for page ··· 1955 1667 struct request_queue *q = (struct request_queue *)data; 1956 1668 int nr_pages = global_page_state(NR_FILE_DIRTY) + 1957 1669 global_page_state(NR_UNSTABLE_NFS); 1670 + struct bdi_writeback *wb; 1671 + struct wb_iter iter; 1958 1672 1959 1673 /* 1960 1674 * We want to write everything out, not just down to the dirty 1961 1675 * threshold 1962 1676 */ 1963 - if (bdi_has_dirty_io(&q->backing_dev_info)) 1964 - bdi_start_writeback(&q->backing_dev_info, nr_pages, 1965 - WB_REASON_LAPTOP_TIMER); 1677 + if (!bdi_has_dirty_io(&q->backing_dev_info)) 1678 + return; 1679 + 1680 + bdi_for_each_wb(wb, &q->backing_dev_info, &iter, 0) 1681 + if (wb_has_dirty_io(wb)) 1682 + wb_start_writeback(wb, nr_pages, true, 1683 + WB_REASON_LAPTOP_TIMER); 1966 1684 } 1967 1685 1968 1686 /* ··· 2012 1718 2013 1719 void writeback_set_ratelimit(void) 2014 1720 { 1721 + struct wb_domain *dom = &global_wb_domain; 2015 1722 unsigned long background_thresh; 2016 1723 unsigned long dirty_thresh; 1724 + 2017 1725 global_dirty_limits(&background_thresh, &dirty_thresh); 2018 - global_dirty_limit = dirty_thresh; 1726 + dom->dirty_limit = dirty_thresh; 2019 1727 ratelimit_pages = dirty_thresh / (num_online_cpus() * 32); 2020 1728 if (ratelimit_pages < 16) 2021 1729 ratelimit_pages = 16; ··· 2066 1770 writeback_set_ratelimit(); 2067 1771 register_cpu_notifier(&ratelimit_nb); 2068 1772 2069 - fprop_global_init(&writeout_completions, GFP_KERNEL); 1773 + BUG_ON(wb_domain_init(&global_wb_domain, GFP_KERNEL)); 2070 1774 } 2071 1775 2072 1776 /** ··· 2386 2090 2387 2091 /* 2388 2092 * Helper function for set_page_dirty family. 2093 + * 2094 + * Caller must hold mem_cgroup_begin_page_stat(). 2095 + * 2389 2096 * NOTE: This relies on being atomic wrt interrupts. 2390 2097 */ 2391 - void account_page_dirtied(struct page *page, struct address_space *mapping) 2098 + void account_page_dirtied(struct page *page, struct address_space *mapping, 2099 + struct mem_cgroup *memcg) 2392 2100 { 2101 + struct inode *inode = mapping->host; 2102 + 2393 2103 trace_writeback_dirty_page(page, mapping); 2394 2104 2395 2105 if (mapping_cap_account_dirty(mapping)) { 2396 - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); 2106 + struct bdi_writeback *wb; 2397 2107 2108 + inode_attach_wb(inode, page); 2109 + wb = inode_to_wb(inode); 2110 + 2111 + mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); 2398 2112 __inc_zone_page_state(page, NR_FILE_DIRTY); 2399 2113 __inc_zone_page_state(page, NR_DIRTIED); 2400 - __inc_bdi_stat(bdi, BDI_RECLAIMABLE); 2401 - __inc_bdi_stat(bdi, BDI_DIRTIED); 2114 + __inc_wb_stat(wb, WB_RECLAIMABLE); 2115 + __inc_wb_stat(wb, WB_DIRTIED); 2402 2116 task_io_account_write(PAGE_CACHE_SIZE); 2403 2117 current->nr_dirtied++; 2404 2118 this_cpu_inc(bdp_ratelimits); ··· 2419 2113 /* 2420 2114 * Helper function for deaccounting dirty page without writeback. 2421 2115 * 2422 - * Doing this should *normally* only ever be done when a page 2423 - * is truncated, and is not actually mapped anywhere at all. However, 2424 - * fs/buffer.c does this when it notices that somebody has cleaned 2425 - * out all the buffers on a page without actually doing it through 2426 - * the VM. Can you say "ext3 is horribly ugly"? Thought you could. 2116 + * Caller must hold mem_cgroup_begin_page_stat(). 2427 2117 */ 2428 - void account_page_cleaned(struct page *page, struct address_space *mapping) 2118 + void account_page_cleaned(struct page *page, struct address_space *mapping, 2119 + struct mem_cgroup *memcg, struct bdi_writeback *wb) 2429 2120 { 2430 2121 if (mapping_cap_account_dirty(mapping)) { 2122 + mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); 2431 2123 dec_zone_page_state(page, NR_FILE_DIRTY); 2432 - dec_bdi_stat(inode_to_bdi(mapping->host), BDI_RECLAIMABLE); 2124 + dec_wb_stat(wb, WB_RECLAIMABLE); 2433 2125 task_io_account_cancelled_write(PAGE_CACHE_SIZE); 2434 2126 } 2435 2127 } 2436 - EXPORT_SYMBOL(account_page_cleaned); 2437 2128 2438 2129 /* 2439 2130 * For address_spaces which do not use buffers. Just tag the page as dirty in ··· 2446 2143 */ 2447 2144 int __set_page_dirty_nobuffers(struct page *page) 2448 2145 { 2146 + struct mem_cgroup *memcg; 2147 + 2148 + memcg = mem_cgroup_begin_page_stat(page); 2449 2149 if (!TestSetPageDirty(page)) { 2450 2150 struct address_space *mapping = page_mapping(page); 2451 2151 unsigned long flags; 2452 2152 2453 - if (!mapping) 2153 + if (!mapping) { 2154 + mem_cgroup_end_page_stat(memcg); 2454 2155 return 1; 2156 + } 2455 2157 2456 2158 spin_lock_irqsave(&mapping->tree_lock, flags); 2457 2159 BUG_ON(page_mapping(page) != mapping); 2458 2160 WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); 2459 - account_page_dirtied(page, mapping); 2161 + account_page_dirtied(page, mapping, memcg); 2460 2162 radix_tree_tag_set(&mapping->page_tree, page_index(page), 2461 2163 PAGECACHE_TAG_DIRTY); 2462 2164 spin_unlock_irqrestore(&mapping->tree_lock, flags); 2165 + mem_cgroup_end_page_stat(memcg); 2166 + 2463 2167 if (mapping->host) { 2464 2168 /* !PageAnon && !swapper_space */ 2465 2169 __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); 2466 2170 } 2467 2171 return 1; 2468 2172 } 2173 + mem_cgroup_end_page_stat(memcg); 2469 2174 return 0; 2470 2175 } 2471 2176 EXPORT_SYMBOL(__set_page_dirty_nobuffers); ··· 2488 2177 void account_page_redirty(struct page *page) 2489 2178 { 2490 2179 struct address_space *mapping = page->mapping; 2180 + 2491 2181 if (mapping && mapping_cap_account_dirty(mapping)) { 2182 + struct inode *inode = mapping->host; 2183 + struct bdi_writeback *wb; 2184 + bool locked; 2185 + 2186 + wb = unlocked_inode_to_wb_begin(inode, &locked); 2492 2187 current->nr_dirtied--; 2493 2188 dec_zone_page_state(page, NR_DIRTIED); 2494 - dec_bdi_stat(inode_to_bdi(mapping->host), BDI_DIRTIED); 2189 + dec_wb_stat(wb, WB_DIRTIED); 2190 + unlocked_inode_to_wb_end(inode, locked); 2495 2191 } 2496 2192 } 2497 2193 EXPORT_SYMBOL(account_page_redirty); ··· 2584 2266 EXPORT_SYMBOL(set_page_dirty_lock); 2585 2267 2586 2268 /* 2269 + * This cancels just the dirty bit on the kernel page itself, it does NOT 2270 + * actually remove dirty bits on any mmap's that may be around. It also 2271 + * leaves the page tagged dirty, so any sync activity will still find it on 2272 + * the dirty lists, and in particular, clear_page_dirty_for_io() will still 2273 + * look at the dirty bits in the VM. 2274 + * 2275 + * Doing this should *normally* only ever be done when a page is truncated, 2276 + * and is not actually mapped anywhere at all. However, fs/buffer.c does 2277 + * this when it notices that somebody has cleaned out all the buffers on a 2278 + * page without actually doing it through the VM. Can you say "ext3 is 2279 + * horribly ugly"? Thought you could. 2280 + */ 2281 + void cancel_dirty_page(struct page *page) 2282 + { 2283 + struct address_space *mapping = page_mapping(page); 2284 + 2285 + if (mapping_cap_account_dirty(mapping)) { 2286 + struct inode *inode = mapping->host; 2287 + struct bdi_writeback *wb; 2288 + struct mem_cgroup *memcg; 2289 + bool locked; 2290 + 2291 + memcg = mem_cgroup_begin_page_stat(page); 2292 + wb = unlocked_inode_to_wb_begin(inode, &locked); 2293 + 2294 + if (TestClearPageDirty(page)) 2295 + account_page_cleaned(page, mapping, memcg, wb); 2296 + 2297 + unlocked_inode_to_wb_end(inode, locked); 2298 + mem_cgroup_end_page_stat(memcg); 2299 + } else { 2300 + ClearPageDirty(page); 2301 + } 2302 + } 2303 + EXPORT_SYMBOL(cancel_dirty_page); 2304 + 2305 + /* 2587 2306 * Clear a page's dirty flag, while caring for dirty memory accounting. 2588 2307 * Returns true if the page was previously dirty. 2589 2308 * ··· 2637 2282 int clear_page_dirty_for_io(struct page *page) 2638 2283 { 2639 2284 struct address_space *mapping = page_mapping(page); 2285 + int ret = 0; 2640 2286 2641 2287 BUG_ON(!PageLocked(page)); 2642 2288 2643 2289 if (mapping && mapping_cap_account_dirty(mapping)) { 2290 + struct inode *inode = mapping->host; 2291 + struct bdi_writeback *wb; 2292 + struct mem_cgroup *memcg; 2293 + bool locked; 2294 + 2644 2295 /* 2645 2296 * Yes, Virginia, this is indeed insane. 2646 2297 * ··· 2682 2321 * always locked coming in here, so we get the desired 2683 2322 * exclusion. 2684 2323 */ 2324 + memcg = mem_cgroup_begin_page_stat(page); 2325 + wb = unlocked_inode_to_wb_begin(inode, &locked); 2685 2326 if (TestClearPageDirty(page)) { 2327 + mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); 2686 2328 dec_zone_page_state(page, NR_FILE_DIRTY); 2687 - dec_bdi_stat(inode_to_bdi(mapping->host), 2688 - BDI_RECLAIMABLE); 2689 - return 1; 2329 + dec_wb_stat(wb, WB_RECLAIMABLE); 2330 + ret = 1; 2690 2331 } 2691 - return 0; 2332 + unlocked_inode_to_wb_end(inode, locked); 2333 + mem_cgroup_end_page_stat(memcg); 2334 + return ret; 2692 2335 } 2693 2336 return TestClearPageDirty(page); 2694 2337 } ··· 2706 2341 2707 2342 memcg = mem_cgroup_begin_page_stat(page); 2708 2343 if (mapping) { 2709 - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); 2344 + struct inode *inode = mapping->host; 2345 + struct backing_dev_info *bdi = inode_to_bdi(inode); 2710 2346 unsigned long flags; 2711 2347 2712 2348 spin_lock_irqsave(&mapping->tree_lock, flags); ··· 2717 2351 page_index(page), 2718 2352 PAGECACHE_TAG_WRITEBACK); 2719 2353 if (bdi_cap_account_writeback(bdi)) { 2720 - __dec_bdi_stat(bdi, BDI_WRITEBACK); 2721 - __bdi_writeout_inc(bdi); 2354 + struct bdi_writeback *wb = inode_to_wb(inode); 2355 + 2356 + __dec_wb_stat(wb, WB_WRITEBACK); 2357 + __wb_writeout_inc(wb); 2722 2358 } 2723 2359 } 2724 2360 spin_unlock_irqrestore(&mapping->tree_lock, flags); ··· 2744 2376 2745 2377 memcg = mem_cgroup_begin_page_stat(page); 2746 2378 if (mapping) { 2747 - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); 2379 + struct inode *inode = mapping->host; 2380 + struct backing_dev_info *bdi = inode_to_bdi(inode); 2748 2381 unsigned long flags; 2749 2382 2750 2383 spin_lock_irqsave(&mapping->tree_lock, flags); ··· 2755 2386 page_index(page), 2756 2387 PAGECACHE_TAG_WRITEBACK); 2757 2388 if (bdi_cap_account_writeback(bdi)) 2758 - __inc_bdi_stat(bdi, BDI_WRITEBACK); 2389 + __inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK); 2759 2390 } 2760 2391 if (!PageDirty(page)) 2761 2392 radix_tree_tag_clear(&mapping->page_tree,

+1 -1

mm/readahead.c

··· 541 541 /* 542 542 * Defer asynchronous read-ahead on IO congestion. 543 543 */ 544 - if (bdi_read_congested(inode_to_bdi(mapping->host))) 544 + if (inode_read_congested(mapping->host)) 545 545 return; 546 546 547 547 /* do read-ahead */

+2

mm/rmap.c

··· 30 30 * swap_lock (in swap_duplicate, swap_info_get) 31 31 * mmlist_lock (in mmput, drain_mmlist and others) 32 32 * mapping->private_lock (in __set_page_dirty_buffers) 33 + * mem_cgroup_{begin,end}_page_stat (memcg->move_lock) 34 + * mapping->tree_lock (widely used) 33 35 * inode->i_lock (in set_page_dirty's __mark_inode_dirty) 34 36 * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) 35 37 * sb_lock (within inode_lock in fs/fs-writeback.c)

+11 -7

mm/truncate.c

··· 116 116 * the VM has canceled the dirty bit (eg ext3 journaling). 117 117 * Hence dirty accounting check is placed after invalidation. 118 118 */ 119 - if (TestClearPageDirty(page)) 120 - account_page_cleaned(page, mapping); 121 - 119 + cancel_dirty_page(page); 122 120 ClearPageMappedToDisk(page); 123 121 delete_from_page_cache(page); 124 122 return 0; ··· 510 512 static int 511 513 invalidate_complete_page2(struct address_space *mapping, struct page *page) 512 514 { 515 + struct mem_cgroup *memcg; 516 + unsigned long flags; 517 + 513 518 if (page->mapping != mapping) 514 519 return 0; 515 520 516 521 if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL)) 517 522 return 0; 518 523 519 - spin_lock_irq(&mapping->tree_lock); 524 + memcg = mem_cgroup_begin_page_stat(page); 525 + spin_lock_irqsave(&mapping->tree_lock, flags); 520 526 if (PageDirty(page)) 521 527 goto failed; 522 528 523 529 BUG_ON(page_has_private(page)); 524 - __delete_from_page_cache(page, NULL); 525 - spin_unlock_irq(&mapping->tree_lock); 530 + __delete_from_page_cache(page, NULL, memcg); 531 + spin_unlock_irqrestore(&mapping->tree_lock, flags); 532 + mem_cgroup_end_page_stat(memcg); 526 533 527 534 if (mapping->a_ops->freepage) 528 535 mapping->a_ops->freepage(page); ··· 535 532 page_cache_release(page); /* pagecache ref */ 536 533 return 1; 537 534 failed: 538 - spin_unlock_irq(&mapping->tree_lock); 535 + spin_unlock_irqrestore(&mapping->tree_lock, flags); 536 + mem_cgroup_end_page_stat(memcg); 539 537 return 0; 540 538 } 541 539

+58 -21

mm/vmscan.c

··· 154 154 { 155 155 return !sc->target_mem_cgroup; 156 156 } 157 + 158 + /** 159 + * sane_reclaim - is the usual dirty throttling mechanism operational? 160 + * @sc: scan_control in question 161 + * 162 + * The normal page dirty throttling mechanism in balance_dirty_pages() is 163 + * completely broken with the legacy memcg and direct stalling in 164 + * shrink_page_list() is used for throttling instead, which lacks all the 165 + * niceties such as fairness, adaptive pausing, bandwidth proportional 166 + * allocation and configurability. 167 + * 168 + * This function tests whether the vmscan currently in progress can assume 169 + * that the normal dirty throttling mechanism is operational. 170 + */ 171 + static bool sane_reclaim(struct scan_control *sc) 172 + { 173 + struct mem_cgroup *memcg = sc->target_mem_cgroup; 174 + 175 + if (!memcg) 176 + return true; 177 + #ifdef CONFIG_CGROUP_WRITEBACK 178 + if (cgroup_on_dfl(mem_cgroup_css(memcg)->cgroup)) 179 + return true; 180 + #endif 181 + return false; 182 + } 157 183 #else 158 184 static bool global_reclaim(struct scan_control *sc) 185 + { 186 + return true; 187 + } 188 + 189 + static bool sane_reclaim(struct scan_control *sc) 159 190 { 160 191 return true; 161 192 } ··· 483 452 return page_count(page) - page_has_private(page) == 2; 484 453 } 485 454 486 - static int may_write_to_queue(struct backing_dev_info *bdi, 487 - struct scan_control *sc) 455 + static int may_write_to_inode(struct inode *inode, struct scan_control *sc) 488 456 { 489 457 if (current->flags & PF_SWAPWRITE) 490 458 return 1; 491 - if (!bdi_write_congested(bdi)) 459 + if (!inode_write_congested(inode)) 492 460 return 1; 493 - if (bdi == current->backing_dev_info) 461 + if (inode_to_bdi(inode) == current->backing_dev_info) 494 462 return 1; 495 463 return 0; 496 464 } ··· 568 538 } 569 539 if (mapping->a_ops->writepage == NULL) 570 540 return PAGE_ACTIVATE; 571 - if (!may_write_to_queue(inode_to_bdi(mapping->host), sc)) 541 + if (!may_write_to_inode(mapping->host, sc)) 572 542 return PAGE_KEEP; 573 543 574 544 if (clear_page_dirty_for_io(page)) { ··· 609 579 static int __remove_mapping(struct address_space *mapping, struct page *page, 610 580 bool reclaimed) 611 581 { 582 + unsigned long flags; 583 + struct mem_cgroup *memcg; 584 + 612 585 BUG_ON(!PageLocked(page)); 613 586 BUG_ON(mapping != page_mapping(page)); 614 587 615 - spin_lock_irq(&mapping->tree_lock); 588 + memcg = mem_cgroup_begin_page_stat(page); 589 + spin_lock_irqsave(&mapping->tree_lock, flags); 616 590 /* 617 591 * The non racy check for a busy page. 618 592 * ··· 654 620 swp_entry_t swap = { .val = page_private(page) }; 655 621 mem_cgroup_swapout(page, swap); 656 622 __delete_from_swap_cache(page); 657 - spin_unlock_irq(&mapping->tree_lock); 623 + spin_unlock_irqrestore(&mapping->tree_lock, flags); 624 + mem_cgroup_end_page_stat(memcg); 658 625 swapcache_free(swap); 659 626 } else { 660 627 void (*freepage)(struct page *); ··· 675 640 if (reclaimed && page_is_file_cache(page) && 676 641 !mapping_exiting(mapping)) 677 642 shadow = workingset_eviction(mapping, page); 678 - __delete_from_page_cache(page, shadow); 679 - spin_unlock_irq(&mapping->tree_lock); 643 + __delete_from_page_cache(page, shadow, memcg); 644 + spin_unlock_irqrestore(&mapping->tree_lock, flags); 645 + mem_cgroup_end_page_stat(memcg); 680 646 681 647 if (freepage != NULL) 682 648 freepage(page); ··· 686 650 return 1; 687 651 688 652 cannot_free: 689 - spin_unlock_irq(&mapping->tree_lock); 653 + spin_unlock_irqrestore(&mapping->tree_lock, flags); 654 + mem_cgroup_end_page_stat(memcg); 690 655 return 0; 691 656 } 692 657 ··· 954 917 */ 955 918 mapping = page_mapping(page); 956 919 if (((dirty || writeback) && mapping && 957 - bdi_write_congested(inode_to_bdi(mapping->host))) || 920 + inode_write_congested(mapping->host)) || 958 921 (writeback && PageReclaim(page))) 959 922 nr_congested++; 960 923 ··· 972 935 * note that the LRU is being scanned too quickly and the 973 936 * caller can stall after page list has been processed. 974 937 * 975 - * 2) Global reclaim encounters a page, memcg encounters a 976 - * page that is not marked for immediate reclaim or 977 - * the caller does not have __GFP_IO. In this case mark 978 - * the page for immediate reclaim and continue scanning. 938 + * 2) Global or new memcg reclaim encounters a page that is 939 + * not marked for immediate reclaim or the caller does not 940 + * have __GFP_IO. In this case mark the page for immediate 941 + * reclaim and continue scanning. 979 942 * 980 943 * __GFP_IO is checked because a loop driver thread might 981 944 * enter reclaim, and deadlock if it waits on a page for ··· 989 952 * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing 990 953 * may_enter_fs here is liable to OOM on them. 991 954 * 992 - * 3) memcg encounters a page that is not already marked 955 + * 3) Legacy memcg encounters a page that is not already marked 993 956 * PageReclaim. memcg does not have any dirty pages 994 957 * throttling so we could easily OOM just because too many 995 958 * pages are in writeback and there is nothing else to ··· 1004 967 goto keep_locked; 1005 968 1006 969 /* Case 2 above */ 1007 - } else if (global_reclaim(sc) || 970 + } else if (sane_reclaim(sc) || 1008 971 !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) { 1009 972 /* 1010 973 * This is slightly racy - end_page_writeback() ··· 1453 1416 if (current_is_kswapd()) 1454 1417 return 0; 1455 1418 1456 - if (!global_reclaim(sc)) 1419 + if (!sane_reclaim(sc)) 1457 1420 return 0; 1458 1421 1459 1422 if (file) { ··· 1645 1608 set_bit(ZONE_WRITEBACK, &zone->flags); 1646 1609 1647 1610 /* 1648 - * memcg will stall in page writeback so only consider forcibly 1649 - * stalling for global reclaim 1611 + * Legacy memcg will stall in page writeback so avoid forcibly 1612 + * stalling here. 1650 1613 */ 1651 - if (global_reclaim(sc)) { 1614 + if (sane_reclaim(sc)) { 1652 1615 /* 1653 1616 * Tag a zone as congested if all the dirty pages scanned were 1654 1617 * backed by a congested BDI and wait_iff_congested will stall.