Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

blk-throttle: Prevents the bps restricted io from entering the bps queue again

[BUG]
There has an issue of io delayed dispatch caused by io splitting. Consider
the following scenario:
1) If we set a BPS limit of 1MB/s and restrict the maximum IO size per
dispatch to 4KB, submitting -two- 1MB IO requests results in completion
times of 1s and 2s, which is expected.
2) However, if we additionally set an IOPS limit of 1,000,000/s with the
same BPS limit of 1MB/s, submitting -two- 1MB IO requests again results in
both completing in 2s, even though the IOPS constraint is being met.

[CAUSE]
This issue arises because BPS and IOPS currently share the same queue in
the blkthrotl mechanism:
1) This issue does not occur when only BPS is limited because the split IOs
return false in blk_should_throtl() and do not go through to throtl again.
2) For split IOs, even if they have been tagged with BIO_BPS_THROTTLED,
they still get queued alternately in the same list due to continuous
splitting and reordering. As a result, the two IO requests are both
completed at the 2-second mark, causing an unintended delay.
3) It is not difficult to imagine that in this scenario, if N 1MB IOs are
issued at once, all IOs will eventually complete together in N seconds.

[FIX]
With the queue separation introduced in the previous patches, we now have
separate BPS and IOPS queues. For IOs that have already passed the BPS
limitation, they do not need to re-enter the BPS queue and can directly
placed to the IOPS queue.

Since we have split the queues, when the IOPS queue is previously empty
and a new bio is added to the first qnode->bios_iops list in the
service_queue, we also need to update the disptime. This patch introduces
"THROTL_TG_IOPS_WAS_EMPTY" flag to mark it.

Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com>
Link: https://lore.kernel.org/r/20250506020935.655574-8-wozizhi@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

authored by

Zizhi Wo and committed by
Jens Axboe
d1ba22ab 28ad83b7

+53 -11
+45 -8
block/blk-throttle.c
··· 163 163 { 164 164 bool rw = bio_data_dir(bio); 165 165 166 - if (bio_flagged(bio, BIO_TG_BPS_THROTTLED)) { 166 + /* 167 + * Split bios have already been throttled by bps, so they are 168 + * directly queued into the iops path. 169 + */ 170 + if (bio_flagged(bio, BIO_TG_BPS_THROTTLED) || 171 + bio_flagged(bio, BIO_BPS_THROTTLED)) { 167 172 bio_list_add(&qn->bios_iops, bio); 168 173 sq->nr_queued_iops[rw]++; 169 174 } else { ··· 951 946 952 947 throtl_qnode_add_bio(bio, qn, sq); 953 948 949 + /* 950 + * Since we have split the queues, when the iops queue is 951 + * previously empty and a new @bio is added into the first @qn, 952 + * we also need to update the @tg->disptime. 953 + */ 954 + if (bio_flagged(bio, BIO_BPS_THROTTLED) && 955 + bio == throtl_peek_queued(&sq->queued[rw])) 956 + tg->flags |= THROTL_TG_IOPS_WAS_EMPTY; 957 + 954 958 throtl_enqueue_tg(tg); 955 959 } 956 960 ··· 987 973 988 974 /* see throtl_add_bio_tg() */ 989 975 tg->flags &= ~THROTL_TG_WAS_EMPTY; 976 + tg->flags &= ~THROTL_TG_IOPS_WAS_EMPTY; 990 977 } 991 978 992 979 static void start_parent_slice_with_credit(struct throtl_grp *child_tg, ··· 1175 1160 1176 1161 if (parent_sq) { 1177 1162 /* @parent_sq is another throl_grp, propagate dispatch */ 1178 - if (tg->flags & THROTL_TG_WAS_EMPTY) { 1163 + if (tg->flags & THROTL_TG_WAS_EMPTY || 1164 + tg->flags & THROTL_TG_IOPS_WAS_EMPTY) { 1179 1165 tg_update_disptime(tg); 1180 1166 if (!throtl_schedule_next_dispatch(parent_sq, false)) { 1181 1167 /* window is already open, repeat dispatching */ ··· 1721 1705 1722 1706 static bool tg_within_limit(struct throtl_grp *tg, struct bio *bio, bool rw) 1723 1707 { 1724 - /* throtl is FIFO - if bios are already queued, should queue */ 1725 - if (sq_queued(&tg->service_queue, rw)) 1708 + struct throtl_service_queue *sq = &tg->service_queue; 1709 + 1710 + /* 1711 + * For a split bio, we need to specifically distinguish whether the 1712 + * iops queue is empty. 1713 + */ 1714 + if (bio_flagged(bio, BIO_BPS_THROTTLED)) 1715 + return sq->nr_queued_iops[rw] == 0 && 1716 + tg_dispatch_iops_time(tg, bio) == 0; 1717 + 1718 + /* 1719 + * Throtl is FIFO - if bios are already queued, should queue. 1720 + * If the bps queue is empty and @bio is within the bps limit, charge 1721 + * bps here for direct placement into the iops queue. 1722 + */ 1723 + if (sq_queued(&tg->service_queue, rw)) { 1724 + if (sq->nr_queued_bps[rw] == 0 && 1725 + tg_dispatch_bps_time(tg, bio) == 0) 1726 + throtl_charge_bps_bio(tg, bio); 1727 + 1726 1728 return false; 1729 + } 1727 1730 1728 1731 return tg_dispatch_time(tg, bio) == 0; 1729 1732 } ··· 1823 1788 1824 1789 /* 1825 1790 * Update @tg's dispatch time and force schedule dispatch if @tg 1826 - * was empty before @bio. The forced scheduling isn't likely to 1827 - * cause undue delay as @bio is likely to be dispatched directly if 1828 - * its @tg's disptime is not in the future. 1791 + * was empty before @bio, or the iops queue is empty and @bio will 1792 + * add to. The forced scheduling isn't likely to cause undue 1793 + * delay as @bio is likely to be dispatched directly if its @tg's 1794 + * disptime is not in the future. 1829 1795 */ 1830 - if (tg->flags & THROTL_TG_WAS_EMPTY) { 1796 + if (tg->flags & THROTL_TG_WAS_EMPTY || 1797 + tg->flags & THROTL_TG_IOPS_WAS_EMPTY) { 1831 1798 tg_update_disptime(tg); 1832 1799 throtl_schedule_next_dispatch(tg->service_queue.parent_sq, true); 1833 1800 }
+8 -3
block/blk-throttle.h
··· 56 56 }; 57 57 58 58 enum tg_state_flags { 59 - THROTL_TG_PENDING = 1 << 0, /* on parent's pending tree */ 60 - THROTL_TG_WAS_EMPTY = 1 << 1, /* bio_lists[] became non-empty */ 61 - THROTL_TG_CANCELING = 1 << 2, /* starts to cancel bio */ 59 + THROTL_TG_PENDING = 1 << 0, /* on parent's pending tree */ 60 + THROTL_TG_WAS_EMPTY = 1 << 1, /* bio_lists[] became non-empty */ 61 + /* 62 + * The sq's iops queue is empty, and a bio is about to be enqueued 63 + * to the first qnode's bios_iops list. 64 + */ 65 + THROTL_TG_IOPS_WAS_EMPTY = 1 << 2, 66 + THROTL_TG_CANCELING = 1 << 3, /* starts to cancel bio */ 62 67 }; 63 68 64 69 struct throtl_grp {