Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

block: make generic_make_request handle arbitrarily sized bios

The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.

But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them. In the future this will
let us delete merge_bvec_fn and a bunch of other code.

We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to
blk_queue_bounce(); this means that blk_queue_split() and
blk_recalc_rq_segments() don't need to be concerned with bouncing
affecting segment merging.

Some make_request_fn() callbacks were simple enough to audit and verify
they don't need blk_queue_split() calls. The skipped ones are:

* nfhd_make_request (arch/m68k/emu/nfblock.c)
* axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
* simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
* brd_make_request (ramdisk - drivers/block/brd.c)
* mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
* loop_make_request
* null_queue_bio
* bcache's make_request fns

Some others are almost certainly safe to remove now, but will be left
for future patches.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Jim Paris <jim@jtan.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

authored by

Kent Overstreet and committed by
Jens Axboe
54efd50b 41609892

+192 -22
+9 -10
block/blk-core.c
··· 643 643 if (q->id < 0) 644 644 goto fail_q; 645 645 646 + q->bio_split = bioset_create(BIO_POOL_SIZE, 0); 647 + if (!q->bio_split) 648 + goto fail_id; 649 + 646 650 q->backing_dev_info.ra_pages = 647 651 (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE; 648 652 q->backing_dev_info.capabilities = BDI_CAP_CGROUP_WRITEBACK; ··· 655 651 656 652 err = bdi_init(&q->backing_dev_info); 657 653 if (err) 658 - goto fail_id; 654 + goto fail_split; 659 655 660 656 setup_timer(&q->backing_dev_info.laptop_mode_wb_timer, 661 657 laptop_mode_timer_fn, (unsigned long) q); ··· 697 693 698 694 fail_bdi: 699 695 bdi_destroy(&q->backing_dev_info); 696 + fail_split: 697 + bioset_free(q->bio_split); 700 698 fail_id: 701 699 ida_simple_remove(&blk_queue_ida, q->id); 702 700 fail_q: ··· 1616 1610 struct request *req; 1617 1611 unsigned int request_count = 0; 1618 1612 1613 + blk_queue_split(q, &bio, q->bio_split); 1614 + 1619 1615 /* 1620 1616 * low level driver can indicate that it wants pages above a 1621 1617 * certain limit bounced to low memory (ie for highmem, or even ··· 1837 1829 "nonexistent block-device %s (%Lu)\n", 1838 1830 bdevname(bio->bi_bdev, b), 1839 1831 (long long) bio->bi_iter.bi_sector); 1840 - goto end_io; 1841 - } 1842 - 1843 - if (likely(bio_is_rw(bio) && 1844 - nr_sectors > queue_max_hw_sectors(q))) { 1845 - printk(KERN_ERR "bio too big device %s (%u > %u)\n", 1846 - bdevname(bio->bi_bdev, b), 1847 - bio_sectors(bio), 1848 - queue_max_hw_sectors(q)); 1849 1832 goto end_io; 1850 1833 } 1851 1834
+149 -10
block/blk-merge.c
··· 9 9 10 10 #include "blk.h" 11 11 12 + static struct bio *blk_bio_discard_split(struct request_queue *q, 13 + struct bio *bio, 14 + struct bio_set *bs) 15 + { 16 + unsigned int max_discard_sectors, granularity; 17 + int alignment; 18 + sector_t tmp; 19 + unsigned split_sectors; 20 + 21 + /* Zero-sector (unknown) and one-sector granularities are the same. */ 22 + granularity = max(q->limits.discard_granularity >> 9, 1U); 23 + 24 + max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9); 25 + max_discard_sectors -= max_discard_sectors % granularity; 26 + 27 + if (unlikely(!max_discard_sectors)) { 28 + /* XXX: warn */ 29 + return NULL; 30 + } 31 + 32 + if (bio_sectors(bio) <= max_discard_sectors) 33 + return NULL; 34 + 35 + split_sectors = max_discard_sectors; 36 + 37 + /* 38 + * If the next starting sector would be misaligned, stop the discard at 39 + * the previous aligned sector. 40 + */ 41 + alignment = (q->limits.discard_alignment >> 9) % granularity; 42 + 43 + tmp = bio->bi_iter.bi_sector + split_sectors - alignment; 44 + tmp = sector_div(tmp, granularity); 45 + 46 + if (split_sectors > tmp) 47 + split_sectors -= tmp; 48 + 49 + return bio_split(bio, split_sectors, GFP_NOIO, bs); 50 + } 51 + 52 + static struct bio *blk_bio_write_same_split(struct request_queue *q, 53 + struct bio *bio, 54 + struct bio_set *bs) 55 + { 56 + if (!q->limits.max_write_same_sectors) 57 + return NULL; 58 + 59 + if (bio_sectors(bio) <= q->limits.max_write_same_sectors) 60 + return NULL; 61 + 62 + return bio_split(bio, q->limits.max_write_same_sectors, GFP_NOIO, bs); 63 + } 64 + 65 + static struct bio *blk_bio_segment_split(struct request_queue *q, 66 + struct bio *bio, 67 + struct bio_set *bs) 68 + { 69 + struct bio *split; 70 + struct bio_vec bv, bvprv; 71 + struct bvec_iter iter; 72 + unsigned seg_size = 0, nsegs = 0; 73 + int prev = 0; 74 + 75 + struct bvec_merge_data bvm = { 76 + .bi_bdev = bio->bi_bdev, 77 + .bi_sector = bio->bi_iter.bi_sector, 78 + .bi_size = 0, 79 + .bi_rw = bio->bi_rw, 80 + }; 81 + 82 + bio_for_each_segment(bv, bio, iter) { 83 + if (q->merge_bvec_fn && 84 + q->merge_bvec_fn(q, &bvm, &bv) < (int) bv.bv_len) 85 + goto split; 86 + 87 + bvm.bi_size += bv.bv_len; 88 + 89 + if (bvm.bi_size >> 9 > queue_max_sectors(q)) 90 + goto split; 91 + 92 + /* 93 + * If the queue doesn't support SG gaps and adding this 94 + * offset would create a gap, disallow it. 95 + */ 96 + if (q->queue_flags & (1 << QUEUE_FLAG_SG_GAPS) && 97 + prev && bvec_gap_to_prev(&bvprv, bv.bv_offset)) 98 + goto split; 99 + 100 + if (prev && blk_queue_cluster(q)) { 101 + if (seg_size + bv.bv_len > queue_max_segment_size(q)) 102 + goto new_segment; 103 + if (!BIOVEC_PHYS_MERGEABLE(&bvprv, &bv)) 104 + goto new_segment; 105 + if (!BIOVEC_SEG_BOUNDARY(q, &bvprv, &bv)) 106 + goto new_segment; 107 + 108 + seg_size += bv.bv_len; 109 + bvprv = bv; 110 + prev = 1; 111 + continue; 112 + } 113 + new_segment: 114 + if (nsegs == queue_max_segments(q)) 115 + goto split; 116 + 117 + nsegs++; 118 + bvprv = bv; 119 + prev = 1; 120 + seg_size = bv.bv_len; 121 + } 122 + 123 + return NULL; 124 + split: 125 + split = bio_clone_bioset(bio, GFP_NOIO, bs); 126 + 127 + split->bi_iter.bi_size -= iter.bi_size; 128 + bio->bi_iter = iter; 129 + 130 + if (bio_integrity(bio)) { 131 + bio_integrity_advance(bio, split->bi_iter.bi_size); 132 + bio_integrity_trim(split, 0, bio_sectors(split)); 133 + } 134 + 135 + return split; 136 + } 137 + 138 + void blk_queue_split(struct request_queue *q, struct bio **bio, 139 + struct bio_set *bs) 140 + { 141 + struct bio *split; 142 + 143 + if ((*bio)->bi_rw & REQ_DISCARD) 144 + split = blk_bio_discard_split(q, *bio, bs); 145 + else if ((*bio)->bi_rw & REQ_WRITE_SAME) 146 + split = blk_bio_write_same_split(q, *bio, bs); 147 + else 148 + split = blk_bio_segment_split(q, *bio, q->bio_split); 149 + 150 + if (split) { 151 + bio_chain(split, *bio); 152 + generic_make_request(*bio); 153 + *bio = split; 154 + } 155 + } 156 + EXPORT_SYMBOL(blk_queue_split); 157 + 12 158 static unsigned int __blk_recalc_rq_segments(struct request_queue *q, 13 159 struct bio *bio, 14 160 bool no_sg_merge) 15 161 { 16 162 struct bio_vec bv, bvprv = { NULL }; 17 - int cluster, high, highprv = 1; 163 + int cluster, prev = 0; 18 164 unsigned int seg_size, nr_phys_segs; 19 165 struct bio *fbio, *bbio; 20 166 struct bvec_iter iter; ··· 182 36 cluster = blk_queue_cluster(q); 183 37 seg_size = 0; 184 38 nr_phys_segs = 0; 185 - high = 0; 186 39 for_each_bio(bio) { 187 40 bio_for_each_segment(bv, bio, iter) { 188 41 /* ··· 191 46 if (no_sg_merge) 192 47 goto new_segment; 193 48 194 - /* 195 - * the trick here is making sure that a high page is 196 - * never considered part of another segment, since 197 - * that might change with the bounce page. 198 - */ 199 - high = page_to_pfn(bv.bv_page) > queue_bounce_pfn(q); 200 - if (!high && !highprv && cluster) { 49 + if (prev && cluster) { 201 50 if (seg_size + bv.bv_len 202 51 > queue_max_segment_size(q)) 203 52 goto new_segment; ··· 211 72 212 73 nr_phys_segs++; 213 74 bvprv = bv; 75 + prev = 1; 214 76 seg_size = bv.bv_len; 215 - highprv = high; 216 77 } 217 78 bbio = bio; 218 79 }
+4
block/blk-mq.c
··· 1287 1287 return; 1288 1288 } 1289 1289 1290 + blk_queue_split(q, &bio, q->bio_split); 1291 + 1290 1292 if (!is_flush_fua && !blk_queue_nomerges(q) && 1291 1293 blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq)) 1292 1294 return; ··· 1373 1371 bio_io_error(bio); 1374 1372 return; 1375 1373 } 1374 + 1375 + blk_queue_split(q, &bio, q->bio_split); 1376 1376 1377 1377 if (!is_flush_fua && !blk_queue_nomerges(q) && 1378 1378 blk_attempt_plug_merge(q, bio, &request_count, NULL))
+3
block/blk-sysfs.c
··· 561 561 562 562 blk_trace_shutdown(q); 563 563 564 + if (q->bio_split) 565 + bioset_free(q->bio_split); 566 + 564 567 ida_simple_remove(&blk_queue_ida, q->id); 565 568 call_rcu(&q->rcu_head, blk_free_queue_rcu); 566 569 }
+2
drivers/block/drbd/drbd_req.c
··· 1499 1499 struct drbd_device *device = (struct drbd_device *) q->queuedata; 1500 1500 unsigned long start_jif; 1501 1501 1502 + blk_queue_split(q, &bio, q->bio_split); 1503 + 1502 1504 start_jif = jiffies; 1503 1505 1504 1506 /*
+4 -2
drivers/block/pktcdvd.c
··· 2447 2447 char b[BDEVNAME_SIZE]; 2448 2448 struct bio *split; 2449 2449 2450 + blk_queue_bounce(q, &bio); 2451 + 2452 + blk_queue_split(q, &bio, q->bio_split); 2453 + 2450 2454 pd = q->queuedata; 2451 2455 if (!pd) { 2452 2456 pr_err("%s incorrect request queue\n", ··· 2480 2476 pkt_err(pd, "wrong bio size\n"); 2481 2477 goto end_io; 2482 2478 } 2483 - 2484 - blk_queue_bounce(q, &bio); 2485 2479 2486 2480 do { 2487 2481 sector_t zone = get_zone(bio->bi_iter.bi_sector, pd);
+2
drivers/block/ps3vram.c
··· 606 606 607 607 dev_dbg(&dev->core, "%s\n", __func__); 608 608 609 + blk_queue_split(q, &bio, q->bio_split); 610 + 609 611 spin_lock_irq(&priv->lock); 610 612 busy = !bio_list_empty(&priv->list); 611 613 bio_list_add(&priv->list, bio);
+2
drivers/block/rsxx/dev.c
··· 151 151 struct rsxx_bio_meta *bio_meta; 152 152 int st = -EINVAL; 153 153 154 + blk_queue_split(q, &bio, q->bio_split); 155 + 154 156 might_sleep(); 155 157 156 158 if (!card)
+2
drivers/block/umem.c
··· 531 531 (unsigned long long)bio->bi_iter.bi_sector, 532 532 bio->bi_iter.bi_size); 533 533 534 + blk_queue_split(q, &bio, q->bio_split); 535 + 534 536 spin_lock_irq(&card->lock); 535 537 *card->biotail = bio; 536 538 bio->bi_next = NULL;
+2
drivers/block/zram/zram_drv.c
··· 900 900 if (unlikely(!zram_meta_get(zram))) 901 901 goto error; 902 902 903 + blk_queue_split(queue, &bio, queue->bio_split); 904 + 903 905 if (!valid_io_request(zram, bio->bi_iter.bi_sector, 904 906 bio->bi_iter.bi_size)) { 905 907 atomic64_inc(&zram->stats.invalid_io);
+2
drivers/md/dm.c
··· 1799 1799 1800 1800 map = dm_get_live_table(md, &srcu_idx); 1801 1801 1802 + blk_queue_split(q, &bio, q->bio_split); 1803 + 1802 1804 generic_start_io_acct(rw, bio_sectors(bio), &dm_disk(md)->part0); 1803 1805 1804 1806 /* if we're suspended, we have to queue this io for later */
+2
drivers/md/md.c
··· 257 257 unsigned int sectors; 258 258 int cpu; 259 259 260 + blk_queue_split(q, &bio, q->bio_split); 261 + 260 262 if (mddev == NULL || mddev->pers == NULL 261 263 || !mddev->ready) { 262 264 bio_io_error(bio);
+2
drivers/s390/block/dcssblk.c
··· 826 826 unsigned long source_addr; 827 827 unsigned long bytes_done; 828 828 829 + blk_queue_split(q, &bio, q->bio_split); 830 + 829 831 bytes_done = 0; 830 832 dev_info = bio->bi_bdev->bd_disk->private_data; 831 833 if (dev_info == NULL)
+2
drivers/s390/block/xpram.c
··· 190 190 unsigned long page_addr; 191 191 unsigned long bytes; 192 192 193 + blk_queue_split(q, &bio, q->bio_split); 194 + 193 195 if ((bio->bi_iter.bi_sector & 7) != 0 || 194 196 (bio->bi_iter.bi_size & 4095) != 0) 195 197 /* Request is not page-aligned. */
+2
drivers/staging/lustre/lustre/llite/lloop.c
··· 340 340 int rw = bio_rw(old_bio); 341 341 int inactive; 342 342 343 + blk_queue_split(q, &old_bio, q->bio_split); 344 + 343 345 if (!lo) 344 346 goto err; 345 347
+3
include/linux/blkdev.h
··· 463 463 464 464 struct blk_mq_tag_set *tag_set; 465 465 struct list_head tag_set_list; 466 + struct bio_set *bio_split; 466 467 }; 467 468 468 469 #define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */ ··· 784 783 extern int blk_insert_cloned_request(struct request_queue *q, 785 784 struct request *rq); 786 785 extern void blk_delay_queue(struct request_queue *, unsigned long); 786 + extern void blk_queue_split(struct request_queue *, struct bio **, 787 + struct bio_set *); 787 788 extern void blk_recount_segments(struct request_queue *, struct bio *); 788 789 extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int); 789 790 extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,