Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

block: implement bio helper to add iter bvec pages to bio

For an ITER_BVEC, we can just iterate the iov and add the pages
to the bio directly. For now, we grab a reference to those pages,
and release them normally on IO completion. This isn't really needed
for the normal case of O_DIRECT from/to a file, but some of the more
esoteric use cases (like splice(2)) will unconditionally put the
pipe buffer pages when the buffers are released. Until we can manage
that case properly, ITER_BVEC pages are treated like normal pages
in terms of reference counting.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

+54 -8
+54 -8
block/bio.c
··· 836 836 } 837 837 EXPORT_SYMBOL(bio_add_page); 838 838 839 + static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter) 840 + { 841 + const struct bio_vec *bv = iter->bvec; 842 + unsigned int len; 843 + size_t size; 844 + 845 + if (WARN_ON_ONCE(iter->iov_offset > bv->bv_len)) 846 + return -EINVAL; 847 + 848 + len = min_t(size_t, bv->bv_len - iter->iov_offset, iter->count); 849 + size = bio_add_page(bio, bv->bv_page, len, 850 + bv->bv_offset + iter->iov_offset); 851 + if (size == len) { 852 + struct page *page; 853 + int i; 854 + 855 + /* 856 + * For the normal O_DIRECT case, we could skip grabbing this 857 + * reference and then not have to put them again when IO 858 + * completes. But this breaks some in-kernel users, like 859 + * splicing to/from a loop device, where we release the pipe 860 + * pages unconditionally. If we can fix that case, we can 861 + * get rid of the get here and the need to call 862 + * bio_release_pages() at IO completion time. 863 + */ 864 + mp_bvec_for_each_page(page, bv, i) 865 + get_page(page); 866 + iov_iter_advance(iter, size); 867 + return 0; 868 + } 869 + 870 + return -EINVAL; 871 + } 872 + 839 873 #define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *)) 840 874 841 875 /** ··· 918 884 } 919 885 920 886 /** 921 - * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio 887 + * bio_iov_iter_get_pages - add user or kernel pages to a bio 922 888 * @bio: bio to add pages to 923 - * @iter: iov iterator describing the region to be mapped 889 + * @iter: iov iterator describing the region to be added 924 890 * 925 - * Pins pages from *iter and appends them to @bio's bvec array. The 926 - * pages will have to be released using put_page() when done. 891 + * This takes either an iterator pointing to user memory, or one pointing to 892 + * kernel pages (BVEC iterator). If we're adding user pages, we pin them and 893 + * map them into the kernel. On IO completion, the caller should put those 894 + * pages. For now, when adding kernel pages, we still grab a reference to the 895 + * page. This isn't strictly needed for the common case, but some call paths 896 + * end up releasing pages from eg a pipe and we can't easily control these. 897 + * See comment in __bio_iov_bvec_add_pages(). 898 + * 927 899 * The function tries, but does not guarantee, to pin as many pages as 928 - * fit into the bio, or are requested in *iter, whatever is smaller. 929 - * If MM encounters an error pinning the requested pages, it stops. 930 - * Error is returned only if 0 pages could be pinned. 900 + * fit into the bio, or are requested in *iter, whatever is smaller. If 901 + * MM encounters an error pinning the requested pages, it stops. Error 902 + * is returned only if 0 pages could be pinned. 931 903 */ 932 904 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) 933 905 { 906 + const bool is_bvec = iov_iter_is_bvec(iter); 934 907 unsigned short orig_vcnt = bio->bi_vcnt; 935 908 936 909 do { 937 - int ret = __bio_iov_iter_get_pages(bio, iter); 910 + int ret; 911 + 912 + if (is_bvec) 913 + ret = __bio_iov_bvec_add_pages(bio, iter); 914 + else 915 + ret = __bio_iov_iter_get_pages(bio, iter); 938 916 939 917 if (unlikely(ret)) 940 918 return bio->bi_vcnt > orig_vcnt ? 0 : ret;