Merge tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block

+1

Documentation/block/index.rst

··· 14 14 cmdline-partition 15 15 data-integrity 16 16 deadline-iosched 17 + inline-encryption 17 18 ioprio 18 19 kyber-iosched 19 20 null_blk

+263

Documentation/block/inline-encryption.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + ================= 4 + Inline Encryption 5 + ================= 6 + 7 + Background 8 + ========== 9 + 10 + Inline encryption hardware sits logically between memory and the disk, and can 11 + en/decrypt data as it goes in/out of the disk. Inline encryption hardware has a 12 + fixed number of "keyslots" - slots into which encryption contexts (i.e. the 13 + encryption key, encryption algorithm, data unit size) can be programmed by the 14 + kernel at any time. Each request sent to the disk can be tagged with the index 15 + of a keyslot (and also a data unit number to act as an encryption tweak), and 16 + the inline encryption hardware will en/decrypt the data in the request with the 17 + encryption context programmed into that keyslot. This is very different from 18 + full disk encryption solutions like self encrypting drives/TCG OPAL/ATA 19 + Security standards, since with inline encryption, any block on disk could be 20 + encrypted with any encryption context the kernel chooses. 21 + 22 + 23 + Objective 24 + ========= 25 + 26 + We want to support inline encryption (IE) in the kernel. 27 + To allow for testing, we also want a crypto API fallback when actual 28 + IE hardware is absent. We also want IE to work with layered devices 29 + like dm and loopback (i.e. we want to be able to use the IE hardware 30 + of the underlying devices if present, or else fall back to crypto API 31 + en/decryption). 32 + 33 + 34 + Constraints and notes 35 + ===================== 36 + 37 + - IE hardware has a limited number of "keyslots" that can be programmed 38 + with an encryption context (key, algorithm, data unit size, etc.) at any time. 39 + One can specify a keyslot in a data request made to the device, and the 40 + device will en/decrypt the data using the encryption context programmed into 41 + that specified keyslot. When possible, we want to make multiple requests with 42 + the same encryption context share the same keyslot. 43 + 44 + - We need a way for upper layers like filesystems to specify an encryption 45 + context to use for en/decrypting a struct bio, and a device driver (like UFS) 46 + needs to be able to use that encryption context when it processes the bio. 47 + 48 + - We need a way for device drivers to expose their inline encryption 49 + capabilities in a unified way to the upper layers. 50 + 51 + 52 + Design 53 + ====== 54 + 55 + We add a :c:type:`struct bio_crypt_ctx` to :c:type:`struct bio` that can 56 + represent an encryption context, because we need to be able to pass this 57 + encryption context from the upper layers (like the fs layer) to the 58 + device driver to act upon. 59 + 60 + While IE hardware works on the notion of keyslots, the FS layer has no 61 + knowledge of keyslots - it simply wants to specify an encryption context to 62 + use while en/decrypting a bio. 63 + 64 + We introduce a keyslot manager (KSM) that handles the translation from 65 + encryption contexts specified by the FS to keyslots on the IE hardware. 66 + This KSM also serves as the way IE hardware can expose its capabilities to 67 + upper layers. The generic mode of operation is: each device driver that wants 68 + to support IE will construct a KSM and set it up in its struct request_queue. 69 + Upper layers that want to use IE on this device can then use this KSM in 70 + the device's struct request_queue to translate an encryption context into 71 + a keyslot. The presence of the KSM in the request queue shall be used to mean 72 + that the device supports IE. 73 + 74 + The KSM uses refcounts to track which keyslots are idle (either they have no 75 + encryption context programmed, or there are no in-flight struct bios 76 + referencing that keyslot). When a new encryption context needs a keyslot, it 77 + tries to find a keyslot that has already been programmed with the same 78 + encryption context, and if there is no such keyslot, it evicts the least 79 + recently used idle keyslot and programs the new encryption context into that 80 + one. If no idle keyslots are available, then the caller will sleep until there 81 + is at least one. 82 + 83 + 84 + blk-mq changes, other block layer changes and blk-crypto-fallback 85 + ================================================================= 86 + 87 + We add a pointer to a ``bi_crypt_context`` and ``keyslot`` to 88 + :c:type:`struct request`. These will be referred to as the ``crypto fields`` 89 + for the request. This ``keyslot`` is the keyslot into which the 90 + ``bi_crypt_context`` has been programmed in the KSM of the ``request_queue`` 91 + that this request is being sent to. 92 + 93 + We introduce ``block/blk-crypto-fallback.c``, which allows upper layers to remain 94 + blissfully unaware of whether or not real inline encryption hardware is present 95 + underneath. When a bio is submitted with a target ``request_queue`` that doesn't 96 + support the encryption context specified with the bio, the block layer will 97 + en/decrypt the bio with the blk-crypto-fallback. 98 + 99 + If the bio is a ``WRITE`` bio, a bounce bio is allocated, and the data in the bio 100 + is encrypted stored in the bounce bio - blk-mq will then proceed to process the 101 + bounce bio as if it were not encrypted at all (except when blk-integrity is 102 + concerned). ``blk-crypto-fallback`` sets the bounce bio's ``bi_end_io`` to an 103 + internal function that cleans up the bounce bio and ends the original bio. 104 + 105 + If the bio is a ``READ`` bio, the bio's ``bi_end_io`` (and also ``bi_private``) 106 + is saved and overwritten by ``blk-crypto-fallback`` to 107 + ``bio_crypto_fallback_decrypt_bio``. The bio's ``bi_crypt_context`` is also 108 + overwritten with ``NULL``, so that to the rest of the stack, the bio looks 109 + as if it was a regular bio that never had an encryption context specified. 110 + ``bio_crypto_fallback_decrypt_bio`` will decrypt the bio, restore the original 111 + ``bi_end_io`` (and also ``bi_private``) and end the bio again. 112 + 113 + Regardless of whether real inline encryption hardware is used or the 114 + blk-crypto-fallback is used, the ciphertext written to disk (and hence the 115 + on-disk format of data) will be the same (assuming the hardware's implementation 116 + of the algorithm being used adheres to spec and functions correctly). 117 + 118 + If a ``request queue``'s inline encryption hardware claimed to support the 119 + encryption context specified with a bio, then it will not be handled by the 120 + ``blk-crypto-fallback``. We will eventually reach a point in blk-mq when a 121 + :c:type:`struct request` needs to be allocated for that bio. At that point, 122 + blk-mq tries to program the encryption context into the ``request_queue``'s 123 + keyslot_manager, and obtain a keyslot, which it stores in its newly added 124 + ``keyslot`` field. This keyslot is released when the request is completed. 125 + 126 + When the first bio is added to a request, ``blk_crypto_rq_bio_prep`` is called, 127 + which sets the request's ``crypt_ctx`` to a copy of the bio's 128 + ``bi_crypt_context``. bio_crypt_do_front_merge is called whenever a subsequent 129 + bio is merged to the front of the request, which updates the ``crypt_ctx`` of 130 + the request so that it matches the newly merged bio's ``bi_crypt_context``. In particular, the request keeps a copy of the ``bi_crypt_context`` of the first 131 + bio in its bio-list (blk-mq needs to be careful to maintain this invariant 132 + during bio and request merges). 133 + 134 + To make it possible for inline encryption to work with request queue based 135 + layered devices, when a request is cloned, its ``crypto fields`` are cloned as 136 + well. When the cloned request is submitted, blk-mq programs the 137 + ``bi_crypt_context`` of the request into the clone's request_queue's keyslot 138 + manager, and stores the returned keyslot in the clone's ``keyslot``. 139 + 140 + 141 + API presented to users of the block layer 142 + ========================================= 143 + 144 + ``struct blk_crypto_key`` represents a crypto key (the raw key, size of the 145 + key, the crypto algorithm to use, the data unit size to use, and the number of 146 + bytes required to represent data unit numbers that will be specified with the 147 + ``bi_crypt_context``). 148 + 149 + ``blk_crypto_init_key`` allows upper layers to initialize such a 150 + ``blk_crypto_key``. 151 + 152 + ``bio_crypt_set_ctx`` should be called on any bio that a user of 153 + the block layer wants en/decrypted via inline encryption (or the 154 + blk-crypto-fallback, if hardware support isn't available for the desired 155 + crypto configuration). This function takes the ``blk_crypto_key`` and the 156 + data unit number (DUN) to use when en/decrypting the bio. 157 + 158 + ``blk_crypto_config_supported`` allows upper layers to query whether or not the 159 + an encryption context passed to request queue can be handled by blk-crypto 160 + (either by real inline encryption hardware, or by the blk-crypto-fallback). 161 + This is useful e.g. when blk-crypto-fallback is disabled, and the upper layer 162 + wants to use an algorithm that may not supported by hardware - this function 163 + lets the upper layer know ahead of time that the algorithm isn't supported, 164 + and the upper layer can fallback to something else if appropriate. 165 + 166 + ``blk_crypto_start_using_key`` - Upper layers must call this function on 167 + ``blk_crypto_key`` and a ``request_queue`` before using the key with any bio 168 + headed for that ``request_queue``. This function ensures that either the 169 + hardware supports the key's crypto settings, or the crypto API fallback has 170 + transforms for the needed mode allocated and ready to go. Note that this 171 + function may allocate an ``skcipher``, and must not be called from the data 172 + path, since allocating ``skciphers`` from the data path can deadlock. 173 + 174 + ``blk_crypto_evict_key`` *must* be called by upper layers before a 175 + ``blk_crypto_key`` is freed. Further, it *must* only be called only once 176 + there are no more in-flight requests that use that ``blk_crypto_key``. 177 + ``blk_crypto_evict_key`` will ensure that a key is removed from any keyslots in 178 + inline encryption hardware that the key might have been programmed into (or the blk-crypto-fallback). 179 + 180 + API presented to device drivers 181 + =============================== 182 + 183 + A :c:type:``struct blk_keyslot_manager`` should be set up by device drivers in 184 + the ``request_queue`` of the device. The device driver needs to call 185 + ``blk_ksm_init`` on the ``blk_keyslot_manager``, which specifying the number of 186 + keyslots supported by the hardware. 187 + 188 + The device driver also needs to tell the KSM how to actually manipulate the 189 + IE hardware in the device to do things like programming the crypto key into 190 + the IE hardware into a particular keyslot. All this is achieved through the 191 + :c:type:`struct blk_ksm_ll_ops` field in the KSM that the device driver 192 + must fill up after initing the ``blk_keyslot_manager``. 193 + 194 + The KSM also handles runtime power management for the device when applicable 195 + (e.g. when it wants to program a crypto key into the IE hardware, the device 196 + must be runtime powered on) - so the device driver must also set the ``dev`` 197 + field in the ksm to point to the `struct device` for the KSM to use for runtime 198 + power management. 199 + 200 + ``blk_ksm_reprogram_all_keys`` can be called by device drivers if the device 201 + needs each and every of its keyslots to be reprogrammed with the key it 202 + "should have" at the point in time when the function is called. This is useful 203 + e.g. if a device loses all its keys on runtime power down/up. 204 + 205 + ``blk_ksm_destroy`` should be called to free up all resources used by a keyslot 206 + manager upon ``blk_ksm_init``, once the ``blk_keyslot_manager`` is no longer 207 + needed. 208 + 209 + 210 + Layered Devices 211 + =============== 212 + 213 + Request queue based layered devices like dm-rq that wish to support IE need to 214 + create their own keyslot manager for their request queue, and expose whatever 215 + functionality they choose. When a layered device wants to pass a clone of that 216 + request to another ``request_queue``, blk-crypto will initialize and prepare the 217 + clone as necessary - see ``blk_crypto_insert_cloned_request`` in 218 + ``blk-crypto.c``. 219 + 220 + 221 + Future Optimizations for layered devices 222 + ======================================== 223 + 224 + Creating a keyslot manager for a layered device uses up memory for each 225 + keyslot, and in general, a layered device merely passes the request on to a 226 + "child" device, so the keyslots in the layered device itself are completely 227 + unused, and don't need any refcounting or keyslot programming. We can instead 228 + define a new type of KSM; the "passthrough KSM", that layered devices can use 229 + to advertise an unlimited number of keyslots, and support for any encryption 230 + algorithms they choose, while not actually using any memory for each keyslot. 231 + Another use case for the "passthrough KSM" is for IE devices that do not have a 232 + limited number of keyslots. 233 + 234 + 235 + Interaction between inline encryption and blk integrity 236 + ======================================================= 237 + 238 + At the time of this patch, there is no real hardware that supports both these 239 + features. However, these features do interact with each other, and it's not 240 + completely trivial to make them both work together properly. In particular, 241 + when a WRITE bio wants to use inline encryption on a device that supports both 242 + features, the bio will have an encryption context specified, after which 243 + its integrity information is calculated (using the plaintext data, since 244 + the encryption will happen while data is being written), and the data and 245 + integrity info is sent to the device. Obviously, the integrity info must be 246 + verified before the data is encrypted. After the data is encrypted, the device 247 + must not store the integrity info that it received with the plaintext data 248 + since that might reveal information about the plaintext data. As such, it must 249 + re-generate the integrity info from the ciphertext data and store that on disk 250 + instead. Another issue with storing the integrity info of the plaintext data is 251 + that it changes the on disk format depending on whether hardware inline 252 + encryption support is present or the kernel crypto API fallback is used (since 253 + if the fallback is used, the device will receive the integrity info of the 254 + ciphertext, not that of the plaintext). 255 + 256 + Because there isn't any real hardware yet, it seems prudent to assume that 257 + hardware implementations might not implement both features together correctly, 258 + and disallow the combination for now. Whenever a device supports integrity, the 259 + kernel will pretend that the device does not support hardware inline encryption 260 + (by essentially setting the keyslot manager in the request_queue of the device 261 + to NULL). When the crypto API fallback is enabled, this means that all bios with 262 + and encryption context will use the fallback, and IO will complete as usual. 263 + When the fallback is disabled, a bio with an encryption context will be failed.

+18

block/Kconfig

··· 146 146 config BLK_CGROUP_IOCOST 147 147 bool "Enable support for cost model based cgroup IO controller" 148 148 depends on BLK_CGROUP=y 149 + select BLK_RQ_IO_DATA_LEN 149 150 select BLK_RQ_ALLOC_TIME 150 151 ---help--- 151 152 Enabling this option enables the .weight interface for cost ··· 185 184 Builds Logic for interfacing with Opal enabled controllers. 186 185 Enabling this option enables users to setup/unlock/lock 187 186 Locking ranges for SED devices using the Opal protocol. 187 + 188 + config BLK_INLINE_ENCRYPTION 189 + bool "Enable inline encryption support in block layer" 190 + help 191 + Build the blk-crypto subsystem. Enabling this lets the 192 + block layer handle encryption, so users can take 193 + advantage of inline encryption hardware if present. 194 + 195 + config BLK_INLINE_ENCRYPTION_FALLBACK 196 + bool "Enable crypto API fallback for blk-crypto" 197 + depends on BLK_INLINE_ENCRYPTION 198 + select CRYPTO 199 + select CRYPTO_SKCIPHER 200 + help 201 + Enabling this lets the block layer handle inline encryption 202 + by falling back to the kernel crypto API when inline 203 + encryption hardware is not present. 188 204 189 205 menu "Partition Types" 190 206

+2

block/Makefile

··· 36 36 obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o 37 37 obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o 38 38 obj-$(CONFIG_BLK_PM) += blk-pm.o 39 + obj-$(CONFIG_BLK_INLINE_ENCRYPTION) += keyslot-manager.o blk-crypto.o 40 + obj-$(CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK) += blk-crypto-fallback.o

+1 -1

block/bfq-iosched.c

··· 6073 6073 * comments on bfq_init_rq for the reason behind this delayed 6074 6074 * preparation. 6075 6075 */ 6076 - static void bfq_prepare_request(struct request *rq, struct bio *bio) 6076 + static void bfq_prepare_request(struct request *rq) 6077 6077 { 6078 6078 /* 6079 6079 * Regardless of whether we have an icq attached, we have to

+3

block/bio-integrity.c

··· 42 42 struct bio_set *bs = bio->bi_pool; 43 43 unsigned inline_vecs; 44 44 45 + if (WARN_ON_ONCE(bio_has_crypt_ctx(bio))) 46 + return ERR_PTR(-EOPNOTSUPP); 47 + 45 48 if (!bs || !mempool_initialized(&bs->bio_integrity_pool)) { 46 49 bip = kmalloc(struct_size(bip, bip_inline_vecs, nr_vecs), gfp_mask); 47 50 inline_vecs = nr_vecs;

+106 -78

block/bio.c

··· 18 18 #include <linux/blk-cgroup.h> 19 19 #include <linux/highmem.h> 20 20 #include <linux/sched/sysctl.h> 21 + #include <linux/blk-crypto.h> 21 22 22 23 #include <trace/events/block.h> 23 24 #include "blk.h" ··· 238 237 239 238 if (bio_integrity(bio)) 240 239 bio_integrity_free(bio); 240 + 241 + bio_crypt_free_ctx(bio); 241 242 } 242 243 EXPORT_SYMBOL(bio_uninit); 243 244 ··· 711 708 712 709 __bio_clone_fast(b, bio); 713 710 711 + bio_crypt_clone(b, bio, gfp_mask); 712 + 714 713 if (bio_integrity(bio)) { 715 714 int ret; 716 715 ··· 753 748 return true; 754 749 } 755 750 756 - static bool bio_try_merge_pc_page(struct request_queue *q, struct bio *bio, 757 - struct page *page, unsigned len, unsigned offset, 758 - bool *same_page) 751 + /* 752 + * Try to merge a page into a segment, while obeying the hardware segment 753 + * size limit. This is not for normal read/write bios, but for passthrough 754 + * or Zone Append operations that we can't split. 755 + */ 756 + static bool bio_try_merge_hw_seg(struct request_queue *q, struct bio *bio, 757 + struct page *page, unsigned len, 758 + unsigned offset, bool *same_page) 759 759 { 760 760 struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; 761 761 unsigned long mask = queue_segment_boundary(q); ··· 775 765 } 776 766 777 767 /** 778 - * __bio_add_pc_page - attempt to add page to passthrough bio 779 - * @q: the target queue 780 - * @bio: destination bio 781 - * @page: page to add 782 - * @len: vec entry length 783 - * @offset: vec entry offset 784 - * @same_page: return if the merge happen inside the same page 768 + * bio_add_hw_page - attempt to add a page to a bio with hw constraints 769 + * @q: the target queue 770 + * @bio: destination bio 771 + * @page: page to add 772 + * @len: vec entry length 773 + * @offset: vec entry offset 774 + * @max_sectors: maximum number of sectors that can be added 775 + * @same_page: return if the segment has been merged inside the same page 785 776 * 786 - * Attempt to add a page to the bio_vec maplist. This can fail for a 787 - * number of reasons, such as the bio being full or target block device 788 - * limitations. The target block device must allow bio's up to PAGE_SIZE, 789 - * so it is always possible to add a single page to an empty bio. 790 - * 791 - * This should only be used by passthrough bios. 777 + * Add a page to a bio while respecting the hardware max_sectors, max_segment 778 + * and gap limitations. 792 779 */ 793 - int __bio_add_pc_page(struct request_queue *q, struct bio *bio, 780 + int bio_add_hw_page(struct request_queue *q, struct bio *bio, 794 781 struct page *page, unsigned int len, unsigned int offset, 795 - bool *same_page) 782 + unsigned int max_sectors, bool *same_page) 796 783 { 797 784 struct bio_vec *bvec; 798 785 799 - /* 800 - * cloned bio must not modify vec list 801 - */ 802 - if (unlikely(bio_flagged(bio, BIO_CLONED))) 786 + if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) 803 787 return 0; 804 788 805 - if (((bio->bi_iter.bi_size + len) >> 9) > queue_max_hw_sectors(q)) 789 + if (((bio->bi_iter.bi_size + len) >> 9) > max_sectors) 806 790 return 0; 807 791 808 792 if (bio->bi_vcnt > 0) { 809 - if (bio_try_merge_pc_page(q, bio, page, len, offset, same_page)) 793 + if (bio_try_merge_hw_seg(q, bio, page, len, offset, same_page)) 810 794 return len; 811 795 812 796 /* ··· 827 823 return len; 828 824 } 829 825 826 + /** 827 + * bio_add_pc_page - attempt to add page to passthrough bio 828 + * @q: the target queue 829 + * @bio: destination bio 830 + * @page: page to add 831 + * @len: vec entry length 832 + * @offset: vec entry offset 833 + * 834 + * Attempt to add a page to the bio_vec maplist. This can fail for a 835 + * number of reasons, such as the bio being full or target block device 836 + * limitations. The target block device must allow bio's up to PAGE_SIZE, 837 + * so it is always possible to add a single page to an empty bio. 838 + * 839 + * This should only be used by passthrough bios. 840 + */ 830 841 int bio_add_pc_page(struct request_queue *q, struct bio *bio, 831 842 struct page *page, unsigned int len, unsigned int offset) 832 843 { 833 844 bool same_page = false; 834 - return __bio_add_pc_page(q, bio, page, len, offset, &same_page); 845 + return bio_add_hw_page(q, bio, page, len, offset, 846 + queue_max_hw_sectors(q), &same_page); 835 847 } 836 848 EXPORT_SYMBOL(bio_add_pc_page); 837 849 ··· 956 936 put_page(bvec->bv_page); 957 937 } 958 938 } 939 + EXPORT_SYMBOL_GPL(bio_release_pages); 959 940 960 941 static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter) 961 942 { ··· 1031 1010 return 0; 1032 1011 } 1033 1012 1013 + static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter) 1014 + { 1015 + unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; 1016 + unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt; 1017 + struct request_queue *q = bio->bi_disk->queue; 1018 + unsigned int max_append_sectors = queue_max_zone_append_sectors(q); 1019 + struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; 1020 + struct page **pages = (struct page **)bv; 1021 + ssize_t size, left; 1022 + unsigned len, i; 1023 + size_t offset; 1024 + 1025 + if (WARN_ON_ONCE(!max_append_sectors)) 1026 + return 0; 1027 + 1028 + /* 1029 + * Move page array up in the allocated memory for the bio vecs as far as 1030 + * possible so that we can start filling biovecs from the beginning 1031 + * without overwriting the temporary page array. 1032 + */ 1033 + BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2); 1034 + pages += entries_left * (PAGE_PTRS_PER_BVEC - 1); 1035 + 1036 + size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset); 1037 + if (unlikely(size <= 0)) 1038 + return size ? size : -EFAULT; 1039 + 1040 + for (left = size, i = 0; left > 0; left -= len, i++) { 1041 + struct page *page = pages[i]; 1042 + bool same_page = false; 1043 + 1044 + len = min_t(size_t, PAGE_SIZE - offset, left); 1045 + if (bio_add_hw_page(q, bio, page, len, offset, 1046 + max_append_sectors, &same_page) != len) 1047 + return -EINVAL; 1048 + if (same_page) 1049 + put_page(page); 1050 + offset = 0; 1051 + } 1052 + 1053 + iov_iter_advance(iter, size); 1054 + return 0; 1055 + } 1056 + 1034 1057 /** 1035 1058 * bio_iov_iter_get_pages - add user or kernel pages to a bio 1036 1059 * @bio: bio to add pages to ··· 1104 1039 return -EINVAL; 1105 1040 1106 1041 do { 1107 - if (is_bvec) 1108 - ret = __bio_iov_bvec_add_pages(bio, iter); 1109 - else 1110 - ret = __bio_iov_iter_get_pages(bio, iter); 1042 + if (bio_op(bio) == REQ_OP_ZONE_APPEND) { 1043 + if (WARN_ON_ONCE(is_bvec)) 1044 + return -EINVAL; 1045 + ret = __bio_iov_append_get_pages(bio, iter); 1046 + } else { 1047 + if (is_bvec) 1048 + ret = __bio_iov_bvec_add_pages(bio, iter); 1049 + else 1050 + ret = __bio_iov_iter_get_pages(bio, iter); 1051 + } 1111 1052 } while (!ret && iov_iter_count(iter) && !bio_full(bio, 0)); 1112 1053 1113 1054 if (is_bvec) 1114 1055 bio_set_flag(bio, BIO_NO_PAGE_REF); 1115 1056 return bio->bi_vcnt ? 0 : ret; 1116 1057 } 1058 + EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages); 1117 1059 1118 1060 static void submit_bio_wait_endio(struct bio *bio) 1119 1061 { ··· 1177 1105 if (bio_integrity(bio)) 1178 1106 bio_integrity_advance(bio, bytes); 1179 1107 1108 + bio_crypt_advance(bio, bytes); 1180 1109 bio_advance_iter(bio, &bio->bi_iter, bytes); 1181 1110 } 1182 1111 EXPORT_SYMBOL(bio_advance); ··· 1376 1303 schedule_work(&bio_dirty_work); 1377 1304 } 1378 1305 1379 - void update_io_ticks(struct hd_struct *part, unsigned long now, bool end) 1380 - { 1381 - unsigned long stamp; 1382 - again: 1383 - stamp = READ_ONCE(part->stamp); 1384 - if (unlikely(stamp != now)) { 1385 - if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) { 1386 - __part_stat_add(part, io_ticks, end ? now - stamp : 1); 1387 - } 1388 - } 1389 - if (part->partno) { 1390 - part = &part_to_disk(part)->part0; 1391 - goto again; 1392 - } 1393 - } 1394 - 1395 - void generic_start_io_acct(struct request_queue *q, int op, 1396 - unsigned long sectors, struct hd_struct *part) 1397 - { 1398 - const int sgrp = op_stat_group(op); 1399 - 1400 - part_stat_lock(); 1401 - 1402 - update_io_ticks(part, jiffies, false); 1403 - part_stat_inc(part, ios[sgrp]); 1404 - part_stat_add(part, sectors[sgrp], sectors); 1405 - part_inc_in_flight(q, part, op_is_write(op)); 1406 - 1407 - part_stat_unlock(); 1408 - } 1409 - EXPORT_SYMBOL(generic_start_io_acct); 1410 - 1411 - void generic_end_io_acct(struct request_queue *q, int req_op, 1412 - struct hd_struct *part, unsigned long start_time) 1413 - { 1414 - unsigned long now = jiffies; 1415 - unsigned long duration = now - start_time; 1416 - const int sgrp = op_stat_group(req_op); 1417 - 1418 - part_stat_lock(); 1419 - 1420 - update_io_ticks(part, now, true); 1421 - part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration)); 1422 - part_dec_in_flight(q, part, op_is_write(req_op)); 1423 - 1424 - part_stat_unlock(); 1425 - } 1426 - EXPORT_SYMBOL(generic_end_io_acct); 1427 - 1428 1306 static inline bool bio_remaining_done(struct bio *bio) 1429 1307 { 1430 1308 /* ··· 1468 1444 1469 1445 BUG_ON(sectors <= 0); 1470 1446 BUG_ON(sectors >= bio_sectors(bio)); 1447 + 1448 + /* Zone append commands cannot be split */ 1449 + if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND)) 1450 + return NULL; 1471 1451 1472 1452 split = bio_clone_fast(bio, gfp, bs); 1473 1453 if (!split)

+6

block/blk-cgroup.c

··· 1530 1530 { 1531 1531 u64 old = atomic64_read(&blkg->delay_start); 1532 1532 1533 + /* negative use_delay means no scaling, see blkcg_set_delay() */ 1534 + if (atomic_read(&blkg->use_delay) < 0) 1535 + return; 1536 + 1533 1537 /* 1534 1538 * We only want to scale down every second. The idea here is that we 1535 1539 * want to delay people for min(delay_nsec, NSEC_PER_SEC) in a certain ··· 1721 1717 */ 1722 1718 void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta) 1723 1719 { 1720 + if (WARN_ON_ONCE(atomic_read(&blkg->use_delay) < 0)) 1721 + return; 1724 1722 blkcg_scale_delay(blkg, now); 1725 1723 atomic64_add(delta, &blkg->delay_nsec); 1726 1724 }

+224 -111

block/blk-core.c

··· 39 39 #include <linux/debugfs.h> 40 40 #include <linux/bpf.h> 41 41 #include <linux/psi.h> 42 + #include <linux/sched/sysctl.h> 43 + #include <linux/blk-crypto.h> 42 44 43 45 #define CREATE_TRACE_POINTS 44 46 #include <trace/events/block.h> ··· 123 121 rq->start_time_ns = ktime_get_ns(); 124 122 rq->part = NULL; 125 123 refcount_set(&rq->ref, 1); 124 + blk_crypto_rq_set_defaults(rq); 126 125 } 127 126 EXPORT_SYMBOL(blk_rq_init); 128 127 ··· 139 136 REQ_OP_NAME(ZONE_OPEN), 140 137 REQ_OP_NAME(ZONE_CLOSE), 141 138 REQ_OP_NAME(ZONE_FINISH), 139 + REQ_OP_NAME(ZONE_APPEND), 142 140 REQ_OP_NAME(WRITE_SAME), 143 141 REQ_OP_NAME(WRITE_ZEROES), 144 142 REQ_OP_NAME(SCSI_IN), ··· 244 240 bio_set_flag(bio, BIO_QUIET); 245 241 246 242 bio_advance(bio, nbytes); 243 + 244 + if (req_op(rq) == REQ_OP_ZONE_APPEND && error == BLK_STS_OK) { 245 + /* 246 + * Partial zone append completions cannot be supported as the 247 + * BIO fragments may end up not being written sequentially. 248 + */ 249 + if (bio->bi_iter.bi_size) 250 + bio->bi_status = BLK_STS_IOERR; 251 + else 252 + bio->bi_iter.bi_sector = rq->__sector; 253 + } 247 254 248 255 /* don't actually finish bio if it's part of flush sequence */ 249 256 if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ)) ··· 456 441 } 457 442 } 458 443 444 + static inline int bio_queue_enter(struct bio *bio) 445 + { 446 + struct request_queue *q = bio->bi_disk->queue; 447 + bool nowait = bio->bi_opf & REQ_NOWAIT; 448 + int ret; 449 + 450 + ret = blk_queue_enter(q, nowait ? BLK_MQ_REQ_NOWAIT : 0); 451 + if (unlikely(ret)) { 452 + if (nowait && !blk_queue_dying(q)) 453 + bio_wouldblock_error(bio); 454 + else 455 + bio_io_error(bio); 456 + } 457 + 458 + return ret; 459 + } 460 + 459 461 void blk_queue_exit(struct request_queue *q) 460 462 { 461 463 percpu_ref_put(&q->q_usage_counter); ··· 517 485 if (ret) 518 486 goto fail_id; 519 487 520 - q->backing_dev_info = bdi_alloc_node(GFP_KERNEL, node_id); 488 + q->backing_dev_info = bdi_alloc(node_id); 521 489 if (!q->backing_dev_info) 522 490 goto fail_split; 523 491 ··· 527 495 528 496 q->backing_dev_info->ra_pages = VM_READAHEAD_PAGES; 529 497 q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK; 530 - q->backing_dev_info->name = "block"; 531 498 q->node = node_id; 532 499 533 500 timer_setup(&q->backing_dev_info->laptop_mode_wb_timer, ··· 637 606 } 638 607 EXPORT_SYMBOL(blk_put_request); 639 608 609 + static void blk_account_io_merge_bio(struct request *req) 610 + { 611 + if (!blk_do_io_stat(req)) 612 + return; 613 + 614 + part_stat_lock(); 615 + part_stat_inc(req->part, merges[op_stat_group(req_op(req))]); 616 + part_stat_unlock(); 617 + } 618 + 640 619 bool bio_attempt_back_merge(struct request *req, struct bio *bio, 641 620 unsigned int nr_segs) 642 621 { ··· 665 624 req->biotail = bio; 666 625 req->__data_len += bio->bi_iter.bi_size; 667 626 668 - blk_account_io_start(req, false); 627 + bio_crypt_free_ctx(bio); 628 + 629 + blk_account_io_merge_bio(req); 669 630 return true; 670 631 } 671 632 ··· 691 648 req->__sector = bio->bi_iter.bi_sector; 692 649 req->__data_len += bio->bi_iter.bi_size; 693 650 694 - blk_account_io_start(req, false); 651 + bio_crypt_do_front_merge(req, bio); 652 + 653 + blk_account_io_merge_bio(req); 695 654 return true; 696 655 } 697 656 ··· 715 670 req->__data_len += bio->bi_iter.bi_size; 716 671 req->nr_phys_segments = segments + 1; 717 672 718 - blk_account_io_start(req, false); 673 + blk_account_io_merge_bio(req); 719 674 return true; 720 675 no_merge: 721 676 req_set_nomerge(q, req); ··· 917 872 return ret; 918 873 } 919 874 875 + /* 876 + * Check write append to a zoned block device. 877 + */ 878 + static inline blk_status_t blk_check_zone_append(struct request_queue *q, 879 + struct bio *bio) 880 + { 881 + sector_t pos = bio->bi_iter.bi_sector; 882 + int nr_sectors = bio_sectors(bio); 883 + 884 + /* Only applicable to zoned block devices */ 885 + if (!blk_queue_is_zoned(q)) 886 + return BLK_STS_NOTSUPP; 887 + 888 + /* The bio sector must point to the start of a sequential zone */ 889 + if (pos & (blk_queue_zone_sectors(q) - 1) || 890 + !blk_queue_zone_is_seq(q, pos)) 891 + return BLK_STS_IOERR; 892 + 893 + /* 894 + * Not allowed to cross zone boundaries. Otherwise, the BIO will be 895 + * split and could result in non-contiguous sectors being written in 896 + * different zones. 897 + */ 898 + if (nr_sectors > q->limits.chunk_sectors) 899 + return BLK_STS_IOERR; 900 + 901 + /* Make sure the BIO is small enough and will not get split */ 902 + if (nr_sectors > q->limits.max_zone_append_sectors) 903 + return BLK_STS_IOERR; 904 + 905 + bio->bi_opf |= REQ_NOMERGE; 906 + 907 + return BLK_STS_OK; 908 + } 909 + 920 910 static noinline_for_stack bool 921 911 generic_make_request_checks(struct bio *bio) 922 912 { ··· 1021 941 if (!q->limits.max_write_same_sectors) 1022 942 goto not_supported; 1023 943 break; 944 + case REQ_OP_ZONE_APPEND: 945 + status = blk_check_zone_append(q, bio); 946 + if (status != BLK_STS_OK) 947 + goto end_io; 948 + break; 1024 949 case REQ_OP_ZONE_RESET: 1025 950 case REQ_OP_ZONE_OPEN: 1026 951 case REQ_OP_ZONE_CLOSE: ··· 1046 961 } 1047 962 1048 963 /* 1049 - * Various block parts want %current->io_context and lazy ioc 1050 - * allocation ends up trading a lot of pain for a small amount of 1051 - * memory. Just allocate it upfront. This may fail and block 1052 - * layer knows how to live with it. 964 + * Various block parts want %current->io_context, so allocate it up 965 + * front rather than dealing with lots of pain to allocate it only 966 + * where needed. This may fail and the block layer knows how to live 967 + * with it. 1053 968 */ 1054 - create_io_context(GFP_ATOMIC, q->node); 969 + if (unlikely(!current->io_context)) 970 + create_task_io_context(current, GFP_ATOMIC, q->node); 1055 971 1056 972 if (!blkcg_bio_issue_check(q, bio)) 1057 973 return false; ··· 1074 988 return false; 1075 989 } 1076 990 991 + static blk_qc_t do_make_request(struct bio *bio) 992 + { 993 + struct request_queue *q = bio->bi_disk->queue; 994 + blk_qc_t ret = BLK_QC_T_NONE; 995 + 996 + if (blk_crypto_bio_prep(&bio)) { 997 + if (!q->make_request_fn) 998 + return blk_mq_make_request(q, bio); 999 + ret = q->make_request_fn(q, bio); 1000 + } 1001 + blk_queue_exit(q); 1002 + return ret; 1003 + } 1004 + 1077 1005 /** 1078 - * generic_make_request - hand a buffer to its device driver for I/O 1006 + * generic_make_request - re-submit a bio to the block device layer for I/O 1079 1007 * @bio: The bio describing the location in memory and on the device. 1080 1008 * 1081 - * generic_make_request() is used to make I/O requests of block 1082 - * devices. It is passed a &struct bio, which describes the I/O that needs 1083 - * to be done. 1084 - * 1085 - * generic_make_request() does not return any status. The 1086 - * success/failure status of the request, along with notification of 1087 - * completion, is delivered asynchronously through the bio->bi_end_io 1088 - * function described (one day) else where. 1089 - * 1090 - * The caller of generic_make_request must make sure that bi_io_vec 1091 - * are set to describe the memory buffer, and that bi_dev and bi_sector are 1092 - * set to describe the device address, and the 1093 - * bi_end_io and optionally bi_private are set to describe how 1094 - * completion notification should be signaled. 1095 - * 1096 - * generic_make_request and the drivers it calls may use bi_next if this 1097 - * bio happens to be merged with someone else, and may resubmit the bio to 1098 - * a lower device by calling into generic_make_request recursively, which 1099 - * means the bio should NOT be touched after the call to ->make_request_fn. 1009 + * This is a version of submit_bio() that shall only be used for I/O that is 1010 + * resubmitted to lower level drivers by stacking block drivers. All file 1011 + * systems and other upper level users of the block layer should use 1012 + * submit_bio() instead. 1100 1013 */ 1101 1014 blk_qc_t generic_make_request(struct bio *bio) 1102 1015 { ··· 1146 1061 current->bio_list = bio_list_on_stack; 1147 1062 do { 1148 1063 struct request_queue *q = bio->bi_disk->queue; 1149 - blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ? 1150 - BLK_MQ_REQ_NOWAIT : 0; 1151 1064 1152 - if (likely(blk_queue_enter(q, flags) == 0)) { 1065 + if (likely(bio_queue_enter(bio) == 0)) { 1153 1066 struct bio_list lower, same; 1154 1067 1155 1068 /* Create a fresh bio_list for all subordinate requests */ 1156 1069 bio_list_on_stack[1] = bio_list_on_stack[0]; 1157 1070 bio_list_init(&bio_list_on_stack[0]); 1158 - ret = q->make_request_fn(q, bio); 1159 - 1160 - blk_queue_exit(q); 1071 + ret = do_make_request(bio); 1161 1072 1162 1073 /* sort new bios into those for a lower level 1163 1074 * and those for the same level ··· 1169 1088 bio_list_merge(&bio_list_on_stack[0], &lower); 1170 1089 bio_list_merge(&bio_list_on_stack[0], &same); 1171 1090 bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); 1172 - } else { 1173 - if (unlikely(!blk_queue_dying(q) && 1174 - (bio->bi_opf & REQ_NOWAIT))) 1175 - bio_wouldblock_error(bio); 1176 - else 1177 - bio_io_error(bio); 1178 1091 } 1179 1092 bio = bio_list_pop(&bio_list_on_stack[0]); 1180 1093 } while (bio); ··· 1185 1110 * 1186 1111 * This function behaves like generic_make_request(), but does not protect 1187 1112 * against recursion. Must only be used if the called driver is known 1188 - * to not call generic_make_request (or direct_make_request) again from 1189 - * its make_request function. (Calling direct_make_request again from 1190 - * a workqueue is perfectly fine as that doesn't recurse). 1113 + * to be blk-mq based. 1191 1114 */ 1192 1115 blk_qc_t direct_make_request(struct bio *bio) 1193 1116 { 1194 1117 struct request_queue *q = bio->bi_disk->queue; 1195 - bool nowait = bio->bi_opf & REQ_NOWAIT; 1196 - blk_qc_t ret; 1197 1118 1198 - if (!generic_make_request_checks(bio)) 1199 - return BLK_QC_T_NONE; 1200 - 1201 - if (unlikely(blk_queue_enter(q, nowait ? BLK_MQ_REQ_NOWAIT : 0))) { 1202 - if (nowait && !blk_queue_dying(q)) 1203 - bio_wouldblock_error(bio); 1204 - else 1205 - bio_io_error(bio); 1119 + if (WARN_ON_ONCE(q->make_request_fn)) { 1120 + bio_io_error(bio); 1206 1121 return BLK_QC_T_NONE; 1207 1122 } 1208 - 1209 - ret = q->make_request_fn(q, bio); 1210 - blk_queue_exit(q); 1211 - return ret; 1123 + if (!generic_make_request_checks(bio)) 1124 + return BLK_QC_T_NONE; 1125 + if (unlikely(bio_queue_enter(bio))) 1126 + return BLK_QC_T_NONE; 1127 + if (!blk_crypto_bio_prep(&bio)) { 1128 + blk_queue_exit(q); 1129 + return BLK_QC_T_NONE; 1130 + } 1131 + return blk_mq_make_request(q, bio); 1212 1132 } 1213 1133 EXPORT_SYMBOL_GPL(direct_make_request); 1214 1134 ··· 1211 1141 * submit_bio - submit a bio to the block device layer for I/O 1212 1142 * @bio: The &struct bio which describes the I/O 1213 1143 * 1214 - * submit_bio() is very similar in purpose to generic_make_request(), and 1215 - * uses that function to do most of the work. Both are fairly rough 1216 - * interfaces; @bio must be presetup and ready for I/O. 1144 + * submit_bio() is used to submit I/O requests to block devices. It is passed a 1145 + * fully set up &struct bio that describes the I/O that needs to be done. The 1146 + * bio will be send to the device described by the bi_disk and bi_partno fields. 1217 1147 * 1148 + * The success/failure status of the request, along with notification of 1149 + * completion, is delivered asynchronously through the ->bi_end_io() callback 1150 + * in @bio. The bio must NOT be touched by thecaller until ->bi_end_io() has 1151 + * been called. 1218 1152 */ 1219 1153 blk_qc_t submit_bio(struct bio *bio) 1220 1154 { 1221 - bool workingset_read = false; 1222 - unsigned long pflags; 1223 - blk_qc_t ret; 1224 - 1225 1155 if (blkcg_punt_bio_submit(bio)) 1226 1156 return BLK_QC_T_NONE; 1227 1157 ··· 1240 1170 if (op_is_write(bio_op(bio))) { 1241 1171 count_vm_events(PGPGOUT, count); 1242 1172 } else { 1243 - if (bio_flagged(bio, BIO_WORKINGSET)) 1244 - workingset_read = true; 1245 1173 task_io_account_read(bio->bi_iter.bi_size); 1246 1174 count_vm_events(PGPGIN, count); 1247 1175 } ··· 1255 1187 } 1256 1188 1257 1189 /* 1258 - * If we're reading data that is part of the userspace 1259 - * workingset, count submission time as memory stall. When the 1260 - * device is congested, or the submitting cgroup IO-throttled, 1261 - * submission can be a significant part of overall IO time. 1190 + * If we're reading data that is part of the userspace workingset, count 1191 + * submission time as memory stall. When the device is congested, or 1192 + * the submitting cgroup IO-throttled, submission can be a significant 1193 + * part of overall IO time. 1262 1194 */ 1263 - if (workingset_read) 1195 + if (unlikely(bio_op(bio) == REQ_OP_READ && 1196 + bio_flagged(bio, BIO_WORKINGSET))) { 1197 + unsigned long pflags; 1198 + blk_qc_t ret; 1199 + 1264 1200 psi_memstall_enter(&pflags); 1265 - 1266 - ret = generic_make_request(bio); 1267 - 1268 - if (workingset_read) 1201 + ret = generic_make_request(bio); 1269 1202 psi_memstall_leave(&pflags); 1270 1203 1271 - return ret; 1204 + return ret; 1205 + } 1206 + 1207 + return generic_make_request(bio); 1272 1208 } 1273 1209 EXPORT_SYMBOL(submit_bio); 1274 1210 ··· 1333 1261 should_fail_request(&rq->rq_disk->part0, blk_rq_bytes(rq))) 1334 1262 return BLK_STS_IOERR; 1335 1263 1264 + if (blk_crypto_insert_cloned_request(rq)) 1265 + return BLK_STS_IOERR; 1266 + 1336 1267 if (blk_queue_io_stat(q)) 1337 - blk_account_io_start(rq, true); 1268 + blk_account_io_start(rq); 1338 1269 1339 1270 /* 1340 1271 * Since we have a scheduler attached on the top device, ··· 1389 1314 } 1390 1315 EXPORT_SYMBOL_GPL(blk_rq_err_bytes); 1391 1316 1392 - void blk_account_io_completion(struct request *req, unsigned int bytes) 1317 + static void update_io_ticks(struct hd_struct *part, unsigned long now, bool end) 1318 + { 1319 + unsigned long stamp; 1320 + again: 1321 + stamp = READ_ONCE(part->stamp); 1322 + if (unlikely(stamp != now)) { 1323 + if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) 1324 + __part_stat_add(part, io_ticks, end ? now - stamp : 1); 1325 + } 1326 + if (part->partno) { 1327 + part = &part_to_disk(part)->part0; 1328 + goto again; 1329 + } 1330 + } 1331 + 1332 + static void blk_account_io_completion(struct request *req, unsigned int bytes) 1393 1333 { 1394 1334 if (req->part && blk_do_io_stat(req)) { 1395 1335 const int sgrp = op_stat_group(req_op(req)); ··· 1435 1345 update_io_ticks(part, jiffies, true); 1436 1346 part_stat_inc(part, ios[sgrp]); 1437 1347 part_stat_add(part, nsecs[sgrp], now - req->start_time_ns); 1438 - part_dec_in_flight(req->q, part, rq_data_dir(req)); 1348 + part_stat_unlock(); 1439 1349 1440 1350 hd_struct_put(part); 1441 - part_stat_unlock(); 1442 1351 } 1443 1352 } 1444 1353 1445 - void blk_account_io_start(struct request *rq, bool new_io) 1354 + void blk_account_io_start(struct request *rq) 1446 1355 { 1447 - struct hd_struct *part; 1448 - int rw = rq_data_dir(rq); 1449 - 1450 1356 if (!blk_do_io_stat(rq)) 1451 1357 return; 1452 1358 1359 + rq->part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq)); 1360 + 1453 1361 part_stat_lock(); 1454 - 1455 - if (!new_io) { 1456 - part = rq->part; 1457 - part_stat_inc(part, merges[rw]); 1458 - } else { 1459 - part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq)); 1460 - if (!hd_struct_try_get(part)) { 1461 - /* 1462 - * The partition is already being removed, 1463 - * the request will be accounted on the disk only 1464 - * 1465 - * We take a reference on disk->part0 although that 1466 - * partition will never be deleted, so we can treat 1467 - * it as any other partition. 1468 - */ 1469 - part = &rq->rq_disk->part0; 1470 - hd_struct_get(part); 1471 - } 1472 - part_inc_in_flight(rq->q, part, rw); 1473 - rq->part = part; 1474 - } 1475 - 1476 - update_io_ticks(part, jiffies, false); 1477 - 1362 + update_io_ticks(rq->part, jiffies, false); 1478 1363 part_stat_unlock(); 1479 1364 } 1365 + 1366 + unsigned long disk_start_io_acct(struct gendisk *disk, unsigned int sectors, 1367 + unsigned int op) 1368 + { 1369 + struct hd_struct *part = &disk->part0; 1370 + const int sgrp = op_stat_group(op); 1371 + unsigned long now = READ_ONCE(jiffies); 1372 + 1373 + part_stat_lock(); 1374 + update_io_ticks(part, now, false); 1375 + part_stat_inc(part, ios[sgrp]); 1376 + part_stat_add(part, sectors[sgrp], sectors); 1377 + part_stat_local_inc(part, in_flight[op_is_write(op)]); 1378 + part_stat_unlock(); 1379 + 1380 + return now; 1381 + } 1382 + EXPORT_SYMBOL(disk_start_io_acct); 1383 + 1384 + void disk_end_io_acct(struct gendisk *disk, unsigned int op, 1385 + unsigned long start_time) 1386 + { 1387 + struct hd_struct *part = &disk->part0; 1388 + const int sgrp = op_stat_group(op); 1389 + unsigned long now = READ_ONCE(jiffies); 1390 + unsigned long duration = now - start_time; 1391 + 1392 + part_stat_lock(); 1393 + update_io_ticks(part, now, true); 1394 + part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration)); 1395 + part_stat_local_dec(part, in_flight[op_is_write(op)]); 1396 + part_stat_unlock(); 1397 + } 1398 + EXPORT_SYMBOL(disk_end_io_acct); 1480 1399 1481 1400 /* 1482 1401 * Steal bios from a request and add them to a bio list. ··· 1735 1636 } 1736 1637 rq->nr_phys_segments = rq_src->nr_phys_segments; 1737 1638 rq->ioprio = rq_src->ioprio; 1738 - rq->extra_len = rq_src->extra_len; 1639 + 1640 + if (rq->bio) 1641 + blk_crypto_rq_bio_prep(rq, rq->bio, gfp_mask); 1739 1642 1740 1643 return 0; 1741 1644 ··· 1878 1777 current->plug = NULL; 1879 1778 } 1880 1779 EXPORT_SYMBOL(blk_finish_plug); 1780 + 1781 + void blk_io_schedule(void) 1782 + { 1783 + /* Prevent hang_check timer from firing at us during very long I/O */ 1784 + unsigned long timeout = sysctl_hung_task_timeout_secs * HZ / 2; 1785 + 1786 + if (timeout) 1787 + io_schedule_timeout(timeout); 1788 + else 1789 + io_schedule(); 1790 + } 1791 + EXPORT_SYMBOL_GPL(blk_io_schedule); 1881 1792 1882 1793 int __init blk_dev_init(void) 1883 1794 {

+657

block/blk-crypto-fallback.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright 2019 Google LLC 4 + */ 5 + 6 + /* 7 + * Refer to Documentation/block/inline-encryption.rst for detailed explanation. 8 + */ 9 + 10 + #define pr_fmt(fmt) "blk-crypto-fallback: " fmt 11 + 12 + #include <crypto/skcipher.h> 13 + #include <linux/blk-cgroup.h> 14 + #include <linux/blk-crypto.h> 15 + #include <linux/blkdev.h> 16 + #include <linux/crypto.h> 17 + #include <linux/keyslot-manager.h> 18 + #include <linux/mempool.h> 19 + #include <linux/module.h> 20 + #include <linux/random.h> 21 + 22 + #include "blk-crypto-internal.h" 23 + 24 + static unsigned int num_prealloc_bounce_pg = 32; 25 + module_param(num_prealloc_bounce_pg, uint, 0); 26 + MODULE_PARM_DESC(num_prealloc_bounce_pg, 27 + "Number of preallocated bounce pages for the blk-crypto crypto API fallback"); 28 + 29 + static unsigned int blk_crypto_num_keyslots = 100; 30 + module_param_named(num_keyslots, blk_crypto_num_keyslots, uint, 0); 31 + MODULE_PARM_DESC(num_keyslots, 32 + "Number of keyslots for the blk-crypto crypto API fallback"); 33 + 34 + static unsigned int num_prealloc_fallback_crypt_ctxs = 128; 35 + module_param(num_prealloc_fallback_crypt_ctxs, uint, 0); 36 + MODULE_PARM_DESC(num_prealloc_crypt_fallback_ctxs, 37 + "Number of preallocated bio fallback crypto contexts for blk-crypto to use during crypto API fallback"); 38 + 39 + struct bio_fallback_crypt_ctx { 40 + struct bio_crypt_ctx crypt_ctx; 41 + /* 42 + * Copy of the bvec_iter when this bio was submitted. 43 + * We only want to en/decrypt the part of the bio as described by the 44 + * bvec_iter upon submission because bio might be split before being 45 + * resubmitted 46 + */ 47 + struct bvec_iter crypt_iter; 48 + union { 49 + struct { 50 + struct work_struct work; 51 + struct bio *bio; 52 + }; 53 + struct { 54 + void *bi_private_orig; 55 + bio_end_io_t *bi_end_io_orig; 56 + }; 57 + }; 58 + }; 59 + 60 + static struct kmem_cache *bio_fallback_crypt_ctx_cache; 61 + static mempool_t *bio_fallback_crypt_ctx_pool; 62 + 63 + /* 64 + * Allocating a crypto tfm during I/O can deadlock, so we have to preallocate 65 + * all of a mode's tfms when that mode starts being used. Since each mode may 66 + * need all the keyslots at some point, each mode needs its own tfm for each 67 + * keyslot; thus, a keyslot may contain tfms for multiple modes. However, to 68 + * match the behavior of real inline encryption hardware (which only supports a 69 + * single encryption context per keyslot), we only allow one tfm per keyslot to 70 + * be used at a time - the rest of the unused tfms have their keys cleared. 71 + */ 72 + static DEFINE_MUTEX(tfms_init_lock); 73 + static bool tfms_inited[BLK_ENCRYPTION_MODE_MAX]; 74 + 75 + static struct blk_crypto_keyslot { 76 + enum blk_crypto_mode_num crypto_mode; 77 + struct crypto_skcipher *tfms[BLK_ENCRYPTION_MODE_MAX]; 78 + } *blk_crypto_keyslots; 79 + 80 + static struct blk_keyslot_manager blk_crypto_ksm; 81 + static struct workqueue_struct *blk_crypto_wq; 82 + static mempool_t *blk_crypto_bounce_page_pool; 83 + 84 + /* 85 + * This is the key we set when evicting a keyslot. This *should* be the all 0's 86 + * key, but AES-XTS rejects that key, so we use some random bytes instead. 87 + */ 88 + static u8 blank_key[BLK_CRYPTO_MAX_KEY_SIZE]; 89 + 90 + static void blk_crypto_evict_keyslot(unsigned int slot) 91 + { 92 + struct blk_crypto_keyslot *slotp = &blk_crypto_keyslots[slot]; 93 + enum blk_crypto_mode_num crypto_mode = slotp->crypto_mode; 94 + int err; 95 + 96 + WARN_ON(slotp->crypto_mode == BLK_ENCRYPTION_MODE_INVALID); 97 + 98 + /* Clear the key in the skcipher */ 99 + err = crypto_skcipher_setkey(slotp->tfms[crypto_mode], blank_key, 100 + blk_crypto_modes[crypto_mode].keysize); 101 + WARN_ON(err); 102 + slotp->crypto_mode = BLK_ENCRYPTION_MODE_INVALID; 103 + } 104 + 105 + static int blk_crypto_keyslot_program(struct blk_keyslot_manager *ksm, 106 + const struct blk_crypto_key *key, 107 + unsigned int slot) 108 + { 109 + struct blk_crypto_keyslot *slotp = &blk_crypto_keyslots[slot]; 110 + const enum blk_crypto_mode_num crypto_mode = 111 + key->crypto_cfg.crypto_mode; 112 + int err; 113 + 114 + if (crypto_mode != slotp->crypto_mode && 115 + slotp->crypto_mode != BLK_ENCRYPTION_MODE_INVALID) 116 + blk_crypto_evict_keyslot(slot); 117 + 118 + slotp->crypto_mode = crypto_mode; 119 + err = crypto_skcipher_setkey(slotp->tfms[crypto_mode], key->raw, 120 + key->size); 121 + if (err) { 122 + blk_crypto_evict_keyslot(slot); 123 + return err; 124 + } 125 + return 0; 126 + } 127 + 128 + static int blk_crypto_keyslot_evict(struct blk_keyslot_manager *ksm, 129 + const struct blk_crypto_key *key, 130 + unsigned int slot) 131 + { 132 + blk_crypto_evict_keyslot(slot); 133 + return 0; 134 + } 135 + 136 + /* 137 + * The crypto API fallback KSM ops - only used for a bio when it specifies a 138 + * blk_crypto_key that was not supported by the device's inline encryption 139 + * hardware. 140 + */ 141 + static const struct blk_ksm_ll_ops blk_crypto_ksm_ll_ops = { 142 + .keyslot_program = blk_crypto_keyslot_program, 143 + .keyslot_evict = blk_crypto_keyslot_evict, 144 + }; 145 + 146 + static void blk_crypto_fallback_encrypt_endio(struct bio *enc_bio) 147 + { 148 + struct bio *src_bio = enc_bio->bi_private; 149 + int i; 150 + 151 + for (i = 0; i < enc_bio->bi_vcnt; i++) 152 + mempool_free(enc_bio->bi_io_vec[i].bv_page, 153 + blk_crypto_bounce_page_pool); 154 + 155 + src_bio->bi_status = enc_bio->bi_status; 156 + 157 + bio_put(enc_bio); 158 + bio_endio(src_bio); 159 + } 160 + 161 + static struct bio *blk_crypto_clone_bio(struct bio *bio_src) 162 + { 163 + struct bvec_iter iter; 164 + struct bio_vec bv; 165 + struct bio *bio; 166 + 167 + bio = bio_alloc_bioset(GFP_NOIO, bio_segments(bio_src), NULL); 168 + if (!bio) 169 + return NULL; 170 + bio->bi_disk = bio_src->bi_disk; 171 + bio->bi_opf = bio_src->bi_opf; 172 + bio->bi_ioprio = bio_src->bi_ioprio; 173 + bio->bi_write_hint = bio_src->bi_write_hint; 174 + bio->bi_iter.bi_sector = bio_src->bi_iter.bi_sector; 175 + bio->bi_iter.bi_size = bio_src->bi_iter.bi_size; 176 + 177 + bio_for_each_segment(bv, bio_src, iter) 178 + bio->bi_io_vec[bio->bi_vcnt++] = bv; 179 + 180 + bio_clone_blkg_association(bio, bio_src); 181 + blkcg_bio_issue_init(bio); 182 + 183 + return bio; 184 + } 185 + 186 + static bool blk_crypto_alloc_cipher_req(struct blk_ksm_keyslot *slot, 187 + struct skcipher_request **ciph_req_ret, 188 + struct crypto_wait *wait) 189 + { 190 + struct skcipher_request *ciph_req; 191 + const struct blk_crypto_keyslot *slotp; 192 + int keyslot_idx = blk_ksm_get_slot_idx(slot); 193 + 194 + slotp = &blk_crypto_keyslots[keyslot_idx]; 195 + ciph_req = skcipher_request_alloc(slotp->tfms[slotp->crypto_mode], 196 + GFP_NOIO); 197 + if (!ciph_req) 198 + return false; 199 + 200 + skcipher_request_set_callback(ciph_req, 201 + CRYPTO_TFM_REQ_MAY_BACKLOG | 202 + CRYPTO_TFM_REQ_MAY_SLEEP, 203 + crypto_req_done, wait); 204 + *ciph_req_ret = ciph_req; 205 + 206 + return true; 207 + } 208 + 209 + static bool blk_crypto_split_bio_if_needed(struct bio **bio_ptr) 210 + { 211 + struct bio *bio = *bio_ptr; 212 + unsigned int i = 0; 213 + unsigned int num_sectors = 0; 214 + struct bio_vec bv; 215 + struct bvec_iter iter; 216 + 217 + bio_for_each_segment(bv, bio, iter) { 218 + num_sectors += bv.bv_len >> SECTOR_SHIFT; 219 + if (++i == BIO_MAX_PAGES) 220 + break; 221 + } 222 + if (num_sectors < bio_sectors(bio)) { 223 + struct bio *split_bio; 224 + 225 + split_bio = bio_split(bio, num_sectors, GFP_NOIO, NULL); 226 + if (!split_bio) { 227 + bio->bi_status = BLK_STS_RESOURCE; 228 + return false; 229 + } 230 + bio_chain(split_bio, bio); 231 + generic_make_request(bio); 232 + *bio_ptr = split_bio; 233 + } 234 + 235 + return true; 236 + } 237 + 238 + union blk_crypto_iv { 239 + __le64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE]; 240 + u8 bytes[BLK_CRYPTO_MAX_IV_SIZE]; 241 + }; 242 + 243 + static void blk_crypto_dun_to_iv(const u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE], 244 + union blk_crypto_iv *iv) 245 + { 246 + int i; 247 + 248 + for (i = 0; i < BLK_CRYPTO_DUN_ARRAY_SIZE; i++) 249 + iv->dun[i] = cpu_to_le64(dun[i]); 250 + } 251 + 252 + /* 253 + * The crypto API fallback's encryption routine. 254 + * Allocate a bounce bio for encryption, encrypt the input bio using crypto API, 255 + * and replace *bio_ptr with the bounce bio. May split input bio if it's too 256 + * large. Returns true on success. Returns false and sets bio->bi_status on 257 + * error. 258 + */ 259 + static bool blk_crypto_fallback_encrypt_bio(struct bio **bio_ptr) 260 + { 261 + struct bio *src_bio, *enc_bio; 262 + struct bio_crypt_ctx *bc; 263 + struct blk_ksm_keyslot *slot; 264 + int data_unit_size; 265 + struct skcipher_request *ciph_req = NULL; 266 + DECLARE_CRYPTO_WAIT(wait); 267 + u64 curr_dun[BLK_CRYPTO_DUN_ARRAY_SIZE]; 268 + struct scatterlist src, dst; 269 + union blk_crypto_iv iv; 270 + unsigned int i, j; 271 + bool ret = false; 272 + blk_status_t blk_st; 273 + 274 + /* Split the bio if it's too big for single page bvec */ 275 + if (!blk_crypto_split_bio_if_needed(bio_ptr)) 276 + return false; 277 + 278 + src_bio = *bio_ptr; 279 + bc = src_bio->bi_crypt_context; 280 + data_unit_size = bc->bc_key->crypto_cfg.data_unit_size; 281 + 282 + /* Allocate bounce bio for encryption */ 283 + enc_bio = blk_crypto_clone_bio(src_bio); 284 + if (!enc_bio) { 285 + src_bio->bi_status = BLK_STS_RESOURCE; 286 + return false; 287 + } 288 + 289 + /* 290 + * Use the crypto API fallback keyslot manager to get a crypto_skcipher 291 + * for the algorithm and key specified for this bio. 292 + */ 293 + blk_st = blk_ksm_get_slot_for_key(&blk_crypto_ksm, bc->bc_key, &slot); 294 + if (blk_st != BLK_STS_OK) { 295 + src_bio->bi_status = blk_st; 296 + goto out_put_enc_bio; 297 + } 298 + 299 + /* and then allocate an skcipher_request for it */ 300 + if (!blk_crypto_alloc_cipher_req(slot, &ciph_req, &wait)) { 301 + src_bio->bi_status = BLK_STS_RESOURCE; 302 + goto out_release_keyslot; 303 + } 304 + 305 + memcpy(curr_dun, bc->bc_dun, sizeof(curr_dun)); 306 + sg_init_table(&src, 1); 307 + sg_init_table(&dst, 1); 308 + 309 + skcipher_request_set_crypt(ciph_req, &src, &dst, data_unit_size, 310 + iv.bytes); 311 + 312 + /* Encrypt each page in the bounce bio */ 313 + for (i = 0; i < enc_bio->bi_vcnt; i++) { 314 + struct bio_vec *enc_bvec = &enc_bio->bi_io_vec[i]; 315 + struct page *plaintext_page = enc_bvec->bv_page; 316 + struct page *ciphertext_page = 317 + mempool_alloc(blk_crypto_bounce_page_pool, GFP_NOIO); 318 + 319 + enc_bvec->bv_page = ciphertext_page; 320 + 321 + if (!ciphertext_page) { 322 + src_bio->bi_status = BLK_STS_RESOURCE; 323 + goto out_free_bounce_pages; 324 + } 325 + 326 + sg_set_page(&src, plaintext_page, data_unit_size, 327 + enc_bvec->bv_offset); 328 + sg_set_page(&dst, ciphertext_page, data_unit_size, 329 + enc_bvec->bv_offset); 330 + 331 + /* Encrypt each data unit in this page */ 332 + for (j = 0; j < enc_bvec->bv_len; j += data_unit_size) { 333 + blk_crypto_dun_to_iv(curr_dun, &iv); 334 + if (crypto_wait_req(crypto_skcipher_encrypt(ciph_req), 335 + &wait)) { 336 + i++; 337 + src_bio->bi_status = BLK_STS_IOERR; 338 + goto out_free_bounce_pages; 339 + } 340 + bio_crypt_dun_increment(curr_dun, 1); 341 + src.offset += data_unit_size; 342 + dst.offset += data_unit_size; 343 + } 344 + } 345 + 346 + enc_bio->bi_private = src_bio; 347 + enc_bio->bi_end_io = blk_crypto_fallback_encrypt_endio; 348 + *bio_ptr = enc_bio; 349 + ret = true; 350 + 351 + enc_bio = NULL; 352 + goto out_free_ciph_req; 353 + 354 + out_free_bounce_pages: 355 + while (i > 0) 356 + mempool_free(enc_bio->bi_io_vec[--i].bv_page, 357 + blk_crypto_bounce_page_pool); 358 + out_free_ciph_req: 359 + skcipher_request_free(ciph_req); 360 + out_release_keyslot: 361 + blk_ksm_put_slot(slot); 362 + out_put_enc_bio: 363 + if (enc_bio) 364 + bio_put(enc_bio); 365 + 366 + return ret; 367 + } 368 + 369 + /* 370 + * The crypto API fallback's main decryption routine. 371 + * Decrypts input bio in place, and calls bio_endio on the bio. 372 + */ 373 + static void blk_crypto_fallback_decrypt_bio(struct work_struct *work) 374 + { 375 + struct bio_fallback_crypt_ctx *f_ctx = 376 + container_of(work, struct bio_fallback_crypt_ctx, work); 377 + struct bio *bio = f_ctx->bio; 378 + struct bio_crypt_ctx *bc = &f_ctx->crypt_ctx; 379 + struct blk_ksm_keyslot *slot; 380 + struct skcipher_request *ciph_req = NULL; 381 + DECLARE_CRYPTO_WAIT(wait); 382 + u64 curr_dun[BLK_CRYPTO_DUN_ARRAY_SIZE]; 383 + union blk_crypto_iv iv; 384 + struct scatterlist sg; 385 + struct bio_vec bv; 386 + struct bvec_iter iter; 387 + const int data_unit_size = bc->bc_key->crypto_cfg.data_unit_size; 388 + unsigned int i; 389 + blk_status_t blk_st; 390 + 391 + /* 392 + * Use the crypto API fallback keyslot manager to get a crypto_skcipher 393 + * for the algorithm and key specified for this bio. 394 + */ 395 + blk_st = blk_ksm_get_slot_for_key(&blk_crypto_ksm, bc->bc_key, &slot); 396 + if (blk_st != BLK_STS_OK) { 397 + bio->bi_status = blk_st; 398 + goto out_no_keyslot; 399 + } 400 + 401 + /* and then allocate an skcipher_request for it */ 402 + if (!blk_crypto_alloc_cipher_req(slot, &ciph_req, &wait)) { 403 + bio->bi_status = BLK_STS_RESOURCE; 404 + goto out; 405 + } 406 + 407 + memcpy(curr_dun, bc->bc_dun, sizeof(curr_dun)); 408 + sg_init_table(&sg, 1); 409 + skcipher_request_set_crypt(ciph_req, &sg, &sg, data_unit_size, 410 + iv.bytes); 411 + 412 + /* Decrypt each segment in the bio */ 413 + __bio_for_each_segment(bv, bio, iter, f_ctx->crypt_iter) { 414 + struct page *page = bv.bv_page; 415 + 416 + sg_set_page(&sg, page, data_unit_size, bv.bv_offset); 417 + 418 + /* Decrypt each data unit in the segment */ 419 + for (i = 0; i < bv.bv_len; i += data_unit_size) { 420 + blk_crypto_dun_to_iv(curr_dun, &iv); 421 + if (crypto_wait_req(crypto_skcipher_decrypt(ciph_req), 422 + &wait)) { 423 + bio->bi_status = BLK_STS_IOERR; 424 + goto out; 425 + } 426 + bio_crypt_dun_increment(curr_dun, 1); 427 + sg.offset += data_unit_size; 428 + } 429 + } 430 + 431 + out: 432 + skcipher_request_free(ciph_req); 433 + blk_ksm_put_slot(slot); 434 + out_no_keyslot: 435 + mempool_free(f_ctx, bio_fallback_crypt_ctx_pool); 436 + bio_endio(bio); 437 + } 438 + 439 + /** 440 + * blk_crypto_fallback_decrypt_endio - queue bio for fallback decryption 441 + * 442 + * @bio: the bio to queue 443 + * 444 + * Restore bi_private and bi_end_io, and queue the bio for decryption into a 445 + * workqueue, since this function will be called from an atomic context. 446 + */ 447 + static void blk_crypto_fallback_decrypt_endio(struct bio *bio) 448 + { 449 + struct bio_fallback_crypt_ctx *f_ctx = bio->bi_private; 450 + 451 + bio->bi_private = f_ctx->bi_private_orig; 452 + bio->bi_end_io = f_ctx->bi_end_io_orig; 453 + 454 + /* If there was an IO error, don't queue for decrypt. */ 455 + if (bio->bi_status) { 456 + mempool_free(f_ctx, bio_fallback_crypt_ctx_pool); 457 + bio_endio(bio); 458 + return; 459 + } 460 + 461 + INIT_WORK(&f_ctx->work, blk_crypto_fallback_decrypt_bio); 462 + f_ctx->bio = bio; 463 + queue_work(blk_crypto_wq, &f_ctx->work); 464 + } 465 + 466 + /** 467 + * blk_crypto_fallback_bio_prep - Prepare a bio to use fallback en/decryption 468 + * 469 + * @bio_ptr: pointer to the bio to prepare 470 + * 471 + * If bio is doing a WRITE operation, this splits the bio into two parts if it's 472 + * too big (see blk_crypto_split_bio_if_needed). It then allocates a bounce bio 473 + * for the first part, encrypts it, and update bio_ptr to point to the bounce 474 + * bio. 475 + * 476 + * For a READ operation, we mark the bio for decryption by using bi_private and 477 + * bi_end_io. 478 + * 479 + * In either case, this function will make the bio look like a regular bio (i.e. 480 + * as if no encryption context was ever specified) for the purposes of the rest 481 + * of the stack except for blk-integrity (blk-integrity and blk-crypto are not 482 + * currently supported together). 483 + * 484 + * Return: true on success. Sets bio->bi_status and returns false on error. 485 + */ 486 + bool blk_crypto_fallback_bio_prep(struct bio **bio_ptr) 487 + { 488 + struct bio *bio = *bio_ptr; 489 + struct bio_crypt_ctx *bc = bio->bi_crypt_context; 490 + struct bio_fallback_crypt_ctx *f_ctx; 491 + 492 + if (WARN_ON_ONCE(!tfms_inited[bc->bc_key->crypto_cfg.crypto_mode])) { 493 + /* User didn't call blk_crypto_start_using_key() first */ 494 + bio->bi_status = BLK_STS_IOERR; 495 + return false; 496 + } 497 + 498 + if (!blk_ksm_crypto_cfg_supported(&blk_crypto_ksm, 499 + &bc->bc_key->crypto_cfg)) { 500 + bio->bi_status = BLK_STS_NOTSUPP; 501 + return false; 502 + } 503 + 504 + if (bio_data_dir(bio) == WRITE) 505 + return blk_crypto_fallback_encrypt_bio(bio_ptr); 506 + 507 + /* 508 + * bio READ case: Set up a f_ctx in the bio's bi_private and set the 509 + * bi_end_io appropriately to trigger decryption when the bio is ended. 510 + */ 511 + f_ctx = mempool_alloc(bio_fallback_crypt_ctx_pool, GFP_NOIO); 512 + f_ctx->crypt_ctx = *bc; 513 + f_ctx->crypt_iter = bio->bi_iter; 514 + f_ctx->bi_private_orig = bio->bi_private; 515 + f_ctx->bi_end_io_orig = bio->bi_end_io; 516 + bio->bi_private = (void *)f_ctx; 517 + bio->bi_end_io = blk_crypto_fallback_decrypt_endio; 518 + bio_crypt_free_ctx(bio); 519 + 520 + return true; 521 + } 522 + 523 + int blk_crypto_fallback_evict_key(const struct blk_crypto_key *key) 524 + { 525 + return blk_ksm_evict_key(&blk_crypto_ksm, key); 526 + } 527 + 528 + static bool blk_crypto_fallback_inited; 529 + static int blk_crypto_fallback_init(void) 530 + { 531 + int i; 532 + int err; 533 + 534 + if (blk_crypto_fallback_inited) 535 + return 0; 536 + 537 + prandom_bytes(blank_key, BLK_CRYPTO_MAX_KEY_SIZE); 538 + 539 + err = blk_ksm_init(&blk_crypto_ksm, blk_crypto_num_keyslots); 540 + if (err) 541 + goto out; 542 + err = -ENOMEM; 543 + 544 + blk_crypto_ksm.ksm_ll_ops = blk_crypto_ksm_ll_ops; 545 + blk_crypto_ksm.max_dun_bytes_supported = BLK_CRYPTO_MAX_IV_SIZE; 546 + 547 + /* All blk-crypto modes have a crypto API fallback. */ 548 + for (i = 0; i < BLK_ENCRYPTION_MODE_MAX; i++) 549 + blk_crypto_ksm.crypto_modes_supported[i] = 0xFFFFFFFF; 550 + blk_crypto_ksm.crypto_modes_supported[BLK_ENCRYPTION_MODE_INVALID] = 0; 551 + 552 + blk_crypto_wq = alloc_workqueue("blk_crypto_wq", 553 + WQ_UNBOUND | WQ_HIGHPRI | 554 + WQ_MEM_RECLAIM, num_online_cpus()); 555 + if (!blk_crypto_wq) 556 + goto fail_free_ksm; 557 + 558 + blk_crypto_keyslots = kcalloc(blk_crypto_num_keyslots, 559 + sizeof(blk_crypto_keyslots[0]), 560 + GFP_KERNEL); 561 + if (!blk_crypto_keyslots) 562 + goto fail_free_wq; 563 + 564 + blk_crypto_bounce_page_pool = 565 + mempool_create_page_pool(num_prealloc_bounce_pg, 0); 566 + if (!blk_crypto_bounce_page_pool) 567 + goto fail_free_keyslots; 568 + 569 + bio_fallback_crypt_ctx_cache = KMEM_CACHE(bio_fallback_crypt_ctx, 0); 570 + if (!bio_fallback_crypt_ctx_cache) 571 + goto fail_free_bounce_page_pool; 572 + 573 + bio_fallback_crypt_ctx_pool = 574 + mempool_create_slab_pool(num_prealloc_fallback_crypt_ctxs, 575 + bio_fallback_crypt_ctx_cache); 576 + if (!bio_fallback_crypt_ctx_pool) 577 + goto fail_free_crypt_ctx_cache; 578 + 579 + blk_crypto_fallback_inited = true; 580 + 581 + return 0; 582 + fail_free_crypt_ctx_cache: 583 + kmem_cache_destroy(bio_fallback_crypt_ctx_cache); 584 + fail_free_bounce_page_pool: 585 + mempool_destroy(blk_crypto_bounce_page_pool); 586 + fail_free_keyslots: 587 + kfree(blk_crypto_keyslots); 588 + fail_free_wq: 589 + destroy_workqueue(blk_crypto_wq); 590 + fail_free_ksm: 591 + blk_ksm_destroy(&blk_crypto_ksm); 592 + out: 593 + return err; 594 + } 595 + 596 + /* 597 + * Prepare blk-crypto-fallback for the specified crypto mode. 598 + * Returns -ENOPKG if the needed crypto API support is missing. 599 + */ 600 + int blk_crypto_fallback_start_using_mode(enum blk_crypto_mode_num mode_num) 601 + { 602 + const char *cipher_str = blk_crypto_modes[mode_num].cipher_str; 603 + struct blk_crypto_keyslot *slotp; 604 + unsigned int i; 605 + int err = 0; 606 + 607 + /* 608 + * Fast path 609 + * Ensure that updates to blk_crypto_keyslots[i].tfms[mode_num] 610 + * for each i are visible before we try to access them. 611 + */ 612 + if (likely(smp_load_acquire(&tfms_inited[mode_num]))) 613 + return 0; 614 + 615 + mutex_lock(&tfms_init_lock); 616 + if (tfms_inited[mode_num]) 617 + goto out; 618 + 619 + err = blk_crypto_fallback_init(); 620 + if (err) 621 + goto out; 622 + 623 + for (i = 0; i < blk_crypto_num_keyslots; i++) { 624 + slotp = &blk_crypto_keyslots[i]; 625 + slotp->tfms[mode_num] = crypto_alloc_skcipher(cipher_str, 0, 0); 626 + if (IS_ERR(slotp->tfms[mode_num])) { 627 + err = PTR_ERR(slotp->tfms[mode_num]); 628 + if (err == -ENOENT) { 629 + pr_warn_once("Missing crypto API support for \"%s\"\n", 630 + cipher_str); 631 + err = -ENOPKG; 632 + } 633 + slotp->tfms[mode_num] = NULL; 634 + goto out_free_tfms; 635 + } 636 + 637 + crypto_skcipher_set_flags(slotp->tfms[mode_num], 638 + CRYPTO_TFM_REQ_FORBID_WEAK_KEYS); 639 + } 640 + 641 + /* 642 + * Ensure that updates to blk_crypto_keyslots[i].tfms[mode_num] 643 + * for each i are visible before we set tfms_inited[mode_num]. 644 + */ 645 + smp_store_release(&tfms_inited[mode_num], true); 646 + goto out; 647 + 648 + out_free_tfms: 649 + for (i = 0; i < blk_crypto_num_keyslots; i++) { 650 + slotp = &blk_crypto_keyslots[i]; 651 + crypto_free_skcipher(slotp->tfms[mode_num]); 652 + slotp->tfms[mode_num] = NULL; 653 + } 654 + out: 655 + mutex_unlock(&tfms_init_lock); 656 + return err; 657 + }

+201

block/blk-crypto-internal.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright 2019 Google LLC 4 + */ 5 + 6 + #ifndef __LINUX_BLK_CRYPTO_INTERNAL_H 7 + #define __LINUX_BLK_CRYPTO_INTERNAL_H 8 + 9 + #include <linux/bio.h> 10 + #include <linux/blkdev.h> 11 + 12 + /* Represents a crypto mode supported by blk-crypto */ 13 + struct blk_crypto_mode { 14 + const char *cipher_str; /* crypto API name (for fallback case) */ 15 + unsigned int keysize; /* key size in bytes */ 16 + unsigned int ivsize; /* iv size in bytes */ 17 + }; 18 + 19 + extern const struct blk_crypto_mode blk_crypto_modes[]; 20 + 21 + #ifdef CONFIG_BLK_INLINE_ENCRYPTION 22 + 23 + void bio_crypt_dun_increment(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE], 24 + unsigned int inc); 25 + 26 + bool bio_crypt_rq_ctx_compatible(struct request *rq, struct bio *bio); 27 + 28 + bool bio_crypt_ctx_mergeable(struct bio_crypt_ctx *bc1, unsigned int bc1_bytes, 29 + struct bio_crypt_ctx *bc2); 30 + 31 + static inline bool bio_crypt_ctx_back_mergeable(struct request *req, 32 + struct bio *bio) 33 + { 34 + return bio_crypt_ctx_mergeable(req->crypt_ctx, blk_rq_bytes(req), 35 + bio->bi_crypt_context); 36 + } 37 + 38 + static inline bool bio_crypt_ctx_front_mergeable(struct request *req, 39 + struct bio *bio) 40 + { 41 + return bio_crypt_ctx_mergeable(bio->bi_crypt_context, 42 + bio->bi_iter.bi_size, req->crypt_ctx); 43 + } 44 + 45 + static inline bool bio_crypt_ctx_merge_rq(struct request *req, 46 + struct request *next) 47 + { 48 + return bio_crypt_ctx_mergeable(req->crypt_ctx, blk_rq_bytes(req), 49 + next->crypt_ctx); 50 + } 51 + 52 + static inline void blk_crypto_rq_set_defaults(struct request *rq) 53 + { 54 + rq->crypt_ctx = NULL; 55 + rq->crypt_keyslot = NULL; 56 + } 57 + 58 + static inline bool blk_crypto_rq_is_encrypted(struct request *rq) 59 + { 60 + return rq->crypt_ctx; 61 + } 62 + 63 + #else /* CONFIG_BLK_INLINE_ENCRYPTION */ 64 + 65 + static inline bool bio_crypt_rq_ctx_compatible(struct request *rq, 66 + struct bio *bio) 67 + { 68 + return true; 69 + } 70 + 71 + static inline bool bio_crypt_ctx_front_mergeable(struct request *req, 72 + struct bio *bio) 73 + { 74 + return true; 75 + } 76 + 77 + static inline bool bio_crypt_ctx_back_mergeable(struct request *req, 78 + struct bio *bio) 79 + { 80 + return true; 81 + } 82 + 83 + static inline bool bio_crypt_ctx_merge_rq(struct request *req, 84 + struct request *next) 85 + { 86 + return true; 87 + } 88 + 89 + static inline void blk_crypto_rq_set_defaults(struct request *rq) { } 90 + 91 + static inline bool blk_crypto_rq_is_encrypted(struct request *rq) 92 + { 93 + return false; 94 + } 95 + 96 + #endif /* CONFIG_BLK_INLINE_ENCRYPTION */ 97 + 98 + void __bio_crypt_advance(struct bio *bio, unsigned int bytes); 99 + static inline void bio_crypt_advance(struct bio *bio, unsigned int bytes) 100 + { 101 + if (bio_has_crypt_ctx(bio)) 102 + __bio_crypt_advance(bio, bytes); 103 + } 104 + 105 + void __bio_crypt_free_ctx(struct bio *bio); 106 + static inline void bio_crypt_free_ctx(struct bio *bio) 107 + { 108 + if (bio_has_crypt_ctx(bio)) 109 + __bio_crypt_free_ctx(bio); 110 + } 111 + 112 + static inline void bio_crypt_do_front_merge(struct request *rq, 113 + struct bio *bio) 114 + { 115 + #ifdef CONFIG_BLK_INLINE_ENCRYPTION 116 + if (bio_has_crypt_ctx(bio)) 117 + memcpy(rq->crypt_ctx->bc_dun, bio->bi_crypt_context->bc_dun, 118 + sizeof(rq->crypt_ctx->bc_dun)); 119 + #endif 120 + } 121 + 122 + bool __blk_crypto_bio_prep(struct bio **bio_ptr); 123 + static inline bool blk_crypto_bio_prep(struct bio **bio_ptr) 124 + { 125 + if (bio_has_crypt_ctx(*bio_ptr)) 126 + return __blk_crypto_bio_prep(bio_ptr); 127 + return true; 128 + } 129 + 130 + blk_status_t __blk_crypto_init_request(struct request *rq); 131 + static inline blk_status_t blk_crypto_init_request(struct request *rq) 132 + { 133 + if (blk_crypto_rq_is_encrypted(rq)) 134 + return __blk_crypto_init_request(rq); 135 + return BLK_STS_OK; 136 + } 137 + 138 + void __blk_crypto_free_request(struct request *rq); 139 + static inline void blk_crypto_free_request(struct request *rq) 140 + { 141 + if (blk_crypto_rq_is_encrypted(rq)) 142 + __blk_crypto_free_request(rq); 143 + } 144 + 145 + void __blk_crypto_rq_bio_prep(struct request *rq, struct bio *bio, 146 + gfp_t gfp_mask); 147 + static inline void blk_crypto_rq_bio_prep(struct request *rq, struct bio *bio, 148 + gfp_t gfp_mask) 149 + { 150 + if (bio_has_crypt_ctx(bio)) 151 + __blk_crypto_rq_bio_prep(rq, bio, gfp_mask); 152 + } 153 + 154 + /** 155 + * blk_crypto_insert_cloned_request - Prepare a cloned request to be inserted 156 + * into a request queue. 157 + * @rq: the request being queued 158 + * 159 + * Return: BLK_STS_OK on success, nonzero on error. 160 + */ 161 + static inline blk_status_t blk_crypto_insert_cloned_request(struct request *rq) 162 + { 163 + 164 + if (blk_crypto_rq_is_encrypted(rq)) 165 + return blk_crypto_init_request(rq); 166 + return BLK_STS_OK; 167 + } 168 + 169 + #ifdef CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK 170 + 171 + int blk_crypto_fallback_start_using_mode(enum blk_crypto_mode_num mode_num); 172 + 173 + bool blk_crypto_fallback_bio_prep(struct bio **bio_ptr); 174 + 175 + int blk_crypto_fallback_evict_key(const struct blk_crypto_key *key); 176 + 177 + #else /* CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK */ 178 + 179 + static inline int 180 + blk_crypto_fallback_start_using_mode(enum blk_crypto_mode_num mode_num) 181 + { 182 + pr_warn_once("crypto API fallback is disabled\n"); 183 + return -ENOPKG; 184 + } 185 + 186 + static inline bool blk_crypto_fallback_bio_prep(struct bio **bio_ptr) 187 + { 188 + pr_warn_once("crypto API fallback disabled; failing request.\n"); 189 + (*bio_ptr)->bi_status = BLK_STS_NOTSUPP; 190 + return false; 191 + } 192 + 193 + static inline int 194 + blk_crypto_fallback_evict_key(const struct blk_crypto_key *key) 195 + { 196 + return 0; 197 + } 198 + 199 + #endif /* CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK */ 200 + 201 + #endif /* __LINUX_BLK_CRYPTO_INTERNAL_H */

+404

block/blk-crypto.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright 2019 Google LLC 4 + */ 5 + 6 + /* 7 + * Refer to Documentation/block/inline-encryption.rst for detailed explanation. 8 + */ 9 + 10 + #define pr_fmt(fmt) "blk-crypto: " fmt 11 + 12 + #include <linux/bio.h> 13 + #include <linux/blkdev.h> 14 + #include <linux/keyslot-manager.h> 15 + #include <linux/module.h> 16 + #include <linux/slab.h> 17 + 18 + #include "blk-crypto-internal.h" 19 + 20 + const struct blk_crypto_mode blk_crypto_modes[] = { 21 + [BLK_ENCRYPTION_MODE_AES_256_XTS] = { 22 + .cipher_str = "xts(aes)", 23 + .keysize = 64, 24 + .ivsize = 16, 25 + }, 26 + [BLK_ENCRYPTION_MODE_AES_128_CBC_ESSIV] = { 27 + .cipher_str = "essiv(cbc(aes),sha256)", 28 + .keysize = 16, 29 + .ivsize = 16, 30 + }, 31 + [BLK_ENCRYPTION_MODE_ADIANTUM] = { 32 + .cipher_str = "adiantum(xchacha12,aes)", 33 + .keysize = 32, 34 + .ivsize = 32, 35 + }, 36 + }; 37 + 38 + /* 39 + * This number needs to be at least (the number of threads doing IO 40 + * concurrently) * (maximum recursive depth of a bio), so that we don't 41 + * deadlock on crypt_ctx allocations. The default is chosen to be the same 42 + * as the default number of post read contexts in both EXT4 and F2FS. 43 + */ 44 + static int num_prealloc_crypt_ctxs = 128; 45 + 46 + module_param(num_prealloc_crypt_ctxs, int, 0444); 47 + MODULE_PARM_DESC(num_prealloc_crypt_ctxs, 48 + "Number of bio crypto contexts to preallocate"); 49 + 50 + static struct kmem_cache *bio_crypt_ctx_cache; 51 + static mempool_t *bio_crypt_ctx_pool; 52 + 53 + static int __init bio_crypt_ctx_init(void) 54 + { 55 + size_t i; 56 + 57 + bio_crypt_ctx_cache = KMEM_CACHE(bio_crypt_ctx, 0); 58 + if (!bio_crypt_ctx_cache) 59 + goto out_no_mem; 60 + 61 + bio_crypt_ctx_pool = mempool_create_slab_pool(num_prealloc_crypt_ctxs, 62 + bio_crypt_ctx_cache); 63 + if (!bio_crypt_ctx_pool) 64 + goto out_no_mem; 65 + 66 + /* This is assumed in various places. */ 67 + BUILD_BUG_ON(BLK_ENCRYPTION_MODE_INVALID != 0); 68 + 69 + /* Sanity check that no algorithm exceeds the defined limits. */ 70 + for (i = 0; i < BLK_ENCRYPTION_MODE_MAX; i++) { 71 + BUG_ON(blk_crypto_modes[i].keysize > BLK_CRYPTO_MAX_KEY_SIZE); 72 + BUG_ON(blk_crypto_modes[i].ivsize > BLK_CRYPTO_MAX_IV_SIZE); 73 + } 74 + 75 + return 0; 76 + out_no_mem: 77 + panic("Failed to allocate mem for bio crypt ctxs\n"); 78 + } 79 + subsys_initcall(bio_crypt_ctx_init); 80 + 81 + void bio_crypt_set_ctx(struct bio *bio, const struct blk_crypto_key *key, 82 + const u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE], gfp_t gfp_mask) 83 + { 84 + struct bio_crypt_ctx *bc = mempool_alloc(bio_crypt_ctx_pool, gfp_mask); 85 + 86 + bc->bc_key = key; 87 + memcpy(bc->bc_dun, dun, sizeof(bc->bc_dun)); 88 + 89 + bio->bi_crypt_context = bc; 90 + } 91 + 92 + void __bio_crypt_free_ctx(struct bio *bio) 93 + { 94 + mempool_free(bio->bi_crypt_context, bio_crypt_ctx_pool); 95 + bio->bi_crypt_context = NULL; 96 + } 97 + 98 + void __bio_crypt_clone(struct bio *dst, struct bio *src, gfp_t gfp_mask) 99 + { 100 + dst->bi_crypt_context = mempool_alloc(bio_crypt_ctx_pool, gfp_mask); 101 + *dst->bi_crypt_context = *src->bi_crypt_context; 102 + } 103 + EXPORT_SYMBOL_GPL(__bio_crypt_clone); 104 + 105 + /* Increments @dun by @inc, treating @dun as a multi-limb integer. */ 106 + void bio_crypt_dun_increment(u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE], 107 + unsigned int inc) 108 + { 109 + int i; 110 + 111 + for (i = 0; inc && i < BLK_CRYPTO_DUN_ARRAY_SIZE; i++) { 112 + dun[i] += inc; 113 + /* 114 + * If the addition in this limb overflowed, then we need to 115 + * carry 1 into the next limb. Else the carry is 0. 116 + */ 117 + if (dun[i] < inc) 118 + inc = 1; 119 + else 120 + inc = 0; 121 + } 122 + } 123 + 124 + void __bio_crypt_advance(struct bio *bio, unsigned int bytes) 125 + { 126 + struct bio_crypt_ctx *bc = bio->bi_crypt_context; 127 + 128 + bio_crypt_dun_increment(bc->bc_dun, 129 + bytes >> bc->bc_key->data_unit_size_bits); 130 + } 131 + 132 + /* 133 + * Returns true if @bc->bc_dun plus @bytes converted to data units is equal to 134 + * @next_dun, treating the DUNs as multi-limb integers. 135 + */ 136 + bool bio_crypt_dun_is_contiguous(const struct bio_crypt_ctx *bc, 137 + unsigned int bytes, 138 + const u64 next_dun[BLK_CRYPTO_DUN_ARRAY_SIZE]) 139 + { 140 + int i; 141 + unsigned int carry = bytes >> bc->bc_key->data_unit_size_bits; 142 + 143 + for (i = 0; i < BLK_CRYPTO_DUN_ARRAY_SIZE; i++) { 144 + if (bc->bc_dun[i] + carry != next_dun[i]) 145 + return false; 146 + /* 147 + * If the addition in this limb overflowed, then we need to 148 + * carry 1 into the next limb. Else the carry is 0. 149 + */ 150 + if ((bc->bc_dun[i] + carry) < carry) 151 + carry = 1; 152 + else 153 + carry = 0; 154 + } 155 + 156 + /* If the DUN wrapped through 0, don't treat it as contiguous. */ 157 + return carry == 0; 158 + } 159 + 160 + /* 161 + * Checks that two bio crypt contexts are compatible - i.e. that 162 + * they are mergeable except for data_unit_num continuity. 163 + */ 164 + static bool bio_crypt_ctx_compatible(struct bio_crypt_ctx *bc1, 165 + struct bio_crypt_ctx *bc2) 166 + { 167 + if (!bc1) 168 + return !bc2; 169 + 170 + return bc2 && bc1->bc_key == bc2->bc_key; 171 + } 172 + 173 + bool bio_crypt_rq_ctx_compatible(struct request *rq, struct bio *bio) 174 + { 175 + return bio_crypt_ctx_compatible(rq->crypt_ctx, bio->bi_crypt_context); 176 + } 177 + 178 + /* 179 + * Checks that two bio crypt contexts are compatible, and also 180 + * that their data_unit_nums are continuous (and can hence be merged) 181 + * in the order @bc1 followed by @bc2. 182 + */ 183 + bool bio_crypt_ctx_mergeable(struct bio_crypt_ctx *bc1, unsigned int bc1_bytes, 184 + struct bio_crypt_ctx *bc2) 185 + { 186 + if (!bio_crypt_ctx_compatible(bc1, bc2)) 187 + return false; 188 + 189 + return !bc1 || bio_crypt_dun_is_contiguous(bc1, bc1_bytes, bc2->bc_dun); 190 + } 191 + 192 + /* Check that all I/O segments are data unit aligned. */ 193 + static bool bio_crypt_check_alignment(struct bio *bio) 194 + { 195 + const unsigned int data_unit_size = 196 + bio->bi_crypt_context->bc_key->crypto_cfg.data_unit_size; 197 + struct bvec_iter iter; 198 + struct bio_vec bv; 199 + 200 + bio_for_each_segment(bv, bio, iter) { 201 + if (!IS_ALIGNED(bv.bv_len | bv.bv_offset, data_unit_size)) 202 + return false; 203 + } 204 + 205 + return true; 206 + } 207 + 208 + blk_status_t __blk_crypto_init_request(struct request *rq) 209 + { 210 + return blk_ksm_get_slot_for_key(rq->q->ksm, rq->crypt_ctx->bc_key, 211 + &rq->crypt_keyslot); 212 + } 213 + 214 + /** 215 + * __blk_crypto_free_request - Uninitialize the crypto fields of a request. 216 + * 217 + * @rq: The request whose crypto fields to uninitialize. 218 + * 219 + * Completely uninitializes the crypto fields of a request. If a keyslot has 220 + * been programmed into some inline encryption hardware, that keyslot is 221 + * released. The rq->crypt_ctx is also freed. 222 + */ 223 + void __blk_crypto_free_request(struct request *rq) 224 + { 225 + blk_ksm_put_slot(rq->crypt_keyslot); 226 + mempool_free(rq->crypt_ctx, bio_crypt_ctx_pool); 227 + blk_crypto_rq_set_defaults(rq); 228 + } 229 + 230 + /** 231 + * __blk_crypto_bio_prep - Prepare bio for inline encryption 232 + * 233 + * @bio_ptr: pointer to original bio pointer 234 + * 235 + * If the bio crypt context provided for the bio is supported by the underlying 236 + * device's inline encryption hardware, do nothing. 237 + * 238 + * Otherwise, try to perform en/decryption for this bio by falling back to the 239 + * kernel crypto API. When the crypto API fallback is used for encryption, 240 + * blk-crypto may choose to split the bio into 2 - the first one that will 241 + * continue to be processed and the second one that will be resubmitted via 242 + * generic_make_request. A bounce bio will be allocated to encrypt the contents 243 + * of the aforementioned "first one", and *bio_ptr will be updated to this 244 + * bounce bio. 245 + * 246 + * Caller must ensure bio has bio_crypt_ctx. 247 + * 248 + * Return: true on success; false on error (and bio->bi_status will be set 249 + * appropriately, and bio_endio() will have been called so bio 250 + * submission should abort). 251 + */ 252 + bool __blk_crypto_bio_prep(struct bio **bio_ptr) 253 + { 254 + struct bio *bio = *bio_ptr; 255 + const struct blk_crypto_key *bc_key = bio->bi_crypt_context->bc_key; 256 + 257 + /* Error if bio has no data. */ 258 + if (WARN_ON_ONCE(!bio_has_data(bio))) { 259 + bio->bi_status = BLK_STS_IOERR; 260 + goto fail; 261 + } 262 + 263 + if (!bio_crypt_check_alignment(bio)) { 264 + bio->bi_status = BLK_STS_IOERR; 265 + goto fail; 266 + } 267 + 268 + /* 269 + * Success if device supports the encryption context, or if we succeeded 270 + * in falling back to the crypto API. 271 + */ 272 + if (blk_ksm_crypto_cfg_supported(bio->bi_disk->queue->ksm, 273 + &bc_key->crypto_cfg)) 274 + return true; 275 + 276 + if (blk_crypto_fallback_bio_prep(bio_ptr)) 277 + return true; 278 + fail: 279 + bio_endio(*bio_ptr); 280 + return false; 281 + } 282 + 283 + /** 284 + * __blk_crypto_rq_bio_prep - Prepare a request's crypt_ctx when its first bio 285 + * is inserted 286 + * 287 + * @rq: The request to prepare 288 + * @bio: The first bio being inserted into the request 289 + * @gfp_mask: gfp mask 290 + */ 291 + void __blk_crypto_rq_bio_prep(struct request *rq, struct bio *bio, 292 + gfp_t gfp_mask) 293 + { 294 + if (!rq->crypt_ctx) 295 + rq->crypt_ctx = mempool_alloc(bio_crypt_ctx_pool, gfp_mask); 296 + *rq->crypt_ctx = *bio->bi_crypt_context; 297 + } 298 + 299 + /** 300 + * blk_crypto_init_key() - Prepare a key for use with blk-crypto 301 + * @blk_key: Pointer to the blk_crypto_key to initialize. 302 + * @raw_key: Pointer to the raw key. Must be the correct length for the chosen 303 + * @crypto_mode; see blk_crypto_modes[]. 304 + * @crypto_mode: identifier for the encryption algorithm to use 305 + * @dun_bytes: number of bytes that will be used to specify the DUN when this 306 + * key is used 307 + * @data_unit_size: the data unit size to use for en/decryption 308 + * 309 + * Return: 0 on success, -errno on failure. The caller is responsible for 310 + * zeroizing both blk_key and raw_key when done with them. 311 + */ 312 + int blk_crypto_init_key(struct blk_crypto_key *blk_key, const u8 *raw_key, 313 + enum blk_crypto_mode_num crypto_mode, 314 + unsigned int dun_bytes, 315 + unsigned int data_unit_size) 316 + { 317 + const struct blk_crypto_mode *mode; 318 + 319 + memset(blk_key, 0, sizeof(*blk_key)); 320 + 321 + if (crypto_mode >= ARRAY_SIZE(blk_crypto_modes)) 322 + return -EINVAL; 323 + 324 + mode = &blk_crypto_modes[crypto_mode]; 325 + if (mode->keysize == 0) 326 + return -EINVAL; 327 + 328 + if (dun_bytes == 0 || dun_bytes > BLK_CRYPTO_MAX_IV_SIZE) 329 + return -EINVAL; 330 + 331 + if (!is_power_of_2(data_unit_size)) 332 + return -EINVAL; 333 + 334 + blk_key->crypto_cfg.crypto_mode = crypto_mode; 335 + blk_key->crypto_cfg.dun_bytes = dun_bytes; 336 + blk_key->crypto_cfg.data_unit_size = data_unit_size; 337 + blk_key->data_unit_size_bits = ilog2(data_unit_size); 338 + blk_key->size = mode->keysize; 339 + memcpy(blk_key->raw, raw_key, mode->keysize); 340 + 341 + return 0; 342 + } 343 + 344 + /* 345 + * Check if bios with @cfg can be en/decrypted by blk-crypto (i.e. either the 346 + * request queue it's submitted to supports inline crypto, or the 347 + * blk-crypto-fallback is enabled and supports the cfg). 348 + */ 349 + bool blk_crypto_config_supported(struct request_queue *q, 350 + const struct blk_crypto_config *cfg) 351 + { 352 + return IS_ENABLED(CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK) || 353 + blk_ksm_crypto_cfg_supported(q->ksm, cfg); 354 + } 355 + 356 + /** 357 + * blk_crypto_start_using_key() - Start using a blk_crypto_key on a device 358 + * @key: A key to use on the device 359 + * @q: the request queue for the device 360 + * 361 + * Upper layers must call this function to ensure that either the hardware 362 + * supports the key's crypto settings, or the crypto API fallback has transforms 363 + * for the needed mode allocated and ready to go. This function may allocate 364 + * an skcipher, and *should not* be called from the data path, since that might 365 + * cause a deadlock 366 + * 367 + * Return: 0 on success; -ENOPKG if the hardware doesn't support the key and 368 + * blk-crypto-fallback is either disabled or the needed algorithm 369 + * is disabled in the crypto API; or another -errno code. 370 + */ 371 + int blk_crypto_start_using_key(const struct blk_crypto_key *key, 372 + struct request_queue *q) 373 + { 374 + if (blk_ksm_crypto_cfg_supported(q->ksm, &key->crypto_cfg)) 375 + return 0; 376 + return blk_crypto_fallback_start_using_mode(key->crypto_cfg.crypto_mode); 377 + } 378 + 379 + /** 380 + * blk_crypto_evict_key() - Evict a key from any inline encryption hardware 381 + * it may have been programmed into 382 + * @q: The request queue who's associated inline encryption hardware this key 383 + * might have been programmed into 384 + * @key: The key to evict 385 + * 386 + * Upper layers (filesystems) must call this function to ensure that a key is 387 + * evicted from any hardware that it might have been programmed into. The key 388 + * must not be in use by any in-flight IO when this function is called. 389 + * 390 + * Return: 0 on success or if key is not present in the q's ksm, -err on error. 391 + */ 392 + int blk_crypto_evict_key(struct request_queue *q, 393 + const struct blk_crypto_key *key) 394 + { 395 + if (blk_ksm_crypto_cfg_supported(q->ksm, &key->crypto_cfg)) 396 + return blk_ksm_evict_key(q->ksm, key); 397 + 398 + /* 399 + * If the request queue's associated inline encryption hardware didn't 400 + * have support for the key, then the key might have been programmed 401 + * into the fallback keyslot manager, so try to evict from there. 402 + */ 403 + return blk_crypto_fallback_evict_key(key); 404 + }

+1 -1

block/blk-exec.c

··· 55 55 rq->rq_disk = bd_disk; 56 56 rq->end_io = done; 57 57 58 - blk_account_io_start(rq, true); 58 + blk_account_io_start(rq); 59 59 60 60 /* 61 61 * don't check dying flag for MQ because the request won't

+2 -24

block/blk-flush.c

··· 258 258 blk_flush_complete_seq(rq, fq, seq, error); 259 259 } 260 260 261 - fq->flush_queue_delayed = 0; 262 261 spin_unlock_irqrestore(&fq->mq_flush_lock, flags); 263 262 } 264 263 ··· 432 433 * blkdev_issue_flush - queue a flush 433 434 * @bdev: blockdev to issue flush for 434 435 * @gfp_mask: memory allocation flags (for bio_alloc) 435 - * @error_sector: error sector 436 436 * 437 437 * Description: 438 - * Issue a flush for the block device in question. Caller can supply 439 - * room for storing the error offset in case of a flush error, if they 440 - * wish to. 438 + * Issue a flush for the block device in question. 441 439 */ 442 - int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask, 443 - sector_t *error_sector) 440 + int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask) 444 441 { 445 - struct request_queue *q; 446 442 struct bio *bio; 447 443 int ret = 0; 448 - 449 - if (bdev->bd_disk == NULL) 450 - return -ENXIO; 451 - 452 - q = bdev_get_queue(bdev); 453 - if (!q) 454 - return -ENXIO; 455 444 456 445 bio = bio_alloc(gfp_mask, 0); 457 446 bio_set_dev(bio, bdev); 458 447 bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH; 459 448 460 449 ret = submit_bio_wait(bio); 461 - 462 - /* 463 - * The driver must store the error location in ->bi_sector, if 464 - * it supports it. For non-stacked drivers, this should be 465 - * copied from blk_rq_pos(rq). 466 - */ 467 - if (error_sector) 468 - *error_sector = bio->bi_iter.bi_sector; 469 - 470 450 bio_put(bio); 471 451 return ret; 472 452 }

+7

block/blk-integrity.c

··· 409 409 bi->tag_size = template->tag_size; 410 410 411 411 disk->queue->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES; 412 + 413 + #ifdef CONFIG_BLK_INLINE_ENCRYPTION 414 + if (disk->queue->ksm) { 415 + pr_warn("blk-integrity: Integrity and hardware inline encryption are not supported together. Disabling hardware inline encryption.\n"); 416 + blk_ksm_unregister(disk->queue); 417 + } 418 + #endif 412 419 } 413 420 EXPORT_SYMBOL(blk_integrity_register); 414 421

+65 -21

block/blk-iocost.c

··· 260 260 VTIME_PER_SEC_SHIFT = 37, 261 261 VTIME_PER_SEC = 1LLU << VTIME_PER_SEC_SHIFT, 262 262 VTIME_PER_USEC = VTIME_PER_SEC / USEC_PER_SEC, 263 + VTIME_PER_NSEC = VTIME_PER_SEC / NSEC_PER_SEC, 263 264 264 265 /* bound vrate adjustments within two orders of magnitude */ 265 266 VRATE_MIN_PPM = 10000, /* 1% */ ··· 1207 1206 return HRTIMER_NORESTART; 1208 1207 } 1209 1208 1210 - static bool iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now, u64 cost) 1209 + static bool iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now) 1211 1210 { 1212 1211 struct ioc *ioc = iocg->ioc; 1213 1212 struct blkcg_gq *blkg = iocg_to_blkg(iocg); 1214 1213 u64 vtime = atomic64_read(&iocg->vtime); 1215 1214 u64 vmargin = ioc->margin_us * now->vrate; 1216 1215 u64 margin_ns = ioc->margin_us * NSEC_PER_USEC; 1217 - u64 expires, oexpires; 1216 + u64 delta_ns, expires, oexpires; 1218 1217 u32 hw_inuse; 1219 1218 1220 1219 lockdep_assert_held(&iocg->waitq.lock); ··· 1237 1236 return false; 1238 1237 1239 1238 /* use delay */ 1240 - if (cost) { 1241 - u64 cost_ns = DIV64_U64_ROUND_UP(cost * NSEC_PER_USEC, 1242 - now->vrate); 1243 - blkcg_add_delay(blkg, now->now_ns, cost_ns); 1244 - } 1245 - blkcg_use_delay(blkg); 1246 - 1247 - expires = now->now_ns + DIV64_U64_ROUND_UP(vtime - now->vnow, 1248 - now->vrate) * NSEC_PER_USEC; 1239 + delta_ns = DIV64_U64_ROUND_UP(vtime - now->vnow, 1240 + now->vrate) * NSEC_PER_USEC; 1241 + blkcg_set_delay(blkg, delta_ns); 1242 + expires = now->now_ns + delta_ns; 1249 1243 1250 1244 /* if already active and close enough, don't bother */ 1251 1245 oexpires = ktime_to_ns(hrtimer_get_softexpires(&iocg->delay_timer)); ··· 1261 1265 1262 1266 spin_lock_irqsave(&iocg->waitq.lock, flags); 1263 1267 ioc_now(iocg->ioc, &now); 1264 - iocg_kick_delay(iocg, &now, 0); 1268 + iocg_kick_delay(iocg, &now); 1265 1269 spin_unlock_irqrestore(&iocg->waitq.lock, flags); 1266 1270 1267 1271 return HRTIMER_NORESTART; ··· 1379 1383 if (waitqueue_active(&iocg->waitq) || iocg->abs_vdebt) { 1380 1384 /* might be oversleeping vtime / hweight changes, kick */ 1381 1385 iocg_kick_waitq(iocg, &now); 1382 - iocg_kick_delay(iocg, &now, 0); 1386 + iocg_kick_delay(iocg, &now); 1383 1387 } else if (iocg_is_idle(iocg)) { 1384 1388 /* no waiter and idle, deactivate */ 1385 1389 iocg->last_inuse = iocg->inuse; ··· 1539 1543 if (rq_wait_pct > RQ_WAIT_BUSY_PCT || 1540 1544 missed_ppm[READ] > ppm_rthr || 1541 1545 missed_ppm[WRITE] > ppm_wthr) { 1546 + /* clearly missing QoS targets, slow down vrate */ 1542 1547 ioc->busy_level = max(ioc->busy_level, 0); 1543 1548 ioc->busy_level++; 1544 1549 } else if (rq_wait_pct <= RQ_WAIT_BUSY_PCT * UNBUSY_THR_PCT / 100 && 1545 1550 missed_ppm[READ] <= ppm_rthr * UNBUSY_THR_PCT / 100 && 1546 1551 missed_ppm[WRITE] <= ppm_wthr * UNBUSY_THR_PCT / 100) { 1547 - /* take action iff there is contention */ 1548 - if (nr_shortages && !nr_lagging) { 1552 + /* QoS targets are being met with >25% margin */ 1553 + if (nr_shortages) { 1554 + /* 1555 + * We're throttling while the device has spare 1556 + * capacity. If vrate was being slowed down, stop. 1557 + */ 1549 1558 ioc->busy_level = min(ioc->busy_level, 0); 1550 - /* redistribute surpluses first */ 1551 - if (!nr_surpluses) 1559 + 1560 + /* 1561 + * If there are IOs spanning multiple periods, wait 1562 + * them out before pushing the device harder. If 1563 + * there are surpluses, let redistribution work it 1564 + * out first. 1565 + */ 1566 + if (!nr_lagging && !nr_surpluses) 1552 1567 ioc->busy_level--; 1568 + } else { 1569 + /* 1570 + * Nobody is being throttled and the users aren't 1571 + * issuing enough IOs to saturate the device. We 1572 + * simply don't know how close the device is to 1573 + * saturation. Coast. 1574 + */ 1575 + ioc->busy_level = 0; 1553 1576 } 1554 1577 } else { 1578 + /* inside the hysterisis margin, we're good */ 1555 1579 ioc->busy_level = 0; 1556 1580 } 1557 1581 ··· 1694 1678 return cost; 1695 1679 } 1696 1680 1681 + static void calc_size_vtime_cost_builtin(struct request *rq, struct ioc *ioc, 1682 + u64 *costp) 1683 + { 1684 + unsigned int pages = blk_rq_stats_sectors(rq) >> IOC_SECT_TO_PAGE_SHIFT; 1685 + 1686 + switch (req_op(rq)) { 1687 + case REQ_OP_READ: 1688 + *costp = pages * ioc->params.lcoefs[LCOEF_RPAGE]; 1689 + break; 1690 + case REQ_OP_WRITE: 1691 + *costp = pages * ioc->params.lcoefs[LCOEF_WPAGE]; 1692 + break; 1693 + default: 1694 + *costp = 0; 1695 + } 1696 + } 1697 + 1698 + static u64 calc_size_vtime_cost(struct request *rq, struct ioc *ioc) 1699 + { 1700 + u64 cost; 1701 + 1702 + calc_size_vtime_cost_builtin(rq, ioc, &cost); 1703 + return cost; 1704 + } 1705 + 1697 1706 static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio) 1698 1707 { 1699 1708 struct blkcg_gq *blkg = bio->bi_blkg; ··· 1803 1762 */ 1804 1763 if (bio_issue_as_root_blkg(bio) || fatal_signal_pending(current)) { 1805 1764 iocg->abs_vdebt += abs_cost; 1806 - if (iocg_kick_delay(iocg, &now, cost)) 1765 + if (iocg_kick_delay(iocg, &now)) 1807 1766 blkcg_schedule_throttle(rqos->q, 1808 1767 (bio->bi_opf & REQ_SWAP) == REQ_SWAP); 1809 1768 spin_unlock_irq(&iocg->waitq.lock); ··· 1891 1850 spin_lock_irqsave(&iocg->waitq.lock, flags); 1892 1851 if (likely(!list_empty(&iocg->active_list))) { 1893 1852 iocg->abs_vdebt += abs_cost; 1894 - iocg_kick_delay(iocg, &now, cost); 1853 + iocg_kick_delay(iocg, &now); 1895 1854 } else { 1896 1855 iocg_commit_bio(iocg, bio, cost); 1897 1856 } ··· 1909 1868 static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq) 1910 1869 { 1911 1870 struct ioc *ioc = rqos_to_ioc(rqos); 1912 - u64 on_q_ns, rq_wait_ns; 1871 + u64 on_q_ns, rq_wait_ns, size_nsec; 1913 1872 int pidx, rw; 1914 1873 1915 1874 if (!ioc->enabled || !rq->alloc_time_ns || !rq->start_time_ns) ··· 1930 1889 1931 1890 on_q_ns = ktime_get_ns() - rq->alloc_time_ns; 1932 1891 rq_wait_ns = rq->start_time_ns - rq->alloc_time_ns; 1892 + size_nsec = div64_u64(calc_size_vtime_cost(rq, ioc), VTIME_PER_NSEC); 1933 1893 1934 - if (on_q_ns <= ioc->params.qos[pidx] * NSEC_PER_USEC) 1894 + if (on_q_ns <= size_nsec || 1895 + on_q_ns - size_nsec <= ioc->params.qos[pidx] * NSEC_PER_USEC) 1935 1896 this_cpu_inc(ioc->pcpu_stat->missed[rw].nr_met); 1936 1897 else 1937 1898 this_cpu_inc(ioc->pcpu_stat->missed[rw].nr_missed); ··· 2340 2297 spin_lock_irq(&ioc->lock); 2341 2298 2342 2299 if (enable) { 2300 + blk_stat_enable_accounting(ioc->rqos.q); 2343 2301 blk_queue_flag_set(QUEUE_FLAG_RQ_ALLOC_TIME, ioc->rqos.q); 2344 2302 ioc->enabled = true; 2345 2303 } else {

+5 -10

block/blk-map.c

··· 257 257 static struct bio *bio_map_user_iov(struct request_queue *q, 258 258 struct iov_iter *iter, gfp_t gfp_mask) 259 259 { 260 + unsigned int max_sectors = queue_max_hw_sectors(q); 260 261 int j; 261 262 struct bio *bio; 262 263 int ret; ··· 295 294 if (n > bytes) 296 295 n = bytes; 297 296 298 - if (!__bio_add_pc_page(q, bio, page, n, offs, 299 - &same_page)) { 297 + if (!bio_add_hw_page(q, bio, page, n, offs, 298 + max_sectors, &same_page)) { 300 299 if (same_page) 301 300 put_page(page); 302 301 break; ··· 550 549 rq->biotail->bi_next = *bio; 551 550 rq->biotail = *bio; 552 551 rq->__data_len += (*bio)->bi_iter.bi_size; 552 + bio_crypt_free_ctx(*bio); 553 553 } 554 554 555 555 return 0; ··· 656 654 bio = rq->bio; 657 655 } while (iov_iter_count(&i)); 658 656 659 - if (!bio_flagged(bio, BIO_USER_MAPPED)) 660 - rq->rq_flags |= RQF_COPY_USER; 661 657 return 0; 662 658 663 659 unmap_rq: ··· 731 731 { 732 732 int reading = rq_data_dir(rq) == READ; 733 733 unsigned long addr = (unsigned long) kbuf; 734 - int do_copy = 0; 735 734 struct bio *bio, *orig_bio; 736 735 int ret; 737 736 ··· 739 740 if (!len || !kbuf) 740 741 return -EINVAL; 741 742 742 - do_copy = !blk_rq_aligned(q, addr, len) || object_is_on_stack(kbuf); 743 - if (do_copy) 743 + if (!blk_rq_aligned(q, addr, len) || object_is_on_stack(kbuf)) 744 744 bio = bio_copy_kern(q, kbuf, len, gfp_mask, reading); 745 745 else 746 746 bio = bio_map_kern(q, kbuf, len, gfp_mask); ··· 749 751 750 752 bio->bi_opf &= ~REQ_OP_MASK; 751 753 bio->bi_opf |= req_op(rq); 752 - 753 - if (do_copy) 754 - rq->rq_flags |= RQF_COPY_USER; 755 754 756 755 orig_bio = bio; 757 756 ret = blk_rq_append_bio(rq, &bio);

+25 -51

block/blk-merge.c

··· 336 336 /* there isn't chance to merge the splitted bio */ 337 337 split->bi_opf |= REQ_NOMERGE; 338 338 339 - /* 340 - * Since we're recursing into make_request here, ensure 341 - * that we mark this bio as already having entered the queue. 342 - * If not, and the queue is going away, we can get stuck 343 - * forever on waiting for the queue reference to drop. But 344 - * that will never happen, as we're already holding a 345 - * reference to it. 346 - */ 347 - bio_set_flag(*bio, BIO_QUEUE_ENTERED); 348 - 349 339 bio_chain(split, *bio); 350 340 trace_block_split(q, split, (*bio)->bi_iter.bi_sector); 351 341 generic_make_request(*bio); ··· 509 519 * map a request to scatterlist, return number of sg entries setup. Caller 510 520 * must make sure sg can hold rq->nr_phys_segments entries 511 521 */ 512 - int blk_rq_map_sg(struct request_queue *q, struct request *rq, 513 - struct scatterlist *sglist) 522 + int __blk_rq_map_sg(struct request_queue *q, struct request *rq, 523 + struct scatterlist *sglist, struct scatterlist **last_sg) 514 524 { 515 - struct scatterlist *sg = NULL; 516 525 int nsegs = 0; 517 526 518 527 if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) 519 - nsegs = __blk_bvec_map_sg(rq->special_vec, sglist, &sg); 528 + nsegs = __blk_bvec_map_sg(rq->special_vec, sglist, last_sg); 520 529 else if (rq->bio && bio_op(rq->bio) == REQ_OP_WRITE_SAME) 521 - nsegs = __blk_bvec_map_sg(bio_iovec(rq->bio), sglist, &sg); 530 + nsegs = __blk_bvec_map_sg(bio_iovec(rq->bio), sglist, last_sg); 522 531 else if (rq->bio) 523 - nsegs = __blk_bios_map_sg(q, rq->bio, sglist, &sg); 532 + nsegs = __blk_bios_map_sg(q, rq->bio, sglist, last_sg); 524 533 525 - if (unlikely(rq->rq_flags & RQF_COPY_USER) && 526 - (blk_rq_bytes(rq) & q->dma_pad_mask)) { 527 - unsigned int pad_len = 528 - (q->dma_pad_mask & ~blk_rq_bytes(rq)) + 1; 529 - 530 - sg->length += pad_len; 531 - rq->extra_len += pad_len; 532 - } 533 - 534 - if (q->dma_drain_size && q->dma_drain_needed(rq)) { 535 - if (op_is_write(req_op(rq))) 536 - memset(q->dma_drain_buffer, 0, q->dma_drain_size); 537 - 538 - sg_unmark_end(sg); 539 - sg = sg_next(sg); 540 - sg_set_page(sg, virt_to_page(q->dma_drain_buffer), 541 - q->dma_drain_size, 542 - ((unsigned long)q->dma_drain_buffer) & 543 - (PAGE_SIZE - 1)); 544 - nsegs++; 545 - rq->extra_len += q->dma_drain_size; 546 - } 547 - 548 - if (sg) 549 - sg_mark_end(sg); 534 + if (*last_sg) 535 + sg_mark_end(*last_sg); 550 536 551 537 /* 552 538 * Something must have been wrong if the figured number of ··· 532 566 533 567 return nsegs; 534 568 } 535 - EXPORT_SYMBOL(blk_rq_map_sg); 569 + EXPORT_SYMBOL(__blk_rq_map_sg); 536 570 537 571 static inline int ll_new_hw_segment(struct request *req, struct bio *bio, 538 572 unsigned int nr_phys_segs) ··· 562 596 if (blk_integrity_rq(req) && 563 597 integrity_req_gap_back_merge(req, bio)) 564 598 return 0; 599 + if (!bio_crypt_ctx_back_mergeable(req, bio)) 600 + return 0; 565 601 if (blk_rq_sectors(req) + bio_sectors(bio) > 566 602 blk_rq_get_max_sectors(req, blk_rq_pos(req))) { 567 603 req_set_nomerge(req->q, req); ··· 579 611 return 0; 580 612 if (blk_integrity_rq(req) && 581 613 integrity_req_gap_front_merge(req, bio)) 614 + return 0; 615 + if (!bio_crypt_ctx_front_mergeable(req, bio)) 582 616 return 0; 583 617 if (blk_rq_sectors(req) + bio_sectors(bio) > 584 618 blk_rq_get_max_sectors(req, bio->bi_iter.bi_sector)) { ··· 631 661 if (blk_integrity_merge_rq(q, req, next) == false) 632 662 return 0; 633 663 664 + if (!bio_crypt_ctx_merge_rq(req, next)) 665 + return 0; 666 + 634 667 /* Merge is OK... */ 635 668 req->nr_phys_segments = total_phys_segments; 636 669 return 1; ··· 669 696 rq->rq_flags |= RQF_MIXED_MERGE; 670 697 } 671 698 672 - static void blk_account_io_merge(struct request *req) 699 + static void blk_account_io_merge_request(struct request *req) 673 700 { 674 701 if (blk_do_io_stat(req)) { 675 - struct hd_struct *part; 676 - 677 702 part_stat_lock(); 678 - part = req->part; 679 - 680 - part_dec_in_flight(req->q, part, rq_data_dir(req)); 681 - 682 - hd_struct_put(part); 703 + part_stat_inc(req->part, merges[op_stat_group(req_op(req))]); 683 704 part_stat_unlock(); 705 + 706 + hd_struct_put(req->part); 684 707 } 685 708 } 709 + 686 710 /* 687 711 * Two cases of handling DISCARD merge: 688 712 * If max_discard_segments > 1, the driver takes every bio ··· 791 821 /* 792 822 * 'next' is going away, so update stats accordingly 793 823 */ 794 - blk_account_io_merge(next); 824 + blk_account_io_merge_request(next); 795 825 796 826 /* 797 827 * ownership of bio passed from next to req, return 'next' for ··· 853 883 854 884 /* only merge integrity protected bio into ditto rq */ 855 885 if (blk_integrity_merge_bio(rq->q, rq, bio) == false) 886 + return false; 887 + 888 + /* Only merge if the crypt contexts are compatible */ 889 + if (!bio_crypt_rq_ctx_compatible(rq, bio)) 856 890 return false; 857 891 858 892 /* must be using the same buffer */

+2 -1

block/blk-mq-debugfs.c

··· 213 213 HCTX_STATE_NAME(STOPPED), 214 214 HCTX_STATE_NAME(TAG_ACTIVE), 215 215 HCTX_STATE_NAME(SCHED_RESTART), 216 + HCTX_STATE_NAME(INACTIVE), 216 217 }; 217 218 #undef HCTX_STATE_NAME 218 219 ··· 240 239 HCTX_FLAG_NAME(TAG_SHARED), 241 240 HCTX_FLAG_NAME(BLOCKING), 242 241 HCTX_FLAG_NAME(NO_SCHED), 242 + HCTX_FLAG_NAME(STACKING), 243 243 }; 244 244 #undef HCTX_FLAG_NAME 245 245 ··· 294 292 RQF_NAME(MQ_INFLIGHT), 295 293 RQF_NAME(DONTPREP), 296 294 RQF_NAME(PREEMPT), 297 - RQF_NAME(COPY_USER), 298 295 RQF_NAME(FAILED), 299 296 RQF_NAME(QUIET), 300 297 RQF_NAME(ELVPRIV),

+69 -13

block/blk-mq-sched.c

··· 80 80 blk_mq_run_hw_queue(hctx, true); 81 81 } 82 82 83 + #define BLK_MQ_BUDGET_DELAY 3 /* ms units */ 84 + 83 85 /* 84 86 * Only SCSI implements .get_budget and .put_budget, and SCSI restarts 85 87 * its queue by itself in its completion handler, so we don't need to 86 88 * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE. 89 + * 90 + * Returns -EAGAIN if hctx->dispatch was found non-empty and run_work has to 91 + * be run again. This is necessary to avoid starving flushes. 87 92 */ 88 - static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) 93 + static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) 89 94 { 90 95 struct request_queue *q = hctx->queue; 91 96 struct elevator_queue *e = q->elevator; 92 97 LIST_HEAD(rq_list); 98 + int ret = 0; 93 99 94 100 do { 95 101 struct request *rq; ··· 103 97 if (e->type->ops.has_work && !e->type->ops.has_work(hctx)) 104 98 break; 105 99 100 + if (!list_empty_careful(&hctx->dispatch)) { 101 + ret = -EAGAIN; 102 + break; 103 + } 104 + 106 105 if (!blk_mq_get_dispatch_budget(hctx)) 107 106 break; 108 107 109 108 rq = e->type->ops.dispatch_request(hctx); 110 109 if (!rq) { 111 110 blk_mq_put_dispatch_budget(hctx); 111 + /* 112 + * We're releasing without dispatching. Holding the 113 + * budget could have blocked any "hctx"s with the 114 + * same queue and if we didn't dispatch then there's 115 + * no guarantee anyone will kick the queue. Kick it 116 + * ourselves. 117 + */ 118 + blk_mq_delay_run_hw_queues(q, BLK_MQ_BUDGET_DELAY); 112 119 break; 113 120 } 114 121 ··· 132 113 */ 133 114 list_add(&rq->queuelist, &rq_list); 134 115 } while (blk_mq_dispatch_rq_list(q, &rq_list, true)); 116 + 117 + return ret; 135 118 } 136 119 137 120 static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx, ··· 151 130 * Only SCSI implements .get_budget and .put_budget, and SCSI restarts 152 131 * its queue by itself in its completion handler, so we don't need to 153 132 * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE. 133 + * 134 + * Returns -EAGAIN if hctx->dispatch was found non-empty and run_work has to 135 + * to be run again. This is necessary to avoid starving flushes. 154 136 */ 155 - static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx) 137 + static int blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx) 156 138 { 157 139 struct request_queue *q = hctx->queue; 158 140 LIST_HEAD(rq_list); 159 141 struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from); 142 + int ret = 0; 160 143 161 144 do { 162 145 struct request *rq; 146 + 147 + if (!list_empty_careful(&hctx->dispatch)) { 148 + ret = -EAGAIN; 149 + break; 150 + } 163 151 164 152 if (!sbitmap_any_bit_set(&hctx->ctx_map)) 165 153 break; ··· 179 149 rq = blk_mq_dequeue_from_ctx(hctx, ctx); 180 150 if (!rq) { 181 151 blk_mq_put_dispatch_budget(hctx); 152 + /* 153 + * We're releasing without dispatching. Holding the 154 + * budget could have blocked any "hctx"s with the 155 + * same queue and if we didn't dispatch then there's 156 + * no guarantee anyone will kick the queue. Kick it 157 + * ourselves. 158 + */ 159 + blk_mq_delay_run_hw_queues(q, BLK_MQ_BUDGET_DELAY); 182 160 break; 183 161 } 184 162 ··· 203 165 } while (blk_mq_dispatch_rq_list(q, &rq_list, true)); 204 166 205 167 WRITE_ONCE(hctx->dispatch_from, ctx); 168 + return ret; 206 169 } 207 170 208 - void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) 171 + static int __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) 209 172 { 210 173 struct request_queue *q = hctx->queue; 211 174 struct elevator_queue *e = q->elevator; 212 175 const bool has_sched_dispatch = e && e->type->ops.dispatch_request; 176 + int ret = 0; 213 177 LIST_HEAD(rq_list); 214 - 215 - /* RCU or SRCU read lock is needed before checking quiesced flag */ 216 - if (unlikely(blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))) 217 - return; 218 - 219 - hctx->run++; 220 178 221 179 /* 222 180 * If we have previous entries on our dispatch list, grab them first for ··· 242 208 blk_mq_sched_mark_restart_hctx(hctx); 243 209 if (blk_mq_dispatch_rq_list(q, &rq_list, false)) { 244 210 if (has_sched_dispatch) 245 - blk_mq_do_dispatch_sched(hctx); 211 + ret = blk_mq_do_dispatch_sched(hctx); 246 212 else 247 - blk_mq_do_dispatch_ctx(hctx); 213 + ret = blk_mq_do_dispatch_ctx(hctx); 248 214 } 249 215 } else if (has_sched_dispatch) { 250 - blk_mq_do_dispatch_sched(hctx); 216 + ret = blk_mq_do_dispatch_sched(hctx); 251 217 } else if (hctx->dispatch_busy) { 252 218 /* dequeue request one by one from sw queue if queue is busy */ 253 - blk_mq_do_dispatch_ctx(hctx); 219 + ret = blk_mq_do_dispatch_ctx(hctx); 254 220 } else { 255 221 blk_mq_flush_busy_ctxs(hctx, &rq_list); 256 222 blk_mq_dispatch_rq_list(q, &rq_list, false); 223 + } 224 + 225 + return ret; 226 + } 227 + 228 + void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) 229 + { 230 + struct request_queue *q = hctx->queue; 231 + 232 + /* RCU or SRCU read lock is needed before checking quiesced flag */ 233 + if (unlikely(blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))) 234 + return; 235 + 236 + hctx->run++; 237 + 238 + /* 239 + * A return of -EAGAIN is an indication that hctx->dispatch is not 240 + * empty and we must run again in order to avoid starving flushes. 241 + */ 242 + if (__blk_mq_sched_dispatch_requests(hctx) == -EAGAIN) { 243 + if (__blk_mq_sched_dispatch_requests(hctx) == -EAGAIN) 244 + blk_mq_run_hw_queue(hctx, true); 257 245 } 258 246 } 259 247

+46 -24

block/blk-mq-tag.c

··· 92 92 { 93 93 if (!(data->flags & BLK_MQ_REQ_INTERNAL) && 94 94 !hctx_may_queue(data->hctx, bt)) 95 - return -1; 95 + return BLK_MQ_NO_TAG; 96 96 if (data->shallow_depth) 97 97 return __sbitmap_queue_get_shallow(bt, data->shallow_depth); 98 98 else ··· 111 111 if (data->flags & BLK_MQ_REQ_RESERVED) { 112 112 if (unlikely(!tags->nr_reserved_tags)) { 113 113 WARN_ON_ONCE(1); 114 - return BLK_MQ_TAG_FAIL; 114 + return BLK_MQ_NO_TAG; 115 115 } 116 116 bt = &tags->breserved_tags; 117 117 tag_offset = 0; ··· 121 121 } 122 122 123 123 tag = __blk_mq_get_tag(data, bt); 124 - if (tag != -1) 124 + if (tag != BLK_MQ_NO_TAG) 125 125 goto found_tag; 126 126 127 127 if (data->flags & BLK_MQ_REQ_NOWAIT) 128 - return BLK_MQ_TAG_FAIL; 128 + return BLK_MQ_NO_TAG; 129 129 130 130 ws = bt_wait_ptr(bt, data->hctx); 131 131 do { ··· 143 143 * as running the queue may also have found completions. 144 144 */ 145 145 tag = __blk_mq_get_tag(data, bt); 146 - if (tag != -1) 146 + if (tag != BLK_MQ_NO_TAG) 147 147 break; 148 148 149 149 sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE); 150 150 151 151 tag = __blk_mq_get_tag(data, bt); 152 - if (tag != -1) 152 + if (tag != BLK_MQ_NO_TAG) 153 153 break; 154 154 155 155 bt_prev = bt; ··· 180 180 sbitmap_finish_wait(bt, ws, &wait); 181 181 182 182 found_tag: 183 + /* 184 + * Give up this allocation if the hctx is inactive. The caller will 185 + * retry on an active hctx. 186 + */ 187 + if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) { 188 + blk_mq_put_tag(tags, data->ctx, tag + tag_offset); 189 + return BLK_MQ_NO_TAG; 190 + } 183 191 return tag + tag_offset; 184 192 } 185 193 ··· 264 256 struct blk_mq_tags *tags; 265 257 busy_tag_iter_fn *fn; 266 258 void *data; 267 - bool reserved; 259 + unsigned int flags; 268 260 }; 261 + 262 + #define BT_TAG_ITER_RESERVED (1 << 0) 263 + #define BT_TAG_ITER_STARTED (1 << 1) 269 264 270 265 static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) 271 266 { 272 267 struct bt_tags_iter_data *iter_data = data; 273 268 struct blk_mq_tags *tags = iter_data->tags; 274 - bool reserved = iter_data->reserved; 269 + bool reserved = iter_data->flags & BT_TAG_ITER_RESERVED; 275 270 struct request *rq; 276 271 277 272 if (!reserved) ··· 285 274 * test and set the bit before assining ->rqs[]. 286 275 */ 287 276 rq = tags->rqs[bitnr]; 288 - if (rq && blk_mq_request_started(rq)) 289 - return iter_data->fn(rq, iter_data->data, reserved); 290 - 291 - return true; 277 + if (!rq) 278 + return true; 279 + if ((iter_data->flags & BT_TAG_ITER_STARTED) && 280 + !blk_mq_request_started(rq)) 281 + return true; 282 + return iter_data->fn(rq, iter_data->data, reserved); 292 283 } 293 284 294 285 /** ··· 303 290 * @reserved) where rq is a pointer to a request. Return true 304 291 * to continue iterating tags, false to stop. 305 292 * @data: Will be passed as second argument to @fn. 306 - * @reserved: Indicates whether @bt is the breserved_tags member or the 307 - * bitmap_tags member of struct blk_mq_tags. 293 + * @flags: BT_TAG_ITER_* 308 294 */ 309 295 static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt, 310 - busy_tag_iter_fn *fn, void *data, bool reserved) 296 + busy_tag_iter_fn *fn, void *data, unsigned int flags) 311 297 { 312 298 struct bt_tags_iter_data iter_data = { 313 299 .tags = tags, 314 300 .fn = fn, 315 301 .data = data, 316 - .reserved = reserved, 302 + .flags = flags, 317 303 }; 318 304 319 305 if (tags->rqs) 320 306 sbitmap_for_each_set(&bt->sb, bt_tags_iter, &iter_data); 321 307 } 322 308 309 + static void __blk_mq_all_tag_iter(struct blk_mq_tags *tags, 310 + busy_tag_iter_fn *fn, void *priv, unsigned int flags) 311 + { 312 + WARN_ON_ONCE(flags & BT_TAG_ITER_RESERVED); 313 + 314 + if (tags->nr_reserved_tags) 315 + bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, 316 + flags | BT_TAG_ITER_RESERVED); 317 + bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, flags); 318 + } 319 + 323 320 /** 324 - * blk_mq_all_tag_busy_iter - iterate over all started requests in a tag map 321 + * blk_mq_all_tag_iter - iterate over all requests in a tag map 325 322 * @tags: Tag map to iterate over. 326 - * @fn: Pointer to the function that will be called for each started 323 + * @fn: Pointer to the function that will be called for each 327 324 * request. @fn will be called as follows: @fn(rq, @priv, 328 325 * reserved) where rq is a pointer to a request. 'reserved' 329 326 * indicates whether or not @rq is a reserved request. Return 330 327 * true to continue iterating tags, false to stop. 331 328 * @priv: Will be passed as second argument to @fn. 332 329 */ 333 - static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags, 334 - busy_tag_iter_fn *fn, void *priv) 330 + void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, 331 + void *priv) 335 332 { 336 - if (tags->nr_reserved_tags) 337 - bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true); 338 - bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false); 333 + return __blk_mq_all_tag_iter(tags, fn, priv, 0); 339 334 } 340 335 341 336 /** ··· 363 342 364 343 for (i = 0; i < tagset->nr_hw_queues; i++) { 365 344 if (tagset->tags && tagset->tags[i]) 366 - blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv); 345 + __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, 346 + BT_TAG_ITER_STARTED); 367 347 } 368 348 } 369 349 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);

+4 -2

block/blk-mq-tag.h

··· 34 34 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool); 35 35 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, 36 36 void *priv); 37 + void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, 38 + void *priv); 37 39 38 40 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt, 39 41 struct blk_mq_hw_ctx *hctx) ··· 46 44 } 47 45 48 46 enum { 49 - BLK_MQ_TAG_FAIL = -1U, 47 + BLK_MQ_NO_TAG = -1U, 50 48 BLK_MQ_TAG_MIN = 1, 51 - BLK_MQ_TAG_MAX = BLK_MQ_TAG_FAIL - 1, 49 + BLK_MQ_TAG_MAX = BLK_MQ_NO_TAG - 1, 52 50 }; 53 51 54 52 extern bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *);

+295 -112

block/blk-mq.c

··· 26 26 #include <linux/delay.h> 27 27 #include <linux/crash_dump.h> 28 28 #include <linux/prefetch.h> 29 + #include <linux/blk-crypto.h> 29 30 30 31 #include <trace/events/block.h> 31 32 ··· 271 270 } 272 271 273 272 static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, 274 - unsigned int tag, unsigned int op, u64 alloc_time_ns) 273 + unsigned int tag, u64 alloc_time_ns) 275 274 { 276 275 struct blk_mq_tags *tags = blk_mq_tags_from_data(data); 277 276 struct request *rq = tags->static_rqs[tag]; 278 277 req_flags_t rq_flags = 0; 279 278 280 279 if (data->flags & BLK_MQ_REQ_INTERNAL) { 281 - rq->tag = -1; 280 + rq->tag = BLK_MQ_NO_TAG; 282 281 rq->internal_tag = tag; 283 282 } else { 284 283 if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) { ··· 286 285 atomic_inc(&data->hctx->nr_active); 287 286 } 288 287 rq->tag = tag; 289 - rq->internal_tag = -1; 288 + rq->internal_tag = BLK_MQ_NO_TAG; 290 289 data->hctx->tags->rqs[rq->tag] = rq; 291 290 } 292 291 ··· 295 294 rq->mq_ctx = data->ctx; 296 295 rq->mq_hctx = data->hctx; 297 296 rq->rq_flags = rq_flags; 298 - rq->cmd_flags = op; 297 + rq->cmd_flags = data->cmd_flags; 299 298 if (data->flags & BLK_MQ_REQ_PREEMPT) 300 299 rq->rq_flags |= RQF_PREEMPT; 301 300 if (blk_queue_io_stat(data->q)) ··· 318 317 #if defined(CONFIG_BLK_DEV_INTEGRITY) 319 318 rq->nr_integrity_segments = 0; 320 319 #endif 320 + blk_crypto_rq_set_defaults(rq); 321 321 /* tag was already set */ 322 - rq->extra_len = 0; 323 322 WRITE_ONCE(rq->deadline, 0); 324 323 325 324 rq->timeout = 0; ··· 327 326 rq->end_io = NULL; 328 327 rq->end_io_data = NULL; 329 328 330 - data->ctx->rq_dispatched[op_is_sync(op)]++; 329 + data->ctx->rq_dispatched[op_is_sync(data->cmd_flags)]++; 331 330 refcount_set(&rq->ref, 1); 331 + 332 + if (!op_is_flush(data->cmd_flags)) { 333 + struct elevator_queue *e = data->q->elevator; 334 + 335 + rq->elv.icq = NULL; 336 + if (e && e->type->ops.prepare_request) { 337 + if (e->type->icq_cache) 338 + blk_mq_sched_assign_ioc(rq); 339 + 340 + e->type->ops.prepare_request(rq); 341 + rq->rq_flags |= RQF_ELVPRIV; 342 + } 343 + } 344 + 345 + data->hctx->queued++; 332 346 return rq; 333 347 } 334 348 335 - static struct request *blk_mq_get_request(struct request_queue *q, 336 - struct bio *bio, 337 - struct blk_mq_alloc_data *data) 349 + static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data) 338 350 { 351 + struct request_queue *q = data->q; 339 352 struct elevator_queue *e = q->elevator; 340 - struct request *rq; 341 - unsigned int tag; 342 - bool clear_ctx_on_error = false; 343 353 u64 alloc_time_ns = 0; 344 - 345 - blk_queue_enter_live(q); 354 + unsigned int tag; 346 355 347 356 /* alloc_time includes depth and tag waits */ 348 357 if (blk_queue_rq_alloc_time(q)) 349 358 alloc_time_ns = ktime_get_ns(); 350 359 351 - data->q = q; 352 - if (likely(!data->ctx)) { 353 - data->ctx = blk_mq_get_ctx(q); 354 - clear_ctx_on_error = true; 355 - } 356 - if (likely(!data->hctx)) 357 - data->hctx = blk_mq_map_queue(q, data->cmd_flags, 358 - data->ctx); 359 360 if (data->cmd_flags & REQ_NOWAIT) 360 361 data->flags |= BLK_MQ_REQ_NOWAIT; 361 362 ··· 373 370 e->type->ops.limit_depth && 374 371 !(data->flags & BLK_MQ_REQ_RESERVED)) 375 372 e->type->ops.limit_depth(data->cmd_flags, data); 376 - } else { 373 + } 374 + 375 + retry: 376 + data->ctx = blk_mq_get_ctx(q); 377 + data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx); 378 + if (!(data->flags & BLK_MQ_REQ_INTERNAL)) 377 379 blk_mq_tag_busy(data->hctx); 378 - } 379 380 381 + /* 382 + * Waiting allocations only fail because of an inactive hctx. In that 383 + * case just retry the hctx assignment and tag allocation as CPU hotplug 384 + * should have migrated us to an online CPU by now. 385 + */ 380 386 tag = blk_mq_get_tag(data); 381 - if (tag == BLK_MQ_TAG_FAIL) { 382 - if (clear_ctx_on_error) 383 - data->ctx = NULL; 384 - blk_queue_exit(q); 385 - return NULL; 386 - } 387 + if (tag == BLK_MQ_NO_TAG) { 388 + if (data->flags & BLK_MQ_REQ_NOWAIT) 389 + return NULL; 387 390 388 - rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags, alloc_time_ns); 389 - if (!op_is_flush(data->cmd_flags)) { 390 - rq->elv.icq = NULL; 391 - if (e && e->type->ops.prepare_request) { 392 - if (e->type->icq_cache) 393 - blk_mq_sched_assign_ioc(rq); 394 - 395 - e->type->ops.prepare_request(rq, bio); 396 - rq->rq_flags |= RQF_ELVPRIV; 397 - } 391 + /* 392 + * Give up the CPU and sleep for a random short time to ensure 393 + * that thread using a realtime scheduling class are migrated 394 + * off the the CPU, and thus off the hctx that is going away. 395 + */ 396 + msleep(3); 397 + goto retry; 398 398 } 399 - data->hctx->queued++; 400 - return rq; 399 + return blk_mq_rq_ctx_init(data, tag, alloc_time_ns); 401 400 } 402 401 403 402 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op, 404 403 blk_mq_req_flags_t flags) 405 404 { 406 - struct blk_mq_alloc_data alloc_data = { .flags = flags, .cmd_flags = op }; 405 + struct blk_mq_alloc_data data = { 406 + .q = q, 407 + .flags = flags, 408 + .cmd_flags = op, 409 + }; 407 410 struct request *rq; 408 411 int ret; 409 412 ··· 417 408 if (ret) 418 409 return ERR_PTR(ret); 419 410 420 - rq = blk_mq_get_request(q, NULL, &alloc_data); 421 - blk_queue_exit(q); 422 - 411 + rq = __blk_mq_alloc_request(&data); 423 412 if (!rq) 424 - return ERR_PTR(-EWOULDBLOCK); 425 - 413 + goto out_queue_exit; 426 414 rq->__data_len = 0; 427 415 rq->__sector = (sector_t) -1; 428 416 rq->bio = rq->biotail = NULL; 429 417 return rq; 418 + out_queue_exit: 419 + blk_queue_exit(q); 420 + return ERR_PTR(-EWOULDBLOCK); 430 421 } 431 422 EXPORT_SYMBOL(blk_mq_alloc_request); 432 423 433 424 struct request *blk_mq_alloc_request_hctx(struct request_queue *q, 434 425 unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx) 435 426 { 436 - struct blk_mq_alloc_data alloc_data = { .flags = flags, .cmd_flags = op }; 437 - struct request *rq; 427 + struct blk_mq_alloc_data data = { 428 + .q = q, 429 + .flags = flags, 430 + .cmd_flags = op, 431 + }; 432 + u64 alloc_time_ns = 0; 438 433 unsigned int cpu; 434 + unsigned int tag; 439 435 int ret; 436 + 437 + /* alloc_time includes depth and tag waits */ 438 + if (blk_queue_rq_alloc_time(q)) 439 + alloc_time_ns = ktime_get_ns(); 440 440 441 441 /* 442 442 * If the tag allocator sleeps we could get an allocation for a ··· 453 435 * allocator for this for the rare use case of a command tied to 454 436 * a specific queue. 455 437 */ 456 - if (WARN_ON_ONCE(!(flags & BLK_MQ_REQ_NOWAIT))) 438 + if (WARN_ON_ONCE(!(flags & (BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_RESERVED)))) 457 439 return ERR_PTR(-EINVAL); 458 440 459 441 if (hctx_idx >= q->nr_hw_queues) ··· 467 449 * Check if the hardware context is actually mapped to anything. 468 450 * If not tell the caller that it should skip this queue. 469 451 */ 470 - alloc_data.hctx = q->queue_hw_ctx[hctx_idx]; 471 - if (!blk_mq_hw_queue_mapped(alloc_data.hctx)) { 472 - blk_queue_exit(q); 473 - return ERR_PTR(-EXDEV); 474 - } 475 - cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask); 476 - alloc_data.ctx = __blk_mq_get_ctx(q, cpu); 452 + ret = -EXDEV; 453 + data.hctx = q->queue_hw_ctx[hctx_idx]; 454 + if (!blk_mq_hw_queue_mapped(data.hctx)) 455 + goto out_queue_exit; 456 + cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask); 457 + data.ctx = __blk_mq_get_ctx(q, cpu); 477 458 478 - rq = blk_mq_get_request(q, NULL, &alloc_data); 459 + if (q->elevator) 460 + data.flags |= BLK_MQ_REQ_INTERNAL; 461 + else 462 + blk_mq_tag_busy(data.hctx); 463 + 464 + ret = -EWOULDBLOCK; 465 + tag = blk_mq_get_tag(&data); 466 + if (tag == BLK_MQ_NO_TAG) 467 + goto out_queue_exit; 468 + return blk_mq_rq_ctx_init(&data, tag, alloc_time_ns); 469 + 470 + out_queue_exit: 479 471 blk_queue_exit(q); 480 - 481 - if (!rq) 482 - return ERR_PTR(-EWOULDBLOCK); 483 - 484 - return rq; 472 + return ERR_PTR(ret); 485 473 } 486 474 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx); 487 475 ··· 498 474 struct blk_mq_hw_ctx *hctx = rq->mq_hctx; 499 475 const int sched_tag = rq->internal_tag; 500 476 477 + blk_crypto_free_request(rq); 501 478 blk_pm_mark_last_busy(rq); 502 479 rq->mq_hctx = NULL; 503 - if (rq->tag != -1) 480 + if (rq->tag != BLK_MQ_NO_TAG) 504 481 blk_mq_put_tag(hctx->tags, ctx, rq->tag); 505 - if (sched_tag != -1) 482 + if (sched_tag != BLK_MQ_NO_TAG) 506 483 blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag); 507 484 blk_mq_sched_restart(hctx); 508 485 blk_queue_exit(q); ··· 552 527 blk_stat_add(rq, now); 553 528 } 554 529 555 - if (rq->internal_tag != -1) 530 + if (rq->internal_tag != BLK_MQ_NO_TAG) 556 531 blk_mq_sched_completed_request(rq, now); 557 532 558 533 blk_account_io_done(rq, now); ··· 582 557 q->mq_ops->complete(rq); 583 558 } 584 559 585 - static void __blk_mq_complete_request(struct request *rq) 560 + /** 561 + * blk_mq_force_complete_rq() - Force complete the request, bypassing any error 562 + * injection that could drop the completion. 563 + * @rq: Request to be force completed 564 + * 565 + * Drivers should use blk_mq_complete_request() to complete requests in their 566 + * normal IO path. For timeout error recovery, drivers may call this forced 567 + * completion routine after they've reclaimed timed out requests to bypass 568 + * potentially subsequent fake timeouts. 569 + */ 570 + void blk_mq_force_complete_rq(struct request *rq) 586 571 { 587 572 struct blk_mq_ctx *ctx = rq->mq_ctx; 588 573 struct request_queue *q = rq->q; ··· 638 603 } 639 604 put_cpu(); 640 605 } 606 + EXPORT_SYMBOL_GPL(blk_mq_force_complete_rq); 641 607 642 608 static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx) 643 609 __releases(hctx->srcu) ··· 672 636 { 673 637 if (unlikely(blk_should_fake_timeout(rq->q))) 674 638 return false; 675 - __blk_mq_complete_request(rq); 639 + blk_mq_force_complete_rq(rq); 676 640 return true; 677 641 } 678 642 EXPORT_SYMBOL(blk_mq_complete_request); ··· 703 667 blk_add_timer(rq); 704 668 WRITE_ONCE(rq->state, MQ_RQ_IN_FLIGHT); 705 669 706 - if (q->dma_drain_size && blk_rq_bytes(rq)) { 707 - /* 708 - * Make sure space for the drain appears. We know we can do 709 - * this because max_hw_segments has been adjusted to be one 710 - * fewer than the device can handle. 711 - */ 712 - rq->nr_phys_segments++; 713 - } 714 - 715 670 #ifdef CONFIG_BLK_DEV_INTEGRITY 716 671 if (blk_integrity_rq(rq) && req_op(rq) == REQ_OP_WRITE) 717 672 q->integrity.profile->prepare_fn(rq); ··· 722 695 if (blk_mq_request_started(rq)) { 723 696 WRITE_ONCE(rq->state, MQ_RQ_IDLE); 724 697 rq->rq_flags &= ~RQF_TIMED_OUT; 725 - if (q->dma_drain_size && blk_rq_bytes(rq)) 726 - rq->nr_phys_segments--; 727 698 } 728 699 } 729 700 ··· 1062 1037 }; 1063 1038 bool shared; 1064 1039 1065 - if (rq->tag != -1) 1040 + if (rq->tag != BLK_MQ_NO_TAG) 1066 1041 return true; 1067 1042 1068 1043 if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag)) ··· 1078 1053 data.hctx->tags->rqs[rq->tag] = rq; 1079 1054 } 1080 1055 1081 - return rq->tag != -1; 1056 + return rq->tag != BLK_MQ_NO_TAG; 1082 1057 } 1083 1058 1084 1059 static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode, ··· 1220 1195 __blk_mq_requeue_request(rq); 1221 1196 } 1222 1197 1198 + static void blk_mq_handle_zone_resource(struct request *rq, 1199 + struct list_head *zone_list) 1200 + { 1201 + /* 1202 + * If we end up here it is because we cannot dispatch a request to a 1203 + * specific zone due to LLD level zone-write locking or other zone 1204 + * related resource not being available. In this case, set the request 1205 + * aside in zone_list for retrying it later. 1206 + */ 1207 + list_add(&rq->queuelist, zone_list); 1208 + __blk_mq_requeue_request(rq); 1209 + } 1210 + 1223 1211 /* 1224 1212 * Returns true if we did some work AND can potentially do more. 1225 1213 */ ··· 1244 1206 bool no_tag = false; 1245 1207 int errors, queued; 1246 1208 blk_status_t ret = BLK_STS_OK; 1209 + bool no_budget_avail = false; 1210 + LIST_HEAD(zone_list); 1247 1211 1248 1212 if (list_empty(list)) 1249 1213 return false; ··· 1264 1224 hctx = rq->mq_hctx; 1265 1225 if (!got_budget && !blk_mq_get_dispatch_budget(hctx)) { 1266 1226 blk_mq_put_driver_tag(rq); 1227 + no_budget_avail = true; 1267 1228 break; 1268 1229 } 1269 1230 ··· 1307 1266 if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { 1308 1267 blk_mq_handle_dev_resource(rq, list); 1309 1268 break; 1269 + } else if (ret == BLK_STS_ZONE_RESOURCE) { 1270 + /* 1271 + * Move the request to zone_list and keep going through 1272 + * the dispatch list to find more requests the drive can 1273 + * accept. 1274 + */ 1275 + blk_mq_handle_zone_resource(rq, &zone_list); 1276 + if (list_empty(list)) 1277 + break; 1278 + continue; 1310 1279 } 1311 1280 1312 1281 if (unlikely(ret != BLK_STS_OK)) { ··· 1327 1276 1328 1277 queued++; 1329 1278 } while (!list_empty(list)); 1279 + 1280 + if (!list_empty(&zone_list)) 1281 + list_splice_tail_init(&zone_list, list); 1330 1282 1331 1283 hctx->dispatched[queued_to_index(queued)]++; 1332 1284 ··· 1374 1320 * 1375 1321 * If driver returns BLK_STS_RESOURCE and SCHED_RESTART 1376 1322 * bit is set, run queue after a delay to avoid IO stalls 1377 - * that could otherwise occur if the queue is idle. 1323 + * that could otherwise occur if the queue is idle. We'll do 1324 + * similar if we couldn't get budget and SCHED_RESTART is set. 1378 1325 */ 1379 1326 needs_restart = blk_mq_sched_needs_restart(hctx); 1380 1327 if (!needs_restart || 1381 1328 (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) 1382 1329 blk_mq_run_hw_queue(hctx, true); 1383 - else if (needs_restart && (ret == BLK_STS_RESOURCE)) 1330 + else if (needs_restart && (ret == BLK_STS_RESOURCE || 1331 + no_budget_avail)) 1384 1332 blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY); 1385 1333 1386 1334 blk_mq_update_dispatch_busy(hctx, true); ··· 1596 1540 } 1597 1541 } 1598 1542 EXPORT_SYMBOL(blk_mq_run_hw_queues); 1543 + 1544 + /** 1545 + * blk_mq_delay_run_hw_queues - Run all hardware queues asynchronously. 1546 + * @q: Pointer to the request queue to run. 1547 + * @msecs: Microseconds of delay to wait before running the queues. 1548 + */ 1549 + void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs) 1550 + { 1551 + struct blk_mq_hw_ctx *hctx; 1552 + int i; 1553 + 1554 + queue_for_each_hw_ctx(q, hctx, i) { 1555 + if (blk_mq_hctx_stopped(hctx)) 1556 + continue; 1557 + 1558 + blk_mq_delay_run_hw_queue(hctx, msecs); 1559 + } 1560 + } 1561 + EXPORT_SYMBOL(blk_mq_delay_run_hw_queues); 1599 1562 1600 1563 /** 1601 1564 * blk_mq_queue_stopped() - check whether one or more hctxs have been stopped ··· 1857 1782 rq->__sector = bio->bi_iter.bi_sector; 1858 1783 rq->write_hint = bio->bi_write_hint; 1859 1784 blk_rq_bio_prep(rq, bio, nr_segs); 1785 + blk_crypto_rq_bio_prep(rq, bio, GFP_NOIO); 1860 1786 1861 - blk_account_io_start(rq, true); 1787 + blk_account_io_start(rq); 1862 1788 } 1863 1789 1864 1790 static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, ··· 2049 1973 * 2050 1974 * Returns: Request queue cookie. 2051 1975 */ 2052 - static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) 1976 + blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) 2053 1977 { 2054 1978 const int is_sync = op_is_sync(bio->bi_opf); 2055 1979 const int is_flush_fua = op_is_flush(bio->bi_opf); 2056 - struct blk_mq_alloc_data data = { .flags = 0}; 1980 + struct blk_mq_alloc_data data = { 1981 + .q = q, 1982 + }; 2057 1983 struct request *rq; 2058 1984 struct blk_plug *plug; 2059 1985 struct request *same_queue_rq = NULL; 2060 1986 unsigned int nr_segs; 2061 1987 blk_qc_t cookie; 1988 + blk_status_t ret; 2062 1989 2063 1990 blk_queue_bounce(q, &bio); 2064 1991 __blk_queue_split(q, &bio, &nr_segs); 2065 1992 2066 1993 if (!bio_integrity_prep(bio)) 2067 - return BLK_QC_T_NONE; 1994 + goto queue_exit; 2068 1995 2069 1996 if (!is_flush_fua && !blk_queue_nomerges(q) && 2070 1997 blk_attempt_plug_merge(q, bio, nr_segs, &same_queue_rq)) 2071 - return BLK_QC_T_NONE; 1998 + goto queue_exit; 2072 1999 2073 2000 if (blk_mq_sched_bio_merge(q, bio, nr_segs)) 2074 - return BLK_QC_T_NONE; 2001 + goto queue_exit; 2075 2002 2076 2003 rq_qos_throttle(q, bio); 2077 2004 2078 2005 data.cmd_flags = bio->bi_opf; 2079 - rq = blk_mq_get_request(q, bio, &data); 2006 + rq = __blk_mq_alloc_request(&data); 2080 2007 if (unlikely(!rq)) { 2081 2008 rq_qos_cleanup(q, bio); 2082 2009 if (bio->bi_opf & REQ_NOWAIT) 2083 2010 bio_wouldblock_error(bio); 2084 - return BLK_QC_T_NONE; 2011 + goto queue_exit; 2085 2012 } 2086 2013 2087 2014 trace_block_getrq(q, bio, bio->bi_opf); ··· 2094 2015 cookie = request_to_qc_t(data.hctx, rq); 2095 2016 2096 2017 blk_mq_bio_to_request(rq, bio, nr_segs); 2018 + 2019 + ret = blk_crypto_init_request(rq); 2020 + if (ret != BLK_STS_OK) { 2021 + bio->bi_status = ret; 2022 + bio_endio(bio); 2023 + blk_mq_free_request(rq); 2024 + return BLK_QC_T_NONE; 2025 + } 2097 2026 2098 2027 plug = blk_mq_plug(q, bio); 2099 2028 if (unlikely(is_flush_fua)) { ··· 2171 2084 } 2172 2085 2173 2086 return cookie; 2087 + queue_exit: 2088 + blk_queue_exit(q); 2089 + return BLK_QC_T_NONE; 2174 2090 } 2091 + EXPORT_SYMBOL_GPL(blk_mq_make_request); /* only for request based dm */ 2175 2092 2176 2093 void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, 2177 2094 unsigned int hctx_idx) ··· 2351 2260 return -ENOMEM; 2352 2261 } 2353 2262 2263 + struct rq_iter_data { 2264 + struct blk_mq_hw_ctx *hctx; 2265 + bool has_rq; 2266 + }; 2267 + 2268 + static bool blk_mq_has_request(struct request *rq, void *data, bool reserved) 2269 + { 2270 + struct rq_iter_data *iter_data = data; 2271 + 2272 + if (rq->mq_hctx != iter_data->hctx) 2273 + return true; 2274 + iter_data->has_rq = true; 2275 + return false; 2276 + } 2277 + 2278 + static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx) 2279 + { 2280 + struct blk_mq_tags *tags = hctx->sched_tags ? 2281 + hctx->sched_tags : hctx->tags; 2282 + struct rq_iter_data data = { 2283 + .hctx = hctx, 2284 + }; 2285 + 2286 + blk_mq_all_tag_iter(tags, blk_mq_has_request, &data); 2287 + return data.has_rq; 2288 + } 2289 + 2290 + static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu, 2291 + struct blk_mq_hw_ctx *hctx) 2292 + { 2293 + if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu) 2294 + return false; 2295 + if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids) 2296 + return false; 2297 + return true; 2298 + } 2299 + 2300 + static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node) 2301 + { 2302 + struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node, 2303 + struct blk_mq_hw_ctx, cpuhp_online); 2304 + 2305 + if (!cpumask_test_cpu(cpu, hctx->cpumask) || 2306 + !blk_mq_last_cpu_in_hctx(cpu, hctx)) 2307 + return 0; 2308 + 2309 + /* 2310 + * Prevent new request from being allocated on the current hctx. 2311 + * 2312 + * The smp_mb__after_atomic() Pairs with the implied barrier in 2313 + * test_and_set_bit_lock in sbitmap_get(). Ensures the inactive flag is 2314 + * seen once we return from the tag allocator. 2315 + */ 2316 + set_bit(BLK_MQ_S_INACTIVE, &hctx->state); 2317 + smp_mb__after_atomic(); 2318 + 2319 + /* 2320 + * Try to grab a reference to the queue and wait for any outstanding 2321 + * requests. If we could not grab a reference the queue has been 2322 + * frozen and there are no requests. 2323 + */ 2324 + if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) { 2325 + while (blk_mq_hctx_has_requests(hctx)) 2326 + msleep(5); 2327 + percpu_ref_put(&hctx->queue->q_usage_counter); 2328 + } 2329 + 2330 + return 0; 2331 + } 2332 + 2333 + static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node) 2334 + { 2335 + struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node, 2336 + struct blk_mq_hw_ctx, cpuhp_online); 2337 + 2338 + if (cpumask_test_cpu(cpu, hctx->cpumask)) 2339 + clear_bit(BLK_MQ_S_INACTIVE, &hctx->state); 2340 + return 0; 2341 + } 2342 + 2354 2343 /* 2355 2344 * 'cpu' is going away. splice any existing rq_list entries from this 2356 2345 * software queue to the hw queue dispatch list, and ensure that it ··· 2444 2273 enum hctx_type type; 2445 2274 2446 2275 hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead); 2276 + if (!cpumask_test_cpu(cpu, hctx->cpumask)) 2277 + return 0; 2278 + 2447 2279 ctx = __blk_mq_get_ctx(hctx->queue, cpu); 2448 2280 type = hctx->type; 2449 2281 ··· 2470 2296 2471 2297 static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx) 2472 2298 { 2299 + if (!(hctx->flags & BLK_MQ_F_STACKING)) 2300 + cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE, 2301 + &hctx->cpuhp_online); 2473 2302 cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD, 2474 2303 &hctx->cpuhp_dead); 2475 2304 } ··· 2532 2355 { 2533 2356 hctx->queue_num = hctx_idx; 2534 2357 2358 + if (!(hctx->flags & BLK_MQ_F_STACKING)) 2359 + cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE, 2360 + &hctx->cpuhp_online); 2535 2361 cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead); 2536 2362 2537 2363 hctx->tags = set->tags[hctx_idx]; ··· 2653 2473 } 2654 2474 } 2655 2475 2656 - static bool __blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, int hctx_idx) 2476 + static bool __blk_mq_alloc_map_and_request(struct blk_mq_tag_set *set, 2477 + int hctx_idx) 2657 2478 { 2658 2479 int ret = 0; 2659 2480 ··· 2702 2521 * If the cpu isn't present, the cpu is mapped to first hctx. 2703 2522 */ 2704 2523 for_each_possible_cpu(i) { 2705 - hctx_idx = set->map[HCTX_TYPE_DEFAULT].mq_map[i]; 2706 - /* unmapped hw queue can be remapped after CPU topo changed */ 2707 - if (!set->tags[hctx_idx] && 2708 - !__blk_mq_alloc_rq_map(set, hctx_idx)) { 2709 - /* 2710 - * If tags initialization fail for some hctx, 2711 - * that hctx won't be brought online. In this 2712 - * case, remap the current ctx to hctx[0] which 2713 - * is guaranteed to always have tags allocated 2714 - */ 2715 - set->map[HCTX_TYPE_DEFAULT].mq_map[i] = 0; 2716 - } 2717 2524 2718 2525 ctx = per_cpu_ptr(q->queue_ctx, i); 2719 2526 for (j = 0; j < set->nr_maps; j++) { ··· 2709 2540 ctx->hctxs[j] = blk_mq_map_queue_type(q, 2710 2541 HCTX_TYPE_DEFAULT, i); 2711 2542 continue; 2543 + } 2544 + hctx_idx = set->map[j].mq_map[i]; 2545 + /* unmapped hw queue can be remapped after CPU topo changed */ 2546 + if (!set->tags[hctx_idx] && 2547 + !__blk_mq_alloc_map_and_request(set, hctx_idx)) { 2548 + /* 2549 + * If tags initialization fail for some hctx, 2550 + * that hctx won't be brought online. In this 2551 + * case, remap the current ctx to hctx[0] which 2552 + * is guaranteed to always have tags allocated 2553 + */ 2554 + set->map[j].mq_map[i] = 0; 2712 2555 } 2713 2556 2714 2557 hctx = blk_mq_map_queue_type(q, j, i); ··· 3125 2944 INIT_LIST_HEAD(&q->requeue_list); 3126 2945 spin_lock_init(&q->requeue_lock); 3127 2946 3128 - q->make_request_fn = blk_mq_make_request; 3129 2947 q->nr_requests = set->queue_depth; 3130 2948 3131 2949 /* ··· 3168 2988 int i; 3169 2989 3170 2990 for (i = 0; i < set->nr_hw_queues; i++) 3171 - if (!__blk_mq_alloc_rq_map(set, i)) 2991 + if (!__blk_mq_alloc_map_and_request(set, i)) 3172 2992 goto out_unwind; 3173 2993 3174 2994 return 0; 3175 2995 3176 2996 out_unwind: 3177 2997 while (--i >= 0) 3178 - blk_mq_free_rq_map(set->tags[i]); 2998 + blk_mq_free_map_and_requests(set, i); 3179 2999 3180 3000 return -ENOMEM; 3181 3001 } ··· 3185 3005 * may reduce the depth asked for, if memory is tight. set->queue_depth 3186 3006 * will be updated to reflect the allocated depth. 3187 3007 */ 3188 - static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set) 3008 + static int blk_mq_alloc_map_and_requests(struct blk_mq_tag_set *set) 3189 3009 { 3190 3010 unsigned int depth; 3191 3011 int err; ··· 3345 3165 if (ret) 3346 3166 goto out_free_mq_map; 3347 3167 3348 - ret = blk_mq_alloc_rq_maps(set); 3168 + ret = blk_mq_alloc_map_and_requests(set); 3349 3169 if (ret) 3350 3170 goto out_free_mq_map; 3351 3171 ··· 3527 3347 blk_mq_sysfs_unregister(q); 3528 3348 } 3529 3349 3350 + prev_nr_hw_queues = set->nr_hw_queues; 3530 3351 if (blk_mq_realloc_tag_set_tags(set, set->nr_hw_queues, nr_hw_queues) < 3531 3352 0) 3532 3353 goto reregister; 3533 3354 3534 - prev_nr_hw_queues = set->nr_hw_queues; 3535 3355 set->nr_hw_queues = nr_hw_queues; 3536 - blk_mq_update_queue_map(set); 3537 3356 fallback: 3357 + blk_mq_update_queue_map(set); 3538 3358 list_for_each_entry(q, &set->tag_list, tag_set_list) { 3539 3359 blk_mq_realloc_hw_ctxs(set, q); 3540 3360 if (q->nr_hw_queues != set->nr_hw_queues) { ··· 3789 3609 { 3790 3610 cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL, 3791 3611 blk_mq_hctx_notify_dead); 3612 + cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online", 3613 + blk_mq_hctx_notify_online, 3614 + blk_mq_hctx_notify_offline); 3792 3615 return 0; 3793 3616 } 3794 3617 subsys_initcall(blk_mq_init);

+2 -2

block/blk-mq.h

··· 201 201 struct request *rq) 202 202 { 203 203 blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag); 204 - rq->tag = -1; 204 + rq->tag = BLK_MQ_NO_TAG; 205 205 206 206 if (rq->rq_flags & RQF_MQ_INFLIGHT) { 207 207 rq->rq_flags &= ~RQF_MQ_INFLIGHT; ··· 211 211 212 212 static inline void blk_mq_put_driver_tag(struct request *rq) 213 213 { 214 - if (rq->tag == -1 || rq->internal_tag == -1) 214 + if (rq->tag == BLK_MQ_NO_TAG || rq->internal_tag == BLK_MQ_NO_TAG) 215 215 return; 216 216 217 217 __blk_mq_put_driver_tag(rq->mq_hctx, rq);

+31 -37

block/blk-settings.c

··· 48 48 lim->chunk_sectors = 0; 49 49 lim->max_write_same_sectors = 0; 50 50 lim->max_write_zeroes_sectors = 0; 51 + lim->max_zone_append_sectors = 0; 51 52 lim->max_discard_sectors = 0; 52 53 lim->max_hw_discard_sectors = 0; 53 54 lim->discard_granularity = 0; ··· 84 83 lim->max_dev_sectors = UINT_MAX; 85 84 lim->max_write_same_sectors = UINT_MAX; 86 85 lim->max_write_zeroes_sectors = UINT_MAX; 86 + lim->max_zone_append_sectors = UINT_MAX; 87 87 } 88 88 EXPORT_SYMBOL(blk_set_stacking_limits); 89 89 ··· 222 220 q->limits.max_write_zeroes_sectors = max_write_zeroes_sectors; 223 221 } 224 222 EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors); 223 + 224 + /** 225 + * blk_queue_max_zone_append_sectors - set max sectors for a single zone append 226 + * @q: the request queue for the device 227 + * @max_zone_append_sectors: maximum number of sectors to write per command 228 + **/ 229 + void blk_queue_max_zone_append_sectors(struct request_queue *q, 230 + unsigned int max_zone_append_sectors) 231 + { 232 + unsigned int max_sectors; 233 + 234 + if (WARN_ON(!blk_queue_is_zoned(q))) 235 + return; 236 + 237 + max_sectors = min(q->limits.max_hw_sectors, max_zone_append_sectors); 238 + max_sectors = min(q->limits.chunk_sectors, max_sectors); 239 + 240 + /* 241 + * Signal eventual driver bugs resulting in the max_zone_append sectors limit 242 + * being 0 due to a 0 argument, the chunk_sectors limit (zone size) not set, 243 + * or the max_hw_sectors limit not set. 244 + */ 245 + WARN_ON(!max_sectors); 246 + 247 + q->limits.max_zone_append_sectors = max_sectors; 248 + } 249 + EXPORT_SYMBOL_GPL(blk_queue_max_zone_append_sectors); 225 250 226 251 /** 227 252 * blk_queue_max_segments - set max hw segments for a request for this queue ··· 499 470 b->max_write_same_sectors); 500 471 t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors, 501 472 b->max_write_zeroes_sectors); 473 + t->max_zone_append_sectors = min(t->max_zone_append_sectors, 474 + b->max_zone_append_sectors); 502 475 t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn); 503 476 504 477 t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask, ··· 681 650 q->dma_pad_mask = mask; 682 651 } 683 652 EXPORT_SYMBOL(blk_queue_update_dma_pad); 684 - 685 - /** 686 - * blk_queue_dma_drain - Set up a drain buffer for excess dma. 687 - * @q: the request queue for the device 688 - * @dma_drain_needed: fn which returns non-zero if drain is necessary 689 - * @buf: physically contiguous buffer 690 - * @size: size of the buffer in bytes 691 - * 692 - * Some devices have excess DMA problems and can't simply discard (or 693 - * zero fill) the unwanted piece of the transfer. They have to have a 694 - * real area of memory to transfer it into. The use case for this is 695 - * ATAPI devices in DMA mode. If the packet command causes a transfer 696 - * bigger than the transfer size some HBAs will lock up if there 697 - * aren't DMA elements to contain the excess transfer. What this API 698 - * does is adjust the queue so that the buf is always appended 699 - * silently to the scatterlist. 700 - * 701 - * Note: This routine adjusts max_hw_segments to make room for appending 702 - * the drain buffer. If you call blk_queue_max_segments() after calling 703 - * this routine, you must set the limit to one fewer than your device 704 - * can support otherwise there won't be room for the drain buffer. 705 - */ 706 - int blk_queue_dma_drain(struct request_queue *q, 707 - dma_drain_needed_fn *dma_drain_needed, 708 - void *buf, unsigned int size) 709 - { 710 - if (queue_max_segments(q) < 2) 711 - return -EINVAL; 712 - /* make room for appending the drain */ 713 - blk_queue_max_segments(q, queue_max_segments(q) - 1); 714 - q->dma_drain_needed = dma_drain_needed; 715 - q->dma_drain_buffer = buf; 716 - q->dma_drain_size = size; 717 - 718 - return 0; 719 - } 720 - EXPORT_SYMBOL_GPL(blk_queue_dma_drain); 721 653 722 654 /** 723 655 * blk_queue_segment_boundary - set boundary rules for segment merging

+13

block/blk-sysfs.c

··· 218 218 (unsigned long long)q->limits.max_write_zeroes_sectors << 9); 219 219 } 220 220 221 + static ssize_t queue_zone_append_max_show(struct request_queue *q, char *page) 222 + { 223 + unsigned long long max_sectors = q->limits.max_zone_append_sectors; 224 + 225 + return sprintf(page, "%llu\n", max_sectors << SECTOR_SHIFT); 226 + } 227 + 221 228 static ssize_t 222 229 queue_max_sectors_store(struct request_queue *q, const char *page, size_t count) 223 230 { ··· 646 639 .show = queue_write_zeroes_max_show, 647 640 }; 648 641 642 + static struct queue_sysfs_entry queue_zone_append_max_entry = { 643 + .attr = {.name = "zone_append_max_bytes", .mode = 0444 }, 644 + .show = queue_zone_append_max_show, 645 + }; 646 + 649 647 static struct queue_sysfs_entry queue_nonrot_entry = { 650 648 .attr = {.name = "rotational", .mode = 0644 }, 651 649 .show = queue_show_nonrot, ··· 761 749 &queue_discard_zeroes_data_entry.attr, 762 750 &queue_write_same_max_entry.attr, 763 751 &queue_write_zeroes_max_entry.attr, 752 + &queue_zone_append_max_entry.attr, 764 753 &queue_nonrot_entry.attr, 765 754 &queue_zoned_entry.attr, 766 755 &queue_nr_zones_entry.attr,

-63

block/blk-throttle.c

··· 2358 2358 } 2359 2359 #endif 2360 2360 2361 - /* 2362 - * Dispatch all bios from all children tg's queued on @parent_sq. On 2363 - * return, @parent_sq is guaranteed to not have any active children tg's 2364 - * and all bios from previously active tg's are on @parent_sq->bio_lists[]. 2365 - */ 2366 - static void tg_drain_bios(struct throtl_service_queue *parent_sq) 2367 - { 2368 - struct throtl_grp *tg; 2369 - 2370 - while ((tg = throtl_rb_first(parent_sq))) { 2371 - struct throtl_service_queue *sq = &tg->service_queue; 2372 - struct bio *bio; 2373 - 2374 - throtl_dequeue_tg(tg); 2375 - 2376 - while ((bio = throtl_peek_queued(&sq->queued[READ]))) 2377 - tg_dispatch_one_bio(tg, bio_data_dir(bio)); 2378 - while ((bio = throtl_peek_queued(&sq->queued[WRITE]))) 2379 - tg_dispatch_one_bio(tg, bio_data_dir(bio)); 2380 - } 2381 - } 2382 - 2383 - /** 2384 - * blk_throtl_drain - drain throttled bios 2385 - * @q: request_queue to drain throttled bios for 2386 - * 2387 - * Dispatch all currently throttled bios on @q through ->make_request_fn(). 2388 - */ 2389 - void blk_throtl_drain(struct request_queue *q) 2390 - __releases(&q->queue_lock) __acquires(&q->queue_lock) 2391 - { 2392 - struct throtl_data *td = q->td; 2393 - struct blkcg_gq *blkg; 2394 - struct cgroup_subsys_state *pos_css; 2395 - struct bio *bio; 2396 - int rw; 2397 - 2398 - rcu_read_lock(); 2399 - 2400 - /* 2401 - * Drain each tg while doing post-order walk on the blkg tree, so 2402 - * that all bios are propagated to td->service_queue. It'd be 2403 - * better to walk service_queue tree directly but blkg walk is 2404 - * easier. 2405 - */ 2406 - blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) 2407 - tg_drain_bios(&blkg_to_tg(blkg)->service_queue); 2408 - 2409 - /* finally, transfer bios from top-level tg's into the td */ 2410 - tg_drain_bios(&td->service_queue); 2411 - 2412 - rcu_read_unlock(); 2413 - spin_unlock_irq(&q->queue_lock); 2414 - 2415 - /* all bios now should be in td->service_queue, issue them */ 2416 - for (rw = READ; rw <= WRITE; rw++) 2417 - while ((bio = throtl_pop_queued(&td->service_queue.queued[rw], 2418 - NULL))) 2419 - generic_make_request(bio); 2420 - 2421 - spin_lock_irq(&q->queue_lock); 2422 - } 2423 - 2424 2361 int blk_throtl_init(struct request_queue *q) 2425 2362 { 2426 2363 struct throtl_data *td;

+4 -12

block/blk-wbt.c

··· 405 405 rwb_arm_timer(rwb); 406 406 } 407 407 408 - static void __wbt_update_limits(struct rq_wb *rwb) 408 + static void wbt_update_limits(struct rq_wb *rwb) 409 409 { 410 410 struct rq_depth *rqd = &rwb->rq_depth; 411 411 ··· 416 416 calc_wb_limits(rwb); 417 417 418 418 rwb_wake_all(rwb); 419 - } 420 - 421 - void wbt_update_limits(struct request_queue *q) 422 - { 423 - struct rq_qos *rqos = wbt_rq_qos(q); 424 - if (!rqos) 425 - return; 426 - __wbt_update_limits(RQWB(rqos)); 427 419 } 428 420 429 421 u64 wbt_get_min_lat(struct request_queue *q) ··· 433 441 return; 434 442 RQWB(rqos)->min_lat_nsec = val; 435 443 RQWB(rqos)->enable_state = WBT_STATE_ON_MANUAL; 436 - __wbt_update_limits(RQWB(rqos)); 444 + wbt_update_limits(RQWB(rqos)); 437 445 } 438 446 439 447 ··· 677 685 static void wbt_queue_depth_changed(struct rq_qos *rqos) 678 686 { 679 687 RQWB(rqos)->rq_depth.queue_depth = blk_queue_depth(rqos->q); 680 - __wbt_update_limits(RQWB(rqos)); 688 + wbt_update_limits(RQWB(rqos)); 681 689 } 682 690 683 691 static void wbt_exit(struct rq_qos *rqos) ··· 835 843 rwb->enable_state = WBT_STATE_ON_DEFAULT; 836 844 rwb->wc = 1; 837 845 rwb->rq_depth.default_depth = RWB_DEF_DEPTH; 838 - __wbt_update_limits(rwb); 846 + wbt_update_limits(rwb); 839 847 840 848 /* 841 849 * Assign rwb and add the stats callback.

-4

block/blk-wbt.h

··· 88 88 #ifdef CONFIG_BLK_WBT 89 89 90 90 int wbt_init(struct request_queue *); 91 - void wbt_update_limits(struct request_queue *); 92 91 void wbt_disable_default(struct request_queue *); 93 92 void wbt_enable_default(struct request_queue *); 94 93 ··· 106 107 static inline int wbt_init(struct request_queue *q) 107 108 { 108 109 return -EINVAL; 109 - } 110 - static inline void wbt_update_limits(struct request_queue *q) 111 - { 112 110 } 113 111 static inline void wbt_disable_default(struct request_queue *q) 114 112 {

+22 -1

block/blk-zoned.c

··· 82 82 } 83 83 EXPORT_SYMBOL_GPL(blk_req_needs_zone_write_lock); 84 84 85 + bool blk_req_zone_write_trylock(struct request *rq) 86 + { 87 + unsigned int zno = blk_rq_zone_no(rq); 88 + 89 + if (test_and_set_bit(zno, rq->q->seq_zones_wlock)) 90 + return false; 91 + 92 + WARN_ON_ONCE(rq->rq_flags & RQF_ZONE_WRITE_LOCKED); 93 + rq->rq_flags |= RQF_ZONE_WRITE_LOCKED; 94 + 95 + return true; 96 + } 97 + EXPORT_SYMBOL_GPL(blk_req_zone_write_trylock); 98 + 85 99 void __blk_req_zone_write_lock(struct request *rq) 86 100 { 87 101 if (WARN_ON_ONCE(test_and_set_bit(blk_rq_zone_no(rq), ··· 471 457 /** 472 458 * blk_revalidate_disk_zones - (re)allocate and initialize zone bitmaps 473 459 * @disk: Target disk 460 + * @update_driver_data: Callback to update driver data on the frozen disk 474 461 * 475 462 * Helper function for low-level device drivers to (re) allocate and initialize 476 463 * a disk request queue zone bitmaps. This functions should normally be called 477 464 * within the disk ->revalidate method for blk-mq based drivers. For BIO based 478 465 * drivers only q->nr_zones needs to be updated so that the sysfs exposed value 479 466 * is correct. 467 + * If the @update_driver_data callback function is not NULL, the callback is 468 + * executed with the device request queue frozen after all zones have been 469 + * checked. 480 470 */ 481 - int blk_revalidate_disk_zones(struct gendisk *disk) 471 + int blk_revalidate_disk_zones(struct gendisk *disk, 472 + void (*update_driver_data)(struct gendisk *disk)) 482 473 { 483 474 struct request_queue *q = disk->queue; 484 475 struct blk_revalidate_zone_args args = { ··· 517 498 q->nr_zones = args.nr_zones; 518 499 swap(q->seq_zones_wlock, args.seq_zones_wlock); 519 500 swap(q->conv_zones_bitmap, args.conv_zones_bitmap); 501 + if (update_driver_data) 502 + update_driver_data(disk); 520 503 ret = 0; 521 504 } else { 522 505 pr_warn("%s: failed to revalidate zones\n", disk->disk_name);

+19 -69

block/blk.h

··· 5 5 #include <linux/idr.h> 6 6 #include <linux/blk-mq.h> 7 7 #include <linux/part_stat.h> 8 + #include <linux/blk-crypto.h> 8 9 #include <xen/xen.h> 10 + #include "blk-crypto-internal.h" 9 11 #include "blk-mq.h" 10 12 #include "blk-mq-sched.h" 11 13 ··· 19 17 #endif 20 18 21 19 struct blk_flush_queue { 22 - unsigned int flush_queue_delayed:1; 23 20 unsigned int flush_pending_idx:1; 24 21 unsigned int flush_running_idx:1; 25 22 blk_status_t rq_status; ··· 62 61 void blk_free_flush_queue(struct blk_flush_queue *q); 63 62 64 63 void blk_freeze_queue(struct request_queue *q); 65 - 66 - static inline void blk_queue_enter_live(struct request_queue *q) 67 - { 68 - /* 69 - * Given that running in generic_make_request() context 70 - * guarantees that a live reference against q_usage_counter has 71 - * been established, further references under that same context 72 - * need not check that the queue has been frozen (marked dead). 73 - */ 74 - percpu_ref_get(&q->q_usage_counter); 75 - } 76 64 77 65 static inline bool biovec_phys_mergeable(struct request_queue *q, 78 66 struct bio_vec *vec1, struct bio_vec *vec2) ··· 185 195 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio, 186 196 unsigned int nr_segs, struct request **same_queue_rq); 187 197 188 - void blk_account_io_start(struct request *req, bool new_io); 189 - void blk_account_io_completion(struct request *req, unsigned int bytes); 198 + void blk_account_io_start(struct request *req); 190 199 void blk_account_io_done(struct request *req, u64 now); 191 200 192 201 /* ··· 292 303 293 304 int create_task_io_context(struct task_struct *task, gfp_t gfp_mask, int node); 294 305 295 - /** 296 - * create_io_context - try to create task->io_context 297 - * @gfp_mask: allocation mask 298 - * @node: allocation node 299 - * 300 - * If %current->io_context is %NULL, allocate a new io_context and install 301 - * it. Returns the current %current->io_context which may be %NULL if 302 - * allocation failed. 303 - * 304 - * Note that this function can't be called with IRQ disabled because 305 - * task_lock which protects %current->io_context is IRQ-unsafe. 306 - */ 307 - static inline struct io_context *create_io_context(gfp_t gfp_mask, int node) 308 - { 309 - WARN_ON_ONCE(irqs_disabled()); 310 - if (unlikely(!current->io_context)) 311 - create_task_io_context(current, gfp_mask, node); 312 - return current->io_context; 313 - } 314 - 315 306 /* 316 307 * Internal throttling interface 317 308 */ 318 309 #ifdef CONFIG_BLK_DEV_THROTTLING 319 - extern void blk_throtl_drain(struct request_queue *q); 320 310 extern int blk_throtl_init(struct request_queue *q); 321 311 extern void blk_throtl_exit(struct request_queue *q); 322 312 extern void blk_throtl_register_queue(struct request_queue *q); 323 313 #else /* CONFIG_BLK_DEV_THROTTLING */ 324 - static inline void blk_throtl_drain(struct request_queue *q) { } 325 314 static inline int blk_throtl_init(struct request_queue *q) { return 0; } 326 315 static inline void blk_throtl_exit(struct request_queue *q) { } 327 316 static inline void blk_throtl_register_queue(struct request_queue *q) { } ··· 342 375 static inline void blk_queue_free_zone_bitmaps(struct request_queue *q) {} 343 376 #endif 344 377 345 - void part_dec_in_flight(struct request_queue *q, struct hd_struct *part, 346 - int rw); 347 - void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, 348 - int rw); 349 - void update_io_ticks(struct hd_struct *part, unsigned long now, bool end); 350 378 struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector); 351 379 352 380 int blk_alloc_devt(struct hd_struct *part, dev_t *devt); ··· 351 389 #define ADDPART_FLAG_NONE 0 352 390 #define ADDPART_FLAG_RAID 1 353 391 #define ADDPART_FLAG_WHOLEDISK 2 354 - struct hd_struct *__must_check add_partition(struct gendisk *disk, int partno, 355 - sector_t start, sector_t len, int flags, 356 - struct partition_meta_info *info); 357 - void __delete_partition(struct percpu_ref *ref); 358 - void delete_partition(struct gendisk *disk, int partno); 392 + void delete_partition(struct gendisk *disk, struct hd_struct *part); 393 + int bdev_add_partition(struct block_device *bdev, int partno, 394 + sector_t start, sector_t length); 395 + int bdev_del_partition(struct block_device *bdev, int partno); 396 + int bdev_resize_partition(struct block_device *bdev, int partno, 397 + sector_t start, sector_t length); 359 398 int disk_expand_part_tbl(struct gendisk *disk, int target); 399 + int hd_ref_init(struct hd_struct *part); 360 400 361 - static inline int hd_ref_init(struct hd_struct *part) 362 - { 363 - if (percpu_ref_init(&part->ref, __delete_partition, 0, 364 - GFP_KERNEL)) 365 - return -ENOMEM; 366 - return 0; 367 - } 368 - 369 - static inline void hd_struct_get(struct hd_struct *part) 370 - { 371 - percpu_ref_get(&part->ref); 372 - } 373 - 401 + /* no need to get/put refcount of part0 */ 374 402 static inline int hd_struct_try_get(struct hd_struct *part) 375 403 { 376 - return percpu_ref_tryget_live(&part->ref); 404 + if (part->partno) 405 + return percpu_ref_tryget_live(&part->ref); 406 + return 1; 377 407 } 378 408 379 409 static inline void hd_struct_put(struct hd_struct *part) 380 410 { 381 - percpu_ref_put(&part->ref); 382 - } 383 - 384 - static inline void hd_struct_kill(struct hd_struct *part) 385 - { 386 - percpu_ref_kill(&part->ref); 411 + if (part->partno) 412 + percpu_ref_put(&part->ref); 387 413 } 388 414 389 415 static inline void hd_free_part(struct hd_struct *part) 390 416 { 391 - free_part_stats(part); 417 + free_percpu(part->dkstats); 392 418 kfree(part->info); 393 419 percpu_ref_exit(&part->ref); 394 420 } ··· 434 484 435 485 struct request_queue *__blk_alloc_queue(int node_id); 436 486 437 - int __bio_add_pc_page(struct request_queue *q, struct bio *bio, 487 + int bio_add_hw_page(struct request_queue *q, struct bio *bio, 438 488 struct page *page, unsigned int len, unsigned int offset, 439 - bool *same_page); 489 + unsigned int max_sectors, bool *same_page); 440 490 441 491 #endif /* BLK_INTERNAL_H */

+2

block/bounce.c

··· 267 267 break; 268 268 } 269 269 270 + bio_crypt_clone(bio, bio_src, gfp_mask); 271 + 270 272 if (bio_integrity(bio_src)) { 271 273 int ret; 272 274

+63 -70

block/genhd.c

··· 92 92 } 93 93 EXPORT_SYMBOL(bdevname); 94 94 95 - #ifdef CONFIG_SMP 96 95 static void part_stat_read_all(struct hd_struct *part, struct disk_stats *stat) 97 96 { 98 97 int cpu; ··· 111 112 stat->io_ticks += ptr->io_ticks; 112 113 } 113 114 } 114 - #else /* CONFIG_SMP */ 115 - static void part_stat_read_all(struct hd_struct *part, struct disk_stats *stat) 116 - { 117 - memcpy(stat, &part->dkstats, sizeof(struct disk_stats)); 118 - } 119 - #endif /* CONFIG_SMP */ 120 - 121 - void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw) 122 - { 123 - if (queue_is_mq(q)) 124 - return; 125 - 126 - part_stat_local_inc(part, in_flight[rw]); 127 - if (part->partno) 128 - part_stat_local_inc(&part_to_disk(part)->part0, in_flight[rw]); 129 - } 130 - 131 - void part_dec_in_flight(struct request_queue *q, struct hd_struct *part, int rw) 132 - { 133 - if (queue_is_mq(q)) 134 - return; 135 - 136 - part_stat_local_dec(part, in_flight[rw]); 137 - if (part->partno) 138 - part_stat_local_dec(&part_to_disk(part)->part0, in_flight[rw]); 139 - } 140 115 141 116 static unsigned int part_in_flight(struct request_queue *q, 142 117 struct hd_struct *part) 143 118 { 119 + unsigned int inflight = 0; 144 120 int cpu; 145 - unsigned int inflight; 146 121 147 - if (queue_is_mq(q)) { 148 - return blk_mq_in_flight(q, part); 149 - } 150 - 151 - inflight = 0; 152 122 for_each_possible_cpu(cpu) { 153 123 inflight += part_stat_local_read_cpu(part, in_flight[0], cpu) + 154 124 part_stat_local_read_cpu(part, in_flight[1], cpu); ··· 132 164 unsigned int inflight[2]) 133 165 { 134 166 int cpu; 135 - 136 - if (queue_is_mq(q)) { 137 - blk_mq_in_flight_rw(q, part, inflight); 138 - return; 139 - } 140 167 141 168 inflight[0] = 0; 142 169 inflight[1] = 0; ··· 307 344 * primarily used for stats accounting. 308 345 * 309 346 * CONTEXT: 310 - * RCU read locked. The returned partition pointer is valid only 311 - * while preemption is disabled. 347 + * RCU read locked. The returned partition pointer is always valid 348 + * because its refcount is grabbed except for part0, which lifetime 349 + * is same with the disk. 312 350 * 313 351 * RETURNS: 314 352 * Found partition on success, part0 is returned if no partition matches 353 + * or the matched partition is being deleted. 315 354 */ 316 355 struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector) 317 356 { ··· 321 356 struct hd_struct *part; 322 357 int i; 323 358 359 + rcu_read_lock(); 324 360 ptbl = rcu_dereference(disk->part_tbl); 325 361 326 362 part = rcu_dereference(ptbl->last_lookup); 327 - if (part && sector_in_part(part, sector)) 328 - return part; 363 + if (part && sector_in_part(part, sector) && hd_struct_try_get(part)) 364 + goto out_unlock; 329 365 330 366 for (i = 1; i < ptbl->len; i++) { 331 367 part = rcu_dereference(ptbl->part[i]); 332 368 333 369 if (part && sector_in_part(part, sector)) { 370 + /* 371 + * only live partition can be cached for lookup, 372 + * so use-after-free on cached & deleting partition 373 + * can be avoided 374 + */ 375 + if (!hd_struct_try_get(part)) 376 + break; 334 377 rcu_assign_pointer(ptbl->last_lookup, part); 335 - return part; 378 + goto out_unlock; 336 379 } 337 380 } 338 - return &disk->part0; 381 + 382 + part = &disk->part0; 383 + out_unlock: 384 + rcu_read_unlock(); 385 + return part; 339 386 } 340 387 341 388 /** ··· 817 840 disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO; 818 841 disk->flags |= GENHD_FL_NO_PART_SCAN; 819 842 } else { 843 + struct backing_dev_info *bdi = disk->queue->backing_dev_info; 844 + struct device *dev = disk_to_dev(disk); 820 845 int ret; 821 846 822 847 /* Register BDI before referencing it from bdev */ 823 - disk_to_dev(disk)->devt = devt; 824 - ret = bdi_register_owner(disk->queue->backing_dev_info, 825 - disk_to_dev(disk)); 848 + dev->devt = devt; 849 + ret = bdi_register(bdi, "%u:%u", MAJOR(devt), MINOR(devt)); 826 850 WARN_ON(ret); 851 + bdi_set_owner(bdi, dev); 827 852 blk_register_region(disk_devt(disk), disk->minors, NULL, 828 853 exact_match, exact_lock, disk); 829 854 } ··· 857 878 } 858 879 EXPORT_SYMBOL(device_add_disk_no_queue_reg); 859 880 881 + static void invalidate_partition(struct gendisk *disk, int partno) 882 + { 883 + struct block_device *bdev; 884 + 885 + bdev = bdget_disk(disk, partno); 886 + if (!bdev) 887 + return; 888 + 889 + fsync_bdev(bdev); 890 + __invalidate_device(bdev, true); 891 + 892 + /* 893 + * Unhash the bdev inode for this device so that it gets evicted as soon 894 + * as last inode reference is dropped. 895 + */ 896 + remove_inode_hash(bdev->bd_inode); 897 + bdput(bdev); 898 + } 899 + 860 900 void del_gendisk(struct gendisk *disk) 861 901 { 862 902 struct disk_part_iter piter; ··· 894 896 DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE); 895 897 while ((part = disk_part_iter_next(&piter))) { 896 898 invalidate_partition(disk, part->partno); 897 - bdev_unhash_inode(part_devt(part)); 898 - delete_partition(disk, part->partno); 899 + delete_partition(disk, part); 899 900 } 900 901 disk_part_iter_exit(&piter); 901 902 902 903 invalidate_partition(disk, 0); 903 - bdev_unhash_inode(disk_devt(disk)); 904 904 set_capacity(disk, 0); 905 905 disk->flags &= ~GENHD_FL_UP; 906 906 up_write(&disk->lookup_sem); ··· 1275 1279 unsigned int inflight; 1276 1280 1277 1281 part_stat_read_all(p, &stat); 1278 - inflight = part_in_flight(q, p); 1282 + if (queue_is_mq(q)) 1283 + inflight = blk_mq_in_flight(q, p); 1284 + else 1285 + inflight = part_in_flight(q, p); 1279 1286 1280 1287 return sprintf(buf, 1281 1288 "%8lu %8lu %8llu %8u " ··· 1317 1318 struct request_queue *q = part_to_disk(p)->queue; 1318 1319 unsigned int inflight[2]; 1319 1320 1320 - part_in_flight_rw(q, p, inflight); 1321 + if (queue_is_mq(q)) 1322 + blk_mq_in_flight_rw(q, p, inflight); 1323 + else 1324 + part_in_flight_rw(q, p, inflight); 1325 + 1321 1326 return sprintf(buf, "%8u %8u\n", inflight[0], inflight[1]); 1322 1327 } 1323 1328 ··· 1576 1573 disk_part_iter_init(&piter, gp, DISK_PITER_INCL_EMPTY_PART0); 1577 1574 while ((hd = disk_part_iter_next(&piter))) { 1578 1575 part_stat_read_all(hd, &stat); 1579 - inflight = part_in_flight(gp->queue, hd); 1576 + if (queue_is_mq(gp->queue)) 1577 + inflight = blk_mq_in_flight(gp->queue, hd); 1578 + else 1579 + inflight = part_in_flight(gp->queue, hd); 1580 1580 1581 1581 seq_printf(seqf, "%4d %7d %s " 1582 1582 "%lu %lu %lu %u " ··· 1686 1680 1687 1681 disk = kzalloc_node(sizeof(struct gendisk), GFP_KERNEL, node_id); 1688 1682 if (disk) { 1689 - if (!init_part_stats(&disk->part0)) { 1683 + disk->part0.dkstats = alloc_percpu(struct disk_stats); 1684 + if (!disk->part0.dkstats) { 1690 1685 kfree(disk); 1691 1686 return NULL; 1692 1687 } 1693 1688 init_rwsem(&disk->lookup_sem); 1694 1689 disk->node_id = node_id; 1695 1690 if (disk_expand_part_tbl(disk, 0)) { 1696 - free_part_stats(&disk->part0); 1691 + free_percpu(disk->part0.dkstats); 1697 1692 kfree(disk); 1698 1693 return NULL; 1699 1694 } ··· 1710 1703 * TODO: Ideally set_capacity() and get_capacity() should be 1711 1704 * converted to make use of bd_mutex and sequence counters. 1712 1705 */ 1713 - seqcount_init(&disk->part0.nr_sects_seq); 1706 + hd_sects_seq_init(&disk->part0); 1714 1707 if (hd_ref_init(&disk->part0)) { 1715 1708 hd_free_part(&disk->part0); 1716 1709 kfree(disk); ··· 1812 1805 } 1813 1806 1814 1807 EXPORT_SYMBOL(bdev_read_only); 1815 - 1816 - int invalidate_partition(struct gendisk *disk, int partno) 1817 - { 1818 - int res = 0; 1819 - struct block_device *bdev = bdget_disk(disk, partno); 1820 - if (bdev) { 1821 - fsync_bdev(bdev); 1822 - res = __invalidate_device(bdev, true); 1823 - bdput(bdev); 1824 - } 1825 - return res; 1826 - } 1827 - 1828 - EXPORT_SYMBOL(invalidate_partition); 1829 1808 1830 1809 /* 1831 1810 * Disk events - monitor disk events like media change and eject request.

+25 -123

block/ioctl.c

··· 16 16 static int blkpg_do_ioctl(struct block_device *bdev, 17 17 struct blkpg_partition __user *upart, int op) 18 18 { 19 - struct block_device *bdevp; 20 - struct gendisk *disk; 21 - struct hd_struct *part, *lpart; 22 19 struct blkpg_partition p; 23 - struct disk_part_iter piter; 24 20 long long start, length; 25 - int partno; 26 21 27 22 if (!capable(CAP_SYS_ADMIN)) 28 23 return -EACCES; 29 24 if (copy_from_user(&p, upart, sizeof(struct blkpg_partition))) 30 25 return -EFAULT; 31 - disk = bdev->bd_disk; 32 26 if (bdev != bdev->bd_contains) 33 27 return -EINVAL; 34 - partno = p.pno; 35 - if (partno <= 0) 28 + 29 + if (p.pno <= 0) 36 30 return -EINVAL; 37 - switch (op) { 38 - case BLKPG_ADD_PARTITION: 39 - start = p.start >> 9; 40 - length = p.length >> 9; 41 - /* check for fit in a hd_struct */ 42 - if (sizeof(sector_t) == sizeof(long) && 43 - sizeof(long long) > sizeof(long)) { 44 - long pstart = start, plength = length; 45 - if (pstart != start || plength != length 46 - || pstart < 0 || plength < 0 || partno > 65535) 47 - return -EINVAL; 48 - } 49 - /* check if partition is aligned to blocksize */ 50 - if (p.start & (bdev_logical_block_size(bdev) - 1)) 51 - return -EINVAL; 52 31 53 - mutex_lock(&bdev->bd_mutex); 32 + if (op == BLKPG_DEL_PARTITION) 33 + return bdev_del_partition(bdev, p.pno); 54 34 55 - /* overlap? */ 56 - disk_part_iter_init(&piter, disk, 57 - DISK_PITER_INCL_EMPTY); 58 - while ((part = disk_part_iter_next(&piter))) { 59 - if (!(start + length <= part->start_sect || 60 - start >= part->start_sect + part->nr_sects)) { 61 - disk_part_iter_exit(&piter); 62 - mutex_unlock(&bdev->bd_mutex); 63 - return -EBUSY; 64 - } 65 - } 66 - disk_part_iter_exit(&piter); 35 + start = p.start >> SECTOR_SHIFT; 36 + length = p.length >> SECTOR_SHIFT; 67 37 68 - /* all seems OK */ 69 - part = add_partition(disk, partno, start, length, 70 - ADDPART_FLAG_NONE, NULL); 71 - mutex_unlock(&bdev->bd_mutex); 72 - return PTR_ERR_OR_ZERO(part); 73 - case BLKPG_DEL_PARTITION: 74 - part = disk_get_part(disk, partno); 75 - if (!part) 76 - return -ENXIO; 38 + /* check for fit in a hd_struct */ 39 + if (sizeof(sector_t) < sizeof(long long)) { 40 + long pstart = start, plength = length; 77 41 78 - bdevp = bdget(part_devt(part)); 79 - disk_put_part(part); 80 - if (!bdevp) 81 - return -ENOMEM; 82 - 83 - mutex_lock(&bdevp->bd_mutex); 84 - if (bdevp->bd_openers) { 85 - mutex_unlock(&bdevp->bd_mutex); 86 - bdput(bdevp); 87 - return -EBUSY; 88 - } 89 - /* all seems OK */ 90 - fsync_bdev(bdevp); 91 - invalidate_bdev(bdevp); 92 - 93 - mutex_lock_nested(&bdev->bd_mutex, 1); 94 - delete_partition(disk, partno); 95 - mutex_unlock(&bdev->bd_mutex); 96 - mutex_unlock(&bdevp->bd_mutex); 97 - bdput(bdevp); 98 - 99 - return 0; 100 - case BLKPG_RESIZE_PARTITION: 101 - start = p.start >> 9; 102 - /* new length of partition in bytes */ 103 - length = p.length >> 9; 104 - /* check for fit in a hd_struct */ 105 - if (sizeof(sector_t) == sizeof(long) && 106 - sizeof(long long) > sizeof(long)) { 107 - long pstart = start, plength = length; 108 - if (pstart != start || plength != length 109 - || pstart < 0 || plength < 0) 110 - return -EINVAL; 111 - } 112 - part = disk_get_part(disk, partno); 113 - if (!part) 114 - return -ENXIO; 115 - bdevp = bdget(part_devt(part)); 116 - if (!bdevp) { 117 - disk_put_part(part); 118 - return -ENOMEM; 119 - } 120 - mutex_lock(&bdevp->bd_mutex); 121 - mutex_lock_nested(&bdev->bd_mutex, 1); 122 - if (start != part->start_sect) { 123 - mutex_unlock(&bdevp->bd_mutex); 124 - mutex_unlock(&bdev->bd_mutex); 125 - bdput(bdevp); 126 - disk_put_part(part); 127 - return -EINVAL; 128 - } 129 - /* overlap? */ 130 - disk_part_iter_init(&piter, disk, 131 - DISK_PITER_INCL_EMPTY); 132 - while ((lpart = disk_part_iter_next(&piter))) { 133 - if (lpart->partno != partno && 134 - !(start + length <= lpart->start_sect || 135 - start >= lpart->start_sect + lpart->nr_sects) 136 - ) { 137 - disk_part_iter_exit(&piter); 138 - mutex_unlock(&bdevp->bd_mutex); 139 - mutex_unlock(&bdev->bd_mutex); 140 - bdput(bdevp); 141 - disk_put_part(part); 142 - return -EBUSY; 143 - } 144 - } 145 - disk_part_iter_exit(&piter); 146 - part_nr_sects_write(part, (sector_t)length); 147 - i_size_write(bdevp->bd_inode, p.length); 148 - mutex_unlock(&bdevp->bd_mutex); 149 - mutex_unlock(&bdev->bd_mutex); 150 - bdput(bdevp); 151 - disk_put_part(part); 152 - return 0; 153 - default: 42 + if (pstart != start || plength != length || pstart < 0 || 43 + plength < 0 || p.pno > 65535) 154 44 return -EINVAL; 45 + } 46 + 47 + switch (op) { 48 + case BLKPG_ADD_PARTITION: 49 + /* check if partition is aligned to blocksize */ 50 + if (p.start & (bdev_logical_block_size(bdev) - 1)) 51 + return -EINVAL; 52 + return bdev_add_partition(bdev, p.pno, start, length); 53 + case BLKPG_RESIZE_PARTITION: 54 + return bdev_resize_partition(bdev, p.pno, start, length); 55 + default: 56 + return -EINVAL; 155 57 } 156 58 } 157 59 ··· 204 302 } 205 303 206 304 #ifdef CONFIG_COMPAT 207 - static int compat_put_long(compat_long_t *argp, long val) 305 + static int compat_put_long(compat_long_t __user *argp, long val) 208 306 { 209 307 return put_user(val, argp); 210 308 } 211 309 212 - static int compat_put_ulong(compat_ulong_t *argp, compat_ulong_t val) 310 + static int compat_put_ulong(compat_ulong_t __user *argp, compat_ulong_t val) 213 311 { 214 312 return put_user(val, argp); 215 313 }

+397

block/keyslot-manager.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * Copyright 2019 Google LLC 4 + */ 5 + 6 + /** 7 + * DOC: The Keyslot Manager 8 + * 9 + * Many devices with inline encryption support have a limited number of "slots" 10 + * into which encryption contexts may be programmed, and requests can be tagged 11 + * with a slot number to specify the key to use for en/decryption. 12 + * 13 + * As the number of slots is limited, and programming keys is expensive on 14 + * many inline encryption hardware, we don't want to program the same key into 15 + * multiple slots - if multiple requests are using the same key, we want to 16 + * program just one slot with that key and use that slot for all requests. 17 + * 18 + * The keyslot manager manages these keyslots appropriately, and also acts as 19 + * an abstraction between the inline encryption hardware and the upper layers. 20 + * 21 + * Lower layer devices will set up a keyslot manager in their request queue 22 + * and tell it how to perform device specific operations like programming/ 23 + * evicting keys from keyslots. 24 + * 25 + * Upper layers will call blk_ksm_get_slot_for_key() to program a 26 + * key into some slot in the inline encryption hardware. 27 + */ 28 + 29 + #define pr_fmt(fmt) "blk-crypto: " fmt 30 + 31 + #include <linux/keyslot-manager.h> 32 + #include <linux/atomic.h> 33 + #include <linux/mutex.h> 34 + #include <linux/pm_runtime.h> 35 + #include <linux/wait.h> 36 + #include <linux/blkdev.h> 37 + 38 + struct blk_ksm_keyslot { 39 + atomic_t slot_refs; 40 + struct list_head idle_slot_node; 41 + struct hlist_node hash_node; 42 + const struct blk_crypto_key *key; 43 + struct blk_keyslot_manager *ksm; 44 + }; 45 + 46 + static inline void blk_ksm_hw_enter(struct blk_keyslot_manager *ksm) 47 + { 48 + /* 49 + * Calling into the driver requires ksm->lock held and the device 50 + * resumed. But we must resume the device first, since that can acquire 51 + * and release ksm->lock via blk_ksm_reprogram_all_keys(). 52 + */ 53 + if (ksm->dev) 54 + pm_runtime_get_sync(ksm->dev); 55 + down_write(&ksm->lock); 56 + } 57 + 58 + static inline void blk_ksm_hw_exit(struct blk_keyslot_manager *ksm) 59 + { 60 + up_write(&ksm->lock); 61 + if (ksm->dev) 62 + pm_runtime_put_sync(ksm->dev); 63 + } 64 + 65 + /** 66 + * blk_ksm_init() - Initialize a keyslot manager 67 + * @ksm: The keyslot_manager to initialize. 68 + * @num_slots: The number of key slots to manage. 69 + * 70 + * Allocate memory for keyslots and initialize a keyslot manager. Called by 71 + * e.g. storage drivers to set up a keyslot manager in their request_queue. 72 + * 73 + * Return: 0 on success, or else a negative error code. 74 + */ 75 + int blk_ksm_init(struct blk_keyslot_manager *ksm, unsigned int num_slots) 76 + { 77 + unsigned int slot; 78 + unsigned int i; 79 + unsigned int slot_hashtable_size; 80 + 81 + memset(ksm, 0, sizeof(*ksm)); 82 + 83 + if (num_slots == 0) 84 + return -EINVAL; 85 + 86 + ksm->slots = kvcalloc(num_slots, sizeof(ksm->slots[0]), GFP_KERNEL); 87 + if (!ksm->slots) 88 + return -ENOMEM; 89 + 90 + ksm->num_slots = num_slots; 91 + 92 + init_rwsem(&ksm->lock); 93 + 94 + init_waitqueue_head(&ksm->idle_slots_wait_queue); 95 + INIT_LIST_HEAD(&ksm->idle_slots); 96 + 97 + for (slot = 0; slot < num_slots; slot++) { 98 + ksm->slots[slot].ksm = ksm; 99 + list_add_tail(&ksm->slots[slot].idle_slot_node, 100 + &ksm->idle_slots); 101 + } 102 + 103 + spin_lock_init(&ksm->idle_slots_lock); 104 + 105 + slot_hashtable_size = roundup_pow_of_two(num_slots); 106 + ksm->log_slot_ht_size = ilog2(slot_hashtable_size); 107 + ksm->slot_hashtable = kvmalloc_array(slot_hashtable_size, 108 + sizeof(ksm->slot_hashtable[0]), 109 + GFP_KERNEL); 110 + if (!ksm->slot_hashtable) 111 + goto err_destroy_ksm; 112 + for (i = 0; i < slot_hashtable_size; i++) 113 + INIT_HLIST_HEAD(&ksm->slot_hashtable[i]); 114 + 115 + return 0; 116 + 117 + err_destroy_ksm: 118 + blk_ksm_destroy(ksm); 119 + return -ENOMEM; 120 + } 121 + EXPORT_SYMBOL_GPL(blk_ksm_init); 122 + 123 + static inline struct hlist_head * 124 + blk_ksm_hash_bucket_for_key(struct blk_keyslot_manager *ksm, 125 + const struct blk_crypto_key *key) 126 + { 127 + return &ksm->slot_hashtable[hash_ptr(key, ksm->log_slot_ht_size)]; 128 + } 129 + 130 + static void blk_ksm_remove_slot_from_lru_list(struct blk_ksm_keyslot *slot) 131 + { 132 + struct blk_keyslot_manager *ksm = slot->ksm; 133 + unsigned long flags; 134 + 135 + spin_lock_irqsave(&ksm->idle_slots_lock, flags); 136 + list_del(&slot->idle_slot_node); 137 + spin_unlock_irqrestore(&ksm->idle_slots_lock, flags); 138 + } 139 + 140 + static struct blk_ksm_keyslot *blk_ksm_find_keyslot( 141 + struct blk_keyslot_manager *ksm, 142 + const struct blk_crypto_key *key) 143 + { 144 + const struct hlist_head *head = blk_ksm_hash_bucket_for_key(ksm, key); 145 + struct blk_ksm_keyslot *slotp; 146 + 147 + hlist_for_each_entry(slotp, head, hash_node) { 148 + if (slotp->key == key) 149 + return slotp; 150 + } 151 + return NULL; 152 + } 153 + 154 + static struct blk_ksm_keyslot *blk_ksm_find_and_grab_keyslot( 155 + struct blk_keyslot_manager *ksm, 156 + const struct blk_crypto_key *key) 157 + { 158 + struct blk_ksm_keyslot *slot; 159 + 160 + slot = blk_ksm_find_keyslot(ksm, key); 161 + if (!slot) 162 + return NULL; 163 + if (atomic_inc_return(&slot->slot_refs) == 1) { 164 + /* Took first reference to this slot; remove it from LRU list */ 165 + blk_ksm_remove_slot_from_lru_list(slot); 166 + } 167 + return slot; 168 + } 169 + 170 + unsigned int blk_ksm_get_slot_idx(struct blk_ksm_keyslot *slot) 171 + { 172 + return slot - slot->ksm->slots; 173 + } 174 + EXPORT_SYMBOL_GPL(blk_ksm_get_slot_idx); 175 + 176 + /** 177 + * blk_ksm_get_slot_for_key() - Program a key into a keyslot. 178 + * @ksm: The keyslot manager to program the key into. 179 + * @key: Pointer to the key object to program, including the raw key, crypto 180 + * mode, and data unit size. 181 + * @slot_ptr: A pointer to return the pointer of the allocated keyslot. 182 + * 183 + * Get a keyslot that's been programmed with the specified key. If one already 184 + * exists, return it with incremented refcount. Otherwise, wait for a keyslot 185 + * to become idle and program it. 186 + * 187 + * Context: Process context. Takes and releases ksm->lock. 188 + * Return: BLK_STS_OK on success (and keyslot is set to the pointer of the 189 + * allocated keyslot), or some other blk_status_t otherwise (and 190 + * keyslot is set to NULL). 191 + */ 192 + blk_status_t blk_ksm_get_slot_for_key(struct blk_keyslot_manager *ksm, 193 + const struct blk_crypto_key *key, 194 + struct blk_ksm_keyslot **slot_ptr) 195 + { 196 + struct blk_ksm_keyslot *slot; 197 + int slot_idx; 198 + int err; 199 + 200 + *slot_ptr = NULL; 201 + down_read(&ksm->lock); 202 + slot = blk_ksm_find_and_grab_keyslot(ksm, key); 203 + up_read(&ksm->lock); 204 + if (slot) 205 + goto success; 206 + 207 + for (;;) { 208 + blk_ksm_hw_enter(ksm); 209 + slot = blk_ksm_find_and_grab_keyslot(ksm, key); 210 + if (slot) { 211 + blk_ksm_hw_exit(ksm); 212 + goto success; 213 + } 214 + 215 + /* 216 + * If we're here, that means there wasn't a slot that was 217 + * already programmed with the key. So try to program it. 218 + */ 219 + if (!list_empty(&ksm->idle_slots)) 220 + break; 221 + 222 + blk_ksm_hw_exit(ksm); 223 + wait_event(ksm->idle_slots_wait_queue, 224 + !list_empty(&ksm->idle_slots)); 225 + } 226 + 227 + slot = list_first_entry(&ksm->idle_slots, struct blk_ksm_keyslot, 228 + idle_slot_node); 229 + slot_idx = blk_ksm_get_slot_idx(slot); 230 + 231 + err = ksm->ksm_ll_ops.keyslot_program(ksm, key, slot_idx); 232 + if (err) { 233 + wake_up(&ksm->idle_slots_wait_queue); 234 + blk_ksm_hw_exit(ksm); 235 + return errno_to_blk_status(err); 236 + } 237 + 238 + /* Move this slot to the hash list for the new key. */ 239 + if (slot->key) 240 + hlist_del(&slot->hash_node); 241 + slot->key = key; 242 + hlist_add_head(&slot->hash_node, blk_ksm_hash_bucket_for_key(ksm, key)); 243 + 244 + atomic_set(&slot->slot_refs, 1); 245 + 246 + blk_ksm_remove_slot_from_lru_list(slot); 247 + 248 + blk_ksm_hw_exit(ksm); 249 + success: 250 + *slot_ptr = slot; 251 + return BLK_STS_OK; 252 + } 253 + 254 + /** 255 + * blk_ksm_put_slot() - Release a reference to a slot 256 + * @slot: The keyslot to release the reference of. 257 + * 258 + * Context: Any context. 259 + */ 260 + void blk_ksm_put_slot(struct blk_ksm_keyslot *slot) 261 + { 262 + struct blk_keyslot_manager *ksm; 263 + unsigned long flags; 264 + 265 + if (!slot) 266 + return; 267 + 268 + ksm = slot->ksm; 269 + 270 + if (atomic_dec_and_lock_irqsave(&slot->slot_refs, 271 + &ksm->idle_slots_lock, flags)) { 272 + list_add_tail(&slot->idle_slot_node, &ksm->idle_slots); 273 + spin_unlock_irqrestore(&ksm->idle_slots_lock, flags); 274 + wake_up(&ksm->idle_slots_wait_queue); 275 + } 276 + } 277 + 278 + /** 279 + * blk_ksm_crypto_cfg_supported() - Find out if a crypto configuration is 280 + * supported by a ksm. 281 + * @ksm: The keyslot manager to check 282 + * @cfg: The crypto configuration to check for. 283 + * 284 + * Checks for crypto_mode/data unit size/dun bytes support. 285 + * 286 + * Return: Whether or not this ksm supports the specified crypto config. 287 + */ 288 + bool blk_ksm_crypto_cfg_supported(struct blk_keyslot_manager *ksm, 289 + const struct blk_crypto_config *cfg) 290 + { 291 + if (!ksm) 292 + return false; 293 + if (!(ksm->crypto_modes_supported[cfg->crypto_mode] & 294 + cfg->data_unit_size)) 295 + return false; 296 + if (ksm->max_dun_bytes_supported < cfg->dun_bytes) 297 + return false; 298 + return true; 299 + } 300 + 301 + /** 302 + * blk_ksm_evict_key() - Evict a key from the lower layer device. 303 + * @ksm: The keyslot manager to evict from 304 + * @key: The key to evict 305 + * 306 + * Find the keyslot that the specified key was programmed into, and evict that 307 + * slot from the lower layer device. The slot must not be in use by any 308 + * in-flight IO when this function is called. 309 + * 310 + * Context: Process context. Takes and releases ksm->lock. 311 + * Return: 0 on success or if there's no keyslot with the specified key, -EBUSY 312 + * if the keyslot is still in use, or another -errno value on other 313 + * error. 314 + */ 315 + int blk_ksm_evict_key(struct blk_keyslot_manager *ksm, 316 + const struct blk_crypto_key *key) 317 + { 318 + struct blk_ksm_keyslot *slot; 319 + int err = 0; 320 + 321 + blk_ksm_hw_enter(ksm); 322 + slot = blk_ksm_find_keyslot(ksm, key); 323 + if (!slot) 324 + goto out_unlock; 325 + 326 + if (WARN_ON_ONCE(atomic_read(&slot->slot_refs) != 0)) { 327 + err = -EBUSY; 328 + goto out_unlock; 329 + } 330 + err = ksm->ksm_ll_ops.keyslot_evict(ksm, key, 331 + blk_ksm_get_slot_idx(slot)); 332 + if (err) 333 + goto out_unlock; 334 + 335 + hlist_del(&slot->hash_node); 336 + slot->key = NULL; 337 + err = 0; 338 + out_unlock: 339 + blk_ksm_hw_exit(ksm); 340 + return err; 341 + } 342 + 343 + /** 344 + * blk_ksm_reprogram_all_keys() - Re-program all keyslots. 345 + * @ksm: The keyslot manager 346 + * 347 + * Re-program all keyslots that are supposed to have a key programmed. This is 348 + * intended only for use by drivers for hardware that loses its keys on reset. 349 + * 350 + * Context: Process context. Takes and releases ksm->lock. 351 + */ 352 + void blk_ksm_reprogram_all_keys(struct blk_keyslot_manager *ksm) 353 + { 354 + unsigned int slot; 355 + 356 + /* This is for device initialization, so don't resume the device */ 357 + down_write(&ksm->lock); 358 + for (slot = 0; slot < ksm->num_slots; slot++) { 359 + const struct blk_crypto_key *key = ksm->slots[slot].key; 360 + int err; 361 + 362 + if (!key) 363 + continue; 364 + 365 + err = ksm->ksm_ll_ops.keyslot_program(ksm, key, slot); 366 + WARN_ON(err); 367 + } 368 + up_write(&ksm->lock); 369 + } 370 + EXPORT_SYMBOL_GPL(blk_ksm_reprogram_all_keys); 371 + 372 + void blk_ksm_destroy(struct blk_keyslot_manager *ksm) 373 + { 374 + if (!ksm) 375 + return; 376 + kvfree(ksm->slot_hashtable); 377 + memzero_explicit(ksm->slots, sizeof(ksm->slots[0]) * ksm->num_slots); 378 + kvfree(ksm->slots); 379 + memzero_explicit(ksm, sizeof(*ksm)); 380 + } 381 + EXPORT_SYMBOL_GPL(blk_ksm_destroy); 382 + 383 + bool blk_ksm_register(struct blk_keyslot_manager *ksm, struct request_queue *q) 384 + { 385 + if (blk_integrity_queue_supports_integrity(q)) { 386 + pr_warn("Integrity and hardware inline encryption are not supported together. Disabling hardware inline encryption.\n"); 387 + return false; 388 + } 389 + q->ksm = ksm; 390 + return true; 391 + } 392 + EXPORT_SYMBOL_GPL(blk_ksm_register); 393 + 394 + void blk_ksm_unregister(struct request_queue *q) 395 + { 396 + q->ksm = NULL; 397 + }

+1 -1

block/kyber-iosched.c

··· 579 579 return merged; 580 580 } 581 581 582 - static void kyber_prepare_request(struct request *rq, struct bio *bio) 582 + static void kyber_prepare_request(struct request *rq) 583 583 { 584 584 rq_set_domain_token(rq, -1); 585 585 }

+1 -1

block/mq-deadline.c

··· 541 541 * Nothing to do here. This is defined only to ensure that .finish_request 542 542 * method is called upon request completion. 543 543 */ 544 - static void dd_prepare_request(struct request *rq, struct bio *bio) 544 + static void dd_prepare_request(struct request *rq) 545 545 { 546 546 } 547 547

+158 -29

block/partitions/core.c

··· 274 274 .uevent = part_uevent, 275 275 }; 276 276 277 - static void delete_partition_work_fn(struct work_struct *work) 277 + static void hd_struct_free_work(struct work_struct *work) 278 278 { 279 - struct hd_struct *part = container_of(to_rcu_work(work), struct hd_struct, 280 - rcu_work); 279 + struct hd_struct *part = 280 + container_of(to_rcu_work(work), struct hd_struct, rcu_work); 281 281 282 282 part->start_sect = 0; 283 283 part->nr_sects = 0; ··· 285 285 put_device(part_to_dev(part)); 286 286 } 287 287 288 - void __delete_partition(struct percpu_ref *ref) 288 + static void hd_struct_free(struct percpu_ref *ref) 289 289 { 290 290 struct hd_struct *part = container_of(ref, struct hd_struct, ref); 291 - INIT_RCU_WORK(&part->rcu_work, delete_partition_work_fn); 291 + struct gendisk *disk = part_to_disk(part); 292 + struct disk_part_tbl *ptbl = 293 + rcu_dereference_protected(disk->part_tbl, 1); 294 + 295 + rcu_assign_pointer(ptbl->last_lookup, NULL); 296 + put_device(disk_to_dev(disk)); 297 + 298 + INIT_RCU_WORK(&part->rcu_work, hd_struct_free_work); 292 299 queue_rcu_work(system_wq, &part->rcu_work); 300 + } 301 + 302 + int hd_ref_init(struct hd_struct *part) 303 + { 304 + if (percpu_ref_init(&part->ref, hd_struct_free, 0, GFP_KERNEL)) 305 + return -ENOMEM; 306 + return 0; 293 307 } 294 308 295 309 /* 296 310 * Must be called either with bd_mutex held, before a disk can be opened or 297 311 * after all disk users are gone. 298 312 */ 299 - void delete_partition(struct gendisk *disk, int partno) 313 + void delete_partition(struct gendisk *disk, struct hd_struct *part) 300 314 { 301 315 struct disk_part_tbl *ptbl = 302 316 rcu_dereference_protected(disk->part_tbl, 1); 303 - struct hd_struct *part; 304 317 305 - if (partno >= ptbl->len) 306 - return; 307 - 308 - part = rcu_dereference_protected(ptbl->part[partno], 1); 309 - if (!part) 310 - return; 311 - 312 - rcu_assign_pointer(ptbl->part[partno], NULL); 313 - rcu_assign_pointer(ptbl->last_lookup, NULL); 318 + /* 319 + * ->part_tbl is referenced in this part's release handler, so 320 + * we have to hold the disk device 321 + */ 322 + get_device(disk_to_dev(part_to_disk(part))); 323 + rcu_assign_pointer(ptbl->part[part->partno], NULL); 314 324 kobject_put(part->holder_dir); 315 325 device_del(part_to_dev(part)); 316 326 ··· 331 321 * "in-use" until we really free the gendisk. 332 322 */ 333 323 blk_invalidate_devt(part_devt(part)); 334 - hd_struct_kill(part); 324 + percpu_ref_kill(&part->ref); 335 325 } 336 326 337 327 static ssize_t whole_disk_show(struct device *dev, ··· 345 335 * Must be called either with bd_mutex held, before a disk can be opened or 346 336 * after all disk users are gone. 347 337 */ 348 - struct hd_struct *add_partition(struct gendisk *disk, int partno, 338 + static struct hd_struct *add_partition(struct gendisk *disk, int partno, 349 339 sector_t start, sector_t len, int flags, 350 340 struct partition_meta_info *info) 351 341 { ··· 387 377 if (!p) 388 378 return ERR_PTR(-EBUSY); 389 379 390 - if (!init_part_stats(p)) { 380 + p->dkstats = alloc_percpu(struct disk_stats); 381 + if (!p->dkstats) { 391 382 err = -ENOMEM; 392 383 goto out_free; 393 384 } 394 385 395 - seqcount_init(&p->nr_sects_seq); 386 + hd_sects_seq_init(p); 396 387 pdev = part_to_dev(p); 397 388 398 389 p->start_sect = start; ··· 469 458 out_free_info: 470 459 kfree(p->info); 471 460 out_free_stats: 472 - free_part_stats(p); 461 + free_percpu(p->dkstats); 473 462 out_free: 474 463 kfree(p); 475 464 return ERR_PTR(err); ··· 481 470 out_put: 482 471 put_device(pdev); 483 472 return ERR_PTR(err); 473 + } 474 + 475 + static bool partition_overlaps(struct gendisk *disk, sector_t start, 476 + sector_t length, int skip_partno) 477 + { 478 + struct disk_part_iter piter; 479 + struct hd_struct *part; 480 + bool overlap = false; 481 + 482 + disk_part_iter_init(&piter, disk, DISK_PITER_INCL_EMPTY); 483 + while ((part = disk_part_iter_next(&piter))) { 484 + if (part->partno == skip_partno || 485 + start >= part->start_sect + part->nr_sects || 486 + start + length <= part->start_sect) 487 + continue; 488 + overlap = true; 489 + break; 490 + } 491 + 492 + disk_part_iter_exit(&piter); 493 + return overlap; 494 + } 495 + 496 + int bdev_add_partition(struct block_device *bdev, int partno, 497 + sector_t start, sector_t length) 498 + { 499 + struct hd_struct *part; 500 + 501 + mutex_lock(&bdev->bd_mutex); 502 + if (partition_overlaps(bdev->bd_disk, start, length, -1)) { 503 + mutex_unlock(&bdev->bd_mutex); 504 + return -EBUSY; 505 + } 506 + 507 + part = add_partition(bdev->bd_disk, partno, start, length, 508 + ADDPART_FLAG_NONE, NULL); 509 + mutex_unlock(&bdev->bd_mutex); 510 + return PTR_ERR_OR_ZERO(part); 511 + } 512 + 513 + int bdev_del_partition(struct block_device *bdev, int partno) 514 + { 515 + struct block_device *bdevp; 516 + struct hd_struct *part; 517 + int ret = 0; 518 + 519 + part = disk_get_part(bdev->bd_disk, partno); 520 + if (!part) 521 + return -ENXIO; 522 + 523 + ret = -ENOMEM; 524 + bdevp = bdget(part_devt(part)); 525 + if (!bdevp) 526 + goto out_put_part; 527 + 528 + mutex_lock(&bdevp->bd_mutex); 529 + 530 + ret = -EBUSY; 531 + if (bdevp->bd_openers) 532 + goto out_unlock; 533 + 534 + sync_blockdev(bdevp); 535 + invalidate_bdev(bdevp); 536 + 537 + mutex_lock_nested(&bdev->bd_mutex, 1); 538 + delete_partition(bdev->bd_disk, part); 539 + mutex_unlock(&bdev->bd_mutex); 540 + 541 + ret = 0; 542 + out_unlock: 543 + mutex_unlock(&bdevp->bd_mutex); 544 + bdput(bdevp); 545 + out_put_part: 546 + disk_put_part(part); 547 + return ret; 548 + } 549 + 550 + int bdev_resize_partition(struct block_device *bdev, int partno, 551 + sector_t start, sector_t length) 552 + { 553 + struct block_device *bdevp; 554 + struct hd_struct *part; 555 + int ret = 0; 556 + 557 + part = disk_get_part(bdev->bd_disk, partno); 558 + if (!part) 559 + return -ENXIO; 560 + 561 + ret = -ENOMEM; 562 + bdevp = bdget(part_devt(part)); 563 + if (!bdevp) 564 + goto out_put_part; 565 + 566 + mutex_lock(&bdevp->bd_mutex); 567 + mutex_lock_nested(&bdev->bd_mutex, 1); 568 + 569 + ret = -EINVAL; 570 + if (start != part->start_sect) 571 + goto out_unlock; 572 + 573 + ret = -EBUSY; 574 + if (partition_overlaps(bdev->bd_disk, start, length, partno)) 575 + goto out_unlock; 576 + 577 + part_nr_sects_write(part, (sector_t)length); 578 + i_size_write(bdevp->bd_inode, length << SECTOR_SHIFT); 579 + 580 + ret = 0; 581 + out_unlock: 582 + mutex_unlock(&bdevp->bd_mutex); 583 + mutex_unlock(&bdev->bd_mutex); 584 + bdput(bdevp); 585 + out_put_part: 586 + disk_put_part(part); 587 + return ret; 484 588 } 485 589 486 590 static bool disk_unlock_native_capacity(struct gendisk *disk) ··· 614 488 } 615 489 } 616 490 617 - int blk_drop_partitions(struct gendisk *disk, struct block_device *bdev) 491 + int blk_drop_partitions(struct block_device *bdev) 618 492 { 619 493 struct disk_part_iter piter; 620 494 struct hd_struct *part; 621 - int res; 622 495 623 - if (!disk_part_scan_enabled(disk)) 496 + if (!disk_part_scan_enabled(bdev->bd_disk)) 624 497 return 0; 625 498 if (bdev->bd_part_count) 626 499 return -EBUSY; 627 - res = invalidate_partition(disk, 0); 628 - if (res) 629 - return res; 630 500 631 - disk_part_iter_init(&piter, disk, DISK_PITER_INCL_EMPTY); 501 + sync_blockdev(bdev); 502 + invalidate_bdev(bdev); 503 + 504 + disk_part_iter_init(&piter, bdev->bd_disk, DISK_PITER_INCL_EMPTY); 632 505 while ((part = disk_part_iter_next(&piter))) 633 - delete_partition(disk, part->partno); 506 + delete_partition(bdev->bd_disk, part); 634 507 disk_part_iter_exit(&piter); 635 508 636 509 return 0; 637 510 } 511 + #ifdef CONFIG_S390 512 + /* for historic reasons in the DASD driver */ 513 + EXPORT_SYMBOL_GPL(blk_drop_partitions); 514 + #endif 638 515 639 516 static bool blk_add_partition(struct gendisk *disk, struct block_device *bdev, 640 517 struct parsed_partitions *state, int p)

+11 -19

drivers/ata/libata-scsi.c

··· 649 649 { 650 650 struct scsi_cmnd *scmd = qc->scsicmd; 651 651 652 - qc->extrabytes = scmd->request->extra_len; 652 + qc->extrabytes = scmd->extra_len; 653 653 qc->nbytes = scsi_bufflen(scmd) + qc->extrabytes; 654 654 } 655 655 ··· 1017 1017 * RETURNS: 1018 1018 * 1 if ; otherwise, 0. 1019 1019 */ 1020 - static int atapi_drain_needed(struct request *rq) 1020 + bool ata_scsi_dma_need_drain(struct request *rq) 1021 1021 { 1022 - if (likely(!blk_rq_is_passthrough(rq))) 1023 - return 0; 1024 - 1025 - if (!blk_rq_bytes(rq) || op_is_write(req_op(rq))) 1026 - return 0; 1027 - 1028 1022 return atapi_cmd_type(scsi_req(rq)->cmd[0]) == ATAPI_MISC; 1029 1023 } 1024 + EXPORT_SYMBOL_GPL(ata_scsi_dma_need_drain); 1030 1025 1031 1026 int ata_scsi_dev_config(struct scsi_device *sdev, struct ata_device *dev) 1032 1027 { ··· 1034 1039 blk_queue_max_hw_sectors(q, dev->max_sectors); 1035 1040 1036 1041 if (dev->class == ATA_DEV_ATAPI) { 1037 - void *buf; 1038 - 1039 1042 sdev->sector_size = ATA_SECT_SIZE; 1040 1043 1041 1044 /* set DMA padding */ 1042 1045 blk_queue_update_dma_pad(q, ATA_DMA_PAD_SZ - 1); 1043 1046 1044 - /* configure draining */ 1045 - buf = kmalloc(ATAPI_MAX_DRAIN, q->bounce_gfp | GFP_KERNEL); 1046 - if (!buf) { 1047 + /* make room for appending the drain */ 1048 + blk_queue_max_segments(q, queue_max_segments(q) - 1); 1049 + 1050 + sdev->dma_drain_len = ATAPI_MAX_DRAIN; 1051 + sdev->dma_drain_buf = kmalloc(sdev->dma_drain_len, 1052 + q->bounce_gfp | GFP_KERNEL); 1053 + if (!sdev->dma_drain_buf) { 1047 1054 ata_dev_err(dev, "drain buffer allocation failed\n"); 1048 1055 return -ENOMEM; 1049 1056 } 1050 - 1051 - blk_queue_dma_drain(q, atapi_drain_needed, buf, ATAPI_MAX_DRAIN); 1052 1057 } else { 1053 1058 sdev->sector_size = ata_id_logical_sector_size(dev->id); 1054 1059 sdev->manage_start_stop = 1; ··· 1130 1135 void ata_scsi_slave_destroy(struct scsi_device *sdev) 1131 1136 { 1132 1137 struct ata_port *ap = ata_shost_to_port(sdev->host); 1133 - struct request_queue *q = sdev->request_queue; 1134 1138 unsigned long flags; 1135 1139 struct ata_device *dev; 1136 1140 ··· 1146 1152 } 1147 1153 spin_unlock_irqrestore(ap->lock, flags); 1148 1154 1149 - kfree(q->dma_drain_buffer); 1150 - q->dma_drain_buffer = NULL; 1151 - q->dma_drain_size = 0; 1155 + kfree(sdev->dma_drain_buf); 1152 1156 } 1153 1157 EXPORT_SYMBOL_GPL(ata_scsi_slave_destroy); 1154 1158

+2 -35

drivers/base/core.c

··· 3213 3213 } 3214 3214 3215 3215 /** 3216 - * device_create_vargs - creates a device and registers it with sysfs 3217 - * @class: pointer to the struct class that this device should be registered to 3218 - * @parent: pointer to the parent struct device of this new device, if any 3219 - * @devt: the dev_t for the char device to be added 3220 - * @drvdata: the data to be added to the device for callbacks 3221 - * @fmt: string for the device's name 3222 - * @args: va_list for the device's name 3223 - * 3224 - * This function can be used by char device classes. A struct device 3225 - * will be created in sysfs, registered to the specified class. 3226 - * 3227 - * A "dev" file will be created, showing the dev_t for the device, if 3228 - * the dev_t is not 0,0. 3229 - * If a pointer to a parent struct device is passed in, the newly created 3230 - * struct device will be a child of that device in sysfs. 3231 - * The pointer to the struct device will be returned from the call. 3232 - * Any further sysfs files that might be required can be created using this 3233 - * pointer. 3234 - * 3235 - * Returns &struct device pointer on success, or ERR_PTR() on error. 3236 - * 3237 - * Note: the struct class passed to this function must have previously 3238 - * been created with a call to class_create(). 3239 - */ 3240 - struct device *device_create_vargs(struct class *class, struct device *parent, 3241 - dev_t devt, void *drvdata, const char *fmt, 3242 - va_list args) 3243 - { 3244 - return device_create_groups_vargs(class, parent, devt, drvdata, NULL, 3245 - fmt, args); 3246 - } 3247 - EXPORT_SYMBOL_GPL(device_create_vargs); 3248 - 3249 - /** 3250 3216 * device_create - creates a device and registers it with sysfs 3251 3217 * @class: pointer to the struct class that this device should be registered to 3252 3218 * @parent: pointer to the parent struct device of this new device, if any ··· 3243 3277 struct device *dev; 3244 3278 3245 3279 va_start(vargs, fmt); 3246 - dev = device_create_vargs(class, parent, devt, drvdata, fmt, vargs); 3280 + dev = device_create_groups_vargs(class, parent, devt, drvdata, NULL, 3281 + fmt, vargs); 3247 3282 va_end(vargs); 3248 3283 return dev; 3249 3284 }

-1

drivers/block/aoe/aoeblk.c

··· 407 407 WARN_ON(d->gd); 408 408 WARN_ON(d->flags & DEVFL_UP); 409 409 blk_queue_max_hw_sectors(q, BLK_DEF_MAX_SECTORS); 410 - q->backing_dev_info->name = "aoe"; 411 410 q->backing_dev_info->ra_pages = READ_AHEAD / PAGE_SIZE; 412 411 d->bufpool = mp; 413 412 d->blkq = gd->queue = q;

+4 -23

drivers/block/drbd/drbd_req.c

··· 21 21 22 22 static bool drbd_may_do_local_read(struct drbd_device *device, sector_t sector, int size); 23 23 24 - /* Update disk stats at start of I/O request */ 25 - static void _drbd_start_io_acct(struct drbd_device *device, struct drbd_request *req) 26 - { 27 - struct request_queue *q = device->rq_queue; 28 - 29 - generic_start_io_acct(q, bio_op(req->master_bio), 30 - req->i.size >> 9, &device->vdisk->part0); 31 - } 32 - 33 - /* Update disk stats when completing request upwards */ 34 - static void _drbd_end_io_acct(struct drbd_device *device, struct drbd_request *req) 35 - { 36 - struct request_queue *q = device->rq_queue; 37 - 38 - generic_end_io_acct(q, bio_op(req->master_bio), 39 - &device->vdisk->part0, req->start_jif); 40 - } 41 - 42 24 static struct drbd_request *drbd_req_new(struct drbd_device *device, struct bio *bio_src) 43 25 { 44 26 struct drbd_request *req; ··· 245 263 start_new_tl_epoch(first_peer_device(device)->connection); 246 264 247 265 /* Update disk stats */ 248 - _drbd_end_io_acct(device, req); 266 + bio_end_io_acct(req->master_bio, req->start_jif); 249 267 250 268 /* If READ failed, 251 269 * have it be pushed back to the retry work queue, ··· 1204 1222 bio_endio(bio); 1205 1223 return ERR_PTR(-ENOMEM); 1206 1224 } 1207 - req->start_jif = start_jif; 1225 + 1226 + /* Update disk stats */ 1227 + req->start_jif = bio_start_io_acct(req->master_bio); 1208 1228 1209 1229 if (!get_ldev(device)) { 1210 1230 bio_put(req->private_bio); 1211 1231 req->private_bio = NULL; 1212 1232 } 1213 - 1214 - /* Update disk stats */ 1215 - _drbd_start_io_acct(device, req); 1216 1233 1217 1234 /* process discards always from our submitter thread */ 1218 1235 if (bio_op(bio) == REQ_OP_WRITE_ZEROES ||

+1 -1

drivers/block/loop.c

··· 2037 2037 lo->tag_set.queue_depth = 128; 2038 2038 lo->tag_set.numa_node = NUMA_NO_NODE; 2039 2039 lo->tag_set.cmd_size = sizeof(struct loop_cmd); 2040 - lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE; 2040 + lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING; 2041 2041 lo->tag_set.driver_data = lo; 2042 2042 2043 2043 err = blk_mq_alloc_tag_set(&lo->tag_set);

+27 -1

drivers/block/null_blk_main.c

··· 1250 1250 return errno_to_blk_status(err); 1251 1251 } 1252 1252 1253 + static void nullb_zero_read_cmd_buffer(struct nullb_cmd *cmd) 1254 + { 1255 + struct nullb_device *dev = cmd->nq->dev; 1256 + struct bio *bio; 1257 + 1258 + if (dev->memory_backed) 1259 + return; 1260 + 1261 + if (dev->queue_mode == NULL_Q_BIO && bio_op(cmd->bio) == REQ_OP_READ) { 1262 + zero_fill_bio(cmd->bio); 1263 + } else if (req_op(cmd->rq) == REQ_OP_READ) { 1264 + __rq_for_each_bio(bio, cmd->rq) 1265 + zero_fill_bio(bio); 1266 + } 1267 + } 1268 + 1253 1269 static inline void nullb_complete_cmd(struct nullb_cmd *cmd) 1254 1270 { 1271 + /* 1272 + * Since root privileges are required to configure the null_blk 1273 + * driver, it is fine that this driver does not initialize the 1274 + * data buffers of read commands. Zero-initialize these buffers 1275 + * anyway if KMSAN is enabled to prevent that KMSAN complains 1276 + * about null_blk not initializing read data buffers. 1277 + */ 1278 + if (IS_ENABLED(CONFIG_KMSAN)) 1279 + nullb_zero_read_cmd_buffer(cmd); 1280 + 1255 1281 /* Complete IO by inline, softirq or timer */ 1256 1282 switch (cmd->nq->dev->irqmode) { 1257 1283 case NULL_IRQ_SOFTIRQ: ··· 1423 1397 static enum blk_eh_timer_return null_timeout_rq(struct request *rq, bool res) 1424 1398 { 1425 1399 pr_info("rq %p timed out\n", rq); 1426 - blk_mq_complete_request(rq); 1400 + blk_mq_force_complete_rq(rq); 1427 1401 return BLK_EH_DONE; 1428 1402 } 1429 1403

+29 -8

drivers/block/null_blk_zoned.c

··· 74 74 75 75 int null_register_zoned_dev(struct nullb *nullb) 76 76 { 77 + struct nullb_device *dev = nullb->dev; 77 78 struct request_queue *q = nullb->q; 78 79 79 - if (queue_is_mq(q)) 80 - return blk_revalidate_disk_zones(nullb->disk); 80 + if (queue_is_mq(q)) { 81 + int ret = blk_revalidate_disk_zones(nullb->disk, NULL); 81 82 82 - blk_queue_chunk_sectors(q, nullb->dev->zone_size_sects); 83 - q->nr_zones = blkdev_nr_zones(nullb->disk); 83 + if (ret) 84 + return ret; 85 + } else { 86 + blk_queue_chunk_sectors(q, dev->zone_size_sects); 87 + q->nr_zones = blkdev_nr_zones(nullb->disk); 88 + } 89 + 90 + blk_queue_max_zone_append_sectors(q, dev->zone_size_sects); 84 91 85 92 return 0; 86 93 } ··· 149 142 } 150 143 151 144 static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector, 152 - unsigned int nr_sectors) 145 + unsigned int nr_sectors, bool append) 153 146 { 154 147 struct nullb_device *dev = cmd->nq->dev; 155 148 unsigned int zno = null_zone_no(dev, sector); ··· 169 162 case BLK_ZONE_COND_IMP_OPEN: 170 163 case BLK_ZONE_COND_EXP_OPEN: 171 164 case BLK_ZONE_COND_CLOSED: 172 - /* Writes must be at the write pointer position */ 173 - if (sector != zone->wp) 165 + /* 166 + * Regular writes must be at the write pointer position. 167 + * Zone append writes are automatically issued at the write 168 + * pointer and the position returned using the request or BIO 169 + * sector. 170 + */ 171 + if (append) { 172 + sector = zone->wp; 173 + if (cmd->bio) 174 + cmd->bio->bi_iter.bi_sector = sector; 175 + else 176 + cmd->rq->__sector = sector; 177 + } else if (sector != zone->wp) { 174 178 return BLK_STS_IOERR; 179 + } 175 180 176 181 if (zone->cond != BLK_ZONE_COND_EXP_OPEN) 177 182 zone->cond = BLK_ZONE_COND_IMP_OPEN; ··· 265 246 { 266 247 switch (op) { 267 248 case REQ_OP_WRITE: 268 - return null_zone_write(cmd, sector, nr_sectors); 249 + return null_zone_write(cmd, sector, nr_sectors, false); 250 + case REQ_OP_ZONE_APPEND: 251 + return null_zone_write(cmd, sector, nr_sectors, true); 269 252 case REQ_OP_ZONE_RESET: 270 253 case REQ_OP_ZONE_RESET_ALL: 271 254 case REQ_OP_ZONE_OPEN:

+1 -1

drivers/block/paride/pcd.c

··· 1032 1032 1033 1033 for (unit = 0, cd = pcd; unit < PCD_UNITS; unit++, cd++) { 1034 1034 if (cd->present) { 1035 - register_cdrom(&cd->info); 1035 + register_cdrom(cd->disk, &cd->info); 1036 1036 cd->disk->private_data = cd; 1037 1037 add_disk(cd->disk); 1038 1038 }

+2 -17

drivers/block/rsxx/dev.c

··· 96 96 .ioctl = rsxx_blkdev_ioctl, 97 97 }; 98 98 99 - static void disk_stats_start(struct rsxx_cardinfo *card, struct bio *bio) 100 - { 101 - generic_start_io_acct(card->queue, bio_op(bio), bio_sectors(bio), 102 - &card->gendisk->part0); 103 - } 104 - 105 - static void disk_stats_complete(struct rsxx_cardinfo *card, 106 - struct bio *bio, 107 - unsigned long start_time) 108 - { 109 - generic_end_io_acct(card->queue, bio_op(bio), 110 - &card->gendisk->part0, start_time); 111 - } 112 - 113 99 static void bio_dma_done_cb(struct rsxx_cardinfo *card, 114 100 void *cb_data, 115 101 unsigned int error) ··· 107 121 108 122 if (atomic_dec_and_test(&meta->pending_dmas)) { 109 123 if (!card->eeh_state && card->gendisk) 110 - disk_stats_complete(card, meta->bio, meta->start_time); 124 + bio_end_io_acct(meta->bio, meta->start_time); 111 125 112 126 if (atomic_read(&meta->error)) 113 127 bio_io_error(meta->bio); ··· 153 167 bio_meta->bio = bio; 154 168 atomic_set(&bio_meta->error, 0); 155 169 atomic_set(&bio_meta->pending_dmas, 0); 156 - bio_meta->start_time = jiffies; 157 170 158 171 if (!unlikely(card->halt)) 159 - disk_stats_start(card, bio); 172 + bio_meta->start_time = bio_start_io_acct(bio); 160 173 161 174 dev_dbg(CARD_TO_DEV(card), "BIO[%c]: meta: %p addr8: x%llx size: %d\n", 162 175 bio_data_dir(bio) ? 'W' : 'R', bio_meta,

+10 -14

drivers/block/zram/zram_drv.c

··· 1510 1510 static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index, 1511 1511 int offset, unsigned int op, struct bio *bio) 1512 1512 { 1513 - unsigned long start_time = jiffies; 1514 - struct request_queue *q = zram->disk->queue; 1515 1513 int ret; 1516 - 1517 - generic_start_io_acct(q, op, bvec->bv_len >> SECTOR_SHIFT, 1518 - &zram->disk->part0); 1519 1514 1520 1515 if (!op_is_write(op)) { 1521 1516 atomic64_inc(&zram->stats.num_reads); ··· 1520 1525 atomic64_inc(&zram->stats.num_writes); 1521 1526 ret = zram_bvec_write(zram, bvec, index, offset, bio); 1522 1527 } 1523 - 1524 - generic_end_io_acct(q, op, &zram->disk->part0, start_time); 1525 1528 1526 1529 zram_slot_lock(zram, index); 1527 1530 zram_accessed(zram, index); ··· 1541 1548 u32 index; 1542 1549 struct bio_vec bvec; 1543 1550 struct bvec_iter iter; 1551 + unsigned long start_time; 1544 1552 1545 1553 index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT; 1546 1554 offset = (bio->bi_iter.bi_sector & ··· 1557 1563 break; 1558 1564 } 1559 1565 1566 + start_time = bio_start_io_acct(bio); 1560 1567 bio_for_each_segment(bvec, bio, iter) { 1561 1568 struct bio_vec bv = bvec; 1562 1569 unsigned int unwritten = bvec.bv_len; ··· 1566 1571 bv.bv_len = min_t(unsigned int, PAGE_SIZE - offset, 1567 1572 unwritten); 1568 1573 if (zram_bvec_rw(zram, &bv, index, offset, 1569 - bio_op(bio), bio) < 0) 1570 - goto out; 1574 + bio_op(bio), bio) < 0) { 1575 + bio->bi_status = BLK_STS_IOERR; 1576 + break; 1577 + } 1571 1578 1572 1579 bv.bv_offset += bv.bv_len; 1573 1580 unwritten -= bv.bv_len; ··· 1577 1580 update_position(&index, &offset, &bv); 1578 1581 } while (unwritten); 1579 1582 } 1580 - 1583 + bio_end_io_acct(bio, start_time); 1581 1584 bio_endio(bio); 1582 - return; 1583 - 1584 - out: 1585 - bio_io_error(bio); 1586 1585 } 1587 1586 1588 1587 /* ··· 1626 1633 u32 index; 1627 1634 struct zram *zram; 1628 1635 struct bio_vec bv; 1636 + unsigned long start_time; 1629 1637 1630 1638 if (PageTransHuge(page)) 1631 1639 return -ENOTSUPP; ··· 1645 1651 bv.bv_len = PAGE_SIZE; 1646 1652 bv.bv_offset = 0; 1647 1653 1654 + start_time = disk_start_io_acct(bdev->bd_disk, SECTORS_PER_PAGE, op); 1648 1655 ret = zram_bvec_rw(zram, &bv, index, offset, op, NULL); 1656 + disk_end_io_acct(bdev->bd_disk, op, start_time); 1649 1657 out: 1650 1658 /* 1651 1659 * If I/O fails, just return error(ie, non-zero) without

+51 -34

drivers/cdrom/cdrom.c

··· 586 586 return 0; 587 587 } 588 588 589 - int register_cdrom(struct cdrom_device_info *cdi) 589 + int register_cdrom(struct gendisk *disk, struct cdrom_device_info *cdi) 590 590 { 591 591 static char banner_printed; 592 592 const struct cdrom_device_ops *cdo = cdi->ops; ··· 600 600 banner_printed = 1; 601 601 cdrom_sysctl_register(); 602 602 } 603 + 604 + cdi->disk = disk; 605 + disk->cdi = cdi; 603 606 604 607 ENSURE(cdo, drive_status, CDC_DRIVE_STATUS); 605 608 if (cdo->check_events == NULL && cdo->media_changed == NULL) ··· 2295 2292 return cdrom_read_cdda_old(cdi, ubuf, lba, nframes); 2296 2293 } 2297 2294 2298 - static int cdrom_ioctl_multisession(struct cdrom_device_info *cdi, 2299 - void __user *argp) 2295 + int cdrom_multisession(struct cdrom_device_info *cdi, 2296 + struct cdrom_multisession *info) 2300 2297 { 2301 - struct cdrom_multisession ms_info; 2302 2298 u8 requested_format; 2303 2299 int ret; 2304 - 2305 - cd_dbg(CD_DO_IOCTL, "entering CDROMMULTISESSION\n"); 2306 2300 2307 2301 if (!(cdi->ops->capability & CDC_MULTI_SESSION)) 2308 2302 return -ENOSYS; 2309 2303 2310 - if (copy_from_user(&ms_info, argp, sizeof(ms_info))) 2311 - return -EFAULT; 2312 - 2313 - requested_format = ms_info.addr_format; 2304 + requested_format = info->addr_format; 2314 2305 if (requested_format != CDROM_MSF && requested_format != CDROM_LBA) 2315 2306 return -EINVAL; 2316 - ms_info.addr_format = CDROM_LBA; 2307 + info->addr_format = CDROM_LBA; 2317 2308 2318 - ret = cdi->ops->get_last_session(cdi, &ms_info); 2309 + ret = cdi->ops->get_last_session(cdi, info); 2310 + if (!ret) 2311 + sanitize_format(&info->addr, &info->addr_format, 2312 + requested_format); 2313 + return ret; 2314 + } 2315 + EXPORT_SYMBOL_GPL(cdrom_multisession); 2316 + 2317 + static int cdrom_ioctl_multisession(struct cdrom_device_info *cdi, 2318 + void __user *argp) 2319 + { 2320 + struct cdrom_multisession info; 2321 + int ret; 2322 + 2323 + cd_dbg(CD_DO_IOCTL, "entering CDROMMULTISESSION\n"); 2324 + 2325 + if (copy_from_user(&info, argp, sizeof(info))) 2326 + return -EFAULT; 2327 + ret = cdrom_multisession(cdi, &info); 2319 2328 if (ret) 2320 2329 return ret; 2321 - 2322 - sanitize_format(&ms_info.addr, &ms_info.addr_format, requested_format); 2323 - 2324 - if (copy_to_user(argp, &ms_info, sizeof(ms_info))) 2330 + if (copy_to_user(argp, &info, sizeof(info))) 2325 2331 return -EFAULT; 2326 2332 2327 2333 cd_dbg(CD_DO_IOCTL, "CDROMMULTISESSION successful\n"); 2328 - return 0; 2334 + return ret; 2329 2335 } 2330 2336 2331 2337 static int cdrom_ioctl_eject(struct cdrom_device_info *cdi) ··· 2675 2663 return 0; 2676 2664 } 2677 2665 2666 + int cdrom_read_tocentry(struct cdrom_device_info *cdi, 2667 + struct cdrom_tocentry *entry) 2668 + { 2669 + u8 requested_format = entry->cdte_format; 2670 + int ret; 2671 + 2672 + if (requested_format != CDROM_MSF && requested_format != CDROM_LBA) 2673 + return -EINVAL; 2674 + 2675 + /* make interface to low-level uniform */ 2676 + entry->cdte_format = CDROM_MSF; 2677 + ret = cdi->ops->audio_ioctl(cdi, CDROMREADTOCENTRY, entry); 2678 + if (!ret) 2679 + sanitize_format(&entry->cdte_addr, &entry->cdte_format, 2680 + requested_format); 2681 + return ret; 2682 + } 2683 + EXPORT_SYMBOL_GPL(cdrom_read_tocentry); 2684 + 2678 2685 static int cdrom_ioctl_read_tocentry(struct cdrom_device_info *cdi, 2679 2686 void __user *argp) 2680 2687 { 2681 2688 struct cdrom_tocentry entry; 2682 - u8 requested_format; 2683 2689 int ret; 2684 - 2685 - /* cd_dbg(CD_DO_IOCTL, "entering CDROMREADTOCENTRY\n"); */ 2686 2690 2687 2691 if (copy_from_user(&entry, argp, sizeof(entry))) 2688 2692 return -EFAULT; 2689 - 2690 - requested_format = entry.cdte_format; 2691 - if (requested_format != CDROM_MSF && requested_format != CDROM_LBA) 2692 - return -EINVAL; 2693 - /* make interface to low-level uniform */ 2694 - entry.cdte_format = CDROM_MSF; 2695 - ret = cdi->ops->audio_ioctl(cdi, CDROMREADTOCENTRY, &entry); 2696 - if (ret) 2697 - return ret; 2698 - sanitize_format(&entry.cdte_addr, &entry.cdte_format, requested_format); 2699 - 2700 - if (copy_to_user(argp, &entry, sizeof(entry))) 2693 + ret = cdrom_read_tocentry(cdi, &entry); 2694 + if (!ret && copy_to_user(argp, &entry, sizeof(entry))) 2701 2695 return -EFAULT; 2702 - /* cd_dbg(CD_DO_IOCTL, "CDROMREADTOCENTRY successful\n"); */ 2703 - return 0; 2696 + return ret; 2704 2697 } 2705 2698 2706 2699 static int cdrom_ioctl_play_msf(struct cdrom_device_info *cdi,

+1 -1

drivers/cdrom/gdrom.c

··· 770 770 goto probe_fail_no_disk; 771 771 } 772 772 probe_gdrom_setupdisk(); 773 - if (register_cdrom(gd.cd_info)) { 773 + if (register_cdrom(gd.disk, gd.cd_info)) { 774 774 err = -ENODEV; 775 775 goto probe_fail_cdrom_register; 776 776 }

+8 -9

drivers/ide/ide-cd.c

··· 1034 1034 return 0; 1035 1035 } 1036 1036 1037 - static int cdrom_read_tocentry(ide_drive_t *drive, int trackno, int msf_flag, 1038 - int format, char *buf, int buflen) 1037 + static int ide_cdrom_read_tocentry(ide_drive_t *drive, int trackno, 1038 + int msf_flag, int format, char *buf, int buflen) 1039 1039 { 1040 1040 unsigned char cmd[BLK_MAX_CDB]; 1041 1041 ··· 1104 1104 sectors_per_frame << SECTOR_SHIFT); 1105 1105 1106 1106 /* first read just the header, so we know how long the TOC is */ 1107 - stat = cdrom_read_tocentry(drive, 0, 1, 0, (char *) &toc->hdr, 1107 + stat = ide_cdrom_read_tocentry(drive, 0, 1, 0, (char *) &toc->hdr, 1108 1108 sizeof(struct atapi_toc_header)); 1109 1109 if (stat) 1110 1110 return stat; ··· 1121 1121 ntracks = MAX_TRACKS; 1122 1122 1123 1123 /* now read the whole schmeer */ 1124 - stat = cdrom_read_tocentry(drive, toc->hdr.first_track, 1, 0, 1124 + stat = ide_cdrom_read_tocentry(drive, toc->hdr.first_track, 1, 0, 1125 1125 (char *)&toc->hdr, 1126 1126 sizeof(struct atapi_toc_header) + 1127 1127 (ntracks + 1) * ··· 1141 1141 * Heiko Eißfeldt. 1142 1142 */ 1143 1143 ntracks = 0; 1144 - stat = cdrom_read_tocentry(drive, CDROM_LEADOUT, 1, 0, 1144 + stat = ide_cdrom_read_tocentry(drive, CDROM_LEADOUT, 1, 0, 1145 1145 (char *)&toc->hdr, 1146 1146 sizeof(struct atapi_toc_header) + 1147 1147 (ntracks + 1) * ··· 1181 1181 1182 1182 if (toc->hdr.first_track != CDROM_LEADOUT) { 1183 1183 /* read the multisession information */ 1184 - stat = cdrom_read_tocentry(drive, 0, 0, 1, (char *)&ms_tmp, 1184 + stat = ide_cdrom_read_tocentry(drive, 0, 0, 1, (char *)&ms_tmp, 1185 1185 sizeof(ms_tmp)); 1186 1186 if (stat) 1187 1187 return stat; ··· 1195 1195 1196 1196 if (drive->atapi_flags & IDE_AFLAG_TOCADDR_AS_BCD) { 1197 1197 /* re-read multisession information using MSF format */ 1198 - stat = cdrom_read_tocentry(drive, 0, 1, 1, (char *)&ms_tmp, 1198 + stat = ide_cdrom_read_tocentry(drive, 0, 1, 1, (char *)&ms_tmp, 1199 1199 sizeof(ms_tmp)); 1200 1200 if (stat) 1201 1201 return stat; ··· 1305 1305 if (drive->atapi_flags & IDE_AFLAG_NO_SPEED_SELECT) 1306 1306 devinfo->mask |= CDC_SELECT_SPEED; 1307 1307 1308 - devinfo->disk = info->disk; 1309 - return register_cdrom(devinfo); 1308 + return register_cdrom(info->disk, devinfo); 1310 1309 } 1311 1310 1312 1311 static int ide_cdrom_probe_capabilities(ide_drive_t *drive)

+5 -2

drivers/ide/ide-io.c

··· 233 233 void ide_map_sg(ide_drive_t *drive, struct ide_cmd *cmd) 234 234 { 235 235 ide_hwif_t *hwif = drive->hwif; 236 - struct scatterlist *sg = hwif->sg_table; 236 + struct scatterlist *sg = hwif->sg_table, *last_sg = NULL; 237 237 struct request *rq = cmd->rq; 238 238 239 - cmd->sg_nents = blk_rq_map_sg(drive->queue, rq, sg); 239 + cmd->sg_nents = __blk_rq_map_sg(drive->queue, rq, sg, &last_sg); 240 + if (blk_rq_bytes(rq) && (blk_rq_bytes(rq) & rq->q->dma_pad_mask)) 241 + last_sg->length += 242 + (rq->q->dma_pad_mask & ~blk_rq_bytes(rq)) + 1; 240 243 } 241 244 EXPORT_SYMBOL_GPL(ide_map_sg); 242 245

+3 -5

drivers/lightnvm/pblk-cache.c

··· 21 21 void pblk_write_to_cache(struct pblk *pblk, struct bio *bio, 22 22 unsigned long flags) 23 23 { 24 - struct request_queue *q = pblk->dev->q; 25 24 struct pblk_w_ctx w_ctx; 26 25 sector_t lba = pblk_get_lba(bio); 27 - unsigned long start_time = jiffies; 26 + unsigned long start_time; 28 27 unsigned int bpos, pos; 29 28 int nr_entries = pblk_get_secs(bio); 30 29 int i, ret; 31 30 32 - generic_start_io_acct(q, REQ_OP_WRITE, bio_sectors(bio), 33 - &pblk->disk->part0); 31 + start_time = bio_start_io_acct(bio); 34 32 35 33 /* Update the write buffer head (mem) with the entries that we can 36 34 * write. The write in itself cannot fail, so there is no need to ··· 77 79 pblk_rl_inserted(&pblk->rl, nr_entries); 78 80 79 81 out: 80 - generic_end_io_acct(q, REQ_OP_WRITE, &pblk->disk->part0, start_time); 82 + bio_end_io_acct(bio, start_time); 81 83 pblk_write_should_kick(pblk); 82 84 83 85 if (ret == NVM_IO_DONE)

+4 -7

drivers/lightnvm/pblk-read.c

··· 187 187 static void __pblk_end_io_read(struct pblk *pblk, struct nvm_rq *rqd, 188 188 bool put_line) 189 189 { 190 - struct nvm_tgt_dev *dev = pblk->dev; 191 190 struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd); 192 191 struct bio *int_bio = rqd->bio; 193 192 unsigned long start_time = r_ctx->start_time; 194 193 195 - generic_end_io_acct(dev->q, REQ_OP_READ, &pblk->disk->part0, start_time); 194 + bio_end_io_acct(int_bio, start_time); 196 195 197 196 if (rqd->error) 198 197 pblk_log_read_err(pblk, rqd); ··· 262 263 263 264 void pblk_submit_read(struct pblk *pblk, struct bio *bio) 264 265 { 265 - struct nvm_tgt_dev *dev = pblk->dev; 266 - struct request_queue *q = dev->q; 267 266 sector_t blba = pblk_get_lba(bio); 268 267 unsigned int nr_secs = pblk_get_secs(bio); 269 268 bool from_cache; 270 269 struct pblk_g_ctx *r_ctx; 271 270 struct nvm_rq *rqd; 272 271 struct bio *int_bio, *split_bio; 272 + unsigned long start_time; 273 273 274 - generic_start_io_acct(q, REQ_OP_READ, bio_sectors(bio), 275 - &pblk->disk->part0); 274 + start_time = bio_start_io_acct(bio); 276 275 277 276 rqd = pblk_alloc_rqd(pblk, PBLK_READ); 278 277 ··· 280 283 rqd->end_io = pblk_end_io_read; 281 284 282 285 r_ctx = nvm_rq_to_pdu(rqd); 283 - r_ctx->start_time = jiffies; 286 + r_ctx->start_time = start_time; 284 287 r_ctx->lba = blba; 285 288 286 289 if (pblk_alloc_rqd_meta(pblk, rqd)) {

+4 -15

drivers/md/bcache/request.c

··· 668 668 static void bio_complete(struct search *s) 669 669 { 670 670 if (s->orig_bio) { 671 - generic_end_io_acct(s->d->disk->queue, bio_op(s->orig_bio), 672 - &s->d->disk->part0, s->start_time); 673 - 671 + bio_end_io_acct(s->orig_bio, s->start_time); 674 672 trace_bcache_request_end(s->d, s->orig_bio); 675 673 s->orig_bio->bi_status = s->iop.status; 676 674 bio_endio(s->orig_bio); ··· 728 730 s->recoverable = 1; 729 731 s->write = op_is_write(bio_op(bio)); 730 732 s->read_dirty_data = 0; 731 - s->start_time = jiffies; 733 + s->start_time = bio_start_io_acct(bio); 732 734 733 735 s->iop.c = d->c; 734 736 s->iop.bio = NULL; ··· 1080 1082 bio->bi_end_io = ddip->bi_end_io; 1081 1083 bio->bi_private = ddip->bi_private; 1082 1084 1083 - generic_end_io_acct(ddip->d->disk->queue, bio_op(bio), 1084 - &ddip->d->disk->part0, ddip->start_time); 1085 + bio_end_io_acct(bio, ddip->start_time); 1085 1086 1086 1087 if (bio->bi_status) { 1087 1088 struct cached_dev *dc = container_of(ddip->d, ··· 1105 1108 */ 1106 1109 ddip = kzalloc(sizeof(struct detached_dev_io_private), GFP_NOIO); 1107 1110 ddip->d = d; 1108 - ddip->start_time = jiffies; 1111 + ddip->start_time = bio_start_io_acct(bio); 1109 1112 ddip->bi_end_io = bio->bi_end_io; 1110 1113 ddip->bi_private = bio->bi_private; 1111 1114 bio->bi_end_io = detached_dev_end_io; ··· 1186 1189 quit_max_writeback_rate(d->c, dc); 1187 1190 } 1188 1191 } 1189 - 1190 - generic_start_io_acct(q, 1191 - bio_op(bio), 1192 - bio_sectors(bio), 1193 - &d->disk->part0); 1194 1192 1195 1193 bio_set_dev(bio, dc->bdev); 1196 1194 bio->bi_iter.bi_sector += dc->sb.data_offset; ··· 1303 1311 return BLK_QC_T_NONE; 1304 1312 } 1305 1313 1306 - generic_start_io_acct(q, bio_op(bio), bio_sectors(bio), &d->disk->part0); 1307 - 1308 1314 s = search_alloc(bio, d); 1309 1315 cl = &s->cl; 1310 1316 bio = &s->bio.bio; ··· 1362 1372 { 1363 1373 struct gendisk *g = d->disk; 1364 1374 1365 - g->queue->make_request_fn = flash_dev_make_request; 1366 1375 g->queue->backing_dev_info->congested_fn = flash_dev_congested; 1367 1376 d->cache_miss = flash_dev_cache_miss; 1368 1377 d->ioctl = flash_dev_ioctl;

+1 -1

drivers/md/dm-integrity.c

··· 2657 2657 2658 2658 dm_integrity_flush_buffers(ic); 2659 2659 if (ic->meta_dev) 2660 - blkdev_issue_flush(ic->dev->bdev, GFP_NOIO, NULL); 2660 + blkdev_issue_flush(ic->dev->bdev, GFP_NOIO); 2661 2661 2662 2662 limit = ic->provided_data_sectors; 2663 2663 if (ic->sb->flags & cpu_to_le32(SB_FLAG_RECALCULATING)) {

+1 -1

drivers/md/dm-rq.c

··· 547 547 md->tag_set->ops = &dm_mq_ops; 548 548 md->tag_set->queue_depth = dm_get_blk_mq_queue_depth(); 549 549 md->tag_set->numa_node = md->numa_node_id; 550 - md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE; 550 + md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING; 551 551 md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues(); 552 552 md->tag_set->driver_data = md; 553 553

-17

drivers/md/dm-table.c

··· 279 279 static int device_area_is_invalid(struct dm_target *ti, struct dm_dev *dev, 280 280 sector_t start, sector_t len, void *data) 281 281 { 282 - struct request_queue *q; 283 282 struct queue_limits *limits = data; 284 283 struct block_device *bdev = dev->bdev; 285 284 sector_t dev_size = ··· 286 287 unsigned short logical_block_size_sectors = 287 288 limits->logical_block_size >> SECTOR_SHIFT; 288 289 char b[BDEVNAME_SIZE]; 289 - 290 - /* 291 - * Some devices exist without request functions, 292 - * such as loop devices not yet bound to backing files. 293 - * Forbid the use of such devices. 294 - */ 295 - q = bdev_get_queue(bdev); 296 - if (!q || !q->make_request_fn) { 297 - DMWARN("%s: %s is not yet initialised: " 298 - "start=%llu, len=%llu, dev_size=%llu", 299 - dm_device_name(ti->table->md), bdevname(bdev, b), 300 - (unsigned long long)start, 301 - (unsigned long long)len, 302 - (unsigned long long)dev_size); 303 - return 1; 304 - } 305 290 306 291 if (!dev_size) 307 292 return 0;

+3 -3

drivers/md/dm-zoned-metadata.c

··· 661 661 662 662 ret = dmz_rdwr_block(zmd, REQ_OP_WRITE, block, mblk->page); 663 663 if (ret == 0) 664 - ret = blkdev_issue_flush(zmd->dev->bdev, GFP_NOIO, NULL); 664 + ret = blkdev_issue_flush(zmd->dev->bdev, GFP_NOIO); 665 665 666 666 return ret; 667 667 } ··· 703 703 704 704 /* Flush drive cache (this will also sync data) */ 705 705 if (ret == 0) 706 - ret = blkdev_issue_flush(zmd->dev->bdev, GFP_NOIO, NULL); 706 + ret = blkdev_issue_flush(zmd->dev->bdev, GFP_NOIO); 707 707 708 708 return ret; 709 709 } ··· 772 772 773 773 /* If there are no dirty metadata blocks, just flush the device cache */ 774 774 if (list_empty(&write_list)) { 775 - ret = blkdev_issue_flush(zmd->dev->bdev, GFP_NOIO, NULL); 775 + ret = blkdev_issue_flush(zmd->dev->bdev, GFP_NOIO); 776 776 goto err; 777 777 } 778 778

+17 -7

drivers/md/dm.c

··· 26 26 #include <linux/pr.h> 27 27 #include <linux/refcount.h> 28 28 #include <linux/part_stat.h> 29 + #include <linux/blk-crypto.h> 29 30 30 31 #define DM_MSG_PREFIX "core" 31 32 ··· 681 680 struct mapped_device *md = io->md; 682 681 struct bio *bio = io->orig_bio; 683 682 684 - io->start_time = jiffies; 685 - 686 - generic_start_io_acct(md->queue, bio_op(bio), bio_sectors(bio), 687 - &dm_disk(md)->part0); 688 - 683 + io->start_time = bio_start_io_acct(bio); 689 684 if (unlikely(dm_stats_used(&md->stats))) 690 685 dm_stats_account_io(&md->stats, bio_data_dir(bio), 691 686 bio->bi_iter.bi_sector, bio_sectors(bio), ··· 694 697 struct bio *bio = io->orig_bio; 695 698 unsigned long duration = jiffies - io->start_time; 696 699 697 - generic_end_io_acct(md->queue, bio_op(bio), &dm_disk(md)->part0, 698 - io->start_time); 700 + bio_end_io_acct(bio, io->start_time); 699 701 700 702 if (unlikely(dm_stats_used(&md->stats))) 701 703 dm_stats_account_io(&md->stats, bio_data_dir(bio), ··· 1330 1334 1331 1335 __bio_clone_fast(clone, bio); 1332 1336 1337 + bio_crypt_clone(clone, bio, GFP_NOIO); 1338 + 1333 1339 if (bio_integrity(bio)) { 1334 1340 int r; 1335 1341 ··· 1785 1787 blk_qc_t ret = BLK_QC_T_NONE; 1786 1788 int srcu_idx; 1787 1789 struct dm_table *map; 1790 + 1791 + if (dm_get_md_type(md) == DM_TYPE_REQUEST_BASED) { 1792 + /* 1793 + * We are called with a live reference on q_usage_counter, but 1794 + * that one will be released as soon as we return. Grab an 1795 + * extra one as blk_mq_make_request expects to be able to 1796 + * consume a reference (which lives until the request is freed 1797 + * in case a request is allocated). 1798 + */ 1799 + percpu_ref_get(&q->q_usage_counter); 1800 + return blk_mq_make_request(q, bio); 1801 + } 1788 1802 1789 1803 map = dm_get_live_table(md, &srcu_idx); 1790 1804

+1 -1

drivers/md/raid5-ppl.c

··· 1037 1037 } 1038 1038 1039 1039 /* flush the disk cache after recovery if necessary */ 1040 - ret = blkdev_issue_flush(rdev->bdev, GFP_KERNEL, NULL); 1040 + ret = blkdev_issue_flush(rdev->bdev, GFP_KERNEL); 1041 1041 out: 1042 1042 __free_page(page); 1043 1043 return ret;

+1 -2

drivers/mtd/mtdcore.c

··· 2036 2036 struct backing_dev_info *bdi; 2037 2037 int ret; 2038 2038 2039 - bdi = bdi_alloc(GFP_KERNEL); 2039 + bdi = bdi_alloc(NUMA_NO_NODE); 2040 2040 if (!bdi) 2041 2041 return ERR_PTR(-ENOMEM); 2042 2042 2043 - bdi->name = name; 2044 2043 /* 2045 2044 * We put '-0' suffix to the name to get the same name format as we 2046 2045 * used to get. Since this is called only once, we get a unique name.

+4 -2

drivers/nvdimm/blk.c

··· 178 178 bip = bio_integrity(bio); 179 179 nsblk = q->queuedata; 180 180 rw = bio_data_dir(bio); 181 - do_acct = nd_iostat_start(bio, &start); 181 + do_acct = blk_queue_io_stat(bio->bi_disk->queue); 182 + if (do_acct) 183 + start = bio_start_io_acct(bio); 182 184 bio_for_each_segment(bvec, bio, iter) { 183 185 unsigned int len = bvec.bv_len; 184 186 ··· 197 195 } 198 196 } 199 197 if (do_acct) 200 - nd_iostat_end(bio, start); 198 + bio_end_io_acct(bio, start); 201 199 202 200 bio_endio(bio); 203 201 return BLK_QC_T_NONE;

+4 -2

drivers/nvdimm/btt.c

··· 1452 1452 if (!bio_integrity_prep(bio)) 1453 1453 return BLK_QC_T_NONE; 1454 1454 1455 - do_acct = nd_iostat_start(bio, &start); 1455 + do_acct = blk_queue_io_stat(bio->bi_disk->queue); 1456 + if (do_acct) 1457 + start = bio_start_io_acct(bio); 1456 1458 bio_for_each_segment(bvec, bio, iter) { 1457 1459 unsigned int len = bvec.bv_len; 1458 1460 ··· 1479 1477 } 1480 1478 } 1481 1479 if (do_acct) 1482 - nd_iostat_end(bio, start); 1480 + bio_end_io_acct(bio, start); 1483 1481 1484 1482 bio_endio(bio); 1485 1483 return BLK_QC_T_NONE;

-19

drivers/nvdimm/nd.h

··· 396 396 #endif 397 397 int nd_blk_region_init(struct nd_region *nd_region); 398 398 int nd_region_activate(struct nd_region *nd_region); 399 - void __nd_iostat_start(struct bio *bio, unsigned long *start); 400 - static inline bool nd_iostat_start(struct bio *bio, unsigned long *start) 401 - { 402 - struct gendisk *disk = bio->bi_disk; 403 - 404 - if (!blk_queue_io_stat(disk->queue)) 405 - return false; 406 - 407 - *start = jiffies; 408 - generic_start_io_acct(disk->queue, bio_op(bio), bio_sectors(bio), 409 - &disk->part0); 410 - return true; 411 - } 412 - static inline void nd_iostat_end(struct bio *bio, unsigned long start) 413 - { 414 - struct gendisk *disk = bio->bi_disk; 415 - 416 - generic_end_io_acct(disk->queue, bio_op(bio), &disk->part0, start); 417 - } 418 399 static inline bool is_bad_pmem(struct badblocks *bb, sector_t sector, 419 400 unsigned int len) 420 401 {

+4 -2

drivers/nvdimm/pmem.c

··· 202 202 if (bio->bi_opf & REQ_PREFLUSH) 203 203 ret = nvdimm_flush(nd_region, bio); 204 204 205 - do_acct = nd_iostat_start(bio, &start); 205 + do_acct = blk_queue_io_stat(bio->bi_disk->queue); 206 + if (do_acct) 207 + start = bio_start_io_acct(bio); 206 208 bio_for_each_segment(bvec, bio, iter) { 207 209 if (op_is_write(bio_op(bio))) 208 210 rc = pmem_do_write(pmem, bvec.bv_page, bvec.bv_offset, ··· 218 216 } 219 217 } 220 218 if (do_acct) 221 - nd_iostat_end(bio, start); 219 + bio_end_io_acct(bio, start); 222 220 223 221 if (bio->bi_opf & REQ_FUA) 224 222 ret = nvdimm_flush(nd_region, bio);

+1 -1

drivers/nvme/host/core.c

··· 310 310 return true; 311 311 312 312 nvme_req(req)->status = NVME_SC_HOST_ABORTED_CMD; 313 - blk_mq_complete_request(req); 313 + blk_mq_force_complete_rq(req); 314 314 return true; 315 315 } 316 316 EXPORT_SYMBOL_GPL(nvme_cancel_request);

+1 -1

drivers/nvme/target/io-cmd-bdev.c

··· 226 226 227 227 u16 nvmet_bdev_flush(struct nvmet_req *req) 228 228 { 229 - if (blkdev_issue_flush(req->ns->bdev, GFP_KERNEL, NULL)) 229 + if (blkdev_issue_flush(req->ns->bdev, GFP_KERNEL)) 230 230 return NVME_SC_INTERNAL | NVME_SC_DNR; 231 231 return 0; 232 232 }

+3 -15

drivers/s390/block/dasd_genhd.c

··· 143 143 */ 144 144 void dasd_destroy_partitions(struct dasd_block *block) 145 145 { 146 - /* The two structs have 168/176 byte on 31/64 bit. */ 147 - struct blkpg_partition bpart; 148 - struct blkpg_ioctl_arg barg; 149 146 struct block_device *bdev; 150 147 151 148 /* ··· 152 155 bdev = block->bdev; 153 156 block->bdev = NULL; 154 157 155 - /* 156 - * See fs/partition/check.c:delete_partition 157 - * Can't call delete_partitions directly. Use ioctl. 158 - * The ioctl also does locking and invalidation. 159 - */ 160 - memset(&bpart, 0, sizeof(struct blkpg_partition)); 161 - memset(&barg, 0, sizeof(struct blkpg_ioctl_arg)); 162 - barg.data = (void __force __user *) &bpart; 163 - barg.op = BLKPG_DEL_PARTITION; 164 - for (bpart.pno = block->gdp->minors - 1; bpart.pno > 0; bpart.pno--) 165 - ioctl_by_bdev(bdev, BLKPG, (unsigned long) &barg); 158 + mutex_lock(&bdev->bd_mutex); 159 + blk_drop_partitions(bdev); 160 + mutex_unlock(&bdev->bd_mutex); 166 161 167 - invalidate_partition(block->gdp, 0); 168 162 /* Matching blkdev_put to the blkdev_get in dasd_scan_partitions. */ 169 163 blkdev_put(bdev, FMODE_READ); 170 164 set_capacity(block->gdp, 0);

+55 -32

drivers/scsi/scsi_lib.c

··· 978 978 scsi_io_completion_action(cmd, result); 979 979 } 980 980 981 - static blk_status_t scsi_init_sgtable(struct request *req, 982 - struct scsi_data_buffer *sdb) 981 + static inline bool scsi_cmd_needs_dma_drain(struct scsi_device *sdev, 982 + struct request *rq) 983 983 { 984 - int count; 985 - 986 - /* 987 - * If sg table allocation fails, requeue request later. 988 - */ 989 - if (unlikely(sg_alloc_table_chained(&sdb->table, 990 - blk_rq_nr_phys_segments(req), sdb->table.sgl, 991 - SCSI_INLINE_SG_CNT))) 992 - return BLK_STS_RESOURCE; 993 - 994 - /* 995 - * Next, walk the list, and fill in the addresses and sizes of 996 - * each segment. 997 - */ 998 - count = blk_rq_map_sg(req->q, req, sdb->table.sgl); 999 - BUG_ON(count > sdb->table.nents); 1000 - sdb->table.nents = count; 1001 - sdb->length = blk_rq_payload_bytes(req); 1002 - return BLK_STS_OK; 984 + return sdev->dma_drain_len && blk_rq_is_passthrough(rq) && 985 + !op_is_write(req_op(rq)) && 986 + sdev->host->hostt->dma_need_drain(rq); 1003 987 } 1004 988 1005 989 /* ··· 999 1015 */ 1000 1016 blk_status_t scsi_init_io(struct scsi_cmnd *cmd) 1001 1017 { 1018 + struct scsi_device *sdev = cmd->device; 1002 1019 struct request *rq = cmd->request; 1020 + unsigned short nr_segs = blk_rq_nr_phys_segments(rq); 1021 + struct scatterlist *last_sg = NULL; 1003 1022 blk_status_t ret; 1023 + bool need_drain = scsi_cmd_needs_dma_drain(sdev, rq); 1024 + int count; 1004 1025 1005 - if (WARN_ON_ONCE(!blk_rq_nr_phys_segments(rq))) 1026 + if (WARN_ON_ONCE(!nr_segs)) 1006 1027 return BLK_STS_IOERR; 1007 1028 1008 - ret = scsi_init_sgtable(rq, &cmd->sdb); 1009 - if (ret) 1010 - return ret; 1029 + /* 1030 + * Make sure there is space for the drain. The driver must adjust 1031 + * max_hw_segments to be prepared for this. 1032 + */ 1033 + if (need_drain) 1034 + nr_segs++; 1035 + 1036 + /* 1037 + * If sg table allocation fails, requeue request later. 1038 + */ 1039 + if (unlikely(sg_alloc_table_chained(&cmd->sdb.table, nr_segs, 1040 + cmd->sdb.table.sgl, SCSI_INLINE_SG_CNT))) 1041 + return BLK_STS_RESOURCE; 1042 + 1043 + /* 1044 + * Next, walk the list, and fill in the addresses and sizes of 1045 + * each segment. 1046 + */ 1047 + count = __blk_rq_map_sg(rq->q, rq, cmd->sdb.table.sgl, &last_sg); 1048 + 1049 + if (blk_rq_bytes(rq) & rq->q->dma_pad_mask) { 1050 + unsigned int pad_len = 1051 + (rq->q->dma_pad_mask & ~blk_rq_bytes(rq)) + 1; 1052 + 1053 + last_sg->length += pad_len; 1054 + cmd->extra_len += pad_len; 1055 + } 1056 + 1057 + if (need_drain) { 1058 + sg_unmark_end(last_sg); 1059 + last_sg = sg_next(last_sg); 1060 + sg_set_buf(last_sg, sdev->dma_drain_buf, sdev->dma_drain_len); 1061 + sg_mark_end(last_sg); 1062 + 1063 + cmd->extra_len += sdev->dma_drain_len; 1064 + count++; 1065 + } 1066 + 1067 + BUG_ON(count > cmd->sdb.table.nents); 1068 + cmd->sdb.table.nents = count; 1069 + cmd->sdb.length = blk_rq_payload_bytes(rq); 1011 1070 1012 1071 if (blk_integrity_rq(rq)) { 1013 1072 struct scsi_data_buffer *prot_sdb = cmd->prot_sdb; 1014 - int ivecs, count; 1073 + int ivecs; 1015 1074 1016 1075 if (WARN_ON_ONCE(!prot_sdb)) { 1017 1076 /* ··· 1637 1610 struct request_queue *q = hctx->queue; 1638 1611 struct scsi_device *sdev = q->queuedata; 1639 1612 1640 - if (scsi_dev_queue_ready(q, sdev)) 1641 - return true; 1642 - 1643 - if (atomic_read(&sdev->device_busy) == 0 && !scsi_device_blocked(sdev)) 1644 - blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY); 1645 - return false; 1613 + return scsi_dev_queue_ready(q, sdev); 1646 1614 } 1647 1615 1648 1616 static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, ··· 1706 1684 case BLK_STS_OK: 1707 1685 break; 1708 1686 case BLK_STS_RESOURCE: 1687 + case BLK_STS_ZONE_RESOURCE: 1709 1688 if (atomic_read(&sdev->device_busy) || 1710 1689 scsi_device_blocked(sdev)) 1711 1690 ret = BLK_STS_DEV_RESOURCE;

+15 -1

drivers/scsi/sd.c

··· 1206 1206 } 1207 1207 } 1208 1208 1209 + if (req_op(rq) == REQ_OP_ZONE_APPEND) { 1210 + ret = sd_zbc_prepare_zone_append(cmd, &lba, nr_blocks); 1211 + if (ret) 1212 + return ret; 1213 + } 1214 + 1209 1215 fua = rq->cmd_flags & REQ_FUA ? 0x8 : 0; 1210 1216 dix = scsi_prot_sg_count(cmd); 1211 1217 dif = scsi_host_dif_capable(cmd->device->host, sdkp->protection_type); ··· 1293 1287 return sd_setup_flush_cmnd(cmd); 1294 1288 case REQ_OP_READ: 1295 1289 case REQ_OP_WRITE: 1290 + case REQ_OP_ZONE_APPEND: 1296 1291 return sd_setup_read_write_cmnd(cmd); 1297 1292 case REQ_OP_ZONE_RESET: 1298 1293 return sd_zbc_setup_zone_mgmt_cmnd(cmd, ZO_RESET_WRITE_POINTER, ··· 2062 2055 2063 2056 out: 2064 2057 if (sd_is_zoned(sdkp)) 2065 - sd_zbc_complete(SCpnt, good_bytes, &sshdr); 2058 + good_bytes = sd_zbc_complete(SCpnt, good_bytes, &sshdr); 2066 2059 2067 2060 SCSI_LOG_HLCOMPLETE(1, scmd_printk(KERN_INFO, SCpnt, 2068 2061 "sd_done: completed %d of %d bytes\n", ··· 3379 3372 sdkp->first_scan = 1; 3380 3373 sdkp->max_medium_access_timeouts = SD_MAX_MEDIUM_TIMEOUTS; 3381 3374 3375 + error = sd_zbc_init_disk(sdkp); 3376 + if (error) 3377 + goto out_free_index; 3378 + 3382 3379 sd_revalidate_disk(gd); 3383 3380 3384 3381 gd->flags = GENHD_FL_EXT_DEVT; ··· 3420 3409 out_put: 3421 3410 put_disk(gd); 3422 3411 out_free: 3412 + sd_zbc_release_disk(sdkp); 3423 3413 kfree(sdkp); 3424 3414 out: 3425 3415 scsi_autopm_put_device(sdp); ··· 3496 3484 disk->private_data = NULL; 3497 3485 put_disk(disk); 3498 3486 put_device(&sdkp->device->sdev_gendev); 3487 + 3488 + sd_zbc_release_disk(sdkp); 3499 3489 3500 3490 kfree(sdkp); 3501 3491 }

+38 -5

drivers/scsi/sd.h

··· 79 79 u32 zones_optimal_open; 80 80 u32 zones_optimal_nonseq; 81 81 u32 zones_max_open; 82 + u32 *zones_wp_offset; 83 + spinlock_t zones_wp_offset_lock; 84 + u32 *rev_wp_offset; 85 + struct mutex rev_mutex; 86 + struct work_struct zone_wp_offset_work; 87 + char *zone_wp_update_buf; 82 88 #endif 83 89 atomic_t openers; 84 90 sector_t capacity; /* size in logical blocks */ ··· 213 207 214 208 #ifdef CONFIG_BLK_DEV_ZONED 215 209 210 + int sd_zbc_init_disk(struct scsi_disk *sdkp); 211 + void sd_zbc_release_disk(struct scsi_disk *sdkp); 216 212 extern int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buffer); 217 213 extern void sd_zbc_print_zones(struct scsi_disk *sdkp); 218 214 blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd, 219 215 unsigned char op, bool all); 220 - extern void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes, 221 - struct scsi_sense_hdr *sshdr); 216 + unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes, 217 + struct scsi_sense_hdr *sshdr); 222 218 int sd_zbc_report_zones(struct gendisk *disk, sector_t sector, 223 219 unsigned int nr_zones, report_zones_cb cb, void *data); 224 220 221 + blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd, sector_t *lba, 222 + unsigned int nr_blocks); 223 + 225 224 #else /* CONFIG_BLK_DEV_ZONED */ 225 + 226 + static inline int sd_zbc_init(void) 227 + { 228 + return 0; 229 + } 230 + 231 + static inline int sd_zbc_init_disk(struct scsi_disk *sdkp) 232 + { 233 + return 0; 234 + } 235 + 236 + static inline void sd_zbc_exit(void) {} 237 + static inline void sd_zbc_release_disk(struct scsi_disk *sdkp) {} 226 238 227 239 static inline int sd_zbc_read_zones(struct scsi_disk *sdkp, 228 240 unsigned char *buf) ··· 257 233 return BLK_STS_TARGET; 258 234 } 259 235 260 - static inline void sd_zbc_complete(struct scsi_cmnd *cmd, 261 - unsigned int good_bytes, 262 - struct scsi_sense_hdr *sshdr) {} 236 + static inline unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, 237 + unsigned int good_bytes, struct scsi_sense_hdr *sshdr) 238 + { 239 + return 0; 240 + } 241 + 242 + static inline blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd, 243 + sector_t *lba, 244 + unsigned int nr_blocks) 245 + { 246 + return BLK_STS_TARGET; 247 + } 263 248 264 249 #define sd_zbc_report_zones NULL 265 250

+367 -38

drivers/scsi/sd_zbc.c

··· 11 11 #include <linux/blkdev.h> 12 12 #include <linux/vmalloc.h> 13 13 #include <linux/sched/mm.h> 14 + #include <linux/mutex.h> 14 15 15 16 #include <asm/unaligned.h> 16 17 ··· 20 19 21 20 #include "sd.h" 22 21 22 + static unsigned int sd_zbc_get_zone_wp_offset(struct blk_zone *zone) 23 + { 24 + if (zone->type == ZBC_ZONE_TYPE_CONV) 25 + return 0; 26 + 27 + switch (zone->cond) { 28 + case BLK_ZONE_COND_IMP_OPEN: 29 + case BLK_ZONE_COND_EXP_OPEN: 30 + case BLK_ZONE_COND_CLOSED: 31 + return zone->wp - zone->start; 32 + case BLK_ZONE_COND_FULL: 33 + return zone->len; 34 + case BLK_ZONE_COND_EMPTY: 35 + case BLK_ZONE_COND_OFFLINE: 36 + case BLK_ZONE_COND_READONLY: 37 + default: 38 + /* 39 + * Offline and read-only zones do not have a valid 40 + * write pointer. Use 0 as for an empty zone. 41 + */ 42 + return 0; 43 + } 44 + } 45 + 23 46 static int sd_zbc_parse_report(struct scsi_disk *sdkp, u8 *buf, 24 47 unsigned int idx, report_zones_cb cb, void *data) 25 48 { 26 49 struct scsi_device *sdp = sdkp->device; 27 50 struct blk_zone zone = { 0 }; 51 + int ret; 28 52 29 53 zone.type = buf[0] & 0x0f; 30 54 zone.cond = (buf[1] >> 4) & 0xf; ··· 65 39 zone.cond == ZBC_ZONE_COND_FULL) 66 40 zone.wp = zone.start + zone.len; 67 41 68 - return cb(&zone, idx, data); 42 + ret = cb(&zone, idx, data); 43 + if (ret) 44 + return ret; 45 + 46 + if (sdkp->rev_wp_offset) 47 + sdkp->rev_wp_offset[idx] = sd_zbc_get_zone_wp_offset(&zone); 48 + 49 + return 0; 69 50 } 70 51 71 52 /** ··· 241 208 return ret; 242 209 } 243 210 211 + static blk_status_t sd_zbc_cmnd_checks(struct scsi_cmnd *cmd) 212 + { 213 + struct request *rq = cmd->request; 214 + struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); 215 + sector_t sector = blk_rq_pos(rq); 216 + 217 + if (!sd_is_zoned(sdkp)) 218 + /* Not a zoned device */ 219 + return BLK_STS_IOERR; 220 + 221 + if (sdkp->device->changed) 222 + return BLK_STS_IOERR; 223 + 224 + if (sector & (sd_zbc_zone_sectors(sdkp) - 1)) 225 + /* Unaligned request */ 226 + return BLK_STS_IOERR; 227 + 228 + return BLK_STS_OK; 229 + } 230 + 231 + #define SD_ZBC_INVALID_WP_OFST (~0u) 232 + #define SD_ZBC_UPDATING_WP_OFST (SD_ZBC_INVALID_WP_OFST - 1) 233 + 234 + static int sd_zbc_update_wp_offset_cb(struct blk_zone *zone, unsigned int idx, 235 + void *data) 236 + { 237 + struct scsi_disk *sdkp = data; 238 + 239 + lockdep_assert_held(&sdkp->zones_wp_offset_lock); 240 + 241 + sdkp->zones_wp_offset[idx] = sd_zbc_get_zone_wp_offset(zone); 242 + 243 + return 0; 244 + } 245 + 246 + static void sd_zbc_update_wp_offset_workfn(struct work_struct *work) 247 + { 248 + struct scsi_disk *sdkp; 249 + unsigned int zno; 250 + int ret; 251 + 252 + sdkp = container_of(work, struct scsi_disk, zone_wp_offset_work); 253 + 254 + spin_lock_bh(&sdkp->zones_wp_offset_lock); 255 + for (zno = 0; zno < sdkp->nr_zones; zno++) { 256 + if (sdkp->zones_wp_offset[zno] != SD_ZBC_UPDATING_WP_OFST) 257 + continue; 258 + 259 + spin_unlock_bh(&sdkp->zones_wp_offset_lock); 260 + ret = sd_zbc_do_report_zones(sdkp, sdkp->zone_wp_update_buf, 261 + SD_BUF_SIZE, 262 + zno * sdkp->zone_blocks, true); 263 + spin_lock_bh(&sdkp->zones_wp_offset_lock); 264 + if (!ret) 265 + sd_zbc_parse_report(sdkp, sdkp->zone_wp_update_buf + 64, 266 + zno, sd_zbc_update_wp_offset_cb, 267 + sdkp); 268 + } 269 + spin_unlock_bh(&sdkp->zones_wp_offset_lock); 270 + 271 + scsi_device_put(sdkp->device); 272 + } 273 + 274 + /** 275 + * sd_zbc_prepare_zone_append() - Prepare an emulated ZONE_APPEND command. 276 + * @cmd: the command to setup 277 + * @lba: the LBA to patch 278 + * @nr_blocks: the number of LBAs to be written 279 + * 280 + * Called from sd_setup_read_write_cmnd() for REQ_OP_ZONE_APPEND. 281 + * @sd_zbc_prepare_zone_append() handles the necessary zone wrote locking and 282 + * patching of the lba for an emulated ZONE_APPEND command. 283 + * 284 + * In case the cached write pointer offset is %SD_ZBC_INVALID_WP_OFST it will 285 + * schedule a REPORT ZONES command and return BLK_STS_IOERR. 286 + */ 287 + blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd, sector_t *lba, 288 + unsigned int nr_blocks) 289 + { 290 + struct request *rq = cmd->request; 291 + struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); 292 + unsigned int wp_offset, zno = blk_rq_zone_no(rq); 293 + blk_status_t ret; 294 + 295 + ret = sd_zbc_cmnd_checks(cmd); 296 + if (ret != BLK_STS_OK) 297 + return ret; 298 + 299 + if (!blk_rq_zone_is_seq(rq)) 300 + return BLK_STS_IOERR; 301 + 302 + /* Unlock of the write lock will happen in sd_zbc_complete() */ 303 + if (!blk_req_zone_write_trylock(rq)) 304 + return BLK_STS_ZONE_RESOURCE; 305 + 306 + spin_lock_bh(&sdkp->zones_wp_offset_lock); 307 + wp_offset = sdkp->zones_wp_offset[zno]; 308 + switch (wp_offset) { 309 + case SD_ZBC_INVALID_WP_OFST: 310 + /* 311 + * We are about to schedule work to update a zone write pointer 312 + * offset, which will cause the zone append command to be 313 + * requeued. So make sure that the scsi device does not go away 314 + * while the work is being processed. 315 + */ 316 + if (scsi_device_get(sdkp->device)) { 317 + ret = BLK_STS_IOERR; 318 + break; 319 + } 320 + sdkp->zones_wp_offset[zno] = SD_ZBC_UPDATING_WP_OFST; 321 + schedule_work(&sdkp->zone_wp_offset_work); 322 + fallthrough; 323 + case SD_ZBC_UPDATING_WP_OFST: 324 + ret = BLK_STS_DEV_RESOURCE; 325 + break; 326 + default: 327 + wp_offset = sectors_to_logical(sdkp->device, wp_offset); 328 + if (wp_offset + nr_blocks > sdkp->zone_blocks) { 329 + ret = BLK_STS_IOERR; 330 + break; 331 + } 332 + 333 + *lba += wp_offset; 334 + } 335 + spin_unlock_bh(&sdkp->zones_wp_offset_lock); 336 + if (ret) 337 + blk_req_zone_write_unlock(rq); 338 + return ret; 339 + } 340 + 244 341 /** 245 342 * sd_zbc_setup_zone_mgmt_cmnd - Prepare a zone ZBC_OUT command. The operations 246 343 * can be RESET WRITE POINTER, OPEN, CLOSE or FINISH. ··· 385 222 unsigned char op, bool all) 386 223 { 387 224 struct request *rq = cmd->request; 388 - struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); 389 225 sector_t sector = blk_rq_pos(rq); 226 + struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); 390 227 sector_t block = sectors_to_logical(sdkp->device, sector); 228 + blk_status_t ret; 391 229 392 - if (!sd_is_zoned(sdkp)) 393 - /* Not a zoned device */ 394 - return BLK_STS_IOERR; 395 - 396 - if (sdkp->device->changed) 397 - return BLK_STS_IOERR; 398 - 399 - if (sector & (sd_zbc_zone_sectors(sdkp) - 1)) 400 - /* Unaligned request */ 401 - return BLK_STS_IOERR; 230 + ret = sd_zbc_cmnd_checks(cmd); 231 + if (ret != BLK_STS_OK) 232 + return ret; 402 233 403 234 cmd->cmd_len = 16; 404 235 memset(cmd->cmnd, 0, cmd->cmd_len); ··· 411 254 return BLK_STS_OK; 412 255 } 413 256 257 + static bool sd_zbc_need_zone_wp_update(struct request *rq) 258 + { 259 + switch (req_op(rq)) { 260 + case REQ_OP_ZONE_APPEND: 261 + case REQ_OP_ZONE_FINISH: 262 + case REQ_OP_ZONE_RESET: 263 + case REQ_OP_ZONE_RESET_ALL: 264 + return true; 265 + case REQ_OP_WRITE: 266 + case REQ_OP_WRITE_ZEROES: 267 + case REQ_OP_WRITE_SAME: 268 + return blk_rq_zone_is_seq(rq); 269 + default: 270 + return false; 271 + } 272 + } 273 + 274 + /** 275 + * sd_zbc_zone_wp_update - Update cached zone write pointer upon cmd completion 276 + * @cmd: Completed command 277 + * @good_bytes: Command reply bytes 278 + * 279 + * Called from sd_zbc_complete() to handle the update of the cached zone write 280 + * pointer value in case an update is needed. 281 + */ 282 + static unsigned int sd_zbc_zone_wp_update(struct scsi_cmnd *cmd, 283 + unsigned int good_bytes) 284 + { 285 + int result = cmd->result; 286 + struct request *rq = cmd->request; 287 + struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); 288 + unsigned int zno = blk_rq_zone_no(rq); 289 + enum req_opf op = req_op(rq); 290 + 291 + /* 292 + * If we got an error for a command that needs updating the write 293 + * pointer offset cache, we must mark the zone wp offset entry as 294 + * invalid to force an update from disk the next time a zone append 295 + * command is issued. 296 + */ 297 + spin_lock_bh(&sdkp->zones_wp_offset_lock); 298 + 299 + if (result && op != REQ_OP_ZONE_RESET_ALL) { 300 + if (op == REQ_OP_ZONE_APPEND) { 301 + /* Force complete completion (no retry) */ 302 + good_bytes = 0; 303 + scsi_set_resid(cmd, blk_rq_bytes(rq)); 304 + } 305 + 306 + /* 307 + * Force an update of the zone write pointer offset on 308 + * the next zone append access. 309 + */ 310 + if (sdkp->zones_wp_offset[zno] != SD_ZBC_UPDATING_WP_OFST) 311 + sdkp->zones_wp_offset[zno] = SD_ZBC_INVALID_WP_OFST; 312 + goto unlock_wp_offset; 313 + } 314 + 315 + switch (op) { 316 + case REQ_OP_ZONE_APPEND: 317 + rq->__sector += sdkp->zones_wp_offset[zno]; 318 + fallthrough; 319 + case REQ_OP_WRITE_ZEROES: 320 + case REQ_OP_WRITE_SAME: 321 + case REQ_OP_WRITE: 322 + if (sdkp->zones_wp_offset[zno] < sd_zbc_zone_sectors(sdkp)) 323 + sdkp->zones_wp_offset[zno] += 324 + good_bytes >> SECTOR_SHIFT; 325 + break; 326 + case REQ_OP_ZONE_RESET: 327 + sdkp->zones_wp_offset[zno] = 0; 328 + break; 329 + case REQ_OP_ZONE_FINISH: 330 + sdkp->zones_wp_offset[zno] = sd_zbc_zone_sectors(sdkp); 331 + break; 332 + case REQ_OP_ZONE_RESET_ALL: 333 + memset(sdkp->zones_wp_offset, 0, 334 + sdkp->nr_zones * sizeof(unsigned int)); 335 + break; 336 + default: 337 + break; 338 + } 339 + 340 + unlock_wp_offset: 341 + spin_unlock_bh(&sdkp->zones_wp_offset_lock); 342 + 343 + return good_bytes; 344 + } 345 + 414 346 /** 415 347 * sd_zbc_complete - ZBC command post processing. 416 348 * @cmd: Completed command 417 349 * @good_bytes: Command reply bytes 418 350 * @sshdr: command sense header 419 351 * 420 - * Called from sd_done(). Process report zones reply and handle reset zone 421 - * and write commands errors. 352 + * Called from sd_done() to handle zone commands errors and updates to the 353 + * device queue zone write pointer offset cahce. 422 354 */ 423 - void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes, 355 + unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes, 424 356 struct scsi_sense_hdr *sshdr) 425 357 { 426 358 int result = cmd->result; ··· 525 279 * so be quiet about the error. 526 280 */ 527 281 rq->rq_flags |= RQF_QUIET; 528 - } 282 + } else if (sd_zbc_need_zone_wp_update(rq)) 283 + good_bytes = sd_zbc_zone_wp_update(cmd, good_bytes); 284 + 285 + if (req_op(rq) == REQ_OP_ZONE_APPEND) 286 + blk_req_zone_write_unlock(rq); 287 + 288 + return good_bytes; 529 289 } 530 290 531 291 /** ··· 633 381 return 0; 634 382 } 635 383 384 + static void sd_zbc_revalidate_zones_cb(struct gendisk *disk) 385 + { 386 + struct scsi_disk *sdkp = scsi_disk(disk); 387 + 388 + swap(sdkp->zones_wp_offset, sdkp->rev_wp_offset); 389 + } 390 + 391 + static int sd_zbc_revalidate_zones(struct scsi_disk *sdkp, 392 + u32 zone_blocks, 393 + unsigned int nr_zones) 394 + { 395 + struct gendisk *disk = sdkp->disk; 396 + int ret = 0; 397 + 398 + /* 399 + * Make sure revalidate zones are serialized to ensure exclusive 400 + * updates of the scsi disk data. 401 + */ 402 + mutex_lock(&sdkp->rev_mutex); 403 + 404 + /* 405 + * Revalidate the disk zones to update the device request queue zone 406 + * bitmaps and the zone write pointer offset array. Do this only once 407 + * the device capacity is set on the second revalidate execution for 408 + * disk scan or if something changed when executing a normal revalidate. 409 + */ 410 + if (sdkp->first_scan) { 411 + sdkp->zone_blocks = zone_blocks; 412 + sdkp->nr_zones = nr_zones; 413 + goto unlock; 414 + } 415 + 416 + if (sdkp->zone_blocks == zone_blocks && 417 + sdkp->nr_zones == nr_zones && 418 + disk->queue->nr_zones == nr_zones) 419 + goto unlock; 420 + 421 + sdkp->rev_wp_offset = kvcalloc(nr_zones, sizeof(u32), GFP_NOIO); 422 + if (!sdkp->rev_wp_offset) { 423 + ret = -ENOMEM; 424 + goto unlock; 425 + } 426 + 427 + ret = blk_revalidate_disk_zones(disk, sd_zbc_revalidate_zones_cb); 428 + 429 + kvfree(sdkp->rev_wp_offset); 430 + sdkp->rev_wp_offset = NULL; 431 + 432 + unlock: 433 + mutex_unlock(&sdkp->rev_mutex); 434 + 435 + return ret; 436 + } 437 + 636 438 int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf) 637 439 { 638 440 struct gendisk *disk = sdkp->disk; 441 + struct request_queue *q = disk->queue; 639 442 unsigned int nr_zones; 640 443 u32 zone_blocks = 0; 444 + u32 max_append; 641 445 int ret; 642 446 643 447 if (!sd_is_zoned(sdkp)) ··· 714 406 goto err; 715 407 716 408 /* The drive satisfies the kernel restrictions: set it up */ 717 - blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, sdkp->disk->queue); 718 - blk_queue_required_elevator_features(sdkp->disk->queue, 719 - ELEVATOR_F_ZBD_SEQ_WRITE); 409 + blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q); 410 + blk_queue_required_elevator_features(q, ELEVATOR_F_ZBD_SEQ_WRITE); 720 411 nr_zones = round_up(sdkp->capacity, zone_blocks) >> ilog2(zone_blocks); 721 412 722 413 /* READ16/WRITE16 is mandatory for ZBC disks */ 723 414 sdkp->device->use_16_for_rw = 1; 724 415 sdkp->device->use_10_for_rw = 0; 725 416 726 - /* 727 - * Revalidate the disk zone bitmaps once the block device capacity is 728 - * set on the second revalidate execution during disk scan and if 729 - * something changed when executing a normal revalidate. 730 - */ 731 - if (sdkp->first_scan) { 732 - sdkp->zone_blocks = zone_blocks; 733 - sdkp->nr_zones = nr_zones; 734 - return 0; 735 - } 417 + ret = sd_zbc_revalidate_zones(sdkp, zone_blocks, nr_zones); 418 + if (ret) 419 + goto err; 736 420 737 - if (sdkp->zone_blocks != zone_blocks || 738 - sdkp->nr_zones != nr_zones || 739 - disk->queue->nr_zones != nr_zones) { 740 - ret = blk_revalidate_disk_zones(disk); 741 - if (ret != 0) 742 - goto err; 743 - sdkp->zone_blocks = zone_blocks; 744 - sdkp->nr_zones = nr_zones; 745 - } 421 + /* 422 + * On the first scan 'chunk_sectors' isn't setup yet, so calling 423 + * blk_queue_max_zone_append_sectors() will result in a WARN(). Defer 424 + * this setting to the second scan. 425 + */ 426 + if (sdkp->first_scan) 427 + return 0; 428 + 429 + max_append = min_t(u32, logical_to_sectors(sdkp->device, zone_blocks), 430 + q->limits.max_segments << (PAGE_SHIFT - 9)); 431 + max_append = min_t(u32, max_append, queue_max_hw_sectors(q)); 432 + 433 + blk_queue_max_zone_append_sectors(q, max_append); 746 434 747 435 return 0; 748 436 ··· 763 459 "%u zones of %u logical blocks\n", 764 460 sdkp->nr_zones, 765 461 sdkp->zone_blocks); 462 + } 463 + 464 + int sd_zbc_init_disk(struct scsi_disk *sdkp) 465 + { 466 + if (!sd_is_zoned(sdkp)) 467 + return 0; 468 + 469 + sdkp->zones_wp_offset = NULL; 470 + spin_lock_init(&sdkp->zones_wp_offset_lock); 471 + sdkp->rev_wp_offset = NULL; 472 + mutex_init(&sdkp->rev_mutex); 473 + INIT_WORK(&sdkp->zone_wp_offset_work, sd_zbc_update_wp_offset_workfn); 474 + sdkp->zone_wp_update_buf = kzalloc(SD_BUF_SIZE, GFP_KERNEL); 475 + if (!sdkp->zone_wp_update_buf) 476 + return -ENOMEM; 477 + 478 + return 0; 479 + } 480 + 481 + void sd_zbc_release_disk(struct scsi_disk *sdkp) 482 + { 483 + kvfree(sdkp->zones_wp_offset); 484 + sdkp->zones_wp_offset = NULL; 485 + kfree(sdkp->zone_wp_update_buf); 486 + sdkp->zone_wp_update_buf = NULL; 766 487 }

+1 -2

drivers/scsi/sr.c

··· 794 794 set_capacity(disk, cd->capacity); 795 795 disk->private_data = &cd->driver; 796 796 disk->queue = sdev->request_queue; 797 - cd->cdi.disk = disk; 798 797 799 - if (register_cdrom(&cd->cdi)) 798 + if (register_cdrom(disk, &cd->cdi)) 800 799 goto fail_put; 801 800 802 801 /*

+4 -21

fs/block_dev.c

··· 255 255 break; 256 256 if (!(iocb->ki_flags & IOCB_HIPRI) || 257 257 !blk_poll(bdev_get_queue(bdev), qc, true)) 258 - io_schedule(); 258 + blk_io_schedule(); 259 259 } 260 260 __set_current_state(TASK_RUNNING); 261 261 ··· 449 449 450 450 if (!(iocb->ki_flags & IOCB_HIPRI) || 451 451 !blk_poll(bdev_get_queue(bdev), qc, true)) 452 - io_schedule(); 452 + blk_io_schedule(); 453 453 } 454 454 __set_current_state(TASK_RUNNING); 455 455 ··· 671 671 * i_mutex and doing so causes performance issues with concurrent 672 672 * O_SYNC writers to a block device. 673 673 */ 674 - error = blkdev_issue_flush(bdev, GFP_KERNEL, NULL); 674 + error = blkdev_issue_flush(bdev, GFP_KERNEL); 675 675 if (error == -EOPNOTSUPP) 676 676 error = 0; 677 677 ··· 712 712 blk_queue_exit(bdev->bd_queue); 713 713 return result; 714 714 } 715 - EXPORT_SYMBOL_GPL(bdev_read_page); 716 715 717 716 /** 718 717 * bdev_write_page() - Start writing a page to a block device ··· 756 757 blk_queue_exit(bdev->bd_queue); 757 758 return result; 758 759 } 759 - EXPORT_SYMBOL_GPL(bdev_write_page); 760 760 761 761 /* 762 762 * pseudo-fs ··· 878 880 } 879 881 880 882 static LIST_HEAD(all_bdevs); 881 - 882 - /* 883 - * If there is a bdev inode for this device, unhash it so that it gets evicted 884 - * as soon as last inode reference is dropped. 885 - */ 886 - void bdev_unhash_inode(dev_t dev) 887 - { 888 - struct inode *inode; 889 - 890 - inode = ilookup5(blockdev_superblock, hash(dev), bdev_test, &dev); 891 - if (inode) { 892 - remove_inode_hash(inode); 893 - iput(inode); 894 - } 895 - } 896 883 897 884 struct block_device *bdget(dev_t dev) 898 885 { ··· 1498 1515 lockdep_assert_held(&bdev->bd_mutex); 1499 1516 1500 1517 rescan: 1501 - ret = blk_drop_partitions(disk, bdev); 1518 + ret = blk_drop_partitions(bdev); 1502 1519 if (ret) 1503 1520 return ret; 1504 1521

+1 -1

fs/direct-io.c

··· 500 500 spin_unlock_irqrestore(&dio->bio_lock, flags); 501 501 if (!(dio->iocb->ki_flags & IOCB_HIPRI) || 502 502 !blk_poll(dio->bio_disk->queue, dio->bio_cookie, true)) 503 - io_schedule(); 503 + blk_io_schedule(); 504 504 /* wake up sets us TASK_RUNNING */ 505 505 spin_lock_irqsave(&dio->bio_lock, flags); 506 506 dio->waiter = NULL;

+1 -1

fs/ext4/fsync.c

··· 176 176 ret = ext4_fsync_journal(inode, datasync, &needs_barrier); 177 177 178 178 if (needs_barrier) { 179 - err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL); 179 + err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL); 180 180 if (!ret) 181 181 ret = err; 182 182 }

+1 -1

fs/ext4/ialloc.c

··· 1440 1440 if (ret < 0) 1441 1441 goto err_out; 1442 1442 if (barrier) 1443 - blkdev_issue_flush(sb->s_bdev, GFP_NOFS, NULL); 1443 + blkdev_issue_flush(sb->s_bdev, GFP_NOFS); 1444 1444 1445 1445 skip_zeroout: 1446 1446 ext4_lock_group(sb, group);

+1 -1

fs/ext4/super.c

··· 5296 5296 needs_barrier = true; 5297 5297 if (needs_barrier) { 5298 5298 int err; 5299 - err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL); 5299 + err = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL); 5300 5300 if (!ret) 5301 5301 ret = err; 5302 5302 }

+1 -1

fs/fat/file.c

··· 195 195 if (err) 196 196 return err; 197 197 198 - return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL); 198 + return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL); 199 199 } 200 200 201 201

+1 -1

fs/fs-writeback.c

··· 2319 2319 2320 2320 WARN(bdi_cap_writeback_dirty(wb->bdi) && 2321 2321 !test_bit(WB_registered, &wb->state), 2322 - "bdi-%s not registered\n", wb->bdi->name); 2322 + "bdi-%s not registered\n", bdi_dev_name(wb->bdi)); 2323 2323 2324 2324 inode->dirtied_when = jiffies; 2325 2325 if (dirtytime)

+19 -13

fs/hfs/mdb.c

··· 32 32 static int hfs_get_last_session(struct super_block *sb, 33 33 sector_t *start, sector_t *size) 34 34 { 35 - struct cdrom_multisession ms_info; 36 - struct cdrom_tocentry te; 37 - int res; 35 + struct cdrom_device_info *cdi = disk_to_cdi(sb->s_bdev->bd_disk); 38 36 39 37 /* default values */ 40 38 *start = 0; 41 39 *size = i_size_read(sb->s_bdev->bd_inode) >> 9; 42 40 43 41 if (HFS_SB(sb)->session >= 0) { 42 + struct cdrom_tocentry te; 43 + 44 + if (!cdi) 45 + return -EINVAL; 46 + 44 47 te.cdte_track = HFS_SB(sb)->session; 45 48 te.cdte_format = CDROM_LBA; 46 - res = ioctl_by_bdev(sb->s_bdev, CDROMREADTOCENTRY, (unsigned long)&te); 47 - if (!res && (te.cdte_ctrl & CDROM_DATA_TRACK) == 4) { 48 - *start = (sector_t)te.cdte_addr.lba << 2; 49 - return 0; 49 + if (cdrom_read_tocentry(cdi, &te) || 50 + (te.cdte_ctrl & CDROM_DATA_TRACK) != 4) { 51 + pr_err("invalid session number or type of track\n"); 52 + return -EINVAL; 50 53 } 51 - pr_err("invalid session number or type of track\n"); 52 - return -EINVAL; 54 + 55 + *start = (sector_t)te.cdte_addr.lba << 2; 56 + } else if (cdi) { 57 + struct cdrom_multisession ms_info; 58 + 59 + ms_info.addr_format = CDROM_LBA; 60 + if (cdrom_multisession(cdi, &ms_info) == 0 && ms_info.xa_flag) 61 + *start = (sector_t)ms_info.addr.lba << 2; 53 62 } 54 - ms_info.addr_format = CDROM_LBA; 55 - res = ioctl_by_bdev(sb->s_bdev, CDROMMULTISESSION, (unsigned long)&ms_info); 56 - if (!res && ms_info.xa_flag) 57 - *start = (sector_t)ms_info.addr.lba << 2; 63 + 58 64 return 0; 59 65 } 60 66

+1 -1

fs/hfsplus/inode.c

··· 340 340 } 341 341 342 342 if (!test_bit(HFSPLUS_SB_NOBARRIER, &sbi->flags)) 343 - blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL); 343 + blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL); 344 344 345 345 inode_unlock(inode); 346 346

+1 -1

fs/hfsplus/super.c

··· 239 239 mutex_unlock(&sbi->vh_mutex); 240 240 241 241 if (!test_bit(HFSPLUS_SB_NOBARRIER, &sbi->flags)) 242 - blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL); 242 + blkdev_issue_flush(sb->s_bdev, GFP_KERNEL); 243 243 244 244 return error; 245 245 }

+18 -15

fs/hfsplus/wrapper.c

··· 127 127 static int hfsplus_get_last_session(struct super_block *sb, 128 128 sector_t *start, sector_t *size) 129 129 { 130 - struct cdrom_multisession ms_info; 131 - struct cdrom_tocentry te; 132 - int res; 130 + struct cdrom_device_info *cdi = disk_to_cdi(sb->s_bdev->bd_disk); 133 131 134 132 /* default values */ 135 133 *start = 0; 136 134 *size = i_size_read(sb->s_bdev->bd_inode) >> 9; 137 135 138 136 if (HFSPLUS_SB(sb)->session >= 0) { 137 + struct cdrom_tocentry te; 138 + 139 + if (!cdi) 140 + return -EINVAL; 141 + 139 142 te.cdte_track = HFSPLUS_SB(sb)->session; 140 143 te.cdte_format = CDROM_LBA; 141 - res = ioctl_by_bdev(sb->s_bdev, 142 - CDROMREADTOCENTRY, (unsigned long)&te); 143 - if (!res && (te.cdte_ctrl & CDROM_DATA_TRACK) == 4) { 144 - *start = (sector_t)te.cdte_addr.lba << 2; 145 - return 0; 144 + if (cdrom_read_tocentry(cdi, &te) || 145 + (te.cdte_ctrl & CDROM_DATA_TRACK) != 4) { 146 + pr_err("invalid session number or type of track\n"); 147 + return -EINVAL; 146 148 } 147 - pr_err("invalid session number or type of track\n"); 148 - return -EINVAL; 149 + *start = (sector_t)te.cdte_addr.lba << 2; 150 + } else if (cdi) { 151 + struct cdrom_multisession ms_info; 152 + 153 + ms_info.addr_format = CDROM_LBA; 154 + if (cdrom_multisession(cdi, &ms_info) == 0 && ms_info.xa_flag) 155 + *start = (sector_t)ms_info.addr.lba << 2; 149 156 } 150 - ms_info.addr_format = CDROM_LBA; 151 - res = ioctl_by_bdev(sb->s_bdev, CDROMMULTISESSION, 152 - (unsigned long)&ms_info); 153 - if (!res && ms_info.xa_flag) 154 - *start = (sector_t)ms_info.addr.lba << 2; 157 + 155 158 return 0; 156 159 } 157 160

+1 -1

fs/iomap/direct-io.c

··· 561 561 !dio->submit.last_queue || 562 562 !blk_poll(dio->submit.last_queue, 563 563 dio->submit.cookie, true)) 564 - io_schedule(); 564 + blk_io_schedule(); 565 565 } 566 566 __set_current_state(TASK_RUNNING); 567 567 }

+26 -28

fs/isofs/inode.c

··· 544 544 545 545 static unsigned int isofs_get_last_session(struct super_block *sb, s32 session) 546 546 { 547 - struct cdrom_multisession ms_info; 548 - unsigned int vol_desc_start; 549 - struct block_device *bdev = sb->s_bdev; 550 - int i; 547 + struct cdrom_device_info *cdi = disk_to_cdi(sb->s_bdev->bd_disk); 548 + unsigned int vol_desc_start = 0; 551 549 552 - vol_desc_start=0; 553 - ms_info.addr_format=CDROM_LBA; 554 550 if (session > 0) { 555 - struct cdrom_tocentry Te; 556 - Te.cdte_track=session; 557 - Te.cdte_format=CDROM_LBA; 558 - i = ioctl_by_bdev(bdev, CDROMREADTOCENTRY, (unsigned long) &Te); 559 - if (!i) { 551 + struct cdrom_tocentry te; 552 + 553 + if (!cdi) 554 + return 0; 555 + 556 + te.cdte_track = session; 557 + te.cdte_format = CDROM_LBA; 558 + if (cdrom_read_tocentry(cdi, &te) == 0) { 560 559 printk(KERN_DEBUG "ISOFS: Session %d start %d type %d\n", 561 - session, Te.cdte_addr.lba, 562 - Te.cdte_ctrl&CDROM_DATA_TRACK); 563 - if ((Te.cdte_ctrl&CDROM_DATA_TRACK) == 4) 564 - return Te.cdte_addr.lba; 560 + session, te.cdte_addr.lba, 561 + te.cdte_ctrl & CDROM_DATA_TRACK); 562 + if ((te.cdte_ctrl & CDROM_DATA_TRACK) == 4) 563 + return te.cdte_addr.lba; 565 564 } 566 565 567 566 printk(KERN_ERR "ISOFS: Invalid session number or type of track\n"); 568 567 } 569 - i = ioctl_by_bdev(bdev, CDROMMULTISESSION, (unsigned long) &ms_info); 570 - if (session > 0) 571 - printk(KERN_ERR "ISOFS: Invalid session number\n"); 572 - #if 0 573 - printk(KERN_DEBUG "isofs.inode: CDROMMULTISESSION: rc=%d\n",i); 574 - if (i==0) { 575 - printk(KERN_DEBUG "isofs.inode: XA disk: %s\n",ms_info.xa_flag?"yes":"no"); 576 - printk(KERN_DEBUG "isofs.inode: vol_desc_start = %d\n", ms_info.addr.lba); 577 - } 578 - #endif 579 - if (i==0) 568 + 569 + if (cdi) { 570 + struct cdrom_multisession ms_info; 571 + 572 + ms_info.addr_format = CDROM_LBA; 573 + if (cdrom_multisession(cdi, &ms_info) == 0) { 580 574 #if WE_OBEY_THE_WRITTEN_STANDARDS 581 - if (ms_info.xa_flag) /* necessary for a valid ms_info.addr */ 575 + /* necessary for a valid ms_info.addr */ 576 + if (ms_info.xa_flag) 582 577 #endif 583 - vol_desc_start=ms_info.addr.lba; 578 + vol_desc_start = ms_info.addr.lba; 579 + } 580 + } 581 + 584 582 return vol_desc_start; 585 583 } 586 584

+1 -1

fs/jbd2/checkpoint.c

··· 414 414 * jbd2_cleanup_journal_tail() doesn't get called all that often. 415 415 */ 416 416 if (journal->j_flags & JBD2_BARRIER) 417 - blkdev_issue_flush(journal->j_fs_dev, GFP_NOFS, NULL); 417 + blkdev_issue_flush(journal->j_fs_dev, GFP_NOFS); 418 418 419 419 return __jbd2_update_log_tail(journal, first_tid, blocknr); 420 420 }

+2 -2

fs/jbd2/commit.c

··· 775 775 if (commit_transaction->t_need_data_flush && 776 776 (journal->j_fs_dev != journal->j_dev) && 777 777 (journal->j_flags & JBD2_BARRIER)) 778 - blkdev_issue_flush(journal->j_fs_dev, GFP_NOFS, NULL); 778 + blkdev_issue_flush(journal->j_fs_dev, GFP_NOFS); 779 779 780 780 /* Done it all: now write the commit record asynchronously. */ 781 781 if (jbd2_has_feature_async_commit(journal)) { ··· 882 882 stats.run.rs_blocks_logged++; 883 883 if (jbd2_has_feature_async_commit(journal) && 884 884 journal->j_flags & JBD2_BARRIER) { 885 - blkdev_issue_flush(journal->j_dev, GFP_NOFS, NULL); 885 + blkdev_issue_flush(journal->j_dev, GFP_NOFS); 886 886 } 887 887 888 888 if (err)

+1 -1

fs/jbd2/recovery.c

··· 286 286 err = err2; 287 287 /* Make sure all replayed data is on permanent storage */ 288 288 if (journal->j_flags & JBD2_BARRIER) { 289 - err2 = blkdev_issue_flush(journal->j_fs_dev, GFP_KERNEL, NULL); 289 + err2 = blkdev_issue_flush(journal->j_fs_dev, GFP_KERNEL); 290 290 if (!err) 291 291 err = err2; 292 292 }

+1 -1

fs/libfs.c

··· 1113 1113 err = __generic_file_fsync(file, start, end, datasync); 1114 1114 if (err) 1115 1115 return err; 1116 - return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL); 1116 + return blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL); 1117 1117 } 1118 1118 EXPORT_SYMBOL(generic_file_fsync); 1119 1119

+1 -1

fs/nilfs2/the_nilfs.h

··· 375 375 */ 376 376 smp_wmb(); 377 377 378 - err = blkdev_issue_flush(nilfs->ns_bdev, GFP_KERNEL, NULL); 378 + err = blkdev_issue_flush(nilfs->ns_bdev, GFP_KERNEL); 379 379 if (err != -EIO) 380 380 err = 0; 381 381 return err;

+1 -1

fs/ocfs2/file.c

··· 194 194 needs_barrier = true; 195 195 err = jbd2_complete_transaction(journal, commit_tid); 196 196 if (needs_barrier) { 197 - ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL); 197 + ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL); 198 198 if (!err) 199 199 err = ret; 200 200 }

+1 -1

fs/reiserfs/file.c

··· 159 159 barrier_done = reiserfs_commit_for_inode(inode); 160 160 reiserfs_write_unlock(inode->i_sb); 161 161 if (barrier_done != 1 && reiserfs_barrier_flush(inode->i_sb)) 162 - blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL); 162 + blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL); 163 163 inode_unlock(inode); 164 164 if (barrier_done < 0) 165 165 return barrier_done;

+1 -3

fs/super.c

··· 1598 1598 int err; 1599 1599 va_list args; 1600 1600 1601 - bdi = bdi_alloc(GFP_KERNEL); 1601 + bdi = bdi_alloc(NUMA_NO_NODE); 1602 1602 if (!bdi) 1603 1603 return -ENOMEM; 1604 - 1605 - bdi->name = sb->s_type->name; 1606 1604 1607 1605 va_start(args, fmt); 1608 1606 err = bdi_register_va(bdi, fmt, args);

+13 -16

fs/udf/lowlevel.c

··· 27 27 28 28 unsigned int udf_get_last_session(struct super_block *sb) 29 29 { 30 + struct cdrom_device_info *cdi = disk_to_cdi(sb->s_bdev->bd_disk); 30 31 struct cdrom_multisession ms_info; 31 - unsigned int vol_desc_start; 32 - struct block_device *bdev = sb->s_bdev; 33 - int i; 34 32 35 - vol_desc_start = 0; 33 + if (!cdi) { 34 + udf_debug("CDROMMULTISESSION not supported.\n"); 35 + return 0; 36 + } 37 + 36 38 ms_info.addr_format = CDROM_LBA; 37 - i = ioctl_by_bdev(bdev, CDROMMULTISESSION, (unsigned long)&ms_info); 38 - 39 - if (i == 0) { 39 + if (cdrom_multisession(cdi, &ms_info) == 0) { 40 40 udf_debug("XA disk: %s, vol_desc_start=%d\n", 41 41 ms_info.xa_flag ? "yes" : "no", ms_info.addr.lba); 42 42 if (ms_info.xa_flag) /* necessary for a valid ms_info.addr */ 43 - vol_desc_start = ms_info.addr.lba; 44 - } else { 45 - udf_debug("CDROMMULTISESSION not supported: rc=%d\n", i); 43 + return ms_info.addr.lba; 46 44 } 47 - return vol_desc_start; 45 + return 0; 48 46 } 49 47 50 48 unsigned long udf_get_last_block(struct super_block *sb) 51 49 { 52 50 struct block_device *bdev = sb->s_bdev; 51 + struct cdrom_device_info *cdi = disk_to_cdi(bdev->bd_disk); 53 52 unsigned long lblock = 0; 54 53 55 54 /* 56 - * ioctl failed or returned obviously bogus value? 55 + * The cdrom layer call failed or returned obviously bogus value? 57 56 * Try using the device size... 58 57 */ 59 - if (ioctl_by_bdev(bdev, CDROM_LAST_WRITTEN, (unsigned long) &lblock) || 60 - lblock == 0) 58 + if (!cdi || cdrom_get_last_written(cdi, &lblock) || lblock == 0) 61 59 lblock = i_size_read(bdev->bd_inode) >> sb->s_blocksize_bits; 62 60 63 61 if (lblock) 64 62 return lblock - 1; 65 - else 66 - return 0; 63 + return 0; 67 64 }

+1 -1

fs/xfs/xfs_super.c

··· 305 305 xfs_blkdev_issue_flush( 306 306 xfs_buftarg_t *buftarg) 307 307 { 308 - blkdev_issue_flush(buftarg->bt_bdev, GFP_NOFS, NULL); 308 + blkdev_issue_flush(buftarg->bt_bdev, GFP_NOFS); 309 309 } 310 310 311 311 STATIC void

+73 -9

fs/zonefs/super.c

··· 20 20 #include <linux/mman.h> 21 21 #include <linux/sched/mm.h> 22 22 #include <linux/crc32.h> 23 + #include <linux/task_io_accounting_ops.h> 23 24 24 25 #include "zonefs.h" 25 26 ··· 478 477 if (ZONEFS_I(inode)->i_ztype == ZONEFS_ZTYPE_CNV) 479 478 ret = file_write_and_wait_range(file, start, end); 480 479 if (!ret) 481 - ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL); 480 + ret = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL); 482 481 483 482 if (ret) 484 483 zonefs_io_error(inode, true); ··· 596 595 .end_io = zonefs_file_write_dio_end_io, 597 596 }; 598 597 598 + static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from) 599 + { 600 + struct inode *inode = file_inode(iocb->ki_filp); 601 + struct zonefs_inode_info *zi = ZONEFS_I(inode); 602 + struct block_device *bdev = inode->i_sb->s_bdev; 603 + unsigned int max; 604 + struct bio *bio; 605 + ssize_t size; 606 + int nr_pages; 607 + ssize_t ret; 608 + 609 + nr_pages = iov_iter_npages(from, BIO_MAX_PAGES); 610 + if (!nr_pages) 611 + return 0; 612 + 613 + max = queue_max_zone_append_sectors(bdev_get_queue(bdev)); 614 + max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize); 615 + iov_iter_truncate(from, max); 616 + 617 + bio = bio_alloc_bioset(GFP_NOFS, nr_pages, &fs_bio_set); 618 + if (!bio) 619 + return -ENOMEM; 620 + 621 + bio_set_dev(bio, bdev); 622 + bio->bi_iter.bi_sector = zi->i_zsector; 623 + bio->bi_write_hint = iocb->ki_hint; 624 + bio->bi_ioprio = iocb->ki_ioprio; 625 + bio->bi_opf = REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE; 626 + if (iocb->ki_flags & IOCB_DSYNC) 627 + bio->bi_opf |= REQ_FUA; 628 + 629 + ret = bio_iov_iter_get_pages(bio, from); 630 + if (unlikely(ret)) { 631 + bio_io_error(bio); 632 + return ret; 633 + } 634 + size = bio->bi_iter.bi_size; 635 + task_io_account_write(ret); 636 + 637 + if (iocb->ki_flags & IOCB_HIPRI) 638 + bio_set_polled(bio, iocb); 639 + 640 + ret = submit_bio_wait(bio); 641 + 642 + bio_put(bio); 643 + 644 + zonefs_file_write_dio_end_io(iocb, size, ret, 0); 645 + if (ret >= 0) { 646 + iocb->ki_pos += size; 647 + return size; 648 + } 649 + 650 + return ret; 651 + } 652 + 599 653 /* 600 654 * Handle direct writes. For sequential zone files, this is the only possible 601 655 * write path. For these files, check that the user is issuing writes ··· 666 610 struct inode *inode = file_inode(iocb->ki_filp); 667 611 struct zonefs_inode_info *zi = ZONEFS_I(inode); 668 612 struct super_block *sb = inode->i_sb; 613 + bool sync = is_sync_kiocb(iocb); 614 + bool append = false; 669 615 size_t count; 670 616 ssize_t ret; 671 617 ··· 676 618 * as this can cause write reordering (e.g. the first aio gets EAGAIN 677 619 * on the inode lock but the second goes through but is now unaligned). 678 620 */ 679 - if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && !is_sync_kiocb(iocb) && 621 + if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && !sync && 680 622 (iocb->ki_flags & IOCB_NOWAIT)) 681 623 return -EOPNOTSUPP; 682 624 ··· 700 642 } 701 643 702 644 /* Enforce sequential writes (append only) in sequential zones */ 703 - mutex_lock(&zi->i_truncate_mutex); 704 - if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && iocb->ki_pos != zi->i_wpoffset) { 645 + if (zi->i_ztype == ZONEFS_ZTYPE_SEQ) { 646 + mutex_lock(&zi->i_truncate_mutex); 647 + if (iocb->ki_pos != zi->i_wpoffset) { 648 + mutex_unlock(&zi->i_truncate_mutex); 649 + ret = -EINVAL; 650 + goto inode_unlock; 651 + } 705 652 mutex_unlock(&zi->i_truncate_mutex); 706 - ret = -EINVAL; 707 - goto inode_unlock; 653 + append = sync; 708 654 } 709 - mutex_unlock(&zi->i_truncate_mutex); 710 655 711 - ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops, 712 - &zonefs_write_dio_ops, is_sync_kiocb(iocb)); 656 + if (append) 657 + ret = zonefs_file_dio_append(iocb, from); 658 + else 659 + ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops, 660 + &zonefs_write_dio_ops, sync); 713 661 if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && 714 662 (ret > 0 || ret == -EIOCBQUEUED)) { 715 663 if (ret > 0)

-2

include/linux/backing-dev-defs.h

··· 193 193 congested_fn *congested_fn; /* Function pointer if device is md/dm */ 194 194 void *congested_data; /* Pointer to aux data for congested func */ 195 195 196 - const char *name; 197 - 198 196 struct kref refcnt; /* Reference counter for the structure */ 199 197 unsigned int capabilities; /* Device capabilities */ 200 198 unsigned int min_ratio;

+2 -6

include/linux/backing-dev.h

··· 33 33 __printf(2, 0) 34 34 int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, 35 35 va_list args); 36 - int bdi_register_owner(struct backing_dev_info *bdi, struct device *owner); 36 + void bdi_set_owner(struct backing_dev_info *bdi, struct device *owner); 37 37 void bdi_unregister(struct backing_dev_info *bdi); 38 38 39 - struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id); 40 - static inline struct backing_dev_info *bdi_alloc(gfp_t gfp_mask) 41 - { 42 - return bdi_alloc_node(gfp_mask, NUMA_NO_NODE); 43 - } 39 + struct backing_dev_info *bdi_alloc(int node_id); 44 40 45 41 void wb_start_background_writeback(struct bdi_writeback *wb); 46 42 void wb_workfn(struct work_struct *work);

+4 -9

include/linux/bio.h

··· 70 70 return false; 71 71 } 72 72 73 - static inline bool bio_no_advance_iter(struct bio *bio) 73 + static inline bool bio_no_advance_iter(const struct bio *bio) 74 74 { 75 75 return bio_op(bio) == REQ_OP_DISCARD || 76 76 bio_op(bio) == REQ_OP_SECURE_ERASE || ··· 138 138 #define bio_for_each_segment_all(bvl, bio, iter) \ 139 139 for (bvl = bvec_init_iter_all(&iter); bio_next_segment((bio), &iter); ) 140 140 141 - static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter, 142 - unsigned bytes) 141 + static inline void bio_advance_iter(const struct bio *bio, 142 + struct bvec_iter *iter, unsigned int bytes) 143 143 { 144 144 iter->bi_sector += bytes >> 9; 145 145 ··· 417 417 418 418 static inline void bio_wouldblock_error(struct bio *bio) 419 419 { 420 + bio_set_flag(bio, BIO_QUIET); 420 421 bio->bi_status = BLK_STS_AGAIN; 421 422 bio_endio(bio); 422 423 } ··· 444 443 void bio_release_pages(struct bio *bio, bool mark_dirty); 445 444 extern void bio_set_pages_dirty(struct bio *bio); 446 445 extern void bio_check_pages_dirty(struct bio *bio); 447 - 448 - void generic_start_io_acct(struct request_queue *q, int op, 449 - unsigned long sectors, struct hd_struct *part); 450 - void generic_end_io_acct(struct request_queue *q, int op, 451 - struct hd_struct *part, 452 - unsigned long start_time); 453 446 454 447 extern void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter, 455 448 struct bio *src, struct bvec_iter *src_iter);

+39 -14

include/linux/blk-cgroup.h

··· 607 607 u64_stats_update_begin(&bis->sync); 608 608 609 609 /* 610 - * If the bio is flagged with BIO_QUEUE_ENTERED it means this 611 - * is a split bio and we would have already accounted for the 612 - * size of the bio. 610 + * If the bio is flagged with BIO_CGROUP_ACCT it means this is a 611 + * split bio and we would have already accounted for the size of 612 + * the bio. 613 613 */ 614 - if (!bio_flagged(bio, BIO_QUEUE_ENTERED)) 614 + if (!bio_flagged(bio, BIO_CGROUP_ACCT)) { 615 + bio_set_flag(bio, BIO_CGROUP_ACCT); 615 616 bis->cur.bytes[rwd] += bio->bi_iter.bi_size; 617 + } 616 618 bis->cur.ios[rwd]++; 617 619 618 620 u64_stats_update_end(&bis->sync); ··· 631 629 632 630 static inline void blkcg_use_delay(struct blkcg_gq *blkg) 633 631 { 632 + if (WARN_ON_ONCE(atomic_read(&blkg->use_delay) < 0)) 633 + return; 634 634 if (atomic_add_return(1, &blkg->use_delay) == 1) 635 635 atomic_inc(&blkg->blkcg->css.cgroup->congestion_count); 636 636 } ··· 641 637 { 642 638 int old = atomic_read(&blkg->use_delay); 643 639 640 + if (WARN_ON_ONCE(old < 0)) 641 + return 0; 644 642 if (old == 0) 645 643 return 0; 646 644 ··· 667 661 return 1; 668 662 } 669 663 664 + /** 665 + * blkcg_set_delay - Enable allocator delay mechanism with the specified delay amount 666 + * @blkg: target blkg 667 + * @delay: delay duration in nsecs 668 + * 669 + * When enabled with this function, the delay is not decayed and must be 670 + * explicitly cleared with blkcg_clear_delay(). Must not be mixed with 671 + * blkcg_[un]use_delay() and blkcg_add_delay() usages. 672 + */ 673 + static inline void blkcg_set_delay(struct blkcg_gq *blkg, u64 delay) 674 + { 675 + int old = atomic_read(&blkg->use_delay); 676 + 677 + /* We only want 1 person setting the congestion count for this blkg. */ 678 + if (!old && atomic_cmpxchg(&blkg->use_delay, old, -1) == old) 679 + atomic_inc(&blkg->blkcg->css.cgroup->congestion_count); 680 + 681 + atomic64_set(&blkg->delay_nsec, delay); 682 + } 683 + 684 + /** 685 + * blkcg_clear_delay - Disable allocator delay mechanism 686 + * @blkg: target blkg 687 + * 688 + * Disable use_delay mechanism. See blkcg_set_delay(). 689 + */ 670 690 static inline void blkcg_clear_delay(struct blkcg_gq *blkg) 671 691 { 672 692 int old = atomic_read(&blkg->use_delay); 673 - if (!old) 674 - return; 693 + 675 694 /* We only want 1 person clearing the congestion count for this blkg. */ 676 - while (old) { 677 - int cur = atomic_cmpxchg(&blkg->use_delay, old, 0); 678 - if (cur == old) { 679 - atomic_dec(&blkg->blkcg->css.cgroup->congestion_count); 680 - break; 681 - } 682 - old = cur; 683 - } 695 + if (old && atomic_cmpxchg(&blkg->use_delay, old, 0) == old) 696 + atomic_dec(&blkg->blkcg->css.cgroup->congestion_count); 684 697 } 685 698 686 699 void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta);

+123

include/linux/blk-crypto.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright 2019 Google LLC 4 + */ 5 + 6 + #ifndef __LINUX_BLK_CRYPTO_H 7 + #define __LINUX_BLK_CRYPTO_H 8 + 9 + #include <linux/types.h> 10 + 11 + enum blk_crypto_mode_num { 12 + BLK_ENCRYPTION_MODE_INVALID, 13 + BLK_ENCRYPTION_MODE_AES_256_XTS, 14 + BLK_ENCRYPTION_MODE_AES_128_CBC_ESSIV, 15 + BLK_ENCRYPTION_MODE_ADIANTUM, 16 + BLK_ENCRYPTION_MODE_MAX, 17 + }; 18 + 19 + #define BLK_CRYPTO_MAX_KEY_SIZE 64 20 + /** 21 + * struct blk_crypto_config - an inline encryption key's crypto configuration 22 + * @crypto_mode: encryption algorithm this key is for 23 + * @data_unit_size: the data unit size for all encryption/decryptions with this 24 + * key. This is the size in bytes of each individual plaintext and 25 + * ciphertext. This is always a power of 2. It might be e.g. the 26 + * filesystem block size or the disk sector size. 27 + * @dun_bytes: the maximum number of bytes of DUN used when using this key 28 + */ 29 + struct blk_crypto_config { 30 + enum blk_crypto_mode_num crypto_mode; 31 + unsigned int data_unit_size; 32 + unsigned int dun_bytes; 33 + }; 34 + 35 + /** 36 + * struct blk_crypto_key - an inline encryption key 37 + * @crypto_cfg: the crypto configuration (like crypto_mode, key size) for this 38 + * key 39 + * @data_unit_size_bits: log2 of data_unit_size 40 + * @size: size of this key in bytes (determined by @crypto_cfg.crypto_mode) 41 + * @raw: the raw bytes of this key. Only the first @size bytes are used. 42 + * 43 + * A blk_crypto_key is immutable once created, and many bios can reference it at 44 + * the same time. It must not be freed until all bios using it have completed 45 + * and it has been evicted from all devices on which it may have been used. 46 + */ 47 + struct blk_crypto_key { 48 + struct blk_crypto_config crypto_cfg; 49 + unsigned int data_unit_size_bits; 50 + unsigned int size; 51 + u8 raw[BLK_CRYPTO_MAX_KEY_SIZE]; 52 + }; 53 + 54 + #define BLK_CRYPTO_MAX_IV_SIZE 32 55 + #define BLK_CRYPTO_DUN_ARRAY_SIZE (BLK_CRYPTO_MAX_IV_SIZE / sizeof(u64)) 56 + 57 + /** 58 + * struct bio_crypt_ctx - an inline encryption context 59 + * @bc_key: the key, algorithm, and data unit size to use 60 + * @bc_dun: the data unit number (starting IV) to use 61 + * 62 + * A bio_crypt_ctx specifies that the contents of the bio will be encrypted (for 63 + * write requests) or decrypted (for read requests) inline by the storage device 64 + * or controller, or by the crypto API fallback. 65 + */ 66 + struct bio_crypt_ctx { 67 + const struct blk_crypto_key *bc_key; 68 + u64 bc_dun[BLK_CRYPTO_DUN_ARRAY_SIZE]; 69 + }; 70 + 71 + #include <linux/blk_types.h> 72 + #include <linux/blkdev.h> 73 + 74 + struct request; 75 + struct request_queue; 76 + 77 + #ifdef CONFIG_BLK_INLINE_ENCRYPTION 78 + 79 + static inline bool bio_has_crypt_ctx(struct bio *bio) 80 + { 81 + return bio->bi_crypt_context; 82 + } 83 + 84 + void bio_crypt_set_ctx(struct bio *bio, const struct blk_crypto_key *key, 85 + const u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE], 86 + gfp_t gfp_mask); 87 + 88 + bool bio_crypt_dun_is_contiguous(const struct bio_crypt_ctx *bc, 89 + unsigned int bytes, 90 + const u64 next_dun[BLK_CRYPTO_DUN_ARRAY_SIZE]); 91 + 92 + int blk_crypto_init_key(struct blk_crypto_key *blk_key, const u8 *raw_key, 93 + enum blk_crypto_mode_num crypto_mode, 94 + unsigned int dun_bytes, 95 + unsigned int data_unit_size); 96 + 97 + int blk_crypto_start_using_key(const struct blk_crypto_key *key, 98 + struct request_queue *q); 99 + 100 + int blk_crypto_evict_key(struct request_queue *q, 101 + const struct blk_crypto_key *key); 102 + 103 + bool blk_crypto_config_supported(struct request_queue *q, 104 + const struct blk_crypto_config *cfg); 105 + 106 + #else /* CONFIG_BLK_INLINE_ENCRYPTION */ 107 + 108 + static inline bool bio_has_crypt_ctx(struct bio *bio) 109 + { 110 + return false; 111 + } 112 + 113 + #endif /* CONFIG_BLK_INLINE_ENCRYPTION */ 114 + 115 + void __bio_crypt_clone(struct bio *dst, struct bio *src, gfp_t gfp_mask); 116 + static inline void bio_crypt_clone(struct bio *dst, struct bio *src, 117 + gfp_t gfp_mask) 118 + { 119 + if (bio_has_crypt_ctx(src)) 120 + __bio_crypt_clone(dst, src, gfp_mask); 121 + } 122 + 123 + #endif /* __LINUX_BLK_CRYPTO_H */

+14

include/linux/blk-mq.h

··· 140 140 */ 141 141 atomic_t nr_active; 142 142 143 + /** @cpuhp_online: List to store request if CPU is going to die */ 144 + struct hlist_node cpuhp_online; 143 145 /** @cpuhp_dead: List to store request if some CPU die. */ 144 146 struct hlist_node cpuhp_dead; 145 147 /** @kobj: Kernel object for sysfs. */ ··· 393 391 enum { 394 392 BLK_MQ_F_SHOULD_MERGE = 1 << 0, 395 393 BLK_MQ_F_TAG_SHARED = 1 << 1, 394 + /* 395 + * Set when this device requires underlying blk-mq device for 396 + * completing IO: 397 + */ 398 + BLK_MQ_F_STACKING = 1 << 2, 396 399 BLK_MQ_F_BLOCKING = 1 << 5, 397 400 BLK_MQ_F_NO_SCHED = 1 << 6, 398 401 BLK_MQ_F_ALLOC_POLICY_START_BIT = 8, ··· 406 399 BLK_MQ_S_STOPPED = 0, 407 400 BLK_MQ_S_TAG_ACTIVE = 1, 408 401 BLK_MQ_S_SCHED_RESTART = 2, 402 + 403 + /* hw queue is inactive after all its CPUs become offline */ 404 + BLK_MQ_S_INACTIVE = 3, 409 405 410 406 BLK_MQ_MAX_DEPTH = 10240, 411 407 ··· 504 494 void blk_mq_kick_requeue_list(struct request_queue *q); 505 495 void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs); 506 496 bool blk_mq_complete_request(struct request *rq); 497 + void blk_mq_force_complete_rq(struct request *rq); 507 498 bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list, 508 499 struct bio *bio, unsigned int nr_segs); 509 500 bool blk_mq_queue_stopped(struct request_queue *q); ··· 519 508 void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs); 520 509 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async); 521 510 void blk_mq_run_hw_queues(struct request_queue *q, bool async); 511 + void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs); 522 512 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, 523 513 busy_tag_iter_fn *fn, void *priv); 524 514 void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset); ··· 588 576 if (rq->q->mq_ops->cleanup_rq) 589 577 rq->q->mq_ops->cleanup_rq(rq); 590 578 } 579 + 580 + blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio); 591 581 592 582 #endif

+21 -3

include/linux/blk_types.h

··· 18 18 struct io_context; 19 19 struct cgroup_subsys_state; 20 20 typedef void (bio_end_io_t) (struct bio *); 21 + struct bio_crypt_ctx; 21 22 22 23 /* 23 24 * Block error status values. See block/blk-core:blk_errors for the details. ··· 63 62 * any other system wide resources. 64 63 */ 65 64 #define BLK_STS_DEV_RESOURCE ((__force blk_status_t)13) 65 + 66 + /* 67 + * BLK_STS_ZONE_RESOURCE is returned from the driver to the block layer if zone 68 + * related resources are unavailable, but the driver can guarantee the queue 69 + * will be rerun in the future once the resources become available again. 70 + * 71 + * This is different from BLK_STS_DEV_RESOURCE in that it explicitly references 72 + * a zone specific resource and IO to a different zone on the same device could 73 + * still be served. Examples of that are zones that are write-locked, but a read 74 + * to the same zone could be served. 75 + */ 76 + #define BLK_STS_ZONE_RESOURCE ((__force blk_status_t)14) 66 77 67 78 /** 68 79 * blk_path_error - returns true if error may be path related ··· 186 173 u64 bi_iocost_cost; 187 174 #endif 188 175 #endif 176 + 177 + #ifdef CONFIG_BLK_INLINE_ENCRYPTION 178 + struct bio_crypt_ctx *bi_crypt_context; 179 + #endif 180 + 189 181 union { 190 182 #if defined(CONFIG_BLK_DEV_INTEGRITY) 191 183 struct bio_integrity_payload *bi_integrity; /* data integrity */ ··· 238 220 * throttling rules. Don't do it again. */ 239 221 BIO_TRACE_COMPLETION, /* bio_endio() should trace the final completion 240 222 * of this bio. */ 241 - BIO_QUEUE_ENTERED, /* can use blk_queue_enter_live() */ 223 + BIO_CGROUP_ACCT, /* has been accounted to a cgroup */ 242 224 BIO_TRACKED, /* set if bio goes through the rq_qos path */ 243 225 BIO_FLAG_LAST 244 226 }; ··· 314 296 REQ_OP_ZONE_CLOSE = 11, 315 297 /* Transition a zone to full */ 316 298 REQ_OP_ZONE_FINISH = 12, 299 + /* write data at the current zone write pointer */ 300 + REQ_OP_ZONE_APPEND = 13, 317 301 318 302 /* SCSI passthrough using struct scsi_request */ 319 303 REQ_OP_SCSI_IN = 32, ··· 343 323 __REQ_RAHEAD, /* read ahead, can fail anytime */ 344 324 __REQ_BACKGROUND, /* background IO */ 345 325 __REQ_NOWAIT, /* Don't wait if request will block */ 346 - __REQ_NOWAIT_INLINE, /* Return would-block error inline */ 347 326 /* 348 327 * When a shared kthread needs to issue a bio for a cgroup, doing 349 328 * so synchronously can lead to priority inversions as the kthread ··· 377 358 #define REQ_RAHEAD (1ULL << __REQ_RAHEAD) 378 359 #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND) 379 360 #define REQ_NOWAIT (1ULL << __REQ_NOWAIT) 380 - #define REQ_NOWAIT_INLINE (1ULL << __REQ_NOWAIT_INLINE) 381 361 #define REQ_CGROUP_PUNT (1ULL << __REQ_CGROUP_PUNT) 382 362 383 363 #define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)

+106 -16

include/linux/blkdev.h

··· 43 43 struct rq_qos; 44 44 struct blk_queue_stats; 45 45 struct blk_stat_callback; 46 + struct blk_keyslot_manager; 46 47 47 48 #define BLKDEV_MIN_RQ 4 48 49 #define BLKDEV_MAX_RQ 128 /* Default maximum */ ··· 83 82 /* set for "ide_preempt" requests and also for requests for which the SCSI 84 83 "quiesce" state must be ignored. */ 85 84 #define RQF_PREEMPT ((__force req_flags_t)(1 << 8)) 86 - /* contains copies of user pages */ 87 - #define RQF_COPY_USER ((__force req_flags_t)(1 << 9)) 88 85 /* vaguely specified driver internal error. Ignored by the block layer */ 89 86 #define RQF_FAILED ((__force req_flags_t)(1 << 10)) 90 87 /* don't warn about errors */ ··· 222 223 unsigned short nr_integrity_segments; 223 224 #endif 224 225 226 + #ifdef CONFIG_BLK_INLINE_ENCRYPTION 227 + struct bio_crypt_ctx *crypt_ctx; 228 + struct blk_ksm_keyslot *crypt_keyslot; 229 + #endif 230 + 225 231 unsigned short write_hint; 226 232 unsigned short ioprio; 227 - 228 - unsigned int extra_len; /* length of alignment and padding */ 229 233 230 234 enum mq_rq_state state; 231 235 refcount_t ref; ··· 292 290 typedef blk_qc_t (make_request_fn) (struct request_queue *q, struct bio *bio); 293 291 294 292 struct bio_vec; 295 - typedef int (dma_drain_needed_fn)(struct request *); 296 293 297 294 enum blk_eh_timer_return { 298 295 BLK_EH_DONE, /* drivers has completed the command */ ··· 337 336 unsigned int max_hw_discard_sectors; 338 337 unsigned int max_write_same_sectors; 339 338 unsigned int max_write_zeroes_sectors; 339 + unsigned int max_zone_append_sectors; 340 340 unsigned int discard_granularity; 341 341 unsigned int discard_alignment; 342 342 ··· 363 361 extern int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op, 364 362 sector_t sectors, sector_t nr_sectors, 365 363 gfp_t gfp_mask); 366 - extern int blk_revalidate_disk_zones(struct gendisk *disk); 364 + int blk_revalidate_disk_zones(struct gendisk *disk, 365 + void (*update_driver_data)(struct gendisk *disk)); 367 366 368 367 extern int blkdev_report_zones_ioctl(struct block_device *bdev, fmode_t mode, 369 368 unsigned int cmd, unsigned long arg); ··· 402 399 struct rq_qos *rq_qos; 403 400 404 401 make_request_fn *make_request_fn; 405 - dma_drain_needed_fn *dma_drain_needed; 406 402 407 403 const struct blk_mq_ops *mq_ops; 408 404 ··· 471 469 */ 472 470 unsigned long nr_requests; /* Max # of requests */ 473 471 474 - unsigned int dma_drain_size; 475 - void *dma_drain_buffer; 476 472 unsigned int dma_pad_mask; 477 473 unsigned int dma_alignment; 474 + 475 + #ifdef CONFIG_BLK_INLINE_ENCRYPTION 476 + /* Inline crypto capabilities */ 477 + struct blk_keyslot_manager *ksm; 478 + #endif 478 479 479 480 unsigned int rq_timeout; 480 481 int poll_nsec; ··· 734 729 { 735 730 return 0; 736 731 } 732 + static inline bool blk_queue_zone_is_seq(struct request_queue *q, 733 + sector_t sector) 734 + { 735 + return false; 736 + } 737 + static inline unsigned int blk_queue_zone_no(struct request_queue *q, 738 + sector_t sector) 739 + { 740 + return 0; 741 + } 737 742 #endif /* CONFIG_BLK_DEV_ZONED */ 738 743 739 744 static inline bool rq_is_sync(struct request *rq) ··· 760 745 return false; 761 746 762 747 if (req_op(rq) == REQ_OP_WRITE_ZEROES) 748 + return false; 749 + 750 + if (req_op(rq) == REQ_OP_ZONE_APPEND) 763 751 return false; 764 752 765 753 if (rq->cmd_flags & REQ_NOMERGE_FLAGS) ··· 1099 1081 extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q, 1100 1082 unsigned int max_write_same_sectors); 1101 1083 extern void blk_queue_logical_block_size(struct request_queue *, unsigned int); 1084 + extern void blk_queue_max_zone_append_sectors(struct request_queue *q, 1085 + unsigned int max_zone_append_sectors); 1102 1086 extern void blk_queue_physical_block_size(struct request_queue *, unsigned int); 1103 1087 extern void blk_queue_alignment_offset(struct request_queue *q, 1104 1088 unsigned int alignment); ··· 1119 1099 sector_t offset); 1120 1100 extern void blk_queue_stack_limits(struct request_queue *t, struct request_queue *b); 1121 1101 extern void blk_queue_update_dma_pad(struct request_queue *, unsigned int); 1122 - extern int blk_queue_dma_drain(struct request_queue *q, 1123 - dma_drain_needed_fn *dma_drain_needed, 1124 - void *buf, unsigned int size); 1125 1102 extern void blk_queue_segment_boundary(struct request_queue *, unsigned long); 1126 1103 extern void blk_queue_virt_boundary(struct request_queue *, unsigned long); 1127 1104 extern void blk_queue_dma_alignment(struct request_queue *, int); ··· 1155 1138 return max_t(unsigned short, rq->nr_phys_segments, 1); 1156 1139 } 1157 1140 1158 - extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *); 1141 + int __blk_rq_map_sg(struct request_queue *q, struct request *rq, 1142 + struct scatterlist *sglist, struct scatterlist **last_sg); 1143 + static inline int blk_rq_map_sg(struct request_queue *q, struct request *rq, 1144 + struct scatterlist *sglist) 1145 + { 1146 + struct scatterlist *last_sg = NULL; 1147 + 1148 + return __blk_rq_map_sg(q, rq, sglist, &last_sg); 1149 + } 1159 1150 extern void blk_dump_rq_flags(struct request *, char *); 1160 1151 extern long nr_blockdev_pages(void); 1161 1152 ··· 1231 1206 !list_empty(&plug->cb_list)); 1232 1207 } 1233 1208 1234 - extern int blkdev_issue_flush(struct block_device *, gfp_t, sector_t *); 1209 + extern void blk_io_schedule(void); 1210 + 1211 + int blkdev_issue_flush(struct block_device *, gfp_t); 1235 1212 extern int blkdev_issue_write_same(struct block_device *bdev, sector_t sector, 1236 1213 sector_t nr_sects, gfp_t gfp_mask, struct page *page); 1237 1214 ··· 1318 1291 static inline unsigned int queue_max_segment_size(const struct request_queue *q) 1319 1292 { 1320 1293 return q->limits.max_segment_size; 1294 + } 1295 + 1296 + static inline unsigned int queue_max_zone_append_sectors(const struct request_queue *q) 1297 + { 1298 + return q->limits.max_zone_append_sectors; 1321 1299 } 1322 1300 1323 1301 static inline unsigned queue_logical_block_size(const struct request_queue *q) ··· 1583 1551 return blk_get_integrity(bdev->bd_disk); 1584 1552 } 1585 1553 1554 + static inline bool 1555 + blk_integrity_queue_supports_integrity(struct request_queue *q) 1556 + { 1557 + return q->integrity.profile; 1558 + } 1559 + 1586 1560 static inline bool blk_integrity_rq(struct request *rq) 1587 1561 { 1588 1562 return rq->cmd_flags & REQ_INTEGRITY; ··· 1669 1631 { 1670 1632 return NULL; 1671 1633 } 1634 + static inline bool 1635 + blk_integrity_queue_supports_integrity(struct request_queue *q) 1636 + { 1637 + return false; 1638 + } 1672 1639 static inline int blk_integrity_compare(struct gendisk *a, struct gendisk *b) 1673 1640 { 1674 1641 return 0; ··· 1725 1682 1726 1683 #endif /* CONFIG_BLK_DEV_INTEGRITY */ 1727 1684 1685 + #ifdef CONFIG_BLK_INLINE_ENCRYPTION 1686 + 1687 + bool blk_ksm_register(struct blk_keyslot_manager *ksm, struct request_queue *q); 1688 + 1689 + void blk_ksm_unregister(struct request_queue *q); 1690 + 1691 + #else /* CONFIG_BLK_INLINE_ENCRYPTION */ 1692 + 1693 + static inline bool blk_ksm_register(struct blk_keyslot_manager *ksm, 1694 + struct request_queue *q) 1695 + { 1696 + return true; 1697 + } 1698 + 1699 + static inline void blk_ksm_unregister(struct request_queue *q) { } 1700 + 1701 + #endif /* CONFIG_BLK_INLINE_ENCRYPTION */ 1702 + 1703 + 1728 1704 struct block_device_operations { 1729 1705 int (*open) (struct block_device *, fmode_t); 1730 1706 void (*release) (struct gendisk *, fmode_t); ··· 1781 1719 1782 1720 #ifdef CONFIG_BLK_DEV_ZONED 1783 1721 bool blk_req_needs_zone_write_lock(struct request *rq); 1722 + bool blk_req_zone_write_trylock(struct request *rq); 1784 1723 void __blk_req_zone_write_lock(struct request *rq); 1785 1724 void __blk_req_zone_write_unlock(struct request *rq); 1786 1725 ··· 1872 1809 return false; 1873 1810 } 1874 1811 1875 - static inline int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask, 1876 - sector_t *error_sector) 1812 + static inline int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask) 1877 1813 { 1878 1814 return 0; 1879 1815 } ··· 1891 1829 else 1892 1830 wake_up_process(waiter); 1893 1831 } 1832 + 1833 + #ifdef CONFIG_BLOCK 1834 + unsigned long disk_start_io_acct(struct gendisk *disk, unsigned int sectors, 1835 + unsigned int op); 1836 + void disk_end_io_acct(struct gendisk *disk, unsigned int op, 1837 + unsigned long start_time); 1838 + 1839 + /** 1840 + * bio_start_io_acct - start I/O accounting for bio based drivers 1841 + * @bio: bio to start account for 1842 + * 1843 + * Returns the start time that should be passed back to bio_end_io_acct(). 1844 + */ 1845 + static inline unsigned long bio_start_io_acct(struct bio *bio) 1846 + { 1847 + return disk_start_io_acct(bio->bi_disk, bio_sectors(bio), bio_op(bio)); 1848 + } 1849 + 1850 + /** 1851 + * bio_end_io_acct - end I/O accounting for bio based drivers 1852 + * @bio: bio to end account for 1853 + * @start: start time returned by bio_start_io_acct() 1854 + */ 1855 + static inline void bio_end_io_acct(struct bio *bio, unsigned long start_time) 1856 + { 1857 + return disk_end_io_acct(bio->bi_disk, bio_op(bio), start_time); 1858 + } 1859 + #endif /* CONFIG_BLOCK */ 1894 1860 1895 1861 #endif

+11 -2

include/linux/bvec.h

··· 12 12 #include <linux/errno.h> 13 13 #include <linux/mm.h> 14 14 15 - /* 16 - * was unsigned short, but we might as well be ready for > 64kB I/O pages 15 + /** 16 + * struct bio_vec - a contiguous range of physical memory addresses 17 + * @bv_page: First page associated with the address range. 18 + * @bv_len: Number of bytes in the address range. 19 + * @bv_offset: Start of the address range relative to the start of @bv_page. 20 + * 21 + * The following holds for a bvec if n * PAGE_SIZE < bv_offset + bv_len: 22 + * 23 + * nth_page(@bv_page, n) == @bv_page + n 24 + * 25 + * This holds because page_is_mergeable() checks the above property. 17 26 */ 18 27 struct bio_vec { 19 28 struct page *bv_page;

+6 -1

include/linux/cdrom.h

··· 94 94 struct packet_command *); 95 95 }; 96 96 97 + int cdrom_multisession(struct cdrom_device_info *cdi, 98 + struct cdrom_multisession *info); 99 + int cdrom_read_tocentry(struct cdrom_device_info *cdi, 100 + struct cdrom_tocentry *entry); 101 + 97 102 /* the general block_device operations structure: */ 98 103 extern int cdrom_open(struct cdrom_device_info *cdi, struct block_device *bdev, 99 104 fmode_t mode); ··· 109 104 unsigned int clearing); 110 105 extern int cdrom_media_changed(struct cdrom_device_info *); 111 106 112 - extern int register_cdrom(struct cdrom_device_info *cdi); 107 + extern int register_cdrom(struct gendisk *disk, struct cdrom_device_info *cdi); 113 108 extern void unregister_cdrom(struct cdrom_device_info *cdi); 114 109 115 110 typedef struct {

+1

include/linux/cpuhotplug.h

··· 152 152 CPUHP_AP_SMPBOOT_THREADS, 153 153 CPUHP_AP_X86_VDSO_VMA_ONLINE, 154 154 CPUHP_AP_IRQ_AFFINITY_ONLINE, 155 + CPUHP_AP_BLK_MQ_ONLINE, 155 156 CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS, 156 157 CPUHP_AP_X86_INTEL_EPB_ONLINE, 157 158 CPUHP_AP_PERF_ONLINE,

-4

include/linux/device.h

··· 884 884 /* 885 885 * Easy functions for dynamically creating devices on the fly 886 886 */ 887 - extern __printf(5, 0) 888 - struct device *device_create_vargs(struct class *cls, struct device *parent, 889 - dev_t devt, void *drvdata, 890 - const char *fmt, va_list vargs); 891 887 extern __printf(5, 6) 892 888 struct device *device_create(struct class *cls, struct device *parent, 893 889 dev_t devt, void *drvdata,

+1 -1

include/linux/elevator.h

··· 39 39 void (*request_merged)(struct request_queue *, struct request *, enum elv_merge); 40 40 void (*requests_merged)(struct request_queue *, struct request *, struct request *); 41 41 void (*limit_depth)(unsigned int, struct blk_mq_alloc_data *); 42 - void (*prepare_request)(struct request *, struct bio *bio); 42 + void (*prepare_request)(struct request *); 43 43 void (*finish_request)(struct request *); 44 44 void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool); 45 45 struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);

-2

include/linux/fs.h

··· 2591 2591 #ifdef CONFIG_BLOCK 2592 2592 extern int register_blkdev(unsigned int, const char *); 2593 2593 extern void unregister_blkdev(unsigned int, const char *); 2594 - extern void bdev_unhash_inode(dev_t dev); 2595 2594 extern struct block_device *bdget(dev_t); 2596 2595 extern struct block_device *bdgrab(struct block_device *bdev); 2597 2596 extern void bd_set_size(struct block_device *, loff_t size); ··· 2732 2733 extern int revalidate_disk(struct gendisk *); 2733 2734 extern int check_disk_change(struct block_device *); 2734 2735 extern int __invalidate_device(struct block_device *, bool); 2735 - extern int invalidate_partition(struct gendisk *, int); 2736 2736 #endif 2737 2737 unsigned long invalidate_mapping_pages(struct address_space *mapping, 2738 2738 pgoff_t start, pgoff_t end);

+23 -17

include/linux/genhd.h

··· 39 39 #include <linux/fs.h> 40 40 #include <linux/workqueue.h> 41 41 42 - struct disk_stats { 43 - u64 nsecs[NR_STAT_GROUPS]; 44 - unsigned long sectors[NR_STAT_GROUPS]; 45 - unsigned long ios[NR_STAT_GROUPS]; 46 - unsigned long merges[NR_STAT_GROUPS]; 47 - unsigned long io_ticks; 48 - local_t in_flight[2]; 49 - }; 50 - 51 42 #define PARTITION_META_INFO_VOLNAMELTH 64 52 43 /* 53 44 * Enough for the string representation of any kind of UUID plus NULL. ··· 59 68 * can be non-atomic on 32bit machines with 64bit sector_t. 60 69 */ 61 70 sector_t nr_sects; 71 + #if BITS_PER_LONG==32 && defined(CONFIG_SMP) 62 72 seqcount_t nr_sects_seq; 73 + #endif 74 + unsigned long stamp; 75 + struct disk_stats __percpu *dkstats; 76 + struct percpu_ref ref; 77 + 63 78 sector_t alignment_offset; 64 79 unsigned int discard_alignment; 65 80 struct device __dev; ··· 75 78 #ifdef CONFIG_FAIL_MAKE_REQUEST 76 79 int make_it_fail; 77 80 #endif 78 - unsigned long stamp; 79 - #ifdef CONFIG_SMP 80 - struct disk_stats __percpu *dkstats; 81 - #else 82 - struct disk_stats dkstats; 83 - #endif 84 - struct percpu_ref ref; 85 81 struct rcu_work rcu_work; 86 82 }; 87 83 ··· 207 217 #ifdef CONFIG_BLK_DEV_INTEGRITY 208 218 struct kobject integrity_kobj; 209 219 #endif /* CONFIG_BLK_DEV_INTEGRITY */ 220 + #if IS_ENABLED(CONFIG_CDROM) 221 + struct cdrom_device_info *cdi; 222 + #endif 210 223 int node_id; 211 224 struct badblocks *bb; 212 225 struct lockdep_map lockdep_map; 213 226 }; 227 + 228 + #if IS_REACHABLE(CONFIG_CDROM) 229 + #define disk_to_cdi(disk) ((disk)->cdi) 230 + #else 231 + #define disk_to_cdi(disk) NULL 232 + #endif 214 233 215 234 static inline struct gendisk *part_to_disk(struct hd_struct *part) 216 235 { ··· 262 263 { 263 264 if (likely(part)) 264 265 put_device(part_to_dev(part)); 266 + } 267 + 268 + static inline void hd_sects_seq_init(struct hd_struct *p) 269 + { 270 + #if BITS_PER_LONG==32 && defined(CONFIG_SMP) 271 + seqcount_init(&p->nr_sects_seq); 272 + #endif 265 273 } 266 274 267 275 /* ··· 345 339 346 340 int bdev_disk_changed(struct block_device *bdev, bool invalidate); 347 341 int blk_add_partitions(struct gendisk *disk, struct block_device *bdev); 348 - int blk_drop_partitions(struct gendisk *disk, struct block_device *bdev); 342 + int blk_drop_partitions(struct block_device *bdev); 349 343 extern void printk_all_partitions(void); 350 344 351 345 extern struct gendisk *__alloc_disk_node(int minors, int node_id);

+106

include/linux/keyslot-manager.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * Copyright 2019 Google LLC 4 + */ 5 + 6 + #ifndef __LINUX_KEYSLOT_MANAGER_H 7 + #define __LINUX_KEYSLOT_MANAGER_H 8 + 9 + #include <linux/bio.h> 10 + #include <linux/blk-crypto.h> 11 + 12 + struct blk_keyslot_manager; 13 + 14 + /** 15 + * struct blk_ksm_ll_ops - functions to manage keyslots in hardware 16 + * @keyslot_program: Program the specified key into the specified slot in the 17 + * inline encryption hardware. 18 + * @keyslot_evict: Evict key from the specified keyslot in the hardware. 19 + * The key is provided so that e.g. dm layers can evict 20 + * keys from the devices that they map over. 21 + * Returns 0 on success, -errno otherwise. 22 + * 23 + * This structure should be provided by storage device drivers when they set up 24 + * a keyslot manager - this structure holds the function ptrs that the keyslot 25 + * manager will use to manipulate keyslots in the hardware. 26 + */ 27 + struct blk_ksm_ll_ops { 28 + int (*keyslot_program)(struct blk_keyslot_manager *ksm, 29 + const struct blk_crypto_key *key, 30 + unsigned int slot); 31 + int (*keyslot_evict)(struct blk_keyslot_manager *ksm, 32 + const struct blk_crypto_key *key, 33 + unsigned int slot); 34 + }; 35 + 36 + struct blk_keyslot_manager { 37 + /* 38 + * The struct blk_ksm_ll_ops that this keyslot manager will use 39 + * to perform operations like programming and evicting keys on the 40 + * device 41 + */ 42 + struct blk_ksm_ll_ops ksm_ll_ops; 43 + 44 + /* 45 + * The maximum number of bytes supported for specifying the data unit 46 + * number. 47 + */ 48 + unsigned int max_dun_bytes_supported; 49 + 50 + /* 51 + * Array of size BLK_ENCRYPTION_MODE_MAX of bitmasks that represents 52 + * whether a crypto mode and data unit size are supported. The i'th 53 + * bit of crypto_mode_supported[crypto_mode] is set iff a data unit 54 + * size of (1 << i) is supported. We only support data unit sizes 55 + * that are powers of 2. 56 + */ 57 + unsigned int crypto_modes_supported[BLK_ENCRYPTION_MODE_MAX]; 58 + 59 + /* Device for runtime power management (NULL if none) */ 60 + struct device *dev; 61 + 62 + /* Here onwards are *private* fields for internal keyslot manager use */ 63 + 64 + unsigned int num_slots; 65 + 66 + /* Protects programming and evicting keys from the device */ 67 + struct rw_semaphore lock; 68 + 69 + /* List of idle slots, with least recently used slot at front */ 70 + wait_queue_head_t idle_slots_wait_queue; 71 + struct list_head idle_slots; 72 + spinlock_t idle_slots_lock; 73 + 74 + /* 75 + * Hash table which maps struct *blk_crypto_key to keyslots, so that we 76 + * can find a key's keyslot in O(1) time rather than O(num_slots). 77 + * Protected by 'lock'. 78 + */ 79 + struct hlist_head *slot_hashtable; 80 + unsigned int log_slot_ht_size; 81 + 82 + /* Per-keyslot data */ 83 + struct blk_ksm_keyslot *slots; 84 + }; 85 + 86 + int blk_ksm_init(struct blk_keyslot_manager *ksm, unsigned int num_slots); 87 + 88 + blk_status_t blk_ksm_get_slot_for_key(struct blk_keyslot_manager *ksm, 89 + const struct blk_crypto_key *key, 90 + struct blk_ksm_keyslot **slot_ptr); 91 + 92 + unsigned int blk_ksm_get_slot_idx(struct blk_ksm_keyslot *slot); 93 + 94 + void blk_ksm_put_slot(struct blk_ksm_keyslot *slot); 95 + 96 + bool blk_ksm_crypto_cfg_supported(struct blk_keyslot_manager *ksm, 97 + const struct blk_crypto_config *cfg); 98 + 99 + int blk_ksm_evict_key(struct blk_keyslot_manager *ksm, 100 + const struct blk_crypto_key *key); 101 + 102 + void blk_ksm_reprogram_all_keys(struct blk_keyslot_manager *ksm); 103 + 104 + void blk_ksm_destroy(struct blk_keyslot_manager *ksm); 105 + 106 + #endif /* __LINUX_KEYSLOT_MANAGER_H */

+2

include/linux/libata.h

··· 1092 1092 #define ATA_SCSI_COMPAT_IOCTL /* empty */ 1093 1093 #endif 1094 1094 extern int ata_scsi_queuecmd(struct Scsi_Host *h, struct scsi_cmnd *cmd); 1095 + bool ata_scsi_dma_need_drain(struct request *rq); 1095 1096 extern int ata_sas_scsi_ioctl(struct ata_port *ap, struct scsi_device *dev, 1096 1097 unsigned int cmd, void __user *arg); 1097 1098 extern bool ata_link_online(struct ata_link *link); ··· 1388 1387 .ioctl = ata_scsi_ioctl, \ 1389 1388 ATA_SCSI_COMPAT_IOCTL \ 1390 1389 .queuecommand = ata_scsi_queuecmd, \ 1390 + .dma_need_drain = ata_scsi_dma_need_drain, \ 1391 1391 .can_queue = ATA_DEF_QUEUE, \ 1392 1392 .tag_alloc_policy = BLK_TAG_ALLOC_RR, \ 1393 1393 .this_id = ATA_SHT_THIS_ID, \

+14 -47

include/linux/part_stat.h

··· 4 4 5 5 #include <linux/genhd.h> 6 6 7 + struct disk_stats { 8 + u64 nsecs[NR_STAT_GROUPS]; 9 + unsigned long sectors[NR_STAT_GROUPS]; 10 + unsigned long ios[NR_STAT_GROUPS]; 11 + unsigned long merges[NR_STAT_GROUPS]; 12 + unsigned long io_ticks; 13 + local_t in_flight[2]; 14 + }; 15 + 7 16 /* 8 17 * Macros to operate on percpu disk statistics: 9 18 * 10 - * {disk|part|all}_stat_{add|sub|inc|dec}() modify the stat counters 11 - * and should be called between disk_stat_lock() and 12 - * disk_stat_unlock(). 19 + * {disk|part|all}_stat_{add|sub|inc|dec}() modify the stat counters and should 20 + * be called between disk_stat_lock() and disk_stat_unlock(). 13 21 * 14 22 * part_stat_read() can be called at any time. 15 - * 16 - * part_stat_{add|set_all}() and {init|free}_part_stats are for 17 - * internal use only. 18 23 */ 19 - #ifdef CONFIG_SMP 20 - #define part_stat_lock() ({ rcu_read_lock(); get_cpu(); }) 21 - #define part_stat_unlock() do { put_cpu(); rcu_read_unlock(); } while (0) 24 + #define part_stat_lock() preempt_disable() 25 + #define part_stat_unlock() preempt_enable() 22 26 23 27 #define part_stat_get_cpu(part, field, cpu) \ 24 28 (per_cpu_ptr((part)->dkstats, (cpu))->field) ··· 48 44 sizeof(struct disk_stats)); 49 45 } 50 46 51 - static inline int init_part_stats(struct hd_struct *part) 52 - { 53 - part->dkstats = alloc_percpu(struct disk_stats); 54 - if (!part->dkstats) 55 - return 0; 56 - return 1; 57 - } 58 - 59 - static inline void free_part_stats(struct hd_struct *part) 60 - { 61 - free_percpu(part->dkstats); 62 - } 63 - 64 - #else /* !CONFIG_SMP */ 65 - #define part_stat_lock() ({ rcu_read_lock(); 0; }) 66 - #define part_stat_unlock() rcu_read_unlock() 67 - 68 - #define part_stat_get(part, field) ((part)->dkstats.field) 69 - #define part_stat_get_cpu(part, field, cpu) part_stat_get(part, field) 70 - #define part_stat_read(part, field) part_stat_get(part, field) 71 - 72 - static inline void part_stat_set_all(struct hd_struct *part, int value) 73 - { 74 - memset(&part->dkstats, value, sizeof(struct disk_stats)); 75 - } 76 - 77 - static inline int init_part_stats(struct hd_struct *part) 78 - { 79 - return 1; 80 - } 81 - 82 - static inline void free_part_stats(struct hd_struct *part) 83 - { 84 - } 85 - 86 - #endif /* CONFIG_SMP */ 87 - 88 47 #define part_stat_read_accum(part, field) \ 89 48 (part_stat_read(part, field[STAT_READ]) + \ 90 49 part_stat_read(part, field[STAT_WRITE]) + \ 91 50 part_stat_read(part, field[STAT_DISCARD])) 92 51 93 52 #define __part_stat_add(part, field, addnd) \ 94 - (part_stat_get(part, field) += (addnd)) 53 + __this_cpu_add((part)->dkstats->field, addnd) 95 54 96 55 #define part_stat_add(part, field, addnd) do { \ 97 56 __part_stat_add((part), field, addnd); \

+1

include/scsi/scsi_cmnd.h

··· 142 142 unsigned long state; /* Command completion state */ 143 143 144 144 unsigned char tag; /* SCSI-II queued command tag */ 145 + unsigned int extra_len; /* length of alignment and padding */ 145 146 }; 146 147 147 148 /*

+3

include/scsi/scsi_device.h

··· 229 229 struct scsi_device_handler *handler; 230 230 void *handler_data; 231 231 232 + size_t dma_drain_len; 233 + void *dma_drain_buf; 234 + 232 235 unsigned char access_state; 233 236 struct mutex state_mutex; 234 237 enum scsi_device_state sdev_state;

+7

include/scsi/scsi_host.h

··· 271 271 int (* map_queues)(struct Scsi_Host *shost); 272 272 273 273 /* 274 + * Check if scatterlists need to be padded for DMA draining. 275 + * 276 + * Status: OPTIONAL 277 + */ 278 + bool (* dma_need_drain)(struct request *rq); 279 + 280 + /* 274 281 * This function determines the BIOS parameters for a given 275 282 * harddisk. These tend to be numbers that are made up by 276 283 * the host adapter. Parameters:

+2 -2

kernel/trace/blktrace.c

··· 170 170 if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP)) 171 171 blkcg = NULL; 172 172 #ifdef CONFIG_BLK_CGROUP 173 - trace_note(bt, 0, BLK_TN_MESSAGE, buf, n, 173 + trace_note(bt, current->pid, BLK_TN_MESSAGE, buf, n, 174 174 blkcg ? cgroup_id(blkcg->css.cgroup) : 1); 175 175 #else 176 - trace_note(bt, 0, BLK_TN_MESSAGE, buf, n, 0); 176 + trace_note(bt, current->pid, BLK_TN_MESSAGE, buf, n, 0); 177 177 #endif 178 178 local_irq_restore(flags); 179 179 }

+5 -16

mm/backing-dev.c

··· 15 15 #include <trace/events/writeback.h> 16 16 17 17 struct backing_dev_info noop_backing_dev_info = { 18 - .name = "noop", 19 18 .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK, 20 19 }; 21 20 EXPORT_SYMBOL_GPL(noop_backing_dev_info); ··· 864 865 return ret; 865 866 } 866 867 867 - struct backing_dev_info *bdi_alloc_node(gfp_t gfp_mask, int node_id) 868 + struct backing_dev_info *bdi_alloc(int node_id) 868 869 { 869 870 struct backing_dev_info *bdi; 870 871 871 - bdi = kmalloc_node(sizeof(struct backing_dev_info), 872 - gfp_mask | __GFP_ZERO, node_id); 872 + bdi = kzalloc_node(sizeof(*bdi), GFP_KERNEL, node_id); 873 873 if (!bdi) 874 874 return NULL; 875 875 ··· 878 880 } 879 881 return bdi; 880 882 } 881 - EXPORT_SYMBOL(bdi_alloc_node); 883 + EXPORT_SYMBOL(bdi_alloc); 882 884 883 885 static struct rb_node **bdi_lookup_rb_node(u64 id, struct rb_node **parentp) 884 886 { ··· 962 964 trace_writeback_bdi_register(bdi); 963 965 return 0; 964 966 } 965 - EXPORT_SYMBOL(bdi_register_va); 966 967 967 968 int bdi_register(struct backing_dev_info *bdi, const char *fmt, ...) 968 969 { ··· 975 978 } 976 979 EXPORT_SYMBOL(bdi_register); 977 980 978 - int bdi_register_owner(struct backing_dev_info *bdi, struct device *owner) 981 + void bdi_set_owner(struct backing_dev_info *bdi, struct device *owner) 979 982 { 980 - int rc; 981 - 982 - rc = bdi_register(bdi, "%u:%u", MAJOR(owner->devt), MINOR(owner->devt)); 983 - if (rc) 984 - return rc; 985 - /* Leaking owner reference... */ 986 - WARN_ON(bdi->owner); 983 + WARN_ON_ONCE(bdi->owner); 987 984 bdi->owner = owner; 988 985 get_device(owner); 989 - return 0; 990 986 } 991 - EXPORT_SYMBOL(bdi_register_owner); 992 987 993 988 /* 994 989 * Remove bdi from bdi_list, and ensure that it is no longer visible

+26 -22

tools/cgroup/iocost_monitor.py

··· 28 28 parser.add_argument('--cgroup', action='append', metavar='REGEX', 29 29 help='Regex for target cgroups, ') 30 30 parser.add_argument('--interval', '-i', metavar='SECONDS', type=float, default=1, 31 - help='Monitoring interval in seconds') 31 + help='Monitoring interval in seconds (0 exits immediately ' 32 + 'after checking requirements)') 32 33 parser.add_argument('--json', action='store_true', 33 34 help='Output in json') 34 35 args = parser.parse_args() ··· 113 112 114 113 def dict(self, now): 115 114 return { 'device' : devname, 116 - 'timestamp' : str(now), 117 - 'enabled' : str(int(self.enabled)), 118 - 'running' : str(int(self.running)), 119 - 'period_ms' : str(self.period_ms), 120 - 'period_at' : str(self.period_at), 121 - 'period_vtime_at' : str(self.vperiod_at), 122 - 'busy_level' : str(self.busy_level), 123 - 'vrate_pct' : str(self.vrate_pct), } 115 + 'timestamp' : now, 116 + 'enabled' : self.enabled, 117 + 'running' : self.running, 118 + 'period_ms' : self.period_ms, 119 + 'period_at' : self.period_at, 120 + 'period_vtime_at' : self.vperiod_at, 121 + 'busy_level' : self.busy_level, 122 + 'vrate_pct' : self.vrate_pct, } 124 123 125 124 def table_preamble_str(self): 126 125 state = ('RUN' if self.running else 'IDLE') if self.enabled else 'OFF' ··· 180 179 181 180 def dict(self, now, path): 182 181 out = { 'cgroup' : path, 183 - 'timestamp' : str(now), 184 - 'is_active' : str(int(self.is_active)), 185 - 'weight' : str(self.weight), 186 - 'weight_active' : str(self.active), 187 - 'weight_inuse' : str(self.inuse), 188 - 'hweight_active_pct' : str(self.hwa_pct), 189 - 'hweight_inuse_pct' : str(self.hwi_pct), 190 - 'inflight_pct' : str(self.inflight_pct), 191 - 'debt_ms' : str(self.debt_ms), 192 - 'use_delay' : str(self.use_delay), 193 - 'delay_ms' : str(self.delay_ms), 194 - 'usage_pct' : str(self.usage), 195 - 'address' : str(hex(self.address)) } 182 + 'timestamp' : now, 183 + 'is_active' : self.is_active, 184 + 'weight' : self.weight, 185 + 'weight_active' : self.active, 186 + 'weight_inuse' : self.inuse, 187 + 'hweight_active_pct' : self.hwa_pct, 188 + 'hweight_inuse_pct' : self.hwi_pct, 189 + 'inflight_pct' : self.inflight_pct, 190 + 'debt_ms' : self.debt_ms, 191 + 'use_delay' : self.use_delay, 192 + 'delay_ms' : self.delay_ms, 193 + 'usage_pct' : self.usage, 194 + 'address' : self.address } 196 195 for i in range(len(self.usages)): 197 196 out[f'usage_pct_{i}'] = str(self.usages[i]) 198 197 return out ··· 248 247 249 248 if ioc is None: 250 249 err(f'Could not find ioc for {devname}'); 250 + 251 + if interval == 0: 252 + sys.exit(0) 251 253 252 254 # Keep printing 253 255 while True: